Numba CUDA structs and function overloads

I am interested in creating custom structs, creating associated overloaded functions, and compiling numba device functions for a targeted type. I am wondering if function overloading is possible in any fashion.

If a user defines an example function as follows

def sphere(x,y,z):
    return sqrt(x*x + y*y + z*z) - 2

I would like to compile said function where x,y, and z are scalars and again with them as structures. I can define addition and subtraction for the structure but I run into a little trouble when sqrt is used. Is there a way to declare functions (they can all be pre-declared before a user creates their own function) similar to the Templates Numba uses (i.e. ConcreteTemplate) where I specify the input and return types and have the Numba compiler pick the correct sqrt function behind the scenes depending on what types I use? The reason I even ask is it looks like the key to a template is the function itself so it doesn’t look like I can have the same function key point to multiple overloaded functions based on the signature.

If I deferred decorating the device functions and had something like

def sphere(x,y,z):
	return cusom_module.sqrt(x * x + y*y + z*z) - 2

would it be possible to hot swap the module that is referred in the function where one contains the struct version of the sqrt and the other contains the normal scalar versions once the decorator is applied?

Hi @uchytilc,

I think this is possible by defining all the bits and pieces roughly following this example: Example: an interval type — Numba 0.52.0-py3.7-linux-x86_64.egg documentation but it would likely need some adaptation for CUDA.

Another option might be to wait for Extending the extension API for hardware targets by stuartarchibald · Pull Request #6598 · numba/numba · GitHub which will be a while but should make this sort of thing easier.

I think this is possible by defining all the bits and pieces roughly following this example: Example: an interval type — Numba 0.52.0-py3.7-linux-x86_64.egg documentation but it would likely need some adaptation for CUDA.

This notebook contains an adaptation of the Interval example for CUDA: https://github.com/gmarkall/extending-numba-cuda/blob/main/Extending%20Numba’s%20CUDA%20Target.ipynb

Hey @stuartarchibald, @gmarkall was kind enough to send me the interval example on gitter earlier and I have a working example of a float2 and float3 struct within CUDA. What I am trying to do now is define most of the basic math operations on the struct without having to create struct specific versions (i.e. sqrt_vec2() vs just sqrt()) that also work on scalars. I’ve been thinking about how to do this over the last few days and have a few ideas that might work but wanted to check here first to see if it was possible.

What I am trying to do now is define most of the basic math operations on the struct without having to create struct specific versions (i.e. sqrt_vec2() vs just sqrt()) that also work on scalars. I’ve been thinking about how to do this over the last few days and have a few ideas that might work but wanted to check here first to see if it was possible.

I don’t see an obvious way of doing this, but if you’ve had an idea that will work within the existing infrastructure, then it would be great to see it - if it works out, it would be great if you could post some example code!

I have a few quick questions about Numba’s compilation pipeline (sorry if any of my terminology is off, I am still learning how all of this is put together).

  1. I noticed that in libdevice.py there are a number of stub functions that eventually point to libdevice functions. However, if I do either numba.sqrtf or cuda.sqrtf neither successfully compiles. math.sqrt works so I am assuming that at some point the stub functions and the math equivalent functions are tied together somehow. How is this done?

  2. In the interval example given you can define how a class’ property/initialization should be handled when compiled (i.e. you specify how the lowering takes place and define the llvm directly). Is there a way to do this with functions? For example if I wanted to define the sqrt for a vec2 I could do

def sqrt(vec):
    return float2(sqrt(vec.x),sqrt(vec.y))

but is there a way to do it with something like cuda_lower_method (if such a thing exists) where I specify the llvm in a similar fashion.

After looking through libdevicedecl.py, libdevicefuncs.py, and libdeviceimpl.py I think some of my above questions are answered. I am trying to define a custom module based around how the 3 files listed are structured. I am having a little trouble but will probably eventually stumble my way to a solution. If there is a guide on defining your own module that might be helpful but if not no worries.

@gmarkall @stuartarchibald Thanks for all your help so far, I’ve made some great progress. I have one quick question regarding the typer functions. I’ve created my own CUDA module that will contain the libdevice/math library functions. I’ve made each function within the library a CallableTemplate. These seem to require a typer function. Messing with this typer function causes errors quite frequently (for example it isn’t very happy when I use *args/**kwargs). I’ve defined them to work with my current type list but want to be able to extend the functions to new types. Is there any mechanism that would allow me to edit or extend the return values of the typer function? For the custom classes I’ve created I can use the type_callable method from extending and that seems to allow me to define multiple typers for the class allowing for multiple constructing methods. Is there something similar for the CallableTemplete (Can I just use the extending module for this as well)?

No problem, glad it’s going ok.

Indeed, this is a known “issue”, there’s various ways to deal with it but it’s just plain difficult as I expect you’ve found!

I’ve a feeling what you really want is @overloads, which is essentially a mechanism of defining a custom “typing” region (a bit like a typer function) and then an implementation: A guide to using @overload — Numba 0+untagged.4124.gd4460fe.dirty documentation. These don’t exist for CUDA yet. This PR/idea Extending the extension API for hardware targets by stuartarchibald · Pull Request #6598 · numba/numba · GitHub will make it possible.

Presume you want to do something like:

from numba import njit, types
import numpy as np
from numba.extending import overload

def foo(x):
    pass

# register an int specialisation
@overload(foo)
def ol_foo1(x):
    if isinstance(x, types.Integer):
        def impl(x):
            return 'x is int', x
        return impl

# later... register a float specialisation
@overload(foo)
def ol_foo2(x):
    if isinstance(x, types.Float):
        def impl(x):
            return 'x is float', x
        return impl


@njit
def call_foo():
    print(foo(1))
    print(foo(1.2345))

call_foo()

you might need something like numba.core.typing.templates.make_overload_template as an interim, this is getting really tricky though. Also, this is starting to reach into Numba internals a lot so stability/using private API warnings apply.

This is exactly what I am looking for. I’ll keep my eye out for the PR. My current workaround is a little hacky but I think it works ok. For each function I want to register, within the typer I check if the input type is a numba defined type or not. If it isn’t I check the class if a typer method is defined, which specifies the return type.

@cuda_registry.register
class Cuda_abs(CallableTemplate):
    key = libdevice.abs

    def generic(self):
        func = type(self).__name__[5:]
        def typer(typ):
            if not typ.is_internal:
                rtrn = getattr(typ, "typer_" + func, None)()
                if rtrn is None:
                    raise TypeError(f"{typ} does not support this operaion")
                return rtrn
            return typ
        return typer


class Long2(Vec2):
    def __init__(self):
        super().__init__(name = 'long2')

    def typer_abs(self):
        return self

I’m almost finished with the vector classes, I just have one last thing to figure out and it doesn’t look like the guides that are linked cover how to do this. If I create a vector, say a vec3, I can access the elements via .x, .y, and .z, and I’ve also implemented methods to get vec2s containing elements of the vec3 via .xy, .xz, and .yz. The only thing I can’t figure out is how to set elements via these attributes i.e.

vec3 = float3(1.2, 3.2, 4.3)
vec3.xy = float2(1,2)
vec3.z = 5

Is there a way to do this?

Here is a snippet of what I have currently

			@cuda_registry.register_attr
			class Vec3_attrs(AttributeTemplate):
				key = vectype

				def resolve_xy(self, mod):
					return vectype.vec2

				def resolve_xz(self, mod):
					return vectype.vec2

				def resolve_yz(self, mod):
					return vectype.vec2

			@lower_attr(vectype, 'xy')
			def vec3_xy(context, builder, sig, args):
				vec = cgutils.create_struct_proxy(sig)(context, builder, value=args)
				xy = cgutils.create_struct_proxy(sig.vec2)(context, builder)
				xy.x = vec.x
				xy.y = vec.y
				return xy._getvalue()

The code below is where I am at right now. It looks like I need a way to store the input arg into the struct. builder has a store function but I can’t quite figure out how it works with a struct in terms of picking the appropriate member of the struct to store into. I found some code used for setting attrs of a Record and I am currently trying to mess around with that code but I’m not quite there yet.

		@impl_registry.lower_setattr(vectype, 'xy')
		def vec3_xy(context, builder, sig, args):
			typ, valty = sig.args
			target, val = args

			vec3 = cgutils.create_struct_proxy(typ)(context, builder, value=target)
			vec2 = cgutils.create_struct_proxy(valty)(context, builder, value=val)

            #this doesn't work
            vec3.x = vec2.x 
            vec3.y = vec2.y

            #this doesn't seem to work either
           	builder.store(vec2.x, vec3._get_ptr_by_index(0))
           	builder.store(vec2.y, vec3._get_ptr_by_index(1))

EDIT: After spending some time with this it really seams like the _get_ptr_by_index should work. @stuartarchibald @gmarkall let me know if either of you have some insight into why this isn’t working. Does create_struct_proxy create a copy of the original, meaning I am copying the input into the copy not the original?