Extending Numba for CUDA inside Cppyy

I’m not really familiar with cppyy’s numba_ext, so without sitting down and starting to hack on cppyy myself, it’s not really clear to me what the immediate next steps to fix this are. Maybe someone who knows cppyy better than I do can provide some pointers here.

“Normal” C++ functions, i.e. those without a __device__ annotation, cannot be called from within kernels. In CUDA C++, you always need to add __device__ or __hostdevice__ to functions if you want them to be compiled for the GPU - this is what I was referring to here:

I think there should be some logic that calls cuda.declare_device() with appropriate arguments for each C++ function you want to make accessible from @cuda.jit-decorated functions. I don’t think you need to copy the logic of declare_device() into cppyy.

1 Like

@gmarkall I tried following what you mentioned here:

I think there should be some logic that calls cuda.declare_device() with appropriate arguments for each C++ function you want to make accessible from @cuda.jit-decorated functions.

cuda.declare_device() returns a descriptor and when i print it, it shows something like this: <numba.cuda.compiler.ExternFunction object at 0x7f54215cc5e0>

Here, as mentioned Calling foreign functions from Python kernels — Numba 0+untagged.871.g53e976f.dirty documentation this links the .cu/ptx file. However using this won’t help in numba extension since we don’t have the info on what to link to prior to compilation time.

So, I just want to know if there’s some alternate way to use the descriptor we obtained above to call a function from a numba kernel. Can we look into some way of using this to schedule the call? Something like to obtain the function pointer of this descriptor and then use numba’s cuda helper functions schedule the function??

I am stuck so any steps to move forward can help with this task.

Where does the information to construct the arguments to declare_device() come from?

At what point do you know exactly what code the kernel will call?

The only way to use the returned descriptor is to call it from inside a Numba kernel. You can’t obtain a usable function pointer from it.

I imagine at some point prior to launching the kernel, you have all the information available about what code is to be called and executed on the device, but (since I don’t know cppyy I’m inferring / guessing from the content of our discussion) it sounds like the present architecture of its Numba extension is not a straightforward match for the requirements for linking external code in the CUDA target. Therefore, we need to either:

  • Make some modifications to the way cppyy works in this regard, or
  • Modify / add to the ways in which you can interact with external code from the Numba CUDA target.

The first can be done independently of Numba CUDA - the second, I am happy to work on with you, but I’d need you to drive the direction of it towards something that meets cppyy’s needs (and still be feasible within the CUDA execution model).