Numba cuda works just fine with the original 3D CuPy array. But it interprets the tuple of cupy arrays as a python object and crashes. As per documentation, Numba Cuda supports tuple data types.
What am I missing?
The fix won’t be generally available in Numba until the next release. However, Numba does allow you to extend the way it handles arguments to kernels with external code, so once I’ve worked out the fix I can post some code here that you will be able to use to make tuples of CuPy arrays work with existing released versions of Numba.
It’s a bit difficult to estimate - it turns out to be harder to solve this than I first thought - I’m going to bring it up at today’s dev meeting to explore some ideas for resolving it, and will update this post following the dev meeting.
This was discussed in the dev meeting on Tuesday 11th Oct (minutes).
The reason we have the problem now is that Numba recognises argument types based only on their class, but there is no common base class for objects implementing the CUDA Array Interface. The CUDA target gets around this by special-casing arguments that implement the interface:
As this mechanism operates outside of the typeof() function, it doesn’t work when the CAI-implementing object is inside another object, because the typeof() function has no way to get back to the hack in typeof_pyval(). A proper fix for this is to add another way for us to register argument typing functions, which will be protocol-based instead of class-based - these protocol-based functions will be called based on the attributes an argument has (such as __cuda_array_interface__) rather than their class.
In the meantime, you can work around your typing issue by registering a typeof_impl() function specifically for CuPy arrays. An example of this looks like:
from numba import cuda
from numba.core.typing.typeof import typeof, typeof_impl
import cupy as cp
# Class-based registration of typing for CuPy arrays
def typeof_cupy_ndarray(val, c):
# Zero-copy convert to Numba device array and use its type
return typeof(cuda.as_cuda_array(val, sync=False), c)
"""A kernel that copies elements into the first array in the tuple
from the second array.
i = cuda.grid(1)
tup[i] = tup[i]
N = 32
x = cp.zeros(N)
y = cp.arange(N)
# Will be zeroes
f[1, N]((x, y))
# Should be the same as the result of arange() after the kernel
# Sanity check - are the arrays equal?
The code above tells Numba how to type CuPy arrays - it works by converting the CuPy array to a device array and then determining the type of the Numba array, which it knows how to do. This works specifically only for CuPy and not for the CUDA Array Interface in general - if you are using other libraries (e.g. PyTorch, JAX etc.) you will also need to register their tensor / array / matrix objects in the same way as the CuPy registration above.
Hi @gmarkall, thanks for the proposed solution. I’ve tried implementing it but am running into another issue:
numba.cuda.cudadrv.error.NvvmError: Failed to compile
Deprecated data layout, please use one of:
(): Error: Formal parameter space overflowed (35640 bytes required, max 4096 bytes allowed) in function _ZN6cudapy8cuda_test_244B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dE8UniTupleI5ArrayIdLi2E1C7mutable7alignedELi495EE
Do you have any suggestions what this could be related to?
For context, grouped_data is a tuple of cupy arrays. Each cuppy array has same shape (aka columns) but shape (aka rows) vary from array to array. In other words two-dimensional arrays inside tuple are not quite homogenous.
Thanks for the info. This is like passing in 495 arrays as parameters (or up to 3K+). There are a couple of ways of looking at this:
One is that you need to find another way of passing the parameters. For example, packing all your data into a single contiguous array and passing in an array of offsets to the start of each sub-array.
The other is that Numba’s calling convention on CUDA could support passing a tuple as a pointer to the tuple (and therefore a single parameter) rather than as all of its individual members. However, this would either require a data transfer from host to GPU and therefor a synchronization at each kernel launch (bad for performance in general) or for there to be a way to construct a tuple on the device that resides on it instead of the host. How dynamic is the tuple? Does it tend to consist of the same arrays for many kernel invocations (or even the life of the program), or does it tend to be a different set of arrays for many/most/all calls?
Also, the Awkward Array library seems to be designed for this use case, and has some support for CUDA - I don’t know if the CUDA support will be flexible enough for your use case (I know that there is current work to bring the CUDA support closer to the CPU support) but it might be worth looking at. (cc @jpivarski in case he has some thoughts / guidance to offer here).
which is assigned to @ianna. The bigger picture status is that fast iteration over Awkward Arrays is possible in @numba.jit functions (CPU), and Awkward Arrays can be backed by CuPy arrays, rather than NumPy. What remains is to test, add any specializations to work around CPU-vs-GPU context issues, and develop some demonstrations.
Knowing that there’s an interested user waiting to try it out helps a lot. At the moment, @ianna is busy with another Numba-related project as a way of getting up to speed with how Numba and Awkward’s Numba interface work. @cuda_enthusiast, if you have a project that would benefit from the Awkward-Numba-CUDA interface being complete, we can use your sample code as a motivating example and work with you to get it working.
Maybe we should take this over to GitHub, since the issues and PRs would be able to track it there.
A first step could be to turn your problem into one expressed as an Awkward Array, rather than tuples of arrays. It just has to have the structure of your problem, but not necessarily the scale or any private information. It can be generated on the fly (doesn’t have to come from files, but can if they’re small enough). If it is a problem that needs to scale to large datasets (it must be; why else would you be using GPUs?), then it might be necessary to indicate how it scales, like “We’ll have about the same number of nested arrays, but they’ll all be larger” or “we’ll have a lot more nested arrays, but they’ll be about the size shown in this example.”
At that point, we should be able to get the arrays into a Numba-compiled function and test the kind of workflow that you want to do, on the CPU, rather than the GPU. A function like
index = nb.cuda.threadIdx.x
if index > len(array): return
can be simulated with
for index in len(array):
and we should be able to do the @nb.jit one now. There may be some caveats about which functions are expected to work in JIT-compiled code and which ones are not, and that would be the first thing to work through.
Making it work in @nb.cuda.jit would be a matter of trying the example that is known to work on the CPU and solving any GPU issues we encounter along the way.
Hi @jpivarski, @cuda_enthusiast. Here is a draft PR where we can continue the discussion. @cuda_enthusiast please, do share some sample data. I think, it’s would be more productive to start with a realistic example. Thanks!