Tuple of CuPy arrays - Numba Cuda

cuda_enthusiast · October 3, 2022, 6:30am

Hello,

Bellow creates tuple of 2D cupy arrays. For more context, _cpd is a 3 dimensional cupy array and the entire operation below is similar to Pandas’ groupby operation.

tuple(cp.split(_cpd, cp.asnumpy(cp.unique(_cpd[:, col_index], return_index=True)[1][1:])))

Numba cuda works just fine with the original 3D CuPy array. But it interprets the tuple of cupy arrays as a python object and crashes. As per documentation, Numba Cuda supports tuple data types.
What am I missing?

Thanks.

gmarkall · October 3, 2022, 11:55am

Thanks for bringing this up - you’ve stumbled upon a bug which I’ve recorded as: CUDA: Tuples of CAI exporters that aren't Numba device arrays are not recognized by typing · Issue #8483 · numba/numba · GitHub

My intuition is that this shouldn’t be too difficult to resolve so it should get fixed relatively quickly - I will post updates on the issue as it progresses.

cuda_enthusiast · October 3, 2022, 4:23pm

Thank you, @gmarkall. What does Numba release cycles look like? I’m assuming that fix won’t be available till next scheduled release? Thanks again.

gmarkall · October 4, 2022, 10:21am

The fix won’t be generally available in Numba until the next release. However, Numba does allow you to extend the way it handles arguments to kernels with external code, so once I’ve worked out the fix I can post some code here that you will be able to use to make tuples of CuPy arrays work with existing released versions of Numba.

cuda_enthusiast · October 4, 2022, 3:36pm

Thank you @gmarkall, that would be much appreciated!

cuda_enthusiast · October 11, 2022, 4:04am

Hello @gmarkall, would it be possible to provide a very rough estimate of ETA on this issue? Thanks a lot.

gmarkall · October 11, 2022, 10:19am

It’s a bit difficult to estimate - it turns out to be harder to solve this than I first thought - I’m going to bring it up at today’s dev meeting to explore some ideas for resolving it, and will update this post following the dev meeting.

cuda_enthusiast · October 11, 2022, 11:25pm

sounds good @gmarkall, thank you.

gmarkall · October 14, 2022, 8:50am

This was discussed in the dev meeting on Tuesday 11th Oct (minutes).

The reason we have the problem now is that Numba recognises argument types based only on their class, but there is no common base class for objects implementing the CUDA Array Interface. The CUDA target gets around this by special-casing arguments that implement the interface:

github.com

numba/numba/blob/f4674f98795d0c1bbeba693a075885c2e1eace06/numba/cuda/dispatcher.py#L649-L661


      
          def typeof_pyval(self, val):
              # Based on _DispatcherBase.typeof_pyval, but differs from it to support
              # the CUDA Array Interface.
              try:
                  return typeof(val, Purpose.argument)
              except ValueError:
                  if cuda.is_cuda_array(val):
                      # When typing, we don't need to synchronize on the array's
                      # stream - this is done when the kernel is launched.
                      return typeof(cuda.as_cuda_array(val, sync=False),
                                    Purpose.argument)
                  else:
                      raise

As this mechanism operates outside of the typeof() function, it doesn’t work when the CAI-implementing object is inside another object, because the typeof() function has no way to get back to the hack in typeof_pyval(). A proper fix for this is to add another way for us to register argument typing functions, which will be protocol-based instead of class-based - these protocol-based functions will be called based on the attributes an argument has (such as __cuda_array_interface__) rather than their class.

Workaround

In the meantime, you can work around your typing issue by registering a typeof_impl() function specifically for CuPy arrays. An example of this looks like:

from numba import cuda
from numba.core.typing.typeof import typeof, typeof_impl
import cupy as cp


# Class-based registration of typing for CuPy arrays
@typeof_impl.register(cp.ndarray)
def typeof_cupy_ndarray(val, c):
    # Zero-copy convert to Numba device array and use its type
    return typeof(cuda.as_cuda_array(val, sync=False), c)


@cuda.jit
def f(tup):
    """A kernel that copies elements into the first array in the tuple
    from the second array.
    """
    i = cuda.grid(1)
    tup[0][i] = tup[1][i]


N = 32
x = cp.zeros(N)
y = cp.arange(N)

# Will be zeroes
print(x)

f[1, N]((x, y))

# Should be the same as the result of arange() after the kernel
print(x)

# Sanity check - are the arrays equal?
cp.testing.assert_array_equal(x, y)

The code above tells Numba how to type CuPy arrays - it works by converting the CuPy array to a device array and then determining the type of the Numba array, which it knows how to do. This works specifically only for CuPy and not for the CUDA Array Interface in general - if you are using other libraries (e.g. PyTorch, JAX etc.) you will also need to register their tensor / array / matrix objects in the same way as the CuPy registration above.

cuda_enthusiast · October 17, 2022, 4:57am

Hi @gmarkall, thanks for the proposed solution. I’ve tried implementing it but am running into another issue:

numba.cuda.cudadrv.error.NvvmError: Failed to compile
Deprecated data layout, please use one of:
32-bit: e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-i128:128:128-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64
64-bit: e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-i128:128:128-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64
(): Error: Formal parameter space overflowed (35640 bytes required, max 4096 bytes allowed) in function _ZN6cudapy8cuda_test_244B96cw51cXTLSUwv1kAPW1tQPAP9CY9GJAHUqIFJIBltW60OjnB1KwUDHQV1kNNAcQ_2fkQgNFHeY_2fJCGgXiDPuFYTAA_3d_3dE8UniTupleI5ArrayIdLi2E1C7mutable7alignedELi495EE
NVVM_ERROR_COMPILATION

Do you have any suggestions what this could be related to?

Below code sample reproduces the issue.

def fill_arrays(cpd, cuda_array_of_arrays, col_index):
    _cpd = cpd.copy()
    _cpd = _cpd[_cpd[:, col_index].argsort()]
    grouped_data = tuple(cp.split(_cpd, cp.unique(_cpd[:, col_index], return_index=True)[1][1:].get()))
    _blocks, _threads = config_cuda(len(grouped_data))
    cuda_test[_blocks, _threads](grouped_data)


@cuda.jit
def cuda_test(grouped_data):
    _index = cuda.grid(1)
    if _index < len(grouped_data):
        d2_array = grouped_data[_index]
        for i in range(d2_array.shape[0]):
            d2_array[i][0] = 0

For context, grouped_data is a tuple of cupy arrays. Each cuppy array has same shape[1] (aka columns) but shape[0] (aka rows) vary from array to array. In other words two-dimensional arrays inside tuple are not quite homogenous.

Thank you!

gmarkall · October 17, 2022, 11:11am

How many arrays are in your tuple? Passing the tuple gets converted to passing an individual argument for each parameter, so if you have a lot then that will overflow the parameter space.

cuda_enthusiast · October 17, 2022, 11:34am

495 in this particular case. Could go up all the way to 3K+ arrays in some use cases. Thanks.

gmarkall · October 17, 2022, 4:17pm

Thanks for the info. This is like passing in 495 arrays as parameters (or up to 3K+). There are a couple of ways of looking at this:

One is that you need to find another way of passing the parameters. For example, packing all your data into a single contiguous array and passing in an array of offsets to the start of each sub-array.
The other is that Numba’s calling convention on CUDA could support passing a tuple as a pointer to the tuple (and therefore a single parameter) rather than as all of its individual members. However, this would either require a data transfer from host to GPU and therefor a synchronization at each kernel launch (bad for performance in general) or for there to be a way to construct a tuple on the device that resides on it instead of the host. How dynamic is the tuple? Does it tend to consist of the same arrays for many kernel invocations (or even the life of the program), or does it tend to be a different set of arrays for many/most/all calls?

Also, the Awkward Array library seems to be designed for this use case, and has some support for CUDA - I don’t know if the CUDA support will be flexible enough for your use case (I know that there is current work to bring the CUDA support closer to the CPU support) but it might be worth looking at. (cc @jpivarski in case he has some thoughts / guidance to offer here).

jpivarski · October 17, 2022, 5:39pm

Awkward Array is designed for this, and it will eventually allow iteration over ragged arrays in @numba.cuda.jit functions. The issue tracking its status is:

What is the state of Awkward Arrays in Numba's CUDA target? · Issue #1658 · scikit-hep/awkward · GitHub

which is assigned to @ianna. The bigger picture status is that fast iteration over Awkward Arrays is possible in @numba.jit functions (CPU), and Awkward Arrays can be backed by CuPy arrays, rather than NumPy. What remains is to test, add any specializations to work around CPU-vs-GPU context issues, and develop some demonstrations.

Knowing that there’s an interested user waiting to try it out helps a lot. At the moment, @ianna is busy with another Numba-related project as a way of getting up to speed with how Numba and Awkward’s Numba interface work. @cuda_enthusiast, if you have a project that would benefit from the Awkward-Numba-CUDA interface being complete, we can use your sample code as a motivating example and work with you to get it working.

cuda_enthusiast · October 17, 2022, 6:57pm

Hi @jpivarski that would be great! We are dealing with [python] lists/tuples of non-homogeneous CuPy arrays. We could share some sample data and snippets as well. Thanks.

jpivarski · October 17, 2022, 7:20pm

Maybe we should take this over to GitHub, since the issues and PRs would be able to track it there.

A first step could be to turn your problem into one expressed as an Awkward Array, rather than tuples of arrays. It just has to have the structure of your problem, but not necessarily the scale or any private information. It can be generated on the fly (doesn’t have to come from files, but can if they’re small enough). If it is a problem that needs to scale to large datasets (it must be; why else would you be using GPUs?), then it might be necessary to indicate how it scales, like “We’ll have about the same number of nested arrays, but they’ll all be larger” or “we’ll have a lot more nested arrays, but they’ll be about the size shown in this example.”

At that point, we should be able to get the arrays into a Numba-compiled function and test the kind of workflow that you want to do, on the CPU, rather than the GPU. A function like

@nb.cuda.jit
def something(array):
    index = nb.cuda.threadIdx.x
    if index > len(array): return
    ...

can be simulated with

@nb.jit
def something(array):
    for index in len(array):
        ...

and we should be able to do the @nb.jit one now. There may be some caveats about which functions are expected to work in JIT-compiled code and which ones are not, and that would be the first thing to work through.

Making it work in @nb.cuda.jit would be a matter of trying the example that is known to work on the CPU and solving any GPU issues we encounter along the way.

ianna · October 18, 2022, 7:43pm

Hi @jpivarski, @cuda_enthusiast. Here is a draft PR where we can continue the discussion. @cuda_enthusiast please, do share some sample data. I think, it’s would be more productive to start with a realistic example. Thanks!

Topic		Replies	Views
Usage of CUDA Python, Linear Algebra on GPU and Computational Code Community Support	7	2873	December 31, 2021
Cannot create a shared array in a kernel using kernel parameters Community Support	3	882	February 5, 2021
Numba for CUDA Programmers course released Announcements	0	663	April 23, 2021
Random array generation : numba cuda slower than cupy? Support: How do I do ...?	3	1473	July 23, 2021
Built-in Vector Types in Numba Cuda Support: How do I do ...?	1	399	October 18, 2021

Tuple of CuPy arrays - Numba Cuda

Workaround

Related Topics