Is there a way to pass list of arrays to CUDA kernel?

I need to access several DeviceNDArrays from one kernel.
Previously, when using @njit, I was passing a List() of NDArray references.
But with @cuda.jit this doesn’t seem to work.

Exception: Failed in cuda mode pipeline (step: native lowering)
NRT required but not enabled
During: lowering “array = getitem(value=arrayOfArrays, index=i, fn=)”

Here’s test example of what I’m trying to do:

import numpy as np
from numba import cuda
from numba.typed import List

@cuda.jit()
def testKernel(arrayOfArrays, numArrays):
    itemIndex = cuda.threadIdx.x
    if (itemIndex < numInputItems):
        for i in range(numArrays):
            array = arrayOfArrays[i] # This line causes "NRT required but not enabled" error
            array[itemIndex] += 1

numInputArrays = 5 # the number may change at runtime
numInputItems = 128 # 1024*1024
#buffersList = List() # apparently it's not allowed to append a DeviceNDArray to typed List
buffersList = [] # use reflected list then
for i in range(numInputArrays):
    buffer_h = np.zeros(numInputItems, dtype=np.float32)
    #buffer_d = cuda.device_array_like(buffer_h) # same error anyway
    buffer_d = cuda.to_device(buffer_h)
    buffersList.append(buffer_d)

testKernel[1, numInputItems](buffersList, numInputArrays)
cuda.synchronise()

Is it possible to pass an array of arrays to CUDA kernel?

I’m using Numba 0.57.

You are actually really close! Numba supports passing a tuple of arrays, but not a list of them.

So if you just pass tuples instead of arrays, your original kernel works almost verbatim (with a minor bug fix):

@cuda.jit()
def testKernel(arrayOfArrays, numArrays):
    itemIndex = cuda.threadIdx.x
    numInputItems = len(arrayOfArrays[0]) # Bugfix. Assumes they all have the same size
    if (itemIndex < numInputItems):
        for i in range(numArrays):
            array = arrayOfArrays[i] # This line works now
            array[itemIndex] += 1

a = cuda.to_device(np.zeros(10))
b = cuda.to_device(np.ones(10))
testKernel[1, min(len(a), len(b))]((a, b), 2)
a_host = a.copy_to_host()
b_host = b.copy_to_host()
np.testing.assert_allclose(a_host, 1)
np.testing.assert_allclose(b_host, 2)

So that works! I would also suggest a few changes which I think will improve your kernel.
First, you don’t need to pass numArrays, since you can infer from tuple. Second, you are only updating your array along the threads, which means that if you launch with more blocks, you will run update the same element many times. There is also an issue with the sizes of the arrays. If you want to allow the arrays to be of different sizes, you need to swap the if and the for. Finally, this kernel only works on 1D arrays. You can make it work for ND arrays by using the .flat property. Changing all this yields:

@cuda.jit()
def testKernelNew(tupleOfArrays):
    numArrays = len(tupleOfArrays)
    itemIndex = cuda.grid(1)
    for i in range(numArrays):
        array = tupleOfArrays[i]
        numInputItems = array.size
        if itemIndex < numInputItems:
            array.flat[itemIndex] += 1


my_range = range(100, 128, 2)
arrays_for_input = tuple(
    [cuda.to_device(i * np.ones(i, dtype=np.float32).reshape(2, -1)) for i in my_range]
)
testKernelNew[(max(a.size for a in arrays_for_input) + 31) // 32, 32](arrays_for_input)
for a, n in zip(arrays_for_input, my_range):
    np.testing.assert_allclose(a.copy_to_host(), n + 1)
    assert a.size == n

Thanks! I didn’t even consider using tuples before.

So my test case works with just converting list to tuple when passing the list to kernel:

testKernel[1, numInputItems](tuple(buffersList), numInputArrays)