I need to access several DeviceNDArrays from one kernel.
Previously, when using @njit, I was passing a List() of NDArray references.
But with @cuda.jit this doesn’t seem to work.
Exception: Failed in cuda mode pipeline (step: native lowering)
NRT required but not enabled
During: lowering “array = getitem(value=arrayOfArrays, index=i, fn=)”
Here’s test example of what I’m trying to do:
import numpy as np
from numba import cuda
from numba.typed import List
@cuda.jit()
def testKernel(arrayOfArrays, numArrays):
itemIndex = cuda.threadIdx.x
if (itemIndex < numInputItems):
for i in range(numArrays):
array = arrayOfArrays[i] # This line causes "NRT required but not enabled" error
array[itemIndex] += 1
numInputArrays = 5 # the number may change at runtime
numInputItems = 128 # 1024*1024
#buffersList = List() # apparently it's not allowed to append a DeviceNDArray to typed List
buffersList = [] # use reflected list then
for i in range(numInputArrays):
buffer_h = np.zeros(numInputItems, dtype=np.float32)
#buffer_d = cuda.device_array_like(buffer_h) # same error anyway
buffer_d = cuda.to_device(buffer_h)
buffersList.append(buffer_d)
testKernel[1, numInputItems](buffersList, numInputArrays)
cuda.synchronise()
Is it possible to pass an array of arrays to CUDA kernel?
So if you just pass tuples instead of arrays, your original kernel works almost verbatim (with a minor bug fix):
@cuda.jit()
def testKernel(arrayOfArrays, numArrays):
itemIndex = cuda.threadIdx.x
numInputItems = len(arrayOfArrays[0]) # Bugfix. Assumes they all have the same size
if (itemIndex < numInputItems):
for i in range(numArrays):
array = arrayOfArrays[i] # This line works now
array[itemIndex] += 1
a = cuda.to_device(np.zeros(10))
b = cuda.to_device(np.ones(10))
testKernel[1, min(len(a), len(b))]((a, b), 2)
a_host = a.copy_to_host()
b_host = b.copy_to_host()
np.testing.assert_allclose(a_host, 1)
np.testing.assert_allclose(b_host, 2)
So that works! I would also suggest a few changes which I think will improve your kernel.
First, you don’t need to pass numArrays, since you can infer from tuple. Second, you are only updating your array along the threads, which means that if you launch with more blocks, you will run update the same element many times. There is also an issue with the sizes of the arrays. If you want to allow the arrays to be of different sizes, you need to swap the if and the for. Finally, this kernel only works on 1D arrays. You can make it work for ND arrays by using the .flat property. Changing all this yields:
@cuda.jit()
def testKernelNew(tupleOfArrays):
numArrays = len(tupleOfArrays)
itemIndex = cuda.grid(1)
for i in range(numArrays):
array = tupleOfArrays[i]
numInputItems = array.size
if itemIndex < numInputItems:
array.flat[itemIndex] += 1
my_range = range(100, 128, 2)
arrays_for_input = tuple(
[cuda.to_device(i * np.ones(i, dtype=np.float32).reshape(2, -1)) for i in my_range]
)
testKernelNew[(max(a.size for a in arrays_for_input) + 31) // 32, 32](arrays_for_input)
for a, n in zip(arrays_for_input, my_range):
np.testing.assert_allclose(a.copy_to_host(), n + 1)
assert a.size == n