Unable to create an empty array inside the device function

Hello,

I am not able to create an empty array inside a device function that I later want to get filled and returned within the same device function. Short snippet that reproduces the issue is below.

Exception message is also provided. Tried the same with the cuda.local.array method without success.
Must be something very obvious but not sure what exactly is wrong.

Thanks.

@cuda.jit(device=True)
def empty_array(input_array):
    return cuda.device_array(shape=input_array.shape)


@cuda.jit
def test_cuda(array_of_arrays):
    _index = cuda.grid(1)
    if _index < array_of_arrays.shape[0]:
        dated_array = array_of_arrays[_index]
        a = empty_array(dated_array)

Exception: Failed in cuda mode pipeline (step: nopython frontend)
Failed in cuda mode pipeline (step: nopython frontend)
Unknown attribute ‘device_array’ of type Module(<module ‘numba.cuda’ from ‘venv\lib\site-packages\numba\cuda\init.py’>)
def empty_array(input_array):
return cuda.device_array(shape=input_array.shape)
^
During: typing of get attribute at …
File “…”, line 35:
def empty_array(input_array):
return cuda.device_array(shape=input_array.shape)
^
During: resolving callee type: type(<numba.cuda.compiler.Dispatcher object at 0x00000283948DE900>)
During: typing of call at …
File “…”, line 43:
def test_cuda(array_of_arrays):

dated_d2_array = array_of_arrays[_index]
a = empty_array(dated_d2_array)
^

Hi,

I’m not very familiar with your problem and probably @gmarkall can shed light here.

Anyway, I think the problem here is with the dynamic size given by parameter, as probably as it happens with cuda.local_array only a constant size is allowed. As far as I know it is not a Numba limitation actually, it is a limitation of CUDA itself as the thread local memory should be statically allocated by the compiler (I think). But let us hear someone else’s opinion, as I’m not even sure about where cuda.device_array is allocating the memory.

It’s not possible to allocate an array inside a device function - any arrays that you want to use in a kernel need to be passed in.

Ah, @noeliarico beat me to it! :slight_smile:

as it happens with cuda.local_array only a constant size is allowed. As far as I know it is not a Numba limitation actually, it is a limitation of CUDA itself as the thread local memory should be statically allocated by the compiler (I think).

It is correct that constant-sized local memory can be used at present. However, this is only a limitation of Numba - you can allocate local memory in a CUDA C/C++ kernel.

We had a session after a Numba dev meeting where I started working on the implementation of dynamic local arrays as a demo of working on Numba (https://www.youtube.com/watch?v=VdqwDyu1lNw) but I never quite finished it up - I did a bit more work on it after that session and the code in its present form is still waiting to be picked up / finished in Commits · gmarkall/numba · GitHub.

I’m not even sure about where cuda.device_array is allocating the memory.

Internally cuda.device_array is calling the driver function cuMemAlloc on the host to allocate memory - however, you can replace the memory allocator with something else using the External Memory Management plugin interface, which you would do if you wanted Numba to share a memory pool with another library.

@gmarkall, @noeliarico - got it! Thank you - I ended up creating an empty CuPy placeholder with right shape and passing it to kernel as an argument. Kernel then passes it down to a device function and all works fine.

Local memory allocation will be a nice enhancement for Numba.

Thanks again.

1 Like