How do I use global memory with numba.cuda?

Supine · July 30, 2021, 7:02am

Hello!

I have a large array which need to be accessed in all blocks and all threads.

I think the best way is using global memory. In Cuda C, we can use __device__ to state a global array, but I just find const array in numba.cuda.

So how can I use global memory in numba.cuda? You know, const memory is too little to do this.

Now I compele the function by transfer the array as a function parameters, it’s not a good way.
And I also got some error here, even though I just state cuda.local.array in my code.
The error is

numba.cuda.cudadrv.driver.LinkerError: [218] Call to cuLinkAddData results in UNKNOWN_CUDA_ERROR
ptxas error   : File uses too much global constant data (0x19e1b0 bytes, 0x10000 max)

I do not state cuda.const.array_like in the code, I can’t sure what occurpied the const memory.
My code just like this:

@cuda.jit
def device_func(para1, para2, ..., result, ...):
    temp_arr = cuda.const.array((27, 3), numba.int32)
    # some operations

para1 = cuda.to_device(para_1)
para2 = cuda.to_device(para_2)

result = cuda.device_array(result_h)

# ......

device_func[4, 64](para1, para2, ..., result, ...)

# ......

I want to know where the errer is and how can I deal with it.

Thank you very much!

gmarkall · August 2, 2021, 8:30am

Device arrays are held in global memory - so in your example, para1, para2, and result are global memory.

Supine · August 2, 2021, 11:38am

Thanks for your reply!

But I still have some questions here.
In device function, I do not state cuda.const.array_like() in the code, but I got the error.

numba.cuda.cudadrv.driver.LinkerError: [218] Call to cuLinkAddData results in UNKNOWN_CUDA_ERROR
ptxas error   : File uses too much global constant data (0x19e1b0 bytes, 0x10000 max)

I can’t sure what occurpied the const memory.
In my code, I transfer a few of large arraies by parameters. But I think it dosen’t occurpy the const memory.
Looking forward to your reply. Thank you.

gmarkall · August 2, 2021, 12:52pm

This is because Numba automatically converts global and closure variables into constants, as described in: Deviations from Python Semantics — Numba 0.53.1-py3.7-linux-x86_64.egg documentation

In order to avoid the issue, any global variables that you’re referencing from within the kernel need to be copied to device memory (e.g. with cuda.to_device) and passed into the kernel as parameters.

(also a quick note: cuda.const_array_like isn’t really needed in modern Numba because of the automatic conversion of global variables to const arrays, it’s just there for backward compatibility)

Supine · August 3, 2021, 2:39am

Thank you.
I used cuda.device_array() for all return values and cuda.to_device() for all parameters.
But I got the same error.

I think there maybe several possible reasons:

I used two cuda.local.array((27, 3)) in my device function, does it effect the constant memory?
The parameter arraies are so larger. I took a small part of arraies(from 25000 to 20), and It worked. But It dosen’t make sense.

If necessary, I can give my code to you.
Looking forward to your reply. Thank you very much.

gmarkall · August 3, 2021, 10:15am

This shouldn’t affect the constant memory - local memory is kept in the GPU RAM like global memory, except that it is private to each thread.
This does sound a bit surprising.

I think it will be necessary to share the code to determine the source of the error - if it is short, you could paste it here, or for longer code a link to a Gist or Github repository would be appropriate.

Supine · August 4, 2021, 9:09am

The Gist link of the code is Numba Implement o RL_ev · GitHub.

The suffix 20 and 23558 show the data scale.

The 20 data display in the code, if you need the 23558 data, I can upload it to Google drive.

Thank you very much!

gmarkall · August 4, 2021, 9:44am

There are a couple of things to notice. First, if I run with the simulator, I get:

$ NUMBA_ENABLE_CUDASIM=1 python repro.py 
Traceback (most recent call last):
  File "/home/gmarkall/numbadev/issues/discourse-828/repro.py", line 789, in <module>
    rl_cal[threadsPerBlock, blocksPerGrid](
  File "/home/gmarkall/numbadev/numba/numba/cuda/simulator/kernel.py", line 121, in __call__
    bm.run(grid_point, *fake_args)
  File "/home/gmarkall/numbadev/numba/numba/cuda/simulator/kernel.py", line 281, in run
    raise t.exception[0].with_traceback(t.exception[1])
  File "/home/gmarkall/numbadev/numba/numba/cuda/simulator/kernel.py", line 175, in run
    super(BlockThread, self).run()
  File "/home/gmarkall/miniconda3/envs/numbanp120/lib/python3.9/threading.py", line 892, in run
    self._target(*self._args, **self._kwargs)
  File "/home/gmarkall/numbadev/numba/numba/cuda/simulator/kernel.py", line 259, in target
    self._f(*args)
  File "/home/gmarkall/numbadev/issues/discourse-828/repro.py", line 305, in rl_cal
    crd_d[atom2_up - 1][temp_idx]
  File "/home/gmarkall/numbadev/numba/numba/cuda/simulator/cudadrv/devicearray.py", line 62, in __getitem__
    return self.__wrap_if_fake(self._item.__getitem__(idx))
  File "/home/gmarkall/numbadev/numba/numba/cuda/simulator/cudadrv/devicearray.py", line 111, in __getitem__
    ret = self._ary.__getitem__(idx)
IndexError: tid=[0, 0, 0] ctaid=[4, 0, 0]: index 12 is out of bounds for axis 0 with size 10

At line 305 of your code (I reformatted the source for ease of reading, your line numbers may differ) we have:

crd_d[atom2_up - 1][temp_idx] - crd[atom1 - 1][temp_idx]

Here crd_d is being indexed as crd_d[12][0] which is out of bounds for it, as its shape is (10, 3).

Secondly (and this, along with possible similar errors) will be the source of the link error - the secord term of the expression references crd, not crd_d, so it will be converted into a const array. If your data size is much larger, (e.g. 23558) and crd is correspondingly larger, then this will definitely require more than the available constant memory.

So:

First, ensure that there are no out-of-bounds accesses in the data using the simulator with the data of size 20 (running with the environment variable NUMBA_ENABLE_CUDASIM=1 set).
Replace all the references in the kernel to crd with crd_d, and fix any other accidental references to non-device arrays.
Check that the code works with the 23558 size data. The simulator may be too slow for this, so if you want to check for out-of-bounds accesses, use compute-sanitizer instead. For example:

$ compute-sanitizer python repro.py 
========= COMPUTE-SANITIZER
/home/gmarkall/numbadev/numba/numba/cuda/compiler.py:865: NumbaPerformanceWarning: Grid size (64) < 2 * SM count (144) will likely result in GPU under utilization due to low occupancy.
  warn(NumbaPerformanceWarning(msg))

If there are bad memory accesses, it states the function, thread, address, and a stack trace like:

========= Invalid __global__ read of size 8 bytes
=========     at 0x6340 in cudapy::__main__::rl_cal$241(Array<double, (int)1, C, mutable, aligned>, Array<double, (int)1, C, mutable, aligned>, Array<long long, (int)2, C, mutable, aligned>, Array<long long, (int)2, C, mutable, aligned>, Array<long long, (int)1, C, mutable, aligned>, Array<double, (int)2, C, mutable, aligned>, Array<double, (int)2, C, mutable, aligned>, Array<long long, (int)2, C, mutable, aligned>, Array<long long, (int)2, C, mutable, aligned>, Array<double, (int)1, C, mutable, aligned>, Array<long long, (int)1, C, mutable, aligned>, Array<double, (int)3, C, mutable, aligned>, Array<double, (int)3, C, mutable, aligned>, Array<double, (int)1, C, mutable, aligned>, Array<double, (int)1, C, mutable, aligned>, long long)
=========     by thread (0,0,0) in block (4,0,0)
=========     Address 0x7ff9ad202320 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x20b63a]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
...

Supine · August 4, 2021, 10:42am

I’m so sorry to waste your time.
It’s a foolish error, I didn’t debug the code carefully because it’s ‘unbeautiful’.
After modifing crd to crd_d, it do work now.
The first error maybe caused by the slice data, it is not complete.
Thanks for your help!

gmarkall · August 4, 2021, 10:49am

No problem - this has got me thinking - we should probably add a warning message when a large constant array is created, so that it will be clearer when this kind of problem occurs.

Supine · August 4, 2021, 10:58am

I’m glad if it have a little help and thank you again.

Topic		Replies	Views
Avoid multiple copies of large numpy array in closure? Community Support	3	448	December 18, 2025
Unable to create an empty array inside the device function Support: How do I do ...?	4	1254	October 16, 2022
BUG: Numba using a lot of GPU memory Development	3	632	September 20, 2024
Templates in Numba Cuda Community Support	4	71	November 26, 2025
Writing to global memory is slow? Support: How do I do ...?	3	389	August 2, 2023

How do I use global memory with numba.cuda?

Related topics