There are a couple of things to notice. First, if I run with the simulator, I get:
$ NUMBA_ENABLE_CUDASIM=1 python repro.py
Traceback (most recent call last):
File "/home/gmarkall/numbadev/issues/discourse-828/repro.py", line 789, in <module>
rl_cal[threadsPerBlock, blocksPerGrid](
File "/home/gmarkall/numbadev/numba/numba/cuda/simulator/kernel.py", line 121, in __call__
bm.run(grid_point, *fake_args)
File "/home/gmarkall/numbadev/numba/numba/cuda/simulator/kernel.py", line 281, in run
raise t.exception[0].with_traceback(t.exception[1])
File "/home/gmarkall/numbadev/numba/numba/cuda/simulator/kernel.py", line 175, in run
super(BlockThread, self).run()
File "/home/gmarkall/miniconda3/envs/numbanp120/lib/python3.9/threading.py", line 892, in run
self._target(*self._args, **self._kwargs)
File "/home/gmarkall/numbadev/numba/numba/cuda/simulator/kernel.py", line 259, in target
self._f(*args)
File "/home/gmarkall/numbadev/issues/discourse-828/repro.py", line 305, in rl_cal
crd_d[atom2_up - 1][temp_idx]
File "/home/gmarkall/numbadev/numba/numba/cuda/simulator/cudadrv/devicearray.py", line 62, in __getitem__
return self.__wrap_if_fake(self._item.__getitem__(idx))
File "/home/gmarkall/numbadev/numba/numba/cuda/simulator/cudadrv/devicearray.py", line 111, in __getitem__
ret = self._ary.__getitem__(idx)
IndexError: tid=[0, 0, 0] ctaid=[4, 0, 0]: index 12 is out of bounds for axis 0 with size 10
At line 305 of your code (I reformatted the source for ease of reading, your line numbers may differ) we have:
crd_d[atom2_up - 1][temp_idx] - crd[atom1 - 1][temp_idx]
Here crd_d
is being indexed as crd_d[12][0]
which is out of bounds for it, as its shape is (10, 3)
.
Secondly (and this, along with possible similar errors) will be the source of the link error - the secord term of the expression references crd
, not crd_d
, so it will be converted into a const array. If your data size is much larger, (e.g. 23558) and crd
is correspondingly larger, then this will definitely require more than the available constant memory.
So:
- First, ensure that there are no out-of-bounds accesses in the data using the simulator with the data of size 20 (running with the environment variable
NUMBA_ENABLE_CUDASIM=1
set).
- Replace all the references in the kernel to
crd
with crd_d
, and fix any other accidental references to non-device arrays.
- Check that the code works with the 23558 size data. The simulator may be too slow for this, so if you want to check for out-of-bounds accesses, use
compute-sanitizer
instead. For example:
$ compute-sanitizer python repro.py
========= COMPUTE-SANITIZER
/home/gmarkall/numbadev/numba/numba/cuda/compiler.py:865: NumbaPerformanceWarning: Grid size (64) < 2 * SM count (144) will likely result in GPU under utilization due to low occupancy.
warn(NumbaPerformanceWarning(msg))
If there are bad memory accesses, it states the function, thread, address, and a stack trace like:
========= Invalid __global__ read of size 8 bytes
========= at 0x6340 in cudapy::__main__::rl_cal$241(Array<double, (int)1, C, mutable, aligned>, Array<double, (int)1, C, mutable, aligned>, Array<long long, (int)2, C, mutable, aligned>, Array<long long, (int)2, C, mutable, aligned>, Array<long long, (int)1, C, mutable, aligned>, Array<double, (int)2, C, mutable, aligned>, Array<double, (int)2, C, mutable, aligned>, Array<long long, (int)2, C, mutable, aligned>, Array<long long, (int)2, C, mutable, aligned>, Array<double, (int)1, C, mutable, aligned>, Array<long long, (int)1, C, mutable, aligned>, Array<double, (int)3, C, mutable, aligned>, Array<double, (int)3, C, mutable, aligned>, Array<double, (int)1, C, mutable, aligned>, Array<double, (int)1, C, mutable, aligned>, long long)
========= by thread (0,0,0) in block (4,0,0)
========= Address 0x7ff9ad202320 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame: [0x20b63a]
========= in /usr/lib/x86_64-linux-gnu/libcuda.so.1
...