I understand that, in Cuda, “shared memory draws from a pool shared with the L1 cache” and that “the programmer can configure how much is allocated to shared memory, up to 96KB.” Can the Numba Cuda programmer configure this? If so, how?
Thanks!
I understand that, in Cuda, “shared memory draws from a pool shared with the L1 cache” and that “the programmer can configure how much is allocated to shared memory, up to 96KB.” Can the Numba Cuda programmer configure this? If so, how?
Thanks!
The quick answer is that yes, the Numba CUDA programmer can configure this, but there’s no good API for it. I’m sure I’ve written about this somewhere before, but I haven’t been able to locate the guidance on this that I gave to someone else who asked the same question.
I’ll continue searching and if I don’t find it, I’ll put together another example illustrating how to configure the shared memory / L1 carve out with Numba-CUDA.
I would greatly appreciate it!
Were you able to find the guidance? I’m kinda stuck; I’ve augmented the algorithm to handle larger data but, for the modified code, the number of registers is too large so I get CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE.
I think configuring the shared memory to be slightly larger would enable the code to handle the dataset i need it to handle.
I’d be very grateful to get guidance on this as soon as convenient for you.
I didn’t find an example. There are some bits of functionality for this in Numba-CUDA, but it turns out they’re untested and perhaps bitrotted. I was trying:
from numba import cuda, types
sig = (types.float32[::1], types.float32[::1])
@cuda.jit(sig)
def add_one(r, x):
i = cuda.grid(1)
if i < len(r):
r[i] = x[i] + 1
kernel = add_one.overloads[sig]
cufunc = kernel._codelibrary.get_cufunc()
cufunc.cache_config(prefer_shared=True)
but this fails with:
Traceback (most recent call last):
File "/home/gmarkall/numbadev/issues/discourse-3080/repro.py", line 13, in <module>
cufunc.cache_config(prefer_shared=True)
File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/cudadrv/driver.py", line 2418, in cache_config
flag = attr.CU_FUNC_CACHE_PREFER_SHARED
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: type object 'CUfunction_attribute' has no attribute 'CU_FUNC_CACHE_PREFER_SHARED'
I suspect it worked with the ctypes bindings but no longer works now we use the CUDA Python bindings.
I’ve created a feature request on the Numba-CUDA issue tracker for this: [FEA] Support for L1 cache / shared memory configuration · Issue #617 · NVIDIA/numba-cuda · GitHub
Thanks. Is there even a hacky approach to doing it that will tide me over until a feature is added to Numba? Maybe involving calling some c code?
You’re right, there will be a hacky way, but it won’t need C code. Just using ctypes should be enough. Let me try a couple of things and get back to you.
I found this workaround: [FEA] Provide a way to set the maximum dynamic shared memory size · Issue #94 · NVIDIA/numba-cuda · GitHub - I need to try it to make sure it still works, but it provides a starting point for me to fix up if not.