Configuring how much shared memory is available

pnk · November 22, 2025, 4:56am

I understand that, in Cuda, “shared memory draws from a pool shared with the L1 cache” and that “the programmer can configure how much is allocated to shared memory, up to 96KB.” Can the Numba Cuda programmer configure this? If so, how?

Thanks!

gmarkall · November 24, 2025, 10:37pm

The quick answer is that yes, the Numba CUDA programmer can configure this, but there’s no good API for it. I’m sure I’ve written about this somewhere before, but I haven’t been able to locate the guidance on this that I gave to someone else who asked the same question.

I’ll continue searching and if I don’t find it, I’ll put together another example illustrating how to configure the shared memory / L1 carve out with Numba-CUDA.

pnk · November 25, 2025, 11:54pm

I would greatly appreciate it!

pnk · November 28, 2025, 3:38am

Were you able to find the guidance? I’m kinda stuck; I’ve augmented the algorithm to handle larger data but, for the modified code, the number of registers is too large so I get CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE.
I think configuring the shared memory to be slightly larger would enable the code to handle the dataset i need it to handle.

I’d be very grateful to get guidance on this as soon as convenient for you.

gmarkall · November 28, 2025, 12:15pm

I didn’t find an example. There are some bits of functionality for this in Numba-CUDA, but it turns out they’re untested and perhaps bitrotted. I was trying:


from numba import cuda, types

sig = (types.float32[::1], types.float32[::1])

@cuda.jit(sig)
def add_one(r, x):
    i = cuda.grid(1)
    if i < len(r):
        r[i] = x[i] + 1

kernel = add_one.overloads[sig]
cufunc = kernel._codelibrary.get_cufunc()
cufunc.cache_config(prefer_shared=True)

but this fails with:

Traceback (most recent call last):
  File "/home/gmarkall/numbadev/issues/discourse-3080/repro.py", line 13, in <module>
    cufunc.cache_config(prefer_shared=True)
  File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/cudadrv/driver.py", line 2418, in cache_config
    flag = attr.CU_FUNC_CACHE_PREFER_SHARED
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: type object 'CUfunction_attribute' has no attribute 'CU_FUNC_CACHE_PREFER_SHARED'

I suspect it worked with the ctypes bindings but no longer works now we use the CUDA Python bindings.

gmarkall · November 28, 2025, 1:01pm

I’ve created a feature request on the Numba-CUDA issue tracker for this: [FEA] Support for L1 cache / shared memory configuration · Issue #617 · NVIDIA/numba-cuda · GitHub

pnk · November 29, 2025, 4:41pm

Thanks. Is there even a hacky approach to doing it that will tide me over until a feature is added to Numba? Maybe involving calling some c code?

gmarkall · November 30, 2025, 11:20pm

You’re right, there will be a hacky way, but it won’t need C code. Just using ctypes should be enough. Let me try a couple of things and get back to you.

gmarkall · December 1, 2025, 9:53am

I found this workaround: [FEA] Provide a way to set the maximum dynamic shared memory size · Issue #94 · NVIDIA/numba-cuda · GitHub - I need to try it to make sure it still works, but it provides a starting point for me to fix up if not.

Topic		Replies	Views
How do I use `sharedmem` parameter in kernel? Support: How do I do ...?	4	550	March 18, 2023
Using dynamic shared memory Numba	6	1119	August 22, 2023
BUG: Numba using a lot of GPU memory Development	3	641	September 20, 2024
Cannot create a shared array in a kernel using kernel parameters Community Support	3	1068	February 5, 2021
CUDA shared memory on 1d arrays Numba	1	457	July 23, 2021

Configuring how much shared memory is available

Related topics