Reducing number of registers used by Numba kernel?

pnk · November 28, 2025, 4:20am

My Numba kernel is using too many registers, resulting in CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE. Admittedly, the kernel does have a lot of local int32 variables. What’s strange is:

If I comment out the last line of the kernel (which is a call to a device function with a modest number of variables), the number of registers used is under the limit, but
by the time the last line is to be executed, almost all of the variables of the kernel are no longer relevant (they are not arguments to the device function.

It seems that the compiler should recognize 2 and reclaim the registers.

I tried making a device function consisting almost all the lines of the kernel except that last line, and having the kernel call this device function. I thought that would signal to the compiler that those variables are no longer in scope. However, it didn’t reduce the number of registers used. It’s as if the compiler inlined the device function (despite the decoration inline=False) and forgot about scoping.

Is there anything I can do to guide the compiler towards reclaiming some registers so as to reduce the total register usage?

Another interesting observation is that the assembly codeseems to use a lot of 64-bit registers, even though there is no explicit use of 64-bit values in the code. I’ve tried inserting lots of casts to int32 but that didn’t make any difference. Maybe the 64-bit values arise in array indexing? Is there a way to address this problem?

Thank you!

gmarkall · November 28, 2025, 11:54am

A quick note on a potential workaround: does passing the max_registers keyword argument to the jit decorator help? (See CUDA Kernel API — Numba CUDA)

There is an example of this in the test suite: numba-cuda/numba_cuda/numba/cuda/tests/cudadrv/test_linker.py at main · NVIDIA/numba-cuda · GitHub

In general you don’t want to be trying to fiddle with things to make a launch fit on the device, because different devices will have different maximum cooperative launch sizes (it may even vary across toolkits if different register allocations occur in different versions).

Instead I would suggest trying to find a way to write your kernel such that it can work correctly regardless of the number of blocks launched, and then use the max_cooperative_grid_blocks() function to check what size grid can be used: Cooperative Groups — Numba CUDA (with an example of usage at the bottom of that page).

Topic		Replies	Views
BUG: Numba using a lot of GPU memory Development	3	637	September 20, 2024
Feb 25 Invited Talk: Faster and simpler Numba CUDA kernels using CUB and cuda.cooperative Announcements	1	238	February 28, 2025
Configuring how much shared memory is available Support: How do I do ...?	8	114	December 1, 2025
Understanding memory usage of a kernel Support: How do I do ...?	2	94	November 18, 2024
Blog: 28000x speedup with Numba.CUDA Showcase	1	1138	April 23, 2021

Reducing number of registers used by Numba kernel?

Related topics