Reducing number of registers used by Numba kernel?

My Numba kernel is using too many registers, resulting in CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE. Admittedly, the kernel does have a lot of local int32 variables. What’s strange is:

  1. If I comment out the last line of the kernel (which is a call to a device function with a modest number of variables), the number of registers used is under the limit, but
  2. by the time the last line is to be executed, almost all of the variables of the kernel are no longer relevant (they are not arguments to the device function.

It seems that the compiler should recognize 2 and reclaim the registers.

I tried making a device function consisting almost all the lines of the kernel except that last line, and having the kernel call this device function. I thought that would signal to the compiler that those variables are no longer in scope. However, it didn’t reduce the number of registers used. It’s as if the compiler inlined the device function (despite the decoration inline=False) and forgot about scoping.

Is there anything I can do to guide the compiler towards reclaiming some registers so as to reduce the total register usage?

Another interesting observation is that the assembly codeseems to use a lot of 64-bit registers, even though there is no explicit use of 64-bit values in the code. I’ve tried inserting lots of casts to int32 but that didn’t make any difference. Maybe the 64-bit values arise in array indexing? Is there a way to address this problem?

Thank you!

A quick note on a potential workaround: does passing the max_registers keyword argument to the jit decorator help? (See CUDA Kernel API — Numba CUDA)

There is an example of this in the test suite: numba-cuda/numba_cuda/numba/cuda/tests/cudadrv/test_linker.py at main · NVIDIA/numba-cuda · GitHub

In general you don’t want to be trying to fiddle with things to make a launch fit on the device, because different devices will have different maximum cooperative launch sizes (it may even vary across toolkits if different register allocations occur in different versions).

Instead I would suggest trying to find a way to write your kernel such that it can work correctly regardless of the number of blocks launched, and then use the max_cooperative_grid_blocks() function to check what size grid can be used: Cooperative Groups — Numba CUDA (with an example of usage at the bottom of that page).