My Numba kernel is using too many registers, resulting in CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE. Admittedly, the kernel does have a lot of local int32 variables. What’s strange is:
- If I comment out the last line of the kernel (which is a call to a device function with a modest number of variables), the number of registers used is under the limit, but
- by the time the last line is to be executed, almost all of the variables of the kernel are no longer relevant (they are not arguments to the device function.
It seems that the compiler should recognize 2 and reclaim the registers.
I tried making a device function consisting almost all the lines of the kernel except that last line, and having the kernel call this device function. I thought that would signal to the compiler that those variables are no longer in scope. However, it didn’t reduce the number of registers used. It’s as if the compiler inlined the device function (despite the decoration inline=False) and forgot about scoping.
Is there anything I can do to guide the compiler towards reclaiming some registers so as to reduce the total register usage?
Another interesting observation is that the assembly codeseems to use a lot of 64-bit registers, even though there is no explicit use of 64-bit values in the code. I’ve tried inserting lots of casts to int32 but that didn’t make any difference. Maybe the 64-bit values arise in array indexing? Is there a way to address this problem?
Thank you!