ValueError:
Kernel launch configuration was not specified. Use the syntax:
kernel_function[blockspergrid, threadsperblock](arg0, arg1, ..., argn)
See https://numba.pydata.org/numba-doc/latest/cuda/kernels.html#kernel-invocation for help.
I’ve read the document but I’m struggling to understand the change to the code I need to make.
I see this line
i, j = numba.cuda.grid(2)
Does this not meet the kernel launch configuration? I’m lacking a fundamental understanding.
I can edit the code to make it execute:
def smooth(x):
out = cupy.empty_like(x)
smooth_gpu[2, 32](x, out)
return out
However, I just picked those values for blockspergrid and threadssperblock. I don’t understand how they should be derived.
In short, the error message is clear “kernel_function[blockspergrid, threadsperblock]”, but I’m just not sure what values I should be providing and why i, j = numba.cuda.grid(2) is insufficient.
is what gave the kernel launch configuration. The choice of values - the number of blocks (2 in your example) and the number of threads in each block (32 in your example) is a bit of a complicated topic, but roughly:
You want to have a multiple of 32 threads in a block (32 threads work together as a warp), and you need to have enough that there are several warps per block so they can be swapped out when they block on a memory access to another warp that can continue using execution units. So generally I pick a value like 256 as a starting point.
You need to have enough blocks so that the grid covers all your input / output data (because you index into the data using cuda.grid() generally). The number of blocks also ought to be more than 2 times the number of streaming multiprocessors (SMs) on your device. You can launch a lot of blocks, but if you have really large input data it can be better to limit the number of blocks you launch, and have a loop in the kernel to ensure all input is handled - see the add_kernel in this notebook for an example of how to use a strided loop to do this.
Table 15 in the Compute Capabilities section of the CUDA programming guide gives the limits for various configuration parameters, like the size of the grid and thread blocks for different devices.