Specify Kernel Launch Configuration

I’m running the following code on a Dask Cuda Cluster: pipeline-cpus.ipynb · GitHub

Numba is throwing the following exception:

ValueError: 
Kernel launch configuration was not specified. Use the syntax:

kernel_function[blockspergrid, threadsperblock](arg0, arg1, ..., argn)

See https://numba.pydata.org/numba-doc/latest/cuda/kernels.html#kernel-invocation for help.

I’ve read the document but I’m struggling to understand the change to the code I need to make.

I see this line

i, j = numba.cuda.grid(2)

Does this not meet the kernel launch configuration? I’m lacking a fundamental understanding.

I can edit the code to make it execute:

def smooth(x):
    out = cupy.empty_like(x)
    smooth_gpu[2, 32](x, out)
    return out

However, I just picked those values for blockspergrid and threadssperblock. I don’t understand how they should be derived.

In short, the error message is clear “kernel_function[blockspergrid, threadsperblock]”, but I’m just not sure what values I should be providing and why i, j = numba.cuda.grid(2) is insufficient.

Your edit to the line:

    smooth_gpu[2, 32](x, out)

is what gave the kernel launch configuration. The choice of values - the number of blocks (2 in your example) and the number of threads in each block (32 in your example) is a bit of a complicated topic, but roughly:

  • You want to have a multiple of 32 threads in a block (32 threads work together as a warp), and you need to have enough that there are several warps per block so they can be swapped out when they block on a memory access to another warp that can continue using execution units. So generally I pick a value like 256 as a starting point.
  • You need to have enough blocks so that the grid covers all your input / output data (because you index into the data using cuda.grid() generally). The number of blocks also ought to be more than 2 times the number of streaming multiprocessors (SMs) on your device. You can launch a lot of blocks, but if you have really large input data it can be better to limit the number of blocks you launch, and have a loop in the kernel to ensure all input is handled - see the add_kernel in this notebook for an example of how to use a strided loop to do this.

This Stackoverflow post gives some pointers about optimizing the launch configuration.

Table 15 in the Compute Capabilities section of the CUDA programming guide gives the limits for various configuration parameters, like the size of the grid and thread blocks for different devices.