Can't use @cuda.reduce decorator inside Kernel

I have a summ reduction function

@cuda.reduce def sum_reduce(a, b): return a + b

Which I want to use inside a Kernel:

@cuda.jit
def IC(res):
res[0] = sum_reduce(res)

However it returns an error:

ailed in cuda mode pipeline (step: nopython frontend)
Untyped global name ‘sum_reduce’: Cannot determine Numba type of <class ‘numba.cuda.kernels.reduction.Reduce’>

Is there a way to use @reduce decorated functions in Kernel?

Unfortunately you can’t use a @cuda.reduce function inside a kernel because the granularity of parallelism doesn’t match inside a kernel and outside of it - a @cuda.reduce function uses an entire grid to compute the reduction, whereas you program from the perspective of a single thread inside a kernel.

You would need to implement your own in-kernel reduction - a strategy for reasonable performance involves synchronizing within a warp, then within a block, then within the whole grid. This gist computes multiple reductions using warp-aggregated atomics - it’s not exactly what you need but might provide a bit of a useful reference.

If you need all threads to be able to synchronize and use the result of the reduction, you will need to use a grid group to synchronize once the reduction is complete. See Cooperative Groups — Numba 0+untagged.2155.g9ce83ef.dirty documentation

1 Like

I forgot to mention, better support for in-kernel reductions is a work-in-progress - unfortunately I don’t have anything I can share right now, but we would like it to be easy to write in-kernel reductions in future.

1 Like