Numba cuda slower than cuda c -- turn off memory safety checks?

Hi all,

I’d like to use numba cuda to write fast cuda kernels, because of its nice dev experience.

However, I find that numba-cuda is consistently slower than cuda-c. Comparing the ptx, this seems because numba-cuda adds memory safety checks.

My question: Is it possible to get cuda-c speed (by disabling memory safety checks), or is numba-cuda not meant as an alternative to cuda-c?


For example, for this toy kernel

@cuda.jit()
def mul2(x):
    x[cuda.threadIdx.x] *= 2.0

signature = (float32[:],)
ptx = cuda.compile_ptx_for_current_device(mul2, signature)

here’s a comparison of its ptx with the ptx of the equivalent cuda-c code:

The section “compute memory address” is way larger.


Thanks!