Hi all,
I’d like to use numba cuda to write fast cuda kernels, because of its nice dev experience.
However, I find that numba-cuda is consistently slower than cuda-c. Comparing the ptx, this seems because numba-cuda adds memory safety checks.
My question: Is it possible to get cuda-c speed (by disabling memory safety checks), or is numba-cuda not meant as an alternative to cuda-c?
For example, for this toy kernel
@cuda.jit()
def mul2(x):
x[cuda.threadIdx.x] *= 2.0
signature = (float32[:],)
ptx = cuda.compile_ptx_for_current_device(mul2, signature)
here’s a comparison of its ptx with the ptx of the equivalent cuda-c code:
The section “compute memory address” is way larger.
Thanks!