CUDA: Experimental branch with support for -dlcm=cg

I have an experimental (quickly hacked together) branch with support for an option like the -dlcm=cg option to ptxas (the default load cache modifier, PTX Compiler APIs :: CUDA Toolkit Documentation). This is the branch: GitHub - gmarkall/numba at cuda-linker-options

You can use the dlcm kwarg to the @cuda.jit decorator with it, like so:

# Use with

from numba import cuda, float32, void

def axpy(r, a, x, y):
    start = cuda.grid(1)
    step = cuda.gridsize(1)

    for i in range(start, len(r), step):
        r[i] = a * x[i] + y[i]

sig = void(float32[::1], float32, float32[::1], float32[::1])
with_dlcm_cg = cuda.jit(sig, dlcm='cg')(axpy)
with_dlcm_ca = cuda.jit(sig, dlcm='ca')(axpy)
with_dlcm_xx = cuda.jit(sig)(axpy)

with open('dlcm_cg.sass', 'w') as f:

with open('dlcm_ca.sass', 'w') as f:

with open('dlcm_xx.sass', 'w') as f:

then you can inspect the differences between the SASS, e.g.:

-        /*0c60*/                   LDG.E R9, [R8.64] ;
-        /*0c70*/                   LDG.E R16, [R16.64] ;
+        /*0c60*/                   LDG.E.STRONG.SM R9, [R8.64] ;
+        /*0c70*/                   LDG.E.STRONG.SM R16, [R16.64] ;

(or even better, measure performance!)


  • Is this useful for others / should it be made into a proper PR?
  • Would other ptxas / linker options be helpful? E.g. dscm?