CUDA: Experimental branch with support for -dlcm=cg

I have an experimental (quickly hacked together) branch with support for an option like the -dlcm=cg option to ptxas (the default load cache modifier, PTX Compiler APIs :: CUDA Toolkit Documentation). This is the branch: GitHub - gmarkall/numba at cuda-linker-options

You can use the dlcm kwarg to the @cuda.jit decorator with it, like so:

# Use with https://github.com/gmarkall/numba/tree/cuda-linker-options

from numba import cuda, float32, void


def axpy(r, a, x, y):
    start = cuda.grid(1)
    step = cuda.gridsize(1)

    for i in range(start, len(r), step):
        r[i] = a * x[i] + y[i]


sig = void(float32[::1], float32, float32[::1], float32[::1])
with_dlcm_cg = cuda.jit(sig, dlcm='cg')(axpy)
with_dlcm_ca = cuda.jit(sig, dlcm='ca')(axpy)
with_dlcm_xx = cuda.jit(sig)(axpy)

with open('dlcm_cg.sass', 'w') as f:
    f.write(with_dlcm_cg.inspect_sass()[sig.args])

with open('dlcm_ca.sass', 'w') as f:
    f.write(with_dlcm_ca.inspect_sass()[sig.args])

with open('dlcm_xx.sass', 'w') as f:
    f.write(with_dlcm_xx.inspect_sass()[sig.args])

then you can inspect the differences between the SASS, e.g.:

-        /*0c60*/                   LDG.E R9, [R8.64] ;
-        /*0c70*/                   LDG.E R16, [R16.64] ;
+        /*0c60*/                   LDG.E.STRONG.SM R9, [R8.64] ;
+        /*0c70*/                   LDG.E.STRONG.SM R16, [R16.64] ;

(or even better, measure performance!)

Questions:

  • Is this useful for others / should it be made into a proper PR?
  • Would other ptxas / linker options be helpful? E.g. dscm?