I have an experimental (quickly hacked together) branch with support for an option like the -dlcm=cg
option to ptxas (the default load cache modifier, PTX Compiler APIs :: CUDA Toolkit Documentation). This is the branch: GitHub - gmarkall/numba at cuda-linker-options
You can use the dlcm
kwarg to the @cuda.jit
decorator with it, like so:
# Use with https://github.com/gmarkall/numba/tree/cuda-linker-options
from numba import cuda, float32, void
def axpy(r, a, x, y):
start = cuda.grid(1)
step = cuda.gridsize(1)
for i in range(start, len(r), step):
r[i] = a * x[i] + y[i]
sig = void(float32[::1], float32, float32[::1], float32[::1])
with_dlcm_cg = cuda.jit(sig, dlcm='cg')(axpy)
with_dlcm_ca = cuda.jit(sig, dlcm='ca')(axpy)
with_dlcm_xx = cuda.jit(sig)(axpy)
with open('dlcm_cg.sass', 'w') as f:
f.write(with_dlcm_cg.inspect_sass()[sig.args])
with open('dlcm_ca.sass', 'w') as f:
f.write(with_dlcm_ca.inspect_sass()[sig.args])
with open('dlcm_xx.sass', 'w') as f:
f.write(with_dlcm_xx.inspect_sass()[sig.args])
then you can inspect the differences between the SASS, e.g.:
- /*0c60*/ LDG.E R9, [R8.64] ;
- /*0c70*/ LDG.E R16, [R16.64] ;
+ /*0c60*/ LDG.E.STRONG.SM R9, [R8.64] ;
+ /*0c70*/ LDG.E.STRONG.SM R16, [R16.64] ;
(or even better, measure performance!)
Questions:
- Is this useful for others / should it be made into a proper PR?
- Would other ptxas / linker options be helpful? E.g.
dscm
?