I am writing CUDA kernels, and would like to do some ad-hoc timing of various subroutines without having to to go to the trouble of setting up profiling. Is it possible to get the current time, or the elapsed time since a kernel started executing from within a kernel?
There is a %clock64 register in PTX, which you can implement accessing with an intrinsic:
from numba.cuda.extending import intrinsic
from llvmlite import ir
@intrinsic
def cuda_clock64(typingctx):
sig = types.uint64()
def codegen(context, builder, sig, args):
function_type = ir.FunctionType(ir.IntType(64), [])
instruction = "mov.u64 $0, %clock64;"
clock64 = ir.InlineAsm(function_type, instruction, "=l",
side_effect=True)
return builder.call(clock64, [])
return sig, codegen
then call cuda_clock64() in your kernel to get the elapsed clock cycles.
This should probably be upstreamed into Numba as a convenience, but I never quite found the time to do it.