CUDA platform - get current (or elapsed) time

I am writing CUDA kernels, and would like to do some ad-hoc timing of various subroutines without having to to go to the trouble of setting up profiling. Is it possible to get the current time, or the elapsed time since a kernel started executing from within a kernel?

There is a %clock64 register in PTX, which you can implement accessing with an intrinsic:

from numba.cuda.extending import intrinsic
from llvmlite import ir

@intrinsic
def cuda_clock64(typingctx):
    sig = types.uint64()

    def codegen(context, builder, sig, args):
        function_type = ir.FunctionType(ir.IntType(64), []) 
        instruction = "mov.u64 $0, %clock64;"
        clock64 = ir.InlineAsm(function_type, instruction, "=l",
                               side_effect=True)
        return builder.call(clock64, []) 

    return sig, codegen

then call cuda_clock64() in your kernel to get the elapsed clock cycles.

This should probably be upstreamed into Numba as a convenience, but I never quite found the time to do it.