How do I profile code including numba kernels decorated with @cuda.jit? Is there a good way to apply nsight systems and nsight compute tools that replace nvprof?

Does anyone have up to date advice on how to profile numba.cuda kernels? I used to be able to get some basic info using nvprof which is now replaced by nsight sytems and nsight compute. Can anyone offer advice (or better yet an example) of how to use the new tools?

Alternatively, is there something new incorporated into numba to provide kernel profiling info? Any suggestions would be appreciated.

This Deep Learning Institute workshop from the recent GTC23 explains how to use NSight Compute with Numba, including correlating Python source code with the profile info:

The Numba part starts at about 1h04 in.

Just what I was looking for. Thank you for the prompt and very helpful response. (I guess I should be looking to expand my DLI certification…)

I appreciated the previous help on accessing nsight compute. Now I am trying to an analyze the impact of using multiple streams to overlap compute and data transfer, so I am really looking for something like the timelines produced by nsight systems.
Is there a good way to get that for python/numba?
Does the early part of the same DLI presentation that discuss NVTX provide a way to access such information?

I’m not sure, as I haven’t watched the earlier part of the presentation - I was aware with the later part about NSight because I provided some assistance in putting it together.

NVTX can be used from Python with the NVTX Python wrapper - perhaps this is a useful starting point? References: