Blog: 28000x speedup with Numba.CUDA

RagnarGrootKoerkamp · April 13, 2021, 1:24pm

Hi!

Just wanted to say thanks for all the help here! Numba.CUDA managed to speed up my code a lot

Since it wasn’t always easy to find all the information I needed, I ended up writing a blogpost on my experience with Numba.CUDA: 28000x speedup with Numba.CUDA | Curious Coding

gmarkall · April 23, 2021, 2:58pm

This is a fantastic and detailed writeup, and a great example for anyone looking to understand the whole trajectory from Python code to an optimized CUDA implementation - many thanks for the writeup!

I have a couple of thoughts / comments on specific sections, which I’ll add below:

My CC 5.0 GPU (GTX 960M) supposedly can run 32 kernels in parallel but in my runs it’s always capped at 16. I have no idea why it doesn’t go higher.

This could be because the occupancy limit for the kernel is already reached when 16 kernels are launched - did you happen to look at the occupancy calculator with this version of the kernel?

starts = np.array(np.cumsum(np.array([0] + [len(seq) for seq in seqs]), dtype=np.int32), dtype=np.int32)
d_starts = cuda.to_device(starts, stream=stream)

It’s worth noting that asynchronous transfers can be made on a stream by passing the stream keyword argument to to_device as you have done here, but only a synchronous transfer will be made if the host memory is not pinned or mapped. The creation of pinned and mapped arrays is done with functions listed in the Memory Management documentation. For this particular use case I’d imagine that pinning the host memory wouldn’t have made too much difference, but if you were looking to overlap data transfer and kernel launches then it would have been necessary.

If we first copy all sequences to the device, we can then pass in a tuple of device sequences, and just take an index into that tuple in the kernel function.

Sadly, this actually generates slightly slower kernels than the sequence concatenation code, and I’m unsure why. The total runtime goes up to 1.39s .

I still need to look at your other thread, but my gut instinct is that this increases register pressure or otherwise decreases efficiency because the underlying implementation of passing a tuple of arrays expands the tuple argument into multiple arguments, one for each tuple element (this goes on recursively if you have nesting of tuples).

Topic		Replies	Views
Numba cuda: for vs while in kernel performance difference Community Support	1	1255	February 1, 2022
Single thread GPU vs CPU performance as a function of calculation complexity Numba	4	1330	August 30, 2022
Concurrent kernel execution in different streams Support: How do I do ...?	0	304	June 30, 2021
Random array generation : numba cuda slower than cupy? Support: How do I do ...?	3	1459	July 23, 2021
Numba for CUDA Programmers course released Announcements	0	655	April 23, 2021

Blog: 28000x speedup with Numba.CUDA

Related Topics