This is a fantastic and detailed writeup, and a great example for anyone looking to understand the whole trajectory from Python code to an optimized CUDA implementation - many thanks for the writeup!
I have a couple of thoughts / comments on specific sections, which I’ll add below:
My CC 5.0 GPU (GTX 960M) supposedly can run 32 kernels in parallel but in my runs it’s always capped at 16. I have no idea why it doesn’t go higher.
This could be because the occupancy limit for the kernel is already reached when 16 kernels are launched - did you happen to look at the occupancy calculator with this version of the kernel?
starts = np.array(np.cumsum(np.array( + [len(seq) for seq in seqs]), dtype=np.int32), dtype=np.int32)
d_starts = cuda.to_device(starts, stream=stream)
It’s worth noting that asynchronous transfers can be made on a stream by passing the
stream keyword argument to
to_device as you have done here, but only a synchronous transfer will be made if the host memory is not pinned or mapped. The creation of pinned and mapped arrays is done with functions listed in the Memory Management documentation. For this particular use case I’d imagine that pinning the host memory wouldn’t have made too much difference, but if you were looking to overlap data transfer and kernel launches then it would have been necessary.
If we first copy all sequences to the device, we can then pass in a tuple of device sequences, and just take an index into that tuple in the kernel function.
Sadly, this actually generates slightly slower kernels than the sequence concatenation code, and I’m unsure why. The total runtime goes up to 1.39s .
I still need to look at your other thread, but my gut instinct is that this increases register pressure or otherwise decreases efficiency because the underlying implementation of passing a tuple of arrays expands the tuple argument into multiple arguments, one for each tuple element (this goes on recursively if you have nesting of tuples).