Tuple of sequence argument to cuda kernel is slower than passing concatenation

RagnarGrootKoerkamp · March 23, 2021, 1:34pm

I have a cuda kernel that takes a list of np.array sequences (of different sizes) as argument.

From testing, it seems that passing in a single array that is the concatenation of all sequences is around 8% faster than passing in a tuple of device arrays.

Some experimental code:
Concatenation: http://ragnargrootkoerkamp.nl/upload/experiment_concat.py
Tuple of device array: http://ragnargrootkoerkamp.nl/upload/experiment_tuple.py
(Run a diff to see the differences - the only difference is in how the sequences are passed to the kernel.)
Concatenation runs in 0.85s.
Tuple of device array runs in 0.94s.

Naturally the HtoD memory copy is faster for a single concatenated sequence, but from profiling it seems that this is not the limiting factor:

Left/purple is the concatenation, and right/blue is the tuple of device sequences.
In this case I’m running 100 kernels of 20 blocks(=sequences) each, on a GTX 960M with 5SMs.

There’s also a little gap in the CPU timeline during the memory transfer which is probably some garbage collection but also doesn’t seem large enough to explain the difference.

It would be great to know where this slowdown in the kernels that take a tuple argument is coming from.

The tuple variant uses slightly more registers, but the blocks/SM (or warps/SM) is currently limited by shared memory usage on my GPU, so this shouldn’t be the issue:
http://ragnargrootkoerkamp.nl/upload/profile_concat.nvvp
http://ragnargrootkoerkamp.nl/upload/profile_tuple.nvvp
(I can only put 2 links as new user.)

My only remaining guess is that accessing seqs[cuda.blockIdx.x][pos] is just a bit slower than seqs[offset + pos], but ideally setting seq = seqs[cuyda.blockIdx.x] followed by seq[pos] should be just as efficient.

RagnarGrootKoerkamp · March 23, 2021, 4:44pm

I optimized the code a bit to first do all the memory transfers and only then start all the kernels. The tuple version is still ~8% slower than the concatenation version, but now we can be sure that this is not a limiting factor.
Also, firing up the kernels is very fast now compared to the total time, so it really is the kernels themselves that somehow must have somewhat less efficient code generations:
http://ragnargrootkoerkamp.nl/upload/experiment_concat_2.py
http://ragnargrootkoerkamp.nl/upload/experiment_tuple_2.py

http://ragnargrootkoerkamp.nl/upload/profile_concat_2.nvvp
http://ragnargrootkoerkamp.nl/upload/profile_tuple_2.nvvp

Topic		Replies	Views
Blog: 28000x speedup with Numba.CUDA Showcase	1	1094	April 23, 2021
Is there a way to pass list of arrays to CUDA kernel? Support: How do I do ...?	2	412	June 16, 2023
Unusual 20x slowdown between nearly identical calculations with CUDA Community Support	5	543	August 19, 2022
Avoid tuple copying (structref?) in CUDA Community Support	3	325	August 12, 2023
Numba cuda: for vs while in kernel performance difference Community Support	1	1589	February 1, 2022

Tuple of sequence argument to cuda kernel is slower than passing concatenation

Related topics