I have a cuda kernel that takes a list of np.array
sequences (of different sizes) as argument.
From testing, it seems that passing in a single array that is the concatenation of all sequences is around 8% faster than passing in a tuple of device arrays.
Some experimental code:
Concatenation: http://ragnargrootkoerkamp.nl/upload/experiment_concat.py
Tuple of device array: http://ragnargrootkoerkamp.nl/upload/experiment_tuple.py
(Run a diff to see the differences - the only difference is in how the sequences are passed to the kernel.)
Concatenation runs in 0.85s
.
Tuple of device array runs in 0.94s
.
Naturally the HtoD memory copy is faster for a single concatenated sequence, but from profiling it seems that this is not the limiting factor:
Left/purple is the concatenation, and right/blue is the tuple of device sequences.
In this case I’m running 100 kernels of 20 blocks(=sequences) each, on a GTX 960M with 5SMs.
There’s also a little gap in the CPU timeline during the memory transfer which is probably some garbage collection but also doesn’t seem large enough to explain the difference.
It would be great to know where this slowdown in the kernels that take a tuple argument is coming from.
The tuple variant uses slightly more registers, but the blocks/SM (or warps/SM) is currently limited by shared memory usage on my GPU, so this shouldn’t be the issue:
http://ragnargrootkoerkamp.nl/upload/profile_concat.nvvp
http://ragnargrootkoerkamp.nl/upload/profile_tuple.nvvp
(I can only put 2 links as new user.)
My only remaining guess is that accessing seqs[cuda.blockIdx.x][pos]
is just a bit slower than seqs[offset + pos]
, but ideally setting seq = seqs[cuyda.blockIdx.x]
followed by seq[pos]
should be just as efficient.