I have a cuda kernel that takes a list of
np.array sequences (of different sizes) as argument.
From testing, it seems that passing in a single array that is the concatenation of all sequences is around 8% faster than passing in a tuple of device arrays.
Some experimental code:
Tuple of device array: http://ragnargrootkoerkamp.nl/upload/experiment_tuple.py
(Run a diff to see the differences - the only difference is in how the sequences are passed to the kernel.)
Concatenation runs in
Tuple of device array runs in
Naturally the HtoD memory copy is faster for a single concatenated sequence, but from profiling it seems that this is not the limiting factor:
Left/purple is the concatenation, and right/blue is the tuple of device sequences.
In this case I’m running 100 kernels of 20 blocks(=sequences) each, on a GTX 960M with 5SMs.
There’s also a little gap in the CPU timeline during the memory transfer which is probably some garbage collection but also doesn’t seem large enough to explain the difference.
It would be great to know where this slowdown in the kernels that take a tuple argument is coming from.
The tuple variant uses slightly more registers, but the blocks/SM (or warps/SM) is currently limited by shared memory usage on my GPU, so this shouldn’t be the issue:
(I can only put 2 links as new user.)
My only remaining guess is that accessing
seqs[cuda.blockIdx.x][pos] is just a bit slower than
seqs[offset + pos], but ideally setting
seq = seqs[cuyda.blockIdx.x] followed by
seq[pos] should be just as efficient.