Tuple of sequence argument to cuda kernel is slower than passing concatenation

I have a cuda kernel that takes a list of np.array sequences (of different sizes) as argument.

From testing, it seems that passing in a single array that is the concatenation of all sequences is around 8% faster than passing in a tuple of device arrays.

Some experimental code:
Concatenation: http://ragnargrootkoerkamp.nl/upload/experiment_concat.py
Tuple of device array: http://ragnargrootkoerkamp.nl/upload/experiment_tuple.py
(Run a diff to see the differences - the only difference is in how the sequences are passed to the kernel.)
Concatenation runs in 0.85s.
Tuple of device array runs in 0.94s.

Naturally the HtoD memory copy is faster for a single concatenated sequence, but from profiling it seems that this is not the limiting factor:

Left/purple is the concatenation, and right/blue is the tuple of device sequences.
In this case I’m running 100 kernels of 20 blocks(=sequences) each, on a GTX 960M with 5SMs.

There’s also a little gap in the CPU timeline during the memory transfer which is probably some garbage collection but also doesn’t seem large enough to explain the difference.

It would be great to know where this slowdown in the kernels that take a tuple argument is coming from.

The tuple variant uses slightly more registers, but the blocks/SM (or warps/SM) is currently limited by shared memory usage on my GPU, so this shouldn’t be the issue:
(I can only put 2 links as new user.)

My only remaining guess is that accessing seqs[cuda.blockIdx.x][pos] is just a bit slower than seqs[offset + pos], but ideally setting seq = seqs[cuyda.blockIdx.x] followed by seq[pos] should be just as efficient.

I optimized the code a bit to first do all the memory transfers and only then start all the kernels. The tuple version is still ~8% slower than the concatenation version, but now we can be sure that this is not a limiting factor.
Also, firing up the kernels is very fast now compared to the total time, so it really is the kernels themselves that somehow must have somewhat less efficient code generations: