CUDA shared memory on 1d arrays

Hi,
all examples for shared memory preload based on 2d arrays.
Does it make any sense and any performance gain if my calculations C = A*B where A and B are 1D arrays ?
Or it doesn’t ?

It’s very hard to give a general answer to this question - sometimes adjusting the way indexing is done can have an impact on register usage, and therefore occupancy and performance.

I would suggest first write the code in the most natural / readable way, and ensure it works - then at that point if performance isn’t what you need, try to experiment with converting it from 2D indexing to 1D indexing. Note that there might be other things you can optimize instead though (depending on your kernel).

To get an idea of where the hotspots in your kernel are, you can use NSight Compute to profile your kernels - with the latest Numba 0.54RC2 (and 0.54 when it is released) you can pass lineinfo=True to the CUDA JIT decorator so that NSight can highlight the time spent on each Python source line, as shown in the PR that added it: Add lineinfo flag to PTX and SASS compilation by maxpkatz · Pull Request #6802 · numba/numba · GitHub