CUDA shared memory on 1d arrays

2W-12 · July 12, 2021, 7:23pm

Hi,
all examples for shared memory preload based on 2d arrays.
Does it make any sense and any performance gain if my calculations C = A*B where A and B are 1D arrays ?
Or it doesn’t ?

gmarkall · July 23, 2021, 10:35am

It’s very hard to give a general answer to this question - sometimes adjusting the way indexing is done can have an impact on register usage, and therefore occupancy and performance.

I would suggest first write the code in the most natural / readable way, and ensure it works - then at that point if performance isn’t what you need, try to experiment with converting it from 2D indexing to 1D indexing. Note that there might be other things you can optimize instead though (depending on your kernel).

To get an idea of where the hotspots in your kernel are, you can use NSight Compute to profile your kernels - with the latest Numba 0.54RC2 (and 0.54 when it is released) you can pass lineinfo=True to the CUDA JIT decorator so that NSight can highlight the time spent on each Python source line, as shown in the PR that added it: Add lineinfo flag to PTX and SASS compilation by maxpkatz · Pull Request #6802 · numba/numba · GitHub

Topic		Replies	Views
Using dynamic shared memory Numba	6	1091	August 22, 2023
Performance is comparable between two different numba cuda approaches Community Support	6	117	December 31, 2024
Making Awkward Arrays work in the CUDA target Community Support	4	1330	March 8, 2023
Why is my Numba CUDA function slower than PyTorch? Community Support	1	788	February 13, 2024
How do I use Numba to run Trading Back Testing? Support: How do I do ...?	3	556	May 10, 2024

CUDA shared memory on 1d arrays

Related topics