Hello Numba community,
I am a new Numba user and have a question regarding the slow global memory performance. Maybe this question is easy to solve, but I couldn’t manage it, and I want an expert opinion from you.
The issue is as follows: When I generate random numbers “on the fly” without storing them in global memory, the GPU kernel performs very well, nearly as fast as CUDA C. However, when I need to store the generated random numbers in global memory, the performance drops significantly, while CUDA C does not suffer from this issue.
For example, I conducted the following experiment on a GTX 1050 GPU:
Generating N=10^8 random numbers without storing them in global memory took approximately 1.79 ms.
Generating the same N=10^8 random numbers and storing them in global memory took around 37.98 ms.
I also repeated the experiment on A100 and V100 GPUs, and the same performance pattern persisted.
My question is: Is this the expected behavior for Numba, or is there something wrong with my implementation?
I would highly appreciate any help or insights you can provide.
Thank you.