Writing to global memory is slow?

Hello Numba community,

I am a new Numba user and have a question regarding the slow global memory performance. Maybe this question is easy to solve, but I couldn’t manage it, and I want an expert opinion from you.

The issue is as follows: When I generate random numbers “on the fly” without storing them in global memory, the GPU kernel performs very well, nearly as fast as CUDA C. However, when I need to store the generated random numbers in global memory, the performance drops significantly, while CUDA C does not suffer from this issue.

For example, I conducted the following experiment on a GTX 1050 GPU:

Generating N=10^8 random numbers without storing them in global memory took approximately 1.79 ms.
Generating the same N=10^8 random numbers and storing them in global memory took around 37.98 ms.

I also repeated the experiment on A100 and V100 GPUs, and the same performance pattern persisted.

My question is: Is this the expected behavior for Numba, or is there something wrong with my implementation?

I would highly appreciate any help or insights you can provide.

Thank you.

I don’t know the answer, but posting the example code you’re using might help someone who does.

Thanks for your reply.

Here are code examples of generating random numbers using Numba.

With storing to global memory:

import numpy as np
from numba import cuda, float32, int32, float64
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float64
from numba import cuda
from numba import jit
import numba

@cuda.jit
def gpu_random_numbers(rng_states, n, out):
thread_id = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
stride = cuda.blockDim.x * cuda.gridDim.x
for i in range(thread_id, n, stride):
out [i] = cuda.random.xoroshiro128p_uniform_float64(rng_states, thread_id)

num_numbers = 100000000
block_size = 512
grid_size = 100

rng_states = create_xoroshiro128p_states(grid_size * block_size, seed=1234)
d_out = cuda.device_array(num_numbers, dtype=np.float64)
gpu_random_numbers[grid_size, block_size](rng_states, num_numbers, d_out)

Without storing to global memory:

import numpy as np
from numba import cuda, float32, int32, float64
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float64
from numba import cuda
from numba import jit
import numba

@cuda.jit
def gpu_random_numbers(rng_states, n):
thread_id = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
stride = cuda.blockDim.x * cuda.gridDim.x
for i in range(thread_id, n, stride):
out = cuda.random.xoroshiro128p_uniform_float64(rng_states, thread_id)

num_numbers = 100000000
block_size = 512
grid_size = 100

rng_states = create_xoroshiro128p_states(grid_size * block_size, seed=1234)
gpu_random_numbers[grid_size, block_size](rng_states, num_numbers)

Would appreciate any help.

Thank you!

Could you also share the C code and how you are timing the functions?