Writing to global memory is slow?

Hello Numba community,

I am a new Numba user and have a question regarding the slow global memory performance. Maybe this question is easy to solve, but I couldn’t manage it, and I want an expert opinion from you.

The issue is as follows: When I generate random numbers “on the fly” without storing them in global memory, the GPU kernel performs very well, nearly as fast as CUDA C. However, when I need to store the generated random numbers in global memory, the performance drops significantly, while CUDA C does not suffer from this issue.

For example, I conducted the following experiment on a GTX 1050 GPU:

Generating N=10^8 random numbers without storing them in global memory took approximately 1.79 ms.
Generating the same N=10^8 random numbers and storing them in global memory took around 37.98 ms.

I also repeated the experiment on A100 and V100 GPUs, and the same performance pattern persisted.

My question is: Is this the expected behavior for Numba, or is there something wrong with my implementation?

I would highly appreciate any help or insights you can provide.

Thank you.

I don’t know the answer, but posting the example code you’re using might help someone who does.

1 Like

Thanks for your reply.

Here are code examples of generating random numbers using Numba.

With storing to global memory:

import numpy as np
from numba import cuda, float32, int32, float64
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float64
from numba import cuda
from numba import jit
import numba

@cuda.jit
def gpu_random_numbers(rng_states, n, out):
thread_id = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
stride = cuda.blockDim.x * cuda.gridDim.x
for i in range(thread_id, n, stride):
out [i] = cuda.random.xoroshiro128p_uniform_float64(rng_states, thread_id)

num_numbers = 100000000
block_size = 512
grid_size = 100

rng_states = create_xoroshiro128p_states(grid_size * block_size, seed=1234)
d_out = cuda.device_array(num_numbers, dtype=np.float64)
gpu_random_numbers[grid_size, block_size](rng_states, num_numbers, d_out)

Without storing to global memory:

import numpy as np
from numba import cuda, float32, int32, float64
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float64
from numba import cuda
from numba import jit
import numba

@cuda.jit
def gpu_random_numbers(rng_states, n):
thread_id = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
stride = cuda.blockDim.x * cuda.gridDim.x
for i in range(thread_id, n, stride):
out = cuda.random.xoroshiro128p_uniform_float64(rng_states, thread_id)

num_numbers = 100000000
block_size = 512
grid_size = 100

rng_states = create_xoroshiro128p_states(grid_size * block_size, seed=1234)
gpu_random_numbers[grid_size, block_size](rng_states, num_numbers)

Would appreciate any help.

Thank you!

Could you also share the C code and how you are timing the functions?