Numba Cuda, strange behavior with xoroshiro128p

gmonet · August 27, 2021, 8:27am

Hello to all,
It’s me again with my program numba.cuda of Brownian dynamics.
I have found a strange bug where the interpretation of an input variable seems to interfere with the random number generator:

import numpy as np
from numba import cuda
import numba as nb
from numba.cuda.random import (create_xoroshiro128p_states,
                               xoroshiro128p_normal_float32)
import math, time

N = 4*1024 # Number of particles
r0 = np.random.uniform(size=(N)) # Initial position (1D case)
d_r0 = cuda.to_device(r0.astype(np.float32)) # Transfert to device memory as float32
N_steps = 1_000_000 # Number of simulation steps
freq_dumps = 0 # Frequency of dumping for the positions of particles (0 = no dumping)
trajectory = cuda.device_array(shape=(N,0)) # Foo variable that will store position if there is a dumping

rng_states = create_xoroshiro128p_states(N, seed=0)

threadsperblock = 128
blockspergrid = math.ceil(N / threadsperblock)

@cuda.jit
def test_kernel(r0, freq_dumps, N_steps, rng_states, trajectory):
    pos = cuda.grid(1)
    i_dump = 0
    if pos < r0.shape[0]:
        x0 = r0[pos]
        for step in range(N_steps):
            x1 = x0 + xoroshiro128p_normal_float32(rng_states, pos)

            # dumping block
            if freq_dumps !=0:
                if (step + 1)%freq_dumps == 0:
                    trajectory[pos, i_dump] = x1
                    i_dump += 1
            x0=x1

dt_cpu = []
for i in range(4):
    t0_cpu = time.perf_counter()
    test_kernel[blockspergrid, threadsperblock](r0, freq_dumps, N_steps, rng_states, trajectory)
    cuda.synchronize()
    t1_cpu = time.perf_counter()
    dt_cpu.append(t1_cpu-t0_cpu)
print(f'{np.mean(dt_cpu[1:])/N/N_steps*1E9:.3e} ns/step/particles')

1.367e-01 ns/step/particles

@cuda.jit
def test_kernel(r0, freq_dumps, N_steps, rng_states, trajectory):
    pos = cuda.grid(1)
    i_dump = 0
    freq_dumps = 0 # <------- If i define freq_dumps here, there is x10 boost ?!
    if pos < r0.shape[0]:
        x0 = r0[pos]
        for step in range(N_steps):
            x1 = x0 + xoroshiro128p_normal_float32(rng_states, pos)

            # dumping block
            if freq_dumps !=0:
                if (step + 1)%freq_dumps == 0:
                    trajectory[pos, i_dump] = x1
                    i_dump += 1
            x0=x1

dt_cpu = []
for i in range(4):
    t0_cpu = time.perf_counter()
    test_kernel[blockspergrid, threadsperblock](r0, freq_dumps, N_steps, rng_states, trajectory)
    cuda.synchronize()
    t1_cpu = time.perf_counter()
    dt_cpu.append(t1_cpu-t0_cpu)
print(f'{np.mean(dt_cpu[1:])/N/N_steps*1E9:.3e} ns/step/particles')

9.296e-03 ns/step/particles

Notice that if I replace x1 = x0 + xoroshiro128p_normal_float32(rng_states, pos) by x1=x0+1, the difference between the two programs vanishes. Same if I delete trajectory[pos, i_dump] = x1 as if there is a read/write of this variable.

Can anyone tell me what’s going on? Did I miss something? Should I create two kernel functions based on freq_jumps=0 or !=0 ?
Thanks !

Geoffrey.

gmarkall · September 2, 2021, 3:00pm

I believe what’s happening here is your second kernel is much better optimized because freq_dumps is a known constant - since it is 0, the block:

            # dumping block
            if freq_dumps !=0:
                if (step + 1)%freq_dumps == 0:
                    trajectory[pos, i_dump] = x1
                    i_dump += 1

is completely eliminated.

You can get a similar speed boost in the first kernel by commenting out the “dumping block” code - in the second case, the compiler is effectively commenting it out for you.

gmonet · September 2, 2021, 4:54pm

Ok I see, thank you Graham.
Indeed, I still don’t fully understand how the compiler optimizes the Python code.

Topic		Replies	Views
Random array generation : numba cuda slower than cupy? Support: How do I do ...?	3	1924	July 23, 2021
Random Number Generation on the GPU Support: What is this error message?	9	3482	November 6, 2020
Large gap between launching cuda kernel Community Support	4	610	June 27, 2022
Writing to global memory is slow? Support: How do I do ...?	3	390	August 2, 2023
Numba CUDA implementation producing different results and taking longer time Support: How do I do ...?	0	332	May 13, 2023

Numba Cuda, strange behavior with xoroshiro128p

Related topics