Numba Cuda, strange behavior with xoroshiro128p

Hello to all,
It’s me again with my program numba.cuda of Brownian dynamics.
I have found a strange bug where the interpretation of an input variable seems to interfere with the random number generator:

import numpy as np
from numba import cuda
import numba as nb
from numba.cuda.random import (create_xoroshiro128p_states,
                               xoroshiro128p_normal_float32)
import math, time

N = 4*1024 # Number of particles
r0 = np.random.uniform(size=(N)) # Initial position (1D case)
d_r0 = cuda.to_device(r0.astype(np.float32)) # Transfert to device memory as float32
N_steps = 1_000_000 # Number of simulation steps
freq_dumps = 0 # Frequency of dumping for the positions of particles (0 = no dumping)
trajectory = cuda.device_array(shape=(N,0)) # Foo variable that will store position if there is a dumping

rng_states = create_xoroshiro128p_states(N, seed=0)

threadsperblock = 128
blockspergrid = math.ceil(N / threadsperblock)
@cuda.jit
def test_kernel(r0, freq_dumps, N_steps, rng_states, trajectory):
    pos = cuda.grid(1)
    i_dump = 0
    if pos < r0.shape[0]:
        x0 = r0[pos]
        for step in range(N_steps):
            x1 = x0 + xoroshiro128p_normal_float32(rng_states, pos)

            # dumping block
            if freq_dumps !=0:
                if (step + 1)%freq_dumps == 0:
                    trajectory[pos, i_dump] = x1
                    i_dump += 1
            x0=x1

dt_cpu = []
for i in range(4):
    t0_cpu = time.perf_counter()
    test_kernel[blockspergrid, threadsperblock](r0, freq_dumps, N_steps, rng_states, trajectory)
    cuda.synchronize()
    t1_cpu = time.perf_counter()
    dt_cpu.append(t1_cpu-t0_cpu)
print(f'{np.mean(dt_cpu[1:])/N/N_steps*1E9:.3e} ns/step/particles')

1.367e-01 ns/step/particles

@cuda.jit
def test_kernel(r0, freq_dumps, N_steps, rng_states, trajectory):
    pos = cuda.grid(1)
    i_dump = 0
    freq_dumps = 0 # <------- If i define freq_dumps here, there is x10 boost ?!
    if pos < r0.shape[0]:
        x0 = r0[pos]
        for step in range(N_steps):
            x1 = x0 + xoroshiro128p_normal_float32(rng_states, pos)

            # dumping block
            if freq_dumps !=0:
                if (step + 1)%freq_dumps == 0:
                    trajectory[pos, i_dump] = x1
                    i_dump += 1
            x0=x1

dt_cpu = []
for i in range(4):
    t0_cpu = time.perf_counter()
    test_kernel[blockspergrid, threadsperblock](r0, freq_dumps, N_steps, rng_states, trajectory)
    cuda.synchronize()
    t1_cpu = time.perf_counter()
    dt_cpu.append(t1_cpu-t0_cpu)
print(f'{np.mean(dt_cpu[1:])/N/N_steps*1E9:.3e} ns/step/particles')

9.296e-03 ns/step/particles

Notice that if I replace x1 = x0 + xoroshiro128p_normal_float32(rng_states, pos) by x1=x0+1, the difference between the two programs vanishes. Same if I delete trajectory[pos, i_dump] = x1 as if there is a read/write of this variable.

Can anyone tell me what’s going on? Did I miss something? Should I create two kernel functions based on freq_jumps=0 or !=0 ?
Thanks !

Geoffrey.

I believe what’s happening here is your second kernel is much better optimized because freq_dumps is a known constant - since it is 0, the block:

            # dumping block
            if freq_dumps !=0:
                if (step + 1)%freq_dumps == 0:
                    trajectory[pos, i_dump] = x1
                    i_dump += 1

is completely eliminated.

You can get a similar speed boost in the first kernel by commenting out the “dumping block” code - in the second case, the compiler is effectively commenting it out for you.

Ok I see, thank you Graham.
Indeed, I still don’t fully understand how the compiler optimizes the Python code.