Hello to all,
It’s me again with my program numba.cuda of Brownian dynamics.
I have found a strange bug where the interpretation of an input variable seems to interfere with the random number generator:
import numpy as np
from numba import cuda
import numba as nb
from numba.cuda.random import (create_xoroshiro128p_states,
xoroshiro128p_normal_float32)
import math, time
N = 4*1024 # Number of particles
r0 = np.random.uniform(size=(N)) # Initial position (1D case)
d_r0 = cuda.to_device(r0.astype(np.float32)) # Transfert to device memory as float32
N_steps = 1_000_000 # Number of simulation steps
freq_dumps = 0 # Frequency of dumping for the positions of particles (0 = no dumping)
trajectory = cuda.device_array(shape=(N,0)) # Foo variable that will store position if there is a dumping
rng_states = create_xoroshiro128p_states(N, seed=0)
threadsperblock = 128
blockspergrid = math.ceil(N / threadsperblock)
@cuda.jit
def test_kernel(r0, freq_dumps, N_steps, rng_states, trajectory):
pos = cuda.grid(1)
i_dump = 0
if pos < r0.shape[0]:
x0 = r0[pos]
for step in range(N_steps):
x1 = x0 + xoroshiro128p_normal_float32(rng_states, pos)
# dumping block
if freq_dumps !=0:
if (step + 1)%freq_dumps == 0:
trajectory[pos, i_dump] = x1
i_dump += 1
x0=x1
dt_cpu = []
for i in range(4):
t0_cpu = time.perf_counter()
test_kernel[blockspergrid, threadsperblock](r0, freq_dumps, N_steps, rng_states, trajectory)
cuda.synchronize()
t1_cpu = time.perf_counter()
dt_cpu.append(t1_cpu-t0_cpu)
print(f'{np.mean(dt_cpu[1:])/N/N_steps*1E9:.3e} ns/step/particles')
1.367e-01 ns/step/particles
@cuda.jit
def test_kernel(r0, freq_dumps, N_steps, rng_states, trajectory):
pos = cuda.grid(1)
i_dump = 0
freq_dumps = 0 # <------- If i define freq_dumps here, there is x10 boost ?!
if pos < r0.shape[0]:
x0 = r0[pos]
for step in range(N_steps):
x1 = x0 + xoroshiro128p_normal_float32(rng_states, pos)
# dumping block
if freq_dumps !=0:
if (step + 1)%freq_dumps == 0:
trajectory[pos, i_dump] = x1
i_dump += 1
x0=x1
dt_cpu = []
for i in range(4):
t0_cpu = time.perf_counter()
test_kernel[blockspergrid, threadsperblock](r0, freq_dumps, N_steps, rng_states, trajectory)
cuda.synchronize()
t1_cpu = time.perf_counter()
dt_cpu.append(t1_cpu-t0_cpu)
print(f'{np.mean(dt_cpu[1:])/N/N_steps*1E9:.3e} ns/step/particles')
9.296e-03 ns/step/particles
Notice that if I replace x1 = x0 + xoroshiro128p_normal_float32(rng_states, pos)
by x1=x0+1
, the difference between the two programs vanishes. Same if I delete trajectory[pos, i_dump] = x1
as if there is a read/write of this variable.
Can anyone tell me what’s going on? Did I miss something? Should I create two kernel functions based on freq_jumps=0 or !=0 ?
Thanks !
Geoffrey.