Error when multiplying device array elements with a scaler (mul(float32, array(float32, 0d, C)))

@cuda.jit
def velocity_verlet(aa, dt_gpu, L_gpu, r_gpu, dr_gpu, a_gpu, v_gpu):
    start_x = cuda.grid(1)
    stride_x = cuda.gridsize(1)
    tid = start_x
    for i1 in range(start_x, r_gpu.shape[0], stride_x):
        for i2 in range(i1+1, r_gpu.shape[0]):
            for k in range(0,3): #Periodic boundary conditions
                dr_gpu[k] = r_gpu[i1][k] - r_gpu[i2][k]
                if dr_gpu[k] > L_gpu[k]/2:
                    dr_gpu[k] = dr_gpu[k] - L_gpu[k]
                elif dr_gpu[k] < -L_gpu[k]/2:
                    dr_gpu[k] = dr_gpu[k] + L_gpu[k]
            #LJ_potential_gpu(dr_gpu, aa)
            #a_gpu[i1] = + aa # from i2 on i1
            #a_gpu[i2] = - aa # from i1 on i2
        for k in range(3):
            v_gpu[i1][k] += a_gpu[i1][k] * dt_gpu/2;

The above is a cuda function and all of the inputs and outputs are device arrays defined as follows:

dt = 0.001
N = 400
r = np.zeros((N,3))
v = np.zeros((N,3))
dt_gpu = cuda.to_device(np.float32(dt))
L_gpu = cuda.to_device(np.float32(np.array([5, 5, 5])))
r_gpu = cuda.to_device(np.float32(r))
a_gpu = cuda.device_array(v.shape, dtype=np.float32) # Store calculated accelerations
v_gpu = cuda.device_array(v.shape, dtype=np.float32) # Store calculated speeds
dr_gpu = cuda.device_array(3, dtype=np.float32) # Store calculated dr
aa = cuda.device_array(3, dtype=np.float32)
velocity_verlet[128, 128](aa,dt_gpu, L_gpu, r_gpu, dr_gpu, a_gpu, v_gpu)

I however get an error related to the last line of my function where numba has trouble multiplying the scaler dt_gpu to the elements of a_gpu which it preceives as 0D float arrays (array(float32, 0d, C)) as per the following error. I am pretty sure this is a very simple issue and I’m a noob as I could not find anything on it online. Can someone tell me how to fix this?

TypingError: Failed in cuda mode pipeline (step: nopython frontend)
No implementation of function Function(<built-in function mul>) found for signature:
 
 >>> mul(float32, array(float32, 0d, C))
 
There are 8 candidate implementations:
      - Of which 2 did not match due to:
      Operator Overload in function 'mul': File: unknown: Line unknown.
        With argument(s): '(float32, array(float32, 0d, C))':
       No match for registered cases:
        * (int64, int64) -> int64
        * (int64, uint64) -> int64
        * (uint64, int64) -> int64
        * (uint64, uint64) -> uint64
        * (float32, float32) -> float32
        * (float64, float64) -> float64
        * (complex64, complex64) -> complex64
        * (complex128, complex128) -> complex128
      - Of which 6 did not match due to:
      Overload of function 'mul': File: <numerous>: Line N/A.
        With argument(s): '(float32, array(float32, 0d, C))':
       No match.

What are the shapes of dt, L, r, and v?

Sorry I forgot to mention!
r and v are (500, 3), L is (3,) and dt is () and all are numpy float32.

An example function call is this:

dt = 0.001
N = 400
r = np.zeros((N,3))
v = np.zeros((N,3))
dt_gpu = cuda.to_device(np.float32(dt))
L_gpu = cuda.to_device(np.float32(np.array([5, 5, 5])))
r_gpu = cuda.to_device(np.float32(r))
a_gpu = cuda.device_array(v.shape, dtype=np.float32) # Store calculated accelerations
v_gpu = cuda.device_array(v.shape, dtype=np.float32) # Store calculated speeds
dr_gpu = cuda.device_array(3, dtype=np.float32) # Store calculated dr
aa = cuda.device_array(3, dtype=np.float32)
velocity_verlet[128, 128](aa,dt_gpu, L_gpu, r_gpu, dr_gpu, a_gpu, v_gpu)

I now understand that dt_gpu becomes an array but I don’t know how to fix this. In other words how do I pass a scaler to the cuda function?

You can pass scalars by value to kernels. So your call to velocity_verlet can become:

velocity_verlet[128, 128](aa, dt, L_gpu, r_gpu, dr_gpu, a_gpu, v_gpu)

However, if you’d like to keep the scalar on the GPU (so that you don’t have to force a transfer of data between the host and device between kernels that use the scalar - for example if you had a kernel that determined the dt value and wrote to it, then the velocity_verlet kernel were to use this computed dt value) you could modify your kernel to access it like a scalar, and keep the call the same:

@cuda.jit
def velocity_verlet(aa, dt_gpu, L_gpu, r_gpu, dr_gpu, a_gpu, v_gpu):
    start_x = cuda.grid(1)
    stride_x = cuda.gridsize(1)
    tid = start_x
    for i1 in range(start_x, r_gpu.shape[0], stride_x):
        for i2 in range(i1+1, r_gpu.shape[0]):
            for k in range(0,3): #Periodic boundary conditions
                dr_gpu[k] = r_gpu[i1][k] - r_gpu[i2][k]
                if dr_gpu[k] > L_gpu[k]/2:
                    dr_gpu[k] = dr_gpu[k] - L_gpu[k]
                elif dr_gpu[k] < -L_gpu[k]/2:
                    dr_gpu[k] = dr_gpu[k] + L_gpu[k]
            #LJ_potential_gpu(dr_gpu, aa)
            #a_gpu[i1] = + aa # from i2 on i1
            #a_gpu[i2] = - aa # from i1 on i2
        for k in range(3):
            v_gpu[i1][k] += a_gpu[i1][k] * dt_gpu[()]/2;

velocity_verlet[128, 128](aa, dt_gpu, L_gpu, r_gpu, dr_gpu, a_gpu, v_gpu)

(Note that the change to the kernel is to change the access of dt_gpu to dt_gpu[()]. There is a little mention of this in the NumPy Documentation: Indexing on ndarrays — NumPy v1.26 Manual - see the note that begins with:

An empty (tuple) index is a full scalar index into a zero-dimensional array.

1 Like

Thanks! that basically solved my problem.