Cuda.jit vs guvectorize

Hello guys, I’m new to numba and also to cuda programming, actually sometimes for me it is kind of easy to get lost in all the information about cuda and actually sometimes might be hard to reproduce some examples to see the speed up.

I tryed to write two functions to do the same thing, one using the @cuda.jit decorator and the other with @guvectorize. To what I read until now, for the guvectorize it isnt necessary to specify the threadsperblock and blocksperthread (I thought that somehow the guvectorize take those in an optimal way). So when I write my function I didn’t specified those. So my doubt is about the difference between those decorators, I’m affraid that my guvectorized function is running in a completely serial way.

I’ll leave my code bellow and I would like to know the differences and how I could reproduce the @cuda.jit way with @guvectorize. I think that one important thing to say is that in my full code that I’m trying to speed up these function and others are executed in a loop several times, don’t know if this should lead in a diferent way on how I’m writing my functions.

My code:

from timeit import default_timer as timer
from numba import jit, guvectorize, int32, int64, float64
from numba import cuda
import numpy as np
from numpy import *
import math

D = 9
nx = 20000
ny = 1000
ly = ny-1
uLB = 0.04

cx = np.array([0, 1,-1, 0, 0, 1,-1, 1,-1],dtype=np.float64);
cy = np.array([0, 0, 0, 1,-1, 1,-1,-1, 1],dtype=np.float64);
c = np.array([cx,cy]);
w = np.array([4/9, 1/9, 1/9, 1/9, 1/9, 1/36, 1/36, 1/36, 1/36],dtype=np.float64);


def inivel(d, x, y):
    return (1-d) * uLB * (1 + 1e-4*sin(y/ly*2*pi))

@guvectorize(["void(float64[:,:], float64[:,:,:], float64[:,:], float64[:], float64[:,:,:])"], "(nx,ny),(D,nx,ny),(D,i),(i) -> (i,nx,ny)", target="cuda")
def equilibrium_gpu(rho, u, c, w, feq ):              # Equilibrium distribution function.
    nx2 = rho.shape[0]
    ny2 = rho.shape[1]
    
    for i in range(9):
        for j in range(0,nx2):
            for k in range(0,ny2):
                feq[i,j,k] = rho[j,k]*w[i] * (1 + (3 * (c[0,i]*u[0,j,k] + c[1,i]*u[1,j,k])) + 0.5*(3 * (c[0,i]*u[0,j,k] + c[1,i]*u[1,j,k]))**2 - (3/2 * (u[0,j,k]**2 + u[1,j,k]**2)))


@cuda.jit
def equilibrium_gpu2(rho,u,c,w,feq):
    
    nx2 = rho.shape[0]
    ny2 = rho.shape[1]

    j, k = cuda.grid(2)
    if (j < nx2) & (k < ny2):
        for i in range(9):
            feq[i, j, k] = rho[j,k]*w[i] * (1 + (3 * (c[0,i]*u[0,j,k] + c[1,i]*u[1,j,k])) + 0.5*(3 * (c[0,i]*u[0,j,k] + c[1,i]*u[1,j,k]))**2 - (3/2 * (u[0,j,k]**2 + u[1,j,k]**2)))


vel = fromfunction(inivel, (2,nx,ny))
rho = np.ones([nx, ny], dtype='float64')
res = np.zeros([D, nx, ny], dtype='float64')

rho_device = cuda.to_device(rho)
u_device = cuda.to_device(vel)
c_device = cuda.to_device(c)
w_device = cuda.to_device(w)
feq_device = cuda.device_array(shape=(D,nx,ny,), dtype=np.float64)
feq_device2 = cuda.device_array(shape=(D,nx,ny,), dtype=np.float64)

threadsperblock = (64, 16)
blockspergrid_x = math.ceil(nx / threadsperblock[0])
blockspergrid_y = math.ceil(ny / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

s = timer()
equilibrium_gpu(rho_device,u_device,c_device,w_device,out=feq_device)
cuda.synchronize()
print(timer() - s)

s = timer()
equilibrium_gpu2[blockspergrid, threadsperblock](rho_device,u_device,c_device,w_device,feq_device2)
cuda.synchronize()
print(timer() - s)

s = timer()
feq_host = feq_device.copy_to_host()
print(timer()-s)

s = timer()
feq_host2 = feq_device2.copy_to_host()
print(timer()-s)

feq_host == feq_host2

The output is like:

57.212994
0.21147969999999816
0.29927709999999763
0.2992276999999959
array([[[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]],

       [[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]],

       [[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]],

       ...,

       [[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]],

       [[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]],

       [[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]]])

So actually they do the same thing, but the @guvectorize way is taking much longer.
Any help would be awesome, I’m really trying to learn all this.

Thanks
Bruno