Cuda.jit vs guvectorize

brunomagacho · June 17, 2021, 4:41pm

Hello guys, I’m new to numba and also to cuda programming, actually sometimes for me it is kind of easy to get lost in all the information about cuda and actually sometimes might be hard to reproduce some examples to see the speed up.

I tryed to write two functions to do the same thing, one using the @cuda.jit decorator and the other with @guvectorize. To what I read until now, for the guvectorize it isnt necessary to specify the threadsperblock and blocksperthread (I thought that somehow the guvectorize take those in an optimal way). So when I write my function I didn’t specified those. So my doubt is about the difference between those decorators, I’m affraid that my guvectorized function is running in a completely serial way.

I’ll leave my code bellow and I would like to know the differences and how I could reproduce the @cuda.jit way with @guvectorize. I think that one important thing to say is that in my full code that I’m trying to speed up these function and others are executed in a loop several times, don’t know if this should lead in a diferent way on how I’m writing my functions.

My code:

from timeit import default_timer as timer
from numba import jit, guvectorize, int32, int64, float64
from numba import cuda
import numpy as np
from numpy import *
import math

D = 9
nx = 20000
ny = 1000
ly = ny-1
uLB = 0.04

cx = np.array([0, 1,-1, 0, 0, 1,-1, 1,-1],dtype=np.float64);
cy = np.array([0, 0, 0, 1,-1, 1,-1,-1, 1],dtype=np.float64);
c = np.array([cx,cy]);
w = np.array([4/9, 1/9, 1/9, 1/9, 1/9, 1/36, 1/36, 1/36, 1/36],dtype=np.float64);


def inivel(d, x, y):
    return (1-d) * uLB * (1 + 1e-4*sin(y/ly*2*pi))

@guvectorize(["void(float64[:,:], float64[:,:,:], float64[:,:], float64[:], float64[:,:,:])"], "(nx,ny),(D,nx,ny),(D,i),(i) -> (i,nx,ny)", target="cuda")
def equilibrium_gpu(rho, u, c, w, feq ):              # Equilibrium distribution function.
    nx2 = rho.shape[0]
    ny2 = rho.shape[1]
    
    for i in range(9):
        for j in range(0,nx2):
            for k in range(0,ny2):
                feq[i,j,k] = rho[j,k]*w[i] * (1 + (3 * (c[0,i]*u[0,j,k] + c[1,i]*u[1,j,k])) + 0.5*(3 * (c[0,i]*u[0,j,k] + c[1,i]*u[1,j,k]))**2 - (3/2 * (u[0,j,k]**2 + u[1,j,k]**2)))


@cuda.jit
def equilibrium_gpu2(rho,u,c,w,feq):
    
    nx2 = rho.shape[0]
    ny2 = rho.shape[1]

    j, k = cuda.grid(2)
    if (j < nx2) & (k < ny2):
        for i in range(9):
            feq[i, j, k] = rho[j,k]*w[i] * (1 + (3 * (c[0,i]*u[0,j,k] + c[1,i]*u[1,j,k])) + 0.5*(3 * (c[0,i]*u[0,j,k] + c[1,i]*u[1,j,k]))**2 - (3/2 * (u[0,j,k]**2 + u[1,j,k]**2)))


vel = fromfunction(inivel, (2,nx,ny))
rho = np.ones([nx, ny], dtype='float64')
res = np.zeros([D, nx, ny], dtype='float64')

rho_device = cuda.to_device(rho)
u_device = cuda.to_device(vel)
c_device = cuda.to_device(c)
w_device = cuda.to_device(w)
feq_device = cuda.device_array(shape=(D,nx,ny,), dtype=np.float64)
feq_device2 = cuda.device_array(shape=(D,nx,ny,), dtype=np.float64)

threadsperblock = (64, 16)
blockspergrid_x = math.ceil(nx / threadsperblock[0])
blockspergrid_y = math.ceil(ny / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

s = timer()
equilibrium_gpu(rho_device,u_device,c_device,w_device,out=feq_device)
cuda.synchronize()
print(timer() - s)

s = timer()
equilibrium_gpu2[blockspergrid, threadsperblock](rho_device,u_device,c_device,w_device,feq_device2)
cuda.synchronize()
print(timer() - s)

s = timer()
feq_host = feq_device.copy_to_host()
print(timer()-s)

s = timer()
feq_host2 = feq_device2.copy_to_host()
print(timer()-s)

feq_host == feq_host2

The output is like:

57.212994
0.21147969999999816
0.29927709999999763
0.2992276999999959
array([[[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]],

       [[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]],

       [[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]],

       ...,

       [[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]],

       [[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]],

       [[ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        ...,
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True],
        [ True,  True,  True, ...,  True,  True,  True]]])

So actually they do the same thing, but the @guvectorize way is taking much longer.
Any help would be awesome, I’m really trying to learn all this.

Thanks
Bruno

Topic		Replies	Views
`guvectorized`: No performance difference between targets `cpu` and `parallel`? `cuda` even slower. `vectorize` faster? Support: How do I do ...?	5	995	August 9, 2022
Using guvectorize inside a jitted function Support: How do I do ...?	11	1360	June 17, 2024
Why @njit(parallel=True) seems to be faster than @vectorize(target='parallel')? Support: How do I do ...?	0	749	January 10, 2024
Returning an array of 3D or 6D vectors from `guvectorize` on CUDA (cross-compiling for CPU & CUDA) Support: How do I do ...?	2	744	August 2, 2022
Numba signatures @guvectorize( signatures target='cuda) Community Support	4	1139	February 5, 2021

Cuda.jit vs guvectorize

Related topics