Numba signatures @guvectorize( signatures target='cuda)

Dear all,
I am trying to learn how to properly use signatures for guvectorize to leverage from cuda, however I am having some issues and I was hoping someone could have a better clue than I have.

From here Types and signatures — Numba 0.52.0.dev0+274.g626b40e-py3.7-linux-x86_64.egg documentation I understand that:

So I created this example:

a = np.ones((8,4), dtype=np.float32)
b = np.ones((8,4), dtype=np.float32)
c = np.ones((8,4,10), dtype=np.float32)
d=np.ones(shape=(10,), dtype=np.float32)

import numpy as np
def update_a(a,b,c,d):#Works properly
    i_c,j_c,k_c = c.shape
    for k in range(k_c):#Loop in k_c
        num_sum=0
        for j in range(j_c):#Loop in i
            for i in range(i_c):#Loop in j
                a_val=a[i,j]
                b_val=b[i,j]
                c_val=c[i,j,k]
                # num_sum=a[i_a,j_a]+b[i_b,j_b]+c[i_c,j_c,k_c_l]
                num_sum=a_val+b_val+c_val
        d[k]=num_sum
    return d

Then I tried to first check how it works targetting the cpu, that already provided a TypeError:

**TypeError** : type and shape signature mismatch for arg #1

@guvectorize(['void(float32[:,:], float32[:,:], float32[:,:,:], float32[:])'],
             '(n,m,z)->(z)', nopython=True, target='cpu')
def update_a_cuda(a,b,c,d):
    i_c,j_c,k_c = c.shape
    for k in range(k_c):#Loop in k_c
        num_sum=0
        for j in range(j_c):#Loop in i
            for i in range(i_c):#Loop in j
                a_val=a[i,j]
                b_val=b[i,j]
                c_val=c[i,j,k]
                num_sum=a_val+b_val+c_val
        d[k]=num_sum

Of course, then when I target my cuda I have an error.

**TypingError** : Failed in nopython mode pipeline (step: nopython frontend) ******No implementation of function Function(<DeviceFunction py_func=<function update_a_cuda at 0x0000024A28F31B80> signature=(array(float32, 2d, A), array(float32, 2d, A), array(float32, 3d, A), array(float32, 1d, A)) -> none>) found for signature: >>> <unknown function>(array(float32, 3d, A), array(float32, 1d, A)) There are 2 candidate implementations: **- Of which 2 did not match due to: Type Restricted Function in function 'unknown': File: unknown: Line unknown. With argument(s): '(array(float32, 3d, A), array(float32, 1d, A))':******** **No match for registered cases: * (array(float32, 2d, A), array(float32, 2d, A), array(float32, 3d, A), array(float32, 1d, A)) -> none** **During: resolving callee type: Function(<DeviceFunction py_func=<function update_a_cuda at 0x0000024A28F31B80> signature=(array(float32, 2d, A), array(float32, 2d, A), array(float32, 3d, A), array(float32, 1d, A)) -> none>)** **During: typing of call at <string> (5)**

from numba import guvectorize
@guvectorize(['void(float32[:,:], float32[:,:], float32[:,:,:], float32[:])'],
             '(n,m,z)->(z)', nopython=True, target='cuda')
def update_a_cuda(a,b,c,d):
    i_c,j_c,k_c = c.shape
    for k in range(k_c):#Loop in k_c
        num_sum=0
        for j in range(j_c):#Loop in i
            for i in range(i_c):#Loop in j
                a_val=a[i,j]
                b_val=b[i,j]
                c_val=c[i,j,k]
                num_sum=a_val+b_val+c_val
        d[k]=num_sum

As an alterantive I have also tried to use the cuda.jit, however I am missing something, because the result is different than the expected.

from numba import cuda
@cuda.jit('void(float32[:,:], float32[:,:], float32[:,:,:], float32[:])')

def add_cuda_func(a,b,c,d):
    i_c,j_c,k_c = cuda.grid(3)
    for k in range(k_c):#Loop in k_c
        num_sum=0
        for j in range(j_c):#Loop in i
            for i in range(i_c):#Loop in j
                a_val=a[i,j]
                b_val=b[i,j]
                c_val=c[i,j,k]
                num_sum=a_val+b_val+c_val
        d[k]=num_sum

import math

device = cuda.get_current_device()

a = np.ones((8,4), dtype=np.float32)

b = np.ones((8,4), dtype=np.float32)

c = np.ones((8,4,10), dtype=np.float32)

d=np.ones(shape=(10,), dtype=np.float32)

# max_tpb = device.WARP_SIZE#Warp size in threads #For 1D

# bpg = int(np.ceil(float(n)/max_tpb))# blockspergrid for 1D

tpb = (10,10,10)#threadsperblock #Rule of thumb sum up to 32

blockspergrid_x = math.ceil(c.shape[0] / tpb[0])

blockspergrid_y = math.ceil(c.shape[1] / tpb[1])

blockspergrid_z = math.ceil(c.shape[2] / tpb[2])

bpg = (blockspergrid_x, blockspergrid_y,blockspergrid_z)# blockspergridx,blockspergridy

print('Blocks per grid:', bpg)

print('Threads per block', tpb)

add_cuda_func[bpg, tpb](a, b, c,d)# Outputs a wrong array([0., 0., 0., 0., 0., 0., 0., 0., 0., 1.], dtype=float32), expected was array([3., 3., 3., 3., 3., 3., 3., 3., 3., 3.], dtype=float32)

Does anyone have a clue of what could I try? Or any material?

Thank you a lot!

Hi,

I think there is just a small problem with the “dimension string”
The following seems to work for me (can’t test CUDA on my laptop)

@guvectorize(['void(float32[:,:], float32[:,:], float32[:,:,:], float32[:])'],
   ...:              '(n, m), (n, m), (n,m,z)->(z)', nopython=True, target='cpu')
1 Like

OMG, @Hannes you are a genius! Thank you so much! It worked like a glove!
Just to make sure I got the logic
The args in (x,y,z) relates to each of the dimensions of each passed arrays. So in this case
(n, m), Relates to float32[:,:](First Array),
(n, m), Relates to float32[:,:](Second Array)
(n,m,z)Relates to float32[:,:,:](Third Array)
->(z) Indicates the dimension (z Fourth Array) of my output(->)

You’re welcome :slight_smile:

Yes, that is correct, I assumed the dimensions based on the assignments in your initial code, hope I got them right :stuck_out_tongue: Think of them like the row/col/layer dminesions of matrices/tensors. Gufunc needs to know how the arrays are going to relate to each other shapewise.

One more tip, because it isn’t always obvious:
The function is still only called with a,b,c and returns d. Even if there is no explicit return statement in your function. That part is organised by numba. (At least when working with the CPU, I would be suprised if it were different on the GPU)

d = update_a_cuda(a, b, c)
1 Like

@Hannes Thaanks! Really you saved my day! Also thanks for the tip, now it makes a lot more sense on how I know the return haha I was almost believing in cuda magic.