Numba signatures @guvectorize( signatures target='cuda)

DTKx · February 5, 2021, 2:58pm

Dear all,
I am trying to learn how to properly use signatures for guvectorize to leverage from cuda, however I am having some issues and I was hoping someone could have a better clue than I have.

From here Types and signatures — Numba 0.52.0.dev0+274.g626b40e-py3.7-linux-x86_64.egg documentation I understand that:

"float64(int32, int32)" ) which specifies a function taking two 32-bit integers and returning a double-precision float.
From this example I understand that in the case of guvectorize I must always return void python - Numba.vectorize for CUDA: What is the correct signature to return arrays? - Stack Overflow

So I created this example:

a = np.ones((8,4), dtype=np.float32)
b = np.ones((8,4), dtype=np.float32)
c = np.ones((8,4,10), dtype=np.float32)
d=np.ones(shape=(10,), dtype=np.float32)

import numpy as np
def update_a(a,b,c,d):#Works properly
    i_c,j_c,k_c = c.shape
    for k in range(k_c):#Loop in k_c
        num_sum=0
        for j in range(j_c):#Loop in i
            for i in range(i_c):#Loop in j
                a_val=a[i,j]
                b_val=b[i,j]
                c_val=c[i,j,k]
                # num_sum=a[i_a,j_a]+b[i_b,j_b]+c[i_c,j_c,k_c_l]
                num_sum=a_val+b_val+c_val
        d[k]=num_sum
    return d

Then I tried to first check how it works targetting the cpu, that already provided a TypeError:

**TypeError** : type and shape signature mismatch for arg #1

@guvectorize(['void(float32[:,:], float32[:,:], float32[:,:,:], float32[:])'],
             '(n,m,z)->(z)', nopython=True, target='cpu')
def update_a_cuda(a,b,c,d):
    i_c,j_c,k_c = c.shape
    for k in range(k_c):#Loop in k_c
        num_sum=0
        for j in range(j_c):#Loop in i
            for i in range(i_c):#Loop in j
                a_val=a[i,j]
                b_val=b[i,j]
                c_val=c[i,j,k]
                num_sum=a_val+b_val+c_val
        d[k]=num_sum

Of course, then when I target my cuda I have an error.

**TypingError** : Failed in nopython mode pipeline (step: nopython frontend) ******No implementation of function Function(<DeviceFunction py_func=<function update_a_cuda at 0x0000024A28F31B80> signature=(array(float32, 2d, A), array(float32, 2d, A), array(float32, 3d, A), array(float32, 1d, A)) -> none>) found for signature: >>> <unknown function>(array(float32, 3d, A), array(float32, 1d, A)) There are 2 candidate implementations: **- Of which 2 did not match due to: Type Restricted Function in function 'unknown': File: unknown: Line unknown. With argument(s): '(array(float32, 3d, A), array(float32, 1d, A))':******** **No match for registered cases: * (array(float32, 2d, A), array(float32, 2d, A), array(float32, 3d, A), array(float32, 1d, A)) -> none** **During: resolving callee type: Function(<DeviceFunction py_func=<function update_a_cuda at 0x0000024A28F31B80> signature=(array(float32, 2d, A), array(float32, 2d, A), array(float32, 3d, A), array(float32, 1d, A)) -> none>)** **During: typing of call at <string> (5)**

from numba import guvectorize
@guvectorize(['void(float32[:,:], float32[:,:], float32[:,:,:], float32[:])'],
             '(n,m,z)->(z)', nopython=True, target='cuda')
def update_a_cuda(a,b,c,d):
    i_c,j_c,k_c = c.shape
    for k in range(k_c):#Loop in k_c
        num_sum=0
        for j in range(j_c):#Loop in i
            for i in range(i_c):#Loop in j
                a_val=a[i,j]
                b_val=b[i,j]
                c_val=c[i,j,k]
                num_sum=a_val+b_val+c_val
        d[k]=num_sum

As an alterantive I have also tried to use the cuda.jit, however I am missing something, because the result is different than the expected.

from numba import cuda
@cuda.jit('void(float32[:,:], float32[:,:], float32[:,:,:], float32[:])')

def add_cuda_func(a,b,c,d):
    i_c,j_c,k_c = cuda.grid(3)
    for k in range(k_c):#Loop in k_c
        num_sum=0
        for j in range(j_c):#Loop in i
            for i in range(i_c):#Loop in j
                a_val=a[i,j]
                b_val=b[i,j]
                c_val=c[i,j,k]
                num_sum=a_val+b_val+c_val
        d[k]=num_sum

import math

device = cuda.get_current_device()

a = np.ones((8,4), dtype=np.float32)

b = np.ones((8,4), dtype=np.float32)

c = np.ones((8,4,10), dtype=np.float32)

d=np.ones(shape=(10,), dtype=np.float32)

# max_tpb = device.WARP_SIZE#Warp size in threads #For 1D

# bpg = int(np.ceil(float(n)/max_tpb))# blockspergrid for 1D

tpb = (10,10,10)#threadsperblock #Rule of thumb sum up to 32

blockspergrid_x = math.ceil(c.shape[0] / tpb[0])

blockspergrid_y = math.ceil(c.shape[1] / tpb[1])

blockspergrid_z = math.ceil(c.shape[2] / tpb[2])

bpg = (blockspergrid_x, blockspergrid_y,blockspergrid_z)# blockspergridx,blockspergridy

print('Blocks per grid:', bpg)

print('Threads per block', tpb)

add_cuda_func[bpg, tpb](a, b, c,d)# Outputs a wrong array([0., 0., 0., 0., 0., 0., 0., 0., 0., 1.], dtype=float32), expected was array([3., 3., 3., 3., 3., 3., 3., 3., 3., 3.], dtype=float32)

Does anyone have a clue of what could I try? Or any material?

Thank you a lot!

Hannes · February 5, 2021, 3:50pm

Hi,

I think there is just a small problem with the “dimension string”
The following seems to work for me (can’t test CUDA on my laptop)

@guvectorize(['void(float32[:,:], float32[:,:], float32[:,:,:], float32[:])'],
   ...:              '(n, m), (n, m), (n,m,z)->(z)', nopython=True, target='cpu')

DTKx · February 5, 2021, 4:04pm

OMG, @Hannes you are a genius! Thank you so much! It worked like a glove!
Just to make sure I got the logic
The args in (x,y,z) relates to each of the dimensions of each passed arrays. So in this case
(n, m), Relates to float32[:,:](First Array),
(n, m), Relates to float32[:,:](Second Array)
(n,m,z)Relates to float32[:,:,:](Third Array)
->(z) Indicates the dimension (z Fourth Array) of my output(->)

Hannes · February 5, 2021, 4:08pm

You’re welcome

Yes, that is correct, I assumed the dimensions based on the assignments in your initial code, hope I got them right Think of them like the row/col/layer dminesions of matrices/tensors. Gufunc needs to know how the arrays are going to relate to each other shapewise.

One more tip, because it isn’t always obvious:
The function is still only called with a,b,c and returns d. Even if there is no explicit return statement in your function. That part is organised by numba. (At least when working with the CPU, I would be suprised if it were different on the GPU)

d = update_a_cuda(a, b, c)

DTKx · February 5, 2021, 4:13pm

@Hannes Thaanks! Really you saved my day! Also thanks for the tip, now it makes a lot more sense on how I know the return haha I was almost believing in cuda magic.

Topic		Replies	Views
Numba signatures cuda Support: How do I do ...?	2	337	February 5, 2021
Numba.vectorize with target='cuda' return type is not respected Community Support	4	365	February 9, 2023
Returning an array of 3D or 6D vectors from `guvectorize` on CUDA (cross-compiling for CPU & CUDA) Support: How do I do ...?	2	592	August 2, 2022
Cuda compiled argument signature Numba	1	302	February 28, 2023
Errors when with using Numba with vectorize with cuda flag and normal cpu for Denoising Point Clouds Support: What is this error message?	1	471	September 28, 2020

Numba signatures @guvectorize( signatures target='cuda)

Related Topics