Dear all,
I am trying to learn how to properly use signatures for guvectorize to leverage from cuda, however I am having some issues and I was hoping someone could have a better clue than I have.
From here Types and signatures — Numba 0.52.0.dev0+274.g626b40e-py3.7-linux-x86_64.egg documentation I understand that:
-
"float64(int32, int32)"
) which specifies a function taking two 32-bit integers and returning a double-precision float. - From this example I understand that in the case of guvectorize I must always return void python - Numba.vectorize for CUDA: What is the correct signature to return arrays? - Stack Overflow
So I created this example:
a = np.ones((8,4), dtype=np.float32)
b = np.ones((8,4), dtype=np.float32)
c = np.ones((8,4,10), dtype=np.float32)
d=np.ones(shape=(10,), dtype=np.float32)
import numpy as np
def update_a(a,b,c,d):#Works properly
i_c,j_c,k_c = c.shape
for k in range(k_c):#Loop in k_c
num_sum=0
for j in range(j_c):#Loop in i
for i in range(i_c):#Loop in j
a_val=a[i,j]
b_val=b[i,j]
c_val=c[i,j,k]
# num_sum=a[i_a,j_a]+b[i_b,j_b]+c[i_c,j_c,k_c_l]
num_sum=a_val+b_val+c_val
d[k]=num_sum
return d
Then I tried to first check how it works targetting the cpu, that already provided a TypeError:
**TypeError** : type and shape signature mismatch for arg #1
@guvectorize(['void(float32[:,:], float32[:,:], float32[:,:,:], float32[:])'],
'(n,m,z)->(z)', nopython=True, target='cpu')
def update_a_cuda(a,b,c,d):
i_c,j_c,k_c = c.shape
for k in range(k_c):#Loop in k_c
num_sum=0
for j in range(j_c):#Loop in i
for i in range(i_c):#Loop in j
a_val=a[i,j]
b_val=b[i,j]
c_val=c[i,j,k]
num_sum=a_val+b_val+c_val
d[k]=num_sum
Of course, then when I target my cuda I have an error.
**TypingError** : Failed in nopython mode pipeline (step: nopython frontend) ******No implementation of function Function(<DeviceFunction py_func=<function update_a_cuda at 0x0000024A28F31B80> signature=(array(float32, 2d, A), array(float32, 2d, A), array(float32, 3d, A), array(float32, 1d, A)) -> none>) found for signature: >>> <unknown function>(array(float32, 3d, A), array(float32, 1d, A)) There are 2 candidate implementations: **- Of which 2 did not match due to: Type Restricted Function in function 'unknown': File: unknown: Line unknown. With argument(s): '(array(float32, 3d, A), array(float32, 1d, A))':******** **No match for registered cases: * (array(float32, 2d, A), array(float32, 2d, A), array(float32, 3d, A), array(float32, 1d, A)) -> none** **During: resolving callee type: Function(<DeviceFunction py_func=<function update_a_cuda at 0x0000024A28F31B80> signature=(array(float32, 2d, A), array(float32, 2d, A), array(float32, 3d, A), array(float32, 1d, A)) -> none>)** **During: typing of call at <string> (5)**
from numba import guvectorize
@guvectorize(['void(float32[:,:], float32[:,:], float32[:,:,:], float32[:])'],
'(n,m,z)->(z)', nopython=True, target='cuda')
def update_a_cuda(a,b,c,d):
i_c,j_c,k_c = c.shape
for k in range(k_c):#Loop in k_c
num_sum=0
for j in range(j_c):#Loop in i
for i in range(i_c):#Loop in j
a_val=a[i,j]
b_val=b[i,j]
c_val=c[i,j,k]
num_sum=a_val+b_val+c_val
d[k]=num_sum
As an alterantive I have also tried to use the cuda.jit, however I am missing something, because the result is different than the expected.
from numba import cuda
@cuda.jit('void(float32[:,:], float32[:,:], float32[:,:,:], float32[:])')
def add_cuda_func(a,b,c,d):
i_c,j_c,k_c = cuda.grid(3)
for k in range(k_c):#Loop in k_c
num_sum=0
for j in range(j_c):#Loop in i
for i in range(i_c):#Loop in j
a_val=a[i,j]
b_val=b[i,j]
c_val=c[i,j,k]
num_sum=a_val+b_val+c_val
d[k]=num_sum
import math
device = cuda.get_current_device()
a = np.ones((8,4), dtype=np.float32)
b = np.ones((8,4), dtype=np.float32)
c = np.ones((8,4,10), dtype=np.float32)
d=np.ones(shape=(10,), dtype=np.float32)
# max_tpb = device.WARP_SIZE#Warp size in threads #For 1D
# bpg = int(np.ceil(float(n)/max_tpb))# blockspergrid for 1D
tpb = (10,10,10)#threadsperblock #Rule of thumb sum up to 32
blockspergrid_x = math.ceil(c.shape[0] / tpb[0])
blockspergrid_y = math.ceil(c.shape[1] / tpb[1])
blockspergrid_z = math.ceil(c.shape[2] / tpb[2])
bpg = (blockspergrid_x, blockspergrid_y,blockspergrid_z)# blockspergridx,blockspergridy
print('Blocks per grid:', bpg)
print('Threads per block', tpb)
add_cuda_func[bpg, tpb](a, b, c,d)# Outputs a wrong array([0., 0., 0., 0., 0., 0., 0., 0., 0., 1.], dtype=float32), expected was array([3., 3., 3., 3., 3., 3., 3., 3., 3., 3.], dtype=float32)
Does anyone have a clue of what could I try? Or any material?
Thank you a lot!