I’m trying to optimize the following piece of code to have it run faster. I’m currently using the @njit decorator and while it provides some speedup, it is still slower than I’d like.
So I wanted to write the following code into something that can be run off a cuda GPU but I’m unsure as to how to accomplish it.
I know I’ll need to use the following decorator @cuda.jit(argtypes=[int64, float64[:,:], float64[:,:], float64[:,:], int64, float64[:,:,:], int64]) but past that I’m unsure as to how to use cuda.gridsize, cuda.grid and other required pieces of code to get this running on a gpu.
import numpy as np
from numba import njit, prange, vectorize
@njit(cache=True,fastmath=True,nogil=True)
def tauLoop(Ptrain, TrainData, C, Mi, numFeats, detection_stream, channels):
for tau in range(1, 121):
print(tau)
for i in range (0, Ptrain):
for n in range(1, channels+1):
test_stat = np.empty((1, 8))
firstPos = numFeats*(n-1)
secondPos = numFeats*n
f_sensor_n = TrainData[firstPos:secondPos, i]
C_sensor_n = (C[firstPos:secondPos, firstPos:secondPos])
invertCsensor = np.linalg.inv(C_sensor_n)
for k in range (0, 8):
Mi_sensor_n = Mi[firstPos:secondPos, k]
subTracted = ((f_sensor_n-Mi_sensor_n))
leftSide = (subTracted).conj().T
rightSide = invertCsensor.dot(subTracted)
calcVal = np.dot(leftSide, rightSide)
test_stat[0, k] = (calcVal)
min_ts = min(test_stat[0])
if min_ts > tau:
detection_stream[tau-1,n-1,i] = 1
else:
detection_stream[tau-1,n-1,i] = 0
return detection_stream
Ptrain = 3800
channels = 192
TrainData = np.random.rand(576, Ptrain)
C = np.random.rand(576, 576)
Mi = np.random.rand(576, 576)
detection_stream = np.zeros((120, channels,Ptrain), np.float64)
detection_stream_found = tauLoop(Ptrain, TrainData, C, Mi, 3, detection_stream, channels)
Any help would be super appreciated, online tutorials are a little scarce and documentation is a hard to follow to implement in this use case.
EDIT: I’ve made some progress but here is where I’m getting stuck:
@cuda.jit
def tauLoop(Ptrain, TrainData, C, Mi, numFeats, detection_stream, channels):
nz,nx, ny = detection_stream.shape
tau, i, n = cuda.grid(3)
if i < ny and n < nx and tau < nz:
firstPos = numFeats*(n-1)
secondPos = numFeats*n
f_sensor_n = TrainData[firstPos:secondPos, i]
C_sensor_n = (C[firstPos:secondPos, firstPos:secondPos])
#invertCsensor = np.linalg.inv(C_sensor_n)
for k in range (0, 8):
Mi_sensor_n = Mi[firstPos:secondPos, k]
"""
#subTracted = f_sensor_n - Mi_sensor_n
leftSide = (subTracted).conj().T
rightSide = invertCsensor.dot(subTracted)
calcVal = np.dot(leftSide, rightSide)
test_stat.append(calcVal)
"""
min_ts = 1
if min_ts > 0:
detection_stream[tau-1, n-1,i] = 1
else:
detection_stream[tau-1, n-1,i] = 0
Ptrain = 3800
channels = 192
TrainData = np.random.rand(576, Ptrain)
test_stat = np.empty((1, 8))
C = np.random.rand(576, 576)
Mi = np.random.rand(576, 576)
an_array = np.zeros((120, channels,Ptrain), np.int32)
threadsperblock = (8,8,8)
blockspergrid_x = math.ceil(an_array.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(an_array.shape[1] / threadsperblock[1])
blockspergrid_z = math.ceil(an_array.shape[2] / threadsperblock[2])
blockspergrid = (blockspergrid_x, blockspergrid_y,blockspergrid_z)
tauLoop[blockspergrid, threadsperblock](Ptrain, TrainData, C, Mi, 3, an_array, channels)
So this little snippet works but numpy’s linalg inverse is not supported and the operations taking place within the k loop doesn’t work either. Any ideas on how to overcome this?