Optimizing Code Further, CUDA Jit?

I’m trying to optimize the following piece of code to have it run faster. I’m currently using the @njit decorator and while it provides some speedup, it is still slower than I’d like.

So I wanted to write the following code into something that can be run off a cuda GPU but I’m unsure as to how to accomplish it.

I know I’ll need to use the following decorator @cuda.jit(argtypes=[int64, float64[:,:], float64[:,:], float64[:,:], int64, float64[:,:,:], int64]) but past that I’m unsure as to how to use cuda.gridsize, cuda.grid and other required pieces of code to get this running on a gpu.

import numpy as np
from numba import njit, prange, vectorize

@njit(cache=True,fastmath=True,nogil=True)
def tauLoop(Ptrain, TrainData, C, Mi, numFeats, detection_stream, channels):
   for tau in range(1, 121):
       print(tau)
       for i in range (0, Ptrain):
           for n in range(1, channels+1):
               test_stat = np.empty((1, 8))
               firstPos = numFeats*(n-1)
               secondPos = numFeats*n
               f_sensor_n = TrainData[firstPos:secondPos, i]
               C_sensor_n = (C[firstPos:secondPos, firstPos:secondPos])
               invertCsensor = np.linalg.inv(C_sensor_n)

               for k in range (0, 8):
                   Mi_sensor_n = Mi[firstPos:secondPos, k]
                   subTracted = ((f_sensor_n-Mi_sensor_n))
                   leftSide = (subTracted).conj().T
                   rightSide = invertCsensor.dot(subTracted)
                   calcVal = np.dot(leftSide, rightSide)
                   test_stat[0, k] = (calcVal)

               min_ts = min(test_stat[0])

               if min_ts > tau:
                   detection_stream[tau-1,n-1,i] = 1
               else:
                   detection_stream[tau-1,n-1,i] = 0


   return detection_stream

Ptrain = 3800
channels = 192
TrainData = np.random.rand(576, Ptrain)
C = np.random.rand(576, 576)
Mi = np.random.rand(576, 576)
detection_stream = np.zeros((120, channels,Ptrain), np.float64)
detection_stream_found = tauLoop(Ptrain, TrainData, C, Mi, 3, detection_stream, channels)

Any help would be super appreciated, online tutorials are a little scarce and documentation is a hard to follow to implement in this use case.

EDIT: I’ve made some progress but here is where I’m getting stuck:

@cuda.jit
def tauLoop(Ptrain, TrainData, C, Mi, numFeats, detection_stream, channels):

    nz,nx, ny = detection_stream.shape
    tau, i, n = cuda.grid(3)

    if i < ny and n < nx and tau < nz:
        firstPos = numFeats*(n-1)
        secondPos = numFeats*n
        f_sensor_n = TrainData[firstPos:secondPos, i]
        C_sensor_n = (C[firstPos:secondPos, firstPos:secondPos])
        #invertCsensor = np.linalg.inv(C_sensor_n)

        for k in range (0, 8):
            Mi_sensor_n = Mi[firstPos:secondPos, k]
            """
            #subTracted = f_sensor_n - Mi_sensor_n
            leftSide = (subTracted).conj().T
            rightSide = invertCsensor.dot(subTracted)
            calcVal = np.dot(leftSide, rightSide)
            test_stat.append(calcVal)
            """

        min_ts = 1

        if min_ts > 0:
            detection_stream[tau-1, n-1,i] = 1
        else:
            detection_stream[tau-1, n-1,i] = 0

Ptrain = 3800
channels = 192
TrainData = np.random.rand(576, Ptrain)
test_stat = np.empty((1, 8))
C = np.random.rand(576, 576)
Mi = np.random.rand(576, 576)
an_array = np.zeros((120, channels,Ptrain), np.int32)
threadsperblock = (8,8,8)

blockspergrid_x = math.ceil(an_array.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(an_array.shape[1] / threadsperblock[1])
blockspergrid_z = math.ceil(an_array.shape[2] / threadsperblock[2])
blockspergrid = (blockspergrid_x, blockspergrid_y,blockspergrid_z)

tauLoop[blockspergrid, threadsperblock](Ptrain, TrainData, C, Mi, 3, an_array, channels)

So this little snippet works but numpy’s linalg inverse is not supported and the operations taking place within the k loop doesn’t work either. Any ideas on how to overcome this?

For NumPy operations that are not supported in the CUDA target, they need to be rewritten in terms of loops and operations on individual elements. I haven’t had time to try and translate all of the above, but a couple of ideas for starting points might be (noting that the below code is untested, but illustrates the general idea):

  1. numpy.linalg.inv: It looks like your matrix is 3x3, so you could use a function like this
@njit
def invert_3x3_matrix(m, res):
    a = m[0, 0]
    b = m[0, 1]
    c = m[0, 2]
    d = m[1, 0]
    e = m[1, 1]
    f = m[1, 2]
    g = m[2, 0]
    h = m[2, 1]
    i = m[2, 2]

    D = 1.0 / (a*(e*i - f*h) - b*(d*i - f*g) + c*(d*h - e*g))

    a1 = D * (e*i - f*h)
    b1 = D * (c*h - b*i)
    c1 = D * (b*f - c*e)
    d1 = D * (f*g - d*i)
    e1 = D * (a*i - c*g)
    f1 = D * (c*d - a*f)
    g1 = D * (d*h - e*g)
    h1 = D * (b*g - a*h)
    i1 = D * (a*e - b*d)

    res[0, 0] = a1
    res[0, 1] = b1
    res[0, 2] = c1
    res[1, 0] = d1
    res[1, 1] = e1
    res[1, 2] = f1
    res[2, 0] = g1
    res[2, 1] = h1
    res[2, 2] = i1

Then call it by declaring a local array for the result, e.g.:

invertCsensor = cuda.local.array((3, 3), np.float64)
invert_3x3_matrix(C_sensor_n, invertCsensor)
  1. np.dot: For a dot product, you could rewrite like:
calcVal = 0.0
for i in range(len(leftSide)):
    calcVal += leftSide[i] * rightSide[i]

I hope this helps give the idea - I think you’ll also need to rewrite conj with a loop as well.

Thank you for all the guidance! It’s given me a lot of pointers to work towards a cuda implementation of this code.

This is what I have so far and I believe it’s working but the calcVal values don’t match up with the regular version of this code and I’m not quite too sure why.

@cuda.jit
def tauLoop(Ptrain, TrainData, C, Mi, numFeats, detection_stream, channels):

    nz,nx, ny = detection_stream.shape
    tau, i, n = cuda.grid(3)

    if i < ny and n < nx and tau < nz:
        firstPos = numFeats*(n)
        secondPos = numFeats*(n+1)
        f_sensor_n = TrainData[firstPos:secondPos, i]
        C_sensor_n = (C[firstPos:secondPos, firstPos:secondPos])
        invertCsensor = cuda.local.array((3, 3),dtype=float64)
        invert_3x3_matrix(C_sensor_n, invertCsensor)

        test_stat = cuda.local.array((1, 8),dtype=float64)
        for k in range (0, 8):

            Mi_sensor_n = Mi[firstPos:secondPos, k]
            
            subTracted = cuda.local.array((1, 3),dtype=float64)

            calcVal = 0.0
            for q in range(len(f_sensor_n)):
                subTracted[0,q] = f_sensor_n[q] - Mi_sensor_n[q]

                tempVal = 0.0
                for b in range(len(f_sensor_n)):
                    tempVal += (invertCsensor[b][q] * subTracted[0,q]) * subTracted[0,q]

                calcVal += tempVal
            print(calcVal)
            test_stat[0,k] = calcVal

        
        min_ts = 1 #min function test_stat

        if min_ts > tau:
            detection_stream[tau, n,i] = 1
        else:
            detection_stream[tau, n,i] = 0