Optimizing Code Further, CUDA Jit?

CodingNoob21 · July 21, 2020, 5:53pm

I’m trying to optimize the following piece of code to have it run faster. I’m currently using the @njit decorator and while it provides some speedup, it is still slower than I’d like.

So I wanted to write the following code into something that can be run off a cuda GPU but I’m unsure as to how to accomplish it.

I know I’ll need to use the following decorator @cuda.jit(argtypes=[int64, float64[:,:], float64[:,:], float64[:,:], int64, float64[:,:,:], int64]) but past that I’m unsure as to how to use cuda.gridsize, cuda.grid and other required pieces of code to get this running on a gpu.

import numpy as np
from numba import njit, prange, vectorize

@njit(cache=True,fastmath=True,nogil=True)
def tauLoop(Ptrain, TrainData, C, Mi, numFeats, detection_stream, channels):
   for tau in range(1, 121):
       print(tau)
       for i in range (0, Ptrain):
           for n in range(1, channels+1):
               test_stat = np.empty((1, 8))
               firstPos = numFeats*(n-1)
               secondPos = numFeats*n
               f_sensor_n = TrainData[firstPos:secondPos, i]
               C_sensor_n = (C[firstPos:secondPos, firstPos:secondPos])
               invertCsensor = np.linalg.inv(C_sensor_n)

               for k in range (0, 8):
                   Mi_sensor_n = Mi[firstPos:secondPos, k]
                   subTracted = ((f_sensor_n-Mi_sensor_n))
                   leftSide = (subTracted).conj().T
                   rightSide = invertCsensor.dot(subTracted)
                   calcVal = np.dot(leftSide, rightSide)
                   test_stat[0, k] = (calcVal)

               min_ts = min(test_stat[0])

               if min_ts > tau:
                   detection_stream[tau-1,n-1,i] = 1
               else:
                   detection_stream[tau-1,n-1,i] = 0


   return detection_stream

Ptrain = 3800
channels = 192
TrainData = np.random.rand(576, Ptrain)
C = np.random.rand(576, 576)
Mi = np.random.rand(576, 576)
detection_stream = np.zeros((120, channels,Ptrain), np.float64)
detection_stream_found = tauLoop(Ptrain, TrainData, C, Mi, 3, detection_stream, channels)

Any help would be super appreciated, online tutorials are a little scarce and documentation is a hard to follow to implement in this use case.

EDIT: I’ve made some progress but here is where I’m getting stuck:

@cuda.jit
def tauLoop(Ptrain, TrainData, C, Mi, numFeats, detection_stream, channels):

    nz,nx, ny = detection_stream.shape
    tau, i, n = cuda.grid(3)

    if i < ny and n < nx and tau < nz:
        firstPos = numFeats*(n-1)
        secondPos = numFeats*n
        f_sensor_n = TrainData[firstPos:secondPos, i]
        C_sensor_n = (C[firstPos:secondPos, firstPos:secondPos])
        #invertCsensor = np.linalg.inv(C_sensor_n)

        for k in range (0, 8):
            Mi_sensor_n = Mi[firstPos:secondPos, k]
            """
            #subTracted = f_sensor_n - Mi_sensor_n
            leftSide = (subTracted).conj().T
            rightSide = invertCsensor.dot(subTracted)
            calcVal = np.dot(leftSide, rightSide)
            test_stat.append(calcVal)
            """

        min_ts = 1

        if min_ts > 0:
            detection_stream[tau-1, n-1,i] = 1
        else:
            detection_stream[tau-1, n-1,i] = 0

Ptrain = 3800
channels = 192
TrainData = np.random.rand(576, Ptrain)
test_stat = np.empty((1, 8))
C = np.random.rand(576, 576)
Mi = np.random.rand(576, 576)
an_array = np.zeros((120, channels,Ptrain), np.int32)
threadsperblock = (8,8,8)

blockspergrid_x = math.ceil(an_array.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(an_array.shape[1] / threadsperblock[1])
blockspergrid_z = math.ceil(an_array.shape[2] / threadsperblock[2])
blockspergrid = (blockspergrid_x, blockspergrid_y,blockspergrid_z)

tauLoop[blockspergrid, threadsperblock](Ptrain, TrainData, C, Mi, 3, an_array, channels)

So this little snippet works but numpy’s linalg inverse is not supported and the operations taking place within the k loop doesn’t work either. Any ideas on how to overcome this?

gmarkall · July 23, 2020, 9:29am

For NumPy operations that are not supported in the CUDA target, they need to be rewritten in terms of loops and operations on individual elements. I haven’t had time to try and translate all of the above, but a couple of ideas for starting points might be (noting that the below code is untested, but illustrates the general idea):

numpy.linalg.inv: It looks like your matrix is 3x3, so you could use a function like this

@njit
def invert_3x3_matrix(m, res):
    a = m[0, 0]
    b = m[0, 1]
    c = m[0, 2]
    d = m[1, 0]
    e = m[1, 1]
    f = m[1, 2]
    g = m[2, 0]
    h = m[2, 1]
    i = m[2, 2]

    D = 1.0 / (a*(e*i - f*h) - b*(d*i - f*g) + c*(d*h - e*g))

    a1 = D * (e*i - f*h)
    b1 = D * (c*h - b*i)
    c1 = D * (b*f - c*e)
    d1 = D * (f*g - d*i)
    e1 = D * (a*i - c*g)
    f1 = D * (c*d - a*f)
    g1 = D * (d*h - e*g)
    h1 = D * (b*g - a*h)
    i1 = D * (a*e - b*d)

    res[0, 0] = a1
    res[0, 1] = b1
    res[0, 2] = c1
    res[1, 0] = d1
    res[1, 1] = e1
    res[1, 2] = f1
    res[2, 0] = g1
    res[2, 1] = h1
    res[2, 2] = i1

Then call it by declaring a local array for the result, e.g.:

invertCsensor = cuda.local.array((3, 3), np.float64)
invert_3x3_matrix(C_sensor_n, invertCsensor)

np.dot: For a dot product, you could rewrite like:

calcVal = 0.0
for i in range(len(leftSide)):
    calcVal += leftSide[i] * rightSide[i]

I hope this helps give the idea - I think you’ll also need to rewrite conj with a loop as well.

CodingNoob21 · July 23, 2020, 4:48pm

Thank you for all the guidance! It’s given me a lot of pointers to work towards a cuda implementation of this code.

This is what I have so far and I believe it’s working but the calcVal values don’t match up with the regular version of this code and I’m not quite too sure why.

@cuda.jit
def tauLoop(Ptrain, TrainData, C, Mi, numFeats, detection_stream, channels):

    nz,nx, ny = detection_stream.shape
    tau, i, n = cuda.grid(3)

    if i < ny and n < nx and tau < nz:
        firstPos = numFeats*(n)
        secondPos = numFeats*(n+1)
        f_sensor_n = TrainData[firstPos:secondPos, i]
        C_sensor_n = (C[firstPos:secondPos, firstPos:secondPos])
        invertCsensor = cuda.local.array((3, 3),dtype=float64)
        invert_3x3_matrix(C_sensor_n, invertCsensor)

        test_stat = cuda.local.array((1, 8),dtype=float64)
        for k in range (0, 8):

            Mi_sensor_n = Mi[firstPos:secondPos, k]
            
            subTracted = cuda.local.array((1, 3),dtype=float64)

            calcVal = 0.0
            for q in range(len(f_sensor_n)):
                subTracted[0,q] = f_sensor_n[q] - Mi_sensor_n[q]

                tempVal = 0.0
                for b in range(len(f_sensor_n)):
                    tempVal += (invertCsensor[b][q] * subTracted[0,q]) * subTracted[0,q]

                calcVal += tempVal
            print(calcVal)
            test_stat[0,k] = calcVal

        
        min_ts = 1 #min function test_stat

        if min_ts > tau:
            detection_stream[tau, n,i] = 1
        else:
            detection_stream[tau, n,i] = 0

Topic		Replies	Views
CUDA.jit - Higher Order Convolution Optimizations (Volterra Operator) Support: How do I do ...?	1	369	October 5, 2020
Cuda.jit vs guvectorize Support: How do I do ...?	0	654	June 17, 2021
How to call a cuda.jit function in a njit function? Community Support	3	460	June 16, 2021
Hi...I am very much eager to convert the following code into cuda.jit, what is the correct way to do that? Support: How do I do ...?	2	373	November 15, 2021
@Cuda.jit is there cache=True? Support: How do I do ...?	3	1126	August 7, 2020

Optimizing Code Further, CUDA Jit?

Related Topics