Optimizing Code Further, CUDA Jit?

For NumPy operations that are not supported in the CUDA target, they need to be rewritten in terms of loops and operations on individual elements. I haven’t had time to try and translate all of the above, but a couple of ideas for starting points might be (noting that the below code is untested, but illustrates the general idea):

  1. numpy.linalg.inv: It looks like your matrix is 3x3, so you could use a function like this
@njit
def invert_3x3_matrix(m, res):
    a = m[0, 0]
    b = m[0, 1]
    c = m[0, 2]
    d = m[1, 0]
    e = m[1, 1]
    f = m[1, 2]
    g = m[2, 0]
    h = m[2, 1]
    i = m[2, 2]

    D = 1.0 / (a*(e*i - f*h) - b*(d*i - f*g) + c*(d*h - e*g))

    a1 = D * (e*i - f*h)
    b1 = D * (c*h - b*i)
    c1 = D * (b*f - c*e)
    d1 = D * (f*g - d*i)
    e1 = D * (a*i - c*g)
    f1 = D * (c*d - a*f)
    g1 = D * (d*h - e*g)
    h1 = D * (b*g - a*h)
    i1 = D * (a*e - b*d)

    res[0, 0] = a1
    res[0, 1] = b1
    res[0, 2] = c1
    res[1, 0] = d1
    res[1, 1] = e1
    res[1, 2] = f1
    res[2, 0] = g1
    res[2, 1] = h1
    res[2, 2] = i1

Then call it by declaring a local array for the result, e.g.:

invertCsensor = cuda.local.array((3, 3), np.float64)
invert_3x3_matrix(C_sensor_n, invertCsensor)
  1. np.dot: For a dot product, you could rewrite like:
calcVal = 0.0
for i in range(len(leftSide)):
    calcVal += leftSide[i] * rightSide[i]

I hope this helps give the idea - I think you’ll also need to rewrite conj with a loop as well.