I’ve been trying to optimize the result of 3 orders of convolutions. First I tried applying @njit to everything, but it was a bit slow on sufficiently sized 3rd order stuff. So, it made sense to try @cuda.jit. My understanding is that I am supposed to instead apply @cuda.jit to each function up the tree so that it doesn’t have to cross the CPU-GPU-CPU boundary unnecessarily. I don’t understand how to apply @cuda.jit to the function get_y. I’ve got other pieces of more complicated code elsewhere that I’m trying to all shift over to cuda.jit, but this convolution is right now the bottleneck so I’m trying to focus here. The array x could always stay on the GPU and doesn’t need to be repeatedly, but apparently that’s not much of an issue (as you can see a commented for loop that’d compare this). I saw a post about awkward arrays, that might be neat to use for K to help generalize this further. Running get_y takes about 200ms, but I want it around 2ms, so this implementation just isn’t working for me and I’m having trouble identifying what’s slowing it down. How do I speed this up further? Is there a way to do that while consolidating the n order functions?

```
import time
from numba import cuda, njit, jit
import numpy as np
@cuda.jit
def nbconv_1d1o_cuda(x,k,y,s): #numba_conv 1 dimension 1st order
n = cuda.grid(1)
if (0 <= n) and (n < y.size):
d = n+k.shape[0]-1
for i in range(k.shape[0]):
y[n] += x[d-i] * k[i]
@cuda.jit
def nbconv_1d2o_cuda(x,k,y,s): #numba_conv 1 dimension 2nd order
n = cuda.grid(1)
b_offset = s*3 - 1
if (0 <= n) and (n < y.size):
d = n+k.shape[0]-1
for i in range(k.shape[0]):
for j in range(min(i+b_offset, k.shape[1])):
y[n] += x[d-i] * x[d-j] * k[i,j]
@cuda.jit
def nbconv_1d3o_cuda(x,k,y,s): #numba_conv 1 dimension 3rd order
b_offset = s*3 - 1
n = cuda.grid(1)
if (0 <= n) and (n < y.size):
d = n+k.shape[0]-1
for i in range(k.shape[0]):
for j in range(min(i+b_offset, k.shape[1])):
for l in range(min(j+b_offset, k.shape[2])):
y[n] += x[d-i] * x[d-j] * x[d-l] * k[i,j,l]
# @cuda.jit
# @njit
def get_y(x,K1,K2,K3,y,b,th,s):
if 0 < K1.shape[0]:
nbconv_1d1o_cuda[b,th](x, K1, y[K1.shape[0]-1:], s)
if 0 < K2.shape[0]:
nbconv_1d2o_cuda[b,th](x, K2, y[K2.shape[0]-1:], s)
if 0 < K3.shape[0]:
nbconv_1d3o_cuda[b,th](x, K3, y[K3.shape[0]-1:], s)
# @njit
def get_y_helper(x, K, s):
xc = cuda.to_device(x)
K1 = cuda.to_device(K[1])
K2 = cuda.to_device(K[2])
K3 = cuda.to_device(K[3])
th = 128
b = x.size//th+1
y = cuda.device_array_like(x)
for i in range(10):
get_y(xc,K1,K2,K3,y,b,th,s)
return y.copy_to_host()
def test():
np.random.seed(0)
x = np.random.uniform(-1,1,10000)
d = 100
K = [ 1, np.random.uniform(-1,1,(d))
,np.random.uniform(-1,1,(d,d))
,np.random.uniform(-1,1,(d,d,d))
]
smoothness = 4
# for i in range(10):
y = get_y_helper(x, K, smoothness)
print(y[-5:])
if __name__ == '__main__':
test()
start = time.time()
test()
print("time elapsed = ", time.time() - start)
```