CUDA.jit - Higher Order Convolution Optimizations (Volterra Operator)

I’ve been trying to optimize the result of 3 orders of convolutions. First I tried applying @njit to everything, but it was a bit slow on sufficiently sized 3rd order stuff. So, it made sense to try @cuda.jit. My understanding is that I am supposed to instead apply @cuda.jit to each function up the tree so that it doesn’t have to cross the CPU-GPU-CPU boundary unnecessarily. I don’t understand how to apply @cuda.jit to the function get_y. I’ve got other pieces of more complicated code elsewhere that I’m trying to all shift over to cuda.jit, but this convolution is right now the bottleneck so I’m trying to focus here. The array x could always stay on the GPU and doesn’t need to be repeatedly, but apparently that’s not much of an issue (as you can see a commented for loop that’d compare this). I saw a post about awkward arrays, that might be neat to use for K to help generalize this further. Running get_y takes about 200ms, but I want it around 2ms, so this implementation just isn’t working for me and I’m having trouble identifying what’s slowing it down. How do I speed this up further? Is there a way to do that while consolidating the n order functions?

import time
from numba import cuda, njit, jit
import numpy as np

@cuda.jit
def nbconv_1d1o_cuda(x,k,y,s): #numba_conv 1 dimension 1st order
    n = cuda.grid(1)
    if (0 <= n) and (n < y.size):
        d = n+k.shape[0]-1
        for i in range(k.shape[0]):
            y[n] += x[d-i] * k[i]

@cuda.jit
def nbconv_1d2o_cuda(x,k,y,s): #numba_conv 1 dimension 2nd order
    n = cuda.grid(1)
    b_offset = s*3 - 1
    if (0 <= n) and (n < y.size):
        d = n+k.shape[0]-1
        for i in range(k.shape[0]):
            for j in range(min(i+b_offset, k.shape[1])):
                y[n] += x[d-i] * x[d-j] * k[i,j]

@cuda.jit
def nbconv_1d3o_cuda(x,k,y,s): #numba_conv 1 dimension 3rd order
    b_offset = s*3 - 1
    n = cuda.grid(1)
    if (0 <= n) and (n < y.size):
        d = n+k.shape[0]-1
        for i in range(k.shape[0]):
            for j in range(min(i+b_offset, k.shape[1])):
                for l in range(min(j+b_offset, k.shape[2])):
                    y[n] += x[d-i] * x[d-j] * x[d-l] * k[i,j,l]

# @cuda.jit
# @njit
def get_y(x,K1,K2,K3,y,b,th,s):
    if 0 < K1.shape[0]:
        nbconv_1d1o_cuda[b,th](x, K1, y[K1.shape[0]-1:], s)
    if 0 < K2.shape[0]:
        nbconv_1d2o_cuda[b,th](x, K2, y[K2.shape[0]-1:], s)
    if 0 < K3.shape[0]:
        nbconv_1d3o_cuda[b,th](x, K3, y[K3.shape[0]-1:], s)

# @njit
def get_y_helper(x, K, s):
    xc = cuda.to_device(x)
    K1 = cuda.to_device(K[1])
    K2 = cuda.to_device(K[2])
    K3 = cuda.to_device(K[3])
    th = 128
    b = x.size//th+1
    y = cuda.device_array_like(x)

    for i in range(10):
        get_y(xc,K1,K2,K3,y,b,th,s)
    return y.copy_to_host()

def test():
    np.random.seed(0)
    x = np.random.uniform(-1,1,10000)
    d = 100
    K = [ 1, np.random.uniform(-1,1,(d))
        ,np.random.uniform(-1,1,(d,d))
        ,np.random.uniform(-1,1,(d,d,d))
        ]
    smoothness = 4
    # for i in range(10):
    y = get_y_helper(x, K, smoothness)
    print(y[-5:])

if __name__ == '__main__':
    test()
    start = time.time()
    test()
    print("time elapsed = ", time.time() - start)

Since numba directly exposes the CUDA execution model, one usually cannot replace @njit with @cuda.jit. In the CUDA execution model, the function is being executed by every thread. The code needs to be adapted to distribute the work across the threads.

With that said, if you are new to CUDA programming, it might be more productive to use CuPy, which gives you a NumPy-like API. Here’s CuPy documentation on convolution that might have what you need: https://docs.cupy.dev/en/stable/reference/signal.html?highlight=convolution#convolution.

1 Like