CUDA.jit - Higher Order Convolution Optimizations (Volterra Operator)

randompast · October 5, 2020, 5:57pm

I’ve been trying to optimize the result of 3 orders of convolutions. First I tried applying @njit to everything, but it was a bit slow on sufficiently sized 3rd order stuff. So, it made sense to try @cuda.jit. My understanding is that I am supposed to instead apply @cuda.jit to each function up the tree so that it doesn’t have to cross the CPU-GPU-CPU boundary unnecessarily. I don’t understand how to apply @cuda.jit to the function get_y. I’ve got other pieces of more complicated code elsewhere that I’m trying to all shift over to cuda.jit, but this convolution is right now the bottleneck so I’m trying to focus here. The array x could always stay on the GPU and doesn’t need to be repeatedly, but apparently that’s not much of an issue (as you can see a commented for loop that’d compare this). I saw a post about awkward arrays, that might be neat to use for K to help generalize this further. Running get_y takes about 200ms, but I want it around 2ms, so this implementation just isn’t working for me and I’m having trouble identifying what’s slowing it down. How do I speed this up further? Is there a way to do that while consolidating the n order functions?

import time
from numba import cuda, njit, jit
import numpy as np

@cuda.jit
def nbconv_1d1o_cuda(x,k,y,s): #numba_conv 1 dimension 1st order
    n = cuda.grid(1)
    if (0 <= n) and (n < y.size):
        d = n+k.shape[0]-1
        for i in range(k.shape[0]):
            y[n] += x[d-i] * k[i]

@cuda.jit
def nbconv_1d2o_cuda(x,k,y,s): #numba_conv 1 dimension 2nd order
    n = cuda.grid(1)
    b_offset = s*3 - 1
    if (0 <= n) and (n < y.size):
        d = n+k.shape[0]-1
        for i in range(k.shape[0]):
            for j in range(min(i+b_offset, k.shape[1])):
                y[n] += x[d-i] * x[d-j] * k[i,j]

@cuda.jit
def nbconv_1d3o_cuda(x,k,y,s): #numba_conv 1 dimension 3rd order
    b_offset = s*3 - 1
    n = cuda.grid(1)
    if (0 <= n) and (n < y.size):
        d = n+k.shape[0]-1
        for i in range(k.shape[0]):
            for j in range(min(i+b_offset, k.shape[1])):
                for l in range(min(j+b_offset, k.shape[2])):
                    y[n] += x[d-i] * x[d-j] * x[d-l] * k[i,j,l]

# @cuda.jit
# @njit
def get_y(x,K1,K2,K3,y,b,th,s):
    if 0 < K1.shape[0]:
        nbconv_1d1o_cuda[b,th](x, K1, y[K1.shape[0]-1:], s)
    if 0 < K2.shape[0]:
        nbconv_1d2o_cuda[b,th](x, K2, y[K2.shape[0]-1:], s)
    if 0 < K3.shape[0]:
        nbconv_1d3o_cuda[b,th](x, K3, y[K3.shape[0]-1:], s)

# @njit
def get_y_helper(x, K, s):
    xc = cuda.to_device(x)
    K1 = cuda.to_device(K[1])
    K2 = cuda.to_device(K[2])
    K3 = cuda.to_device(K[3])
    th = 128
    b = x.size//th+1
    y = cuda.device_array_like(x)

    for i in range(10):
        get_y(xc,K1,K2,K3,y,b,th,s)
    return y.copy_to_host()

def test():
    np.random.seed(0)
    x = np.random.uniform(-1,1,10000)
    d = 100
    K = [ 1, np.random.uniform(-1,1,(d))
        ,np.random.uniform(-1,1,(d,d))
        ,np.random.uniform(-1,1,(d,d,d))
        ]
    smoothness = 4
    # for i in range(10):
    y = get_y_helper(x, K, smoothness)
    print(y[-5:])

if __name__ == '__main__':
    test()
    start = time.time()
    test()
    print("time elapsed = ", time.time() - start)

sklam · October 5, 2020, 7:23pm

Since numba directly exposes the CUDA execution model, one usually cannot replace @njit with @cuda.jit. In the CUDA execution model, the function is being executed by every thread. The code needs to be adapted to distribute the work across the threads.

With that said, if you are new to CUDA programming, it might be more productive to use CuPy, which gives you a NumPy-like API. Here’s CuPy documentation on convolution that might have what you need: https://docs.cupy.dev/en/stable/reference/signal.html?highlight=convolution#convolution.

Topic		Replies	Views
Numba convolutions Community Support	7	5040	July 10, 2020
How to call a cuda.jit function in a njit function? Community Support	3	654	June 16, 2021
Kernel within a kernel Support: How do I do ...?	10	1296	August 1, 2024
Optimizing Code Further, CUDA Jit? Community Support	2	803	July 23, 2020
Accelerating calculations involving 3D array Community Support	6	1139	December 1, 2020

CUDA.jit - Higher Order Convolution Optimizations (Volterra Operator)

Related topics