Numba convolutions

I’ve been comparing np.convolve(x,k,‘valid’) to other possible implementations for a signal processing application. I want to eventually compare this to scipy.ndimage.convolve for higher dimension versions. Here’s one possible implementation, but, it’d be ideal to have something that was fast and generalize to n dimensions (audio, image, video) and n orders (ex: np.einsum(‘ij,ij’, np.outer(audio_slice, audio_slice), k_past_audio_factor_window ). The general form is likely beyond the scope here. What I’m wondering is what the best implementations might be using numba with a gpu.

@cuda.jit
def numba_cuda_conv(x,k,y):
  for i in range( k.size -1 , x.size ): #this slides the dot product over the valid region
    for j in range( k.size ): #this loop is just a dot product
      y[ i - k.size + 1 ] += x[i-j] * k[j]

#use case, toy sizes for demonstration
import numpy as np
import cupy as cp
x = cp.zeros(10)
k = cp.zeros(5)
y = cp.zeros(x.size - k.size + 1)
npc = np.convolve(x,k,'valid')
numba_cuda_conv[1,32](x,k,y)
print(npc)
print(y)
print( np.all( np.isclose( npc, y) ) )

Consider a possible use case which simply filters a signal with various sized filters. As you might imagine efficiency becomes increasingly necessary as the size of x increases, for example when x.shape is (1000, 1000, 50000) (which might correspond to a video).

import cupy as cp
x = cp.random.uniform(-1,1,50000)
for i in range(1000):
    K = [ cp.random.uniform(100*i) for i in range(1,100) ]
    Y = [ cp.zeros(x.size-K[i].size) for i in range(1,100) ]
    C = [ numba_cuda_conv[a,b](x, K[i], Y[i]) for i in range(len(K)) ]
    K += [ updateK(x, K[i], Y[i]) for i in range(len(K)) ]

This is just a start at converting the code to run on the CUDA target, but one way of doing it is the following:

from numba import cuda
import numpy as np
import time


@cuda.jit
def numba_cuda_conv(x, k, y):
    i = cuda.grid(1)

    if (i >= k.size - 1) and (i < x.size):
        for j in range(k.size):  # this loop is just a dot product
            y[i - k.size + 1] += x[i - j] * k[j]


# use case, toy sizes for demonstration

x = np.random.random(1000000)
k = np.random.random(5)
y = np.zeros(x.size - k.size + 1)

np_start = time.time()
npc = np.convolve(x, k, "valid")
np_end = time.time()

nthreads = 128
nblocks = (len(x) // nthreads) + 1

# Run once to warm up JIT
numba_cuda_conv[nblocks, nthreads](x, k, y)

# Copy / allocate data to device so we don't time the copies
x_d = cuda.to_device(x)
k_d = cuda.to_device(k)
y_d = cuda.device_array_like(y)

cuda_start = time.time()
numba_cuda_conv[nblocks, nthreads](x_d, k_d, y_d)

# Synchronize because kernel launches return asynchronously
cuda.synchronize()

cuda_end = time.time()

# Sanity check
np.testing.assert_allclose(npc, y_d.copy_to_host())

print(f"CPU time: {np_end - np_start}")
print(f"GPU time: {cuda_end - cuda_start}")

Which gives for me (i7-6700K vs RTX 8000):

CPU time: 0.0023627281188964844
GPU time: 0.0005590915679931641

A bit of speedup, but not a massive amount. The changes I’ve made:

  • Distribute the for loop across threads using cuda.grid(1). It would me more efficient to use a strided loop, but it is simpler to launch one thread per element.
  • Time both implementations. To avoid timing the JIT compilation, I do a warm-up call to the CUDA kernel first.
  • Copy data to the device before timing, to avoid timing copies of data to the device - typically you would keep data on the device throughout as much of the lifetime as possible, so when measuring for a small benchmark like this it is more representative to time only the kernel execution time and not the copies.
  • Compute the launch configuration to ensure there is at least one thread per element.
  • Make the data size a bit bigger (10 elements is not enough to measure anything, but I understand you only had 10 in there for the sake of example).

Things I haven’t done but could improve performance more:

  • Implement a strided loop with a smaller grid size. Launching more threads takes longer, so for short-lived kernels the overhead can be noticeable.
  • Preload data into shared memory - since adjacent threads are all sharing data in their neighbourhood, cooperating to stream data in and out of shared memory then computing on shared memory might be more efficient.
  • Implement a more complex kernel - this kernel is quite simple, and might well be bandwidth-limited. Perhaps more complex convolutions/operations will be easier to obtain bigger speedups with a GPU.
  • Streaming data on/off the GPU asynchronously - you mentioned in Gitter about using this for video processing - if the video is realtime, streams and asynchronous operations can be used to overlap communication and computation to get better performance / lower latency.
  • Check whether memory accesses are coalesced - coalesced memory accesses with utilise available bandwidth better. I haven’t given this any thought, but you can determine whether there are improvements to be made either by thinking hard and experimenting with modifying data layout / access patterns, or using a tool like NSight Compute.

I hope this helps for getting started - please let me know if you’d like me to expand on any of the above!

2 Likes

I put together various implementations into a github and graphed their performance. I added a link to your github on the corresponding implementation. Thanks for all of the tips!

2 Likes

Hello, have you considered testing against cuSignal? It is Numba based, and makes use of shared memory. There are windowing and 1d and 2d convolution operations and other filtering ops that may be useful.

The library is used for signal processing, and is built for online/streaming data.

EDIT: just saw that this was 10 days old, so may be out of date, the cuSignal library is pretty highly optimized, would be interesting to see how it holds up against your micro-benchmarks.

@andrewpalumbo I heard about that after the fact and have considered it but haven’t gotten to it yet. It’d likely only help for first order implementations. Feel free to try it and make a pull request, I’m eager to know how it’ll turn out.

@randompast cloned your benchmark repo… I’ll try to get some cuSignal numbers up next time I have an ec2 instance up.

1 Like

Just FYI, CuPy recently added cupy.convolve, which switch between FFT (ongoing) and dot product: https://github.com/cupy/cupy/pull/3371.

1 Like

I got an email that mentioned that, thanks for reaching out and letting me know. Hopefully I’ll get around to adding these additional implementations to the benchmarks soon.