Numba convolutions

This is just a start at converting the code to run on the CUDA target, but one way of doing it is the following:

from numba import cuda
import numpy as np
import time


@cuda.jit
def numba_cuda_conv(x, k, y):
    i = cuda.grid(1)

    if (i >= k.size - 1) and (i < x.size):
        for j in range(k.size):  # this loop is just a dot product
            y[i - k.size + 1] += x[i - j] * k[j]


# use case, toy sizes for demonstration

x = np.random.random(1000000)
k = np.random.random(5)
y = np.zeros(x.size - k.size + 1)

np_start = time.time()
npc = np.convolve(x, k, "valid")
np_end = time.time()

nthreads = 128
nblocks = (len(x) // nthreads) + 1

# Run once to warm up JIT
numba_cuda_conv[nblocks, nthreads](x, k, y)

# Copy / allocate data to device so we don't time the copies
x_d = cuda.to_device(x)
k_d = cuda.to_device(k)
y_d = cuda.device_array_like(y)

cuda_start = time.time()
numba_cuda_conv[nblocks, nthreads](x_d, k_d, y_d)

# Synchronize because kernel launches return asynchronously
cuda.synchronize()

cuda_end = time.time()

# Sanity check
np.testing.assert_allclose(npc, y_d.copy_to_host())

print(f"CPU time: {np_end - np_start}")
print(f"GPU time: {cuda_end - cuda_start}")

Which gives for me (i7-6700K vs RTX 8000):

CPU time: 0.0023627281188964844
GPU time: 0.0005590915679931641

A bit of speedup, but not a massive amount. The changes I’ve made:

  • Distribute the for loop across threads using cuda.grid(1). It would me more efficient to use a strided loop, but it is simpler to launch one thread per element.
  • Time both implementations. To avoid timing the JIT compilation, I do a warm-up call to the CUDA kernel first.
  • Copy data to the device before timing, to avoid timing copies of data to the device - typically you would keep data on the device throughout as much of the lifetime as possible, so when measuring for a small benchmark like this it is more representative to time only the kernel execution time and not the copies.
  • Compute the launch configuration to ensure there is at least one thread per element.
  • Make the data size a bit bigger (10 elements is not enough to measure anything, but I understand you only had 10 in there for the sake of example).

Things I haven’t done but could improve performance more:

  • Implement a strided loop with a smaller grid size. Launching more threads takes longer, so for short-lived kernels the overhead can be noticeable.
  • Preload data into shared memory - since adjacent threads are all sharing data in their neighbourhood, cooperating to stream data in and out of shared memory then computing on shared memory might be more efficient.
  • Implement a more complex kernel - this kernel is quite simple, and might well be bandwidth-limited. Perhaps more complex convolutions/operations will be easier to obtain bigger speedups with a GPU.
  • Streaming data on/off the GPU asynchronously - you mentioned in Gitter about using this for video processing - if the video is realtime, streams and asynchronous operations can be used to overlap communication and computation to get better performance / lower latency.
  • Check whether memory accesses are coalesced - coalesced memory accesses with utilise available bandwidth better. I haven’t given this any thought, but you can determine whether there are improvements to be made either by thinking hard and experimenting with modifying data layout / access patterns, or using a tool like NSight Compute.

I hope this helps for getting started - please let me know if you’d like me to expand on any of the above!

2 Likes