Numba convolutions

gmarkall · June 12, 2020, 5:11pm

This is just a start at converting the code to run on the CUDA target, but one way of doing it is the following:

from numba import cuda
import numpy as np
import time


@cuda.jit
def numba_cuda_conv(x, k, y):
    i = cuda.grid(1)

    if (i >= k.size - 1) and (i < x.size):
        for j in range(k.size):  # this loop is just a dot product
            y[i - k.size + 1] += x[i - j] * k[j]


# use case, toy sizes for demonstration

x = np.random.random(1000000)
k = np.random.random(5)
y = np.zeros(x.size - k.size + 1)

np_start = time.time()
npc = np.convolve(x, k, "valid")
np_end = time.time()

nthreads = 128
nblocks = (len(x) // nthreads) + 1

# Run once to warm up JIT
numba_cuda_conv[nblocks, nthreads](x, k, y)

# Copy / allocate data to device so we don't time the copies
x_d = cuda.to_device(x)
k_d = cuda.to_device(k)
y_d = cuda.device_array_like(y)

cuda_start = time.time()
numba_cuda_conv[nblocks, nthreads](x_d, k_d, y_d)

# Synchronize because kernel launches return asynchronously
cuda.synchronize()

cuda_end = time.time()

# Sanity check
np.testing.assert_allclose(npc, y_d.copy_to_host())

print(f"CPU time: {np_end - np_start}")
print(f"GPU time: {cuda_end - cuda_start}")

Which gives for me (i7-6700K vs RTX 8000):

CPU time: 0.0023627281188964844
GPU time: 0.0005590915679931641

A bit of speedup, but not a massive amount. The changes I’ve made:

Distribute the for loop across threads using cuda.grid(1). It would me more efficient to use a strided loop, but it is simpler to launch one thread per element.
Time both implementations. To avoid timing the JIT compilation, I do a warm-up call to the CUDA kernel first.
Copy data to the device before timing, to avoid timing copies of data to the device - typically you would keep data on the device throughout as much of the lifetime as possible, so when measuring for a small benchmark like this it is more representative to time only the kernel execution time and not the copies.
Compute the launch configuration to ensure there is at least one thread per element.
Make the data size a bit bigger (10 elements is not enough to measure anything, but I understand you only had 10 in there for the sake of example).

Things I haven’t done but could improve performance more:

Implement a strided loop with a smaller grid size. Launching more threads takes longer, so for short-lived kernels the overhead can be noticeable.
Preload data into shared memory - since adjacent threads are all sharing data in their neighbourhood, cooperating to stream data in and out of shared memory then computing on shared memory might be more efficient.
Implement a more complex kernel - this kernel is quite simple, and might well be bandwidth-limited. Perhaps more complex convolutions/operations will be easier to obtain bigger speedups with a GPU.
Streaming data on/off the GPU asynchronously - you mentioned in Gitter about using this for video processing - if the video is realtime, streams and asynchronous operations can be used to overlap communication and computation to get better performance / lower latency.
Check whether memory accesses are coalesced - coalesced memory accesses with utilise available bandwidth better. I haven’t given this any thought, but you can determine whether there are improvements to be made either by thinking hard and experimenting with modifying data layout / access patterns, or using a tool like NSight Compute.

I hope this helps for getting started - please let me know if you’d like me to expand on any of the above!

Topic		Replies	Views
Question about the performance of npconvolve using numba Support: How do I do ...?	2	518	May 8, 2023
CUDA.jit - Higher Order Convolution Optimizations (Volterra Operator) Support: How do I do ...?	1	446	October 5, 2020
Numba convolutions with Padding Community Support	0	266	November 21, 2022
Using dynamic shared memory Numba	6	1078	August 22, 2023
Random array generation : numba cuda slower than cupy? Support: How do I do ...?	3	1917	July 23, 2021

Numba convolutions

Related topics