This is just a start at converting the code to run on the CUDA target, but one way of doing it is the following:
from numba import cuda
import numpy as np
import time
@cuda.jit
def numba_cuda_conv(x, k, y):
i = cuda.grid(1)
if (i >= k.size - 1) and (i < x.size):
for j in range(k.size): # this loop is just a dot product
y[i - k.size + 1] += x[i - j] * k[j]
# use case, toy sizes for demonstration
x = np.random.random(1000000)
k = np.random.random(5)
y = np.zeros(x.size - k.size + 1)
np_start = time.time()
npc = np.convolve(x, k, "valid")
np_end = time.time()
nthreads = 128
nblocks = (len(x) // nthreads) + 1
# Run once to warm up JIT
numba_cuda_conv[nblocks, nthreads](x, k, y)
# Copy / allocate data to device so we don't time the copies
x_d = cuda.to_device(x)
k_d = cuda.to_device(k)
y_d = cuda.device_array_like(y)
cuda_start = time.time()
numba_cuda_conv[nblocks, nthreads](x_d, k_d, y_d)
# Synchronize because kernel launches return asynchronously
cuda.synchronize()
cuda_end = time.time()
# Sanity check
np.testing.assert_allclose(npc, y_d.copy_to_host())
print(f"CPU time: {np_end - np_start}")
print(f"GPU time: {cuda_end - cuda_start}")
Which gives for me (i7-6700K vs RTX 8000):
CPU time: 0.0023627281188964844
GPU time: 0.0005590915679931641
A bit of speedup, but not a massive amount. The changes I’ve made:
- Distribute the
forloop across threads usingcuda.grid(1). It would me more efficient to use a strided loop, but it is simpler to launch one thread per element. - Time both implementations. To avoid timing the JIT compilation, I do a warm-up call to the CUDA kernel first.
- Copy data to the device before timing, to avoid timing copies of data to the device - typically you would keep data on the device throughout as much of the lifetime as possible, so when measuring for a small benchmark like this it is more representative to time only the kernel execution time and not the copies.
- Compute the launch configuration to ensure there is at least one thread per element.
- Make the data size a bit bigger (10 elements is not enough to measure anything, but I understand you only had 10 in there for the sake of example).
Things I haven’t done but could improve performance more:
- Implement a strided loop with a smaller grid size. Launching more threads takes longer, so for short-lived kernels the overhead can be noticeable.
- Preload data into shared memory - since adjacent threads are all sharing data in their neighbourhood, cooperating to stream data in and out of shared memory then computing on shared memory might be more efficient.
- Implement a more complex kernel - this kernel is quite simple, and might well be bandwidth-limited. Perhaps more complex convolutions/operations will be easier to obtain bigger speedups with a GPU.
- Streaming data on/off the GPU asynchronously - you mentioned in Gitter about using this for video processing - if the video is realtime, streams and asynchronous operations can be used to overlap communication and computation to get better performance / lower latency.
- Check whether memory accesses are coalesced - coalesced memory accesses with utilise available bandwidth better. I haven’t given this any thought, but you can determine whether there are improvements to be made either by thinking hard and experimenting with modifying data layout / access patterns, or using a tool like NSight Compute.
I hope this helps for getting started - please let me know if you’d like me to expand on any of the above!