I’ve been comparing np.convolve(x,k,‘valid’) to other possible implementations for a signal processing application. I want to eventually compare this to scipy.ndimage.convolve for higher dimension versions. Here’s one possible implementation, but, it’d be ideal to have something that was fast and generalize to n dimensions (audio, image, video) and n orders (ex: np.einsum(‘ij,ij’, np.outer(audio_slice, audio_slice), k_past_audio_factor_window ). The general form is likely beyond the scope here. What I’m wondering is what the best implementations might be using numba with a gpu.
@cuda.jit
def numba_cuda_conv(x,k,y):
for i in range( k.size -1 , x.size ): #this slides the dot product over the valid region
for j in range( k.size ): #this loop is just a dot product
y[ i - k.size + 1 ] += x[i-j] * k[j]
#use case, toy sizes for demonstration
import numpy as np
import cupy as cp
x = cp.zeros(10)
k = cp.zeros(5)
y = cp.zeros(x.size - k.size + 1)
npc = np.convolve(x,k,'valid')
numba_cuda_conv[1,32](x,k,y)
print(npc)
print(y)
print( np.all( np.isclose( npc, y) ) )
Consider a possible use case which simply filters a signal with various sized filters. As you might imagine efficiency becomes increasingly necessary as the size of x increases, for example when x.shape is (1000, 1000, 50000) (which might correspond to a video).
import cupy as cp
x = cp.random.uniform(-1,1,50000)
for i in range(1000):
K = [ cp.random.uniform(100*i) for i in range(1,100) ]
Y = [ cp.zeros(x.size-K[i].size) for i in range(1,100) ]
C = [ numba_cuda_conv[a,b](x, K[i], Y[i]) for i in range(len(K)) ]
K += [ updateK(x, K[i], Y[i]) for i in range(len(K)) ]