When computing a sliding dot product between two 1D arrays, Q
and T
, one can naively do:
def naive_sliding_dot_product(Q, T):
m = len(Q)
l = len(T) - m + 1
out = np.empty(l)
for i in range(l):
out[i] = np.dot(Q, T[i:i+m])
return out
Since a sliding dot product is basically a convolution, one can do this computation quickly in pure numpy
via:
def numpy_sliding_dot_product(Q, T):
return np.convolve(np.ascontiguousarray(Q[::-1]), T, mode='valid')
In the STUMPY package, weâve been using a numba
njit
version of the the ânaiveâ code, which gets about the same performance as the pure numpy
version on small to medium array inputs:
@njit(fastmath=True)
def stumpy_sliding_dot_product(Q, T):
m = Q.shape[0]
l = T.shape[0] - m + 1
out = np.empty(l)
for i in range(l):
out[i] = np.dot(Q, T[i : i + m])
return out
This has served us well over the years. However, on larger array inputs, a pure numpy
FFT overlap-add convolution (oaconvolve
) reigns supreme and significantly outperforms both the numpy
discrete convolution and stumpy
versions above:
def oaconvolve_sliding_dot(Q, T):
return oaconvolve(np.ascontiguousarray(Q[::-1]), T, mode='valid')
However, realizing that the stumpy
version depends heavily upon np.dot
, which may/may not use BLAS in the backend (Iâm using an M2 Mac so no Intel chip at play here), I decided to swap out the np.dot
for a simpler for-loop:
@njit(fastmath=True)
def stumpy_sliding_dot_product_V2(Q, T):
m = Q.shape[0]
l = T.shape[0] - m + 1
out = np.zeros(l)
for i in range(l):
for j in range(m):
out[i] += Q[j] * T[i + j]
return out
Alas, this turned out to be much, much slower. However, instead of accumulating with out[i]
, I tried accumulating the summation using an intermediate result
variable:
@njit(fastmath=True)
def stumpy_sliding_dot_product_V3(Q, T):
m = Q.shape[0]
l = T.shape[0] - m + 1
out = np.empty(l)
for i in range(l):
result = 0.0
for j in range(m):
result += Q[j] * T[i + j]
out[i] = result
return out
Surprisingly, this stumpy_sliding_dot_product_V3
was over 2x faster than the (now dethroned!) numpy_sliding_dot_product
approach for small to medium sized array inputs AND also slightly faster than oaconvolve_sliding_dot_product
for larger array inputs!
My question is âwhy does using an intermediate result
variable in stumpy_sliding_dot_product_V3
improve the performance so muchâ?