Understanding Sliding Dot Product Performance

seanlaw · February 11, 2025, 10:24pm

When computing a sliding dot product between two 1D arrays, Q and T, one can naively do:

def naive_sliding_dot_product(Q, T):
    m = len(Q)
    l = len(T) - m + 1
    out = np.empty(l)
    for i in range(l):
        out[i] = np.dot(Q, T[i:i+m])
    return out

Since a sliding dot product is basically a convolution, one can do this computation quickly in pure numpy via:

def numpy_sliding_dot_product(Q, T):
    return np.convolve(np.ascontiguousarray(Q[::-1]), T, mode='valid')

In the STUMPY package, we’ve been using a numba njit version of the the “naive” code, which gets about the same performance as the pure numpy version on small to medium array inputs:

@njit(fastmath=True)
def stumpy_sliding_dot_product(Q, T):
    m = Q.shape[0]
    l = T.shape[0] - m + 1
    out = np.empty(l)
    for i in range(l):
        out[i] = np.dot(Q, T[i : i + m])

    return out

This has served us well over the years. However, on larger array inputs, a pure numpy FFT overlap-add convolution (oaconvolve) reigns supreme and significantly outperforms both the numpy discrete convolution and stumpy versions above:

def oaconvolve_sliding_dot(Q, T):
    return  oaconvolve(np.ascontiguousarray(Q[::-1]), T, mode='valid')

However, realizing that the stumpy version depends heavily upon np.dot, which may/may not use BLAS in the backend (I’m using an M2 Mac so no Intel chip at play here), I decided to swap out the np.dot for a simpler for-loop:

@njit(fastmath=True)
def stumpy_sliding_dot_product_V2(Q, T):
    m = Q.shape[0]
    l = T.shape[0] - m + 1
    out = np.zeros(l)
    for i in range(l):
        for j in range(m):
            out[i] += Q[j] * T[i + j]

    return out

Alas, this turned out to be much, much slower. However, instead of accumulating with out[i], I tried accumulating the summation using an intermediate result variable:

@njit(fastmath=True)
def stumpy_sliding_dot_product_V3(Q, T):
    m = Q.shape[0]
    l = T.shape[0] - m + 1
    out = np.empty(l)
    for i in range(l):
        result = 0.0
        for j in range(m):
            result += Q[j] * T[i + j]
        out[i] = result
    return out

Surprisingly, this stumpy_sliding_dot_product_V3 was over 2x faster than the (now dethroned!) numpy_sliding_dot_product approach for small to medium sized array inputs AND also slightly faster than oaconvolve_sliding_dot_product for larger array inputs!

My question is “why does using an intermediate result variable in stumpy_sliding_dot_product_V3 improve the performance so much”?

hbina · February 12, 2025, 1:26am

I suspect its because the write to the real buffer only occurs once, using result means the write can happen to the register as and maybe the CPU/compiler can optimize more. I am guessing its doing some SIMD stuff since from the look of it does look quite linear?

I feel like parallel=True should work here but it doesn’t?

Here’s the reference assembly for both of these as reference:

V2: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 pushq %r15 .cfi_def - Pastebin.com
V3: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 pushq %r15 .cfi_def - Pastebin.com

seanlaw · February 12, 2025, 1:32am

I guess it doesn’t make sense to me why writing to out[i] would be more expensive than writing to result. I would’ve thought that the cache locality is better for out[i].

I did try adding parallel=True but it’s actually slower because these computations finish quickly and adding additional threads has a high(er) overhead cost.

DannyWeitekamp · February 12, 2025, 5:41am

@hbina ‘s insight that result can be treated as a register is on point. Sometimes there are cases where compilers refuse to do what seem like obvious and reasonable optimizations (including sometimes vecorization) because there isn’t a guarantee that a memory address will be written to from exactly one source. It’s often easier to make these gaurentees when you’re dealing with a local variable instead of heap memory like a numpy array. One interesting case that I don’t think you can dig into with numba is that sometimes assembly code will have what seem like pointless reads to addresses that were just written to. This can happen when the compiler cannot guarantee that some other part of a subroutine couldn’t have written to that address. You should look up how the restrict annotation works in C/C++ compilers. I can’t say for certain that it holds the key to understanding what’s going on here, but it certainly may help you develop an intuition for some of the possibilities. Also taking dot products is a classic example of where that annotation can help speed up code.

seanlaw · February 12, 2025, 9:01pm

Thank you for your comments and insights @hbina and @DannyWeitekamp. In case others were wondering, explicit loop unroll made things worse and took about 4x as long:

@njit(fastmath=True)
def njit_unroll_sliding_dot(Q, T):
    m = len(Q)
    l = T.shape[0] - m + 1
    out = np.empty(l)

    n_unroll = 4
    for i in range(0, l - n_unroll, n_unroll):
        result_0 = 0.0
        result_1 = 0.0
        result_2 = 0.0
        result_3 = 0.0
        for j in range(m):
            result_0 += Q[j] * T[i + j]
            result_1 += Q[j] * T[i + 1 + j]
            result_2 += Q[j] * T[i + 2 + j]
            result_3 += Q[j] * T[i + 3 + j]
        out[i] = result_0
        out[i + 1] = result_1
        out[i + 2] = result_2
        out[i + 3] = result_3

    # Handle any remaining iterations if `l` is not a multiple of `n_unroll`
    for i in range(l - l % n_unroll, l):
        result = 0.0
        for j in range(m):
            result += Q[j] * T[i + j]
        out[i] = result

    return out

hbina · February 12, 2025, 9:09pm

Can you try removing bounds check?
See Environment variables — Numba 0.52.0.dev0+274.g626b40e-py3.7-linux-x86_64.egg documentation

seanlaw · February 12, 2025, 9:36pm

Yes, I tried that (when I experimented with parallel=True) by setting @njit(fastmath=True, boundscheck=False) but this neither increased nor decreased the compute time.

Topic		Replies	Views
Numba with cross product is 10x slow than numba with loop Community Support	7	3358	August 4, 2020
Unstable performance of arithmetic ops inside a rolling window Support: How do I do ...?	6	359	September 6, 2021
Feedback on tips for first-timers Community Support	14	972	August 15, 2023
Array multiplication - speed comparison Community Support	4	266	January 20, 2024
Numba supporting numpy.dot Support: How do I do ...?	3	1045	December 13, 2023

Understanding Sliding Dot Product Performance

Related topics