Understanding Sliding Dot Product Performance

When computing a sliding dot product between two 1D arrays, Q and T, one can naively do:

def naive_sliding_dot_product(Q, T):
    m = len(Q)
    l = len(T) - m + 1
    out = np.empty(l)
    for i in range(l):
        out[i] = np.dot(Q, T[i:i+m])
    return out

Since a sliding dot product is basically a convolution, one can do this computation quickly in pure numpy via:

def numpy_sliding_dot_product(Q, T):
    return np.convolve(np.ascontiguousarray(Q[::-1]), T, mode='valid')

In the STUMPY package, we’ve been using a numba njit version of the the “naive” code, which gets about the same performance as the pure numpy version on small to medium array inputs:

@njit(fastmath=True)
def stumpy_sliding_dot_product(Q, T):
    m = Q.shape[0]
    l = T.shape[0] - m + 1
    out = np.empty(l)
    for i in range(l):
        out[i] = np.dot(Q, T[i : i + m])

    return out

This has served us well over the years. However, on larger array inputs, a pure numpy FFT overlap-add convolution (oaconvolve) reigns supreme and significantly outperforms both the numpy discrete convolution and stumpy versions above:

def oaconvolve_sliding_dot(Q, T):
    return  oaconvolve(np.ascontiguousarray(Q[::-1]), T, mode='valid')

However, realizing that the stumpy version depends heavily upon np.dot, which may/may not use BLAS in the backend (I’m using an M2 Mac so no Intel chip at play here), I decided to swap out the np.dot for a simpler for-loop:

@njit(fastmath=True)
def stumpy_sliding_dot_product_V2(Q, T):
    m = Q.shape[0]
    l = T.shape[0] - m + 1
    out = np.zeros(l)
    for i in range(l):
        for j in range(m):
            out[i] += Q[j] * T[i + j]

    return out

Alas, this turned out to be much, much slower. However, instead of accumulating with out[i], I tried accumulating the summation using an intermediate result variable:

@njit(fastmath=True)
def stumpy_sliding_dot_product_V3(Q, T):
    m = Q.shape[0]
    l = T.shape[0] - m + 1
    out = np.empty(l)
    for i in range(l):
        result = 0.0
        for j in range(m):
            result += Q[j] * T[i + j]
        out[i] = result
    return out

Surprisingly, this stumpy_sliding_dot_product_V3 was over 2x faster than the (now dethroned!) numpy_sliding_dot_product approach for small to medium sized array inputs AND also slightly faster than oaconvolve_sliding_dot_product for larger array inputs!

My question is “why does using an intermediate result variable in stumpy_sliding_dot_product_V3 improve the performance so much”?

I suspect its because the write to the real buffer only occurs once, using result means the write can happen to the register as and maybe the CPU/compiler can optimize more. I am guessing its doing some SIMD stuff since from the look of it does look quite linear?

I feel like parallel=True should work here but it doesn’t?

Here’s the reference assembly for both of these as reference:

V2: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 pushq %r15 .cfi_def - Pastebin.com
V3: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 pushq %r15 .cfi_def - Pastebin.com

1 Like

I guess it doesn’t make sense to me why writing to out[i] would be more expensive than writing to result. I would’ve thought that the cache locality is better for out[i].

I did try adding parallel=True but it’s actually slower because these computations finish quickly and adding additional threads has a high(er) overhead cost.

@hbina ‘s insight that result can be treated as a register is on point. Sometimes there are cases where compilers refuse to do what seem like obvious and reasonable optimizations (including sometimes vecorization) because there isn’t a guarantee that a memory address will be written to from exactly one source. It’s often easier to make these gaurentees when you’re dealing with a local variable instead of heap memory like a numpy array. One interesting case that I don’t think you can dig into with numba is that sometimes assembly code will have what seem like pointless reads to addresses that were just written to. This can happen when the compiler cannot guarantee that some other part of a subroutine couldn’t have written to that address. You should look up how the restrict annotation works in C/C++ compilers. I can’t say for certain that it holds the key to understanding what’s going on here, but it certainly may help you develop an intuition for some of the possibilities. Also taking dot products is a classic example of where that annotation can help speed up code.

1 Like

Thank you for your comments and insights @hbina and @DannyWeitekamp. In case others were wondering, explicit loop unroll made things worse and took about 4x as long:

@njit(fastmath=True)
def njit_unroll_sliding_dot(Q, T):
    m = len(Q)
    l = T.shape[0] - m + 1
    out = np.empty(l)

    n_unroll = 4
    for i in range(0, l - n_unroll, n_unroll):
        result_0 = 0.0
        result_1 = 0.0
        result_2 = 0.0
        result_3 = 0.0
        for j in range(m):
            result_0 += Q[j] * T[i + j]
            result_1 += Q[j] * T[i + 1 + j]
            result_2 += Q[j] * T[i + 2 + j]
            result_3 += Q[j] * T[i + 3 + j]
        out[i] = result_0
        out[i + 1] = result_1
        out[i + 2] = result_2
        out[i + 3] = result_3

    # Handle any remaining iterations if `l` is not a multiple of `n_unroll`
    for i in range(l - l % n_unroll, l):
        result = 0.0
        for j in range(m):
            result += Q[j] * T[i + j]
        out[i] = result

    return out

Can you try removing bounds check?
See Environment variables — Numba 0.52.0.dev0+274.g626b40e-py3.7-linux-x86_64.egg documentation

Yes, I tried that (when I experimented with parallel=True) by setting @njit(fastmath=True, boundscheck=False) but this neither increased nor decreased the compute time.