Unstable performance of arithmetic ops inside a rolling window

Hi all, I have a 2-dimensional numpy array and trying to apply a rolling window on each column separately. The following snippet gives me some strange performance:

from numba import njit
import numpy as np

@njit
def rolling_apply_1d_nb(out, a, window):
    for i in range(a.shape[0]):
        from_i = max(0, i + 1 - window)
        to_i = i + 1
        window_a = a[from_i:to_i]
        out[i] = np.sum(window_a + 1)
    return out


@njit
def rolling_apply_nb(a, window):
    out = np.empty_like(a, dtype=np.float_)
    for col in range(a.shape[1]):
        rolling_apply_1d_nb(out[:, col], a[:, col], window)
    return out

a = np.random.uniform(size=(1000, 1000))
%timeit rolling_apply_nb(a, 10)
98.9 ms ± 2.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Now, if you remove the addition of one inside np.sum, the execution time drops to 4ms. Any ideas?

If the addition is a costly operation and this performance is somehow justifiable, I’m curious why the above snippet using parallel=True executes twice slower than without parallelization:

from numba import njit, prange
import numpy as np

@njit
def rolling_apply_1d_nb(out, a, window):
    for i in range(a.shape[0]):
        from_i = max(0, i + 1 - window)
        to_i = i + 1
        window_a = a[from_i:to_i]
        out[i] = np.sum(window_a + 1)
    return out


@njit(parallel=True)
def rolling_apply_nb(a, window):
    out = np.empty_like(a, dtype=np.float_)
    for col in prange(a.shape[1]):
        rolling_apply_1d_nb(out[:, col], a[:, col], window)
    return out

a = np.random.uniform(size=(1000, 1000))
%timeit rolling_apply_nb(a, 10)
235 ms ± 4.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

While without addition it now takes 1ms.

Hi @polakowo

Welcome on board :slight_smile:

I am not certain, but I have an idea what could be going on here.
When you add 1 to window_a, new memory has to be allocated for the result.
That amounts to 1 Million array allocations for a single function call. (Since you do this for every element in the input array). All the other lines of code that are executed in the inner loop can be stack allocated or simply use existing buffers as far as I can tell.

In this example you could probably get away by manually “pulling the +1” out of the sum. Since you know how long the window is you can just write it out explicitly to prevent heap allocation for the intermediate result.

Maybe someone else can back me up here.

If you are working with rolling windows, maybe numba’s stencil decorator could be interesting to you? See here: Using the @stencil decorator — Numba 0.54.0+0.g5888895d3.dirty-py3.7-linux-x86_64.egg documentation

Hi @Hannes, thank you for the reply. The addition op indeed appears to unnecessarily allocate memory. Any ideas why the parallelization takes twice longer? Even though we create an array for each element, the outermost loop that iterates over the column axis should still offer some speedup… The new @stencil indicator is promising but doesn’t fit well into my package where most calculations require an array and cannot be expressed using relative indexing, unfortunately.

Mhmm, I ran the following test, but the numbers are a bit inconclusive.
There seems to be some sweet spot where the parallel version is faster.
Threading comes with some overhead. So it is somewhat expected to perform worse for very small problems. But I don’t know why it is slower than the serial execution for the huge test case. Maybe the different threads are competing for CPU cache?
(Could also be that I started with full power and then the CPU throttled due to temperatures going bazookas)

NB! When timing numba functions you should always trigger their compilation first by calling them once, to avoid the compilation time being measured by timeit

from numba import njit, prange
import numpy as np

@njit
def rolling_apply_1d_nb(out, a, window):
    for i in range(a.shape[0]):
        from_i = max(0, i + 1 - window)
        to_i = i + 1
        window_a = a[from_i:to_i]
        out[i] = np.sum(window_a) + window_a.shape[0]
    return out



def rolling_apply_nb(a, window):
    out = np.empty_like(a, dtype=np.float_)
    for col in prange(a.shape[1]):
        rolling_apply_1d_nb(out[:, col], a[:, col], window)
    return out

fun_serial = njit(parallel=False)(rolling_apply_nb)
fun_parallel = njit(parallel=True)(rolling_apply_nb)

# Trigger compilation
fun_serial(np.random.uniform(size=(20,20)), 10)
fun_parallel(np.random.uniform(size=(20,20)), 10)

for n in (10, 100, 1000, 10000):
    a = np.random.uniform(size=(n,n))
    print(f"{n = }")
    print("Serial: ", end="")
    %timeit fun_serial(a, 10)
    print("Parallel: ", end="")
    %timeit fun_parallel(a, 10)

yields

n = 10
Serial: 1.3 µs ± 34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Parallel: 143 µs ± 2.77 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
n = 100
Serial: 68.5 µs ± 620 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Parallel: 76.9 µs ± 1.25 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
n = 1000
Serial: 7.8 ms ± 32.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Parallel: 2.94 ms ± 611 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
n = 10000
Serial: 1.48 s ± 84.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Parallel: 1.96 s ± 255 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Yeah, that’s what I’m experiencing too - medium-sized arrays perform best. I will try to parallelize using Ray or some other multi-threading backend and compare the results. But anyway, thank you for reminding me to avoid creating new arrays all over the place; temporary arrays do the trick.

I would also checkout the guvectorize decorator instead of njit. It looks like it would fit your use case. In my experience that always gives better parallelization compared to njit+prange (which is more flexible). And it additionally also gives a lot of extra features related to ufuncs, like specifying the axis over which to parallelize (which also removes the need for the “wrapper” function you’re currently using).

https://numba.pydata.org/numba-doc/dev/user/vectorize.html#the-guvectorize-decorator

For example something like:

from numba import guvectorize

@guvectorize(
    [
        "void(float32[:], int16, float32[:])",
        "void(float64[:], int16, float64[:])",
    ],
    "(n),()->(n)", target="parallel", nopython=True)
def rolling_apply_guvec(a, window, out):

    for i in range(a.size):
        from_i = max(0, i + 1 - window)
        to_i = i + 1
        window_a = a[from_i:to_i]
        out[i] = np.sum(window_a + 1)
        
a = np.random.uniform(size=(1000, 1000))
out = np.empty_like(a, dtype=np.float32)

rolling_apply_guvec(a, 10, out, axis=0)

You could also have a look at the Numbagg package, even if it doesn’t fit your use case out of the box, it might still provide some inspiration on how to implement rolling functions etc.

Making sure your computation is done one float32, can also help a bit, if the reduced accuracy is not a problem of course.