Hi all, I have a 2-dimensional numpy array and trying to apply a rolling window on each column separately. The following snippet gives me some strange performance:

```
from numba import njit
import numpy as np
@njit
def rolling_apply_1d_nb(out, a, window):
for i in range(a.shape[0]):
from_i = max(0, i + 1 - window)
to_i = i + 1
window_a = a[from_i:to_i]
out[i] = np.sum(window_a + 1)
return out
@njit
def rolling_apply_nb(a, window):
out = np.empty_like(a, dtype=np.float_)
for col in range(a.shape[1]):
rolling_apply_1d_nb(out[:, col], a[:, col], window)
return out
a = np.random.uniform(size=(1000, 1000))
%timeit rolling_apply_nb(a, 10)
98.9 ms ± 2.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

Now, if you remove the addition of one inside `np.sum`

, the execution time drops to 4ms. Any ideas?

If the addition is a costly operation and this performance is somehow justifiable, I’m curious why the above snippet using `parallel=True`

executes twice slower than without parallelization:

```
from numba import njit, prange
import numpy as np
@njit
def rolling_apply_1d_nb(out, a, window):
for i in range(a.shape[0]):
from_i = max(0, i + 1 - window)
to_i = i + 1
window_a = a[from_i:to_i]
out[i] = np.sum(window_a + 1)
return out
@njit(parallel=True)
def rolling_apply_nb(a, window):
out = np.empty_like(a, dtype=np.float_)
for col in prange(a.shape[1]):
rolling_apply_1d_nb(out[:, col], a[:, col], window)
return out
a = np.random.uniform(size=(1000, 1000))
%timeit rolling_apply_nb(a, 10)
235 ms ± 4.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

While without addition it now takes 1ms.