Hi all,
First post here so please bear with me for style mistakes. I originally posted this topic to the Numba GitHub issues page but I’m hoping people here might be able to help also. You can see my original post here for a much more extensive example: jit/njit parallelization of np.roll() · Issue #7842 · numba/numba · GitHub
I’m using python/numba/jit to prototype a CFD code so I need to be able to do element-wise arithmetic on large arrays, including shifting to get neighboring values for estimating the spatial derivatives. When using np.roll()
the parallel analysis shows that the shift operations are not recognized as parallel loops. E.g., an operation like sarr = np.roll(arr,-1)
is not given a loop number. I redefined my own local version called roll()
using prange
and if I use that for shift operations it is recognized as parallel, but those can’t be fused with other operations so the code is not efficient.
Below is a short code for comparison. It contains two versions of the same function, with different implementations of roll()
.
import numpy as np
from numba import njit, prange
import timeit
# version with local roll() definition
@njit(parallel=True)
def rs_roll_wrapper(arr):
def roll(arr,shift):
out = np.empty(arr.size, dtype=arr.dtype)
for i in prange(arr.size):
idx = (i - shift)%arr.size
out[idx] = arr[i]
return(out)
temp1 = roll(arr,-1)
temp1 *= arr
temp2 = roll(arr,+1)
temp2 = arr + temp1
out = temp2 * temp2
return(out)
# version with np.roll()
@njit(parallel=True)
def np_roll_wrapper(arr):
roll = np.roll
temp1 = roll(arr,-1)
temp1 *= arr
temp2 = roll(arr,+1)
temp2 = arr + temp1
out = temp2 * temp2
return(out)
and here’s a script to test them
np_roll_wrapper(np.arange(int(1e2)))
stime = timeit.default_timer()
np_roll_wrapper(np.arange(int(1e6)))
print(timeit.default_timer() - stime)
rs_roll_wrapper(np.arange(int(1e2)))
stime = timeit.default_timer()
rs_roll_wrapper(np.arange(int(1e6)))
print(timeit.default_timer() - stime)
which outputs
0.013500430999556556
0.00415644699933182
Looking at the parallel diagnostics, the two optimized versions end up looking like this:
the numpy version
Parallel loop listing for Function np_roll_wrapper, <ipython-input-84-57f1e14f0d7f> (18)
-----------------------------|loop #ID
@njit(parallel=True) |
def np_roll_wrapper(arr): |
|
roll = np.roll |
|
temp1 = roll(arr,-1) |
temp1 *= arr-------------| #1740
temp2 = roll(arr,+1) |
temp2 = arr + temp1------| #1738
out = temp2 * temp2------| #1739
|
return(out) |
--------------------------------- Fusing loops ---------------------------------
Attempting fusion of parallel loops (combines loops with similar properties)...
Fused loop summary:
+--1738 has the following loops fused into it:
+--1739 (fused)
Following the attempted fusion of parallel for-loops there are 2 parallel for-
loop(s) (originating from loops labelled: #1740, #1738).
my version
Parallel loop listing for Function rs_roll_wrapper, <ipython-input-84-57f1e14f0d7f> (1)
--------------------------------------------------------|loop #ID
@njit(parallel=True) |
def rs_roll_wrapper(arr): |
def roll(arr,shift): |
out = np.empty(arr.size, dtype=arr.dtype) |
for i in prange(arr.size): ------------------| #1744, 1743
idx = (i - shift)%arr.size |
out[idx] = arr[i] |
return(out) |
|
temp1 = roll(arr,-1) |
temp1 *= arr----------------------------------------| #1745
temp2 = roll(arr,+1) |
temp2 = arr + temp1---------------------------------| #1741
out = temp2 * temp2---------------------------------| #1742
|
return(out) |
--------------------------------- Fusing loops ---------------------------------
Attempting fusion of parallel loops (combines loops with similar properties)...
Fused loop summary:
+--1741 has the following loops fused into it:
+--1742 (fused)
Following the attempted fusion of parallel for-loops there are 4 parallel for-
loop(s) (originating from loops labelled: #1744, #1745, #1743, #1741).
So, the first key difference is that in my version each call to roll()
implies a loop within the definition of the function, while the same is not true for np.roll()
. So, does that mean that np.roll()
is not a parallel loop, or is it not a loop at all? And if the numpy version has fewer loops (2 vs 4) then why is it half as fast?
The second issue is systemic to both versions – every call to roll()
, whichever version is used, is unfuzeable with other loops, so even though the loops are all of the same size and use basic arithemtic at most, the execution is slowed by them being broken up into separate loops. Whereas, by comparison, if I explicitly redefine everything in terms of indices instead of using vector notation I can do it all in a single loop that is approximately 4 times faster (being 1 loop instead of 4).
So, that’s all to say, has anybody worked on or implemented a version of np.roll that is natively parallelizeable and can be fuzed with other loops?