Hello Everyone.
I have been working with my new project and i found one issue with prange feature of numba.
For the sake of reproduciblity, i have added the simple example below. setting 'parallel=False'
on 'Main'
function yields faster result than 'parallel=True.'
from numba import jit
import numpy as np
@jit(nopython=True)
def Run(val):
N = 40
for i in range(N):
for i1 in range(N):
for i2 in range(N):
for i3 in range(N):
arr = np.asarray([1,1])
@jit(nopython=True, parallel=True)
def Main(arr):
for i in numba.prange(len(arr)):
Run(arr[i])
arr = list(range(0,30))
arr = np.asarray(arr)
Main(arr)
I beleive prange is not leveraging the array slicing operation. Any help/suggestion on this.=
cross-link to gitter conversation
I think the example you provide is probably a little too simple, since nothing actually happens within the Run
function that affects the final result. The input val
isn’t used, neither are the loop-counters i<x>
and arr
is always the same regardless the iteration.
So the timings you experience might be due to the fact that the non-parallel option has more effective optimizations, since it’s easier to do for the non-parallel case. I’m just speculating about that, and have no idea what actually happens in either case.
In my experience, when creating a toy example, it’s best to make sure the inner most code actually does some calculation, and that result is returned/assigned. For example:
@numba.njit
def Run(val):
N = 40
for i in range(N):
for i1 in range(N):
for i2 in range(N):
for i3 in range(N):
val[0] += val[0]//(i3+1) + val[1]//(i2+1)
def Main(arr):
for i in numba.prange(1, len(arr)):
Run(arr[i-1:i+1])
main_enable_par = numba.njit(parallel=False)(Main)
main_disable_par = numba.njit(parallel=True)(Main)
# compile once before running timeit
main_enable_par(arr.copy())
main_disable_par(arr.copy())
%timeit main_enable_par(arr.copy())
%timeit main_disable_par(arr.copy())
This makes the parallel case run about 3x faster for me:
1 s ± 21.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
314 ms ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
There will always be some overhead for the parallel case, so if the Run
function becomes computationally easy (like N=5
), the parallel case becomes 3x slower. Results probably vary depending on the amount of parallelization your hardware can do (cores/threads etc).
The above of course doesn’t rule out that prange is lacking some optimization.
1 Like
I had a similar experience just now, with numba.prange not providing the expected speedup (a factor 1.3 on a 4-core machine). With a little more testing, I find that I see a very small speedup on my laptop (even when freshly rebooted, and nothing else using significant CPU), while I get a more normal speedup on a linux desktop (a factor 5-6 on a 10-core machine).
Here is my code (a 1D diffusion solver):
import numpy as np
from numba import jit, prange
@jit(nopython=True, parallel=False)
def diffusion(Nt):
alpha = 0.49
x = np.linspace(0, 1, 100000000)
# Initial condition
C = 1/(0.25*np.sqrt(2*np.pi)) * np.exp(-0.5*((x-0.5)/0.25)**2)
# Temporary work array
C_ = np.zeros_like(C)
# Loop over time (normal for-loop)
for j in range(Nt):
# Loop over array elements (space, parallel for-loop)
for i in prange(1, len(C)-1):
C_[i] = C[i] + alpha*(C[i+1] - 2*C[i] + C[i-1])
C[:] = C_
return C
# Run once to just-in-time compile
C = diffusion(1)
# Check timing
%timeit C = diffusion(10)