Numba Prange Not Working as Expected

Hello Everyone.
I have been working with my new project and i found one issue with prange feature of numba.
For the sake of reproduciblity, i have added the simple example below. setting 'parallel=False' on 'Main' function yields faster result than 'parallel=True.'

from numba import jit
import numpy as np


@jit(nopython=True)
def Run(val):
	N = 40
	for i in range(N):
		for i1 in range(N):
			for i2 in range(N):
				for i3 in range(N):
					arr = np.asarray([1,1])


@jit(nopython=True, parallel=True)
def Main(arr):
	for i in numba.prange(len(arr)):
		Run(arr[i])
		
	
arr = list(range(0,30))	
arr = np.asarray(arr)
Main(arr)

I beleive prange is not leveraging the array slicing operation. Any help/suggestion on this.=

cross-link to gitter conversation

I think the example you provide is probably a little too simple, since nothing actually happens within the Run function that affects the final result. The input val isn’t used, neither are the loop-counters i<x> and arr is always the same regardless the iteration.

So the timings you experience might be due to the fact that the non-parallel option has more effective optimizations, since it’s easier to do for the non-parallel case. I’m just speculating about that, and have no idea what actually happens in either case.

In my experience, when creating a toy example, it’s best to make sure the inner most code actually does some calculation, and that result is returned/assigned. For example:

@numba.njit
def Run(val):
    N = 40
    for i in range(N):
        for i1 in range(N):
            for i2 in range(N):
                for i3 in range(N):
                    val[0] += val[0]//(i3+1) + val[1]//(i2+1)

def Main(arr):
    for i in numba.prange(1, len(arr)):
        Run(arr[i-1:i+1])

main_enable_par = numba.njit(parallel=False)(Main)
main_disable_par = numba.njit(parallel=True)(Main)

# compile once before running timeit
main_enable_par(arr.copy())
main_disable_par(arr.copy())

%timeit main_enable_par(arr.copy())
%timeit main_disable_par(arr.copy())

This makes the parallel case run about 3x faster for me:

1 s ± 21.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
314 ms ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

There will always be some overhead for the parallel case, so if the Run function becomes computationally easy (like N=5), the parallel case becomes 3x slower. Results probably vary depending on the amount of parallelization your hardware can do (cores/threads etc).

The above of course doesn’t rule out that prange is lacking some optimization.

1 Like

I had a similar experience just now, with numba.prange not providing the expected speedup (a factor 1.3 on a 4-core machine). With a little more testing, I find that I see a very small speedup on my laptop (even when freshly rebooted, and nothing else using significant CPU), while I get a more normal speedup on a linux desktop (a factor 5-6 on a 10-core machine).

Here is my code (a 1D diffusion solver):

import numpy as np
from numba import jit, prange

@jit(nopython=True, parallel=False)
def diffusion(Nt):
    alpha = 0.49
    x = np.linspace(0, 1, 100000000)
    # Initial condition
    C = 1/(0.25*np.sqrt(2*np.pi)) * np.exp(-0.5*((x-0.5)/0.25)**2)
    # Temporary work array
    C_ = np.zeros_like(C)
    # Loop over time (normal for-loop)
    for j in range(Nt):
        # Loop over array elements (space, parallel for-loop)
        for i in prange(1, len(C)-1):
            C_[i] = C[i] + alpha*(C[i+1] - 2*C[i] + C[i-1])
        C[:] = C_
    return C

# Run once to just-in-time compile
C = diffusion(1)

# Check timing
%timeit C = diffusion(10)