Help improving performance of embarassingly parallel loop

I am trying to understand how numba manages threads when parallelizing, especially when dealing with loops that call functions of numpy arrays.

I have started with a simple example code:

import numpy as np
from numba import njit, prange, get_num_threads
import time

@njit
def f(i, r):
  t = i * r
  for j in range(int(1e5)):
    t = (t * r) % (i+1)
  return 1.1, t**2

@njit(parallel=True)
#@njit(parallel=False)
def iu_loop(rs, a):

  bad = 101

  for i in prange(len(rs)):

    if i==bad:
        continue

    temp = f(i, rs)

    a[0] *= temp[0]
    a[1] += temp[1]

  return a

a = np.array([1., 0.])
rs = np.ones((1,1))
_ = iu_loop(rss, a)

a = np.array([1., 0.])
rs = np.ones((1000,1000))

print("threads=", get_num_threads())

t_start = time.time()
result = iu_loop(rs, a)
t_end = time.time()
run_time = t_end - t_start
print(f"computed result in: {round(run_time, 3)}s")
print(result)

This works very well and as expected; running the code with parallel=True and 8 threads reduces computation time by essentially exactly a factor of 8, which is what I would expect considering that the prange’d loop is embarrassingly parallel.

Now I’ve tried replacing the argument of f with an array:

import numpy as np
from numba import njit, prange, get_num_threads
import time

@njit
def f(i, rs):
  r = np.sum(rs)
  t = i * r
  for j in range(int(1e5)):
    t = (t * r) % (i+1)
  return 1.1, t**2

@njit(parallel=True)
#@njit(parallel=False)
def iu_loop(rss, a):

  bad = 101

  for i in prange(len(rss)):

    if i==bad:
        continue

    temp = f(i, rss)

    a[0] *= temp[0]
    a[1] += temp[1]

  return a

a = np.array([1., 0.])
rss = np.ones((1,1))
_ = iu_loop(rss, a)

a = np.array([1., 0.])
rss = np.ones((1000,1000))

print("threads=", get_num_threads())

t_start = time.time()
result = iu_loop(rss, a)
t_end = time.time()
run_time = t_end - t_start
print(f"computed result in: {round(run_time, 3)}s")
print(result)

The parallel=True code is still faster, but the increase in speed is only a factor of 3 instead of 8. In a more complicated setting, which is what prompted me to ask the question in the first place, I am experiencing almost no speed increase at all.

I feel that I am missing something crucial here since, in the second case, I would still expect the same speedup given that the code is no less embarrassingly parallel than before. I’ve also run the parallel_diagnostics function but I have trouble understanding the information it provides.

Essentially I would like to get an understanding of what is happening, and hopefully find a way to fix the code in the second case to achieve the full speed increase :slight_smile:

Hey @Wlos ,

Numba uses compiler optimization levels to translate Python code into efficient machine code. The default optimization level (NUMBA_OPT=3) includes optimizations such as loop unrolling, function inlining, vectorization etc. which can significantly influence the performance of your code out of the box.
Parts of your real-life code might already be parallelized…

Additionally to these optimizations, you can set parallel=True and try to enable parallel execution of loops using multiple threads. It’s important to consider the overhead introduced by parallelization. These overheads can limit the scalability of parallel execution. Predicting the positive or negative effects in advance can be difficult.

In your scenario the loops are embarrassingly parallel. There should potentially be a significant impact.
In my test run I observed that the overhead seems to diminish at some point, with the ratio of single vs. multi-cores peaking at around 3 (with 4 cores available). It’s possible that at some point memory constraints become a limiting factor, causing the ratio to peak.

Have you tried increasing the array size? Maybe you have not reached the peak ratio yet.

single_vs_multi_cores

1 Like

Hi, thank you for your reply! I appreciate the effort of testing the code against different array input sizes and it is definitely interesting to see it laid out like that. I still have many questions, however, and I would like to figure out what is happening in more detail.

The default optimization level (NUMBA_OPT=3) includes optimizations such as loop unrolling, function inlining, vectorization etc. which can significantly influence the performance of your code out of the box.
Parts of your real-life code might already be parallelized…

I find this slightly confusing, so just to clarify: are you saying that, even if I never set parallel=True anywhere in my code, the program will still do some parallelization under the hood and use multiple threads? If this was the case, I would expect it to happen also in the first example.

I understand that overhead (as a generic, vague concept) can get in the way. What I have trouble understanding is why the result is so different in both cases; I would have expected overhead to also be significant in the first example, where instead I observe the 8x speedup. I can assume that this is due to having a two-dimensional array in the second example and that this complicates things somehow (maybe the optimizations in the single-thread case become more aggressive?). If this is the case, I would deifnitely like to understand why this becomes an issue. Presently, though, I don’t see why it should not be possible to simply split up the iterations between threads just like in the first case.

Turning it around, I would expect that it should be possible to reduce the overhead to just “assigning iterations to different threads” which in this case seems like the most direct way to have a speed improvement. If numba is doing other things under the hood, they don’t seem to be working as well. Then, what other factors are coming into play? And, if taking control of the parallelization process more closely is necessary, what is the right way to go about it?

Hey @Wlos,

  1. Numba’s default optimization level includes techniques like function inlining, loop unrolling, and vectorization (SIMD), which can significantly enhance code performance without(!) explicit parallelization.
    Frequently Asked Questions — Numba 0+untagged.2291.gc6da269.dirty documentation

  2. Overheads associated with parallelized array operations might indeed be larger than scalar operations due to higher effort for communication and synchronization among threads. However, the benefits of parallelization can outweigh these overheads for sufficiently large data sizes.

  3. Numba’s primary role is to provide LLVM with enough information to compile efficient machine code. It does nothing “under the hood” that slows down the execution of compiled functions.

  4. As an experiment you could try to set the environment variable NUMBA_OPT to 0 to reduce automatic optimizations. I assume the ratio for the comparison of single vs multi core would increase.
    Environment variables — Numba 0+untagged.2291.gc6da269.dirty documentation

Happy coding!

Thank you again for your reply!

From the page you linked at point 1 I read that Numba can automatically parallelize code in two cases:

  • Ufuncs and gufuncs with the target=“parallel” option will run on multiple threads.
  • The parallel=True option to @jit will attempt to optimize array operations and run them in parallel. It also adds support for prange() to explicitly parallelize a loop.

Then I guess it would be possible that some implicit parallelization is happening with that np.sum function in the second example… somehow.

I will run some tests as you suggested and see what comes out of it.

Many numba/numpy functions (like sum, dot) are parallelized by default, if I remember correctly. Unrelated to parallel=true. There are a handful of environment variables that can be used to disable that.

In my experience this is a common cause of inaccurate benchmarking comparisons.

Have you tried enclosing the body of the loop in if != is_bad instead of using continue. Of course this would be functionally equivalent and ought to really boil down to the same compiled assembly, but I wonder if the semantics of continue are somehow keeping the pass that deals with parallelization from doing the right thing.

Edit: woops looks like that bit is used in both cases, perhaps that isn’t the secret here.

Another factor which can slow down parallelization which may be playing a role here is just the fact that the different threads are competing for L1 cache resources. Perhaps more in the second case because the .sum() call allocates a new array, whereas in the original case the inner most loop which updates t can mostly just use thread specific registers.

Here is an experiment:

In the original “iu_loop (parallel)” function, the sum of the rss array is recalculated within each iteration of the parallel loop inside the function f.
This results in redundant computations and inefficiencies.
You could optimize the function like in “iu_loop_par_2”, where the sum of the rss array is computed only once outside the parallel loop and stored in the variable r.

By eliminating the redundant computation of the array sum within each iteration of the loop, iu_loop_par_2 reduces some overhead, leading to more efficient code.
You would expect this code to be faster.
However, the difference in performance will not immediately become apparent. This is because compilers can sometimes do their magic and cover inefficiencies through additional optimizations and more machine code generation.

As the data size increases, the advantage of the optimization in iu_loop_par_2 becomes more obvious.
The reduced overhead of summing the array only once helps to improve scalability and performance compared to the original iu_loop, where the sum is recalculated in each iteration.

The same applies to the original question. The single threaded function might be partially vectorized/optimized especially for smaller data size.

@njit
def f_2(i, r):
    t = i * r
    for j in range(100_000):
        t = (t * r) % (i+1)
    return 1.1, t**2

@njit
def iu_loop_2(rss, a):
    bad = 101
    r = np.sum(rss)  # Calculate r only once
    for i in range(len(rss)):
        if i == bad:
            continue
        tmp = f_2(i, r)
        a[0] *= tmp[0]
        a[1] += tmp[1]
    return a

@njit(parallel=True)
def iu_loop_par_2(rss, a):
    bad = 101
    r = np.sum(rss)  # Calculate r only once
    _a0, _a1 = a
    for i in prange(len(rss)):
        if i == bad:
            continue
        tmp = f_2(i, r)
        a[0] *= tmp[0]
        a[1] += tmp[1]
    return a

Surprisingly, the inefficient code shows the same performance as the more efficient version (single and multithreaded). At some point with increasing data size the performance of the original versions start to deteriorate. A similar picture like the comparison single vs multithreaded from before.

Edit:
The work within the inner loop is constant. The work in sum is size dependent. That could also be a cause for the divergence.

Figure 2024-02-27 180146

1 Like

Another experiment…

Changing the “NUMBA_OPT” environment variable to modify the LLVM compiler setup had no impact on my results.
Instead, I conducted tests using both single- and multithreaded functions with a GCC compiler under different optimization setups.
Let’s assume that both compilers use similar kinds of optimizations.

Contrary to my expectation, the higher degree of optimization (GCC -c=-O3) resulted in a higher ratio between single-threaded and multithreaded calculation times compared to the lower optimization level (GCC -c=-O0). There seems to be no correlation between the degree of optimization and the ratio of single vs multithreaded execution.
My theory is busted…

It appears that the complexity of the interaction between compiler optimizations and multithreading is not straightforward.
The calculated ratio between single-threaded and multithreaded functions for optimization level 3 was ~3.0 (for 4 cpus) using Cython which is similar to the prior results using Numba ~2.9.

#******************************************************************************
# Comparison of single-threaded and multithreaded functions under different optimization levels
#******************************************************************************
GCC – Compiler				
cpus = 4				
N = 1000				
a = np.array([1., 0.])				
rss = np.ones((N, N))				

# High degree of optimizations				
# -c=-O3 --compile-args=-fopenmp –link-args=-fopenmp				
	                Test 1	Test 2	Test 3	Avg
call_iu_loop	    4,39s	4,39s	4,44s	
call_iu_loop_par	1,49s	1,46s	1,46s	
ratio	            2,95	3,01	3,04	3,00 <=

# Low degree of optimizations				
# -c=-O0 --compile-args=-fopenmp –link-args=-fopenmp				
	                Test 1	Test 2	Test 3	Avg
call_iu_loop	    5,68s	5,68s	5,72s	
call_iu_loop_par	2,55s	2,65s	3,27s	
ratio	            2,23	2,14	1,75	2,04 <=