I am trying to understand how numba manages threads when parallelizing, especially when dealing with loops that call functions of numpy arrays.
I have started with a simple example code:
import numpy as np
from numba import njit, prange, get_num_threads
import time
@njit
def f(i, r):
t = i * r
for j in range(int(1e5)):
t = (t * r) % (i+1)
return 1.1, t**2
@njit(parallel=True)
#@njit(parallel=False)
def iu_loop(rs, a):
bad = 101
for i in prange(len(rs)):
if i==bad:
continue
temp = f(i, rs)
a[0] *= temp[0]
a[1] += temp[1]
return a
a = np.array([1., 0.])
rs = np.ones((1,1))
_ = iu_loop(rss, a)
a = np.array([1., 0.])
rs = np.ones((1000,1000))
print("threads=", get_num_threads())
t_start = time.time()
result = iu_loop(rs, a)
t_end = time.time()
run_time = t_end - t_start
print(f"computed result in: {round(run_time, 3)}s")
print(result)
This works very well and as expected; running the code with parallel=True and 8 threads reduces computation time by essentially exactly a factor of 8, which is what I would expect considering that the prange’d loop is embarrassingly parallel.
Now I’ve tried replacing the argument of f with an array:
import numpy as np
from numba import njit, prange, get_num_threads
import time
@njit
def f(i, rs):
r = np.sum(rs)
t = i * r
for j in range(int(1e5)):
t = (t * r) % (i+1)
return 1.1, t**2
@njit(parallel=True)
#@njit(parallel=False)
def iu_loop(rss, a):
bad = 101
for i in prange(len(rss)):
if i==bad:
continue
temp = f(i, rss)
a[0] *= temp[0]
a[1] += temp[1]
return a
a = np.array([1., 0.])
rss = np.ones((1,1))
_ = iu_loop(rss, a)
a = np.array([1., 0.])
rss = np.ones((1000,1000))
print("threads=", get_num_threads())
t_start = time.time()
result = iu_loop(rss, a)
t_end = time.time()
run_time = t_end - t_start
print(f"computed result in: {round(run_time, 3)}s")
print(result)
The parallel=True code is still faster, but the increase in speed is only a factor of 3 instead of 8. In a more complicated setting, which is what prompted me to ask the question in the first place, I am experiencing almost no speed increase at all.
I feel that I am missing something crucial here since, in the second case, I would still expect the same speedup given that the code is no less embarrassingly parallel than before. I’ve also run the parallel_diagnostics function but I have trouble understanding the information it provides.
Essentially I would like to get an understanding of what is happening, and hopefully find a way to fix the code in the second case to achieve the full speed increase