After other tests I’ve got a clearer idea of what is going on. First of all, I modified the above examples by adding some more ‘number crunching stuff’ inside the inner loop.
Example A now is as follows:
# Example A
def f(num_threads):
out = np.empty((num_rows, 3, 300))
# calculate number of rows per thread
num_rows_per_thread = int(math.ceil(num_rows / num_threads))
for index_thread in numba.prange(num_threads):
# Loop over loan parts
i_row_begin = index_thread * num_rows_per_thread
i_row_end = min(num_rows, (index_thread + 1) * num_rows_per_thread)
tmp = np.empty(5)
for i_row in range(i_row_begin, i_row_end):
# Do some calculation on a single row
tmp[0] = i_row
for index_s in range(out.shape[1]):
tmp[1] = index_s
for index_t in range(out.shape[2]):
tmp[2] = index_t + index_s
tmp[3] = index_t / (1.2 + i_row / (index_s + 1))
tmp[4] = index_t / (1.8 + i_row / (index_s + 1))
tmp[3] = tmp[4] * math.sin(1.3 + tmp[3])
tmp[4] = tmp[3] * math.cos(tmp[3])
tmp[3] = tmp[4] + math.sin(1.3 + tmp[3])
tmp[4] = tmp[3] + math.cos(tmp[3])
tmp[2] = math.cos(tmp[1]) * math.sin(tmp[4] - tmp[2])
tmp[3] = math.exp(tmp[2] * tmp[4]) + math.cos(tmp[3])
if index_t % 2 == 1:
tmp[3] = tmp[3] - tmp[2] * out[i_row, index_s, index_t-1]
out[i_row, index_s, index_t] = np.sum(tmp / (1 + np.sum(tmp)))
return out
print()
print("Example A")
print_time(f)
Leading to:
Example A
Using parallel=False
2.33 s ± 208 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using parallel=True and prange(1)
1.18 s ± 7.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using parallel=True and prange(2)
645 ms ± 48.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using parallel=True and prange(4)
321 ms ± 7.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using parallel=True and prange(8)
171 ms ± 7.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using parallel=True and prange(16)
90.2 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using parallel=True and prange(32)
82.2 ms ± 4.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using parallel=True and prange(46)
74.6 ms ± 2.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using parallel=True and prange(64)
70.5 ms ± 2.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So now there is a clear scaling with the prange loop. Still, using parallel=False takes almost double that parallel=True and prange(1), even if in the latter case the code is still using only one core.
Then, the outcome of example 2 is exactly as before, i.e. no parallelization is used:
# Example B
def calc_bunch_rows(i_row_begin, i_row_end, out):
tmp = np.empty(5)
for i_row in range(i_row_begin, i_row_end):
tmp[0] = i_row
for index_s in range(out.shape[1]):
tmp[1] = index_s
for index_t in range(out.shape[2]):
tmp[2] = index_t + index_s
tmp[3] = index_t / (1.2 + i_row / (index_s + 1))
tmp[4] = index_t / (1.8 + i_row / (index_s + 1))
tmp[3] = tmp[4] * math.sin(1.3 + tmp[3])
tmp[4] = tmp[3] * math.cos(tmp[3])
tmp[3] = tmp[4] + math.sin(1.3 + tmp[3])
tmp[4] = tmp[3] + math.cos(tmp[3])
tmp[2] = math.cos(tmp[1]) * math.sin(tmp[4] - tmp[2])
tmp[3] = math.exp(tmp[2] * tmp[4]) + math.cos(tmp[3])
if index_t % 2 == 1:
tmp[3] = tmp[3] - tmp[2] * out[i_row, index_s, index_t-1]
out[i_row, index_s, index_t] = np.sum(tmp / (1 + np.sum(tmp)))
calc_bunch_rows = numba.njit(calc_bunch_rows, nogil=True)
def f(num_threads):
out = np.empty((num_rows, 3, 300))
# calculate threads
num_rows_per_thread = int(math.ceil(num_rows / num_threads))
for index_thread in numba.prange(num_threads):
# Loop over loan parts
i_row_begin = index_thread * num_rows_per_thread
i_row_end = min(num_rows, (index_thread + 1) * num_rows_per_thread)
calc_bunch_rows(i_row_begin, i_row_end, out)
print()
print("Example B")
print_time(f)
Leading to:
Example B
Using parallel=False
2.25 s ± 29.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using parallel=True and prange(1)
2.23 s ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using parallel=True and prange(2)
3.13 s ± 22.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using parallel=True and prange(4)
2.86 s ± 27.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using parallel=True and prange(8)
2.63 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using parallel=True and prange(16)
2.47 s ± 112 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using parallel=True and prange(32)
3.4 s ± 50.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using parallel=True and prange(46)
3.85 s ± 42.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using parallel=True and prange(64)
3.88 s ± 35.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Finally, example C has now exactly the same performance as example A:
# Example C
def calc_bunch_rows(i_row_begin, i_row_end, out):
tmp = np.empty(5)
tmp2 = np.empty(5)
for i_row in range(i_row_begin, i_row_end):
tmp[0] = i_row
for index_s in range(out.shape[1]):
tmp[1] = index_s
for index_t in range(out.shape[2]):
tmp[2] = index_t + index_s
tmp[3] = index_t / (1.2 + i_row / (index_s + 1))
tmp[4] = index_t / (1.8 + i_row / (index_s + 1))
tmp[3] = tmp[4] * math.sin(1.3 + tmp[3])
tmp[4] = tmp[3] * math.cos(tmp[3])
tmp[3] = tmp[4] + math.sin(1.3 + tmp[3])
tmp[4] = tmp[3] + math.cos(tmp[3])
tmp[2] = math.cos(tmp[1]) * math.sin(tmp[4] - tmp[2])
tmp[3] = math.exp(tmp[2] * tmp[4]) + math.cos(tmp[3])
if index_t % 2 == 1:
tmp[3] = tmp[3] - tmp[2] * out[i_row, index_s, index_t-1]
tmp_sum = 0
for tmp_i in range(tmp.shape[0]):
tmp_sum += tmp[tmp_i]
for tmp_i in range(tmp.shape[0]):
tmp2[tmp_i] += tmp[tmp_i] / 1 + tmp_sum
tmp_sum = 0
for tmp_i in range(tmp.shape[0]):
tmp_sum += tmp2[tmp_i]
out[i_row, index_s, index_t] = tmp_sum
calc_bunch_rows = numba.njit(calc_bunch_rows, nogil=True)
def f(num_threads):
out = np.empty((num_rows, 3, 300))
# calculate threads
num_rows_per_thread = int(math.ceil(num_rows / num_threads))
for index_thread in numba.prange(num_threads):
# Loop over loan parts
i_row_begin = index_thread * num_rows_per_thread
i_row_end = min(num_rows, (index_thread + 1) * num_rows_per_thread)
calc_bunch_rows(i_row_begin, i_row_end, out)
print()
print("Example C")
print_time(f)
Example C
Using parallel=False
1.18 s ± 697 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using parallel=True and prange(1)
1.18 s ± 4.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using parallel=True and prange(2)
612 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using parallel=True and prange(4)
309 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using parallel=True and prange(8)
172 ms ± 8.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using parallel=True and prange(16)
127 ms ± 33.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using parallel=True and prange(32)
83.6 ms ± 3.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using parallel=True and prange(46)
75.4 ms ± 4.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using parallel=True and prange(64)
70.4 ms ± 3.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Summarizing, in examples A and C n cores are used with the statement prange(n) (checked looking at cpu load).
However, if the number crunching stuff is made simpler, only in example A Numba is able to perform some extra optimization and will parallelize not only the prange loop. E.g. in the following example I’ removing the two lines
if index_t % 2 == 1:
tmp[3] = tmp[3] - tmp[2] * out[i_row, index_s, index_t-1]
seems like Numba is able to parallelize even loops over index_t and index_s. Look here:
# Example A2 (modification of example A)
def f(num_threads):
out = np.empty((num_rows, 3, 300))
# calculate number of rows per thread
num_rows_per_thread = int(math.ceil(num_rows / num_threads))
for index_thread in numba.prange(num_threads):
# Loop over loan parts
i_row_begin = index_thread * num_rows_per_thread
i_row_end = min(num_rows, (index_thread + 1) * num_rows_per_thread)
tmp = np.empty(5)
for i_row in range(i_row_begin, i_row_end):
# Do some calculation on a single row
tmp[0] = i_row
for index_s in range(out.shape[1]):
tmp[1] = index_s
for index_t in range(out.shape[2]):
tmp[2] = index_t + index_s
tmp[3] = index_t / (1.2 + i_row / (index_s + 1))
tmp[4] = index_t / (1.8 + i_row / (index_s + 1))
tmp[3] = tmp[4] * math.sin(1.3 + tmp[3])
tmp[4] = tmp[3] * math.cos(tmp[3])
tmp[3] = tmp[4] + math.sin(1.3 + tmp[3])
tmp[4] = tmp[3] + math.cos(tmp[3])
tmp[2] = math.cos(tmp[1]) * math.sin(tmp[4] - tmp[2])
tmp[3] = math.exp(tmp[2] * tmp[4]) + math.cos(tmp[3])
out[i_row, index_s, index_t] = np.sum(tmp / (1 + np.sum(tmp)))
return out
print()
print("Example A2")
print_time(f)
Example A2
Using parallel=False
2.22 s ± 9.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using parallel=True and prange(1)
502 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using parallel=True and prange(2)
293 ms ± 30.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using parallel=True and prange(4)
165 ms ± 5.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using parallel=True and prange(8)
123 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using parallel=True and prange(16)
99.5 ms ± 8.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using parallel=True and prange(32)
80.7 ms ± 6.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using parallel=True and prange(46)
84.2 ms ± 9.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using parallel=True and prange(64)
81.6 ms ± 9.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Indeed, cpu load inspection reveals that more than n cores are used with the statement prange(n).
Finally, as it could be expected, Numba will not parallelize the inner loops if they are moved to an external function, as follows (modification of Example C):
# Example C2 (modfication of C)
def calc_bunch_rows(i_row_begin, i_row_end, out):
tmp = np.empty(5)
tmp2 = np.empty(5)
for i_row in range(i_row_begin, i_row_end):
tmp[0] = i_row
for index_s in range(out.shape[1]):
tmp[1] = index_s
for index_t in range(out.shape[2]):
tmp[2] = index_t + index_s
tmp[3] = index_t / (1.2 + i_row / (index_s + 1))
tmp[4] = index_t / (1.8 + i_row / (index_s + 1))
tmp[3] = tmp[4] * math.sin(1.3 + tmp[3])
tmp[4] = tmp[3] * math.cos(tmp[3])
tmp[3] = tmp[4] + math.sin(1.3 + tmp[3])
tmp[4] = tmp[3] + math.cos(tmp[3])
tmp[2] = math.cos(tmp[1]) * math.sin(tmp[4] - tmp[2])
tmp[3] = math.exp(tmp[2] * tmp[4]) + math.cos(tmp[3])
tmp_sum = 0
for tmp_i in range(tmp.shape[0]):
tmp_sum += tmp[tmp_i]
for tmp_i in range(tmp.shape[0]):
tmp2[tmp_i] += tmp[tmp_i] / 1 + tmp_sum
tmp_sum = 0
for tmp_i in range(tmp.shape[0]):
tmp_sum += tmp2[tmp_i]
out[i_row, index_s, index_t] = tmp_sum
calc_bunch_rows = numba.njit(calc_bunch_rows, nogil=True)
def f(num_threads):
out = np.empty((num_rows, 3, 300))
# calculate threads
num_rows_per_thread = int(math.ceil(num_rows / num_threads))
for index_thread in numba.prange(num_threads):
# Loop over loan parts
i_row_begin = index_thread * num_rows_per_thread
i_row_end = min(num_rows, (index_thread + 1) * num_rows_per_thread)
calc_bunch_rows(i_row_begin, i_row_end, out)
print()
print("Example C2")
print_time(f)
Example C2
Using parallel=False
1.17 s ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using parallel=True and prange(1)
1.19 s ± 31.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using parallel=True and prange(2)
598 ms ± 6.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using parallel=True and prange(4)
307 ms ± 6.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using parallel=True and prange(8)
175 ms ± 7.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using parallel=True and prange(16)
140 ms ± 4.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using parallel=True and prange(32)
116 ms ± 7.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using parallel=True and prange(46)
109 ms ± 9.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using parallel=True and prange(64)
88 ms ± 6.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So now things are much clearer to me. However still I’m asking: how can I get any parallelization out of example B and why Numba cannot parallelize it? Is it a bug to be reported or an intended behavior?
Lack of parallelization of example B is really frustrating me. I currently need to parallelize some algorithm where multiplication of many medium-sized matrices (let’s say 100x100) is involved. The multiplications are independent of each other, so they can be parallelized very easily. However Numba does not support numpy.dot when the ‘out’ argument is specified, meaning that numpy arrays are dynamically created (more or less like numpy.sum in example B) thus making Numba not able to parallelize the prange. Also, the size of the matrices is such that if I move away from BLAS libraries and write my own matrix multiplication in Numba things are going to be very much slower.