# Numba performance doesn't scale as well as NumPy in vectorized max function

Was working on some `aesara` -related benchmarking and came across this comparison. Tried running vectorized `max` function, reduced on `axis=1` , on various sizes of 2d arrays in `Numba` and `NumPy` :

``````@numba.vectorize(
[
"float32(float32, float32)",
"float64(float64, float64)",
]
)
def custom_op_fn(x, y):
if x > y:
return x
else:
return y

@numba.njit
def max_reduce_axis_1(x):
x_transpose = np.transpose(x)

res = np.full((x.shape[0]), -np.inf, dtype=np.float64)
for m in range(x.shape[1]):
custom_op_fn(res, x_transpose[m], res)

return res

# Make sure the underlying Numba function is compiled
# custom_op_res = custom_op_fn.reduce(test_data[0], axis=1).reshape(-1, 1)
custom_op_res = max_reduce_axis_1(test_data[0])

def np_max_reduce_axis_1(x):
return np.maximum.reduce(x, axis=1)

np_res = np_max_reduce_axis_1(test_data[0])
``````

and got the following results:

``````| data shape     | Numba                      | NumPy                       |
|:---------------|:---------------------------|:----------------------------|
| (3, 3)         | 531 ns Â± 12.8 ns (1000000) | 1.59 Âµs Â± 9.25 ns (1000000) |
| (25, 25)       | 784 ns Â± 11.8 ns (1000000) | 2.47 Âµs Â± 37 ns (100000)    |
| (100, 100)     | 4.27 Âµs Â± 23.5 ns (100000) | 8.15 Âµs Â± 37.1 ns (100000)  |
| (1000, 1000)   | 1.13 ms Â± 10.7 Âµs (1000)   | 321 Âµs Â± 7.05 Âµs (1000)     |
| (10000, 10000) | 1.02 s Â± 12 ms (1)         | 51.4 ms Â± 101 Âµs (10)       |
``````

Hereâ€™s the gist i used to generate the output above. Wonder why the `Numba` results didnâ€™t scale as well as `NumPy` in this case.

1 Like

Interesting question! I donâ€™t know; I started digging through the numpy repo and immediately got lost. I donâ€™t know enough about C and how Python wraps C. This is as far as I got numpy/umath.py at 623bc1fae1d47df24e7f1e29321d0c0ba2771ce0 Â· numpy/numpy Â· GitHub.

I wonder if numbaâ€™s `vectorize` decorator actually automatically builds vectorized functions in the sense that all elements are loaded into memory simultaneously and passed through a processing pipeline (i.e. single-instruction multiple data). My read of the numba vectorize docs is not clear.

Can someone with a bit more experience help out here? Is `vectorize` automatic SIMD? Does it depend on llvm?

1 Like

do you need to use `vectorize`? I wonder what would happen if you used a double loop instead.

A double loop turns out a lot faster than `vectorize` for larger arrays, but `NumPy` still performs better:

``````| data shape     | Numba                      | NumPy                      |
|:---------------|:---------------------------|:---------------------------|
| (3, 3)         | 543 ns Â± 3.57 ns (1000000) | 2 Âµs Â± 28.1 ns (100000)    |
| (25, 25)       | 949 ns Â± 8.13 ns (1000000) | 2.87 Âµs Â± 15.9 ns (100000) |
| (100, 100)     | 7.35 Âµs Â± 37.1 ns (100000) | 8.84 Âµs Â± 86.1 ns (100000) |
| (1000, 1000)   | 643 Âµs Â± 8.75 Âµs (1000)    | 328 Âµs Â± 6.53 Âµs (1000)    |
| (10000, 10000) | 91 ms Â± 200 Âµs (10)        | 51.9 ms Â± 104 Âµs (10)      |
``````

Hereâ€™s the code to reproduce the above:

``````@numba.njit
def max_double_loop_axis_1(x):

res = np.full((x.shape[0]), -np.inf, dtype=np.float64)
for i in range(x.shape[0]):
for j in range(x.shape[1]):
if x[i][j] > res[i]:
res[i] = x[i][j]

return res

custom_op_res = max_reduce_axis_1(test_data[0])
``````
1 Like

I suspect thereâ€™s some missed optimisation as the root cause. IIRC NumPy generates SIMD loop specialisations for its `reduce` so Numba would need to do the same to match. Switching this:

``````        self._mpm_cheap = self._module_pass_manager(loop_vectorize=True,
slp_vectorize=True,
opt=3,
cost="cheap")
``````

recovers most of the performance, but looking at the machine code thereâ€™s still SIMD issues.

Changing the function to read as:

``````@numba.njit(error_model='numpy')
def max_reduce_axis_1(x):

res = np.empty((x.shape[0]), dtype=x.dtype)
for i in range(x.shape[0]):
res[i] = -np.inf
for j in range(x.shape[1]):
tmp = x[i, j]
if res[i] < tmp:
res[i] = tmp
return res
``````

pretty much matches NumPy speed for me on the `(10000, 10000)` case (so long as the â€ścheapâ€ť pass is set up as the above) and the machine code looks reasonable.

I suspect the underlying cause of the missed optimisation could be down to some of the over generalisation of `range(n)` loop code generation which adds complexity thatâ€™s not needed in simple cases where the loop bound and stride are simple.

3 Likes

It looks like the gap can be closed a little more (for the initial `vectorize` approach) using `identity="reorderable"`, but it still doesnâ€™t compare with NumPy or the non-`vectorized` version:

``````import numpy as np

import numba

@numba.vectorize(["float64(float64, float64)"], identity="reorderable")
def custom_op_fn(x, y):
if x > y:
return x
else:
return y

@numba.njit
def max_reduce_axis_1(x):
res = np.full((x.shape[0],), -np.inf, dtype=x.dtype)
x_transpose = np.transpose(x)
for m in range(x.shape[1]):
custom_op_fn(res, x_transpose[m], res)
return res

X = np.random.normal(size=(5000, 5000))
res_1 = max_reduce_axis_1(X)

@numba.njit(error_model='numpy')
def max_reduce_axis_2(x):
res = np.empty((x.shape[0],), dtype=x.dtype)
for i in range(x.shape[0]):
res[i] = -np.inf
for j in range(x.shape[1]):
tmp = x[i, j]
if res[i] < tmp:
res[i] = tmp
return res

res_2 = max_reduce_axis_2(X)
assert np.array_equal(res_1, res_2)
assert np.array_equal(res_1, np.max(X, axis=1))

%timeit np.max(X, axis=1)
# 14.3 ms Â± 408 Âµs per loop (mean Â± std. dev. of 7 runs, 100 loops each)

%timeit max_reduce_axis_1(X)
# 58 ms Â± 1.72 ms per loop (mean Â± std. dev. of 7 runs, 10 loops each)
# With forced `loop_vectorization=True`, `slp_vectorization=True`, and `opt=3`:
# 59.3 ms Â± 3.61 ms per loop (mean Â± std. dev. of 7 runs, 10 loops each)

%timeit max_reduce_axis_2(X)
# 36 ms Â± 1.03 ms per loop (mean Â± std. dev. of 7 runs, 10 loops each)
# With forced `loop_vectorization=True`, `slp_vectorization=True`, and `opt=3`:
# 16 ms Â± 44.6 Âµs per loop (mean Â± std. dev. of 7 runs, 100 loops each)
``````

Forcing those optimizations that @stuartarchibald mentioned really seems to help the non-`vectorize` version, but apparently doesnâ€™t affect the `vectorize`d approach positively.

1 Like

@stuartarchibald thanks! This was super helpful. Any insights on why the optimizations are set as they currently stand?

Also, any comments on how the performance of the `vectorize`d approach can be further improved (especially for those larger arrays)?

Are you referring to the code in `build_ufunc_wrapper`?

@brandonwillard, Iâ€™m thinking about the way loops are represented in Numba in general and how the `ufunc` loop nests (as referenced) are quite involved. The control flow graphs for the LLVM IR can be seen with:

``````import numpy as np
import numba

@numba.vectorize(["float64(float64, float64)"], identity="reorderable")
def custom_op_fn(x, y):
if x > y:
return x
else:
return y

@numba.njit(debug=True)
def max_reduce_axis_1(x):
res = np.full((x.shape[0],), -np.inf, dtype=x.dtype)
x_transpose = np.transpose(x)
for m in range(x.shape[1]):
custom_op_fn(res, x_transpose[m], res)
return res

X = np.random.normal(size=(5000, 5000))
res_1 = max_reduce_axis_1(X)

@numba.njit(debug=True, error_model='numpy')
def max_reduce_axis_2(x):
res = np.empty((x.shape[0],), dtype=x.dtype)
for i in range(x.shape[0]):
res[i] = -np.inf
for j in range(x.shape[1]):
tmp = x[i, j]
if res[i] < tmp:
res[i] = tmp
return res

res_2 = max_reduce_axis_2(X)
assert np.array_equal(res_1, res_2)
assert np.array_equal(res_1, np.max(X, axis=1))

def show_llvm_cfg(func):
func.inspect_cfg(func.signatures[0]).pretty_printer(interleave=True, view=True)

show_llvm_cfg(max_reduce_axis_1)
show_llvm_cfg(max_reduce_axis_2)
``````
1 Like

@fanshi118

No problem.

Numba does a two pass LLVM based optimisation. The first pass is â€ścheapâ€ť and is intended to inline as much as possible, do some basic things that help loop vectorization, and then run the all important â€śreference count pruningâ€ť pass. This reference count pruning pass removes redundant reference counting (refcount) operations in many common cases and as a result removes the side effects that having refcounts would cause, this lets the LLVM standard optimisations do a lot more. The second pass is like a traditional `-O3 -vectorize` and is run on the refcount pruned IR.

The reason the â€ścheapâ€ť pass is not `O3` is that it would be expensive in compile time for what is, most of the time, relatively little gain (the `O3` pass is run immediately afterwards on recount pruned IR and the optimisers will struggle if thereâ€™s refcounts in the IR). Obviously there are outlier cases such as the one presented here (and a few in the issue tracker) where running `O3` in both has benefit.

Compilation is all about trade-offs, thereâ€™s no pass sequence that will generate optimally performing output for any input, and running more optimisations does not necessarily lead to better performance, but it does lead to higher compile times! As a result Numbaâ€™s current optimisation sequence is basically what is considered a â€śreasonable defaultâ€ť to try and get reasonable performance from a reasonable compilation effort to satisfy the majority of use cases.

I have some long-term ideas about how to put some of this tuning in the hands of users and this is something Iâ€™d like to explore more. Being able to say â€śIâ€™m in HPC production, spend as much time tuning as possibleâ€ť is very useful in some cases, as is â€śthis function is essentially cold, just compile it quickly donâ€™t waste time trying to optimise itâ€ť.

This probably needs the CFGs and LLVM vectorizer debug output analysing. Numba performance doesn't scale as well as NumPy in vectorized max function - #9 by stuartarchibald helps with the former and:

``````from llvmlite import binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')
``````

shows the latter.

Hope this helps.

2 Likes

After reading @stuartarchibald replies and reviewing the low-level code with @gmarkall a little bit about this, it seemed like we could accomplish NumPy performance for `ufunc.reduce`-like operations within Numba if we add some user-configurable flags for the optimization passes in Numba (e.g. the lines that @stuartarchibald changed).

Just to clarify and recap the entire situation, when Aesara is asked to convert a graph representing `np.max(x, y, axis=1)` to a Numba-`njit`ed function, it does so piece-by-piece. That starts with a scalar `max` function, for which we can generate a custom vectorized function like the following (well, at least after Add support for `np.broadcast_to` by guilhermeleobas Â· Pull Request #7119 Â· numba/numba Â· GitHub goes through):

``````import numpy as np

import numba

@numba.njit
def vectorized_max(x, y, out=None):
if out is None:
out = np.empty((x.shape[0],), dtype=np.float64)

for i in range(out.shape[0]):
if x[i] > y[i]:
out[i] = x[i]
else:
out[i] = y[i]
return out
``````

With this function, Aesara can then implement the `axis=1` part using something like the following:

``````@numba.njit
def max_reduce_axis_1(x):
x_transpose = np.transpose(x)

res = np.full((x.shape[0]), -np.inf, dtype=np.float64)
for m in range(x.shape[1]):
vectorized_max(res, x_transpose[m], res)

return res
``````

It seems like the resulting `max_reduce_axis_1` can be currently optimized to the same degree as @stuartarchibaldâ€™s all-in-one example. Apparently, all thatâ€™s left is to get the extra SIMD-related optimizations that were enabled by setting `loop_vectorize=True`, `slp_vectorize=True`, and possibly `opt=3` in the â€ścheapâ€ť optimization pass.

Again, a quick fix might involve additional user-configurable options that allow the adjustment in the â€ścheapâ€ť pass; however, since we definitely donâ€™t want to increase the overall compile time when the Numba backend is used (especially when the plan is to make it the default backend), it would be best if this option could be enabled only for the compilation of these specific `ufunc.reduce`-like functions.

Iâ€™m not sure if thatâ€™s possible via `numba.config` options (e.g. if we temporarily set those options, force immediate `njit` compilation of the function and unset the options), but, if it is, I would willing to put in a PR for this.

I just posted this discussion to this issue, which seems very similar (if not the same).

I was wondering if declaring the argument as a contiguous array would help (it helps a bit) and if a C implementation with AVX instructions would be faster (apparently twice as fast as Numpy). The AVX implementation might serve as a stronger baseline for Numba to target, since it runs approximately at the memory bandwidth limit. There is not much reason why such a simple function should be slower than memcpy since there is not much computation being done here.

Implementation Time (Microseconds)
Numba 1014.748 Â± 160.301
Numba, contiguous 792.670 Â± 47.724
Numpy 429.551 Â± 42.410
C with AVX 235.201 Â± 31.141
``````import os, time, ctypes, numba, numpy as np

with open("mylib.c", "w") as f:
f.write("""
#include <immintrin.h>
#include <math.h>

// Note: Only works for array dimensions which are multiples of 4
void reduce(double *result, double *x, int n){
for (int i = 0; i < n; i += 4){
__m256d res[4];

for (int k = 0; k < 4; k++){
res[k] = _mm256_set1_pd(-INFINITY);
}

for (int j = 0; j < n; j += 4){
for (int k = 0; k < 4; k++){
res[k] = _mm256_max_pd(res[k], _mm256_loadu_pd(x + (i + k) * n + j));
}
}

for (int k = 0; k < 4; k++){
result[i + k] = fmax(fmax(res[k][0], res[k][1]), fmax(res[k][2], res[k][3]));
}
}
}
""")

# Compile and build C function
os.system("gcc -mavx -O3 -shared mylib.c -o mylib.so")
lib = ctypes.CDLL("./mylib.so")
_reduce = lib.reduce
c_double_p = ctypes.POINTER(ctypes.c_double)
_reduce.argtypes = [c_double_p, c_double_p, ctypes.c_int]
def c_avx_max_reduce_axis_1(x):
result = np.zeros(len(x))
_reduce(np.ctypeslib.as_ctypes(result), np.ctypeslib.as_ctypes(x.ravel()), len(x))
return result

@numba.njit("f8[:](f8[:, :])")
def numba_max_reduce_axis_1(x):
res = np.full((x.shape[0]), -np.inf, dtype=x.dtype)

for i in range(x.shape[0]):
for j in range(x.shape[1]):
if res[i] < x[i, j]:
res[i] = x[i, j]

return res

@numba.njit("f8[:](f8[:, ::1])")
def numba_c_order_max_reduce_axis_1(x):
res = np.full((x.shape[0]), -np.inf, dtype=x.dtype)

for i in range(x.shape[0]):
for j in range(x.shape[1]):
if res[i] < x[i, j]:
res[i] = x[i, j]

return res

def np_max_reduce_axis_1(x):
return x.max(axis=1)

test_data = np.random.normal(size=(1024, 1024))

functions = [
numba_max_reduce_axis_1,
numba_c_order_max_reduce_axis_1,
np_max_reduce_axis_1,
c_avx_max_reduce_axis_1,
]

results = [max_reduce_axis_1(test_data) for max_reduce_axis_1 in functions]

for res in results:
assert(np.allclose(results[0], res))

for max_reduce_axis_1 in functions:
timings = []
for _ in range(1000):
t = time.perf_counter()

max_reduce_axis_1(test_data)

timings.append((time.perf_counter() - t) * 1e6)

print(f"| {max_reduce_axis_1.__name__:31} | {np.mean(timings):9.3f} +- {np.std(timings):8.3f} us |")

``````
1 Like

@sklam, @stuartarchibald, @gmarkall, per our conversation today, hereâ€™s an updated script with all the Numba implementations weâ€™ve tried for `np.max.reduce(..., axis=1)`:

``````from contextlib import contextmanager

import numba
import numpy as np

X = np.random.normal(size=(5000, 6000))

#
# This is what we're trying to implement in Numba:
#
numpy_res = np.max(X, axis=1)

# This is what we're trying to match/beat:
# %timeit np.max(X, axis=1)
# 15.6 ms Â± 129 Âµs per loop (mean Â± std. dev. of 7 runs, 100 loops each)

#
# NOTE: We do *not* want to enable `parallel=True`, `fastmath=True`, or any
# other options that are neither generalizable nor used by NumPy.
#

@numba.njit(inline="always")
def custom_max(x, y):
if x > y:
return x
else:
return y

#
# Case 1: Using `vectorize` along the reduced dimension
#
@numba.vectorize(["float64(float64, float64)"])
def vectorized_max(x, y):
return custom_max(x, y)

@numba.njit
def vectorized_max_reduce_axis_1(x):
res = np.full((x.shape[0],), -np.inf, dtype=x.dtype)
x_transpose = np.transpose(x)
for m in range(x.shape[1]):
vectorized_max(res, x_transpose[m], res)

return res

# Confirm that it works
assert np.array_equal(numpy_res, vectorized_max_reduce_axis_1(X))

# %timeit vectorized_max_reduce_axis_1(X)
# 94.2 ms Â± 435 Âµs per loop (mean Â± std. dev. of 7 runs, 10 loops each)

#
# Case 2: Manually written Numba reduction loops
#
@numba.njit
def manual_max_reduce_axis_1(x):
res = np.full((x.shape[0],), -np.inf, dtype=x.dtype)
for i in range(x.shape[0]):
for j in range(x.shape[1]):
res[i] = custom_max(res[i], x[i, j])
return res

assert np.array_equal(numpy_res, manual_max_reduce_axis_1(X))

# %timeit manual_max_reduce_axis_1(X)
# 38.3 ms Â± 1.01 ms per loop (mean Â± std. dev. of 7 runs, 10 loops each)

#
# Case 3: Calls to NumPy's `ufunc.reduce` via `objmode`
#
@numba.njit
def objmode_max_reduce_axis_1(x):
with numba.objmode(res="float64[:]"):
res = vectorized_max.reduce(x, axis=1)
return res

assert np.array_equal(numpy_res, objmode_max_reduce_axis_1(X))

# %timeit objmode_max_reduce_axis_1(X)
# 73.7 ms Â± 262 Âµs per loop (mean Â± std. dev. of 7 runs, 10 loops each)

#
# Case 4: Recompile Case 2 with more LLVM optimizations in the "cheap" pass
#
# This comes from the discussions in
# https://numba.discourse.group/t/numba-performance-doesnt-scale-as-well-as-numpy-in-vectorized-max-function/782
#
@contextmanager
def use_optimized_cheap_pass(*args, **kwargs):
"""Temporarily replace the cheap optimization pass with a better one."""
from numba.core.registry import cpu_target

context = cpu_target.target_context._internal_codegen
old_pm = context._mpm_cheap
new_pm = context._module_pass_manager(
loop_vectorize=True, slp_vectorize=True, opt=3, cost="cheap"
)
context._mpm_cheap = new_pm
try:
yield
finally:
context._mpm_cheap = old_pm

with use_optimized_cheap_pass():

@numba.njit
def opt_manual_max_reduce_axis_1(x):
res = np.full((x.shape[0],), -np.inf, dtype=x.dtype)
for i in range(x.shape[0]):
for j in range(x.shape[1]):
res[i] = custom_max(res[i], x[i, j])
return res

assert np.array_equal(numpy_res, opt_manual_max_reduce_axis_1(X))

# %timeit opt_manual_max_reduce_axis_1(X)
# 37.6 ms Â± 872 Âµs per loop (mean Â± std. dev. of 7 runs, 10 loops each)

#
# Case 5: This is Stuart's original form of the reduction
#
# Apparently, removing the use of `custom_max` and manually "in-lining" the
# `max` operation makes a large difference; however, we can't reasonably do
# this for every binary function that will perform a reduction.
#
with use_optimized_cheap_pass():

@numba.njit
def orig_opt_max_reduce_axis_1(x):
res = np.full((x.shape[0],), -np.inf, dtype=x.dtype)
for i in range(x.shape[0]):
for j in range(x.shape[1]):
tmp = x[i, j]
if res[i] < tmp:
res[i] = tmp
return res

assert np.array_equal(numpy_res, orig_opt_max_reduce_axis_1(X))

# %timeit orig_opt_max_reduce_axis_1(X)
# 20.3 ms Â± 233 Âµs per loop (mean Â± std. dev. of 7 runs, 10 loops each)
``````

It looks like the â€ścheapâ€ť pass optimization hack only helps when the binary function is manually in-lined; however, we need to generalize these reduction loops by calling nearly arbitrary binary functions (e.g. like `custom_max` in the example above), so we canâ€™t reasonably in-line the binary operation as in @stuartarchibaldâ€™s example. The performance for that example is acceptable for our use-case, though, so, if we could get `res[i] = custom_max(res[i], x[i, j])` to be treated more or less the same as

``````tmp = x[i, j]
if res[i] < tmp:
res[i] = tmp
``````

we might be fine.