Numba performance doesn't scale as well as NumPy in vectorized max function

Was working on some aesara -related benchmarking and came across this comparison. Tried running vectorized max function, reduced on axis=1 , on various sizes of 2d arrays in Numba and NumPy :

@numba.vectorize(
    [
        "float32(float32, float32)",
        "float64(float64, float64)",
    ]
)
def custom_op_fn(x, y):
    if x > y:
        return x
    else:
        return y


@numba.njit
def max_reduce_axis_1(x):
    x_transpose = np.transpose(x)

    res = np.full((x.shape[0]), -np.inf, dtype=np.float64)
    for m in range(x.shape[1]):
        custom_op_fn(res, x_transpose[m], res)

    return res


# Make sure the underlying Numba function is compiled
# custom_op_res = custom_op_fn.reduce(test_data[0], axis=1).reshape(-1, 1)
custom_op_res = max_reduce_axis_1(test_data[0])


def np_max_reduce_axis_1(x):
    return np.maximum.reduce(x, axis=1)


np_res = np_max_reduce_axis_1(test_data[0])

and got the following results:

| data shape     | Numba                      | NumPy                       |
|:---------------|:---------------------------|:----------------------------|
| (3, 3)         | 531 ns ± 12.8 ns (1000000) | 1.59 µs ± 9.25 ns (1000000) |
| (25, 25)       | 784 ns ± 11.8 ns (1000000) | 2.47 µs ± 37 ns (100000)    |
| (100, 100)     | 4.27 µs ± 23.5 ns (100000) | 8.15 µs ± 37.1 ns (100000)  |
| (1000, 1000)   | 1.13 ms ± 10.7 µs (1000)   | 321 µs ± 7.05 µs (1000)     |
| (10000, 10000) | 1.02 s ± 12 ms (1)         | 51.4 ms ± 101 µs (10)       |

Here’s the gist i used to generate the output above. Wonder why the Numba results didn’t scale as well as NumPy in this case.

1 Like

Interesting question! I don’t know; I started digging through the numpy repo and immediately got lost. I don’t know enough about C and how Python wraps C. This is as far as I got numpy/umath.py at 623bc1fae1d47df24e7f1e29321d0c0ba2771ce0 · numpy/numpy · GitHub.

I wonder if numba’s vectorize decorator actually automatically builds vectorized functions in the sense that all elements are loaded into memory simultaneously and passed through a processing pipeline (i.e. single-instruction multiple data). My read of the numba vectorize docs is not clear.

Can someone with a bit more experience help out here? Is vectorize automatic SIMD? Does it depend on llvm?

1 Like

do you need to use vectorize? I wonder what would happen if you used a double loop instead.

A double loop turns out a lot faster than vectorize for larger arrays, but NumPy still performs better:

| data shape     | Numba                      | NumPy                      |
|:---------------|:---------------------------|:---------------------------|
| (3, 3)         | 543 ns ± 3.57 ns (1000000) | 2 µs ± 28.1 ns (100000)    |
| (25, 25)       | 949 ns ± 8.13 ns (1000000) | 2.87 µs ± 15.9 ns (100000) |
| (100, 100)     | 7.35 µs ± 37.1 ns (100000) | 8.84 µs ± 86.1 ns (100000) |
| (1000, 1000)   | 643 µs ± 8.75 µs (1000)    | 328 µs ± 6.53 µs (1000)    |
| (10000, 10000) | 91 ms ± 200 µs (10)        | 51.9 ms ± 104 µs (10)      |

Here’s the code to reproduce the above:

@numba.njit
def max_double_loop_axis_1(x):

    res = np.full((x.shape[0]), -np.inf, dtype=np.float64)
    for i in range(x.shape[0]):
        for j in range(x.shape[1]):
            if x[i][j] > res[i]:
                res[i] = x[i][j]

    return res


custom_op_res = max_reduce_axis_1(test_data[0])
1 Like

I suspect there’s some missed optimisation as the root cause. IIRC NumPy generates SIMD loop specialisations for its reduce so Numba would need to do the same to match. Switching this:

to read as:

        self._mpm_cheap = self._module_pass_manager(loop_vectorize=True,
                                                    slp_vectorize=True,
                                                    opt=3,
                                                    cost="cheap")

recovers most of the performance, but looking at the machine code there’s still SIMD issues.

Changing the function to read as:

@numba.njit(error_model='numpy')
def max_reduce_axis_1(x):

    res = np.empty((x.shape[0]), dtype=x.dtype)
    for i in range(x.shape[0]):
        res[i] = -np.inf
        for j in range(x.shape[1]):
            tmp = x[i, j]
            if res[i] < tmp:
                res[i] = tmp
    return res

pretty much matches NumPy speed for me on the (10000, 10000) case (so long as the “cheap” pass is set up as the above) and the machine code looks reasonable.

I suspect the underlying cause of the missed optimisation could be down to some of the over generalisation of range(n) loop code generation which adds complexity that’s not needed in simple cases where the loop bound and stride are simple.

3 Likes

It looks like the gap can be closed a little more (for the initial vectorize approach) using identity="reorderable", but it still doesn’t compare with NumPy or the non-vectorized version:

import numpy as np

import numba


@numba.vectorize(["float64(float64, float64)"], identity="reorderable")
def custom_op_fn(x, y):
    if x > y:
        return x
    else:
        return y


@numba.njit
def max_reduce_axis_1(x):
    res = np.full((x.shape[0],), -np.inf, dtype=x.dtype)
    x_transpose = np.transpose(x)
    for m in range(x.shape[1]):
        custom_op_fn(res, x_transpose[m], res)
    return res


X = np.random.normal(size=(5000, 5000))
res_1 = max_reduce_axis_1(X)


@numba.njit(error_model='numpy')
def max_reduce_axis_2(x):
    res = np.empty((x.shape[0],), dtype=x.dtype)
    for i in range(x.shape[0]):
        res[i] = -np.inf
        for j in range(x.shape[1]):
            tmp = x[i, j]
            if res[i] < tmp:
                res[i] = tmp
    return res


res_2 = max_reduce_axis_2(X)
assert np.array_equal(res_1, res_2)
assert np.array_equal(res_1, np.max(X, axis=1))


%timeit np.max(X, axis=1)
# 14.3 ms ± 408 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit max_reduce_axis_1(X)
# 58 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# With forced `loop_vectorization=True`, `slp_vectorization=True`, and `opt=3`:
# 59.3 ms ± 3.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit max_reduce_axis_2(X)
# 36 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# With forced `loop_vectorization=True`, `slp_vectorization=True`, and `opt=3`:
# 16 ms ± 44.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Forcing those optimizations that @stuartarchibald mentioned really seems to help the non-vectorize version, but apparently doesn’t affect the vectorized approach positively.

1 Like

@stuartarchibald thanks! This was super helpful. Any insights on why the optimizations are set as they currently stand?

Also, any comments on how the performance of the vectorized approach can be further improved (especially for those larger arrays)?

Are you referring to the code in build_ufunc_wrapper?

@brandonwillard, I’m thinking about the way loops are represented in Numba in general and how the ufunc loop nests (as referenced) are quite involved. The control flow graphs for the LLVM IR can be seen with:

import numpy as np
import numba

@numba.vectorize(["float64(float64, float64)"], identity="reorderable")
def custom_op_fn(x, y):
    if x > y:
        return x
    else:
        return y


@numba.njit(debug=True)
def max_reduce_axis_1(x):
    res = np.full((x.shape[0],), -np.inf, dtype=x.dtype)
    x_transpose = np.transpose(x)
    for m in range(x.shape[1]):
        custom_op_fn(res, x_transpose[m], res)
    return res


X = np.random.normal(size=(5000, 5000))
res_1 = max_reduce_axis_1(X)


@numba.njit(debug=True, error_model='numpy')
def max_reduce_axis_2(x):
    res = np.empty((x.shape[0],), dtype=x.dtype)
    for i in range(x.shape[0]):
        res[i] = -np.inf
        for j in range(x.shape[1]):
            tmp = x[i, j]
            if res[i] < tmp:
                res[i] = tmp
    return res


res_2 = max_reduce_axis_2(X)
assert np.array_equal(res_1, res_2)
assert np.array_equal(res_1, np.max(X, axis=1))

def show_llvm_cfg(func):
    func.inspect_cfg(func.signatures[0]).pretty_printer(interleave=True, view=True)

show_llvm_cfg(max_reduce_axis_1)
show_llvm_cfg(max_reduce_axis_2)
1 Like

@fanshi118

No problem.

Numba does a two pass LLVM based optimisation. The first pass is “cheap” and is intended to inline as much as possible, do some basic things that help loop vectorization, and then run the all important “reference count pruning” pass. This reference count pruning pass removes redundant reference counting (refcount) operations in many common cases and as a result removes the side effects that having refcounts would cause, this lets the LLVM standard optimisations do a lot more. The second pass is like a traditional -O3 -vectorize and is run on the refcount pruned IR.

The reason the “cheap” pass is not O3 is that it would be expensive in compile time for what is, most of the time, relatively little gain (the O3 pass is run immediately afterwards on recount pruned IR and the optimisers will struggle if there’s refcounts in the IR). Obviously there are outlier cases such as the one presented here (and a few in the issue tracker) where running O3 in both has benefit.

Compilation is all about trade-offs, there’s no pass sequence that will generate optimally performing output for any input, and running more optimisations does not necessarily lead to better performance, but it does lead to higher compile times! As a result Numba’s current optimisation sequence is basically what is considered a “reasonable default” to try and get reasonable performance from a reasonable compilation effort to satisfy the majority of use cases.

I have some long-term ideas about how to put some of this tuning in the hands of users and this is something I’d like to explore more. Being able to say “I’m in HPC production, spend as much time tuning as possible” is very useful in some cases, as is “this function is essentially cold, just compile it quickly don’t waste time trying to optimise it”.

This probably needs the CFGs and LLVM vectorizer debug output analysing. Numba performance doesn't scale as well as NumPy in vectorized max function - #9 by stuartarchibald helps with the former and:

from llvmlite import binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')

shows the latter.

Hope this helps.

2 Likes

After reading @stuartarchibald replies and reviewing the low-level code with @gmarkall a little bit about this, it seemed like we could accomplish NumPy performance for ufunc.reduce-like operations within Numba if we add some user-configurable flags for the optimization passes in Numba (e.g. the lines that @stuartarchibald changed).

Just to clarify and recap the entire situation, when Aesara is asked to convert a graph representing np.max(x, y, axis=1) to a Numba-njited function, it does so piece-by-piece. That starts with a scalar max function, for which we can generate a custom vectorized function like the following (well, at least after Add support for `np.broadcast_to` by guilhermeleobas · Pull Request #7119 · numba/numba · GitHub goes through):

import numpy as np

import numba


@numba.njit
def vectorized_max(x, y, out=None):
    if out is None:
        out = np.empty((x.shape[0],), dtype=np.float64)

    for i in range(out.shape[0]):
        if x[i] > y[i]:
            out[i] = x[i]
        else:
            out[i] = y[i]
    return out

With this function, Aesara can then implement the axis=1 part using something like the following:

@numba.njit
def max_reduce_axis_1(x):
    x_transpose = np.transpose(x)

    res = np.full((x.shape[0]), -np.inf, dtype=np.float64)
    for m in range(x.shape[1]):
        vectorized_max(res, x_transpose[m], res)

    return res

It seems like the resulting max_reduce_axis_1 can be currently optimized to the same degree as @stuartarchibald’s all-in-one example. Apparently, all that’s left is to get the extra SIMD-related optimizations that were enabled by setting loop_vectorize=True, slp_vectorize=True, and possibly opt=3 in the “cheap” optimization pass.

Again, a quick fix might involve additional user-configurable options that allow the adjustment in the “cheap” pass; however, since we definitely don’t want to increase the overall compile time when the Numba backend is used (especially when the plan is to make it the default backend), it would be best if this option could be enabled only for the compilation of these specific ufunc.reduce-like functions.

I’m not sure if that’s possible via numba.config options (e.g. if we temporarily set those options, force immediate njit compilation of the function and unset the options), but, if it is, I would willing to put in a PR for this.

I just posted this discussion to this issue, which seems very similar (if not the same).

I was wondering if declaring the argument as a contiguous array would help (it helps a bit) and if a C implementation with AVX instructions would be faster (apparently twice as fast as Numpy). The AVX implementation might serve as a stronger baseline for Numba to target, since it runs approximately at the memory bandwidth limit. There is not much reason why such a simple function should be slower than memcpy since there is not much computation being done here.

Implementation Time (Microseconds)
Numba 1014.748 ± 160.301
Numba, contiguous 792.670 ± 47.724
Numpy 429.551 ± 42.410
C with AVX 235.201 ± 31.141
import os, time, ctypes, numba, numpy as np

with open("mylib.c", "w") as f:
    f.write("""
#include <immintrin.h>
#include <math.h>

// Note: Only works for array dimensions which are multiples of 4
void reduce(double *result, double *x, int n){
    for (int i = 0; i < n; i += 4){
        __m256d res[4];

        for (int k = 0; k < 4; k++){
            res[k] = _mm256_set1_pd(-INFINITY);
        }

        for (int j = 0; j < n; j += 4){
            for (int k = 0; k < 4; k++){
                res[k] = _mm256_max_pd(res[k], _mm256_loadu_pd(x + (i + k) * n + j));
            }
        }

        for (int k = 0; k < 4; k++){
            result[i + k] = fmax(fmax(res[k][0], res[k][1]), fmax(res[k][2], res[k][3]));
        }
    }
}
""")

# Compile and build C function
os.system("gcc -mavx -O3 -shared mylib.c -o mylib.so")
lib = ctypes.CDLL("./mylib.so")
_reduce = lib.reduce
c_double_p = ctypes.POINTER(ctypes.c_double)
_reduce.argtypes = [c_double_p, c_double_p, ctypes.c_int]
def c_avx_max_reduce_axis_1(x):
    result = np.zeros(len(x))
    _reduce(np.ctypeslib.as_ctypes(result), np.ctypeslib.as_ctypes(x.ravel()), len(x))
    return result

@numba.njit("f8[:](f8[:, :])")
def numba_max_reduce_axis_1(x):
    res = np.full((x.shape[0]), -np.inf, dtype=x.dtype)

    for i in range(x.shape[0]):
        for j in range(x.shape[1]):
            if res[i] < x[i, j]:
                res[i] = x[i, j]

    return res

@numba.njit("f8[:](f8[:, ::1])")
def numba_c_order_max_reduce_axis_1(x):
    res = np.full((x.shape[0]), -np.inf, dtype=x.dtype)

    for i in range(x.shape[0]):
        for j in range(x.shape[1]):
            if res[i] < x[i, j]:
                res[i] = x[i, j]

    return res

def np_max_reduce_axis_1(x):
    return x.max(axis=1)

test_data = np.random.normal(size=(1024, 1024))

functions = [
    numba_max_reduce_axis_1,
    numba_c_order_max_reduce_axis_1,
    np_max_reduce_axis_1,
    c_avx_max_reduce_axis_1,
]

results = [max_reduce_axis_1(test_data) for max_reduce_axis_1 in functions]

for res in results:
    assert(np.allclose(results[0], res))

for max_reduce_axis_1 in functions:
    timings = []
    for _ in range(1000):
        t = time.perf_counter()

        max_reduce_axis_1(test_data)

        timings.append((time.perf_counter() - t) * 1e6)

    print(f"| {max_reduce_axis_1.__name__:31} | {np.mean(timings):9.3f} +- {np.std(timings):8.3f} us |")

1 Like

@sklam, @stuartarchibald, @gmarkall, per our conversation today, here’s an updated script with all the Numba implementations we’ve tried for np.max.reduce(..., axis=1):

from contextlib import contextmanager

import numba
import numpy as np


X = np.random.normal(size=(5000, 6000))

#
# This is what we're trying to implement in Numba:
#
numpy_res = np.max(X, axis=1)

# This is what we're trying to match/beat:
# %timeit np.max(X, axis=1)
# 15.6 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#
# NOTE: We do *not* want to enable `parallel=True`, `fastmath=True`, or any
# other options that are neither generalizable nor used by NumPy.
#

@numba.njit(inline="always")
def custom_max(x, y):
    if x > y:
        return x
    else:
        return y


#
# Case 1: Using `vectorize` along the reduced dimension
#
@numba.vectorize(["float64(float64, float64)"])
def vectorized_max(x, y):
    return custom_max(x, y)


@numba.njit
def vectorized_max_reduce_axis_1(x):
    res = np.full((x.shape[0],), -np.inf, dtype=x.dtype)
    x_transpose = np.transpose(x)
    for m in range(x.shape[1]):
        vectorized_max(res, x_transpose[m], res)

    return res


# Confirm that it works
assert np.array_equal(numpy_res, vectorized_max_reduce_axis_1(X))

# %timeit vectorized_max_reduce_axis_1(X)
# 94.2 ms ± 435 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

#
# Case 2: Manually written Numba reduction loops
#
@numba.njit
def manual_max_reduce_axis_1(x):
    res = np.full((x.shape[0],), -np.inf, dtype=x.dtype)
    for i in range(x.shape[0]):
        for j in range(x.shape[1]):
            res[i] = custom_max(res[i], x[i, j])
    return res


assert np.array_equal(numpy_res, manual_max_reduce_axis_1(X))

# %timeit manual_max_reduce_axis_1(X)
# 38.3 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


#
# Case 3: Calls to NumPy's `ufunc.reduce` via `objmode`
#
@numba.njit
def objmode_max_reduce_axis_1(x):
    with numba.objmode(res="float64[:]"):
        res = vectorized_max.reduce(x, axis=1)
    return res


assert np.array_equal(numpy_res, objmode_max_reduce_axis_1(X))

# %timeit objmode_max_reduce_axis_1(X)
# 73.7 ms ± 262 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


#
# Case 4: Recompile Case 2 with more LLVM optimizations in the "cheap" pass
#
# This comes from the discussions in
# https://numba.discourse.group/t/numba-performance-doesnt-scale-as-well-as-numpy-in-vectorized-max-function/782
#
@contextmanager
def use_optimized_cheap_pass(*args, **kwargs):
    """Temporarily replace the cheap optimization pass with a better one."""
    from numba.core.registry import cpu_target

    context = cpu_target.target_context._internal_codegen
    old_pm = context._mpm_cheap
    new_pm = context._module_pass_manager(
        loop_vectorize=True, slp_vectorize=True, opt=3, cost="cheap"
    )
    context._mpm_cheap = new_pm
    try:
        yield
    finally:
        context._mpm_cheap = old_pm


with use_optimized_cheap_pass():

    @numba.njit
    def opt_manual_max_reduce_axis_1(x):
        res = np.full((x.shape[0],), -np.inf, dtype=x.dtype)
        for i in range(x.shape[0]):
            for j in range(x.shape[1]):
                res[i] = custom_max(res[i], x[i, j])
        return res

    assert np.array_equal(numpy_res, opt_manual_max_reduce_axis_1(X))

# %timeit opt_manual_max_reduce_axis_1(X)
# 37.6 ms ± 872 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


#
# Case 5: This is Stuart's original form of the reduction
#
# Apparently, removing the use of `custom_max` and manually "in-lining" the
# `max` operation makes a large difference; however, we can't reasonably do
# this for every binary function that will perform a reduction.
#
with use_optimized_cheap_pass():

    @numba.njit
    def orig_opt_max_reduce_axis_1(x):
        res = np.full((x.shape[0],), -np.inf, dtype=x.dtype)
        for i in range(x.shape[0]):
            for j in range(x.shape[1]):
                tmp = x[i, j]
                if res[i] < tmp:
                    res[i] = tmp
        return res

    assert np.array_equal(numpy_res, orig_opt_max_reduce_axis_1(X))

# %timeit orig_opt_max_reduce_axis_1(X)
# 20.3 ms ± 233 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

It looks like the “cheap” pass optimization hack only helps when the binary function is manually in-lined; however, we need to generalize these reduction loops by calling nearly arbitrary binary functions (e.g. like custom_max in the example above), so we can’t reasonably in-line the binary operation as in @stuartarchibald’s example. The performance for that example is acceptable for our use-case, though, so, if we could get res[i] = custom_max(res[i], x[i, j]) to be treated more or less the same as

tmp = x[i, j]
if res[i] < tmp:
    res[i] = tmp

we might be fine.