A huge performance penalty for this simple function

pauljurczak · January 17, 2023, 6:49am

I’m benchmarking getCenters function shown below with Python 3.10.6, Numba 0.56.0, Ubuntu 22.04.1 on x64 CPU:

import numpy as np
import numba as nb
import timeit as ti


@nb.njit(fastmath=True)
def getCenters(im, ce, w):
  for i in range(ce.size-w-1):
    square = [im[i], im[i+1], im[i+w], im[i+w+1]]
    square.sort()
    ce[i] = (square[1]+square[2])/2


w, h = 640, 480
im = np.random.rand(w*h).astype(np.float32)
ce = np.empty_like(im)

fun = f'getCenters(im, ce, w)'
t = 1000 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=10))
print(f'{fun}:  {np.amin(t):6.3f}ms  {np.median(t):6.3f}ms')

and getting about 151ms execution time. Equivalent C++ version shown here:

void getCenters(vector<float> &im, vector<float> &ce, int w) {
  for (int i=0; i < im.size()-w-1; i++) {
    array<float, 4> square = {im[i], im[i+1], im[i+w], im[i+w+1]};

    sort(square.begin(), square.end());
    ce[i] = (square[1]+square[2])/2;
  }
}

executes in 6.33ms when compiled with gcc 11.2. Why Numba version is 24 times slower?

It seems that sort() is a high-cost operation. Commenting it out lowers the time from 151ms to 22ms. Still far from C++ version, though.

luk-f-a · January 17, 2023, 8:16am

hi @pauljurczak

have you tried using numpy arrays instead of a list? array.sort might have a better algorithm.

Please also take into account that you are using a very different data structure. By using a fixed size array in C++, it’ll always be faster than the more dynamic python structures.

Luk

stuartarchibald · January 17, 2023, 10:12am

Hi @pauljurczak,

As noted by @luk-f-a, in Numba, data access through NumPy arrays is often faster than through Python containers. This is in part due to there being no bounds check by default, also because NumPy arrays are fixed size and in the case above can be guaranteed contiguous in memory.

The memory allocation for the variable square is loop invariant (it’s always a container of size 4), so perhaps try moving the allocation out of the loop and just writing to it in each iteration. This will hopefully speed things up (in the C++ code, the container is declared as a const size and, were I to guess, is probably hoisted out of the loop and allocated on the stack).

With respect to sort(), IIRC Numba’s sort() isn’t quite as efficient as NumPy’s (or indeed other heavily optimised sorting algs, there’s a warning in Numba’s docs about this!). Given there’s only 4 values and the algorithm just needs the two in the middle, perhaps just write something custom?

Hope this helps?

pauljurczak · January 17, 2023, 4:22pm

Yes, I tried that. This version:

@nb.njit(fastmath=True)
def getCenters(im, ce, w):
  for i in range(ce.size-w-1):
    square = np.array([im[i], im[i+1], im[i+w], im[i+w+1]])
    np.sort(square)
    ce[i] = (square[1]+square[2])/2

is slower with 176ms execution time.

pauljurczak · January 17, 2023, 4:32pm

Taking square out of the loop doesn’t help. This version:

@nb.njit(fastmath=True)
def getCenters(im, ce, w):
  square = np.zeros(4, dtype=np.float32)

  for i in range(ce.size-w-1):
    square[:] = im[i], im[i+1], im[i+w], im[i+w+1]
    np.sort(square)
    ce[i] = (square[1]+square[2])/2

executes slower, taking 158ms.

stuz5000 · January 17, 2023, 6:48pm

Almost all of the cost seems to be the result of the sort function.

@nb.njit(fastmath=True)
def getCenters(im, ce, w):
  square = np.zeros(4, dtype=np.float32)
  for i in range(ce.size-w-1):
    square[:] = im[i], im[i+1], im[i+w], im[i+w+1]
    # square.sort()
    ce[i] = (square[1]+square[2])/2

… gives. getCenters(im, ce, w): 0.493ms 0.531ms

Likely square is still being allocated (I assume).

You might want to profile sort alone for size 4 arrays, to see how it compares to the C++ version.
Perhaps the numpy/numba version is much slower than the C++ version for small lists.
Bubble sort (or non-recursive) sorting algorithms are also likely faster for very small inputs.

In any case, because you are always sorting over 4 items here, a special case implementation that identifies the min/max and uses what’s left may be faster.

Re: special casing: There’s a very good chance this might also explain the C++ version - the compiler has knowledge about the array size being sorted. I can’t be sure though, since its not clear which sort function you are using in the C++ version. (Plus, it’s allocated on the stack, which is much faster than, heap allocation generally, and certainly faster than overheads for numpy array allocation).

If you super-duper want to know what’s going on check out the C++ Compiler Explorer - see Jason Turner’s C++ Weekly Youtube channel. No permissions to include links … sigh.
You can easily see what happened to the C++ implementation this way, and numba has a similar feature to output the LLVM, or assembled versions. Perhaps the C++ version has become non-recursive because of the small and fixed array size.
I’d be very interested to know if numba can achieve similar speed ups if given similar hints.

nelson2005 · January 17, 2023, 8:48pm

Right- an inline stack-based sort algorithm along these lines would likely make a substantial difference

nelson2005 · January 17, 2023, 9:09pm

Just for fun, here’s what I get with such an implementation

getCenters(im, ce, w): 131.477ms 136.278ms
getCenters1(im, ce, w): 0.394ms 0.412ms

@nb.njit(inline='always')
def sort4(d0, d1, d2, d3):
    if d2 < d1: d1, d2 = d2, d1
    if d2 < d0: d0, d2 = d2, d0
    if d1 < d0: d0, d1 = d1, d0
    if d3 < d0: d0, d3 = d3, d0
    if d3 < d1: d1, d3 = d3, d1
    if d3 < d2: d2, d3 = d3, d2
    return d0, d1, d2, d3


@nb.njit(fastmath=True)
def getCenters1(im, ce, w):
    for i in range(ce.size - w - 1):
        a, b, c, d = sort4(im[i], im[i + 1], im[i + w], im[i + w + 1])
        ce[i] = (b + c) / 2

pauljurczak · January 18, 2023, 1:50am

True, but there is a huge instability of how optimizer handles the IR passed to it or what IR is generated by Numba. I commented out the sort in my original post and the execution time was 22ms. Commenting it out in my latest example, as you did, drops the execution time to 0.5ms. I think both cases should give the optimizer the same opportunities to produce fast binaries. Additionally, commenting the sort in C++ code reduces execution time to 0.054ms, which makes Numba 10x slower. I’m curious if there is anything I can do in order to close this gap. I mean using essentially the same code on Python and C++ sides.

stuz5000 · January 18, 2023, 2:22am

Nelson’s sort4 above seems to do a good job.
There also always going to be some overhead of calling from Python land and figuring out which optimized implementation to call. Also, I imagine C++'s optimizer gives it advantages that numba might not make use of use yet.

pauljurczak · January 18, 2023, 2:23am

Excellent! This is a notch faster than the equivalent C++ implementation. On my PC: 0.6ms for Python and 0.63ms for C++. Here is my C++ code:

tuple<float, float, float, float> sort4(float d0, float d1, float d2, float d3) {
    if (d2 < d1)  swap(d1, d2);
    if (d2 < d0)  swap(d0, d2);
    if (d1 < d0)  swap(d0, d1);
    if (d3 < d0)  swap(d0, d3);
    if (d3 < d1)  swap(d1, d3);
    if (d3 < d2)  swap(d2, d3);
    return {d0, d1, d2, d3};
}


void getCenters(vector<float> &im, vector<float> &ce, int w) {
  for (int i=0; i < im.size()-w-1; i++) {
    auto [a, b, c, d] = sort4(im[i], im[i+1], im[i+w], im[i+w+1]);
    ce[i] = (b+c)/2;
  }
}

I’m still unsettled by the amount of tinkering and algorithm changes needed to get performance parity between Numba and C++.

nelson2005 · January 18, 2023, 2:56am

I’d look at it another way… the numba solution was 10x faster than your original C++ implementation, right? Are you concerned about the amount of tinkering you had to do to get C++ as fast as numba?

pauljurczak · January 18, 2023, 5:14am

I’m concerned about the lack of predictability. Some algorithms, e.g. sort4 perform equally efficiently with both languages, other produce huge performance discrepancy.

stuz5000 · January 18, 2023, 8:12pm

Welcome to optimization, generally.

stuz5000 · January 18, 2023, 8:14pm

It would be nice to know why in-place sort is so slow in numba. It might help guide our intuitions while tuning.

folded · January 19, 2023, 4:28pm

Looking at the IR, numba generates a ~200 instruction generic quicksort, which ultimately leads to not inferring that the array is fixed size and small, and eliminating a lot of code. An inlined sort4 implementation, by contrast, generates almost exactly the same code as optimized c++ std::sort.

The optimization opportunity is detected for the C++ version; the std::sort that I looked at insertion sorts for containers < 16 elements long, which would loop unroll very nicely for fixed sizes.

I guess if misc/quicksort.py included a similar array length test it might well be possible to to the same in numba assuming that the IR optimization can detect that the size is constant?

I guess it’s fair that I can’t post links, but godbolt[dot]org/z/Yco3zvcM1 is where I tested this:

top: numba np.sort
middle: numba sort4
bottom: C++ calling std::sort

stuz5000 · January 19, 2023, 5:30pm

@folded Super interesting. Thanks!

@stuartarchibald
Could it be made possible for numba to see when arrays declared in the local scope are initialized with a size that is a constant in the scope, and pass it down to the optimizer? I can imagine there are many optimizations this would unlock.
A lot of my time in the 00’s was spent providing integer template parameters to enable C++ loop unrolling with template routines to do exactly this.
(Think rotation matricies, NxM convolutions, … the speedup can be profound). C++ it seems can do this more directly now.
Numba should ultimately be able to do the same, no? Re: convolutions, optimizations for fixed sizes. DNN layers are fixed sizes, and there are often disires to do more complex operations than dot products.

This flexibility is a good step to expand relevance and differentiation for the project? (and perhaps funding from some large entity who would benefit from easy optimization?).
My company’s too small for $$ but am interested to contribute.

stuz5000 · January 19, 2023, 5:33pm

Fair. But C++ also suffers a lack of predictability. It benefitted from seeing len=4, if it had not, you may have been asking, “why is C++ so slow”?

Generally to know why things are slow, you need to investigate in detail. This is the nature of optimization in any language.

sschaer · January 19, 2023, 10:30pm

Hi everyone

I really like the development of this topic and the discussion it has triggered. Just for fun, I would like to propose a very different algorithm for solving the initial problem.
I would never think of this problem as a sorting problem. In reality, it is a matter of finding the sum of the second and third largest values from a set of four values (let’s forget about dividing by two). This is equivalent to finding the difference between the sum and the smallest and largest values. Implementing this with Numba gives a solution that is even faster than @nelson2005’s.
A very naive implementation, which can certainly be made more efficient with a little thought, might look like this:

@nb.njit(inline='always')
def get_min_max(v0, v1, v2, v3):
    vmin = v0
    if vmin > v1: vmin = v1
    if vmin > v2: vmin = v2
    if vmin > v3: vmin = v3
    vmax = v0
    if vmax < v1: vmax = v1
    if vmax < v2: vmax = v2
    if vmax < v3: vmax = v3
    return vmin, vmax
    

@nb.njit(fastmath=True)
def getCenters(im, ce, w):
    for i in range(ce.size - w - 1):
        v0, v1, v2, v3 = im[i], im[i + 1], im[i + w], im[i + w + 1]
        vmin, vmax = get_min_max(v0, v1, v2, v3)
        ce[i] = (v0 + v1 + v2 + v3 - vmin - vmax) / 2

Maybe someone else has yet another idea or likes to improve on this one.

Edit: Regardless of the algorithm, you can gain a lot of performance by specifying w as unsigned:

@nb.njit(fastmath=True, locals=dict(w=nb.uint32))
def getCenters(im, ce, w):
    ...

pauljurczak · January 19, 2023, 11:50pm

This is mind-bending! Specialized sort version:

@nb.njit(inline='always')
def mid2(d0, d1, d2, d3):
    if d2 < d1: d1, d2 = d2, d1
    if d2 < d0: d0, d2 = d2, d0
    if d1 < d0: d0, d1 = d1, d0
    if d3 < d0: d0, d3 = d3, d0
    if d3 < d1: d1, d3 = d3, d1
    if d3 < d2: d2, d3 = d3, d2
    return d1, d2


@nb.njit(fastmath=True)
def getCenters(im, ce, w):
  for i in range(ce.size-w-1):
    a, b = mid2(im[i], im[i+1], im[i+w], im[i+w+1])
    ce[i] = (a+b)/2

takes 0.6ms, but after specifying the type of w with:

@nb.njit(fastmath=True, locals=dict(w=nb.uint32))

it takes only 0.085ms. This is magic! OTOH, when I specify:

@nb.njit(fastmath=True, locals=dict(w=nb.int32))

it runs slower than the original, taking 0.67ms.

To make it even more convoluted, C++ version with int32_t w takes 0.63ms, but uint32_t w takes 3.6ms (compiled with gcc 11.2). Go figure. I’m going to try newer C++ compilers.

Topic		Replies	Views
Timings for arr[:, i] seem much slower in numba Community Support	5	176	March 14, 2024
Try to understand why numba is 4x faster than c++ code in the same task Community Support	4	1947	November 9, 2022
Numba slow for simple numpy operations Numba	1	337	October 13, 2023
Curious performance loss of NumPy function with Numba Numba	11	520	December 26, 2023
Why my numba+numpy implementation is faster than c+cffi? Numba	2	1852	August 7, 2020

A huge performance penalty for this simple function

Related Topics