A huge performance penalty for this simple function

I’m benchmarking getCenters function shown below with Python 3.10.6, Numba 0.56.0, Ubuntu 22.04.1 on x64 CPU:

import numpy as np
import numba as nb
import timeit as ti


@nb.njit(fastmath=True)
def getCenters(im, ce, w):
  for i in range(ce.size-w-1):
    square = [im[i], im[i+1], im[i+w], im[i+w+1]]
    square.sort()
    ce[i] = (square[1]+square[2])/2


w, h = 640, 480
im = np.random.rand(w*h).astype(np.float32)
ce = np.empty_like(im)

fun = f'getCenters(im, ce, w)'
t = 1000 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=10))
print(f'{fun}:  {np.amin(t):6.3f}ms  {np.median(t):6.3f}ms')

and getting about 151ms execution time. Equivalent C++ version shown here:

void getCenters(vector<float> &im, vector<float> &ce, int w) {
  for (int i=0; i < im.size()-w-1; i++) {
    array<float, 4> square = {im[i], im[i+1], im[i+w], im[i+w+1]};

    sort(square.begin(), square.end());
    ce[i] = (square[1]+square[2])/2;
  }
}

executes in 6.33ms when compiled with gcc 11.2. Why Numba version is 24 times slower?

It seems that sort() is a high-cost operation. Commenting it out lowers the time from 151ms to 22ms. Still far from C++ version, though.

hi @pauljurczak

have you tried using numpy arrays instead of a list? array.sort might have a better algorithm.

Please also take into account that you are using a very different data structure. By using a fixed size array in C++, it’ll always be faster than the more dynamic python structures.

Luk

Hi @pauljurczak,

As noted by @luk-f-a, in Numba, data access through NumPy arrays is often faster than through Python containers. This is in part due to there being no bounds check by default, also because NumPy arrays are fixed size and in the case above can be guaranteed contiguous in memory.

The memory allocation for the variable square is loop invariant (it’s always a container of size 4), so perhaps try moving the allocation out of the loop and just writing to it in each iteration. This will hopefully speed things up (in the C++ code, the container is declared as a const size and, were I to guess, is probably hoisted out of the loop and allocated on the stack).

With respect to sort(), IIRC Numba’s sort() isn’t quite as efficient as NumPy’s (or indeed other heavily optimised sorting algs, there’s a warning in Numba’s docs about this!). Given there’s only 4 values and the algorithm just needs the two in the middle, perhaps just write something custom?

Hope this helps?

Yes, I tried that. This version:

@nb.njit(fastmath=True)
def getCenters(im, ce, w):
  for i in range(ce.size-w-1):
    square = np.array([im[i], im[i+1], im[i+w], im[i+w+1]])
    np.sort(square)
    ce[i] = (square[1]+square[2])/2

is slower with 176ms execution time.

Taking square out of the loop doesn’t help. This version:

@nb.njit(fastmath=True)
def getCenters(im, ce, w):
  square = np.zeros(4, dtype=np.float32)

  for i in range(ce.size-w-1):
    square[:] = im[i], im[i+1], im[i+w], im[i+w+1]
    np.sort(square)
    ce[i] = (square[1]+square[2])/2

executes slower, taking 158ms.

Almost all of the cost seems to be the result of the sort function.

@nb.njit(fastmath=True)
def getCenters(im, ce, w):
  square = np.zeros(4, dtype=np.float32)
  for i in range(ce.size-w-1):
    square[:] = im[i], im[i+1], im[i+w], im[i+w+1]
    # square.sort()
    ce[i] = (square[1]+square[2])/2

… gives. getCenters(im, ce, w): 0.493ms 0.531ms

Likely square is still being allocated (I assume).

You might want to profile sort alone for size 4 arrays, to see how it compares to the C++ version.
Perhaps the numpy/numba version is much slower than the C++ version for small lists.
Bubble sort (or non-recursive) sorting algorithms are also likely faster for very small inputs.

In any case, because you are always sorting over 4 items here, a special case implementation that identifies the min/max and uses what’s left may be faster.

Re: special casing: There’s a very good chance this might also explain the C++ version - the compiler has knowledge about the array size being sorted. I can’t be sure though, since its not clear which sort function you are using in the C++ version. (Plus, it’s allocated on the stack, which is much faster than, heap allocation generally, and certainly faster than overheads for numpy array allocation).

If you super-duper want to know what’s going on check out the C++ Compiler Explorer - see Jason Turner’s C++ Weekly Youtube channel. No permissions to include links … sigh.
You can easily see what happened to the C++ implementation this way, and numba has a similar feature to output the LLVM, or assembled versions. Perhaps the C++ version has become non-recursive because of the small and fixed array size.
I’d be very interested to know if numba can achieve similar speed ups if given similar hints.

Right- an inline stack-based sort algorithm along these lines would likely make a substantial difference

1 Like

Just for fun, here’s what I get with such an implementation

getCenters(im, ce, w): 131.477ms 136.278ms
getCenters1(im, ce, w): 0.394ms 0.412ms

@nb.njit(inline='always')
def sort4(d0, d1, d2, d3):
    if d2 < d1: d1, d2 = d2, d1
    if d2 < d0: d0, d2 = d2, d0
    if d1 < d0: d0, d1 = d1, d0
    if d3 < d0: d0, d3 = d3, d0
    if d3 < d1: d1, d3 = d3, d1
    if d3 < d2: d2, d3 = d3, d2
    return d0, d1, d2, d3


@nb.njit(fastmath=True)
def getCenters1(im, ce, w):
    for i in range(ce.size - w - 1):
        a, b, c, d = sort4(im[i], im[i + 1], im[i + w], im[i + w + 1])
        ce[i] = (b + c) / 2
5 Likes

True, but there is a huge instability of how optimizer handles the IR passed to it or what IR is generated by Numba. I commented out the sort in my original post and the execution time was 22ms. Commenting it out in my latest example, as you did, drops the execution time to 0.5ms. I think both cases should give the optimizer the same opportunities to produce fast binaries. Additionally, commenting the sort in C++ code reduces execution time to 0.054ms, which makes Numba 10x slower. I’m curious if there is anything I can do in order to close this gap. I mean using essentially the same code on Python and C++ sides.

Nelson’s sort4 above seems to do a good job.
There also always going to be some overhead of calling from Python land and figuring out which optimized implementation to call. Also, I imagine C++'s optimizer gives it advantages that numba might not make use of use yet.

Excellent! This is a notch faster than the equivalent C++ implementation. On my PC: 0.6ms for Python and 0.63ms for C++. Here is my C++ code:

tuple<float, float, float, float> sort4(float d0, float d1, float d2, float d3) {
    if (d2 < d1)  swap(d1, d2);
    if (d2 < d0)  swap(d0, d2);
    if (d1 < d0)  swap(d0, d1);
    if (d3 < d0)  swap(d0, d3);
    if (d3 < d1)  swap(d1, d3);
    if (d3 < d2)  swap(d2, d3);
    return {d0, d1, d2, d3};
}


void getCenters(vector<float> &im, vector<float> &ce, int w) {
  for (int i=0; i < im.size()-w-1; i++) {
    auto [a, b, c, d] = sort4(im[i], im[i+1], im[i+w], im[i+w+1]);
    ce[i] = (b+c)/2;
  }
}

I’m still unsettled by the amount of tinkering and algorithm changes needed to get performance parity between Numba and C++.

I’d look at it another way… the numba solution was 10x faster than your original C++ implementation, right? Are you concerned about the amount of tinkering you had to do to get C++ as fast as numba? :wink:

1 Like

I’m concerned about the lack of predictability. Some algorithms, e.g. sort4 perform equally efficiently with both languages, other produce huge performance discrepancy.

1 Like

Welcome to optimization, generally.

1 Like

It would be nice to know why in-place sort is so slow in numba. It might help guide our intuitions while tuning.

Looking at the IR, numba generates a ~200 instruction generic quicksort, which ultimately leads to not inferring that the array is fixed size and small, and eliminating a lot of code. An inlined sort4 implementation, by contrast, generates almost exactly the same code as optimized c++ std::sort.

The optimization opportunity is detected for the C++ version; the std::sort that I looked at insertion sorts for containers < 16 elements long, which would loop unroll very nicely for fixed sizes.

I guess if misc/quicksort.py included a similar array length test it might well be possible to to the same in numba assuming that the IR optimization can detect that the size is constant?

I guess it’s fair that I can’t post links, but godbolt[dot]org/z/Yco3zvcM1 is where I tested this:

top: numba np.sort
middle: numba sort4
bottom: C++ calling std::sort

@folded Super interesting. Thanks!

@stuartarchibald
Could it be made possible for numba to see when arrays declared in the local scope are initialized with a size that is a constant in the scope, and pass it down to the optimizer? I can imagine there are many optimizations this would unlock.
A lot of my time in the 00’s was spent providing integer template parameters to enable C++ loop unrolling with template routines to do exactly this.
(Think rotation matricies, NxM convolutions, … the speedup can be profound). C++ it seems can do this more directly now.
Numba should ultimately be able to do the same, no? Re: convolutions, optimizations for fixed sizes. DNN layers are fixed sizes, and there are often disires to do more complex operations than dot products.

This flexibility is a good step to expand relevance and differentiation for the project? (and perhaps funding from some large entity who would benefit from easy optimization?).
My company’s too small for $$ but am interested to contribute.

Fair. But C++ also suffers a lack of predictability. It benefitted from seeing len=4, if it had not, you may have been asking, “why is C++ so slow”?

Generally to know why things are slow, you need to investigate in detail. This is the nature of optimization in any language.

1 Like

Hi everyone

I really like the development of this topic and the discussion it has triggered. Just for fun, I would like to propose a very different algorithm for solving the initial problem.
I would never think of this problem as a sorting problem. In reality, it is a matter of finding the sum of the second and third largest values from a set of four values (let’s forget about dividing by two). This is equivalent to finding the difference between the sum and the smallest and largest values. Implementing this with Numba gives a solution that is even faster than @nelson2005’s.
A very naive implementation, which can certainly be made more efficient with a little thought, might look like this:

@nb.njit(inline='always')
def get_min_max(v0, v1, v2, v3):
    vmin = v0
    if vmin > v1: vmin = v1
    if vmin > v2: vmin = v2
    if vmin > v3: vmin = v3
    vmax = v0
    if vmax < v1: vmax = v1
    if vmax < v2: vmax = v2
    if vmax < v3: vmax = v3
    return vmin, vmax
    

@nb.njit(fastmath=True)
def getCenters(im, ce, w):
    for i in range(ce.size - w - 1):
        v0, v1, v2, v3 = im[i], im[i + 1], im[i + w], im[i + w + 1]
        vmin, vmax = get_min_max(v0, v1, v2, v3)
        ce[i] = (v0 + v1 + v2 + v3 - vmin - vmax) / 2

Maybe someone else has yet another idea or likes to improve on this one.

Edit: Regardless of the algorithm, you can gain a lot of performance by specifying w as unsigned:

@nb.njit(fastmath=True, locals=dict(w=nb.uint32))
def getCenters(im, ce, w):
    ...
3 Likes

This is mind-bending! Specialized sort version:

@nb.njit(inline='always')
def mid2(d0, d1, d2, d3):
    if d2 < d1: d1, d2 = d2, d1
    if d2 < d0: d0, d2 = d2, d0
    if d1 < d0: d0, d1 = d1, d0
    if d3 < d0: d0, d3 = d3, d0
    if d3 < d1: d1, d3 = d3, d1
    if d3 < d2: d2, d3 = d3, d2
    return d1, d2


@nb.njit(fastmath=True)
def getCenters(im, ce, w):
  for i in range(ce.size-w-1):
    a, b = mid2(im[i], im[i+1], im[i+w], im[i+w+1])
    ce[i] = (a+b)/2

takes 0.6ms, but after specifying the type of w with:

@nb.njit(fastmath=True, locals=dict(w=nb.uint32))

it takes only 0.085ms. This is magic! OTOH, when I specify:

@nb.njit(fastmath=True, locals=dict(w=nb.int32))

it runs slower than the original, taking 0.67ms.

To make it even more convoluted, C++ version with int32_t w takes 0.63ms, but uint32_t w takes 3.6ms (compiled with gcc 11.2). Go figure. I’m going to try newer C++ compilers.