A huge performance penalty for this simple function

Edit: add quotes that I should have referenced in the first place…

Me too, a very stimulating conversation!

both great insights… I got anchored to the ‘sort question’ and googled for a fast sort algorithm. @sschaer’s implementation broke through that to reduce the problem to core principles, demonstrating yet again that algorithms are the key to performance.

Right- numba runs faster with unsigned numpy array access so it doesn’t have to check for wraparound… casting works as well

a, b, c, d = sort4(im[uint64(i)], im[uint64(i + 1)], im[uint64(i + w)], im[uint64(i + w + 1)])
ce[uint64(i)] = (b + c) / 2

I tweaked it a bit. Your version takes 0.085ms on my PC. This version:

@nb.njit(fastmath=True, inline='always')
def avgMid2(d0, d1, d2, d3):
  mi = min(d0, d1, d2, d3)
  ma = max(d0, d1, d2, d3)
  return (d0+d1+d2+d3-mi-ma)/2


@nb.njit(fastmath=True, locals=dict(w=nb.uint32))
def getCenters(im, ce, w):
  for i in range(ce.size-w-1):
    ce[i] = avgMid2(im[i], im[i+1], im[i+w], im[i+w+1])

takes 0.072ms. It is also 2x faster than C++ version compiled with Clang 15.0.7, which takes 0.149ms:

typedef int32_t Int;

inline float avgMid2(float d0, float d1, float d2, float d3) {
  float mi = min(min(d0, d1), min(d2, d3));
  float ma = max(max(d0, d1), max(d2, d3));
  return (d0+d1+d2+d3-mi-ma)/2;
}


float getCenters(vector<float> &im, vector<float> &ce, Int w) {
  Int limit = im.size()-w-1;

  for (Int i=0; i < limit; i++) 
    ce[i] = avgMid2(im[i], im[i+1], im[i+w], im[i+w+1]);
}

Gcc 12.1.0 does a bit better with 0.128ms time.

Somewhat related to the unpredictability in performance variation - I posted this question a while ago about parallel performance.
https://numba.discourse.group/t/floyd-warshalls-in-numba-vs-c-openmp/1719

The variation in just moving the minimum into a separate function was very noticeable, not sure why.