Curious performance loss of NumPy function with Numba

pauljurczak · December 18, 2023, 2:44am

I noticed a strange performance loss (4x to 8x) depending on the CPU (x86-64) and dtype used with Python 3.11.6 and Numba 0.58.1 between f1() (faster) and f0() (slower) in this script:

import numpy as np, numba as nb, timeit as ti

@nb.njit(fastmath=True)
def f0(img):
  return np.sum(np.argmax(img, axis=1))

def f1(img):
  return np.sum(np.argmax(img, axis=1))

H, W = 816, 256
img = (np.arange(H*W) % 255).reshape(H, W).astype('u1')

for fun in ('f0(img)', 'f1(img)'):
  t = 10**6 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=100))
  print(f'{fun}={eval(fun)}  time: {np.amin(t):6.3f}  {np.median(t):6.3f}  {np.amax(t):6.3f} [us]')

Is there an explanation or a solution to this performance problem? My use case is a bit more complex than this example, but is using argmax.

Rutger · December 18, 2023, 7:15am

Using your configuration it’s more about 2x for me. But the fastmath version seems to scale poorly with increasing array size.

import numpy as np
import numba as nb
import timeit as ti

@nb.njit(fastmath=True)
def f0(img):
    return np.sum(np.argmax(img, axis=1))

def f1(img):
    return np.sum(np.argmax(img, axis=1))

# warm up
H, W = 816, 256
img = (np.arange(H*W) % 255).reshape(H, W).astype('u1')
f0(img)
f1(img)

for size in [64, 128, 256, 512, 1024]:
    
    img = (np.arange(size*size) % 255).reshape(size, size).astype('u1')    
    
    print(img.shape)    
    for fun in ('f0(img)', 'f1(img)'):
        t = 10**6 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=100))
        print(f'{fun}={eval(fun)}  time: {np.amin(t):6.3f}  {np.median(t):6.3f}  {np.amax(t):6.3f} [us]')
        
    print()

For small arrays it\s actually faster, but on my machine the break-even point seems around 256x256 for this test:

pauljurczak · December 18, 2023, 7:36am

Thanks for testing it. You added:

f0(img)
f1(img)

My reading of:

ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=100)

is that setup=fun will effectively perform the warm up, which is confirmed by the timings I’m getting:

f0(img)=108834  time: 340.801  342.003  347.754 [us]
f1(img)=108834  time: 56.666  57.097  61.796 [us]

The maximum time (last number) doesn’t indicate compilation.

sschaer · December 18, 2023, 6:18pm

Regarding your question as to whether there is an explanation for this, it might be worth looking at the implementation of argmax (this is what makes your function slower than the non-jitted one): numba/numba/np/arraymath.py at main · numba/numba · GitHub

The Numpy implementation can be found here:

github.com

numpy/numpy/blob/main/numpy/_core/src/multiarray/arraytypes.c.src#L3080


      
          * #type = npy_half, npy_float, npy_double, npy_longdouble,
          *         npy_datetime, npy_timedelta#
          * #isfloat = 1*4, 0*2#
          * #isnan = npy_half_isnan, npy_isnan*3, nop*2#
          * #le = npy_half_le, _LESS_THAN_OR_EQUAL*5#
          * #iscomplex = 0, 1*3, 0*2#
          * #incr = ip++, ip+=2*3, ip++*2#
          * #isdatetime = 0*4, 1*2#
          */
          static int
          @fname@_argmax(@type@ *ip, npy_intp n, npy_intp *max_ind,
                 PyArrayObject *NPY_UNUSED(aip))
          {
             npy_intp i;
             @type@ mp = *ip;
          #if @iscomplex@
             @type@ mp_im = ip[1];
          #endif
          
             *max_ind = 0;

The Numpy implementation is a bit messy to read, but it seems that they don’t use an explicit SIMD implementation. However, playing with your example (especially the data type but also the aspect ratio of the matrix), it occurs to me that this actually compiles to a function that utilizes SIMD very efficiently. I suppose LLVM has a hard time optimizing the Numba implementation. However, I have a bit of hope that it is possible to write an argmax that LLVM can optimize better and maybe even in a way that can keep up with Numpy. Something that is already much faster for your particular case is this:

@nb.njit
def rowwise_argmax(mat):
    n, m = mat.shape
    out = np.empty(n, dtype=np.int64)
    for i in range(n):
        arg = 0
        val = mat[i, 0]
        for j in range(1, m):
            if mat[i, j] > val:
                arg = j
                val = mat[i, j]
        out[i] = arg
    return out

I’ve had a quick look at the generated IR and it seems that both this and the Numba implementation use SIMD, but not in the way I would expect. It’s a very interesting case!

sschaer · December 19, 2023, 7:21am

Hi @pauljurczak

I did a little testing and found something that beats Numpy. It’s not very intuitive at first, but since it uses SIMD so much more efficiently (exactly how you would like it to do!), the function below performs extremely well. It first finds the maximum value and then checks where it is located within the array. There is redundant work, but since each step is compiler friendly, it still works. This idea can likely be improved too by using SIMD explicitly, but that’s probably not necessary at this stage. However, I may be worth checking if this current version would also be a good solution to be used within Numba.

import numpy as np
import numba as nb

@nb.njit 
def find(a, val):
  for i in range(a.size):
    if a[i] == val:
      return i
  
@nb.njit 
def argmax(a):
  val = np.max(a)
  idx = find(a, val)
  return idx 

@nb.njit
def rowwise_argmax(mat):
  nrows = mat.shape[0]
  out = np.empty(nrows, dtype=np.int64)
  for i in range(nrows):
      out[i] = argmax(mat[i])
  return out

One more thing. I noticed that you have a very specific example. Is such a case actually representative or would a random arrangement of the elements be more representative? I ask because it may be possible to take advantage of the fact that you may actually encounter the maximum possible (pixel?) value relatively early.

pauljurczak · December 19, 2023, 9:19am

Thank you for investigating this issue. The image size given by H, W = 816, 256 is my real use case. The test data is artificial, but resembles the real world data, which has one peak per row (on average).

I will be looking at explicit SIMD implementation, but I don’t know a suitable Python library. Cython, perhaps? It takes just 8 256-bit registers to represent the whole image row. Finding argmax should take only a few operations per register. It should be really fast.

sschaer · December 19, 2023, 10:44am

If you find a good C/C++ implementation, it is relatively easy to use it in Numba. Two good solutions are the use of ctypes or the emission of LLVM IR by Clang. The first option is easier to implement and you can use any compiler, but the second option has some other advantages, e.g. it can be inlined and the function is cacheable.
If you find a good implementation that you want to use in Numba, I can tell you more about it. Especially the Clang solution is a bit tricky because of version conflicts, since LLVM updates its IR so frequently. And you may need to load a C/C++ runtime library…

pauljurczak · December 19, 2023, 5:09pm

I’m going to have a look at GitHub - ispc/ispc: Intel® Implicit SPMD Program Compiler. How hard it will be to use it with Python/Numba?

sschaer · December 21, 2023, 8:40am

If you compile it to a .so or .dll file, then it should be easy to use it in Python and Numba. But I’m not sure if switching to ISPC is enough. Writing argmax for uint8, which uses SIMD efficiently, is very difficult because it is not a well supported data type. It may also depend on which instruction sets you have available. I would probably look for an existing implementation.

pauljurczak · December 25, 2023, 7:21am

I implemented this function using C++ SIMD libraries: VCL and eve. Source code is here: simd-benchmarks/main-5-vcl-eve.cpp at main · pauljurczak/simd-benchmarks · GitHub. My implementation takes advantage of fixed image size, which is my use case. The speedup is dramatic: over 15x. Your version rowwise_argmax(img) executes in 62us, f1(img) in 57us. The best of C++ SIMD versions takes 3.6us:

int loop_vc_nested(const array<uint8_t, H*W> &img, const array<Vec32uc, 8> &idx) {
  int sum = 0;
  Vec32uc vMax, iMax, vCurr, iCurr;

  for (int i=0; i<H*W; i+=W) {
    iMax.load(&idx[0]);
    vMax.load(&img[i]);

    for (int j=1; j<8; j++) {
      iCurr.load(&idx[j]);
      vCurr.load(&img[i+j*32]);
      iMax = select(vCurr > vMax, iCurr, iMax);
      vMax = max(vMax, vCurr);
    }

    Vec32uc vMaxAll{horizontal_max(vMax)};
    sum += iMax[horizontal_find_first(vMax == vMaxAll)];
  }

  return sum;
}

sschaer · December 25, 2023, 11:50pm

Very nice work!

I quickly exposed your function to Numba and run the comparison. I had to slightly rewrite your function for that (see below). And I used clang 14.0.0 to emit LLVM IR:

clang -emit-llvm -march=native -std=c++20 -S ...

It is indeed much faster but if you want to have a proper comparison you also have to give Numba the additional information. You would probably end up with something like this:

code

# NOTE: compiler is very sensitive to even tiny changes on that
import numba as nb 

NROWS = 816
NCOLS = 256
DTYPE = nb.uint8

@nb.njit
def _argmax(a):
    v = DTYPE(0)
    for i in range(NCOLS):
        if a[i] > v:
            v = a[i]
    for i in range(NCOLS):
        if a[i] == v:
            return i

@nb.njit
def argmax_sum(img):
    acc = 0
    for i in range(NROWS):
        acc += _argmax(img[i])
    return acc

Also I think (correct me if I am wrong) your function behaves differently to Numpy. If the maximum value is contained more than once you have no guarantee that you get the first occurrence. At least with random inputs the results do not agree with Numpy.

Anyways, with all this I get on the deterministic input from your initial post:

Numba: 19.2 us
Numpy: 120 us
C++: 6.4 us

C++ source

#include "vectorclass/vectorclass.h"
#include <array>
#include <eve/module/algo.hpp>
#include <eve/module/core.hpp>
#include <eve/wide.hpp>

using V32u1 = eve::wide<uint8_t, eve::fixed<32>>;

constexpr size_t H = 816;
constexpr size_t W = 256;


template <std::size_t... I>
constexpr std::array<uint8_t, sizeof...(I)> arange(std::index_sequence<I...>) {
    return { { static_cast<uint8_t>(I)... } };
}
constexpr auto idx = arange(std::make_index_sequence<W> {});


extern "C" int argmax_sum(const uint8_t* img) {
    int sum = 0;
    Vec32uc vMax, iMax, vCurr, iCurr;

    for (int i = 0; i < H * W; i += W) {
        iMax.load(&idx[0]);
        vMax.load(&img[i]);

        for (int j = 1; j < 8; j++) {
            iCurr.load(&idx[j * 32]);
            vCurr.load(&img[i + j * 32]);
            iMax = select(vCurr > vMax, iCurr, iMax);
            vMax = max(vMax, vCurr);
        }

        Vec32uc vMaxAll { horizontal_max(vMax) };
        sum += iMax[horizontal_find_first(vMax == vMaxAll)];
    }

    return sum;
}

pauljurczak · December 26, 2023, 1:36am

Thank you for doing the work. In my use case, it doesn’t matter, which maximum position is counted. Real world input data almost always has none or one “interesting” maximum. This function is just a computationally expensive prelude to processing detected peaks.

Topic		Replies	Views
Timings for arr[:, i] seem much slower in numba Community Support	5	176	March 14, 2024
Numba performance doesn't scale as well as NumPy in vectorized max function Community Support	14	3061	March 4, 2022
Dictionary performance Community Support	2	736	November 28, 2022
Why this assignment of subarray is so slow? Numba	3	297	July 25, 2022
50% performance drop from py 3.8/numba 0.51.2 to py 3.9/numba 0.55.1 Community Support	4	436	August 30, 2022

Curious performance loss of NumPy function with Numba

Related Topics