I noticed a strange performance loss (4x to 8x) depending on the CPU (x86-64) and dtype used with Python 3.11.6 and Numba 0.58.1 between f1() (faster) and f0() (slower) in this script:
import numpy as np, numba as nb, timeit as ti
@nb.njit(fastmath=True)
def f0(img):
return np.sum(np.argmax(img, axis=1))
def f1(img):
return np.sum(np.argmax(img, axis=1))
H, W = 816, 256
img = (np.arange(H*W) % 255).reshape(H, W).astype('u1')
for fun in ('f0(img)', 'f1(img)'):
t = 10**6 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=100))
print(f'{fun}={eval(fun)} time: {np.amin(t):6.3f} {np.median(t):6.3f} {np.amax(t):6.3f} [us]')
Is there an explanation or a solution to this performance problem? My use case is a bit more complex than this example, but is using argmax.
Regarding your question as to whether there is an explanation for this, it might be worth looking at the implementation of argmax (this is what makes your function slower than the non-jitted one): numba/numba/np/arraymath.py at main · numba/numba · GitHub
The Numpy implementation can be found here:
The Numpy implementation is a bit messy to read, but it seems that they don’t use an explicit SIMD implementation. However, playing with your example (especially the data type but also the aspect ratio of the matrix), it occurs to me that this actually compiles to a function that utilizes SIMD very efficiently. I suppose LLVM has a hard time optimizing the Numba implementation. However, I have a bit of hope that it is possible to write an argmax that LLVM can optimize better and maybe even in a way that can keep up with Numpy. Something that is already much faster for your particular case is this:
@nb.njit
def rowwise_argmax(mat):
n, m = mat.shape
out = np.empty(n, dtype=np.int64)
for i in range(n):
arg = 0
val = mat[i, 0]
for j in range(1, m):
if mat[i, j] > val:
arg = j
val = mat[i, j]
out[i] = arg
return out
I’ve had a quick look at the generated IR and it seems that both this and the Numba implementation use SIMD, but not in the way I would expect. It’s a very interesting case!
I did a little testing and found something that beats Numpy. It’s not very intuitive at first, but since it uses SIMD so much more efficiently (exactly how you would like it to do!), the function below performs extremely well. It first finds the maximum value and then checks where it is located within the array. There is redundant work, but since each step is compiler friendly, it still works. This idea can likely be improved too by using SIMD explicitly, but that’s probably not necessary at this stage. However, I may be worth checking if this current version would also be a good solution to be used within Numba.
import numpy as np
import numba as nb
@nb.njit
def find(a, val):
for i in range(a.size):
if a[i] == val:
return i
@nb.njit
def argmax(a):
val = np.max(a)
idx = find(a, val)
return idx
@nb.njit
def rowwise_argmax(mat):
nrows = mat.shape[0]
out = np.empty(nrows, dtype=np.int64)
for i in range(nrows):
out[i] = argmax(mat[i])
return out
One more thing. I noticed that you have a very specific example. Is such a case actually representative or would a random arrangement of the elements be more representative? I ask because it may be possible to take advantage of the fact that you may actually encounter the maximum possible (pixel?) value relatively early.
Thank you for investigating this issue. The image size given by H, W = 816, 256 is my real use case. The test data is artificial, but resembles the real world data, which has one peak per row (on average).
I will be looking at explicit SIMD implementation, but I don’t know a suitable Python library. Cython, perhaps? It takes just 8 256-bit registers to represent the whole image row. Finding argmax should take only a few operations per register. It should be really fast.
If you find a good C/C++ implementation, it is relatively easy to use it in Numba. Two good solutions are the use of ctypes or the emission of LLVM IR by Clang. The first option is easier to implement and you can use any compiler, but the second option has some other advantages, e.g. it can be inlined and the function is cacheable.
If you find a good implementation that you want to use in Numba, I can tell you more about it. Especially the Clang solution is a bit tricky because of version conflicts, since LLVM updates its IR so frequently. And you may need to load a C/C++ runtime library…
If you compile it to a .so or .dll file, then it should be easy to use it in Python and Numba. But I’m not sure if switching to ISPC is enough. Writing argmax for uint8, which uses SIMD efficiently, is very difficult because it is not a well supported data type. It may also depend on which instruction sets you have available. I would probably look for an existing implementation.
I implemented this function using C++ SIMD libraries: VCL and eve. Source code is here: simd-benchmarks/main-5-vcl-eve.cpp at main · pauljurczak/simd-benchmarks · GitHub. My implementation takes advantage of fixed image size, which is my use case. The speedup is dramatic: over 15x. Your version rowwise_argmax(img) executes in 62us, f1(img) in 57us. The best of C++ SIMD versions takes 3.6us:
int loop_vc_nested(const array<uint8_t, H*W> &img, const array<Vec32uc, 8> &idx) {
int sum = 0;
Vec32uc vMax, iMax, vCurr, iCurr;
for (int i=0; i<H*W; i+=W) {
iMax.load(&idx[0]);
vMax.load(&img[i]);
for (int j=1; j<8; j++) {
iCurr.load(&idx[j]);
vCurr.load(&img[i+j*32]);
iMax = select(vCurr > vMax, iCurr, iMax);
vMax = max(vMax, vCurr);
}
Vec32uc vMaxAll{horizontal_max(vMax)};
sum += iMax[horizontal_find_first(vMax == vMaxAll)];
}
return sum;
}
I quickly exposed your function to Numba and run the comparison. I had to slightly rewrite your function for that (see below). And I used clang 14.0.0 to emit LLVM IR:
clang -emit-llvm -march=native -std=c++20 -S ...
It is indeed much faster but if you want to have a proper comparison you also have to give Numba the additional information. You would probably end up with something like this:
code
# NOTE: compiler is very sensitive to even tiny changes on that
import numba as nb
NROWS = 816
NCOLS = 256
DTYPE = nb.uint8
@nb.njit
def _argmax(a):
v = DTYPE(0)
for i in range(NCOLS):
if a[i] > v:
v = a[i]
for i in range(NCOLS):
if a[i] == v:
return i
@nb.njit
def argmax_sum(img):
acc = 0
for i in range(NROWS):
acc += _argmax(img[i])
return acc
Also I think (correct me if I am wrong) your function behaves differently to Numpy. If the maximum value is contained more than once you have no guarantee that you get the first occurrence. At least with random inputs the results do not agree with Numpy.
Anyways, with all this I get on the deterministic input from your initial post:
Numba: 19.2 us
Numpy: 120 us
C++: 6.4 us
C++ source
#include "vectorclass/vectorclass.h"
#include <array>
#include <eve/module/algo.hpp>
#include <eve/module/core.hpp>
#include <eve/wide.hpp>
using V32u1 = eve::wide<uint8_t, eve::fixed<32>>;
constexpr size_t H = 816;
constexpr size_t W = 256;
template <std::size_t... I>
constexpr std::array<uint8_t, sizeof...(I)> arange(std::index_sequence<I...>) {
return { { static_cast<uint8_t>(I)... } };
}
constexpr auto idx = arange(std::make_index_sequence<W> {});
extern "C" int argmax_sum(const uint8_t* img) {
int sum = 0;
Vec32uc vMax, iMax, vCurr, iCurr;
for (int i = 0; i < H * W; i += W) {
iMax.load(&idx[0]);
vMax.load(&img[i]);
for (int j = 1; j < 8; j++) {
iCurr.load(&idx[j * 32]);
vCurr.load(&img[i + j * 32]);
iMax = select(vCurr > vMax, iCurr, iMax);
vMax = max(vMax, vCurr);
}
Vec32uc vMaxAll { horizontal_max(vMax) };
sum += iMax[horizontal_find_first(vMax == vMaxAll)];
}
return sum;
}
Thank you for doing the work. In my use case, it doesn’t matter, which maximum position is counted. Real world input data almost always has none or one “interesting” maximum. This function is just a computationally expensive prelude to processing detected peaks.