Timings for arr[:, i] seem much slower in numba

I was looking at timings for parts of my code and came across the following.

import numpy as np
import numba as nb
@nb.njit(cache=True)
def foo_nb(A):
    return A[:, 100]

I then run the following in ipython:

A = np.ones((1000, 1000))

%timeit foo_nb(A)
774 ns ± 7.99 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

I ran it once beforehand to make sure that it wasn’t the compilation I was timing.

If I do the same timings without numba, that is with

def foo(A):
    return A[:, 100]

I get

%timeit foo(A)
209 ns ± 4.02 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Is there a simple explanation for the difference in timings?

Calling a numba compiled function has some overhead, which often amounts to a little more time per call than plain numpy functions. This is partialy because numba does some type checking for the sake of possible multiple dispatch situations where the compiled code that is run differs by the argument types. While numba is great for speeding up complex numerical code, you are unlikely to see speedups for single atomic functions (in this case array slicing) because of the added per-call overhead. A good rule of thumb for getting the best speedup is to try to execute larger process intensive sections of code in jitted functions to reduce the number of times the program switches between the Python interpreter and jitted sections of code.

It can’t be just the overhead of calling the function as this is much faster:

@nb.njit(cache=True)
def foo2d_nb(A, i):
    return A[100, i]

%timeit foo2d_nb(A, 100)
241 ns ± 1.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

This is faster still without numba but there is some extra cost on top of the time numpy takes that is not the function call overhead when we do return A[:, 100]. Is there a numba overhead for returning a numpy array?

If numpy is faster in both cases then that is consistent with the fact that numba has more per-call overhead than pure numpy. The fact that the above is faster than the previous slicing example is likely because retrieving a single float from an array only requires a simple read, while constructing a new slice (even if it is just a slice and not a copy) requires some memory allocation for the new ndarray object. Just returning a float does not, and this will be true regardless of whether or not numba is involved.

In any case I think you may be splitting hairs with these benchmarks. These speed differences may be negligible in the context of a larger jitted program. If you really want to cut out numba’s extra overhead then there are ways of shortcutting past numba’s type checking, but I wouldn’t recommend it if you are just concerned with simple standalone atomic numpy operations. Numba won’t be any faster than pure numpy at these. However it can be significantly faster when the program executes many complex operations in loops.

I just gave a MWE. My actual code has this in the middle of a function. I noticed that the function was slower with numba than without so tested individual parts to try to see why. This is the only part I have found so far that might be the culprit.

Hey @lesshaste ,

Numba uses low-level operations to minimize Python overhead, but its approach to array handling may introduce some overhead.
While NumPy arrays are Python objects, Numba uses a view to the underlying raw data to perform computations more efficiently at a lower level.
When performing slice operations with Numba, the result is also a view into the memory, but the base of the sliced array is not the original array itself. It’s a pointer to its memory.

For example:

anb = foo_nb(A)
anp = foo(A)

print(type(A.base))   # <class 'NoneType'>              # Mommy
print(type(anb.base)) # <class '_nrt_python._MemInfo'>  # Base of foo_nb result is a memory pointer
print(type(anp.base)) # <class 'numpy.ndarray'>         # Base of foo result is ndarray A

You typically need to create a wrapper function to pass the raw data of the NumPy ndarray to the low-level function and then transform the result back into a NumPy ndarray.
In Cython you would do that manually. I guess Numba does it under the hood. This wrapping process or transformation may introduce some overhead, especially if the computation itself is relatively simple.
The simpler your profiled piece of code gets the more dominant the overhead will be. You might end up with misleading conclusions.
Profiling a function within a broader context would help to identify potential bottlenecks. Using a line profiler like “Profila”, developed by @itamarst for Numba, could be useful in this regard.
Have you tried that already?

If you suspect that the overhead of slice operations in Numba is indeed impacting performance, reaching out to the Numba developers could be a good idea.

1 Like