Argument Parsing in Numba njit'd functions

I’ve been reading this blog post benchmarking the impact of the interpreter on Python code.

The author finds that the the interpreter overhead is significantly less (around 3X) less impactful on runtimes than the Python argument parser.

He explains how argument parsing is expensive in Python due to the allocation of a tuple to store arguments in as well as the parsing of a format string on the callee side to determine how to parse the arguments.

Can anyone explain to me whether Numba differs in the way it parses arguments, or does it use the same mechanisms as C Python? I’m developing applications where I’m passing a lot of parameters.

Playing around with the following example


@numba.njit
def foo():
    res = 0
    for i in range(100):
        for j in range(10000):
            res += 1
    return res

@numba.njit
def bar(a):
    res = 0
    for i in range(100):
        for j in range(10000):
            res += 1
    return res

for any integer value of a returns benchmarks of 50ns vs 114ns on my laptop (intel i7 CPU). Implying that argument parsing basically doubles the runtime for the ‘same computational work’ being done by the function.

Is this behaviour expected? How does one explain it?

hi @skailasa
when thinking about arguments in Numba you need to distinguish between arguments which cross the Python-Numba boundary (interpreted code vs compiled code), and arguments that go from one compiled function to another compiled function.
When you are crossing the boundary (like your examples above) you are paying the Python parsing cost (because foo is a python wrapper to a compiled function) plus you are paying the cost of unboxing your arguments and dispatching to the correct function. For that to make sense, the time-savings from the compiled code need to be greater than the cost your pay to cross the boundary.
When you don’t cross the boundary, that is, when one jitted function calls another jitted function, there’s no unboxing cost, or dispatching cost. The call is extremely fast, like in a compiled language.

So, if your application passes a lot of parameters within jitted code, you’re most likely fine. If you are passing a lot of parameters into jitted code you’ll pay a cost for each call. If your functions are big enough that compilation saves execution time, then you’ll be net positive. Otherwise, you might even be slower than in normal Python

Luk

Hi Luk,

Thank you for your detailed answer. This clears up my question. Though I do have another related question which you may be able to answer.

Let’s consider a simple C++ function

#include <cstdint>

struct Vec2 {
    int64_t x;
    int64_t y;
};

int64_t normSquared(Vec2 v) {
    return v.x * v.x + v.y * v.y;
}

and a Numba equivalent

@numba.njit
def norm_squared(a):
    return a[0]*a[0]+a[1]*a[1]

I want to understand the difference in performance between these two functions. Obviously the Numba function will have overhead to:

(1) check arguments provided by the interpreter and dispatch to correct compiled function

(2) interface with the python interpreter for error handling, and return values.

Examining the assembly code, this seems to be the case, are there any other sources of overhead that I am missing in my analysis?

From this simple example it seems intuitive that performance will be similar for Numba and C++, as there are no other calls to jit’d functions. However, if there were, does it result in much more complex Numba IR? Or are the functions ‘inlined’ in some sense, and passed onto LLVM to figure out how to optimize? Apologies for the imprecise language, I’m fairly new to this material.

are there any other sources of overhead that I am missing in my analysis?

I think we’d a need core developer to answer this accurately. I’d say that in general you captured both main overheads, entry and exit. I think there’s a third kind which is internal, for example, some objects are reference-counted within the function code (I guess to preserve python compatibility). That adds a performance cost. In the last version 0.53, there was some work to improve the ref-count code, and it resulted in this overhead being smaller (I saw improvements in real-world code).
Besides the ref-count cost, there might be other costs not present in C++, but I’m not sure.

if there were [calls to jit’d functions], does it result in much more complex Numba IR?

I haven’t spent much time working with Numba IR. I think for your purposes, the Numba IR does not matter much. If you are interested in a comparison versus C++, you could check out the LLVM IR produced by Clang vs the LLVM produced by Numba. Or the assembly code.

Or are the functions ‘inlined’ in some sense, and passed onto LLVM to figure out how to optimize?

Functions are not inlined by Numba, except in special cases when the user explicitly requests it. It’s a special use case, not worth spending much time discussing. The traditional inlining you might be thinking about might or not happen in LLVM. Numba sends all the functions, separately declared by the user, and LLVM has some rules to decide whether it’s worth inlining. Inlining is one of the LLVM optimizations, along many others.

A final note: your norm_squared function is polymorphic. You could pass a list, a tuple, a numpy array, even a dictionary. The resulting code and performance characteristics will be very different in each case. You have to be careful about the type of the input you pass, before you can do meaningful comparison against C++ code.

Luk