Thanks for profiling this @luk-f-a, I think this makes sense, there’s some cost to getting a new Python process to a point where it can start doing the work requested, which means importing modules etc to get to that state.
Some specific details about Numba…
- The Numba runtime (NRT) would also be compiled at the point a Numba
@njitdecorated function is encountered, this has to be done before Numba can execute anything (it’s being delayed until JIT compilation is requested in Numba PR #8438, which should speed up pure import time, but the NRT compilation still has to occur before execution). - To get Numba to a point where it can compile something, it has to import NumPy (and probably SciPy if it’s in the environment), which takes a while. Then it has to import
llvmlitewhich triggers loading LLVM (which is large), and then it has to initialise LLVM before it compiles anything. - Functions that are cached on disk still have a cost. First types have to be checked to work out the type signature the function is being called with. Then the data on disk has to be loaded and checked to see if there’s a suitable cached version (check if signature, CPU and some other things match) and then the binary data which is the compiled function has to be wired in so that it can be executed. Once this is done, subsequent executions with the same types (type signature) will be much quicker as they are mostly just some type checking and dictionary look ups to get to the point of running the compiled function.