I’m encountering an unexpected performance decrease when calling an @njit-compiled function using np.cov. While I expect the first call to be slow due to JIT compilation, I noticed that changing the input shape for the first time results in a significant execution time increase. This behavior is surprising because the function signature remains the same across calls.
Code to Reproduce the Issue
from contextlib import contextmanager
from time import perf_counter
from typing import Callable, Generator
from numba import njit
import numpy as np
@contextmanager
def timer(f: Callable[[float], object] = lambda _: None) -> Generator[Callable[[], float], None, None]:
_ = perf_counter()
def t() -> float:
return perf_counter() - _
yield t
f(t())
del _
@njit
def test(x, y):
return np.cov(x, y)
with timer(print):
test(np.array([0.]), np.array([0.])) # First call (JIT compilation expected)
# Example output: 3.549982499331236
with timer(print):
test(np.array([0.]), np.array([0.])) # Second call (should be fast)
# Example output: 1.730024814605713e-05
with timer(print):
test(np.array([0., 0.]), np.array([0., 0.])) # First shape change
# Example output: 0.02204340137541294 <-- Unexpectedly high execution time
with timer(print):
test(np.array([0., 0.]), np.array([0., 0.])) # Second call with same new shape (should be fast)
# Example output: 2.8500333428382874e-05
print(test.nopython_signatures)
# Output: [(Array(float64, 1, 'C', False, aligned=True), Array(float64, 1, 'C', False, aligned=True)) -> array(float64, 2d, C)]
Observed Behavior
First execution (compilation overhead expected) → Slow.
Second execution (same input shape) → Fast, as expected.
Third execution (first shape change) → Unexpectedly slow, even though the function signature remains the same.
Fourth execution (same new shape) → Fast again, as expected.
Questions
Why does the first execution with a new input shape result in a significant performance decrease, even though the function signature does not change?
How can I investigate what is happening internally?
Any insights or debugging strategies would be greatly appreciated!
Numba’s overload of np.cov selects between different implementations based on the shape of the input arrays. Even though both cases ultimately return a 2D array, the internal code path is different. Numba must compile a separate version (specialization) for each case, which is why you see additional compilation overhead when the input shape changes.
Edit:
Running your code, the first call (which has the compilation overhead) takes much longer than the subsequent calls.
Once compiled, each specialization seems to be cached and reused, so separate compilations aren’t causing the delays.
Here is what I get:
If test(x, y) is compiled only once, there should be one .nbc file.
If test(x, y) is recompiled for different input shapes, multiple .nbc files should appear.
Can you set a cache directory, cache the function and check the files that will be created.
import os
os.environ["NUMBA_CACHE_DIR"] = "/SomePath/numba_cache"
from numba import njit
import numpy as np
@njit(cache=True)
def test(x, y):
return np.cov(x, y)
test(np.array([0.]), np.array([0.]))
test(np.array([0., 0.]), np.array([0., 0.]))
".../test.test-7.py312.1.nbc",
".../test.test-7.py312.nbi"
If you run the following code line by line in Numba debug mode, you should be able to see when compilations occur.
For me, there is only one compilation, even when the input array sizes change.
The timings of the “test” function suggest that np.cov might be compiled twice depending on the shape of the input arrays. I couldn’t reproduce this behavior with the same versions of Python and Numba.
To investigate further, you can set os.environ[“NUMBA_DEBUG”] = “1” to check if a compilation is triggered when executing the line.
What you should see when running test(np.array([0.]), np.array([0.])) is a series of long compilation logs (the content doesn’t really matter, just the fact that some compilation appears). After that, when you run test(np.array([0., 0.]), np.array([0., 0.])), the result should return quickly without the lengthy compilation logs.
If you still see a long compilation log for the second call, it indicates that recompilation is occurring for some reason.
Great, this is exactly how it should behave.
Could you try the same with the original code, using os.environ[“NUMBA_DEBUG”] = “1”? What you should see is similar behavior: there should only be one compilation log. The first function call will take longer, while subsequent calls should be much faster.
Here’s an example of what the output should look like:
...some long compilation logs...
.long 178
.zero 4
.quad .const.pickledata.125757979843904.sha1
.quad 0
.long 0
.zero 4
.size .const.picklebuf.125757979843904, 40
.section ".note.GNU-stack","",@progbits
===========================================================
7.919435390998842 # (<= longer execution time for the first call)
2.1211017156019807e-05
1.015502493828535e-05
7.238006219267845e-06
[(Array(float64, 1, 'C', False, aligned=True), Array(float64, 1, 'C', False, aligned=True)) -> array(float64, 2d, C)]
If you ran the code all at once, it’s possible that two compilation logs appeared before the actual output. Have you tried executing each function call line by line to ensure that step 3 doesn’t trigger a recompile?
The longer third call suggests that Numba might be compiling a specialized version of np.cov for the new input shape. Normally, you’d expect to see compilation logs, but if the visible signature doesn’t change, those logs might be hidden (???).
I couldn’t reproduce this delay under the same Python/Numba environment, so it might be due to environmental factors or even a bug. It’s hard to say…