Numba very slow with default arguments depending on which arguments are provided

Crossposting my stackoverflow question, I’m implementing a function for which some arguments are compulsory and some are set as default in numba, however depending on the values set for the default arguments, which order the different types appear in and which arguments are provided I am getting very different timings:

import numba as nb

@nb.njit
def function(a, b, c, d=1.49012e-8, e=1.49012000000001e-8, f=0.0, g=None):
    ...
        
@nb.njit
def function2(a, b, c, d=1.49012e-8, e=1.49012000000001e-8, f=0.0):
    ...
        
@nb.njit
def function3(a, b, c, d=1.49012e-8, e=1.49012000000001e-8, f=None, g=0.0):
    ...

@nb.njit
def function4(a, b, c, d=1.49012e-8, e=1.49012e-8, f=0.0, g=None):
    ...

And then timing it with different numbers of args and excluding different kwargs:

d = 1.49012e-8
e = 1.49012000000001e-8
f = 0.0
g = 1000

args = (d, e, f, g)
kwargs = {'d': d, 'e': e, 'f': f, 'g': g}

def time_func(func, args, kwargs):
    func(1, 2, 3)
    
    print(func.__name__)
    print("time *args")
    for i, _ in enumerate(args):
        func(1, 2, 3, *args[:i])
        %timeit -n 1000 func(1, 2, 3, *args[:i])
    print("time **kwargs")
    for i in kwargs:
        _kwargs = {k: v for k, v in kwargs.items() if k != i}
        func(1, 2, 3, **_kwargs)
        %timeit -n 1000 func(1, 2, 3, **_kwargs)

time_func(function, args, kwargs)
time_func(function2, args[:-1], {k: v for k, v in kwargs.items() if k != 'g'})
time_func(function3, args, kwargs)
time_func(function4, args, kwargs)

Output:

function
time *args
26.3 µs ± 425 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
25.4 µs ± 266 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
24 µs ± 175 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
241 ns ± 4.94 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
time **kwargs
235 ns ± 2.03 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
23.7 µs ± 62.6 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
23.3 µs ± 203 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
241 ns ± 5.25 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
function2
time *args
24.1 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
23.3 µs ± 172 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
22.1 µs ± 428 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
time **kwargs
210 ns ± 1.31 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
22.6 µs ± 97.4 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
21.9 µs ± 98.5 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
function3
time *args
26.3 µs ± 149 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
25.2 µs ± 81.4 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
24 µs ± 160 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
23.3 µs ± 416 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
time **kwargs
237 ns ± 4.64 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
25 µs ± 290 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
255 ns ± 12.5 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
24.2 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
function4
time *args
26.2 µs ± 238 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
25.1 µs ± 95.6 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
24.1 µs ± 250 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
240 ns ± 5.87 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
time **kwargs
231 ns ± 11.9 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
233 ns ± 3.1 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
23.4 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
230 ns ± 3.43 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

I’ve tested with numba 0.58.0 in python 3.11.5 and numba 0.57.1 in python 3.10.12 and get similar results in both.

Hey @Nin17 - Heads up, I have a PR that introduces a fix for this here.

Before

function
time *args
24.5 μs ± 2.34 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
23.6 μs ± 1.27 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
20.5 μs ± 1.87 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
425 ns ± 7.72 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
time **kwargs
223 ns ± 31.9 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
20.2 μs ± 1.56 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
22.1 μs ± 2.56 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
413 ns ± 47.4 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
function2
time *args
20.5 μs ± 1.81 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
22.7 μs ± 2.53 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
19.3 μs ± 1.97 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
time **kwargs
207 ns ± 35.1 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
19.8 μs ± 1.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
18.2 μs ± 1.31 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
function3
time *args
22.4 μs ± 2.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
21.6 μs ± 1.95 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
21.9 μs ± 3.02 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
19.3 μs ± 572 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
time **kwargs
375 ns ± 77.6 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
22 μs ± 1.99 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
232 ns ± 4.97 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
19.2 μs ± 791 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
function4
time *args
24 μs ± 2.79 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
21.5 μs ± 1.38 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
21.6 μs ± 2.05 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
334 ns ± 112 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
time **kwargs
356 ns ± 58.3 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
302 ns ± 84.3 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
21.1 μs ± 1.94 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
334 ns ± 88 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

After

function
time *args
300 ns ± 44.8 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
308 ns ± 35 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
370 ns ± 82.4 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
272 ns ± 69.4 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
time **kwargs
301 ns ± 93 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
254 ns ± 27.9 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
274 ns ± 77.4 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
288 ns ± 82.5 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
function2
time *args
320 ns ± 85.1 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
389 ns ± 83.2 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
383 ns ± 69.7 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
time **kwargs
213 ns ± 55 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
210 ns ± 27 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
189 ns ± 3.69 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
function3
time *args
360 ns ± 96.5 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
456 ns ± 170 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
484 ns ± 11.3 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
373 ns ± 72.9 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
time **kwargs
209 ns ± 12.2 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
305 ns ± 122 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
385 ns ± 53.1 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
234 ns ± 13.2 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
function4
time *args
359 ns ± 159 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
492 ns ± 30.7 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
349 ns ± 107 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
247 ns ± 16.3 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
time **kwargs
274 ns ± 74.1 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
319 ns ± 73.7 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
343 ns ± 87.4 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
223 ns ± 8.47 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

1 Like