Major slow down when adding one more layer of function

Well I have to admit the title doesn’t give much useful info. But below is a simplest example that reproduces the issue. Basically I have 2 functions (rbf, linear) that have slightly different arguments, so I was trying to write another layer (kernel) which will choose which to call depending on the value of a. I wrote a benchmark1 to see how fast it is to call rbf/linear directly, and benchmark2 to test the speed of having the extra kernel layer. The result was quite unexpected, on my machine benchmark2 takes 100x longer than benchmark1. And more curiously, if I change the content of linear function to something much simpler, for example x1[0] + x2[0], benchmark2 runs much faster, even though the linear function is never really called (when a=1.0).

import numpy as np
import numba as nb
import time

@nb.njit
def rbf(x1, x2, a) -> float:
    s = 0
    for i in range(x1.shape[0]):
        d = x1[i] - x2[i]
        s += d**2
    return np.exp(-a * s)

@nb.njit
def linear(x1, x2) -> float:
    s = 0
    for i in range(x1.shape[0]):
        s += (x1[i] * x2[i])
    return s
    # return x1[0] + x2[0]

@nb.njit
def benchmark1(x1, x2, a):
    for i in range(x1.shape[0]):
        if a > 1e-6:
            _ = rbf(x1[i], x2[i], a)
        else:
            _ = linear(x1[i], x2[i])

@nb.njit
def kernel(x1, x2, a):
    if a > 1e-6:
        return rbf(x1, x2, a)
    else:
        return linear(x1, x2)

@nb.njit
def benchmark2(x1, x2, a):
    for i in range(x1.shape[0]):
        _ = kernel(x1[i], x2[i], a)

# warming up
benchmark1(np.random.random((10, 2)), np.random.random((10, 2)), 1.0)
benchmark2(np.random.random((10, 2)), np.random.random((10, 2)), 1.0)

size = (10000, 100)

X1 = np.random.random(size)
X2 = np.random.random(size)
t0 = time.perf_counter()
benchmark1(X1, X2, 1.0)
t1 = time.perf_counter()
print(f'Benchmark1: {(t1-t0)*1e6:.3f} us.')
time.sleep(2)

X1 = np.random.random(size)
X2 = np.random.random(size)
t0 = time.perf_counter()
benchmark2(X1, X2, 1.0)
t1 = time.perf_counter()
print(f'Benchmark2: {(t1-t0)*1e6:.3f} us.')

Can someone help me to understand why this is happening?

One more note: The slow down is not due to the extra function call when using kernel. Even if I replace the function contents in the definition of kernel (inline the linear and rbf in kernel), it still results in similar time. So it’s probably something else causing this behavior.

Likely benchmark1 is doing nothing. Try this

@nb.njit
def benchmark1(x1, x2, a):
    sum = 0
    for i in range(x1.shape[0]):
        if a > 1e-6:
            sum += rbf(x1[i], x2[i], a)
        else:
            sum += linear(x1[i], x2[i])
    return sum

Thank you. Now the benchmark results make much more sense. I guess in my original code the compiler is smart enough to optimize away the unused code.

And 2 more notes for who may be interested:

  1. The result is platform/version-dependent. The 100x slower result was done on my Windows laptop. I tested it again on my Mac M1, and benchmark2 is also “optimized”, just not as much as benchmark1.
  2. I need to add time.sleep(2) between benchmark calls to get reliable results. I suppose it’s due to the thermal throttling of the CPU (?).