Well I have to admit the title doesn’t give much useful info. But below is a simplest example that reproduces the issue. Basically I have 2 functions (rbf, linear) that have slightly different arguments, so I was trying to write another layer (kernel) which will choose which to call depending on the value of a. I wrote a benchmark1 to see how fast it is to call rbf/linear directly, and benchmark2 to test the speed of having the extra kernel layer. The result was quite unexpected, on my machine benchmark2 takes 100x longer than benchmark1. And more curiously, if I change the content of linear function to something much simpler, for example x1[0] + x2[0], benchmark2 runs much faster, even though the linear function is never really called (when a=1.0).
import numpy as np
import numba as nb
import time
@nb.njit
def rbf(x1, x2, a) -> float:
s = 0
for i in range(x1.shape[0]):
d = x1[i] - x2[i]
s += d**2
return np.exp(-a * s)
@nb.njit
def linear(x1, x2) -> float:
s = 0
for i in range(x1.shape[0]):
s += (x1[i] * x2[i])
return s
# return x1[0] + x2[0]
@nb.njit
def benchmark1(x1, x2, a):
for i in range(x1.shape[0]):
if a > 1e-6:
_ = rbf(x1[i], x2[i], a)
else:
_ = linear(x1[i], x2[i])
@nb.njit
def kernel(x1, x2, a):
if a > 1e-6:
return rbf(x1, x2, a)
else:
return linear(x1, x2)
@nb.njit
def benchmark2(x1, x2, a):
for i in range(x1.shape[0]):
_ = kernel(x1[i], x2[i], a)
# warming up
benchmark1(np.random.random((10, 2)), np.random.random((10, 2)), 1.0)
benchmark2(np.random.random((10, 2)), np.random.random((10, 2)), 1.0)
size = (10000, 100)
X1 = np.random.random(size)
X2 = np.random.random(size)
t0 = time.perf_counter()
benchmark1(X1, X2, 1.0)
t1 = time.perf_counter()
print(f'Benchmark1: {(t1-t0)*1e6:.3f} us.')
time.sleep(2)
X1 = np.random.random(size)
X2 = np.random.random(size)
t0 = time.perf_counter()
benchmark2(X1, X2, 1.0)
t1 = time.perf_counter()
print(f'Benchmark2: {(t1-t0)*1e6:.3f} us.')
Can someone help me to understand why this is happening?