Major slow down when adding one more layer of function

seekiu · September 27, 2023, 9:09am

Well I have to admit the title doesn’t give much useful info. But below is a simplest example that reproduces the issue. Basically I have 2 functions (rbf, linear) that have slightly different arguments, so I was trying to write another layer (kernel) which will choose which to call depending on the value of a. I wrote a benchmark1 to see how fast it is to call rbf/linear directly, and benchmark2 to test the speed of having the extra kernel layer. The result was quite unexpected, on my machine benchmark2 takes 100x longer than benchmark1. And more curiously, if I change the content of linear function to something much simpler, for example x1[0] + x2[0], benchmark2 runs much faster, even though the linear function is never really called (when a=1.0).

import numpy as np
import numba as nb
import time

@nb.njit
def rbf(x1, x2, a) -> float:
    s = 0
    for i in range(x1.shape[0]):
        d = x1[i] - x2[i]
        s += d**2
    return np.exp(-a * s)

@nb.njit
def linear(x1, x2) -> float:
    s = 0
    for i in range(x1.shape[0]):
        s += (x1[i] * x2[i])
    return s
    # return x1[0] + x2[0]

@nb.njit
def benchmark1(x1, x2, a):
    for i in range(x1.shape[0]):
        if a > 1e-6:
            _ = rbf(x1[i], x2[i], a)
        else:
            _ = linear(x1[i], x2[i])

@nb.njit
def kernel(x1, x2, a):
    if a > 1e-6:
        return rbf(x1, x2, a)
    else:
        return linear(x1, x2)

@nb.njit
def benchmark2(x1, x2, a):
    for i in range(x1.shape[0]):
        _ = kernel(x1[i], x2[i], a)

# warming up
benchmark1(np.random.random((10, 2)), np.random.random((10, 2)), 1.0)
benchmark2(np.random.random((10, 2)), np.random.random((10, 2)), 1.0)

size = (10000, 100)

X1 = np.random.random(size)
X2 = np.random.random(size)
t0 = time.perf_counter()
benchmark1(X1, X2, 1.0)
t1 = time.perf_counter()
print(f'Benchmark1: {(t1-t0)*1e6:.3f} us.')
time.sleep(2)

X1 = np.random.random(size)
X2 = np.random.random(size)
t0 = time.perf_counter()
benchmark2(X1, X2, 1.0)
t1 = time.perf_counter()
print(f'Benchmark2: {(t1-t0)*1e6:.3f} us.')

Can someone help me to understand why this is happening?

seekiu · September 27, 2023, 12:03pm

One more note: The slow down is not due to the extra function call when using kernel. Even if I replace the function contents in the definition of kernel (inline the linear and rbf in kernel), it still results in similar time. So it’s probably something else causing this behavior.

nelson2005 · September 27, 2023, 8:51pm

Likely benchmark1 is doing nothing. Try this

@nb.njit
def benchmark1(x1, x2, a):
    sum = 0
    for i in range(x1.shape[0]):
        if a > 1e-6:
            sum += rbf(x1[i], x2[i], a)
        else:
            sum += linear(x1[i], x2[i])
    return sum

seekiu · September 28, 2023, 2:25am

Thank you. Now the benchmark results make much more sense. I guess in my original code the compiler is smart enough to optimize away the unused code.

And 2 more notes for who may be interested:

The result is platform/version-dependent. The 100x slower result was done on my Windows laptop. I tested it again on my Mac M1, and benchmark2 is also “optimized”, just not as much as benchmark1.
I need to add time.sleep(2) between benchmark calls to get reliable results. I suppose it’s due to the thermal throttling of the CPU (?).

Topic		Replies	Views
Nested numba functions slow down execution and cannot be inlined Community Support	6	2628	February 7, 2023
3x slowdown in parent function when applying njit Community Support	11	653	July 16, 2020
Timings for arr[:, i] seem much slower in numba Community Support	5	220	March 14, 2024
Numba lennard-jones example is better with function nesting? Community Support	3	295	October 10, 2023
Why this function call is faster than inlined version? Numba	3	441	May 15, 2023

Major slow down when adding one more layer of function

Related topics