Why this function call is faster than inlined version?

This script:

import numpy as np, numba as nb, timeit as ti

xmin, xmax, xn = -2.25, 0.75, 450
ymin, ymax, yn = -1.25, 1.25, 375
imax = 200


@nb.njit(fastmath=True, locals=dict(x=nb.complex64))
def abs2(x):
  return x.real**2 + x.imag**2


@nb.njit(fastmath=True, locals=dict(c=nb.complex64))
def kernel(c):
    z = c

    for i in range(imax):
        z = z * z + c
        if abs2(z) > 4:
        # if (z.real**2 + z.imag**2) > 4:
            return i
        
    return imax


@nb.njit(fastmath=True)
def mandelbrot():
    result = np.zeros((yn, xn), dtype=np.uint32)

    for j, y in zip(range(yn), np.arange(ymin, ymax, (ymax-ymin)/yn)):
        for i, x in zip(range(xn), np.arange(xmin, xmax, (xmax-xmin)/xn)):
            result[j, i] = kernel(np.csingle(x+y*1j))
            
    return result


if __name__ == "__main__":
  fun = f'mandelbrot()'
  t = 1000 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=100))
  print(f'{fun}:  {np.amin(t):6.3f}ms  {np.median(t):6.3f}ms')

runs in 17.3ms vs. inlined version (commented out) 19.1ms. Why is calling abs2 faster?

Hello @pauljurczak

Why manual inlining leads to slowdown is indeed not obvious at a first glance and also not fully clear to me. Anyway, here’s what I assume is happening.

Unless you force Numba to inline abs2 by setting inline="always", it will not be inlined. You can check this by looking at the unoptimized intermediate representation by setting the environment variable:

import os 
os.environ["NUMBA_DUMP_LLVM"] = "1"

However, after optimization taken by Numba, there are the LLVM optimization passes and they inline abs2 regardless. Check this by setting the environment variable:

os.environ["NUMBA_DUMP_OPTIMIZED"] = "1"

It looks to me like early inlining is prohibitive for later optimizations. A loss of performance due to manual inlining (or inlining by Numba) of very small functions is something I have also observed in the past.

I haven’t looked further into what exactly happens (what optimizations are missed or what passes would recover performance), but perhaps some of the Numba devs or experienced LLVM users can shine more light on this.

1 Like

I didn’t observe any significant difference between the two (hopefully I faithfully reproduced your code). I did a run first to factor out compile time.

numba.version=‘0.56.4’, sys.version_info=sys.version_info(major=3, minor=9, micro=16, releaselevel=‘final’, serial=0)
mandelbrot(): 16.617ms 16.758ms
mandelbrot_inl(): 16.628ms 17.048ms

It may or may not be related to this item, I got slightly more intuitive results with numba 0.53.1. Tracker here

numba.version=‘0.53.1’, sys.version_info=sys.version_info(major=3, minor=9, micro=16, releaselevel=‘final’, serial=0)
mandelbrot(): 16.467ms 16.821ms
mandelbrot_inl(): 16.396ms 16.560ms

import sys

import numba
import numpy as np, numba as nb, timeit as ti

xmin, xmax, xn = -2.25, 0.75, 450
ymin, ymax, yn = -1.25, 1.25, 375
imax = 200


@nb.njit(fastmath=True, locals=dict(x=nb.complex64))
def abs2(x):
    return x.real ** 2 + x.imag ** 2


@nb.njit(fastmath=True, locals=dict(c=nb.complex64))
def kernel(c):
    z = c

    for i in range(imax):
        z = z * z + c
        if abs2(z) > 4:
            # if (z.real**2 + z.imag**2) > 4:
            return i

    return imax

@nb.njit(fastmath=True, locals=dict(c=nb.complex64))
def kernel_inl(c):
    z = c

    for i in range(imax):
        z = z * z + c
        if (z.real**2 + z.imag**2) > 4:
            return i

    return imax


@nb.njit(fastmath=True)
def mandelbrot():
    result = np.zeros((yn, xn), dtype=np.uint32)

    for j, y in zip(range(yn), np.arange(ymin, ymax, (ymax - ymin) / yn)):
        for i, x in zip(range(xn), np.arange(xmin, xmax, (xmax - xmin) / xn)):
            result[j, i] = kernel(np.csingle(x + y * 1j))

    return result

@nb.njit(fastmath=True)
def mandelbrot_inl():
    result = np.zeros((yn, xn), dtype=np.uint32)

    for j, y in zip(range(yn), np.arange(ymin, ymax, (ymax - ymin) / yn)):
        for i, x in zip(range(xn), np.arange(xmin, xmax, (xmax - xmin) / xn)):
            result[j, i] = kernel_inl(np.csingle(x + y * 1j))

    return result


if __name__ == "__main__":
    funs = [f'mandelbrot()', f'mandelbrot_inl()']
    for fun in funs:
        t = 1000 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=100))

    print(f'{numba.__version__=}, {sys.version_info=}')
    for fun in funs:
        t = 1000 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=100))
        print(f'{fun}:  {np.amin(t):6.3f}ms  {np.median(t):6.3f}ms')

I forgot to list my system info: Numba 0.57.0, Python 3.11.3, Ubuntu 22.04.2, AMD Ryzen 7 3800X. Perhaps, there is a performance regression between Numba 0.56 and 0.57.