Why this function call is faster than inlined version?

pauljurczak · May 12, 2023, 4:08pm

This script:

import numpy as np, numba as nb, timeit as ti

xmin, xmax, xn = -2.25, 0.75, 450
ymin, ymax, yn = -1.25, 1.25, 375
imax = 200


@nb.njit(fastmath=True, locals=dict(x=nb.complex64))
def abs2(x):
  return x.real**2 + x.imag**2


@nb.njit(fastmath=True, locals=dict(c=nb.complex64))
def kernel(c):
    z = c

    for i in range(imax):
        z = z * z + c
        if abs2(z) > 4:
        # if (z.real**2 + z.imag**2) > 4:
            return i
        
    return imax


@nb.njit(fastmath=True)
def mandelbrot():
    result = np.zeros((yn, xn), dtype=np.uint32)

    for j, y in zip(range(yn), np.arange(ymin, ymax, (ymax-ymin)/yn)):
        for i, x in zip(range(xn), np.arange(xmin, xmax, (xmax-xmin)/xn)):
            result[j, i] = kernel(np.csingle(x+y*1j))
            
    return result


if __name__ == "__main__":
  fun = f'mandelbrot()'
  t = 1000 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=100))
  print(f'{fun}:  {np.amin(t):6.3f}ms  {np.median(t):6.3f}ms')

runs in 17.3ms vs. inlined version (commented out) 19.1ms. Why is calling abs2 faster?

sschaer · May 14, 2023, 4:32pm

Hello @pauljurczak

Why manual inlining leads to slowdown is indeed not obvious at a first glance and also not fully clear to me. Anyway, here’s what I assume is happening.

Unless you force Numba to inline abs2 by setting inline="always", it will not be inlined. You can check this by looking at the unoptimized intermediate representation by setting the environment variable:

import os 
os.environ["NUMBA_DUMP_LLVM"] = "1"

However, after optimization taken by Numba, there are the LLVM optimization passes and they inline abs2 regardless. Check this by setting the environment variable:

os.environ["NUMBA_DUMP_OPTIMIZED"] = "1"

It looks to me like early inlining is prohibitive for later optimizations. A loss of performance due to manual inlining (or inlining by Numba) of very small functions is something I have also observed in the past.

I haven’t looked further into what exactly happens (what optimizations are missed or what passes would recover performance), but perhaps some of the Numba devs or experienced LLVM users can shine more light on this.

nelson2005 · May 15, 2023, 1:05am

I didn’t observe any significant difference between the two (hopefully I faithfully reproduced your code). I did a run first to factor out compile time.

numba.version=‘0.56.4’, sys.version_info=sys.version_info(major=3, minor=9, micro=16, releaselevel=‘final’, serial=0)
mandelbrot(): 16.617ms 16.758ms
mandelbrot_inl(): 16.628ms 17.048ms

It may or may not be related to this item, I got slightly more intuitive results with numba 0.53.1. Tracker here

numba.version=‘0.53.1’, sys.version_info=sys.version_info(major=3, minor=9, micro=16, releaselevel=‘final’, serial=0)
mandelbrot(): 16.467ms 16.821ms
mandelbrot_inl(): 16.396ms 16.560ms

import sys

import numba
import numpy as np, numba as nb, timeit as ti

xmin, xmax, xn = -2.25, 0.75, 450
ymin, ymax, yn = -1.25, 1.25, 375
imax = 200


@nb.njit(fastmath=True, locals=dict(x=nb.complex64))
def abs2(x):
    return x.real ** 2 + x.imag ** 2


@nb.njit(fastmath=True, locals=dict(c=nb.complex64))
def kernel(c):
    z = c

    for i in range(imax):
        z = z * z + c
        if abs2(z) > 4:
            # if (z.real**2 + z.imag**2) > 4:
            return i

    return imax

@nb.njit(fastmath=True, locals=dict(c=nb.complex64))
def kernel_inl(c):
    z = c

    for i in range(imax):
        z = z * z + c
        if (z.real**2 + z.imag**2) > 4:
            return i

    return imax


@nb.njit(fastmath=True)
def mandelbrot():
    result = np.zeros((yn, xn), dtype=np.uint32)

    for j, y in zip(range(yn), np.arange(ymin, ymax, (ymax - ymin) / yn)):
        for i, x in zip(range(xn), np.arange(xmin, xmax, (xmax - xmin) / xn)):
            result[j, i] = kernel(np.csingle(x + y * 1j))

    return result

@nb.njit(fastmath=True)
def mandelbrot_inl():
    result = np.zeros((yn, xn), dtype=np.uint32)

    for j, y in zip(range(yn), np.arange(ymin, ymax, (ymax - ymin) / yn)):
        for i, x in zip(range(xn), np.arange(xmin, xmax, (xmax - xmin) / xn)):
            result[j, i] = kernel_inl(np.csingle(x + y * 1j))

    return result


if __name__ == "__main__":
    funs = [f'mandelbrot()', f'mandelbrot_inl()']
    for fun in funs:
        t = 1000 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=100))

    print(f'{numba.__version__=}, {sys.version_info=}')
    for fun in funs:
        t = 1000 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=100))
        print(f'{fun}:  {np.amin(t):6.3f}ms  {np.median(t):6.3f}ms')

pauljurczak · May 15, 2023, 3:04am

I forgot to list my system info: Numba 0.57.0, Python 3.11.3, Ubuntu 22.04.2, AMD Ryzen 7 3800X. Perhaps, there is a performance regression between Numba 0.56 and 0.57.

Topic		Replies	Views
Nested numba functions slow down execution and cannot be inlined Community Support	6	2640	February 7, 2023
When does having `inline='always'` actually make a difference in execution speed? Community Support	1	135	September 6, 2024
Numba lennard-jones example is better with function nesting? Community Support	3	299	October 10, 2023
Major slow down when adding one more layer of function Numba	3	211	September 28, 2023
Heuristics: inlining, stack allocated arrays Community Support	2	1189	July 18, 2020

Why this function call is faster than inlined version?

Related topics