numba.pycc.CC appears to be ignoring fastmath flag after compilation

vlovero · July 26, 2020, 4:39pm

I have a function for solving a specific type of sparse matrix and whenever I compile it with flags fastmath=True, nogil=True it seems like the fastmath flag is being ignored when compiled with numba.pycc.CC. Using the %timeit it always shows the AOT being 10-25% slower than the JIT compiled. I have tried doing cc.target_cpu = 'host' thinking the generic version was slower, however I saw zero difference in performance. The function and way I’ve set it up is,

import numpy as np
from numba import njit
from numba.pycc import CC


def solve(A, b, results, corner_upper, corner_lower):
    n = b.shape[0]
    m = b.shape[1]
    for i in range(m):
        results[0, i] = b[0, i]
        results[n - 1, i] = b[n - 1, i]
    supra_diag = A[2].copy()
    last_col = np.zeros(n - 1)
    last_col[0] = corner_upper
    last_col[n - 2] = supra_diag[n - 2]
    last_row_val = corner_lower
    current_main = A[1][0]
    next_main = A[1][1]
    last_main = A[1][n - 1]

    for i in range(n - 2):
        a = A[0][i]
        for j in range(m):
            results[i, j] /= current_main
            supra_diag[i] = A[2][i + 1] / current_main
            results[i + 1, j] = b[i + 1, j] - a * results[i, j]
            results[n - 1, j] -= last_row_val * results[i, j]
        last_col[i] /= current_main
        last_col[i + 1] -= a * last_col[i]
        next_main -= a * supra_diag[i]
        last_main -= last_row_val * last_col[i]
        last_row_val = (A[0][i + 1] if i == n - 3 else 0.0) - last_row_val * supra_diag[i]

        current_main = next_main
        next_main = A[1][i + 2]

    i = n - 2
    last_col[i] /= current_main
    results[i] /= current_main
    results[n - 1] -= last_row_val * results[i]
    last_main -= last_row_val * last_col[i]
    results[n - 1] /= last_main
    for i in range(n - 1):
        for j in range(m):
            results[i, j] -= last_col[i] * results[n - 1, j]
    for i in range(n - 2, 0, -1):
        for j in range(m):
            results[i - 1, j] -= supra_diag[i - 1] * results[i, j]


cc = CC('aot_module')

solve_njit = njit('void(f8[:, :], f8[:, :], f8[:, :], f8, f8)', fastmath=True, nogil=True)(solve)
solve_njit_noflags = njit('void(f8[:, :], f8[:, :], f8[:, :], f8, f8)')(solve)
solve_aot = cc.export('solve_aot', 'void(f8[:, :], f8[:, :], f8[:, :], f8, f8)')(solve_njit)
solve_aot_noflags = cc.export('solve_aot_noflags', 'void(f8[:, :], f8[:, :], f8[:, :], f8, f8)')(solve_njit_noflags)
cc.compile()

from aot_module import solve_aot as solve_aot_imported
from aot_module import solve_aot_noflags as solve_aot_noflags_imported

np.random.seed(0)
A = np.r_['0,2', np.full(500, -1.0), np.full(500, 3.0), np.full(500, -1.0)]
b = np.random.random((500, 500))
x = np.empty_like(b)

Benchmarking the three versions of the function with timeit I get the following results,

%timeit solve_aot_imported(A, b, x, -1.0, -1.0)
1.27 ms ± 14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit solve_njit(A, b, x, -1.0, -1.0)
1.01 ms ± 23.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit solve_njit_noflags(A, b, x, -1.0, -1.0)
1.23 ms ± 15.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit solve_aot_noflags_imported(A, b, x, -1.0, -1.0)
1.27 ms ± 18.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Is there something in the way I have written the function that is causing this? This is the first time I have ever seen a significant difference in performance between JIT and AOT versions of my functions. Any explanation on this would be much appreciated.

Topic		Replies	Views
AOT vs llvmlite.dll Support: How do I do ...?	8	185	December 8, 2024
Help with fastmath Development	3	1348	December 18, 2020
Compile without compiler Support: How do I do ...?	6	897	July 27, 2021
Different behavior of code with np.nan when jitted Community Support	3	135	April 19, 2025
AOT Compilation Succeeds - But Import Still Slow Numba	10	1137	October 30, 2020

numba.pycc.CC appears to be ignoring fastmath flag after compilation

Related topics