I have a function for solving a specific type of sparse matrix and whenever I compile it with flags fastmath=True, nogil=True
it seems like the fastmath
flag is being ignored when compiled with numba.pycc.CC
. Using the %timeit
it always shows the AOT being 10-25% slower than the JIT compiled. I have tried doing cc.target_cpu = 'host'
thinking the generic version was slower, however I saw zero difference in performance. The function and way I’ve set it up is,
import numpy as np
from numba import njit
from numba.pycc import CC
def solve(A, b, results, corner_upper, corner_lower):
n = b.shape[0]
m = b.shape[1]
for i in range(m):
results[0, i] = b[0, i]
results[n - 1, i] = b[n - 1, i]
supra_diag = A[2].copy()
last_col = np.zeros(n - 1)
last_col[0] = corner_upper
last_col[n - 2] = supra_diag[n - 2]
last_row_val = corner_lower
current_main = A[1][0]
next_main = A[1][1]
last_main = A[1][n - 1]
for i in range(n - 2):
a = A[0][i]
for j in range(m):
results[i, j] /= current_main
supra_diag[i] = A[2][i + 1] / current_main
results[i + 1, j] = b[i + 1, j] - a * results[i, j]
results[n - 1, j] -= last_row_val * results[i, j]
last_col[i] /= current_main
last_col[i + 1] -= a * last_col[i]
next_main -= a * supra_diag[i]
last_main -= last_row_val * last_col[i]
last_row_val = (A[0][i + 1] if i == n - 3 else 0.0) - last_row_val * supra_diag[i]
current_main = next_main
next_main = A[1][i + 2]
i = n - 2
last_col[i] /= current_main
results[i] /= current_main
results[n - 1] -= last_row_val * results[i]
last_main -= last_row_val * last_col[i]
results[n - 1] /= last_main
for i in range(n - 1):
for j in range(m):
results[i, j] -= last_col[i] * results[n - 1, j]
for i in range(n - 2, 0, -1):
for j in range(m):
results[i - 1, j] -= supra_diag[i - 1] * results[i, j]
cc = CC('aot_module')
solve_njit = njit('void(f8[:, :], f8[:, :], f8[:, :], f8, f8)', fastmath=True, nogil=True)(solve)
solve_njit_noflags = njit('void(f8[:, :], f8[:, :], f8[:, :], f8, f8)')(solve)
solve_aot = cc.export('solve_aot', 'void(f8[:, :], f8[:, :], f8[:, :], f8, f8)')(solve_njit)
solve_aot_noflags = cc.export('solve_aot_noflags', 'void(f8[:, :], f8[:, :], f8[:, :], f8, f8)')(solve_njit_noflags)
cc.compile()
from aot_module import solve_aot as solve_aot_imported
from aot_module import solve_aot_noflags as solve_aot_noflags_imported
np.random.seed(0)
A = np.r_['0,2', np.full(500, -1.0), np.full(500, 3.0), np.full(500, -1.0)]
b = np.random.random((500, 500))
x = np.empty_like(b)
Benchmarking the three versions of the function with timeit
I get the following results,
%timeit solve_aot_imported(A, b, x, -1.0, -1.0)
1.27 ms ± 14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit solve_njit(A, b, x, -1.0, -1.0)
1.01 ms ± 23.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit solve_njit_noflags(A, b, x, -1.0, -1.0)
1.23 ms ± 15.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit solve_aot_noflags_imported(A, b, x, -1.0, -1.0)
1.27 ms ± 18.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Is there something in the way I have written the function that is causing this? This is the first time I have ever seen a significant difference in performance between JIT and AOT versions of my functions. Any explanation on this would be much appreciated.