numba.pycc.CC appears to be ignoring fastmath flag after compilation

I have a function for solving a specific type of sparse matrix and whenever I compile it with flags fastmath=True, nogil=True it seems like the fastmath flag is being ignored when compiled with numba.pycc.CC. Using the %timeit it always shows the AOT being 10-25% slower than the JIT compiled. I have tried doing cc.target_cpu = 'host' thinking the generic version was slower, however I saw zero difference in performance. The function and way I’ve set it up is,

import numpy as np
from numba import njit
from numba.pycc import CC


def solve(A, b, results, corner_upper, corner_lower):
    n = b.shape[0]
    m = b.shape[1]
    for i in range(m):
        results[0, i] = b[0, i]
        results[n - 1, i] = b[n - 1, i]
    supra_diag = A[2].copy()
    last_col = np.zeros(n - 1)
    last_col[0] = corner_upper
    last_col[n - 2] = supra_diag[n - 2]
    last_row_val = corner_lower
    current_main = A[1][0]
    next_main = A[1][1]
    last_main = A[1][n - 1]

    for i in range(n - 2):
        a = A[0][i]
        for j in range(m):
            results[i, j] /= current_main
            supra_diag[i] = A[2][i + 1] / current_main
            results[i + 1, j] = b[i + 1, j] - a * results[i, j]
            results[n - 1, j] -= last_row_val * results[i, j]
        last_col[i] /= current_main
        last_col[i + 1] -= a * last_col[i]
        next_main -= a * supra_diag[i]
        last_main -= last_row_val * last_col[i]
        last_row_val = (A[0][i + 1] if i == n - 3 else 0.0) - last_row_val * supra_diag[i]

        current_main = next_main
        next_main = A[1][i + 2]

    i = n - 2
    last_col[i] /= current_main
    results[i] /= current_main
    results[n - 1] -= last_row_val * results[i]
    last_main -= last_row_val * last_col[i]
    results[n - 1] /= last_main
    for i in range(n - 1):
        for j in range(m):
            results[i, j] -= last_col[i] * results[n - 1, j]
    for i in range(n - 2, 0, -1):
        for j in range(m):
            results[i - 1, j] -= supra_diag[i - 1] * results[i, j]


cc = CC('aot_module')

solve_njit = njit('void(f8[:, :], f8[:, :], f8[:, :], f8, f8)', fastmath=True, nogil=True)(solve)
solve_njit_noflags = njit('void(f8[:, :], f8[:, :], f8[:, :], f8, f8)')(solve)
solve_aot = cc.export('solve_aot', 'void(f8[:, :], f8[:, :], f8[:, :], f8, f8)')(solve_njit)
solve_aot_noflags = cc.export('solve_aot_noflags', 'void(f8[:, :], f8[:, :], f8[:, :], f8, f8)')(solve_njit_noflags)
cc.compile()

from aot_module import solve_aot as solve_aot_imported
from aot_module import solve_aot_noflags as solve_aot_noflags_imported

np.random.seed(0)
A = np.r_['0,2', np.full(500, -1.0), np.full(500, 3.0), np.full(500, -1.0)]
b = np.random.random((500, 500))
x = np.empty_like(b)

Benchmarking the three versions of the function with timeit I get the following results,

%timeit solve_aot_imported(A, b, x, -1.0, -1.0)
1.27 ms ± 14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit solve_njit(A, b, x, -1.0, -1.0)
1.01 ms ± 23.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit solve_njit_noflags(A, b, x, -1.0, -1.0)
1.23 ms ± 15.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit solve_aot_noflags_imported(A, b, x, -1.0, -1.0)
1.27 ms ± 18.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Is there something in the way I have written the function that is causing this? This is the first time I have ever seen a significant difference in performance between JIT and AOT versions of my functions. Any explanation on this would be much appreciated.