I am profiling my code and find performance differences that I do not understand and therefore cannot improve. Below are 3 code snippets. The first is to create a timing baseline for calling the function and declaring and initializing the variables. The second and third contain the actual loops I am trying to compare and optimize.
baseline:
@njit(cache=True, fastmath=True)
def baseline():
sig_update = np.ones(6, dtype=types.float64)
eps = np.ones(6, dtype=types.float64)
bmat = np.ones((6, 30), dtype=types.float64)
elv = np.ones(30, dtype=types.float64)
u10 = np.ones(30, dtype=types.float64)
ip = 1.0
xsj = 1.0
153 ns Ā± 0.171 ns per loop (mean Ā± std. dev. of 7 runs, 10,000,000 loops each)
fast loop:
@njit(cache=True, fastmath=True)
def elv():
sig_update = np.ones(6, dtype=types.float64)
eps = np.ones(6, dtype=types.float64)
bmat = np.ones((6, 30), dtype=types.float64)
elv = np.ones(30, dtype=types.float64)
u10 = np.ones(30, dtype=types.float64)
ip = 1.0
xsj = 1.0
for j in range(30):
elv[j] = 0.0
for k in range(6):
elv[j] += bmat[k, j] * sig_update[k] * ip * abs(xsj)
188 ns Ā± 1.7 ns per loop (mean Ā± std. dev. of 7 runs, 10,000,000 loops each)
net timing of the fast loop is therefore 188 - 153 = 35 ns
slow loop:
@njit(cache=True, fastmath=True)
def eps():
sig_update = np.ones(6, dtype=types.float64)
eps = np.ones(6, dtype=types.float64)
bmat = np.ones((6, 30), dtype=types.float64)
elv = np.ones(30, dtype=types.float64)
u10 = np.ones(30, dtype=types.float64)
ip = 1.0
xsj = 1.0
for j in range(6):
eps[j] = 0.0
for k in range(30):
eps[j] += bmat[j, k] * u10[k]
242 ns Ā± 2.7 ns per loop (mean Ā± std. dev. of 7 runs, 1,000,000 loops each)
net timing of the slow loop is therefore 242 - 153 = 89 ns
The fast loop comprises 6 * 30 * 3 = 540 float64 multiplications and the slow loop 6 * 30 * 1 = 180 float64 multiplications. So I assume the difference comes from how the arrays accessed?! However, even then I would expect that in the slow loop bmat is more efficiently accessed (by row) than in the fast loop (by column).
Any help with explaining the issue and optimizing the slow loop would be appreciated.