Array multiplication - speed comparison

I am profiling my code and find performance differences that I do not understand and therefore cannot improve. Below are 3 code snippets. The first is to create a timing baseline for calling the function and declaring and initializing the variables. The second and third contain the actual loops I am trying to compare and optimize.

baseline:

@njit(cache=True, fastmath=True)
def baseline():
    sig_update = np.ones(6, dtype=types.float64) 
    eps = np.ones(6, dtype=types.float64)
    bmat = np.ones((6, 30), dtype=types.float64)
    elv = np.ones(30, dtype=types.float64)
    u10 = np.ones(30, dtype=types.float64)
    ip = 1.0
    xsj = 1.0

153 ns Ā± 0.171 ns per loop (mean Ā± std. dev. of 7 runs, 10,000,000 loops each)

fast loop:

@njit(cache=True, fastmath=True)
def elv():
    sig_update = np.ones(6, dtype=types.float64) 
    eps = np.ones(6, dtype=types.float64)
    bmat = np.ones((6, 30), dtype=types.float64)
    elv = np.ones(30, dtype=types.float64)
    u10 = np.ones(30, dtype=types.float64)
    ip = 1.0
    xsj = 1.0

    for j in range(30):
        elv[j] = 0.0
        for k in range(6):
            elv[j] += bmat[k, j] * sig_update[k] * ip * abs(xsj)

188 ns Ā± 1.7 ns per loop (mean Ā± std. dev. of 7 runs, 10,000,000 loops each)
net timing of the fast loop is therefore 188 - 153 = 35 ns

slow loop:

@njit(cache=True, fastmath=True)
def eps():
    sig_update = np.ones(6, dtype=types.float64) 
    eps = np.ones(6, dtype=types.float64)
    bmat = np.ones((6, 30), dtype=types.float64)
    elv = np.ones(30, dtype=types.float64)
    u10 = np.ones(30, dtype=types.float64)
    ip = 1.0
    xsj = 1.0

    for j in range(6):
        eps[j] = 0.0
        for k in range(30):
            eps[j] += bmat[j, k] * u10[k]

242 ns Ā± 2.7 ns per loop (mean Ā± std. dev. of 7 runs, 1,000,000 loops each)
net timing of the slow loop is therefore 242 - 153 = 89 ns

The fast loop comprises 6 * 30 * 3 = 540 float64 multiplications and the slow loop 6 * 30 * 1 = 180 float64 multiplications. So I assume the difference comes from how the arrays accessed?! However, even then I would expect that in the slow loop bmat is more efficiently accessed (by row) than in the fast loop (by column).

Any help with explaining the issue and optimizing the slow loop would be appreciated.

ā€¦ and stranger even. Why would this:

@njit(cache=True, fastmath=True)
def test2():
    a = np.ones((100,10), dtype=np.float64)
    b = np.ones(10, dtype=np.float64)
    c = np.ones(100, dtype=np.float64)
    tmp=0.0
    for j in range(100):
        tmp = 0.0
        for k in range(10):
            tmp += a[j,k]*b[k]
        c[j] = tmp

414 ns Ā± 5.42 ns per loop (mean Ā± std. dev. of 7 runs, 1,000,000 loops each)

be twice as fast as this:

@njit(cache=True, fastmath=True)
def test2():
    a = np.ones((100,10), dtype=np.float64)
    b = np.ones(10, dtype=np.float64)
    c = np.ones(100, dtype=np.float64)
    tmp=0.0
    for j in range(100):
#        tmp=0.0
        for k in range(10): 
            tmp += a[j,k]*b[k]
        c[j] = tmp

1.05 Āµs Ā± 11.6 ns per loop (mean Ā± std. dev. of 7 runs, 1,000,000 loops each)

With a bit larger arrays, where the array shape isnā€™t known at compile time, the timings doesnā€™t look that weird.

The measurement of tiny functions which are pracitcally calculating nothing or at least the whole calculation can be optimized away, can be tricky. At least return something from the function.

from numba import njit
from numba import types
import numpy as np

@njit(cache=True, fastmath=True)
def elv_func(sig_update,eps,bmat,elv,u10):

    ip = 1.0
    xsj = 1.0

    for j in range(bmat.shape[1]):
        elv[j] = 0.0
        for k in range(bmat.shape[0]):
            elv[j] += bmat[k, j] * sig_update[k] * ip * abs(xsj)


@njit(cache=True, fastmath=True)
def eps_func_1(sig_update,eps,bmat,elv,u10):
    ip = 1.0
    xsj = 1.0

    for j in range(bmat.shape[0]):
        eps[j] = 0.0
        for k in range(bmat.shape[1]):
            eps[j] += bmat[j, k] * u10[k]
            
@njit(cache=True, fastmath=True)
def eps_func_2(sig_update,eps,bmat,elv,u10):
    ip = 1.0
    xsj = 1.0

    for j in range(bmat.shape[0]):
        acc=0
        for k in range(bmat.shape[1]):
            acc += bmat[j, k] * u10[k]
        eps[j] = acc
        
sig_update = np.ones(600, dtype=np.float64) 
eps  = np.ones(600, dtype=np.float64)
bmat = np.ones((600, 3000), dtype=np.float64)
elv  = np.ones(3000, dtype=np.float64)
u10  = np.ones(3000, dtype=np.float64)

%timeit elv_func(sig_update,eps,bmat,elv,u10)
#1.96 ms Ā± 84.7 Āµs per loop (mean Ā± std. dev. of 7 runs, 100 loops each)
%timeit eps_func_1(sig_update,eps,bmat,elv,u10)
#1.56 ms Ā± 31.7 Āµs per loop (mean Ā± std. dev. of 7 runs, 1,000 loops each)
%timeit eps_func_2(sig_update,eps,bmat,elv,u10)
#448 Āµs Ā± 6.9 Āµs per loop (mean Ā± std. dev. of 7 runs, 1,000 loops each)

Youā€™re touching a rather complex topic. It is not possible to do it justice with such a post, but Iā€™ll try to provide you some input to help understand why the outcomes of our tests might not be predictable, yet are not entirely surprising.

Numba compiled code is different from Python, which executes all code in the order itā€™s written. In compiled code, thereā€™s no guarantee that every piece of code will be executed, nor that it will be executed in the way itā€™s written. When optimization comes into play, the compiler has significant freedom to rearrange and modify your code to improve performance, as long as the final outcome remains the same. This leads us to the first issue: your code contains a substantial amount of dead code. The compiler may remove parts of it, so breaking down the runtime into ā€˜baselineā€™ and ā€˜loopā€™ is not advisable. Additionally, the fact that your function doesnā€™t do anything (it takes no arguments and returns nothing) can lead to unexpected results.

Another important factor in your examples is the use of literals. This gives the compiler a lot of information to work with, making it difficult to anticipate its actions (it will significantly alter your original code e.g. by doing loop unrolling). So, if youā€™re not planning to use literals in your actual application, itā€™s best not to use them in these tests. Generally, itā€™s more effective to test your real application rather than toy functions, as thereā€™s no assurance the compiler will behave the same way when these snippets are part of a larger codebase.

As for why elv is faster than eps, this could be due to the small size of your arrays. They likely fit into your CPUā€™s L1 cache, so cache misses might not be a primary concern. Also, sig_update is smaller than u10, which is beneficial. You should also conisder that ip * abs(xsj) must be only computed once (loop invariant code motion).

The second example isnā€™t surprising because a fully independent inner loop is generally preferable. If you remove the initial declaration of tmp = 0.0, it becomes more apparent why this is the better case.

1 Like

Thanks @max9111 and @sschaer. Profiling compiled code is clearly not an easy task. I went about it by progressively commenting out code snippets and ended up with the ā€œsurpriseā€ I describe above. So, the difference between elv and eps timing was also obtained while profiling the entire code by elimination. After splitting the small arrays (6 elements or less) into individual variables the difference in timing disappears. More verbose, but faster.