Array multiplication - speed comparison

HarryvL · January 11, 2024, 10:49am

I am profiling my code and find performance differences that I do not understand and therefore cannot improve. Below are 3 code snippets. The first is to create a timing baseline for calling the function and declaring and initializing the variables. The second and third contain the actual loops I am trying to compare and optimize.

baseline:

@njit(cache=True, fastmath=True)
def baseline():
    sig_update = np.ones(6, dtype=types.float64) 
    eps = np.ones(6, dtype=types.float64)
    bmat = np.ones((6, 30), dtype=types.float64)
    elv = np.ones(30, dtype=types.float64)
    u10 = np.ones(30, dtype=types.float64)
    ip = 1.0
    xsj = 1.0

153 ns ± 0.171 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

fast loop:

@njit(cache=True, fastmath=True)
def elv():
    sig_update = np.ones(6, dtype=types.float64) 
    eps = np.ones(6, dtype=types.float64)
    bmat = np.ones((6, 30), dtype=types.float64)
    elv = np.ones(30, dtype=types.float64)
    u10 = np.ones(30, dtype=types.float64)
    ip = 1.0
    xsj = 1.0

    for j in range(30):
        elv[j] = 0.0
        for k in range(6):
            elv[j] += bmat[k, j] * sig_update[k] * ip * abs(xsj)

188 ns ± 1.7 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
net timing of the fast loop is therefore 188 - 153 = 35 ns

slow loop:

@njit(cache=True, fastmath=True)
def eps():
    sig_update = np.ones(6, dtype=types.float64) 
    eps = np.ones(6, dtype=types.float64)
    bmat = np.ones((6, 30), dtype=types.float64)
    elv = np.ones(30, dtype=types.float64)
    u10 = np.ones(30, dtype=types.float64)
    ip = 1.0
    xsj = 1.0

    for j in range(6):
        eps[j] = 0.0
        for k in range(30):
            eps[j] += bmat[j, k] * u10[k]

242 ns ± 2.7 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
net timing of the slow loop is therefore 242 - 153 = 89 ns

The fast loop comprises 6 * 30 * 3 = 540 float64 multiplications and the slow loop 6 * 30 * 1 = 180 float64 multiplications. So I assume the difference comes from how the arrays accessed?! However, even then I would expect that in the slow loop bmat is more efficiently accessed (by row) than in the fast loop (by column).

Any help with explaining the issue and optimizing the slow loop would be appreciated.

HarryvL · January 13, 2024, 7:40am

… and stranger even. Why would this:

@njit(cache=True, fastmath=True)
def test2():
    a = np.ones((100,10), dtype=np.float64)
    b = np.ones(10, dtype=np.float64)
    c = np.ones(100, dtype=np.float64)
    tmp=0.0
    for j in range(100):
        tmp = 0.0
        for k in range(10):
            tmp += a[j,k]*b[k]
        c[j] = tmp

414 ns ± 5.42 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

be twice as fast as this:

@njit(cache=True, fastmath=True)
def test2():
    a = np.ones((100,10), dtype=np.float64)
    b = np.ones(10, dtype=np.float64)
    c = np.ones(100, dtype=np.float64)
    tmp=0.0
    for j in range(100):
#        tmp=0.0
        for k in range(10): 
            tmp += a[j,k]*b[k]
        c[j] = tmp

1.05 µs ± 11.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

max9111 · January 14, 2024, 7:48pm

With a bit larger arrays, where the array shape isn’t known at compile time, the timings doesn’t look that weird.

The measurement of tiny functions which are pracitcally calculating nothing or at least the whole calculation can be optimized away, can be tricky. At least return something from the function.

from numba import njit
from numba import types
import numpy as np

@njit(cache=True, fastmath=True)
def elv_func(sig_update,eps,bmat,elv,u10):

    ip = 1.0
    xsj = 1.0

    for j in range(bmat.shape[1]):
        elv[j] = 0.0
        for k in range(bmat.shape[0]):
            elv[j] += bmat[k, j] * sig_update[k] * ip * abs(xsj)


@njit(cache=True, fastmath=True)
def eps_func_1(sig_update,eps,bmat,elv,u10):
    ip = 1.0
    xsj = 1.0

    for j in range(bmat.shape[0]):
        eps[j] = 0.0
        for k in range(bmat.shape[1]):
            eps[j] += bmat[j, k] * u10[k]
            
@njit(cache=True, fastmath=True)
def eps_func_2(sig_update,eps,bmat,elv,u10):
    ip = 1.0
    xsj = 1.0

    for j in range(bmat.shape[0]):
        acc=0
        for k in range(bmat.shape[1]):
            acc += bmat[j, k] * u10[k]
        eps[j] = acc
        
sig_update = np.ones(600, dtype=np.float64) 
eps  = np.ones(600, dtype=np.float64)
bmat = np.ones((600, 3000), dtype=np.float64)
elv  = np.ones(3000, dtype=np.float64)
u10  = np.ones(3000, dtype=np.float64)

%timeit elv_func(sig_update,eps,bmat,elv,u10)
#1.96 ms ± 84.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit eps_func_1(sig_update,eps,bmat,elv,u10)
#1.56 ms ± 31.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit eps_func_2(sig_update,eps,bmat,elv,u10)
#448 µs ± 6.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

sschaer · January 14, 2024, 7:51pm

You’re touching a rather complex topic. It is not possible to do it justice with such a post, but I’ll try to provide you some input to help understand why the outcomes of our tests might not be predictable, yet are not entirely surprising.

Numba compiled code is different from Python, which executes all code in the order it’s written. In compiled code, there’s no guarantee that every piece of code will be executed, nor that it will be executed in the way it’s written. When optimization comes into play, the compiler has significant freedom to rearrange and modify your code to improve performance, as long as the final outcome remains the same. This leads us to the first issue: your code contains a substantial amount of dead code. The compiler may remove parts of it, so breaking down the runtime into ‘baseline’ and ‘loop’ is not advisable. Additionally, the fact that your function doesn’t do anything (it takes no arguments and returns nothing) can lead to unexpected results.

Another important factor in your examples is the use of literals. This gives the compiler a lot of information to work with, making it difficult to anticipate its actions (it will significantly alter your original code e.g. by doing loop unrolling). So, if you’re not planning to use literals in your actual application, it’s best not to use them in these tests. Generally, it’s more effective to test your real application rather than toy functions, as there’s no assurance the compiler will behave the same way when these snippets are part of a larger codebase.

As for why elv is faster than eps, this could be due to the small size of your arrays. They likely fit into your CPU’s L1 cache, so cache misses might not be a primary concern. Also, sig_update is smaller than u10, which is beneficial. You should also conisder that ip * abs(xsj) must be only computed once (loop invariant code motion).

The second example isn’t surprising because a fully independent inner loop is generally preferable. If you remove the initial declaration of tmp = 0.0, it becomes more apparent why this is the better case.

HarryvL · January 20, 2024, 11:34am

Thanks @max9111 and @sschaer. Profiling compiled code is clearly not an easy task. I went about it by progressively commenting out code snippets and ended up with the “surprise” I describe above. So, the difference between elv and eps timing was also obtained while profiling the entire code by elimination. After splitting the small arrays (6 elements or less) into individual variables the difference in timing disappears. More verbose, but faster.

Topic		Replies	Views
Slow Array Element Assignment to Variable Community Support	3	355	July 23, 2020
Compilation time: 2d vs 1d array Community Support	4	321	February 16, 2021
Timings for arr[:, i] seem much slower in numba Community Support	5	122	March 14, 2024
How can I make this arithmetic run faster? Support: How do I do ...?	3	273	February 29, 2024
Uint64 vs int64 indexing performance difference Support: How do I do ...?	7	1218	August 16, 2022

Array multiplication - speed comparison

Related Topics