Numba with cross product is 10x slow than numba with loop

I am very surprised np.dot is 10 x slower than the loop in numba. Does someone have an idea? In my example below, go_fast is using a loop instead of cross dot, and go_fast_np is using array cross dot directly. go_faster_np is 10x slower than go_faster

0.653007984161377
0.012694835662841797
0.0072171688079833984
0.0067157745361328125
0.007045269012451172

0.543813943862915
0.041455984115600586
0.029541015625
0.03484296798706055
0.03132915496826172

import numpy as np
from numba import njit
import time

@njit
def go_fast(a, b):
# Function is compiled and runs in machine code
m = a.shape[0]
n = b.shape[0]
f = a.shape[1]
trace = np.zeros((m, n))
for i in range(m):
for j in range(n):
for k in range(f):
trace[i, j] += a[i, k] * b[j, k]
return trace

@njit
def go_fast_np(a, b):
# why is it 10x slower than go_fast
m = a.shape[0]
n = b.shape[0]
trace = np.zeros((m, n))
for i in range(m):
for j in range(n):
trace[i, j] = a[i, :].dot(b[j, :])
return trace

if name == ‘main’:
x = np.arange(20000).astype(float).reshape(1000, 20)
y = np.arange(10000).astype(float).reshape(500, 20)
t = time.time()
for _ in range(5):
s = go_fast(x, y)
prev_t = t
t = time.time()
print(t - prev_t)
print(’=’*60)
for _ in range(5):
s1 = go_fast_np(x, y)
prev_t = t
t = time.time()
print(t - prev_t)

1 Like

hi, it’s very common that pure-loop code performs better than using array methods or numpy functions.

Earlier this week we had another example: Using @njit with numpy.tensordot

So is it alway recommended to use loop instead of vectorization within functions decorated by njit?

“Recommended” would be too strong. First, it depends about what you care. Speed is not the only thing that matters for code. Loops can be harder to read that vectorized code is many cases, so maybe you care about that more than speed. In some cases loops might be easier to read, it depends on the case.

Even for speed is not possible to give an absolute rule. In general, loops will perform well in Numba. However, in some cases like matrix multiplication, it’s impossible to beat the BLAS libraries that perform np.dot . So, attempting to write a matrix multiplication as a loop will be slower using np.dot.

hope it helps,
Luk

Both of your functions are slower than they could be

Very small matrix, matrix multiplications can be done faster if you inline everything. Example
In your case you have only one (quite small) dot product, where a optimized numba function is approximately as fast as a BLAS call. On a larger problem the BLAS call will be significantly faster, on a smaller problem a single threaded Numba function can be faster. (eg. a lot of (2x2) dot calls as shown in the link.

import numpy as np
import numba as nb

@nb.njit()
def go_fast(a, b):
    # Function is compiled and runs in machine code
    m = a.shape[0]
    n = b.shape[0]
    f = a.shape[1]
    trace = np.zeros((m, n))
    for i in range(m):
        for j in range(n):
            for k in range(f):
                trace[i, j] += a[i, k] * b[j, k]
    return trace

@nb.njit()
def go_fast_np(a, b):
    # why is it 10x slower than go_fast
    m = a.shape[0]
    n = b.shape[0]
    trace = np.zeros((m, n))
    for i in range(m):
        for j in range(n):
            trace[i, j] = a[i, :].dot(b[j, :])
    return trace

@nb.njit(fastmath=True,parallel=True)
def go_fast_opt(a, b):
    # Function is compiled and runs in machine code
    m = a.shape[0]
    n = b.shape[0]
    f = a.shape[1]
    trace = np.empty((m, n))
    for i in nb.prange(m):
        for j in range(n):
            #Use scalar here
            acc=0
            for k in range(f):
                acc += a[i, k] * b[j, k]
            trace[i, j]=acc
    return trace


@nb.njit()
def go_fast_np_opt(a, b):
    m = a.shape[0]
    n = b.shape[0]
    trace = np.dot(a,b.T)
    return trace

Timings

a = np.arange(20000).astype(float).reshape(1000, 20)
b = np.arange(10000).astype(float).reshape(500, 20)

%timeit go_fast(a, b)
#6.64 ms ± 50.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit go_fast_opt(a, b)
#1.2 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit go_fast_np(a, b)
#33.4 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit go_fast_np_opt(a, b)
#1.1 ms ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

a = np.arange(20000).astype(float).reshape(100, 200)
b = np.arange(10000).astype(float).reshape(50, 200)

%timeit go_fast(a, b)
#922 µs ± 1.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit go_fast_opt(a, b)
#79.4 µs ± 1.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit go_fast_np(a, b)
#431 µs ± 3.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit go_fast_np_opt(a, b)
#14.6 µs ± 291 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1 Like

@max9111 Thanks for your reply. It is very helpful. matrx.dot seems to be very fast. Also, three loops with simple piecewise multiply is very fast. What bothers me is why two loops with vector dot is always much slower (go_fast_np)? Does it have something to do with indexing?

the slicing might be creating intermediate variables.

If there is a special BLAS routine (GEMM -> (matrix,matrix product) ) it is usually recommendable to use that. Only in some special cases (really small arrays) a handwritten and inlined Numba implementation could be faster (avoiding function call overhead, and some checks which are done in the BLAS routine).
But some BLAS-packages also have optimizations for very small dot products, which isn’t supported in Numba eg. Intel MKL

On larger arrays eg. (1000x1000)x(1000x1000) the naive Numba implementation (go_fast_opt) will be much slower than the highly optimized BLAS - routine (simple np.dot).

Why is go_fast_np so slow

In this example you call a BLAS routine (vector-vector product) for 500_000 times. Each function call comes with some overhead for doing only a few calculations in the called function. (calling is more costly than the whole calculation).