Feedback on tips for first-timers

Brohrer · August 6, 2023, 11:05pm

Hi all,

I’m writing a post with some suggestions to help first time Numba users get their best results. Provisionally I’ve come up with these recommendations, based on my own try-it-and-time-it experiments, and not based on any understanding of the nuts and bolts of Numba or LLVM.

Do these sound like they’re close to right? Anything you would modify? Add? Remove?

Thanks,
Brandon

Suggestions for working with Numba

Use for loops instead of array operators.
Don’t use Numpy functions.
Don’t initialize Numpy arrays.
Pass in pre-allocated arrays for holding results.
But if you’re doing matrix multiplication skip Numba and stick with Numpy.

esc · August 7, 2023, 12:05pm

Hello and thank your for your post. I am not convinced that this looks anywhere close to right and am quite surprised you ended up with this set of recommendations. Numba has extensive support for NumPy and can actually accelerate many NumPy functions beyond what NumPy supports. In essence, recommending users to not combine Numba and NumPy appears counter-productive and potentially misleading.

Having said that, Numba is quite sensitive to use-cases so I think any tips for first-timers should focus on use-cases? I.e. you want to do X – hence you need to think about Y. My suggestions to move forward on this would be

a) Please post your try-it-and-time-it experiments, so that we can see how you managed to derive these rather surprising tips.

b) Enumerate and establish a list of common use-cases.

c) Derive a set of best-practices from these use-cases.

None the less, this sounds like a very useful piece of documentation to have and would probably be a good addition to the Numba documentation as a whole.

DannyWeitekamp · August 7, 2023, 2:08pm

I would say that a generally true restatement of 1-3 in my experience is that one can often speed things up considerably by writing loops that accomplish the equivalent of several numpy vector/tensor operations by applying them item-by-item in the body of the loop instead of by applying each operation over all of the data before moving on to the next one. I think this is counterintuitive to many Python/Matlab users because they are often told to use vectorized operations to speed things up and interpret this advice to mean that there is some magical principle of optimization called vectorization, when in fact one of the best things to do (when you can compile code) is just make sure you have nice contiguous memory access patterns and accomplish as much as possible in each pass through the data so that the compiler can store intermediate results in fast CPU registers instead of allocating new arrays for them.

I think there are some examples of this in the docs. But some more comprehensive resources along these lines would probably help folks whose heuristics for making things faster come from a Python/Matlab background where writing loops is the worst possible thing to do.

Brohrer · August 8, 2023, 10:23pm

Thanks esc, this is exactly the type of insight I was hoping to get. As a start to the process you recommend, here’s a particularly illustrative case. The function

def add(A, B, C):
    C = A + B
    return C

adds 10 million-element arrays of np.floats. With considerations taken to avoid caching and remove compilation time from the calculation, I get that this takes 39 ms with the @njit decorator in place and 17 ms without it.

Hypothesizing that the time for allocating the results array was significant, I also tried

def add(A, B, C):
    np.add(A, B, out=C)
    return C

This took 14 ms without the @njit decorator. With @njit it failed to compile, throwing error

numba.core.errors.LoweringError: Failed in nopython mode pipeline (step: native lowering)
unsupported keyword arguments when calling Function(<ufunc 'add'>)

File "sum_array_numba.py", line 20:
def add(A, B, C):
    np.add(A, B, out=C)

Interestingly, adding this with an element-wise jitted loop also takes 14 ms:

@njit
def add(A, B, C):
    for i in range(n_sum):
        C[i] = A[i] + B[i]
    return C

Based on this, I concluded that Numba and Numpy in this case were working at cross purposes. Any other interpretations to offer?

Numba version 0.56.4
Numpy version 1.21.5

Brohrer · August 8, 2023, 10:27pm

Thanks for validating the experience DannyWeitekamp. It was wildly counterintuitive for me at first too for exactly the reason you described. Your interpretation was my tentative hypothesis too, so I’m glad to hear we were thinking along the same lines.

esc · August 9, 2023, 8:08am

@Brohrer - Thank you for adding these details about the array-sum use-case. Do you think it would be possible to summarize your findings in a self-contained script that can be copied and pasted to run easily? It would make it easier for everyone interested to re-produce your findings. Thank you!

Brohrer · August 10, 2023, 1:36am

No problem. Hopefully this works for you.
Here’s the original.

from time import perf_counter
import numpy as np
from numba import njit

n_reps = 10
n_elements = int(1e8)


@njit
def add_numpy_jit(A, B, C):
    C = A + B
    return C


@njit
def add_elements(A, B, C):
    for i in range(A.size):
        C[i] = A[i] + B[i]
    return C


def time_function(fcn, msg):
    print()
    print(msg)
    total_time = 0
    for i_rep in range(n_reps):
        print(f"iter {i_rep + 1} of {n_reps}", end="\r")

        start = perf_counter()
        fcn(A, B, C)
        end = perf_counter()
        total_time += end - start

    print(f"{int(1000 * total_time / n_reps)} ms           ")


# Initialize and pre-allocate the arrays
A = np.random.sample(n_elements)
B = np.random.sample(n_elements)
C = np.random.sample(n_elements)

# Ensure jitted functions are pre-compiled
add_numpy_jit(A, B, C)
add_elements(A, B, C)

time_function(np.add, "Numpy array addition")
time_function(add_numpy_jit, "Numpy array addition, jitted by Numba ")
time_function(add_elements, "Numba-jitted element-wise array addition")

max9111 · August 10, 2023, 7:48am

On a contiguous arrays np.dot both in Numba and Numpy calls the appropriate BLAS routine (a bit less calling overhead in Numba). On non contiguous arrays the behaviour is different. Numba uses it’s own implementation (which looks quite similar to what you have tried) numpy.dot copys the arrays and calls a BLAS routine.
Which approach is faster depends on the array size.

For very small arrays a BLAS call can also be much slower than hand-written code.

esc · August 10, 2023, 4:21pm

@Brohrer thank you for adding the script for reproduction, I played around with and will share my discoveries below:

I initially downloaded the script as compare.py and ran it to get following results:

 💣 zsh» python compare.py

Numpy array addition
32 ms

Numpy array addition, jitted by Numba
97 ms

Numba-jitted element-wise array addition
31 ms

I then tried to apply the following diff to compare the non-jitted version of C = A + B

diff --git i/compare.py w/compare.py
index 9ac7a73b9b..b18cb6587f 100644
--- i/compare.py
+++ w/compare.py
@@ -6,7 +6,7 @@ n_reps = 10
 n_elements = int(1e8)


-@njit
+#@njit
 def add_numpy_jit(A, B, C):
     C = A + B
     return C

Running this again:

 💣 zsh» python compare.py

Numpy array addition
32 ms

Numpy array addition, jitted by Numba
96 ms

Numba-jitted element-wise array addition
31 ms

So this is a bit weird, I thought to myself… I then realized that the way you benchmark compares quite different things. np.add is being tested as the three-argument version fcn(A, B, C) and add_numpy_jit is doing a two argument version. Adding the njit decorator doesn’t really make any difference here. Here is a patch that compares the jitted and non-jitted functions using the py_func attribute of the compiled function to execute the non-jitted function:

diff --git i/compare.py w/compare.py
index 9ac7a73b9b..cfb585e75b 100644
--- i/compare.py
+++ w/compare.py
@@ -8,8 +8,7 @@ n_elements = int(1e8)

 @njit
 def add_numpy_jit(A, B, C):
-    C = A + B
-    return C
+    return np.add(A, B, C)


 @njit
@@ -43,6 +42,6 @@ C = np.random.sample(n_elements)
 add_numpy_jit(A, B, C)
 add_elements(A, B, C)

-time_function(np.add, "Numpy array addition")
+time_function(add_numpy_jit.py_func, "Numpy array addition py_func")
 time_function(add_numpy_jit, "Numpy array addition, jitted by Numba ")
 time_function(add_elements, "Numba-jitted element-wise array addition")

Running it:

 💣 zsh» python compare.py

Numpy array addition py_func
32 ms

Numpy array addition, jitted by Numba
32 ms

Numba-jitted element-wise array addition
31 ms

Though I am not sure what conclusions to draw from them this.!? Though I do wonder why Numba seems to jit the three argument variant but rejects an explicit use of the out= kwarg. I will think about this some more.

esc · August 10, 2023, 6:38pm

I looked into this some more and came up with the following next version:

from time import perf_counter
import numpy as np
from numba import njit

n_reps = 10
n_elements = int(1e8)


def add_numpy_results(A, B, C):
    np.add(A, B, out=C)


@njit
def add_numpy_njit(A, B, C):
    return np.add(A, B)


@njit
def add_njit(A, B, C):
    return A + B


@njit
def add_elements(A, B, C):
    for i in range(A.size):
        C[i] = A[i] + B[i]


def time_function(fcn, msg):
    print()
    print(msg)
    total_time = 0
    for i_rep in range(n_reps):
        print(f"iter {i_rep + 1} of {n_reps}", end="\r")

        start = perf_counter()
        fcn(A, B, C)
        end = perf_counter()
        total_time += end - start

    print(f"{int(1000 * total_time / n_reps)} ms           ")


# Initialize and pre-allocate the arrays
A = np.random.sample(n_elements)
B = np.random.sample(n_elements)
C = np.random.sample(n_elements)

add_numpy_njit(A, B, C)
add_njit(A, B, C)
add_elements(A, B, C)

time_function(add_numpy_results, "np.add out= as kwarg")
time_function(add_numpy_njit.py_func, "np.add py_func")
time_function(add_numpy_njit, "np.add jitted")
time_function(add_njit.py_func, "addition operator py_func")
time_function(add_njit, "addition operator jitted")
time_function(add_elements, "Numba-jitted element-wise array addition")
time_function(add_elements.py_func, "pure Python element-wise array addition")

When I run it on my M1 MacBook I get:

 💣 zsh» python compare.py                                                                                                                                              

np.add out= as kwarg
32 ms

np.add py_func
95 ms

np.add jitted
96 ms

addition operator py_func
96 ms

addition operator jitted
95 ms

Numba-jitted element-wise array addition
31 ms

pure Python element-wise array addition
14148 ms

So that seems to indicate that indicate that using the out= kwarg or doing a Numba hand-rolled loop are most optimal. I wonder if Numba could support the out= kwarg.

Brohrer · August 10, 2023, 6:57pm

Hey esc, thanks for giving this such a careful look! You’ve taken it a level deeper than I did, beyond helpful tips for first timers and into “what’s actually going on here and how can we make it better?” I appreciate that.

That said, you’ve left me in your dust. Next step on this in deeper than I can help with now

Brohrer · August 10, 2023, 7:08pm

FWIW I also did this investigation into matrix multiplication. It is consistent with your explanation.

from time import perf_counter
import numpy as np
from numba import njit, prange

n_reps = 10
n_rows = 1000
n_mid = 1000
n_cols = 1000


@njit
def matmul_elements(A, B, C):
    n_row, n_col = C.shape
    n_mid = A.shape[1]
    for i in range(n_row):
        for j in range(n_col):
            for k in range(n_mid):
                C[i, j] += A[i, k] * B[k, j]


@njit(parallel=True)
def matmul_parallel(A, B, C):
    n_row, n_col = C.shape
    n_mid = A.shape[1]
    for i in prange(n_row):
        for j in prange(n_col):
            for k in prange(n_mid):
                C[i, j] += A[i, k] * B[k, j]


def time_function(fcn, msg):
    print()
    print(msg)
    total_time = 0
    for i_rep in range(n_reps):
        print(f"iter {i_rep + 1} of {n_reps}", end="\r")

        start = perf_counter()
        fcn(A, B, C)
        end = perf_counter()
        total_time += end - start

    print(f"{int(1000 * total_time / n_reps)} ms           ")


# Initialize and pre-allocate the arrays
A = np.random.sample((n_rows, n_mid))
B = np.random.sample((n_mid, n_cols))
C = np.random.sample((n_rows, n_cols))

# Ensure jitted functions are pre-compiled
matmul_elements(A, B, C)
matmul_parallel(A, B, C)

time_function(np.matmul, "Numpy matrix multiply")
time_function(matmul_elements, "Numba, single-threaded matrix multiply")
time_function(matmul_parallel, "Numba, parallelized matrix multiply")

It gives me this on my machine:

Numpy matrix multiply
579 ms           

Numba, single-threaded matrix multiply
1270 ms           

Numba, parallelized matrix multiply
298 ms

so I guess the next level guidance on matrix multiplication might be “numba with parallel=True”

sschaer · August 11, 2023, 7:05am

@Brohrer

I think there is something wrong with your timing. 580 ms is way too long to multiply two 1000 by 1000 contiguous matrices. Your implementation has also a bug because the output matrix must be set to zero first.
Also, @max9111 referred to non-contiguous arrays in the first part of his post and arrays much smaller than one million elements in the second part.

I suggest that after you fix the bug, you add the following function to your set of test functions:

numba_matmul = njit(lambda a, b, c: a @ b)

and then run it with two different inputs:

# Initialize and preorder the arrays
A1 = np.random.sample((n_rows, n_mid))
B1 = np.random.sample((n_mid, n_cols))
C1 = np.empty((n_rows, n_cols))

# Initialization and preallocation of strided arrays
A2 = np.random.sample((2*n_rows, 2*n_mid))[::2, ::2]
B2 = np.random.sample((2*n_mid, 2*n_cols))[::2, ::2]
C2 = np.empty((n_rows, n_cols))

where you make the sizes much smaller than they are now. This shows you better what @max9111 said.

The key to fast matrix multiplication is not parallelism, although it is helpful. The key to fast matrix multiplication is optimizing memory access patterns, maximizing cache usage, and minimizing unnecessary operations (e.g. loop overhead).

But I also think that what you are trying to show here is already way beyond what beginners should be concerned with. With linear algebra, you usually get along very well if you just trust Numba, they did a very good job there

esc · August 15, 2023, 11:38am

Here is related case where an NP function is slower when jitted:

github.com/numba/numba

Delegate np.zeros to numpy

opened 10:49PM - 14 Aug 23 UTC

fingoldo

needtriage

Got myself in a situation when after jitting np.zeros becomes much slower. Can …we please delegate the call to numpy's implementation? I love how numba's a[:, :] = 0 outperforms everything else when zeroing an existing array, but creation of new one is not great. This has been [noticed](https://stackoverflow.com/questions/61359606/numba-and-numpy-array-allocation-why-is-it-so-slow) and [explained](https://stackoverflow.com/questions/44487786/performance-of-zeros-function-in-numpy) by numpy using the _culloc_ func. .fill() is also for some reason twice slower inside njit, but .zeros() is even more so. ```python shape = (10000, 10000) a = np.zeros(shape, dtype=np.int64) def alloc_new(a): for _ in range(10): a = np.zeros(shape, dtype=np.int64) a[500, 500] = 1 return a def numpy_fancy_assign(a): for _ in range(10): a[:, :] = 0 a[500, 500] = 1 return a def numpy_fill(a): for _ in range(10): a.fill(0) a[500, 500] = 1 return a def cyclces_assign(a): for _ in range(10): for i in range(a.shape[0]): for j in range(a.shape[1]): a[i, j] = 0 a[500, 500] = 1 return a njitted_funcs = [] funcs = (alloc_new, numpy_fancy_assign, numpy_fill, cyclces_assign) for func in funcs: njitted_func = njit(func) njitted_func(a) # test call njitted_funcs.append(njitted_func) ``` %timeit njitted_funcs[0](a) **2.28 s** ± 42.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit njitted_funcs[1](a) _58.2 ms_ ± 3.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit njitted_funcs[2](a) 563 ms ± 16.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit njitted_funcs[3](a) 603 ms ± 41.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit funcs[0](a) **357 µs** ± 41 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit funcs[1](a) 305 ms ± 9.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit funcs[2](a) 305 ms ± 7.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Brohrer · August 15, 2023, 10:33pm

Thanks for flagging this. Good to know it’s not just me

Topic		Replies	Views
New Numba Tutorials Community Support	1	488	February 28, 2022
Numba / Mojo blog post Showcase	2	975	October 18, 2023
Tips for performance improvement of my code Support: How do I do ...?	5	681	January 5, 2023
How could i best optimize this with numba Support: How do I do ...?	9	594	May 4, 2021
Numba Activity Newsletter, May 2024 Announcements	0	156	June 11, 2024

Feedback on tips for first-timers

Related Topics