Missing apparent optimizations

pehtran · October 12, 2023, 10:28am

Hello Numba community! Writing with a question. We’ve recently re-implemented some of our algorithms in Numba, and are wondering whether we missed out on any apparent optimizations? The source is already quite performant, however, would be grateful to hear from Numba experts on possible caveats/pitfalls we missed.

The example stand-alone algorithm is the following one

github.com

outbrain/outrank/blob/main/outrank/algorithms/feature_ranking/ranking_mi_numba.py

from __future__ import annotations

import numpy as np
from numba import njit
from numba import prange

np.random.seed(123)
# Fast Numba-based approximative mutual information


@njit(
    'Tuple((int32[:], int32[:]))(int32[:])',
    cache=True,
    fastmath=True,
    error_model='numpy',
    boundscheck=True,
)
def numba_unique(a):
    """Identify unique elements in an array, fast"""

This file has been truncated. show original

Would be interested in high level comments (you could just use x instead of y), but also specifics (i.e., Numba supports this and that, wondering if that would work).

Thanks!

sschaer · October 12, 2023, 8:56pm

Hi @pehtran,

I’m not sure what you understand by “apparent,” but I can pinpoint some potential suboptimalities. I explicitly say “potential” because I did not run your code and only looked at the function in isolation, not holistically. Therefore, I have no idea which parts of your code are actually performance-relevant. Nevertheless, here is what I noticed:

Do you really need the boundscheck?
You’re using many vectorized operations. Don’t get me wrong; this is perfectly fine if you do it to keep the code size small and it helps with readability. But it is often not optimal for performance because it can result in either a lot of redundant work (we will see some examples later) or repeated memory allocations (we will also see examples of this).
Line 26-27: You can do this more efficiently with an explicit loop. Traverse container once to count the non-zero elements, then allocate unique_values and unique_counts, and finally traverse it again to fill the values. This approach also avoids using astype, which in your case copies data yet again to another array.
Line 41: Do not loop over the array if you don’t use c. Just use range(class_values.size). Then index also becomes your loop variable which makes it bit more readable.
Line 83: X == f_values[f_index] creates an intermediate array (actually even two, one for f_values[f_index] and the other for the equality check) that you don’t actually need. Use an explicit for loop instead, as in the case of lines 26-27.
Line 86: np.roll seems innocent but is a costly function because it allocates a new array and copies data. The Numba implementation makes it even a bit worse because it uses a module to compute every single index (you can find the code here). There is a better way to do it here. However, in your case, I would avoid it entirely. Just grab the values you need (x_value_subspace) and do not roll the entire array. The indexing with x_value_subspace makes it even worse because it creates yet another array from the rolled one.
Line 93-94: Again, there’s an issue with the intermediate arrays. Since you declared the loop in line 76 as prange, I assume this is a quite hot loop. You really want to minimize memory allocation inside loops. I understand that preallocating and reusing memory is often less readable, but in this case, you can at least partially avoid it without sacrificing readability.
Line 130: Again, an intermediate array is created. Also, confirm that this is really what you want. Should this condition only be true if all values of X and Y are the same? If yes, loop over the arrays and break as soon as you find the first item that is different. This will save you a lot of work. But maybe positive and negative values are allowed to cancel each other out…
You use often arrays for indexing (or variables you got from an array like container[a[el]]). I am not sure if this still relevant but at least in earlier version of Numba using unsigned integers to index into arrays was significantly faster. This makes sense because if the index is unisgned, Numba does not have to check for “wraparound”. So you may want to make such arrars of dtype np.uint32 instead of np.int32.

As I mentioned at the beginning, I did not run your code. Maybe some of my points are not applicable because I missed some details, or the improvements are not worth implementing because the overall runtime of these code parts is negligible. Still, I hope you find something useful in these suggestions.

pehtran · October 13, 2023, 6:02am

Hello @sschaer! Thank you kindly for the prompt and extensive response, very good comments. Will test the optimizations/ideas and report here so others are aware of what had the biggest impact, really great forum I must say.

pehtran · November 10, 2023, 11:49am

Hi, a mini update, should anyone be interested in the results of implementations of the suggestions. As correctly pointed out, the np.roll op is substantially heavy. By rewriting that in pure numba and caching intermediary counts (we missed that last time around), we were able to speed up the algorithm by around 100%, really good advice @sschaer thanks!

Topic		Replies	Views
Tips for performance improvement of my code Support: How do I do ...?	5	567	January 5, 2023
Real world optimization using numba Community Support	0	846	March 5, 2021
Optimizing parallelized similarity function Support: How do I do ...?	8	706	June 1, 2023
Blurhash-numba project feedback Numba	4	364	October 3, 2020
How could i best optimize this with numba Support: How do I do ...?	9	518	May 4, 2021

Missing apparent optimizations

Related Topics