Tips or advice for a function which is slower in Numba

joueswant · September 16, 2022, 9:12am

Hi amazing team!
I am trying to overcome a bottleneck in a process by implementing Numba’s njit. There are two simple functions in this bottleneck, both ready to work with Numba. See the test with and with njit below:

import numpy as np
from numba import njit # nb.__version__ returns 0.55.1


def weib_2params_sampl(shape, scale, num_samples=1):
    arr=np.random.weibull(shape, size=num_samples)
    return scale*arr

def weib_2params(y, iters=1000, eps=1e-6):
    ## Checkers
    # Check for distribution ndim
    if len(y.shape) != 1:
        print("Weibull 2 params fit Error. Function requires 'y' to have ndim==1")
        return np.nan, np.nan
    # Check for nans ~ automatically erase them
    if np.isnan(y).any():
        y = np.ravel(y)[~np.isnan(y)]
    
    ## 2 params-Weibull fit k via MLE
    # Initialization
    ln_y = np.log(y)
    k = 1.
    k_t_1 = k

    for t in range(iters):
        y_k = y ** k
        y_k_ln_y = y_k * ln_y
        ff = np.sum(y_k_ln_y)
        fg = np.sum(y_k)
        f = ff / fg - np.mean(ln_y) - (1. / k)

        # Calculate second derivative d^2f/dk^2
        ff_prime = np.sum(y_k_ln_y * ln_y)
        fg_prime = ff
        f_prime = (ff_prime/fg - (ff/fg * fg_prime/fg)) + (1. / (k*k))

        # Newton-Raphson method k = k - f(k;x)/f'(k;x)
        k -= f/f_prime

        if np.isnan(f):
            print("Weibull 2 params fit warning. Fitting the 2 parameters Weibull is impossible. Returning NaN for shape and scale")
            return np.nan, np.nan
        if abs(k - k_t_1) < eps:
            break
        # Update the MLE
        k_t_1 = k

    lam = np.mean(y ** k) ** (1.0 / k)
    return k, lam
	
def test():
	arr = weib_2params_sampl(*np.random.uniform(.1,6, 2), num_samples=15000)
	weib_2params(arr)


######## Numba implementation

@jit(nopython=True)
def weib_2params_sampl_nb(shape, scale, num_samples=1):
    return scale * np.random.weibull(shape, size=num_samples)
	
@jit(nopython=True)
def weib_2params_nb(y, iters=1000, eps=1e-6):
    ## Checkers
    # Check for distribution ndim
    if len(y.shape) != 1:
        print("Weibull 2 params fit Error. Function requires 'y' to have ndim==1")
        return np.nan, np.nan
    # Check for nans ~ automatically erase them
    if np.isnan(y).any():
        y = np.ravel(y)[~np.isnan(y)]
    
    ## 2 params-Weibull fit k via MLE
    # Initialization
    ln_y = np.log(y)
    k = 1.
    k_t_1 = k

    for t in range(iters):
        y_k = y ** k
        y_k_ln_y = y_k * ln_y
        ff = np.sum(y_k_ln_y)
        fg = np.sum(y_k)
        f = ff / fg - np.mean(ln_y) - (1. / k)

        # Calculate second derivative d^2f/dk^2
        ff_prime = np.sum(y_k_ln_y * ln_y)
        fg_prime = ff
        f_prime = (ff_prime/fg - (ff/fg * fg_prime/fg)) + (1. / (k*k))

        # Newton-Raphson method k = k - f(k;x)/f'(k;x)
        k -= f/f_prime

        if np.isnan(f):
            print("Weibull 2 params fit warning. Fitting the 2 parameters Weibull is impossible. Returning NaN for shape and scale")
            return np.nan, np.nan
        if abs(k - k_t_1) < eps:
            break
        # Update the MLE
        k_t_1 = k

    lam = np.mean(y ** k) ** (1.0 / k)
    return k, lam

@jit(nopython=True)
def test_nb():
	arr = weib_2params_sampl_nb(*np.random.uniform(.1,6, 2), num_samples=15000)
	weib_2params_nb(arr)

The results from running test are (in notebook):

%timeit test
"21.3 ns ± 0.353 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)"
%timeit test_nb # with numba already compiled
"23.3 ns ± 0.491 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)"

Actually, both functions “weib_2params_sampl_nb” and “weib_2params_nb” are a bit slower and the Python/Numpy versions

Is there anything I can do to increase speed?

Thanks a lot guys!
Best!

luk-f-a · September 16, 2022, 10:42am

hi @joueswant , thanks for the clear post with full reproducer.
When benchmarking something that takes nanoseconds, your measurement is distorted by little overhead costs. For sure you care about more than nanoseconds, and I’m guessing your real use-case is larger in some dimension, but I cannot guess which one (larger num_samples? weib_2params_sampl_nb inside another loop?).
if you could make an example that takes 10-30 seconds, the next step would be to profile the pure-python version. I suspect a big part of the time will be taken by random.weibull which is very likely already implemented in C. I couldn’t 100% confirm that by looking at numpy’s code, but my gut feeling is that it is indeed in C. If that’s the case, ie most of the time is spent in weibull and that is already in C, then there is not much more that Numba can do.
If, however, a big part of the time is spent in weib_2params_nb then there are few things that could be done. For example, this part could be changed into an explicit loop. Depending on the size of y you could save some time in memory allocations, maybe even achieve SIMD parallelism (not easy though).

        y_k = y ** k
        y_k_ln_y = y_k * ln_y
        ff = np.sum(y_k_ln_y)
        fg = np.sum(y_k)
        f = ff / fg - np.mean(ln_y) - (1. / k)

        # Calculate second derivative d^2f/dk^2
        ff_prime = np.sum(y_k_ln_y * ln_y)
        fg_prime = ff
        f_prime = (ff_prime/fg - (ff/fg * fg_prime/fg)) + (1. / (k*k))

Hope this helps,
Luk

joueswant · September 16, 2022, 5:39pm

Hi,

Thanks a lot for your quick reply. You are true, the first part of the code np.random.weibull seems to performs slightly better with numpy than with numba. Check image below with drastically increased the num_samples

Regarding the second part of the code, forgive me for my ignorance, but what is an explicit loop?

Thanks a lot!!

luk-f-a · September 16, 2022, 8:55pm

no worries @joueswant, this is how it works:

y_k = y ** k is the vectorized way of writing this explicit loop:

y_k = np.zeros_like(y)
for i in range(len(y)):
    y_k[i] = y[i] ** k

it’s shorter and more convenient, but it is slightly inefficient, and sometimes very inefficient. You could try something like this:

    for t in range(iters):
        ff = 0
        fg = 0
        ff_prime = 0
        for i in range(len(y)):
            y_k = y[i] ** k 
            fg += y_k
            y_k_ln_y = y_k * ln_y[i]
            ff +=  y_k_ln_y
            ff_prime += y_k_ln_y * ln_y[i]
            
        f = ff / fg - np.mean(ln_y) - (1. / k)

        # Calculate second derivative d^2f/dk^2
        fg_prime = ff
        f_prime = (ff_prime/fg - (ff/fg * fg_prime/fg)) + (1. / (k*k))

        # Newton-Raphson method k = k - f(k;x)/f'(k;x)
        k -= f/f_prime

Please check my math, I might have made some mistakes in the substitutions, as I did them a bit quickly. I’m sure you get the idea: replace every calculation that operates on a vector (like sum, which is an implicit loop) with an operation inside an explicit loop (for i ...).
This could make a difference, depending on different factors. Sometimes you just have to try and see if it works.

Luk

joueswant · September 17, 2022, 10:08am

Thanks a lot Luk!
Double checked the math and tested, they are fine! Good job!
Numba is not accelerating the code that much with the explicit loop (after few benchmarks, the improvement is about 5% in terms of speed, in some cases is about 3% slower). I guess numpy is pretty well optimized in this particular case.

In any case, the std of the benchmark look favorable to Numba (higher stability i guess), so who knows, maybe I try to njit more code and then I will see the difference. Its a good start!

Thanks for your help and prompt answer!
Best!

luk-f-a · September 17, 2022, 10:14am

have you checked how much time it’s spent in weib_2params_sampl and how much in weib_2params? I am sure numba will accelerate weib_2params but if it’s only a small part of the total time, then it doesn’t make a large difference in the end.

joueswant · September 17, 2022, 10:26am

Yes, the overall improvement comes from weib_2params, as you commented, since weib_2params_sampl is just a nump.random.weibull which is already optimized, there is not that much improvement. Check the image:

I can share you the notebook if you are interested, yet I believe (with my little experience in Numba) we reached the max. speed we could

luk-f-a · September 18, 2022, 8:35pm

yes, based on those numbers it seems you reached the max speed single-threaded. There is always the option to use parallelization (via numba prange, dask or joblib)

Topic		Replies	Views
Major slow down when adding one more layer of function Numba	3	198	September 28, 2023
Help with Optimisation Numba	1	196	December 1, 2023
Comparison between Numba and Fortran code Numba	3	562	December 13, 2023
Why am i getting different performance speeds for the "same" decorator? Community Support	11	926	March 4, 2021
My numba code is slower than my original code Numba	1	74	February 11, 2025

Tips or advice for a function which is slower in Numba

Related topics