Tips or advice for a function which is slower in Numba

Hi amazing team!
I am trying to overcome a bottleneck in a process by implementing Numba’s njit. There are two simple functions in this bottleneck, both ready to work with Numba. See the test with and with njit below:

import numpy as np
from numba import njit # nb.__version__ returns 0.55.1


def weib_2params_sampl(shape, scale, num_samples=1):
    arr=np.random.weibull(shape, size=num_samples)
    return scale*arr

def weib_2params(y, iters=1000, eps=1e-6):
    ## Checkers
    # Check for distribution ndim
    if len(y.shape) != 1:
        print("Weibull 2 params fit Error. Function requires 'y' to have ndim==1")
        return np.nan, np.nan
    # Check for nans ~ automatically erase them
    if np.isnan(y).any():
        y = np.ravel(y)[~np.isnan(y)]
    
    ## 2 params-Weibull fit k via MLE
    # Initialization
    ln_y = np.log(y)
    k = 1.
    k_t_1 = k

    for t in range(iters):
        y_k = y ** k
        y_k_ln_y = y_k * ln_y
        ff = np.sum(y_k_ln_y)
        fg = np.sum(y_k)
        f = ff / fg - np.mean(ln_y) - (1. / k)

        # Calculate second derivative d^2f/dk^2
        ff_prime = np.sum(y_k_ln_y * ln_y)
        fg_prime = ff
        f_prime = (ff_prime/fg - (ff/fg * fg_prime/fg)) + (1. / (k*k))

        # Newton-Raphson method k = k - f(k;x)/f'(k;x)
        k -= f/f_prime

        if np.isnan(f):
            print("Weibull 2 params fit warning. Fitting the 2 parameters Weibull is impossible. Returning NaN for shape and scale")
            return np.nan, np.nan
        if abs(k - k_t_1) < eps:
            break
        # Update the MLE
        k_t_1 = k

    lam = np.mean(y ** k) ** (1.0 / k)
    return k, lam
	
def test():
	arr = weib_2params_sampl(*np.random.uniform(.1,6, 2), num_samples=15000)
	weib_2params(arr)


######## Numba implementation

@jit(nopython=True)
def weib_2params_sampl_nb(shape, scale, num_samples=1):
    return scale * np.random.weibull(shape, size=num_samples)
	
@jit(nopython=True)
def weib_2params_nb(y, iters=1000, eps=1e-6):
    ## Checkers
    # Check for distribution ndim
    if len(y.shape) != 1:
        print("Weibull 2 params fit Error. Function requires 'y' to have ndim==1")
        return np.nan, np.nan
    # Check for nans ~ automatically erase them
    if np.isnan(y).any():
        y = np.ravel(y)[~np.isnan(y)]
    
    ## 2 params-Weibull fit k via MLE
    # Initialization
    ln_y = np.log(y)
    k = 1.
    k_t_1 = k

    for t in range(iters):
        y_k = y ** k
        y_k_ln_y = y_k * ln_y
        ff = np.sum(y_k_ln_y)
        fg = np.sum(y_k)
        f = ff / fg - np.mean(ln_y) - (1. / k)

        # Calculate second derivative d^2f/dk^2
        ff_prime = np.sum(y_k_ln_y * ln_y)
        fg_prime = ff
        f_prime = (ff_prime/fg - (ff/fg * fg_prime/fg)) + (1. / (k*k))

        # Newton-Raphson method k = k - f(k;x)/f'(k;x)
        k -= f/f_prime

        if np.isnan(f):
            print("Weibull 2 params fit warning. Fitting the 2 parameters Weibull is impossible. Returning NaN for shape and scale")
            return np.nan, np.nan
        if abs(k - k_t_1) < eps:
            break
        # Update the MLE
        k_t_1 = k

    lam = np.mean(y ** k) ** (1.0 / k)
    return k, lam

@jit(nopython=True)
def test_nb():
	arr = weib_2params_sampl_nb(*np.random.uniform(.1,6, 2), num_samples=15000)
	weib_2params_nb(arr)

The results from running test are (in notebook):

%timeit test
"21.3 ns ± 0.353 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)"
%timeit test_nb # with numba already compiled
"23.3 ns ± 0.491 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)"

Actually, both functions “weib_2params_sampl_nb” and “weib_2params_nb” are a bit slower and the Python/Numpy versions

Is there anything I can do to increase speed?

Thanks a lot guys!
Best!

hi @joueswant , thanks for the clear post with full reproducer.
When benchmarking something that takes nanoseconds, your measurement is distorted by little overhead costs. For sure you care about more than nanoseconds, and I’m guessing your real use-case is larger in some dimension, but I cannot guess which one (larger num_samples? weib_2params_sampl_nb inside another loop?).
if you could make an example that takes 10-30 seconds, the next step would be to profile the pure-python version. I suspect a big part of the time will be taken by random.weibull which is very likely already implemented in C. I couldn’t 100% confirm that by looking at numpy’s code, but my gut feeling is that it is indeed in C. If that’s the case, ie most of the time is spent in weibull and that is already in C, then there is not much more that Numba can do.
If, however, a big part of the time is spent in weib_2params_nb then there are few things that could be done. For example, this part could be changed into an explicit loop. Depending on the size of y you could save some time in memory allocations, maybe even achieve SIMD parallelism (not easy though).

        y_k = y ** k
        y_k_ln_y = y_k * ln_y
        ff = np.sum(y_k_ln_y)
        fg = np.sum(y_k)
        f = ff / fg - np.mean(ln_y) - (1. / k)

        # Calculate second derivative d^2f/dk^2
        ff_prime = np.sum(y_k_ln_y * ln_y)
        fg_prime = ff
        f_prime = (ff_prime/fg - (ff/fg * fg_prime/fg)) + (1. / (k*k))

Hope this helps,
Luk

1 Like

Hi,

Thanks a lot for your quick reply. You are true, the first part of the code np.random.weibull seems to performs slightly better with numpy than with numba. Check image below with drastically increased the num_samples

Regarding the second part of the code, forgive me for my ignorance, but what is an explicit loop?

Thanks a lot!!

no worries @joueswant, this is how it works:

y_k = y ** k is the vectorized way of writing this explicit loop:

y_k = np.zeros_like(y)
for i in range(len(y)):
    y_k[i] = y[i] ** k

it’s shorter and more convenient, but it is slightly inefficient, and sometimes very inefficient. You could try something like this:

    for t in range(iters):
        ff = 0
        fg = 0
        ff_prime = 0
        for i in range(len(y)):
            y_k = y[i] ** k 
            fg += y_k
            y_k_ln_y = y_k * ln_y[i]
            ff +=  y_k_ln_y
            ff_prime += y_k_ln_y * ln_y[i]
            
        f = ff / fg - np.mean(ln_y) - (1. / k)

        # Calculate second derivative d^2f/dk^2
        fg_prime = ff
        f_prime = (ff_prime/fg - (ff/fg * fg_prime/fg)) + (1. / (k*k))

        # Newton-Raphson method k = k - f(k;x)/f'(k;x)
        k -= f/f_prime

Please check my math, I might have made some mistakes in the substitutions, as I did them a bit quickly. I’m sure you get the idea: replace every calculation that operates on a vector (like sum, which is an implicit loop) with an operation inside an explicit loop (for i ...).
This could make a difference, depending on different factors. Sometimes you just have to try and see if it works.

Luk

1 Like

Thanks a lot Luk!
Double checked the math and tested, they are fine! Good job!
Numba is not accelerating the code that much with the explicit loop (after few benchmarks, the improvement is about 5% in terms of speed, in some cases is about 3% slower). I guess numpy is pretty well optimized in this particular case.

In any case, the std of the benchmark look favorable to Numba (higher stability i guess), so who knows, maybe I try to njit more code and then I will see the difference. Its a good start!

Thanks for your help and prompt answer!
Best!

have you checked how much time it’s spent in weib_2params_sampl and how much in weib_2params? I am sure numba will accelerate weib_2params but if it’s only a small part of the total time, then it doesn’t make a large difference in the end.

Yes, the overall improvement comes from weib_2params, as you commented, since weib_2params_sampl is just a nump.random.weibull which is already optimized, there is not that much improvement. Check the image:


I can share you the notebook if you are interested, yet I believe (with my little experience in Numba) we reached the max. speed we could

yes, based on those numbers it seems you reached the max speed single-threaded. There is always the option to use parallelization (via numba prange, dask or joblib)