Why am i getting different performance speeds for the "same" decorator?


I am using just the @jit decorator and getting the warning “Compilation is falling back to object mode WITH looplifting enabled etc”

I run again and get a good run time 0.037s.

I like the object mode with loop lifting enabled because my function has some python types that I need but also a loop that can be optimized.

So to get rid of the warning I decorated my function with @jit(forceobj=True, looplift=True), I was expecting the same time but I get 0.167s. Why is the speed reduced is Numba not doing the same thing?

Any help would be greatly appreciated.



When profiling jit decorated functions the first run is always significantly slower than subsequent runs. This is because Numba will compile the function on the first run and this can take some time. Only on the second call, does the compiled function execute:

In [1]: import numba as nb

In [2]: @nb.njit
   ...: def foo():
   ...:     acc = 0.0
   ...:     for i in range(1000000):
   ...:         acc += i
   ...:     return i

In [3]: %time foo()
CPU times: user 135 ms, sys: 32.5 ms, total: 168 ms
Wall time: 211 ms
Out[3]: 999999

In [4]: %time foo()
CPU times: user 91 µs, sys: 1 µs, total: 92 µs
Wall time: 92.3 µs
Out[4]: 999999
1 Like


Thank you for replying to my query.
Yes I was aware of the compilation time so I know it is not that.

def kernel(zr, zi, cr, ci, lim, cutoff):
    count = 0
    while ((zr*zr + zi*zi) < (lim*lim)) and count < cutoff:
        zr, zi = zr * zr - zi * zi + cr, 2 * zr * zi + ci
        count += 1
    return count
kernel_njit = njit()(kernel)
def plot_mandel(mandel):
def compute_mandel_py(cr, ci, N, bound=1.0, lim=1000.0, cutoff=1e6):
    mandel = np.empty((N, N), dtype=int)
    grid_x = np.linspace(-bound, bound, N)
    t0 = time.time()
    for i, x in enumerate(grid_x):
        for j, y in enumerate(grid_x):
            mandel[i,j] = kernel(x, y, cr, ci, lim, cutoff)
    return mandel, time.time() - t0
def compute_mandel_njit(cr, ci, N, bound=1.0, lim=1000.0, cutoff=1e6):
    mandel = np.empty((N, N))
    grid_x = np.linspace(-bound, bound, N)
    t0 = time.time()
    for i, x in enumerate(grid_x):
        for j, y in enumerate(grid_x):
            mandel[i,j] = kernel_njit(x, y, cr, ci, lim, cutoff)
    return mandel, time.time() - t0
compute_mandel_njit_jit1 = jit()(compute_mandel_njit)
compute_mandel_njit_jit2 = jit(forceobj=True, looplift=True)(compute_mandel_njit)
def python_run():
    kwargs = dict(cr=0.285, ci=0.01,
    print("Using pure Python")
    mandel_func = compute_mandel_py       
    mandel_set, runtime = mandel_func(**kwargs)
    print("Mandelbrot set generated in {} seconds".format(runtime))
def njit_run():
    kwargs = dict(cr=0.285, ci=0.01,
    print("Using njitted kernel")
    mandel_func = compute_mandel_njit       
    mandel_set, runtime = mandel_func(**kwargs)
    print("Mandelbrot set generated in {} seconds".format(runtime))
def njit_jit_run1():
    kwargs = dict(cr=0.285, ci=0.01,
    print("Using njitted kernel and jitted compute function")
    mandel_func = compute_mandel_njit_jit1       
    mandel_set, runtime = mandel_func(**kwargs)
    print("Mandelbrot set generated in {} seconds".format(runtime))
def njit_jit_run2():
    kwargs = dict(cr=0.285, ci=0.01,
    print("Using njitted kernel and jitted compute function in object mode & looplift")
    mandel_func = compute_mandel_njit_jit2       
    mandel_set, runtime = mandel_func(**kwargs)
    print("Mandelbrot set generated in {} seconds".format(runtime))

And then running


At least twice (accounting for compilation) I get these times;

Using njitted kernel
Mandelbrot set generated in 0.15392279624938965 seconds
Using njitted kernel and jitted compute function
Mandelbrot set generated in 0.028262853622436523 seconds
Using njitted kernel and jitted compute function in object mode & looplift
Mandelbrot set generated in 0.15626192092895508 seconds

What I don’t understand is why
compute_mandel_njit_jit1 = jit()(compute_mandel_njit)
compute_mandel_njit_jit2 = jit(forceobj=True, looplift=True)(compute_mandel_njit)

These is why the first (only jit no options set) is much faster when in the warning it says it is using object mode with loop lifting enabled. If this was true both functions should give similar performance.

Is my question more clear now?

Thanks again for replying.


Hi @fionnualasolomon

I agree with your point that those timing differences seem a bit odd if one expects the functions to work the same - and interestingly enough I cannot reproduce this behaviour on my own system. For me all implementations run in about 150 ms.

I wonder if this has something to do with certain library versions or hardware - would you mind sharing your output of numba -s?

Here is mine for comparison

No errors reported.

Warning log
Warning (cuda): CUDA driver library cannot be found or no CUDA enabled devices are present.
Exception class: <class ‘numba.cuda.cudadrv.error.CudaSupportError’>
Warning (roc): Error initialising ROC: No ROC toolchains found.
Warning (roc): No HSA Agents found, encountered exception when searching: Error at driver init:
NUMBA_HSA_DRIVER /opt/rocm/lib/libhsa-runtime64.so is not a valid file path. Note it must be a filepath of the .so/.dll/.dylib or the driver:

If requested, please copy and paste the information between
the dashed (----) lines, or from a given specific section as

IMPORTANT: Please ensure that you are happy with sharing the
contents of the information present, any information that you
wish to keep private you should remove before sharing.

Hi @Hannes,

Thank you for responding. I was using jupyter lab and then thought it might have been contributing but even run as one script from my terminal I was getting different timings.

No errors reported.

__Warning log__
Warning (cuda): CUDA driver library cannot be found or no CUDA enabled devices are present.
Exception class: <class 'numba.cuda.cudadrv.error.CudaSupportError'>
Warning (roc): Error initialising ROC: No ROC toolchains found.
Warning (roc): No HSA Agents found, encountered exception when searching: Error at driver init: 

HSA is not currently supported on this platform (darwin).
Warning (psutil): psutil cannot be imported. For more accuracy, consider installing it.
If requested, please copy and paste the information between
the dashed (----) lines, or from a given specific section as

IMPORTANT: Please ensure that you are happy with sharing the
contents of the information present, any information that you
wish to keep private you should remove before sharing.

That is odd you get all the same timings - that is what I would expect. :woman_shrugging:


The most obvious difference I see is that you are using and older numba / llvmlite version.

I almost find your timing for the explicit object mode / looplift run suspiciously fast.
I tried on my own machine to remove the timer from the jitted function and compile it in nopython mode and then measure the time outside the nopython compiled function, which should be very close to optimal since object mode should slow things down a little.
Even with that setup I cannot optimise the run time much beyond that of 150ms. I don’t know why the explicit looplifting version would be almost 10 times faster on your machine.

Have you checked of that version actually returns the correct result? Maybe something is buggy and the loop is cut short or similar.

1 Like

I’m not sure that calling time.time() inside an njitted function can be relied on - optimizations might move code around inside the function so that the code being timed may not correlate with the same code from within the source.

If you move your timing outside the jitted function, do you get more consistent results? (Also, object mode + loop lifting might not be needed if the timing is moved outside the jitted function)


We have the same LLVM version no? 10.0.1? and I have numba 0.51.2 which I had thought was the most up to date?

I also now removed the time.time from within the function and get these times.

Using njitted kernel and jitted compute function
CPU times: user 34.3 ms, sys: 694 µs, total: 35 ms
Wall time: 35.1 ms
Using njitted kernel and njitted compute function
CPU times: user 29 ms, sys: 283 µs, total: 29.3 ms
Wall time: 29.4 ms
Using njitted kernel and jitted compute function in object mode & looplift
CPU times: user 174 ms, sys: 1.33 ms, total: 175 ms
Wall time: 175 ms

Which are consistent with what I expect; first one the jit is choosing nopython mode, second one is forced to use nopython mode, both have similar times and the third is being forced to use object mode and so is slower.

So I think, despite the warning saying it was running in object mode with looplifting, it was somehow running in nopython mode or at least getting the speed of nopython mode when the time.time was included in the function.

The mandel set image they all generated were correct too.

Hi @gmarkall,

Yes I did get much more consistent results removing the time.time (see my other reply to @Hannes) .

The time.time in the function was definitely confusing things :sweat_smile:


Glad things seem more consistent now :slight_smile:

The latest version at the moment is 0.52.0 - though, 0.53.0 should be out in a few days.


seems like Graham’s guess hit dead center - now everything looks right I’d say :slight_smile:

Just FYI: numba is currently at version 0.52 with version 0.53RC2 recently released. And I was referring to the version of llvmlite - a small Python wrapper used by numba to interact with LLVM afaik.

EDIT: Graham, you beat me by like 2 minutes :stuck_out_tongue:

oh ok thanks for letting me know (@hannes too). I’m all up to date now and will watch for the new updates. Cheers for the help :smiley: