How to turn off wraparound?

max9111 · October 19, 2022, 9:31am

Warparound checking is quite costly and often not necessary, if it can be guaranteed by the programmer that negative indices are not occurring.
In principal this could be detected in simple cases, but the main question is how to turn it off completely?

Example

import numpy as np
import numba as nb

A=np.random.rand(1_000_000)
res=np.zeros(1_000_000)

@nb.njit()
def func_1(A,res):
    for i in range(1,A.shape[0]-1):
        res[i]=A[i-1]+A[i+1]
    return res

@nb.njit()
def func_2(A,res):
    for i in range(A.shape[0]-2):
        res[i+1]=A[i]+A[i+2]
    return res

#Uneccesary wraparound enabled
%timeit func_1(A,res)
#1.36 ms ± 7.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#No wraparound
%timeit func_2(A,res)
#637 µs ± 6.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

luk-f-a · October 20, 2022, 4:00pm

hi @max9111 I think it depends what you mean by “completely”.

You can “disable” it for any given loop by casting the loop variable to an unsigned integer, as described in this discussion: Uint64 vs int64 indexing performance difference

I’m not aware of a way to turn it off globally.

Luk

max9111 · October 21, 2022, 8:57am

Thanks @luk-f-a , but I tried a bit further. It looks like the main performance impact comes from a loop, which doesn’t start with zero.

Example

import numpy as np
import numba as nb

A=np.random.rand(1_000_000)
res=np.zeros(1_000_000)

@nb.njit()
def func_1(A,res):
    for i in range(1,A.shape[0]-1):
        res[i]=A[i]+A[i+1]
    return res

@nb.njit()
def func_2(A,res):
    for i in range(A.shape[0]-2):
        res[i+1]=A[i+1]+A[i+2]
    return res

#Loop starts with one (slow)
%timeit func_1(A,res)
#1.36 ms ± 7.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#Expected performacne
%timeit func_2(A,res)
#637 µs ± 6.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Another Example: Workaround 1&2 works as expected, while the code in the question is about a factor of 10 slower. Maybe I should open an issue, or is this behavior somehow expected?

sschaer · October 21, 2022, 1:35pm

Hi @max9111

Unfortunately, I cannot tell you what happened in your case. However, it might be a useful hint that I could not reproduce your results (Numba 0.56.0 and Python 3.9.2). I get exactly the same timings for both functions. In any case, make sure that both functions have already been called before you time them. You may know that the functions are compiled on the first call. But there is also a relatively high (some 100 ms) overhead for the very first call of a jitted function.

nelson2005 · October 21, 2022, 2:36pm

I also cannot reproduce on RHEL7/3.7.9/numba 0.50.1

max9111 · October 21, 2022, 3:20pm

It was a version issue

The simple example above was on version 0.53, which was used because of this issues.
https://github.com/numba/numba/issues/8172
https://github.com/numba/numba/issues/8398

On newer versions (0.56) both simple examples are working.

Actual code in version 0.53

import numba as nb
import numpy as np

float_type = np.float32
#float_type = np.float64

itot = 384;
jtot = 384;
ktot = 384;
ncells = itot*jtot*ktot;

at = np.zeros((ktot, jtot, itot), dtype=float_type)

a = np.random.rand(ktot, jtot, itot)
a = a.astype(float_type)

@nb.njit(["(float32[:,:,::1])(float32[:,:,::1], float32[:,:,::1], float32, float32, float32, float32)",
          "(float64[:,:,::1])(float64[:,:,::1], float64[:,:,::1], float64, float64, float64, float64)"],)
def diff_1(at, a, visc, dxidxi, dyidyi, dzidzi):
    ktot, jtot, itot=at.shape
    for k in range(1, ktot-1):
        for j in range(1, jtot-1):
            for i in range(1, itot-1):
                at[k, j, i] = visc * ( 
                        + ( (a[k+1, j  , i  ] - a[k  , j  , i  ])  
                          - (a[k  , j  , i  ] - a[k-1, j  , i  ]) ) * dxidxi
                        + ( (a[k  , j+1, i  ] - a[k  , j  , i  ])  
                          - (a[k  , j  , i  ] - a[k  , j-1, i  ]) ) * dyidyi
                        + ( (a[k  , j  , i+1] - a[k  , j  , i  ])  
                          - (a[k  , j  , i  ] - a[k  , j  , i-1]) ) * dzidzi )
    return at

@nb.njit(["(float32[:,:,::1])(float32[:,:,::1], float32[:,:,::1], float32, float32, float32, float32)",
          "(float64[:,:,::1])(float64[:,:,::1], float64[:,:,::1], float64, float64, float64, float64)"])
def diff_2(at, a, visc, dxidxi, dyidyi, dzidzi):
    ktot, jtot, itot=at.shape
    for k in range(ktot-2):
        for j in range(jtot-2):
            for i in range(itot-2):
                at[k+1, j+1, i+1] += visc * ( 
                        + ( (a[k+2, j+1, i+1] - a[k+1, j+1 , i+1])  
                          - (a[k+1, j+1, i+1] - a[k  , j+1 , i+1]) ) * dxidxi
                        + ( (a[k+1, j+2, i+1] - a[k+1, j+1 , i+1])  
                          - (a[k+1, j+1, i+1] - a[k+1, j   , i+1]) ) * dyidyi
                        + ( (a[k+1, j+1, i+2] - a[k+1, j+1 , i+1])  
                          - (a[k+1, j+1, i+1] - a[k+1, j+1 , i  ]) ) * dzidzi )
    return at

There only diff_2 was working properly (~30ms), diff_1 was much slower at ~200ms. Therefore I expected a wraparound problem.

Version 0.56

Both implementations are quite slow, which wasn’t unexpected because of the issues above.
With this fix https://github.com/numba/numba/issues/8172#issuecomment-1160474583, both implementations are showing the expected performance of ~30ms.

sschaer · October 21, 2022, 3:53pm

Dear @max9111

Glad you found the issue and thanks for the clarification.

There is also another way to fix this problem in Numba 0.56 that should not increase compile time:

@nb.njit(..., locals={"k": nb.uint32, "j": nb.uint32, "i": nb.uint32})
def diff_2(at, a, visc, dxidxi, dyidyi, dzidzi):
    ...

max9111 · October 21, 2022, 4:18pm

Interesting, but this is only working for diff_2 and not fully for diff_1. I am also wondering why locals={"k": nb.uint64, "j": nb.uint64, "i": nb.uint64} isn’t working in both cases.

I guess turning the fist O3 optimization pass on is more predictable. The compile times are not differing by much in this example and cache=True is always an option to avoid to long compile times after restarting the interpreter, especially with given signatures.

%timeit diff_1.recompile()
#opt=0 457 ms ± 4.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#opt=2 533 ms ± 6.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#opt=3 545 ms ± 4.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

sschaer · October 21, 2022, 4:41pm

I suspect the reason it only works for diff_2 is that some indices in diff_1 are potentially negative (e.g. k-1). I don’t know why nb.uint64 doesn’t work. A closer look at the disassembly may bring clarity.

nelson2005 · October 23, 2022, 1:46pm

It’s a little bit surprising (to me, at least) that O3 isn’t on by default. Speed is the main reason people use numba.

I’d think a big performance regression would be classified as a bug to fix, not a feature request.

Unless I’m missing something, which wouldn’t be unusual

stuartarchibald · November 1, 2022, 1:08pm

@nelson2005 rest assured that -O3 is on by default, it’s just that it used to effectively get run twice! See Numba issue #8430 for an explanation and the history of the optimisation sequence.

nelson2005 · November 1, 2022, 2:26pm

Thanks @stuartarchibald for that pointer. I had read the underlying issues but hadn’t noticed #8430 that pulled it all together.

I’m firmly in the HPC use case- compilation takes over an hour for me already. A little more or less doesn’t make much difference since the program is run many times with the jit cache. That naturally puts me in the ‘program should run as fast as possible’ camp

Topic		Replies	Views
Undefined behavior and overflow Community Support	0	196	July 31, 2023
Why am i getting different performance speeds for the "same" decorator? Community Support	11	930	March 4, 2021
Fesetround in njit function Community Support	1	242	July 22, 2023
Comparison between Numba and Fortran code Numba	3	568	December 13, 2023
Overflow Error! Community Support	5	2676	March 25, 2021

How to turn off wraparound?

Related topics