How to turn off wraparound?

Warparound checking is quite costly and often not necessary, if it can be guaranteed by the programmer that negative indices are not occurring.
In principal this could be detected in simple cases, but the main question is how to turn it off completely?

Example

import numpy as np
import numba as nb

A=np.random.rand(1_000_000)
res=np.zeros(1_000_000)

@nb.njit()
def func_1(A,res):
    for i in range(1,A.shape[0]-1):
        res[i]=A[i-1]+A[i+1]
    return res

@nb.njit()
def func_2(A,res):
    for i in range(A.shape[0]-2):
        res[i+1]=A[i]+A[i+2]
    return res

#Uneccesary wraparound enabled
%timeit func_1(A,res)
#1.36 ms ± 7.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#No wraparound
%timeit func_2(A,res)
#637 µs ± 6.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

hi @max9111 I think it depends what you mean by “completely”.

You can “disable” it for any given loop by casting the loop variable to an unsigned integer, as described in this discussion: Uint64 vs int64 indexing performance difference

I’m not aware of a way to turn it off globally.

Luk

Thanks @luk-f-a , but I tried a bit further. It looks like the main performance impact comes from a loop, which doesn’t start with zero.

Example

import numpy as np
import numba as nb

A=np.random.rand(1_000_000)
res=np.zeros(1_000_000)

@nb.njit()
def func_1(A,res):
    for i in range(1,A.shape[0]-1):
        res[i]=A[i]+A[i+1]
    return res

@nb.njit()
def func_2(A,res):
    for i in range(A.shape[0]-2):
        res[i+1]=A[i+1]+A[i+2]
    return res

#Loop starts with one (slow)
%timeit func_1(A,res)
#1.36 ms ± 7.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#Expected performacne
%timeit func_2(A,res)
#637 µs ± 6.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Another Example: Workaround 1&2 works as expected, while the code in the question is about a factor of 10 slower. Maybe I should open an issue, or is this behavior somehow expected?

Hi @max9111

Unfortunately, I cannot tell you what happened in your case. However, it might be a useful hint that I could not reproduce your results (Numba 0.56.0 and Python 3.9.2). I get exactly the same timings for both functions. In any case, make sure that both functions have already been called before you time them. You may know that the functions are compiled on the first call. But there is also a relatively high (some 100 ms) overhead for the very first call of a jitted function.

I also cannot reproduce on RHEL7/3.7.9/numba 0.50.1

It was a version issue

The simple example above was on version 0.53, which was used because of this issues.
https://github.com/numba/numba/issues/8172
https://github.com/numba/numba/issues/8398

On newer versions (0.56) both simple examples are working.

Actual code in version 0.53

import numba as nb
import numpy as np

float_type = np.float32
#float_type = np.float64

itot = 384;
jtot = 384;
ktot = 384;
ncells = itot*jtot*ktot;

at = np.zeros((ktot, jtot, itot), dtype=float_type)

a = np.random.rand(ktot, jtot, itot)
a = a.astype(float_type)

@nb.njit(["(float32[:,:,::1])(float32[:,:,::1], float32[:,:,::1], float32, float32, float32, float32)",
          "(float64[:,:,::1])(float64[:,:,::1], float64[:,:,::1], float64, float64, float64, float64)"],)
def diff_1(at, a, visc, dxidxi, dyidyi, dzidzi):
    ktot, jtot, itot=at.shape
    for k in range(1, ktot-1):
        for j in range(1, jtot-1):
            for i in range(1, itot-1):
                at[k, j, i] = visc * ( 
                        + ( (a[k+1, j  , i  ] - a[k  , j  , i  ])  
                          - (a[k  , j  , i  ] - a[k-1, j  , i  ]) ) * dxidxi
                        + ( (a[k  , j+1, i  ] - a[k  , j  , i  ])  
                          - (a[k  , j  , i  ] - a[k  , j-1, i  ]) ) * dyidyi
                        + ( (a[k  , j  , i+1] - a[k  , j  , i  ])  
                          - (a[k  , j  , i  ] - a[k  , j  , i-1]) ) * dzidzi )
    return at

@nb.njit(["(float32[:,:,::1])(float32[:,:,::1], float32[:,:,::1], float32, float32, float32, float32)",
          "(float64[:,:,::1])(float64[:,:,::1], float64[:,:,::1], float64, float64, float64, float64)"])
def diff_2(at, a, visc, dxidxi, dyidyi, dzidzi):
    ktot, jtot, itot=at.shape
    for k in range(ktot-2):
        for j in range(jtot-2):
            for i in range(itot-2):
                at[k+1, j+1, i+1] += visc * ( 
                        + ( (a[k+2, j+1, i+1] - a[k+1, j+1 , i+1])  
                          - (a[k+1, j+1, i+1] - a[k  , j+1 , i+1]) ) * dxidxi
                        + ( (a[k+1, j+2, i+1] - a[k+1, j+1 , i+1])  
                          - (a[k+1, j+1, i+1] - a[k+1, j   , i+1]) ) * dyidyi
                        + ( (a[k+1, j+1, i+2] - a[k+1, j+1 , i+1])  
                          - (a[k+1, j+1, i+1] - a[k+1, j+1 , i  ]) ) * dzidzi )
    return at

There only diff_2 was working properly (~30ms), diff_1 was much slower at ~200ms. Therefore I expected a wraparound problem.

Version 0.56

Both implementations are quite slow, which wasn’t unexpected because of the issues above.
With this fix https://github.com/numba/numba/issues/8172#issuecomment-1160474583, both implementations are showing the expected performance of ~30ms.

Dear @max9111

Glad you found the issue and thanks for the clarification.

There is also another way to fix this problem in Numba 0.56 that should not increase compile time:

@nb.njit(..., locals={"k": nb.uint32, "j": nb.uint32, "i": nb.uint32})
def diff_2(at, a, visc, dxidxi, dyidyi, dzidzi):
    ...

Interesting, but this is only working for diff_2 and not fully for diff_1. I am also wondering why locals={"k": nb.uint64, "j": nb.uint64, "i": nb.uint64} isn’t working in both cases.

I guess turning the fist O3 optimization pass on is more predictable. The compile times are not differing by much in this example and cache=True is always an option to avoid to long compile times after restarting the interpreter, especially with given signatures.

%timeit diff_1.recompile()
#opt=0 457 ms ± 4.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#opt=2 533 ms ± 6.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#opt=3 545 ms ± 4.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I suspect the reason it only works for diff_2 is that some indices in diff_1 are potentially negative (e.g. k-1). I don’t know why nb.uint64 doesn’t work. A closer look at the disassembly may bring clarity.

It’s a little bit surprising (to me, at least) that O3 isn’t on by default. Speed is the main reason people use numba.

I’d think a big performance regression would be classified as a bug to fix, not a feature request.

Unless I’m missing something, which wouldn’t be unusual :slight_smile:

@nelson2005 rest assured that -O3 is on by default, it’s just that it used to effectively get run twice! See Numba issue #8430 for an explanation and the history of the optimisation sequence.

Thanks @stuartarchibald for that pointer. I had read the underlying issues but hadn’t noticed #8430 that pulled it all together.

I’m firmly in the HPC use case- compilation takes over an hour for me already. A little more or less doesn’t make much difference since the program is run many times with the jit cache. That naturally puts me in the ‘program should run as fast as possible’ camp :slight_smile: