This is a similar topic to my previous post A huge performance penalty for this simple function, but with a different code example. The objective here is to develop a faster version of atan2()
at the cost of accuracy. The version shown here has the worst approximation error of about 4° and median error below 3°, which is acceptable for my use case. Here is a test code, which places atan2()
in a context it will be ultimately used with:
import numpy as np
import numba as nb
import timeit as ti
from math import pi, atan2, sqrt
@nb.njit(fastmath=True, locals=dict(w=nb.uint32))
def test1(im, grad, a, w):
for i in range(im.size-w-1):
dx = im[i+w+1]-im[i]
dy = im[i+w]-im[i+1]
grad[i] = sqrt(dx**2+dy**2)
a[i] = atan2(dy, dx)
w, h = 640, 480
im = np.random.rand(w*h).astype('f4')
grad = np.empty_like(im)
a = np.empty_like(im, dtype='f4')
fun = f'test1(im, grad, a, w)'
t = 1000 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=100))
print(f'{fun}: {np.amin(t):6.3f}ms {np.median(t):6.3f}ms {np.sum(grad)}')
test1()
executes in 8.37ms on my PC running Python 3.10.6 and numba 0.56.4 on Ubuntu 22.04.1. A C++ equivalent is about as fast, executing in 8.22ms when compiled with gcc 12.1.0. So far, so good.
When benchmarking with an atan2fast
substituting for atan2
:
c = pi/4
@nb.njit(fastmath=True, inline='always', locals=dict(x=nb.float32, y=nb.float32, c=nb.float32))
def atan2fast(y, x):
if y == 0 and x == 0:
return 0
if x >= 0:
angle = c*(1-(x-abs(y))/(x+abs(y)))
else:
angle = c*(3-(x+abs(y))/(abs(y)-x))
return -angle if y < 0 else angle
I’m getting a huge performance discrepancy:
- Python: 2.69ms
- C++: 0.338ms
Here is the C++ version:
inline constexpr float c = numbers::pi_v<float>/4;
static inline float atan2fast(float y, float x) {
float angle;
if (y == 0 && x == 0)
return 0;
if (x >= 0)
angle = c*(1-(x-abs(y))/(x+abs(y)));
else
angle = c*(3-(x+abs(y))/(abs(y)-x));
return (y < 0) ? -angle : angle;
}
I tried several simple code transformations of atan2fast
and performance gap didn’t budge.