Mistaken poor optimization of atan2

This is a similar topic to my previous post A huge performance penalty for this simple function, but with a different code example. The objective here is to develop a faster version of atan2() at the cost of accuracy. The version shown here has the worst approximation error of about 4° and median error below 3°, which is acceptable for my use case. Here is a test code, which places atan2() in a context it will be ultimately used with:

import numpy as np
import numba as nb
import timeit as ti
from math import pi, atan2, sqrt


@nb.njit(fastmath=True, locals=dict(w=nb.uint32))
def test1(im, grad, a, w):
  for i in range(im.size-w-1):
    dx = im[i+w+1]-im[i]
    dy = im[i+w]-im[i+1]
    grad[i] = sqrt(dx**2+dy**2)
    a[i] = atan2(dy, dx)


w, h = 640, 480
im = np.random.rand(w*h).astype('f4')
grad = np.empty_like(im)
a = np.empty_like(im, dtype='f4')

fun = f'test1(im, grad, a, w)'
t = 1000 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=100))
print(f'{fun}:  {np.amin(t):6.3f}ms  {np.median(t):6.3f}ms  {np.sum(grad)}')

test1() executes in 8.37ms on my PC running Python 3.10.6 and numba 0.56.4 on Ubuntu 22.04.1. A C++ equivalent is about as fast, executing in 8.22ms when compiled with gcc 12.1.0. So far, so good.

When benchmarking with an atan2fast substituting for atan2:

c = pi/4

@nb.njit(fastmath=True, inline='always', locals=dict(x=nb.float32, y=nb.float32, c=nb.float32))
def atan2fast(y, x): 
  if y == 0 and x == 0:
    return 0

  if x >= 0:
    angle = c*(1-(x-abs(y))/(x+abs(y)))
  else:
    angle = c*(3-(x+abs(y))/(abs(y)-x))

  return -angle if y < 0 else angle

I’m getting a huge performance discrepancy:

  • Python: 2.69ms
  • C++: 0.338ms

Here is the C++ version:

inline constexpr float c = numbers::pi_v<float>/4;

static inline float atan2fast(float y, float x) {
	float angle;

	if (y == 0 && x == 0)
		return 0;

	if (x >= 0)
        angle = c*(1-(x-abs(y))/(x+abs(y)));
	else
		angle = c*(3-(x+abs(y))/(abs(y)-x));

	return (y < 0) ? -angle : angle;
}

I tried several simple code transformations of atan2fast and performance gap didn’t budge.

Well, this was a false alarm. C++ optimizer was eliding most of the computation. After some changes to prevent “cheating”, C++ version clocks at 2.60ms, which is almost the same as Numba. I was thinking about deleting this post, but decided to leave it. Maybe someone will post a faster version of atan2fast.

1 Like

Slightly slower, but more accurate version with maximum error about 0.284°: atan2fast1 takes 3.05ms vs. 2.69ms for atan2fast. Performance is close to C++ version (2.91ms with gcc 11.1.0 and 2.49ms with clang 15.0.7).

@nb.njit(fastmath=True, inline='always', locals=dict(z=nb.float32))
def atanFast(z):
  return (0.97239411-0.19194795*z*z)*z


@nb.njit(fastmath=True, inline='always', locals=dict(x=nb.float32, y=nb.float32))
def atan2fast1(y, x): 
  if x != 0:
    if abs(x) > abs(y):
      if x > 0:
        return atanFast(y/x)
      else: 
        return atanFast(y/x) + (pi if y >= 0 else -pi)
    else: 
      return -atanFast(x/y) + (pi/2 if y > 0 else -pi/2)
  else:
    return pi/2 if y > 0 else (-pi/2 if y < 0 else 0)