Mistaken poor optimization of atan2

pauljurczak · February 9, 2023, 2:17am

This is a similar topic to my previous post A huge performance penalty for this simple function, but with a different code example. The objective here is to develop a faster version of atan2() at the cost of accuracy. The version shown here has the worst approximation error of about 4° and median error below 3°, which is acceptable for my use case. Here is a test code, which places atan2() in a context it will be ultimately used with:

import numpy as np
import numba as nb
import timeit as ti
from math import pi, atan2, sqrt


@nb.njit(fastmath=True, locals=dict(w=nb.uint32))
def test1(im, grad, a, w):
  for i in range(im.size-w-1):
    dx = im[i+w+1]-im[i]
    dy = im[i+w]-im[i+1]
    grad[i] = sqrt(dx**2+dy**2)
    a[i] = atan2(dy, dx)


w, h = 640, 480
im = np.random.rand(w*h).astype('f4')
grad = np.empty_like(im)
a = np.empty_like(im, dtype='f4')

fun = f'test1(im, grad, a, w)'
t = 1000 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=100))
print(f'{fun}:  {np.amin(t):6.3f}ms  {np.median(t):6.3f}ms  {np.sum(grad)}')

test1() executes in 8.37ms on my PC running Python 3.10.6 and numba 0.56.4 on Ubuntu 22.04.1. A C++ equivalent is about as fast, executing in 8.22ms when compiled with gcc 12.1.0. So far, so good.

When benchmarking with an atan2fast substituting for atan2:

c = pi/4

@nb.njit(fastmath=True, inline='always', locals=dict(x=nb.float32, y=nb.float32, c=nb.float32))
def atan2fast(y, x): 
  if y == 0 and x == 0:
    return 0

  if x >= 0:
    angle = c*(1-(x-abs(y))/(x+abs(y)))
  else:
    angle = c*(3-(x+abs(y))/(abs(y)-x))

  return -angle if y < 0 else angle

I’m getting a huge performance discrepancy:

Python: 2.69ms
C++: 0.338ms

Here is the C++ version:

inline constexpr float c = numbers::pi_v<float>/4;

static inline float atan2fast(float y, float x) {
	float angle;

	if (y == 0 && x == 0)
		return 0;

	if (x >= 0)
        angle = c*(1-(x-abs(y))/(x+abs(y)));
	else
		angle = c*(3-(x+abs(y))/(abs(y)-x));

	return (y < 0) ? -angle : angle;
}

I tried several simple code transformations of atan2fast and performance gap didn’t budge.

pauljurczak · February 9, 2023, 2:47am

Well, this was a false alarm. C++ optimizer was eliding most of the computation. After some changes to prevent “cheating”, C++ version clocks at 2.60ms, which is almost the same as Numba. I was thinking about deleting this post, but decided to leave it. Maybe someone will post a faster version of atan2fast.

pauljurczak · February 9, 2023, 6:01am

Slightly slower, but more accurate version with maximum error about 0.284°: atan2fast1 takes 3.05ms vs. 2.69ms for atan2fast. Performance is close to C++ version (2.91ms with gcc 11.1.0 and 2.49ms with clang 15.0.7).

@nb.njit(fastmath=True, inline='always', locals=dict(z=nb.float32))
def atanFast(z):
  return (0.97239411-0.19194795*z*z)*z


@nb.njit(fastmath=True, inline='always', locals=dict(x=nb.float32, y=nb.float32))
def atan2fast1(y, x): 
  if x != 0:
    if abs(x) > abs(y):
      if x > 0:
        return atanFast(y/x)
      else: 
        return atanFast(y/x) + (pi if y >= 0 else -pi)
    else: 
      return -atanFast(x/y) + (pi/2 if y > 0 else -pi/2)
  else:
    return pi/2 if y > 0 else (-pi/2 if y < 0 else 0)

Topic		Replies	Views
A huge performance penalty for this simple function Numba	22	1891	January 28, 2023
Curious performance loss of NumPy function with Numba Numba	11	392	December 26, 2023
Why this function call is faster than inlined version? Numba	3	323	May 15, 2023
Try to understand why numba is 4x faster than c++ code in the same task Community Support	4	1519	November 9, 2022
Why this assignment of subarray is so slow? Numba	3	271	July 25, 2022

Mistaken poor optimization of atan2

Related Topics