Return type "types.UniTuple(float64, 3)" vs float64[:]

I am in the process of optimizing a small function that is called millions of times during execution of my code and found that returning array elements as a tuple is 3-4 times faster than returning the actual array itself. My question therefore … is this real and if so why is this the case? I found this by chance, so is there perhaps even a faster way of returning (small) array results?

Toy example:

import numpy as np
from numba import njit, types, float64

@njit("types.UniTuple(float64, 3)(float64)", cache=True)
def foo_a(s): 
    r = np.zeros(3, dtype = float64)
    return r[0], r[1], r[2]

@njit("float64[:](float64)", cache=True)
def foo_b(s):
    r = np.zeros(3, dtype = float64)
    return r
%timeit foo_a(0.0)

123 ns ± 0.662 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

%timeit foo_b(0.0)

423 ns ± 4.24 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

I would try and make a benchmark that better captures your use case. Without inspecting the actual code that this results in, my guess would be that the compiler can optimize this unrealistically because you always return zeros regardless of the input. Returning the array probably isn’t the “issue” but creating/allocating the array is. And only in one function do you actually return that array, so it has to be allocated.

If you’re going to do that millions of times it’s probably best to avoid that (as the tuple version also shows). You can for example add another function that does return the array, but doesn’t allocate it, and it will have similar performance showing that the return type isn’t causing the slowdown.

For example:

@njit("void(float64, float64[:])", cache=True)
def foo_c(s, r):
    r[0] = 0
    r[1] = 0
    r[2] = 0

%timeit foo_a(0.0)
%timeit foo_b(0.0)

r = np.zeros(3, dtype=np.float64)
%timeit foo_c(0.0, r)
280 ns ± 55.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
739 ns ± 13.7 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
278 ns ± 11.9 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

But as said, all of these are prone to optimizations to the point where they are probably no longer representative for what you’re trying to do. I would at least make sure the returned results actually depend on the input to the function.

Thanks Rutger, here are the actual functions. The one that returns a tuple is still twice as fast.

@njit("types.UniTuple(float64, 3)(float64[:])", cache=True)
def PS2(s):
    
    ps = np.zeros(3, dtype=types.float64)
    p = (s[0] + s[1] + s[2]) / 3.0
    sd0 = s[0] - p
    sd1 = s[1] - p
    sd2 = s[2] - p
    sd3 = s[3]
    sd4 = s[4]
    sd5 = s[5]
    
    J2 = ((sd0 ** 2 + sd1 ** 2 + sd2 ** 2)/2.0 + (sd3 ** 2 + sd4 ** 2 + sd5 ** 2))
    if J2 < 1.0e-10: return 0.0, 0.0, 0.0
    J3 = sd0 * (sd1*sd2-sd5**2) - sd3 * (sd2*sd3 - sd4*sd5) + sd4 * (sd3*sd5 - sd1*sd4)

    sqrt3 = math.sqrt(3.0)
    sqrtJ2 = math.sqrt(J2)
    pi23 = 2.0 * math.pi / 3.0

    alpha = math.asin(-1.5 * sqrt3 * J3 / J2 / sqrtJ2) / 3.0
    fac = 2.0 * sqrtJ2 / sqrt3

    ps[0] = fac * math.sin(alpha + pi23) + p
    ps[1] = fac * math.sin(alpha) + p
    ps[2] = fac * math.sin(alpha - pi23) + p

    return ps[0], ps[1], ps[2]
@njit("float64[:](float64[:])", cache=True)
def PS3(s):
    
    ps = np.zeros(3, dtype=types.float64)
    p = (s[0] + s[1] + s[2]) / 3.0
    sd0 = s[0] - p
    sd1 = s[1] - p
    sd2 = s[2] - p
    sd3 = s[3]
    sd4 = s[4]
    sd5 = s[5]
    
    J2 = ((sd0 ** 2 + sd1 ** 2 + sd2 ** 2)/2.0 + (sd3 ** 2 + sd4 ** 2 + sd5 ** 2))
    if J2 < 1.0e-10: return ps
    J3 = sd0 * (sd1*sd2-sd5**2) - sd3 * (sd2*sd3 - sd4*sd5) + sd4 * (sd3*sd5 - sd1*sd4)

    sqrt3 = math.sqrt(3.0)
    sqrtJ2 = math.sqrt(J2)
    pi23 = 2.0 * math.pi / 3.0

    alpha = math.asin(-1.5 * sqrt3 * J3 / J2 / sqrtJ2) / 3.0
    fac = 2.0 * sqrtJ2 / sqrt3

    ps[0] = fac * math.sin(alpha + pi23) + p
    ps[1] = fac * math.sin(alpha) + p
    ps[2] = fac * math.sin(alpha - pi23) + p

    return ps
s = np.array([1.0,2.0,3.0,0.0,0.0,0.0])
%timeit PS2(s)

204 ns ± 3.22 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

s = np.array([1.0,2.0,3.0,0.0,0.0,0.0])
%timeit PS3(s)

500 ns ± 6.96 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Thanks for elaborating on the example. Using this also results in the slowdown for me, even when the output array is provided instead of declared. Given that an array is mutable, it’s not unreasonable that it’s slower, but I don’t know if that justifies the total difference in this case.

It’s still faster to avoid the allocation all together if that’s possible, like:

@njit("types.UniTuple(float64, 3)(float64[::1])", cache=True)
def PS5(s):
    
    sd0, sd1, sd2, sd3, sd4, sd5 = s
    p = (sd0 + sd1 + sd2) / 3.0
    sd0 -= p
    sd1 -= p
    sd2 -= p
    
    J2 = ((sd0 ** 2 + sd1 ** 2 + sd2 ** 2)/2.0 + (sd3 ** 2 + sd4 ** 2 + sd5 ** 2))
    if J2 < 1.0e-10: return 0.0, 0.0, 0.0
    J3 = sd0 * (sd1*sd2-sd5**2) - sd3 * (sd2*sd3 - sd4*sd5) + sd4 * (sd3*sd5 - sd1*sd4)

    sqrt3 = math.sqrt(3.0)
    sqrtJ2 = math.sqrt(J2)
    pi23 = 2.0 * math.pi / 3.0

    alpha = math.asin(-1.5 * sqrt3 * J3 / J2 / sqrtJ2) / 3.0
    fac = 2.0 * sqrtJ2 / sqrt3

    ps0 = fac * math.sin(alpha + pi23) + p
    ps1 = fac * math.sin(alpha) + p
    ps2 = fac * math.sin(alpha - pi23) + p

    return ps0, ps1, ps2

For me that’s about 10% faster compared to allocating a new array in the function.

Thanks, for me it is also 10% faster. I guess this is as good as it gets :slight_smile:

Maybe it’s something similar as what happens in Julia when using StaticArrays, where they are allocated on the stack instead of the heap: Code Optimization for Differential Equations · DifferentialEquations.jl

@Rutger, avoiding creating and returning an array in the first place is indeed the fastest way to return small array results. However, in other areas of my code I make millions of calls to a function that creates a smallish array of size 6x30. Clearly then I cannot avoid creating and returning an array. However, as arrays are passed by reference, it is possible to skip the return altogether. The following tests show that this reduces the execution time by a factor 2-3 compared to returning the array.

import numpy as np
import numba as nb
from numba import njit, types, float64

@njit("none()", cache=True)
def empty():
    pass

%timeit empty()

33.1 ns ± 0.228 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

@njit("float64()", cache=True)
def floatreturn():
    a = 1.0
    return a

%timeit floatreturn()

35.5 ns ± 0.306 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

@njit("float64()", cache=True)
def arrayelementreturn():
    a = np.array([1.0])
    return a[0]

%timeit arrayelementreturn()

52.1 ns ± 0.318 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

@njit("float64[:]()", cache=True)
def arrayreturn():
    a = np.array([1.0])
    return a

%timeit arrayreturn()

361 ns ± 5.21 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

@njit("none(float64[:])", cache=True)
def byreference_no_return(a):
    a[0] = 1.0

a = np.array([0.0])
%timeit byreference_no_return(a)
print(a)

119 ns ± 0.41 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
[1.]

@njit("float64[:](float64[:])", cache=True)
def byreference_return(a):
    a[0] = 1.0
    return a

a = np.array([0.0])
%timeit byreference_return(a)
print(a)

316 ns ± 3.98 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
[1.]