Rotation cartesian coordinates, numba vs C extension

j-faria · November 30, 2022, 9:20am

I’m trying to replace a few C extensions with numba, but I’m having some issues with performance.

The problem is simple: rotate an xyz point by a series of angles, using rotation matrices.
I have an old C extension that does the job like this

void rotation(double *xyz, double lon, double lat, double inc, double phase, double *xiyizi)
{
    double g2 = --phase * (2 * M_PI);
    double i = inc * M_PI / 180.;
    double b = lat * M_PI / 180.;
    double g = -lon * M_PI / 180.;
    double b2 = -(M_PI / 2. - i);

    double R[3][3] = {{(1 - cos(g2)) * cos(i) * cos(i) + cos(g2), sin(g2) * sin(i), (1 - cos(g2)) * cos(i) * sin(i)},
                      {-sin(g2) * sin(i), cos(g2), sin(g2) * cos(i)},
                      {(1 - cos(g2)) * sin(i) * cos(i), -sin(g2) * cos(i), (1 - cos(g2)) * sin(i) * sin(i) + cos(g2)}};

    double R2[3][3] = {{cos(b) * cos(g) * cos(b2) - sin(b2) * sin(b), -sin(g) * cos(b), cos(b) * cos(g) * sin(b2) + sin(b) * cos(b2)},
                       {sin(g) * cos(b2), cos(g), sin(g) * sin(b2)},
                       {-sin(b) * cos(g) * cos(b2) - cos(b) * sin(b2), sin(b) * sin(g), -sin(b) * cos(g) * sin(b2) + cos(b) * cos(b2)}};

    double R3[3][3] = {{R2[0][0] * R[0][0] + R2[0][1] * R[1][0] + R2[0][2] * R[2][0],
                        R2[0][0] * R[0][1] + R2[0][1] * R[1][1] + R2[0][2] * R[2][1],
                        R2[0][0] * R[0][2] + R2[0][1] * R[1][2] + R2[0][2] * R[2][2]},
                       {R2[1][0] * R[0][0] + R2[1][1] * R[1][0] + R2[1][2] * R[2][0],
                        R2[1][0] * R[0][1] + R2[1][1] * R[1][1] + R2[1][2] * R[2][1],
                        R2[1][0] * R[0][2] + R2[1][1] * R[1][2] + R2[1][2] * R[2][2]},
                       {R2[2][0] * R[0][0] + R2[2][1] * R[1][0] + R2[2][2] * R[2][0],
                        R2[2][0] * R[0][1] + R2[2][1] * R[1][1] + R2[2][2] * R[2][1],
                        R2[2][0] * R[0][2] + R2[2][1] * R[1][2] + R2[2][2] * R[2][2]}};

    xiyizi[0] = R3[0][0] * xyz[0] + R3[0][1] * xyz[1] + R3[0][2] * xyz[2];
    xiyizi[1] = R3[1][0] * xyz[0] + R3[1][1] * xyz[1] + R3[1][2] * xyz[2];
    xiyizi[2] = R3[2][0] * xyz[0] + R3[2][1] * xyz[1] + R3[2][2] * xyz[2];
}

which I tried to replace with the following numba function, returning a numpy array

from numpy import pi, sin, cos, zeros, ndarray

@numba.njit()
def rotation_numba(xyz: ndarray, lon: float, lat: float, inc: float, phase: float):

    g2 = (-1.0 + phase) * (2.0 * pi)
    i = inc * pi / 180.
    b = lat * pi / 180.
    g = -lon * pi / 180.
    b2 = -(pi / 2.0 - i)

    R = [
        [(1 - cos(g2)) * cos(i) * cos(i) + cos(g2), sin(g2) * sin(i), (1 - cos(g2)) * cos(i) * sin(i)],
        [-sin(g2) * sin(i), cos(g2), sin(g2) * cos(i)],
        [(1 - cos(g2)) * sin(i) * cos(i), -sin(g2) * cos(i), (1 - cos(g2)) * sin(i) * sin(i) + cos(g2)]
    ]

    R2 = [
        [cos(b) * cos(g) * cos(b2) - sin(b2) * sin(b), -sin(g) * cos(b), cos(b) * cos(g) * sin(b2) + sin(b) * cos(b2)],
        [sin(g) * cos(b2), cos(g), sin(g) * sin(b2)],
        [-sin(b) * cos(g) * cos(b2) - cos(b) * sin(b2), sin(b) * sin(g), -sin(b) * cos(g) * sin(b2) + cos(b) * cos(b2)]
    ]

    R3 = [
        [R2[0][0] * R[0][0] + R2[0][1] * R[1][0] + R2[0][2] * R[2][0],
         R2[0][0] * R[0][1] + R2[0][1] * R[1][1] + R2[0][2] * R[2][1],
         R2[0][0] * R[0][2] + R2[0][1] * R[1][2] + R2[0][2] * R[2][2]
        ],
        [R2[1][0] * R[0][0] + R2[1][1] * R[1][0] + R2[1][2] * R[2][0],
         R2[1][0] * R[0][1] + R2[1][1] * R[1][1] + R2[1][2] * R[2][1],
         R2[1][0] * R[0][2] + R2[1][1] * R[1][2] + R2[1][2] * R[2][2],
        ],
        [R2[2][0] * R[0][0] + R2[2][1] * R[1][0] + R2[2][2] * R[2][0],
         R2[2][0] * R[0][1] + R2[2][1] * R[1][1] + R2[2][2] * R[2][1],
         R2[2][0] * R[0][2] + R2[2][1] * R[1][2] + R2[2][2] * R[2][2]
        ]
    ]

    xyz_out = zeros(3)
    xyz_out[0] = R3[0][0] * xyz[0] + R3[0][1] * xyz[1] + R3[0][2] * xyz[2]
    xyz_out[1] = R3[1][0] * xyz[0] + R3[1][1] * xyz[1] + R3[1][2] * xyz[2]
    xyz_out[2] = R3[2][0] * xyz[0] + R3[2][1] * xyz[1] + R3[2][2] * xyz[2]
    return xyz_out

However, I’m not able to achieve the same performance with the two functions:

xyz = np.array([-0.8, 0.1, 0.5])

%timeit rotation_c(xyz, 180.0, 30.0, 90.0, 0.1)
# 931 ns ± 5.49 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit rotation_numba(xyz, 180.0, 30.0, 90.0, 0.1)
# 2.58 µs ± 45.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Is there something I’m doing wrong, or a different implementation in numba that would give me similar performance to the C extension?

nelson2005 · November 30, 2022, 10:37pm

At least part of it might be due to including compile time in the timing… what do you get if you test it like

xyz = np.array([-0.8, 0.1, 0.5])
rotation_numba(xyz, 180.0, 30.0, 90.0, 0.1)  # call once to compile
%timeit rotation_numba(xyz, 180.0, 30.0, 90.0, 0.1)  # test the speed

nelson2005 · November 30, 2022, 11:07pm

You may also see some improvement if you pass in numpy arrays R, R2, R3, xyz_out

sschaer · December 1, 2022, 12:52am

Hi @j-faria

I agree with @nelson2005 that the performance difference likely comes from the memory allocation. Most of the execution time of the Numba function is spent allocating the arrays and lists. The comparison is not fair because in the C function you pass the output array as an argument, while you do the allocation within the jitted function. Your C function will also allocate R, R2 and R3 on the stack, while Numba will probably allocate them on the heap.

nelson2005 · December 1, 2022, 3:02pm

What @sschaer said… I did a quick test making the structure comparable to the C implementation with passing in xyz_out and making R, R2, R3 local.

original 4.51 µs ± 162 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
optimized 1.46 µs ± 129 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

j-faria · December 2, 2022, 9:44am

Hi @nelson2005, @sschaer,
Thank you for your suggestions and insight!

The times I showed were obtained after one initial call to the numba function to compile, so I think you’re right that the difference is in the memory allocation.

I understand that R, R2, R3 might be allocated in the stack on the C side. @nelson2005, is this what you mean by making the arrays local? Can you tell me how you achieved that?

For the output array, I thought the wrapper around the C function, using the C extensions API, still needs to do the allocation and therefore I wasn’t expecting a difference there. In other words, since I call rotation_c from the python inside in the same way, without first allocating the output array, that allocation is somehow still done inside the function. So the allocation would still be included in the C times.

nelson2005 · December 2, 2022, 3:39pm

Right… you didn’t show that in your code, or I missed it. Posting more complete code would have saved this conversation

Maybe, or maybe not. The in and out arrays could be on the stack. That code also wasn’t posted; all @sschaer and I were saying was that the posted code doesn’t contain a heap allocation while the posted python does contain one.

There’s nothing in this problem that requires arrays. You could style the code with local variables, like

R_0_0 = (1 - cos(g2)) * cos(i) * cos(i) + cos(g2)
...
R3_0_0 = R2_0_0 * R_0_0 + R2_0_1 * R_1_0 + R2_0_2 * R_2_0

sschaer · December 2, 2022, 5:55pm

Hi @j-faria

If you are really trying to get as much out of it as possible, I would go for something like the one below.
This reduces the execution time on my machine from 2.8 µs to 500 ns without allocation of the output array and 900 ns with the allocation (hope that’s not just due to typos ). If you allocate the output array within the function use an empty array like xyz_out = np.empty(3).

Note that it is partially also optimized for compile time (1.1 s vs. 300 ms). But I admit that it sacrifices readability.

@nb.njit
def rotation_numba(xyz: ndarray, lon: float, lat: float, inc: float, phase: float, xyz_out: ndarray):

    g2 = (-1.0 + phase) * (2.0 * pi)
    deg_to_rad = pi / 180.
    i = inc * deg_to_rad 
    b = lat * deg_to_rad 
    g = -lon * deg_to_rad 
    b2 = -(pi / 2.0 - i)

    cos_g2 = cos(g2)
    cos_i = cos(i)
    sin_g2 = sin(g2)
    sin_i = sin(i)
    one_neg_cos_g2 = 1 - cos_g2
    
    R_00 = one_neg_cos_g2 * cos_i * cos_i + cos_g2 
    R_01 = sin_g2 * sin_i
    R_02 = one_neg_cos_g2 * cos_i * sin_i
    R_10 = -sin_g2 * sin_i
    R_11 = cos_g2
    R_12 = sin_g2 * cos_i
    R_20 = one_neg_cos_g2 * sin_i * cos_i
    R_21 = -sin_g2 * cos_i
    R_22 = one_neg_cos_g2 * sin_i * sin_i + cos_g2

    cos_b2 = cos(b2)
    cos_b = cos(b)
    sin_b2 = sin(b2)
    sin_b = sin(b)
    cos_g = cos(g)
    sin_g = sin(g)

    R2_00 = cos_b * cos_g * cos_b2 - sin_b2 * sin_b
    R2_01 = -sin_g * cos_b
    R2_02 = cos_b * cos_g * sin_b2 + sin_b * cos_b2
    R2_10 = sin_g * cos_b2
    R2_11 = cos_g
    R2_12 = sin_g * sin_b2
    R2_20 = -sin_b * cos_g * cos_b2 - cos_b * sin_b2
    R2_21 = sin_b * sin_g
    R2_22 = -sin_b * cos_g * sin_b2 + cos_b * cos_b2
    
    R3_00 = R2_00 * R_00 + R2_01 * R_10 + R2_02 * R_20
    R3_01 = R2_00 * R_01 + R2_01 * R_11 + R2_02 * R_21
    R3_02 = R2_00 * R_02 + R2_01 * R_12 + R2_02 * R_22
    R3_10 = R2_10 * R_00 + R2_11 * R_10 + R2_12 * R_20
    R3_11 = R2_10 * R_01 + R2_11 * R_11 + R2_12 * R_21
    R3_12 = R2_10 * R_02 + R2_11 * R_12 + R2_12 * R_22
    R3_20 = R2_20 * R_00 + R2_21 * R_10 + R2_22 * R_20
    R3_21 = R2_20 * R_01 + R2_21 * R_11 + R2_22 * R_21
    R3_22 = R2_20 * R_02 + R2_21 * R_12 + R2_22 * R_22

    xyz_out[0] = R3_00 * xyz[0] + R3_01 * xyz[1] + R3_02 * xyz[2]
    xyz_out[1] = R3_10 * xyz[0] + R3_11 * xyz[1] + R3_12 * xyz[2]
    xyz_out[2] = R3_20 * xyz[0] + R3_21 * xyz[1] + R3_22 * xyz[2]

max9111 · December 5, 2022, 10:14am

You can just replace the intermediate lists with tuples and pass the resulting array to the function. Than the performance should be similar with C. Tuples are allocated on the stack and comparable to the stack allocated arrays in C.

Or you can further optimize the code like sschaer did.

j-faria · December 5, 2022, 11:20am

Using the modifications proposed by @sschaer and also passing in the output array, I was able to bring the times to within 0.1μs of the C code.
Since the function is to be called many times and independently within a tight loop, there’s no problem pre-allocating the output array, and the gains are substantial!

Thank you very much for the help!

Topic		Replies	Views
Comparison between Numba and Fortran code Numba	3	601	December 13, 2023
Extending LAPACK functions into numba Support: How do I do ...?	6	985	June 21, 2022
Feedback on tips for first-timers Community Support	14	979	August 15, 2023
A huge performance penalty for this simple function Numba	22	3076	January 28, 2023
Best Practices Adopting Numpy Based Code for Compatibility/Performance with Numba Support: How do I do ...?	12	417	November 2, 2024

Rotation cartesian coordinates, numba vs C extension

Related topics