Returning an array of 3D or 6D vectors from `guvectorize` on CUDA (cross-compiling for CPU & CUDA)

s-m-e · July 31, 2022, 3:26pm

Background

I am currently working on poliastro. More specifically, I am working on introducing array types, i.e. Orbit arrays, and array semantics into the package.

poliastro heavily relies on numba for accelerating computations. One of my design goals is to avoid duplicate code or duplicate implementations for “scalar” operations on single orbits and operations on arrays of orbits, so ideally I want both types of operations to use the same code base. I did tons of digging and tests and arrived at the conclusion to build infrastructure on top of numba.vectorize and numba.guvectorize. There are a few interesting side-effects to this decision. First, functions decorated by numba.vectorize work just as well on scalars as they work on arrays. This is really cool. Second, both decorators support multiple compiler targets: cpu, parallel and cuda. This is where it got even more interesting. What if I could set an environment variable and my code would suddenly run on the GPU as opposed to the CPU? “Cross-compiling stuff with numba” sounds like an awesome idea.

MVE

In a very simplified form, I am doing something along the following lines:

from numba import guvectorize, jit, cuda
import numpy as np

TARGET = 'cuda'  # controlled via environment variable
assert TARGET in ('cpu', 'parallel', 'cuda')

if TARGET == 'cuda':  # select decorator and configuration for "helper" function(s)
    hjit = cuda.jit
    hkwargs = dict(device = True, inline = True)
else:
    hjit = jit
    hkwargs = dict(nopython = True, inline = 'always')

def _parse_signature(s):  # simplify complicated signatures (tuples because, well, returning arrays does not work for device functions)
    s = s.replace('M', 'Tuple([V,V,V])')
    s = s.replace('V', 'Tuple([f,f,f])')
    return s.replace('f', 'f8')

hjit is a decorator for “helper” functions, i.e. functions that can be called by other functions decorated by vectorize or guvectorize. Depending on the target, hjit is therefore either a regular njit decorator or a decorator generating CUDA device functions. _parse_signature assists with complicated function signatures. As CUDA device functions can not return arrays, I ended up playing with tuples, full Haskell-style. Sometimes the implementation looks old-school, but it’s really fast So tuples it is. Besides, and this surprised me even more after having read the documentation, once I had refactored a lot of code away from ndarrays to tuples, activating inlining for the CPU target suddenly gave me a really decent performance boost (while the compilation times also became crazy long).

A helper function may then look as follows:

@hjit(_parse_signature('V(V,M)'), **hkwargs)
def matmul_VM_(a, b):
    return (
        a[0] * b[0][0] + a[1] * b[1][0] + a[2] * b[2][0],
        a[0] * b[0][1] + a[1] * b[1][1] + a[2] * b[2][1],
        a[0] * b[0][2] + a[1] * b[1][2] + a[2] * b[2][2],
    )

An actually vectorized function using the above helper could look as follows:

@guvectorize(
    _parse_signature('void(f[:],f[:],f[:],f[:],f[:],f[:])'),
    '(n),(n),(n)->(n),(n),(n)',  # "layout" in numba terms
    target = TARGET,
    nopython = True,
)
def foo(a, b, c, x, y, z):
    R = (
        (0.2, 0.8, 0.3),
        (0.3, 0.5, 0.6),
        (0.4, 0.1, 0.8),
    )
    for idx in range(a.shape[0]):
        x[idx], y[idx], z[idx] = matmul_VM_((a[idx], b[idx], c[idx]), R)

… and be tested as follows:

LEN = 100_000_000

data = np.arange(0, 3 * LEN, dtype = 'f8').reshape(3, LEN)
res = np.zeros_like(data)

foo(data[0], data[1], data[2], res[0], res[1], res[2])  # providing views on `res`
print(res)

If I understand this stuff correctly, functions decorated by vectorize are limited to one return value, a scalar. They can not return multiple scalars (or anything else for that matter). In contrast, in poliastro I am dealing with “arrays of vectors”, e.g. 6 orbital elements for per orbit, which can be understood as a vector of length 6. “Array of orbits” == “array of vectors”, or an ndarray of shape (6,n) / (n,6). In one example use case, they need to be converted to a different type of orbital elements, again 6 numbers per orbit. This is where guvectorize became interesting and the above code example was born. It works perfectly for cpu and parallel targets, but it breaks on the cuda target. Notice that the above example deals with vectors of length 3 as opposed to 6 to keep things simple.

What’s the problem?

The “layout string”, in numba terms. The above uses (n),(n),(n)->(n),(n),(n). In this case, if the target is set to cuda, numba fails to compile: AssertionError: only support 1 output

So I thought I really only need one output, a fixed length vector. But signatures along the lines of (n),(n),(n)->(3,n) or (n),(n),(n)->(n,3) are also causing numba to fail to compile: ValueError: bad token in signature "3"

Any ideas on how to move forward with this?

MVEs / notebooks

The above code can be found as an MVE in a complete notebook here.

A real-life example showing actual conversion of orbital elements can be found as a full test case in another notebook here. Dependencies and their installation are documented at the top of the notebook.

References

Does not tell that only one output is allowed for cuda target: Just-in-Time compilation — Numba 0+untagged.663.ge6a4027.dirty documentation
Also missing this crucial detail: CUDA Ufuncs and Generalized Ufuncs — Numba 0+untagged.663.ge6a4027.dirty documentation
Appears to allow const numbers in signatures (or “layout strings” in numba terms): Generalized universal function API — NumPy v2.3 Manual
Unrelated, but one of the very few places where AssertionError: only support 1 output shows up in the issue tracker / github repository: CUDA guvectorize target does not copy back modified input arrays · Issue #4952 · numba/numba · GitHub
Where the assertion is located: Blaming numba/numba/np/ufunc/deviceufunc.py at 94916113d92cb519a28adb10c30d8afda8049d07 · numba/numba · GitHub
Going for structured types to avoid the limitation of only one output does not appear to be an option either: Vectorization of structured types appears to be broken · Issue #5329 · numba/numba · GitHub
A potential “work-around” (???) by providing at least one additional dummy array that carries the length information for the vectors (3 or 6 in my case) in its shape: Vectorize with arbitrary output shape - #2 by Rutger

gmarkall · August 2, 2022, 2:32pm

Quick notes (following discussion in triage meeting):

Feature request for fixed dimensions: Guvectorize signature with frozen arguments · Issue #1668 · numba/numba · GitHub - we should look at implementing support for this as it has been available since NumPy 1.16
I need to look into support for multiple outputs in the CUDA target - it would be nice to have.

s-m-e · August 2, 2022, 2:36pm

@gmarkall Thanks for looking into this! Happy to test anything new and shaky

Just for reference: I issued the following feature request: Allow multiple outputs for `guvectorize` on CUDA target · Issue #8303 · numba/numba · GitHub

Topic		Replies	Views
Numba signatures @guvectorize( signatures target='cuda) Community Support	4	1137	February 5, 2021
Support of CUDA gufunc with no output Numba	0	260	February 9, 2022
Numba signatures cuda Support: How do I do ...?	2	426	February 5, 2021
`guvectorized`: No performance difference between targets `cpu` and `parallel`? `cuda` even slower. `vectorize` faster? Support: How do I do ...?	5	990	August 9, 2022
@guvectorize does not return values for multiple variables Support: How do I do ...?	2	550	December 10, 2021

Returning an array of 3D or 6D vectors from `guvectorize` on CUDA (cross-compiling for CPU & CUDA)

Background

MVE

What’s the problem?

MVEs / notebooks

References

Related topics