Returning an array of 3D or 6D vectors from `guvectorize` on CUDA (cross-compiling for CPU & CUDA)

Background

I am currently working on poliastro. More specifically, I am working on introducing array types, i.e. Orbit arrays, and array semantics into the package.

poliastro heavily relies on numba for accelerating computations. One of my design goals is to avoid duplicate code or duplicate implementations for “scalar” operations on single orbits and operations on arrays of orbits, so ideally I want both types of operations to use the same code base. I did tons of digging and tests and arrived at the conclusion to build infrastructure on top of numba.vectorize and numba.guvectorize. There are a few interesting side-effects to this decision. First, functions decorated by numba.vectorize work just as well on scalars as they work on arrays. This is really cool. Second, both decorators support multiple compiler targets: cpu, parallel and cuda. This is where it got even more interesting. What if I could set an environment variable and my code would suddenly run on the GPU as opposed to the CPU? “Cross-compiling stuff with numba” sounds like an awesome idea.

MVE

In a very simplified form, I am doing something along the following lines:

from numba import guvectorize, jit, cuda
import numpy as np

TARGET = 'cuda'  # controlled via environment variable
assert TARGET in ('cpu', 'parallel', 'cuda')

if TARGET == 'cuda':  # select decorator and configuration for "helper" function(s)
    hjit = cuda.jit
    hkwargs = dict(device = True, inline = True)
else:
    hjit = jit
    hkwargs = dict(nopython = True, inline = 'always')

def _parse_signature(s):  # simplify complicated signatures (tuples because, well, returning arrays does not work for device functions)
    s = s.replace('M', 'Tuple([V,V,V])')
    s = s.replace('V', 'Tuple([f,f,f])')
    return s.replace('f', 'f8')

hjit is a decorator for “helper” functions, i.e. functions that can be called by other functions decorated by vectorize or guvectorize. Depending on the target, hjit is therefore either a regular njit decorator or a decorator generating CUDA device functions. _parse_signature assists with complicated function signatures. As CUDA device functions can not return arrays, I ended up playing with tuples, full Haskell-style. Sometimes the implementation looks old-school, but it’s really fast :slight_smile: So tuples it is. Besides, and this surprised me even more after having read the documentation, once I had refactored a lot of code away from ndarrays to tuples, activating inlining for the CPU target suddenly gave me a really decent performance boost (while the compilation times also became crazy long).

A helper function may then look as follows:

@hjit(_parse_signature('V(V,M)'), **hkwargs)
def matmul_VM_(a, b):
    return (
        a[0] * b[0][0] + a[1] * b[1][0] + a[2] * b[2][0],
        a[0] * b[0][1] + a[1] * b[1][1] + a[2] * b[2][1],
        a[0] * b[0][2] + a[1] * b[1][2] + a[2] * b[2][2],
    )

An actually vectorized function using the above helper could look as follows:

@guvectorize(
    _parse_signature('void(f[:],f[:],f[:],f[:],f[:],f[:])'),
    '(n),(n),(n)->(n),(n),(n)',  # "layout" in numba terms
    target = TARGET,
    nopython = True,
)
def foo(a, b, c, x, y, z):
    R = (
        (0.2, 0.8, 0.3),
        (0.3, 0.5, 0.6),
        (0.4, 0.1, 0.8),
    )
    for idx in range(a.shape[0]):
        x[idx], y[idx], z[idx] = matmul_VM_((a[idx], b[idx], c[idx]), R)

… and be tested as follows:

LEN = 100_000_000

data = np.arange(0, 3 * LEN, dtype = 'f8').reshape(3, LEN)
res = np.zeros_like(data)

foo(data[0], data[1], data[2], res[0], res[1], res[2])  # providing views on `res`
print(res)

If I understand this stuff correctly, functions decorated by vectorize are limited to one return value, a scalar. They can not return multiple scalars (or anything else for that matter). In contrast, in poliastro I am dealing with “arrays of vectors”, e.g. 6 orbital elements for per orbit, which can be understood as a vector of length 6. “Array of orbits” == “array of vectors”, or an ndarray of shape (6,n) / (n,6). In one example use case, they need to be converted to a different type of orbital elements, again 6 numbers per orbit. This is where guvectorize became interesting and the above code example was born. It works perfectly for cpu and parallel targets, but it breaks on the cuda target. Notice that the above example deals with vectors of length 3 as opposed to 6 to keep things simple.

What’s the problem?

The “layout string”, in numba terms. The above uses (n),(n),(n)->(n),(n),(n). In this case, if the target is set to cuda, numba fails to compile: AssertionError: only support 1 output

So I thought I really only need one output, a fixed length vector. But signatures along the lines of (n),(n),(n)->(3,n) or (n),(n),(n)->(n,3) are also causing numba to fail to compile: ValueError: bad token in signature "3"

Any ideas on how to move forward with this?


MVEs / notebooks

The above code can be found as an MVE in a complete notebook here.

A real-life example showing actual conversion of orbital elements can be found as a full test case in another notebook here. Dependencies and their installation are documented at the top of the notebook.


References

1 Like

Quick notes (following discussion in triage meeting):

1 Like

@gmarkall Thanks for looking into this! Happy to test anything new and shaky :wink:

Just for reference: I issued the following feature request: Allow multiple outputs for `guvectorize` on CUDA target · Issue #8303 · numba/numba · GitHub

1 Like