Background
I am currently working on poliastro. More specifically, I am working on introducing array types, i.e. Orbit arrays, and array semantics into the package.
poliastro
heavily relies on numba
for accelerating computations. One of my design goals is to avoid duplicate code or duplicate implementations for “scalar” operations on single orbits and operations on arrays of orbits, so ideally I want both types of operations to use the same code base. I did tons of digging and tests and arrived at the conclusion to build infrastructure on top of numba.vectorize
and numba.guvectorize
. There are a few interesting side-effects to this decision. First, functions decorated by numba.vectorize
work just as well on scalars as they work on arrays. This is really cool. Second, both decorators support multiple compiler targets: cpu
, parallel
and cuda
. This is where it got even more interesting. What if I could set an environment variable and my code would suddenly run on the GPU as opposed to the CPU? “Cross-compiling stuff with numba” sounds like an awesome idea.
MVE
In a very simplified form, I am doing something along the following lines:
from numba import guvectorize, jit, cuda
import numpy as np
TARGET = 'cuda' # controlled via environment variable
assert TARGET in ('cpu', 'parallel', 'cuda')
if TARGET == 'cuda': # select decorator and configuration for "helper" function(s)
hjit = cuda.jit
hkwargs = dict(device = True, inline = True)
else:
hjit = jit
hkwargs = dict(nopython = True, inline = 'always')
def _parse_signature(s): # simplify complicated signatures (tuples because, well, returning arrays does not work for device functions)
s = s.replace('M', 'Tuple([V,V,V])')
s = s.replace('V', 'Tuple([f,f,f])')
return s.replace('f', 'f8')
hjit
is a decorator for “helper” functions, i.e. functions that can be called by other functions decorated by vectorize
or guvectorize
. Depending on the target, hjit
is therefore either a regular njit
decorator or a decorator generating CUDA device functions. _parse_signature
assists with complicated function signatures. As CUDA device functions can not return arrays, I ended up playing with tuples, full Haskell-style. Sometimes the implementation looks old-school, but it’s really fast So tuples it is. Besides, and this surprised me even more after having read the documentation, once I had refactored a lot of code away from ndarray
s to tuples, activating inlining for the CPU target suddenly gave me a really decent performance boost (while the compilation times also became crazy long).
A helper function may then look as follows:
@hjit(_parse_signature('V(V,M)'), **hkwargs)
def matmul_VM_(a, b):
return (
a[0] * b[0][0] + a[1] * b[1][0] + a[2] * b[2][0],
a[0] * b[0][1] + a[1] * b[1][1] + a[2] * b[2][1],
a[0] * b[0][2] + a[1] * b[1][2] + a[2] * b[2][2],
)
An actually vectorized function using the above helper could look as follows:
@guvectorize(
_parse_signature('void(f[:],f[:],f[:],f[:],f[:],f[:])'),
'(n),(n),(n)->(n),(n),(n)', # "layout" in numba terms
target = TARGET,
nopython = True,
)
def foo(a, b, c, x, y, z):
R = (
(0.2, 0.8, 0.3),
(0.3, 0.5, 0.6),
(0.4, 0.1, 0.8),
)
for idx in range(a.shape[0]):
x[idx], y[idx], z[idx] = matmul_VM_((a[idx], b[idx], c[idx]), R)
… and be tested as follows:
LEN = 100_000_000
data = np.arange(0, 3 * LEN, dtype = 'f8').reshape(3, LEN)
res = np.zeros_like(data)
foo(data[0], data[1], data[2], res[0], res[1], res[2]) # providing views on `res`
print(res)
If I understand this stuff correctly, functions decorated by vectorize
are limited to one return value, a scalar. They can not return multiple scalars (or anything else for that matter). In contrast, in poliastro
I am dealing with “arrays of vectors”, e.g. 6 orbital elements for per orbit, which can be understood as a vector of length 6. “Array of orbits” == “array of vectors”, or an ndarray
of shape (6,n)
/ (n,6)
. In one example use case, they need to be converted to a different type of orbital elements, again 6 numbers per orbit. This is where guvectorize
became interesting and the above code example was born. It works perfectly for cpu
and parallel
targets, but it breaks on the cuda
target. Notice that the above example deals with vectors of length 3 as opposed to 6 to keep things simple.
What’s the problem?
The “layout string”, in numba
terms. The above uses (n),(n),(n)->(n),(n),(n)
. In this case, if the target is set to cuda
, numba
fails to compile: AssertionError: only support 1 output
So I thought I really only need one output, a fixed length vector. But signatures along the lines of (n),(n),(n)->(3,n)
or (n),(n),(n)->(n,3)
are also causing numba
to fail to compile: ValueError: bad token in signature "3"
Any ideas on how to move forward with this?
MVEs / notebooks
The above code can be found as an MVE in a complete notebook here.
A real-life example showing actual conversion of orbital elements can be found as a full test case in another notebook here. Dependencies and their installation are documented at the top of the notebook.
References
- Does not tell that only one output is allowed for cuda target: Just-in-Time compilation — Numba 0.56.0+0.gf75c45a8d.dirty-py3.7-linux-x86_64.egg documentation
- Also missing this crucial detail: CUDA Ufuncs and Generalized Ufuncs — Numba 0.56.0+0.gf75c45a8d.dirty-py3.7-linux-x86_64.egg documentation
- Appears to allow const numbers in signatures (or “layout strings” in numba terms): Generalized Universal Function API — NumPy v1.23 Manual
- Unrelated, but one of the very few places where
AssertionError: only support 1 output
shows up in the issue tracker / github repository: CUDA guvectorize target does not copy back modified input arrays · Issue #4952 · numba/numba · GitHub - Where the assertion is located: numba/numba/np/ufunc/deviceufunc.py at 94916113d92cb519a28adb10c30d8afda8049d07 · numba/numba · GitHub
- Going for structured types to avoid the limitation of only one output does not appear to be an option either: Vectorization of structured types appears to be broken · Issue #5329 · numba/numba · GitHub
- A potential “work-around” (???) by providing at least one additional dummy array that carries the length information for the vectors (3 or 6 in my case) in its shape: Vectorize with arbitrary output shape - #2 by Rutger