I am currently working on poliastro. More specifically, I am working on introducing array types, i.e. Orbit arrays, and array semantics into the package.
poliastro heavily relies on
numba for accelerating computations. One of my design goals is to avoid duplicate code or duplicate implementations for “scalar” operations on single orbits and operations on arrays of orbits, so ideally I want both types of operations to use the same code base. I did tons of digging and tests and arrived at the conclusion to build infrastructure on top of
numba.guvectorize. There are a few interesting side-effects to this decision. First, functions decorated by
numba.vectorize work just as well on scalars as they work on arrays. This is really cool. Second, both decorators support multiple compiler targets:
cuda. This is where it got even more interesting. What if I could set an environment variable and my code would suddenly run on the GPU as opposed to the CPU? “Cross-compiling stuff with numba” sounds like an awesome idea.
In a very simplified form, I am doing something along the following lines:
from numba import guvectorize, jit, cuda import numpy as np TARGET = 'cuda' # controlled via environment variable assert TARGET in ('cpu', 'parallel', 'cuda') if TARGET == 'cuda': # select decorator and configuration for "helper" function(s) hjit = cuda.jit hkwargs = dict(device = True, inline = True) else: hjit = jit hkwargs = dict(nopython = True, inline = 'always') def _parse_signature(s): # simplify complicated signatures (tuples because, well, returning arrays does not work for device functions) s = s.replace('M', 'Tuple([V,V,V])') s = s.replace('V', 'Tuple([f,f,f])') return s.replace('f', 'f8')
hjit is a decorator for “helper” functions, i.e. functions that can be called by other functions decorated by
guvectorize. Depending on the target,
hjit is therefore either a regular
njit decorator or a decorator generating CUDA device functions.
_parse_signature assists with complicated function signatures. As CUDA device functions can not return arrays, I ended up playing with tuples, full Haskell-style. Sometimes the implementation looks old-school, but it’s really fast So tuples it is. Besides, and this surprised me even more after having read the documentation, once I had refactored a lot of code away from
ndarrays to tuples, activating inlining for the CPU target suddenly gave me a really decent performance boost (while the compilation times also became crazy long).
A helper function may then look as follows:
@hjit(_parse_signature('V(V,M)'), **hkwargs) def matmul_VM_(a, b): return ( a * b + a * b + a * b, a * b + a * b + a * b, a * b + a * b + a * b, )
An actually vectorized function using the above helper could look as follows:
@guvectorize( _parse_signature('void(f[:],f[:],f[:],f[:],f[:],f[:])'), '(n),(n),(n)->(n),(n),(n)', # "layout" in numba terms target = TARGET, nopython = True, ) def foo(a, b, c, x, y, z): R = ( (0.2, 0.8, 0.3), (0.3, 0.5, 0.6), (0.4, 0.1, 0.8), ) for idx in range(a.shape): x[idx], y[idx], z[idx] = matmul_VM_((a[idx], b[idx], c[idx]), R)
… and be tested as follows:
LEN = 100_000_000 data = np.arange(0, 3 * LEN, dtype = 'f8').reshape(3, LEN) res = np.zeros_like(data) foo(data, data, data, res, res, res) # providing views on `res` print(res)
If I understand this stuff correctly, functions decorated by
vectorize are limited to one return value, a scalar. They can not return multiple scalars (or anything else for that matter). In contrast, in
poliastro I am dealing with “arrays of vectors”, e.g. 6 orbital elements for per orbit, which can be understood as a vector of length 6. “Array of orbits” == “array of vectors”, or an
ndarray of shape
(n,6). In one example use case, they need to be converted to a different type of orbital elements, again 6 numbers per orbit. This is where
guvectorize became interesting and the above code example was born. It works perfectly for
parallel targets, but it breaks on the
cuda target. Notice that the above example deals with vectors of length 3 as opposed to 6 to keep things simple.
The “layout string”, in
numba terms. The above uses
(n),(n),(n)->(n),(n),(n). In this case, if the target is set to
numba fails to compile:
AssertionError: only support 1 output
So I thought I really only need one output, a fixed length vector. But signatures along the lines of
(n),(n),(n)->(n,3) are also causing
numba to fail to compile:
ValueError: bad token in signature "3"
Any ideas on how to move forward with this?
The above code can be found as an MVE in a complete notebook here.
A real-life example showing actual conversion of orbital elements can be found as a full test case in another notebook here. Dependencies and their installation are documented at the top of the notebook.
- Does not tell that only one output is allowed for cuda target: Just-in-Time compilation — Numba 0.56.0+0.gf75c45a8d.dirty-py3.7-linux-x86_64.egg documentation
- Also missing this crucial detail: CUDA Ufuncs and Generalized Ufuncs — Numba 0.56.0+0.gf75c45a8d.dirty-py3.7-linux-x86_64.egg documentation
- Appears to allow const numbers in signatures (or “layout strings” in numba terms): Generalized Universal Function API — NumPy v1.23 Manual
- Unrelated, but one of the very few places where
AssertionError: only support 1 outputshows up in the issue tracker / github repository: CUDA guvectorize target does not copy back modified input arrays · Issue #4952 · numba/numba · GitHub
- Where the assertion is located: numba/numba/np/ufunc/deviceufunc.py at 94916113d92cb519a28adb10c30d8afda8049d07 · numba/numba · GitHub
- Going for structured types to avoid the limitation of only one output does not appear to be an option either: Vectorization of structured types appears to be broken · Issue #5329 · numba/numba · GitHub
- A potential “work-around” (???) by providing at least one additional dummy array that carries the length information for the vectors (3 or 6 in my case) in its shape: Vectorize with arbitrary output shape - #2 by Rutger