A further question is does Awkward support similar string-related APIs as cpython does?
When a string is materialized in Numba code—for instance, you have an
array of strings and write an expression like
array[i] for some integer
i, that string is materialized as Numba’s lowered
nb.types.string, which is a Python unicode object. Numba provides string operations on that string type, though they are CPython API calls and they therefore capture the GIL.
(I just went to check up on it and I couldn’t find out where that was implemented (Present Awkward strings to Numba as Numba strings · Issue #1917 · scikit-hep/awkward · GitHub), which is very strange, but I did verify that it does happen: Awkward strings are presented as strings.)
I found this doc Home · scikit-hep/awkward Wiki · GitHub, " @martindurant has plans for wrapping Rust’s Unicode-aware string library as vectorized string functions in
It hasn’t been implemented yet, and the plan is for it to be implemented using pyarrow’s string handling, rather than Rust’s. Also, these would be new
ak.str.* functions, such as
ak.str.capitalize(array) to capitalize every string in
array, even if they are nested within lists or other data structures.
However, the plan is not to make those functions available within Numba-JIT-compiled code, just as the
ak.* functions are not available in Numba-JIT-compiled code, either. The design philosophy is “in Python, you use vectorized functions (e.g.
ak.str.capitalize) and in Numba-compiled code, you use imperative loops.” Actually implementing
ak.str.* functions in Numba-compiled code would be a giant project, since it could not share any implementations with the vectorized ones. The same is true of Numba’s implementation of NumPy functions: they’re all reimplementations—if they were calls into NumPy’s C code (impossible for some NumPy functions, which are written in Python that calls other NumPy C functions), then LLVM would see it as an external function pointer and would not be able to optimize around it, which undermines the purpose of using Numba. And as you can see from the Numba project, reimplementing every NumPy function in Numba-lowered code is a big project!
And from my limited exprience about the interaction between C++ and python code, it’s hard to maintain a good performance with frequent type conversion between the edges/borders in C++ and Python.
I guess you know that already! In Python, the
ak.* and future
ak.str.* functions are either operating on columnar data, already in arrays (not C++ objects), with a single call to a vectorized kernel that iterates over the columnar data purely in C, and in Numba, the
ak.str.* functions won’t be implemented, but loops over Awkward data are accelerated, and you’re encouraged to write for-loop style implementations to do what you need to do. Neither of these involve more than one time through the Python-compiled (C or Numba) interface, unless you count Numba’s implementation of strings via the CPython API to be a step through the interface.
So I guess it’s your purpose to only using C and Python in
Awkward 2.0 ?
Awkward 1.x was the same way: each vectorized
ak.* function made only one step through the Python-C++ interface. What differs is that some data types and metadata were handled on the C++ side in Awkward 1.x, and all of that has been moved to Python in Awkward 2.x. The motivation isn’t performance (although all indications so far find that the startup time for each
ak.* function is faster in Awkward 2.x… not the part that scales with array size, which is what we care about more). The motivation is for compatibility with other Python libraries that need to see more of what Awkward is doing; C++ code is a black box to Python. Examples include Dask, which wants to build an execution graph, so it needs to know how to predict the data types of Awkward operations, and JAX autodiff, which needs to pass a Tracer over all low-level array (1-dimensional array buffer) operations. The details are here: ACAT 2021, Lessons learned in Python-C++ integration.