Feature request about supporting Arrow in Numba

Hi, guys.

I guess we haven’t any specific plan to support arrow-related types natively for numba. At the same time, more and more projects, e.g., Pandas has ArrowStringArray, vaex has supported some arrow formats, Ray has also heavily used vendored arrow, and so on. So I think maybe we could discuss about whether numba will support it, about its benifits and disadvantages. Or what internal/restrictive limitations in Numba hinder us to easily support a brand new and complicated types and their operations like arrow.

In a sense, it is supported. The Awkward Array library has the same columnar data types as Arrow, so an Arrow → Awkward conversion is zero-copy[1], and Awkward Arrays are a registered Numba extension, so you can iterate over them in JIT-compiled functions.

Here’s an example, starting from a Pandas ArrowStringArray:

>>> import numpy as np
>>> import pandas as pd
>>> import pyarrow as pa
>>> import awkward as ak
>>> import numba as nb

>>> x = pd.array(
...     ['This is', 'some text', None, 'data.'], dtype="string[pyarrow]"
... )

>>> pa.array(x)
<pyarrow.lib.ChunkedArray object at 0x7fea6c47d770>
[
  [
    "This is",
    "some text",
    null,
    "data."
  ]
]

>>> ak.from_arrow(pa.array(x))
<Array ['This is', 'some text', None, 'data.'] type='4 * ?string'>

>>> @nb.njit
... def f(strings):
...     out = np.zeros(len(strings), dtype=np.int64)
...     for i, s in enumerate(strings):
...         if s is not None:
...             out[i] = ord(s[0])
...     return out

>>> f(ak.from_arrow(pa.array(x)))
array([ 84, 115,   0, 100])

Strings are a very specific data type; Arrow can represent nested structs, variable length lists, missing data, and heterogeneous data, too. These data types can be zero-copy converted[1:1] to Awkward Arrays and those Awkward Arrays can be iterated over in Numba, which means iterating over the original Arrow buffers. In the Numba function, Arrow structs (Awkward records) appear as objects with attributes, variable-length lists appear as sequence types, and missing values as None.[2]

Going the other way—producing non-rectilinear data structures in Numba and sending them to an Arrow buffer—can be done using Awkward’s ArrayBuilder and ak.to_arrow, but it’s not as slick and it requires more data copies.

This capability doesn’t reside within the Numba project; a user has to piece these things together with different libraries, but it is possible.


  1. Well, there’s exactly one exception to zero-copy conversion from Arrow: Arrow’s sparse unions have to be converted into dense unions before conversion, so one index buffer per sparse union needs to be allocated and filled. ↩︎ ↩︎

  2. Awkward Arrays with union types can’t be iterated over in Numba-compiled functions, so that’s another limitation, also related to unions. Generally speaking, union-types are the rough edge of support. ↩︎

1 Like

Awesome! I will look into Awkward project very carefully. I knew it exists, but didn’t notice what things it has already done! Thrilled!

BTW, I am the one asking this question (person(dlee992) == person(Kang_Ge), :slight_smile: ), I just re-register another account to keep the consistent name with my Github username.

I basically know Awkward can convert between arrow data format and numba-understandable native data format. A further question is does Awkward support similar string-related APIs as cpython does? Do u have a plan, or it’s already supported very well?

========== Updates =============
I found this doc Home · scikit-hep/awkward Wiki · GitHub, " @martindurant has plans for wrapping Rust’s Unicode-aware string library as vectorized string functions in ak.str.* ."

But I didn’t find any source code in Awkward to support cpython-unicode-like APIs, I feel u guys don’t implement these stuff yet.

And from my limited exprience about the interaction between C++ and python code, it’s hard to maintain a good performance with frequent type conversion between the edges/borders in C++ and Python. So I guess it’s your purpose to only using C and Python in Awkward 2.0? @jpivarski

A further question is does Awkward support similar string-related APIs as cpython does?

When a string is materialized in Numba code—for instance, you have an array of strings and write an expression like array[i] for some integer i, that string is materialized as Numba’s lowered nb.types.string, which is a Python unicode object. Numba provides string operations on that string type, though they are CPython API calls and they therefore capture the GIL.

(I just went to check up on it and I couldn’t find out where that was implemented (Present Awkward strings to Numba as Numba strings · Issue #1917 · scikit-hep/awkward · GitHub), which is very strange, but I did verify that it does happen: Awkward strings are presented as strings.)

I found this doc Home · scikit-hep/awkward Wiki · GitHub, " @martindurant has plans for wrapping Rust’s Unicode-aware string library as vectorized string functions in ak.str.* ."

It hasn’t been implemented yet, and the plan is for it to be implemented using pyarrow’s string handling, rather than Rust’s. Also, these would be new ak.str.* functions, such as ak.str.capitalize(array) to capitalize every string in array, even if they are nested within lists or other data structures.

However, the plan is not to make those functions available within Numba-JIT-compiled code, just as the ak.* functions are not available in Numba-JIT-compiled code, either. The design philosophy is “in Python, you use vectorized functions (e.g. ak.fill_none or ak.str.capitalize) and in Numba-compiled code, you use imperative loops.” Actually implementing ak.* and ak.str.* functions in Numba-compiled code would be a giant project, since it could not share any implementations with the vectorized ones. The same is true of Numba’s implementation of NumPy functions: they’re all reimplementations—if they were calls into NumPy’s C code (impossible for some NumPy functions, which are written in Python that calls other NumPy C functions), then LLVM would see it as an external function pointer and would not be able to optimize around it, which undermines the purpose of using Numba. And as you can see from the Numba project, reimplementing every NumPy function in Numba-lowered code is a big project!

And from my limited exprience about the interaction between C++ and python code, it’s hard to maintain a good performance with frequent type conversion between the edges/borders in C++ and Python.

I guess you know that already! In Python, the ak.* and future ak.str.* functions are either operating on columnar data, already in arrays (not C++ objects), with a single call to a vectorized kernel that iterates over the columnar data purely in C, and in Numba, the ak.*/ak.str.* functions won’t be implemented, but loops over Awkward data are accelerated, and you’re encouraged to write for-loop style implementations to do what you need to do. Neither of these involve more than one time through the Python-compiled (C or Numba) interface, unless you count Numba’s implementation of strings via the CPython API to be a step through the interface.

So I guess it’s your purpose to only using C and Python in Awkward 2.0 ?

Awkward 1.x was the same way: each vectorized ak.* function made only one step through the Python-C++ interface. What differs is that some data types and metadata were handled on the C++ side in Awkward 1.x, and all of that has been moved to Python in Awkward 2.x. The motivation isn’t performance (although all indications so far find that the startup time for each ak.* function is faster in Awkward 2.x… not the part that scales with array size, which is what we care about more). The motivation is for compatibility with other Python libraries that need to see more of what Awkward is doing; C++ code is a black box to Python. Examples include Dask, which wants to build an execution graph, so it needs to know how to predict the data types of Awkward operations, and JAX autodiff, which needs to pass a Tracer over all low-level array (1-dimensional array buffer) operations. The details are here: ACAT 2021, Lessons learned in Python-C++ integration.

1 Like

For the record, @dlee992 corrected my misinformation about Numba holding the GIL for Unicode-handling (in the GitHub thread).

UnicodeModel does not hold the GIL, which is definitely better than the alternative. The implementations all seem to be in numba/unicode_support.py at main · numba/numba · GitHub and there’s no match to the string “gil”.

(I was just remembering wrong.)

Yeah, thanks for ur helpful reply. @jpivarski

  1. About the string issue, I leave a comment on it, we can discuss this on the issue page.

  2. Yeah, indeed, I learned the pain/lessons from interaction between Numba and C++, rather than Python and C++, but I think the underlying problem is the same. Numba has its own NRT to manage heap memory, and C++ has its own shared_ptr or customized memory management system to handle this. And I think your changes in Awkward 2.0 is a good direction for better interaction with 3rd-party libraries and also for performance.

  3. I saw the video quickly, and I do have interests about JAX, but haven’t got dirty hands on it, but it’s on my future plan. And indeed, my colleagues are developing a project Mars GitHub - mars-project/mars: Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and Python functions., which does a part of similar things with Dask, although I didn’t know much things about Mars or Dask. Maybe there exists some oppitunities for interaction between Awkward and Mars as well. So I want to ask what u want to do with Dask, for distributed and lazy execution?

It looks interesting! We have a dask-awkward project to put Awkward operations (those ak.* functions) into a Dask graph, and the triple combination of Awkward + Dask + Numba has been tested.

Since Awkward operations can be delayed and run on Dask workers, does that mean that Mars would be able to run them on a variety of backends? (I know that there’s a particular integration for Dask graphs to run on Ray, developed within AnyScale, but Mars looks like it can run them on a lot more backends.)