I present a project potentially useful for the Numba ecosystem: charex. This package is designed to extend Numba’s capabilities by integrating NumPy’s string operations within Numba-optimized functions.
Overview
charex stands for “Character Extensions” and enables Numba access to the suite of string comparison operations and occurrence and property methods available in NumPy’s char module:
import charex
For those interested in trying out charex, you can find the package and more details on GitHub at the charex repository. The benchmarks and tests are also available within the repository, providing insights into the package’s performance.
Comparison operations:
char.equal
char.not_equal
char.greater_equal
char.less_equal
char.greater
char.less
char.compare_chararrays
Occurrence and Property information:
char.count
char.endswith
char.startswith
char.find
char.rfind
char.index
char.rindex
char.str_len
char.isalpha
char.isalnum
char.isspace
char.isdecimal
char.isdigit
char.isnumeric
char.istitle
char.isupper
char.islower
Further Information
I’ve tested charex to ensure compatibility and performance, with the latest tests conducted using Numba 0.59.0 and NumPy 1.26.3. Please feel free to contribute, validate, provide feedback, and discuss how charex can be further improved. The aim is to gather feedback and further refine the project for integration into Numba (see: #8500).
Kudos to you, @nmehran , for introducing charex, a great extension that enhances Numba’s capabilities with NumPy string operations.
Charex currently aligns with the existing NumPy “char” namespace. As outlined in NEP-55, NumPy will slowly transition from “char” to “strings” namespace. Its introduction may already be imminent in NumPy 2.0. The timing might be a bit unfortunate.
Quick update, roughly two years later: charex has now been refreshed for the current Numba / NumPy string landscape.
The short version: the NEP-55 / np.strings concern that @Oyibo raised was exactly right. As of charex 0.5.0, the project is no longer limited to NumPy’s np.char namespace.
Updated compatibility to Numba 0.65.1, llvmlite 0.47.x, Python 3.10-<3.15.
Supports NumPy 1.22-<1.27 and 2.0-<2.5.
Keeps np.char support for fixed-width S / U arrays.
Adds np.strings support for fixed-width S / U.
Adds NumPy 2.x StringDType support through np.strings, including variable-width strings and na_object variants.
Extends the read-only catalog to scalar, 0-D, 1-D, N-D, strided, reversed, read-only, zero-stride, empty, and broadcast-compatible shapes.
Preserves the semantic differences between np.char and np.strings, especially around trailing whitespace / NUL behavior.
The supported read-only catalog now includes comparisons, occurrence/search operations, str_len, predicates, and np.char.compare_chararrays.
A full shape/behavior audit currently reports:
1702 audit rows
1702 matching rows
0 mismatches
0 NumPy-accepted cases rejected by charex
The current benchmark matrix covers 135 fixed-width and StringDType cases on Python 3.12.8, NumPy 2.4.6, Numba 0.65.1, and llvmlite 0.47.0. In that matrix, charex ranges from 1.02x to 6.51x NumPy speed, with a 1.60x median.
Still out of scope for now (feasible): transformation operations such as replace, case conversion, strip, pad, join, split, encode, and decode.
The project is much closer to the original goal now: a focused compatibility and performance layer for NumPy string operations in Numba, while also keeping the implementation and audit results available for official Numba string-support integration.