Charex - Enhance Numba with NumPy's string operations

I present a project potentially useful for the Numba ecosystem: charex. This package is designed to extend Numba’s capabilities by integrating NumPy’s string operations within Numba-optimized functions.

Overview

charex stands for “Character Extensions” and enables Numba access to the suite of string comparison operations and occurrence and property methods available in NumPy’s char module:

import charex

For those interested in trying out charex, you can find the package and more details on GitHub at the charex repository. The benchmarks and tests are also available within the repository, providing insights into the package’s performance.

Comparison operations:

  • char.equal
  • char.not_equal
  • char.greater_equal
  • char.less_equal
  • char.greater
  • char.less
  • char.compare_chararrays

Occurrence and Property information:

  • char.count
  • char.endswith
  • char.startswith
  • char.find
  • char.rfind
  • char.index
  • char.rindex
  • char.str_len
  • char.isalpha
  • char.isalnum
  • char.isspace
  • char.isdecimal
  • char.isdigit
  • char.isnumeric
  • char.istitle
  • char.isupper
  • char.islower

Further Information

I’ve tested charex to ensure compatibility and performance, with the latest tests conducted using Numba 0.59.0 and NumPy 1.26.3. Please feel free to contribute, validate, provide feedback, and discuss how charex can be further improved. The aim is to gather feedback and further refine the project for integration into Numba (see: #8500).

1 Like

Kudos to you, @nmehran , for introducing charex, a great extension that enhances Numba’s capabilities with NumPy string operations.

Charex currently aligns with the existing NumPy “char” namespace. As outlined in NEP-55, NumPy will slowly transition from “char” to “strings” namespace. Its introduction may already be imminent in NumPy 2.0. The timing might be a bit unfortunate.

Thank you for your valuable contribution.

1 Like

Hi @Oyibo , I appreciate your insight with regard to NEP-55.

Quick update, roughly two years later: charex has now been refreshed for the current Numba / NumPy string landscape.

The short version: the NEP-55 / np.strings concern that @Oyibo raised was exactly right. As of charex 0.5.0, the project is no longer limited to NumPy’s np.char namespace.

Release: Release charex 0.5.0 · nmehran/charex · GitHub

Notable changes since the original 2024 version:

  • Updated compatibility to Numba 0.65.1, llvmlite 0.47.x, Python 3.10-<3.15.
  • Supports NumPy 1.22-<1.27 and 2.0-<2.5.
  • Keeps np.char support for fixed-width S / U arrays.
  • Adds np.strings support for fixed-width S / U.
  • Adds NumPy 2.x StringDType support through np.strings, including variable-width strings and na_object variants.
  • Extends the read-only catalog to scalar, 0-D, 1-D, N-D, strided, reversed, read-only, zero-stride, empty, and broadcast-compatible shapes.
  • Preserves the semantic differences between np.char and np.strings, especially around trailing whitespace / NUL behavior.

The supported read-only catalog now includes comparisons, occurrence/search operations, str_len, predicates, and np.char.compare_chararrays.

A full shape/behavior audit currently reports:

  • 1702 audit rows
  • 1702 matching rows
  • 0 mismatches
  • 0 NumPy-accepted cases rejected by charex

The current benchmark matrix covers 135 fixed-width and StringDType cases on Python 3.12.8, NumPy 2.4.6, Numba 0.65.1, and llvmlite 0.47.0. In that matrix, charex ranges from 1.02x to 6.51x NumPy speed, with a 1.60x median.

Still out of scope for now (feasible): transformation operations such as replace, case conversion, strip, pad, join, split, encode, and decode.

The project is much closer to the original goal now: a focused compatibility and performance layer for NumPy string operations in Numba, while also keeping the implementation and audit results available for official Numba string-support integration.