Improving Numba for CPU workloads

yashssh · May 22, 2025, 9:42am

Hi,

I work with the CPU compiler team at NVIDIA, and we are interested in improving Numba’s performance on CPUs. We undertook the project of upgrading Numba’s LLVM dependency from LLVM 14 to LLVM 20 (some pull requests are currently under review), which resulted in approximately a 2× speedup on numba-benchmarks. Going forward, we want to continue contributing to Numba’s acceleration as a compiler team, and are therefore looking for workloads that directly leverage Numba for Python acceleration on CPUs. I would greatly appreciate any guidance on the questions below.

Are there workloads that use Numba for ML Inference? HPC? (on CPUs)
Any other users of Numba on CPUs who would want some compiler related features(eg fp16, fp8 support, etc) or more performance out of Numba?
Benchmarks(either independent or via workloads) where we can do some experimentation with Numba?
Any other areas we can chip in to improve Numba or LLVM(leveraging Numba’s Python frontend) as a compiler team?

Thanks
Yashwant

seanlaw · May 23, 2025, 8:36pm

Hi, @yashssh! I don’t know if this falls within what you’re looking for but numba is used extensively in the stumpy package for both CPU and GPU workloads. Our main goal is to improve the performance of computing something called a “matrix profile” and this can be done via:

import numpy as np
import stumpy

m = 50
T = np.random.rand(100_000)
mp = stumpy.stump(T, m)  # Compute the matrix profile for time series, `T`

It would be great to speed up the computational time for longer time series.

ErolBa · May 24, 2025, 4:36pm

I’m sure automatic differentiation would be a very nice addition to numba, esp. in machine learning topics. If you’re curious there’s already been some work done on this using Enzyme… Autodiff support in Numba · Issue #8565 · numba/numba

ehsantn · May 25, 2025, 3:23pm

Thanks for your contributions to Numba! Will your LLVM upgrade effort enable ORC JIT soon? Asking since lazy compilation in ORC JIT can potentially be very beneficial.

If you are interested in distributed high performance data processing (and data prep compute for AI/ML), Bodo JIT and BodoSQL are built on top of Numba and can greatly benefit from Numba improvements. The main problem is Numba’s compilation time which can be very slow for large workloads (up to 30 minutes sometimes for very large workloads). Ideally, we need some kind of AOT compilation support that works with regular overloads seamlessly and is easy to package like Cython. The compiler itself can benefit from acceleration too since not everything can be compiled ahead of time.

There are several Bodo open source benchmarks you can use for your work such as TPC-H and NYC Taxi. There are Spark/Dask/Modin-Ray etc. versions to compare performance too (see blogs for previous results). I would be happy to help if you have any questions.

yashssh · May 26, 2025, 10:15am

Thanks @ErolBa! It does look interesting. Do you happen to know any projects who want Numba to support Autodiff?

@gmarkall I see the Github thread linked has died down a bit, do you know if any work has been started on that front? I am happy to move this discussion to Github thread if you think that’s more appropriate.

yashssh · May 26, 2025, 10:20am

Will your LLVM upgrade effort enable ORC JIT soon? Asking since lazy compilation in ORC JIT can potentially be very beneficial.

I suppose once the LLVM upgrades are merged ORC JIT migration work can be started but unfortunately the work that I have submitted for review is still using MCJIT.

If you are interested in distributed high performance data processing (and data prep compute for AI/ML), Bodo JIT and BodoSQL are built on top of Numba and can greatly benefit from Numba improvements.

Thank you! Let me take a look around and will reply back when I have a deeper understanding

The main problem is Numba’s compilation time which can be very slow for large workloads (up to 30 minutes sometimes for very large workloads).

Hmm that might be bug in Numba and I remember seeing similar threads before, can you point me to or share a reproducer for a function with questionable compile time?

Thanks again!!

ehsantn · May 26, 2025, 1:58pm

Thanks for the responses. Looking forward to hearing more from you.

Hmm that might be bug in Numba and I remember seeing similar threads before, can you point me to or share a reproducer for a function with questionable compile time?

The generated code from SQL was large and we didn’t find anything wrong in our profiling. Don’t have an exact reproducer but TPC-H queries (both SQL and Python) are typically good for benchmarking.

nelson2005 · May 26, 2025, 4:25pm

Lots of examples on the net.
The program I work on daily takes on the order of an hour to compile.

I understand the compilation is entirely single threaded, so perhaps that’s part of a potential solution

ErolBa · May 27, 2025, 7:51am

I don’t know any specific project off the top of my head, but I’m sure autodiff has a important usecase in both machine learning pipelines and some scientific applications. It’s one of the reasons why JAX (JAX: High performance array computing — JAX documentation) is such a popular library.

lmcinnes · May 29, 2025, 4:39am

I can say that I would definitely be interested in fp16 and fp8 support. There are certainly compute tasks where fp16 would be good enough accuracy, and the performance and memory benefits would be noticeable. Also nearest neighbor search with pynndescent would benefit from being able to use fp16 and fp8 for some level of quantization. Since pynndescent is used in UMAP and openTSNE (among others) that could bring significant benefits to dimension reduction workloads as well.

CommonClimate · June 27, 2025, 12:35am

Hello @yashssh. This is admitted a very niche application, but a package I co-maintain Pyleoclim is using Numba to speedup some loop-heavy computations association with the weighted wavelet Z-transform (WWZ) method. To benchmark:

import pyleoclim as pyleo
ts = pyleo.utils.load_dataset('SOI').standardize()
psd = ts.spectral(method='wwz')  
psd_sig = psd.signif_test(number=10)  # can increase number to increase the computational load and make speedup more apparent
fig, ax = psd_sig.plot(title='PSD using WWZ method')

The part of the code that actually uses Numba is here.
It may very well be that Numba (with these particular options) is not the best choice for speeding up this computation. If anyone has other suggestions, I am all ears.

Topic		Replies	Views
About Numba LLVM 20 support and Better PTX control Numba	3	113	June 16, 2025
Using Numba to compile a Python algorithm into target assembly Numba	1	618	June 26, 2023
Comparison between Numba and Fortran code Numba	3	589	December 13, 2023
Compilation pipeline, compile time and vectorization Numba	10	1666	May 31, 2024
Slow down when porting anisotropic diffusion filter Community Support	2	289	June 16, 2021

Improving Numba for CPU workloads

Related topics