Significant speedup with pyOMP

pauljurczak · May 21, 2026, 4:03pm

This simple function shows over 2x speedup (single threaded) when compiled with pyOMP versus plain Numba:

@mp.njit(fastmath=True)
def f0(nSteps):
  step = 1.0/nSteps
  sum = 0.0

  with openmp("parallel for reduction(+:sum) schedule(static)"):
    for j in range(nSteps):
      x = ((j-1)-0.5)*step
      sum += 4.0/(1.0+x*x)

  pi = step*sum
  return pi

I used Python 3.14.5, Numba 0.63.1 and pyOMP 0.5.1 on amd64 CPU. Interestingly, pyOMP version uses XMM and YMM registers in generated code 202 and 147 times respectively. Plain Numba version uses only XMM registers 51 times and its performance stays the same with new version 0.65.1. Full code is here: python-benchmarks/so-4-omp.py at main · pauljurczak/python-benchmarks · GitHub .

Unfortunately, the speedup on arm64 CPU is negligible.

kh-abd-kh · May 25, 2026, 11:47pm

thank you very much, so now SIMD can be used through pyOMP
i shall test it, thanks
just to be sure: a little question
numba.njit → (parallel=True) → prange is broken with pyOMP loaded correct?
but numba.njit → range is working

pauljurczak · May 26, 2026, 2:09am

I don’t think you can mix Numba and OpenMP parallel expressions, but I’m not an expert. Choose one or the other.

kh-abd-kh · May 26, 2026, 8:32am

ok, it was just a quick test. no problem
Something that i always was missing in Numba
hybrid offloading the cpu+gpu together at the same time
even “naive” dividing the job half/half = no dynamical chunking
just naive dividing the job between the cpu and gpu
is something like this “ever” possible?

pauljurczak · May 26, 2026, 9:00pm

OpenMP has a task concept. Tasks can be run in parallel, some on the host, some on the devices. I don’t know if pyOMP supports this. Test it out and let us know if it does.

docharri · June 4, 2026, 12:22pm

@kh-abd-kh

Something that i always was missing in Numba
hybrid offloading the cpu+gpu together at the same time
even “naive” dividing the job half/half = no dynamical chunking
just naive dividing the job between the cpu and gpu
is something like this “ever” possible?

I’ve seen some research into this direction (with working code) at my company. They were looking for a good use case. If you have one, feel free to drop a link here or write me a direct message.

kh-abd-kh · June 6, 2026, 12:55am

I can do it case by case but using C
a-compile the library into cpu and gpu
b-use C fork/join by hand
but i want a universal dynamical solution
I am doing it for something called Riemann Theta multigenus case higher than g=9
up to g=9 it is ok gpu but i hope to get more by distributing the load over cpu/gpu

Do it by hand in C is possible but universal wrapper with dynamical load distribution
is the challenge, of course it needs first a pre-run for timing
thanks any way

Topic		Replies	Views
[PyOMP] OpenMP in Python for CPU/GPU parallel programming Showcase	3	442	September 17, 2025
Curious performance of fusion of Numba and pyomp Numba	0	31	June 8, 2026
Does Numba support MPI and/or openMP parallelization? Community Support	22	4285	October 12, 2024
How to install numba.openmp Numba	28	3620	September 10, 2024
Prange is crashing jupyter lab with OMP: Error #15: Initializing libomp.dylib, but found libiomp5.dylib already initialized Support: What is this error message?	9	2826	March 5, 2021

Significant speedup with pyOMP

Related topics