Significant speedup with pyOMP

This simple function shows over 2x speedup (single threaded) when compiled with pyOMP versus plain Numba:

@mp.njit(fastmath=True)
def f0(nSteps):
  step = 1.0/nSteps
  sum = 0.0

  with openmp("parallel for reduction(+:sum) schedule(static)"):
    for j in range(nSteps):
      x = ((j-1)-0.5)*step
      sum += 4.0/(1.0+x*x)

  pi = step*sum
  return pi

I used Python 3.14.5, Numba 0.63.1 and pyOMP 0.5.1 on amd64 CPU. Interestingly, pyOMP version uses XMM and YMM registers in generated code 202 and 147 times respectively. Plain Numba version uses only XMM registers 51 times and its performance stays the same with new version 0.65.1. Full code is here: python-benchmarks/so-4-omp.py at main · pauljurczak/python-benchmarks · GitHub .

Unfortunately, the speedup on arm64 CPU is negligible.

1 Like

thank you very much, so now SIMD can be used through pyOMP
i shall test it, thanks
just to be sure: a little question
numba.njit → (parallel=True) → prange is broken with pyOMP loaded correct?
but numba.njit → range is working

I don’t think you can mix Numba and OpenMP parallel expressions, but I’m not an expert. Choose one or the other.

ok, it was just a quick test. no problem
Something that i always was missing in Numba
hybrid offloading the cpu+gpu together at the same time
even “naive” dividing the job half/half = no dynamical chunking
just naive dividing the job between the cpu and gpu
is something like this “ever” possible?

OpenMP has a task concept. Tasks can be run in parallel, some on the host, some on the devices. I don’t know if pyOMP supports this. Test it out and let us know if it does.

@kh-abd-kh

Something that i always was missing in Numba
hybrid offloading the cpu+gpu together at the same time
even “naive” dividing the job half/half = no dynamical chunking
just naive dividing the job between the cpu and gpu
is something like this “ever” possible?

I’ve seen some research into this direction (with working code) at my company. They were looking for a good use case. If you have one, feel free to drop a link here or write me a direct message.

I can do it case by case but using C
a-compile the library into cpu and gpu
b-use C fork/join by hand
but i want a universal dynamical solution
I am doing it for something called Riemann Theta multigenus case higher than g=9
up to g=9 it is ok gpu but i hope to get more by distributing the load over cpu/gpu

Do it by hand in C is possible but universal wrapper with dynamical load distribution
is the challenge, of course it needs first a pre-run for timing
thanks any way