This simple function shows over 2x speedup (single threaded) when compiled with pyOMP versus plain Numba:
@mp.njit(fastmath=True)
def f0(nSteps):
step = 1.0/nSteps
sum = 0.0
with openmp("parallel for reduction(+:sum) schedule(static)"):
for j in range(nSteps):
x = ((j-1)-0.5)*step
sum += 4.0/(1.0+x*x)
pi = step*sum
return pi
I used Python 3.14.5, Numba 0.63.1 and pyOMP 0.5.1 on amd64 CPU. Interestingly, pyOMP version uses XMM and YMM registers in generated code 202 and 147 times respectively. Plain Numba version uses only XMM registers 51 times and its performance stays the same with new version 0.65.1. Full code is here: python-benchmarks/so-4-omp.py at main · pauljurczak/python-benchmarks · GitHub .
Unfortunately, the speedup on arm64 CPU is negligible.
1 Like
thank you very much, so now SIMD can be used through pyOMP
i shall test it, thanks
just to be sure: a little question
numba.njit → (parallel=True) → prange is broken with pyOMP loaded correct?
but numba.njit → range is working
I don’t think you can mix Numba and OpenMP parallel expressions, but I’m not an expert. Choose one or the other.
ok, it was just a quick test. no problem
Something that i always was missing in Numba
hybrid offloading the cpu+gpu together at the same time
even “naive” dividing the job half/half = no dynamical chunking
just naive dividing the job between the cpu and gpu
is something like this “ever” possible?
OpenMP has a task concept. Tasks can be run in parallel, some on the host, some on the devices. I don’t know if pyOMP supports this. Test it out and let us know if it does.