Use Numba with PySpark

idan3 · June 15, 2022, 2:24pm

We use Spark through Spark API with pyspark. I know Numba can make python code very fast so it is a good thing to use it on our udfs (user-defined function) but I’m not sure Numba can help me with map or mapPartion.

I’m mapping with spark over all our data but the function itself doesn’t have a for loop inside (as Numba likes) and I can’t use ‘for-loop’ instead of the map/mapPartitions because its a pyspark functionality used to distribute the computing on all workers (executors).

Do someone here have some suggestion on how to use Numba with pyspark, is there something that I’m missing?

nelson2005 · June 16, 2022, 5:47am

Edit:. The link in my message is incorrect, Graham’s link below is the intended link.

It’s not entirely clear to me what your question is… Do you have a minimal example of what you’re trying to do?

There’s another related discussion here

Python UDFs are generally relatively slow in Spark 2.x, though I’ve heard that there have been improvements in 3.x. Including the python layer (even with numba) will likely not provide much benefit unless you’re doing something with intensive computational requirements.

gmarkall · June 27, 2022, 10:16am

Topic		Replies	Views
Numba and PySpark users? Community Support	3	1879	May 11, 2022
Pandas dataframes with numba Numba	2	713	October 19, 2022
Tutorial on supporting Python User-Defined Functions in CUDA-accelerated Applications with Numba Showcase	0	468	March 25, 2022
DFS implementation with Numba Showcase	5	1009	November 2, 2021
My numba code is slower than my original code Numba	1	96	February 11, 2025

Use Numba with PySpark

Related topics