Use Numba with PySpark

We use Spark through Spark API with pyspark. I know Numba can make python code very fast so it is a good thing to use it on our udfs (user-defined function) but I’m not sure Numba can help me with map or mapPartion.

I’m mapping with spark over all our data but the function itself doesn’t have a for loop inside (as Numba likes) and I can’t use ‘for-loop’ instead of the map/mapPartitions because its a pyspark functionality used to distribute the computing on all workers (executors).

Do someone here have some suggestion on how to use Numba with pyspark, is there something that I’m missing?

It’s not entirely clear to me what your question is… Do you have a minimal example of what you’re trying to do?

There’s another related discussion here

Python UDFs are generally relatively slow in Spark 2.x, though I’ve heard that there have been improvements in 3.x. Including the python layer (even with numba) will likely not provide much benefit unless you’re doing something with intensive computational requirements.