We use Spark through Spark API with pyspark. I know Numba can make python code very fast so it is a good thing to use it on our udfs (user-defined function) but I’m not sure Numba can help me with map or mapPartion.
I’m mapping with spark over all our data but the function itself doesn’t have a for loop inside (as Numba likes) and I can’t use ‘for-loop’ instead of the map/mapPartitions because its a pyspark functionality used to distribute the computing on all workers (executors).
Do someone here have some suggestion on how to use Numba with pyspark, is there something that I’m missing?