Real world optimization using numba

Graham Markall asked me to share this post here: A few days ago I asked at the numba gitter for help because of scaling issues: Running multiple numba processes in parallel causes poor performance. Finally we found the root cause which was the way the numba code was written: numpy style, creating large intermediate numpy arrays. These caused issues overloading the processor cache. We could solve this by replacing parts of the numba code by using nested loops instead. It is a real world optimization task (interferometry) used by ESA to analyze pictures of galaxies / stars which involves parallel optimization threads, so the numba code is executed single threaded. Performance of the whole optimization is dominated by the numba code which is at the “heart” of the function to optimize. To check if Python/Numba can compete with Java the same optimization was performed using Java - including the FFT library available there. The result was: There is only a 10-20% overall performance penalty using Python/numba. So the result is dominated by the optimization algorithm choice, not by the language used. But there still seem to be minor CPU cache issues, since the optimal number of parallel threads was 16 for Python and 32 for Java on a 16 core AMD 5950x CPU. But this did not cause significant performance degradation for numba. You can find the code at fast-cma-es/interferometryudp.py at master · dietmarwo/fast-cma-es · GitHub and fcmaes-java/Interferometry.java at master · dietmarwo/fcmaes-java · GitHub .

2 Likes