Recently, I ventured into Python and decided to make a project while learning. The theme is parallelization which is my passion. So, what better way to consume CPU cores than to explore the Mandelbrot Set in real time.
What’s exciting about this project is Numba, which JITs Python code to C-like performance. Not only did I benchmark Numba 0.54.1, but also the autochunk branch at Intel Labs. The test system is a Linux box with an AMD Threadripper 3970X CPU.
JIT compilation disabled entirely
The captured time is the time taken to display the initial screen.
$ NUMBA_DISABLE_JIT=1 \
py38 mandel_queue.py --width=1280 --height=720 --num-threads=N
N time in seconds
32 : 8.7
16 : 15.9
8 : 31.1
4 : 63.1
2 : 119.8
Commenting out @njit
in app/mandel_for.py
For this, I wanted to see how long Python takes to display the initial screen without Numba.
$ py38 mandel_queue.py --width=1280 --height=720 --num-threads=N
N time in seconds
32 : 5.6
16 : 9.9
8 : 20.1
4 : 39.3
2 : 75.1
At last, Numba
I unset the NUMBA_DISABLE
environment variable and uncommented @njit
in app/mandel_for.py
. Like before, this is the time taken to display the initial screen. Here is proof to what a difference Numba can make for your application.
$ py38 mandel_queue.py --width=1280 --height=720 --num-threads=N
N time in seconds
32 : 0.014
16 : 0.014
8 : 0.022
4 : 0.039
2 : 0.076
Running Parallel, Auto Zoom Results
Press the letter x
to begin the auto zoom session. The app will zoom beyond 200 levels. The captured time is the benchmark duration. Python 3.7 is using Numba from the autochunk branch. Python 3.8 is using Numba 0.54.1.
$ py37 mandel_parfor.py --width=1280 --height=720 --num-threads=N
$ py38 mandel_parfor.py --width=1280 --height=720 --num-threads=N
$ py38 mandel_stream.py --width=1280 --height=720 --num-threads=N
$ py38 mandel_queue.py --width=1280 --height=720 --num-threads=N
$ py38 mandel_ocl.py --width=1280 --height=720 --num-threads=N
N time in seconds
62 : 4.311 parfor autochunk
62 : 4.503 stream
62 : 4.504 opencl
62 : 4.882 queue
62 : 5.214 parfor 0.54.1
32 : 6.082 parfor autochunk
32 : 6.144 stream
32 : 6.322 queue
32 : 8.176 parfor 0.54.1
16 : 10.483 parfor autochunk
16 : 10.558 stream
16 : 11.084 queue
16 : 13.710 parfor 0.54.1
8 : 19.556 parfor autochunk
8 : 19.532 stream
8 : 19.641 queue
8 : 23.684 parfor 0.54.1
4 : 39.045 parfor autochunk
4 : 39.343 stream
4 : 39.380 queue
4 : 44.098 parfor 0.54.1
2 : 72.232 stream
2 : 73.454 queue
2 : 76.724 parfor autochunk
2 : 76.758 parfor 0.54.1