Explore the Mandelbrot Set in real time with Numba

Recently, I ventured into Python and decided to make a project while learning. The theme is parallelization which is my passion. So, what better way to consume CPU cores than to explore the Mandelbrot Set in real time.

What’s exciting about this project is Numba, which JITs Python code to C-like performance. Not only did I benchmark Numba 0.54.1, but also the autochunk branch at Intel Labs. The test system is a Linux box with an AMD Threadripper 3970X CPU.

JIT compilation disabled entirely

The captured time is the time taken to display the initial screen.

$ NUMBA_DISABLE_JIT=1 \
  py38 mandel_queue.py --width=1280 --height=720 --num-threads=N
   N   time in seconds
  32 :   8.7
  16 :  15.9
   8 :  31.1
   4 :  63.1
   2 : 119.8

Commenting out @njit in app/mandel_for.py

For this, I wanted to see how long Python takes to display the initial screen without Numba.

$ py38 mandel_queue.py --width=1280 --height=720 --num-threads=N
   N   time in seconds
  32 :   5.6
  16 :   9.9
   8 :  20.1
   4 :  39.3
   2 :  75.1

At last, Numba

I unset the NUMBA_DISABLE environment variable and uncommented @njit in app/mandel_for.py. Like before, this is the time taken to display the initial screen. Here is proof to what a difference Numba can make for your application.

$ py38 mandel_queue.py --width=1280 --height=720 --num-threads=N
   N   time in seconds
  32 :  0.014
  16 :  0.014
   8 :  0.022
   4 :  0.039
   2 :  0.076

Running Parallel, Auto Zoom Results

Press the letter x to begin the auto zoom session. The app will zoom beyond 200 levels. The captured time is the benchmark duration. Python 3.7 is using Numba from the autochunk branch. Python 3.8 is using Numba 0.54.1.

$ py37 mandel_parfor.py --width=1280 --height=720 --num-threads=N
$ py38 mandel_parfor.py --width=1280 --height=720 --num-threads=N
$ py38 mandel_stream.py --width=1280 --height=720 --num-threads=N
$ py38 mandel_queue.py  --width=1280 --height=720 --num-threads=N
$ py38 mandel_ocl.py    --width=1280 --height=720 --num-threads=N

   N   time in seconds
  62 :  4.311  parfor autochunk
  62 :  4.503  stream
  62 :  4.504  opencl
  62 :  4.882  queue
  62 :  5.214  parfor 0.54.1

  32 :  6.082  parfor autochunk
  32 :  6.144  stream
  32 :  6.322  queue
  32 :  8.176  parfor 0.54.1

  16 : 10.483  parfor autochunk
  16 : 10.558  stream
  16 : 11.084  queue
  16 : 13.710  parfor 0.54.1

   8 : 19.556  parfor autochunk
   8 : 19.532  stream
   8 : 19.641  queue
   8 : 23.684  parfor 0.54.1

   4 : 39.045  parfor autochunk
   4 : 39.343  stream
   4 : 39.380  queue
   4 : 44.098  parfor 0.54.1

   2 : 72.232  stream
   2 : 73.454  queue
   2 : 76.724  parfor autochunk
   2 : 76.758  parfor 0.54.1
3 Likes

Below are the complementary results from a NVIDIA Geforce 2070 RTX GPU running the same auto zoom session. Press the letter x after the initial display. Although Numba isn’t used for the GPU demonstrations, the results may be useful one day for comparing against @cuda.jit.

$ py38 mandel_{cuda,ocl}.py --width=1280 --height=720
$ py38 mandel_{cuda,ocl}.py --width=1280 --height=720 --mixed-prec=1
$ py38 mandel_{cuda,ocl}.py --width=1280 --height=720 --mixed-prec=2
$ py38 mandel_{cuda,ocl}.py --width=1280 --height=720 --mixed-prec=2 --fma=1

  time in seconds
        9.0  CUDA
        8.4  CUDA mixed-prec=1
        7.1  CUDA mixed-prec=2
        6.3  CUDA mixed-prec=2 fma=1

        9.4  OpenCL
        8.8  OpenCL mixed-prec=1
        7.5  OpenCL mixed-prec=2
        6.7  OpenCL mixed-prec=2 fma=1

Very interesting work. Thank you for posting! The plan is for the autochunk PR to be merged into Numba mainline in the next few weeks.

That’s great news, thank you! Congrats on the autochunk branch and look forward to it. I added a cuda.jit demonstration named mandel_kernel.py as mandel_cuda.py is taken using PyCUDA.

The repo has a demo folding consisting of non-parallel demonstrations for computing the Mandelbrot Set, apply Anti-Aliasing, Gaussian Blur, and Unsharp Mask via a step-by-step approach.