Walkthrough from pure Python implementation to multi-GPU Numba-jitted version

The talk " Evaluating Your Options for Accelerated Numerical Computing in Pure Python" by my colleague Matthew Penn includes a walkthrough starting with a pure Python implementation of a k-Nearest Neighbours operand and goes through to a a multi-GPU jitted version with Numba. The trajectory it follows is:

  • The Pure Python version
  • CPU JIT with Numba
  • Parallel CPU jit with prange
  • GPU JIT with Numba’s @cuda.jit
  • Multi-GPU JIT with Numba and Dask

It also includes the use of an External Memory Manager (RMM, the RAPIDS Memory Manager) with Numba, and explains some optimization strategies for the GPU kernels.

I think this could be helpful for those looking to understand how to port from pure Python all the way up to nulti-GPU high-performance code, because it follows the typical steps that one needs to take to get there- Python → CPU JIT → GPU JIT → Parallel/multi-GPU.

Note that this talk also discusses some other options (CuPy, NumPy, etc.) but a large portion of the talk focuses on Numba, so I felt it was worthwhile sharing here. Note that although Matthew is a colleague of mine, I didn’t have any involvement in the preparation of the talk - it is all his work :slight_smile:.

Recording: Attendee Portal (Registration for NVIDIA GTC is needed, which is free)
Slides: https://static.rainfocus.com/nvidia/gtcspring2022/sess/1638480642908001OycX/SessionFile/Evaluating%20Your%20Options%20for%20Accelerated%20Numerical%20Computing%20in%20Pure%20Python_1647528023707001MkTJ.pdf (not sure if registration is required here)
Example code (on Github, accessible without restriction): event-notebooks/GTC_Spring_2022/numerical-computing at main · rapidsai-community/event-notebooks · GitHub

4 Likes