Numba for CUDA Programmers course released

A new tutorial covering the use of Numba for CUDA Programming is now available, at:

This is an adapted version of one delivered internally at NVIDIA - its primary audience is those who are familiar with CUDA C/C++ programming, but perhaps less so with Python and its ecosystem. That said, it should be useful to those familiar with the Python and PyData ecosystem - those unfamiliar with CUDA may want to build a base understanding by working through Mark Harris’s An Even Easier Introduction to CUDA blog post, and briefly reading through the CUDA Programming Guide Chapters 1 and 2 (Introduction and Programming Model).

The course is broken into 5 sessions:

Session 1: An introduction to Numba and CUDA Python

Covers the basics:

  • An introduction to Numba
  • CUDA kernels and ufuncs
  • CUDA memory management basics

Session 2: Typing

Explains Numba’s type system and how type inference works:

  • How to understand what the typing is doing
  • CUDA-specific typing issues and performance optimization through typing

Session 3: Porting strategies, performance, interoperability, debugging

Various tools and techniques for going from unoptimized Python code to an optimized CUDA implementation.

  • Red flags: watching out for code that won’t port well to CUDA
  • Step-by-step porting process: pure Python → Object mode → Nopython mode → CUDA → Optimization
  • Dealing with NumPy array operations and using CuPy
  • Interoperability with other CUDA Python libraries
  • Managing data movement
  • Useful components in the CUDA target
  • Performance measurement
  • Debugging

Session 4: Extending Numba

How to write an extension for the Numba CUDA target so you can use your own data types and classes in CUDA kernels. Includes dealing with the following for members, attributes, properties, methods, etc:

  • Typing
  • Data models
  • Lowering

Session 5: Memory Management

Explains how Numba’s internal memory management works, and how to replace it with your own memory management:

  • Internals: garbage collection, finalizers, deferred deallocation
  • Using an External Memory Management (EMM) Plugin such as the RAPIDS Memory Manager (RMM)
  • Writing an EMM Plugin, with examples:
    • Using CuPy’s memory pool
    • A simple wrapper around the C runtime API