New Numba Tutorials

Introduction

Previously in other posts, I have briefly discussed the need for Numba tutorials with @sseibert and @esc. I would like to extend that discussion here, and perhaps other people have a wish-list of things in Numba they would like explained in tutorials.

I am using Numba more and more. Whenever I have calculations that don’t fit neatly in a vectorized Numpy framework, or if they would make temporary memory-allocations for large matrices that are not really necessary, or just because I think the code is easier to understand with old-fashioned for-loops, then I just write @jit above the function and it is truly amazing that you can gain 50x speed-up or more from doing this! But there is a catch, as it often requires you to know how to properly prepare the data and only write code that is supported by Numba.

For several years whenever I tried using Numba, I didn’t know its little “quirks”, and so I rarely got any speed-up from using Numba, so I almost never used it. It took some time and effort to learn how to write Python code that Numba can work with, especially because there doesn’t seem to be any real tutorials, so you have to piece the know-how together from lots of searching the internet. So I think it would be a good idea to make tutorials that easily teach the best practices for using Numba with different kinds of data and algorithms.

Existing Tutorials

Before writing this post I first looked at the docs to see if you already have any tutorials that I might have missed - and there are actually a few GitHub repos all the way down at the very last line of the doc-pages. They link to this and this GitHub repo with Notebooks that seem to be prepared for some conferences several years ago, and hence rely on a speaker to explain a lot of what goes on in the code. Perhaps they don’t even work anymore with the new versions of Numba.

Example Tutorials

When writing stand-alone tutorials, it is important to explain every single code-line in plain English so everyone can understand it. People are reading the tutorials because they don’t know anything, so it is important to explain it all.

Here are a few tutorials I recently wrote, and here are some tutorials I wrote a few years ago that also have explanations on how to use Pandas (because there weren’t any good tutorials on a lot of Pandas features either), and here are my older TensorFlow tutorials which were quite popular. All of these tutorials explain complicated things in a manner that is fairly easy to understand, with nearly all code-lines being commented.

I would suggest that you write Numba tutorials in Notebooks using the same style, and then give the tutorials a prominent placement on your doc-pages right at the top so people can find them, instead of all the way down at the very last line of the index-page.

The placement on GitHub should also have an easier and more “official” location like www.github.com/numba/numba-tutorials/ You already have a GitHub repo called numba/numba-examples but that shows how to use Numba to implement certain algorithms, and is not really a tutorial.

You can also make GitHub automatically test that the Notebooks still run whenever the Numba package is being updated, and send you an e-mail if they fail. That way you can make sure they are always working with the latest Numba version.

Topics

These are the topics I would personally like to see covered in the tutorials, because I have spent a long time trying to figure it out myself from searching the internet and experimenting with Numba:

  • What kind of data does Numba support and what are the best practices for passing data to jitted functions? This should include both simple and more complex data-structures like list-of-lists. For example, it took me quite a while to figure out that a list-of-lists should be passed as a numba.typed.List of Numpy arrays, as that is not an obvious solution, especially since numba.typed.List is still considered experimental. So it would have been nice if I could have read an official guide on how to best pass different data-structures to jitted functions.

  • Why isinstance and other common things in Python are not supported by Numba, and best practices for checking input-data is correct. If I need to do type-checking, error-handling, etc., I usually split the function into a Python wrapper and a Numba implementation for doing the actual work, but again it was something that took a while to figure out how to do properly, and tutorials with best practices would have helped.

  • Return types in jitted functions cannot be different types for a single variable as Python normally supports, and even the Numpy arrays cannot have different shapes, which seems kind of strange. What are the best practices for working around these limitations?

  • Best practices for making jitted code run in parallel. And how to make the code switch between parallel and serial versions, which took me a while to figure out how to do, so this solution could be included in the tutorial.

  • Some functions in Pandas support Numba, I think it is groupby and apply and maybe more? What is the proper way of using these and what kind of speed-up is possible? A few years ago I had big problems making Pandas run fast with groupby and apply, and wish I had known how to speed it up with Numba.

  • It would also be great if you interleave the tutorials with brief explanations of why Numba works in this way. For example, why can’t jitted functions use Python lists directly. That would help the user appreciate the limitations of Numba rather than just finding them frustrating. But long and detailed explanations of how Numba works should probably be in their own tutorial Notebooks, because most people probably just want to know how to use Numba properly with a brief explanation as to why, and not have to read pages of detailed explanations on how Numba works - while other people might find that very interesting and useful.

Benefits

The main benefit is that both new and old users can quickly learn how to use Numba for data-structures and algorithms that resemble their own problem, and learn best practices on how to structure Python code when using Numba.

I have probably “wasted” maybe 50-100 hours trying to figure out how Numba works, and I am still learning how to use Numba properly.

If tutorials could save just 1 hour for each person trying to learn Numba, and 10k people save 1 hour each, then that is 5 work-years saved right there! And that’s a very low estimate of the amount of time that could be saved by making good tutorials. The real number is probably several orders of magnitude higher. And this will also lead to greater use of Numba in more projects, which could give maybe 50x speed-up for a lot of Python code.

So it would be well worth your time and effort to make good tutorials for Numba!

My Contribution

I would actually like to help writing good Numba tutorials like this, because it would be something that could have a very large impact on the world. But I don’t really have the time, and I also don’t have the necessary know-how, as I am still very much a Numba beginner who is fumbling my way around. So I would much rather offer to help by “reviewing” the tutorials you write, and make comments on what I still find hard to understand.

I hope you will consider writing new tutorials like this.

3 Likes

This is great! Once this happens it would be good to create the materials in such a way that Students could either consume the materials in a self-study mode at home or that it could be presented as part of a conference tutorial or such.