Numba code-base is hard to understand

Let me first say that I am a super-fan of Numba and I think you guys are super-heroes, so I hope you will take this as constructive feedback.

I need the nan (not-a-number) version of np.argmax but it is currently not implemented in Numba. I figured it would be quite easy, so I took a look in your code-base to see if I could make that. The relevant Python module is here. But it is nearly completely undocumented, so it takes great effort for outsiders to try and understand.

Furthermore, some of the code is very strange (buggy?), like this for-loop which exists in all of these argmax/argmin functions:

for v in arry.flat:
    min_value = v
    min_idx = 0
    break
if np.isnan(min_value):
    return min_idx

This looks bizarre to me. And as there is no comment on what you intend to do here, I am really confused when I see strange code like this. Because it looks like you are basically just doing the following:

if np.isnan(arry.flat[0]):
    return 0

I would now like to argue for more extensive documentation of the code-base itself, not just the user-facing API.

I actually program in two languages that are interleaved: I write clear English comments on what is going on for nearly all the Python-lines. This may seem excessive, but it means that practically anyone can read my code and understand what is going on in a few minutes. I can read code that I wrote 5-10 years ago and understand it very quickly, even though I’ve completely forgotten what it does. And if there is bizarre / buggy code, then my intention is explained clearly in English so it is more obvious what is wrong.

Writing good comments is actually a rare and very under-appreciated skill, just like writing good docs or good text-books. Not everyone can do it. But everyone can make a serious effort to explain what they are doing to the next person who will look at the code. It makes it much easier for others to understand what is going on in the code, and in the long-run it is well worth the invested time. And writing code that is beautiful and easy to understand for others, often means that problems are polished away, because ugly code is hard to explain.

Let me give you a few examples of my code.

The first example is from TensorFlow where I added a small function several years ago. Compare it to the other functions in that file, whose code-lines are only sparsely documented, if at all. Which do you find easier to understand?

The second example is more recent, where I use Numba to speed-up a new algorithm I made. See e.g. this function and this function where I go into great detail explaining what happens in the algorithms, and even “trivial” code is explained in the comments as well. The idea is that you can read the English comments alone without reading any of the Python code, and you will understand exactly what is going on in the code.

I find it much easier to read the code when nearly everything is commented really well in English, rather than having to switch between reading a few short broken English comments and Python code. That is why I say that I program in two interleaved languages: English and Python.

I probably cannot convince you to go into “full Shakespeare mode” like I do with my comments. But please consider whether your code is easy or hard to understand for the next person who will have to read and maintain it.

A few of the functions in arraymath.py actually do have extensive code-comments, but some of them seem to have been written in a hurry and are quite confusing like these that seem to belong to different code-lines.

As a bare minimum I would like to suggest that every source-file has a header that explains what it contains and how it fits in with the rest of the project. And each function has a doc-string that explains what the function does. Ideally the crucial / difficult code-lines would also be explained.

In my opinion, good code comments are almost as important as good code itself.

Once again, I meant for this to be constructive feedback from a Numba super-fan.

Oh cool, feel free to submit some PRs!

Also, maybe something like software that sucks less | suckless.org software that sucks less would be useful. Something that goes beyond pure formatting but also mentiones generap tips and tricks for developers.

Hehe! … Well … My point was exactly that myself and others in the community are unable to help you improve and extend Numba, because it would take immense effort to try and understand even fairly simple things in the Numba code-base, due to the almost complete lack of code-comments.

I already have my own important R&D projects I need to work on (which is nearly all open-source for which I have never been paid anything, I might add!) So I cannot justify spending several weeks trying to figure out how to add a simple np.nanargmax function to Numba, because I don’t understand what is going on in your code-base.

So this was a plea that you might want to consider improving the code-docs and comments in the future, so outsiders can understand it more easily and then be able to help you extend and improve the code-base, without having to invest massive amounts of time and effort.

Thanks!

Another thought I had on this one: we have a gitter chat, so next time you are stuck with a piece of code that should be simple but doesn’t appear so, don’t waste much time on it, but come chat:

There are usually community members online that will help you reverse engineer any weird code you may encounter.

1 Like

I think this is actually a really important issue and I have some more thoughts I would like to share.

What I suggest you do in the Numba core dev-team, is that you take one of your existing Python modules / files, and polish that up to a standard that you think is the level everything ideally ought to be in the code-base, both in terms of the code-quality, but certainly also in terms of the code-comments.

It would be a massive task to do this to the whole code-base. But going forward you can ask people who commit new changes, to first review your polished Python-file and please make the new contribution to the same high standard. Don’t be afraid to go through a couple of iterations where you ask people to improve their contributions so they are higher quality and easier to understand for others. Because Numba is now a pillar of the Python eco-system, and having made contributions to it is something people can be proud of, so you can easily afford to make polite but firm demands on the quality-level you expect from new contributions.

Regarding the code-comments, I think the following should be minimum requirements:

  • Every file should have a header that briefly explains what the file contains and how it fits in with the overall system. That way I know if I am even looking in the right file.

  • Each function / class / etc. should have a doc-string explaining what it does, even if it is internal and not intended to be used in the external API. Because without proper doc-strings we have to guess / infer what the function does from reading its code, which is not only very slow to do, but if there are problems / bugs in the code, then we may not understand what it does, or even worse, we may think that the bugs are actually features.

  • Personally I would prefer if you go into “full Shakespeare mode” like I do with my own code-comments, but I don’t think I can convince you of doing that. But you should as a bare minimum explain what is going on in non-trivial parts of the code, using proper English explanations and not short cryptic comments in broken English, that may be just as hard to understand as the code itself.

If you start doing this going forward, then I think it will be a great help to others who are trying to understand and contribute to your code-base, and eventually more and more of your code-base will be easier to understand for others.

As an example, have a look at the R&D project I just finished here which has lots of non-trivial things going on - including some non-obvious data-conversion to make Numba run even faster (altogether I got a nearly 50x speed-up over pure Python thanks to Numba!) Without the extensive code-comments this would be quite hard to understand - even for myself if I look at the code again in a few years. There’s also some code that handles floating-point rounding errors, which would look bizarre if it wasn’t explained.

Just look at how beautiful this code is! I simply don’t understand why software engineers don’t take more pride in their work, like master carpenters would do. And it’s not hard to do. This entire R&D project took me 4 weeks to complete, including the invention of the new algorithm with several iterations of rewriting and improving it; optimizing it for Numba; polishing and documenting the code; making a tutorial; writing the paper, etc.