Support for parallel mode in different OS

As usual, thanks very much for making Numba, it is a fantastic tool!

I have a Python package which uses Numba Jit on some of its functions. The functions have been implemented in such a way that they can also be run in parallel, simply by switching on the parallel mode when decorating the functions with Numba Jit. But I have currently switched the parallel mode off, because I don’t know how well it is supported in different Operating Systems. I am using Linux myself and I don’t really have the possibility to test it on Windows and Mac.

We have previously discussed switching between parallel and serial mode (see post #1125 as I am not allowed to write a link here), where the conclusion was that it could be done simply by setting the number of parallel threads with Numba’s function set_num_threads, and a test showed that there was no performance penalty when doing this with only one thread, compared to using Jit in serial mode. But that was only tested on Linux.

These are my questions now:

  1. If I decorate a function in my Python package with Numba Jit in parallel mode, can I expect it to work flawlessly on all Operating Systems: Linux, Windows and Mac?

  2. Is there significant overhead in parallel mode on some Operating Systems e.g. on Windows? If this is the case, then the overhead will presumably also be there when the number of threads is set to 1?

Thanks!

Hi @Hvass-Labs,

Numba puts in a lot of effort to make sure that it works the same way, where possible, on all OS. The Numba build farm tests approximately 4 python versions x 4 NumPy versions x 7 OS/chip combinations to make sure that this is the case! Numba’s public CI system checks less large selection than the previous but it’s still comparatively large.

The function should behave the same way regardless of the operating system. There are limitations to what can be done in terms of thread/fork/spawn but these are a function of the operating systems (and sometimes the threading layers) themselves. The documentation on the threading layers is here The Threading Layers — Numba 0+untagged.4124.gd4460fe.dirty documentation you can choose one that suits your needs. All threading layers are implemented to the fullest extent possible for a given OS.

I presume you are wanting to compare @njit against @njit(parallel=True) with set_num_threads(1)? If so, the compiled code paths are often quite different so you are very unlikely to be comparing the same compiled code even without considering threading layers and overheads. I would anticipate the overhead of threading would be fairly small but it will be there even when there is a single thread (the entire parallel region would just be run as a single chunk on a single thread). It’s often recommended to just measure what is important for your use case, a small amount of scheduling overhead might well be acceptable if say 99% of the execution time is spent in BLAS.

Thanks very much for the detailed response, I appreciate it!

My functions in question are fairly simple, there is just a Numba prange() outer-loop, and nothing fancy in terms of synchronization etc. So I think I will take the chance and enable Numba parallel by default in my Python package.