Compilation pipeline, compile time and vectorization

Thank you for the discussion in the meeting on Tuesday.

I experimented a bit more with the alternative compilation structure, and I think the results are quite interesting.

First, I think it makes sense to distinguish between two different changes:

  1. Separate typing from compilation, so that machine code is only generated for those functions where it is needed. From what people said in the meeting my understanding is that pretty much everyone agrees that this would be a positive change, only that it might be quite a bit of work.
  2. Optimize only after linking modules: Right now, each function is translated to an llvm module, which is then optimized. Relevant modules are then linked together and the result is optimized again. This could be changed (relatively easily) to something where the individual modules are not optimized before they are linked, but only once after linking.

In the crazy-one-module-compile branch I linked above I implemented both of those ideas together, but in a very messy way. From playing with it is seems that, if combined, they solve the extra long compile times we observe in the function chain of the original post. It seems that only implementing the first change helps with the problem, implementing both helps even more, but implementing only the second makes things actually a bit worse.

But because the second is pretty straight forward to implement on it’s own, and should fix a lot of vectorization issues, I separated it out from the combined branch above, cleaned it up a little and ran some benchmarks to see how much performance benefit and extra compilation cost we would see.

I extracted a couple of benchmarks from the numba-benchmarks repository (I’d be happy to add more if someone is interested in specific cases, @stuartarchibald you mentioned you had an idea of something where it might perform worse?), and set it up so that I collect compile time and runtime performance data.

The following plot shows the change in performance between the numba main branch and the new branch that includes only the 2. change from above.

I have to admit I’m pretty surprised that there are so many functions that benefit from this global-only optimization approach. Compile time does increase (by a surprisingly constant factor), but pretty modestly, I’d say, at least for my benchmarks.


The changes in numba and the benchmarks can be found here:

The branch that implements the second change:

Tested against this branch (which also includes benchmarks and an analysis notebook)