Compilation pipeline, compile time and vectorization

Hello everyone :slight_smile:,

Between the years I had some time to look in a bit more detail at two issues that have come up a few times in the numba backend of pytensor. (Just for background, pytensor is a fork of aesara, which uses numba to compile computation graphs with autodiff, similar to jax or tensorflow, and is used as backend in pymc). I’d be curious to hear what you think of this here. I hope it makes at least some sense. :slight_smile:

The issues are:

  • Long compile times. We often have compile times longer than 10s, I’ve even seen a very large model where compilation took 30 min.
  • Missed vectorization opportunities. From my experience in C or rust I often expect the loop vectorizer of llvm to transform a loop, but in many cases this doesn’t happen in numba.

They seem rather independent, but I think they might actually have similar causes, which is why I’m combining them here.

Let’s start with compilation time: In pytensor we tend to have a relatively large number of functions, that each do relatively little work, and get combined into one big function that is then exposed to the user. As a numba-only proxy, let’s look at this:

# Make sure we have an empty cache (ipython syntax....)
%env NUMBA_CACHE_DIR=/home/adr/.cache/numba
!rm -rf ~/.cache/numba

import numba
import numpy as np

def make_func(func):
    # A small function that calls a previous function and reshapes the result
    @numba.njit(no_cpython_wrapper=True)
    def foo(x):
        return func(x).reshape((3, 3))

    return foo

@numba.njit(no_cpython_wrapper=True)
def first(x):
    return x

# We build a chain of 50 of those simple function
current = first
for i in range(50):
    current = make_func(current)

# Wrap it in a final function with cpython wrapper
@numba.njit
def outer_logp_wrapper(x):
    return current(x)

# Compile the function
%time outer_logp_wrapper(np.zeros(9))
# CPU times: user 18.7 s, sys: 53.6 ms, total: 18.8 s
# Wall time: 18.8 s

So compiling this function chain takes about 20 s. What is it doing in all that time? From what I currently understand it does the following:

For each of the 50 functions it creates an llvm module containing the function itself and declarations for the function it’s using, and a CodeLibrary. That code library is then finalized. During finalization it optimizes each function in the module separately using rewrites provided by llvm, and then links in modules for functions it is using (ie the previous function’s finalized, and thus optimized, module). It then optimizes the whole module. (there are also some other passes, that I think don’t change anything for my purpose here). It then produces code for that linked module.

So this means that by the time it produced the code for the final function, it optimized and produced independent machine code for each of the previous functions. The first function will have run through the llvm optimization pipeline and codegen 50 times (I think :slight_smile:)!

Would there be an alternative? I think it should be possible to split finalize into two parts: One first part (let’s call it pre-finalization), where we just optimize the individual functions, and make sure nothing further can be added to the module. And a second part (post-finalization) that recursively links in all required modules, runs module level optimizations and llvm codegen.

In an act of vandalism I went through the code and broke everything but this example, but made it so that we only link at the very end, and avoid some (I don’t think I got all?) of the intermediate codegen. See code here. This got the compile time of the example down to ~4 s! And for pytensor functions I saw similar improvements (and some segfaults, I sure broke stuff).

So how does this affect the runtime of the compiled function? Surprisingly, I think it often improves it: From what I can tell the approach of running module level optimization on each function-module first, and then again after linking runs into optimization ordering issues, and misses opportunities because of it.

Let’s look at this example (I’m sure there are simpler ones, but still):

%env NUMBA_CACHE_DIR=/home/adr/.cache/numba
!rm -rf ~/.cache/numba

# To see vectorization output of llvm
#import llvmlite.binding as llvm
#llvm.set_option('', '--debug-only=loop-vectorize')

import numba
import numpy as np


@numba.njit(error_model="numpy")
def inner(a1, a2, b1):
    return (
        np.exp((a1 / a2) / a2),
        np.exp(((a1 / a2) * b1) / a2),
    )

@numba.njit(error_model="numpy")
def outer_logp_wrapper(N, a1, a2, b1):
    out1 = np.empty(N)
    out2 = np.empty(N)
    for i in range(N):
        val1, val2 = inner(a1[i], a2[i], b1[i])
        out1[i] = val1
        out2[i] = val2
    return out1, out2

N = 10_000
a1, a2, b1 = np.random.randn(3, N)
_ = outer_logp_wrapper(N, a1, a2, b1)
%timeit outer_logp_wrapper(N, a1, a2, b1)

The loop in outer_logp_wrapper doesn’t get vectorized on main numba:

LV: Not vectorizing: Found unvectorizable type   %0 = insertelement <2 x double> undef, double %.10.i, i32 0

The reason seems to be that when the module for the inner function is optimized, this will include the loop-vectorizer, and also the slp-vectorizer passes. The first is supposed to have higher priority, so it is executed first (I think…), but because there is no loop this does nothing. llvm then runs the slp-vectorizer, and this will successfully use some vector instructions in inner. The modules for outer_logp_wrapper and the optimized inner modules are then merged, and the optimizer will run again. It inlines the inner function, but because that already contains vector instructions, it can no longer vectorize the loop from outer_logp_wrapper.

If we use the crazy-vandalizm-branch again, that only runs the module level optimization passes on the final linked module, we don’t have that problem: The first module is only optimized using the function passes, and those don’t include slp-vectorize. llvm will then inline the inner function, and the loop vectorizer is successful. After that it will also run slp-vectorize, which doesn’t have anything left to vectorize in that place.

This then leads to much better performance (at least in this particular case, not sure how well this generalizes to different things):
image

Sorry for the long post and I hope it is more-or-less understandable.

And by the way: Happy new year!

hi @aseyboldt happy new year.

I work on a framework for statistical calculations and we also see very long compilation times (2-5 minutes) when working with 100s of functions. In our case the nesting is not as deep as in your example; we have a main loop that runs 10-100 functions, each of which might go 2-10 levels deep in calling other functions.

Some months ago I started working on a shareable reproducer (the real thing is not open source), if I could finish it I would like to run it on your branch.

do you think it’s possible to split the problem such that the final solution is independent of the intention of the user/developer? ie, numba does not know whether a function foo will be used “stand alone” or will be called by another function. That’s what I mean by intention, and each option might lead to different optimal strategies for compilation.
In this case it seems that numba prepared everything for the “stand alone” case, which leads to slower compilation times. no_cpython_wrapper=True could have served as a way to convey the intention not to use the function stand alone, and make certain decisions based on that.

Thanks a lot for sharing the analysis and a possible solution. I don’t know much about CodeLibrary or the finalization step to be able to comment on your solution, but I’ll follow this topic as a way to learn about that.

Luk

1 Like

It’s good to hear we are not alone with that problem. :slight_smile:
I’d also be curious to hear what the branch does with different functions, but keep in mind that I called it vandalism for a reason…
The final function has to be called outer_logp_wrapper right now.

I don’t think the intention necessarily has to be shared with numba. As I understand it, it could just not produce code for functions before they are called from python. When they are called later, the code could just be produced then. This shouldn’t (?) be slower in total if you want to call several functions, but you just pay the cost for each function when it is called instead of up-front when the last function in the chain is called the first time.

I really might be missing important details here however. While I’ve been reading a lot of numba source code, there is a lot more to read and to understand…
Right now the most of the core devs seem to be on a well deserved break till next week, maybe they can share some more informed opinions when they are back :slight_smile:

Some quick thoughts about this:

When Numba is compiling a function, it needs type information of all of the callees of the function, so it can infer types in the current function based on the return types of callees).

Unfortunately the way Numba is set up, it completely compiles the callee in order to get type information, which is why it ends up going all the way to finalizing the callee’s module and running codegen. It would probably be better overall if it initially only ran type inference on the callee when compiling the caller.

This coupling is quite difficult to break, and is a pain for various reasons, the performance issue you have identified being one of them. We have been thinking a bit about how to break this link (I think @guilherme was the one most recently thinking about it).

are those issues related? I would have expected that for typing purposes the results were cached and it didn’t need to be compiled again.

Ah, that makes sense, I was wondering why the compilation is triggered somewhere during type inference.
Maybe splitting the finalization would still work? It might be enough to run pre-finalization (as defined above) during typing. That still costs quite a bit of time, but only does things that need to be done anyway. And all errors that are not not due to linking should still get triggered?
This wouldn’t completely decouple the llvm module from the typing pass, but would at least avoid the codegen in llvm.
I don’t really understand the way they are coupled fully, so this might very well not be true at all…

FYI:
The vectorization and compilation-time issues in this thread–and what @gmarkall mentioned about typing–are a couple of the reasons why I (representing the Aesara group from which @aseyboldt and PyMC Labs forked) started looking into Numba’s compilation pipeline and–eventually–its typing system a while back. The typing/compilation-related dependencies mentioned above are also part of the impetus for recent discussions regarding the typing system’s external use and refactoring.

More specifically, some directly related concerns involving compilation and vectorization of Aesara-generated Numba code were covered in Numba performance doesn't scale as well as NumPy in vectorized max function - #11 by brandonwillard, where the topic of Numba-generated loop IR and its connection to vectorizability was covered and the exact same areas of the compilation pipeline were similarly altered.

Thank you for the discussion in the meeting on Tuesday.

I experimented a bit more with the alternative compilation structure, and I think the results are quite interesting.

First, I think it makes sense to distinguish between two different changes:

  1. Separate typing from compilation, so that machine code is only generated for those functions where it is needed. From what people said in the meeting my understanding is that pretty much everyone agrees that this would be a positive change, only that it might be quite a bit of work.
  2. Optimize only after linking modules: Right now, each function is translated to an llvm module, which is then optimized. Relevant modules are then linked together and the result is optimized again. This could be changed (relatively easily) to something where the individual modules are not optimized before they are linked, but only once after linking.

In the crazy-one-module-compile branch I linked above I implemented both of those ideas together, but in a very messy way. From playing with it is seems that, if combined, they solve the extra long compile times we observe in the function chain of the original post. It seems that only implementing the first change helps with the problem, implementing both helps even more, but implementing only the second makes things actually a bit worse.

But because the second is pretty straight forward to implement on it’s own, and should fix a lot of vectorization issues, I separated it out from the combined branch above, cleaned it up a little and ran some benchmarks to see how much performance benefit and extra compilation cost we would see.

I extracted a couple of benchmarks from the numba-benchmarks repository (I’d be happy to add more if someone is interested in specific cases, @stuartarchibald you mentioned you had an idea of something where it might perform worse?), and set it up so that I collect compile time and runtime performance data.

The following plot shows the change in performance between the numba main branch and the new branch that includes only the 2. change from above.

I have to admit I’m pretty surprised that there are so many functions that benefit from this global-only optimization approach. Compile time does increase (by a surprisingly constant factor), but pretty modestly, I’d say, at least for my benchmarks.


The changes in numba and the benchmarks can be found here:

The branch that implements the second change:

Tested against this branch (which also includes benchmarks and an analysis notebook)

The new llvm pass manager API seems to support passes that are designed to be run multiple times as well, which would probably help a lot to keep compile time in check while also making sure we don’t mess up optimizations further down the line:
https://llvm.org/doxygen/classllvm_1_1PassBuilder.html#ad6f258d31ffa2d2e4dfaf990ba596d0d
There is an old issue in llvmlite to support that, but it looks like everything there is using the legacy pass managers? (Investigate new LLVM pass manager API · Issue #5 · numba/llvmlite · GitHub)