Statically linking in dependencies resulting in large memory needs

This is going to be a bit lengthy.

In our project, we have a large number of JIT’ed functions performing laborious calculations, and it’s a common occurrence for one JIT’ed function to be calling a number of other JIT’ed functions, which in turn have its own callees, and so on.

For performance reasons, we cache as much of the JIT compiled code as possible.

I’ve noticed that the cached file sizes are getting pretty large, with Mb’s or even tens of Mb’s per each .nbc becoming a common occurrence (the total cache size can take up several Gb’s on disk).

The memory needs of the application are pretty large too. It can peak at 10Gb+ on the first run - when the compilation is performed and the cache is being built - and about a half of that on the subsequent runs when the cache is loaded.

Looking into that, apart from other possible causes for excess (such as lengthy mangled names of overloaded functions, which were also discussed here), I found out that numba links in the code libraries of the callees into the callers. The verification pass asserts all the dependencies have been linked in correctly.

To this effect, when I inspect LLVM code of a caller function that invokes a callee function, I find the latter’s LLVM code embedded essentially verbatim into the former (within the function definition rather than as inlined instructions, see also comment after the code example below). It comes with define linkonce_odr directive, so duplicate definitions are safely avoided and all that.

The fact remains though that if a caller has several callees, and so on, the LLVM code can grow pretty large.

Then as a consequence, the machine code that actually gets stored in .nbc for the given function’s overload will in fact contain all the instructions of every single JIT’ed function needed to run the given function, the functions that those functions need in turn, and so on. Needless to say, this can result in rather excessive repetitions of the machine code across the entire project that comprises a large infrastructure of JIT’ed code, manifesting in the aforementioned disk and memory costs.

My understanding is that statically linking-in the code libraries for the dependencies of the given function is done mostly for the purpose of being able to cache it. This way, every cached function is essentially an independently executable program (within the numba / llvmlite runtime), as it doesn’t have to rely on any other JIT’ed functions being independently and previously made available in the context.

If not for the cache needs, I can simply rip out the add / link-in clause here and relax the verification to make it compile. Every single define linkonce_odr for dependencies will disappear from the caller’s LLVM code. Expectedly so, a declare instruction for every needed dependency of the caller will be found in its LLVM code instead. Numba will then link all dependencies dynamically instead of statically.

This applies not only to user-defined jitted functions, but also to the numba-provided overloads for, e.g., numpy’s API.

However, if I cache the callee, then the first run of the program (when the cache is being built and functions are made available in the context just-in-time) is fine, but on the subsequent run when the cache is getting loaded, an error is predictably thrown from the caller - its declare callee cannot be resolved.

Indeed, if callee didn’t happen to have been called prior to the caller within the same program run, then it’s not available in the context for numba to dynamically link against it. Same goes for, e.g., any needed numpy overloads and such - whatever is needed is now declared rather than defined, so if not called in the right order the ‘symbol not found’ will ensue. LLVM runtime is simply not aware where to even look for the requested dependencies.

A workaround is to type explicit signatures for every JIT’ed function, and have some auxiliary routine to bring other overloads into the context. Here is an example (ignore detailed calculations done by the functions there) which illustrates the point with the two numba changes made as described above, i.e., comment out this and this for a full experience:

import numba
import numpy


CACHE_CONFIG = True


sig = numba.int64(numba.int64)


@numba.njit(sig)
def aux(n):
    a = numpy.empty(n)
    a[0] = 1
    b = numpy.full(1, 2.17)
    return numpy.sum(numpy.zeros(1)) + a[0] + b[0]


@numba.njit(sig, cache=CACHE_CONFIG, inline='never')
def callee(n):
    """ About just laborious enough to prevent inlining... """
    arr1 = numpy.zeros(n)
    for i in range(n):
        arr1[i] = i * 1.1
    arr2 = numpy.full(2 * n, 3.14)
    arr3 = numpy.empty(2 * n)
    for i in range(2 * n):
        arr3[i] = arr2[i] + arr1[i // 2]
    return numpy.sum(arr1) + numpy.sum(arr3)


@numba.njit(sig, cache=CACHE_CONFIG)
def caller(n):
    arr = numpy.zeros(n)
    for i in range(n):
        arr[i] = callee(i) * 1.1
    return numpy.sum(arr)


def save(filename, content):
    with open(filename, "w") as f:
        f.write(content)


if __name__ == '__main__':
    _ = caller(1)
    callee_llvm = next(iter(callee.inspect_llvm().values()))
    caller_llvm = next(iter(caller.inspect_llvm().values()))

    save('callee_llvm', callee_llvm)
    save('caller_llvm', caller_llvm)

(Numba tends to ignore inline='never' for simple functions. For my purposes, I needed to make sure inlining does not happen, so I made callee and caller just complex enough to skip inlining. What I am after is the phenomenon of the entire LLVM code of the callee embedded into the caller, which includes the full function definition rather than just - some of - its statements inlined into the caller.)

This works fine as-is, one can run it and have the callee cached. The second run will work fine too, since the callee is typed with an explicit signature sig, it will get populated into the context before the caller, making it available for the caller’s dynamic symbol resolution.

Notice I had to also put together an auxiliary aux function which is not cached, as it needs to make various numpy overloads (defined in the numba codebase) available for the callee itself.

Try setting sig = None and see a segfault ensue on the second run. Same if aux gets cached. But setting CACHE_CONFIG = False all is fine again on every run. All explained by the discussion above.

Does anyone have best practices suggestion(s) how to proceed here? Particularly so, without having to surgically change numba like described above. The crux of the issues is large memory demands in the projects having an elaborate infrastructure of inter-dependent JIT’ed functions that have to be cached for performance reasons.

One idea was to swap JIT’ed functions inplace for a declaration proxy, like here.

1 Like

Hey @milton,

It does sound really complicated…
Here is an idea:
LLVM’s optimizer is very powerful at inlining functions, but if you could use a static function pointer loaded from a global, the call becomes indirect.
This should/could(?) prevent LLVM from inlining the callee’s code since the pointer obscures the direct call target.
The idea is to “freeze” function pointers in a static global variable so that a JIT-compiled function only holds a small pointer instead of the full callee code. This avoids duplicating the machine code for every callee across all callers.
What do you think?

Hi @Oyibo,

Thanks for your comment, I am happy to see this topic is getting some attention after all - despite the lengthiness of it at that! The story is rather complex indeed, and I do maintain that it requires a certain amount of prose to explain what the problem is all about.

I believe what you’re proposing is actually related to the line of thinking I had been pursuing. I briefly touched on it at the end of the OP above, but here are some additional details. I will go over the code I had put together to address this.

A JIT’ed function (or a function loaded from cache) is available in the LLVM global context by its mangled name. Taking advantage of that, a JIT’ed function func can be swapped in-place by its proxy func_proxy, that looks up the original func in the global context and calls it. This decorator does the swap, and it ‘fools’ numba into inlining a cheap func declaration into any potential caller of func into the caller’s LLVM code rather than copy-pasting into it the entire func’s LLVM code.

Here it’s important that I’m talking about copy-pasting the callee’s (func) LLVM code in its entirety, rather than inlining its instructions into the body of the caller’s function. It appears to be always one or the other and when it’s the former the severity of code duplication that it manifests is probably particularly jarring.

When the caching is used (which is when this problem is the most relevant), on subsequent program execution runs, this instruction will make the func name available in the global context. This is important because this step is what would have been missing if I simply plucked out the linking library instructions from the numba source code, as described in the OP.

Hey Milton,

I see the problem, but I’m not sure I fully understand your point.

I made some modifications to your example and compared the uncached version of the proxy with a ctypes pointer to generate indirect function calls (as sanity check). Both approaches seem to prevent LLVM from over-optimizing, and the generated LLVM-IR files are smaller than those from the regular caller.

For the cached versions, I compared only the normal caller and the proxy. Both generate or update their cached files on the first call. If the cached files already exist, there’s a cache miss on the first run, but subsequent calls result in cache hits for both callers. Interestingly, the cached file generated for the proxy is smaller than that of the regular caller.

From this simple example, it looks like the proxy is working well.

Here is the setup:

import numpy as np
from numba import njit, float64
import ctypes
from numbox.core.proxy import proxy

CACHE = True
sig = float64(float64)

@njit(sig)  # , cache=CACHE)
def foo(x):
    for _ in range(10):
        x += 2
        for _ in range(10):
            x += 1
    return x

@proxy(sig)  # , jit_options={'cache': CACHE})
def foo_proxy(x):
    for _ in range(10):
        x += 2
        for _ in range(10):
            x += 1
    return x

# prepare indirect call via function pointer
foo_f64 = foo.overloads[foo.signatures[0]]
addr = foo_f64.library.get_pointer_to_function(foo_f64.fndesc.llvm_cfunc_wrapper_name)
proto_f64 = ctypes.CFUNCTYPE(ctypes.c_double, ctypes.c_double)
bind_f64 = proto_f64(addr)

@njit  # can't cache ctypes
def caller_indirect(a: float):
    return bind_f64(a)

@njit(cache=CACHE)
def caller_proxy(a: float):
    return foo_proxy(a)

@njit(cache=CACHE)
def caller(a: float):
    return foo(a)

def save(filename, content):
    with open(filename, "w") as f:
        f.write(content)

Here are the checks:

if __name__ == '__main__':
    functions = ["caller", "caller_indirect", "caller_proxy"]

    for func_name in functions:
        func = globals()[func_name]  # Get function by name
        func(1.0)  # Call the function

        llvm_code = next(iter(func.inspect_llvm().values()))  # Extract LLVM IR
        save(f"z_{func_name}_llvm", llvm_code)  # Save the LLVM output
        # caller_llvm: File has +400 lines
        # caller_indirect_llvm: File has 84 lines
        # caller_proxy_llvm: File has 84 lines

        # Note: Cache has to be activated
        if func_name != "caller_indirect":
            print(f"{func_name}:")
            print("hits:", func.stats.cache_hits)
            print("misses:", func.stats.cache_misses)
            print()
        # 1st run:
            # caller:
            # hits: Counter()
            # misses: Counter({(float64,): 1})

            # caller_proxy:
            # hits: Counter()
            # misses: Counter({(float64,): 1})

        # subsequent runs:
            # caller:
            # hits: Counter({(float64,): 1})
            # misses: Counter()
            # UserWarning: Inspection disabled for cached code. Invalid result is returned.

            # caller_proxy:
            # hits: Counter({(float64,): 1})
            # misses: Counter()
            # UserWarning: Inspection disabled for cached code. Invalid result is returned.

These are the generated files & sizes:


# files:
# z_caller_indirect_llvm:    4 kb
# z_caller_proxy_llvm:       4 kb
# z_caller_llvm:            21 kb

# cached files:
# indirect_function_call.caller_proxy-37.py312.1.nbc:   7 kb
# indirect_function_call.caller_proxy-37.py312.nbi:     1 kb
# indirect_function_call.caller-41.py312.1.nbc:        16 kb
# indirect_function_call.caller-41.py312.nbi:           1 kb
1 Like

Hi @Oyibo,

Great, thanks a lot for these tests - your observations make sense to me:

  1. Looking at ‘z_caller_llvm’ I see the entire definition of foo (define linkonce_odr i32 @_ZN8__main__3fooB2v1B38c8tJTIeFIjxB2IKSgI4CrvQClQZ6FczSBAA_3dEd...) copy-pasted in there. The call instruction to it can be found within the body of the caller function block, as expected.
  2. Looking at ‘z_caller_proxy_llvm’ I see instead the declaration of foo_proxy (declare double @cfunc._ZN8__main__9foo_proxyB2v2B38c8tJTIeFIjxB2IKSgI4CrvQClQZ6FczSBAA_3dEd...). The instruction to call foo_proxy is once again predictably found in the caller_proxy.
  3. Both foo and foo_proxy are compiled to the same code (by intention), and consistently with that - if cached - take up the same space on disk. The foo_proxy however also compiles (and caches, if asked to) the __foo_proxy, which is the actual proxy function to it, that takes noticeably less space. Which is expected, as __foo_proxy is simply a wrap to call foo_proxy.
  4. Finally, the caller’s disk cache is signfincantly more voluminous than the caller_proxy, since - as discussed above - the latter does not contain all the computation instructions of foo_proxy anywhere in it (neither copy-pasted / statically linked nor inlined). Of course it’s subtle whether the O(1) memory and disk space needs for the __foo_proxy kind of overhead are small enough not to offset whatever was saved in caller_proxy vs. caller, as the latter actually grows with the complexity of the callee (foo).

Now these points illustrate the phenomenon of numba statically linking all the callees into the caller’s executable / compiled code. This creates a memory bloat problem (as well as the cache disk storage problem), which becomes severe enough for large codebases with hundreds of jitted functions. The proxy tool kind of solves / alleviates that, at least for the category of the use cases considered.

The question is then if this is the best one can do or if there’s a more numba native way to solve this problem.

1 Like

For some historical context—and perhaps a forthcoming solution—this bloat issue was something that the PIXIE project proposed to address. However, I’m not sure what the state of that effort is these days.

@milton If we cache the worker function foo and let it perform more computations, then assign the same work to foo_proxy and cache it, too, I think we see better what happens.

@njit(sig, cache=CACHE)
def foo(x):
    for _ in range(20):
        for _ in range(10):
            x += 2
            for _ in range(10):
                x += 1
        dtype = np.array(x).dtype.type
        ONE, TWO = dtype(1), dtype(2)
        epsilon = ONE
        while ONE + epsilon != ONE:
            epsilon /= TWO
        machine_eps = epsilon * TWO
        x += machine_eps - np.finfo(dtype).eps
    return x

@proxy(sig, jit_options={'cache': CACHE})
def foo_proxy(x):
    # same code as in foo

The proxy decorator generates two files: foo_proxy.nbc and __foo_proxy.nbc.
Every cached function that calls foo_proxy includes instructions for both versions.
At first glance, foo_proxy and __foo_proxy have different sizes. The subsequent callers have a reduced size, too.
The proxy decorator seems to introduce overhead for the worker function, reduces overhead for the callers but also reduces a degree of optimization.

# foo-11.py312.1.nbc:          31.25 KB
# foo-11.py312.nbi:             1.23 KB

# caller-57.py312.1.nbc:       22.67 KB
# caller-57.py312.nbi:          1.18 KB

# foo_proxy-27.py312.1.nbc:    31.38 KB
# foo_proxy-27.py312.nbi:       1.24 KB
# __foo_proxy-13.py312.1.nbc:   8.84 KB
# __foo_proxy-13.py312.nbi:     1.24 KB

# caller_proxy-53.py312.1.nbc:  7.78 KB
# caller_proxy-53.py312.nbi:    1.19 KB

It’s unclear how or if this provides any real benefit and how it would scale in a real-world scenario. It probably depends on the use case.

For testing you could switch the proxy on and off via a custom decorator and compare the results (or observe the pitfalls):

# global settings
USE_PROXY = True
DEFAULT_ARGS_JIT = {
    "nopython": True,
    "cache": NUMBA_CACHE,
}

def jit_or_proxy(*args, **kwargs):
    if not args:
        raise ValueError("The first positional argument must be the signature.")
    kwargs_ = DEFAULT_ARGS_JIT.copy()
    kwargs_.update(kwargs)
    if not USE_PROXY:
        return jit(*args, **kwargs_)
    else:
        return proxy(args[0], jit_options={**kwargs_})

@DannyWeitekamp an update on the PIXIE project and Numba AOT would be great. Hopefully, this issue will be resolved by then.

1 Like

@Oyibo I am interested in caching all (or most of) the functions in my project. To that end, in my runs of your code I had cached foo and foo_proxy.

Notice that __foo_proxy created behind the scenes by the proxy decorator is always inlined into the caller. Which means that __foo_proxy’s only instruction, ‘call foo_proxy’, will be inlined into the caller. That’s why in caller_proxy’s LLVM code you see the call to foo_proxy (but nothing related to the __foo_proxy).

Notice that caller_proxy cannot possibly even get access to the foo_proxy’s calculation (your nested for-loop), because it has been completely severed from it via the re-assignment of the foo_proxy variable to the __foo_proxy dispatcher.

About this:

I don’t think it’s the case, in light of what was just mentioned above. In fact, you can see that caller_proxy’s LLVM code does not contain any of the actual calculations of the foo_proxy (as I mentioned previously, __foo_proxy is just 1 instruction, which is easily inlined into the caller’s function block). This is to be contrasted with the caller’s LLVM code, which contains essentially verbatim the entire foo’s definition block.

About this:

Indeed, as I mentioned above, it does depend on the use case. The proxy declaration is constant-space (does not depend on the number of instructions) so the benefit of proxying outweighs the cost for large enough callees.

Thanks again for your comments and careful investigation of this problem!

1 Like

@milton that’s correct, foo_proxy.nbc and __foo_proxy.nbc will not be created if you just cache caller_proxy. They will be created if you use cache in jit_options of the proxy decorator.
I found it illustrative because it shows on a surface level what happens behind the scenes without investigating the LLVM_IR.

@DannyWeitekamp Very interesting, thanks for the reference. Basically the idea is that if func_1 is compiled and cached, and func_2 calls func_1 (both are JIT functions), then the dispatcher for func_2 should be able to find func_1 and load it from cache and into the LLVM context - even if func_1 was not compiled (or loaded from cache) eagerly. This will allow func_2 to merely declare func_1 and expect it to be available in the shared scope when it needs it.

Right now, to the best of my knowledge, such a mechanism is lacking in numba, that’s why func_2’s LLVM code (and any code that both needs func_1 and didn’t inline func_1’s instructions into its callers) pretty much has to carry its own copy of func_1.

@milton That sounds about right to me (I’m not a dev so I don’t really have a deep understanding of these things). @sklam or other devs involved in the PIXIE plan might be able to give you a better idea.

I believe the impetus for not having this functionality from the get-go was to lean into numba always doing full-program optimization on every end-point (i.e. any jitted function called from Python), instead of treating each function as part of a separate translation unit and cross-linking them. The benefit of this is that the compiler can be really aggressive about inlining and other optimizations because each subprocedure called by the end-point can be specialized for it’s particular needs. But of course, this creates a lot of redundant compiled instructions. What I gather from the PIXIE proposal is that it aims to strike more of a balance, and reuse intermediate pieces of the compilation pipeline more aggressively between end-points—some balance between AOT compiling a library and JIT compiling individual functions.

1 Like