This is going to be a bit lengthy.
In our project, we have a large number of JIT’ed functions performing laborious calculations, and it’s a common occurrence for one JIT’ed function to be calling a number of other JIT’ed functions, which in turn have its own callees, and so on.
For performance reasons, we cache as much of the JIT compiled code as possible.
I’ve noticed that the cached file sizes are getting pretty large, with Mb’s or even tens of Mb’s per each .nbc becoming a common occurrence (the total cache size can take up several Gb’s on disk).
The memory needs of the application are pretty large too. It can peak at 10Gb+ on the first run - when the compilation is performed and the cache is being built - and about a half of that on the subsequent runs when the cache is loaded.
Looking into that, apart from other possible causes for excess (such as lengthy mangled names of overloaded functions, which were also discussed here), I found out that numba links in the code libraries of the callees into the callers. The verification pass asserts all the dependencies have been linked in correctly.
To this effect, when I inspect LLVM code of a caller
function that invokes a callee
function, I find the latter’s LLVM code embedded essentially verbatim into the former (within the function definition rather than as inlined instructions, see also comment after the code example below). It comes with define linkonce_odr
directive, so duplicate definitions are safely avoided and all that.
The fact remains though that if a caller has several callees, and so on, the LLVM code can grow pretty large.
Then as a consequence, the machine code that actually gets stored in .nbc for the given function’s overload will in fact contain all the instructions of every single JIT’ed function needed to run the given function, the functions that those functions need in turn, and so on. Needless to say, this can result in rather excessive repetitions of the machine code across the entire project that comprises a large infrastructure of JIT’ed code, manifesting in the aforementioned disk and memory costs.
My understanding is that statically linking-in the code libraries for the dependencies of the given function is done mostly for the purpose of being able to cache it. This way, every cached function is essentially an independently executable program (within the numba / llvmlite runtime), as it doesn’t have to rely on any other JIT’ed functions being independently and previously made available in the context.
If not for the cache needs, I can simply rip out the add / link-in clause here and relax the verification to make it compile. Every single define linkonce_odr
for dependencies will disappear from the caller
’s LLVM code. Expectedly so, a declare
instruction for every needed dependency of the caller
will be found in its LLVM code instead. Numba will then link all dependencies dynamically instead of statically.
This applies not only to user-defined jitted functions, but also to the numba-provided overloads for, e.g., numpy’s API.
However, if I cache the callee
, then the first run of the program (when the cache is being built and functions are made available in the context just-in-time) is fine, but on the subsequent run when the cache is getting loaded, an error is predictably thrown from the caller
- its declare callee
cannot be resolved.
Indeed, if callee
didn’t happen to have been called prior to the caller
within the same program run, then it’s not available in the context for numba to dynamically link against it. Same goes for, e.g., any needed numpy overloads and such - whatever is needed is now declared rather than defined, so if not called in the right order the ‘symbol not found’ will ensue. LLVM runtime is simply not aware where to even look for the requested dependencies.
A workaround is to type explicit signatures for every JIT’ed function, and have some auxiliary routine to bring other overloads into the context. Here is an example (ignore detailed calculations done by the functions there) which illustrates the point with the two numba changes made as described above, i.e., comment out this and this for a full experience:
import numba
import numpy
CACHE_CONFIG = True
sig = numba.int64(numba.int64)
@numba.njit(sig)
def aux(n):
a = numpy.empty(n)
a[0] = 1
b = numpy.full(1, 2.17)
return numpy.sum(numpy.zeros(1)) + a[0] + b[0]
@numba.njit(sig, cache=CACHE_CONFIG, inline='never')
def callee(n):
""" About just laborious enough to prevent inlining... """
arr1 = numpy.zeros(n)
for i in range(n):
arr1[i] = i * 1.1
arr2 = numpy.full(2 * n, 3.14)
arr3 = numpy.empty(2 * n)
for i in range(2 * n):
arr3[i] = arr2[i] + arr1[i // 2]
return numpy.sum(arr1) + numpy.sum(arr3)
@numba.njit(sig, cache=CACHE_CONFIG)
def caller(n):
arr = numpy.zeros(n)
for i in range(n):
arr[i] = callee(i) * 1.1
return numpy.sum(arr)
def save(filename, content):
with open(filename, "w") as f:
f.write(content)
if __name__ == '__main__':
_ = caller(1)
callee_llvm = next(iter(callee.inspect_llvm().values()))
caller_llvm = next(iter(caller.inspect_llvm().values()))
save('callee_llvm', callee_llvm)
save('caller_llvm', caller_llvm)
(Numba tends to ignore inline='never'
for simple functions. For my purposes, I needed to make sure inlining does not happen, so I made callee
and caller
just complex enough to skip inlining. What I am after is the phenomenon of the entire LLVM code of the callee embedded into the caller, which includes the full function definition rather than just - some of - its statements inlined into the caller.)
This works fine as-is, one can run it and have the callee
cached. The second run will work fine too, since the callee
is typed with an explicit signature sig
, it will get populated into the context before the caller
, making it available for the caller
’s dynamic symbol resolution.
Notice I had to also put together an auxiliary aux
function which is not cached, as it needs to make various numpy overloads (defined in the numba codebase) available for the callee
itself.
Try setting sig = None
and see a segfault ensue on the second run. Same if aux
gets cached. But setting CACHE_CONFIG = False
all is fine again on every run. All explained by the discussion above.
Does anyone have best practices suggestion(s) how to proceed here? Particularly so, without having to surgically change numba like described above. The crux of the issues is large memory demands in the projects having an elaborate infrastructure of inter-dependent JIT’ed functions that have to be cached for performance reasons.
One idea was to swap JIT’ed functions inplace for a declaration proxy, like here.