Cache behaviour

luk-f-a · August 19, 2022, 4:02pm

hi,

I’m working to fix the current lack of cache invalidation when a secondary file—a module from which other jit functions are imported and then called from a main jit function—is modified.
I already have a branch where the overloads are picked from the cache according to the code signature of the calling function and all dependencies. This means that a change in the code of a dependency does not lead anymore to incorrect calculations.
So far so good. In the process of testing, I noticed that the cache of the main function keeps growing, ie accumulating overloads.
I found out that this is because the code of the main function has not changed, which means that the cached overloads are not considered stale, but simply unused. This could be a problem if the cache keeps growing and growing, so I started looking into that. In the process of looking into it, I noticed that currently the cache is very pessimistic about changes in the main file. Even changes outside the function invalidate the cache.
So currently, the cache is ignoring changes in secondary files, but invalidating the cache for every change in the file, even outside the main function or its closure variables.

I would like to propose to move away from invalidating the cache index based on the timestamp of the file, and use only the code+closure signature of the function itself. Would anyone see a problem with that? When I say code+closure signature I mean the exact same information that is used to select the overload from within the cache.
The cache index is currently a combination of (function signature, target context, code hash, closures hash). My proposal would be to use code hash + closure hash to decide whether the index should be reset (implicitly throwing away every existing cached compilation). When saving an overload in the cache, the cache machinery would compare the hashes of the latest overload against existing ones, and reset the index if they don’t match.
All feedback and suggestions welcome.

Luk

nelson2005 · August 19, 2022, 6:15pm

I don’t know what it would break, but I’m in favor of removing the timestamp from the cache.

This would facilitate zipping up source trees with cache and having them ‘still work’ after unzipping due to the lack of timestamp support in pkzip format.

There’s a discussion about that here

luk-f-a · August 19, 2022, 8:00pm

thanks @nelson2005 , from the related post it seems that you have been digging quite a lot into the cache too.

The cache takes a “snapshot” of the code (from py_func.__code__.co_code) and of the closures (from py_func.__closure__). If we knew this is enough (ie, as safe as monitoring the file timestamp), I think I can find a way to replace the timestamp rule with a hashing rule.

Luk

alex · August 22, 2022, 3:12pm

Hi @luk-f-a,

The cache takes a “snapshot” of the code (from py_func.__code__.co_code) and of the closures (from py_func.__closure__). If we knew this is enough (ie, as safe as monitoring the file timestamp), I think I can find a way to replace the timestamp rule with a hashing rule.

If the closure is capturing another function, that function may need to be considered in the hash, too, if I remember correctly. I implemented a simple recursive function for this in

github.com/numba/numba

Prevent cache busting by the UUID of Dispatcher

numba:main ← sk1p:caching-fn-closure-6264

opened 04:24PM - 24 Sep 20 UTC

sk1p

+28 -9

First steps towards fixing #6264. This PR changes the cache key to explicitly… include the bytecode of python functions captured in closures, and also their transitive dependencies. Before, the whole `Dispatcher` object was pickled and included in the key, which changes for each "run" (i.e. each time a new python interpreter is started) because of the UUID. I'm not sure if this is enough, or if other attributes of the `Dispatcher` objects also need to be included in the key. As @sklam noted in the issue, the referenced functions may also need to be cached (except if they are inlined?), but I'm not sure what the best approach would be, or if the user should be required to explicitly mark them to be cached. Please let me know if this goes into the right direction, thanks!

Although I’m not sure if this is enough. Hopefully some fix for Caching: functions captured in closure · Issue #6264 · numba/numba · GitHub can be included while you are touching this part of the code

Thanks,

Alex

gmarkall · August 23, 2022, 9:15am

I’ve added this to the agenda for the dev meeting today, as I think it will be helpful for thinking about kickstarting progress on those PRs: Numba Meeting: 2022-08-23 - HackMD

Topic		Replies	Views
Caching redefined functions Community Support	6	166	October 2, 2025
Numba portable caching logic Support: How do I do ...?	4	1497	August 24, 2023
Cache jitted function with jitclass argument Support: How do I do ...?	3	970	July 7, 2021
Does cache work for factory function? Support: How do I do ...?	2	549	February 19, 2021
Feedback on custom caching strategy in PyTensor Community Support	5	122	November 19, 2025

Cache behaviour

Related topics