Cache behaviour

hi,

I’m working to fix the current lack of cache invalidation when a secondary file—a module from which other jit functions are imported and then called from a main jit function—is modified.
I already have a branch where the overloads are picked from the cache according to the code signature of the calling function and all dependencies. This means that a change in the code of a dependency does not lead anymore to incorrect calculations.
So far so good. In the process of testing, I noticed that the cache of the main function keeps growing, ie accumulating overloads.
I found out that this is because the code of the main function has not changed, which means that the cached overloads are not considered stale, but simply unused. This could be a problem if the cache keeps growing and growing, so I started looking into that. In the process of looking into it, I noticed that currently the cache is very pessimistic about changes in the main file. Even changes outside the function invalidate the cache.
So currently, the cache is ignoring changes in secondary files, but invalidating the cache for every change in the file, even outside the main function or its closure variables.

I would like to propose to move away from invalidating the cache index based on the timestamp of the file, and use only the code+closure signature of the function itself. Would anyone see a problem with that? When I say code+closure signature I mean the exact same information that is used to select the overload from within the cache.
The cache index is currently a combination of (function signature, target context, code hash, closures hash). My proposal would be to use code hash + closure hash to decide whether the index should be reset (implicitly throwing away every existing cached compilation). When saving an overload in the cache, the cache machinery would compare the hashes of the latest overload against existing ones, and reset the index if they don’t match.
All feedback and suggestions welcome.

Luk

I don’t know what it would break, but I’m in favor of removing the timestamp from the cache.

This would facilitate zipping up source trees with cache and having them ‘still work’ after unzipping due to the lack of timestamp support in pkzip format.

There’s a discussion about that here

thanks @nelson2005 , from the related post it seems that you have been digging quite a lot into the cache too.

The cache takes a “snapshot” of the code (from py_func.__code__.co_code) and of the closures (from py_func.__closure__). If we knew this is enough (ie, as safe as monitoring the file timestamp), I think I can find a way to replace the timestamp rule with a hashing rule.

Luk

Hi @luk-f-a,

The cache takes a “snapshot” of the code (from py_func.__code__.co_code) and of the closures (from py_func.__closure__). If we knew this is enough (ie, as safe as monitoring the file timestamp), I think I can find a way to replace the timestamp rule with a hashing rule.

If the closure is capturing another function, that function may need to be considered in the hash, too, if I remember correctly. I implemented a simple recursive function for this in

Although I’m not sure if this is enough. Hopefully some fix for Caching: functions captured in closure · Issue #6264 · numba/numba · GitHub can be included while you are touching this part of the code :slight_smile:

Thanks,

Alex

I’ve added this to the agenda for the dev meeting today, as I think it will be helpful for thinking about kickstarting progress on those PRs: Numba Meeting: 2022-08-23 - HackMD