Hacking NUMBA_CACHE_DIR and _UserProvidedCacheLocator to package jit-cache in egg

Continuing here from gitter because the issue seems like it may be longer-form than I hoped :slight_smile:

I’m using numba 0.52.0 with python 3.7.1 on windows

I have a Jenkins CI/CD pipeline that packages up my application into an egg, and I’d like to use this egg as the mechanism to deliver the application.

The jit caching for the application takes a long time, on the order of an hour. For this reason I’d like to find a way to pre-cache the functions and deliver them inside the egg. However, there’s a problem… the _UserProvidedCacheLocator uses a subdirectory under NUMBA_CACHE_DIR that is derived from a hash of the entire path and the CI/CD machine and the execution machine will have the code in different paths, thereby computing different cache directories. I think this means that I’ll need to take control of the path to the cached files a little more explicitly.

My initial thought was to monkey-patch a get_suitable_cache_subpath() function into _UserProvidedCacheLocator. This feels a bit hackish to me… does anyone have advice about a better/cleaner way?

I have no helpful advice for this other than to consider AOT compilation if you haven’t already. In any case I would love to know what your solution is when you figure it out. Making an extension module with AOT compilation of course has the drawback that the available overloads are fixed for the end user, but JIT compilation would start from scratch on a fresh install. It seems like there should be something that accomplishes the best of both. I haven’t reached the point in developing my own projects where they are ready to distribute, but I have this looming concern that in cases where JIT is required it would be a bit annoying for the end user if everything needed to compile after a pip install. This seems like a creative solution. Best of luck.

Thanks- what I ended up with was using the module’s filename to figure out if it was running from an egg and if so, unzipping the egg into a temporary directory and pointing the the jit cache there.

Not super elegant but it seems to work well enough and avoids the hour of jitting for every run. The downside is that the unit test that builds the cache now takes an hour :slight_smile:

Edit:

Much of this example code was adapted from numba’s cache module. It does use a library function to unzip the egg but I hope the general idea is clear.

comments/criticisms welcome!

def monkey_patch_cache_locator(cache_root_dir: str, logr):
    """
    Unzips the egg files in sys.path and redirects the numba cache to the unzipped location
    changes the cache naming lookup strategy in numba.core.caching in order to better control the
    location of cache files that were unzipped from egg(s)
    :param cache_root_dir:  the directory that the egg(s) will be into unzipped into
    :return:  set of egg files that were unzipped
    """
    import numba
    import zipfile
    import common_utils as cu

    eggs = set()

    for path in reversed(sys.path):
        path = os.path.abspath(path)
        if zipfile.is_zipfile(path) and 'EGG-INFO/PKG-INFO' in zipfile.ZipFile(path).namelist():
            eggs.update([cu.unzip_egg(path, cache_root_dir, logr)])

    # no eggs, no need to patch
    if not eggs:
        return eggs

    def from_function(cls, py_func, py_file: str):
        if os.path.isfile(py_file):
            return None
        return cls(py_func, py_file)

    def init(self, py_func, py_file: str):
        self._lineno = py_func.__code__.co_firstlineno
        self._cache_path = None
        for egg in eggs:
            if py_file.startswith(egg):
                self._py_file = cache_root_dir + py_file[len(egg):]
                if not os.path.isfile(self._py_file):
                    raise RuntimeError(
                        f"Real python file '{self._py_file}' does not exist"
                        f" for {py_file} (available eggs are {','.join(eggs)})"
                    )
                self._cache_path = os.path.join(os.path.dirname(self._py_file), '__pycache__')

        if not self._cache_path:
            raise RuntimeError(
                f"unable to find cache target for {py_file} (available eggs are {','.join(eggs)})"
            )
        if not os.path.isdir(self._cache_path):
            raise RuntimeError(
                f"cache directory '{self._cache_path}' found"
                f"for {py_file} does not exist (available eggs are {','.join(eggs)})"
            )
    # monkey-patch
    numba.core.caching._UserProvidedCacheLocator.from_function = classmethod(from_function)
    numba.core.caching._UserProvidedCacheLocator.__init__ = init

    return eggs

Okay, there’s a little more to the story. The egg zipfile format doesn’t support the st_mtime python file timestamp granularity used by numba in the .nbi index files so in my setup.py I stored a file with a pickled dict of filename and reset the timestamps when I unpacked the egg. There’s more discussion about my egg-caching odyssey here