Numba caching on a batch cluster?

Hannes · May 11, 2021, 11:22am

Hi all,

I was wondering if anyone here has experience with using numba on a batch system.

Specifically I am facing the following situation:

We have a central HTCondor system with thousands of servers that supposedly have a range of different hardware configurations. I am planning to submit hundreds of smaller jobs to the cluster, which would make use of some numba accelerated code that I have written.
I know that the initial compilation of my code takes a while, and may actually exceed the runtime of a single job, that is hardly an efficient use of the resources.

I see two approaches to tackle this problem:

Bundle together several small jobs into one big job, and provide a copy of my code that is local to a single worker instance. Since I am using numba’s caching feature, JIT compilation should only happen once per job collection. A possible downside that I see here, is that there is less flexibility in scheduling the individual jobs.
Leave all the jobs individual and provide my code from a central network file system location that is accessible from the cluster instances. Ideally this would allow the first workers that spin up to take care of the compilation for their own architecture and later runs should find the cached compiled versions in the centrally hosted package. However, I have no idea if numba’s caching system is designed to deal with simultaneous access from various Python sessions running on different machines. Are there any file based locking mechanisms?

I will happily admit that these things are way out of my comfort zone. My experience on working with the batch system is very limited and I also don’t really know under which circumstances a recompilation of numba code is actually required (i.e. how different does the hardware have to be for that to be necessary).

If somebody has any experience to share, I will happily soak it up. I’ll also report back at a later point if I figure out a way to do this well.

Cheers
Hannes

stuartarchibald · July 6, 2021, 9:23am

Hi @Hannes

RE 1. This would probably work, but as noted comes at the cost of flexibility in individual job scheduling.

RE 2. Without knowing the cluster set up it’s hard to guess at this. From memory, it should be ok to cache across architecture as system information (LLVM triple, CPU and target machine features) are part of the cache key. Consideration probably needs to be given to how much concurrent access is made to the shared space!

Topic		Replies	Views
Numba portable caching logic Support: How do I do ...?	4	1530	August 24, 2023
Strange behaviour on HPC cluster Community Support	3	730	July 16, 2021
Best Practices Adopting Numpy Based Code for Compatibility/Performance with Numba Support: How do I do ...?	12	524	November 2, 2024
Some questions about Numba behavior Community Support	1	472	October 18, 2021
Caching vs AOT: What are the use cases? Support: How do I do ...?	0	226	August 15, 2023

Numba caching on a batch cluster?

Related topics