Best practices for developing large Numba-dependent projects with both CPU and GPU compatibility

From @gmarkall (CUDA Simulator IndexError when using @jit instead of @cuda.jit(device=True) functions inside of kernels · Issue #7494 · numba/numba · GitHub)

I don’t think support for mixing jitted and non-jitted code (jitted CPU, non-jitted CUDA) is something I’d encourage support for, given that the visibility into functions when debugging will depend on which decorator was used - I think the way forward here is to recommend in the simulator documentation that NUMBA_DISABLE_JIT is used in conjunction with NUMBA_ENABLE_CUDASIM if @jit functions are called from kernels. (I wouldn’t want to enable this by default, because there are use cases where CPU JIT and CUDA JIT are separate, and one would want the speed for the CPU jitted code whilst still being able to debug the CUDA jitted code).

I’ve been thinking more about what you said above and in the context of your suggestion on discourse to use nested functions. I’m in a bit of a tricky spot where I want to expose not only a top-level function but also at least one of the lower-level functions. For the case where I have a CUDA and CPU version for each of two functions, I need the following 4 combinations.

dist_matrix wasserstein_distance
CUDA cuda_dist_matrix.py cuda_wasserstein_distance.py
CPU cpu_dist_matrix.py cpu_wasserstein_distance.py

In other words, some people might want only dist_matrix, only wasserstein_distance, or both, and the choice of CUDA vs. CPU will largely depend on the hardware available to them.

My instinct has been to write one set of lower-level functions (i.e. helper.py) that can then be used in the higher-level functions like dist_matrix() and wasserstein_distance() without needing to copy-paste these. If I use your discourse suggestion to get around the painful environment variable hack for defining local array sizes, then I either need the lower-level functions to be essentially inaccessible or define an f() and f2(local_size) pair of functions (2nd example). When dist_matrix has all of the lower-level functions in the same file, it’s ~1000 lines. If wasserstein_distance were to have all of its lower-level functions, this would be ~700 lines I think. For dist_matrix(), there’s really only ~50-100 lines of code that are specific to CUDA or CPU, and then a few functions that need to be swapped at various points (cuda.local.array <—> np.zeros and then occasionally dealing with cuda.grid()), so ~1200 lines of “distinct” code.

In order to have the 4 combinations each split into their own file, I would need a total of ~3400 lines, so ~3x “boilerplate” copying. If I needed to update documentation, enhance performance, or fix a bug in non-target-specific code, then I would need to keep each of the 4 files consistent with each other. On the other hand, I might save a lot of time that I’d otherwise be spending trying to get everything to play nice. For now and especially since the code has been fairly well “debugged”, I’m thinking I’ll try for one standalone file for each of the use cases. Naturally, some of this will change as Numba continues to progress, but I’m also thinking about the long-term, not knowing how many more times I’ll want to do something similar and need to make these same kinds of design decisions.

I’m interested to hear your thoughts. What would you consider optimal given the various trade-offs? Are there alternatives that I’m overlooking?