Numba uses llvmlite’s version of LLVM for building and optimizing IR before sending it to NVVM in the CUDA target. The current issues / concerns with this are:
- A patch to LLVM 9 was required to stop the IR auto-upgrader replacing some
@llvm.nvvm.atomic..
intrinsics with theatomicrmw
instruction that is unrecognized by NVVM: https://github.com/numba/llvmlite/pull/593 - this means that the CUDA target can’t be used with an LLVM not built as part of the llvmlite conda recipe, or with this patches included otherwise. Since most users are using the conda package, this has not been a great issue so far, but distro packaging is unlikely to use these patches for their LLVM. - Optimizing the IR with one version then sending it to an earlier version is potentially introducing correctness issues: https://github.com/numba/numba/issues/5576#issuecomment-646548553
The process of optimizing the IR prior to sending it to NVVM was originally introduced as an attempt to work around what appeared to be an NVVM bug:
- PR: https://github.com/numba/numba/pull/1344/files
- Issue and discussion: https://github.com/numba/numba/issues/1341
- Test script to check for the bug: https://gist.github.com/sklam/f62f1f48bb0be78f9ceb
This was practical at the time, but it doesn’t look like a good long-term solution. One side-effect of running optimization passes first is that it makes debugging code generation by inspecting the IR much more practical - the optimized IR is much shorter than the unoptimized version emitted by Numba, and I can usually get a sense of where the problem is by looking at the optimized IR - tracing through unoptimized IR is quite exhausting and needs a lot of reference back to the Numba codebase in order to trace what the generated code does.
In an ideal world, we’d resolve the above issues with the following actions:
- Check whether the NVVM issue described in issue 1341 is still present - if so, raise an issue with NVIDIA about it.
- If not / when it is fixed, remove the IR optimization step from the CUDA target.
This would have a couple of useful side-effects:
- No patch would be required for the LLVM that llvmlite is built with, because the IR wouldn’t get auto-upgraded to use
atomicrmw
. - Issue 5576 would potentially be fixed (and I don’t relish the thought of trying to actually debug that problem - I’ve spent quite some time on it already).
The downsides would be:
- Debugging by looking at IR may be more difficult, because the libNVVM API doesn’t provide access to its IR after optimization. This would be an inconvenience, but not a show-stopper. This could be mitigated by providing a way to optimize the IR with llvmlite just to see what it looks like after optimization, but the resulting IR would not be compatible with NVVM.
- If there are some IR optimizations that work better in upstream LLVM, these would no longer be applied in the CUDA target.
I’m on leave until 20th July but on return I’d like to fix the issues by removing the IR optimization step - this is mainly a testing effort, to ensure that when it is removed there aren’t negative performance/correctness effects.
In the meantime, I’d like to solicit any thoughts on the situation / plan - perhaps @sklam @stuartarchibald you have some thoughts?