LLVM upstream version vs. NVVM LLVM version

Numba uses llvmlite’s version of LLVM for building and optimizing IR before sending it to NVVM in the CUDA target. The current issues / concerns with this are:

  • A patch to LLVM 9 was required to stop the IR auto-upgrader replacing some @llvm.nvvm.atomic.. intrinsics with the atomicrmw instruction that is unrecognized by NVVM: https://github.com/numba/llvmlite/pull/593 - this means that the CUDA target can’t be used with an LLVM not built as part of the llvmlite conda recipe, or with this patches included otherwise. Since most users are using the conda package, this has not been a great issue so far, but distro packaging is unlikely to use these patches for their LLVM.
  • Optimizing the IR with one version then sending it to an earlier version is potentially introducing correctness issues: https://github.com/numba/numba/issues/5576#issuecomment-646548553

The process of optimizing the IR prior to sending it to NVVM was originally introduced as an attempt to work around what appeared to be an NVVM bug:

This was practical at the time, but it doesn’t look like a good long-term solution. One side-effect of running optimization passes first is that it makes debugging code generation by inspecting the IR much more practical - the optimized IR is much shorter than the unoptimized version emitted by Numba, and I can usually get a sense of where the problem is by looking at the optimized IR - tracing through unoptimized IR is quite exhausting and needs a lot of reference back to the Numba codebase in order to trace what the generated code does.

In an ideal world, we’d resolve the above issues with the following actions:

  • Check whether the NVVM issue described in issue 1341 is still present - if so, raise an issue with NVIDIA about it.
  • If not / when it is fixed, remove the IR optimization step from the CUDA target.

This would have a couple of useful side-effects:

  • No patch would be required for the LLVM that llvmlite is built with, because the IR wouldn’t get auto-upgraded to use atomicrmw.
  • Issue 5576 would potentially be fixed (and I don’t relish the thought of trying to actually debug that problem - I’ve spent quite some time on it already).

The downsides would be:

  • Debugging by looking at IR may be more difficult, because the libNVVM API doesn’t provide access to its IR after optimization. This would be an inconvenience, but not a show-stopper. This could be mitigated by providing a way to optimize the IR with llvmlite just to see what it looks like after optimization, but the resulting IR would not be compatible with NVVM.
  • If there are some IR optimizations that work better in upstream LLVM, these would no longer be applied in the CUDA target.

I’m on leave until 20th July but on return I’d like to fix the issues by removing the IR optimization step - this is mainly a testing effort, to ensure that when it is removed there aren’t negative performance/correctness effects.

In the meantime, I’d like to solicit any thoughts on the situation / plan - perhaps @sklam @stuartarchibald you have some thoughts?

Would this make it possible for extensions like Awkward Array to run in numba.cuda.jited functions? We’re currently developing a CUDA backend for Awkward, which uses CUDA pointers in place of main memory pointers and converts to and from CuPy, rather than NumPy, for the low-level buffers. I suspect some modifications would be necessary, but would Numba based on NVVM reduce barriers to something like the following in the future?

@nb.cuda.autojit
def something(awkward_array, output_numpy):
    i = nb.cuda.grid(1)
    data_structure = awkward_array[i]
    walk over data_structure, do something...
    output_numpy[i] = whatever...

For context, Awkward’s DataModel in Numba is a set of arrays and integer indexes into those arrays that are interpreted by type-specific, hard-coded implementations of __getitem__, __len__, etc. So it’s all arrays and numbers internally; the pyobject is only used for reference counting and boxing return values (which won’t be the case for CUDA kernels, since they act in place). With the Awkward CUDA backend, these arrays can be GPU-resident/CuPy.

Would this make it possible for extensions like Awkward Array to run in numba.cuda.jit ed functions?

This change wouldn’t do anything for extensions - the CUDA target already uses NVVM, it’s just that the pipeline is currently:

Bytecode → Numba IR → LLVM IR → Optimized LLVM IR → NVVM → …

and I’m suggesting making it:

Bytecode → Numba IR → LLVM IR → NVVM → …

NVVM is itself based on LLVM, so it also runs LLVM optimization passes - the proposed change means we only have the optimizations once, instead of the current situtation where we run them twice, first with LLVM 9 then with an earlier LLVM version inside NVVM, which causes some problems.

There are other facilities we should look at to support Awkward Arrays in CUDA jitted functions - I’ve started this topic for that discussion, and I hope we can make some progress there.

Here’s my notes on LLVM bitcode compatibility issue after a long googling session:

From LLVM doc: LLVM Developer Policy — LLVM 18.0.0git documentation.

The current LLVM version supports loading any bitcode since version 3.0.

Newer releases can ignore features from older releases, but they cannot miscompile them. For example, if nsw is ever replaced with something else, dropping it would be a valid way to upgrade the IR.

The doc says that old bitcode since 3.0 can be upgraded to the latest. But this makes me wonder how hard it is to make NVVM read latest LLVM bitcode.

Another interesting thing I found is that it is possible to combine bitcode reader & writer from different LLVM versions. The libbcc project has a bitcode translator (source code) that upgrades pre-LLVM3 bitcode to LLVM3.

Forgot to mention the LLVM thread about the downgrading bitcode: https://lists.llvm.org/pipermail/llvm-dev/2016-August/103384.html