How to use `cfunc.<mangled name>` that is from `nb.njit(f)`'s LLVM IR

In numba/numba/core/funcdesc.py at f0d24824fcd6a454827e3c108882395d00befc04 · numba/numba · GitHub, it exposes C-compatible wrapper in LLVM IR.

    def llvm_cfunc_wrapper_name(self):
        """
        The LLVM-registered name for a C-compatible wrapper of the
        raw function.
        """
        return 'cfunc.' + self.mangled_name

Is there documentation on how to properly invoke the C-compatible wrapper? Like for this example,

@nb.njit
def f(a: np.array):
    return a + 1

f(np.zeros((3, 3), dtype=np.float32))

What will be the signature of the C-compatible wrapper in LLVM IR in this case?

I am currently seeing the cfunc.<...> generated from the above f is like:


define { ptr, ptr, i64, i64, ptr, [2 x i64], [2 x i64] } @cfunc._ZN8__main__1fB2v1B38c8tJTIeFIjxB2IKSgI4CrvQClQZ6FczSBAA_3dE5ArrayIdLi2E1C7mutable7alignedE({ ptr, ptr, i64, i64, ptr, [2 x i64], [2 x i64] } %.1) local_unnamed_addr {
entry:
  %.3 = alloca { ptr, ptr, i64, i64, ptr, [2 x i64], [2 x i64] }, align 8
  %.fca.1.gep = getelementptr inbounds nuw i8, ptr %.3, i64 8
  %.fca.2.gep = getelementptr inbounds nuw i8, ptr %.3, i64 16
  %.fca.3.gep = getelementptr inbounds nuw i8, ptr %.3, i64 24
  %.fca.4.gep = getelementptr inbounds nuw i8, ptr %.3, i64 32
  %.fca.5.0.gep = getelementptr inbounds nuw i8, ptr %.3, i64 40
  %.fca.5.1.gep = getelementptr inbounds nuw i8, ptr %.3, i64 48
  %.fca.6.0.gep = getelementptr inbounds nuw i8, ptr %.3, i64 56
  %.fca.6.1.gep = getelementptr inbounds nuw i8, ptr %.3, i64 64
  %excinfo = alloca ptr, align 8
  call void @llvm.memset.p0.i64(ptr noundef nonnull align 8 dereferenceable(72) %.3, i8 0, i64 72, i1 false)
  store ptr null, ptr %excinfo, align 8
  %extracted.meminfo = extractvalue { ptr, ptr, i64, i64, ptr, [2 x i64], [2 x i64] } %.1, 0
  %extracted.data = extractvalue { ptr, ptr, i64, i64, ptr, [2 x i64], [2 x i64] } %.1, 4
  %extracted.shape = extractvalue { ptr, ptr, i64, i64, ptr, [2 x i64], [2 x i64] } %.1, 5
  %.7 = extractvalue [2 x i64] %extracted.shape, 0
  %.8 = extractvalue [2 x i64] %extracted.shape, 1
  %.11 = call i32 @_ZN8__main__1fB2v1B38c8tJTIeFIjxB2IKSgI4CrvQClQZ6FczSBAA_3dE5ArrayIdLi2E1C7mutable7alignedE(ptr nonnull %.3, ptr nonnull %excinfo, ptr %extracted.meminfo, ptr poison, i64 poison, i64 poison, ptr %extracted.data, i64 %.7, i64 %.8, i64 poison, i64 poison) #3
  %.12 = load ptr, ptr %excinfo, align 8
  %.21.fca.0.load = load ptr, ptr %.3, align 8
  %.21.fca.1.load = load ptr, ptr %.fca.1.gep, align 8
  %.21.fca.2.load = load i64, ptr %.fca.2.gep, align 8
  %.21.fca.3.load = load i64, ptr %.fca.3.gep, align 8
  %.21.fca.4.load = load ptr, ptr %.fca.4.gep, align 8
  %.21.fca.5.0.load = load i64, ptr %.fca.5.0.gep, align 8
  %.21.fca.5.1.load = load i64, ptr %.fca.5.1.gep, align 8
  %.21.fca.6.0.load = load i64, ptr %.fca.6.0.gep, align 8
  %.21.fca.6.1.load = load i64, ptr %.fca.6.1.gep, align 8
  %inserted.meminfo = insertvalue { ptr, ptr, i64, i64, ptr, [2 x i64], [2 x i64] } undef, ptr %.21.fca.0.load, 0
  %inserted.parent = insertvalue { ptr, ptr, i64, i64, ptr, [2 x i64], [2 x i64] } %inserted.meminfo, ptr %.21.fca.1.load, 1
  %inserted.nitems = insertvalue { ptr, ptr, i64, i64, ptr, [2 x i64], [2 x i64] } %inserted.parent, i64 %.21.fca.2.load, 2
  %inserted.itemsize = insertvalue { ptr, ptr, i64, i64, ptr, [2 x i64], [2 x i64] } %inserted.nitems, i64 %.21.fca.3.load, 3
  %inserted.data = insertvalue { ptr, ptr, i64, i64, ptr, [2 x i64], [2 x i64] } %inserted.itemsize, ptr %.21.fca.4.load, 4
  %.30 = insertvalue [2 x i64] undef, i64 %.21.fca.5.0.load, 0
  %.32 = insertvalue [2 x i64] %.30, i64 %.21.fca.5.1.load, 1
  %inserted.shape = insertvalue { ptr, ptr, i64, i64, ptr, [2 x i64], [2 x i64] } %inserted.data, [2 x i64] %.32, 5
  %.34 = insertvalue [2 x i64] undef, i64 %.21.fca.6.0.load, 0
  %.36 = insertvalue [2 x i64] %.34, i64 %.21.fca.6.1.load, 1
  %inserted.strides = insertvalue { ptr, ptr, i64, i64, ptr, [2 x i64], [2 x i64] } %inserted.shape, [2 x i64] %.36, 6
  %.38 = alloca i32, align 4
  store i32 0, ptr %.38, align 4
  %cond = icmp eq i32 %.11, 0
  br i1 %cond, label %common.ret, label %entry.if.if

common.ret:                                       ; preds = %entry, %entry.if.if.if.if, %.41
  %common.ret.op = phi { ptr, ptr, i64, i64, ptr, [2 x i64], [2 x i64] } [ zeroinitializer, %entry.if.if.if.if ], [ %inserted.strides, %.41 ], [ %inserted.strides, %entry ]
  ret { ptr, ptr, i64, i64, ptr, [2 x i64], [2 x i64] } %common.ret.op

.41:                                              ; preds = %entry.if.if.endif.if, %entry.if.if.endif
  %.89 = call ptr @PyUnicode_FromString(ptr nonnull @".const.<numba.core.cpu.CPUContext object at 0x53af7c21d190>")
  call void @PyErr_WriteUnraisable(ptr %.89)
  call void @Py_DecRef(ptr %.89)
  call void @numba_gil_release(ptr nonnull %.38)
  br label %common.ret

entry.if.if:                                      ; preds = %entry
  call void @numba_gil_ensure(ptr nonnull %.38)
  call void @PyErr_Clear()
  %.44 = load { ptr, i32, ptr, ptr, i32 }, ptr %.12, align 8
  %.45 = extractvalue { ptr, i32, ptr, ptr, i32 } %.44, 4
  %.46 = icmp sgt i32 %.45, 0
  %.49 = extractvalue { ptr, i32, ptr, ptr, i32 } %.44, 0
  %.51 = extractvalue { ptr, i32, ptr, ptr, i32 } %.44, 1
  br i1 %.46, label %entry.if.if.if, label %entry.if.if.else

entry.if.if.if:                                   ; preds = %entry.if.if
  %.52 = sext i32 %.51 to i64
  %.53 = call ptr @PyBytes_FromStringAndSize(ptr %.49, i64 %.52)
  %.54 = load { ptr, i32, ptr, ptr, i32 }, ptr %.12, align 8
  %.55 = extractvalue { ptr, i32, ptr, ptr, i32 } %.54, 2
  %.57 = extractvalue { ptr, i32, ptr, ptr, i32 } %.54, 3
  %.59 = call ptr %.57(ptr %.55)
  %.60 = icmp eq ptr %.59, null
  br i1 %.60, label %entry.if.if.if.if, label %entry.if.if.if.endif, !prof !0

entry.if.if.else:                                 ; preds = %entry.if.if
  %.73 = extractvalue { ptr, i32, ptr, ptr, i32 } %.44, 2
  %.74 = call ptr @numba_unpickle(ptr %.49, i32 %.51, ptr %.73)
  br label %entry.if.if.endif

entry.if.if.endif:                                ; preds = %entry.if.if.if.endif, %entry.if.if.else
  %.76 = phi ptr [ %.64, %entry.if.if.if.endif ], [ %.74, %entry.if.if.else ]
  %.77.not = icmp eq ptr %.76, null
  br i1 %.77.not, label %.41, label %entry.if.if.endif.if, !prof !0

entry.if.if.if.if:                                ; preds = %entry.if.if.if
  call void @PyErr_SetString(ptr nonnull @PyExc_RuntimeError, ptr nonnull @".const.Error creating Python tuple from runtime exception arguments.1")
  br label %common.ret

entry.if.if.if.endif:                             ; preds = %entry.if.if.if
  %.64 = call ptr @numba_runtime_build_excinfo_struct(ptr %.53, ptr nonnull %.59)
  call void @NRT_Free(ptr nonnull %.12)
  br label %entry.if.if.endif

entry.if.if.endif.if:                             ; preds = %entry.if.if.endif
  call void @numba_do_raise(ptr nonnull %.76)
  br label %.41
}

How does { ptr, ptr, i64, i64, ptr, [2 x i64], [2 x i64] } %.1 translate into C function signature?

Full LLVM IR generated: LLVM IR - JustPaste.it

Maybe I should look into no_cfunc_wrapper flag.

This guy { ptr, ptr, i64, i64, ptr, [2 x i64], [2 x i64] } is structured according to the numba array model.

2 Likes

Thanks for the pointer, @milton !

In that case, I will assume as long as I follow the numba array model you shared, I should be able to

1: nb.njit a Python function
2. Exports its LLVM IR
3. Compile the LLVM IR into a C function
4. Calls the C function at cfunc._ZN8__main__1fB2v1B38c8tJTIeFIjxB2IKSgI4CrvQClQZ6FczSBAA_3dE5ArrayIdLi2E1C7mutable7alignedE as f(meminfo, parent, nitems, itemsize, data, shape, strides)

A follow up question: Does everyone of them have a corresponding C types?

For meminfo, I assume it corresponds to core/runtime/nrt_external.h’s NRT_MemInfo

What about other types? What does types.UniTuple refer to in C types? According to LLVM IR documentation, I assume I can send things like

int64_t a[2] = {...};

as an argument to [2 x i64] field, is that correct?

@jimlin I think your steps are reasonable. Indeed, meminfo refers to the the NRT’s struct type you mentioned, which in this case wraps the numpy array (keeps the reference count to it, the pointer to the array itself, etc., here is a simple demo).

1 Like

Gotcha thanks Milton!

Also, looks like cfunc.<mangled name> is always generated,

I wonder: is there any existing example in Numba codebase that calls cfunc.<mangled name> directly?

For context, I am currently trying to run this for example:

@nb.njit
def my_array_add(arr: np.array):
  x = arr.size
  for i in range(arr.shape[0]):
    arr[i] = arr[i] + 1345 + x
  return arr

my_array_add(np.array([1,2,3], dtype=np.int32))

I exported the LLVM IR via: my_array_add.inspect_llvm()[my_array_add.signatures[0]]

And then JIT-compile the LLVM IR in C++ environment via LLVM execution engine.

  FuncAddress func_addr = compile_llvm_ir(engine, llvm_ir);

  // NOTE:
  // I learned what are the inputs to the cfunc via inspecting/printing stuff
  // from `NRT_adapt_ndarray_from_python` function call
  int32_t data[] = {1, 2, 3};
  int64_t nitems = 3;
  int64_t itemsize = 4;
  int64_t shape[1] = {3};
  int64_t strides[1] = {4};

  NRT_MemInfo *meminfo = NRT_MemInfo_new(reinterpret_cast<void *>(data), 0,
                                         pyobject_dtor, nullptr);
  std::cout << "meminfo: " << meminfo << std::endl;

  RetType (*f)(NRT_MemInfo *, /*parent*/ void *, /*nitems*/ int64_t,
           /*itemsize*/ int64_t, /*data*/ int32_t *, /*shape*/ int64_t[1],
           /*strides*/ int64_t[1]) =
      (reinterpret_cast<RetType (*)(NRT_MemInfo *, /*parent*/ void *,
                                /*nitems*/ int64_t, /*itemsize*/ int64_t,
                                /*data*/ int32_t *, /*shape*/ int64_t[1],
                                /*strides*/ int64_t[1])>(func_addr));

  RetType ret = f(meminfo, nullptr, nitems, itemsize, data, shape, strides);
  std::cerr << "Returned meminfo: " << ret.meminfo << std::endl;
  for(int i = 0; i < nitems; ++i){
    std::cerr << "data[" << i << "] = " << data[i] << "\n";
  }

I am currently stuck at this where it shows Segmentation fault (core dumped). I think _ZN8__main__12my_array_addB2v1B38c8tJTIeFIjxB2IKSgI4CrvQClQZ6FczSBAA_3dE5ArrayIiLi1E1C7mutable7alignedE is where the segmentation fault happens.

Still trying to debug it.

; Function Attrs: nofree norecurse nounwind memory(argmem: readwrite)
define noundef i32 @_ZN8__main__12my_array_addB2v1B38c8tJTIeFIjxB2IKSgI4CrvQClQZ6FczSBAA_3dE5ArrayIiLi1E1C7mutable7alignedE(ptr noalias writeonly captures(none) %retptr, ptr noalias readnone captures(none) %excinfo, ptr %arg.arr.0, ptr %arg.arr.1, i64 %arg.arr.2, i64 %arg.arr.3, ptr %arg.arr.4, i64 %arg.arr.5.0, i64 %arg.arr.6.0) local_unnamed_addr #0 {
B0.endif:
  tail call void @NRT_incref(ptr %arg.arr.0)
  %.120113.not = icmp slt i64 %arg.arr.5.0, 1
  br i1 %.120113.not, label %B110, label %iter.check

iter.check:                                       ; preds = %B0.endif
  %0 = trunc i64 %arg.arr.2 to i32
  %1 = add i32 %0, 1345
  %min.iters.check = icmp ult i64 %arg.arr.5.0, 4
  br i1 %min.iters.check, label %B72.preheader, label %vector.main.loop.iter.check

vector.main.loop.iter.check:                      ; preds = %iter.check
  %const = bitcast i64 9223372036854775776 to i64
  %min.iters.check1 = icmp ult i64 %arg.arr.5.0, 32
  br i1 %min.iters.check1, label %vec.epilog.ph, label %vector.ph

vector.ph:                                        ; preds = %vector.main.loop.iter.check
  %n.vec = and i64 %arg.arr.5.0, %const
  %broadcast.splatinsert = insertelement <8 x i32> poison, i32 %1, i64 0
  %broadcast.splat = shufflevector <8 x i32> %broadcast.splatinsert, <8 x i32> poison, <8 x i32> zeroinitializer
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %sunkaddr = mul i64 %index, 4
  %sunkaddr22 = getelementptr i8, ptr %arg.arr.4, i64 %sunkaddr
  %wide.load = load <8 x i32>, ptr %sunkaddr22, align 4
  %sunkaddr23 = mul i64 %index, 4
  %sunkaddr24 = getelementptr i8, ptr %arg.arr.4, i64 %sunkaddr23
  %sunkaddr25 = getelementptr i8, ptr %sunkaddr24, i64 32
  %wide.load2 = load <8 x i32>, ptr %sunkaddr25, align 4
  %sunkaddr26 = mul i64 %index, 4
  %sunkaddr27 = getelementptr i8, ptr %arg.arr.4, i64 %sunkaddr26
  %sunkaddr28 = getelementptr i8, ptr %sunkaddr27, i64 64
  %wide.load3 = load <8 x i32>, ptr %sunkaddr28, align 4
  %sunkaddr29 = mul i64 %index, 4
  %sunkaddr30 = getelementptr i8, ptr %arg.arr.4, i64 %sunkaddr29
  %sunkaddr31 = getelementptr i8, ptr %sunkaddr30, i64 96
  %wide.load4 = load <8 x i32>, ptr %sunkaddr31, align 4
  %2 = add <8 x i32> %broadcast.splat, %wide.load
  %3 = add <8 x i32> %broadcast.splat, %wide.load2
  %4 = add <8 x i32> %broadcast.splat, %wide.load3
  %5 = add <8 x i32> %broadcast.splat, %wide.load4
  store <8 x i32> %2, ptr %sunkaddr22, align 4
  store <8 x i32> %3, ptr %sunkaddr25, align 4
  store <8 x i32> %4, ptr %sunkaddr28, align 4
  store <8 x i32> %5, ptr %sunkaddr31, align 4
  %index.next = add nuw i64 %index, 32
  %6 = icmp eq i64 %n.vec, %index.next
  br i1 %6, label %middle.block, label %vector.body, !llvm.loop !0

middle.block:                                     ; preds = %vector.body
  %cmp.n = icmp eq i64 %arg.arr.5.0, %n.vec
  br i1 %cmp.n, label %B110, label %vec.epilog.iter.check

vec.epilog.iter.check:                            ; preds = %middle.block
  %n.vec.remaining = and i64 %arg.arr.5.0, 28
  %min.epilog.iters.check = icmp eq i64 %n.vec.remaining, 0
  br i1 %min.epilog.iters.check, label %B72.preheader, label %vec.epilog.ph

vec.epilog.ph:                                    ; preds = %vec.epilog.iter.check, %vector.main.loop.iter.check
  %vec.epilog.resume.val = phi i64 [ %n.vec, %vec.epilog.iter.check ], [ 0, %vector.main.loop.iter.check ]
  %const_mat = add i64 %const, 28
  %n.vec6 = and i64 %arg.arr.5.0, %const_mat
  %broadcast.splatinsert7 = insertelement <4 x i32> poison, i32 %1, i64 0
  %broadcast.splat8 = shufflevector <4 x i32> %broadcast.splatinsert7, <4 x i32> poison, <4 x i32> zeroinitializer
  br label %vec.epilog.vector.body

vec.epilog.vector.body:                           ; preds = %vec.epilog.vector.body, %vec.epilog.ph
  %index9 = phi i64 [ %vec.epilog.resume.val, %vec.epilog.ph ], [ %index.next11, %vec.epilog.vector.body ]
  %7 = shl i64 %index9, 2
  %scevgep13 = getelementptr i8, ptr %arg.arr.4, i64 %7
  %wide.load10 = load <4 x i32>, ptr %scevgep13, align 4
  %8 = add <4 x i32> %broadcast.splat8, %wide.load10
  store <4 x i32> %8, ptr %scevgep13, align 4
  %index.next11 = add nuw i64 %index9, 4
  %9 = icmp eq i64 %n.vec6, %index.next11
  br i1 %9, label %vec.epilog.middle.block, label %vec.epilog.vector.body, !llvm.loop !3

vec.epilog.middle.block:                          ; preds = %vec.epilog.vector.body
  %cmp.n12 = icmp eq i64 %arg.arr.5.0, %n.vec6
  br i1 %cmp.n12, label %B110, label %B72.preheader

B72.preheader:                                    ; preds = %vec.epilog.iter.check, %vec.epilog.middle.block, %iter.check
  %.126112114.ph = phi i64 [ 0, %iter.check ], [ %n.vec, %vec.epilog.iter.check ], [ %n.vec6, %vec.epilog.middle.block ]
  br label %B72

B72:                                              ; preds = %B72.preheader, %B72
  %.126112114 = phi i64 [ %.133, %B72 ], [ %.126112114.ph, %B72.preheader ]
  %.133 = add nuw nsw i64 %.126112114, 1
  %10 = shl i64 %.126112114, 2
  %scevgep = getelementptr i8, ptr %arg.arr.4, i64 %10
  %.187 = load i32, ptr %scevgep, align 4
  %.221 = add i32 %1, %.187
  store i32 %.221, ptr %scevgep, align 4
  %exitcond.not = icmp eq i64 %arg.arr.5.0, %.133
  br i1 %exitcond.not, label %B110, label %B72, !llvm.loop !4

B110:                                             ; preds = %B72, %middle.block, %vec.epilog.middle.block, %B0.endif
  store ptr %arg.arr.0, ptr %retptr, align 8
  %retptr.repack99 = getelementptr inbounds nuw i8, ptr %retptr, i64 8
  store ptr %arg.arr.1, ptr %retptr.repack99, align 8
  %retptr.repack101 = getelementptr inbounds nuw i8, ptr %retptr, i64 16
  store i64 %arg.arr.2, ptr %retptr.repack101, align 8
  %retptr.repack103 = getelementptr inbounds nuw i8, ptr %retptr, i64 24
  store i64 %arg.arr.3, ptr %retptr.repack103, align 8
  %retptr.repack105 = getelementptr inbounds nuw i8, ptr %retptr, i64 32
  store ptr %arg.arr.4, ptr %retptr.repack105, align 8
  %retptr.repack107 = getelementptr inbounds nuw i8, ptr %retptr, i64 40
  store i64 %arg.arr.5.0, ptr %retptr.repack107, align 8
  %retptr.repack109 = getelementptr inbounds nuw i8, ptr %retptr, i64 48
  store i64 %arg.arr.6.0, ptr %retptr.repack109, align 8
  ret i32 0
}

Full LLVM IR: LLVM IR from `my_array_add` - JustPaste.it

This is why I am asking if there is an example that I can refer to. Thanks!

Any thoughts on how to debug or why the segmentation fault happens?

Is the function signature correct? Or is there potentially padding I need to apply on data?

Feels like an out of bound access but the LLVM IR block is really hard to understand especially the for loop part

NRT_MemInfo *, /*parent*/ void *, /*nitems*/ int64_t,
           /*itemsize*/ int64_t, /*data*/ int32_t *, /*shape*/ int64_t[1],
           /*strides*/ int64_t[1]

@jimlin I made a little demo here, feel free to take a look. There are actually two demos in one there, calculation is compiled from a cfunc’s LLVM code and then statically linked in the executable main; and another_calculation is compiled from njit’s LLVM into a shared lib and loaded dynamically during main’s runtime.

Thanks @milton, really appreciate it!

In your example:

another_calculation_signature = float64(float64)

@njit(another_calculation_signature)
def another_calculation(param):
    return 3.141 * param ** 2

is taking float as input arguments but what about the case when inputs are np.ndarray?

Something like:

@nb.njit
def my_array_add(arr: np.array):
  x = arr.size
  for i in range(arr.shape[0]):
    arr[i] = arr[i] + 1345 + x
  return arr

my_array_add(np.array([1,2,3], dtype=np.int32))

Do you mind also show me how to do it for my_array_add?

I can’t figure out why:

  int32_t data[] = {1, 2, 3};
  int64_t nitems = 3;
  int64_t itemsize = 4;
  int64_t shape[1] = {3};
  int64_t strides[1] = {4};

are not valid inputs in C++.

Thanks so much!

Thanks!

So I just realized from reading more about the LLVM IR that was generated (https://g.co/gemini/share/ffee717658df)

I think Numba might have implicit assumption of memory alignments because LLVM IR contains some SIMD vector instructions.

Could this be the reason I am seeing segmentation faults?

Sorry for spamming, but I just figured out!

Turns out the function signature should have been:

struct RetType{
  NRT_MemInfo *meminfo;
  void *parent;
  int64_t nitems;
  int64_t itemsize;
  int32_t *data;
  int64_t shape;   // <-- I thought this should be `int64_t[1]`
  int64_t strides;  // <-- I thought this should be `int64_t[1]`
};

RetType (*f)(NRT_MemInfo *, /*parent*/ void *, /*nitems*/ int64_t,
           /*itemsize*/ int64_t, /*data*/ int32_t *, /*shape*/ int64_t, // not int64_t[1]'s pointer
           /*strides*/ int64_t. // not int64_t[1]'s pointer
          )

Basically, I misinterpreted { ptr, ptr, i64, i64, ptr, [1 x i64], [1 x i64] } %.1. I thought [1 x i64] means I need to pass int64_t[2]'s pointer.

Turns out it just mean I need to pass 1 int64_t.

Credits to Gemini’s code translation from LLVM IR to C++ code that helped me throughout the debugging process!