# Loop-vectorize debug

I see comments suggesting adding this to understand how loops are being handled by numba, and in the their own FAQ (https://numba.pydata.org/numba-doc/latest/user/faq.html)

``````from llvmlite import binding as llvm
llvm.set_option('','--debug-only=loop-vectorize')
``````

I find that this code changes nothing - there is no additional debug output from llvm indicating whether a loop is being vectorized / enhanced with SSE instructions, or if not why not.

I’ve tried this on Windows, Linux and Mac, all using Anaconda distributions of Python 3.8 and with the conda, conda-forge and numba channels.

Any ideas? Does it work for you? Is there a hidden extra step?

1 Like

Hi @rhjmoore,

This works as intended locally (I’ve also been using this debug output for years and don’t recall it ever not working, this makes your case something that needs investigating!):

``````\$ cat issue7358.py
from numba import njit
import numpy as np
from llvmlite import binding as llvm
llvm.set_option('','--debug-only=loop-vectorize')

@njit
def test(a):
for i in range(a.shape[0]):
a[i] = a[i] * 2

print(test(np.asarray([1.0, 2.0, 3.0])))

\$ python issue7358.py

LV: Checking a loop in "_ZN8__main__8test\$241E5ArrayIdLi1E1C7mutable7alignedE" from test
LV: Loop hints: force=? width=0 unroll=0
LV: Found a loop: B18
LV: Found an induction variable.
LV: Found FP op with unsafe algebra.
LV: We can vectorize this loop!
LV: Found trip count: 0
LV: The Smallest and Widest types: 64 / 64 bits.
LV: The Widest register safe to use is: 128 bits.
LV: Found uniform instruction:   %exitcond.not = icmp eq i64 %.127, %arg.a.5.0
LV: Found uniform instruction:   %.180 = getelementptr double, double* %arg.a.4, i64 %.56.07
LV: Found uniform instruction:   %.56.07 = phi i64 [ %.127, %B18 ], [ 0, %B18.preheader ]
LV: Found uniform instruction:   %.127 = add nuw nsw i64 %.56.07, 1
LV: Found scalar instruction:   %.56.07 = phi i64 [ %.127, %B18 ], [ 0, %B18.preheader ]
LV: Found scalar instruction:   %.127 = add nuw nsw i64 %.56.07, 1
LV: Scalarizing:  %.180 = getelementptr double, double* %arg.a.4, i64 %.56.07
LV: Scalarizing:  %.181 = load double, double* %.180, align 8
LV: Scalarizing:  %.183 = fmul double %.181, 2.000000e+00
LV: Scalarizing:  store double %.183, double* %.180, align 8
LV: Scalarizing:  %.180 = getelementptr double, double* %arg.a.4, i64 %.56.07
digraph VPlan {
graph [labelloc=t, fontsize=30; label="Vectorization Plan\nInitial VPlan for VF=\{1\},UF\>=1"]
node [shape=rect, fontname=Courier, fontsize=30]
edge [fontname=Courier, fontsize=30]
compound=true
N0 [label =
"B18:\n" +
"WIDEN-INDUCTION %.56.07 = phi %.127, 0\l" +
"CLONE %.180 = getelementptr %arg.a.4, %.56.07\l" +
"CLONE %.181 = load %.180\l" +
"CLONE %.183 = fmul %.181, 2.000000e+00\l" +
"CLONE store %.183, %.180\l"
]
}
digraph VPlan {
graph [labelloc=t, fontsize=30; label="Vectorization Plan\nInitial VPlan for VF=\{2\},UF\>=1"]
node [shape=rect, fontname=Courier, fontsize=30]
edge [fontname=Courier, fontsize=30]
compound=true
N0 [label =
"B18:\n" +
"WIDEN-INDUCTION %.56.07 = phi %.127, 0\l" +
"CLONE %.180 = getelementptr %arg.a.4, %.56.07\l" +
"WIDEN %.181 = load %.180, ir<%.180>\l" +
"WIDEN\l""  %.183 = fmul %.181, 2.000000e+00\l" +
"WIDEN store %.183, %.180, ir<%.180>\l"
]
}
LV: Found an estimated cost of 0 for VF 1 For instruction:   %.56.07 = phi i64 [ %.127, %B18 ], [ 0, %B18.preheader ]
LV: Found an estimated cost of 1 for VF 1 For instruction:   %.127 = add nuw nsw i64 %.56.07, 1
LV: Found an estimated cost of 0 for VF 1 For instruction:   %.180 = getelementptr double, double* %arg.a.4, i64 %.56.07
LV: Found an estimated cost of 1 for VF 1 For instruction:   %.181 = load double, double* %.180, align 8
LV: Found an estimated cost of 2 for VF 1 For instruction:   %.183 = fmul double %.181, 2.000000e+00
LV: Found an estimated cost of 1 for VF 1 For instruction:   store double %.183, double* %.180, align 8
LV: Found an estimated cost of 1 for VF 1 For instruction:   %exitcond.not = icmp eq i64 %.127, %arg.a.5.0
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1 %exitcond.not, label %B38.loopexit, label %B18
LV: Scalar loop costs: 6.
LV: Found an estimated cost of 0 for VF 2 For instruction:   %.56.07 = phi i64 [ %.127, %B18 ], [ 0, %B18.preheader ]
LV: Found an estimated cost of 1 for VF 2 For instruction:   %.127 = add nuw nsw i64 %.56.07, 1
LV: Found an estimated cost of 0 for VF 2 For instruction:   %.180 = getelementptr double, double* %arg.a.4, i64 %.56.07
LV: Found an estimated cost of 1 for VF 2 For instruction:   %.181 = load double, double* %.180, align 8
LV: Found an estimated cost of 2 for VF 2 For instruction:   %.183 = fmul double %.181, 2.000000e+00
LV: Found an estimated cost of 1 for VF 2 For instruction:   store double %.183, double* %.180, align 8
LV: Found an estimated cost of 1 for VF 2 For instruction:   %exitcond.not = icmp eq i64 %.127, %arg.a.5.0
LV: Found an estimated cost of 0 for VF 2 For instruction:   br i1 %exitcond.not, label %B38.loopexit, label %B18
LV: Vector loop of width 2 costs: 3.
LV: Selecting VF: 2.
LV(REG): Calculating max register usage:
LV(REG): At #0 Interval # 0
LV(REG): At #1 Interval # 1
LV(REG): At #2 Interval # 2
LV(REG): At #3 Interval # 2
LV(REG): At #4 Interval # 3
LV(REG): At #6 Interval # 1
LV(REG): VF = 2
LV(REG): Found max usage: 2 item
LV(REG): RegisterClass: Generic::ScalarRC, 2 registers
LV(REG): RegisterClass: Generic::VectorRC, 1 registers
LV(REG): Found invariant usage: 0 item
LV: The target has 16 registers of Generic::ScalarRC register class
LV: The target has 16 registers of Generic::VectorRC register class
LV: Loop cost is 6
LV: Interleaving to reduce branch cost.
LV: Found a vectorizable loop (2) in test
LV: Interleave Count is 2
Setting best plan to VF=2, UF=2
LV: Interleaving disabled by the pass manager
None
``````

How are you running the sample code (command line/notebook/other) and on what architecture?

Thanks for picking this up. I’m running this both on notebook and then, after I suspected Jupyter might be making it fail, I tried saving this code to a python file and executing it at the command line as you show above.

This is true on Windows, Linux and Mac for me, all with Anaconda installations. Do you have a recommended alternative route for getting this to work? Should I have full LLVM / Clang installed separately or something?

One more note - I have found a reference on the internet somewhere (but can’t currently locate a link) to one other person having this problem. they had numba v0.28 installed, and they subsequently moved to a developer build of v0.29 and that fixed it.
I remember there was a brief discussion of whether this might be because of how the package / wheel was built in conda-forge, but it was never conclusive.

at Apr 17 2019 19:00

I can you see, @stuartarchibald, in the conversation!

I think I’ve got to the bottom of this, it comes down to how LLVM was built. This being the important bit:

Numba from the `numba` channel is built with assertions on:

whereas the build from the conda-forge feedstock does not do this:

As a result, you will need to use a `llvmlite` that is linked against a LLVM which is built with assertions. One such `llvmlite`/`numba` combination is available from the Numba channel, e.g.:

``````conda create -n my_numba_env_with_assertions -c numba numba
``````

I guess the reason I’ve never seen it fail is because I have LLVM built with assertions (for debugging!). Having just tried builds from conda-forge and the Anaconda distribution I can reproduce the lack of output against these packages.

I think the Numba docs could do with a fix to note this.

Hope this helps.

Thanks, that makes sense. I’ll try and create a new environment with the versions of numba, llvm etc. and confirm whether this fixes it for me.

For the documentation update do you want to re-open my original issue to link a patch against or shall I open a new issue?

Great, thanks for testing it.

I’ve reopened the original issue with a note about what the problem is, a patch to the documentation would be welcomed. Thanks!

Confirmed, that works.

Great, thanks for confirming, glad this is resolved.

@stuartarchibald
Sorry, I have one more question on the output of this now that I have it working:
Is there a way in which one can link each set of output about a loop to individual loops in your function if you have multiple loops?

Yes, but it’s a bit involved…

Setting `debug=True` in the `@jit` decorators means that `DWARF` info is emitted and this links the Python source to the generated LLVM through `DILocation` entries.

Example:

``````from llvmlite import binding as llvm
llvm.set_option('','--debug-only=loop-vectorize')

from numba import njit
import numpy as np

@njit(fastmath=True, debug=True)
def foo(n):
acc = 0.0
acc2 = 0.0

for i in range(n):
acc += np.sqrt(i) * np.sin(i)

for i in range(n - 1):
acc -= np.sqrt(i + 2) * np.cos(i)

return acc + acc2

foo(1)

print(foo.inspect_llvm(foo.signatures[0]))
``````

This produces two sets out output.

1. the loop vectorize debug info (LV debug).
``````LV: Checking a loop in "_ZN8__main__7foo\$241Ex" from <elided>.py:12:1
LV: Loop hints: force=? width=0 unroll=0
LV: Found a loop: for.end
LV: Found an induction variable.
LV: We can vectorize this loop!
LV: Found trip count: 0
LV: The Smallest and Widest types: 64 / 64 bits.
LV: The Widest register safe to use is: 128 bits.
LV: Found uniform instruction:   %exitcond33.not = icmp eq i64 %.136, %arg.n, !dbg !19
LV: Scalarizing:  %.201.le = sitofp i64 %.57.028 to double, !dbg !20
LV: Scalarizing:  %.244 = fmul fast double %.202.le, %.230.le, !dbg !20
LV: Scalarizing:  %.254 = fadd fast double %.244, %acc.3.030, !dbg !20
LV: Scalarizing:  %.202.le = tail call fast double @sqrt(double %.201.le), !dbg !20
digraph VPlan {
graph [labelloc=t, fontsize=30; label="Vectorization Plan\nInitial VPlan for VF=\{1\},UF\>=1"]
node [shape=rect, fontname=Courier, fontsize=30]
edge [fontname=Courier, fontsize=30]
compound=true
N0 [label =
"for.end:\n" +
"WIDEN-PHI %acc.3.030 = phi %.254, 0.000000e+00\l" +
"WIDEN-INDUCTION %.57.028 = phi %.136, 0\l" +
"CLONE %.201.le = sitofp %.57.028\l" +
"WIDEN-CALL %.202.le = call %.201.le, @sqrt\l" +
"WIDEN-CALL %.230.le = call %.201.le, @llvm.sin.f64\l" +
"CLONE %.244 = fmul %.202.le, %.230.le\l" +
"CLONE %.254 = fadd %.244, %acc.3.030\l"
]
}
digraph VPlan {
graph [labelloc=t, fontsize=30; label="Vectorization Plan\nInitial VPlan for VF=\{2\},UF\>=1"]
node [shape=rect, fontname=Courier, fontsize=30]
edge [fontname=Courier, fontsize=30]
compound=true
N0 [label =
"for.end:\n" +
"WIDEN-PHI %acc.3.030 = phi %.254, 0.000000e+00\l" +
"WIDEN-INDUCTION %.57.028 = phi %.136, 0\l" +
"WIDEN\l""  %.201.le = sitofp %.57.028\l" +
"REPLICATE %.202.le = call %.201.le, @sqrt\l" +
"WIDEN-CALL %.230.le = call %.201.le, @llvm.sin.f64\l" +
"WIDEN\l""  %.244 = fmul %.202.le, %.230.le\l" +
"WIDEN\l""  %.254 = fadd %.244, %acc.3.030\l"
]
}
LV: Found an estimated cost of 0 for VF 1 For instruction:   %acc.3.030 = phi double [ %.254, %for.end ], [ 0.000000e+00, %for.end.preheader ]
LV: Found an estimated cost of 0 for VF 1 For instruction:   %.57.028 = phi i64 [ %.136, %for.end ], [ 0, %for.end.preheader ]
LV: Found an estimated cost of 1 for VF 1 For instruction:   %.136 = add nuw nsw i64 %.57.028, 1, !dbg !19
LV: Found an estimated cost of 1 for VF 1 For instruction:   %.201.le = sitofp i64 %.57.028 to double, !dbg !20
LV: Found an estimated cost of 10 for VF 1 For instruction:   %.202.le = tail call fast double @sqrt(double %.201.le), !dbg !20
LV: Found an estimated cost of 10 for VF 1 For instruction:   %.230.le = tail call fast double @llvm.sin.f64(double %.201.le), !dbg !20
LV: Found an estimated cost of 2 for VF 1 For instruction:   %.244 = fmul fast double %.202.le, %.230.le, !dbg !20
LV: Found an estimated cost of 2 for VF 1 For instruction:   %.254 = fadd fast double %.244, %acc.3.030, !dbg !20
LV: Found an estimated cost of 1 for VF 1 For instruction:   %exitcond33.not = icmp eq i64 %.136, %arg.n, !dbg !19
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1 %exitcond33.not, label %B48, label %for.end, !dbg !19
LV: Scalar loop costs: 27.
LV: Found an estimated cost of 0 for VF 2 For instruction:   %acc.3.030 = phi double [ %.254, %for.end ], [ 0.000000e+00, %for.end.preheader ]
LV: Found an estimated cost of 0 for VF 2 For instruction:   %.57.028 = phi i64 [ %.136, %for.end ], [ 0, %for.end.preheader ]
LV: Found an estimated cost of 1 for VF 2 For instruction:   %.136 = add nuw nsw i64 %.57.028, 1, !dbg !19
LV: Found an estimated cost of 20 for VF 2 For instruction:   %.201.le = sitofp i64 %.57.028 to double, !dbg !20
LV: Found an estimated cost of 22 for VF 2 For instruction:   %.202.le = tail call fast double @sqrt(double %.201.le), !dbg !20
LV: Found an estimated cost of 22 for VF 2 For instruction:   %.230.le = tail call fast double @llvm.sin.f64(double %.201.le), !dbg !20
LV: Found an estimated cost of 2 for VF 2 For instruction:   %.244 = fmul fast double %.202.le, %.230.le, !dbg !20
LV: Found an estimated cost of 2 for VF 2 For instruction:   %.254 = fadd fast double %.244, %acc.3.030, !dbg !20
LV: Found an estimated cost of 1 for VF 2 For instruction:   %exitcond33.not = icmp eq i64 %.136, %arg.n, !dbg !19
LV: Found an estimated cost of 0 for VF 2 For instruction:   br i1 %exitcond33.not, label %B48, label %for.end, !dbg !19
LV: Vector loop of width 2 costs: 35.
LV: Selecting VF: 1.
LV(REG): Calculating max register usage:
LV(REG): At #0 Interval # 0
LV(REG): At #1 Interval # 1
LV(REG): At #2 Interval # 2
LV(REG): At #3 Interval # 3
LV(REG): At #4 Interval # 3
LV(REG): At #5 Interval # 4
LV(REG): At #6 Interval # 4
LV(REG): At #7 Interval # 3
LV(REG): At #8 Interval # 2
LV(REG): VF = 1
LV(REG): Found max usage: 1 item
LV(REG): RegisterClass: Generic::ScalarRC, 4 registers
LV(REG): Found invariant usage: 0 item
LV: The target has 16 registers of Generic::ScalarRC register class
LV: Loop cost is 27
LV: Not Interleaving.
LV: Vectorization is possible but not beneficial.
LV: Interleaving is not beneficial.

LV: Checking a loop in "_ZN8__main__7foo\$241Ex" from <elided>.py:15:1
LV: Loop hints: force=? width=0 unroll=0
LV: Found a loop: B62
LV: Found an induction variable.
LV: We can vectorize this loop!
LV: Found trip count: 0
LV: The Smallest and Widest types: 64 / 64 bits.
LV: The Widest register safe to use is: 128 bits.
LV: Found uniform instruction:   %exitcond.not = icmp eq i64 %.411, %spec.select23, !dbg !21
LV: Scalarizing:  %.473 = add nuw nsw i64 %.334.025, 2, !dbg !22
LV: Scalarizing:  %.487.le = sitofp i64 %.473 to double, !dbg !22
LV: Scalarizing:  %.517.le = sitofp i64 %.334.025 to double, !dbg !22
LV: Scalarizing:  %.532 = fmul fast double %.488.le, %.518.le, !dbg !22
LV: Scalarizing:  %.542 = fsub fast double %acc.4.027, %.532, !dbg !22
LV: Scalarizing:  %.488.le = tail call fast double @sqrt(double %.487.le), !dbg !22
digraph VPlan {
graph [labelloc=t, fontsize=30; label="Vectorization Plan\nInitial VPlan for VF=\{1\},UF\>=1"]
node [shape=rect, fontname=Courier, fontsize=30]
edge [fontname=Courier, fontsize=30]
compound=true
N0 [label =
"B62:\n" +
"WIDEN-PHI %acc.4.027 = phi %.542, %.254.lcssa\l" +
"WIDEN-INDUCTION %.334.025 = phi %.411, 0\l" +
"CLONE %.473 = add %.334.025, 2\l" +
"CLONE %.487.le = sitofp %.473\l" +
"WIDEN-CALL %.488.le = call %.487.le, @sqrt\l" +
"CLONE %.517.le = sitofp %.334.025\l" +
"WIDEN-CALL %.518.le = call %.517.le, @llvm.cos.f64\l" +
"CLONE %.532 = fmul %.488.le, %.518.le\l" +
"CLONE %.542 = fsub %acc.4.027, %.532\l"
]
}
digraph VPlan {
graph [labelloc=t, fontsize=30; label="Vectorization Plan\nInitial VPlan for VF=\{2\},UF\>=1"]
node [shape=rect, fontname=Courier, fontsize=30]
edge [fontname=Courier, fontsize=30]
compound=true
N0 [label =
"B62:\n" +
"WIDEN-PHI %acc.4.027 = phi %.542, %.254.lcssa\l" +
"WIDEN-INDUCTION %.334.025 = phi %.411, 0\l" +
"WIDEN\l""  %.473 = add %.334.025, 2\l" +
"WIDEN\l""  %.487.le = sitofp %.473\l" +
"REPLICATE %.488.le = call %.487.le, @sqrt\l" +
"WIDEN\l""  %.517.le = sitofp %.334.025\l" +
"WIDEN-CALL %.518.le = call %.517.le, @llvm.cos.f64\l" +
"WIDEN\l""  %.532 = fmul %.488.le, %.518.le\l" +
"WIDEN\l""  %.542 = fsub %acc.4.027, %.532\l"
]
}
LV: Found an estimated cost of 0 for VF 1 For instruction:   %acc.4.027 = phi double [ %.542, %B62 ], [ %.254.lcssa, %B62.preheader ]
LV: Found an estimated cost of 0 for VF 1 For instruction:   %.334.025 = phi i64 [ %.411, %B62 ], [ 0, %B62.preheader ]
LV: Found an estimated cost of 1 for VF 1 For instruction:   %.411 = add nuw nsw i64 %.334.025, 1, !dbg !21
LV: Found an estimated cost of 1 for VF 1 For instruction:   %.473 = add nuw nsw i64 %.334.025, 2, !dbg !22
LV: Found an estimated cost of 1 for VF 1 For instruction:   %.487.le = sitofp i64 %.473 to double, !dbg !22
LV: Found an estimated cost of 10 for VF 1 For instruction:   %.488.le = tail call fast double @sqrt(double %.487.le), !dbg !22
LV: Found an estimated cost of 1 for VF 1 For instruction:   %.517.le = sitofp i64 %.334.025 to double, !dbg !22
LV: Found an estimated cost of 10 for VF 1 For instruction:   %.518.le = tail call fast double @llvm.cos.f64(double %.517.le), !dbg !22
LV: Found an estimated cost of 2 for VF 1 For instruction:   %.532 = fmul fast double %.488.le, %.518.le, !dbg !22
LV: Found an estimated cost of 2 for VF 1 For instruction:   %.542 = fsub fast double %acc.4.027, %.532, !dbg !22
LV: Found an estimated cost of 1 for VF 1 For instruction:   %exitcond.not = icmp eq i64 %.411, %spec.select23, !dbg !21
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1 %exitcond.not, label %B94.loopexit, label %B62, !dbg !21
LV: Scalar loop costs: 29.
LV: Found an estimated cost of 0 for VF 2 For instruction:   %acc.4.027 = phi double [ %.542, %B62 ], [ %.254.lcssa, %B62.preheader ]
LV: Found an estimated cost of 0 for VF 2 For instruction:   %.334.025 = phi i64 [ %.411, %B62 ], [ 0, %B62.preheader ]
LV: Found an estimated cost of 1 for VF 2 For instruction:   %.411 = add nuw nsw i64 %.334.025, 1, !dbg !21
LV: Found an estimated cost of 1 for VF 2 For instruction:   %.473 = add nuw nsw i64 %.334.025, 2, !dbg !22
LV: Found an estimated cost of 20 for VF 2 For instruction:   %.487.le = sitofp i64 %.473 to double, !dbg !22
LV: Found an estimated cost of 22 for VF 2 For instruction:   %.488.le = tail call fast double @sqrt(double %.487.le), !dbg !22
LV: Found an estimated cost of 20 for VF 2 For instruction:   %.517.le = sitofp i64 %.334.025 to double, !dbg !22
LV: Found an estimated cost of 22 for VF 2 For instruction:   %.518.le = tail call fast double @llvm.cos.f64(double %.517.le), !dbg !22
LV: Found an estimated cost of 2 for VF 2 For instruction:   %.532 = fmul fast double %.488.le, %.518.le, !dbg !22
LV: Found an estimated cost of 2 for VF 2 For instruction:   %.542 = fsub fast double %acc.4.027, %.532, !dbg !22
LV: Found an estimated cost of 1 for VF 2 For instruction:   %exitcond.not = icmp eq i64 %.411, %spec.select23, !dbg !21
LV: Found an estimated cost of 0 for VF 2 For instruction:   br i1 %exitcond.not, label %B94.loopexit, label %B62, !dbg !21
LV: Vector loop of width 2 costs: 45.
LV: Selecting VF: 1.
LV(REG): Calculating max register usage:
LV(REG): At #0 Interval # 0
LV(REG): At #1 Interval # 1
LV(REG): At #2 Interval # 2
LV(REG): At #3 Interval # 3
LV(REG): At #4 Interval # 4
LV(REG): At #5 Interval # 4
LV(REG): At #6 Interval # 4
LV(REG): At #7 Interval # 4
LV(REG): At #8 Interval # 4
LV(REG): At #9 Interval # 3
LV(REG): At #10 Interval # 2
LV(REG): VF = 1
LV(REG): Found max usage: 1 item
LV(REG): RegisterClass: Generic::ScalarRC, 4 registers
LV(REG): Found invariant usage: 1 item
LV(REG): RegisterClass: Generic::ScalarRC, 2 registers
LV: The target has 16 registers of Generic::ScalarRC register class
LV: Loop cost is 29
LV: Not Interleaving.
LV: Vectorization is possible but not beneficial.
LV: Interleaving is not beneficial.

``````
1. the LLVM IR from the module that Numba generated.
``````; ModuleID = 'foo'
source_filename = "<string>"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

@"_ZN08NumbaEnv8__main__7foo\$241Ex" = common local_unnamed_addr global i8* null
@.const.foo = internal constant [4 x i8] c"foo\00"
@PyExc_RuntimeError = external global i8
@".const.missing Environment: _ZN08NumbaEnv8__main__7foo\$241Ex" = internal constant [54 x i8] c"missing Environment: _ZN08NumbaEnv8__main__7foo\$241Ex\00"

; Function Attrs: nofree noinline nounwind
define i32 @"_ZN8__main__7foo\$241Ex"(double* noalias nocapture %retptr, { i8*, i32, i8* }** noalias nocapture readnone %excinfo, i64 %arg.n) local_unnamed_addr #0 !dbg !4 {
entry:
%.71.inv = icmp sgt i64 %arg.n, 0
br i1 %.71.inv, label %for.end.preheader, label %B94, !dbg !19

%0 = add i64 %arg.n, -1, !dbg !19
%xtraiter39 = and i64 %arg.n, 3, !dbg !19
%1 = icmp ult i64 %0, 3, !dbg !19
br i1 %1, label %B48.unr-lcssa, label %for.end.preheader.new, !dbg !19

%unroll_iter42 = and i64 %arg.n, -4, !dbg !19
br label %for.end, !dbg !19

B48.unr-lcssa:                                    ; preds = %for.end, %for.end.preheader
%.254.lcssa.ph = phi double [ undef, %for.end.preheader ], [ %.254.3, %for.end ]
%acc.3.030.unr = phi double [ 0.000000e+00, %for.end.preheader ], [ %.254.3, %for.end ]
%.57.028.unr = phi i64 [ 0, %for.end.preheader ], [ %.136.3, %for.end ]
%lcmp.mod40.not = icmp eq i64 %xtraiter39, 0, !dbg !19
br i1 %lcmp.mod40.not, label %B48, label %for.end.epil.preheader, !dbg !19

br label %for.end.epil, !dbg !19

for.end.epil:                                     ; preds = %for.end.epil.preheader, %for.end.epil
%acc.3.030.epil = phi double [ %.254.epil, %for.end.epil ], [ %acc.3.030.unr, %for.end.epil.preheader ]
%.57.028.epil = phi i64 [ %.136.epil, %for.end.epil ], [ %.57.028.unr, %for.end.epil.preheader ]
%epil.iter = phi i64 [ %epil.iter.sub, %for.end.epil ], [ %xtraiter39, %for.end.epil.preheader ]
%.136.epil = add nuw nsw i64 %.57.028.epil, 1, !dbg !19
%.201.le.epil = sitofp i64 %.57.028.epil to double, !dbg !20
%.202.le.epil = tail call fast double @sqrt(double %.201.le.epil), !dbg !20
%.230.le.epil = tail call fast double @llvm.sin.f64(double %.201.le.epil), !dbg !20
%.244.epil = fmul fast double %.202.le.epil, %.230.le.epil, !dbg !20
%.254.epil = fadd fast double %.244.epil, %acc.3.030.epil, !dbg !20
%epil.iter.sub = add i64 %epil.iter, -1, !dbg !19
%epil.iter.cmp.not = icmp eq i64 %epil.iter.sub, 0, !dbg !19
br i1 %epil.iter.cmp.not, label %B48, label %for.end.epil, !dbg !19, !llvm.loop !21

B48:                                              ; preds = %for.end.epil, %B48.unr-lcssa
%.254.lcssa = phi double [ %.254.lcssa.ph, %B48.unr-lcssa ], [ %.254.epil, %for.end.epil ], !dbg !20
%.348 = icmp slt i64 %arg.n, 2, !dbg !23
%.294 = add nsw i64 %arg.n, -1, !dbg !23
%spec.select23 = select i1 %.348, i64 0, i64 %.294
%.39824 = icmp sgt i64 %spec.select23, 0, !dbg !23
br i1 %.39824, label %B62.preheader, label %B94, !dbg !23

%2 = icmp eq i64 %spec.select23, 1, !dbg !23
br i1 %2, label %B94.loopexit.unr-lcssa, label %B62.preheader.new, !dbg !23

%unroll_iter = and i64 %spec.select23, -2, !dbg !23
br label %B62, !dbg !23

B62:                                              ; preds = %B62, %B62.preheader.new
%acc.4.027 = phi double [ %.254.lcssa, %B62.preheader.new ], [ %.542.1, %B62 ]
%.334.025 = phi i64 [ 0, %B62.preheader.new ], [ %.411.1, %B62 ]
%3 = add i64 %.334.025, 1, !dbg !24
%4 = add i64 %.334.025, 2, !dbg !24
%.487.le = sitofp i64 %4 to double, !dbg !24
%.488.le = tail call fast double @sqrt(double %.487.le), !dbg !24
%.517.le = sitofp i64 %.334.025 to double, !dbg !24
%.518.le = tail call fast double @llvm.cos.f64(double %.517.le), !dbg !24
%.532 = fmul fast double %.488.le, %.518.le, !dbg !24
%.411.1 = add nuw nsw i64 %.334.025, 2, !dbg !23
%5 = add i64 %.334.025, 3, !dbg !24
%.487.le.1 = sitofp i64 %5 to double, !dbg !24
%.488.le.1 = tail call fast double @sqrt(double %.487.le.1), !dbg !24
%.517.le.1 = sitofp i64 %3 to double, !dbg !24
%.518.le.1 = tail call fast double @llvm.cos.f64(double %.517.le.1), !dbg !24
%.532.1 = fmul fast double %.488.le.1, %.518.le.1, !dbg !24
%6 = fadd fast double %.532, %.532.1, !dbg !24
%.542.1 = fsub fast double %acc.4.027, %6, !dbg !24
%niter.ncmp.1 = icmp eq i64 %unroll_iter, %.411.1, !dbg !23
br i1 %niter.ncmp.1, label %B94.loopexit.unr-lcssa, label %B62, !dbg !23

B94.loopexit.unr-lcssa:                           ; preds = %B62, %B62.preheader
%.542.lcssa.ph = phi double [ undef, %B62.preheader ], [ %.542.1, %B62 ]
%acc.4.027.unr = phi double [ %.254.lcssa, %B62.preheader ], [ %.542.1, %B62 ]
%.334.025.unr = phi i64 [ 0, %B62.preheader ], [ %.411.1, %B62 ]
%7 = and i64 %spec.select23, 1, !dbg !23
%lcmp.mod.not = icmp eq i64 %7, 0, !dbg !23
br i1 %lcmp.mod.not, label %B94, label %B94.loopexit.epilog-lcssa, !dbg !23

B94.loopexit.epilog-lcssa:                        ; preds = %B94.loopexit.unr-lcssa
%.473.epil = add nuw nsw i64 %.334.025.unr, 2, !dbg !24
%.487.le.epil = sitofp i64 %.473.epil to double, !dbg !24
%.488.le.epil = tail call fast double @sqrt(double %.487.le.epil), !dbg !24
%.517.le.epil = sitofp i64 %.334.025.unr to double, !dbg !24
%.518.le.epil = tail call fast double @llvm.cos.f64(double %.517.le.epil), !dbg !24
%.532.epil = fmul fast double %.488.le.epil, %.518.le.epil, !dbg !24
%.542.epil = fsub fast double %acc.4.027.unr, %.532.epil, !dbg !24
br label %B94, !dbg !25

B94:                                              ; preds = %B94.loopexit.epilog-lcssa, %B94.loopexit.unr-lcssa, %entry, %B48
%acc.4.0.lcssa = phi double [ %.254.lcssa, %B48 ], [ 0.000000e+00, %entry ], [ %.542.lcssa.ph, %B94.loopexit.unr-lcssa ], [ %.542.epil, %B94.loopexit.epilog-lcssa ], !dbg !9
store double %acc.4.0.lcssa, double* %retptr, align 8, !dbg !25
ret i32 0, !dbg !25

for.end:                                          ; preds = %for.end, %for.end.preheader.new
%acc.3.030 = phi double [ 0.000000e+00, %for.end.preheader.new ], [ %.254.3, %for.end ]
%.57.028 = phi i64 [ 0, %for.end.preheader.new ], [ %11, %for.end ]
%.201.le = sitofp i64 %.57.028 to double, !dbg !20
%.202.le = tail call fast double @sqrt(double %.201.le), !dbg !20
%.230.le = tail call fast double @llvm.sin.f64(double %.201.le), !dbg !20
%.244 = fmul fast double %.202.le, %.230.le, !dbg !20
%.254 = fadd fast double %.244, %acc.3.030, !dbg !20
%8 = add i64 %.57.028, 1, !dbg !20
%.201.le.1 = sitofp i64 %8 to double, !dbg !20
%.202.le.1 = tail call fast double @sqrt(double %.201.le.1), !dbg !20
%.230.le.1 = tail call fast double @llvm.sin.f64(double %.201.le.1), !dbg !20
%.244.1 = fmul fast double %.202.le.1, %.230.le.1, !dbg !20
%.254.1 = fadd fast double %.244.1, %.254, !dbg !20
%9 = add i64 %8, 1, !dbg !20
%.201.le.2 = sitofp i64 %9 to double, !dbg !20
%.202.le.2 = tail call fast double @sqrt(double %.201.le.2), !dbg !20
%.230.le.2 = tail call fast double @llvm.sin.f64(double %.201.le.2), !dbg !20
%.244.2 = fmul fast double %.202.le.2, %.230.le.2, !dbg !20
%.254.2 = fadd fast double %.244.2, %.254.1, !dbg !20
%.136.3 = add nuw nsw i64 %.57.028, 4, !dbg !19
%10 = add i64 %9, 1, !dbg !20
%.201.le.3 = sitofp i64 %10 to double, !dbg !20
%.202.le.3 = tail call fast double @sqrt(double %.201.le.3), !dbg !20
%.230.le.3 = tail call fast double @llvm.sin.f64(double %.201.le.3), !dbg !20
%.244.3 = fmul fast double %.202.le.3, %.230.le.3, !dbg !20
%.254.3 = fadd fast double %.244.3, %.254.2, !dbg !20
%niter43.ncmp.3 = icmp eq i64 %unroll_iter42, %.136.3, !dbg !19
%11 = add i64 %10, 1, !dbg !19
br i1 %niter43.ncmp.3, label %B48.unr-lcssa, label %for.end, !dbg !19
}

; Function Attrs: nounwind readnone speculatable willreturn

; Function Attrs: nofree nounwind readonly

; Function Attrs: nounwind readnone speculatable willreturn
declare double @llvm.sin.f64(double) #1

; Function Attrs: nounwind readnone speculatable willreturn
declare double @llvm.cos.f64(double) #1

entry:
%.5 = alloca i8*, align 8
%.6 = call i32 (i8*, i8*, i64, i64, ...) @PyArg_UnpackTuple(i8* %py_args, i8* getelementptr inbounds ([4 x i8], [4 x i8]* @.const.foo, i64 0, i64 0), i64 1, i64 1, i8** nonnull %.5)
%.7 = icmp eq i32 %.6, 0
%.36 = alloca double, align 8
store double 0.000000e+00, double* %.36, align 8
br i1 %.7, label %entry.if, label %entry.endif, !prof !26

entry.if:                                         ; preds = %entry.endif.endif.endif, %entry
ret i8* null

entry.endif:                                      ; preds = %entry
%.11 = load i8*, i8** @"_ZN08NumbaEnv8__main__7foo\$241Ex", align 8
%.16 = icmp eq i8* %.11, null
br i1 %.16, label %entry.endif.if, label %entry.endif.endif, !prof !26

entry.endif.if:                                   ; preds = %entry.endif
call void @PyErr_SetString(i8* nonnull @PyExc_RuntimeError, i8* getelementptr inbounds ([54 x i8], [54 x i8]* @".const.missing Environment: _ZN08NumbaEnv8__main__7foo\$241Ex", i64 0, i64 0))
ret i8* null

entry.endif.endif:                                ; preds = %entry.endif
%.20 = load i8*, i8** %.5, align 8
%.23 = call i8* @PyNumber_Long(i8* %.20)
%.24.not = icmp eq i8* %.23, null
br i1 %.24.not, label %entry.endif.endif.endif, label %entry.endif.endif.if, !prof !26

entry.endif.endif.if:                             ; preds = %entry.endif.endif
%.26 = call i64 @PyLong_AsLongLong(i8* nonnull %.23)
call void @Py_DecRef(i8* nonnull %.23)
br label %entry.endif.endif.endif

entry.endif.endif.endif:                          ; preds = %entry.endif.endif, %entry.endif.endif.if
%.21.0 = phi i64 [ %.26, %entry.endif.endif.if ], [ 0, %entry.endif.endif ]
%.31 = call i8* @PyErr_Occurred()
%.32.not = icmp eq i8* %.31, null
br i1 %.32.not, label %entry.endif.endif.endif.endif, label %entry.if, !prof !27

entry.endif.endif.endif.endif:                    ; preds = %entry.endif.endif.endif
store double 0.000000e+00, double* %.36, align 8
%.40 = call i32 @"_ZN8__main__7foo\$241Ex"(double* nonnull %.36, { i8*, i32, i8* }** undef, i64 %.21.0) #5
%.50 = load double, double* %.36, align 8
%.55 = call i8* @PyFloat_FromDouble(double %.50)
ret i8* %.55
}

declare i32 @PyArg_UnpackTuple(i8*, i8*, i64, i64, ...) local_unnamed_addr

; Function Attrs: nofree nounwind
define double @"cfunc._ZN8__main__7foo\$241Ex"(i64 %.1) local_unnamed_addr #3 {
entry:
%.3 = alloca double, align 8
store double 0.000000e+00, double* %.3, align 8
%.7 = call i32 @"_ZN8__main__7foo\$241Ex"(double* nonnull %.3, { i8*, i32, i8* }** undef, i64 %.1) #5
%.17 = load double, double* %.3, align 8
ret double %.17
}

; Function Attrs: nounwind
declare void @llvm.stackprotector(i8*, i8**) #4

attributes #0 = { nofree noinline nounwind }
attributes #1 = { nounwind readnone speculatable willreturn }
attributes #2 = { nofree nounwind readonly }
attributes #3 = { nofree nounwind }
attributes #4 = { nounwind }
attributes #5 = { noinline }

!llvm.dbg.cu = !{!0}
!llvm.module.flags = !{!2, !3}

!0 = distinct !DICompileUnit(language: DW_LANG_C_plus_plus, file: !1, producer: "Numba", isOptimized: true, runtimeVersion: 0, emissionKind: FullDebug)
!1 = !DIFile(filename: "<elided>.py", directory: "<elided>")
!2 = !{i32 2, !"Dwarf Version", i32 4}
!3 = !{i32 2, !"Debug Info Version", i32 3}
!4 = distinct !DISubprogram(name: "foo", linkageName: "_ZN8__main__7foo\$241Ex", scope: !1, file: !1, line: 7, type: !5, scopeLine: 7, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0)
!5 = !DISubroutineType(types: !6)
!6 = !{}
!7 = !DILocalVariable(name: "n", scope: !4, file: !1, line: 7, type: !8)
!8 = !DIBasicType(name: "i64", size: 64, encoding: DW_ATE_unsigned)
!9 = !DILocation(line: 0, scope: !4)
!10 = !DILocalVariable(name: "acc", scope: !4, file: !1, line: 9, type: !11)
!11 = !DIBasicType(name: "double", size: 64, encoding: DW_ATE_float)
!12 = !DILocalVariable(name: "acc\$3", scope: !4, file: !1, line: 9, type: !11)
!13 = !DILocalVariable(name: "acc2", scope: !4, file: !1, line: 10, type: !11)
!14 = !DILocalVariable(name: "i", scope: !4, file: !1, line: 12, type: !8)
!15 = !DILocalVariable(name: "acc\$1", scope: !4, file: !1, line: 13, type: !11)
!16 = !DILocalVariable(name: "acc\$4", scope: !4, file: !1, line: 12, type: !11)
!17 = !DILocalVariable(name: "i\$1", scope: !4, file: !1, line: 15, type: !8)
!18 = !DILocalVariable(name: "acc\$2", scope: !4, file: !1, line: 16, type: !11)
!19 = !DILocation(line: 12, column: 1, scope: !4)
!20 = !DILocation(line: 13, column: 1, scope: !4)
!21 = distinct !{!21, !22}
!22 = !{!"llvm.loop.unroll.disable"}
!23 = !DILocation(line: 15, column: 1, scope: !4)
!24 = !DILocation(line: 16, column: 1, scope: !4)
!25 = !DILocation(line: 18, column: 1, scope: !4)
!26 = !{!"branch_weights", i32 1, i32 99}
!27 = !{!"branch_weights", i32 99, i32 1}
``````

In the above the LV debug has e.g. `LV: Found a loop: for.end`, if you look in the LLVM IR there’s labels like `for.end.preheader:` and `for.end.epil:` these are derived from the original loop label `for.end`. If you then look at the `!dbg` markers on the instructions in the associated blocks you’ll see `!19`, `!20` and `!21`. Referring to the metadata section at the bottom on the LLVM IR, `!19` and `!20` are `DILocation`s and show that this loop is from Python source lines 12 and 13.

``````!19 = !DILocation(line: 12, column: 1, scope: !4)
!20 = !DILocation(line: 13, column: 1, scope: !4)
``````

which correspond to:

``````    for i in range(n):
acc += np.sqrt(i) * np.sin(i)
``````

Presenting this sort of thing in an easy to use manner is something that we hope to get working one day!

Hope this helps?!