Compilation timing information

hi everyone, as mentioned on the weekly call, I’m working on measuring compilation time.

There’s a part that puzzles me, I get these pieces of information:

  • most of the time is spent in type inference (but I suspect that’s because compilation of called functions is included there)

  • {'llvm_lock': 84.81026899999961, 'compiler_lock': 138.9255492}

  • Printing pass timings for JITCodeLibrary('run_cg_fc') Total time: 16.8268

So, if LLVM took 16 seconds, what happened in the other 84 seconds where there was an llvm_lock? What actions are included in the llvm_lock but not in the pass timings?

It’s also puzzling that cProfiler says that 75% of the time is spent in the method get_call_type of dispatcher type (2812 individual calls). However, get_call_template which is part of get_call_type is only called 2 times. I don’t know what to make of it.

Luk

Can you include some sample code / commands that shows how you’re generating this info? (It doesn’t have to be the whole code that you’re compiling here, I just want to be able to see how you’re generating the info to try and explain the output)

sorry, I should have included that to begin with.
The main function is called run_cg_fc, and from that dispatcher I get the timers:

print(run_cg_fc.get_metadata()[tuple(parameter_types.values())]['timers'])
[out] {'llvm_lock': 84.81026899999961, 'compiler_lock': 138.9255492}

The second piece of information needs setting first a config variable, and then it can be retrieved as follows:

# top of the file
from numba.core import config
config.LLVM_PASS_TIMINGS = 1
###
.... rest of the file ....
###
# pass timings
print(run_cg_funcs.get_metadata()[tuple(parameter_types.values())]['llvm_pass_timings'])

[out]
Printing pass timings for JITCodeLibrary('run_cg_fc')
Total time: 16.8268
== #0 Function passes on '_ZN8__main__14run_cg_fc$<VERY LONG STRING HERE>'
 Percent: 27.1%
 Total 4.5661s
 Top timings:
   4.4474s ( 97.4%) SROA #31
   0.0502s (  1.1%) Early CSE #777
   0.0274s (  0.6%) Debug Variable Analysis #244
   0.0137s (  0.3%) Prune NRT refops #778
   0.0091s (  0.2%) Post-Dominator Tree Construction #1435
== #1 Function passes on '_ZN7cpython8__main__14run_cg_fc<VERY LONG STRING HERE>'
 Percent: 0.6%
 Total 0.1063s
 Top timings:
   0.0942s ( 88.6%) SROA #25
   0.0047s (  4.4%) Early CSE #778
   0.0027s (  2.5%) Simplify the CFG #301
   0.0014s (  1.3%) Post-Dominator Tree Construction #1436
   0.0012s (  1.1%) Dominator Tree Construction #3155
== #2 Function passes on 'cfunc._ZN8__main__14run_cg_fc$242 <VERY LONG STRING HERE> '
 Percent: 0.0%
 Total 0.0008s
 Top timings:
   0.0005s ( 59.2%) Early CSE #779
   0.0002s ( 21.3%) Simplify the CFG #302
   0.0000s (  4.1%) Prune NRT refops #780
   0.0000s (  3.8%) Dominator Tree Construction #3156
   0.0000s (  3.5%) Post-Dominator Tree Construction #1437
== #3 Module passes (cheap optimization for refprune)
 Percent: 2.6%
 Total 0.4424s
 Top timings:
   0.1504s ( 34.0%) Loop Invariant Code Motion
   0.0597s ( 13.5%) Memory SSA
   0.0522s ( 11.8%) Function Integration/Inlining
   0.0504s ( 11.4%) Simplify the CFG #2
   0.0181s (  4.1%) Post-Dominator Tree Construction #2
== #4 Module passes (full optimization)
 Percent: 33.7%
 Total 5.6748s
 Top timings:
   0.5221s (  9.2%) Induction Variable Simplification
   0.4937s (  8.7%) Combine redundant instructions
   0.3462s (  6.1%) Global Value Numbering
   0.2667s (  4.7%) Unroll loops #2
   0.2497s (  4.4%) Interprocedural Sparse Conditional Constant Propagation
== #5 Finalize object
 Percent: 35.9%
 Total 6.0364s
 Top timings:
   1.8049s ( 29.9%) Greedy Register Allocator #253
   1.5151s ( 25.1%) X86 vzeroupper inserter #68
   0.6399s ( 10.6%) Loop Strength Reduction #181
   0.2958s (  4.9%) X86 Execution Dependency Fix #40
   0.2233s (  3.7%) Induction Variable Users #183