hi everyone, as mentioned on the weekly call, I’m working on measuring compilation time.
There’s a part that puzzles me, I get these pieces of information:
-
most of the time is spent in type inference (but I suspect that’s because compilation of called functions is included there)
-
{'llvm_lock': 84.81026899999961, 'compiler_lock': 138.9255492}
-
Printing pass timings for JITCodeLibrary('run_cg_fc') Total time: 16.8268
So, if LLVM took 16 seconds, what happened in the other 84 seconds where there was an llvm_lock? What actions are included in the llvm_lock but not in the pass timings?
It’s also puzzling that cProfiler says that 75% of the time is spent in the method get_call_type of dispatcher type (2812 individual calls). However, get_call_template which is part of get_call_type is only called 2 times. I don’t know what to make of it.
Luk
Can you include some sample code / commands that shows how you’re generating this info? (It doesn’t have to be the whole code that you’re compiling here, I just want to be able to see how you’re generating the info to try and explain the output)
sorry, I should have included that to begin with.
The main function is called run_cg_fc, and from that dispatcher I get the timers:
print(run_cg_fc.get_metadata()[tuple(parameter_types.values())]['timers'])
[out] {'llvm_lock': 84.81026899999961, 'compiler_lock': 138.9255492}
The second piece of information needs setting first a config variable, and then it can be retrieved as follows:
# top of the file
from numba.core import config
config.LLVM_PASS_TIMINGS = 1
###
.... rest of the file ....
###
# pass timings
print(run_cg_funcs.get_metadata()[tuple(parameter_types.values())]['llvm_pass_timings'])
[out]
Printing pass timings for JITCodeLibrary('run_cg_fc')
Total time: 16.8268
== #0 Function passes on '_ZN8__main__14run_cg_fc$<VERY LONG STRING HERE>'
Percent: 27.1%
Total 4.5661s
Top timings:
4.4474s ( 97.4%) SROA #31
0.0502s ( 1.1%) Early CSE #777
0.0274s ( 0.6%) Debug Variable Analysis #244
0.0137s ( 0.3%) Prune NRT refops #778
0.0091s ( 0.2%) Post-Dominator Tree Construction #1435
== #1 Function passes on '_ZN7cpython8__main__14run_cg_fc<VERY LONG STRING HERE>'
Percent: 0.6%
Total 0.1063s
Top timings:
0.0942s ( 88.6%) SROA #25
0.0047s ( 4.4%) Early CSE #778
0.0027s ( 2.5%) Simplify the CFG #301
0.0014s ( 1.3%) Post-Dominator Tree Construction #1436
0.0012s ( 1.1%) Dominator Tree Construction #3155
== #2 Function passes on 'cfunc._ZN8__main__14run_cg_fc$242 <VERY LONG STRING HERE> '
Percent: 0.0%
Total 0.0008s
Top timings:
0.0005s ( 59.2%) Early CSE #779
0.0002s ( 21.3%) Simplify the CFG #302
0.0000s ( 4.1%) Prune NRT refops #780
0.0000s ( 3.8%) Dominator Tree Construction #3156
0.0000s ( 3.5%) Post-Dominator Tree Construction #1437
== #3 Module passes (cheap optimization for refprune)
Percent: 2.6%
Total 0.4424s
Top timings:
0.1504s ( 34.0%) Loop Invariant Code Motion
0.0597s ( 13.5%) Memory SSA
0.0522s ( 11.8%) Function Integration/Inlining
0.0504s ( 11.4%) Simplify the CFG #2
0.0181s ( 4.1%) Post-Dominator Tree Construction #2
== #4 Module passes (full optimization)
Percent: 33.7%
Total 5.6748s
Top timings:
0.5221s ( 9.2%) Induction Variable Simplification
0.4937s ( 8.7%) Combine redundant instructions
0.3462s ( 6.1%) Global Value Numbering
0.2667s ( 4.7%) Unroll loops #2
0.2497s ( 4.4%) Interprocedural Sparse Conditional Constant Propagation
== #5 Finalize object
Percent: 35.9%
Total 6.0364s
Top timings:
1.8049s ( 29.9%) Greedy Register Allocator #253
1.5151s ( 25.1%) X86 vzeroupper inserter #68
0.6399s ( 10.6%) Loop Strength Reduction #181
0.2958s ( 4.9%) X86 Execution Dependency Fix #40
0.2233s ( 3.7%) Induction Variable Users #183