How do I find out what the compiler is spending its time doing?

A few questions are often asked with respect to performance of the compiler:

  • What is the compiler spending its time doing?
  • What is the compiler doing in terms of “compiler passes”?
  • Which compiler passes are taking the most time?

For reference, Numba is a multi-pass compiler, it generates IR and then repeatedly transforms it. Not all compiler passes in Numba are about optimisation, quite a few are to enable features or to make it so it’s possible to compile more of the Python language.

The information about the time spent in each compiler pass, on a per-compiled-function basis, can be accessed like this:

from numba import njit, literal_unroll
import numpy as np

# This is a function which deliberately loads up the work on the
# `literal_unroll` compiler pass.
def foo():
    a = (1, 'a', 2, np.zeros(5), 6j)
    letters = [chr(x) for x in range(ord('a'), ord('a') + len(a))]
    count = 0
    for x in literal_unroll(a):
        print(letters[count].upper() * 3)
        count += 1

# Run it once such that it is compiled.

# Look up the signature of the first thing that's been compiled.
signature = foo.signatures[0]
# Find the "overload" for it, this is what Numba uses internally to represent
# the compiled function.
overload = foo.overloads[signature]

# This is the pipeline we want to look at, @njit = nopython pipeline.
pipeline = 'nopython'

# Print the information, it's in the overload.metadata dictionary.
width = 20
for name, time in overload.metadata['pipeline_times'][pipeline].items():
    fmt = (f'{name: <{40}}:'

the output of which is (for example):

<print statements from the program elided>


0_translate_bytecode                    :0.000004            0.004511            0.000002            
1_fixup_args                            :0.000002            0.000008            0.000002            
2_ir_processing                         :0.000002            0.000921            0.000002            
3_with_lifting                          :0.000002            0.000965            0.000002            
4_rewrite_semantic_constants            :0.000002            0.000127            0.000002            
5_dead_branch_prune                     :0.000002            0.000094            0.000002            
6_generic_rewrites                      :0.000002            0.002261            0.000002            
7_inline_closure_likes                  :0.000002            0.019027            0.000003            
8_make_function_op_code_to_jit_function :0.000002            0.000036            0.000002            
9_inline_inlinables                     :0.000002            0.000188            0.000002            
10_dead_branch_prune                    :0.000002            0.000111            0.000002            
11_find_literally                       :0.000002            0.000104            0.000002            
12_literal_unroll                       :0.000002            3.487060            0.000003            
13_reconstruct_ssa                      :0.000002            0.011669            0.000004            
14_nopython_type_inference              :0.000002            0.048368            0.000003            
15_annotate_types                       :0.000002            0.005212            0.000002            
16_strip_phis                           :0.000002            0.004182            0.000002            
17_inline_overloads                     :0.000002            0.002399            0.000002            
18_nopython_rewrites                    :0.000002            0.012225            0.000003            
19_ir_legalization                      :0.000002            0.005183            0.000002            
20_nopython_backend                     :0.000002            1.097940            0.000003            
21_dump_parfor_diagnostics              :0.000002            0.000011            0.000001            

The three columns of floats correspond to “pass initialisation time”, “pass run time”, “pass finalisation time” respectively. From the above, it’s clear that literal_unroll is taking quite a lot of run time, this is expected as it’s an involved pass and is used directly in the input source. The nopython_backend pass is also taking quite a bit of time, and this is expected as both strings and literal_unroll generate a lot of work for the compiler backend that Numba uses to do JIT compilation to machine code (LLVM).

Compiler passes which are known to sometimes be quite involved, and therefore take a load of time, include:

  • nopython_type_inference: this is working out the types of all the variables used in the input source, it is iterative and often tries multiple things before giving up and producing a TypingError.
  • reconstruct_ssa: this is converting Numba’s IR into SSA form, the algorithms are that do this are necessarily complex.
  • nopython_backend: this is Numba “lowering” it’s IR to LLVM IR and then using LLVM to compile and link it such that it can be executed in place of the @jit decorated Python function. Lowering Numba IR to LLVM IR is usually quite quick, but LLVM optimising and linking can take some time.
  • pre_parfor_pass and parfor_pass (if parallel=True is set in @jit): these passes respectively do analysis and transforms for parallel execution, analysis is quite involved and the transforms are complicated, involved and create a lot of work for LLVM.
  • literal_unroll: this pass does loop versioning, which creates a new version of the loop for each type in the thing being unrolled/iterated over, it in itself is complicated and also creates a lot of work for LLVM.

Most other passes are relatively quick in terms of execution.

1 Like