A few questions are often asked with respect to performance of the compiler:
- What is the compiler spending its time doing?
- What is the compiler doing in terms of “compiler passes”?
- Which compiler passes are taking the most time?
For reference, Numba is a multi-pass compiler, it generates IR and then repeatedly transforms it. Not all compiler passes in Numba are about optimisation, quite a few are to enable features or to make it so it’s possible to compile more of the Python language.
The information about the time spent in each compiler pass, on a per-compiled-function basis, can be accessed like this:
from numba import njit, literal_unroll
import numpy as np
# This is a function which deliberately loads up the work on the
# `literal_unroll` compiler pass.
@njit
def foo():
a = (1, 'a', 2, np.zeros(5), 6j)
letters = [chr(x) for x in range(ord('a'), ord('a') + len(a))]
count = 0
for x in literal_unroll(a):
print(x)
print(letters[count].upper() * 3)
count += 1
# Run it once such that it is compiled.
foo()
# Look up the signature of the first thing that's been compiled.
signature = foo.signatures[0]
# Find the "overload" for it, this is what Numba uses internally to represent
# the compiled function.
overload = foo.overloads[signature]
# This is the pipeline we want to look at, @njit = nopython pipeline.
pipeline = 'nopython'
# Print the information, it's in the overload.metadata dictionary.
width = 20
print("\n\nTimings:\n")
for name, time in overload.metadata['pipeline_times'][pipeline].items():
fmt = (f'{name: <{40}}:'
f'{time.init:<{width}.6f}'
f'{time.run:<{width}.6f}'
f'{time.finalize:<{width}.6f}')
print(fmt)
the output of which is (for example):
<print statements from the program elided>
Timings:
0_translate_bytecode :0.000004 0.004511 0.000002
1_fixup_args :0.000002 0.000008 0.000002
2_ir_processing :0.000002 0.000921 0.000002
3_with_lifting :0.000002 0.000965 0.000002
4_rewrite_semantic_constants :0.000002 0.000127 0.000002
5_dead_branch_prune :0.000002 0.000094 0.000002
6_generic_rewrites :0.000002 0.002261 0.000002
7_inline_closure_likes :0.000002 0.019027 0.000003
8_make_function_op_code_to_jit_function :0.000002 0.000036 0.000002
9_inline_inlinables :0.000002 0.000188 0.000002
10_dead_branch_prune :0.000002 0.000111 0.000002
11_find_literally :0.000002 0.000104 0.000002
12_literal_unroll :0.000002 3.487060 0.000003
13_reconstruct_ssa :0.000002 0.011669 0.000004
14_nopython_type_inference :0.000002 0.048368 0.000003
15_annotate_types :0.000002 0.005212 0.000002
16_strip_phis :0.000002 0.004182 0.000002
17_inline_overloads :0.000002 0.002399 0.000002
18_nopython_rewrites :0.000002 0.012225 0.000003
19_ir_legalization :0.000002 0.005183 0.000002
20_nopython_backend :0.000002 1.097940 0.000003
21_dump_parfor_diagnostics :0.000002 0.000011 0.000001
The three columns of floats correspond to “pass initialisation time”, “pass run time”, “pass finalisation time” respectively. From the above, it’s clear that literal_unroll
is taking quite a lot of run time, this is expected as it’s an involved pass and is used directly in the input source. The nopython_backend
pass is also taking quite a bit of time, and this is expected as both strings and literal_unroll
generate a lot of work for the compiler backend that Numba uses to do JIT compilation to machine code (LLVM).
Compiler passes which are known to sometimes be quite involved, and therefore take a load of time, include:
-
nopython_type_inference
: this is working out the types of all the variables used in the input source, it is iterative and often tries multiple things before giving up and producing aTypingError
. -
reconstruct_ssa
: this is converting Numba’s IR into SSA form, the algorithms are that do this are necessarily complex. -
nopython_backend
: this is Numba “lowering” it’s IR to LLVM IR and then using LLVM to compile and link it such that it can be executed in place of the@jit
decorated Python function. Lowering Numba IR to LLVM IR is usually quite quick, but LLVM optimising and linking can take some time. -
pre_parfor_pass
andparfor_pass
(ifparallel=True
is set in@jit
): these passes respectively do analysis and transforms for parallel execution, analysis is quite involved and the transforms are complicated, involved and create a lot of work for LLVM. -
literal_unroll
: this pass does loop versioning, which creates a new version of the loop for each type in the thing being unrolled/iterated over, it in itself is complicated and also creates a lot of work for LLVM.
Most other passes are relatively quick in terms of execution.