I have been playing around with the “official” example code that showcases a nice numba use case by calculating the lennard-jones interaction energy [Example: Lennard Jones]. I have noticed that when I modified the (non-parallel) numba_scalar in a way that combines the two functions lj_numba_scalar and distance_numba_scalar into a single function, the speed of execution as measured with ipython’s timeit was reduced by a factor of 6!
By investigating further I noticed that the minimal change that produced this effect is moving the calculation of the inverse of the atomic distance into the distance function.
This behavior was consistent over multiple systems and architectures (i5-11600K/ubuntu, microsoft surface pro 7, raspbery pi 4), although on the raspi the difference was smaller (factor of 3)
If I turn on parallelization, the slowdown was much lower (~factor of 1.5), but still present.
Any idea what is going on? Such huge performance drops would be nice to understand.
To give you some numbers, on a 10000x3 array the original ran in 145 ms on my desktop.
Here’s the minimally modified code (901ms runtime):
# moved the calculation of the inverse
@numba.njit
def lj_numba_scalar_prange(r_inv):
sr6 = r_inv**6
pot = 4.*(sr6*sr6 - sr6)
return pot
@numba.njit
def distance_numba_scalar_prange(atom1, atom2):
dx = atom2[0] - atom1[0]
dy = atom2[1] - atom1[1]
dz = atom2[2] - atom1[2]
r = (dx * dx + dy * dy + dz * dz) ** 0.5
r_inv = (1./r)
return r_inv
And here’s the code with merged functions (917 ms runtime):
# merge distance and potential calculation functions
@numba.njit
def lj_numba_scalar(atom1, atom2):
dx = atom2[0] - atom1[0]
dy = atom2[1] - atom1[1]
dz = atom2[2] - atom1[2]
r = (dx * dx + dy * dy + dz * dz) ** 0.5
sr6 = (1./r)**6
pot = 4.*(sr6*sr6 - sr6)
return pot
@numba.njit
def potential_numba_scalar(cluster):
energy = 0.0
for i in range(len(cluster)-1):
for j in range(i + 1, len(cluster)):
e = lj_numba_scalar(cluster[i], cluster[j])
energy += e
return energy
It would be nice if you could provide the exact code you ran so others can reproduce. There could be a mistake in the profiling, like including compilation time.
What I think is more likely, however, is the following:
You observe slower execution when you increase the code size of a function that is called many, many times.
The slowdown is less when the loop is executed in parallel.
This strongly suggests to me that this is related to inlining. I suspect that Numba stops inlining lj_numba_scalar when it gets too big. Have you tested using numba.njit(inline="always")?
You can also manually inline everything into potential_numba_scalar, like you did with lj_numba_scalar. If this recovers performance it also supports this assumption.
Thanks for the quick answer, your suggestion was absolutely correct. With numba.njit(inline="always") all execution times where now essentially the same.
In my understanding (and correct me if I’m wrong) inlining speeds up execution by removing the overhead of jumps/function calls on machine level. However I have two open questions:
According to the docs numba inline is off by default and all inlining is handled by LLVM. Why does this inlining logic not trigger when the function call introduces a 6x overhead?
I do not understand why parallel execution changes the bahavior. Why is function call overhead different in parallel? After all, the relative speed of the function execution with/without function call should not change?
Sorry for not making a completely working example accessible, if anyone is still interested, there’s one here: numba_lj_inlining_test
I’m not an expert on this topic, so please consider my response with caution.
This overhead is typically minimal and not the primary reason for choosing to inline a function. Inlining offers the compiler the opportunity to make optimizations that would otherwise not be possible, and this can have a significant impact on performance.
Deciding whether inlining makes sense is not straightforward because it comes with drawbacks too, such as increased code size and longer compilation times. In some cases, inlining can even have the opposite effect and slow down your code. I don’t know much about LLMV’s internals and can therefore not give you more information on that.
If your system is constrained by memory bandwidth, the optimization level of your function may not be the primary concern. This is because, in such cases, the various cores might spend a lot of time waiting for data from memory, rendering the potential performance gains from function inlining less significant.