Comparison between Numba and Fortran code

I am comparing some new code I wrote in numba with an older fortran implementation. The code is pretty basic (arithmetic operations and vector manipulation mostly) so I would expect that in principle achieving a similar efficiency is possible–although this may be naive of me. What I see instead is that the numba version takes at least twice as long in a case study I was attempting.

The original code is a bit too large to be posted here, but I have tried replicating its basic features, and written examples in both languages. I noticed some basic optimizations–for example, avoiding to use .shape in numba seems to speed up the code quite a bit. Nonetheless, even in this sample case I eventually find that the numba implementation takes twice as much time (see below) as the fortran code.

The purpose of this question is multifaceted:

  • First of all, I might be missing some obvious improvements to the numba function that would make it faster (I’m particularly suspicious of the random number call). Figuring these out would likely help me with the original code as well.
  • If optimizations are not possible, or not significant, I would like to understand if there is some fundamental reason in the way numba’s JIT compilation works that makes it slower than the Fortran version. I know that fortran compilers are very optimized for specific hardware architectures–could it be that numba has a disadvantage on this side?
  • I have read other issues about profiling in numba, but the conclusion they seem to lead to is that there is no very advanced form of profiling. I hope that answering this question can help me better understand what strategies are available to investigate numba code and look for possible bottlenecks or pitfalls.

Below, I post the code I am using to run examples. The Fortran code is compiled with gfortran.

The Fortran code:

program comparison

use functions, only: total

real :: start, finish

integer :: d1, d2, d3
integer :: i, j
real(8), dimension(2) :: result, temp
real(8), dimension(:,:), allocatable :: r
real(8), dimension(:,:,:), allocatable :: rs

call cpu_time(start)

d1 = 4
d2 = 8
d3 = 3

result = (/1., 0./)

allocate (r(d2, d3))
allocate (rs(d1, d2, d3))

do i = 1, 1000000  
  call random_number(r)
  call random_number(rs)

  temp = total(r, rs, d1, d2, d3)

  result(1) = result(1) * temp(1)
  result(2) = result(2) + temp(2)
end do

call cpu_time(finish)

print '("Time = ",f6.3," seconds.")',finish-start

end program comparison


module functions
  contains 

  function pot(ra, rb, d3)
    integer, intent(in) :: d3
    real(8), intent(in), dimension(d3) :: ra, rb
    real(8) :: e, f, a, b
    real(8), dimension(2) :: pot
    
    f = sum((ra-rb)**2)
    e = exp(f)
    a = 1 - f
    b = sqrt(f * e)
    
    pot = (/a, b/)
  end function pot

  function total(r, rs, d1, d2, d3)
    real(8), dimension(2) :: empty, temp
    real(8), dimension(2) :: total
    integer :: i,j

    integer :: d1, d2, d3
    real(8), dimension(d2,d3) :: r
    real(8), dimension(d1,d2,d3) :: rs
    
    empty = (/1., 0./)

    do i=1, d1
      do j=1, d2
        temp = pot(r(j,:), rs(i,j,:), d3)
        empty(1) = empty(1) * temp(1)
        empty(2) = empty(2) + temp(2)
      end do
    end do

    total = empty
  end function total
end module functions

When compiled with gfortran numba_comparison.f90, this takes about 3.5 seconds; the optimization flag -O3 lowers this time to about 2.5 seconds.

The python code is as follows:


from numba import njit
import numpy as np
import random
import time
import os

@njit
def pot(ra, rb):
  f = np.sum((ra-rb)**2)
  e = np.exp(f)
  a = 1 - f
  b = np.sqrt(f * e)
  return a, b

@njit
def total(r, rs, d1, d2):
  empty = np.array([1., 0.])
  for i in range(d1):
    for j in range(d2):
      temp = pot(r[j], rs[i,j])
      empty[0] *= temp[0]
      empty[1] += temp[1]
  return empty

### passing d1 and d2 here is necessary as the function runs much slower if
### rs.shape[0] and rs.shape[1] are used. This is reminiscent of the fortran version.

@njit
def runner(N):
  d1 = 4
  d2 = 8
  d3 = 3

  result = np.array([1., 0.])
  r = np.zeros((d2, d3))
  rs = np.zeros((d1, d2, d3))

  for i in range(N):
    r = np.random.rand(d2, d3)
    rs = np.random.rand(d2, d2, d3)
    temp = total(r, rs, d1, d2)
    result[0] *= temp[0]
    result[1] *= temp[1]


t_start = time.time()
runner(1)
t_end = time.time()
run_time = t_end - t_start
print(f"Compiled in {round(run_time, 3)} seconds.")

t_start = time.time()
runner(1000000)
t_end = time.time()
run_time = t_end - t_start
print(f"Time = {round(run_time, 3)} seconds.")

This takes about 1 second to compile, and about 4.5 seconds to run, which is almost double the time of the Fortran version.

Hi @Wlos
I have no idea about Fortran, but the Numba code can certainly be made much, much faster. Here’s what I notice:

  • f = np.sum((ra-rb)**2): Calculate this value in an explicit loop. The subtraction creates a new array, as does the exponentiation. With an explicit loop, you can completely avoid the memory allocation here. This also happens in the hottest loop. I’m pretty sure you can make big gains with this!
  • empty = np.array([1., 0.]): Use two variables. You don’t need an array here. Again, you avoid expensive heap allocations.
  • r = np.random.rand(d2, d3)`: I think you’re trying to do the right thing here, but it doesn’t work as you may expect. If you want to generate the random numbers inplace, you have to do it explicitly in a loop. If you do this, you should flatten the arrays first (see below).
  • Are d1, d2, d3 always hard-coded? Maybe then you can give the compiler some hints about loop unrolling. But I would have to test that first.
@njit
def random_inplace(a):
    for i in range(a.size):
        a.flat[i] = np.random.rand()
2 Likes

Thank you for your comments. Especially for the first one–I knew that using numpy functions such as np.sum tends to slow the code down, but it was not clear to me that this would be the case also for more simple operations. Sure enough, by defining

@njit
def my_sum(v1, v2):
  v0 = 0.
  for i in range(len(v1)):
    v0 += (v1[i]-v2[i])**2
  return v0

# ...

f = my_sum(ra, rb)

the computation time jumps from 4.5 seconds to about 1.7 seconds (this is faster than the fortran code, although that can also definitely be better optimized).

Generating the random arrays through a loop was also a pretty big improvement, which is somewhat surprising given that it takes place in the outer loop. Using two functions

@njit
def make_random2(r):
  for i in range(r.shape[0]):
    for j in range(r.shape[1]):
      r[i,j] = np.random.rand()

@njit
def make_random3(r):
  for i in range(r.shape[0]):
    for j in range(r.shape[1]):
      for k in range(r.shape[2]):
        r[i,j,k] = np.random.rand()

the execution time goes down by about 0.4 seconds, and passing the dimensions to the function instead of using .shape takes away another 0.1 seconds. This leads to a comfortable 1.2 seconds for execution.

I actually expected that the allocation of empty would be time-consuming, which is why I used two separate numbers in the inner loop. Since it is in the outer loop, replacing empty with two variables only leads to a gain of some 0.01s, although it is still noticeable.

Finally, I have no idea what exactly loop unrolling would entail, but the actual use case is definitely more involved, so even if it works in this reduced example it might not be useful.

Overall, your comments have been really useful! In my python code, I had a line where an array was repeatedly assigned the difference of two other arrays in a loop. Defining a specialized function for that cut the execution time by half and brought it close to the performance of the original fortran script.

The code definitely gets clunkier, though, if new functions must be defined for something as simple as taking the difference between two arrays. I can understand why python does this, but I was convinced that numba would take care of it and remove the allocation of an intermediate array–this is what happens in fortran (where defining a function such as my_sum leads to no improvements). It is not very clear to me why numba wouldn’t collapse the operations in this case. Is it by design, or are there some intrinsic limitations that make it unfeasible?

Have you tried using the random_inplace function I suggested instead of two separate make_random2 and make_random3 functions? The shape doesn’t matter here, so assuming that the arrays are 1-dimensional makes the code more concise and possibly a bit faster.

On your last point, I can only echo some of my thoughts, since I am neither a Numba nor an LLVM developer. Dynamic arrays are not a native structure in LLVM, but they probably are in Fortran. So I would expect that it is not so trivial to get LLVM to optimize this. It’s probably possible to get LLVM to fuse the difference loop and the quadratic loop (maybe it already does, that needs to be checked). However, loop fusion plus reduction sounds very ambitious to me. Maybe it’s possible to write a custom compiler pass that rewrites that. I had something like this in mind a while ago because I see this pattern so often. Either to write a compiler pass or to create a new “lazy array” data structure.