I am comparing some new code I wrote in numba with an older fortran implementation. The code is pretty basic (arithmetic operations and vector manipulation mostly) so I would expect that in principle achieving a similar efficiency is possible–although this may be naive of me. What I see instead is that the numba version takes at least twice as long in a case study I was attempting.
The original code is a bit too large to be posted here, but I have tried replicating its basic features, and written examples in both languages. I noticed some basic optimizations–for example, avoiding to use .shape in numba seems to speed up the code quite a bit. Nonetheless, even in this sample case I eventually find that the numba implementation takes twice as much time (see below) as the fortran code.
The purpose of this question is multifaceted:
- First of all, I might be missing some obvious improvements to the numba function that would make it faster (I’m particularly suspicious of the random number call). Figuring these out would likely help me with the original code as well.
- If optimizations are not possible, or not significant, I would like to understand if there is some fundamental reason in the way numba’s JIT compilation works that makes it slower than the Fortran version. I know that fortran compilers are very optimized for specific hardware architectures–could it be that numba has a disadvantage on this side?
- I have read other issues about profiling in numba, but the conclusion they seem to lead to is that there is no very advanced form of profiling. I hope that answering this question can help me better understand what strategies are available to investigate numba code and look for possible bottlenecks or pitfalls.
Below, I post the code I am using to run examples. The Fortran code is compiled with gfortran.
The Fortran code:
program comparison
use functions, only: total
real :: start, finish
integer :: d1, d2, d3
integer :: i, j
real(8), dimension(2) :: result, temp
real(8), dimension(:,:), allocatable :: r
real(8), dimension(:,:,:), allocatable :: rs
call cpu_time(start)
d1 = 4
d2 = 8
d3 = 3
result = (/1., 0./)
allocate (r(d2, d3))
allocate (rs(d1, d2, d3))
do i = 1, 1000000
call random_number(r)
call random_number(rs)
temp = total(r, rs, d1, d2, d3)
result(1) = result(1) * temp(1)
result(2) = result(2) + temp(2)
end do
call cpu_time(finish)
print '("Time = ",f6.3," seconds.")',finish-start
end program comparison
module functions
contains
function pot(ra, rb, d3)
integer, intent(in) :: d3
real(8), intent(in), dimension(d3) :: ra, rb
real(8) :: e, f, a, b
real(8), dimension(2) :: pot
f = sum((ra-rb)**2)
e = exp(f)
a = 1 - f
b = sqrt(f * e)
pot = (/a, b/)
end function pot
function total(r, rs, d1, d2, d3)
real(8), dimension(2) :: empty, temp
real(8), dimension(2) :: total
integer :: i,j
integer :: d1, d2, d3
real(8), dimension(d2,d3) :: r
real(8), dimension(d1,d2,d3) :: rs
empty = (/1., 0./)
do i=1, d1
do j=1, d2
temp = pot(r(j,:), rs(i,j,:), d3)
empty(1) = empty(1) * temp(1)
empty(2) = empty(2) + temp(2)
end do
end do
total = empty
end function total
end module functions
When compiled with gfortran numba_comparison.f90, this takes about 3.5 seconds; the optimization flag -O3 lowers this time to about 2.5 seconds.
The python code is as follows:
from numba import njit
import numpy as np
import random
import time
import os
@njit
def pot(ra, rb):
f = np.sum((ra-rb)**2)
e = np.exp(f)
a = 1 - f
b = np.sqrt(f * e)
return a, b
@njit
def total(r, rs, d1, d2):
empty = np.array([1., 0.])
for i in range(d1):
for j in range(d2):
temp = pot(r[j], rs[i,j])
empty[0] *= temp[0]
empty[1] += temp[1]
return empty
### passing d1 and d2 here is necessary as the function runs much slower if
### rs.shape[0] and rs.shape[1] are used. This is reminiscent of the fortran version.
@njit
def runner(N):
d1 = 4
d2 = 8
d3 = 3
result = np.array([1., 0.])
r = np.zeros((d2, d3))
rs = np.zeros((d1, d2, d3))
for i in range(N):
r = np.random.rand(d2, d3)
rs = np.random.rand(d2, d2, d3)
temp = total(r, rs, d1, d2)
result[0] *= temp[0]
result[1] *= temp[1]
t_start = time.time()
runner(1)
t_end = time.time()
run_time = t_end - t_start
print(f"Compiled in {round(run_time, 3)} seconds.")
t_start = time.time()
runner(1000000)
t_end = time.time()
run_time = t_end - t_start
print(f"Time = {round(run_time, 3)} seconds.")
This takes about 1 second to compile, and about 4.5 seconds to run, which is almost double the time of the Fortran version.