Slow down when porting anisotropic diffusion filter

Hey there,
I am just starting with Numba, and I am trying to port an anisotropic diffusion filter to Numba. The function looks something like this:

from numba import jit

@jit(nopython=True)
def anisodiff(img, niter=1, kappa=50, gamma=0.1, step=(1.,1.), option=1):
    """
    Anisotropic diffusion.
    Usage:
    imgout = anisodiff(im, niter, kappa, gamma, option)
    Arguments:
            img    - input image
            niter  - number of iterations
            kappa  - conduction coefficient 20-100 ?
            gamma  - max value of .25 for stability
            step   - tuple, the distance between adjacent pixels in (y,x)
            option - 1 Perona Malik diffusion equation No 1
                     2 Perona Malik diffusion equation No 2
            ploton - if True, the image will be plotted on every iteration
    Returns:
            imgout   - diffused image.
    kappa controls conduction as a function of gradient.  If kappa is low
    small intensity gradients are able to block conduction and hence diffusion
    across step edges.  A large value reduces the influence of intensity
    gradients on conduction.
    gamma controls speed of diffusion (you usually want it at a maximum of
    0.25)
    step is used to scale the gradients in case the spacing between adjacent
    pixels differs in the x and y axes
    Diffusion equation 1 favours high contrast edges over low contrast ones.
    Diffusion equation 2 favours wide regions over smaller ones.
    Reference:
    P. Perona and J. Malik.
    Scale-space and edge detection using ansotropic diffusion.
    IEEE Transactions on Pattern Analysis and Machine Intelligence,
    12(7):629-639, July 1990.
    Original MATLAB code by Peter Kovesi
    School of Computer Science & Software Engineering
    The University of Western Australia
    pk @ csse uwa edu au
    <http://www.csse.uwa.edu.au>
    Translated to Python and optimised by Alistair Muldal
    Department of Pharmacology
    University of Oxford
    <alistair.muldal@pharm.ox.ac.uk>
    June 2000  original version.
    March 2002 corrected diffusion eqn No 2.
    July 2012 translated to Python
    April 2019 - Corrected for Python 3.7 - AvW 
    """

    # initialize output array
    imgout = img.copy()

    # initialize some internal variables
    deltaS = np.zeros_like(imgout)
    deltaE = deltaS.copy()
    E = deltaS.copy()
    S = deltaS.copy()
    NS = deltaS.copy()
    EW = deltaS.copy()
    gS = np.ones_like(imgout)
    gE = gS.copy()

    for ii in range(niter):

        # calculate the diffs
        deltaS[:-1,: ] = np.diff(imgout.T).T
        deltaE[: ,:-1] = np.diff(imgout)

        # conduction gradients (only need to compute one per dim!)
        if option == 1:
            gS[:, :] = np.exp(- np.square(np.divide(deltaS, kappa))) / step[0]
            gE[:, :] = np.exp(- np.square(np.divide(deltaE, kappa))) / step[1]
        elif option == 2:
            gS[:, :] = 1./(1. + np.square(np.divide(deltaS, kappa))) / step[0]
            gE[:, :] = 1./(1. + np.square(np.divide(deltaE, kappa))) / step[1]

        # update matrices
        E[:, :] = gE*deltaE
        S[:, :] = gS*deltaS

        # subtract a copy that has been shifted 'North/West' by one
        # pixel. don't ask questions. just do it. trust me.
        NS[:] = S
        EW[:] = E
        NS[1:,:] -= S[:-1,:]
        EW[:,1:] -= E[:,:-1]

        # update the image
        imgout += gamma*(NS+EW)

    return imgout

I’m using it in a Jupyter notebook and calling it and timing it with the following line:

%%timeit
_ = anisodiff(np.random.rand(4096, 128).astype(np.float32), niter=10, kappa=10, gamma=1, step=np.asarray((.5, .5), dtype=np.float32), option=1)

I did take care to run the above command once before timing to ensure the cached compiled function was being timed, not the compilation itself. I tried to ensure that the code was type-stable (I need all my computations done as float32). I was shocked to see that I did not get any improvement over the Numpy version. The performance decreases significantly with Numba.

The above leads to 2 questions:

  1. What is the best way to profile Numba code and identify bottlenecks?
  2. Could someone look at the code above and tell me if something obvious would make the Numba version much slower than the Numpy version?

Any help appreciated!
-Guillaume

very interesting. Initially I thought the cause might be the small loop (only 10 reps), but the difference is still there with niter=1000.
Finding which lines causes the difference might require deleting one line at a time and comparing the speed.
I don’t see anything obviously wrong or slow in the code. I think some of the copies could be avoided, but even in that case the difference would affect numpy and numba equally.

I was able to get some improvements by using explicit loops but I couldn’t match numpy’s speed.

Interestingly, when using float64 numba is slightly faster. When changing to float32, numba’s version become 2x faster, but numpy’s gets 4x faster. I think that’s a hint to what’s happening, but I’m not sure what the exact problem is.

Thanks so much for looking into this!