Scaling of prange parallelization in long calculation

Hi everybody,

I’m using numba to speedup the core calculation in a physical model, which works well in general as I find around x3-x4 improvements to an optimized numpy implementation.

What puzzles me is the scaling with the number of threads using prange parallelization. I understand that there is some overhead involved in the use of multithreading so I was expecting that in general these statements should be true:

  1. Parallelization should be fastest if the outer most loop uses prange.
  2. The lower the access to resources outside from within the loop the higher the performance.
  3. The longer the calculation within the loop takes the better it should scale, as overhead of transferring memory and other resources would be less relevant.

Now I have two functions with very different calculation complexity but similar structure. Each loops over x-points in an array and has inner loops that depend on the input model size.

The first one is relatively simple and takes around 5ms to compute on a single thread (data N=400, model N=200), the second one does more complex matrix calculations with several local matrix variables within the loop and takes around 30ms for the same input.

What I do not understand is, that the first function scales well with the number of threads, needing around 1/9th of time on 16 hyper thread cores with an efficiency that goes down monotonically when increasing the number of threads used. The complex function, on the other hand, never goes much below 1/3rd of the single thread time with most of the gains already present with 3 threads.
Interestingly, the behavior is exactly the same if I decrease the time spend within each loop iteration (data N=4000, model N=20).

While the functions are a bit long to post here without correct type seeing you can find them on github:
function_1 (ReflQNB)
function_2 (ReflNBSigma)

I’d appreciate any suggestions how to find what causes the second function to scale so badly although it would be expected to scale better than the first.

A full working example with some arbitrary inputs (it just have to work) would be good.

Some ideas to improve performance (maybe not directly related to scaling):

  • You are declaring the inputs like numba.complex128[:, :] This means that they can be non-contiguous, which usually has some impact on performance. When declaring types explicitly, it is recommendable to also include a signature with contiguous arrays eg. numba.complex128[:, ::1]

  • Always think twice if you really have to allocate memory. For example, this may not be necessary in the dot4 function. Very small functions could be also inlined inline="always"

  • Also think if you want to check for zero division errors, which also can slow down the code ````error_model=“numpy”``` disables this check.

Thanks for your reply @max9111 ,
I added your suggestions but it unfortunately did not lead to noticeable differences in run time.
If I understand correctly, the declaration of ::1 is only needed for arrays with more than one dimension, right?

I’ve uploaded a test example with timing here.

I guess I have found the main bottleneck. I assume you were on the right way because you have written out the dot product in the function dot4. The problem there is the allocation of the output, which costs quite some time and doesn’t scale with the number of cores.

The following function is easier and should provide more or less the same performance. A BLAS call for 4x4 matrices is usually slower than a manual implementation. Please note that the output array is also passed to the function and not allocated inside.

The main idea is to avoid memory allocations as much as possible. In your case you can allocate a temporary array only once like you did with the array PX. We also need a function to copy arrays from one to another.

@numba.njit(cache=True, inline='always',fastmath=True)
def dot4(A,B,D):
    for i in range(4):
        for j in range(4):
            acc=0
            for k in range(4):
                acc+=A[i,k]*B[k,j]
            D[i,j]=acc
    return D
@numba.njit(inline="always")
def copyto(dest,source):
    
    for i in range(4):
        for j in range(4):
            dest[i,j]=source[i,j]
    return dest

The part which calls dot4 repeatedly can now look as follows:

        ##### ass_P ####
        M=PX[-2]
        for linv in range(layers-3):
            # Multiply up the sample matrix
            TMP=dot4(M, PX[layers-3-linv],TMP)
            M=copyto(M,TMP)
        M=dot4(X, M,TMP)