Prange cannot avoid race condition!

Hi,

I am trying to use numba to parallelize an operation on 2d array, where each row can be handled separately. However, it seems prange causes wrong result. I took a look at some related topic on Numba’s github page and here, but I couldn’t find my answer (or, maybe there exist some related questions but I didn’t understand them!!!)

Below, I am providing a simple test so that it can be reproduced.

# python= 3.8.5, # numpy= 1.20.3, # numba 0.54.1
def naive_func(PA, PB, IA, IB):
    """
    Merge two 2-dim arrays PA and PB, each with shape (n, m).
    The values in each row is sorted in ascending ordrer. 
    This function update PA[i] by finding the top-m smallest values
    in both PA[i] and PB[i], while prioritizing the values of PA[i].
    
    Example:
    For a row in PA and its corresponding row in PB:
    Merging [0, 1, 2, 3, 4] and [0', 3', 5, 7, 10] is: [0, 0', 1, 2, 3]  
    (Note: the sign prime is used to just distinguish between the two zeros and two threes in two arrays 
     PA and PB)

    Then we update IA based on the rearrangement that occured on 
    the merging process of PA and PB.
    
    Parameters
    ----------
    PA : numpy.ndarray
        a 2-dim array, with shape (n, m), each row sorted in ascending order

    PB : numpy.ndarray
        a 2-dim array, with shape (n, m), each row sorted in ascending order

    IA : numpy.ndarray
        a 2-dim array, with shape (n, m)

    IB : numpy.ndarray
        a 2-dim array, with shape (n, m)

    Returns
    -------
    None
    """
    n, m = PA.shape
    
    profile = np.column_stack((PA, PB))
    indices = np.column_stack((IA, IB))
        
    idx = np.argsort(profile, axis=1)
    profile = np.take_along_axis(profile, idx, axis=1)
    indices = np.take_along_axis(indices, idx, axis=1)

    PA[:, :] = profile[:, : m]
    IA[:, :] = profile[:, : m]

################
@njit(parallel=True)
def performant_func(PA, PB, IA, IB):
    """
    n, m = PA.shape
    
    for i in prange(n): # Updating rows in parallel mode
        for j in range(m): # for each row, I think this inner loop is executed serially.
            if PB[i, j] < PA[i, -1]:
                idx = np.searchsorted(PA[i], PB[i, j], side="right")
                
                PA[i, idx + 1 :] = PA[i, idx:-1]
                PA[i, idx] = PB[i, j]

                IA[i, idx + 1 :] = IA[i, idx:-1]
                IA[i, idx] = IB[i, j]
                
################
def test_func():
    PA = np.array(
        [[0, 3, 3, 3, 4],
       [0, 1, 2, 3, 4]]
    )
    
    IA = np.array(
        [[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]]
    )
    
    
    PB = np.array(
        [[0, 1, 3, 3, 4],
       [0, 0, 1, 4, 4]]
    )
    
    IB = np.array(
        [[250, 251, 252, 253, 254],
       [255, 256, 257, 258, 259]]
    )
    

    ref_P = PA.copy()
    ref_I = IA.copy()

    comp_P = PA.copy()
    comp_I = IA.copy()

    naive_func(ref_P, PB, ref_I, IB)
    performant_func(comp_P, PB, comp_I, IB)

    ref = np.column_stack((ref_P, ref_I))
    comp = np.column_stack((comp_P, comp_I))

    npt.assert_array_equal(ref, comp)
        
test_func()

>>>
AssertionError: 
Arrays are not equal

Mismatched elements: 9 / 20 (45%)
Max absolute difference: 256
Max relative difference: 2.
 x: array([[0, 0, 1, 3, 3, 0, 0, 1, 3, 3],
       [0, 0, 0, 1, 1, 0, 0, 0, 1, 1]])
 y: array([[  0,   0,   1,   3,   3,   0, 250, 251,   1,   2],
       [  0,   0,   0,   1,   1,   5, 255, 256,   6,   7]])

One solution is to add .copy() operation when updating slice of PA and IA, as follows:

PA[i, idx + 1 :] = PA[i, idx:-1].copy()
###
IA[i, idx + 1 :] = IA[i, idx:-1].copy()
###

However, theoretically speaking, there should be no problem with `prange` even without use of `.copy()`. 

Please let me know if further information is required.