Euclidean distance in toroidal geometry: serial vs parallel

I am trying to calculate distance between two sets of 3D vectors in a toroidal (periodic boundaries) geometry:

#@njit('float32[:](float32[:,:], float32[:,:], float32[:])',cache=True,fastmath=True)
@njit(cache=True, parallel=False, fastmath=True, boundscheck=False, nogil=True)
def pair_vec_dist_test(pIList,pJList,cellsize):
    #assert pIList.shape[1] == pJList.shape[1]
    #assert cellsize.size == 3

    dr2 = np.empty(int(pIList.shape[0]*pJList.shape[0]),dtype=np.float32)
    inv_cellsize = 1.0 / cellsize

    for i in numba.prange(pIList.shape[0]):
        for j in range(pJList.shape[0]):
            index = j + pJList.shape[0] * i
            xdist = pJList[j,0]-pIList[i,0]
            ydist = pJList[j,1]-pIList[i,1]
            zdist = pJList[j,2]-pIList[i,2]
            xk =  xdist-cellsize[0]*np.rint(xdist*inv_cellsize[0])
            yk =  ydist-cellsize[1]*np.rint(ydist*inv_cellsize[1])
            zk =  zdist-cellsize[2]*np.rint(zdist*inv_cellsize[2])
            dr2[index] = xk**2+yk**2+zk**2

    return dr2


N1, N2, dim = 700, 1000,  3  #  particles in 3D
cellsize=np.array([26.4,26.4,70.0]).astype("float32")
vecI = np.random.random((N1, dim)).astype("float32")
vecJ = np.random.random((N2, dim)).astype("float32")
pair_vec_dist_test(vecI,vecJ,cellsize)

I am trying to optimize this function as much as possible and so I have unrolled the loops, and applied the fastmath and nogil options. When I tried for parallel option, my exec time jumped from 472 microsec to 11ms which is a bit surprising. Can someone help me understand what has gone wrong in this scenario?

I am also looking for further optimizations. I was suggested to make the code cache-friendly by loop tiling/blocking though I couldn’t see any significant changes. If you have any suggestions in mind, please let me know.