Lexsort-like operation using Numba Cuda?

I have two 3-dimensional arrays called the ‘Feature_Arr’ and ‘Centroids_Arr’.

Each 2d array in the Feature_Arr is like:

[[-0.16578992,  1.302     ,  1.        ,  1.        ],
 [ 0.        ,  0.        ,  0.        ,  1.        ],
 [ 0.00392716,  0.2551    ,  1.        ,  0.        ],
 [ 0.16186276,  1.1104    ,  1.        ,  1.        ]]

Each 2d array in the Centroid_Arr is like:

[[ -4.3150377 ,   2.586869  ,   0.9627161 ],
 [ -1.0964384 ,   1.350665  ,   0.28387254],
 [ -6.1319566 ,  -2.8796217 ,  -1.9866784 ],
 [ -8.299599  ,  -7.908891  ,  -4.6172056 ]]

What I want:

I’d like to sort each 2D Feature_Arr by the ascending of the first column, if there are equal numbers, then sorting by the corresponding numbers from the second column. Then swap the rows of both Feature_Arr and Centroid_Arr based on the sorting result of the Feature_Arr .

This is what I’ve tried:

import numpy as np
from numba import cuda

@cuda.jit(device=True)
def sort_2d_arrays(feature_arr, centroids_arr):
    for i in range(feature_arr.shape[0] - 1):
        for j in range(feature_arr.shape[0] - i - 1):
            if (feature_arr[j, 0] > feature_arr[j + 1, 0]) or \
               (feature_arr[j, 0] == feature_arr[j + 1, 0] and 
                feature_arr[j, 1] > feature_arr[j + 1, 1]):
                for k in range(feature_arr.shape[1]):
                    temp = feature_arr[j, k]
                    feature_arr[j, k] = feature_arr[j + 1, k]
                    feature_arr[j + 1, k] = temp
                for k in range(centroids_arr.shape[1]):
                    temp = centroids_arr[j, k]
                    centroids_arr[j, k] = centroids_arr[j + 1, k]
                    centroids_arr[j + 1, k] = temp
        
@cuda.jit
def sort_features_and_centroids(pharmacophores, centroids):
    x = cuda.grid(1)
    if x < pharmacophores.shape[0]:
        sort_2d_arrays(pharmacophores[x], centroids[x])
d_features = cuda.to_device(test_features)
d_centroid_coords = cuda.to_device(test_coords)

threadsperblock = (16, 16)
blockspergrid_x = int(np.ceil(test_features.shape[0] / threadsperblock[0]))
blockspergrid_y = int(np.ceil(test_features.shape[1] / threadsperblock[1]))
blockspergrid = (blockspergrid_x, blockspergrid_y)

sort_features_and_centroids[blockspergrid, threadsperblock](d_features, d_centroid_coords)

sorted_features = d_features.copy_to_host()
sorted_centroid_coords = d_centroid_coords.copy_to_host()

However, some of 2D arrays were not correctly sorted, for example:

One of the original 2D arrays:

 [[ 0.14940944  1.5281      1.          1.        ]
  [ 0.          0.          0.          0.        ]
  [-0.2631746   0.8732      1.          1.        ]
  [ 0.1137652   0.6502      0.          1.        ]]

After being sorted:

[[-0.2631746  0.         0.         0.       ]
 [ 0.         0.8732     1.         1.       ]
 [-0.2631746  0.8732     1.         1.       ]
 [ 0.1137652  0.6502     0.         1.       ]]

one row was copied to another row but no swap. And some other arrays were just incorrectly sorted.

Would greatly appreciate if you could help.