How to compute a Wasserstein distance matrix via a Numba/CUDA Kernel?

I’m interested in rewriting the following procedure (source) to be compatible with Numba @cuda.jit decorators for use on a GPU in order to improve runtime performance (i.e. through parallelization). I plan to use this in conjunction with the centroid distance to create sparse distance matrices on the order of 1e6 x 1e6 or larger (in total ~1e8 distance calculations using 100 nearest neighbors). I’m looking for something very fast with minimal effort (said every data scientist).

def _cdf_distance(p, u_values, v_values, u_weights=None, v_weights=None):

    u_sorter = np.argsort(u_values)
    v_sorter = np.argsort(v_values)

    all_values = np.concatenate((u_values, v_values))
    all_values.sort(kind='mergesort')

    # Compute the differences between pairs of successive values of u and v.
    deltas = np.diff(all_values)

    # Get the respective positions of the values of u and v among the values of
    # both distributions.
    u_cdf_indices = u_values[u_sorter].searchsorted(all_values[:-1], 'right')
    v_cdf_indices = v_values[v_sorter].searchsorted(all_values[:-1], 'right')

    # Calculate the CDFs of u and v using their weights, if specified.
    if u_weights is None:
        u_cdf = u_cdf_indices / u_values.size
    else:
        u_sorted_cumweights = np.concatenate(([0], np.cumsum(u_weights[u_sorter])))
        u_cdf = u_sorted_cumweights[u_cdf_indices] / u_sorted_cumweights[-1]

    if v_weights is None:
        v_cdf = v_cdf_indices / v_values.size
    else:
        v_sorted_cumweights = np.concatenate(([0], np.cumsum(v_weights[v_sorter])))
        v_cdf = v_sorted_cumweights[v_cdf_indices] / v_sorted_cumweights[-1]

    # Compute the value of the integral based on the CDFs.
    # If p = 1 or p = 2, we avoid using np.power, which introduces an overhead
    # of about 15%.
    if p == 1:
        return np.sum(np.multiply(np.abs(u_cdf - v_cdf), deltas))
    if p == 2:
        return np.sqrt(np.sum(np.multiply(np.square(u_cdf - v_cdf), deltas)))
    return np.power(np.sum(np.multiply(np.power(np.abs(u_cdf - v_cdf), p),
                                       deltas)), 1/p)

The general steps are as follows:

  • sort individual vectors and combined vector
  • compute finite differences between pairs
  • find positions of individual sorted vectors relative to combined vector
  • Calculate cumulative density function for individual vectors
  • Integrate the CDFs

One of the main problems that I’ve run into is that it deals with multiple arrays of several different sizes and I’ve maybe been thinking about this in a non-standard way for GPU computing i.e. looking to find/make replacements for NumPy functions that are compatible with @cuda.jit() decorators without severe performance degradation (though I don’t seem to be alone in this, 1.). As I worked towards implementing this approach, I quickly found that it required many cuda.local.array() calls (in total, 15 per call to cdf_distance), and I imagine this could bog things down.

For this specific task, how would you approach implementing this in Numba while retaining speed?