How to compute a Wasserstein distance matrix via a Numba/CUDA Kernel?

sgbaird · August 19, 2021, 7:31pm

I’m interested in rewriting the following procedure (source) to be compatible with Numba @cuda.jit decorators for use on a GPU in order to improve runtime performance (i.e. through parallelization). I plan to use this in conjunction with the centroid distance to create sparse distance matrices on the order of 1e6 x 1e6 or larger (in total ~1e8 distance calculations using 100 nearest neighbors). I’m looking for something very fast with minimal effort (said every data scientist).

def _cdf_distance(p, u_values, v_values, u_weights=None, v_weights=None):

    u_sorter = np.argsort(u_values)
    v_sorter = np.argsort(v_values)

    all_values = np.concatenate((u_values, v_values))
    all_values.sort(kind='mergesort')

    # Compute the differences between pairs of successive values of u and v.
    deltas = np.diff(all_values)

    # Get the respective positions of the values of u and v among the values of
    # both distributions.
    u_cdf_indices = u_values[u_sorter].searchsorted(all_values[:-1], 'right')
    v_cdf_indices = v_values[v_sorter].searchsorted(all_values[:-1], 'right')

    # Calculate the CDFs of u and v using their weights, if specified.
    if u_weights is None:
        u_cdf = u_cdf_indices / u_values.size
    else:
        u_sorted_cumweights = np.concatenate(([0], np.cumsum(u_weights[u_sorter])))
        u_cdf = u_sorted_cumweights[u_cdf_indices] / u_sorted_cumweights[-1]

    if v_weights is None:
        v_cdf = v_cdf_indices / v_values.size
    else:
        v_sorted_cumweights = np.concatenate(([0], np.cumsum(v_weights[v_sorter])))
        v_cdf = v_sorted_cumweights[v_cdf_indices] / v_sorted_cumweights[-1]

    # Compute the value of the integral based on the CDFs.
    # If p = 1 or p = 2, we avoid using np.power, which introduces an overhead
    # of about 15%.
    if p == 1:
        return np.sum(np.multiply(np.abs(u_cdf - v_cdf), deltas))
    if p == 2:
        return np.sqrt(np.sum(np.multiply(np.square(u_cdf - v_cdf), deltas)))
    return np.power(np.sum(np.multiply(np.power(np.abs(u_cdf - v_cdf), p),
                                       deltas)), 1/p)

The general steps are as follows:

sort individual vectors and combined vector
compute finite differences between pairs
find positions of individual sorted vectors relative to combined vector
Calculate cumulative density function for individual vectors
Integrate the CDFs

One of the main problems that I’ve run into is that it deals with multiple arrays of several different sizes and I’ve maybe been thinking about this in a non-standard way for GPU computing i.e. looking to find/make replacements for NumPy functions that are compatible with @cuda.jit() decorators without severe performance degradation (though I don’t seem to be alone in this, 1.). As I worked towards implementing this approach, I quickly found that it required many cuda.local.array() calls (in total, 15 per call to cdf_distance), and I imagine this could bog things down.

For this specific task, how would you approach implementing this in Numba while retaining speed?

Topic		Replies	Views
Implement euclidian distance as cuda kernel Support: How do I do ...?	1	495	February 4, 2021
CPU vs GPU version Numba	2	478	July 28, 2020
Cuda Optimize Jaro Distance Community Support	8	993	June 18, 2020
Fastest numba code for computing euclideandistance between a vector and every row of a matrix Community Support	22	4050	September 14, 2021
Cuda vs CPU maintenance Community Support	1	511	June 15, 2020

How to compute a Wasserstein distance matrix via a Numba/CUDA Kernel?

Related topics