Implement euclidian distance as cuda kernel

I’m new to numba and cuda and I want to implement a fairly straightforward function on a GPU using numba.cuda.jit

This is the function I try to parallize

def dist( points , centroid ):
	""" computes the squared euclidian distance between a set of points and a centroid.
		points: 'number of dimensions'*n array
		centroid: 1*'number of dimensions' array
  """ 
	distances = np.zeros([1,points.shape[1]])
	for i in range( points.shape[0] ):
		distances += np.power( points[i,] - centroid[i] , 2 )
	return distances

What I have come up with so far is this

@cuda.jit
def dist_GPU1(points, centroid, distances ):
	"""Version1""" 
  # note that points here is a transformed version of points (points.T) in contrast to the function above
  # Get thread positions
	x, y = cuda.grid(2)
  # Check boundaries
	x_bound = points.shape[0]
	y_bound = points.shape[1]
	if x < x_bound and y < y_bound:
		dist = ( (points[x,0] - centroid[0])**2 + (points[x,1] - centroid[1])**2 ) ** 0.5
		distances[0,x] = dist
		return  # do nothing

The points are in 2D space, however I plan to generalize it to a higher space later on

p = [[-0.30085331,  0.92613909, -0.00674861, -0.52885546,  0.41127246,  3.73846658,
   3.17136828,  2.88435172,  2.6988963,   1.52147801,  5.32408397,  4.61491772,
   4.323078,    5.61167629,  6.03099952]
 [-3.22084365, -1.7911364,  -3.95967012, -3.32818605, -1.80313876,  2.6400779,
   2.76968061,  3.52856111,  3.17180914,  2.11847992, -0.06871988, -1.83921752,
  -1.30921238, -0.66873657, -0.02445487]]

c = np.array([-0.521854  ,  0.50810636])

Comparing

t1 = time.time() 
dists_cpu = computeDistanceToCentroid_CPU(p, c)
t2 = time.time()
# t2 - t1 = 0.0009181499481201172


TPB = 32
threadsperblock = (TPB,TPB)

grid_size = int(np.ceil(Points.T.size / TPB))
blockspergrid = (grid_size,grid_size)
computeDistanceToCentroid_GPU1[blockspergrid, threadsperblock](p.T, c, distances)
t3 = time.time()
distances = np.zeros((1, points.shape[1]))
t4 = time.time()
# t4 - t2 = 0.17550301551818848

I’m running this using google colab and have this CPU and GPU

CPU:
model name	: Intel(R) Xeon(R) CPU @ 2.20GHz
model name	: Intel(R) Xeon(R) CPU @ 2.20GHz
GPU:
Wed Feb  3 17:23:37 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   63C    P8    11W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

One can see that my GPU version is slower, and it gets even slower for my actual problem with way more points.
Can someone clarify what the problem with that code is and point me in the right direction?
Also, When I use more points, I run into an error:
CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE
What is this error and how can I avoid it?
I guess its related to how I implemented the cuda kernel, but googling for the error did not help much and I’m not familiar enough with cuda to properly google how to improve my function.

Any help is appreciated!

Hi @CvG,

I think in the code above you are using really small amounts of data (GPUs tend to do well with a lot of data) and timing not just the execution on the GPU but also including the time taken transferring the data to and from the GPU.

Example showing the effect of moving data on timings:

from numba import cuda
import numpy as np
import time


@cuda.jit
def dist_GPU1(points, centroid, distances):
    x, y = cuda.grid(2)
    x_bound = points.shape[0]
    y_bound = points.shape[1]
    if x < x_bound and y < y_bound:
        dist = ( (points[x, 0] - centroid[0])**2 +
                 (points[x, 1] - centroid[1])**2 ) ** 0.5
        distances[0, x] = dist


points = np.array([[-0.30085331,  0.92613909, -0.00674861, -0.52885546,
                    0.41127246,  3.73846658, 3.17136828,  2.88435172,
                    2.6988963,   1.52147801,  5.32408397,  4.61491772,
                    4.323078,    5.61167629,  6.03099952],
                   [-3.22084365, -1.7911364,  -3.95967012, -3.32818605,
                    -1.80313876,  2.6400779, 2.76968061,  3.52856111,
                    3.17180914, 2.11847992, -0.06871988, -1.83921752,
                    -1.30921238, -0.66873657, -0.02445487]])

c = np.array([-0.521854,  0.50810636])

expected = (np.sum((points.T - c)**2, axis=1) ** 0.5).reshape((1, -1))

TPB = 32
threadsperblock = (TPB, TPB)
distances = np.zeros((1, points.shape[1]))
grid_size = int(np.ceil(points.T.size / TPB))
blockspergrid = (grid_size, grid_size)

# run once to get the function compiled.
dist_GPU1[blockspergrid, threadsperblock](points.T, c, distances)
# synchronize the device
cuda.synchronize()
# reset "distances" to zeros
distances[:] = 0

# Time running the kernel with data copy included
ts = time.time()
dist_GPU1[blockspergrid, threadsperblock](points.T, c, distances)
cuda.synchronize()
te = time.time()

# check
np.testing.assert_allclose(distances, expected)
print("Run time, including data transfer:", te - ts)

# reset "distances" to zeros
distances[:] = 0
# explicitly copy the NumPy data to the device before doing the work
d_distances = cuda.to_device(distances)
d_points = cuda.to_device(points.T)  # Copy the transpose!
d_c = cuda.to_device(c)
# sync data transfer
cuda.synchronize()

# Time running the kernel on data that's already on the GPU
ts = time.time()
dist_GPU1[blockspergrid, threadsperblock](d_points, d_c, d_distances)
cuda.synchronize()
te = time.time()

# copy data back to the host and check the results are correct
np.testing.assert_allclose(d_distances.copy_to_host(), expected)
print("Run time, not including data transfer:", te - ts)

CC @gmarkall is there a resource you’d recommend for introductory material?