I’m new to numba and cuda and I want to implement a fairly straightforward function on a GPU using numba.cuda.jit
This is the function I try to parallize
def dist( points , centroid ):
""" computes the squared euclidian distance between a set of points and a centroid.
points: 'number of dimensions'*n array
centroid: 1*'number of dimensions' array
"""
distances = np.zeros([1,points.shape[1]])
for i in range( points.shape[0] ):
distances += np.power( points[i,] - centroid[i] , 2 )
return distances
What I have come up with so far is this
@cuda.jit
def dist_GPU1(points, centroid, distances ):
"""Version1"""
# note that points here is a transformed version of points (points.T) in contrast to the function above
# Get thread positions
x, y = cuda.grid(2)
# Check boundaries
x_bound = points.shape[0]
y_bound = points.shape[1]
if x < x_bound and y < y_bound:
dist = ( (points[x,0] - centroid[0])**2 + (points[x,1] - centroid[1])**2 ) ** 0.5
distances[0,x] = dist
return # do nothing
The points are in 2D space, however I plan to generalize it to a higher space later on
p = [[-0.30085331, 0.92613909, -0.00674861, -0.52885546, 0.41127246, 3.73846658,
3.17136828, 2.88435172, 2.6988963, 1.52147801, 5.32408397, 4.61491772,
4.323078, 5.61167629, 6.03099952]
[-3.22084365, -1.7911364, -3.95967012, -3.32818605, -1.80313876, 2.6400779,
2.76968061, 3.52856111, 3.17180914, 2.11847992, -0.06871988, -1.83921752,
-1.30921238, -0.66873657, -0.02445487]]
c = np.array([-0.521854 , 0.50810636])
Comparing
t1 = time.time()
dists_cpu = computeDistanceToCentroid_CPU(p, c)
t2 = time.time()
# t2 - t1 = 0.0009181499481201172
TPB = 32
threadsperblock = (TPB,TPB)
grid_size = int(np.ceil(Points.T.size / TPB))
blockspergrid = (grid_size,grid_size)
computeDistanceToCentroid_GPU1[blockspergrid, threadsperblock](p.T, c, distances)
t3 = time.time()
distances = np.zeros((1, points.shape[1]))
t4 = time.time()
# t4 - t2 = 0.17550301551818848
I’m running this using google colab and have this CPU and GPU
CPU:
model name : Intel(R) Xeon(R) CPU @ 2.20GHz
model name : Intel(R) Xeon(R) CPU @ 2.20GHz
GPU:
Wed Feb 3 17:23:37 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 63C P8 11W / 70W | 0MiB / 15079MiB | 0% Default |
| | | ERR! |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
One can see that my GPU version is slower, and it gets even slower for my actual problem with way more points.
Can someone clarify what the problem with that code is and point me in the right direction?
Also, When I use more points, I run into an error:
CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE
What is this error and how can I avoid it?
I guess its related to how I implemented the cuda kernel, but googling for the error did not help much and I’m not familiar enough with cuda to properly google how to improve my function.
Any help is appreciated!