I’m new to numba and cuda and I want to implement a fairly straightforward function on a GPU using `numba.cuda.jit`

This is the function I try to parallize

```
def dist( points , centroid ):
""" computes the squared euclidian distance between a set of points and a centroid.
points: 'number of dimensions'*n array
centroid: 1*'number of dimensions' array
"""
distances = np.zeros([1,points.shape[1]])
for i in range( points.shape[0] ):
distances += np.power( points[i,] - centroid[i] , 2 )
return distances
```

What I have come up with so far is this

```
@cuda.jit
def dist_GPU1(points, centroid, distances ):
"""Version1"""
# note that points here is a transformed version of points (points.T) in contrast to the function above
# Get thread positions
x, y = cuda.grid(2)
# Check boundaries
x_bound = points.shape[0]
y_bound = points.shape[1]
if x < x_bound and y < y_bound:
dist = ( (points[x,0] - centroid[0])**2 + (points[x,1] - centroid[1])**2 ) ** 0.5
distances[0,x] = dist
return # do nothing
```

The points are in 2D space, however I plan to generalize it to a higher space later on

```
p = [[-0.30085331, 0.92613909, -0.00674861, -0.52885546, 0.41127246, 3.73846658,
3.17136828, 2.88435172, 2.6988963, 1.52147801, 5.32408397, 4.61491772,
4.323078, 5.61167629, 6.03099952]
[-3.22084365, -1.7911364, -3.95967012, -3.32818605, -1.80313876, 2.6400779,
2.76968061, 3.52856111, 3.17180914, 2.11847992, -0.06871988, -1.83921752,
-1.30921238, -0.66873657, -0.02445487]]
c = np.array([-0.521854 , 0.50810636])
```

Comparing

```
t1 = time.time()
dists_cpu = computeDistanceToCentroid_CPU(p, c)
t2 = time.time()
# t2 - t1 = 0.0009181499481201172
TPB = 32
threadsperblock = (TPB,TPB)
grid_size = int(np.ceil(Points.T.size / TPB))
blockspergrid = (grid_size,grid_size)
computeDistanceToCentroid_GPU1[blockspergrid, threadsperblock](p.T, c, distances)
t3 = time.time()
distances = np.zeros((1, points.shape[1]))
t4 = time.time()
# t4 - t2 = 0.17550301551818848
```

I’m running this using google colab and have this CPU and GPU

```
CPU:
model name : Intel(R) Xeon(R) CPU @ 2.20GHz
model name : Intel(R) Xeon(R) CPU @ 2.20GHz
GPU:
Wed Feb 3 17:23:37 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 63C P8 11W / 70W | 0MiB / 15079MiB | 0% Default |
| | | ERR! |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
```

One can see that my GPU version is slower, and it gets even slower for my actual problem with way more points.

Can someone clarify what the problem with that code is and point me in the right direction?

Also, When I use more points, I run into an error:

`CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE`

What is this error and how can I avoid it?

I guess its related to how I implemented the cuda kernel, but googling for the error did not help much and I’m not familiar enough with cuda to properly google how to improve my function.

Any help is appreciated!