Hey guys, I noticed the problem of my last post, which is that kernels do run asynchronously, which is why it seemed like computations were happening after the kernel finished. I realize now, my code is just slow, and wanted to know if anyone knows how I can speed it up!
The mahalanobis distance requires $(x-y)^T S^{-1} (x-y)$, which is an easy calculation to do with numpy, but I don’t think matmul is implemented for cuda.jit(). So I programmed it manually like this:
import numpy as np
from numba import cuda
import math
from tqdm import tqdm
# Randomly generate some example vectors
n_1 = 100
n_2 = 300000
D = 128
set_1 = np.random.random((n_1, D))
set_2 = np.random.random((n_2, D))
# For mahalanobis distance, find inverse covariance matrix
icov = np.linalg.inv(np.cov(set_2.T))
dists = np.zeros((n_1, n_2))
# put data on GPU
set_1_d = cuda.to_device(set_1)
set_2_d = cuda.to_device(set_2)
icov_d = cuda.to_device(icov)
dists_d = cuda.to_device(dists)
@cuda.jit()
def mahalanobis(set_1, set_2, icov, dists):
tx = cuda.threadIdx.x
bx = cuda.blockIdx.x
bw = cuda.blockDim.x
ind_s2 = bx * bw + tx
D = set_1.shape[1]
if ind_s2 < dists.shape[1]:
# Loop through all the set_1 vectors
for ind_s1 in range(dists.shape[0]):
tmp_sum = 0
# These for loops do the matrix calculation stuff
for j in range(D):
a_j = set_1[ind_s1, j] - set_2[ind_s2, j]
for i in range(D):
a_i = set_1[ind_s1, i] - set_2[ind_s2, i]
c_ij = icov[i, j]
tmp_sum += a_j * a_i * c_ij
dists[ind_s1, ind_s2] = math.sqrt(tmp_sum)
# silly way to get run time
block_size = 128
n_blocks = int(np.ceil(set_2.shape[0] / block_size))
for i in tqdm(range(1)):
mahalanobis[n_blocks, block_size](set_1_d, set_2_d, icov_d, dists_d)
cuda.synchronize()
dists = dists_d.copy_to_host()
This runs in about 1:30-2:00 on my GPU (Tesla V100-SXM3-32GB). The reason why I am surprised (and originally thought I found a bug), is because using Euclidean distance (which is pretty much just one less for loop), it takes about half a second. I suspect it’s because everything is in global memory. So I tried putting the covariance matrix in shared memory, like so:
@cuda.jit()
def mahalanobis(set_1, set_2, icov, dists):
tx = cuda.threadIdx.x
bx = cuda.blockIdx.x
bw = cuda.blockDim.x
ind_s2 = bx * bw + tx
D = set_1.shape[1]
icov_s = cuda.shared.array((128, 128), dtype=np.float32)
for j in range(128):
icov_s[tx, j] = icov[tx, j]
cuda.syncthreads()
if ind_s2 < dists.shape[1]:
# Loop through all the set_1 vectors
for ind_s1 in range(dists.shape[0]):
tmp_sum = 0
# These for loops do the matrix calculation stuff
for j in range(D):
a_j = set_1[ind_s1, j] - set_2[ind_s2, j]
for i in range(D):
a_i = set_1[ind_s1, i] - set_2[ind_s2, i]
c_ij = icov_s[i, j]
tmp_sum += a_j * a_i * c_ij
dists[ind_s1, ind_s2] = math.sqrt(tmp_sum)
But I keep getting CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE.
I thought since the covariance is 128 x 128, and I have 128 threads per block, then we could just use tx (threadidx.x) to have each thread fill in a row in shared memory.
In summary:
- How can I make this faster?
- When using shared memory, why do I get this error?
Thanks!