Can this code be more efficient

Hey guys, I noticed the problem of my last post, which is that kernels do run asynchronously, which is why it seemed like computations were happening after the kernel finished. I realize now, my code is just slow, and wanted to know if anyone knows how I can speed it up!

The mahalanobis distance requires $(x-y)^T S^{-1} (x-y)$, which is an easy calculation to do with numpy, but I don’t think matmul is implemented for cuda.jit(). So I programmed it manually like this:

import numpy as np
from numba import cuda
import math
from tqdm import tqdm

# Randomly generate some example vectors
n_1 = 100
n_2 = 300000
D = 128
set_1 = np.random.random((n_1, D))
set_2 = np.random.random((n_2, D))
# For mahalanobis distance, find inverse covariance matrix
icov = np.linalg.inv(np.cov(set_2.T))
dists = np.zeros((n_1, n_2))

# put data on GPU
set_1_d = cuda.to_device(set_1)
set_2_d = cuda.to_device(set_2)
icov_d = cuda.to_device(icov)
dists_d = cuda.to_device(dists)

@cuda.jit()
def mahalanobis(set_1, set_2, icov, dists):
    tx = cuda.threadIdx.x
    bx = cuda.blockIdx.x
    bw = cuda.blockDim.x
    ind_s2 = bx * bw + tx
    D = set_1.shape[1]

    if ind_s2 < dists.shape[1]:
        # Loop through all the set_1 vectors
        for ind_s1 in range(dists.shape[0]):
            tmp_sum = 0
            # These for loops do the matrix calculation stuff
            for j in range(D):
                a_j = set_1[ind_s1, j] - set_2[ind_s2, j]
                for i in range(D):
                    a_i = set_1[ind_s1, i] - set_2[ind_s2, i]
                    c_ij = icov[i, j] 
                    tmp_sum += a_j * a_i * c_ij 
            dists[ind_s1, ind_s2] = math.sqrt(tmp_sum)

# silly way to get run time            
block_size = 128
n_blocks = int(np.ceil(set_2.shape[0] / block_size))
for i in tqdm(range(1)):
    mahalanobis[n_blocks, block_size](set_1_d, set_2_d, icov_d, dists_d)
    cuda.synchronize()
    dists = dists_d.copy_to_host()

This runs in about 1:30-2:00 on my GPU (Tesla V100-SXM3-32GB). The reason why I am surprised (and originally thought I found a bug), is because using Euclidean distance (which is pretty much just one less for loop), it takes about half a second. I suspect it’s because everything is in global memory. So I tried putting the covariance matrix in shared memory, like so:

@cuda.jit()
def mahalanobis(set_1, set_2, icov, dists):
    tx = cuda.threadIdx.x
    bx = cuda.blockIdx.x
    bw = cuda.blockDim.x
    ind_s2 = bx * bw + tx
    D = set_1.shape[1]

    icov_s = cuda.shared.array((128, 128), dtype=np.float32)
    for j in range(128):
        icov_s[tx, j] = icov[tx, j]
    cuda.syncthreads()


    if ind_s2 < dists.shape[1]:
        # Loop through all the set_1 vectors
        for ind_s1 in range(dists.shape[0]):
            tmp_sum = 0
            # These for loops do the matrix calculation stuff
            for j in range(D):
                a_j = set_1[ind_s1, j] - set_2[ind_s2, j]
                for i in range(D):
                    a_i = set_1[ind_s1, i] - set_2[ind_s2, i]
                    c_ij = icov_s[i, j] 
                    tmp_sum += a_j * a_i * c_ij 
            dists[ind_s1, ind_s2] = math.sqrt(tmp_sum)

But I keep getting CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE.
I thought since the covariance is 128 x 128, and I have 128 threads per block, then we could just use tx (threadidx.x) to have each thread fill in a row in shared memory.

In summary:

  1. How can I make this faster?
  2. When using shared memory, why do I get this error?

Thanks!