Cuda Optimize Jaro Distance

gmarkall · June 18, 2020, 11:02am

Here’s a basic attempt:

# Python3 implementation of above approach
from math import floor
from numba import cuda, guvectorize, njit
import numpy as np
import datetime


# Function to calculate the
# Jaro Similarity of two s
@njit
def jaro_distance(s1, s2, hash_s1, hash_s2):

    # Length of two s
    len1 = len(s1)
    len2 = len(s2)

    # Maximum distance upto which matching
    # is allowed
    max_dist = floor(max(len1, len2) / 2) - 1

    # Count of matches
    match = 0

    # Traverse throgh the first
    for i in range(len1):

        # Check if there is any matches
        for j in range(max(0, i - max_dist), min(len2, i + max_dist + 1)):

            # If there is a match
            if s1[i] == s2[j] and hash_s2[j] == 0:
                hash_s1[i] = 1
                hash_s2[j] = 1
                match += 1
                break

    # If there is no match
    if match == 0:
        return 0.0

    # Number of transpositions
    t = 0
    point = 0

    # Count number of occurances
    # where two characters match but
    # there is a third matched character
    # in between the indices
    for i in range(len1):
        if hash_s1[i]:

            # Find the next matched character
            # in second
            while hash_s2[point] == 0:
                point += 1

            if s1[i] != s2[point]:
                point += 1
                t += 1
    t = t // 2

    # Return the Jaro Similarity
    return (match / len1 + match / len2 + (match - t + 1) / match) / 3.0


# Driver code
def full_jaro(vector1, vector2, hashv1, hashv2, result):
    for i in range(len(vector1)):
        for j in range(len(vector2)):
            for k in range(len(hashv1)):
                hashv1[k] = 0
            for l in range(len(hashv2)):
                hashv2[l] = 0
            result[i][j] = jaro_distance(vector1[i], vector2[j], hashv1, hashv2)


@cuda.jit
def full_jaro_gpu(vector1, vector2, result):
    # For loop distributed across threads - cuda.grid(2) gives us 2D indices
    # based on the thread ID.
    i, j = cuda.grid(2)

    # Ensure that threads beyond the size of the inputs don't attempt to
    # compute out of bounds
    if i >= len(vector1) or j >= len(vector2):
        return

    hashv1 = cuda.local.array(HASH_SIZE, dtype=np.float32)
    hashv2 = cuda.local.array(HASH_SIZE, dtype=np.float32)

    for k in range(len(hashv1)):
        hashv1[k] = 0
    for l in range(len(hashv2)):
        hashv2[l] = 0
    result[i][j] = jaro_distance(vector1[i], vector2[j], hashv1, hashv2)


types = "(float32[:,:], float32[:,:],float32[:],float32[:],float32[:,:])"
layout = "(m,n),(p,q),(t),(u)->(m,p)"
full_jaro_cpu = guvectorize([types], layout, target='cpu')(full_jaro)


# Prjaro Similarity of two s
def textToNumpy(filename):
    def convertStringToNum(string):
        string = string.lower()
        if string == "a":
            return 11
        if string == "b":
            return 37
        if string == "c":
            return 35
        if string == "d":
            return 12
        if string == "e":
            return 13
        if string == "f":
            return 14
        if string == "g":
            return 15
        if string == "h":
            return 16
        if string == "i":
            return 17
        if string == "j":
            return 18
        if string == "k":
            return 19
        if string == "l":
            return 20
        if string == "m":
            return 21
        if string == "n":
            return 22
        if string == "o":
            return 23
        if string == "p":
            return 24
        if string == "q":
            return 25
        if string == "r":
            return 26
        if string == "s":
            return 27
        if string == "t":
            return 28
        if string == "u":
            return 29
        if string == "v":
            return 30
        if string == "w":
            return 31
        if string == "x":
            return 32
        if string == "y":
            return 33
        if string == "z":
            return 34
        else:
            print(string)

    def listAngka(string):
        data = [float(convertStringToNum(x)) for x in string]
        return data

    with open(filename) as file:
        datas = file.readlines()
        datas = [x.replace("\n", "") for x in datas]
        datas = [listAngka(data) for data in datas]
    return datas


data = [x for x in textToNumpy("small")]
data2 = [x for x in textToNumpy("small")]
length = max(map(len, data))
length2 = max(map(len, data2))
A = np.array([xi + [0] * (length - len(xi)) for xi in data],
                dtype=np.float32)
B = np.array([xi + [0] * (length - len(xi)) for xi in data2],
                dtype=np.float32)

hash1 = np.zeros(A.shape[1], dtype=np.float32)
hash2 = np.zeros(B.shape[1], dtype=np.float32)

# Allocate data on the GPU, and copy data over
d_A = cuda.to_device(A)
d_B = cuda.to_device(B)
d_result = cuda.device_array((len(A), len(B)), dtype=np.float64)

nthreads = (16, 16)
nblocks = ((len(A) // nthreads[0]) + 1, (len(B) // nthreads[1]) + 1)

HASH_SIZE = (A.shape[1])

# Run once to compile and cache
full_jaro_gpu[nblocks, nthreads](d_A, d_B, d_result)


print("CPU Run...")

cpu_begin = datetime.datetime.now()
cpu_result = full_jaro_cpu(A, B, hash1, hash2)
cpu_end = datetime.datetime.now()

print("GPU Run...")

gpu_begin = datetime.datetime.now()
full_jaro_gpu[nblocks, nthreads](d_A, d_B, d_result)
cuda.synchronize()
gpu_end = datetime.datetime.now()

print("Sanity check...")
np.testing.assert_allclose(cpu_result, d_result.copy_to_host())

cpu_time = cpu_end - cpu_begin
gpu_time = gpu_end - gpu_begin
ratio = cpu_time / gpu_time
print()
print("CPU Time:", cpu_end - cpu_begin)
print("GPU Time:", gpu_end - gpu_begin)
print("Speedup factor: ", ratio)

A quick walkthrough of the changes:

I replaced cuda.jit(device=True) for the jaro_distance function with @njit - this compiles in nopython mode - nopython mode functions can be called like CUDA device functions, as long as they only use CUDA-supported operations. This means I can use the function with both the CPU target and the GPU target.
Implemented the outer loop on the GPU using @cuda.jit - I think part of the problem was that the way the layout and signature was set up was forcing all the work to be done with one GPU thread. It might have been possible to fix this and still use @guvectorize, but in my experience you get much better performance (and more flexibility) with @cuda.jit than @guvectorize(target='cuda') anyway. In the GPU implementation, full_jaro_gpu:
- The for loops for i and j are distributed across threads using cuda.grid(2) - this assigns one GPU thread to each iteration of the for loops. The kernel is launched with a 2D grid, which maps onto the 2D iteration space of these loops.
- Because the grid size is computed, it could be larger than the inputs - threads that will be outside the bounds of the vector1 or vector2 need to exit early, so we return when i >= len(vector1) or j >= len(vector2).
- Each thread needs its own copy of hashv1 and hashv2, so these are replaced with local arrays, since they’re not that large. The size of these arrays needs to be a compile-time constant, which is expected to be available as a global in HASH_SIZE when the function is called for the first time.
For small benchmarks it’s important to only time the execution of the kernel on the GPU - in the context of a large program / run, it would be expected to keep data on the device across many kernel launches, and avoid bottlenecking things on transfers between the host and the device, so some changes are needed to ensure we only time the kernel execution:
- d_A and d_B are copies of A and B on the device that are made prior to timing, and passed into the timed kernel.
- The result, d_result, is pre-allocated too.
- Outside of the timing, the kernel is called once first to force a compilation - this ensures that when we time the execution, we’re not timing the compilation, but only the kernel time. Subsequent runs retrieve the compiled code from the cache instead of compiling.
Computation of the grid size: I have assumed one element of the output per thread - the computation of nthreads and nblocks gives a grid just large enough (maybe slightly larger than necessary) to have one thread per element.
After launching the GPU kernel, cuda.synchronize() is called to make sure the GPU kernel is finished - kernel launches can return asynchronously.
Sanity checking added to make sure the CPU and GPU produce the same result.

On my system, with an i7-6700K, RTX 8000, CUDA 11 RC, and using this top 1000 English words list with the apostrophe removed from n't, I get:

$ python repro.py 
CPU Run...
GPU Run...
Sanity check...

CPU Time: 0:00:00.257595
GPU Time: 0:00:00.001633
Speedup factor:  157.74341702388242

This is not too bad, but some notes apply:

I haven’t really tried to optimize the code as such, just to map it onto the GPU execution model. Some things that I mentioned in this comment probably apply here too, particularly the bits about strided loops and coalescing.
Perhaps there are other ways of writing the algorithm that map better onto GPU - I’m not familiar with the algorithm and what it does, so I don’t have many thoughts here, but I do wonder whether hashv1 and hashv2 need to be float32 type, for example.
Maybe the CPU implementation could be improved too. I have not used all cores, but even if I did it would theoretically be around 4-8x faster than the CPU time I have there.

Hopefully this provides a helpful start - do post back if you are able to optimize it further (or have more questions)!

Topic		Replies	Views
Optimizing parallelized similarity function Support: How do I do ...?	8	996	June 1, 2023
CPU vs GPU version Numba	2	470	July 28, 2020
Fastest numba code for computing euclideandistance between a vector and every row of a matrix Community Support	22	3900	September 14, 2021
Implement euclidian distance as cuda kernel Support: How do I do ...?	1	484	February 4, 2021
Usage of CUDA Python, Linear Algebra on GPU and Computational Code Community Support	7	3741	December 31, 2021

Cuda Optimize Jaro Distance

Related topics