Here’s a basic attempt:
# Python3 implementation of above approach
from math import floor
from numba import cuda, guvectorize, njit
import numpy as np
import datetime
# Function to calculate the
# Jaro Similarity of two s
@njit
def jaro_distance(s1, s2, hash_s1, hash_s2):
# Length of two s
len1 = len(s1)
len2 = len(s2)
# Maximum distance upto which matching
# is allowed
max_dist = floor(max(len1, len2) / 2) - 1
# Count of matches
match = 0
# Traverse throgh the first
for i in range(len1):
# Check if there is any matches
for j in range(max(0, i - max_dist), min(len2, i + max_dist + 1)):
# If there is a match
if s1[i] == s2[j] and hash_s2[j] == 0:
hash_s1[i] = 1
hash_s2[j] = 1
match += 1
break
# If there is no match
if match == 0:
return 0.0
# Number of transpositions
t = 0
point = 0
# Count number of occurances
# where two characters match but
# there is a third matched character
# in between the indices
for i in range(len1):
if hash_s1[i]:
# Find the next matched character
# in second
while hash_s2[point] == 0:
point += 1
if s1[i] != s2[point]:
point += 1
t += 1
t = t // 2
# Return the Jaro Similarity
return (match / len1 + match / len2 + (match - t + 1) / match) / 3.0
# Driver code
def full_jaro(vector1, vector2, hashv1, hashv2, result):
for i in range(len(vector1)):
for j in range(len(vector2)):
for k in range(len(hashv1)):
hashv1[k] = 0
for l in range(len(hashv2)):
hashv2[l] = 0
result[i][j] = jaro_distance(vector1[i], vector2[j], hashv1, hashv2)
@cuda.jit
def full_jaro_gpu(vector1, vector2, result):
# For loop distributed across threads - cuda.grid(2) gives us 2D indices
# based on the thread ID.
i, j = cuda.grid(2)
# Ensure that threads beyond the size of the inputs don't attempt to
# compute out of bounds
if i >= len(vector1) or j >= len(vector2):
return
hashv1 = cuda.local.array(HASH_SIZE, dtype=np.float32)
hashv2 = cuda.local.array(HASH_SIZE, dtype=np.float32)
for k in range(len(hashv1)):
hashv1[k] = 0
for l in range(len(hashv2)):
hashv2[l] = 0
result[i][j] = jaro_distance(vector1[i], vector2[j], hashv1, hashv2)
types = "(float32[:,:], float32[:,:],float32[:],float32[:],float32[:,:])"
layout = "(m,n),(p,q),(t),(u)->(m,p)"
full_jaro_cpu = guvectorize([types], layout, target='cpu')(full_jaro)
# Prjaro Similarity of two s
def textToNumpy(filename):
def convertStringToNum(string):
string = string.lower()
if string == "a":
return 11
if string == "b":
return 37
if string == "c":
return 35
if string == "d":
return 12
if string == "e":
return 13
if string == "f":
return 14
if string == "g":
return 15
if string == "h":
return 16
if string == "i":
return 17
if string == "j":
return 18
if string == "k":
return 19
if string == "l":
return 20
if string == "m":
return 21
if string == "n":
return 22
if string == "o":
return 23
if string == "p":
return 24
if string == "q":
return 25
if string == "r":
return 26
if string == "s":
return 27
if string == "t":
return 28
if string == "u":
return 29
if string == "v":
return 30
if string == "w":
return 31
if string == "x":
return 32
if string == "y":
return 33
if string == "z":
return 34
else:
print(string)
def listAngka(string):
data = [float(convertStringToNum(x)) for x in string]
return data
with open(filename) as file:
datas = file.readlines()
datas = [x.replace("\n", "") for x in datas]
datas = [listAngka(data) for data in datas]
return datas
data = [x for x in textToNumpy("small")]
data2 = [x for x in textToNumpy("small")]
length = max(map(len, data))
length2 = max(map(len, data2))
A = np.array([xi + [0] * (length - len(xi)) for xi in data],
dtype=np.float32)
B = np.array([xi + [0] * (length - len(xi)) for xi in data2],
dtype=np.float32)
hash1 = np.zeros(A.shape[1], dtype=np.float32)
hash2 = np.zeros(B.shape[1], dtype=np.float32)
# Allocate data on the GPU, and copy data over
d_A = cuda.to_device(A)
d_B = cuda.to_device(B)
d_result = cuda.device_array((len(A), len(B)), dtype=np.float64)
nthreads = (16, 16)
nblocks = ((len(A) // nthreads[0]) + 1, (len(B) // nthreads[1]) + 1)
HASH_SIZE = (A.shape[1])
# Run once to compile and cache
full_jaro_gpu[nblocks, nthreads](d_A, d_B, d_result)
print("CPU Run...")
cpu_begin = datetime.datetime.now()
cpu_result = full_jaro_cpu(A, B, hash1, hash2)
cpu_end = datetime.datetime.now()
print("GPU Run...")
gpu_begin = datetime.datetime.now()
full_jaro_gpu[nblocks, nthreads](d_A, d_B, d_result)
cuda.synchronize()
gpu_end = datetime.datetime.now()
print("Sanity check...")
np.testing.assert_allclose(cpu_result, d_result.copy_to_host())
cpu_time = cpu_end - cpu_begin
gpu_time = gpu_end - gpu_begin
ratio = cpu_time / gpu_time
print()
print("CPU Time:", cpu_end - cpu_begin)
print("GPU Time:", gpu_end - gpu_begin)
print("Speedup factor: ", ratio)
A quick walkthrough of the changes:
- I replaced
cuda.jit(device=True)for thejaro_distancefunction with@njit- this compiles in nopython mode - nopython mode functions can be called like CUDA device functions, as long as they only use CUDA-supported operations. This means I can use the function with both the CPU target and the GPU target. - Implemented the outer loop on the GPU using
@cuda.jit- I think part of the problem was that the way the layout and signature was set up was forcing all the work to be done with one GPU thread. It might have been possible to fix this and still use@guvectorize, but in my experience you get much better performance (and more flexibility) with@cuda.jitthan@guvectorize(target='cuda')anyway. In the GPU implementation,full_jaro_gpu:- The
forloops foriandjare distributed across threads usingcuda.grid(2)- this assigns one GPU thread to each iteration of the for loops. The kernel is launched with a 2D grid, which maps onto the 2D iteration space of these loops. - Because the grid size is computed, it could be larger than the inputs - threads that will be outside the bounds of the
vector1orvector2need to exit early, so we return wheni >= len(vector1) or j >= len(vector2). - Each thread needs its own copy of
hashv1andhashv2, so these are replaced with local arrays, since they’re not that large. The size of these arrays needs to be a compile-time constant, which is expected to be available as a global inHASH_SIZEwhen the function is called for the first time.
- The
- For small benchmarks it’s important to only time the execution of the kernel on the GPU - in the context of a large program / run, it would be expected to keep data on the device across many kernel launches, and avoid bottlenecking things on transfers between the host and the device, so some changes are needed to ensure we only time the kernel execution:
-
d_Aandd_Bare copies ofAandBon the device that are made prior to timing, and passed into the timed kernel. - The result,
d_result, is pre-allocated too. - Outside of the timing, the kernel is called once first to force a compilation - this ensures that when we time the execution, we’re not timing the compilation, but only the kernel time. Subsequent runs retrieve the compiled code from the cache instead of compiling.
-
- Computation of the grid size: I have assumed one element of the output per thread - the computation of
nthreadsandnblocksgives a grid just large enough (maybe slightly larger than necessary) to have one thread per element. - After launching the GPU kernel,
cuda.synchronize()is called to make sure the GPU kernel is finished - kernel launches can return asynchronously. - Sanity checking added to make sure the CPU and GPU produce the same result.
On my system, with an i7-6700K, RTX 8000, CUDA 11 RC, and using this top 1000 English words list with the apostrophe removed from n't, I get:
$ python repro.py
CPU Run...
GPU Run...
Sanity check...
CPU Time: 0:00:00.257595
GPU Time: 0:00:00.001633
Speedup factor: 157.74341702388242
This is not too bad, but some notes apply:
- I haven’t really tried to optimize the code as such, just to map it onto the GPU execution model. Some things that I mentioned in this comment probably apply here too, particularly the bits about strided loops and coalescing.
- Perhaps there are other ways of writing the algorithm that map better onto GPU - I’m not familiar with the algorithm and what it does, so I don’t have many thoughts here, but I do wonder whether
hashv1andhashv2need to befloat32type, for example. - Maybe the CPU implementation could be improved too. I have not used all cores, but even if I did it would theoretically be around 4-8x faster than the CPU time I have there.
Hopefully this provides a helpful start - do post back if you are able to optimize it further (or have more questions)!