Please Help, I am new to Numba CUDA programming. My Binary search program not showing any speedup upon increasing number of threads & blocks

akumari · January 31, 2021, 9:16am

Here is my code:

#GPU based Binary Search Kernel Function : To get total count of a search item present in database.

@cuda.jit
def cuda_BinarySearch(srcitem, dsrcDB, dsrange, threadCount):

    tid = cuda.grid(1) # thread id
last = first = -1

low  = (tid * len(dsrcDB) // threadCount)
high =  ((tid+1) * len(dsrcDB) // threadCount) - 1


while low <= high: 
	# Calculate mid to divide search doamin
	mid = low + (high - low) // 2

	# if key is found, update the result
	if srcitem == dsrcDB[mid][1]:
		first = mid
		high = mid - 1
	# if key is less than the mid element, discard right half
	elif srcitem < dsrcDB[mid][1]:
		high = mid - 1
	# if key is more than the mid element, discard left half
	else:
		low = mid + 1
# End of first While Loop
if first != -1:
	#Reinitialize low & high
	low = first
	high = ((tid+1) * len(dsrcDB) // threadCount) - 1

	while low <= high: 
		# Calculate mid to divide search doamin
		mid = low + (high - low) // 2

		# if key is found, update the result
		if srcitem == dsrcDB[mid][1]:
			last = mid
			low = mid + 1
		# if key is less than the mid element, discard right half
		elif srcitem < dsrcDB[mid][1]:
			high = mid - 1
		# if key is more than the mid element, discard left half
		else:
			low = mid + 1
	# End of last While Loop

if tid < threadCount:
	if first != -1 and last != -1:
		dsrange[tid] = (last - first + 1)

#Driver’s Code

cuda_BinarySearch[1,8](srcItem, dsrcDB, dsrange, noOfThreads)

Note:
My database “dsrcDB” is huge almost 255 MB.
I am logging the time taken by GPU and CPU both.
E.g. To search 55 items,
Time taken by GPU (1 Grid, 1 Block, 8 threads) : 0.04008 sec
Time taken by CPU : 0.25186 sec

Now problem is, when I am increasingly number of threads and blocks to 16,32,64,128,256,512,1024,2048 I am not gaining any visible change in time taken by GPU.

GPU Machine details:
Device 1: “GeForce GTX 1080 Ti”
CUDA Driver Version / Runtime Version 11.0 / 11.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 11178 MBytes (11721506816 bytes)
(28) Multiprocessors, (128) CUDA Cores/MP: 3584 CUDA Cores

MemoryInfo(free=10626727936, total=11721506816)
numba version: 0.50.1
NumPy version: 1.18.5
llvmlite version: 0.33.0+1.g022ab0f

Any help appreciated. Thank you!

sgbaird · August 22, 2021, 5:40am

Consider looking at this guide for Markdown to improve the readability of your post and perhaps attract more attention.

Topic		Replies	Views
About understanding simple cuda results Community Support	2	138	April 2, 2024
Single thread GPU vs CPU performance as a function of calculation complexity Numba	4	2133	August 30, 2022
Unusual 20x slowdown between nearly identical calculations with CUDA Community Support	5	586	August 19, 2022
Optimize 3D binary dilation - any tips? - Support: How do I do ...?	7	793	January 7, 2022
Random array generation : numba cuda slower than cupy? Support: How do I do ...?	3	1949	July 23, 2021

Please Help, I am new to Numba CUDA programming. My Binary search program not showing any speedup upon increasing number of threads & blocks

Related topics