Greetings again, and many thanks to this community for all the help it provides! I’m currently working on a very computationally intensive calculation. Fortunately it’s one that can be parallelized and right now I’m working on switching from CPU parallelization to GPU parallelization for an even better speed improvement (my boss purchased a very nice Nvidia GeForce RTX 3090 GPU). However, I’m not getting the performance that I was expecting based on my CPU and GPU specs and I suspect it has something to do with the performance of my code on the individual GPU threads. There are probably better ways to do these comparisons, but here’s a summary of how I’m measuring CPU vs GPU performance with some very simple code.
Below is my GPU code. I realize it doesn’t make much sense to use a GPU for a single thread calculation in general, but I’m not sure how else to do an apples-to-apples comparison for this purpose.
import numpy as np
from numba import cuda
@cuda.jit
def mykernel(myarr, aa):
ix, iy, iz = cuda.grid(3)
nloops = int(1e7)
val = 0.0
for k in range(0, nloops):
val += 0.001
myarr[ix, iy] += val
# set n = m = 1 for a single thread calculation:
n = 1
m = 1
myarr = np.zeros((n,m), dtype=np.double)
d_myarr = cuda.to_device(myarr)
mykernel[(n, 1, 1), (1, m, 1)](d_myarr, 123)
myarr = d_myarr.copy_to_host()
print(myarr[0][0])
And here is my equivalent CPU code:
from numba import njit
@njit
def getValue():
nloops = int(1e7)
val = 0.0
for k in range(0, nloops):
val += 0.001
return val
print(getValue())
To see how the relative performance varies as a function of how complex the calculation is, I’m changing the variable “nloops” and measuring how long it takes the code to complete. Here are my findings:
nloops = 1e7:
gpu code time = 0.95 s
cpu code time = 0.45 s
ratio = 2.11
nloops = 1e8:
gpu code time = 3.43 s
cpu code time = 0.54 s
ratio = 6.35
nloops = 1e9:
gpu code time = 29.45 s
cpu code time = 1.23 s
ratio = 23.94
I understand that the advantage of using a GPU is not that it is faster than a CPU, but that it has a large number of threads, but doesn’t that advantage diminish when the performance of each individual thread is poor? Can anyone explain why the relative performance of a GPU thread seems to get worse and worse as the complexity of the calculation increases? Perhaps there’s a best practice that I’m ignoring here? Thanks in advance!