Single thread GPU vs CPU performance as a function of calculation complexity

Greetings again, and many thanks to this community for all the help it provides! I’m currently working on a very computationally intensive calculation. Fortunately it’s one that can be parallelized and right now I’m working on switching from CPU parallelization to GPU parallelization for an even better speed improvement (my boss purchased a very nice Nvidia GeForce RTX 3090 GPU). However, I’m not getting the performance that I was expecting based on my CPU and GPU specs and I suspect it has something to do with the performance of my code on the individual GPU threads. There are probably better ways to do these comparisons, but here’s a summary of how I’m measuring CPU vs GPU performance with some very simple code.

Below is my GPU code. I realize it doesn’t make much sense to use a GPU for a single thread calculation in general, but I’m not sure how else to do an apples-to-apples comparison for this purpose.

import numpy as np
from numba import cuda

@cuda.jit
def mykernel(myarr, aa):
  ix, iy, iz = cuda.grid(3)
  nloops = int(1e7)
  val = 0.0 
  for k in range(0, nloops):
    val += 0.001

  myarr[ix, iy] += val 

# set n = m = 1 for a single thread calculation:
n = 1 
m = 1 

myarr = np.zeros((n,m), dtype=np.double)

d_myarr = cuda.to_device(myarr)
mykernel[(n, 1, 1), (1, m, 1)](d_myarr, 123)
myarr = d_myarr.copy_to_host()

print(myarr[0][0])

And here is my equivalent CPU code:

from numba import njit

@njit
def getValue():
  nloops = int(1e7)
  val = 0.0 
  for k in range(0, nloops):
    val += 0.001
  return val 

print(getValue())

To see how the relative performance varies as a function of how complex the calculation is, I’m changing the variable “nloops” and measuring how long it takes the code to complete. Here are my findings:

nloops = 1e7:
gpu code time = 0.95 s
cpu code time = 0.45 s
ratio = 2.11

nloops = 1e8:
gpu code time = 3.43 s
cpu code time = 0.54 s
ratio = 6.35

nloops = 1e9:
gpu code time = 29.45 s
cpu code time = 1.23 s
ratio = 23.94

I understand that the advantage of using a GPU is not that it is faster than a CPU, but that it has a large number of threads, but doesn’t that advantage diminish when the performance of each individual thread is poor? Can anyone explain why the relative performance of a GPU thread seems to get worse and worse as the complexity of the calculation increases? Perhaps there’s a best practice that I’m ignoring here? Thanks in advance!

I’ve been digging into this a little bit deeper and thought I’d share some of my results. Below is a plot of CPU and GPU execution time as a function of nloops as well as the ratio of the two.


It seems that when nloops is small what I’m seeing is the ratio of the overhead of the calculation and when nloops gets large what I’m seeing is the ratio of the clock speeds (sort of). I say “sort of” because my CPU’s clock speed (3.70 GHz) is not 30 times faster than my GPU’s clock speed (1.4 GHz), it’s only about 2.5 times faster.

Is this behavior normal? Or is there something I can do to have the ratio approach something closer to 2.5 rather than 30?

I realize it doesn’t make much sense to use a GPU for a single thread calculation in general,

Here you have identified the central issue with this comparison - it does indeed make no sense to do this. If you had to move 40 tons of dirt and wanted to work out whether it would be faster to use a Lamborghini or a lorry to move it, would you compare the top speeds of the two vehicles?

but I’m not sure how else to do an apples-to-apples comparison for this purpose.

I think your end goal is “write a program that computes the results I need as quickly as possible”, so the apples-to-apples comparison is between the fastest CPU implementation you can write and the fastest GPU implementation you can write.

Since your example is so abstract, I can’t rewrite it to show how parallelism on the GPU could provide a result faster. Assuming you have some array of data you need to do the same items of work on though, I’d try to write the GPU version to compute all the results in parallel. I’d also attempt to parallelise the CPU implementation, since you likely have several CPU cores you could be utilizing, and it also wouldn’t be a fair comparison to benchmark CPU single-threaded performance against GPU parallel performance.

That said…

I have modified your example above to rectify some issues in it, to help demonstrate how to measure things with Numba and CUDA. My modified version of the code is:

import numpy as np
from numba import config, cuda, njit
from time import perf_counter
import sys

# Suppress low occupancy warnings due to tiny grid
config.CUDA_LOW_OCCUPANCY_WARNINGS = 0

if len(sys.argv) > 1:
    nloops = int(10 ** int(sys.argv[1]))
else:
    nloops = int(1e7)

print(f"Running with {nloops} loops...")

sig = '(float64[:,::1], int32)'


@cuda.jit(sig)
def mykernel(myarr, aa):
    ix, iy, iz = cuda.grid(3)
    val = 0.0
    for k in range(0, nloops):
        val += 0.001

    myarr[ix, iy] += val


# set n = m = 1 for a single thread calculation:
n = 1
m = 1

myarr = np.zeros((n, m), dtype=np.double)
d_myarr = cuda.to_device(myarr)

gpu_start = perf_counter()
mykernel[(n, 1, 1), (1, m, 1)](d_myarr, 123)
cuda.synchronize()
gpu_end = perf_counter()

myarr = d_myarr.copy_to_host()


@njit(sig)
def getValue(myarr, aa):
    ix, iy = 0, 0
    val = 0.0
    for k in range(0, nloops):
        val += 0.001
    myarr[ix, iy] += val


myarr = np.zeros((n, m), dtype=np.double)

cpu_start = perf_counter()
getValue(myarr, 123)
cpu_end = perf_counter()

np.testing.assert_allclose(d_myarr.copy_to_host(), myarr)

gpu_time = gpu_end - gpu_start
cpu_time = cpu_end - cpu_start

print(f"GPU time: {gpu_time}")
print(f"CPU time: {cpu_time}")
print(f"Ratio: {gpu_time / cpu_time}")

Notes on the changes:

  • The two functions weren’t really doing the same thing. I’ve tried to make them as similar as possible.
  • With lazy compilation, if you time the whole program (which I assume you might have been doing as there’s no timing code in your examples) you are also measuring the compilation time. Although it’s not recommended in general, I added a signature to the decorators so that compilation happens prior to the timing and execution.
  • The perf_counter() function should be used to record times, along with the use of cuda.synchronize() to ensure GPU execution is finished prior to recording the end time.
  • I’ve added a check that the results match.

With these changes, I generally observe the single-thread performance difference sitting at around 28 for large enough workloads:

$ python repro.py 7
Running with 10000000 loops...
GPU time: 0.32721387100036736
CPU time: 0.009616430000278342
Ratio: 34.026543217274636

$ python repro.py 8
Running with 100000000 loops...
GPU time: 2.7633673889999955
CPU time: 0.09569300299972383
Ratio: 28.877423660828896

$ python repro.py 9
Running with 1000000000 loops...
GPU time: 27.383181880999928
CPU time: 0.9571818690001237
Ratio: 28.608128473645785

(This is with a Quadro RTX 8000 and an i7-6700K). Extrapolating the single-thread performance across all the cores available (also not a very valid comparison, but makes the point that parallelism is needed for the GPU), one might expect:

Quadro RTX SMs (72) * cores per SM (64) = 4608
CPU cores: 4
Core ratio: 1152

Total single thread performance available CPU vs. GPU: core ratio (1152) / observed single thread ratio (~30) = 38.4

Now this calculation is very rough, assumes maximum performance, all kinds of ideal things going on, etc, but suggests there should be approximately 40 times more performance available from my GPU than my CPU - but not if I artificially restrict myself to a single thread.

1 Like

Thanks so much for another great answer! Just to clarify/summarize – I was naively expecting that my GPU vs CPU performance improvement would be equal to ((number of GPU threads) / (number of CPU threads)) * ((GPU clock speed) / (CPU clock speed)). However, it appears that both my example and your example show that there is (unfortunately) an extra factor of roughly 1/20 in that equation. Does that sound correct? I’m somewhat curious where this extra factor comes from, but as long as it is expected behavior then I am satisfied to know that I’m not doing something egregiously wrong in my code.

I’m somewhat curious where this extra factor comes from, but as long as it is expected behavior then I am satisfied to know that I’m not doing something egregiously wrong in my code.

CPUs and GPUs are architected completely differently to optimize for different goals.

Broadly, CPUs are engineered for extremely high single-thread performance, using many architectural features that increase the number of instructions that can be executed per cycle. This makes the cores very large and complex, so fewer of them fit on a die, but each core can race through a single stream of instructions very quickly.

GPUs are instead engineered for the largest possible parallelism - this means that each individual core must be small so that many fit on a die, and many of the techniques used by CPU cores (single-threaded instruction-level parallelism, very large caches, superscalar / out-of-order execution, etc.) cannot be used. Single-thread performance is still a factor in the design, but it’s not the metric that can be optimized at the expense of parallelism.

This is covered in brief in the introduction to the CUDA Programming Guide: CUDA C++ Programming Guide

For a more complete treatment of the architectures, Computer Architecture - 6th Edition is a fairly standard textbook. This lecture series is taught partially from the book (note I haven’t watched these lecture videos, but I did take this course many years ago): https://www.youtube.com/playlist?list=PLS6iudSQ-kj5YIhnM1nXxXyfjPBmqRy8l