Unusual 20x slowdown between nearly identical calculations with CUDA

Greetings. I’m new to CUDA, so I hope this isn’t a stupid question. The image below shows code for three nearly identical calculations. I’m completely baffled as to why example (A) takes nearly 20x longer to run than the other two. This is of course just a simplified example of something else that I’m working on. In my real application I’m not able to simply switch to example (B) or (C). Any insights would be greatly appreciated.


Here’s a text version for copying/pasting:

import numpy as np
import math
from numba import cuda, f8, uint32


@cuda.jit((f8[:,:], uint32))
def mykernel(myarr, aa):
  ix, iy, iz = cuda.grid(3)

  bb = 0.0 
  for k in range(0, 2500):
    bb += 0.01

  myarr[ix, iy] += math.cos(bb)


n = 5000
m = 10000

myarr = np.zeros((n,m), dtype=np.double)

d_myarr = cuda.to_device(myarr)
mykernel[(n, m, 1), (1, 1, 1)](d_myarr, 5)
myarr = d_myarr.copy_to_host()

After posting this I realized I should probably provide a few more details. The speed of the math.cos calculation seems to depend on the history of how the variable “bb” was calculated, which seems very strange to me. In example A, “bb” is calculated by starting at 0.0 and adding 0.01 to it 2500 times until it reaches 25.0. In example B, “bb” is calculated by simply doing 2500*0.01. And in example C, “bb” is calculated the same as in A, but then it is not used again later. Again, any guidance would be very appreciated. Thank you!

What’s happening in the three cases:

A. The loop executes, and accounts for the majority of execution time.
B. The loop is not present, so execution time is greatly reduced.
C. The loop exists in your code, but the compiler spots that the loop isn’t doing anything useful, so it removes it, and the compiled code is the same (or extremely similar to) case B.

I’m guessing that this is a minimal reproducer of something else you are actually doing (and I really appreciate you taking the time to craft this minimal example, because it makes it so much easier to analyse the situation and provide some feedback), so it’s hard to recommend a single path by which you could speed things up - are you having trouble getting the performance you need, or is this more of an observation / question?

Ah, I just spotted that your grid configuration is very inefficient - you’re launching with one thread per block, which will have a huge impact on performance - let me craft an example of a more efficient configuration…

Here’s an example of your original version, and a version that uses a much more efficient mapping of threads to work items:

import numpy as np
import math
from numba import cuda, f8, uint32
from time import perf_counter

n = 5000
m = 10000


# Original version

@cuda.jit((f8[:,:], uint32))
def mykernel(myarr, aa):
    ix, iy, iz = cuda.grid(3)

    bb = 0.0
    for k in range(0, 2500):
        bb += 0.01

    myarr[ix, iy] += math.cos(bb)


# Create arrays and transfer to device
myarr = np.zeros((n, m), dtype=np.float64)
d_myarr = cuda.to_device(myarr)

# Time kernel execution - we synchronize after the launch to ensure we wait for
# the kernel to finish running, because launches are asynchronous.
start = perf_counter()
mykernel[(n, m, 1), (1, 1, 1)](d_myarr, 5)
cuda.synchronize()
end = perf_counter()

original_time = end - start

# Copy results to host for verification
myarr_original = d_myarr.copy_to_host()


# Optimized version

@cuda.jit((f8[:,:], uint32))
def mykernel_optimized(myarr, aa):
    ix = cuda.blockIdx.x
    iy = cuda.threadIdx.x
    ny = cuda.blockDim.x

    for y in range(iy, myarr.shape[1], ny):
        bb = 0.0
        for k in range(0, 2500):
            bb += 0.01

        myarr[ix, y] += math.cos(bb)


# Create arrays and transfer to device
myarr = np.zeros((n, m), dtype=np.float64)
d_myarr = cuda.to_device(myarr)

# A reasonable number of threads per block
block_dim = 256

# Time the optimized version
start = perf_counter()
mykernel_optimized[n, block_dim](d_myarr, 5)
cuda.synchronize()
end = perf_counter()

optimized_time = end - start

# Copy results to host
myarr_optimized = d_myarr.copy_to_host()

# Ensure that the results agree before we print the results
np.testing.assert_allclose(myarr_original, myarr_optimized)

print(f"Original version:  {original_time}")
print(f"Optimized version: {optimized_time}")

Some notes:

  • Your original launched n * m blocks with one thread each. Since each thread block consists of at least 1 warp (32 threads) there would be 31 threads sitting idle. In practice you need several warps per block, so that the device can switch between them when one stalls on a load.
  • The modified version launches n blocks, each with 256 threads (256 is a reasonable number of threads to try with in the absence of other constraints). Each block is assigned to a single row of the output (the ix variable), and each thread within a block takes several of the elements of the row, working through them in a blocksize-strided loop.
  • To time the code, I surround only the kernel launch and a synchronize with calls to perf_counter() - this is to avoid measuring any overhead of transferring data and only compare the kernel times. The synchronize is needed because launches are asynchronous, so this forces the CPU to wait for the kernel to finish executing before recording the end time.
  • The assert at the end is used to make sure we’re comparing like-for-like output.
  • I’ve left in the signatures to the @cuda.jit decorator ((f8[:,:], uint32)) - in general I’d recommend not supplying a signature because Numba can work it out, and it usually introduces the possibility of errors (e.g. declaring a signature that doesn’t match the arguments). However, I’ve left it in here because providing the signature forces Numba to compile at the time when the function is defined, and not on the first call - this avoids the compilation time getting included in the timing in this example.

On my system this prints:

Original version:  17.29377248399942
Optimized version: 0.5464405929997156

A speedup of 31.64x, in almost exact agreement with the first kernel using only 1/32 available threads per warp.

Thank you so much! I didn’t realize compilers are smart enough to skip over parts of code that aren’t useful. This definitely makes so much more sense now. I also appreciate the additional tip about grid configuration, that’s another topic that I’m still in the process of learning about. I do have a few followup questions about gpu performance/optimization, but I may open a new thread for those so we don’t deviate too much from the original topic here. Thank you again!

1 Like