Understanding Cuda Stream

Let’s say I have a very large n x m matrix and I have a single kernel that processes each of the n rows of the matrix. However, the i+1th row is depends on the ith row (i.e., the ith row must be processed first) and so, currently, I launch a new kernel each for processing each row:

for i in range(n):
    process_ith_row_kernel[blocks_per_grid, threads_per_block](i)

Note that the columns within the ith row can be processed asynchronously and that only the rows need to be processed synchronously (one after another). This is fine but when the number of rows, n, is large, I end up spending a ton of my time launching kernels. I have looked at other ways to improve this so as to break free from the synchronous kernel invocations but they rely on atomic operations that involve updating two variables at the same time (i.e., this is not possible) and there are many, many calls to atomic operations when the data is large.

I came across Numba Cuda streams which mentions a command queue. However, it isn’t clear whether I something like this would still experience the same issues as above with multiple kernel launches:

stream = cuda.stream()
with stream.auto_synchronize():
    for i in range(n):
        process_ith_row_kernel[blocks_per_grid, threads_per_block, stream](i)

Can anybody provide some clarification as to what a stream would be doing in this case? Or if there’s a more appropriate way to get help optimizing my Numba Cuda code?

Launching kernels is asynchronous (see https://numba.readthedocs.io/en/stable/cuda/kernels.html#kernel-invocation). For example, running the following:

from numba import cuda, float64
import numpy as np
import math
import time


# A kernel that takes some time to run
@cuda.jit
def f(x):
    tmp = x[0]
    for i in range(100000):
        tmp += math.sin(float64(i))
    x[0] = tmp


x = np.zeros(1)
d_x = cuda.to_device(x)

# First call to warm up the cache
f[1, 1](d_x)

# Time the second call
start = time.perf_counter()
f[1, 1](d_x)
after_launch = time.perf_counter()

cuda.synchronize()
after_sync = time.perf_counter()

print(f"Time after launch: {after_launch - start}")
print(f"Time after sync: {after_sync - start}")

results in:

$ python repro.py 
Time after launch: 0.00016877400048542768
Time after sync: 0.13036533699778374

The cuda.synchronize() call is required to make the CPU wait until all pending kernels have finished execution.

I can think of two possible causes for the code apparently being synchronous:

  1. The queue depth for pending kernels is finite - it’s possible that you’re filling up the queue if you have a lot of rows.
  2. Something is happening that’s forcing synchronization - for example, if some implicit data transfer occurs due to an input / output on the host. Is your loop exactly as simple as:
for i in range(n):
    process_ith_row_kernel[blocks_per_grid, threads_per_block](i)

or is it possible that you’re doing a synchronous operation in the loop, or accidentally passing some data from the host? Running with NUMBA_CUDA_LOG_LEVEL=DEBUG should help identify if there’s something like this going on - you should see an unbroken stream of cuCtxGetCurrent / cuCtxGetDevice / cuLaunchKernel if your loop is only launching the kernel asynchronously, like:

== CUDA [483] DEBUG -- call driver api: cuCtxGetCurrent
== CUDA [483] DEBUG -- call driver api: cuCtxGetDevice
== CUDA [483] DEBUG -- call driver api: cuLaunchKernel
== CUDA [483] DEBUG -- call driver api: cuCtxGetCurrent
== CUDA [483] DEBUG -- call driver api: cuCtxGetDevice
== CUDA [483] DEBUG -- call driver api: cuCtxGetCurrent
== CUDA [483] DEBUG -- call driver api: cuCtxGetDevice
== CUDA [483] DEBUG -- call driver api: cuLaunchKernel
== CUDA [483] DEBUG -- call driver api: cuCtxGetCurrent
== CUDA [483] DEBUG -- call driver api: cuCtxGetDevice
== CUDA [483] DEBUG -- call driver api: cuCtxGetCurrent
== CUDA [483] DEBUG -- call driver api: cuCtxGetDevice
== CUDA [484] DEBUG -- call driver api: cuLaunchKernel
== CUDA [484] DEBUG -- call driver api: cuCtxGetCurrent
== CUDA [484] DEBUG -- call driver api: cuCtxGetDevice
...

Streams

Using a stream for the example you posted above would not make any difference - the behaviour should be the same. Streams are more useful for overlapping sequences of operations, rather than making a sequence of operations run asynchronously with respect to the host, which can be done without streams. See e.g. https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/ for an introduction to the use of streams for overlapping data transfer and kernel launches.

Thank you @gmarkall. More concretely, this is the function that I’m trying to figure out how to speed up in our STUMPY GPU code. Would you mind taking a look and letting me know if you see anything that can be done for improving its performance? I know you’re are super busy so absolutely no pressure.

Just to reiterate, in our current implementation, it is important that each row is processed one after another in sequential order. However, each column can be processed independently on a different thread. Any help or guidance would be greatly appreciated!

I can take a look but I’m not familiar with STUMPY - is there a quick list of directions I can use to execute that kernel / reproduce the performance issue?

Awesome! Let me spin up a Github issue with some instructions on getting things up and running and provide a simple code execution with that kernel. Will that work?

The conversation is continued in this STUMPY Github issue

That sounds great! Post must be at least 20 characters blah blah…

1 Like