# Understanding concurrency of streams in numba cuda correctly

I’m not really sure if I’m understanding how streams work with kernels asynchronously.

``````import pandas as pd
from datetime import datetime, timedelta
from numba import cuda
import numba
import numpy as np
import math

@numba.jit
def min_max_of_arrays(x, Output):

for i in range(x.shape):

Column = x[:, i]

Maximum = Column
Minimum = Column

for j in Column:
if j > Maximum:
Maximum = j
elif j < Minimum:
Minimum = j

Output[0, i] = Maximum
Output[1, i] = Minimum

return Output

@cuda.jit
def normalise_arrays(FeatureSlices, MinMaxArray, NormalisedArray):

x, y = cuda.grid(2)

if x < NormalisedArray.shape and y < NormalisedArray.shape:

Min = MinMaxArray[1, y]
Max = MinMaxArray[0, y]

NormalisedArray[x, y] = ((FeatureSlices[x, y] - Min) / (Max - Min))

# For this dummy example I'll use random numbers. I want to find the min and max of every column of every array in my list so that I can perform a min max scaler on the "InputArrayList".

InputArrayList = [np.random.rand(60, 98450) for j in range(18)]

# I find the min max of every column.

MinMaxArrayList = []
for FeatureSlice in InputArrayList:

MinMaxArray = np.empty([2, 98450])

min_max_of_arrays(FeatureSlice, MinMaxArray)

MinMaxArrayList.append(MinMaxArray)

# Now I have my array of min and max's I can hand this into my kernel function using multiple streams
# across multiple GPU's. This is where I'm unsure of my understanding of how streams work. I don't know
# if using "del" at the end of each loop is really helping anything. I'm also unclear on how the loop
# is managing to start a kernel on it's own loop and then continue running to start a new kernel before the previous one has fininshed. I basically think i'm still running everything synchronously rather than
# asynchronously.

with cuda.defer_cleanup():

NumberOfStreams = 18

for i, (MinMax, arr) in enumerate(zip(MinMaxArrayList, InputArrayList)):

if len(cuda.gpus) == 3:
if i < NumberOfStreams / 3:
cuda.select_device(0)
elif i >= NumberOfStreams / 3 and i < NumberOfStreams:
cuda.select_device(1)
else:
cuda.select_device(2)

stream = cuda.stream()

with cuda.pinned(arr):

# We put our Input array on to the device
Input = cuda.to_device(arr, stream=stream)
MinMaxInput = cuda.to_device(MinMax, stream=stream)

# We create an empty array on the device, ready to be filled with all the normalised columns
# from our input array.
NormalisedArray = cuda.device_array(
(60, 98450),
dtype=Input.dtype, stream=stream)

blockspergrid = (blockspergrid_x, blockspergrid_y)

NormalisedArray.copy_to_host(arr, stream=stream)

del arr, MinMaxInput, NormalisedArray
``````

I’ve also used NSight Systems to try and check what the GPU’s are doing. It looks as though I’ve managed to use two of the GPU’s but the kernels seem to be having to wait for one to finish before another can start.

The kernel also needs to be launched on the stream:

``````normalise_arrays[blockspergrid, threadsperblock, stream](Input, MinMaxInput, NormalisedArray)
``````

This isn’t very prominently documented, but the kernel launch configuration with a stream is mentioned here in the docs: