I’m trying to implement a runs-based kernel where I need to traverse the elements of a 3D matrix on a row per thread basis i.e. each thread is assigned a row and iterates over all elements of that row.

For example, if, for simplicity, we were to use a 2D matrix with 50 rows and 100 columns, I would need to create 50 threads that would go through the 100 elements of their respective row.

Can someone give me some pointers on how to do this?

Run your code with a prange loop over the matrix row index. i.e.

from numba import njit, prange
import numpy as np
@njit(parallel=True)
def foo(nthreads, n):
x = np.empty((nthreads, n))
for i in prange(x.shape[0]):
# Demonstrate with some example work
for j in range(x.shape[1]):
x[i, j] = j
return x
print(foo(50, 4))

However, I’d perhaps consider whether this is a good strategy as having more threads than cores is not particularly cheap due to the amount of context switching and scheduling required. One pattern that’s often used is to have one thread per CPU core and let prange automatically divide the work up between them. For example if you had 4 cores and a matrix with 20 rows, each thread would end up on average working on 5 of the rows.

There’s more information on setting threads (in addition to the environment variable above) here.

Thanks a lot for the reply @stuartarchibald . I should have clarified further though.

This kernel is the be run on the GPU not the CPU. It is an implementation of a CCL algorithm originally presented in

This is why I’m specifically asking for this way of traversing a matrix on a GPU.
Just to give you an idea of what is required, let me show you a picture.

On the left you have a binary image of unspecified size. On the right there’s a spans matrix. This spans matrix holds these “spans” defined as (p1,p2) with p1 the first point in a sequence of 1s of a given row and p2 the last point of said sequence.
For instance, take the point (0,2) (shaded for convenience). It has span of ((0,2), (0,2)) i.e. we now know that there is a sequence or “span” of points with value 1 from (0,2) to (0,2) - the row is implicit as their height is always the same.
The idea is then to have a thread per row that iterates over the image matrix and creates the spans in the spans matrix.

It is not a Numba implementation, and the specific algorithm likely does not exactly match the one above, but if you can use CuPy there is a GPU-based CCL implementation matching scipy.ndimage.label at cupyx.scipy.ndimage.label. I think recent Numba releases can take CuPy arrays as input, so hopefully it would still be able to interoperate with any other Numba-based code you may have.

Thanks for the suggestion @grlee77. I can give that variant a shot.
It can be particularly useful to compare its performance to the one I’m trying to implement.
I’m also aware of a couple github repositories like

and

the latter of which contains a battery of 2D and 3D CCL algorithms both for the CPU and the GPU.

As an example of what I’m trying to achieve, here’s a very simple kernel that traverses a 2D matrix in a thread per row order and stores each row’s amount of spans:

from numba import cuda
from math import ceil
import numpy as np
@cuda.jit
def find_spans(arr, out):
row = cuda.grid(1)
cc_found = False
if row < arr.shape[0]:
for column in range(arr.shape[1]):
if arr[row, column] == 1 and not cc_found: # Connected Component found
cc_found = True
out[row] = out[row] + 1
elif arr[row, column] == 0 and cc_found:
cc_found = False
# Some binary 2D array and a simple structure to hold the spans
# Notice that the size of this structure is equal to the amount of rows
binary_array = np.random.randint(2, size=(500,500))
amount_of_spans = np.zeros(binary_array.shape[0])
threads = 32
find_spans[ceil(binary_array.shape[0] / threads), threads](binary_array, amount_of_spans)