Traversing a matrix on a thread per row basis

JRibeiro · December 2, 2020, 7:12pm

I’m trying to implement a runs-based kernel where I need to traverse the elements of a 3D matrix on a row per thread basis i.e. each thread is assigned a row and iterates over all elements of that row.

For example, if, for simplicity, we were to use a 2D matrix with 50 rows and 100 columns, I would need to create 50 threads that would go through the 100 elements of their respective row.

Can someone give me some pointers on how to do this?

stuartarchibald · December 3, 2020, 10:28am

Hi @JRibeiro,

The way to do this would be to:

Set the maxium number of threads to e.g. 50 with this environment variable.

Run your code with a prange loop over the matrix row index. i.e.

from numba import njit, prange
import numpy as np

@njit(parallel=True)
def foo(nthreads, n):
    x = np.empty((nthreads, n))
    for i in prange(x.shape[0]):
        # Demonstrate with some example work
        for j in range(x.shape[1]):
            x[i, j] = j
    return x

print(foo(50, 4))

However, I’d perhaps consider whether this is a good strategy as having more threads than cores is not particularly cheap due to the amount of context switching and scheduling required. One pattern that’s often used is to have one thread per CPU core and let prange automatically divide the work up between them. For example if you had 4 cores and a matrix with 20 rows, each thread would end up on average working on 5 of the rows.

There’s more information on setting threads (in addition to the environment variable above) here.

Hope this helps?

JRibeiro · December 3, 2020, 2:14pm

Thanks a lot for the reply @stuartarchibald . I should have clarified further though.

This kernel is the be run on the GPU not the CPU. It is an implementation of a CCL algorithm originally presented in

This is why I’m specifically asking for this way of traversing a matrix on a GPU.
Just to give you an idea of what is required, let me show you a picture.

On the left you have a binary image of unspecified size. On the right there’s a spans matrix. This spans matrix holds these “spans” defined as (p1,p2) with p1 the first point in a sequence of 1s of a given row and p2 the last point of said sequence.
For instance, take the point (0,2) (shaded for convenience). It has span of ((0,2), (0,2)) i.e. we now know that there is a sequence or “span” of points with value 1 from (0,2) to (0,2) - the row is implicit as their height is always the same.
The idea is then to have a thread per row that iterates over the image matrix and creates the spans in the spans matrix.

Do you think you can help me with this?

Any help is appreciated

grlee77 · December 4, 2020, 7:34pm

It is not a Numba implementation, and the specific algorithm likely does not exactly match the one above, but if you can use CuPy there is a GPU-based CCL implementation matching scipy.ndimage.label at cupyx.scipy.ndimage.label. I think recent Numba releases can take CuPy arrays as input, so hopefully it would still be able to interoperate with any other Numba-based code you may have.

JRibeiro · December 11, 2020, 5:41pm

Thanks for the suggestion @grlee77. I can give that variant a shot.
It can be particularly useful to compare its performance to the one I’m trying to implement.
I’m also aware of a couple github repositories like

and

the latter of which contains a battery of 2D and 3D CCL algorithms both for the CPU and the GPU.

As an example of what I’m trying to achieve, here’s a very simple kernel that traverses a 2D matrix in a thread per row order and stores each row’s amount of spans:

from numba import cuda
from math import ceil
import numpy as np

@cuda.jit
def find_spans(arr, out):
    row = cuda.grid(1)
    cc_found = False
    if row < arr.shape[0]:
        for column in range(arr.shape[1]):
            if arr[row, column] == 1 and not cc_found:  # Connected Component found
                cc_found = True
                out[row] = out[row] + 1
            elif arr[row, column] == 0 and cc_found:
                cc_found = False

# Some binary 2D array and a simple structure to hold the spans
# Notice that the size of this structure is equal to the amount of rows
binary_array = np.random.randint(2, size=(500,500))
amount_of_spans = np.zeros(binary_array.shape[0])

threads = 32
find_spans[ceil(binary_array.shape[0] / threads), threads](binary_array, amount_of_spans)

Topic		Replies	Views
Scaling of prange parallelization in long calculation Community Support	3	546	August 12, 2021
Possible reason for prange slowdown Community Support	0	67	April 15, 2024
Weird parallel prange behaviour Community Support	14	2121	July 22, 2020
Parallelise large loops inside jitted function Support: How do I do ...?	1	216	July 19, 2023
Prange, race condition and writing to arrays : part 2 Support: How do I do ...?	6	274	January 27, 2023

Traversing a matrix on a thread per row basis

Related Topics