Variable not changing inside while loop in GPU

I’m experiencing a very strange behavior inside a while loop in the GPU.
I don’t get an error but I do get an unexpected result that makes no sense given the inside logic:

@cuda.jit(device=True)
def union(labels_matrix, p1, p2):
    done = False

    while not done:
        p1_t, p2_t = find(labels_matrix, p1), find(labels_matrix, p2)
        if p1_t[1] < p2_t[1]:
            old = cuda.atomic.min(labels_matrix, p2_t[0], p1_t[1])
            done = old == p2_t[1]
            p2 = get_3d_index(labels_matrix.shape, old)
        elif p2_t[1] < p1_t[1]:
            old = cuda.atomic.min(labels_matrix, p1_t[0], p2_t[1])
            done = old == p1_t[1]
            p1 = get_3d_index(labels_matrix.shape, old)
        else:
            done = True

The only condition to exit the loop is for the variable done to be true.
As an example and using some prints, p1_t = ((0, 2, 2), 1442) and p2_t = ((0,0,2), 2). It should exit the loop after only one iteration but, for some reason:
image

The same thing happens even if I add done = True as the last instruction after the if/else sequence.
Can someone help me with this?

It’s very hard to help based on just the code of the device function and the provided screenshot. How did you modify the device function to print? What’s the kernel that calls the device function? How was the kernel invoked?

Thanks for the reply :slight_smile:

The kernel is called merge. I’m (trying :slight_smile: ) to implement a block-based union-find ccl algorithm hence the union and find functions you see above.
Long story short, I have an unsigned int 64 bitmask where each bit represents a coordinate in a 4x4x4 matrix - the neighborhood of a block (which are all 2x2x2). If a certain bit is set and is relevant to the neighboorhood, I call union like e.g.:

if has_bit(bitmask, 0) and image[z - 1, y - 1, x - 1] == 0:
    union(labels_matrix, (z, y, x), (z - 2, y - 2, x - 2))

The kernel invocation itself is simple. With hys_res being some 3D image and hys_res_zeros its labeled copy (d_image and labels are their device counterparts respectively):

threadsperblock = (8, 4, 4)  # 128 threads
blockspergrid = (ceil((hys_res_zeros.shape[0] / 2) / threadsperblock[0]),
                     ceil((hys_res_zeros.shape[1] / 2) / threadsperblock[1]),
                     ceil((hys_res_zeros.shape[2] / 2) / threadsperblock[2]))

d_image = cuda.to_device(hys_res)
labels = cuda.to_device(hys_res_zeros)
buf_merge[blockspergrid, threadsperblock](d_image, labels)

Here’s what I tried:

  • Changing done to 0/1 instead of True/False just to see if it would make a difference - it didn’t

  • Unpack the results of find just to make sure the calculations were correct - they are

  • Print done inside the if/else portions to make sure its 1/True - it is 1/True

These are all seen in the screenshot above.

@cuda.jit(device=True)
def union(labels_matrix, p1, p2):
    done = 0

    while not done:
        if done:
            print('done is True')
        else:
            print('done is False')
        p1_t, p2_t = find(labels_matrix, p1), find(labels_matrix, p2)
        a1, p1idx = p1_t
        p1z, p1y, p1x = a1
        a2, p2idx = p2_t
        p2z, p2y, p2x = a2
        print('p1:', p1z, p1y, p1x, p1idx, '| p2:', p2z, p2y, p2x, p2idx)
        if p1_t[1] < p2_t[1]:
            old = cuda.atomic.min(labels_matrix, p2_t[0], p1_t[1])
            if old == p2_t[1]:
                done = 1
                print('old is equal to previous label | done =', done, '\n')
            p2 = get_3d_index(labels_matrix.shape, old)
        elif p2_t[1] < p1_t[1]:
            old = cuda.atomic.min(labels_matrix, p1_t[0], p2_t[1])
            if old == p1_t[1]:
                done = 1
                print('old is equal to previous label | done =', done, '\n')
            p1 = get_3d_index(labels_matrix.shape, old)
        else:
            done = 1

Is this helpful?

How do you know which threads the lines of output are coming from?

I’m restricting the guard in the merge kernel:

@cuda.jit()
def buf_merge(image, labels_matrix):
    z, y, x = cuda.grid(3)
    z, y, x = z * 2, y * 2, x * 2

    if z == 0 and y == 2 and x == 2:
         ...

Argh! It ended up being a simple error on my part. Sorry for the confusion.
Turns out I was calling union several times. It had nothing do with while loops nor inside logic.