CUDA Atomic Operations On Multiple Values

I have some iterative function that repeatedly returns a floating point value, x, and an integer, y, that represents an array index. You can think of x and y as a min() and argmin() pair. Each CUDA thread is frequently checking and replacing P and I Like this:

if x < P:
    P = x
    I = y

I understand that I can perform an atomic min to update P with x but I am concerned that I have race condition here and I am hoping that there is some way to lock both P and I.

Another trick would be to encode P in 32 bits and I in 32 bits and then join them together into a 64bit float and then perform an atomic min on this single combined value. But I haven’t seen this anywhere.

Is there a better way to accomplish this?

1 Like

I can’t think of a way to do this unless you have a lock and in the critical section you compute and update P and I.

You can either write a lock with CAS. Or, like what you said in:

use .view to reinterpret cast a 32-bit float into uint32. Extend it to uint64 and bitwise join the two numbers. Use CAS to update a single uint64 slot for the combined P and I.

1 Like

@sklam Aside from atomic operations and reductions, do you know of any tricks/research on alternative programming paradigms for situations where you may have multiple threads updating the same array element?

@seanlaw, the alternatives often involve rethinking the algorithm. For example, going from multiple threads writing to one location into 1-thread reading from multiple locations. The moderngpu library by Sean Baxter at Nvidia Research has a lot of amazing ideas. However, I found the moderngpu library has a steep learning curve because the ideas in it are unconventional to how I was trained. Also, Numba’s CUDA support is not expressive enough to make reusable components using the ideas from moderngpu.

I forgot to mention to advantages to moderngpu. I have implemented ad-hoc Python versions of the algorithms (e.g. load-balancing search) to solve parallel problems with synchronization issues. The ideas from moderngpu from me to achieve the required performance which is otherwise blocked by synchronization. The technique can be applied beyond GPU.

I’m facing exactly the same problem. Since two years have passed since the last answer, is there any update on how to do this?

If it isn’t, could you @seanlaw please share how the comparison should be done if the two numbers are encoded together?

Thanks in advance!

@noeliarico I was not able to make any progress on this but I am still very much interested in a solution. The only related reference that I could find (that is still lacking details) can be found on Page 19-20 of this paper, though they are writing a CUDA kernel directly and not numba. I’d be interested in what you find!

I wonder if @gmarkall or @stuartarchibald might have some other suggestions here?

What section is the description in? I don’t see page numbers and I’m not quite sure where we’re counting 1 from :slight_smile:

@gmarkall It is Section 4.5 below Figure 11 with the paragraph starting with

The first strategy aims to accelerate each atomic write…

@seanlaw Sorry for the long delay - the two suggestions in the paper are as good as any I’m aware of right now.

@gmarkall Would you happen to know if it’s possible to apply those concepts directly in numba? I looked at it a while ago but I couldn’t wrap my head around how to combine the int32 with the float32 and store it in a 64bit float inside of numba :sweat:

I’m guessing that it might be easier in raw CUDA.

There is also another piece that I am unsure of. In order to update the combined float64 from above, one would need to lock the float64, deconstruct it into the 32bit components, perform the necessary comparison, combine the new components into the float64, and release the lock. Maybe I’m missing something but it doesn’t seem like this is possible based on the description in the paper?