CUDA Atomic Operations On Multiple Values

seanlaw · July 8, 2020, 10:43am

I have some iterative function that repeatedly returns a floating point value, x, and an integer, y, that represents an array index. You can think of x and y as a min() and argmin() pair. Each CUDA thread is frequently checking and replacing P and I Like this:

if x < P:
    P = x
    I = y

I understand that I can perform an atomic min to update P with x but I am concerned that I have race condition here and I am hoping that there is some way to lock both P and I.

Another trick would be to encode P in 32 bits and I in 32 bits and then join them together into a 64bit float and then perform an atomic min on this single combined value. But I haven’t seen this anywhere.

Is there a better way to accomplish this?

sklam · July 14, 2020, 7:52pm

I can’t think of a way to do this unless you have a lock and in the critical section you compute and update P and I.

You can either write a lock with CAS. Or, like what you said in:

use .view to reinterpret cast a 32-bit float into uint32. Extend it to uint64 and bitwise join the two numbers. Use CAS to update a single uint64 slot for the combined P and I.

seanlaw · July 27, 2020, 1:47pm

@sklam Aside from atomic operations and reductions, do you know of any tricks/research on alternative programming paradigms for situations where you may have multiple threads updating the same array element?

sklam · July 28, 2020, 3:24pm

@seanlaw, the alternatives often involve rethinking the algorithm. For example, going from multiple threads writing to one location into 1-thread reading from multiple locations. The moderngpu library by Sean Baxter at Nvidia Research has a lot of amazing ideas. However, I found the moderngpu library has a steep learning curve because the ideas in it are unconventional to how I was trained. Also, Numba’s CUDA support is not expressive enough to make reusable components using the ideas from moderngpu.

sklam · July 28, 2020, 4:15pm

I forgot to mention to advantages to moderngpu. I have implemented ad-hoc Python versions of the algorithms (e.g. load-balancing search) to solve parallel problems with synchronization issues. The ideas from moderngpu from me to achieve the required performance which is otherwise blocked by synchronization. The technique can be applied beyond GPU.

noeliarico · May 25, 2022, 9:11am

I’m facing exactly the same problem. Since two years have passed since the last answer, is there any update on how to do this?

If it isn’t, could you @seanlaw please share how the comparison should be done if the two numbers are encoded together?

Thanks in advance!

seanlaw · May 25, 2022, 1:36pm

@noeliarico I was not able to make any progress on this but I am still very much interested in a solution. The only related reference that I could find (that is still lacking details) can be found on Page 19-20 of this paper, though they are writing a CUDA kernel directly and not numba. I’d be interested in what you find!

I wonder if @gmarkall or @stuartarchibald might have some other suggestions here?

gmarkall · May 25, 2022, 4:13pm

What section is the description in? I don’t see page numbers and I’m not quite sure where we’re counting 1 from

seanlaw · May 25, 2022, 4:43pm

@gmarkall It is Section 4.5 below Figure 11 with the paragraph starting with

The first strategy aims to accelerate each atomic write…

gmarkall · May 30, 2022, 4:34pm

@seanlaw Sorry for the long delay - the two suggestions in the paper are as good as any I’m aware of right now.

seanlaw · May 30, 2022, 5:35pm

@gmarkall Would you happen to know if it’s possible to apply those concepts directly in numba? I looked at it a while ago but I couldn’t wrap my head around how to combine the int32 with the float32 and store it in a 64bit float inside of numba

I’m guessing that it might be easier in raw CUDA.

There is also another piece that I am unsure of. In order to update the combined float64 from above, one would need to lock the float64, deconstruct it into the 32bit components, perform the necessary comparison, combine the new components into the float64, and release the lock. Maybe I’m missing something but it doesn’t seem like this is possible based on the description in the paper?

Topic		Replies	Views
CUDA Python - Multiple Threads Operating on Same Array Location Support: How do I do ...?	6	1302	January 21, 2022
CUDA: Explainer of a kernel with 2D blocks, shared memory, atomics Community Support	2	1596	September 11, 2020
Variable not changing inside while loop in GPU Support: What is this error message?	5	472	September 7, 2021
Basic algebra with Numpy Support: How do I do ...?	1	245	May 21, 2021
Trying to implement Compare and swap element Development	1	709	March 22, 2021

CUDA Atomic Operations On Multiple Values

Related Topics