I want to parallelize an algorithm using a cuda.jit kernel. Each thread i receives an input array w of type float32 and I want to choose a random pair (i, j) for each thread (i and j can appear in other threads too) and then make some modifications to w[i] and w[j]. In pseudo-code, it would go something like this:

#inside cuda.jit kernel
i = threadIdx
j = random(1, N)
if w[i] > wmin and w[j] > wmin:
dw = #some calculation involving both w's
winner = choose i or j
w[winner] += dw
w[loser] -= dw

The problem with this algorithm comes with race conditions. This is because w[i] and w[j] could be modified by another thread at any time (Creating unique pairs (i, j) beforehand is not an option). So my question is this: Is there a way to lock both w[i] and w[j] before the if statement and unlock them after they are updated? I thought of using a mutex lock but I am not sure of how to use for locking 2 variables at the same time.

I’m finding it hard to provide a generic good suggestion here - can you provide some more concrete detais about the algorithm / implementation here? That might help to see if there’s a solution that could work well in your specific scenario.

I’m trying to simulate a system of N=1000 agents that have a property called ‘wealth’, given by vector w. At each time step, each agent is i is paired with an opponent j and they play a game where they exchange some wealth dw (which is a function of w[i] and w[j]). The winner is chosen with probability .5. I thought of parallelizing this algorithm making it so that each thread in a block is responsible for one agent and takes the task of choosing it’s opponent (with this scheme I could leverage the use of shared memory). This implementation would be pretty easy if I could generate beforehand the 500 unique random pairs (i, j) for each time step. But the problem is that these agents are embedded in a complex network, so an agent is allowed only to play with it’s neighbors (and the number of neighbors depends on the agent) and It’s probably not possible to generate this unique pairs.

So instead I thought that each thread representing an agent would choose one opponent randomly from it’s set of neighbors. This would make it so that it’s possible for an agent to be drawn more than once at every time step, generating race conditions. Maybe with the main loop of my serial code you can understand it better:

def MCS(self, steps):
"""
Main MC loop
"""
for mcs in range(steps):
# Chooses a possible neighbor for each agent i. If the agent is isolated returns -1
j = self.get_opponents()
for i, ji in enumerate(j):
# Check both agents have w > w_min and node is not isolated
if self.is_valid(i, ji) and ji != -1:
dw = np.minimum(r[i] * w[i], r[ji] * w[ji])
winner = np.random.choice([i, ji])
dw = np.where(winner == i, dw, -dw)
w[winner] += dw
w[loser] -w dw

I’m sorry if this is too long, let me know if something is not clear.
Thank you.

I have found a solution that works for my problem. I leave it here in case anyone encounters a similar situation in the future.

@cuda.jit(device=True)
def double_lock(mutex, i, j):
# This order is important to avoid a bug where mutex[i] locks itself
# waiting for mutex[j] while mutex[j] locks itself waiting for mutex[i].
first, second = (i, j) if i < j else (j, i)
while cuda.atomic.cas(mutex, first, 0, 1) != 0:
pass
while cuda.atomic.cas(mutex, second, 0, 1) != 0:
pass
cuda.threadfence()

@cuda.jit(device=True)
def double_unlock(mutex, i, j):
cuda.threadfence()
cuda.atomic.exch(mutex, j, 0)
cuda.atomic.exch(mutex, i, 0)

The above functions can be called from a kernel like this:

@cuda.jit
def my_kernel(w, ...):
...
double_lock(mutex, i, j)
# Sensitive code that needs that w[i], w[j] are not
# modified by other threads.
double_unlock(mutex, i, j)
...