CUDA Python - Multiple Threads Operating on Same Array Location

Hello all,

Lets say I have some particular values stored in a vector with size 5. These 5 values came from a calculation and i will add them up in a global array.

Example code

import numba.cuda as cuda
import cupy  as cp

size = 5
global_sum = cp.zeros((1,1))
random_vec = cp.random.randn(size)

@cuda.jit
def add_all_values(global_sum, random_vec, size):
    tidx = cuda.grid(1)
    if tidx < size:
        global_sum[0,0] += random_vec[tidx]
        cuda.syncthreads()


add_all_values[1,32](global_sum, random_vec, size)
   
print(f"Vector: {random_vec}")     
print(f"True Answer: {random_vec.sum()}")
print(f"Kernel Answer: {global_sum}")

If you execute the code, my kernel is not working properly. Instead of adding all random_vec[tidx] values to global_sum[0,0], it only adds the first value, random_vector[0].

My aim is not to perform this addition, as far as I know @reduce decorator can be used to sum all elements in a vector. My aim is that force all threads to work in parallel, make additions in parallel and give me an final array at the end of these calculations.

I created this simple example because my original problem is much bigger. My code calculates element stiffness matrices for every element and adds them up to a global stiffness matrix. So, if 2 elements have the same node, they contribute the same array location in global stiffness matrix. Because they do not affect each other, independent operations, I want them to perform parallel, calculate parallel, add parallel.

In summary: N number of threads are trying to add their particular value to the same array location but I cant to it. If thread1_value is 3.2 and thread2_value is 6.1, I want them to add their values to global_array[0,0]. So when I check global_array[0,0], I should see 9.3. However I only see 3.2 in global_array[0,0]. It seems like only the thread affects the global_array.

How can I overcome this issue?

You can use an atomic operation instead:

import numba.cuda as cuda
import cupy as cp

size = 5
global_sum = cp.zeros((1, 1))
random_vec = cp.random.randn(size)


@cuda.jit
def add_all_values(global_sum, random_vec, size):
    tidx = cuda.grid(1)
    if tidx < size:
        cuda.atomic.add(global_sum, (0, 0), random_vec[tidx])


add_all_values[1, 32](global_sum, random_vec, size)

print(f"Vector: {random_vec}")
print(f"True Answer: {random_vec.sum()}")
print(f"Kernel Answer: {global_sum}")

For global assembly I’d expect the contention to be relatively low so most of the atomic operations should avoid getting serialized, and this should be a reasonably efficient approach.

When I run this, I get:

Vector: [ 0.01499315 -1.33878368  0.03578643  0.97101383 -0.46703511]
True Answer: -0.7840253738953421
Kernel Answer: [[-0.78402537]]

Note that you can also avoid doing the global assembly entirely and instead compute the matrix-vector product when doing a solve using the local matrices and mapping your global vector into local element space and back - this is outlined in Section 4.1 of http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.405.3888&rep=rep1&type=pdf - the experiments / results there are a bit old, but the general criteria are:

  • Low order: global assembly likely faster
  • High order: local assembly + local-global transform and local element matrix multiplication likely faster
  • Local assembly with the local-global transform does need more memory (because you need to keep all the local matrices around) - the overhead is greater for low-order, because it’s a factor of the proportion of nodes shared between elements.

You could also colour the mesh and assemble each colour separately so that you don’t need an atomic operation (this is described in S4.2 of the publication).

Graham Markall,

Thank you so much, after I used the cuda.atomix.add() function you mentioned, I successfully implemented my assembly procedure. It seems that “+=” operator is not working for all threads working in parallel. And also, cuda.atomic library contains more functions than addition. Some people might need it: addition, subtraction, increment, max, min, etc. (for those looking for help).

The code is already written in python+numpy form and I am already busy to convert it to GPU. My algorithm might be a classical one. Every stiffness matrix is calculated separately and every value is added to corresponding global matrix index at the same time, parallel. Unfortunately I could not implement the technique you mentioned. However, thank you so much for suggestion, I will read it.

Excellent, I’m glad to hear that! :slightly_smiling_face:

This is a very general issue in parallel programming - multiple threads concurrently operating on common items of data must coordinate in some way to ensure that their behaviour results in a consistent outcome - the failure to do so results in incorrect output - this is usually referred to as a race condition, see e.g. multithreading - What is a race condition? - Stack Overflow

Indeed, these are all listed in Synchronization and Atomic Operations.

The technique I mentioned would take quite a lot of work and requires that you are using an iterative solver, so this can be a perfectly reasonable implementation decision.

When your GPU porting work is finished, do you plan to open-source it? It would be great if there were an open source Numba + CUDA FEM assembly example to be able to show people (as FEM is something I’m very interested in, but not actively working on at the moment :slightly_smiling_face:)

Graham Markall,

I would like to share my/our stiffness assembly kernel as an example opensource project, of course. We decided to carry the code on GitHUB but I am not so experienced with it. Maybe the repository’s name will be “CUDA Python Accelerated FEM Analysis” or something. However, I mentioned it before, it is not an advanced analysis program. It was a course project for undergraduate FEM course. Here, I give an example. The hollow metal plate analysis. The graph is displacement_x data. Left side is stuck to the wall and right side is pulled by 0.1 meters. All the analysis is done by using Python capabilities. Computations are mainly done on GPU for speed purposes. I will inform you when we uploaded the code on GitHUB, opensource project.

And also, I would like to donate my inside-kernel linear algebra library. Maybe people will use it in the future by calling numba.cuda.insidekernel_linalg.[function]. It is not a problem for me. It would even be better because the library is not even in 0.0.0 version, completely for personal usage. I believe community will create a much better library than mine by adding and reinforcing it. Here is the list of functions I use for my analysis, just basics but I can use them inside kernels for computations. Do you have any suggestions?

1 Like

Although I don’t do much FEM, I just today ran into the issue of wanting to perform on device when you’re algebra functions such as matrix multiply.

+1 to sharing the source code for those operations if you can.