Hello all,

Lets say I have some particular values stored in a vector with size 5. These 5 values came from a calculation and i will add them up in a global array.

Example code

```
import numba.cuda as cuda
import cupy as cp
size = 5
global_sum = cp.zeros((1,1))
random_vec = cp.random.randn(size)
@cuda.jit
def add_all_values(global_sum, random_vec, size):
tidx = cuda.grid(1)
if tidx < size:
global_sum[0,0] += random_vec[tidx]
cuda.syncthreads()
add_all_values[1,32](global_sum, random_vec, size)
print(f"Vector: {random_vec}")
print(f"True Answer: {random_vec.sum()}")
print(f"Kernel Answer: {global_sum}")
```

If you execute the code, my kernel is not working properly. Instead of adding all random_vec[tidx] values to global_sum[0,0], it only adds the first value, random_vector[0].

My aim is not to perform this addition, as far as I know @reduce decorator can be used to sum all elements in a vector. My aim is that force all threads to work in parallel, make additions in parallel and give me an final array at the end of these calculations.

I created this simple example because my original problem is much bigger. My code calculates element stiffness matrices for every element and adds them up to a global stiffness matrix. So, if 2 elements have the same node, they contribute the same array location in global stiffness matrix. Because they do not affect each other, independent operations, I want them to perform parallel, calculate parallel, add parallel.

In summary: N number of threads are trying to add their particular value to the same array location but I cant to it. If thread1_value is 3.2 and thread2_value is 6.1, I want them to add their values to global_array[0,0]. So when I check global_array[0,0], I should see 9.3. However I only see 3.2 in global_array[0,0]. It seems like only the thread affects the global_array.

How can I overcome this issue?