# Shared Memory Reduction example assuming 1 block

This example (can’t add link …/examples.html#shared-memory-reduction) in the documentation seems to assume 1 block.

``````import numpy as np
from numba import cuda
from numba.types import int32

# generate data
a = cuda.to_device(np.arange(1024))
nelem = len(a)

@cuda.jit
def array_sum(data):
size = len(data)

if tid < size:
i = cuda.grid(1)

# Declare an array in shared memory
shr = cuda.shared.array(nelem, int32)
shr[tid] = data[i]

# Ensure writes to shared memory are visible
# to all threads before reducing

s = 1
while s < cuda.blockDim.x:
if tid % (2 * s) == 0:
# Stride by `s` and add
shr[tid] += shr[tid + s]
s *= 2

# After the loop, the zeroth  element contains the sum
if tid == 0:
data[tid] = shr[tid]

array_sum[2, 512](a)
print(a)                  # 523776
print(sum(np.arange(1024)))  # 523776
``````
``````130816
523776
``````

With `array_sum[2, 1024](a)`, the result is 0.

I am confused by this part:

``````   tid = cuda.threadIdx.x
size = len(data)

if tid < size:
i = cuda.grid(1)
``````

If there is one block, then `i` simply equals `tid`.

Here is the equivalent implementation that also works with non-multiples of 2.

``````import numpy as np
from numba import cuda, int32

a = np.arange(1_000)
a_gpu = cuda.to_device(a)
n = len(a)

@cuda.jit
def sum_reduce_gpu(a): # assume 1 block
# thread id within a 1D block (position relative to a block)
# Create an array in shared memory
shared_array = cuda.shared.array(n, int32) # shape needs to be a "simple constant", len(a) won't work
# each thread is responsible for copying one value
# wait for all threads to finish

step_size = 1
while step_size < n:
if thread_id % (2 * step_size) == 0 and thread_id + step_size < n:
step_size *= 2
# wait for all threads to finish to make sure shared_array[thread_id + step_size] has been updated

# one thread is responsible for writing the result
# store the result in a

sum_reduce_gpu[1, n](a_gpu)
print(a.sum(), a_gpu)
``````

If I wanted to use multiple blocks, would I need a 2 step process? Each block will be responsible for one part of the array, then at the end, one of the threads will combine the results of each block.

You can also use the `@reduce` decorator to generate a reduction kernel: GPU Reduction — Numba 0.56.4+0.g288a38bbd.dirty-py3.7-linux-x86_64.egg documentation