Hi !

I’m an inexperienced in GPU programming (and Numba !), and I’m trying to implement a kernel to perform the following operation:

```
def op_slow(input, gamma, mu, c, weight_shape):
for bi in range(input.shape[0]):
for c0 in range(input.shape[1]):
for i0 in range(input.shape[2]):
for j0 in range(input.shape[3]):
for i0p in range(input.shape[5]):
for j0p in range(input.shape[6]):
for ci in range(weight_shape[1]):
for ii in range(weight_shape[2]):
for ji in range(weight_shape[3]):
input[bi, c0, i0, j0, c0, i0p, j0p] += c[c0] * (mu[bi, ci, i0+ii, j0+ji] * mu[bi, ci, i0p+ii, j0p+ji] + gamma[bi, ci, i0+ii, j0+ji, ci, i0p+ii, j0p+ji])
return input
```

After carefully reviewing the do’s and don’t of GPU programming, I thought a good implementation would:

- parallelize on bi and c0 (I understood it was an heuristic choice, so I couldn’t come up with anything better)
- use shared memory to prevent multiple back and forth with the off-chip memory, at least for mu and c which are reused

And some unknows:

- is the use of @guvectorize recommanded in these cases ? (I’m not sure I understood the use case of this decorator)
- should the reduction happen in a cooperative group ?

Here is what I eventually wrote:

```
@cuda.jit
def op_numba_c(input, gamma, mu, c, weight_shape_1, weight_shape_2, weight_shape_3):
sharedI = cuda.shared.array((0, 0, 0, 0, 0, 0, 0), dtype=numba.float32)
sharedM = cuda.shared.array((0, 0, 0, 0), dtype=numba.float32)
sharedG = cuda.shared.array((0, 0, 0, 0, 0, 0, 0), dtype=numba.float32)
sharedC = cuda.shared.array(0, dtype=numba.float32)
bi, c0 = cuda.grid(2)
threadB, threadC = cuda.threadIdx.x, cuda.threadIdx.y
x_gridsize = cuda.gridDim.x * cuda.blockDim.x
y_gridsize = cuda.gridDim.y * cuda.blockDim.y
while bi < input.shape[0]:
while c0 < input.shape[1]:
if bi < input.shape[0] and c0 < input.shape[1]:
sharedI[threadB, threadC, :, :, 0] = input[bi, c0, :, :, c0]
sharedM[threadB] = mu[bi]
sharedG[threadB] = gamma[bi]
sharedC = c[c0]
cuda.syncthreads()
for i0 in range(input.shape[2]):
for j0 in range(input.shape[3]):
for i0p in range(input.shape[5]):
for j0p in range(input.shape[6]):
v = numba.float32(0.)
for ci in range(weight_shape_1):
for ii in range(weight_shape_2):
for ji in range(weight_shape_3):
v += (sharedM[threadB, ci, i0+ii, j0+ji] * sharedM[threadB, ci, i0p+ii, j0p+ji] + sharedG[threadB, ci, i0+ii, j0+ji, ci, i0p+ii, j0p+ji])
sharedI[threadB, threadC, i0, j0, 0, i0p, j0p] += sharedC * v
cuda.syncthreads()
if bi < input.shape[0] and c0 < input.shape[1]:
input[bi, c0, :, :, c0, :, :] = sharedI[bi, c0, :, :, 0, :, :]
c0 += y_gridsize
bi += x_gridsize
```

However there are issues with this implementation:

- the CUDA shared array should be switched to dynamic shared array as per the discussion here: No implementation of CUDA shared.array error · Issue #8300 · numba/numba · GitHub – I however find no trace of this in the documentation. There is also a problem of shared memory size limitation.
- I am not sure how to set a “correct” threadsperblock/blockspergrid kernel launch parameter

If anyone would be so kind as help understand what’s wrong and how to improve it I would greatly appreciate it Thanks !

Edit: updated tentative