Using dynamic shared memory

Hi !
I’m an inexperienced in GPU programming (and Numba !), and I’m trying to implement a kernel to perform the following operation:

def op_slow(input, gamma, mu, c, weight_shape):
    for bi in range(input.shape[0]):
        for c0 in range(input.shape[1]):
            for i0 in range(input.shape[2]):
                for j0 in range(input.shape[3]):
                    for i0p in range(input.shape[5]):
                        for j0p in range(input.shape[6]):
                            for ci in range(weight_shape[1]):
                                for ii in range(weight_shape[2]):
                                    for ji in range(weight_shape[3]):
                                        input[bi, c0, i0, j0, c0, i0p, j0p] += c[c0] * (mu[bi, ci, i0+ii, j0+ji] * mu[bi, ci, i0p+ii, j0p+ji] + gamma[bi, ci, i0+ii, j0+ji, ci, i0p+ii, j0p+ji])
    return input

After carefully reviewing the do’s and don’t of GPU programming, I thought a good implementation would:

  • parallelize on bi and c0 (I understood it was an heuristic choice, so I couldn’t come up with anything better)
  • use shared memory to prevent multiple back and forth with the off-chip memory, at least for mu and c which are reused

And some unknows:

  • is the use of @guvectorize recommanded in these cases ? (I’m not sure I understood the use case of this decorator)
  • should the reduction happen in a cooperative group ?

Here is what I eventually wrote:

@cuda.jit
def op_numba_c(input, gamma, mu, c, weight_shape_1, weight_shape_2, weight_shape_3):
    sharedI = cuda.shared.array((0, 0, 0, 0, 0, 0, 0), dtype=numba.float32)
    sharedM = cuda.shared.array((0, 0, 0, 0), dtype=numba.float32)
    sharedG = cuda.shared.array((0, 0, 0, 0, 0, 0, 0), dtype=numba.float32)
    sharedC = cuda.shared.array(0, dtype=numba.float32)
    bi, c0 = cuda.grid(2)
    threadB, threadC = cuda.threadIdx.x, cuda.threadIdx.y
    x_gridsize = cuda.gridDim.x * cuda.blockDim.x
    y_gridsize = cuda.gridDim.y * cuda.blockDim.y

    while bi < input.shape[0]:
        while c0 < input.shape[1]:
            if bi < input.shape[0] and c0 < input.shape[1]:
                sharedI[threadB, threadC, :, :, 0] = input[bi, c0, :, :, c0]
                sharedM[threadB] = mu[bi]
                sharedG[threadB] = gamma[bi]
                sharedC = c[c0]
            cuda.syncthreads()
            for i0 in range(input.shape[2]):
                for j0 in range(input.shape[3]):
                    for i0p in range(input.shape[5]):
                        for j0p in range(input.shape[6]):
                            v = numba.float32(0.)
                            for ci in range(weight_shape_1):
                                for ii in range(weight_shape_2):
                                    for ji in range(weight_shape_3):
                                        v += (sharedM[threadB, ci, i0+ii, j0+ji] * sharedM[threadB, ci, i0p+ii, j0p+ji] + sharedG[threadB, ci, i0+ii, j0+ji, ci, i0p+ii, j0p+ji])
                            sharedI[threadB, threadC, i0, j0, 0, i0p, j0p] += sharedC * v
            
            cuda.syncthreads()
            if bi < input.shape[0] and c0 < input.shape[1]:
                input[bi, c0, :, :, c0, :, :] = sharedI[bi, c0, :, :, 0, :, :]

            c0 += y_gridsize
        bi += x_gridsize

However there are issues with this implementation:

If anyone would be so kind as help understand what’s wrong and how to improve it I would greatly appreciate it :slight_smile: Thanks !

Edit: updated tentative

I can’t quite get my head around the example to figure out what the mathematical expression being implemented here is (other than it’s some sort of reduction) but I do wonder if Numba is a library providing programming at a less-than-ideal level of abstraction - can your operation be written as a NumPy array expression?

If so, perhaps it would be easier to build a GPU implementation with good performance by implementing that expression using CuPy (and only use Numba kernels for operations that don’t fit into an array computing framework).

Indeed, that’s what I wanted to do at first. However I fail to translate the equation to a Numpy-able expression: my best take at implementing it was op_slow which is not practical in our use case.
The expression is a component of an equation for the covariance of the output of a 2D convolution on an unreliable hardware. In the non batched case, we have dimensions channel x width x height x channel x width x height, indices [c0, i0, j0, c0p, i0p, j0p]. However, there is a special condition c0 == c0p. We then perform a convolution like operation with c, i and j the dimensions of the convolution kernel on gamma, mu and coef, resp. the previous layer covariance, expectation and a regularization coefficient vector :
input[c0, i0, j0, c0, i0p, j0p] = \sum_c \sum_i \sum_j coef[c0] (gamma[c, i0+i, j0+j, c, i0p+i, j0p+j] + mu[c, i0+i, j0+j] * mu[c, i0p+i, j0p+j])

If you have an idea on how to implement this using regular Numpy expression, I couldn’t manage on my side. Particularly, the conjoint strides in dimensions (c, i, j) with different starting indices (i0, i0p and j0, j0p) are a pain point. Going low level allows me to manage the indices as in op_slow while trying to achieve some interesting performance. Eventually, I thought I could also integrate in the Numba kernel the equation other’s components in order to save time/memory.

For confirmation, what is input.shape, and what is weight_shape? It seems different between the original op_slow() and the convolution described in the previous post, and I’m trying to have a working code example to experiment with. (If you could post a sample that sets up a call to op_slow() with the data shapes / types you’re calling it with, that would be really helpful)

Sure ! Here is a toy example:

device = torch.device('cuda:0')
input = torch.rand(2, 3, 7, 7, 3, 7, 7).to(device)
gamma = torch.rand(2, 3, 9, 9, 3, 9, 9).to(device)
mu = torch.rand(2, 3, 9, 9).to(device)
c = torch.tensor(1.).to(device)
if c.dim() == 0:
    c = c.repeat(input.shape[1])
weight_shape = [3, 3, 3, 3]
ref = op_slow(input.clone(), gamma, mu, c, weight_shape)

To be precise, in the general case we have the convolution weights in 4 dimensions that operates on mu/gamma. “Input” is made from all the other components of the covariance matrix, and thus is of the shape output of the convolution with last 3 dimensions (CxWxH) repeated (CxWxHxCxWxH).

The data type is by default in this sample, though the user may have cast it to half for memory reason (gamma and input quickly saturating the GPU vRAM), hence the inplace addition on input.

The signature is different between op_slow and the tentative, as I directly pass the dimensions weight_shape_1 = weight_shape[1] (I thought passing a torch.Size object would mess up with the compiler).

Hey !
I have been progressing on the tentative/understanding of Numba on my side. I have a working tentative implementation that matches the output of op_slow, here is a list of main changes with respect to the original post:

  • changed the parallelism only to the batch dimension so as to simplify my approach
  • removed usage of shared array for mu, gamma and input at the moment

Now the catch is that it performs worse than the CPU implementation by 2/3 orders of magnitude. I’m looking toward profiling based improvement as the next step. My main source is curiouscoding.nl/phd/2021/03/24/numba-cuda-speedup/
I understand the main culprit would be related to sub-par data management. I’d take any suggestion !

@cuda.jit
def op_numba_c(input, gamma, mu, c, weight_shape_1, weight_shape_2, weight_shape_3):
    sharedC = cuda.shared.array(shape=1024, dtype=numba.float32)
    bi = cuda.grid(1)
    x_gridsize = cuda.gridsize(1)
    threadB = cuda.threadIdx.x

    if threadB < c.shape[0]:
        sharedC[threadB] = c[threadB]
    if cuda.blockDim.x < c.shape[0] and threadB == 0:  # not covering all values
        for i in range(cuda.blockDim.x, c.shape[0]):
            sharedC[i] = c[i]

    cuda.syncthreads()

    if bi > input.shape[0]:
        return

    while bi < input.shape[0]:
        # Iterate on io j0 i0p j0p
        for i0 in range(input.shape[2]):
            for j0 in range(input.shape[3]):
                for i0p in range(input.shape[5]):
                    for j0p in range(input.shape[6]):
                        v = numba.float32(0.)
                        for ci in range(weight_shape_1):
                            threadM = mu[bi, ci]
                            threadG = gamma[bi, ci]
                            for ii in range(weight_shape_2):
                                for ji in range(weight_shape_3):
                                    v += (threadM[i0+ii, j0+ji] * threadM[i0p+ii, j0p+ji] + threadG[i0+ii, j0+ji, ci, i0p+ii, j0p+ji])
                        # For each c0 update input
                        for c0 in range(input.shape[1]):
                            #input[bi, c0, i0, j0, c0, i0p, j0p] += v * sharedC[c0]
                            cuda.atomic.add(input, (bi, c0, i0, j0, c0, i0p, j0p), v * sharedC[c0])
        bi += x_gridsize

Hi, you shouldn’t need atomic add and using shared memory at all. In your last comment you mentioned that only the batch dim is parallelized. However you typically need to a lot of parallelism to keep the GPU fully occupied, otherwise it’s under-utilized.

You original code can be parallelized as follows:

bi, c0, i0, j0, c0, i0p, j0p => dimensions that can be mapped to thread blocks
ci, ii, ji => dimensions that can be mapped to parallel threads

You will need block reduction among the threads. Each thread block will be responsible for one data point of input[bi, c0, i0, j0, c0, i0p, j0p]. This way as I see, no inter-block communication such as atomic add is needed.