Using dynamic shared memory

sebastienwood · August 1, 2022, 10:18pm

Hi !
I’m an inexperienced in GPU programming (and Numba !), and I’m trying to implement a kernel to perform the following operation:

def op_slow(input, gamma, mu, c, weight_shape):
    for bi in range(input.shape[0]):
        for c0 in range(input.shape[1]):
            for i0 in range(input.shape[2]):
                for j0 in range(input.shape[3]):
                    for i0p in range(input.shape[5]):
                        for j0p in range(input.shape[6]):
                            for ci in range(weight_shape[1]):
                                for ii in range(weight_shape[2]):
                                    for ji in range(weight_shape[3]):
                                        input[bi, c0, i0, j0, c0, i0p, j0p] += c[c0] * (mu[bi, ci, i0+ii, j0+ji] * mu[bi, ci, i0p+ii, j0p+ji] + gamma[bi, ci, i0+ii, j0+ji, ci, i0p+ii, j0p+ji])
    return input

After carefully reviewing the do’s and don’t of GPU programming, I thought a good implementation would:

parallelize on bi and c0 (I understood it was an heuristic choice, so I couldn’t come up with anything better)
use shared memory to prevent multiple back and forth with the off-chip memory, at least for mu and c which are reused

And some unknows:

is the use of @guvectorize recommanded in these cases ? (I’m not sure I understood the use case of this decorator)
should the reduction happen in a cooperative group ?

Here is what I eventually wrote:

@cuda.jit
def op_numba_c(input, gamma, mu, c, weight_shape_1, weight_shape_2, weight_shape_3):
    sharedI = cuda.shared.array((0, 0, 0, 0, 0, 0, 0), dtype=numba.float32)
    sharedM = cuda.shared.array((0, 0, 0, 0), dtype=numba.float32)
    sharedG = cuda.shared.array((0, 0, 0, 0, 0, 0, 0), dtype=numba.float32)
    sharedC = cuda.shared.array(0, dtype=numba.float32)
    bi, c0 = cuda.grid(2)
    threadB, threadC = cuda.threadIdx.x, cuda.threadIdx.y
    x_gridsize = cuda.gridDim.x * cuda.blockDim.x
    y_gridsize = cuda.gridDim.y * cuda.blockDim.y

    while bi < input.shape[0]:
        while c0 < input.shape[1]:
            if bi < input.shape[0] and c0 < input.shape[1]:
                sharedI[threadB, threadC, :, :, 0] = input[bi, c0, :, :, c0]
                sharedM[threadB] = mu[bi]
                sharedG[threadB] = gamma[bi]
                sharedC = c[c0]
            cuda.syncthreads()
            for i0 in range(input.shape[2]):
                for j0 in range(input.shape[3]):
                    for i0p in range(input.shape[5]):
                        for j0p in range(input.shape[6]):
                            v = numba.float32(0.)
                            for ci in range(weight_shape_1):
                                for ii in range(weight_shape_2):
                                    for ji in range(weight_shape_3):
                                        v += (sharedM[threadB, ci, i0+ii, j0+ji] * sharedM[threadB, ci, i0p+ii, j0p+ji] + sharedG[threadB, ci, i0+ii, j0+ji, ci, i0p+ii, j0p+ji])
                            sharedI[threadB, threadC, i0, j0, 0, i0p, j0p] += sharedC * v
            
            cuda.syncthreads()
            if bi < input.shape[0] and c0 < input.shape[1]:
                input[bi, c0, :, :, c0, :, :] = sharedI[bi, c0, :, :, 0, :, :]

            c0 += y_gridsize
        bi += x_gridsize

However there are issues with this implementation:

the CUDA shared array should be switched to dynamic shared array as per the discussion here: No implementation of CUDA shared.array error · Issue #8300 · numba/numba · GitHub – I however find no trace of this in the documentation. There is also a problem of shared memory size limitation.
I am not sure how to set a “correct” threadsperblock/blockspergrid kernel launch parameter

If anyone would be so kind as help understand what’s wrong and how to improve it I would greatly appreciate it Thanks !

Edit: updated tentative

gmarkall · August 2, 2022, 10:22am

I can’t quite get my head around the example to figure out what the mathematical expression being implemented here is (other than it’s some sort of reduction) but I do wonder if Numba is a library providing programming at a less-than-ideal level of abstraction - can your operation be written as a NumPy array expression?

If so, perhaps it would be easier to build a GPU implementation with good performance by implementing that expression using CuPy (and only use Numba kernels for operations that don’t fit into an array computing framework).

sebastienwood · August 2, 2022, 1:53pm

Indeed, that’s what I wanted to do at first. However I fail to translate the equation to a Numpy-able expression: my best take at implementing it was op_slow which is not practical in our use case.
The expression is a component of an equation for the covariance of the output of a 2D convolution on an unreliable hardware. In the non batched case, we have dimensions channel x width x height x channel x width x height, indices [c0, i0, j0, c0p, i0p, j0p]. However, there is a special condition c0 == c0p. We then perform a convolution like operation with c, i and j the dimensions of the convolution kernel on gamma, mu and coef, resp. the previous layer covariance, expectation and a regularization coefficient vector :
input[c0, i0, j0, c0, i0p, j0p] = \sum_c \sum_i \sum_j coef[c0] (gamma[c, i0+i, j0+j, c, i0p+i, j0p+j] + mu[c, i0+i, j0+j] * mu[c, i0p+i, j0p+j])

If you have an idea on how to implement this using regular Numpy expression, I couldn’t manage on my side. Particularly, the conjoint strides in dimensions (c, i, j) with different starting indices (i0, i0p and j0, j0p) are a pain point. Going low level allows me to manage the indices as in op_slow while trying to achieve some interesting performance. Eventually, I thought I could also integrate in the Numba kernel the equation other’s components in order to save time/memory.

gmarkall · August 3, 2022, 4:39pm

For confirmation, what is input.shape, and what is weight_shape? It seems different between the original op_slow() and the convolution described in the previous post, and I’m trying to have a working code example to experiment with. (If you could post a sample that sets up a call to op_slow() with the data shapes / types you’re calling it with, that would be really helpful)

sebastienwood · August 3, 2022, 5:40pm

Sure ! Here is a toy example:

device = torch.device('cuda:0')
input = torch.rand(2, 3, 7, 7, 3, 7, 7).to(device)
gamma = torch.rand(2, 3, 9, 9, 3, 9, 9).to(device)
mu = torch.rand(2, 3, 9, 9).to(device)
c = torch.tensor(1.).to(device)
if c.dim() == 0:
    c = c.repeat(input.shape[1])
weight_shape = [3, 3, 3, 3]
ref = op_slow(input.clone(), gamma, mu, c, weight_shape)

To be precise, in the general case we have the convolution weights in 4 dimensions that operates on mu/gamma. “Input” is made from all the other components of the covariance matrix, and thus is of the shape output of the convolution with last 3 dimensions (CxWxH) repeated (CxWxHxCxWxH).

The data type is by default in this sample, though the user may have cast it to half for memory reason (gamma and input quickly saturating the GPU vRAM), hence the inplace addition on input.

The signature is different between op_slow and the tentative, as I directly pass the dimensions weight_shape_1 = weight_shape[1] (I thought passing a torch.Size object would mess up with the compiler).

sebastienwood · August 5, 2022, 11:59pm

Hey !
I have been progressing on the tentative/understanding of Numba on my side. I have a working tentative implementation that matches the output of op_slow, here is a list of main changes with respect to the original post:

changed the parallelism only to the batch dimension so as to simplify my approach
removed usage of shared array for mu, gamma and input at the moment

Now the catch is that it performs worse than the CPU implementation by 2/3 orders of magnitude. I’m looking toward profiling based improvement as the next step. My main source is curiouscoding.nl/phd/2021/03/24/numba-cuda-speedup/
I understand the main culprit would be related to sub-par data management. I’d take any suggestion !

@cuda.jit
def op_numba_c(input, gamma, mu, c, weight_shape_1, weight_shape_2, weight_shape_3):
    sharedC = cuda.shared.array(shape=1024, dtype=numba.float32)
    bi = cuda.grid(1)
    x_gridsize = cuda.gridsize(1)
    threadB = cuda.threadIdx.x

    if threadB < c.shape[0]:
        sharedC[threadB] = c[threadB]
    if cuda.blockDim.x < c.shape[0] and threadB == 0:  # not covering all values
        for i in range(cuda.blockDim.x, c.shape[0]):
            sharedC[i] = c[i]

    cuda.syncthreads()

    if bi > input.shape[0]:
        return

    while bi < input.shape[0]:
        # Iterate on io j0 i0p j0p
        for i0 in range(input.shape[2]):
            for j0 in range(input.shape[3]):
                for i0p in range(input.shape[5]):
                    for j0p in range(input.shape[6]):
                        v = numba.float32(0.)
                        for ci in range(weight_shape_1):
                            threadM = mu[bi, ci]
                            threadG = gamma[bi, ci]
                            for ii in range(weight_shape_2):
                                for ji in range(weight_shape_3):
                                    v += (threadM[i0+ii, j0+ji] * threadM[i0p+ii, j0p+ji] + threadG[i0+ii, j0+ji, ci, i0p+ii, j0p+ji])
                        # For each c0 update input
                        for c0 in range(input.shape[1]):
                            #input[bi, c0, i0, j0, c0, i0p, j0p] += v * sharedC[c0]
                            cuda.atomic.add(input, (bi, c0, i0, j0, c0, i0p, j0p), v * sharedC[c0])
        bi += x_gridsize

vesuppi · August 22, 2023, 2:40pm

Hi, you shouldn’t need atomic add and using shared memory at all. In your last comment you mentioned that only the batch dim is parallelized. However you typically need to a lot of parallelism to keep the GPU fully occupied, otherwise it’s under-utilized.

You original code can be parallelized as follows:

bi, c0, i0, j0, c0, i0p, j0p => dimensions that can be mapped to thread blocks
ci, ii, ji => dimensions that can be mapped to parallel threads

You will need block reduction among the threads. Each thread block will be responsible for one data point of input[bi, c0, i0, j0, c0, i0p, j0p]. This way as I see, no inter-block communication such as atomic add is needed.

Topic		Replies	Views
How do I use Numba to run Trading Back Testing? Support: How do I do ...?	3	543	May 10, 2024
How do I use `sharedmem` parameter in kernel? Support: How do I do ...?	4	525	March 18, 2023
Usage of CUDA Python, Linear Algebra on GPU and Computational Code Community Support	7	3741	December 31, 2021
CUDA shared memory on 1d arrays Numba	1	437	July 23, 2021
Configuring how much shared memory is available Support: How do I do ...?	8	79	December 1, 2025

Using dynamic shared memory

Related topics