Optimize 3D binary dilation - any tips? -

I would like to optimize a 3d binary dilation on GPU. I am suspecting the code is not optimal:

@cuda.jit("void(uint8[:,:,:],uint8[:,:,:],uint8[:,:,:])")
def cuda_binary_dilate_u8(vol, out, kern):
    z,y,x = cuda.grid(3)
    d,h,w = vol.shape
    a,b,c = kern.shape
    pa,pb,pc = a//2,b//2,c//2
    if z >= 0 and z < d and y >= 0 and y < h and x >= 0 and x < w:
        out[z,y,x] = False
        
        for i in range(a):
            for j in range(b):
                for k in range(c):
                    kv = kern[i,j,k]
                    zz = z+i-pa
                    yy = y+j-pb
                    xx = x+k-pc
                    if zz < 0 or zz >= d or yy < 0 or yy >= h or xx < 0 or xx >= w: 
                        continue
                    if vol[zz,yy,xx] and kv:
                       out[z,y,x] = True

def binary_dilate_u8_cuda(vol, kern, iterations=1):
    xcu = cuda.to_device(vol)
    kcu = cuda.to_device(kern)
    sizes = vol.shape 
    block_dim = (4,4,4)
    grid_dim = tuple(int(np.ceil(a/b)) for a, b in zip(sizes, block_dim))

    ycu = cuda.device_array(shape=vol.shape, dtype=np.uint8)
    a,b = ycu, xcu
    for i in range(iterations):
         a,b = b,a
         cuda_binary_dilate_u8[grid_dim, block_dim](a, b, kcu)
    return b

Do you have a notebook or something runnable that uses the kernel? It’s much easier to get started helping someone to optimize their code if it is ready to run.

Sure!
[binary 3d dilate cuda]

"""
binary dilation in numba cuda
>> python cuda_binary_dilate_3d.py 512 512 512 1
>>> runtime: 0.001s on a Nvidia TitanXP. 
"""
import numba
import numpy as np
import time
from numba import cuda 


@cuda.jit("void(uint8[:,:,:],uint8[:,:,:],uint8[:,:,:])")
def cuda_binary_dilate_u8(vol, out, kern):
    z,y,x = cuda.grid(3)
    d,h,w = vol.shape
    a,b,c = kern.shape
    pa,pb,pc = a//2,b//2,c//2
    if z >= 0 and z < d and y >= 0 and y < h and x >= 0 and x < w:
        out[z,y,x] = False
        
        #put kern in shared memory?
        shared = cuda.shared.array((3,3,3),dtype=numba.boolean)
        for i in range(a):
            for j in range(b):
                for k in range(c):
                    shared[i,j,k] = kern[i,j,k]
        
        for i in range(a):
            for j in range(b):
                for k in range(c):
                    zz = z+i-pa
                    yy = y+j-pb
                    xx = x+k-pc
                    if zz < 0 or zz >= d or yy < 0 or yy >= h or xx < 0 or xx >= w: 
                        continue
                    if vol[zz,yy,xx] and shared[i,j,k]:
                       out[z,y,x] = True




def binary_dilate_u8_cuda(vol, kern, iterations=1):
    sizes = vol.shape 
    block_dim = (4,4,4)
    #stride
    sizes = [v for v in sizes]
    grid_dim = tuple(int(np.ceil(a/b)) for a, b in zip(sizes, block_dim))

    # non inplace
    ycu = cuda.device_array(shape=vol.shape, dtype=np.uint8)
    a,b = ycu, vol 
    for i in range(iterations):
         a,b = b,a
         cuda_binary_dilate_u8[grid_dim, block_dim](a, b, kern)
    return b


def test(d=256,h=256,w=256,niter=1):
    vol = np.random.randn(d,h,w) > 1
    kern = np.random.randn(3,3,3) > 1

    vol = cuda.to_device(vol)
    kern = cuda.to_device(kern)
    
    t1 = time.time()
    for i in range(10):
        binary_dilate_u8_cuda(vol, kern, niter)
    t2 = time.time()
    
    print(f' runtime: {(t2-t1)/10}') # i measure 1ms


if __name__ == '__main__':
    import fire;fire.fire(test)

I think i was not properly timing before (in fact i was probably timing cuda-cpu transfer…)
runtime indicates around 1ms. Still, if you have any advice, or course on numba cuda + gpu programming? Thanks!

Thanks for posting the code. What’s the fire module you’re using? I did pip install fire and I get

$ python repro.py 
Traceback (most recent call last):
  File "/home/gmarkall/numbadev/issues/discourse-1100/repro.py", line 74, in <module>
    import fire;fire.fire(test)
AttributeError: module 'fire' has no attribute 'fire'

Before I get into looking at optimizing the kernel, a couple of points:

You need to add cuda.synchronize() before you stop timing, otherwise you’re just timing the CPU enqueuing kernels - the actual launches happen asynchronously. So you need something like:

    t1 = time.time()
    for i in range(10):
        binary_dilate_u8_cuda(vol, kern, niter)
    cuda.synchronize()
    t2 = time.time()

in your timing loop.

Secondly, I think you have a potential mismatch between the dtype of your signatures and what you’re actually passing in. The signature specifies uint8 arrays, but vol and kern are bool_ arrays. I would remove the signature so the kernel is declared as:

@cuda.jit
def cuda_binary_dilate_u8(vol, out, kern):

And then have one launch before the timing loop to force compilation of the kernel (because adding the signature also forced compilation at the time of definition):

   # Warm up / compile
    binary_dilate_u8_cuda(vol, kern, niter)

    t1 = time.time()
    for i in range(10):
        binary_dilate_u8_cuda(vol, kern, niter)
        # ...

Is there a reason you added the signature? Generally we suggest not using signatures, and letting Numba determine argument types instead.

sorry for the fire part, it is typo. it should be:

import fire;fire.Fire(test)

ah thanks for this, i did not know. so in fact the timing is closer to 20 ms with cuda.synchronize.
Indeed the signature does not bring much except for the first pass where it seems compilation is slower without it.

    for i in range(10):
        if i == 1:
            t1 = time.time() # do not count the first time
        binary_dilate_u8_cuda(vol, kern, niter)
    cuda.synchronize()
    t2 = time.time()

I had chance to make a few quick changes:

  • Hardcoding the sizes (NKERN) allows the compiler to optimize a little better
  • Using threads to cooperatively load into shared memory instead of every thread writing all shared memory values
  • Removing some redundant checks (lower bounds of x, y, and z, for example)

I pushed my modified version to:numba-discourse-1100/repro.py at main · gmarkall/numba-discourse-1100 · GitHub

You can run it with python repro.py - note I also added checking of the output to ensure that my changes didn’t alter the output. I couldn’t commit the reference output as it is too large, but you should be able to regenerate it with python original.py.

Timings on my system (Quadro RTX 8000) were:

  1 Original:                 0.04547548294067383
  2 hardcode NKERN:           0.04254336357116699
  3 coop load shared:         0.04084956645965576
  4 remove redundnant checks: 0.03867783546447754

There is definitely more that can be done here, but this is what I could do with a short amount of time.

1 Like

Thanks a lot! in fact on my GPU the “coop load shared” reduces the runtime by a factor of 2! I will have a deeper look at numba-discourse examples :slight_smile:

1 Like