Optimize 3D binary dilation - any tips? -

etienne87 · December 24, 2021, 3:55pm

I would like to optimize a 3d binary dilation on GPU. I am suspecting the code is not optimal:

@cuda.jit("void(uint8[:,:,:],uint8[:,:,:],uint8[:,:,:])")
def cuda_binary_dilate_u8(vol, out, kern):
    z,y,x = cuda.grid(3)
    d,h,w = vol.shape
    a,b,c = kern.shape
    pa,pb,pc = a//2,b//2,c//2
    if z >= 0 and z < d and y >= 0 and y < h and x >= 0 and x < w:
        out[z,y,x] = False
        
        for i in range(a):
            for j in range(b):
                for k in range(c):
                    kv = kern[i,j,k]
                    zz = z+i-pa
                    yy = y+j-pb
                    xx = x+k-pc
                    if zz < 0 or zz >= d or yy < 0 or yy >= h or xx < 0 or xx >= w: 
                        continue
                    if vol[zz,yy,xx] and kv:
                       out[z,y,x] = True

def binary_dilate_u8_cuda(vol, kern, iterations=1):
    xcu = cuda.to_device(vol)
    kcu = cuda.to_device(kern)
    sizes = vol.shape 
    block_dim = (4,4,4)
    grid_dim = tuple(int(np.ceil(a/b)) for a, b in zip(sizes, block_dim))

    ycu = cuda.device_array(shape=vol.shape, dtype=np.uint8)
    a,b = ycu, xcu
    for i in range(iterations):
         a,b = b,a
         cuda_binary_dilate_u8[grid_dim, block_dim](a, b, kcu)
    return b

gmarkall · December 30, 2021, 11:11am

Do you have a notebook or something runnable that uses the kernel? It’s much easier to get started helping someone to optimize their code if it is ready to run.

etienne87 · December 31, 2021, 9:14am

Sure!
[binary 3d dilate cuda]

"""
binary dilation in numba cuda
>> python cuda_binary_dilate_3d.py 512 512 512 1
>>> runtime: 0.001s on a Nvidia TitanXP. 
"""
import numba
import numpy as np
import time
from numba import cuda 


@cuda.jit("void(uint8[:,:,:],uint8[:,:,:],uint8[:,:,:])")
def cuda_binary_dilate_u8(vol, out, kern):
    z,y,x = cuda.grid(3)
    d,h,w = vol.shape
    a,b,c = kern.shape
    pa,pb,pc = a//2,b//2,c//2
    if z >= 0 and z < d and y >= 0 and y < h and x >= 0 and x < w:
        out[z,y,x] = False
        
        #put kern in shared memory?
        shared = cuda.shared.array((3,3,3),dtype=numba.boolean)
        for i in range(a):
            for j in range(b):
                for k in range(c):
                    shared[i,j,k] = kern[i,j,k]
        
        for i in range(a):
            for j in range(b):
                for k in range(c):
                    zz = z+i-pa
                    yy = y+j-pb
                    xx = x+k-pc
                    if zz < 0 or zz >= d or yy < 0 or yy >= h or xx < 0 or xx >= w: 
                        continue
                    if vol[zz,yy,xx] and shared[i,j,k]:
                       out[z,y,x] = True




def binary_dilate_u8_cuda(vol, kern, iterations=1):
    sizes = vol.shape 
    block_dim = (4,4,4)
    #stride
    sizes = [v for v in sizes]
    grid_dim = tuple(int(np.ceil(a/b)) for a, b in zip(sizes, block_dim))

    # non inplace
    ycu = cuda.device_array(shape=vol.shape, dtype=np.uint8)
    a,b = ycu, vol 
    for i in range(iterations):
         a,b = b,a
         cuda_binary_dilate_u8[grid_dim, block_dim](a, b, kern)
    return b


def test(d=256,h=256,w=256,niter=1):
    vol = np.random.randn(d,h,w) > 1
    kern = np.random.randn(3,3,3) > 1

    vol = cuda.to_device(vol)
    kern = cuda.to_device(kern)
    
    t1 = time.time()
    for i in range(10):
        binary_dilate_u8_cuda(vol, kern, niter)
    t2 = time.time()
    
    print(f' runtime: {(t2-t1)/10}') # i measure 1ms


if __name__ == '__main__':
    import fire;fire.fire(test)

I think i was not properly timing before (in fact i was probably timing cuda-cpu transfer…)
runtime indicates around 1ms. Still, if you have any advice, or course on numba cuda + gpu programming? Thanks!

gmarkall · December 31, 2021, 4:31pm

Thanks for posting the code. What’s the fire module you’re using? I did pip install fire and I get

$ python repro.py 
Traceback (most recent call last):
  File "/home/gmarkall/numbadev/issues/discourse-1100/repro.py", line 74, in <module>
    import fire;fire.fire(test)
AttributeError: module 'fire' has no attribute 'fire'

gmarkall · December 31, 2021, 4:48pm

Before I get into looking at optimizing the kernel, a couple of points:

You need to add cuda.synchronize() before you stop timing, otherwise you’re just timing the CPU enqueuing kernels - the actual launches happen asynchronously. So you need something like:

    t1 = time.time()
    for i in range(10):
        binary_dilate_u8_cuda(vol, kern, niter)
    cuda.synchronize()
    t2 = time.time()

in your timing loop.

Secondly, I think you have a potential mismatch between the dtype of your signatures and what you’re actually passing in. The signature specifies uint8 arrays, but vol and kern are bool_ arrays. I would remove the signature so the kernel is declared as:

@cuda.jit
def cuda_binary_dilate_u8(vol, out, kern):

And then have one launch before the timing loop to force compilation of the kernel (because adding the signature also forced compilation at the time of definition):

   # Warm up / compile
    binary_dilate_u8_cuda(vol, kern, niter)

    t1 = time.time()
    for i in range(10):
        binary_dilate_u8_cuda(vol, kern, niter)
        # ...

Is there a reason you added the signature? Generally we suggest not using signatures, and letting Numba determine argument types instead.

etienne87 · January 1, 2022, 12:32pm

sorry for the fire part, it is typo. it should be:

import fire;fire.Fire(test)

ah thanks for this, i did not know. so in fact the timing is closer to 20 ms with cuda.synchronize.
Indeed the signature does not bring much except for the first pass where it seems compilation is slower without it.

    for i in range(10):
        if i == 1:
            t1 = time.time() # do not count the first time
        binary_dilate_u8_cuda(vol, kern, niter)
    cuda.synchronize()
    t2 = time.time()

gmarkall · January 5, 2022, 10:15am

I had chance to make a few quick changes:

Hardcoding the sizes (NKERN) allows the compiler to optimize a little better
Using threads to cooperatively load into shared memory instead of every thread writing all shared memory values
Removing some redundant checks (lower bounds of x, y, and z, for example)

I pushed my modified version to:numba-discourse-1100/repro.py at main · gmarkall/numba-discourse-1100 · GitHub

You can run it with python repro.py - note I also added checking of the output to ensure that my changes didn’t alter the output. I couldn’t commit the reference output as it is too large, but you should be able to regenerate it with python original.py.

Timings on my system (Quadro RTX 8000) were:

  1 Original:                 0.04547548294067383
  2 hardcode NKERN:           0.04254336357116699
  3 coop load shared:         0.04084956645965576
  4 remove redundnant checks: 0.03867783546447754

There is definitely more that can be done here, but this is what I could do with a short amount of time.

etienne87 · January 7, 2022, 8:31am

Thanks a lot! in fact on my GPU the “coop load shared” reduces the runtime by a factor of 2! I will have a deeper look at numba-discourse examples

Topic		Replies	Views
Cuda Optimize Jaro Distance Community Support	8	799	June 18, 2020
CUDA.jit - Higher Order Convolution Optimizations (Volterra Operator) Support: How do I do ...?	1	369	October 5, 2020
Optimizing Code Further, CUDA Jit? Community Support	2	704	July 23, 2020
Please Help, I am new to Numba CUDA programming. My Binary search program not showing any speedup upon increasing number of threads & blocks Community Support	1	491	August 22, 2021
Random array generation : numba cuda slower than cupy? Support: How do I do ...?	3	1457	July 23, 2021

Optimize 3D binary dilation - any tips? -

Related Topics