 Optimize 3D binary dilation - any tips? -

I would like to optimize a 3d binary dilation on GPU. I am suspecting the code is not optimal:

@cuda.jit("void(uint8[:,:,:],uint8[:,:,:],uint8[:,:,:])")
def cuda_binary_dilate_u8(vol, out, kern):
z,y,x = cuda.grid(3)
d,h,w = vol.shape
a,b,c = kern.shape
pa,pb,pc = a//2,b//2,c//2
if z >= 0 and z < d and y >= 0 and y < h and x >= 0 and x < w:
out[z,y,x] = False

for i in range(a):
for j in range(b):
for k in range(c):
kv = kern[i,j,k]
zz = z+i-pa
yy = y+j-pb
xx = x+k-pc
if zz < 0 or zz >= d or yy < 0 or yy >= h or xx < 0 or xx >= w:
continue
if vol[zz,yy,xx] and kv:
out[z,y,x] = True

def binary_dilate_u8_cuda(vol, kern, iterations=1):
xcu = cuda.to_device(vol)
kcu = cuda.to_device(kern)
sizes = vol.shape
block_dim = (4,4,4)
grid_dim = tuple(int(np.ceil(a/b)) for a, b in zip(sizes, block_dim))

ycu = cuda.device_array(shape=vol.shape, dtype=np.uint8)
a,b = ycu, xcu
for i in range(iterations):
a,b = b,a
cuda_binary_dilate_u8[grid_dim, block_dim](a, b, kcu)
return b

Do you have a notebook or something runnable that uses the kernel? It’s much easier to get started helping someone to optimize their code if it is ready to run.

Sure!
[binary 3d dilate cuda]

"""
binary dilation in numba cuda
>> python cuda_binary_dilate_3d.py 512 512 512 1
>>> runtime: 0.001s on a Nvidia TitanXP.
"""
import numba
import numpy as np
import time
from numba import cuda

@cuda.jit("void(uint8[:,:,:],uint8[:,:,:],uint8[:,:,:])")
def cuda_binary_dilate_u8(vol, out, kern):
z,y,x = cuda.grid(3)
d,h,w = vol.shape
a,b,c = kern.shape
pa,pb,pc = a//2,b//2,c//2
if z >= 0 and z < d and y >= 0 and y < h and x >= 0 and x < w:
out[z,y,x] = False

#put kern in shared memory?
shared = cuda.shared.array((3,3,3),dtype=numba.boolean)
for i in range(a):
for j in range(b):
for k in range(c):
shared[i,j,k] = kern[i,j,k]

for i in range(a):
for j in range(b):
for k in range(c):
zz = z+i-pa
yy = y+j-pb
xx = x+k-pc
if zz < 0 or zz >= d or yy < 0 or yy >= h or xx < 0 or xx >= w:
continue
if vol[zz,yy,xx] and shared[i,j,k]:
out[z,y,x] = True

def binary_dilate_u8_cuda(vol, kern, iterations=1):
sizes = vol.shape
block_dim = (4,4,4)
#stride
sizes = [v for v in sizes]
grid_dim = tuple(int(np.ceil(a/b)) for a, b in zip(sizes, block_dim))

# non inplace
ycu = cuda.device_array(shape=vol.shape, dtype=np.uint8)
a,b = ycu, vol
for i in range(iterations):
a,b = b,a
cuda_binary_dilate_u8[grid_dim, block_dim](a, b, kern)
return b

def test(d=256,h=256,w=256,niter=1):
vol = np.random.randn(d,h,w) > 1
kern = np.random.randn(3,3,3) > 1

vol = cuda.to_device(vol)
kern = cuda.to_device(kern)

t1 = time.time()
for i in range(10):
binary_dilate_u8_cuda(vol, kern, niter)
t2 = time.time()

print(f' runtime: {(t2-t1)/10}') # i measure 1ms

if __name__ == '__main__':
import fire;fire.fire(test)

I think i was not properly timing before (in fact i was probably timing cuda-cpu transfer…)
runtime indicates around 1ms. Still, if you have any advice, or course on numba cuda + gpu programming? Thanks!

Thanks for posting the code. What’s the fire module you’re using? I did pip install fire and I get

\$ python repro.py
Traceback (most recent call last):
File "/home/gmarkall/numbadev/issues/discourse-1100/repro.py", line 74, in <module>
import fire;fire.fire(test)
AttributeError: module 'fire' has no attribute 'fire'

Before I get into looking at optimizing the kernel, a couple of points:

You need to add cuda.synchronize() before you stop timing, otherwise you’re just timing the CPU enqueuing kernels - the actual launches happen asynchronously. So you need something like:

t1 = time.time()
for i in range(10):
binary_dilate_u8_cuda(vol, kern, niter)
cuda.synchronize()
t2 = time.time()

Secondly, I think you have a potential mismatch between the dtype of your signatures and what you’re actually passing in. The signature specifies uint8 arrays, but vol and kern are bool_ arrays. I would remove the signature so the kernel is declared as:

@cuda.jit
def cuda_binary_dilate_u8(vol, out, kern):

And then have one launch before the timing loop to force compilation of the kernel (because adding the signature also forced compilation at the time of definition):

# Warm up / compile
binary_dilate_u8_cuda(vol, kern, niter)

t1 = time.time()
for i in range(10):
binary_dilate_u8_cuda(vol, kern, niter)
# ...

Is there a reason you added the signature? Generally we suggest not using signatures, and letting Numba determine argument types instead.

sorry for the fire part, it is typo. it should be:

import fire;fire.Fire(test)

ah thanks for this, i did not know. so in fact the timing is closer to 20 ms with cuda.synchronize.
Indeed the signature does not bring much except for the first pass where it seems compilation is slower without it.

for i in range(10):
if i == 1:
t1 = time.time() # do not count the first time
binary_dilate_u8_cuda(vol, kern, niter)
cuda.synchronize()
t2 = time.time()

I had chance to make a few quick changes:

• Hardcoding the sizes (NKERN) allows the compiler to optimize a little better
• Removing some redundant checks (lower bounds of x, y, and z, for example)

I pushed my modified version to:numba-discourse-1100/repro.py at main · gmarkall/numba-discourse-1100 · GitHub

You can run it with python repro.py - note I also added checking of the output to ensure that my changes didn’t alter the output. I couldn’t commit the reference output as it is too large, but you should be able to regenerate it with python original.py.

Timings on my system (Quadro RTX 8000) were:

1 Original:                 0.04547548294067383
2 hardcode NKERN:           0.04254336357116699
Thanks a lot! in fact on my GPU the “coop load shared” reduces the runtime by a factor of 2! I will have a deeper look at numba-discourse examples 