# Optimize 3D binary dilation - any tips? -

I would like to optimize a 3d binary dilation on GPU. I am suspecting the code is not optimal:

``````@cuda.jit("void(uint8[:,:,:],uint8[:,:,:],uint8[:,:,:])")
def cuda_binary_dilate_u8(vol, out, kern):
z,y,x = cuda.grid(3)
d,h,w = vol.shape
a,b,c = kern.shape
pa,pb,pc = a//2,b//2,c//2
if z >= 0 and z < d and y >= 0 and y < h and x >= 0 and x < w:
out[z,y,x] = False

for i in range(a):
for j in range(b):
for k in range(c):
kv = kern[i,j,k]
zz = z+i-pa
yy = y+j-pb
xx = x+k-pc
if zz < 0 or zz >= d or yy < 0 or yy >= h or xx < 0 or xx >= w:
continue
if vol[zz,yy,xx] and kv:
out[z,y,x] = True

def binary_dilate_u8_cuda(vol, kern, iterations=1):
xcu = cuda.to_device(vol)
kcu = cuda.to_device(kern)
sizes = vol.shape
block_dim = (4,4,4)
grid_dim = tuple(int(np.ceil(a/b)) for a, b in zip(sizes, block_dim))

ycu = cuda.device_array(shape=vol.shape, dtype=np.uint8)
a,b = ycu, xcu
for i in range(iterations):
a,b = b,a
cuda_binary_dilate_u8[grid_dim, block_dim](a, b, kcu)
return b
``````

Do you have a notebook or something runnable that uses the kernel? Itâ€™s much easier to get started helping someone to optimize their code if it is ready to run.

Sure!
[binary 3d dilate cuda]

``````"""
binary dilation in numba cuda
>> python cuda_binary_dilate_3d.py 512 512 512 1
>>> runtime: 0.001s on a Nvidia TitanXP.
"""
import numba
import numpy as np
import time
from numba import cuda

@cuda.jit("void(uint8[:,:,:],uint8[:,:,:],uint8[:,:,:])")
def cuda_binary_dilate_u8(vol, out, kern):
z,y,x = cuda.grid(3)
d,h,w = vol.shape
a,b,c = kern.shape
pa,pb,pc = a//2,b//2,c//2
if z >= 0 and z < d and y >= 0 and y < h and x >= 0 and x < w:
out[z,y,x] = False

#put kern in shared memory?
shared = cuda.shared.array((3,3,3),dtype=numba.boolean)
for i in range(a):
for j in range(b):
for k in range(c):
shared[i,j,k] = kern[i,j,k]

for i in range(a):
for j in range(b):
for k in range(c):
zz = z+i-pa
yy = y+j-pb
xx = x+k-pc
if zz < 0 or zz >= d or yy < 0 or yy >= h or xx < 0 or xx >= w:
continue
if vol[zz,yy,xx] and shared[i,j,k]:
out[z,y,x] = True

def binary_dilate_u8_cuda(vol, kern, iterations=1):
sizes = vol.shape
block_dim = (4,4,4)
#stride
sizes = [v for v in sizes]
grid_dim = tuple(int(np.ceil(a/b)) for a, b in zip(sizes, block_dim))

# non inplace
ycu = cuda.device_array(shape=vol.shape, dtype=np.uint8)
a,b = ycu, vol
for i in range(iterations):
a,b = b,a
cuda_binary_dilate_u8[grid_dim, block_dim](a, b, kern)
return b

def test(d=256,h=256,w=256,niter=1):
vol = np.random.randn(d,h,w) > 1
kern = np.random.randn(3,3,3) > 1

vol = cuda.to_device(vol)
kern = cuda.to_device(kern)

t1 = time.time()
for i in range(10):
binary_dilate_u8_cuda(vol, kern, niter)
t2 = time.time()

print(f' runtime: {(t2-t1)/10}') # i measure 1ms

if __name__ == '__main__':
import fire;fire.fire(test)
``````

I think i was not properly timing before (in fact i was probably timing cuda-cpu transferâ€¦)
runtime indicates around 1ms. Still, if you have any advice, or course on numba cuda + gpu programming? Thanks!

Thanks for posting the code. Whatâ€™s the `fire` module youâ€™re using? I did `pip install fire` and I get

``````\$ python repro.py
Traceback (most recent call last):
File "/home/gmarkall/numbadev/issues/discourse-1100/repro.py", line 74, in <module>
import fire;fire.fire(test)
AttributeError: module 'fire' has no attribute 'fire'
``````

Before I get into looking at optimizing the kernel, a couple of points:

You need to add `cuda.synchronize()` before you stop timing, otherwise youâ€™re just timing the CPU enqueuing kernels - the actual launches happen asynchronously. So you need something like:

``````    t1 = time.time()
for i in range(10):
binary_dilate_u8_cuda(vol, kern, niter)
cuda.synchronize()
t2 = time.time()
``````

Secondly, I think you have a potential mismatch between the dtype of your signatures and what youâ€™re actually passing in. The signature specifies `uint8` arrays, but `vol` and `kern` are `bool_` arrays. I would remove the signature so the kernel is declared as:

``````@cuda.jit
def cuda_binary_dilate_u8(vol, out, kern):
``````

And then have one launch before the timing loop to force compilation of the kernel (because adding the signature also forced compilation at the time of definition):

``````   # Warm up / compile
binary_dilate_u8_cuda(vol, kern, niter)

t1 = time.time()
for i in range(10):
binary_dilate_u8_cuda(vol, kern, niter)
# ...
``````

Is there a reason you added the signature? Generally we suggest not using signatures, and letting Numba determine argument types instead.

sorry for the fire part, it is typo. it should be:

``````import fire;fire.Fire(test)
``````

ah thanks for this, i did not know. so in fact the timing is closer to 20 ms with cuda.synchronize.
Indeed the signature does not bring much except for the first pass where it seems compilation is slower without it.

``````    for i in range(10):
if i == 1:
t1 = time.time() # do not count the first time
binary_dilate_u8_cuda(vol, kern, niter)
cuda.synchronize()
t2 = time.time()
``````

I had chance to make a few quick changes:

• Hardcoding the sizes (`NKERN`) allows the compiler to optimize a little better
• Removing some redundant checks (lower bounds of `x`, `y`, and `z`, for example)

I pushed my modified version to:numba-discourse-1100/repro.py at main Â· gmarkall/numba-discourse-1100 Â· GitHub

You can run it with `python repro.py` - note I also added checking of the output to ensure that my changes didnâ€™t alter the output. I couldnâ€™t commit the reference output as it is too large, but you should be able to regenerate it with `python original.py`.

Timings on my system (Quadro RTX 8000) were:

``````  1 Original:                 0.04547548294067383
2 hardcode NKERN:           0.04254336357116699