I’m trying to copy the content of a 3D matrix to another using the following code:
import numpy as np
from math import ceil
from numba import cuda
@cuda.jit
def device_copy(src, dst):
z, x, y = cuda.grid(3)
dst[z, x, y] = src[z, x, y]
arr1 = cuda.to_device(np.arange(1000000).reshape(100,100,100))
arr2 = cuda.device_array_like(arr1)
tpb = (8,8,8)
bpg = (ceil(arr1.shape[0]/tpb[0]), ceil(arr1.shape[1]/tpb[1]), ceil(arr1.shape[2]/tpb[2]))
device_copy[bpg, tpb](arr1,arr2)
Which seems to work but, if I copy the data to host, I get the following error:
numba.cuda.cudadrv.driver.CudaAPIError: [700] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR
Can someone tell me what I’m doing wrong?
If you run this under cuda-memcheck with:
cuda-memcheck python repro.py
You’ll see a lot of invalid memory accesses - the shape of your grid does not perfectly align with the shape of the array you’re copying. You can add a guard to the kernel so that only elements within the bounds of the arrays are written:
import numpy as np
from math import ceil
from numba import cuda
@cuda.jit
def device_copy(src, dst):
z, x, y = cuda.grid(3)
if z < src.shape[0] and x < src.shape[1] and y < src.shape[2]:
dst[z, x, y] = src[z, x, y]
host_arr1 = np.arange(1000000).reshape(100,100,100)
arr1 = cuda.to_device(host_arr1)
arr2 = cuda.device_array_like(arr1)
tpb = (8,8,8)
bpg = (ceil(arr1.shape[0]/tpb[0]), ceil(arr1.shape[1]/tpb[1]), ceil(arr1.shape[2]/tpb[2]))
device_copy[bpg, tpb](arr1,arr2)
host_arr2 = arr2.copy_to_host()
# Sanity check
np.testing.assert_equal(host_arr1, host_arr2)
print(host_arr2)
Which then produces no error when run under cuda-memcheck:
$ cuda-memcheck python repro.py
========= CUDA-MEMCHECK
[[[ 0 1 2 ... 97 98 99]
[ 100 101 102 ... 197 198 199]
[ 200 201 202 ... 297 298 299]
...
[ 9700 9701 9702 ... 9797 9798 9799]
[ 9800 9801 9802 ... 9897 9898 9899]
[ 9900 9901 9902 ... 9997 9998 9999]]
[[ 10000 10001 10002 ... 10097 10098 10099]
[ 10100 10101 10102 ... 10197 10198 10199]
[ 10200 10201 10202 ... 10297 10298 10299]
...
[ 19700 19701 19702 ... 19797 19798 19799]
[ 19800 19801 19802 ... 19897 19898 19899]
[ 19900 19901 19902 ... 19997 19998 19999]]
[[ 20000 20001 20002 ... 20097 20098 20099]
[ 20100 20101 20102 ... 20197 20198 20199]
[ 20200 20201 20202 ... 20297 20298 20299]
...
[ 29700 29701 29702 ... 29797 29798 29799]
[ 29800 29801 29802 ... 29897 29898 29899]
[ 29900 29901 29902 ... 29997 29998 29999]]
...
[[970000 970001 970002 ... 970097 970098 970099]
[970100 970101 970102 ... 970197 970198 970199]
[970200 970201 970202 ... 970297 970298 970299]
...
[979700 979701 979702 ... 979797 979798 979799]
[979800 979801 979802 ... 979897 979898 979899]
[979900 979901 979902 ... 979997 979998 979999]]
[[980000 980001 980002 ... 980097 980098 980099]
[980100 980101 980102 ... 980197 980198 980199]
[980200 980201 980202 ... 980297 980298 980299]
...
[989700 989701 989702 ... 989797 989798 989799]
[989800 989801 989802 ... 989897 989898 989899]
[989900 989901 989902 ... 989997 989998 989999]]
[[990000 990001 990002 ... 990097 990098 990099]
[990100 990101 990102 ... 990197 990198 990199]
[990200 990201 990202 ... 990297 990298 990299]
...
[999700 999701 999702 ... 999797 999798 999799]
[999800 999801 999802 ... 999897 999898 999899]
[999900 999901 999902 ... 999997 999998 999999]]]
========= ERROR SUMMARY: 0 errors
Note that you get UNKNOWN_CUDA_ERROR from the copy to the host because kernels launch asynchronously, but copying data to the host is a synchronous operation - so the error in your kernel only shows up during the call to copy the data.
Oh right! Completely forgot about the guard. Thanks for the reply.
1 Like