Copying the content of a device array to a device array

I’m trying to copy the content of a 3D matrix to another using the following code:

import numpy as np
from math import ceil
from numba import cuda

@cuda.jit
def device_copy(src, dst):
    z, x, y = cuda.grid(3)
    dst[z, x, y] = src[z, x, y]

arr1 = cuda.to_device(np.arange(1000000).reshape(100,100,100))
arr2 = cuda.device_array_like(arr1)

tpb = (8,8,8)
bpg = (ceil(arr1.shape[0]/tpb[0]), ceil(arr1.shape[1]/tpb[1]), ceil(arr1.shape[2]/tpb[2]))
device_copy[bpg, tpb](arr1,arr2)

Which seems to work but, if I copy the data to host, I get the following error:

numba.cuda.cudadrv.driver.CudaAPIError: [700] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR

Can someone tell me what I’m doing wrong?

If you run this under cuda-memcheck with:

cuda-memcheck python repro.py

You’ll see a lot of invalid memory accesses - the shape of your grid does not perfectly align with the shape of the array you’re copying. You can add a guard to the kernel so that only elements within the bounds of the arrays are written:

import numpy as np
from math import ceil
from numba import cuda

@cuda.jit
def device_copy(src, dst):
    z, x, y = cuda.grid(3)
    if z < src.shape[0] and x < src.shape[1] and y < src.shape[2]:
        dst[z, x, y] = src[z, x, y]

host_arr1 = np.arange(1000000).reshape(100,100,100)
arr1 = cuda.to_device(host_arr1)
arr2 = cuda.device_array_like(arr1)

tpb = (8,8,8)
bpg = (ceil(arr1.shape[0]/tpb[0]), ceil(arr1.shape[1]/tpb[1]), ceil(arr1.shape[2]/tpb[2]))
device_copy[bpg, tpb](arr1,arr2)

host_arr2 = arr2.copy_to_host()

# Sanity check
np.testing.assert_equal(host_arr1, host_arr2)
print(host_arr2)

Which then produces no error when run under cuda-memcheck:

$ cuda-memcheck python repro.py 
========= CUDA-MEMCHECK
[[[     0      1      2 ...     97     98     99]
  [   100    101    102 ...    197    198    199]
  [   200    201    202 ...    297    298    299]
  ...
  [  9700   9701   9702 ...   9797   9798   9799]
  [  9800   9801   9802 ...   9897   9898   9899]
  [  9900   9901   9902 ...   9997   9998   9999]]

 [[ 10000  10001  10002 ...  10097  10098  10099]
  [ 10100  10101  10102 ...  10197  10198  10199]
  [ 10200  10201  10202 ...  10297  10298  10299]
  ...
  [ 19700  19701  19702 ...  19797  19798  19799]
  [ 19800  19801  19802 ...  19897  19898  19899]
  [ 19900  19901  19902 ...  19997  19998  19999]]

 [[ 20000  20001  20002 ...  20097  20098  20099]
  [ 20100  20101  20102 ...  20197  20198  20199]
  [ 20200  20201  20202 ...  20297  20298  20299]
  ...
  [ 29700  29701  29702 ...  29797  29798  29799]
  [ 29800  29801  29802 ...  29897  29898  29899]
  [ 29900  29901  29902 ...  29997  29998  29999]]

 ...

 [[970000 970001 970002 ... 970097 970098 970099]
  [970100 970101 970102 ... 970197 970198 970199]
  [970200 970201 970202 ... 970297 970298 970299]
  ...
  [979700 979701 979702 ... 979797 979798 979799]
  [979800 979801 979802 ... 979897 979898 979899]
  [979900 979901 979902 ... 979997 979998 979999]]

 [[980000 980001 980002 ... 980097 980098 980099]
  [980100 980101 980102 ... 980197 980198 980199]
  [980200 980201 980202 ... 980297 980298 980299]
  ...
  [989700 989701 989702 ... 989797 989798 989799]
  [989800 989801 989802 ... 989897 989898 989899]
  [989900 989901 989902 ... 989997 989998 989999]]

 [[990000 990001 990002 ... 990097 990098 990099]
  [990100 990101 990102 ... 990197 990198 990199]
  [990200 990201 990202 ... 990297 990298 990299]
  ...
  [999700 999701 999702 ... 999797 999798 999799]
  [999800 999801 999802 ... 999897 999898 999899]
  [999900 999901 999902 ... 999997 999998 999999]]]
========= ERROR SUMMARY: 0 errors

Note that you get UNKNOWN_CUDA_ERROR from the copy to the host because kernels launch asynchronously, but copying data to the host is a synchronous operation - so the error in your kernel only shows up during the call to copy the data.

Oh right! Completely forgot about the guard. Thanks for the reply.

1 Like