CUDA partial copy from device to host

When using Numba for developing CUDA code: Is it possible to copy only a part from a device array to the host?

I have a bidimensional array on the device. I need the complete matrix for doing some computations but I’m only interested in the value resulting in the first column of all the rows. So I wonder whether it is possible to copy only this column back to the host, as the dimension of the array is huge, so copy only this part would be a great improvement in memory terms.

You can copy slices of arrays, but be aware that if your slice is non-contiguous then you might still be copying a lot more than you want. This can be demonstrated with the following code:

from numba import cuda

arr = cuda.device_array((128, 128))
row = arr[1, :]
col = arr[:, 1]

print("Copying whole array")
print("Copying row")
print("Copying column")

When run with API and argument logging, I see (note I’ve copied only the relevant lines - irrelevant API calls are omitted here):

Copying whole array
== CUDA [219] DEBUG -- call driver api: cuMemcpyDtoH_v2(94779505341152, 140629260632064, 131072)
Copying row
== CUDA [219] DEBUG -- call driver api: cuMemcpyDtoH_v2(94779505340112, 140629260633088, 1024)
Copying column
== CUDA [219] DEBUG -- call driver api: cuMemcpyDtoH_v2(94779505341152, 140629260632072, 130056)

We see from the above that the amount of data transferred by copying is in each case:

  • Whole matrix: 131072 bytes, (128 rows * 128 columns of 8 byte values)
  • Whole row: 1024 bytes (128 values of 8 bytes)
  • Whole column: 130056 bytes ((128 rows * 127 columns) + 1 entry) of 8 byte values)

So copying the first column still needs to copy most of the matrix - if you really want to only copy the first column then you will need to gather the column into a contiguous array on the device then copy that.