Reading a DeviceNDArray on the GPU

Can someone tell me if there’s a proper way to read data from a DeviceNDArray without copying/transfering to host? Or is it never intended?
For example, I’m able to read the data if I make a CuPy reference:

some_arr = cuda.to_device(np.array(0))  # can't read this
some_arr_cupy = cupy.asarray(some_arr)  # can read this

asarry copies the array to the host. You cannot read from a device array until you transfer it to the host.

What do you mean by “read” it? Can you provide a bit more context about what you’re trying to achieve?

… Perhaps a Managed Array is what you want: Memory Management — Numba 0.52.0-py3.7-linux-x86_64.egg documentation

For a description of Managed Memory, see Unified Memory for CUDA Beginners.

Oops. Just found out the problem and it’s quite clear from the example code.

some_arr = cuda.to_device(np.array(0))

This will never work as it creates a zero-dimensional array. I guess I looked at something like np.zeros() or np.ones() where you don’t need to be concerned with that. Of course, it just needs to change to this:

some_arr = cuda.to_device(np.array([0]))

This is an edited reply after finding out the issue


Thanks for the link on Unified Memory. It might come in handy. What does “considered experimental on Windows” mean though?

I’m trying to write a kernel using 2 CuPy arrays - a1 to read and a2 to write - and it is critical that these writes happen in-place. I preallocated a2 with the typical zeros function and then thought of a way to access those preallocated indices. The data are 3D coordinates.
I thought of a sequential atomic counter plus CuPy’s put() to do this.
(In the meantime, I found out CuPy’s functions don’t seem to work in a numba cuda kernel).

some_arr = cuda.to_device(np.array(0))

This will never work as it creates a zero-dimensional array.

This will work with Numba 0.53.

What does “considered experimental on Windows” mean though?

It means that you can use it and it will probably work, but is not tested as well as on Linux. If you use it on Windows and run into issues, then please report them (or report that it works well for you) so that we can increase the level of assurance of Unified Memory support on Windows, and move it towards being properly supported instead of being considered experimental.

I’m trying to write a kernel using 2 CuPy arrays - a1 to read and a2 to write - and it is critical that these writes happen in-place. I preallocated a2 with the typical zeros function and then thought of a way to access those preallocated indices. The data are 3D coordinates.
I thought of a sequential atomic counter plus CuPy’s put() to do this.

I’m having a really hard time understanding the issue here - perhaps you can present an example of the code that you’d like to work, and explain what you observe that shows it doesn’t presently work?

(In the meantime, I found out CuPy’s functions don’t seem to work in a numba cuda kernel).

That’s correct - CuPy functions cannot be called from within kernels.

Oh! Let me correct the solution then.

Will do :+1:

The original issue has been overcome. Here’s the kernel after some changes:

@cuda.jit
def get_surface(binary_image, surface, counter):
    row, image_slice = cuda.grid(2)
    span_found = False
    if row < binary_image.shape[0] and image_slice < binary_image.shape[2]:  # guard for rows and slices
        for column in range(binary_image.shape[1]):
            if binary_image[row, column, image_slice] == 1:
                if not span_found:  # Connected Component found
                    span_found = True
                    cuda.atomic.add(counter, 0, 1)
                    surface[counter[0]] = (row, column, image_slice)
                    if column == binary_image.shape[1] - 1:  # Case where span starts in the last column
                        cuda.atomic.add(counter, 0, 1)
                        surface[counter[0]] = (row, column, image_slice)
                else:  # Case where span ends in the last column
                    cuda.atomic.add(counter, 0, 1)
                    surface[counter[0]] = (row, column, image_slice)
            elif binary_image[row, column, image_slice] == 0 and span_found:
                cuda.atomic.add(counter, 0, 1)
                surface[counter[0]] = (row, column, image_slice)

To add some context to the kernel:
Binary_image and surface are the CuPy arrays from which I’ll read from and write to respectively. Counter is a Numba array which I’m using as a counter. Binary_image is some 3D binary object with 0 representing the background and 1 the object - arranged as (number_of_points, 3).
The goal is to extract the points that make up the object’s surface which I’m doing using spans. As far as I know, Numba CUDA does not allow Python lists due to their dynamic nature, nor does NumPy/CuPy create mutable arrays. Therefore, a solution could be to preallocate surface using something like np.zeros().
However, I still need to ensure that these positions are indexed hence the counter:

surface[counter[0]] = (row, column, image_slice)

I now have a new issue where this instruction doesn’t work. This is what I get:

No implementation of function Function(<built-in function setitem>) found for signature:
 
 >>> setitem(array(uint16, 2d, C), int32, Tuple(int32, int64, int32))
 
There are 16 candidate implementations:
      - Of which 16 did not match due to:
      Overload of function 'setitem': File: <numerous>: Line N/A.
        With argument(s): '(array(uint16, 2d, C), int32, Tuple(int32, int64, int32))':
       No match.

def get_surface(binary_image, surface, counter):
    <source elided>
                    cuda.atomic.add(counter, 0, 1)
                    surface[counter[0]] = (row, column, image_slice)
                    ^

Is there some fundamental mistake I’m making?

Oops. Just found out the problem and it’s quite clear from the example code.

some_arr = cuda.to_device(np.array(0))

This will never work as it creates a zero-dimensional array.
Note: @gmarkall pointed out this will work with Numba 0.53.

Of course, it just needs to change to this:

some_arr = cuda.to_device(np.array([0]))

The error message is saying:

  • surface is a 2D array of uint16
  • It is being indexed with a 1D index (the int32 in (array(uint16, 2d, C), int32, Tuple(int32, int64, int32))
  • The RHS of the assignment is a tuple of (int32, int64, int32)

Assigning a tuple to a column of a 2D array generally seems to work in the CUDA target, e.g.:

from numba import cuda
import numpy as np


@cuda.jit
def assign_2d(x):
    i = cuda.grid(1)
    x[i] = (i * 2, i * 2 + 1)


x = np.zeros((32, 2))
assign_2d[1, 32](x)
print(x)

produces:

$ python repro.py 
[[ 0.  1.]
 [ 2.  3.]
 [ 4.  5.]
...
 [58. 59.]
 [60. 61.]
 [62. 63.]]

so I suspect the problem might be the widths of the types being assigned - if you change (row, column, image_slice) to (uint16(row), uint16(column), uint16(image_slice)) where uint16 comes from from numba.types import uint16, does that help?

If not, can you please post some code I can run to reproduce the problem? Then I can try to be more specific about the issue.

Thanks for the all the help so far :smile:

The good news first: Your assumption was correct. The casts worked and I no longer get any errors.
I assume that, if possible, I just need to change the array I’m reading from to uint16 and that’ll take care of the issue.

The bad news: The insertions seem… random? The counter is sequential. Each insertion can only fall onto one case meaning that it will add +1 to the counter. However, instead of a sequential array of coordinates, this is what I get:
image

Any idea as to why?

This is how I’m initializing the whole thing:

# A CuPy reference of a 3D binary image
binary_image_cupy = cp.asarray(binary_image)
# label_data_characterization retrieves the bounding box of the particle with a specific label
bbox, bbox_dim = label_data_characterization(binary_image_cupy, 185284128)
bbox_surface = cp.zeros((15000, 3), dtype=np.uint16)
# the counter 
counter = cuda.to_device(np.array([-1])) 

bb_threads = (16, 16)
bb_blocks = (ceil(bbox_dim[0] / bb_threads[0]), ceil(bbox_dim[2] / ccl_threads[1]))
get_surface[bb_blocks, bb_threads](bbox, bbox_surface, counter)

Turns out it was a good old concurrency issue.
The original function had pretty severe issues. Here’s the current version:

@cuda.jit
def get_surface(binary_image, surface, counter):
    row, image_slice = cuda.grid(2)
    span_found = False
    if row < binary_image.shape[0] and image_slice < binary_image.shape[2]:  
        for column in range(binary_image.shape[1]):
            if binary_image[row, column, image_slice] == 1:
                if not span_found:  # Span found
                    span_found = True
                    surface[cuda.atomic.add(counter, 0, 1)] = (u2(row), u2(column), u2(image_slice))
                elif column == binary_image.shape[1]-1 and span_found:  # Span ends in last column
                    span_found = False
                    surface[cuda.atomic.add(counter, 0, 1)] = (u2(row), u2(column), u2(image_slice))
            elif binary_image[row, column, image_slice] == 0 and span_found:  # Span is over
                span_found = False
                surface[cuda.atomic.add(counter, 0, 1)] = (u2(row), u2(column-1), u2(image_slice))

The problem was what happened between the atomic add and the insertion:

cuda.atomic.add(counter, 0, 1)
surface[counter[0] = (u2(row), u2(column), u2(image_slice))

I changed it to the function above and things are now working properly.