Making Awkward Arrays work in the CUDA target

The CUDA Array Interface facilitates passing data between different CUDA aware libraries, and is supported by CuPy, Numba, and others. I have an example of implementing the interface for Runtime API-allocated pointers, which I’ll post later - I’m on a personal machine at the moment and I don’t have it handy.

Got the example now - it is:

from numba import cuda

# Get access to the CUDA Runtime API via ctypes

from ctypes import CDLL, POINTER, byref, c_void_p, c_size_t

cudart = CDLL('libcudart.so')

cudaMalloc = cudart.cudaMalloc
cudaMalloc.argtypes = [POINTER(c_void_p), c_size_t]

cudaFree = cudart.cudaFree
cudaFree.argtypes = [c_void_p]


# Create a class to provide the CUDA Array interface.
#
# Simple example for read-write contiguous data.

FLOAT32_SIZE = 4


class MyFloatArray:
    """An array of floats that allocates memory on the device when constructed,
    and frees it when it is deleted.
    """

    def __init__(self, size):
        self._size = size
        self._typestr = 'f4'
        self._ptr = c_void_p()

        alloc_size = size * FLOAT32_SIZE
        ret = cudaMalloc(byref(self._ptr), alloc_size)
        if ret:
            raise RuntimeError(f'Unexpected return code {ret} from cudaMalloc')

    def __del__(self):
        cudaFree(self._ptr)

    @property
    def __cuda_array_interface__(self):
        return {
            'shape': (self._size,),
            'typestr': self._typestr,
            'data': (self._ptr.value, False),
            'version': 2,
        }


# Create an instance of the array object that allocates some memory

nelems = 32
arr = MyFloatArray(nelems)


# For sanity checks / inspection

arr_cai = arr.__cuda_array_interface__

cai_ptr = arr_cai['data'][0]
print(f"Pointer from CAI is 0x{cai_ptr:x}")

print(f"Shape from CAI is {arr_cai['shape']}")
print(f"Type string from CAI is '{arr_cai['typestr']}'")
print(f"Version from CAI is {arr_cai['version']}")


# Launch a kernel on the array holding the Runtime API-allocated memory

@cuda.jit
def initialize(x):
    i = cuda.grid(1)
    if i < len(x):
        x[i] = 3.14


initialize[1, 32](arr)


# We can also copy the data to the host and print it out to check that it is
# expected. For simplicity in this example, we'll use the as_cuda_array()
# function, that takes any object and creates a Numba device array from it, so
# that we can copy the data and print it.

print(cuda.as_cuda_array(arr).copy_to_host())

# Prints:
# [3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14
# 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14
# 3.14 3.14 3.14 3.14]

I had a feeling that exchanging data between contexts required calling a function to get a valid pointer for the current context (or this might have been related to IPC) - I can’t find the documentation of what I was thinking of right now, but will post back when I have located it. In general if you’re just using a single context, I think passing pointers between different libraries will be fine.

Having done some further reading to remind myself, I think I may have been thinking of Peer Access between different devices (CUDA Driver API :: CUDA Toolkit Documentation) - if you’re only concerned with pointers on the same device, then I think this isn’t something you need to worry about for Awkward and could consider all device pointers allocated by different libraries / APIs interchangeable between those libraries and API functions.