Graphics API interop

The general idea is to get hold of a device pointer from D3D interop, and then use it to construct an instance of the DeviceNDArray class. The Rapids Memory Manager (RMM) used to do this before EMM Plugins could be used: rmm/rmm.py at branch-0.13 · rapidsai/rmm · GitHub

def device_array(shape, dtype=np.float, strides=None, order="C", stream=0):
    """
    device_array(shape, dtype=np.float, strides=None, order='C',
                 stream=0)
    Allocate an empty Numba device array. Clone of Numba `cuda.device_array`,
    but uses RMM for device memory management.
    """
    shape, strides, dtype = cuda.api._prepare_shape_strides_dtype(
        shape, strides, dtype, order
    )
    datasize = cuda.driver.memory_size_from_info(
        shape, strides, dtype.itemsize
    )

    buf = librmm.DeviceBuffer(size=datasize, stream=stream)

    ctx = cuda.current_context()
    ptr = ctypes.c_uint64(int(buf.ptr))
    mem = cuda.driver.MemoryPointer(ctx, ptr, datasize, owner=buf)
    return cuda.cudadrv.devicearray.DeviceNDArray(
        shape, strides, dtype, gpu_data=mem
    )

The above code creates a MemoryPointer that points to RMM-allocated memory, then uses it to initialize a DeviceNDArray instance. Assuming you get the pointer to your D3D buffer as an integer, the above could be modified to create a Numba array pointing to the D3D buffer:

def d3d_device_array(ptr, shape, dtype=np.float32, strides=None, order="C"):
    shape, strides, dtype = cuda.api._prepare_shape_strides_dtype(
        shape, strides, dtype, order
    )
    datasize = cuda.driver.memory_size_from_info(
        shape, strides, dtype.itemsize
    )

    def make_finalizer(ptr):
        def finalize():
            # d3d_free is assumed to be a function that "cleans up" ptr
            # e.g. decrementing a reference count, or freeing it, etc...
            # whatever needs to be done when it is no longer needed by
            # Numba.
            d3d_free(ptr)

        return finalize

    ctx = cuda.current_context()
    c_ptr = ctypes.c_uint64(ptr)
    finalizer = make_finalizer(ptr)
    mem = cuda.driver.MemoryPointer(ctx, c_ptr, datasize, finalizer=finalizer)
    return cuda.cudadrv.devicearray.DeviceNDArray(
        shape, strides, dtype, gpu_data=mem
    )

# Using d3d_device_array:

# A function that gets your D3D buffer (I'm unsure of the implementation
# details here, but will somehow follow the SDK example and make it
# accessible to Python)
ptr, size = my_get_d3d_buf()  # Assume this gives a 1D array of float32
d3d_array = d3d_device_array(ptr, size)

# d3d_array is now ready to be passed to a kernel
kernel[griddim, blockdim](d3d_array, ...)

The finalizer is needed so that when the Numba Device Array is garbage collected, it can somehow let D3D know that the pointer is no longer in use / can be freed (I’m not sure exactly what needs to be done, but perhaps you know already / can tell from the SDK example?).

I hope this helps illustrate things - are there other areas I should try to sketch out?