Graphics API interop

Hello! I’m very new to Numba, so probably my question is trivial, but are there any possibilities for interop between Numba’s CUDA backend and graphics APIs like D3D12 or Vulkan (related native CUDA APIs for that are cudaImportExternalMemory et al.)?
Basically, what I’d like to do is to allocate a buffer on D3D12 side, pass it to Numba (without GPU/CPU roundtrip), perform some calculations on its values and use the modified buffer for rendering.
I’ve had a look at EMM section in the docs, but didn’t really understand if that could help, unfortunately.

Thank you!

I’m not too familiar with D3D12 and Vulkan, but if you can point me towards an example of what you want to do that uses CUDA C/C++, I can probably point you in the right direction for doing the same thing in Numba. Are there some good short examples of this?

Hello, here is a sample from Nvidia https://github.com/NVIDIA/cuda-samples/blob/master/Samples/simpleD3D12/simpleD3D12.cpp
The most interesting parts are around lines 380-430.

Thanks!

Thanks for the pointer - is there a Python wrapper for D3D12 that you were planning to use, or are you hoping to use C/C++ for the D3D12 work and pass pointers from it into Python to use with Numba?

Nope, I was not planning to use Python outside of Numba, so my desired setup is similar to the sample I’ve linked - a C/C++ DX12 renderer that can use CUDA as some sort of plugin to edit DX12 GPU buffers on-chip.

Thanks. I’ll see if I can find the time to put together an example - this does sound like a fun application of Numba! It will be a little bit of a learning curve for me (I’ve never used any graphics APIs before) but my plan is to package up the code from the example you linked to in a DLL that can be loaded with ctypes as a starter.

BTW, an EMM Plugin is not likely to be good for this use case - it is for replacing Numba’s memory management entirely, which I don’t think you want to do - it sounds more like Numba’s memory management should be kept intact, and only specific buffers shared into Numba device arrays.

Thank you for your help! Please poke me if I can assist you in any way.
Also, may I kindly ask you to outline the general approach to implementing this memory sharing so that I could look into it, too?

The general idea is to get hold of a device pointer from D3D interop, and then use it to construct an instance of the DeviceNDArray class. The Rapids Memory Manager (RMM) used to do this before EMM Plugins could be used: rmm/rmm.py at branch-0.13 · rapidsai/rmm · GitHub

def device_array(shape, dtype=np.float, strides=None, order="C", stream=0):
    """
    device_array(shape, dtype=np.float, strides=None, order='C',
                 stream=0)
    Allocate an empty Numba device array. Clone of Numba `cuda.device_array`,
    but uses RMM for device memory management.
    """
    shape, strides, dtype = cuda.api._prepare_shape_strides_dtype(
        shape, strides, dtype, order
    )
    datasize = cuda.driver.memory_size_from_info(
        shape, strides, dtype.itemsize
    )

    buf = librmm.DeviceBuffer(size=datasize, stream=stream)

    ctx = cuda.current_context()
    ptr = ctypes.c_uint64(int(buf.ptr))
    mem = cuda.driver.MemoryPointer(ctx, ptr, datasize, owner=buf)
    return cuda.cudadrv.devicearray.DeviceNDArray(
        shape, strides, dtype, gpu_data=mem
    )

The above code creates a MemoryPointer that points to RMM-allocated memory, then uses it to initialize a DeviceNDArray instance. Assuming you get the pointer to your D3D buffer as an integer, the above could be modified to create a Numba array pointing to the D3D buffer:

def d3d_device_array(ptr, shape, dtype=np.float32, strides=None, order="C"):
    shape, strides, dtype = cuda.api._prepare_shape_strides_dtype(
        shape, strides, dtype, order
    )
    datasize = cuda.driver.memory_size_from_info(
        shape, strides, dtype.itemsize
    )

    def make_finalizer(ptr):
        def finalize():
            # d3d_free is assumed to be a function that "cleans up" ptr
            # e.g. decrementing a reference count, or freeing it, etc...
            # whatever needs to be done when it is no longer needed by
            # Numba.
            d3d_free(ptr)

        return finalize

    ctx = cuda.current_context()
    c_ptr = ctypes.c_uint64(ptr)
    finalizer = make_finalizer(ptr)
    mem = cuda.driver.MemoryPointer(ctx, c_ptr, datasize, finalizer=finalizer)
    return cuda.cudadrv.devicearray.DeviceNDArray(
        shape, strides, dtype, gpu_data=mem
    )

# Using d3d_device_array:

# A function that gets your D3D buffer (I'm unsure of the implementation
# details here, but will somehow follow the SDK example and make it
# accessible to Python)
ptr, size = my_get_d3d_buf()  # Assume this gives a 1D array of float32
d3d_array = d3d_device_array(ptr, size)

# d3d_array is now ready to be passed to a kernel
kernel[griddim, blockdim](d3d_array, ...)

The finalizer is needed so that when the Numba Device Array is garbage collected, it can somehow let D3D know that the pointer is no longer in use / can be freed (I’m not sure exactly what needs to be done, but perhaps you know already / can tell from the SDK example?).

I hope this helps illustrate things - are there other areas I should try to sketch out?

Graham, thanks a lot for the explanation, it seems mostly clear to me :slight_smile: Just one more question though - do I understand correctly that (at least, on Windows) for this to work, I need my renderer to be in the same process as the Python interpreter running Numba (since this device address is virtual)? I think I recall some mentions of IPC in the docs, but I believe they were Linux-specific.

Numba does only support IPC on Linux, but I suspect that limitation only exists because at the time it was implemented CUDA didn’t support IPC on Windows. I’m going to try enabling the IPC tests on Windows and running through the test suite on it to see if it “just works” - if it works, or works with only minor modifications, we can probably lift that restriction.

That sounds awesome, thank you :slight_smile: Would you mind posting here if anything related to both D3D interop sample and Windows IPC comes up?

IPC seems to work on Windows - will make a PR shortly. If you want to use it right away, you can apply this patch:

diff --git a/numba/cuda/cudadrv/driver.py b/numba/cuda/cudadrv/driver.py
index 83e15617d..76666742a 100644
--- a/numba/cuda/cudadrv/driver.py
+++ b/numba/cuda/cudadrv/driver.py
@@ -40,7 +40,6 @@ from numba.cuda.envvars import get_numba_envvar

 VERBOSE_JIT_LOG = int(get_numba_envvar('VERBOSE_CU_JIT_LOG', 1))
 MIN_REQUIRED_CC = (2, 0)
-SUPPORTS_IPC = sys.platform.startswith('linux')


 _py_decref = ctypes.pythonapi.Py_DecRef
@@ -1164,8 +1163,6 @@ class Context(object):
         """
         Returns a *IpcHandle* from a GPU allocation.
         """
-        if not SUPPORTS_IPC:
-            raise OSError('OS does not support CUDA IPC')
         return self.memory_manager.get_ipc_handle(memory)

     def open_ipc_handle(self, handle, size):

```diff --git a/numba/cuda/tests/cudapy/test_ipc.py b/numba/cuda/tests/cudapy/test_ipc.py
index 5acee1b00..4733c48e1 100644
--- a/numba/cuda/tests/cudapy/test_ipc.py
+++ b/numba/cuda/tests/cudapy/test_ipc.py
@@ -81,7 +81,6 @@ def ipc_array_test(ipcarr, result_queue):
     result_queue.put((succ, out))


-@linux_only
 @skip_under_cuda_memcheck('Hangs cuda-memcheck')
 @unittest.skipUnless(has_mp_get_context, "requires multiprocessing.get_context")
 @skip_on_cudasim('Ipc not available in CUDASIM')
@@ -190,18 +189,6 @@ class TestIpcMemory(ContextResettingTestCase):
                 self.check_ipc_array(index, foreign)


-@unittest.skipIf(linux, 'Only on OS other than Linux')
-@skip_on_cudasim('Ipc not available in CUDASIM')
-class TestIpcNotSupported(ContextResettingTestCase):
-    def test_unsupported(self):
-        arr = np.arange(10, dtype=np.intp)
-        devarr = cuda.to_device(arr)
-        with self.assertRaises(OSError) as raises:
-            devarr.get_ipc_handle()
-        errmsg = str(raises.exception)
-        self.assertIn('OS does not support CUDA IPC', errmsg)
-
-
 def staged_ipc_handle_test(handle, device_num, result_queue):
     def the_work():
         with cuda.gpus[device_num]:
@@ -244,7 +231,6 @@ def staged_ipc_array_test(ipcarr, device_num, result_queue):
     result_queue.put((succ, out))


-@linux_only
 @skip_under_cuda_memcheck('Hangs cuda-memcheck')
 @unittest.skipUnless(has_mp_get_context, "requires multiprocessing.get_context")
 @skip_on_cudasim('Ipc not available in CUDASIM')
1 Like

Thanks a lot for help!

PR for Windows IPC support: CUDA: Support IPC on Windows by gmarkall · Pull Request #6818 · numba/numba · GitHub

BTW, have you considered embedding the Python interpreter in your application so you don’t need to use IPC? 1. Embedding Python in Another Application — Python 3.9.2 documentation - if you can embed Python it would simplify your interoperability layer and avoid the need for IPC at all.

Yep, I’ve considered that, but ideally I’d like to use Numba from a Jupyter notebook, and I’m not sure if I can embed Jupyter server into my process (but I’ll look into that, so thanks for your suggestion).