CUDA - OpenGL interop

Hi!

I’m exploring CUDA - OpenGL interop examples.

E.g. cuda-samples/Samples/5_Domain_Specific/simpleGL

Using moderngl for OpenGL bits.

I’ve used the cuda-python to get a raw C pointer to VBO (“Vertex Buffer Object”).

How do I pass the raw C pointer to numba kernel?

I imagine it has to be wrapped into DeviceNDArray somehow…?

Thanks!

Leaving only relevant CUDA bits:

import numba.cuda as nbcuda
from cuda import cudart
from cuda.cudart import cudaGraphicsRegisterFlags as GLFlags

#...
# bunch of OpenGL prep code for VBO.
#...

err, gl_resource = cudart.cudaGraphicsGLRegisterBuffer(
    vbo.glo, GLFlags.cudaGraphicsRegisterFlagsWriteDiscard
)

(err,) = cudart.cudaGraphicsMapResources(1, gl_resource, 0)

(err, dev_ptr, dev_ptr_size) = cudart.cudaGraphicsResourceGetMappedPointer(gl_resource)

# no errors up to this point and I get
# raw C pointer dev_ptr to the VBO data ...

# run kernel
@nbcuda.jit
def kern(x):
    idx = nbcuda.grid(1)
    x[idx] = x[idx] + 0.2

# What needs to be done with dev_ptr to make kern call to accept it?
# this FAIL ... expectedly ...
kern.forall(vertices.size)(dev_ptr)

# ...
# bunch of clean up code

Found relevant post graphics-api-interop

It describes D3D but I think I can wrangle it into OpenGL example…

I added support for passing raw pointers in CUDA-jitted kernels to Numba (in main currently, will be in 0.57): Calling foreign functions from Python kernels — Numba 0.57.0dev0+1629.g987879043.dirty documentation - however the main aim there was to be able to pass pointers to data to CUDA C / C++ functions that might be expecting them, rather than Numba’s array struct.

However, it should be possible to write an argument-handling extension that takes your raw pointer to the VBO and passes it to the kernel as a uint64 - the only problem is the extension handler API is undocumented (and maybe a bit fiddly). There’s an example of using it for curand:

  • In the jit function, the “registration” of the handler:
  • The handler implementation itself:

Argument handlers are classes with a prepare_args method that accept the type of an argument and its value, returning a transformed type and value. If the extension doesn’t handle the given type, it should just return the type and value unchanged. Otherwise, it can transform the argument in whatever way is needed and return the new argument type and value, which should be a type already handled by Numba kernels (e.g. a scalar type, an array, tuple, etc.).

I realise this is all a bit complex and potentially confusing - if you run into trouble, if you can post some executable code that runs up until the kernel launch fails because the VBO pointer is not understood, then I should be able to help figure out the right direction (or possibly figure out a solution).

(Also I’m thinking if we get a solution to this, it will make an excellently-motivated example to go in the docs :slightly_smiling_face: )

:slight_smile: bait is taken. I’ll super happy to contribute! Once I figure it out…

If I understand correctly: in my case extension hander’s prepare_args() would take the pointer as int/uint64 + size + dtype and return a cuda.cudadrv.devicearray.DeviceNDArray that wraps the VBO array?

I think that’s the best way to start - you might want to create a little VertexBufferObject class that holds the pointer, size, and dtype, and pass instances of it to the kernel, so that your extension can do something like isinstance(val, VertexBufferObject) to know whether to convert the value to a Numba device array - otherwise just passing the raw pointer could be difficult to distinguish from any other integer.

Code below does what is intended using code from grapics api interop

I’ll proceed with extension improvement…

I have a small problem running the code in vscode jupyter notebook and as a file from vscode… I’ll describe it in the follow up post, after I reduce it to the minimum reproducible.

Runs with no errors if launched as:
python example.py

# example.py
import ctypes
import numpy as np
from PIL import Image
from cuda import cudart
from cuda.cudart import cudaGraphicsRegisterFlags as GLFlags
import numba.cuda as nbcuda
import moderngl as mgl

#####################################################
# add 0.2 to the value
@nbcuda.jit
def shift_02(x):
    idx = nbcuda.grid(1)
    x[idx] = x[idx] + 0.2

###################################################
# wrap the raw pointer into DeviceNDArray
def gl_NDArray(ptr, shape, dtype=np.float32, strides=None, order="C"):
    shape, strides, dtype = nbcuda.api.prepare_shape_strides_dtype(
        shape, strides, dtype, order
    )
    datasize = nbcuda.driver.memory_size_from_info(shape, strides, dtype.itemsize)

    ctx = nbcuda.current_context()
    c_ptr = ctypes.c_uint64(ptr)
    mem = nbcuda.driver.MemoryPointer(ctx, c_ptr, datasize)
    return nbcuda.cudadrv.devicearray.DeviceNDArray(shape, strides, dtype, gpu_data=mem)

#####################################
# save fbo image into file
def save_image(fbo, name):
    data = fbo.read(components=3)
    image = Image.frombytes("RGB", fbo.size, data)
    image.save(name)


##################################################################
# these shaders do nothing just passing the data along the pipeline

vert_code = """
#version 330

in vec2 in_vert;
in vec3 in_color;

out vec3 color;

void main() {
    gl_Position = vec4(in_vert, 0.0, 1.0);
    color = in_color;
}
"""

frag_code = """
#version 330

in vec3 color;

out vec4 fragColor;

void main(){
    fragColor = vec4(color, 1.0);
}
"""

ctx = mgl.create_context(standalone=True)
prog = ctx.program(vertex_shader=vert_code, fragment_shader=frag_code)


# vertices for triangle with Red Green and Blue colors
vertices = np.array(
    # X     Y                   x     y                   x    y
    [-0.6, -0.6, 1.0, 0.0, 0.0, 0.6, -0.6, 0.0, 1.0, 0.0, 0.0, 0.6, 0.0, 0.0, 1.0],
    dtype="f4",
)

# init VBO, VAO and FBO
vbo = ctx.buffer(vertices)
vao = ctx.simple_vertex_array(prog, vbo, "in_vert", "in_color")
fbo = ctx.framebuffer(color_attachments=[ctx.texture((512, 512), 4)])
fbo.use()

# Get CUDA pointer to VBO
err, gl_resource = cudart.cudaGraphicsGLRegisterBuffer(
    vbo.glo, GLFlags.cudaGraphicsRegisterFlagsNone 
)
(err,) = cudart.cudaGraphicsMapResources(1, gl_resource, 0)

# get raw pointer to device array
(err, dev_ptr, dev_ptr_size) = cudart.cudaGraphicsResourceGetMappedPointer(gl_resource)
# wrap into DeviceNDArray
vbo_arr = gl_NDArray(dev_ptr, vertices.size)

# rendering

# 1st render as is
ctx.clear()
vao.render(mgl.TRIANGLES)
save_image(fbo, "image1.jpg")

# 2nd render after numba kernel call
ctx.clear()
shift_02[1,2](vbo_arr) # shift coordinates (x,y) of first point
nbcuda.synchronize()
vao.render(mgl.TRIANGLES)
save_image(fbo, "image2.jpg")

# CUDA cleanup
(err,) = cudart.cudaGraphicsUnmapResources(1, gl_resource, 0)
(err,) = cudart.cudaGraphicsUnregisterResource(gl_resource)

The error mentioned above is reported here
Error using if numba cuda nvidia bindings are used

It had nothing to do with VSCode.

OK almost there but hitting a problem.

It appears that numba.cuda.CUDADispatcher.call() typechecking kicks off at this line kernel = _dispatcher.Dispatcher._cuda_call(self, *args) rejects anything that is not a recognized Numba type before custom extension handler has a chance to modify the type which only starts later at this line kernel.launch(args, griddim, blockdim, stream, sharedmem).

Does this mean that in kernel[1,1](var) var has to be of type recognizable by Numba to start with?

I’m probably missing something…

“Extension-less” solution in the post above does the job for now :slight_smile:

Thanks Graham @gmarkall!

import numpy as np
import ctypes
import numpy as np
from PIL import Image
from cuda import cudart
from cuda.cudart import cudaGraphicsRegisterFlags as GLFlags
import numba.cuda as nbcuda
import moderngl as mgl


class ModernGL_VBO_Handler:
    def __init__(self, dtype) -> None:
        self.dtype = np.dtype(dtype)

    def _ptr_to_DeviceNDArray(ptr, shape, strides=None, order="C"):
        dtype = self.dtype
        shape, strides, dtype = nbcuda.api.prepare_shape_strides_dtype(
            shape, strides, dtype, order
        )
        datasize = nbcuda.driver.memory_size_from_info(shape, strides, dtype.itemsize)

        ctx = nbcuda.current_context()
        c_ptr = ctypes.c_uint64(ptr)
        mem = nbcuda.driver.MemoryPointer(ctx, c_ptr, datasize)
        return nbcuda.cudadrv.devicearray.DeviceNDArray(
            shape, strides, dtype, gpu_data=mem
        )

    def _vbo_to_DeviceNDArray(vbo):
        err, gl_resource = cudart.cudaGraphicsGLRegisterBuffer(
            vbo.glo, GLFlags.cudaGraphicsRegisterFlagsNone
        )
        (err,) = cudart.cudaGraphicsMapResources(1, gl_resource, 0)
        (err, dev_ptr, dev_buff_size) = cudart.cudaGraphicsResourceGetMappedPointer(
            gl_resource
        )
        return gl_NDArray(dev_ptr, dev_buff_size)

    def prepare_args(self, ty, val, **kwargs):
        print(__name__, ty) # THIS NEVER RUNS
        if isinstance(val, mgl.Buffer):
            val = _vbo_to_DeviceNDArray(val)
            ty = type(val)
            return ty, val
        else:
            return ty, val


@nbcuda.jit(extensions=[ModernGL_VBO_Handler(np.float32)])
def shift_02(x):
    idx = nbcuda.grid(1)
    x[idx] = x[idx] + 0.2


#####################################
# save fbo image into file
def save_image(fbo, name):
    data = fbo.read(components=3)
    image = Image.frombytes("RGB", fbo.size, data)
    image.save(name)


##################################################################
# these shaders do nothing just passing the data along the pipeline

vert_code = """
#version 330

in vec2 in_vert;
in vec3 in_color;

out vec3 color;

void main() {
    gl_Position = vec4(in_vert, 0.0, 1.0);
    color = in_color;
}
"""

frag_code = """
#version 330

in vec3 color;

out vec4 fragColor;

void main(){
    fragColor = vec4(color, 1.0);
}
"""

ctx = mgl.create_context(standalone=True)
prog = ctx.program(vertex_shader=vert_code, fragment_shader=frag_code)


# vertices for triangle with Red Green and Blue colors
vertices = np.array(
    # X     Y                   x     y                   x    y
    [-0.6, -0.6, 1.0, 0.0, 0.0, 0.6, -0.6, 0.0, 1.0, 0.0, 0.0, 0.6, 0.0, 0.0, 1.0],
    dtype="f4",
)

# init VBO, VAO and FBO
vbo = ctx.buffer(vertices)
vao = ctx.simple_vertex_array(prog, vbo, "in_vert", "in_color")
fbo = ctx.framebuffer(color_attachments=[ctx.texture((512, 512), 4)])
fbo.use()

# rendering

# 1st render as is
ctx.clear()
vao.render(mgl.TRIANGLES)
save_image(fbo, "image1.jpg")

# 2nd render using numba kernel
ctx.clear()

some_arr = np.array([1,2,3])

######################################################
# DEBUG: replacing the vbo with `some_arr` gets to the extension handler... (unnecessarily)

shift_02[1,2](vbo) # shift coordinates (x,y) of first point

######################################################
nbcuda.synchronize()
vao.render(mgl.TRIANGLES)
save_image(fbo, "image2.jpg")