CUDA - OpenGL interop

biosckon · March 29, 2023, 4:14pm

Hi!

I’m exploring CUDA - OpenGL interop examples.

E.g. cuda-samples/Samples/5_Domain_Specific/simpleGL

Using moderngl for OpenGL bits.

I’ve used the cuda-python to get a raw C pointer to VBO (“Vertex Buffer Object”).

How do I pass the raw C pointer to numba kernel?

I imagine it has to be wrapped into DeviceNDArray somehow…?

Thanks!

Leaving only relevant CUDA bits:

import numba.cuda as nbcuda
from cuda import cudart
from cuda.cudart import cudaGraphicsRegisterFlags as GLFlags

#...
# bunch of OpenGL prep code for VBO.
#...

err, gl_resource = cudart.cudaGraphicsGLRegisterBuffer(
    vbo.glo, GLFlags.cudaGraphicsRegisterFlagsWriteDiscard
)

(err,) = cudart.cudaGraphicsMapResources(1, gl_resource, 0)

(err, dev_ptr, dev_ptr_size) = cudart.cudaGraphicsResourceGetMappedPointer(gl_resource)

# no errors up to this point and I get
# raw C pointer dev_ptr to the VBO data ...

# run kernel
@nbcuda.jit
def kern(x):
    idx = nbcuda.grid(1)
    x[idx] = x[idx] + 0.2

# What needs to be done with dev_ptr to make kern call to accept it?
# this FAIL ... expectedly ...
kern.forall(vertices.size)(dev_ptr)

# ...
# bunch of clean up code

biosckon · March 29, 2023, 4:41pm

Found relevant post graphics-api-interop

It describes D3D but I think I can wrangle it into OpenGL example…

gmarkall · March 29, 2023, 8:09pm

I added support for passing raw pointers in CUDA-jitted kernels to Numba (in main currently, will be in 0.57): Calling foreign functions from Python kernels — Numba 0.57.0dev0+1629.g987879043.dirty documentation - however the main aim there was to be able to pass pointers to data to CUDA C / C++ functions that might be expecting them, rather than Numba’s array struct.

However, it should be possible to write an argument-handling extension that takes your raw pointer to the VBO and passes it to the kernel as a uint64 - the only problem is the extension handler API is undocumented (and maybe a bit fiddly). There’s an example of using it for curand:

In the jit function, the “registration” of the handler:

github.com

numba/numba-examples/blob/c423f5419a459f5ab8874fda6d39bb5ea05d04b2/examples/cuda/curand/example.py#L54


      
          
          
@cuda.jit(link=['shim.cu'], extensions=[curand_state_arg_handler])
          def setup(states):
              i = cuda.grid(1)
              curand_init(1234, i, 0, states, i)
          
          

          
# Random sampling kernel - computes the fraction of numbers with low bits set
          # from a random distribution.
          
          
@cuda.jit(link=['shim.cu'], extensions=[curand_state_arg_handler])
          def count_low_bits_native(states, sample_count, results):
              i = cuda.grid(1)
              count = 0
          
          
    # Copy state to local memory
              # XXX: TBC
          
          
    # Generate pseudo-random numbers
              for sample in range(sample_count):
                  x = curand(states, i)

The handler implementation itself:

github.com

numba/numba-examples/blob/c423f5419a459f5ab8874fda6d39bb5ea05d04b2/examples/cuda/curand/numba_curand.py#L104-L118


      
          # Argument handling. When a CurandStatePointer is passed into a kernel, we
          # really only need to pass the pointer to the data, not the whole underlying
          # array structure. Our handler here transforms these arguments into a uint64
          # holding the pointer.
          
          
class CurandStateArgHandler:
              def prepare_args(self, ty, val, **kwargs):
                  if isinstance(val, CurandStates):
                      assert ty == curand_state_pointer
                      return types.uint64, val.data
                  else:
                      return ty, val
          
          

          
curand_state_arg_handler = CurandStateArgHandler()

Argument handlers are classes with a prepare_args method that accept the type of an argument and its value, returning a transformed type and value. If the extension doesn’t handle the given type, it should just return the type and value unchanged. Otherwise, it can transform the argument in whatever way is needed and return the new argument type and value, which should be a type already handled by Numba kernels (e.g. a scalar type, an array, tuple, etc.).

I realise this is all a bit complex and potentially confusing - if you run into trouble, if you can post some executable code that runs up until the kernel launch fails because the VBO pointer is not understood, then I should be able to help figure out the right direction (or possibly figure out a solution).

gmarkall · March 29, 2023, 8:10pm

(Also I’m thinking if we get a solution to this, it will make an excellently-motivated example to go in the docs )

biosckon · March 29, 2023, 10:33pm

bait is taken. I’ll super happy to contribute! Once I figure it out…

biosckon · March 29, 2023, 11:43pm

If I understand correctly: in my case extension hander’s prepare_args() would take the pointer as int/uint64 + size + dtype and return a cuda.cudadrv.devicearray.DeviceNDArray that wraps the VBO array?

gmarkall · March 30, 2023, 10:24am

I think that’s the best way to start - you might want to create a little VertexBufferObject class that holds the pointer, size, and dtype, and pass instances of it to the kernel, so that your extension can do something like isinstance(val, VertexBufferObject) to know whether to convert the value to a Numba device array - otherwise just passing the raw pointer could be difficult to distinguish from any other integer.

biosckon · March 30, 2023, 10:55am

Code below does what is intended using code from grapics api interop

I’ll proceed with extension improvement…

I have a small problem running the code in vscode jupyter notebook and as a file from vscode… I’ll describe it in the follow up post, after I reduce it to the minimum reproducible.

Runs with no errors if launched as:
python example.py

# example.py
import ctypes
import numpy as np
from PIL import Image
from cuda import cudart
from cuda.cudart import cudaGraphicsRegisterFlags as GLFlags
import numba.cuda as nbcuda
import moderngl as mgl

#####################################################
# add 0.2 to the value
@nbcuda.jit
def shift_02(x):
    idx = nbcuda.grid(1)
    x[idx] = x[idx] + 0.2

###################################################
# wrap the raw pointer into DeviceNDArray
def gl_NDArray(ptr, shape, dtype=np.float32, strides=None, order="C"):
    shape, strides, dtype = nbcuda.api.prepare_shape_strides_dtype(
        shape, strides, dtype, order
    )
    datasize = nbcuda.driver.memory_size_from_info(shape, strides, dtype.itemsize)

    ctx = nbcuda.current_context()
    c_ptr = ctypes.c_uint64(ptr)
    mem = nbcuda.driver.MemoryPointer(ctx, c_ptr, datasize)
    return nbcuda.cudadrv.devicearray.DeviceNDArray(shape, strides, dtype, gpu_data=mem)

#####################################
# save fbo image into file
def save_image(fbo, name):
    data = fbo.read(components=3)
    image = Image.frombytes("RGB", fbo.size, data)
    image.save(name)


##################################################################
# these shaders do nothing just passing the data along the pipeline

vert_code = """
#version 330

in vec2 in_vert;
in vec3 in_color;

out vec3 color;

void main() {
    gl_Position = vec4(in_vert, 0.0, 1.0);
    color = in_color;
}
"""

frag_code = """
#version 330

in vec3 color;

out vec4 fragColor;

void main(){
    fragColor = vec4(color, 1.0);
}
"""

ctx = mgl.create_context(standalone=True)
prog = ctx.program(vertex_shader=vert_code, fragment_shader=frag_code)


# vertices for triangle with Red Green and Blue colors
vertices = np.array(
    # X     Y                   x     y                   x    y
    [-0.6, -0.6, 1.0, 0.0, 0.0, 0.6, -0.6, 0.0, 1.0, 0.0, 0.0, 0.6, 0.0, 0.0, 1.0],
    dtype="f4",
)

# init VBO, VAO and FBO
vbo = ctx.buffer(vertices)
vao = ctx.simple_vertex_array(prog, vbo, "in_vert", "in_color")
fbo = ctx.framebuffer(color_attachments=[ctx.texture((512, 512), 4)])
fbo.use()

# Get CUDA pointer to VBO
err, gl_resource = cudart.cudaGraphicsGLRegisterBuffer(
    vbo.glo, GLFlags.cudaGraphicsRegisterFlagsNone 
)
(err,) = cudart.cudaGraphicsMapResources(1, gl_resource, 0)

# get raw pointer to device array
(err, dev_ptr, dev_ptr_size) = cudart.cudaGraphicsResourceGetMappedPointer(gl_resource)
# wrap into DeviceNDArray
vbo_arr = gl_NDArray(dev_ptr, vertices.size)

# rendering

# 1st render as is
ctx.clear()
vao.render(mgl.TRIANGLES)
save_image(fbo, "image1.jpg")

# 2nd render after numba kernel call
ctx.clear()
shift_02[1,2](vbo_arr) # shift coordinates (x,y) of first point
nbcuda.synchronize()
vao.render(mgl.TRIANGLES)
save_image(fbo, "image2.jpg")

# CUDA cleanup
(err,) = cudart.cudaGraphicsUnmapResources(1, gl_resource, 0)
(err,) = cudart.cudaGraphicsUnregisterResource(gl_resource)

biosckon · March 30, 2023, 2:27pm

The error mentioned above is reported here
Error using if numba cuda nvidia bindings are used

It had nothing to do with VSCode.

biosckon · March 30, 2023, 10:10pm

OK almost there but hitting a problem.

It appears that numba.cuda.CUDADispatcher.call() typechecking kicks off at this line kernel = _dispatcher.Dispatcher._cuda_call(self, *args) rejects anything that is not a recognized Numba type before custom extension handler has a chance to modify the type which only starts later at this line kernel.launch(args, griddim, blockdim, stream, sharedmem).

Does this mean that in kernel[1,1](var) var has to be of type recognizable by Numba to start with?

I’m probably missing something…

“Extension-less” solution in the post above does the job for now

Thanks Graham @gmarkall!

import numpy as np
import ctypes
import numpy as np
from PIL import Image
from cuda import cudart
from cuda.cudart import cudaGraphicsRegisterFlags as GLFlags
import numba.cuda as nbcuda
import moderngl as mgl


class ModernGL_VBO_Handler:
    def __init__(self, dtype) -> None:
        self.dtype = np.dtype(dtype)

    def _ptr_to_DeviceNDArray(ptr, shape, strides=None, order="C"):
        dtype = self.dtype
        shape, strides, dtype = nbcuda.api.prepare_shape_strides_dtype(
            shape, strides, dtype, order
        )
        datasize = nbcuda.driver.memory_size_from_info(shape, strides, dtype.itemsize)

        ctx = nbcuda.current_context()
        c_ptr = ctypes.c_uint64(ptr)
        mem = nbcuda.driver.MemoryPointer(ctx, c_ptr, datasize)
        return nbcuda.cudadrv.devicearray.DeviceNDArray(
            shape, strides, dtype, gpu_data=mem
        )

    def _vbo_to_DeviceNDArray(vbo):
        err, gl_resource = cudart.cudaGraphicsGLRegisterBuffer(
            vbo.glo, GLFlags.cudaGraphicsRegisterFlagsNone
        )
        (err,) = cudart.cudaGraphicsMapResources(1, gl_resource, 0)
        (err, dev_ptr, dev_buff_size) = cudart.cudaGraphicsResourceGetMappedPointer(
            gl_resource
        )
        return gl_NDArray(dev_ptr, dev_buff_size)

    def prepare_args(self, ty, val, **kwargs):
        print(__name__, ty) # THIS NEVER RUNS
        if isinstance(val, mgl.Buffer):
            val = _vbo_to_DeviceNDArray(val)
            ty = type(val)
            return ty, val
        else:
            return ty, val


@nbcuda.jit(extensions=[ModernGL_VBO_Handler(np.float32)])
def shift_02(x):
    idx = nbcuda.grid(1)
    x[idx] = x[idx] + 0.2


#####################################
# save fbo image into file
def save_image(fbo, name):
    data = fbo.read(components=3)
    image = Image.frombytes("RGB", fbo.size, data)
    image.save(name)


##################################################################
# these shaders do nothing just passing the data along the pipeline

vert_code = """
#version 330

in vec2 in_vert;
in vec3 in_color;

out vec3 color;

void main() {
    gl_Position = vec4(in_vert, 0.0, 1.0);
    color = in_color;
}
"""

frag_code = """
#version 330

in vec3 color;

out vec4 fragColor;

void main(){
    fragColor = vec4(color, 1.0);
}
"""

ctx = mgl.create_context(standalone=True)
prog = ctx.program(vertex_shader=vert_code, fragment_shader=frag_code)


# vertices for triangle with Red Green and Blue colors
vertices = np.array(
    # X     Y                   x     y                   x    y
    [-0.6, -0.6, 1.0, 0.0, 0.0, 0.6, -0.6, 0.0, 1.0, 0.0, 0.0, 0.6, 0.0, 0.0, 1.0],
    dtype="f4",
)

# init VBO, VAO and FBO
vbo = ctx.buffer(vertices)
vao = ctx.simple_vertex_array(prog, vbo, "in_vert", "in_color")
fbo = ctx.framebuffer(color_attachments=[ctx.texture((512, 512), 4)])
fbo.use()

# rendering

# 1st render as is
ctx.clear()
vao.render(mgl.TRIANGLES)
save_image(fbo, "image1.jpg")

# 2nd render using numba kernel
ctx.clear()

some_arr = np.array([1,2,3])

######################################################
# DEBUG: replacing the vbo with `some_arr` gets to the extension handler... (unnecessarily)

shift_02[1,2](vbo) # shift coordinates (x,y) of first point

######################################################
nbcuda.synchronize()
vao.render(mgl.TRIANGLES)
save_image(fbo, "image2.jpg")

Topic		Replies	Views
CUDA + OpenGL interop Showcase	1	648	May 10, 2024
Error using if numba cuda nvidia bindings are used Support: What is this error message?	0	415	March 30, 2023
Making Awkward Arrays work in the CUDA target Community Support	4	1321	March 8, 2023
Graphics API interop Community Support	14	1073	March 12, 2021
Passing pointer from c++ to python Numba	7	1500	August 25, 2022

CUDA - OpenGL interop

Related topics