Using Structref in CUDA

Hi!

I’ve written a ray tracing code using Numba with Structrefs to hold a number of variables and methods (e.g. object intersects for different shapes and material properties). It’s an embarrassingly parallelizable code (and does work fine with OpenMP and MPI), but I want to try running the code on a GPU. I’ve done some testing, confirmed I can run kernel and device functions on my GPU, but now I’m having difficulties with using Structrefs on CUDA. Maybe the answer is you can’t, but I wanted to ask if anyone has managed to use them before with CUDA?

Here is the code I’m using to test things:

from numba import cuda
from numba import njit
from numba.experimental import structref
from numba.core.extending import overload_method
from numba.core import types
import numpy as np

use_cuda = True

if use_cuda:
    JIT = cuda.jit
    DEV_JIT = cuda.jit(device=True)
else:
    JIT = njit
    DEV_JIT = njit

# Create a basic object here
@structref.register
class PhotonType(types.StructRef):
    def preprocess_fields(self, fields):
        return tuple((name, types.unliteral(typ)) for name, typ in fields)


# Now the big class definition
class Photon(structref.StructRefProxy):
    def __new__(cls, ray):
        return structref.StructRefProxy.__new__(cls, ray) 

    @property
    def ray(self):
        return _ray(self)
    
    @ray.setter
    def ray(self, ray):
        return set_ray(self, ray)

@JIT
def _ray(self):
    return self.ray

@JIT
def set_ray(self, ray):
    for ii in range(3):
        self.ray[ii] = ray[ii]


structref.define_proxy(
    Photon, PhotonType,
    ["ray"
    ]
)



@DEV_JIT
def create_Photon(ray):
    return Photon(ray)



dtype = np.float32
Nphoton = 10_000
rays = np.arange(0, 3*Nphoton, dtype=dtype)

if not use_cuda:
    @JIT
    def test_thing(Nphoton, rays):
        for photon_idx in range(Nphoton):
            this_ray = rays[3*photon_idx:3*photon_idx+3]
            photon = create_Photon(this_ray)
        
    to_run = test_thing


else:

    @JIT
    def test_thing(Nphoton, rays):
        photon_idx = cuda.grid(1)

        if photon_idx < Nphoton:        
            this_ray = rays[3*photon_idx:3*photon_idx+3]
            photon = create_Photon(this_ray)
            

    
    

    threadsperblock = 32
    blockspergrid = (Nphoton + (threadsperblock - 1)) // threadsperblock
    to_run = test_thing[blockspergrid, threadsperblock]


to_run(Nphoton, rays)

print("I ran!")

I’ve tried with the setter to see if this was causing issues with creating the object. If you set use_cuda=False it should run fine. Swapping DEV_JIT with JIT for create_Photon gives the same error message (as far as I can see).

This is the error message I get with use_cuda=True:

Traceback (most recent call last):
  File "/home/tg/raytrace/.dev_trials/gpu/numba_gpu.py", line 92, in <module>
    to_run(Nphoton, rays)
  File "/home/tg/raytrace/.venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/dispatcher.py", line 700, in __call__
    return self.dispatcher.call(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tg/raytrace/.venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/dispatcher.py", line 1022, in call
    kernel = _dispatcher.Dispatcher._cuda_call(self, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tg/raytrace/.venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/dispatcher.py", line 1030, in _compile_for_args
    return self.compile(tuple(argtypes))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tg/raytrace/.venv/lib/python3.12/site-packages/numba/core/compiler_lock.py", line 35, in _acquire_compile_lock
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tg/raytrace/.venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/dispatcher.py", line 1296, in compile
    kernel = _Kernel(self.py_func, argtypes, **self.targetoptions)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tg/raytrace/.venv/lib/python3.12/site-packages/numba/core/compiler_lock.py", line 35, in _acquire_compile_lock
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tg/raytrace/.venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/dispatcher.py", line 145, in __init__
    cres = compile_cuda(
           ^^^^^^^^^^^^^
  File "/home/tg/raytrace/.venv/lib/python3.12/site-packages/numba/core/compiler_lock.py", line 35, in _acquire_compile_lock
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tg/raytrace/.venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/compiler.py", line 749, in compile_cuda
    cres = compile_extra(
           ^^^^^^^^^^^^^^
  File "/home/tg/raytrace/.venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/compiler.py", line 590, in compile_extra
    return pipeline.compile_extra(func)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tg/raytrace/.venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/core/compiler.py", line 125, in compile_extra
    return self._compile_bytecode()
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tg/raytrace/.venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/core/compiler.py", line 193, in _compile_bytecode
    return self._compile_core()
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/tg/raytrace/.venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/core/compiler.py", line 172, in _compile_core
    raise e
  File "/home/tg/raytrace/.venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/core/compiler.py", line 161, in _compile_core
    pm.run(self.state)
  File "/home/tg/raytrace/.venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/core/compiler_machinery.py", line 399, in run
    raise patched_exception
  File "/home/tg/raytrace/.venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/core/compiler_machinery.py", line 386, in run
    self._runPass(idx, pass_inst, state)
  File "/home/tg/raytrace/.venv/lib/python3.12/site-packages/numba/core/compiler_lock.py", line 35, in _acquire_compile_lock
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tg/raytrace/.venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/core/compiler_machinery.py", line 338, in _runPass
    mutated |= check(pss.run_pass, internal_state)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tg/raytrace/.venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/core/compiler_machinery.py", line 292, in check
    mangled = func(compiler_state)
              ^^^^^^^^^^^^^^^^^^^^
  File "/home/tg/raytrace/.venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/core/typed_passes.py", line 145, in run_pass
    typemap, return_type, calltypes, errs = type_inference_stage(
                                            ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tg/raytrace/.venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/core/typed_passes.py", line 124, in type_inference_stage
    errs = infer.propagate(raise_errors=raise_errors)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tg/raytrace/.venv/lib/python3.12/site-packages/numba_cuda/numba/cuda/core/typeinfer.py", line 1154, in propagate
    raise errors[0]
numba.core.errors.TypingError: Failed in cuda mode pipeline (step: nopython frontend)
Failed in cuda mode pipeline (step: nopython frontend)
No implementation of function Function(<class '__main__.Photon'>) found for signature:
 
 >>> Photon(array(float32, 1d, C))
 
There are 2 candidate implementations:
    - Of which 2 did not match due to:
    Overload in function 'ctor': File: ../../../../.dev_trials/gpu/<unknown> (built from string?): Line 0.
      With argument(s): '(array(float32, 1d, C))':
     Rejected as the implementation raised a specific error:
       NumbaRuntimeError: Failed in cuda mode pipeline (step: cuda native lowering)
     NRT required but not enabled
     During: lowering "st = call $6load_global.0($16load_deref.2, func=$6load_global.0, args=[Var($16load_deref.2, <string>:5)], kws=(), vararg=None, varkwarg=None, target=None)" at <string> (5)
     During: Pass cuda_native_lowering
  raised from /home/tg/raytrace/.venv/lib/python3.12/site-packages/numba/core/runtime/context.py:42

During: resolving callee type: Function(<class '__main__.Photon'>)
During: typing of call at /home/tg/raytrace/.dev_trials/gpu/numba_gpu.py (57)


File "numba_gpu.py", line 57:
def create_Photon(ray):
    return Photon(ray)
    ^

During: Pass nopython_type_inference
During: resolving callee type: type(CUDADispatcher(<function create_Photon at 0x7096601fc7c0>))
During: typing of call at /home/tg/raytrace/.dev_trials/gpu/numba_gpu.py (81)


File "numba_gpu.py", line 81:
    def test_thing(Nphoton, rays):
        <source elided>
            this_ray = rays[3*photon_idx:3*photon_idx+3]
            photon = create_Photon(this_ray)
            ^

During: Pass nopython_type_inference

I understand NRT required but not enabled is to with dynamic memory allocation, which isn’t possible with CUDA, so this might mean it’s just not possible to do. Nevertheless, I’m hoping someone has some experience doing this and can offer some advice.

Thank you very much in advance! :slight_smile:

I only looked at it very briefly, but noticed that when you define your `createPhoton` jitted constructor you are implicitly relying on the structref constructor overloaded here. This is because define_constructor is invoked by define_proxy, and you have called the latter. Perhaps if you cuda.jit your own constructor using the new instead like here, it could work.

Hi Milton,

Thanks for the advice! It took me a while to look into it since this side of things is very unfamiliar to me, and I must admit in the end I couldn’t fully grasp what’s going on in the these files.

In the end, I found that using custom numpy.dtype’s does a good job of what I want of cleanly moving variables around, and they do work in CUDA JIT’d functions as well. To automate this a bit, I wrote a function to take the contents of a Class dictionary and converts it into a dtype (ignoring any Callables), while the methods I need in the JIT functions are added into the main function using a factory function. This also saves a lot of writing out much of the explicit parts of the Structref (e.g. the functions and registering the variables).

Actually, doing some testing using regular Numba, I found the performance was substantially worse using Structrefs versus dtypes. In fact, using Structrefs caused me code to run at least twice as slow as my original version without them or dtypes, even though the code was structured the same in terms of the steps, just that some functions were swapped to methods attached to structs.

Thanks again for the help, and sorry I couldn’t try the suggested solution.

1 Like

Interesting! If you get a chance to distill and share it, I would also be interested to take a look at a (minimal) reproducer of the performance difference you have observed between structref and dtype. (I suppose the latter - then - doesn’t get wrapped in a meminfo, or something…)