Avoid tuple copying (structref?) in CUDA

We have a large kernel that takes complex data structures. (We are using cooperative groups so don’t need individual kernels for sync purposes.) Currently data structures are passed as nested NamedTuples. However, we have hit a " Formal Parameter Space Overflowed error.

Researching this, it would seem that namedtuples are passed by value. Even if we keep within the bounds (512 elements total), copying 512 elements per call is hardly ideal. And we would like to avoid having to add complexity by obscuring our datastructures.

Is there any way to avoid this? Especially as these are immutable, it seems very wasteful to copy the structures when a simple pointer would suffice.

We have tried to use a custom structref built object

However, even for trivial examples, we get an error when we invoke a kernel with our structref object as an argument - eg -

NRT required but not enabled
During: lowering "$18load_attr.3 = getattr(value=instance, attr=x)"

I suspect that structref is insisting on refcounting our object, which is unavailable in CUDA. Can we simply get a pointer and avoid ref counting? Could we … create a c structure through ctypes and pass it to the kernel perhaps?

Grateful for suggestions!

For anyone coming across this, my solution has been to convert the NamedTuples into record arrays. (We have a simple automatic utility for this.) They seem to be passed by reference. The issue we have now is that arrays which were embedded in the “context” object are transferred “all at once” whereas before we could create individual device arrays and control which ones we wanted to send back. However, that is much easier to work around.

can you post the code for the conversion utility, or at least the key bits? That sounds like it could be generally useful.

to_record() is the base entrypoint. named_tuples_to_records is perhaps specific to our use case.