Avoid tuple copying (structref?) in CUDA

shaunc · August 1, 2023, 4:43pm

We have a large kernel that takes complex data structures. (We are using cooperative groups so don’t need individual kernels for sync purposes.) Currently data structures are passed as nested NamedTuples. However, we have hit a " Formal Parameter Space Overflowed error.

Researching this, it would seem that namedtuples are passed by value. Even if we keep within the bounds (512 elements total), copying 512 elements per call is hardly ideal. And we would like to avoid having to add complexity by obscuring our datastructures.

Is there any way to avoid this? Especially as these are immutable, it seems very wasteful to copy the structures when a simple pointer would suffice.

We have tried to use a custom structref built object

However, even for trivial examples, we get an error when we invoke a kernel with our structref object as an argument - eg -

NRT required but not enabled
During: lowering "$18load_attr.3 = getattr(value=instance, attr=x)"

I suspect that structref is insisting on refcounting our object, which is unavailable in CUDA. Can we simply get a pointer and avoid ref counting? Could we … create a c structure through ctypes and pass it to the kernel perhaps?

Grateful for suggestions!

shaunc · August 2, 2023, 3:30pm

For anyone coming across this, my solution has been to convert the NamedTuples into record arrays. (We have a simple automatic utility for this.) They seem to be passed by reference. The issue we have now is that arrays which were embedded in the “context” object are transferred “all at once” whereas before we could create individual device arrays and control which ones we wanted to send back. However, that is much easier to work around.

nelson2005 · August 2, 2023, 6:27pm

can you post the code for the conversion utility, or at least the key bits? That sounds like it could be generally useful.

shaunc · August 12, 2023, 11:50pm

gist.github.com

https://gist.github.com/shaunc/d86854d74f4518935beb4d595bbac8e6

record.py

from collections import abc
from typing import Any, Iterator, cast

import numpy as np
from numba import cuda  # type: ignore

TField = tuple[str, Any] | tuple[str, Any, Any]

_EXCLUDED_SEQ_TYPES = (str, bytes, bytearray, memoryview)

This file has been truncated. show original

to_record() is the base entrypoint. named_tuples_to_records is perhaps specific to our use case.

Topic		Replies	Views
Is there a way to pass list of arrays to CUDA kernel? Support: How do I do ...?	2	231	June 16, 2023
Cannot create a shared array in a kernel using kernel parameters Community Support	3	877	February 5, 2021
Tuple of sequence argument to cuda kernel is slower than passing concatenation Support: How do I do ...?	1	559	March 23, 2021
Tuple of CuPy arrays - Numba Cuda Support: How do I do ...?	16	878	October 18, 2022
Passing tuples to device functions Support: What is this error message?	5	924	September 2, 2021

Avoid tuple copying (structref?) in CUDA

Related Topics