Making Awkward Arrays work in the CUDA target

From this post:

Would this make it possible for extensions like Awkward Array to run in numba.cuda.jit ed functions? We’re currently developing a CUDA backend for Awkward, which uses CUDA pointers in place of main memory pointers and converts to and from CuPy, rather than NumPy, for the low-level buffers. I suspect some modifications would be necessary, but would Numba based on NVVM reduce barriers to something like the following in the future?

@nb.cuda.autojit
def something(awkward_array, output_numpy):
    i = nb.cuda.grid(1)
    data_structure = awkward_array[i]
    walk over data_structure, do something...
    output_numpy[i] = whatever...

For context, Awkward’s DataModel in Numba is a set of arrays and integer indexes into those arrays that are interpreted by type-specific, hard-coded implementations of __getitem__ , __len__ , etc. So it’s all arrays and numbers internally; the pyobject is only used for reference counting and boxing return values (which won’t be the case for CUDA kernels, since they act in place). With the Awkward CUDA backend, these arrays can be GPU-resident/CuPy.

I’d be really keen to find a way to support this. There are two things worth looking at right now that might enable the use of Awkward Array types and functions in CUDA kernels:

How does this look from your perspective? Do you need me to expand a bit more on either of the items I mentioned above?

Ah—I misunderstood the other thread about NVVM—I didn’t realize that any Numba extensions would work in CUDA mode yet.

The fact that it’s possible now means I ought to look into it. (The timescale of when I can do that is unclear—when I thought a compiler pipeline was being changed, I was imagining this becoming a possibility in a few months, but right away!)

One thing might make the porting easier: Awkward Arrays are read-only, so they would always be in the InParams. Output would have to be through NumPy arrays or anything else Numba supports. The ArrayBuilder that we can use for output in CPU-JITted functions users too much dynamic memory, tree structures, C++ shared pointers—it’s very GPU-unfriendly. But the read-only input Arrays ought to be doable.

I’ve reminded myself of the internals of the Awkward-Numba extension. It is all integers and arrays, but some of the relevant arrays are not directly visible in the ArrayViewModel; they’re pointed to by interpreting elements of the ArrayViewModel’s arrayptrs as pointers. The lifetime of the whole data structure is controlled by the one reference counter on pyobject and then we can freely jump from pointer to pointer, guaranteed that none of them have been deleted. That will probably make the transition to GPUs easier, but I’ll need to make sure that all the arrays behind the ArrayViewModel arrayptrs are actually ready for/on the GPU.

In our current project porting the non-Numba methods to CUDA, we’re labeling all of our buffers by which library should handle them: the default awkward-cpu-kernels.so or the new awkward-cuda-kernels.so. We’re planning on making the distinction user-visible, such that there are explicit “copy to GPU”/“copy to CPU” functions that copy all the buffers if an array structure. When all buffers involved in an operation are in main memory, CPU functions are internally used; when all buffers are on a GPU, GPU methods are used; and if we somehow ended up with a mixture (it’s possible), an error message would tell the user to either call “copy to CPU” or “copy to GPU” (a pass-through for the buffers that are already there).

Perhaps, then, if an Awkward Array is passed as an argument to a numba.cuda.jited function, it should complain with a “please call copy to GPU” error if any of its buffers are still in main memory? This would be consistent with the behavior in Awkward outside of Numba, and did show the user to maintain control over where their arrays are. On the other hand, passing an array to a numba.cuda.jited function is unlike the other cases in that it’s unambiguous that the array needs to be on the GPU and maybe it should be automatically copied, like InParams for NumPy arrays.

I think the example you showed me about InParams, OutParams, and InOutParams was about this copying. If I go with the policy of requiring the user to call a “copy to GPU” function before it can be used in a numba.cuda.jited function, then perhaps this part just invokes a check and didn’t actually mutate anything.

That brings up a technical question: are all CUDA-allocated pointers interchangable? If we call CUDA allocate to make a buffer, can lowered code in a numba.cuda.jited function just dereference these pointers, similar to the way the current implementation dereferences main memory pointers? I understand they have to be guaranteed to be CUDA-allocated pointers, but we have labels to track that. But if there are other requirements beyond that, in unaware of them.

(One thing we’ll be testing next week is interchangeability of CUDA-allocated pointers with CuPy, which is nearly the same thing.)

1 Like

One thing might make the porting easier: Awkward Arrays are read-only, so they would always be in the InParams. Output would have to be through NumPy arrays or anything else Numba supports. The ArrayBuilder that we can use for output in CPU-JITted functions users too much dynamic memory, tree structures, C++ shared pointers—it’s very GPU-unfriendly. But the read-only input Arrays ought to be doable.

I’ve reminded myself of the internals of the Awkward-Numba extension. It is all integers and arrays, but some of the relevant arrays are not directly visible in the ArrayViewModel; they’re pointed to by interpreting elements of the ArrayViewModel’s arrayptrs as pointers. The lifetime of the whole data structure is controlled by the one reference counter on pyobject and then we can freely jump from pointer to pointer, guaranteed that none of them have been deleted. That will probably make the transition to GPUs easier, but I’ll need to make sure that all the arrays behind the ArrayViewModel arrayptrs are actually ready for/on the GPU.

This sounds promising!

In our current project porting the non-Numba methods to CUDA, we’re labeling all of our buffers by which library should handle them: the default awkward-cpu-kernels.so or the new awkward-cuda-kernels.so .

Do you want some of the code in awkward-cuda-kernels.so to be callable/usable from within @cuda.jit functions? If so, it’s possible to link in code compiled from another language with a @cuda.jit kernel and call it - unfortunately it’s undocumented and there’s barely any example code, but if this is something that would be useful and you could give a tiny example of an API you might call in awkward-cuda-kernels.so I can probably illustrate how it would be done.

Perhaps, then, if an Awkward Array is passed as an argument to a numba.cuda.jit ed function, it should complain with a “please call copy to GPU” error if any of its buffers are still in main memory? This would be consistent with the behavior in Awkward outside of Numba, and did show the user to maintain control over where their arrays are. On the other hand, passing an array to a numba.cuda.jit ed function is unlike the other cases in that it’s unambiguous that the array needs to be on the GPU and maybe it should be automatically copied, like InParams for NumPy arrays.

My feeling is that if you’re extending the CUDA target to support Awkward Arrays, the better path is to give it the semantics that match the rest of CUDA Awkward Arrays, and complain if the user should copy to the GPU. In my experience, lots of users fail to notice when they’re copying data between host and device, so forcing them to take control of copying is a good thing given that the rest of Awkward CUDA would.

That brings up a technical question: are all CUDA-allocated pointers interchangable? If we call CUDA allocate to make a buffer, can lowered code in a numba.cuda.jit ed function just dereference these pointers, similar to the way the current implementation dereferences main memory pointers? I understand they have to be guaranteed to be CUDA-allocated pointers, but we have labels to track that. But if there are other requirements beyond that, in unaware of them.

My understanding is that all CUDA-allocated pointers are interchangeable within the context in which they were created, across both the Runtime and Driver API - I had a feeling that exchanging data between contexts required calling a function to get a valid pointer for the current context (or this might have been related to IPC) - I can’t find the documentation of what I was thinking of right now, but will post back when I have located it. In general if you’re just using a single context, I think passing pointers between different libraries will be fine.

(One thing we’ll be testing next week is interchangeability of CUDA-allocated pointers with CuPy, which is nearly the same thing.)

A couple of items that may be of interest related to this and above discussions:

  • The CUDA Array Interface facilitates passing data between different CUDA aware libraries, and is supported by CuPy, Numba, and others. I have an example of implementing the interface for Runtime API-allocated pointers, which I’ll post later - I’m on a personal machine at the moment and I don’t have it handy.
  • Documentation on CUDA IPC for sharing data between processes.
1 Like

The CUDA Array Interface facilitates passing data between different CUDA aware libraries, and is supported by CuPy, Numba, and others. I have an example of implementing the interface for Runtime API-allocated pointers, which I’ll post later - I’m on a personal machine at the moment and I don’t have it handy.

Got the example now - it is:

from numba import cuda

# Get access to the CUDA Runtime API via ctypes

from ctypes import CDLL, POINTER, byref, c_void_p, c_size_t

cudart = CDLL('libcudart.so')

cudaMalloc = cudart.cudaMalloc
cudaMalloc.argtypes = [POINTER(c_void_p), c_size_t]

cudaFree = cudart.cudaFree
cudaFree.argtypes = [c_void_p]


# Create a class to provide the CUDA Array interface.
#
# Simple example for read-write contiguous data.

FLOAT32_SIZE = 4


class MyFloatArray:
    """An array of floats that allocates memory on the device when constructed,
    and frees it when it is deleted.
    """

    def __init__(self, size):
        self._size = size
        self._typestr = 'f4'
        self._ptr = c_void_p()

        alloc_size = size * FLOAT32_SIZE
        ret = cudaMalloc(byref(self._ptr), alloc_size)
        if ret:
            raise RuntimeError(f'Unexpected return code {ret} from cudaMalloc')

    def __del__(self):
        cudaFree(self._ptr)

    @property
    def __cuda_array_interface__(self):
        return {
            'shape': (self._size,),
            'typestr': self._typestr,
            'data': (self._ptr.value, False),
            'version': 2,
        }


# Create an instance of the array object that allocates some memory

nelems = 32
arr = MyFloatArray(nelems)


# For sanity checks / inspection

arr_cai = arr.__cuda_array_interface__

cai_ptr = arr_cai['data'][0]
print(f"Pointer from CAI is 0x{cai_ptr:x}")

print(f"Shape from CAI is {arr_cai['shape']}")
print(f"Type string from CAI is '{arr_cai['typestr']}'")
print(f"Version from CAI is {arr_cai['version']}")


# Launch a kernel on the array holding the Runtime API-allocated memory

@cuda.jit
def initialize(x):
    i = cuda.grid(1)
    if i < len(x):
        x[i] = 3.14


initialize[1, 32](arr)


# We can also copy the data to the host and print it out to check that it is
# expected. For simplicity in this example, we'll use the as_cuda_array()
# function, that takes any object and creates a Numba device array from it, so
# that we can copy the data and print it.

print(cuda.as_cuda_array(arr).copy_to_host())

# Prints:
# [3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14
# 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14
# 3.14 3.14 3.14 3.14]

I had a feeling that exchanging data between contexts required calling a function to get a valid pointer for the current context (or this might have been related to IPC) - I can’t find the documentation of what I was thinking of right now, but will post back when I have located it. In general if you’re just using a single context, I think passing pointers between different libraries will be fine.

Having done some further reading to remind myself, I think I may have been thinking of Peer Access between different devices (CUDA Driver API :: CUDA Toolkit Documentation) - if you’re only concerned with pointers on the same device, then I think this isn’t something you need to worry about for Awkward and could consider all device pointers allocated by different libraries / APIs interchangeable between those libraries and API functions.

(Just cleaning out my email inbox…)

It’s done!