Host Side Memory Allocation (for CUDA)

I am wondering how Numba deals with host side memory allocation. I am very unfamiliar with the inner workings of Numpy and have very little experience with extending Python into C. From what I gather it looks like Numba is using something called a memoryview (which seems to be a very common thing to use when interfacing with C) and simply writes into the buffer of an already created object, like a numpy array. This means that you can rely on the input object to deal with the freeing of the host memory. I’ve tried calling memCpyDtoH with the host pointer being a reference to an empty numpy array (using ary.ctypes.get_data()) and it seems to work fine. Are there scenarios where this might fail (are most of them edge cases?)? If you wanted to directly create the memory buffers via malloc and free, is there a way to call these from Python (like PyMalloc or something?) in such a way that they don’t interfere with the Python interpreter?

For pinned memory I noticed that CUDA creates the host side pinned memory. Are there issues with how this memory might be dealt with in the Python interpreter (like get garbage collected)? I’m imagining that as long as the pointer isn’t garbage collected and as long as you call cuMemFreeHost you’re good to go.

From what I gather it looks like Numba is using something called a memoryview (which seems to be a very common thing to use when interfacing with C) and simply writes into the buffer of an already created object, like a numpy array. This means that you can rely on the input object to deal with the freeing of the host memory. I’ve tried calling memCpyDtoH with the host pointer being a reference to an empty numpy array (using ary.ctypes.get_data()) and it seems to work fine. Are there scenarios where this might fail (are most of them edge cases?)?

I think this generally works - I can’t think of edge cases right now, but don’t take that as a guarantee that there are none :slight_smile:

If you wanted to directly create the memory buffers via malloc and free, is there a way to call these from Python (like PyMalloc or something?) in such a way that they don’t interfere with the Python interpreter?

Perhaps your could call them with ctypes or cffi?

For pinned memory I noticed that CUDA creates the host side pinned memory. Are there issues with how this memory might be dealt with in the Python interpreter (like get garbage collected)? I’m imagining that as long as the pointer isn’t garbage collected and as long as you call cuMemFreeHost you’re good to go.

Pinned and Mapped Memory also use a memoryview, which is implemented in mviewbuf.c, the MemAlloc class. This is subclassed by the MappedOwnedPointer and PinnedMemory classes in numba.cuda.cudadrv.driver, which are eventually instantiated when Context.memhostalloc is called - which is used depends on the mapped argument to memhostalloc. The created memoryview is passed to the ndarray constructor so that the ndarray uses the memoryview that was created by Numba: e.g. https://github.com/numba/numba/blob/master/numba/cuda/api.py#L141.

Finalizers are used to free allocations made by Numba. One example of one being made and then passed to an instance of a MappedMemory is at https://github.com/numba/numba/blob/master/numba/cuda/cudadrv/driver.py#L739. When the MappedMemory object is garbage collected,the finalizer function runs, which adds a deallocation to Numba’s list of pending deallocations: https://github.com/numba/numba/blob/4459e61a3d4f1a8c5e323470b61105f36d039feb/numba/cuda/cudadrv/driver.py#L1262 - when Numba needs to actually free up some more GPU memory, either because it has run out, or because it has decided there are too many or too large pending deallocations, it will call the functions in its pending deallocation list with the pointers to deallocate. It would be possible for the finalizer to deallocate immediately, but Numba defers deallocations to minimize the chance of deallocation running during a sequence of asynchronous operations, as it would force a synchronization. This behavior is detailed further in Deallocation Behavior in the docs.

Perhaps your could call them with ctypes or cffi?

I was thinking of doing this but the Python docs say

To avoid memory corruption, extension writers should never try to operate on Python objects with the functions exported by the C library: malloc() , calloc() , realloc() and free() . This will result in mixed calls between the C allocator and the Python memory manager with fatal consequences, because they implement different algorithms and operate on different heaps.
so I was wondering if there is an equivalent to malloc within Python.h called Pymalloc or something that I could wrap in like 5 lines of code to work like malloc (it looks like PyMem_RawMalloc might be what I am looking for?). It’s possible I might be misinterpreting what is being said. Are they saying that you simply shouldn’t call malloc from within the C file you are creating as the extension, and that calling malloc with ctypes is fine?

The above comment is also why I was wondering if there might be issues with the pinned memory that CUDA allocates. If you haven’t come across any issues with it then I won’t worry about it for now.

I think the Python docs is warning about the use of C allocator/free functions on PyObjects* that is created with a different heap. There should be no problem using a different heap in a Python program as long as you don’t accidentally invoke the wrong free() on objects of a different heap.

Thanks for clearing that up Siu!