Cannot reset CUDA context with Numba

gudiandian · August 23, 2021, 12:23pm

I am new to numba. I am using PyTorch and Numba together. I wish to clear the resources in the CUDA context after training one model and then continue to train another model in one script. I am currently using cuda.current_context().reset() API. However, there are a part of the CUDA memory that I cannot clear with this API. As I train more and more models in this script, it seems that the CUDA memory that I cannot clear increases, and it finally leads to OOM error.

I am pretty sure that I have deleted the old models and datasets before training another model. I also runI have already tried gc.collect() and torch.cuda.empty_cache().

It seems that cuda.get_current_device().reset() and cuda.close() will clear that part of memory. But these API will destroy CUDA context, and I cannot continue to use torch.distributed APIs afterwards.

I am wondering why cuda.current_context().reset() cannot clean up all the memory in the context? From the doc, I think this API should clean up all resources in current context. Is there any way to clear the context without destroying it “for real”?

I would really appreciate any help! Thank you!

gmarkall · August 25, 2021, 9:44am

cuda.current_context().reset() only cleans up the resources owned by Numba - it can’t clear up things that Numba doesn’t know about.

I don’t think there will be any way to clear up the context without destroying it safely, because any references to memory in the context from other libraries (such as PyTorch) will be invalidated without the other libraries’ knowledge.

I’m not familiar with PyTorch, but if you can post some code that reproduces the apparent leaks using it in conjunction with Numba, it might be possible to help pinpoint why resources might not be getting freed when you think that all references to them are gone (perhaps there will be a bug somewhere in Numba or Torch, from which we can distill an Issue).

gudiandian · August 25, 2021, 10:30am

Thanks for your reply! I just found out the reason is on PyTorch side. Thank you very much!

Topic		Replies	Views
Need to clear GPU memory Development	0	147	September 20, 2024
Closing/resetting a device then re-using within a notebook Support: What is this error message?	0	998	December 2, 2021
How to free the GPU memory once computations are done using the device_arrays Community Support	3	2536	May 18, 2023
Sharing CUDA memory by numba Support: How do I do ...?	0	443	November 3, 2021
Memory grows all time Community Support	7	2133	November 17, 2021

Cannot reset CUDA context with Numba

Related topics