Cannot reset CUDA context with Numba

I am new to numba. I am using PyTorch and Numba together. I wish to clear the resources in the CUDA context after training one model and then continue to train another model in one script. I am currently using cuda.current_context().reset() API. However, there are a part of the CUDA memory that I cannot clear with this API. As I train more and more models in this script, it seems that the CUDA memory that I cannot clear increases, and it finally leads to OOM error.

I am pretty sure that I have deleted the old models and datasets before training another model. I also runI have already tried gc.collect() and torch.cuda.empty_cache().

It seems that cuda.get_current_device().reset() and cuda.close() will clear that part of memory. But these API will destroy CUDA context, and I cannot continue to use torch.distributed APIs afterwards.

I am wondering why cuda.current_context().reset() cannot clean up all the memory in the context? From the doc, I think this API should clean up all resources in current context. Is there any way to clear the context without destroying it “for real”?

I would really appreciate any help! Thank you!

cuda.current_context().reset() only cleans up the resources owned by Numba - it can’t clear up things that Numba doesn’t know about.

I don’t think there will be any way to clear up the context without destroying it safely, because any references to memory in the context from other libraries (such as PyTorch) will be invalidated without the other libraries’ knowledge.

I’m not familiar with PyTorch, but if you can post some code that reproduces the apparent leaks using it in conjunction with Numba, it might be possible to help pinpoint why resources might not be getting freed when you think that all references to them are gone (perhaps there will be a bug somewhere in Numba or Torch, from which we can distill an Issue).

Thanks for your reply! I just found out the reason is on PyTorch side. Thank you very much!

1 Like