I am new to numba. I am using PyTorch and Numba together. I wish to clear the resources in the CUDA context after training one model and then continue to train another model in one script. I am currently using cuda.current_context().reset()
API. However, there are a part of the CUDA memory that I cannot clear with this API. As I train more and more models in this script, it seems that the CUDA memory that I cannot clear increases, and it finally leads to OOM error.
I am pretty sure that I have deleted the old models and datasets before training another model. I also runI have already tried gc.collect()
and torch.cuda.empty_cache()
.
It seems that cuda.get_current_device().reset()
and cuda.close()
will clear that part of memory. But these API will destroy CUDA context, and I cannot continue to use torch.distributed APIs afterwards.
I am wondering why cuda.current_context().reset()
cannot clean up all the memory in the context? From the doc, I think this API should clean up all resources in current context. Is there any way to clear the context without destroying it “for real”?
I would really appreciate any help! Thank you!