Prior to this commit, allocations performed by `ncclCommInitRank` had no corresponding call to `ncclCommDestroy`. While `ncclCommDestroy` does occur in the `CCLThreadLocalContext::Clear` method, there are no calls into this method. On worker processes, the failure to call `ncclCommDestroy` typically had little effect. Any destruction would occur shortly before the process closes, and so resources would be reclaimed by the OS when the process terminates.
However, worker0 of a Disco session is a separate thread, rather than a separate process. While this allows it to easily receive data from the controller thread, resources allocated by worker0 are not reclaimed by the OS until the entire process terminates. As a result, the `CCLThreadLocalContext` leaked GPU memory, as the `ncclCommInitRank` call at the start of each `tvm.runtime.disco.ProcessSession` was never de-allocated. The increase in GPU memory usage was about 1 gigabyte for each `ProcessSession`. This commit updates `CCLThreadLocalContext` to have a destructor that calls the `Clear` method. For worker0, this is called when the thread is joined to the main thread. You can view, comment on, or merge this pull request online at: https://github.com/apache/tvm/pull/17078 -- Commit Summary -- * [Bugfix][NCCL] Release NCCL thread_local resources in destructor -- File Changes -- M src/runtime/disco/nccl/nccl.cc (12) M src/runtime/disco/nccl/nccl_context.h (15) -- Patch Links -- https://github.com/apache/tvm/pull/17078.patch https://github.com/apache/tvm/pull/17078.diff -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm/pull/17078 You are receiving this because you are subscribed to this thread. Message ID: <apache/tvm/pull/17...@github.com>