Prior to this commit, allocations performed by `ncclCommInitRank` had no 
corresponding call to `ncclCommDestroy`.  While `ncclCommDestroy` does occur in 
the `CCLThreadLocalContext::Clear` method, there are no calls into this method. 
 On worker processes, the failure to call `ncclCommDestroy` typically had 
little effect.  Any destruction would occur shortly before the process closes, 
and so resources would be reclaimed by the OS when the process terminates.

However, worker0 of a Disco session is a separate thread, rather than a 
separate process.  While this allows it to easily receive data from the 
controller thread, resources allocated by worker0 are not reclaimed by the OS 
until the entire process terminates.  As a result, the `CCLThreadLocalContext` 
leaked GPU memory, as the `ncclCommInitRank` call at the start of each
`tvm.runtime.disco.ProcessSession` was never de-allocated.  The increase in GPU 
memory usage was about 1 gigabyte for each `ProcessSession`.

This commit updates `CCLThreadLocalContext` to have a destructor that calls the 
`Clear` method.  For worker0, this is called when the thread is joined to the 
main thread.
You can view, comment on, or merge this pull request online at:

  https://github.com/apache/tvm/pull/17078

-- Commit Summary --

  * [Bugfix][NCCL] Release NCCL thread_local resources in destructor

-- File Changes --

    M src/runtime/disco/nccl/nccl.cc (12)
    M src/runtime/disco/nccl/nccl_context.h (15)

-- Patch Links --

https://github.com/apache/tvm/pull/17078.patch
https://github.com/apache/tvm/pull/17078.diff

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm/pull/17078
You are receiving this because you are subscribed to this thread.

Message ID: <apache/tvm/pull/17...@github.com>

Reply via email to