On 2025-06-23 12:59, Philip Yang wrote:
> If the process is exiting, the mmput inside mmu notifier callback from
> compactd or fork or numa balancing could release the last reference
> of mm struct to call exit_mmap and free_pgtable, this triggers deadlock
> with below backtrace.
>
> The deadlock will leak kfd process as mmu notifier release is not called
> and cause VRAM leaking.
>
> The fix is to take mm reference mmget_non_zero when adding prange to the
> deferred list to pair with mmput in deferred list work.
>
> The backtrace of hung task:
>
>  INFO: task python:348105 blocked for more than 64512 seconds.
>  Call Trace:
>   __schedule+0x1c3/0x550
>   schedule+0x46/0xb0
>   rwsem_down_write_slowpath+0x24b/0x4c0
>   unlink_anon_vmas+0xb1/0x1c0
>   free_pgtables+0xa9/0x130
>   exit_mmap+0xbc/0x1a0
>   mmput+0x5a/0x140
>   svm_range_cpu_invalidate_pagetables+0x2b/0x40 [amdgpu]
>   mn_itree_invalidate+0x72/0xc0
>   __mmu_notifier_invalidate_range_start+0x48/0x60
>   try_to_unmap_one+0x10fa/0x1400
>   rmap_walk_anon+0x196/0x460
>   try_to_unmap+0xbb/0x210
>   migrate_page_unmap+0x54d/0x7e0
>   migrate_pages_batch+0x1c3/0xae0
>   migrate_pages_sync+0x98/0x240
>   migrate_pages+0x25c/0x520
>   compact_zone+0x29d/0x590
>   compact_zone_order+0xb6/0xf0
>   try_to_compact_pages+0xbe/0x220
>   __alloc_pages_direct_compact+0x96/0x1a0
>   __alloc_pages_slowpath+0x410/0x930
>   __alloc_pages_nodemask+0x3a9/0x3e0
>   do_huge_pmd_anonymous_page+0xd7/0x3e0
>   __handle_mm_fault+0x5e3/0x5f0
>   handle_mm_fault+0xf7/0x2e0
>   hmm_vma_fault.isra.0+0x4d/0xa0
>   walk_pmd_range.isra.0+0xa8/0x310
>   walk_pud_range+0x167/0x240
>   walk_pgd_range+0x55/0x100
>   __walk_page_range+0x87/0x90
>   walk_page_range+0xf6/0x160
>   hmm_range_fault+0x4f/0x90
>   amdgpu_hmm_range_get_pages+0x123/0x230 [amdgpu]
>   amdgpu_ttm_tt_get_user_pages+0xb1/0x150 [amdgpu]
>   init_user_pages+0xb1/0x2a0 [amdgpu]
>   amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x543/0x7d0 [amdgpu]
>   kfd_ioctl_alloc_memory_of_gpu+0x24c/0x4e0 [amdgpu]
>   kfd_ioctl+0x29d/0x500 [amdgpu]
>
> Fixes: fa582c6f3684 ("drm/amdkfd: Use mmget_not_zero in MMU notifier")
> Signed-off-by: Philip Yang <[email protected]>
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 23 +++++++++++------------
>  1 file changed, 11 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index 10d1276f8e1c..5fe92f9a1ce1 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -2392,15 +2392,17 @@ svm_range_add_list_work(struct svm_range_list *svms, 
> struct svm_range *prange,
>                   prange->work_item.op != SVM_OP_UNMAP_RANGE)
>                       prange->work_item.op = op;
>       } else {
> -             prange->work_item.op = op;
> -
> -             /* Pairs with mmput in deferred_list_work */
> -             mmget(mm);
> -             prange->work_item.mm = mm;
> -             list_add_tail(&prange->deferred_list,
> -                           &prange->svms->deferred_range_list);
> -             pr_debug("add prange 0x%p [0x%lx 0x%lx] to work list op %d\n",
> -                      prange, prange->start, prange->last, op);
> +             /* Pairs with mmput in deferred_list_work.
> +              * If process is exiting and mm is gone, don't update mmu 
> notifier.
> +              */
> +             if (mmget_not_zero(mm)) {
> +                     prange->work_item.mm = mm;
> +                     prange->work_item.op = op;
> +                     list_add_tail(&prange->deferred_list,
> +                                   &prange->svms->deferred_range_list);
> +                     pr_debug("add prange 0x%p [0x%lx 0x%lx] to work list op 
> %d\n",
> +                              prange, prange->start, prange->last, op);
> +             }
>       }
>       spin_unlock(&svms->deferred_list_lock);
>  }
> @@ -2568,8 +2570,6 @@ svm_range_cpu_invalidate_pagetables(struct 
> mmu_interval_notifier *mni,
>  
>       if (range->event == MMU_NOTIFY_RELEASE)
>               return true;
> -     if (!mmget_not_zero(mni->mm))
> -             return true;

I looked for other uses of mm inside svm_range_cpu_invalidate_pagetables, and I 
found this call chain: svm_range_cpu_invalidate_pagetables -> 
svm_range_unmap_from_cpu -> svm_range_unmap_split -> svm_range_add_child. It 
currently assigns mm to a work item without taking a reference count. But it 
seems that mm isn't even used later. So maybe we should just assign

    pchild->work_item.mm = NULL;

Then we could also remove the mm parameter from svm_range_unmap_split and 
svm_range_add_child.

With that fixed, the patch is

Reviewed-by: Felix Kuehling <[email protected]>


>  
>       start = mni->interval_tree.start;
>       last = mni->interval_tree.last;
> @@ -2596,7 +2596,6 @@ svm_range_cpu_invalidate_pagetables(struct 
> mmu_interval_notifier *mni,
>       }
>  
>       svm_range_unlock(prange);
> -     mmput(mni->mm);
>  
>       return true;
>  }

Reply via email to