On 5/8/25 19:25, James Zhu wrote: > > On 2025-05-08 11:20, James Zhu wrote: >> >> On 2025-05-08 10:50, Christian König wrote: >>> On 5/8/25 16:46, James Zhu wrote: >>>> When XNACK on, hang or low performance is observed with some test cases. >>>> The restoring page process has unexpected stuck during evicting/restoring >>>> if some bo's flag has KFD_IOCTL_SVM_FLAG_GPU_ALWAYS_MAPPED setting: >>>> 1. when xnack on, retry pagefault will invoke restoring pages process >>>> 2. A. if there is enough VRAM space, simply migrating pages from ram to >>>> vram >>>> B. if there is no enough VRAM space left, searching resource LRU list, >>>> and >>>> scheduling a new eviction work queue to evict LRU bo from vram to >>>> ram >>>> first, then resume restoring pages process, or waiting for eviction >>>> timeout and try to schedule evicting next LRU bo >>>> 3. for case 2B, if bo has KFD_IOCTL_SVM_FLAG_GPU_ALWAYS_MAPPED setting, >>>> queue eviction will be triggered.So restoring work queue will be >>>> scheduled. >>>> 4. step 1, restoring pages process will hold one mm->mmap_lock's read until >>>> restoring pages is completed >>>> step 2B, evictiion work queue process will hold one mm->mmap_lock's >>>> read >>>> until evicting bo is completed >>>> step 3, restoring work queue process is trying to acquire one >>>> mm->mmap_lock's >>>> write after the above two mm->mmap_lock's read are released, and in the >>>> meantime which will block all following mm->mmap_lock's read request. >>>> 5. in step 2, if the first eviction bo's size is big enough for step 1 >>>> restoring pages request, everything is fine. if not, which means that >>>> the >>>> mm->mmap_lock's read step 1 won't be release right the way. In step 3, >>>> first >>>> eviction bo's restoring work queue will compete for mm->mmap_lock's >>>> write, >>>> the second and following LRU bo's evictiion work queue will be blocked >>>> by >>>> tring to acquire mm->mmap_lock's read until timeout. All restoring >>>> pages >>>> process will be stuck here. >>>> Using down_write_trylock to replace mmap_write_lock will help not block the >>>> second and following evictiion work queue process. >>>> >>>> Signed-off-by: James Zhu <[email protected]> >>>> --- >>>> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 6 +++++- >>>> 1 file changed, 5 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >>>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >>>> index 72be6e152e88..5f6ed70559b7 100644 >>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >>>> @@ -1794,7 +1794,11 @@ svm_range_list_lock_and_flush_work(struct >>>> svm_range_list *svms, >>>> { >>>> retry_flush_work: >>>> flush_work(&svms->deferred_list_work); >>>> - mmap_write_lock(mm); >>>> + while (true) { >>>> + if (down_write_trylock(&(mm->mmap_lock))) >>>> + break; >>>> + schedule(); >>>> + } >>> Oh, stuff like that is usually an absolutely clear NAK from upstream. >>> >>> As far as I know that is not something we can do so easily. >>> >>> Would it be possible to wait for progress instead of calling schedule() >>> here? >> >> [JZ] At 1st beginning, I am thinking adding sync with restoring pages done. >> >> but the original restoring work design philosophy is blindly scheduled after >> certain delay. >> >> the changes with sync may take more time and risk. I would like Felix and >> Philip give comments >> >> if there is efficient and safe way to fix it. Thanks! > [JZ] BTW, in worse case, mmap_write_lock will fall into > rwsem_down_write_slowpath(), schedule_preempt_disabled() and schedule();
Yeah, but drivers are not allowed to re-implement or even bypass that logic. Regards, Christian. > >> >>> >>> Regards, >>> Christian. >>> >>>> if (list_empty(&svms->deferred_range_list)) >>>> return;
