On 5/9/25 14:45, James Zhu wrote: > > On 2025-05-09 02:00, Christian König wrote: >> On 5/8/25 19:25, James Zhu wrote: >>> On 2025-05-08 11:20, James Zhu wrote: >>>> On 2025-05-08 10:50, Christian König wrote: >>>>> On 5/8/25 16:46, James Zhu wrote: >>>>>> When XNACK on, hang or low performance is observed with some test cases. >>>>>> The restoring page process has unexpected stuck during evicting/restoring >>>>>> if some bo's flag has KFD_IOCTL_SVM_FLAG_GPU_ALWAYS_MAPPED setting: >>>>>> 1. when xnack on, retry pagefault will invoke restoring pages process >>>>>> 2. A. if there is enough VRAM space, simply migrating pages from ram to >>>>>> vram >>>>>> B. if there is no enough VRAM space left, searching resource LRU >>>>>> list, and >>>>>> scheduling a new eviction work queue to evict LRU bo from vram to >>>>>> ram >>>>>> first, then resume restoring pages process, or waiting for >>>>>> eviction >>>>>> timeout and try to schedule evicting next LRU bo >>>>>> 3. for case 2B, if bo has KFD_IOCTL_SVM_FLAG_GPU_ALWAYS_MAPPED setting, >>>>>> queue eviction will be triggered.So restoring work queue will be >>>>>> scheduled. >>>>>> 4. step 1, restoring pages process will hold one mm->mmap_lock's read >>>>>> until >>>>>> restoring pages is completed >>>>>> step 2B, evictiion work queue process will hold one mm->mmap_lock's >>>>>> read >>>>>> until evicting bo is completed >>>>>> step 3, restoring work queue process is trying to acquire one >>>>>> mm->mmap_lock's >>>>>> write after the above two mm->mmap_lock's read are released, and in >>>>>> the >>>>>> meantime which will block all following mm->mmap_lock's read request. >>>>>> 5. in step 2, if the first eviction bo's size is big enough for step 1 >>>>>> restoring pages request, everything is fine. if not, which means >>>>>> that the >>>>>> mm->mmap_lock's read step 1 won't be release right the way. In step >>>>>> 3, first >>>>>> eviction bo's restoring work queue will compete for mm->mmap_lock's >>>>>> write, >>>>>> the second and following LRU bo's evictiion work queue will be >>>>>> blocked by >>>>>> tring to acquire mm->mmap_lock's read until timeout. All restoring >>>>>> pages >>>>>> process will be stuck here. >>>>>> Using down_write_trylock to replace mmap_write_lock will help not block >>>>>> the >>>>>> second and following evictiion work queue process. >>>>>> >>>>>> Signed-off-by: James Zhu <[email protected]> >>>>>> --- >>>>>> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 6 +++++- >>>>>> 1 file changed, 5 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >>>>>> index 72be6e152e88..5f6ed70559b7 100644 >>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >>>>>> @@ -1794,7 +1794,11 @@ svm_range_list_lock_and_flush_work(struct >>>>>> svm_range_list *svms, >>>>>> { >>>>>> retry_flush_work: >>>>>> flush_work(&svms->deferred_list_work); >>>>>> - mmap_write_lock(mm); >>>>>> + while (true) { >>>>>> + if (down_write_trylock(&(mm->mmap_lock))) >>>>>> + break; >>>>>> + schedule(); >>>>>> + } >>>>> Oh, stuff like that is usually an absolutely clear NAK from upstream. >>>>> >>>>> As far as I know that is not something we can do so easily. >>>>> >>>>> Would it be possible to wait for progress instead of calling schedule() >>>>> here? >>>> [JZ] At 1st beginning, I am thinking adding sync with restoring pages >>>> done. >>>> >>>> but the original restoring work design philosophy is blindly scheduled >>>> after certain delay. >>>> >>>> the changes with sync may take more time and risk. I would like Felix and >>>> Philip give comments >>>> >>>> if there is efficient and safe way to fix it. Thanks! >>> [JZ] BTW, in worse case, mmap_write_lock will fall into >>> rwsem_down_write_slowpath(), schedule_preempt_disabled() and schedule(); >> Yeah, but drivers are not allowed to re-implement or even bypass that logic. > > [JZ] here can take as a new version mmap_write_lock without blocking the > following read request and competing for write > > lock which means read has higher priority under this case. Logically sync is > better way to replace it. In practice, under > > current scenery, sync will increase other burdens an risk.
Yeah and exactly that is messing with mutex internals and is an absolutely clear no-go. Stuff like that is not allowed, not even remotely. Regards, Christian. > >> Regards, >> Christian. >> >>>>> Regards, >>>>> Christian. >>>>> >>>>>> if (list_empty(&svms->deferred_range_list)) >>>>>> return;
