Hi David, Lorenzo, Thank you for the patient and direct feedback.
You are right; I misjudged the scope and abstraction here. My initial local fix was in the AMD driver path and addressed the failure I was seeing there. I then tried to move the solution into MM core because I guessed similar notifier users might hit the same class of problem. David's explanation makes clear that this was the wrong model: an MMU notifier user must tolerate unmap/remap, and mappings that cannot tolerate that need a different mechanism, such as page pinning, not a driver-controlled THP policy in MM core. Sorry for the noise and for taking reviewer time. I appreciate the explanation, since it corrected my understanding of the expected MMU notifier and THP semantics. On the AI assistance: I disclosed it because it was involved, and I did review the generated code against the behavior I thought I wanted. The failure here was my own misunderstanding of the MM core contract, which led to an inappropriate patch despite that review. I will drop this series and will not send a v2 for this approach. I will re-scope the work to the AMDGPU/KFD side, with a minimal reproducer and a discussion/question first if MM input is needed, rather than proposing MM core changes. Thanks again, Yitao ________________________________ 发件人: Lorenzo Stoakes <[email protected]> 发送时间: 2026年6月25日 7:54 收件人: David Hildenbrand (Arm) <[email protected]> 抄送: Yitao Jiang <[email protected]>; Alex Deucher <[email protected]>; Christian König <[email protected]>; David Airlie <[email protected]>; Simona Vetter <[email protected]>; Felix Kuehling <[email protected]>; Andrew Morton <[email protected]>; Zi Yan <[email protected]>; Baolin Wang <[email protected]>; Liam R . Howlett <[email protected]>; Nico Pache <[email protected]>; Ryan Roberts <[email protected]>; Dev Jain <[email protected]>; Barry Song <[email protected]>; Lance Yang <[email protected]>; Vlastimil Babka <[email protected]>; Mike Rapoport <[email protected]>; Suren Baghdasaryan <[email protected]>; Michal Hocko <[email protected]>; Jann Horn <[email protected]>; [email protected] <[email protected]>; [email protected] <[email protected]>; [email protected] <[email protected]>; [email protected] <[email protected]> 主题: Re: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings NAK to this or any version of this. This series is insane and the idea is insane. On Thu, Jun 25, 2026 at 01:47:25PM +0200, David Hildenbrand (Arm) wrote: > On 6/25/26 12:59, Yitao Jiang wrote: > > Hi, > > > > This series fixes a THP policy problem I found while debugging > > frequent ROCm GPU failures on an AMD Radeon 780M system during ML > > training. > > > > Some AMDGPU/KFD user mappings are registered through interval > > notifiers and cannot safely tolerate the backing VMA changing from base > > pages to a transparent huge page after registration. Userspace can > > still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also > > collapse the range, after the GPU mapping has been registered. > > Huh, why? As a memory notifier user, you must be prepared from memory to get > unmapped+remapped at random points in time. > > What is the precise problem here? How are you handling THPs at registration > time? > > Letting arbitrary drivers make THP policies sounds like the very wrong > approach. We absolutely will not _ever_ allow drivers to do this while I still breath :) > > -- > Cheers, > > David Thanks, Lorenzo
