Hi David, Lorenzo,

Thank you for the patient and direct feedback.

You are right; I misjudged the scope and abstraction here. My initial local fix 
was in the AMD driver path and addressed the failure I was seeing there. I then 
tried to move the solution into MM core because I guessed similar notifier 
users might hit the same class of problem. David's explanation makes clear that 
this was the wrong model: an MMU notifier user must tolerate unmap/remap, and 
mappings that cannot tolerate that need a different mechanism, such as page 
pinning, not a driver-controlled THP policy in MM core.

Sorry for the noise and for taking reviewer time. I appreciate the explanation, 
since it corrected my understanding of the expected MMU notifier and THP 
semantics.

On the AI assistance: I disclosed it because it was involved, and I did review 
the generated code against the behavior I thought I wanted. The failure here 
was my own misunderstanding of the MM core contract, which led to an 
inappropriate patch despite that review.

I will drop this series and will not send a v2 for this approach. I will 
re-scope the work to the AMDGPU/KFD side, with a minimal reproducer and a 
discussion/question first if MM input is needed, rather than proposing MM core 
changes.

Thanks again,
Yitao
________________________________
发件人: Lorenzo Stoakes <[email protected]>
发送时间: 2026年6月25日 7:54
收件人: David Hildenbrand (Arm) <[email protected]>
抄送: Yitao Jiang <[email protected]>; Alex Deucher 
<[email protected]>; Christian König <[email protected]>; David 
Airlie <[email protected]>; Simona Vetter <[email protected]>; Felix Kuehling 
<[email protected]>; Andrew Morton <[email protected]>; Zi Yan 
<[email protected]>; Baolin Wang <[email protected]>; Liam R . 
Howlett <[email protected]>; Nico Pache <[email protected]>; Ryan Roberts 
<[email protected]>; Dev Jain <[email protected]>; Barry Song 
<[email protected]>; Lance Yang <[email protected]>; Vlastimil Babka 
<[email protected]>; Mike Rapoport <[email protected]>; Suren Baghdasaryan 
<[email protected]>; Michal Hocko <[email protected]>; Jann Horn 
<[email protected]>; [email protected] 
<[email protected]>; [email protected] 
<[email protected]>; [email protected] 
<[email protected]>; [email protected] <[email protected]>
主题: Re: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings

NAK to this or any version of this.

This series is insane and the idea is insane.

On Thu, Jun 25, 2026 at 01:47:25PM +0200, David Hildenbrand (Arm) wrote:
> On 6/25/26 12:59, Yitao Jiang wrote:
> > Hi,
> >
> > This series fixes a THP policy problem I found while debugging
> > frequent ROCm GPU failures on an AMD Radeon 780M system during ML
> > training.
> >
> > Some AMDGPU/KFD user mappings are registered through interval
> > notifiers and cannot safely tolerate the backing VMA changing from base
> > pages to a transparent huge page after registration. Userspace can
> > still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also
> > collapse the range, after the GPU mapping has been registered.
>
> Huh, why? As a memory notifier user, you must be prepared from memory to get
> unmapped+remapped at random points in time.
>
> What is the precise problem here? How are you handling THPs at registration 
> time?
>
> Letting arbitrary drivers make THP policies sounds like the very wrong 
> approach.

We absolutely will not _ever_ allow drivers to do this while I still breath :)

>
> --
> Cheers,
>
> David

Thanks, Lorenzo

Reply via email to