On Tue, Mar 31, 2026 at 05:32:28PM +0300, Mike Rapoport wrote:
> Hi Harry,
>
> On Tue, Mar 31, 2026 at 04:03:13PM +0900, Harry Yoo (Oracle) wrote:
> > On Mon, Mar 30, 2026 at 01:11:03PM +0300, Mike Rapoport wrote:
> > > From: "Mike Rapoport (Microsoft)" <[email protected]>
> > >
> > > mfill_atomic() passes a lot of parameters down to its callees.
> > >
> > > Aggregate them all into mfill_state structure and pass this structure to
> > > functions that implement various UFFDIO_ commands.
> > >
> > > Tracking the state in a structure will allow moving the code that retries
> > > copying of data for UFFDIO_COPY into mfill_atomic_pte_copy() and make the
> > > loop in mfill_atomic() identical for all UFFDIO operations on PTE-mapped
> > > memory.
> > >
> > > The mfill_state definition is deliberately local to mm/userfaultfd.c,
> > > hence shmem_mfill_atomic_pte() is not updated.
> > >
> > > [[email protected]: properly initialize mfill_state.len to fix
> > > folio_add_new_anon_rmap() WARN]
> > > Link: https://lkml.kernel.org/r/abehBY7QakYF9bK4@hyeyoo
> > > Signed-off-by: Mike Rapoport (Microsoft) <[email protected]>
> > > Signed-off-by: Harry Yoo <[email protected]>
> > > Acked-by: David Hildenbrand (Arm) <[email protected]>
> > > ---
> > > mm/userfaultfd.c | 148 ++++++++++++++++++++++++++---------------------
> > > 1 file changed, 82 insertions(+), 66 deletions(-)
> > >
> > > @@ -790,12 +804,14 @@ static __always_inline ssize_t mfill_atomic(struct
> > > userfaultfd_ctx *ctx,
> > > uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
> > > goto out_unlock;
> > >
> > > - while (src_addr < src_start + len) {
> > > - pmd_t dst_pmdval;
> > > + state.vma = dst_vma;
> >
> > Oh wait, the lock leak was introduced in patch 2.
>
> Lock leak was introduced in patch 4 that moved getting the vma.
Still not sure what I could possibly be missing, so let me try again.
when I check out to this commit "userfaultfd: introduce struct mfill_state" I
see:
| static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
| unsigned long dst_start,
| unsigned long src_start,
| unsigned long len,
| uffd_flags_t flags)
| {
| struct mfill_state state = (struct mfill_state){
| .ctx = ctx,
| .dst_start = dst_start,
| .src_start = src_start, .flags = flags,
| .len = len,
| .src_addr = src_start,
| .dst_addr = dst_start,
| };
|
[ ...snip...]
| retry:
| /*
| * Make sure the vma is not shared, that the dst range is
| * both valid and fully within a single existing vma.
| */
| dst_vma = uffd_mfill_lock(dst_mm, dst_start, len);
It acquires the vma lock (or mmap_lock) here, but doesn't set state.vma.
| if (IS_ERR(dst_vma)) {
| err = PTR_ERR(dst_vma);
| goto out;
| }
|
| /*
| * If memory mappings are changing because of non-cooperative
| * operation (e.g. mremap) running in parallel, bail out and
| * request the user to retry later
| */
| down_read(&ctx->map_changing_lock);
| err = -EAGAIN;
| if (atomic_read(&ctx->mmap_changing))
| goto out_unlock;
|
| err = -EINVAL;
| /*
| * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED
but
| * it will overwrite vm_ops, so vma_is_anonymous must return false.
| */
| if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) &&
| dst_vma->vm_flags & VM_SHARED))
|
| /*
| * validate 'mode' now that we know the dst_vma: don't allow
| * a wrprotect copy if the userfaultfd didn't register as WP.
| */
| if ((flags & MFILL_ATOMIC_WP) && !(dst_vma->vm_flags & VM_UFFD_WP))
| goto out_unlock;
|
| /*
| * If this is a HUGETLB vma, pass off to appropriate routine
| */
| if (is_vm_hugetlb_page(dst_vma))
| return mfill_atomic_hugetlb(ctx, dst_vma, dst_start,
| src_start, len, flags);
|
| if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
| goto out_unlock;
| if (!vma_is_shmem(dst_vma) &&
| uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
| goto out_unlock;
|
| state.vma = dst_vma;
It is set here. So if anything before this jumps to `out_unlock`
label due to a sanity check,
[...]
| while (state.src_addr < src_start + len) {
| VM_WARN_ON_ONCE(state.dst_addr >= dst_start + len);
|
| pmd_t dst_pmdval;
| [...]
|
| out_unlock:
| up_read(&ctx->map_changing_lock);
| uffd_mfill_unlock(state.vma);
the `vma` parameter will be NULL?
If I'm not missing something this is introduced in patch 2 and
fixed in patch 4.
| out:
| if (state.folio)
| folio_put(state.folio);
| VM_WARN_ON_ONCE(copied < 0);
| VM_WARN_ON_ONCE(err > 0);
| VM_WARN_ON_ONCE(!copied && !err);
| return copied ? copied : err;
| }
> > > @@ -866,10 +882,10 @@ static __always_inline ssize_t mfill_atomic(struct
> > > userfaultfd_ctx *ctx,
> > >
> > > out_unlock:
> > > up_read(&ctx->map_changing_lock);
> > > - uffd_mfill_unlock(dst_vma);
> > > + uffd_mfill_unlock(state.vma);
> > > out:
> > > - if (folio)
> > > - folio_put(folio);
> > > + if (state.folio)
> > > + folio_put(state.folio);
> >
> > Sashiko raised a concern [2] that it the VMA might be unmapped and
> > a new mapping created as a uffd hugetlb vma and leak the folio by
> > going through
> >
> > `if (is_vm_hugetlb_page(dst_vma))
> > return mfill_atomic_hugetlb(ctx, dst_vma, dst_start,
> > src_start, len, flags);`
> >
> > but it appears to be a false positive (to me) because
> >
> > `if (atomic_read(&ctx->mmap_changing))` check should have detected unmapping
> > and free the folio?
>
> I think it's real, and it's there more or less from the beginning, although
> nobody hit it yet :)
>
> Before retrying the copy we drop all the locks, so if the copy is really
> long the old mapping can be wiped and a new mapping can be created instead.
Oops, perhaps I should have imagined harder :)
> There's already a v4 of a patch that attempts to solve this:
>
> https://lore.kernel.org/all/[email protected]
Thanks for the pointer!
> > [2]
> > https://sashiko.dev/#/patchset/20260330101116.1117699-1-rppt%40kernel.org?patch=13671
> >
> > > VM_WARN_ON_ONCE(copied < 0);
> > > VM_WARN_ON_ONCE(err > 0);
> > > VM_WARN_ON_ONCE(!copied && !err);
--
Cheers,
Harry / Hyeonggon