maps under RCU

Suren Baghdasaryan Tue, 29 Apr 2025 10:14:43 -0700

On Tue, Apr 29, 2025 at 8:40 AM Jann Horn <[email protected]> wrote:
>
> On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan <[email protected]> wrote:
> > With maple_tree supporting vma tree traversal under RCU and vma and
> > its important members being RCU-safe, /proc/pid/maps can be read under
> > RCU and without the need to read-lock mmap_lock. However vma content
> > can change from under us, therefore we make a copy of the vma and we
> > pin pointer fields used when generating the output (currently only
> > vm_file and anon_name). Afterwards we check for concurrent address
> > space modifications, wait for them to end and retry. While we take
> > the mmap_lock for reading during such contention, we do that momentarily
> > only to record new mm_wr_seq counter. This change is designed to reduce
> > mmap_lock contention and prevent a process reading /proc/pid/maps files
> > (often a low priority task, such as monitoring/data collection services)
> > from blocking address space updates.
> [...]
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index b9e4fbbdf6e6..f9d50a61167c 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> [...]
> > +/*
> > + * Take VMA snapshot and pin vm_file and anon_name as they are used by
> > + * show_map_vma.
> > + */
> > +static int get_vma_snapshot(struct proc_maps_private *priv, struct 
> > vm_area_struct *vma)
> > +{
> > +       struct vm_area_struct *copy = &priv->vma_copy;
> > +       int ret = -EAGAIN;
> > +
> > +       memcpy(copy, vma, sizeof(*vma));
> > +       if (copy->vm_file && !get_file_rcu(&copy->vm_file))
> > +               goto out;
>
> I think this uses get_file_rcu() in a different way than intended.
>
> As I understand it, get_file_rcu() is supposed to be called on a
> pointer which always points to a file with a non-zero refcount (except
> when it is NULL). That's why it takes a file** instead of a file* - if
> it observes a zero refcount, it assumes that the pointer must have
> been updated in the meantime, and retries. Calling get_file_rcu() on a
> pointer that points to a file with zero refcount, which I think can
> happen with this patch, will cause an endless loop.
> (Just as background: For other usecases, get_file_rcu() is supposed to
> still behave nicely and not spuriously return NULL when the file* is
> concurrently updated to point to another file*; that's what that loop
> is for.)


Ah, I see. I wasn't aware of this subtlety. I think this is fixable by
checking the return value of get_file_rcu() and retrying speculation
if it changed.

> (If my understanding is correct, maybe we should document that more
> explicitly...)

Good point. I'll add comments for get_file_rcu() as a separate patch.

>
> Also, I think you are introducing an implicit assumption that
> remove_vma() does not NULL out the ->vm_file pointer (because that
> could cause tearing and could theoretically lead to a torn pointer
> being accessed here).
>
> One alternative might be to change the paths that drop references to
> vma->vm_file (search for vma_close to find them) such that they first
> NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that,
> maybe with a new helper like this:
>
> static void vma_fput(struct vm_area_struct *vma)
> {
>   struct file *file = vma->vm_file;
>
>   if (file) {
>     WRITE_ONCE(vma->vm_file, NULL);
>     fput(file);
>   }
> }
>
> Then on the lockless lookup path you could use get_file_rcu() on the
> ->vm_file pointer _of the original VMA_, and store the returned file*
> into copy->vm_file.

Ack. Except for storing the return value of get_file_rcu(). I think
once we detect that  get_file_rcu() returns a different file we should
bail out and retry. The change in file is an indication that the vma
got changed from under us, so whatever we have is stale.

>
> > +       if (!anon_vma_name_get_if_valid(copy))
> > +               goto put_file;
> > +
> > +       if (!mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq))
> > +               return 0;
>
> We only check for concurrent updates at this point, so up to here,
> anything we read from "copy" could contain torn pointers (both because
> memcpy() is not guaranteed to copy pointers atomically and because the
> updates to the original VMA are not done with WRITE_ONCE()).
> That probably means that something like the preceding
> anon_vma_name_get_if_valid() could crash on an access to a torn
> pointer.
> Please either do another mmap_lock_speculate_retry() check directly
> after the memcpy(), or ensure nothing before this point reads from
> "copy".

Ack. I'll add mmap_lock_speculate_retry() check right after memcpy().

>
> > +       /* Address space got modified, vma might be stale. Re-lock and 
> > retry. */
> > +       rcu_read_unlock();
> > +       ret = mmap_read_lock_killable(priv->mm);
> > +       if (!ret) {
> > +               /* mmap_lock_speculate_try_begin() succeeds when holding 
> > mmap_read_lock */
> > +               mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);
> > +               mmap_read_unlock(priv->mm);
> > +               ret = -EAGAIN;
> > +       }
> > +
> > +       rcu_read_lock();
> > +
> > +       anon_vma_name_put_if_valid(copy);
> > +put_file:
> > +       if (copy->vm_file)
> > +               fput(copy->vm_file);
> > +out:
> > +       return ret;
> > +}
> [...]
> > @@ -266,39 +399,41 @@ static void get_vma_name(struct vm_area_struct *vma,
> >                 } else {
> >                         *path = file_user_path(vma->vm_file);
> >                 }
> > -               return;
> > +               goto out;
> >         }
> >
> >         if (vma->vm_ops && vma->vm_ops->name) {
> >                 *name = vma->vm_ops->name(vma);
>
> This seems to me like a big, subtle change of semantics. After this
> change, vm_ops->name() will no longer receive a real VMA; and in
> particular, I think the .name implementation special_mapping_name used
> in special_mapping_vmops will have a UAF because it relies on
> vma->vm_private_data pointing to a live object.

Ah, I see. IOW, vma->vm_private_data might change from under us and I
don't detect that. Moreover this is just an example and .name() might
depend on other things.

>
> I think you'll need to fall back to using the mmap lock and the real
> VMA if you see a non-NULL vma->vm_ops->name pointer.

Yeah, either that or obtain the name and make a copy during
get_vma_snapshot() using original vma. Will need to check which way is
better.

>
> >                 if (*name)
> > -                       return;
> > +                       goto out;
> >         }
> >
> >         *name = arch_vma_name(vma);
> >         if (*name)
> > -               return;
> > +               goto out;
> >
> >         if (!vma->vm_mm) {
> >                 *name = "[vdso]";
> > -               return;
> > +               goto out;
> >         }
> >
> >         if (vma_is_initial_heap(vma)) {
> >                 *name = "[heap]";
> > -               return;
> > +               goto out;
> >         }
> >
> >         if (vma_is_initial_stack(vma)) {
> >                 *name = "[stack]";
> > -               return;
> > +               goto out;
> >         }
> >
> >         if (anon_name) {
> >                 *name_fmt = "[anon:%s]";
> >                 *name = anon_name->name;
> > -               return;
> >         }
> > +out:
> > +       if (anon_name && !mmap_locked)
> > +               anon_vma_name_put(anon_name);
>
> Isn't this refcount drop too early, causing UAF read? We drop the
> reference on the anon_name here, but (on some paths) we're about to
> return anon_name->name to the caller through *name, and the caller
> will read from it.
>
> Ah, but I guess it's actually fine because the refcount increment was
> unnecessary in the first place, because the vma pointer actually
> points to a copy of the original VMA, and the copy has its own
> refcounted reference to the anon_name thanks to get_vma_snapshot()?
> It might be helpful to have some comments documenting which VMA
> pointers can point to copies.

If I follow Andrii's suggestion and avoid vma copying I'll need to
implement careful handling of pointers and allow get_vma_name() to
fail indicating that vma changed from under us. Let me see if this is
still doable in the light of your findings.

Thanks for the insightful review and welcome back!

Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

Reply via email to