On Tue, Apr 29, 2025 at 8:40 AM Jann Horn <[email protected]> wrote:
>
> On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan <[email protected]> wrote:
> > With maple_tree supporting vma tree traversal under RCU and vma and
> > its important members being RCU-safe, /proc/pid/maps can be read under
> > RCU and without the need to read-lock mmap_lock. However vma content
> > can change from under us, therefore we make a copy of the vma and we
> > pin pointer fields used when generating the output (currently only
> > vm_file and anon_name). Afterwards we check for concurrent address
> > space modifications, wait for them to end and retry. While we take
> > the mmap_lock for reading during such contention, we do that momentarily
> > only to record new mm_wr_seq counter. This change is designed to reduce
> > mmap_lock contention and prevent a process reading /proc/pid/maps files
> > (often a low priority task, such as monitoring/data collection services)
> > from blocking address space updates.
> [...]
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index b9e4fbbdf6e6..f9d50a61167c 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> [...]
> > +/*
> > + * Take VMA snapshot and pin vm_file and anon_name as they are used by
> > + * show_map_vma.
> > + */
> > +static int get_vma_snapshot(struct proc_maps_private *priv, struct
> > vm_area_struct *vma)
> > +{
> > + struct vm_area_struct *copy = &priv->vma_copy;
> > + int ret = -EAGAIN;
> > +
> > + memcpy(copy, vma, sizeof(*vma));
> > + if (copy->vm_file && !get_file_rcu(©->vm_file))
> > + goto out;
>
> I think this uses get_file_rcu() in a different way than intended.
>
> As I understand it, get_file_rcu() is supposed to be called on a
> pointer which always points to a file with a non-zero refcount (except
> when it is NULL). That's why it takes a file** instead of a file* - if
> it observes a zero refcount, it assumes that the pointer must have
> been updated in the meantime, and retries. Calling get_file_rcu() on a
> pointer that points to a file with zero refcount, which I think can
> happen with this patch, will cause an endless loop.
> (Just as background: For other usecases, get_file_rcu() is supposed to
> still behave nicely and not spuriously return NULL when the file* is
> concurrently updated to point to another file*; that's what that loop
> is for.)
Ah, I see. I wasn't aware of this subtlety. I think this is fixable by
checking the return value of get_file_rcu() and retrying speculation
if it changed.
> (If my understanding is correct, maybe we should document that more
> explicitly...)
Good point. I'll add comments for get_file_rcu() as a separate patch.
>
> Also, I think you are introducing an implicit assumption that
> remove_vma() does not NULL out the ->vm_file pointer (because that
> could cause tearing and could theoretically lead to a torn pointer
> being accessed here).
>
> One alternative might be to change the paths that drop references to
> vma->vm_file (search for vma_close to find them) such that they first
> NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that,
> maybe with a new helper like this:
>
> static void vma_fput(struct vm_area_struct *vma)
> {
> struct file *file = vma->vm_file;
>
> if (file) {
> WRITE_ONCE(vma->vm_file, NULL);
> fput(file);
> }
> }
>
> Then on the lockless lookup path you could use get_file_rcu() on the
> ->vm_file pointer _of the original VMA_, and store the returned file*
> into copy->vm_file.
Ack. Except for storing the return value of get_file_rcu(). I think
once we detect that get_file_rcu() returns a different file we should
bail out and retry. The change in file is an indication that the vma
got changed from under us, so whatever we have is stale.
>
> > + if (!anon_vma_name_get_if_valid(copy))
> > + goto put_file;
> > +
> > + if (!mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq))
> > + return 0;
>
> We only check for concurrent updates at this point, so up to here,
> anything we read from "copy" could contain torn pointers (both because
> memcpy() is not guaranteed to copy pointers atomically and because the
> updates to the original VMA are not done with WRITE_ONCE()).
> That probably means that something like the preceding
> anon_vma_name_get_if_valid() could crash on an access to a torn
> pointer.
> Please either do another mmap_lock_speculate_retry() check directly
> after the memcpy(), or ensure nothing before this point reads from
> "copy".
Ack. I'll add mmap_lock_speculate_retry() check right after memcpy().
>
> > + /* Address space got modified, vma might be stale. Re-lock and
> > retry. */
> > + rcu_read_unlock();
> > + ret = mmap_read_lock_killable(priv->mm);
> > + if (!ret) {
> > + /* mmap_lock_speculate_try_begin() succeeds when holding
> > mmap_read_lock */
> > + mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);
> > + mmap_read_unlock(priv->mm);
> > + ret = -EAGAIN;
> > + }
> > +
> > + rcu_read_lock();
> > +
> > + anon_vma_name_put_if_valid(copy);
> > +put_file:
> > + if (copy->vm_file)
> > + fput(copy->vm_file);
> > +out:
> > + return ret;
> > +}
> [...]
> > @@ -266,39 +399,41 @@ static void get_vma_name(struct vm_area_struct *vma,
> > } else {
> > *path = file_user_path(vma->vm_file);
> > }
> > - return;
> > + goto out;
> > }
> >
> > if (vma->vm_ops && vma->vm_ops->name) {
> > *name = vma->vm_ops->name(vma);
>
> This seems to me like a big, subtle change of semantics. After this
> change, vm_ops->name() will no longer receive a real VMA; and in
> particular, I think the .name implementation special_mapping_name used
> in special_mapping_vmops will have a UAF because it relies on
> vma->vm_private_data pointing to a live object.
Ah, I see. IOW, vma->vm_private_data might change from under us and I
don't detect that. Moreover this is just an example and .name() might
depend on other things.
>
> I think you'll need to fall back to using the mmap lock and the real
> VMA if you see a non-NULL vma->vm_ops->name pointer.
Yeah, either that or obtain the name and make a copy during
get_vma_snapshot() using original vma. Will need to check which way is
better.
>
> > if (*name)
> > - return;
> > + goto out;
> > }
> >
> > *name = arch_vma_name(vma);
> > if (*name)
> > - return;
> > + goto out;
> >
> > if (!vma->vm_mm) {
> > *name = "[vdso]";
> > - return;
> > + goto out;
> > }
> >
> > if (vma_is_initial_heap(vma)) {
> > *name = "[heap]";
> > - return;
> > + goto out;
> > }
> >
> > if (vma_is_initial_stack(vma)) {
> > *name = "[stack]";
> > - return;
> > + goto out;
> > }
> >
> > if (anon_name) {
> > *name_fmt = "[anon:%s]";
> > *name = anon_name->name;
> > - return;
> > }
> > +out:
> > + if (anon_name && !mmap_locked)
> > + anon_vma_name_put(anon_name);
>
> Isn't this refcount drop too early, causing UAF read? We drop the
> reference on the anon_name here, but (on some paths) we're about to
> return anon_name->name to the caller through *name, and the caller
> will read from it.
>
> Ah, but I guess it's actually fine because the refcount increment was
> unnecessary in the first place, because the vma pointer actually
> points to a copy of the original VMA, and the copy has its own
> refcounted reference to the anon_name thanks to get_vma_snapshot()?
> It might be helpful to have some comments documenting which VMA
> pointers can point to copies.
If I follow Andrii's suggestion and avoid vma copying I'll need to
implement careful handling of pointers and allow get_vma_name() to
fail indicating that vma changed from under us. Let me see if this is
still doable in the light of your findings.
Thanks for the insightful review and welcome back!