On 3/16/26 22:40, Konstantin Khorenko wrote:
> tcache_attach_page() inserts a page into the per-node radix tree under
> tree_lock, then releases the lock and calls tcache_lru_add().  Between
> releasing tree_lock and completing tcache_lru_add(), the page is visible
> in the radix tree but not yet on the tcache LRU.
> 
> During this window a concurrent tcache_detach_page() on another CPU can:
>   1. Find the page via radix_tree_lookup (RCU)
>   2. page_cache_get_speculative(page): refcount 1 -> 2
>   3. page_ref_freeze(page, 2): refcount 2 -> 0
>   4. Remove the page from the radix tree
>   5. tcache_lru_del(): page not on LRU yet, skipped
>   6. tcache_put_page() -> free_hot_cold_page(): page freed to PCP list
> 
> Now page->lru links into a PCP free list.  When the original CPU then
> executes tcache_lru_add() -> list_add_tail(&page->lru, &pni->lru), it
> overwrites page->lru destroying the PCP list linkage.  Subsequent PCP
> allocations follow the stale pointer and hit a poisoned or cross-linked
> lru, causing "list_del corruption" warnings and eventually a hard lockup
> when free_pcppages_bulk() holds zone->lock forever.
> 
> Fix by taking an extra page reference before releasing tree_lock.  This
> makes page_ref_freeze(page, 2) fail on any concurrent detach (refcount
> will be 3, not the expected 2), forcing the detach to retry after the
> page is fully set up (in tree AND on LRU).  The extra reference is
> dropped after tcache_lru_add() completes.
> 
> Note: moving tcache_lru_add() inside the tree_lock critical section would
> cause a lock ordering inversion (tree_lock -> pni->lock vs the shrinker's
> pni->lock -> tree_lock path), so the extra-reference approach is used.
> 
> https: //virtuozzo.atlassian.net/browse/PSBM-161840

nit:    ^ excess space

> 

Reviewed-by: Pavel Tikhomirov <[email protected]>

> Signed-off-by: Konstantin Khorenko <[email protected]>
> ---
>  mm/tcache.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/tcache.c b/mm/tcache.c
> index e8ba7ee26cbef..f95b5ed6cb0bc 100644
> --- a/mm/tcache.c
> +++ b/mm/tcache.c
> @@ -810,9 +810,26 @@ tcache_attach_page(struct tcache_node *node, pgoff_t 
> index, struct page *page)
>        */
>       spin_lock_irqsave(&node->tree_lock, flags);
>       err = tcache_page_tree_insert(node, index, page);
> +     if (!err) {
> +             /*
> +              * Take an extra reference while the page is visible in
> +              * the tree but not yet on the LRU.  Without this,
> +              * a concurrent tcache_detach_page() on another CPU can
> +              * find the page via radix_tree_lookup, succeed with
> +              * page_ref_freeze(page, 2) and free the page to PCP.
> +              * When we then call tcache_lru_add() below, we overwrite
> +              * page->lru which now links into a PCP free list,
> +              * corrupting that list.  The extra reference makes the
> +              * freeze fail (refcount will be 3, not 2), so the
> +              * concurrent detach retries after we finish setup.
> +              */
> +             get_page(page);
> +     }
>       spin_unlock(&node->tree_lock);
> -     if (!err)
> +     if (!err) {
>               tcache_lru_add(node->pool, page);
> +             put_page(page);
> +     }
>       local_irq_restore(flags); /* Implies rcu_read_lock_sched() */
>       return err;
>  }

-- 
Best regards, Pavel Tikhomirov
Senior Software Developer, Virtuozzo.

_______________________________________________
Devel mailing list
[email protected]
https://lists.openvz.org/mailman/listinfo/devel

Reply via email to