tcache_attach_page() inserts a page into the per-node radix tree under tree_lock, then releases the lock and calls tcache_lru_add(). Between releasing tree_lock and completing tcache_lru_add(), the page is visible in the radix tree but not yet on the tcache LRU.
During this window a concurrent tcache_detach_page() on another CPU can: 1. Find the page via radix_tree_lookup (RCU) 2. page_cache_get_speculative(page): refcount 1 -> 2 3. page_ref_freeze(page, 2): refcount 2 -> 0 4. Remove the page from the radix tree 5. tcache_lru_del(): page not on LRU yet, skipped 6. tcache_put_page() -> free_hot_cold_page(): page freed to PCP list Now page->lru links into a PCP free list. When the original CPU then executes tcache_lru_add() -> list_add_tail(&page->lru, &pni->lru), it overwrites page->lru destroying the PCP list linkage. Subsequent PCP allocations follow the stale pointer and hit a poisoned or cross-linked lru, causing "list_del corruption" warnings and eventually a hard lockup when free_pcppages_bulk() holds zone->lock forever. Fix by taking an extra page reference before releasing tree_lock. This makes page_ref_freeze(page, 2) fail on any concurrent detach (refcount will be 3, not the expected 2), forcing the detach to retry after the page is fully set up (in tree AND on LRU). The extra reference is dropped after tcache_lru_add() completes. Note: moving tcache_lru_add() inside the tree_lock critical section would cause a lock ordering inversion (tree_lock -> pni->lock vs the shrinker's pni->lock -> tree_lock path), so the extra-reference approach is used. https: //virtuozzo.atlassian.net/browse/PSBM-161840 Signed-off-by: Konstantin Khorenko <[email protected]> --- mm/tcache.c | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/mm/tcache.c b/mm/tcache.c index e8ba7ee26cbef..f95b5ed6cb0bc 100644 --- a/mm/tcache.c +++ b/mm/tcache.c @@ -810,9 +810,26 @@ tcache_attach_page(struct tcache_node *node, pgoff_t index, struct page *page) */ spin_lock_irqsave(&node->tree_lock, flags); err = tcache_page_tree_insert(node, index, page); + if (!err) { + /* + * Take an extra reference while the page is visible in + * the tree but not yet on the LRU. Without this, + * a concurrent tcache_detach_page() on another CPU can + * find the page via radix_tree_lookup, succeed with + * page_ref_freeze(page, 2) and free the page to PCP. + * When we then call tcache_lru_add() below, we overwrite + * page->lru which now links into a PCP free list, + * corrupting that list. The extra reference makes the + * freeze fail (refcount will be 3, not 2), so the + * concurrent detach retries after we finish setup. + */ + get_page(page); + } spin_unlock(&node->tree_lock); - if (!err) + if (!err) { tcache_lru_add(node->pool, page); + put_page(page); + } local_irq_restore(flags); /* Implies rcu_read_lock_sched() */ return err; } -- 2.43.0 _______________________________________________ Devel mailing list [email protected] https://lists.openvz.org/mailman/listinfo/devel
