tcache_attach_page() inserts a page into the per-node radix tree under
tree_lock, then releases the lock and calls tcache_lru_add(). Between
releasing tree_lock and completing tcache_lru_add(), the page is visible
in the radix tree but not yet on the tcache LRU.
During this window a concurrent tcache_detach_page() on another CPU can:
1. Find the page via radix_tree_lookup (RCU)
2. page_cache_get_speculative(page): refcount 1 -> 2
3. page_ref_freeze(page, 2): refcount 2 -> 0
4. Remove the page from the radix tree
5. tcache_lru_del(): page not on LRU yet, skipped
6. tcache_put_page() -> free_hot_cold_page(): page freed to PCP list
Now page->lru links into a PCP free list. When the original CPU then
executes tcache_lru_add() -> list_add_tail(&page->lru, &pni->lru), it
overwrites page->lru destroying the PCP list linkage. The PCP list head
still has a stale pointer to the page, but page->lru now points into the
tcache LRU. Subsequent PCP allocations follow the stale pointer and hit
a poisoned or cross-linked lru, causing "list_del corruption" warnings
and eventually a hard lockup when free_pcppages_bulk() holds zone->lock
forever.
Fix by taking an extra page reference before releasing tree_lock. This
makes page_ref_freeze(page, 2) fail on any concurrent detach (refcount
will be 3, not the expected 2), forcing the detach to retry after the
page is fully set up (in tree AND on LRU). The extra reference is
dropped after tcache_lru_add() completes.
Note: moving tcache_lru_add() inside the tree_lock critical section would
cause a lock ordering inversion (tree_lock -> pni->lock vs the shrinker's
pni->lock -> tree_lock path), so the extra-reference approach is used.
Fixes: cb855faae6326 ("mm: introduce transcendent file cache (tcache)")
https://virtuozzo.atlassian.net/browse/PSBM-161840
Signed-off-by: Konstantin Khorenko <[email protected]>
Feature: mm: transcendent file cache (tcache)
---
mm/tcache.c | 19 ++++++++++++++++++-
1 file changed, 18 insertions(+), 1 deletion(-)
diff --git a/mm/tcache.c b/mm/tcache.c
index 2452ca33e92d4..f5d13b748e2ee 100644
--- a/mm/tcache.c
+++ b/mm/tcache.c
@@ -805,9 +805,26 @@ tcache_attach_page(struct tcache_node *node, pgoff_t
index, struct page *page)
spin_lock_irqsave(&node->tree_lock, flags);
err = tcache_page_tree_insert(node, index, page);
+ if (!err) {
+ /*
+ * Take an extra reference while the page is visible in
+ * the tree but not yet on the LRU. Without this,
+ * a concurrent tcache_detach_page() on another CPU can
+ * find the page via radix_tree_lookup, succeed with
+ * page_ref_freeze(page, 2) and free the page to PCP.
+ * When we then call tcache_lru_add() below, we overwrite
+ * page->lru which now links into a PCP free list,
+ * corrupting that list. The extra reference makes the
+ * freeze fail (refcount will be 3, not 2), so the
+ * concurrent detach retries after we finish setup.
+ */
+ get_page(page);
+ }
spin_unlock(&node->tree_lock);
- if (!err)
+ if (!err) {
tcache_lru_add(node->pool, page);
+ put_page(page);
+ }
local_irq_restore(flags);
return err;
}
--
2.43.0
_______________________________________________
Devel mailing list
[email protected]
https://lists.openvz.org/mailman/listinfo/devel