tcache_attach_page() inserts a page into the per-node radix tree under
tree_lock, then releases the lock and calls tcache_lru_add().  Between
releasing tree_lock and completing tcache_lru_add(), the page is visible
in the radix tree but not yet on the tcache LRU.

During this window a concurrent tcache_detach_page() on another CPU can:
  1. Find the page via radix_tree_lookup (RCU)
  2. page_cache_get_speculative(page): refcount 1 -> 2
  3. page_ref_freeze(page, 2): refcount 2 -> 0
  4. Remove the page from the radix tree
  5. tcache_lru_del(): page not on LRU yet, skipped
  6. tcache_put_page() -> free_hot_cold_page(): page freed to PCP list

Now page->lru links into a PCP free list.  When the original CPU then
executes tcache_lru_add() -> list_add_tail(&page->lru, &pni->lru), it
overwrites page->lru destroying the PCP list linkage.  Subsequent PCP
allocations follow the stale pointer and hit a poisoned or cross-linked
lru, causing "list_del corruption" warnings and eventually a hard lockup
when free_pcppages_bulk() holds zone->lock forever.

Fix by taking an extra page reference before releasing tree_lock.  This
makes page_ref_freeze(page, 2) fail on any concurrent detach (refcount
will be 3, not the expected 2), forcing the detach to retry after the
page is fully set up (in tree AND on LRU).  The extra reference is
dropped after tcache_lru_add() completes.

Note: moving tcache_lru_add() inside the tree_lock critical section would
cause a lock ordering inversion (tree_lock -> pni->lock vs the shrinker's
pni->lock -> tree_lock path), so the extra-reference approach is used.

https: //virtuozzo.atlassian.net/browse/PSBM-161840

Signed-off-by: Konstantin Khorenko <[email protected]>
---
 mm/tcache.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/mm/tcache.c b/mm/tcache.c
index e8ba7ee26cbef..f95b5ed6cb0bc 100644
--- a/mm/tcache.c
+++ b/mm/tcache.c
@@ -810,9 +810,26 @@ tcache_attach_page(struct tcache_node *node, pgoff_t 
index, struct page *page)
         */
        spin_lock_irqsave(&node->tree_lock, flags);
        err = tcache_page_tree_insert(node, index, page);
+       if (!err) {
+               /*
+                * Take an extra reference while the page is visible in
+                * the tree but not yet on the LRU.  Without this,
+                * a concurrent tcache_detach_page() on another CPU can
+                * find the page via radix_tree_lookup, succeed with
+                * page_ref_freeze(page, 2) and free the page to PCP.
+                * When we then call tcache_lru_add() below, we overwrite
+                * page->lru which now links into a PCP free list,
+                * corrupting that list.  The extra reference makes the
+                * freeze fail (refcount will be 3, not 2), so the
+                * concurrent detach retries after we finish setup.
+                */
+               get_page(page);
+       }
        spin_unlock(&node->tree_lock);
-       if (!err)
+       if (!err) {
                tcache_lru_add(node->pool, page);
+               put_page(page);
+       }
        local_irq_restore(flags); /* Implies rcu_read_lock_sched() */
        return err;
 }
-- 
2.43.0

_______________________________________________
Devel mailing list
[email protected]
https://lists.openvz.org/mailman/listinfo/devel

Reply via email to