The per-cfs_rq active_timer (CONFIG_CFS_CPULIMIT) is started by
dec_nr_active_cfs_rqs() to defer the tg->nr_cpus_active decrement.
Its callback sched_cfs_active_timer() dereferences cfs_rq->tg.

When a task group is destroyed, unregister_fair_sched_group() tears
down the per-CPU cfs_rq structures but never cancels the active_timer.
If the timer fires after the cfs_rq and task_group memory have been
freed and reallocated (all three — cfs_rq, sched_entity, and
task_group — live in the kmalloc-1k slab cache), the callback performs
atomic_dec() on an arbitrary kernel address, corrupting memory.

This was observed as a hard lockup + NULL-pointer Oops in
enqueue_task_fair() during task wakeup: the se->parent pointer at
offset 128 in a sched_entity was corrupted because it shares the same
slab offset as cfs_rq->skip (also offset 128) — a classic cross-type
UAF in a shared slab cache.

Fix this in two ways:

1. Cancel the active_timer in unregister_fair_sched_group() before the
   cfs_rq is freed. The cancel must happen outside the rq lock because
   the timer callback sched_cfs_active_timer() acquires it.

2. Move the atomic_dec(&cfs_rq->tg->nr_cpus_active) inside the rq lock
   in sched_cfs_active_timer(). In the original code the callback
   releases the rq lock before executing atomic_dec, creating a window
   where the teardown path can run between the unlock and the
   atomic_dec:

     CPU B (timer callback)              CPU A (teardown)
     ──────────────────────              ────────────────
     sched_cfs_active_timer()
       raw_spin_rq_lock()
       cfs_rq->active = ...
       raw_spin_rq_unlock()
       ← lock released, atomic_dec
         not yet executed
                                         unregister_fair_sched_group()
                                           raw_spin_rq_lock()
                                           list_del_leaf_cfs_rq()
                                           raw_spin_rq_unlock()
                                         sched_free_group()
                                           kfree(tg)
       atomic_dec(&cfs_rq->tg->...)  ← UAF! tg already freed

   With atomic_dec inside the rq lock, teardown's raw_spin_rq_lock()
   in unregister_fair_sched_group() cannot proceed until the callback
   has completed all accesses to cfs_rq->tg, eliminating the window.

   Note: hrtimer_cancel() (fix #1) also independently closes this
   race — the hrtimer core keeps base->running == timer until fn()
   returns, so hrtimer_cancel() waits for the callback to fully
   complete, including atomic_dec.  Moving atomic_dec under the lock
   makes the serialization explicit via the lock protocol without
   relying on hrtimer internals.

https://virtuozzo.atlassian.net/browse/VSTOR-126785

Signed-off-by: Konstantin Khorenko <[email protected]>

Feature: sched: ability to limit number of CPUs available to a CT
---
 kernel/sched/fair.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cc5dceb9c815f..9b0fe4c8a272f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -597,9 +597,8 @@ static enum hrtimer_restart sched_cfs_active_timer(struct 
hrtimer *timer)
 
        raw_spin_rq_lock_irqsave(rq, flags);
        cfs_rq->active = !list_empty(&cfs_rq->tasks);
-       raw_spin_rq_unlock_irqrestore(rq, flags);
-
        atomic_dec(&cfs_rq->tg->nr_cpus_active);
+       raw_spin_rq_unlock_irqrestore(rq, flags);
 
        return HRTIMER_NORESTART;
 }
@@ -13020,6 +13019,16 @@ void unregister_fair_sched_group(struct task_group *tg)
        destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
 
        for_each_possible_cpu(cpu) {
+#ifdef CONFIG_CFS_CPULIMIT
+               /*
+                * Cancel the per-cfs_rq active timer before freeing.
+                * The callback dereferences cfs_rq->tg, so failing to
+                * cancel leads to use-after-free once the tg is freed.
+                * Must be done outside the rq lock since the callback
+                * acquires it.
+                */
+               hrtimer_cancel(&tg->cfs_rq[cpu]->active_timer);
+#endif
                if (tg->se[cpu])
                        remove_entity_load_avg(tg->se[cpu]);
 
-- 
2.43.0

_______________________________________________
Devel mailing list
[email protected]
https://lists.openvz.org/mailman/listinfo/devel

Reply via email to