The per-cfs_rq active_timer (CONFIG_CFS_CPULIMIT) is started by
dec_nr_active_cfs_rqs() to defer the tg->nr_cpus_active decrement.
Its callback sched_cfs_active_timer() dereferences cfs_rq->tg.
When a task group is destroyed, unregister_fair_sched_group() tears
down the per-CPU cfs_rq structures but never cancels the active_timer.
If the timer fires after the cfs_rq and task_group memory have been
freed and reallocated (all three — cfs_rq, sched_entity, and
task_group — live in the kmalloc-1k slab cache), the callback performs
atomic_dec() on an arbitrary kernel address, corrupting memory.
This was observed as a hard lockup + NULL-pointer Oops in
enqueue_task_fair() during task wakeup: the se->parent pointer at
offset 128 in a sched_entity was corrupted because it shares the same
slab offset as cfs_rq->skip (also offset 128) — a classic cross-type
UAF in a shared slab cache.
Fix this in two ways:
1. Cancel the active_timer in unregister_fair_sched_group() before the
cfs_rq is freed. The cancel must happen outside the rq lock because
the timer callback sched_cfs_active_timer() acquires it.
2. Move the atomic_dec(&cfs_rq->tg->nr_cpus_active) inside the rq lock
in sched_cfs_active_timer(). In the original code the callback
releases the rq lock before executing atomic_dec, creating a window
where the teardown path can run between the unlock and the
atomic_dec:
CPU B (timer callback) CPU A (teardown)
────────────────────── ────────────────
sched_cfs_active_timer()
raw_spin_rq_lock()
cfs_rq->active = ...
raw_spin_rq_unlock()
← lock released, atomic_dec
not yet executed
unregister_fair_sched_group()
raw_spin_rq_lock()
list_del_leaf_cfs_rq()
raw_spin_rq_unlock()
sched_free_group()
kfree(tg)
atomic_dec(&cfs_rq->tg->...) ← UAF! tg already freed
With atomic_dec inside the rq lock, teardown's raw_spin_rq_lock()
in unregister_fair_sched_group() cannot proceed until the callback
has completed all accesses to cfs_rq->tg, eliminating the window.
Note: hrtimer_cancel() (fix #1) also independently closes this
race — the hrtimer core keeps base->running == timer until fn()
returns, so hrtimer_cancel() waits for the callback to fully
complete, including atomic_dec. Moving atomic_dec under the lock
makes the serialization explicit via the lock protocol without
relying on hrtimer internals.
https://virtuozzo.atlassian.net/browse/VSTOR-126785
Signed-off-by: Konstantin Khorenko <[email protected]>
Feature: sched: ability to limit number of CPUs available to a CT
---
kernel/sched/fair.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cc5dceb9c815f..9b0fe4c8a272f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -597,9 +597,8 @@ static enum hrtimer_restart sched_cfs_active_timer(struct
hrtimer *timer)
raw_spin_rq_lock_irqsave(rq, flags);
cfs_rq->active = !list_empty(&cfs_rq->tasks);
- raw_spin_rq_unlock_irqrestore(rq, flags);
-
atomic_dec(&cfs_rq->tg->nr_cpus_active);
+ raw_spin_rq_unlock_irqrestore(rq, flags);
return HRTIMER_NORESTART;
}
@@ -13020,6 +13019,16 @@ void unregister_fair_sched_group(struct task_group *tg)
destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
for_each_possible_cpu(cpu) {
+#ifdef CONFIG_CFS_CPULIMIT
+ /*
+ * Cancel the per-cfs_rq active timer before freeing.
+ * The callback dereferences cfs_rq->tg, so failing to
+ * cancel leads to use-after-free once the tg is freed.
+ * Must be done outside the rq lock since the callback
+ * acquires it.
+ */
+ hrtimer_cancel(&tg->cfs_rq[cpu]->active_timer);
+#endif
if (tg->se[cpu])
remove_entity_load_avg(tg->se[cpu]);
--
2.43.0
_______________________________________________
Devel mailing list
[email protected]
https://lists.openvz.org/mailman/listinfo/devel