On 3/17/26 09:33, Dmitry Sepp wrote:
Make the control available for the cgroup2 hierarchy as well.
https://virtuozzo.atlassian.net/browse/VSTOR-124385
Signed-off-by: Dmitry Sepp <[email protected]>
---
kernel/sched/core.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f66ee9d07387..3b13fd3a3f7a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10431,6 +10431,13 @@ static struct cftype cpu_files[] = {
.seq_show = cpu_uclamp_max_show,
.write = cpu_uclamp_max_write,
},
+#endif
+#ifdef CONFIG_CFS_CPULIMIT
+ {
+ .name = "nr_cpus",
May be to add
.flags = CFTYPE_NOT_ON_ROOT,
like most of other entries here?
+ .read_u64 = nr_cpus_read_u64,
+ .write_u64 = nr_cpus_write_u64,
+ },
#endif
{
.name = "proc.stat",
Also while we are here, can you please fix another related issue?
Bug: Missing `cpus_read_lock()` in `tg_set_cpu_limit()`
tg_set_cpu_limit() calls __tg_set_cfs_bandwidth(), which iterates over for_each_online_cpu(i) and
takes per-CPU rq
locks. However, tg_set_cpu_limit() does not hold cpus_read_lock():
kernel/sched/core.c lines 10025-10031
mutex_lock(&cfs_constraints_mutex);
ret = __tg_set_cfs_bandwidth(tg, period, quota, burst);
if (!ret) {
tg->cpu_rate = cpu_rate;
tg->nr_cpus = nr_cpus;
}
mutex_unlock(&cfs_constraints_mutex);
Compare with tg_set_cfs_bandwidth(), which does it correctly:
kernel/sched/core.c lines 9734-9743
{
int ret;
guard(cpus_read_lock)();
guard(mutex)(&cfs_constraints_mutex);
ret = __tg_set_cfs_bandwidth(tg, period, quota, burst);
tg_update_cpu_limit(tg);
return ret;
}
The requirement to hold cpus_read_lock() was introduced by upstream commit 0e59bdaea75f
("sched/fair: Disable runtime_enabled on dying rq"), which changed the iteration in
__tg_set_cfs_bandwidth() from for_each_possible_cpu to for_each_online_cpu and added
get_online_cpus()/put_online_cpus() around the call. This was done to prevent a race between setting
cfs_rq->runtime_enabled and unthrottle_offline_cfs_rqs().
If a CPU goes offline while __tg_set_cfs_bandwidth() is executing inside tg_set_cpu_limit(), the
function may re-enable runtime_enabled on a dying CPU's cfs_rq after unthrottle_offline_cfs_rqs() has
already cleared it, leaving tasks stranded on a dead CPU with no way to migrate.
The bug was inherited from the original commit 4514c5835d32f ("sched: Port CONFIG_CFS_CPULIMIT
feature"), where tg_set_cpu_limit() was ported from vz7 (kernel 3.10) without accounting for the
changed locking requirements. In the vz7 kernel, __tg_set_cfs_bandwidth() used for_each_possible_cpu,
so cpus_read_lock() was not needed.
==================================================
+ another issue with cpu.max vs ns_cpus behavior:
Semantics: `cpu.nr_cpus` becomes passive after writing to `cpu.max`
After the first patch, writing to cpu.max no longer resets nr_cpus (which is good), but it does not
re-apply it either.
The code path when writing cpu.max:
cpu_max_write() → tg_set_cfs_bandwidth() → __tg_set_cfs_bandwidth() (sets quota/period directly) →
tg_update_cpu_limit()
(recalculates cpu_rate from quota/period, does not touch nr_cpus)
This leads to a confusing scenario:
echo 2 > cpu.nr_cpus # limit = 2 CPUs (via CFS bandwidth)
echo "max 100000" > cpu.max # remove the limit
cat cpu.nr_cpus # reads 2 ← but there is no actual limit!
nr_cpus is stored but has no effect until someone writes to cpu.nr_cpus again. In cgroup v2, where
both files are
visible side by side, this can mislead the user into thinking a CPU limit is
in place when it is not.
Possible ways to address this:
• Make tg_update_cpu_limit() take nr_cpus into account (re-apply it when
cpu.max is written)
• Reset nr_cpus = 0 when cpu.max is written (as it was before the first patch, though that behavior
was intentionally
removed)
_______________________________________________
Devel mailing list
[email protected]
https://lists.openvz.org/mailman/listinfo/devel