Re: [Devel] [PATCH vz10 v2 2/2] sched: Support nr_cpus in cgroup2 as well

Konstantin Khorenko Wed, 18 Mar 2026 07:50:32 -0700



On 3/17/26 09:33, Dmitry Sepp wrote:

Make the control available for the cgroup2 hierarchy as well.

https://virtuozzo.atlassian.net/browse/VSTOR-124385

Signed-off-by: Dmitry Sepp <[email protected]>
---
  kernel/sched/core.c | 7 +++++++
  1 file changed, 7 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f66ee9d07387..3b13fd3a3f7a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10431,6 +10431,13 @@ static struct cftype cpu_files[] = {
                .seq_show = cpu_uclamp_max_show,
                .write = cpu_uclamp_max_write,
        },
+#endif
+#ifdef CONFIG_CFS_CPULIMIT
+       {
+               .name = "nr_cpus",


May be to add
          .flags = CFTYPE_NOT_ON_ROOT,

like most of other entries here?

+               .read_u64 = nr_cpus_read_u64,
+               .write_u64 = nr_cpus_write_u64,
+       },
  #endif
        {
                .name = "proc.stat",



Also while we are here, can you please fix another related issue?

  Bug: Missing `cpus_read_lock()` in `tg_set_cpu_limit()`

tg_set_cpu_limit() calls __tg_set_cfs_bandwidth(), which iterates over for_each_online_cpu(i) andtakes per-CPU rq

  locks. However, tg_set_cpu_limit() does not hold cpus_read_lock():

   kernel/sched/core.c lines 10025-10031

      mutex_lock(&cfs_constraints_mutex);
      ret = __tg_set_cfs_bandwidth(tg, period, quota, burst);
      if (!ret) {
          tg->cpu_rate = cpu_rate;
          tg->nr_cpus = nr_cpus;
      }
      mutex_unlock(&cfs_constraints_mutex);

  Compare with tg_set_cfs_bandwidth(), which does it correctly:

   kernel/sched/core.c lines 9734-9743

  {
      int ret;
      guard(cpus_read_lock)();
      guard(mutex)(&cfs_constraints_mutex);
      ret = __tg_set_cfs_bandwidth(tg, period, quota, burst);
      tg_update_cpu_limit(tg);
      return ret;
  }

The requirement to hold cpus_read_lock() was introduced by upstream commit 0e59bdaea75f("sched/fair: Disable runtime_enabled on dying rq"), which changed the iteration in__tg_set_cfs_bandwidth() from for_each_possible_cpu to for_each_online_cpu and addedget_online_cpus()/put_online_cpus() around the call. This was done to prevent a race between settingcfs_rq->runtime_enabled and unthrottle_offline_cfs_rqs().If a CPU goes offline while __tg_set_cfs_bandwidth() is executing inside tg_set_cpu_limit(), thefunction may re-enable runtime_enabled on a dying CPU's cfs_rq after unthrottle_offline_cfs_rqs() hasalready cleared it, leaving tasks stranded on a dead CPU with no way to migrate.The bug was inherited from the original commit 4514c5835d32f ("sched: Port CONFIG_CFS_CPULIMITfeature"), where tg_set_cpu_limit() was ported from vz7 (kernel 3.10) without accounting for thechanged locking requirements. In the vz7 kernel, __tg_set_cfs_bandwidth() used for_each_possible_cpu,so cpus_read_lock() was not needed.


==================================================

+ another issue with cpu.max vs ns_cpus behavior:

  Semantics: `cpu.nr_cpus` becomes passive after writing to `cpu.max`

After the first patch, writing to cpu.max no longer resets nr_cpus (which is good), but it does notre-apply it either.

  The code path when writing cpu.max:

cpu_max_write() → tg_set_cfs_bandwidth() → __tg_set_cfs_bandwidth() (sets quota/period directly) →tg_update_cpu_limit()

   (recalculates cpu_rate from quota/period, does not touch nr_cpus)
  This leads to a confusing scenario:

  echo 2 > cpu.nr_cpus          # limit = 2 CPUs (via CFS bandwidth)
  echo "max 100000" > cpu.max   # remove the limit
  cat cpu.nr_cpus                # reads 2 ← but there is no actual limit!

nr_cpus is stored but has no effect until someone writes to cpu.nr_cpus again. In cgroup v2, whereboth files are

  visible side by side, this can mislead the user into thinking a CPU limit is 
in place when it is not.
  Possible ways to address this:
  • Make tg_update_cpu_limit() take nr_cpus into account (re-apply it when 
cpu.max is written)

• Reset nr_cpus = 0 when cpu.max is written (as it was before the first patch, though that behaviorwas intentionally

    removed)
_______________________________________________
Devel mailing list
[email protected]
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH vz10 v2 2/2] sched: Support nr_cpus in cgroup2 as well

Reply via email to