On 3/20/26 6:40 AM, Li Wang wrote:
On Thu, Mar 19, 2026 at 01:37:46PM -0400, Waiman Long wrote:
The vmstats flush threshold currently increases linearly with the
number of online CPUs. As the number of CPUs increases over time, it
will become increasingly difficult to meet the threshold and update the
vmstats data in a timely manner. These days, systems with hundreds of
CPUs or even thousands of them are becoming more common.

For example, the test_memcg_sock test of test_memcontrol always fails
when running on an arm64 system with 128 CPUs. It is because the
threshold is now 64*128 = 8192. With 4k page size, it needs changes in
32 MB of memory. It will be even worse with larger page size like 64k.

To make the output of memory.stat more correct, it is better to
scale up the threshold logarithmically instead of linearly with the
number of CPUs. With the log2 scale, we can use the possibly larger
num_possible_cpus() instead of num_online_cpus() which may change at
run time.

Although there is supposed to be a periodic and asynchronous flush of
vmstats every 2 seconds, the actual time lag between succesive runs
can actually vary quite a bit. In fact, I have seen time lags of up
to 10s of seconds in some cases. So we couldn't too rely on the hope
that there will be an asynchronous vmstats flush every 2 seconds. This
may be something we need to look into.

Signed-off-by: Waiman Long <[email protected]>
---
  mm/memcontrol.c | 17 ++++++++++++-----
  1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 772bac21d155..8d4ede72f05c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -548,20 +548,20 @@ struct memcg_vmstats {
   *    rstat update tree grow unbounded.
   *
   * 2) Flush the stats synchronously on reader side only when there are more 
than
- *    (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this optimization
- *    will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH * nr_cpus) 
but
- *    only for 2 seconds due to (1).
+ *    (MEMCG_CHARGE_BATCH * (ilog2(nr_cpus) + 1)) update events. Though this
+ *    optimization will let stats be out of sync by up to that amount but only
+ *    for 2 seconds due to (1).
   */
  static void flush_memcg_stats_dwork(struct work_struct *w);
  static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
  static u64 flush_last_time;
+static int vmstats_flush_threshold __ro_after_init;
#define FLUSH_TIME (2UL*HZ) static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats)
  {
-       return atomic_read(&vmstats->stats_updates) >
-               MEMCG_CHARGE_BATCH * num_online_cpus();
+       return atomic_read(&vmstats->stats_updates) > vmstats_flush_threshold;
  }
static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val,
@@ -5191,6 +5191,13 @@ int __init mem_cgroup_init(void)
memcg_pn_cachep = KMEM_CACHE(mem_cgroup_per_node,
                                     SLAB_PANIC | SLAB_HWCACHE_ALIGN);
+       /*
+        * Logarithmically scale up vmstats flush threshold with the number
+        * of CPUs.
+        * N.B. ilog2(1) = 0.
+        */
+       vmstats_flush_threshold = MEMCG_CHARGE_BATCH *
+                                 (ilog2(num_possible_cpus()) + 1);
Changing the threashold from linearly to logarithmically looks smarter,
but my concern is that, on large systems (hundreds/thousands of CPUs),
the threshold drops dramatically.

For example, 1024 CPUs it goes from 65536 (256MB) to only 704 (2.7MB),
that's almost 100x. Could this potentially raise a performance issue
as frequently read 'memory.stat' on a heavily loaded system?

Maybe go with MEMCG_CHARGE_BATCH * int_sqrt(num_possible_cpus()),
which sits between linear and log2?

I have also been thinking about scaling faster than log2 but still below linear. I believe int_sqrt() is a good suggestion and I will adopt it in the next version.

Thanks,
Longman


Reply via email to