The Linux kernel provides mechanisms like 'isolcpus' and 'nohz_full' to reduce interference for latency-sensitive workloads. However, these are locked behind the "Reboot Wall" - they can only be configured via boot parameters and require a system restart for changes to take effect.
In modern cloud-native environments, CPU resources often need to be dynamically re-partitioned to accommodate container scaling without the performance penalty and downtime of a full system reboot. Similarly, high-frequency trading (HFT) platforms require the ability to fine-tune CPU isolation at runtime to minimize jitter for critical execution threads based on shifting market demands. This patch series introduces Dynamic Housekeeping & Enhanced Isolation (DHEI). DHEI allows administrators to reconfigure the kernel's housekeeping boundaries at runtime via a new sysfs interface at /sys/kernel/housekeeping/. Key Features: - Fine-grained control: Separate sysfs nodes for timer, rcu, tick, workqueue, kthread, managed_irq, domain, and misc. - Dynamic NOHZ_FULL: Supports enabling/disabling full dynticks mode on-the-fly. - SMT Awareness: Optional 'smt_aware_mode' for core-granular isolation. - Safety Guards: Prevents isolating all CPUs, requires at least one online housekeeping CPU, and enforces CAP_SYS_ADMIN capability. Core Architecture: 1. Notifier-Driven Synchronization: HK_UPDATE_MASK blocking notifier chain. 2. Decoupled Memory Management: Runtime-safe cpumask allocation. 3. Subsystem Handlers: Dynamic migration for IRQ, RCU, Sched, etc. The series is organized as follows: - Patches 01-03: Core infrastructure (dynamic allocation, notifier, enum separation) - Patches 04-09: Subsystem notifier handlers (genirq, RCU, scheduler, watchdog, workqueue, mm/compaction) - Patch 10: tick/nohz dynamic full dynticks - Patches 11-13: SMT-aware isolation, boot-time bridging, sysfs interface - Patch 14: ABI documentation - Patch 15: kselftest suite Tested on x86_64 (8 vCPUs, SMT enabled) with all selftests passing. As suggested by Joel Fernandes and Thomas Gleixner, this V1 version provides a stronger rationale for dynamic isolation and addresses all RFC feedback regarding naming and notifier robustness. To: Ingo Molnar <[email protected]> To: Peter Zijlstra <[email protected]> To: Juri Lelli <[email protected]> To: Vincent Guittot <[email protected]> To: Dietmar Eggemann <[email protected]> To: Steven Rostedt <[email protected]> To: Ben Segall <[email protected]> To: Mel Gorman <[email protected]> To: Valentin Schneider <[email protected]> To: Thomas Gleixner <[email protected]> To: Paul E. McKenney <[email protected]> To: Frederic Weisbecker <[email protected]> To: Neeraj Upadhyay <[email protected]> To: Joel Fernandes <[email protected]> To: Josh Triplett <[email protected]> To: Boqun Feng <[email protected]> To: Uladzislau Rezki <[email protected]> To: Mathieu Desnoyers <[email protected]> To: Lai Jiangshan <[email protected]> To: Zqiang <[email protected]> To: Tejun Heo <[email protected]> To: Andrew Morton <[email protected]> To: Vlastimil Babka <[email protected]> To: Suren Baghdasaryan <[email protected]> To: Michal Hocko <[email protected]> To: Brendan Jackman <[email protected]> To: Johannes Weiner <[email protected]> To: Zi Yan <[email protected]> To: Anna-Maria Behnsen <[email protected]> To: Ingo Molnar <[email protected]> To: Shuah Khan <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Signed-off-by: Qiliang Yuan <[email protected]> Changes since RFC: - Dynamic RCU NOCB rewrite: Perform full runtime offload/deoffload via remove_cpu()/add_cpu() for online CPUs, with lazy initialization. - Robust Timer Migration: Added logic to dynamically migrate tick_do_timer_cpu when a housekeeper is isolated. - Enhanced Isolation Safety: Hardened sysfs interface with CAP_SYS_ADMIN checks, 0600 permissions, and strict cpumask validations including SMT subset checks. - Lifecycle Cleanups: Replaced system_state boot checks with slab_is_available() and added hotplug shutdown guards for clean power-off. - Testing & Docs: Added comprehensive kselftest suite for isolation scenarios and detailed ABI documentation. - Link to RFC: https://lore.kernel.org/all/20260206-feature-dynamic_isolcpus_dhei-v1-0-00a711eb0...@gmail.com/ --- Qiliang Yuan (15): sched/isolation: Support dynamic allocation for housekeeping masks sched/isolation: Introduce housekeeping notifier infrastructure sched/isolation: Separate housekeeping types in enum hk_type genirq: Support dynamic migration for managed interrupts rcu: Support runtime NOCB initialization and dynamic offloading sched/core: Dynamically update scheduler domain housekeeping mask watchdog: Allow runtime toggle of lockup detector affinity workqueue: Support dynamic housekeeping mask updates mm/compaction: Support dynamic housekeeping mask updates for kcompactd tick/nohz: Transition to dynamic full dynticks state management sched/isolation: Implement SMT-aware isolation and safety guards sched/isolation: Bridge boot-time parameters with dynamic isolation sched/isolation: Implement sysfs interface for dynamic housekeeping Documentation: isolation: Document DHEI sysfs interfaces selftests: dhei: Add functional tests for dynamic housekeeping .../ABI/testing/sysfs-kernel-housekeeping | 22 ++ include/linux/sched/isolation.h | 40 +++- kernel/irq/manage.c | 49 +++++ kernel/rcu/rcu.h | 4 + kernel/rcu/tree.c | 76 +++++++ kernel/rcu/tree.h | 2 +- kernel/rcu/tree_nocb.h | 27 ++- kernel/sched/core.c | 28 +++ kernel/sched/isolation.c | 236 ++++++++++++++++++++- kernel/time/tick-sched.c | 130 +++++++++--- kernel/watchdog.c | 25 +++ kernel/workqueue.c | 42 ++++ mm/compaction.c | 27 +++ tools/testing/selftests/Makefile | 1 + tools/testing/selftests/dhei/Makefile | 4 + tools/testing/selftests/dhei/dhei_test.sh | 160 ++++++++++++++ 16 files changed, 818 insertions(+), 55 deletions(-) --- base-commit: 63804fed149a6750ffd28610c5c1c98cce6bd377 change-id: 20260324-dhei-v12-final-891d1ba62bd3 Best regards, -- Qiliang Yuan <[email protected]>

