From: Wanpeng Li <[email protected]>

This can be reproduced by setting up a CPU for nohz_full and pin two tasks 
that hog the CPU 100% of the time to that CPU, top reports 70% system time 
and 30% user time utilization sometimes. The trace like this:

   hog-11980 [015]   341.494491: function:    enter_from_user_mode <-- 
apic_timer_interrupt
<idle>-0     [000]   341.494492: function:    smp_apic_timer_interrupt <-- 
apic_timer_interrupt
   hog-11980 [015]   341.494492: function:    __context_tracking_exit <-- 
enter_from_user_mode
<idle>-0     [000]   341.494492: function:    irq_enter <-- 
smp_apic_timer_interrupt
   hog-11980 [015]   341.494492: bprint:      vtime_delta: diff=0 
(now=4295008339 vtime_snap=4295008339)
   hog-11980 [015]   341.494492: function:    smp_apic_timer_interrupt <-- 
apic_timer_interrupt
   hog-11980 [015]   341.494492: function:    irq_enter <-- 
smp_apic_timer_interrupt
   hog-11980 [015]   341.494493: function:    tick_sched_timer <-- 
__hrtimer_run_queues
<idle>-0     [000]   341.494493: function:    tick_sched_timer <-- 
__hrtimer_run_queues
<idle>-0     [000]   341.494493: function:    tick_do_update_jiffies64.part.14 
<-- tick_sched_do_timer
<idle>-0     [000]   341.494494: function:    do_timer <-- 
tick_do_update_jiffies64.part.14
   hog-11980 [015]   341.494494: function:    irq_exit <-- 
smp_apic_timer_interrupt
<idle>-0     [000]   341.494494: bprint:      do_timer: updated 
jiffies_64=4295008340 ticks=1
   hog-11980 [015]   341.494494: function:    __context_tracking_enter <-- 
prepare_exit_to_usermode
   hog-11980 [015]   341.494494: function:    vtime_user_enter <-- 
__context_tracking_enter
   hog-11980 [015]   341.494495: bprint:      vtime_delta: diff=1000000 
(now=4295008340 vtime_snap=4295008339)
   hog-11980 [015]   341.494495: function:    __vtime_account_system <-- 
vtime_user_enter
   hog-11980 [015]   341.494495: bprint:      get_vtime_delta: 
vtime_snap=4295008339 now=4295008340
   hog-11980 [015]   341.494495: function:    account_system_time <-- 
__vtime_account_system
   hog-11980 [015]   341.494495: bprint:      account_system_time: 
cputime=995488
<idle>-0     [000]   341.494497: function:    irq_exit <-- 
smp_apic_timer_interrupt

In this trace, we see the following:

 1. On CPU15, we transition from user-space to kernel-space because
    of a timer interrupt (it's the tick)

 2. vtimer_delta() returns 0, because jiffies didn't change since the
    last accounting

 3. While CPU15 is executing in kernel-space, jiffies is updated
    by CPU0

 4. When going back to user-space, vtime_delta() returns non-zero
    and the whole time is accounted for system time (observe how
    the cputime parameter in account_system_time() is less than 1ms)


After commit ff9a9b4c433 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN 
to jiffy granularity"), it reduces timing accuracy on nohz_full CPUs to 
jiffy based sampling. So the issue can happen if all CPUs fire their ticks 
at the same time:


               CPU 0                  CPU 1
               -----                  -----
                                      exit_user() // no cputime update 
tick X         update_jiffies
                                      enter_user() // cputime update


                                      exit_user() // no cputime update
tick X+1       update_jiffies
                                      enter_user() // cputime update

The nohz_full CPU receives an interrupt at the same time the timer 
interrupt fires on the housekeeping CPU.

This patch offsets the tick to avert all ticks alignment in order 
that the vtime sampling does not end up "in phase" with the jiffies 
incrementing.

Reported-by: Luiz Capitulino <[email protected]>
Suggested-by: Rik van Riel <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Luiz Capitulino <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
---
 kernel/time/tick-sched.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 7fe53be..35d79f7 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1197,8 +1197,12 @@ void tick_setup_sched_timer(void)
        /* Get the next period (per-CPU) */
        hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update());
 
-       /* Offset the tick to avert jiffies_lock contention. */
-       if (sched_skew_tick) {
+       /*
+        * Offset the tick to avert jiffies_lock contention, and all ticks
+        * alignment in order that the vtime sampling does not end up "in
+        * phase" with the jiffies incrementing.
+        */
+       if (sched_skew_tick || tick_nohz_full_enabled()) {
                u64 offset = ktime_to_ns(tick_period) >> 1;
                do_div(offset, num_possible_cpus());
                offset *= smp_processor_id();
-- 
2.7.4

Reply via email to