[Bug 1668241] [NEW] Guests trash time on start, crashing the guest

ChristianEhrhardt Mon, 27 Feb 2017 03:56:24 -0800

Public bug reported:

Hi,
I've run into an issue on ppc64el where guests seem to have broken time info 
and due to that crash in various ways.


### TL;DR ###
- ppc64el guests break due to bad time values
- Very reproducible, but only on one System (HW issue?)
- A lot of debug available, but I don't know what the next debug step would be


### Symptoms ###
- Guest becomes unresponsive
- SSH Login hangs, but not totally fails
  - the system is responding e.g. key exchange
  - it is then hanging at "pledge: Network"
- virsh console also hangs on some guests (no response at all)
- GDB Debugging shows the guest CPU seems to spin mostly around spinlocks and 
timers
  - it is consuming too much CPU cycles still to just be the idle loop
- Guests report their just initialized password as timed-out and need to be 
changed
- On PW change a python error reports time about 10^12 years into the future
  "Fatal Python error: Py_Initialize: can't initialize time OverflowError: 
timestamp too large to convert to C _PyTime_t"
  "date: invalid date 'now - 18446744072844558894.77 seconds'"
- In crash the guests are somewhere in year 3158
- sometimes no ssh can connect at all
- But still a random few also work as they should

As a rule of thumb about 75-90% of the guests I start are usually broken

### Reproduction ###
I found that if starting many guests concurrently some of them hang almost for 
sure.
Sequentially they might fail, sometimes later in the life-cycle, but not as 
reliable than starting concurrently.

# cat /home/ubuntu/ppc-template.xml
<domain type='kvm'>
  <os>
    <type>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pae/>
  </features>
  <devices>
    <interface type='network'>
      <source network='default'/>
      <model type='virtio'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/3'/>
      <target port='0'/>
    </serial>
    <graphics type='vnc' autoport='yes' listen='127.0.0.1'>
      <listen type='address' address='127.0.0.1'/>
    </graphics>
    <video/>
  </devices>
</domain>

# cat repro.sh 
#!/bin/bash
set -xu
sudo ppc64_cpu --smt=off
log=$(mktemp)
idx=1
# one per cpu should not be too much
max=$(grep '^processor' /proc/cpuinfo -c)
uvt-simplestreams-libvirt sync --source http://cloud-images.ubuntu.com/daily 
arch=ppc64el label=daily release=xenial
for idx in $(seq 1 ${max}); do
        uvt-kvm create --template /home/ubuntu/ppc-template.xml 
--password=ubuntu --ssh-public-key-file /home/ubuntu/.ssh/authorized_keys 
kvmguest-xenial-test${idx} release=xenial arch=ppc64el label=daily
done
# be fair, give it a lot of time to initialize
sleep 500
for idx in $(seq 1 ${max}); do
        uvt-kvm wait --insecure kvmguest-xenial-test${idx}
        if [[ $? != 0 ]]; then
                echo "guest ${idx} not active" > ${log} 
        fi
done
# report failing guests utside of set -x spam
cat ${log}


### GDB-Debugging: ###
Current debug kernel at 
http://ddebs.ubuntu.com/ubuntu/pool/main/l/linux/linux-image-4.4.0-64-generic-dbgsym_4.4.0-64.85_ppc64el.ddeb
(or via ddebs in /etc/apt/sources.list.d/ddebs.list)
apt-get update
apt-get install linux-image-4.4.0-63-generic-dbgsym
virsh qemu-monitor-command --hmp kvmguest-xenial-postcopy 'gdbserver'
gdb /usr/lib/debug/boot/vmlinux-4.4.0-63-generic
(gdb) target remote localhost:1234

I see mostly Timer/Locking:
- sched_rt_period_timer
- __run_hrtimer
- __arch_spin_trylock
- timerqueue_del
- ktime_get
- __remove_hrtimer
=> I don't know the ppc idle loop, timer could be normal, but too much locking?
=> The working guest while idle never/rarely hit those.


### Backtraces ###
0xc000000000b0cdf0 in __arch_spin_trylock (lock=<optimized out>) at 
/build/linux-JaAPTH/linux-4.4.0/arch/powerpc/include/asm/spinlock.h:75
75      /build/linux-JaAPTH/linux-4.4.0/arch/powerpc/include/asm/spinlock.h: No 
such file or directory.
(gdb) bt
#0  0xc000000000b0cdf0 in __arch_spin_trylock (lock=<optimized out>) at 
/build/linux-JaAPTH/linux-4.4.0/arch/powerpc/include/asm/spinlock.h:75
#1  arch_spin_lock (lock=<optimized out>) at 
/build/linux-JaAPTH/linux-4.4.0/arch/powerpc/include/asm/spinlock.h:125
#2  do_raw_spin_lock (lock=<optimized out>) at 
/build/linux-JaAPTH/linux-4.4.0/include/linux/spinlock.h:148
#3  __raw_spin_lock (lock=<optimized out>) at 
/build/linux-JaAPTH/linux-4.4.0/include/linux/spinlock_api_smp.h:145
#4  _raw_spin_lock (lock=0xc00000001fe0ad80) at 
/build/linux-JaAPTH/linux-4.4.0/kernel/locking/spinlock.c:151
#5  0xc00000000014da28 in __run_hrtimer (now=<optimized out>, timer=<optimized 
out>, base=<optimized out>, cpu_base=<optimized out>)
    at /build/linux-JaAPTH/linux-4.4.0/kernel/time/hrtimer.c:1250
#6  __hrtimer_run_queues (cpu_base=0xc00000001fe0ad80, now=...) at 
/build/linux-JaAPTH/linux-4.4.0/kernel/time/hrtimer.c:1312
#7  0xc00000000014ea3c in hrtimer_interrupt (dev=<optimized out>) at 
/build/linux-JaAPTH/linux-4.4.0/kernel/time/hrtimer.c:1346
#8  0xc00000000001f5fc in __timer_interrupt () at 
/build/linux-JaAPTH/linux-4.4.0/arch/powerpc/kernel/time.c:502
#9  0xc00000000001f9b0 in timer_interrupt (regs=<optimized out>) at 
/build/linux-JaAPTH/linux-4.4.0/arch/powerpc/kernel/time.c:564
#10 0xc000000000002714 in decrementer_common () at 
/build/linux-JaAPTH/linux-4.4.0/arch/powerpc/kernel/exceptions-64s.S:790

All I got (~10) came from decrementer_common as the root of the call.
Might this just be a GDB Debugging artifact triggered by the short sleeps that 
the interrupts cause - or is it trying to recover *years* of time in events?


### System integrity / System Dependent ###
With all that known I deployed another system (Wichita) and there all works 
fine.
So it depends on the system.

We thought on reinstalling, but that would kill our reproduction case.
There is nothing in dmesg that would indicate a HW issue.
Also we rebooted to clear any runtime issues, but the issue persists.

To check the installation I wrote a few commands to check where
libvirt/qemu and everything else comes from. But all relevant tools are
the binaries/configs right from the archive.

Check if there are any "uncommon" package sources:
$ apt-cache policy $(dpkg-query -l | awk '/^ii/ {print $2}' | xargs) | awk 
'BEGIN {hit=0; mirror=""; pocket=""} /^[a-zA-Z]/ {printf("%s %s 
%s\n",mirror,pocket,pkg); pkg=$1} {if (hit) { mirror=$2; pocket=$3; hit=0}} /^ 
\*\*\* / {hit=1} END {printf("%s %s %s\n",mirror,pocket,pkg);}

      3 http://ports.ubuntu.com/ubuntu-ports xenial-backports/main
    724 http://ports.ubuntu.com/ubuntu-ports xenial/main
      1 http://ports.ubuntu.com/ubuntu-ports xenial/multiverse
     72 http://ports.ubuntu.com/ubuntu-ports xenial/universe
    315 http://ports.ubuntu.com/ubuntu-ports xenial-updates/main
     11 http://ports.ubuntu.com/ubuntu-ports xenial-updates/universe
      8 /var/lib/dpkg/status

Using the slightly longer command we can check which packages have
the ppas or local sources.
$ apt-cache policy $(dpkg-query -l | awk '/^ii/ {print $2}' | xargs) | awk 
'BEGIN {hit=0; mirror=""; pocket=""} /^[a-zA-Z]/ {printf("%s %s 
%s\n",mirror,pocket,pkg); pkg=$1} {if (hit) { mirror=$2; pocket=$3; hit=0}} /^ 
\*\*\* / {hit=1} END {printf("%s %s %s\n",mirror,pocket,pkg);}' > pkg-source.txt

But those with "status" are only new SRUs since we updated last week
(The same state wichita was working)

Also there is no local config difference that can cause this.
Of course I might just "reinstall" and maybe I'm fixed, but as I wrote then we 
might loose the ability to reproduce.
I'd only do so once we give up to understand.


### Crash ###
A crash dump I was taking confirms time craziness.
The file is available to load as:

$ crash /usr/lib/debug/boot/vmlinux-4.4.0-64-generic
/home/ubuntu/cpaelzer/kvmguest-xenial-test13.dump

You can find one of my 20 dumps and the matching kernel symbols at:
http://people.canonical.com/~paelzer/ppc64el-timer

>From Crash header:
       DATE: Sun Nov  9 19:12:33 3158
And also:
    COMMAND: "swapper/0"
    TASK: c000000001598d10  [THREAD_INFO: c0000000015f4000]
    STATE: TASK_RUNNING (PANIC)

I checked various dumps, and the summary is:
- all guests with issues have the same time (no matter when I dumped them)
- all guests with issues hang at the same task in panic state (could be a crash 
dump artifact)
- The good guests might still have a weird time, but a "different" one than the 
broken guests
  The diff is about what the guest would get starting with the same time as 
same broken ones and then counting up normally until I dumped.
   E.g. Thu Nov 13 22:09:53 3158 (on guest 4) after running about 3 hours since 
all were started
   E.g. but the other "working" guest 6 had "Sun Nov 16 15:13:07 3158" so not 
exactly the same
   The guests that "feel" really good have a correct time in Feb 2017
- Processes hang in schedule mostly due to being mostly idle I guess

=> The fact that the password is considered "aged" lets me assume that
on boot when cloud-init sets the password things are still ok. But later
on time warps.


### On boot issue ###
With the idea that this has to change between cloud-init setting the password 
and being available to log in - combined with the fact that even "good" guests 
have "broken" time I was looking for the log of those.

Since the broken guests can't be logged into and shut down cleanly there might 
be some loss.
But neither syslog nor cloud-init log contained the time warp.
Yet it might be that most things break after the time-warp occurred, so that 
might be missing in the log.


### Root cause? ###
All of the issues I've identified so far have pointed towards "time".
I don't have a good hold on why this happens thou.


@bugproxy/IBM:
- Does this sound familar to you in any way?
- If not please let me know what you'd need to debug with me.

** Affects: qemu (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1668241

Title:
  Guests trash time on start, crashing the guest

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1668241/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1668241] [NEW] Guests trash time on start, crashing the guest

Reply via email to