[Kernel-packages] [Bug 2077722] [NEW] [Ubuntu 24.04] MultiVM - L2 guest(s) running stress-ng getting stuck at booting after triggering crash

Launchpad Bug Tracker Mon, 26 Aug 2024 00:40:59 -0700

You have been subscribed to a public bug:

Problem:
While bringing up 2 Ubuntu 24.04 guests and running stress-ng (90% load) on 
both and triggering crash simultaneously, 1st guest gets stuck and does not 
boot up. In one of the attempts, both the guests got stuck on booting with 
console hang.


Attempts:
Reproducible 3/3 consecutive times
Run 1: L2-1 guest got stuck 
Run 2: L2-1 guest got stuck
Run 3: L2-1 and L2-2 guest got stuck


=================================================================
L1 Host:
1. PowerVM
2. OS: Ubuntu 24.04
3. Kernel: 6.8.0-31-generic
4. Mem (free -mh): 47Gi
5. cpus: 40

Guest L2-1:
1. OS: Ubuntu 24.04
2. Kernel: 6.8.0-31-generic
3. Mem (free -mh): 9.5Gi
4. cpus: 8
5. Stress: stress-ng - 90% load
6. XML configuration:
   <vcpu placement='static' current='8'>16</vcpu>
   <memory unit='KiB'>10971520</memory>
   <topology sockets='8' dies='1' cores='1' threads='2'/>

Guest L2-2:
1. OS: Ubuntu 24.04
2. Kernel: 6.8.0-31-generic
3. Mem (free -mh): 9.5Gi
4. cpus: 8
5. Stress: stress-ng - 90% load
6. XML configuration:
   <vcpu placement='static' current='8'>16</vcpu>
   <memory unit='KiB'>10971520</memory>
   <topology sockets='2' dies='1' cores='1' threads='8'/>


=================================================================
Steps to reproduce:
1. Bring up 2 Ubuntu 24.04 L2 guests with configuration mentioned as above
2. Run the attached stress-ng.sh script on both L2 guests
3. Trigger crash: echo c >/proc/sysrq-trigger on both L2 guests at the same time

After triggering the crash, 1 or both guest consoles will get stuck. And
then, we will not be able to enter the guest neither shut it down. In
oder to boot into the guest, virsh destroy of the guest will be
required.


=================================================================
Run1: Console.log Error message of L2-1
  Booting `Ubuntu'

Loading Linux 6.8.0-31-generic ...
Loading initial ramdisk ...
OF stdout device is: /vdevice/vty@30000000
Preparing to boot Linux version 6.8.0-31-generic (buildd@bos02-ppc64el-018) 
(powerpc64le-linux-gnu-gcc-13 (Ubuntu 13.2.0-23ubuntu4) 13.2.0, GNU ld (GNU 
Binutils for Ubuntu) 2.42) #31-Ubuntu SMP Sat Apr 20 00:05:55 UTC 2024 (Ubuntu 
6.8.0-31.31-generic 6.8.1)
Detected machine type: 0000000000000101
command line: BOOT_IMAGE=/vmlinux-6.8.0-31-generic 
root=/dev/mapper/ubuntu--vg-ubuntu--lv ro 
crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M
Max number of cores passed to firmware: 1024 (NR_CPUS = 2048)
Calling ibm,client-architecture-support... done
memory layout at init:
  memory_limit : 0000000000000000 (16 MB aligned)
  alloc_bottom : 0000000009d70000
  alloc_top    : 0000000030000000
  alloc_top_hi : 00000002a0000000
  rmo_top      : 0000000030000000
  ram_top      : 00000002a0000000
instantiating rtas at 0x000000002fff0000... done
prom_hold_cpus: skipped
copying OF device tree...
Building dt strings...
Building dt structure...
Device tree strings 0x0000000009d80000 -> 0x0000000009d80bc6
Device tree struct  0x0000000009d90000 -> 0x0000000009da0000
Quiescing Open Firmware ...
Booting Linux via __start() @ 0x0000000000230000 ...
[    0.000000] random: crng init done
[    0.000000] Reserving 512MB of memory at 512MB for crashkernel (System RAM: 
10752MB)
[    0.000000] radix-mmu: Page sizes from device-tree:
[    0.000000] radix-mmu: Page size shift = 12 AP=0x0
[    0.000000] radix-mmu: Page size shift = 16 AP=0x5
[    0.000000] radix-mmu: Page size shift = 21 AP=0x1
[    0.000000] radix-mmu: Page size shift = 30 AP=0x2
[    0.000000] Activating Kernel Userspace Access Prevention
[    0.000000] Activating Kernel Userspace Execution Prevention
[    0.000000] radix-mmu: Mapped 0x0000000000000000-0x00000000038a0000 with 
64.0 KiB pages (exec)
[    0.000000] radix-mmu: Mapped 0x00000000038a0000-0x00000002a0000000 with 
64.0 KiB pages
[    0.000000] lpar: Using radix MMU under hypervisor
[    0.000000] Linux version 6.8.0-31-generic (buildd@bos02-ppc64el-018) 
(powerpc64le-linux-gnu-gcc-13 (Ubuntu 13.2.0-23ubuntu4) 13.2.0, GNU ld (GNU 
Binutils for Ubuntu) 2.42) #31-Ubuntu SMP Sat Apr 20 00:05:55 UTC 2024 (Ubuntu 
6.8.0-31.31-generic 6.8.1)
[    0.000000] Secure boot mode disabled
[    0.000000] Found initrd at 0xc000000006200000:0xc000000009d6da29
[    0.000000] Hardware name: IBM pSeries (emulated by qemu) POWER10 (raw) 
0x800200 0xf000006 of:SLOF,HEAD hv:linux,kvm pSeries
[    0.000000] printk: legacy bootconsole [udbg0] enabled
[    0.000000] Partition configured for 16 cpus.
[    0.000000] CPU maps initialized for 2 threads per core
[    0.000000] numa: Partition configured for 1 NUMA nodes.
[    0.000000] -----------------------------------------------------
[    0.000000] phys_mem_size     = 0x2a0000000
[    0.000000] dcache_bsize      = 0x80
[    0.000000] icache_bsize      = 0x80
[    0.000000] cpu_features      = 0x001400eb8f5f9187
[    0.000000]   possible        = 0x001ffbfbcf5fb187
[    0.000000]   always          = 0x0000000380008181
[    0.000000] cpu_user_features = 0xdc0065c2 0xaef60000
[    0.000000] mmu_features      = 0x3c007641
[    0.000000] firmware_features = 0x00000a85455a445f
[    0.000000] vmalloc start     = 0xc008000000000000
[    0.000000] IO start          = 0xc00a000000000000
[    0.000000] vmemmap start     = 0xc00c000000000000
[    0.000000] -----------------------------------------------------
[    0.000000] numa:   NODE_DATA [mem 0x28ae09c00-0x28ae1197f]
[    0.000000] rfi-flush: fallback displacement flush available
[    0.000000] rfi-flush: ori type flush available
[    0.000000] rfi-flush: mttrig type flush available
[    0.000000] count-cache-flush: hardware flush enabled.
[    0.000000] link-stack-flush: software flush enabled.
[    0.000000] stf-barrier: eieio barrier available
[    0.000000] PPC64 nvram contains 65536 bytes
[    0.000000] barrier-nospec: using ORI speculation barrier
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000000000000-0x000000029fffffff]
[    0.000000]   Device   empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x000000029fffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x000000029fffffff]
[    0.000000] percpu: Embedded 12 pages/cpu s609960 r0 d176472 u786432
[    0.000000] Kernel command line: BOOT_IMAGE=/vmlinux-6.8.0-31-generic 
root=/dev/mapper/ubuntu--vg-ubuntu--lv ro 
crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M
[    0.000000] Unknown kernel command line parameters 
"BOOT_IMAGE=/vmlinux-6.8.0-31-generic", will be passed to user space.
[    0.000000] Dentry cache hash table entries: 2097152 (order: 8, 16777216 
bytes, linear)
[    0.000000] Inode-cache hash table entries: 1048576 (order: 7, 8388608 
bytes, linear)
[    0.000000] Fallback order for Node 0: 0 
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 171864
[    0.000000] Policy zone: Normal
[    0.000000] mem auto-init: stack:all(zero), heap alloc:on, heap free:off
[    0.000000] Memory: 9947840K/11010048K available (23680K kernel code, 4096K 
rwdata, 25472K rodata, 8832K init, 1901K bss, 1062208K reserved, 0K 
cma-reserved)
[    0.000000] SLUB: HWalign=128, Order=0-3, MinObjects=0, CPUs=16, Nodes=1
[    0.000000] ftrace: allocating 51717 entries in 19 pages
[    0.000000] ftrace: allocated 19 pages with 3 groups
[    0.000000] trace event string verifier disabled
[    0.000000] rcu: Hierarchical RCU implementation.
[    0.000000] rcu:     RCU restricting CPUs from NR_CPUS=2048 to nr_cpu_ids=16.
[    0.000000]  Rude variant of Tasks RCU enabled.
[    0.000000]  Tracing variant of Tasks RCU enabled.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 
jiffies.
[    0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=16
[    0.000000] NR_IRQS: 512, nr_irqs: 512, preallocated irqs: 16
[    0.000000] xive: Using IRQ range [0-f]
[    0.000000] xive: Interrupt handling initialized with spapr backend
[    0.000000] xive: Using priority 6 for all interrupts
[    0.000000] xive: Using 64kB queues
[    0.000000] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[    0.000000] time_init: 56 bit decrementer (max: 7fffffffffffff)
[    0.001027] clocksource: timebase: mask: 0xffffffffffffffff max_cycles: 
0x761537d007, max_idle_ns: 440795202126 ns
[    0.002881] clocksource: timebase mult[1f40000] shift[24] registered


=================================================================
Host side:
When the L2-1 guest console got stuck on first attempt
Run 1

# top | cat

top - 08:53:11 up 2 days, 14:15,  6 users,  load average: 9.00, 10.53, 12.53
Tasks: 496 total,   1 running, 495 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.2 us,  2.2 sy,  0.0 ni, 76.7 id,  0.0 wa, 20.0 hi,  0.0 si,  0.0 st 
MiB Mem :  48414.8 total,    303.5 free,  24681.1 used,  23777.0 buff/cache     
MiB Swap:   8191.9 total,   7910.1 free,    281.8 used.  23733.7 avail Mem 

USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND          
root      20   0   15.6g  10.5g  15360 S 800.0  22.2 146:46.26 qemu-system-ppc  
root      20   0   15.5g  10.5g  15360 S 100.0  22.1  88:03.53 qemu-system-ppc  

# free -mh
               total        used        free      shared  buff/cache   available
Mem:            47Gi        24Gi       230Mi       2.2Mi        23Gi        23Gi


=================================================================
Debugging logs/dumps:
1. console.logs of both L2 guest consoles (All 3 attempts)
2. virsh dump of both guests (All 3 attempts)

Copying the above logs/dumps to june server machines under
/dump/dumps/<bug-number>


=================================================================
Attachments:
1. Run-1 console.log of L2-1 guest getting stuck
2. Run-3 console.log of L2-1 guest getting stuck
3. Run-3 console.log of L2-2 guest getting stuck
4. Stress-ng script to run 90% load: stress-ng.sh


$ pwd
/home/dump/dumps/206735
$ ls
bug-206735-guest-console-logs  bug-206735-guest-virsh-dumps

Thanks.


~/bug-206735-dumps# crash 
/root/.cache/debuginfod_client/475c3a23ac990f64c5a03cf1fe8b229fde9a7692/debuginfo
 ./vmcore-ubuntu_vm1-1

crash 8.0.4
Copyright (C) 2002-2022  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011, 2020-2022  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
Copyright (C) 2015, 2021  VMware, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
GNU gdb (GDB) 10.2
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "powerpc64le-unknown-linux-gnu".
Type "show configuration" for configuration details.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...

      KERNEL: 
/root/.cache/debuginfod_client/475c3a23ac990f64c5a03cf1fe8b229fde9a7692/debuginfo
    DUMPFILE: ./vmcore-ubuntu_vm1-1
        CPUS: 1
        DATE: Fri May 24 08:44:35 UTC 2024
      UPTIME: 00:00:00
LOAD AVERAGE: 0.00, 0.00, 0.00
       TASKS: 1
    NODENAME: (none)
     RELEASE: 6.8.0-31-generic
     VERSION: #31-Ubuntu SMP Sat Apr 20 00:05:55 UTC 2024
     MACHINE: ppc64le  (3450 Mhz)
      MEMORY: 10.5 GB
       PANIC: ""
         PID: 0
     COMMAND: "swapper/0"
        TASK: c000000003bf8900  [THREAD_INFO: c000000003bf8900]
         CPU: 0
       STATE: TASK_RUNNING (ACTIVE)
     WARNING: panic task not found

crash> bt
PID: 0        TASK: c000000003bf8900  CPU: 0    COMMAND: "swapper/0"
 R0:  c0000000000de4f4    R1:  c00000028af13f80    R2:  c000000002254800   
 R3:  c0000000048de000    R4:  c000000003c37bc0    R5:  0000000000000000   
 R6:  0000000000000000    R7:  0000000000000000    R8:  c000000001724d18   
 R9:  000000000000ff00    R10: 0000000286f80000    R11: 0000000053474552   
 R12: c0000000000e4184    R13: c000000003e80000    R14: 0000000000000000   
 R15: 0000000000000000    R16: 0000000000000000    R17: 0000000000000000   
 R18: 0000000000000000    R19: 0000000000000000    R20: 0000000000000000   
 R21: 0000000000000000    R22: 0000000000000000    R23: 0000000000000000   
 R24: 0000000000000000    R25: c000000003c37bc0    R26: c00000028af13fe0   
 R27: c000000003c34000    R28: c000000003c69e88    R29: c000000003c37c80   
 R30: c000000003262bd8    R31: c0000000048de000   
 NIP: c0000000000e41a4    MSR: 8000000000000033    OR3: 0000000000000000
 CTR: c0000000000e4184    LR:  c0000000000de4f4    XER: 0000000000000074
 CCR: 0000000082042840    MQ:  0000000000000000    DAR: 0000000000000000
 DSISR: 0000000000000000     Syscall Result: 0000000000000000
 [NIP  : xive_spapr_update_pending+32]
 [LR   : xive_get_irq+76]
 #0 [c00000028af13f80] (null) at 0  (unreliable)
 #1 [c00000028af13fb0] __do_irq at c000000000017a78
 #2 [c00000028af13fe0] __do_IRQ at c000000000018cd8
 #3 [c000000003c37bc0] (null) at 0  (unreliable)
 #4 [c000000003c37c20] do_IRQ at c000000000018e30
 #5 [c000000003c37c50] hardware_interrupt_common_virt at c00000000000953c
 #6 [c000000003c37f20] (null) at 9d6da29  (unreliable)
 #7 [c000000003c37f50] start_kernel at c00000000300fed0
 #8 [c000000003c37fe0] start_here_common at c00000000000e998
crash> dis xive_spapr_update_pending+32
0xc0000000000e41a4 <xive_spapr_update_pending+32>:      hwsync
crash> dis -s xive_spapr_update_pending+32
FILE: /build/linux-NbDBKx/linux-6.8.0/arch/powerpc/sysdev/xive/spapr.c
LINE: 618

dis: xive_spapr_update_pending+32: source code is not available

crash>


I debugged the 6.9.4-200.fc40 kernel of the evelp2g2 L2 VM as given to me by 
Lekshmi and I find that stress-ng has nothing
to do with hitting this hang in the same __do_IRQ -> xive_get_irheq -> 
xive_spapr_update_pending call-stack.

This call-stack hits randomly on one of the L2 vcpus whenever we do a
"echo c > /proc/sysrq-trigger".

The NIP points to the mb() macro in C code and hwsync in the GDB
disassembly but this isn't really a problem with hwsync.

I could reproduce this exact same problem with kernel 6.10.0-rc6+ on FC40 after 
I:
i)      Reset the crashkernel cmdline with the following command:
        grubby --update-kernel ALL --args 
"crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M"
ii)     Enabled kdump via the "systemctl enable kdump" and "systemctl start 
kdump" commands.

On debugging the vcpu thread in the Qemu instance running in L1 I find
that the vcpu thread doesn't exit from KVM_RUN with any error code.

I find that the xive_get_irq() -> xive_spapr_update_pending() functions are 
actually being called repeatedly by do_IRQ() -> __do_IRQ() -> __do_irq()
in the arch/powerpc code. This means that we are constantly getting interrupts 
on this CPU. When I debugged the IRQ number we are constantly getting
I see that it is 0x0 which is not informative as of now (to me at least).

I think that there is some problem in the startup sequence of the
secondary CPUs as I never faced this problem on the boot CPU as long as
I tried today.

I request the CPU team to investigate the startup sequence of the
secondary SMP CPUs as they would be having a better idea of this for
powerpc.

The procedure to be followed is simple:
i)    Put logs in the startup code of the secondary and primary CPU(s).
ii)   Investigate the point at which the primary CPU waits for the secondary 
CPUs to come up and understand what isn't happening at the secondary CPUs
      such that the primary CPU doesn't go past the "smp: Bringing up secondary 
CPUs" log.


(In reply to comment #21)
> I think that there is some problem in the startup sequence of the secondary
> CPUs as I never faced this problem on the boot CPU as long as I tried today.
> 
> I request the CPU team to investigate the startup sequence of the secondary
> SMP CPUs as they would be having a better idea of this for powerpc.
> 
> The procedure to be followed is simple:
> i)    Put logs in the startup code of the secondary and primary CPU(s).
> ii)   Investigate the point at which the primary CPU waits for the secondary
> CPUs to come up and understand what isn't happening at the secondary CPUs
>       such that the primary CPU doesn't go past the "smp: Bringing up
> secondary CPUs" log.

We have done exactly that and we see that one of the secondary threads during 
the bring up is stuck in arch_local_irq_restore().
We dont know why its gets stuck there and thats a function that cant be 
instrumented since it leads to other side-effects even before hitting 
Bringing up secondary CPUs.

Even the addr2line for the address (NIP) shown in rcu stall points to
the same arch_local_irq_restore().


Just to level set.

Gautham's patch helps if we are going to disable xive in L2
Nick's patch will not throw RCU Stalls but L2 will still hang.

Current interpretation of investigation:
When we have xive on L2, CPU bring up gets stuck at mb() in 
xive_spapr_update-pending. However this is not reproducible with xive disabled.

Its probably a bit early to conclude xive is the problem.


-----------------------------------------------------------------------------------

After reverting the following commits pointed by Gautam Menghani, hang
is not seen and kdump in L2 works as expected.

df938a5576f3 KVM: PPC: Book3S HV nestedv2: Do not inject certain interrupts
ec0f6639fa88 KVM: PPC: Book3S HV nestedv2: Ensure LPCR_MER bit is passed to the 
L0

** Affects: linux (Ubuntu)
     Importance: Undecided
     Assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
         Status: New


** Tags: architecture-ppc64le bugnameltc-206735 severity-high 
targetmilestone-inin---
-- 
[Ubuntu 24.04] MultiVM - L2 guest(s) running stress-ng getting stuck at booting 
after triggering crash
https://bugs.launchpad.net/bugs/2077722
You received this bug notification because you are a member of Kernel Packages, 
which is subscribed to linux in Ubuntu.

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2077722] [NEW] [Ubuntu 24.04] MultiVM - L2 guest(s) running stress-ng getting stuck at booting after triggering crash

Reply via email to