Public bug reported:

linux-image-5.13.0-39-generic:
  Installed: 5.13.0-39.44~20.04.1

Description:    Ubuntu 20.04.1 LTS
Release:        20.04

I use qemu to run short lived Linux VMs as part of a CI pipeline, using
nested KVM on Intel CPUs. With good probability, one of the qemu
processes managing the VMs exits without any output. I've been able to
track the behaviour to L1 qemu receiving KVM_EXIT_SHUTDOWN from KVM_RUN
ioctl:

    ...
    15268@1647341556.924605:kvm_run_exit cpu_index 0, reason 2
    15268@1647341556.928341:kvm_run_exit cpu_index 0, reason 8

    on QEMU emulator version 4.2.1 (Debian 1:4.2-3ubuntu6.21)

Digging deeper, I managed to capture the following trace from the L1
kernel (via perf record -a -e "kvm:*"):

    ...
    [001]   770.850287:                      kvm:kvm_entry: vcpu 0, rip 0x100146
    [001]   770.850307:                       kvm:kvm_exit: vcpu 0 reason 
TRIPLE_FAULT rip 0x100146 info1 0x0000000000000000 info2 0x0000000000000000 
intr_info 0x00000000 error_code 0x00000000
    [001]   770.850313:                        kvm:kvm_fpu: unload
    [001]   770.850316:             kvm:kvm_userspace_exit: reason 
KVM_EXIT_SHUTDOWN (8)

   on Linux 5.13.0-30-generic #33~20.04.1-Ubuntu SMP Mon Feb 7 14:25:10
UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Immediately prior to the triple fault there are a bunch of
EXTERNAL_INTERRUPT and reads / writes of MSRs and CRs. The crash seems
independent of the Linux version running in L2, I see it across a bunch
of LTS kernels. Unfortunately I don't know which version of Linux /
Ubuntu is in L0.

I've tried to reproduce on other machines I have access to, without much
luck. I've also tried to make sense of rip 0x100146 on my own, but I
don't understand x86 / qemu boot enough. Finally, I've tried looking at
commits to KVM between 5.13 and master that mention TRIPLE_FAULT, but
nothing rang a bell.

I've put traces from two failed executions + lscpu at 
https://gist.github.com/lmb/c36479bb67f397ba08319b5e0f752386
For completeness sake, you can see the failing CI runs at 
https://ebpf.semaphoreci.com/branches/317c3f18-4de0-488b-af6d-2a1fa0967f87

I've tried to get help with this issue via k...@vger.kernel.org but had
no luck. See
https://lore.kernel.org/kvm/95c1dc01-4aa0-46a6-95b1-bbc62588a...@www.fastmail.com/

** Affects: linux-meta-hwe-5.13 (Ubuntu)
     Importance: Undecided
         Status: New

** Package changed: ubuntu => linux-meta-hwe-5.13 (Ubuntu)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-meta-hwe-5.13 in Ubuntu.
https://bugs.launchpad.net/bugs/1970034

Title:
  Intel nested KVM exits L2 due to TRIPLE_FAULT

Status in linux-meta-hwe-5.13 package in Ubuntu:
  New

Bug description:
  linux-image-5.13.0-39-generic:
    Installed: 5.13.0-39.44~20.04.1

  Description:  Ubuntu 20.04.1 LTS
  Release:      20.04

  I use qemu to run short lived Linux VMs as part of a CI pipeline,
  using nested KVM on Intel CPUs. With good probability, one of the qemu
  processes managing the VMs exits without any output. I've been able to
  track the behaviour to L1 qemu receiving KVM_EXIT_SHUTDOWN from
  KVM_RUN ioctl:

      ...
      15268@1647341556.924605:kvm_run_exit cpu_index 0, reason 2
      15268@1647341556.928341:kvm_run_exit cpu_index 0, reason 8

      on QEMU emulator version 4.2.1 (Debian 1:4.2-3ubuntu6.21)

  Digging deeper, I managed to capture the following trace from the L1
  kernel (via perf record -a -e "kvm:*"):

      ...
      [001]   770.850287:                      kvm:kvm_entry: vcpu 0, rip 
0x100146
      [001]   770.850307:                       kvm:kvm_exit: vcpu 0 reason 
TRIPLE_FAULT rip 0x100146 info1 0x0000000000000000 info2 0x0000000000000000 
intr_info 0x00000000 error_code 0x00000000
      [001]   770.850313:                        kvm:kvm_fpu: unload
      [001]   770.850316:             kvm:kvm_userspace_exit: reason 
KVM_EXIT_SHUTDOWN (8)

     on Linux 5.13.0-30-generic #33~20.04.1-Ubuntu SMP Mon Feb 7
  14:25:10 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

  Immediately prior to the triple fault there are a bunch of
  EXTERNAL_INTERRUPT and reads / writes of MSRs and CRs. The crash seems
  independent of the Linux version running in L2, I see it across a
  bunch of LTS kernels. Unfortunately I don't know which version of
  Linux / Ubuntu is in L0.

  I've tried to reproduce on other machines I have access to, without
  much luck. I've also tried to make sense of rip 0x100146 on my own,
  but I don't understand x86 / qemu boot enough. Finally, I've tried
  looking at commits to KVM between 5.13 and master that mention
  TRIPLE_FAULT, but nothing rang a bell.

  I've put traces from two failed executions + lscpu at 
https://gist.github.com/lmb/c36479bb67f397ba08319b5e0f752386
  For completeness sake, you can see the failing CI runs at 
https://ebpf.semaphoreci.com/branches/317c3f18-4de0-488b-af6d-2a1fa0967f87

  I've tried to get help with this issue via k...@vger.kernel.org but had
  no luck. See
  
https://lore.kernel.org/kvm/95c1dc01-4aa0-46a6-95b1-bbc62588a...@www.fastmail.com/

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-meta-hwe-5.13/+bug/1970034/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to