Public bug reported:

First encountered in 5.4 kernel, but still present in HWE.

Description:    Ubuntu 20.04.4 LTS
Release:        20.04

We have three of those cards in three identical EPYC 7302P HP DL325
Gen10 servers.

ruben@alpha:~$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-5.13.0-30-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro 
amd_iommu=on vfio-pci.ids=1002:7341,1002:ab38 nofb iommu=pt

dmesg excerpt (with vendor-reset):

[  412.868799] vfio-pci 0000:86:00.0: enabling device (0142 -> 0143)
[  412.868980] vfio-pci 0000:86:00.0: AMD_NAVI14: version 1.1
[  412.868982] vfio-pci 0000:86:00.0: AMD_NAVI14: performing pre-reset
[  412.888842] vfio-pci 0000:86:00.0: AMD_NAVI14: performing reset
[  412.925218] ATOM BIOS: 113-D3250100-102
[  412.925221] vendor-reset-drm: atomfirmware: bios_scratch_reg_offset 
initialized to 4c
[  413.171020] vfio-pci 0000:86:00.0: AMD_NAVI14: bus reset disabled? yes
[  413.171028] vfio-pci 0000:86:00.0: AMD_NAVI14: SMU response reg: 0, sol reg: 
0, mp1 intr enabled? no, bl ready? yes
[  413.171035] vfio-pci 0000:86:00.0: AMD_NAVI14: performing post-reset
[  413.208794] vfio-pci 0000:86:00.0: AMD_NAVI14: reset result = 0
[  413.208971] vfio-pci 0000:86:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
[  413.208985] vfio-pci 0000:86:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
[  413.208990] vfio-pci 0000:86:00.0: vfio_ecap_init: hiding ecap 0x25@0x400
[  413.208992] vfio-pci 0000:86:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
[  413.208994] vfio-pci 0000:86:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
[  413.228798] vfio-pci 0000:86:00.1: enabling device (0140 -> 0142)
[  413.296899] vfio-pci 0000:86:00.0: AMD_NAVI14: version 1.1
[  413.296904] vfio-pci 0000:86:00.0: AMD_NAVI14: performing pre-reset
[  413.297096] vfio-pci 0000:86:00.0: AMD_NAVI14: performing reset
[  413.333349] ATOM BIOS: 113-D3250100-102
[  413.333351] vendor-reset-drm: atomfirmware: bios_scratch_reg_offset 
initialized to 4c
[  413.579787] vfio-pci 0000:86:00.0: AMD_NAVI14: bus reset disabled? yes
[  413.579793] vfio-pci 0000:86:00.0: AMD_NAVI14: SMU response reg: 0, sol reg: 
0, mp1 intr enabled? no, bl ready? yes
[  413.579797] vfio-pci 0000:86:00.0: AMD_NAVI14: performing post-reset
[  413.616795] vfio-pci 0000:86:00.0: AMD_NAVI14: reset result = 0
[  419.766917] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[  419.766919] Do you have a strange power saving mode enabled?
[  419.766920] Dazed and confused, but trying to continue
[  436.498601] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[  436.498604] Do you have a strange power saving mode enabled?
[  436.498605] Dazed and confused, but trying to continue
[  454.306951] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[  454.306955] Do you have a strange power saving mode enabled?
[  454.306955] Dazed and confused, but trying to continue
[  456.237162] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[  456.237165] Do you have a strange power saving mode enabled?
[  456.237166] Dazed and confused, but trying to continue
[  457.800596] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[  457.800598] Do you have a strange power saving mode enabled?
[  457.800599] Dazed and confused, but trying to continue
[  474.068911] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[  474.068914] Do you have a strange power saving mode enabled?
[  474.068915] Dazed and confused, but trying to continue

This happens both with and without the vendor-reset workaround
(https://github.com/gnif/vendor-reset/).  The GPU works "fine" in a VM
(in OpenStack, KVM), although it generates these spurious NMIs
frequently, especially when booting the VM and when using ROCm (eg.
clinfo) in the VM.

I will now move one of these GPUs in an older Intel system and run bare
metal because we need a student to work on it.  I'll also test
passthrough on that machine, to see whether it has the same behaviour.

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: linux-image-5.13.0-30-generic 5.13.0-30.33~20.04.1
ProcVersionSignature: Ubuntu 5.13.0-30.33~20.04.1-generic 5.13.19
Uname: Linux 5.13.0-30-generic x86_64
ApportVersion: 2.20.11-0ubuntu27.21
Architecture: amd64
CasperMD5CheckResult: skip
Date: Mon Mar  7 10:26:25 2022
InstallationDate: Installed on 2022-01-05 (60 days ago)
InstallationMedia: Ubuntu-Server 18.04.6 LTS "Bionic Beaver" - Release amd64 
(20210915)
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: linux-signed-hwe-5.13
UpgradeStatus: Upgraded to focal on 2022-01-17 (48 days ago)

** Affects: linux-signed-hwe-5.13 (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: amd64 apport-bug focal uec-images

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-signed-hwe-5.13 in Ubuntu.
https://bugs.launchpad.net/bugs/1963893

Title:
  Radeon Pro W5500 in passthrough with vfio generates spuriour NMI
  reason 25

Status in linux-signed-hwe-5.13 package in Ubuntu:
  New

Bug description:
  First encountered in 5.4 kernel, but still present in HWE.

  Description:  Ubuntu 20.04.4 LTS
  Release:      20.04

  We have three of those cards in three identical EPYC 7302P HP DL325
  Gen10 servers.

  ruben@alpha:~$ cat /proc/cmdline
  BOOT_IMAGE=/vmlinuz-5.13.0-30-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv 
ro amd_iommu=on vfio-pci.ids=1002:7341,1002:ab38 nofb iommu=pt

  dmesg excerpt (with vendor-reset):

  [  412.868799] vfio-pci 0000:86:00.0: enabling device (0142 -> 0143)
  [  412.868980] vfio-pci 0000:86:00.0: AMD_NAVI14: version 1.1
  [  412.868982] vfio-pci 0000:86:00.0: AMD_NAVI14: performing pre-reset
  [  412.888842] vfio-pci 0000:86:00.0: AMD_NAVI14: performing reset
  [  412.925218] ATOM BIOS: 113-D3250100-102
  [  412.925221] vendor-reset-drm: atomfirmware: bios_scratch_reg_offset 
initialized to 4c
  [  413.171020] vfio-pci 0000:86:00.0: AMD_NAVI14: bus reset disabled? yes
  [  413.171028] vfio-pci 0000:86:00.0: AMD_NAVI14: SMU response reg: 0, sol 
reg: 0, mp1 intr enabled? no, bl ready? yes
  [  413.171035] vfio-pci 0000:86:00.0: AMD_NAVI14: performing post-reset
  [  413.208794] vfio-pci 0000:86:00.0: AMD_NAVI14: reset result = 0
  [  413.208971] vfio-pci 0000:86:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
  [  413.208985] vfio-pci 0000:86:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
  [  413.208990] vfio-pci 0000:86:00.0: vfio_ecap_init: hiding ecap 0x25@0x400
  [  413.208992] vfio-pci 0000:86:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
  [  413.208994] vfio-pci 0000:86:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
  [  413.228798] vfio-pci 0000:86:00.1: enabling device (0140 -> 0142)
  [  413.296899] vfio-pci 0000:86:00.0: AMD_NAVI14: version 1.1
  [  413.296904] vfio-pci 0000:86:00.0: AMD_NAVI14: performing pre-reset
  [  413.297096] vfio-pci 0000:86:00.0: AMD_NAVI14: performing reset
  [  413.333349] ATOM BIOS: 113-D3250100-102
  [  413.333351] vendor-reset-drm: atomfirmware: bios_scratch_reg_offset 
initialized to 4c
  [  413.579787] vfio-pci 0000:86:00.0: AMD_NAVI14: bus reset disabled? yes
  [  413.579793] vfio-pci 0000:86:00.0: AMD_NAVI14: SMU response reg: 0, sol 
reg: 0, mp1 intr enabled? no, bl ready? yes
  [  413.579797] vfio-pci 0000:86:00.0: AMD_NAVI14: performing post-reset
  [  413.616795] vfio-pci 0000:86:00.0: AMD_NAVI14: reset result = 0
  [  419.766917] Uhhuh. NMI received for unknown reason 25 on CPU 0.
  [  419.766919] Do you have a strange power saving mode enabled?
  [  419.766920] Dazed and confused, but trying to continue
  [  436.498601] Uhhuh. NMI received for unknown reason 25 on CPU 0.
  [  436.498604] Do you have a strange power saving mode enabled?
  [  436.498605] Dazed and confused, but trying to continue
  [  454.306951] Uhhuh. NMI received for unknown reason 25 on CPU 0.
  [  454.306955] Do you have a strange power saving mode enabled?
  [  454.306955] Dazed and confused, but trying to continue
  [  456.237162] Uhhuh. NMI received for unknown reason 25 on CPU 0.
  [  456.237165] Do you have a strange power saving mode enabled?
  [  456.237166] Dazed and confused, but trying to continue
  [  457.800596] Uhhuh. NMI received for unknown reason 25 on CPU 0.
  [  457.800598] Do you have a strange power saving mode enabled?
  [  457.800599] Dazed and confused, but trying to continue
  [  474.068911] Uhhuh. NMI received for unknown reason 25 on CPU 0.
  [  474.068914] Do you have a strange power saving mode enabled?
  [  474.068915] Dazed and confused, but trying to continue

  This happens both with and without the vendor-reset workaround
  (https://github.com/gnif/vendor-reset/).  The GPU works "fine" in a VM
  (in OpenStack, KVM), although it generates these spurious NMIs
  frequently, especially when booting the VM and when using ROCm (eg.
  clinfo) in the VM.

  I will now move one of these GPUs in an older Intel system and run
  bare metal because we need a student to work on it.  I'll also test
  passthrough on that machine, to see whether it has the same behaviour.

  ProblemType: Bug
  DistroRelease: Ubuntu 20.04
  Package: linux-image-5.13.0-30-generic 5.13.0-30.33~20.04.1
  ProcVersionSignature: Ubuntu 5.13.0-30.33~20.04.1-generic 5.13.19
  Uname: Linux 5.13.0-30-generic x86_64
  ApportVersion: 2.20.11-0ubuntu27.21
  Architecture: amd64
  CasperMD5CheckResult: skip
  Date: Mon Mar  7 10:26:25 2022
  InstallationDate: Installed on 2022-01-05 (60 days ago)
  InstallationMedia: Ubuntu-Server 18.04.6 LTS "Bionic Beaver" - Release amd64 
(20210915)
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   XDG_RUNTIME_DIR=<set>
   LANG=en_US.UTF-8
   SHELL=/bin/bash
  SourcePackage: linux-signed-hwe-5.13
  UpgradeStatus: Upgraded to focal on 2022-01-17 (48 days ago)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe-5.13/+bug/1963893/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to