Public bug reported: First encountered in 5.4 kernel, but still present in HWE.
Description: Ubuntu 20.04.4 LTS Release: 20.04 We have three of those cards in three identical EPYC 7302P HP DL325 Gen10 servers. ruben@alpha:~$ cat /proc/cmdline BOOT_IMAGE=/vmlinuz-5.13.0-30-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro amd_iommu=on vfio-pci.ids=1002:7341,1002:ab38 nofb iommu=pt dmesg excerpt (with vendor-reset): [ 412.868799] vfio-pci 0000:86:00.0: enabling device (0142 -> 0143) [ 412.868980] vfio-pci 0000:86:00.0: AMD_NAVI14: version 1.1 [ 412.868982] vfio-pci 0000:86:00.0: AMD_NAVI14: performing pre-reset [ 412.888842] vfio-pci 0000:86:00.0: AMD_NAVI14: performing reset [ 412.925218] ATOM BIOS: 113-D3250100-102 [ 412.925221] vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c [ 413.171020] vfio-pci 0000:86:00.0: AMD_NAVI14: bus reset disabled? yes [ 413.171028] vfio-pci 0000:86:00.0: AMD_NAVI14: SMU response reg: 0, sol reg: 0, mp1 intr enabled? no, bl ready? yes [ 413.171035] vfio-pci 0000:86:00.0: AMD_NAVI14: performing post-reset [ 413.208794] vfio-pci 0000:86:00.0: AMD_NAVI14: reset result = 0 [ 413.208971] vfio-pci 0000:86:00.0: vfio_ecap_init: hiding ecap 0x19@0x270 [ 413.208985] vfio-pci 0000:86:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0 [ 413.208990] vfio-pci 0000:86:00.0: vfio_ecap_init: hiding ecap 0x25@0x400 [ 413.208992] vfio-pci 0000:86:00.0: vfio_ecap_init: hiding ecap 0x26@0x410 [ 413.208994] vfio-pci 0000:86:00.0: vfio_ecap_init: hiding ecap 0x27@0x440 [ 413.228798] vfio-pci 0000:86:00.1: enabling device (0140 -> 0142) [ 413.296899] vfio-pci 0000:86:00.0: AMD_NAVI14: version 1.1 [ 413.296904] vfio-pci 0000:86:00.0: AMD_NAVI14: performing pre-reset [ 413.297096] vfio-pci 0000:86:00.0: AMD_NAVI14: performing reset [ 413.333349] ATOM BIOS: 113-D3250100-102 [ 413.333351] vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c [ 413.579787] vfio-pci 0000:86:00.0: AMD_NAVI14: bus reset disabled? yes [ 413.579793] vfio-pci 0000:86:00.0: AMD_NAVI14: SMU response reg: 0, sol reg: 0, mp1 intr enabled? no, bl ready? yes [ 413.579797] vfio-pci 0000:86:00.0: AMD_NAVI14: performing post-reset [ 413.616795] vfio-pci 0000:86:00.0: AMD_NAVI14: reset result = 0 [ 419.766917] Uhhuh. NMI received for unknown reason 25 on CPU 0. [ 419.766919] Do you have a strange power saving mode enabled? [ 419.766920] Dazed and confused, but trying to continue [ 436.498601] Uhhuh. NMI received for unknown reason 25 on CPU 0. [ 436.498604] Do you have a strange power saving mode enabled? [ 436.498605] Dazed and confused, but trying to continue [ 454.306951] Uhhuh. NMI received for unknown reason 25 on CPU 0. [ 454.306955] Do you have a strange power saving mode enabled? [ 454.306955] Dazed and confused, but trying to continue [ 456.237162] Uhhuh. NMI received for unknown reason 25 on CPU 0. [ 456.237165] Do you have a strange power saving mode enabled? [ 456.237166] Dazed and confused, but trying to continue [ 457.800596] Uhhuh. NMI received for unknown reason 25 on CPU 0. [ 457.800598] Do you have a strange power saving mode enabled? [ 457.800599] Dazed and confused, but trying to continue [ 474.068911] Uhhuh. NMI received for unknown reason 25 on CPU 0. [ 474.068914] Do you have a strange power saving mode enabled? [ 474.068915] Dazed and confused, but trying to continue This happens both with and without the vendor-reset workaround (https://github.com/gnif/vendor-reset/). The GPU works "fine" in a VM (in OpenStack, KVM), although it generates these spurious NMIs frequently, especially when booting the VM and when using ROCm (eg. clinfo) in the VM. I will now move one of these GPUs in an older Intel system and run bare metal because we need a student to work on it. I'll also test passthrough on that machine, to see whether it has the same behaviour. ProblemType: Bug DistroRelease: Ubuntu 20.04 Package: linux-image-5.13.0-30-generic 5.13.0-30.33~20.04.1 ProcVersionSignature: Ubuntu 5.13.0-30.33~20.04.1-generic 5.13.19 Uname: Linux 5.13.0-30-generic x86_64 ApportVersion: 2.20.11-0ubuntu27.21 Architecture: amd64 CasperMD5CheckResult: skip Date: Mon Mar 7 10:26:25 2022 InstallationDate: Installed on 2022-01-05 (60 days ago) InstallationMedia: Ubuntu-Server 18.04.6 LTS "Bionic Beaver" - Release amd64 (20210915) ProcEnviron: TERM=xterm-256color PATH=(custom, no user) XDG_RUNTIME_DIR=<set> LANG=en_US.UTF-8 SHELL=/bin/bash SourcePackage: linux-signed-hwe-5.13 UpgradeStatus: Upgraded to focal on 2022-01-17 (48 days ago) ** Affects: linux-signed-hwe-5.13 (Ubuntu) Importance: Undecided Status: New ** Tags: amd64 apport-bug focal uec-images -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-signed-hwe-5.13 in Ubuntu. https://bugs.launchpad.net/bugs/1963893 Title: Radeon Pro W5500 in passthrough with vfio generates spuriour NMI reason 25 Status in linux-signed-hwe-5.13 package in Ubuntu: New Bug description: First encountered in 5.4 kernel, but still present in HWE. Description: Ubuntu 20.04.4 LTS Release: 20.04 We have three of those cards in three identical EPYC 7302P HP DL325 Gen10 servers. ruben@alpha:~$ cat /proc/cmdline BOOT_IMAGE=/vmlinuz-5.13.0-30-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro amd_iommu=on vfio-pci.ids=1002:7341,1002:ab38 nofb iommu=pt dmesg excerpt (with vendor-reset): [ 412.868799] vfio-pci 0000:86:00.0: enabling device (0142 -> 0143) [ 412.868980] vfio-pci 0000:86:00.0: AMD_NAVI14: version 1.1 [ 412.868982] vfio-pci 0000:86:00.0: AMD_NAVI14: performing pre-reset [ 412.888842] vfio-pci 0000:86:00.0: AMD_NAVI14: performing reset [ 412.925218] ATOM BIOS: 113-D3250100-102 [ 412.925221] vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c [ 413.171020] vfio-pci 0000:86:00.0: AMD_NAVI14: bus reset disabled? yes [ 413.171028] vfio-pci 0000:86:00.0: AMD_NAVI14: SMU response reg: 0, sol reg: 0, mp1 intr enabled? no, bl ready? yes [ 413.171035] vfio-pci 0000:86:00.0: AMD_NAVI14: performing post-reset [ 413.208794] vfio-pci 0000:86:00.0: AMD_NAVI14: reset result = 0 [ 413.208971] vfio-pci 0000:86:00.0: vfio_ecap_init: hiding ecap 0x19@0x270 [ 413.208985] vfio-pci 0000:86:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0 [ 413.208990] vfio-pci 0000:86:00.0: vfio_ecap_init: hiding ecap 0x25@0x400 [ 413.208992] vfio-pci 0000:86:00.0: vfio_ecap_init: hiding ecap 0x26@0x410 [ 413.208994] vfio-pci 0000:86:00.0: vfio_ecap_init: hiding ecap 0x27@0x440 [ 413.228798] vfio-pci 0000:86:00.1: enabling device (0140 -> 0142) [ 413.296899] vfio-pci 0000:86:00.0: AMD_NAVI14: version 1.1 [ 413.296904] vfio-pci 0000:86:00.0: AMD_NAVI14: performing pre-reset [ 413.297096] vfio-pci 0000:86:00.0: AMD_NAVI14: performing reset [ 413.333349] ATOM BIOS: 113-D3250100-102 [ 413.333351] vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c [ 413.579787] vfio-pci 0000:86:00.0: AMD_NAVI14: bus reset disabled? yes [ 413.579793] vfio-pci 0000:86:00.0: AMD_NAVI14: SMU response reg: 0, sol reg: 0, mp1 intr enabled? no, bl ready? yes [ 413.579797] vfio-pci 0000:86:00.0: AMD_NAVI14: performing post-reset [ 413.616795] vfio-pci 0000:86:00.0: AMD_NAVI14: reset result = 0 [ 419.766917] Uhhuh. NMI received for unknown reason 25 on CPU 0. [ 419.766919] Do you have a strange power saving mode enabled? [ 419.766920] Dazed and confused, but trying to continue [ 436.498601] Uhhuh. NMI received for unknown reason 25 on CPU 0. [ 436.498604] Do you have a strange power saving mode enabled? [ 436.498605] Dazed and confused, but trying to continue [ 454.306951] Uhhuh. NMI received for unknown reason 25 on CPU 0. [ 454.306955] Do you have a strange power saving mode enabled? [ 454.306955] Dazed and confused, but trying to continue [ 456.237162] Uhhuh. NMI received for unknown reason 25 on CPU 0. [ 456.237165] Do you have a strange power saving mode enabled? [ 456.237166] Dazed and confused, but trying to continue [ 457.800596] Uhhuh. NMI received for unknown reason 25 on CPU 0. [ 457.800598] Do you have a strange power saving mode enabled? [ 457.800599] Dazed and confused, but trying to continue [ 474.068911] Uhhuh. NMI received for unknown reason 25 on CPU 0. [ 474.068914] Do you have a strange power saving mode enabled? [ 474.068915] Dazed and confused, but trying to continue This happens both with and without the vendor-reset workaround (https://github.com/gnif/vendor-reset/). The GPU works "fine" in a VM (in OpenStack, KVM), although it generates these spurious NMIs frequently, especially when booting the VM and when using ROCm (eg. clinfo) in the VM. I will now move one of these GPUs in an older Intel system and run bare metal because we need a student to work on it. I'll also test passthrough on that machine, to see whether it has the same behaviour. ProblemType: Bug DistroRelease: Ubuntu 20.04 Package: linux-image-5.13.0-30-generic 5.13.0-30.33~20.04.1 ProcVersionSignature: Ubuntu 5.13.0-30.33~20.04.1-generic 5.13.19 Uname: Linux 5.13.0-30-generic x86_64 ApportVersion: 2.20.11-0ubuntu27.21 Architecture: amd64 CasperMD5CheckResult: skip Date: Mon Mar 7 10:26:25 2022 InstallationDate: Installed on 2022-01-05 (60 days ago) InstallationMedia: Ubuntu-Server 18.04.6 LTS "Bionic Beaver" - Release amd64 (20210915) ProcEnviron: TERM=xterm-256color PATH=(custom, no user) XDG_RUNTIME_DIR=<set> LANG=en_US.UTF-8 SHELL=/bin/bash SourcePackage: linux-signed-hwe-5.13 UpgradeStatus: Upgraded to focal on 2022-01-17 (48 days ago) To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe-5.13/+bug/1963893/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp