Hi Alison,

On Wed, Apr 16, 2025 at 09:53:18PM -0700, Alison Chaiken wrote:
> When the PC has been sitting idle with the monitor off  and I try to
> interact with it, I sometimes find that it is locked up and won't respond to
> keystrokes, ping or sysrq.   Based on the syslog, it looks like the PCIe
> endpoint has gone into the low-power D3cold state and can't be woken.  The
> result is
> 
>      watchdog: Watchdog detected hard LOCKUP on cpu 20
> 
> after which I must press the power button to shutdown and reboot.  Below is
> some output from the syslog.
> 
> *********************************************************************************************************
> 
> 2025-03-30T14:54:01.842776-07:00 schallkreis kernel: pcieport 0000:0f:00.0:
> Unable to change power state from D3cold to D0, device inaccessible
> 2025-03-30T14:54:02.504644-07:00 schallkreis kernel: [drm] Fence fallback
> timer expired on ring sdma0
> 2025-03-30T14:54:02.504652-07:00 schallkreis kernel: [drm] Fence fallback
> timer expired on ring gfx_0.0.0
> 2025-03-30T14:54:02.945988-07:00 schallkreis kernel: [drm] Fence fallback
> timer expired on ring sdma0
> 2025-03-30T14:54:03.387173-07:00 schallkreis kernel: [drm] Fence fallback
> timer expired on ring gfx_0.0.0
> 2025-03-30T14:54:04.049130-07:00 schallkreis kernel: [drm] Fence fallback
> timer expired on ring gfx_0.0.0
> 2025-03-30T14:54:04.049136-07:00 schallkreis kernel: [drm] Fence fallback
> timer expired on ring sdma1
> 2025-03-30T14:54:04.931536-07:00 schallkreis kernel: [drm] Fence fallback
> timer expired on ring gfx_0.0.0
> 2025-03-30T14:54:05.373985-07:00 schallkreis kernel: [drm] Fence fallback
> timer expired on ring gfx_0.0.0
> 2025-03-30T14:54:09.785625-07:00 schallkreis kwin_wayland_wrapper[4583]:
> kwin_libinput: Libinput: event22 - Logitech USB Optical Mouse: client bug:
> event processing lagging behind by 220ms, your system is too slow
> 2025-03-30T14:54:10.888484-07:00 schallkreis kernel: watchdog: Watchdog
> detected hard LOCKUP on cpu 20
> 2025-03-30T14:54:10.888491-07:00 schallkreis kernel: Modules linked in:
> dm_mod cpuid tls qrtr binfmt_misc nls_ascii nls_cp437 vfat fat amd_atl
> intel_rapl_msr intel_rapl_common mt7925e mt7925_common mt792x_lib
> mt76_connac_lib edac_mce_amd snd_hda_codec_realtek mt76 kvm_amd
> snd_hda_codec_generic mac80211 snd_hda_scodec_component snd_hda_codec_hdmi
> snd_hda_intel kvm snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec libarc4
> snd_hda_core cfg80211 crct10dif_pclmul ghash_clmulni_intel sha512_ssse3
> snd_hwdep sha256_ssse3 sha1_ssse3 snd_pcm aesni_intel spd5118 gf128mul
> crypto_simd cryptd snd_timer wmi_bmof sp5100_tco rapl snd ccp pcspkr
> watchdog rfkill k10temp soundcore joydev evdev nfsd auth_rpcgss nfs_acl
> lockd grace sunrpc parport_pc ppdev lp parport configfs efi_pstore nfnetlink
> efivarfs ip_tables x_tables autofs4 ext4 mbcache jbd2 crc32c_generic
> hid_generic usbhid hid amdgpu amdxcp drm_exec gpu_sched drm_buddy
> i2c_algo_bit drm_suballoc_helper drm_display_helper cec rc_core
> drm_ttm_helper ttm xhci_pci xhci_hcd drm_kms_helper ahci libahci r8169
> libata drm nvme
> 2025-03-30T14:54:10.888493-07:00 schallkreis kernel:  thunderbolt realtek
> usbcore crc32_pclmul mdio_devres scsi_mod crc32c_intel nvme_core libphy
> i2c_piix4 i2c_smbus usb_common scsi_common nvme_auth crc16 video wmi
> gpio_amdpt gpio_generic butto2025-03-30T14:54:10.888494-07:00 schallkreis
> kernel: CPU: 20 UID: 1000 PID: 4645 Comm: HDMI-A-1 Tainted: G        W
> 6.12.17-amd64 #1  Debian 6.12.17-1
> 2025-03-30T14:54:10.888501-07:00 schallkreis kernel: Tainted: [W]DWARN
> 2025-03-30T14:54:10.888502-07:00 schallkreis kernel: Hardware name: System76
> Thelio Mira/Thelio Mira, BIOS 3.11.SP01 12/05/2024
> 2025-03-30T14:54:10.888503-07:00 schallkreis kernel: RIP:
> 0010:native_queued_spin_lock_slowpath+0x70/0x2a0
> 2025-03-30T14:54:10.888503-07:00 schallkreis kernel: Code: 77 77 f0 0f ba 2b
> 08 0f 92 c2 8b 03 0f b6 d2 c1 e2 08 30 e4 09 d0 3d ff 00 00 00 77 53 85 c0
> 74=
>  10 0f b6 03 84 c0 74 09 f3 90 <0f> b6 03 84 c0 75 f7 b8 01 00 00 00 66 89
> 03 5b 5d 41 5c 41 5d c3
> 2025-03-30T14:54:10.888504-07:00 schallkreis kernel: RSP:
> 0018:ffffb0d1cb007868 EFLAGS: 00000002
> 2025-03-30T14:54:10.888510-07:00 schallkreis kernel: RAX: 0000000000000001
> RBX: ffff8b47eb080178 RCX: ffff8b4a20fd62d2025-03-30T14:54:10.888510-07:00
> schallkreis kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI:
> ffff8b47eb080172025-03-30T14:54:10.888511-07:00 schallkreis kernel: RBP:
> ffff8b4a20fd6280 R08: 0000000000000080 R09:
> ffff8b4a20fd6282025-03-30T14:54:10.888511-07:00 schallkreis kernel: R10:
> ffffb0d1cb0078e0 R11: 0000000000000004 R12:
> ffff8b47eb080172025-03-30T14:54:10.888511-07:00 schallkreis kernel: R13:
> ffff8b4a20fd62d8 R14: ffff8b4a20fd6080 R15:
> ffff8b4a20fd6282025-03-30T14:54:10.888511-07:00 schallkreis kernel: FS:
> 00007fc576ffd6c0(0000) GS:ffff8b565fc00000(0000) knlGS:0000000000000000
> 2025-03-30T14:54:10.888512-07:00 schallkreis kernel: CS:  0010 DS: 0000 ES:
> 0000 CR0: 0000000080050033
> 2025-03-30T14:54:10.888513-07:00 schallkreis kernel: CR2: 00007f8f80d1aef0
> CR3: 0000000129824000 CR4: 0000000000f50ef2025-03-30T14:54:10.888513-07:00
> schallkreis kernel: PKRU: 55555554
> 2025-03-30T14:54:10.888514-07:00 schallkreis kernel: Call Trace:
> 2025-03-30T14:54:10.888514-07:00 schallkreis kernel:  <NMI>
> 2025-03-30T14:54:10.888514-07:00 schallkreis kernel:  ?
> watchdog_hardlockup_check.cold+0x100/0x105
> 2025-03-30T14:54:10.888514-07:00 schallkreis kernel:  ?
> __perf_event_overflow+0x10c/0x320
> 2025-03-30T14:54:10.888514-07:00 schallkreis kernel:  ?
> amd_pmu_v2_handle_irq+0x2a1/0x3c0
> 2025-03-30T14:54:10.888515-07:00 schallkreis kernel:  ?
> perf_event_nmi_handler+0x2a/0x50
> 2025-03-30T14:54:10.888515-07:00 schallkreis kernel:  ?
> nmi_handle+0x5e/0x120
> 2025-03-30T14:54:10.888515-07:00 schallkreis kernel:  ?
> default_do_nmi+0x40/0x130
> 2025-03-30T14:54:10.888515-07:00 schallkreis kernel:  ? exc_nmi+0x122/0x1a0
> 2025-03-30T14:54:10.888516-07:00 schallkreis kernel:  ?
> end_repeat_nmi+0xf/0x53
> 2025-03-30T14:54:10.888516-07:00 schallkreis kernel:  ?
> native_queued_spin_lock_slowpath+0x70/0x2a0
> 2025-03-30T14:54:10.888517-07:00 schallkreis kernel:  ?
> native_queued_spin_lock_slowpath+0x70/0x2a0
> 2025-03-30T14:54:10.888517-07:00 schallkreis kernel:  ?
> native_queued_spin_lock_slowpath+0x70/0x2a0
> 2025-03-30T14:54:10.888517-07:00 schallkreis kernel:
> </NMI2025-03-30T14:54:10.888518-07:00 schallkreis kernel:  <TASK>
> 2025-03-30T14:54:10.888518-07:00 schallkreis kernel:
> _raw_spin_lock_irqsave+0x3d/0x50
> 2025-03-30T14:54:10.888518-07:00 schallkreis kernel:
> drm_event_reserve_init+0x2f/0xc0 [drm]
> 2025-03-30T14:54:10.888518-07:00 schallkreis kernel:
> drm_mode_atomic_ioctl+0x61d/0xcb0 [drm]
> 2025-03-30T14:54:10.888519-07:00 schallkreis kernel:  ?
> __pfx_drm_mode_atomic_ioctl+0x10/0x10 [drm]
> 2025-03-30T14:54:10.888519-07:00 schallkreis kernel:
> drm_ioctl_kernel+0xad/0x100 [drm]
> 2025-03-30T14:54:10.888519-07:00 schallkreis kernel:  drm_ioctl+0x277/0x4f0
> [drm]
> 2025-03-30T14:54:10.888520-07:00 schallkreis kernel:  ?
> __pfx_drm_mode_atomic_ioctl+0x10/0x10 [drm]
> 2025-03-30T14:54:10.888520-07:00 schallkreis kernel:
> amdgpu_drm_ioctl+0x4b/0x80 [amdgpu]
> 2025-03-30T14:54:10.888520-07:00 schallkreis kernel:
> __x64_sys_ioctl+0x91/0xd0
> 2025-03-30T14:54:10.888520-07:00 schallkreis kernel:
> do_syscall_64+0x82/0x190
> 2025-03-30T14:54:10.888521-07:00 schallkreis kernel:  ?
> __x64_sys_poll+0xd0/0x180
> 2025-03-30T14:54:10.888521-07:00 schallkreis kernel:  ?
> syscall_exit_to_user_mode+0x4d/0x210
> 2025-03-30T14:54:10.888521-07:00 schallkreis kernel:  ?
> do_syscall_64+0x8e/0x190
> 2025-03-30T14:54:10.888522-07:00 schallkreis kernel:  ?
> restore_fpregs_from_fpstate+0x3c/0xa0
> 2025-03-30T14:54:10.888522-07:00 schallkreis kernel:  ?
> switch_fpu_return+0x4e/0xd0
> 2025-03-30T14:54:10.888522-07:00 schallkreis kernel:  ?
> syscall_exit_to_user_mode+0x172/0x210
> 2025-03-30T14:54:10.888523-07:00 schallkreis kernel:  ?
> do_syscall_64+0x8e/0x190
> 2025-03-30T14:54:10.888523-07:00 schallkreis kernel:  ? do_futex+0x125/0x190
> 2025-03-30T14:54:10.888524-07:00 schallkreis kernel:  ?
> __x64_sys_futex+0x127/0x1e0
> 2025-03-30T14:54:10.888524-07:00 schallkreis kernel:  ?
> syscall_exit_to_user_mode+0x4d/0x210
> 2025-03-30T14:54:10.888524-07:00 schallkreis kernel:  ?
> do_syscall_64+0x8e/0x190
> 2025-03-30T14:54:10.888524-07:00 schallkreis kernel:  ?
> __x64_sys_poll+0xd0/0x180
> 2025-03-30T14:54:10.888525-07:00 schallkreis kernel:  ?
> syscall_exit_to_user_mode+0x4d/0x210
> 2025-03-30T14:54:10.888525-07:00 schallkreis kernel:  ?
> do_syscall_64+0x8e/0x190
> 2025-03-30T14:54:10.888525-07:00 schallkreis kernel:  ?
> syscall_exit_to_user_mode+0x4d/0x210
> 2025-03-30T14:54:10.888526-07:00 schallkreis kernel:  ?
> do_syscall_64+0x8e/0x190
> 2025-03-30T14:54:10.888526-07:00 schallkreis kernel:  ?
> syscall_exit_to_user_mode+0x4d/0x210
> 2025-03-30T14:54:10.888526-07:00 schallkreis kernel:  ?
> do_syscall_64+0x8e/0x190
> 2025-03-30T14:54:10.888527-07:00 schallkreis kernel:  ?
> do_syscall_64+0x8e/0x190
> 2025-03-30T14:54:10.888527-07:00 schallkreis kernel:  ?
> do_syscall_64+0x8e/0x190
> 2025-03-30T14:54:10.888527-07:00 schallkreis kernel:
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 2025-03-30T14:54:10.888528-07:00 schallkreis kernel: RIP:
> 0033:0x7fc5ce5168db
> 2025-03-30T14:54:10.888528-07:00 schallkreis kernel: Code: 00 48 89 44 24 18
> 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48
> 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 =
> f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
> 2025-03-30T14:54:10.888528-07:00 schallkreis kernel: RSP:
> 002b:00007fc576ffc3b0 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000012025-03-30T14:54:10.888529-07:00 schallkreis kernel: RAX:
> ffffffffffffffda RBX: 00007fc56c00bf20 RCX:
> 00007fc5ce5168d2025-03-30T14:54:10.888529-07:00 schallkreis kernel: RDX:
> 00007fc576ffc4a0 RSI: 00000000c03864bc RDI:
> 0000000000000012025-03-30T14:54:10.888529-07:00 schallkreis kernel: RBP:
> 00007fc576ffc4a0 R08: 00007fc56c00ebd0 R09:
> 00007fc56c0084a2025-03-30T14:54:10.888530-07:00 schallkreis kernel: R10:
> 0000000000000000 R11: 0000000000000246 R12: 00000000c03864bc
> 2025-03-30T14:54:10.888530-07:00 schallkreis kernel: R13: 0000000000000012
> R14: 00007fc56c01983c R15: 00007fc56c008520
> 2025-03-30T14:54:10.888530-07:00 schallkreis kernel:  </TASK>
> 2025-03-30T14:54:10.888530-07:00 schallkreis kernel: watchdog: Watchdog
> detected hard LOCKUP on cpu 8
> 2025-03-30T14:54:10.888531-07:00 schallkreis kernel: Modules linked in:
> dm_mod cpuid tls qrtr binfmt_misc nls_ascii nls_cp437 vfat fat amd_atl
> intel_rapl_msr intel_rapl_common mt7925e mt=
> 7925_common mt792x_lib mt76_connac_lib edac_mce_amd snd_hda_codec_realtek
> mt76 kvm_amd snd_hda_codec_generic mac80211 snd_hda_scodec_component
> snd_hda_codec_hdmi snd_hda_intel kvm snd_intel_dspcfg =
> snd_intel_sdw_acpi snd_hda_codec libarc4 snd_hda_core cfg80211
> crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 snd_hwdep sha256_ssse3
> sha1_ssse3 snd_pcm aesni_intel spd5118 gf128mul crypto_simd=
>  cryptd snd_timer wmi_bmof sp5100_tco rapl snd ccp pcspkr watchdog rfkill
> k10temp soundcore joydev evdev nfsd auth_rpcgss nfs_acl lockd grace sunrpc
> parport_pc ppdev lp parpor=
> t configfs efi_pstore nfnetlink efivarfs ip_tables x_tables autofs4 ext4
> mbcache jbd2 crc32c_generic hid_generic usbhid hid amdgpu amdxcp drm_exec
> gpu_sched drm_buddy i2c_algo_bit drm=
> _suballoc_helper drm_display_helper cec rc_core drm_ttm_helper ttm xhci_pci
> xhci_hcd drm_kms_helper ahci libahci r8169 libata drm nvme
> 2025-03-30T14:54:10.888532-07:00 schallkreis kernel:  thunderbolt realtek
> usbcore crc32_pclmul mdio_devres scsi_mod crc32c_intel nvme_core libphy
> i2c_piix4 i2c_smbus usb_common scsi_common =
> nvme_auth crc16 video wmi gpio_amdpt gpio_generic
> butto2025-03-30T14:54:10.888532-07:00 schallkreis kernel: CPU: 8 UID: 0 PID:
> 0 Comm: swapper/8 Tainted: G        W=
>           6.12.17-amd64 #1  Debian 6.12.17-2025-03-30T14:54:10.888532-07:00
> schallkreis kernel: Tainted: [W]=3DWARN
> 2025-03-30T14:54:10.888533-07:00 schallkreis kernel: Hardware name: System76
> Thelio Mira/Thelio Mira, BIOS 3.11.SP01 12/05/2024
> 
> -- System Information:
> Debian Release: trixie/sid
>   APT prefers testing-debug
>   APT policy: (500, 'testing-debug'), (500, 'testing')
> Architecture: amd64 (x86_64)
> 
> Kernel: Linux 6.12.19-amd64 (SMP w/24 CPU threads; PREEMPT)
> Kernel taint flags: TAINT_WARN
> Locale: LANGDen_US.UTF-8, LC_CTYPE=3Den_US.UTF-8 (charmap=3DUTF-8),
>  LANGUAGE not set
> Shell: /bin/sh linked to /usr/bin/dash
> Init: systemd (via /run/systemd/system)
> LSM: AppArmor: enabled
> 
> ************************************************************************************
> 
> Since the system is locking up, I have no dmesg.  I uploaded a compressed
> syslog which contains 2 occurrences of the same failure to
> 
>             https://she-devel.com/Thelio-Mira-syslog.tar.xz
> 
> The problem has persisted with 6.12.21-amd64.  Since the failure starts with
> PCIe at 0000:0f, below is some more related output in which that PCIe link
> appears to attach to USB.
> 
> [alison@schallkreis ~]$ lspci -vv -t
> -[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge
> Root Complex
>            +-00.2  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge
> IOMMU
>            +-01.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge
> Dummy Host Bridge
>            +-01.1-[01-03]----00.0-[02-03]----00.0-[03]--+-00.0  Advanced
> Micro Devices, Inc. [AMD/ATI] Navi 33 [Radeon RX 7600/7600 XT/7600M
> XT/7600S/7700S / PRO W7600]
>            |                                            \-00.1  Advanced
> Micro Devices, Inc. [AMD/ATI] Navi 31 HDMI/DP Audio
>            +-01.2-[04]----00.0  Micron/Crucial Technology T700 NVMe PCIe SSD
>            +-02.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge
> Dummy Host Bridge
>            +-02.1-[05-0e]----00.0-[06-0e]--+-00.0-[07]--
>            |                               +-04.0-[08]--
>            |                               +-08.0-[09]--
>            |                               +-09.0-[0a]----00.0  ASMedia
> Technology Inc. ASM1061/ASM1062 Serial ATA Controller
>            |                               +-0a.0-[0b]----00.0  MEDIATEK
> Corp. Device 0717
>            |                               +-0b.0-[0c]----00.0  Realtek
> Semiconductor Co., Ltd. RTL8125 2.5GbE Controller
>            |                               +-0c.0-[0d]----00.0  Advanced
> Micro Devices, Inc. [AMD] Device 43fc
>            |                               \-0d.0-[0e]----00.0  Advanced
> Micro Devices, Inc. [AMD] 600 Series Chipset SATA Controller
>            +-02.2-[0f-72]----00.0-[10-72]--+-00.0-[11-40]--
>            |                               +-01.0-[41-70]--
>            |                               +-02.0-[71]----00.0  ASMedia
> Technology Inc. ASM4242 USB 3.2 xHCI Controller
>            |                               \-03.0-[72]----00.0  ASMedia
> Technology Inc. ASM4242 USB 4 / Thunderbolt 3 Host Router
>            +-03.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge
> Dummy Host Bridge
>            +-04.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge
> Dummy Host Bridge
>            +-08.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge
> Dummy Host Bridge
>            +-08.1-[73]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI]
> Granite Ridge [Radeon Graphics]
>            |            +-00.1  Advanced Micro Devices, Inc. [AMD/ATI]
> Rembrandt Radeon High Definition Audio Controller
>            |            +-00.2  Advanced Micro Devices, Inc. [AMD] Family
> 19h PSP/CCP
>            |            +-00.3  Advanced Micro Devices, Inc. [AMD]
> Raphael/Granite Ridge USB 3.1 xHCI
>            |            +-00.4  Advanced Micro Devices, Inc. [AMD]
> Raphael/Granite Ridge USB 3.1 xHCI
>            |            \-00.6  Advanced Micro Devices, Inc. [AMD] Family
> 17h/19h/1ah HD Audio Controller
>            +-08.3-[74]----00.0  Advanced Micro Devices, Inc. [AMD]
> Raphael/Granite Ridge USB 2.0 xHCI
>            +-14.0  Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller
>            +-14.3  Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge
>            +-18.0  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge
> Data Fabric; Function 0
>            +-18.1  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge
> Data Fabric; Function 1
>            +-18.2  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge
> Data Fabric; Function 2
>            +-18.3  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge
> Data Fabric; Function 3
>            +-18.4  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge
> Data Fabric; Function 4
>            +-18.5  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge
> Data Fabric; Function 5
>            +-18.6  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge
> Data Fabric; Function 6
>            \-18.7  Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge
> Data Fabric; Function 7
> 
> To investigate, I wrote a bpftrace script and a systemd unit to run it which
> are at
> 
> https://github.com/chaiken/BPF-sandbox/commit/e0dda39cbe92e0f80805a2a06aa80d6fb3b065b2
> 
> With these I'm 99% sure that apcupsd is triggering the problem by waking the
> USB bus every 15 seconds.   I started the tracer and stopped apcupsd.    If
> the system lasts a week with locking up, I'll restart apcupsd and see if I
> can figure out how it's triggering the kernel failure.  So far the system
> has been stable for 4 days.

Thanks for providing those updates. Please let us know once you have
further confidence about the above.

Regards,
Salvatore

Reply via email to