On 5/13/25 21:51, Mario Limonciello wrote: > On 5/13/2025 2:45 PM, Bjorn Helgaas wrote: >> From Denis's report at https://bugzilla.kernel.org/show_bug.cgi?id=220110: >> >>> I am having problems with my laptop that has a thunderbolt >>> controller to which I connected an AMD 6750XT. >>> >>> The topology of my system is described in this bug: >>> https://gitlab.freedesktop.org/drm/amd/-/issues/4014 yet I don't >>> know if this is related or not. >>> >>> I experienced PC attempting to enter s2idle while playing a YT >>> video; PC has become totally unresponsive to input in any >>> keyboard/mouse and power button after turning off screens attached >>> to the AMD card (the built-in screen was off already). >>> >>> From a look at the logs it appears one uncorrectible AER pci error >>> triggered a pci root reset, and that comes with a bug where the >>> usage counter assumes a wrong value; this in turn seems to cause all >>> sorts of weird bugs. >>> >>> That however is my interpretation of the attached log, that might be >>> very wrong. >>> >>> This is the first time I experience this bug in a year with this >>> laptop and I don't know how easy it is to reproduce. >>> >>> The kernel has been compiled from sources and it has >>> >>> [PATCH v2] PCI: Explicitly put devices into D0 when initializing >>> [PATCH v4] PCI/PM: Put devices to low power state on shutdown >>> >>> as I am helping testing things. I find unlikely any of those might >>> cause these issues especially "PCI: Explicitly put devices into D0 >>> when initializing" that has been there for a few weeks now. >>> >>> Thanks in advice to whoever will help me. > > From the logs the system didn't actually enter s2idle, but because of the > failure to recover after AER he lost the external GPU. > > I don't expect that "PCI/PM: Put devices to low power state on shutdown" has > anything to do with this issue. This should only affect system shutdown. > (Tangentially related comment; we have another version of this on the > linux-pm list now that is more generic [1]). > > How readily can this be reproduced? Can you try to reproduce once more? > Can this reproduce on an unpatched kernel? > I have tried many different of unpatched and patched 6.14.6 for a few hours and I could not get this same bug again.
After unsuccessfully attempting to reproduce with the kernel I have been running I decided to test the newest "PM_ Use hibernate flows for system power off" patch [1]. and that patch seems to help quickly poweroff my laptop when combined with the other mentioned patch. > To confirm if "PCI: Explicitly put devices into D0 when initializing" is the > cause can you compare the PCI state of all devices from sysfs with and > without the patch in place after bootup? Basically run this in patched > kernel and unpatched kernel and let's compare. > > $ grep -v foo /sys/bus/pci/devices/*/power_state > > unpatched: https://pastebin.com/Ym31Vjh6 patched with just "PCI: Explicitly put devices into D0 when initializing": https://pastebin.com/SSSWLgcs diff for easy view: https://www.diffchecker.com/y5GVyEG1/ two devices were D3hot and two were unknown, while now are recognized as D0. Having those two patches together does not seem to cause any harm and I could not reproduce the issue. I do not believe any of those patches are the cause for the particular crash I experienced, however I do believe there is something wrong going on because on power on the amdgpu on the thunderbolt card sometimes is there sometimes is not and I have to unplug and replug it for it to work. The only patch that alleviates this particular problem is [2] "[PATCH v3] PCI: Prevent power state transition of erroneous device" but it comes with a regression where I can no longer wake up the laptop properly. I will write this in detail as a response to that patch given that was not part of the subject here. [2] https://lore.kernel.org/linux-pci/[email protected]/T/#m90fb151a4ab4af5ec8c667a27eb98bf43a9942dc > [1] > https://lore.kernel.org/linux-pm/[email protected]/T/#u
