On Thu, Apr 3, 2025 at 3:13 PM Mario Limonciello <[email protected]> wrote: > > On 4/3/2025 10:48 AM, Alex Deucher wrote: > > On Wed, Apr 2, 2025 at 11:12 PM Mario Limonciello <[email protected]> > > wrote: > >> > >> From: Mario Limonciello <[email protected]> > >> > >> AMD RX580 when added AMD Phenom 2 has problems with overheating. This is > >> due to > > > > I don't think this is entirely accurate. I think the GPU gets hot > > because the device hangs due to a problem with changing the PCIe > > clocks. > > > >> changes with PCIe dynamic switching introduced by commit 466a7d115326e > >> ("drm/amd: Use the first non-dGPU PCI device for BW limits"). > >> > >> To avoid risks of other issues with old hardware require at least Zen > >> hardware > >> for AMD side to enable PCIe dynamic switching. > > > > I'm pretty sure PCIe reclocking worked on pre-Zen hardware. We've > > supported this on our GPUs going back at least 15 or more years. I > > suspect the actual problem is that some links may not reliably train > > at the full bandwidth on some motherboards. Forcing a higher link > > speed may cause problems. > > That seems odd to me it would advertise a higher link speed than it > could train at.
That's why we train the link; to determine what speed is reliable. It could be that there is a marginal trace on the motherboard that has deteriorated over time or was never reliable to begin with. It would be interesting to know if the link used to work reliably on this board. > > > Maybe it would be better to limit the max > > PCIe link rate to whatever the link is currently trained to. IIRC, > > PCIe links will train at the fastest link possible by default. The > > previous behavior was to limit the max clock to the slowest link in > > the topology to save power, but then we changed it to use the fastest > > link possible based on the PCIe link caps. Perhaps limiting it to the > > fastest currently trained link rate would be better. > > I mean that's essentially what happens when > amdgpu_device_pcie_dynamic_switching_supported() returns that it doesn't > work. I mean rather than checking the PCIe caps, check the current link speed instead. pcie_bandwidth_available() returns the speed and lanes of the slowest link in the topology; what we want is the current speed that the link upstream of the GPU is trained at. If there is no USB4/TB or limited speed bridge upstream of the GPU, then that function should return the current speed of the link which would be fine. The problem is that amdgpu_device_pcie_dynamic_switching_supported() returning false disables PCIe DPM so we don't dynamically change the PCIe speed/lanes at runtime. I suspect that would work fine as long as we don't go past the current speed the link is currently trained at. > > If your theory is right; maybe what we really need is a pile of DMI > quirks for M/B that are having this problem. Depends on whether it's a general problem or something specific to this particular board. I.e., the slot on this board has deteriorated. I think what we want is to enable PCIe DPM, but just limit the link the the max current speed rather than the max speed. If the links are reliable the links should train at the max speed on power up. Alex > > > > > Alex > > > >> > >> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4098 > >> Fixes: 466a7d115326e ("drm/amd: Use the first non-dGPU PCI device for BW > >> limits") > >> Signed-off-by: Mario Limonciello <[email protected]> > >> --- > >> v2: > >> * Cover more hardware > >> --- > >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++ > >> 1 file changed, 5 insertions(+) > >> > >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >> index a30111d2c3ea0..caa44ee788c8f 100644 > >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >> @@ -1854,6 +1854,9 @@ bool amdgpu_device_seamless_boot_supported(struct > >> amdgpu_device *adev) > >> * > >> * > >> https://edc.intel.com/content/www/us/en/design/products/platforms/details/raptor-lake-s/13th-generation-core-processors-datasheet-volume-1-of-2/005/pci-express-support/ > >> * https://gitlab.freedesktop.org/drm/amd/-/issues/2663 > >> + * > >> + * AMD Phenom II X6 1090T has a similar issue > >> + * https://gitlab.freedesktop.org/drm/amd/-/issues/4098 > >> */ > >> static bool amdgpu_device_pcie_dynamic_switching_supported(struct > >> amdgpu_device *adev) > >> { > >> @@ -1866,6 +1869,8 @@ static bool > >> amdgpu_device_pcie_dynamic_switching_supported(struct amdgpu_device > >> > >> if (c->x86_vendor == X86_VENDOR_INTEL) > >> return false; > >> + if (c->x86_vendor == X86_VENDOR_AMD && > >> !cpu_feature_enabled(X86_FEATURE_ZEN)) > >> + return false; > >> #endif > >> return true; > >> } > >> -- > >> 2.43.0 > >> >
