Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider: - [High] The driver permanently masks the Surprise Link Down AER event (PCI_ERR_UNC_SURPDN) on the upstream switch port without ever restoring it. - [High] Lockless read-modify-write of the shared Upstream Switch Port's AER mask register creates a race condition. - [High] Unchecked pci_read_config_dword return value may corrupt the AER mask register. - [High] Attempting to mask Surprise Link Down on an Upstream Port violates the PCIe specification and fails to suppress the error. --
commit b4b812ffca5eaea995b148e8a053e6c4fd45bfe9 Author: Mallesh Koujalagi <[email protected]> drm/xe: Suppress Surprise Link Down on device This patch attempts to suppress spurious Surprise Link Down AER events during a recovery cold reset. It does so by masking the PCI_ERR_UNC_SURPDN bit in the upstream switch port's AER Uncorrectable Error Mask register before triggering the cold reset. > diff --git a/drivers/gpu/drm/xe/xe_ras.c b/drivers/gpu/drm/xe/xe_ras.c > index 9a90a7118e896..acdedf403649f 100644 > --- a/drivers/gpu/drm/xe/xe_ras.c > +++ b/drivers/gpu/drm/xe/xe_ras.c [ ... ] > +#ifdef CONFIG_PCIEAER > +static void pcie_suppress_surprise_link_down(struct pci_dev *usp) > +{ > + u32 aer_uncorr_mask; > + u16 aer_cap; > + > + aer_cap = usp->aer_cap; > + if (!aer_cap) { > + dev_dbg(&usp->dev, > + "AER capability not present\n"); > + return; > + } > + > + pci_read_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, > &aer_uncorr_mask); [Severity: High] Does this code risk corrupting the AER mask register if the read fails? If pci_read_config_dword() fails, for example due to temporary bus unavailability, it can leave aer_uncorr_mask uninitialized or set it to all 1s. Writing that value back could inadvertently mask all uncorrectable errors or write garbage to the register. > + aer_uncorr_mask |= PCI_ERR_UNC_SURPDN; > + pci_write_config_dword(usp, aer_cap + PCI_ERR_UNCOR_MASK, > aer_uncorr_mask); [Severity: High] Is it safe to perform a lockless read-modify-write on a shared parent bridge's configuration space? Since the upstream switch port is outside the xe driver's ownership, if the AER core or another driver concurrently modifies this mask register without holding pci_dev_lock() on the upstream port, could one of the updates be silently lost? [Severity: High] Will this permanently mask the Surprise Link Down AER event on the upstream switch port? The configuration space of the port above the cold reset boundary isn't reset by the slot power cycle, and there does not appear to be any corresponding code to unmask this bit later. Could this permanently blind the port to legitimate Surprise Link Down errors until the next system reboot? > + dev_dbg(&usp->dev, "Surprise Link Down masked for cold reset\n"); > +} > +#endif /* CONFIG_PCIEAER */ > + > +static void punit_error_handler(struct xe_device *xe) > +{ > +#ifdef CONFIG_PCIEAER > + struct pci_dev *pdev = to_pci_dev(xe->drm.dev); > + struct pci_dev *vsp, *usp; > + > + /* > + * Device Hierarchy: > + * > + * Root Port --> Upstream Switch Port (USP) --> Virtual Switch Port > (VSP) --> SGunit > + * > + * Cold reset power-cycles the slot, dropping the PCIe link. The > + * slot triggers a spurious Surprise Link Down AER event on the USP. > + */ > + vsp = pci_upstream_bridge(pdev); > + usp = vsp ? pci_upstream_bridge(vsp) : NULL; > + > + if (usp) > + pcie_suppress_surprise_link_down(usp); [Severity: High] Will masking the Surprise Link Down bit on the Upstream Switch Port actually suppress the intended error? According to the PCIe Base Specification, the Surprise Down Error Status bit is mandatory for Downstream Ports and reserved for all other port types. Since the Downstream Port (vsp) facing the slot is what actually detects the link drop and triggers the AER event, does writing to a reserved bit in the upstream port's config space fail to suppress the spurious AER event? > +#endif > xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_COLD_RESET); > xe_device_declare_wedged(xe); > } -- Sashiko AI review ยท https://sashiko.dev/#/patchset/[email protected]?part=5
