On 5/11/22 4:40 PM, Bjorn Helgaas wrote:
On Mon, Apr 18, 2022 at 03:02:37PM +0000, Kuppuswamy Sathyanarayanan wrote:Currently the aer_irq() handler returns IRQ_NONE for cases without bits PCI_ERR_ROOT_UNCOR_RCV or PCI_ERR_ROOT_COR_RCV are set. But this assumption is incorrect. Consider a scenario where aer_irq() is triggered for a correctable error, and while we process the error and before we clear the error status in "Root Error Status" register, if the same kind of error is triggered again, since aer_irq() only clears events it saw, the multi-bit error is left in tact. This will cause the interrupt to fire again, resulting in entering aer_irq() with just the multi-bit error logged in the "Root Error Status" register. Repeated AER recovery test has revealed this condition does happen and this prevents any new interrupt from being triggered. Allow to process interrupt even if only multi-correctable (BIT 1) or multi-uncorrectable bit (BIT 3) is set. Also note that, for cases with only multi-bit error is set, since this is not the first occurrence of the error, PCI_ERR_ROOT_ERR_SRC may have zero or some junk value. So we cannot cleanly process this error information using aer_isr_one_error(). All we are attempting with this fix is to make sure error interrupt processing can continue in this scenario. This error can be reproduced by making following changes to the aer_irq() function and by executing the given test commands. static irqreturn_t aer_irq(int irq, void *context) struct aer_err_source e_src = {}; pci_read_config_dword(rp, aer + PCI_ERR_ROOT_STATUS, &e_src.status); + pci_dbg(pdev->port, "Root Error Status: %04x\n", + e_src.status); if (!(e_src.status & AER_ERR_STATUS_MASK))Do you mean if (!(e_src.status & (PCI_ERR_ROOT_UNCOR_RCV|PCI_ERR_ROOT_COR_RCV))) here? AER_ERR_STATUS_MASK would be after this fix.
Yes. You are correct. Do you want me to update it and Fixes tag and send next version?
return IRQ_NONE; + mdelay(5000);
-- Sathyanarayanan Kuppuswamy Linux Kernel Developer
