On Wed, 22 Sep 2010, Michael Madore wrote: > > There are a total of 9 systems with identical hardware. Two are known > > to have exhibited the error once. On a third system the error is > > cropping up repeatedly at seemingly random intervals. > > > > Before: > > > > # cat /proc/interrupts > > CPU0 CPU1 CPU2 CPU3 > > 0: 177627 0 0 0 IO-APIC-edge timer > > 1: 8 0 0 0 IO-APIC-edge i8042 > > 8: 0 0 0 0 IO-APIC-edge rtc > > 9: 0 0 0 0 IO-APIC-level acpi > > 12: 4 0 0 0 IO-APIC-edge i8042 > > 169: 0 0 0 0 IO-APIC-level > > ohci_hcd:usb1, ohci_hcd:usb2 > > 177: 4765 0 2075 12617 IO-APIC-level sata_mv > > 185: 293 3314 0 930 IO-APIC-level eth0 > > 193: 1185 0 652487 0 IO-APIC-level eth1 > > 201: 1066 3754 0 0 IO-APIC-level eth2 > > 209: 34 1243 0 9583 IO-APIC-level eth3 > > NMI: 291 86 102 95 > > LOC: 177279 177207 177135 176937 > > ERR: 0 > > MIS: 0 > > > > > > After: > > > > cat /proc/interrupts > > CPU0 CPU1 CPU2 CPU3 > > 0: 1621763365 0 0 0 IO-APIC-edge timer > > 1: 8 0 0 0 IO-APIC-edge i8042 > > 8: 0 0 0 0 IO-APIC-edge rtc > > 9: 0 0 0 0 IO-APIC-level acpi > > 12: 4 0 0 0 IO-APIC-edge i8042 > > 169: 0 0 0 0 IO-APIC-level > > ohci_hcd:usb1, ohci_hcd:usb2 > > 177: 4667 445309 58964857 13563559 IO-APIC-level sata_mv > > 185: 700 45404048 15552242 2403191 IO-APIC-level eth0 > > 193: 1368 59938104 1437996227 2249214980 IO-APIC-level eth1 > > 201: 3164 45011211 3564856 1776851 IO-APIC-level eth2 > > 209: 32 6293899 170449817 43256252 IO-APIC-level eth3 > > NMI: 44773 93225 87595 204989 > > LOC: 1621560863 1621570294 1621563260 1621567033 > > ERR: 0 > > MIS: 0
It looks like from this output that probably have irqbalance(d) loaded, please make sure it is, or alternately, use manual control over smp_affinity in /proc/irq/<nn>/smp_affinity to set the interrupt affinity for each of the active interrupts in your system. I have a suspicion that your issue is related to interrupt routing and handling of the "pin interrupt" to hypertransport conversion in the hardware. When an interrupt fires quickly on two different processors in a row, it is possible for issues to occur. > > The systems have the following hardware: > > > > Supermicro H8DAR-T > > 2 X Opteron Model 285 2.6GHz > > 4 X 2GB PC-3200 (8GB) > > 2 X Seagate ST3500320NS 500GB > > Intel Pro 1000MT - PWLA8492MT > > > > The systems were originally installed in March-April 2008 with RHEL 4. > > They were upgraded to RHEL 5 in May 2009. The problem started > > showing up 6-9 months ago. The systems have been running the same > > firmware since the beginning. The next time the problem occurs and > > the user reboots the machine I will find out if it is the latest. you can use dmidecode to find out the version of the running bios, with the machine on-line. > Hi John, > > The error just occurred again after about 24 hours. Here is the > output /proc/interrupts: > > cat /proc/interrupts > CPU0 CPU1 CPU2 CPU3 > 0: 68937556 0 0 0 IO-APIC-edge timer > 1: 8 0 0 0 IO-APIC-edge i8042 > 8: 0 0 0 0 IO-APIC-edge rtc > 9: 0 0 0 0 IO-APIC-level acpi > 12: 4 0 0 0 IO-APIC-edge i8042 > 169: 0 0 0 0 IO-APIC-level > ohci_hcd:usb1, ohci_hcd:usb2 > 177: 4765 2996 2075 3745436 IO-APIC-level sata_mv > 185: 293 1871753 0 1278129 IO-APIC-level eth0 > 193: 1185 22739 377905436 0 IO-APIC-level eth1 > 201: 1066 2063938 185 187635 IO-APIC-level eth2 > 209: 34 174965 0 7825001 IO-APIC-level eth3 > NMI: 2843 3142 10915 1882 > LOC: 68928632 68928547 68928054 68927546 > ERR: 0 > MIS: 0 > > Is there any other information I should have the user collect after > this error occurs? BTW the report-bad-irq will occur when an interrupt (at the APIC) is still asserted but our driver reads the ICR register and doesn't find any bits set (indicating we don't have an interrupt asserted from our adapter) Is there a chance you can download the ethregs utility and build it for your distribution (you need pciutils-devel) then run it after the issue has occurred (as root) and before you reboot. ethregs -c -o > irqnoisy.txt It really does seem like something is going wrong with the hardware (possibly the chipset) or the slot that is making the irq line stay asserted even when our NIC is not asserting the irq line. I'm with John and think your best intermediate step is a bios upgrade if you don't have the latest. ------------------------------------------------------------------------------ Start uncovering the many advantages of virtual appliances and start using them to simplify application deployment and accelerate your shift to cloud computing. http://p.sf.net/sfu/novell-sfdev2dev _______________________________________________ E1000-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired
