On Wed, 22 Sep 2010, Michael Madore wrote:
> > There are a total of 9 systems with identical hardware.  Two are known
> > to have exhibited the error once.  On a third system the error is
> > cropping up repeatedly at seemingly random intervals.
> >
> > Before:
> >
> > # cat /proc/interrupts
> >            CPU0       CPU1       CPU2       CPU3
> >   0:     177627          0          0          0    IO-APIC-edge  timer
> >   1:          8          0          0          0    IO-APIC-edge  i8042
> >   8:          0          0          0          0    IO-APIC-edge  rtc
> >   9:          0          0          0          0   IO-APIC-level  acpi
> >  12:          4          0          0          0    IO-APIC-edge  i8042
> > 169:          0          0          0          0   IO-APIC-level
> > ohci_hcd:usb1, ohci_hcd:usb2
> > 177:       4765          0       2075      12617   IO-APIC-level  sata_mv
> > 185:        293       3314          0        930   IO-APIC-level  eth0
> > 193:       1185          0     652487          0   IO-APIC-level  eth1
> > 201:       1066       3754          0          0   IO-APIC-level  eth2
> > 209:         34       1243          0       9583   IO-APIC-level  eth3
> > NMI:        291         86        102         95
> > LOC:     177279     177207     177135     176937
> > ERR:          0
> > MIS:          0
> >
> >
> > After:
> >
> > cat /proc/interrupts
> >            CPU0       CPU1       CPU2       CPU3
> >   0: 1621763365          0          0          0    IO-APIC-edge  timer
> >   1:          8          0          0          0    IO-APIC-edge  i8042
> >   8:          0          0          0          0    IO-APIC-edge  rtc
> >   9:          0          0          0          0   IO-APIC-level  acpi
> >  12:          4          0          0          0    IO-APIC-edge  i8042
> > 169:          0          0          0          0   IO-APIC-level
> > ohci_hcd:usb1, ohci_hcd:usb2
> > 177:       4667     445309   58964857   13563559   IO-APIC-level  sata_mv
> > 185:        700   45404048   15552242    2403191   IO-APIC-level  eth0
> > 193:       1368   59938104 1437996227 2249214980   IO-APIC-level  eth1
> > 201:       3164   45011211    3564856    1776851   IO-APIC-level  eth2
> > 209:         32    6293899  170449817   43256252   IO-APIC-level  eth3
> > NMI:      44773      93225      87595     204989
> > LOC: 1621560863 1621570294 1621563260 1621567033
> > ERR:          0
> > MIS:          0

It looks like from this output that probably have irqbalance(d) loaded, 
please make sure it is, or alternately, use manual control over 
smp_affinity in /proc/irq/<nn>/smp_affinity to set the interrupt affinity 
for each of the active interrupts in your system.

I have a suspicion that your issue is related to interrupt routing 
and handling of the "pin interrupt" to hypertransport conversion in the 
hardware.  When an interrupt fires quickly on two different processors in 
a row, it is possible for issues to occur.

> > The systems have the following hardware:
> >
> > Supermicro H8DAR-T
> > 2 X Opteron Model 285 2.6GHz
> > 4 X 2GB PC-3200 (8GB)
> > 2 X Seagate ST3500320NS 500GB
> > Intel Pro 1000MT - PWLA8492MT
> >
> > The systems were originally installed in March-April 2008 with RHEL 4.
> >  They were upgraded to RHEL 5 in May 2009.  The problem started
> > showing up 6-9 months ago.  The systems have been running the same
> > firmware since the beginning.  The next time the problem occurs and
> > the user reboots the machine I will find out if it is the latest.

you can use dmidecode to find out the version of the running bios, with 
the machine on-line.
 
> Hi John,
> 
> The error just occurred again after about 24 hours.  Here is the
> output /proc/interrupts:
> 
> cat /proc/interrupts
>             CPU0       CPU1       CPU2       CPU3
>    0:   68937556          0          0          0    IO-APIC-edge  timer
>    1:          8          0          0          0    IO-APIC-edge  i8042
>    8:          0          0          0          0    IO-APIC-edge  rtc
>    9:          0          0          0          0   IO-APIC-level  acpi
>   12:          4          0          0          0    IO-APIC-edge  i8042
> 169:          0          0          0          0   IO-APIC-level
> ohci_hcd:usb1, ohci_hcd:usb2
> 177:       4765       2996       2075    3745436   IO-APIC-level  sata_mv
> 185:        293    1871753          0    1278129   IO-APIC-level  eth0
> 193:       1185      22739  377905436          0   IO-APIC-level  eth1
> 201:       1066    2063938        185     187635   IO-APIC-level  eth2
> 209:         34     174965          0    7825001   IO-APIC-level  eth3
> NMI:       2843       3142      10915       1882
> LOC:   68928632   68928547   68928054   68927546
> ERR:          0
> MIS:          0
> 
> Is there any other information I should have the user collect after
> this error occurs?

BTW the report-bad-irq will occur when an interrupt (at the APIC) is still 
asserted but our driver reads the ICR register and doesn't find any bits 
set (indicating we don't have an interrupt asserted from our adapter)

Is there a chance you can download the ethregs utility and build it for 
your distribution (you need pciutils-devel) then run it after the issue 
has occurred (as root) and before you reboot.

ethregs -c -o > irqnoisy.txt

It really does seem like something is going wrong with the hardware 
(possibly the chipset) or the slot that is making the irq line stay 
asserted even when our NIC is not asserting the irq line.

I'm with John and think your best intermediate step is a bios upgrade if 
you don't have the latest.

------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to