One thing to check is that the switch and NIC are negotiating duplex correctly... Duplex mis-negotiation (ie switch full, NIC half) used to be a fairly common cause of FCS errors, although this is rare now as drivers have gotten a lot better. What will happen is the Full duplex station will transmit when the half duplex station is sending, causing it to think it has seen a collision, whereupon it ceases transmission and you have a packet fragment with no FCS. FCS errored frames will be dropped by the switch, so performance will be horrible.
One easy way to fix this is to set duplex on both ends of the connection to 10000/full and retest. HTH - Steve P -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jon Forrest Sent: Wednesday, March 28, 2007 4:52 PM To: beowulf@beowulf.org Subject: [Beowulf] How to Diagnose Cause of Cluster Ethernet Errors? I've been pulling out what little hair I have left while trying to figure out a bizarre problem with a Linux cluster I'm running. Here's a short description of the problem. I'm managing a 29-node cluster. All the nodes use the same hardware and boot the same kernel image (Scientific Linux 4.4, linux 2.6.9). The owner of this cluster runs a multi-node MPI job over and over with different input data. We've been seeing strange performance numbers depending on which nodes the job uses. These variations are not due to the input data. In some combinations the performance is an order of magnitude slower than in others. Fooling around with replacing the gigabit ethernet switch, replacing two of the nodes, and running memtest all day long didn't result in anything interesting. However, today I took a look at the network statistics as shown on the ethernet switch (a Netgear GS748T). What I saw was 13 of the 29 switch ports had very large numbers of FCS (Frame Checksum Sequence) errors. In fact, some had more FCS errors than valid frames, and I'm talking about frame counts in the billions. All the other ports showed 0 FCS errors. So, something is clearly wrong. What I'm wondering is what's causing these FCS errors. The cables are short and the equipment is new. All the nodes use new SuperMicro H8DCR-3 motherboards with onboard ethernet controllers so I'm having trouble believing that this problem is caused by a faulty ethernet controller because this would mean that 13 out of 29 controllers are bad. Running "ifconfig eth0" on the nodes show no errors but I'm not sure if this kind of error is detectable by the sender, and I'm guessing that packets with FCS errors are dropped by the switch. Could the switch be making a mistake while under heavy load when computing the FCS values? I'd like to find the definitive cause of the problem before I ask the vendor to replace massive amounts of hardware. How would you isolate the cause of this problem? Cordially, -- Jon Forrest Unix Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 [EMAIL PROTECTED] _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf