I've been pulling out what little hair I have left while trying to figure out a bizarre problem with a Linux cluster I'm running. Here's a short description of the problem.
I'm managing a 29-node cluster. All the nodes use the same hardware and boot the same kernel image (Scientific Linux 4.4, linux 2.6.9). The owner of this cluster runs a multi-node MPI job over and over with different input data. We've been seeing strange performance numbers depending on which nodes the job uses. These variations are not due to the input data. In some combinations the performance is an order of magnitude slower than in others. Fooling around with replacing the gigabit ethernet switch, replacing two of the nodes, and running memtest all day long didn't result in anything interesting. However, today I took a look at the network statistics as shown on the ethernet switch (a Netgear GS748T). What I saw was 13 of the 29 switch ports had very large numbers of FCS (Frame Checksum Sequence) errors. In fact, some had more FCS errors than valid frames, and I'm talking about frame counts in the billions. All the other ports showed 0 FCS errors. So, something is clearly wrong. What I'm wondering is what's causing these FCS errors. The cables are short and the equipment is new. All the nodes use new SuperMicro H8DCR-3 motherboards with onboard ethernet controllers so I'm having trouble believing that this problem is caused by a faulty ethernet controller because this would mean that 13 out of 29 controllers are bad. Running "ifconfig eth0" on the nodes show no errors but I'm not sure if this kind of error is detectable by the sender, and I'm guessing that packets with FCS errors are dropped by the switch. Could the switch be making a mistake while under heavy load when computing the FCS values? I'd like to find the definitive cause of the problem before I ask the vendor to replace massive amounts of hardware. How would you isolate the cause of this problem? Cordially, -- Jon Forrest Unix Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 [EMAIL PROTECTED] _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf