Bad FDR cables? Is it possible that the switches are running slower due to signaling issues?
Sent from my iPad On Jun 12, 2013, at 12:03 AM, Christopher Samuel <sam...@unimelb.edu.au> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi folks, > > I'm doing the bring up and testing on our SandyBridge IBM iDataplex > with an FDR switch and as part of that I've been doing burn-in testing > with HPL and seeing really poor efficiency (~25% over 65 odd nodes > with 256GB RAM). Simultaneously HPL on the 3 nodes with 512GB RAM > gives ~70% efficiency. > > Checking the switch with ibqueryerrors shows lots of things like: > > GUID 0x2c90300771450 port 22: [PortXmitWait == 198817026] > > That's about 2 or 3 hours after last clearing the counters. :-( > > Doing: > > # ibclearcounters && ibclearerrors && sleep 1 && ibqueryerrors > > Shows 75 of 94 nodes bad, pretty much all with thousands of > PortXmitWait, some into the 10's of thousands. > > We are running RHEL 6.3, Mellanox OFED 2.0.5, FDR IB and Open-MPI 1.6.4. > > Talking with another site who also has the same sort of iDataplex, but > running RHEL 5.8, Mellanox OFED 1.5 and QDR I, reveals that they (once > they started looking) are also seeing high PortXmitWait counters > shortly after clearing them with user codes. > > These are Mellanox MT27500 ConnectX-3 adapters. > > We're talking with both IBM and Mellanox directly, but other than > Mellanox spotting some GPFS NSD file servers that had bad FDR ports > (which got unplugged last week and fixed today) we've not made any > progress into the underlying cause. :-( > > Has anyone seen anything like this before? > > cheers! > Chris > - -- > Christopher Samuel Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.11 (GNU/Linux) > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iEYEARECAAYFAlG4AQ8ACgkQO2KABBYQAh96awCfRESpDRhVHvpJBqrv33sGlQJm > NvoAnjg20/xMMcji72eAWI1HzyEQureY > =GfkH > -----END PGP SIGNATURE----- > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf