Faraz, I really suggest you examine the Intel Cluster Checker. I guess that you cannot take down a production cluster to run an entire Cluster checker run, however these are the types of faults which ICC is designed to find. You can define a smal lset of compute nodes to run on, including this node, and maybe run ICC on them?
As for the diagnosis, run ethtool <interface name> where that is the name of your ethernet interface. compare that with the output of ethtool on a properly working compute node. On 17 August 2017 at 18:00, Faraz Hussain <i...@feacluster.com> wrote: > I noticed an mpi job was taking 5X longer to run whenever it got the > compute node lusytp104 . So I ran qperf and found the bandwidth between it > and any other nodes was ~100MB/sec. This is much lower than ~1GB/sec > between all the other nodes. Any tips on how to debug further? I haven't > tried rebooting since it is currently running a single-node job. > > [hussaif1@lusytp114 ~]$ qperf lusytp104 tcp_lat tcp_bw > tcp_lat: > latency = 17.4 us > tcp_bw: > bw = 118 MB/sec > [hussaif1@lusytp114 ~]$ qperf lusytp113 tcp_lat tcp_bw > tcp_lat: > latency = 20.4 us > tcp_bw: > bw = 1.07 GB/sec > > This is separate issue from my previous post about a slow compute node. I > am still investigating that per the helpful replies. Will post an update > about that once I find the root cause! > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf