I noticed an mpi job was taking 5X longer to run whenever it got the compute node lusytp104 . So I ran qperf and found the bandwidth between it and any other nodes was ~100MB/sec. This is much lower than ~1GB/sec between all the other nodes. Any tips on how to debug further? I haven't tried rebooting since it is currently running a single-node job.

[hussaif1@lusytp114 ~]$ qperf lusytp104 tcp_lat tcp_bw
tcp_lat:
    latency  =  17.4 us
tcp_bw:
    bw  =  118 MB/sec
[hussaif1@lusytp114 ~]$ qperf lusytp113 tcp_lat tcp_bw
tcp_lat:
    latency  =  20.4 us
tcp_bw:
    bw  =  1.07 GB/sec

This is separate issue from my previous post about a slow compute node. I am still investigating that per the helpful replies. Will post an update about that once I find the root cause!

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to