[Beowulf] Poor bandwith from one compute node

Faraz Hussain Thu, 17 Aug 2017 09:01:35 -0700

I noticed an mpi job was taking 5X longer to run whenever it got thecompute node lusytp104 . So I ran qperf and found the bandwidthbetween it and any other nodes was ~100MB/sec. This is much lower than~1GB/sec between all the other nodes. Any tips on how to debugfurther? I haven't tried rebooting since it is currently running asingle-node job.


[hussaif1@lusytp114 ~]$ qperf lusytp104 tcp_lat tcp_bw
tcp_lat:
    latency  =  17.4 us
tcp_bw:
    bw  =  118 MB/sec
[hussaif1@lusytp114 ~]$ qperf lusytp113 tcp_lat tcp_bw
tcp_lat:
    latency  =  20.4 us
tcp_bw:
    bw  =  1.07 GB/sec

This is separate issue from my previous post about a slow computenode. I am still investigating that per the helpful replies. Will postan update about that once I find the root cause!


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Poor bandwith from one compute node

Reply via email to