I ran the qperf command between two compute nodes ( b4 and b5 ) and got:
[hussaif1@lustwzb5 ~]$ qperf lustwzb4 -t 30 rc_lat rc_bi_bw
rc_lat:
fd
latency = 7.73 us
rc_bi_bw:
bw = 9.06 GB/sec
If I understand correctly, I would need to enable ipoib and then rerun
test? It would then show ~40GB/sec I assume.
Quoting Jeff Johnson <jeff.john...@aeoncomputing.com>:
Faraz,
You can test your point to point rdma bandwidth as well.
On host lustwz99 run `qperf`
On any of the hosts lustwzb1-16 run `qperf lustwz99 -t 30 rc_lat rc_bi_bw`
Establish that you can pass traffic at expected speeds before going to the
ipoib portion.
Also make sure that all of your node are running in the same mode,
connected or datagram and that your MTU is the same on all nodes for that
IP interface.
--Jeff
On Wed, Aug 2, 2017 at 10:50 AM, Faraz Hussain <i...@feacluster.com> wrote:
Thanks Joe. Here is the output from the commands you suggested. We have
open mpi built from Intel mpi compiler. Is there some benchmark code I can
compile so that we are all comparing the same code?
[hussaif1@lustwzb4 test]$ ibv_devinfo
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.11.550
node_guid: f452:1403:0016:3b70
sys_image_guid: f452:1403:0016:3b73
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x0
board_id: DEL0A40000028
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 3
port_lmc: 0x00
link_layer: InfiniBand
port: 2
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: InfiniBand
[hussaif1@lustwzb4 test]$ ibstat
CA 'mlx4_0'
CA type: MT4099
Number of ports: 2
Firmware version: 2.11.550
Hardware version: 0
Node GUID: 0xf452140300163b70
System image GUID: 0xf452140300163b73
Port 1:
State: Active
Physical state: LinkUp
Rate: 40 (FDR10)
Base lid: 3
LMC: 0
SM lid: 1
Capability mask: 0x02514868
Port GUID: 0xf452140300163b71
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Disabled
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0xf452140300163b72
Link layer: InfiniBand
[hussaif1@lustwzb4 test]$ ibstatus
Infiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:f452:1403:0016:3b71
base lid: 0x3
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X FDR10)
link_layer: InfiniBand
Infiniband device 'mlx4_0' port 2 status:
default gid: fe80:0000:0000:0000:f452:1403:0016:3b72
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 10 Gb/sec (4X)
link_layer: InfiniBand
Quoting Joe Landman <joe.land...@gmail.com>:
start with
ibv_devinfo
ibstat
ibstatus
and see what (if anything) they report.
Second, how did you compile/run your MPI code?
On 08/02/2017 12:44 PM, Faraz Hussain wrote:
I have inherited a 20-node cluster that supposedly has an infiniband
network. I am testing some mpi applications and am seeing no performance
improvement with multiple nodes. So I am wondering if the Infiband network
even works?
The output of ifconfig -a shows an ib0 and ib1 network. I ran ethtools
ib0 and it shows:
Speed: 40000Mb/s
Link detected: no
and for ib1 it show:
Speed: 10000Mb/s
Link detected: no
I am assuming this means it is down? Any idea how to debug further and
restart it?
Thanks!
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
Joe Landman
e: joe.land...@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing
jeff.john...@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001 f: 858-412-3845
m: 619-204-9061
4170 Morena Boulevard, Suite D - San Diego, CA 92117
High-Performance Computing / Lustre Filesystems / Scale-out Storage
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf