I ran the qperf command between two compute nodes ( b4 and b5 ) and got:

[hussaif1@lustwzb5 ~]$ qperf lustwzb4 -t 30 rc_lat rc_bi_bw
rc_lat:

fd
    latency  =  7.73 us
rc_bi_bw:
    bw  =  9.06 GB/sec

If I understand correctly, I would need to enable ipoib and then rerun test? It would then show ~40GB/sec I assume.

Quoting Jeff Johnson <jeff.john...@aeoncomputing.com>:

Faraz,

You can test your point to point rdma bandwidth as well.

On host lustwz99 run `qperf`
On any of the hosts lustwzb1-16 run `qperf lustwz99 -t 30 rc_lat rc_bi_bw`

Establish that you can pass traffic at expected speeds before going to the
ipoib portion.

Also make sure that all of your node are running in the same mode,
connected or datagram and that your MTU is the same on all nodes for that
IP interface.

--Jeff

On Wed, Aug 2, 2017 at 10:50 AM, Faraz Hussain <i...@feacluster.com> wrote:

Thanks Joe. Here is the output from the commands you suggested. We have
open mpi built from Intel mpi compiler. Is there some benchmark code I can
compile so that we are all comparing the same code?

[hussaif1@lustwzb4 test]$ ibv_devinfo
hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.11.550
        node_guid:                      f452:1403:0016:3b70
        sys_image_guid:                 f452:1403:0016:3b73
        vendor_id:                      0x02c9
        vendor_part_id:                 4099
        hw_ver:                         0x0
        board_id:                       DEL0A40000028
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               3
                        port_lmc:               0x00
                        link_layer:             InfiniBand

                port:   2
                        state:                  PORT_DOWN (1)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             InfiniBand


[hussaif1@lustwzb4 test]$ ibstat
CA 'mlx4_0'
        CA type: MT4099
        Number of ports: 2
        Firmware version: 2.11.550
        Hardware version: 0
        Node GUID: 0xf452140300163b70
        System image GUID: 0xf452140300163b73
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 40 (FDR10)
                Base lid: 3
                LMC: 0
                SM lid: 1
                Capability mask: 0x02514868
                Port GUID: 0xf452140300163b71
                Link layer: InfiniBand
        Port 2:
                State: Down
                Physical state: Disabled
                Rate: 10
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x02514868
                Port GUID: 0xf452140300163b72
                Link layer: InfiniBand

[hussaif1@lustwzb4 test]$ ibstatus
Infiniband device 'mlx4_0' port 1 status:
        default gid:     fe80:0000:0000:0000:f452:1403:0016:3b71
        base lid:        0x3
        sm lid:          0x1
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            40 Gb/sec (4X FDR10)
        link_layer:      InfiniBand

Infiniband device 'mlx4_0' port 2 status:
        default gid:     fe80:0000:0000:0000:f452:1403:0016:3b72
        base lid:        0x0
        sm lid:          0x0
        state:           1: DOWN
        phys state:      3: Disabled
        rate:            10 Gb/sec (4X)
        link_layer:      InfiniBand



Quoting Joe Landman <joe.land...@gmail.com>:

start with

    ibv_devinfo

    ibstat

    ibstatus


and see what (if anything) they report.

Second, how did you compile/run your MPI code?


On 08/02/2017 12:44 PM, Faraz Hussain wrote:

I have inherited a 20-node cluster that supposedly has an infiniband
network. I am testing some mpi applications and am seeing no performance
improvement with multiple nodes. So I am wondering if the Infiband network
even works?

The output of ifconfig -a shows an ib0 and ib1 network. I ran ethtools
ib0 and it shows:

       Speed: 40000Mb/s
       Link detected: no

and for ib1 it show:

       Speed: 10000Mb/s
       Link detected: no

I am assuming this means it is down? Any idea how to debug further and
restart it?

Thanks!

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


--
Joe Landman
e: joe.land...@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf




_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf




--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing

jeff.john...@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite D - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage



_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to