Hi Faraz

The output of lsmod looks good to me.
It shows that you have verbs, rdma, etc.
Presumably this happens in all nodes (the output you sent
is likely to be in one node, lustwzb4 or something like that).

ompi-info shows that Open MPI was built with openib (Infinband)
support. So, another good thing.
Therefore, by default Open MPI will try to use Inifinband,
unless one of the nodes' IB card has a problem,
or the IB kernel modules were not loaded, etc.
But you shouldn't worry about it until it happens.


I think ibhosts is just telling you that the NICs
have two ports ("ports 2", with a space in between).

Also, check the back of the nodes for the IB cable connections.
They're thick cables, should be connected to the IB switch.
You will *probably* find two IB ports in the nodes, with only
one connected. At least that is what your ifconfig output suggests.

ibstat runs only on the node you're in.
If you have a tool such as pdsh (parallel shell),
you can use it to run ibstat on all nodes.
Or just ssh to each node and run ibstat.

Anyway, I don't see any red flag or problem.
[Well, unless somebody else spots something that I haven't seen,
which is *very* possible.]
It seems to be good to go to run MPI (Open MPI) programs
using Infinband.


********

Now some items a bit out of topic, not a specific answer to your question, but hopefully they may help.

1) pdsh

Do you have a head/master node in the cluster?
Is it lustwzb99 perhaps?
You could run pdsh from there.
It is very helpful for cluster-wide checks, etc.
(You can install it if not there, sometimes there
is also "dsh" already installed, although older.)

https://sourceforge.net/projects/pdsh/

[It may be available as package (rpm or similar)
for your Linux distribution also.]

2) Open MPI details and customization

I'd suggest that you take a look at the Open MPI FAQ,
for more details, specially how to control things at runtime.
They have zillions of "MCA parameters" that allow a lot of customization, if you care:

https://www.open-mpi.org/faq/

Their README file (you can get it in their tarball) is also
a good source of information.

3) Resource managers and integration with Open MPI

Also, if you have a "resource manager" (a.k.a. job queue system),
such as Torque/PBS, Slurm, SGE, you may want to look into integrating
it with Open MPI (if it is not already this way), and how to
set up the job scripts to take advantage of that integration.
The Open MPI FAQs have some material on this (and the Open MPI README file also), but you may need to consult the "resource manager" documentation as well. [If you're using Torque start with "man qsub".]


4) Open MPI installation: NFS vs. local

You may need to check if Open MPI is installed, say,
in an NFS shared directory, visible to all nodes,
or perhaps installed via package (RPM or similar) on
all nodes.
In the latter case, make sure you have the same exact
version (including the compiler that was used to build it) everywhere.
Installing on NFS makes life easier on small clusters (for updates, etc).
Make sure the NFS directory is exported/mounted to/by all nodes.

5) Environment variables and "envrionment modules" package

You may need also to set some environment variables (such as PATH and LD_LIBRARY_PATH) to ensure that Open MPI (and any other software) works. The simplest way is brute force in the .bashrc/.tcshrc initialization files.

However, I'd recommend taking a look at the "environment modules"
package, that provides a much cleaner solution, and makes it easy
for users to switch from one compiler to another, from one version
of Open MPI to another, etc.
If you provide a variety of versions of software, that is a must.

http://modules.sourceforge.net/


[Available as package in many Linux distros.]

**

I hope this helps,
Gus Correa

On 08/02/2017 01:37 PM, Faraz Hussain wrote:
Thanks for the tips. We have openmpi installed. Here is some relevant output from the commands you suggested. One confusing thing is ibstat shows only port 1 as active. But ibhosts shows port 2 only.

[hussaif1@lustwzb4 test]$ lsmod | grep ib
ib_ucm                 12120  0
ib_ipoib              114971  0
ib_cm                  42214  3 ib_ucm,rdma_cm,ib_ipoib
ib_uverbs              50244  2 rdma_ucm,ib_ucm
ib_umad                12562  0
mlx5_ib               103326  0
mlx5_core              85201  1 mlx5_ib
mlx4_ib               164865  0
ib_sa                  24170  5 rdma_ucm,rdma_cm,ib_ipoib,ib_cm,mlx4_ib
ib_mad                 43241  4 ib_cm,ib_umad,mlx4_ib,ib_sa
ib_core 95458 12 rdma_ucm,ib_ucm,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_uverbs,ib_umad,mlx5_ib,mlx4_ib,ib_sa,ib_mad
ib_addr                 7732  3 rdma_cm,ib_uverbs,ib_core
ipv6                  317829  145 ib_ipoib,mlx4_ib,ib_addr
mlx4_core             258183  2 mlx4_en,mlx4_ib
compat 23876 17 rdma_ucm,ib_ucm,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_uverbs,ib_umad,mlx5_ib,mlx5_core,mlx4_en,mlx4_ib,ib_sa,ib_mad,ib_core,ib_addr,mlx4_core
libcrc32c               1246  1 bnx2x

[hussaif1@lustwzb4 test]$ ompi_info | grep ib

MCA btl: openib (MCA v2.0, API v2.0, Component v1.8.4)

[hussaif1@lustwzb4 test]$ ibstat
CA 'mlx4_0'
         CA type: MT4099
         Number of ports: 2
         Firmware version: 2.11.550
         Hardware version: 0
         Node GUID: 0xf452140300163b70
         System image GUID: 0xf452140300163b73
         Port 1:
                 State: Active
                 Physical state: LinkUp
                 Rate: 40 (FDR10)
                 Base lid: 3
                 LMC: 0
                 SM lid: 1
                 Capability mask: 0x02514868
                 Port GUID: 0xf452140300163b71
                 Link layer: InfiniBand
         Port 2:
                 State: Down
                 Physical state: Disabled
                 Rate: 10
                 Base lid: 0
                 LMC: 0
                 SM lid: 0
                 Capability mask: 0x02514868
                 Port GUID: 0xf452140300163b72
                 Link layer: InfiniBand

[hussaif1@lustwzb4 test]$ ibhosts
Ca      : 0xf45214030015bf60 ports 2 "lustwzb9 HCA-1"
Ca      : 0xf45214030015c0e0 ports 2 "lustwzb16 HCA-1"
Ca      : 0xf452140300163e20 ports 2 "lustwzb15 HCA-1"
Ca      : 0xf45214030015c080 ports 2 "lustwzb14 HCA-1"
Ca      : 0xf45214030015c290 ports 2 "lustwzb13 HCA-1"
Ca      : 0xf45214030015bf70 ports 2 "lustwzb12 HCA-1"
Ca      : 0xf452140300163bb0 ports 2 "lustwzb11 HCA-1"
Ca      : 0xf452140300163c70 ports 2 "lustwzb10 HCA-1"
Ca      : 0xf452140300163e30 ports 2 "lustwzb8 HCA-1"
Ca      : 0xf452140300163b80 ports 2 "lustwzb7 HCA-1"
Ca      : 0xf452140300163ba0 ports 2 "lustwzb6 HCA-1"
Ca      : 0xf45214030015bfb0 ports 2 "lustwzb5 HCA-1"
Ca      : 0xf45214030015bf90 ports 2 "lustwzb3 HCA-1"
Ca      : 0xf452140300163df0 ports 2 "lustwzb2 HCA-1"
Ca      : 0xf45214030015c0a0 ports 2 "lustwzb1 HCA-1"
Ca      : 0x0002c90300b78240 ports 1 "lustwz99 HCA-1"
Ca      : 0xf452140300163b70 ports 2 "lustwzb4 HCA-1"

[hussaif1@lustwzb4 test]$ ibnetdiscover
#
# Topology file: generated on Wed Aug  2 13:24:40 2017
#
# Initiated from node f452140300163b70 port f452140300163b71

vendid=0x2c9
devid=0xc738
sysimgguid=0x2c9030089cab0
switchguid=0x2c9030089cab0(2c9030089cab0)
Switch 32 "S-0002c9030089cab0" # "SwitchX - Mellanox Technologies" base port 0 lid 2 lmc 0 [16] "H-0002c90300b78240"[1](2c90300b78241) # "lustwz99 HCA-1" lid 1 4xFDR10 [17] "H-f45214030015c0a0"[1](f45214030015c0a1) # "lustwzb1 HCA-1" lid 5 4xFDR10 [18] "H-f452140300163df0"[1](f452140300163df1) # "lustwzb2 HCA-1" lid 6 4xFDR10 [19] "H-f45214030015bf90"[1](f45214030015bf91) # "lustwzb3 HCA-1" lid 4 4xFDR10 [20] "H-f452140300163b70"[1](f452140300163b71) # "lustwzb4 HCA-1" lid 3 4xFDR10 [21] "H-f45214030015bfb0"[1](f45214030015bfb1) # "lustwzb5 HCA-1" lid 7 4xFDR10 [22] "H-f452140300163ba0"[1](f452140300163ba1) # "lustwzb6 HCA-1" lid 8 4xFDR10 [23] "H-f452140300163b80"[1](f452140300163b81) # "lustwzb7 HCA-1" lid 9 4xFDR10 [24] "H-f452140300163e30"[1](f452140300163e31) # "lustwzb8 HCA-1" lid 10 4xFDR10 [25] "H-f45214030015bf60"[1](f45214030015bf61) # "lustwzb9 HCA-1" lid 11 4xFDR10 [26] "H-f452140300163c70"[1](f452140300163c71) # "lustwzb10 HCA-1" lid 12 4xFDR10 [27] "H-f452140300163bb0"[1](f452140300163bb1) # "lustwzb11 HCA-1" lid 13 4xFDR10 [28] "H-f45214030015bf70"[1](f45214030015bf71) # "lustwzb12 HCA-1" lid 14 4xFDR10 [29] "H-f45214030015c290"[1](f45214030015c291) # "lustwzb13 HCA-1" lid 15 4xFDR10 [30] "H-f45214030015c080"[1](f45214030015c081) # "lustwzb14 HCA-1" lid 16 4xFDR10 [31] "H-f452140300163e20"[1](f452140300163e21) # "lustwzb15 HCA-1" lid 17 4xFDR10 [32] "H-f45214030015c0e0"[1](f45214030015c0e1) # "lustwzb16 HCA-1" lid 18 4xFDR10
vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015c0e3
caguid=0xf45214030015c0e0
Ca      2 "H-f45214030015c0e0"          # "lustwzb16 HCA-1"
[1](f45214030015c0e1) "S-0002c9030089cab0"[32] # lid 18 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163e23
caguid=0xf452140300163e20
Ca      2 "H-f452140300163e20"          # "lustwzb15 HCA-1"
[1](f452140300163e21) "S-0002c9030089cab0"[31] # lid 17 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015c083
caguid=0xf45214030015c080
Ca      2 "H-f45214030015c080"          # "lustwzb14 HCA-1"
[1](f45214030015c081) "S-0002c9030089cab0"[30] # lid 16 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015bf73
caguid=0xf45214030015bf70
Ca      2 "H-f45214030015bf70"          # "lustwzb12 HCA-1"
[1](f45214030015bf71) "S-0002c9030089cab0"[28] # lid 14 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015c293
caguid=0xf45214030015c290
Ca      2 "H-f45214030015c290"          # "lustwzb13 HCA-1"
[1](f45214030015c291) "S-0002c9030089cab0"[29] # lid 15 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015bf63
caguid=0xf45214030015bf60
Ca      2 "H-f45214030015bf60"          # "lustwzb9 HCA-1"
[1](f45214030015bf61) "S-0002c9030089cab0"[25] # lid 11 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163bb3
caguid=0xf452140300163bb0
Ca      2 "H-f452140300163bb0"          # "lustwzb11 HCA-1"
[1](f452140300163bb1) "S-0002c9030089cab0"[27] # lid 13 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163c73
caguid=0xf452140300163c70
Ca      2 "H-f452140300163c70"          # "lustwzb10 HCA-1"
[1](f452140300163c71) "S-0002c9030089cab0"[26] # lid 12 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163e33
caguid=0xf452140300163e30
Ca      2 "H-f452140300163e30"          # "lustwzb8 HCA-1"
[1](f452140300163e31) "S-0002c9030089cab0"[24] # lid 10 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163b83
caguid=0xf452140300163b80
Ca      2 "H-f452140300163b80"          # "lustwzb7 HCA-1"
[1](f452140300163b81) "S-0002c9030089cab0"[23] # lid 9 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015bfb3
caguid=0xf45214030015bfb0
Ca      2 "H-f45214030015bfb0"          # "lustwzb5 HCA-1"
[1](f45214030015bfb1) "S-0002c9030089cab0"[21] # lid 7 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163ba3
caguid=0xf452140300163ba0
Ca      2 "H-f452140300163ba0"          # "lustwzb6 HCA-1"
[1](f452140300163ba1) "S-0002c9030089cab0"[22] # lid 8 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163df3
caguid=0xf452140300163df0
Ca      2 "H-f452140300163df0"          # "lustwzb2 HCA-1"
[1](f452140300163df1) "S-0002c9030089cab0"[18] # lid 6 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015bf93
caguid=0xf45214030015bf90
Ca      2 "H-f45214030015bf90"          # "lustwzb3 HCA-1"
[1](f45214030015bf91) "S-0002c9030089cab0"[19] # lid 4 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf45214030015c0a3
caguid=0xf45214030015c0a0
Ca      2 "H-f45214030015c0a0"          # "lustwzb1 HCA-1"
[1](f45214030015c0a1) "S-0002c9030089cab0"[17] # lid 5 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0x2c90300b78243
caguid=0x2c90300b78240
Ca      1 "H-0002c90300b78240"          # "lustwz99 HCA-1"
[1](2c90300b78241) "S-0002c9030089cab0"[16] # lid 1 lmc 0 "SwitchX - Mellanox Technologies" lid 2 4xFDR10

vendid=0x2c9
devid=0x1003
sysimgguid=0xf452140300163b73
caguid=0xf452140300163b70
Ca      2 "H-f452140300163b70"          # "lustwzb4 HCA-1"
[1](f452140300163b71)   "S-0002c9030089cab0"[20]

Quoting Gus Correa <g...@ldeo.columbia.edu>:

Hi Faraz

1) lsmod | grep ib should show if the Infinband kernel modules are loaded.

2) Infinband normally uses remote DMA (rdma) through "verbs".
You should see an "ib" module with "verbs" in the name.
That is the preferred/faster mode for MPI.

3) However, you can also use Infinband for TCP/IP (slower).
As the output of your ifconfig shows, your ib0 interface is
also configured for TCP/IP.

4) You may have two interfaces (one card with two or two cards) in the nodes. One may not be connected to a switch (ib1). Check the back of your nodes.

5) To check if MPI is using it, depends a bit on which MPI library
you're using.
Which one? Open MPI, MVAPICH2, some vendor/proprietary one?
If it is Open MPI the command "ompi-info" will tell.
With Open MPI there are also ways to enable/disable
Infiniband at runtime.

6) Some Infinband diagnostics may also help (normally in /usr/sbin)

ibstat
ibhosts
ibnetdiscover

etc

OK, this is my pedestrian view of Infinband.
Now let's hear the experts in the list for deeper insights. :)

I hope this helps,
Gus Correa


On 08/02/2017 12:44 PM, Faraz Hussain wrote:
I have inherited a 20-node cluster that supposedly has an infiniband network. I am testing some mpi applications and am seeing no performance improvement with multiple nodes. So I am wondering if the Infiband network even works?

The output of ifconfig -a shows an ib0 and ib1 network. I ran ethtools ib0 and it shows:

        Speed: 40000Mb/s
        Link detected: no

and for ib1 it show:

        Speed: 10000Mb/s
        Link detected: no

I am assuming this means it is down? Any idea how to debug further and restart it?

Thanks!

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to