What hardware and what Infiniband switch you have
Run   these commands:      ibdiagnet   smshow

Unfortunately ibdiagnet seems to give some errors:

[hussaif1@lustwzb34 ~]$ ibdiagnet
----------
Load Plugins from:
/usr/share/ibdiagnet2.1.1/plugins/
(You can specify more paths to be looked in with "IBDIAGNET_PLUGINS_PATH" env variable)

Plugin Name                                   Result     Comment
libibdiagnet_cable_diag_plugin-2.1.1          Succeeded  Plugin loaded
libibdiagnet_phy_diag_plugin-2.1.1            Succeeded  Plugin loaded

---------------------------------------------
Discovery
-E- Failed to initialize

-E- Fabric Discover failed, err=IBDiag initialize wasn't done
-E- Fabric Discover failed, MAD err=Failed to umad_open_port

---------------------------------------------
Summary
-I- Stage                     Warnings   Errors     Comment
-I- Discovery                                       NA
-I- Lids Check                                      NA
-I- Links Check                                     NA
-I- Subnet Manager                                  NA
-I- Port Counters                                   NA
-I- Nodes Information                               NA
-I- Speed / Width checks                            NA
-I- Partition Keys                                  NA
-I- Alias GUIDs                                     NA
-I- Temperature Sensing                             NA

-I- You can find detailed errors/warnings in: /var/tmp/ibdiagnet2/ibdiagnet2.log

-E- A fatal error occurred, exiting...


I do not have smshow command , but I see there is an sminfo. It also give this error:

[hussaif1@lustwzb34 ~]$ smshow
bash: smshow: command not found...
[hussaif1@lustwzb34 ~]$ sm
smartctl smbcacls smbcquotas smbspool smbtree sm-notify smpdump smtp-sink smartd smbclient smbget smbtar sminfo smparquery smpquery smtp-source
[hussaif1@lustwzb34 ~]$ sminfo
ibwarn: [10407] mad_rpc_open_port: can't open UMAD port ((null):0)
sminfo: iberror: failed: Failed to open '(null)' port '0'



You originally had the OpenMPI which was provided by CentOS  ??

Correct.

You compiled the OpenMPI from source??

Yes, I then compiled it from source and it seems to work ( at least give reasonable numbers when running latency and bandwith tests )..

How are you bringing the new OpenMPI version itno your PATH ?? Are you
using modules or an mpi switcher utilioty?

Just as follows:

export PATH=/Apps/users/hussaif1/openmpi-4.0.0/bin:$PATH

Thanks!


On Wed, 1 May 2019 at 09:39, Benson Muite <benson_mu...@emailplus.org>
wrote:

Hi Faraz,

Have you tried any other MPI distributions (eg. MPICH, MVAPICH)?

Regards,

Benson
On 4/30/19 11:20 PM, Gus Correa wrote:

It may be using IPoIB (TCP/IP over IB), not verbs/rdma.
You can force it to use openib (verbs, rdma) with (vader is for in-node
shared memory):

mpirun --mca btl openib,self,vader ...


These flags may also help tell which btl (byte transport layer) is being used:

 --mca btl_base_verbose 30

See these FAQ:https://www.open-mpi.org/faq/?category=openfabrics#ib-btlhttps://www.open-mpi.org/faq/?category=all#tcp-routability-1.3

Better really ask more details in the Open MPI list. They are the pros!

My two cents,
Gus Correa



On Tue, Apr 30, 2019 at 3:57 PM Faraz Hussain <i...@feacluster.com> wrote:

Thanks, after buidling openmpi 4 from source, it now works! However it
still gives this message below when I run openmpi with verbose setting:

No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

   Local host:           lustwzb34
   Local device:         mlx4_0
   Local port:           1
   CPCs attempted:       rdmacm, udcm

However, the results from my latency and bandwith tests seem to be
what I would expect from infiniband. See:

[hussaif1@lustwzb34 pt2pt]$  mpirun -v -np 2 -hostfile ./hostfile
./osu_latency
# OSU MPI Latency Test v5.3.2
# Size          Latency (us)
0                       1.87
1                       1.88
2                       1.93
4                       1.92
8                       1.93
16                      1.95
32                      1.93
64                      2.08
128                     2.61
256                     2.72
512                     2.93
1024                    3.33
2048                    3.81
4096                    4.71
8192                    6.68
16384                   8.38
32768                  12.13
65536                  19.74
131072                 35.08
262144                 64.67
524288                122.11
1048576               236.69
2097152               465.97
4194304               926.31

[hussaif1@lustwzb34 pt2pt]$  mpirun -v -np 2 -hostfile ./hostfile
./osu_bw
# OSU MPI Bandwidth Test v5.3.2
# Size      Bandwidth (MB/s)
1                       3.09
2                       6.35
4                      12.77
8                      26.01
16                     51.31
32                    103.08
64                    197.89
128                   362.00
256                   676.28
512                  1096.26
1024                 1819.25
2048                 2551.41
4096                 3886.63
8192                 3983.17
16384                4362.30
32768                4457.09
65536                4502.41
131072               4512.64
262144               4531.48
524288               4537.42
1048576              4510.69
2097152              4546.64
4194304              4565.12

When I run ibv_devinfo I get:

[hussaif1@lustwzb34 pt2pt]$ ibv_devinfo
hca_id: mlx4_0
         transport:                      InfiniBand (0)
         fw_ver:                         2.36.5000
         node_guid:                      480f:cfff:fff5:c6c0
         sys_image_guid:                 480f:cfff:fff5:c6c3
         vendor_id:                      0x02c9
         vendor_part_id:                 4103
         hw_ver:                         0x0
         board_id:                       HP_1360110017
         phys_port_cnt:                  2
         Device ports:
                 port:   1
                         state:                  PORT_ACTIVE (4)
                         max_mtu:                4096 (5)
                         active_mtu:             1024 (3)
                         sm_lid:                 0
                         port_lid:               0
                         port_lmc:               0x00
                         link_layer:             Ethernet

                 port:   2
                         state:                  PORT_DOWN (1)
                         max_mtu:                4096 (5)
                         active_mtu:             1024 (3)
                         sm_lid:                 0
                         port_lid:               0
                         port_lmc:               0x00
                         link_layer:             Ethernet

I will ask the openmpi mailing list if my results make sense?!


Quoting Gus Correa <g...@ldeo.columbia.edu>:

> Hi Faraz
>
> By all means, download the Open MPI tarball and build from source.
> Otherwise there won't be support for IB (the CentOS Open MPI packages
most
> likely rely only on TCP/IP).
>
> Read their README file (it comes in the tarball), and take a careful
look
> at their (excellent) FAQ:
> https://www.open-mpi.org/faq/
> Many issues can be solved by just reading these two resources.
>
> If you hit more trouble, subscribe to the Open MPI mailing list, and ask
> questions there,
> because you will get advice directly from the Open MPI developers, and
the
> fix will come easy.
> https://www.open-mpi.org/community/lists/ompi.php
>
> My two cents,
> Gus Correa
>
> On Tue, Apr 30, 2019 at 3:07 PM Faraz Hussain <i...@feacluster.com>
wrote:
>
>> Thanks, yes I have installed those libraries. See below. Initially I
>> installed the libraries via yum. But then I tried installing the rpms
>> directly from Mellanox website (
>> MLNX_OFED_LINUX-4.5-1.0.1.0-rhel7.5-x86_64.tar ). Even after doing
>> that, I still got the same error with openmpi. I will try your
>> suggestion of building openmpi from source next!
>>
>> root@lustwzb34:/root # yum list | grep ibverbs
>> libibverbs.x86_64                     41mlnx1-OFED.4.5.0.1.0.45101
>> libibverbs-devel.x86_64               41mlnx1-OFED.4.5.0.1.0.45101
>> libibverbs-devel-static.x86_64        41mlnx1-OFED.4.5.0.1.0.45101
>> libibverbs-utils.x86_64               41mlnx1-OFED.4.5.0.1.0.45101
>> libibverbs.i686                       17.2-3.el7
>> rhel-7-server-rpms
>> libibverbs-devel.i686                 1.2.1-1.el7
>> rhel-7-server-rpms
>>
>> root@lustwzb34:/root # lsmod | grep ib
>> ib_ucm                 22602  0
>> ib_ipoib              168425  0
>> ib_cm                  53141  3 rdma_cm,ib_ucm,ib_ipoib
>> ib_umad                22093  0
>> mlx5_ib               339961  0
>> ib_uverbs             121821  3 mlx5_ib,ib_ucm,rdma_ucm
>> mlx5_core             919178  2 mlx5_ib,mlx5_fpga_tools
>> mlx4_ib               211747  0
>> ib_core               294554  10
>>
>>
rdma_cm,ib_cm,iw_cm,mlx4_ib,mlx5_ib,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib
>> mlx4_core             360598  2 mlx4_en,mlx4_ib
>> mlx_compat             29012  15
>>
>>
rdma_cm,ib_cm,iw_cm,mlx4_en,mlx4_ib,mlx5_ib,mlx5_fpga_tools,ib_ucm,ib_core,ib_umad,ib_uverbs,mlx4_core,mlx5_core,rdma_ucm,ib_ipoib
>> devlink                42368  4 mlx4_en,mlx4_ib,mlx4_core,mlx5_core
>> libcrc32c              12644  3 xfs,nf_nat,nf_conntrack
>> root@lustwzb34:/root #
>>
>>
>>
>> > Did you install libibverbs  (and libibverbs-utils, for information
and
>> > troubleshooting)?
>>
>> > yum list |grep ibverbs
>>
>> > Are you loading the ib modules?
>>
>> > lsmod |grep ib
>>
>>




_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf




_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Reply via email to