Errrr.. you are not running a subnet manager? DO you have an Infiniband switch or are you connecting two servers back-to-back?
Also - have you considered using OpenHPC rather tyhan installing CentOS on two servers? When you expand this manual installation is going to be painful. On Wed, 1 May 2019 at 15:05, Faraz Hussain <i...@feacluster.com> wrote: > > What hardware and what Infiniband switch you have > > Run these commands: ibdiagnet smshow > > Unfortunately ibdiagnet seems to give some errors: > > [hussaif1@lustwzb34 ~]$ ibdiagnet > ---------- > Load Plugins from: > /usr/share/ibdiagnet2.1.1/plugins/ > (You can specify more paths to be looked in with > "IBDIAGNET_PLUGINS_PATH" env variable) > > Plugin Name Result Comment > libibdiagnet_cable_diag_plugin-2.1.1 Succeeded Plugin loaded > libibdiagnet_phy_diag_plugin-2.1.1 Succeeded Plugin loaded > > --------------------------------------------- > Discovery > -E- Failed to initialize > > -E- Fabric Discover failed, err=IBDiag initialize wasn't done > -E- Fabric Discover failed, MAD err=Failed to umad_open_port > > --------------------------------------------- > Summary > -I- Stage Warnings Errors Comment > -I- Discovery NA > -I- Lids Check NA > -I- Links Check NA > -I- Subnet Manager NA > -I- Port Counters NA > -I- Nodes Information NA > -I- Speed / Width checks NA > -I- Partition Keys NA > -I- Alias GUIDs NA > -I- Temperature Sensing NA > > -I- You can find detailed errors/warnings in: > /var/tmp/ibdiagnet2/ibdiagnet2.log > > -E- A fatal error occurred, exiting... > > > I do not have smshow command , but I see there is an sminfo. It also > give this error: > > [hussaif1@lustwzb34 ~]$ smshow > bash: smshow: command not found... > [hussaif1@lustwzb34 ~]$ sm > smartctl smbcacls smbcquotas smbspool smbtree > sm-notify smpdump smtp-sink > smartd smbclient smbget smbtar sminfo > smparquery smpquery smtp-source > [hussaif1@lustwzb34 ~]$ sminfo > ibwarn: [10407] mad_rpc_open_port: can't open UMAD port ((null):0) > sminfo: iberror: failed: Failed to open '(null)' port '0' > > > > > You originally had the OpenMPI which was provided by CentOS ?? > > Correct. > > > You compiled the OpenMPI from source?? > > Yes, I then compiled it from source and it seems to work ( at least > give reasonable numbers when running latency and bandwith tests ).. > > > How are you bringing the new OpenMPI version itno your PATH ?? Are you > > using modules or an mpi switcher utilioty? > > Just as follows: > > export PATH=/Apps/users/hussaif1/openmpi-4.0.0/bin:$PATH > > Thanks! > > > > > On Wed, 1 May 2019 at 09:39, Benson Muite <benson_mu...@emailplus.org> > > wrote: > > > >> Hi Faraz, > >> > >> Have you tried any other MPI distributions (eg. MPICH, MVAPICH)? > >> > >> Regards, > >> > >> Benson > >> On 4/30/19 11:20 PM, Gus Correa wrote: > >> > >> It may be using IPoIB (TCP/IP over IB), not verbs/rdma. > >> You can force it to use openib (verbs, rdma) with (vader is for in-node > >> shared memory): > >> > >> mpirun --mca btl openib,self,vader ... > >> > >> > >> These flags may also help tell which btl (byte transport layer) is > >> being used: > >> > >> --mca btl_base_verbose 30 > >> > >> See these > >> FAQ: > https://www.open-mpi.org/faq/?category=openfabrics#ib-btlhttps://www.open-mpi.org/faq/?category=all#tcp-routability-1.3 > >> > >> Better really ask more details in the Open MPI list. They are the pros! > >> > >> My two cents, > >> Gus Correa > >> > >> > >> > >> On Tue, Apr 30, 2019 at 3:57 PM Faraz Hussain <i...@feacluster.com> > wrote: > >> > >>> Thanks, after buidling openmpi 4 from source, it now works! However it > >>> still gives this message below when I run openmpi with verbose setting: > >>> > >>> No OpenFabrics connection schemes reported that they were able to be > >>> used on a specific port. As such, the openib BTL (OpenFabrics > >>> support) will be disabled for this port. > >>> > >>> Local host: lustwzb34 > >>> Local device: mlx4_0 > >>> Local port: 1 > >>> CPCs attempted: rdmacm, udcm > >>> > >>> However, the results from my latency and bandwith tests seem to be > >>> what I would expect from infiniband. See: > >>> > >>> [hussaif1@lustwzb34 pt2pt]$ mpirun -v -np 2 -hostfile ./hostfile > >>> ./osu_latency > >>> # OSU MPI Latency Test v5.3.2 > >>> # Size Latency (us) > >>> 0 1.87 > >>> 1 1.88 > >>> 2 1.93 > >>> 4 1.92 > >>> 8 1.93 > >>> 16 1.95 > >>> 32 1.93 > >>> 64 2.08 > >>> 128 2.61 > >>> 256 2.72 > >>> 512 2.93 > >>> 1024 3.33 > >>> 2048 3.81 > >>> 4096 4.71 > >>> 8192 6.68 > >>> 16384 8.38 > >>> 32768 12.13 > >>> 65536 19.74 > >>> 131072 35.08 > >>> 262144 64.67 > >>> 524288 122.11 > >>> 1048576 236.69 > >>> 2097152 465.97 > >>> 4194304 926.31 > >>> > >>> [hussaif1@lustwzb34 pt2pt]$ mpirun -v -np 2 -hostfile ./hostfile > >>> ./osu_bw > >>> # OSU MPI Bandwidth Test v5.3.2 > >>> # Size Bandwidth (MB/s) > >>> 1 3.09 > >>> 2 6.35 > >>> 4 12.77 > >>> 8 26.01 > >>> 16 51.31 > >>> 32 103.08 > >>> 64 197.89 > >>> 128 362.00 > >>> 256 676.28 > >>> 512 1096.26 > >>> 1024 1819.25 > >>> 2048 2551.41 > >>> 4096 3886.63 > >>> 8192 3983.17 > >>> 16384 4362.30 > >>> 32768 4457.09 > >>> 65536 4502.41 > >>> 131072 4512.64 > >>> 262144 4531.48 > >>> 524288 4537.42 > >>> 1048576 4510.69 > >>> 2097152 4546.64 > >>> 4194304 4565.12 > >>> > >>> When I run ibv_devinfo I get: > >>> > >>> [hussaif1@lustwzb34 pt2pt]$ ibv_devinfo > >>> hca_id: mlx4_0 > >>> transport: InfiniBand (0) > >>> fw_ver: 2.36.5000 > >>> node_guid: 480f:cfff:fff5:c6c0 > >>> sys_image_guid: 480f:cfff:fff5:c6c3 > >>> vendor_id: 0x02c9 > >>> vendor_part_id: 4103 > >>> hw_ver: 0x0 > >>> board_id: HP_1360110017 > >>> phys_port_cnt: 2 > >>> Device ports: > >>> port: 1 > >>> state: PORT_ACTIVE (4) > >>> max_mtu: 4096 (5) > >>> active_mtu: 1024 (3) > >>> sm_lid: 0 > >>> port_lid: 0 > >>> port_lmc: 0x00 > >>> link_layer: Ethernet > >>> > >>> port: 2 > >>> state: PORT_DOWN (1) > >>> max_mtu: 4096 (5) > >>> active_mtu: 1024 (3) > >>> sm_lid: 0 > >>> port_lid: 0 > >>> port_lmc: 0x00 > >>> link_layer: Ethernet > >>> > >>> I will ask the openmpi mailing list if my results make sense?! > >>> > >>> > >>> Quoting Gus Correa <g...@ldeo.columbia.edu>: > >>> > >>> > Hi Faraz > >>> > > >>> > By all means, download the Open MPI tarball and build from source. > >>> > Otherwise there won't be support for IB (the CentOS Open MPI packages > >>> most > >>> > likely rely only on TCP/IP). > >>> > > >>> > Read their README file (it comes in the tarball), and take a careful > >>> look > >>> > at their (excellent) FAQ: > >>> > https://www.open-mpi.org/faq/ > >>> > Many issues can be solved by just reading these two resources. > >>> > > >>> > If you hit more trouble, subscribe to the Open MPI mailing list, and > ask > >>> > questions there, > >>> > because you will get advice directly from the Open MPI developers, > and > >>> the > >>> > fix will come easy. > >>> > https://www.open-mpi.org/community/lists/ompi.php > >>> > > >>> > My two cents, > >>> > Gus Correa > >>> > > >>> > On Tue, Apr 30, 2019 at 3:07 PM Faraz Hussain <i...@feacluster.com> > >>> wrote: > >>> > > >>> >> Thanks, yes I have installed those libraries. See below. Initially I > >>> >> installed the libraries via yum. But then I tried installing the > rpms > >>> >> directly from Mellanox website ( > >>> >> MLNX_OFED_LINUX-4.5-1.0.1.0-rhel7.5-x86_64.tar ). Even after doing > >>> >> that, I still got the same error with openmpi. I will try your > >>> >> suggestion of building openmpi from source next! > >>> >> > >>> >> root@lustwzb34:/root # yum list | grep ibverbs > >>> >> libibverbs.x86_64 41mlnx1-OFED.4.5.0.1.0.45101 > >>> >> libibverbs-devel.x86_64 41mlnx1-OFED.4.5.0.1.0.45101 > >>> >> libibverbs-devel-static.x86_64 41mlnx1-OFED.4.5.0.1.0.45101 > >>> >> libibverbs-utils.x86_64 41mlnx1-OFED.4.5.0.1.0.45101 > >>> >> libibverbs.i686 17.2-3.el7 > >>> >> rhel-7-server-rpms > >>> >> libibverbs-devel.i686 1.2.1-1.el7 > >>> >> rhel-7-server-rpms > >>> >> > >>> >> root@lustwzb34:/root # lsmod | grep ib > >>> >> ib_ucm 22602 0 > >>> >> ib_ipoib 168425 0 > >>> >> ib_cm 53141 3 rdma_cm,ib_ucm,ib_ipoib > >>> >> ib_umad 22093 0 > >>> >> mlx5_ib 339961 0 > >>> >> ib_uverbs 121821 3 mlx5_ib,ib_ucm,rdma_ucm > >>> >> mlx5_core 919178 2 mlx5_ib,mlx5_fpga_tools > >>> >> mlx4_ib 211747 0 > >>> >> ib_core 294554 10 > >>> >> > >>> >> > >>> > rdma_cm,ib_cm,iw_cm,mlx4_ib,mlx5_ib,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib > >>> >> mlx4_core 360598 2 mlx4_en,mlx4_ib > >>> >> mlx_compat 29012 15 > >>> >> > >>> >> > >>> > rdma_cm,ib_cm,iw_cm,mlx4_en,mlx4_ib,mlx5_ib,mlx5_fpga_tools,ib_ucm,ib_core,ib_umad,ib_uverbs,mlx4_core,mlx5_core,rdma_ucm,ib_ipoib > >>> >> devlink 42368 4 mlx4_en,mlx4_ib,mlx4_core,mlx5_core > >>> >> libcrc32c 12644 3 xfs,nf_nat,nf_conntrack > >>> >> root@lustwzb34:/root # > >>> >> > >>> >> > >>> >> > >>> >> > Did you install libibverbs (and libibverbs-utils, for information > >>> and > >>> >> > troubleshooting)? > >>> >> > >>> >> > yum list |grep ibverbs > >>> >> > >>> >> > Are you loading the ib modules? > >>> >> > >>> >> > lsmod |grep ib > >>> >> > >>> >> > >>> > >>> > >>> > >>> > >> _______________________________________________ > >> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin > Computing > >> To change your subscription (digest mode or unsubscribe) visit > >> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > >> > >> _______________________________________________ > >> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin > Computing > >> To change your subscription (digest mode or unsubscribe) visit > >> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > >> > > > >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf