I think I he wrong track regarding the subnet manager, sorry. What does ibstatus give you
On Wed, 1 May 2019 at 15:31, John Hearns <hear...@googlemail.com> wrote: > Errrr.. you are not running a subnet manager? > DO you have an Infiniband switch or are you connecting two servers > back-to-back? > > Also - have you considered using OpenHPC rather tyhan installing CentOS on > two servers? > When you expand this manual installation is going to be painful. > > On Wed, 1 May 2019 at 15:05, Faraz Hussain <i...@feacluster.com> wrote: > >> > What hardware and what Infiniband switch you have >> > Run these commands: ibdiagnet smshow >> >> Unfortunately ibdiagnet seems to give some errors: >> >> [hussaif1@lustwzb34 ~]$ ibdiagnet >> ---------- >> Load Plugins from: >> /usr/share/ibdiagnet2.1.1/plugins/ >> (You can specify more paths to be looked in with >> "IBDIAGNET_PLUGINS_PATH" env variable) >> >> Plugin Name Result Comment >> libibdiagnet_cable_diag_plugin-2.1.1 Succeeded Plugin loaded >> libibdiagnet_phy_diag_plugin-2.1.1 Succeeded Plugin loaded >> >> --------------------------------------------- >> Discovery >> -E- Failed to initialize >> >> -E- Fabric Discover failed, err=IBDiag initialize wasn't done >> -E- Fabric Discover failed, MAD err=Failed to umad_open_port >> >> --------------------------------------------- >> Summary >> -I- Stage Warnings Errors Comment >> -I- Discovery NA >> -I- Lids Check NA >> -I- Links Check NA >> -I- Subnet Manager NA >> -I- Port Counters NA >> -I- Nodes Information NA >> -I- Speed / Width checks NA >> -I- Partition Keys NA >> -I- Alias GUIDs NA >> -I- Temperature Sensing NA >> >> -I- You can find detailed errors/warnings in: >> /var/tmp/ibdiagnet2/ibdiagnet2.log >> >> -E- A fatal error occurred, exiting... >> >> >> I do not have smshow command , but I see there is an sminfo. It also >> give this error: >> >> [hussaif1@lustwzb34 ~]$ smshow >> bash: smshow: command not found... >> [hussaif1@lustwzb34 ~]$ sm >> smartctl smbcacls smbcquotas smbspool smbtree >> sm-notify smpdump smtp-sink >> smartd smbclient smbget smbtar sminfo >> smparquery smpquery smtp-source >> [hussaif1@lustwzb34 ~]$ sminfo >> ibwarn: [10407] mad_rpc_open_port: can't open UMAD port ((null):0) >> sminfo: iberror: failed: Failed to open '(null)' port '0' >> >> >> >> > You originally had the OpenMPI which was provided by CentOS ?? >> >> Correct. >> >> > You compiled the OpenMPI from source?? >> >> Yes, I then compiled it from source and it seems to work ( at least >> give reasonable numbers when running latency and bandwith tests ).. >> >> > How are you bringing the new OpenMPI version itno your PATH ?? Are you >> > using modules or an mpi switcher utilioty? >> >> Just as follows: >> >> export PATH=/Apps/users/hussaif1/openmpi-4.0.0/bin:$PATH >> >> Thanks! >> >> > >> > On Wed, 1 May 2019 at 09:39, Benson Muite <benson_mu...@emailplus.org> >> > wrote: >> > >> >> Hi Faraz, >> >> >> >> Have you tried any other MPI distributions (eg. MPICH, MVAPICH)? >> >> >> >> Regards, >> >> >> >> Benson >> >> On 4/30/19 11:20 PM, Gus Correa wrote: >> >> >> >> It may be using IPoIB (TCP/IP over IB), not verbs/rdma. >> >> You can force it to use openib (verbs, rdma) with (vader is for in-node >> >> shared memory): >> >> >> >> mpirun --mca btl openib,self,vader ... >> >> >> >> >> >> These flags may also help tell which btl (byte transport layer) is >> >> being used: >> >> >> >> --mca btl_base_verbose 30 >> >> >> >> See these >> >> FAQ: >> https://www.open-mpi.org/faq/?category=openfabrics#ib-btlhttps://www.open-mpi.org/faq/?category=all#tcp-routability-1.3 >> >> >> >> Better really ask more details in the Open MPI list. They are the pros! >> >> >> >> My two cents, >> >> Gus Correa >> >> >> >> >> >> >> >> On Tue, Apr 30, 2019 at 3:57 PM Faraz Hussain <i...@feacluster.com> >> wrote: >> >> >> >>> Thanks, after buidling openmpi 4 from source, it now works! However it >> >>> still gives this message below when I run openmpi with verbose >> setting: >> >>> >> >>> No OpenFabrics connection schemes reported that they were able to be >> >>> used on a specific port. As such, the openib BTL (OpenFabrics >> >>> support) will be disabled for this port. >> >>> >> >>> Local host: lustwzb34 >> >>> Local device: mlx4_0 >> >>> Local port: 1 >> >>> CPCs attempted: rdmacm, udcm >> >>> >> >>> However, the results from my latency and bandwith tests seem to be >> >>> what I would expect from infiniband. See: >> >>> >> >>> [hussaif1@lustwzb34 pt2pt]$ mpirun -v -np 2 -hostfile ./hostfile >> >>> ./osu_latency >> >>> # OSU MPI Latency Test v5.3.2 >> >>> # Size Latency (us) >> >>> 0 1.87 >> >>> 1 1.88 >> >>> 2 1.93 >> >>> 4 1.92 >> >>> 8 1.93 >> >>> 16 1.95 >> >>> 32 1.93 >> >>> 64 2.08 >> >>> 128 2.61 >> >>> 256 2.72 >> >>> 512 2.93 >> >>> 1024 3.33 >> >>> 2048 3.81 >> >>> 4096 4.71 >> >>> 8192 6.68 >> >>> 16384 8.38 >> >>> 32768 12.13 >> >>> 65536 19.74 >> >>> 131072 35.08 >> >>> 262144 64.67 >> >>> 524288 122.11 >> >>> 1048576 236.69 >> >>> 2097152 465.97 >> >>> 4194304 926.31 >> >>> >> >>> [hussaif1@lustwzb34 pt2pt]$ mpirun -v -np 2 -hostfile ./hostfile >> >>> ./osu_bw >> >>> # OSU MPI Bandwidth Test v5.3.2 >> >>> # Size Bandwidth (MB/s) >> >>> 1 3.09 >> >>> 2 6.35 >> >>> 4 12.77 >> >>> 8 26.01 >> >>> 16 51.31 >> >>> 32 103.08 >> >>> 64 197.89 >> >>> 128 362.00 >> >>> 256 676.28 >> >>> 512 1096.26 >> >>> 1024 1819.25 >> >>> 2048 2551.41 >> >>> 4096 3886.63 >> >>> 8192 3983.17 >> >>> 16384 4362.30 >> >>> 32768 4457.09 >> >>> 65536 4502.41 >> >>> 131072 4512.64 >> >>> 262144 4531.48 >> >>> 524288 4537.42 >> >>> 1048576 4510.69 >> >>> 2097152 4546.64 >> >>> 4194304 4565.12 >> >>> >> >>> When I run ibv_devinfo I get: >> >>> >> >>> [hussaif1@lustwzb34 pt2pt]$ ibv_devinfo >> >>> hca_id: mlx4_0 >> >>> transport: InfiniBand (0) >> >>> fw_ver: 2.36.5000 >> >>> node_guid: 480f:cfff:fff5:c6c0 >> >>> sys_image_guid: 480f:cfff:fff5:c6c3 >> >>> vendor_id: 0x02c9 >> >>> vendor_part_id: 4103 >> >>> hw_ver: 0x0 >> >>> board_id: HP_1360110017 >> >>> phys_port_cnt: 2 >> >>> Device ports: >> >>> port: 1 >> >>> state: PORT_ACTIVE (4) >> >>> max_mtu: 4096 (5) >> >>> active_mtu: 1024 (3) >> >>> sm_lid: 0 >> >>> port_lid: 0 >> >>> port_lmc: 0x00 >> >>> link_layer: Ethernet >> >>> >> >>> port: 2 >> >>> state: PORT_DOWN (1) >> >>> max_mtu: 4096 (5) >> >>> active_mtu: 1024 (3) >> >>> sm_lid: 0 >> >>> port_lid: 0 >> >>> port_lmc: 0x00 >> >>> link_layer: Ethernet >> >>> >> >>> I will ask the openmpi mailing list if my results make sense?! >> >>> >> >>> >> >>> Quoting Gus Correa <g...@ldeo.columbia.edu>: >> >>> >> >>> > Hi Faraz >> >>> > >> >>> > By all means, download the Open MPI tarball and build from source. >> >>> > Otherwise there won't be support for IB (the CentOS Open MPI >> packages >> >>> most >> >>> > likely rely only on TCP/IP). >> >>> > >> >>> > Read their README file (it comes in the tarball), and take a careful >> >>> look >> >>> > at their (excellent) FAQ: >> >>> > https://www.open-mpi.org/faq/ >> >>> > Many issues can be solved by just reading these two resources. >> >>> > >> >>> > If you hit more trouble, subscribe to the Open MPI mailing list, >> and ask >> >>> > questions there, >> >>> > because you will get advice directly from the Open MPI developers, >> and >> >>> the >> >>> > fix will come easy. >> >>> > https://www.open-mpi.org/community/lists/ompi.php >> >>> > >> >>> > My two cents, >> >>> > Gus Correa >> >>> > >> >>> > On Tue, Apr 30, 2019 at 3:07 PM Faraz Hussain <i...@feacluster.com> >> >>> wrote: >> >>> > >> >>> >> Thanks, yes I have installed those libraries. See below. Initially >> I >> >>> >> installed the libraries via yum. But then I tried installing the >> rpms >> >>> >> directly from Mellanox website ( >> >>> >> MLNX_OFED_LINUX-4.5-1.0.1.0-rhel7.5-x86_64.tar ). Even after doing >> >>> >> that, I still got the same error with openmpi. I will try your >> >>> >> suggestion of building openmpi from source next! >> >>> >> >> >>> >> root@lustwzb34:/root # yum list | grep ibverbs >> >>> >> libibverbs.x86_64 41mlnx1-OFED.4.5.0.1.0.45101 >> >>> >> libibverbs-devel.x86_64 41mlnx1-OFED.4.5.0.1.0.45101 >> >>> >> libibverbs-devel-static.x86_64 41mlnx1-OFED.4.5.0.1.0.45101 >> >>> >> libibverbs-utils.x86_64 41mlnx1-OFED.4.5.0.1.0.45101 >> >>> >> libibverbs.i686 17.2-3.el7 >> >>> >> rhel-7-server-rpms >> >>> >> libibverbs-devel.i686 1.2.1-1.el7 >> >>> >> rhel-7-server-rpms >> >>> >> >> >>> >> root@lustwzb34:/root # lsmod | grep ib >> >>> >> ib_ucm 22602 0 >> >>> >> ib_ipoib 168425 0 >> >>> >> ib_cm 53141 3 rdma_cm,ib_ucm,ib_ipoib >> >>> >> ib_umad 22093 0 >> >>> >> mlx5_ib 339961 0 >> >>> >> ib_uverbs 121821 3 mlx5_ib,ib_ucm,rdma_ucm >> >>> >> mlx5_core 919178 2 mlx5_ib,mlx5_fpga_tools >> >>> >> mlx4_ib 211747 0 >> >>> >> ib_core 294554 10 >> >>> >> >> >>> >> >> >>> >> rdma_cm,ib_cm,iw_cm,mlx4_ib,mlx5_ib,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib >> >>> >> mlx4_core 360598 2 mlx4_en,mlx4_ib >> >>> >> mlx_compat 29012 15 >> >>> >> >> >>> >> >> >>> >> rdma_cm,ib_cm,iw_cm,mlx4_en,mlx4_ib,mlx5_ib,mlx5_fpga_tools,ib_ucm,ib_core,ib_umad,ib_uverbs,mlx4_core,mlx5_core,rdma_ucm,ib_ipoib >> >>> >> devlink 42368 4 mlx4_en,mlx4_ib,mlx4_core,mlx5_core >> >>> >> libcrc32c 12644 3 xfs,nf_nat,nf_conntrack >> >>> >> root@lustwzb34:/root # >> >>> >> >> >>> >> >> >>> >> >> >>> >> > Did you install libibverbs (and libibverbs-utils, for >> information >> >>> and >> >>> >> > troubleshooting)? >> >>> >> >> >>> >> > yum list |grep ibverbs >> >>> >> >> >>> >> > Are you loading the ib modules? >> >>> >> >> >>> >> > lsmod |grep ib >> >>> >> >> >>> >> >> >>> >> >>> >> >>> >> >>> >> >> _______________________________________________ >> >> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin >> Computing >> >> To change your subscription (digest mode or unsubscribe) visit >> >> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >> >> >> >> _______________________________________________ >> >> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin >> Computing >> >> To change your subscription (digest mode or unsubscribe) visit >> >> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >> >> >> >> >> >>
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf