Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-01 Thread Christopher Samuel
On 5/1/19 8:50 AM, Faraz Hussain wrote: Unfortunately I get this: root@lustwzb34:/root # systemctl status rdma Unit rdma.service could not be found. You're missing this RPM then, which might explain a lot: $ rpm -qi rdma-core Name: rdma-core Version : 17.2 Release : 3.el7 Arc

Re: [Beowulf] OT: open positions in HPC, Cloud, networking, services and support etc

2019-05-01 Thread Andrew Latham
Congrats! On Wed, May 1, 2019 at 10:36 AM Joe Landman wrote: > Hi folks, > >Apologies for OT conversation, I'll keep it very brief. My team is > looking for excellent HPC/Cloud/networking/support folk. Feel free to > ping me at email below or jlandman over at cray dot com, and I can point >

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-01 Thread Faraz Hussain
What does this say? systemctl status rdma Unfortunately I get this: root@lustwzb34:/root # systemctl status rdma Unit rdma.service could not be found. ___ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your su

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-01 Thread Faraz Hussain
Quoting John Hearns : On the RHEL 6.9 servers run ibstatus also And sminfo Unfortunately, the RHEL 6.9 machines don't appear to have all the Infiniband utilities installed. All I see is: [root@lustwzb25 ~]# ib ibacm ibdiagnet ibdmchkibis

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-01 Thread Faraz Hussain
Quoting John Hearns : What does ibstatus give you [hussaif1@lustwzb33 ~]$ ibstatus Infiniband device 'mlx4_0' port 1 status: default gid: fe80::::4a0f:cfff:fef5:b650 base lid:0x0 sm lid: 0x0 state: 4: ACTIVE

[Beowulf] OT: open positions in HPC, Cloud, networking, services and support etc

2019-05-01 Thread Joe Landman
Hi folks,   Apologies for OT conversation, I'll keep it very brief.  My team is looking for excellent HPC/Cloud/networking/support folk. Feel free to ping me at email below or jlandman over at cray dot com, and I can point you to the URLs.   And, I forgot to mention, I'm now over at Cray as

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-01 Thread Christopher Samuel
On 5/1/19 7:05 AM, Faraz Hussain wrote: [hussaif1@lustwzb34 ~]$ sminfo ibwarn: [10407] mad_rpc_open_port: can't open UMAD port ((null):0) sminfo: iberror: failed: Failed to open '(null)' port '0' Sorry I'm late to this. What does this say? systemctl status rdma You should see something alon

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-01 Thread John Hearns via Beowulf
On the RHEL 6.9 servers run ibstatus also And sminfo On Wed, 1 May 2019 at 16:23, John Hearns wrote: > link_layer: Ethernet > > E…. > > On Wed, 1 May 2019 at 16:18, Faraz Hussain wrote: > >> >> Quoting John Hearns : >> >> > What does ibstatus give you >> >> [hussaif1@lustwzb33

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-01 Thread John Hearns via Beowulf
link_layer: Ethernet E…. On Wed, 1 May 2019 at 16:18, Faraz Hussain wrote: > > Quoting John Hearns : > > > What does ibstatus give you > > [hussaif1@lustwzb33 ~]$ ibstatus > Infiniband device 'mlx4_0' port 1 status: > default gid: fe80::::4a0f:cfff:fef5:b6

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-01 Thread Faraz Hussain
Quoting John Hearns : E.. you are not running a subnet manager? DO you have an Infiniband switch or are you connecting two servers back-to-back? Unfortunately, I am not familiar with a subnet manager. These are sixteen machines in an HP enclosure. Fourteen of them are running RHEL 6

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-01 Thread John Hearns via Beowulf
I think I he wrong track regarding the subnet manager, sorry. What does ibstatus give you On Wed, 1 May 2019 at 15:31, John Hearns wrote: > E.. you are not running a subnet manager? > DO you have an Infiniband switch or are you connecting two servers > back-to-back? > > Also - have you

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-01 Thread John Hearns via Beowulf
E.. you are not running a subnet manager? DO you have an Infiniband switch or are you connecting two servers back-to-back? Also - have you considered using OpenHPC rather tyhan installing CentOS on two servers? When you expand this manual installation is going to be painful. On Wed, 1 May 2

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-01 Thread Faraz Hussain
What hardware and what Infiniband switch you have Run these commands: ibdiagnet smshow Unfortunately ibdiagnet seems to give some errors: [hussaif1@lustwzb34 ~]$ ibdiagnet -- Load Plugins from: /usr/share/ibdiagnet2.1.1/plugins/ (You can specify more paths to be looked in with

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-01 Thread John Hearns via Beowulf
Hi Faraz. Could to make another summary for us? What hardware and what Infiniband switch you have Run these commands: ibdiagnet smshow You originally had the OpenMPI which was provided by CentOS ?? You compiled the OpenMPI from source?? How are you bringing the new OpenMPI version itno

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-01 Thread Benson Muite
Hi Faraz, Have you tried any other MPI distributions (eg. MPICH, MVAPICH)? Regards, Benson On 4/30/19 11:20 PM, Gus Correa wrote: It may be using IPoIB (TCP/IP over IB), not verbs/rdma. You can force it to use openib (verbs, rdma) with (vader is for in-node shared memory): mpirun --mca b