[Beowulf] 3.10.0-957.12.1.el7.x86_64

2019-05-02 Thread Jonathan Engwall
Hello beowulf, I might be jumping to a conclusion but this new kernel and the sudden network problems here are fishy. Ubuntu had network issues with every kernel. My gateway has flipped itself to 192.168.0.255. It insists on staying that way too. I have had other issues tonight though. It is just a

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-02 Thread Chris Samuel
On 2/5/19 10:50 am, Faraz Hussain wrote: Thanks John. I believe we purchased the enclosure from HPe with only hardware support. I am not aware of any support contract with Mellanox. We are running RHEL 7.5 ( I may have accidentally said it was Cent OS, but that was a typo ).. Red Hat do have

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-02 Thread Gus Correa
Google ... https://wiki.archlinux.org/index.php/InfiniBand https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/ch-configure_infiniband_and_rdma_networks#sec-Understanding_InfiniBand_and_RDMA_technologies On Thu, May 2, 2019 at 2:32 PM Faraz Hussain

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-02 Thread Faraz Hussain
Thanks Benson, these are very useful links. I browsed the Mellanox guide and it seems the target audience is experts in networking. I wish there was some quick start guide or Infiniband for dummies book :-) Quoting Benson Muite : Hi Faraz, Mellanox manuals can be found at: https://docs.

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-02 Thread Gus Correa
yum groupinfo infiniband On Thu, May 2, 2019 at 11:44 AM Faraz Hussain wrote: > Thanks. Before I go down the path of installing things willy-nilly, is > there some guide I should be following instead? I obviously have a > problem with my mellanox drivers combined with "user error".. > > So shoul

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-02 Thread Faraz Hussain
Thanks John. I believe we purchased the enclosure from HPe with only hardware support. I am not aware of any support contract with Mellanox. We are running RHEL 7.5 ( I may have accidentally said it was Cent OS, but that was a typo ).. I am more the application guy. We have a hardware/netwo

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-02 Thread Faraz Hussain
Quoting John Hearns via Beowulf : But no-one wants to pay for Bright Cluster Manager, for example. So the end user gets at best a freeware solution like Rocks, or at worst some Kickstarted setup which installs an OS, the CentOS supplied IB drivers and MPI, and Gridengine slapped on top of that.

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-02 Thread Benson Muite
Hi Faraz, Mellanox manuals can be found at: https://docs.mellanox.com/ Example setup instructions (not sure if correct for you as do not have exact details on your hardware): https://www.mellanox.com/related-docs/prod_software/Mellanox_OFED_Linux_User_Manual_v4_3.pdf Maybe also helpful (stu

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-02 Thread John Hearns via Beowulf
Warming to my subject now. I really dont want to be specific about any vendor, or cluster management package. As I say I have had experience ranging from national contracts, currently at a company with tens of thousands of cpus worldwide, down to installing half rack HPC clusters for customers, and

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-02 Thread John Hearns via Beowulf
Chris, I have to say this. I have worked for smaller companies, and have worked for cluster integrators. For big University sized and national labs the procurement exercise will end up with a well defined support arrangement. I have seen, in once company I worked at, an HPC system arrive which I w

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-02 Thread Jonathan Aquilina
Hi John, I think there is a bit of an inaccuracy given you mention HP. What I have learned as I am working with a local HP and HPE distributor that for servers and everything you want to deal with HPE (HP enterprise) where as standard consumer hardware is bought from HP as they have two distinc

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-02 Thread John Hearns via Beowulf
Pleas tell us the history of the overall system. Was it bought as hardware only from a supplier? Or was it delivered as an already set up system with operating system, applications, Infiniband drivers etc? I would also look at Qlustar https://www.qlustar.com/book/qlustar/summary and Bright https:

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-02 Thread Christopher Samuel
On 5/2/19 8:40 AM, Faraz Hussain wrote: So should I be paying Mellanox to help? Or is it a RedHat issue? Or is it our harware vendor, HP who should be involved?? I suspect that would be set out in the contract for the HP system. The clusters I've been involved in purchasing in the past have a

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-02 Thread John Hearns via Beowulf
You ask some damned good questions there. I will try to answer them from the point of view of someone who has worked as an HPC systems integrator and supported HPC systems, both for systems integrators and within companies. We will start with HP. Did you buy those systems direct from HP as servers

Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

2019-05-02 Thread Faraz Hussain
Thanks. Before I go down the path of installing things willy-nilly, is there some guide I should be following instead? I obviously have a problem with my mellanox drivers combined with "user error".. So should I be paying Mellanox to help? Or is it a RedHat issue? Or is it our harware vendo