Hi Chris,
We are running CX4 cards and have had some issues as well. Which version/s
of openmpi are they running?

If you follow the instructions from Mellanox and run with yalla and mxm
that works(ish) of openmpi 1.10.3, including setting the appropriate
environment variables or config file.

If they are running the 2.1 series from openmpi there are some issues with
compiling in the mellanox drivers.

We haven't seen any hard locks like this but we have seen a whole bundle of
other issues.

Cheers,

Lance
--
Dr Lance Wilson
Characterisation Virtual Laboratory (CVL) Coordinator &
Senior HPC Consultant
Ph: 03 99055942 (+61 3 99055942)
Mobile: 0437414123 (+61 4 3741 4123)
Multi-modal Australian ScienceS Imaging and Visualisation Environment
(www.massive.org.au)
Monash University

On 26 October 2017 at 22:42, Chris Samuel <sam...@unimelb.edu.au> wrote:

> Hi folks,
>
> I'm helping another group out and we've found that running an Open-MPI
> program, even just a singleton, will kill nodes with Mellanox ConnectX 4
> and 5
> cards using RoCE (the mlx5 driver).   The node just locks up hard with no
> OOPS
> or other diagnostics and has to be power cycled.
>
> Disabling openib/verbs support with:
>
> export OMPI_MCA_btl=tcp,self,vader
>
> stops the crashes, and whilst it's hard to tell strace seems to imply it
> hangs
> when trying to probe for openib/verbs devices (or shortly after).
>
> Nodes with ConnectX-3 cards (mlx4 driver) don't seem to have the issue and
> I'm
> reasonably convinced this has to be a driver bug, or perhaps a bad
> interaction
> with recent 4.11.x and 4.12.x kernels (they need those for CephFS).
>
> They've got a bug open with Mellanox already but I was wondering if anyone
> else had seen anything similar?
>
> cheers!
> Chris
> --
>  Christopher Samuel        Senior Systems Administrator
>  Melbourne Bioinformatics - The University of Melbourne
>  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
>
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to