On 26/10/17 22:42, Chris Samuel wrote:

> I'm helping another group out and we've found that running an Open-MPI 
> program, even just a singleton, will kill nodes with Mellanox ConnectX 4 and 
> 5 
> cards using RoCE (the mlx5 driver).   The node just locks up hard with no 
> OOPS 
> or other diagnostics and has to be power cycled.

It was indeed a driver bug, and is now fixed in Mellanox OFED 4.2 (which
came out a few days ago).

cheers,
Chris
-- 
 Christopher Samuel        Senior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to