On 26/10/17 22:42, Chris Samuel wrote: > I'm helping another group out and we've found that running an Open-MPI > program, even just a singleton, will kill nodes with Mellanox ConnectX 4 and > 5 > cards using RoCE (the mlx5 driver). The node just locks up hard with no > OOPS > or other diagnostics and has to be power cycled.
It was indeed a driver bug, and is now fixed in Mellanox OFED 4.2 (which came out a few days ago). cheers, Chris -- Christopher Samuel Senior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf