Where is this driver from? OS, or OFED, or? We use primarily MVAPICH2 but I would be curious to try to duplicate this on our mlx5 equipment.
What model cards do you have? -- ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novos...@rutgers.edu<mailto:novos...@rutgers.edu> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' On Oct 26, 2017, at 07:43, Chris Samuel <sam...@unimelb.edu.au<mailto:sam...@unimelb.edu.au>> wrote: Hi folks, I'm helping another group out and we've found that running an Open-MPI program, even just a singleton, will kill nodes with Mellanox ConnectX 4 and 5 cards using RoCE (the mlx5 driver). The node just locks up hard with no OOPS or other diagnostics and has to be power cycled. Disabling openib/verbs support with: export OMPI_MCA_btl=tcp,self,vader stops the crashes, and whilst it's hard to tell strace seems to imply it hangs when trying to probe for openib/verbs devices (or shortly after). Nodes with ConnectX-3 cards (mlx4 driver) don't seem to have the issue and I'm reasonably convinced this has to be a driver bug, or perhaps a bad interaction with recent 4.11.x and 4.12.x kernels (they need those for CephFS). They've got a bug open with Mellanox already but I was wondering if anyone else had seen anything similar? cheers! Chris -- Christopher Samuel Senior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au<mailto:sam...@unimelb.edu.au> Phone: +61 (0)3 903 55545 _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org<mailto:Beowulf@beowulf.org> sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.beowulf.org%2Fmailman%2Flistinfo%2Fbeowulf&data=02%7C01%7Cnovosirj%40rutgers.edu%7C919d4d1a79fe443eaa1608d51c66c114%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636446150021038393&sdata=ZTHOeZxgYMtG7XVnZJw3BebEz4rypdmkCuW3ZVraLiQ%3D&reserved=0
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf