On 12/10/17 01:12, Michael Di Domenico wrote: > i'm seeing issues on a mellanox fdr10 cluster where the mpi setup and > teardown takes longer then i expect it should on larger rank count > jobs. i'm only trying to run ~1000 ranks and the startup time is over > a minute. i tested this with both openmpi and intel mpi, both exhibit > close to the same behavior.
What wire-up protocol are you using for your MPI in your batch system? With Slurm at least you should be looking at using PMIx or PMI2 (PMIx needs Slurm to be compiled against it as an external library, PMI2 is a contrib plugin in the source tree). Hope that helps.. Chris -- Christopher Samuel Senior Systems Administrator Melbourne Bioinformatics - The University of Melbourne Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf