Hi Ivan, Ivan Paganini wrote:
The myrinet connection was working right, but sometimes a user program just got stuck - one of the processes was sleeping, and all others were running. Then, the program hangs. Investigating this further,
Unless you are using bocking receives ("--mx-recv blocking" or "--mx-recv hybrid"), the default mode is polling. So, a process will only sleep if it is still in the spawning phase (in MPI_Init) or if it's blocking on something outside MPI (like disk IO).
overheat. mpirun.ch_mx -v shows that all the processes are issued ok to the nodes, but somehow one (or more) process go to sleep or never starts, and all the other processes just hangs. The mx diagnose tools
All processes wait on everybody at spawn time, so if one process never starts, the rest of the MPI world will wait for it, possibly forever. The root problem is the process not starting.
The spawning phase in MPICH-MX uses socket and ssh (or rsh). Usually, ssh uses native Ethernet, but it could also use IPoM (Ethernet over Myrinet). Which case is it for you ?
Patrick _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf