Hi Ivan,

Ivan Paganini wrote:
The myrinet connection was working right, but sometimes a user program
just got stuck - one of the processes was sleeping, and all others
were running. Then, the program hangs. Investigating this further,

Unless you are using bocking receives ("--mx-recv blocking" or "--mx-recv hybrid"), the default mode is polling. So, a process will only sleep if it is still in the spawning phase (in MPI_Init) or if it's blocking on something outside MPI (like disk IO).

overheat. mpirun.ch_mx -v shows that all the processes are issued ok
to the nodes, but somehow one (or more) process go to sleep or never
starts, and all the other processes just hangs. The mx diagnose tools

All processes wait on everybody at spawn time, so if one process never starts, the rest of the MPI world will wait for it, possibly forever. The root problem is the process not starting.

The spawning phase in MPICH-MX uses socket and ssh (or rsh). Usually, ssh uses native Ethernet, but it could also use IPoM (Ethernet over Myrinet). Which case is it for you ?

Patrick
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to