I found that the blocking send of MPI blocks for the
version of MPICH compiled for Myrinet (at least w.r.t.
the old GM) but does not block for the MPICH from
Argonne compile for GCC and PGI. Or was it the other
way around, I don't recall. Assuming it was the first
case, it might be relavent. So
Hello Mark, Patrick,
>>The spawning phase in MPICH-MX uses socket and ssh (or rsh). Usually,
>>ssh uses native Ethernet, but it could also use IPoM (Ethernet over
>>Myrinet). Which case is it for you ?
As I said before, I'm also experiencing some ether problems (in the
service network) like TCP w
Ivan,
Ivan Paganini wrote:
I did a strace on the hanged process, and the output is this:
"strace -f" to trace the children as well.
Could you send the output of mpirun.ch_mx -v also, to see if the process
starts and send some info to the mpirun perl script and hangs later or
never really st
clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x40046f68) = 31384
waitpid(-1,
this looks like a fork/exec that's failing. as you might expect
if, for instance, your shared FS doesn't supply a binary successfully.
note also that ltrace -S often provides
Hi Ivan,
Ivan Paganini wrote:
The myrinet connection was working right, but sometimes a user program
just got stuck - one of the processes was sleeping, and all others
were running. Then, the program hangs. Investigating this further,
Unless you are using bocking receives ("--mx-recv blocking"
Mark Hahn wrote:
here's an idea: configure ip-over-myrinet, and use it exclusively
to start the jobs. if that works, then you know for sure that the
problem is solely on the eth side (switch, perhaps, or maybe a nic
that's jabbering or otherwise misbehaving?)
Ivan may have to stage the binar
Just a update: trying several times, the strace stops in different
points, the speficied in the other email and here:
___
munmap(0x40176000, 4096)= 0
time([1191243868]) = 1191243868
open("/etc/hosts", O_RDONLY)
Hello Chris, everybody:
I am not using jumbo frames, and I'm now considering this option, but
first I wanted to know for sure that there is no other problem before,
just to control the number of variables at hand. But thanks for your
help.
I did a strace on the hanged process, and the output is t