Re: [Beowulf] Problems with a JS21 - Ah, the networking...

2007-10-01 Thread Alan Louis Scheinine
I found that the blocking send of MPI blocks for the version of MPICH compiled for Myrinet (at least w.r.t. the old GM) but does not block for the MPICH from Argonne compile for GCC and PGI. Or was it the other way around, I don't recall. Assuming it was the first case, it might be relavent. So

Re: [Beowulf] Problems with a JS21 - Ah, the networking...

2007-10-01 Thread Ivan Paganini
Hello Mark, Patrick, >>The spawning phase in MPICH-MX uses socket and ssh (or rsh). Usually, >>ssh uses native Ethernet, but it could also use IPoM (Ethernet over >>Myrinet). Which case is it for you ? As I said before, I'm also experiencing some ether problems (in the service network) like TCP w

Re: [Beowulf] Problems with a JS21 - Ah, the networking...

2007-10-01 Thread Patrick Geoffray
Ivan, Ivan Paganini wrote: I did a strace on the hanged process, and the output is this: "strace -f" to trace the children as well. Could you send the output of mpirun.ch_mx -v also, to see if the process starts and send some info to the mpirun perl script and hangs later or never really st

Re: [Beowulf] Problems with a JS21 - Ah, the networking...

2007-10-01 Thread Mark Hahn
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x40046f68) = 31384 waitpid(-1, this looks like a fork/exec that's failing. as you might expect if, for instance, your shared FS doesn't supply a binary successfully. note also that ltrace -S often provides

Re: [Beowulf] Problems with a JS21 - Ah, the networking...

2007-10-01 Thread Patrick Geoffray
Hi Ivan, Ivan Paganini wrote: The myrinet connection was working right, but sometimes a user program just got stuck - one of the processes was sleeping, and all others were running. Then, the program hangs. Investigating this further, Unless you are using bocking receives ("--mx-recv blocking"

Re: [Beowulf] Problems with a JS21 - Ah, the networking...

2007-10-01 Thread Patrick Geoffray
Mark Hahn wrote: here's an idea: configure ip-over-myrinet, and use it exclusively to start the jobs. if that works, then you know for sure that the problem is solely on the eth side (switch, perhaps, or maybe a nic that's jabbering or otherwise misbehaving?) Ivan may have to stage the binar

Re: [Beowulf] Problems with a JS21 - Ah, the networking...

2007-10-01 Thread Ivan Paganini
Just a update: trying several times, the strace stops in different points, the speficied in the other email and here: ___ munmap(0x40176000, 4096)= 0 time([1191243868]) = 1191243868 open("/etc/hosts", O_RDONLY)

Re: [Beowulf] Problems with a JS21 - Ah, the networking...

2007-10-01 Thread Ivan Paganini
Hello Chris, everybody: I am not using jumbo frames, and I'm now considering this option, but first I wanted to know for sure that there is no other problem before, just to control the number of variables at hand. But thanks for your help. I did a strace on the hanged process, and the output is t