Re: [Beowulf] mpirun issue

Reuti Wed, 22 Oct 2008 11:26:07 -0700

Am 21.10.2008 um 22:21 schrieb Luis Alejandro Del Castillo Riley:

And with the ps -e f shows that is running fine until they crashwith the error broken pipe and killing signal

From this I would assume that one processes crashed and you arefacing only the follow-up error. Maybe because it ran out of memoryor disk space. It might depend on the application, how it willdistribute the data and maybe with ten nodes some array or so wasgetting too big over the runtime of the job.

When you can spot the node which crashes, maybe you can findsomething in /var/log/messages of the node.


-- Reuti

On Tue, Oct 21, 2008 at 2:50 PM, Luis Alejandro Del Castillo Riley<[EMAIL PROTECTED]> wrote:
hi
yes i have 10 nodes each ones with intel xeon quad core so basicalyare 4 processors per each node
On Tue, Oct 21, 2008 at 7:53 AM, Reuti <[EMAIL PROTECTED]>wrote:
Hi,

Am 21.10.2008 um 01:18 schrieb Luis Alejandro Del Castillo Riley:
hi fellows i have a cluster with 1 master 10 nodes with intel XeonQuad core.
Fedora core 6
PGI 7.0-7
mpich 1.2.5.2
the last version of MPICH from 2005 is 1.2.7p1. For newerinstallations I would suggest to look into Open MPI.
machines.x86_64 with a 10 node names

Means only the 10 nodes?


when i try to run:
 mpirun -v -arch x86_64  -keep_pg -nolocal -np 9 mm5.mpp

i had no error but when a run with
 mpirun -v -arch x86_64  -keep_pg -nolocal -np 10 mm5.mpp

they take around 40 min to send me and error :
bm_list_4667: (1526.781250) wakeup_slave: unable to interrupt slave0 pid 4666
With so many time, I would suggest to login to all nodes and checkwith:
$ ps -e f
(f w/o -) the ditribution and startup of the porcesses. Is it doingnothing for 40 minutes or running fine until it crashes?
-- Reuti


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] mpirun issue

Reply via email to