Re: [Beowulf] Weird problem with mpp-dyna

Robert G. Brown Wed, 14 Mar 2007 09:03:10 -0800

On Wed, 14 Mar 2007, Peter St. John wrote:

I just want to mention (not being a sysadmin professionally, at all) that
you could get exactly this result if something were assigning IP addresses
sequentially, e.g.
node1 = foo.bar.1
node2 = foo.bar.2
...
and something else had already assigned 13 to a public thing, say, a
webserver that is not open on the port that MPI uses.
I don't know nada about addressing a CPU within a multiprocessor machine,
but if it has it's own IP address then it could choke this way.


On the same note, I'm always fond of looking for loose wires or bad
switches or dying hardware on a bizarrely inconsistent network
connection.  Does this only happen in MPI?  Or can you get oddities
using a network testing program e.g. netpipe (which will let you test
raw sockets, mpi, pvm in situ)?

  rgb


Peter


On 3/14/07, Joshua Baker-LePain <[EMAIL PROTECTED]> wrote:


I have a user trying to run a coupled structural thermal analsis using
mpp-dyna (mpp971_d_7600.2.398).  The underlying OS is centos-4 on x86_64
hardware.  We use our cluster largely as a COW, so all the cluster nodes
have both public and private network interfaces.  All MPI traffic is
passed on the private network.

Running a simulation via 'mpirun -np 12' works just fine.  Running the
same sim (on the same virtual machine, even, i.e. in the same 'lamboot'
session) with -np > 12 leads to the following output:

Performing Decomposition -- Phase 3 03/12/2007
11:47:53


*** Error the number of solid elements 13881
defined on the thermal generation control
card is greater than the total number
of solids in the model 12984

*** Error the number of solid elements 13929
defined on the thermal generation control
card is greater than the total number
of solids in the model 12985
connect to address $ADDRESS: Connection timed out
connect to address $ADDRESS: Connection timed out

where $ADDRESS is the IP address of the *public* interface of the node on
which the job was launched.  Has anybody seen anything like this?  Any
ideas on why it would fail over a specific number of CPUs?

Note that the failure is CPU dependent, not node-count dependent.
I've tried on clusters made of both dual-CPU machines and quad-CPU
machines, and in both cases it took 13 CPUs to create the failure.
Note also that I *do* have a user writing his own MPI code, and he has no
issues running on >12 CPUs.

Thanks.

--
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


--
Robert G. Brown                        http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:[EMAIL PROTECTED]


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Weird problem with mpp-dyna

Reply via email to