On Wed, 14 Mar 2007, Peter St. John wrote:
I just want to mention (not being a sysadmin professionally, at all) that
you could get exactly this result if something were assigning IP addresses
sequentially, e.g.
node1 = foo.bar.1
node2 = foo.bar.2
...
and something else had already assigned 13 to a public thing, say, a
webserver that is not open on the port that MPI uses.
I don't know nada about addressing a CPU within a multiprocessor machine,
but if it has it's own IP address then it could choke this way.
On the same note, I'm always fond of looking for loose wires or bad
switches or dying hardware on a bizarrely inconsistent network
connection. Does this only happen in MPI? Or can you get oddities
using a network testing program e.g. netpipe (which will let you test
raw sockets, mpi, pvm in situ)?
rgb
Peter
On 3/14/07, Joshua Baker-LePain <[EMAIL PROTECTED]> wrote:
I have a user trying to run a coupled structural thermal analsis using
mpp-dyna (mpp971_d_7600.2.398). The underlying OS is centos-4 on x86_64
hardware. We use our cluster largely as a COW, so all the cluster nodes
have both public and private network interfaces. All MPI traffic is
passed on the private network.
Running a simulation via 'mpirun -np 12' works just fine. Running the
same sim (on the same virtual machine, even, i.e. in the same 'lamboot'
session) with -np > 12 leads to the following output:
Performing Decomposition -- Phase 3 03/12/2007
11:47:53
*** Error the number of solid elements 13881
defined on the thermal generation control
card is greater than the total number
of solids in the model 12984
*** Error the number of solid elements 13929
defined on the thermal generation control
card is greater than the total number
of solids in the model 12985
connect to address $ADDRESS: Connection timed out
connect to address $ADDRESS: Connection timed out
where $ADDRESS is the IP address of the *public* interface of the node on
which the job was launched. Has anybody seen anything like this? Any
ideas on why it would fail over a specific number of CPUs?
Note that the failure is CPU dependent, not node-count dependent.
I've tried on clusters made of both dual-CPU machines and quad-CPU
machines, and in both cases it took 13 CPUs to create the failure.
Note also that I *do* have a user writing his own MPI code, and he has no
issues running on >12 CPUs.
Thanks.
--
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:[EMAIL PROTECTED]
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf