[Beowulf] Weird problem with mpp-dyna

Joshua Baker-LePain Wed, 14 Mar 2007 05:48:09 -0800

I have a user trying to run a coupled structural thermal analsis usingmpp-dyna (mpp971_d_7600.2.398). The underlying OS is centos-4 on x86_64hardware. We use our cluster largely as a COW, so all the cluster nodeshave both public and private network interfaces. All MPI traffic ispassed on the private network.

Running a simulation via 'mpirun -np 12' works just fine. Running thesame sim (on the same virtual machine, even, i.e. in the same 'lamboot'session) with -np > 12 leads to the following output:


Performing Decomposition -- Phase 3 03/12/2007
11:47:53


*** Error the number of solid elements 13881
defined on the thermal generation control
card is greater than the total number
of solids in the model 12984

*** Error the number of solid elements 13929
defined on the thermal generation control
card is greater than the total number
of solids in the model 12985
connect to address $ADDRESS: Connection timed out
connect to address $ADDRESS: Connection timed out

where $ADDRESS is the IP address of the *public* interface of the node onwhich the job was launched. Has anybody seen anything like this? Anyideas on why it would fail over a specific number of CPUs?


Note that the failure is CPU dependent, not node-count dependent.
I've tried on clusters made of both dual-CPU machines and quad-CPU
machines, and in both cases it took 13 CPUs to create the failure.

Note also that I *do* have a user writing his own MPI code, and he has noissues running on >12 CPUs.


Thanks.

--
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Weird problem with mpp-dyna

Reply via email to