On Mon, Mar 30, 2009 at 02:14:50PM +0100, J?rg Sa?mannshausen wrote: > Dear all, > > I am having this rather anoying problem with the parallel execution of > one of the programs (GAMESS US version) on our cluster. The error > message is: > > TCP connect error: ECONNREFUSED. > TCP: Connect failed. comp10 -> comp02.chem.strath.ac.uk:42208. > A fatal error occurred on DDI Process 0. > TCP connect error: ECONNREFUSED. > TCP: Connect failed. comp10 -> comp02.chem.strath.ac.uk:42208. > A fatal error occurred on DDI Process 60. > TCP connect error: ECONNREFUSED. > TCP: Connect failed. comp10 -> comp02.chem.strath.ac.uk:42208. > A fatal error occurred on DDI Process 2. > TCP connect error: ECONNREFUSED. > > [ ... ] > > Eventually, the ddicick tips over and the whole thing crashes. The > program is using rsh (yes, I know, security, I did not install the > cluster!) and I can rsh comp10 -> comp02 and there is no firewall > installed between the nodes (at least, not that I am aware of). Trying > to run the same job with the same number of nodes will fail X times and > at X+1 suddenly work. I could not work out a pattern for that (other > that I get exponentially annoyed). Right now, there is only one gigabit > network connecting the cluster, so nfs, mpi etc. is all running over one > interface (again, I did not set up the cluster).
How rapidly are these rsh connection attempts occuring? The rsh protocol requires connections from privileged ports - less than 1024. If a host attempts to make more than 1024 to another host in less than TCP TIME-WAIT seconds, it will run out ports and the connections will fail. I've seen this occur with parallel applications using rsh. David S. > > I have run out of ideas of where to look. I checked (as quickly as > possible) at some nodes with netstat, the ddicick program is acutally > running. Changing to ssh did not solve the problem. > > I would appreciate any feedback as it is highly anyoing to wait Y days > to get the job running and then it crashes. > > All the best from Glasgow! > > J?rg > > > -- > ************************************************************* > J?rg Sa?mannshausen > Research Fellow > University of Strathclyde > Department of Pure and Applied Chemistry > 295 Cathedral St. > Glasgow > G1 1XL > > email: jorg.sassmannshau...@strath.ac.uk > web: http://sassy.formativ.net > > Please avoid sending me Word or PowerPoint attachments. > See http://www.gnu.org/philosophy/no-word-attachments.html > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf