Dear all,

I am having this rather anoying problem with the parallel execution of one of the programs (GAMESS US version) on our cluster. The error message is:

 TCP connect error: ECONNREFUSED.
 TCP: Connect failed. comp10 -> comp02.chem.strath.ac.uk:42208.
 A fatal error occurred on DDI Process 0.
 TCP connect error: ECONNREFUSED.
 TCP: Connect failed. comp10 -> comp02.chem.strath.ac.uk:42208.
 A fatal error occurred on DDI Process 60.
 TCP connect error: ECONNREFUSED.
 TCP: Connect failed. comp10 -> comp02.chem.strath.ac.uk:42208.
 A fatal error occurred on DDI Process 2.
 TCP connect error: ECONNREFUSED.

[ ... ]

Eventually, the ddicick tips over and the whole thing crashes. The program is using rsh (yes, I know, security, I did not install the cluster!) and I can rsh comp10 -> comp02 and there is no firewall installed between the nodes (at least, not that I am aware of). Trying to run the same job with the same number of nodes will fail X times and at X+1 suddenly work. I could not work out a pattern for that (other that I get exponentially annoyed). Right now, there is only one gigabit network connecting the cluster, so nfs, mpi etc. is all running over one interface (again, I did not set up the cluster).

I have run out of ideas of where to look. I checked (as quickly as possible) at some nodes with netstat, the ddicick program is acutally running. Changing to ssh did not solve the problem.

I would appreciate any feedback as it is highly anyoing to wait Y days to get the job running and then it crashes.

All the best from Glasgow!

Jörg


--
*************************************************************
Jörg Saßmannshausen
Research Fellow
University of Strathclyde
Department of Pure and Applied Chemistry
295 Cathedral St.
Glasgow
G1 1XL

email: jorg.sassmannshau...@strath.ac.uk
web: http://sassy.formativ.net

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html



_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to