On 8 January 2011 05:01, akshar bhosale <akshar.bhos...@gmail.com> wrote: > hi, > we have 100 nodes cluster. we have strange problem on cluster with torque > 2.4.8 > a job submitted for 256 cores interactively gives following error in pbs > server : > > PBS_Server;LOG_ERROR::sync_node_jobs, stray job 2004.nodesvr.clust1.in found > on node07.clust1.in > PBS_Server;LOG_ERROR::sync_node_jobs, stray job 2004.nodesvr.clust1.in found > on node05.clust1.in
Disable both nodes - node05 and node07 - in your scheduler. Submit your job. When you have time, log into those nodes and look at the system logs at about the time the failed job starts, and at the mom log. Are the nodes mounting the users home directory? Are they authenticating properly - ie are they contacting their NIS or LDAP server? ps -eaf --forest on the nodes - do you see any processes belonging to this job 2004? _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf