On 8 January 2011 05:01, akshar bhosale <akshar.bhos...@gmail.com> wrote:
> hi,
> we have 100 nodes cluster. we have strange problem on cluster with torque
> 2.4.8
> a job submitted for 256 cores interactively gives following error in pbs
> server :
>
> PBS_Server;LOG_ERROR::sync_node_jobs, stray job 2004.nodesvr.clust1.in found
> on node07.clust1.in
> PBS_Server;LOG_ERROR::sync_node_jobs, stray job 2004.nodesvr.clust1.in found
> on node05.clust1.in

Disable both nodes - node05 and node07 - in your scheduler.
Submit your job.

When you have time, log into those nodes and look at the system logs
at about the time the failed job starts, and at the mom log.
Are the nodes mounting the users home directory? Are they
authenticating properly - ie are they contacting their NIS or LDAP
server?
ps -eaf --forest   on the nodes - do you see any processes belonging
to this job 2004?
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to