hi,
we have 100 nodes cluster. we have strange problem on cluster with torque
2.4.8
a job submitted for 256 cores interactively gives following error in pbs
server :

PBS_Server;LOG_ERROR::sync_node_jobs, stray job 2004.nodesvr.clust1.in found
on node07.clust1.in
PBS_Server;LOG_ERROR::sync_node_jobs, stray job 2004.nodesvr.clust1.in found
on node05.clust1.in

Also master mom says :
pbs_mom: LOG_ERROR::node_bailout, 2004.nodesvr.clust1.in join_job failed
from node07.clust1.in 17 - recovery attempted)
pbs_mom: LOG_ERROR::sister could not communicate (15059) in
2004.nodesvr.clust1.in job_start_error from node node0.clust1.in   in jo
b_start_error
Jan  7 08:49:54  node07 pbs_mom: LOG_ERROR::exec_bail, exec_bail: sent 16
ABORT requests, should be 20
node_bailout, node_bailout: received KILL/ABORT request for job
2004.nodesvr.clust1.in from node node07.clust1.in

node07 logs says :
pbs_mom;Job;2004.nodesvr.clust1.in;JOIN JOB as node 15
pbs_mom;Svr;pbs_mom;LOG_ERROR::Transport endpoint is not connected (107) in
im_request, rpp_flush

The job could not allocate shell for 40 minutes and then we got shell on
master mom node.

We are not able to find out the exact issue..any help will be appreciated.

--
Akshar B.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to