hi, we have 100 nodes cluster. we have strange problem on cluster with torque 2.4.8 a job submitted for 256 cores interactively gives following error in pbs server :
PBS_Server;LOG_ERROR::sync_node_jobs, stray job 2004.nodesvr.clust1.in found on node07.clust1.in PBS_Server;LOG_ERROR::sync_node_jobs, stray job 2004.nodesvr.clust1.in found on node05.clust1.in Also master mom says : pbs_mom: LOG_ERROR::node_bailout, 2004.nodesvr.clust1.in join_job failed from node07.clust1.in 17 - recovery attempted) pbs_mom: LOG_ERROR::sister could not communicate (15059) in 2004.nodesvr.clust1.in job_start_error from node node0.clust1.in in jo b_start_error Jan 7 08:49:54 node07 pbs_mom: LOG_ERROR::exec_bail, exec_bail: sent 16 ABORT requests, should be 20 node_bailout, node_bailout: received KILL/ABORT request for job 2004.nodesvr.clust1.in from node node07.clust1.in node07 logs says : pbs_mom;Job;2004.nodesvr.clust1.in;JOIN JOB as node 15 pbs_mom;Svr;pbs_mom;LOG_ERROR::Transport endpoint is not connected (107) in im_request, rpp_flush The job could not allocate shell for 40 minutes and then we got shell on master mom node. We are not able to find out the exact issue..any help will be appreciated. -- Akshar B.
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf