FYI  (Just to have it posted, in case anybody else ever runs into 
this.)

A little while back I moved same names around in the cluster.  To do 
so, in SGE a bunch
of queues and some hosts were removed and then added back.  There was 
much trial and error
in doing so - I make no claim that the right commands were issued in 
the proper order.  However,
in the end all the queues were as desired and they all stayed up and 
running.  Until the node
was rebooted, at which point SGE came back up with only two queues 
present.  After
much poking around the problem was finally locate:  some of the old 
host names and old queues
were still present in files under:

   $SGEROOT/default/spool/qmaster/qinstances

and as soon as SGE hit one of those during startup, it would stop 
creating all further queues.
The error message that resulted when that happened was of this form:

   09/21/2011 12:22:56|qmaster|safserver|E|cannot recreate queue all.q 
from disk because of unknown host mendel

and appeared in:

   $SGEROOT/default/spool/qmaster/messages

Regards,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to