Greetings,
I relatively new to cluster environments and I was given a small (7nodes+1head) cluster to admin. So far I only had to maintain what was already installed so few problems to solve (and to think on). But new (diferent: amd opteron vs intel xeon) machines came and I have to expand the cluster (think and solve problems). The (old) cluster is semi-diskless (all machines do have disks but they boot from a single image on a central server) with nfs for filesystem sharing. The main problems I had were: * if the /var filesystem is shared, race conditions happen (all nodes want to write on the same files). I had this problem and moved to a local /var filesystem. * if /var is local (which it may because the disks do exist), the whole point of central point for easy admin vanishes, because I would had to create all the /var structure that packages need to work, on each node (would be easier to do: "for $node; ssh $install_cmd; done", than guessing which dirs I need to create or files to copy). * if /var is tmpfs all forensics are certainly gone after failure (Murphy told me this one ;). Everything I read on the subject do underline the advantages of diskless approaches but miss to alert to this problem and/or to solve it. On the other side, the distributed approach tools (where every node is autonomous) seem to be halted (as systemimager - which is used in the Oscar project) or discontinued, or truly overblown for my reference scale (IBM's xCat); so it really seems that I'm missing something. The question is what you do about this ? Gil Brandao _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf