> > By the way, the idea of rolling-your-own hardware on a large cluster, and > > planning on having a small technical team, makes me shiver in horror. If > > you go that route, you better have *lots* of experience in clusters. and > > make very good decisions about cluster components and management methods. > > If you don't, your users will suffer mightily, which means you will suffer > > mightily too.
I believe that overstates the case significantly. some clusters are just plain easy. it's entirely possible to buy a significant number of conservative compute nodes, toss them onto a generic switch or two, and run the whole thing for a couple years without any real effort. I did it, and while I have a lot of experience, I didn't apply any deep voodoo for the cluster I'm thinking of. it started out with a good solid login/file/boot server (4U, 6x scsi, dual-xeon 2.4, 1G ram), a single 48pt 100bt (1G up) switch, and 48 dual-xeon nodes (diskful but not disk-booting). it was a delight to install, maintain and manage. I originally built it with APC controllable PDUs, but in the process of moving it, stripped them out as I didn't need them. (I _do_ always require net-IPMI on anything newly purchased.) I've added more nodes to the cluster since then - dual-opteron nodes and a couple GE switches. > For clusters with more than perhaps 16 nodes, or EVEN 32 if you're > feeling masochistic and inclined to heartache: with all respect to rgb, I don't think size is a primary factor in cluster building/maintaining/etc effort. certainly it does eventually become a concern, but that's primarily a statistical result of MTBF/nnodes. it's quite possible to choose hardware to maximize MTBF and configuration risk. in the cluster above, I choose a chassis (AIC) which has a large centrifugal blower, rather than a bunch of 40mm axial/muffin fans. a much larger cluster I'm working on now (768 nodes) has 14 40mm muffin fans in each node! while I know I can rely on the vendor (HP) to replace failures promptly and without complaint, there's an interesting side-effect: power dissipation. of 12 fans pointing at the CPUs are actually paired inline, and each pair is rated to dissipate up to 20W. so a node that idles at 210W and 265W under full load can easily consume 340W if the fans are ramped up. ouch! this is probably the most significant size-dependent factor for me. if you're doing your own 32-node cluster, it's pretty easy to manage the cooling. the difference between dissipating 300 and 400W is less than a ton of chiller capacity. scraping up 10-20 additional tons of capacity is quite a different proposition. regards, mark hahn. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf