On Friday 28 April 2006 05:04, Mark Hahn wrote: > > Does any one know what types of problems/challanges for big clusters? > > cooling, power, managability, reliability, delivering IO, space.
I'd add: sysadmin or other professional resources to manage the cluster. Certainly, the more manageable and reliable the cluster is, the less time the admin(s) will have to spend at basically keeping the cluster in good health. But given manageability and reliability, the bigger issue is: How many users and how many different codebases do you have? Given the variety in individual needs, you can end up spending quite a bit of time helping users get new code working well, and/or making adjustments to the cluster software to accommodate their needs. At least this has been my experience. I'm the only admin for a 1024-node cluster with 70+ authorized users (49 unique users in the past 31 days, about 30of whom are frequent users, I'd estimate), and probably a couple dozen user applications. Having other non-sysadmin local staff helping me, as well as having good hardware and software vendor support, has been critical to multiply the force I can bring to bear in solving problems. You know all those best practices you hear about when you're a sysadmin managing a departmental network? Well, when you have a large cluster, best practices become critical -- you have to arrange things so that you don't have to touch hardware but rarely, nor login to fix problems on individual nodes. Such attention to individual nodes takes far too much time away from more productive pursuits, and will lead to lower cluster availability, which means extra frustration and stress for you and your users. A few elements of manageability that I use all the time: * the ability to turn nodes on or off in a remote, scripted, customizable manner * the ability to reinstall the OS on all your nodes, or specific nodes, trivially (e.g. as provided by Rocks or Warewulf) * the ability to get remote console so you can fix problems without getting out the crash cart -- hopefully you don't have to use this much (because it means paying attention to individual systems), but when you need it, it will speed up your work compared to the alternative * the ability to gather and analyze node health information trivially, using embedded hardware management tools and system software tools * the ability to administratively close a node that has problems, so that you can deal with the problem later, and meanwhile jobs won't get assigned to it Think of your compute nodes not as individuals, but as indistinguishable members of a Borg Collective. You shouldn't care very much about individual nodes, but only about the overall health of the cluster. Is the Collective running smoothly? If so, great -- make sure you don't have to sweat the details very much. > > we are considering having a 512 node cluster that will be using > > Myrinet as its main interconnect, and would like to do our homework I've had excellent experience with Myrinet, in terms of reliability, functionality, and technical support. It's probably the most trouble-free part of my cluster and my best overall vendor experience. Myrinet gets used continuously by my users, but I rarely have to pay attention to it at all. > how confident are you at addressing especially the physical issues above? > cooling and power happen to be prominent in my awareness right now > because of a 768-node cluster I'm working on. but even ~200 node > clusters need to have some careful thought applied to managability > (cleaining up dead jobs, making sure the scheduler doesn't let jobs hang > around consuming myrinet ports, for instance.) reliability is a fairly > cut and dried issue, IMO - either you make the right hardware decisions > at purchase, or not. A few comments from my personal experience. On my cluster, perhaps 1 in 10,000 or 100,000 job processes ends up unkilled, taking up compute node resources. It's not been a big problem for me, although it certainly does come up. Generally the undead processes have been a handful out of a set of processes that have something in common -- a bad run, a user doing something weird, or some anomalous system state (e.g. central filesystem going down). I've never had a problem with consumed Myrinet ports, but I'm sure that's going to depend on the details of your local cluster usage patterns. Most often the problem has been a job spinning using CPU, slowing down legitimate jobs. If I configured my scheduler properly (LSF), I'm pretty sure I could avoid even that problem -- just set a threshold on CPU idleness or load level. I *have* made a couple of scripts to find nodes that are busier than they should be, or quieter than they should be, based on the load that the scheduler has placed on them versus the load they're actually carrying. That helps identify problems, and more frequently it helps to give confidence that there *aren't* any problems. :) I'm not sure I agree with Mark that reliability is cut and dried, depending only on initial hardware decisions. (Yes, I removed or changed a couple of important qualifying words in there from what Mark wrote. :) Vendor support methods are critical -- consider that part of the initial hardware choice if you like. My point here is that it's hardware and vendor choice taken together, not just hardware choice. By the way, the idea of rolling-your-own hardware on a large cluster, and planning on having a small technical team, makes me shiver in horror. If you go that route, you better have *lots* of experience in clusters. and make very good decisions about cluster components and management methods. If you don't, your users will suffer mightily, which means you will suffer mightily too. David _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf