On Tue, 8 Apr 2008, Jesse Becker wrote: > Gerry Creager wrote: > > Yeah, we're using Ganglia. It's a good start, but not complete... > > The next version of Ganglia (3.1.x) is being written to be much more easy to > customize, both on the backend metric collection by allowing custom modules > for gmond, and on the frontend with some changes to make custom reports > easier > to write. I've written a small pair of routines to monitor SGE jobs, for > example, and it could easily be extended to watch multiple queues.
It might be useful to consider what we did in the Scyld cluster system. We found that a significant number of customers (and potential customers) were using Ganglia, or were planning on using it. But those that were intensively using it complained about its resource usage. In some cases it was using 20% of CPU time. We have a design philosophy of running nothing on the compute nodes except for the application. A pure philosophy doesn't always fit with a working system, so from the beginning we built in a system called BeoStat (Beowulf State, Status and Statistics). To keep the "pure" appearance of our system we initially hid this in BeoBoot, so that it started immediately at boot time, underneath the rest of the system. How are these two related? To implement Ganglia we just chopped out the underlying layers (which spend a huge amount of time generating then parsing XML), and generate the final XML directly from the BeoStat statistics already combined on the master. This gave us the best of both worlds: no additional load on compute nodes, lower network load, much higher efficiency, and easy scalability to thousands of nodes from BeoStat, and the ability to log and summarize historical data, good-looking displays and ability to monitor multiple clusters from Ganglia. It might be useful to look at the design of Beostat. It's superficially similar to other systems out there, but we made decisions that are much different than others -- ones that most consider wrong until they understand their value. Some of them are: It's not extensible It reports values in a binary structure It's UDP unicast to a single master machine It has no liveness criteria The receive side stores only current values The first one is the most uncommon. Beostat is not extensible. You can't add in your own stat entries. You can't have it report stats from 64 cores. It reports what it reports... that's it. Why is this important? We want to deploy cluster systems. Not build a one-off cluster. We want the stats to be the same on every system we deploy. We want every tool that uses the stats to be able to know that they will be available. Once you allow and encourage a customizable system, every deployment will be different. Tools won't work out of the box, and there is a good chance that tools will require mutually incompatible extensions. Deploying a fixed-content stat system also enforces discipline. We carefully considered what we need to report, and how to report it. In contrast look at Ganglia's stats. Why did they choose the set they did? Pretty clearly because the underlying kernel reported those values. What do they mean? The XML DTD doesn't tell you. You have to look at the source code. What do you use them for? They don't know, they'll figure it out later. People next question "but what if I have 8/16/64 cores? You only have 2 [[ now 4 ]] CPU stat slots." The answer is similar to above -- what are you going to do with all of that data? The answer is summarize it before using it. We just summarize it on the reporting side. We report that there are N CPUs, the overall load average, and then summarize the CPU cores as groups (e.g. per socket). For network adapters we report e.g. eth0, eth1, eth2 and "all the rest added together". Once we chose a fixed set of stats, we had a ability to make it a fixed size report. It could be reported as binary values, with any per-kernel-version variation done on the sending side. Having a small, limited-size report meant that it fit in a single network packet. That makes the network load predictable and very scalable. It gave us the opportunity to effectively use UDP to report, without fragmenting into multiple frames. UDP means that we can switch to and from multicast without changes, even changing in real time. A fixed-size frame makes the receiving side simple as well. We just receive the incoming network frame into memory. No parsing, no translation, no interpretation. We actually do a tiny bit more, such as putting on a timestamp, but overall the receiving process does only trivial work. This is important when the receiver is the master, which could end up with the heaviest workload if the system isn't carefully designed. We've support 1000+ machines for years, and are now designing around 10K nodes. We actually do a tiny bit more when storing a stat packet -- we add a timestamp. We can use this to figure out the time skew between the master and computer node, verify the network reliability/load, and to decide if the node is live. This isn't the only liveness test. It's not even the primary liveness test. We document it as only a guideline. Developers should use the underlying cluster management system to decide if a node has died. But if there hasn't been a recent report, a scheduler should avoid using the node. Classifying the world into Live and Dead is wrong. It's at least Live, Dead and Schrodinger's Still-boxed Cat Finally, this is a State, Status and Statistics system. It's a scoreboard, not a history book. We keep only two values, the last two received. That gives us the current info, and the ability to calculate rate. If any subsystem needs older values (very few do) it can pick a logging, summarization and coalescing approach of its own. We made many other innovative architectural decisions when designing the system, such as publishing the stats as a read-only shared memory version. But this are less interesting because no one disagrees with them ;-). -- Donald Becker [EMAIL PROTECTED] Penguin Computing / Scyld Software www.penguincomputing.com www.scyld.com Annapolis MD and San Francisco CA _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf