On Fri, 26 Sep 2008, Donald Becker wrote:
But that rule doesn't continue when we move to higher core counts. We still want a little observability, but a number for each of a zillion cores is useless. Perhaps worse than useless, because each tool has to make its own decision about how to summarize the values before using them. A better solution is to have the reporting side summarize the values.
Why is this a better solution? Might not applications NOT wish to summarize or aggregate? And why does the cutoff occur at 2 cpus (and not 1). And what do you choose to compute and return? Aggregated activity (not showing how it is distributed), or average activity (even worse, just showing a nominal percentage of total aggregate activity? And how do you differentiate (or do you) between a single processor dual core and a dual processor single core and a single processor quad core and a dual processor dual core, etc? I'd say that ONE solution is to provide a tool that does client side averages and aggregation, and I initially did it that way, in part to minimize the size of the return and make life "easy", or so I thought. I changed my mind, because I discovered that I DID want a lot of the information detail that was being aggregated and hidden. It actually was very useful. IMO at this point, it is "better" to provide the raw numbers per core (and network device) to the client (monitor program, head node, whatever) side and let IT decide what it wants to display or how it wants to use the numbers. Here is my reasoning. A network bottleneck on a system with multiple network interfaces shows up not necessarily as the aggregate being saturated, but as a particular interface being saturated. There may be multiple interfaces, and they may not even have the same speed characteristics -- "saturation" on one may be a small fraction of the capacity of another. A CPU bottleneck -- especially in a cluster that isn't doing "the same thing" synchronously on all the cores by hypothesis -- can be one processor core pegged at 100% but all the rest near idle (possibly waiting on the one that is pegged!). Even on an 8 core system -- again where it might be an 8 core server in a LAN with cores allocated among several completely distinct VMs (some of them running Windows and invisible to "direct" monitoring) or an 8 core node in a cluster -- if the cores aren't running a set of homogeneous tasks aggregates will not reveal a bottleneck or the reaching of a resource limit at a glance. mysql might be max'd out on a single core and several mysql clients might be tamely waiting on it, even though the AGGREGATE CPU is down there at < 20%, and with people wondering why everything is slow. When you make a toplevel design decision like this -- to aggregate and average, or not -- you are basically picking a set of assumptions concerning what people will need. However, people's needs vary and almost any assumption that you make will NOT fit the needs of some subset. That may be OK -- you make a tool that is right for a restricted subspace of things that DO match your assumptions, and that's fine. In the case of scyld, you tightly define the cluster architecture, so you MAKE it so the assumptions work on both ends. But the price is that it makes your tool all but useless to people with problems your assumptions hide but that would be revealed and solvable if you didn't hide them, or environments that don't match your assumptions ditto. I ran through similar considerations on the network side. One interface? All "ethernet" interfaces? Or just everything the kernel tags as "an interface"? Per interface, do I send back the raw packet counts, or just the rates? In other words, who does the dividing to turn packet count deltas into a rate? The interface decision was easy -- I started with just one, had a few systems with two where I needed numbers on both, and finally just broke down and send the information on all interfaces, and let the client decide which ones are "important". The rate issue was a tougher decision, because xmlsysd doesn't operate on a predetermined schedule. To get an accurate differential rate, one really should sample on a fine-grained time granularity -- average over a time order 1 sec, say. Again, one has the possibility of totally saturating the network for short bursts but having a relatively low AVERAGE rate. I tried various schema for doing node-side averaging and none of them were very satisfactory. Ultimately -- and this is a decision I could easily reconsider or make a client-controlled option, given a view of the TCP interface as being there primarily as a control interface and using UDP to actually back-transport the information -- I opted to just send the raw numbers again -- it is consistent and the client side can decide if 1, 5, 500 second averages are OK (and indeed can decide to sample twice 0.1 seconds apart to generate local deltas and then NOT sample for 10 seconds, so the numbers given are an instant snapshot of the rate, but that rate is only updated occasionally). Not necessarily the "ultimate" solution -- I still think of implementing a fixed-width window and doing a timed delta of perhaps 0.1 or 0.01 sec across it to make various rates -- but that requires two interruptions of the background work in place of one, and introduces a longish delay between polling the node and getting the answer, neither desireable on a TCP connection or for all possible sets of work the node might be doing. Incidentally, avoiding client-side arithmetic minimizes computational impact on the nodes, sometimes the expense of a larger return packet. This in turn could be mitigated by means of adding more controls to permit the remote and dynamic reconfiguration of the daemon, although I haven't yet done this as broadly as I could. I keep it simple unless/until there is a real need to add complexity, lest I get to where the tool itself becomes as complex as /proc itself. FWIW, the kernel does exactly this for (I'm sure) similar reasons: provide the raw, instantaneous numbers and let userspace clients do whatever selection, aggregation, and rate arithmetic they wish with them. It's not that difficult. All the procps tools do it (starting with a much uglier data interface to parse). It's not the only solution. It MAY not be the "best" solution. But it's not a bad one.
Again the problem with XML and other extensible systems is that people use the flexibility to avoid thoughtful design. Sure, it's obvious how to just add a few more records when we go from two to four core per socket. And reporting system won't obviously break when we have 64 cores in each of 4 sockets. But it really just shifts the re-design work to the tools and applications.
I absolute agree with the first and have been saying it myself, although I wouldn't say it is >>THE<< problem with XML, just >>A<< problem. There are obviously data structures for which XML isn't a good solution period, no matter how thoughtful you are, and IMO its BIGGEST problem is that one cannot easily switch between its default human readable but extremely "fat" encapsulation and a heavily compressed binary encapsulation, by means of toggling a single switch in the library. Toggle fat to debug, toggle thin, become efficient. Also, at the very least I'd qualify the "people" into "some people". But that's a UNIVERSAL problem, whether or not you use a library or data encapsulation standard. There's no substitute for thoughtful design, but equally true there's nothing obvious or easy about it either. It's why good programmers are (often) well paid. And truthfully I think it is clearly a BENEFIT (all things being equal) to have a reporting system that won't break (obviously or not) when going from 1 to N cores. Note well that whether or not this breaks the tools and applications depends on whether or not they were designed to scale across a variable number of CPUs from the beginning. The fact that design decisions on a client-server application pair can be obsoleted by changes beyond one's control on the server side is hardly a general conclusion one can use to indict extensible encapsulations of the data. One (fairly obviously) needs to be aware of and code client side applications to be able to manage variability where it exists, if you can TELL where it exists. The problem comes when something that wasn't variable becomes variable, or where something that was in a comfortable and familiar range suddenly isn't. I'll resist making up a car metaphor to illustrate the point, though...;-) When you have a "sudden" change in the kernel -- some resource that only existed on computers one at a time from time immemorial -- to a resource that now comes N at a time, things are going to break. Think of the 2.0.0 linux kernel -- suddenly there were two processors! Tools that KNEW there were never going to be more than one didn't know what to do! In general, everything was assumed 1, now it is N. Everything will have to be fixed (or ignore the change and report only one, or as if it were only one). You are absolutely correct that this change requires a client side change to manage, but that change needs to occur ONCE to make the CLIENT aware of the N whatevers and arrange for IT to be able to transparently manage 1-N of them as they appear in the return. If you do it badly, you say "well, now there are EITHER one or at most TWO processors per system. But I'll never need more than two, so all my loops and display and processing decisions can use two as a hard upper bound." The price you pay is that the day dual duals or single quads, or eight-ways appear, you can rewrite everything AGAIN -- expensive, especially if you DIDN'T use an extensible data encapsulation scheme (canned or homebrew) -- or tell yourself "well, I'll just hide the eight and average them somehow back to two so my API remains fixed and my client(s) do(es)n't break". Too bad if someone wants to know per core utilization on any system with lots of cores, not just aggregate utilization, because the aggregate number you fudge will not reveal the core saturation which could be just what you need to see on servers or nodes that can be rate limited by a saturated processor handling a single threaded task. It isn't always easy to know what level of detail is going to be useful to at least some users (including yourself, actually), so when one builds a tool like this one has a tendency to support what YOU need, for YOUR purposes -- right now. Putting in "everything" is clearly expensive but gives you everything -- lets you run a true "cluster-top" or "cluster-ps" that requires access to "all" the contents of /proc on all the nodes to function. Putting in just load averages is cheap but omits lots of detail that many people will need. There is no unique and obviously best solution to this problem, only tradeoffs. The closest one can come to best is "one that satisfies most of the people, most of the time" where satisfaction has to take into account having all the data they might need at a "cost" in resource utilization they are willing to pay." And it is again obvious that that best solution usually going to be a somewhat VARIABLE solution -- one that provides at least a bit of choice to the user to be able to select the best tradeoff for their personal needs out of the available (limited) choices -- and one that is extensible in dimensions (like core count or interface count) where extension is likely to occur. rgb -- Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf