On Fri, 26 Sep 2008, Robert G. Brown wrote: > On Fri, 26 Sep 2008, Donald Becker wrote: > > > But that rule doesn't continue when we move to higher core counts. We > > still want a little observability, but a number for each of a zillion > > cores is useless. Perhaps worse than useless, because each tool has to > > make its own decision about how to summarize the values before using them. > > A better solution is to have the reporting side summarize the values. > > Why is this a better solution? Might not applications NOT wish to > summarize or aggregate? And why does the cutoff occur at 2 cpus (and > not 1).
There are four fixed slot for CPU utilization percentage, not two. One rule for counting is 0, 1, 2, Many. You have to draw the cut-off somewhere, and somewhere 4 or 8 is where the numbers stop being useful when a human looks at them. After 4 you find that you stop caring about what each core is doing, and instead ask - how many cores are essentially idle / available to do work - how close to fully busy are the occupied cores - how close to completely idle are the idle cores > And what do you choose to compute and return? Aggregated > activity (not showing how it is distributed), or average activity (even > worse, just showing a nominal percentage of total aggregate activity? For the numbers reported per socket or per core, you report and use utilization percentage. BeoStat also reports system load average, mostly because people expect it. But the length of the run queue isn't a good indication of how effective the node is getting work done. > And how do you differentiate (or do you) between a single processor dual > core and a dual processor single core and a single processor quad core > and a dual processor dual core, etc? The CPU core/socket enumeration naturally groups the cores within a single socket. That might change next year when we have three cores per socket. At that point we should do some redesign -- and redesign here means predicting the future. A good approach is to group cores by which channels to memory they use, and start reporting memory controller utilization and contention. My prediction is that those memory controller stats will be the best indication of still-available node capacity. CPU utilization percentages will move from being the primary stat, to a secondary stat -- the CPU/memory busy ratio used for reporting how effectively the busy cores are being used. > A network bottleneck on a system with multiple network interfaces shows > up not necessarily as the aggregate being saturated, but as a particular > interface being saturated. There may be multiple interfaces, and they We support four network reporting slots: 0, 1, 2, and "all of the rest" > may not even have the same speed characteristics -- "saturation" on one > may be a small fraction of the capacity of another. Finding the speed of a network is problematic. Even if we limit ourselves to Ethernet frame format, there are several types of networks which fake a speed report, and dynamically change speed. Non-Ethernet-like networks are even more difficult, especially when they mix RDMA traffic with packet traffic. [[ OK, I'll admit this as a short-coming of BeoStat. When we designed it, I knew we couldn't get accurate network speed numbers. Since I wrote most of the kernel drivers, I knew all of the shortcomings, corner cases and caveats. So we didn't even attempt to report a number, even statically. Someone that knew less would make a sleazy assumption e.g. "100Mbps-HD, 100Mbps-FD or 1Gbps-FD" that would be right most of the time. ]] [[ A secondary problem is that BeoStat doesn't re-order and identify the networks. It just reports them as they are listed in /proc/net/dev. It would be better to identify the networks as being used for booting, control, message communication and file I/O. And then make order them so that unused NICs aren't reported in the first three stat slots, leaving important networks combined in the final, summary slot. ]] > counts, or just the rates? In other words, who does the dividing to > turn packet count deltas into a rate? I think we have implemented a simple, general-purpose solution with two reporting slots, and good-granularity timestamps. That allows programs to compute the rate without keeping their own state. Note that this doesn't attempt to fill the same role as RRD Tool (Round-Robin Display Tool). That system keeps a long record of historical values, and makes decisions about how to summarize and collapse the log. [[ A BeoStat redesign would include a feature to make it easier to keep historical stats. If we included a ring buffer logging which nodes slots had updated values, we could have have a daemon that knew which stats had been updated instead of having to scan the table slightly more frequently than the one-second update period.]] > Incidentally, avoiding client-side arithmetic minimizes computational > impact on the nodes, sometimes the expense of a larger return packet. The arithmetic is trivial. We are talking about some additions, perhaps averaging two numbers. There isn't anything time consuming. The biggest cost is probably the floating point register context switch -- with lazy FP register set switching, the first time you touch any FP register you pay a big cost. If you do even one FP operation, even an implicit conversion that doesn't look like real work, you might as well do a bunch of FP work. -- Donald Becker [EMAIL PROTECTED] Penguin Computing / Scyld Software www.penguincomputing.com www.scyld.com Annapolis MD and San Francisco CA _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
