Comment inserted below
On 10/21/10 3:43 AM, "Eugen Leitl" <eu...@leitl.org> wrote: > > In contrast, back at Harvard, there are discussions going on about building > up new resources for scientific computing, and talk of converting precious > office and lab space on campus (where space is extremely scarce) into machine > rooms. I find this idea fairly misdirected, given that we should be able to > either leverage a third-party cloud infrastructure for most of this, or at > least host the machines somewhere off-campus (where it would be cheaper to > get space anyway). There is rarely a need for the users of the machines to be > anywhere physically close to them anymore. There *is* a political reason and a funding stream reason. When you use a remote resource, then someone is measuring the use of that resource, and typically one has a budget allocated for that resource. Perhaps at google, computing resources are free, but that's not the case at most places. So, someone who has been given X amount of resources to do task Y can't on the spur of the moment use some fraction of that to do task Z (and, in fact, if you're consuming government funds, using resources allocated for Y to do Z is illegal). However, if you've used the dollars to buy a local computer, typically, the "accounting for use" stops at that point, and nobody much cares what you use that computer for, as long as Y gets done. In the long term, yes, there will be an evaluation of whether you bought too much or too little for the X amount of resources, but in the short run, you've got some potential "free" excess resources. This is a bigger deal than you might think. Let's take a real life example. You have a small project, funded at, say, $150k for a year (enough to support a person working maybe 1/3 time, plus resources) for a couple years. You decide to use an institutionally provided desktop computer and store all your terabytes of data on an institutional server and pay the nominal $500/month (which pays for backups, etc. and all the admin stuff you shouldn't really be fooling with anyway). You toil happily for the year (spending around $6k of your budget on computing resources), and then the funding runs out, a little earlier than you had hoped (oops, the institution decided to retroactively change the chargeback rates, so now that monthly charge is $550). And someone comes to you and says: Hey, you are out of money, we're deleting the data you have stored in the cloud, and by the way, give back that computer on your desk. You're going to need to restart your work next year, when next year's money arrives (depending on the funding agency's grant cycle, there is a random delay in this.. Maybe they're waiting for Congress to pass a continuing resolution or the California Legislature to pass the budget, or whatever..), but in the mean time, you're out of luck. And yes, a wise project manager (even for this $300k task) would have set aside some reserves, etc. But that doesn't always happen. At least if you OWN the computing resources, you have the option of mothballing, deferring maintenance, etc. to ride through a funding stream hiccup. Unless you really don't believe in > remote management tools, the idea that we're going to displace students or > faculty lab space to host machines that don't need to be on campus makes no > sense to me. > > The tools are surprisingly good. > Log first, ask questions later. It should come as no surprise that debugging > a large parallel job running on thousands of remote processors is not easy. > So, printf() is your friend. This works in a "resources are free" environment. But what if you are paying for every byte of storage for all those log messages? What if you're paying for compute cycles to scan those logs? Remote computing on a large scale works *great* if the only cost is a "connectivity endpoint" Look at the change in phone costs over the past few decades. Back in the 70s, phone call and data line cost was (roughly) proportional to distance, because you were essentially paying for a share of the physical copper (or equivalent) along the way. As soon, though, as there was substantial fiber available, there was a huge oversupply of capacity, so the pricing model changed to "pay for the endpoint" (or "point of presence/POP"), leading to "5c/min long distance anywhere in the world". I was at a talk by a guy from AT&T in 1993 and he mentioned that the new fiber link across the Atlantic cost about $3/phone line (in terms of equivalent capacity, e.g. 64kbps), and that was the total lifecycle cost.. The financial model was: if you paid $3, you'd have 64kbbps across the atlantic in perpetuity, or close to it. Once you'd paid your $3, nobody cared if it was busy or idle, etc. So the "incremental cost" to use the circuit was essentially zero. Compare this to the incredibly expensive physical copper wires with limited bandwidth, where they could charge dollars/minute, which is pretty close to the actual cost to provide the service. If you go back through the archives of this list, this kind of "step function in costs" has been discussed a lot. You've already got someone sysadmin'ing a cluster with 32 nodes, and they're not fully busy, so adding another cluster only increases your costs by the hardware purchase (since the electrical and HVAC costs are covered by overheads). But the approach of "low incremental cost to consume excess capacity" only lasts so long: when you get to sufficiently large scales, there is *no* excess capacity, because you're able to spend your money in sufficiently small granules (compared to overall size). Or, returning to my original point, the person giving you your money is able to account for your usage in sufficiently small granules that you have no "hidden excess" to "play with". Rare is the cost sensitive organization that voluntarily allocates resources to unconstrained "fooling around". Basically, it's the province of patronage. Log everything your program does, and if > something seems to go wrong, scour the logs to figure it out. Disk is cheap, > so better to just log everything and sort it out later if something seems to > be broken. There's little hope of doing real interactive debugging in this > kind of environment, and most developers don't get shell access to the > machines they are running on anyway. For the same reason I am now a huge > believer in unit tests -- before launching that job all over the planet, it's > really nice to see all of the test lights go green. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf