On Fri, Mar 1, 2013 at 11:31 PM, Mark Hahn <h...@mcmaster.ca> wrote: >> >> http://www.hpcwire.com/hpcwire/2013-02-28/utility_supercomputing_heats_up.html
Hi Mark, Great points you raise here! Lovely to have this discussion, let me have a crack at some of them in line. This is fun! > well, it's HPC wire - I always assume their name is acknowledgement that > their content is much like "HPC PR wire", often or mostly vendor-sponsored. > call me ivory-tower, but this sort of thing: > > Cycle has seen at least two examples of real-world MPI applications > that ran as much as 40 percent better on the Amazon EC2 cloud than > on an internal kit that used QDR InfiniBand. > really PISSES ME OFF. it's insulting to the reader. let's first assume > it's not a lie - next we should ask "how can that be"? Exactly - should always be asking questions. What was that old phrase? There are lies, damn lies and marketing? (or something like that, I forget the phrase) ;-) > EC2 has a smallish amount of virt overhead and weak interconnect, so why > would it be faster? AFAIKT, the only possible explanation is that the > "internal kit" > was just plain botched. or else they're comparing apples/oranges (say, > different > vintage cpu/ram, or the app was sensitive to the particular cache size, > associativity, SSE level, etc.) in other words, these examples do not > inform the topic of the article, which is about the viability of > cloud/utility HPC. the article > then concludes "well, you should try it (us) because it doesn't > cost much". instead I say: yes, gather data and when it indicates your > "kit" is botched, you should fix your kit. In essence all of the above. Cycle has been a bootstrapped company for the last 7 years, this basically reflects that they *have* to respond and react directly to customers. There's no time for anything in the biz that does not directly improve customer work loads. You are right there are 1,000's of things that can effect performance. Bad kit, slow kit, old kit, shabby kit, bad admins, poor networks, bad code, bad schedulers, bad hair days etc... The run we talked about had some of all of this. The cool thing about IaaS providers is they can (not always) get a hold of more recent kit, and if you are smart about how you use it you can see huge benefits from all of this, even down to simple changes to CPU spec at a simple level, to much larger wins if you fix all of the bad/old/shabby/poor stuff. > I have to add: I've almost never seen a non-fluff quote from IDC. the ones > in this article are doozies. I'll take the fifth on that one :-) >> that are only great in straight lines ;-) Another thing to think of is >> total cost per unit of science. Given we can now exploit much larger > > people say a lot of weaselly things in the guise of TCO. I do not really > understand why cloud/utility is not viewed with a lot more suspicion. > AFAIKT, people's thinking gets incredibly sloppy in this area, and they > start accepting articles of faith like "Economies of Scale". yes, there is > no question that some things get cheaper at large scale. even if we model > that as a monotonic increase in efficiency, it's highly nonlinear. > > 1. capital cost of hardware. > 2. operating costs: power, cooling, rent, connectivity, licenses. > 3. staff operating costs. I hear you. Numbers have to take in to account all of these factors, you are absolutely spot on. > big operations probably get some economy of large-scale HW purchases. but > it's foolish to think this is giant: why would your HW vendor not want to > maintain decent margins? Yup - everyone needs to make a buck at a certain point, price/performance is a serious issue. In all things you get what you pay for. $ / ABV is one of my fave optimization routines, albeit at extreme values of $ the algorithm always resolves to shabby domestic lager, or something random out of the well... but I digress. *chortles*. > power/cooling/rent are mostly strictly linear once you get past trivial > clusters (say, few tens of racks). certainly there is some economy > possible, but there's isn't much room to work with. since power is about 10% > of > purchase cost per year, mediocre PUE makes that 13%, and because we're > talking cloud, rent is off the table. I know Google/FB/etc manage PUEs of > near 1.0 and site their facilities to get better power prices. I suspect > they do not get half-priced power, though. and besides, that's still only > going to take the operating component of TCO down to 5%. at the rate cpu > speed and power is improving, they probably care more about accelerated > amortization. PUE is starting to be a big issue. We built the MGHPCC as both a responsible response to sustainable energy costs (hydro), and the fact that on campus machine rooms were heading towards the 3.0 PUE scale, which is just plain irresponsible. > staff: box-monkeying is strictly linear with size of cluster, but can be > extremely low. (do you even bother to replace broken stuff?). actual > sysadmin/system programming is *not* a function of the size of the facility > at all, or at least not directly. diversity of nodes and/or environments > is what costs you system-person time. you can certainly model this, but > it's not really part of the TCO examination, since you have to pay it either > way. > in short, scale matters, but not much. Node diversity happens in IaaS also. We have spun tens of thousands of nodes up in popular IaaS environments. At that scale you have to get extremely clever in how you do this, it's the software equivalent of "box-monkeying", just at a level with systems and what folks call "devops" - the needle is just moving, it's the same concept just in software. Smart software costs a whole lot less than fixed infrastructure. > so in a very crude sense, cloud/utility computing is really just asking > another company to make a profit from you. if you did it yourself, you'd > simply not be making money for Amazon - everything else could be the same. > Amazon has no special sauce, just a fairly large amount of DIN-standard > ketchup. unless outsource-mania is really a reflection of doubts about > competence: if we insource, we're vulnerable to having incompetent staff. Re: "special sauce" - there is a fair amount of it at scale. I've seen inside the source for what Cycle does, and have also run 25K+ proc environments, we all need some sauce to make any of this large scale stuff even slightly palatable! > the one place where cloud/utility outsourcing makes the most sense is at > small scale. if you don't have enough work to keep tens of racks busy, > then there are some scaling and granularity effects. you probably can't > hire 3% of a sysadmin, and some of your nodes will be idle at times... Agree - I've seen significant win at both the small end and the large end of things. Utility being a phrase that identifies how one can turn on and off resource at the drop of a hat. If you are running nodes 100% 365d/y on prem is still a serious win. If you have the odd monster run, or are in a small shop with no access to large scale compute, utility is the only way to go. Reminds me when I first came to the USA, we never owned a car, did not really need it, but when we did we used a company called "Zip car", they rent by the hour. My, oh my, if I'd have used that for a whole year we would have gone broke, but it is totally awesome to use for a few hours, and then go park it - the clock stops ticking, it's a great idea. Even better to stretch that analogy, take the Ferrari you can rent in Vegas - sure per hour it is crazy pricing, but blasting round the strip for a few hours gets you joy you could never afford if you wanted to buy one of Mr Enzo's motors for yourself. Clearly, John Hearns has a subliminal effect on all my beowulf postings *chuckles!* > I'm a little surprised there aren't more cloud cooperatives, where smaller > companies pool their resources to form a non-profit entity to get past these > dis-economies of very small scale. fundamentally, I think it's just that > almost anyone thrust into a management position is phobic about risk. > I certainly see that in the organization where I work (essentialy an > academic HPC coop.) You guys proved this out: http://webdocs.cs.ualberta.ca/~jonathan/PREVIOUS/Grad/Papers/ciss.pdf It's a wonderful idea, and the basis for my prior work in .edu for http://www.mghpcc.org > people really like EC2. that's great. but they shouldn't be deluded into > thinking it's efficient: Amazon is making a KILLING on EC2. It's as efficient as "the bench" shows you, no more no less. >> systems than some of us have internally, are we are starting to see >> overhead issues of vanish due to massive scale, certainly at cost? I know > > > eh? numbers please. I see significant overheads only on quite small > systems. You are right numbers were missing. Here's an example of a recent Cycle run: http://blog.cyclecomputing.com/2013/02/built-to-scale-10600-instance-cyclecloud-cluster-39-core-years-of-science-4362.html So ignoring any obvious self promotion here, 40yrs of compute for $4.3K is a pretty awesome number, least to me. Probably compares with some academic rates, at least it is getting close, comparing a single server: http://www.rhpcs.mcmaster.ca/current-rates I do have to say - I absolutely LOVE that you guys put in real FTE support for these servers - this is a very cool idea, I never did implement charge back in my old gig, what you guys are doing here is awesome! >> for a fact that what we call "Pleasantly Parallel" workloads all of this > > > hah. people forget that "embarassingly parallel" is a reference to the > "embarassment of riches" idiom. perhaps a truer name would be > "independent concurrency". Nothing quite like an embarrassment of riches! As Hannibal Smith so famously said: "I love it when a plan comes together" >> I personally think the game is starting to change a little bit yet again >> here... > > > I don't. HPC clusters have been "PaaS cloud providers" from the beginning. > outsourcing is really a statement about a organization's culture: an > assertion that if > it insources, it will do so badly. it's interesting that some very large > organizations > like GE are in the middle of a reaction from outsourcing... My point was more around the crazy "elastic scale" one can now pull off with the appropriate software and engineering available from top tier IaaS and PaaS providers. To me this is a fascinating and very exciting time for grand challenge, large scale awesome science. It's lovely to have another thing in the tool kit to be able to exploit both on and off prem compute. Although I'm only a few weeks into this gig, I do see this bridging and enabling technology to help us do what we all want to do - make bigger science happen, provide better outcomes, design better systems, therapeutics and further understand the insane and inordinately complicated world that we live in. > regards, mark hahn. Cool beans - thanks ever so for the follow up, this is a great conversation to have! Much appreciated! _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf