Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-18 Thread Matt Lawrence
On Tue, 16 Jun 2009, Mike Davis wrote: In my experience, Sysadmins don't want beer or luxurious offices they want the tools that they need, proper managerial support, and respect. And I'm sure not getting any of that in my current job. And the pay scale is much lower. -- Matt It's not what

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-17 Thread Greg Lindahl
> out of curiosity, why do you call this sort of thing HPC? > I think "batch" or perhaps "throughput" covers it, but why _high_performance_, > if in fact it composed of all routine-performance pieces? It's a calling a duck a duck issue. "Warehouse computing" looks a lot like ethernet-connected HPC

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-17 Thread Mark Hahn
Believe it or not, but most HPC sites do not use MPI. They are all batch systems where storage I/O is the bottleneck. However, I have tested MPI out of curiosity, why do you call this sort of thing HPC? I think "batch" or perhaps "throughput" covers it, but why _high_performance_, if in fact i

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-17 Thread Prentice Bisbal
Kilian CAVALOTTI wrote: > On Tuesday 16 June 2009 18:23:31 John Hearns wrote: >> Highly paid, pampered systems admins who must be treated like expensive >> racehorses, and not exercised too much every day. They need cool beers on >> tap and luxurious offices to relax in > > Hey! I've been ripped

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-17 Thread Kilian CAVALOTTI
On Tuesday 16 June 2009 18:23:31 John Hearns wrote: > Highly paid, pampered systems admins who must be treated like expensive > racehorses, and not exercised too much every day. They need cool beers on > tap and luxurious offices to relax in Hey! I've been ripped off... -- Kilian _

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread Alex Chekholko
On Mon, 15 Jun 2009 20:58:58 +0100 John Hearns wrote: > 2009/6/15 Michael Di Domenico : > > > > > > Course having said all that, if you've been watching the linux-kernel > > mailing list you've probably noticed the Xen/Kvm/Linux HV argument > > that took place last week.  Makes me a little afraid

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread Andrew Robbie (Gmail)
On 16/06/2009, at 5:58 AM, John Hearns wrote: 2009/6/15 Michael Di Domenico : Course having said all that, if you've been watching the linux-kernel mailing list you've probably noticed the Xen/Kvm/Linux HV argument that took place last week. Makes me a little afraid to push any Linux HV

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread Mike Davis
John Hearns wrote: 2009/6/16 Egan Ford mailto:e...@sense.net>> I have no idea the state of VMs on IB. That can be an issue with MPI. Believe it or not, but most HPC sites do not use MPI. They are all batch systems where storage I/O is the bottleneck. Burn the Witch! Burn the

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread Joe Landman
John Hearns wrote: Any HPC installation, if you want to show it off to alumni, august committees from grant awarding bodies etc. and not get sand kicked in your face from the big boys in the Top 500 NEEDS an expensive infrastructure of various MPI libraries. Big, big switches with lots of fl

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread Egan Ford
Ha! :-) I've put a few GigE systems in the Top100, and if the stars align you'll see a Top20 GigE system in next weeks list. That's ONE GigE to each node oversubscribed 4:1. Sadly no flashing lights, and since its 100% water cooled with low velocity fans, there is almost no noise. On Tue, Jun

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread Egan Ford
The good news... We (IBM) demonstrated such a system at SC08 as a Cloud Computing demo. The setup was a combination of Moab, xCAT, and Xen. xCAT is an open source provisioning system that can control/monitor hardware, discover nodes, and provision stateful/stateless physical nodes and virtual ma

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread John Hearns
2009/6/16 Egan Ford > I have no idea the state of VMs on IB. That can be an issue with MPI. > Believe it or not, but most HPC sites do not use MPI. They are all batch > systems where storage I/O is the bottleneck. Burn the Witch! Burn the Witch! Any HPC installation, if you want to show it o

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread Greg Keller
Date: Tue, 16 Jun 2009 10:38:55 +0200 From: Kilian CAVALOTTI On Monday 15 June 2009 20:47:40 Michael Di Domenico wrote: It would be nice to be able to just move bad hardware out from under a running job without affecting the run of the job. I may be missing something major here, but if

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread John Hearns
2009/6/16 Ashley Pittman > > > > elements (or slots) allocated for the job on the node - if the VM is > > able to adapt itself to such a situation, f.e. by starting several MPI > > ranks and using shared memory for MPI communication. Further, to > > cleanly stop the job, the queueing system will

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread Ashley Pittman
On Tue, 2009-06-16 at 12:27 +0200, Bogdan Costescu wrote: > You might be right, at least when talking about the short term. It has > been my experience with several ISVs that they are very slow in > adopting newer features related to system infrastructure in their > software - by system infrastr

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread Michael Di Domenico
On Mon, Jun 15, 2009 at 3:58 PM, John Hearns wrote: > 2009/6/15 Michael Di Domenico : >> Course having said all that, if you've been watching the linux-kernel >> mailing list you've probably noticed the Xen/Kvm/Linux HV argument >> that took place last week. Makes me a little afraid to push any Li

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread Bogdan Costescu
On Tue, 16 Jun 2009, John Hearns wrote: I believe that if we can get features like live migration of failing machines, plus specialized stripped-down virtual machines specific to job types then we will see virtualization becoming mainstream in HPC clustering. You might be right, at least whe

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread John Hearns
2009/6/16 Kilian CAVALOTTI > My take on this is that it's probably more efficient to develop > checkpointing > features and recovery in software (like MPI) rather than adding a > virtualization layer, which is likely to decrease performance. > The performance hits measured by Panda et. al. on Inf

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread John Hearns
2009/6/16 Kilian CAVALOTTI > > > I may be missing something major here, but if there's bad hardware, chances > are the job has already failed from it, right? Would it be a bad disk (and > the > OS would only notice a bad disk while trying to write on it, likely asked > to > do so by the job), or

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread Kilian CAVALOTTI
On Monday 15 June 2009 20:47:40 Michael Di Domenico wrote: > It would be nice to be able to just move bad hardware out from under a > running job without affecting the run of the job. I may be missing something major here, but if there's bad hardware, chances are the job has already failed from

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-15 Thread John Hearns
2009/6/15 Michael Di Domenico : > > > Course having said all that, if you've been watching the linux-kernel > mailing list you've probably noticed the Xen/Kvm/Linux HV argument > that took place last week.  Makes me a little afraid to push any Linux > HV solution into to production, but it's a fun

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-15 Thread Michael Di Domenico
On Mon, Jun 15, 2009 at 1:59 PM, John Hearns wrote: > Proactive Fault Tolerance for HPC using Xen virtualization > > Its something I've wanted to see working - doing a Xen live migration > of a 'dodgy' compute node, and the job just keeps on trucking. > Looks as if these guys have it working. Anyon

[Beowulf] HPC fault tolerance using virtualization

2009-06-15 Thread John Hearns
I was doing a search on ganglia + ipmi (I'm looking at doing such a thing for temperature measurement) when I cam across this paper: http://www.csm.ornl.gov/~engelman/publications/nagarajan07proactive.ppt.pdf Proactive Fault Tolerance for HPC using Xen virtualization Its something I've wanted to