date:20090616

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread Alex Chekholko

On Mon, 15 Jun 2009 20:58:58 +0100 John Hearns wrote: > 2009/6/15 Michael Di Domenico : > > > > > > Course having said all that, if you've been watching the linux-kernel > > mailing list you've probably noticed the Xen/Kvm/Linux HV argument > > that took place last week. Makes me a little afraid

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread Andrew Robbie (Gmail)

On 16/06/2009, at 5:58 AM, John Hearns wrote: 2009/6/15 Michael Di Domenico : Course having said all that, if you've been watching the linux-kernel mailing list you've probably noticed the Xen/Kvm/Linux HV argument that took place last week. Makes me a little afraid to push any Linux HV

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread Mike Davis

John Hearns wrote: 2009/6/16 Egan Ford mailto:e...@sense.net>> I have no idea the state of VMs on IB. That can be an issue with MPI. Believe it or not, but most HPC sites do not use MPI. They are all batch systems where storage I/O is the bottleneck. Burn the Witch! Burn the

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread Joe Landman

John Hearns wrote: Any HPC installation, if you want to show it off to alumni, august committees from grant awarding bodies etc. and not get sand kicked in your face from the big boys in the Top 500 NEEDS an expensive infrastructure of various MPI libraries. Big, big switches with lots of fl

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread Egan Ford

Ha! :-) I've put a few GigE systems in the Top100, and if the stars align you'll see a Top20 GigE system in next weeks list. That's ONE GigE to each node oversubscribed 4:1. Sadly no flashing lights, and since its 100% water cooled with low velocity fans, there is almost no noise. On Tue, Jun

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread Egan Ford

The good news... We (IBM) demonstrated such a system at SC08 as a Cloud Computing demo. The setup was a combination of Moab, xCAT, and Xen. xCAT is an open source provisioning system that can control/monitor hardware, discover nodes, and provision stateful/stateless physical nodes and virtual ma

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread John Hearns

2009/6/16 Egan Ford > I have no idea the state of VMs on IB. That can be an issue with MPI. > Believe it or not, but most HPC sites do not use MPI. They are all batch > systems where storage I/O is the bottleneck. Burn the Witch! Burn the Witch! Any HPC installation, if you want to show it o

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread Greg Keller

Date: Tue, 16 Jun 2009 10:38:55 +0200 From: Kilian CAVALOTTI On Monday 15 June 2009 20:47:40 Michael Di Domenico wrote: It would be nice to be able to just move bad hardware out from under a running job without affecting the run of the job. I may be missing something major here, but if

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread John Hearns

2009/6/16 Ashley Pittman > > > > elements (or slots) allocated for the job on the node - if the VM is > > able to adapt itself to such a situation, f.e. by starting several MPI > > ranks and using shared memory for MPI communication. Further, to > > cleanly stop the job, the queueing system will

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread Ashley Pittman

On Tue, 2009-06-16 at 12:27 +0200, Bogdan Costescu wrote: > You might be right, at least when talking about the short term. It has > been my experience with several ISVs that they are very slow in > adopting newer features related to system infrastructure in their > software - by system infrastr

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread Michael Di Domenico

On Mon, Jun 15, 2009 at 3:58 PM, John Hearns wrote: > 2009/6/15 Michael Di Domenico : >> Course having said all that, if you've been watching the linux-kernel >> mailing list you've probably noticed the Xen/Kvm/Linux HV argument >> that took place last week. Makes me a little afraid to push any Li

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread Bogdan Costescu

On Tue, 16 Jun 2009, John Hearns wrote: I believe that if we can get features like live migration of failing machines, plus specialized stripped-down virtual machines specific to job types then we will see virtualization becoming mainstream in HPC clustering. You might be right, at least whe

[Beowulf] Re: HPC fault tolerance using virtualization

2009-06-16 Thread Dave Love

John Hearns writes: > I was doing a search on ganglia + ipmi (I'm looking at doing such a > thing for temperature measurement) Like http://www.nw-grid.ac.uk/LivScripts?action=AttachFile&do=get&target=freeipmi-gmetric-temp>? If you want to take action, though, go direct to Nagios or similar with

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread John Hearns

2009/6/16 Kilian CAVALOTTI > My take on this is that it's probably more efficient to develop > checkpointing > features and recovery in software (like MPI) rather than adding a > virtualization layer, which is likely to decrease performance. > The performance hits measured by Panda et. al. on Inf

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread John Hearns

2009/6/16 Kilian CAVALOTTI > > > I may be missing something major here, but if there's bad hardware, chances > are the job has already failed from it, right? Would it be a bad disk (and > the > OS would only notice a bad disk while trying to write on it, likely asked > to > do so by the job), or

Re: [Beowulf] HPC fault tolerance using virtualization

2009-06-16 Thread Kilian CAVALOTTI

On Monday 15 June 2009 20:47:40 Michael Di Domenico wrote: > It would be nice to be able to just move bad hardware out from under a > running job without affecting the run of the job. I may be missing something major here, but if there's bad hardware, chances are the job has already failed from

Re: [Beowulf] HPC fault tolerance using virtualization

Re: [Beowulf] HPC fault tolerance using virtualization

Re: [Beowulf] HPC fault tolerance using virtualization

Re: [Beowulf] HPC fault tolerance using virtualization

Re: [Beowulf] HPC fault tolerance using virtualization

Re: [Beowulf] HPC fault tolerance using virtualization

Re: [Beowulf] HPC fault tolerance using virtualization

Re: [Beowulf] HPC fault tolerance using virtualization

Re: [Beowulf] HPC fault tolerance using virtualization

Re: [Beowulf] HPC fault tolerance using virtualization

Re: [Beowulf] HPC fault tolerance using virtualization

Re: [Beowulf] HPC fault tolerance using virtualization

[Beowulf] Re: HPC fault tolerance using virtualization

Re: [Beowulf] HPC fault tolerance using virtualization

Re: [Beowulf] HPC fault tolerance using virtualization

Re: [Beowulf] HPC fault tolerance using virtualization

16 matches

Site Navigation

Mail list logo

Footer information