On Mon, Jun 15, 2009 at 1:59 PM, John Hearns<hear...@googlemail.com> wrote: > Proactive Fault Tolerance for HPC using Xen virtualization > > Its something I've wanted to see working - doing a Xen live migration > of a 'dodgy' compute node, and the job just keeps on trucking. > Looks as if these guys have it working. Anyone else seen similar?
I haven't seen it in the field yet, but I had hoped to do something similar with a cluster this summer. I hadn't seen the above paper before, but I was basing my test on some papers I'd seen about using Xen with cloud computing initiatives (ala AWS or Eucalyptus). Ideally I'd like to see Infiniband worked into the mist, so that I could use high speed messaging within the xen images and then live migrate an image as need arises. DK Panda has a paper that shows a little bit of this, but details are far and few. It would be nice to be able to just move bad hardware out from under a running job without affecting the run of the job. If it took an extra ten minutes for the job to run because of a migration i think thats a small price to pay for actually having the run go to completion and not have to worry some much about checkpoints. Course having said all that, if you've been watching the linux-kernel mailing list you've probably noticed the Xen/Kvm/Linux HV argument that took place last week. Makes me a little afraid to push any Linux HV solution into to production, but it's a fun experiment none the less... _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf