Lux, Jim (337C) wrote:
> 
> 
> On 12/8/09 9:22 AM, "james bardin" <jbar...@bu.edu> wrote:
> 
>> On Tue, Dec 8, 2009 at 10:50 AM, Prentice Bisbal <prent...@ias.edu> wrote:
>>
>>> You'd hope that. Most of my current clusters users are scientific
>>> researchers in academia, not computer scientists. While some are
>>> extremely computer savvy, others have learned just enough about
>>> programming to do their calculations. Expecting the latter to write code
>>> with checkpointing is unrealistic, and working in academia, I can't
>>> force them to. Which is why taking down 4 nodes instead of just one is
>>> less than ideal.
>>>
>> I find it's still advantageous to push them to learn it. A researcher
>> working with a tight deadline for a grant will often see the light
>> when a hardware failure loses them a month or more of data processing.
>> It really is in their own best interests to learn about their tools.
> 
> 
> What about some form of "image checkpoint" like "hibernation"... Should be
> application unaware, just snapshots memory.

That's fine when the problem is on one system and there's only one
system image to worry about check pointing once you start spreading the
job around to multiple systems, things get complicated, especially if
your node is heterogeneous w.r.t hardware.

I fear we're straying off the topic of the original post...

--
Prentice
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to