On 10/25/2016 02:24 PM, Michael Di Domenico wrote:
here's an interesting thought exercise and a real problem i have to tackle.
i have a researchers that want to run magma codes for three weeks or
so at a time. the process is unfortunately sequential in nature and
magma doesn't support check pointing (as far as i know) and (I don't
know much about magma)
So the question is;
what kind of a system could one design/buy using any combination of
hardware/software that would guarantee that this program would run for
3 wks or so and not fail
and by "fail" i mean from some system type error, ie memory faulted,
cpu faulted, network io slipped (nfs timeout) as opposed to "there's a
bug in magma" which already bit us a few times
You'd need to design an HA network and storage system to handle the
possibility of external failure. For internal failure, you'd want to
run this in a kvm very close to the metal, and snapshot/checkpoint the
VM every so often to local/remote VERY FAST storage.
This said, it would help to start with a system that can handle
hard/heavy load for that period of time w/o failure. We have units at
various places around the world that sustain many GB/s continuously of
IO for more than a year of operations, under fairly intense loads.
Choose your systems wisely, and don't let brand names decide the outcome.
there's probably some commercial or "unreleased" commercial product on
the market that might fill this need, but i'm also looking for
something "creative" as well
Start with good. If you ping me about our burn in test case, I'll be
happy to send it over. Its running y-cruncher to do burn in on all
CPUs/ram continuously. Its pretty good at catching bad MB/CPU/RAM.
Previously, I had a GAMESS run I used for this (also very good).
three weeks isn't a big stretch compared to some of the others codes
i've heard around the DOE that run for months, but it's still pretty
painful to have a run go for three weeks and then fail 2.5 weeks in
and have to restart. most modern day hardware would probably support
this without issue, but i'm looking for more of a guarantee then a
prayer
double bonus points for anything that runs at high clock speeds >3Ghz
See above. This is fairly *easy* for various definitions of easy.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
e: land...@scalableinformatics.com
w: http://scalableinformatics.com
t: @scalableinfo
p: +1 734 786 8423 x121
c: +1 734 612 4615
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf