Assuming you can contain a run on a single node, you could use containers and the freezer controller (plus maybe LVM snapshots) to do checkpoint/restart.
Skylar On 10/25/2016 11:24 AM, Michael Di Domenico wrote: > here's an interesting thought exercise and a real problem i have to tackle. > > i have a researchers that want to run magma codes for three weeks or > so at a time. the process is unfortunately sequential in nature and > magma doesn't support check pointing (as far as i know) and (I don't > know much about magma) > > So the question is; > > what kind of a system could one design/buy using any combination of > hardware/software that would guarantee that this program would run for > 3 wks or so and not fail > > and by "fail" i mean from some system type error, ie memory faulted, > cpu faulted, network io slipped (nfs timeout) as opposed to "there's a > bug in magma" which already bit us a few times > > there's probably some commercial or "unreleased" commercial product on > the market that might fill this need, but i'm also looking for > something "creative" as well > > three weeks isn't a big stretch compared to some of the others codes > i've heard around the DOE that run for months, but it's still pretty > painful to have a run go for three weeks and then fail 2.5 weeks in > and have to restart. most modern day hardware would probably support > this without issue, but i'm looking for more of a guarantee then a > prayer > > double bonus points for anything that runs at high clock speeds >3Ghz > > any thoughts? > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf