Several points in here.
1. Preemption is one approach I finally got the admin to buy into for forecasting codes. 2. MY operational codes for an individual simulation don't take long to run, save the fact that we don't do a 12 hr hurricane sim, but an 84 hour sim for the weather side (WRF). Saving grace here is that the nested grids are not too large so they can run to completion in a couple of wall-clock hours. 3. When one starts trying to twiddle initial conditions statistically to create an ensemble, one then has to run all the ensemble members. One usually starts with central cases first, especially if one "knows" which are central and which are peripheral. If one run takes 30 min on 128 processors, and one thinks one needs 57 members run, one exceeds a wall-clock day. And needs a bigger, faster computer, or at least a bigger queue reservation. If one does this without preemption, one gets all results back at the end of the hurricane season and declares success after 3 years of analysis instead of providing data in near real time.


So there are 57 jobs of 30 minutes each.  Get your user to rewrite their
scripts so it isn't one job.  That shouldn't be too hard.

Part of this involves the social engineering required on my campus to get HPC efforts to work at all... Alas, nothing has to do with backtraces.

Very true (on both parts).

Craig




gerry

Yeah, we really do that. With boundary-condition munging we can run a statistical set of simulations and see what the probabilities are and where, for instance, maximum storm surge is likely to go. If we don't get sufficient membership in the ensemble, the statistical strength of the forecasting procedure decreases.

Gerry

part of the reason I got a kick out of this simple backtrace.so
is indeed that it's quite possible to conceive of a checkpoint.so
which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly decent job of checkpointing at least serial codes non-intrusively.


BTW, I like your code.  I had a script written for me in the past
(by Greg Lindahl in a galaxy far-far away).  The one modification
I would make is to print out the MPI ID evnironment variable (MPI
flavors vary how it is set).  Then when it crashes, you know which
process actually died.

Craig

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf





_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to