Craig Tierney wrote:
Gerry Creager wrote:
I've tried to stay out of this. Really, I have.
Craig Tierney wrote:
Mark Hahn wrote:
Sorry to start a flame war....
what part do you think was inflamed?
It was when I was trying to say "Real codes have user-level
checkpointing implemented and no code should ever run for 7
days."
A number of my climate simulations will run for 7-10 days to get
century-long simulations to complete. I've run geodesy simulations
that ran for up to 17 days in the past. I like to think that my codes
are real enough!
NCAR and GFDL run climate simulations for weeks as well. How longest
period of time any one job can run? It is 8-12 hours. I can verify
these numbers if needed, but I can guarantee you that no one is allowed
to put their job in for 17 days. With explicit permission they may get
24 hours, but that would be for unique situations.
On the p575, we have similar constraints and I do work within those. In
my lab, I can control access a bit more and have considerably fewer (and
truly grateful) users, so if we need to run "forever" we can implement
that.
Real codes do have user-level checkpointing, though. And even better
codes can be restarted without a lot of user intervention by invoking
a run-time flag and going off for coffee.
You mean there are people that bother to implement checkpointing and
then don't make it code like:
if (checkpoint files exist in my directory) then
load checkpoint files
else
start from scratch
end
????
Yes, there are. No, I'm not one of them. My stuff does do a restart if
it stops and finds evidence of a need to continue. However, I've seen
this failure time and time again over the years.
Set your queue maximums to 6-8 hours. Prevents system hogging,
encourages checkpointing for long runs. Make sure your IO system
can support the checkpointing because it can create a lot of load.
And how do you support my operational requirements with this policy
during hurricane season? Let's see... "Stop that ensemble run now so
the Monte Carlo chemists can play for awhile, then we'll let you back
on. Don't worry about the timeliness of your simulations. No one
needs a 35-member ensemble for statistical forecasting, anyway." Did
I miss something?
You kick-off the users that are not running operational codes because
their work is (probably) not as time constrained. Also, if you take
so long to get your answer in an operational mode that the answer
doesn't matter anymore, you need a faster computer. I would think that
if you cannot spit out a 12-hour hurricane forecast in a couple of
hours I would be concerned how valuable the answer would be.
Several points in here.
1. Preemption is one approach I finally got the admin to buy into for
forecasting codes.
2. MY operational codes for an individual simulation don't take long to
run, save the fact that we don't do a 12 hr hurricane sim, but an 84
hour sim for the weather side (WRF). Saving grace here is that the
nested grids are not too large so they can run to completion in a couple
of wall-clock hours.
3. When one starts trying to twiddle initial conditions statistically
to create an ensemble, one then has to run all the ensemble members.
One usually starts with central cases first, especially if one "knows"
which are central and which are peripheral. If one run takes 30 min on
128 processors, and one thinks one needs 57 members run, one exceeds a
wall-clock day. And needs a bigger, faster computer, or at least a
bigger queue reservation. If one does this without preemption, one gets
all results back at the end of the hurricane season and declares success
after 3 years of analysis instead of providing data in near real time.
Part of this involves the social engineering required on my campus to
get HPC efforts to work at all... Alas, nothing has to do with backtraces.
gerry
Yeah, we really do that. With boundary-condition munging we can run a
statistical set of simulations and see what the probabilities are and
where, for instance, maximum storm surge is likely to go. If we don't
get sufficient membership in the ensemble, the statistical strength of
the forecasting procedure decreases.
Gerry
part of the reason I got a kick out of this simple backtrace.so
is indeed that it's quite possible to conceive of a checkpoint.so
which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly decent
job of checkpointing at least serial codes non-intrusively.
BTW, I like your code. I had a script written for me in the past
(by Greg Lindahl in a galaxy far-far away). The one modification
I would make is to print out the MPI ID evnironment variable (MPI
flavors vary how it is set). Then when it crashes, you know which
process actually died.
Craig
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
--
Gerry Creager -- [EMAIL PROTECTED]
Texas Mesonet -- AATLT, Texas A&M University
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf