Re: [Beowulf] backtraces

Gerry Creager Tue, 12 Jun 2007 09:33:37 -0700

Craig Tierney wrote:

Gerry Creager wrote:

I've tried to stay out of this.  Really, I have.
Craig Tierney wrote:
Mark Hahn wrote:
Sorry to start a flame war....
what part do you think was inflamed?
It was when I was trying to say "Real codes have user-level
checkpointing implemented and no code should ever run for 7
days."
A number of my climate simulations will run for 7-10 days to getcentury-long simulations to complete. I've run geodesy simulationsthat ran for up to 17 days in the past. I like to think that my codesare real enough!


NCAR and GFDL run climate simulations for weeks as well.  How longest
period of time any one job can run?  It is 8-12 hours.  I can verify
these numbers if needed, but I can guarantee you that no one is allowed
to put their job in for 17 days.  With explicit permission they may get
24 hours, but that would be for unique situations.

On the p575, we have similar constraints and I do work within those. Inmy lab, I can control access a bit more and have considerably fewer (andtruly grateful) users, so if we need to run "forever" we can implementthat.

Real codes do have user-level checkpointing, though. And even bettercodes can be restarted without a lot of user intervention by invokinga run-time flag and going off for coffee.
You mean there are people that bother to implement checkpointing and
then don't make it code like:

if (checkpoint files exist in my directory) then
   load checkpoint files
else
   start from scratch
end

????

Yes, there are. No, I'm not one of them. My stuff does do a restart ifit stops and finds evidence of a need to continue. However, I've seenthis failure time and time again over the years.

Set your queue maximums to 6-8 hours.  Prevents system hogging,
encourages checkpointing for long runs.  Make sure your IO system
can support the checkpointing because it can create a lot of load.
And how do you support my operational requirements with this policyduring hurricane season? Let's see... "Stop that ensemble run now sothe Monte Carlo chemists can play for awhile, then we'll let you backon. Don't worry about the timeliness of your simulations. No oneneeds a 35-member ensemble for statistical forecasting, anyway." DidI miss something?
You kick-off the users that are not running operational codes because
their work is (probably) not as time constrained.  Also, if you take
so long to get your answer in an operational mode that the answerdoesn't matter anymore, you need a faster computer. I would think that
if you cannot spit out a 12-hour hurricane forecast in a couple of
hours I would be concerned how valuable the answer would be.


Several points in here.

1. Preemption is one approach I finally got the admin to buy into forforecasting codes.2. MY operational codes for an individual simulation don't take long torun, save the fact that we don't do a 12 hr hurricane sim, but an 84hour sim for the weather side (WRF). Saving grace here is that thenested grids are not too large so they can run to completion in a coupleof wall-clock hours.3. When one starts trying to twiddle initial conditions statisticallyto create an ensemble, one then has to run all the ensemble members.One usually starts with central cases first, especially if one "knows"which are central and which are peripheral. If one run takes 30 min on128 processors, and one thinks one needs 57 members run, one exceeds awall-clock day. And needs a bigger, faster computer, or at least abigger queue reservation. If one does this without preemption, one getsall results back at the end of the hurricane season and declares successafter 3 years of analysis instead of providing data in near real time.

Part of this involves the social engineering required on my campus toget HPC efforts to work at all... Alas, nothing has to do with backtraces.


gerry

Yeah, we really do that. With boundary-condition munging we can run astatistical set of simulations and see what the probabilities are andwhere, for instance, maximum storm surge is likely to go. If we don'tget sufficient membership in the ensemble, the statistical strength ofthe forecasting procedure decreases.
Gerry
part of the reason I got a kick out of this simple backtrace.so
is indeed that it's quite possible to conceive of a checkpoint.so
which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly decentjob of checkpointing at least serial codes non-intrusively.
BTW, I like your code.  I had a script written for me in the past
(by Greg Lindahl in a galaxy far-far away).  The one modification
I would make is to print out the MPI ID evnironment variable (MPI
flavors vary how it is set).  Then when it crashes, you know which
process actually died.

Craig

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf


--
Gerry Creager -- [EMAIL PROTECTED]
Texas Mesonet -- AATLT, Texas A&M University
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] backtraces

Reply via email to