OK, off and running with v7.6.6 to test - report later.
Two small observations from the switchover, at restart.
1) another timing glitch
15/07/2015 10:45:22 | Milkyway@Home | Sending scheduler request: To fetch
work.15/07/2015 10:45:25 | Milkyway@Home | Scheduler request completed: got 2
new tasks15/07/2015 10:46:28 | Milkyway@Home | Sending scheduler request: To
fetch work.15/07/2015 10:46:30 | Milkyway@Home | Not sending work - last
request too recent: 59 sec
2) with debug logging and so many tasks, stdoutdae.txt had outgrown its limit.
stdoutdae.old is 70 MB - by my setting, it should have rotated at 50 MB. I
believe there's an outstanding request for logs to rotate while running, not
only at startup.
On Tuesday, 14 July 2015, 19:13, David Anderson <[email protected]>
wrote:
I checked in a workaround in which the client waits until
stderr.txt is not locked before reading it.
Can people please review this change?
-- David
On 14-Jul-2015 8:54 AM, Richard Haselgrove wrote:
> I've finally managed to capture an orphaned stderr.txt file on disk, and
> marry it
> up with the 'Validate error' task report at Milkyway.
>
> The copied file on my hard disk has the final few lines that were missing
> from the
> version reported to the project.
>
> It'll be easiest to read the full report, with embedded screenshots, at
> http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=3662&postid=63799#63799
>
>
>
> On Saturday, 11 July 2015, 22:44, Richard Haselgrove
> <[email protected]> wrote:
>
>
>
> Here's a double 'Error 32' with both slot debug and task debug active
>throughout.
> This one resulted in a completely blank stderr.txt being reported to the
> project.Task is
>de_modfit_sum_fast_15_3s_136_sim1Jun1_4_1434554402_8900923_1,
> http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1184736496
>
>
> On Saturday, 11 July 2015, 21:35, Jason Groothuis
> <[email protected] <mailto:[email protected]>> wrote:
>
>
>
> Trial correction, this part of boinc_api.cpp, line 778: // various
>platforms
> have problems shutting down a process // while other threads are still
> executing, // or triggering endless exit()/atexit() loops. //
> BOINCINFO("Exit Status: %d", status); fflush(NULL);
> #if defined(_WIN32) // Halt all the threads and clean up.
> TerminateProcess(GetCurrentProcess(), status); // note: the above CAN
> return! Sleep(1000); DebugBreak();#elif defined(__APPLE_CC__)
>
> Becomes // various platforms have problems shutting down a process //
>while
> other threads are still executing, // or triggering endless
>exit()/atexit()
> loops. // BOINCINFO("Exit Status: %d", status); fflush(NULL);#if
> defined(_WIN32) // JG: Buffered IO is not committed to disk on flush, so
> commit it, add other file descriptors if needed _commit(stderr); //
>Halt
> all the threads and clean up. TerminateProcess(GetCurrentProcess(),
>status);
> // note: the above CAN return! [JG: It does, it's asychronous, system
> dependant this thread runs on some time so
> WaitForSingleObject(GetCurrentProcess(), INFINITE); // That's why you do
>this
> Sleep(1000); //JG: Will never be reached DebugBreak(); //JG: Will
>never
> be reached#elif defined(__APPLE_CC__)
>
>
>
>------------------------------------------------------------------------------------------------------
> Jason Richard Groothuis
> bSc(compSci)
>
>
>------------------------------------------------------------------------------------------------------
>
>
> > From: [email protected] <mailto:[email protected]>
> > To: [email protected] <mailto:[email protected]>;
> [email protected] <mailto:[email protected]>;
> [email protected] <mailto:[email protected]>
> > Date: Sun, 12 Jul 2015 05:47:37 +0930
> > Subject: Re: [boinc_dev] Client: race condition on stderr.txt invalidates
> Milkyway tasks
> >
> > Comparing versions at leisure, but this same issue dates back in slight
> variations to since I started crunching back in 2007, with different
>symptoms
> depending on build characteristics and afflicted system ( OS version,
>#cores,
> and GPU or CPU). You be looking for a history on the boinc_exit()
>function.
> Don;t think it ever had the wait after terminateprocess, so IO
>cancellations
> are likely depending on timing/chance.
> >
> >
>
>------------------------------------------------------------------------------------------------------
> > Jason Richard Groothuis
> > bSc(compSci)
> >
> >
>
>------------------------------------------------------------------------------------------------------
> >
> >
> > Date: Sat, 11 Jul 2015 20:07:56 +0000
> > From: [email protected] <mailto:[email protected]>
> > To: [email protected] <mailto:[email protected]>;
> [email protected] <mailto:[email protected]>;
> [email protected] <mailto:[email protected]>
> > Subject: Re: [boinc_dev] Client: race condition on stderr.txt invalidates
> Milkyway tasks
> >
> > The Milkyway application we are mostly observing this with is
>
>milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe,
> which was deployed to their server on 6 Oct 2014, 20:18:34 UTC - the
>internal
> signature says "API_VERSION_6.13.0"
> >
> >
> >
> >
> > On Saturday, 11 July 2015, 20:55, Jason Groothuis
> <[email protected] <mailto:[email protected]>> wrote:
> >
> >
> > "Perhaps the exit process has been invoked in the Milkyway app, but not
>all
> consequent OS functions have completed in time."Correct, since the
> TermnateProcess() call, which is asynchronous and so returns immediately
> without necessarily doing anything, is missing the WaitForSingleObject()
>on
> the Current process after it. The process resources will cleanup as part
>of
> OS garbage collection *sometime* down the road.Doubting the accuracy of the
> MSDN documentation on these functions is fine, but wondering why it
>doesn;t
> work as expected when you ignore it, is just odd. On Saturday, 11 July
> 2015, 20:09, Jason Groothuis <[email protected]
> <mailto:[email protected]>> wrote: Not sure how much detail
> you'd like on the situation. (Can provide much more) It's a result of
> buffered IO implemented in multithreaded C Runtimes, in some situations
>using
> deferred procedure calls. Internal helper threads are being killed before
> commits are completed.least desirable partial workaround (but helps
> ):
> > - disable buffered IO by linking the application with the ms supplied
> COMMODE.OBJProbably Better, but not tested:- initiate a low level _commit()
> and add the missing WaitForSingleObject() after the TerminateProcess
> Call,Best:- do a low level _comit() and check the file modification time
> updated, then preferably use a friendly means of exit that allows
>DLL/Thread
> cleanup, closing threads/processes using sentinel flags, like while(!done)
> instead of while(1) with
>
>kills.------------------------------------------------------------------------------------------------------Jason
> Richard Groothuis
> bSc(compSci)--------------------------------------------------------------
> ----------------------------------------> Date: Sat, 11 Jul 2015 11:30:32
> -0700> From: [email protected] <mailto:[email protected]>> To:
> [email protected] <mailto:[email protected]>;
> [email protected] <mailto:[email protected]>> Subject:
>Re:
> [boinc_dev] Client: race condition on stderr.txt invalidates Milkyway
> tasks> > Richard:> Can you please ask him to set <task_d
> eb
> > ug> as well?> > I have no theories about what could cause this.> The
>BOINC
> client learns that a job is finished when its process has exited,> and by
>that
> time all files are closed and locks released> (I'm assuming the MW@h
> <mailto:MW@h> app is single-process - is that correct?)> > In this case,
>when
> the job finishes, the client successfully reads stderr.txt> (otherwise
> <stderr_txt> would be absent or there would be an error message)> but it's
> empty.> This would be the case, e.g., if the writing process hadn't exited
> yet> and its stderr buffer wasn't flushed.> But the process has exited.> >
> Anyone have any ideas?> > -- David> > On 09-Jul-2015 7:42 AM, Richard
> Haselgrove wrote:> > User Keith Myers (UID 147145 at
> http://milkyway.cs.rpi.edu/milkyway/index.php) has > > asked for my help in
> identifying task failures at Milkyway.> >> > At my suggestion, he installed
> Windows client v7.6.2, and the attached message log > > extracts show the
> enhanced <slot_debug> output that helped identify the
> CM
> > S-dev > > problem.> >> > In both cases, the task under scrutiny> >> >
>(1)
> de_fast_15_3s_136_sim1Jun1_1_1434554402_7775504_0, > >
> http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181200273> >> >
>(2)
> ps_fast_15_3s_136_sim1Jun1_1_1434554402_7806437_0, > >
> http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181298220> >> >
>was
> declared 'Validate error', and the <stderr_txt> section is empty. In the >
>>
> special case of Milkyway@Home <mailto:Milkyway@Home>, these two
>observations
> are linked, because the > > science result is returned in stderr, not a
> separate upload file.> >> > Also in both cases, the <slot_debug> log
>contains>
> >> > [slot] failed to remove file slots/x/stderr.txt: unlink() failed> >>
>>
> between 'handle_exited_app()' and 'Computation for task ... finished'> >>
>> It
> appears that there is a race condition, whereby BOINC tries (and fails) to
>> >
> delete stderr.txt before the operating system has released the write lock.
> This > > (I'm presuming) also explains why the file appea
> rs
> > empty when read off the disk > > for incorporation into the client_state
> structure in memory, prior to reporting > > the completed task to the
> project.> >> > In order the preserve the scientific result at Milkyway (and
> debug and other > > useful information at other projects), the client
>should
> not initiate > > 'handle_exited_app()' until it has confirmed that the
>write
> lock on stderr.txt has > > been released.> >> >> > Log 1 also shows that
>the
> additional safeguards on cleaning out slots are working > > properly: if
>both
> handle_exited_app() and get_free_slot() fail to delete the file, > > the
>next
> task isn't started in the not-empty slot (11), but in slot 14 inste ad. > >
> And when slot 11 is tested again at the next get_free_slot(), the delete
> succeeds > > and the now-empty slot is reused.> >
> _______________________________________________> boinc_dev mailing list>
> [email protected] <mailto:[email protected]>>
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev> To unsubscribe,
>visit
> th
> > e above URL and> (near bottom of page) enter your email address.
> _______________________________________________boinc_dev mailing
> [email protected]
>
><mailto:[email protected]>://lists.ssl.berkeley.edu/mailman/listinfo/boinc_devTo
> unsubscribe, visit the above URL and(near bottom of page) enter your email
> address. _______________________________________________boinc_dev mailing
> [email protected]
>
><mailto:[email protected]>://lists.ssl.berkeley.edu/mailman/listinfo/boinc_devTo
> unsubscribe, visit the above URL and(near bottom of page) enter your email
> address.
> >
> >
> > _______________________________________________
> > boinc_dev mailing list
> > [email protected] <mailto:[email protected]>
> > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> > To unsubscribe, visit the above URL and
> > (near bottom of page) enter your email address.
>
> _______________________________________________
> boinc_dev mailing list
> [email protected] <mailto:[email protected]>
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
>
>
>
>
> _______________________________________________
> boinc_dev mailing list
> [email protected] <mailto:[email protected]>
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
>
>
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.