I'm pleased to confirm that in something over 36 hours of continuous running since
installing v7.6.6, I've now completed over 2,570 Milkyway tasks without a single
new 'validate error'. Despite David's initial scepticism, it does indeed appear
that the file write wasn't being completed before the BOINC client tried to read
it - and with the additional guard timing, the completed file has been reported
every time (I would have expected ~50 errors by now with the previous client).
I'll maybe confirm by regressing to v7.6.3 tomorrow and seeing if the errors
return, but apart from that I think we can declare another 'case closed'.
On Wednesday, 15 July 2015, 10:58, Richard Haselgrove
<[email protected]> wrote:
OK, off and running with v7.6.6 to test - report later.
Two small observations from the switchover, at restart.
1) another timing glitch
15/07/2015 10:45:22 | Milkyway@Home <mailto:Milkyway@Home> | Sending
scheduler
request: To fetch work.15/07/2015 10:45:25 | Milkyway@Home
<mailto:Milkyway@Home> | Scheduler request completed: got 2 new
tasks15/07/2015 10:46:28 | Milkyway@Home <mailto:Milkyway@Home> | Sending
scheduler request: To fetch work.15/07/2015 10:46:30 | Milkyway@Home
<mailto:Milkyway@Home> | Not sending work - last request too recent: 59 sec
2) with debug logging and so many tasks, stdoutdae.txt had outgrown its
limit.
stdoutdae.old is 70 MB - by my setting, it should have rotated at 50 MB. I
believe there's an outstanding request for logs to rotate while running, not
only at startup.
On Tuesday, 14 July 2015, 19:13, David Anderson <[email protected]
<mailto:[email protected]>> wrote:
I checked in a workaround in which the client waits until
stderr.txt is not locked before reading it.
Can people please review this change?
-- David
On 14-Jul-2015 8:54 AM, Richard Haselgrove wrote:
> I've finally managed to capture an orphaned stderr.txt file on disk, and
marry it
> up with the 'Validate error' task report at Milkyway.
>
> The copied file on my hard disk has the final few lines that were missing
from the
> version reported to the project.
>
> It'll be easiest to read the full report, with embedded screenshots, at
>
http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=3662&postid=63799#63799
>
>
>
> On Saturday, 11 July 2015, 22:44, Richard Haselgrove
> <[email protected] <mailto:[email protected]>>
wrote:
>
>
>
> Here's a double 'Error 32' with both slot debug and task debug active
throughout.
> This one resulted in a completely blank stderr.txt being reported to
the
> project.Task is
de_modfit_sum_fast_15_3s_136_sim1Jun1_4_1434554402_8900923_1,
> http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1184736496
>
>
> On Saturday, 11 July 2015, 21:35, Jason Groothuis
> <[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>>
wrote:
>
>
>
> Trial correction, this part of boinc_api.cpp, line 778: // various
platforms
> have problems shutting down a process // while other threads are
still
> executing, // or triggering endless exit()/atexit() loops. //
> BOINCINFO("Exit Status: %d", status); fflush(NULL);
> #if defined(_WIN32) // Halt all the threads and clean up.
> TerminateProcess(GetCurrentProcess(), status); // note: the above
CAN
> return! Sleep(1000); DebugBreak();#elif defined(__APPLE_CC__)
>
> Becomes // various platforms have problems shutting down a process
//
while
> other threads are still executing, // or triggering endless
exit()/atexit()
> loops. // BOINCINFO("Exit Status: %d", status);
fflush(NULL);#if
> defined(_WIN32) // JG: Buffered IO is not committed to disk on
flush, so
> commit it, add other file descriptors if needed _commit(stderr);
//
Halt
> all the threads and clean up. TerminateProcess(GetCurrentProcess(),
status);
> // note: the above CAN return! [JG: It does, it's asychronous,
system
> dependant this thread runs on some time so
> WaitForSingleObject(GetCurrentProcess(), INFINITE); // That's why you
do
this
> Sleep(1000); //JG: Will never be reached DebugBreak(); //JG: Will
never
> be reached#elif defined(__APPLE_CC__)
>
>
>
------------------------------------------------------------------------------------------------------
> Jason Richard Groothuis
> bSc(compSci)
>
>
------------------------------------------------------------------------------------------------------
>
>
> > From: [email protected]
<mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
> > To: [email protected]
<mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>;
> [email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>;
> [email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
> > Date: Sun, 12 Jul 2015 05:47:37 +0930
> > Subject: Re: [boinc_dev] Client: race condition on stderr.txt
invalidates
> Milkyway tasks
> >
> > Comparing versions at leisure, but this same issue dates back in
slight
> variations to since I started crunching back in 2007, with different
symptoms
> depending on build characteristics and afflicted system ( OS version,
#cores,
> and GPU or CPU). You be looking for a history on the boinc_exit()
function.
> Don;t think it ever had the wait after terminateprocess, so IO
cancellations
> are likely depending on timing/chance.
> >
> >
>
------------------------------------------------------------------------------------------------------
> > Jason Richard Groothuis
> > bSc(compSci)
> >
> >
>
------------------------------------------------------------------------------------------------------
> >
> >
> > Date: Sat, 11 Jul 2015 20:07:56 +0000
> > From: [email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>
> > To: [email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>;
> [email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>;
> [email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
> > Subject: Re: [boinc_dev] Client: race condition on stderr.txt
invalidates
> Milkyway tasks
> >
> > The Milkyway application we are mostly observing this with is
>
milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe,
> which was deployed to their server on 6 Oct 2014, 20:18:34 UTC - the
internal
> signature says "API_VERSION_6.13.0"
> >
> >
> >
> >
> > On Saturday, 11 July 2015, 20:55, Jason Groothuis
> <[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>>
wrote:
> >
> >
> > "Perhaps the exit process has been invoked in the Milkyway app, but
not all
> consequent OS functions have completed in time."Correct, since the
> TermnateProcess() call, which is asynchronous and so returns
immediately
> without necessarily doing anything, is missing the
WaitForSingleObject() on
> the Current process after it. The process resources will cleanup as
part of
> OS garbage collection *sometime* down the road.Doubting the accuracy
of the
> MSDN documentation on these functions is fine, but wondering why it
doesn;t
> work as expected when you ignore it, is just odd. On Saturday, 11
July
> 2015, 20:09, Jason Groothuis <[email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>>> wrote: Not sure how much detail
> you'd like on the situation. (Can provide much more) It's a result of
> buffered IO implemented in multithreaded C Runtimes, in some
situations using
> deferred procedure calls. Internal helper threads are being killed
before
> commits are completed.least desirable partial workaround (but helps
> ):
> > - disable buffered IO by linking the application with the ms
supplied
> COMMODE.OBJProbably Better, but not tested:- initiate a low level
_commit()
> and add the missing WaitForSingleObject() after the TerminateProcess
> Call,Best:- do a low level _comit() and check the file modification
time
> updated, then preferably use a friendly means of exit that allows
DLL/Thread
> cleanup, closing threads/processes using sentinel flags, like
while(!done)
> instead of while(1) with
>
kills.------------------------------------------------------------------------------------------------------Jason
> Richard Groothuis
> bSc(compSci)--------------------------------------------------------------
> ----------------------------------------> Date: Sat, 11 Jul 2015
11:30:32
> -0700> From: [email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>> To:
> [email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>;
> [email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>>
Subject: Re:
> [boinc_dev] Client: race condition on stderr.txt invalidates
Milkyway
> tasks> > Richard:> Can you please ask him to set <task_d
> eb
> > ug> as well?> > I have no theories about what could cause this.>
The BOINC
> client learns that a job is finished when its process has exited,> and
by
that
> time all files are closed and locks released> (I'm assuming the MW@h
<mailto:MW@h>
> <mailto:MW@h <mailto:MW@h>> app is single-process - is that correct?)>
>
In this case, when
> the job finishes, the client successfully reads stderr.txt> (otherwise
> <stderr_txt> would be absent or there would be an error message)> but
it's
> empty.> This would be the case, e.g., if the writing process hadn't
exited
> yet> and its stderr buffer wasn't flushed.> But the process has exited.>
>
> Anyone have any ideas?> > -- David> > On 09-Jul-2015 7:42 AM, Richard
> Haselgrove wrote:> > User Keith Myers (UID 147145 at
> http://milkyway.cs.rpi.edu/milkyway/index.php) has > > asked for my help
in
> identifying task failures at Milkyway.> >> > At my suggestion, he
installed
> Windows client v7.6.2, and the attached message log > > extracts show
the
> enhanced <slot_debug> output that helped identify the
> CM
> > S-dev > > problem.> >> > In both cases, the task under scrutiny> >>
> (1)
> de_fast_15_3s_136_sim1Jun1_1_1434554402_7775504_0, > >
> http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181200273> >> >
(2)
> ps_fast_15_3s_136_sim1Jun1_1_1434554402_7806437_0, > >
> http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181298220> >> >
was
> declared 'Validate error', and the <stderr_txt> section is empty. In the
> >
> special case of Milkyway@Home <mailto:Milkyway@Home>
<mailto:Milkyway@Home <mailto:Milkyway@Home>>, these two observations
> are linked, because the > > science result is returned in stderr, not a
> separate upload file.> >> > Also in both cases, the <slot_debug> log
contains>
> >> > [slot] failed to remove file slots/x/stderr.txt: unlink() failed>
>> >
> between 'handle_exited_app()' and 'Computation for task ... finished'>
>>
> It
> appears that there is a race condition, whereby BOINC tries (and fails)
to > >
> delete stderr.txt before the operating system has released the write
lock.
> This > > (I'm presuming) also explains why the file appea
> rs
> > empty when read off the disk > > for incorporation into the
client_state
> structure in memory, prior to reporting > > the completed task to the
> project.> >> > In order the preserve the scientific result at Milkyway
(and
> debug and other > > useful information at other projects), the client
should
> not initiate > > 'handle_exited_app()' until it has confirmed that the
write
> lock on stderr.txt has > > been released.> >> >> > Log 1 also shows
that the
> additional safeguards on cleaning out slots are working > > properly:
if both
> handle_exited_app() and get_free_slot() fail to delete the file, > >
the next
> task isn't started in the not-empty slot (11), but in slot 14 inste ad. >
>
> And when slot 11 is tested again at the next get_free_slot(), the
delete
> succeeds > > and the now-empty slot is reused.> >
> _______________________________________________> boinc_dev mailing list>
> [email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>>
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev> To unsubscribe,
visit
> th
> > e above URL and> (near bottom of page) enter your email address.
> _______________________________________________boinc_dev mailing
> [email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>>://lists.ssl.berkeley.edu/mailman/listinfo/boinc_devTo
> unsubscribe, visit the above URL and(near bottom of page) enter your
email
> address. _______________________________________________boinc_dev
mailing
> [email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>>://lists.ssl.berkeley.edu/mailman/listinfo/boinc_devTo
> unsubscribe, visit the above URL and(near bottom of page) enter your
email
> address.
> >
> >
> > _______________________________________________
> > boinc_dev mailing list
> > [email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
> > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> > To unsubscribe, visit the above URL and
> > (near bottom of page) enter your email address.
>
> _______________________________________________
> boinc_dev mailing list
> [email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
>
>
>
>
> _______________________________________________
> boinc_dev mailing list
> [email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
>
>
_______________________________________________
boinc_dev mailing list
[email protected] <mailto:[email protected]>
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.
_______________________________________________
boinc_dev mailing list
[email protected] <mailto:[email protected]>
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.