The Milkyway application we are mostly observing this with is
milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe,
which was deployed to their server on 6 Oct 2014, 20:18:34 UTC - the internal
signature says "API_VERSION_6.13.0"
On Saturday, 11 July 2015, 20:55, Jason Groothuis
<[email protected]> wrote:
"Perhaps the exit process has been invoked in the Milkyway app, but not all
consequent OS functions have completed in time."
Correct, since the TermnateProcess() call, which is asynchronous and so returns
immediately without necessarily doing anything, is missing the
WaitForSingleObject() on the Current process after it. The process resources
will cleanup as part of OS garbage collection *sometime* down the road.
Doubting the accuracy of the MSDN documentation on these functions is fine,
but wondering why it doesn;t work as expected when you ignore it, is just odd.
On Saturday, 11 July 2015, 20:09, Jason Groothuis
<[email protected]> wrote:
Not sure how much detail you'd like on the situation. (Can provide much more)
It's a result of buffered IO implemented in multithreaded C Runtimes, in some
situations using deferred procedure calls. Internal helper threads are being
killed before commits are completed.least desirable partial workaround (but
helps):- disable buffered IO by linking the application with the ms supplied
COMMODE.OBJProbably Better, but not tested:- initiate a low level _commit() and
add the missing WaitForSingleObject() after the TerminateProcess Call,Best:- do
a low level _comit() and check the file modification time updated, then
preferably use a friendly means of exit that allows DLL/Thread cleanup, closing
threads/processes using sentinel flags, like while(!done) instead of while(1)
with
kills.------------------------------------------------------------------------------------------------------Jason
Richard Groothuis
bSc(compSci)--------------------------------------------------------------
----------------------------------------> Date: Sat, 11 Jul 2015 11:30:32
-0700> From: [email protected]> To: [email protected];
[email protected]> Subject: Re: [boinc_dev] Client: race condition on
stderr.txt invalidates Milkyway tasks> > Richard:> Can you please ask him to
set <task_debug> as well?> > I have no theories about what could cause this.>
The BOINC client learns that a job is finished when its process has exited,>
and by that time all files are closed and locks released> (I'm assuming the
MW@h app is single-process - is that correct?)> > In this case, when the job
finishes, the client successfully reads stderr.txt> (otherwise <stderr_txt>
would be absent or there would be an error message)> but it's empty.> This
would be the case, e.g., if the writing process hadn't exited yet> and its
stderr buffer wasn't flushed.> But the process has exited.> > Anyone have any
ideas?> > -- David> > On 09-Jul-2015 7:42 AM, Richard Haselgrove wrote:>
> User Keith Myers (UID 147145 at
> http://milkyway.cs.rpi.edu/milkyway/index.php) has > > asked for my help in
> identifying task failures at Milkyway.> >> > At my suggestion, he installed
> Windows client v7.6.2, and the attached message log > > extracts show the
> enhanced <slot_debug> output that helped identify the CMS-dev > > problem.>
> >> > In both cases, the task under scrutiny> >> > (1)
> de_fast_15_3s_136_sim1Jun1_1_1434554402_7775504_0, > >
> http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181200273> >> > (2)
> ps_fast_15_3s_136_sim1Jun1_1_1434554402_7806437_0, > >
> http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181298220> >> > was
> declared 'Validate error', and the <stderr_txt> section is empty. In the > >
> special case of Milkyway@Home, these two observations are linked, because
> the > > science result is returned in stderr, not a separate upload file.>
> >> > Also in both cases, the <slot_debug> log contains> >> > [slot] failed
> to remove file slots/x/stderr.txt:
unlink() failed> >> > between 'handle_exited_app()' and 'Computation for task
... finished'> >> > It appears that there is a race condition, whereby BOINC
tries (and fails) to > > delete stderr.txt before the operating system has
released the write lock. This > > (I'm presuming) also explains why the file
appears empty when read off the disk > > for incorporation into the
client_state structure in memory, prior to reporting > > the completed task to
the project.> >> > In order the preserve the scientific result at Milkyway (and
debug and other > > useful information at other projects), the client should
not initiate > > 'handle_exited_app()' until it has confirmed that the write
lock on stderr.txt has > > been released.> >> >> > Log 1 also shows that the
additional safeguards on cleaning out slots are working > > properly: if both
handle_exited_app() and get_free_slot() fail to delete the file, > > the next
task isn't started in the not-empty slot (11), but in slot 14 inste
ad. > > And when slot 11 is tested again at the next get_free_slot(), the
delete succeeds > > and the now-empty slot is reused.> >
_______________________________________________> boinc_dev mailing list>
[email protected]>
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev> To unsubscribe, visit
the above URL and> (near bottom of page) enter your email address.
_______________________________________________boinc_dev mailing
[email protected]http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_devTo
unsubscribe, visit the above URL and(near bottom of page) enter your email
address.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.