Trial correction, this part of boinc_api.cpp, line 778: // various platforms
have problems shutting down a process // while other threads are still
executing, // or triggering endless exit()/atexit() loops. //
BOINCINFO("Exit Status: %d", status); fflush(NULL);
#if defined(_WIN32) // Halt all the threads and clean up.
TerminateProcess(GetCurrentProcess(), status); // note: the above CAN
return! Sleep(1000); DebugBreak();#elif defined(__APPLE_CC__)
Becomes // various platforms have problems shutting down a process // while
other threads are still executing, // or triggering endless exit()/atexit()
loops. // BOINCINFO("Exit Status: %d", status); fflush(NULL);#if
defined(_WIN32) // JG: Buffered IO is not committed to disk on flush, so
commit it, add other file descriptors if needed _commit(stderr); // Halt
all the threads and clean up. TerminateProcess(GetCurrentProcess(), status);
// note: the above CAN return! [JG: It does, it's asychronous, system
dependant this thread runs on some time so
WaitForSingleObject(GetCurrentProcess(), INFINITE); // That's why you do this
Sleep(1000); //JG: Will never be reached DebugBreak(); //JG: Will never
be reached#elif defined(__APPLE_CC__)
------------------------------------------------------------------------------------------------------
Jason Richard Groothuis
bSc(compSci)
------------------------------------------------------------------------------------------------------
> From: [email protected]
> To: [email protected]; [email protected];
> [email protected]
> Date: Sun, 12 Jul 2015 05:47:37 +0930
> Subject: Re: [boinc_dev] Client: race condition on stderr.txt invalidates
> Milkyway tasks
>
> Comparing versions at leisure, but this same issue dates back in slight
> variations to since I started crunching back in 2007, with different symptoms
> depending on build characteristics and afflicted system ( OS version,
> #cores, and GPU or CPU). You be looking for a history on the boinc_exit()
> function. Don;t think it ever had the wait after terminateprocess, so IO
> cancellations are likely depending on timing/chance.
>
> ------------------------------------------------------------------------------------------------------
> Jason Richard Groothuis
> bSc(compSci)
>
> ------------------------------------------------------------------------------------------------------
>
>
> Date: Sat, 11 Jul 2015 20:07:56 +0000
> From: [email protected]
> To: [email protected]; [email protected];
> [email protected]
> Subject: Re: [boinc_dev] Client: race condition on stderr.txt invalidates
> Milkyway tasks
>
> The Milkyway application we are mostly observing this with is
> milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe,
> which was deployed to their server on 6 Oct 2014, 20:18:34 UTC - the internal
> signature says "API_VERSION_6.13.0"
>
>
>
>
> On Saturday, 11 July 2015, 20:55, Jason Groothuis
> <[email protected]> wrote:
>
>
> "Perhaps the exit process has been invoked in the Milkyway app, but not all
> consequent OS functions have completed in time."Correct, since the
> TermnateProcess() call, which is asynchronous and so returns immediately
> without necessarily doing anything, is missing the WaitForSingleObject() on
> the Current process after it. The process resources will cleanup as part of
> OS garbage collection *sometime* down the road.Doubting the accuracy of the
> MSDN documentation on these functions is fine, but wondering why it doesn;t
> work as expected when you ignore it, is just odd. On Saturday, 11 July
> 2015, 20:09, Jason Groothuis <[email protected]> wrote: Not
> sure how much detail you'd like on the situation. (Can provide much more)
> It's a result of buffered IO implemented in multithreaded C Runtimes, in some
> situations using deferred procedure calls. Internal helper threads are being
> killed before commits are completed.least desirable partial workaround (but
> helps
):
> - disable buffered IO by linking the application with the ms supplied
> COMMODE.OBJProbably Better, but not tested:- initiate a low level _commit()
> and add the missing WaitForSingleObject() after the TerminateProcess
> Call,Best:- do a low level _comit() and check the file modification time
> updated, then preferably use a friendly means of exit that allows DLL/Thread
> cleanup, closing threads/processes using sentinel flags, like while(!done)
> instead of while(1) with
> kills.------------------------------------------------------------------------------------------------------Jason
> Richard Groothuis
> bSc(compSci)--------------------------------------------------------------
> ----------------------------------------> Date: Sat, 11 Jul 2015 11:30:32
> -0700> From: [email protected]> To: [email protected];
> [email protected]> Subject: Re: [boinc_dev] Client: race condition
> on stderr.txt invalidates Milkyway tasks> > Richard:> Can you please ask
> him to set <task_d
eb
> ug> as well?> > I have no theories about what could cause this.> The BOINC
> client learns that a job is finished when its process has exited,> and by
> that time all files are closed and locks released> (I'm assuming the MW@h app
> is single-process - is that correct?)> > In this case, when the job finishes,
> the client successfully reads stderr.txt> (otherwise <stderr_txt> would be
> absent or there would be an error message)> but it's empty.> This would be
> the case, e.g., if the writing process hadn't exited yet> and its stderr
> buffer wasn't flushed.> But the process has exited.> > Anyone have any
> ideas?> > -- David> > On 09-Jul-2015 7:42 AM, Richard Haselgrove wrote:> >
> User Keith Myers (UID 147145 at
> http://milkyway.cs.rpi.edu/milkyway/index.php) has > > asked for my help in
> identifying task failures at Milkyway.> >> > At my suggestion, he installed
> Windows client v7.6.2, and the attached message log > > extracts show the
> enhanced <slot_debug> output that helped identify the
CM
> S-dev > > problem.> >> > In both cases, the task under scrutiny> >> > (1)
> de_fast_15_3s_136_sim1Jun1_1_1434554402_7775504_0, > >
> http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181200273> >> > (2)
> ps_fast_15_3s_136_sim1Jun1_1_1434554402_7806437_0, > >
> http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181298220> >> > was
> declared 'Validate error', and the <stderr_txt> section is empty. In the > >
> special case of Milkyway@Home, these two observations are linked, because the
> > > science result is returned in stderr, not a separate upload file.> >> >
> Also in both cases, the <slot_debug> log contains> >> > [slot] failed to
> remove file slots/x/stderr.txt: unlink() failed> >> > between
> 'handle_exited_app()' and 'Computation for task ... finished'> >> > It
> appears that there is a race condition, whereby BOINC tries (and fails) to >
> > delete stderr.txt before the operating system has released the write lock.
> This > > (I'm presuming) also explains why the file appea
rs
> empty when read off the disk > > for incorporation into the client_state
> structure in memory, prior to reporting > > the completed task to the
> project.> >> > In order the preserve the scientific result at Milkyway (and
> debug and other > > useful information at other projects), the client should
> not initiate > > 'handle_exited_app()' until it has confirmed that the write
> lock on stderr.txt has > > been released.> >> >> > Log 1 also shows that the
> additional safeguards on cleaning out slots are working > > properly: if both
> handle_exited_app() and get_free_slot() fail to delete the file, > > the next
> task isn't started in the not-empty slot (11), but in slot 14 inste ad. > >
> And when slot 11 is tested again at the next get_free_slot(), the delete
> succeeds > > and the now-empty slot is reused.> >
> _______________________________________________> boinc_dev mailing list>
> [email protected]>
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev> To unsubscribe,
> visit
th
> e above URL and> (near bottom of page) enter your email address.
> _______________________________________________boinc_dev
> mailing
> [email protected]http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_devTo
> unsubscribe, visit the above URL and(near bottom of page) enter your email
> address.
> _______________________________________________boinc_dev mailing
> [email protected]http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_devTo
> unsubscribe, visit the above URL and(near bottom of page) enter your email
> address.
>
>
> _______________________________________________
> boinc_dev mailing list
> [email protected]
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.