I checked in a workaround in which the client waits until
stderr.txt is not locked before reading it.
Can people please review this change?
-- David

On 14-Jul-2015 8:54 AM, Richard Haselgrove wrote:
I've finally managed to capture an orphaned stderr.txt file on disk, and marry it up with the 'Validate error' task report at Milkyway.

The copied file on my hard disk has the final few lines that were missing from the version reported to the project.

It'll be easiest to read the full report, with embedded screenshots, at http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=3662&postid=63799#63799



On Saturday, 11 July 2015, 22:44, Richard Haselgrove <[email protected]> wrote:



    Here's a double 'Error 32' with both slot debug and task debug active 
throughout.
    This one resulted in a completely blank stderr.txt being reported to the
    project.Task is 
de_modfit_sum_fast_15_3s_136_sim1Jun1_4_1434554402_8900923_1,
    http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1184736496


        On Saturday, 11 July 2015, 21:35, Jason Groothuis
    <[email protected] <mailto:[email protected]>> wrote:



    Trial correction, this part of boinc_api.cpp, line 778: // various platforms
    have problems shutting down a process    // while other threads are still
executing, // or triggering endless exit()/atexit() loops. // BOINCINFO("Exit Status: %d", status); fflush(NULL); #if defined(_WIN32) // Halt all the threads and clean up. TerminateProcess(GetCurrentProcess(), status); // note: the above CAN
    return! Sleep(1000);    DebugBreak();#elif defined(__APPLE_CC__)

    Becomes // various platforms have problems shutting down a process    // 
while
    other threads are still executing,    // or triggering endless 
exit()/atexit()
    loops.    //    BOINCINFO("Exit Status: %d", status);    fflush(NULL);#if
    defined(_WIN32)    // JG: Buffered IO is not committed to disk on flush, so
    commit it, add other file descriptors if needed    _commit(stderr);    // 
Halt
all the threads and clean up. TerminateProcess(GetCurrentProcess(), status); // note: the above CAN return! [JG: It does, it's asychronous, system
    dependant this thread runs on some time so
WaitForSingleObject(GetCurrentProcess(), INFINITE); // That's why you do this Sleep(1000); //JG: Will never be reached DebugBreak(); //JG: Will never
    be reached#elif defined(__APPLE_CC__)


    
------------------------------------------------------------------------------------------------------
    Jason Richard Groothuis
    bSc(compSci)

    
------------------------------------------------------------------------------------------------------


    > From: [email protected] <mailto:[email protected]>
    > To: [email protected] <mailto:[email protected]>;
    [email protected] <mailto:[email protected]>;
    [email protected] <mailto:[email protected]>
    > Date: Sun, 12 Jul 2015 05:47:37 +0930
    > Subject: Re: [boinc_dev] Client: race condition on stderr.txt invalidates
    Milkyway tasks
    >
    > Comparing versions at leisure, but this same issue dates back in slight
    variations to since I started crunching back in 2007, with different 
symptoms
    depending on build characteristics and afflicted system ( OS version,  
#cores,
and GPU or CPU). You be looking for a history on the boinc_exit() function. Don;t think it ever had the wait after terminateprocess, so IO cancellations
    are likely depending on timing/chance.
    >
    >
    
------------------------------------------------------------------------------------------------------
    > Jason Richard Groothuis
    > bSc(compSci)
    >
    >
    
------------------------------------------------------------------------------------------------------
    >
    >
    > Date: Sat, 11 Jul 2015 20:07:56 +0000
    > From: [email protected] <mailto:[email protected]>
    > To: [email protected] <mailto:[email protected]>;
    [email protected] <mailto:[email protected]>;
    [email protected] <mailto:[email protected]>
    > Subject: Re: [boinc_dev] Client: race condition on stderr.txt invalidates
    Milkyway tasks
    >
    > The Milkyway application we are mostly observing this with is
    
milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe,
    which was deployed to their server on 6 Oct 2014, 20:18:34 UTC - the 
internal
    signature says "API_VERSION_6.13.0"
    >
    >
    >
    >
    >      On Saturday, 11 July 2015, 20:55, Jason Groothuis
    <[email protected] <mailto:[email protected]>> wrote:
    >
    >
    >  "Perhaps the exit process has been invoked in the Milkyway app, but not 
all
    consequent OS functions have completed in time."Correct, since the
    TermnateProcess() call, which is asynchronous and so returns immediately
    without necessarily doing anything,  is missing the WaitForSingleObject() on
    the Current process after it.  The process resources will cleanup as part of
    OS garbage collection *sometime* down the road.Doubting the accuracy of the
    MSDN documentation on these functions  is fine, but wondering why it doesn;t
    work as expected when you ignore it, is just odd.      On Saturday, 11 July
    2015, 20:09, Jason Groothuis <[email protected]
    <mailto:[email protected]>> wrote:      Not sure how much detail
    you'd like on the situation. (Can provide much more)  It's a result of
    buffered IO implemented in multithreaded C Runtimes, in some situations 
using
    deferred procedure calls.  Internal helper threads are being killed before
    commits are completed.least desirable partial workaround (but helps
    ):
    >  - disable buffered IO by linking the application with the ms supplied
    COMMODE.OBJProbably Better, but not tested:- initiate a low level _commit()
    and add the missing WaitForSingleObject() after the TerminateProcess
    Call,Best:- do a low level _comit() and check the file modification time
    updated, then preferably use a friendly means of exit that allows DLL/Thread
    cleanup, closing threads/processes using sentinel flags, like while(!done)
    instead of while(1) with
    
kills.------------------------------------------------------------------------------------------------------Jason
    Richard Groothuis
    bSc(compSci)--------------------------------------------------------------
    ----------------------------------------> Date: Sat, 11 Jul 2015 11:30:32
    -0700> From: [email protected] <mailto:[email protected]>> To:
    [email protected] <mailto:[email protected]>;
    [email protected] <mailto:[email protected]>> Subject: Re:
    [boinc_dev] Client: race condition on stderr.txt invalidates    Milkyway
    tasks> > Richard:> Can you please ask him to set <task_d
    eb
    >  ug> as well?> > I have no theories about what could cause this.> The 
BOINC
    client learns that a job is finished when its process has exited,> and by 
that
    time all files are closed and locks released> (I'm assuming the MW@h
    <mailto:MW@h> app is single-process - is that correct?)> > In this case, 
when
    the job finishes, the client successfully reads stderr.txt> (otherwise
    <stderr_txt> would be absent or there would be an error message)> but it's
    empty.> This would be the case, e.g., if the writing process hadn't exited
    yet> and its stderr buffer wasn't flushed.> But the process has exited.> >
    Anyone have any ideas?> > -- David> > On 09-Jul-2015 7:42 AM, Richard
    Haselgrove wrote:> > User Keith Myers (UID 147145 at
    http://milkyway.cs.rpi.edu/milkyway/index.php) has > > asked for my help in
    identifying task failures at Milkyway.> >> > At my suggestion, he installed
    Windows client v7.6.2, and the attached message log > > extracts show the
    enhanced <slot_debug> output that helped identify the
    CM
    >  S-dev > > problem.> >> > In both cases, the task under scrutiny> >> > (1)
    de_fast_15_3s_136_sim1Jun1_1_1434554402_7775504_0, > >
    http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181200273> >> > (2)
    ps_fast_15_3s_136_sim1Jun1_1_1434554402_7806437_0, > >
    http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181298220> >> > was
    declared 'Validate error', and the <stderr_txt> section is empty. In the > >
    special case of Milkyway@Home <mailto:Milkyway@Home>, these two observations
    are linked, because the > > science result is returned in stderr, not a
    separate upload file.> >> > Also in both cases, the <slot_debug> log 
contains>
    >> > [slot] failed to remove file slots/x/stderr.txt:  unlink() failed> >> >
    between 'handle_exited_app()' and 'Computation for task ... finished'> >> > 
It
    appears that there is a race condition, whereby BOINC tries (and fails) to > 
>
    delete stderr.txt before the operating system has released the write lock.
    This > > (I'm presuming) also explains why the file appea
    rs
    >  empty when read off the disk > > for incorporation into the client_state
    structure in memory, prior to reporting > > the completed task to the
    project.> >> > In order the preserve the scientific result at Milkyway (and
    debug and other > > useful information at other projects), the client should
    not initiate > > 'handle_exited_app()' until it has confirmed that the write
    lock on stderr.txt has > > been released.> >> >> > Log 1 also shows that the
    additional safeguards on cleaning out slots are working > > properly: if 
both
    handle_exited_app() and get_free_slot() fail to delete the file, > > the 
next
    task isn't started in the not-empty slot (11), but in slot 14 inste ad. > >
    And when slot 11 is tested again at the next get_free_slot(), the delete
    succeeds > > and the now-empty slot is reused.> >
    _______________________________________________> boinc_dev mailing list>
    [email protected] <mailto:[email protected]>>
    http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev> To unsubscribe, 
visit
    th
    >  e above URL and> (near bottom of page) enter your email address.
    _______________________________________________boinc_dev mailing
    [email protected]
    
<mailto:[email protected]>://lists.ssl.berkeley.edu/mailman/listinfo/boinc_devTo
    unsubscribe, visit the above URL and(near bottom of page) enter your email
    address. _______________________________________________boinc_dev mailing
    [email protected]
    
<mailto:[email protected]>://lists.ssl.berkeley.edu/mailman/listinfo/boinc_devTo
    unsubscribe, visit the above URL and(near bottom of page) enter your email
    address.
    >
    >
    > _______________________________________________
    > boinc_dev mailing list
    > [email protected] <mailto:[email protected]>
    > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
    > To unsubscribe, visit the above URL and
    > (near bottom of page) enter your email address.

    _______________________________________________
    boinc_dev mailing list
    [email protected] <mailto:[email protected]>
    http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
    To unsubscribe, visit the above URL and
    (near bottom of page) enter your email address.




    _______________________________________________
    boinc_dev mailing list
    [email protected] <mailto:[email protected]>
    http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
    To unsubscribe, visit the above URL and
    (near bottom of page) enter your email address.



_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to