I guess nothing that Windows does should surprise me anymore.
-- David

On 16-Jul-2015 4:13 PM, Richard Haselgrove wrote:
I'm pleased to confirm that in something over 36 hours of continuous running since installing v7.6.6, I've now completed over 2,570 Milkyway tasks without a single new 'validate error'. Despite David's initial scepticism, it does indeed appear that the file write wasn't being completed before the BOINC client tried to read it - and with the additional guard timing, the completed file has been reported every time (I would have expected ~50 errors by now with the previous client).

I'll maybe confirm by regressing to v7.6.3 tomorrow and seeing if the errors return, but apart from that I think we can declare another 'case closed'.



On Wednesday, 15 July 2015, 10:58, Richard Haselgrove <[email protected]> wrote:



    OK, off and running with v7.6.6 to test - report later.
    Two small observations from the switchover, at restart.
    1) another timing glitch
    15/07/2015 10:45:22 | Milkyway@Home <mailto:Milkyway@Home> | Sending 
scheduler
    request: To fetch work.15/07/2015 10:45:25 | Milkyway@Home
    <mailto:Milkyway@Home> | Scheduler request completed: got 2 new
    tasks15/07/2015 10:46:28 | Milkyway@Home <mailto:Milkyway@Home> | Sending
    scheduler request: To fetch work.15/07/2015 10:46:30 | Milkyway@Home
    <mailto:Milkyway@Home> | Not sending work - last request too recent: 59 sec
    2) with debug logging and so many tasks, stdoutdae.txt had outgrown its 
limit.
    stdoutdae.old is 70 MB - by my setting, it should have rotated at 50 MB. I
    believe there's an outstanding request for logs to rotate while running, not
    only at startup.


        On Tuesday, 14 July 2015, 19:13, David Anderson <[email protected]
    <mailto:[email protected]>> wrote:



    I checked in a workaround in which the client waits until
    stderr.txt is not locked before reading it.
    Can people please review this change?
    -- David

    On 14-Jul-2015 8:54 AM, Richard Haselgrove wrote:
    > I've finally managed to capture an orphaned stderr.txt file on disk, and
    marry it
    > up with the 'Validate error' task report at Milkyway.
    >
    > The copied file on my hard disk has the final few lines that were missing
    from the
    > version reported to the project.
    >
    > It'll be easiest to read the full report, with embedded screenshots, at
    > 
http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=3662&postid=63799#63799
    >
    >
    >
    > On Saturday, 11 July 2015, 22:44, Richard Haselgrove
    > <[email protected] <mailto:[email protected]>> 
wrote:
    >
    >
    >
    >    Here's a double 'Error 32' with both slot debug and task debug active
    throughout.
    >    This one resulted in a completely blank stderr.txt being reported to 
the
    >    project.Task is 
de_modfit_sum_fast_15_3s_136_sim1Jun1_4_1434554402_8900923_1,
    > http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1184736496
    >
    >
    >        On Saturday, 11 July 2015, 21:35, Jason Groothuis
    >    <[email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>> 
wrote:
    >
    >
    >
    >    Trial correction, this part of boinc_api.cpp, line 778: // various 
platforms
    >    have problems shutting down a process    // while other threads are 
still
    >    executing,    // or triggering endless exit()/atexit() loops.    //
    >    BOINCINFO("Exit Status: %d", status); fflush(NULL);
    >    #if defined(_WIN32)    // Halt all the threads and clean up.
    >    TerminateProcess(GetCurrentProcess(), status);    // note: the above 
CAN
    >    return! Sleep(1000);    DebugBreak();#elif defined(__APPLE_CC__)
    >
    >    Becomes // various platforms have problems shutting down a process    
//
    while
    >    other threads are still executing,    // or triggering endless
    exit()/atexit()
    >    loops.    //    BOINCINFO("Exit Status: %d", status);    
fflush(NULL);#if
    >    defined(_WIN32)    // JG: Buffered IO is not committed to disk on 
flush, so
    >    commit it, add other file descriptors if needed    _commit(stderr);    
//
    Halt
    >    all the threads and clean up. TerminateProcess(GetCurrentProcess(), 
status);
    >      // note: the above CAN return!  [JG: It does, it's asychronous, 
system
    >    dependant this thread runs on some time so
    >    WaitForSingleObject(GetCurrentProcess(), INFINITE); // That's why you 
do
    this
    >      Sleep(1000);  //JG: Will never be reached DebugBreak();  //JG: Will 
never
    >    be reached#elif defined(__APPLE_CC__)
    >
    >
    >
    
------------------------------------------------------------------------------------------------------
    >    Jason Richard Groothuis
    >    bSc(compSci)
    >
    >
    
------------------------------------------------------------------------------------------------------
    >
    >
    >    > From: [email protected] 
<mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>
    >    > To: [email protected] 
<mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>;
    > [email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>;
    > [email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>
    >    > Date: Sun, 12 Jul 2015 05:47:37 +0930
    >    > Subject: Re: [boinc_dev] Client: race condition on stderr.txt 
invalidates
    >    Milkyway tasks
    >    >
    >    > Comparing versions at leisure, but this same issue dates back in 
slight
    >    variations to since I started crunching back in 2007, with different 
symptoms
> depending on build characteristics and afflicted system ( OS version, #cores,
    >    and GPU or CPU).  You be looking for a history on the boinc_exit() 
function.
    >    Don;t think it ever had the wait after terminateprocess, so IO 
cancellations
    >    are likely depending on timing/chance.
    >    >
    >    >
    >
    
------------------------------------------------------------------------------------------------------
    >    > Jason Richard Groothuis
    >    > bSc(compSci)
    >    >
    >    >
    >
    
------------------------------------------------------------------------------------------------------
    >    >
    >    >
    >    > Date: Sat, 11 Jul 2015 20:07:56 +0000
    >    > From: [email protected]
    <mailto:[email protected]> <mailto:[email protected]
    <mailto:[email protected]>>
    >    > To: [email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>;
    > [email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>;
    > [email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>
    >    > Subject: Re: [boinc_dev] Client: race condition on stderr.txt 
invalidates
    >    Milkyway tasks
    >    >
    >    > The Milkyway application we are mostly observing this with is
    > 
milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe,
    >    which was deployed to their server on 6 Oct 2014, 20:18:34 UTC - the 
internal
    >    signature says "API_VERSION_6.13.0"
    >    >
    >    >
    >    >
    >    >
    >    >      On Saturday, 11 July 2015, 20:55, Jason Groothuis
    >    <[email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>> 
wrote:
    >    >
    >    >
    >    >  "Perhaps the exit process has been invoked in the Milkyway app, but
    not all
    >    consequent OS functions have completed in time."Correct, since the
    >    TermnateProcess() call, which is asynchronous and so returns 
immediately
    >    without necessarily doing anything,  is missing the 
WaitForSingleObject() on
    >    the Current process after it.  The process resources will cleanup as 
part of
    >    OS garbage collection *sometime* down the road.Doubting the accuracy 
of the
    >    MSDN documentation on these functions  is fine, but wondering why it 
doesn;t
    >    work as expected when you ignore it, is just odd.      On Saturday, 11 
July
    >    2015, 20:09, Jason Groothuis <[email protected]
    <mailto:[email protected]>
    >    <mailto:[email protected]
    <mailto:[email protected]>>> wrote:      Not sure how much detail
    >    you'd like on the situation. (Can provide much more)  It's a result of
    >    buffered IO implemented in multithreaded C Runtimes, in some 
situations using
    >    deferred procedure calls.  Internal helper threads are being killed 
before
    >    commits are completed.least desirable partial workaround (but helps
    >    ):
    >    >  - disable buffered IO by linking the application with the ms 
supplied
    >    COMMODE.OBJProbably Better, but not tested:- initiate a low level 
_commit()
    >    and add the missing WaitForSingleObject() after the TerminateProcess
    >    Call,Best:- do a low level _comit() and check the file modification 
time
    >    updated, then preferably use a friendly means of exit that allows 
DLL/Thread
    >    cleanup, closing threads/processes using sentinel flags, like 
while(!done)
    >    instead of while(1) with
    >
    
kills.------------------------------------------------------------------------------------------------------Jason
    >    Richard Groothuis
    > bSc(compSci)--------------------------------------------------------------
    >    ----------------------------------------> Date: Sat, 11 Jul 2015 
11:30:32
    >    -0700> From: [email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>> To:
    > [email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>;
    > [email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>>
    Subject: Re:
    >    [boinc_dev] Client: race condition on stderr.txt invalidates    
Milkyway
    >    tasks> > Richard:> Can you please ask him to set <task_d
    >    eb
    >    >  ug> as well?> > I have no theories about what could cause this.> 
The BOINC
    >    client learns that a job is finished when its process has exited,> and 
by
    that
    >    time all files are closed and locks released> (I'm assuming the MW@h
    <mailto:MW@h>
    >    <mailto:MW@h <mailto:MW@h>> app is single-process - is that correct?)> 
>
    In this case, when
    >    the job finishes, the client successfully reads stderr.txt> (otherwise
    >    <stderr_txt> would be absent or there would be an error message)> but 
it's
    >    empty.> This would be the case, e.g., if the writing process hadn't 
exited
    >    yet> and its stderr buffer wasn't flushed.> But the process has exited.> 
>
    >    Anyone have any ideas?> > -- David> > On 09-Jul-2015 7:42 AM, Richard
    >    Haselgrove wrote:> > User Keith Myers (UID 147145 at
    > http://milkyway.cs.rpi.edu/milkyway/index.php) has > > asked for my help 
in
    >    identifying task failures at Milkyway.> >> > At my suggestion, he 
installed
    >    Windows client v7.6.2, and the attached message log > > extracts show 
the
    >    enhanced <slot_debug> output that helped identify the
    >    CM
    >    >  S-dev > > problem.> >> > In both cases, the task under scrutiny> >> 
> (1)
    > de_fast_15_3s_136_sim1Jun1_1_1434554402_7775504_0, > >
    > http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181200273> >> > 
(2)
    > ps_fast_15_3s_136_sim1Jun1_1_1434554402_7806437_0, > >
    > http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181298220> >> > 
was
    >    declared 'Validate error', and the <stderr_txt> section is empty. In the 
> >
    >    special case of Milkyway@Home <mailto:Milkyway@Home>
    <mailto:Milkyway@Home <mailto:Milkyway@Home>>, these two observations
    >    are linked, because the > > science result is returned in stderr, not a
    >    separate upload file.> >> > Also in both cases, the <slot_debug> log
    contains>
    >    >> > [slot] failed to remove file slots/x/stderr.txt:  unlink() failed> 
>> >
    >    between 'handle_exited_app()' and 'Computation for task ... finished'> 
>>
    > It
    >    appears that there is a race condition, whereby BOINC tries (and fails)
    to > >
    >    delete stderr.txt before the operating system has released the write 
lock.
    >    This > > (I'm presuming) also explains why the file appea
    >    rs
    >    >  empty when read off the disk > > for incorporation into the 
client_state
    >    structure in memory, prior to reporting > > the completed task to the
    >    project.> >> > In order the preserve the scientific result at Milkyway 
(and
    >    debug and other > > useful information at other projects), the client 
should
    >    not initiate > > 'handle_exited_app()' until it has confirmed that the 
write
    >    lock on stderr.txt has > > been released.> >> >> > Log 1 also shows 
that the
    >    additional safeguards on cleaning out slots are working > > properly: 
if both
    >    handle_exited_app() and get_free_slot() fail to delete the file, > > 
the next
    >    task isn't started in the not-empty slot (11), but in slot 14 inste ad. > 
>
    >    And when slot 11 is tested again at the next get_free_slot(), the 
delete
    >    succeeds > > and the now-empty slot is reused.> >
    > _______________________________________________> boinc_dev mailing list>
    > [email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>>
    > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev> To unsubscribe, 
visit
    >    th
    >    >  e above URL and> (near bottom of page) enter your email address.
    > _______________________________________________boinc_dev mailing
    > [email protected] 
<mailto:[email protected]>
    >    <mailto:[email protected]
    
<mailto:[email protected]>>://lists.ssl.berkeley.edu/mailman/listinfo/boinc_devTo


    >    unsubscribe, visit the above URL and(near bottom of page) enter your 
email
    >    address. _______________________________________________boinc_dev 
mailing
    > [email protected] 
<mailto:[email protected]>
    >    <mailto:[email protected]
    
<mailto:[email protected]>>://lists.ssl.berkeley.edu/mailman/listinfo/boinc_devTo
    >    unsubscribe, visit the above URL and(near bottom of page) enter your 
email
    >    address.
    >    >
    >    >
    >    > _______________________________________________
    >    > boinc_dev mailing list
    >    > [email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>
    >    > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
    >    > To unsubscribe, visit the above URL and
    >    > (near bottom of page) enter your email address.
    >
    > _______________________________________________
    >    boinc_dev mailing list
    > [email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>
    > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
    >    To unsubscribe, visit the above URL and
    >    (near bottom of page) enter your email address.
    >
    >
    >
    >
    > _______________________________________________
    >    boinc_dev mailing list
    > [email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>
    > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
    >    To unsubscribe, visit the above URL and
    >    (near bottom of page) enter your email address.
    >
    >

    _______________________________________________
    boinc_dev mailing list
    [email protected] <mailto:[email protected]>
    http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
    To unsubscribe, visit the above URL and
    (near bottom of page) enter your email address.




    _______________________________________________
    boinc_dev mailing list
    [email protected] <mailto:[email protected]>
    http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
    To unsubscribe, visit the above URL and
    (near bottom of page) enter your email address.



_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to