OK, off and running with v7.6.6 to test - report later.
Two small observations from the switchover, at restart.
1) another timing glitch
15/07/2015 10:45:22 | Milkyway@Home | Sending scheduler request: To fetch 
work.15/07/2015 10:45:25 | Milkyway@Home | Scheduler request completed: got 2 
new tasks15/07/2015 10:46:28 | Milkyway@Home | Sending scheduler request: To 
fetch work.15/07/2015 10:46:30 | Milkyway@Home | Not sending work - last 
request too recent: 59 sec
2) with debug logging and so many tasks, stdoutdae.txt had outgrown its limit.
stdoutdae.old is 70 MB - by my setting, it should have rotated at 50 MB. I 
believe there's an outstanding request for logs to rotate while running, not 
only at startup. 


     On Tuesday, 14 July 2015, 19:13, David Anderson <[email protected]> 
wrote:
   
 

 I checked in a workaround in which the client waits until
stderr.txt is not locked before reading it.
Can people please review this change?
-- David

On 14-Jul-2015 8:54 AM, Richard Haselgrove wrote:
> I've finally managed to capture an orphaned stderr.txt file on disk, and 
> marry it 
> up with the 'Validate error' task report at Milkyway.
>
> The copied file on my hard disk has the final few lines that were missing 
> from the 
> version reported to the project.
>
> It'll be easiest to read the full report, with embedded screenshots, at 
> http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=3662&postid=63799#63799
>
>
>
> On Saturday, 11 July 2015, 22:44, Richard Haselgrove 
> <[email protected]> wrote:
>
>
>
>    Here's a double 'Error 32' with both slot debug and task debug active 
>throughout.
>    This one resulted in a completely blank stderr.txt being reported to the
>    project.Task is 
>de_modfit_sum_fast_15_3s_136_sim1Jun1_4_1434554402_8900923_1,
>    http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1184736496
>
>
>        On Saturday, 11 July 2015, 21:35, Jason Groothuis
>    <[email protected] <mailto:[email protected]>> wrote:
>
>
>
>    Trial correction, this part of boinc_api.cpp, line 778: // various 
>platforms
>    have problems shutting down a process    // while other threads are still
>    executing,    // or triggering endless exit()/atexit() loops.    //  
>    BOINCINFO("Exit Status: %d", status);    fflush(NULL);
>    #if defined(_WIN32)    // Halt all the threads and clean up.  
>    TerminateProcess(GetCurrentProcess(), status);    // note: the above CAN
>    return! Sleep(1000);    DebugBreak();#elif defined(__APPLE_CC__)
>
>    Becomes // various platforms have problems shutting down a process    // 
>while
>    other threads are still executing,    // or triggering endless 
>exit()/atexit()
>    loops.    //    BOINCINFO("Exit Status: %d", status);    fflush(NULL);#if
>    defined(_WIN32)    // JG: Buffered IO is not committed to disk on flush, so
>    commit it, add other file descriptors if needed    _commit(stderr);    // 
>Halt
>    all the threads and clean up. TerminateProcess(GetCurrentProcess(), 
>status); 
>      // note: the above CAN return!  [JG: It does, it's asychronous, system
>    dependant this thread runs on some time so
>    WaitForSingleObject(GetCurrentProcess(), INFINITE); // That's why you do 
>this 
>      Sleep(1000);  //JG: Will never be reached    DebugBreak();  //JG: Will 
>never
>    be reached#elif defined(__APPLE_CC__)
>
>
>    
>------------------------------------------------------------------------------------------------------
>    Jason Richard Groothuis
>    bSc(compSci)
>
>    
>------------------------------------------------------------------------------------------------------
>
>
>    > From: [email protected] <mailto:[email protected]>
>    > To: [email protected] <mailto:[email protected]>;
>    [email protected] <mailto:[email protected]>;
>    [email protected] <mailto:[email protected]>
>    > Date: Sun, 12 Jul 2015 05:47:37 +0930
>    > Subject: Re: [boinc_dev] Client: race condition on stderr.txt invalidates
>    Milkyway tasks
>    >
>    > Comparing versions at leisure, but this same issue dates back in slight
>    variations to since I started crunching back in 2007, with different 
>symptoms
>    depending on build characteristics and afflicted system ( OS version,  
>#cores,
>    and GPU or CPU).  You be looking for a history on the boinc_exit() 
>function. 
>    Don;t think it ever had the wait after terminateprocess, so IO 
>cancellations
>    are likely depending on timing/chance.
>    >
>    >
>    
>------------------------------------------------------------------------------------------------------
>    > Jason Richard Groothuis
>    > bSc(compSci)
>    >
>    >
>    
>------------------------------------------------------------------------------------------------------
>    >
>    >
>    > Date: Sat, 11 Jul 2015 20:07:56 +0000
>    > From: [email protected] <mailto:[email protected]>
>    > To: [email protected] <mailto:[email protected]>;
>    [email protected] <mailto:[email protected]>;
>    [email protected] <mailto:[email protected]>
>    > Subject: Re: [boinc_dev] Client: race condition on stderr.txt invalidates
>    Milkyway tasks
>    >
>    > The Milkyway application we are mostly observing this with is
>    
>milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe,
>    which was deployed to their server on 6 Oct 2014, 20:18:34 UTC - the 
>internal
>    signature says "API_VERSION_6.13.0"
>    >
>    >
>    >
>    >
>    >      On Saturday, 11 July 2015, 20:55, Jason Groothuis
>    <[email protected] <mailto:[email protected]>> wrote:
>    >
>    >
>    >  "Perhaps the exit process has been invoked in the Milkyway app, but not 
>all
>    consequent OS functions have completed in time."Correct, since the
>    TermnateProcess() call, which is asynchronous and so returns immediately
>    without necessarily doing anything,  is missing the WaitForSingleObject() 
>on
>    the Current process after it.  The process resources will cleanup as part 
>of
>    OS garbage collection *sometime* down the road.Doubting the accuracy of the
>    MSDN documentation on these functions  is fine, but wondering why it 
>doesn;t
>    work as expected when you ignore it, is just odd.      On Saturday, 11 July
>    2015, 20:09, Jason Groothuis <[email protected]
>    <mailto:[email protected]>> wrote:      Not sure how much detail
>    you'd like on the situation. (Can provide much more)  It's a result of
>    buffered IO implemented in multithreaded C Runtimes, in some situations 
>using
>    deferred procedure calls.  Internal helper threads are being killed before
>    commits are completed.least desirable partial workaround (but helps
>    ):
>    >  - disable buffered IO by linking the application with the ms supplied
>    COMMODE.OBJProbably Better, but not tested:- initiate a low level _commit()
>    and add the missing WaitForSingleObject() after the TerminateProcess
>    Call,Best:- do a low level _comit() and check the file modification time
>    updated, then preferably use a friendly means of exit that allows 
>DLL/Thread
>    cleanup, closing threads/processes using sentinel flags, like while(!done)
>    instead of while(1) with
>    
>kills.------------------------------------------------------------------------------------------------------Jason
>    Richard Groothuis
>    bSc(compSci)--------------------------------------------------------------
>    ----------------------------------------> Date: Sat, 11 Jul 2015 11:30:32
>    -0700> From: [email protected] <mailto:[email protected]>> To:
>    [email protected] <mailto:[email protected]>;
>    [email protected] <mailto:[email protected]>> Subject: 
>Re:
>    [boinc_dev] Client: race condition on stderr.txt invalidates    Milkyway
>    tasks> > Richard:> Can you please ask him to set <task_d
>    eb
>    >  ug> as well?> > I have no theories about what could cause this.> The 
>BOINC
>    client learns that a job is finished when its process has exited,> and by 
>that
>    time all files are closed and locks released> (I'm assuming the MW@h
>    <mailto:MW@h> app is single-process - is that correct?)> > In this case, 
>when
>    the job finishes, the client successfully reads stderr.txt> (otherwise
>    <stderr_txt> would be absent or there would be an error message)> but it's
>    empty.> This would be the case, e.g., if the writing process hadn't exited
>    yet> and its stderr buffer wasn't flushed.> But the process has exited.> >
>    Anyone have any ideas?> > -- David> > On 09-Jul-2015 7:42 AM, Richard
>    Haselgrove wrote:> > User Keith Myers (UID 147145 at
>    http://milkyway.cs.rpi.edu/milkyway/index.php) has > > asked for my help in
>    identifying task failures at Milkyway.> >> > At my suggestion, he installed
>    Windows client v7.6.2, and the attached message log > > extracts show the
>    enhanced <slot_debug> output that helped identify the
>    CM
>    >  S-dev > > problem.> >> > In both cases, the task under scrutiny> >> > 
>(1)
>    de_fast_15_3s_136_sim1Jun1_1_1434554402_7775504_0, > >
>    http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181200273> >> > 
>(2)
>    ps_fast_15_3s_136_sim1Jun1_1_1434554402_7806437_0, > >
>    http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181298220> >> > 
>was
>    declared 'Validate error', and the <stderr_txt> section is empty. In the > 
>>
>    special case of Milkyway@Home <mailto:Milkyway@Home>, these two 
>observations
>    are linked, because the > > science result is returned in stderr, not a
>    separate upload file.> >> > Also in both cases, the <slot_debug> log 
>contains>
>    >> > [slot] failed to remove file slots/x/stderr.txt:  unlink() failed> >> 
>>
>    between 'handle_exited_app()' and 'Computation for task ... finished'> >> 
>> It
>    appears that there is a race condition, whereby BOINC tries (and fails) to 
>> >
>    delete stderr.txt before the operating system has released the write lock.
>    This > > (I'm presuming) also explains why the file appea
>    rs
>    >  empty when read off the disk > > for incorporation into the client_state
>    structure in memory, prior to reporting > > the completed task to the
>    project.> >> > In order the preserve the scientific result at Milkyway (and
>    debug and other > > useful information at other projects), the client 
>should
>    not initiate > > 'handle_exited_app()' until it has confirmed that the 
>write
>    lock on stderr.txt has > > been released.> >> >> > Log 1 also shows that 
>the
>    additional safeguards on cleaning out slots are working > > properly: if 
>both
>    handle_exited_app() and get_free_slot() fail to delete the file, > > the 
>next
>    task isn't started in the not-empty slot (11), but in slot 14 inste ad. > >
>    And when slot 11 is tested again at the next get_free_slot(), the delete
>    succeeds > > and the now-empty slot is reused.> >
>    _______________________________________________> boinc_dev mailing list>
>    [email protected] <mailto:[email protected]>>
>    http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev> To unsubscribe, 
>visit
>    th
>    >  e above URL and> (near bottom of page) enter your email address.
>    _______________________________________________boinc_dev mailing
>    [email protected]
>    
><mailto:[email protected]>://lists.ssl.berkeley.edu/mailman/listinfo/boinc_devTo
>    unsubscribe, visit the above URL and(near bottom of page) enter your email
>    address. _______________________________________________boinc_dev mailing
>    [email protected]
>    
><mailto:[email protected]>://lists.ssl.berkeley.edu/mailman/listinfo/boinc_devTo
>    unsubscribe, visit the above URL and(near bottom of page) enter your email
>    address.
>    >
>    >
>    > _______________________________________________
>    > boinc_dev mailing list
>    > [email protected] <mailto:[email protected]>
>    > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>    > To unsubscribe, visit the above URL and
>    > (near bottom of page) enter your email address.
>
>    _______________________________________________
>    boinc_dev mailing list
>    [email protected] <mailto:[email protected]>
>    http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>    To unsubscribe, visit the above URL and
>    (near bottom of page) enter your email address.
>
>
>
>
>    _______________________________________________
>    boinc_dev mailing list
>    [email protected] <mailto:[email protected]>
>    http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>    To unsubscribe, visit the above URL and
>    (near bottom of page) enter your email address.
>
>

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.


 
  
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to