I've built a new version of 7.6 with David's latest change to address this 
issue.

http://boinc.berkeley.edu/dl/boinc_7.6.6_windows_intelx86.exe
http://boinc.berkeley.edu/dl/boinc_7.6.6_windows_x86_64.exe

----- Rom

-----Original Message-----
From: boinc_dev [mailto:[email protected]] On Behalf Of 
Richard Haselgrove
Sent: Tuesday, July 14, 2015 11:55 AM
To: Jason Groothuis <[email protected]>; Dave A 
<[email protected]>; BOINC Dev Mailing List <[email protected]>; 
William Stilte <[email protected]>
Subject: Re: [boinc_dev] Client: race condition on stderr.txt invalidates 
Milkyway tasks

I've finally managed to capture an orphaned stderr.txt file on disk, and marry 
it up with the 'Validate error' task report at Milkyway.
The copied file on my hard disk has the final few lines that were missing from 
the version reported to the project.
It'll be easiest to read the full report, with embedded screenshots, at 
http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=3662&postid=63799#63799 


     On Saturday, 11 July 2015, 22:44, Richard Haselgrove 
<[email protected]> wrote:
   
 

 Here's a double 'Error 32' with both slot debug and task debug active 
throughout.
This one resulted in a completely blank stderr.txt being reported to the 
project.Task is de_modfit_sum_fast_15_3s_136_sim1Jun1_4_1434554402_8900923_1, 
http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1184736496 


    On Saturday, 11 July 2015, 21:35, Jason Groothuis 
<[email protected]> wrote:
  
 

 Trial correction, this part of boinc_api.cpp, line 778: // various platforms 
have problems shutting down a process    // while other threads are still 
executing,    // or triggering endless exit()/atexit() loops.    //    
BOINCINFO("Exit Status: %d", status);    fflush(NULL); #if defined(_WIN32)    
// Halt all the threads and clean up.    TerminateProcess(GetCurrentProcess(), 
status);    // note: the above CAN return!    Sleep(1000);    
DebugBreak();#elif defined(__APPLE_CC__)

Becomes // various platforms have problems shutting down a process    // while 
other threads are still executing,    // or triggering endless exit()/atexit() 
loops.    //    BOINCINFO("Exit Status: %d", status);    fflush(NULL);#if 
defined(_WIN32)    // JG: Buffered IO is not committed to disk on flush, so 
commit it, add other file descriptors if needed    _commit(stderr);    // Halt 
all the threads and clean up.    TerminateProcess(GetCurrentProcess(), status); 
   // note: the above CAN return!  [JG: It does, it's asychronous, system 
dependant this thread runs on some time so    
WaitForSingleObject(GetCurrentProcess(), INFINITE); // That's why you do this   
 Sleep(1000);  //JG: Will never be reached    DebugBreak();  //JG: Will never 
be reached#elif defined(__APPLE_CC__)


------------------------------------------------------------------------------------------------------
Jason Richard Groothuis
bSc(compSci)

------------------------------------------------------------------------------------------------------


> From: [email protected]
> To: [email protected]; [email protected]; 
> [email protected]
> Date: Sun, 12 Jul 2015 05:47:37 +0930
> Subject: Re: [boinc_dev] Client: race condition on stderr.txt 
> invalidates Milkyway tasks
> 
> Comparing versions at leisure, but this same issue dates back in slight 
> variations to since I started crunching back in 2007, with different symptoms 
> depending on build characteristics and afflicted system ( OS version,  
> #cores, and GPU or CPU).  You be looking for a history on the boinc_exit() 
> function.  Don;t think it ever had the wait after terminateprocess, so IO 
> cancellations are likely depending on timing/chance.
> 
> ----------------------------------------------------------------------
> --------------------------------
> Jason Richard Groothuis
> bSc(compSci)
> 
> ----------------------------------------------------------------------
> --------------------------------
> 
> 
> Date: Sat, 11 Jul 2015 20:07:56 +0000
> From: [email protected]
> To: [email protected]; [email protected]; 
> [email protected]
> Subject: Re: [boinc_dev] Client: race condition on stderr.txt 
> invalidates Milkyway tasks
> 
> The Milkyway application we are mostly observing this with is 
> milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe, 
> which was deployed to their server on 6 Oct 2014, 20:18:34 UTC - the internal 
> signature says "API_VERSION_6.13.0"
> 
>  
> 
> 
>      On Saturday, 11 July 2015, 20:55, Jason Groothuis 
><[email protected]> wrote:
>      
> 
>  "Perhaps the exit process has been invoked in the Milkyway app, but 
>not all consequent OS functions have completed in time."Correct, since 
>the TermnateProcess() call, which is asynchronous and so returns 
>immediately without necessarily doing anything,  is missing the 
>WaitForSingleObject() on the Current process after it.  The process 
>resources will cleanup as part of OS garbage collection *sometime* down 
>the road.Doubting the accuracy of the MSDN documentation on these 
>functions  is fine, but wondering why it doesn;t work as expected when 
>you ignore it, is just odd.      On Saturday, 11 July 2015, 20:09, 
>Jason Groothuis <[email protected]> wrote:      Not sure how 
>much detail you'd like on the situation. (Can provide much more)  It's 
>a result of buffered IO implemented in multithreaded C Runtimes, in 
>some situations using deferred procedure calls.  Internal helper 
>threads are being killed before commits are completed.least desirable 
>partial workaround (but helps
 ):
>  - disable buffered IO by linking the application with the ms supplied 
>COMMODE.OBJProbably Better, but not tested:- initiate a low level 
>_commit() and add the missing WaitForSingleObject() after the 
>TerminateProcess Call,Best:- do a low level _comit() and check the file 
>modification time updated, then preferably use a friendly means of exit 
>that allows DLL/Thread cleanup, closing threads/processes using 
>sentinel flags, like while(!done) instead of while(1) with 
>kills.-----------------------------------------------------------------
>-------------------------------------Jason Richard Groothuis 
>bSc(compSci)-----------------------------------------------------------
>--- ----------------------------------------> Date: Sat, 11 Jul 2015 
>11:30:32 -0700> From: [email protected]> To: 
>[email protected]; [email protected]> Subject: Re: 
>[boinc_dev] Client: race condition on stderr.txt invalidates    
>Milkyway tasks> > Richard:> Can you please ask him to set <task_d
 eb
>  ug> as well?> > I have no theories about what could cause this.> The 
>BOINC client learns that a job is finished when its process has 
>exited,> and by that time all files are closed and locks released> (I'm 
>assuming the MW@h app is single-process - is that correct?)> > In this 
>case, when the job finishes, the client successfully reads stderr.txt> 
>(otherwise <stderr_txt> would be absent or there would be an error 
>message)> but it's empty.> This would be the case, e.g., if the writing 
>process hadn't exited yet> and its stderr buffer wasn't flushed.> But 
>the process has exited.> > Anyone have any ideas?> > -- David> > On 
>09-Jul-2015 7:42 AM, Richard Haselgrove wrote:>  > User Keith Myers 
>(UID 147145 at http://milkyway.cs.rpi.edu/milkyway/index.php) has > > 
>asked for my help in identifying task failures at Milkyway.> >> > At my 
>suggestion, he installed Windows client v7.6.2, and the attached 
>message log > > extracts show the enhanced <slot_debug> output that 
>helped identify the
 CM
>  S-dev > > problem.> >> > In both cases, the task under scrutiny> >> > 
>(1) de_fast_15_3s_136_sim1Jun1_1_1434554402_7775504_0, > > 
>http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181200273> >> 
>> (2) ps_fast_15_3s_136_sim1Jun1_1_1434554402_7806437_0, > > 
>http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181298220> >> 
>> was declared 'Validate error', and the <stderr_txt> section is empty. 
>In the > > special case of Milkyway@Home, these two observations are 
>linked, because the > > science result is returned in stderr, not a 
>separate upload file.> >> > Also in both cases, the <slot_debug> log 
>contains> >> > [slot] failed to remove file slots/x/stderr.txt:  
>unlink() failed> >> > between 'handle_exited_app()' and 'Computation 
>for task ... finished'> >> > It appears that there is a race condition, 
>whereby BOINC tries (and fails) to > > delete stderr.txt before the 
>operating system has released the write lock. This > > (I'm presuming) 
>also explains why the file appea
 rs
>  empty when read off the disk > > for incorporation into the 
>client_state structure in memory, prior to reporting > > the completed 
>task to the project.> >> > In order the preserve the scientific result 
>at Milkyway (and debug and other > > useful information at other 
>projects), the client should not initiate > > 'handle_exited_app()' 
>until it has confirmed that the write lock on stderr.txt has > > been 
>released.> >> >> > Log 1 also shows that the additional safeguards on 
>cleaning out slots are working > > properly: if both 
>handle_exited_app() and get_free_slot() fail to delete the file, > > 
>the next task isn't started in the not-empty slot (11), but in slot 14 
>inste ad. > > And when slot 11 is tested again at the next 
>get_free_slot(), the delete succeeds > > and the now-empty slot is 
>reused.> > _______________________________________________> boinc_dev 
>mailing list> [email protected]> 
>http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev> To 
>unsubscribe, visit
 th
>  e above URL and> (near bottom of page) enter your email address.             
>             _______________________________________________boinc_dev mailing 
>[email protected]http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_devTo
> unsubscribe, visit the above URL and(near bottom of page) enter your email 
>address.                                  
>_______________________________________________boinc_dev mailing 
>[email protected]http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_devTo
> unsubscribe, visit the above URL and(near bottom of page) enter your email 
>address.
> 
>                                
> _______________________________________________
> boinc_dev mailing list
> [email protected]
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and (near bottom of page) enter 
> your email address.
                         
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.


 

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

 
  
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to