Not sure how much detail you'd like on the situation. (Can provide much more)  
It's a result of buffered IO implemented in multithreaded C Runtimes, in some 
situations using deferred procedure calls.  Internal helper threads are being 
killed before commits are completed.
least desirable partial workaround (but helps):- disable buffered IO by linking 
the application with the ms supplied COMMODE.OBJ
Probably Better, but not tested:- initiate a low level _commit() and add the 
missing WaitForSingleObject() after the TerminateProcess Call,
Best:- do a low level _comit() and check the file modification time updated, 
then preferably use a friendly means of exit that allows DLL/Thread cleanup, 
closing threads/processes using sentinel flags, like while(!done) instead of 
while(1) with kills.

------------------------------------------------------------------------------------------------------
Jason Richard Groothuis 
bSc(compSci)

------------------------------------------------------------------------------------------------------


> Date: Sat, 11 Jul 2015 11:30:32 -0700
> From: [email protected]
> To: [email protected]; [email protected]
> Subject: Re: [boinc_dev] Client: race condition on stderr.txt invalidates     
> Milkyway tasks
> 
> Richard:
> Can you please ask him to set <task_debug> as well?
> 
> I have no theories about what could cause this.
> The BOINC client learns that a job is finished when its process has exited,
> and by that time all files are closed and locks released
> (I'm assuming the MW@h app is single-process - is that correct?)
> 
> In this case, when the job finishes, the client successfully reads stderr.txt
> (otherwise <stderr_txt> would be absent or there would be an error message)
> but it's empty.
> This would be the case, e.g., if the writing process hadn't exited yet
> and its stderr buffer wasn't flushed.
> But the process has exited.
> 
> Anyone have any ideas?
> 
> -- David
> 
> On 09-Jul-2015 7:42 AM, Richard Haselgrove wrote:
> > User Keith Myers (UID 147145 at 
> > http://milkyway.cs.rpi.edu/milkyway/index.php) has 
> > asked for my help in identifying task failures at Milkyway.
> >
> > At my suggestion, he installed Windows client v7.6.2, and the attached 
> > message log 
> > extracts show the enhanced <slot_debug> output that helped identify the 
> > CMS-dev 
> > problem.
> >
> > In both cases, the task under scrutiny
> >
> > (1) de_fast_15_3s_136_sim1Jun1_1_1434554402_7775504_0, 
> > http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181200273
> >
> > (2) ps_fast_15_3s_136_sim1Jun1_1_1434554402_7806437_0, 
> > http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181298220
> >
> > was declared 'Validate error', and the <stderr_txt> section is empty. In 
> > the 
> > special case of Milkyway@Home, these two observations are linked, because 
> > the 
> > science result is returned in stderr, not a separate upload file.
> >
> > Also in both cases, the <slot_debug> log contains
> >
> > [slot] failed to remove file slots/x/stderr.txt: unlink() failed
> >
> > between 'handle_exited_app()' and 'Computation for task ... finished'
> >
> > It appears that there is a race condition, whereby BOINC tries (and fails) 
> > to 
> > delete stderr.txt before the operating system has released the write lock. 
> > This 
> > (I'm presuming) also explains why the file appears empty when read off the 
> > disk 
> > for incorporation into the client_state structure in memory, prior to 
> > reporting 
> > the completed task to the project.
> >
> > In order the preserve the scientific result at Milkyway (and debug and 
> > other 
> > useful information at other projects), the client should not initiate 
> > 'handle_exited_app()' until it has confirmed that the write lock on 
> > stderr.txt has 
> > been released.
> >
> >
> > Log 1 also shows that the additional safeguards on cleaning out slots are 
> > working 
> > properly: if both handle_exited_app() and get_free_slot() fail to delete 
> > the file, 
> > the next task isn't started in the not-empty slot (11), but in slot 14 
> > instead. 
> > And when slot 11 is tested again at the next get_free_slot(), the delete 
> > succeeds 
> > and the now-empty slot is reused.
> 
> _______________________________________________
> boinc_dev mailing list
> [email protected]
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
                                          
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to