Not sure how much detail you'd like on the situation. (Can provide much more) It's a result of buffered IO implemented in multithreaded C Runtimes, in some situations using deferred procedure calls. Internal helper threads are being killed before commits are completed. least desirable partial workaround (but helps):- disable buffered IO by linking the application with the ms supplied COMMODE.OBJ Probably Better, but not tested:- initiate a low level _commit() and add the missing WaitForSingleObject() after the TerminateProcess Call, Best:- do a low level _comit() and check the file modification time updated, then preferably use a friendly means of exit that allows DLL/Thread cleanup, closing threads/processes using sentinel flags, like while(!done) instead of while(1) with kills.
------------------------------------------------------------------------------------------------------ Jason Richard Groothuis bSc(compSci) ------------------------------------------------------------------------------------------------------ > Date: Sat, 11 Jul 2015 11:30:32 -0700 > From: [email protected] > To: [email protected]; [email protected] > Subject: Re: [boinc_dev] Client: race condition on stderr.txt invalidates > Milkyway tasks > > Richard: > Can you please ask him to set <task_debug> as well? > > I have no theories about what could cause this. > The BOINC client learns that a job is finished when its process has exited, > and by that time all files are closed and locks released > (I'm assuming the MW@h app is single-process - is that correct?) > > In this case, when the job finishes, the client successfully reads stderr.txt > (otherwise <stderr_txt> would be absent or there would be an error message) > but it's empty. > This would be the case, e.g., if the writing process hadn't exited yet > and its stderr buffer wasn't flushed. > But the process has exited. > > Anyone have any ideas? > > -- David > > On 09-Jul-2015 7:42 AM, Richard Haselgrove wrote: > > User Keith Myers (UID 147145 at > > http://milkyway.cs.rpi.edu/milkyway/index.php) has > > asked for my help in identifying task failures at Milkyway. > > > > At my suggestion, he installed Windows client v7.6.2, and the attached > > message log > > extracts show the enhanced <slot_debug> output that helped identify the > > CMS-dev > > problem. > > > > In both cases, the task under scrutiny > > > > (1) de_fast_15_3s_136_sim1Jun1_1_1434554402_7775504_0, > > http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181200273 > > > > (2) ps_fast_15_3s_136_sim1Jun1_1_1434554402_7806437_0, > > http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181298220 > > > > was declared 'Validate error', and the <stderr_txt> section is empty. In > > the > > special case of Milkyway@Home, these two observations are linked, because > > the > > science result is returned in stderr, not a separate upload file. > > > > Also in both cases, the <slot_debug> log contains > > > > [slot] failed to remove file slots/x/stderr.txt: unlink() failed > > > > between 'handle_exited_app()' and 'Computation for task ... finished' > > > > It appears that there is a race condition, whereby BOINC tries (and fails) > > to > > delete stderr.txt before the operating system has released the write lock. > > This > > (I'm presuming) also explains why the file appears empty when read off the > > disk > > for incorporation into the client_state structure in memory, prior to > > reporting > > the completed task to the project. > > > > In order the preserve the scientific result at Milkyway (and debug and > > other > > useful information at other projects), the client should not initiate > > 'handle_exited_app()' until it has confirmed that the write lock on > > stderr.txt has > > been released. > > > > > > Log 1 also shows that the additional safeguards on cleaning out slots are > > working > > properly: if both handle_exited_app() and get_free_slot() fail to delete > > the file, > > the next task isn't started in the not-empty slot (11), but in slot 14 > > instead. > > And when slot 11 is tested again at the next get_free_slot(), the delete > > succeeds > > and the now-empty slot is reused. > > _______________________________________________ > boinc_dev mailing list > [email protected] > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev > To unsubscribe, visit the above URL and > (near bottom of page) enter your email address. _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
