I'm now running Milkyway on my own GTX 670 (64-bit Windows 7 - BOINC v7.6.3) -
got 6 invalid by this cause from 435 tasks in 7 hours.
Three of the invalids have matched with
11-Jul-2015 18:57:59 [---] [slot] failed to remove file slots/0/stderr.txt:
Error 32
and in one case
11-Jul-2015 19:09:19 [---] [slot] cleaning out slots/1:
handle_exited_app()11-Jul-2015 19:09:19 [---] [slot] removed file
slots/1/astronomy_parameters.txt11-Jul-2015 19:09:19 [---] [slot] removed file
slots/1/boinc_finish_called11-Jul-2015 19:09:19 [---] [slot] removed file
slots/1/boinc_task_state.xml11-Jul-2015 19:09:19 [---] [slot] removed file
slots/1/init_data.xml11-Jul-2015 19:09:19 [---] [slot] removed file
slots/1/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe11-Jul-2015
19:09:19 [---] [slot] removed file slots/1/separation_checkpoint11-Jul-2015
19:09:19 [---] [slot] removed file slots/1/stars.txt11-Jul-2015 19:09:19 [---]
[slot] failed to remove file slots/1/stderr.txt: Error 3211-Jul-2015 19:09:19
[Milkyway@Home] Computation for task
de_fast_15_3s_136_sim1Jun1_1_1434554402_8856510_0 finished11-Jul-2015 19:09:19
[---] [slot] cleaning out slots/1: get_free_slot()11-Jul-2015 19:09:19 [---]
[slot] failed to remove file slots/1/stderr.txt: Error 3211-Jul-2015 19:09:19
[Milkyway@Home] [slot] failed to clean out dir: unlink() failed
Error 32 seems to be
ERROR_SHARING_VIOLATION32 (0x20)The process cannot access the file because it
is being used by another process.
(from MSDN)
and to originate in sandbox.cpp
// delete a file.// return success if we deleted it or it didn't exist in the
first place//static int delete_project_owned_file_aux(const char* path) {#ifdef
_WIN32 if (DeleteFile(path)) return 0; int error = GetLastError(); if
(error == ERROR_FILE_NOT_FOUND) { return 0; } if (error ==
ERROR_ACCESS_DENIED) { SetFileAttributes(path, FILE_ATTRIBUTE_NORMAL);
if (DeleteFile(path)) return 0; } return error;
That seems to point doubt at your presumption
> This would be the case, e.g., if the writing process hadn't exited yet
> and its stderr buffer wasn't flushed.
> But the process has exited.
Perhaps the exit process has been invoked in the Milkyway app, but not all
consequent OS functions have completed in time.
On Saturday, 11 July 2015, 20:09, Jason Groothuis
<[email protected]> wrote:
Not sure how much detail you'd like on the situation. (Can provide much more)
It's a result of buffered IO implemented in multithreaded C Runtimes, in some
situations using deferred procedure calls. Internal helper threads are being
killed before commits are completed.
least desirable partial workaround (but helps):- disable buffered IO by linking
the application with the ms supplied COMMODE.OBJ
Probably Better, but not tested:- initiate a low level _commit() and add the
missing WaitForSingleObject() after the TerminateProcess Call,
Best:- do a low level _comit() and check the file modification time updated,
then preferably use a friendly means of exit that allows DLL/Thread cleanup,
closing threads/processes using sentinel flags, like while(!done) instead of
while(1) with kills.
------------------------------------------------------------------------------------------------------
Jason Richard Groothuis
bSc(compSci)
------------------------------------------------------------------------------------------------------
> Date: Sat, 11 Jul 2015 11:30:32 -0700
> From: [email protected]
> To: [email protected]; [email protected]
> Subject: Re: [boinc_dev] Client: race condition on stderr.txt invalidates
> Milkyway tasks
>
> Richard:
> Can you please ask him to set <task_debug> as well?
>
> I have no theories about what could cause this.
> The BOINC client learns that a job is finished when its process has exited,
> and by that time all files are closed and locks released
> (I'm assuming the MW@h app is single-process - is that correct?)
>
> In this case, when the job finishes, the client successfully reads stderr.txt
> (otherwise <stderr_txt> would be absent or there would be an error message)
> but it's empty.
> This would be the case, e.g., if the writing process hadn't exited yet
> and its stderr buffer wasn't flushed.
> But the process has exited.
>
> Anyone have any ideas?
>
> -- David
>
> On 09-Jul-2015 7:42 AM, Richard Haselgrove wrote:
> > User Keith Myers (UID 147145 at
> > http://milkyway.cs.rpi.edu/milkyway/index.php) has
> > asked for my help in identifying task failures at Milkyway.
> >
> > At my suggestion, he installed Windows client v7.6.2, and the attached
> > message log
> > extracts show the enhanced <slot_debug> output that helped identify the
> > CMS-dev
> > problem.
> >
> > In both cases, the task under scrutiny
> >
> > (1) de_fast_15_3s_136_sim1Jun1_1_1434554402_7775504_0,
> > http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181200273
> >
> > (2) ps_fast_15_3s_136_sim1Jun1_1_1434554402_7806437_0,
> > http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181298220
> >
> > was declared 'Validate error', and the <stderr_txt> section is empty. In
> > the
> > special case of Milkyway@Home, these two observations are linked, because
> > the
> > science result is returned in stderr, not a separate upload file.
> >
> > Also in both cases, the <slot_debug> log contains
> >
> > [slot] failed to remove file slots/x/stderr.txt: unlink() failed
> >
> > between 'handle_exited_app()' and 'Computation for task ... finished'
> >
> > It appears that there is a race condition, whereby BOINC tries (and fails)
> > to
> > delete stderr.txt before the operating system has released the write lock.
> > This
> > (I'm presuming) also explains why the file appears empty when read off the
> > disk
> > for incorporation into the client_state structure in memory, prior to
> > reporting
> > the completed task to the project.
> >
> > In order the preserve the scientific result at Milkyway (and debug and
> > other
> > useful information at other projects), the client should not initiate
> > 'handle_exited_app()' until it has confirmed that the write lock on
> > stderr.txt has
> > been released.
> >
> >
> > Log 1 also shows that the additional safeguards on cleaning out slots are
> > working
> > properly: if both handle_exited_app() and get_free_slot() fail to delete
> > the file,
> > the next task isn't started in the not-empty slot (11), but in slot 14
> > instead.
> > And when slot 11 is tested again at the next get_free_slot(), the delete
> > succeeds
> > and the now-empty slot is reused.
>
> _______________________________________________
> boinc_dev mailing list
> [email protected]
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.