Richard:
Can you please ask him to set <task_debug> as well?

I have no theories about what could cause this.
The BOINC client learns that a job is finished when its process has exited,
and by that time all files are closed and locks released
(I'm assuming the MW@h app is single-process - is that correct?)

In this case, when the job finishes, the client successfully reads stderr.txt
(otherwise <stderr_txt> would be absent or there would be an error message)
but it's empty.
This would be the case, e.g., if the writing process hadn't exited yet
and its stderr buffer wasn't flushed.
But the process has exited.

Anyone have any ideas?

-- David

On 09-Jul-2015 7:42 AM, Richard Haselgrove wrote:
User Keith Myers (UID 147145 at http://milkyway.cs.rpi.edu/milkyway/index.php) has asked for my help in identifying task failures at Milkyway.

At my suggestion, he installed Windows client v7.6.2, and the attached message log extracts show the enhanced <slot_debug> output that helped identify the CMS-dev problem.

In both cases, the task under scrutiny

(1) de_fast_15_3s_136_sim1Jun1_1_1434554402_7775504_0, http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181200273

(2) ps_fast_15_3s_136_sim1Jun1_1_1434554402_7806437_0, http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=1181298220

was declared 'Validate error', and the <stderr_txt> section is empty. In the special case of Milkyway@Home, these two observations are linked, because the science result is returned in stderr, not a separate upload file.

Also in both cases, the <slot_debug> log contains

[slot] failed to remove file slots/x/stderr.txt: unlink() failed

between 'handle_exited_app()' and 'Computation for task ... finished'

It appears that there is a race condition, whereby BOINC tries (and fails) to delete stderr.txt before the operating system has released the write lock. This (I'm presuming) also explains why the file appears empty when read off the disk for incorporation into the client_state structure in memory, prior to reporting the completed task to the project.

In order the preserve the scientific result at Milkyway (and debug and other useful information at other projects), the client should not initiate 'handle_exited_app()' until it has confirmed that the write lock on stderr.txt has been released.


Log 1 also shows that the additional safeguards on cleaning out slots are working properly: if both handle_exited_app() and get_free_slot() fail to delete the file, the next task isn't started in the not-empty slot (11), but in slot 14 instead. And when slot 11 is tested again at the next get_free_slot(), the delete succeeds and the now-empty slot is reused.

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to