Thanks; I checked in the fix.
-- David
On 08-Jul-2015 2:17 AM, William Stilte wrote:
Well, well.
In the light of recent developments I strongly debated following this up. I'm not
much for flogging a dead horse [unless it's just pretending to be dead].
But seeing that I'm a bit protective of this particular three line fix, seeing I
spent a morning analysing sched/handle_request.cpp to trace down the bug and come
up with a very simple fix, I'm giving this another push.
first, background
For quite a few years SETI (and to some lesser extent various other projects) have
been plagued by bouts of spontaneous fits of 'abandoned task' errors.
Users come in (usually very upset) reporting a full cache having been marked
'abandoned' by the server, while merrily crunching away on their rigs.
Abandoned tasks cannot be reported in afterwards. If not caught and manually
aborted by the user, a whole cache will just warm the atmosphere.
second, the hunt
After several failed attempts over the past years of tracing this, a joint attempt
last week ( see http://setiathome.berkeley.edu/forum_thread.php?id=77597 ) finally
bore fruit.
It appears _something_ goes wrong in server-client communication and the client
does not receive the server reply. And crucially the server does not notice the
reply has not been received and increases rpc_seqno in the DB. The client however,
thinking it was not heard, resends the request. With the same rpc_seqno. And that
fatally triggers lines 335 ff. of sched/handle_request.cpp. Why fatally? Because
in line 426 the tasks get unconditionally zapped (marked as abandoned) after line
419 checks it's not a multi-client host and assigns a new hostid. [This new hostid
will in fact be the old hostid, but will trigger the client to reset rpc_seqno].
The client however, knows nothing of that decision and continues happily crunching
now obsolete tasks.
third, the fix
Now, we know the client reports if and what tasks (from that project) he has on
board. It's in 'other_results'. And lo and behold in line 400 in a slightly
different context exactly that safety check is made:
(g_request->other_results.size() == 0)
so, please please please, with a cherry on the top, can we have an if
(g_request->other_results.size() == 0) applied to line 426?
I quote you Blaise Pascal 'I have made this letter longer than usual, only because
I have not had time to make it shorter.'
Apart from that, Richard tried short and fell on deaf ears. I told a story now (I
hope to the enlightenment and amusement of my fellow list subscribers) and remain
in the hope of a speedy check-in
yours sincerely
William Stilte
2015-07-01 13:30 GMT+02:00 Richard Haselgrove <[email protected]
<mailto:[email protected]>>:
David,
Six years ago, you added an exclusion to 'handle_request.cpp':
- Scheduler: in no-host-ID case, don't mark results as "detached" if
request
contains any in-progress results
http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commitdiff;h=4222d744e88163a4b02d269349c51cd22d6826ca
Could you apply the second part of that test to the other call
to mark_results_over(host), in line 426, please? It only needs the
'(g_request->other_results.size() == 0)' test, as the second case already
has
the (g_request->allow_multiple_clients != 1) exclusion.
That would greatly help to prevent periodic complaints about wasted
computing
time, such as http://setiathome.berkeley.edu/forum_thread.php?id=77597
Eric,
Could you apply that patch locally to SETI and SETI Beta, please, and
monitor
to verify that it solves the problem with
http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=71141
That's the one which Jason asked you to look at the server logs for (email
"auto abandoned tasks issue"), because it seems to suffer disproportionally
from whatever is triggering the scheduler to follow this code path.
_______________________________________________
boinc_dev mailing list
[email protected] <mailto:[email protected]>
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.