Well, well.
In the light of recent developments I strongly debated following this up.
I'm not much for flogging a dead horse [unless it's just pretending to be
dead].
But seeing that I'm a bit protective of this particular three line fix,
seeing I spent a morning analysing sched/handle_request.cpp to trace down
the bug and come up with a very simple fix, I'm giving this another push.

first, background

For quite a few years SETI (and to some lesser extent various other
projects) have been plagued by bouts of spontaneous fits of 'abandoned
task' errors.
Users come in (usually very upset) reporting a full cache having been
marked 'abandoned' by the server, while merrily crunching away on their
rigs.
Abandoned tasks cannot be reported in afterwards. If not caught and
manually aborted by the user, a whole cache will just warm the atmosphere.

second, the hunt

After several failed attempts over the past years of tracing this, a joint
attempt last week ( see
http://setiathome.berkeley.edu/forum_thread.php?id=77597 ) finally bore
fruit.
It appears _something_ goes wrong in server-client communication and the
client does not receive the server reply. And crucially the server does not
notice the reply has not been received and increases rpc_seqno in the DB.
The client however, thinking it was not heard, resends the request. With
the same rpc_seqno. And that fatally triggers lines 335 ff. of
sched/handle_request.cpp. Why fatally? Because in line 426 the tasks get
unconditionally zapped (marked as abandoned) after line 419 checks it's not
a multi-client host and assigns a new hostid. [This new hostid will in fact
be the old hostid, but will trigger the client to reset rpc_seqno]. The
client however, knows nothing of that decision and continues happily
crunching now obsolete tasks.

third, the fix

Now, we know the client reports if and what tasks (from that project) he
has on board. It's in 'other_results'. And lo and behold in line 400 in a
slightly different context exactly that safety check is made:
(g_request->other_results.size() == 0)



so, please please please, with a cherry on the top, can we have an if
(g_request->other_results.size() == 0) applied to line 426?



I quote you Blaise Pascal 'I have made this letter longer than usual, only
because I have not had time to make it shorter.'

Apart from that, Richard tried short and fell on deaf ears. I told a story
now (I hope to the enlightenment and amusement of my fellow list
subscribers) and remain

                 in the hope of a speedy check-in

         yours sincerely
                                        William Stilte


2015-07-01 13:30 GMT+02:00 Richard Haselgrove <[email protected]>
:

> David,
> Six years ago, you added an exclusion to 'handle_request.cpp':
> - Scheduler: in no-host-ID case, don't mark results as "detached"    if
> request contains any in-progress results
>
> http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commitdiff;h=4222d744e88163a4b02d269349c51cd22d6826ca
>
> Could you apply the second part of that test to the other call
> to mark_results_over(host), in line 426, please? It only needs the
> '(g_request->other_results.size() == 0)' test, as the second case already
> has the (g_request->allow_multiple_clients != 1) exclusion.
> That would greatly help to prevent periodic complaints about wasted
> computing time, such as
> http://setiathome.berkeley.edu/forum_thread.php?id=77597
> Eric,
> Could you apply that patch locally to SETI and SETI Beta, please, and
> monitor to verify that it solves the problem with
> http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=71141
> That's the one which Jason asked you to look at the server logs for (email
> "auto abandoned tasks issue"), because it seems to suffer disproportionally
> from whatever is triggering the scheduler to follow this code path.
> _______________________________________________
> boinc_dev mailing list
> [email protected]
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to