Thanks; I checked in the fix.
-- David

On 08-Jul-2015 2:17 AM, William Stilte wrote:
Well, well.
In the light of recent developments I strongly debated following this up. I'm not much for flogging a dead horse [unless it's just pretending to be dead]. But seeing that I'm a bit protective of this particular three line fix, seeing I spent a morning analysing sched/handle_request.cpp to trace down the bug and come up with a very simple fix, I'm giving this another push.

first, background

For quite a few years SETI (and to some lesser extent various other projects) have been plagued by bouts of spontaneous fits of 'abandoned task' errors. Users come in (usually very upset) reporting a full cache having been marked 'abandoned' by the server, while merrily crunching away on their rigs. Abandoned tasks cannot be reported in afterwards. If not caught and manually aborted by the user, a whole cache will just warm the atmosphere.

second, the hunt

After several failed attempts over the past years of tracing this, a joint attempt last week ( see http://setiathome.berkeley.edu/forum_thread.php?id=77597 ) finally bore fruit. It appears _something_ goes wrong in server-client communication and the client does not receive the server reply. And crucially the server does not notice the reply has not been received and increases rpc_seqno in the DB. The client however, thinking it was not heard, resends the request. With the same rpc_seqno. And that fatally triggers lines 335 ff. of sched/handle_request.cpp. Why fatally? Because in line 426 the tasks get unconditionally zapped (marked as abandoned) after line 419 checks it's not a multi-client host and assigns a new hostid. [This new hostid will in fact be the old hostid, but will trigger the client to reset rpc_seqno]. The client however, knows nothing of that decision and continues happily crunching now obsolete tasks.

third, the fix

Now, we know the client reports if and what tasks (from that project) he has on board. It's in 'other_results'. And lo and behold in line 400 in a slightly different context exactly that safety check is made: (g_request->other_results.size() == 0)



so, please please please, with a cherry on the top, can we have an if (g_request->other_results.size() == 0) applied to line 426?



I quote you Blaise Pascal 'I have made this letter longer than usual, only because I have not had time to make it shorter.'

Apart from that, Richard tried short and fell on deaf ears. I told a story now (I hope to the enlightenment and amusement of my fellow list subscribers) and remain

                 in the hope of a speedy check-in

         yours sincerely
William Stilte


2015-07-01 13:30 GMT+02:00 Richard Haselgrove <[email protected] <mailto:[email protected]>>:

    David,
    Six years ago, you added an exclusion to 'handle_request.cpp':
    - Scheduler: in no-host-ID case, don't mark results as "detached"   if 
request
    contains any in-progress results
    
http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commitdiff;h=4222d744e88163a4b02d269349c51cd22d6826ca

    Could you apply the second part of that test to the other call
    to mark_results_over(host), in line 426, please? It only needs the
    '(g_request->other_results.size() == 0)' test, as the second case already 
has
    the (g_request->allow_multiple_clients != 1) exclusion.
    That would greatly help to prevent periodic complaints about wasted 
computing
    time, such as http://setiathome.berkeley.edu/forum_thread.php?id=77597
    Eric,
    Could you apply that patch locally to SETI and SETI Beta, please, and 
monitor
    to verify that it solves the problem with
    http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=71141
    That's the one which Jason asked you to look at the server logs for (email
    "auto abandoned tasks issue"), because it seems to suffer disproportionally
    from whatever is triggering the scheduler to follow this code path.
    _______________________________________________
    boinc_dev mailing list
    [email protected] <mailto:[email protected]>
    http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
    To unsubscribe, visit the above URL and
    (near bottom of page) enter your email address.



_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to