Well, well. In the light of recent developments I strongly debated following this up. I'm not much for flogging a dead horse [unless it's just pretending to be dead]. But seeing that I'm a bit protective of this particular three line fix, seeing I spent a morning analysing sched/handle_request.cpp to trace down the bug and come up with a very simple fix, I'm giving this another push.
first, background For quite a few years SETI (and to some lesser extent various other projects) have been plagued by bouts of spontaneous fits of 'abandoned task' errors. Users come in (usually very upset) reporting a full cache having been marked 'abandoned' by the server, while merrily crunching away on their rigs. Abandoned tasks cannot be reported in afterwards. If not caught and manually aborted by the user, a whole cache will just warm the atmosphere. second, the hunt After several failed attempts over the past years of tracing this, a joint attempt last week ( see http://setiathome.berkeley.edu/forum_thread.php?id=77597 ) finally bore fruit. It appears _something_ goes wrong in server-client communication and the client does not receive the server reply. And crucially the server does not notice the reply has not been received and increases rpc_seqno in the DB. The client however, thinking it was not heard, resends the request. With the same rpc_seqno. And that fatally triggers lines 335 ff. of sched/handle_request.cpp. Why fatally? Because in line 426 the tasks get unconditionally zapped (marked as abandoned) after line 419 checks it's not a multi-client host and assigns a new hostid. [This new hostid will in fact be the old hostid, but will trigger the client to reset rpc_seqno]. The client however, knows nothing of that decision and continues happily crunching now obsolete tasks. third, the fix Now, we know the client reports if and what tasks (from that project) he has on board. It's in 'other_results'. And lo and behold in line 400 in a slightly different context exactly that safety check is made: (g_request->other_results.size() == 0) so, please please please, with a cherry on the top, can we have an if (g_request->other_results.size() == 0) applied to line 426? I quote you Blaise Pascal 'I have made this letter longer than usual, only because I have not had time to make it shorter.' Apart from that, Richard tried short and fell on deaf ears. I told a story now (I hope to the enlightenment and amusement of my fellow list subscribers) and remain in the hope of a speedy check-in yours sincerely William Stilte 2015-07-01 13:30 GMT+02:00 Richard Haselgrove <[email protected]> : > David, > Six years ago, you added an exclusion to 'handle_request.cpp': > - Scheduler: in no-host-ID case, don't mark results as "detached" if > request contains any in-progress results > > http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commitdiff;h=4222d744e88163a4b02d269349c51cd22d6826ca > > Could you apply the second part of that test to the other call > to mark_results_over(host), in line 426, please? It only needs the > '(g_request->other_results.size() == 0)' test, as the second case already > has the (g_request->allow_multiple_clients != 1) exclusion. > That would greatly help to prevent periodic complaints about wasted > computing time, such as > http://setiathome.berkeley.edu/forum_thread.php?id=77597 > Eric, > Could you apply that patch locally to SETI and SETI Beta, please, and > monitor to verify that it solves the problem with > http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=71141 > That's the one which Jason asked you to look at the server logs for (email > "auto abandoned tasks issue"), because it seems to suffer disproportionally > from whatever is triggering the scheduler to follow this code path. > _______________________________________________ > boinc_dev mailing list > [email protected] > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev > To unsubscribe, visit the above URL and > (near bottom of page) enter your email address. _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
