Re: [boinc_dev] astropulse robustness / abandonned tasks

Richard Haselgrove Fri, 08 Aug 2014 10:22:54 -0700

In probably the fullest message board description on the last circuit round 
this merry-go-round,


http://setiathome.berkeley.edu/forum_thread.php?id=70946


we observed a number of occasions where client message logs contained lines like

22.05.2013 13:45:56 | SETI@home | Not sending work - last request too recent: 
76 sec


at times not unadjacent to the times when abandonments were recorded for user 
tasks. That led us into 'clutching at straws' mode: was another computer 
sending out-of-sequence RPC requests with duplicate credentials? (the users 
swore not). Were entire RPC requests being delayed in a transit queue and 
arriving out of sequence? Unlikely. Was the server receiving the RPCs in a 
timely fashion, but processing them out of order - perhaps delaying one because 
of incomplete packets?

And so on. Most of this was happening before the server move to CoLo, when the 
SETI data line was heavily congested - we thought the problem might diminish 
with the higher-quality internet service at the bottom of the hill, and so it 
seems to have transpired. But doesn't help our friends outwith the continental 
USA.

Incidentally, I reported seeing one 'last request too recent' in my own logs, 
and traced it back to an internet time update changing the computer clock. But 
I didn't suffer any abandoned tasks in that event.


>________________________________
> From: Eric J Korpela <[email protected]>
>To: Richard Haselgrove <[email protected]> 
>Cc: "[email protected]" <[email protected]> 
>Sent: Friday, 8 August 2014, 17:47
>Subject: Re: [boinc_dev] astropulse robustness / abandonned tasks
> 
>
>His host seems to be losing track of RPC sequence numbers.  Loss of cached
>writes on restart?
>
>2014-08-08 07:13:53.1883 [PID=28339]   [HOST#6960982] [USER#8522684] RPC
>seqno 59642 less than expected 59643; creating new host
>2014-08-08 07:13:53.1896 [PID=28339]   [HOST#6960982] [USER#8522684] Found
>similar existing host for this user - assigned.
>2014-08-08 07:13:53.1932 [PID=28339] [CRITICAL]  [HOST#6960982]
>[RESULT#3670788988] [WU#1562416658] changed CPID: marking in-progress
>result 03se08ad.16169.8252.438086664200.12.220_0 as client error!
>2014-08-08 07:13:53.1932 [PID=28339]   Request: [USER#8522684]
>[HOST#6960982] [IP 41.79.224.134] client 7.2.42
>
>
>
>On Fri, Aug 8, 2014 at 9:17 AM, Richard Haselgrove <
>[email protected]> wrote:
>
>> The same user appears to have suffered another 'abandon' event today:
>>
>> http://setiathome.berkeley.edu/results.php?hostid=6960982&state=6
>>
>> The reasons mentioned by Eric are all valid, but there appears to be an
>> irreducible core of sporadic events which cannot be ascribed to user
>> malfeasance. In earlier reports like this, many (but not all) of the cases
>> appeared to be associated with long-distance and/or poor quality internet
>> connections - again, like this one.
>>
>>   ------------------------------
>>  *From:* Eric J Korpela <[email protected]>
>> *To:* "McLeod, John" <[email protected]>
>> *Cc:* "[email protected]" <[email protected]>; Richard
>> Haselgrove <[email protected]>
>> *Sent:* Friday, 8 August 2014, 16:56
>>
>> *Subject:* Re: [boinc_dev] astropulse robustness / abandonned tasks
>>
>> Astropulse does checkpoint quite frequently, and restarts without problem
>> most of the time.  "Abandoned" is definitely a server side decision that
>> indicates a client detach or a reset or some sort of confusion as to the
>> identity of a host and whether it was working on those results.  (Other
>> possibilities include multiple hosts using a copied or shared BOINC
>> directory, multiple copies of BOINC on one host using the same BOINC client
>> directory, deletion or corruption or bad permissions on files in the BOINC
>> client directory, any of which could confuse client or server).
>>
>>
>> Which client version and OS are you using?
>>
>>
>> On Fri, Aug 8, 2014 at 5:55 AM, McLeod, John <[email protected]> wrote:
>>
>> > BOINC has a checkpointing mechanism built in, but it requires that the
>> > project developers write checkpoint code.  Some projects can checkpoint
>> > almost any time, and others can checkpoint only every few minutes, and
>> some
>> > cannot checkpoint at all.  SETI can checkpoint frequently (and instigated
>> > the mechanism to NOT do every possible checkpoint, but only once every X
>> > minutes).  CPDN always checkpoints every time it can (typically this is
>> > several minutes).  I cannot remember an example of one that cannot
>> > checkpoint at all, but they exist.
>> >
>> > -----Original Message-----
>> > From: boinc_dev [mailto:[email protected]] On Behalf Of
>> > Richard Haselgrove
>> > Sent: Friday, August 08, 2014 4:48 AM
>> > To: Luc A. Germain; [email protected]
>> > Subject: Re: [boinc_dev] astropulse robustness / abandonned tasks
>> >
>> > The abandoning of tasks happens when the BOINC server 'thinks' that it
>> has
>> > 'evidence' that the client has detached from the project and then
>> > re-attached again. This has affected a number of users in the past, but
>> has
>> > proved extremely tricky to diagnose and resolve: not least, because most
>> of
>> > the evidence resides in the server logs.
>> >
>> > We did investigate one suspected case at Albert during credit testing,
>> but
>> > that turned out to be a genuine 'detach' caused by hard disk failure - it
>> > is distinguished from reports like this one because no running tasks were
>> > left on the host computer (they were on the drive that failed...) to
>> waste
>> > time and electricity.
>> >
>> > I would certainly welcome it if we could pair up a developer and a
>> project
>> > administrator with access to server logs to investigate this problem and
>> > cure it at source.
>> >
>> > The checkpointing question is a matter for the project developers, and
>> > I'll leave it to them to respond via this list.
>> >
>> >
>> >
>> > >________________________________
>> > > From: Luc A. Germain <[email protected]>
>> > >To: [email protected]
>> > >Sent: Friday, 8 August 2014, 9:41
>> > >Subject: [boinc_dev] astropulse robustness / abandonned tasks
>> > >
>> > >
>> > >Hi,
>> > >Two things:
>> > >1) A suggestion here for you develloppers ;-) As atropulse tasks take
>> > "some" time to complete they are more prone to power failure as we have
>> in
>> > the third world. When it happens most of the time the task restarts
>> > computing from start (this is even more frustrating when the task reaches
>> > near completion). Could it be possible to introduce regular checkpoints
>> by
>> > saving intermediate data, or work files, where the task computing could
>> > restart from, saving so a lot of computing time ? Maybe this could be an
>> > option in the user profile as I guess not everyone needs this.
>> > >
>> > >2) Two days ago I sent a message about abandonned tasks. Since, all my
>> > computing goes to the garbage bin as they are not taken into account.
>> Which
>> > procedure should/could I try to solve this problem ? Could
>> > uninstalling/reinstalling the application from my computers be a
>> solution?
>> > Should I wait till the problem solves by itself (and would this not take
>> > ages) ?
>> > >
>> > >An answer would be highly appreciated.
>> > >
>> > >Best regards and thanks for your work,
>> > >Luc
>> > >_______________________________________________
>> > >boinc_dev mailing list
>> > >[email protected]
>> > >http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>> > >To unsubscribe, visit the above URL and
>> > >(near bottom of page) enter your email address.
>> > >
>> > >
>> > >
>> > _______________________________________________
>> > boinc_dev mailing list
>> > [email protected]
>> > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>> > To unsubscribe, visit the above URL and
>> > (near bottom of page) enter your email address.
>> > _______________________________________________
>> > boinc_dev mailing list
>> > [email protected]
>> > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>> > To unsubscribe, visit the above URL and
>> > (near bottom of page) enter your email address.
>> >
>> _______________________________________________
>> boinc_dev mailing list
>> [email protected]
>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>> To unsubscribe, visit the above URL and
>> (near bottom of page) enter your email address.
>>
>>
>>
>_______________________________________________
>boinc_dev mailing list
>[email protected]
>http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>To unsubscribe, visit the above URL and
>(near bottom of page) enter your email address.
>
>
>
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] astropulse robustness / abandonned tasks

Reply via email to