Re: [boinc_dev] astropulse robustness / abandonned tasks

McLeod, John Fri, 08 Aug 2014 11:01:39 -0700

Yes, but, it is a reason that checking for a sequence number near 0 vs near the 
current sequence number does not work reliably.


Would resetting the sequence number to 0 if the CPU or OS changed work?  These 
are changes that would tend to indicate a new computer.

From: Richard Haselgrove [mailto:[email protected]]
Sent: Friday, August 08, 2014 1:54 PM
To: McLeod, John; Eric J Korpela
Cc: [email protected]
Subject: Re: [boinc_dev] astropulse robustness / abandonned tasks

That is certainly true, but in the cases that I investigated - which started 
with posts in various degrees of indignation on the message boards - the users 
were adamant that copying of data directories had *not* taken place.

________________________________
From: "McLeod, John" <[email protected]<mailto:[email protected]>>
To: Eric J Korpela <[email protected]<mailto:[email protected]>>; 
Richard Haselgrove 
<[email protected]<mailto:[email protected]>>
Cc: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Sent: Friday, 8 August 2014, 18:47
Subject: RE: [boinc_dev] astropulse robustness / abandonned tasks

One technique used to create a new host that is attached to everything that an 
old host is attached to is to copy the entire data directory from old to new 
and install BOINC on the new host.  This would tend to preserve the sequence 
number.

-----Original Message-----
From: boinc_dev 
[mailto:[email protected]<mailto:[email protected]>]
 On Behalf Of Eric J Korpela
Sent: Friday, August 08, 2014 1:44 PM
To: Richard Haselgrove
Cc: [email protected]<mailto:[email protected]>
Subject: Re: [boinc_dev] astropulse robustness / abandonned tasks

The other possibility is the sender doesn't think the prior RPC completed
and didn't update the sequence number (although I haven't looked at the
code to see if this is possible).  With a mismatch of one out of 59643
seems like the server is reaching exactly the wrong conclusion (this is a
new host) rather than the right one (there was a communications problem on
the prior contact).  If it were a new host, shouldn't the sequence number
be near 0?


On Fri, Aug 8, 2014 at 10:19 AM, Richard Haselgrove <
[email protected]<mailto:[email protected]>> wrote:

> In probably the fullest message board description on the last circuit
> round this merry-go-round,
>
> http://setiathome.berkeley.edu/forum_thread.php?id=70946
>
> we observed a number of occasions where client message logs contained
> lines like
>
> 22.05.2013 13:45:56 | SETI@home<mailto:SETI@home> | Not sending work - last 
> request too
> recent: 76 sec
>
> at times not unadjacent to the times when abandonments were recorded for
> user tasks. That led us into 'clutching at straws' mode: was another
> computer sending out-of-sequence RPC requests with duplicate credentials?
> (the users swore not). Were entire RPC requests being delayed in a transit
> queue and arriving out of sequence? Unlikely. Was the server receiving the
> RPCs in a timely fashion, but processing them out of order - perhaps
> delaying one because of incomplete packets?
>
> And so on. Most of this was happening before the server move to CoLo, when
> the SETI data line was heavily congested - we thought the problem might
> diminish with the higher-quality internet service at the bottom of the
> hill, and so it seems to have transpired. But doesn't help our friends
> outwith the continental USA.
>
> Incidentally, I reported seeing one 'last request too recent' in my own
> logs, and traced it back to an internet time update changing the computer
> clock. But I didn't suffer any abandoned tasks in that event.
>
>  ------------------------------
>  *From:* Eric J Korpela 
> <[email protected]<mailto:[email protected]>>
> *To:* Richard Haselgrove 
> <[email protected]<mailto:[email protected]>>
> *Cc:* "[email protected]<mailto:[email protected]>" 
> <[email protected]<mailto:[email protected]>>
> *Sent:* Friday, 8 August 2014, 17:47
>
> *Subject:* Re: [boinc_dev] astropulse robustness / abandonned tasks
>
> His host seems to be losing track of RPC sequence numbers.  Loss of cached
> writes on restart?
>
> 2014-08-08 07:13:53.1883 [PID=28339]  [HOST#6960982] [USER#8522684] RPC
> seqno 59642 less than expected 59643; creating new host
> 2014-08-08 07:13:53.1896 [PID=28339]  [HOST#6960982] [USER#8522684] Found
> similar existing host for this user - assigned.
> 2014-08-08 07:13:53.1932 [PID=28339] [CRITICAL]  [HOST#6960982]
> [RESULT#3670788988] [WU#1562416658] changed CPID: marking in-progress
> result 03se08ad.16169.8252.438086664200.12.220_0 as client error!
> 2014-08-08 07:13:53.1932 [PID=28339]  Request: [USER#8522684]
> [HOST#6960982] [IP 41.79.224.134] client 7.2.42
>
>
>
> On Fri, Aug 8, 2014 at 9:17 AM, Richard Haselgrove <
> [email protected]<mailto:[email protected]>> wrote:
>
> > The same user appears to have suffered another 'abandon' event today:
> >
> > http://setiathome.berkeley.edu/results.php?hostid=6960982&state=6
> >
> > The reasons mentioned by Eric are all valid, but there appears to be an
> > irreducible core of sporadic events which cannot be ascribed to user
> > malfeasance. In earlier reports like this, many (but not all) of the
> cases
> > appeared to be associated with long-distance and/or poor quality internet
> > connections - again, like this one.
> >
> >  ------------------------------
> >  *From:* Eric J Korpela 
> > <[email protected]<mailto:[email protected]>>
> > *To:* "McLeod, John" <[email protected]<mailto:[email protected]>>
> > *Cc:* "[email protected]<mailto:[email protected]>" 
> > <[email protected]<mailto:[email protected]>>; Richard
> > Haselgrove 
> > <[email protected]<mailto:[email protected]>>
> > *Sent:* Friday, 8 August 2014, 16:56
> >
> > *Subject:* Re: [boinc_dev] astropulse robustness / abandonned tasks
>
> >
> > Astropulse does checkpoint quite frequently, and restarts without problem
> > most of the time.  "Abandoned" is definitely a server side decision that
> > indicates a client detach or a reset or some sort of confusion as to the
> > identity of a host and whether it was working on those results.  (Other
> > possibilities include multiple hosts using a copied or shared BOINC
> > directory, multiple copies of BOINC on one host using the same BOINC
> client
> > directory, deletion or corruption or bad permissions on files in the
> BOINC
> > client directory, any of which could confuse client or server).
> >
> >
> > Which client version and OS are you using?
> >
> >
> > On Fri, Aug 8, 2014 at 5:55 AM, McLeod, John 
> > <[email protected]<mailto:[email protected]>>
> wrote:
> >
> > > BOINC has a checkpointing mechanism built in, but it requires that the
> > > project developers write checkpoint code.  Some projects can checkpoint
> > > almost any time, and others can checkpoint only every few minutes, and
> > some
> > > cannot checkpoint at all.  SETI can checkpoint frequently (and
> instigated
> > > the mechanism to NOT do every possible checkpoint, but only once every
> X
> > > minutes).  CPDN always checkpoints every time it can (typically this is
> > > several minutes).  I cannot remember an example of one that cannot
> > > checkpoint at all, but they exist.
> > >
> > > -----Original Message-----
> > > From: boinc_dev 
> > > [mailto:[email protected]<mailto:[email protected]>]
> > >  On Behalf
> Of
> > > Richard Haselgrove
> > > Sent: Friday, August 08, 2014 4:48 AM
> > > To: Luc A. Germain; 
> > > [email protected]<mailto:[email protected]>
> > > Subject: Re: [boinc_dev] astropulse robustness / abandonned tasks
> > >
> > > The abandoning of tasks happens when the BOINC server 'thinks' that it
> > has
> > > 'evidence' that the client has detached from the project and then
> > > re-attached again. This has affected a number of users in the past, but
> > has
> > > proved extremely tricky to diagnose and resolve: not least, because
> most
> > of
> > > the evidence resides in the server logs.
> > >
> > > We did investigate one suspected case at Albert during credit testing,
> > but
> > > that turned out to be a genuine 'detach' caused by hard disk failure -
> it
> > > is distinguished from reports like this one because no running tasks
> were
> > > left on the host computer (they were on the drive that failed...) to
> > waste
> > > time and electricity.
> > >
> > > I would certainly welcome it if we could pair up a developer and a
> > project
> > > administrator with access to server logs to investigate this problem
> and
> > > cure it at source.
> > >
> > > The checkpointing question is a matter for the project developers, and
> > > I'll leave it to them to respond via this list.
> > >
> > >
> > >
> > > >________________________________
> > > > From: Luc A. Germain <[email protected]<mailto:[email protected]>>
> > > >To: [email protected]<mailto:[email protected]>
> > > >Sent: Friday, 8 August 2014, 9:41
> > > >Subject: [boinc_dev] astropulse robustness / abandonned tasks
> > > >
> > > >
> > > >Hi,
> > > >Two things:
> > > >1) A suggestion here for you develloppers ;-) As atropulse tasks take
> > > "some" time to complete they are more prone to power failure as we have
> > in
> > > the third world. When it happens most of the time the task restarts
> > > computing from start (this is even more frustrating when the task
> reaches
> > > near completion). Could it be possible to introduce regular checkpoints
> > by
> > > saving intermediate data, or work files, where the task computing could
> > > restart from, saving so a lot of computing time ? Maybe this could be
> an
> > > option in the user profile as I guess not everyone needs this.
> > > >
> > > >2) Two days ago I sent a message about abandonned tasks. Since, all my
> > > computing goes to the garbage bin as they are not taken into account.
> > Which
> > > procedure should/could I try to solve this problem ? Could
> > > uninstalling/reinstalling the application from my computers be a
> > solution?
> > > Should I wait till the problem solves by itself (and would this not
> take
> > > ages) ?
> > > >
> > > >An answer would be highly appreciated.
> > > >
> > > >Best regards and thanks for your work,
> > > >Luc
> > > >_______________________________________________
> > > >boinc_dev mailing list
> > > >[email protected]<mailto:[email protected]>
> > > >http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> > > >To unsubscribe, visit the above URL and
> > > >(near bottom of page) enter your email address.

> > > >
> > > >
> > > >
> > > _______________________________________________
> > > boinc_dev mailing list
> > > [email protected]<mailto:[email protected]>
> > > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> > > To unsubscribe, visit the above URL and
> > > (near bottom of page) enter your email address.
> > > _______________________________________________
> > > boinc_dev mailing list
> > > [email protected]<mailto:[email protected]>
> > > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> > > To unsubscribe, visit the above URL and
> > > (near bottom of page) enter your email address.
> > >
> > _______________________________________________
> > boinc_dev mailing list
> > [email protected]<mailto:[email protected]>
> > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> > To unsubscribe, visit the above URL and
> > (near bottom of page) enter your email address.
> >
> >
> >
> _______________________________________________
> boinc_dev mailing list
> [email protected]<mailto:[email protected]>
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
>
>
>
_______________________________________________
boinc_dev mailing list
[email protected]<mailto:[email protected]>
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] astropulse robustness / abandonned tasks

Reply via email to