In probably the fullest message board description on the last circuit round this merry-go-round,
http://setiathome.berkeley.edu/forum_thread.php?id=70946 we observed a number of occasions where client message logs contained lines like 22.05.2013 13:45:56 | SETI@home | Not sending work - last request too recent: 76 sec at times not unadjacent to the times when abandonments were recorded for user tasks. That led us into 'clutching at straws' mode: was another computer sending out-of-sequence RPC requests with duplicate credentials? (the users swore not). Were entire RPC requests being delayed in a transit queue and arriving out of sequence? Unlikely. Was the server receiving the RPCs in a timely fashion, but processing them out of order - perhaps delaying one because of incomplete packets? And so on. Most of this was happening before the server move to CoLo, when the SETI data line was heavily congested - we thought the problem might diminish with the higher-quality internet service at the bottom of the hill, and so it seems to have transpired. But doesn't help our friends outwith the continental USA. Incidentally, I reported seeing one 'last request too recent' in my own logs, and traced it back to an internet time update changing the computer clock. But I didn't suffer any abandoned tasks in that event. >________________________________ > From: Eric J Korpela <[email protected]> >To: Richard Haselgrove <[email protected]> >Cc: "[email protected]" <[email protected]> >Sent: Friday, 8 August 2014, 17:47 >Subject: Re: [boinc_dev] astropulse robustness / abandonned tasks > > >His host seems to be losing track of RPC sequence numbers. Loss of cached >writes on restart? > >2014-08-08 07:13:53.1883 [PID=28339] [HOST#6960982] [USER#8522684] RPC >seqno 59642 less than expected 59643; creating new host >2014-08-08 07:13:53.1896 [PID=28339] [HOST#6960982] [USER#8522684] Found >similar existing host for this user - assigned. >2014-08-08 07:13:53.1932 [PID=28339] [CRITICAL] [HOST#6960982] >[RESULT#3670788988] [WU#1562416658] changed CPID: marking in-progress >result 03se08ad.16169.8252.438086664200.12.220_0 as client error! >2014-08-08 07:13:53.1932 [PID=28339] Request: [USER#8522684] >[HOST#6960982] [IP 41.79.224.134] client 7.2.42 > > > >On Fri, Aug 8, 2014 at 9:17 AM, Richard Haselgrove < >[email protected]> wrote: > >> The same user appears to have suffered another 'abandon' event today: >> >> http://setiathome.berkeley.edu/results.php?hostid=6960982&state=6 >> >> The reasons mentioned by Eric are all valid, but there appears to be an >> irreducible core of sporadic events which cannot be ascribed to user >> malfeasance. In earlier reports like this, many (but not all) of the cases >> appeared to be associated with long-distance and/or poor quality internet >> connections - again, like this one. >> >> ------------------------------ >> *From:* Eric J Korpela <[email protected]> >> *To:* "McLeod, John" <[email protected]> >> *Cc:* "[email protected]" <[email protected]>; Richard >> Haselgrove <[email protected]> >> *Sent:* Friday, 8 August 2014, 16:56 >> >> *Subject:* Re: [boinc_dev] astropulse robustness / abandonned tasks >> >> Astropulse does checkpoint quite frequently, and restarts without problem >> most of the time. "Abandoned" is definitely a server side decision that >> indicates a client detach or a reset or some sort of confusion as to the >> identity of a host and whether it was working on those results. (Other >> possibilities include multiple hosts using a copied or shared BOINC >> directory, multiple copies of BOINC on one host using the same BOINC client >> directory, deletion or corruption or bad permissions on files in the BOINC >> client directory, any of which could confuse client or server). >> >> >> Which client version and OS are you using? >> >> >> On Fri, Aug 8, 2014 at 5:55 AM, McLeod, John <[email protected]> wrote: >> >> > BOINC has a checkpointing mechanism built in, but it requires that the >> > project developers write checkpoint code. Some projects can checkpoint >> > almost any time, and others can checkpoint only every few minutes, and >> some >> > cannot checkpoint at all. SETI can checkpoint frequently (and instigated >> > the mechanism to NOT do every possible checkpoint, but only once every X >> > minutes). CPDN always checkpoints every time it can (typically this is >> > several minutes). I cannot remember an example of one that cannot >> > checkpoint at all, but they exist. >> > >> > -----Original Message----- >> > From: boinc_dev [mailto:[email protected]] On Behalf Of >> > Richard Haselgrove >> > Sent: Friday, August 08, 2014 4:48 AM >> > To: Luc A. Germain; [email protected] >> > Subject: Re: [boinc_dev] astropulse robustness / abandonned tasks >> > >> > The abandoning of tasks happens when the BOINC server 'thinks' that it >> has >> > 'evidence' that the client has detached from the project and then >> > re-attached again. This has affected a number of users in the past, but >> has >> > proved extremely tricky to diagnose and resolve: not least, because most >> of >> > the evidence resides in the server logs. >> > >> > We did investigate one suspected case at Albert during credit testing, >> but >> > that turned out to be a genuine 'detach' caused by hard disk failure - it >> > is distinguished from reports like this one because no running tasks were >> > left on the host computer (they were on the drive that failed...) to >> waste >> > time and electricity. >> > >> > I would certainly welcome it if we could pair up a developer and a >> project >> > administrator with access to server logs to investigate this problem and >> > cure it at source. >> > >> > The checkpointing question is a matter for the project developers, and >> > I'll leave it to them to respond via this list. >> > >> > >> > >> > >________________________________ >> > > From: Luc A. Germain <[email protected]> >> > >To: [email protected] >> > >Sent: Friday, 8 August 2014, 9:41 >> > >Subject: [boinc_dev] astropulse robustness / abandonned tasks >> > > >> > > >> > >Hi, >> > >Two things: >> > >1) A suggestion here for you develloppers ;-) As atropulse tasks take >> > "some" time to complete they are more prone to power failure as we have >> in >> > the third world. When it happens most of the time the task restarts >> > computing from start (this is even more frustrating when the task reaches >> > near completion). Could it be possible to introduce regular checkpoints >> by >> > saving intermediate data, or work files, where the task computing could >> > restart from, saving so a lot of computing time ? Maybe this could be an >> > option in the user profile as I guess not everyone needs this. >> > > >> > >2) Two days ago I sent a message about abandonned tasks. Since, all my >> > computing goes to the garbage bin as they are not taken into account. >> Which >> > procedure should/could I try to solve this problem ? Could >> > uninstalling/reinstalling the application from my computers be a >> solution? >> > Should I wait till the problem solves by itself (and would this not take >> > ages) ? >> > > >> > >An answer would be highly appreciated. >> > > >> > >Best regards and thanks for your work, >> > >Luc >> > >_______________________________________________ >> > >boinc_dev mailing list >> > >[email protected] >> > >http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >> > >To unsubscribe, visit the above URL and >> > >(near bottom of page) enter your email address. >> > > >> > > >> > > >> > _______________________________________________ >> > boinc_dev mailing list >> > [email protected] >> > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >> > To unsubscribe, visit the above URL and >> > (near bottom of page) enter your email address. >> > _______________________________________________ >> > boinc_dev mailing list >> > [email protected] >> > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >> > To unsubscribe, visit the above URL and >> > (near bottom of page) enter your email address. >> > >> _______________________________________________ >> boinc_dev mailing list >> [email protected] >> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >> To unsubscribe, visit the above URL and >> (near bottom of page) enter your email address. >> >> >> >_______________________________________________ >boinc_dev mailing list >[email protected] >http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >To unsubscribe, visit the above URL and >(near bottom of page) enter your email address. > > > _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
