Part 3: There's a particular problem with the way DCF is calculated by BOINC 
only on completion of a task.

All of these screenshots show the same group of NumberFields tasks.

I first noticed that my longest-running task was unusual some 63 hours into the 
run:

http://img189.imageshack.us/img189/1185/numberfieldsruntimeesti.png

45 hours later, elapsed time has only moved on by 20 hours, and BOINC has felt 
it safe to pre-empt that single task. This screenshot was taken mid-afternoon 
5/11/2011, so maybe 52 hours before deadline. Note that the estimate for the 
unstarted tasks has decreased - BOINC must have run a task on another core in 
the meantime, and found it was a short-running one.

http://img263.imageshack.us/img263/9779/numberfieldspreempted3.png

About an hour later, the long task completed. Suddenly (but not until now) the 
unstarted tasks are in deadline trouble, with an 84 hour estimate (already 
reduced - luckily - by four rapid exits) and only 60 hours remaining to 
deadline.

http://img823.imageshack.us/img823/3107/numberfieldscompleted.png

I think I've suggested before that BOINC should start increasing its estimate 
of DCF as soon as a running task passes the runtime predicted by the current 
DCF value. The objection which is raised is that the task may be faulty (in an 
infinite loop, or something) and may never complete with a 'success' outcome to 
generate a genuine recalculated DCF. But in this case, it did....

BOINC could maintain and apply a 'provisional' DCF for the project while a 
long-running task is still active - that would have brought the additional 
queued tasks forward in high priority sooner than occurred in this case. When 
the individual long task completes, the 'provisional' DCF would be discarded, 
and the 'permanent' DCF would be updated as normal, depending on the task 
outcome (full DCF correction if success outcome, no change if error outcome).


----- Original Message ----- 
From: <[email protected]>
To: "Richard Haselgrove" <[email protected]>
Cc: "BOINC Developers Mailing List" <[email protected]>; 
<[email protected]>
Sent: Monday, November 07, 2011 4:36 PM
Subject: Re: [boinc_dev] APR, DCF and non-deterministic projects


> Actually, it isn't hard to get close on average at all.  It does involve
> re-writing some code though.
> 
> Currently DCF is based on "recent results" and falls slowly and increases
> quickly.
> 
> A better method of determining the DCF would be to base it off of the Mean
> and Standard Deviation of the tasks for a particular application (or better
> application version).
> 
> Work fetch would request work based on the mean project DCF for a
> particular resource.  This will do better than the current method that
> always requests too little.
> CPU scheduler would assume that any particular task may be the worst case.
> And therefore use Mean + 3 * standard deviation for the expected runtime.
> This assumes a curve similar to a bell curve which is not the case for LHC,
> but LHC would not miss deadlines because, in my experience, the mean is
> closer to the maximum value than expected in a bell curve.
> 
> jm7
> 
> 
> |------------>
> | From:      |
> |------------>
>  
> >--------------------------------------------------------------------------------------------------------------------------------------------------|
>  |Richard Haselgrove <[email protected]>                            
>                                                                       |
>  
> >--------------------------------------------------------------------------------------------------------------------------------------------------|
> |------------>
> | To:        |
> |------------>
>  
> >--------------------------------------------------------------------------------------------------------------------------------------------------|
>  |BOINC Developers Mailing List <[email protected]>                  
>                                                                       |
>  
> >--------------------------------------------------------------------------------------------------------------------------------------------------|
> |------------>
> | Date:      |
> |------------>
>  
> >--------------------------------------------------------------------------------------------------------------------------------------------------|
>  |11/07/2011 11:22 AM                                                         
>                                                                       |
>  
> >--------------------------------------------------------------------------------------------------------------------------------------------------|
> |------------>
> | Subject:   |
> |------------>
>  
> >--------------------------------------------------------------------------------------------------------------------------------------------------|
>  |[boinc_dev] APR, DCF and non-deterministic projects                         
>                                                                       |
>  
> >--------------------------------------------------------------------------------------------------------------------------------------------------|
> |------------>
> | Sent by:   |
> |------------>
>  
> >--------------------------------------------------------------------------------------------------------------------------------------------------|
>  |<[email protected]>                                        
>                                                                       |
>  
> >--------------------------------------------------------------------------------------------------------------------------------------------------|
> 
> 
> 
> 
> 
> The important part of the subject line is "non-deterministic projects" - by
> that, I mean projects where task runtimes can't be predicted in advance. A
> well-known case in point is the original LHC@home, now renamed
> lhcathomeclassic. The problem is clearly stated on their current 'about'
> page:
> 
> "Typically SixTrack simulates 60 particles at a time as they travel around
> the ring, and runs the simulation for 100000 loops (or sometimes 1 million
> loops) around the ring. That may sound like a lot, but it is less than 10s
> in the real world. Still, it is enough to test whether the beam is going to
> remain on a stable orbit for a much longer time, or risks losing control
> and flying off course into the walls of the vacuum tube. Such a beam
> instability would be a very serious problem that could result in the
> machine being stopped for repairs if it happened in real life."
> 
> Sixtrack exits when the simulated beam hits the simulated tunnel wall.
> Within the last week, I've seen runtimes ranging from 4 seconds to over 10
> hours. It's hard to see how the runtime could be known in advance, without
> already knowing the answer to the problem they're trying to research.
> 
> A mathematical project which also exhibits non-deterministic runtime is
> NumberFields@home (http://numberfields.asu.edu/NumberFields/index.php).
> NumberFields is now running the very latest server code (software version:
> 24527, according to their status page), and I've been running it on the
> latest available v6.13.10 client. We get a clear view of how well CreditNew
> and the server-based runtime estimation process work for non-deterministic
> projects.
> 
> This report will come in several parts, but here's the first problem:
> 
> 05-Nov-2011 09:51:49 [NumberFields@home] Requesting new tasks for CPU
> 05-Nov-2011 09:51:49 [NumberFields@home] [sched_op] CPU work request:
> 365384.35 seconds; 0.00 CPUs
> 05-Nov-2011 09:51:49 [NumberFields@home] [sched_op] NVIDIA work request:
> 0.00 seconds; 0.00 CPUs
> 05-Nov-2011 09:51:52 [NumberFields@home] Scheduler request completed: got
> 43 new tasks
> 05-Nov-2011 09:51:52 [NumberFields@home] [sched_op] Server version 613
> 05-Nov-2011 09:51:52 [NumberFields@home] Project requested delay of 21
> seconds
> 05-Nov-2011 09:51:52 [NumberFields@home] [sched_op] estimated total CPU
> task duration: 2773276 seconds
> 
> That's a request for three hundred thousand seconds, and an estimated
> allocation of over two million. It's not a fluke:
> 
> 06-Nov-2011 17:34:25 [NumberFields@home] [sched_op] CPU work request:
> 318987.60 seconds; 0.00 CPUs
> 06-Nov-2011 17:34:25 [NumberFields@home] [sched_op] NVIDIA work request:
> 0.00 seconds; 0.00 CPUs
> 06-Nov-2011 17:34:28 [NumberFields@home] Scheduler request completed: got
> 85 new tasks
> 06-Nov-2011 17:34:28 [NumberFields@home] [sched_op] Server version 613
> 06-Nov-2011 17:34:28 [NumberFields@home] Project requested delay of 21
> seconds
> 06-Nov-2011 17:34:28 [NumberFields@home] [sched_op] estimated total CPU
> task duration: 1036408 seconds
> 
> BOINC v6.13/v7 is deliberately designed to make these large work requests
> because of the max/min hysteresis fetch policy.
> 
> The problem here is that non-deterministic runtimes can't be tracked in
> real time by the server APR averaging, and client DCF goes into overdrive
> to try and compensate. I've attached a full log of the DCF changes since
> this host was attached to the NumberFields. At the times of the two logs
> above, client DCF was 7.770888 and 4.360754: according to
> http://boinc.berkeley.edu/trac/wiki/RuntimeEstimation, "DCF is no longer
> used", and indeed http://boinc.berkeley.edu/trac/changeset/21153/boinc
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
> 
> 
> 
> 
>
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to