Actually, it isn't hard to get close on average at all.  It does involve
re-writing some code though.

Currently DCF is based on "recent results" and falls slowly and increases
quickly.

A better method of determining the DCF would be to base it off of the Mean
and Standard Deviation of the tasks for a particular application (or better
application version).

Work fetch would request work based on the mean project DCF for a
particular resource.  This will do better than the current method that
always requests too little.
CPU scheduler would assume that any particular task may be the worst case.
And therefore use Mean + 3 * standard deviation for the expected runtime.
This assumes a curve similar to a bell curve which is not the case for LHC,
but LHC would not miss deadlines because, in my experience, the mean is
closer to the maximum value than expected in a bell curve.

jm7


|------------>
| From:      |
|------------>
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
  |Richard Haselgrove <[email protected]>                             
                                                                     |
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| To:        |
|------------>
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
  |BOINC Developers Mailing List <[email protected]>                   
                                                                     |
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Date:      |
|------------>
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
  |11/07/2011 11:22 AM                                                          
                                                                     |
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Subject:   |
|------------>
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
  |[boinc_dev] APR, DCF and non-deterministic projects                          
                                                                     |
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Sent by:   |
|------------>
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
  |<[email protected]>                                         
                                                                     |
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|





The important part of the subject line is "non-deterministic projects" - by
that, I mean projects where task runtimes can't be predicted in advance. A
well-known case in point is the original LHC@home, now renamed
lhcathomeclassic. The problem is clearly stated on their current 'about'
page:

"Typically SixTrack simulates 60 particles at a time as they travel around
the ring, and runs the simulation for 100000 loops (or sometimes 1 million
loops) around the ring. That may sound like a lot, but it is less than 10s
in the real world. Still, it is enough to test whether the beam is going to
remain on a stable orbit for a much longer time, or risks losing control
and flying off course into the walls of the vacuum tube. Such a beam
instability would be a very serious problem that could result in the
machine being stopped for repairs if it happened in real life."

Sixtrack exits when the simulated beam hits the simulated tunnel wall.
Within the last week, I've seen runtimes ranging from 4 seconds to over 10
hours. It's hard to see how the runtime could be known in advance, without
already knowing the answer to the problem they're trying to research.

A mathematical project which also exhibits non-deterministic runtime is
NumberFields@home (http://numberfields.asu.edu/NumberFields/index.php).
NumberFields is now running the very latest server code (software version:
24527, according to their status page), and I've been running it on the
latest available v6.13.10 client. We get a clear view of how well CreditNew
and the server-based runtime estimation process work for non-deterministic
projects.

This report will come in several parts, but here's the first problem:

05-Nov-2011 09:51:49 [NumberFields@home] Requesting new tasks for CPU
05-Nov-2011 09:51:49 [NumberFields@home] [sched_op] CPU work request:
365384.35 seconds; 0.00 CPUs
05-Nov-2011 09:51:49 [NumberFields@home] [sched_op] NVIDIA work request:
0.00 seconds; 0.00 CPUs
05-Nov-2011 09:51:52 [NumberFields@home] Scheduler request completed: got
43 new tasks
05-Nov-2011 09:51:52 [NumberFields@home] [sched_op] Server version 613
05-Nov-2011 09:51:52 [NumberFields@home] Project requested delay of 21
seconds
05-Nov-2011 09:51:52 [NumberFields@home] [sched_op] estimated total CPU
task duration: 2773276 seconds

That's a request for three hundred thousand seconds, and an estimated
allocation of over two million. It's not a fluke:

06-Nov-2011 17:34:25 [NumberFields@home] [sched_op] CPU work request:
318987.60 seconds; 0.00 CPUs
06-Nov-2011 17:34:25 [NumberFields@home] [sched_op] NVIDIA work request:
0.00 seconds; 0.00 CPUs
06-Nov-2011 17:34:28 [NumberFields@home] Scheduler request completed: got
85 new tasks
06-Nov-2011 17:34:28 [NumberFields@home] [sched_op] Server version 613
06-Nov-2011 17:34:28 [NumberFields@home] Project requested delay of 21
seconds
06-Nov-2011 17:34:28 [NumberFields@home] [sched_op] estimated total CPU
task duration: 1036408 seconds

BOINC v6.13/v7 is deliberately designed to make these large work requests
because of the max/min hysteresis fetch policy.

The problem here is that non-deterministic runtimes can't be tracked in
real time by the server APR averaging, and client DCF goes into overdrive
to try and compensate. I've attached a full log of the DCF changes since
this host was attached to the NumberFields. At the times of the two logs
above, client DCF was 7.770888 and 4.360754: according to
http://boinc.berkeley.edu/trac/wiki/RuntimeEstimation, "DCF is no longer
used", and indeed http://boinc.berkeley.edu/trac/changeset/21153/boinc
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.




_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to