Actually, it isn't hard to get close on average at all. It does involve re-writing some code though.
Currently DCF is based on "recent results" and falls slowly and increases quickly. A better method of determining the DCF would be to base it off of the Mean and Standard Deviation of the tasks for a particular application (or better application version). Work fetch would request work based on the mean project DCF for a particular resource. This will do better than the current method that always requests too little. CPU scheduler would assume that any particular task may be the worst case. And therefore use Mean + 3 * standard deviation for the expected runtime. This assumes a curve similar to a bell curve which is not the case for LHC, but LHC would not miss deadlines because, in my experience, the mean is closer to the maximum value than expected in a bell curve. jm7 |------------> | From: | |------------> >--------------------------------------------------------------------------------------------------------------------------------------------------| |Richard Haselgrove <[email protected]> | >--------------------------------------------------------------------------------------------------------------------------------------------------| |------------> | To: | |------------> >--------------------------------------------------------------------------------------------------------------------------------------------------| |BOINC Developers Mailing List <[email protected]> | >--------------------------------------------------------------------------------------------------------------------------------------------------| |------------> | Date: | |------------> >--------------------------------------------------------------------------------------------------------------------------------------------------| |11/07/2011 11:22 AM | >--------------------------------------------------------------------------------------------------------------------------------------------------| |------------> | Subject: | |------------> >--------------------------------------------------------------------------------------------------------------------------------------------------| |[boinc_dev] APR, DCF and non-deterministic projects | >--------------------------------------------------------------------------------------------------------------------------------------------------| |------------> | Sent by: | |------------> >--------------------------------------------------------------------------------------------------------------------------------------------------| |<[email protected]> | >--------------------------------------------------------------------------------------------------------------------------------------------------| The important part of the subject line is "non-deterministic projects" - by that, I mean projects where task runtimes can't be predicted in advance. A well-known case in point is the original LHC@home, now renamed lhcathomeclassic. The problem is clearly stated on their current 'about' page: "Typically SixTrack simulates 60 particles at a time as they travel around the ring, and runs the simulation for 100000 loops (or sometimes 1 million loops) around the ring. That may sound like a lot, but it is less than 10s in the real world. Still, it is enough to test whether the beam is going to remain on a stable orbit for a much longer time, or risks losing control and flying off course into the walls of the vacuum tube. Such a beam instability would be a very serious problem that could result in the machine being stopped for repairs if it happened in real life." Sixtrack exits when the simulated beam hits the simulated tunnel wall. Within the last week, I've seen runtimes ranging from 4 seconds to over 10 hours. It's hard to see how the runtime could be known in advance, without already knowing the answer to the problem they're trying to research. A mathematical project which also exhibits non-deterministic runtime is NumberFields@home (http://numberfields.asu.edu/NumberFields/index.php). NumberFields is now running the very latest server code (software version: 24527, according to their status page), and I've been running it on the latest available v6.13.10 client. We get a clear view of how well CreditNew and the server-based runtime estimation process work for non-deterministic projects. This report will come in several parts, but here's the first problem: 05-Nov-2011 09:51:49 [NumberFields@home] Requesting new tasks for CPU 05-Nov-2011 09:51:49 [NumberFields@home] [sched_op] CPU work request: 365384.35 seconds; 0.00 CPUs 05-Nov-2011 09:51:49 [NumberFields@home] [sched_op] NVIDIA work request: 0.00 seconds; 0.00 CPUs 05-Nov-2011 09:51:52 [NumberFields@home] Scheduler request completed: got 43 new tasks 05-Nov-2011 09:51:52 [NumberFields@home] [sched_op] Server version 613 05-Nov-2011 09:51:52 [NumberFields@home] Project requested delay of 21 seconds 05-Nov-2011 09:51:52 [NumberFields@home] [sched_op] estimated total CPU task duration: 2773276 seconds That's a request for three hundred thousand seconds, and an estimated allocation of over two million. It's not a fluke: 06-Nov-2011 17:34:25 [NumberFields@home] [sched_op] CPU work request: 318987.60 seconds; 0.00 CPUs 06-Nov-2011 17:34:25 [NumberFields@home] [sched_op] NVIDIA work request: 0.00 seconds; 0.00 CPUs 06-Nov-2011 17:34:28 [NumberFields@home] Scheduler request completed: got 85 new tasks 06-Nov-2011 17:34:28 [NumberFields@home] [sched_op] Server version 613 06-Nov-2011 17:34:28 [NumberFields@home] Project requested delay of 21 seconds 06-Nov-2011 17:34:28 [NumberFields@home] [sched_op] estimated total CPU task duration: 1036408 seconds BOINC v6.13/v7 is deliberately designed to make these large work requests because of the max/min hysteresis fetch policy. The problem here is that non-deterministic runtimes can't be tracked in real time by the server APR averaging, and client DCF goes into overdrive to try and compensate. I've attached a full log of the DCF changes since this host was attached to the NumberFields. At the times of the two logs above, client DCF was 7.770888 and 4.360754: according to http://boinc.berkeley.edu/trac/wiki/RuntimeEstimation, "DCF is no longer used", and indeed http://boinc.berkeley.edu/trac/changeset/21153/boinc http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address. _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
