[boinc_dev] Confidence interval for work fetch and CPU scheduling.

John . McLeod Thu, 20 Jan 2011 09:40:10 -0800

If we go to a statics based approach for CPU scheduling, we should account
for the decreasing confidence interval for large queues.  Where a single
item in a queue has a maximum expected run time of mean + 3*stdev, a queue
of tasks would have an expected maximum run time of n*(mean + 3 * stdev /
n).  So for the rr_sim based on statistics, the run time per task would be
figured as mean + 3*stdev/n for calculating CPU scheduling.


Note that for work fetch where we are trying to ensure that we have enough
work on hand to avoid idle CPUs, the amount of work on hand should ensure
that n*(mean - 3*stdev/n) >= minimum queue size.  Currently, work fetch
assumes that the tasks will all take nearly the longest time that the
application has seen.  For a large sample, this is incorrect, and for
keeping the CPUs busy, it is the wrong direction from the mean.

I believe that we can successfully get rid of "extra work" if we:
1)  Move to statistics based CPU and work fetch scheduling.
2)  Acquire statistics of all of the following and use them appropriately
in the CPU scheduling and work fetch policies:
    a)  actual run time / original estimated run time.  (used to correct
duration)
    b)  runtime after application first reaches 100% ( per project.  Used
in computation of computation deadline).
    c)  time from completion to report. (per project used in computation of
computation deadline)
3)  "Connect every X days" could have an effective time limit, and would
only be in place for planning purposes.  i.e. it could be set to 0 30 days
after it was last set.  The other possibility would be to have a "vacation"
planner with local dates and times for planned unusual network outages.

So the computation deadline would be report deadline - max(("connect every
X days"), (mean time to complete after 100% + mean time to report after
completion for project + 3 * (stdev time to complete after 100% + stdev
time to report after completion for project)  / tasks in queue for
project  ) + task_switch_interval).  The task switch interval is the
granularity of the scheduler and needs to be accounted for no matter what
the means for calculating the rest.  Using the mean time to report after
completion gathers all of the causes that might occur into one number.  It
really should not matter what the cause for the delay in uploading and
reporting, they all should be accounted for.  Delays could include a server
outage (like the SETI weekly outage), a network outage (the computer is
only attached to the network a few times per week) or anything else.  Yes
it is possible to have different calculations of report deadlines for each
project based on the reliability of the network connection to that project,
but I believe this to be a good idea.

jm7

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

[boinc_dev] Confidence interval for work fetch and CPU scheduling.

Reply via email to