Re: [boinc_dev] Fwd: Scheduler troubles in conjunction with rate limiting from server

Richard Haselgrove Tue, 11 Feb 2014 06:53:31 -0800

Here's an example of the sort of event which can cause the problems Rytis was 
describing:


11/02/2014 14:28:15 | boincsimap | [sched_op] Starting scheduler request
11/02/2014 14:28:15 | boincsimap | Sending scheduler request: To fetch work.
11/02/2014 14:28:15 | boincsimap | Requesting new tasks for CPU
11/02/2014 14:28:15 | boincsimap | [sched_op] CPU work request: 1.46 seconds; 
0.00 devices
11/02/2014 14:28:15 | boincsimap | [sched_op] NVIDIA work request: 0.00 
seconds; 0.00 devices
11/02/2014 14:28:17 | boincsimap | Scheduler request completed: got 1 new tasks
11/02/2014 14:28:17 | boincsimap | [sched_op] Server version 703
11/02/2014 14:28:17 | boincsimap | Project requested delay of 7 seconds
11/02/2014 14:28:17 | boincsimap | [sched_op] estimated total CPU task 
duration: 3680 seconds
11/02/2014 14:28:17 | boincsimap | [sched_op] estimated total NVIDIA task 
duration: 0 seconds
11/02/2014 14:28:17 | boincsimap | [sched_op] Deferring communication for 
00:00:07
11/02/2014 14:28:17 | boincsimap | [sched_op] Reason: requested by project
11/02/2014 14:28:19 | boincsimap | Started download of 20140129.556477
11/02/2014 14:28:24 | boincsimap | Finished download of 20140129.556477
11/02/2014 14:28:47 | boincsimap | Computation for task 20140129.537727_1 
finished
11/02/2014 14:28:47 | boincsimap | Starting task 20140129.540879_1
11/02/2014 14:28:47 | boincsimap | [cpu_sched] Starting task 20140129.540879_1 
using simap version 512 in slot 1
11/02/2014 14:28:49 | boincsimap | Started upload of 20140129.537727_1_0
11/02/2014 14:29:00 | boincsimap | Finished upload of 20140129.537727_1_0

But because work was requested 30 seconds *before* a task completed, neither 
the old nor the new versions of "inhibit RPCs during upload" would have 
prevented it.

As it happens, SIMAP is one of the projects which could honestly use the 
"estimates are linear and can be trusted" flag, if available.



>________________________________
> From: Richard Haselgrove <[email protected]>
>To: David Anderson <[email protected]>; BOINC Developers Mailing List 
><[email protected]> 
>Sent: Saturday, 8 February 2014, 12:08
>Subject: Re: [boinc_dev] Fwd: Scheduler troubles in conjunction with rate 
>limiting from server
> 
>
>
>I thought we had this protection in place already?
>
>
>Specifically, since your checkin 60fc3d3 of April 2011:
>
>
>"client: defer reporting completed tasks if an upload started recently;
>we might be able to report more tasks once the upload completes."
>
>
>http://boinc.berkeley.edu/trac/changeset/60fc3d3f22f66d7a7b5bb5632d2de322cf2f180a/boinc-v2
>
>
>
>If that works (and in my experience it does), it exactly covers Rytis' 
>problem: by delaying work fetch until the previous task is reportable, an 
>extra slot is made available within the jobs-in-progress limit.
>
>
>It took a few follow-up revisions to get 60fc3d3 working properly: the only 
>remaining loophole that I can see is that occasionally BOINC might slip in a 
>work fetch after a task has exited, but before the upload has even started. 
>The other situation which could lead to Rytis' observation is if BOINC 
>requested new work shortly before his task exited, but we have always resisted 
>the calls to adjust scheduling on the basis of anticipated/estimated 
>completion times.
>
>
>I'm a little worried by the new checkin: if a project completes tasks, and 
>hence starts uploads, more frequently than once every five minutes, will it 
>ever break free of the deferral?
>
>
>>________________________________
>> From: David Anderson <[email protected]>
>>To: BOINC Developers Mailing List <[email protected]> 
>>Sent: Saturday, 8 February 2014, 0:00
>>Subject: [boinc_dev] Fwd: Scheduler troubles in conjunction with rate 
>>limiting from server
>> 
>>
>>I checked in the following change to address the problem
>>Rytis describes below.
>>
>>    client: work fetch policy tweak
>>
>>    If a project has active uploads, defer work fetch from it for 5 minutes
>>    even if there are idle devices (that's the change).
>>    This addresses a situation (reported by Rytis) where
>>    - a project P has a jobs-in-progress limit less than NCPUS
>>    - P's jobs
 finish and are uploading
>>    - the client asks P for work and doesn't get any because of the limit
>>    - the client does exponential backoff from P
>>    Over the long term, P can get much less than its fair share of work
>>
>>-- David
>>
>>-------- Original Message --------
>>Subject:     Scheduler troubles in conjunction with rate limiting from server
>>Date:     Fri, 7 Feb 2014 12:41:04 +0200
>>From:     Rytis Slatkevičius <[email protected]>
>>To:     David Anderson <[email protected]>
>>CC:     Matthew Blumberg <[email protected]>
>>
>>
>>
>>Hello
 David,
>>
>>we observed an interesting trouble with task scheduling:
>>
>>Project A (our project) limits number of tasks per proc to 2 and has resource 
>>share
>>of 500;
>>Project B (Einstein) does not limit number of tasks and has resource share of 
>>25.
>>
>>B has longer tasks than A, and also longer tasks than the minimum work buffer.
>>
>>When attaching both, A has priority because of resource share. It fetches 2 
>>tasks
>>(as the server does not send any more). B then fetches tasks to fill the 
>>remaining
>>buffers up to the minimum threshold.
>>
>>When A finishes work, scheduler request happens as there is not enough work
>>available to fill all work slots. However, because the completed tasks have 
>>not been
>>uploaded yet, scheduler does not send any new work as it is limited to 2 
>>tasks on
>>host (and it still has them, even though computation is complete). Backoff 
>>happens
>>for A as no work is provided, and therefore B is asked for
 work. Now only B is running.
>>
>>When B finishes work, either A is asked again (if the backoff has completed), 
>>two
>>tasks are sent, and process repeats again, or A is not even asked (if the 
>>backoff is
>>still in progress) and B is asked again.
>>
>>The end result: system runs work from B almost exclusively, even though A has 
>>work
>>available (BOINC just thinks it does not). We increased the job limits to a 
>>number
>>higher than the minimum threshold and the issue seems to have disappeared.
>>
>>--
>>Pagarbiai / Sincerely
>>Rytis Slatkevičius
>>+370 670 77777
>>
>>
>>_______________________________________________
>>boinc_dev mailing list
>>[email protected]
>>http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>>To unsubscribe, visit the
 above URL and
>>(near bottom of page) enter your email address.
>>
>>
>
>
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] Fwd: Scheduler troubles in conjunction with rate limiting from server

Reply via email to