Re: [PATCH] sched/fair: Skip wake_affine() for core siblings

Kirill Tkhai Mon, 28 Sep 2015 12:20:19 -0700


On 28.09.2015 21:22, Mike Galbraith wrote:
> On Mon, 2015-09-28 at 18:36 +0300, Kirill Tkhai wrote:
> 
>> Mike, one more moment. wake_wide() and current logic confuses me a bit.
>> It makes us to decide if we want affine wakeup or not, but 
>> select_idle_sibling()
>> if a function is not for choosing this_cpu's llc domain only. We use it
>> for searching in prev_cpu llc domain too, and it seems we are not interested
>> in current flips in this case.
> 
> We're always interested in "flips", as the point is to try to identify
> N:M load components, and when they may overload a socket.  The hope is
> to get it more right than wrong, as making the tracking really accurate
> is too expensive for the fast path.
> 
>>  Imagine a situation, when we share a mutex
>> with a task on another NUMA node. When the task is realising the mutex
>> it is waking us, but we definitelly won't use affine logic in this case.
> 
> Why not?  A wakeup is a wakeup is a wakeup, they all do the same thing.
> If wake_wide() doesn't NAK an affine wakeup, we ask wake_affine() for
> its opinion, then look for an idle CPU near the waker's CPU if it says
> OK, or near wakee's previous CPU if it says go away.


But NUMA sd does not have SD_WAKE_AFFINE flag, so this case a new cpu won't
be choosen from previous node. There will be choosen the highest domain
of smp_processor_id(), which has SD_BALANCE_WAKE flag, and the cpu will
be choosen from the idlest group/cpu. And we don't have a deal with old
cache at all. This looks like a completely wrong behaviour...

>> We wake the wakee anywhere and loose hot cache.
> 
> Yeah, sometimes we'll make tasks drag their data to them when we could
> have dragged the task to the data in the name of trying to crank up CPU
> utilization.  At some point, _somebody_ has to drag their data across
> interconnect, but we really don't know if/when the data transport cost
> will pay off in better utilization.
> 
>       -Mike
> 
> (I'll take a peek at below when damn futexes get done kicking my a$$)

This case, you'll be able to analyze new results below :)

ORIGIN  1               2               3               4               avg
C=1     8607,960451     8344,16111      8381,197569     7991,84102      
8331,2900375
C=2     15397,152685    15438,913761    15771,512182    15648,368819    
15563,98686175
C=4     30987,860844    31144,127431    31104,153461    30874,292825    
31027,60864025
C=8     62447,47612     62179,788923    61534,482204    62787,894021    
62237,410317
                                        
                                        
PATCHED 1               2               3               4               avg     
        
C=1     8782,439938     8675,891877     8609,209537     8735,120895     
8700,66556175
C=2     16526,31409     16491,650678    16149,594736    16365,630084    
16383,297397
C=4     32286,341708    32313,536565    32538,285157    32299,427398    
32359,397707
C=8     63860,310019    63187,152569    63400,930755    63210,460753    
63414,713524

This is:
# pgbench -j 1 -S -c X -T 200 test

The test machine has no NUMA, and I suppose, NUMA machine will show much better 
results.
I'll test it tomorrow.

>>  I changed the logic, and
>> tried pgbench 1:8. The results (I threw away 3 first iterations, because
>> they much differ with iter >= 4. Looks like, the reason is in uncached disk 
>> IO).
>>
>>
>> Before:
>>
>>   trans. |  tps (i)     | tps (e)
>> --------------------------------------
>> 12098226 | 60491.067392 | 60500.886373
>> 12030184 | 60150.874285 | 60160.654295
>> 11882977 | 59414.829150 | 59424.830637
>> 12020125 | 60100.579023 | 60111.600176
>> 12161917 | 60809.547906 | 60827.321639
>> 12154660 | 60773.249254 | 60783.085165
>>
>> After:
>>
>>  trans.  | tps (i)      | tps (e)
>> --------------------------------------
>> 12770407 | 63849.883578 | 63860.310019      
>> 12635366 | 63176.399769 | 63187.152569      
>> 12676890 | 63384.396440 | 63400.930755      
>> 12639949 | 63199.526330 | 63210.460753      
>> 12670626 | 63353.079951 | 63363.274143      
>> 12647001 | 63209.613698 | 63219.812331      
>>
>> I'm going to test other cases, but could you tell me (if you remember) are 
>> there reasons
>> we skip prev_cpu, like I described above? Some types of workloads etc.
>>
>> ---
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 4df37a4..dfbe06b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4930,8 +4930,13 @@ select_task_rq_fair(struct task_struct *p, int 
>> prev_cpu, int sd_flag, int wake_f
>>      int want_affine = 0;
>>      int sync = wake_flags & WF_SYNC;
>>  
>> -    if (sd_flag & SD_BALANCE_WAKE)
>> -            want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, 
>> tsk_cpus_allowed(p));
>> +    if (sd_flag & SD_BALANCE_WAKE) {
>> +            want_affine = 1;
>> +            if (cpu == prev_cpu || !cpumask_test_cpu(cpu, 
>> tsk_cpus_allowed(p)))
>> +                    goto want_affine;
>> +            if (wake_wide(p))
>> +                    goto want_affine;
>> +    }
>>  
>>      rcu_read_lock();
>>      for_each_domain(cpu, tmp) {
>> @@ -4954,16 +4959,12 @@ select_task_rq_fair(struct task_struct *p, int 
>> prev_cpu, int sd_flag, int wake_f
>>                      break;
>>      }
>>  
>> -    if (affine_sd) {
>> +want_affine:
>> +    if (want_affine) {
>>              sd = NULL; /* Prefer wake_affine over balance flags */
>> -            if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
>> +            if (affine_sd && wake_affine(affine_sd, p, sync))
>>                      new_cpu = cpu;
>> -    }
>> -
>> -    if (!sd) {
>> -            if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
>> -                    new_cpu = select_idle_sibling(p, new_cpu);
>> -
>> +            new_cpu = select_idle_sibling(p, new_cpu);
>>      } else while (sd) {
>>              struct sched_group *group;
>>              int weight;
> 
> 

Regards,
Kirill
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sched/fair: Skip wake_affine() for core siblings

Reply via email to