On 26.09.2015 18:25, Mike Galbraith wrote:
> On Fri, 2015-09-25 at 20:54 +0300, Kirill Tkhai wrote:
>> We are not interested in actual target if both prev
>> and curr cpus share CPU cache. select_idle_sibling()
>> searches in top-down order; top level is the same
>> for both of them, and the result will be the same.
>> So, we can save a little CPU cycles and cache misses
>> and skip wake_affine() calculations.
> 
> But, whereas previously wake_affine() could NAK a migration if it would
> create an imbalance, we'll now just go ahead and stack tasks if
> select_idle_sibling() can't find an idle home to override the blanket
> approval.  It doesn't look like a good idea to me to bounce tasks around
> only to then perhaps stack them, as if we do stack waker/wakee, we
> certainly lose concurrency. (microbenchmarks like pipe-test love that,
> but not all that many real applications play ping-pong for a living;)
> 
> I spent most of the day piddling with your little patch, so I'll post
> some condensed mixed load notes.
> 
> concurrent tbench 4 + pgbench, 30 seconds per client count (i4790+smt)
>                                              master                           
> master+
> pgbench                   1       2       3     avg         1       2       3 
>     avg   comp
> clients 1       tps = 18768   18591   18264   18541     18351   17257   17245 
>   17617   .950
> clients 2       tps = 30779   30661   31016   30818     29112   28026   29026 
>   28721   .931
> clients 4       tps = 54195   55100   54048   54447     53290   52336   52930 
>   52852   .970
> clients 8       tps = 60332   67052   64699   64027     38491   35746   37746 
>   37327   .582!!

Yeah, this is terrible.

> Do the opposite, wake_affine() always NAKs.
>                                              master                           
> master++
> pgbench                   1       2       3     avg         1       2       3 
>     avg   comp
> clients 1       tps = 18768   18591   18264   18541     16874   16865   16665 
>   16801   .906
> clients 2       tps = 30779   30661   31016   30818     33562   33546   33681 
>   33596  1.090
> clients 4       tps = 54195   55100   54048   54447     61544   61482   61117 
>   61381  1.127
> clients 8       tps = 60332   67052   64699   64027     75171   75524   75318 
>   75337  1.176

Looks like, NAK may be better, because it saves L1 cache, while the patch 
always invalidates it.

Could you say, do you execute pgbench using just -cX -jY -T30 or something 
special? I've tried it,
but the dispersion of the results much differs from time to time.

> 
> ...
> 
> virgin vs your patch again, 2 _minutes_ per client count, as I noticed much 
> variance at 8
> clients, where wake_wide() is supposed to kick in to keep N:M load spread out.
> 
>                                              master                           
> master+
> pgbench                   1       2       3     avg         1       2       3 
>     avg   comp
> clients 1       tps = 18548   18673   18390   18537     17879   17652   17621 
>   17717   .955
> clients 2       tps = 31083   31110   30859   31017     30274   30003   29796 
>   30024   .967
> clients 4       tps = 53107   53156   53601   53288     52658   53024   53449 
>   53043   .995
> clients 8       tps = 34213   34310   28844   32455     31360   31416   30732 
>   31169   .960
> 
> 30 seconds per run isn't enough, and wake_wide() is not doing a wonderful job 
> for 1:N pgbench.
> 
> hrmph, twiddle...
> 
> waker/wakee coupling strengthened
> postgres@homer:~> pgbench.sh
> clients 1       tps = 18035
> clients 2       tps = 32525
> clients 4       tps = 53246
> clients 8       tps = 37278
> 
> better, but not enough..  + sd_llc_size = #cores vs #threads
> postgres@homer:~> pgbench.sh
> clients 1       tps = 18482
> clients 2       tps = 32366
> clients 4       tps = 54557
> clients 8       tps = 69643
> 
> Ok, that's what I want to see, full repeat.
> master = twiddle
> master+ = twiddle+patch
> 
> concurrent tbench 4 + pgbench, 2 minutes per client count (i4790+smt)
>                                              master                           
> master+
> pgbench                   1       2       3     avg         1       2       3 
>     avg   comp
> clients 1       tps = 18599   18627   18532   18586     17480   17682   17606 
>   17589   .946
> clients 2       tps = 32344   32313   32408   32355     25167   26140   23730 
>   25012   .773
> clients 4       tps = 52593   51390   51095   51692     22983   23046   22427 
>   22818   .441
> clients 8       tps = 70354   69583   70107   70014     66924   66672   69310 
>   67635   .966
> 
> Hrm... turn the tables, measure tbench while pgbench 4 client load runs 
> endlessly.
> 
>                                              master                           
> master+
> tbench                    1       2       3     avg         1       2       3 
>     avg   comp
> pairs 1        MB/s =   430     426     436     430       481     481     494 
>     485  1.127
> pairs 2        MB/s =  1083    1085    1072    1080      1086    1090    1083 
>    1086  1.005
> pairs 4        MB/s =  1725    1697    1729    1717      2023    2002    2006 
>    2010  1.170
> pairs 8        MB/s =  2740    2631    2700    2690      3016    2977    3071 
>    3021  1.123
> 
> tbench without competition
>                master        master+   comp
> pairs 1        MB/s =   694     692    .997 
> pairs 2        MB/s =  1268    1259    .992
> pairs 4        MB/s =  2210    2165    .979
> pairs 8        MB/s =  3586    3526    .983  (yawn, all within routine 
> variance)

Hm, it seems tbench with competition is better only because of a busy system 
makes tbench
processes be woken on the same cpu.
 
> twiddle:
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6048,14 +6048,18 @@ static void update_top_cache_domain(int
>  {
>       struct sched_domain *sd;
>       struct sched_domain *busy_sd = NULL;
> +     struct sched_group *group;
>       int id = cpu;
>       int size = 1;
>  
>       sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>       if (sd) {
>               id = cpumask_first(sched_domain_span(sd));
> -             size = cpumask_weight(sched_domain_span(sd));
>               busy_sd = sd->parent; /* sd_busy */
> +             group = sd->groups;
> +             /* Set size to the number of cores, not threads */
> +             while (group = group->next, group != sd->groups)
> +                     size++;
>       }
>       rcu_assign_pointer(per_cpu(sd_busy, cpu), busy_sd);
>  
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4421,19 +4421,26 @@ static unsigned long cpu_avg_load_per_ta
>  
>  static void record_wakee(struct task_struct *p)
>  {
> +     unsigned long now = jiffies;
> +
>       /*
>        * Rough decay (wiping) for cost saving, don't worry
>        * about the boundary, really active task won't care
>        * about the loss.
>        */
> -     if (time_after(jiffies, current->wakee_flip_decay_ts + HZ)) {
> +     if (time_after(now, current->wakee_flip_decay_ts + HZ)) {
>               current->wakee_flips >>= 1;
> -             current->wakee_flip_decay_ts = jiffies;
> +             current->wakee_flip_decay_ts = now;
> +     }
> +     if (time_after(now, p->wakee_flip_decay_ts + HZ)) {
> +             p->wakee_flips >>= 1;
> +             p->wakee_flip_decay_ts = now;
>       }
>  
>       if (current->last_wakee != p) {
>               current->last_wakee = p;
>               current->wakee_flips++;
> +             p->wakee_flips++;
>       }
>  }
>  
> 

Regards,
Kirill
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to