On 26.09.2015 18:25, Mike Galbraith wrote:
> On Fri, 2015-09-25 at 20:54 +0300, Kirill Tkhai wrote:
>> We are not interested in actual target if both prev
>> and curr cpus share CPU cache. select_idle_sibling()
>> searches in top-down order; top level is the same
>> for both of them, and the result will be the same.
>> So, we can save a little CPU cycles and cache misses
>> and skip wake_affine() calculations.
>
> But, whereas previously wake_affine() could NAK a migration if it would
> create an imbalance, we'll now just go ahead and stack tasks if
> select_idle_sibling() can't find an idle home to override the blanket
> approval. It doesn't look like a good idea to me to bounce tasks around
> only to then perhaps stack them, as if we do stack waker/wakee, we
> certainly lose concurrency. (microbenchmarks like pipe-test love that,
> but not all that many real applications play ping-pong for a living;)
>
> I spent most of the day piddling with your little patch, so I'll post
> some condensed mixed load notes.
>
> concurrent tbench 4 + pgbench, 30 seconds per client count (i4790+smt)
> master
> master+
> pgbench 1 2 3 avg 1 2 3
> avg comp
> clients 1 tps = 18768 18591 18264 18541 18351 17257 17245
> 17617 .950
> clients 2 tps = 30779 30661 31016 30818 29112 28026 29026
> 28721 .931
> clients 4 tps = 54195 55100 54048 54447 53290 52336 52930
> 52852 .970
> clients 8 tps = 60332 67052 64699 64027 38491 35746 37746
> 37327 .582!!
Yeah, this is terrible.
> Do the opposite, wake_affine() always NAKs.
> master
> master++
> pgbench 1 2 3 avg 1 2 3
> avg comp
> clients 1 tps = 18768 18591 18264 18541 16874 16865 16665
> 16801 .906
> clients 2 tps = 30779 30661 31016 30818 33562 33546 33681
> 33596 1.090
> clients 4 tps = 54195 55100 54048 54447 61544 61482 61117
> 61381 1.127
> clients 8 tps = 60332 67052 64699 64027 75171 75524 75318
> 75337 1.176
Looks like, NAK may be better, because it saves L1 cache, while the patch
always invalidates it.
Could you say, do you execute pgbench using just -cX -jY -T30 or something
special? I've tried it,
but the dispersion of the results much differs from time to time.
>
> ...
>
> virgin vs your patch again, 2 _minutes_ per client count, as I noticed much
> variance at 8
> clients, where wake_wide() is supposed to kick in to keep N:M load spread out.
>
> master
> master+
> pgbench 1 2 3 avg 1 2 3
> avg comp
> clients 1 tps = 18548 18673 18390 18537 17879 17652 17621
> 17717 .955
> clients 2 tps = 31083 31110 30859 31017 30274 30003 29796
> 30024 .967
> clients 4 tps = 53107 53156 53601 53288 52658 53024 53449
> 53043 .995
> clients 8 tps = 34213 34310 28844 32455 31360 31416 30732
> 31169 .960
>
> 30 seconds per run isn't enough, and wake_wide() is not doing a wonderful job
> for 1:N pgbench.
>
> hrmph, twiddle...
>
> waker/wakee coupling strengthened
> postgres@homer:~> pgbench.sh
> clients 1 tps = 18035
> clients 2 tps = 32525
> clients 4 tps = 53246
> clients 8 tps = 37278
>
> better, but not enough.. + sd_llc_size = #cores vs #threads
> postgres@homer:~> pgbench.sh
> clients 1 tps = 18482
> clients 2 tps = 32366
> clients 4 tps = 54557
> clients 8 tps = 69643
>
> Ok, that's what I want to see, full repeat.
> master = twiddle
> master+ = twiddle+patch
>
> concurrent tbench 4 + pgbench, 2 minutes per client count (i4790+smt)
> master
> master+
> pgbench 1 2 3 avg 1 2 3
> avg comp
> clients 1 tps = 18599 18627 18532 18586 17480 17682 17606
> 17589 .946
> clients 2 tps = 32344 32313 32408 32355 25167 26140 23730
> 25012 .773
> clients 4 tps = 52593 51390 51095 51692 22983 23046 22427
> 22818 .441
> clients 8 tps = 70354 69583 70107 70014 66924 66672 69310
> 67635 .966
>
> Hrm... turn the tables, measure tbench while pgbench 4 client load runs
> endlessly.
>
> master
> master+
> tbench 1 2 3 avg 1 2 3
> avg comp
> pairs 1 MB/s = 430 426 436 430 481 481 494
> 485 1.127
> pairs 2 MB/s = 1083 1085 1072 1080 1086 1090 1083
> 1086 1.005
> pairs 4 MB/s = 1725 1697 1729 1717 2023 2002 2006
> 2010 1.170
> pairs 8 MB/s = 2740 2631 2700 2690 3016 2977 3071
> 3021 1.123
>
> tbench without competition
> master master+ comp
> pairs 1 MB/s = 694 692 .997
> pairs 2 MB/s = 1268 1259 .992
> pairs 4 MB/s = 2210 2165 .979
> pairs 8 MB/s = 3586 3526 .983 (yawn, all within routine
> variance)
Hm, it seems tbench with competition is better only because of a busy system
makes tbench
processes be woken on the same cpu.
> twiddle:
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6048,14 +6048,18 @@ static void update_top_cache_domain(int
> {
> struct sched_domain *sd;
> struct sched_domain *busy_sd = NULL;
> + struct sched_group *group;
> int id = cpu;
> int size = 1;
>
> sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
> if (sd) {
> id = cpumask_first(sched_domain_span(sd));
> - size = cpumask_weight(sched_domain_span(sd));
> busy_sd = sd->parent; /* sd_busy */
> + group = sd->groups;
> + /* Set size to the number of cores, not threads */
> + while (group = group->next, group != sd->groups)
> + size++;
> }
> rcu_assign_pointer(per_cpu(sd_busy, cpu), busy_sd);
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4421,19 +4421,26 @@ static unsigned long cpu_avg_load_per_ta
>
> static void record_wakee(struct task_struct *p)
> {
> + unsigned long now = jiffies;
> +
> /*
> * Rough decay (wiping) for cost saving, don't worry
> * about the boundary, really active task won't care
> * about the loss.
> */
> - if (time_after(jiffies, current->wakee_flip_decay_ts + HZ)) {
> + if (time_after(now, current->wakee_flip_decay_ts + HZ)) {
> current->wakee_flips >>= 1;
> - current->wakee_flip_decay_ts = jiffies;
> + current->wakee_flip_decay_ts = now;
> + }
> + if (time_after(now, p->wakee_flip_decay_ts + HZ)) {
> + p->wakee_flips >>= 1;
> + p->wakee_flip_decay_ts = now;
> }
>
> if (current->last_wakee != p) {
> current->last_wakee = p;
> current->wakee_flips++;
> + p->wakee_flips++;
> }
> }
>
>
Regards,
Kirill
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/