Hi Segher,
Segher Boessenkool writes:
> On Fri, Nov 11, 2016 at 02:16:18AM +0100, Nicolai Stange wrote:
>> >> >From the discussion on gcc-patches [1] of what is now the aforementioned
>> >> r228318 ("bb-reorder: Add -freorder-blocks-algorithm= and wire it up"),
>> >> it is not clear to me whether this change can actually reduce code size
>> >> beyond those 0.1% given there for -Os.
>> >
>> > There is r228692 as well.
>>
>> Ok, summarizing, that changelog says that the simple algorithm
>> potentially produced even bigger code with -Os than stc did. From that
>> commit on, this remains true only on x86 and mn10300. Right?
>
> x86 and mn10300 use STC at -Os by default.
Ah! This explains why I didn't see such a BB ordering on x86.
>> Sure. Let me restate my original question: assume for a moment that it
>> is true that -Os with simple never produces code smaller than 0.1% of
>> what is created by -Os with stc. I haven't got any idea what the "other
>> things" are able to achieve w.r.t code size savings, but to me, 0.1%
>> doesn't appear to be that huge. Don't get me wrong: I *really* can't
>> judge on whether 0.1% is a significant improvement or not. I'm just
>> assuming that it's not. With this assumption, the question of whether
>> those saved 0.1% are really worth the significantly decreased
>> performance encountered in some situations seemed just natural...
>
> It all depends on the tradeoff you want. There are many knobs you can
> turn -- for example the inlining params, that has quite some effect on
> code size.
>
> -Os is mostly -O2 except those things that increase code size.
>
> What is the tradeoff in your case? What is a realistic number for the
> slowdown of your kernel? Do you see hotspots in there that should be
> handled better anyhow? Etc.
Well, I do see hotspots in my crappy RFC code not yet upstreamed, namely
those 0.5 us extra latency on a memory stressed system as initially
mentioned. Problem is, this is in interrupt context...
I'm quite certain that this is due to a mispredicted branch in
for(...) {
g()
}
-- g() lives on another page and the TLB is cold.
However, once my test hardware is free again, I'll try to identify some
hotspots and get some numbers for a vanilla kernel.
For example, __hrtimer_runqueues() suffers from the same BB ordering,
but its code doesn't cross pages. I'd really have to measure it...
>> No, I want small, possibly at the cost of performance to the extent of
>> what's sensible. What sensible actually is is what my question is about.
>
> It is different for every use case I'm afraid.
>
>> So, summarizing, I'm not asking whether I should use -O2 or -Os or
>> whatever, but whether the current behaviour I'm seeing with -Os is
>> intended/expected quantitatively.
>
> With simple you get smaller code than with STC, so -Os uses simple.
> If that is ridiculously slower then you won't hear me complaining if
> you propose defaulting it the other way; but you haven't shown any
> convincing numbers yet?
Hmm, maybe there's a better way than changing the default.
If I read r228692 correctly, for the -Os case,
reorder_basic_blocks_simple() is made to retain the edge order as given
by the BB order (whatever this is?) because this reduces code size
(because of reasons).
I think, the gain in code size savings is due to the favoring of the
total fallthrough edge count -- these won't lead to jump insns in the
output. Is this correct?
However, I believe that this edge ordering can be relaxed: it's only a
certain type of edges that must come before all the others.
The "all the others" can then be sorted by frequency again, thus leading
to better static branch prediction, especially in the loop case.
Thinking locally, if we have
A
|\
| B
|/
C
we want to have the output order A, B, C, not A, C, B, because:
- A will have a single branch insn at its end either way,
- with the second ordering, B will need one, too.
Now, if B-C is considered first by the chain assembling part in
reorder_basic_blocks_simple(), this will produce the desired output.
I've appended a patch that does just this: it puts all the edges
originating from BB's with only one outgoing edge first and the rest,
sorted by frequency, after them.
The results for that single test case I've tried, namely the kernel
with some config on ARM, look surprisingly good:
W/o patch:
0 .head.text026c c0008000 c0008000 8000 2**2
1 .text 002ab370 c010 c010 0001 2**5
16 .init.text00027270 c06002e0 c06002e0 003f02e0 2**5
17 .exit.text0594 c0627550 c0627550 00417550 2**2
W/ patch:
0 .head.text026c c0008000 c0008000 8000 2**2
1 .text 002aaeac c010 c010 0001 2**5
16 .init.text0002723c c06002e0 c06002e0 003f02e0 2**5
17 .exit.text0594 c062751c c062751c 0041751c 2**2
So even slightly smaller code is produced. But more importantly, it gets
the fallthrough for the loops