Re: Quantitative analysis of -Os vs -O3

Michael Clark Sat, 26 Aug 2017 03:06:06 -0700

> On 26 Aug 2017, at 8:39 PM, Andrew Pinski <pins...@gmail.com> wrote:
> 
> On Sat, Aug 26, 2017 at 1:23 AM, Michael Clark <michaeljcl...@mac.com> wrote:
>> Dear GCC folk,
>> I have to say that’s GCC’s -Os caught me by surprise after several years 
>> using Apple GCC and more recently LLVM/Clang in Xcode. Over the last year 
>> and a half I have been working on RISC-V development and have been 
>> exclusively using GCC for RISC-V builds, and initially I was using -Os. 
>> After performing a qualitative/quantitative assessment I don’t believe GCC’s 
>> current -Os is particularly useful, at least for my needs as it doesn’t 
>> provide a commensurate saving in size given the sometimes quite huge drop in 
>> performance.
>> 
>> I’m quoting an extract from Eric’s earlier email on the Overwhelmed by GCC 
>> frustration thread, as I think Apple’s documentation which presumably 
>> documents Clang/LLVM -Os policy is what I would call an ideal -Os (perhaps 
>> using -O2 as a starting point) with the idea that the current -Os is renamed 
>> to -Oz.
>> 
>>        -Oz
>>               (APPLE ONLY) Optimize for size, regardless of performance. -Oz
>>               enables the same optimization flags that -Os uses, but -Oz also
>>               enables other optimizations intended solely to reduce code 
>> size.
>>               In particular, instructions that encode into fewer bytes are
>>               preferred over longer instructions that execute in fewer 
>> cycles.
>>               -Oz on Darwin is very similar to -Os in FSF distributions of 
>> GCC.
>>               -Oz employs the same inlining limits and avoids string 
>> instructions
>>               just like -Os.
>> 
>>        -Os
>>               Optimize for size, but not at the expense of speed. -Os 
>> enables all
>>               -O2 optimizations that do not typically increase code size.
>>               However, instructions are chosen for best performance, 
>> regardless
>>               of size. To optimize solely for size on Darwin, use -Oz (APPLE
>>               ONLY).
>> 
>> I have recently  been working on a benchmark suite to test a RISC-V JIT 
>> engine. I have performed all testing using GCC 7.1 as the baseline compiler, 
>> and during the process I have collected several performance metrics, some 
>> that are neutral to the JIT runtime environment. In particular I have made 
>> performance comparisons between -Os and -O3 on x86, along with capturing 
>> executable file sizes, dynamic retired instruction and micro-op counts for 
>> x86, dynamic retired instruction counts for RISC-V as well as dynamic 
>> register and instruction usage histograms for RISC-V, for both -Os and -O3.
>> 
>> See the Optimisation section for a charted performance comparison between 
>> -O3 and -Os. There are dozens of other plots that show the differences 
>> between -Os and -O3.
>> 
>>        - https://rv8.io/bench
>> 
>> The Geomean on x86 shows a 19% performance hit for -Os vs -O3 on x86. The 
>> Geomean of course smooths over some pathological cases where -Os performance 
>> is severely degraded versus -O3 but not with significant, or commensurate 
>> savings in size.
> 
> 
> First let me put into some perspective on -Os usage and some history:
> 1) -Os is not useful for non-embedded users
> 2) the embedded folks really need the smallest code possible and
> usually will be willing to afford the performance hit
> 3) -Os was a mistake for Apple to use in the first place; they used it
> and then GCC got better for PowerPC to use the string instructions
> which is why -Oz was added :)
> 4) -Os is used heavily by the arm/thumb2 folks in bare metal applications.
> 
> Comparing -O3 to -Os is not totally fair on x86 due to the many
> different instructions and encodings.
> Compare it on ARM/Thumb2 or MIPS/MIPS16 (or micromips) where size is a
> big issue.
> I soon have a need to keep overall (bare-metal) application size down
> to just 256k.
> Micro-controllers are places where -Os matters the most.


Fair points.

- Size at all cost is useful for the embedded case where there is a restricted 
footprint.
- It’s fair to compare on RISC-V which has the RVC compressed ISA extension, 
which is conceptually similar to Thumb-2
- Understand renaming -Os to -Oz would cause a few downstream issues for those 
who expect size at all costs.
- There is an achievable use-case for good RVC compression and good performance 
on RISC-V

However the question remains, what options does one choose for size, but not 
size at the expense of speed. -O2 and an -mtune?

I’m probably interested in an -O2 with an -mtune that can favour register 
allocations that result in better RVC compression for RISC-V. Ideally the 
dominant register set can be assigned to x8 through x15 using loop frequency 
information and this  would result in better compression and also reduce 
dynamic icache pressure. I think I should look more closely at LRA and see how 
it uses register_priority.

There is a use case for high performance code that also makes good use of RVC 
on RISC-V, while there may also be a use case for the current -Os for bare 
metal where the implementor chooses to sacrifices speed for size at all costs. 
The problem is there is only one -Os flag, versus -Oz and -Os which makes the 
distinction between size at all costs versus size but not at the expense of 
speed clear. i.e. the cases where reduced size improves performance.

I guess an -mtune for -O2 might be something worth considering. If the C 
extension is selected on RISC-V, the compiler should make best use of it for 
performance reasons.

Someone has nicely summarised Clang/LLVM flags. It seems Clang/LLVM retains a 
distinction between -Os and -Oz (which not just Apple, it is also Google Chrome)

- https://stackoverflow.com/questions/15548023/clang-optimization-levels

        • -Os is the same as -O2
        • -Oz is based on -Os
                • opt drops: -slp-vectorizer
                • clang drops: -vectorize-loops

There may be an argument for flag compatibility with Clang/LLVM. At present one 
would need to pass -Oz to get the Clang/LLVM equivalent to GCC’s -Os.

I guess I should use -O2, which I think is what musl uses (which seems to have 
the same meaning between GCC and Clang). There are embedded use cases, that are 
not extremely constrained by size, but where size still is an issue, but so is 
performance.

>> I don’t currently have -O2 in my results however it seems like I should add 
>> -O2 to the benchmark suite. If you take a look at the web page you’ll see 
>> that there is already a huge amount of data given we have captured dynamic 
>> register frequencies and dynamic instruction frequencies for -Os and -O3. 
>> The tables and charts are all generated by scripts so if there is interest I 
>> could add -O2. I can also pretty easily perform runs with new compiler 
>> versions as everything is completely automated. The biggest factor is that 
>> it currently takes 4 hours for a full run as we run all of the benchmarks in 
>> a simulator to capture dynamic register usage and dynamic instruction usage.
>> 
>> After looking at the results, one has to question the utility of -Os in its 
>> present form, and indeed question how it is actually used in practice, given 
>> the proportion of savings in executable size. After my assessment I would 
>> not recommend anyone to use -Os because its savings in size are not 
>> proportionate to the loss in performance. I feel discouraged from using it 
>> after looking at the results. I really don’t believe -Os makes the right 
>> trades e.g. reducing icache pressure can indeed lead to better performance 
>> due to reduced code size.
> 
> This comment does not help my application usage.  It rather hurts it
> and goes against what -Os is really about.  It is not about reducing
> icache pressure but overall application code size.  I really need the
> code to fit into a specific size.
> 
> Thanks,
> Andrew
> 
>> 
>> I also wonder whether -O2 level optimisations may be a good starting point 
>> for a more useful -Os and how one would proceed towards selecting 
>> optimisations to add back to -Os to increase its usability, or rename the 
>> current -Os to -Oz and make -Os an alias for -O2. A similar profile to -O2 
>> would probably produce less shock for anyone who does quantitative 
>> performance analysis of -Os.
>> 
>> In fact there are some interesting issues for the RISC-V backend given the 
>> assembler performs RVC compression and GCC doesn’t really see the size of 
>> emitted instructions. It would be an interesting backend to investigate 
>> improving -Os presuming that a backend can opt in to various optimisations 
>> for a given optimisation level. RISC-V would gain most of its size and 
>> runtime icache pressure reduction improvements by getting the highest 
>> frequency registers allocated within the 8 register set that is accessible 
>> by the RVC instructions. Merely controlling register allocation to favour 
>> the RVC accessible registers would produce the largest savings in executable 
>> size, and may indeed be good for performance due to reduced icache pressure.
>> 
>> I have Dynamic Register Frequency Charts but they are not presently labeled 
>> or coloured whether the registers are RVC accessible registers (x8 to x15). 
>> I did however work on some crude ASCII histograms that indicate register 
>> access frequency and whether the register is RVC accessible. Ideally the 
>> register allocator would allocate highest frequency registers first from the 
>> RVC set. The register order is already correctly defined in the RISC-V 
>> backend. I have been experimenting with riscv_register_priority to try to 
>> nudge LRA but have not yet had success. riscv_register_priority currently 
>> returns 1 for RVC registers (if the C extension is present) and 0 for 
>> regular registers however the loop frequency information is obviously not 
>> accurate enough or LRA does not completely honour the register order and 
>> priority. It’s likely it may not make a lot of difference on platforms with 
>> very regular register files. See this gist for one of the benchmarks 
>> register access frequency labeled as to whether the register is accessible 
>> from compressed instructions:
>> 
>> - https://gist.github.com/michaeljclark/8ba727e56084833e4f838c941eeca6be
>> 
>> Question. Who uses -Os on GCC?
>> 
>> I have for many years used -Os on macOS for Clang builds, as it has been an 
>> Xcode default, but I’m considering using -O2 instead of -Os with FSF GCC. I 
>> was using FSF GCC’s -Os under the mistaken impression that it operates 
>> similarly to -Os in Xcode. i.e. produces code that performs well.
>> 
>> In any case, despite my rant, I hope the quantitative states in the link 
>> above prove to be useful.
>> 
>> Thanks and Regards,
>> Michael.

Re: Quantitative analysis of -Os vs -O3

Reply via email to