Hi Honza, I experimented building Coremark with both PGO and LTO at -O3 level on Aarch64 machine. First I generated profiles using the recommended seeds in Coremark's readme.txt. Then compiled again with -O3 -flto and -fprofile-use.
I tried using GCC Linaro compiler (september) which is based on FSF 4.9 and GCC trunk 30-sep-2014. With linaro compiler perf events show 5% less instruction counts compared to the GCC trunk version I used. I looked at the generated code and seeing that IPA inlining have changed between linaro and trunk. Linaro compiler does not seem to inline a function called "crcu32", but trunk inlines it but does not inline "crcu16. Also trunk does not detect an IPA indirect inlining on a function called "cmp_complex". The number of partitions for ltrans is 3 in Linaro compiler and reduced to 2 in trunk. Eyeballing the dump it seems --param max-inline-insns-auto limit reached and hence deciding not to inline some functions. I tried increasing this limit from 40 to 45, 50 and 100. But is not helping in inlining "crcu32" in trunk, but inlines "cmp_complex" when set to limit set 45. But this is not reducing the instruction count. With Linaro compiler I tried to manually not to inline crcu16. Now Linaro compiler behaves in same way as trunk. It inlines crcu32, crcu16 is not inlined and instruction count increases. So inlining "crcu16", seem to increasing the instruction counts in trunk. I tried to latest trunk on X86_64 machine only and inlining behavior is same to the trunk version I used in Aarch64. LTO may not be best thing to try on Coremark, but just wanted to check if trunk (5.0) is better compared to GCC 4.9. Can you suggest where should I look in GCC to see why these inline decisions changes in trunk? Also compared to FSF 4.9, inline size calculation in IPA have changed now in trunk? Please advise. regards, Venkat.