> I did some measurement (64bit). > > Experiment 1: > > -O2 -funroll-loops vs -O2 > > It improves performance (geomean) by 0.56%, not too much: > O2 O2 unroll-loops > 164.gzip 1324 1331 0.56% > 175.vpr 1694 1605 -5.24% > 176.gcc 2293 2350 2.47% > 181.mcf 1772 1788 0.90% > 186.crafty 2320 2326 0.26% > 197.parser 1166 1162 -0.32% > 252.eon 2443 2529 3.50% > 253.perlbmk 2410 2460 2.07% > 254.gap 1987 2019 1.58% > 255.vortex 2392 2406 0.58% > 256.bzip2 1719 1715 -0.25% > 300.twolf 2288 2308 0.88%
Can you also try -funroll-all-loops? As for pretty small programs, like spec2k, -funroll-all-loops is often win. In just few loops we can work out number of iterations. > > Experiment 3: O2 lto vs O2: geomean 0.72% > O2 O2 LTO > 164.gzip 1324 1317 -0.53% > 175.vpr 1694 1697 0.18% > 176.gcc 2293 2291 -0.08% > 181.mcf 1772 1760 -0.65% > 186.crafty 2320 2245 -3.26% > 197.parser 1166 1163 -0.29% > 252.eon 2443 2576 5.44% > 253.perlbmk 2410 2433 0.93% > 254.gap 1987 1995 0.36% > 255.vortex 2392 2588 8.19% > 256.bzip2 1719 1729 0.56% > 300.twolf 2288 2248 -1.77% You need -O3 -fwhole-program -flto for resonable cross module inlining to happen. -fwhole-program is quite essential to get resonable win from LTO (w/o profile feedback). At least our nightly tester then gets quite nice improvements on few benchmark at spec2k, see also my gccsummit slides. Honza