Jan Hubicka wrote:
So things seems to work now plus minus as expected. I.e. LTO builds
seems similar to combined builds and whole-programs improves code size
quite noticeably.
Runtime results for gzip are pretty much unchanged, but that is
expected. I am quite curoius about full SPEC run.
Before the fix (Jan's two latest patches), the lto results were
disappointed. In brief the results I checked SPEC2000 a week ago on
lto branch LTO on Core I7 (-O3 vs -O3 -flto with optional
-fwhole-program) were
o Usage of LTO made compiler 1.9 time slower (in cpu time) for
SPECInt2000 and 2.2 time for SPECFP2000 on x86 and x86_64.
o LTO generated 16-17% bigger code for Int2000 and 13-14% bigger for
FP2000.
o There is 0.6% improvement for SPECFP2000 on x86 and 1% for
SPECInt2000 on x86_64 (only because of 20% improvement on vortex,
all other tests were actually worse than without LTO).
o No improvement for Int2000 on x86 and FP2000 on x86_64.
o 252.eon and 176.gcc crash compiler when LTO were used.
With latest Jan's fixes, The results (for -O3 vs -O3 -flto
-fwhole-program) are
x86:
o Int2000:
- LTO crashes the compiler on vortex. LTO generates
wrong code for vpr, gcc, perlbmk, and gap.
- Compiler is 1.85 times slower with LTO
- Average code size is almost 6% smaller:
4.615% 44287 46331 164.gzip
-3.145% 144101 139569 175.vpr
0.261% 1566926 1571009 176.gcc
-12.118% 12279 10791 181.mcf
11.130% 209956 233324 186.crafty
-29.735% 155358 109162 197.parser
-23.075% 497347 382585 252.eon
8.904% 552163 601327 253.perlbmk
1.516% 503006 510630 254.gap
-20.891% 47465 37549 256.bzip2
-3.047% 198365 192321 300.twolf
Average = -5.96236%
- Performance is improved almost by 4%
164.gzip 1668 1629 -2.33813%
181.mcf 5011 5020 0.17960%
186.crafty 2268 2277 0.39682%
197.parser 1928 1925 -0.15560%
252.eon 2477 2950 19.0957%
256.bzip2 1894 1956 3.2735%
300.twolf 2806 3026 7.84034%
GeoMean 2416 2509 3.84934%
o FP2000
- LTO generates wrong code for mgrid, applu, galgel, facerec,
fm3d, sxitrack, and apsi.
- Compiler is 2.1 times slower with LTO
- Average code size is almost 1.7% smaller:
-8.771% 27544 25128 168.wupwise
2.328% 9108 9320 171.swim
2.127% 18193 18580 172.mgrid
0.004% 76584 76587 173.applu
-5.938% 576270 542049 177.mesa
-2.046% 183667 179910 178.galgel
-10.635% 15881 14192 179.art
-16.292% 28812 24118 183.equake
-3.177% 67239 65103 187.facerec
10.989% 125273 139039 188.ammp
-0.735% 49137 48776 189.lucas
-0.856% 1144550 1134756 191.fma3d
11.457% 935941 1043168 200.sixtrack
Average = -1.65735%
- Performance is improved almost by 6%
168.wupwise 2349 3266 39.0379%
171.swim 3511 3529 0.51267%
177.mesa 1970 2008 1.92893%
179.art 7097 7293 2.76173%
183.equake 3844 4138 7.64828%
188.ammp 2423 2401 -0.90796%
189.lucas 2825 2718 -3.78761%
GeoMean 3144 3332 5.97964%
x86_64:
o Int2000:
- LTO crashes the compiler on gcc. LTO generates
wrong code for vpr, perlbmk, gap, and vortex
- Compiler is 1.8 times slower with LTO
- Average code size is more than 8% smaller:
1.376% 49119 49795 164.gzip
-4.348% 158389 151503 175.vpr
-16.964% 14949 12413 181.mcf
12.875% 195234 220370 186.crafty
-29.519% 180780 127416 197.parser
-22.894% 521614 402197 252.eon
9.507% 645749 707141 253.perlbmk
6.550% 585164 623492 254.gap
-22.493% 660414 511866 255.vortex
-18.343% 55825 45585 256.bzip2
-5.295% 212727 201463 300.twolf
Average = -8.14068%
- Performance is improved by 2.1%
164.gzip 1804 1773 -1.7184%
181.mcf 3480 3460 -0.5747%
186.crafty 3397 3406 0.2649%
197.parser 1847 1803 -2.3822%
252.eon 4071 4537 11.4468%
256.bzip2 2197 2249 2.3668%
300.twolf 2878 3048 5.9068%
GeoMean 2688 2744 2.0833%
o FP2000
- LTO crashes the compiler on apsi. LTO generates wrong code for
mgrid, applu, galgel, facerec, fm3d, sixtrack.
- Compiler is 2.1 times slower with LTO
- Average code size is 2.7% smaller:
27.674% 33902 43284 168.wupwise
-3.107% 15704 15216 171.swim
-0.685% 22929 22772 172.mgrid
-1.167% 103280 102075 173.applu
-8.346% 678724 622079 177.mesa
-4.304% 249773 239024 178.galgel
-25.801% 20375 15118 179.art
-28.805% 37514 26708 183.equake
-1.577% 76837 75625 187.facerec
1.570% 168235 170877 188.ammp
-1.168% 57271 56602 189.lucas
-0.940% 1276316 1264314 191.fma3d
10.949% 1106507 1227658 200.sixtrack
Average = -2.74672%
- Performance is improved almost by 6%
168.wupwise 2532 3708 46.4455%
171.swim 3740 3729 -0.2941%
177.mesa 2969 2946 -0.7746%
179.art 7278 7092 -2.5556%
183.equake 3978 4227 6.2594%
188.ammp 2490 2515 1.0040%
189.lucas 3886 3806 -2.0586%
GeoMean 3603 3812 5.8007%
LTO is quite promising. Actually it is in line or even better with
improvement got from other compilers (pathscale is the most convenient
compiler to check lto separately: lto gave there upto 5% improvement
on SPECFP2000 and 3.5% for SPECInt2000 making compiler about 50%
slower and generated code size upto 30% bigger). LTO in GCC actually
results in significant code reduction which is quite different from
pathscale. That is one of rare cases on my mind when a specific
optimization works actually better in gcc than in other optimizing
compilers. So congratulation to all people who worked on LTO!
I think the biggest winner of LTO will be big C++ programs (eon shows
that). Additional optimizations (like devirtualization) could improve
that results even more. I think the next big thing would be using
subtarget-specialized functions.