http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44688
Richard Guenther <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P3 |P2 Status|UNCONFIRMED |NEW Last reconfirmed| |2011.01.19 14:49:39 Ever Confirmed|0 |1 --- Comment #2 from Richard Guenther <rguenth at gcc dot gnu.org> 2011-01-19 14:49:39 UTC --- Confirmed. Leslie3d code-size almost doubled compared to 4.5 (and is even worse compared to 4.4). With -O3 -ffast-math -funroll-loops -fprefetch-loop-arrays > ls -l > benchspec/CPU2006/437.leslie3d/run/build_base_amd64-m64-gcc42-nn.0000/leslie3d > -rwxrwxr-x 1 rguenther suse 572893 Jan 19 13:11 benchspec/CPU2006/437.leslie3d/run/build_base_amd64-m64-gcc42-nn.0000/leslie3d With -O3 -ffast-math -funroll-loops > ls -l > benchspec/CPU2006/437.leslie3d/run/build_base_amd64-m64-gcc42-nn.0000/leslie3d > -rwxrwxr-x 1 rguenther suse 368093 Jan 19 13:14 benchspec/CPU2006/437.leslie3d/run/build_base_amd64-m64-gcc42-nn.0000/leslie3d so the regression is mostly prefetching enabled at -O3 for AMD archs. prefetching + RTL loop unrolling: 638680 prefetching: 356736 : 274088 there are several issues. 1) prefetching doesn't re-use the epilogue loop created by the vectorizer 2) the RTL loop unroller unrolls both epilogue loops 3) for both epilogue loops we usually know an integer upper bound for the number of iterations, but we are not able to compute it also the vectorizer checks use various different variables to test bounds agains which doesn't even allow us to simplify the effective niter == 0 || niter <= 6 style tests ... that obviously does not help the situation. On the tree level we see things like <bb 7>: vectorizer check if (bnd.24_140 <= 1) goto <bb 12>; // unvectorized loop else goto <bb 8>; <bb 8>: prefetcher check if (bnd.24_140 > 4) goto <bb 9>; else goto <bb 14>; <bb 9>: <bb 10>: # ivtmp.36_174 = PHI <0(9), ivtmp.36_197(10)> ivtmp.36_197 = ivtmp.36_174 + 4; if (...) goto <bb 10>; else goto <bb 14>; <bb 14>: # ivtmp.36_176 = PHI <0(8), ivtmp.36_197(10)> <bb 15>: # ivtmp.36_193 = PHI <ivtmp.36_176(14), ivtmp.36_192(15)> ivtmp.36_192 = ivtmp.36_193 + 1; if (bnd.24_140 <= ivtmp.36_192) goto <bb 11>; else goto <bb 15>; and we should be able to derive that the epilogue loop runs at most 3 times. On RTL this seems to be difficult also because we changed IVs again to pointers. So the things to do are: 1) preserve loop information across expand (and up to loop2_init) 2) compute number of iteration information right before expand 3) make IPA inlining integration be performed before tree loop optimizers 4) preserve loop information starting with tree loop optimizers 5) ... In the end this regression shows at -O3 - an optimization flag setting that is documented to eventually have this kind of effects. P2.