https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61481
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |missed-optimization Status|UNCONFIRMED |NEW Last reconfirmed| |2014-06-12 Ever confirmed|0 |1 Severity|normal |enhancement --- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- for(int i=0;i<N;i++) for(int j=0;j<N;j++) for(int k=0;k<N;k++) z[i*N+j]+=x[i*N+k]*y[k*N+j]; this one probably hits limits for loop unrolling. It's not easy to prove that fully unrolling to 4**3 statements will be profitable. A related testcase is calculix from SPEC CPU 2006, core loop extracted in testsuite/gfortran.dg/reassoc_4.f. Fully unrolling the loop nest of depth 5 would be profitable, but we don't compute that (I have several patches that try to make it so, but they are not effective). A #pragma force_unroll (or similar) would probably help in such situations (where you control the source, not for benchmarks of course). In this case the non-unrolled loops are vectorized before "final" unrolling is applied which obfuscates the code enough to make it not very well optimized. In particular the permutations are hard to get rid of. There are other cases where unrolling prevents vectorization and thus reduces speed. I think you are expecting too much from the compiler here ;) But yes, forcefully unrolling everything early would get you what you want. This is also sth for the vectorizer cost-model as I doubt the vectorization of the outer loop is profitable the way we do it (given the low number of iterations and the high setup cost): t.ii:10:8: note: Cost model analysis: Vector inside of loop cost: 68 Vector prologue cost: 16 Vector epilogue cost: 0 Scalar iteration cost: 44 Scalar outside cost: 0 Vector outside cost: 16 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 1 so it computes that, given that we execute 4 scalar iterations in parallel, this is all profitable. The outside cost is from vect_cst_.16_95 = { 2, 2, 2, 2 }; vect_cst_.19_81 = {pretmp_82, pretmp_82, pretmp_82, pretmp_82}; vect_cst_.22_74 = {pretmp_68, pretmp_68, pretmp_68, pretmp_68}; vect_cst_.25_53 = {pretmp_158, pretmp_158, pretmp_158, pretmp_158}; vect_cst_.28_18 = {pretmp_160, pretmp_160, pretmp_160, pretmp_160}; vect_cst_.31_2 = { 2, 2, 2, 2 }; vect_cst_.34_168 = {pretmp_88, pretmp_88, pretmp_88, pretmp_88}; vect_cst_.37_69 = {pretmp_47, pretmp_47, pretmp_47, pretmp_47}; vect_cst_.40_32 = {pretmp_54, pretmp_54, pretmp_54, pretmp_54}; vect_cst_.43_11 = {pretmp_151, pretmp_151, pretmp_151, pretmp_151}; vect_cst_.46_143 = { 2, 2, 2, 2 }; vect_cst_.49_55 = {pretmp_171, pretmp_171, pretmp_171, pretmp_171}; vect_cst_.52_83 = {pretmp_8, pretmp_8, pretmp_8, pretmp_8}; vect_cst_.55_66 = {pretmp_122, pretmp_122, pretmp_122, pretmp_122}; vect_cst_.58_217 = {pretmp_115, pretmp_115, pretmp_115, pretmp_115}; vect_cst_.61_220 = { 2, 2, 2, 2 }; that is, 16 splats for vector constants (yeah, some duplicates here). What the vector inside cost doesn't measure is IV maintainance, also the costs of the permutes vect_perm_even_57 = VEC_PERM_EXPR <vect__180.5_58, vect__180.6_21, { 0, 2, 4, 6 }>; vect_perm_odd_94 = VEC_PERM_EXPR <vect__180.5_58, vect__180.6_21, { 1, 3, 5, 7 }>; vect_perm_even_198 = VEC_PERM_EXPR <vect__180.7_109, vect__180.8_38, { 0, 2, 4, 6 }>; vect_perm_odd_212 = VEC_PERM_EXPR <vect__180.7_109, vect__180.8_38, { 1, 3, 5, 7 }>; vect_perm_even_222 = VEC_PERM_EXPR <vect_perm_even_57, vect_perm_even_198, { 0, 2, 4, 6 }>; vect_perm_odd_242 = VEC_PERM_EXPR <vect_perm_even_57, vect_perm_even_198, { 1, 3, 5, 7 }>; vect_perm_even_256 = VEC_PERM_EXPR <vect_perm_odd_94, vect_perm_odd_212, { 0, 2, 4, 6 }>; vect_perm_odd_266 = VEC_PERM_EXPR <vect_perm_odd_94, vect_perm_odd_212, { 1, 3, 5, 7 }>; ... vect_perm_even_195 = VEC_PERM_EXPR <vect__175.11_259, vect__175.12_225, { 0, 2, 4, 6 }>; vect_perm_odd_194 = VEC_PERM_EXPR <vect__175.11_259, vect__175.12_225, { 1, 3, 5, 7 }>; vect_perm_even_193 = VEC_PERM_EXPR <vect__175.13_207, vect__175.14_196, { 0, 2, 4, 6 }>; vect_perm_odd_192 = VEC_PERM_EXPR <vect__175.13_207, vect__175.14_196, { 1, 3, 5, 7 }>; vect_perm_even_191 = VEC_PERM_EXPR <vect_perm_even_195, vect_perm_even_193, { 0, 2, 4, 6 }>; vect_perm_odd_161 = VEC_PERM_EXPR <vect_perm_even_195, vect_perm_even_193, { 1, 3, 5, 7 }>; vect_perm_even_129 = VEC_PERM_EXPR <vect_perm_odd_194, vect_perm_odd_192, { 0, 2, 4, 6 }>; vect_perm_odd_107 = VEC_PERM_EXPR <vect_perm_odd_194, vect_perm_odd_192, { 1, 3, 5, 7 }>; are underestimated. Note that given we only access a subset of all computed values we have a hard time removing those computations that are not necessary from the vectorized code: MEM[(int *)&z] = vect_inter_high_239; MEM[(int *)&z + 16B] = vect_inter_low_240; MEM[(int *)&z + 32B] = vect_inter_high_250; MEM[(int *)&z + 48B] = vect_inter_low_251; _73 = z[0]; _79 = z[5]; ret_80 = _73 + _79; _85 = z[10]; ret_86 = ret_80 + _85; _91 = z[15]; ret_92 = ret_86 + _91; eventually FRE could look through z[15] -> MEM[(int *)&z + 48B] -> VEC_PERM_EXPR <vect_inter_low_236, vect_inter_low_238, { 2, 6, 3, 7 }> and avoid that last permutations, but FRE doesn't run after vectorization. We could also trick DCE to partially remove the dead vector stores by storing a vector extract to the relevant field. But there is nothing to clean that up after that (well, apart from RTL). You should adjust your code and make z an array for the diagnoal and only compute that.