[Bug tree-optimization/61481] Poor optimization of simple small-sized matrix routines with constant data

rguenth at gcc dot gnu.org Thu, 12 Jun 2014 02:17:28 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61481


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2014-06-12
     Ever confirmed|0                           |1
           Severity|normal                      |enhancement

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
    for(int i=0;i<N;i++) for(int j=0;j<N;j++) for(int k=0;k<N;k++)
z[i*N+j]+=x[i*N+k]*y[k*N+j];

this one probably hits limits for loop unrolling.  It's not easy to prove
that fully unrolling to 4**3 statements will be profitable.

A related testcase is calculix from SPEC CPU 2006, core loop extracted in
testsuite/gfortran.dg/reassoc_4.f.  Fully unrolling the loop nest of depth
5 would be profitable, but we don't compute that (I have several patches
that try to make it so, but they are not effective).

A #pragma force_unroll (or similar) would probably help in such situations
(where you control the source, not for benchmarks of course).

In this case the non-unrolled loops are vectorized before "final" unrolling
is applied which obfuscates the code enough to make it not very well optimized.
In particular the permutations are hard to get rid of.

There are other cases where unrolling prevents vectorization and thus
reduces speed.

I think you are expecting too much from the compiler here ;)  But yes,
forcefully unrolling everything early would get you what you want.

This is also sth for the vectorizer cost-model as I doubt the vectorization
of the outer loop is profitable the way we do it (given the low number of
iterations and the high setup cost):

t.ii:10:8: note: Cost model analysis:
  Vector inside of loop cost: 68
  Vector prologue cost: 16
  Vector epilogue cost: 0
  Scalar iteration cost: 44
  Scalar outside cost: 0
  Vector outside cost: 16
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 1

so it computes that, given that we execute 4 scalar iterations in parallel,
this is all profitable.

The outside cost is from

  vect_cst_.16_95 = { 2, 2, 2, 2 };
  vect_cst_.19_81 = {pretmp_82, pretmp_82, pretmp_82, pretmp_82};
  vect_cst_.22_74 = {pretmp_68, pretmp_68, pretmp_68, pretmp_68};
  vect_cst_.25_53 = {pretmp_158, pretmp_158, pretmp_158, pretmp_158};
  vect_cst_.28_18 = {pretmp_160, pretmp_160, pretmp_160, pretmp_160};
  vect_cst_.31_2 = { 2, 2, 2, 2 };
  vect_cst_.34_168 = {pretmp_88, pretmp_88, pretmp_88, pretmp_88};
  vect_cst_.37_69 = {pretmp_47, pretmp_47, pretmp_47, pretmp_47};
  vect_cst_.40_32 = {pretmp_54, pretmp_54, pretmp_54, pretmp_54};
  vect_cst_.43_11 = {pretmp_151, pretmp_151, pretmp_151, pretmp_151};
  vect_cst_.46_143 = { 2, 2, 2, 2 };
  vect_cst_.49_55 = {pretmp_171, pretmp_171, pretmp_171, pretmp_171};
  vect_cst_.52_83 = {pretmp_8, pretmp_8, pretmp_8, pretmp_8};
  vect_cst_.55_66 = {pretmp_122, pretmp_122, pretmp_122, pretmp_122};
  vect_cst_.58_217 = {pretmp_115, pretmp_115, pretmp_115, pretmp_115};
  vect_cst_.61_220 = { 2, 2, 2, 2 };

that is, 16 splats for vector constants (yeah, some duplicates here).

What the vector inside cost doesn't measure is IV maintainance, also
the costs of the permutes

  vect_perm_even_57 = VEC_PERM_EXPR <vect__180.5_58, vect__180.6_21, { 0, 2, 4,
6 }>;
  vect_perm_odd_94 = VEC_PERM_EXPR <vect__180.5_58, vect__180.6_21, { 1, 3, 5,
7 }>;
  vect_perm_even_198 = VEC_PERM_EXPR <vect__180.7_109, vect__180.8_38, { 0, 2,
4, 6 }>;
  vect_perm_odd_212 = VEC_PERM_EXPR <vect__180.7_109, vect__180.8_38, { 1, 3,
5, 7 }>;
  vect_perm_even_222 = VEC_PERM_EXPR <vect_perm_even_57, vect_perm_even_198, {
0, 2, 4, 6 }>;
  vect_perm_odd_242 = VEC_PERM_EXPR <vect_perm_even_57, vect_perm_even_198, {
1, 3, 5, 7 }>;
  vect_perm_even_256 = VEC_PERM_EXPR <vect_perm_odd_94, vect_perm_odd_212, { 0,
2, 4, 6 }>;
  vect_perm_odd_266 = VEC_PERM_EXPR <vect_perm_odd_94, vect_perm_odd_212, { 1,
3, 5, 7 }>;
...
  vect_perm_even_195 = VEC_PERM_EXPR <vect__175.11_259, vect__175.12_225, { 0,
2, 4, 6 }>;
  vect_perm_odd_194 = VEC_PERM_EXPR <vect__175.11_259, vect__175.12_225, { 1,
3, 5, 7 }>;
  vect_perm_even_193 = VEC_PERM_EXPR <vect__175.13_207, vect__175.14_196, { 0,
2, 4, 6 }>;
  vect_perm_odd_192 = VEC_PERM_EXPR <vect__175.13_207, vect__175.14_196, { 1,
3, 5, 7 }>;
  vect_perm_even_191 = VEC_PERM_EXPR <vect_perm_even_195, vect_perm_even_193, {
0, 2, 4, 6 }>;
  vect_perm_odd_161 = VEC_PERM_EXPR <vect_perm_even_195, vect_perm_even_193, {
1, 3, 5, 7 }>;
  vect_perm_even_129 = VEC_PERM_EXPR <vect_perm_odd_194, vect_perm_odd_192, {
0, 2, 4, 6 }>;
  vect_perm_odd_107 = VEC_PERM_EXPR <vect_perm_odd_194, vect_perm_odd_192, { 1,
3, 5, 7 }>;

are underestimated.

Note that given we only access a subset of all computed values we have
a hard time removing those computations that are not necessary from the
vectorized code:

  MEM[(int *)&z] = vect_inter_high_239;
  MEM[(int *)&z + 16B] = vect_inter_low_240;
  MEM[(int *)&z + 32B] = vect_inter_high_250;
  MEM[(int *)&z + 48B] = vect_inter_low_251;
  _73 = z[0];
  _79 = z[5];
  ret_80 = _73 + _79;
  _85 = z[10];
  ret_86 = ret_80 + _85;
  _91 = z[15];
  ret_92 = ret_86 + _91;

eventually FRE could look through z[15] -> MEM[(int *)&z + 48B] ->
VEC_PERM_EXPR <vect_inter_low_236, vect_inter_low_238, { 2, 6, 3, 7 }>
and avoid that last permutations, but FRE doesn't run after vectorization.

We could also trick DCE to partially remove the dead vector stores
by storing a vector extract to the relevant field.  But there is nothing
to clean that up after that (well, apart from RTL).

You should adjust your code and make z an array for the diagnoal and only
compute that.

[Bug tree-optimization/61481] Poor optimization of simple small-sized matrix routines with constant data

Reply via email to