[Bug target/80232] Ofast pessimizes Sparse matmult in scimark2 benchmark on avx platforms

rguenth at gcc dot gnu.org Tue, 28 Mar 2017 01:40:12 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80232


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2017-03-28
             Blocks|                            |53947
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
So vectorization related.  gathers are known to be notoriously slow but
cost-wise they are not properly represented, in this case they are just
accounted as unaligned load...

It would be more appropriate to account them as VMAT_ELEMENTWISE I suppose
(N scalar loads plus gathering into a vector).  Thus:

Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c       (revision 246500)
+++ gcc/tree-vect-stmts.c       (working copy)
@@ -929,7 +929,8 @@ vect_model_store_cost (stmt_vec_info stm

   tree vectype = STMT_VINFO_VECTYPE (stmt_info);
   /* Costs of the stores.  */
-  if (memory_access_type == VMAT_ELEMENTWISE)
+  if (memory_access_type == VMAT_ELEMENTWISE
+      || memory_access_type == VMAT_GATHER_SCATTER)
     /* N scalar stores plus extracting the elements.  */
     inside_cost += record_stmt_cost (body_cost_vec,
                                     ncopies * TYPE_VECTOR_SUBPARTS (vectype),
@@ -938,7 +939,8 @@ vect_model_store_cost (stmt_vec_info stm
     vect_get_store_cost (dr, ncopies, &inside_cost, body_cost_vec);

   if (memory_access_type == VMAT_ELEMENTWISE
-      || memory_access_type == VMAT_STRIDED_SLP)
+      || memory_access_type == VMAT_STRIDED_SLP
+      || memory_access_type == VMAT_GATHER_SCATTER)
     inside_cost += record_stmt_cost (body_cost_vec,
                                     ncopies * TYPE_VECTOR_SUBPARTS (vectype),
                                     vec_to_scalar, stmt_info, 0, vect_body);
@@ -1056,7 +1058,8 @@ vect_model_load_cost (stmt_vec_info stmt
     }

   /* The loads themselves.  */
-  if (memory_access_type == VMAT_ELEMENTWISE)
+  if (memory_access_type == VMAT_ELEMENTWISE
+      || memory_access_type == VMAT_GATHER_SCATTER)
     {
       /* N scalar loads plus gathering them into a vector.  */
       tree vectype = STMT_VINFO_VECTYPE (stmt_info);
@@ -1069,7 +1072,8 @@ vect_model_load_cost (stmt_vec_info stmt
                        &inside_cost, &prologue_cost, 
                        prologue_cost_vec, body_cost_vec, true);
   if (memory_access_type == VMAT_ELEMENTWISE
-      || memory_access_type == VMAT_STRIDED_SLP)
+      || memory_access_type == VMAT_STRIDED_SLP
+      || memory_access_type == VMAT_GATHER_SCATTER)
     inside_cost += record_stmt_cost (body_cost_vec, ncopies, vec_construct,
                                     stmt_info, 0, vect_body);


changes cost-model for haswell from

SparseCompRow.c:37:17: note: Cost model analysis:
  Vector inside of loop cost: 13
  Vector prologue cost: 25
  Vector epilogue cost: 26
  Scalar iteration cost: 5
  Scalar outside cost: 7
  Vector outside cost: 51
  prologue iterations: 4
  epilogue iterations: 4
  Calculated minimum iters for profitability: 10

to

SparseCompRow.c:37:17: note: Cost model analysis:
  Vector inside of loop cost: 23
  Vector prologue cost: 25
  Vector epilogue cost: 26
  Scalar iteration cost: 5
  Scalar outside cost: 7
  Vector outside cost: 51
  prologue iterations: 4
  epilogue iterations: 4
  Calculated minimum iters for profitability: 10
SparseCompRow.c:37:17: note:   Runtime profitability threshold = 9
SparseCompRow.c:37:17: note:   Static estimate profitability threshold = 15

so no change in overall profitability...

We seem to peel for alignment which makes the runtime cost check quite
expensive:

  <bb 6> [12.75%]:
  _114 = (unsigned int) rowRp1_34;
  _112 = (unsigned int) rowR_33;
  niters.6_54 = _114 - _112;
  _92 = (long unsigned int) rowR_33;
  _91 = _92 * 4;
  vectp.7_93 = col_38(D) + _91;
  _89 = (unsigned long) vectp.7_93;
  _88 = _89 >> 2;
  _87 = -_88;
  _86 = (unsigned int) _87;
  prolog_loop_niters.8_90 = _86 & 7;
  _44 = (unsigned int) rowRp1_34;
  _43 = (unsigned int) rowR_33;
  _27 = _44 - _43;
  _26 = _27 + 4294967295;
  _25 = prolog_loop_niters.8_90 + 7;
  _24 = MAX_EXPR <_25, 8>;
  if (_26 < _24)

given that rowRp1 - rowR is 5 for the small case and 10 for the large
runtime profitability is not given for the small and on the border for
the large case...  also the col[] setup is so that the accesses to
x are continguous which means gather is overkill here.  Of course we have
no way to vectorize it otherwise (we don't "open-code" gather).

My first suggestion would be to split the profitability check from the
prologue niter computation.  And of course fix the cost computation like
suggested above.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug target/80232] Ofast pessimizes Sparse matmult in scimark2 benchmark on avx platforms

Reply via email to