https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80232
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Last reconfirmed| |2017-03-28
Blocks| |53947
Ever confirmed|0 |1
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
So vectorization related. gathers are known to be notoriously slow but
cost-wise they are not properly represented, in this case they are just
accounted as unaligned load...
It would be more appropriate to account them as VMAT_ELEMENTWISE I suppose
(N scalar loads plus gathering into a vector). Thus:
Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c (revision 246500)
+++ gcc/tree-vect-stmts.c (working copy)
@@ -929,7 +929,8 @@ vect_model_store_cost (stmt_vec_info stm
tree vectype = STMT_VINFO_VECTYPE (stmt_info);
/* Costs of the stores. */
- if (memory_access_type == VMAT_ELEMENTWISE)
+ if (memory_access_type == VMAT_ELEMENTWISE
+ || memory_access_type == VMAT_GATHER_SCATTER)
/* N scalar stores plus extracting the elements. */
inside_cost += record_stmt_cost (body_cost_vec,
ncopies * TYPE_VECTOR_SUBPARTS (vectype),
@@ -938,7 +939,8 @@ vect_model_store_cost (stmt_vec_info stm
vect_get_store_cost (dr, ncopies, &inside_cost, body_cost_vec);
if (memory_access_type == VMAT_ELEMENTWISE
- || memory_access_type == VMAT_STRIDED_SLP)
+ || memory_access_type == VMAT_STRIDED_SLP
+ || memory_access_type == VMAT_GATHER_SCATTER)
inside_cost += record_stmt_cost (body_cost_vec,
ncopies * TYPE_VECTOR_SUBPARTS (vectype),
vec_to_scalar, stmt_info, 0, vect_body);
@@ -1056,7 +1058,8 @@ vect_model_load_cost (stmt_vec_info stmt
}
/* The loads themselves. */
- if (memory_access_type == VMAT_ELEMENTWISE)
+ if (memory_access_type == VMAT_ELEMENTWISE
+ || memory_access_type == VMAT_GATHER_SCATTER)
{
/* N scalar loads plus gathering them into a vector. */
tree vectype = STMT_VINFO_VECTYPE (stmt_info);
@@ -1069,7 +1072,8 @@ vect_model_load_cost (stmt_vec_info stmt
&inside_cost, &prologue_cost,
prologue_cost_vec, body_cost_vec, true);
if (memory_access_type == VMAT_ELEMENTWISE
- || memory_access_type == VMAT_STRIDED_SLP)
+ || memory_access_type == VMAT_STRIDED_SLP
+ || memory_access_type == VMAT_GATHER_SCATTER)
inside_cost += record_stmt_cost (body_cost_vec, ncopies, vec_construct,
stmt_info, 0, vect_body);
changes cost-model for haswell from
SparseCompRow.c:37:17: note: Cost model analysis:
Vector inside of loop cost: 13
Vector prologue cost: 25
Vector epilogue cost: 26
Scalar iteration cost: 5
Scalar outside cost: 7
Vector outside cost: 51
prologue iterations: 4
epilogue iterations: 4
Calculated minimum iters for profitability: 10
to
SparseCompRow.c:37:17: note: Cost model analysis:
Vector inside of loop cost: 23
Vector prologue cost: 25
Vector epilogue cost: 26
Scalar iteration cost: 5
Scalar outside cost: 7
Vector outside cost: 51
prologue iterations: 4
epilogue iterations: 4
Calculated minimum iters for profitability: 10
SparseCompRow.c:37:17: note: Runtime profitability threshold = 9
SparseCompRow.c:37:17: note: Static estimate profitability threshold = 15
so no change in overall profitability...
We seem to peel for alignment which makes the runtime cost check quite
expensive:
<bb 6> [12.75%]:
_114 = (unsigned int) rowRp1_34;
_112 = (unsigned int) rowR_33;
niters.6_54 = _114 - _112;
_92 = (long unsigned int) rowR_33;
_91 = _92 * 4;
vectp.7_93 = col_38(D) + _91;
_89 = (unsigned long) vectp.7_93;
_88 = _89 >> 2;
_87 = -_88;
_86 = (unsigned int) _87;
prolog_loop_niters.8_90 = _86 & 7;
_44 = (unsigned int) rowRp1_34;
_43 = (unsigned int) rowR_33;
_27 = _44 - _43;
_26 = _27 + 4294967295;
_25 = prolog_loop_niters.8_90 + 7;
_24 = MAX_EXPR <_25, 8>;
if (_26 < _24)
given that rowRp1 - rowR is 5 for the small case and 10 for the large
runtime profitability is not given for the small and on the border for
the large case... also the col[] setup is so that the accesses to
x are continguous which means gather is overkill here. Of course we have
no way to vectorize it otherwise (we don't "open-code" gather).
My first suggestion would be to split the profitability check from the
prologue niter computation. And of course fix the cost computation like
suggested above.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations