https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80232
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Last reconfirmed| |2017-03-28 Blocks| |53947 Ever confirmed|0 |1 --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- So vectorization related. gathers are known to be notoriously slow but cost-wise they are not properly represented, in this case they are just accounted as unaligned load... It would be more appropriate to account them as VMAT_ELEMENTWISE I suppose (N scalar loads plus gathering into a vector). Thus: Index: gcc/tree-vect-stmts.c =================================================================== --- gcc/tree-vect-stmts.c (revision 246500) +++ gcc/tree-vect-stmts.c (working copy) @@ -929,7 +929,8 @@ vect_model_store_cost (stmt_vec_info stm tree vectype = STMT_VINFO_VECTYPE (stmt_info); /* Costs of the stores. */ - if (memory_access_type == VMAT_ELEMENTWISE) + if (memory_access_type == VMAT_ELEMENTWISE + || memory_access_type == VMAT_GATHER_SCATTER) /* N scalar stores plus extracting the elements. */ inside_cost += record_stmt_cost (body_cost_vec, ncopies * TYPE_VECTOR_SUBPARTS (vectype), @@ -938,7 +939,8 @@ vect_model_store_cost (stmt_vec_info stm vect_get_store_cost (dr, ncopies, &inside_cost, body_cost_vec); if (memory_access_type == VMAT_ELEMENTWISE - || memory_access_type == VMAT_STRIDED_SLP) + || memory_access_type == VMAT_STRIDED_SLP + || memory_access_type == VMAT_GATHER_SCATTER) inside_cost += record_stmt_cost (body_cost_vec, ncopies * TYPE_VECTOR_SUBPARTS (vectype), vec_to_scalar, stmt_info, 0, vect_body); @@ -1056,7 +1058,8 @@ vect_model_load_cost (stmt_vec_info stmt } /* The loads themselves. */ - if (memory_access_type == VMAT_ELEMENTWISE) + if (memory_access_type == VMAT_ELEMENTWISE + || memory_access_type == VMAT_GATHER_SCATTER) { /* N scalar loads plus gathering them into a vector. */ tree vectype = STMT_VINFO_VECTYPE (stmt_info); @@ -1069,7 +1072,8 @@ vect_model_load_cost (stmt_vec_info stmt &inside_cost, &prologue_cost, prologue_cost_vec, body_cost_vec, true); if (memory_access_type == VMAT_ELEMENTWISE - || memory_access_type == VMAT_STRIDED_SLP) + || memory_access_type == VMAT_STRIDED_SLP + || memory_access_type == VMAT_GATHER_SCATTER) inside_cost += record_stmt_cost (body_cost_vec, ncopies, vec_construct, stmt_info, 0, vect_body); changes cost-model for haswell from SparseCompRow.c:37:17: note: Cost model analysis: Vector inside of loop cost: 13 Vector prologue cost: 25 Vector epilogue cost: 26 Scalar iteration cost: 5 Scalar outside cost: 7 Vector outside cost: 51 prologue iterations: 4 epilogue iterations: 4 Calculated minimum iters for profitability: 10 to SparseCompRow.c:37:17: note: Cost model analysis: Vector inside of loop cost: 23 Vector prologue cost: 25 Vector epilogue cost: 26 Scalar iteration cost: 5 Scalar outside cost: 7 Vector outside cost: 51 prologue iterations: 4 epilogue iterations: 4 Calculated minimum iters for profitability: 10 SparseCompRow.c:37:17: note: Runtime profitability threshold = 9 SparseCompRow.c:37:17: note: Static estimate profitability threshold = 15 so no change in overall profitability... We seem to peel for alignment which makes the runtime cost check quite expensive: <bb 6> [12.75%]: _114 = (unsigned int) rowRp1_34; _112 = (unsigned int) rowR_33; niters.6_54 = _114 - _112; _92 = (long unsigned int) rowR_33; _91 = _92 * 4; vectp.7_93 = col_38(D) + _91; _89 = (unsigned long) vectp.7_93; _88 = _89 >> 2; _87 = -_88; _86 = (unsigned int) _87; prolog_loop_niters.8_90 = _86 & 7; _44 = (unsigned int) rowRp1_34; _43 = (unsigned int) rowR_33; _27 = _44 - _43; _26 = _27 + 4294967295; _25 = prolog_loop_niters.8_90 + 7; _24 = MAX_EXPR <_25, 8>; if (_26 < _24) given that rowRp1 - rowR is 5 for the small case and 10 for the large runtime profitability is not given for the small and on the border for the large case... also the col[] setup is so that the accesses to x are continguous which means gather is overkill here. Of course we have no way to vectorize it otherwise (we don't "open-code" gather). My first suggestion would be to split the profitability check from the prologue niter computation. And of course fix the cost computation like suggested above. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations