https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64844
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Andrew Pinski from comment #2) > t.c:8:5: note: === vect_update_slp_costs_according_to_vf === > t.c:8:5: note: cost model: the vector iteration cost = 26 divided by the > scalar iteration cost = 10 is greater or equal to the vectorization factor = > 2. > t.c:8:5: note: not vectorized: vectorization not profitable. > t.c:8:5: note: not vectorized: vector version will never be profitable. > t.c:8:5: note: bad operation or unsupported loop bound. > > > A cost model issue with cortex-a57. The cost model changed in GCC 5 for > cortex-a57. I think this is due to unaligned loads. But we (should) have known misalignment here and thus peeling for alignment should be able to arrange for aligned vectors. Iff aarch64 can align the stack properly. If not then the first loop should behave the same as the 2nd... Right: t.c:7:5: note: vect_model_load_cost: aligned. t.c:7:5: note: vect_get_data_access_cost: inside_cost = 5, outside_cost = 0. t.c:7:5: note: vect_model_load_cost: aligned. t.c:7:5: note: vect_get_data_access_cost: inside_cost = 10, outside_cost = 0. t.c:7:5: note: Try peeling by 1 t.c:7:5: note: Alignment of access forced using peeling. t.c:7:5: note: Peeling for alignment will be applied. but the costs are odd. For the first (aligned loop we get) t.c:7:5: note: Cost model analysis: Vector inside of loop cost: 16 Vector prologue cost: 8 Vector epilogue cost: 11 Scalar iteration cost: 10 Scalar outside cost: 0 Vector outside cost: 19 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 10 while for the 2nd: t.c:7:5: note: Cost model analysis: Vector inside of loop cost: 26 Vector prologue cost: 18 Vector epilogue cost: 21 Scalar iteration cost: 10 Scalar outside cost: 0 Vector outside cost: 39 prologue iterations: 1 epilogue iterations: 1 while the vector inside of loop cost should be the same. The issue is that both vect_enhance_data_refs_alignment at analysis time and vectorizable_load at transform time account for the cost via the add_stmt_cost hook. With that fixed we get t.c:7:5: note: Cost model analysis: Vector inside of loop cost: 16 Vector prologue cost: 18 Vector epilogue cost: 21 Scalar iteration cost: 10 Scalar outside cost: 0 Vector outside cost: 39 prologue iterations: 1 epilogue iterations: 1 Calculated minimum iters for profitability: 12 which is more reasonable and vectorizes both loops as expected.