https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116562
Bug ID: 116562 Summary: wrong cost of gather load preventing loop from vectored Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: kugan at gcc dot gnu.org Target Milestone: --- typedef int real_t; extern __attribute__((aligned(64))) real_t a[32000],b[32000],c[32000],d[32000]; void s4117() { for (int i = 0; i < 32000; i++) { a[i] = b[i] + c[i/2] * d[i]; } } is not vectored for AdvSIMD due to wrong cost calculation. Compiler option used: cc1plus -Ofast -fdump-tree-vect-all -mcpu=neoverse-v2 --param=aarch64-autovec-preference=1 tt.c:6:21: note: Cost model analysis: Vector inside of loop cost: 64 Vector prologue cost: 0 Vector epilogue cost: 0 Scalar iteration cost: 15 Scalar outside cost: 0 Vector outside cost: 0 prologue iterations: 0 epilogue iterations: 0 tt.c:6:21: missed: cost model: the vector iteration cost = 64 divided by the scalar iteration cost = 15 is greater or equal to the vectorization factor = 4. tt.c:6:21: missed: not vectorized: vectorization not profitable. tt.c:6:21: missed: not vectorized: vector version will never be profitable. tt.c:6:21: missed: Loop costings may not be worthwhile. tt.c:6:21: note: ***** Analysis failed with vector mode V4SI We cost this c[i/2] as having the cost of 4 loads and one construct. I think we should special case these sort of gather loads which as lower cost in practice? 11233 if (costing_p) 11234 { 11235 /* For emulated gathers N offset vector element 11236 offset add is consumed by the load). */ 11237 inside_cost = record_stmt_cost (cost_vec, const_nunits, 11238 vec_to_scalar, stmt_info, 11239 0, vect_body); 11240 /* N scalar loads plus gathering them into a 11241 vector. */ 11242 inside_cost 11243 = record_stmt_cost (cost_vec, const_nunits, scalar_load, 11244 stmt_info, 0, vect_body); 11245 inside_cost 11246 = record_stmt_cost (cost_vec, 1, vec_construct, 11247 stmt_info, 0, vect_body); 11248 continue; 11249 }