https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125174
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- AVX2 shell2.fppized.f90:971:24: note: Cost model analysis: Vector inside of loop cost: 1084 Vector prologue cost: 56 Vector epilogue cost: 736 Scalar iteration cost: 360 Scalar outside cost: 8 Vector outside cost: 792 prologue iterations: 0 epilogue iterations: 2 Calculated minimum iters for profitability: 3 vs. SSE2 shell2.fppized.f90:971:24: note: Cost model analysis: Vector inside of loop cost: 460 Vector prologue cost: 56 Vector epilogue cost: 376 Scalar iteration cost: 360 Scalar outside cost: 8 Vector outside cost: 432 prologue iterations: 0 epilogue iterations: 1 Calculated minimum iters for profitability: 2 the tipping point is mainly that we use elementwise loads: MEM[(real(kind=8) *)_149] 1 times scalar_load costs 12 in body MEM[(real(kind=8) *)_149] 1 times scalar_load costs 12 in body MEM[(real(kind=8) *)_149] 1 times vec_construct costs 12 in body vs. MEM[(real(kind=8) *)_149] 1 times scalar_load costs 12 in body MEM[(real(kind=8) *)_149] 1 times scalar_load costs 12 in body MEM[(real(kind=8) *)_149] 1 times scalar_load costs 12 in body MEM[(real(kind=8) *)_149] 1 times scalar_load costs 12 in body MEM[(real(kind=8) *)_149] 1 times vec_construct costs 60 in body plus the complex loads are "nice" for SSE: REALPART_EXPR <MEM[(complex(kind=8) *)_144]> 1 times vec_perm costs 4 in body REALPART_EXPR <MEM[(complex(kind=8) *)_144]> 2 times unaligned_load (misalign -1) costs 24 in body but again pairwise element for AVX: REALPART_EXPR <MEM[(complex(kind=8) *)_144]> 1 times vec_construct costs 60 in body REALPART_EXPR <MEM[(complex(kind=8) *)_144]> 1 times vec_construct costs 60 in body Detected avx256 cross-lane permutation: REALPART_EXPR <MEM[(complex(kind=8) *)_144]> REALPART_EXPR <MEM[(complex(kind=8) *)_144]> 1 times vec_perm costs 4 in body REALPART_EXPR <MEM[(complex(kind=8) *)_144]> 4 times unaligned_load (misalign -1) costs 48 in body where the construct cost looks over-costed (it doesn't currently know that it constructs from two V2DF elements rather than four DF elements). That we do not cost the relatively expensive SSE/AVX sin/cos SIMD calls (we cost them only as simple scalar_stmt on the scalar costing side) is probably not the most important issue, but if we'd cost a scalar, SSE and AVX sin() call all the same but, say, with cost 200, that would tip costs in favor of AVX.
