https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125174

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
AVX2

shell2.fppized.f90:971:24: note:  Cost model analysis: 
  Vector inside of loop cost: 1084
  Vector prologue cost: 56 
  Vector epilogue cost: 736
  Scalar iteration cost: 360
  Scalar outside cost: 8
  Vector outside cost: 792 
  prologue iterations: 0
  epilogue iterations: 2
  Calculated minimum iters for profitability: 3

vs. SSE2

shell2.fppized.f90:971:24: note:  Cost model analysis: 
  Vector inside of loop cost: 460
  Vector prologue cost: 56 
  Vector epilogue cost: 376
  Scalar iteration cost: 360
  Scalar outside cost: 8
  Vector outside cost: 432 
  prologue iterations: 0
  epilogue iterations: 1
  Calculated minimum iters for profitability: 2

the tipping point is mainly that we use elementwise loads:

MEM[(real(kind=8) *)_149] 1 times scalar_load costs 12 in body
MEM[(real(kind=8) *)_149] 1 times scalar_load costs 12 in body
MEM[(real(kind=8) *)_149] 1 times vec_construct costs 12 in body

vs.

MEM[(real(kind=8) *)_149] 1 times scalar_load costs 12 in body
MEM[(real(kind=8) *)_149] 1 times scalar_load costs 12 in body
MEM[(real(kind=8) *)_149] 1 times scalar_load costs 12 in body
MEM[(real(kind=8) *)_149] 1 times scalar_load costs 12 in body
MEM[(real(kind=8) *)_149] 1 times vec_construct costs 60 in body

plus the complex loads are "nice" for SSE:

REALPART_EXPR <MEM[(complex(kind=8) *)_144]> 1 times vec_perm costs 4 in body
REALPART_EXPR <MEM[(complex(kind=8) *)_144]> 2 times unaligned_load (misalign
-1) costs 24 in body

but again pairwise element for AVX:

REALPART_EXPR <MEM[(complex(kind=8) *)_144]> 1 times vec_construct costs 60 in
body
REALPART_EXPR <MEM[(complex(kind=8) *)_144]> 1 times vec_construct costs 60 in
body
Detected avx256 cross-lane permutation: REALPART_EXPR <MEM[(complex(kind=8)
*)_144]>
REALPART_EXPR <MEM[(complex(kind=8) *)_144]> 1 times vec_perm costs 4 in body
REALPART_EXPR <MEM[(complex(kind=8) *)_144]> 4 times unaligned_load (misalign
-1) costs 48 in body

where the construct cost looks over-costed (it doesn't currently know that
it constructs from two V2DF elements rather than four DF elements).

That we do not cost the relatively expensive SSE/AVX sin/cos SIMD calls
(we cost them only as simple scalar_stmt on the scalar costing side)
is probably not the most important issue, but if we'd cost a
scalar, SSE and AVX sin() call all the same but, say, with cost 200,
that would tip costs in favor of AVX.

Reply via email to