[Bug tree-optimization/102750] [12 Regression] 433.milc regressed by 10% on AMD zen2 at -Ofast -march=native -flto after r12-3893-g6390c5047adb75

rguenth at gcc dot gnu.org via Gcc-bugs Wed, 19 Jan 2022 00:45:56 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102750


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rsandifo at gcc dot gnu.org

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
So when we have BB vectorization and costs like

+su3_proj.c:44:24: note: Cost model analysis for part in loop 1:
+  Vector cost: 776
+  Scalar cost: 836

we decide it's worthwhile to vectorize.  But that does not account for the
fact that the scalar code represents multiple vector lanes possibly
executing in parallel while the vector code in combining the lanes makes
them data dependent.  So considering a VF of at least 2 the scalar code
could in the ideal case run with a latency of 836/2 while the vector code
would have a latency of 776.  Of course the cost number we compute doesn't
really resemble overall latency, I will see if we can improve that for GCC 13.

What I'm after is that it's maybe not a good idea to compare scalar vs.
vector cost for BB vectorization in the way we do.  Without any info
on the individually costed stmt dependences it's hard to do better of course
though the target could try simple adjustments based on available
issue with and execution resources to for example assume that two scalar
ops can always issue & execute in parallel and thus divide the scalar cost
by two (for BB vectorization only or for scalar stmts participating in SLP
for loop vect, sth not readily available though).  There's always the
possibility to special case more constrained resources (integer/fp divider,
stores or shifts), but that already gets a bit complicated because we
cost scalar stmts independently and thus the code doesn't know how many
of them will be combined into a single lane.  The alternative is to do
the biasing on the vector side but there you do not know the original
scalar operation involved (consider patterns).

So it might be not easy to adjust the current heuristics in a good way.

[Bug tree-optimization/102750] [12 Regression] 433.milc regressed by 10% on AMD zen2 at -Ofast -march=native -flto after r12-3893-g6390c5047adb75

Reply via email to