https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100076
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Ever confirmed|0 |1
Last reconfirmed| |2021-04-15
Status|UNCONFIRMED |NEW
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note even when avoiding the STLF hit the vectorized version is slower.
You can use -mtune-ctl=^sse_unaligned_load_optimal to force loading
the lower/upper half of vectors separately.
The reason is that without -ffast-math we are using an in-order reduction
which doesn't save us much but instead just combines dependence chains
here. We do have a related bug for this somewhere.
With -ffast-math the version with/without
-mtune-ctl=^sse_unaligned_load_optimal
is about the same speed, so STLF is a red herring here (on Zen2).
Still not vectorizing is a lot faster.
Can you check if -mtune-ctl=^sse_unaligned_load_optimal helps on CLX?