https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115845
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |crazylht at gmail dot com
Keywords| |missed-optimization
Target| |x86_64-*-*
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
I'm not sure if there's any AMD specific performance counter, it would be
interesting to see if Icelake or later behave similarly (with 512 vector size)
and whether there's a performance counter that helps identifying the issue.
Maybe it's just bad to have those masked back-to-back load-add-store
sequence, but it's of ocurse "fine" without masking. Well, better at least:
24 │ vmovupd (%r9),%ymm0
▒
98 │ vaddpd -0xbcd0(%rbp,%r15,8),%ymm0,%ymm0
◆
1176 │ vmovupd %ymm0,(%r9)
fact seems to be that there is _exactly_ a %ymm of data each iteration,
the %zmm vector loop has no hits, nor does the epilogue (for the
non-masked vectorization run). That possibly makes for a bad backward
branch prediction (interestingly for the non-masked case we seem to
"unroll"), and possibly the count is the point we determine the bogus
prediction.