https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115845
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |crazylht at gmail dot com Keywords| |missed-optimization Target| |x86_64-*-* --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- I'm not sure if there's any AMD specific performance counter, it would be interesting to see if Icelake or later behave similarly (with 512 vector size) and whether there's a performance counter that helps identifying the issue. Maybe it's just bad to have those masked back-to-back load-add-store sequence, but it's of ocurse "fine" without masking. Well, better at least: 24 │ vmovupd (%r9),%ymm0 ▒ 98 │ vaddpd -0xbcd0(%rbp,%r15,8),%ymm0,%ymm0 ◆ 1176 │ vmovupd %ymm0,(%r9) fact seems to be that there is _exactly_ a %ymm of data each iteration, the %zmm vector loop has no hits, nor does the epilogue (for the non-masked vectorization run). That possibly makes for a bad backward branch prediction (interestingly for the non-masked case we seem to "unroll"), and possibly the count is the point we determine the bogus prediction.