https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115845

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |crazylht at gmail dot com
           Keywords|                            |missed-optimization
             Target|                            |x86_64-*-*

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
I'm not sure if there's any AMD specific performance counter, it would be
interesting to see if Icelake or later behave similarly (with 512 vector size)
and whether there's a performance counter that helps identifying the issue.

Maybe it's just bad to have those masked back-to-back load-add-store
sequence, but it's of ocurse "fine" without masking.  Well, better at least:

    24 │        vmovupd       (%r9),%ymm0                                      
                                                             ▒
    98 │        vaddpd        -0xbcd0(%rbp,%r15,8),%ymm0,%ymm0                 
                                                             ◆
  1176 │        vmovupd       %ymm0,(%r9)                          

fact seems to be that there is _exactly_ a %ymm of data each iteration,
the %zmm vector loop has no hits, nor does the epilogue (for the
non-masked vectorization run).  That possibly makes for a bad backward
branch prediction (interestingly for the non-masked case we seem to
"unroll"), and possibly the count is the point we determine the bogus
prediction.

Reply via email to