https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69622

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2016-02-02
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
A workaround is -fno-schedule-insns2.  I suppose the compiler is trying to
increase the distance of the loads and stores (in a greedy way) to reduce
the impact on load latency in the general premise of moving loads up and
stores down.

In fact with -fno-schedule-insns2 you can see that we end up with

.L5:
        vmovdqa 32(%rsi), %ymm2
        vmovdqa 64(%rsi), %ymm1
        vmovdqa 96(%rsi), %ymm0
        vmovdqa (%rsi), %ymm3
        vmovntdq        %ymm3, (%rdi)
        vmovntdq        %ymm2, 32(%rdi)
        vmovntdq        %ymm1, 64(%rdi)
        vmovntdq        %ymm0, 96(%rdi)

which is because we do TER the zero-offset load (thus RTL expand it right
before the store).  Possibly scheduling tries to fix that up but does a
miserable job.

Reply via email to