https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69622
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Last reconfirmed| |2016-02-02 Ever confirmed|0 |1 --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- A workaround is -fno-schedule-insns2. I suppose the compiler is trying to increase the distance of the loads and stores (in a greedy way) to reduce the impact on load latency in the general premise of moving loads up and stores down. In fact with -fno-schedule-insns2 you can see that we end up with .L5: vmovdqa 32(%rsi), %ymm2 vmovdqa 64(%rsi), %ymm1 vmovdqa 96(%rsi), %ymm0 vmovdqa (%rsi), %ymm3 vmovntdq %ymm3, (%rdi) vmovntdq %ymm2, 32(%rdi) vmovntdq %ymm1, 64(%rdi) vmovntdq %ymm0, 96(%rdi) which is because we do TER the zero-offset load (thus RTL expand it right before the store). Possibly scheduling tries to fix that up but does a miserable job.