https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54939
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> --- With AVX2 we indeed generate .L4: vmovupd (%rdx,%rax), %ymm3 addl $1, %r9d vpermpd $177, %ymm3, %ymm4 vmovapd %ymm3, %ymm2 vmulpd %ymm6, %ymm4, %ymm4 vfmsub132pd %ymm5, %ymm4, %ymm2 vfmadd132pd %ymm5, %ymm4, %ymm3 vshufpd $10, %ymm3, %ymm2, %ymm2 vaddpd (%rcx,%rax), %ymm2, %ymm2 vmovupd %ymm2, (%rcx,%rax) addq $32, %rax cmpl %esi, %r9d jb .L4 thus either there is no addsub for %ymm or there is insufficient pattern support for it. Note that with AVX2 the above is what is generated even with the cost model as it's now considered a profitable vectorization.