https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90579
--- Comment #20 from Richard Biener <rguenth at gcc dot gnu.org> --- So on the GIMPLE level we have vect__5.9_37 = MEM <vector(4) double> [(double *)&r + 16B]; vect__5.10_38 = VEC_PERM_EXPR <vect__5.9_37, vect__5.9_37, { 3, 2, 1, 0 }>; stmp_t_11.11_39 = BIT_FIELD_REF <vect__5.10_38, 64, 0>; stmp_t_11.11_40 = stmp_t_11.11_39 + 0.0; stmp_t_11.11_41 = BIT_FIELD_REF <vect__5.10_38, 64, 64>; stmp_t_11.11_42 = stmp_t_11.11_40 + stmp_t_11.11_41; stmp_t_11.11_43 = BIT_FIELD_REF <vect__5.10_38, 64, 128>; stmp_t_11.11_44 = stmp_t_11.11_42 + stmp_t_11.11_43; stmp_t_11.11_45 = BIT_FIELD_REF <vect__5.10_38, 64, 192>; where forwprop elides the VEC_PERM_EXPR and it would have elided the vector load, replacing it by component loads if it would have processed stmts in the proper order and the VEC_PERM_EXPR elision would have eliminated the VEC_PERM_EXPR stmt. The result applying both is loop: .LFB0: .cfi_startproc movslq %edi, %rdi vbroadcastsd %xmm0, %ymm1 vmovddup %xmm0, %xmm0 vmulpd a(,%rdi,8), %ymm1, %ymm1 vmovupd %ymm1, r(%rip) vunpckhpd %xmm1, %xmm1, %xmm2 vmulpd a+32(,%rdi,8), %xmm0, %xmm0 vmovupd %xmm0, r+32(%rip) vxorpd %xmm0, %xmm0, %xmm0 vaddsd r+40(%rip), %xmm0, %xmm0 vaddsd r+32(%rip), %xmm0, %xmm0 vaddsd r+24(%rip), %xmm0, %xmm0 vaddsd r+16(%rip), %xmm0, %xmm0 vaddsd %xmm2, %xmm0, %xmm0 vaddsd %xmm0, %xmm1, %xmm0 vzeroupper ret