https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756
--- Comment #18 from Richard Biener <rguenth at gcc dot gnu.org> --- Good: <bb 2> [local count: 1073741824]: _1 = *m_12(D); _14 = VEC_PERM_EXPR <v_13(D), v_13(D), { 0, 0, 0, 0 }>; _2 = _1 * _14; _3 = MEM[(__v4sf *)m_12(D) + 16B]; _15 = VEC_PERM_EXPR <v_13(D), v_13(D), { 1, 1, 1, 1 }>; _4 = _3 * _15; _5 = _2 + _4; _6 = MEM[(__v4sf *)m_12(D) + 32B]; _16 = VEC_PERM_EXPR <v_13(D), v_13(D), { 2, 2, 2, 2 }>; _7 = _6 * _16; _8 = _5 + _7; _9 = MEM[(__v4sf *)m_12(D) + 48B]; _17 = VEC_PERM_EXPR <v_13(D), v_13(D), { 3, 3, 3, 3 }>; _10 = _9 * _17; _18 = _8 + _10; return _18; Bad: <bb 2> [local count: 1073741824]: _1 = *m_12(D); _30 = BIT_FIELD_REF <v_13(D), 32, 0>; v_28 = BIT_INSERT_EXPR <v_27(D), _30, 0>; _29 = VEC_PERM_EXPR <v_28, v_28, { 0, 0, 0, 0 }>; _2 = _1 * _29; _3 = MEM[(__v4sf *)m_12(D) + 16B]; _26 = BIT_FIELD_REF <v_13(D), 32, 32>; v_24 = BIT_INSERT_EXPR <v_23(D), _26, 0>; _25 = VEC_PERM_EXPR <v_24, v_24, { 0, 0, 0, 0 }>; _4 = _3 * _25; _5 = _2 + _4; _6 = MEM[(__v4sf *)m_12(D) + 32B]; _14 = BIT_FIELD_REF <v_13(D), 32, 64>; v_16 = BIT_INSERT_EXPR <v_17(D), _14, 0>; _15 = VEC_PERM_EXPR <v_16, v_16, { 0, 0, 0, 0 }>; _7 = _6 * _15; _8 = _5 + _7; _9 = MEM[(__v4sf *)m_12(D) + 48B]; _18 = BIT_FIELD_REF <v_13(D), 32, 96>; v_20 = BIT_INSERT_EXPR <v_21(D), _18, 0>; _19 = VEC_PERM_EXPR <v_20, v_20, { 0, 0, 0, 0 }>; _10 = _9 * _19; _22 = _8 + _10; return _22; So what's missing is converting the extract element, insert at 0 & splat into splat element N. _30 = BIT_FIELD_REF <v_13(D), 32, 0>; v_28 = BIT_INSERT_EXPR <v_27(D), _30, 0>; _29 = VEC_PERM_EXPR <v_28, v_28, { 0, 0, 0, 0 }>; Shows a missing no-op (insert into default-def at 0 from extract from same position can simply return the vector we extract from). _26 = BIT_FIELD_REF <v_13(D), 32, 32>; v_24 = BIT_INSERT_EXPR <v_23(D), _26, 0>; _25 = VEC_PERM_EXPR <v_24, v_24, { 0, 0, 0, 0 }>; is a bit more complicated - the VEC_PERM_EXPR indices should be modified based on the fact we only pick the just inserted elements and those were extracted from another (compatible) vector.