https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118984
--- Comment #11 from Andrew Pinski <pinskia at gcc dot gnu.org> --- Let me try again: So we have: __v4di v4 = ymm0 __v2di tmp = _mm256_extracti128_si256(v4, 1); // vextracti128 __v2di tmp1 = _mm256_castsi256_si128(v4); // subreg __v2di v2 = tmp + tmp1; __v2di v3 = _mm_shuffle_epi32(v2, 0b1110); __v2di res = v2 + v3; So the register allocator allocates res to xmm0. then v3 to xmm1, and v2 to xmm0. And then this is where the problem comes in. tmp gets allocated to xmm0 and tmp1 gets allocated to xmm1. Then you will need a move from ymm0 to ymm1 (or xmm0 to xmm1) added because tmp needs to be in xmm0 but v4 was in ymm0. This is why the extra register copy (move) comes from. The zeroing effect of the instruction is just a side effect of the instruction rather than anything else. Now if the register allocator allocates tmp to xmm1 and tmp1 to xmm0, then there is no extra move needed and there is no conflict between xmm0 and v4. Does that make sense now? Getting this correct in this case needs a global view of trying trying to remove as many moves as possible and still might end up in the wrong valley.