https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118984

--- Comment #11 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Let me try again:

So we have:
__v4di v4 = ymm0
__v2di tmp = _mm256_extracti128_si256(v4, 1); // vextracti128
__v2di tmp1 = _mm256_castsi256_si128(v4); // subreg
__v2di v2 = tmp + tmp1;
__v2di v3 = _mm_shuffle_epi32(v2, 0b1110);
__v2di res = v2 + v3;


So the register allocator allocates res to xmm0.
then v3 to xmm1, and v2 to xmm0.

And then this is where the problem comes in.
tmp gets allocated to xmm0 and tmp1 gets allocated to xmm1.
Then you will need a move from ymm0 to ymm1 (or xmm0 to xmm1) added because tmp
needs to be in xmm0 but v4 was in ymm0.

This is why the extra register copy (move) comes from. The zeroing effect of
the instruction is just a side effect of the instruction rather than anything
else.


Now if the register allocator allocates tmp to xmm1 and tmp1 to xmm0, then
there is no extra move needed and there is no conflict between xmm0 and v4.

Does that make sense now?

Getting this correct in this case needs a global view of trying trying to
remove as many moves as possible and still might end up in the wrong valley.

Reply via email to