https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118984
--- Comment #9 from Andrew Pinski <pinskia at gcc dot gnu.org> --- (In reply to Maxim Egorushkin from comment #8) > (In reply to Andrew Pinski from comment #6) > > If you look at the difference between the 2 functions. > > vextracti128 xmm1, ymm0, 0x1 > > > > vs > > vmovdqa xmm1, xmm0 > > vextracti128 xmm0, ymm0, 0x1 > > > > The register allocator is allocating the result of the > > _mm256_extracti128_si256 in the first case to xmm1 but in the second case to > > xmm0. That means in the second case we need to a move instruction to copy > > what was in ymm0 but only 128bits of it. And that is where vmovdqa is coming > > from. > > I am sorry for being thick, but I fail to see what requires/causes > > > That means in the second case we need to a move instruction to copy what > > was in ymm0 but only 128bits of it_ > > What exactly needs moving only 128 bits of ymm0 and why, please? Because you have a conflict. Register allocation happens localized in many cases and if you need a value from a register that will be clobbered by a different instruction, a move will be inserted (it just happens in this case we only need the lower part of the register so we can use the "vmovdqa xmm*" instruction to do the copying instead of copying the full register).