https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118984
Andrew Pinski <pinskia at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Blocks|101926 | --- Comment #6 from Andrew Pinski <pinskia at gcc dot gnu.org> --- If you look at the difference between the 2 functions. vextracti128 xmm1, ymm0, 0x1 vs vmovdqa xmm1, xmm0 vextracti128 xmm0, ymm0, 0x1 The register allocator is allocating the result of the _mm256_extracti128_si256 in the first case to xmm1 but in the second case to xmm0. That means in the second case we need to a move instruction to copy what was in ymm0 but only 128bits of it. And that is where vmovdqa is coming from. IIRC there are a few other examples of this issue and it comes down to subreg not being so good for the register allocation. As I mentioned register allocation is NP complete problem so getting an extra move (copy register) might/will happen if allocate in the wrong order in some cases. It happens more often with vector instructions/registers due to the different "modes" of the registers that it can hold (subregs). Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101926 [Bug 101926] [meta-bug] struct/complex/other argument passing and return should be improved