https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118984
--- Comment #10 from Maxim Egorushkin <maxim.yegorushkin at gmail dot com> --- (In reply to Andrew Pinski from comment #9) > (In reply to Maxim Egorushkin from comment #8) > > (In reply to Andrew Pinski from comment #6) > > > If you look at the difference between the 2 functions. > > > vextracti128 xmm1, ymm0, 0x1 > > > > > > vs > > > vmovdqa xmm1, xmm0 > > > vextracti128 xmm0, ymm0, 0x1 > > > > > > The register allocator is allocating the result of the > > > _mm256_extracti128_si256 in the first case to xmm1 but in the second case > > > to > > > xmm0. That means in the second case we need to a move instruction to copy > > > what was in ymm0 but only 128bits of it. And that is where vmovdqa is > > > coming > > > from. > > > > I am sorry for being thick, but I fail to see what requires/causes > > > > > That means in the second case we need to a move instruction to copy what > > > was in ymm0 but only 128bits of it_ > > > > What exactly needs moving only 128 bits of ymm0 and why, please? > > > Because you have a conflict. Register allocation happens localized in many > cases and if you need a value from a register that will be clobbered by a > different instruction, a move will be inserted (it just happens in this case > we only need the lower part of the register so we can use the "vmovdqa xmm*" > instruction to do the copying instead of copying the full register). The code sums two xmm halfs of ymm0 register. One scratch register is needed for the high xmm half of the ymm0 register, in both cases. All the instructions emitted have an explicit destination register, which is distinct from instruction arguments. Only the last vpaddq instruction should store its result into xmm0 register to not require more register moves. We have the argument in ymm0 and the result must be in xmm0. We have 15 spare ymm/xmm registers at our disposal. What conflicts with what here, please?