https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118984

--- Comment #10 from Maxim Egorushkin <maxim.yegorushkin at gmail dot com> ---
(In reply to Andrew Pinski from comment #9)
> (In reply to Maxim Egorushkin from comment #8)
> > (In reply to Andrew Pinski from comment #6)
> > > If you look at the difference between the 2 functions.
> > >         vextracti128    xmm1, ymm0, 0x1
> > > 
> > > vs
> > >         vmovdqa xmm1, xmm0
> > >         vextracti128    xmm0, ymm0, 0x1
> > > 
> > > The register allocator is allocating the result of the
> > > _mm256_extracti128_si256 in the first case to xmm1 but in the second case 
> > > to
> > > xmm0. That means in the second case we need to a move instruction to copy
> > > what was in ymm0 but only 128bits of it. And that is where vmovdqa is 
> > > coming
> > > from.
> > 
> > I am sorry for being thick, but I fail to see what requires/causes 
> > 
> > > That means in the second case we need to a move instruction to copy what 
> > > was in ymm0 but only 128bits of it_
> > 
> > What exactly needs moving only 128 bits of ymm0 and why, please?
> 
> 
> Because you have a conflict. Register allocation happens localized in many
> cases and if you need a value from a register that will be clobbered by a
> different instruction, a move will be inserted (it just happens in this case
> we only need the lower part of the register so we can use the "vmovdqa xmm*"
> instruction to do the copying instead of copying the full register).

The code sums two xmm halfs of ymm0 register. One scratch register is needed
for the high xmm half of the ymm0 register, in both cases.

All the instructions emitted have an explicit destination register, which is
distinct from instruction arguments. 

Only the last vpaddq instruction should store its result into xmm0 register to
not require more register moves.

We have the argument in ymm0 and the result must be in xmm0. We have 15 spare
ymm/xmm registers at our disposal.  

What conflicts with what here, please?

Reply via email to