https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494

--- Comment #4 from Peter Cordes <peter at cordes dot ca> ---
I suspect dep-chains are the problem, and branching to skip work is a Good
Thing when it's predictable.

(In reply to Richard Biener from comment #2)
> On Skylake it's better (1uop, 1 cycle latency) while on Ryzen even better.
> On Bulldozer it also isn't that bad (comparable to Skylake I guess).

SKL: AVX VBLENDVPS x,x,x,x  is 2 uops, 2c latency, ~1c throughput.  (Same for
ymm)
SKL: SSE4 BLENDVPS x,x,xmm0 is 1 uop,  1c latency, ~0.36c throughput in my
testing, or maybe 0.333c with breaking dep chains.  (IDK how Agner got 1c. 
Maybe he that was an editing mistake, and he copied the 1c from the VEX
version.)


[V](P)BLENDV(B|PS|PD) is funny: the SSE versions are 1 uop on SKL, I assume
because they only have 3 register operands (including implicit XMM0).  But the
VEX encoding has 4 operands: 1 output and 3 inputs.  I think this is too many
for 1 uop to encode, and that's why VBLENDVPS is 2 uops even on Skylake.

(The blend-control register encoded by an imm8 in the VEX version instead of
implicit xmm0, but I don't think that's what stops the decoders from making it
1 uop.  I think it's simply having 4 total operands.)

On Skylake, the uop(s) for [V]BLENDVPS/D and [V]PBLENDVB can run on any of p015
(instead of only p5 on BDW and earlier), but the 2-uop VEX version is still 2
cycle latency.  The VEX version has a bias towards port 5, but less than half
the total uops run on p5 so it's not p015 + p5.  The SSE version seems equally
distributed to all of p015.

----

On SKL, the optimal choice might be to use the SSE encoding, if we can deal
with a destructive destination and having the blend control in xmm0.

The SSE/AVX penalty on SKL is output dependencies for write-only SSE
instructions (like movaps or cvtps2dq) writing to an XMM register that has a
dirty upper 128.  It's a per-register thing, not like Haswell where there's it
triggers a state slow change. 
(https://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower-without-vzeroupper-on-skylake)

---

Footnote: VBLENDVPS throughput is only 1c for a big block of it back-to-back,
even though it's only 2 uops that can run on any of 3 ports.  So why isn't it
0.66c throughput?

VBLENDVPS throughput (for back-to-back vblendvps) seems to be limited by some
front-end effect.  In an unrolled loop with 20 vblendvps (with no loop-carried
dependencies), there are a negligible amount of cycles where the front-end
delivered the full 4 uops.  Most cycles only 2 are issued.

This is not a general a problem for 2 uop instructions or anything: 9x bextr +
dec/jnz = 19 uops total runs at 5.00c / iter, or 3.8 uops / clock, with the
only cycle to not issue 4 uops being (I think) the group of 3 including the
loop branch.  Playing around with other 2 uops instructions, I didn't see
front-end bottlenecks.  I saw some back-end bottlenecks because other 2-uop
instructions aren't so nicely distributed over ports, but perf counts for 
idq_uops_not_delivered.cycles_fe_was_ok:u generally equaled total cycles. 
 (It counts when either the FE delivers 4 uops, or the back end was stalled and
thus not the front-end's fault.)

A 1 uop instruction following a vblendvps can issue with it in the same cycle,
so this effect is probably not horrible for normal cases where we're using
vblendvps mixed with normal instructions.

I haven't investigated further, whether this is a front-end effect (uop cache
fetch problem?) or whether it's an allocation bottleneck.  Possibly being a
4-operand instruction has something to do with it, although each uop can't have
that many I don't think.

Reply via email to