https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494
--- Comment #4 from Peter Cordes <peter at cordes dot ca> --- I suspect dep-chains are the problem, and branching to skip work is a Good Thing when it's predictable. (In reply to Richard Biener from comment #2) > On Skylake it's better (1uop, 1 cycle latency) while on Ryzen even better. > On Bulldozer it also isn't that bad (comparable to Skylake I guess). SKL: AVX VBLENDVPS x,x,x,x is 2 uops, 2c latency, ~1c throughput. (Same for ymm) SKL: SSE4 BLENDVPS x,x,xmm0 is 1 uop, 1c latency, ~0.36c throughput in my testing, or maybe 0.333c with breaking dep chains. (IDK how Agner got 1c. Maybe he that was an editing mistake, and he copied the 1c from the VEX version.) [V](P)BLENDV(B|PS|PD) is funny: the SSE versions are 1 uop on SKL, I assume because they only have 3 register operands (including implicit XMM0). But the VEX encoding has 4 operands: 1 output and 3 inputs. I think this is too many for 1 uop to encode, and that's why VBLENDVPS is 2 uops even on Skylake. (The blend-control register encoded by an imm8 in the VEX version instead of implicit xmm0, but I don't think that's what stops the decoders from making it 1 uop. I think it's simply having 4 total operands.) On Skylake, the uop(s) for [V]BLENDVPS/D and [V]PBLENDVB can run on any of p015 (instead of only p5 on BDW and earlier), but the 2-uop VEX version is still 2 cycle latency. The VEX version has a bias towards port 5, but less than half the total uops run on p5 so it's not p015 + p5. The SSE version seems equally distributed to all of p015. ---- On SKL, the optimal choice might be to use the SSE encoding, if we can deal with a destructive destination and having the blend control in xmm0. The SSE/AVX penalty on SKL is output dependencies for write-only SSE instructions (like movaps or cvtps2dq) writing to an XMM register that has a dirty upper 128. It's a per-register thing, not like Haswell where there's it triggers a state slow change. (https://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower-without-vzeroupper-on-skylake) --- Footnote: VBLENDVPS throughput is only 1c for a big block of it back-to-back, even though it's only 2 uops that can run on any of 3 ports. So why isn't it 0.66c throughput? VBLENDVPS throughput (for back-to-back vblendvps) seems to be limited by some front-end effect. In an unrolled loop with 20 vblendvps (with no loop-carried dependencies), there are a negligible amount of cycles where the front-end delivered the full 4 uops. Most cycles only 2 are issued. This is not a general a problem for 2 uop instructions or anything: 9x bextr + dec/jnz = 19 uops total runs at 5.00c / iter, or 3.8 uops / clock, with the only cycle to not issue 4 uops being (I think) the group of 3 including the loop branch. Playing around with other 2 uops instructions, I didn't see front-end bottlenecks. I saw some back-end bottlenecks because other 2-uop instructions aren't so nicely distributed over ports, but perf counts for idq_uops_not_delivered.cycles_fe_was_ok:u generally equaled total cycles. (It counts when either the FE delivers 4 uops, or the back end was stalled and thus not the front-end's fault.) A 1 uop instruction following a vblendvps can issue with it in the same cycle, so this effect is probably not horrible for normal cases where we're using vblendvps mixed with normal instructions. I haven't investigated further, whether this is a front-end effect (uop cache fetch problem?) or whether it's an allocation bottleneck. Possibly being a 4-operand instruction has something to do with it, although each uop can't have that many I don't think.