https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80636
--- Comment #2 from Peter Cordes <peter at cordes dot ca> --- > The same possibly applies to all "zero-extending" moves? Yes, if a vmovdqa %xmm0,%xmm1 will work, it's the best choice on AMD CPUs, and doesn't hurt on Intel CPUs. So in any case where you need to copy a register, and the upper lane(s) are known to be zero. If you're copying just to zero the upper lane, you don't have a choice (if you don't know that the source reg's upper lane is zeroed). In general, when all else is equal, use narrower vectors. (e.g. in a horizontal sum, the first step should be vextractf128 to reduce down to 128b vectors.) --- Quoting the Bulldozer section of Agner Fog's microarch.pdf (section 18.10 Bulldozer AVX): > 128-bit register-to-register moves have zero latency, while 256-bit > register-to-register > moves have a latency of 2 clocks plus a penalty of 2-3 clocks for using a > different > domain (see below) on Bulldozer and Piledriver. --- On Ryzen: the low 128-bit lane is renamed with zero latency, but the upper lane needs an execution unit. Despite this, vectorizing with 256b *is* worth it on Ryzen, because the core is so wide and decodes double-uop instructions efficiently. Also, AVX 3-operand instructions make moves rarer. --- On Jaguar: 128b moves (with implicit zeroing of the upper lane) are 1 uop, 256b moves are 2 uops. 128b moves from zeroed registers are eliminated (no execution port, but still have to decode/issue/retire). David Kanter's writeup (http://www.realworldtech.com/jaguar/4/) explains that the PRF has an "is-zero" bit which can be set efficiently. This is how 128b moves are able to zero the upper lane of the destination in the rename stage, without using an extra uop. (And to avoid needing an execution port for xor-zeroing uops).