https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039
--- Comment #5 from Alexander Monakov <amonakov at gcc dot gnu.org> --- Ah, in that sense. The extra load is problematic in cold code where it's likely a TLB miss. For hot code: the load does not depend on any previous computations and so does not increase dependency chains. So it's ok from latency point of view; from throughput point of view, there's a tradeoff, one extra load per chain may be ok, but if every other instruction in a chain needs a different load, that's probably excessive. So it needs to be costed somehow. That said, sufficiently simple constants can be synthesized with SSE in-place without loading them from memory, for example the constant in the opening example: pcmpeqd %xmm1, %xmm1 // xmm1 = ~0 pslld $31, %xmm1 // xmm1 <<= 31 (again, if we need to synthesize just one constant per chain that's preferable, if we need many, the extra work would need to be costed against the latency improvement of keeping the chain on SSE)