https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039

--- Comment #5 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Ah, in that sense. The extra load is problematic in cold code where it's likely
a TLB miss. For hot code: the load does not depend on any previous computations
and so does not increase dependency chains. So it's ok from latency point of
view; from throughput point of view, there's a tradeoff, one extra load per
chain may be ok, but if every other instruction in a chain needs a different
load, that's probably excessive. So it needs to be costed somehow.

That said, sufficiently simple constants can be synthesized with SSE in-place
without loading them from memory, for example the constant in the opening
example:

  pcmpeqd %xmm1, %xmm1  // xmm1 = ~0
  pslld   $31, %xmm1    // xmm1 <<= 31

(again, if we need to synthesize just one constant per chain that's preferable,
if we need many, the extra work would need to be costed against the latency
improvement of keeping the chain on SSE)

Reply via email to