https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103554

--- Comment #3 from Avi Kivity <avi at scylladb dot com> ---
> _Note_ that likely the suboptimal solution presented here is faster because
> it avoids STLF penalties from the calls stack setup which very likely uses
> scalar or differently aligned vector moves.

Interesting point. Agner says (Icelake):

> A read that is bigger than the write, or a read that covers both written and 
> unwritten bytes,
> fails to forward. The write-to-read latency is 19-20 clock cycles.

However, the same code is generated when `in` is a reference, in which case it
may not be in the store queue at all, so we're paying two extra instructions
for nothing. movhps is also 2 uops, so we're paying 3 uops to load 2 elements.

Reply via email to