https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103554
--- Comment #3 from Avi Kivity <avi at scylladb dot com> --- > _Note_ that likely the suboptimal solution presented here is faster because > it avoids STLF penalties from the calls stack setup which very likely uses > scalar or differently aligned vector moves. Interesting point. Agner says (Icelake): > A read that is bigger than the write, or a read that covers both written and > unwritten bytes, > fails to forward. The write-to-read latency is 19-20 clock cycles. However, the same code is generated when `in` is a reference, in which case it may not be in the store queue at all, so we're paying two extra instructions for nothing. movhps is also 2 uops, so we're paying 3 uops to load 2 elements.