https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103554

--- Comment #4 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 6 Dec 2021, avi at scylladb dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103554
> 
> --- Comment #3 from Avi Kivity <avi at scylladb dot com> ---
> > _Note_ that likely the suboptimal solution presented here is faster because
> > it avoids STLF penalties from the calls stack setup which very likely uses
> > scalar or differently aligned vector moves.
> 
> Interesting point. Agner says (Icelake):
> 
> > A read that is bigger than the write, or a read that covers both written 
> > and unwritten bytes,
> > fails to forward. The write-to-read latency is 19-20 clock cycles.

Note the penalty is usually much bigger since the CPU speculatively issues
the load rather than using the data in the store buffers and thus when
the store retires it has to flush & restart.

> However, the same code is generated when `in` is a reference, in which case it
> may not be in the store queue at all, so we're paying two extra instructions
> for nothing. movhps is also 2 uops, so we're paying 3 uops to load 2 elements.

Yes - across function boundaries it's difficult to weight possible STLF
against less optimal code (we've talked about trying to use IPA analysis
to discover the likeliness of a STLF failure).

I just wanted to say that looking at such small code in isolation may
fail to cover important parts of the bigger picture ;)

Reply via email to