https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #40 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 14 Mar 2022, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908
> 
> --- Comment #39 from Hongtao.liu <crazylht at gmail dot com> ---
> 
> > I'll see if I get around to prototype some argument classification
> > in the vectorizer (looking how hard it is to use
> > INIT_CUMULATIVE_ARGS in a context where we are not expanding to RTL),
> > unfortunately stack passing is done by code in function.cc (plus
> > extra target hooks of course), but it might be easy enough to figure
> > alignment and size at least (and whether arguments are passed on
> > the stack or not).
> 
> According to Intel software optimization guide,  
> When using an unmasked store instruction, and load instruction after it, data
> forwarding depends on ***load type, size and address offset from store
> address***, and does not depend on the store address itself (i.e., the store
> address does not have to be aligned to or fit into cache line, forwarding will
> occur for nonaligned and even line-split stores).
> The figure below describes all possible cases when data forwarding will occur.
> 
> I'm not sure if we can get store size in the vectorizer, how parameter been
> pushed to stack by caller also matters for STLF.

Yes, but since we now use _by_pieces for stack pushing we can try aligning
heuristics on both sides.  The main point of using INIT_CUMULATIVE_ARGS
is of course to figure whether a decl is passed in registers - there
are plenty of PRs where we get costs wrong for that case.

My additional worry is that we're going to be too pessimistic for
cases that execute long after the argument setup and thus will fetch
from L1 instead of forward from the store buffers.

Reply via email to