[Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2

rguenther at suse dot de via Gcc-bugs Mon, 28 Mar 2022 23:47:52 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908

--- Comment #45 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 29 Mar 2022, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908
> 
> --- Comment #43 from Hongtao.liu <crazylht at gmail dot com> ---
> One thing I found by experiments:
> Insert 64 vaddps %xmm18, %xmm19, %xmm20(no dependence between each other, just
> emulate for pipeline) before stalled load, stlf stall case is as fast as no
> stall cases on CLX. I guess this is "distance" you mean.

Yes - on the micro-architecture that's likely the point the data is
then available from L1-D.  The "distance" might depend on the store
workload (# of stores that can issue / retire / flush to L1 per cycle).

> Is there any existed structure in GCC I can get latency from entry to the load
> instruction?

There's the DFA description used by the instruction scheduler.  I'm
not familiar with that part of GCC but IIRC the dependence and DFA
query part should be sufficiently separate.  For OOO
uarchs we can compute a minimum distance based purely on frontend
cycles.  Doing better would need to look at instruction dependences.
I'm not sure if the CPUs we care about use forwarding possibilities
in the decision to OOO schedule loads/stores but IIRC store buffer
entries are allocated early at insn issue time and memory dependences
are taken into account.

Since we have no idea about the instruction sequence before function
entry going into too much detail will probably suffer from GIGO so
I'd resort to approximating the frontend side of the pipeline only
by some manual bean counting.

> And of course for loop with unknown trip count, latency can't be
> exactly estimated. Similar for cases when load is in join_bb, guess we need to
> calculate "average" latency among all possible predecessors?

I'd have simply stopped at backwards reachable blocks since whether
or not a load will forward from a store before function entry will
depend on the iteration number.

Likewise for CFG joins - I suppose one could conservatively assume
the shorter or longer path is taken, dependent on what side we want
to err on (maybe look at the edge probabilities even and choose the
most probable incoming path length).

[Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2

Reply via email to