https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908
--- Comment #45 from rguenther at suse dot de <rguenther at suse dot de> --- On Tue, 29 Mar 2022, crazylht at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908 > > --- Comment #43 from Hongtao.liu <crazylht at gmail dot com> --- > One thing I found by experiments: > Insert 64 vaddps %xmm18, %xmm19, %xmm20(no dependence between each other, just > emulate for pipeline) before stalled load, stlf stall case is as fast as no > stall cases on CLX. I guess this is "distance" you mean. Yes - on the micro-architecture that's likely the point the data is then available from L1-D. The "distance" might depend on the store workload (# of stores that can issue / retire / flush to L1 per cycle). > Is there any existed structure in GCC I can get latency from entry to the load > instruction? There's the DFA description used by the instruction scheduler. I'm not familiar with that part of GCC but IIRC the dependence and DFA query part should be sufficiently separate. For OOO uarchs we can compute a minimum distance based purely on frontend cycles. Doing better would need to look at instruction dependences. I'm not sure if the CPUs we care about use forwarding possibilities in the decision to OOO schedule loads/stores but IIRC store buffer entries are allocated early at insn issue time and memory dependences are taken into account. Since we have no idea about the instruction sequence before function entry going into too much detail will probably suffer from GIGO so I'd resort to approximating the frontend side of the pipeline only by some manual bean counting. > And of course for loop with unknown trip count, latency can't be > exactly estimated. Similar for cases when load is in join_bb, guess we need to > calculate "average" latency among all possible predecessors? I'd have simply stopped at backwards reachable blocks since whether or not a load will forward from a store before function entry will depend on the iteration number. Likewise for CFG joins - I suppose one could conservatively assume the shorter or longer path is taken, dependent on what side we want to err on (maybe look at the edge probabilities even and choose the most probable incoming path length).