On Thu, Aug 29, 2019 at 7:36 PM Alexander Monakov <amona...@ispras.ru> wrote:
>
> On Thu, 29 Aug 2019, Maxim Kuvyrkov wrote:
>
> > >> r1 = [rb + 0]
> > >> <math with r1>
> > >> r2 = [rb + 8]
> > >> <math with r2>
> > >> r3 = [rb + 16]
> > >> <math with r3>
> > >>
> > >> which, apparently, cortex-a53 autoprefetcher doesn't recognize.  This
> > >> schedule happens because r2= load gets lower priority than the
> > >> "irrelevant" <math with r1> due to the above patch.
> > >>
> > >> If we think about it, the fact that "r1 = [rb + 0]" can be scheduled
> > >> means that true dependencies of all similar base+offset loads are
> > >> resolved.  Therefore, for autoprefetcher-friendly schedule we should
> > >> prioritize memory reads before "irrelevant" instructions.
> > >
> > > But isn't there also max number of load issues in a fetch window to 
> > > consider?
> > > So interleaving arithmetic with loads might be profitable.
> >
> > It appears that cores with autoprefetcher hardware prefer loads and stores
> > bundled together, not interspersed with other instructions to occupy the 
> > rest
> > of CPU units.
>
> Let me point out that the motivating example has a bigger effect in play:
>
> (1) r1 = [rb + 0]
> (2) <math with r1>
> (3) r2 = [rb + 8]
> (4) <math with r2>
> (5) r3 = [rb + 16]
> (6) <math with r3>
>
> here Cortex-A53, being an in-order core, cannot issue the load at (3) until
> after the load at (1) has completed, because the use at (2) depends on it.
> The good schedule allows the three loads to issue in a pipelined fashion.

OK, so with dispatch/issue issues I was thinking of scheduling independent
work like AGU ops inbetween the loads.  It might be that some in-order
cores like to see two adjacent loads to fire auto-prefetching but any such
heuristic should probably be very sub-architecture specific.

> So essentially the main issue is not a hardware peculiarity, but rather the
> bad schedule being totally wrong (it could only make sense if loads had 
> 1-cycle
> latency, which they do not).
>
> I think this highlights how implementing this autoprefetch heuristic via the
> dfa_lookahead_guard interface looks questionable in the first place, but the
> patch itself makes sense to me.
>
> Alexander

Reply via email to