On Thu, Aug 29, 2019 at 7:36 PM Alexander Monakov <amona...@ispras.ru> wrote: > > On Thu, 29 Aug 2019, Maxim Kuvyrkov wrote: > > > >> r1 = [rb + 0] > > >> <math with r1> > > >> r2 = [rb + 8] > > >> <math with r2> > > >> r3 = [rb + 16] > > >> <math with r3> > > >> > > >> which, apparently, cortex-a53 autoprefetcher doesn't recognize. This > > >> schedule happens because r2= load gets lower priority than the > > >> "irrelevant" <math with r1> due to the above patch. > > >> > > >> If we think about it, the fact that "r1 = [rb + 0]" can be scheduled > > >> means that true dependencies of all similar base+offset loads are > > >> resolved. Therefore, for autoprefetcher-friendly schedule we should > > >> prioritize memory reads before "irrelevant" instructions. > > > > > > But isn't there also max number of load issues in a fetch window to > > > consider? > > > So interleaving arithmetic with loads might be profitable. > > > > It appears that cores with autoprefetcher hardware prefer loads and stores > > bundled together, not interspersed with other instructions to occupy the > > rest > > of CPU units. > > Let me point out that the motivating example has a bigger effect in play: > > (1) r1 = [rb + 0] > (2) <math with r1> > (3) r2 = [rb + 8] > (4) <math with r2> > (5) r3 = [rb + 16] > (6) <math with r3> > > here Cortex-A53, being an in-order core, cannot issue the load at (3) until > after the load at (1) has completed, because the use at (2) depends on it. > The good schedule allows the three loads to issue in a pipelined fashion.
OK, so with dispatch/issue issues I was thinking of scheduling independent work like AGU ops inbetween the loads. It might be that some in-order cores like to see two adjacent loads to fire auto-prefetching but any such heuristic should probably be very sub-architecture specific. > So essentially the main issue is not a hardware peculiarity, but rather the > bad schedule being totally wrong (it could only make sense if loads had > 1-cycle > latency, which they do not). > > I think this highlights how implementing this autoprefetch heuristic via the > dfa_lookahead_guard interface looks questionable in the first place, but the > patch itself makes sense to me. > > Alexander