On August 29, 2019 5:40:47 PM GMT+02:00, Maxim Kuvyrkov <maxim.kuvyr...@linaro.org> wrote: >Hi, > >This patch tweaks autoprefetcher heuristic in scheduler to better group >memory loads and stores together. > >From https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91598: > >There are two separate changes, both related to instruction scheduler, >that cause the regression. The first change in r253235 is responsible >for 70% of the regression. >=== > haifa-sched: fix autopref_rank_for_schedule qsort comparator > > * haifa-sched.c (autopref_rank_for_schedule): Order 'irrelevant' insns > first, always call autopref_rank_data otherwise. > > > >git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@253235 >138bc75d-0d04-0410-961f-82ee72b054a4 >=== > >After this change instead of >r1 = [rb + 0] >r2 = [rb + 8] >r3 = [rb + 16] >r4 = <math with r1> >r5 = <math with r2> >r6 = <math with r3> > >we get >r1 = [rb + 0] ><math with r1> >r2 = [rb + 8] ><math with r2> >r3 = [rb + 16] ><math with r3> > >which, apparently, cortex-a53 autoprefetcher doesn't recognize. This >schedule happens because r2= load gets lower priority than the >"irrelevant" <math with r1> due to the above patch. > >If we think about it, the fact that "r1 = [rb + 0]" can be scheduled >means that true dependencies of all similar base+offset loads are >resolved. Therefore, for autoprefetcher-friendly schedule we should >prioritize memory reads before "irrelevant" instructions.
But isn't there also max number of load issues in a fetch window to consider? So interleaving arithmetic with loads might be profitable. >On the other hand, following similar logic, we want to delay memory >stores as much as possible to start scheduling them only after all >potential producers are scheduled. I.e., for autoprefetcher-friendly >schedule we should prioritize "irrelevant" instructions before memory >writes. > >Obvious patch to implement the above is attached. It brings 70% of >regressed performance on this testcase back. > >OK to commit? > >Regards, > >-- >Maxim Kuvyrkov >www.linaro.org