On 05/25/2018 11:54 AM, Richard Biener wrote: > On May 25, 2018 6:57:13 PM GMT+02:00, Jeff Law <l...@redhat.com> wrote: >> On 05/25/2018 03:49 AM, Bin.Cheng wrote: >>> On Fri, May 25, 2018 at 10:23 AM, Prathamesh Kulkarni >>> <prathamesh.kulka...@linaro.org> wrote: >>>> On 23 May 2018 at 18:37, Jeff Law <l...@redhat.com> wrote: >>>>> On 05/23/2018 03:20 AM, Prathamesh Kulkarni wrote: >>>>>> On 23 May 2018 at 13:58, Richard Biener <rguent...@suse.de> wrote: >>>>>>> On Wed, 23 May 2018, Prathamesh Kulkarni wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> I am trying to work on PR80155, which exposes a problem with >> code >>>>>>>> hoisting and register pressure on a leading embedded benchmark >> for ARM >>>>>>>> cortex-m7, where code-hoisting causes an extra register spill. >>>>>>>> >>>>>>>> I have attached two test-cases which (hopefully) are >> representative of >>>>>>>> the original test-case. >>>>>>>> The first one (trans_dfa.c) is bigger and somewhat similar to >> the >>>>>>>> original test-case and trans_dfa_2.c is hand-reduced version of >>>>>>>> trans_dfa.c. There's 2 spills caused with trans_dfa.c >>>>>>>> and one spill with trans_dfa_2.c due to lesser amount of cases. >>>>>>>> The test-cases in the PR are probably not relevant. >>>>>>>> >>>>>>>> Initially I thought the spill was happening because of "too many >>>>>>>> hoistings" taking place in original test-case thus increasing >> the >>>>>>>> register pressure, but it seems the spill is possibly caused >> because >>>>>>>> expression gets hoisted out of a block that is on loop exit. >>>>>>>> >>>>>>>> For example, the following hoistings take place with >> trans_dfa_2.c: >>>>>>>> >>>>>>>> (1) Inserting expression in block 4 for code hoisting: >>>>>>>> {mem_ref<0B>,tab_20(D)}@.MEM_45 (0005) >>>>>>>> >>>>>>>> (2) Inserting expression in block 4 for code hoisting: >> {plus_expr,_4,1} (0006) >>>>>>>> >>>>>>>> (3) Inserting expression in block 4 for code hoisting: >>>>>>>> {pointer_plus_expr,s_33,1} (0023) >>>>>>>> >>>>>>>> (4) Inserting expression in block 3 for code hoisting: >>>>>>>> {pointer_plus_expr,s_33,1} (0023) >>>>>>>> >>>>>>>> The issue seems to be hoisting of (*tab + 1) which consists of >> first >>>>>>>> two hoistings in block 4 >>>>>>>> from blocks 5 and 9, which causes the extra spill. I verified >> that by >>>>>>>> disabling hoisting into block 4, >>>>>>>> which resulted in no extra spills. >>>>>>>> >>>>>>>> I wonder if that's because the expression (*tab + 1) is getting >>>>>>>> hoisted from blocks 5 and 9, >>>>>>>> which are on loop exit ? So the expression that was previously >>>>>>>> computed in a block on loop exit, gets hoisted outside that >> block >>>>>>>> which possibly makes the allocator more defensive ? Similarly >>>>>>>> disabling hoisting of expressions which appeared in blocks on >> loop >>>>>>>> exit in original test-case prevented the extra spill. The other >>>>>>>> hoistings didn't seem to matter. >>>>>>> >>>>>>> I think that's simply co-incidence. The only thing that makes >>>>>>> a block that also exits from the loop special is that an >>>>>>> expression could be sunk out of the loop and hoisting (commoning >>>>>>> with another path) could prevent that. But that isn't what is >>>>>>> happening here and it would be a pass ordering issue as >>>>>>> the sinking pass runs only after hoisting (no idea why exactly >>>>>>> but I guess there are cases where we want to prefer CSE over >>>>>>> sinking). So you could try if re-ordering PRE and sinking helps >>>>>>> your testcase. >>>>>> Thanks for the suggestions. Placing sink pass before PRE works >>>>>> for both these test-cases! Sadly it still causes the spill for the >> benchmark -:( >>>>>> I will try to create a better approximation of the original >> test-case. >>>>>>> >>>>>>> What I do see is a missed opportunity to merge the successors >>>>>>> of BB 4. After PRE we have >>>>>>> >>>>>>> <bb 4> [local count: 159303558]: >>>>>>> <L1>: >>>>>>> pretmp_123 = *tab_37(D); >>>>>>> _87 = pretmp_123 + 1; >>>>>>> if (c_36 == 65) >>>>>>> goto <bb 5>; [34.00%] >>>>>>> else >>>>>>> goto <bb 8>; [66.00%] >>>>>>> >>>>>>> <bb 5> [local count: 54163210]: >>>>>>> *tab_37(D) = _87; >>>>>>> _96 = MEM[(char *)s_57 + 1B]; >>>>>>> if (_96 != 0) >>>>>>> goto <bb 7>; [89.00%] >>>>>>> else >>>>>>> goto <bb 6>; [11.00%] >>>>>>> >>>>>>> <bb 8> [local count: 105140348]: >>>>>>> *tab_37(D) = _87; >>>>>>> _56 = MEM[(char *)s_57 + 1B]; >>>>>>> if (_56 != 0) >>>>>>> goto <bb 10>; [89.00%] >>>>>>> else >>>>>>> goto <bb 9>; [11.00%] >>>>>>> >>>>>>> here at least the stores and loads can be hoisted. Note this >>>>>>> may also point at the real issue of the code hoisting which is >>>>>>> tearing apart the RMW operation? >>>>>> Indeed, this possibility seems much more likely than block being >> on loop exit. >>>>>> I will try to "hardcode" the load/store hoists into block 4 for >> this >>>>>> specific test-case to check >>>>>> if that prevents the spill. >>>>> Even if it prevents the spill in this case, it's likely a good >> thing to >>>>> do. The statements prior to the conditional in bb5 and bb8 should >> be >>>>> hoisted, leaving bb5 and bb8 with just their conditionals. >>>> Hi, >>>> It seems disabling forwprop somehow works for causing no extra >> spills >>>> on the original test-case. >>>> >>>> For instance, >>>> Hoisting without forwprop: >>>> >>>> bb 3: >>>> _1 = tab_1(D) + 8 >>>> pretmp_268 = MEM[tab_1(D) + 8B]; >>>> _2 = pretmp_268 + 1; >>>> goto <bb 4> or <bb 5> >>>> >>>> bb 4: >>>> *_1 = _ 2 >>>> >>>> bb 5: >>>> *_1 = _2 >>>> >>>> Hoisting with forwprop: >>>> >>>> bb 3: >>>> pretmp_164 = MEM[tab_1(D) + 8B]; >>>> _2 = pretmp_164 + 1 >>>> goto <bb 4> or <bb 5> >>>> >>>> bb 4: >>>> MEM[tab_1(D) + 8] = _2; >>>> >>>> bb 5: >>>> MEM[tab_1(D) + 8] = _2; >>>> >>>> Although in both cases, we aren't hoisting stores, the issues with >> forwprop >>>> for this case seems to be the folding of >>>> *_1 = _2 >>>> into >>>> MEM[tab_1(D) + 8] = _2 ? >>> >>> This isn't an issue, right? IIUC, tab_1(D) used all over the loop >>> thus propagating _1 using (tab_1(D) + 8) actually removes one live >>> range. >>> >>>> >>>> Disabling folding to mem_ref[base + offset] in forwprop "works" in >> the >>>> sense it created same set of hoistings as without forwprop, however >> it >>>> still results in additional spills (albeit different registers). >>>> >>>> That's because forwprop seems to be increasing live range of >>>> prephitmp_217 by substituting >>>> _221 + 1 with prephitmp_217 + 2 (_221 is defined as prephitmp_217 + >> 1). >>> Hmm, it's hard to discuss private benchmarks, not sure which dump >>> shall I find prephitmp_221/prephitmp_217 stuff. >>> >>>> On the other hand, Bin pointed out to me in private that forwprop >> also >>>> helps to restrict register pressure by propagating "tab + const_int" >>>> for same test-case. >>>> >>>> So I am not really sure if there's an easier fix than having >>>> heuristics for estimating register pressure at TREE level ? I would >> be >>> Easy fix, maybe not. OTOH, I am more convinced passes like >>> forwprop/sink/hoisting can be improved by taking live range into >>> consideration. Specifically, to direct such passes when moving code >>> around different basic blocks, because inter-block register pressure >>> is hard to resolve afterwards. >>> >>> As suggested by Jeff and Richi, I guess the first step would be doing >>> experiments, collecting more benchmark data for reordering sink >> before >>> pre? It enables code sink as well as decreases register pressure in >>> the original reduced cases IIRC. >> We might even consider re-evaluating Bernd's work on what is >> effectively >> a gimple scheduler to minimize register pressure. > > Sure. The main issue here I see is with the interaction with TER which we > unfortunately still rely on. Enough GIMPLE instruction selection might help > to get rid of the remaining pieces... I really wonder how bad it would be to walk over expr.c and change the expanders to be able to walk SSA_NAME_DEF_STMT to potentially get at the more complex statements rather than relying on TER.
That's really all TER is supposed to be doing anyway. Jeff