On May 25, 2018 9:25:51 PM GMT+02:00, Jeff Law <l...@redhat.com> wrote: >On 05/25/2018 11:54 AM, Richard Biener wrote: >> On May 25, 2018 6:57:13 PM GMT+02:00, Jeff Law <l...@redhat.com> >wrote: >>> On 05/25/2018 03:49 AM, Bin.Cheng wrote: >>>> On Fri, May 25, 2018 at 10:23 AM, Prathamesh Kulkarni >>>> <prathamesh.kulka...@linaro.org> wrote: >>>>> On 23 May 2018 at 18:37, Jeff Law <l...@redhat.com> wrote: >>>>>> On 05/23/2018 03:20 AM, Prathamesh Kulkarni wrote: >>>>>>> On 23 May 2018 at 13:58, Richard Biener <rguent...@suse.de> >wrote: >>>>>>>> On Wed, 23 May 2018, Prathamesh Kulkarni wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> I am trying to work on PR80155, which exposes a problem with >>> code >>>>>>>>> hoisting and register pressure on a leading embedded benchmark >>> for ARM >>>>>>>>> cortex-m7, where code-hoisting causes an extra register spill. >>>>>>>>> >>>>>>>>> I have attached two test-cases which (hopefully) are >>> representative of >>>>>>>>> the original test-case. >>>>>>>>> The first one (trans_dfa.c) is bigger and somewhat similar to >>> the >>>>>>>>> original test-case and trans_dfa_2.c is hand-reduced version >of >>>>>>>>> trans_dfa.c. There's 2 spills caused with trans_dfa.c >>>>>>>>> and one spill with trans_dfa_2.c due to lesser amount of >cases. >>>>>>>>> The test-cases in the PR are probably not relevant. >>>>>>>>> >>>>>>>>> Initially I thought the spill was happening because of "too >many >>>>>>>>> hoistings" taking place in original test-case thus increasing >>> the >>>>>>>>> register pressure, but it seems the spill is possibly caused >>> because >>>>>>>>> expression gets hoisted out of a block that is on loop exit. >>>>>>>>> >>>>>>>>> For example, the following hoistings take place with >>> trans_dfa_2.c: >>>>>>>>> >>>>>>>>> (1) Inserting expression in block 4 for code hoisting: >>>>>>>>> {mem_ref<0B>,tab_20(D)}@.MEM_45 (0005) >>>>>>>>> >>>>>>>>> (2) Inserting expression in block 4 for code hoisting: >>> {plus_expr,_4,1} (0006) >>>>>>>>> >>>>>>>>> (3) Inserting expression in block 4 for code hoisting: >>>>>>>>> {pointer_plus_expr,s_33,1} (0023) >>>>>>>>> >>>>>>>>> (4) Inserting expression in block 3 for code hoisting: >>>>>>>>> {pointer_plus_expr,s_33,1} (0023) >>>>>>>>> >>>>>>>>> The issue seems to be hoisting of (*tab + 1) which consists of >>> first >>>>>>>>> two hoistings in block 4 >>>>>>>>> from blocks 5 and 9, which causes the extra spill. I verified >>> that by >>>>>>>>> disabling hoisting into block 4, >>>>>>>>> which resulted in no extra spills. >>>>>>>>> >>>>>>>>> I wonder if that's because the expression (*tab + 1) is >getting >>>>>>>>> hoisted from blocks 5 and 9, >>>>>>>>> which are on loop exit ? So the expression that was previously >>>>>>>>> computed in a block on loop exit, gets hoisted outside that >>> block >>>>>>>>> which possibly makes the allocator more defensive ? Similarly >>>>>>>>> disabling hoisting of expressions which appeared in blocks on >>> loop >>>>>>>>> exit in original test-case prevented the extra spill. The >other >>>>>>>>> hoistings didn't seem to matter. >>>>>>>> >>>>>>>> I think that's simply co-incidence. The only thing that makes >>>>>>>> a block that also exits from the loop special is that an >>>>>>>> expression could be sunk out of the loop and hoisting >(commoning >>>>>>>> with another path) could prevent that. But that isn't what is >>>>>>>> happening here and it would be a pass ordering issue as >>>>>>>> the sinking pass runs only after hoisting (no idea why exactly >>>>>>>> but I guess there are cases where we want to prefer CSE over >>>>>>>> sinking). So you could try if re-ordering PRE and sinking >helps >>>>>>>> your testcase. >>>>>>> Thanks for the suggestions. Placing sink pass before PRE works >>>>>>> for both these test-cases! Sadly it still causes the spill for >the >>> benchmark -:( >>>>>>> I will try to create a better approximation of the original >>> test-case. >>>>>>>> >>>>>>>> What I do see is a missed opportunity to merge the successors >>>>>>>> of BB 4. After PRE we have >>>>>>>> >>>>>>>> <bb 4> [local count: 159303558]: >>>>>>>> <L1>: >>>>>>>> pretmp_123 = *tab_37(D); >>>>>>>> _87 = pretmp_123 + 1; >>>>>>>> if (c_36 == 65) >>>>>>>> goto <bb 5>; [34.00%] >>>>>>>> else >>>>>>>> goto <bb 8>; [66.00%] >>>>>>>> >>>>>>>> <bb 5> [local count: 54163210]: >>>>>>>> *tab_37(D) = _87; >>>>>>>> _96 = MEM[(char *)s_57 + 1B]; >>>>>>>> if (_96 != 0) >>>>>>>> goto <bb 7>; [89.00%] >>>>>>>> else >>>>>>>> goto <bb 6>; [11.00%] >>>>>>>> >>>>>>>> <bb 8> [local count: 105140348]: >>>>>>>> *tab_37(D) = _87; >>>>>>>> _56 = MEM[(char *)s_57 + 1B]; >>>>>>>> if (_56 != 0) >>>>>>>> goto <bb 10>; [89.00%] >>>>>>>> else >>>>>>>> goto <bb 9>; [11.00%] >>>>>>>> >>>>>>>> here at least the stores and loads can be hoisted. Note this >>>>>>>> may also point at the real issue of the code hoisting which is >>>>>>>> tearing apart the RMW operation? >>>>>>> Indeed, this possibility seems much more likely than block being >>> on loop exit. >>>>>>> I will try to "hardcode" the load/store hoists into block 4 for >>> this >>>>>>> specific test-case to check >>>>>>> if that prevents the spill. >>>>>> Even if it prevents the spill in this case, it's likely a good >>> thing to >>>>>> do. The statements prior to the conditional in bb5 and bb8 >should >>> be >>>>>> hoisted, leaving bb5 and bb8 with just their conditionals. >>>>> Hi, >>>>> It seems disabling forwprop somehow works for causing no extra >>> spills >>>>> on the original test-case. >>>>> >>>>> For instance, >>>>> Hoisting without forwprop: >>>>> >>>>> bb 3: >>>>> _1 = tab_1(D) + 8 >>>>> pretmp_268 = MEM[tab_1(D) + 8B]; >>>>> _2 = pretmp_268 + 1; >>>>> goto <bb 4> or <bb 5> >>>>> >>>>> bb 4: >>>>> *_1 = _ 2 >>>>> >>>>> bb 5: >>>>> *_1 = _2 >>>>> >>>>> Hoisting with forwprop: >>>>> >>>>> bb 3: >>>>> pretmp_164 = MEM[tab_1(D) + 8B]; >>>>> _2 = pretmp_164 + 1 >>>>> goto <bb 4> or <bb 5> >>>>> >>>>> bb 4: >>>>> MEM[tab_1(D) + 8] = _2; >>>>> >>>>> bb 5: >>>>> MEM[tab_1(D) + 8] = _2; >>>>> >>>>> Although in both cases, we aren't hoisting stores, the issues with >>> forwprop >>>>> for this case seems to be the folding of >>>>> *_1 = _2 >>>>> into >>>>> MEM[tab_1(D) + 8] = _2 ? >>>> >>>> This isn't an issue, right? IIUC, tab_1(D) used all over the loop >>>> thus propagating _1 using (tab_1(D) + 8) actually removes one live >>>> range. >>>> >>>>> >>>>> Disabling folding to mem_ref[base + offset] in forwprop "works" in >>> the >>>>> sense it created same set of hoistings as without forwprop, >however >>> it >>>>> still results in additional spills (albeit different registers). >>>>> >>>>> That's because forwprop seems to be increasing live range of >>>>> prephitmp_217 by substituting >>>>> _221 + 1 with prephitmp_217 + 2 (_221 is defined as prephitmp_217 >+ >>> 1). >>>> Hmm, it's hard to discuss private benchmarks, not sure which dump >>>> shall I find prephitmp_221/prephitmp_217 stuff. >>>> >>>>> On the other hand, Bin pointed out to me in private that forwprop >>> also >>>>> helps to restrict register pressure by propagating "tab + >const_int" >>>>> for same test-case. >>>>> >>>>> So I am not really sure if there's an easier fix than having >>>>> heuristics for estimating register pressure at TREE level ? I >would >>> be >>>> Easy fix, maybe not. OTOH, I am more convinced passes like >>>> forwprop/sink/hoisting can be improved by taking live range into >>>> consideration. Specifically, to direct such passes when moving >code >>>> around different basic blocks, because inter-block register >pressure >>>> is hard to resolve afterwards. >>>> >>>> As suggested by Jeff and Richi, I guess the first step would be >doing >>>> experiments, collecting more benchmark data for reordering sink >>> before >>>> pre? It enables code sink as well as decreases register pressure >in >>>> the original reduced cases IIRC. >>> We might even consider re-evaluating Bernd's work on what is >>> effectively >>> a gimple scheduler to minimize register pressure. >> >> Sure. The main issue here I see is with the interaction with TER >which we unfortunately still rely on. Enough GIMPLE instruction >selection might help to get rid of the remaining pieces... >I really wonder how bad it would be to walk over expr.c and change the >expanders to be able to walk SSA_NAME_DEF_STMT to potentially get at >the >more complex statements rather than relying on TER.
But that's what they do... TER computes when this is valid and not break due to coalescing. >That's really all TER is supposed to be doing anyway. Yes. Last year I posted patches to apply the scheduling TER does on top of the above but it was difficult to dissect from the above... Maybe we need to try harder here... Richard. >Jeff