On Fri, 25 May 2018, Bin.Cheng wrote: > On Fri, May 25, 2018 at 10:23 AM, Prathamesh Kulkarni > <prathamesh.kulka...@linaro.org> wrote: > > On 23 May 2018 at 18:37, Jeff Law <l...@redhat.com> wrote: > >> On 05/23/2018 03:20 AM, Prathamesh Kulkarni wrote: > >>> On 23 May 2018 at 13:58, Richard Biener <rguent...@suse.de> wrote: > >>>> On Wed, 23 May 2018, Prathamesh Kulkarni wrote: > >>>> > >>>>> Hi, > >>>>> I am trying to work on PR80155, which exposes a problem with code > >>>>> hoisting and register pressure on a leading embedded benchmark for ARM > >>>>> cortex-m7, where code-hoisting causes an extra register spill. > >>>>> > >>>>> I have attached two test-cases which (hopefully) are representative of > >>>>> the original test-case. > >>>>> The first one (trans_dfa.c) is bigger and somewhat similar to the > >>>>> original test-case and trans_dfa_2.c is hand-reduced version of > >>>>> trans_dfa.c. There's 2 spills caused with trans_dfa.c > >>>>> and one spill with trans_dfa_2.c due to lesser amount of cases. > >>>>> The test-cases in the PR are probably not relevant. > >>>>> > >>>>> Initially I thought the spill was happening because of "too many > >>>>> hoistings" taking place in original test-case thus increasing the > >>>>> register pressure, but it seems the spill is possibly caused because > >>>>> expression gets hoisted out of a block that is on loop exit. > >>>>> > >>>>> For example, the following hoistings take place with trans_dfa_2.c: > >>>>> > >>>>> (1) Inserting expression in block 4 for code hoisting: > >>>>> {mem_ref<0B>,tab_20(D)}@.MEM_45 (0005) > >>>>> > >>>>> (2) Inserting expression in block 4 for code hoisting: {plus_expr,_4,1} > >>>>> (0006) > >>>>> > >>>>> (3) Inserting expression in block 4 for code hoisting: > >>>>> {pointer_plus_expr,s_33,1} (0023) > >>>>> > >>>>> (4) Inserting expression in block 3 for code hoisting: > >>>>> {pointer_plus_expr,s_33,1} (0023) > >>>>> > >>>>> The issue seems to be hoisting of (*tab + 1) which consists of first > >>>>> two hoistings in block 4 > >>>>> from blocks 5 and 9, which causes the extra spill. I verified that by > >>>>> disabling hoisting into block 4, > >>>>> which resulted in no extra spills. > >>>>> > >>>>> I wonder if that's because the expression (*tab + 1) is getting > >>>>> hoisted from blocks 5 and 9, > >>>>> which are on loop exit ? So the expression that was previously > >>>>> computed in a block on loop exit, gets hoisted outside that block > >>>>> which possibly makes the allocator more defensive ? Similarly > >>>>> disabling hoisting of expressions which appeared in blocks on loop > >>>>> exit in original test-case prevented the extra spill. The other > >>>>> hoistings didn't seem to matter. > >>>> > >>>> I think that's simply co-incidence. The only thing that makes > >>>> a block that also exits from the loop special is that an > >>>> expression could be sunk out of the loop and hoisting (commoning > >>>> with another path) could prevent that. But that isn't what is > >>>> happening here and it would be a pass ordering issue as > >>>> the sinking pass runs only after hoisting (no idea why exactly > >>>> but I guess there are cases where we want to prefer CSE over > >>>> sinking). So you could try if re-ordering PRE and sinking helps > >>>> your testcase. > >>> Thanks for the suggestions. Placing sink pass before PRE works > >>> for both these test-cases! Sadly it still causes the spill for the > >>> benchmark -:( > >>> I will try to create a better approximation of the original test-case. > >>>> > >>>> What I do see is a missed opportunity to merge the successors > >>>> of BB 4. After PRE we have > >>>> > >>>> <bb 4> [local count: 159303558]: > >>>> <L1>: > >>>> pretmp_123 = *tab_37(D); > >>>> _87 = pretmp_123 + 1; > >>>> if (c_36 == 65) > >>>> goto <bb 5>; [34.00%] > >>>> else > >>>> goto <bb 8>; [66.00%] > >>>> > >>>> <bb 5> [local count: 54163210]: > >>>> *tab_37(D) = _87; > >>>> _96 = MEM[(char *)s_57 + 1B]; > >>>> if (_96 != 0) > >>>> goto <bb 7>; [89.00%] > >>>> else > >>>> goto <bb 6>; [11.00%] > >>>> > >>>> <bb 8> [local count: 105140348]: > >>>> *tab_37(D) = _87; > >>>> _56 = MEM[(char *)s_57 + 1B]; > >>>> if (_56 != 0) > >>>> goto <bb 10>; [89.00%] > >>>> else > >>>> goto <bb 9>; [11.00%] > >>>> > >>>> here at least the stores and loads can be hoisted. Note this > >>>> may also point at the real issue of the code hoisting which is > >>>> tearing apart the RMW operation? > >>> Indeed, this possibility seems much more likely than block being on loop > >>> exit. > >>> I will try to "hardcode" the load/store hoists into block 4 for this > >>> specific test-case to check > >>> if that prevents the spill. > >> Even if it prevents the spill in this case, it's likely a good thing to > >> do. The statements prior to the conditional in bb5 and bb8 should be > >> hoisted, leaving bb5 and bb8 with just their conditionals. > > Hi, > > It seems disabling forwprop somehow works for causing no extra spills > > on the original test-case. > > > > For instance, > > Hoisting without forwprop: > > > > bb 3: > > _1 = tab_1(D) + 8 > > pretmp_268 = MEM[tab_1(D) + 8B]; > > _2 = pretmp_268 + 1; > > goto <bb 4> or <bb 5> > > > > bb 4: > > *_1 = _ 2 > > > > bb 5: > > *_1 = _2 > > > > Hoisting with forwprop: > > > > bb 3: > > pretmp_164 = MEM[tab_1(D) + 8B]; > > _2 = pretmp_164 + 1 > > goto <bb 4> or <bb 5> > > > > bb 4: > > MEM[tab_1(D) + 8] = _2; > > > > bb 5: > > MEM[tab_1(D) + 8] = _2; > > > > Although in both cases, we aren't hoisting stores, the issues with forwprop > > for this case seems to be the folding of > > *_1 = _2 > > into > > MEM[tab_1(D) + 8] = _2 ? > > This isn't an issue, right? IIUC, tab_1(D) used all over the loop > thus propagating _1 using (tab_1(D) + 8) actually removes one live > range. > > > > > Disabling folding to mem_ref[base + offset] in forwprop "works" in the > > sense it created same set of hoistings as without forwprop, however it > > still results in additional spills (albeit different registers). > > > > That's because forwprop seems to be increasing live range of > > prephitmp_217 by substituting > > _221 + 1 with prephitmp_217 + 2 (_221 is defined as prephitmp_217 + 1). > Hmm, it's hard to discuss private benchmarks, not sure which dump > shall I find prephitmp_221/prephitmp_217 stuff. > > > On the other hand, Bin pointed out to me in private that forwprop also > > helps to restrict register pressure by propagating "tab + const_int" > > for same test-case. > > > > So I am not really sure if there's an easier fix than having > > heuristics for estimating register pressure at TREE level ? I would be > Easy fix, maybe not. OTOH, I am more convinced passes like > forwprop/sink/hoisting can be improved by taking live range into > consideration. Specifically, to direct such passes when moving code > around different basic blocks, because inter-block register pressure > is hard to resolve afterwards. > > As suggested by Jeff and Richi, I guess the first step would be doing > experiments, collecting more benchmark data for reordering sink before > pre? It enables code sink as well as decreases register pressure in > the original reduced cases IIRC.
Note sinking also doesn't have a cost model that takes into account register pressure. But yes, I can't think of a reason to do sinking after PRE (well, apart from PRE performing value-numbering and that of course might expose sinking opportunities). So yes, perform some experiments - you should be able to use -fdump-statistics[-stats] and look at the number of sunk stmts reported (and also the number of PRE eliminations ultimatively performed). Richard. > Thanks, > bin > > grateful for suggestions on how to proceed from here. > > Thanks! > > > > Regards, > > Prathamesh > >> > >> Jeff > > -- Richard Biener <rguent...@suse.de> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)