Re: PR80155: Code hoisting and register pressure

Richard Biener Fri, 25 May 2018 10:54:53 -0700

On May 25, 2018 6:57:13 PM GMT+02:00, Jeff Law <[email protected]> wrote:
>On 05/25/2018 03:49 AM, Bin.Cheng wrote:
>> On Fri, May 25, 2018 at 10:23 AM, Prathamesh Kulkarni
>> <[email protected]> wrote:
>>> On 23 May 2018 at 18:37, Jeff Law <[email protected]> wrote:
>>>> On 05/23/2018 03:20 AM, Prathamesh Kulkarni wrote:
>>>>> On 23 May 2018 at 13:58, Richard Biener <[email protected]> wrote:
>>>>>> On Wed, 23 May 2018, Prathamesh Kulkarni wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> I am trying to work on PR80155, which exposes a problem with
>code
>>>>>>> hoisting and register pressure on a leading embedded benchmark
>for ARM
>>>>>>> cortex-m7, where code-hoisting causes an extra register spill.
>>>>>>>
>>>>>>> I have attached two test-cases which (hopefully) are
>representative of
>>>>>>> the original test-case.
>>>>>>> The first one (trans_dfa.c) is bigger and somewhat similar to
>the
>>>>>>> original test-case and trans_dfa_2.c is hand-reduced version of
>>>>>>> trans_dfa.c. There's 2 spills caused with trans_dfa.c
>>>>>>> and one spill with trans_dfa_2.c due to lesser amount of cases.
>>>>>>> The test-cases in the PR are probably not relevant.
>>>>>>>
>>>>>>> Initially I thought the spill was happening because of "too many
>>>>>>> hoistings" taking place in original test-case thus increasing
>the
>>>>>>> register pressure, but it seems the spill is possibly caused
>because
>>>>>>> expression gets hoisted out of a block that is on loop exit.
>>>>>>>
>>>>>>> For example, the following hoistings take place with
>trans_dfa_2.c:
>>>>>>>
>>>>>>> (1) Inserting expression in block 4 for code hoisting:
>>>>>>> {mem_ref<0B>,tab_20(D)}@.MEM_45 (0005)
>>>>>>>
>>>>>>> (2) Inserting expression in block 4 for code hoisting:
>{plus_expr,_4,1} (0006)
>>>>>>>
>>>>>>> (3) Inserting expression in block 4 for code hoisting:
>>>>>>> {pointer_plus_expr,s_33,1} (0023)
>>>>>>>
>>>>>>> (4) Inserting expression in block 3 for code hoisting:
>>>>>>> {pointer_plus_expr,s_33,1} (0023)
>>>>>>>
>>>>>>> The issue seems to be hoisting of (*tab + 1) which consists of
>first
>>>>>>> two hoistings in block 4
>>>>>>> from blocks 5 and 9, which causes the extra spill. I verified
>that by
>>>>>>> disabling hoisting into block 4,
>>>>>>> which resulted in no extra spills.
>>>>>>>
>>>>>>> I wonder if that's because the expression (*tab + 1) is getting
>>>>>>> hoisted from blocks 5 and 9,
>>>>>>> which are on loop exit ? So the expression that was previously
>>>>>>> computed in a block on loop exit, gets hoisted outside that
>block
>>>>>>> which possibly makes the allocator more defensive ? Similarly
>>>>>>> disabling hoisting of expressions which appeared in blocks on
>loop
>>>>>>> exit in original test-case prevented the extra spill. The other
>>>>>>> hoistings didn't seem to matter.
>>>>>>
>>>>>> I think that's simply co-incidence.  The only thing that makes
>>>>>> a block that also exits from the loop special is that an
>>>>>> expression could be sunk out of the loop and hoisting (commoning
>>>>>> with another path) could prevent that.  But that isn't what is
>>>>>> happening here and it would be a pass ordering issue as
>>>>>> the sinking pass runs only after hoisting (no idea why exactly
>>>>>> but I guess there are cases where we want to prefer CSE over
>>>>>> sinking).  So you could try if re-ordering PRE and sinking helps
>>>>>> your testcase.
>>>>> Thanks for the suggestions. Placing sink pass before PRE works
>>>>> for both these test-cases! Sadly it still causes the spill for the
>benchmark -:(
>>>>> I will try to create a better approximation of the original
>test-case.
>>>>>>
>>>>>> What I do see is a missed opportunity to merge the successors
>>>>>> of BB 4.  After PRE we have
>>>>>>
>>>>>> <bb 4> [local count: 159303558]:
>>>>>> <L1>:
>>>>>> pretmp_123 = *tab_37(D);
>>>>>> _87 = pretmp_123 + 1;
>>>>>> if (c_36 == 65)
>>>>>>   goto <bb 5>; [34.00%]
>>>>>> else
>>>>>>   goto <bb 8>; [66.00%]
>>>>>>
>>>>>> <bb 5> [local count: 54163210]:
>>>>>> *tab_37(D) = _87;
>>>>>> _96 = MEM[(char *)s_57 + 1B];
>>>>>> if (_96 != 0)
>>>>>>   goto <bb 7>; [89.00%]
>>>>>> else
>>>>>>   goto <bb 6>; [11.00%]
>>>>>>
>>>>>> <bb 8> [local count: 105140348]:
>>>>>> *tab_37(D) = _87;
>>>>>> _56 = MEM[(char *)s_57 + 1B];
>>>>>> if (_56 != 0)
>>>>>>   goto <bb 10>; [89.00%]
>>>>>> else
>>>>>>   goto <bb 9>; [11.00%]
>>>>>>
>>>>>> here at least the stores and loads can be hoisted.  Note this
>>>>>> may also point at the real issue of the code hoisting which is
>>>>>> tearing apart the RMW operation?
>>>>> Indeed, this possibility seems much more likely than block being
>on loop exit.
>>>>> I will try to "hardcode" the load/store hoists into block 4 for
>this
>>>>> specific test-case to check
>>>>> if that prevents the spill.
>>>> Even if it prevents the spill in this case, it's likely a good
>thing to
>>>> do.  The statements prior to the conditional in bb5 and bb8 should
>be
>>>> hoisted, leaving bb5 and bb8 with just their conditionals.
>>> Hi,
>>> It seems disabling forwprop somehow works for causing no extra
>spills
>>> on the original test-case.
>>>
>>> For instance,
>>> Hoisting without forwprop:
>>>
>>> bb 3:
>>> _1 = tab_1(D) + 8
>>> pretmp_268 = MEM[tab_1(D) + 8B];
>>> _2 = pretmp_268 + 1;
>>> goto <bb 4> or <bb 5>
>>>
>>> bb 4:
>>>  *_1 = _ 2
>>>
>>> bb 5:
>>> *_1 = _2
>>>
>>> Hoisting with forwprop:
>>>
>>> bb 3:
>>> pretmp_164 = MEM[tab_1(D) + 8B];
>>> _2 = pretmp_164 + 1
>>> goto <bb 4> or <bb 5>
>>>
>>> bb 4:
>>> MEM[tab_1(D) + 8] = _2;
>>>
>>> bb 5:
>>> MEM[tab_1(D) + 8] = _2;
>>>
>>> Although in both cases, we aren't hoisting stores, the issues with
>forwprop
>>> for this case seems to be the folding of
>>> *_1 = _2
>>> into
>>> MEM[tab_1(D) + 8] = _2  ?
>> 
>> This isn't an issue, right?  IIUC, tab_1(D) used all over the loop
>> thus propagating _1 using (tab_1(D) + 8) actually removes one live
>> range.
>> 
>>>
>>> Disabling folding to mem_ref[base + offset] in forwprop "works" in
>the
>>> sense it created same set of hoistings as without forwprop, however
>it
>>> still results in additional spills (albeit different registers).
>>>
>>> That's because forwprop seems to be increasing live range of
>>> prephitmp_217 by substituting
>>> _221 + 1 with prephitmp_217 + 2 (_221 is defined as prephitmp_217 +
>1).
>> Hmm, it's hard to discuss private benchmarks, not sure which dump
>> shall I find prephitmp_221/prephitmp_217 stuff.
>> 
>>> On the other hand, Bin pointed out to me in private that forwprop
>also
>>> helps to restrict register pressure by propagating "tab + const_int"
>>> for same test-case.
>>>
>>> So I am not really sure if there's an easier fix than having
>>> heuristics for estimating register pressure at TREE level ? I would
>be
>> Easy fix, maybe not.  OTOH, I am more convinced passes like
>> forwprop/sink/hoisting can be improved by taking live range into
>> consideration.  Specifically, to direct such passes when moving code
>> around different basic blocks, because inter-block register pressure
>> is hard to resolve afterwards.
>> 
>> As suggested by Jeff and Richi, I guess the first step would be doing
>> experiments, collecting more benchmark data for reordering sink
>before
>> pre?  It enables code sink as well as decreases register pressure in
>> the original reduced cases IIRC.
>We might even consider re-evaluating Bernd's work on what is
>effectively
>a gimple scheduler to minimize register pressure.


Sure. The main issue here I see is with the interaction with TER which we 
unfortunately still rely on. Enough GIMPLE instruction selection might help to 
get rid of the remaining pieces... 

>Or we could look to extend your work into a generalized pressure
>reducing pass that we could run near the gimple/rtl border.
>
>The final possibility would be Click's algorithm from '95 adjusted to
>just do pressure reduction.
>
>jeff

Re: PR80155: Code hoisting and register pressure

Reply via email to