Re: PR80155: Code hoisting and register pressure

Richard Biener Fri, 25 May 2018 23:11:29 -0700

On May 25, 2018 9:25:51 PM GMT+02:00, Jeff Law <l...@redhat.com> wrote:
>On 05/25/2018 11:54 AM, Richard Biener wrote:
>> On May 25, 2018 6:57:13 PM GMT+02:00, Jeff Law <l...@redhat.com>
>wrote:
>>> On 05/25/2018 03:49 AM, Bin.Cheng wrote:
>>>> On Fri, May 25, 2018 at 10:23 AM, Prathamesh Kulkarni
>>>> <prathamesh.kulka...@linaro.org> wrote:
>>>>> On 23 May 2018 at 18:37, Jeff Law <l...@redhat.com> wrote:
>>>>>> On 05/23/2018 03:20 AM, Prathamesh Kulkarni wrote:
>>>>>>> On 23 May 2018 at 13:58, Richard Biener <rguent...@suse.de>
>wrote:
>>>>>>>> On Wed, 23 May 2018, Prathamesh Kulkarni wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> I am trying to work on PR80155, which exposes a problem with
>>> code
>>>>>>>>> hoisting and register pressure on a leading embedded benchmark
>>> for ARM
>>>>>>>>> cortex-m7, where code-hoisting causes an extra register spill.
>>>>>>>>>
>>>>>>>>> I have attached two test-cases which (hopefully) are
>>> representative of
>>>>>>>>> the original test-case.
>>>>>>>>> The first one (trans_dfa.c) is bigger and somewhat similar to
>>> the
>>>>>>>>> original test-case and trans_dfa_2.c is hand-reduced version
>of
>>>>>>>>> trans_dfa.c. There's 2 spills caused with trans_dfa.c
>>>>>>>>> and one spill with trans_dfa_2.c due to lesser amount of
>cases.
>>>>>>>>> The test-cases in the PR are probably not relevant.
>>>>>>>>>
>>>>>>>>> Initially I thought the spill was happening because of "too
>many
>>>>>>>>> hoistings" taking place in original test-case thus increasing
>>> the
>>>>>>>>> register pressure, but it seems the spill is possibly caused
>>> because
>>>>>>>>> expression gets hoisted out of a block that is on loop exit.
>>>>>>>>>
>>>>>>>>> For example, the following hoistings take place with
>>> trans_dfa_2.c:
>>>>>>>>>
>>>>>>>>> (1) Inserting expression in block 4 for code hoisting:
>>>>>>>>> {mem_ref<0B>,tab_20(D)}@.MEM_45 (0005)
>>>>>>>>>
>>>>>>>>> (2) Inserting expression in block 4 for code hoisting:
>>> {plus_expr,_4,1} (0006)
>>>>>>>>>
>>>>>>>>> (3) Inserting expression in block 4 for code hoisting:
>>>>>>>>> {pointer_plus_expr,s_33,1} (0023)
>>>>>>>>>
>>>>>>>>> (4) Inserting expression in block 3 for code hoisting:
>>>>>>>>> {pointer_plus_expr,s_33,1} (0023)
>>>>>>>>>
>>>>>>>>> The issue seems to be hoisting of (*tab + 1) which consists of
>>> first
>>>>>>>>> two hoistings in block 4
>>>>>>>>> from blocks 5 and 9, which causes the extra spill. I verified
>>> that by
>>>>>>>>> disabling hoisting into block 4,
>>>>>>>>> which resulted in no extra spills.
>>>>>>>>>
>>>>>>>>> I wonder if that's because the expression (*tab + 1) is
>getting
>>>>>>>>> hoisted from blocks 5 and 9,
>>>>>>>>> which are on loop exit ? So the expression that was previously
>>>>>>>>> computed in a block on loop exit, gets hoisted outside that
>>> block
>>>>>>>>> which possibly makes the allocator more defensive ? Similarly
>>>>>>>>> disabling hoisting of expressions which appeared in blocks on
>>> loop
>>>>>>>>> exit in original test-case prevented the extra spill. The
>other
>>>>>>>>> hoistings didn't seem to matter.
>>>>>>>>
>>>>>>>> I think that's simply co-incidence.  The only thing that makes
>>>>>>>> a block that also exits from the loop special is that an
>>>>>>>> expression could be sunk out of the loop and hoisting
>(commoning
>>>>>>>> with another path) could prevent that.  But that isn't what is
>>>>>>>> happening here and it would be a pass ordering issue as
>>>>>>>> the sinking pass runs only after hoisting (no idea why exactly
>>>>>>>> but I guess there are cases where we want to prefer CSE over
>>>>>>>> sinking).  So you could try if re-ordering PRE and sinking
>helps
>>>>>>>> your testcase.
>>>>>>> Thanks for the suggestions. Placing sink pass before PRE works
>>>>>>> for both these test-cases! Sadly it still causes the spill for
>the
>>> benchmark -:(
>>>>>>> I will try to create a better approximation of the original
>>> test-case.
>>>>>>>>
>>>>>>>> What I do see is a missed opportunity to merge the successors
>>>>>>>> of BB 4.  After PRE we have
>>>>>>>>
>>>>>>>> <bb 4> [local count: 159303558]:
>>>>>>>> <L1>:
>>>>>>>> pretmp_123 = *tab_37(D);
>>>>>>>> _87 = pretmp_123 + 1;
>>>>>>>> if (c_36 == 65)
>>>>>>>>   goto <bb 5>; [34.00%]
>>>>>>>> else
>>>>>>>>   goto <bb 8>; [66.00%]
>>>>>>>>
>>>>>>>> <bb 5> [local count: 54163210]:
>>>>>>>> *tab_37(D) = _87;
>>>>>>>> _96 = MEM[(char *)s_57 + 1B];
>>>>>>>> if (_96 != 0)
>>>>>>>>   goto <bb 7>; [89.00%]
>>>>>>>> else
>>>>>>>>   goto <bb 6>; [11.00%]
>>>>>>>>
>>>>>>>> <bb 8> [local count: 105140348]:
>>>>>>>> *tab_37(D) = _87;
>>>>>>>> _56 = MEM[(char *)s_57 + 1B];
>>>>>>>> if (_56 != 0)
>>>>>>>>   goto <bb 10>; [89.00%]
>>>>>>>> else
>>>>>>>>   goto <bb 9>; [11.00%]
>>>>>>>>
>>>>>>>> here at least the stores and loads can be hoisted.  Note this
>>>>>>>> may also point at the real issue of the code hoisting which is
>>>>>>>> tearing apart the RMW operation?
>>>>>>> Indeed, this possibility seems much more likely than block being
>>> on loop exit.
>>>>>>> I will try to "hardcode" the load/store hoists into block 4 for
>>> this
>>>>>>> specific test-case to check
>>>>>>> if that prevents the spill.
>>>>>> Even if it prevents the spill in this case, it's likely a good
>>> thing to
>>>>>> do.  The statements prior to the conditional in bb5 and bb8
>should
>>> be
>>>>>> hoisted, leaving bb5 and bb8 with just their conditionals.
>>>>> Hi,
>>>>> It seems disabling forwprop somehow works for causing no extra
>>> spills
>>>>> on the original test-case.
>>>>>
>>>>> For instance,
>>>>> Hoisting without forwprop:
>>>>>
>>>>> bb 3:
>>>>> _1 = tab_1(D) + 8
>>>>> pretmp_268 = MEM[tab_1(D) + 8B];
>>>>> _2 = pretmp_268 + 1;
>>>>> goto <bb 4> or <bb 5>
>>>>>
>>>>> bb 4:
>>>>>  *_1 = _ 2
>>>>>
>>>>> bb 5:
>>>>> *_1 = _2
>>>>>
>>>>> Hoisting with forwprop:
>>>>>
>>>>> bb 3:
>>>>> pretmp_164 = MEM[tab_1(D) + 8B];
>>>>> _2 = pretmp_164 + 1
>>>>> goto <bb 4> or <bb 5>
>>>>>
>>>>> bb 4:
>>>>> MEM[tab_1(D) + 8] = _2;
>>>>>
>>>>> bb 5:
>>>>> MEM[tab_1(D) + 8] = _2;
>>>>>
>>>>> Although in both cases, we aren't hoisting stores, the issues with
>>> forwprop
>>>>> for this case seems to be the folding of
>>>>> *_1 = _2
>>>>> into
>>>>> MEM[tab_1(D) + 8] = _2  ?
>>>>
>>>> This isn't an issue, right?  IIUC, tab_1(D) used all over the loop
>>>> thus propagating _1 using (tab_1(D) + 8) actually removes one live
>>>> range.
>>>>
>>>>>
>>>>> Disabling folding to mem_ref[base + offset] in forwprop "works" in
>>> the
>>>>> sense it created same set of hoistings as without forwprop,
>however
>>> it
>>>>> still results in additional spills (albeit different registers).
>>>>>
>>>>> That's because forwprop seems to be increasing live range of
>>>>> prephitmp_217 by substituting
>>>>> _221 + 1 with prephitmp_217 + 2 (_221 is defined as prephitmp_217
>+
>>> 1).
>>>> Hmm, it's hard to discuss private benchmarks, not sure which dump
>>>> shall I find prephitmp_221/prephitmp_217 stuff.
>>>>
>>>>> On the other hand, Bin pointed out to me in private that forwprop
>>> also
>>>>> helps to restrict register pressure by propagating "tab +
>const_int"
>>>>> for same test-case.
>>>>>
>>>>> So I am not really sure if there's an easier fix than having
>>>>> heuristics for estimating register pressure at TREE level ? I
>would
>>> be
>>>> Easy fix, maybe not.  OTOH, I am more convinced passes like
>>>> forwprop/sink/hoisting can be improved by taking live range into
>>>> consideration.  Specifically, to direct such passes when moving
>code
>>>> around different basic blocks, because inter-block register
>pressure
>>>> is hard to resolve afterwards.
>>>>
>>>> As suggested by Jeff and Richi, I guess the first step would be
>doing
>>>> experiments, collecting more benchmark data for reordering sink
>>> before
>>>> pre?  It enables code sink as well as decreases register pressure
>in
>>>> the original reduced cases IIRC.
>>> We might even consider re-evaluating Bernd's work on what is
>>> effectively
>>> a gimple scheduler to minimize register pressure.
>> 
>> Sure. The main issue here I see is with the interaction with TER
>which we unfortunately still rely on. Enough GIMPLE instruction
>selection might help to get rid of the remaining pieces... 
>I really wonder how bad it would be to walk over expr.c and change the
>expanders to be able to walk SSA_NAME_DEF_STMT to potentially get at
>the
>more complex statements rather than relying on TER.


But that's what they do... TER computes when this is valid and not break due to 
coalescing. 

>That's really all TER is supposed to be doing anyway.

Yes. Last year I posted patches to apply the scheduling TER does on top of the 
above but it was difficult to dissect from the above... Maybe we need to try 
harder here... 

Richard. 

>Jeff

Re: PR80155: Code hoisting and register pressure

Reply via email to