On Wed, May 23, 2018 at 9:28 AM, Richard Biener <rguent...@suse.de> wrote:
> On Wed, 23 May 2018, Prathamesh Kulkarni wrote:
>
>> Hi,
>> I am trying to work on PR80155, which exposes a problem with code
>> hoisting and register pressure on a leading embedded benchmark for ARM
>> cortex-m7, where code-hoisting causes an extra register spill.
>>
>> I have attached two test-cases which (hopefully) are representative of
>> the original test-case.
>> The first one (trans_dfa.c) is bigger and somewhat similar to the
>> original test-case and trans_dfa_2.c is hand-reduced version of
>> trans_dfa.c. There's 2 spills caused with trans_dfa.c
>> and one spill with trans_dfa_2.c due to lesser amount of cases.
>> The test-cases in the PR are probably not relevant.
>>
>> Initially I thought the spill was happening because of "too many
>> hoistings" taking place in original test-case thus increasing the
>> register pressure, but it seems the spill is possibly caused because
>> expression gets hoisted out of a block that is on loop exit.
>>
>> For example, the following hoistings take place with trans_dfa_2.c:
>>
>> (1) Inserting expression in block 4 for code hoisting:
>> {mem_ref<0B>,tab_20(D)}@.MEM_45 (0005)
>>
>> (2) Inserting expression in block 4 for code hoisting: {plus_expr,_4,1} 
>> (0006)
>>
>> (3) Inserting expression in block 4 for code hoisting:
>> {pointer_plus_expr,s_33,1} (0023)
>>
>> (4) Inserting expression in block 3 for code hoisting:
>> {pointer_plus_expr,s_33,1} (0023)
>>
>> The issue seems to be hoisting of (*tab + 1) which consists of first
>> two hoistings in block 4
>> from blocks 5 and 9, which causes the extra spill. I verified that by
>> disabling hoisting into block 4,
>> which resulted in no extra spills.
>>
>> I wonder if that's because the expression (*tab + 1) is getting
>> hoisted from blocks 5 and 9,
>> which are on loop exit ? So the expression that was previously
>> computed in a block on loop exit, gets hoisted outside that block
>> which possibly makes the allocator more defensive ? Similarly
>> disabling hoisting of expressions which appeared in blocks on loop
>> exit in original test-case prevented the extra spill. The other
>> hoistings didn't seem to matter.
>
> I think that's simply co-incidence.  The only thing that makes
> a block that also exits from the loop special is that an
> expression could be sunk out of the loop and hoisting (commoning
> with another path) could prevent that.  But that isn't what is
> happening here and it would be a pass ordering issue as
> the sinking pass runs only after hoisting (no idea why exactly
> but I guess there are cases where we want to prefer CSE over
> sinking).  So you could try if re-ordering PRE and sinking helps
> your testcase.
>
> What I do see is a missed opportunity to merge the successors
> of BB 4.  After PRE we have
>
> <bb 4> [local count: 159303558]:
> <L1>:
> pretmp_123 = *tab_37(D);
> _87 = pretmp_123 + 1;
> if (c_36 == 65)
>   goto <bb 5>; [34.00%]
> else
>   goto <bb 8>; [66.00%]
>
> <bb 5> [local count: 54163210]:
> *tab_37(D) = _87;
> _96 = MEM[(char *)s_57 + 1B];
> if (_96 != 0)
>   goto <bb 7>; [89.00%]
> else
>   goto <bb 6>; [11.00%]
>
> <bb 8> [local count: 105140348]:
> *tab_37(D) = _87;
> _56 = MEM[(char *)s_57 + 1B];
> if (_56 != 0)
>   goto <bb 10>; [89.00%]
> else
>   goto <bb 9>; [11.00%]
>
> here at least the stores and loads can be hoisted.  Note this
> may also point at the real issue of the code hoisting which is
> tearing apart the RMW operation?
>
>> If that's the case, would it make sense to add another constraint to
>> hoisting to not hoist expression if it appears in a block that is on
>> loop exit (exception is if block has only single successor), at-least
>> for targets like cortex-m7 where the effect of spill is more
>> pronounced ?
>>
>> I tried to write an untested prototype patch (approach-8.diff) on
>> these lines, to refuse hoisting if an expression appears in block on
>> loop exit. The patch adds a new map pre_expr_block_d, that maps
>> pre_expr to the set of blocks (as a bitmap) it appears in and are on
>> loop exit, which is computed during compute_avail.
>> During do_hoist_insertion, it simply checks if the bitmap of blocks is
>> not empty.
>> It works for the attached test-cases and passes ssa-pre-*.c and
>> ssa-hoist-*.c tests.
>
> As said to me the heuristic doesn't make much sense.  There is
> btw loop_exits_from_bb_p ().
>
> If the issue in the end is a RA one (that is, there _is_ a better
> allocation possible?) then you may also want to look at out-of-SSA
> coalescing.
>
> So overall I'm not convinced the hoisting decision is wrong.
> It doesn't increase register pressure at all.  It does expose
Not quite.  There are two hoisting in case of trans_dfa_2.c

For the first one:

Inserting expression in block 4 for code hoisting:
{mem_ref<0B>,tab_20(D)}@.MEM_45 (0005)
Inserted pretmp_30 = *tab_20(D);
 in predecessor 4 (0005)
Inserting expression in block 4 for code hoisting: {plus_expr,_4,1} (0006)
Inserted _12 = pretmp_30 + 1;
 in predecessor 4 (0006)

I agree hoisting separates RM from W, but for the second one:

Inserting expression in block 4 for code hoisting:
{pointer_plus_expr,s_33,1} (0023)
Inserted _14 = s_33 + 1;
 in predecessor 4 (0023)
Inserting expression in block 3 for code hoisting:
{pointer_plus_expr,s_33,1} (0023)
Inserted _62 = s_33 + 1;
 in predecessor 3 (0023)

It does increase register pressure because _62(originally s_34 and
s_27) is extended and s_33 is still alive.

Thanks,
bin

> a missed store hoisting and load CSE opportunity (but we don't
> have a way to "PRE" or hoist stores at the moment).  Stores
> do not fit data flow problems well given they need to be kept
> in order with respect to other stores and loads and appearantly
> *tab aliases *s (yeah, s is char *...  make tab restrict and I
> guess things will work much smoother).
>
> Richard.
>
>> Alternatively, we could restrict replacing expression by it's leader
>> in eliminate_dom_walker::before_dom_children if the expression appears
>> in a block on loop exit.
>> In principle this is more general than hoisting, but I have restricted
>> it to only hoisted expressions to avoid being overly conservative.
>> Similarly, this constrained could be made conditional, only for
>> targets like cortex-m7. I have attached a prototype patch based on
>> this approach (approach-9.diff). Although it works for attached
>> test-cases, unfortunately it ICE's with the original test-case in
>> tail-merge, I am working on that.
>>
>> I am not sure though if either of these approaches are in the correct
>> direction and would
>> be grateful for suggestions on the issue!
>>
>> Thanks,
>> Prathamesh
>>
>
> --
> Richard Biener <rguent...@suse.de>
> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
> 21284 (AG Nuernberg)

Reply via email to