Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting

Richard Biener Wed, 17 Dec 2014 07:55:12 -0800

On Mon, Dec 15, 2014 at 4:29 PM, Jiong Wang <jiong.w...@arm.com> wrote:
>
> On 15/12/14 15:28, Jiong Wang wrote:
>>
>> On 04/12/14 19:32, Jiong Wang wrote:
>>>
>>> On 04/12/14 11:07, Richard Biener wrote:
>>>
>>>> On Thu, Dec 4, 2014 at 12:07 PM, Richard Biener
>>>> <richard.guent...@gmail.com> wrote:
>>>>>
>>>>> On Thu, Dec 4, 2014 at 12:00 PM, Jiong Wang <jiong.w...@arm.com> wrote:
>>>>>
>>>>>
>>>>> which means re-associate the constant imm with the virtual frame
>>>>> pointer.
>>>>>
>>>>> transform
>>>>>
>>>>>         RA <- fixed_reg + RC
>>>>>         RD <- MEM (RA + const_offset)
>>>>>
>>>>>      into:
>>>>>
>>>>>         RA <- fixed_reg + const_offset
>>>>>         RD <- MEM (RA + RC)
>>>>>
>>>>> then RA <- fixed_reg + const_offset is actually loop invariant, so the
>>>>> later
>>>>> RTL GCSE PRE pass could catch it and do the hoisting, and thus
>>>>> ameliorate
>>>>> what tree
>>>>> level ivopts could not sort out.
>>>>> There is a LIM pass after gimple ivopts - if the invariantness is
>>>>> already
>>>>> visible there why not handle it there similar to the special-cases in
>>>>> rewrite_bittest and rewrite_reciprocal?
>>>
>>> maybe, needs further check.
>>>
>>>>> And of course similar tricks could be applied on the RTL level to
>>>>> RTL invariant motion?
>>>
>>> Thanks. I just checked the code, yes, loop invariant motion pass
>>> is the natural place to integrate such multi-insns invariant analysis
>>> trick.
>>>
>>> those code could be integrated into loop-invariant.c cleanly, but
>>> then I found although the invariant detected successfully but it's not
>>> moved
>>> out of loop because of cost issue, and looks like the patch committed to
>>> fix PR33928
>>> is trying to prevent such cheap address be hoisted to reduce register
>>> pressure.
>>>
>>>
>>>     805       /* ??? Try to determine cheapness of address computation.
>>> Unfortunately
>>>     806          the address cost is only a relative measure, we can't
>>> really compare
>>>     807          it with any absolute number, but only with other address
>>> costs.
>>>     808          But here we don't have any other addresses, so compare
>>> with a magic
>>>     809          number anyway.  It has to be large enough to not regress
>>> PR33928
>>>     810          (by avoiding to move reg+8,reg+16,reg+24 invariants),
>>> but small
>>>     811          enough to not regress 410.bwaves either (by still moving
>>> reg+reg
>>>     812          invariants).
>>>     813          See
>>> http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01210.html .  */
>>>     814       inv->cheap_address = address_cost (SET_SRC (set),
>>> word_mode,
>>>     815                                          ADDR_SPACE_GENERIC,
>>> speed) < 3;
>>>
>>>
>>> I think that maybe necessary for x86 which is short of register, while
>>> for RISC,
>>> it may not be that necessary, especially the whole register pressure is
>>> not big.
>>>
>>> currently, what I could think of is for this transformation below, we
>>> should
>>> increase the costs:
>>>     A
>>> ==
>>>         RA <- virtual_stack_var + RC
>>>         RD <- MEM (RA + const_offset)
>>>
>>>      into:
>>>
>>> B
>>> ==
>>>         RA <- virtual_stack_var + const_offset  <--B
>>>         RD <- MEM (RA + RC)
>>>
>>> because the cost is not that cheap, if there is not re-assocation of
>>> virtual_stack_var
>>> with const_offset, then lra elimination will create another instruction
>>> to hold the
>>> elimination result, so format A will actually be
>>>
>>>         RT <- real_stack_pointer + elimination_offset
>>>         RA <- RT + RC
>>>         RD <- MEM (RA + const_offset)
>>>
>>> so, the re-assocation and later hoisting of invariant B could actually
>>> save two instructions
>>> in the loop, this is why there are 15% perf gap for bzip2 under some
>>> situation.
>>
>> updated patch.
>>
>> moved this instruction shuffling trick to rtl loop invariant pass.
>> as described above, this patch tries to transform A to B, so that
>> after the transformation:
>>
>>     * RA <- virtual_stack_var + const_offset  could be hoisted out of the
>> loop
>>     * easy the work of lra elimination as virtual_stack_var is associated
>> with const_offset
>>       that the elimination offset could be combined with const_offset
>> automatically.
>>
>> current rtl loop invariant pass treat "reg <- reg + off" as cheap address,
>> while although
>> "reg <- virtual_stack_var + offset" fall into the same format, but it's
>> not that cheap as
>> we could also save one lra elimination instruction. so this patch will
>> mark
>> "reg <- virtual_stack_var + offset" transformed from A to be expensive, so
>> that it could be
>> hoisted later.
>>
>> after patch, pr62173 is fixed on powerpc64, while *still not on aarch64*.
>> because there are one
>> glitch in aarch64_legitimize_address which cause unnecessary complex
>> instructions sequences
>> generated when legitimize some addresses. and if we fix that, we will get
>> cheaper address for
>> those cases which is generally good, and the cheaper address will cause
>> tree level IVOPT do
>> more IVOPT optimization which is generally good also, but from the speck2k
>> result, there
>> are actually around 1% code size regression on two cases, the reason is
>> for target support
>> post-index address, doing IVOPT may not always be the best choice because
>> we lost the advantage
>> of using post-index addressing.
>>
>> on aarch64, for the following testcase, the ivopted version is complexer
>> than not ivopted version.
>>
>>       while (oargc--) *(nargv++) = *(oargv++);
>>    so, I sent the generic fix here only, as it's an independent patch, and
>> could benefit other targets
>> like powerpc64 although the issue on aarch64 is still not resolved before
>> we figure out how to let
>> the correct fix on aarch64_legitimize_address do not cause code size
>> regression on benchmark caused
>> by sub-optimal tree level IVOPT.
>>
>> and the testcase is marked to run on powerpc64 only at current time.
>>
>> bootstrap OK on x86-64, no regression.
>> bootstrap OK on AArch64, and from speck2k compile dump, there do have a
>> few more RTL loop
>> invariants get hoisted.
>>
>> ok for trunk?
>>
>> gcc/
>> PR62173
>>     loop-invariant.c.c (expensive_addr): New hash_table.
>>     (need_expensive_addr_check_p): New bool.
>>     (find_exits): Rename to "find_exists_and_reshuffle.
>>     Support re-shuffle instructions for better loop invariant hoisting.
>>     (create_new_invariant): Mark address as expensive if it's generated by
>> re-shuffle.
>>     (init_inv_motion_data): Initialize expensive_addr and
>> need_expensive_addr_check_p.
>>     (free_inv_motion_data): Release expensive_addr.
>>
>> gcc/testssuite/
>>     PR62173
>>     gcc.dg/pr62173.c: New test.
>
>
> sorry, patch uploaded.


+             do
+               {
...
+               } while ((next_insn = next_real_insn (next_insn))
+                        && body[i] == BLOCK_FOR_INSN (next_insn));

ick.  I realize we don't have SSA form on RTL but doesn't DF provide
at least some help in looking up definition statements for pseudos?
In fact we want to restrict the transform to single-use pseudos, thus
hopefully it can at least tell us that... (maybe not and this is what
LOG_LINKS are for in combine...?)  At least loop-invariant alreadly
computes df_chain with DF_UD_CHAIN which seems exactly what
is needed (apart from maybe getting at single-use info).

The meat of this should be factored out to a separate function I guess.

Otherwise my RTL fu is too weak to review this properly.

Thanks,
Richard.

>
>>
>>>> Oh, and the patch misses a testcase.
>>>
>>> It's at the start of the email, will include in the patch.
>>>
>>> void bar(int i) {
>>>      char A[10];
>>>      int d = 0;
>>>      while (i > 0)
>>>      A[d++] = i--;
>>>
>>>      while (d > 0)
>>>      foo(A[d--]);
>>> }
>>>
>>>>> Thanks,
>>>>> Richard.
>>>>>
>>>>>> and this patch only tries to re-shuffle instructions within single
>>>>>> basic
>>>>>> block which
>>>>>> is a inner loop which is perf critical.
>>>>>>
>>>>>> I am reusing the loop info in fwprop because there is loop info and
>>>>>> it's run
>>>>>> before
>>>>>> GCSE.
>>>>>>
>>>>>> verified on aarch64 and mips64, the array base address hoisted out of
>>>>>> loop.
>>>>>>
>>>>>> bootstrap ok on x86-64 and aarch64.
>>>>>>
>>>>>> comments?
>>>>>>
>>>>>> thanks.
>>>>>>
>>>>>> gcc/
>>>>>>      PR62173
>>>>>>      fwprop.c (prepare_for_gcse_pre): New function.
>>>>>>      (fwprop_done): Call it.
>>>
>>>
>>
>>
>

Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting

Reply via email to