On 17/12/14 15:54, Richard Biener wrote:
On Mon, Dec 15, 2014 at 4:29 PM, Jiong Wang <jiong.w...@arm.com> wrote:
On 15/12/14 15:28, Jiong Wang wrote:
On 04/12/14 19:32, Jiong Wang wrote:
On 04/12/14 11:07, Richard Biener wrote:

On Thu, Dec 4, 2014 at 12:07 PM, Richard Biener
<richard.guent...@gmail.com> wrote:
On Thu, Dec 4, 2014 at 12:00 PM, Jiong Wang <jiong.w...@arm.com> wrote:


which means re-associate the constant imm with the virtual frame
pointer.

transform

         RA <- fixed_reg + RC
         RD <- MEM (RA + const_offset)

      into:

         RA <- fixed_reg + const_offset
         RD <- MEM (RA + RC)

then RA <- fixed_reg + const_offset is actually loop invariant, so the
later
RTL GCSE PRE pass could catch it and do the hoisting, and thus
ameliorate
what tree
level ivopts could not sort out.
There is a LIM pass after gimple ivopts - if the invariantness is
already
visible there why not handle it there similar to the special-cases in
rewrite_bittest and rewrite_reciprocal?
maybe, needs further check.

And of course similar tricks could be applied on the RTL level to
RTL invariant motion?
Thanks. I just checked the code, yes, loop invariant motion pass
is the natural place to integrate such multi-insns invariant analysis
trick.

those code could be integrated into loop-invariant.c cleanly, but
then I found although the invariant detected successfully but it's not
moved
out of loop because of cost issue, and looks like the patch committed to
fix PR33928
is trying to prevent such cheap address be hoisted to reduce register
pressure.


     805       /* ??? Try to determine cheapness of address computation.
Unfortunately
     806          the address cost is only a relative measure, we can't
really compare
     807          it with any absolute number, but only with other address
costs.
     808          But here we don't have any other addresses, so compare
with a magic
     809          number anyway.  It has to be large enough to not regress
PR33928
     810          (by avoiding to move reg+8,reg+16,reg+24 invariants),
but small
     811          enough to not regress 410.bwaves either (by still moving
reg+reg
     812          invariants).
     813          See
http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01210.html .  */
     814       inv->cheap_address = address_cost (SET_SRC (set),
word_mode,
     815                                          ADDR_SPACE_GENERIC,
speed) < 3;


I think that maybe necessary for x86 which is short of register, while
for RISC,
it may not be that necessary, especially the whole register pressure is
not big.

currently, what I could think of is for this transformation below, we
should
increase the costs:
     A
==
         RA <- virtual_stack_var + RC
         RD <- MEM (RA + const_offset)

      into:

B
==
         RA <- virtual_stack_var + const_offset  <--B
         RD <- MEM (RA + RC)

because the cost is not that cheap, if there is not re-assocation of
virtual_stack_var
with const_offset, then lra elimination will create another instruction
to hold the
elimination result, so format A will actually be

         RT <- real_stack_pointer + elimination_offset
         RA <- RT + RC
         RD <- MEM (RA + const_offset)

so, the re-assocation and later hoisting of invariant B could actually
save two instructions
in the loop, this is why there are 15% perf gap for bzip2 under some
situation.
updated patch.

moved this instruction shuffling trick to rtl loop invariant pass.
as described above, this patch tries to transform A to B, so that
after the transformation:

     * RA <- virtual_stack_var + const_offset  could be hoisted out of the
loop
     * easy the work of lra elimination as virtual_stack_var is associated
with const_offset
       that the elimination offset could be combined with const_offset
automatically.

current rtl loop invariant pass treat "reg <- reg + off" as cheap address,
while although
"reg <- virtual_stack_var + offset" fall into the same format, but it's
not that cheap as
we could also save one lra elimination instruction. so this patch will
mark
"reg <- virtual_stack_var + offset" transformed from A to be expensive, so
that it could be
hoisted later.

after patch, pr62173 is fixed on powerpc64, while *still not on aarch64*.
because there are one
glitch in aarch64_legitimize_address which cause unnecessary complex
instructions sequences
generated when legitimize some addresses. and if we fix that, we will get
cheaper address for
those cases which is generally good, and the cheaper address will cause
tree level IVOPT do
more IVOPT optimization which is generally good also, but from the speck2k
result, there
are actually around 1% code size regression on two cases, the reason is
for target support
post-index address, doing IVOPT may not always be the best choice because
we lost the advantage
of using post-index addressing.

on aarch64, for the following testcase, the ivopted version is complexer
than not ivopted version.

       while (oargc--) *(nargv++) = *(oargv++);
    so, I sent the generic fix here only, as it's an independent patch, and
could benefit other targets
like powerpc64 although the issue on aarch64 is still not resolved before
we figure out how to let
the correct fix on aarch64_legitimize_address do not cause code size
regression on benchmark caused
by sub-optimal tree level IVOPT.

and the testcase is marked to run on powerpc64 only at current time.

bootstrap OK on x86-64, no regression.
bootstrap OK on AArch64, and from speck2k compile dump, there do have a
few more RTL loop
invariants get hoisted.

ok for trunk?

gcc/
PR62173
     loop-invariant.c.c (expensive_addr): New hash_table.
     (need_expensive_addr_check_p): New bool.
     (find_exits): Rename to "find_exists_and_reshuffle.
     Support re-shuffle instructions for better loop invariant hoisting.
     (create_new_invariant): Mark address as expensive if it's generated by
re-shuffle.
     (init_inv_motion_data): Initialize expensive_addr and
need_expensive_addr_check_p.
     (free_inv_motion_data): Release expensive_addr.

gcc/testssuite/
     PR62173
     gcc.dg/pr62173.c: New test.

sorry, patch uploaded.
+             do
+               {
...
+               } while ((next_insn = next_real_insn (next_insn))
+                        && body[i] == BLOCK_FOR_INSN (next_insn));

ick.  I realize we don't have SSA form on RTL but doesn't DF provide
at least some help in looking up definition statements for pseudos?
In fact we want to restrict the transform to single-use pseudos, thus
hopefully it can at least tell us that... (maybe not and this is what
LOG_LINKS are for in combine...?)  At least loop-invariant alreadly
computes df_chain with DF_UD_CHAIN which seems exactly what
is needed (apart from maybe getting at single-use info).

thanks very much for these inspiring questions.

yes, we want to restrict the transformation on single-use pseudo only,
and it's better the transformation could re-use existed info and helper
function to avoid increase compile time. but I haven't found anything I
can reuse at the stage the transformation happen.

the info similar as LOG_LINKS is what I want, but maybe simpler. I'd study
the code about build LOG_LINKS, and try to see if we can do some factor out.

thanks.

Regards,
Jiong


The meat of this should be factored out to a separate function I guess.

Otherwise my RTL fu is too weak to review this properly.

Thanks,
Richard.

Oh, and the patch misses a testcase.
It's at the start of the email, will include in the patch.

void bar(int i) {
      char A[10];
      int d = 0;
      while (i > 0)
      A[d++] = i--;

      while (d > 0)
      foo(A[d--]);
}

Thanks,
Richard.

and this patch only tries to re-shuffle instructions within single
basic
block which
is a inner loop which is perf critical.

I am reusing the loop info in fwprop because there is loop info and
it's run
before
GCSE.

verified on aarch64 and mips64, the array base address hoisted out of
loop.

bootstrap ok on x86-64 and aarch64.

comments?

thanks.

gcc/
      PR62173
      fwprop.c (prepare_for_gcse_pre): New function.
      (fwprop_done): Call it.




Reply via email to