others compared to gcc-5.3.0

rguenther at suse dot de Thu, 08 Mar 2018 02:32:53 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359


--- Comment #31 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 8 Mar 2018, aldyh at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359
> 
> Aldy Hernandez <aldyh at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>            Assignee|unassigned at gcc dot gnu.org      |aldyh at gcc dot 
> gnu.org
> 
> --- Comment #30 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
> Created attachment 43597
>   --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43597&action=edit
> untested patch implementing suggestion in comment 26
> 
> The attached untested patch attempts to implement the suggestion in comment 26
> of replacing the out-of-loop pre-inc with post-inc values.
> 
> Richi, is this more or less what you had in mind?

Yes.

> Assuming this:
> 
> LOOP:
>   # p_8 = PHI <p_16(2), p_19(3)>
>   ...
>   p_19 = p_8 + 4294967295;
>   goto LOOP:
> 
> The patch replaces:
>   p_22 = p_8 + 4294967294;
>   MEM[(char *)p_19 + 4294967295B] = 45;
> into:
>   p_22 = p_19 + 4294967295;
>   *p_22 = 45;
> 
> This allows the backend to use auto-dec in two places:
> 
> strb    r1, [r4, #-1]!
> ...
> strblt  r3, [r4, #-1]!
> 
> ...reducing the byte count from 116 to 104, but just shy of the 96 needed to
> eliminate the regression.  I will discuss the missing bytes in a follow-up
> comment, as they are unrelated to this IV adjustment patch.
> 
> It is worth noting that x86 also benefits from a reduction of 3 bytes with 
> this
> patch, as we remove 2 lea instructions: one within the loop, and one before
> returning.  Thus, I believe this is a regression across the board, or at least
> in multiple architectures.
> 
> A few comments...
> 
> While I see the benefit of hijacking insert_backedge_copies() for this, I am
> not a big fan of changing the IL after the last tree dump (*t.optimized), as
> the modified IL would only be visible in *r.expand.  Could we perhaps move 
> this
> to another spot?  Say after the last forwprop pass, or perhaps right before
> expand?  Or perhaps have a *t.final dump right before expand?

I don't see a big problem here - but yes, for example doing it
during uncprop would be possible as well (moving all of
insert_backedge_copies then).  I'd not do this at this point though.

> As mentioned, this is only a proof of concept.  I made the test rather
> restrictive.  I suppose we could relax the conditions and generalize it a 
> bit. 
> There are comments throughout showing what I had in mind.

I'd have not restricted the out-of-loop IV use to IV +- CST but
instead did the transform

+   LOOP:
+     # p_8 = PHI <p_16(2), p_INC(3)>
+     ...
+     p_INC = p_8 - 1;
+     goto LOOP;
+     ... p_8 uses ...

to

+   LOOP:
+     # p_8 = PHI <p_16(2), p_INC(3)>
+     ...
+     p_INC = p_8 - 1;
+     goto LOOP;
      newtem_12 = p_INC + 1; // undo IV increment
      ... p_8 out-of-loop p_8 uses replaced with newtem_12 ...

so it would always work if we can undo the IV increment.

The disadvantage might be that we then rely on RTL optimizations
to combine the original out-of-loop constant add with the
newtem computation but I guess that's not too much to ask ;)
k

[Bug target/70359] [6/7/8 Regression] Code size increase for x86/ARM/others compared to gcc-5.3.0

Reply via email to