https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359
--- Comment #31 from rguenther at suse dot de <rguenther at suse dot de> --- On Thu, 8 Mar 2018, aldyh at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359 > > Aldy Hernandez <aldyh at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Assignee|unassigned at gcc dot gnu.org |aldyh at gcc dot > gnu.org > > --- Comment #30 from Aldy Hernandez <aldyh at gcc dot gnu.org> --- > Created attachment 43597 > --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43597&action=edit > untested patch implementing suggestion in comment 26 > > The attached untested patch attempts to implement the suggestion in comment 26 > of replacing the out-of-loop pre-inc with post-inc values. > > Richi, is this more or less what you had in mind? Yes. > Assuming this: > > LOOP: > # p_8 = PHI <p_16(2), p_19(3)> > ... > p_19 = p_8 + 4294967295; > goto LOOP: > > The patch replaces: > p_22 = p_8 + 4294967294; > MEM[(char *)p_19 + 4294967295B] = 45; > into: > p_22 = p_19 + 4294967295; > *p_22 = 45; > > This allows the backend to use auto-dec in two places: > > strb r1, [r4, #-1]! > ... > strblt r3, [r4, #-1]! > > ...reducing the byte count from 116 to 104, but just shy of the 96 needed to > eliminate the regression. I will discuss the missing bytes in a follow-up > comment, as they are unrelated to this IV adjustment patch. > > It is worth noting that x86 also benefits from a reduction of 3 bytes with > this > patch, as we remove 2 lea instructions: one within the loop, and one before > returning. Thus, I believe this is a regression across the board, or at least > in multiple architectures. > > A few comments... > > While I see the benefit of hijacking insert_backedge_copies() for this, I am > not a big fan of changing the IL after the last tree dump (*t.optimized), as > the modified IL would only be visible in *r.expand. Could we perhaps move > this > to another spot? Say after the last forwprop pass, or perhaps right before > expand? Or perhaps have a *t.final dump right before expand? I don't see a big problem here - but yes, for example doing it during uncprop would be possible as well (moving all of insert_backedge_copies then). I'd not do this at this point though. > As mentioned, this is only a proof of concept. I made the test rather > restrictive. I suppose we could relax the conditions and generalize it a > bit. > There are comments throughout showing what I had in mind. I'd have not restricted the out-of-loop IV use to IV +- CST but instead did the transform + LOOP: + # p_8 = PHI <p_16(2), p_INC(3)> + ... + p_INC = p_8 - 1; + goto LOOP; + ... p_8 uses ... to + LOOP: + # p_8 = PHI <p_16(2), p_INC(3)> + ... + p_INC = p_8 - 1; + goto LOOP; newtem_12 = p_INC + 1; // undo IV increment ... p_8 out-of-loop p_8 uses replaced with newtem_12 ... so it would always work if we can undo the IV increment. The disadvantage might be that we then rely on RTL optimizations to combine the original out-of-loop constant add with the newtem computation but I guess that's not too much to ask ;) k