Richard Sandiford <richard.sandif...@arm.com> writes: > Richard Biener <rguent...@suse.de> writes: >> On Wed, 2 Apr 2025, Richard Sandiford wrote: >> >>> Richard Biener <rguent...@suse.de> writes: >>> > On Tue, 1 Apr 2025, Richard Sandiford wrote: >>> > >>> >> The problem in PR101523 was that, after each successful 2->2 >>> >> combination attempt, distribute_links would search further and >>> >> further for the next combinable use of the i2 destination. >>> >> The original patch for the PR dealt with that by disallowing such >>> >> combinations. However, that led to various optimisation regressions, >>> >> so there was interest in allowing the combinations again, at least >>> >> until an alternative way of getting the same results is in place. >>> >> >>> >> This patch does that, but limits distribute_links to a certain number >>> >> of instructions when i2 is unchanged. As Segher said in the PR trail, >>> >> it would make more conceptual sense to apply the limit unconditionally, >>> >> but I thought it would be better to change as little as possible at >>> >> this development stage. Logically, in stage 1, the --param should >>> >> be applied directly by distribute_links with no input from callers. >>> >> >>> >> As I mentioned in: >>> >> >>> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116398#c28 >>> >> >>> >> I think it's safe to drop log links even if a use exists. All >>> >> processing of log links seems to handle the absence of a link >>> >> for a particular register in a conservative way. >>> >> >>> >> The initial set-up errs on the side of dropping links, since for example >>> >> create_log_links has: >>> >> >>> >> /* flow.c claimed: >>> >> >>> >> We don't build a LOG_LINK for hard registers contained >>> >> in ASM_OPERANDs. If these registers get replaced, >>> >> we might wind up changing the semantics of the insn, >>> >> even if reload can make what appear to be valid >>> >> assignments later. */ >>> >> if (regno < FIRST_PSEUDO_REGISTER >>> >> && asm_noperands (PATTERN (use_insn)) >= 0) >>> >> continue; >>> >> >>> >> which excludes combinations by dropping log links, rather than during >>> >> try_combine. And: >>> >> >>> >> /* If this register is being initialized using itself, and the >>> >> register is uninitialized in this basic block, and there are >>> >> no LOG_LINKS which set the register, then part of the >>> >> register is uninitialized. In that case we can't assume >>> >> anything about the number of nonzero bits. >>> >> >>> >> ??? We could do better if we checked this in >>> >> reg_{nonzero_bits,num_sign_bit_copies}_for_combine. Then we >>> >> could avoid making assumptions about the insn which initially >>> >> sets the register, while still using the information in other >>> >> insns. We would have to be careful to check every insn >>> >> involved in the combination. */ >>> >> >>> >> if (insn >>> >> && reg_referenced_p (x, PATTERN (insn)) >>> >> && !REGNO_REG_SET_P (DF_LR_IN (BLOCK_FOR_INSN (insn)), >>> >> REGNO (x))) >>> >> { >>> >> struct insn_link *link; >>> >> >>> >> FOR_EACH_LOG_LINK (link, insn) >>> >> if (dead_or_set_p (link->insn, x)) >>> >> break; >>> >> if (!link) >>> >> { >>> >> rsp->nonzero_bits = GET_MODE_MASK (mode); >>> >> rsp->sign_bit_copies = 1; >>> >> return; >>> >> } >>> >> } >>> >> >>> >> treats the lack of a log link is a possible sign of uninitialised data, >>> >> but that would be a missed optimisation rather than a correctness issue. >>> >> >>> >> One question is what the default --param value should be. I went with >>> >> Jakub's suggestion of 3000 from: >>> >> >>> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116398#c25 >>> >> >>> >> Also, to answer Jakub's question in that comment, I tried bisecting: >>> >> >>> >> int limit = atoi (getenv ("BISECT")); >>> >> >>> >> (so applying the limit for all calls from try_combine) with an >>> >> abort in distribute_links if the limit caused a link to be skipped. >>> >> The minimum BISECT value that allowed an aarch64-linux-gnu bootstrap >>> >> to succeed with --enable-languages=all --enable-checking=yes,rtl,extra >>> >> was 142, so much lower than the parameter value. I realised too late >>> >> that --enable-checking=release would probably have been a more >>> >> interesting test. >>> >> >>> >> Some of the try_combine changes come from Richi's patch in: >>> >> >>> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523#c53 >>> >> >>> >> Bootstrapped & regression-tested on aarch64-linux-gnu, >>> >> powerpc64le-linux-gnu and x86_64-linux-gnu. OK to install? >>> >> >>> >> Richard >>> >> >>> >> >>> >> gcc/ >>> >> PR testsuite/116398 >>> >> * params.opt (-param=max-combine-search-insns=): New param. >>> >> * doc/invoke.texi: Document it. >>> >> * combine.cc (distribute_links): Add a limit parameter. >>> >> (try_combine): Use the new param to limit distribute_links >>> >> when i2 is unchanged. Return i3 rather than i2 if i2 is unchanged. >>> >> >>> >> gcc/testsuite/ >>> >> * gcc.target/aarch64/popcnt-le-1.c: Account for commutativity of TST. >>> >> * gcc.target/aarch64/popcnt-le-3.c: Likewise AND. >>> >> * gcc.target/aarch64/sve/pred-not-gen-1.c: Revert previous patch. >>> >> * gcc.target/aarch64/sve/pred-not-gen-4.c: Likewise. >>> >> * gcc.target/aarch64/sve/var_stride_2.c: Likewise. >>> >> * gcc.target/aarch64/sve/var_stride_4.c: Likewise. >>> >> >>> >> Co-authored-by: Richard Biener <rguent...@suse.de> >>> >> --- >>> >> gcc/combine.cc | 30 +++++++++---------- >>> >> gcc/doc/invoke.texi | 9 ++++++ >>> >> gcc/params.opt | 4 +++ >>> >> .../gcc.target/aarch64/popcnt-le-1.c | 4 +-- >>> >> .../gcc.target/aarch64/popcnt-le-3.c | 4 +-- >>> >> gcc/testsuite/gcc.target/aarch64/pr100056.c | 4 ++- >>> >> .../gcc.target/aarch64/sve/pred-not-gen-1.c | 4 +-- >>> >> .../gcc.target/aarch64/sve/pred-not-gen-4.c | 4 +-- >>> >> .../gcc.target/aarch64/sve/var_stride_2.c | 3 +- >>> >> .../gcc.target/aarch64/sve/var_stride_4.c | 3 +- >>> >> 10 files changed, 42 insertions(+), 27 deletions(-) >>> >> >>> >> diff --git a/gcc/combine.cc b/gcc/combine.cc >>> >> index ef13f5d5e90..28403d1f9e0 100644 >>> >> --- a/gcc/combine.cc >>> >> +++ b/gcc/combine.cc >>> >> @@ -472,7 +472,7 @@ static void move_deaths (rtx, rtx, int, rtx_insn *, >>> >> rtx *); >>> >> static bool reg_bitfield_target_p (rtx, rtx); >>> >> static void distribute_notes (rtx, rtx_insn *, rtx_insn *, rtx_insn *, >>> >> rtx, rtx, rtx); >>> >> -static void distribute_links (struct insn_link *); >>> >> +static void distribute_links (struct insn_link *, int limit = INT_MAX); >>> >> static void mark_used_regs_combine (rtx); >>> >> static void record_promoted_value (rtx_insn *, rtx); >>> >> static bool unmentioned_reg_p (rtx, rtx); >>> >> @@ -4208,16 +4208,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn >>> >> *i1, rtx_insn *i0, >>> >> adjust_for_new_dest (i3); >>> >> } >>> >> >>> >> - /* If I2 didn't change, this is not a combination (but a >>> >> simplification or >>> >> - canonicalisation with context), which should not be done here. >>> >> Doing >>> >> - it here explodes the algorithm. Don't. */ >>> >> - if (rtx_equal_p (newi2pat, PATTERN (i2))) >>> >> - { >>> >> - if (dump_file) >>> >> - fprintf (dump_file, "i2 didn't change, not doing this\n"); >>> >> - undo_all (); >>> >> - return 0; >>> >> - } >>> >> + bool i2_unchanged = rtx_equal_p (newi2pat, PATTERN (i2)); >>> >> >>> >> /* We now know that we can do this combination. Merge the insns and >>> >> update the status of registers and LOG_LINKS. */ >>> >> @@ -4593,10 +4584,11 @@ try_combine (rtx_insn *i3, rtx_insn *i2, >>> >> rtx_insn *i1, rtx_insn *i0, >>> >> NULL_RTX, NULL_RTX, NULL_RTX); >>> >> } >>> >> >>> >> - distribute_links (i3links); >>> >> - distribute_links (i2links); >>> >> - distribute_links (i1links); >>> >> - distribute_links (i0links); >>> >> + int limit = i2_unchanged ? param_max_combine_search_insns : INT_MAX; >>> >> + distribute_links (i3links, limit); >>> >> + distribute_links (i2links, limit); >>> >> + distribute_links (i1links, limit); >>> >> + distribute_links (i0links, limit); >>> > >>> > I'm not sure I understand how the distribute_links mixes in with >>> > the change below. >>> > >>> > The original issue reported in the PR was that returning i2 >>> > caused evern insn between i2 and i3 to be re-processed by the >>> > main loop, re-trying previously failed combine attempts and >>> > producing garbage. >>> > >>> > The disadvantage of directly returning i3 is that if added_links_insn >>> > was inbetween i2 and i3 we've missed to re-try on insn with changed >>> > links. >>> >>> Can that happen though? log links are supposed to be attached to the >>> next use of the register (only), so I wouldn't expect the next use to be >>> before i3 (putatively the previous first use). >> >> But log-links on i2 are for uses in i2 for defs before i2, given >> i0/i1/i2 changed/have gone away we adjust those to eventually end >> up on insns between i2 and i3 (or indeed after i3). Combine then >> want's to try combine the insns with changed log-links. > > Right. But I meant in the case where i2 hasn't changed, which > I thought was the case in contention. If i2 hasn't changed then > its uses are the same, and so there is no need to distribute its > log links. > >>> > But what you do now also limits the insn walk in distribute_links >>> > (IMO a good thing on its own), but not taking advantage of that >>> > when i2_unchanged (distributing the link to insns between i2 and i3 >>> > is useless in that case). >>> >>> Yeah, it's true that distribute_links could potentially start the >>> search at a better place. But that's IMO a third, somewhat independent >>> optimisation. And it seems more dangerous than the proposed patch, >>> since if it turns out that the next use is before i3 in some current >>> or future circumstances, we could end up generating wrong code. >>> >>> > Ideally we'd distribute links to insns between i2 and i3 in a backward >>> > walk from i3, but pick up the first use if within the limit, as we >>> > want to limit the number of insns we reprocess in the try_combine callers >>> > loop. >>> > >>> > Really ideally we'd want to remember all added_links_insn and only >>> > re-process those, not all other unaffected insns between i2 and i3 >>> > (or earlier). >>> > >>> > That said, your change looks like two independent changes? IIRC Jakub >>> > at some point proposed to limit the 'return i3' below to cases where >>> > i2 is too far away or sth like that. >>> >>> Yes, it's two independent changes. The return i3 one is taken directly >>> from your comment in: >>> >>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523#c53 >>> >>> I think that's the right thing to do, for the reason you explained there >>> and above, so I didn't want to lose it. But like you say in the PR, >>> combine did still take a large slice of time with that change in isolation. >>> >>> Jakub discussed the idea of augmenting that change with one that makes >>> try_combine check luids. The distribute_links part of the patch is >>> supposed to be an alternative to that. >> >> I see. I understood Jakub thinks my change is too aggressive since >> it does indeed miss combine attempts from changed log-links. >> >> I remember that when allowing added_links_insn to override the >> i3 return for i2_unchaged_p the bad compile-time re-surfaced. So >> now I wonder whether, if i2_unchanged, distribute_links (i2links) >> does (should do?) anything? I don't see anything that would avoid >> adjusting links when there's still a use in newi2_patt (IIRC >> i0 and i1 are always gone?). > > Yeah, that's also what I meant. But, once the "return i3" is in place, > the remaining large compile time partly comes from distributing i3's link > for i2's destination, and I suppose from the general "walk all instructions > between A and B" that try_combine uses to check dataflow, with A and B > getting gradually further apart. > > FWIW, with: > > diff --git a/gcc/combine.cc b/gcc/combine.cc > index ef13f5d5e90..e12a78c8d89 100644 > --- a/gcc/combine.cc > +++ b/gcc/combine.cc > @@ -4208,16 +4208,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, > rtx_insn *i0, > adjust_for_new_dest (i3); > } > > - /* If I2 didn't change, this is not a combination (but a simplification or > - canonicalisation with context), which should not be done here. Doing > - it here explodes the algorithm. Don't. */ > - if (rtx_equal_p (newi2pat, PATTERN (i2))) > - { > - if (dump_file) > - fprintf (dump_file, "i2 didn't change, not doing this\n"); > - undo_all (); > - return 0; > - } > + bool only_i3_changed = !i0 && !i1 && rtx_equal_p (newi2pat, PATTERN (i2)); > > /* We now know that we can do this combination. Merge the insns and > update the status of registers and LOG_LINKS. */ > @@ -4785,6 +4776,9 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, > rtx_insn *i0, > combine_successes++; > undo_commit (); > > + if (only_i3_changed) > + return i3; > + > rtx_insn *ret = newi2pat ? i2 : i3; > if (added_links_insn && DF_INSN_LUID (added_links_insn) < DF_INSN_LUID > (ret)) > ret = added_links_insn; > > so essentially your patch, I see an 11.9x improvement in combine's > compile time in the original testcase for PR101523, going from 88% > to 41% of total compile time. Memory usage improves 110x, from > 97% to 25%. The output is unchanged. Adding: > > diff --git a/gcc/combine.cc b/gcc/combine.cc > index e12a78c8d89..becfd0a65ff 100644 > --- a/gcc/combine.cc > +++ b/gcc/combine.cc > @@ -472,7 +472,7 @@ static void move_deaths (rtx, rtx, int, rtx_insn *, rtx > *); > static bool reg_bitfield_target_p (rtx, rtx); > static void distribute_notes (rtx, rtx_insn *, rtx_insn *, rtx_insn *, > rtx, rtx, rtx); > -static void distribute_links (struct insn_link *); > +static void distribute_links (struct insn_link *, rtx_insn * = nullptr); > static void mark_used_regs_combine (rtx); > static void record_promoted_value (rtx_insn *, rtx); > static bool unmentioned_reg_p (rtx, rtx); > @@ -4209,6 +4209,8 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, > rtx_insn *i0, > } > > bool only_i3_changed = !i0 && !i1 && rtx_equal_p (newi2pat, PATTERN (i2)); > + if (only_i3_changed) > + swap_i2i3 = false;
Oops, this was supposed to be split_i2i3. With that fixed... > > /* We now know that we can do this combination. Merge the insns and > update the status of registers and LOG_LINKS. */ > @@ -4584,8 +4586,8 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, > rtx_insn *i0, > NULL_RTX, NULL_RTX, NULL_RTX); > } > > - distribute_links (i3links); > - distribute_links (i2links); > + distribute_links (i3links, only_i3_changed ? i3 : nullptr); > + distribute_links (i2links, !i0 && !i1 ? i2 : nullptr); > distribute_links (i1links); > distribute_links (i0links); > > @@ -14978,10 +14980,13 @@ distribute_notes (rtx notes, rtx_insn *from_insn, > rtx_insn *i3, rtx_insn *i2, > > /* Similarly to above, distribute the LOG_LINKS that used to be present on > I3, I2, and I1 to new locations. This is also called to add a link > - pointing at I3 when I3's destination is changed. */ > + pointing at I3 when I3's destination is changed. > + > + If START is nonnull and an insn, we know that the next location for > + each link is no earlier than START. */ > > static void > -distribute_links (struct insn_link *links) > +distribute_links (struct insn_link *links, rtx_insn *start) > { > struct insn_link *link, *next_link; > > @@ -15047,7 +15052,10 @@ distribute_links (struct insn_link *links) > I3 to I2. Also note that not much searching is typically done here > since most links don't point very far away. */ > > - for (insn = NEXT_INSN (link->insn); > + insn = start; > + if (!insn || NOTE_P (insn)) > + insn = NEXT_INSN (link->insn); > + for (; > (insn && (this_basic_block->next_bb == EXIT_BLOCK_PTR_FOR_FN (cfun) > || BB_HEAD (this_basic_block->next_bb) != insn)); > insn = NEXT_INSN (insn)) > > improves compile time a further 17.7%, going from 41% to 33% of [I meant 37%, gah] > compile time. ...the improvement becomes 34.6%, going from 41% to 34%. Thanks, Richard