Richard Biener <rguent...@suse.de> writes: > On Wed, 2 Apr 2025, Richard Sandiford wrote: > >> Richard Biener <rguent...@suse.de> writes: >> > On Tue, 1 Apr 2025, Richard Sandiford wrote: >> > >> >> The problem in PR101523 was that, after each successful 2->2 >> >> combination attempt, distribute_links would search further and >> >> further for the next combinable use of the i2 destination. >> >> The original patch for the PR dealt with that by disallowing such >> >> combinations. However, that led to various optimisation regressions, >> >> so there was interest in allowing the combinations again, at least >> >> until an alternative way of getting the same results is in place. >> >> >> >> This patch does that, but limits distribute_links to a certain number >> >> of instructions when i2 is unchanged. As Segher said in the PR trail, >> >> it would make more conceptual sense to apply the limit unconditionally, >> >> but I thought it would be better to change as little as possible at >> >> this development stage. Logically, in stage 1, the --param should >> >> be applied directly by distribute_links with no input from callers. >> >> >> >> As I mentioned in: >> >> >> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116398#c28 >> >> >> >> I think it's safe to drop log links even if a use exists. All >> >> processing of log links seems to handle the absence of a link >> >> for a particular register in a conservative way. >> >> >> >> The initial set-up errs on the side of dropping links, since for example >> >> create_log_links has: >> >> >> >> /* flow.c claimed: >> >> >> >> We don't build a LOG_LINK for hard registers contained >> >> in ASM_OPERANDs. If these registers get replaced, >> >> we might wind up changing the semantics of the insn, >> >> even if reload can make what appear to be valid >> >> assignments later. */ >> >> if (regno < FIRST_PSEUDO_REGISTER >> >> && asm_noperands (PATTERN (use_insn)) >= 0) >> >> continue; >> >> >> >> which excludes combinations by dropping log links, rather than during >> >> try_combine. And: >> >> >> >> /* If this register is being initialized using itself, and the >> >> register is uninitialized in this basic block, and there are >> >> no LOG_LINKS which set the register, then part of the >> >> register is uninitialized. In that case we can't assume >> >> anything about the number of nonzero bits. >> >> >> >> ??? We could do better if we checked this in >> >> reg_{nonzero_bits,num_sign_bit_copies}_for_combine. Then we >> >> could avoid making assumptions about the insn which initially >> >> sets the register, while still using the information in other >> >> insns. We would have to be careful to check every insn >> >> involved in the combination. */ >> >> >> >> if (insn >> >> && reg_referenced_p (x, PATTERN (insn)) >> >> && !REGNO_REG_SET_P (DF_LR_IN (BLOCK_FOR_INSN (insn)), >> >> REGNO (x))) >> >> { >> >> struct insn_link *link; >> >> >> >> FOR_EACH_LOG_LINK (link, insn) >> >> if (dead_or_set_p (link->insn, x)) >> >> break; >> >> if (!link) >> >> { >> >> rsp->nonzero_bits = GET_MODE_MASK (mode); >> >> rsp->sign_bit_copies = 1; >> >> return; >> >> } >> >> } >> >> >> >> treats the lack of a log link is a possible sign of uninitialised data, >> >> but that would be a missed optimisation rather than a correctness issue. >> >> >> >> One question is what the default --param value should be. I went with >> >> Jakub's suggestion of 3000 from: >> >> >> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116398#c25 >> >> >> >> Also, to answer Jakub's question in that comment, I tried bisecting: >> >> >> >> int limit = atoi (getenv ("BISECT")); >> >> >> >> (so applying the limit for all calls from try_combine) with an >> >> abort in distribute_links if the limit caused a link to be skipped. >> >> The minimum BISECT value that allowed an aarch64-linux-gnu bootstrap >> >> to succeed with --enable-languages=all --enable-checking=yes,rtl,extra >> >> was 142, so much lower than the parameter value. I realised too late >> >> that --enable-checking=release would probably have been a more >> >> interesting test. >> >> >> >> Some of the try_combine changes come from Richi's patch in: >> >> >> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523#c53 >> >> >> >> Bootstrapped & regression-tested on aarch64-linux-gnu, >> >> powerpc64le-linux-gnu and x86_64-linux-gnu. OK to install? >> >> >> >> Richard >> >> >> >> >> >> gcc/ >> >> PR testsuite/116398 >> >> * params.opt (-param=max-combine-search-insns=): New param. >> >> * doc/invoke.texi: Document it. >> >> * combine.cc (distribute_links): Add a limit parameter. >> >> (try_combine): Use the new param to limit distribute_links >> >> when i2 is unchanged. Return i3 rather than i2 if i2 is unchanged. >> >> >> >> gcc/testsuite/ >> >> * gcc.target/aarch64/popcnt-le-1.c: Account for commutativity of TST. >> >> * gcc.target/aarch64/popcnt-le-3.c: Likewise AND. >> >> * gcc.target/aarch64/sve/pred-not-gen-1.c: Revert previous patch. >> >> * gcc.target/aarch64/sve/pred-not-gen-4.c: Likewise. >> >> * gcc.target/aarch64/sve/var_stride_2.c: Likewise. >> >> * gcc.target/aarch64/sve/var_stride_4.c: Likewise. >> >> >> >> Co-authored-by: Richard Biener <rguent...@suse.de> >> >> --- >> >> gcc/combine.cc | 30 +++++++++---------- >> >> gcc/doc/invoke.texi | 9 ++++++ >> >> gcc/params.opt | 4 +++ >> >> .../gcc.target/aarch64/popcnt-le-1.c | 4 +-- >> >> .../gcc.target/aarch64/popcnt-le-3.c | 4 +-- >> >> gcc/testsuite/gcc.target/aarch64/pr100056.c | 4 ++- >> >> .../gcc.target/aarch64/sve/pred-not-gen-1.c | 4 +-- >> >> .../gcc.target/aarch64/sve/pred-not-gen-4.c | 4 +-- >> >> .../gcc.target/aarch64/sve/var_stride_2.c | 3 +- >> >> .../gcc.target/aarch64/sve/var_stride_4.c | 3 +- >> >> 10 files changed, 42 insertions(+), 27 deletions(-) >> >> >> >> diff --git a/gcc/combine.cc b/gcc/combine.cc >> >> index ef13f5d5e90..28403d1f9e0 100644 >> >> --- a/gcc/combine.cc >> >> +++ b/gcc/combine.cc >> >> @@ -472,7 +472,7 @@ static void move_deaths (rtx, rtx, int, rtx_insn *, >> >> rtx *); >> >> static bool reg_bitfield_target_p (rtx, rtx); >> >> static void distribute_notes (rtx, rtx_insn *, rtx_insn *, rtx_insn *, >> >> rtx, rtx, rtx); >> >> -static void distribute_links (struct insn_link *); >> >> +static void distribute_links (struct insn_link *, int limit = INT_MAX); >> >> static void mark_used_regs_combine (rtx); >> >> static void record_promoted_value (rtx_insn *, rtx); >> >> static bool unmentioned_reg_p (rtx, rtx); >> >> @@ -4208,16 +4208,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn >> >> *i1, rtx_insn *i0, >> >> adjust_for_new_dest (i3); >> >> } >> >> >> >> - /* If I2 didn't change, this is not a combination (but a >> >> simplification or >> >> - canonicalisation with context), which should not be done here. >> >> Doing >> >> - it here explodes the algorithm. Don't. */ >> >> - if (rtx_equal_p (newi2pat, PATTERN (i2))) >> >> - { >> >> - if (dump_file) >> >> - fprintf (dump_file, "i2 didn't change, not doing this\n"); >> >> - undo_all (); >> >> - return 0; >> >> - } >> >> + bool i2_unchanged = rtx_equal_p (newi2pat, PATTERN (i2)); >> >> >> >> /* We now know that we can do this combination. Merge the insns and >> >> update the status of registers and LOG_LINKS. */ >> >> @@ -4593,10 +4584,11 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn >> >> *i1, rtx_insn *i0, >> >> NULL_RTX, NULL_RTX, NULL_RTX); >> >> } >> >> >> >> - distribute_links (i3links); >> >> - distribute_links (i2links); >> >> - distribute_links (i1links); >> >> - distribute_links (i0links); >> >> + int limit = i2_unchanged ? param_max_combine_search_insns : INT_MAX; >> >> + distribute_links (i3links, limit); >> >> + distribute_links (i2links, limit); >> >> + distribute_links (i1links, limit); >> >> + distribute_links (i0links, limit); >> > >> > I'm not sure I understand how the distribute_links mixes in with >> > the change below. >> > >> > The original issue reported in the PR was that returning i2 >> > caused evern insn between i2 and i3 to be re-processed by the >> > main loop, re-trying previously failed combine attempts and >> > producing garbage. >> > >> > The disadvantage of directly returning i3 is that if added_links_insn >> > was inbetween i2 and i3 we've missed to re-try on insn with changed >> > links. >> >> Can that happen though? log links are supposed to be attached to the >> next use of the register (only), so I wouldn't expect the next use to be >> before i3 (putatively the previous first use). > > But log-links on i2 are for uses in i2 for defs before i2, given > i0/i1/i2 changed/have gone away we adjust those to eventually end > up on insns between i2 and i3 (or indeed after i3). Combine then > want's to try combine the insns with changed log-links.
Right. But I meant in the case where i2 hasn't changed, which I thought was the case in contention. If i2 hasn't changed then its uses are the same, and so there is no need to distribute its log links. >> > But what you do now also limits the insn walk in distribute_links >> > (IMO a good thing on its own), but not taking advantage of that >> > when i2_unchanged (distributing the link to insns between i2 and i3 >> > is useless in that case). >> >> Yeah, it's true that distribute_links could potentially start the >> search at a better place. But that's IMO a third, somewhat independent >> optimisation. And it seems more dangerous than the proposed patch, >> since if it turns out that the next use is before i3 in some current >> or future circumstances, we could end up generating wrong code. >> >> > Ideally we'd distribute links to insns between i2 and i3 in a backward >> > walk from i3, but pick up the first use if within the limit, as we >> > want to limit the number of insns we reprocess in the try_combine callers >> > loop. >> > >> > Really ideally we'd want to remember all added_links_insn and only >> > re-process those, not all other unaffected insns between i2 and i3 >> > (or earlier). >> > >> > That said, your change looks like two independent changes? IIRC Jakub >> > at some point proposed to limit the 'return i3' below to cases where >> > i2 is too far away or sth like that. >> >> Yes, it's two independent changes. The return i3 one is taken directly >> from your comment in: >> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523#c53 >> >> I think that's the right thing to do, for the reason you explained there >> and above, so I didn't want to lose it. But like you say in the PR, >> combine did still take a large slice of time with that change in isolation. >> >> Jakub discussed the idea of augmenting that change with one that makes >> try_combine check luids. The distribute_links part of the patch is >> supposed to be an alternative to that. > > I see. I understood Jakub thinks my change is too aggressive since > it does indeed miss combine attempts from changed log-links. > > I remember that when allowing added_links_insn to override the > i3 return for i2_unchaged_p the bad compile-time re-surfaced. So > now I wonder whether, if i2_unchanged, distribute_links (i2links) > does (should do?) anything? I don't see anything that would avoid > adjusting links when there's still a use in newi2_patt (IIRC > i0 and i1 are always gone?). Yeah, that's also what I meant. But, once the "return i3" is in place, the remaining large compile time partly comes from distributing i3's link for i2's destination, and I suppose from the general "walk all instructions between A and B" that try_combine uses to check dataflow, with A and B getting gradually further apart. FWIW, with: diff --git a/gcc/combine.cc b/gcc/combine.cc index ef13f5d5e90..e12a78c8d89 100644 --- a/gcc/combine.cc +++ b/gcc/combine.cc @@ -4208,16 +4208,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0, adjust_for_new_dest (i3); } - /* If I2 didn't change, this is not a combination (but a simplification or - canonicalisation with context), which should not be done here. Doing - it here explodes the algorithm. Don't. */ - if (rtx_equal_p (newi2pat, PATTERN (i2))) - { - if (dump_file) - fprintf (dump_file, "i2 didn't change, not doing this\n"); - undo_all (); - return 0; - } + bool only_i3_changed = !i0 && !i1 && rtx_equal_p (newi2pat, PATTERN (i2)); /* We now know that we can do this combination. Merge the insns and update the status of registers and LOG_LINKS. */ @@ -4785,6 +4776,9 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0, combine_successes++; undo_commit (); + if (only_i3_changed) + return i3; + rtx_insn *ret = newi2pat ? i2 : i3; if (added_links_insn && DF_INSN_LUID (added_links_insn) < DF_INSN_LUID (ret)) ret = added_links_insn; so essentially your patch, I see an 11.9x improvement in combine's compile time in the original testcase for PR101523, going from 88% to 41% of total compile time. Memory usage improves 110x, from 97% to 25%. The output is unchanged. Adding: diff --git a/gcc/combine.cc b/gcc/combine.cc index e12a78c8d89..becfd0a65ff 100644 --- a/gcc/combine.cc +++ b/gcc/combine.cc @@ -472,7 +472,7 @@ static void move_deaths (rtx, rtx, int, rtx_insn *, rtx *); static bool reg_bitfield_target_p (rtx, rtx); static void distribute_notes (rtx, rtx_insn *, rtx_insn *, rtx_insn *, rtx, rtx, rtx); -static void distribute_links (struct insn_link *); +static void distribute_links (struct insn_link *, rtx_insn * = nullptr); static void mark_used_regs_combine (rtx); static void record_promoted_value (rtx_insn *, rtx); static bool unmentioned_reg_p (rtx, rtx); @@ -4209,6 +4209,8 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0, } bool only_i3_changed = !i0 && !i1 && rtx_equal_p (newi2pat, PATTERN (i2)); + if (only_i3_changed) + swap_i2i3 = false; /* We now know that we can do this combination. Merge the insns and update the status of registers and LOG_LINKS. */ @@ -4584,8 +4586,8 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0, NULL_RTX, NULL_RTX, NULL_RTX); } - distribute_links (i3links); - distribute_links (i2links); + distribute_links (i3links, only_i3_changed ? i3 : nullptr); + distribute_links (i2links, !i0 && !i1 ? i2 : nullptr); distribute_links (i1links); distribute_links (i0links); @@ -14978,10 +14980,13 @@ distribute_notes (rtx notes, rtx_insn *from_insn, rtx_insn *i3, rtx_insn *i2, /* Similarly to above, distribute the LOG_LINKS that used to be present on I3, I2, and I1 to new locations. This is also called to add a link - pointing at I3 when I3's destination is changed. */ + pointing at I3 when I3's destination is changed. + + If START is nonnull and an insn, we know that the next location for + each link is no earlier than START. */ static void -distribute_links (struct insn_link *links) +distribute_links (struct insn_link *links, rtx_insn *start) { struct insn_link *link, *next_link; @@ -15047,7 +15052,10 @@ distribute_links (struct insn_link *links) I3 to I2. Also note that not much searching is typically done here since most links don't point very far away. */ - for (insn = NEXT_INSN (link->insn); + insn = start; + if (!insn || NOTE_P (insn)) + insn = NEXT_INSN (link->insn); + for (; (insn && (this_basic_block->next_bb == EXIT_BLOCK_PTR_FOR_FN (cfun) || BB_HEAD (this_basic_block->next_bb) != insn)); insn = NEXT_INSN (insn)) improves compile time a further 17.7%, going from 41% to 33% of compile time. This still generates the same code for the testcase as the unpatched compiler, and I checked that a boostrap and regression test succeeded with the (independent rather than cumulative) patch: diff --git a/gcc/combine.cc b/gcc/combine.cc index ef13f5d5e90..ba9649b9b13 100644 --- a/gcc/combine.cc +++ b/gcc/combine.cc @@ -472,7 +472,7 @@ static void move_deaths (rtx, rtx, int, rtx_insn *, rtx *); static bool reg_bitfield_target_p (rtx, rtx); static void distribute_notes (rtx, rtx_insn *, rtx_insn *, rtx_insn *, rtx, rtx, rtx); -static void distribute_links (struct insn_link *); +static void distribute_links (struct insn_link *, rtx_insn * = nullptr); static void mark_used_regs_combine (rtx); static void record_promoted_value (rtx_insn *, rtx); static bool unmentioned_reg_p (rtx, rtx); @@ -4208,16 +4208,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0, adjust_for_new_dest (i3); } - /* If I2 didn't change, this is not a combination (but a simplification or - canonicalisation with context), which should not be done here. Doing - it here explodes the algorithm. Don't. */ - if (rtx_equal_p (newi2pat, PATTERN (i2))) - { - if (dump_file) - fprintf (dump_file, "i2 didn't change, not doing this\n"); - undo_all (); - return 0; - } + bool only_i3_changed = !i0 && !i1 && rtx_equal_p (newi2pat, PATTERN (i2)); /* We now know that we can do this combination. Merge the insns and update the status of registers and LOG_LINKS. */ @@ -4301,6 +4292,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0, FOR_EACH_LOG_LINK (link, insn) if (link->insn == i3 && link->regno == regno) { + gcc_assert (!only_i3_changed); link->insn = i2; done = true; break; @@ -4593,8 +4585,8 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0, NULL_RTX, NULL_RTX, NULL_RTX); } - distribute_links (i3links); - distribute_links (i2links); + distribute_links (i3links, only_i3_changed ? i3 : nullptr); + distribute_links (i2links, !i0 && !i1 ? i2 : nullptr); distribute_links (i1links); distribute_links (i0links); @@ -14984,10 +14976,13 @@ distribute_notes (rtx notes, rtx_insn *from_insn, rtx_insn *i3, rtx_insn *i2, /* Similarly to above, distribute the LOG_LINKS that used to be present on I3, I2, and I1 to new locations. This is also called to add a link - pointing at I3 when I3's destination is changed. */ + pointing at I3 when I3's destination is changed. + + LIMIT is the maximum number of nondebug instructions that can be scanned + when looking for the next use of a definition. */ static void -distribute_links (struct insn_link *links) +distribute_links (struct insn_link *links, rtx_insn *start) { struct insn_link *link, *next_link; @@ -15053,10 +15048,15 @@ distribute_links (struct insn_link *links) I3 to I2. Also note that not much searching is typically done here since most links don't point very far away. */ + if (link->insn == start) + start = nullptr; for (insn = NEXT_INSN (link->insn); (insn && (this_basic_block->next_bb == EXIT_BLOCK_PTR_FOR_FN (cfun) || BB_HEAD (this_basic_block->next_bb) != insn)); insn = NEXT_INSN (insn)) + { + if (insn == start) + start = nullptr; if (DEBUG_INSN_P (insn)) continue; else if (INSN_P (insn) && reg_overlap_mentioned_p (reg, PATTERN (insn))) @@ -15073,6 +15073,7 @@ distribute_links (struct insn_link *links) } else if (INSN_P (insn) && reg_set_p (reg, insn)) break; + } /* If we found a place to put the link, place it there unless there is already a link to the same insn as LINK at that point. */ @@ -15081,6 +15082,7 @@ distribute_links (struct insn_link *links) { struct insn_link *link2; + gcc_assert (!start || NOTE_P (start)); FOR_EACH_LOG_LINK (link2, place) if (link2->insn == link->insn && link2->regno == link->regno) break; But still, 33% is a lot for one pass. So I think we still need a limit of some kind, even though that would drop optimisations in extreme cases. If we don't put a limit on distribute_links, and don't put a limit on try_combine (as for Jakub's luid patch), we could instead limit the number of times that distribute_links distributes a particular link. There is a 32-bit hole in insn_link on LP64 hosts. That said, I'm kind-of becoming more sympathetic to Segher's view that these kinds of 2->2 combination were a mistake, although I still stand by my original question of why we don't just revert c4c5ad1d6d1e1e1fe7a1c2b3bb097cc269dc7306. Thanks, Richard