Re: [PATCH] combine: Allow 2->2 combos but limit insn searches [PR116398]

Richard Sandiford Thu, 03 Apr 2025 06:41:07 -0700

Richard Biener <rguent...@suse.de> writes:
> On Wed, 2 Apr 2025, Richard Sandiford wrote:
>
>> Richard Biener <rguent...@suse.de> writes:
>> > On Tue, 1 Apr 2025, Richard Sandiford wrote:
>> >
>> >> The problem in PR101523 was that, after each successful 2->2
>> >> combination attempt, distribute_links would search further and
>> >> further for the next combinable use of the i2 destination.
>> >> The original patch for the PR dealt with that by disallowing such
>> >> combinations.  However, that led to various optimisation regressions,
>> >> so there was interest in allowing the combinations again, at least
>> >> until an alternative way of getting the same results is in place.
>> >> 
>> >> This patch does that, but limits distribute_links to a certain number
>> >> of instructions when i2 is unchanged.  As Segher said in the PR trail,
>> >> it would make more conceptual sense to apply the limit unconditionally,
>> >> but I thought it would be better to change as little as possible at
>> >> this development stage.  Logically, in stage 1, the --param should
>> >> be applied directly by distribute_links with no input from callers.
>> >> 
>> >> As I mentioned in:
>> >> 
>> >>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116398#c28
>> >> 
>> >> I think it's safe to drop log links even if a use exists.  All
>> >> processing of log links seems to handle the absence of a link
>> >> for a particular register in a conservative way.
>> >> 
>> >> The initial set-up errs on the side of dropping links, since for example
>> >> create_log_links has:
>> >> 
>> >>              /* flow.c claimed:
>> >> 
>> >>                  We don't build a LOG_LINK for hard registers contained
>> >>                  in ASM_OPERANDs.  If these registers get replaced,
>> >>                  we might wind up changing the semantics of the insn,
>> >>                  even if reload can make what appear to be valid
>> >>                  assignments later.  */
>> >>               if (regno < FIRST_PSEUDO_REGISTER
>> >>                   && asm_noperands (PATTERN (use_insn)) >= 0)
>> >>                 continue;
>> >> 
>> >> which excludes combinations by dropping log links, rather than during
>> >> try_combine.  And:
>> >> 
>> >>       /* If this register is being initialized using itself, and the
>> >>          register is uninitialized in this basic block, and there are
>> >>          no LOG_LINKS which set the register, then part of the
>> >>          register is uninitialized.  In that case we can't assume
>> >>          anything about the number of nonzero bits.
>> >> 
>> >>          ??? We could do better if we checked this in
>> >>          reg_{nonzero_bits,num_sign_bit_copies}_for_combine.  Then we
>> >>          could avoid making assumptions about the insn which initially
>> >>          sets the register, while still using the information in other
>> >>          insns.  We would have to be careful to check every insn
>> >>          involved in the combination.  */
>> >> 
>> >>       if (insn
>> >>           && reg_referenced_p (x, PATTERN (insn))
>> >>           && !REGNO_REG_SET_P (DF_LR_IN (BLOCK_FOR_INSN (insn)),
>> >>                                REGNO (x)))
>> >>         {
>> >>           struct insn_link *link;
>> >> 
>> >>           FOR_EACH_LOG_LINK (link, insn)
>> >>             if (dead_or_set_p (link->insn, x))
>> >>               break;
>> >>           if (!link)
>> >>             {
>> >>               rsp->nonzero_bits = GET_MODE_MASK (mode);
>> >>               rsp->sign_bit_copies = 1;
>> >>               return;
>> >>             }
>> >>         }
>> >> 
>> >> treats the lack of a log link is a possible sign of uninitialised data,
>> >> but that would be a missed optimisation rather than a correctness issue.
>> >> 
>> >> One question is what the default --param value should be.  I went with
>> >> Jakub's suggestion of 3000 from:
>> >> 
>> >>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116398#c25
>> >> 
>> >> Also, to answer Jakub's question in that comment, I tried bisecting:
>> >> 
>> >>   int limit = atoi (getenv ("BISECT"));
>> >> 
>> >> (so applying the limit for all calls from try_combine) with an
>> >> abort in distribute_links if the limit caused a link to be skipped.
>> >> The minimum BISECT value that allowed an aarch64-linux-gnu bootstrap
>> >> to succeed with --enable-languages=all --enable-checking=yes,rtl,extra
>> >> was 142, so much lower than the parameter value.  I realised too late
>> >> that --enable-checking=release would probably have been a more
>> >> interesting test.
>> >> 
>> >> Some of the try_combine changes come from Richi's patch in:
>> >> 
>> >>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523#c53
>> >> 
>> >> Bootstrapped & regression-tested on aarch64-linux-gnu,
>> >> powerpc64le-linux-gnu and x86_64-linux-gnu.  OK to install?
>> >> 
>> >> Richard
>> >> 
>> >> 
>> >> gcc/
>> >>   PR testsuite/116398
>> >>   * params.opt (-param=max-combine-search-insns=): New param.
>> >>   * doc/invoke.texi: Document it.
>> >>   * combine.cc (distribute_links): Add a limit parameter.
>> >>   (try_combine): Use the new param to limit distribute_links
>> >>   when i2 is unchanged.  Return i3 rather than i2 if i2 is unchanged.
>> >> 
>> >> gcc/testsuite/
>> >>   * gcc.target/aarch64/popcnt-le-1.c: Account for commutativity of TST.
>> >>   * gcc.target/aarch64/popcnt-le-3.c: Likewise AND.
>> >>   * gcc.target/aarch64/sve/pred-not-gen-1.c: Revert previous patch.
>> >>   * gcc.target/aarch64/sve/pred-not-gen-4.c: Likewise.
>> >>   * gcc.target/aarch64/sve/var_stride_2.c: Likewise.
>> >>   * gcc.target/aarch64/sve/var_stride_4.c: Likewise.
>> >> 
>> >> Co-authored-by: Richard Biener <rguent...@suse.de>
>> >> ---
>> >>  gcc/combine.cc                                | 30 +++++++++----------
>> >>  gcc/doc/invoke.texi                           |  9 ++++++
>> >>  gcc/params.opt                                |  4 +++
>> >>  .../gcc.target/aarch64/popcnt-le-1.c          |  4 +--
>> >>  .../gcc.target/aarch64/popcnt-le-3.c          |  4 +--
>> >>  gcc/testsuite/gcc.target/aarch64/pr100056.c   |  4 ++-
>> >>  .../gcc.target/aarch64/sve/pred-not-gen-1.c   |  4 +--
>> >>  .../gcc.target/aarch64/sve/pred-not-gen-4.c   |  4 +--
>> >>  .../gcc.target/aarch64/sve/var_stride_2.c     |  3 +-
>> >>  .../gcc.target/aarch64/sve/var_stride_4.c     |  3 +-
>> >>  10 files changed, 42 insertions(+), 27 deletions(-)
>> >> 
>> >> diff --git a/gcc/combine.cc b/gcc/combine.cc
>> >> index ef13f5d5e90..28403d1f9e0 100644
>> >> --- a/gcc/combine.cc
>> >> +++ b/gcc/combine.cc
>> >> @@ -472,7 +472,7 @@ static void move_deaths (rtx, rtx, int, rtx_insn *, 
>> >> rtx *);
>> >>  static bool reg_bitfield_target_p (rtx, rtx);
>> >>  static void distribute_notes (rtx, rtx_insn *, rtx_insn *, rtx_insn *,
>> >>                         rtx, rtx, rtx);
>> >> -static void distribute_links (struct insn_link *);
>> >> +static void distribute_links (struct insn_link *, int limit = INT_MAX);
>> >>  static void mark_used_regs_combine (rtx);
>> >>  static void record_promoted_value (rtx_insn *, rtx);
>> >>  static bool unmentioned_reg_p (rtx, rtx);
>> >> @@ -4208,16 +4208,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn 
>> >> *i1, rtx_insn *i0,
>> >>        adjust_for_new_dest (i3);
>> >>      }
>> >>  
>> >> -  /* If I2 didn't change, this is not a combination (but a 
>> >> simplification or
>> >> -     canonicalisation with context), which should not be done here.  
>> >> Doing
>> >> -     it here explodes the algorithm.  Don't.  */
>> >> -  if (rtx_equal_p (newi2pat, PATTERN (i2)))
>> >> -    {
>> >> -      if (dump_file)
>> >> - fprintf (dump_file, "i2 didn't change, not doing this\n");
>> >> -      undo_all ();
>> >> -      return 0;
>> >> -    }
>> >> +  bool i2_unchanged = rtx_equal_p (newi2pat, PATTERN (i2));
>> >>  
>> >>    /* We now know that we can do this combination.  Merge the insns and
>> >>       update the status of registers and LOG_LINKS.  */
>> >> @@ -4593,10 +4584,11 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn 
>> >> *i1, rtx_insn *i0,
>> >>                       NULL_RTX, NULL_RTX, NULL_RTX);
>> >>        }
>> >>  
>> >> -    distribute_links (i3links);
>> >> -    distribute_links (i2links);
>> >> -    distribute_links (i1links);
>> >> -    distribute_links (i0links);
>> >> +    int limit = i2_unchanged ? param_max_combine_search_insns : INT_MAX;
>> >> +    distribute_links (i3links, limit);
>> >> +    distribute_links (i2links, limit);
>> >> +    distribute_links (i1links, limit);
>> >> +    distribute_links (i0links, limit);
>> >
>> > I'm not sure I understand how the distribute_links mixes in with
>> > the change below.
>> >
>> > The original issue reported in the PR was that returning i2
>> > caused evern insn between i2 and i3 to be re-processed by the
>> > main loop, re-trying previously failed combine attempts and
>> > producing garbage.
>> >
>> > The disadvantage of directly returning i3 is that if added_links_insn
>> > was inbetween i2 and i3 we've missed to re-try on insn with changed
>> > links.
>> 
>> Can that happen though?  log links are supposed to be attached to the
>> next use of the register (only), so I wouldn't expect the next use to be
>> before i3 (putatively the previous first use).
>
> But log-links on i2 are for uses in i2 for defs before i2, given
> i0/i1/i2 changed/have gone away we adjust those to eventually end
> up on insns between i2 and i3 (or indeed after i3).  Combine then
> want's to try combine the insns with changed log-links.


Right.  But I meant in the case where i2 hasn't changed, which
I thought was the case in contention.  If i2 hasn't changed then
its uses are the same, and so there is no need to distribute its
log links.

>> > But what you do now also limits the insn walk in distribute_links
>> > (IMO a good thing on its own), but not taking advantage of that
>> > when i2_unchanged (distributing the link to insns between i2 and i3
>> > is useless in that case).
>> 
>> Yeah, it's true that distribute_links could potentially start the
>> search at a better place.  But that's IMO a third, somewhat independent
>> optimisation.  And it seems more dangerous than the proposed patch,
>> since if it turns out that the next use is before i3 in some current
>> or future circumstances, we could end up generating wrong code.
>> 
>> > Ideally we'd distribute links to insns between i2 and i3 in a backward
>> > walk from i3, but pick up the first use if within the limit, as we
>> > want to limit the number of insns we reprocess in the try_combine callers
>> > loop.
>> >
>> > Really ideally we'd want to remember all added_links_insn and only
>> > re-process those, not all other unaffected insns between i2 and i3
>> > (or earlier).
>> >
>> > That said, your change looks like two independent changes?  IIRC Jakub
>> > at some point proposed to limit the 'return i3' below to cases where
>> > i2 is too far away or sth like that.
>> 
>> Yes, it's two independent changes.  The return i3 one is taken directly
>> from your comment in:
>> 
>>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523#c53
>> 
>> I think that's the right thing to do, for the reason you explained there
>> and above, so I didn't want to lose it.  But like you say in the PR,
>> combine did still take a large slice of time with that change in isolation.
>> 
>> Jakub discussed the idea of augmenting that change with one that makes
>> try_combine check luids.  The distribute_links part of the patch is
>> supposed to be an alternative to that.
>
> I see.  I understood Jakub thinks my change is too aggressive since
> it does indeed miss combine attempts from changed log-links.
>
> I remember that when allowing added_links_insn to override the
> i3 return for i2_unchaged_p the bad compile-time re-surfaced.  So
> now I wonder whether, if i2_unchanged, distribute_links (i2links)
> does (should do?) anything?  I don't see anything that would avoid
> adjusting links when there's still a use in newi2_patt (IIRC
> i0 and i1 are always gone?).

Yeah, that's also what I meant.  But, once the "return i3" is in place,
the remaining large compile time partly comes from distributing i3's link
for i2's destination, and I suppose from the general "walk all instructions
between A and B" that try_combine uses to check dataflow, with A and B
getting gradually further apart.

FWIW, with:

diff --git a/gcc/combine.cc b/gcc/combine.cc
index ef13f5d5e90..e12a78c8d89 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -4208,16 +4208,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
rtx_insn *i0,
       adjust_for_new_dest (i3);
     }
 
-  /* If I2 didn't change, this is not a combination (but a simplification or
-     canonicalisation with context), which should not be done here.  Doing
-     it here explodes the algorithm.  Don't.  */
-  if (rtx_equal_p (newi2pat, PATTERN (i2)))
-    {
-      if (dump_file)
-       fprintf (dump_file, "i2 didn't change, not doing this\n");
-      undo_all ();
-      return 0;
-    }
+  bool only_i3_changed = !i0 && !i1 && rtx_equal_p (newi2pat, PATTERN (i2));
 
   /* We now know that we can do this combination.  Merge the insns and
      update the status of registers and LOG_LINKS.  */
@@ -4785,6 +4776,9 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
rtx_insn *i0,
   combine_successes++;
   undo_commit ();
 
+  if (only_i3_changed)
+    return i3;
+
   rtx_insn *ret = newi2pat ? i2 : i3;
   if (added_links_insn && DF_INSN_LUID (added_links_insn) < DF_INSN_LUID (ret))
     ret = added_links_insn;

so essentially your patch, I see an 11.9x improvement in combine's
compile time in the original testcase for PR101523, going from 88%
to 41% of total compile time.  Memory usage improves 110x, from
97% to 25%.  The output is unchanged.  Adding:

diff --git a/gcc/combine.cc b/gcc/combine.cc
index e12a78c8d89..becfd0a65ff 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -472,7 +472,7 @@ static void move_deaths (rtx, rtx, int, rtx_insn *, rtx *);
 static bool reg_bitfield_target_p (rtx, rtx);
 static void distribute_notes (rtx, rtx_insn *, rtx_insn *, rtx_insn *,
                              rtx, rtx, rtx);
-static void distribute_links (struct insn_link *);
+static void distribute_links (struct insn_link *, rtx_insn * = nullptr);
 static void mark_used_regs_combine (rtx);
 static void record_promoted_value (rtx_insn *, rtx);
 static bool unmentioned_reg_p (rtx, rtx);
@@ -4209,6 +4209,8 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
rtx_insn *i0,
     }
 
   bool only_i3_changed = !i0 && !i1 && rtx_equal_p (newi2pat, PATTERN (i2));
+  if (only_i3_changed)
+    swap_i2i3 = false;
 
   /* We now know that we can do this combination.  Merge the insns and
      update the status of registers and LOG_LINKS.  */
@@ -4584,8 +4586,8 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
rtx_insn *i0,
                            NULL_RTX, NULL_RTX, NULL_RTX);
       }
 
-    distribute_links (i3links);
-    distribute_links (i2links);
+    distribute_links (i3links, only_i3_changed ? i3 : nullptr);
+    distribute_links (i2links, !i0 && !i1 ? i2 : nullptr);
     distribute_links (i1links);
     distribute_links (i0links);
 
@@ -14978,10 +14980,13 @@ distribute_notes (rtx notes, rtx_insn *from_insn, 
rtx_insn *i3, rtx_insn *i2,
 
 /* Similarly to above, distribute the LOG_LINKS that used to be present on
    I3, I2, and I1 to new locations.  This is also called to add a link
-   pointing at I3 when I3's destination is changed.  */
+   pointing at I3 when I3's destination is changed.
+
+   If START is nonnull and an insn, we know that the next location for
+   each link is no earlier than START.  */
 
 static void
-distribute_links (struct insn_link *links)
+distribute_links (struct insn_link *links, rtx_insn *start)
 {
   struct insn_link *link, *next_link;
 
@@ -15047,7 +15052,10 @@ distribute_links (struct insn_link *links)
         I3 to I2.  Also note that not much searching is typically done here
         since most links don't point very far away.  */
 
-      for (insn = NEXT_INSN (link->insn);
+      insn = start;
+      if (!insn || NOTE_P (insn))
+       insn = NEXT_INSN (link->insn);
+      for (;
           (insn && (this_basic_block->next_bb == EXIT_BLOCK_PTR_FOR_FN (cfun)
                     || BB_HEAD (this_basic_block->next_bb) != insn));
           insn = NEXT_INSN (insn))

improves compile time a further 17.7%, going from 41% to 33% of
compile time.  This still generates the same code for the testcase
as the unpatched compiler, and I checked that a boostrap and regression
test succeeded with the (independent rather than cumulative) patch:

diff --git a/gcc/combine.cc b/gcc/combine.cc
index ef13f5d5e90..ba9649b9b13 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -472,7 +472,7 @@ static void move_deaths (rtx, rtx, int, rtx_insn *, rtx *);
 static bool reg_bitfield_target_p (rtx, rtx);
 static void distribute_notes (rtx, rtx_insn *, rtx_insn *, rtx_insn *,
                              rtx, rtx, rtx);
-static void distribute_links (struct insn_link *);
+static void distribute_links (struct insn_link *, rtx_insn * = nullptr);
 static void mark_used_regs_combine (rtx);
 static void record_promoted_value (rtx_insn *, rtx);
 static bool unmentioned_reg_p (rtx, rtx);
@@ -4208,16 +4208,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
rtx_insn *i0,
       adjust_for_new_dest (i3);
     }
 
-  /* If I2 didn't change, this is not a combination (but a simplification or
-     canonicalisation with context), which should not be done here.  Doing
-     it here explodes the algorithm.  Don't.  */
-  if (rtx_equal_p (newi2pat, PATTERN (i2)))
-    {
-      if (dump_file)
-       fprintf (dump_file, "i2 didn't change, not doing this\n");
-      undo_all ();
-      return 0;
-    }
+  bool only_i3_changed = !i0 && !i1 && rtx_equal_p (newi2pat, PATTERN (i2));
 
   /* We now know that we can do this combination.  Merge the insns and
      update the status of registers and LOG_LINKS.  */
@@ -4301,6 +4292,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
rtx_insn *i0,
              FOR_EACH_LOG_LINK (link, insn)
                if (link->insn == i3 && link->regno == regno)
                  {
+                   gcc_assert (!only_i3_changed);
                    link->insn = i2;
                    done = true;
                    break;
@@ -4593,8 +4585,8 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
rtx_insn *i0,
                            NULL_RTX, NULL_RTX, NULL_RTX);
       }
 
-    distribute_links (i3links);
-    distribute_links (i2links);
+    distribute_links (i3links, only_i3_changed ? i3 : nullptr);
+    distribute_links (i2links, !i0 && !i1 ? i2 : nullptr);
     distribute_links (i1links);
     distribute_links (i0links);
 
@@ -14984,10 +14976,13 @@ distribute_notes (rtx notes, rtx_insn *from_insn, 
rtx_insn *i3, rtx_insn *i2,
 
 /* Similarly to above, distribute the LOG_LINKS that used to be present on
    I3, I2, and I1 to new locations.  This is also called to add a link
-   pointing at I3 when I3's destination is changed.  */
+   pointing at I3 when I3's destination is changed.
+
+   LIMIT is the maximum number of nondebug instructions that can be scanned
+   when looking for the next use of a definition.  */
 
 static void
-distribute_links (struct insn_link *links)
+distribute_links (struct insn_link *links, rtx_insn *start)
 {
   struct insn_link *link, *next_link;
 
@@ -15053,10 +15048,15 @@ distribute_links (struct insn_link *links)
         I3 to I2.  Also note that not much searching is typically done here
         since most links don't point very far away.  */
 
+      if (link->insn == start)
+       start = nullptr;
       for (insn = NEXT_INSN (link->insn);
           (insn && (this_basic_block->next_bb == EXIT_BLOCK_PTR_FOR_FN (cfun)
                     || BB_HEAD (this_basic_block->next_bb) != insn));
           insn = NEXT_INSN (insn))
+       {
+         if (insn == start)
+           start = nullptr;
        if (DEBUG_INSN_P (insn))
          continue;
        else if (INSN_P (insn) && reg_overlap_mentioned_p (reg, PATTERN (insn)))
@@ -15073,6 +15073,7 @@ distribute_links (struct insn_link *links)
          }
        else if (INSN_P (insn) && reg_set_p (reg, insn))
          break;
+       }
 
       /* If we found a place to put the link, place it there unless there
         is already a link to the same insn as LINK at that point.  */
@@ -15081,6 +15082,7 @@ distribute_links (struct insn_link *links)
        {
          struct insn_link *link2;
 
+         gcc_assert (!start || NOTE_P (start));
          FOR_EACH_LOG_LINK (link2, place)
            if (link2->insn == link->insn && link2->regno == link->regno)
              break;

But still, 33% is a lot for one pass.  So I think we still need a limit
of some kind, even though that would drop optimisations in extreme cases.
If we don't put a limit on distribute_links, and don't put a limit
on try_combine (as for Jakub's luid patch), we could instead limit
the number of times that distribute_links distributes a particular link.
There is a 32-bit hole in insn_link on LP64 hosts.

That said, I'm kind-of becoming more sympathetic to Segher's view that
these kinds of 2->2 combination were a mistake, although I still stand
by my original question of why we don't just revert
c4c5ad1d6d1e1e1fe7a1c2b3bb097cc269dc7306.

Thanks,
Richard

Re: [PATCH] combine: Allow 2->2 combos but limit insn searches [PR116398]

Reply via email to