Richard Sandiford <richard.sandif...@arm.com> writes:
> Richard Biener <rguent...@suse.de> writes:
>> On Wed, 2 Apr 2025, Richard Sandiford wrote:
>>
>>> Richard Biener <rguent...@suse.de> writes:
>>> > On Tue, 1 Apr 2025, Richard Sandiford wrote:
>>> >
>>> >> The problem in PR101523 was that, after each successful 2->2
>>> >> combination attempt, distribute_links would search further and
>>> >> further for the next combinable use of the i2 destination.
>>> >> The original patch for the PR dealt with that by disallowing such
>>> >> combinations.  However, that led to various optimisation regressions,
>>> >> so there was interest in allowing the combinations again, at least
>>> >> until an alternative way of getting the same results is in place.
>>> >> 
>>> >> This patch does that, but limits distribute_links to a certain number
>>> >> of instructions when i2 is unchanged.  As Segher said in the PR trail,
>>> >> it would make more conceptual sense to apply the limit unconditionally,
>>> >> but I thought it would be better to change as little as possible at
>>> >> this development stage.  Logically, in stage 1, the --param should
>>> >> be applied directly by distribute_links with no input from callers.
>>> >> 
>>> >> As I mentioned in:
>>> >> 
>>> >>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116398#c28
>>> >> 
>>> >> I think it's safe to drop log links even if a use exists.  All
>>> >> processing of log links seems to handle the absence of a link
>>> >> for a particular register in a conservative way.
>>> >> 
>>> >> The initial set-up errs on the side of dropping links, since for example
>>> >> create_log_links has:
>>> >> 
>>> >>              /* flow.c claimed:
>>> >> 
>>> >>                  We don't build a LOG_LINK for hard registers contained
>>> >>                  in ASM_OPERANDs.  If these registers get replaced,
>>> >>                  we might wind up changing the semantics of the insn,
>>> >>                  even if reload can make what appear to be valid
>>> >>                  assignments later.  */
>>> >>               if (regno < FIRST_PSEUDO_REGISTER
>>> >>                   && asm_noperands (PATTERN (use_insn)) >= 0)
>>> >>                 continue;
>>> >> 
>>> >> which excludes combinations by dropping log links, rather than during
>>> >> try_combine.  And:
>>> >> 
>>> >>       /* If this register is being initialized using itself, and the
>>> >>          register is uninitialized in this basic block, and there are
>>> >>          no LOG_LINKS which set the register, then part of the
>>> >>          register is uninitialized.  In that case we can't assume
>>> >>          anything about the number of nonzero bits.
>>> >> 
>>> >>          ??? We could do better if we checked this in
>>> >>          reg_{nonzero_bits,num_sign_bit_copies}_for_combine.  Then we
>>> >>          could avoid making assumptions about the insn which initially
>>> >>          sets the register, while still using the information in other
>>> >>          insns.  We would have to be careful to check every insn
>>> >>          involved in the combination.  */
>>> >> 
>>> >>       if (insn
>>> >>           && reg_referenced_p (x, PATTERN (insn))
>>> >>           && !REGNO_REG_SET_P (DF_LR_IN (BLOCK_FOR_INSN (insn)),
>>> >>                                REGNO (x)))
>>> >>         {
>>> >>           struct insn_link *link;
>>> >> 
>>> >>           FOR_EACH_LOG_LINK (link, insn)
>>> >>             if (dead_or_set_p (link->insn, x))
>>> >>               break;
>>> >>           if (!link)
>>> >>             {
>>> >>               rsp->nonzero_bits = GET_MODE_MASK (mode);
>>> >>               rsp->sign_bit_copies = 1;
>>> >>               return;
>>> >>             }
>>> >>         }
>>> >> 
>>> >> treats the lack of a log link is a possible sign of uninitialised data,
>>> >> but that would be a missed optimisation rather than a correctness issue.
>>> >> 
>>> >> One question is what the default --param value should be.  I went with
>>> >> Jakub's suggestion of 3000 from:
>>> >> 
>>> >>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116398#c25
>>> >> 
>>> >> Also, to answer Jakub's question in that comment, I tried bisecting:
>>> >> 
>>> >>   int limit = atoi (getenv ("BISECT"));
>>> >> 
>>> >> (so applying the limit for all calls from try_combine) with an
>>> >> abort in distribute_links if the limit caused a link to be skipped.
>>> >> The minimum BISECT value that allowed an aarch64-linux-gnu bootstrap
>>> >> to succeed with --enable-languages=all --enable-checking=yes,rtl,extra
>>> >> was 142, so much lower than the parameter value.  I realised too late
>>> >> that --enable-checking=release would probably have been a more
>>> >> interesting test.
>>> >> 
>>> >> Some of the try_combine changes come from Richi's patch in:
>>> >> 
>>> >>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523#c53
>>> >> 
>>> >> Bootstrapped & regression-tested on aarch64-linux-gnu,
>>> >> powerpc64le-linux-gnu and x86_64-linux-gnu.  OK to install?
>>> >> 
>>> >> Richard
>>> >> 
>>> >> 
>>> >> gcc/
>>> >>  PR testsuite/116398
>>> >>  * params.opt (-param=max-combine-search-insns=): New param.
>>> >>  * doc/invoke.texi: Document it.
>>> >>  * combine.cc (distribute_links): Add a limit parameter.
>>> >>  (try_combine): Use the new param to limit distribute_links
>>> >>  when i2 is unchanged.  Return i3 rather than i2 if i2 is unchanged.
>>> >> 
>>> >> gcc/testsuite/
>>> >>  * gcc.target/aarch64/popcnt-le-1.c: Account for commutativity of TST.
>>> >>  * gcc.target/aarch64/popcnt-le-3.c: Likewise AND.
>>> >>  * gcc.target/aarch64/sve/pred-not-gen-1.c: Revert previous patch.
>>> >>  * gcc.target/aarch64/sve/pred-not-gen-4.c: Likewise.
>>> >>  * gcc.target/aarch64/sve/var_stride_2.c: Likewise.
>>> >>  * gcc.target/aarch64/sve/var_stride_4.c: Likewise.
>>> >> 
>>> >> Co-authored-by: Richard Biener <rguent...@suse.de>
>>> >> ---
>>> >>  gcc/combine.cc                                | 30 +++++++++----------
>>> >>  gcc/doc/invoke.texi                           |  9 ++++++
>>> >>  gcc/params.opt                                |  4 +++
>>> >>  .../gcc.target/aarch64/popcnt-le-1.c          |  4 +--
>>> >>  .../gcc.target/aarch64/popcnt-le-3.c          |  4 +--
>>> >>  gcc/testsuite/gcc.target/aarch64/pr100056.c   |  4 ++-
>>> >>  .../gcc.target/aarch64/sve/pred-not-gen-1.c   |  4 +--
>>> >>  .../gcc.target/aarch64/sve/pred-not-gen-4.c   |  4 +--
>>> >>  .../gcc.target/aarch64/sve/var_stride_2.c     |  3 +-
>>> >>  .../gcc.target/aarch64/sve/var_stride_4.c     |  3 +-
>>> >>  10 files changed, 42 insertions(+), 27 deletions(-)
>>> >> 
>>> >> diff --git a/gcc/combine.cc b/gcc/combine.cc
>>> >> index ef13f5d5e90..28403d1f9e0 100644
>>> >> --- a/gcc/combine.cc
>>> >> +++ b/gcc/combine.cc
>>> >> @@ -472,7 +472,7 @@ static void move_deaths (rtx, rtx, int, rtx_insn *, 
>>> >> rtx *);
>>> >>  static bool reg_bitfield_target_p (rtx, rtx);
>>> >>  static void distribute_notes (rtx, rtx_insn *, rtx_insn *, rtx_insn *,
>>> >>                                rtx, rtx, rtx);
>>> >> -static void distribute_links (struct insn_link *);
>>> >> +static void distribute_links (struct insn_link *, int limit = INT_MAX);
>>> >>  static void mark_used_regs_combine (rtx);
>>> >>  static void record_promoted_value (rtx_insn *, rtx);
>>> >>  static bool unmentioned_reg_p (rtx, rtx);
>>> >> @@ -4208,16 +4208,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn 
>>> >> *i1, rtx_insn *i0,
>>> >>        adjust_for_new_dest (i3);
>>> >>      }
>>> >>  
>>> >> -  /* If I2 didn't change, this is not a combination (but a 
>>> >> simplification or
>>> >> -     canonicalisation with context), which should not be done here.  
>>> >> Doing
>>> >> -     it here explodes the algorithm.  Don't.  */
>>> >> -  if (rtx_equal_p (newi2pat, PATTERN (i2)))
>>> >> -    {
>>> >> -      if (dump_file)
>>> >> -        fprintf (dump_file, "i2 didn't change, not doing this\n");
>>> >> -      undo_all ();
>>> >> -      return 0;
>>> >> -    }
>>> >> +  bool i2_unchanged = rtx_equal_p (newi2pat, PATTERN (i2));
>>> >>  
>>> >>    /* We now know that we can do this combination.  Merge the insns and
>>> >>       update the status of registers and LOG_LINKS.  */
>>> >> @@ -4593,10 +4584,11 @@ try_combine (rtx_insn *i3, rtx_insn *i2, 
>>> >> rtx_insn *i1, rtx_insn *i0,
>>> >>                              NULL_RTX, NULL_RTX, NULL_RTX);
>>> >>        }
>>> >>  
>>> >> -    distribute_links (i3links);
>>> >> -    distribute_links (i2links);
>>> >> -    distribute_links (i1links);
>>> >> -    distribute_links (i0links);
>>> >> +    int limit = i2_unchanged ? param_max_combine_search_insns : INT_MAX;
>>> >> +    distribute_links (i3links, limit);
>>> >> +    distribute_links (i2links, limit);
>>> >> +    distribute_links (i1links, limit);
>>> >> +    distribute_links (i0links, limit);
>>> >
>>> > I'm not sure I understand how the distribute_links mixes in with
>>> > the change below.
>>> >
>>> > The original issue reported in the PR was that returning i2
>>> > caused evern insn between i2 and i3 to be re-processed by the
>>> > main loop, re-trying previously failed combine attempts and
>>> > producing garbage.
>>> >
>>> > The disadvantage of directly returning i3 is that if added_links_insn
>>> > was inbetween i2 and i3 we've missed to re-try on insn with changed
>>> > links.
>>> 
>>> Can that happen though?  log links are supposed to be attached to the
>>> next use of the register (only), so I wouldn't expect the next use to be
>>> before i3 (putatively the previous first use).
>>
>> But log-links on i2 are for uses in i2 for defs before i2, given
>> i0/i1/i2 changed/have gone away we adjust those to eventually end
>> up on insns between i2 and i3 (or indeed after i3).  Combine then
>> want's to try combine the insns with changed log-links.
>
> Right.  But I meant in the case where i2 hasn't changed, which
> I thought was the case in contention.  If i2 hasn't changed then
> its uses are the same, and so there is no need to distribute its
> log links.
>
>>> > But what you do now also limits the insn walk in distribute_links
>>> > (IMO a good thing on its own), but not taking advantage of that
>>> > when i2_unchanged (distributing the link to insns between i2 and i3
>>> > is useless in that case).
>>> 
>>> Yeah, it's true that distribute_links could potentially start the
>>> search at a better place.  But that's IMO a third, somewhat independent
>>> optimisation.  And it seems more dangerous than the proposed patch,
>>> since if it turns out that the next use is before i3 in some current
>>> or future circumstances, we could end up generating wrong code.
>>> 
>>> > Ideally we'd distribute links to insns between i2 and i3 in a backward
>>> > walk from i3, but pick up the first use if within the limit, as we
>>> > want to limit the number of insns we reprocess in the try_combine callers
>>> > loop.
>>> >
>>> > Really ideally we'd want to remember all added_links_insn and only
>>> > re-process those, not all other unaffected insns between i2 and i3
>>> > (or earlier).
>>> >
>>> > That said, your change looks like two independent changes?  IIRC Jakub
>>> > at some point proposed to limit the 'return i3' below to cases where
>>> > i2 is too far away or sth like that.
>>> 
>>> Yes, it's two independent changes.  The return i3 one is taken directly
>>> from your comment in:
>>> 
>>>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101523#c53
>>> 
>>> I think that's the right thing to do, for the reason you explained there
>>> and above, so I didn't want to lose it.  But like you say in the PR,
>>> combine did still take a large slice of time with that change in isolation.
>>> 
>>> Jakub discussed the idea of augmenting that change with one that makes
>>> try_combine check luids.  The distribute_links part of the patch is
>>> supposed to be an alternative to that.
>>
>> I see.  I understood Jakub thinks my change is too aggressive since
>> it does indeed miss combine attempts from changed log-links.
>>
>> I remember that when allowing added_links_insn to override the
>> i3 return for i2_unchaged_p the bad compile-time re-surfaced.  So
>> now I wonder whether, if i2_unchanged, distribute_links (i2links)
>> does (should do?) anything?  I don't see anything that would avoid
>> adjusting links when there's still a use in newi2_patt (IIRC
>> i0 and i1 are always gone?).
>
> Yeah, that's also what I meant.  But, once the "return i3" is in place,
> the remaining large compile time partly comes from distributing i3's link
> for i2's destination, and I suppose from the general "walk all instructions
> between A and B" that try_combine uses to check dataflow, with A and B
> getting gradually further apart.
>
> FWIW, with:
>
> diff --git a/gcc/combine.cc b/gcc/combine.cc
> index ef13f5d5e90..e12a78c8d89 100644
> --- a/gcc/combine.cc
> +++ b/gcc/combine.cc
> @@ -4208,16 +4208,7 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
> rtx_insn *i0,
>        adjust_for_new_dest (i3);
>      }
>  
> -  /* If I2 didn't change, this is not a combination (but a simplification or
> -     canonicalisation with context), which should not be done here.  Doing
> -     it here explodes the algorithm.  Don't.  */
> -  if (rtx_equal_p (newi2pat, PATTERN (i2)))
> -    {
> -      if (dump_file)
> -     fprintf (dump_file, "i2 didn't change, not doing this\n");
> -      undo_all ();
> -      return 0;
> -    }
> +  bool only_i3_changed = !i0 && !i1 && rtx_equal_p (newi2pat, PATTERN (i2));
>  
>    /* We now know that we can do this combination.  Merge the insns and
>       update the status of registers and LOG_LINKS.  */
> @@ -4785,6 +4776,9 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
> rtx_insn *i0,
>    combine_successes++;
>    undo_commit ();
>  
> +  if (only_i3_changed)
> +    return i3;
> +
>    rtx_insn *ret = newi2pat ? i2 : i3;
>    if (added_links_insn && DF_INSN_LUID (added_links_insn) < DF_INSN_LUID 
> (ret))
>      ret = added_links_insn;
>
> so essentially your patch, I see an 11.9x improvement in combine's
> compile time in the original testcase for PR101523, going from 88%
> to 41% of total compile time.  Memory usage improves 110x, from
> 97% to 25%.  The output is unchanged.  Adding:
>
> diff --git a/gcc/combine.cc b/gcc/combine.cc
> index e12a78c8d89..becfd0a65ff 100644
> --- a/gcc/combine.cc
> +++ b/gcc/combine.cc
> @@ -472,7 +472,7 @@ static void move_deaths (rtx, rtx, int, rtx_insn *, rtx 
> *);
>  static bool reg_bitfield_target_p (rtx, rtx);
>  static void distribute_notes (rtx, rtx_insn *, rtx_insn *, rtx_insn *,
>                             rtx, rtx, rtx);
> -static void distribute_links (struct insn_link *);
> +static void distribute_links (struct insn_link *, rtx_insn * = nullptr);
>  static void mark_used_regs_combine (rtx);
>  static void record_promoted_value (rtx_insn *, rtx);
>  static bool unmentioned_reg_p (rtx, rtx);
> @@ -4209,6 +4209,8 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
> rtx_insn *i0,
>      }
>  
>    bool only_i3_changed = !i0 && !i1 && rtx_equal_p (newi2pat, PATTERN (i2));
> +  if (only_i3_changed)
> +    swap_i2i3 = false;

Oops, this was supposed to be split_i2i3.  With that fixed...

>  
>    /* We now know that we can do this combination.  Merge the insns and
>       update the status of registers and LOG_LINKS.  */
> @@ -4584,8 +4586,8 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, 
> rtx_insn *i0,
>                           NULL_RTX, NULL_RTX, NULL_RTX);
>        }
>  
> -    distribute_links (i3links);
> -    distribute_links (i2links);
> +    distribute_links (i3links, only_i3_changed ? i3 : nullptr);
> +    distribute_links (i2links, !i0 && !i1 ? i2 : nullptr);
>      distribute_links (i1links);
>      distribute_links (i0links);
>  
> @@ -14978,10 +14980,13 @@ distribute_notes (rtx notes, rtx_insn *from_insn, 
> rtx_insn *i3, rtx_insn *i2,
>  
>  /* Similarly to above, distribute the LOG_LINKS that used to be present on
>     I3, I2, and I1 to new locations.  This is also called to add a link
> -   pointing at I3 when I3's destination is changed.  */
> +   pointing at I3 when I3's destination is changed.
> +
> +   If START is nonnull and an insn, we know that the next location for
> +   each link is no earlier than START.  */
>  
>  static void
> -distribute_links (struct insn_link *links)
> +distribute_links (struct insn_link *links, rtx_insn *start)
>  {
>    struct insn_link *link, *next_link;
>  
> @@ -15047,7 +15052,10 @@ distribute_links (struct insn_link *links)
>        I3 to I2.  Also note that not much searching is typically done here
>        since most links don't point very far away.  */
>  
> -      for (insn = NEXT_INSN (link->insn);
> +      insn = start;
> +      if (!insn || NOTE_P (insn))
> +     insn = NEXT_INSN (link->insn);
> +      for (;
>          (insn && (this_basic_block->next_bb == EXIT_BLOCK_PTR_FOR_FN (cfun)
>                    || BB_HEAD (this_basic_block->next_bb) != insn));
>          insn = NEXT_INSN (insn))
>
> improves compile time a further 17.7%, going from 41% to 33% of

[I meant 37%, gah]

> compile time.

...the improvement becomes 34.6%, going from 41% to 34%.

Thanks,
Richard

Reply via email to