Christoph Müllner <christoph.muell...@vrull.eu> writes:
> This patch converts the fold-mem-offsets pass from DF to RTL-SSA.
> Along with this conversion, the way the pass collects information
> was completely reworked.  Instead of visiting each instruction multiple
> times, this is now down only once.
>
> Most significant changes are:
> * The pass operates mainly on insn_info objects from RTL-SSA.
> * Single iteration over all nondebug INSNs for identification
>   of fold-mem-roots.  Then walk of the fold-mem-roots' DEF-chain
>   to collect foldable constants.
> * The class fold_mem_info holds vectors for the DEF-chain of
>   the to-be-folded INSNs (fold_agnostic_insns, which don't need
>   to be adjusted, and fold_insns, which need their constant to
>   be set to zero).
> * Introduction of a single-USE mode, which only collects DEFs,
>   that have a single USE and therefore are safe to transform
>   (the fold-mem-root will be the final USE).  This mode is fast
>   and will always run (unless disabled via -fno-fold-mem-offsets).
> * Introduction of a multi-USE mode, which allows DEFs to have
>   multiple USEs, but all USEs must be part of any fold-mem-root's
>   DEF-chain.  The analysis of all USEs is expensive and therefore,
>   this mode is disabled for highly connected CFGs.  Note, that
>   multi-USE mode will miss some opportunities that the single-USE
>   mode finds (e.g. multi-USE mode fails for fold-mem-offsets-3.c).
>
> The following testing was done:
> * Bootstrapped and regtested on aarch64-linux and x86-64-linux.
> * Regtested on riscv64-linux.
> * SPEC CPU 2017 tested on aarch64 and riscv64-linux.
>
> The number of applied optimizations of different versions/modes
> of fold-mem-offsets in SPEC CPU2017 on RISC-V rv64gc_zba_zbb_zbs
> is as follows:
>
> Upstream:
>   500.perlbench_r: 169
>   502.gcc_r: 624
>   520.omnetpp_r: 1301
>   523.xalancbmk_r: 23
>   525.x264_r: 705
>   531.deepsjeng_r: 36
>   541.leela_r: 19
>   548.exchange2: 90
>   557.xz_r: 11
>   SUM: 2151
>
> New single-USE:
>   500.perlbench_r: 70
>   502.gcc_r: 263
>   520.omnetpp_r: 1100
>   523.xalancbmk_r: 10
>   525.x264_r: 95
>   531.deepsjeng_r: 19
>   541.leela_r: 252
>   548.exchange2: 13
>   557.xz_r: 11
>   SUM: 1833
>
> New multi-USE:
>   500.perlbench_r: 186
>   502.gcc_r: 744
>   520.omnetpp_r: 1187
>   523.xalancbmk_r: 22
>   525.x264_r: 985
>   531.deepsjeng_r: 21
>   541.leela_r: 87
>   548.exchange2: 63
>   557.xz_r: 23
>   SUM: 3318
>
> New single-USE then multi-USE:
>   500.perlbench_r: 192
>   502.gcc_r: 761
>   520.omnetpp_r: 1673
>   523.xalancbmk_r: 22
>   525.x264_r: 995
>   531.deepsjeng_r: 21
>   541.leela_r: 252
>   548.exchange2: 63
>   557.xz_r: 23
>   SUM: 4002
>
> A compile time analysis with `/bin/time -v ./install/usr/local/bin/gcc -O2 
> all.i`
> (all.i from PR117922) shows:
> * -fno-fold-mem-offsets:  289 s (user time) / 15572436 kBytes (max resident 
> set size)
> * -ffold-mem-offsets:     339 s (user time) / 23606516 kBytes (max resident 
> set size)
> Adding -fexpensive-optimizations to enable multi-USE mode does not have
> an impact on the duration or the memory footprint.
>
> SPEC CPU 2017 showed no significant performance impact on aarch64-linux.
>
> gcc/ChangeLog:
>
>       PR rtl-optimization/117922
>       * fold-mem-offsets.cc (INCLUDE_ALGORITHM): Added definition.
>       (INCLUDE_FUNCTIONAL): Likewise.
>       (INCLUDE_ARRAY): Likewise.
>       (class pass_fold_mem_offsets): Moved to bottom of file.
>       (get_fold_mem_offset_root): Converted to RTL-SSA.
>       (get_single_def_in_bb): Converted to RTL-SSA.
>       (get_uses): New.
>       (has_foldable_uses_p): Converted to RTL-SSA.
>       (fold_offsets): Converted to RTL-SSA.
>       (fold_offsets_1): Converted to RTL-SSA.
>       (get_fold_mem_root): Removed.
>       (do_check_validity): New.
>       (do_analysis): Removed.
>       (insn_uses_not_in_bitmap): New.
>       (do_fold_info_calculation): Removed.
>       (drop_unsafe_candidates): New.
>       (do_commit_offset): Converted to RTL-SSA.
>       (compute_validity_closure): Removed.
>       (do_commit_insn): Changed to change INSN in place.
>       (fold_mem_offsets_single_use): New.
>       (fold_mem_offsets_multi_use): New.
>       (pass_fold_mem_offsets::execute): Moved to bottom of file.
>       (fold_mem_offsets): New.
>
> Signed-off-by: Christoph Müllner <christoph.muell...@vrull.eu>
> ---
>  gcc/fold-mem-offsets.cc | 1138 ++++++++++++++++++++-------------------
>  1 file changed, 596 insertions(+), 542 deletions(-)

Thanks for doing this.  I have some specific comments below, but my
main general comment is this:

When using rtl-ssa, changes to instructions should happen through the
rtl-ssa changes framework, since that keeps the metadata up-to-date.
It also ensures that (for example) uses in debug instructions are not
accidentally left behind.

As part of this, I think it would be better to make all the rtl insn
changes tentatively *before* checking validity, rather than building
up a list of rtl insns that needs to be changed later.  That is,
each attempt like:

      /* Walk DEF-chain and collect info->fold_insns, info->fold_agnostic_insns
         and the resulting offset.  */
      info->added_offset = fold_offsets (info->insn, info->reg, info, false);
      if (info->added_offset == 0 || !do_check_validity (info))
        {
          delete info;
          continue;
        }

should first create an insn_change_watermark, then use validate_change
& co (from recog.cc) to change the instructions as soon as the desired
changes are known, passing true to the in_group parameter.  Then, if the
final validity checks fail, the code can continue without committing the
changes, so that everything reverts back to the way that it was before.

This also makes it easier to get the target conditions right.  The patch
checks whether something is a member of GENERAL_REGS and checks whether
the new memory meets memory_address_addr_space_p.  But what ultimately
matters is whether the new instruction is recognised and whether it
matches its constraints.  In the case of rtl-ssa, this is all done by
rtl-ssa's version of recog.  The register class checks and the
memory_address_addr_space_p checks should then not be needed,
and could actively prevent valid optimisations.

Using the rtl-ssa framework also means that changes_are_worthwhile
can be used to test whether each set of changes is beneficial.

Some other general comments:

fold_mem_offsets <-> fold_mem_offsets_1 seems to have unbounded mutual
recursion, which could lead to stack overflow on hosts with small stacks.
It would probably be better to use a worklist instead.

Alternatively, and perhaps more simply, the pass could record offset
information as part of the current instruction walk, storing the result
in a hash_map.  The offset information for the current instruction
(that is, the information that gets entered into the hash_map for
the current instruction) would then depend on the contents of that
instruction and the current contents of the hash_map.  That's just a
suggestion though; gathering the data lazily is ok with me too if
gathering it eagerly is likely to generate a lot of redundant work.

> diff --git a/gcc/fold-mem-offsets.cc b/gcc/fold-mem-offsets.cc
> index c1c94472a071..0e777c32ee31 100644
> --- a/gcc/fold-mem-offsets.cc
> +++ b/gcc/fold-mem-offsets.cc
> [...]
>  /* This pass tries to optimize memory offset calculations by moving constants

Question about a pre-existing comment, sorry:

   For example it can transform code like this:

     add  t4, sp, 16
     add  t2, a6, t4
     shl  t3, t2, 1
     ld   a2, 0(t3)
     add  a2, 1
     sd   a2, 8(t2)

   into the following (one instruction less):

     add  t2, a6, sp
     shl  t3, t2, 1
     ld   a2, 32(t3)
     add  a2, 1
     sd   a2, 24(t2)

Is that code possible in practice?  It seems to be multiplying a pointer
by 2 (that is, the base seems to be sp * 2 rather than sp).

> [...]
> +      /* Don't fold when we have unspec / volatile.  */
> +      if (GET_CODE (src) == UNSPEC
> +       || GET_CODE (src) == UNSPEC_VOLATILE
> +       || GET_CODE (dest) == UNSPEC
> +       || GET_CODE (dest) == UNSPEC_VOLATILE)

This is pre-existing, sorry, but: destinations can't be UNSPECs or
UNSPEC_VOLATILEs, so I think we should remove the last two lines.

> [...]
> +/* Get the single reaching definition of REG used in INSN within a BB.
> +   Return the definition or NULL if there's no definition with the desired
> +   criteria.  If SINGLE_USE is set to true the DEF must have exactly one
> +   USE resulting in a 1:1 DEF-USE relationship.  If set to false, then a
> +   1:n DEF-USE relationship is accepted and the caller must take care to
> +   ensure all USEs are safe for folding.  */
> +static set_info *
> +get_single_def_in_bb (insn_info *insn, rtx reg, bool single_use)
>  {
> -  df_ref use;
> -  struct df_link *ref_chain, *ref_link;
> -
> -  FOR_EACH_INSN_USE (use, insn)
> +  /* Get the use_info of the base register.  */
> +  for (use_info *use : insn->uses ())
>      {
> -      if (GET_CODE (DF_REF_REG (use)) == SUBREG)
> +      /* Other USEs can be ignored and multiple equal USEs are fine.  */
> +      if (use->regno () != REGNO (reg))
> +     continue;

This can use:

  use_info *use = find_access (insn->uses (), REGNO (reg));
  if (!use)
    return NULL;

and so avoid the loop.

> +
> +      /* Don't handle subregs for now.  */
> +      if (use->includes_subregs ())
>       return NULL;
> -      if (REGNO (DF_REF_REG (use)) == REGNO (reg))
> -     break;
> -    }
>  
> -  if (!use)
> -    return NULL;
> +      /* Get the DEF of the register.  */
> +      set_info *def = use->def ();
> +      if (!def)
> +     return NULL;
>  
> -  ref_chain = DF_REF_CHAIN (use);
> +      /* Limit the amount of USEs of DEF to 1.  */
> +      auto iter = def->nondebug_insn_uses ();
> +      if (single_use && ++iter.begin () != iter.end ())

This can use def->single_nondebug_use ()

> +     return NULL;
>  
> -  if (!ref_chain)
> -    return NULL;
> +      /* Don't handle multiregs for now.  */
> +      if (def->includes_multiregs ())
> +     return NULL;
>  
> -  for (ref_link = ref_chain; ref_link; ref_link = ref_link->next)
> -    {
> -      /* Problem getting some definition for this instruction.  */
> -      if (ref_link->ref == NULL)
> +      /* Only consider uses whose definition comes from a real instruction
> +      and has no notes attached.  */
> +      insn_info *def_insn = def->insn ();
> +      if (def_insn->is_artificial () || def_insn->first_note ())
>       return NULL;

What's the reason for the first_note check?

> -      if (DF_REF_INSN_INFO (ref_link->ref) == NULL)
> +
> +      /* No parallel expressions or clobbers.  */
> +      if (def_insn->num_defs () != 1)
>       return NULL;
> -      if (global_regs[REGNO (reg)]
> -       && !set_of (reg, DF_REF_INSN (ref_link->ref)))
> +
> +      rtx_insn *def_rtl = def_insn->rtl ();
> +      if (!NONJUMP_INSN_P (def_rtl) || RTX_FRAME_RELATED_P (def_rtl))
>       return NULL;
> -    }
>  
> -  if (ref_chain->next)
> -    return NULL;
> +      /* Check if the DEF is a SET of the expected form.  */
> +      rtx def_set = simple_regno_set (PATTERN (def_rtl), def->regno ());
> +      if (!def_set)
> +     return NULL;
>  
> -  rtx_insn *def = DF_REF_INSN (ref_chain->ref);
> +      /* Ensure DEF and USE are in the same BB.  */
> +      if (def->bb () != insn->bb ())
> +     return NULL;
>  
> -  if (BLOCK_FOR_INSN (def) != BLOCK_FOR_INSN (insn))
> -    return NULL;
> +      /* Ensure the definition comes before the insn.  */
> +      if (*use->insn () < *def_insn)
> +     return NULL;

This is redundant: the definition always comes earlier (in RPO) than the use.
>  
> -  if (DF_INSN_LUID (def) > DF_INSN_LUID (insn))
> -    return NULL;
> +      return def;
> +    }
>  
> -  return def;
> +  return NULL;
>  }
>  
> -/* Get all uses of REG which is set in INSN.  Return the use list or NULL if 
> a
> -   use is missing / irregular.  If SUCCESS is not NULL then set it to false 
> if
> -   there are missing / irregular uses and true otherwise.  */
> -static df_link *
> -get_uses (rtx_insn *insn, rtx reg, bool *success)
> +/* Test if all USEs of DEF (which defines REG) meet certain criteria to be
> +   foldable.  Returns true iff all USEs are fine or false otherwise.  */
> +static bool
> +has_foldable_uses_p (set_info *def, rtx reg)
>  {
> -  df_ref def;
> -
> -  if (success)
> -    *success = false;
> -
> -  FOR_EACH_INSN_DEF (def, insn)
> -    if (REGNO (DF_REF_REG (def)) == REGNO (reg))
> -      break;
> -
> -  if (!def)
> -    return NULL;
> +  /* We only fold through instructions that are transitively used as
> +     memory addresses and do not have other uses.  Use the same logic
> +     from offset calculation to visit instructions that can propagate
> +     offsets and keep track of them in CAN_FOLD_INSNS.  */
> +  for (use_info *use : def->nondebug_insn_uses ())
> +    {

Would use->includes_address_uses () be useful here, as a short-cut?

> +      insn_info *use_insn = use->insn ();
> +      if (use_insn->is_artificial ())
> +     return false;
>  
> -  df_link *ref_chain = DF_REF_CHAIN (def);
> -  int insn_luid = DF_INSN_LUID (insn);
> -  basic_block insn_bb = BLOCK_FOR_INSN (insn);
> +      /* Punt if the use is anything more complicated than a set
> +      (clobber, use, etc).  */
> +      rtx_insn *use_rtl = use_insn->rtl ();
> +      if (!NONJUMP_INSN_P (use_rtl) || GET_CODE (PATTERN (use_rtl)) != SET)
> +     return false;
>  
> -  for (df_link *ref_link = ref_chain; ref_link; ref_link = ref_link->next)
> -    {
> -      /* Problem getting a use for this instruction.  */
> -      if (ref_link->ref == NULL)
> -     return NULL;
> -      if (DF_REF_CLASS (ref_link->ref) != DF_REF_REGULAR)
> -     return NULL;
> +      /* Special case: A foldable memory store is not foldable if it
> +      mentions DEST outside of the address calculation.  */
> +      rtx use_set = PATTERN (use_rtl);
> +      if (use_set && MEM_P (SET_DEST (use_set))
> +       && reg_mentioned_p (reg, SET_SRC (use_set)))
> +     return false;
>  
> -      rtx_insn *use = DF_REF_INSN (ref_link->ref);
> -      if (DEBUG_INSN_P (use))
> -     continue;
> +      if (use->bb () != def->bb ())
> +     return false;
>  
> -      /* We do not handle REG_EQUIV/REG_EQ notes for now.  */
> -      if (DF_REF_FLAGS (ref_link->ref) & DF_REF_IN_NOTE)
> -     return NULL;
> -      if (BLOCK_FOR_INSN (use) != insn_bb)
> -     return NULL;
>        /* Punt if use appears before def in the basic block.  See PR111601.  
> */
> -      if (DF_INSN_LUID (use) < insn_luid)
> -     return NULL;
> +      if (*use_insn < *def->insn ())
> +     return false;

As above, this check is redundant.

> [...]
> @@ -856,69 +897,82 @@ pass_fold_mem_offsets::execute (function *fn)
>              "fold-mem-offsets: %d basic blocks and %d edges/basic block",
>              n_basic_blocks_for_fn (cfun),
>              n_edges_for_fn (cfun) / n_basic_blocks_for_fn (cfun));
> -      return 0;
> +      multi_use_mode = false;
>      }
>  
> -  df_set_flags (DF_EQ_NOTES + DF_RD_PRUNE_DEAD_DEFS + DF_DEFER_INSN_RESCAN);
> -  df_chain_add_problem (DF_UD_CHAIN + DF_DU_CHAIN);
> +  /* There is a conflict between this pass and RISCV's shorten-memrefs
> +     pass.  For now disable folding if optimizing for size because
> +     otherwise this cancels the effects of shorten-memrefs.  */
> +  cgraph_node *n = cgraph_node::get (fn->decl);
> +  if (n && n->optimize_for_size_p ())
> +    return 0;
> +
> +  /* Initialise RTL SSA.  */
> +  calculate_dominance_info (CDI_DOMINATORS);
>    df_analyze ();
> +  crtl->ssa = new rtl_ssa::function_info (cfun);
>  
> -  bitmap_initialize (&can_fold_insns, NULL);
> -  bitmap_initialize (&candidate_fold_insns, NULL);
> -  bitmap_initialize (&cannot_fold_insns, NULL);
> +  /* The number of instructions that were simplified or eliminated.  */
> +  int stats_fold_count = 0;
>  
> -  stats_fold_count = 0;
> +  /* Fold mem offsets with DEFs that have a single USE.  */
> +  stats_fold_count += fold_mem_offsets_single_use ();
>  
> -  basic_block bb;
> -  rtx_insn *insn;
> -  FOR_ALL_BB_FN (bb, fn)
> +  /* Fold mem offsets with DEFs that have multiple USEs.  */
> +  if (multi_use_mode || flag_expensive_optimizations)
>      {
> -      /* There is a conflict between this pass and RISCV's shorten-memrefs
> -      pass.  For now disable folding if optimizing for size because
> -      otherwise this cancels the effects of shorten-memrefs.  */
> -      if (optimize_bb_for_size_p (bb))
> -     continue;
> -
> -      fold_info_map fold_info;
> +      if (dump_file)
> +     fprintf (dump_file, "Starting multi-use analysis\n");
> +      stats_fold_count += fold_mem_offsets_multi_use ();
> +    }
>  
> -      bitmap_clear (&can_fold_insns);
> -      bitmap_clear (&candidate_fold_insns);
> -      bitmap_clear (&cannot_fold_insns);
> +  statistics_counter_event (cfun, "Number of folded instructions",
> +                         stats_fold_count);
>  
> -      FOR_BB_INSNS (bb, insn)
> -     do_analysis (insn);
> +  free_dominance_info (CDI_DOMINATORS);
> +  cleanup_cfg (0);

With the changes suggested in the general comments, this would become:

  if (crtl->ssa->perform_pending_updates ())
    cleanup_cfg (0);

>  
> -      FOR_BB_INSNS (bb, insn)
> -     do_fold_info_calculation (insn, &fold_info);
> +  delete crtl->ssa;
> +  crtl->ssa = nullptr;
>  
> -      FOR_BB_INSNS (bb, insn)
> -     if (fold_mem_info **info = fold_info.get (insn))
> -       do_check_validity (insn, *info);
> +  delete_trivially_dead_insns (get_insns (), max_reg_num ());

Which cases is this for?   Going back to the example:

   For example it can transform code like this:

     add  t4, sp, 16
     add  t2, a6, t4
     shl  t3, t2, 1
     ld   a2, 0(t3)
     add  a2, 1
     sd   a2, 8(t2)

   into the following (one instruction less):

     add  t2, a6, sp
     shl  t3, t2, 1
     ld   a2, 32(t3)
     add  a2, 1
     sd   a2, 24(t2)

I would have expected the pass to be able to delete the t4 instruction
on the fly, rather than need a separate instruction walk.

Thanks,
Richard

Reply via email to