Re: [PATCH v2] ipa-cp: Speculatively call specialized functions

Manolis Tsamis Mon, 23 Jan 2023 02:12:31 -0800

On Fri, Jan 13, 2023 at 7:49 PM Martin Jambor <mjam...@suse.cz> wrote:
>
> Hello,
>
> sorry for getting to this quite late.  I have only had a quick glance at
> ipa-cp.cc hunks so far.
>


Hi Martin,

Thanks for taking the time to review these.

> On Fri, Dec 16 2022, Manolis Tsamis wrote:
> > The IPA CP pass offers a wide range of optimizations, where most of them
> > lead to specialized functions that are called from a call site.
> > This can lead to multiple specialized function clones, if more than
> > one call-site allows such an optimization.
> > If not all call-sites can be optimized, the program might end
> > up with call-sites to the original function.
> >
> > This pass assumes that non-optimized call-sites (i.e. call-sites
> > that don't call specialized functions) are likely to be called
> > with arguments that would allow calling specialized clones.
> > Since we cannot guarantee this (for obvious reasons), we can't
> > replace the existing calls. However, we can introduce dynamic
> > guards that test the arguments for the collected constants
> > and calls the specialized function if there is a match.
> >
> > To demonstrate the effect, let's consider the following program part:
> >
> >   func_1()
> >     myfunc(1)
> >   func_2()
> >     myfunc(2)
> >   func_i(i)
> >     myfunc(i)
> >
> > In this case the transformation would do the following:
> >
> >   func_1()
> >     myfunc.constprop.1() // myfunc() with arg0 == 1
> >   func_2()
> >     myfunc.constprop.2() // myfunc() with arg0 == 2
> >   func_i(i)
> >     if (i == 1)
> >       myfunc.constprop.1() // myfunc() with arg0 == 1
> >     else if (i == 2)
> >       myfunc.constprop.2() // myfunc() with arg0 == 2
> >     else
> >       myfunc(i)
>
> My understanding of the code, however, is that it rather creates
>
>   func_i(i)
>     if (i == 1)
>       myfunc.constprop.1_1() // mostly equivalent but separate from 
> myfunc.constprop.1
>     else if (i == 2)
>       myfunc.constprop.2_1() // mostly equivalent but separate from 
> myfunc.constprop.2
>     else
>       myfunc(i)
>
> Which I find difficult to justify.  From comments it looked like the
> reason is avoiding calling find_more_scalar_values, is that correct?
>
> I'd like to know more about the cases you are targeting and cases where
> adding the additional known scalar constants were an issue.  I think it
> needs to be tackled differently.
>
> By the way, as IPA-CP works now (it would be nice but difficult to lift
> that limitation), all but up to one constant in known_csts are constants
> in all call contexts, so without calling find_more_scalar_values you
> should need just one run-time condition per speculative call.  So
> tracking which constant is which might be better than avoiding
> find_more_scalar_values?
>

First of all what you say about the clones being mostly equivalent but
separate is true.
I have also noted this in the V2 changes but the description is based
on V1 where the
clones were indeed shared with ipa-cp. Allow me to provide some context here:

The implementation is based on the assumption that the constant
arguments from an
ipa-cp specialization are likely to appear in non-constant call sites
as well, and in that
case it is worthwhile to speculatively specialize for these. In the
first implementation
the speculative guards called the same specialized functions that were
created by
ipa-cp but this turned out to be an issue for two reasons.

The first issue was find_more_scalar_values. Whereas the constant chosen for the
ipa-cp clone is a good indicator for the likeliness of a value in the
non-constant callsites,
the constants added by find_more_scalar_values are usually not. Adding
more constants
to the specialization is of course an improvement for the call sites
that involve these
constants, but for speculatively specializing they make things worse
by increasing the
complexity of the guard and also decreasing the probability the guard
will be true (by being
more restrictive). This is especially true for pointer constants added
by find_more_scalar_values,
which are useful for the specialized function but greatly limit the
usefulness of the speculative one.

The second reason for creating separate clones is that we realized
that ipa-cp clones are more
than just specializing for a constant. The specialized clones haves a
list of known contexts, which
from my understanding it is incorrect to call the specialized function
from a context not included in
these, and a list of aggregate replacements which didn't work well
when we tried to use them as
speculative specializations.

These are the main reasons that although we wanted to avoid
excessively clone function we had
to create separate clones from ipa-cp's.

Additionally it is true that with this approach, as you mention, just
one run-time condition per
speculative call is needed. The implementation is more general and
supports any number of them,
but currently only one is used. In case this was not implied by your
last sentence there, I want to
clarify that if the specialized function has more than just one
specialized value
(through find_more_scalar_values) then testing for just the single
constant is not enough to
make calling the specialization legal.

I would be interested in ideas and feedback on how that could be
improved and if there is a way
to avoid creating separate function clones here.

> Also growth limits in ipa-cp are not updated appropriately.
>

Because I'm not really familiar with how the growth limits work, can
you please point out what
I should look for to address this?

> Some more comments inline:
>
> >
> > The pass consists of two main parts:
> > * collecting all specialized functions and the argument/constant pair(s)
> > * insertion of the guards during materialization
> >
> > The patch integrates well into ipa-cp and related IPA functionality.
> > Given the nature of IPA, the changes are touching many IPA-related
> > files as well as call-graph data structures.
> >
> > The impact of the dynamic guard is expected to be less than the speedup
> > gained by enabled optimizations (e.g. inlining or constant propagation).
> >
> > gcc/Changelog:
> >
> >         * cgraph.cc (cgraph_add_edge_to_call_site_hash): Add support for 
> > guarded specialized edges.
> >         (cgraph_edge::set_call_stmt): Likewise.
> >         (symbol_table::create_edge): Likewise.
> >         (cgraph_edge::remove): Likewise.
> >         (cgraph_edge::make_speculative): Likewise.
> >         (cgraph_edge::make_specialized): Likewise.
> >         (cgraph_edge::remove_specializations): Likewise.
> >         (cgraph_edge::redirect_call_stmt_to_callee): Likewise.
> >         (cgraph_edge::dump_edge_flags): Likewise.
> >         (verify_speculative_call): Likewise.
> >         (verify_specialized_call): Likewise.
> >         (cgraph_node::verify_node): Likewise.
> >         * cgraph.h (class GTY): Add new class that contains info of 
> > specialized edges.
> >         * cgraphclones.cc (cgraph_edge::clone): Add support for guarded 
> > specialized edges.
> >         (cgraph_node::set_call_stmt_including_clones): Likewise.
> >         * ipa-cp.cc (want_remove_some_param_p): Likewise.
> >         (create_specialized_node): Likewise.
> >         (add_specialized_edges): Likewise.
> >         (ipcp_driver): Likewise.
> >         * ipa-fnsummary.cc (redirect_to_unreachable): Likewise.
> >         (ipa_fn_summary_t::duplicate): Likewise.
> >         (analyze_function_body): Likewise.
> >         (estimate_edge_size_and_time): Likewise.
> >         (remap_edge_summaries): Likewise.
> >         * ipa-inline-transform.cc (inline_transform): Likewise.
> >         * ipa-inline.cc (edge_badness): Likewise.
> >          lto-cgraph.cc (lto_output_edge): Likewise.
> >         (input_edge): Likewise.
> >         * tree-inline.cc (copy_bb): Likewise.
> >         * value-prof.cc (gimple_sc): Add function to create guarded 
> > specializations.
> >         * value-prof.h (gimple_sc): Likewise.
>
> Please also include test-cases.
>

Will do.

> >
> > Signed-off-by: Manolis Tsamis <manolis.tsa...@vrull.eu>
> >
> > ---
> >
>
> [...]
>
> > diff --git a/gcc/ipa-cp.cc b/gcc/ipa-cp.cc
> > index cc031ebed0f..31d01ada928 100644
> > --- a/gcc/ipa-cp.cc
> > +++ b/gcc/ipa-cp.cc
> > @@ -119,6 +119,7 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "symbol-summary.h"
> >  #include "tree-vrp.h"
> >  #include "ipa-prop.h"
> > +#include "gimple-pretty-print.h"
> >  #include "tree-pretty-print.h"
> >  #include "tree-inline.h"
> >  #include "ipa-fnsummary.h"
> > @@ -5239,16 +5240,20 @@ want_remove_some_param_p (cgraph_node *node, 
> > vec<tree> known_csts)
> >    return false;
> >  }
> >
> > +static hash_map<cgraph_node*, vec<cgraph_node*>> 
> > *available_specializations;
> > +
> >  /* Create a specialized version of NODE with known constants in KNOWN_CSTS,
> >     known contexts in KNOWN_CONTEXTS and known aggregate values in AGGVALS 
> > and
> > -   redirect all edges in CALLERS to it.  */
> > +   redirect all edges in CALLERS to it.  If IS_SPECULATIVE is true then 
> > this
> > +   node is created to be part of a guarded specialization edge.  */
> >
> >  static struct cgraph_node *
> >  create_specialized_node (struct cgraph_node *node,
> >                        vec<tree> known_csts,
> >                        vec<ipa_polymorphic_call_context> known_contexts,
> >                        vec<ipa_argagg_value, va_gc> *aggvals,
> > -                      vec<cgraph_edge *> &callers)
> > +                      vec<cgraph_edge *> &callers,
> > +                      bool is_speculative)
> >  {
> >    ipa_node_params *new_info, *info = ipa_node_params_sum->get (node);
> >    vec<ipa_replace_map *, va_gc> *replace_trees = NULL;
> > @@ -5383,7 +5388,7 @@ create_specialized_node (struct cgraph_node *node,
> >    for (const ipa_argagg_value &av : aggvals)
> >      new_node->maybe_create_reference (av.value, NULL);
> >
> > -  if (dump_file && (dump_flags & TDF_DETAILS))
> > +  if (dump_file && (dump_flags & TDF_DETAILS) && !is_speculative)
> >      {
> >        fprintf (dump_file, "     the new node is %s.\n", 
> > new_node->dump_name ());
> >        if (known_contexts.exists ())
> > @@ -5409,6 +5414,13 @@ create_specialized_node (struct cgraph_node *node,
> >    new_info->known_csts = known_csts;
> >    new_info->known_contexts = known_contexts;
> >
> > +  if (is_speculative && !info->ipcp_orig_node)
>
> What is the reason for testing !info->ipcp_orig_node node here?
>

The idea is to limit excessive speculation by not doing it for when a
non-specialized
function is specialized for the first time. This does not affect
correctness, so it can be
removed if my assumption doesn't hold.

>
> > +    {
> > +      vec<cgraph_node*> &spec_nodes
> > +     = available_specializations->get_or_insert (node);
> > +      spec_nodes.safe_push (new_node);
> > +    }
> > +
> >    ipcp_discover_new_direct_edges (new_node, known_csts, known_contexts,
> >                                 aggvals);
> >
> > @@ -6104,6 +6116,21 @@ decide_about_value (struct cgraph_node *node, int 
> > index, HOST_WIDE_INT offset,
> >        known_csts = avals->m_known_vals.copy ();
> >        known_contexts = copy_useful_known_contexts 
> > (avals->m_known_contexts);
> >      }
> > +
> > +  /* If guarded specialization is enabled then we create an additional
> > +     clone with KNOWN_CSTS and no known contexts or aggregates.
> > +     We don't want find_more_scalar_values because adding more constants
> > +     instreases the complexity of the guard and reduces the chance
> > +     that it is used.  */
> > +  if (flag_ipa_guarded_specialization && !val->self_recursion_generated_p 
> > ())
> > +    {
> > +      vec<cgraph_edge *> no_callers = vNULL;
> > +      cgraph_node *guarded_spec_node
> > +     = create_specialized_node (node, known_csts.copy (), vNULL,
> > +                                              NULL, no_callers, true);
>
> It looks like that if the value being considered is an aggregate value
> (offset is non-negative) or polymorphic_context (yeah, the whole
> function is a template), neither of which is recorded in known_csts,
> you'll end up creating a clone with no specialization at all (other than
> that for all direct calls).
>

Thanks for pointing out, I will add a condition to check if there are
any known constants.

>
> > +      update_profiling_info (node, guarded_spec_node);
>
> I must say I don't know what is the best way to distribute profiling
> counts in the transformation you propose, but this is not going to do
> the right thing.  update_profiling_info tries to divide the counts
> proportionally depending on sum of counts of calls to the original and
> the clone and since the clone has no callers at this point, it will
> become quite cold.
>

Indeed. Would it be fine to assume that for all non-constant call sites a small
percentage of them (say ~10%) will end to the speculative specializations and
calculate the profiles based on that?

>
> > +    }
> > +
> >    find_more_scalar_values_for_callers_subset (node, known_csts, callers);
> >    find_more_contexts_for_caller_subset (node, &known_contexts, callers);
> >    vec<ipa_argagg_value, va_gc> *aggvals
> > @@ -6111,7 +6138,7 @@ decide_about_value (struct cgraph_node *node, int 
> > index, HOST_WIDE_INT offset,
> >    gcc_checking_assert (ipcp_val_agg_replacement_ok_p (aggvals, index,
> >                                                     offset, val->value));
> >    val->spec_node = create_specialized_node (node, known_csts, 
> > known_contexts,
> > -                                         aggvals, callers);
> > +                                         aggvals, callers, false);
> >
> >    if (val->self_recursion_generated_p ())
> >      self_gen_clones->safe_push (val->spec_node);
> > @@ -6270,7 +6297,7 @@ decide_whether_version_node (struct cgraph_node *node)
> >         known_contexts = vNULL;
> >       }
> >        clone = create_specialized_node (node, known_csts, known_contexts,
> > -                                    aggvals, callers);
> > +                                    aggvals, callers, false);
> >        info->do_clone_for_all_contexts = false;
> >        ipa_node_params_sum->get (clone)->is_all_contexts_clone = true;
> >        ret = true;
> > @@ -6546,6 +6573,135 @@ ipcp_store_vr_results (void)
> >      }
> >  }
> >
> > +/* Add new edges to the call graph to represent the available 
> > specializations
> > +   of each specialized function.  */
> > +static void
> > +add_specialized_edges (void)
> > +{
> > +  cgraph_edge *e;
> > +  cgraph_node *n, *spec_n;
> > +  tree known_cst;
> > +  unsigned i, j;
> > +
> > +  FOR_EACH_DEFINED_FUNCTION (n)
> > +    {
> > +      if (dump_file && n->callees)
> > +     fprintf (dump_file,
> > +              "Procesing function %s for specialization of edges.\n",
> > +              n->dump_name ());
> > +
> > +      if (n->ipcp_clone)
> > +     continue;
> > +
> > +      bool update = false;
> > +      for (e = n->callees; e; e = e->next_callee)
> > +     {
> > +       if (!e->callee || e->recursive_p ())
> > +         continue;
> > +
> > +       vec<cgraph_node*> *specialization_nodes
> > +         = available_specializations->get (e->callee);
> > +
> > +       /* Even if the calle is a specialized node it is still valid to
> > +          further create guarded specializations based on the original 
> > node.
> > +          If the existing specialized node doesn't have any known constants
> > +          then it is probably profitable to specialize further.  */
>
> So you are saying thar scalar constants specializations are always
> better than aggregate or polymorphic_context ones?  IMHO this should be
> at least driven by some heuristics like the number that
> good_cloning_opportunity_p uses.
>

Ok, I will look into good_cloning_opportunity_p and see if I can
create a heuristic
that makes sense for that case. The reason this was made like that is
that in our
tests the constants ended up being more important.

> > +       if (e->callee->ipcp_clone && !specialization_nodes)
> > +         {
> > +           ipa_node_params *info
> > +             = ipa_node_params_sum->get (e->callee);
> > +           gcc_checking_assert (info->ipcp_orig_node);
> > +
> > +           bool has_known_constant = false;
> > +           FOR_EACH_VEC_ELT (info->known_csts, i, known_cst)
> > +             if (known_cst != NULL_TREE)
> > +               {
> > +                 has_known_constant = true;
> > +                 break;
> > +               }
> > +
> > +           if (!has_known_constant)
> > +             specialization_nodes
> > +               = available_specializations->get (info->ipcp_orig_node);
> > +         }
> > +
> > +       if (!specialization_nodes)
> > +         continue;
> > +
> > +       unsigned num_of_specializations = 0;
> > +       unsigned max_num_of_specializations = opt_for_fn (n->decl,
> > +                                               
> > param_ipa_spec_max_per_edge);
> > +
> > +       FOR_EACH_VEC_ELT (*specialization_nodes, i, spec_n)
> > +         {
> > +           if (dump_file)
> > +             fprintf (dump_file,
> > +                      "Edge has available specialization %s.\n",
> > +                      spec_n->dump_name ());
> > +
> > +           ipa_node_params *spec_params = ipa_node_params_sum->get 
> > (spec_n);
> > +           vec<cgraph_specialization_info> replaced_args = vNULL;
> > +           bool failed = false;
> > +
> > +           FOR_EACH_VEC_ELT (spec_params->known_csts, j, known_cst)
>
> As I wrote before, I think you are testing also constants which we know
> are there.

If you mean that guards which are trivially false can be created (e.g.
if (4 == 6))
that is true. I thought that letting later optimizations take care of
that is fine
but maybe it's also wasteful. I can improve it by testing if the argument
is constant.

>
> The idea is interesting, thanks for exploring these options.  As I said,
> knowing a bit more about what motivated you might help us to reason
> about it.
>

This implementation is part of an optimization effort that was inspired by
patterns in SPEC2017's x264 benchmark that were then made into more
general optimization approaches. The idea is that these are useful if they
can offer improved performance as long they don't cause important
performance regressions. Combined with another (similar in nature)
proposed optimization our current measurements show a ~2-3%
improvement in that benchmark.

Thanks,
Manolis

> Martin

Re: [PATCH v2] ipa-cp: Speculatively call specialized functions

Reply via email to