Re: [PATCH] AutoFDO patch for trunk

Jan Hubicka Thu, 15 May 2014 17:47:25 -0700

> Index: gcc/auto-profile.c
> ===================================================================
> --- gcc/auto-profile.c        (revision 0)
> +++ gcc/auto-profile.c        (revision 0)
> @@ -0,0 +1,1584 @@
> +/* Calculate branch probabilities, and basic block execution counts.


Update the toplevel comment, too.

> +
> +/* The following routines implements AutoFDO optimization.
> +
> +   This optimization uses sampling profiles to annotate basic block counts
> +   and uses heuristics to estimate branch probabilities.
> +
> +   There are three phases in AutoFDO:
> +
> +   Phase 1: Read profile from the profile data file.
> +     The following info is read from the profile datafile:
> +     * string_table: a map between function name and its index.
> +     * autofdo_source_profile: a map from function_instance name to
> +       function_instance. This is represented as a forest of
> +       function_instances.
> +     * WorkingSet: a histogram of how many instructions are covered for a
> +     given percentage of total cycles.
> +
> +   Phase 2: Early inline.
> +     Early inline uses autofdo_source_profile to find if a callsite is:
> +     * inlined in the profiled binary.
> +     * callee body is hot in the profiling run.
> +     If both condition satisfies, early inline will inline the callsite
> +     regardless of the code growth.
> +
> +   Phase 3: Annotate control flow graph.
> +     AutoFDO uses a separate pass to:
> +     * Annotate basic block count
> +     * Estimate branch probability
> +
> +   After the above 3 phases, all profile is readily annotated on the GCC IR.
> +   AutoFDO tries to reuse all FDO infrastructure as much as possible to make
> +   use of the profile. E.g. it uses existing mechanism to calculate the basic
> +   block/edge frequency, as well as the cgraph node/edge count.
> +*/

I suppose the phases can be made explicit to the PM, so early inliner can stay
a pass.
> +/* Store a string array, indexed by string position in the array.  */
> +class string_table {
> +public:
> +  static string_table *create ();
> +
> +  /* For a given string, returns its index.  */
> +  int get_index (const char *name) const;
> +
> +  /* For a given decl, returns the index of the decl name.  */
> +  int get_index_by_decl (tree decl) const;
> +
> +  /* For a given index, returns the string.  */
> +  const char *get_name (int index) const;

I wonder how these should work with LTO.  I suppose we can tell user to not LTO
for train run, we should teach the table to ignore LTO generated suffixes as
well as random seeds (like coverage code already does).

What happens when two static functions have same name?

> +/* Profile for all functions.  */
> +class autofdo_source_profile {
> +public:
> +  static autofdo_source_profile *create ()
> +    {
> +      autofdo_source_profile *map = new autofdo_source_profile ();

I think we still want empty lines here.  Otherwise I trust you on following GCC 
C++ coding standards :)
> +      if (map->read ())
> +     return map;
> +      delete map;
> +      return NULL;
> +    }

> +/* Helper functions.  */
> +
> +/* Return the original name of NAME: strip the suffix that starts
> +   with '.'  */
> +
> +static const char *get_original_name (const char *name)
> +{
> +  char *ret = xstrdup (name);
> +  char *find = strchr (ret, '.');
> +  if (find != NULL)
> +    *find = 0;
> +  return ret;
> +}

Stripping all suffixes will likely make conflicts on static functions, right?
Example like this will get you a conflict
m()
{ 
void t(void)
{
}
t();
}
m2()
{
  void t2(void)
{
}
t2();
}

that is of course not that important given that nested functions are not top 
priority
but it would be nice to handle it gratefuly.
I would definitely merge logic here with logic in gcov that already knows how 
to skip
random seeds and other cases that makes filenames to diverge...
> +
> +/* Return the combined location, which is a 32bit integer in which
> +   higher 16 bits stores the line offset of LOC to the start lineno
> +   of DECL, The lower 16 bits stores the discrimnator.  */
> +
> +static unsigned
> +get_combined_location (location_t loc, tree decl)
> +{
> +  return ((LOCATION_LINE (loc) - DECL_SOURCE_LINE (decl)) << 16);
> +}

What happens on overflows here? I guess they are possible - 65536 lines is not 
infinity
these days.
The discriminator is added later?
> +
> +int
> +string_table::get_index (const char *name) const
> +{
> +  if (name == NULL)
> +    return -1;
> +  string_index_map::const_iterator iter = map_.find (name);
> +  if (iter == map_.end())
> +    return -1;
> +  else
> +    return iter->second;
> +}
> +
> +int
> +string_table::get_index_by_decl (tree decl) const

Comments missing here.
> +
> +/* Read the profile and create a function_instance with head count as
> +   HEAD_COUNT. Recursively read callsites to create nested function_instances
> +   too. STACK is used to track the recursive creation process.  */
> +
> +function_instance *
> +function_instance::read_function_instance (
> +    function_instance_stack *stack, gcov_type head_count)
> +{
> +  unsigned name = gcov_read_unsigned ();
> +  unsigned num_pos_counts = gcov_read_unsigned ();
> +  unsigned num_callsites = gcov_read_unsigned ();
> +  function_instance *s = new function_instance (name, head_count);
> +  stack->push_back(s);
> +
> +  for (unsigned i = 0; i < num_pos_counts; i++)
> +    {
> +      unsigned offset = gcov_read_unsigned () & 0xffff0000;
> +      unsigned num_targets = gcov_read_unsigned ();
> +      gcov_type count = gcov_read_counter ();
> +      s->pos_counts[offset].count = count;
> +      for (unsigned j = 0; j < stack->size(); j++)
> +     (*stack)[j]->total_count_ += count;
> +      for (unsigned j = 0; j < num_targets; j++)
> +     {
> +       /* Only indirect call target histogram is supported now.  */
> +       gcov_read_unsigned ();
> +       gcov_type target_idx = gcov_read_counter ();
> +       s->pos_counts[offset].targets[target_idx] =
> +           gcov_read_counter ();
> +     }
> +    }
> +  for (unsigned i = 0; i < num_callsites; i++) {
> +    unsigned offset = gcov_read_unsigned ();
> +    function_instance *callee_function_instance =
> +     read_function_instance (stack, 0);
> +    s->callsites[std::make_pair (offset, callee_function_instance->name ())] 
> =
> +     callee_function_instance;
> +  }
> +  stack->pop_back();
> +  return s;

Can you please add a summary of the file format and what is read in here?
> +/* Find count_info for a given gimple STMT. If found, store the count_info
> +   in INFO and return true; otherwise return false.  */
> +
> +bool
> +autofdo_source_profile::get_count_info (gimple stmt, count_info *info) const
> +{
> +  if (LOCATION_LOCUS (gimple_location (stmt)) == cfun->function_end_locus)
> +    return false;
> +
> +  inline_stack stack;
> +  get_inline_stack (gimple_location (stmt), &stack);
> +  if (stack.size () == 0)
> +    return false;
> +  function_instance *s = get_function_instance_by_inline_stack (stack);
> +  if (s == NULL)
> +    return false;
> +  return s->get_count_info (stack[0].second, info);
> +}
> +
> +/* Mark LOC as annotated.  */
> +
> +void
> +autofdo_source_profile::mark_annotated (location_t loc) {

Please review the file for the coding standards.
> +
> +  /* Program behavior changed, original promoted (and inlined) target is not
> +     hot any more. Will avoid promote the original target.
> +
> +     To check if original promoted target is still hot, we check the total
> +     count of the unpromoted targets (stored in old_info). If it is no less
> +     than half of the callsite count (stored in INFO), the original promoted
> +     target is considered not hot any more.  */
> +  if (total >= info->count * 0.5)

Probably /2, I am not sure if we can hit roundoff errors with long 
long->double->long long,
but it is better to avoid it anyway.
> +/* Module profile is only used by LIPO. Here we simply ignore it.  */
> +
> +static void
> +fake_read_autofdo_module_profile ()

Probably could get in only after LIPO?
> +{
> +  /* Read in the module info.  */
> +  gcov_read_unsigned ();
> +
> +  /* Skip the length of the section.  */
> +  gcov_read_unsigned ();

You probably want to check that length is 0 or so.
> +  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> +    {
> +      count_info info;
> +      gimple stmt = gsi_stmt (gsi);
> +      if (stmt->code == GIMPLE_DEBUG)
> +     continue;

Probably also clobbers can be skipped.

> +/* If a basic block's count is known, and only one of its in/out edges' count
> +   is unknown, its count can be calculated.
> +   Meanwhile, if all of the in/out edges' counts are known, then the basic
> +   block's unknown count can also be calculated.
> +   IS_SUCC is true if out edges of a basic blocks are examined.
> +   Return TRUE if any basic block/edge count is changed.  */

Interesting profile cleanup.
I wonder wy EDGE_ANNOTATED is exported now (i.e. why it is not maintained only 
withing
auto-FDO). Do you have more use for it outside auto-FDO?

I was thinking that having a flag about profile being reliable may be useful 
idea, but never
really tried to implement it.
> +
> +/* Special propagation for circuit expressions. Because GCC translates
> +   control flow into data flow for circuit expressions. E.g.
> +   BB1:
> +   if (a && b)
> +     BB2
> +   else
> +     BB3
> +
> +   will be translated into:
> +
> +   BB1:
> +     if (a)
> +       goto BB.t1
> +     else
> +       goto BB.t3
> +   BB.t1:
> +     if (b)
> +       goto BB.t2
> +     else
> +       goto BB.t3
> +   BB.t2:
> +     goto BB.t3
> +   BB.t3:
> +     tmp = PHI (0 (BB1), 0 (BB.t1), 1 (BB.t2)
> +     if (tmp)
> +       goto BB2
> +     else
> +       goto BB3
> +
> +   In this case, we need to propagate through PHI to determine the edge
> +   count of BB1->BB.t1, BB.t1->BB.t2.  */

So here you have known probabilities of BB1, BB2 and BB3 and you need to guess 
all the rest?
Why it need so much of gimple matching?
> +
> +static void
> +afdo_propagate_circuit (void)
> +{
> +  basic_block bb;
> +  FOR_ALL_BB_FN (bb, cfun)
> +    {
> +      gimple phi_stmt;
> +      tree cmp_rhs, cmp_lhs;
> +      gimple cmp_stmt = last_stmt (bb);
> +      edge e;
> +      edge_iterator ei;
> +
> +      if (!cmp_stmt || gimple_code (cmp_stmt) != GIMPLE_COND)
> +     continue;
> +      cmp_rhs = gimple_cond_rhs (cmp_stmt);
> +      cmp_lhs = gimple_cond_lhs (cmp_stmt);
> +      if (!TREE_CONSTANT (cmp_rhs)
> +       || !(integer_zerop (cmp_rhs) || integer_onep (cmp_rhs)))
> +     continue;
> +      if (TREE_CODE (cmp_lhs) != SSA_NAME)
> +     continue;
> +      if ((bb->flags & BB_ANNOTATED) == 0)
> +     continue;
> +      phi_stmt = SSA_NAME_DEF_STMT (cmp_lhs);
> +      while (phi_stmt && gimple_code (phi_stmt) == GIMPLE_ASSIGN
> +          && gimple_assign_single_p (phi_stmt)
> +          && TREE_CODE (gimple_assign_rhs1 (phi_stmt)) == SSA_NAME)
> +     phi_stmt = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (phi_stmt));
> +      if (!phi_stmt || gimple_code (phi_stmt) != GIMPLE_PHI)
> +     continue;
> +      FOR_EACH_EDGE (e, ei, bb->succs)
> +     {
> +       unsigned i, total = 0;
> +       edge only_one;
> +       bool check_value_one = (((integer_onep (cmp_rhs))
> +                 ^ (gimple_cond_code (cmp_stmt) == EQ_EXPR))
> +                 ^ ((e->flags & EDGE_TRUE_VALUE) != 0));
> +       if ((e->flags & EDGE_ANNOTATED) == 0)
> +         continue;
> +       for (i = 0; i < gimple_phi_num_args (phi_stmt); i++)
> +         {
> +           tree val = gimple_phi_arg_def (phi_stmt, i);
> +           edge ep = gimple_phi_arg_edge (phi_stmt, i);
> +
> +           if (!TREE_CONSTANT (val) || !(integer_zerop (val)
> +               || integer_onep (val)))
> +             continue;
> +           if (check_value_one ^ integer_onep (val))
> +             continue;
> +           total++;
> +           only_one = ep;
> +           if (e->probability == 0 && (ep->flags & EDGE_ANNOTATED) == 0)
> +             {
> +               ep->probability = 0;
> +               ep->count = 0;
> +               ep->flags |= EDGE_ANNOTATED;
> +             }
> +         }
> +       if (total == 1 && (only_one->flags & EDGE_ANNOTATED) == 0)
> +         {
> +           only_one->probability = e->probability;
> +           only_one->count = e->count;
> +           only_one->flags |= EDGE_ANNOTATED;
> +         }
> +     }
> +    }
> +}
> +
> +/* Propagate the basic block count and edge count on the control flow
> +   graph. We do the propagation iteratively until stablize.  */
> +
> +static void
> +afdo_propagate (void)
> +{
> +  basic_block bb;
> +  bool changed = true;
> +  int i = 0;
> +
> +  FOR_ALL_BB_FN (bb, cfun)
> +    {
> +      bb->count = ((basic_block) bb->aux)->count;
> +      if ((((basic_block) bb->aux)->flags & BB_ANNOTATED) != 0)
> +     bb->flags |= BB_ANNOTATED;
> +    }
> +
> +  while (changed && i++ < 10)
> +    {
> +      changed = false;
> +
> +      if (afdo_propagate_edge (true))
> +     changed = true;
> +      if (afdo_propagate_edge (false))
> +     changed = true;
> +      afdo_propagate_circuit ();
> +    }

I wonder if there is any code to share with same (except for the circuit thing) 
code in profile.c
and gcov.c? Profile.c also have the minimal flow algorithm for fixing broken 
profiles, perhaps
it could be useful here?
> +  afdo_find_equiv_class ();
> +  afdo_propagate ();
> +
> +  FOR_EACH_BB_FN (bb, cfun)
> +    {
> +      edge e;
> +      edge_iterator ei;
> +      int num_unknown_succ = 0;
> +      gcov_type total_count = 0;
> +
> +      FOR_EACH_EDGE (e, ei, bb->succs)
> +     {
> +       if ((e->flags & EDGE_ANNOTATED) == 0)
> +         num_unknown_succ ++;
> +       else
> +         total_count += e->count;
> +     }
> +      if (num_unknown_succ == 0 && total_count > 0)
> +     {
> +       FOR_EACH_EDGE (e, ei, bb->succs)
> +         e->probability =
> +             (double) e->count * REG_BR_PROB_BASE / total_count;
> +     }
> +    }
> +  FOR_ALL_BB_FN (bb, cfun)
> +    {
> +      edge e;
> +      edge_iterator ei;
> +
> +      FOR_EACH_EDGE (e, ei, bb->succs)
> +     e->count =
> +             (double) bb->count * e->probability / REG_BR_PROB_BASE;
> +      bb->aux = NULL;
> +    }
> +
> +  loop_optimizer_finalize ();
> +  free_dominance_info (CDI_DOMINATORS);
> +  free_dominance_info (CDI_POST_DOMINATORS);
> +}

I wonder if using the estimated profile to fill in the missing gaps would make 
sense?
I.e. running the standard branch probabilities->freuqencies propagation code 
bug this
time taking the determined counts as known values?
> +
> +/* Perform value profile transformation using AutoFDO profile. Add the
> +   promoted stmts to PROMOTED_STMTS. Return TRUE if there is any
> +   indirect call promoted.  */
> +
> +static bool
> +afdo_vpt_for_early_inline (stmt_set *promoted_stmts)

What exactly this function does? It inlines according to the inline decisions
made at train run time and recorded to profile feedback?

The patch looks very resonable in general. Lets break it up to smaller chunks 
and
get it merged. It is an interesting stuff!

Thanks a lot!
Honza

Re: [PATCH] AutoFDO patch for trunk

Reply via email to