For Transactional Memory support, we also create versions of functions
(see code in trunk, e.g., in trans-mem.c).  Right now, that's a single
instrumented version of the original code but having different
transactional instrumentations available might be worthwhile in the
future.

Is there a chance that we can share any code between TM and what you are
working on?


Torvald

On Thu, 2011-12-15 at 22:53 -0800, Sriraman Tallam wrote:
> Hi,
> 
>   I am working on user-directed and compiler-directed function
> multiversioning which has been discussed in these threads:
> 
> 1) http://gcc.gnu.org/ml/gcc-patches/2011-04/msg02344.html
> 2)  http://gcc.gnu.org/ml/gcc/2011-08/msg00298.html
> 
> The gist of the discussions for user-directed multiversioning is that
> it should use a function overloading mechanism to allow the user to
> specify multiple versions of the same function and/or use a function
> attribute to specify that a particular function must be mutiversioned.
> Afterwards, a call to such a function is appropriately dispatched by
> the compiler. This work is in progress. However, this patch is *not*
> about this.
> 
> This patch does compiler directed multi-versioning which is to allow
> the compiler to automatically version functions in order to
> exploit uArch features to maxmize performance for a set of selected
> target platforms in the same binary. I have added a new flag, mvarch,
> to allow the user to specify the arches on which the generated
> binary will be running on. 
> More than one arch name is allowed, for instance, -mvarch=core2,
> corei7 (arch names same as that allowed by march). The compiler will
> then automatically create function versions that are specialized for
> these arches by tagging "-mtune=<arch>" on the versions. It will only
> create versions of those functions where it sees opportunities for
> performance improvement.
> 
> As a use case, I have added versioning for core2 where the function
> will be optimized for vectorization. I submitted a patch recently to
> not allow vectorization of loops in core2 if unaligned vector
> load/stores are generated as these are very slow :
> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00955.html
> With mvarch=core2, the compiler will identify functions with
> vectorizable loads/stores in loops and create a version for core2. The
> core2 version will not have unaligned vector load/stores whereas the
> default will.
> 
> It is also easy to add versioning criteria for other arches or add more
> versioning criteria for core2. I already experimented with one other
> versioning criterion for corei7 and I plan to add that in a follow-up
> patch. Basically, any mtune specific optimizations can be plugged in
> as a criterion to version. For this patch, only the vectorization based
> versioning is shown.
> 
> The version dispatch happens via the IFUNC mechanism to keep the
> run-time overhead of version dispatch minimal. When the compiler has
> to version a function foo, it makes two copies, foo.autoclone.original
> and foo.autoclone.clone0. It then modifies foo by replacing it with a
> ifunc call to these two versions based on the outcome of a run-time
> check for the processor type.
> 
> The function cloning for preventing vectorization on core2 is done 
> aggressively as it conservatively checks if a particular function can
> generate unaligned accesses. This is necessary as the cloning pass
> happens early whereas the actual vectorization happens much later in
> the loop optimization passes.  Hence, the vectorization pass sees
> a different IR and the same checks to detect unaligned accesses cannot
> be reused here.  So, it could turn out that the final code generated
>  in all the function versions is identical. I am working on solutions
>  to this problem but there is the ICF feature in the gold linker which
>  can detect identical function bodies and merge them. Note that this
> need to be true always for other optimizations if it possible to
> detect with high accuracy if a function will benefit or not from
> versioning.  
> 
> Regarding the placement of the versioning pass in the pass order, it
> comes after inlining otherwise the ifunc calls would prevent inlining
> of the functions.  Also, it was desired that all versioning decisions
> happen at one point and it should happen before any target specific 
> optimizations kick in. Hence, it was chosen to come just after the
> ipa-inline pass.
> 
> This optimization was tested on one of our image processing related
> benchmarks and the text size bloat from auto-versioning was about 20%.
> The performance improvement from tree vectorization was ~22% on corei7,
>  10% on AMD istanbul and ~2% on core2. Without this versioning, tree
> vectorization was deteriorating the performance by ~6%. 
> 
> 
>       * mversn-dispatch.c (make_name): Use '.' to concatenate to suffix
>       mangled names.
>       (clone_function): Do not make clones ctors/dtors. Recompute dominance
>       info.
>       (make_bb_flow): New function.
>       (get_selector_gimple_seq): New function.
>       (make_selector_function): New function.
>       (make_attribute): New function.
>       (make_ifunc_function): New function.
>       (copy_decl_attributes): New function.
>       (dispatch_using_ifunc): New function.
>       (purge_function_body): New function.
>       (function_can_make_abnormal_goto): New function.
>       (make_function_not_cloneable): New function.
>       (do_auto_clone): New function.
>       (pass_auto_clone): New gimple pass.
>       * passes.c (init_optimization_passes): Add pass_auto_clone to list.
>       * tree-pass.h (pass_auto_clone): New pass.
>       * params.def (PARAM_MAX_FUNCTION_SIZE_FOR_AUTO_CLONING): New param.
>       * target.def (mversion_function): New target hook.
>       * config/i386/i386.c (ix86_option_override_internal): Check correctness
>       of ix86_mv_arch_string.
>       (add_condition_to_bb): New function.
>       (make_empty_function): New function.
>       (make_condition_function): New function.
>       (is_loop_form_vectorizable): New function.
>       (is_loop_stmts_vectorizable): New function.
>       (any_loops_vectorizable_with_load_store): New function.
>       (mversion_for_core2): New function.
>       (ix86_mversion_function): New function.
>       * config/i386/i386.opt (mvarch): New option.
>       * doc/tm.texi (TARGET_MVERSION_FUNCTION): Document.
>       * doc/tm.texi.in (TARGET_MVERSION_FUNCTION): Document.
>       * testsuite/gcc.dg/automversn_1.c: New testcase.
> 
> Index: doc/tm.texi
> ===================================================================
> --- doc/tm.texi       (revision 182355)
> +++ doc/tm.texi       (working copy)
> @@ -10927,6 +10927,11 @@ The result is another tree containing a simplified
>  call's result.  If @var{ignore} is true the value will be ignored.
>  @end deftypefn
>  
> +@deftypefn {Target Hook} int TARGET_MVERSION_FUNCTION (tree @var{fndecl}, 
> tree *@var{optimization_node_chain}, tree *@var{cond_func_decl})
> +Check if a function needs to be multi-versioned to support variants of
> +this architecture.  @var{fndecl} is the declaration of the function.
> +@end deftypefn
> +
>  @deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_VECTOR_MEMOP (void)
>  Return true if unaligned vector memory load/store is a slow operation
>  on this target.
> Index: doc/tm.texi.in
> ===================================================================
> --- doc/tm.texi.in    (revision 182355)
> +++ doc/tm.texi.in    (working copy)
> @@ -10873,6 +10873,11 @@ The result is another tree containing a simplified
>  call's result.  If @var{ignore} is true the value will be ignored.
>  @end deftypefn
>  
> +@hook TARGET_MVERSION_FUNCTION
> +Check if a function needs to be multi-versioned to support variants of
> +this architecture.  @var{fndecl} is the declaration of the function.
> +@end deftypefn
> +
>  @hook TARGET_SLOW_UNALIGNED_VECTOR_MEMOP
>  Return true if unaligned vector memory load/store is a slow operation
>  on this target.
> Index: target.def
> ===================================================================
> --- target.def        (revision 182355)
> +++ target.def        (working copy)
> @@ -1277,6 +1277,12 @@ DEFHOOK
>   "",
>   bool, (void), NULL)
>  
> +/* Target hook to check if this function should be versioned.  */
> +DEFHOOK
> +(mversion_function,
> + "",
> + int, (tree fndecl, tree *optimization_node_chain, tree *cond_func_decl), 
> NULL)
> +
>  /* Returns a code for a target-specific builtin that implements
>     reciprocal of the function, or NULL_TREE if not available.  */
>  DEFHOOK
> Index: tree-pass.h
> ===================================================================
> --- tree-pass.h       (revision 182355)
> +++ tree-pass.h       (working copy)
> @@ -449,6 +449,7 @@ extern struct gimple_opt_pass pass_split_functions
>  extern struct gimple_opt_pass pass_feedback_split_functions;
>  extern struct gimple_opt_pass pass_threadsafe_analyze;
>  extern struct gimple_opt_pass pass_tree_convert_builtin_dispatch;
> +extern struct gimple_opt_pass pass_auto_clone;
>  
>  /* IPA Passes */
>  extern struct simple_ipa_opt_pass pass_ipa_lower_emutls;
> Index: testsuite/gcc.dg/automversn_1.c
> ===================================================================
> --- testsuite/gcc.dg/automversn_1.c   (revision 0)
> +++ testsuite/gcc.dg/automversn_1.c   (revision 0)
> @@ -0,0 +1,27 @@
> +/* Check that the auto_clone pass works correctly.  Function foo must be 
> cloned
> +   because it is hot and has a vectorizable store.  */
> +
> +/* { dg-options "-O2 -ftree-vectorize -mvarch=core2 -fdump-tree-auto_clone" 
> } */
> +/* { dg-do run } */
> +
> +char a[16];
> +
> +int __attribute__ ((hot)) __attribute__ ((noinline))
> +foo (void)
> +{
> +  int i;
> +  for (i = 0; i< 16; i++)
> +    a[i] = 0;
> +  return 0;
> +}
> +
> +int
> +main ()
> +{
> +  return foo ();
> +}
> +
> +
> +/* { dg-final { scan-tree-dump "foo\.autoclone\.original" "auto_clone" } } */
> +/* { dg-final { scan-tree-dump "foo\.autoclone\.0" "auto_clone" } } */
> +/* { dg-final { cleanup-tree-dump "auto_clone" } } */
> Index: mversn-dispatch.c
> ===================================================================
> --- mversn-dispatch.c (revision 182355)
> +++ mversn-dispatch.c (working copy)
> @@ -135,6 +135,8 @@ along with GCC; see the file COPYING3.  If not see
>  #include "output.h"
>  #include "vecprim.h"
>  #include "gimple-pretty-print.h"
> +#include "target.h"
> +#include "cfgloop.h"
>  
>  typedef struct cgraph_node* NODEPTR;
>  DEF_VEC_P (NODEPTR);
> @@ -212,8 +214,7 @@ function_args_count (tree fntype)
>    return num;
>  }
>  
> -/* Return the variable name (global/constructor) to use for the
> -   version_selector function with name of DECL by appending SUFFIX. */
> +/* Return a new name by appending SUFFIX to the DECL name. */
>  
>  static char *
>  make_name (tree decl, const char *suffix)
> @@ -226,7 +227,8 @@ make_name (tree decl, const char *suffix)
>  
>    name_len = strlen (name) + strlen (suffix) + 2;
>    global_var_name = (char *) xmalloc (name_len);
> -  snprintf (global_var_name, name_len, "%s_%s", name, suffix);
> +  /* Use '.' to concatenate names as it is demangler friendly.  */
> +  snprintf (global_var_name, name_len, "%s.%s", name, suffix);
>    return global_var_name;
>  }
>  
> @@ -246,9 +248,9 @@ static char*
>  make_feature_test_global_name (tree decl, bool is_constructor)
>  {
>    if (is_constructor)
> -    return make_name (decl, "version_selector_constructor");
> +    return make_name (decl, "version.selector.constructor");
>  
> -  return make_name (decl, "version_selector_global");
> +  return make_name (decl, "version.selector.global");
>  }
>  
>  /* This function creates a new VAR_DECL with attributes set
> @@ -865,6 +867,9 @@ empty_function_body (tree fndecl)
>    e = make_edge (new_bb, EXIT_BLOCK_PTR, 0);
>    gcc_assert (e != NULL);
>  
> +  if (dump_file)
> +    dump_function_to_file (current_function_decl, dump_file, TDF_BLOCKS);
> +
>    current_function_decl = old_current_function_decl;
>    pop_cfun ();
>    return new_bb;
> @@ -921,6 +926,10 @@ clone_function (tree orig_fndecl, const char *name
>    push_cfun (DECL_STRUCT_FUNCTION (new_decl));
>    current_function_decl = new_decl;
>  
> +  /* The clones should not be ctors or dtors.  */
> +  DECL_STATIC_CONSTRUCTOR (new_decl) = 0;
> +  DECL_STATIC_DESTRUCTOR (new_decl) = 0;
> +
>    TREE_READONLY (new_decl) = TREE_READONLY (orig_fndecl);
>    TREE_STATIC (new_decl) = TREE_STATIC (orig_fndecl);
>    TREE_USED (new_decl) = TREE_USED (orig_fndecl);
> @@ -954,6 +963,12 @@ clone_function (tree orig_fndecl, const char *name
>    cgraph_call_function_insertion_hooks (new_version);
>    cgraph_mark_needed_node (new_version);
>  
> +  
> +  free_dominance_info (CDI_DOMINATORS);
> +  free_dominance_info (CDI_POST_DOMINATORS);
> +  calculate_dominance_info (CDI_DOMINATORS); 
> +  calculate_dominance_info (CDI_POST_DOMINATORS);
> +
>    pop_cfun ();
>    current_function_decl = old_current_function_decl;
>  
> @@ -1034,9 +1049,9 @@ make_specialized_call_to_clone (gimple generic_stm
>    gcc_assert (generic_fndecl != NULL);
>  
>    if (side == 0)
> -    new_name = make_name (generic_fndecl, "clone_0");
> +    new_name = make_name (generic_fndecl, "clone.0");
>    else
> -    new_name = make_name (generic_fndecl, "clone_1");
> +    new_name = make_name (generic_fndecl, "clone.1");
>  
>    slot = htab_find_slot_with_hash (name_decl_htab, new_name,
>                                     htab_hash_string (new_name), NO_INSERT);
> @@ -1764,3 +1779,700 @@ struct gimple_opt_pass pass_tree_convert_builtin_d
>    TODO_update_ssa | TODO_verify_ssa
>   }
>  };
> +
> +/* This function generates gimple code in NEW_BB to check if COND_VAR
> +   is equal to WHICH_VERSION and return FN_VER pointer if it is equal.
> +   The basic block returned is the block where the control flows if
> +   the equality is false.  */
> +
> +static basic_block
> +make_bb_flow (basic_block new_bb, tree cond_var, tree fn_ver,
> +           int which_version, tree bindings)
> +{
> +  tree result_var;
> +  tree convert_expr;
> +
> +  basic_block bb1, bb2, bb3;
> +  edge e12, e23;
> +
> +  gimple if_else_stmt;
> +  gimple if_stmt;
> +  gimple return_stmt;
> +  gimple_seq gseq = bb_seq (new_bb);
> +
> +  /* Check if the value of cond_var is equal to which_version.  */
> +  if_else_stmt = gimple_build_cond (EQ_EXPR, cond_var,
> +                                 build_int_cst (NULL, which_version),
> +                                 NULL_TREE, NULL_TREE);
> +
> +  mark_symbols_for_renaming (if_else_stmt);
> +  gimple_seq_add_stmt (&gseq, if_else_stmt);
> +  gimple_set_block (if_else_stmt, bindings);
> +  gimple_set_bb (if_else_stmt, new_bb);
> +
> +  result_var = create_tmp_var (ptr_type_node, NULL);
> +  add_referenced_var (result_var);
> +
> +  convert_expr = build1 (CONVERT_EXPR, ptr_type_node, fn_ver);
> +  if_stmt = gimple_build_assign (result_var, convert_expr);
> +  mark_symbols_for_renaming (if_stmt);
> +  gimple_seq_add_stmt (&gseq, if_stmt);
> +  gimple_set_block (if_stmt, bindings);
> +
> +  return_stmt = gimple_build_return (result_var);
> +  mark_symbols_for_renaming (return_stmt);
> +  gimple_seq_add_stmt (&gseq, return_stmt);
> +
> +  set_bb_seq (new_bb, gseq);
> +
> +  bb1 = new_bb;
> +  e12 = split_block (bb1, if_else_stmt);
> +  bb2 = e12->dest;
> +  e12->flags &= ~EDGE_FALLTHRU;
> +  e12->flags |= EDGE_TRUE_VALUE;
> +
> +  e23 = split_block (bb2, return_stmt);
> +  gimple_set_bb (if_stmt, bb2);
> +  gimple_set_bb (return_stmt, bb2);
> +  bb3 = e23->dest;
> +  make_edge (bb1, bb3, EDGE_FALSE_VALUE); 
> +
> +  remove_edge (e23);
> +  make_edge (bb2, EXIT_BLOCK_PTR, 0);
> +
> +  return bb3;
> +}
> +
> +/* Given the pointer to the condition function COND_FUNC_ARG, whose return
> +   value decides the version that gets executed, and the pointers to the
> +   function versions, FN_VER_LIST, this function generates control-flow to
> +   return the appropriate function version pointer based on the return value
> +   of the conditional function.   The condition function is assumed to return
> +   values 0, 1, 2, ... */
> +
> +static gimple_seq
> +get_selector_gimple_seq (tree cond_func_arg, tree fn_ver_list, tree 
> default_ver,
> +                      basic_block new_bb, tree bindings)
> +{
> +  basic_block final_bb;
> +
> +  gimple return_stmt, default_stmt;
> +  gimple_seq gseq = NULL;
> +  gimple_seq gseq_final = NULL;
> +  gimple call_cond_stmt;
> +
> +  tree result_var;
> +  tree convert_expr;
> +  tree p;
> +  tree cond_var;
> +
> +  int which_version;
> +
> +  /* Call the condition function once and store the outcome in cond_var.  */
> +  cond_var = create_tmp_var (integer_type_node, NULL);
> +  call_cond_stmt = gimple_build_call (cond_func_arg, 0);
> +  gimple_call_set_lhs (call_cond_stmt, cond_var);
> +  add_referenced_var (cond_var);
> +  mark_symbols_for_renaming (call_cond_stmt);
> +
> +  gimple_seq_add_stmt (&gseq, call_cond_stmt);
> +  gimple_set_block (call_cond_stmt, bindings);
> +  gimple_set_bb (call_cond_stmt, new_bb);
> +
> +  set_bb_seq (new_bb, gseq);
> +
> +  final_bb = new_bb;
> +
> +  which_version = 0; 
> +  for (p = fn_ver_list; p != NULL_TREE; p = TREE_CHAIN (p))
> +    {
> +      tree ver = TREE_PURPOSE (p);
> +      /* Return this version's pointer, VER, if the value returned by the
> +      condition funciton is equal to WHICH_VERSION.  */
> +      final_bb = make_bb_flow (final_bb, cond_var, ver, which_version,
> +                            bindings);
> +      which_version++;
> +    }
> +
> +  result_var = create_tmp_var (ptr_type_node, NULL);
> +  add_referenced_var (result_var);
> +
> +  /* Return the default version function pointer as the default.  */
> +  convert_expr = build1 (CONVERT_EXPR, ptr_type_node, default_ver);
> +  default_stmt = gimple_build_assign (result_var, convert_expr);
> +  mark_symbols_for_renaming (default_stmt);
> +  gimple_seq_add_stmt (&gseq_final, default_stmt);
> +  gimple_set_block (default_stmt, bindings);
> +  gimple_set_bb (default_stmt, final_bb);
> +
> +  return_stmt = gimple_build_return (result_var);
> +  mark_symbols_for_renaming (return_stmt);
> +  gimple_seq_add_stmt (&gseq_final, return_stmt);
> +  gimple_set_bb (return_stmt, final_bb);
> +
> +  set_bb_seq (final_bb, gseq_final);
> +
> +  return gseq; 
> +}
> +
> +/* Make the ifunc selector function which calls function pointed to by
> +   COND_FUNC_ARG and checks the value to return the appropriate function
> +   version pointer.  */
> +
> +static tree
> +make_selector_function (const char *name, tree cond_func_arg,
> +                     tree fn_ver_list, tree default_ver)
> +{
> +  tree decl, type, t;
> +  basic_block new_bb;
> +  tree old_current_function_decl;
> +  tree decl_name;
> +
> +  /* The selector function should return a (void *). */
> +  type = build_function_type_list (ptr_type_node, NULL_TREE);
> + 
> +  decl = build_fn_decl (name, type);
> +
> +  decl_name = get_identifier (name);
> +  SET_DECL_ASSEMBLER_NAME (decl, decl_name);
> +  DECL_NAME (decl) = decl_name;
> +  gcc_assert (cgraph_node (decl) != NULL);
> +
> +  TREE_USED (decl) = 1;
> +  DECL_ARTIFICIAL (decl) = 1;
> +  DECL_IGNORED_P (decl) = 0;
> +  TREE_PUBLIC (decl) = 0;
> +  DECL_UNINLINABLE (decl) = 1;
> +  DECL_EXTERNAL (decl) = 0;
> +  DECL_CONTEXT (decl) = NULL_TREE;
> +  DECL_INITIAL (decl) = make_node (BLOCK);
> +  DECL_STATIC_CONSTRUCTOR (decl) = 0;
> +  TREE_READONLY (decl) = 0;
> +  DECL_PURE_P (decl) = 0;
> + 
> +  /* Build result decl and add to function_decl. */
> +  t = build_decl (UNKNOWN_LOCATION, RESULT_DECL, NULL_TREE, ptr_type_node);
> +  DECL_ARTIFICIAL (t) = 1;
> +  DECL_IGNORED_P (t) = 1;
> +  DECL_RESULT (decl) = t;
> +
> +  gimplify_function_tree (decl);
> +
> +  old_current_function_decl = current_function_decl;
> +  push_cfun (DECL_STRUCT_FUNCTION (decl));
> +  current_function_decl = decl;
> +  init_empty_tree_cfg_for_function (DECL_STRUCT_FUNCTION (decl));
> +
> +  cfun->curr_properties |=
> +    (PROP_gimple_lcf | PROP_gimple_leh | PROP_cfg | PROP_referenced_vars |
> +     PROP_ssa);
> +
> +  new_bb = create_empty_bb (ENTRY_BLOCK_PTR);
> +  make_edge (ENTRY_BLOCK_PTR, new_bb, EDGE_FALLTHRU);
> +  make_edge (new_bb, EXIT_BLOCK_PTR, 0);
> +
> +  /* This call is very important if this pass runs when the IR is in
> +     SSA form.  It breaks things in strange ways otherwise. */
> +  init_tree_ssa (DECL_STRUCT_FUNCTION (decl));
> +  init_ssa_operands ();
> +
> +  /* Make the body of thr selector function.  */
> +  get_selector_gimple_seq (cond_func_arg, fn_ver_list, default_ver, new_bb,
> +                        DECL_INITIAL (decl));
> +
> +  cgraph_add_new_function (decl, true);
> +  cgraph_call_function_insertion_hooks (cgraph_node (decl));
> +  cgraph_mark_needed_node (cgraph_node (decl));
> +
> +  if (dump_file)
> +    dump_function_to_file (decl, dump_file, TDF_BLOCKS);
> +
> +  pop_cfun ();
> +  current_function_decl = old_current_function_decl;
> +  return decl;
> +}
> +
> +/* Makes a function attribute of the form NAME(ARG_NAME) and chains
> +   it to CHAIN.  */
> +
> +static tree
> +make_attribute (const char *name, const char *arg_name, tree chain)
> +{
> +  tree attr_name;
> +  tree attr_arg_name;
> +  tree attr_args;
> +  tree attr;
> +
> +  attr_name = get_identifier (name);
> +  attr_arg_name = build_string (strlen (arg_name), arg_name);
> +  attr_args = tree_cons (NULL_TREE, attr_arg_name, NULL_TREE);
> +  attr = tree_cons (attr_name, attr_args, chain);
> +  return attr;
> +}
> +
> +/* This creates the ifunc function IFUNC_NAME whose selector function is
> +   SELECTOR_NAME. */
> +
> +static tree
> +make_ifunc_function (const char* ifunc_name, const char *selector_name,
> +                  tree fn_type)
> +{
> +  tree type;
> +  tree decl;
> +
> +  /* The signature of the ifunc function is set to the
> +     type of any version.  */
> +  type = build_function_type (TREE_TYPE (fn_type), TYPE_ARG_TYPES (fn_type));
> +  decl = build_fn_decl (ifunc_name, type);
> +
> +  DECL_CONTEXT (decl) = NULL_TREE;
> +  DECL_INITIAL (decl) = error_mark_node;
> +
> +  /* Set ifunc attribute */
> +  DECL_ATTRIBUTES (decl)
> +    = make_attribute ("ifunc", selector_name, DECL_ATTRIBUTES (decl));
> +
> +  assemble_alias (decl, get_identifier (selector_name)); 
> +
> +  return decl;
> +}
> +
> +/* Copy the decl attributes from from_decl to to_decl, except
> +   DECL_ARTIFICIAL and TREE_PUBLIC.  */
> +
> +static void
> +copy_decl_attributes (tree to_decl, tree from_decl)
> +{
> +  TREE_READONLY (to_decl) = TREE_READONLY (from_decl);
> +  TREE_USED (to_decl) = TREE_USED (from_decl);
> +  DECL_ARTIFICIAL (to_decl) = 1;
> +  DECL_IGNORED_P (to_decl) = DECL_IGNORED_P (from_decl);
> +  TREE_PUBLIC (to_decl) = 0;
> +  DECL_CONTEXT (to_decl) = DECL_CONTEXT (from_decl);
> +  DECL_EXTERNAL (to_decl) = DECL_EXTERNAL (from_decl);
> +  DECL_COMDAT (to_decl) = DECL_COMDAT (from_decl);
> +  DECL_COMDAT_GROUP (to_decl) = DECL_COMDAT_GROUP (from_decl);
> +  DECL_VIRTUAL_P (to_decl) = DECL_VIRTUAL_P (from_decl);
> +  DECL_WEAK (to_decl) = DECL_WEAK (from_decl);
> +}
> +
> +/* This function does the mult-version run-time dispatch using IFUNC.  Given
> +   NUM_VERSIONS versions of a function with the decls in FN_VER_LIST along
> +   with a default version in DEFAULT_VER.  Also given is a condition 
> function,
> +   COND_FUNC_ADDR, whose return value decides the version that gets executed.
> +   This function generates the necessary code to dispatch the right function
> +   version and returns this a GIMPLE_SEQ. The decls of the ifunc function and
> +   the selector function that are created are stored in IFUNC_DECL and
> +   SELECTOR_DECL.  */
> +
> +static gimple_seq
> +dispatch_using_ifunc (int num_versions, tree orig_func_decl,
> +                   tree cond_func_addr, tree fn_ver_list,
> +                   tree default_ver, tree *selector_decl,
> +                   tree *ifunc_decl)
> +{
> +  char *selector_name;
> +  char *ifunc_name;
> +  tree ifunc_function;
> +  tree selector_function;
> +  tree return_type;
> +  VEC (tree, heap) *nargs = NULL;
> +  tree arg;
> +  gimple ifunc_call_stmt;
> +  gimple return_stmt;
> +  gimple_seq gseq = NULL;
> +
> +  gcc_assert (cond_func_addr != NULL
> +           && num_versions > 0
> +           && orig_func_decl != NULL
> +           && fn_ver_list != NULL);
> +
> +  /* The return type of any function version.  */
> +  return_type = TREE_TYPE (TREE_TYPE (orig_func_decl));
> +
> +  nargs = VEC_alloc (tree, heap, 4);
> +
> +  for (arg = DECL_ARGUMENTS (orig_func_decl);
> +       arg; arg = TREE_CHAIN (arg))
> +    {
> +      VEC_safe_push (tree, heap, nargs, arg);
> +      add_referenced_var (arg);
> +    }
> +
> +  /* Assign names to ifunc and ifunc_selector functions. */
> +  selector_name = make_name (orig_func_decl, "ifunc.selector");
> +  ifunc_name = make_name (orig_func_decl, "ifunc");
> +
> +  /* Make a selector function which returns the appropriate function
> +     version pointer based on the outcome of the condition function
> +     execution.  */
> +  selector_function = make_selector_function (selector_name, cond_func_addr,
> +                                           fn_ver_list, default_ver);
> +  *selector_decl = selector_function;
> +
> +  /* Make a new ifunc function.  */
> +  ifunc_function = make_ifunc_function  (ifunc_name, selector_name,
> +                                      TREE_TYPE (orig_func_decl));
> +  *ifunc_decl = ifunc_function;
> +
> +  /* Make selector and ifunc shadow the attributes of the original function. 
>  */
> +  copy_decl_attributes (ifunc_function, orig_func_decl);
> +  copy_decl_attributes (selector_function, orig_func_decl);
> + 
> +  ifunc_call_stmt = gimple_build_call_vec (ifunc_function, nargs);
> +  gimple_seq_add_stmt (&gseq, ifunc_call_stmt); 
> +
> +  /* Make function return the value of it is a non-void type.  */
> +  if (TREE_CODE (return_type) != VOID_TYPE)
> +    {
> +      tree lhs_var;
> +      tree lhs_var_ssa_name;
> +      tree result_decl;
> +
> +      result_decl = DECL_RESULT (orig_func_decl);
> +
> +      if (result_decl
> +       && aggregate_value_p (result_decl, orig_func_decl)
> +       && !TREE_ADDRESSABLE (result_decl))
> +     {
> +       /* Build a RESULT_DECL rather than a VAR_DECL for this case.
> +          See tree-nrv.c: tree_nrv. It checks if the DECL_RESULT and the
> +          return value are the same.  */
> +       lhs_var = build_decl (UNKNOWN_LOCATION, RESULT_DECL, NULL,
> +                             return_type);
> +       DECL_ARTIFICIAL (lhs_var) = 1;
> +       DECL_IGNORED_P (lhs_var) = 1;
> +       TREE_READONLY (lhs_var) = 0;
> +       DECL_EXTERNAL (lhs_var) = 0;
> +       TREE_STATIC (lhs_var) = 0;
> +       TREE_USED (lhs_var) = 1;
> +
> +          add_referenced_var (lhs_var);
> +          DECL_RESULT (orig_func_decl) = lhs_var;
> +     }
> +     else if (!TREE_ADDRESSABLE (return_type)
> +              && COMPLETE_TYPE_P (return_type))
> +        {
> +          lhs_var = create_tmp_var (return_type, NULL);
> +          add_referenced_var (lhs_var);
> +        }
> +      else
> +     {
> +          lhs_var = create_tmp_var_raw (return_type, NULL);
> +       TREE_ADDRESSABLE (lhs_var) = 1;
> +       gimple_add_tmp_var (lhs_var);
> +          add_referenced_var (lhs_var);
> +     }
> +
> +      if (AGGREGATE_TYPE_P (return_type)
> +       || TREE_CODE (return_type) == COMPLEX_TYPE)
> +        {
> +          gimple_call_set_lhs (ifunc_call_stmt, lhs_var);
> +          return_stmt = gimple_build_return (lhs_var);
> +     }
> +      else
> +     {
> +       lhs_var_ssa_name = make_ssa_name (lhs_var, ifunc_call_stmt);
> +       gimple_call_set_lhs (ifunc_call_stmt, lhs_var_ssa_name);
> +       return_stmt = gimple_build_return (lhs_var_ssa_name);
> +     }
> +    }
> +  else
> +    {
> +      return_stmt = gimple_build_return (NULL_TREE);
> +    }
> +
> +  mark_symbols_for_renaming (ifunc_call_stmt);
> +  mark_symbols_for_renaming (return_stmt);
> +  gimple_seq_add_stmt (&gseq, return_stmt); 
> +
> +  VEC_free (tree, heap, nargs);
> +  return gseq;
> +}
> +
> +/* Empty the function body of function fndecl.  Retain just one basic block
> +   along with the ENTRY and EXIT block.  Return the retained basic block.  */
> +
> +static basic_block
> +purge_function_body (tree fndecl)
> +{
> +  basic_block bb, new_bb;
> +  edge first_edge, last_edge;
> +  tree old_current_function_decl;
> +
> +  old_current_function_decl = current_function_decl;
> +  push_cfun (DECL_STRUCT_FUNCTION (fndecl));
> +  current_function_decl = fndecl;
> +
> +  /* Set new_bb to be the first block after ENTRY_BLOCK_PTR. */
> +
> +  first_edge  = VEC_index (edge, ENTRY_BLOCK_PTR->succs, 0);
> +  new_bb = first_edge->dest;
> +  gcc_assert (new_bb != NULL);
> +
> +  for (bb = ENTRY_BLOCK_PTR; bb != NULL;)
> +    {
> +      edge_iterator ei;
> +      edge e;
> +      basic_block bb_next;
> +      bb_next = bb->next_bb;
> +      if (bb == EXIT_BLOCK_PTR)
> +        {
> +       VEC_truncate (edge, EXIT_BLOCK_PTR->preds, 0);
> +        }
> +      else if (bb == ENTRY_BLOCK_PTR)
> +     {
> +       VEC_truncate (edge, ENTRY_BLOCK_PTR->succs, 0);
> +     }
> +      else
> +        {
> +          remove_phi_nodes (bb);
> +          if (bb_seq (bb) != NULL)
> +            {
> +              gimple_stmt_iterator i;
> +              for (i = gsi_start_bb (bb); !gsi_end_p (i);)
> +             {
> +               gimple stmt = gsi_stmt (i);
> +               unlink_stmt_vdef (stmt);
> +               reset_debug_uses (stmt);
> +                  gsi_remove (&i, true);
> +               release_defs (stmt);
> +             }
> +            }
> +       FOR_EACH_EDGE (e, ei, bb->succs)
> +         {
> +           n_edges--;
> +           ggc_free (e);
> +         }
> +       VEC_truncate (edge, bb->succs, 0);
> +       VEC_truncate (edge, bb->preds, 0);
> +          bb->prev_bb = NULL;
> +          bb->next_bb = NULL;
> +       if (bb == new_bb)
> +         {
> +           bb = bb_next;
> +           continue;
> +         }
> +          bb->il.gimple = NULL;
> +          SET_BASIC_BLOCK (bb->index, NULL);
> +          n_basic_blocks--;
> +        }
> +      bb = bb_next;
> +    }
> +
> +
> +  /* This is to allow iterating over the basic blocks. */
> +  new_bb->next_bb = EXIT_BLOCK_PTR;
> +  EXIT_BLOCK_PTR->prev_bb = new_bb;
> +
> +  new_bb->prev_bb = ENTRY_BLOCK_PTR;
> +  ENTRY_BLOCK_PTR->next_bb = new_bb;
> +
> +  gcc_assert (find_edge (new_bb, EXIT_BLOCK_PTR) == NULL);
> +  last_edge = make_edge (new_bb, EXIT_BLOCK_PTR, 0);
> +  gcc_assert (last_edge);
> +
> +  gcc_assert (find_edge (ENTRY_BLOCK_PTR, new_bb) == NULL);
> +  last_edge = make_edge (ENTRY_BLOCK_PTR, new_bb, EDGE_FALLTHRU);
> +  gcc_assert (last_edge);
> +
> +  free_dominance_info (CDI_DOMINATORS);
> +  free_dominance_info (CDI_POST_DOMINATORS);
> +  calculate_dominance_info (CDI_DOMINATORS); 
> +  calculate_dominance_info (CDI_POST_DOMINATORS);
> +
> +  current_function_decl = old_current_function_decl;
> +  pop_cfun ();
> +
> +  return new_bb;
> +}
> +
> +/* Returns true if function FUNC_DECL contains abnormal goto statements.  */
> +
> +static bool
> +function_can_make_abnormal_goto (tree func_decl)
> +{
> +  basic_block bb;
> +  FOR_EACH_BB_FN (bb, DECL_STRUCT_FUNCTION (func_decl))
> +    {
> +      gimple_stmt_iterator gsi;
> +      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> +        {
> +       gimple stmt = gsi_stmt (gsi);
> +       if (stmt_can_make_abnormal_goto (stmt))
> +         return true;
> +     }
> +    }
> +  return false;
> +}
> +
> +/* Has an entry for every cloned function and auxiliaries that have been
> +   generated by auto cloning.  These cannot be further cloned.  */
> +
> +htab_t cloned_function_decls_htab = NULL;
> +
> +/* Adds function FUNC_DECL to the cloned_function_decls_htab.  */
> +
> +static void
> +mark_function_not_cloneable (tree func_decl)
> +{
> +  void **slot;
> +
> +  slot = htab_find_slot_with_hash (cloned_function_decls_htab, func_decl,
> +                                htab_hash_pointer (func_decl), INSERT);
> +  gcc_assert (*slot == NULL);
> +  *slot = func_decl;
> +}
> +
> +/* Entry point for the auto clone pass.  Calls the target hook to determine 
> if
> +   this function must be cloned.  */
> +
> +static unsigned int
> +do_auto_clone (void)
> +{
> +  tree opt_node = NULL_TREE;
> +  int num_versions = 0;
> +  int i = 0;
> +  tree fn_ver_addr_chain = NULL_TREE;
> +  tree default_ver = NULL_TREE;
> +  tree cond_func_decl = NULL_TREE;
> +  tree cond_func_addr;
> +  tree default_decl;
> +  basic_block empty_bb;
> +  gimple_seq gseq = NULL;
> +  gimple_stmt_iterator gsi;
> +  tree selector_decl;
> +  tree ifunc_decl;
> +  void **slot;
> +  struct cgraph_node *node;
> +
> +  node = cgraph_node (current_function_decl);
> +
> +  if (lookup_attribute ("noclone", DECL_ATTRIBUTES (current_function_decl))
> +      != NULL)
> +    {
> +      if (dump_file)
> +     fprintf (dump_file, "Not cloning, noclone attribute set\n");
> +      return 0;
> +    }
> +
> +  /* Check if function size is within permissible limits for cloning.  */
> +  if (node->global.size
> +      > PARAM_VALUE (PARAM_MAX_FUNCTION_SIZE_FOR_AUTO_CLONING))
> +    {
> +      if (dump_file)
> +        fprintf (dump_file, "Function size exceeds auto cloning 
> threshold.\n");
> +      return 0;
> +    }
> +
> +  if (cloned_function_decls_htab == NULL)
> +    cloned_function_decls_htab = htab_create (10, htab_hash_pointer,
> +                                           htab_eq_pointer, NULL);
> +
> +
> +  /* If this function is a clone or other, like the selector function, pass. 
>  */
> +  slot = htab_find_slot_with_hash (cloned_function_decls_htab,
> +                                current_function_decl,
> +                                htab_hash_pointer (current_function_decl),
> +                                INSERT);
> +
> +  if (*slot != NULL)
> +    return 0;
> +
> +  if (profile_status == PROFILE_READ
> +      && !hot_function_p (cgraph_node (current_function_decl)))
> +    return 0;
> +
> +  /* Ignore functions with abnormal gotos, not correct to clone them.  */
> +  if (function_can_make_abnormal_goto (current_function_decl))
> +    return 0;
> +
> +  if (!targetm.mversion_function)
> +    return 0;
> +
> +  /* Call the target hook to see if this function needs to be versioned.  */
> +  num_versions = targetm.mversion_function (current_function_decl, &opt_node,
> +                                         &cond_func_decl);
> +      
> +  /* Nothing more to do if versions are not to be created.  */
> +  if (num_versions == 0)
> +    return 0;
> +
> +  mark_function_not_cloneable (cond_func_decl);
> +  copy_decl_attributes (cond_func_decl, current_function_decl);
> +
> +  /* Make as many clones as requested.  */
> +  for (i = 0; i < num_versions; ++i)
> +    {
> +      tree cloned_decl;
> +      char clone_name[100];
> +
> +      sprintf (clone_name, "autoclone.%d", i);
> +      cloned_decl = clone_function (current_function_decl, clone_name);
> +      fn_ver_addr_chain = tree_cons (build_fold_addr_expr (cloned_decl),
> +                                  NULL, fn_ver_addr_chain);
> +      gcc_assert (cloned_decl != NULL);
> +      mark_function_not_cloneable (cloned_decl);
> +      DECL_FUNCTION_SPECIFIC_TARGET (cloned_decl)
> +        = TREE_PURPOSE (opt_node);
> +      opt_node = TREE_CHAIN (opt_node);
> +    }
> +
> +  /* The current function is replaced by an ifunc call to the right version.
> +     Make another clone for the default.  */
> +  default_decl = clone_function (current_function_decl, 
> "autoclone.original");
> +  mark_function_not_cloneable (default_decl);
> +  /* Empty the body of the current function.  */
> +  empty_bb = purge_function_body (current_function_decl);
> +  default_ver = build_fold_addr_expr (default_decl);
> +  cond_func_addr = build_fold_addr_expr (cond_func_decl);
> +
> +  /* Get the gimple sequence to replace the current function's body with a
> +     ifunc dispatch call to the right version.  */
> +  gseq = dispatch_using_ifunc (num_versions, current_function_decl,
> +                            cond_func_addr, fn_ver_addr_chain,
> +                            default_ver, &selector_decl, &ifunc_decl);
> +
> +  mark_function_not_cloneable (selector_decl);
> +  mark_function_not_cloneable (ifunc_decl);
> +
> +  for (gsi = gsi_start (gseq); !gsi_end_p (gsi); gsi_next (&gsi))
> +    gimple_set_bb (gsi_stmt (gsi), empty_bb);
> +   
> +  set_bb_seq (empty_bb, gseq);
> +
> +  if (dump_file)
> +    dump_function_to_file (current_function_decl, dump_file, TDF_BLOCKS);
> +
> +  update_ssa (TODO_update_ssa_no_phi);
> +  
> +  return 0;
> +}
> +
> +static bool
> +gate_auto_clone (void)
> +{
> +  /* Turned on at -O2 and above.  */
> +  return optimize >= 2;
> +}
> +
> +struct gimple_opt_pass pass_auto_clone =
> +{
> + {
> +  GIMPLE_PASS,
> +  "auto_clone",                              /* name */
> +  gate_auto_clone,                   /* gate */
> +  do_auto_clone,                     /* execute */
> +  NULL,                                      /* sub */
> +  NULL,                                      /* next */
> +  0,                                 /* static_pass_number */
> +  TV_MVERSN_DISPATCH,                        /* tv_id */
> +  PROP_cfg,                          /* properties_required */
> +  PROP_cfg,                          /* properties_provided */
> +  0,                                 /* properties_destroyed */
> +  0,                                 /* todo_flags_start */
> +  TODO_dump_func |                   /* todo_flags_finish */
> +  TODO_cleanup_cfg | TODO_dump_cgraph |
> +  TODO_update_ssa | TODO_verify_ssa
> + }
> +};
> Index: passes.c
> ===================================================================
> --- passes.c  (revision 182355)
> +++ passes.c  (working copy)
> @@ -1278,6 +1278,7 @@ init_optimization_passes (void)
>    /* These passes are run after IPA passes on every function that is being
>       output to the assembler file.  */
>    p = &all_passes;
> +  NEXT_PASS (pass_auto_clone);
>    NEXT_PASS (pass_direct_call_profile);
>    NEXT_PASS (pass_lower_eh_dispatch);
>    NEXT_PASS (pass_all_optimizations);
> Index: config/i386/i386.opt
> ===================================================================
> --- config/i386/i386.opt      (revision 182355)
> +++ config/i386/i386.opt      (working copy)
> @@ -101,6 +101,10 @@ march=
>  Target RejectNegative Joined Var(ix86_arch_string)
>  Generate code for given CPU
>  
> +mvarch=
> +Target RejectNegative Joined Var(ix86_mv_arch_string)
> +Multiversion for the given CPU(s)
> +
>  masm=
>  Target RejectNegative Joined Var(ix86_asm_string)
>  Use given assembler dialect
> Index: config/i386/i386.c
> ===================================================================
> --- config/i386/i386.c        (revision 182355)
> +++ config/i386/i386.c        (working copy)
> @@ -60,7 +60,11 @@ along with GCC; see the file COPYING3.  If not see
>  #include "fibheap.h"
>  #include "tree-flow.h"
>  #include "tree-pass.h"
> +#include "tree-dump.h"
> +#include "gimple-pretty-print.h"
>  #include "cfgloop.h"
> +#include "tree-scalar-evolution.h"
> +#include "tree-vectorizer.h"
>  
>  enum upper_128bits_state
>  {
> @@ -2353,6 +2357,8 @@ enum processor_type ix86_tune;
>  /* Which instruction set architecture to use.  */
>  enum processor_type ix86_arch;
>  
> +char ix86_varch[PROCESSOR_max];
> +
>  /* true if sse prefetch instruction is not NOOP.  */
>  int x86_prefetch_sse;
>  
> @@ -2492,6 +2498,7 @@ static enum calling_abi ix86_function_abi (const_t
>  /* Whether -mtune= or -march= were specified */
>  static int ix86_tune_defaulted;
>  static int ix86_arch_specified;
> +static int ix86_varch_specified;
>  
>  /* A mask of ix86_isa_flags that includes bit X if X
>     was set or cleared on the command line.  */
> @@ -4316,6 +4323,36 @@ ix86_option_override_internal (bool main_args_p)
>        /* Disable vzeroupper pass if TARGET_AVX is disabled.  */
>        target_flags &= ~MASK_VZEROUPPER;
>      }
> +
> +  /* Handle ix86_mv_arch_string.  The values allowed are the same as
> +     -march=<>.  More than one value is allowed and values must be
> +     comma separated.  */
> +  if (ix86_mv_arch_string)
> +    {
> +      char *token;
> +      char *varch;
> +      int i;
> +
> +      ix86_varch_specified = 1;
> +      memset (ix86_varch, 0, sizeof (ix86_varch));
> +      token = XNEWVEC (char, strlen (ix86_mv_arch_string) + 1);
> +      strcpy (token, ix86_mv_arch_string);
> +      varch = strtok ((char *)token, ",");
> +      while (varch != NULL)
> +        {
> +          for (i = 0; i < pta_size; i++)
> +            if (!strcmp (varch, processor_alias_table[i].name))
> +           {
> +             ix86_varch[processor_alias_table[i].processor] = 1;
> +             break;
> +           }
> +          if (i == pta_size)
> +            error ("bad value (%s) for %sv-arch=%s %s",
> +                varch, prefix, suffix, sw);
> +       varch = strtok (NULL, ",");
> +     }
> +      free (token);
> +    }
>  }
>  
>  /* Return TRUE if VAL is passed in register with 256bit AVX modes.  */
> @@ -26120,6 +26157,489 @@ ix86_fold_builtin (tree fndecl, int n_args ATTRIBU
>    return NULL_TREE;
>  }
>  
> +/* This adds a condition to the basic_block NEW_BB in function FUNCTION_DECL
> +   to return integer VERSION_NUM if the outcome of the function 
> PREDICATE_DECL
> +   is true (or false if INVERT_CHECK is true).  This function will be called
> +   during version dispatch to ecide which function version to execute.  */
> +
> +static basic_block
> +add_condition_to_bb (tree function_decl, int version_num,
> +                  basic_block new_bb, tree predicate_decl,
> +                  bool invert_check)
> +{
> +  gimple return_stmt;
> +  gimple call_cond_stmt;
> +  gimple if_else_stmt;
> +
> +  basic_block bb1, bb2, bb3;
> +  edge e12, e23;
> +
> +  tree cond_var;
> +  gimple_seq gseq;
> +
> +  tree old_current_function_decl;
> +
> +  old_current_function_decl = current_function_decl;
> +  push_cfun (DECL_STRUCT_FUNCTION (function_decl));
> +  current_function_decl = function_decl;
> +
> +  gcc_assert (new_bb != NULL);
> +  gseq = bb_seq (new_bb);
> +
> +  if (predicate_decl == NULL_TREE)
> +    {
> +      return_stmt = gimple_build_return (build_int_cst (NULL, version_num));
> +      gimple_seq_add_stmt (&gseq, return_stmt);
> +      set_bb_seq (new_bb, gseq);
> +      gimple_set_bb (return_stmt, new_bb);
> +      pop_cfun ();
> +      current_function_decl = old_current_function_decl;
> +      return new_bb;
> +    }
> +
> +  cond_var = create_tmp_var (integer_type_node, NULL);
> +  call_cond_stmt = gimple_build_call (predicate_decl, 0);
> +  gimple_call_set_lhs (call_cond_stmt, cond_var);
> +  add_referenced_var (cond_var);
> +  mark_symbols_for_renaming (call_cond_stmt); 
> +
> +  gimple_set_block (call_cond_stmt, DECL_INITIAL (function_decl));
> +  gimple_set_bb (call_cond_stmt, new_bb);
> +  gimple_seq_add_stmt (&gseq, call_cond_stmt);
> +
> +  if (!invert_check)
> +    if_else_stmt = gimple_build_cond (GT_EXPR, cond_var,
> +                                   integer_zero_node,
> +                                   NULL_TREE, NULL_TREE);
> +  else
> +    if_else_stmt = gimple_build_cond (LE_EXPR, cond_var,
> +                                   integer_zero_node,
> +                                   NULL_TREE, NULL_TREE);
> +
> +  mark_symbols_for_renaming (if_else_stmt);
> +  gimple_set_block (if_else_stmt, DECL_INITIAL (function_decl));
> +  gimple_set_bb (if_else_stmt, new_bb);
> +  gimple_seq_add_stmt (&gseq, if_else_stmt);
> +
> +  return_stmt = gimple_build_return (build_int_cst (NULL, version_num));
> +  gimple_seq_add_stmt (&gseq, return_stmt);
> +
> + 
> +  set_bb_seq (new_bb, gseq);
> +
> +  bb1 = new_bb;
> +  e12 = split_block (bb1, if_else_stmt);
> +  bb2 = e12->dest;
> +  e12->flags &= ~EDGE_FALLTHRU;
> +  e12->flags |= EDGE_TRUE_VALUE;
> +
> +  e23 = split_block (bb2, return_stmt);
> +  gimple_set_bb (return_stmt, bb2);
> +  bb3 = e23->dest;
> +  make_edge (bb1, bb3, EDGE_FALSE_VALUE); 
> +
> +  remove_edge (e23);
> +  make_edge (bb2, EXIT_BLOCK_PTR, 0);
> +
> +  free_dominance_info (CDI_DOMINATORS);
> +  free_dominance_info (CDI_POST_DOMINATORS);
> +  calculate_dominance_info (CDI_DOMINATORS);
> +  calculate_dominance_info (CDI_POST_DOMINATORS);
> +  rebuild_cgraph_edges ();
> +  update_ssa (TODO_update_ssa);
> +  if (dump_file)
> +    dump_function_to_file (current_function_decl, dump_file, TDF_BLOCKS);
> +
> +  pop_cfun ();
> +  current_function_decl = old_current_function_decl;
> +
> +  return bb3;
> +}
> +
> +/* This makes an empty function with one empty basic block *CREATED_BB
> +   apart from the ENTRY and EXIT blocks.  */
> +
> +static tree
> +make_empty_function (basic_block *created_bb)
> +{
> +  tree decl, type, t;
> +  basic_block new_bb;
> +  tree old_current_function_decl;
> +  tree decl_name;
> +  char name[1000];
> +  static int num = 0;
> +
> +  /* The condition function should return an integer. */
> +  type = build_function_type_list (integer_type_node, NULL_TREE);
> + 
> +  sprintf (name, "cond_%d", num);
> +  num++;
> +  decl = build_fn_decl (name, type);
> +
> +  decl_name = get_identifier (name);
> +  SET_DECL_ASSEMBLER_NAME (decl, decl_name);
> +  DECL_NAME (decl) = decl_name;
> +  gcc_assert (cgraph_node (decl) != NULL);
> +
> +  TREE_USED (decl) = 1;
> +  DECL_ARTIFICIAL (decl) = 1;
> +  DECL_IGNORED_P (decl) = 0;
> +  TREE_PUBLIC (decl) = 0;
> +  DECL_UNINLINABLE (decl) = 1;
> +  DECL_EXTERNAL (decl) = 0;
> +  DECL_CONTEXT (decl) = NULL_TREE;
> +  DECL_INITIAL (decl) = make_node (BLOCK);
> +  DECL_STATIC_CONSTRUCTOR (decl) = 0;
> +  TREE_READONLY (decl) = 0;
> +  DECL_PURE_P (decl) = 0;
> + 
> +  /* Build result decl and add to function_decl. */
> +  t = build_decl (UNKNOWN_LOCATION, RESULT_DECL, NULL_TREE, ptr_type_node);
> +  DECL_ARTIFICIAL (t) = 1;
> +  DECL_IGNORED_P (t) = 1;
> +  DECL_RESULT (decl) = t;
> +
> +  gimplify_function_tree (decl);
> +
> +  old_current_function_decl = current_function_decl;
> +  push_cfun (DECL_STRUCT_FUNCTION (decl));
> +  current_function_decl = decl;
> +  init_empty_tree_cfg_for_function (DECL_STRUCT_FUNCTION (decl));
> +
> +  cfun->curr_properties |=
> +    (PROP_gimple_lcf | PROP_gimple_leh | PROP_cfg | PROP_referenced_vars |
> +     PROP_ssa);
> +
> +  new_bb = create_empty_bb (ENTRY_BLOCK_PTR);
> +  make_edge (ENTRY_BLOCK_PTR, new_bb, EDGE_FALLTHRU);
> +  make_edge (new_bb, EXIT_BLOCK_PTR, 0);
> +
> +  /* This call is very important if this pass runs when the IR is in
> +     SSA form.  It breaks things in strange ways otherwise. */
> +  init_tree_ssa (DECL_STRUCT_FUNCTION (decl));
> +  init_ssa_operands ();
> +
> +  cgraph_add_new_function (decl, true);
> +  cgraph_call_function_insertion_hooks (cgraph_node (decl));
> +  cgraph_mark_needed_node (cgraph_node (decl));
> +
> +  if (dump_file)
> +    dump_function_to_file (decl, dump_file, TDF_BLOCKS);
> +
> +  pop_cfun ();
> +  current_function_decl = old_current_function_decl;
> +  *created_bb = new_bb;
> +  return decl; 
> +}
> +
> +/* This function conservatively checks if loop LOOP is tree vectorizable.
> +   The code is adapted from tree-vectorize.cc and tree-vect-stmts.cc  */
> +
> +static bool
> +is_loop_form_vectorizable (struct loop *loop)
> +{
> +  /* Inner most loops should have 2 basic blocks.  */
> +  if (!loop->inner)
> +    {
> +      /* This is inner most.  */
> +      if (loop->num_nodes != 2)
> +        return false;
> +      /* Empty loop.  */
> +      if (empty_block_p (loop->header))
> +        return false;
> +    }
> +  else
> +    {
> +      /* Bail if there are multiple nested loops.  */ 
> +      if ((loop->inner)->inner || (loop->inner)->next)
> +     return false;
> +      /* Recursive call for the inner loop.  */
> +      if (!is_loop_form_vectorizable (loop->inner))
> +        return false;
> +      if (loop->num_nodes != 5)
> +     return false;
> +      /* The tree has 0 iterations.  */
> +      if (TREE_INT_CST_LOW (number_of_latch_executions (loop)) == 0)
> +     return false;
> +    }
> +
> +   return true;      
> +}
> +
> +/* This function checks if there is atleast one vectorizable
> +   load/store in loop LOOP.  Code adapted from tree-vect-stmts.cc.  */
> +
> +static bool
> +is_loop_stmts_vectorizable (struct loop *loop)
> +{
> +  basic_block *body;
> +  unsigned int i;
> +  bool vect_load_store = false;
> +
> +  body = get_loop_body (loop);
> +
> +  for (i = 0; i < loop->num_nodes; i++)
> +    {
> +      gimple_stmt_iterator gsi;
> +      for (gsi = gsi_start_bb (body[i]); !gsi_end_p (gsi); gsi_next (&gsi))
> +        {
> +       gimple stmt = gsi_stmt (gsi);
> +       enum gimple_code code = gimple_code (stmt);
> +
> +       if (gimple_has_volatile_ops (stmt))
> +         return false;
> +
> +       /* Does it have a vectorizable store or load in a hot bb? */
> +       if (code == GIMPLE_ASSIGN)
> +         {
> +           enum tree_code lhs_code = TREE_CODE (gimple_assign_lhs (stmt));
> +           enum tree_code rhs_code = gimple_assign_rhs_code (stmt);
> +
> +           /* Only look at hot vectorizable loads/stores.  */
> +           if (profile_status == PROFILE_READ
> +               && !maybe_hot_bb_p (body[i]))
> +             continue;
> +
> +           if (lhs_code == ARRAY_REF
> +               || lhs_code == INDIRECT_REF
> +               || lhs_code == COMPONENT_REF
> +               || lhs_code == IMAGPART_EXPR
> +               || lhs_code == REALPART_EXPR
> +               || lhs_code == MEM_REF)
> +             vect_load_store = true;
> +           else if (rhs_code == ARRAY_REF
> +               || rhs_code == INDIRECT_REF
> +               || rhs_code == COMPONENT_REF
> +               || rhs_code == IMAGPART_EXPR
> +               || rhs_code == REALPART_EXPR
> +               || rhs_code == MEM_REF)
> +             vect_load_store = true;
> +         }
> +     }
> +    }
> +
> +  return vect_load_store;
> +}
> +
> +/* This function checks if there are any vectorizable loops present
> +   in CURRENT_FUNCTION_DECL.  This function is called before the
> +   loop optimization passes and is therefore very conservative in
> +   checking for vectorizable loops.  Also, all the checks used in the
> +   vectorizer pass cannot used here since many loop optimizations
> +   have not occurred which could change the loop structure and the
> +   stmts.
> +
> +   The conditions for a loop being vectorizable are adapted from
> +   tree-vectorizer.c, tree-vect-stmts.c. */
> +
> +static bool
> +any_loops_vectorizable_with_load_store (void)
> +{
> +  unsigned int vect_loops_num;
> +  loop_iterator li;
> +  struct loop *loop;
> +  bool vectorizable_loop_found = false;
> +
> +  loop_optimizer_init (LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS);
> +
> +  vect_loops_num = number_of_loops ();
> +
> +  /* Bail out if there are no loops.  */
> +  if (vect_loops_num <= 1)
> +    {
> +      loop_optimizer_finalize ();
> +      return false;
> +    }
> +
> +  scev_initialize ();
> +
> +  /* This is iterating over all loops.  */
> +  FOR_EACH_LOOP (li, loop, 0)
> +    if (optimize_loop_nest_for_speed_p (loop))
> +      {
> +     if (!is_loop_form_vectorizable (loop))
> +       continue;
> +     if (!is_loop_stmts_vectorizable (loop))
> +       continue;
> +        vectorizable_loop_found = true;
> +     break;
> +      }
> +
> +
> +  loop_optimizer_finalize ();
> +  scev_finalize ();
> +
> +  return vectorizable_loop_found;
> +}
> +
> +/* This makes the function that chooses the version to execute based
> +   on the condition.  This condition function will decide which version
> +   of the function to execute.  It should look like this:
> +
> +   int cond_i ()
> +   {
> +      __builtin_cpu_init (); // Get the cpu type.
> +      a =  __builtin_cpu_is_<type1> ();
> +      if (a)
> +        return 1; // first version created.
> +      a =  __builtin_cpu_is_<type2> ();
> +      if (a)
> +        return 2; // second version created.
> +      ...
> +      return 0; // the default version.
> +   }
> +
> +   NEW_BB is the new last basic block of this function and to which more
> +   conditions can be added.  It is updated by this function.  */
> +
> +static tree
> +make_condition_function (basic_block *new_bb)
> +{
> +  gimple ifunc_cpu_init_stmt;
> +  gimple_seq gseq;
> +  tree cond_func_decl;
> +  tree old_current_function_decl;
> + 
> +
> +  cond_func_decl = make_empty_function (new_bb);
> +
> +  old_current_function_decl = current_function_decl;
> +  push_cfun (DECL_STRUCT_FUNCTION (cond_func_decl));
> +  current_function_decl = cond_func_decl;
> +
> +  gseq = bb_seq (*new_bb);
> +
> +  /* Since this is possibly dispatched with IFUNC, call builtin_cpu_init
> +     explicitly, as the constructor will only fire after IFUNC
> +     initializers. */
> +  ifunc_cpu_init_stmt = gimple_build_call_vec (
> +                     ix86_builtins [(int) IX86_BUILTIN_CPU_INIT], NULL);
> +  gimple_seq_add_stmt (&gseq, ifunc_cpu_init_stmt);
> +  gimple_set_bb (ifunc_cpu_init_stmt, *new_bb);
> +  set_bb_seq (*new_bb, gseq);
> +      
> +  pop_cfun ();
> +  current_function_decl = old_current_function_decl;
> +  return cond_func_decl;
> +}
> +
> +/* Create a new target optimization node with tune set to ARCH_TUNE.  */
> +static tree
> +create_mtune_target_opt_node (const char *arch_tune)
> +{
> +  struct cl_target_option target_options;
> +  const char *old_tune_string;
> +  tree optimization_node;
> +  
> +  /* Build an optimization node that is the same as the current one except 
> with
> +     "tune=arch_tune".  */
> +  cl_target_option_save (&target_options, &global_options);
> +  old_tune_string = ix86_tune_string;
> +
> +  ix86_tune_string = arch_tune;
> +  ix86_option_override_internal (false);
> +
> +  optimization_node = build_target_option_node ();
> +
> +  ix86_tune_string = old_tune_string;
> +  cl_target_option_restore (&global_options, &target_options);
> +
> +  return optimization_node;
> +}
> +
> +/* Should a version of this function be specially optimized for core2?
> +
> +   This function should have checks to see if there are any opportunities for
> +   core2 specific optimizations, otherwise do not create a clone.  The
> +   following opportunities are checked.
> +   
> +   * Check if this function has vectorizable loads/stores as it is known that
> +     unaligned 128-bit movs to/from memory (movdqu) are very expensive on
> +     core2 whereas the later generations like corei7 have no additional
> +     overhead.
> +
> +     This versioning is triggered only when -ftree-vectorize is turned on
> +     and when multi-versioning for core2 is requested using -mvarch=core2.
> +
> +   Return false if no versioning is required.  Return true if a version must
> +   be created.  Generate the *OPTIMIZATION_NODE that must be used to optimize
> +   the newly created version, that is tag "tune=core2" on the new version.  
> */
> +
> +static bool
> +mversion_for_core2 (tree *optimization_node,
> +                 tree *cond_func_decl, basic_block *new_bb)
> +{
> +  tree predicate_decl;
> +  bool is_mversion_target_core2 = false;
> +  bool create_version = false;
> +
> +  if (ix86_varch_specified
> +      && ix86_varch[PROCESSOR_CORE2_64])
> +    is_mversion_target_core2 = true;
> +
> +  /* Check for criteria to create a new version for core2.  */
> +
> +  /* If -ftree-vectorize is not used of MV is not requested, bail.  */
> +  if (flag_tree_vectorize && is_mversion_target_core2)
> +    {
> +      /* Check if there is atleast one loop that has a vectorizable 
> load/store.
> +         These are the ones that can generate the unaligned mov which is 
> known
> +         to be very slow on core2.  */
> +      if (any_loops_vectorizable_with_load_store ())
> +        create_version = true;
> +    }
> +  /* else if XXX: Add more criteria to version for core2.  */
> +
> +  if (!create_version)
> +    return false;
> +
> +  /* If the condition function's body has not been created, create it now.  
> */
> +  if (*cond_func_decl == NULL)
> +    *cond_func_decl = make_condition_function (new_bb);
> +
> +  *optimization_node = create_mtune_target_opt_node ("core2");
> +
> +  predicate_decl = ix86_builtins [(int) IX86_BUILTIN_CPU_IS_INTEL_CORE2];
> +  *new_bb = add_condition_to_bb (*cond_func_decl, 0, *new_bb,
> +                              predicate_decl, false);
> +  return true;
> +}
> +
> +/* Should this function CURRENT_FUNCTION_DECL be multi-versioned, if so 
> +   the number of versions to be created (other than the original) is
> +   returned.  The outcome of COND_FUNC_DECL will decide the version to be
> +   executed.  The OPTIMIZATION_NODE_CHAIN has a unique node for each
> +   version to be created.  */
> +
> +static int
> +ix86_mversion_function (tree fndecl ATTRIBUTE_UNUSED,
> +                     tree *optimization_node_chain,
> +                     tree *cond_func_decl)
> +{
> +  basic_block new_bb;
> +  tree optimization_node;
> +  int num_versions_created = 0;
> +
> +  if (ix86_mv_arch_string == NULL)
> +    return 0;
> +
> +  if (mversion_for_core2 (&optimization_node, cond_func_decl, &new_bb))
> +    num_versions_created++;
> +
> +  if (!num_versions_created)
> +    return 0;
> +
> +  *optimization_node_chain = tree_cons (optimization_node,
> +                                     NULL_TREE, *optimization_node_chain);
> +
> +  /* Return the default version as the last stmt in cond_func_decl.  */
> +  if (*cond_func_decl != NULL)
> +    new_bb = add_condition_to_bb (*cond_func_decl, num_versions_created,
> +                                new_bb, NULL_TREE, false);
> +
> +  return num_versions_created;
> +}
> +
>  /* A builtin to init/return the cpu type or feature.  Returns an
>     integer and the type is a const if IS_CONST is set. */
>  
> @@ -35608,6 +36128,9 @@ ix86_loop_unroll_adjust (unsigned nunroll, struct
>  #undef TARGET_FOLD_BUILTIN
>  #define TARGET_FOLD_BUILTIN ix86_fold_builtin
>  
> +#undef TARGET_MVERSION_FUNCTION
> +#define TARGET_MVERSION_FUNCTION ix86_mversion_function
> +
>  #undef TARGET_SLOW_UNALIGNED_VECTOR_MEMOP
>  #define TARGET_SLOW_UNALIGNED_VECTOR_MEMOP ix86_slow_unaligned_vector_memop
>  
> Index: params.def
> ===================================================================
> --- params.def        (revision 182355)
> +++ params.def        (working copy)
> @@ -1037,6 +1037,11 @@ DEFPARAM (PARAM_PMU_PROFILE_N_ADDRESS,
>         "While doing PMU profiling symbolize this many top addresses.",
>         50, 1, 10000)
>  
> +DEFPARAM (PARAM_MAX_FUNCTION_SIZE_FOR_AUTO_CLONING,
> +       "autoclone-function-size-limit",
> +       "Do not auto clone functions beyond this size.",
> +       450, 0, 100000)
> +
>  /*
>  Local variables:
>  mode:c
> 
> --
> This patch is available for review at http://codereview.appspot.com/5490054


Reply via email to