Re: [PATCH] Fix -fcompare-debug issue in cross-jumping (PR rtl-optimization/65980)

2015-12-16 Thread Eric Botcazou
> Given the ill-formed nature of __builtin_unreachable, I think the
> sensible thing is to require BARRIERs to match as well.  If we don't,
> then we could consider paths which differ only in the existence of a
> BARRIER.  If we cross jump them, then we either lose the
> BARRIER/__builtin_unreachable property or we add it where it wasn't before.

All right, although the counter-argument could be that __builtin_unreachable 
shouldn't prevent an optimization that would have happened without it.  Then I 
think that a new predicate should be added to emit-rtl.c with the head comment 
explaining this and invoked from rtx_renumbered_equal_p.

-- 
Eric Botcazou


[PING][Patch 3/3][Arm] Add support for IEEE-conformant versions of scalar fmin* and fmax*

2015-12-16 Thread David Sherwood
Hi,

Here is the last patch of the fmin/fmax change, which adds the optabs
to the arm backend.

Tested:

arm-none-eabi: no regressions

Good to go?
David Sherwood.

ChangeLog:

2015-12-08  David Sherwood  

    gcc/
    * config/arm/iterators.md: New iterators.
    * config/arm/unspecs.md: New unspecs.
    * config/arm/neon.md: New pattern.
    * config/arm/vfp.md: Likewise.
    gcc/testsuite
    * gcc.target/arm/fmaxmin.c: New test.



fmaxmin3.patch
Description: fmaxmin3.patch


Re: [PATCHES, PING*5] Enhance standard DWARF for Ada

2015-12-16 Thread Pierre-Marie de Rodat

On 12/11/2015 09:25 PM, Jason Merrill wrote:

Hmm, can we generate the DWARF procedures during finalize_size_functions
to avoid the need for preserve_body?


Good idea, thank you! Here’s the updated patch (bootstrapped and 
regtested on x86_64-linux, as usual).


--
Pierre-Marie de Rodat
>From 38576de649614743646bb052f5570dc1973a804a Mon Sep 17 00:00:00 2001
From: Pierre-Marie de Rodat 
Date: Thu, 3 Jul 2014 14:16:09 +0200
Subject: [PATCH 2/8] DWARF: handle variable-length records and variant parts

Enhance the DWARF back-end to emit proper descriptions for
variable-length records as well as variant parts in records.

In order to achieve this, generate DWARF expressions ("location
descriptions" in dwarf2out's parlance) for size and data member location
attributes.  Also match QUAL_UNION_TYPE data types as variant parts,
assuming the formers appear only to implement the latters (which is the
case at the moment: only the Ada front-end emits them).

Note that very few debuggers can handle these descriptions (GDB does not
yet), so in order to ease the the transition enable these only when
-fgnat-encodings=minimal.

gcc/ada/ChangeLog:

	* gcc-interface/decl.c (gnat_to_gnu_entity): Disable ___XVS GNAT
	encodings when -fgnat-encodings=minimal.
	(components_to_record): Disable ___XVE, ___XVN, ___XVU and
	___XVZ GNAT encodings when -fgnat-encodings=minimal.
	* gcc-interface/utils.c (maybe_pad_type): Disable __XVS GNAT
	encodings when -fgnat-encodings=minimal.

gcc/ChangeLog:

	* debug.h (struct gcc_debug_hooks): Add a new function_body
	hook.
	* debug.c (do_nothing_debug_hooks): Set the function_body field
	to no-op.
	* dbxout.c (dbx_debug_hooks, xcoff_debug_hooks): Likewise.
	* sdbout.c (sdb_debug_hooks): Likewise.
	* vmsdbgout.c (vmsdbg_debug_hooks): Likewise.
	* stor-layout.c (finalize_size_functions): Let the debug info
	back-end know about the implementation of size functions.
	* dwarf2out.h (dw_discr_list_ref): New typedef.
	(enum dw_val_class): Add value classes for discriminant values
	and discriminant lists.
	(struct dw_discr_value): New structure.
	(struct dw_val_node): Add discriminant values and discriminant
	lists to the union.
	(struct dw_loc_descr_node): Add frame_offset_rel and
	dw_loc_frame_offset (only for checking) fields to handle DWARF
	procedures generation.
	(struct dw_discr_list_node): New structure.
	* dwarf2out.c (dwarf2out_function_body): New.
	(dwarf2_debug_hooks): Set the function_body field to
	dwarf2out_function_body.
	(dwarf2_lineno_debug_hooks): Set the function_body field to
	no-op.
	(new_loc_descr): Initialize the
	dw_loc_frame_offset field.
	(dwarf_proc_stack_usage_map): New.
	(dw_val_equal_p): Handle discriminants.
	(size_of_discr_value): New.
	(size_of_discr_list): New.
	(size_of_die): Handle discriminants.
	(add_loc_descr_to_each): New.
	(add_loc_list): New.
	(print_discr_value): New.
	(print_dw_val): Handle discriminants.
	(value_format): Handle discriminants.
	(output_discr_value): New.
	(output_die): Handle discriminants.
	(output_loc_operands): Handle DW_OP_call2 and DW_OP_call4.
	(uint_loc_descriptor): New.
	(uint_comparison_loc_list): New.
	(loc_list_from_uint_comparison): New.
	(add_discr_value): New.
	(add_discr_list): New.
	(AT_discr_list): New.
	(loc_descr_to_next_no_op): New.
	(free_loc_descr): New.
	(loc_descr_without_nops): New.
	(struct loc_descr_context): Add a dpi field.
	(struct dwarf_procedure_info): New helper structure.
	(new_dwarf_proc_die): New.
	(is_handled_procedure_type): New.
	(resolve_args_picking_1): New.
	(resolve_args_picking): New.
	(function_to_dwarf_procedure): New.
	(copy_dwarf_procedure): New.
	(copy_dwarf_procs_ref_in_attrs): New.
	(copy_dwarf_procs_ref_in_dies): New.
	(break_out_comdat_types): Copy DWARF procedures along with the
	types that reference them.
	(loc_list_from_tree): Rename into loc_list_from_tree_1.  Handle
	CALL_EXPR in the cases suitable for DWARF procedures.  Handle
	for PARM_DECL when generating a location description for a DWARF
	procedure.  Handle big unsigned INTEGER_CST nodes.  Handle
	NON_LVALUE_EXPR, EXACT_DIV_EXPR and all unsigned comparison
	operators.  Add a wrapper for loc_list_from_tree that strips
	DW_OP_nop operations from the result.
	(type_byte_size): New.
	(struct vlr_context): New helper structure.
	(field_byte_offset): Change signature to return either a
	constant offset or a location description for dynamic ones.
	Handle dynamic byte offsets with constant bit offsets and handle
	fields in variant parts.
	(add_data_member_location): Change signature to handle dynamic
	member offsets and fields in variant parts.  Update call to
	field_byte_offset.  Handle location lists.  Emit a variable data
	member location only when -fgnat-encodings=minimal.
	(add_bound_info): Emit self-referential bounds only when
	-fgnat-encodings=minimal.
	(add_byte_size_attribute): Use type_byte_size in order to handle
	dynamic type sizes.  Emit variable byte size only when
	-fgnat-encodings=minimal and when the target DWARF versio

[PATCH, ARM] Fix gcc.c-torture/execute/loop-2b.c execution failure on cortex-m0

2015-12-16 Thread Thomas Preud'homme
During reorg pass, thumb1_reorg () is tasked with rewriting mov rd, rn to subs 
rd, rn, 0 to avoid a comparison against 0 instruction before doing a 
conditional branch based on it. The actual avoiding of cmp is done in 
cbranchsi4_insn instruction C output template. When the condition is met, the 
source register (rn) is also propagated into the comparison in place the 
destination register (rd).

However, right now thumb1_reorg () only look for a mov followed by a cbranchsi 
but does not check whether the comparison in cbranchsi is against the constant 
0. This is not safe because a non clobbering instruction could exist between 
the mov and the comparison that modifies the source register. This is what 
happens here with a post increment of the source register after the mov, which 
skip the &a[i] == &a[1] comparison for iteration i == 1.

This patch fixes the issue by checking that the comparison is against constant 
0.

ChangeLog entry is as follow:


*** gcc/ChangeLog ***

2015-12-07  Thomas Preud'homme  

* config/arm/arm.c (thumb1_reorg): Check that the comparison is
against the constant 0.


diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 42bf272..49c0a06 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -17195,7 +17195,7 @@ thumb1_reorg (void)
   FOR_EACH_BB_FN (bb, cfun)
 {
   rtx dest, src;
-  rtx pat, op0, set = NULL;
+  rtx cmp, op0, op1, set = NULL;
   rtx_insn *prev, *insn = BB_END (bb);
   bool insn_clobbered = false;
 
@@ -17208,8 +17208,13 @@ thumb1_reorg (void)
continue;
 
   /* Get the register with which we are comparing.  */
-  pat = PATTERN (insn);
-  op0 = XEXP (XEXP (SET_SRC (pat), 0), 0);
+  cmp = XEXP (SET_SRC (PATTERN (insn)), 0);
+  op0 = XEXP (cmp, 0);
+  op1 = XEXP (cmp, 1);
+
+  /* Check that comparison is against ZERO.  */
+  if (!CONST_INT_P (op1) || INTVAL (op1) != 0)
+   continue;
 
   /* Find the first flag setting insn before INSN in basic block BB.  */
   gcc_assert (insn != BB_HEAD (bb));
@@ -17249,7 +17254,7 @@ thumb1_reorg (void)
  PATTERN (prev) = gen_rtx_SET (dest, src);
  INSN_CODE (prev) = -1;
  /* Set test register in INSN to dest.  */
- XEXP (XEXP (SET_SRC (pat), 0), 0) = copy_rtx (dest);
+ XEXP (cmp, 0) = copy_rtx (dest);
  INSN_CODE (insn) = -1;
}
 }


Testsuite shows no regression when run for arm-none-eabi with -mcpu=cortex-m0 
-mthumb

Is this ok for trunk?

Best regards,

Thomas



Re: ipa-cp heuristics fixes

2015-12-16 Thread Dominik Vogt
On Thu, Dec 10, 2015 at 08:30:37AM +0100, Jan Hubicka wrote:
>   * ipa-cp.c (ipcp_cloning_candidate_p): Use node->optimize_for_size_p.
>   (good_cloning_opportunity_p): Likewise.
>   (gather_context_independent_values): Do not return true when
>   polymorphic call context is known or when we have known aggregate
>   value of unused parameter.
>   (estimate_local_effects): Try to create clone for all context
>   when either some params are substituted or devirtualization is possible
>   or some params can be removed; use local flag instead of
>   node->will_be_removed_from_program_if_no_direct_calls_p.
>   (identify_dead_nodes): Likewise.

This commit breaks several guality tests on S/390x:

+FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  line 16 arg3 == 3
+FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  line 16 arg4 == 4
+FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  line 16 arg5 == 5
+FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  line 16 arg6 == 6
+FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  line 16 y == 2
+FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  line 18 arg3 == 3
+FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  line 18 arg4 == 4
+FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  line 18 arg5 == 5
+FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  line 18 arg6 == 6
+FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  line 18 y == 2
+FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 16 arg3 == 3
+FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 16 arg4 == 4
+FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 16 arg5 == 5
+FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 16 arg6 == 6
+FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 16 y == 2
+FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 18 arg3 == 3
+FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 18 arg4 == 4
+FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 18 arg5 == 5
+FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 18 arg6 == 6
+FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 18 y == 2
 FAIL: gcc.dg/guality/pr36728-2.c   -O2  line 18 *x == (char) 25
 FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fno-use-linker-plugin 
-flto-partition=none  line 18 *x == (char) 25
+FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  line 16 arg3 == 3
+FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  line 16 arg4 == 4
+FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  line 16 arg5 == 5
+FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  line 16 arg6 == 6
+FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  line 16 arg7 == 30
+FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  line 16 y == 2
 FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  line 18 *x == (char) 25
+FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  line 18 arg3 == 3
+FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  line 18 arg4 == 4
+FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  line 18 arg5 == 5
+FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  line 18 arg6 == 6
+FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  line 18 arg7 == 30
+FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  line 18 y == 2
+FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g  line 16 arg3 == 3
+FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g  line 16 arg4 == 4
+FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g  line 16 arg5 == 5
+FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g  line 16 arg6 == 6
+FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g  line 16 arg7 == 30
+FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g  line 16 y == 2
 FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g  line 18 *x == (char) 25
+FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g  line 18 arg3 == 3
+FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g  line 18 arg4 == 4
+FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g  line 18 arg5 == 5
+FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g  line 18 arg6 == 6
+FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g  line 18 arg7 == 30
+FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g  line 18 y == 2
 FAIL: gcc.dg/guality/pr36728-2.c   -Os  line 18 *x == (char) 25
+FAIL: gcc.dg/guality/pr36728-3.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  line 14 arg3 == 3
+FAIL: gcc.dg/guality/pr367

Re: [PING][Patch 3/3][Arm] Add support for IEEE-conformant versions of scalar fmin* and fmax*

2015-12-16 Thread Kyrill Tkachov

Hi David,

On 16/12/15 08:53, David Sherwood wrote:

Hi,

Here is the last patch of the fmin/fmax change, which adds the optabs
to the arm backend.

Tested:

arm-none-eabi: no regressions

Good to go?
David Sherwood.

ChangeLog:

2015-12-08  David Sherwood  

 gcc/
 * config/arm/iterators.md: New iterators.
 * config/arm/unspecs.md: New unspecs.
 * config/arm/neon.md: New pattern.
 * config/arm/vfp.md: Likewise.


Please list the new entities you add in parentheses.
For example:
* config/arm/iterators.md (VMAXMINFNM): New int iterator.
(fmaxmin): New int attribute.
(fmaxmin): Likewise.

Same for the other files. That way one can grep through the ChangeLogs to
find when any particular pattern/iterator/whatever was modified.

+;; Auto-vectorized forms for the IEEE-754 fmax()/fmin() functions
+(define_insn "3"
+  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
+   (unspec:VCVTF [(match_operand:VCVTF 1 "s_register_operand" "w")
+  (match_operand:VCVTF 2 "s_register_operand" "w")]
+  VMAXMINFNM))]
+  "TARGET_NEON && TARGET_FPU_ARMV8"
+  ".\t%0, %1, %2"
+  [(set_attr "type" "neon_fp_minmax_s")]
+)

I would just say "Vector forms" rather than "Auto-vectorized".
In principle we can get vector types through other means besides
auto-vectorisation.

+;; Scalar forms for the IEEE-754 fmax()/fmin() functions
+(define_insn "3"
+  [(set (match_operand:SDF 0 "s_register_operand" "=")
+   (unspec:SDF [(match_operand:SDF 1 "s_register_operand" "")
+(match_operand:SDF 2 "s_register_operand" 
"")]
+VMAXMINFNM))]
+  "TARGET_HARD_FLOAT && TARGET_VFP5 "
+  ".\\t%0, %1, %2"
+  [(set_attr "type" "f_minmax")
+   (set_attr "conds" "unconditional")]
+)
+

I notice your new test doesn't test the SF variant of this.
Could you please add something to test it?
Looks good to me otherwise.

Thanks,
Kyrill



Re: [PATCH][AArch64] Add vector permute cost

2015-12-16 Thread James Greenhalgh
On Tue, Dec 15, 2015 at 11:35:45AM +, Wilco Dijkstra wrote:
> 
> Add support for vector permute cost since various permutes can expand into a 
> complex
> sequence of instructions.  This fixes major performance regressions due to 
> recent changes
> in the SLP vectorizer (which now vectorizes more aggressively and emits many 
> complex 
> permutes).
> 
> Set the cost to > 1 for all microarchitectures so that the number of permutes 
> is usually zero
> and regressions disappear.  An example of the kind of code that might be 
> emitted for
> VEC_PERM_EXPR {0, 3} where registers happen to be in the wrong order:
> 
> adrpx4, .LC16
> ldr q5, [x4, #:lo12:.LC16
> eor v1.16b, v1.16b, v0.16b
> eor v0.16b, v1.16b, v0.16b
> eor v1.16b, v1.16b, v0.16b
> tbl v0.16b, {v0.16b - v1.16b}, v5.16b
> 
> Regress passes. This fixes regressions that were introduced recently, so OK 
> for commit?
> 
> 
> ChangeLog:
> 2015-12-15  Wilco Dijkstra  
> 
>   * gcc/config/aarch64/aarch64.c (generic_vector_cost):
>   Set vec_permute_cost.
>   (cortexa57_vector_cost): Likewise.
>   (exynosm1_vector_cost): Likewise.
>   (xgene1_vector_cost): Likewise.
>   (aarch64_builtin_vectorization_cost): Use vec_permute_cost.
>   * gcc/config/aarch64/aarch64-protos.h (cpu_vector_cost):
>   Add vec_permute_cost entry.
> 
> 
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 
> 10754c88c0973d8ef3c847195b727f02b193bbd8..2584f16d345b3d015d577dd28c08a73ee3e0b0fb
>  100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -314,6 +314,7 @@ static const struct cpu_vector_cost generic_vector_cost =
>1, /* scalar_load_cost  */
>1, /* scalar_store_cost  */
>1, /* vec_stmt_cost  */
> +  2, /* vec_permute_cost  */
>1, /* vec_to_scalar_cost  */
>1, /* scalar_to_vec_cost  */
>1, /* vec_align_load_cost  */

Is there any reasoning behind making this 2? Do we now miss vectorization
for some of the cheaper permutes? Across the cost models/pipeline
descriptions that have been contributed to GCC I think that this is a
sensible change to the generic costs, but I just want to check there
was some reasoning/experimentation behind the number you picked.

As permutes can have such wildly different costs, this all seems like a good
candidate for some future much more involved hook from the vectorizer to the
back-end specifying the candidate permute operation and requesting a cost
(part of the bigger gimple costs framework?).

Thanks,
James



Re: [PATCH 2/2] [graphite] update required isl versions

2015-12-16 Thread Richard Biener
On Mon, Dec 14, 2015 at 10:48 PM, Sebastian Pop  wrote:
> we now check the isl version, as there are no real differences in existing 
> files
> in between isl 0.14 and isl 0.15.

Looks good to me.

Richard.

> ---
>  config/isl.m4| 29 +++
>  configure| 23 +--
>  gcc/config.in| 12 
>  gcc/configure| 61 
> ++--
>  gcc/configure.ac | 23 ---
>  gcc/graphite-isl-ast-to-gimple.c | 10 ++-
>  gcc/graphite-optimize-isl.c  | 30 ++--
>  gcc/graphite-poly.c  |  8 --
>  gcc/graphite-sese-to-poly.c  |  8 --
>  gcc/graphite.h   |  1 +
>  10 files changed, 39 insertions(+), 166 deletions(-)
>
> diff --git a/config/isl.m4 b/config/isl.m4
> index 459fac1..7387ff2 100644
> --- a/config/isl.m4
> +++ b/config/isl.m4
> @@ -19,23 +19,23 @@
>
>  # ISL_INIT_FLAGS ()
>  # -
> -# Provide configure switches for ISL support.
> +# Provide configure switches for isl support.
>  # Initialize isllibs/islinc according to the user input.
>  AC_DEFUN([ISL_INIT_FLAGS],
>  [
>AC_ARG_WITH([isl-include],
>  [AS_HELP_STRING(
>[--with-isl-include=PATH],
> -  [Specify directory for installed ISL include files])])
> +  [Specify directory for installed isl include files])])
>AC_ARG_WITH([isl-lib],
>  [AS_HELP_STRING(
>[--with-isl-lib=PATH],
> -  [Specify the directory for the installed ISL library])])
> +  [Specify the directory for the installed isl library])])
>
>AC_ARG_ENABLE(isl-version-check,
>  [AS_HELP_STRING(
>[--disable-isl-version-check],
> -  [disable check for ISL version])],
> +  [disable check for isl version])],
>  ENABLE_ISL_CHECK=$enableval,
>  ENABLE_ISL_CHECK=yes)
>
> @@ -58,15 +58,15 @@ AC_DEFUN([ISL_INIT_FLAGS],
>if test "x${with_isl_lib}" != x; then
>  isllibs="-L$with_isl_lib"
>fi
> -  dnl If no --with-isl flag was specified and there is in-tree ISL
> +  dnl If no --with-isl flag was specified and there is in-tree isl
>dnl source, set up flags to use that and skip any version tests
> -  dnl as we cannot run them before building ISL.
> +  dnl as we cannot run them before building isl.
>if test "x${islinc}" = x && test "x${isllibs}" = x \
>   && test -d ${srcdir}/isl; then
>  isllibs='-L$$r/$(HOST_SUBDIR)/isl/'"$lt_cv_objdir"' '
>  islinc='-I$$r/$(HOST_SUBDIR)/isl/include -I$$s/isl/include'
>  ENABLE_ISL_CHECK=no
> -AC_MSG_WARN([using in-tree ISL, disabling version check])
> +AC_MSG_WARN([using in-tree isl, disabling version check])
>fi
>
>isllibs="${isllibs} -lisl"
> @@ -75,7 +75,7 @@ AC_DEFUN([ISL_INIT_FLAGS],
>
>  # ISL_REQUESTED (ACTION-IF-REQUESTED, ACTION-IF-NOT)
>  # 
> -# Provide actions for failed ISL detection.
> +# Provide actions for failed isl detection.
>  AC_DEFUN([ISL_REQUESTED],
>  [
>AC_REQUIRE([ISL_INIT_FLAGS])
> @@ -106,12 +106,17 @@ AC_DEFUN([ISL_CHECK_VERSION],
>  LDFLAGS="${_isl_saved_LDFLAGS} ${isllibs}"
>  LIBS="${_isl_saved_LIBS} -lisl"
>
> -AC_MSG_CHECKING([for compatible ISL])
> -AC_LINK_IFELSE([AC_LANG_PROGRAM([[#include ]], [[;]])],
> -   [gcc_cv_isl=yes],
> -   [gcc_cv_isl=no])
> +AC_MSG_CHECKING([for isl 0.15 (or deprecated 0.14)])
> +AC_TRY_LINK([#include ],
> +[isl_ctx_get_max_operations (isl_ctx_alloc ());],
> +[gcc_cv_isl=yes],
> +[gcc_cv_isl=no])
>  AC_MSG_RESULT([$gcc_cv_isl])
>
> +if test "${gcc_cv_isl}" = no ; then
> +  AC_MSG_RESULT([recommended isl version is 0.15, minimum required isl 
> version 0.14 is deprecated])
> +fi
> +
>  CFLAGS=$_isl_saved_CFLAGS
>  LDFLAGS=$_isl_saved_LDFLAGS
>  LIBS=$_isl_saved_LIBS
> diff --git a/configure b/configure
> index 090615f..a6495c4 100755
> --- a/configure
> +++ b/configure
> @@ -1492,7 +1492,7 @@ Optional Features:
>build static libjava [default=no]
>--enable-bootstrap  enable bootstrapping [yes if native build]
>--disable-isl-version-check
> -  disable check for ISL version
> +  disable check for isl version
>--enable-ltoenable link time optimization support
>--enable-linker-plugin-configure-flags=FLAGS
>additional flags for configuring linker plugins
> @@ -1553,8 +1553,8 @@ Optional Packages:
>package. Equivalent to
>--with-isl-include=PATH/include plus
>--with-isl-lib=PATH/lib
> -  --with-isl-include=PATH Specify directory for installed ISL include files
> -  --with-isl-lib=PATH Specify the directory for the installed ISL library
> +  --with-isl-include=PATH Specify di

Re: [PATCH][LTO,ARM] Fix vector TYPE_MODE in streaming-out

2015-12-16 Thread Richard Biener
On Tue, Dec 15, 2015 at 5:14 PM, Bernd Schmidt  wrote:
> On 12/15/2015 04:09 PM, Christian Bruel wrote:
>>
>> in "normal" mode, the TYPE_MODE for vector_type __simd64_int8_t is set
>> to V8QImode by arm_vector_mode_supported_p during the builtins type
>> initializations, thanks to TARGET_NEON set bu the global flag.
>>
>> Now, in LTO mode the streamer writes the information for this
>> vector_type as a scalar DImode, causing ICEs during arm_expand_builtin.
>> The root cause of this is that the streamer-out uses TYPE_MODE in a
>> context where the target_flags are not known return false for TARGET_NEON.
>>
>> The streamer-in then will then read the wrong mode that propagates to
>> the back-end.
>
>
>>  static void
>>  pack_ts_type_common_value_fields (struct bitpack_d *bp, tree expr)
>>  {
>> -  bp_pack_machine_mode (bp, TYPE_MODE (expr));
>> +  bp_pack_machine_mode (bp, expr->type_common.mode);
>
>
> This looks sensible given that tree-streamer-in uses SET_TYPE_MODE, which
> just writes expr->type_common.mode.
>
> Make a new macro TYPE_MODE_RAW for this and I think the patch is ok
> (although there's precedent for direct access in vector_type_mode, but I
> think that's just bad).

Yeah, though it's well-hidden ;)

I think the patch is ok if you add a comment why we're not using TYPE_MODE here
and if the patch passes the x86_64 vectorizer testsuite with -m32 -march=i586
with no regressions (I do expect some FAILs with -march=i586 but the
patch shouldn't
regress anything).

Thanks,
Richard.


>
> Bernd


Re: [PATCH][AArch64] Add TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS

2015-12-16 Thread James Greenhalgh
On Tue, Dec 15, 2015 at 10:54:49AM +, Wilco Dijkstra wrote:
> ping
> 
> > -Original Message-
> > From: Wilco Dijkstra [mailto:wilco.dijks...@arm.com]
> > Sent: 06 November 2015 20:06
> > To: 'gcc-patches@gcc.gnu.org'
> > Subject: [PATCH][AArch64] Add TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS
> > 
> > This patch adds support for the TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS
> > hook. When the cost of GENERAL_REGS and FP_REGS is identical, the register
> > allocator always uses ALL_REGS even when it has a much higher cost. The
> > hook changes the class to either FP_REGS or GENERAL_REGS depending on the
> > mode of the register. This results in better register allocation overall,
> > fewer spills and reduced codesize - particularly in SPEC2006 gamess.
> > 
> > GCC regression passes with several minor fixes.
> > 
> > OK for commit?
> > 
> > ChangeLog:
> > 2015-11-06  Wilco Dijkstra  
> > 
> > * gcc/config/aarch64/aarch64.c
> > (TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS): New define.
> > (aarch64_ira_change_pseudo_allocno_class): New function.
> > * gcc/testsuite/gcc.target/aarch64/cvtf_1.c: Build with -O2.
> > * gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c
> > (test_corners_sisd_di): Improve force to SIMD register.
> > (test_corners_sisd_si): Likewise.
> > * gcc/testsuite/gcc.target/aarch64/vdup_lane_2.c: Build with -O2.
> > * gcc/testsuite/gcc.target/aarch64/vect-ld1r-compile-fp.c:
> > Remove scan-assembler check for ldr.

Drop the gcc/ from the ChangeLog.

> > --
> >  gcc/config/aarch64/aarch64.c   | 22 
> > ++
> >  gcc/testsuite/gcc.target/aarch64/cvtf_1.c  |  2 +-
> >  gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c  |  4 ++--
> >  gcc/testsuite/gcc.target/aarch64/vdup_lane_2.c |  2 +-
> >  .../gcc.target/aarch64/vect-ld1r-compile-fp.c  |  1 -

These testsuite changes concern me a bit, and you don't mention them beyond
saying they are minor fixes...

> > diff --git a/gcc/testsuite/gcc.target/aarch64/cvtf_1.c 
> > b/gcc/testsuite/gcc.target/aarch64/cvtf_1.c
> > index 5f2ff81..96501db 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/cvtf_1.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/cvtf_1.c
> > @@ -1,5 +1,5 @@
> >  /* { dg-do run } */
> > -/* { dg-options "-save-temps -fno-inline -O1" } */
> > +/* { dg-options "-save-temps -fno-inline -O2" } */

This one says we have a code-gen regression at -O1 ?

> > 
> >  #define FCVTDEF(ftype,itype) \
> >  void \
> > diff --git a/gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c 
> > b/gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c
> > index 363f554..8465c89 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c
> > @@ -186,9 +186,9 @@ test_corners_sisd_di (Int64x1 b)
> >  {
> >force_simd_di (b);
> >b = b >> 63;
> > +  force_simd_di (b);
> >b = b >> 0;
> >b += b >> 65; /* { dg-warning "right shift count >= width of type" } */
> > -  force_simd_di (b);

This one I don't understand, but seems to say that we've decided to move
b out of FP_REGS after getting it in there for b = b << 63; ? So this is
another register allocator regression?

> > diff --git a/gcc/testsuite/gcc.target/aarch64/vdup_lane_2.c 
> > b/gcc/testsuite/gcc.target/aarch64/vdup_lane_2.c
> > index a49db3e..c5a9c52 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/vdup_lane_2.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/vdup_lane_2.c
> > @@ -1,6 +1,6 @@
> >  /* Test vdup_lane intrinsics work correctly.  */
> >  /* { dg-do run } */
> > -/* { dg-options "-O1 --save-temps" } */
> > +/* { dg-options "-O2 --save-temps" } */

Another -O1 regression ?

> > 
> >  #include 
> > 
> > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-ld1r-compile-fp.c 
> > b/gcc/testsuite/gcc.target/aarch64/vect-ld1r-compile-fp.c
> > index 66e0168..4711c61 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/vect-ld1r-compile-fp.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/vect-ld1r-compile-fp.c
> > @@ -8,6 +8,5 @@ DEF (float)
> >  DEF (double)
> > 
> >  /* { dg-final { scan-assembler "ld1r\\t\{v\[0-9\]+\.4s"} } */
> > -/* { dg-final { scan-assembler "ldr\\t\x\[0-9\]+"} } */
> >  /* { dg-final { scan-assembler "ld1r\\t\{v\[0-9\]+\.2d"} } */

This one is fine, I don't really understand what it was hoping to catch
in the first place!

Could you go in to some detail about why your testsuite changes are correct?

Thanks,
James



Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation

2015-12-16 Thread Richard Biener
On Wed, Dec 16, 2015 at 8:43 AM, Ajit Kumar Agarwal
 wrote:
> Hello Jeff:
>
> Here is more of a data you have asked for.
>
> SPEC FP benchmarks.
> a) No Path Splitting + tracer enabled
> Geomean Score =  4749.726.
> b) Path Splitting enabled + tracer enabled.
> Geomean Score =  4781.655.
>
> Conclusion: With both Path Splitting and tracer enabled we got maximum gains. 
> I think we need to have Path Splitting pass.
>
> SPEC INT benchmarks.
> a) Path Splitting enabled + tracer not enabled.
> Geomean Score =  3745.193.
> b) No Path Splitting + tracer enabled.
> Geomean Score = 3738.558.
> c) Path Splitting enabled + tracer enabled.
> Geomean Score = 3742.833.

I suppose with SPEC you mean SPEC CPU 2006?

Can you disclose the architecture you did the measurements on and the
compile flags you used otherwise?

Note that tracer does a very good job only when paired with FDO so can
you re-run SPEC with FDO and
compare with path-splitting enabled on top of that?

Thanks,
Richard.

> Conclusion: We are getting more gains with Path Splitting as compared to 
> tracer. With both Path Splitting and tracer enabled we are also getting  
> gains.
> I think we should have Path Splitting pass.
>
> One more observation: Richard's concern is the creation of multiple exits 
> with Splitting paths through duplication. My observation is,  in tracer pass 
> also there
> is a creation of multiple exits through duplication. I don’t think that’s an 
> issue with the practicality considering the gains we are getting with 
> Splitting paths with
> more PRE, CSE and DCE.
>
> Thanks & Regards
> Ajit
>
>
>
>
> -Original Message-
> From: Jeff Law [mailto:l...@redhat.com]
> Sent: Wednesday, December 16, 2015 5:20 AM
> To: Richard Biener
> Cc: Ajit Kumar Agarwal; GCC Patches; Vinod Kathail; Shail Aditya Gupta; 
> Vidhumouli Hunsigida; Nagaraju Mekala
> Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree 
> ssa representation
>
> On 12/11/2015 03:05 AM, Richard Biener wrote:
>> On Thu, Dec 10, 2015 at 9:08 PM, Jeff Law  wrote:
>>> On 12/03/2015 07:38 AM, Richard Biener wrote:

 This pass is now enabled by default with -Os but has no limits on
 the amount of stmts it copies.
>>>
>>> The more statements it copies, the more likely it is that the path
>>> spitting will turn out to be useful!  It's counter-intuitive.
>>
>> Well, it's still not appropriate for -Os (nor -O2 I think).  -ftracer
>> is enabled with -fprofile-use (but it is also properly driven to only
>> trace hot paths) and otherwise not by default at any optimization level.
> Definitely not appropriate for -Os.  But as I mentioned, I really want to 
> look at the tracer code as it may totally subsume path splitting.
>
>>
>> Don't see how this would work for the CFG pattern it operates on
>> unless you duplicate the exit condition into that new block creating
>> an even more obfuscated CFG.
> Agreed, I don't see any way to fix the multiple exit problem.  Then again, 
> this all runs after the tree loop optimizer, so I'm not sure how big of an 
> issue it is in practice.
>
>
>>> It was only after I approved this code after twiddling it for Ajit
>>> that I came across Honza's tracer implementation, which may in fact
>>> be retargettable to these loops and do a better job.  I haven't
>>> experimented with that.
>>
>> Well, I originally suggested to merge this with the tracer pass...
> I missed that, or it didn't sink into my brain.
>
>>> Again, the more statements it copies the more likely it is to be profitable.
>>> Think superblocks to expose CSE, DCE and the like.
>>
>> Ok, so similar to tracer (where I think the main benefit is actually
>> increasing scheduling opportunities for architectures where it matters).
> Right.  They're both building superblocks, which has the effect of larger 
> windows for scheduling, DCE, CSE, etc.
>
>
>>
>> Note that both passes are placed quite late and thus won't see much
>> of the GIMPLE optimizations (DOM mainly).  I wonder why they were
>> not placed adjacent to each other.
> Ajit had it fairly early, but that didn't play well with if-conversion.
>   I just pushed it past if-conversion and vectorization, but before the
> last DOM pass.  That turns out to be where tracer lives too as you noted.
>
>>>
>>> I wouldn't lose any sleep if we disabled by default or removed, particularly
>>> if we can repurpose Honza's code.  In fact, I might strongly support the
>>> former until we hear back from Ajit on performance data.
>>
>> See above for what we do with -ftracer.  path-splitting should at _least_
>> restrict itself to operate on optimize_loop_for_speed_p () loops.
> I think we need to decide if we want the code at all, particularly given
> the multiple-exit problem.
>
> The difficulty is I think Ajit posted some recent data that shows it's
> helping.  So maybe the thing to do is ask Ajit to try the tracer
> independent of path splitting and take the obvious actions based on
> Ajit's data.

Re: ipa-cp heuristics fixes

2015-12-16 Thread Richard Biener
On Wed, Dec 16, 2015 at 10:15 AM, Dominik Vogt  wrote:
> On Thu, Dec 10, 2015 at 08:30:37AM +0100, Jan Hubicka wrote:
>>   * ipa-cp.c (ipcp_cloning_candidate_p): Use node->optimize_for_size_p.
>>   (good_cloning_opportunity_p): Likewise.
>>   (gather_context_independent_values): Do not return true when
>>   polymorphic call context is known or when we have known aggregate
>>   value of unused parameter.
>>   (estimate_local_effects): Try to create clone for all context
>>   when either some params are substituted or devirtualization is possible
>>   or some params can be removed; use local flag instead of
>>   node->will_be_removed_from_program_if_no_direct_calls_p.
>>   (identify_dead_nodes): Likewise.
>
> This commit breaks several guality tests on S/390x:
>
> +FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 16 arg3 == 3
> +FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 16 arg4 == 4
> +FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 16 arg5 == 5
> +FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 16 arg6 == 6
> +FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 16 y == 2
> +FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 18 arg3 == 3
> +FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 18 arg4 == 4
> +FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 18 arg5 == 5
> +FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 18 arg6 == 6
> +FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 18 y == 2
> +FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 16 arg3 == 3
> +FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 16 arg4 == 4
> +FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 16 arg5 == 5
> +FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 16 arg6 == 6
> +FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 16 y == 2
> +FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 18 arg3 == 3
> +FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 18 arg4 == 4
> +FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 18 arg5 == 5
> +FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 18 arg6 == 6
> +FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 18 y == 2
>  FAIL: gcc.dg/guality/pr36728-2.c   -O2  line 18 *x == (char) 25
>  FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fno-use-linker-plugin 
> -flto-partition=none  line 18 *x == (char) 25
> +FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 16 arg3 == 3
> +FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 16 arg4 == 4
> +FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 16 arg5 == 5
> +FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 16 arg6 == 6
> +FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 16 arg7 == 30
> +FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 16 y == 2
>  FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 18 *x == (char) 25
> +FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 18 arg3 == 3
> +FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 18 arg4 == 4
> +FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 18 arg5 == 5
> +FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 18 arg6 == 6
> +FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 18 arg7 == 30
> +FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 18 y == 2
> +FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g  line 16 arg3 == 3
> +FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g  line 16 arg4 == 4
> +FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g  line 16 arg5 == 5
> +FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g  line 16 arg6 == 6
> +FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g  line 16 arg7 == 30
> +FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g  line 16 y == 2
>  FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g  line 18 *x == (char) 25
> +FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g  line 18 arg3 == 3
> +FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g  line 18 arg4 == 4
> +FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g  line 18 arg5 == 5
> +FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g  line 18 arg6 == 6
> +FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g  line 18 arg7 == 30
> +FAIL: gcc.dg/guality/pr36728-2.c   -O3 -g 

Re: [PATCH][ARM] PR target/68648: Fold NOT of CONST_INT in andsi_iorsi3_notsi splitter

2015-12-16 Thread Kyrill Tkachov

Hi Ramana,

On 15/12/15 21:20, Ramana Radhakrishnan wrote:

On 07/12/15 10:39, Kyrill Tkachov wrote:

Hi all,

In this PR we ICE because during post-reload splitting we generate the insn:
(insn 27 26 11 2 (set (reg:SI 0 r0 [orig:121 D.4992 ] [121])
 (and:SI (not:SI (const_int 1 [0x1]))
 (reg:SI 0 r0 [orig:121 D.4992 ] [121])))
  (nil))


The splitter at fault is andsi_iorsi3_notsi that accepts a const_int in 
operands[3]
and outputs (not (match_dup 3)). It should really be trying to constant fold 
the result
first.  This patch does that by calling simplify_gen_unary to generate the 
complement
of operands[3] if it's a register or the appropriate const_int rtx with the 
correct
folded result that will still properly match the arm bic-immediate instruction.

Bootstrapped and tested on arm-none-eabi.

Is this ok for trunk?

This appears on GCC 4.9 and GCC 5 and I'll be testing the fix there as well.
Ok for those branches if testing is successful?

Thanks,
Kyrill

2015-12-07  Kyrylo Tkachov  

 PR target/68648
 * config/arm/arm.md (*andsi_iorsi3_notsi): Try to simplify
 the complement of operands[3] during splitting.

2015-12-07  Kyrylo Tkachov  

 PR target/68648
 * gcc.c-torture/execute/pr68648.c: New test.

arm-not-int.patch


diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 
2b48bbaf034b286d723536ec2aa6fe0f9b312911..dfb75c5f11c66c6b4a34ff3071b5a0957c3512cb
 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -3274,8 +3274,22 @@ (define_insn_and_split "*andsi_iorsi3_notsi"
"#"   ; "orr%?\\t%0, %1, %2\;bic%?\\t%0, %0, %3"
"&& reload_completed"
[(set (match_dup 0) (ior:SI (match_dup 1) (match_dup 2)))
-   (set (match_dup 0) (and:SI (not:SI (match_dup 3)) (match_dup 0)))]
-  ""
+   (set (match_dup 0) (and:SI (match_dup 4) (match_dup 5)))]
+  {
+ /* If operands[3] is a constant make sure to fold the NOT into it
+   to avoid creating a NOT of a CONST_INT.  */
+rtx not_rtx = simplify_gen_unary (NOT, SImode, operands[3], SImode);
+if (CONST_INT_P (not_rtx))
+  {
+   operands[4] = operands[0];
+   operands[5] = not_rtx;
+  }
+else
+  {
+   operands[5] = operands[0];
+   operands[4] = not_rtx;
+  }

Watch out indentation here - I'm not sure if this is your mail client or mine 
causing this sort of thing.


Looks fine in two of my editors and also in the svn diff output, so I think 
there
must be something on your end.



Ok with that double checked.


Thanks, I've committed it as r231675.
Will backport to the branches after some time on trunk.

Kyrill


regards
Ramana


+  }
[(set_attr "length" "8")
 (set_attr "ce_count" "2")
 (set_attr "predicable" "yes")
diff --git a/gcc/testsuite/gcc.c-torture/execute/pr68648.c 
b/gcc/testsuite/gcc.c-torture/execute/pr68648.c
new file mode 100644
index 
..fc66806a99a0abef7bd517ae2f5200b387e69ce4
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/execute/pr68648.c
@@ -0,0 +1,20 @@
+int __attribute__ ((noinline))
+foo (void)
+{
+  return 123;
+}
+
+int __attribute__ ((noinline))
+bar (void)
+{
+  int c = 1;
+  c |= 4294967295 ^ (foo () | 4073709551608);
+  return c;
+}
+
+int
+main ()
+{
+  if (bar () != 0x83fd4005)
+__builtin_abort ();
+}





Re: [PATCH][AArch64] Add vector permute cost

2015-12-16 Thread Richard Biener
On Wed, Dec 16, 2015 at 10:32 AM, James Greenhalgh
 wrote:
> On Tue, Dec 15, 2015 at 11:35:45AM +, Wilco Dijkstra wrote:
>>
>> Add support for vector permute cost since various permutes can expand into a 
>> complex
>> sequence of instructions.  This fixes major performance regressions due to 
>> recent changes
>> in the SLP vectorizer (which now vectorizes more aggressively and emits many 
>> complex
>> permutes).
>>
>> Set the cost to > 1 for all microarchitectures so that the number of 
>> permutes is usually zero
>> and regressions disappear.  An example of the kind of code that might be 
>> emitted for
>> VEC_PERM_EXPR {0, 3} where registers happen to be in the wrong order:
>>
>> adrpx4, .LC16
>> ldr q5, [x4, #:lo12:.LC16
>> eor v1.16b, v1.16b, v0.16b
>> eor v0.16b, v1.16b, v0.16b
>> eor v1.16b, v1.16b, v0.16b
>> tbl v0.16b, {v0.16b - v1.16b}, v5.16b
>>
>> Regress passes. This fixes regressions that were introduced recently, so OK 
>> for commit?
>>
>>
>> ChangeLog:
>> 2015-12-15  Wilco Dijkstra  
>>
>>   * gcc/config/aarch64/aarch64.c (generic_vector_cost):
>>   Set vec_permute_cost.
>>   (cortexa57_vector_cost): Likewise.
>>   (exynosm1_vector_cost): Likewise.
>>   (xgene1_vector_cost): Likewise.
>>   (aarch64_builtin_vectorization_cost): Use vec_permute_cost.
>>   * gcc/config/aarch64/aarch64-protos.h (cpu_vector_cost):
>>   Add vec_permute_cost entry.
>>
>>
>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>> index 
>> 10754c88c0973d8ef3c847195b727f02b193bbd8..2584f16d345b3d015d577dd28c08a73ee3e0b0fb
>>  100644
>> --- a/gcc/config/aarch64/aarch64.c
>> +++ b/gcc/config/aarch64/aarch64.c
>> @@ -314,6 +314,7 @@ static const struct cpu_vector_cost generic_vector_cost =
>>1, /* scalar_load_cost  */
>>1, /* scalar_store_cost  */
>>1, /* vec_stmt_cost  */
>> +  2, /* vec_permute_cost  */
>>1, /* vec_to_scalar_cost  */
>>1, /* scalar_to_vec_cost  */
>>1, /* vec_align_load_cost  */
>
> Is there any reasoning behind making this 2? Do we now miss vectorization
> for some of the cheaper permutes? Across the cost models/pipeline
> descriptions that have been contributed to GCC I think that this is a
> sensible change to the generic costs, but I just want to check there
> was some reasoning/experimentation behind the number you picked.
>
> As permutes can have such wildly different costs, this all seems like a good
> candidate for some future much more involved hook from the vectorizer to the
> back-end specifying the candidate permute operation and requesting a cost
> (part of the bigger gimple costs framework?).

Yes, the vectorizer side also needs to improve here.  Not sure if it is possible
to represent this kind of complex cost queries with a single gimple cost hook.
After all we don't really want to generate the full gimple stmt just to query
its cost ...

To better represent permute cost in the short term we'd need another vectorizer
specific hook, not sth for stage3 unless we face some serious regressions
on real-world code (thus not microbenchmarks only)

Richard.

> Thanks,
> James
>


Re: ipa-cp heuristics fixes

2015-12-16 Thread Jakub Jelinek
On Wed, Dec 16, 2015 at 10:59:10AM +0100, Richard Biener wrote:
> > What can I do to help fixing this?
> 
> Same on x86_64 btw.  If there isn't a bugreport already please open
> one to track this issue.

PR68860 covers this already and it has been discussed on this ml too
already.

Jakub


RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation

2015-12-16 Thread Ajit Kumar Agarwal


-Original Message-
From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On 
Behalf Of Richard Biener
Sent: Wednesday, December 16, 2015 3:27 PM
To: Ajit Kumar Agarwal
Cc: Jeff Law; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli 
Hunsigida; Nagaraju Mekala
Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa 
representation

On Wed, Dec 16, 2015 at 8:43 AM, Ajit Kumar Agarwal 
 wrote:
> Hello Jeff:
>
> Here is more of a data you have asked for.
>
> SPEC FP benchmarks.
> a) No Path Splitting + tracer enabled
> Geomean Score =  4749.726.
> b) Path Splitting enabled + tracer enabled.
> Geomean Score =  4781.655.
>
> Conclusion: With both Path Splitting and tracer enabled we got maximum gains. 
> I think we need to have Path Splitting pass.
>
> SPEC INT benchmarks.
> a) Path Splitting enabled + tracer not enabled.
> Geomean Score =  3745.193.
> b) No Path Splitting + tracer enabled.
> Geomean Score = 3738.558.
> c) Path Splitting enabled + tracer enabled.
> Geomean Score = 3742.833.

>>I suppose with SPEC you mean SPEC CPU 2006?

The performance data is with respect to SPEC CPU 2000 benchmarks.

>>Can you disclose the architecture you did the measurements on and the compile 
>>flags you used otherwise?

Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz 
cpu cores   : 10
cache size  : 25600 KB

I have used -O3 and enable the tracer with  -ftracer .

Thanks & Regards
Ajit
>>Note that tracer does a very good job only when paired with FDO so can you 
>>re-run SPEC with FDO and compare with path-splitting enabled on top of that?


Thanks,
Richard.

> Conclusion: We are getting more gains with Path Splitting as compared to 
> tracer. With both Path Splitting and tracer enabled we are also getting  
> gains.
> I think we should have Path Splitting pass.
>
> One more observation: Richard's concern is the creation of multiple 
> exits with Splitting paths through duplication. My observation is,  in 
> tracer pass also there is a creation of multiple exits through duplication. I 
> don’t think that’s an issue with the practicality considering the gains we 
> are getting with Splitting paths with more PRE, CSE and DCE.
>
> Thanks & Regards
> Ajit
>
>
>
>
> -Original Message-
> From: Jeff Law [mailto:l...@redhat.com]
> Sent: Wednesday, December 16, 2015 5:20 AM
> To: Richard Biener
> Cc: Ajit Kumar Agarwal; GCC Patches; Vinod Kathail; Shail Aditya 
> Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
> Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on 
> tree ssa representation
>
> On 12/11/2015 03:05 AM, Richard Biener wrote:
>> On Thu, Dec 10, 2015 at 9:08 PM, Jeff Law  wrote:
>>> On 12/03/2015 07:38 AM, Richard Biener wrote:

 This pass is now enabled by default with -Os but has no limits on 
 the amount of stmts it copies.
>>>
>>> The more statements it copies, the more likely it is that the path 
>>> spitting will turn out to be useful!  It's counter-intuitive.
>>
>> Well, it's still not appropriate for -Os (nor -O2 I think).  -ftracer 
>> is enabled with -fprofile-use (but it is also properly driven to only 
>> trace hot paths) and otherwise not by default at any optimization level.
> Definitely not appropriate for -Os.  But as I mentioned, I really want to 
> look at the tracer code as it may totally subsume path splitting.
>
>>
>> Don't see how this would work for the CFG pattern it operates on 
>> unless you duplicate the exit condition into that new block creating 
>> an even more obfuscated CFG.
> Agreed, I don't see any way to fix the multiple exit problem.  Then again, 
> this all runs after the tree loop optimizer, so I'm not sure how big of an 
> issue it is in practice.
>
>
>>> It was only after I approved this code after twiddling it for Ajit 
>>> that I came across Honza's tracer implementation, which may in fact 
>>> be retargettable to these loops and do a better job.  I haven't 
>>> experimented with that.
>>
>> Well, I originally suggested to merge this with the tracer pass...
> I missed that, or it didn't sink into my brain.
>
>>> Again, the more statements it copies the more likely it is to be profitable.
>>> Think superblocks to expose CSE, DCE and the like.
>>
>> Ok, so similar to tracer (where I think the main benefit is actually 
>> increasing scheduling opportunities for architectures where it matters).
> Right.  They're both building superblocks, which has the effect of larger 
> windows for scheduling, DCE, CSE, etc.
>
>
>>
>> Note that both passes are placed quite late and thus won't see much 
>> of the GIMPLE optimizations (DOM mainly).  I wonder why they were not 
>> placed adjacent to each other.
> Ajit had it fairly early, but that didn't play well with if-conversion.
>   I just pushed it past if-conversion and vectorization, but before 
> the last DOM pass.  That turns out to be where tracer lives too as you noted.
>
>>>
>>> I wouldn't lose any sleep if we disab

Re: ipa-cp heuristics fixes

2015-12-16 Thread Dominik Vogt
On Wed, Dec 16, 2015 at 10:59:10AM +0100, Richard Biener wrote:
> Same on x86_64 btw.  If there isn't a bugreport already please open
> one to track this issue.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68935

Ciao

Dominik ^_^  ^_^

-- 

Dominik Vogt
IBM Germany



[C++ Patch] Use default arguments in one more place

2015-12-16 Thread Paolo Carlini

Hi,

while looking into a few small issues I noticed that we could use C++ 
defaults here too. Tested x86_64-linux.


Thanks,
Paolo.

///
Index: cp-tree.h
===
--- cp-tree.h   (revision 231672)
+++ cp-tree.h   (working copy)
@@ -6115,7 +6115,8 @@ extern int template_class_depth   (tree);
 extern int is_specialization_of(tree, tree);
 extern bool is_specialization_of_friend(tree, tree);
 extern tree get_pattern_parm   (tree, tree);
-extern int comp_template_args  (tree, tree);
+extern int comp_template_args  (tree, tree, tree * = NULL,
+tree * = NULL);
 extern tree maybe_process_partial_specialization (tree);
 extern tree most_specialized_instantiation (tree);
 extern void print_candidates   (tree);
Index: pt.c
===
--- pt.c(revision 231672)
+++ pt.c(working copy)
@@ -7881,9 +7881,9 @@ template_args_equal (tree ot, tree nt)
template arguments.  Returns 0 otherwise, and updates OLDARG_PTR and
NEWARG_PTR with the offending arguments if they are non-NULL.  */
 
-static int
-comp_template_args_with_info (tree oldargs, tree newargs,
- tree *oldarg_ptr, tree *newarg_ptr)
+int
+comp_template_args (tree oldargs, tree newargs,
+   tree *oldarg_ptr, tree *newarg_ptr)
 {
   int i;
 
@@ -7913,15 +7913,6 @@ template_args_equal (tree ot, tree nt)
   return 1;
 }
 
-/* Returns 1 iff the OLDARGS and NEWARGS are in fact identical sets
-   of template arguments.  Returns 0 otherwise.  */
-
-int
-comp_template_args (tree oldargs, tree newargs)
-{
-  return comp_template_args_with_info (oldargs, newargs, NULL, NULL);
-}
-
 static void
 add_pending_template (tree d)
 {
@@ -19048,8 +19039,8 @@ unify_pack_expansion (tree tparms, tree targs, tre
  tree bad_old_arg = NULL_TREE, bad_new_arg = NULL_TREE;
  tree old_args = ARGUMENT_PACK_ARGS (old_pack);
 
- if (!comp_template_args_with_info (old_args, new_args,
-&bad_old_arg, &bad_new_arg))
+ if (!comp_template_args (old_args, new_args,
+  &bad_old_arg, &bad_new_arg))
/* Inconsistent unification of this parameter pack.  */
return unify_parameter_pack_inconsistent (explain_p,
  bad_old_arg,
2015-12-16  Paolo Carlini  

* pt.c (comp_template_args): Remove.
(comp_template_args_with_info): Rename to comp_template_args;
not static.
(add_pending_template): Adjust call.
* cp-tree.h (comp_template_args): Add default arguments.


Re: [BUILDROBOT] "error: null argument where non-null required" on multiple targets

2015-12-16 Thread Jan-Benedict Glaw
On Tue, 2015-12-15 10:43:58 -0700, Jeff Law  wrote:
> On 12/14/2015 01:07 PM, Jan-Benedict Glaw wrote:
> >On Mon, 2015-12-14 18:54:28 +, Moore, Catherine 
> > wrote:
> >>>avr-rtems  
> >>>http://toolchain.lug-owl.de/buildbot/show_build_details.php?id=478544
> >>>mipsel-elf 
> >>>http://toolchain.lug-owl.de/buildbot/show_build_details.php?id=478844
> >>>mipsisa64r2-sde-elf
> >>>http://toolchain.lug-owl.de/buildbot/show_build_details.php?id=478855
> >>>mipsisa64sb1-elf   
> >>>http://toolchain.lug-owl.de/buildbot/show_build_details.php?id=478865
> >>>mips-rtems 
> >>>http://toolchain.lug-owl.de/buildbot/show_build_details.php?id=478877
> >>>powerpc-eabialtivec
> >>>http://toolchain.lug-owl.de/buildbot/show_build_details.php?id=478922
> >>>powerpc-eabispe
> >>>http://toolchain.lug-owl.de/buildbot/show_build_details.php?id=478932
> >>>powerpc-rtems  
> >>>http://toolchain.lug-owl.de/buildbot/show_build_details.php?id=478956
> >>>ppc-elf
> >>>http://toolchain.lug-owl.de/buildbot/show_build_details.php?id=478968
> >>>sh-superh-elf  
> >>>http://toolchain.lug-owl.de/buildbot/show_build_details.php?id=479077
> >>
> >>Is there an easy way to reproduce the MIPS problems that you
> >>reported?  I don't seem to be able to do it with a cross-compiler
> >>targeting mipsel-elf.
> >
> >What's your build compiler? For these builds, where it showed up, I'm
> >using a freshly compiles HEAD/master version. So basically, compile a
> >current GCC for your build machine:
> Right.  This is something that only shows up when using the trunk to build
> the crosses.
> 
> When I looked, I thought I bisected it to the delayed folding work.

Shall I bisect one of the cases anew, with the "Test value of
_GLIBCXX_USE_C99_WCHAR not whether it is defined" patch that uncovered
it, applied? Starting with some arbitrary old revision?

MfG, JBG

-- 
  Jan-Benedict Glaw  jbg...@lug-owl.de  +49-172-7608481
Signature of:  GDB has a 'break' feature; why doesn't it have 'fix' too?
the second  :


signature.asc
Description: Digital signature


Re: [PATCH] C FE: fix range of primary-expression in c_parser_postfix_expression

2015-12-16 Thread Marek Polacek
On Tue, Dec 15, 2015 at 09:11:38PM -0500, David Malcolm wrote:
> In the C frontend,
>   c_parser_postfix_expression
> after parsing a primary expression passes "loc", the location of the
> *first token* in that expression to
>   c_parser_postfix_expression_after_primary,
> which thus discards any range information we had for primary
> expressions containing more than one token; we get just the range of
> the first token.
> 
> An example of this can be seen in this testcase from:
>   https://gcc.gnu.org/wiki/ClangDiagnosticsComparison
> 
> void foo(char **argP, char **argQ)
> {
>   (argP - argQ)();
>   argP();
> }
> 
> for which trunk currently gives these ranges:
> 
> diagnostic-range-bad-called-object.c:7:3: error: called object is not a 
> function or function pointer
>(argP - argQ)();
>^
> 
> diagnostic-range-bad-called-object.c:14:3: error: called object 'argP' is not 
> a function or function pointer
>argP();
>^~~~
> 
> The second happens to be correct, but the first is missing
> range information.
> 
> The following patch is a one-liner to preserve the expression's location,
> changing the first to:
> 
> diagnostic-range-bad-called-object.c:7:9: error: called object is not a 
> function or function pointer
>(argP - argQ)();
>~~^~~
> 
> and leaving the second unchanged.
> 
> Applying this fix requires tweaking some column numbers for expected
> locations in gcc.dg/cast-function-1.c; the output of trunk was of the
> form:
> 
> cast-function-1.c:21:7: warning: function called through a non-compatible type
>d = ((double (*) (int)) foo1) (i);
>^
> 
> which the patch changes to:
> 
> cast-function-1.c:21:8: warning: function called through a non-compatible type
>d = ((double (*) (int)) foo1) (i);
>~^~~~
> 
> which I feel is an improvement.
> 
> Successfully bootstrapped®rtested on x86_64-pc-linux-gnu.
> Adds 6 new PASS results to gcc.sum
> 
> OK for trunk in stage 3?
> 
> gcc/c/ChangeLog:
>   * c-parser.c (c_parser_postfix_expression): Use EXPR_LOC_OR_LOC
>   to preserve range information for the primary expression
>   in the call to c_parser_postfix_expression_after_primary.
> 
> gcc/testsuite/ChangeLog:
>   * gcc.dg/cast-function-1.c (bar): Update column numbers.
>   * gcc.dg/diagnostic-range-bad-called-object.c: New test case.

Looks ok.

Marek


Re: [PATCH 1/7][ARM] Add support for ARMv8.1.

2015-12-16 Thread Matthew Wahab

On 10/12/15 11:02, Ramana Radhakrishnan wrote:

On Thu, Dec 10, 2015 at 10:43 AM, Ramana Radhakrishnan
 wrote:

On Mon, Dec 7, 2015 at 4:04 PM, Matthew Wahab
 wrote:

Ping. Updated patch attached.
Matthew


On 26/11/15 15:55, Matthew Wahab wrote:


Hello,


ARMv8.1 includes an extension to ARM which adds two Adv.SIMD
instructions, vqrdmlah and vqrdmlsh. This patch set adds support for
ARMv8.1 and for the new instructions, enabling the architecture with
--march=armv8.1-a. The new instructions are enabled when both ARMv8.1
and a suitable fpu options are set, for instance with -march=armv8.1-a
-mfpu=neon-fp-armv8 -mfloat-abi=hard.


[..]

gcc/
2015-11-26  Matthew Wahab  

  * config/arm/arm-arches.def: Add "armv8.1-a" and "armv8.1-a+crc".
  * config/arm/arm-protos.h (FL2_ARCH8_1): New.
  (FL2_FOR_ARCH8_1A): New.
  * config/arm/arm-tables.opt: Regenerate.
  * config/arm/arm.c (arm_arch8_1): New.
  (arm_option_override): Set arm_arch8_1.
  * config/arm/arm.h (TARGET_NEON_RDMA): New.
  (arm_arch8_1): Declare.
  * doc/invoke.texi (ARM Options, -march): Add "armv8.1-a" and
  "armv8.1-a+crc".
  (ARM Options, -mfpu): Fix a typo.





OK.


I couldn't find 0/7 but in addition here you need to update the output
for TAG_FP_SIMD_Arch to be 4.

regards
Ramana


After discussing this offline, it turns out that the relevant attribute 
(Tag_Advanced_SIMD_arch) is set by the assembler.


Matthew






Re: [PATCH 2/4][AArch64] Increase the loop peeling limit

2015-12-16 Thread Richard Earnshaw (lists)
On 15/12/15 23:34, Evandro Menezes wrote:
> On 12/14/2015 05:26 AM, James Greenhalgh wrote:
>> On Thu, Dec 03, 2015 at 03:07:43PM -0600, Evandro Menezes wrote:
>>> On 11/20/2015 05:53 AM, James Greenhalgh wrote:
 On Thu, Nov 19, 2015 at 04:04:41PM -0600, Evandro Menezes wrote:
> On 11/05/2015 02:51 PM, Evandro Menezes wrote:
>> 2015-11-05  Evandro Menezes 
>>
>>gcc/
>>
>>* config/aarch64/aarch64.c
>> (aarch64_override_options_internal):
>>Increase loop peeling limit.
>>
>> This patch increases the limit for the number of peeled insns.
>> With this change, I noticed no major regression in either
>> Geekbench v3 or SPEC CPU2000 while some benchmarks, typically FP
>> ones, improved significantly.
>>
>> I tested this tuning on Exynos M1 and on A57.  ThunderX seems to
>> benefit from this tuning too.  However, I'd appreciate comments
> >from other stakeholders.
>
> Ping.
 I'd like to leave this for a call from the port maintainers. I can
 see why
 this leads to more opportunities for vectorization, but I'm
 concerned about
 the wider impact on code size. Certainly I wouldn't expect this to
 be our
 default at -O2 and below.

 My gut feeling is that this doesn't really belong in the back-end
 (there are
 presumably good reasons why the default for this parameter across
 GCC has
 fluctuated from 400 to 100 to 200 over recent years), but as I say, I'd
 like Marcus or Richard to make the call as to whether or not we take
 this
 patch.
>>> Please, correct me if I'm wrong, but loop peeling is enabled only
>>> with loop unrolling (and with PGO).  If so, then extra code size is
>>> not a concern, for this heuristic is only active when unrolling
>>> loops, when code size is already of secondary importance.
>> My understanding was that loop peeling is enabled from -O2 upwards, and
>> is also used to partially peel unaligned loops for vectorization
>> (allowing
>> the vector code to be well aligned), or to completely peel inner loops
>> which
>> may then become amenable to SLP vectorization.
>>
>> If I'm wrong then I take back these objections. But I was sure this
>> parameter was used in a number of situations outside of just
>> -funroll-loops/-funroll-all-loops . Certainly I remember seeing
>> performance
>> sensitivities to this parameter at -O3 in some internal workloads I was
>> analysing.
> 
> Vectorization, including SLP, is only enabled at -O3, isn't it?  It
> seems to me that peeling is only used by optimizations which already
> lead to potential increase in code size.
> 
> For instance, with "-Ofast -funroll-all-loops", the total text size for
> the SPEC CPU2000 suite is 26.9MB with this proposed change and 26.8MB
> without it; with just "-O2", it is the same at 23.1MB regardless of this
> setting.
> 
> So it seems to me that this proposal should be neutral for up to -O2.
> 
> Thank you,
> 

My preference would be to not diverge from the global parameter
settings.  I haven't looked in detail at this parameter but it seems to
me there are two possible paths:

1) We could get agreement globally that the parameter should be increased.
2) We could agree that this specific use of the parameter is distinct
from some other uses and deserves a new param in its own right with a
higher value.

R.


Re: [PATCH 1/7][ARM] Add support for ARMv8.1.

2015-12-16 Thread Ramana Radhakrishnan

>>
>> I couldn't find 0/7 but in addition here you need to update the output
>> for TAG_FP_SIMD_Arch to be 4.
>>
>> regards
>> Ramana
> 
> After discussing this offline, it turns out that the relevant attribute 
> (Tag_Advanced_SIMD_arch) is set by the assembler.

Yep - sorry about the noise.

The patch set is OK with the changes suggested.

Ramana

> 
> Matthew
> 
> 
> 
> 


RE: [PATCH][AArch64] Add vector permute cost

2015-12-16 Thread Wilco Dijkstra
Richard Biener wrote:
> On Wed, Dec 16, 2015 at 10:32 AM, James Greenhalgh
>  wrote:
> > On Tue, Dec 15, 2015 at 11:35:45AM +, Wilco Dijkstra wrote:
> >>
> >> Add support for vector permute cost since various permutes can expand into 
> >> a complex
> >> sequence of instructions.  This fixes major performance regressions due to 
> >> recent changes
> >> in the SLP vectorizer (which now vectorizes more aggressively and emits 
> >> many complex
> >> permutes).
> >>
> >> Set the cost to > 1 for all microarchitectures so that the number of 
> >> permutes is usually zero
> >> and regressions disappear.  An example of the kind of code that might be 
> >> emitted for
> >> VEC_PERM_EXPR {0, 3} where registers happen to be in the wrong order:
> >>
> >> adrpx4, .LC16
> >> ldr q5, [x4, #:lo12:.LC16
> >> eor v1.16b, v1.16b, v0.16b
> >> eor v0.16b, v1.16b, v0.16b
> >> eor v1.16b, v1.16b, v0.16b
> >> tbl v0.16b, {v0.16b - v1.16b}, v5.16b
> >>
> >> Regress passes. This fixes regressions that were introduced recently, so 
> >> OK for commit?
> >>
> >>
> >> ChangeLog:
> >> 2015-12-15  Wilco Dijkstra  
> >>
> >>   * gcc/config/aarch64/aarch64.c (generic_vector_cost):
> >>   Set vec_permute_cost.
> >>   (cortexa57_vector_cost): Likewise.
> >>   (exynosm1_vector_cost): Likewise.
> >>   (xgene1_vector_cost): Likewise.
> >>   (aarch64_builtin_vectorization_cost): Use vec_permute_cost.
> >>   * gcc/config/aarch64/aarch64-protos.h (cpu_vector_cost):
> >>   Add vec_permute_cost entry.
> >>
> >>
> >> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> >> index 
> >> 10754c88c0973d8ef3c847195b727f02b193bbd8..2584f16d345b3d015d577dd28c08a73ee3e0b0fb
> >>  100644
> >> --- a/gcc/config/aarch64/aarch64.c
> >> +++ b/gcc/config/aarch64/aarch64.c
> >> @@ -314,6 +314,7 @@ static const struct cpu_vector_cost 
> >> generic_vector_cost =
> >>1, /* scalar_load_cost  */
> >>1, /* scalar_store_cost  */
> >>1, /* vec_stmt_cost  */
> >> +  2, /* vec_permute_cost  */
> >>1, /* vec_to_scalar_cost  */
> >>1, /* scalar_to_vec_cost  */
> >>1, /* vec_align_load_cost  */
> >
> > Is there any reasoning behind making this 2? Do we now miss vectorization
> > for some of the cheaper permutes? Across the cost models/pipeline
> > descriptions that have been contributed to GCC I think that this is a
> > sensible change to the generic costs, but I just want to check there
> > was some reasoning/experimentation behind the number you picked.

Yes, it blocks vectorization if they have a high percentage of permutes, even 
if trivial.
But that is exactly the goal - when we vectorize we want to minimize data 
shuffling
that just adds latency and does not contribute to actual work.

The value 2 was the minimal value that avoided the regressions I noticed due the
improved SLP vectorization. I tried other possibilities like increasing the 
statement cost,
but that affects other cases, so this is the simplest change.

Note that I'm not too convinced about any of the existing vector costs, they 
seem
fairly random numbers - the A57 vector costs for example block many simple 
vectorization cases that are beneficial. The goal of this is simply to fix the 
regressions
rather than tuning vector costs (which is only possible if we have a better 
cost model).

> > As permutes can have such wildly different costs, this all seems like a good
> > candidate for some future much more involved hook from the vectorizer to the
> > back-end specifying the candidate permute operation and requesting a cost
> > (part of the bigger gimple costs framework?).
> 
> Yes, the vectorizer side also needs to improve here.  Not sure if it is 
> possible
> to represent this kind of complex cost queries with a single gimple cost hook.
> After all we don't really want to generate the full gimple stmt just to query
> its cost ...
> 
> To better represent permute cost in the short term we'd need another 
> vectorizer
> specific hook, not sth for stage3 unless we face some serious regressions
> on real-world code (thus not microbenchmarks only)

Yes we need a better cost model for the vectorizer. The sort of things that are 
important
is the vector mode (so we can differentiate between 2x, 4x, 8x, 16x 
vectorization etc), 
the specific permute for permutes, the size and type of load/store (as some do 
permutes) etc.

Wilco


[RFC, rtl optimization]: Better heuristics for estimate_reg_pressure_cost in presence of call for LICM.

2015-12-16 Thread Ajit Kumar Agarwal

The estimate on target_clobbered_registers based on the call_used arrays is not 
correct. This is the worst case 
heuristics on the estimate on target_clobbered_registers. This disables many of 
the loop Invariant code motion 
opportunities in presence of call. Instead of considering the spill cost we 
consider only the target_reg_cost
aggressively.

With this  change with old logic used in regs_used gave the following gains.

diff --git a/gcc/cfgloopanal.c b/gcc/cfgloopanal.c 
--- a/gcc/cfgloopanal.c
+++ b/gcc/cfgloopanal.c
@@ -373,15 +373,23 @@ estimate_reg_pressure_cost (unsigned n_new, unsigned 
n_old, bool speed,
 
   /* If there is a call in the loop body, the call-clobbered registers
  are not available for loop invariants.  */
+
   if (call_p)
 available_regs = available_regs - target_clobbered_regs;
-
+  
   /* If we have enough registers, we should use them and not restrict
  the transformations unnecessarily.  */
   if (regs_needed + target_res_regs <= available_regs)
 return 0;
 
-  if (regs_needed <= available_regs)
+  /* Estimation of target_clobbered register is based on the call_used
+ arrays which is not the right estimate for the clobbered register
+ used in called function. Instead of considering the spill cost we
+ consider only the reg_cost aggressively.  */
+
+  if ((regs_needed <= available_regs) 
+  || (call_p && (regs_needed <= 
+  (available_regs + target_clobbered_regs
 /* If we are close to running out of registers, try to preserve
them.  */
 cost = target_reg_cost [speed] * n_new;

SPEC CPU 2000 INT benchmarks
(Geomean Score without the above change vs Geomean score with the above change 
= 3745.193 vs vs 3752.717)

SPEC INT CPU 2000 benchmarks.
 (Geomean Score without the above change vs Geomean score with the above change 
= 4741.825 vs 4792.085) 

Thanks & Regards
Ajit



[PATCH, ARM, 1/3] Document --with-multilib-list for arm*-*-* targets

2015-12-16 Thread Thomas Preud'homme
Currently, the documentation for --with-multilib-list in gcc/doc/install.texi 
only mentions sh*-*-* and x86-64-*-linux* targets. However, arm*-*-* targets 
also support this option. This patch adds documention for the meaning of this 
option for arm*-*-* targets.

ChangeLog entry is as follows:


*** gcc/ChangeLog ***

2015-12-09  Thomas Preud'homme  

* doc/install.texi (--with-multilib-list): Describe the meaning of the
option for arm*-*-* targets.


diff --git a/gcc/doc/install.texi b/gcc/doc/install.texi
index 57399ed..2c93eb0 100644
--- a/gcc/doc/install.texi
+++ b/gcc/doc/install.texi
@@ -1102,9 +1102,19 @@ sysv, aix.
 @item --with-multilib-list=@var{list}
 @itemx --without-multilib-list
 Specify what multilibs to build.
-Currently only implemented for sh*-*-* and x86-64-*-linux*.
+Currently only implemented for arm*-*-*, sh*-*-* and x86-64-*-linux*.
 
 @table @code
+@item arm*-*-*
+@var{list} is either @code{default} or @code{aprofile}.  Specifying
+@code{default} is equivalent to omitting this option while specifying
+@code{aprofile} builds multilibs for each combination of ISA (@code{-marm} or
+@code{-mthumb}), architecture (@code{-march=armv7-a}, @code{-march=armv7ve},
+or @code{-march=armv8-a}), FPU available (none, @code{-mfpu=vfpv3-d16},
+@code{neon}, @code{vfpv4-d16}, @code{neon-vfpv4} or @code{neon-fp-armv8}
+depending on architecture) and floating-point ABI (@code{-mfloat-abi=softfp}
+or @code{-mfloat-abi=hard}).
+
 @item sh*-*-*
 @var{list} is a comma separated list of CPU names.  These must be of the
 form @code{sh*} or @code{m*} (in which case they match the compiler option


PDF builds fine out of the updated file and look as expected.

Is this ok for trunk?

Best regards,

Thomas



[PATCH, GCC/ARM, 2/3] Error out for incompatible ARM multilibs

2015-12-16 Thread Thomas Preud'homme
Currently in config.gcc, only the first multilib in a multilib list is checked 
for validity and the following elements are ignored due to the break which only 
breaks out of loop in shell. A loop is also done over the multilib list 
elements despite no combination being legal. This patch rework the code to 
address both issues.

ChangeLog entry is as follows:


2015-11-24  Thomas Preud'homme  

* config.gcc: Error out when conflicting multilib is detected.  Do not
loop over multilibs since no combination is legal.


diff --git a/gcc/config.gcc b/gcc/config.gcc
index 59aee2c..be3c720 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -3772,38 +3772,40 @@ case "${target}" in
# Add extra multilibs
if test "x$with_multilib_list" != x; then
arm_multilibs=`echo $with_multilib_list | sed -e 's/,/ 
/g'`
-   for arm_multilib in ${arm_multilibs}; do
-   case ${arm_multilib} in
-   aprofile)
+   case ${arm_multilibs} in
+   aprofile)
# Note that arm/t-aprofile is a
# stand-alone make file fragment to be
# used only with itself.  We do not
# specifically use the
# TM_MULTILIB_OPTION framework because
# this shorthand is more
-   # pragmatic. Additionally it is only
-   # designed to work without any
-   # with-cpu, with-arch with-mode
+   # pragmatic.
+   tmake_profile_file="arm/t-aprofile"
+   ;;
+   default)
+   ;;
+   *)
+   echo "Error: 
--with-multilib-list=${with_multilib_list} not supported." 1>&2
+   exit 1
+   ;;
+   esac
+
+   if test "x${tmake_profile_file}" != x ; then
+   # arm/t-aprofile is only designed to work
+   # without any with-cpu, with-arch, with-mode,
# with-fpu or with-float options.
-   if test "x$with_arch" != x \
-   || test "x$with_cpu" != x \
-   || test "x$with_float" != x \
-   || test "x$with_fpu" != x \
-   || test "x$with_mode" != x ; then
-   echo "Error: You cannot use any of 
--with-arch/cpu/fpu/float/mode with --with-multilib-list=aprofile" 1>&2
-   exit 1
-   fi
-   tmake_file="${tmake_file} 
arm/t-aprofile"
-   break
-   ;;
-   default)
-   ;;
-   *)
-   echo "Error: 
--with-multilib-list=${with_multilib_list} not supported." 1>&2
-   exit 1
-   ;;
-   esac
-   done
+   if test "x$with_arch" != x \
+   || test "x$with_cpu" != x \
+   || test "x$with_float" != x \
+   || test "x$with_fpu" != x \
+   || test "x$with_mode" != x ; then
+   echo "Error: You cannot use any of 
--with-arch/cpu/fpu/float/mode with --with-multilib-list=${arm_multilib}" 1>&2
+   exit 1
+   fi
+
+   tmake_file="${tmake_file} ${tmake_profile_file}"
+   fi
fi
;;


Tested with the following multilib lists:
  + foo -> "Error: --with-multilib-list=foo not supported" as expected
  + default,aprofile -> "Error: --with-multilib-list=default,aprofile not 
supported" as expected
  + aprofile,default -> "Error: --with-multilib-list=aprofile,default not 
supported" as expected
  + (nothing) -> libraries in $installdir/arm-none-eabi/lib{,fpu,thumb}
  + default -> libraries in $installdir/arm-none-eabi/lib{,fpu,thumb} as 
expected
  + aprofile -> $installdir/arm-none-eabi/lib contains all supported multilib

Is this ok for trunk?

Best regards,

Thomas



[PATCH, ARM, 3/3] Add multilib support for bare-metal ARM architectures

2015-12-16 Thread Thomas Preud'homme
Hi Ramana,

As suggested in your initial answer to this thread, we updated the multilib 
patch provided in ARM's embedded branch to be up-to-date with regards to 
supported CPUs in GCC. As to the need to modify Makefile.in and configure.ac, 
this is because the patch aims to let control to the user as to what multilib 
should be built. To this effect, it takes a list of architecture at configure 
time and that list needs to be passed down to t-baremetal Makefile to set the 
multilib variables appropriately.

ChangeLog entry is as follows:


*** gcc/ChangeLog ***

2015-12-15  Thomas Preud'homme  

* Makefile.in (with_multilib_list): New variables substituted by
configure.
* config.gcc: Handle bare-metal multilibs in --with-multilib-list
option.
* config/arm/t-baremetal: New file.
* configure.ac (with_multilib_list): New AC_SUBST.
* configure: Regenerate.
* doc/install.texi (--with-multilib-list): Update description for
arm*-*-* targets to mention bare-metal multilibs.


diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 
1f698798aa2df3f44d6b3a478bb4bf48e9fa7372..18b790afa114aa7580be0662d3ac9ffbc94e919d
 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -546,6 +546,7 @@ lang_opt_files=@lang_opt_files@ $(srcdir)/c-family/c.opt 
$(srcdir)/common.opt
 lang_specs_files=@lang_specs_files@
 lang_tree_files=@lang_tree_files@
 target_cpu_default=@target_cpu_default@
+with_multilib_list=@with_multilib_list@
 OBJC_BOEHM_GC=@objc_boehm_gc@
 extra_modes_file=@extra_modes_file@
 extra_opt_files=@extra_opt_files@
diff --git a/gcc/config.gcc b/gcc/config.gcc
index 
af948b5e203f6b4f53dfca38e9d02d060d00c97b..d8098ed3cefacd00cb10590db1ec86d48e9fcdbc
 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -3787,15 +3787,25 @@ case "${target}" in
default)
;;
*)
-   echo "Error: 
--with-multilib-list=${with_multilib_list} not supported." 1>&2
-   exit 1
+   for arm_multilib in ${arm_multilibs}; do
+   case ${arm_multilib} in
+   armv6-m | armv7-m | armv7e-m | armv7-r 
| armv8-m.base | armv8-m.main)
+   
tmake_profile_file="arm/t-baremetal"
+   ;;
+   *)
+   echo "Error: 
--with-multilib-list=${with_multilib_list} not supported." 1>&2
+   exit 1
+   ;;
+   esac
+   done
;;
esac
 
if test "x${tmake_profile_file}" != x ; then
-   # arm/t-aprofile is only designed to work
-   # without any with-cpu, with-arch, with-mode,
-   # with-fpu or with-float options.
+   # arm/t-aprofile and arm/t-baremetal are only
+   # designed to work without any with-cpu,
+   # with-arch, with-mode, with-fpu or with-float
+   # options.
if test "x$with_arch" != x \
|| test "x$with_cpu" != x \
|| test "x$with_float" != x \
diff --git a/gcc/config/arm/t-baremetal b/gcc/config/arm/t-baremetal
new file mode 100644
index 
..ffd29815e6ec22c747e77747ed9b69e0ae21b63a
--- /dev/null
+++ b/gcc/config/arm/t-baremetal
@@ -0,0 +1,130 @@
+# A set of predefined MULTILIB which can be used for different ARM targets.
+# Via the configure option --with-multilib-list, user can customize the
+# final MULTILIB implementation.
+
+comma := ,
+
+with_multilib_list := $(subst $(comma), ,$(with_multilib_list
+
+MULTILIB_OPTIONS   = mthumb/marm
+MULTILIB_DIRNAMES  = thumb arm
+MULTILIB_OPTIONS  += 
march=armv6s-m/march=armv7-m/march=armv7e-m/march=armv7/march=armv8-m.base/march=armv8-m.main
+MULTILIB_DIRNAMES += armv6-m armv7-m armv7e-m armv7-ar armv8-m.base 
armv8-m.main
+MULTILIB_OPTIONS  += mfloat-abi=softfp/mfloat-abi=hard
+MULTILIB_DIRNAMES += softfp fpu
+MULTILIB_OPTIONS  += 
mfpu=fpv5-sp-d16/mfpu=fpv5-d16/mfpu=fpv4-sp-d16/mfpu=vfpv3-d16
+MULTILIB_DIRNAMES += fpv5-sp-d16 fpv5-d16 fpv4-sp-d16 vfpv3-d16
+
+MULTILIB_MATCHES   = march?armv6s-m=mcpu?cortex-m0
+MULTILIB_MATCHES  += march?armv6s-m=mcpu?cortex-m0.small-multiply
+MULTILIB_MATCHES  += march?armv6s-m=mcpu?cortex-m0plus
+MULTILIB_MATCHES  += march?armv6s-m=mcpu?cortex-m0plus.small-multiply
+MULTILIB_MATCHES  += march?armv6s-m=mcpu?cortex-m1
+MULTILIB_MATCHES  += march?armv6s-m=mcpu?cort

[PATCH] Fix PRs 68916 and 68914

2015-12-16 Thread Richard Biener

Testisms, the easiest thing is to require vect_perm.

Tested on x86_64-unknown-linux-gnu, applied.

Richard.

2015-12-16  Richard Biener  

PR testsuite/68916
PR testsuite/68914
* gcc.dg/vect/pr45752.c: Require vect_perm and adjust expected
dump.
* gcc.dg/vect/slp-perm-4.c: Likewise.

Index: gcc/testsuite/gcc.dg/vect/pr45752.c
===
--- gcc/testsuite/gcc.dg/vect/pr45752.c (revision 231675)
+++ gcc/testsuite/gcc.dg/vect/pr45752.c (working copy)
@@ -1,4 +1,5 @@
 /* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target vect_perm } */
 /* { dg-additional-options "--param tree-reassoc-width=1" } */
 
 #include 
@@ -104,6 +105,8 @@ int main (int argc, const char* argv[])
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect"  } } */
+/* Currently interleaving is not supported for a group-size of 5.  */
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
 /* { dg-final { scan-tree-dump-times "gaps requires scalar epilogue loop" 0 
"vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" { 
target vect_perm } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 2 "vect" } 
} */
Index: gcc/testsuite/gcc.dg/vect/slp-perm-4.c
===
--- gcc/testsuite/gcc.dg/vect/slp-perm-4.c  (revision 231675)
+++ gcc/testsuite/gcc.dg/vect/slp-perm-4.c  (working copy)
@@ -1,4 +1,5 @@
 /* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target vect_perm } */
 
 #include 
 #include "tree-vect.h"
@@ -82,6 +83,8 @@ int main (int argc, const char* argv[])
   return 0;
 }
 
+/* Currently interleaving is not supported for a group-size of 5.  */
+
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
 /* { dg-final { scan-tree-dump-times "gaps requires scalar epilogue loop" 0 
"vect" } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { 
target vect_perm } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" } 
} */


[PATCH PR68906]

2015-12-16 Thread Yuri Rumyantsev
Hi All,

Here is simple patch which cures the issue with outer-loop unswitching
- added invocation of number_of_latch_executions() to reject
unswitching for non-iterated loops.

Bootstrapping and regression testing did not show any new failures.
Is it OK for trunk?

ChangeLog:

2014-12-16  Yuri Rumyantsev  

PR tree-optimization/68906
* tree-ssa-loop-unswitch.c : Include couple header files.
(tree_unswitch_outer_loop): Use number_of_latch_executions
to reject non-iterated loops.

gcc/testsuite/ChangeLog
* gcc.dg/torture/pr68906.c: New test.


patch
Description: Binary data


Re: [PATCH][combine] PR rtl-optimization/68651 Try changing rtx from (r + r) to (r << 1) to aid recognition

2015-12-16 Thread Bernd Schmidt

On 12/15/2015 05:21 PM, Kyrill Tkachov wrote:

Then for the shift pattern in the MD file we'd have to dynamically
select the scheduling type depending on whether or not the shift
amount is 1 and the costs line up?


Yes. This isn't unusual, take a look at i386.md where you have a lot of 
switches on attr type to decide which string to print.



The price we pay when trying these substitutions is an iteration over
 the rtx with FOR_EACH_SUBRTX_PTR. recog gets called only if that
iteration actually performed a substitution of x + x into x << 1. Is
that too high a price to pay? (I'm not familiar with the performance
 characteristics of the FOR_EACH_SUBRTX machinery)


It depends on how many of these transforms we are going to try; it also 
feels very hackish, trying to work around the core design of the 
combiner. IMO it would be better for machine descriptions to work with 
the pass rather than against it.


If you can somehow arrange for the (plus x x) to be turned into a shift 
while substituting that might be yet another approach to try.



Bernd


Re: [PATCH] Allow embedded timestamps by C/C++ macros to be set externally (2)

2015-12-16 Thread Bernd Schmidt

On 12/14/2015 03:31 PM, Dhole wrote:

The copyright assignment process is now complete :)
Let me know if I'm required to do anything else regarding the patch I sent.


Right now we're in a bug fixing stage; please wait until stage 1 reopens 
and then resend your patch.



Bernd



gomp_target_fini (was: [gomp4.5] Handle #pragma omp declare target link)

2015-12-16 Thread Thomas Schwinge
Hi!

On Mon, 14 Dec 2015 19:47:36 +0300, Ilya Verbin  wrote:
> On Fri, Dec 11, 2015 at 18:27:13 +0100, Jakub Jelinek wrote:
> > On Tue, Dec 08, 2015 at 05:45:59PM +0300, Ilya Verbin wrote:
> > > +/* This function finalizes all initialized devices.  */
> > > +
> > > +static void
> > > +gomp_target_fini (void)
> > > +{
> > > +  [...]
> > 
> > The question is what will this do if there are async target tasks still
> > running on some of the devices at this point (forgotten #pragma omp taskwait
> > or similar if target nowait regions are started outside of parallel region,
> > or exit inside of parallel, etc.  But perhaps it can be handled 
> > incrementally.
> > Also there is the question that the 
> > So I think the patch is ok with the above mentioned changes.
> 
> Here is what I've committed to trunk.

> --- a/libgomp/libgomp.h
> +++ b/libgomp/libgomp.h
> @@ -888,6 +888,14 @@ typedef struct acc_dispatch_t
>} cuda;
>  } acc_dispatch_t;
>  
> +/* Various state of the accelerator device.  */
> +enum gomp_device_state
> +{
> +  GOMP_DEVICE_UNINITIALIZED,
> +  GOMP_DEVICE_INITIALIZED,
> +  GOMP_DEVICE_FINALIZED
> +};
> +
>  /* This structure describes accelerator device.
> It contains name of the corresponding libgomp plugin, function handlers 
> for
> interaction with the device, ID-number of the device, and information 
> about
> @@ -933,8 +941,10 @@ struct gomp_device_descr
>/* Mutex for the mutable data.  */
>gomp_mutex_t lock;
>  
> -  /* Set to true when device is initialized.  */
> -  bool is_initialized;
> +  /* Current state of the device.  OpenACC allows to move from INITIALIZED 
> state
> + back to UNINITIALIZED state.  OpenMP allows only to move from 
> INITIALIZED
> + to FINALIZED state (at program shutdown).  */
> +  enum gomp_device_state state;

(ACK, but I assume we'll want to make sure that an OpenACC device is
never re-initialized if we're in/after the libgomp finalization phase.)


The issue mentioned above: "exit inside of parallel" is actually a
problem for nvptx offloading: the libgomp.oacc-c-c++-common/abort-1.c,
libgomp.oacc-c-c++-common/abort-3.c, and libgomp.oacc-fortran/abort-1.f90
test cases now run into annoying "WARNING: program timed out".  Here is
what's happening, as I understand it: in
libgomp/plugin/plugin-nvptx.c:nvptx_exec, the cuStreamSynchronize call
returns CUDA_ERROR_LAUNCH_FAILED, upon which we call GOMP_PLUGIN_fatal.

> --- a/libgomp/target.c
> +++ b/libgomp/target.c

> +/* This function finalizes all initialized devices.  */
> +
> +static void
> +gomp_target_fini (void)
> +{
> +  int i;
> +  for (i = 0; i < num_devices; i++)
> +{
> +  struct gomp_device_descr *devicep = &devices[i];
> +  gomp_mutex_lock (&devicep->lock);
> +  if (devicep->state == GOMP_DEVICE_INITIALIZED)
> + {
> +   devicep->fini_device_func (devicep->target_id);
> +   devicep->state = GOMP_DEVICE_FINALIZED;
> + }
> +  gomp_mutex_unlock (&devicep->lock);
> +}
> +}

> @@ -2387,6 +2433,9 @@ gomp_target_init (void)
>if (devices[i].capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
>   goacc_register (&devices[i]);
>  }
> +
> +  if (atexit (gomp_target_fini) != 0)
> +gomp_fatal ("atexit failed");
>  }

Now, with the above change installed, GOMP_PLUGIN_fatal will trigger the
atexit handler, gomp_target_fini, which, with the device lock held, will
call back into the plugin, GOMP_OFFLOAD_fini_device, which will try to
clean up.

Because of the earlier CUDA_ERROR_LAUNCH_FAILED, the associated CUDA
context is now in an inconsistent state, see
:

CUDA_ERROR_LAUNCH_FAILED = 719
An exception occurred on the device while executing a
kernel. Common causes include dereferencing an invalid device
pointer and accessing out of bounds shared memory. The context
cannot be used, so it must be destroyed (and a new one should be
created). All existing device memory allocations from this
context are invalid and must be reconstructed if the program is
to continue using CUDA.

Thus, any cuMemFreeHost invocations that are run during clean-up will now
also/still return CUDA_ERROR_LAUNCH_FAILED, due to which we'll again call
GOMP_PLUGIN_fatal, which again will trigger the same or another
(GOMP_offload_unregister_ver) atexit handler, which will then deadlock
trying to lock the device again, which is still locked.

(Jim, I wonder: after the first CUDA_ERROR_LAUNCH_FAILED and similar
errors, should we destroy the context right away, or toggle a boolean
flag to mark it as unusable, and use that as an indication to avoid the
follow-on failures of cuMemFreeHost just described above, for example?)


tells us:

Since the behavior is undefined if the exit() function is called more
than once, portable applications calling atexit() must ensure that the

[PATCH] Fix PR68915

2015-12-16 Thread Richard Biener

The following will hopefully resolve PR68915

tested on x86_64-linux, applied.

Richard.

2015-12-16  Richard Biener  

PR testsuite/68915
* gcc.dg/vect/pr46032.c: Use dg-additional-options.

Index: gcc/testsuite/gcc.dg/vect/pr46032.c
===
--- gcc/testsuite/gcc.dg/vect/pr46032.c (revision 231683)
+++ gcc/testsuite/gcc.dg/vect/pr46032.c (working copy)
@@ -1,7 +1,7 @@
 /* { dg-do compile } */
 /* { dg-require-effective-target fopenmp } */
 /* { dg-require-effective-target vect_int } */
-/* { dg-options "-O2 -fopenmp -ftree-vectorize -std=c99 -fipa-pta 
-fdump-tree-vect-all" } */
+/* { dg-additional-options "-fopenmp -fipa-pta" } */
 
 extern void abort (void);
 


Re: [PATCH 2/4][AArch64] Increase the loop peeling limit

2015-12-16 Thread Richard Biener
On Wed, Dec 16, 2015 at 12:24 PM, Richard Earnshaw (lists)
 wrote:
> On 15/12/15 23:34, Evandro Menezes wrote:
>> On 12/14/2015 05:26 AM, James Greenhalgh wrote:
>>> On Thu, Dec 03, 2015 at 03:07:43PM -0600, Evandro Menezes wrote:
 On 11/20/2015 05:53 AM, James Greenhalgh wrote:
> On Thu, Nov 19, 2015 at 04:04:41PM -0600, Evandro Menezes wrote:
>> On 11/05/2015 02:51 PM, Evandro Menezes wrote:
>>> 2015-11-05  Evandro Menezes 
>>>
>>>gcc/
>>>
>>>* config/aarch64/aarch64.c
>>> (aarch64_override_options_internal):
>>>Increase loop peeling limit.
>>>
>>> This patch increases the limit for the number of peeled insns.
>>> With this change, I noticed no major regression in either
>>> Geekbench v3 or SPEC CPU2000 while some benchmarks, typically FP
>>> ones, improved significantly.
>>>
>>> I tested this tuning on Exynos M1 and on A57.  ThunderX seems to
>>> benefit from this tuning too.  However, I'd appreciate comments
>> >from other stakeholders.
>>
>> Ping.
> I'd like to leave this for a call from the port maintainers. I can
> see why
> this leads to more opportunities for vectorization, but I'm
> concerned about
> the wider impact on code size. Certainly I wouldn't expect this to
> be our
> default at -O2 and below.
>
> My gut feeling is that this doesn't really belong in the back-end
> (there are
> presumably good reasons why the default for this parameter across
> GCC has
> fluctuated from 400 to 100 to 200 over recent years), but as I say, I'd
> like Marcus or Richard to make the call as to whether or not we take
> this
> patch.
 Please, correct me if I'm wrong, but loop peeling is enabled only
 with loop unrolling (and with PGO).  If so, then extra code size is
 not a concern, for this heuristic is only active when unrolling
 loops, when code size is already of secondary importance.
>>> My understanding was that loop peeling is enabled from -O2 upwards, and
>>> is also used to partially peel unaligned loops for vectorization
>>> (allowing
>>> the vector code to be well aligned), or to completely peel inner loops
>>> which
>>> may then become amenable to SLP vectorization.
>>>
>>> If I'm wrong then I take back these objections. But I was sure this
>>> parameter was used in a number of situations outside of just
>>> -funroll-loops/-funroll-all-loops . Certainly I remember seeing
>>> performance
>>> sensitivities to this parameter at -O3 in some internal workloads I was
>>> analysing.
>>
>> Vectorization, including SLP, is only enabled at -O3, isn't it?  It
>> seems to me that peeling is only used by optimizations which already
>> lead to potential increase in code size.
>>
>> For instance, with "-Ofast -funroll-all-loops", the total text size for
>> the SPEC CPU2000 suite is 26.9MB with this proposed change and 26.8MB
>> without it; with just "-O2", it is the same at 23.1MB regardless of this
>> setting.
>>
>> So it seems to me that this proposal should be neutral for up to -O2.
>>
>> Thank you,
>>
>
> My preference would be to not diverge from the global parameter
> settings.  I haven't looked in detail at this parameter but it seems to
> me there are two possible paths:
>
> 1) We could get agreement globally that the parameter should be increased.
> 2) We could agree that this specific use of the parameter is distinct
> from some other uses and deserves a new param in its own right with a
> higher value.

I think the fix is to improve the unrolled size estimates by better taking into
account constant propagation and CSE opportunities.  I have some ideas
here but not sure if I have enough free cycles to implement this for GCC 7.

Richard.

> R.


[RFC][PATCH] Fix broken handling of LABEL_REF in genrecog + genpreds.

2015-12-16 Thread Dominik Vogt
The attached patch fixes the handling of LABEL_REF in genrecog and
genpreds.

The current code assumes that X can have only a mode than PRED (X,
MODE) if X is CONST_INT, CONST_DOUBLE or CONST_WIDE_INT, but
actually that can be also the case for a LABEL_REF with VOIDmode.
Due to this it is necessary to add "const_int" (or similar) to
match_code of some predicates handling label_ref but not
const_int.

Ciao

Dominik ^_^  ^_^

-- 

Dominik Vogt
IBM Germany
gcc/ChangeLog

* config/s390/predicates.md ("larl_operand"): Remove now superfluous
const_int and const_double.
* genrecog.c (safe_predicate_mode): Return false for VOIDmode
LABEL_REFs even if the predicate does not handle const_int,
const_double or const_wide_int.
* genpreds.c (add_mode_tests): Treat LABEL_REF like CONST_INT.
>From a211b6e3e2a64bdf47507c0e6151f137bcd4641f Mon Sep 17 00:00:00 2001
From: Dominik Vogt 
Date: Fri, 11 Dec 2015 17:14:25 +0100
Subject: [PATCH] Fix broken handling of LABEL_REF in genrecog + genpreds.

The old code assumed that X can have only a mode than PRED (X, MODE) if X is
CONST_INT, CONST_DOUBLE or CONST_WIDE_INT, but actually that can be also the
case for LABEL_REF.  Due to this bug it was necessary to add "const_int" (or
similar) to match_code of some predicates handling label_ref but not const_int.
---
 gcc/config/s390/predicates.md | 5 +
 gcc/genpreds.c| 1 +
 gcc/genrecog.c| 3 ++-
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/gcc/config/s390/predicates.md b/gcc/config/s390/predicates.md
index 5c462c4..f7ab836 100644
--- a/gcc/config/s390/predicates.md
+++ b/gcc/config/s390/predicates.md
@@ -121,10 +121,7 @@
 ;;  Return true if OP a valid operand for the LARL instruction.
 
 (define_predicate "larl_operand"
-; Note: Although CONST_INT and CONST_DOUBLE are not handled in this predicate,
-; at least one of them needs to appear or otherwise safe_predicate_mode will
-; assume that a VOIDmode LABEL_REF is not accepted either (see genrecog.c).
-  (match_code "label_ref, symbol_ref, const, const_int, const_double")
+  (match_code "label_ref, symbol_ref, const")
 {
   /* Allow labels and local symbols.  */
   if (GET_CODE (op) == LABEL_REF)
diff --git a/gcc/genpreds.c b/gcc/genpreds.c
index eac2180..3791f6d 100644
--- a/gcc/genpreds.c
+++ b/gcc/genpreds.c
@@ -320,6 +320,7 @@ add_mode_tests (struct pred_data *p)
 	{
 	case CONST_INT:
 	case CONST_WIDE_INT:
+	case LABEL_REF:
 	  matches_const_scalar_int_p = true;
 	  break;
 
diff --git a/gcc/genrecog.c b/gcc/genrecog.c
index 599121f..81ea35b 100644
--- a/gcc/genrecog.c
+++ b/gcc/genrecog.c
@@ -3382,7 +3382,8 @@ safe_predicate_mode (const struct pred_data *pred, machine_mode mode)
   if (GET_MODE_CLASS (mode) == MODE_INT
   && (pred->codes[CONST_INT]
 	  || pred->codes[CONST_DOUBLE]
-	  || pred->codes[CONST_WIDE_INT]))
+	  || pred->codes[CONST_WIDE_INT]
+	  || pred->codes[LABEL_REF]))
 return false;
 
   return !pred->special && mode != VOIDmode;
-- 
2.3.0



Re: [RFC, rtl optimization]: Better heuristics for estimate_reg_pressure_cost in presence of call for LICM.

2015-12-16 Thread Bernd Schmidt

On 12/16/2015 12:54 PM, Ajit Kumar Agarwal wrote:


The estimate on target_clobbered_registers based on the call_used arrays is not 
correct. This is the worst case
heuristics on the estimate on target_clobbered_registers. This disables many of 
the loop Invariant code motion
opportunities in presence of call. Instead of considering the spill cost we 
consider only the target_reg_cost
aggressively.


Right now we are in stage 3, and not accepting anything but bug fixes. 
Please resubmit once stage 1 reopens.




diff --git a/gcc/cfgloopanal.c b/gcc/cfgloopanal.c
--- a/gcc/cfgloopanal.c
+++ b/gcc/cfgloopanal.c


Patch submissions should include ChangeLog entries.


@@ -373,15 +373,23 @@ estimate_reg_pressure_cost (unsigned n_new, unsigned 
n_old, bool speed,

/* If there is a call in the loop body, the call-clobbered registers
   are not available for loop invariants.  */
+
if (call_p)
  available_regs = available_regs - target_clobbered_regs;
-
+


Spurious whitespace change.


/* If we have enough registers, we should use them and not restrict
   the transformations unnecessarily.  */
if (regs_needed + target_res_regs <= available_regs)
  return 0;

-  if (regs_needed <= available_regs)
+  /* Estimation of target_clobbered register is based on the call_used
+ arrays which is not the right estimate for the clobbered register
+ used in called function. Instead of considering the spill cost we
+ consider only the reg_cost aggressively.  */
+
+  if ((regs_needed <= available_regs)
+  || (call_p && (regs_needed <=
+  (available_regs + target_clobbered_regs


Formatting issues - unnecessary parens, and bad line split. This would 
be written as


if (regs_needed <= available_regs
|| (call_p
 && regs_needed <= available_regs + target_clobbered_regs))

But then, I think the whole thing could be simplified by just keeping 
the original available_regs value around (before subtracting 
target_clobbered_regs) and have just one comparison here.


Once again I find myself unsure whether this change is actually better 
or just different. The existing code kind of makes sense to me - if a 
reg is call-clobbered and there's a call in the loop, it can't hold an 
invariant. In any case, this is a question for stage 1.



Bernd


Re: [PATCH] S/390: Allow to use r1 to r4 as literal pool base.

2015-12-16 Thread Ulrich Weigand
Dominik Vogt wrote:
> On Mon, Dec 14, 2015 at 04:08:32PM +0100, Ulrich Weigand wrote:
> > I don't think that r1 is actually safe here.  Note that it may be used
> > (unconditionally) as temp register in s390_emit_prologue in certain cases;
> > the upcoming split-stack code will also need to use r1 in some cases.
> 
> How about the attached patch?  It also allows to use r0 as the
> temp register if possible (needs more testing).

This doesn't look safe either.  In particular:

- you use cfun_save_high_fprs_p at a place where its value might not yet
  have been determined (when calling s390_get_prologue_temp_regno from
  s390_init_frame_layout *before* the s390_register_info/s390_frame_info
  calls)

- r0 might hold the incoming static chain value, in which case it cannot
  be used as temp register

> If that's too
> much effort, I'm fine with limiting the original patch to r4 to
> r2.

That seems preferable to me.

> > r2 through r4 should be fine.  [ Not sure if there will be many (any?) cases
> > where one of those is unused but r5 isn't, however. ]
> 
> This can happen if the function only uses register pairs
> (__int128).  Actually I'm not sure whether r2 and r4 are valid
> candidates.

Huh?  Why not?

Bye,
Ulrich

-- 
  Dr. Ulrich Weigand
  GNU/Linux compilers and toolchain
  ulrich.weig...@de.ibm.com



Re: [PATCH PR68906]

2015-12-16 Thread Richard Biener
On Wed, Dec 16, 2015 at 1:14 PM, Yuri Rumyantsev  wrote:
> Hi All,
>
> Here is simple patch which cures the issue with outer-loop unswitching
> - added invocation of number_of_latch_executions() to reject
> unswitching for non-iterated loops.
>
> Bootstrapping and regression testing did not show any new failures.
> Is it OK for trunk?

No, that looks like just papering over the issue.

The issue (with the 2nd testcase at least) is that single_exit () accepts
an exit from the inner loop.

Index: gcc/tree-ssa-loop-unswitch.c
===
--- gcc/tree-ssa-loop-unswitch.c(revision 231686)
+++ gcc/tree-ssa-loop-unswitch.c(working copy)
@@ -431,7 +431,7 @@ tree_unswitch_outer_loop (struct loop *l
 return false;
   /* Accept loops with single exit only.  */
   exit = single_exit (loop);
-  if (!exit)
+  if (!exit || exit->src->loop_father != loop)
 return false;
   /* Check that phi argument of exit edge is not defined inside loop.  */
   if (!check_exit_phi (loop))

fixes the runtime testcase for me (not suitable for the testsuite due
to the infinite
looping though).

Can you please bootstrap/test the above with your testcase?  The above patch is
ok if it passes testing (no time myself right now)

Thanks,
Richard.

> ChangeLog:
>
> 2014-12-16  Yuri Rumyantsev  
>
> PR tree-optimization/68906
> * tree-ssa-loop-unswitch.c : Include couple header files.
> (tree_unswitch_outer_loop): Use number_of_latch_executions
> to reject non-iterated loops.
>
> gcc/testsuite/ChangeLog
> * gcc.dg/torture/pr68906.c: New test.


RE: [PATCH][AArch64] Add TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS

2015-12-16 Thread Wilco Dijkstra
James Greenhalgh wrote:
> On Tue, Dec 15, 2015 at 10:54:49AM +, Wilco Dijkstra wrote:
> > ping
> >
> > > -Original Message-
> > > From: Wilco Dijkstra [mailto:wilco.dijks...@arm.com]
> > > Sent: 06 November 2015 20:06
> > > To: 'gcc-patches@gcc.gnu.org'
> > > Subject: [PATCH][AArch64] Add TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS
> > >
> > > This patch adds support for the TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS
> > > hook. When the cost of GENERAL_REGS and FP_REGS is identical, the register
> > > allocator always uses ALL_REGS even when it has a much higher cost. The
> > > hook changes the class to either FP_REGS or GENERAL_REGS depending on the
> > > mode of the register. This results in better register allocation overall,
> > > fewer spills and reduced codesize - particularly in SPEC2006 gamess.
> > >
> > > GCC regression passes with several minor fixes.
> > >
> > > OK for commit?
> > >
> > > ChangeLog:
> > > 2015-11-06  Wilco Dijkstra  
> > >
> > >   * gcc/config/aarch64/aarch64.c
> > >   (TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS): New define.
> > >   (aarch64_ira_change_pseudo_allocno_class): New function.
> > >   * gcc/testsuite/gcc.target/aarch64/cvtf_1.c: Build with -O2.
> > >   * gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c
> > >   (test_corners_sisd_di): Improve force to SIMD register.
> > >   (test_corners_sisd_si): Likewise.
> > >   * gcc/testsuite/gcc.target/aarch64/vdup_lane_2.c: Build with -O2.
> > >   * gcc/testsuite/gcc.target/aarch64/vect-ld1r-compile-fp.c:
> > >   Remove scan-assembler check for ldr.
> 
> Drop the gcc/ from the ChangeLog.
> 
> > > --
> > >  gcc/config/aarch64/aarch64.c   | 22 
> > > ++
> > >  gcc/testsuite/gcc.target/aarch64/cvtf_1.c  |  2 +-
> > >  gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c  |  4 ++--
> > >  gcc/testsuite/gcc.target/aarch64/vdup_lane_2.c |  2 +-
> > >  .../gcc.target/aarch64/vect-ld1r-compile-fp.c  |  1 -
> 
> These testsuite changes concern me a bit, and you don't mention them beyond
> saying they are minor fixes...

Well any changes to register allocator preferencing would cause fallout in 
tests that
are assuming which register is allocated, especially if they use nasty inline 
assembler
hacks to do so...

> > > diff --git a/gcc/testsuite/gcc.target/aarch64/cvtf_1.c 
> > > b/gcc/testsuite/gcc.target/aarch64/cvtf_1.c
> > > index 5f2ff81..96501db 100644
> > > --- a/gcc/testsuite/gcc.target/aarch64/cvtf_1.c
> > > +++ b/gcc/testsuite/gcc.target/aarch64/cvtf_1.c
> > > @@ -1,5 +1,5 @@
> > >  /* { dg-do run } */
> > > -/* { dg-options "-save-temps -fno-inline -O1" } */
> > > +/* { dg-options "-save-temps -fno-inline -O2" } */
> 
> This one says we have a code-gen regression at -O1 ?

It avoids a regalloc bug - see below.

> > >  #define FCVTDEF(ftype,itype) \
> > >  void \
> > > diff --git a/gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c 
> > > b/gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c
> > > index 363f554..8465c89 100644
> > > --- a/gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c
> > > +++ b/gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c
> > > @@ -186,9 +186,9 @@ test_corners_sisd_di (Int64x1 b)
> > >  {
> > >force_simd_di (b);
> > >b = b >> 63;
> > > +  force_simd_di (b);
> > >b = b >> 0;
> > >b += b >> 65; /* { dg-warning "right shift count >= width of type" } */
> > > -  force_simd_di (b);
> 
> This one I don't understand, but seems to say that we've decided to move
> b out of FP_REGS after getting it in there for b = b << 63; ? So this is
> another register allocator regression?

No, basically the register allocator is now making better decisions as to where 
to
allocate integer variables. It will only allocate them to FP registers if they 
are primarily
used by other FP operations. The force_simd_di inline assembler tries to mimic 
FP uses,
and if there are enough of them at the right places then everything works as 
expected.
If however you do 3 consecutive integer operations then the allocator will now 
correctly
prefer to allocate them to the integer registers (while previously it wouldn't, 
which is
inefficient).

> > > diff --git a/gcc/testsuite/gcc.target/aarch64/vdup_lane_2.c 
> > > b/gcc/testsuite/gcc.target/aarch64/vdup_lane_2.c
> > > index a49db3e..c5a9c52 100644
> > > --- a/gcc/testsuite/gcc.target/aarch64/vdup_lane_2.c
> > > +++ b/gcc/testsuite/gcc.target/aarch64/vdup_lane_2.c
> > > @@ -1,6 +1,6 @@
> > >  /* Test vdup_lane intrinsics work correctly.  */
> > >  /* { dg-do run } */
> > > -/* { dg-options "-O1 --save-temps" } */
> > > +/* { dg-options "-O2 --save-temps" } */
> 
> Another -O1 regression ?

No, it's triggering a bug in the -O1 register preferencing that causes 
incorrect preferences to be
selected despite the costs being right. The cost calculation with -O1 for eg. 
wrap_vdupb_lane_s8_0() in vdup_lane_2.c:

Pass 0 for finding pseudo/allocno costs

r79: preferred FP_REGS, alternative GENERAL_REGS, allocno GENERAL_REGS
 

Re: [PATCH, 4/16] Implement -foffload-alias

2015-12-16 Thread Richard Biener
On Mon, 14 Dec 2015, Tom de Vries wrote:

> On 14/12/15 14:26, Richard Biener wrote:
> > On Sun, 13 Dec 2015, Tom de Vries wrote:
> > 
> > > On 11/12/15 14:00, Richard Biener wrote:
> > > > On Fri, 11 Dec 2015, Tom de Vries wrote:
> > > > 
> > > > > On 13/11/15 12:39, Jakub Jelinek wrote:
> > > > > > We simply have some compiler internal interface between the caller
> > > > > > and
> > > > > > callee of the outlined regions, each interface in between those has
> > > > > > its own structure type used to communicate the info;
> > > > > > we can attach attributes on the fields, or some flags to indicate
> > > > > > some
> > > > > > properties interesting from aliasing POV.  We don't really need to
> > > > > > perform
> > > > > > full IPA-PTA, perhaps it would be enough to a) record somewhere in
> > > > > > cgraph
> > > > > > the relationship in between such callers and callees (for offloading
> > > > > > regions
> > > > > > we already have "omp target entrypoint" attribute on the callee and
> > > > > > a
> > > > > > singler caller), tell LTO if possible not to split those into
> > > > > > different
> > > > > > partitions if easily possible, and then just for these pairs perform
> > > > > > aliasing/points-to analysis in the caller and the result record
> > > > > > using
> > > > > > cliques/special attributes/whatever to the callee side, so that the
> > > > > > callee
> > > > > > (outlined OpenMP/OpenACC/Cilk+ region) can then improve its alias
> > > > > > analysis.
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > This work-in-progress patch allows me to use IPA PTA information in
> > > > > the
> > > > > kernels pass group.
> > > > > 
> > > > > Since:
> > > > > -  I'm running IPA PTA before ealias, and IPA PTA does not interpret
> > > > >  restrict, and
> > > > > - compute_may_alias doesn't run if IPA PTA information is present
> > > > > I needed to convince ealias to do the restrict clique/base annotation.
> > > > > 
> > > > > It would be more logical to fit IPA PTA after ealias, but one is an
> > > > > IPA
> > > > > pass,
> > > > > the other a regular one-function pass, so I would have to split the
> > > > > containing
> > > > > pass groups pass_all_early_optimizations and
> > > > > pass_local_optimization_passes.
> > > > > I'll give that a try now.
> > > > > 
> > > 
> > > I've tried this approach, but realized that this changes the order in
> > > which
> > > non-openacc functions are processed in the compiler, so I've abandoned
> > > this
> > > idea.
> > > 
> > > > > Any comments?
> > > > 
> > > > I don't think you want to run IPA PTA before early
> > > > optimizations, it (and ealias) rely on some initial cleanup to
> > > > do anything meaningful with well-spent ressources.
> > > > 
> > > > The local PTA "hack" also looks more like a waste of resources, but well
> > > > ... teaching IPA PTA to honor restrict might be an impossible task
> > > > though I didn't think much about it other than handling it only for
> > > > nonlocal_p functions (for others we should see all incoming args
> > > > if IPA PTA works optimally).  The restrict tags will leak all over
> > > > the place of course and in the end no meaningful cliques may remain.
> > > > 
> > > 
> > > This patch:
> > > - moves the kernels pass group to the first position in the pass list
> > >after ealias where we're back in ipa mode
> > > - inserts an new ipa pass to contain the gimple pass group called
> > >pass_oacc_ipa
> > > - inserts a version of ipa-pta before the pass group.
> > 
> > In principle I like this a lot, but
> > 
> > +  NEXT_PASS (pass_ipa_pta_oacc_kernels);
> > +  NEXT_PASS (pass_oacc_ipa);
> > +  PUSH_INSERT_PASSES_WITHIN (pass_oacc_ipa)
> > 
> > I think you can put pass_ipa_pta_oacc_kernels into the pass_oacc_ipa
> > group and thus just "clone" ipa_pta?
> 
> Done. But using a clone means using the same gate function, and that means
> that this pass_ipa_pta instance no longer runs by default for openacc by
> default.
> 
> I've added enabling-by-default of fipa-pta for fopenacc in
> default_options_optimization to fix that.

Hmm, but that enables both IPA PTA passes then?  I suppose that's ok,
and if not enabling the "late" IPA PTA you'd want to re-set 
gimple_df->ipa_pta.

> > sub-passes of IPA passes can
> > be both ipa passes and non-ipa passes.
> 
> Right. It does mean that I need yet another pass (pass_ipa_oacc_kernels) to do
> the IPA/non-IPA transition at pass/sub-pass boundary:
> ...
>   NEXT_PASS (pass_ipa_oacc);
>   PUSH_INSERT_PASSES_WITHIN (pass_ipa_oacc)
>   NEXT_PASS (pass_ipa_pta);
>   NEXT_PASS (pass_ipa_oacc_kernels);
>   PUSH_INSERT_PASSES_WITHIN (pass_ipa_oacc_kernels)
>  /* out-of-ipa */
>  NEXT_PASS (pass_oacc_kernels);
>  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
> ...
> 
> OK for stage3 if bootstrap and reg-test succeeds?

Ok.

Richard.


Re: [RFC, rtl optimization]: Better heuristics for estimate_reg_pressure_cost in presence of call for LICM.

2015-12-16 Thread Bernd Schmidt

On 12/16/2015 12:54 PM, Ajit Kumar Agarwal wrote:

/* If there is a call in the loop body, the call-clobbered registers
   are not available for loop invariants.  */
+
if (call_p)
  available_regs = available_regs - target_clobbered_regs;
-
+
/* If we have enough registers, we should use them and not restrict
   the transformations unnecessarily.  */
if (regs_needed + target_res_regs <= available_regs)
  return 0;


Just a thought, one thing that might make sense here is counting some of 
the target_res_regs among the clobbered ones. This would become


 int res_regs = target_res_regs;
 if (call_p)
   {
 available_regs = available_regs - target_clobbered_regs;
 res_regs /= 2;
   }

 /* If we have enough registers, we should use them and not restrict
the transformations unnecessarily.  */
 if (regs_needed + res_regs <= available_regs)
   return 0;

It's all a bit crude, before and after, but such a change might be 
justifiable.



Bernd


Re: [0/7] Type promotion pass and elimination of zext/sext

2015-12-16 Thread Richard Biener
On Thu, Dec 10, 2015 at 1:27 AM, Kugan
 wrote:
> Hi Riachard,
>
> Thanks for the reviews.
>
> I think since we have some unresolved issues here, it is best to aim for
> the next stage1. I however would like any feedback so that I can
> continue to improve this.

Yeah, sorry I've been distracted lately and am not sure I'll get to
the patch before
christmas break.

> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01063.html is also related
> to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67714. I don't think
> there is any agreement on this. Or is there any better place to fix this?

I don't know enough in this area to suggest anything.

Richard.

> Thanks,
> Kugan


[PATCH] OpenACC documentation for libgomp

2015-12-16 Thread James Norris

Hi,

Attached is the patch to add OpenACC documentation for libgomp.

Ok to commit to trunk?

Thanks!
Jim
Index: libgomp.texi
===
--- libgomp.texi	(revision 231662)
+++ libgomp.texi	(working copy)
@@ -94,10 +94,25 @@
 @comment  better formatting.
 @comment
 @menu
+* Enabling OpenACC::   How to enable OpenACC for your
+   applications.
+* OpenACC Runtime Library Routines::
+   The OpenACC runtime application
+   programming interface.
+* OpenACC Environment Variables::
+   Influencing OpenACC runtime behavior with
+   environment variables.
+* CUDA Streams Usage:: Notes on the implementation of
+   asynchronous operations.
+* OpenACC Library Interoperability::
+   OpenACC library interoperability with the
+   NVIDIA CUBLAS library.
 * Enabling OpenMP::How to enable OpenMP for your applications.
-* Runtime Library Routines::   The OpenMP runtime application programming 
+* OpenMP Runtime Library Routines::
+   The OpenMP runtime application programming 
interface.
-* Environment Variables::  Influencing runtime behavior with environment 
+* OpenMP Environment Variables::
+   Influencing runtime behavior with environment 
variables.
 * The libgomp ABI::Notes on the external ABI presented by libgomp.
 * Reporting Bugs:: How to report bugs in the GNU Offloading and
@@ -113,6 +128,643 @@
 
 
 @c -
+@c Enabling OpenACC
+@c -
+
+@node Enabling OpenACC
+@chapter Enabling OpenACC
+
+To activate the OpenACC extensions for C/C++ and Fortran, the compile-time 
+flag @command{-fopenacc} must be specified.  This enables the OpenACC directive
+@code{#pragma acc} in C/C++ and @code{!$accp} directives in free form,
+@code{c$acc}, @code{*$acc} and @code{!$acc} directives in fixed form,
+@code{!$} conditional compilation sentinels in free form and @code{c$},
+@code{*$} and @code{!$} sentinels in fixed form, for Fortran.  The flag also
+arranges for automatic linking of the OpenACC runtime library 
+(@ref{OpenACC Runtime Library Routines}).
+
+A complete description of all OpenACC directives accepted may be found in 
+the @uref{http://www.openacc.org/, OpenMP Application Programming
+Interface} manual, version 2.0.
+
+Note that this is an experimental feature, incomplete, and subject to
+change in future versions of GCC.  See
+@uref{https://gcc.gnu.org/wiki/OpenACC} for more information.
+
+
+
+@c -
+@c OpenACC Runtime Library Routines
+@c -
+
+@node OpenACC Runtime Library Routines
+@chapter OpenACC Runtime Library Routines
+
+The runtime routines described here are defined by section 3 of the OpenACC
+specifications in version 2.0.
+They have C linkage, and do not throw exceptions.
+Generally, they are available only for the host, with the exception of
+@code{acc_on_device}, which is available for both the host and the
+acceleration device.
+
+@menu
+* acc_get_num_devices:: Get number of devices for the given device type
+* acc_set_device_type::
+* acc_get_device_type::
+* acc_set_device_num::
+* acc_get_device_num::
+* acc_init::
+* acc_shutdown::
+* acc_on_device::   Whether executing on a particular device
+* acc_malloc::
+* acc_free::
+* acc_copyin::
+* acc_present_or_copyin::
+* acc_create::
+* acc_present_or_create::
+* acc_copyout::
+* acc_delete::
+* acc_update_device::
+* acc_update_self::
+* acc_map_data::
+* acc_unmap_data::
+* acc_deviceptr::
+* acc_hostptr::
+* acc_is_present::
+* acc_memcpy_to_device::
+* acc_memcpy_from_device::
+
+API routines for target platforms.
+
+* acc_get_current_cuda_device::
+* acc_get_current_cuda_context::
+* acc_get_cuda_stream::
+* acc_set_cuda_stream::
+@end menu
+
+
+
+@node acc_get_num_devices
+@section @code{acc_get_num_devices} -- Get number of devices for given device type
+@table @asis
+@item @emph{Description}
+This routine returns a value indicating the
+number of devices available for the given device type.  It determines
+the number of devices in a @emph{passive} manner.  In other words, it
+does not alter the state within the runtime environment aside from
+possibly initializing an uninitialized device.  This aspect allows
+the routine to be called without concern for altering the interaction
+with an attached accelerator device.
+
+@item @emph{Reference}:
+@uref{http://www.openacc.org/, OpenACC specification v2.0}, section
+3

Re: [PATCH PR68542]

2015-12-16 Thread Richard Biener
On Fri, Dec 11, 2015 at 3:03 PM, Yuri Rumyantsev  wrote:
> Richard.
> Thanks for your review.
> I re-designed fix for assert by adding additional checks for vector
> comparison with boolean result to fold_binary_op_with_conditional_arg
> and remove early exit to combine_cond_expr_cond.
> Unfortunately, I am not able to provide you with test-case since it is
> in my second patch related to back-end patch which I sent earlier
> (12-08).
>
> Bootstrapping and regression testing did not show any new failures.
> Is it OK for trunk?

+  else if (TREE_CODE (type) == VECTOR_TYPE)
 {
   tree testtype = TREE_TYPE (cond);
   test = cond;
   true_value = constant_boolean_node (true, testtype);
   false_value = constant_boolean_node (false, testtype);
 }
+  else
+{
+  test = cond;
+  cond_type = type;
+  true_value = boolean_true_node;
+  false_value = boolean_false_node;
+}

So this is, say, vec1 != vec2 with scalar vs. vector result.  If we have
scalar result and thus, say, scalar + vec1 != vec2.  I believe rather
than doing the above (not seeing how this not would generate wrong
code eventually) we should simply detect the case of mixing vector
and scalar types and bail out.  At least without some comments
your patch makes the function even more difficult to understand than
it is already.

@@ -3448,10 +3448,17 @@ verify_gimple_comparison (tree type, tree op0, tree op1)
   if (TREE_CODE (op0_type) == VECTOR_TYPE
  || TREE_CODE (op1_type) == VECTOR_TYPE)
 {
-  error ("vector comparison returning a boolean");
-  debug_generic_expr (op0_type);
-  debug_generic_expr (op1_type);
-  return true;
+ /* Allow vector comparison returning boolean if operand types
+are boolean or integral and CODE is EQ/NE.  */
+ if (code != EQ_EXPR && code != NE_EXPR
+ && !VECTOR_BOOLEAN_TYPE_P (op0_type)
+ && !VECTOR_INTEGER_TYPE_P (op0_type))
+   {
+ error ("type mismatch for vector comparison returning a boolean");
+ debug_generic_expr (op0_type);
+ debug_generic_expr (op1_type);
+ return true;
+   }
 }
 }
   /* Or a boolean vector type with the same element count

as said before please merge the cascaded if()s.  Better wording for
the error is "unsupported operation or type for vector comparison
returning a boolean"

Otherwise the patch looks sensible to me though it shows that overloading of
EQ/NE_EXPR for scalar result and vector operands might have some more unexpected
fallout (which is why I originally prefered the view-convert to large
integer type variant).

Thanks,
Richard.


> ChangeLog:
> 2015-12-11  Yuri Rumyantsev  
>
> PR middle-end/68542
> * fold-const.c (fold_binary_op_with_conditional_arg): Add checks oh
> vector comparison with boolean result to avoid ICE.
> (fold_relational_const): Add handling of vector
> comparison with boolean result.
> * tree-cfg.c (verify_gimple_comparison): Add argument CODE, allow
> comparison of vector operands with boolean result for EQ/NE only.
> (verify_gimple_assign_binary): Adjust call for verify_gimple_comparison.
> (verify_gimple_cond): Likewise.
> * tree-ssa-forwprop.c (combine_cond_expr_cond): Do not perform
> combining for non-compatible vector types.
> * tree-vrp.c (register_edge_assert_for): VRP does not track ranges for
> vector types.
>
> 2015-12-10 16:36 GMT+03:00 Richard Biener :
>> On Fri, Dec 4, 2015 at 4:07 PM, Yuri Rumyantsev  wrote:
>>> Hi Richard.
>>>
>>> Thanks a lot for your review.
>>> Below are my answers.
>>>
>>> You asked why I inserted additional check to
>>> ++ b/gcc/tree-ssa-forwprop.c
>>> @@ -373,6 +373,11 @@ combine_cond_expr_cond (gimple *stmt, enum
>>> tree_code code, tree type,
>>>
>>>gcc_assert (TREE_CODE_CLASS (code) == tcc_comparison);
>>>
>>> +  /* Do not perform combining it types are not compatible.  */
>>> +  if (TREE_CODE (TREE_TYPE (op0)) == VECTOR_TYPE
>>> +  && !tree_int_cst_equal (TYPE_SIZE (type), TYPE_SIZE (TREE_TYPE 
>>> (op0
>>> +return NULL_TREE;
>>> +
>>>
>>> again, how does this happen?
>>>
>>> This is because without it I've got assert in fold_convert_loc
>>>   gcc_assert (TREE_CODE (orig) == VECTOR_TYPE
>>>  && tree_int_cst_equal (TYPE_SIZE (type), TYPE_SIZE (orig)));
>>>
>>> since it tries to convert vector of bool to scalar bool.
>>> Here is essential part of call-stack:
>>>
>>> #0  internal_error (gmsgid=0x1e48397 "in %s, at %s:%d")
>>> at ../../gcc/diagnostic.c:1259
>>> #1  0x01743ada in fancy_abort (
>>> file=0x1847fc3 "../../gcc/fold-const.c", line=2217,
>>> function=0x184b9d0 >> tree_node*)::__FUNCTION__> "fold_convert_loc") at
>>> ../../gcc/diagnostic.c:1332
>>> #2  0x009c8330 in fold_convert_loc (loc=0, type=0x718a9d20,
>>> arg=0x71a7f488) at ../../gcc/fold-const.c:2216
>>> #3  0x009f003f in fold_ternary_loc (loc=0, code=VEC_COND_EXPR,
>>> 

[PATCH] Fix PR68861

2015-12-16 Thread Richard Biener

The following fixes the SLP miscompile in PR68861 which happens because
we didn't think of stmts appering multiple times in a SLP node when
doing the operand swapping support.

Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.

Richard.

2015-12-16  Richard Biener  

PR tree-optimization/68861
* tree-vect-slp.c (vect_build_slp_tree): Properly handle
duplicate stmts when applying swapping to stmts.

Index: gcc/tree-vect-slp.c
===
*** gcc/tree-vect-slp.c (revision 231675)
--- gcc/tree-vect-slp.c (working copy)
*** vect_build_slp_tree (vec_info *vinfo,
*** 1049,1059 
 if we end up building the operand from scalars as
 we'll continue to process swapped operand two.  */
  for (j = 0; j < group_size; ++j)
!   if (!matches[j])
  {
gimple *stmt = SLP_TREE_SCALAR_STMTS (*node)[j];
!   swap_ssa_operands (stmt, gimple_assign_rhs1_ptr (stmt),
!  gimple_assign_rhs2_ptr (stmt));
  }
  
  /* If we have all children of child built up from scalars then
--- 1049,1077 
 if we end up building the operand from scalars as
 we'll continue to process swapped operand two.  */
  for (j = 0; j < group_size; ++j)
!   {
! gimple *stmt = SLP_TREE_SCALAR_STMTS (*node)[j];
! gimple_set_plf (stmt, GF_PLF_1, false);
!   }
! for (j = 0; j < group_size; ++j)
!   {
! gimple *stmt = SLP_TREE_SCALAR_STMTS (*node)[j];
! if (!matches[j])
!   {
! /* Avoid swapping operands twice.  */
! if (gimple_plf (stmt, GF_PLF_1))
!   continue;
! swap_ssa_operands (stmt, gimple_assign_rhs1_ptr (stmt),
!gimple_assign_rhs2_ptr (stmt));
! gimple_set_plf (stmt, GF_PLF_1, true);
!   }
!   }
! /* Verify we swap all duplicates or none.  */
! if (flag_checking)
!   for (j = 0; j < group_size; ++j)
  {
gimple *stmt = SLP_TREE_SCALAR_STMTS (*node)[j];
!   gcc_assert (gimple_plf (stmt, GF_PLF_1) == ! matches[j]);
  }
  
  /* If we have all children of child built up from scalars then



Re: [PATCH] Fix Fortran deviceptr clause.

2015-12-16 Thread James Norris

Hi,

This is an update of my previous patch. Cesar (thanks!)
pointed out some issues with the original patch that
have now been addressed.

Regtested on x86_64

OK for trunk?

Thanks!
Jim


diff --git a/gcc/fortran/openmp.c b/gcc/fortran/openmp.c
index 276f2f1..9350dc4 100644
--- a/gcc/fortran/openmp.c
+++ b/gcc/fortran/openmp.c
@@ -812,19 +812,10 @@ gfc_match_omp_clauses (gfc_omp_clauses **cp, uint64_t mask,
    OMP_MAP_ALLOC))
 	continue;
   if ((mask & OMP_CLAUSE_DEVICEPTR)
-	  && gfc_match ("deviceptr ( ") == MATCH_YES)
-	{
-	  gfc_omp_namelist **list = &c->lists[OMP_LIST_MAP];
-	  gfc_omp_namelist **head = NULL;
-	  if (gfc_match_omp_variable_list ("", list, true, NULL, &head, false)
-	  == MATCH_YES)
-	{
-	  gfc_omp_namelist *n;
-	  for (n = *head; n; n = n->next)
-		n->u.map_op = OMP_MAP_FORCE_DEVICEPTR;
-	  continue;
-	}
-	}
+	  && gfc_match ("deviceptr ( ") == MATCH_YES
+	  && gfc_match_omp_map_clause (&c->lists[OMP_LIST_MAP],
+   OMP_MAP_FORCE_DEVICEPTR))
+	continue;
   if ((mask & OMP_CLAUSE_USE_DEVICE)
 	  && gfc_match_omp_variable_list ("use_device (",
 	  &c->lists[OMP_LIST_USE_DEVICE], true)
diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c
index db7cab3..98982c3 100644
--- a/libgomp/oacc-parallel.c
+++ b/libgomp/oacc-parallel.c
@@ -49,6 +49,51 @@ find_pset (int pos, size_t mapnum, unsigned short *kinds)
   return kind == GOMP_MAP_TO_PSET;
 }
 
+/* Handle the mapping pair that are presented when a
+   deviceptr clause is used with Fortran.  */
+
+static void
+handle_ftn_pointers (size_t mapnum, void **hostaddrs, size_t *sizes,
+		 unsigned short *kinds)
+{
+  int i;
+
+  for (i = 0; i < mapnum; i++)
+{
+  unsigned short kind1 = kinds[i] & 0xff;
+
+  /* Handle Fortran deviceptr clause.  */
+  if (kind1 == GOMP_MAP_FORCE_DEVICEPTR)
+	{
+	  unsigned short kind2;
+
+	  if (i < (signed)mapnum - 1)
+	kind2 = kinds[i + 1] & 0xff;
+	  else
+	kind2 = 0x;
+
+	  if (sizes[i] == sizeof (void *))
+	continue;
+
+	  /* At this point, we're dealing with a Fortran deviceptr.
+	 If the next element is not what we're expecting, then
+	 this is an instance of where the deviceptr variable was
+	 not used within the region and the pointer was removed
+	 by the gimplifier.  */
+	  if (kind2 == GOMP_MAP_POINTER
+	  && sizes[i + 1] == 0
+	  && hostaddrs[i] == *(void **)hostaddrs[i + 1])
+	{
+	  kinds[i+1] = kinds[i];
+	  sizes[i+1] = sizeof (void *);
+	}
+
+	  /* Invalidate the entry.  */
+	  hostaddrs[i] = NULL;
+	}
+}
+}
+
 static void goacc_wait (int async, int num_waits, va_list *ap);
 
 
@@ -88,6 +133,8 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
   thr = goacc_thread ();
   acc_dev = thr->dev;
 
+  handle_ftn_pointers (mapnum, hostaddrs, sizes, kinds);
+
   /* Host fallback if "if" clause is false or if the current device is set to
  the host.  */
   if (host_fallback)
@@ -172,8 +219,13 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
 
   devaddrs = gomp_alloca (sizeof (void *) * mapnum);
   for (i = 0; i < mapnum; i++)
-devaddrs[i] = (void *) (tgt->list[i].key->tgt->tgt_start
-			+ tgt->list[i].key->tgt_offset);
+{
+  if (tgt->list[i].key != NULL)
+	devaddrs[i] = (void *) (tgt->list[i].key->tgt->tgt_start
++ tgt->list[i].key->tgt_offset);
+  else
+	devaddrs[i] = NULL;
+}
 
   acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
 			  async, dims, tgt);
@@ -224,6 +276,8 @@ GOACC_data_start (int device, size_t mapnum,
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
+  handle_ftn_pointers (mapnum, hostaddrs, sizes, kinds);
+
   /* Host fallback or 'do nothing'.  */
   if ((acc_dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
   || host_fallback)
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/declare-1.f90 b/libgomp/testsuite/libgomp.oacc-fortran/declare-1.f90
index f717d1b..2d4b707 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/declare-1.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/declare-1.f90
@@ -1,29 +1,22 @@
 ! { dg-do run  { target openacc_nvidia_accel_selected } }
 
+! Tests to exercise the declare directive along with
+! the clauses: copy
+!  copyin
+!  copyout
+!  create
+!  present
+!  present_or_copy
+!  present_or_copyin
+!  present_or_copyout
+!  present_or_create
+
 module vars
   implicit none
   integer z
   !$acc declare create (z)
 end module vars
 
-subroutine subr6 (a, d)
-  implicit none
-  integer, parameter :: N = 8
-  integer :: i
-  integer :: a(N)
-  !$acc declare deviceptr (a)
-  integer :: d(N)
-
-  i = 0
-
-  !$acc parallel copy (d)
-do i = 1, N
-  d(i) = a(i) + a(i)
-end do
-  !$acc end parallel
-
-end subroutine
-
 subroutine subr5 (a, b, c, d)
   implicit none
   integer, parameter :: N = 8
@@ 

Re: [RFC] Request for comments on ivopts patch

2015-12-16 Thread Richard Biener
On Tue, Dec 15, 2015 at 12:06 AM, Steve Ellcey  wrote:
> On Mon, 2015-12-14 at 09:57 +0100, Richard Biener wrote:
>
>> I don't know enough to assess the effect of this but
>>
>>  1) not all archs can do auto-incdec so either the comment is misleading
>> or the test should probably be amended
>>  2) I wonder why with the comment ("during the loop") you exclude 
>> IP_NORMAL/END
>>
>> that said, the comment needs to explain the situation better.
>>
>> Of course all such patches need some code-gen effect investigation
>> on more than one arch.
>>
>> [I wonder if a IV cost adjust target hook makes sense at some point]
>>
>> Richard.
>
> I like the idea of a target hook to modify IV costs.  What do you think
> about this?  I had to move some structures from tree-ssa-loop-ivopts.c
> to tree-ssa-loop-ivopts.h in order to give a target hooks access to
> information on the IV candidates.

IMHO it's better to not have empty default implementations but do

if (targetm.adjust_iv_cand_cost)
  targetm.adjust_iv_cand_cost (...);

which saves an unconditional indirect call on most targets.

Generally I think we should prefer target independent heuristics if possible
or heuristics derived from target cost queries.  So this kind of hook
says we've given up (and it makes it too easy to change things just for
a single target).

Anyway, this is for next stage1 of course so there's some time to come up
with good ideas and do testing on more than one target.

Richard.

> Steve Ellcey
> sell...@imgtec.com
>
>
> 2015-12-14  Steve Ellcey  
>
> * doc/tm.texi.in (TARGET_ADJUST_IV_CAND_COST): New target function.
> * target.def (adjust_iv_cand_cost): New target function.
> * target.h (struct iv_cand): New forward declaration.
> * targhooks.c (default_adjust_iv_cand_cost): New default function.
> * targhooks.h (default_adjust_iv_cand_cost): Ditto.
> * tree-ssa-loop-ivopts.c (struct iv, enum iv_position, struct iv_cand)
> Moved to tree-ssa-loop-ivopts.h.
> (determine_iv_cost): Add call to targetm.adjust_iv_cand_cost.
> * tree-ssa-loop-ivopts.h (struct iv, enum iv_position, struct iv_cand)
> Copied here from tree-ssa-loop-ivopts.h.
>
> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> index a0a0a81..1ad4c2d 100644
> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -8221,6 +8221,8 @@ and the associated definitions of those functions.
>
>  @hook TARGET_OFFLOAD_OPTIONS
>
> +@hook TARGET_ADJUST_IV_CAND_COST
> +
>  @defmac TARGET_SUPPORTS_WIDE_INT
>
>  On older ports, large integers are stored in @code{CONST_DOUBLE} rtl
> diff --git a/gcc/target.def b/gcc/target.def
> index d754337..6bdcfcc 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -5846,6 +5846,12 @@ DEFHOOK
>   void, (tree *hold, tree *clear, tree *update),
>   default_atomic_assign_expand_fenv)
>
> +DEFHOOK
> +(adjust_iv_cand_cost,
> +"Allow target to modify the cost of a possible induction variable.",
> +void, (struct iv_cand *cand),
> + default_adjust_iv_cand_cost)
> +
>  /* Leave the boolean fields at the end.  */
>
>  /* True if we can create zeroed data by switching to a BSS section
> diff --git a/gcc/target.h b/gcc/target.h
> index ffc4d6a..6f55575 100644
> --- a/gcc/target.h
> +++ b/gcc/target.h
> @@ -139,6 +139,9 @@ struct ao_ref;
>  /* This is defined in tree-vectorizer.h.  */
>  struct _stmt_vec_info;
>
> +/* This is defined in tree-ivopts.h. */
> +struct iv_cand;
> +
>  /* These are defined in tree-vect-stmts.c.  */
>  extern tree stmt_vectype (struct _stmt_vec_info *);
>  extern bool stmt_in_inner_loop_p (struct _stmt_vec_info *);
> diff --git a/gcc/targhooks.c b/gcc/targhooks.c
> index dcf0863..0d0bbfc 100644
> --- a/gcc/targhooks.c
> +++ b/gcc/targhooks.c
> @@ -1961,4 +1961,10 @@ default_optab_supported_p (int, machine_mode, 
> machine_mode, optimization_type)
>return true;
>  }
>
> +/* Default implementation of TARGET_ADJUST_IV_CAND_COST.  */
> +void
> +default_adjust_iv_cand_cost (struct iv_cand *)
> +{
> +}
> +
>  #include "gt-targhooks.h"
> diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> index 47b5cfc..dd0481d 100644
> --- a/gcc/targhooks.h
> +++ b/gcc/targhooks.h
> @@ -253,4 +253,6 @@ extern void default_setup_incoming_vararg_bounds 
> (cumulative_args_t ca ATTRIBUTE
>  extern bool default_optab_supported_p (int, machine_mode, machine_mode,
>optimization_type);
>
> +extern void default_adjust_iv_cand_cost (struct iv_cand *);
> +
>  #endif /* GCC_TARGHOOKS_H */
> diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
> index 98dc451..5bfd232 100644
> --- a/gcc/tree-ssa-loop-ivopts.c
> +++ b/gcc/tree-ssa-loop-ivopts.c
> @@ -126,21 +126,6 @@ avg_loop_niter (struct loop *loop)
>return niter;
>  }
>
> -/* Representation of the induction variable.  */
> -struct iv
> -{
> -  tree base;   /* Initial value of the iv.  */
> -  tree base_object;/* A memory object to that the induction variable 
> poin

[PATCH, obvious, i386] Remove duplicate check for CLZERO.

2015-12-16 Thread Kirill Yukhin
Hello,
ix86_target_macros_internal () contains duplicated check of `clzero'
option.
I've committed to main trunk as obvious patch in the bottom.

gcc/
* config/i386/i386-c.c (ix86_target_macros_internal): Remove
duplicate check (__CLZERO__).

--
Thanks, K

Index: gcc/config/i386/i386-c.c
===
--- gcc/config/i386/i386-c.c(revision 231687)
+++ gcc/config/i386/i386-c.c(revision 231688)
@@ -439,8 +439,6 @@
 def_or_undef (parse_in, "__CLWB__");
   if (isa_flag & OPTION_MASK_ISA_MWAITX)
 def_or_undef (parse_in, "__MWAITX__");
-  if (isa_flag & OPTION_MASK_ISA_CLZERO)
-def_or_undef (parse_in, "__CLZERO__");
   if (TARGET_IAMCU)
 {
   def_or_undef (parse_in, "__iamcu");


Re: [PATCH] S/390: Allow to use r1 to r4 as literal pool base.

2015-12-16 Thread Dominik Vogt
On Wed, Dec 16, 2015 at 01:51:45PM +0100, Ulrich Weigand wrote:
> Dominik Vogt wrote:
> > > r2 through r4 should be fine.  [ Not sure if there will be many (any?) 
> > > cases
> > > where one of those is unused but r5 isn't, however. ]
> > 
> > This can happen if the function only uses register pairs
> > (__int128).  Actually I'm not sure whether r2 and r4 are valid
> > candidates.
> 
> Huh?  Why not?

Because I'm not sure it is possible to write code where r2 (r4) is
free but r3 (r5) is not - at least when s390_emit_prologue is
called.  Writing code that uses r4 and r5 but not r3 was diffucult
enough:

  __int128 gi;
  const int c = 0x12345678u;
  int foo(void)
  {
gi += c;
return c;
  }

Ciao

Dominik ^_^  ^_^

-- 

Dominik Vogt
IBM Germany



[PTX] simplify calling struct

2015-12-16 Thread Nathan Sidwell
PTX's machine_function structure squirrels away the function type to calculate 
the presence of  varadic args later,  rather than calculate it immediately.  It 
also uses an rtx field as a boolean.  This patch reorganizes it with less 
verbose names and more apt types.


I also noticed that nvptx_hard_regno_mode_ok wasn't  being used, so that's 
deleted.

nathan
2015-12-16  Nathan Sidwell  

	* config/nvptx/nvptx-protos.h (nvptx_hard_regno_mode_ok): Delete.
	* config/nvptx/nvptx.h (struct machine_function):
	Reimplement. Adjust all users.
	* config/nvptx/nvptx.c (nvptx_declare_function_name): Move stack
	and frame array generation earlier.
	(nvptx_call_args): Reimplement.
	(nvptx_expand_call): Adjust.
	(nvptx_hard_reno_mode_ok): Delete.
	(nvptx_reorg): Revert scan of hard regs.

Index: config/nvptx/nvptx-protos.h
===
--- config/nvptx/nvptx-protos.h	(revision 231689)
+++ config/nvptx/nvptx-protos.h	(working copy)
@@ -41,7 +41,6 @@ extern const char *nvptx_ptx_type_from_m
 extern const char *nvptx_output_mov_insn (rtx, rtx);
 extern const char *nvptx_output_call_insn (rtx_insn *, rtx, rtx);
 extern const char *nvptx_output_return (void);
-extern bool nvptx_hard_regno_mode_ok (int, machine_mode);
 extern rtx nvptx_maybe_convert_symbolic_operand (rtx);
 #endif
 #endif
Index: config/nvptx/nvptx.c
===
--- config/nvptx/nvptx.c	(revision 231689)
+++ config/nvptx/nvptx.c	(working copy)
@@ -147,7 +147,7 @@ static struct machine_function *
 nvptx_init_machine_status (void)
 {
   struct machine_function *p = ggc_cleared_alloc ();
-  p->ret_reg_mode = VOIDmode;
+  p->return_mode = VOIDmode;
   return p;
 }
 
@@ -487,7 +487,7 @@ nvptx_strict_argument_naming (cumulative
 static rtx
 nvptx_libcall_value (machine_mode mode, const_rtx)
 {
-  if (cfun->machine->start_call == NULL_RTX)
+  if (!cfun->machine->doing_call)
 /* Pretend to return in a hard reg for early uses before pseudos can be
generated.  */
 return gen_rtx_REG (mode, NVPTX_RETURN_REGNUM);
@@ -506,7 +506,7 @@ nvptx_function_value (const_tree type, c
 
   if (outgoing)
 {
-  cfun->machine->ret_reg_mode = mode;
+  cfun->machine->return_mode = mode;
   return gen_rtx_REG (mode, NVPTX_RETURN_REGNUM);
 }
 
@@ -678,14 +678,14 @@ write_return_type (std::stringstream &s,
 	 optimization-level specific, so no caller can make use of
 	 this data, but more importantly for us, we must ensure it
 	 doesn't change the PTX prototype.  */
-  mode = (machine_mode) cfun->machine->ret_reg_mode;
+  mode = (machine_mode) cfun->machine->return_mode;
 
   if (mode == VOIDmode)
 	return return_in_mem;
 
-  /* Clear ret_reg_mode to inhibit copy of retval to non-existent
+  /* Clear return_mode to inhibit copy of retval to non-existent
 	 retval parameter.  */
-  cfun->machine->ret_reg_mode = VOIDmode;
+  cfun->machine->return_mode = VOIDmode;
 }
   else
 mode = promote_return (mode);
@@ -989,7 +989,18 @@ nvptx_declare_function_name (FILE *file,
 
   fprintf (file, "%s", s.str().c_str());
 
-  if (regno_reg_rtx[OUTGOING_STATIC_CHAIN_REGNUM] != const0_rtx)
+  /* Declare a local var for outgoing varargs.  */
+  if (cfun->machine->has_varadic)
+init_frame (file, STACK_POINTER_REGNUM,
+		UNITS_PER_WORD, crtl->outgoing_args_size);
+
+  /* Declare a local variable for the frame.  */
+  HOST_WIDE_INT sz = get_frame_size ();
+  if (sz || cfun->machine->has_chain)
+init_frame (file, FRAME_POINTER_REGNUM,
+		crtl->stack_alignment_needed / BITS_PER_UNIT, sz);
+
+  if (cfun->machine->has_chain)
 fprintf (file, "\t.reg.u%d %s;\n", GET_MODE_BITSIZE (Pmode),
 	 reg_names[OUTGOING_STATIC_CHAIN_REGNUM]);
 
@@ -1010,17 +1021,6 @@ nvptx_declare_function_name (FILE *file,
 	}
 }
 
-  /* Declare a local var for outgoing varargs.  */
-  if (cfun->machine->has_call_with_varargs)
-init_frame (file, STACK_POINTER_REGNUM,
-		UNITS_PER_WORD, crtl->outgoing_args_size);
-
-  /* Declare a local variable for the frame.  */
-  HOST_WIDE_INT sz = get_frame_size ();
-  if (sz || cfun->machine->has_call_with_sc)
-init_frame (file, FRAME_POINTER_REGNUM,
-		crtl->stack_alignment_needed / BITS_PER_UNIT, sz);
-
   /* Emit axis predicates. */
   if (cfun->machine->axis_predicate[0])
 nvptx_init_axis_predicate (file,
@@ -1036,7 +1036,7 @@ nvptx_declare_function_name (FILE *file,
 const char *
 nvptx_output_return (void)
 {
-  machine_mode mode = (machine_mode)cfun->machine->ret_reg_mode;
+  machine_mode mode = (machine_mode)cfun->machine->return_mode;
 
   if (mode != VOIDmode)
 fprintf (asm_out_file, "\tst.param%s\t[%s_out], %s;\n",
@@ -1076,20 +1076,28 @@ nvptx_get_drap_rtx (void)
argument to the next call.  */
 
 static void
-nvptx_call_args (rtx arg, tree funtype)
+nvptx_call_args (rtx arg, tree fntype)
 {
-  if (cfun->machine->start_call == NULL_RTX)
+  if (!cfun->machine->doing_call)
 

Re: [PATCH] S/390: Allow to use r1 to r4 as literal pool base.

2015-12-16 Thread Ulrich Weigand
Dominik Vogt wrote:
> On Wed, Dec 16, 2015 at 01:51:45PM +0100, Ulrich Weigand wrote:
> > Dominik Vogt wrote:
> > > > r2 through r4 should be fine.  [ Not sure if there will be many (any?) 
> > > > cases
> > > > where one of those is unused but r5 isn't, however. ]
> > > 
> > > This can happen if the function only uses register pairs
> > > (__int128).  Actually I'm not sure whether r2 and r4 are valid
> > > candidates.
> > 
> > Huh?  Why not?
> 
> Because I'm not sure it is possible to write code where r2 (r4) is
> free but r3 (r5) is not - at least when s390_emit_prologue is
> called.  Writing code that uses r4 and r5 but not r3 was diffucult
> enough:

Ah, OK.  I agree that it will rarely happen (it could in more
complex cases where something initially uses r2 but a very late
optimization pass manages to eliminate that use).

However, when it is *is* free, it is a valid candidate in the
sense that it would be safe and correct to use it.

Bye,
Ulrich

-- 
  Dr. Ulrich Weigand
  GNU/Linux compilers and toolchain
  ulrich.weig...@de.ibm.com



Re: [PATCH][AArch64] Add TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS

2015-12-16 Thread James Greenhalgh
On Wed, Dec 16, 2015 at 01:05:21PM +, Wilco Dijkstra wrote:
> James Greenhalgh wrote:
> > On Tue, Dec 15, 2015 at 10:54:49AM +, Wilco Dijkstra wrote:
> > > ping
> > >
> > > > -Original Message-
> > > > From: Wilco Dijkstra [mailto:wilco.dijks...@arm.com]
> > > > Sent: 06 November 2015 20:06
> > > > To: 'gcc-patches@gcc.gnu.org'
> > > > Subject: [PATCH][AArch64] Add TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS
> > > >
> > > > This patch adds support for the TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS
> > > > hook. When the cost of GENERAL_REGS and FP_REGS is identical, the 
> > > > register
> > > > allocator always uses ALL_REGS even when it has a much higher cost. The
> > > > hook changes the class to either FP_REGS or GENERAL_REGS depending on 
> > > > the
> > > > mode of the register. This results in better register allocation 
> > > > overall,
> > > > fewer spills and reduced codesize - particularly in SPEC2006 gamess.
> > > >
> > > > GCC regression passes with several minor fixes.
> > > >
> > > > OK for commit?
> > > >
> > > > ChangeLog:
> > > > 2015-11-06  Wilco Dijkstra  
> > > >
> > > > * gcc/config/aarch64/aarch64.c
> > > > (TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS): New define.
> > > > (aarch64_ira_change_pseudo_allocno_class): New function.
> > > > * gcc/testsuite/gcc.target/aarch64/cvtf_1.c: Build with -O2.
> > > > * gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c
> > > > (test_corners_sisd_di): Improve force to SIMD register.
> > > > (test_corners_sisd_si): Likewise.
> > > > * gcc/testsuite/gcc.target/aarch64/vdup_lane_2.c: Build with 
> > > > -O2.
> > > > * gcc/testsuite/gcc.target/aarch64/vect-ld1r-compile-fp.c:
> > > > Remove scan-assembler check for ldr.
> > 
> > Drop the gcc/ from the ChangeLog.
> > 
> > > > --
> > > >  gcc/config/aarch64/aarch64.c   | 22 
> > > > ++
> > > >  gcc/testsuite/gcc.target/aarch64/cvtf_1.c  |  2 +-
> > > >  gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c  |  4 ++--
> > > >  gcc/testsuite/gcc.target/aarch64/vdup_lane_2.c |  2 +-
> > > >  .../gcc.target/aarch64/vect-ld1r-compile-fp.c  |  1 -
> > 
> > These testsuite changes concern me a bit, and you don't mention them beyond
> > saying they are minor fixes...
> 
> Well any changes to register allocator preferencing would cause fallout in
> tests that are assuming which register is allocated, especially if they use
> nasty inline assembler hacks to do so...

Sure, but the testcases here each operate on data that should live in
FP_REGS given the initial conditions that the nasty hacks try to mimic -
that's what makes the regressions notable.

>
> > > >  #define FCVTDEF(ftype,itype) \
> > > >  void \
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c 
> > > > b/gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c
> > > > index 363f554..8465c89 100644
> > > > --- a/gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c
> > > > @@ -186,9 +186,9 @@ test_corners_sisd_di (Int64x1 b)
> > > >  {
> > > >force_simd_di (b);
> > > >b = b >> 63;
> > > > +  force_simd_di (b);
> > > >b = b >> 0;
> > > >b += b >> 65; /* { dg-warning "right shift count >= width of type" } 
> > > > */
> > > > -  force_simd_di (b);
> > 
> > This one I don't understand, but seems to say that we've decided to move
> > b out of FP_REGS after getting it in there for b = b << 63; ? So this is
> > another register allocator regression?
> 
> No, basically the register allocator is now making better decisions as to
> where to allocate integer variables. It will only allocate them to FP
> registers if they are primarily used by other FP operations. The
> force_simd_di inline assembler tries to mimic FP uses, and if there are
> enough of them at the right places then everything works as expected.  If
> however you do 3 consecutive integer operations then the allocator will now
> correctly prefer to allocate them to the integer registers (while previously
> it wouldn't, which is inefficient).

I'm not sure I understand this argument in the abstract (though I believe
it for some of the supported cores for the AArch64 target). At an abstract
level, given a set of operations which can execute in either FP_REGS or
GENERAL_REGS and initial and post conditions that allocate all input and
output registers from those operations to FP_REGS, I would expect those
operations to take place using FP_REGS? Your patch seems to break this
expectation?

> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/vdup_lane_2.c 
> > > > b/gcc/testsuite/gcc.target/aarch64/vdup_lane_2.c
> > > > index a49db3e..c5a9c52 100644
> > > > --- a/gcc/testsuite/gcc.target/aarch64/vdup_lane_2.c
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/vdup_lane_2.c
> > > > @@ -1,6 +1,6 @@
> > > >  /* Test vdup_lane intrinsics work correctly.  */
> > > >  /* { dg-do run } */
> > > 

Re: [PATCH PR68906]

2015-12-16 Thread Yuri Rumyantsev
Richard,

Here is updated patch which includes (1) a test on exit proposed by
you and (2) another test from PR68021 which is caught by new check on
counted loop. Outer-loop unswitching is not performed for both new
tests.

Bootstrapping and regression testing did not show any new failures.

Is it OK for trunk.

ChangeLog:
2014-12-16  Yuri Rumyantsev  

PR tree-optimization/68021
PR tree-optimization/68906
* tree-ssa-loop-unswitch.c : Include couple header files.
(tree_unswitch_outer_loop): Add check that an exit is not inside inner
loop, use number_of_latch_executions to detect non-iterated loops.

gcc/testsuite/ChangeLog
* gcc.dg/torture/pr68021.c: New test.
* gcc.dg/torture/pr68906.c: Likewise.

2015-12-16 15:51 GMT+03:00 Richard Biener :
> On Wed, Dec 16, 2015 at 1:14 PM, Yuri Rumyantsev  wrote:
>> Hi All,
>>
>> Here is simple patch which cures the issue with outer-loop unswitching
>> - added invocation of number_of_latch_executions() to reject
>> unswitching for non-iterated loops.
>>
>> Bootstrapping and regression testing did not show any new failures.
>> Is it OK for trunk?
>
> No, that looks like just papering over the issue.
>
> The issue (with the 2nd testcase at least) is that single_exit () accepts
> an exit from the inner loop.
>
> Index: gcc/tree-ssa-loop-unswitch.c
> ===
> --- gcc/tree-ssa-loop-unswitch.c(revision 231686)
> +++ gcc/tree-ssa-loop-unswitch.c(working copy)
> @@ -431,7 +431,7 @@ tree_unswitch_outer_loop (struct loop *l
>  return false;
>/* Accept loops with single exit only.  */
>exit = single_exit (loop);
> -  if (!exit)
> +  if (!exit || exit->src->loop_father != loop)
>  return false;
>/* Check that phi argument of exit edge is not defined inside loop.  */
>if (!check_exit_phi (loop))
>
> fixes the runtime testcase for me (not suitable for the testsuite due
> to the infinite
> looping though).
>
> Can you please bootstrap/test the above with your testcase?  The above patch 
> is
> ok if it passes testing (no time myself right now)
>
> Thanks,
> Richard.
>
>> ChangeLog:
>>
>> 2014-12-16  Yuri Rumyantsev  
>>
>> PR tree-optimization/68906
>> * tree-ssa-loop-unswitch.c : Include couple header files.
>> (tree_unswitch_outer_loop): Use number_of_latch_executions
>> to reject non-iterated loops.
>>
>> gcc/testsuite/ChangeLog
>> * gcc.dg/torture/pr68906.c: New test.


patch.1
Description: Binary data


Re: [PATCH][AArch64] Avoid emitting zero immediate as zero register

2015-12-16 Thread James Greenhalgh
On Tue, Dec 15, 2015 at 11:17:35AM +, Wilco Dijkstra wrote:
> ping
> 
> > -Original Message-
> > From: Wilco Dijkstra [mailto:wdijk...@arm.com]
> > Sent: 28 October 2015 17:33
> > To: GCC Patches
> > Subject: [PATCH][AArch64] Avoid emitting zero immediate as zero register
> > 
> > Several instructions accidentally emit wzr/xzr even when the pattern
> > specifies an immediate. Fix this by removing the register specifier in
> > patterns that emit immediates.
> > 
> > Passes regression tests. OK for commit?
> > 
> > ChangeLog:
> > 2015-10-28  Wilco Dijkstra  
> > 
> > * gcc/config/aarch64/aarch64.md (ccmp_and): Emit
> > immediate as %1.
> > (ccmp_ior): Likewise.
> > (add3_compare0): Likewise.
> > (addsi3_compare0_uxtw): Likewise.
> > (add3nr_compare0): Likewise.
> > (compare_neg): Likewise.
> > (3): Likewise.

Remove the gcc/ from the ChangeLog entires. Otherwise, this is OK.

Thanks,
James



Re: [PATCH PR68906]

2015-12-16 Thread Richard Biener
On Wed, Dec 16, 2015 at 3:36 PM, Yuri Rumyantsev  wrote:
> Richard,
>
> Here is updated patch which includes (1) a test on exit proposed by
> you and (2) another test from PR68021 which is caught by new check on
> counted loop. Outer-loop unswitching is not performed for both new
> tests.

As said I don't think

   /* If the loop is not expected to iterate, there is no need
   for unswitching.  */
-  iterations = estimated_loop_iterations_int (loop);
-  if (iterations >= 0 && iterations <= 1)
+  niters = number_of_latch_executions (loop);
+  if (!niters || chrec_contains_undetermined (niters))
 {

is good.  We do want to retain the estimated_loop_iterations check
as it takes into account profile data while yours lets through all
counted loops.

Also I don't see why SCEV needs to be able to analyze the IV to check
for validity.

Can you please split the patch into the part I suggested (which is ok)
and the rest?

Thanks,
Richard.

>
> Bootstrapping and regression testing did not show any new failures.
>
> Is it OK for trunk.
>
> ChangeLog:
> 2014-12-16  Yuri Rumyantsev  
>
> PR tree-optimization/68021
> PR tree-optimization/68906
> * tree-ssa-loop-unswitch.c : Include couple header files.
> (tree_unswitch_outer_loop): Add check that an exit is not inside inner
> loop, use number_of_latch_executions to detect non-iterated loops.
>
> gcc/testsuite/ChangeLog
> * gcc.dg/torture/pr68021.c: New test.
> * gcc.dg/torture/pr68906.c: Likewise.
>
> 2015-12-16 15:51 GMT+03:00 Richard Biener :
>> On Wed, Dec 16, 2015 at 1:14 PM, Yuri Rumyantsev  wrote:
>>> Hi All,
>>>
>>> Here is simple patch which cures the issue with outer-loop unswitching
>>> - added invocation of number_of_latch_executions() to reject
>>> unswitching for non-iterated loops.
>>>
>>> Bootstrapping and regression testing did not show any new failures.
>>> Is it OK for trunk?
>>
>> No, that looks like just papering over the issue.
>>
>> The issue (with the 2nd testcase at least) is that single_exit () accepts
>> an exit from the inner loop.
>>
>> Index: gcc/tree-ssa-loop-unswitch.c
>> ===
>> --- gcc/tree-ssa-loop-unswitch.c(revision 231686)
>> +++ gcc/tree-ssa-loop-unswitch.c(working copy)
>> @@ -431,7 +431,7 @@ tree_unswitch_outer_loop (struct loop *l
>>  return false;
>>/* Accept loops with single exit only.  */
>>exit = single_exit (loop);
>> -  if (!exit)
>> +  if (!exit || exit->src->loop_father != loop)
>>  return false;
>>/* Check that phi argument of exit edge is not defined inside loop.  */
>>if (!check_exit_phi (loop))
>>
>> fixes the runtime testcase for me (not suitable for the testsuite due
>> to the infinite
>> looping though).
>>
>> Can you please bootstrap/test the above with your testcase?  The above patch 
>> is
>> ok if it passes testing (no time myself right now)
>>
>> Thanks,
>> Richard.
>>
>>> ChangeLog:
>>>
>>> 2014-12-16  Yuri Rumyantsev  
>>>
>>> PR tree-optimization/68906
>>> * tree-ssa-loop-unswitch.c : Include couple header files.
>>> (tree_unswitch_outer_loop): Use number_of_latch_executions
>>> to reject non-iterated loops.
>>>
>>> gcc/testsuite/ChangeLog
>>> * gcc.dg/torture/pr68906.c: New test.


Re: [PATCH, 4/16] Implement -foffload-alias

2015-12-16 Thread Tom de Vries

On 16/12/15 14:16, Richard Biener wrote:

On Mon, 14 Dec 2015, Tom de Vries wrote:


On 14/12/15 14:26, Richard Biener wrote:

On Sun, 13 Dec 2015, Tom de Vries wrote:


On 11/12/15 14:00, Richard Biener wrote:

On Fri, 11 Dec 2015, Tom de Vries wrote:


On 13/11/15 12:39, Jakub Jelinek wrote:

We simply have some compiler internal interface between the caller
and
callee of the outlined regions, each interface in between those has
its own structure type used to communicate the info;
we can attach attributes on the fields, or some flags to indicate
some
properties interesting from aliasing POV.  We don't really need to
perform
full IPA-PTA, perhaps it would be enough to a) record somewhere in
cgraph
the relationship in between such callers and callees (for offloading
regions
we already have "omp target entrypoint" attribute on the callee and
a
singler caller), tell LTO if possible not to split those into
different
partitions if easily possible, and then just for these pairs perform
aliasing/points-to analysis in the caller and the result record
using
cliques/special attributes/whatever to the callee side, so that the
callee
(outlined OpenMP/OpenACC/Cilk+ region) can then improve its alias
analysis.


Hi,

This work-in-progress patch allows me to use IPA PTA information in
the
kernels pass group.

Since:
-  I'm running IPA PTA before ealias, and IPA PTA does not interpret
  restrict, and
- compute_may_alias doesn't run if IPA PTA information is present
I needed to convince ealias to do the restrict clique/base annotation.

It would be more logical to fit IPA PTA after ealias, but one is an
IPA
pass,
the other a regular one-function pass, so I would have to split the
containing
pass groups pass_all_early_optimizations and
pass_local_optimization_passes.
I'll give that a try now.



I've tried this approach, but realized that this changes the order in
which
non-openacc functions are processed in the compiler, so I've abandoned
this
idea.


Any comments?


I don't think you want to run IPA PTA before early
optimizations, it (and ealias) rely on some initial cleanup to
do anything meaningful with well-spent ressources.

The local PTA "hack" also looks more like a waste of resources, but well
... teaching IPA PTA to honor restrict might be an impossible task
though I didn't think much about it other than handling it only for
nonlocal_p functions (for others we should see all incoming args
if IPA PTA works optimally).  The restrict tags will leak all over
the place of course and in the end no meaningful cliques may remain.



This patch:
- moves the kernels pass group to the first position in the pass list
after ealias where we're back in ipa mode
- inserts an new ipa pass to contain the gimple pass group called
pass_oacc_ipa
- inserts a version of ipa-pta before the pass group.


In principle I like this a lot, but

+  NEXT_PASS (pass_ipa_pta_oacc_kernels);
+  NEXT_PASS (pass_oacc_ipa);
+  PUSH_INSERT_PASSES_WITHIN (pass_oacc_ipa)

I think you can put pass_ipa_pta_oacc_kernels into the pass_oacc_ipa
group and thus just "clone" ipa_pta?


Done. But using a clone means using the same gate function, and that means
that this pass_ipa_pta instance no longer runs by default for openacc by
default.

I've added enabling-by-default of fipa-pta for fopenacc in
default_options_optimization to fix that.


Hmm, but that enables both IPA PTA passes then?


Yes. An alternative could be to:
- have 'NEXT_PASS (pass_ipa_pta, true/false /* oacc_p */)' in the pass
  list,
- declare a new flag fipa-pta-oacc, and
- use fipa-pta or fipa-pta-oacc in the gate function depending on
  oacc_p.


I suppose that's ok,
and if not enabling the "late" IPA PTA you'd want to re-set
gimple_df->ipa_pta.


sub-passes of IPA passes can
be both ipa passes and non-ipa passes.


Right. It does mean that I need yet another pass (pass_ipa_oacc_kernels) to do
the IPA/non-IPA transition at pass/sub-pass boundary:
...
   NEXT_PASS (pass_ipa_oacc);
   PUSH_INSERT_PASSES_WITHIN (pass_ipa_oacc)
   NEXT_PASS (pass_ipa_pta);
   NEXT_PASS (pass_ipa_oacc_kernels);
   PUSH_INSERT_PASSES_WITHIN (pass_ipa_oacc_kernels)
  /* out-of-ipa */
  NEXT_PASS (pass_oacc_kernels);
  PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
...

OK for stage3 if bootstrap and reg-test succeeds?


Ok.


Committed as attached, with the following changes:
- test for opt->value of OPT_fopenacc in default_options_optimization,
  to prevent fipa-pta to be switched on by default for -fno-openacc.
- fixed pta -> pta2 scan failures.

Thanks,
- Tom


Add pass_oacc_ipa

2015-12-14  Tom de Vries  

	* opts.c (default_options_optimization): Set fipa-pta on by default for
	fopenacc.
	* passes.def: Move kernels pass group to pass_ipa_oacc.
	* tree-pass.h (make_pass_oacc_kernels2): Remove.
	(make_pass_ipa_oacc, make_pass_ipa_oacc_kernels): Declare.
	* tree-ssa-loop.c (pass_oacc_kernels2, make_pass_oacc_kernels2): Remove.
	(pass_ipa_oacc, pass_ipa_oacc_kernels)

[PATCH] Fix PR68870

2015-12-16 Thread Richard Biener

This extends the previous fix for the CFG cleanup issue WRT dead SSA
defs to properly avoid doing sth fancy with conditons in the first pass.

Bootstrapped and tested on x86_64-unknown-linux-gnu, applied.

Richard.

2015-12-16  Richard Biener  

PR tree-optimization/68870
* tree-cfgcleanup.c (cleanup_control_expr_graph): Add first_p
parameter, if set only perform trivial constant folding.
Queue other blocks with conditions for later processing.
(cleanup_control_flow_bb): Add first_p parameter and pass it through.
(cleanup_tree_cfg_1): Pass true for the first iteration
cleanup_control_expr_graph.

* gcc.dg/torture/pr68870.c: New testcase.

Index: gcc/tree-cfgcleanup.c
===
--- gcc/tree-cfgcleanup.c   (revision 231673)
+++ gcc/tree-cfgcleanup.c   (working copy)
@@ -78,7 +78,8 @@ remove_fallthru_edge (vec *
at block BB.  */
 
 static bool
-cleanup_control_expr_graph (basic_block bb, gimple_stmt_iterator gsi)
+cleanup_control_expr_graph (basic_block bb, gimple_stmt_iterator gsi,
+   bool first_p)
 {
   edge taken_edge;
   bool retval = false;
@@ -95,15 +96,26 @@ cleanup_control_expr_graph (basic_block
   switch (gimple_code (stmt))
{
case GIMPLE_COND:
- {
-   code_helper rcode;
-   tree ops[3] = {};
-   if (gimple_simplify (stmt, &rcode, ops, NULL, no_follow_ssa_edges,
-no_follow_ssa_edges)
-   && rcode == INTEGER_CST)
- val = ops[0];
-   break;
- }
+ /* During a first iteration on the CFG only remove trivially
+dead edges but mark other conditions for re-evaluation.  */
+ if (first_p)
+   {
+ val = const_binop (gimple_cond_code (stmt), boolean_type_node,
+gimple_cond_lhs (stmt),
+gimple_cond_rhs (stmt));
+ if (! val)
+   bitmap_set_bit (cfgcleanup_altered_bbs, bb->index);
+   }
+ else
+   {
+ code_helper rcode;
+ tree ops[3] = {};
+ if (gimple_simplify (stmt, &rcode, ops, NULL, no_follow_ssa_edges,
+  no_follow_ssa_edges)
+ && rcode == INTEGER_CST)
+   val = ops[0];
+   }
+ break;
 
case GIMPLE_SWITCH:
  val = gimple_switch_index (as_a  (stmt));
@@ -176,7 +188,7 @@ cleanup_call_ctrl_altering_flag (gimple
true if anything changes.  */
 
 static bool
-cleanup_control_flow_bb (basic_block bb)
+cleanup_control_flow_bb (basic_block bb, bool first_p)
 {
   gimple_stmt_iterator gsi;
   bool retval = false;
@@ -199,7 +211,7 @@ cleanup_control_flow_bb (basic_block bb)
   || gimple_code (stmt) == GIMPLE_SWITCH)
 {
   gcc_checking_assert (gsi_stmt (gsi_last_bb (bb)) == stmt);
-  retval |= cleanup_control_expr_graph (bb, gsi);
+  retval |= cleanup_control_expr_graph (bb, gsi, first_p);
 }
   else if (gimple_code (stmt) == GIMPLE_GOTO
   && TREE_CODE (gimple_goto_dest (stmt)) == ADDR_EXPR
@@ -680,7 +692,7 @@ cleanup_tree_cfg_1 (void)
 {
   bb = BASIC_BLOCK_FOR_FN (cfun, i);
   if (bb)
-   retval |= cleanup_control_flow_bb (bb);
+   retval |= cleanup_control_flow_bb (bb, true);
 }
 
   /* After doing the above SSA form should be valid (or an update SSA
@@ -708,7 +720,7 @@ cleanup_tree_cfg_1 (void)
   if (!bb)
continue;
 
-  retval |= cleanup_control_flow_bb (bb);
+  retval |= cleanup_control_flow_bb (bb, false);
   retval |= cleanup_tree_cfg_bb (bb);
 }
 
Index: gcc/testsuite/gcc.dg/torture/pr68870.c
===
--- gcc/testsuite/gcc.dg/torture/pr68870.c  (revision 0)
+++ gcc/testsuite/gcc.dg/torture/pr68870.c  (working copy)
@@ -0,0 +1,29 @@
+/* { dg-do compile } */
+
+int printf (const char *, ...);
+
+int a, f, g;
+char b, d;
+short c;
+static short e;
+
+char
+fn1 ()
+{
+  for (; b; b++)
+{
+  int h = 5;
+  for (a = 0; a < 1; a++)
+   {
+ for (d = 0; d < 1; d++)
+   for (c = 0; c < 1; c++)
+ for (; e >= 0;)
+   return 5;
+ if (f)
+   h = 0;
+   }
+  if (h)
+   printf ("%d", 0);
+}
+  return g;
+}


Re: [PATCH][AArch64] PR target/68696 FAIL: gcc.target/aarch64/vbslq_u64_1.c scan-assembler-times bif\tv 1

2015-12-16 Thread James Greenhalgh
On Tue, Dec 08, 2015 at 09:21:29AM +, Kyrill Tkachov wrote:
> Hi all,
> 
> The test gcc.target/aarch64/vbslq_u64_1.c started failing recently due to 
> some tree-level changes.
> This just exposed a deficiency in our xor-and-xor pattern for the vector 
> bit-select pattern:
> aarch64_simd_bsl_internal.
> 
> We now fail to match the rtx:
> (set (reg:V4SI 79)
> (xor:V4SI (and:V4SI (xor:V4SI (reg:V4SI 32 v0 [ a ])
> (reg/v:V4SI 77 [ b ]))
> (reg:V4SI 34 v2 [ mask ]))
> (reg/v:V4SI 77 [ b ])))
> 
> whereas before combine attempted:
> (set (reg:V4SI 79)
> (xor:V4SI (and:V4SI (xor:V4SI (reg/v:V4SI 77 [ b ])
> (reg:V4SI 32 v0 [ a ]))
> (reg:V4SI 34 v2 [ mask ]))
> (reg/v:V4SI 77 [ b ])))
> 
> Note that just the order of the operands of the inner XOR has changed.
> This could be solved by making the second operand of the outer XOR a 4th 
> operand
> of the pattern, enforcing that it should be equal to operand 2 or 3 in the 
> pattern
> condition and performing the appropriate swapping in the output template.
> However, the aarch64_simd_bsl_internal pattern is expanded to by other
> places in aarch64-simd.md and updating all the callsites to add a 4th operand 
> is
> wasteful and makes them harder to understand.
> 
> Therefore this patch adds a new define_insn with the match_dup of operand 2 in
> the outer XOR.  I also had to update the alternatives/constraints in the 
> pattern
> and the output template. Basically it involves swapping operands 2 and 3 
> around in the
> constraints and output templates.

Yuck, but OK.

Thanks,
James



C PATCH for c/64637 (better location for -Wunused-value)

2015-12-16 Thread Marek Polacek
The following improves the location for "statement with no effect" warning by
using the location of the expression if available.  Can't use EXPR_LOCATION as
*_DECLs still don't carry a location.

Bootstrapped/regtested on x86_64-linux, ok for trunk?

2015-12-16  Marek Polacek  

PR c/64637
* c-typeck.c (c_process_expr_stmt): Use location of the expression if
available.

* gcc.dg/pr64637.c: New test.

diff --git gcc/c/c-typeck.c gcc/c/c-typeck.c
index 9d6c604..a147ac6 100644
--- gcc/c/c-typeck.c
+++ gcc/c/c-typeck.c
@@ -10131,7 +10131,7 @@ c_process_expr_stmt (location_t loc, tree expr)
  out which is the result.  */
   if (!STATEMENT_LIST_STMT_EXPR (cur_stmt_list)
   && warn_unused_value)
-emit_side_effect_warnings (loc, expr);
+emit_side_effect_warnings (EXPR_LOC_OR_LOC (expr, loc), expr);
 
   exprv = expr;
   while (TREE_CODE (exprv) == COMPOUND_EXPR)
diff --git gcc/testsuite/gcc.dg/pr64637.c gcc/testsuite/gcc.dg/pr64637.c
index e69de29..779ff50 100644
--- gcc/testsuite/gcc.dg/pr64637.c
+++ gcc/testsuite/gcc.dg/pr64637.c
@@ -0,0 +1,25 @@
+/* PR c/64637 */
+/* { dg-do compile } */
+/* { dg-options "-Wunused" } */
+
+void g ();
+
+void
+f (int b)
+{
+  for (int i = 0; i < b; i + b) /* { dg-warning "28:statement with no effect" 
} */
+g ();
+  // PARM_DECLs still don't have a location, don't expect an exact location.
+  for (int i = 0; i < b; b) /* { dg-warning "statement with no effect" } */
+g ();
+  for (int i = 0; i < b; !i) /* { dg-warning "26:statement with no effect" } */
+g ();
+  for (!b;;) /* { dg-warning "8:statement with no effect" } */
+g ();
+  for (;; b * 2) /* { dg-warning "13:statement with no effect" } */
+g ();
+  ({
+ b / 5; /* { dg-warning "8:statement with no effect" } */
+ b ^ 5;
+   });
+}

Marek


Re: [PATCH][AArch64] Properly cost zero_extend+ashift forms of ubfi[xz]

2015-12-16 Thread James Greenhalgh
On Fri, Dec 04, 2015 at 09:30:45AM +, Kyrill Tkachov wrote:
> Hi all,
> 
> We don't handle properly the patterns for the [us]bfiz and [us]bfx 
> instructions when they
> have an extend+ashift form. For example, the 
> *_ashl pattern.
> This leads to rtx costs recuring into the extend and assigning a cost to 
> these patterns that is too
> large.
> 
> This patch fixes that oversight.
> I stumbled across this when working on a different combine patch and ended up 
> matching the above
> pattern, only to have it rejected for -mcpu=cortex-a53 due to the erroneous 
> cost.
> 
> Bootstrapped and tested on aarch64.
> 
> Ok for trunk?
> 
> Thanks,
> Kyrill
> 
> 2015-12-04  Kyrylo Tkachov  
> 
> * config/aarch64/aarch64.c (aarch64_extend_bitfield_pattern_p):
> New function.
> (aarch64_rtx_costs, ZERO_EXTEND, SIGN_EXTEND cases): Use the above
> to handle extend+shift rtxes.

> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 
> c97ecdc0859e0a24792a57aeb18b2e4ea35918f4..d180f6f2d37a280ad77f34caad8496ddaa6e01b2
>  100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -5833,6 +5833,50 @@ aarch64_if_then_else_costs (rtx op0, rtx op1, rtx op2, 
> int *cost, bool speed)
>return false;
>  }
>  
> +/* Check whether X is a bitfield operation of the form shift + extend that
> +   maps down to a UBFIZ/SBFIZ/UBFX/SBFX instruction.  If so, return the
> +   operand to which the bitfield operation is applied to.  Otherwise return

No need for that second "to" at the end of the sentence.

> +   NULL_RTX.  */
> +
> +static rtx
> +aarch64_extend_bitfield_pattern_p (rtx x)
> +{
> +  rtx_code outer_code = GET_CODE (x);
> +  machine_mode outer_mode = GET_MODE (x);
> +
> +  if (outer_code != ZERO_EXTEND && outer_code != SIGN_EXTEND
> +  && outer_mode != SImode && outer_mode != DImode)
> +return NULL_RTX;
> +
> +  rtx inner = XEXP (x, 0);
> +  rtx_code inner_code = GET_CODE (inner);
> +  machine_mode inner_mode = GET_MODE (inner);
> +  rtx op = NULL_RTX;
> +
> +  switch (inner_code)
> +{
> +  case ASHIFT:
> + if (CONST_INT_P (XEXP (inner, 1))
> + && (inner_mode == QImode || inner_mode == HImode))
> +   op = XEXP (inner, 0);
> + break;
> +  case LSHIFTRT:
> + if (outer_code == ZERO_EXTEND && CONST_INT_P (XEXP (inner, 1))
> + && (inner_mode == QImode || inner_mode == HImode))
> +   op = XEXP (inner, 0);
> + break;
> +  case ASHIFTRT:
> + if (outer_code == SIGN_EXTEND && CONST_INT_P (XEXP (inner, 1))
> + && (inner_mode == QImode || inner_mode == HImode))
> +   op = XEXP (inner, 0);
> + break;
> +  default:
> + break;
> +}
> +
> +  return op;
> +}
> +
>  /* Calculate the cost of calculating X, storing it in *COST.  Result
> is true if the total cost of the operation has now been calculated.  */
>  static bool
> @@ -6521,6 +6565,14 @@ cost_plus:
> return true;
>   }
>  
> +  op0 = aarch64_extend_bitfield_pattern_p (x);
> +  if (op0)
> + {
> +   *cost += rtx_cost (op0, mode, ZERO_EXTEND, 0, speed);
> +   if (speed)
> + *cost += extra_cost->alu.bfx;
> +   return true;
> + }

Newline here.

>if (speed)
>   {
> if (VECTOR_MODE_P (mode))
> @@ -6552,6 +6604,14 @@ cost_plus:
> return true;
>   }
>  
> +  op0 = aarch64_extend_bitfield_pattern_p (x);
> +  if (op0)
> + {
> +   *cost += rtx_cost (op0, mode, SIGN_EXTEND, 0, speed);
> +   if (speed)
> + *cost += extra_cost->alu.bfx;
> +   return true;
> + }

And here.

>if (speed)
>   {
> if (VECTOR_MODE_P (mode))

OK with those changes.

Thanks,
James




[PATCH] Fix PR68707, 67323

2015-12-16 Thread Richard Biener

The following patch adds a heuristic to prefer store/load-lanes
over SLP when vectorizing.  Compared to the variant attached to
the PR I made the STMT_VINFO_STRIDED_P behavior explicit (matching
what you've tested).

It's a heuristic that may end up vectorizing less loops or loops
in a less optimal way.

Thus I wait for your ok (it's essentially ARM specific).

Bootstrapped and tested on x86_64-unknown-linux-gnu.

Ok?  It will require a bunch of vectorizer tests to be adjusted for
ARM I think.

Thanks,
Richard.

2015-12-16  Richard Biener  

PR tree-optimization/68707
PR tree-optimization/67323
* tree-vect-slp.c (vect_analyze_slp_instance): Drop SLP instances
if they can be vectorized using load/store-lane instructions.

Index: gcc/tree-vect-slp.c
===
*** gcc/tree-vect-slp.c (revision 231673)
--- gcc/tree-vect-slp.c (working copy)
*** vect_analyze_slp_instance (vec_info *vin
*** 1808,1813 
--- 1802,1836 
  }
  }
  
+   /* If the loads and stores can be handled with load/store-lane
+  instructions do not generate this SLP instance.  */
+   if (is_a  (vinfo)
+ && loads_permuted
+ && dr && vect_store_lanes_supported (vectype, group_size))
+   {
+ slp_tree load_node;
+ FOR_EACH_VEC_ELT (loads, i, load_node)
+   {
+ gimple *first_stmt = GROUP_FIRST_ELEMENT
+ (vinfo_for_stmt (SLP_TREE_SCALAR_STMTS (load_node)[0]));
+ stmt_vec_info stmt_vinfo = vinfo_for_stmt (first_stmt);
+ if (! STMT_VINFO_STRIDED_P (stmt_vinfo)
+ && ! vect_load_lanes_supported
+(STMT_VINFO_VECTYPE (stmt_vinfo),
+ GROUP_SIZE (stmt_vinfo)))
+   break;
+   }
+ if (i == loads.length ())
+   {
+ if (dump_enabled_p ())
+   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+"Built SLP cancelled: can use "
+"load/store-lanes\n");
+   vect_free_slp_instance (new_instance);
+   return false;
+   }
+   }
+ 
vinfo->slp_instances.safe_push (new_instance);
  
if (dump_enabled_p ())


Re: C PATCH for c/64637 (better location for -Wunused-value)

2015-12-16 Thread David Malcolm
On Wed, 2015-12-16 at 15:58 +0100, Marek Polacek wrote:
> The following improves the location for "statement with no effect" warning by
> using the location of the expression if available.  Can't use EXPR_LOCATION as
> *_DECLs still don't carry a location.

Out of interest, does it emit sane underlined ranges for these cases,
with the patch?

> Bootstrapped/regtested on x86_64-linux, ok for trunk?
> 
> 2015-12-16  Marek Polacek  
> 
>   PR c/64637
>   * c-typeck.c (c_process_expr_stmt): Use location of the expression if
>   available.
> 
>   * gcc.dg/pr64637.c: New test.
> 
> diff --git gcc/c/c-typeck.c gcc/c/c-typeck.c
> index 9d6c604..a147ac6 100644
> --- gcc/c/c-typeck.c
> +++ gcc/c/c-typeck.c
> @@ -10131,7 +10131,7 @@ c_process_expr_stmt (location_t loc, tree expr)
>   out which is the result.  */
>if (!STATEMENT_LIST_STMT_EXPR (cur_stmt_list)
>&& warn_unused_value)
> -emit_side_effect_warnings (loc, expr);
> +emit_side_effect_warnings (EXPR_LOC_OR_LOC (expr, loc), expr);
>  
>exprv = expr;
>while (TREE_CODE (exprv) == COMPOUND_EXPR)
> diff --git gcc/testsuite/gcc.dg/pr64637.c gcc/testsuite/gcc.dg/pr64637.c
> index e69de29..779ff50 100644
> --- gcc/testsuite/gcc.dg/pr64637.c
> +++ gcc/testsuite/gcc.dg/pr64637.c
> @@ -0,0 +1,25 @@
> +/* PR c/64637 */
> +/* { dg-do compile } */
> +/* { dg-options "-Wunused" } */
> +
> +void g ();
> +
> +void
> +f (int b)
> +{
> +  for (int i = 0; i < b; i + b) /* { dg-warning "28:statement with no 
> effect" } */
> +g ();
> +  // PARM_DECLs still don't have a location, don't expect an exact location.
> +  for (int i = 0; i < b; b) /* { dg-warning "statement with no effect" } */
> +g ();
> +  for (int i = 0; i < b; !i) /* { dg-warning "26:statement with no effect" } 
> */
> +g ();
> +  for (!b;;) /* { dg-warning "8:statement with no effect" } */
> +g ();
> +  for (;; b * 2) /* { dg-warning "13:statement with no effect" } */
> +g ();
> +  ({
> + b / 5; /* { dg-warning "8:statement with no effect" } */
> + b ^ 5;
> +   });
> +}
> 
>   Marek




Re: C PATCH for c/64637 (better location for -Wunused-value)

2015-12-16 Thread David Malcolm
On Wed, 2015-12-16 at 16:09 +0100, Marek Polacek wrote:
> On Wed, Dec 16, 2015 at 10:04:05AM -0500, David Malcolm wrote:
> > On Wed, 2015-12-16 at 15:58 +0100, Marek Polacek wrote:
> > > The following improves the location for "statement with no effect" 
> > > warning by
> > > using the location of the expression if available.  Can't use 
> > > EXPR_LOCATION as
> > > *_DECLs still don't carry a location.
> > 
> > Out of interest, does it emit sane underlined ranges for these cases,
> > with the patch?
> 
> Yes, it emits what I'd expect, e.g.:
> 
> pr64637.c:10:28: warning: statement with no effect [-Wunused-value]
>for (int i = 0; i < b; i + b)
>   ~~^~~
> Similarly for the rest.
> 
> (Yes, I could've used dg-begin-multiline-output + dg-end-multiline-output to
> check that, but I think what I have right now in the test should be enough.)

Excellent; thanks!



Re: C PATCH for c/64637 (better location for -Wunused-value)

2015-12-16 Thread Marek Polacek
On Wed, Dec 16, 2015 at 10:04:05AM -0500, David Malcolm wrote:
> On Wed, 2015-12-16 at 15:58 +0100, Marek Polacek wrote:
> > The following improves the location for "statement with no effect" warning 
> > by
> > using the location of the expression if available.  Can't use EXPR_LOCATION 
> > as
> > *_DECLs still don't carry a location.
> 
> Out of interest, does it emit sane underlined ranges for these cases,
> with the patch?

Yes, it emits what I'd expect, e.g.:

pr64637.c:10:28: warning: statement with no effect [-Wunused-value]
   for (int i = 0; i < b; i + b)
  ~~^~~
Similarly for the rest.

(Yes, I could've used dg-begin-multiline-output + dg-end-multiline-output to
check that, but I think what I have right now in the test should be enough.)

Marek


[Ping]Re: [AArch64] Simplify TLS pattern by hardcoding relocation modifiers into pattern

2015-12-16 Thread Jiong Wang



On 10/09/15 12:28, Jiong Wang wrote:

TLS instruction sequences are always with fixed format, there is no need
to use operand modifier, we can hardcode the relocation modifiers into
instruction pattern, all those redundant checks in aarch64_print_operand
can be removed.

OK for trunk?

2015-09-10  Jiong Wang  

gcc/
   * config/aarch64/aarch64.md (ldr_got_tiny): Hardcode relocation
   modifers.
   (tlsgd_small): Likewise.
   (tlsgd_tiny): Likewise.
   (tlsie_small_): Likewise.
   (tlsie_small_sidi): Likewise.
   (tlsie_tiny_): Likewise.
   (tlsie_tiny_sidi): Likewise.
   (tlsle12_): Likewise.
   (tlsle24_): Likewise.
   (tlsdesc_small_): Likewise.
   (tlsdesc_small_pseudo_): Likewise.
   (tlsdesc_tiny_): Likewise.
   (tlsdesc_tiny_pseudo_): Likewise.
   * config/aarch64/aarch64.c (aarch64_print_operand): Delete useless
   check on 'A', 'L', 'G'.
   


Ping ~

There is no functional change by this patch, but just cleanup of those
unnecessary use of output modifiers.

All these instruction sequences are always with fixed format, we can just
hardcode the relocation modifiers into instruction patterns, then all those
redundant checks in aarch64_print_operand can be removed.




Re: [PATCH, IA64] Fix building a bare-metal ia64 compiler

2015-12-16 Thread Bernd Edlinger


On 16.12.2015 05:59, Bernd Edlinger wrote:
> Hi,
>
> On 16.12.2015 00:55 Bernd Schmidt wrote:
>> On 12/15/2015 10:13 PM, Bernd Edlinger wrote:
>>> due to recent discussion on the basic asm, and the special handling
>>> of ASM_INPUT in ia64, I tried to build a bare-metal cross-compiler
>>> for ia64, but that did not work, because it seems to be impossible to
>>> build it without having a stdlib.h.
>>
>> Actually David Howells has complained to me about this as well, it 
>> seems to be a problem when building a toolchain for kernel compilation.
>
> yes.  I am not sure, if this is also problematic, when building a 
> cross-glibc,
> then I need also a cross C compiler first, build glibc, and install .h 
> and objects
> to the sysroot, then I build gcc again, with all languages.
>
>>
>>> With the attached patch, I was finally able to build the cross
>>> compiler, by declaring abort in the way as it is done already in many
>>> other places at libgcc.
>>
>> Can you just use __builtin_abort ()? Ok with that change.
>>
>
> I will try, but ./config/ia64/unwind-ia64.c includes this header,
> and it also uses abort () at many places.  So I'd expect warnings
> there.
>
>

OK, no new warnings, because tsystem.h defines a prototype of abort.

=> checked in as r231697.

But I see that unwind-ia64.c does not only use abort, but also malloc and
free too.

If I see that right, the stack walk will crash if malloc fails. What happens
on this target, if new throws out_of_memory?

I think this function should be rewritten to use alloca instead.
(I have an idea how this could be done, but no way to test a patch :(


Bernd.


Re: [C++ Patch] Use default arguments in one more place

2015-12-16 Thread Jason Merrill

OK.

Jason


Re: Ping [PATCH] c++/42121 - diagnose invalid flexible array members

2015-12-16 Thread Martin Sebor

I think this caused PR68932 - FAIL:
obj-c++.dg/property/at-property-23.mm -fgnu-runtime (internal compiler
error)


Sorry about that. I'll look into it today.

Martin


Re: Ping [PATCH] c++/42121 - diagnose invalid flexible array members

2015-12-16 Thread Martin Sebor

I think this caused PR68932 - FAIL:
obj-c++.dg/property/at-property-23.mm -fgnu-runtime (internal compiler
error)


Sorry about that. I'll look into it today.

Martin


Re: [PATCH] Handle BUILT_IN_GOACC_PARALLEL in ipa-pta

2015-12-16 Thread Tom de Vries

On 10/12/15 14:14, Tom de Vries wrote:

[ copy-pasting-with-quote from
https://gcc.gnu.org/ml/gcc-patches/2015-12/msg00420.html , for some
reason I didn't get this email ]


On Thu, 3 Dec 2015, Tom de Vries wrote:

The flag is set here in expand_omp_target:
...
12682 /* Prevent IPA from removing child_fn as unreachable,
 since there are no
12683refs from the parent function to child_fn in offload
 LTO mode.  */
12684 if (ENABLE_OFFLOADING)
12685   cgraph_node::get (child_fn)->mark_force_output ();
...



How are there no refs from the "parent"?  Are there not refs from
some kind of descriptor that maps fallback CPU and offloaded variants?


That descriptor is the offload table, which is emitted in
omp_finish_file. The function iterates over vectors offload_vars and
offload_funcs.

[ I would guess there's a one-on-one correspondance between
symtab_node::offloadable and membership of either offload_vars or
offload_funcs. ]


I think the above needs sorting out in somw way, making the refs
explicit rather than implicit via force_output.


I've tried an approach where I add a test for node->offloadable next to
each test for node->force_output, except for the test in the nonlocal_p
def in ipa_pta_execute. But I didn't (yet) manage to make that work.


I guess setting forced_by_abi instead would also mean child_fn is not
removed
as unreachable, while still allowing optimizations:
...
  /* Like FORCE_OUTPUT, but in the case it is ABI requiring the symbol
 to be exported.  Unlike FORCE_OUTPUT this flag gets cleared to
 symbols promoted to static and it does not inhibit
 optimization.  */
  unsigned forced_by_abi : 1;
...

But I suspect that other optimizations (than ipa-pta) might break
things.


How so?


Probably it's more accurate to say that I do not understand the
difference very well between force_output and force_by_abi, and what is
the class of optimizations enabled by using forced_by_abi instead of
force_output.'


Essentially we have two situations:
- in the host compiler, there is no need for the forced_output flag,
  and it inhibits optimization
- in the accelerator compiler, it (or some equivalent) is needed


Actually, things are slightly more complicated, I realize now. There's
also the distinction between:
- symbols declared as offloadable in the source code, and
- symbols create by the compiler and marked offloadable


I wonder if setting the force_output flag only when streaming the
bytecode for
offloading would work. That way, it wouldn't be set in the host
compiler,
while being set in the accelerator compiler.


Yeah, that was my original thinking btw.


FTR, I've tried that approach, as attached. It fixed the
goacc/kernels-alias-ipa-pta*.c failures. And I ran target-libgomp (also
using an accelerator configuration) without any regressions.


How about this patch?

We remove the setting of force_output when:
- encountering offloadable symbols in the frontend, or
- creating offloadable symbols in expand-omp.

Instead, we set force_output in input_offload_tables.

This is an improvement because:
- it moves the force_output setting to a single location
- it does the force_output setting ALAP

Thanks,
- Tom
Mark symbols in offload tables with force_output in read_offload_tables

2015-12-15  Tom de Vries  

	* c-parser.c (c_parser_oacc_declare, c_parser_omp_declare_target): Don't
	set force_output.

	* parser.c (cp_parser_oacc_declare, cp_parser_omp_declare_target): Don't
	set force_output.

	* omp-low.c (expand_omp_target): Don't set force_output.
	* varpool.c (varpool_node::get_create): Same.
	* lto-cgraph.c (input_offload_tables): Mark entries in offload_vars and
	offload_funcs with force_output.

---
 gcc/c/c-parser.c | 10 ++
 gcc/cp/parser.c  | 10 ++
 gcc/lto-cgraph.c |  9 +
 gcc/omp-low.c|  5 -
 gcc/varpool.c|  1 -
 5 files changed, 13 insertions(+), 22 deletions(-)

diff --git a/gcc/c/c-parser.c b/gcc/c/c-parser.c
index 124c30b..6e6f4b8 100644
--- a/gcc/c/c-parser.c
+++ b/gcc/c/c-parser.c
@@ -13527,10 +13527,7 @@ c_parser_oacc_declare (c_parser *parser)
 		{
 		  g->have_offload = true;
 		  if (is_a  (node))
-			{
-			  vec_safe_push (offload_vars, decl);
-			  node->force_output = 1;
-			}
+			vec_safe_push (offload_vars, decl);
 		}
 		}
 	}
@@ -16412,10 +16409,7 @@ c_parser_omp_declare_target (c_parser *parser)
 		{
 		  g->have_offload = true;
 		  if (is_a  (node))
-		{
-		  vec_safe_push (offload_vars, t);
-		  node->force_output = 1;
-		}
+		vec_safe_push (offload_vars, t);
 		}
 	}
 	}
diff --git a/gcc/cp/parser.c b/gcc/cp/parser.c
index a420cf1..340cc4a 100644
--- a/gcc/cp/parser.c
+++ b/gcc/cp/parser.c
@@ -35091,10 +35091,7 @@ cp_parser_oacc_declare (cp_parser *parser, cp_token *pragma_tok)
 		{
 		  g->have_offload = true;
 		  if (is_a  (node))
-			{
-			  vec_safe_push (offload_vars, decl);
-			  node->force_

Re: [gomp4.5] Handle #pragma omp declare target link

2015-12-16 Thread Thomas Schwinge
Hi!

On Mon, 14 Dec 2015 20:17:33 +0300, Ilya Verbin  wrote:
> [updated patch]

This regresses libgomp.oacc-c-c++-common/declare-4.c compilation for
nvptx offloading:

spawn [...]/build-gcc/gcc/xgcc -B[...]/build-gcc/gcc/ 
[...]/source-gcc/libgomp/testsuite/libgomp.oacc-c/../libgomp.oacc-c-c++-common/declare-4.c
 -B[...]/build-gcc/x86_64-pc-linux-gnu/./libgomp/ 
-B[...]/build-gcc/x86_64-pc-linux-gnu/./libgomp/.libs 
-I[...]/build-gcc/x86_64-pc-linux-gnu/./libgomp 
-I[...]/source-gcc/libgomp/testsuite/../../include 
-I[...]/source-gcc/libgomp/testsuite/.. -fmessage-length=0 
-fno-diagnostics-show-caret -fdiagnostics-color=never 
-B/libexec/gcc/x86_64-pc-linux-gnu/6.0.0 -B/bin 
-B[...]/build-gcc/gcc/accel/x86_64-intelmicemul-linux-gnu/fake_install/libexec/gcc/x86_64-pc-linux-gnu/6.0.0
 -B[...]/build-gcc/gcc/accel/x86_64-intelmicemul-linux-gnu/fake_install/bin 
-fopenacc -I[...]/source-gcc/libgomp/testsuite/libgomp.oacc-c-c++-common 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none -O2 
-L[...]/build-gcc/x86_64-pc-linux-gnu/./libgomp/.libs -lm -o ./declare-4.exe
ptxas /tmp/ccLXqNjE.o, line 50; error   : State space mismatch between 
instruction and address in instruction 'ld'
ptxas /tmp/ccLXqNjE.o, line 50; error   : Unknown symbol 'b_linkptr'
ptxas /tmp/ccLXqNjE.o, line 50; error   : Label expected for forward 
reference of 'b_linkptr'
ptxas fatal   : Ptx assembly aborted due to errors
nvptx-as: ptxas returned 255 exit status
mkoffload: fatal error: 
[...]/build-gcc/gcc/x86_64-pc-linux-gnu-accel-nvptx-none-gcc returned 1 exit 
status
compilation terminated.

That "b_linkptr" symbol is not declared/referenced in the test case
itself, libgomp/testsuite/libgomp.oacc-c-c++-common/declare-4.c:

/* { dg-do run  { target openacc_nvidia_accel_selected } } */

#include 
#include 

float b;
#pragma acc declare link (b)

#pragma acc routine
int
func (int a)
{
  b = a + 1;

  return b;
}

int
main (int argc, char **argv)
{
  float a;

  a = 2.0;

#pragma acc parallel copy (a)
  {
b = a;
a = 1.0;
a = a + b;
  }

  if (a != 3.0)
abort ();

  a = func (a);

  if (a != 4.0)
abort ();

  return 0;
}

..., but I see that the "b_linkptr" identifier is generated for "b" in
the new gcc/lto/lto.c:offload_handle_link_vars based on whether attribute
"omp declare target link" is set, so maybe we fail to set that one as
appropriate?  Jim, as the main author of the OpenACC declare
implementation, would you please have a look?  I have not yet studied in
detail the thread, starting at
,
that resulted in the trunk r231655 commit:

> gcc/c-family/
>   * c-common.c (c_common_attribute_table): Handle "omp declare target
>   link" attribute.
> gcc/
>   * cgraphunit.c (output_in_order): Do not assemble "omp declare target
>   link" variables in ACCEL_COMPILER.
>   * gimplify.c (gimplify_adjust_omp_clauses): Do not remove mapping of
>   "omp declare target link" variables.
>   * lto/lto.c: Include stringpool.h and fold-const.h.
>   (offload_handle_link_vars): New static function.
>   (lto_main): Call offload_handle_link_vars.
>   * omp-low.c (scan_sharing_clauses): Do not remove mapping of "omp
>   declare target link" variables.
>   (add_decls_addresses_to_decl_constructor): For "omp declare target link"
>   variables output address of the artificial pointer instead of address of
>   the variable.  Set most significant bit of the size to mark them.
>   (pass_data_omp_target_link): New pass_data.
>   (pass_omp_target_link): New class.
>   (find_link_var_op): New static function.
>   (make_pass_omp_target_link): New function.
>   * passes.def: Add pass_omp_target_link.
>   * tree-pass.h (make_pass_omp_target_link): Declare.
>   * varpool.c (symbol_table::output_variables): Do not assemble "omp
>   declare target link" variables in ACCEL_COMPILER.
> libgomp/
>   * libgomp.h (REFCOUNT_LINK): Define.
>   (struct splay_tree_key_s): Add link_key.
>   * target.c (gomp_map_vars): Treat REFCOUNT_LINK objects as not mapped.
>   Replace target address of the pointer with target address of newly
>   mapped object in the splay tree.  Set link pointer on target to the
>   device address of the mapped object.
>   (gomp_unmap_vars): Restore target address of the pointer in the splay
>   tree for REFCOUNT_LINK objects after unmapping.
>   (gomp_load_image_to_device): Set refcount to REFCOUNT_LINK for "omp
>   declare target link" objects.
>   (gomp_unload_image_from_device): Replace j with i.  Force unmap of all
>   "omp declare target link" objects, which were mapped for the image.
>   (gomp_exit_d

Re: ipa-cp heuristics fixes

2015-12-16 Thread Jan Hubicka
> On Thu, Dec 10, 2015 at 08:30:37AM +0100, Jan Hubicka wrote:
> > * ipa-cp.c (ipcp_cloning_candidate_p): Use node->optimize_for_size_p.
> > (good_cloning_opportunity_p): Likewise.
> > (gather_context_independent_values): Do not return true when
> > polymorphic call context is known or when we have known aggregate
> > value of unused parameter.
> > (estimate_local_effects): Try to create clone for all context
> > when either some params are substituted or devirtualization is possible
> > or some params can be removed; use local flag instead of
> > node->will_be_removed_from_program_if_no_direct_calls_p.
> > (identify_dead_nodes): Likewise.
> 
> This commit breaks several guality tests on S/390x:

The patch changes ipa-cp heuristics in a way that it will clone in order to 
remove
unused parameter.  Previously it did know how to remove unused parameter but 
would
do so only if some of function's parameter (perhaps the unused one) was also 
constant.
This is inconsistent.  As a result we do more ipa-cp cloning and I suppose we 
are
more likely to hit them on guality for simplified testcases that often contain 
unused
code:
int __attribute__((noinline))
foo (int arg1, int arg2, int arg3, int arg4, int arg5, int arg6, int arg7)
{
  char *x = __builtin_alloca (arg7);
  int __attribute__ ((aligned(32))) y;

  y = 2;
  asm (NOP : "=m" (y), "=m" (b) : "m" (y));
  x[0] = 25;
  asm (NOP : "=m" (x[0]), "=m" (a) : "m" (x[0]), "m" (b));
  return y;
}

int
main ()
{
  int l = 0;
  asm ("" : "=r" (l) : "0" (l));
  a = foo (l + 1, l + 2, l + 3, l + 4, l + 5, l + 6, l + 30);
  asm volatile ("" :: "r" (l));
  return 0;
}

I am trying to understand Jakub's debug code and perhaps it can be improved. 
But in
the case of optimized out unused parameters I think it is perfectly resonable to
say that the variable was optimized out.

As you can see, the testcase explicitely prevents ipa-cp constant propagation 
by the
asm statement.  We can just update the testcases to use the parameters and test
the original issue again.
> 
> +FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 16 arg3 == 3
> +FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 16 arg4 == 4
> +FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 16 arg5 == 5
> +FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 16 arg6 == 6
> +FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 16 y == 2
> +FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 18 arg3 == 3
> +FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 18 arg4 == 4
> +FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 18 arg5 == 5
> +FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 18 arg6 == 6
> +FAIL: gcc.dg/guality/pr36728-1.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 18 y == 2
> +FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 16 arg3 == 3
> +FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 16 arg4 == 4
> +FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 16 arg5 == 5
> +FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 16 arg6 == 6
> +FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 16 y == 2
> +FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 18 arg3 == 3
> +FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 18 arg4 == 4
> +FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 18 arg5 == 5
> +FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 18 arg6 == 6
> +FAIL: gcc.dg/guality/pr36728-1.c   -O3 -g  line 18 y == 2
>  FAIL: gcc.dg/guality/pr36728-2.c   -O2  line 18 *x == (char) 25
>  FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fno-use-linker-plugin 
> -flto-partition=none  line 18 *x == (char) 25
> +FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 16 arg3 == 3
> +FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 16 arg4 == 4
> +FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 16 arg5 == 5
> +FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 16 arg6 == 6
> +FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 16 arg7 == 30
> +FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 16 y == 2
>  FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 18 *x == (char) 25
> +FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 18 arg3 == 3
> +FAIL: gcc.dg/guality/pr36728-2.c   -O2 -flto -fuse-linker-plugin 
> -fno-fat-lto-objects  line 18 arg4 ==

Re: ipa-cp heuristics fixes

2015-12-16 Thread Jakub Jelinek
On Wed, Dec 16, 2015 at 05:24:25PM +0100, Jan Hubicka wrote:
> I am trying to understand Jakub's debug code and perhaps it can be improved. 
> But in
> the case of optimized out unused parameters I think it is perfectly resonable 
> to
> say that the variable was optimized out.

As long as the values that would be passed to the unused parameters are
constant or live in memory or registers that isn't clobbered by the call in
the caller, we have the ability to express that in the debug info now, and
we should.

> As you can see, the testcase explicitely prevents ipa-cp constant propagation 
> by the
> asm statement.  We can just update the testcases to use the parameters and 
> test
> the original issue again.

No, the testcase intentionally wants to test unused arguments, they happen
in quite a lot of code and "optimized out" is not really desirable answer if
we can provide the values some other way.

Jakub


Re: C PATCH for c/64637 (better location for -Wunused-value)

2015-12-16 Thread Jeff Law

On 12/16/2015 07:58 AM, Marek Polacek wrote:

The following improves the location for "statement with no effect" warning by
using the location of the expression if available.  Can't use EXPR_LOCATION as
*_DECLs still don't carry a location.

Bootstrapped/regtested on x86_64-linux, ok for trunk?

2015-12-16  Marek Polacek  

PR c/64637
* c-typeck.c (c_process_expr_stmt): Use location of the expression if
available.

* gcc.dg/pr64637.c: New test.

OK.
jeff



RE: [Patch] Fix for MIPS PR target/65604

2015-12-16 Thread Moore, Catherine


> -Original Message-
> From: Steve Ellcey [mailto:sell...@imgtec.com]
> Sent: Tuesday, December 15, 2015 4:09 PM
> To: Moore, Catherine
> Cc: gcc-patches@gcc.gnu.org; matthew.fort...@imgtec.com
> Subject: RE: [Patch] Fix for MIPS PR target/65604
> 
> On Tue, 2015-12-15 at 15:13 +, Moore, Catherine wrote:
> 
> >
> > HI Steve, The patch is OK.  Will you please add a test case and repost?
> > Thanks,
> > Catherine
> 
> Here is the patch with a test case.
> 

Looks good, thanks.


[PTX] xfail sibcall test

2015-12-16 Thread Nathan Sidwell

PTX doesn't support sibcalls, so this test is doomed to fail.
2015-12-16  Nathan Sidwell  

	* gcc.dg/sibcall-9.c: Xfail for nvptx.

Index: gcc.dg/sibcall-9.c
===
--- gcc.dg/sibcall-9.c	(revision 231689)
+++ gcc.dg/sibcall-9.c	(working copy)
@@ -5,7 +5,7 @@
Copyright (C) 2002 Free Software Foundation Inc.
Contributed by Hans-Peter Nilsson*/
 
-/* { dg-do run { xfail { { cris-*-* crisv32-*-* h8300-*-* hppa*64*-*-* m32r-*-* mcore-*-* mn10300-*-* msp430*-*-* nds32*-*-* xstormy16-*-* v850*-*-* vax-*-* xtensa*-*-* } || { arm*-*-* && { ! arm32 } } } } } */
+/* { dg-do run { xfail { { cris-*-* crisv32-*-* h8300-*-* hppa*64*-*-* m32r-*-* mcore-*-* mn10300-*-* msp430*-*-* nds32*-*-* nvptx-*-* xstormy16-*-* v850*-*-* vax-*-* xtensa*-*-* } || { arm*-*-* && { ! arm32 } } } } } */
 /* -mlongcall disables sibcall patterns.  */
 /* { dg-skip-if "" { powerpc*-*-* } { "-mlongcall" } { "" } } */
 /* { dg-options "-O2 -foptimize-sibling-calls" } */


[PATCH] [graphite] update required isl version

2015-12-16 Thread Sebastian Pop
we check for a the isl compute timeout function added in isl 0.13.
That means GCC could still be configured with isl 0.13, 0.14, and 0.15.

* config/isl.m4 (ISL_CHECK_VERSION): Check for
isl_ctx_get_max_operations.
* configure: Regenerate.

gcc/
* config.in: Regenerate.
* configure: Regenerate.
* configure.ac: Remove checks for functions that exist in isl 0.13 or
later.
* graphite-isl-ast-to-gimple.c: Remove #ifdefs and code for isl 0.12.
* graphite-optimize-isl.c: Same.
* graphite-poly.c: Same.
* graphite-sese-to-poly.c: Same.
* graphite.h: Add comment for isl 0.14.
* toplev.c (print_version): Print isl version.
---
 config/isl.m4| 29 +++
 configure| 23 +--
 gcc/config.in| 12 
 gcc/configure| 61 ++--
 gcc/configure.ac | 23 ---
 gcc/graphite-isl-ast-to-gimple.c | 10 ++-
 gcc/graphite-optimize-isl.c  | 34 --
 gcc/graphite-poly.c  |  8 --
 gcc/graphite-sese-to-poly.c  |  8 --
 gcc/graphite.h   |  1 +
 gcc/toplev.c | 10 +--
 11 files changed, 49 insertions(+), 170 deletions(-)

diff --git a/config/isl.m4 b/config/isl.m4
index 459fac1..7387ff2 100644
--- a/config/isl.m4
+++ b/config/isl.m4
@@ -19,23 +19,23 @@
 
 # ISL_INIT_FLAGS ()
 # -
-# Provide configure switches for ISL support.
+# Provide configure switches for isl support.
 # Initialize isllibs/islinc according to the user input.
 AC_DEFUN([ISL_INIT_FLAGS],
 [
   AC_ARG_WITH([isl-include],
 [AS_HELP_STRING(
   [--with-isl-include=PATH],
-  [Specify directory for installed ISL include files])])
+  [Specify directory for installed isl include files])])
   AC_ARG_WITH([isl-lib],
 [AS_HELP_STRING(
   [--with-isl-lib=PATH],
-  [Specify the directory for the installed ISL library])])
+  [Specify the directory for the installed isl library])])
 
   AC_ARG_ENABLE(isl-version-check,
 [AS_HELP_STRING(
   [--disable-isl-version-check],
-  [disable check for ISL version])],
+  [disable check for isl version])],
 ENABLE_ISL_CHECK=$enableval,
 ENABLE_ISL_CHECK=yes)
   
@@ -58,15 +58,15 @@ AC_DEFUN([ISL_INIT_FLAGS],
   if test "x${with_isl_lib}" != x; then
 isllibs="-L$with_isl_lib"
   fi
-  dnl If no --with-isl flag was specified and there is in-tree ISL
+  dnl If no --with-isl flag was specified and there is in-tree isl
   dnl source, set up flags to use that and skip any version tests
-  dnl as we cannot run them before building ISL.
+  dnl as we cannot run them before building isl.
   if test "x${islinc}" = x && test "x${isllibs}" = x \
  && test -d ${srcdir}/isl; then
 isllibs='-L$$r/$(HOST_SUBDIR)/isl/'"$lt_cv_objdir"' '
 islinc='-I$$r/$(HOST_SUBDIR)/isl/include -I$$s/isl/include'
 ENABLE_ISL_CHECK=no
-AC_MSG_WARN([using in-tree ISL, disabling version check])
+AC_MSG_WARN([using in-tree isl, disabling version check])
   fi
 
   isllibs="${isllibs} -lisl"
@@ -75,7 +75,7 @@ AC_DEFUN([ISL_INIT_FLAGS],
 
 # ISL_REQUESTED (ACTION-IF-REQUESTED, ACTION-IF-NOT)
 # 
-# Provide actions for failed ISL detection.
+# Provide actions for failed isl detection.
 AC_DEFUN([ISL_REQUESTED],
 [
   AC_REQUIRE([ISL_INIT_FLAGS])
@@ -106,12 +106,17 @@ AC_DEFUN([ISL_CHECK_VERSION],
 LDFLAGS="${_isl_saved_LDFLAGS} ${isllibs}"
 LIBS="${_isl_saved_LIBS} -lisl"
 
-AC_MSG_CHECKING([for compatible ISL])
-AC_LINK_IFELSE([AC_LANG_PROGRAM([[#include ]], [[;]])],
-   [gcc_cv_isl=yes],
-   [gcc_cv_isl=no])
+AC_MSG_CHECKING([for isl 0.15 (or deprecated 0.14)])
+AC_TRY_LINK([#include ],
+[isl_ctx_get_max_operations (isl_ctx_alloc ());],
+[gcc_cv_isl=yes],
+[gcc_cv_isl=no])
 AC_MSG_RESULT([$gcc_cv_isl])
 
+if test "${gcc_cv_isl}" = no ; then
+  AC_MSG_RESULT([recommended isl version is 0.15, minimum required isl 
version 0.14 is deprecated])
+fi
+
 CFLAGS=$_isl_saved_CFLAGS
 LDFLAGS=$_isl_saved_LDFLAGS
 LIBS=$_isl_saved_LIBS
diff --git a/configure b/configure
index 090615f..a6495c4 100755
--- a/configure
+++ b/configure
@@ -1492,7 +1492,7 @@ Optional Features:
   build static libjava [default=no]
   --enable-bootstrap  enable bootstrapping [yes if native build]
   --disable-isl-version-check
-  disable check for ISL version
+  disable check for isl version
   --enable-ltoenable link time optimization support
   --enable-linker-plugin-configure-flags=FLAGS
   additional flags for configuring linker plugins
@@ -1553,8 +1553,8 @@ Optional Packages:
  

Re: [PATCH][combine] PR rtl-optimization/68651 Try changing rtx from (r + r) to (r << 1) to aid recognition

2015-12-16 Thread Kyrill Tkachov


On 16/12/15 12:18, Bernd Schmidt wrote:

On 12/15/2015 05:21 PM, Kyrill Tkachov wrote:

Then for the shift pattern in the MD file we'd have to dynamically
select the scheduling type depending on whether or not the shift
amount is 1 and the costs line up?


Yes. This isn't unusual, take a look at i386.md where you have a lot of 
switches on attr type to decide which string to print.



I'm just worried that if we take this idea to its logical conclusion, we have 
to add a new canonicalisation rule:
"all (plus x x) expressions shall be expressed as (ashift x 1)".
Such a rule seems too specific to me and all targets would have to special-case 
it in their MD patterns and costs
if they ever wanted to treat an add and a shift differently.
In this particular case we'd have
to conditionalise the scheduling string selection on a particular CPU tuning 
and the shift amount, which will make
the pattern much harder to read.
To implement this properly we'd also have to


The price we pay when trying these substitutions is an iteration over
 the rtx with FOR_EACH_SUBRTX_PTR. recog gets called only if that
iteration actually performed a substitution of x + x into x << 1. Is
that too high a price to pay? (I'm not familiar with the performance
 characteristics of the FOR_EACH_SUBRTX machinery)


It depends on how many of these transforms we are going to try; it also feels 
very hackish, trying to work around the core design of the combiner. IMO it 
would be better for machine descriptions to work with the pass rather than 
against it.



Perhaps I'm lacking the historical context, but what is the core design of the 
combiner?
Why should the backend have to jump through these hoops if it already 
communicates to the midend
(through correct rtx costs) that a shift is more expensive than a plus?
I'd be more inclined to agree that this is perhaps a limitation in recog rather 
than combine,
but still not a backend problem.


If you can somehow arrange for the (plus x x) to be turned into a shift while 
substituting that might be yet another approach to try.



I did investigate where else we could make this transformation.
For the zero_extend+shift case (the ubfiz instruction from the testcase in my 
original submission)
we could fix this by modifying make_extraction to convert its argument to a 
shift from (plus x x)
as, in that context, shifts are undoubtedly more likely to simplify with the 
various extraction
operations that it's trying to perform.

That leaves the other case (orr + shift), where converting to a shift isn't a
simplification in any way, but the backend happens to have an instruction that 
matches the
combined orr+shift form. There we want to perform the transformation purely to 
aid
recognition, not out of any simplification considerations. That's what I'm 
trying to figure out
how to do now.

However, we want to avoid doing it unconditionally because if we have just a 
simple set
of y = x + x we want to leave it as a plus rather than a shift because it's 
cheaper
on that target.

Thanks,
Kyrill


Bernd




Re: [PATCH, PR67627][RFC] broken libatomic multilib parallel build

2015-12-16 Thread Jeff Law

On 12/04/2015 05:39 AM, Szabolcs Nagy wrote:

As described in pr other/67627, the all-multi target can be
built in parallel with the %_.lo targets which generate make
dependencies that are parsed during the build of all-multi.

gcc -MD does not generate the makefile dependencies in an
atomic way so make can fail if it concurrently parses those
half-written files.
(not observed on x86, but happens on arm native builds.)

this workaround forces all-multi to only run after the *_.lo
targets are done, but there might be a better solution using
automake properly. (automake should know about the generated
make dependency files that are included into the makefile so
no manual tinkering is needed to get the right build order,
but i don't know how to do that.)

2015-12-04  Szabolcs Nagy  

 PR other/67627
 * Makefile.am (all-multi): Add dependency.
 * Makefile.in: Regenerate.

ISTM that many of the libraries have this problem.


[law@localhost gcc]$ grep all-am: */Makefile.in | grep -v install
boehm-gc/Makefile.in:all-am: Makefile $(LTLIBRARIES) all-multi
gotools/Makefile.in:all-am: Makefile $(PROGRAMS) $(MANS)
libatomic/Makefile.in:all-am: Makefile $(LTLIBRARIES) all-multi 
auto-config.h

libbacktrace/Makefile.in:all-am: Makefile $(LTLIBRARIES) all-multi config.h
libcc1/Makefile.in:all-am: Makefile $(LTLIBRARIES) cc1plugin-config.h
libcilkrts/Makefile.in:all-am: Makefile $(LTLIBRARIES) all-multi $(HEADERS)
libffi/Makefile.in:all-am: Makefile $(INFO_DEPS) $(LTLIBRARIES) 
all-multi $(DATA) \
libgfortran/Makefile.in:all-am: Makefile $(LTLIBRARIES) all-multi 
$(DATA) $(HEADERS) config.h
libgo/Makefile.in:all-am: Makefile $(LIBRARIES) $(LTLIBRARIES) all-multi 
$(DATA) \
libgomp/Makefile.in:all-am: Makefile $(INFO_DEPS) $(LTLIBRARIES) 
all-multi $(HEADERS) \
libitm/Makefile.in:all-am: Makefile $(INFO_DEPS) $(LTLIBRARIES) 
all-multi $(HEADERS) \
libjava/Makefile.in:all-am: Makefile $(LTLIBRARIES) $(PROGRAMS) 
$(SCRIPTS) all-multi \

libmpx/Makefile.in:all-am: Makefile all-multi $(HEADERS) config.h
liboffloadmic/Makefile.in:all-am: Makefile $(LTLIBRARIES) all-multi 
$(HEADERS) all-local
libquadmath/Makefile.in:all-am: Makefile $(INFO_DEPS) $(LTLIBRARIES) 
all-multi $(HEADERS) \

libsanitizer/Makefile.in:all-am: Makefile all-multi $(HEADERS) config.h
libssp/Makefile.in:all-am: Makefile $(LTLIBRARIES) all-multi $(HEADERS) 
config.h

libstdc++-v3/Makefile.in:all-am: Makefile all-multi config.h
libvtv/Makefile.in:all-am: Makefile $(LTLIBRARIES) all-multi $(HEADERS)
lto-plugin/Makefile.in:all-am: Makefile $(LTLIBRARIES) config.h all-local
zlib/Makefile.in:all-am: Makefile $(LIBRARIES) $(LTLIBRARIES) all-multi


And if you look at the all-multi targets in each of those, all-multi has 
no dependencies.  So any of them which use auto-dependency generation 
are potentially going to bump into this problem.


We can't fix it by twiddling Makefile.in as that's a generated file.  I 
think it has to happen at the automake level.


jeff


Re: [PATCH, PR67627][RFC] broken libatomic multilib parallel build

2015-12-16 Thread Jeff Law

On 12/04/2015 05:39 AM, Szabolcs Nagy wrote:

As described in pr other/67627, the all-multi target can be
built in parallel with the %_.lo targets which generate make
dependencies that are parsed during the build of all-multi.

gcc -MD does not generate the makefile dependencies in an
atomic way so make can fail if it concurrently parses those
half-written files.
(not observed on x86, but happens on arm native builds.)

this workaround forces all-multi to only run after the *_.lo
targets are done, but there might be a better solution using
automake properly. (automake should know about the generated
make dependency files that are included into the makefile so
no manual tinkering is needed to get the right build order,
but i don't know how to do that.)

2015-12-04  Szabolcs Nagy  

 PR other/67627
 * Makefile.am (all-multi): Add dependency.
 * Makefile.in: Regenerate.
So looking at the patch, it looks like you're adding a dependency in 
Makefile.am to pass it through to Makefile.in, which is fine.


So I think you just need to replicate that fix across the other 
libraries which have this problem.


jeff


Fix PR66208

2015-12-16 Thread Bernd Schmidt
This is a relatively straightforward PR where we should mention a macro 
expansion in a warning message. The patch below implements the 
suggestion by Marek to pass a location down from 
build_function_call_vec. Ok if tests pass on x86_64-linux?


One question I have is about -Wformat, which is dealt with in the same 
area. We do get a mention of the macro expansion, but in a different style:


/* { dg-do compile } */
/* { dg-options "-Wformat" } */

int foox (const char *, ...) __attribute__ ((format (printf, 1, 2)));

#define foo(p, x) foox (p, x)
#define foo3(p, x, y) foox (p, x, y)

void
bar (unsigned int x)
{
  foo ("%x", "fish");
  foo3 ("%x", 3, "fish");
}

macroloc2.c: In function ‘bar’:
macroloc2.c:12:8: warning: format ‘%x’ expects argument of type 
‘unsigned int’, but argument 2 has type ‘char *’ [-Wformat=]

   foo ("%x", "fish");
^
macroloc2.c:6:25: note: in definition of macro ‘foo’
 #define foo(p, x) foox (p, x)
 ^
macroloc2.c:13:9: warning: too many arguments for format 
[-Wformat-extra-args]

   foo3 ("%x", 3, "fish");
 ^
macroloc2.c:7:29: note: in definition of macro ‘foo3’
 #define foo3(p, x, y) foox (p, x, y)
 ^

Is this what we're looking for in terms of output?


Bernd
	PR c/66208
	* c-common.c (check_function_nonnull): Remove unnecessary declaration.
	Add new arg loc and pass it down as context.
	(check_nonnull_arg): Don't mark ctx arg as unused. Use it as a pointer
	to the location to use for the warning.
	(check_function_arguments): New arg loc.  All callers changed.  Pass
	it to check_function_nonnull.
	* c-common.h (check_function_arguments): Adjust declaration.

testsuite/
	PR c/66208
	* c-c++-common/pr66208.c: New file.

diff --git a/gcc/c-family/c-common.c b/gcc/c-family/c-common.c
index 369574f..6074d14 100644
--- a/gcc/c-family/c-common.c
+++ b/gcc/c-family/c-common.c
@@ -395,7 +395,6 @@ static tree handle_bnd_variable_size_attribute (tree *, tree, tree, int, bool *)
 static tree handle_bnd_legacy (tree *, tree, tree, int, bool *);
 static tree handle_bnd_instrument (tree *, tree, tree, int, bool *);
 
-static void check_function_nonnull (tree, int, tree *);
 static void check_nonnull_arg (void *, tree, unsigned HOST_WIDE_INT);
 static bool nonnull_check_p (tree, unsigned HOST_WIDE_INT);
 static bool get_nonnull_operand (tree, unsigned HOST_WIDE_INT *);
@@ -9611,11 +9610,10 @@ handle_nonnull_attribute (tree *node, tree ARG_UNUSED (name),
 
 /* Check the argument list of a function call for null in argument slots
that are marked as requiring a non-null pointer argument.  The NARGS
-   arguments are passed in the array ARGARRAY.
-*/
+   arguments are passed in the array ARGARRAY.  */
 
 static void
-check_function_nonnull (tree attrs, int nargs, tree *argarray)
+check_function_nonnull (location_t loc, tree attrs, int nargs, tree *argarray)
 {
   tree a;
   int i;
@@ -9635,7 +9633,7 @@ check_function_nonnull (tree attrs, int nargs, tree *argarray)
 
   if (a != NULL_TREE)
 for (i = 0; i < nargs; i++)
-  check_function_arguments_recurse (check_nonnull_arg, NULL, argarray[i],
+  check_function_arguments_recurse (check_nonnull_arg, &loc, argarray[i],
 	i + 1);
   else
 {
@@ -9651,7 +9649,7 @@ check_function_nonnull (tree attrs, int nargs, tree *argarray)
 	}
 
 	  if (a != NULL_TREE)
-	check_function_arguments_recurse (check_nonnull_arg, NULL,
+	check_function_arguments_recurse (check_nonnull_arg, &loc,
 	  argarray[i], i + 1);
 	}
 }
@@ -9737,9 +9735,10 @@ nonnull_check_p (tree args, unsigned HOST_WIDE_INT param_num)
via check_function_arguments_recurse.  */
 
 static void
-check_nonnull_arg (void * ARG_UNUSED (ctx), tree param,
-		   unsigned HOST_WIDE_INT param_num)
+check_nonnull_arg (void *ctx, tree param, unsigned HOST_WIDE_INT param_num)
 {
+  location_t *ploc = (location_t *) ctx;
+
   /* Just skip checking the argument if it's not a pointer.  This can
  happen if the "nonnull" attribute was given without an operand
  list (which means to check every pointer argument).  */
@@ -9748,8 +9748,8 @@ check_nonnull_arg (void * ARG_UNUSED (ctx), tree param,
 return;
 
   if (integer_zerop (param))
-warning (OPT_Wnonnull, "null argument where non-null required "
-	 "(argument %lu)", (unsigned long) param_num);
+warning_at (*ploc, OPT_Wnonnull, "null argument where non-null required "
+		"(argument %lu)", (unsigned long) param_num);
 }
 
 /* Helper for nonnull attribute handling; fetch the operand number
@@ -10198,15 +10198,17 @@ handle_designated_init_attribute (tree *node, tree name, tree, int,
 
 
 /* Check for valid arguments being passed to a function with FNTYPE.
-   There are NARGS arguments in the array ARGARRAY.  */
+   There are NARGS arguments in the array ARGARRAY.  LOC should be used for
+   diagnostics.  */
 void
-check_function_arguments (const_tree fntype, int nargs, tree *argarray)
+check_function_arguments (location_t loc,

Re: ipa-cp heuristics fixes

2015-12-16 Thread Jan Hubicka
> On Wed, Dec 16, 2015 at 05:24:25PM +0100, Jan Hubicka wrote:
> > I am trying to understand Jakub's debug code and perhaps it can be 
> > improved. But in
> > the case of optimized out unused parameters I think it is perfectly 
> > resonable to
> > say that the variable was optimized out.
> 
> As long as the values that would be passed to the unused parameters are
> constant or live in memory or registers that isn't clobbered by the call in
> the caller, we have the ability to express that in the debug info now, and
> we should.
int
main ()
{
  int l = 0;
  asm ("" : "=r" (l) : "0" (l));
  a = foo (l + 1, l + 2, l + 3, l + 4, l + 5, l + 6, l + 30);
  asm volatile ("" :: "r" (l));
  return 0;
}

the unused parameters are not constant because of the asm block and we simply 
do not
compute them if cloning happens.  The following patch to the testcase
Index: testsuite/gcc.dg/guality/pr36728-1.c
===
--- testsuite/gcc.dg/guality/pr36728-1.c(revision 231022)
+++ testsuite/gcc.dg/guality/pr36728-1.c(working copy)
@@ -7,7 +7,7 @@
 int a, b;

 int __attribute__((noinline))
-foo (int arg1, int arg2, int arg3, int arg4, int arg5, int arg6, int arg7)
+foo (int arg1, int arg2, int arg3, int arg4, int arg5, int arg6, int arg7, int 
cst)
 { 
   char *x = __builtin_alloca (arg7);
   int __attribute__ ((aligned(32))) y;
@@ -48,7 +48,7 @@ main ()
 { 
   int l = 0;
   asm ("" : "=r" (l) : "0" (l));
-  a = foo (l + 1, l + 2, l + 3, l + 4, l + 5, l + 6, l + 30);
+  a = foo (l + 1, l + 2, l + 3, l + 4, l + 5, l + 6, l + 30, 7);
   asm volatile ("" :: "r" (l));
   return 0;
 }

makes it fail on GCC 5 tree, too. The extra unused constant argument makes GCC 5
to clone foo and  optimize out l, too.

Honza
> 
> > As you can see, the testcase explicitely prevents ipa-cp constant 
> > propagation by the
> > asm statement.  We can just update the testcases to use the parameters and 
> > test
> > the original issue again.
> 
> No, the testcase intentionally wants to test unused arguments, they happen
> in quite a lot of code and "optimized out" is not really desirable answer if
> we can provide the values some other way.
> 
>   Jakub


Re: [PATCH][combine] PR rtl-optimization/68651 Try changing rtx from (r + r) to (r << 1) to aid recognition

2015-12-16 Thread Jeff Law

On 12/16/2015 10:00 AM, Kyrill Tkachov wrote:


On 16/12/15 12:18, Bernd Schmidt wrote:

On 12/15/2015 05:21 PM, Kyrill Tkachov wrote:

Then for the shift pattern in the MD file we'd have to
dynamically select the scheduling type depending on whether or
not the shift amount is 1 and the costs line up?


Yes. This isn't unusual, take a look at i386.md where you have a
lot of switches on attr type to decide which string to print.



I'm just worried that if we take this idea to its logical conclusion,
we have to add a new canonicalisation rule: "all (plus x x)
expressions shall be expressed as (ashift x 1)". Such a rule seems
too specific to me and all targets would have to special-case it in
their MD patterns and costs if they ever wanted to treat an add and a
shift differently. In this particular case we'd have to
conditionalise the scheduling string selection on a particular CPU
tuning and the shift amount, which will make the pattern much harder
to read. To implement this properly we'd also have to
That's not terribly unusual.  And we've done those kind of 
canonicalization rules before -- most recently to deal with issues in 
combine we settled on canonicalization rules for ashift vs mult.  While 
there was fallout, it's manageable.






The price we pay when trying these substitutions is an iteration
over the rtx with FOR_EACH_SUBRTX_PTR. recog gets called only if
that iteration actually performed a substitution of x + x into x
<< 1. Is that too high a price to pay? (I'm not familiar with the
performance characteristics of the FOR_EACH_SUBRTX machinery)


It depends on how many of these transforms we are going to try; it
 also feels very hackish, trying to work around the core design of
the combiner. IMO it would be better for machine descriptions to
work with the pass rather than against it.



Perhaps I'm lacking the historical context, but what is the core
design of the combiner? Why should the backend have to jump through
these hoops if it already communicates to the midend (through correct
rtx costs) that a shift is more expensive than a plus? I'd be more
inclined to agree that this is perhaps a limitation in recog rather
than combine, but still not a backend problem.

The historical design of combine is pretty simple.

Use data dependence to substitute the definition of an operand in a use
of the operand. Essentially create bigger blobs of RTL. Canonicalize and
simplify that larger blob of RTL, then try to match it with a pattern in
the backend.

Note that costing didn't enter the picture. The assumption was that if
the combination succeeds, then it's profitable (fewer insns).  We 
haven't generally encouraged trying to match multiple forms of 
equivalent expressions, instead we declare a canonical form and make 
sure combine uses it.






If you can somehow arrange for the (plus x x) to be turned into a
shift while substituting that might be yet another approach to
try.



I did investigate where else we could make this transformation. For
the zero_extend+shift case (the ubfiz instruction from the testcase
in my original submission) we could fix this by modifying
make_extraction to convert its argument to a shift from (plus x x)
as, in that context, shifts are undoubtedly more likely to simplify
with the various extraction operations that it's trying to perform.
Note that canonicalizing (plus x x) to (ashift x 1) is consistent with 
the canonicalization we do for (mult x C) to (ashift x log2 (C)) where C 
is an exact power of two.


When we made that change consistently (there were cases where we instead 
preferred MULT in the past), we had to fix some backends, but the 
fallout wasn't terrible.


I would think from a representational standpoint canonicalizing (plus x 
x) to (ashift x 1) would be generally a good thing.



jeff


Re: ipa-cp heuristics fixes

2015-12-16 Thread Jakub Jelinek
On Wed, Dec 16, 2015 at 06:15:33PM +0100, Jan Hubicka wrote:
> > On Wed, Dec 16, 2015 at 05:24:25PM +0100, Jan Hubicka wrote:
> > > I am trying to understand Jakub's debug code and perhaps it can be 
> > > improved. But in
> > > the case of optimized out unused parameters I think it is perfectly 
> > > resonable to
> > > say that the variable was optimized out.
> > 
> > As long as the values that would be passed to the unused parameters are
> > constant or live in memory or registers that isn't clobbered by the call in
> > the caller, we have the ability to express that in the debug info now, and
> > we should.
> int
> main ()
> {
>   int l = 0;
>   asm ("" : "=r" (l) : "0" (l));
>   a = foo (l + 1, l + 2, l + 3, l + 4, l + 5, l + 6, l + 30);
>   asm volatile ("" :: "r" (l));
>   return 0;
> }
> 
> the unused parameters are not constant because of the asm block and we simply 
> do not
> compute them if cloning happens.  The following patch to the testcase

So, we had a debuginfo QoI bug before, but with the change we hit that now
way more often than in the past, which makes it a debuginfo quality regression.

But, we still do emit the right debuginfo stuff for
volatile int vv;

static __attribute__ ((noinline))
int f1 (int x, int y, int z)
{
  int a = x * 2;
  int b = y * 2;
  int c = z * 2;
  vv++;
  return x + z;
}

int
u1 (int x)
{
  return f1 (x, 2, 3);
}

and I really can't see what is the difference between that and say
the pr36728-1.c testcase or the one with your modification to also pass
in a constant.  This one also has one used argument, some unused ones, and
some where a constant is passed.  BTW, if I adjust your modified
pr36728-1.c so that cst is used somewhere, then that parameter is passed,
rather than optimized away as I'd expect (and as happens on the above
testcase).
So, clearly there are multiple ways to perform the same parameter
optimizations, one in isra (though I don't see anything to scalarize above),
another one in ipa-cp or where, and only some of them hit the debug info
creation for the optimized away arguments.
Thus the question is why we have multiple spots to do the same thing,
what is so different between the testcases, and what do we need to change
to emit the debug info we should, and what we should change to emit
the cst below if it is used, if we already clone it anyway.

> Index: testsuite/gcc.dg/guality/pr36728-1.c
> ===
> --- testsuite/gcc.dg/guality/pr36728-1.c(revision 231022)
> +++ testsuite/gcc.dg/guality/pr36728-1.c(working copy)
> @@ -7,7 +7,7 @@
>  int a, b;
> 
>  int __attribute__((noinline))
> -foo (int arg1, int arg2, int arg3, int arg4, int arg5, int arg6, int arg7)
> +foo (int arg1, int arg2, int arg3, int arg4, int arg5, int arg6, int arg7, 
> int cst)
>  { 
>char *x = __builtin_alloca (arg7);
>int __attribute__ ((aligned(32))) y;
> @@ -48,7 +48,7 @@ main ()
>  { 
>int l = 0;
>asm ("" : "=r" (l) : "0" (l));
> -  a = foo (l + 1, l + 2, l + 3, l + 4, l + 5, l + 6, l + 30);
> +  a = foo (l + 1, l + 2, l + 3, l + 4, l + 5, l + 6, l + 30, 7);
>asm volatile ("" :: "r" (l));
>return 0;
>  }
> 
> makes it fail on GCC 5 tree, too. The extra unused constant argument makes 
> GCC 5
> to clone foo and  optimize out l, too.

Jakub


[PATCH] [ARM] PR68532: Fix VUZP and VZIP recognition on big endian

2015-12-16 Thread Charles Baylis
Hi

This patch addresses incorrect recognition of VEC_PERM_EXPRs as VUZP
and VZIP on armeb-* targets. It also fixes the definition of the
vuzpq_* and vzipq_*  NEON intrinsics which use incorrect lane
specifiers in the use of __builtin_shuffle().

The problem with arm_neon.h can be seen by temporarily altering
arm_expand_vec_perm_const_1() to unconditionally return false. If this
is done, the vuzp/vzip tests in the advsimd execution tests will fail.
With these patches, this is no longer the case.

The problem is caused by the weird mapping of architectural lane order
to gcc lane order in big endian. For 64 bit vectors, the order is
simply reversed, but 128 bit vectors are treated as 2 64 bit vectors
where the lane ordering is reversed inside those. This is due to the
memory ordering defined by the EABI. There is a large comment in
gcc/config/arm.c above output_move_neon() which describes this in more
detail.

The arm_evpc_neon_vuzp() and  arm_evpc_neon_vzip() functions do not
allow for this lane order, instead treating the lane order as simply
reversed in 128 bit vectors. These patches fix this. I have included a
test case for vuzp, but I don't have one for vzip.

Tested with make check on arm-unknown-linux-gnueabihf with no regressions
Tested with make check on armeb-unknown-linux-gnueabihf. Some
gcc.dg/vect tests fail due to no longer being vectorized. I haven't
analysed these, but it is expected since vuzp is not usable for the
shuffle patterns for which it was previously used. There are also a
few new PASSes.


Patch 1 (vuzp):

gcc/ChangeLog:

2015-12-15  Charles Baylis  

* config/arm/arm.c (arm_neon_endian_lane_map): New function.
(arm_neon_vector_pair_endian_lane_map): New function.
(arm_evpc_neon_vuzp): Allow for big endian lane order.
* config/arm/arm_neon.h (vuzpq_s8): Adjust shuffle patterns for big
endian.
(vuzpq_s16): Likewise.
(vuzpq_s32): Likewise.
(vuzpq_f32): Likewise.
(vuzpq_u8): Likewise.
(vuzpq_u16): Likewise.
(vuzpq_u32): Likewise.
(vuzpq_p8): Likewise.
(vuzpq_p16): Likewise.

gcc/testsuite/ChangeLog:

2015-12-15  Charles Baylis  

* gcc.c-torture/execute/pr68532.c: New test.


Patch 2 (vzip)

gcc/ChangeLog:

2015-12-15  Charles Baylis  

* config/arm/arm.c (arm_evpc_neon_vzip): Allow for big endian lane
order.
* config/arm/arm_neon.h (vzipq_s8): Adjust shuffle patterns for big
endian.
(vzipq_s16): Likewise.
(vzipq_s32): Likewise.
(vzipq_f32): Likewise.
(vzipq_u8): Likewise.
(vzipq_u16): Likewise.
(vzipq_u32): Likewise.
(vzipq_p8): Likewise.
(vzipq_p16): Likewise.


0001-ARM-Fix-up-vuzp-for-big-endian.patch
Description: application/download


0002-ARM-Fix-up-vzip-recognition-for-big-endian.patch
Description: application/download


[PATCH] C and C++ FE: better source ranges for binary ops

2015-12-16 Thread David Malcolm
Currently trunk emits range information for most bad binary operations
in the C++ frontend; but not in the C frontend.

The helper function binary_op_error shared by C and C++ takes a
location_t.  In the C++ frontend, a location_t containing the range
has already been built, so we get the underline when it later validates
the expression: e.g. this example from:
   https://gcc.gnu.org/wiki/ClangDiagnosticsComparison

t.cc:9:19: error: no match for ‘operator+’ (operand types are ‘int’ and ‘foo’)
   return P->bar() + *P;
  ~^~~~

In the C frontend, the "full" location_t is only built
after type-checking, so if type-checking fails, we get a caret/range
covering just the operator so e.g. for:

   return (some_function ()
+ some_other_function ());

we get just:

gcc.dg/bad-binary-ops.c:29:4: error: invalid operands to binary + (have 'struct 
s' and 'struct t')
+ some_other_function ());
^

The following patch updates binary_op_error to accept a rich_location *.
For the C++ frontend we populate it with just the location_t as before,
but for the C frontend we can add locations for the operands, giving
this underlining for the example above:

bad-binary-ops.c:29:4: error: invalid operands to binary + (have 'struct s' and 
'struct t')
   return (some_function ()
   
+ some_other_function ());
^ ~~

Additionally, in the C++ frontend, cp_build_binary_op has an "error"
call, which implicitly uses input_location, giving this reduced
information for another test case from
https://gcc.gnu.org/wiki/ClangDiagnosticsComparison:

bad-binary-ops.C:10:14: error: invalid operands of types '__m128 {aka float}' 
and 'const int*' to binary 'operator/'
   myvec[1] / ptr;
  ^~~

The patch updates it to use error_at with the location_t provided,
fixing the above to become:

bad-binary-ops.C:10:12: error: invalid operands of types '__m128 {aka float}' 
and 'const int*' to binary 'operator/'
   myvec[1] / ptr;
   ~^

Finally, since this patch adds a usage of
gcc_rich_location::maybe_add_expr to cc1, we can drop the hacked-in
copy of gcc_rich_location::add_expr from
gcc.dg/plugin/diagnostic_plugin_show_trees.c.

Successfully bootstrapped®rtested on x86_64-pc-linux-gnu;
adds 15 new PASS results in g++.sum and 7 new PASS results in gcc.sum

OK for trunk in stage 3?

gcc/c-family/ChangeLog:
* c-common.c (binary_op_error): Convert first param from
location_t to rich_location * and use it when emitting an error.
* c-common.h (binary_op_error): Convert first param from
location_t to rich_location *.

gcc/c/ChangeLog:
* c-typeck.c: Include "gcc-rich-location.h".
(build_binary_op): In the two places that call binary_op_error,
create a gcc_rich_location and populate it with the location of
the binary op and its two operands.

gcc/cp/ChangeLog:
* typeck.c (cp_build_binary_op): Update for change in signature
of build_binary_op.  Use error_at to replace an implicit use
of input_location with param "location" in "invalid operands"
error.
(cp_build_binary_op): Replace an error with an error_at, using
"location", rather than implicitly using input_location.

gcc/testsuite/ChangeLog:
* g++.dg/diagnostic/bad-binary-ops.C: New test case.
* gcc.dg/bad-binary-ops.c: New test case.
gcc.dg/plugin/diagnostic_plugin_show_trees.c (get_range_for_expr):
Remove material copied from gcc-rich-location.c
(gcc_rich_location::add_expr): Likewise.
---
 gcc/c-family/c-common.c| 21 +++---
 gcc/c-family/c-common.h|  2 +-
 gcc/c/c-typeck.c   | 11 -
 gcc/cp/typeck.c| 13 --
 gcc/testsuite/g++.dg/diagnostic/bad-binary-ops.C   | 44 
 gcc/testsuite/gcc.dg/bad-binary-ops.c  | 48 ++
 .../gcc.dg/plugin/diagnostic_plugin_show_trees.c   | 44 
 7 files changed, 128 insertions(+), 55 deletions(-)
 create mode 100644 gcc/testsuite/g++.dg/diagnostic/bad-binary-ops.C
 create mode 100644 gcc/testsuite/gcc.dg/bad-binary-ops.c

diff --git a/gcc/c-family/c-common.c b/gcc/c-family/c-common.c
index 4250cdf..653d1dc 100644
--- a/gcc/c-family/c-common.c
+++ b/gcc/c-family/c-common.c
@@ -3795,10 +3795,21 @@ c_register_builtin_type (tree type, const char* name)
 
 /* Print an error message for invalid operands to arith operation
CODE with TYPE0 for operand 0, and TYPE1 for operand 1.
-   LOCATION is the location of the message.  */
+   RICHLOC is a rich location for the message, containing either
+   three separate locations for each of the operator and operands
+
+  lhs op rhs
+  ~~~ ^~ ~~~
+
+   (C FE), or one location ranging over all over them
+
+  lhs op rhs
+  ^~
+
+   (C++ FE).  */
 
 void
-binary_op_error (

C++ PATCH for c++/63628 (generic lambdas and variadic capture)

2015-12-16 Thread Jason Merrill
My patch for 63809 fixed non-capturing use of a parameter pack in a 
regular lambda, but not in a generic lambda, where we can't rely on 
being instantiated within the enclosing context.  So we use the existing 
support for references to parameters from trailing return type, another 
situation where we have a reference to a pack from unevaluated context 
and don't have a local_specialization to help us.


With this change, my patch to copy the local_specializations table is no 
longer needed, so I'm reverting it (but leaving the hash table copy 
constructors).


Tested x86_64-pc-linux-gnu, applying to trunk.
commit 46101839657233781f6fc9e824971cfbf8202a1e
Author: Jason Merrill 
Date:   Tue Dec 15 16:43:02 2015 -0500

	PR c++/63628
	* pt.c (tsubst_pack_expansion): Also make dummy decls if
	retrieve_local_specialization fails.

diff --git a/gcc/cp/pt.c b/gcc/cp/pt.c
index a45e6df..58742b0 100644
--- a/gcc/cp/pt.c
+++ b/gcc/cp/pt.c
@@ -10794,12 +10794,16 @@ tsubst_pack_expansion (tree t, tree args, tsubst_flags_t complain,
 	  if (PACK_EXPANSION_LOCAL_P (t) || CONSTRAINT_VAR_P (parm_pack))
 	arg_pack = retrieve_local_specialization (parm_pack);
 	  else
+	/* We can't rely on local_specializations for a parameter
+	   name used later in a function declaration (such as in a
+	   late-specified return type).  Even if it exists, it might
+	   have the wrong value for a recursive call.  */
+	need_local_specializations = true;
+
+	  if (!arg_pack)
 	{
-	  /* We can't rely on local_specializations for a parameter
-		 name used later in a function declaration (such as in a
-		 late-specified return type).  Even if it exists, it might
-		 have the wrong value for a recursive call.  Just make a
-		 dummy decl, since it's only used for its type.  */
+	  /* This parameter pack was used in an unevaluated context.  Just
+		 make a dummy decl, since it's only used for its type.  */
 	  arg_pack = tsubst_decl (parm_pack, args, complain);
 	  if (arg_pack && DECL_PACK_P (arg_pack))
 		/* Partial instantiation of the parm_pack, we can't build
@@ -10807,7 +10811,6 @@ tsubst_pack_expansion (tree t, tree args, tsubst_flags_t complain,
 		arg_pack = NULL_TREE;
 	  else
 		arg_pack = make_fnparm_pack (arg_pack);
-	  need_local_specializations = true;
 	}
 	}
   else if (TREE_CODE (parm_pack) == FIELD_DECL)
diff --git a/gcc/testsuite/g++.dg/cpp1y/lambda-generic-variadic3.C b/gcc/testsuite/g++.dg/cpp1y/lambda-generic-variadic3.C
new file mode 100644
index 000..9b3455a
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp1y/lambda-generic-variadic3.C
@@ -0,0 +1,15 @@
+// PR c++/63628
+// { dg-do compile { target c++14 } }
+
+auto const pack = [](auto&&... t)
+{
+  return [&](auto&& f)->decltype(auto)
+  {
+return f(static_cast(t)...);
+  };
+};
+
+int main(int argc, char** argv) {
+  pack(1)([](int){});
+  return 0;
+}


Re: [PATCH] Better error recovery for merge-conflict markers (v4)

2015-12-16 Thread David Malcolm
On Wed, 2015-12-09 at 18:44 +0100, Bernd Schmidt wrote:
> On 12/09/2015 05:58 PM, David Malcolm wrote:
> > On Wed, 2015-11-04 at 14:56 +0100, Bernd Schmidt wrote:
> >>
> >> This seems like fairly low impact but also low cost, so I'm fine with it
> >> in principle. I wonder whether the length of the marker is the same
> >> across all versions of patch (and VC tools)?
> >
> > It's hardcoded for GNU patch:
> [...]
>  >From what I can tell, Perforce is the outlier here.
> 
> Thanks for checking all that.
> 
> >> Just thinking out loud - I guess it would be too much to hope for to
> >> share lexers between frontends so that we need only one copy of this?
> >
> > Probably :(
> 
> Someone slap sense into me, I just thought of deriving C and C++ parsers 
> from a common base class... (no this is not a suggestion for this patch).
> 
> > Would a better wording be:
> >
> > extern short some_var; /* This line would lead to a warning due to the
> >duplicate name, but it is skipped when handling
> >the conflict marker.  */
> 
> I think so, yes.
> 
> > That said, it's not clear they're always at the beginning of a line;
> > this bazaar bug indicates that CVS (and bazaar) can emit them
> > mid-line:
> >https://bugs.launchpad.net/bzr/+bug/36399
> 
> Ok. CVS I think we shouldn't worry about, and it looks like this is one 
> particular bug/corner case where the conflict end marker is the last 
> thing in the file. I think on the whole it's best to check for beginning 
> of the line as you've done.
> 
> > Wording-wise, should it be "merge conflict marker", rather
> > than "patch conflict marker"?
> >
> > Clang spells it:
> > "error: version control conflict marker in file"
> > http://blog.llvm.org/2010/04/amazing-feats-of-clang-error-recovery.html#merge_conflicts
> 
> Yeah, if another compiler has a similar/identical diagnostic I think we 
> should just copy that unless there's a very good reason not to.
> 
> > Rebased on top of r231445 (from yesterday).
> > Successfully bootstrapped®rtested on x86_64-pc-linux-gnu.
> > Adds 82 new PASSes to g++.sum and 27 new PASSes to gcc.sum.
> >
> > OK for trunk?
> 
> I'm inclined to say yes since it was originally submitted in time and 
> it's hard to imagine how the change could be risky (I'll point out right 
> away that there are one or two other patches in the queue that were also 
> submitted in time which I feel should not be considered for gcc-6 at 
> this point due to risk).
> 
> Let's wait until the end of the week for objections, commit then.

Thanks.  I updated it based on the feedback above, including changing
the wording to match clang's:
 "error: version control conflict marker in file"
I replaced "patch conflict marker" with "conflict marker" in the code
and names of test cases.

I took the liberty of adding:
  gcc_assert (n > 0);
to c_parser_peek_nth_token based on Martin's feedback.

Having verified bootstrap®rtest (on x86_64-pc-linux-gnu), I've
committed it to trunk as r231712.

I'm attaching what I committed, for reference.
Index: gcc/c-family/ChangeLog
===
--- gcc/c-family/ChangeLog	(revision 231711)
+++ gcc/c-family/ChangeLog	(revision 231712)
@@ -1,3 +1,8 @@
+2015-12-16  David Malcolm  
+
+	* c-common.h (conflict_marker_get_final_tok_kind): New prototype.
+	* c-lex.c (conflict_marker_get_final_tok_kind): New function.
+
 2015-12-15  Ilya Verbin  
 
 	* c-common.c (c_common_attribute_table): Handle "omp declare target
Index: gcc/c-family/c-lex.c
===
--- gcc/c-family/c-lex.c	(revision 231711)
+++ gcc/c-family/c-lex.c	(revision 231712)
@@ -1263,3 +1263,29 @@
 
   return value;
 }
+
+/* Helper function for c_parser_peek_conflict_marker
+   and cp_lexer_peek_conflict_marker.
+   Given a possible conflict marker token of kind TOK1_KIND
+   consisting of a pair of characters, get the token kind for the
+   standalone final character.  */
+
+enum cpp_ttype
+conflict_marker_get_final_tok_kind (enum cpp_ttype tok1_kind)
+{
+  switch (tok1_kind)
+{
+default: gcc_unreachable ();
+case CPP_LSHIFT:
+  /* "<<" and '<' */
+  return CPP_LESS;
+
+case CPP_EQ_EQ:
+  /* "==" and '=' */
+  return CPP_EQ;
+
+case CPP_RSHIFT:
+  /* ">>" and '>' */
+  return CPP_GREATER;
+}
+}
Index: gcc/c-family/c-common.h
===
--- gcc/c-family/c-common.h	(revision 231711)
+++ gcc/c-family/c-common.h	(revision 231712)
@@ -1089,6 +1089,10 @@
 extern int c_gimplify_expr (tree *, gimple_seq *, gimple_seq *);
 extern tree c_build_bind_expr (location_t, tree, tree);
 
+/* In c-lex.c.  */
+extern enum cpp_ttype
+conflict_marker_get_final_tok_kind (enum cpp_ttype tok1_kind);
+
 /* In c-pch.c  */
 extern void pch_init (void);
 extern void pch_cpp_save_state (void);
Index: gcc/c/ChangeLog
=

Re: [PATCH] Better error recovery for merge-conflict markers (v5)

2015-12-16 Thread David Malcolm
On Wed, 2015-12-16 at 00:52 +0100, Bernd Schmidt wrote:
> On 12/15/2015 08:30 PM, David Malcolm wrote:
> 
> > I got thinking about what we'd have to do to support Perforce-style
> > markers, and began to find my token-matching approach to be a little
> > clunky (in conjunction with reading Martin's observations on
> > c_parser_peek_nth_token).
> >
> > Here's a reimplementation of the patch which takes a much simpler
> > approach, and avoids the need to touch the C lexer: check that we're
> > not in a macro expansion and then read in the source line, and
> > textually compare against the various possible conflict markers.
> > This adds the requirement that the source file be readable, so it
> > won't detect conflict markers in a .i file from -save-temps,
> 
> How come? Is source file defined as the one before preprocessing?

Yes, unless you manually strip the #line directives.  So unless the
original source files are still around in the right path relative to
where you're compiling the .i file, the calls to
location_get_source_line will fail.

> And I do think this is an unfortunate limitation (given that we often 
> load .i files into cc1 for debugging and we'd ideally like that to be 
> consistent with normal compilation as much as possible). I'd rather go 
> with the original patch based on this.

(nods)  This can be a pain when debugging diagnostic_show_locus, but
there's not much that can be done about it.



Fix size of enum bitfield in recently added test

2015-12-16 Thread Jeff Law


Matthew pointed out this test was failing for arm-none-eabi because the 
rtx_code enum is represented in 8 bits which causes this error:


pr68619-4.c:42:17: error: width of 'code' exceeds its type
   enum rtx_code code:16;


I changed the size of the bitfield in the obvious way.  I verified all 
the other pr68619 tests on arm-none-eabi as well as verifying x86_64's 
was happy with the change to pr68619-4.c.


Installed on the trunk.

Jeff
commit 7b42c818ec41486da307b50f504d29d20086e8a8
Author: law 
Date:   Wed Dec 16 18:53:25 2015 +

* gcc.dg/tree-ssa/pr68619-4.c: Change size of code bitfield.

git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@231717 
138bc75d-0d04-0410-961f-82ee72b054a4

diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 83cf7ac..9ce80b1 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,7 @@
+2015-12-13  Jeff Law  
+
+   * gcc.dg/tree-ssa/pr68619-4.c: Change size of code bitfield.
+
 2015-12-16  David Malcolm  
 
* c-c++-common/conflict-markers-1.c: New testcase.
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr68619-4.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr68619-4.c
index da3cdb9..6c7d180 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr68619-4.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr68619-4.c
@@ -39,7 +39,7 @@ union rtunion_def
 typedef union rtunion_def rtunion;
 struct rtx_def
 {
-  enum rtx_code code:16;
+  enum rtx_code code:8;
   union u
   {
 rtunion fld[1];


Re: [RFA] [PATCH] Fix invalid redundant extension elimination for rl78 port

2015-12-16 Thread Jeff Law

On 12/01/2015 12:32 PM, Richard Sandiford wrote:

Jeff Law  writes:

@@ -1080,6 +1070,18 @@ add_removable_extension (const_rtx expr, rtx_insn *insn,
  }
  }

+  /* Fourth, if the extended version occupies more registers than the
+original and the source of the extension is the same hard register
+as the destination of the extension, then we can not eliminate
+the extension without deep analysis, so just punt.
+
+We allow this when the registers are different because the
+code in combine_reaching_defs will handle that case correctly.  */
+  if ((HARD_REGNO_NREGS (REGNO (dest), mode)
+  != HARD_REGNO_NREGS (REGNO (reg), GET_MODE (reg)))
+ && REGNO (dest) == REGNO (reg))
+   return;
+
/* Then add the candidate to the list and insert the reaching 
definitions
   into the definition map.  */
ext_cand e = {expr, code, mode, insn};


I might be wrong, but the check looks specific to little-endian.  Would
it make sense to use reg_overlap_mentioned_p instead of the REGNO check?

Agreed.  Testing in progress now...

jeff



Re: ipa-cp heuristics fixes

2015-12-16 Thread Jan Hubicka
Hi,
just to summarize a discussion on IRC. The problem is that we produce debug
statements for eliminated arguments only in ipa-sra and ipa-split, while we
don't do anything for cgraph clones. This is a problem on release branches,
too.

It seems we have all the necessary logic, but the callee modification code from
ipa-split should be moved to tree_function_versioning (which is used by both
ipa-split and cgraph clone mechanizm) and caller modifcation copied to
cgraph_edge::redirect_call_stmt_to_callee.

I am trying to do that. It seems bit difficult as the caller and callee
modifications are tied together and I do not know how chaining of
transfomraitons is going to work. 

Honza


Last testcase for PR middle-end/25140

2015-12-16 Thread Jan Hubicka
Hi,
I checked the ipa-pta and pta implementations and these seems to work just
fine with presence of aliases because get_constraint_for_ssa_var already
looks into the alias targets.

This patch adds a testcase I constructed.  Since I am done with auditing
*alias*.c for variable aliases I will close the PR (after 10 years, yay)

Honza

PR middle-end/25140
* gcc.c-torture/execute/alias-4.c: New testcase.
Index: testsuite/gcc.c-torture/execute/alias-4.c
===
--- testsuite/gcc.c-torture/execute/alias-4.c   (revision 0)
+++ testsuite/gcc.c-torture/execute/alias-4.c   (revision 0)
@@ -0,0 +1,19 @@
+/* { dg-require-alias "" } */
+int a = 1;
+extern int b __attribute__ ((alias ("a")));
+int c = 1;
+extern int d __attribute__ ((alias ("c")));
+main (int argc)
+{
+  int *p;
+  int *q;
+  if (argc)
+p = &a, q = &b;
+  else
+p = &c, q = &d;
+  *p = 1;
+  *q = 2;
+  if (*p == 1)
+__builtin_abort ();
+  return 0;
+}


Re: [PATCH] Fix PR c++/21802 (two-stage name lookup fails for operators)

2015-12-16 Thread Michael Matz
Hi,

On Mon, 14 Dec 2015, Patrick Palka wrote:

> >>> >This should use cp_tree_operand_length.
> >> Hmm, I don't immediately see how I can use this function here.  It
> >> expects a tree but I dont have an appropriate tree to give to it, only a
> >> tree_code.
> >
> > True.  So let's introduce cp_tree_code_length next to 
> > cp_tree_operand_length.
> >
> > Jason
> >
> >
> 
> Like this?  Incremental diff followed by patch v4:

Not my turf, but if I may make a suggestion: please use the new function 
in the old one instead of duplicating the implementation.


Ciao,
Michael.


[PTX] function frame emission

2015-12-16 Thread Nathan Sidwell
This patch removes OUTGOING_STATIC_CHAIN_REGNUM -- there's no need for it to be 
distinct from STATIC_CHAIN_REGNUM.  Also, when we have to emit a frame or 
outgoing args, but it's zero sized, there's no need to actually emit the frame 
or arg array.  We can just initialize the appropriate register to zero -- it's 
ill formed for the area to be dereferenced.


Some other minor tidying up included.

nathan
2015-12-16  Nathan Sidwell  

	* config/nvptx/nvptx.h (OUTGOING_STATIC_CHAIN_REGNUM): Remove.
	(REGISTER_NAMES): Adjust.
	* config/nvptx/nvptx.c (nvptx_pass_by_reference): Avoid long line.
	(nvptx_static_hain): Delete.
	(write_arg_mode): Don't emit initializer if argno < 0.
	(write_arg_type): Fix whitespace.
	(init_frame): Initialize reg to zero if frame is zero-sized.
	(nvptx_declare_function_name):  Use write_arg_type to emit chain
	decl.
	(nvptx_output_call_insn): Adjust static chain emission.
	(nvptx_goacc_reduction): Make static.
	(TARGET_STATIC_CHAIN): Don't override.

Index: config/nvptx/nvptx.c
===
--- config/nvptx/nvptx.c	(revision 231717)
+++ config/nvptx/nvptx.c	(working copy)
@@ -525,8 +525,9 @@ nvptx_function_value_regno_p (const unsi
reference in memory.  */
 
 static bool
-nvptx_pass_by_reference (cumulative_args_t ARG_UNUSED (cum), machine_mode mode,
-			 const_tree type, bool ARG_UNUSED (named))
+nvptx_pass_by_reference (cumulative_args_t ARG_UNUSED (cum),
+			 machine_mode mode, const_tree type,
+			 bool ARG_UNUSED (named))
 {
   return pass_in_memory (mode, type, false);
 }
@@ -549,18 +550,6 @@ nvptx_promote_function_mode (const_tree
   return promote_arg (mode, for_return || !type || TYPE_ARG_TYPES (funtype));
 }
 
-/* Implement TARGET_STATIC_CHAIN.  */
-
-static rtx
-nvptx_static_chain (const_tree fndecl, bool incoming_p)
-{
-  if (!DECL_STATIC_CHAIN (fndecl))
-return NULL;
-
-  return gen_rtx_REG (Pmode, (incoming_p ? STATIC_CHAIN_REGNUM
-			  : OUTGOING_STATIC_CHAIN_REGNUM));
-}
-
 /* Helper for write_arg.  Emit a single PTX argument of MODE, either
in a prototype, or as copy in a function prologue.  ARGNO is the
index of this argument in the PTX function.  FOR_REG is negative,
@@ -588,12 +577,15 @@ write_arg_mode (std::stringstream &s, in
   else
 	s << "%ar" << argno;
   s << ";\n";
-  s << "\tld.param" << ptx_type << " ";
-  if (for_reg)
-	s << reg_names[for_reg];
-  else
-	s << "%ar" << argno;
-  s << ", [%in_ar" << argno << "];\n";
+  if (argno >= 0)
+	{
+	  s << "\tld.param" << ptx_type << " ";
+	  if (for_reg)
+	s << reg_names[for_reg];
+	  else
+	s << "%ar" << argno;
+	  s << ", [%in_ar" << argno << "];\n";
+	}
 }
   return argno + 1;
 }
@@ -625,7 +617,7 @@ write_arg_type (std::stringstream &s, in
 	{
 	  /* Complex types are sent as two separate args.  */
 	  type = TREE_TYPE (type);
-	  mode  = TYPE_MODE (type);
+	  mode = TYPE_MODE (type);
 	  prototyped = true;
 	}
 
@@ -917,16 +909,20 @@ nvptx_maybe_record_fnsym (rtx sym)
 }
 
 /* Emit a local array to hold some part of a conventional stack frame
-   and initialize REGNO to point to it.  */
+   and initialize REGNO to point to it.  If the size is zero, it'll
+   never be valid to dereference, so we can simply initialize to
+   zero.  */
 
 static void
 init_frame (FILE  *file, int regno, unsigned align, unsigned size)
 {
-  fprintf (file, "\t.reg.u%d %s;\n"
-	   "\t.local.align %d .b8 %s_ar[%u];\n"
-	   "\tcvta.local.u%d %s, %s_ar;\n",
-	   POINTER_SIZE, reg_names[regno],
-	   align, reg_names[regno], size ? size : 1,
+  if (size)
+fprintf (file, "\t.local .align %d .b8 %s_ar[%u];\n",
+	 align, reg_names[regno], size);
+  fprintf (file, "\t.reg.u%d %s;\n",
+	   POINTER_SIZE, reg_names[regno]);
+  fprintf (file, (size ? "\tcvta.local.u%d %s, %s_ar;\n"
+		  :  "\tmov.u%d %s, 0;\n"),
 	   POINTER_SIZE, reg_names[regno], reg_names[regno]);
 }
 
@@ -981,12 +977,14 @@ nvptx_declare_function_name (FILE *file,
 }
 
   if (stdarg_p (fntype))
-argno = write_arg_type (s, ARG_POINTER_REGNUM, argno, ptr_type_node, true);
-
-  if (DECL_STATIC_CHAIN (decl))
-argno = write_arg_type (s, STATIC_CHAIN_REGNUM, argno, ptr_type_node,
+argno = write_arg_type (s, ARG_POINTER_REGNUM, argno, ptr_type_node,
 			true);
 
+  if (DECL_STATIC_CHAIN (decl) || cfun->machine->has_chain)
+write_arg_type (s, STATIC_CHAIN_REGNUM,
+		DECL_STATIC_CHAIN (decl) ? argno : -1, ptr_type_node,
+		true);
+
   fprintf (file, "%s", s.str().c_str());
 
   /* Declare a local var for outgoing varargs.  */
@@ -1000,10 +998,6 @@ nvptx_declare_function_name (FILE *file,
 init_frame (file, FRAME_POINTER_REGNUM,
 		crtl->stack_alignment_needed / BITS_PER_UNIT, sz);
 
-  if (cfun->machine->has_chain)
-fprintf (file, "\t.reg.u%d %s;\n", GET_MODE_BITSIZE (Pmode),
-	 reg_names[OUTGOING_STATIC_CHAIN_REGNUM]);
-
   /* Declare the pseudos we have as ptx registers.  */
   int maxregs = max_reg_num ();
   for (

Re: [PATCH 2/4][AArch64] Increase the loop peeling limit

2015-12-16 Thread Evandro Menezes

On 12/16/2015 05:24 AM, Richard Earnshaw (lists) wrote:

On 15/12/15 23:34, Evandro Menezes wrote:

On 12/14/2015 05:26 AM, James Greenhalgh wrote:

On Thu, Dec 03, 2015 at 03:07:43PM -0600, Evandro Menezes wrote:

On 11/20/2015 05:53 AM, James Greenhalgh wrote:

On Thu, Nov 19, 2015 at 04:04:41PM -0600, Evandro Menezes wrote:

On 11/05/2015 02:51 PM, Evandro Menezes wrote:

2015-11-05  Evandro Menezes 

gcc/

* config/aarch64/aarch64.c
(aarch64_override_options_internal):
Increase loop peeling limit.

This patch increases the limit for the number of peeled insns.
With this change, I noticed no major regression in either
Geekbench v3 or SPEC CPU2000 while some benchmarks, typically FP
ones, improved significantly.

I tested this tuning on Exynos M1 and on A57.  ThunderX seems to
benefit from this tuning too.  However, I'd appreciate comments

>from other stakeholders.

Ping.

I'd like to leave this for a call from the port maintainers. I can
see why
this leads to more opportunities for vectorization, but I'm
concerned about
the wider impact on code size. Certainly I wouldn't expect this to
be our
default at -O2 and below.

My gut feeling is that this doesn't really belong in the back-end
(there are
presumably good reasons why the default for this parameter across
GCC has
fluctuated from 400 to 100 to 200 over recent years), but as I say, I'd
like Marcus or Richard to make the call as to whether or not we take
this
patch.

Please, correct me if I'm wrong, but loop peeling is enabled only
with loop unrolling (and with PGO).  If so, then extra code size is
not a concern, for this heuristic is only active when unrolling
loops, when code size is already of secondary importance.

My understanding was that loop peeling is enabled from -O2 upwards, and
is also used to partially peel unaligned loops for vectorization
(allowing
the vector code to be well aligned), or to completely peel inner loops
which
may then become amenable to SLP vectorization.

If I'm wrong then I take back these objections. But I was sure this
parameter was used in a number of situations outside of just
-funroll-loops/-funroll-all-loops . Certainly I remember seeing
performance
sensitivities to this parameter at -O3 in some internal workloads I was
analysing.

Vectorization, including SLP, is only enabled at -O3, isn't it?  It
seems to me that peeling is only used by optimizations which already
lead to potential increase in code size.

For instance, with "-Ofast -funroll-all-loops", the total text size for
the SPEC CPU2000 suite is 26.9MB with this proposed change and 26.8MB
without it; with just "-O2", it is the same at 23.1MB regardless of this
setting.

So it seems to me that this proposal should be neutral for up to -O2.

Thank you,


My preference would be to not diverge from the global parameter
settings.  I haven't looked in detail at this parameter but it seems to
me there are two possible paths:

1) We could get agreement globally that the parameter should be increased.
2) We could agree that this specific use of the parameter is distinct
from some other uses and deserves a new param in its own right with a
higher value.



Here's what I have observed, not only in AArch64: architectures benefit 
differently from certain loop optimizations, especially those dealing 
with vectorization.  Be it because some have plenty of registers of more 
aggressive loop unrolling, or because some have lower costs to 
vectorize.  With this, I'm trying to imply that there may be the case to 
wiggle this parameter to suit loop optimizations better to specific 
targets.  While it is not the only parameter related to loop 
optimizations, it seems to be the one with the desired effects, as 
exemplified by PPC, S390 and x86 (AOSP).  Though there is the 
possibility that they are actually side-effects, as Richard Biener 
perhaps implied in another reply.


Cheers,

--
Evandro Menezes



Re: update_vtable_references segfault

2015-12-16 Thread Nathan Sidwell

On 12/12/15 09:44, Nathan Sidwell wrote:

On 12/11/15 13:15, Jan Hubicka wrote:

Jan,



b) augment can_replace_by_local_alias_in_vtable to check whether
aliases can be created?


I think this is best: can_replace_by_local_alias_in_vtable exists to prevent the
body walk in cases we are not going to create the alias.  This is because in LTO
we may need to stream in the constructor from the object file that is not
copletely
free and thus it is better to not touch it unless necessary.


I went with augmenting can_replace_by_local_alias, which
can_replace_by_local_alias_in_vtable calls.  I also noticed that both should be
static, which  I suspect will encourage the inliner to go inline them and then
determine a bunch of code is unreachable.

tested on x86-linux and ptx-none.

ok?


https://gcc.gnu.org/ml/gcc-patches/2015-12/msg01324.html

ping?

nathan


Re: [RFA] [PATCH] Fix invalid redundant extension elimination for rl78 port

2015-12-16 Thread Jeff Law

On 12/01/2015 12:32 PM, Richard Sandiford wrote:

Jeff Law  writes:

@@ -1080,6 +1070,18 @@ add_removable_extension (const_rtx expr, rtx_insn *insn,
  }
  }

+  /* Fourth, if the extended version occupies more registers than the
+original and the source of the extension is the same hard register
+as the destination of the extension, then we can not eliminate
+the extension without deep analysis, so just punt.
+
+We allow this when the registers are different because the
+code in combine_reaching_defs will handle that case correctly.  */
+  if ((HARD_REGNO_NREGS (REGNO (dest), mode)
+  != HARD_REGNO_NREGS (REGNO (reg), GET_MODE (reg)))
+ && REGNO (dest) == REGNO (reg))
+   return;
+
/* Then add the candidate to the list and insert the reaching 
definitions
   into the definition map.  */
ext_cand e = {expr, code, mode, insn};


I might be wrong, but the check looks specific to little-endian.  Would
it make sense to use reg_overlap_mentioned_p instead of the REGNO check?
Fixed thusly.  Installed on the trunk after the usual testing on 
x86_64-linux-gnu and verifying that rl78-elf doesn't botch pr42833.c.


Jeff
commit 362f406136207b89bddab99f5ff904b0798b7115
Author: law 
Date:   Wed Dec 16 20:34:31 2015 +

* ree.c (add_removable_extension): Use reg_overlap_mentioned_p
rather than testing hard register #s.

git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@231719 
138bc75d-0d04-0410-961f-82ee72b054a4

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 1d2a994..a8475b7 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,8 @@
+2015-12-16  Jeff Law  
+
+   * ree.c (add_removable_extension): Use reg_overlap_mentioned_p
+   rather than testing hard register #s.
+
 2015-12-16  Nathan Sidwell  
 
* config/nvptx/nvptx.h (OUTGOING_STATIC_CHAIN_REGNUM): Remove.
diff --git a/gcc/ree.c b/gcc/ree.c
index 6cfc477..d12e24d 100644
--- a/gcc/ree.c
+++ b/gcc/ree.c
@@ -1085,7 +1085,7 @@ add_removable_extension (const_rtx expr, rtx_insn *insn,
 code in combine_reaching_defs will handle that case correctly.  */
   if ((HARD_REGNO_NREGS (REGNO (dest), mode)
   != HARD_REGNO_NREGS (REGNO (reg), GET_MODE (reg)))
- && REGNO (dest) == REGNO (reg))
+ && reg_overlap_mentioned_p (dest, reg))
return;
 
   /* Then add the candidate to the list and insert the reaching definitions


[PATCH] Remove use of 'struct map' from plugin (nvptx)

2015-12-16 Thread James Norris

Hi,

The attached patch removes the use of the map structure
(struct map) from the NVPTX plugin.

Regtested on x86_64-pc-linux-gnu

Ok for trunk?

Thanks!
Jim

ChangeLog
=

2015-12-XX  James Norris  

libgomp/
* plugin/plugin-nvptx.c (struct map): Removed.
(map_init, map_pop): Remove use of struct map. (map_push):
Likewise and change argument list.
* testsuite/libgomp.oacc-c-c++-common/mapping-1.c: New
Index: plugin/plugin-nvptx.c
===
diff --git a/trunk/libgomp/plugin/plugin-nvptx.c b/trunk/libgomp/plugin/plugin-nvptx.c
--- a/trunk/libgomp/plugin/plugin-nvptx.c	(revision 231649)
+++ b/trunk/libgomp/plugin/plugin-nvptx.c	(working copy)
@@ -91,13 +91,6 @@
   struct ptx_device *ptx_dev;
 };
 
-struct map
-{
-  int async;
-  size_t  size;
-  charmappings[0];
-};
-
 static void
 map_init (struct ptx_stream *s)
 {
@@ -140,17 +133,13 @@
 static void
 map_pop (struct ptx_stream *s)
 {
-  struct map *m;
-
   assert (s != NULL);
   assert (s->h_next);
   assert (s->h_prev);
   assert (s->h_tail);
 
-  m = s->h_tail;
+  s->h_tail = s->h_next;
 
-  s->h_tail += m->size;
-
   if (s->h_tail >= s->h_end)
 s->h_tail = s->h_begin + (int) (s->h_tail - s->h_end);
 
@@ -167,16 +156,14 @@
 }
 
 static void
-map_push (struct ptx_stream *s, int async, size_t size, void **h, void **d)
+map_push (struct ptx_stream *s, size_t size, void **h, void **d)
 {
   int left;
   int offset;
-  struct map *m;
 
   assert (s != NULL);
 
   left = s->h_end - s->h_next;
-  size += sizeof (struct map);
 
   assert (s->h_prev);
   assert (s->h_next);
@@ -183,22 +170,14 @@
 
   if (size >= left)
 {
-  m = s->h_prev;
-  m->size += left;
-  s->h_next = s->h_begin;
-
-  if (s->h_next + size > s->h_end)
-	GOMP_PLUGIN_fatal ("unable to push map");
+  assert (s->h_next == s->h_prev);
+  s->h_next = s->h_prev = s->h_tail = s->h_begin;
 }
 
   assert (s->h_next);
 
-  m = s->h_next;
-  m->async = async;
-  m->size = size;
+  offset = s->h_next - s->h;
 
-  offset = (void *)&m->mappings[0] - s->h;
-
   *d = (void *)(s->d + offset);
   *h = (void *)(s->h + offset);
 
@@ -904,7 +883,7 @@
   /* This reserves a chunk of a pre-allocated page of memory mapped on both
  the host and the device. HP is a host pointer to the new chunk, and DP is
  the corresponding device pointer.  */
-  map_push (dev_str, async, mapnum * sizeof (void *), &hp, &dp);
+  map_push (dev_str, mapnum * sizeof (void *), &hp, &dp);
 
   GOMP_PLUGIN_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
 
Index: testsuite/libgomp.oacc-c-c++-common/mapping-1.c
===
diff --git a/trunk/libgomp/testsuite/libgomp.oacc-c-c++-common/mapping-1.c b/trunk/libgomp/testsuite/libgomp.oacc-c-c++-common/mapping-1.c
new file mode 10644
--- /dev/null	(revision 0)
+++ b/trunk/libgomp/testsuite/libgomp.oacc-c-c++-common/mapping-1.c	(working copy)
@@ -0,0 +1,63 @@
+/* { dg-do run } */
+
+#include 
+#include 
+#include 
+
+/* Exercise the kernel launch argument mapping.  */
+
+int
+main (int argc, char **argv)
+{
+  int a[256], b[256], c[256], d[256], e[256], f[256];
+  int i;
+  int n;
+
+  /* 48 is the size of the mappings for the first parallel construct.  */
+  n = sysconf (_SC_PAGESIZE) / 48 - 1;
+
+  i = 0;
+
+  for (i = 0; i < n; i++)
+{
+  #pragma acc parallel copy (a, b, c, d)
+	{
+	  int j;
+
+	  for (j = 0; j < 256; j++)
+	{
+	  a[j] = j;
+	  b[j] = j;
+	  c[j] = j;
+	  d[j] = j;
+	}
+	}
+}
+
+#pragma acc parallel copy (a, b, c, d, e, f)
+  {
+int j;
+
+for (j = 0; j < 256; j++)
+  {
+	a[j] = j;
+	b[j] = j;
+	c[j] = j;
+	d[j] = j;
+	e[j] = j;
+	f[j] = j;
+  }
+  }
+
+  for (i = 0; i < 256; i++)
+   {
+ if (a[i] != i) abort();
+ if (b[i] != i) abort();
+ if (c[i] != i) abort();
+ if (d[i] != i) abort();
+ if (e[i] != i) abort();
+ if (f[i] != i) abort();
+   }
+
+  exit (0);
+}


  1   2   >