Re: [PATCH] Implement constant-folding simplifications of reductions.

2022-02-21 Thread Marc Glisse

On Mon, 21 Feb 2022, Roger Sayle wrote:


+/* Fold REDUC (@0 op VECTOR_CST) as REDUC (@0) op REDUC (VECTOR_CST).  */
+(for reduc (IFN_REDUC_PLUS IFN_REDUC_MAX IFN_REDUC_MIN IFN_REDUC_FMAX
+IFN_REDUC_FMIN IFN_REDUC_AND IFN_REDUC_IOR IFN_REDUC_XOR)
+ op (plus max min IFN_FMAX IFN_FMIN bit_and bit_ior bit_xor)
+  (simplify (reduc (op @0 VECTOR_CST@1))
+(op (reduc:type @0) (reduc:type @1


I wonder if we need to test flag_associative_math for the 'plus' case,
or if the presence of IFN_REDUC_PLUS is enough to justify the possible
loss of precision.

--
Marc Glisse


[PATCH] c++: Fix up constexpr evaluation of new with zero sized types [PR104568]

2022-02-21 Thread Jakub Jelinek via Gcc-patches
Hi!

The new expression constant expression evaluation right now tries to
deduce how many elts the array it uses for the heap or heap [] vars
should have (or how many elts should its trailing array have if it has
cookie at the start).  As new is lowered at that point to
(some_type *) ::operator new (size)
or so, it computes it by subtracting cookie size if any from size, then
divides the result by sizeof (some_type).
This works fine for most types, except when sizeof (some_type) is 0,
then we divide by zero; size is then equal to cookie_size (or if there
is no cookie, to 0).
The following patch special cases those cases so that we don't divide
by zero and also recover the original outer_nelts from the expression
by forcing the size not to be folded in that case but be explicit
0 * outer_nelts or cookie_size + 0 * outer_nelts.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

Note, we have further issues, we accept-invalid various cases, for both
zero sized elt_type and even non-zero sized elts, we aren't able to
diagnose out of bounds POINTER_PLUS_EXPR like:
constexpr bool
foo ()
{
  auto p = new int[2];
  auto q1 = &p[0];
  auto q2 = &p[1];
  auto q3 = &p[2];
  auto q4 = &p[3];
  delete[] p;
  return true;
}
constexpr bool a = foo ();
That doesn't look like a regression so I think we should resolve that for
GCC 13, but there are 2 problems.  Figure out why
cxx_fold_pointer_plus_expression doesn't deal with the &heap []
etc. cases, and for the zero sized arrays, I think we really need to preserve
whether user wrote an array ref or pointer addition, because in the
&p[3] case if sizeof(p[0]) == 0 we know that if it has 2 elements it is
out of bounds, while if we see p p+ 0 the information if it was
p + 2 or p + 3 in the source is lost.
clang++ seems to handle it fine even in the zero sized cases or with
new expressions.

2022-02-21  Jakub Jelinek  

PR c++/104568
* cp-tree.h (build_new_constexpr_heap_type): Add FULL_SIZE_ADJUSTED
argument.
* init.cc (build_new_constexpr_heap_type): Add FULL_SIZE_ADJUSTED
argument.  If true, don't subtract csz from it nor divide by
int_size_in_bytes (elt_type).  Don't do that division if
int_size_in_bytes is zero either.
(maybe_wrap_new_for_constexpr): Pass false to
build_new_constexpr_heap_type.
(build_new_1): If size is 0, change it to 0 * outer_nelts if
outer_nelts is non-NULL.  Pass type rather than elt_type to
maybe_wrap_new_for_constexpr.
* constexpr.cc (cxx_eval_constant_expression) :
If elt_size is zero sized type, try to recover outer_nelts from
the size argument to operator new/new[] and pass that as
var_size to build_new_constexpr_heap_type together with true
for the last argument.

* g++.dg/cpp2a/constexpr-new22.C: New test.

--- gcc/cp/cp-tree.h.jj 2022-02-09 20:13:51.541304861 +0100
+++ gcc/cp/cp-tree.h2022-02-17 15:34:30.804453673 +0100
@@ -7038,7 +7038,7 @@ extern tree build_offset_ref  (tree, 
tr
 extern tree throw_bad_array_new_length (void);
 extern bool type_has_new_extended_alignment(tree);
 extern unsigned malloc_alignment   (void);
-extern tree build_new_constexpr_heap_type  (tree, tree, tree);
+extern tree build_new_constexpr_heap_type  (tree, tree, tree, bool);
 extern tree build_new  (location_t,
 vec **, tree,
 tree, vec **,
--- gcc/cp/init.cc.jj   2022-02-05 10:50:05.0 +0100
+++ gcc/cp/init.cc  2022-02-17 15:56:30.010056499 +0100
@@ -2930,7 +2930,8 @@ std_placement_new_fn_p (tree alloc_fn)
it is computed such that the size of the struct fits into FULL_SIZE.  */
 
 tree
-build_new_constexpr_heap_type (tree elt_type, tree cookie_size, tree full_size)
+build_new_constexpr_heap_type (tree elt_type, tree cookie_size, tree full_size,
+  bool full_size_adjusted)
 {
   gcc_assert (cookie_size == NULL_TREE || tree_fits_uhwi_p (cookie_size));
   gcc_assert (full_size == NULL_TREE || tree_fits_uhwi_p (full_size));
@@ -2939,9 +2940,14 @@ build_new_constexpr_heap_type (tree elt_
   if (full_size)
 {
   unsigned HOST_WIDE_INT fsz = tree_to_uhwi (full_size);
-  gcc_assert (fsz >= csz);
-  fsz -= csz;
-  fsz /= int_size_in_bytes (elt_type);
+  unsigned HOST_WIDE_INT esz = int_size_in_bytes (elt_type);
+  if (!full_size_adjusted)
+   {
+ gcc_assert (fsz >= csz);
+ fsz -= csz;
+ if (esz)
+   fsz /= esz;
+   }
   itype2 = build_index_type (size_int (fsz - 1));
   if (!cookie_size)
return build_cplus_array_type (elt_type, itype2);
@@ -2992,7 +2998,7 @@ maybe_wrap_new_for_constexpr (tree alloc
 return alloc_call;
 
   tree rtype = build_new_constexpr_heap_type (elt_type, cookie_size,
-

RE: [PATCH] Implement constant-folding simplifications of reductions.

2022-02-21 Thread Roger Sayle


Hi Marc,
I'm assuming that the use (semantics) of a REDUC_PLUS expr allow the
reduction to be done in any order, for example the testcase requires 
-ffast-math to allow the REDUC_PLUS to be introduced in the first place.
This also applies explains why the patch doesn't need to distinguish
negative zeros from positive zeros in ctor_single_nonzero_element
(but it's perhaps something to beware of in uses of VECTOR_CST's 
single_nonzero_element).

Cheers,
Roger
--

> -Original Message-
> From: Marc Glisse 
> Sent: 21 February 2022 08:21
> To: Roger Sayle 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] Implement constant-folding simplifications of
reductions.
> 
> On Mon, 21 Feb 2022, Roger Sayle wrote:
> 
> > +/* Fold REDUC (@0 op VECTOR_CST) as REDUC (@0) op REDUC
> (VECTOR_CST).
> > +*/ (for reduc (IFN_REDUC_PLUS IFN_REDUC_MAX IFN_REDUC_MIN
> IFN_REDUC_FMAX
> > +IFN_REDUC_FMIN IFN_REDUC_AND IFN_REDUC_IOR
> IFN_REDUC_XOR)
> > + op (plus max min IFN_FMAX IFN_FMIN bit_and bit_ior bit_xor)
> > +  (simplify (reduc (op @0 VECTOR_CST@1))
> > +(op (reduc:type @0) (reduc:type @1
> 
> I wonder if we need to test flag_associative_math for the 'plus' case, or
if the
> presence of IFN_REDUC_PLUS is enough to justify the possible loss of
precision.
> 
> --
> Marc Glisse



Re: [PATCH 3/3] target/99881 - x86 vector cost of CTOR from integer regs

2022-02-21 Thread Richard Biener via Gcc-patches
On Mon, 21 Feb 2022, Hongtao Liu wrote:

> On Fri, Feb 18, 2022 at 10:01 PM Richard Biener via Gcc-patches
>  wrote:
> >
> > This uses the now passed SLP node to the vectorizer costing hook
> > to adjust vector construction costs for the cost of moving an
> > integer component from a GPR to a vector register when that's
> > required for building a vector from components.  A cruical difference
> > here is whether the component is loaded from memory or extracted
> > from a vector register as in those cases no intermediate GPR is involved.
> >
> > The pr99881.c testcase can be Un-XFAILed with this patch, the
> > pr91446.c testcase now produces scalar code which looks superior
> > to me so I've adjusted it as well.
> >
> > I'm currently re-bootstrapping and testing on x86_64-unknown-linux-gnu
> > after adding the BIT_FIELD_REF vector extracting special casing.
> Does the patch handle PR101929?

The patch will regress the testcase posted in PR101929 again:

 _255 1 times scalar_store costs 12 in body
 _261 1 times scalar_store costs 12 in body
 _258 1 times scalar_store costs 12 in body
 _264 1 times scalar_store costs 12 in body
 t0_247 + t2_251 1 times scalar_stmt costs 4 in body
 t1_472 + t3_444 1 times scalar_stmt costs 4 in body
 t0_406 - t2_451 1 times scalar_stmt costs 4 in body
 t1_472 - t3_444 1 times scalar_stmt costs 4 in body
-node 0x4182f48 1 times vec_construct costs 16 in prologue
-node 0x41882b0 1 times vec_construct costs 16 in prologue
+node 0x4182f48 1 times vec_construct costs 28 in prologue
+node 0x41882b0 1 times vec_construct costs 28 in prologue
 t0_406 + t2_451 1 times vector_stmt costs 4 in body
 t1_472 - t3_444 1 times vector_stmt costs 4 in body
 node 0x41829f8 1 times vec_perm costs 4 in body
 _436 1 times vector_store costs 16 in body
 t.c:37:9: note: Cost model analysis for part in loop 0:
-  Vector cost: 60
+  Vector cost: 84
   Scalar cost: 64
+t.c:37:9: missed: not vectorized: vectorization is not profitable.

We're constructing V4SI from patterns like { _407, _480, _407, _480 }
where the components are results of integer adds (so the result is
definitely in a GPR).  We are costing the construction as
4 * sse_op + 2 * sse_to_integer which with skylake cost is
4 * COSTS_N_INSNS (1) + 2 * 6.

Whether the vectorization itself is profitable is likely questionable
but then it's true that the construction of V4SI is more costly
in terms of uops than a construction of V4SF.

Now, we can - for the first time - now see the actual construction
pattern and ideal construction might be two GPR->xmm moves
+ two splats + one unpack or maybe two GPR->xmm moves + one
unpack + splat of DI (or other means of duplicating the lowpart).
It still wouldn't help here of course, and we would not have RTL
expansion adhere to this scheme.

Trying to capture secondary effects (the FRE opportunity unleashed)
in costing of this particular SLP subgraph is difficult and probably
undesirable though.  Adjusting any of the target specific costs
is likely not going to work because of the original narrow profitability
gap (60 vs. 64).

For the particular kernel the overall vectorization strathegy needs
to improve (PR98138 is about this I think).  I know we've reverted
the previous change that attempted to cost for GPR -> XMM.  It did

   case vec_construct:
{
  /* N element inserts into SSE vectors.  */
+ int cost
+   = TYPE_VECTOR_SUBPARTS (vectype) * (fp ?
+   ix86_cost->sse_op
+   : 
ix86_cost->integer_to_sse);
+
- int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;

thus treated integer_to_sse as GPR insert into vector cost for
integer components in general while the now proposed change
_adds_ integer to XMM move costs (but only once for duplicate
elements and not for the cases we know are from memory / vector regs).
With the proposed patch we can probably be more "correct" above
and only cost TYPE_VECTOR_SUBPARTS (vectype) - 1 inserts given
for FP the first one is free and for int we're costing it separately.
Again it won't help here.

Thanks,
Richard.

> >
> > I suppose we can let autotesters look for SPEC performance fallout.
> >
> > OK if testing succeeds?
> >
> > Thanks,
> > Richard.
> >
> > 2022-02-18  Richard Biener  
> >
> > PR tree-optimization/104582
> > PR target/99881
> > * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
> > Cost GPR to vector register moves for integer vector construction.
> >
> > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: New.
> > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-2.c: Likewise.
> > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c: Likewise.
> > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c: Likewise.
> > * gcc.target/i386/pr99881.c: Un-XFAIL.
> > * gcc.target/i386/pr91446.c: Adjust to not expect vectorization.
> 

[PATCH] i386: Relax cmpxchg instruction under -mrelax-cmpxchg-loop [PR 103069]

2022-02-21 Thread Hongyu Wang via Gcc-patches
Hi,

For cmpxchg, it is commonly used in spin loop, and several user code
such as pthread directly takes cmpxchg as loop condition, which cause
huge cache bouncing.

This patch extends previous implementation to relax all cmpxchg
instruction under -mrelax-cmpxchg-loop with an extra atomic load,
compare and emulate the failed cmpxchg behavior.

For original spin loop which looks like

loop: mov%eax,%r8d
  or $1,%r8d
  lock cmpxchg %r8d,(%rdi)
  jneloop

It will now truns to

loop: mov%eax,%r8d
  or $1,%r8d
  mov(%r8),%rsi <--- load lock first
  cmp%rsi,%rax <--- compare with expected input
  jne.L2 <--- lock ne expected
  lock cmpxchg %r8d,(%rdi)
  jneloop
  L2: mov%rsi,%rax <--- perform the behavior of failed cmpxchg
  jneloop

under -mrelax-cmpxchg-loop.

Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}

OK for master?

gcc/ChangeLog:

PR target/103069
* config/i386/i386-expand.cc (ix86_expand_atomic_fetch_op_loop):
Split atomic fetch and loop part.
(ix86_expand_cmpxchg_loop): New expander for cmpxchg loop.
* config/i386/i386-protos.h (ix86_expand_cmpxchg_loop): New
prototype.
* config/i386/sync.md (atomic_compare_and_swap): Call new
expander under TARGET_RELAX_CMPXCHG_LOOP.
(atomic_compare_and_swap): Likewise for doubleword modes.

gcc/testsuite/ChangeLog:

PR target/103069
* gcc.target/i386/pr103069-2.c: Adjust result check.
* gcc.target/i386/pr103069-3.c: New test.
* gcc.target/i386/pr103069-4.c: Likewise.
---
 gcc/config/i386/i386-expand.cc | 153 +++--
 gcc/config/i386/i386-protos.h  |   2 +
 gcc/config/i386/sync.md|  65 +
 gcc/testsuite/gcc.target/i386/pr103069-2.c |   4 +-
 gcc/testsuite/gcc.target/i386/pr103069-3.c |  24 
 gcc/testsuite/gcc.target/i386/pr103069-4.c |  43 ++
 6 files changed, 226 insertions(+), 65 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr103069-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr103069-4.c

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index ce9607e36de..6cf1a0b9cb6 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -23203,16 +23203,14 @@ void ix86_expand_atomic_fetch_op_loop (rtx target, 
rtx mem, rtx val,
   enum rtx_code code, bool after,
   bool doubleword)
 {
-  rtx old_reg, new_reg, old_mem, success, oldval, new_mem;
-  rtx_code_label *loop_label, *pause_label, *done_label;
+  rtx old_reg, new_reg, old_mem, success;
   machine_mode mode = GET_MODE (target);
+  rtx_code_label *loop_label = NULL;
 
   old_reg = gen_reg_rtx (mode);
   new_reg = old_reg;
-  loop_label = gen_label_rtx ();
-  pause_label = gen_label_rtx ();
-  done_label = gen_label_rtx ();
   old_mem = copy_to_reg (mem);
+  loop_label = gen_label_rtx ();
   emit_label (loop_label);
   emit_move_insn (old_reg, old_mem);
 
@@ -23234,50 +23232,125 @@ void ix86_expand_atomic_fetch_op_loop (rtx target, 
rtx mem, rtx val,
   if (after)
 emit_move_insn (target, new_reg);
 
-  /* Load memory again inside loop.  */
-  new_mem = copy_to_reg (mem);
-  /* Compare mem value with expected value.  */
+  success = NULL_RTX;
+
+  ix86_expand_cmpxchg_loop (&success, old_mem, mem, old_reg, new_reg,
+   gen_int_mode (MEMMODEL_SYNC_SEQ_CST,
+ SImode),
+   doubleword, loop_label);
+}
+
+/* Relax cmpxchg instruction, param loop_label indicates whether
+   the instruction should be relaxed with a pause loop.  If not,
+   it will be relaxed to an atomic load + compare, and skip
+   cmpxchg instruction if mem != exp_input.  */
+
+void ix86_expand_cmpxchg_loop (rtx *ptarget_bool, rtx target_val,
+  rtx mem, rtx exp_input, rtx new_input,
+  rtx mem_model, bool doubleword,
+  rtx_code_label *loop_label)
+{
+  rtx_code_label *cmp_label = NULL;
+  rtx_code_label *done_label = NULL;
+  rtx target_bool = NULL_RTX, new_mem = NULL_RTX;
+  rtx (*gen) (rtx, rtx, rtx, rtx, rtx) = NULL;
+  rtx (*gendw) (rtx, rtx, rtx, rtx, rtx, rtx) = NULL;
+  machine_mode mode = GET_MODE (target_val), hmode = mode;
+
+  if (*ptarget_bool == NULL)
+target_bool = gen_reg_rtx (QImode);
+  else
+target_bool = *ptarget_bool;
+
+  cmp_label = gen_label_rtx ();
+  done_label = gen_label_rtx ();
+
+  new_mem = gen_reg_rtx (mode);
+  /* Load memory first.  */
+  expand_atomic_load (new_mem, mem, MEMMODEL_SEQ_CST);
+
+  switch (mode)
+{
+case TImode:
+  gendw = gen_atomic_compare_and_swapti_doubleword;
+  hmode = DImode;
+  break;
+case DImode:
+  if (doubleword)
+   {
+ gendw = gen_atomic_compare_and_swapdi_double

Re: [PATCH v2, rs6000] Enable absolute jump table for PPC AIX and Linux

2022-02-21 Thread Kewen.Lin via Gcc-patches
Hi Haochen,

Some minor comments are inlined.

on 2022/2/16 下午4:42, HAO CHEN GUI via Gcc-patches wrote:
> Hi,
>This patch enables absolute jump tables on PPC AIX and Linux. For AIX, the 
> jump
> table is placed in data section. For Linux, it is placed in RELRO section when
> relocation is needed.
> 
>Bootstrapped and tested on AIX,Linux BE and LE with no regressions. Is 
> this okay for trunk?
> Any recommendations? Thanks a lot.
> 

I may miss something, but there seem no test cases in power target testsuite to 
check expected
assembly for absolute and relative jump table.  If so, maybe it's better to add 
one or two?

> ChangeLog
> 2022-02-16 Haochen Gui 
> 
> gcc/
>   * config/rs6000/aix.h (JUMP_TABLES_IN_TEXT_SECTION): Define.
>   * config/rs6000/linux64.h (JUMP_TABLES_IN_TEXT_SECTION): Likewise.
>   * config/rs6000/rs6000.cc (rs6000_option_override_internal): Enable
>   absolute jump tables for AIX and Linux.
>   (rs6000_xcoff_function_rodata_section): Implement.
>   * config/rs6000/xcoff.h (TARGET_ASM_FUNCTION_RODATA_SECTION): Define.
> 
> patch.diff
> diff --git a/gcc/config/rs6000/aix.h b/gcc/config/rs6000/aix.h
> index ad3238bf09a..b52208c2ee7 100644
> --- a/gcc/config/rs6000/aix.h
> +++ b/gcc/config/rs6000/aix.h
> @@ -253,7 +253,7 @@
> 
>  /* Indicate that jump tables go in the text section.  */
> 
> -#define JUMP_TABLES_IN_TEXT_SECTION 1
> +#define JUMP_TABLES_IN_TEXT_SECTION 0
> 
>  /* Define any extra SPECS that the compiler needs to generate.  */
>  #undef  SUBTARGET_EXTRA_SPECS
> diff --git a/gcc/config/rs6000/linux64.h b/gcc/config/rs6000/linux64.h
> index b2a7afabc73..16df9ef167f 100644
> --- a/gcc/config/rs6000/linux64.h
> +++ b/gcc/config/rs6000/linux64.h
> @@ -239,7 +239,7 @@ extern int dot_symbols;
> 
>  /* Indicate that jump tables go in the text section.  */
>  #undef  JUMP_TABLES_IN_TEXT_SECTION
> -#define JUMP_TABLES_IN_TEXT_SECTION TARGET_64BIT
> +#define JUMP_TABLES_IN_TEXT_SECTION 0
> 
>  /* The linux ppc64 ABI isn't explicit on whether aggregates smaller
> than a doubleword should be padded upward or downward.  You could
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index bc3ef0721a4..e9c5552c082 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -4954,6 +4954,10 @@ rs6000_option_override_internal (bool global_init_p)
>  warning (0, "%qs is deprecated and not recommended in any circumstances",
>"-mno-speculate-indirect-jumps");
> 
> +  /* Enable absolute jump tables for AIX and Linux.  */
> +  if (DEFAULT_ABI == ABI_AIX || DEFAULT_ABI == ABI_ELFv2)
> +rs6000_relative_jumptables = 0;
> +
>return ret;
>  }
> 
> @@ -28751,6 +28755,15 @@ constant_generates_xxspltidp (vec_const_128bit_type 
> *vsx_const)
>return sf_value;
>  }
> 
> +section * rs6000_xcoff_function_rodata_section (tree decl ATTRIBUTE_UNUSED,
> + bool relocatable)
> +{
> +  if (relocatable)
> +return data_section;
> +  else
> +return readonly_data_section;
> +}
> +
>  

There is one area to put xcoff related functions and guarded with "#if 
TARGET_XCOFF",
it looks good to put this into? and could we make this function static?

Besides, it seems good to put some comments for this function to describe why we
choose to use the data_section.

>  struct gcc_target targetm = TARGET_INITIALIZER;
> 
> diff --git a/gcc/config/rs6000/xcoff.h b/gcc/config/rs6000/xcoff.h
> index cd0f99cb9c6..0dacd86eed9 100644
> --- a/gcc/config/rs6000/xcoff.h
> +++ b/gcc/config/rs6000/xcoff.h
> @@ -98,7 +98,7 @@
>  #define TARGET_ASM_SELECT_SECTION  rs6000_xcoff_select_section
>  #define TARGET_ASM_SELECT_RTX_SECTION  rs6000_xcoff_select_rtx_section
>  #define TARGET_ASM_UNIQUE_SECTION  rs6000_xcoff_unique_section
> -#define TARGET_ASM_FUNCTION_RODATA_SECTION default_no_function_rodata_section
> +#define TARGET_ASM_FUNCTION_RODATA_SECTION 
> rs6000_xcoff_function_rodata_section


IIUC, the behavior of rs6000_xcoff_function_rodata_section isn't consistent 
with what we have
in the hook description.  If so, we need to update the hook description to 
align with this change.
Otherwise, some future codes might expect this hook only return readonly 
section (not possible
like data section) and get unexpected results.

BR,
Kewen



Re: [PATCH 1/3] tree-optimization/104582 - Simplify vectorizer cost API and fixes

2022-02-21 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> This simplifies the vectorizer cost API by providing overloads
> to add_stmt_cost and record_stmt_cost suitable for scalar stmt
> and branch stmt costing which do not need information like
> a vector type or alignment.  It also fixes two mistakes where
> costs for versioning tests were recorded as vector stmt rather
> than scalar stmt.
>
> This is a first patch to simplify the actual fix for PR104582.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> OK?

LGTM.  Maybe the record_stmt_cost overload should accept scalar_stmt
too though?  It seems inconsistent to allow it only for add_stmt_cost.

Not a strong opinion though, just a suggestion.

Thanks,
Richard

> 2022-02-18  Richard Biener  
>
>   PR tree-optimization/104582
>   * tree-vectorizer.h (add_stmt_cost): New overload.
>   (record_stmt_cost): Likewise.
>   * tree-vect-loop.cc (vect_compute_single_scalar_iteration_cost):
>   Use add_stmt_costs.
>   (vect_get_known_peeling_cost): Use new overloads.
>   (vect_estimate_min_profitable_iters): Likewise.  Consistently
>   use scalar_stmt for costing versioning checks.
>   * tree-vect-stmts.cc (record_stmt_cost): New overload.
> ---
>  gcc/tree-vect-loop.cc  | 39 +++
>  gcc/tree-vect-stmts.cc | 10 ++
>  gcc/tree-vectorizer.h  | 12 
>  3 files changed, 37 insertions(+), 24 deletions(-)
>
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 896218f23ea..5c7b163f01c 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -1323,13 +1323,8 @@ vect_compute_single_scalar_iteration_cost 
> (loop_vec_info loop_vinfo)
>  
>/* Now accumulate cost.  */
>loop_vinfo->scalar_costs = init_cost (loop_vinfo, true);
> -  stmt_info_for_cost *si;
> -  int j;
> -  FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
> - j, si)
> -(void) add_stmt_cost (loop_vinfo->scalar_costs, si->count,
> -   si->kind, si->stmt_info, si->vectype,
> -   si->misalign, si->where);
> +  add_stmt_costs (loop_vinfo->scalar_costs,
> +   &LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo));
>loop_vinfo->scalar_costs->finish_cost (nullptr);
>  }
>  
> @@ -3873,10 +3868,10 @@ vect_get_known_peeling_cost (loop_vec_info 
> loop_vinfo, int peel_iters_prologue,
>iterations are unknown, count a taken branch per peeled loop.  */
>if (peel_iters_prologue > 0)
>   retval = record_stmt_cost (prologue_cost_vec, 1, cond_branch_taken,
> -NULL, NULL_TREE, 0, vect_prologue);
> +vect_prologue);
>if (*peel_iters_epilogue > 0)
>   retval += record_stmt_cost (epilogue_cost_vec, 1, cond_branch_taken,
> - NULL, NULL_TREE, 0, vect_epilogue);
> + vect_epilogue);
>  }
>  
>stmt_info_for_cost *si;
> @@ -3946,8 +3941,7 @@ vect_estimate_min_profitable_iters (loop_vec_info 
> loop_vinfo,
>  {
>/*  FIXME: Make cost depend on complexity of individual check.  */
>unsigned len = LOOP_VINFO_MAY_MISALIGN_STMTS (loop_vinfo).length ();
> -  (void) add_stmt_cost (target_cost_data, len, vector_stmt,
> - NULL, NULL_TREE, 0, vect_prologue);
> +  (void) add_stmt_cost (target_cost_data, len, scalar_stmt, 
> vect_prologue);
>if (dump_enabled_p ())
>   dump_printf (MSG_NOTE,
>"cost model: Adding cost of checks for loop "
> @@ -3959,13 +3953,12 @@ vect_estimate_min_profitable_iters (loop_vec_info 
> loop_vinfo,
>  {
>/*  FIXME: Make cost depend on complexity of individual check.  */
>unsigned len = LOOP_VINFO_COMP_ALIAS_DDRS (loop_vinfo).length ();
> -  (void) add_stmt_cost (target_cost_data, len, vector_stmt,
> - NULL, NULL_TREE, 0, vect_prologue);
> +  (void) add_stmt_cost (target_cost_data, len, scalar_stmt, 
> vect_prologue);
>len = LOOP_VINFO_CHECK_UNEQUAL_ADDRS (loop_vinfo).length ();
>if (len)
>   /* Count LEN - 1 ANDs and LEN comparisons.  */
>   (void) add_stmt_cost (target_cost_data, len * 2 - 1,
> -   scalar_stmt, NULL, NULL_TREE, 0, vect_prologue);
> +   scalar_stmt, vect_prologue);
>len = LOOP_VINFO_LOWER_BOUNDS (loop_vinfo).length ();
>if (len)
>   {
> @@ -3976,7 +3969,7 @@ vect_estimate_min_profitable_iters (loop_vec_info 
> loop_vinfo,
>   if (!LOOP_VINFO_LOWER_BOUNDS (loop_vinfo)[i].unsigned_p)
> nstmts += 1;
> (void) add_stmt_cost (target_cost_data, nstmts,
> - scalar_stmt, NULL, NULL_TREE, 0, vect_prologue);
> + scalar_stmt, vect_prologue);
>   }
>if (dump_enabled_p ())
>   dump_printf (MSG_NOTE,
> @@ -3998,7 +3991,7 @@ vect_estimate_min_profita

Re: [PATCH 2/3] tree-optimization/104582 - make SLP node available in vector cost hook

2022-02-21 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> This adjusts the vectorizer costing API to allow passing down the
> SLP node the vector stmt is created from.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu, I've built
> aarch64 and rs6000 cc1 crosses.
>
> OK?

Not sure about the stmt_info + no node overload.  It'll make it too
easy to forget to add the SLP node in new code, in cases where a node is
available.  (This reminds me of how we have sometimes forgotten to pass
a vectype to vect_is_simple_use in cases where the vectype is actually
necessary.)

No strong opinion about the no stmt_info + node overload, but it looks
like it's only used in a couple of places, so maybe it would be better
not to have that either?

LGTM otherwise FWIW.  If you disagree with the above then let's just
go with the patch as-is.

Thanks,
Richard

> Thanks,
> Richard.
>
> 2022-02-18  Richard Biener  
>
>   PR tree-optimization/104582
>   * tree-vectorizer.h (stmt_info_for_cost::node): New field.
>   (vector_costs::add_stmt_cost): Add SLP node parameter.
>   (dump_stmt_cost): Likewise.
>   (add_stmt_cost): Likewise, new overload and adjust.
>   (add_stmt_costs): Adjust.
>   (record_stmt_cost): New overload.
>   * tree-vectorizer.cc (dump_stmt_cost): Dump the SLP node.
>   (vector_costs::add_stmt_cost): Adjust.
>   * tree-vect-loop.cc (vect_estimate_min_profitable_iters):
>   Adjust.
>   * tree-vect-slp.cc (vect_prologue_cost_for_slp): Record
>   the SLP node for costing.
>   (vectorizable_slp_permutation): Likewise.
>   * tree-vect-stmts.cc (record_stmt_cost): Adjust and add
>   new overloads.
>   * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
>   Adjust.
>   * config/aarch64/aarch64.cc (aarch64_vector_costs::add_stmt_cost):
>   Adjust.
>   * config/rs6000/rs6000.cc (rs6000_vector_costs::add_stmt_cost):
>   Adjust.
>   (rs6000_cost_data::adjust_vect_cost_per_loop): Likewise.
> ---
>  gcc/config/aarch64/aarch64.cc |  6 +++---
>  gcc/config/i386/i386.cc   |  9 +
>  gcc/config/rs6000/rs6000.cc   | 10 ++
>  gcc/tree-vect-loop.cc | 16 +---
>  gcc/tree-vect-slp.cc  |  4 ++--
>  gcc/tree-vect-stmts.cc| 30 ++
>  gcc/tree-vectorizer.cc| 10 +++---
>  gcc/tree-vectorizer.h | 28 +---
>  8 files changed, 75 insertions(+), 38 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 37ed22bcc94..fb40b7e9c78 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -15008,7 +15008,7 @@ public:
>aarch64_vector_costs (vec_info *, bool);
>  
>unsigned int add_stmt_cost (int count, vect_cost_for_stmt kind,
> -   stmt_vec_info stmt_info, tree vectype,
> +   stmt_vec_info stmt_info, slp_tree, tree vectype,
> int misalign,
> vect_cost_model_location where) override;
>void finish_cost (const vector_costs *) override;
> @@ -15953,8 +15953,8 @@ aarch64_stp_sequence_cost (unsigned int count, 
> vect_cost_for_stmt kind,
>  
>  unsigned
>  aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> -  stmt_vec_info stmt_info, tree vectype,
> -  int misalign,
> +  stmt_vec_info stmt_info, slp_tree,
> +  tree vectype, int misalign,
>vect_cost_model_location where)
>  {
>fractional_cost stmt_cost
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index e4b42fbba6f..0830dbd7dca 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -22982,8 +22982,8 @@ class ix86_vector_costs : public vector_costs
>using vector_costs::vector_costs;
>  
>unsigned int add_stmt_cost (int count, vect_cost_for_stmt kind,
> -   stmt_vec_info stmt_info, tree vectype,
> -   int misalign,
> +   stmt_vec_info stmt_info, slp_tree node,
> +   tree vectype, int misalign,
> vect_cost_model_location where) override;
>  };
>  
> @@ -22997,8 +22997,9 @@ ix86_vectorize_create_costs (vec_info *vinfo, bool 
> costing_for_scalar)
>  
>  unsigned
>  ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> -   stmt_vec_info stmt_info, tree vectype,
> -   int misalign, vect_cost_model_location where)
> +   stmt_vec_info stmt_info, slp_tree,
> +   tree vectype, int misalign,
> +   vect_cost_model_location where)
>  {
>unsigned retval = 0;
>bool scalar_p
> diff --git a/gcc/config/rs6000/rs6000.cc b/g

Re: [PATCH 2/3] tree-optimization/104582 - make SLP node available in vector cost hook

2022-02-21 Thread Richard Biener via Gcc-patches
On Mon, 21 Feb 2022, Richard Sandiford wrote:

> Richard Biener  writes:
> > This adjusts the vectorizer costing API to allow passing down the
> > SLP node the vector stmt is created from.
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu, I've built
> > aarch64 and rs6000 cc1 crosses.
> >
> > OK?
> 
> Not sure about the stmt_info + no node overload.  It'll make it too
> easy to forget to add the SLP node in new code, in cases where a node is
> available.  (This reminds me of how we have sometimes forgotten to pass
> a vectype to vect_is_simple_use in cases where the vectype is actually
> necessary.)
> 
> No strong opinion about the no stmt_info + node overload, but it looks
> like it's only used in a couple of places, so maybe it would be better
> not to have that either?
> 
> LGTM otherwise FWIW.  If you disagree with the above then let's just
> go with the patch as-is.

I've gone with the two overloads because we shouldn't pass down both
SLP node and stmt_info, it should be either-or (or none), but never both.

In the future it might all be different of course but I've tried to
not complicate things here.  In principle we might now ditch the
vectype argument in favor of stmt_info->vectype / node->vectype, but
I've tried to be simple at this point.

So yeah, I think doing better needs a lot more thoughts - a good next
target would be to improve costings for permutes - the patch gets
us access to the ones with explicit SLP nodes but not those from
load permutations (but how we implement those depends on the
vectorizable_load strathegy as wel).  Permutes are also not so
obvious to use given they need to be "translated" with the actual
number of vector stmts and the vector type in mind.  One of the ideas
I had was to, during vectorizable_* build an actual instantiation SLP
tree matching the code generation strathegy and duplicating nodes to
have one for each actual vector (but then one could maybe also just
code generate to GIMPLE?).

Anyway, unless you see stage4 uses for patches 1 + 2 in the series
I'm going to wait for the patch 3 conclusion from the x86 maintainers
and otherwise will push this to stage1 (I still think it's a step
in the right direction).

Thanks,
Richard.


> Thanks,
> Richard
> 
> > Thanks,
> > Richard.
> >
> > 2022-02-18  Richard Biener  
> >
> > PR tree-optimization/104582
> > * tree-vectorizer.h (stmt_info_for_cost::node): New field.
> > (vector_costs::add_stmt_cost): Add SLP node parameter.
> > (dump_stmt_cost): Likewise.
> > (add_stmt_cost): Likewise, new overload and adjust.
> > (add_stmt_costs): Adjust.
> > (record_stmt_cost): New overload.
> > * tree-vectorizer.cc (dump_stmt_cost): Dump the SLP node.
> > (vector_costs::add_stmt_cost): Adjust.
> > * tree-vect-loop.cc (vect_estimate_min_profitable_iters):
> > Adjust.
> > * tree-vect-slp.cc (vect_prologue_cost_for_slp): Record
> > the SLP node for costing.
> > (vectorizable_slp_permutation): Likewise.
> > * tree-vect-stmts.cc (record_stmt_cost): Adjust and add
> > new overloads.
> > * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
> > Adjust.
> > * config/aarch64/aarch64.cc (aarch64_vector_costs::add_stmt_cost):
> > Adjust.
> > * config/rs6000/rs6000.cc (rs6000_vector_costs::add_stmt_cost):
> > Adjust.
> > (rs6000_cost_data::adjust_vect_cost_per_loop): Likewise.
> > ---
> >  gcc/config/aarch64/aarch64.cc |  6 +++---
> >  gcc/config/i386/i386.cc   |  9 +
> >  gcc/config/rs6000/rs6000.cc   | 10 ++
> >  gcc/tree-vect-loop.cc | 16 +---
> >  gcc/tree-vect-slp.cc  |  4 ++--
> >  gcc/tree-vect-stmts.cc| 30 ++
> >  gcc/tree-vectorizer.cc| 10 +++---
> >  gcc/tree-vectorizer.h | 28 +---
> >  8 files changed, 75 insertions(+), 38 deletions(-)
> >
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index 37ed22bcc94..fb40b7e9c78 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -15008,7 +15008,7 @@ public:
> >aarch64_vector_costs (vec_info *, bool);
> >  
> >unsigned int add_stmt_cost (int count, vect_cost_for_stmt kind,
> > - stmt_vec_info stmt_info, tree vectype,
> > + stmt_vec_info stmt_info, slp_tree, tree vectype,
> >   int misalign,
> >   vect_cost_model_location where) override;
> >void finish_cost (const vector_costs *) override;
> > @@ -15953,8 +15953,8 @@ aarch64_stp_sequence_cost (unsigned int count, 
> > vect_cost_for_stmt kind,
> >  
> >  unsigned
> >  aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
> > -stmt_vec_info stmt_info, tree vectype,
> > -int misalign,
> > +stmt_vec_info stmt_info, slp_tr

Re: [pushed] LRA, rs6000, Darwin: Amend lo_sum use for forced constants [PR104117].

2022-02-21 Thread Richard Sandiford via Gcc-patches
Iain Sandoe  writes:
> Hi Folks.
>> On 14 Feb 2022, at 16:58, Vladimir Makarov  wrote:
>> On 2022-02-14 11:00, Richard Sandiford wrote:
>
>>> Vladimir Makarov via Gcc-patches  writes:
 
 Hi, Richard.  Change LRA is mine and I approved it for Iain's patch.
 
 I think there is no need for this code and it is misleading.  If
 'mem[low_sum]' does not work, I don't think that 'reg=low_sum;mem[reg]'
 will help for any existing target.  As machine-dependent code for any
 target most probably (for ppc64 darwin it is exactly the case) checks
 address only in memory, it can wrongly accept wrong address by reloading
 it into reg and use it in memory. So these are my arguments for the
 remove this code from process_address_1.
>>> I'm probably making too much of this, but:
>>> 
>>> I think the code is potentially useful in that existing targets do forbid
>>> forbid lo_sum addresses in certain contexts (due to limited offset range)
>>> while still wanting lo_sum to be used to be load the address.  If we
>>> handle the high/lo_sum split in generic code then we have more chance
>>> of being able to optimise things.  So it feels like this is setting an
>>> unfortunate precedent.
>>> 
>>> I still don't understand what went wrong before though (the PR trail
>>> was a bit too long to process :-)).  Is there a case where
>>> (lo_sum (high X) X) != X?  If so, that seems like a target bug to me.
>>> Or does the target accept (set R1 (lo_sum R2 X)) for an X that cannot
>>> be split into a HIGH/LO_SUM pair?  I'd argue that's a target bug too.
>>> 
>> Sometimes it is hard to make a line where an RA bug is a bug in 
>> machine-dependent code or in RA itself.
>> 
>> For this case I would say it is a bug in the both parts.
>> 
>> Low-sum is generated by LRA and it does not know that it should be wrapped 
>> by unspec for darwin. Generally speaking we could avoid the change in LRA 
>> but it would require to do non-trivial analysis in machine dependent code to 
>> find cases when 'reg=low_sum ... mem[reg]' is incorrect code for darwin 
>> (PIC) target (and may be some other PIC targets too). Therefore I believe 
>> the change in LRA is a good solution even if the change can potentially 
>> result in less optimized code for some cases.  Taking your concern into 
>> account we could probably improve the patch by introducing a hook (I never 
>> liked such solutions as we already have too many hooks directing RA) or 
>> better to make the LRA change working only for PIC target. Something like 
>> this (it probably needs better recognition of pic target):
>> 
>> --- a/gcc/lra-constraints.cc
>> +++ b/gcc/lra-constraints.cc
>> @@ -3616,21 +3616,21 @@ process_address_1 (int nop, bool check_only_p,
>>   if (HAVE_lo_sum)
>> {
>>   /* addr => lo_sum (new_base, addr), case (2) above.  */
>>   insn = emit_insn (gen_rtx_SET
>> (new_reg,
>>  gen_rtx_HIGH (Pmode, copy_rtx (addr;
>>   code = recog_memoized (insn);
>>   if (code >= 0)
>> {
>>   *ad.inner = gen_rtx_LO_SUM (Pmode, new_reg, addr);
>> - if (!valid_address_p (op, &ad, cn))
>> + if (!valid_address_p (op, &ad, cn) && !flag_pic)
>
> IMO the PIC aspect of this is possibly misleading
>  - the issue is that we have an invalid address, and that such addresses in 
> this case need to be legitimised by wrapping them in an UNSPEC. 
> - My concern about the generic code was that I would not expect Darwin to be 
> the only platform that might need to wrap an invlaid address in an unspec  
> [TOC, TLS, small data etc. spring to mind].

Yeah, that part is pretty common.

> I need some help understanding the expected pathway through this code that 
> could be useful.
>
> we start with an invalid address.
>
> 1. we build (set reg (high invalid_address))
>  - Darwin was allowing this (and the lo_sum) [eveywhere, not just here] on 
> the basis that the target legitimizer would be called later to fix it up.  
> (that is why the initial recog passes) - but AFAICT we never call the 
> target’s address legitimizer.

I think it's really the other way around: legitimisers are (or should be)
called before recog, with the legitimisers producing instructions that
recog is known to accept.

Once something has been recog()ed, the target has to be prepared
to generate code for it, either directly from an asm template or
indirectly from a define_split.  There's no backing out beyond
that point.

For:

  (set … (mem (lo_sum (reg Y) (…

the lo_sum is an address that needs to be legitimised (because it's
in a mem) but in:

  (set (reg X) (lo_sum (reg Y) (…)))// A
  (set … (mem (reg X)))

it isn't: it's just an ordinary bit of arithmetic whose result happens
to be used as an address later.  If the target accepts A then it must be
prepared to generate code for it, without other hooks

Re: [PATCH] Implement constant-folding simplifications of reductions.

2022-02-21 Thread Richard Biener via Gcc-patches
On Mon, Feb 21, 2022 at 9:31 AM Roger Sayle  wrote:
>
>
> Hi Marc,
> I'm assuming that the use (semantics) of a REDUC_PLUS expr allow the
> reduction to be done in any order, for example the testcase requires
> -ffast-math to allow the REDUC_PLUS to be introduced in the first place.
> This also applies explains why the patch doesn't need to distinguish
> negative zeros from positive zeros in ctor_single_nonzero_element
> (but it's perhaps something to beware of in uses of VECTOR_CST's
> single_nonzero_element).

.IFN_REDUC_PLUS directly maps to the reduc_plus_scal optab
which does not specify an operation order.  There's also
fold_left_plus which does (left-to-right).

An argument could be made that constant folding should do what
the CPU will end up doing but we're already not doing that for
double arithmetic and flush-to-zero enabled with -ffast-math and
denormals I think.

+/* Return the single non-zero element of a CONSTRUCTOR or NULL_TREE.  */
+tree
+ctor_single_nonzero_element (const_tree t)
+{
+  unsigned HOST_WIDE_INT idx;
+  constructor_elt *ce;
+  tree elt = NULL_TREE;
+
+  if (TREE_CODE (t) == SSA_NAME)
+{
+  gassign *def = dyn_cast  (SSA_NAME_DEF_STMT (t));

I think this function should expect a CONSTRUCTOR 'T' and the use in
match.pd be adjusted similar to other CONSTRUCTOR users like

(simplify
 (BIT_FIELD_REF CONSTRUCTOR@0 @1 @2)
...
  (with
   {
 tree ctor = (TREE_CODE (@0) == SSA_NAME
  ? gimple_assign_rhs1 (SSA_NAME_DEF_STMT (@0)) : @0);


+/* Fold reduction of a single nonzero element constructor.  */
+(for reduc (IFN_REDUC_PLUS IFN_REDUC_IOR IFN_REDUC_XOR)
+  (simplify (reduc (CONSTRUCTOR@0))
+(with { tree elt = ctor_single_nonzero_element (@0); }
+  (if (elt)
+(non_lvalue { elt; })

is (non_value... ) really necessary?  I highly doubt so.

I wonder if there's a case like { _1, 0., 0., -0. } where independent on
the order of the operations this transform relies on !HONOR_SIGNED_ZEROS?
It likely also should demote _1 from sNaN to qNaN and thus relies on
-fno-signalling-nans?

Otherwise OK.

Thanks,
Richard.

> Cheers,
> Roger
> --
>
> > -Original Message-
> > From: Marc Glisse 
> > Sent: 21 February 2022 08:21
> > To: Roger Sayle 
> > Cc: gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH] Implement constant-folding simplifications of
> reductions.
> >
> > On Mon, 21 Feb 2022, Roger Sayle wrote:
> >
> > > +/* Fold REDUC (@0 op VECTOR_CST) as REDUC (@0) op REDUC
> > (VECTOR_CST).
> > > +*/ (for reduc (IFN_REDUC_PLUS IFN_REDUC_MAX IFN_REDUC_MIN
> > IFN_REDUC_FMAX
> > > +IFN_REDUC_FMIN IFN_REDUC_AND IFN_REDUC_IOR
> > IFN_REDUC_XOR)
> > > + op (plus max min IFN_FMAX IFN_FMIN bit_and bit_ior bit_xor)
> > > +  (simplify (reduc (op @0 VECTOR_CST@1))
> > > +(op (reduc:type @0) (reduc:type @1
> >
> > I wonder if we need to test flag_associative_math for the 'plus' case, or
> if the
> > presence of IFN_REDUC_PLUS is enough to justify the possible loss of
> precision.
> >
> > --
> > Marc Glisse
>


[PATCH] OpenMP, libgomp: Add new runtime routines omp_target_memcpy_async and omp_target_memcpy_rect_async

2022-02-21 Thread Marcel Vollweiler

Hi,

This patch adds two new OpenMP runtime routines: omp_target_memcpy_async and
omp_target_memcpy_rect_async. Both functions are introduced in OpenMP 5.1 as
asynchronous variants of omp_target_memcpy and omp_target_memcpy_rect.

In contrast to the synchronous variants, the asynchronous functions have two
additional function parameters to allow the specification of task dependences:

   int depobj_count
   omp_depend_t *depobj_list

   integer(c_int), value :: depobj_count
   integer(omp_depend_kind), optional :: depobj_list(*)

The implementation splits the synchronous functions into two parts: (a) check
and (b) copy. Then (a) is used in the asynchronous functions for the sequential
part, and the actual copy process (b) is executed in a new created task. The
sequential part (a) takes into account the requirements for the return values:

"The routine returns zero if successful. Otherwise, it returns a non-zero
value." (omp_target_memcpy_async, OpenMP 5.1 spec, section 3.8.7)

"An application can determine the number of inclusive dimensions supported by an
implementation by passing NULL pointers (or C_NULL_PTR, for Fortran) for both
dst and src. The routine returns the number of dimensions supported by the
implementation for the specified device numbers. No copy operation is
performed." (omp_target_memcpy_rect_async, OpenMP 5.1 spec, section 3.8.8)

Due to asynchronicity an error is thrown if the asynchronous memcpy is not
successful (in contrast to the synchronous functions which use a return
value unequal to zero).

The patch was tested on x86_64-linux with nvptx and amdgcn offloading and with
PowerPC with nvptx offloading. All with no regressions.

Marcel
-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955
OpenMP, libgomp: Add new runtime routines omp_target_memcpy_async and
omp_target_memcpy_rect_async.

This patch adds two new OpenMP runtime routines: omp_target_memcpy_async and
omp_target_memcpy_rect_async. Both functions are introduced in OpenMP 5.1 as
asynchronous variants of omp_target_memcpy and omp_target_memcpy_rect.

In contrast to the synchronous variants, the asynchronous functions have two
additional function parameters to allow the specification of task dependences:

int depobj_count
omp_depend_t *depobj_list

integer(c_int), value :: depobj_count
integer(omp_depend_kind), optional :: depobj_list(*)

The implementation splits the synchronous functions into two parts: (a) check
and (b) copy. Then (a) is used in the asynchronous functions for the sequential
part, and the actual copy process (b) is executed in a new created task. The
sequential part (a) takes into account the requirements for the return values: 

"The routine returns zero if successful. Otherwise, it returns a non-zero
value." (omp_target_memcpy_async, OpenMP 5.1 spec, section 3.8.7)

"An application can determine the number of inclusive dimensions supported by an
implementation by passing NULL pointers (or C_NULL_PTR, for Fortran) for both
dst and src. The routine returns the number of dimensions supported by the
implementation for the specified device numbers. No copy operation is
performed." (omp_target_memcpy_rect_async, OpenMP 5.1 spec, section 3.8.8)

Due to asynchronicity an error is thrown if the asynchronous memcpy is not
successful (in contrast to the synchronous functions which use a return
value unequal to zero).

gcc/ChangeLog:

* omp-low.cc (omp_runtime_api_call): Added target_memcpy_async and
target_memcpy_rect_async to omp_runtime_apis array.

libgomp/ChangeLog:

* libgomp.map: Added omp_target_memcpy_async and
omp_target_memcpy_rect_async.
* libgomp.texi: Both functions are now supported.
* omp.h.in: Added omp_target_memcpy_async and
omp_target_memcpy_rect_async.
* omp_lib.f90.in: Added interfaces for both new functions.
* omp_lib.h.in: Likewise.
* target.c (omp_target_memcpy): Restructured into check and copy part.
(omp_target_memcpy_check): New helper function for omp_target_memcpy and
omp_target_memcpy_async that checks requirements.
(omp_target_memcpy_copy): New helper function for omp_target_memcpy and
omp_target_memcpy_async that performs the memcpy.
(omp_target_memcpy_async_helper): New helper function that is used in
omp_target_memcpy_async for the asynchronous task.
(omp_target_memcpy_async): Added.
(omp_target_memcpy_rect): Restructured into check and copy part.
(omp_target_memcpy_rect_check): New helper function for
omp_target_memcpy_rect and omp_target_memcpy_rect_async that checks
requirements.
(omp_target_memcpy_rect_copy): New helper function for
omp_target_m

Re: [PATCH] PR middle-end/65855: Scalar evolution for quadratic chrecs

2022-02-21 Thread Richard Biener via Gcc-patches
On Fri, Feb 18, 2022 at 7:20 PM Roger Sayle  wrote:
>
>
> This patch improves GCC's scalar evolution and final value replacement
> optimizations by supporting triangular/quadratic/trapezoid chrecs which
> resolves both PR middle-end/65855 and PR c/80852, but alas not (yet)
> PR tree-optimization/46186.
>
> I've listed Richard Biener as co-author, as this solution is based
> heavily on his proposed patch in comment #4 of PR 65855 from 2015,
> but with several significant changes.  The most important change is
> a correction to formula used. For the scalar evolution {a, +, {b, +, c}},
> there was an off-by-one error, so chrec_apply should not return
> a + b*x + c*x*(x+1)/2, but a + b*x + c*x*(x-1)/2, which explains
> why the original patch didn't produce the expected results.
>
> Another significant set of changes involves handling the various
> type mismatches and overflows.  In chrec_apply the evolution is
> evaluated after x iterations (which is an unsigned integer type,
> called orig_type in this patch) but a, b and c may be any type
> including floating point (see PR tree-opt/23391) and including
> types that trap on overflow with -ftrapv (see PR tree-opt/79721),
> and in the case of pointer arithmetic, b and c may have a
> different type (sizetype) to a!  Additionally, the division by
> two in "c*x*(x-1)/2" requires the correct top bit in modulo
> arithmetic, which means that the multiplication typically needs
> to be performed with more precision (in a wider mode) than
> orig_type [unless c is an even integer constant, or x (the
> number of iterations) is a known constant at compile-time].
>
> Finally, final value replacement is only an effective optimization
> if the expression used to compute the result of the loop is cheaper
> than the loop itself, hence chrec_apply needs to return an optimized
> folded tree with the minimal number of operators.  Hence if b == c,
> this patch calculates "a + c*x*(x+1)/2", when c is even we can
> perform the division at compile-time, and when x is a non-trivial
> expression, we wrap it in a SAVE_EXPR so that the lowering to
> gimple can reuse the common subexpression.
>
> Correctly handling all of the corner cases results in a patch
> significantly larger than the original "proof-of-concept".
> There's one remaining refinement, marked as TODO in the patch,
> which is to also support unsigned 64-bit to 128-bit widening
> multiplications (umulditi3) on targets that support it, such as
> x86_64, as used by LLVM, but this patch provides the core
> target-independent functionality.  [This last piece would
> handle/resolve the testcase in PR tree-opt/46186 which
> requires 128-bit TImode operations, not suitable for all
> backend targets].
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check with no new failures.  Ok for mainline?

+/* Fold the (exact) division of chrec by two.  */
+
+static tree
+chrec_fold_divide_by_2 (tree type, tree op)
+{
+  if (SCALAR_FLOAT_TYPE_P (type))
+return fold_build2 (MULT_EXPR, type, op, build_real (type, dconsthalf));
+  return fold_build2 (RSHIFT_EXPR, type, op, build_int_cst (type, 1));

any reason to not use EXACT_DIV_EXPR?

+/* Indicate that a chrec subexpression is used multiple times.  */
+
+static tree
+chrec_save_expr (tree t)
+{
+  if (is_gimple_min_invariant (t))
+return t;
+  t = save_expr (t);
+  if (TREE_CODE (t) == SAVE_EXPR)
+TREE_SIDE_EFFECTS (t) = 0;
+  return t;

why clear TREE_SIDE_EFFECTS here?  IIRC SCEV analysis
simply re-uses subtrees (thus generates a graph) without
wrapping the shared leafs in SAVE_EXPR (and then have
unshare_expr before gimplification blow it up of course...).

+  /* Determine whether c is even.  */
+  tree half_c = NULL_TREE;
+  if (TREE_CODE (c) == INTEGER_CST
+  && (wi::to_widest (c) & 1) == 0)

wi::to_wide should work as well and is cheaper

+  /* Try to narrow the original type, by stripping zero extensions.  */
+  tree orig_type = TREE_TYPE (x);
+  if (TREE_CODE (x) == NOP_EXPR)
+{

so we do have get_narrower (), can we use that?

+  tree wide_type;
+  /* Prefer to perform multiplications in type CTYPE.  */
+  if (INTEGRAL_TYPE_P (ctype)
+  && TYPE_UNSIGNED (ctype)
+  && TYPE_PRECISION (ctype) > prec)
+wide_type = ctype;
+  else if (TYPE_PRECISION (unsigned_type_node) > prec)
+wide_type = unsigned_type_node;
+  else if (TYPE_PRECISION (long_unsigned_type_node) > prec)
+wide_type = long_unsigned_type_node;
+  else if (TYPE_PRECISION (long_long_unsigned_type_node) > prec)
+wide_type = long_long_unsigned_type_node;
+  else
+/* TODO: Try TImode on targets that support umul_widen_optab.  */
+return chrec_dont_know;

can we use mode_for_size () and type_for_mode () instead of hard-coding
these?  You can look for a umul_optab if we do not want to risk
libgcc calls.

There's some code that looks like it can be factored.  Maybe my eyes
are mis-matching things of course:

+  res = chrec_fold_plus (wide_typ

[PATCH] c++: PR c++/95999: Improved error recovery in enumeration lists.

2022-02-21 Thread Roger Sayle

This patch resolves PR c++/95999 which is an ICE-after-error regression
in the g++ front-end.  When parsing an enumerator list, the C++ parser
assumes that cp_parser_constant_expression always returns either an
INTEGER_CST or error_mark_node, but in testcase reported in the PR
actually returns a VAR_DECL.

The usual (but perhaps controversial) design philosophy is that the
routine that reports the error normally has a duty to indicate this to
the rest of the compiler (via error_mark_node), but here the return
value from calling require_rvalue_constant_expression (parser.cc:10666)
is ignored.  I initially experimented with setting EXPRESSION to
error_mark_node here in cp_parser_constant_expression but (perhaps
conveniently) that's insufficient to resolve the problem.  The simple
fix in this patch is to tweak the two places that require INTEGER_CST
to treat all other tree types as though they are error_mark_node.

This patch has been tested on x86_64-pc-linunx-gnu with make bootstrap
and make -k check with no new (unexpected) failures.  Ok for mainline?


2022-02-21  Roger Sayle  

gcc/cp/ChangeLog
PR c++/95999
* decl.cc (finish_enum_value_list): If VALUE isn't an INTEGER_CST
consider it to be zero (i.e. treat it like error_mark_node).
(build_enumerator): Likewise, if PREV_VALUE isn't an INTEGER_CST,
set VALUE to error_mark_node.

gcc/testsuite/ChangeLog
PR c++/95999
* g++.dg/pr95999.c: New test case.


Thanks in advance,
Roger
--

diff --git a/gcc/cp/decl.cc b/gcc/cp/decl.cc
index 7b48b56..40b8f0c 100644
--- a/gcc/cp/decl.cc
+++ b/gcc/cp/decl.cc
@@ -16202,7 +16202,7 @@ finish_enum_value_list (tree enumtype)
 
  /* Update the minimum and maximum values, if appropriate.  */
  value = DECL_INITIAL (decl);
- if (value == error_mark_node)
+ if (TREE_CODE (value) != INTEGER_CST)
value = integer_zero_node;
  /* Figure out what the minimum and maximum values of the
 enumerators are.  */
@@ -16491,7 +16491,7 @@ build_enumerator (tree name, tree value, tree enumtype, 
tree attributes,
 which case the type is an unspecified integral type
 sufficient to contain the incremented value.  */
  prev_value = DECL_INITIAL (TREE_VALUE (TYPE_VALUES (enumtype)));
- if (error_operand_p (prev_value))
+ if (TREE_CODE (prev_value) != INTEGER_CST)
value = error_mark_node;
  else
{
diff --git a/gcc/testsuite/g++.dg/pr95999.C b/gcc/testsuite/g++.dg/pr95999.C
new file mode 100644
index 000..f399f6d
--- /dev/null
+++ b/gcc/testsuite/g++.dg/pr95999.C
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+int a;
+enum struct b;
+template  enum struct b { e };  // { dg-error "explicit specialization" }
+// { dg-excess-errors "note" }
+// { dg-excess-errors "5:" }
+


Re: [RFC][nvptx] Initialize ptx regs

2022-02-21 Thread Tom de Vries via Gcc-patches

On 2/21/22 08:54, Richard Biener wrote:

On Sun, Feb 20, 2022 at 11:50 PM Tom de Vries via Gcc-patches
 wrote:


Hi,

With nvptx target, driver version 510.47.03 and board GT 1030 I, we run into:
...
FAIL: gcc.c-torture/execute/pr53465.c -O1 execution test
FAIL: gcc.c-torture/execute/pr53465.c -O2 execution test
FAIL: gcc.c-torture/execute/pr53465.c -O3 -g execution test
...
while the test-cases pass with nvptx-none-run -O0.

The problem is that the generated ptx contains a read from an uninitialized
ptx register, and the driver JIT doesn't handle this well.

For -O2 and -O3, we can get rid of the FAIL using --param
logical-op-non-short-circuit=0.  But not for -O1.

At -O1, the test-case minimizes to:
...
void __attribute__((noinline, noclone))
foo (int y) {
   int c;
   for (int i = 0; i < y; i++)
 {
   int d = i + 1;
   if (i && d <= c)
 __builtin_abort ();
   c = d;
 }
}

int main () {
   foo (2); return 0;
}
...

Note that the test-case does not contain an uninitialized use.  In the first
iteration, i is 0 and consequently c is not read.  In the second iteration, c
is read, but by that time it's already initialized by 'c = d' from the first
iteration.

AFAICT the problem is introduced as follows: the conditional use of c in the
loop body is translated into an unconditional use of c in the loop header:
...
   # c_1 = PHI 
...
which forwprop1 propagates the 'c_9 = d_7' assignment into:
...
   # c_1 = PHI 
...
which ends up being translated by expand into an unconditional:
...
(insn 13 12 0 (set (reg/v:SI 22 [ c ])
 (reg/v:SI 23 [ d ])) -1
  (nil))
...
at the start of the loop body, creating an uninitialized read of d on the
path from loop entry.


Ah, interesting case.  Note that some fixup pass inserted a copy in
the loop header
before coalescing:

;;   basic block 3, loop depth 1
;;pred:   6
;;2
   # c_10 = PHI 
   # i_11 = PHI 
   c_2 = c_10;   <--- this one
   i_8 = i_11;
   d_6 = i_11 + 1;
   if (i_8 != 0)
 goto ; [64.00%]
   else
 goto ; [36.00%]
;;succ:   4
;;6

;;   basic block 4, loop depth 1
;;pred:   3
   if (d_6 <= c_2)
 goto ; [0.00%]
   else
 goto ; [100.00%]

we try to coalesce both c_10 to d_6 and i_11 to d_6, both have the same
cost I think and we succeed with the first which happens to be the one with
the default def arg.

I also think whether we coalesce or not doesn't really matter for the issue at
hand, the copy on entry should be elided anyway but the odd inserted copy
should be investigated (it looks unnecessary and it should be placed before
the single-use, not in the header).



Thanks for looking into this in detail, unfortunately I'm not familiar 
with this part of the compiler so I can't really comment on your findings.



By disabling coalesce_ssa_name, we get the more usual copies on the incoming
edges.  The copy on the loop entry path still does an uninitialized read, but
that one's now initialized by init-regs.  The test-case passes, also when
disabling init-regs, so it's possible that the JIT driver doesn't object to
this type of uninitialized read.

Now that we characterized the problem to some degree, we need to fix this,
because either:
- we're violating an undocumented ptx invariant, and this is a compiler bug,
   or
- this is is a driver JIT bug and we need to work around it.


So what does the JIT do that ends up breaking things?  Does the
actual hardware ISA have NaTs and trap?



That's a good question.  I haven't studied the actual hardware in 
detail, so I can't answer that. [ And the driver being closed source and 
the hardware undocumented by nvidia doesn't help in quickly finding a 
reliable answer there. ]


However, my theory is the following: there's one or several 
optimizations in the JIT that takes a read from an unitialized register 
as sufficient proof to delete (some) depend insns.


We've seen this in action before.  The following is documented at 
WORKAROUND_PTXJIT_BUG in nvptx.cc.


Consider this code:
...
{
.reg .u32 %x; 

mov.u32 %x,%tid.x; 

setp.ne.u32 %rnotvzero,%x,0; 


}

 @%rnotvzero bra Lskip;
 setp.. %rcond,op1,op2;
 Lskip:

 selp.u32 %rcondu32,1,0,%rcond;
 shfl.idx.b32 %rcondu32,%rcondu32,0,31;
 setp.ne.u32 %rcond,%rcondu32,0;
...

My interpretation of what is supposed to happen, as per the ptx 
specification is as follows.


The conditional setp. sets register %rcond, but just for thread 0 in 
the warp (of 32 threads).  The register remains uninitialized for 
threads 1..31.


The selp sets register %rcondu32. For thread 0, it'll contain a defined 
value,  but for threads 1..31 it'll have undefined values.


The shfl propagates the value of thread 0 in register %rcondu32 to 
threads 0..31.  The register %rcond

[PATCH] [gfortran] Set omp_requires_mask for dynamic_allocators.

2022-02-21 Thread Hafiz Abid Qadeer
This patch fixes an issue that although gfortran accepts
'requires dynamic_allocators', it does not set the omp_requires_mask
accordingly.

gcc/fortran/ChangeLog:

* parse.cc (gfc_parse_file): Set OMP_REQUIRES_DYNAMIC_ALLOCATORS
bit in omp_requires_mask.
---
 gcc/fortran/parse.cc | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/gcc/fortran/parse.cc b/gcc/fortran/parse.cc
index db918291b10..821555bd85f 100644
--- a/gcc/fortran/parse.cc
+++ b/gcc/fortran/parse.cc
@@ -6890,6 +6890,9 @@ done:
   break;
 }
 
+  if (omp_requires & OMP_REQ_DYNAMIC_ALLOCATORS)
+omp_requires_mask
+   = (enum omp_requires) (omp_requires_mask | 
OMP_REQUIRES_DYNAMIC_ALLOCATORS);
   /* Do the parse tree dump.  */
   gfc_current_ns = flag_dump_fortran_original ? gfc_global_ns_list : NULL;
 
-- 
2.25.1



[committed] c++: Add testcase for already fixed PR [PR85493]

2022-02-21 Thread Patrick Palka via Gcc-patches
The a1 and a2 case were fixed (by diagnosing the invalid expression)
with r11-434, and the a3 case with r8-7625.

PR c++/85493

gcc/testsuite/ChangeLog:

* g++.dg/cpp0x/decltype80.C: New test.
---
 gcc/testsuite/g++.dg/cpp0x/decltype80.C | 16 
 1 file changed, 16 insertions(+)
 create mode 100644 gcc/testsuite/g++.dg/cpp0x/decltype80.C

diff --git a/gcc/testsuite/g++.dg/cpp0x/decltype80.C 
b/gcc/testsuite/g++.dg/cpp0x/decltype80.C
new file mode 100644
index 000..6ad140fe548
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/decltype80.C
@@ -0,0 +1,16 @@
+// PR c++/85493
+// { dg-do compile { target c++11 } }
+
+struct no_def {
+  no_def() = delete;
+};
+
+template
+int foo() = delete;
+
+template
+void test() {
+  decltype(no_def()) a1; // { dg-error "deleted" }
+  decltype(no_def(1,2,3)) a2; // { dg-error "no match" }
+  decltype(foo<>()) a3; // { dg-error "deleted" }
+}
-- 
2.35.1.225.ge2ac9141e6



Re: [RFC][nvptx] Initialize ptx regs

2022-02-21 Thread Richard Biener via Gcc-patches
On Mon, Feb 21, 2022 at 2:47 PM Tom de Vries  wrote:
>
> On 2/21/22 08:54, Richard Biener wrote:
> > On Sun, Feb 20, 2022 at 11:50 PM Tom de Vries via Gcc-patches
> >  wrote:
> >>
> >> Hi,
> >>
> >> With nvptx target, driver version 510.47.03 and board GT 1030 I, we run 
> >> into:
> >> ...
> >> FAIL: gcc.c-torture/execute/pr53465.c -O1 execution test
> >> FAIL: gcc.c-torture/execute/pr53465.c -O2 execution test
> >> FAIL: gcc.c-torture/execute/pr53465.c -O3 -g execution test
> >> ...
> >> while the test-cases pass with nvptx-none-run -O0.
> >>
> >> The problem is that the generated ptx contains a read from an uninitialized
> >> ptx register, and the driver JIT doesn't handle this well.
> >>
> >> For -O2 and -O3, we can get rid of the FAIL using --param
> >> logical-op-non-short-circuit=0.  But not for -O1.
> >>
> >> At -O1, the test-case minimizes to:
> >> ...
> >> void __attribute__((noinline, noclone))
> >> foo (int y) {
> >>int c;
> >>for (int i = 0; i < y; i++)
> >>  {
> >>int d = i + 1;
> >>if (i && d <= c)
> >>  __builtin_abort ();
> >>c = d;
> >>  }
> >> }
> >>
> >> int main () {
> >>foo (2); return 0;
> >> }
> >> ...
> >>
> >> Note that the test-case does not contain an uninitialized use.  In the 
> >> first
> >> iteration, i is 0 and consequently c is not read.  In the second 
> >> iteration, c
> >> is read, but by that time it's already initialized by 'c = d' from the 
> >> first
> >> iteration.
> >>
> >> AFAICT the problem is introduced as follows: the conditional use of c in 
> >> the
> >> loop body is translated into an unconditional use of c in the loop header:
> >> ...
> >># c_1 = PHI 
> >> ...
> >> which forwprop1 propagates the 'c_9 = d_7' assignment into:
> >> ...
> >># c_1 = PHI 
> >> ...
> >> which ends up being translated by expand into an unconditional:
> >> ...
> >> (insn 13 12 0 (set (reg/v:SI 22 [ c ])
> >>  (reg/v:SI 23 [ d ])) -1
> >>   (nil))
> >> ...
> >> at the start of the loop body, creating an uninitialized read of d on the
> >> path from loop entry.
> >
> > Ah, interesting case.  Note that some fixup pass inserted a copy in
> > the loop header
> > before coalescing:
> >
> > ;;   basic block 3, loop depth 1
> > ;;pred:   6
> > ;;2
> ># c_10 = PHI 
> ># i_11 = PHI 
> >c_2 = c_10;   <--- this one
> >i_8 = i_11;
> >d_6 = i_11 + 1;
> >if (i_8 != 0)
> >  goto ; [64.00%]
> >else
> >  goto ; [36.00%]
> > ;;succ:   4
> > ;;6
> >
> > ;;   basic block 4, loop depth 1
> > ;;pred:   3
> >if (d_6 <= c_2)
> >  goto ; [0.00%]
> >else
> >  goto ; [100.00%]
> >
> > we try to coalesce both c_10 to d_6 and i_11 to d_6, both have the same
> > cost I think and we succeed with the first which happens to be the one with
> > the default def arg.
> >
> > I also think whether we coalesce or not doesn't really matter for the issue 
> > at
> > hand, the copy on entry should be elided anyway but the odd inserted copy
> > should be investigated (it looks unnecessary and it should be placed before
> > the single-use, not in the header).
> >
>
> Thanks for looking into this in detail, unfortunately I'm not familiar
> with this part of the compiler so I can't really comment on your findings.
>
> >> By disabling coalesce_ssa_name, we get the more usual copies on the 
> >> incoming
> >> edges.  The copy on the loop entry path still does an uninitialized read, 
> >> but
> >> that one's now initialized by init-regs.  The test-case passes, also when
> >> disabling init-regs, so it's possible that the JIT driver doesn't object to
> >> this type of uninitialized read.
> >>
> >> Now that we characterized the problem to some degree, we need to fix this,
> >> because either:
> >> - we're violating an undocumented ptx invariant, and this is a compiler 
> >> bug,
> >>or
> >> - this is is a driver JIT bug and we need to work around it.
> >
> > So what does the JIT do that ends up breaking things?  Does the
> > actual hardware ISA have NaTs and trap?
> >
>
> That's a good question.  I haven't studied the actual hardware in
> detail, so I can't answer that. [ And the driver being closed source and
> the hardware undocumented by nvidia doesn't help in quickly finding a
> reliable answer there. ]
>
> However, my theory is the following: there's one or several
> optimizations in the JIT that takes a read from an unitialized register
> as sufficient proof to delete (some) depend insns.
>
> We've seen this in action before.  The following is documented at
> WORKAROUND_PTXJIT_BUG in nvptx.cc.
>
> Consider this code:
> ...
>  {
>  .reg .u32 %x;
>
>  mov.u32 %x,%tid.x;
>
>  setp.ne.u32 %rnotvzero,%x,0;
>
>  }
>
>   @%rnotvzero bra Lskip;
>   setp.. %rcond,op1,op2;
>   Lskip:
>
>

[PATCH, OpenMP, C/C++] Handle array reference base-pointers in array sections

2022-02-21 Thread Chung-Lin Tang

Hi Jakub,
as encountered in cases where a program constructs its own deep-copying
for arrays-of-pointers, e.g:

   #pragma omp target enter data map(to:level->vectors[:N])
   for (i = 0; i < N; i++)
 #pragma omp target enter data map(to:level->vectors[i][:N])

We need to treat the part of the array reference before the array section
as a base-pointer (here 'level->vectors[i]'), providing pointer-attachment 
behavior.

This patch adds this inside handle_omp_array_sections(), tracing the whole
sequence of array dimensions, creating a whole base-pointer reference
iteratively using build_array_ref(). The conditions are that each of the
"absorbed" dimensions must be length==1, and the final reference must be
of pointer-type (so that pointer attachment makes sense).

There's also a little patch in gimplify_scan_omp_clauses(), to make sure
the array-ref base-pointer goes down the right path.

This case was encountered when working to make 534.hpgmgfv_t from
SPEChpc 2021 properly compile. Tested without regressions on trunk.
Okay to go in once stage1 opens?

Thanks,
Chung-Lin

2022-02-21  Chung-Lin Tang  

gcc/c/ChangeLog:

* c-typeck.cc (handle_omp_array_sections): Add handling for
creating array-reference base-pointer attachment clause.

gcc/cp/ChangeLog:

* semantics.cc (handle_omp_array_sections): Add handling for
creating array-reference base-pointer attachment clause.

gcc/ChangeLog:

* gimplify.cc (gimplify_scan_omp_clauses): Add case for
attach/detach map kind for ARRAY_REF of POINTER_TYPE.

gcc/testsuite/ChangeLog:

* c-c++-common/gomp/target-enter-data-1.c: Adjust testcase.

libgomp/testsuite/ChangeLog:

* libgomp.c-c++-common/ptr-attach-2.c: New test.diff --git a/gcc/c/c-typeck.cc b/gcc/c/c-typeck.cc
index 3075c883548..4257e373557 100644
--- a/gcc/c/c-typeck.cc
+++ b/gcc/c/c-typeck.cc
@@ -13649,6 +13649,10 @@ handle_omp_array_sections (tree c, enum 
c_omp_region_type ort)
   if (int_size_in_bytes (TREE_TYPE (first)) <= 0)
maybe_zero_len = true;
 
+  struct dim { tree low_bound, length; };
+  auto_vec dims (num);
+  dims.safe_grow (num);
+
   for (i = num, t = OMP_CLAUSE_DECL (c); i > 0;
   t = TREE_CHAIN (t))
{
@@ -13763,6 +13767,9 @@ handle_omp_array_sections (tree c, enum 
c_omp_region_type ort)
  else
size = size_binop (MULT_EXPR, size, l);
}
+
+ dim d = { low_bound, length };
+ dims[i] = d;
}
   if (side_effects)
size = build2 (COMPOUND_EXPR, sizetype, side_effects, size);
@@ -13802,6 +13809,23 @@ handle_omp_array_sections (tree c, enum 
c_omp_region_type ort)
  OMP_CLAUSE_DECL (c) = t;
  return false;
}
+
+  tree aref = t;
+  for (i = 0; i < dims.length (); i++)
+   {
+ if (dims[i].length && integer_onep (dims[i].length))
+   {
+ tree lb = dims[i].low_bound;
+ aref = build_array_ref (OMP_CLAUSE_LOCATION (c), aref, lb);
+   }
+ else
+   {
+ if (TREE_CODE (TREE_TYPE (aref)) == POINTER_TYPE)
+   t = aref;
+ break;
+   }
+   }
+
   first = c_fully_fold (first, false, NULL);
   OMP_CLAUSE_DECL (c) = first;
   if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_HAS_DEVICE_ADDR)
@@ -13836,7 +13860,8 @@ handle_omp_array_sections (tree c, enum 
c_omp_region_type ort)
  break;
}
   tree c2 = build_omp_clause (OMP_CLAUSE_LOCATION (c), OMP_CLAUSE_MAP);
-  if (TREE_CODE (t) == COMPONENT_REF)
+  if (TREE_CODE (t) == COMPONENT_REF || TREE_CODE (t) == ARRAY_REF
+ || TREE_CODE (t) == INDIRECT_REF)
OMP_CLAUSE_SET_MAP_KIND (c2, GOMP_MAP_ATTACH_DETACH);
   else
OMP_CLAUSE_SET_MAP_KIND (c2, GOMP_MAP_FIRSTPRIVATE_POINTER);
diff --git a/gcc/cp/semantics.cc b/gcc/cp/semantics.cc
index 0cb17a6a8ab..646f4883d66 100644
--- a/gcc/cp/semantics.cc
+++ b/gcc/cp/semantics.cc
@@ -5497,6 +5497,10 @@ handle_omp_array_sections (tree c, enum 
c_omp_region_type ort)
   if (processing_template_decl && maybe_zero_len)
return false;
 
+  struct dim { tree low_bound, length; };
+  auto_vec dims (num);
+  dims.safe_grow (num);
+
   for (i = num, t = OMP_CLAUSE_DECL (c); i > 0;
   t = TREE_CHAIN (t))
{
@@ -5604,6 +5608,9 @@ handle_omp_array_sections (tree c, enum c_omp_region_type 
ort)
  else
size = size_binop (MULT_EXPR, size, l);
}
+
+ dim d = { low_bound, length };
+ dims[i] = d;
}
   if (!processing_template_decl)
{
@@ -5647,6 +5654,24 @@ handle_omp_array_sections (tree c, enum 
c_omp_region_type ort)
  OMP_CLAUSE_DECL (c) = t;
  return false;
}
+
+ tree aref = t;
+ for (i = 0; i < dims.length (); i++)
+   {
+ if (dims[i].length && integer_onep (dims[i].length))
+ 

Re: [PATCH] [gfortran] Set omp_requires_mask for dynamic_allocators.

2022-02-21 Thread Jakub Jelinek via Gcc-patches
On Mon, Feb 21, 2022 at 02:24:40PM +, Hafiz Abid Qadeer wrote:
> This patch fixes an issue that although gfortran accepts
> 'requires dynamic_allocators', it does not set the omp_requires_mask
> accordingly.
> 
> gcc/fortran/ChangeLog:
> 
>   * parse.cc (gfc_parse_file): Set OMP_REQUIRES_DYNAMIC_ALLOCATORS
>   bit in omp_requires_mask.
> ---
>  gcc/fortran/parse.cc | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/gcc/fortran/parse.cc b/gcc/fortran/parse.cc
> index db918291b10..821555bd85f 100644
> --- a/gcc/fortran/parse.cc
> +++ b/gcc/fortran/parse.cc
> @@ -6890,6 +6890,9 @@ done:
>break;
>  }
>  
> +  if (omp_requires & OMP_REQ_DYNAMIC_ALLOCATORS)
> +omp_requires_mask
> + = (enum omp_requires) (omp_requires_mask | 
> OMP_REQUIRES_DYNAMIC_ALLOCATORS);
>/* Do the parse tree dump.  */
>gfc_current_ns = flag_dump_fortran_original ? gfc_global_ns_list : NULL;

I see we do that for !$omp requires atomic_default_mem_order(...)
but it doesn't look correct to me.

The thing is, omp_requires_mask was added for C/C++ from the C/C++ notion of
translation units (and a question is how does that cope with C++20 modules),
with the assumption that once certain #pragma omp requires is seen, it
applies for the rest of the translation unit and there are some restrictions
that require it to appear before certain constructs in the source.
But, Fortran I think doesn't really have a concept of the translation unit,
the OpenMP term compilation unit is in Fortran program unit, so each
function/subroutine should have its own.  So, instead of what gfc_parse_file
does currently where it computes omp_requires as or of requires from each
function/subroutine (I think especially for atomic_default_mem_order that
can do really weird things, nothing requires that e.g. in different
functions those can't be different in Fortran, while in C/C++ it needs to be
the same), we need to make sure that omp_requires_mask omp-generic.cc sees
or uses is for Fortran the value from the current function/subroutine.

For the yet unimplemented requires unified_address etc., the plan was that
we'd emit the requirement e.g. into the offloading data such that we could
tell the runtime library all the requirements together from whole program or
shared library.  In that case using an or from the various
functions/subroutines is desirable, if at least one function requires
unified_address, the runtime should filter out any devices that don't
satisfy it, etc.

Jakub



[PATCH] tree-optimization/104595 - vectorization of COND_EXPR with bool load

2022-02-21 Thread Richard Biener via Gcc-patches
The following fixes an omission in bool pattern detection that
makes it fail when check_bool_pattern fails for COND_EXPR.  That's
not what it should do, instead it should still pattern recog
to var != 0 even if no further adjustments to the def chain are
necessary when var is not a mask already.

Bootstrapped and tested on x86_64-unknown-linux-gnu, queued for stage1.

There's another piece of the PR not yet fixed.

2022-02-21  Richard Biener  

PR tree-optimization/104595
* tree-vect-patterns.c (vect_recog_bool_pattern): For
COND_EXPR do not fail if check_bool_pattern returns false.

* gcc.dg/vect/pr104595.c: New testcase.
---
 gcc/testsuite/gcc.dg/vect/pr104595.c | 24 
 gcc/tree-vect-patterns.cc| 16 
 2 files changed, 32 insertions(+), 8 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr104595.c

diff --git a/gcc/testsuite/gcc.dg/vect/pr104595.c 
b/gcc/testsuite/gcc.dg/vect/pr104595.c
new file mode 100644
index 000..bb7d79aa69f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr104595.c
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_condition } */
+
+#define N 256
+typedef char T;
+extern T a[N];
+extern T b[N];
+extern T c[N];
+extern _Bool pb[N];
+extern char pc[N];
+
+void predicate_by_bool()
+{
+  for (int i = 0; i < N; i++)
+c[i] = pb[i] ? a[i] : b[i];
+}
+
+void predicate_by_char()
+{
+  for (int i = 0; i < N; i++)
+c[i] = pc[i] ? a[i] : b[i];
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" } } */
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 217bdfd7045..8c61eb965a6 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -4450,18 +4450,18 @@ vect_recog_bool_pattern (vec_info *vinfo,
   if (get_vectype_for_scalar_type (vinfo, type) == NULL_TREE)
return NULL;
 
-  if (!check_bool_pattern (var, vinfo, bool_stmts))
+  if (check_bool_pattern (var, vinfo, bool_stmts))
+   var = adjust_bool_stmts (vinfo, bool_stmts, type, stmt_vinfo);
+  else if (integer_type_for_mask (var, vinfo))
return NULL;
 
-  rhs = adjust_bool_stmts (vinfo, bool_stmts, type, stmt_vinfo);
-
   lhs = vect_recog_temp_ssa_var (TREE_TYPE (lhs), NULL);
   pattern_stmt 
- = gimple_build_assign (lhs, COND_EXPR,
-build2 (NE_EXPR, boolean_type_node,
-rhs, build_int_cst (type, 0)),
-gimple_assign_rhs2 (last_stmt),
-gimple_assign_rhs3 (last_stmt));
+   = gimple_build_assign (lhs, COND_EXPR,
+  build2 (NE_EXPR, boolean_type_node,
+  var, build_int_cst (TREE_TYPE (var), 0)),
+  gimple_assign_rhs2 (last_stmt),
+  gimple_assign_rhs3 (last_stmt));
   *type_out = vectype;
   vect_pattern_detected ("vect_recog_bool_pattern", last_stmt);
 
-- 
2.34.1


[PATCH] ranger: Fix up REALPART_EXPR/IMAGPART_EXPR handling [PR104604]

2022-02-21 Thread Jakub Jelinek via Gcc-patches
Hi!

The following testcase is miscompiled since r12-3328.
That change assumed that if rhs1 of a GIMPLE_ASSIGN is COMPLEX_CST, then
that is the value of the lhs of the stmt, but that is not the case always,
only if it is a GIMPLE_SINGLE_RHS stmt.  If it is e.g.
GIMPLE_UNARY_RHS or GIMPLE_BINARY_RHS (the latter happens in the testcase),
then it can be e.g.
__complex__ (3, 0) / var
and the REALPART_EXPR of that isn't 3, but the realpart of the division.
I assume once the ranger can do complex numbers adjust_*part_expr will just
fetch one or the other range from a underlying complex range, but until
then, we should limit this to what r12-3328 meant to do.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

2022-02-21  Jakub Jelinek  

PR tree-optimization/104604
* gimple-range-fold.cc (adjust_imagpart_expr, adjust_realpart_expr):
Only check if gimple_assign_rhs1 is COMPLEX_CST if
gimple_assign_rhs_code is COMPLEX_CST.

* gcc.c-torture/execute/pr104604.c: New test.

--- gcc/gimple-range-fold.cc.jj 2022-02-16 09:19:15.741557607 +0100
+++ gcc/gimple-range-fold.cc2022-02-21 13:50:56.524481955 +0100
@@ -397,7 +397,8 @@ adjust_imagpart_expr (irange &res, const
}
   return;
 }
-  if (is_gimple_assign (def_stmt))
+  if (is_gimple_assign (def_stmt)
+  && gimple_assign_rhs_code (def_stmt) == COMPLEX_CST)
 {
   tree cst = gimple_assign_rhs1 (def_stmt);
   if (TREE_CODE (cst) == COMPLEX_CST)
@@ -422,7 +423,8 @@ adjust_realpart_expr (irange &res, const
   if (!SSA_NAME_DEF_STMT (name))
 return;
 
-  if (is_gimple_assign (def_stmt))
+  if (is_gimple_assign (def_stmt)
+  && gimple_assign_rhs_code (def_stmt) == COMPLEX_CST)
 {
   tree cst = gimple_assign_rhs1 (def_stmt);
   if (TREE_CODE (cst) == COMPLEX_CST)
--- gcc/testsuite/gcc.c-torture/execute/pr104604.c.jj   2022-02-21 
14:02:01.190303071 +0100
+++ gcc/testsuite/gcc.c-torture/execute/pr104604.c  2022-02-21 
14:01:25.941789588 +0100
@@ -0,0 +1,34 @@
+/* PR tree-optimization/104604 */
+
+unsigned char g;
+
+__attribute__((noipa))
+unsigned char
+foo (_Complex unsigned c)
+{
+  unsigned char v = g;
+  _Complex unsigned t = 3;
+  t /= c;
+  return v + t;
+}
+
+__attribute__((noipa))
+unsigned char
+bar (_Complex unsigned c)
+{
+  unsigned char v = g;
+  _Complex unsigned t = 42;
+  t /= c;
+  return v + t;
+}
+
+int
+main ()
+{
+  unsigned char x = foo (7);
+  if (x)
+__builtin_abort ();
+  if (bar (7) != 6)
+__builtin_abort ();
+  return 0;
+}

Jakub



Re: [PATCH] i386: Relax cmpxchg instruction under -mrelax-cmpxchg-loop [PR 103069]

2022-02-21 Thread Uros Bizjak via Gcc-patches
On Mon, Feb 21, 2022 at 10:29 AM Hongyu Wang  wrote:
>
> Hi,
>
> For cmpxchg, it is commonly used in spin loop, and several user code
> such as pthread directly takes cmpxchg as loop condition, which cause
> huge cache bouncing.
>
> This patch extends previous implementation to relax all cmpxchg
> instruction under -mrelax-cmpxchg-loop with an extra atomic load,
> compare and emulate the failed cmpxchg behavior.
>
> For original spin loop which looks like
>
> loop: mov%eax,%r8d
>   or $1,%r8d
>   lock cmpxchg %r8d,(%rdi)
>   jneloop
>
> It will now truns to
>
> loop: mov%eax,%r8d
>   or $1,%r8d
>   mov(%r8),%rsi <--- load lock first
>   cmp%rsi,%rax <--- compare with expected input
>   jne.L2 <--- lock ne expected
>   lock cmpxchg %r8d,(%rdi)
>   jneloop
>   L2: mov%rsi,%rax <--- perform the behavior of failed cmpxchg
>   jneloop
>
> under -mrelax-cmpxchg-loop.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}
>
> OK for master?
>
> gcc/ChangeLog:
>
> PR target/103069
> * config/i386/i386-expand.cc (ix86_expand_atomic_fetch_op_loop):
> Split atomic fetch and loop part.
> (ix86_expand_cmpxchg_loop): New expander for cmpxchg loop.
> * config/i386/i386-protos.h (ix86_expand_cmpxchg_loop): New
> prototype.
> * config/i386/sync.md (atomic_compare_and_swap): Call new
> expander under TARGET_RELAX_CMPXCHG_LOOP.
> (atomic_compare_and_swap): Likewise for doubleword modes.
>
> gcc/testsuite/ChangeLog:
>
> PR target/103069
> * gcc.target/i386/pr103069-2.c: Adjust result check.
> * gcc.target/i386/pr103069-3.c: New test.
> * gcc.target/i386/pr103069-4.c: Likewise.

OK.

Thanks,
Uros.

> ---
>  gcc/config/i386/i386-expand.cc | 153 +++--
>  gcc/config/i386/i386-protos.h  |   2 +
>  gcc/config/i386/sync.md|  65 +
>  gcc/testsuite/gcc.target/i386/pr103069-2.c |   4 +-
>  gcc/testsuite/gcc.target/i386/pr103069-3.c |  24 
>  gcc/testsuite/gcc.target/i386/pr103069-4.c |  43 ++
>  6 files changed, 226 insertions(+), 65 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr103069-3.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr103069-4.c
>
> diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> index ce9607e36de..6cf1a0b9cb6 100644
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -23203,16 +23203,14 @@ void ix86_expand_atomic_fetch_op_loop (rtx target, 
> rtx mem, rtx val,
>enum rtx_code code, bool after,
>bool doubleword)
>  {
> -  rtx old_reg, new_reg, old_mem, success, oldval, new_mem;
> -  rtx_code_label *loop_label, *pause_label, *done_label;
> +  rtx old_reg, new_reg, old_mem, success;
>machine_mode mode = GET_MODE (target);
> +  rtx_code_label *loop_label = NULL;
>
>old_reg = gen_reg_rtx (mode);
>new_reg = old_reg;
> -  loop_label = gen_label_rtx ();
> -  pause_label = gen_label_rtx ();
> -  done_label = gen_label_rtx ();
>old_mem = copy_to_reg (mem);
> +  loop_label = gen_label_rtx ();
>emit_label (loop_label);
>emit_move_insn (old_reg, old_mem);
>
> @@ -23234,50 +23232,125 @@ void ix86_expand_atomic_fetch_op_loop (rtx target, 
> rtx mem, rtx val,
>if (after)
>  emit_move_insn (target, new_reg);
>
> -  /* Load memory again inside loop.  */
> -  new_mem = copy_to_reg (mem);
> -  /* Compare mem value with expected value.  */
> +  success = NULL_RTX;
> +
> +  ix86_expand_cmpxchg_loop (&success, old_mem, mem, old_reg, new_reg,
> +   gen_int_mode (MEMMODEL_SYNC_SEQ_CST,
> + SImode),
> +   doubleword, loop_label);
> +}
> +
> +/* Relax cmpxchg instruction, param loop_label indicates whether
> +   the instruction should be relaxed with a pause loop.  If not,
> +   it will be relaxed to an atomic load + compare, and skip
> +   cmpxchg instruction if mem != exp_input.  */
> +
> +void ix86_expand_cmpxchg_loop (rtx *ptarget_bool, rtx target_val,
> +  rtx mem, rtx exp_input, rtx new_input,
> +  rtx mem_model, bool doubleword,
> +  rtx_code_label *loop_label)
> +{
> +  rtx_code_label *cmp_label = NULL;
> +  rtx_code_label *done_label = NULL;
> +  rtx target_bool = NULL_RTX, new_mem = NULL_RTX;
> +  rtx (*gen) (rtx, rtx, rtx, rtx, rtx) = NULL;
> +  rtx (*gendw) (rtx, rtx, rtx, rtx, rtx, rtx) = NULL;
> +  machine_mode mode = GET_MODE (target_val), hmode = mode;
> +
> +  if (*ptarget_bool == NULL)
> +target_bool = gen_reg_rtx (QImode);
> +  else
> +target_bool = *ptarget_bool;
> +
> +  cmp_label = gen_label_rtx ();
> +  done_label = gen_label_rtx ();
> +
> +  new_mem = gen_reg_rtx (mode);
> +  /* Lo

[PATCH] i386: Fix up copysign/xorsign expansion [PR104612]

2022-02-21 Thread Jakub Jelinek via Gcc-patches
Hi!

We ICE on the following testcase for -m32 since r12-3435. because
operands[2] is (subreg:SF (reg:DI ...) 0) and
lowpart_subreg (V4SFmode, operands[2], SFmode)
returns NULL, and that is what we use in AND etc. insns we emit.

The following patch (non-attached) fixes that by calling force_reg for the
input operands, to make sure they are really REGs and so lowpart_subreg
will succeed on them - even for theoretical MEMs using REGs there seems
desirable, we don't want to read following memory slots for the paradoxical
subreg.  For the outputs, I thought we'd get better code by always computing
result into a new pseudo and them move lowpart of that pseudo into dest.

I've bootstrapped/regtested this version on x86_64-linux and i686-linux,
unfortunately it regressed
FAIL: gcc.target/i386/pr89984-2.c scan-assembler-not vmovaps
on which the patch changes:
vandps  .LC0(%rip), %xmm1, %xmm1
-   vxorps  %xmm0, %xmm1, %xmm0
+   vxorps  %xmm0, %xmm1, %xmm1
+   vmovaps %xmm1, %xmm0
ret
The RA sees:
(insn 8 4 9 2 (set (reg:V4SF 85)
(and:V4SF (subreg:V4SF (reg:SF 90) 0)
(mem/u/c:V4SF (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0  S16 
A128]))) "pr89984-2.c":7:12 2838 {*andv4sf3}
 (expr_list:REG_DEAD (reg:SF 90)
(nil)))
(insn 9 8 10 2 (set (reg:V4SF 87)
(xor:V4SF (reg:V4SF 85)
(subreg:V4SF (reg:SF 89) 0))) "pr89984-2.c":7:12 2842 {*xorv4sf3}
 (expr_list:REG_DEAD (reg:SF 89)
(expr_list:REG_DEAD (reg:V4SF 85)
(nil
(insn 10 9 14 2 (set (reg:SF 82 [  ])
(subreg:SF (reg:V4SF 87) 0)) "pr89984-2.c":7:12 142 {*movsf_internal}
 (expr_list:REG_DEAD (reg:V4SF 87)
(nil)))
(insn 14 10 15 2 (set (reg/i:SF 20 xmm0)
(reg:SF 82 [  ])) "pr89984-2.c":8:1 142 {*movsf_internal}
 (expr_list:REG_DEAD (reg:SF 82 [  ])
(nil)))
(insn 15 14 0 2 (use (reg/i:SF 20 xmm0)) "pr89984-2.c":8:1 -1
 (nil))
and doesn't know that if it would use xmm0 not just for pseudo 82
but also for pseudo 87, it could create a noop move in insn 10 and
so could avoid an extra register copy and nothing later on is able
to figure that out either.  I don't know how the RA should know
that though.

Anyway, so that we don't regress, I have an alternative patch in attachment,
which will do this stuff (i.e. use fresh vector pseudo as destination and
then move lowpart of that to dest) over what it used before (i.e.
use paradoxical subreg of the dest) only if lowpart_subreg returns NULL.

Ok for trunk if the attached version passes bootstrap/regtest?

2022-02-21  Jakub Jelinek  

PR target/104612
* config/i386/i386-expand.cc (ix86_expand_copysign): Call force_reg
on input operands before calling lowpart_subreg on it.  For output
operand, use a vmode pseudo as destination and then move its lowpart
subreg into operands[0].
(ix86_expand_xorsign): Likewise.

* gcc.dg/pr104612.c: New test.

--- gcc/config/i386/i386-expand.cc.jj   2022-02-09 20:45:03.463499205 +0100
+++ gcc/config/i386/i386-expand.cc  2022-02-21 13:14:31.756657743 +0100
@@ -2153,7 +2153,7 @@ void
 ix86_expand_copysign (rtx operands[])
 {
   machine_mode mode, vmode;
-  rtx dest, op0, op1, mask, op2, op3;
+  rtx dest, vdest, op0, op1, mask, op2, op3;
 
   mode = GET_MODE (operands[0]);
 
@@ -2174,8 +2174,9 @@ ix86_expand_copysign (rtx operands[])
   return;
 }
 
-  dest = lowpart_subreg (vmode, operands[0], mode);
-  op1 = lowpart_subreg (vmode, operands[2], mode);
+  dest = operands[0];
+  vdest = gen_reg_rtx (vmode);
+  op1 = lowpart_subreg (vmode, force_reg (mode, operands[2]), mode);
   mask = ix86_build_signbit_mask (vmode, 0, 0);
 
   if (CONST_DOUBLE_P (operands[1]))
@@ -2184,7 +2185,8 @@ ix86_expand_copysign (rtx operands[])
   /* Optimize for 0, simplify b = copy_signf (0.0f, a) to b = mask & a.  */
   if (op0 == CONST0_RTX (mode))
{
- emit_move_insn (dest, gen_rtx_AND (vmode, mask, op1));
+ emit_move_insn (vdest, gen_rtx_AND (vmode, mask, op1));
+ emit_move_insn (dest, lowpart_subreg (mode, vdest, vmode));
  return;
}
 
@@ -2193,7 +2195,7 @@ ix86_expand_copysign (rtx operands[])
   op0 = force_reg (vmode, op0);
 }
   else
-op0 = lowpart_subreg (vmode, operands[1], mode);
+op0 = lowpart_subreg (vmode, force_reg (mode, operands[1]), mode);
 
   op2 = gen_reg_rtx (vmode);
   op3 = gen_reg_rtx (vmode);
@@ -2201,7 +2203,8 @@ ix86_expand_copysign (rtx operands[])
gen_rtx_NOT (vmode, mask),
op0));
   emit_move_insn (op3, gen_rtx_AND (vmode, mask, op1));
-  emit_move_insn (dest, gen_rtx_IOR (vmode, op2, op3));
+  emit_move_insn (vdest, gen_rtx_IOR (vmode, op2, op3));
+  emit_move_insn (dest, lowpart_subreg (mode, vdest, vmode));
 }
 
 /* Expand an xorsign operation.  */
@@ -2210,7 +2213,7 @@ void
 ix86_expand_xorsign (rtx operands[])
 {
   machine_mode mode, vmod

[PATCH] Fix clang warning in pt.cc

2022-02-21 Thread Martin Liška

Fixes:

gcc/cp/pt.cc:13755:23: warning: suggest braces around initialization of 
subobject [-Wmissing-braces]
  tree_vec_map in = { fn, nullptr };

Patch can bootstrap on x86_64-linux-gnu and survives regression tests.

Ready to be installed?
Thanks,
Martin

gcc/cp/ChangeLog:

* pt.cc (defarg_insts_for): Use braces for subobject.
---
 gcc/cp/pt.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/cp/pt.cc b/gcc/cp/pt.cc
index 16bedbc4bc7..70f02db8757 100644
--- a/gcc/cp/pt.cc
+++ b/gcc/cp/pt.cc
@@ -13752,7 +13752,7 @@ defarg_insts_for (tree fn)
 {
   if (!defarg_inst)
 defarg_inst = hash_table::create_ggc (13);
-  tree_vec_map in = { fn, nullptr };
+  tree_vec_map in = { { fn }, nullptr };
   tree_vec_map **slot
 = defarg_inst->find_slot_with_hash (&in, DECL_UID (fn), INSERT);
   if (!*slot)
--
2.35.1



Re: [PATCH] i386: Fix up copysign/xorsign expansion [PR104612]

2022-02-21 Thread Uros Bizjak via Gcc-patches
On Mon, Feb 21, 2022 at 5:46 PM Jakub Jelinek  wrote:
>
> Hi!
>
> We ICE on the following testcase for -m32 since r12-3435. because
> operands[2] is (subreg:SF (reg:DI ...) 0) and
> lowpart_subreg (V4SFmode, operands[2], SFmode)
> returns NULL, and that is what we use in AND etc. insns we emit.
>
> The following patch (non-attached) fixes that by calling force_reg for the
> input operands, to make sure they are really REGs and so lowpart_subreg
> will succeed on them - even for theoretical MEMs using REGs there seems
> desirable, we don't want to read following memory slots for the paradoxical
> subreg.  For the outputs, I thought we'd get better code by always computing
> result into a new pseudo and them move lowpart of that pseudo into dest.
>
> I've bootstrapped/regtested this version on x86_64-linux and i686-linux,
> unfortunately it regressed
> FAIL: gcc.target/i386/pr89984-2.c scan-assembler-not vmovaps
> on which the patch changes:
> vandps  .LC0(%rip), %xmm1, %xmm1
> -   vxorps  %xmm0, %xmm1, %xmm0
> +   vxorps  %xmm0, %xmm1, %xmm1
> +   vmovaps %xmm1, %xmm0
> ret
> The RA sees:
> (insn 8 4 9 2 (set (reg:V4SF 85)
> (and:V4SF (subreg:V4SF (reg:SF 90) 0)
> (mem/u/c:V4SF (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0  S16 
> A128]))) "pr89984-2.c":7:12 2838 {*andv4sf3}
>  (expr_list:REG_DEAD (reg:SF 90)
> (nil)))
> (insn 9 8 10 2 (set (reg:V4SF 87)
> (xor:V4SF (reg:V4SF 85)
> (subreg:V4SF (reg:SF 89) 0))) "pr89984-2.c":7:12 2842 {*xorv4sf3}
>  (expr_list:REG_DEAD (reg:SF 89)
> (expr_list:REG_DEAD (reg:V4SF 85)
> (nil
> (insn 10 9 14 2 (set (reg:SF 82 [  ])
> (subreg:SF (reg:V4SF 87) 0)) "pr89984-2.c":7:12 142 {*movsf_internal}
>  (expr_list:REG_DEAD (reg:V4SF 87)
> (nil)))
> (insn 14 10 15 2 (set (reg/i:SF 20 xmm0)
> (reg:SF 82 [  ])) "pr89984-2.c":8:1 142 {*movsf_internal}
>  (expr_list:REG_DEAD (reg:SF 82 [  ])
> (nil)))
> (insn 15 14 0 2 (use (reg/i:SF 20 xmm0)) "pr89984-2.c":8:1 -1
>  (nil))
> and doesn't know that if it would use xmm0 not just for pseudo 82
> but also for pseudo 87, it could create a noop move in insn 10 and
> so could avoid an extra register copy and nothing later on is able
> to figure that out either.  I don't know how the RA should know
> that though.
>
> Anyway, so that we don't regress, I have an alternative patch in attachment,
> which will do this stuff (i.e. use fresh vector pseudo as destination and
> then move lowpart of that to dest) over what it used before (i.e.
> use paradoxical subreg of the dest) only if lowpart_subreg returns NULL.

I remember the same issue in the past, so it looks like the fresh
pseudo as destination gives RA some more freedom to do its magic. So,
it is better to do:

  emit_move_insn (dest, gen_lowpart (wmode, t3));

then play with subregs on a destination, like:

  dest = lowpart_subreg (vmode, dest, mode);

The attached patch is OK.

Thanks,
Uros.

>
> Ok for trunk if the attached version passes bootstrap/regtest?
>
> 2022-02-21  Jakub Jelinek  
>
> PR target/104612
> * config/i386/i386-expand.cc (ix86_expand_copysign): Call force_reg
> on input operands before calling lowpart_subreg on it.  For output
> operand, use a vmode pseudo as destination and then move its lowpart
> subreg into operands[0].
> (ix86_expand_xorsign): Likewise.
>
> * gcc.dg/pr104612.c: New test.
>
> --- gcc/config/i386/i386-expand.cc.jj   2022-02-09 20:45:03.463499205 +0100
> +++ gcc/config/i386/i386-expand.cc  2022-02-21 13:14:31.756657743 +0100
> @@ -2153,7 +2153,7 @@ void
>  ix86_expand_copysign (rtx operands[])
>  {
>machine_mode mode, vmode;
> -  rtx dest, op0, op1, mask, op2, op3;
> +  rtx dest, vdest, op0, op1, mask, op2, op3;
>
>mode = GET_MODE (operands[0]);
>
> @@ -2174,8 +2174,9 @@ ix86_expand_copysign (rtx operands[])
>return;
>  }
>
> -  dest = lowpart_subreg (vmode, operands[0], mode);
> -  op1 = lowpart_subreg (vmode, operands[2], mode);
> +  dest = operands[0];
> +  vdest = gen_reg_rtx (vmode);
> +  op1 = lowpart_subreg (vmode, force_reg (mode, operands[2]), mode);
>mask = ix86_build_signbit_mask (vmode, 0, 0);
>
>if (CONST_DOUBLE_P (operands[1]))
> @@ -2184,7 +2185,8 @@ ix86_expand_copysign (rtx operands[])
>/* Optimize for 0, simplify b = copy_signf (0.0f, a) to b = mask & a.  
> */
>if (op0 == CONST0_RTX (mode))
> {
> - emit_move_insn (dest, gen_rtx_AND (vmode, mask, op1));
> + emit_move_insn (vdest, gen_rtx_AND (vmode, mask, op1));
> + emit_move_insn (dest, lowpart_subreg (mode, vdest, vmode));
>   return;
> }
>
> @@ -2193,7 +2195,7 @@ ix86_expand_copysign (rtx operands[])
>op0 = force_reg (vmode, op0);
>  }
>else
> -op0 = lowpart_subreg (vmode, operands[1], mode);
> +op0 = lowpart_subreg (vmode, force_reg (mode, operands[

Re: [PATCH] [gfortran] Set omp_requires_mask for dynamic_allocators.

2022-02-21 Thread Tobias Burnus

Hi Jakub, hi Abid, hi all,

On 21.02.22 16:50, Jakub Jelinek via Fortran wrote:

The thing is, omp_requires_mask was added for C/C++ from the C/C++ notion of
translation units (and a question is how does that cope with C++20 modules),
with the assumption that once certain #pragma omp requires is seen, it
applies for the rest of the translation unit and there are some restrictions
that require it to appear before certain constructs in the source.

But, Fortran I think doesn't really have a concept of the translation unit,
the OpenMP term compilation unit is in Fortran program unit, so each
function/subroutine should have its own.


[Nit picking: "A program unit is a main program, an external subprogram,
a module, a submodule, or a block data program unit." Thus,
subroutines/functions which are contained inside a (sub)module or the
main program or another subroutine/function do not form a program
unit by themselves.]

I wonder whether there is a real problem in terms of the ME, but maybe
there is.

For atomic_default_mem_order: That's purely handle by the FEs by
setting the default – and just using it when parsing the 'atomic'
directive, if there is no explicit mem_order.

And for reverse_offload, unified_address or unified_shared_memory,
it has to be always the same in all 'compilation units' (which use
device-related OpenMP things).

I think both is handled correctly by gfortran and the C/C++ FE.
For gfortran, including pulling those requires by use-association
from a module ("unless the directive appears by referencing a module"
implies that this is intended).

The interesting question is about "requires dynamic_allocators".
However, I think one can still stuff it into a TU-wide
omp_requires_mask. While not all TU or even Fortran program units
have to set it, as soon as it appears in any 'requires', it affects
the available devices.

Thus, I do not see a problem to treat it like, e.g., unified_shared_memory,
except that there should be no error when not set in another program unit
(that uses OpenMP target stuff).


So, instead of what gfc_parse_file
does currently where it computes omp_requires as or of requires from each
function/subroutine (I think especially for atomic_default_mem_order that
can do really weird things, nothing requires that e.g. in different
functions those can't be different in Fortran, while in C/C++ it needs to be
the same), we need to make sure that omp_requires_mask omp-generic.cc sees
or uses is for Fortran the value from the current function/subroutine.


Cf. above - is this really needed? And do you think there is an issue with
the current implementation?


For the yet unimplemented requires unified_address etc., the plan was that
we'd emit the requirement e.g. into the offloading data such that we could
tell the runtime library all the requirements together from whole program or
shared library.  In that case using an or from the various
functions/subroutines is desirable, if at least one function requires
unified_address, the runtime should filter out any devices that don't
satisfy it, etc.


Regarding the implementation, there is a patch for it on OG11,
https://gcc.gnu.org/g:f5bfc65f9a6 which does an initial implementation.
(It prints a diagnostic instead of doing the filtering – but that could
be fixed easily. I think until and include 5.2, the spec is not that clear.
Post 5.2, the "supported devices" definition makes it clear, that the
device list should be filtered.)

Tobias

-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955


Re: [PATCH] ranger: Fix up REALPART_EXPR/IMAGPART_EXPR handling [PR104604]

2022-02-21 Thread Richard Biener via Gcc-patches



> Am 21.02.2022 um 17:15 schrieb Jakub Jelinek via Gcc-patches 
> :
> 
> Hi!
> 
> The following testcase is miscompiled since r12-3328.
> That change assumed that if rhs1 of a GIMPLE_ASSIGN is COMPLEX_CST, then
> that is the value of the lhs of the stmt, but that is not the case always,
> only if it is a GIMPLE_SINGLE_RHS stmt.  If it is e.g.
> GIMPLE_UNARY_RHS or GIMPLE_BINARY_RHS (the latter happens in the testcase),
> then it can be e.g.
> __complex__ (3, 0) / var
> and the REALPART_EXPR of that isn't 3, but the realpart of the division.
> I assume once the ranger can do complex numbers adjust_*part_expr will just
> fetch one or the other range from a underlying complex range, but until
> then, we should limit this to what r12-3328 meant to do.
> 
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

Ok

> 2022-02-21  Jakub Jelinek  
> 
>PR tree-optimization/104604
>* gimple-range-fold.cc (adjust_imagpart_expr, adjust_realpart_expr):
>Only check if gimple_assign_rhs1 is COMPLEX_CST if
>gimple_assign_rhs_code is COMPLEX_CST.
> 
>* gcc.c-torture/execute/pr104604.c: New test.
> 
> --- gcc/gimple-range-fold.cc.jj2022-02-16 09:19:15.741557607 +0100
> +++ gcc/gimple-range-fold.cc2022-02-21 13:50:56.524481955 +0100
> @@ -397,7 +397,8 @@ adjust_imagpart_expr (irange &res, const
>}
>   return;
> }
> -  if (is_gimple_assign (def_stmt))
> +  if (is_gimple_assign (def_stmt)
> +  && gimple_assign_rhs_code (def_stmt) == COMPLEX_CST)
> {
>   tree cst = gimple_assign_rhs1 (def_stmt);
>   if (TREE_CODE (cst) == COMPLEX_CST)
> @@ -422,7 +423,8 @@ adjust_realpart_expr (irange &res, const
>   if (!SSA_NAME_DEF_STMT (name))
> return;
> 
> -  if (is_gimple_assign (def_stmt))
> +  if (is_gimple_assign (def_stmt)
> +  && gimple_assign_rhs_code (def_stmt) == COMPLEX_CST)
> {
>   tree cst = gimple_assign_rhs1 (def_stmt);
>   if (TREE_CODE (cst) == COMPLEX_CST)
> --- gcc/testsuite/gcc.c-torture/execute/pr104604.c.jj2022-02-21 
> 14:02:01.190303071 +0100
> +++ gcc/testsuite/gcc.c-torture/execute/pr104604.c2022-02-21 
> 14:01:25.941789588 +0100
> @@ -0,0 +1,34 @@
> +/* PR tree-optimization/104604 */
> +
> +unsigned char g;
> +
> +__attribute__((noipa))
> +unsigned char
> +foo (_Complex unsigned c)
> +{
> +  unsigned char v = g;
> +  _Complex unsigned t = 3;
> +  t /= c;
> +  return v + t;
> +}
> +
> +__attribute__((noipa))
> +unsigned char
> +bar (_Complex unsigned c)
> +{
> +  unsigned char v = g;
> +  _Complex unsigned t = 42;
> +  t /= c;
> +  return v + t;
> +}
> +
> +int
> +main ()
> +{
> +  unsigned char x = foo (7);
> +  if (x)
> +__builtin_abort ();
> +  if (bar (7) != 6)
> +__builtin_abort ();
> +  return 0;
> +}
> 
>Jakub
> 


Re: [gofrontend-dev] Re: [PATCH] libgo: include asm/ptrace.h for pt_regs definition on PowerPC

2022-02-21 Thread Ian Lance Taylor via Gcc-patches
Note for gofrontend-dev: on gcc-patches only Andreas Schwab suggested
using uc_regs instead of regs, which does look correct to me.

Ian

On Mon, Feb 21, 2022 at 8:47 AM Sören Tempel  wrote:
>
> Ping.
>
> Summary: Fix build of libgo on PPC with musl libc and libucontext by
> explicitly including the Linux header defining `struct pt_regs` instead of
> relying on other libc headers to include it implicitly.
>
> See: https://gcc.gnu.org/pipermail/gcc-patches/2022-January/587520.html
>
> If the patch needs to be revised further please let me know. This patch has
> been applied at Alpine Linux downstream (which uses musl libc) for a while, I
> have not tested it on other systems.
>
> Greetings,
> Sören
>
> Sören Tempel  wrote:
> > Both glibc and musl libc declare pt_regs as an incomplete type. This
> > type has to be completed by inclusion of another header. On Linux, the
> > asm/ptrace.h header file provides this type definition. Without
> > including this header file, it is not possible to access the regs member
> > of the mcontext_t struct as done in libgo/runtime/go-signal.c. On glibc,
> > other headers (e.g. sys/user.h) include asm/ptrace.h but on musl
> > asm/ptrace.h is not included by other headers and thus the
> > aforementioned files do not compile without an explicit include of
> > asm/ptrace.h:
> >
> >   libgo/runtime/go-signal.c: In function 'getSiginfo':
> >   libgo/runtime/go-signal.c:227:63: error: invalid use of undefined 
> > type 'struct pt_regs'
> > 227 | ret.sigpc = 
> > ((ucontext_t*)(context))->uc_mcontext.regs->nip;
> > |
> >
> > See also:
> >
> > * 
> > https://git.musl-libc.org/cgit/musl/commit/?id=c2518a8efb6507f1b41c3b12e03b06f8f2317a1f
> > * https://github.com/kaniini/libucontext/issues/36
> >
> > Signed-off-by: Sören Tempel 
> >
> > ChangeLog:
> >
> >   * libgo/runtime/go-signal.c: Include asm/ptrace.h for the
> > definition of pt_regs (used by mcontext_t) on PowerPC.
> > ---
> >  libgo/runtime/go-signal.c | 6 ++
> >  1 file changed, 6 insertions(+)
> >
> > diff --git a/libgo/runtime/go-signal.c b/libgo/runtime/go-signal.c
> > index d30d1603adc..fc01e04e4a1 100644
> > --- a/libgo/runtime/go-signal.c
> > +++ b/libgo/runtime/go-signal.c
> > @@ -10,6 +10,12 @@
> >  #include 
> >  #include 
> >
> > +// On PowerPC, ucontext.h uses a pt_regs struct as an incomplete
> > +// type. This type must be completed by including asm/ptrace.h.
> > +#ifdef __PPC__
> > +#include 
> > +#endif
> > +
> >  #include "runtime.h"
> >
> >  #ifndef SA_RESTART
>
> --
> You received this message because you are subscribed to the Google Groups 
> "gofrontend-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to gofrontend-dev+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/gofrontend-dev/2FICMX0ORZ6O1.3DYXRTDHNGXCN%408pit.net.


Re: [PATCH] i386: Fix up copysign/xorsign expansion [PR104612]

2022-02-21 Thread Jakub Jelinek via Gcc-patches
On Mon, Feb 21, 2022 at 06:01:00PM +0100, Uros Bizjak wrote:
> I remember the same issue in the past, so it looks like the fresh
> pseudo as destination gives RA some more freedom to do its magic. So,
> it is better to do:
> 
>   emit_move_insn (dest, gen_lowpart (wmode, t3));
> 
> then play with subregs on a destination, like:
> 
>   dest = lowpart_subreg (vmode, dest, mode);

That is what I assumed too, but unfortunately on the pr89984-2.c
testcase that extra freedom resulted in worse code.
The inlined patch gave it that freedom, the attached patch does not
(unless dest is already SUBREG).
I think it might be useful to open a PR for GCC 13 and change the
attached patch to the inlined one once we can deal with it in the RA.

Jakub



Re: [PATCH] [gfortran] Set omp_requires_mask for dynamic_allocators.

2022-02-21 Thread Jakub Jelinek via Gcc-patches
On Mon, Feb 21, 2022 at 06:02:06PM +0100, Tobias Burnus wrote:
> I wonder whether there is a real problem in terms of the ME, but maybe
> there is.
> 
> For atomic_default_mem_order: That's purely handle by the FEs by
> setting the default – and just using it when parsing the 'atomic'
> directive, if there is no explicit mem_order.

Well, for !$omp requires atomic_default_mem_order(whatever)
vs. !$omp atomic that is handled purely in the FE and I hope we do it right.
Where ME is involved is
!$omp requires atomic_default_mem_order(whatever) vs.
!$omp declare variant ...atomic_default_mem_order(whatever).
That is handled in omp-generic.cc and right now that is done during
gimplification of a function.
My reading of what gfc_parse_file does is that if I have:
subroutine foo
  !$omp requires atomic_default_mem_order(relaxed)
end subroutine foo
subroutine bar
  !$omp requires atomic_default_mem_order(acq_rel)
end subroutine bar
subroutine baz
  interface
subroutine foo
end subroutine
  end interface
  interface
subroutine bar
end subroutine
!$omp declare variant (foo) &
!$omp & match (implementation={atomic_default_mem_order(seq_cst)})
  end interface
  call bar
end subroutine baz
then it will call foo because omp_requires in one function is
OMP_MEMORY_ORDER_RELAXED aka 1 and in another one
OMP_MEMORY_ORDER_ACQ_REL aka 4, and (1 | 4) == OMP_MEMORY_ORDER_SEQ_CST
whenb no !$omp requires is in the baz program unit visible and so
it should just call bar.

And similarly with dynamic_allocators, if I have:
subroutine qux
  !$omp requires dynamic_allocators
end subroutine qux
subroutine corge
  interface
subroutine garply
end subroutine
!$omp declare variant (quux) &
!$omp & match (implementation={dynamic_allocators})
  end interface
  call garply
end subroutine corge
I think with the posted patch it would call quux which it should not.

Jakub



Re: [PATCH][_Hashtable] Fix insertion of range of type convertible to value_type PR 56112

2022-02-21 Thread François Dumont via Gcc-patches

Gentle reminder, it is important to have this for gcc 12.

On 15/02/22 10:05, François Dumont wrote:
We have a regression regarding management of types convertible to 
value_type. It is an occurrence of PR 56112 but for the insert method.


    libstdc++: [_Hashtable] Insert range of types convertible to 
value_type PR 56112


    Fix insertion of range of types convertible to value_type.

    libstdc++-v3/ChangeLog:

    PR libstdc++/56112
    * include/bits/hashtable.h
    (_Hashtable<>::_M_insert_unique_aux): New.
    (_Hashtable<>::_S_to_value): New.
    (_Hashtable<>::_M_insert(_Arg&&, const _NodeGenerator&, 
true_type)): Use latters.
    * testsuite/23_containers/unordered_map/cons/56112.cc: Use 
dg-do compile.
    * testsuite/23_containers/unordered_set/cons/56112.cc: New 
test.


Tested under Linux x86_64.

Ok to commit ?

François





Re: [PATCH v3 07/15] arm: Implement MVE predicates as vectors of booleans

2022-02-21 Thread Richard Sandiford via Gcc-patches
Christophe Lyon  writes:
> diff --git a/gcc/config/arm/arm-builtins.c b/gcc/config/arm/arm-builtins.c
> index 9c645722230..dd537ec1679 100644
> --- a/gcc/config/arm/arm-builtins.c
> +++ b/gcc/config/arm/arm-builtins.c
> @@ -1553,11 +1553,25 @@ arm_init_simd_builtin_types (void)
>tree eltype = arm_simd_types[i].eltype;
>machine_mode mode = arm_simd_types[i].mode;
>  
> -  if (eltype == NULL)
> +  if (eltype == NULL
> +   /* VECTOR_BOOL is not supported unless MVE is activated, this would
> +  make build_truth_vector_type_for_mode crash.  */
> +   && ((GET_MODE_CLASS (mode) != MODE_VECTOR_BOOL)
> +   ||!TARGET_HAVE_MVE))

For the record: this kind of thing wouldn't be OK in aarch64,
since there we should allow a target to be selected later.
But I agree that here it's valid, since TARGET_HAVE_MVE already
decides whether arm_neon.h or arm_mve.h builtins are registered.

Formatting nit though: missing space after “||”.

>   continue;
>if (arm_simd_types[i].itype == NULL)
>   {
> -   tree type = build_vector_type (eltype, GET_MODE_NUNITS (mode));
> +   tree type;
> +   if (GET_MODE_CLASS (mode) == MODE_VECTOR_BOOL)
> + {
> +   /* Handle MVE predicates: they are internally stored as 16 bits,
> +  but are used as vectors of 1, 2 or 4-bit elements.  */
> +   type = build_truth_vector_type_for_mode (GET_MODE_NUNITS (mode), 
> mode);

Formatting nit: line too long.

OK with those changes, thanks.

Richard

> +   eltype = TREE_TYPE (type);
> + }
> +   else
> + type = build_vector_type (eltype, GET_MODE_NUNITS (mode));
> +
> type = build_distinct_type_copy (type);
> SET_TYPE_STRUCTURAL_EQUALITY (type);
>  
> @@ -1695,6 +1709,11 @@ arm_init_builtin (unsigned int fcode, 
> arm_builtin_datum *d,
>if (qualifiers & qualifier_map_mode)
>   op_mode = d->mode;
>  
> +  /* MVE Predicates use HImode as mandated by the ABI: pred16_t is 
> unsigned
> +  short.  */
> +  if (qualifiers & qualifier_predicate)
> + op_mode = HImode;
> +
>/* For pointers, we want a pointer to the basic type
>of the vector.  */
>if (qualifiers & qualifier_pointer && VECTOR_MODE_P (op_mode))
> @@ -2939,6 +2958,11 @@ arm_expand_builtin_args (rtx target, machine_mode 
> map_mode, int fcode,
>   case ARG_BUILTIN_COPY_TO_REG:
> if (POINTER_TYPE_P (TREE_TYPE (arg[argc])))
>   op[argc] = convert_memory_address (Pmode, op[argc]);
> +
> +   /* MVE uses mve_pred16_t (aka HImode) for vectors of predicates.  
> */
> +   if (GET_MODE_CLASS (mode[argc]) == MODE_VECTOR_BOOL)
> + op[argc] = gen_lowpart (mode[argc], op[argc]);
> +
> /*gcc_assert (GET_MODE (op[argc]) == mode[argc]); */
> if (!(*insn_data[icode].operand[opno].predicate)
> (op[argc], mode[argc]))
> @@ -3144,6 +3168,13 @@ constant_arg:
>else
>  emit_insn (insn);
>  
> +  if (GET_MODE_CLASS (tmode) == MODE_VECTOR_BOOL)
> +{
> +  rtx HItarget = gen_reg_rtx (HImode);
> +  emit_move_insn (HItarget, gen_lowpart (HImode, target));
> +  return HItarget;
> +}
> +
>return target;
>  }
>  
> diff --git a/gcc/config/arm/arm-builtins.h b/gcc/config/arm/arm-builtins.h
> index e5130d6d286..a8ef8aef82d 100644
> --- a/gcc/config/arm/arm-builtins.h
> +++ b/gcc/config/arm/arm-builtins.h
> @@ -84,7 +84,9 @@ enum arm_type_qualifiers
>qualifier_lane_pair_index = 0x1000,
>/* Lane indices selected in quadtuplets - must be within range of previous
>   argument = a vector.  */
> -  qualifier_lane_quadtup_index = 0x2000
> +  qualifier_lane_quadtup_index = 0x2000,
> +  /* MVE vector predicates.  */
> +  qualifier_predicate = 0x4000
>  };
>  
>  struct arm_simd_type_info
> diff --git a/gcc/config/arm/arm-modes.def b/gcc/config/arm/arm-modes.def
> index de689c8b45e..9ed0cd042c5 100644
> --- a/gcc/config/arm/arm-modes.def
> +++ b/gcc/config/arm/arm-modes.def
> @@ -84,6 +84,14 @@ VECTOR_MODE (FLOAT, BF, 2);   /* V2BF.  */
>  VECTOR_MODE (FLOAT, BF, 4);   /*  V4BF.  */
>  VECTOR_MODE (FLOAT, BF, 8);   /*  V8BF.  */
>  
> +/* Predicates for MVE.  */
> +BOOL_MODE (B2I, 2, 1);
> +BOOL_MODE (B4I, 4, 1);
> +
> +VECTOR_BOOL_MODE (V16BI, 16, BI, 2);
> +VECTOR_BOOL_MODE (V8BI, 8, B2I, 2);
> +VECTOR_BOOL_MODE (V4BI, 4, B4I, 2);
> +
>  /* Fraction and accumulator vector modes.  */
>  VECTOR_MODES (FRACT, 4);  /* V4QQ  V2HQ */
>  VECTOR_MODES (UFRACT, 4); /* V4UQQ V2UHQ */
> diff --git a/gcc/config/arm/arm-simd-builtin-types.def 
> b/gcc/config/arm/arm-simd-builtin-types.def
> index 6ba6f211531..d1d6416dad1 100644
> --- a/gcc/config/arm/arm-simd-builtin-types.def
> +++ b/gcc/config/arm/arm-simd-builtin-types.def
> @@ -51,3 +51,7 @@
>ENTRY (Bfloat16x2_t, V2BF, none, 32, bfloat16, 20)
>ENTRY (Bfloat16x4_t, V4BF, none, 64, bfloat16, 20)
>ENTRY (Bflo

Re: [PATCH] Implement constant-folding simplifications of reductions.

2022-02-21 Thread Jeff Law via Gcc-patches




On 2/21/2022 3:55 AM, Richard Biener via Gcc-patches wrote:

On Mon, Feb 21, 2022 at 9:31 AM Roger Sayle  wrote:


Hi Marc,
I'm assuming that the use (semantics) of a REDUC_PLUS expr allow the
reduction to be done in any order, for example the testcase requires
-ffast-math to allow the REDUC_PLUS to be introduced in the first place.
This also applies explains why the patch doesn't need to distinguish
negative zeros from positive zeros in ctor_single_nonzero_element
(but it's perhaps something to beware of in uses of VECTOR_CST's
single_nonzero_element).

.IFN_REDUC_PLUS directly maps to the reduc_plus_scal optab
which does not specify an operation order.  There's also
fold_left_plus which does (left-to-right).
fold-left-plus is meant to be a strictly in-order reduction, which 
implies (at least to me) that REDUC_PLUS allows reassociation.

An argument could be made that constant folding should do what
the CPU will end up doing but we're already not doing that for
double arithmetic and flush-to-zero enabled with -ffast-math and
denormals I think.
Right.  We can also lose inexact state when folding, but I think that's 
largely considered OK as well.


jeff



Re: [PATCH v2] x86: Add TARGET_OMIT_VZEROUPPER_AFTER_AVX_READ_ZERO

2022-02-21 Thread H.J. Lu via Gcc-patches
On Sun, Feb 20, 2022 at 6:01 PM Hongtao Liu  wrote:
>
> On Thu, Feb 17, 2022 at 9:56 PM H.J. Lu  wrote:
> >
> > On Thu, Feb 17, 2022 at 08:51:31AM +0100, Uros Bizjak wrote:
> > > On Thu, Feb 17, 2022 at 6:25 AM Hongtao Liu via Gcc-patches
> > >  wrote:
> > > >
> > > > On Thu, Feb 17, 2022 at 12:26 PM H.J. Lu via Gcc-patches
> > > >  wrote:
> > > > >
> > > > > Reading YMM registers with all zero bits needs VZEROUPPER on Sandy 
> > > > > Bride,
> > > > > Ivy Bridge, Haswell, Broadwell and Alder Lake to avoid SSE <-> AVX
> > > > > transition penalty.  Add TARGET_READ_ZERO_YMM_ZMM_NEED_VZEROUPPER to
> > > > > generate vzeroupper instruction after loading all-zero YMM/YMM 
> > > > > registers
> > > > > and enable it by default.
> > > > Shouldn't TARGET_READ_ZERO_YMM_ZMM_NONEED_VZEROUPPER sounds a bit 
> > > > smoother?
> > > > Because originally we needed to add vzeroupper to all avx<->sse cases,
> > > > now it's a tune to indicate that we don't need to add it in some
> > >
> > > Perhaps we should go from the other side and use
> > > X86_TUNE_OPTIMIZE_AVX_READ for new processors?
> > >
> >
> > Here is the v2 patch to add TARGET_OMIT_VZEROUPPER_AFTER_AVX_READ_ZERO.
> >
> The patch LGTM in general, but please rebase against
> https://gcc.gnu.org/pipermail/gcc-patches/2022-February/590541.html
> and resend the patch, also wait a couple days in case Uros(and others)
> have any comments.

I am dropping my patch since it causes the compile-time regression.

> >
> > H.J.
> > ---
> > Reading YMM registers with all zero bits needs VZEROUPPER on Sandy Bride,
> > Ivy Bridge, Haswell, Broadwell and Alder Lake to avoid SSE <-> AVX
> > transition penalty.  Add TARGET_OMIT_VZEROUPPER_AFTER_AVX_READ_ZERO to
> > omit vzeroupper instruction after loading all-zero YMM/ZMM registers.
> >
> > gcc/
> >
> > PR target/101456
> > * config/i386/i386.cc (ix86_avx_u128_mode_needed): Omit
> > vzeroupper after reading all-zero YMM/ZMM registers for
> > TARGET_OMIT_VZEROUPPER_AFTER_AVX_READ_ZERO.
> > * config/i386/i386.h (TARGET_OMIT_VZEROUPPER_AFTER_AVX_READ_ZERO):
> > New.
> > * config/i386/x86-tune.def
> > (X86_TUNE_OMIT_VZEROUPPER_AFTER_AVX_READ_ZERO): New.
> >
> > gcc/testsuite/
> >
> > PR target/101456
> > * gcc.target/i386/pr101456-1.c (dg-options): Add
> > -mtune-ctrl=-mtune-ctrl=omit_vzeroupper_after_avx_read_zero.
> > * gcc.target/i386/pr101456-2.c: Likewise.
> > * gcc.target/i386/pr101456-3.c: New test.
> > * gcc.target/i386/pr101456-4.c: Likewise.
> > ---
> >  gcc/config/i386/i386.cc| 51 --
> >  gcc/config/i386/i386.h |  2 +
> >  gcc/config/i386/x86-tune.def   |  5 +++
> >  gcc/testsuite/gcc.target/i386/pr101456-1.c |  2 +-
> >  gcc/testsuite/gcc.target/i386/pr101456-2.c |  2 +-
> >  gcc/testsuite/gcc.target/i386/pr101456-3.c | 33 ++
> >  gcc/testsuite/gcc.target/i386/pr101456-4.c | 33 ++
> >  7 files changed, 103 insertions(+), 25 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101456-3.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101456-4.c
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index cf246e74e57..60c72ceb72d 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -14502,33 +14502,38 @@ ix86_avx_u128_mode_needed (rtx_insn *insn)
> >
> >subrtx_iterator::array_type array;
> >
> > -  rtx set = single_set (insn);
> > -  if (set)
> > +  if (TARGET_OMIT_VZEROUPPER_AFTER_AVX_READ_ZERO)
> >  {
> > -  rtx dest = SET_DEST (set);
> > -  rtx src = SET_SRC (set);
> > -  if (ix86_check_avx_upper_register (dest))
> > +  /* Perform this vzeroupper optimization if target doesn't need
> > +vzeroupper after reading all-zero YMM/YMM registers.  */
> > +  rtx set = single_set (insn);
> > +  if (set)
> > {
> > - /* This is an YMM/ZMM load.  Return AVX_U128_DIRTY if the
> > -source isn't zero.  */
> > - if (standard_sse_constant_p (src, GET_MODE (dest)) != 1)
> > -   return AVX_U128_DIRTY;
> > + rtx dest = SET_DEST (set);
> > + rtx src = SET_SRC (set);
> > + if (ix86_check_avx_upper_register (dest))
> > +   {
> > + /* This is an YMM/ZMM load.  Return AVX_U128_DIRTY if the
> > +source isn't zero.  */
> > + if (standard_sse_constant_p (src, GET_MODE (dest)) != 1)
> > +   return AVX_U128_DIRTY;
> > + else
> > +   return AVX_U128_ANY;
> > +   }
> >   else
> > -   return AVX_U128_ANY;
> > -   }
> > -  else
> > -   {
> > - FOR_EACH_SUBRTX (iter, array, src, NONCONST)
> > -   if (ix86_check_avx_upper_register (*iter))
> > - {
> > -   int status = ix86_avx_u128_mode_source (insn, *iter);
> > -

Re: [PATCH] [gfortran] Set omp_requires_mask for dynamic_allocators.

2022-02-21 Thread Tobias Burnus

Hi Jakub,

On 21.02.22 18:47, Jakub Jelinek wrote:

Where ME is involved is
!$omp requires atomic_default_mem_order(whatever) vs.
!$omp declare variant ...atomic_default_mem_order(whatever).


Ups, missed that case. (Also because there wasn't 'declare variant' when
implementing 'requires' in Fortran.)

Disclaimer to all of the following remarks: I do not understand context
selectors and their fineprint. Thus, my comments my be completely off:


subroutine baz
...
   interface
 subroutine bar
 end subroutine
 !$omp declare variant (foo) &
 !$omp & match (implementation={atomic_default_mem_order(seq_cst)})
   end interface
   call bar
end subroutine baz


I concur that in this case, it needs to know the 'atomic_default_mem_order'
of baz. — But that seems to be not implementable using a module as

module m_foo
  !$omp requires atomic_default_mem_order(...)
contains
  subroutine foo
...
  end
end module m_bar
...
subroutine baz
  use m_foo, only: foo
  ...
end

seems to make the 'requires' available - such that it cannot be overridden
via a local 'require atomic_default_mem_order'. And having a 'use m_bar'
then has conflicting declarations. Similar probably with C++ modules,
unless the 'requires' does not propagate. (Does it?)

I find it odd to have only code which works when not using modules.
(Explicitly using the mem_order on 'omp atomic' still works.)


And for the other requires in context selectors, I do not really understand how 
they
are supposed to get used, either. If any 'unified_shared_memory' or 
'dynamic_allocators'
appears (in linked-in code), it is in principle callable – the the run-time 
library
should then remove all devices which do not support it, possibly only keeping 
the
host device; for USM, it even has to be present in all compilation units. Thus, 
just
directly calling the dynamic_allocators/unified_shared_memory should have the 
same
effect at the end, shouldn't it?

Tobias

-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955


Re: [PATCH] i386: Fix up copysign/xorsign expansion [PR104612]

2022-02-21 Thread Uros Bizjak via Gcc-patches
On Mon, Feb 21, 2022 at 6:33 PM Jakub Jelinek  wrote:
>
> On Mon, Feb 21, 2022 at 06:01:00PM +0100, Uros Bizjak wrote:
> > I remember the same issue in the past, so it looks like the fresh
> > pseudo as destination gives RA some more freedom to do its magic. So,
> > it is better to do:
> >
> >   emit_move_insn (dest, gen_lowpart (wmode, t3));
> >
> > then play with subregs on a destination, like:
> >
> >   dest = lowpart_subreg (vmode, dest, mode);
>
> That is what I assumed too, but unfortunately on the pr89984-2.c
> testcase that extra freedom resulted in worse code.
> The inlined patch gave it that freedom, the attached patch does not
> (unless dest is already SUBREG).
> I think it might be useful to open a PR for GCC 13 and change the
> attached patch to the inlined one once we can deal with it in the RA.

Agreed.

Uros.


[PATCH] libstdc++: Implement P2415R2 "What is a view?"

2022-02-21 Thread Patrick Palka via Gcc-patches
Tested on x86_64-pc-linux-gnu, does this look OK for trunk?

libstdc++-v3/ChangeLog:

* include/bits/ranges_base.h (__detail::__is_initializer_list):
Define.
(viewable_range): Adjust as per P2415R2.
* include/std/ranges (owning_view): Define as per P2415R2.
(enable_borrowed_range): Likewise.
(views::__detail::__can_subrange): Replace with ...
(views::__detail::__can_owning_view): ... this.
(views::_All::_S_noexcept): Sync with operator().
(views::_All::operator()): Use owning_view instead of subrange
as per P2415R2.
* testsuite/std/ranges/adaptors/all.cc (test06): Adjust now that
views::all uses owning_view instead of subrange.
(test08): New test.
* testsuite/std/ranges/adaptors/lazy_split.cc (test09): Adjust
now that rvalue non-view non-borrowed ranges are viewable.
* testsuite/std/ranges/adaptors/split.cc (test06): Likewise.
---
 libstdc++-v3/include/bits/ranges_base.h   | 16 +++-
 libstdc++-v3/include/std/ranges   | 89 ++-
 .../testsuite/std/ranges/adaptors/all.cc  | 59 
 .../std/ranges/adaptors/lazy_split.cc | 13 ++-
 .../testsuite/std/ranges/adaptors/split.cc| 13 ++-
 5 files changed, 157 insertions(+), 33 deletions(-)

diff --git a/libstdc++-v3/include/bits/ranges_base.h 
b/libstdc++-v3/include/bits/ranges_base.h
index 3c5f4b1790a..38db33fd2ce 100644
--- a/libstdc++-v3/include/bits/ranges_base.h
+++ b/libstdc++-v3/include/bits/ranges_base.h
@@ -634,7 +634,7 @@ namespace ranges
 template
   concept __is_derived_from_view_interface
= requires (_Tp __t) { __is_derived_from_view_interface_fn(__t, __t); };
-  }
+  } // namespace __detail
 
   /// [range.view] The ranges::view_base type.
   struct view_base { };
@@ -689,11 +689,23 @@ namespace ranges
 concept common_range
   = range<_Tp> && same_as, sentinel_t<_Tp>>;
 
+  namespace __detail
+  {
+template
+  inline constexpr bool __is_initializer_list = false;
+
+template
+  inline constexpr bool __is_initializer_list> = 
true;
+  } // namespace __detail
+
   /// A range which can be safely converted to a view.
   template
 concept viewable_range = range<_Tp>
   && ((view> && 
constructible_from, _Tp>)
- || (!view> && borrowed_range<_Tp>));
+ || (!view>
+ && (is_lvalue_reference_v<_Tp>
+ || (movable>
+ && 
!__detail::__is_initializer_list>;
 
   // [range.iter.ops] range iterator operations
 
diff --git a/libstdc++-v3/include/std/ranges b/libstdc++-v3/include/std/ranges
index ac85907f129..3e71ecb32b7 100644
--- a/libstdc++-v3/include/std/ranges
+++ b/libstdc++-v3/include/std/ranges
@@ -1144,6 +1144,87 @@ namespace views::__adaptor
   template
 inline constexpr bool enable_borrowed_range> = true;
 
+  template
+requires movable<_Range>
+  && (!__detail::__is_initializer_list>)
+class owning_view : public view_interface>
+{
+private:
+  _Range _M_r = _Range();
+
+public:
+  owning_view() requires default_initializable<_Range> = default;
+
+  constexpr
+  owning_view(_Range&& __t)
+  noexcept(is_nothrow_move_constructible_v<_Range>)
+   : _M_r(std::move(__t))
+  { }
+
+  owning_view(owning_view&&) = default;
+  owning_view& operator=(owning_view&&) = default;
+
+  constexpr _Range&
+  base() & noexcept
+  { return _M_r; }
+
+  constexpr const _Range&
+  base() const& noexcept
+  { return _M_r; }
+
+  constexpr _Range&&
+  base() && noexcept
+  { return std::move(_M_r); }
+
+  constexpr const _Range&&
+  base() const&& noexcept
+  { return std::move(_M_r); }
+
+  constexpr iterator_t<_Range>
+  begin()
+  { return ranges::begin(_M_r); }
+
+  constexpr sentinel_t<_Range>
+  end()
+  { return ranges::end(_M_r); }
+
+  constexpr auto
+  begin() const requires range
+  { return ranges::begin(_M_r); }
+
+  constexpr auto
+  end() const requires range
+  { return ranges::end(_M_r); }
+
+  constexpr bool
+  empty() requires requires { ranges::empty(_M_r); }
+  { return ranges::empty(_M_r); }
+
+  constexpr bool
+  empty() const requires requires { ranges::empty(_M_r); }
+  { return ranges::empty(_M_r); }
+
+  constexpr auto
+  size() requires sized_range<_Range>
+  { return ranges::size(_M_r); }
+
+  constexpr auto
+  size() const requires sized_range
+  { return ranges::size(_M_r); }
+
+  constexpr auto
+  data() requires contiguous_range<_Range>
+  { return ranges::data(_M_r); }
+
+  constexpr auto
+  data() const requires contiguous_range
+  { return ranges::data(_M_r); }
+};
+
+  template
+inline constexpr bool enable_borrowed_range>
+  = enable_borrowed_range<_Tp>;
+
   namespace views
   {
 namespace __detail
@@ -1152,7

Re: [PATCH] [PATCH, v6, 1/1, AARCH64][PR102768] aarch64: Add compiler support for Shadow Call Stack

2022-02-21 Thread Richard Sandiford via Gcc-patches
Dan Li  writes:
> Shadow Call Stack can be used to protect the return address of a
> function at runtime, and clang already supports this feature[1].
>
> To enable SCS in user mode, in addition to compiler, other support
> is also required (as discussed in [2]). This patch only adds basic
> support for SCS from the compiler side, and provides convenience
> for users to enable SCS.
>
> For linux kernel, only the support of the compiler is required.
>
> [1] https://clang.llvm.org/docs/ShadowCallStack.html
> [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102768
>
> Signed-off-by: Dan Li 
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (SLOT_REQUIRED):
>   Change wb_candidate[12] to wb_push_candidate[12].
>   (aarch64_layout_frame): Likewise, and
>   change callee_adjust when scs is enabled.
>   (aarch64_save_callee_saves):
>   Change wb_candidate[12] to wb_push_candidate[12].
>   (aarch64_restore_callee_saves):
>   Change wb_candidate[12] to wb_pop_candidate[12].
>   (aarch64_get_separate_components):
>   Change wb_candidate[12] to wb_push_candidate[12].
>   (aarch64_expand_prologue): Push x30 onto SCS before it's
>   pushed onto stack.
>   (aarch64_expand_epilogue): Pop x30 frome SCS, while
>   preventing it from being popped from the regular stack again.
>   (aarch64_override_options_internal): Add SCS compile option check.
>   (TARGET_HAVE_SHADOW_CALL_STACK): New hook.
>   * config/aarch64/aarch64.h (struct GTY): Add is_scs_enabled,
>   wb_pop_candidate[12], and rename wb_candidate[12] to
>   wb_push_candidate[12].
>   * config/aarch64/aarch64.md (scs_push): New template.
>   (scs_pop): Likewise.
>   * doc/invoke.texi: Document -fsanitize=shadow-call-stack.
>   * doc/tm.texi: Regenerate.
>   * doc/tm.texi.in: Add hook have_shadow_call_stack.
>   * flag-types.h (enum sanitize_code):
>   Add SANITIZE_SHADOW_CALL_STACK.
>   * opts.cc (parse_sanitizer_options): Add shadow-call-stack
>   and exclude SANITIZE_SHADOW_CALL_STACK.
>   * target.def: New hook.
>   * toplev.cc (process_options): Add SCS compile option check.
>   * ubsan.cc (ubsan_expand_null_ifn): Enum type conversion.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/shadow_call_stack_1.c: New test.
>   * gcc.target/aarch64/shadow_call_stack_2.c: New test.
>   * gcc.target/aarch64/shadow_call_stack_3.c: New test.
>   * gcc.target/aarch64/shadow_call_stack_4.c: New test.
>   * gcc.target/aarch64/shadow_call_stack_5.c: New test.
>   * gcc.target/aarch64/shadow_call_stack_6.c: New test.
>   * gcc.target/aarch64/shadow_call_stack_7.c: New test.
>   * gcc.target/aarch64/shadow_call_stack_8.c: New test.

Pushed to trunk, thanks.

Richard

> ---
> V6:
> - Exclude SANITIZE_SHADOW_CALL_STACK.
> - Fix enum type conversion error.
> - Rebase to the mainline (20220216).
>
> V5:
> - Modify part of wb_push_candidates to wb_pop_candidates.
> - Rebase to the mainline (20220210).
>
> V4:
> - Added wb_[push|pop]_candidates[12] to ensure push/pop can
> emit different registers.
>
> V3:
> - Change scs_push/pop to standard move patterns.
> - Optimize scs_pop to avoid pop x30 twice when shadow stack is enabled.
>
>  gcc/config/aarch64/aarch64.cc | 113 +-
>  gcc/config/aarch64/aarch64.h  |  21 +++-
>  gcc/config/aarch64/aarch64.md |  10 ++
>  gcc/doc/invoke.texi   |  30 +
>  gcc/doc/tm.texi   |   5 +
>  gcc/doc/tm.texi.in|   2 +
>  gcc/flag-types.h  |   2 +
>  gcc/opts.cc   |   4 +-
>  gcc/target.def|   8 ++
>  .../gcc.target/aarch64/shadow_call_stack_1.c  |   6 +
>  .../gcc.target/aarch64/shadow_call_stack_2.c  |   6 +
>  .../gcc.target/aarch64/shadow_call_stack_3.c  |  45 +++
>  .../gcc.target/aarch64/shadow_call_stack_4.c  |  20 
>  .../gcc.target/aarch64/shadow_call_stack_5.c  |  18 +++
>  .../gcc.target/aarch64/shadow_call_stack_6.c  |  18 +++
>  .../gcc.target/aarch64/shadow_call_stack_7.c  |  18 +++
>  .../gcc.target/aarch64/shadow_call_stack_8.c  |  24 
>  gcc/toplev.cc |  10 ++
>  gcc/ubsan.cc  |   4 +-
>  19 files changed, 330 insertions(+), 34 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/shadow_call_stack_1.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/shadow_call_stack_2.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/shadow_call_stack_3.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/shadow_call_stack_4.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/shadow_call_stack_5.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/shadow_call_stack_6.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/shadow_call_stack_7.c
> 

Re: [PATCH][_Hashtable] Fix insertion of range of type convertible to value_type PR 56112

2022-02-21 Thread Jonathan Wakely via Gcc-patches
On Mon, 21 Feb 2022 at 18:00, François Dumont via Libstdc++ <
libstd...@gcc.gnu.org> wrote:

> Gentle reminder, it is important to have this for gcc 12.
>

Well, it's been broken since 4.8, so another year wouldn't be the end of
the world ;-)

I did start reviewing it, but I was trying to find a simpler way to solve
it than adding all those overloads. I'll take another look tomorrow and
either approve your patch or suggest something else.



>
> On 15/02/22 10:05, François Dumont wrote:
> > We have a regression regarding management of types convertible to
> > value_type. It is an occurrence of PR 56112 but for the insert method.
> >
> > libstdc++: [_Hashtable] Insert range of types convertible to
> > value_type PR 56112
> >
> > Fix insertion of range of types convertible to value_type.
> >
> > libstdc++-v3/ChangeLog:
> >
> > PR libstdc++/56112
> > * include/bits/hashtable.h
> > (_Hashtable<>::_M_insert_unique_aux): New.
> > (_Hashtable<>::_S_to_value): New.
> > (_Hashtable<>::_M_insert(_Arg&&, const _NodeGenerator&,
> > true_type)): Use latters.
> > * testsuite/23_containers/unordered_map/cons/56112.cc: Use
> > dg-do compile.
> > * testsuite/23_containers/unordered_set/cons/56112.cc: New
> > test.
> >
> > Tested under Linux x86_64.
> >
> > Ok to commit ?
> >
> > François
>
>
>


Re: [PATCH] libstdc++: Implement P2415R2 "What is a view?"

2022-02-21 Thread Jonathan Wakely via Gcc-patches
On Mon, 21 Feb 2022 at 19:39, Patrick Palka via Libstdc++ <
libstd...@gcc.gnu.org> wrote:

> Tested on x86_64-pc-linux-gnu, does this look OK for trunk?
>

OK, thanks.



>
> libstdc++-v3/ChangeLog:
>
> * include/bits/ranges_base.h (__detail::__is_initializer_list):
> Define.
> (viewable_range): Adjust as per P2415R2.
> * include/std/ranges (owning_view): Define as per P2415R2.
> (enable_borrowed_range): Likewise.
> (views::__detail::__can_subrange): Replace with ...
> (views::__detail::__can_owning_view): ... this.
> (views::_All::_S_noexcept): Sync with operator().
> (views::_All::operator()): Use owning_view instead of subrange
> as per P2415R2.
> * testsuite/std/ranges/adaptors/all.cc (test06): Adjust now that
> views::all uses owning_view instead of subrange.
> (test08): New test.
> * testsuite/std/ranges/adaptors/lazy_split.cc (test09): Adjust
> now that rvalue non-view non-borrowed ranges are viewable.
> * testsuite/std/ranges/adaptors/split.cc (test06): Likewise.
> ---
>  libstdc++-v3/include/bits/ranges_base.h   | 16 +++-
>  libstdc++-v3/include/std/ranges   | 89 ++-
>  .../testsuite/std/ranges/adaptors/all.cc  | 59 
>  .../std/ranges/adaptors/lazy_split.cc | 13 ++-
>  .../testsuite/std/ranges/adaptors/split.cc| 13 ++-
>  5 files changed, 157 insertions(+), 33 deletions(-)
>
> diff --git a/libstdc++-v3/include/bits/ranges_base.h
> b/libstdc++-v3/include/bits/ranges_base.h
> index 3c5f4b1790a..38db33fd2ce 100644
> --- a/libstdc++-v3/include/bits/ranges_base.h
> +++ b/libstdc++-v3/include/bits/ranges_base.h
> @@ -634,7 +634,7 @@ namespace ranges
>  template
>concept __is_derived_from_view_interface
> = requires (_Tp __t) { __is_derived_from_view_interface_fn(__t,
> __t); };
> -  }
> +  } // namespace __detail
>
>/// [range.view] The ranges::view_base type.
>struct view_base { };
> @@ -689,11 +689,23 @@ namespace ranges
>  concept common_range
>= range<_Tp> && same_as, sentinel_t<_Tp>>;
>
> +  namespace __detail
> +  {
> +template
> +  inline constexpr bool __is_initializer_list = false;
> +
> +template
> +  inline constexpr bool __is_initializer_list>
> = true;
> +  } // namespace __detail
> +
>/// A range which can be safely converted to a view.
>template
>  concept viewable_range = range<_Tp>
>&& ((view> &&
> constructible_from, _Tp>)
> - || (!view> && borrowed_range<_Tp>));
> + || (!view>
> + && (is_lvalue_reference_v<_Tp>
> + || (movable>
> + &&
> !__detail::__is_initializer_list>;
>
>// [range.iter.ops] range iterator operations
>
> diff --git a/libstdc++-v3/include/std/ranges
> b/libstdc++-v3/include/std/ranges
> index ac85907f129..3e71ecb32b7 100644
> --- a/libstdc++-v3/include/std/ranges
> +++ b/libstdc++-v3/include/std/ranges
> @@ -1144,6 +1144,87 @@ namespace views::__adaptor
>template
>  inline constexpr bool enable_borrowed_range> = true;
>
> +  template
> +requires movable<_Range>
> +  && (!__detail::__is_initializer_list>)
> +class owning_view : public view_interface>
> +{
> +private:
> +  _Range _M_r = _Range();
> +
> +public:
> +  owning_view() requires default_initializable<_Range> = default;
> +
> +  constexpr
> +  owning_view(_Range&& __t)
> +  noexcept(is_nothrow_move_constructible_v<_Range>)
> +   : _M_r(std::move(__t))
> +  { }
> +
> +  owning_view(owning_view&&) = default;
> +  owning_view& operator=(owning_view&&) = default;
> +
> +  constexpr _Range&
> +  base() & noexcept
> +  { return _M_r; }
> +
> +  constexpr const _Range&
> +  base() const& noexcept
> +  { return _M_r; }
> +
> +  constexpr _Range&&
> +  base() && noexcept
> +  { return std::move(_M_r); }
> +
> +  constexpr const _Range&&
> +  base() const&& noexcept
> +  { return std::move(_M_r); }
> +
> +  constexpr iterator_t<_Range>
> +  begin()
> +  { return ranges::begin(_M_r); }
> +
> +  constexpr sentinel_t<_Range>
> +  end()
> +  { return ranges::end(_M_r); }
> +
> +  constexpr auto
> +  begin() const requires range
> +  { return ranges::begin(_M_r); }
> +
> +  constexpr auto
> +  end() const requires range
> +  { return ranges::end(_M_r); }
> +
> +  constexpr bool
> +  empty() requires requires { ranges::empty(_M_r); }
> +  { return ranges::empty(_M_r); }
> +
> +  constexpr bool
> +  empty() const requires requires { ranges::empty(_M_r); }
> +  { return ranges::empty(_M_r); }
> +
> +  constexpr auto
> +  size() requires sized_range<_Range>
> +  { return ranges::size(_M_r); }
> +
> +  constexpr auto
> +  size() const requires sized_range
> +  { return ranges::size(_M_r); }
>

[PATCH 0/3] rs6000: Move g++.dg powerpc tests to g++.target

2022-02-21 Thread Paul A. Clarke via Gcc-patches
Some tests in g++.dg are target-specific for powerpc. Move those to
g++.target/powerpc. Update the DejaGnu directives as needed, since
the target restriction is perhaps no longer needed when residing in the
target-specific powerpc subdirectory.

Tested with Linux on Power9, full "make check".

OK for trunk?

Paul A. Clarke (3):
  rs6000: Move g++.dg/ext powerpc tests to g++.target
  rs6000: Move g++.dg powerpc PR tests to g++.target
  rs6000: Move more g++.dg powerpc tests to g++.target

 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-1.C  | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-10.C | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-11.C | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-12.C | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-13.C | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-14.C | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-15.C | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-16.C | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-17.C | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-18.C | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-2.C  | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-3.C  | 4 ++--
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-4.C  | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-5.C  | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-6.C  | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-7.C  | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-8.C  | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-9.C  | 2 +-
 .../{g++.dg/ext => g++.target/powerpc}/altivec-cell-1.C   | 2 +-
 .../{g++.dg/ext => g++.target/powerpc}/altivec-cell-2.C   | 4 ++--
 .../{g++.dg/ext => g++.target/powerpc}/altivec-cell-3.C   | 4 ++--
 .../{g++.dg/ext => g++.target/powerpc}/altivec-cell-4.C   | 4 ++--
 .../{g++.dg/ext => g++.target/powerpc}/altivec-cell-5.C   | 2 +-
 .../{g++.dg/ext => g++.target/powerpc}/altivec-types-1.C  | 2 +-
 .../{g++.dg/ext => g++.target/powerpc}/altivec-types-2.C  | 2 +-
 .../{g++.dg/ext => g++.target/powerpc}/altivec-types-3.C  | 2 +-
 .../{g++.dg/ext => g++.target/powerpc}/altivec-types-4.C  | 2 +-
 .../{g++.dg/debug/dwarf2 => g++.target/powerpc}/const2.C  | 0
 .../other => g++.target/powerpc}/darwin-minversion-1.C| 0
 .../{g++.dg/eh => g++.target/powerpc}/ppc64-sighandle-cr.C| 0
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr65240-1.C  | 4 ++--
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr65240-2.C  | 4 ++--
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr65240-3.C  | 4 ++--
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr65240-4.C  | 4 ++--
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr65240.h| 0
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr65242.C| 4 ++--
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr67211.C| 4 ++--
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr69667.C| 4 ++--
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr71294.C| 2 +-
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr84264.C| 2 +-
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr84279.C| 4 ++--
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr85657.C| 2 +-
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr93974.C| 0
 gcc/testsuite/{g++.dg/eh => g++.target/powerpc}/simd-4.C  | 2 +-
 gcc/testsuite/{g++.dg/eh => g++.target/powerpc}/simd-5.C  | 0
 .../{g++.dg/other => g++.target/powerpc}/spu2vmx-1.C  | 2 +-
 gcc/testsuite/{g++.dg/eh => g++.target/powerpc}/uncaught3.C   | 2 +-
 .../{g++.dg/ext => g++.target/powerpc}/undef-bool-1.C | 2 +-
 48 files changed, 54 insertions(+), 54 deletions(-)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-1.C (83%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-10.C (92%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-11.C (80%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-12.C (87%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-13.C (97%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-14.C (86%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-15.C (92%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-16.C (88%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-17.C (91%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-18.C (83%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-2.C (92%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-3.C (96%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-4.C (81%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-5.C (83%)
 rename gcc/testsuite/{g++.d

[PATCH 2/3] rs6000: Move g++.dg powerpc PR tests to g++.target

2022-02-21 Thread Paul A. Clarke via Gcc-patches
Also adjust DejaGnu directives, as specifically requiring "powerpc*-*-*" is no
longer required.

2021-02-21  Paul A. Clarke  

gcc/testsuite
* g++.dg/pr65240.h: Move to g++.target/powerpc.
* g++.dg/pr93974.C: Likewise.
* g++.dg/pr65240-1.C: Move to g++.target/powerpc, adjust dg directives.
* g++.dg/pr65240-2.C: Likewise.
* g++.dg/pr65240-3.C: Likewise.
* g++.dg/pr65240-4.C: Likewise.
* g++.dg/pr65242.C: Likewise.
* g++.dg/pr67211.C: Likewise.
* g++.dg/pr69667.C: Likewise.
* g++.dg/pr71294.C: Likewise.
* g++.dg/pr84264.C: Likewise.
* g++.dg/pr84279.C: Likewise.
* g++.dg/pr85657.C: Likewise.
---
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr65240-1.C | 4 ++--
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr65240-2.C | 4 ++--
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr65240-3.C | 4 ++--
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr65240-4.C | 4 ++--
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr65240.h   | 0
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr65242.C   | 4 ++--
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr67211.C   | 4 ++--
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr69667.C   | 4 ++--
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr71294.C   | 2 +-
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr84264.C   | 2 +-
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr84279.C   | 4 ++--
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr85657.C   | 2 +-
 gcc/testsuite/{g++.dg => g++.target/powerpc}/pr93974.C   | 0
 13 files changed, 19 insertions(+), 19 deletions(-)
 rename gcc/testsuite/{g++.dg => g++.target/powerpc}/pr65240-1.C (76%)
 rename gcc/testsuite/{g++.dg => g++.target/powerpc}/pr65240-2.C (76%)
 rename gcc/testsuite/{g++.dg => g++.target/powerpc}/pr65240-3.C (76%)
 rename gcc/testsuite/{g++.dg => g++.target/powerpc}/pr65240-4.C (75%)
 rename gcc/testsuite/{g++.dg => g++.target/powerpc}/pr65240.h (100%)
 rename gcc/testsuite/{g++.dg => g++.target/powerpc}/pr65242.C (94%)
 rename gcc/testsuite/{g++.dg => g++.target/powerpc}/pr67211.C (92%)
 rename gcc/testsuite/{g++.dg => g++.target/powerpc}/pr69667.C (97%)
 rename gcc/testsuite/{g++.dg => g++.target/powerpc}/pr71294.C (96%)
 rename gcc/testsuite/{g++.dg => g++.target/powerpc}/pr84264.C (79%)
 rename gcc/testsuite/{g++.dg => g++.target/powerpc}/pr84279.C (91%)
 rename gcc/testsuite/{g++.dg => g++.target/powerpc}/pr85657.C (90%)
 rename gcc/testsuite/{g++.dg => g++.target/powerpc}/pr93974.C (100%)

diff --git a/gcc/testsuite/g++.dg/pr65240-1.C 
b/gcc/testsuite/g++.target/powerpc/pr65240-1.C
similarity index 76%
rename from gcc/testsuite/g++.dg/pr65240-1.C
rename to gcc/testsuite/g++.target/powerpc/pr65240-1.C
index d2e25b65fcae..d2f4a229773e 100644
--- a/gcc/testsuite/g++.dg/pr65240-1.C
+++ b/gcc/testsuite/g++.target/powerpc/pr65240-1.C
@@ -1,5 +1,5 @@
-/* { dg-do compile { target { powerpc*-*-* && lp64 } } } */
-/* { dg-skip-if "" { powerpc*-*-darwin* } } */
+/* { dg-do compile { target lp64 } } */
+/* { dg-skip-if "" { *-*-darwin* } } */
 /* { dg-require-effective-target powerpc_p8vector_ok } */
 /* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { 
"-mcpu=power8" } } */
 /* { dg-options "-mcpu=power8 -O3 -ffast-math -mcmodel=small -mno-fp-in-toc 
-Wno-return-type" } */
diff --git a/gcc/testsuite/g++.dg/pr65240-2.C 
b/gcc/testsuite/g++.target/powerpc/pr65240-2.C
similarity index 76%
rename from gcc/testsuite/g++.dg/pr65240-2.C
rename to gcc/testsuite/g++.target/powerpc/pr65240-2.C
index 38d5020bd198..12e36994d27b 100644
--- a/gcc/testsuite/g++.dg/pr65240-2.C
+++ b/gcc/testsuite/g++.target/powerpc/pr65240-2.C
@@ -1,5 +1,5 @@
-/* { dg-do compile { target { powerpc*-*-* && lp64 } } } */
-/* { dg-skip-if "" { powerpc*-*-darwin* } } */
+/* { dg-do compile { target lp64 } } */
+/* { dg-skip-if "" { *-*-darwin* } } */
 /* { dg-require-effective-target powerpc_p8vector_ok } */
 /* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { 
"-mcpu=power8" } } */
 /* { dg-options "-mcpu=power8 -O3 -ffast-math -mcmodel=small -mfp-in-toc 
-Wno-return-type" } */
diff --git a/gcc/testsuite/g++.dg/pr65240-3.C 
b/gcc/testsuite/g++.target/powerpc/pr65240-3.C
similarity index 76%
rename from gcc/testsuite/g++.dg/pr65240-3.C
rename to gcc/testsuite/g++.target/powerpc/pr65240-3.C
index e8463c914946..9ded3e3ab1d3 100644
--- a/gcc/testsuite/g++.dg/pr65240-3.C
+++ b/gcc/testsuite/g++.target/powerpc/pr65240-3.C
@@ -1,5 +1,5 @@
-/* { dg-do compile { target { powerpc*-*-* && lp64 } } } */
-/* { dg-skip-if "" { powerpc*-*-darwin* } } */
+/* { dg-do compile { target lp64 } } */
+/* { dg-skip-if "" { *-*-darwin* } } */
 /* { dg-require-effective-target powerpc_p8vector_ok } */
 /* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { 
"-mcpu=power8" } } */
 /* { dg-options "-mcpu=power8 -O3 -ffast-math -mcmodel=medium 
-Wno-return-type" } */
diff --git a/gcc/testsuite/g++.dg/pr65240-4.C 
b/gcc/testsuite/g++.t

[PATCH 1/3] rs6000: Move g++.dg/ext powerpc tests to g++.target

2022-02-21 Thread Paul A. Clarke via Gcc-patches
Also adjust DejaGnu directives, as specifically requiring "powerpc*-*-*" is no
longer required.

2021-02-21  Paul A. Clarke  

gcc/testsuite
* g++.dg/ext/altivec-1.C: Move to g++.target/powerpc, adjust dg
directives.
* g++.dg/ext/altivec-2.C: Likewise.
* g++.dg/ext/altivec-3.C: Likewise.
* g++.dg/ext/altivec-4.C: Likewise.
* g++.dg/ext/altivec-5.C: Likewise.
* g++.dg/ext/altivec-6.C: Likewise.
* g++.dg/ext/altivec-7.C: Likewise.
* g++.dg/ext/altivec-8.C: Likewise.
* g++.dg/ext/altivec-9.C: Likewise.
* g++.dg/ext/altivec-10.C: Likewise.
* g++.dg/ext/altivec-11.C: Likewise.
* g++.dg/ext/altivec-12.C: Likewise.
* g++.dg/ext/altivec-13.C: Likewise.
* g++.dg/ext/altivec-14.C: Likewise.
* g++.dg/ext/altivec-15.C: Likewise.
* g++.dg/ext/altivec-16.C: Likewise.
* g++.dg/ext/altivec-17.C: Likewise.
* g++.dg/ext/altivec-18.C: Likewise.
* g++.dg/ext/altivec-cell-1.C: Likewise.
* g++.dg/ext/altivec-cell-2.C: Likewise.
* g++.dg/ext/altivec-cell-3.C: Likewise.
* g++.dg/ext/altivec-cell-4.C: Likewise.
* g++.dg/ext/altivec-cell-5.C: Likewise.
* g++.dg/ext/altivec-types-1.C: Likewise.
* g++.dg/ext/altivec-types-2.C: Likewise.
* g++.dg/ext/altivec-types-3.C: Likewise.
* g++.dg/ext/altivec-types-4.C: Likewise.
* g++.dg/ext/undef-bool-1.C: Likewise.
---
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-1.C  | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-10.C | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-11.C | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-12.C | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-13.C | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-14.C | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-15.C | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-16.C | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-17.C | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-18.C | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-2.C  | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-3.C  | 4 ++--
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-4.C  | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-5.C  | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-6.C  | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-7.C  | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-8.C  | 2 +-
 gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-9.C  | 2 +-
 .../{g++.dg/ext => g++.target/powerpc}/altivec-cell-1.C   | 2 +-
 .../{g++.dg/ext => g++.target/powerpc}/altivec-cell-2.C   | 4 ++--
 .../{g++.dg/ext => g++.target/powerpc}/altivec-cell-3.C   | 4 ++--
 .../{g++.dg/ext => g++.target/powerpc}/altivec-cell-4.C   | 4 ++--
 .../{g++.dg/ext => g++.target/powerpc}/altivec-cell-5.C   | 2 +-
 .../{g++.dg/ext => g++.target/powerpc}/altivec-types-1.C  | 2 +-
 .../{g++.dg/ext => g++.target/powerpc}/altivec-types-2.C  | 2 +-
 .../{g++.dg/ext => g++.target/powerpc}/altivec-types-3.C  | 2 +-
 .../{g++.dg/ext => g++.target/powerpc}/altivec-types-4.C  | 2 +-
 .../{g++.dg/ext => g++.target/powerpc}/undef-bool-1.C | 2 +-
 28 files changed, 32 insertions(+), 32 deletions(-)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-1.C (83%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-10.C (92%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-11.C (80%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-12.C (87%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-13.C (97%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-14.C (86%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-15.C (92%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-16.C (88%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-17.C (91%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-18.C (83%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-2.C (92%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-3.C (96%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-4.C (81%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-5.C (83%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-6.C (94%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-7.C (96%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-8.C (93%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-9.C (86%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/powerpc}/altivec-cell-1.C (96%)
 rename gcc/testsuite/{g++.dg/ext => g++.target/power

Re: libgo patch committed: Update to Go1.18rc1 release

2022-02-21 Thread Ian Lance Taylor via Gcc-patches
On Sun, Feb 20, 2022 at 2:13 PM Rainer Orth  
wrote:
>
> > This patch updates libgo to the Go1.18rc1 release.  Bootstrapped and
> > ran Go testsuite on x86_64-pc-linux-gnu.  Committed to mainline.
>
> this broke Solaris bootstrap:
>
> ld: fatal: file runtime/internal/.libs/syscall.o: open failed: No such file 
> or directory
> collect2: error: ld returned 1 exit status
>
> Creating a dummy syscall_solaris.go worked around that for now.

Sorry about that.  I committed this patch which should fix the problem.

Ian
e185b127566139818309f8b0c89ee0ce3e42a8d1
diff --git a/gcc/go/gofrontend/MERGE b/gcc/go/gofrontend/MERGE
index 3c0380e8285..7455d01c179 100644
--- a/gcc/go/gofrontend/MERGE
+++ b/gcc/go/gofrontend/MERGE
@@ -1,4 +1,4 @@
-20e74f9ef8206fb02fd28ce3d6e0f01f6fb95dc9
+aee8eddbfc3ef1b03353a060e79e7d668fb229e2
 
 The first line of this file holds the git revision number of the last
 merge done from the gofrontend repository.
diff --git a/libgo/go/runtime/internal/syscall/syscall_other.go 
b/libgo/go/runtime/internal/syscall/syscall_other.go
new file mode 100644
index 000..c0945ceec09
--- /dev/null
+++ b/libgo/go/runtime/internal/syscall/syscall_other.go
@@ -0,0 +1,7 @@
+// Copyright 2022 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+//go:build !linux
+
+package syscall


[PATCH 3/3] rs6000: Move more g++.dg powerpc tests to g++.target

2022-02-21 Thread Paul A. Clarke via Gcc-patches
Also adjust DejaGnu directives, as specifically requiring "powerpc*-*-*" is no
longer required.

2021-02-21  Paul A. Clarke  

gcc/testsuite
* g++.dg/debug/dwarf2/const2.C: Move to g++.target/powerpc.
* g++.dg/other/darwin-minversion-1.C: Likewise.
* g++.dg/eh/ppc64-sighandle-cr.C: Likewise.
* g++.dg/eh/simd-5.C: Likewise.
* g++.dg/eh/simd-4.C: Move to g++.target/powerpc, adjust dg directives.
* g++.dg/eh/uncaught3.C: Likewise.
* g++.dg/other/spu2vmx-1.C: Likewise.
---
 .../{g++.dg/debug/dwarf2 => g++.target/powerpc}/const2.C| 0
 .../{g++.dg/other => g++.target/powerpc}/darwin-minversion-1.C  | 0
 .../{g++.dg/eh => g++.target/powerpc}/ppc64-sighandle-cr.C  | 0
 gcc/testsuite/{g++.dg/eh => g++.target/powerpc}/simd-4.C| 2 +-
 gcc/testsuite/{g++.dg/eh => g++.target/powerpc}/simd-5.C| 0
 gcc/testsuite/{g++.dg/other => g++.target/powerpc}/spu2vmx-1.C  | 2 +-
 gcc/testsuite/{g++.dg/eh => g++.target/powerpc}/uncaught3.C | 2 +-
 7 files changed, 3 insertions(+), 3 deletions(-)
 rename gcc/testsuite/{g++.dg/debug/dwarf2 => g++.target/powerpc}/const2.C 
(100%)
 rename gcc/testsuite/{g++.dg/other => 
g++.target/powerpc}/darwin-minversion-1.C (100%)
 rename gcc/testsuite/{g++.dg/eh => g++.target/powerpc}/ppc64-sighandle-cr.C 
(100%)
 rename gcc/testsuite/{g++.dg/eh => g++.target/powerpc}/simd-4.C (95%)
 rename gcc/testsuite/{g++.dg/eh => g++.target/powerpc}/simd-5.C (100%)
 rename gcc/testsuite/{g++.dg/other => g++.target/powerpc}/spu2vmx-1.C (84%)
 rename gcc/testsuite/{g++.dg/eh => g++.target/powerpc}/uncaught3.C (96%)

diff --git a/gcc/testsuite/g++.dg/debug/dwarf2/const2.C 
b/gcc/testsuite/g++.target/powerpc/const2.C
similarity index 100%
rename from gcc/testsuite/g++.dg/debug/dwarf2/const2.C
rename to gcc/testsuite/g++.target/powerpc/const2.C
diff --git a/gcc/testsuite/g++.dg/other/darwin-minversion-1.C 
b/gcc/testsuite/g++.target/powerpc/darwin-minversion-1.C
similarity index 100%
rename from gcc/testsuite/g++.dg/other/darwin-minversion-1.C
rename to gcc/testsuite/g++.target/powerpc/darwin-minversion-1.C
diff --git a/gcc/testsuite/g++.dg/eh/ppc64-sighandle-cr.C 
b/gcc/testsuite/g++.target/powerpc/ppc64-sighandle-cr.C
similarity index 100%
rename from gcc/testsuite/g++.dg/eh/ppc64-sighandle-cr.C
rename to gcc/testsuite/g++.target/powerpc/ppc64-sighandle-cr.C
diff --git a/gcc/testsuite/g++.dg/eh/simd-4.C 
b/gcc/testsuite/g++.target/powerpc/simd-4.C
similarity index 95%
rename from gcc/testsuite/g++.dg/eh/simd-4.C
rename to gcc/testsuite/g++.target/powerpc/simd-4.C
index 8c9b58bf8684..a01f19c27369 100644
--- a/gcc/testsuite/g++.dg/eh/simd-4.C
+++ b/gcc/testsuite/g++.target/powerpc/simd-4.C
@@ -1,4 +1,4 @@
-/* { dg-do run { target powerpc*-*-darwin* } } */
+/* { dg-do run { target *-*-darwin* } } */
 /* { dg-options "-fexceptions -fnon-call-exceptions -O -maltivec" } */
 
 #include 
diff --git a/gcc/testsuite/g++.dg/eh/simd-5.C 
b/gcc/testsuite/g++.target/powerpc/simd-5.C
similarity index 100%
rename from gcc/testsuite/g++.dg/eh/simd-5.C
rename to gcc/testsuite/g++.target/powerpc/simd-5.C
diff --git a/gcc/testsuite/g++.dg/other/spu2vmx-1.C 
b/gcc/testsuite/g++.target/powerpc/spu2vmx-1.C
similarity index 84%
rename from gcc/testsuite/g++.dg/other/spu2vmx-1.C
rename to gcc/testsuite/g++.target/powerpc/spu2vmx-1.C
index d9c8faf94592..496b46c22c95 100644
--- a/gcc/testsuite/g++.dg/other/spu2vmx-1.C
+++ b/gcc/testsuite/g++.target/powerpc/spu2vmx-1.C
@@ -1,4 +1,4 @@
-/* { dg-do compile { target powerpc*-*-* } } */
+/* { dg-do compile } */
 /* { dg-require-effective-target powerpc_spu } */
 /* { dg-options "-maltivec" } */
 
diff --git a/gcc/testsuite/g++.dg/eh/uncaught3.C 
b/gcc/testsuite/g++.target/powerpc/uncaught3.C
similarity index 96%
rename from gcc/testsuite/g++.dg/eh/uncaught3.C
rename to gcc/testsuite/g++.target/powerpc/uncaught3.C
index 1beaab3f..f891401584ec 100644
--- a/gcc/testsuite/g++.dg/eh/uncaught3.C
+++ b/gcc/testsuite/g++.target/powerpc/uncaught3.C
@@ -1,4 +1,4 @@
-// { dg-do compile { target powerpc*-*-darwin* } }
+// { dg-do compile { target *-*-darwin* } }
 // { dg-final { scan-assembler-not "__cxa_get_exception" } }
 // { dg-options "-mmacosx-version-min=10.4" }
 // { dg-additional-options "-Wno-deprecated" { target c++17 } }
-- 
2.27.0



Re: [PATCH][_Hashtable] Fix insertion of range of type convertible to value_type PR 56112

2022-02-21 Thread François Dumont via Gcc-patches

On 21/02/22 21:54, Jonathan Wakely wrote:



On Mon, 21 Feb 2022 at 18:00, François Dumont via Libstdc++ 
mailto:libstdc%2b...@gcc.gnu.org>> wrote:


Gentle reminder, it is important to have this for gcc 12.


Well, it's been broken since 4.8, so another year wouldn't be the end 
of the world ;-)


Sorry for the pressure, I thought I had broken it with my fix of PR 
96088. Which moreover was in gcc 11.


So indeed not mandatory for gcc 12 but still nice to fix eventually.

Thanks




I did start reviewing it, but I was trying to find a simpler way to 
solve it than adding all those overloads. I'll take another look 
tomorrow and either approve your patch or suggest something else.




On 15/02/22 10:05, François Dumont wrote:
> We have a regression regarding management of types convertible to
> value_type. It is an occurrence of PR 56112 but for the insert
method.
>
>     libstdc++: [_Hashtable] Insert range of types convertible to
> value_type PR 56112
>
>     Fix insertion of range of types convertible to value_type.
>
>     libstdc++-v3/ChangeLog:
>
>     PR libstdc++/56112
>     * include/bits/hashtable.h
>     (_Hashtable<>::_M_insert_unique_aux): New.
>     (_Hashtable<>::_S_to_value): New.
> (_Hashtable<>::_M_insert(_Arg&&, const _NodeGenerator&,
> true_type)): Use latters.
>     *
testsuite/23_containers/unordered_map/cons/56112.cc: Use
> dg-do compile.
>     *
testsuite/23_containers/unordered_set/cons/56112.cc: New
> test.
>
> Tested under Linux x86_64.
>
> Ok to commit ?
>
> François




[PATCH] PR fortran/104619 - [10/11/12 Regression] ICE on list comprehension with default derived type constructor

2022-02-21 Thread Harald Anlauf via Gcc-patches
Dear Fortranners,

a recently introduced shape validation for an array constructor
against the declared shape of a DT component failed to punt if
the shape of the constructor cannot be determined at compile time.

Suggested solution: skip the shape check in those cases.

Regtested on x86_64-pc-linux-gnu.  OK for mainline / affected branches?

Thanks,
Harald

From 356d1ab6ddb58559d3641f169f44fd24f7cb00d1 Mon Sep 17 00:00:00 2001
From: Harald Anlauf 
Date: Mon, 21 Feb 2022 22:49:05 +0100
Subject: [PATCH] Fortran: skip compile-time shape check if constructor shape
 is not known

gcc/fortran/ChangeLog:

	PR fortran/104619
	* resolve.cc (resolve_structure_cons): Skip shape check if shape
	of constructor cannot be determined at compile time.

gcc/testsuite/ChangeLog:

	PR fortran/104619
	* gfortran.dg/derived_constructor_comps_7.f90: New test.
---
 gcc/fortran/resolve.cc|  2 ++
 .../derived_constructor_comps_7.f90   | 28 +++
 2 files changed, 30 insertions(+)
 create mode 100644 gcc/testsuite/gfortran.dg/derived_constructor_comps_7.f90

diff --git a/gcc/fortran/resolve.cc b/gcc/fortran/resolve.cc
index 266e41e25b1..451bc97df43 100644
--- a/gcc/fortran/resolve.cc
+++ b/gcc/fortran/resolve.cc
@@ -1472,6 +1472,8 @@ resolve_structure_cons (gfc_expr *expr, int init)
 		  t = false;
 		  break;
 		};
+	  if (cons->expr->shape == NULL)
+		continue;
 	  mpz_set_ui (len, 1);
 	  mpz_add (len, len, comp->as->upper[n]->value.integer);
 	  mpz_sub (len, len, comp->as->lower[n]->value.integer);
diff --git a/gcc/testsuite/gfortran.dg/derived_constructor_comps_7.f90 b/gcc/testsuite/gfortran.dg/derived_constructor_comps_7.f90
new file mode 100644
index 000..238ac3d9f26
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/derived_constructor_comps_7.f90
@@ -0,0 +1,28 @@
+! { dg-do run }
+! PR fortran/104619
+
+module m
+  implicit none
+  type :: item
+ real :: x
+  end type item
+  type :: container
+ type(item) :: items(3)
+  end type container
+end module
+
+program p
+  use m
+  implicit none
+  type(item), allocatable :: items(:)
+  type(container) :: c
+  integer :: i, n
+  items = [item(3.0), item(4.0), item(5.0)]
+  c = container(items=[(items(i), i = 1, size(items))])
+  if (any (c%items% x /= items% x)) stop 1
+  n = size (items)
+  c = container(items=[(items(i), i = 1, n)])
+  if (any (c%items% x /= items% x)) stop 2
+  c = container(items=[(items(i), i = 1, 3)])
+  if (any (c%items% x /= items% x)) stop 3
+end program
--
2.34.1



[PATCH v2] x86: Fix -fsplit-stack feature detection via TARGET_CAN_SPLIT_STACK

2022-02-21 Thread soeren--- via Gcc-patches
From: Sören Tempel 

Since commit c163647ffbc9a20c8feb6e079dbecccfe016c82e -fsplit-stack
is only supported on glibc targets. However, this original commit
required some fixups. As part of the fixup, the changes to the
gnu-user-common.h and gnu.h were partially reverted in commit
60953a23d57b13a672f751bec0c6eefc059eb1ab thus causing TARGET_CAN_SPLIT_STACK
to be defined for non-glibc targets even though -fsplit-stack is
actually not supported and attempting to use it causes a runtime error.

This causes gcc internal code, such as ./gcc/go/gospec.c to not
correctly detect that -fsplit-stack is not supported and thus causes
gccgo to fail compilation on non-glibc targets.

This commit ensures that TARGET_CAN_SPLIT_STACK is only set if the
default libc is glibc. It is presently unclear to me if there is a
better way to detect glibc at pre-processor time.

The proposed changes have been tested on x86 and x86_64 Alpine Linux
(which uses musl libc) and fix compilation of gccgo for this target.

Signed-off-by: Sören Tempel 

gcc/ChangeLog:

* config/i386/gnu-user-common.h (defined): Only define
TARGET_CAN_SPLIT_STACK for glibc targets.
* config/i386/gnu.h (defined): Ditto.
---
Changes since v1: Use (DEFAULT_LIBC == LIBC_GLIBC) instead of
OPTION_GLIBC_P to detect use of glibc in a pre-processor context.

Is there a better way to detect use of glibc in the config header?

 gcc/config/i386/gnu-user-common.h | 5 +++--
 gcc/config/i386/gnu.h | 5 +++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/gcc/config/i386/gnu-user-common.h 
b/gcc/config/i386/gnu-user-common.h
index 00226f5a455..e126c3fa9fa 100644
--- a/gcc/config/i386/gnu-user-common.h
+++ b/gcc/config/i386/gnu-user-common.h
@@ -66,7 +66,8 @@ along with GCC; see the file COPYING3.  If not see
 #define STACK_CHECK_STATIC_BUILTIN 1
 
 /* We only build the -fsplit-stack support in libgcc if the
-   assembler has full support for the CFI directives.  */
-#if HAVE_GAS_CFI_PERSONALITY_DIRECTIVE
+   assembler has full support for the CFI directives.  Also
+   we only support -fsplit-stack on glibc targets.  */
+#if HAVE_GAS_CFI_PERSONALITY_DIRECTIVE && (DEFAULT_LIBC == LIBC_GLIBC)
 #define TARGET_CAN_SPLIT_STACK
 #endif
diff --git a/gcc/config/i386/gnu.h b/gcc/config/i386/gnu.h
index 25fbc07f58c..17494333bb9 100644
--- a/gcc/config/i386/gnu.h
+++ b/gcc/config/i386/gnu.h
@@ -41,8 +41,9 @@ along with GCC.  If not, see .
 #define TARGET_THREAD_SSP_OFFSET0x14
 
 /* We only build the -fsplit-stack support in libgcc if the
-   assembler has full support for the CFI directives.  */
-#if HAVE_GAS_CFI_PERSONALITY_DIRECTIVE
+   assembler has full support for the CFI directives.  Also
+   we only support -fsplit-stack on glibc targets.  */
+#if HAVE_GAS_CFI_PERSONALITY_DIRECTIVE && (DEFAULT_LIBC == LIBC_GLIBC)
 #define TARGET_CAN_SPLIT_STACK
 #endif
 /* We steal the last transactional memory word.  */


Re: [PATCH v2, rs6000] Enable absolute jump table for PPC AIX and Linux

2022-02-21 Thread HAO CHEN GUI via Gcc-patches


Kewen,
  Thanks so much for your advice.

On 21/2/2022 下午 5:42, Kewen.Lin wrote:
> Hi Haochen,
> 
> Some minor comments are inlined.
> 
> on 2022/2/16 下午4:42, HAO CHEN GUI via Gcc-patches wrote:
>> Hi,
>>This patch enables absolute jump tables on PPC AIX and Linux. For AIX, 
>> the jump
>> table is placed in data section. For Linux, it is placed in RELRO section 
>> when
>> relocation is needed.
>>
>>Bootstrapped and tested on AIX,Linux BE and LE with no regressions. Is 
>> this okay for trunk?
>> Any recommendations? Thanks a lot.
>>
> 
> I may miss something, but there seem no test cases in power target testsuite 
> to check expected
> assembly for absolute and relative jump table.  If so, maybe it's better to 
> add one or two?
I will add some testcases.

> 
>> ChangeLog
>> 2022-02-16 Haochen Gui 
>>
>> gcc/
>>  * config/rs6000/aix.h (JUMP_TABLES_IN_TEXT_SECTION): Define.
>>  * config/rs6000/linux64.h (JUMP_TABLES_IN_TEXT_SECTION): Likewise.
>>  * config/rs6000/rs6000.cc (rs6000_option_override_internal): Enable
>>  absolute jump tables for AIX and Linux.
>>  (rs6000_xcoff_function_rodata_section): Implement.
>>  * config/rs6000/xcoff.h (TARGET_ASM_FUNCTION_RODATA_SECTION): Define.
>>
>> patch.diff
>> diff --git a/gcc/config/rs6000/aix.h b/gcc/config/rs6000/aix.h
>> index ad3238bf09a..b52208c2ee7 100644
>> --- a/gcc/config/rs6000/aix.h
>> +++ b/gcc/config/rs6000/aix.h
>> @@ -253,7 +253,7 @@
>>
>>  /* Indicate that jump tables go in the text section.  */
>>
>> -#define JUMP_TABLES_IN_TEXT_SECTION 1
>> +#define JUMP_TABLES_IN_TEXT_SECTION 0
>>
>>  /* Define any extra SPECS that the compiler needs to generate.  */
>>  #undef  SUBTARGET_EXTRA_SPECS
>> diff --git a/gcc/config/rs6000/linux64.h b/gcc/config/rs6000/linux64.h
>> index b2a7afabc73..16df9ef167f 100644
>> --- a/gcc/config/rs6000/linux64.h
>> +++ b/gcc/config/rs6000/linux64.h
>> @@ -239,7 +239,7 @@ extern int dot_symbols;
>>
>>  /* Indicate that jump tables go in the text section.  */
>>  #undef  JUMP_TABLES_IN_TEXT_SECTION
>> -#define JUMP_TABLES_IN_TEXT_SECTION TARGET_64BIT
>> +#define JUMP_TABLES_IN_TEXT_SECTION 0
>>
>>  /* The linux ppc64 ABI isn't explicit on whether aggregates smaller
>> than a doubleword should be padded upward or downward.  You could
>> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
>> index bc3ef0721a4..e9c5552c082 100644
>> --- a/gcc/config/rs6000/rs6000.cc
>> +++ b/gcc/config/rs6000/rs6000.cc
>> @@ -4954,6 +4954,10 @@ rs6000_option_override_internal (bool global_init_p)
>>  warning (0, "%qs is deprecated and not recommended in any 
>> circumstances",
>>   "-mno-speculate-indirect-jumps");
>>
>> +  /* Enable absolute jump tables for AIX and Linux.  */
>> +  if (DEFAULT_ABI == ABI_AIX || DEFAULT_ABI == ABI_ELFv2)
>> +rs6000_relative_jumptables = 0;
>> +
>>return ret;
>>  }
>>
>> @@ -28751,6 +28755,15 @@ constant_generates_xxspltidp (vec_const_128bit_type 
>> *vsx_const)
>>return sf_value;
>>  }
>>
>> +section * rs6000_xcoff_function_rodata_section (tree decl ATTRIBUTE_UNUSED,
>> +bool relocatable)
>> +{
>> +  if (relocatable)
>> +return data_section;
>> +  else
>> +return readonly_data_section;
>> +}
>> +
>>  
> 
> There is one area to put xcoff related functions and guarded with "#if 
> TARGET_XCOFF",
> it looks good to put this into? and could we make this function static?
>From my understanding, the function should be only called by xcoff target. Of 
>course,
it's safe with the guard. But we couldn't make the function static as it will 
be called
in other files.

> 
> Besides, it seems good to put some comments for this function to describe why 
> we
> choose to use the data_section.
> 
>>  struct gcc_target targetm = TARGET_INITIALIZER;
>>
>> diff --git a/gcc/config/rs6000/xcoff.h b/gcc/config/rs6000/xcoff.h
>> index cd0f99cb9c6..0dacd86eed9 100644
>> --- a/gcc/config/rs6000/xcoff.h
>> +++ b/gcc/config/rs6000/xcoff.h
>> @@ -98,7 +98,7 @@
>>  #define TARGET_ASM_SELECT_SECTION  rs6000_xcoff_select_section
>>  #define TARGET_ASM_SELECT_RTX_SECTION  rs6000_xcoff_select_rtx_section
>>  #define TARGET_ASM_UNIQUE_SECTION  rs6000_xcoff_unique_section
>> -#define TARGET_ASM_FUNCTION_RODATA_SECTION 
>> default_no_function_rodata_section
>> +#define TARGET_ASM_FUNCTION_RODATA_SECTION 
>> rs6000_xcoff_function_rodata_section
> 
> 
> IIUC, the behavior of rs6000_xcoff_function_rodata_section isn't consistent 
> with what we have
> in the hook description.  If so, we need to update the hook description to 
> align with this change.
> Otherwise, some future codes might expect this hook only return readonly 
> section (not possible
> like data section) and get unexpected results.
For the AIX/xcoff, the relocatable data section has to be the data section.
Though it's not read-only. I think it's consistent with the definition?

DEFHOOK
(function_rodata_section,
 "Return the readonly data or reloc read

Re: [PATCH v2, rs6000] Enable absolute jump table for PPC AIX and Linux

2022-02-21 Thread Kewen.Lin via Gcc-patches
on 2022/2/22 上午9:11, HAO CHEN GUI wrote:
> 
> Kewen,
>   Thanks so much for your advice.
> 
> On 21/2/2022 下午 5:42, Kewen.Lin wrote:
>> Hi Haochen,
>>
>> Some minor comments are inlined.
>>
>> on 2022/2/16 下午4:42, HAO CHEN GUI via Gcc-patches wrote:
>>> Hi,
>>>This patch enables absolute jump tables on PPC AIX and Linux. For AIX, 
>>> the jump
>>> table is placed in data section. For Linux, it is placed in RELRO section 
>>> when
>>> relocation is needed.
>>>
>>>Bootstrapped and tested on AIX,Linux BE and LE with no regressions. Is 
>>> this okay for trunk?
>>> Any recommendations? Thanks a lot.
>>>
>>
>> I may miss something, but there seem no test cases in power target testsuite 
>> to check expected
>> assembly for absolute and relative jump table.  If so, maybe it's better to 
>> add one or two?
> I will add some testcases.
> 

Thanks!

>>
>>> ChangeLog
>>> 2022-02-16 Haochen Gui 
>>>
>>> gcc/
>>> * config/rs6000/aix.h (JUMP_TABLES_IN_TEXT_SECTION): Define.
>>> * config/rs6000/linux64.h (JUMP_TABLES_IN_TEXT_SECTION): Likewise.
>>> * config/rs6000/rs6000.cc (rs6000_option_override_internal): Enable
>>> absolute jump tables for AIX and Linux.
>>> (rs6000_xcoff_function_rodata_section): Implement.
>>> * config/rs6000/xcoff.h (TARGET_ASM_FUNCTION_RODATA_SECTION): Define.
>>>
>>> patch.diff
>>> diff --git a/gcc/config/rs6000/aix.h b/gcc/config/rs6000/aix.h
>>> index ad3238bf09a..b52208c2ee7 100644
>>> --- a/gcc/config/rs6000/aix.h
>>> +++ b/gcc/config/rs6000/aix.h
>>> @@ -253,7 +253,7 @@
>>>
>>>  /* Indicate that jump tables go in the text section.  */
>>>
>>> -#define JUMP_TABLES_IN_TEXT_SECTION 1
>>> +#define JUMP_TABLES_IN_TEXT_SECTION 0
>>>
>>>  /* Define any extra SPECS that the compiler needs to generate.  */
>>>  #undef  SUBTARGET_EXTRA_SPECS
>>> diff --git a/gcc/config/rs6000/linux64.h b/gcc/config/rs6000/linux64.h
>>> index b2a7afabc73..16df9ef167f 100644
>>> --- a/gcc/config/rs6000/linux64.h
>>> +++ b/gcc/config/rs6000/linux64.h
>>> @@ -239,7 +239,7 @@ extern int dot_symbols;
>>>
>>>  /* Indicate that jump tables go in the text section.  */
>>>  #undef  JUMP_TABLES_IN_TEXT_SECTION
>>> -#define JUMP_TABLES_IN_TEXT_SECTION TARGET_64BIT
>>> +#define JUMP_TABLES_IN_TEXT_SECTION 0
>>>
>>>  /* The linux ppc64 ABI isn't explicit on whether aggregates smaller
>>> than a doubleword should be padded upward or downward.  You could
>>> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
>>> index bc3ef0721a4..e9c5552c082 100644
>>> --- a/gcc/config/rs6000/rs6000.cc
>>> +++ b/gcc/config/rs6000/rs6000.cc
>>> @@ -4954,6 +4954,10 @@ rs6000_option_override_internal (bool global_init_p)
>>>  warning (0, "%qs is deprecated and not recommended in any 
>>> circumstances",
>>>  "-mno-speculate-indirect-jumps");
>>>
>>> +  /* Enable absolute jump tables for AIX and Linux.  */
>>> +  if (DEFAULT_ABI == ABI_AIX || DEFAULT_ABI == ABI_ELFv2)
>>> +rs6000_relative_jumptables = 0;
>>> +
>>>return ret;
>>>  }
>>>
>>> @@ -28751,6 +28755,15 @@ constant_generates_xxspltidp 
>>> (vec_const_128bit_type *vsx_const)
>>>return sf_value;
>>>  }
>>>
>>> +section * rs6000_xcoff_function_rodata_section (tree decl ATTRIBUTE_UNUSED,
>>> +   bool relocatable)
>>> +{
>>> +  if (relocatable)
>>> +return data_section;
>>> +  else
>>> +return readonly_data_section;
>>> +}
>>> +
>>>  
>>
>> There is one area to put xcoff related functions and guarded with "#if 
>> TARGET_XCOFF",
>> it looks good to put this into? and could we make this function static?
> From my understanding, the function should be only called by xcoff target. Of 
> course,
> it's safe with the guard. But we couldn't make the function static as it will 
> be called
> in other files.
> 

Yeah, that's why I suggested moving it.  Here the function is used for hook 
implementation,
just like some other existing hook functions, it would be used for function 
pointer
initialization (address taken), I think it's fine to make it static.

You can refer to some examples like: rs6000_xcoff_debug_unwind_info, etc...

>>
>> Besides, it seems good to put some comments for this function to describe 
>> why we
>> choose to use the data_section.
>>
>>>  struct gcc_target targetm = TARGET_INITIALIZER;
>>>
>>> diff --git a/gcc/config/rs6000/xcoff.h b/gcc/config/rs6000/xcoff.h
>>> index cd0f99cb9c6..0dacd86eed9 100644
>>> --- a/gcc/config/rs6000/xcoff.h
>>> +++ b/gcc/config/rs6000/xcoff.h
>>> @@ -98,7 +98,7 @@
>>>  #define TARGET_ASM_SELECT_SECTION  rs6000_xcoff_select_section
>>>  #define TARGET_ASM_SELECT_RTX_SECTION  rs6000_xcoff_select_rtx_section
>>>  #define TARGET_ASM_UNIQUE_SECTION  rs6000_xcoff_unique_section
>>> -#define TARGET_ASM_FUNCTION_RODATA_SECTION 
>>> default_no_function_rodata_section
>>> +#define TARGET_ASM_FUNCTION_RODATA_SECTION 
>>> rs6000_xcoff_function_rodata_section
>>
>>
>> IIUC, the behavior of rs6000_xcoff_function_rodata_section isn't consiste

Re: [PATCH v2] x86: Add TARGET_OMIT_VZEROUPPER_AFTER_AVX_READ_ZERO

2022-02-21 Thread Hongtao Liu via Gcc-patches
On Tue, Feb 22, 2022 at 2:35 AM H.J. Lu  wrote:
>
> On Sun, Feb 20, 2022 at 6:01 PM Hongtao Liu  wrote:
> >
> > On Thu, Feb 17, 2022 at 9:56 PM H.J. Lu  wrote:
> > >
> > > On Thu, Feb 17, 2022 at 08:51:31AM +0100, Uros Bizjak wrote:
> > > > On Thu, Feb 17, 2022 at 6:25 AM Hongtao Liu via Gcc-patches
> > > >  wrote:
> > > > >
> > > > > On Thu, Feb 17, 2022 at 12:26 PM H.J. Lu via Gcc-patches
> > > > >  wrote:
> > > > > >
> > > > > > Reading YMM registers with all zero bits needs VZEROUPPER on Sandy 
> > > > > > Bride,
> > > > > > Ivy Bridge, Haswell, Broadwell and Alder Lake to avoid SSE <-> AVX
> > > > > > transition penalty.  Add TARGET_READ_ZERO_YMM_ZMM_NEED_VZEROUPPER to
> > > > > > generate vzeroupper instruction after loading all-zero YMM/YMM 
> > > > > > registers
> > > > > > and enable it by default.
> > > > > Shouldn't TARGET_READ_ZERO_YMM_ZMM_NONEED_VZEROUPPER sounds a bit 
> > > > > smoother?
> > > > > Because originally we needed to add vzeroupper to all avx<->sse cases,
> > > > > now it's a tune to indicate that we don't need to add it in some
> > > >
> > > > Perhaps we should go from the other side and use
> > > > X86_TUNE_OPTIMIZE_AVX_READ for new processors?
> > > >
> > >
> > > Here is the v2 patch to add TARGET_OMIT_VZEROUPPER_AFTER_AVX_READ_ZERO.
> > >
> > The patch LGTM in general, but please rebase against
> > https://gcc.gnu.org/pipermail/gcc-patches/2022-February/590541.html
> > and resend the patch, also wait a couple days in case Uros(and others)
> > have any comments.
>
> I am dropping my patch since it causes the compile-time regression.
I think only vextractif128 part is reverted, but we still have
vmovdqu(below) which should also cause penalty?
> > > + if (ix86_check_avx_upper_register (dest))
> > > +   {
> > > + /* This is an YMM/ZMM load.  Return AVX_U128_DIRTY if the
> > > +source isn't zero.  */
> > > + if (standard_sse_constant_p (src, GET_MODE (dest)) != 1)
> > > +   return AVX_U128_DIRTY;
> > > + else
> > > +   return AVX_U128_ANY;
> > > +   }
>
> > >
> > > H.J.
> > > ---
> > > Reading YMM registers with all zero bits needs VZEROUPPER on Sandy Bride,
> > > Ivy Bridge, Haswell, Broadwell and Alder Lake to avoid SSE <-> AVX
> > > transition penalty.  Add TARGET_OMIT_VZEROUPPER_AFTER_AVX_READ_ZERO to
> > > omit vzeroupper instruction after loading all-zero YMM/ZMM registers.
> > >
> > > gcc/
> > >
> > > PR target/101456
> > > * config/i386/i386.cc (ix86_avx_u128_mode_needed): Omit
> > > vzeroupper after reading all-zero YMM/ZMM registers for
> > > TARGET_OMIT_VZEROUPPER_AFTER_AVX_READ_ZERO.
> > > * config/i386/i386.h (TARGET_OMIT_VZEROUPPER_AFTER_AVX_READ_ZERO):
> > > New.
> > > * config/i386/x86-tune.def
> > > (X86_TUNE_OMIT_VZEROUPPER_AFTER_AVX_READ_ZERO): New.
> > >
> > > gcc/testsuite/
> > >
> > > PR target/101456
> > > * gcc.target/i386/pr101456-1.c (dg-options): Add
> > > -mtune-ctrl=-mtune-ctrl=omit_vzeroupper_after_avx_read_zero.
> > > * gcc.target/i386/pr101456-2.c: Likewise.
> > > * gcc.target/i386/pr101456-3.c: New test.
> > > * gcc.target/i386/pr101456-4.c: Likewise.
> > > ---
> > >  gcc/config/i386/i386.cc| 51 --
> > >  gcc/config/i386/i386.h |  2 +
> > >  gcc/config/i386/x86-tune.def   |  5 +++
> > >  gcc/testsuite/gcc.target/i386/pr101456-1.c |  2 +-
> > >  gcc/testsuite/gcc.target/i386/pr101456-2.c |  2 +-
> > >  gcc/testsuite/gcc.target/i386/pr101456-3.c | 33 ++
> > >  gcc/testsuite/gcc.target/i386/pr101456-4.c | 33 ++
> > >  7 files changed, 103 insertions(+), 25 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101456-3.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr101456-4.c
> > >
> > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > index cf246e74e57..60c72ceb72d 100644
> > > --- a/gcc/config/i386/i386.cc
> > > +++ b/gcc/config/i386/i386.cc
> > > @@ -14502,33 +14502,38 @@ ix86_avx_u128_mode_needed (rtx_insn *insn)
> > >
> > >subrtx_iterator::array_type array;
> > >
> > > -  rtx set = single_set (insn);
> > > -  if (set)
> > > +  if (TARGET_OMIT_VZEROUPPER_AFTER_AVX_READ_ZERO)
> > >  {
> > > -  rtx dest = SET_DEST (set);
> > > -  rtx src = SET_SRC (set);
> > > -  if (ix86_check_avx_upper_register (dest))
> > > +  /* Perform this vzeroupper optimization if target doesn't need
> > > +vzeroupper after reading all-zero YMM/YMM registers.  */
> > > +  rtx set = single_set (insn);
> > > +  if (set)
> > > {
> > > - /* This is an YMM/ZMM load.  Return AVX_U128_DIRTY if the
> > > -source isn't zero.  */
> > > - if (standard_sse_constant_p (src, GET_MODE (dest)) != 1)
> > > -   return AVX_U128_DIRTY;
> > > + rtx dest =

[PATCH] rs6000: Fix some issues related to Power10 fusion [PR104024]

2022-02-21 Thread Kewen.Lin via Gcc-patches
Hi,

As PR104024 shows, currently the option -mpower10-fusion isn't guarded
under -mcpu=power10, so compiler can optimize some patterns unexpectedly.
As the option is undocumented, this patch just simply unmasks it.
For some define_insns in fusion.md which have constraint v, they don't
have the correct conditions there, it can cause ICEs if the modes are
not supported there.  Besides, it seems better to use BOOL_128 instead
of VM since the patterns are vector logical operations.

Bootstrapped and regtested on powerpc64-linux-gnu P8 and
powerpc64le-linux-gnu P9 and P10.

Is it ok for trunk?

BR,
Kewen
-
PR target/104024

gcc/ChangeLog:

* config/rs6000/fusion.md: Regenerate.
* config/rs6000/genfusion.pl: Add the check for define_insns
with constraint v, use BOOL_128 instead of VM.

gcc/testsuite/ChangeLog:

* gcc.target/powerpc/pr104024-1.c: New test.
* gcc.target/powerpc/pr104024-2.c: New test.
---
 gcc/config/rs6000/fusion.md   | 770 +-
 gcc/config/rs6000/genfusion.pl|  14 +-
 gcc/config/rs6000/rs6000.cc   |  11 +-
 gcc/testsuite/gcc.target/powerpc/pr104024-1.c |  16 +
 gcc/testsuite/gcc.target/powerpc/pr104024-2.c |  18 +
 5 files changed, 434 insertions(+), 395 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr104024-1.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr104024-2.c

diff --git a/gcc/config/rs6000/fusion.md b/gcc/config/rs6000/fusion.md
index 15f0c16f705..11cc3cb0e8e 100644
--- a/gcc/config/rs6000/fusion.md
+++ b/gcc/config/rs6000/fusion.md
@@ -1870,12 +1870,12 @@ (define_insn "*fuse_or_rsubf"
 ;; logical-logical fusion pattern generated by gen_logical_addsubf
 ;; vector vand -> vand
 (define_insn "*fuse_vand_vand"
-  [(set (match_operand:VM 3 "altivec_register_operand" "=&0,&1,&v,v")
-(and:VM (and:VM (match_operand:VM 0 "altivec_register_operand" 
"v,v,v,v")
-  (match_operand:VM 1 "altivec_register_operand" 
"%v,v,v,v"))
- (match_operand:VM 2 "altivec_register_operand" "v,v,v,v")))
-   (clobber (match_scratch:VM 4 "=X,X,X,&v"))]
-  "(TARGET_P10_FUSION)"
+  [(set (match_operand:BOOL_128 3 "altivec_register_operand" "=&0,&1,&v,v")
+(and:BOOL_128 (and:BOOL_128 (match_operand:BOOL_128 0 
"altivec_register_operand" "v,v,v,v")
+  (match_operand:BOOL_128 1 "altivec_register_operand" 
"%v,v,v,v"))
+ (match_operand:BOOL_128 2 "altivec_register_operand" 
"v,v,v,v")))
+   (clobber (match_scratch:BOOL_128 4 "=X,X,X,&v"))]
+  "(VECTOR_UNIT_ALTIVEC_OR_VSX_P (mode) && TARGET_P10_FUSION)"
   "@
vand %3,%1,%0\;vand %3,%3,%2
vand %3,%1,%0\;vand %3,%3,%2
@@ -1888,12 +1888,12 @@ (define_insn "*fuse_vand_vand"
 ;; logical-logical fusion pattern generated by gen_logical_addsubf
 ;; vector vandc -> vand
 (define_insn "*fuse_vandc_vand"
-  [(set (match_operand:VM 3 "altivec_register_operand" "=&0,&1,&v,v")
-(and:VM (and:VM (not:VM (match_operand:VM 0 "altivec_register_operand" 
"v,v,v,v"))
-  (match_operand:VM 1 "altivec_register_operand" 
"v,v,v,v"))
- (match_operand:VM 2 "altivec_register_operand" "v,v,v,v")))
-   (clobber (match_scratch:VM 4 "=X,X,X,&v"))]
-  "(TARGET_P10_FUSION)"
+  [(set (match_operand:BOOL_128 3 "altivec_register_operand" "=&0,&1,&v,v")
+(and:BOOL_128 (and:BOOL_128 (not:BOOL_128 (match_operand:BOOL_128 0 
"altivec_register_operand" "v,v,v,v"))
+  (match_operand:BOOL_128 1 "altivec_register_operand" 
"v,v,v,v"))
+ (match_operand:BOOL_128 2 "altivec_register_operand" 
"v,v,v,v")))
+   (clobber (match_scratch:BOOL_128 4 "=X,X,X,&v"))]
+  "(VECTOR_UNIT_ALTIVEC_OR_VSX_P (mode) && TARGET_P10_FUSION)"
   "@
vandc %3,%1,%0\;vand %3,%3,%2
vandc %3,%1,%0\;vand %3,%3,%2
@@ -1906,12 +1906,12 @@ (define_insn "*fuse_vandc_vand"
 ;; logical-logical fusion pattern generated by gen_logical_addsubf
 ;; vector veqv -> vand
 (define_insn "*fuse_veqv_vand"
-  [(set (match_operand:VM 3 "altivec_register_operand" "=&0,&1,&v,v")
-(and:VM (not:VM (xor:VM (match_operand:VM 0 "altivec_register_operand" 
"v,v,v,v")
-  (match_operand:VM 1 "altivec_register_operand" 
"v,v,v,v")))
- (match_operand:VM 2 "altivec_register_operand" "v,v,v,v")))
-   (clobber (match_scratch:VM 4 "=X,X,X,&v"))]
-  "(TARGET_P10_FUSION)"
+  [(set (match_operand:BOOL_128 3 "altivec_register_operand" "=&0,&1,&v,v")
+(and:BOOL_128 (not:BOOL_128 (xor:BOOL_128 (match_operand:BOOL_128 0 
"altivec_register_operand" "v,v,v,v")
+  (match_operand:BOOL_128 1 "altivec_register_operand" 
"v,v,v,v")))
+ (match_operand:BOOL_128 2 "altivec_register_operand" 
"v,v,v,v")))
+   (clobber (match_scratch:BOOL_128 4 "=X,X,X,&v"))]
+  "(VECTOR_UNIT_ALTIVEC_OR_VSX_P (mode) && TARGET_P10_FUSION)"
   "@
veqv %3,%1,%0\;vand %3,%3,%2

Re: [PATCH] i386: Fix up copysign/xorsign expansion [PR104612]

2022-02-21 Thread Hongtao Liu via Gcc-patches
On Tue, Feb 22, 2022 at 12:46 AM Jakub Jelinek  wrote:
>
> Hi!
>
> We ICE on the following testcase for -m32 since r12-3435. because
> operands[2] is (subreg:SF (reg:DI ...) 0) and
According to validate_subreg, (subreg:V4SF (reg:DI ...) 0) should be
valid(but not sure if it really works )

For -m64 part:
I saw operands[2] is (subreg:SF (reg:SI ..) 0).
It's a bit weird that (subreg:V4SF (reg:SF ..) 0) and (subreg:SF
(reg:SI 0)) are legal but (subreg:V4SF (reg:SI ..) not, and last time
I tried to remove those weird codes in validate_subreg but caused a
lot of regression[1].
And it seems we need to be aware of all pre_reload paradoxical subregs
of V*{SF,HF}mode, op, {SF,HF}mode).

 940  /* ??? x86 sse code makes heavy use of *paradoxical* vector subregs,
 941 i.e. (subreg:V4SF (reg:SF) 0) or (subreg:V4SF (reg:V2SF) 0).  This
 942 surely isn't the cleanest way to represent this.  It's questionable
 943 if this ought to be represented at all -- why can't this all be hidden
 944 in post-reload splitters that make arbitrarily mode changes to the
 945 registers themselves.  */
 946  else if (VECTOR_MODE_P (omode)
 947   && GET_MODE_INNER (omode) == GET_MODE_INNER (imode))

Also i;m thinking about(too risky at stage4, leave it to GCC13 stage
1) adjusting GET_MODE_INNER (omode) == GET_MODE_INNER (imode) from the
same mode to the same size.
The same size looks a little more logically coherent.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-August/578186.html

> lowpart_subreg (V4SFmode, operands[2], SFmode)
> returns NULL, and that is what we use in AND etc. insns we emit.
>
> The following patch (non-attached) fixes that by calling force_reg for the
> input operands, to make sure they are really REGs and so lowpart_subreg
> will succeed on them - even for theoretical MEMs using REGs there seems
> desirable, we don't want to read following memory slots for the paradoxical
> subreg.  For the outputs, I thought we'd get better code by always computing
> result into a new pseudo and them move lowpart of that pseudo into dest.
>
> I've bootstrapped/regtested this version on x86_64-linux and i686-linux,
> unfortunately it regressed
> FAIL: gcc.target/i386/pr89984-2.c scan-assembler-not vmovaps
> on which the patch changes:
> vandps  .LC0(%rip), %xmm1, %xmm1
> -   vxorps  %xmm0, %xmm1, %xmm0
> +   vxorps  %xmm0, %xmm1, %xmm1
> +   vmovaps %xmm1, %xmm0
> ret
> The RA sees:
> (insn 8 4 9 2 (set (reg:V4SF 85)
> (and:V4SF (subreg:V4SF (reg:SF 90) 0)
> (mem/u/c:V4SF (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0  S16 
> A128]))) "pr89984-2.c":7:12 2838 {*andv4sf3}
>  (expr_list:REG_DEAD (reg:SF 90)
> (nil)))
> (insn 9 8 10 2 (set (reg:V4SF 87)
> (xor:V4SF (reg:V4SF 85)
> (subreg:V4SF (reg:SF 89) 0))) "pr89984-2.c":7:12 2842 {*xorv4sf3}
>  (expr_list:REG_DEAD (reg:SF 89)
> (expr_list:REG_DEAD (reg:V4SF 85)
> (nil
> (insn 10 9 14 2 (set (reg:SF 82 [  ])
> (subreg:SF (reg:V4SF 87) 0)) "pr89984-2.c":7:12 142 {*movsf_internal}
>  (expr_list:REG_DEAD (reg:V4SF 87)
> (nil)))
> (insn 14 10 15 2 (set (reg/i:SF 20 xmm0)
> (reg:SF 82 [  ])) "pr89984-2.c":8:1 142 {*movsf_internal}
>  (expr_list:REG_DEAD (reg:SF 82 [  ])
> (nil)))
> (insn 15 14 0 2 (use (reg/i:SF 20 xmm0)) "pr89984-2.c":8:1 -1
>  (nil))
> and doesn't know that if it would use xmm0 not just for pseudo 82
> but also for pseudo 87, it could create a noop move in insn 10 and
> so could avoid an extra register copy and nothing later on is able
> to figure that out either.  I don't know how the RA should know
> that though.
>
> Anyway, so that we don't regress, I have an alternative patch in attachment,
> which will do this stuff (i.e. use fresh vector pseudo as destination and
> then move lowpart of that to dest) over what it used before (i.e.
> use paradoxical subreg of the dest) only if lowpart_subreg returns NULL.
>
> Ok for trunk if the attached version passes bootstrap/regtest?
>
> 2022-02-21  Jakub Jelinek  
>
> PR target/104612
> * config/i386/i386-expand.cc (ix86_expand_copysign): Call force_reg
> on input operands before calling lowpart_subreg on it.  For output
> operand, use a vmode pseudo as destination and then move its lowpart
> subreg into operands[0].
> (ix86_expand_xorsign): Likewise.
>
> * gcc.dg/pr104612.c: New test.
>
> --- gcc/config/i386/i386-expand.cc.jj   2022-02-09 20:45:03.463499205 +0100
> +++ gcc/config/i386/i386-expand.cc  2022-02-21 13:14:31.756657743 +0100
> @@ -2153,7 +2153,7 @@ void
>  ix86_expand_copysign (rtx operands[])
>  {
>machine_mode mode, vmode;
> -  rtx dest, op0, op1, mask, op2, op3;
> +  rtx dest, vdest, op0, op1, mask, op2, op3;
>
>mode = GET_MODE (operands[0]);
>
> @@ -2174,8 +2174,9 @@ ix86_expand_copysign (rtx operands[])
>return;
>  }
>
> -  dest = lowpart_subreg (vmode, op

[PATCH] tree-optimization: [PR103855] Fold (type)X / (type)Y

2022-02-21 Thread Zhao Wei Liew via Gcc-patches
Hi,

This is a partial optimization for PR103855.

Initially, I looked into optimizing RTL generation or a more complex
GIMPLE transformation so that we could optimize other cases as well,
such as ((unsigned long long) short / int).

However, that is a bit too complex for now. While I continue to look
into that change, I've decided to implement this simpler match.pd
transformation.

Greatly appreciate any feedback on this patch or guidance for
implementing the more advanced optimizations!

Thanks,
Zhao Wei


0001-tree-optimization-PR103855-Fold-type-X-type-Y.patch
Description: Binary data


Re: [PATCH] tree-optimization: [PR103855] Fold (type)X / (type)Y

2022-02-21 Thread Zhao Wei Liew via Gcc-patches
On Tue, 22 Feb 2022 at 11:57, Zhao Wei Liew  wrote:
>
> Hi,
>
> This is a partial optimization for PR103855.
>
> Initially, I looked into optimizing RTL generation or a more complex
> GIMPLE transformation so that we could optimize other cases as well,
> such as ((unsigned long long) short / int).
>
> However, that is a bit too complex for now. While I continue to look
> into that change, I've decided to implement this simpler match.pd
> transformation.
>
> Greatly appreciate any feedback on this patch or guidance for
> implementing the more advanced optimizations!
>
> Thanks,
> Zhao Wei

Sorry, the original patch wasn't recognized as a text file. I've added
a .txt extension to make it explicit.
From dd3bb05cd7be72d080598cb693549ac74d5cb02d Mon Sep 17 00:00:00 2001
From: Zhao Wei Liew 
Date: Sat, 19 Feb 2022 16:28:38 +0800
Subject: [PATCH] tree-optimization: [PR103855] Fold (type)X / (type)Y

This pattern converts (trunc_div (convert a) (convert b)) to
(convert (trunc_div a b)) when:

1. type, a, and b all have unsigned integeral types
2. a and b have the same type precision
3. type has type precision at least as larger as a and b

This is useful as wider divisions are typically more expensive.

To illustrate the effects, consider the following code snippet:

unsigned long long f(unsigned int a, unsigned int b) {
unsigned long long all = a;
return all / b;
}

Without the pattern, g++ -std=c++20 -O2 generates the following
assembly:

f(unsigned int, unsigned int):
mov eax, edi
mov esi, esi
xor edx, edx
div rsi
ret

With the pattern, it generates this:

f(unsigned int, unsigned int):
mov eax, edi
xor edx, edx
div esi
ret

This is identical to what clang++ -std=c++20 -O2 generates.

Bootstrapped and regression tested on x86_64-pc-linux-gnu.

Signed-off-by: Zhao Wei Liew 

PR tree-optimization/103855

gcc/ChangeLog:

* match.pd: Add pattern for (type)X / (type)Y.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/divide-8.c: New test.
* gcc.dg/tree-ssa/divide-9.c: New test.
---
 gcc/match.pd | 15 +++
 gcc/testsuite/gcc.dg/tree-ssa/divide-8.c |  9 +
 gcc/testsuite/gcc.dg/tree-ssa/divide-9.c |  9 +
 3 files changed, 33 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/divide-8.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/divide-9.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 10f62284862..393b43756dd 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -684,6 +684,21 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (if (INTEGRAL_TYPE_P (type) || VECTOR_INTEGER_TYPE_P (type))
   (convert (trunc_mod @0 @1
 
+/* (type)X / (type)Y -> (type)(X / Y)
+   when the resulting type is at least precise as the original types
+   and when all the types are unsigned integral types. */
+(simplify
+ (trunc_div (convert @0) (convert @1))
+ (if (INTEGRAL_TYPE_P (type)
+  && INTEGRAL_TYPE_P (TREE_TYPE (@0))
+  && INTEGRAL_TYPE_P (TREE_TYPE (@1))
+  && TYPE_UNSIGNED (type)
+  && TYPE_UNSIGNED (TREE_TYPE (@0))
+  && TYPE_UNSIGNED (TREE_TYPE (@1))
+  && TYPE_PRECISION (TREE_TYPE (@0)) == TYPE_PRECISION (TREE_TYPE (@1))
+  && TYPE_PRECISION (type) >= TYPE_PRECISION (TREE_TYPE (@0)))
+  (convert (trunc_div @0 @1
+
 /* x * (1 + y / x) - y -> x - y % x */
 (simplify
  (minus (mult:cs @0 (plus:s (trunc_div:s @1 @0) integer_onep)) @1)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/divide-8.c 
b/gcc/testsuite/gcc.dg/tree-ssa/divide-8.c
new file mode 100644
index 000..dc3dc9ca769
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/divide-8.c
@@ -0,0 +1,9 @@
+/* PR tree-optimization/103855 */
+/* { dg-options "-O -fdump-tree-optimized" } */
+
+unsigned int f(unsigned int a, unsigned int b) {
+unsigned long long all = a;
+return all / b;
+}
+
+/* { dg-final { scan-tree-dump-not "\\\(long long unsigned int\\\)" 
"optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/divide-9.c 
b/gcc/testsuite/gcc.dg/tree-ssa/divide-9.c
new file mode 100644
index 000..6986b5484e4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/divide-9.c
@@ -0,0 +1,9 @@
+/* PR tree-optimization/103855 */
+/* { dg-options "-O -fdump-tree-optimized" } */
+
+unsigned long long f(unsigned int a, unsigned int b) {
+unsigned long long all = a;
+return all / b;
+}
+
+/* { dg-final { scan-tree-dump-times "\\\(long long unsigned int\\\)" 1 
"optimized" } } */
-- 
2.35.1



Re: [PATCH 3/3] target/99881 - x86 vector cost of CTOR from integer regs

2022-02-21 Thread Hongtao Liu via Gcc-patches
On Mon, Feb 21, 2022 at 5:10 PM Richard Biener  wrote:
>
> On Mon, 21 Feb 2022, Hongtao Liu wrote:
>
> > On Fri, Feb 18, 2022 at 10:01 PM Richard Biener via Gcc-patches
> >  wrote:
> > >
> > > This uses the now passed SLP node to the vectorizer costing hook
> > > to adjust vector construction costs for the cost of moving an
> > > integer component from a GPR to a vector register when that's
> > > required for building a vector from components.  A cruical difference
> > > here is whether the component is loaded from memory or extracted
> > > from a vector register as in those cases no intermediate GPR is involved.
> > >
> > > The pr99881.c testcase can be Un-XFAILed with this patch, the
> > > pr91446.c testcase now produces scalar code which looks superior
> > > to me so I've adjusted it as well.
> > >
> > > I'm currently re-bootstrapping and testing on x86_64-unknown-linux-gnu
> > > after adding the BIT_FIELD_REF vector extracting special casing.
> > Does the patch handle PR101929?
>
> The patch will regress the testcase posted in PR101929 again:
>
>  _255 1 times scalar_store costs 12 in body
>  _261 1 times scalar_store costs 12 in body
>  _258 1 times scalar_store costs 12 in body
>  _264 1 times scalar_store costs 12 in body
>  t0_247 + t2_251 1 times scalar_stmt costs 4 in body
>  t1_472 + t3_444 1 times scalar_stmt costs 4 in body
>  t0_406 - t2_451 1 times scalar_stmt costs 4 in body
>  t1_472 - t3_444 1 times scalar_stmt costs 4 in body
> -node 0x4182f48 1 times vec_construct costs 16 in prologue
> -node 0x41882b0 1 times vec_construct costs 16 in prologue
> +node 0x4182f48 1 times vec_construct costs 28 in prologue
> +node 0x41882b0 1 times vec_construct costs 28 in prologue
>  t0_406 + t2_451 1 times vector_stmt costs 4 in body
>  t1_472 - t3_444 1 times vector_stmt costs 4 in body
>  node 0x41829f8 1 times vec_perm costs 4 in body
>  _436 1 times vector_store costs 16 in body
>  t.c:37:9: note: Cost model analysis for part in loop 0:
> -  Vector cost: 60
> +  Vector cost: 84
>Scalar cost: 64
> +t.c:37:9: missed: not vectorized: vectorization is not profitable.
>
> We're constructing V4SI from patterns like { _407, _480, _407, _480 }
> where the components are results of integer adds (so the result is
> definitely in a GPR).  We are costing the construction as
> 4 * sse_op + 2 * sse_to_integer which with skylake cost is
> 4 * COSTS_N_INSNS (1) + 2 * 6.
>
> Whether the vectorization itself is profitable is likely questionable
> but then it's true that the construction of V4SI is more costly
> in terms of uops than a construction of V4SF.
>
> Now, we can - for the first time - now see the actual construction
> pattern and ideal construction might be two GPR->xmm moves
> + two splats + one unpack or maybe two GPR->xmm moves + one
> unpack + splat of DI (or other means of duplicating the lowpart).
Yes, the patch is technically right. I'm ok with the patch.
> It still wouldn't help here of course, and we would not have RTL
> expansion adhere to this scheme.
>
> Trying to capture secondary effects (the FRE opportunity unleashed)
> in costing of this particular SLP subgraph is difficult and probably
> undesirable though.  Adjusting any of the target specific costs
> is likely not going to work because of the original narrow profitability
> gap (60 vs. 64).
>
> For the particular kernel the overall vectorization strathegy needs
> to improve (PR98138 is about this I think).  I know we've reverted
> the previous change that attempted to cost for GPR -> XMM.  It did
>
>case vec_construct:
> {
>   /* N element inserts into SSE vectors.  */
> + int cost
> +   = TYPE_VECTOR_SUBPARTS (vectype) * (fp ?
> +   ix86_cost->sse_op
> +   :
> ix86_cost->integer_to_sse);
> +
> - int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
>
> thus treated integer_to_sse as GPR insert into vector cost for
> integer components in general while the now proposed change
> _adds_ integer to XMM move costs (but only once for duplicate
> elements and not for the cases we know are from memory / vector regs).
> With the proposed patch we can probably be more "correct" above
> and only cost TYPE_VECTOR_SUBPARTS (vectype) - 1 inserts given
> for FP the first one is free and for int we're costing it separately.
> Again it won't help here.
>
> Thanks,
> Richard.
>
> > >
> > > I suppose we can let autotesters look for SPEC performance fallout.
> > >
> > > OK if testing succeeds?
> > >
> > > Thanks,
> > > Richard.
> > >
> > > 2022-02-18  Richard Biener  
> > >
> > > PR tree-optimization/104582
> > > PR target/99881
> > > * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
> > > Cost GPR to vector register moves for integer vector construction.
> > >
> > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: New.
> > > * gcc.dg/vect/costmodel

RE: [PATCH] Fix for GCC '-MF' option cannot deal with long paths in Windows

2022-02-21 Thread Sundeep KOKKONDA via Gcc-patches
Hello,

Can you please review the patch and let me know the feedback?



Thanks,
Sundeep K.

-Original Message-
From: sundeep.kokko...@gmail.com  
Sent: Friday, February 4, 2022 9:30 AM
To: 'Andrew Pinski' 
Cc: 'GCC Patches' 
Subject: RE: [PATCH] Fix for GCC '-MF' option cannot deal with long paths in 
Windows

Hello Andrew,

Can you please let me know your feedback on the patch?


Thanks,
Sundeep K.

-Original Message-
From: sundeep.kokko...@gmail.com 
Sent: Wednesday, January 19, 2022 10:40 AM
To: 'Andrew Pinski' 
Cc: 'GCC Patches' 
Subject: RE: [PATCH] Fix for GCC '-MF' option cannot deal with long paths in 
Windows

Hello Andrew,

I updated the patch. Please see attached patch files and let me know your 
feedback.



Thanks,
Sundeep K.

-Original Message-
From: Andrew Pinski 
Sent: Tuesday, January 18, 2022 9:22 AM
To: sundeep.kokko...@gmail.com
Cc: GCC Patches 
Subject: Re: [PATCH] Fix for GCC '-MF' option cannot deal with long paths in 
Windows

On Mon, Jan 17, 2022 at 8:35 AM Sundeep KOKKONDA via Gcc-patches 
 wrote:
>
> Hello,
>
>
>
> The '-MF' option cannot deal with path size >256 characters in 
> Windows. This patch is to fix this long path issue with -MF option.
>
> Let me know is it ok to commit?

I don't think this is the right place to put this.
Maybe a better way to implement this is to have a wrapper around fopen for 
windows and do it that way.
Does open have the same issue?

Thanks,
Andrew Pinski

>
>
>
>
>
> --
>
> Thanks,
>
> Sundeep K.
>




[PATCH] x86: Update Intel architectures ISA support in documentation.

2022-02-21 Thread Cui,Lili via Gcc-patches
Hi Uros,

This patch is to update Intel architectures ISA support in documentation.
Since the ISA supported by Intel architectures in the documentation
are inconsistent with the actual, modify them all.

OK for master?


gcc/Changelog:

  * gcc/doc/invoke.texi: Update documents for Intel architectures.
---
 gcc/doc/invoke.texi | 185 +++-
 1 file changed, 98 insertions(+), 87 deletions(-)

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 635c5f79278..60472a21255 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -31086,66 +31086,69 @@ instruction set is used, so the code runs on all i686 
family chips.
 When used with @option{-mtune}, it has the same meaning as @samp{generic}.
 
 @item pentium2
-Intel Pentium II CPU, based on Pentium Pro core with MMX instruction set
-support.
+Intel Pentium II CPU, based on Pentium Pro core with MMX and FXSR instruction
+set support.
 
 @item pentium3
 @itemx pentium3m
-Intel Pentium III CPU, based on Pentium Pro core with MMX and SSE instruction
-set support.
+Intel Pentium III CPU, based on Pentium Pro core with MMX, FXSR and SSE
+instruction set support.
 
 @item pentium-m
 Intel Pentium M; low-power version of Intel Pentium III CPU
-with MMX, SSE and SSE2 instruction set support.  Used by Centrino notebooks.
+with MMX, SSE, SSE2 and FXSR instruction set support.  Used by Centrino
+notebooks.
 
 @item pentium4
 @itemx pentium4m
-Intel Pentium 4 CPU with MMX, SSE and SSE2 instruction set support.
+Intel Pentium 4 CPU with MMX, SSE, SSE2 and FXSR instruction set support.
 
 @item prescott
-Improved version of Intel Pentium 4 CPU with MMX, SSE, SSE2 and SSE3 
instruction
-set support.
+Improved version of Intel Pentium 4 CPU with MMX, SSE, SSE2, SSE3 and FXSR
+instruction set support.
 
 @item nocona
 Improved version of Intel Pentium 4 CPU with 64-bit extensions, MMX, SSE,
-SSE2 and SSE3 instruction set support.
+SSE2, SSE3 and FXSR instruction set support.
 
 @item core2
-Intel Core 2 CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3 and SSSE3
-instruction set support.
+Intel Core 2 CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3, CX16,
+SAHF and FXSR instruction set support.
 
 @item nehalem
 Intel Nehalem CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3,
-SSE4.1, SSE4.2 and POPCNT instruction set support.
+SSE4.1, SSE4.2, POPCNT, CX16, SAHF and FXSR instruction set support.
 
 @item westmere
 Intel Westmere CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3,
-SSE4.1, SSE4.2, POPCNT, AES and PCLMUL instruction set support.
+SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR and PCLMUL instruction set support.
 
 @item sandybridge
 Intel Sandy Bridge CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3,
-SSE4.1, SSE4.2, POPCNT, AVX, AES and PCLMUL instruction set support.
+SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, AVX, XSAVE and PCLMUL instruction set
+support.
 
 @item ivybridge
 Intel Ivy Bridge CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3,
-SSE4.1, SSE4.2, POPCNT, AVX, AES, PCLMUL, FSGSBASE, RDRND and F16C
-instruction set support.
+SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, AVX, XSAVE, PCLMUL, FSGSBASE, RDRND
+and F16C instruction set support.
 
 @item haswell
 Intel Haswell CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3,
-SSE4.1, SSE4.2, POPCNT, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA,
-BMI, BMI2 and F16C instruction set support.
+SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, AVX, XSAVE, PCLMUL, FSGSBASE, RDRND,
+F16C, AVX2, BMI, BMI2, LZCNT, FMA, MOVBE and HLE instruction set support.
 
 @item broadwell
 Intel Broadwell CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3,
-SSE4.1, SSE4.2, POPCNT, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA, BMI, 
BMI2,
-F16C, RDSEED ADCX and PREFETCHW instruction set support.
+SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, AVX, XSAVE, PCLMUL, FSGSBASE, RDRND,
+F16C, AVX2, BMI, BMI2, LZCNT, FMA, MOVBE, HLE, RDSEED, ADCX and PREFETCHW
+instruction set support.
 
 @item skylake
 Intel Skylake CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3,
-SSE4.1, SSE4.2, POPCNT, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA,
-BMI, BMI2, F16C, RDSEED, ADCX, PREFETCHW, CLFLUSHOPT, XSAVEC and XSAVES
-instruction set support.
+SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, AVX, XSAVE, PCLMUL, FSGSBASE, RDRND,
+F16C, AVX2, BMI, BMI2, LZCNT, FMA, MOVBE, HLE, RDSEED, ADCX, PREFETCHW, AES,
+CLFLUSHOPT, XSAVEC, XSAVES and SGX instruction set support.
 
 @item bonnell
 Intel Bonnell CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3 and SSSE3
@@ -31153,113 +31156,121 @@ instruction set support.
 
 @item silvermont
 Intel Silvermont CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, 
SSSE3,
-SSE4.1, SSE4.2, POPCNT, AES, PREFETCHW, PCLMUL and RDRND instruction set 
support.
+SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, PCLMUL, PREFETCHW and RDRND
+instruction set support.
 
 @item goldmont
 Intel Goldmont CPU with 64-bit extensions, MOV

[PATCH] Check if loading const from mem is faster

2022-02-21 Thread Jiufu Guo via Gcc-patches
Hi,

For constants, there are some codes to check: if it is able to put
to instruction as an immediate operand or it is profitable to load from
mem.  There are still some places that could be improved for platforms.

This patch could handle PR63281/57836.  This patch does not change
too much on the code like force_const_mem and legitimate_constant_p.
We may integrate these APIs for passes like expand/cse/combine
as a whole solution in the future (maybe better for stage1?).

Bootstrap and regtest pass on ppc64le and x86_64. Is this ok for trunk?
Thanks for comments!


BR,
Jiufu

gcc/ChangeLog:

PR target/94393
PR rtl-optimization/63281
* config/rs6000/rs6000.cc (TARGET_FASTER_LOADING_CONSTANT):
New hook.
(rs6000_faster_loading_const): New hook.
(rs6000_cannot_force_const_mem): Update.
(rs6000_emit_move): Use rs6000_faster_loading_const.
* cse.cc (cse_insn): Use new hook.
* doc/tm.texi.in: New hook TARGET_FASTER_LOADING_CONSTANT.
* doc/tm.texi: Regenerate.
* target.def: New hook faster_loading_constant.
* targhooks.cc (default_faster_loading_constant): New.
* targhooks.h (default_faster_loading_constant): Declare it.

gcc/testsuite/ChangeLog:

* gcc.target/powerpc/medium_offset.c: Update.
* gcc.target/powerpc/pr93012.c: Update.
* gcc.target/powerpc/pr63281.c: New test.

---
 gcc/config/rs6000/rs6000.cc   | 38 ++-
 gcc/cse.cc|  5 +++
 gcc/doc/tm.texi   |  5 +++
 gcc/doc/tm.texi.in|  2 +
 gcc/target.def|  8 
 gcc/targhooks.cc  |  6 +++
 .../gcc.target/powerpc/medium_offset.c|  2 +-
 gcc/testsuite/gcc.target/powerpc/pr63281.c| 14 +++
 gcc/testsuite/gcc.target/powerpc/pr93012.c|  2 +-
 9 files changed, 71 insertions(+), 11 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c

diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index d7a7cfe860f..beb21a0bb2b 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -1361,6 +1361,9 @@ static const struct attribute_spec 
rs6000_attribute_table[] =
 #undef TARGET_CANNOT_FORCE_CONST_MEM
 #define TARGET_CANNOT_FORCE_CONST_MEM rs6000_cannot_force_const_mem
 
+#undef TARGET_FASTER_LOADING_CONSTANT
+#define TARGET_FASTER_LOADING_CONSTANT rs6000_faster_loading_const
+
 #undef TARGET_DELEGITIMIZE_ADDRESS
 #define TARGET_DELEGITIMIZE_ADDRESS rs6000_delegitimize_address
 
@@ -9684,8 +9687,7 @@ rs6000_init_stack_protect_guard (void)
 static bool
 rs6000_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, rtx x)
 {
-  if (GET_CODE (x) == HIGH
-  && GET_CODE (XEXP (x, 0)) == UNSPEC)
+  if (GET_CODE (x) == HIGH)
 return true;
 
   /* A TLS symbol in the TOC cannot contain a sum.  */
@@ -10483,6 +10485,30 @@ rs6000_emit_move_si_sf_subreg (rtx dest, rtx source, 
machine_mode mode)
   return false;
 }
 
+
+/* Implement TARGET_FASTER_LOADING_CONSTANT.  */
+
+static bool
+rs6000_faster_loading_const (machine_mode mode, rtx dest, rtx src)
+{
+  gcc_assert (CONSTANT_P (src));
+
+  if (GET_MODE_CLASS (mode) == MODE_CC || mode == VOIDmode)
+return false;
+  if (GET_CODE (src) == HIGH)
+return false;
+  if (toc_relative_expr_p (src, false, NULL, NULL))
+return false;
+  if (rs6000_cannot_force_const_mem (mode, src))
+return false;
+
+  if (REG_P (dest) && FP_REGNO_P (REGNO (dest)))
+return true;
+  if (!CONST_INT_P (src))
+return true;
+  return num_insns_constant (src, mode) > (TARGET_PCREL ? 1 : 2);
+}
+
 /* Emit a move from SOURCE to DEST in mode MODE.  */
 void
 rs6000_emit_move (rtx dest, rtx source, machine_mode mode)
@@ -10860,13 +10886,7 @@ rs6000_emit_move (rtx dest, rtx source, machine_mode 
mode)
operands[1] = create_TOC_reference (operands[1], operands[0]);
   else if (mode == Pmode
   && CONSTANT_P (operands[1])
-  && GET_CODE (operands[1]) != HIGH
-  && ((REG_P (operands[0])
-   && FP_REGNO_P (REGNO (operands[0])))
-  || !CONST_INT_P (operands[1])
-  || (num_insns_constant (operands[1], mode)
-  > (TARGET_CMODEL != CMODEL_SMALL ? 3 : 2)))
-  && !toc_relative_expr_p (operands[1], false, NULL, NULL)
+  && rs6000_faster_loading_const (mode, operands[0], operands[1])
   && (TARGET_CMODEL == CMODEL_SMALL
   || can_create_pseudo_p ()
   || (REG_P (operands[0])
diff --git a/gcc/cse.cc b/gcc/cse.cc
index a18b599d324..c33d3c7e911 100644
--- a/gcc/cse.cc
+++ b/gcc/cse.cc
@@ -5158,6 +5158,11 @@ cse_insn (rtx_insn *insn)
   if (dest == pc_rtx && src_const && GET_CODE (src_const) == LABEL_REF)
src_folded = src_const, src_folded_cost = src_folded_regcost = -1;
 
+   

Re: [PATCH] Check if loading const from mem is faster

2022-02-21 Thread Richard Biener via Gcc-patches
On Tue, 22 Feb 2022, Jiufu Guo wrote:

> Hi,
> 
> For constants, there are some codes to check: if it is able to put
> to instruction as an immediate operand or it is profitable to load from
> mem.  There are still some places that could be improved for platforms.
> 
> This patch could handle PR63281/57836.  This patch does not change
> too much on the code like force_const_mem and legitimate_constant_p.
> We may integrate these APIs for passes like expand/cse/combine
> as a whole solution in the future (maybe better for stage1?).
> 
> Bootstrap and regtest pass on ppc64le and x86_64. Is this ok for trunk?
> Thanks for comments!

I'm not sure whether we need a new hook here, but iff, then I think
whether loading a constant (from memory?) is faster or not depends
on the context.  So what's the exact situation and the two variants
you are costing against each other?  I assume (since you are
touching CSE) you are costing

  (set (...) (mem))  (before CSE)

against

  (set (...) (immediate))  (what CSE does now)

vs.

  (set (...) (mem))  (original, no CSE)

?  With the new hook you are skipping _all_ of the following loops
logic which does look like a quite bad design and hack (not that
I am very familiar with the candidate / costing logic in there).

We already have TARGET_INSN_COST which you could ask for a cost.
Like if we'd have a single_set then just temporarily substitute
the RHS with the candidate and cost the insns and compare against
the original insn cost.  So why exactly do you need a new hook
for this particular situation?

Thanks,
Richard.


> 
> BR,
> Jiufu
> 
> gcc/ChangeLog:
> 
>   PR target/94393
>   PR rtl-optimization/63281
>   * config/rs6000/rs6000.cc (TARGET_FASTER_LOADING_CONSTANT):
>   New hook.
>   (rs6000_faster_loading_const): New hook.
>   (rs6000_cannot_force_const_mem): Update.
>   (rs6000_emit_move): Use rs6000_faster_loading_const.
>   * cse.cc (cse_insn): Use new hook.
>   * doc/tm.texi.in: New hook TARGET_FASTER_LOADING_CONSTANT.
>   * doc/tm.texi: Regenerate.
>   * target.def: New hook faster_loading_constant.
>   * targhooks.cc (default_faster_loading_constant): New.
>   * targhooks.h (default_faster_loading_constant): Declare it.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.target/powerpc/medium_offset.c: Update.
>   * gcc.target/powerpc/pr93012.c: Update.
>   * gcc.target/powerpc/pr63281.c: New test.
> 
> ---
>  gcc/config/rs6000/rs6000.cc   | 38 ++-
>  gcc/cse.cc|  5 +++
>  gcc/doc/tm.texi   |  5 +++
>  gcc/doc/tm.texi.in|  2 +
>  gcc/target.def|  8 
>  gcc/targhooks.cc  |  6 +++
>  .../gcc.target/powerpc/medium_offset.c|  2 +-
>  gcc/testsuite/gcc.target/powerpc/pr63281.c| 14 +++
>  gcc/testsuite/gcc.target/powerpc/pr93012.c|  2 +-
>  9 files changed, 71 insertions(+), 11 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c
> 
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index d7a7cfe860f..beb21a0bb2b 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -1361,6 +1361,9 @@ static const struct attribute_spec 
> rs6000_attribute_table[] =
>  #undef TARGET_CANNOT_FORCE_CONST_MEM
>  #define TARGET_CANNOT_FORCE_CONST_MEM rs6000_cannot_force_const_mem
>  
> +#undef TARGET_FASTER_LOADING_CONSTANT
> +#define TARGET_FASTER_LOADING_CONSTANT rs6000_faster_loading_const
> +
>  #undef TARGET_DELEGITIMIZE_ADDRESS
>  #define TARGET_DELEGITIMIZE_ADDRESS rs6000_delegitimize_address
>  
> @@ -9684,8 +9687,7 @@ rs6000_init_stack_protect_guard (void)
>  static bool
>  rs6000_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, rtx x)
>  {
> -  if (GET_CODE (x) == HIGH
> -  && GET_CODE (XEXP (x, 0)) == UNSPEC)
> +  if (GET_CODE (x) == HIGH)
>  return true;
>  
>/* A TLS symbol in the TOC cannot contain a sum.  */
> @@ -10483,6 +10485,30 @@ rs6000_emit_move_si_sf_subreg (rtx dest, rtx source, 
> machine_mode mode)
>return false;
>  }
>  
> +
> +/* Implement TARGET_FASTER_LOADING_CONSTANT.  */
> +
> +static bool
> +rs6000_faster_loading_const (machine_mode mode, rtx dest, rtx src)
> +{
> +  gcc_assert (CONSTANT_P (src));
> +
> +  if (GET_MODE_CLASS (mode) == MODE_CC || mode == VOIDmode)
> +return false;
> +  if (GET_CODE (src) == HIGH)
> +return false;
> +  if (toc_relative_expr_p (src, false, NULL, NULL))
> +return false;
> +  if (rs6000_cannot_force_const_mem (mode, src))
> +return false;
> +
> +  if (REG_P (dest) && FP_REGNO_P (REGNO (dest)))
> +return true;
> +  if (!CONST_INT_P (src))
> +return true;
> +  return num_insns_constant (src, mode) > (TARGET_PCREL ? 1 : 2);
> +}
> +
>  /* Emit a move from SOURCE to DEST in mode MODE.  */
>  void
>  rs6000_emit_move (rtx dest, rtx source, machine_mod

Re: [PATCH] tree-optimization: [PR103855] Fold (type)X / (type)Y

2022-02-21 Thread Richard Biener via Gcc-patches
On Tue, Feb 22, 2022 at 5:01 AM Zhao Wei Liew via Gcc-patches
 wrote:
>
> On Tue, 22 Feb 2022 at 11:57, Zhao Wei Liew  wrote:
> >
> > Hi,
> >
> > This is a partial optimization for PR103855.
> >
> > Initially, I looked into optimizing RTL generation or a more complex
> > GIMPLE transformation so that we could optimize other cases as well,
> > such as ((unsigned long long) short / int).
> >
> > However, that is a bit too complex for now. While I continue to look
> > into that change, I've decided to implement this simpler match.pd
> > transformation.
> >
> > Greatly appreciate any feedback on this patch or guidance for
> > implementing the more advanced optimizations!
> >
> > Thanks,
> > Zhao Wei
>
> Sorry, the original patch wasn't recognized as a text file. I've added
> a .txt extension to make it explicit.

A few comments - note the change is only appropriate for next stage1.
Since we're
currently in a regression fixing period reviews can be slow - do not
hesitate to ping
forgotten patches when stage1 opens again.

+/* (type)X / (type)Y -> (type)(X / Y)
+   when the resulting type is at least precise as the original types
+   and when all the types are unsigned integral types. */
+(simplify
+ (trunc_div (convert @0) (convert @1))
+ (if (INTEGRAL_TYPE_P (type)
+  && INTEGRAL_TYPE_P (TREE_TYPE (@0))
+  && INTEGRAL_TYPE_P (TREE_TYPE (@1))
+  && TYPE_UNSIGNED (type)
+  && TYPE_UNSIGNED (TREE_TYPE (@0))
+  && TYPE_UNSIGNED (TREE_TYPE (@1))
+  && TYPE_PRECISION (TREE_TYPE (@0)) == TYPE_PRECISION (TREE_TYPE (@1))
+  && TYPE_PRECISION (type) >= TYPE_PRECISION (TREE_TYPE (@0)))
+  (convert (trunc_div @0 @1

since you are requiring the types of @0 and @1 to match it's easier to write

 && types_match (TREE_TYPE(@0), TREE_TYPE (@1))

that allows you to elide checks on either @0 or @1.  I suppose the transform
does not work for signed types because of the -INT_MIN / -1 overflow case?
It might be possible to use expr_not_equal_to (@0, -INT_MIN) ||
expr_not_equal_to (@1, -1)
(correctly written, lookup the existing examples in match.pd for the X
% -Y transform)

I'll note that as written the transform will not catch CST / (T)x or
(T)x / CST since
you'll not see conversions around constants.  I'm not sure whether
using (convert[12]? ...)
or writing special patterns with INTEGER_CST operands is more convenient.
There is int_fits_type_p to check whether a constant will fit in a
type without truncation
or sign change.

When @0 and @1 do not have the same type there might still be a common type
that can be used and is smaller than 'type', it might be as simple as using
build_nonstandard_integer_type (MIN (@0-prec, @1-prec), 1 /*unsigned_p*/).

In the past there have been attempts to more globally narrow operations using
a new pass rather than using individual patterns.  So for more complicated cases
that might be the way to go.  There's now also the ISEL pass which does
pre-RTL expansion transforms that need some global context and for example
can look at SSA uses.

Richard.


Re: [PATCH 3/3] target/99881 - x86 vector cost of CTOR from integer regs

2022-02-21 Thread Richard Biener via Gcc-patches
On Tue, 22 Feb 2022, Hongtao Liu wrote:

> On Mon, Feb 21, 2022 at 5:10 PM Richard Biener  wrote:
> >
> > On Mon, 21 Feb 2022, Hongtao Liu wrote:
> >
> > > On Fri, Feb 18, 2022 at 10:01 PM Richard Biener via Gcc-patches
> > >  wrote:
> > > >
> > > > This uses the now passed SLP node to the vectorizer costing hook
> > > > to adjust vector construction costs for the cost of moving an
> > > > integer component from a GPR to a vector register when that's
> > > > required for building a vector from components.  A cruical difference
> > > > here is whether the component is loaded from memory or extracted
> > > > from a vector register as in those cases no intermediate GPR is 
> > > > involved.
> > > >
> > > > The pr99881.c testcase can be Un-XFAILed with this patch, the
> > > > pr91446.c testcase now produces scalar code which looks superior
> > > > to me so I've adjusted it as well.
> > > >
> > > > I'm currently re-bootstrapping and testing on x86_64-unknown-linux-gnu
> > > > after adding the BIT_FIELD_REF vector extracting special casing.
> > > Does the patch handle PR101929?
> >
> > The patch will regress the testcase posted in PR101929 again:
> >
> >  _255 1 times scalar_store costs 12 in body
> >  _261 1 times scalar_store costs 12 in body
> >  _258 1 times scalar_store costs 12 in body
> >  _264 1 times scalar_store costs 12 in body
> >  t0_247 + t2_251 1 times scalar_stmt costs 4 in body
> >  t1_472 + t3_444 1 times scalar_stmt costs 4 in body
> >  t0_406 - t2_451 1 times scalar_stmt costs 4 in body
> >  t1_472 - t3_444 1 times scalar_stmt costs 4 in body
> > -node 0x4182f48 1 times vec_construct costs 16 in prologue
> > -node 0x41882b0 1 times vec_construct costs 16 in prologue
> > +node 0x4182f48 1 times vec_construct costs 28 in prologue
> > +node 0x41882b0 1 times vec_construct costs 28 in prologue
> >  t0_406 + t2_451 1 times vector_stmt costs 4 in body
> >  t1_472 - t3_444 1 times vector_stmt costs 4 in body
> >  node 0x41829f8 1 times vec_perm costs 4 in body
> >  _436 1 times vector_store costs 16 in body
> >  t.c:37:9: note: Cost model analysis for part in loop 0:
> > -  Vector cost: 60
> > +  Vector cost: 84
> >Scalar cost: 64
> > +t.c:37:9: missed: not vectorized: vectorization is not profitable.
> >
> > We're constructing V4SI from patterns like { _407, _480, _407, _480 }
> > where the components are results of integer adds (so the result is
> > definitely in a GPR).  We are costing the construction as
> > 4 * sse_op + 2 * sse_to_integer which with skylake cost is
> > 4 * COSTS_N_INSNS (1) + 2 * 6.
> >
> > Whether the vectorization itself is profitable is likely questionable
> > but then it's true that the construction of V4SI is more costly
> > in terms of uops than a construction of V4SF.
> >
> > Now, we can - for the first time - now see the actual construction
> > pattern and ideal construction might be two GPR->xmm moves
> > + two splats + one unpack or maybe two GPR->xmm moves + one
> > unpack + splat of DI (or other means of duplicating the lowpart).
> Yes, the patch is technically right. I'm ok with the patch.

Thanks, I've pushed it now.  I've also tested the suggested adjustment
doing

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index b2bf90576d5..acf2cc977b4 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -22595,7 +22595,7 @@ ix86_builtin_vectorization_cost (enum 
vect_cost_for_stmt type_of_cost,
   case vec_construct:
{
  /* N element inserts into SSE vectors.  */
- int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
+ int cost = (TYPE_VECTOR_SUBPARTS (vectype) - 1) * 
ix86_cost->sse_op;
  /* One vinserti128 for combining two SSE vectors for AVX256.  */
  if (GET_MODE_BITSIZE (mode) == 256)
cost += ix86_vec_cost (mode, ix86_cost->addss);

successfully (with no effect on the PR101929 case as expected), I
will queue that for stage1 since it isn't known to fix any
regression (but I will keep it as option in case something pops up).

I'll also have a more detailed look into the x264_r case to see
if there's something we can do about the regression that will now
show up (and I'll watch autotesters).

Thanks,
Richard.


> > It still wouldn't help here of course, and we would not have RTL
> > expansion adhere to this scheme.
> >
> > Trying to capture secondary effects (the FRE opportunity unleashed)
> > in costing of this particular SLP subgraph is difficult and probably
> > undesirable though.  Adjusting any of the target specific costs
> > is likely not going to work because of the original narrow profitability
> > gap (60 vs. 64).
> >
> > For the particular kernel the overall vectorization strathegy needs
> > to improve (PR98138 is about this I think).  I know we've reverted
> > the previous change that attempted to cost for GPR -> XMM.  It did
> >
> >case vec_construct:
> > {
> >   /* N element inserts into SSE vectors.  */
> > + int c