[PATCH] GCSE: Export add_label_notes as global function

2023-07-10 Thread juzhe . zhong
From: Ju-Zhe Zhong 

Since 'add_lable_notes' is a generic helper function which is used by 
riscv-vsetvl.cc
in RISC-V port backend. And it's also will be used by riscv.cc too by the 
following patches.
Export it as global helper function.

gcc/ChangeLog:

* config/riscv/riscv-vsetvl.cc (add_label_notes): Remove it.
* gcse.cc (add_label_notes): Export it as global.
(one_pre_gcse_pass): Ditto.
* gcse.h (add_label_notes): Ditto.

---
 gcc/config/riscv/riscv-vsetvl.cc | 48 +---
 gcc/gcse.cc  |  3 +-
 gcc/gcse.h   |  1 +
 3 files changed, 3 insertions(+), 49 deletions(-)

diff --git a/gcc/config/riscv/riscv-vsetvl.cc b/gcc/config/riscv/riscv-vsetvl.cc
index ab47901e23f..038ba22362e 100644
--- a/gcc/config/riscv/riscv-vsetvl.cc
+++ b/gcc/config/riscv/riscv-vsetvl.cc
@@ -98,6 +98,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "lcm.h"
 #include "predict.h"
 #include "profile-count.h"
+#include "gcse.h"
 #include "riscv-vsetvl.h"
 
 using namespace rtl_ssa;
@@ -763,53 +764,6 @@ insert_vsetvl (enum emit_type emit_type, rtx_insn *rinsn,
   return VSETVL_DISCARD_RESULT;
 }
 
-/* If X contains any LABEL_REF's, add REG_LABEL_OPERAND notes for them
-   to INSN.  If such notes are added to an insn which references a
-   CODE_LABEL, the LABEL_NUSES count is incremented.  We have to add
-   that note, because the following loop optimization pass requires
-   them.  */
-
-/* ??? If there was a jump optimization pass after gcse and before loop,
-   then we would not need to do this here, because jump would add the
-   necessary REG_LABEL_OPERAND and REG_LABEL_TARGET notes.  */
-
-static void
-add_label_notes (rtx x, rtx_insn *rinsn)
-{
-  enum rtx_code code = GET_CODE (x);
-  int i, j;
-  const char *fmt;
-
-  if (code == LABEL_REF && !LABEL_REF_NONLOCAL_P (x))
-{
-  /* This code used to ignore labels that referred to dispatch tables to
-avoid flow generating (slightly) worse code.
-
-We no longer ignore such label references (see LABEL_REF handling in
-mark_jump_label for additional information).  */
-
-  /* There's no reason for current users to emit jump-insns with
-such a LABEL_REF, so we don't have to handle REG_LABEL_TARGET
-notes.  */
-  gcc_assert (!JUMP_P (rinsn));
-  add_reg_note (rinsn, REG_LABEL_OPERAND, label_ref_label (x));
-
-  if (LABEL_P (label_ref_label (x)))
-   LABEL_NUSES (label_ref_label (x))++;
-
-  return;
-}
-
-  for (i = GET_RTX_LENGTH (code) - 1, fmt = GET_RTX_FORMAT (code); i >= 0; i--)
-{
-  if (fmt[i] == 'e')
-   add_label_notes (XEXP (x, i), rinsn);
-  else if (fmt[i] == 'E')
-   for (j = XVECLEN (x, i) - 1; j >= 0; j--)
- add_label_notes (XVECEXP (x, i, j), rinsn);
-}
-}
-
 /* Add EXPR to the end of basic block BB.
 
This is used by both the PRE and code hoisting.  */
diff --git a/gcc/gcse.cc b/gcc/gcse.cc
index 72832736572..5627fbf127a 100644
--- a/gcc/gcse.cc
+++ b/gcc/gcse.cc
@@ -483,7 +483,6 @@ static void pre_insert_copies (void);
 static int pre_delete (void);
 static int pre_gcse (struct edge_list *);
 static int one_pre_gcse_pass (void);
-static void add_label_notes (rtx, rtx_insn *);
 static void alloc_code_hoist_mem (int, int);
 static void free_code_hoist_mem (void);
 static void compute_code_hoist_vbeinout (void);
@@ -2639,7 +2638,7 @@ one_pre_gcse_pass (void)
then we would not need to do this here, because jump would add the
necessary REG_LABEL_OPERAND and REG_LABEL_TARGET notes.  */
 
-static void
+void
 add_label_notes (rtx x, rtx_insn *insn)
 {
   enum rtx_code code = GET_CODE (x);
diff --git a/gcc/gcse.h b/gcc/gcse.h
index 5582b29eec2..e5ee9b088bd 100644
--- a/gcc/gcse.h
+++ b/gcc/gcse.h
@@ -41,5 +41,6 @@ extern struct target_gcse *this_target_gcse;
 
 void gcse_cc_finalize (void);
 extern bool gcse_or_cprop_is_too_expensive (const char *);
+void add_label_notes (rtx, rtx_insn *);
 
 #endif
-- 
2.36.1



Re: [PATCH 4/19]middle-end: Fix scale_loop_frequencies segfault on multiple-exits

2023-07-10 Thread Richard Biener via Gcc-patches
On Fri, 7 Jul 2023, Jan Hubicka wrote:

> > 
> > Looks good, but I wonder what we can do to at least make the
> > multiple exit case behave reasonably?  The vectorizer keeps track
> 
> > of a "canonical" exit, would it be possible to pass in the main
> > exit edge and use that instead of single_exit (), would other
> > exits then behave somewhat reasonable or would we totally screw
> > things up here?  That is, the "canonical" exit would be the
> > counting exit while the other exits are on data driven conditions
> > and thus wouldn't change probability when we reduce the number
> > of iterations(?)
> 
> I can add canonical_exit parameter and make the function to direct flow
> to it if possible.  However overall I think fixup depends on what
> transformation led to the change.

I think the vectorizer knows there's a single counting IV and all
other exits are dependent on data processed, so the scaling the
vectorizer just changes the counting IV.  So I think it makes
sense to pass that exit to the function in all cases.

> Assuming that vectorizer did no prologues and apilogues and we
> vectorized with factor N, then I think the update could be done more
> specifically as follows.
> 
> We know that header block count dropped by 4. So we can start from that
> and each time we reach basic block with exit edge, we know the original
> count of the edge.  This count is unchanged, so one can rescale
> probabilities out of that BB accordingly.  If loop has no inner loops,
> we can just walk the body in RPO and propagate scales downwards and we
> sould arrive to right result

That should work for alternate exits as well, no?

> I originally added the bound parameter to handle prologues/epilogues
> which gets new artificial bound.  In prologue I think you are right that
> the flow will be probably directed to the conditional counting
> iterations.

I suppose we'd need to scale both main and epilogue together since
the epilogue "steals" from the main loop counts.  Likewise if there's
a skip edge around the vector loop.  I think currently we simply
set the edge probability of those skip conds rather than basing
this off the niter values they work on.  Aka if (niter < VF) goto
epilogue; do {} while (niter / VF); epilogue: do {} while (niter);

There's also the cost model which might require niter > VF to enter
the main loop body.

> In epilogue we add no artificial iteration cap, so maybe it is more
> realistic to simply scale up probability of all exits?

Probably.

> To see what is going on I tried following testcase:
> 
> int a[99];
> test()
> {
>   for (int i = 0; i < 99; i++)
>   a[i]++;
> }
> 
> What surprises me is that vectorizer at -O2 does nothing and we end up
> unrolling the loop:
> 
> L2:
> addl$1, (%rax)
> addl$1, 4(%rax)
> addl$1, 8(%rax)
> addq$12, %rax
> cmpq$a+396, %rax
> 
> Which seems sily thing to do. Vectorized loop with epilogue doing 2 and
> 1 addition would be better.
> 
> With -O3 we vectorize it:
> 
> 
> .L2:
> movdqa  (%rax), %xmm0
> addq$16, %rax
> paddd   %xmm1, %xmm0
> movaps  %xmm0, -16(%rax)
> cmpq%rax, %rdx
> jne .L2
> movqa+384(%rip), %xmm0
> addl$1, a+392(%rip)
> movq.LC1(%rip), %xmm1
> paddd   %xmm1, %xmm0
> movq%xmm0, a+384(%rip)

The -O2 cost model doesn't want to do epilogues:

  /* If using the "very cheap" model. reject cases in which we'd keep
 a copy of the scalar code (even if we might be able to vectorize it).  
*/
  if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP
  && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
  || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
  || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
{
  if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
 "some scalar iterations would need to be 
peeled\n");
  return 0;
}

it's because of the code size increase.

> and correctly drop vectorized loop body to 24 iterations. However the
> epilogue has loop for vector size 2 predicted to iterate once (it won't)
> 
> ;;   basic block 7, loop depth 0, count 10737416 (estimated locally), maybe 
> hot 
> ;;prev block 5, next block 8, flags: (NEW, VISITED)   
>   
> ;;pred:   3 [4.0% (adjusted)]  count:10737416 (estimated locally) 
> (FALSE_VALUE,EXECUTABLE)
> ;;succ:   8 [always]  count:10737416 (estimated locally) 
> (FALLTHRU,EXECUTABLE)
>   
>   
> ;;   basic block 8, loop depth 1, count 21474835 (estimated locally), maybe 
> hot 
> ;;prev block 7, next block 9, flags: (NEW, REACHABLE, VISITED)
>   
> ;;pred:   9 [always]  count:10737417 (estimated locally) 
> (FALLTHRU,DFS_BACK,EXECUTABLE)
> ;;7 [always]  count:10737416 (estimated l

Re: [PATCH] GCSE: Export add_label_notes as global function

2023-07-10 Thread Richard Biener via Gcc-patches
On Mon, 10 Jul 2023, juzhe.zh...@rivai.ai wrote:

> From: Ju-Zhe Zhong 
> 
> Since 'add_lable_notes' is a generic helper function which is used by 
> riscv-vsetvl.cc
> in RISC-V port backend. And it's also will be used by riscv.cc too by the 
> following patches.
> Export it as global helper function.

I know nothing about this code but grepping shows me the existing
rebuild_jump_labels () API which also properly resets LABEL_NUSES
before incrementing it.  I don't think exporting add_label_notes ()
as-is is good because it at least will wreck those counts.
GCSE uses this function to add the notes for a specific instruction
only, so if we want to export such API the name of the function
should imply it works on a single insn.

Richard.

> gcc/ChangeLog:
> 
> * config/riscv/riscv-vsetvl.cc (add_label_notes): Remove it.
> * gcse.cc (add_label_notes): Export it as global.
> (one_pre_gcse_pass): Ditto.
> * gcse.h (add_label_notes): Ditto.
> 
> ---
>  gcc/config/riscv/riscv-vsetvl.cc | 48 +---
>  gcc/gcse.cc  |  3 +-
>  gcc/gcse.h   |  1 +
>  3 files changed, 3 insertions(+), 49 deletions(-)
> 
> diff --git a/gcc/config/riscv/riscv-vsetvl.cc 
> b/gcc/config/riscv/riscv-vsetvl.cc
> index ab47901e23f..038ba22362e 100644
> --- a/gcc/config/riscv/riscv-vsetvl.cc
> +++ b/gcc/config/riscv/riscv-vsetvl.cc
> @@ -98,6 +98,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "lcm.h"
>  #include "predict.h"
>  #include "profile-count.h"
> +#include "gcse.h"
>  #include "riscv-vsetvl.h"
>  
>  using namespace rtl_ssa;
> @@ -763,53 +764,6 @@ insert_vsetvl (enum emit_type emit_type, rtx_insn *rinsn,
>return VSETVL_DISCARD_RESULT;
>  }
>  
> -/* If X contains any LABEL_REF's, add REG_LABEL_OPERAND notes for them
> -   to INSN.  If such notes are added to an insn which references a
> -   CODE_LABEL, the LABEL_NUSES count is incremented.  We have to add
> -   that note, because the following loop optimization pass requires
> -   them.  */
> -
> -/* ??? If there was a jump optimization pass after gcse and before loop,
> -   then we would not need to do this here, because jump would add the
> -   necessary REG_LABEL_OPERAND and REG_LABEL_TARGET notes.  */
> -
> -static void
> -add_label_notes (rtx x, rtx_insn *rinsn)
> -{
> -  enum rtx_code code = GET_CODE (x);
> -  int i, j;
> -  const char *fmt;
> -
> -  if (code == LABEL_REF && !LABEL_REF_NONLOCAL_P (x))
> -{
> -  /* This code used to ignore labels that referred to dispatch tables to
> -  avoid flow generating (slightly) worse code.
> -
> -  We no longer ignore such label references (see LABEL_REF handling in
> -  mark_jump_label for additional information).  */
> -
> -  /* There's no reason for current users to emit jump-insns with
> -  such a LABEL_REF, so we don't have to handle REG_LABEL_TARGET
> -  notes.  */
> -  gcc_assert (!JUMP_P (rinsn));
> -  add_reg_note (rinsn, REG_LABEL_OPERAND, label_ref_label (x));
> -
> -  if (LABEL_P (label_ref_label (x)))
> - LABEL_NUSES (label_ref_label (x))++;
> -
> -  return;
> -}
> -
> -  for (i = GET_RTX_LENGTH (code) - 1, fmt = GET_RTX_FORMAT (code); i >= 0; 
> i--)
> -{
> -  if (fmt[i] == 'e')
> - add_label_notes (XEXP (x, i), rinsn);
> -  else if (fmt[i] == 'E')
> - for (j = XVECLEN (x, i) - 1; j >= 0; j--)
> -   add_label_notes (XVECEXP (x, i, j), rinsn);
> -}
> -}
> -
>  /* Add EXPR to the end of basic block BB.
>  
> This is used by both the PRE and code hoisting.  */
> diff --git a/gcc/gcse.cc b/gcc/gcse.cc
> index 72832736572..5627fbf127a 100644
> --- a/gcc/gcse.cc
> +++ b/gcc/gcse.cc
> @@ -483,7 +483,6 @@ static void pre_insert_copies (void);
>  static int pre_delete (void);
>  static int pre_gcse (struct edge_list *);
>  static int one_pre_gcse_pass (void);
> -static void add_label_notes (rtx, rtx_insn *);
>  static void alloc_code_hoist_mem (int, int);
>  static void free_code_hoist_mem (void);
>  static void compute_code_hoist_vbeinout (void);
> @@ -2639,7 +2638,7 @@ one_pre_gcse_pass (void)
> then we would not need to do this here, because jump would add the
> necessary REG_LABEL_OPERAND and REG_LABEL_TARGET notes.  */
>  
> -static void
> +void
>  add_label_notes (rtx x, rtx_insn *insn)
>  {
>enum rtx_code code = GET_CODE (x);
> diff --git a/gcc/gcse.h b/gcc/gcse.h
> index 5582b29eec2..e5ee9b088bd 100644
> --- a/gcc/gcse.h
> +++ b/gcc/gcse.h
> @@ -41,5 +41,6 @@ extern struct target_gcse *this_target_gcse;
>  
>  void gcse_cc_finalize (void);
>  extern bool gcse_or_cprop_is_too_expensive (const char *);
> +void add_label_notes (rtx, rtx_insn *);
>  
>  #endif
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)


Re: [PATCH] Fix PR 110539: missed optimization after moving two_value to match.pd

2023-07-10 Thread Richard Biener via Gcc-patches
On Fri, Jul 7, 2023 at 7:57 PM Andrew Pinski via Gcc-patches
 wrote:
>
> When I moved two_value to match.pd, I removed the check for the {0,+-1}
> as I had placed it after the {0,+-1} case for cond in match.pd.
> In the case of {0,+-1} and non boolean, before we would optmize those
> case to just `(convert)a` but after we would get `(convert)(a != 0)`
> which was not handled anyways to just `(convert)a`.
> So this adds a pattern to match `(convert)(zeroone != 0)` and simplify
> to `(convert)zeroone`.
>
> In the bug report, we do finally optimize `(convert)(zeroone != 0)` to
> `zeroone` in VRP2 but that in itself is too late and we miss other
> optimizations that would have happened.
>
> OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions.
>
> gcc/ChangeLog:
>
> PR tree-optimization/110539
> * match.pd ((convert)(zeroone !=/== 0)): Match
> and simplify to ((convert)zeroone)){,^1}.
>
> gcc/testsuite/ChangeLog:
>
> PR tree-optimization/110539
> * gcc.dg/tree-ssa/pr110539-1.c: New test.
> * gcc.dg/tree-ssa/pr110539-2.c: New test.
> * gcc.dg/tree-ssa/pr110539-3.c: New test.
> ---
>  gcc/match.pd   | 15 +
>  gcc/testsuite/gcc.dg/tree-ssa/pr110539-1.c | 12 
>  gcc/testsuite/gcc.dg/tree-ssa/pr110539-2.c | 12 
>  gcc/testsuite/gcc.dg/tree-ssa/pr110539-3.c | 70 ++
>  4 files changed, 109 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr110539-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr110539-2.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr110539-3.c
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index c709153217a..87767a7778b 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -2060,6 +2060,21 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> (vec_cond:s (icmp@1 @4 @5) @3 integer_zerop))
>  (vec_cond @0 @2 @3)))
>
> +#if GIMPLE
> +/* This cannot be done on generic as fold has the
> +   exact opposite transformation:
> +   `Fold ~X & 1 as (X & 1) == 0.`
> +   `Fold (X ^ 1) & 1 as (X & 1) == 0.`  */
> +/* (convert)(zeroone != 0) into (convert)zeroone */

Not sure how the comment on GENERIC applies to this one?

> +/* (convert)(zeroone == 0) into (convert)(zeroone^1) */

This OTOH is a canonicalization, and I'm not sure a very good one
since (convert)(zeroone == 0) might be more easily recognized
as setCC and a zero flag might even be present in the def of zeroone?

> +(for neeq (ne eq)
> + (simplify
> +  (convert (neeq zero_one_valued_p@0 integer_zerop))
> +  (if (neeq == NE_EXPR)
> +   (convert @0)
> +   (convert (bit_xor @0 { build_one_cst (TREE_TYPE (@0)); } )
> +#endif
> +
>  /* Transform X & -Y into X * Y when Y is { 0 or 1 }.  */
>  (simplify
>   (bit_and:c (convert? (negate zero_one_valued_p@0)) @1)
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr110539-1.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/pr110539-1.c
> new file mode 100644
> index 000..6ba864cdd13
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr110539-1.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O1 -fdump-tree-optimized" } */
> +int f(int a)
> +{
> +int b = a & 1;
> +int c = b != 0;
> +return c == b;
> +}
> +
> +/* This should be optimized to just return 1; */
> +/* { dg-final { scan-tree-dump-not " == " "optimized"} } */
> +/* { dg-final { scan-tree-dump "return 1;" "optimized"} } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr110539-2.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/pr110539-2.c
> new file mode 100644
> index 000..17874d349ef
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr110539-2.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O1 -fdump-tree-optimized" } */
> +int f(int a)
> +{
> +int b = a & 1;
> +int c = b == 0;
> +return c == b;
> +}
> +
> +/* This should be optimized to just return 0; */
> +/* { dg-final { scan-tree-dump-not " == " "optimized"} } */
> +/* { dg-final { scan-tree-dump "return 0;" "optimized"} } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr110539-3.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/pr110539-3.c
> new file mode 100644
> index 000..c8ef6f56dcd
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr110539-3.c
> @@ -0,0 +1,70 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +
> +void foo(void);
> +static int a, c = 1;
> +static short b;
> +static int *d = &c, *e = &a;
> +static int **f = &d;
> +void __assert_fail() __attribute__((__noreturn__));
> +static void g(short h) {
> +if (*d)
> +;
> +else {
> +if (e) __assert_fail();
> +if (a) {
> +__builtin_unreachable();
> +} else
> +__assert_fail();
> +}
> +if 0, 0) || h) == h) + b) *f = 0;
> +}
> +int main() {
> +int i = 0 != 10 & a;
> +g(i);
> +*e = 9;
> +e = 0;
> +if (d == 0)
> +;
> +else
> +foo();
> +;
> +}
> +/*

[PATCH] GCSE: Export 'insert_insn_end_basic_block' as global function

2023-07-10 Thread juzhe . zhong
From: Ju-Zhe Zhong 

Since VSETVL PASS in RISC-V port is using common part of 
'insert_insn_end_basic_block (struct gcse_expr *expr, basic_block bb)'
and we will also this helper function in riscv.cc for the following patches.

So extract the common part codes of 'insert_insn_end_basic_block (struct 
gcse_expr *expr, basic_block bb)', the new function
of the common part is also call 'insert_insn_end_basic_block (rtx_insn *pat, 
basic_block bb)' but with different arguments.
And call 'insert_insn_end_basic_block (rtx_insn *pat, basic_block bb)' in 
'insert_insn_end_basic_block (struct gcse_expr *expr, basic_block bb)'
and VSETVL PASS in RISC-V port.

Remove redundant codes of VSETVL PASS in RISC-V port.

gcc/ChangeLog:

* config/riscv/riscv-vsetvl.cc (add_label_notes): Remove it.
(insert_insn_end_basic_block): Ditto.
(pass_vsetvl::commit_vsetvls): Adapt for new helper function.
* gcse.cc (insert_insn_end_basic_block):  Export as global function.
* gcse.h (insert_insn_end_basic_block): Ditto.

---
 gcc/config/riscv/riscv-vsetvl.cc | 128 +--
 gcc/gcse.cc  |  24 --
 gcc/gcse.h   |   1 +
 3 files changed, 23 insertions(+), 130 deletions(-)

diff --git a/gcc/config/riscv/riscv-vsetvl.cc b/gcc/config/riscv/riscv-vsetvl.cc
index ab47901e23f..586dc8e5379 100644
--- a/gcc/config/riscv/riscv-vsetvl.cc
+++ b/gcc/config/riscv/riscv-vsetvl.cc
@@ -98,6 +98,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "lcm.h"
 #include "predict.h"
 #include "profile-count.h"
+#include "gcse.h"
 #include "riscv-vsetvl.h"
 
 using namespace rtl_ssa;
@@ -763,127 +764,6 @@ insert_vsetvl (enum emit_type emit_type, rtx_insn *rinsn,
   return VSETVL_DISCARD_RESULT;
 }
 
-/* If X contains any LABEL_REF's, add REG_LABEL_OPERAND notes for them
-   to INSN.  If such notes are added to an insn which references a
-   CODE_LABEL, the LABEL_NUSES count is incremented.  We have to add
-   that note, because the following loop optimization pass requires
-   them.  */
-
-/* ??? If there was a jump optimization pass after gcse and before loop,
-   then we would not need to do this here, because jump would add the
-   necessary REG_LABEL_OPERAND and REG_LABEL_TARGET notes.  */
-
-static void
-add_label_notes (rtx x, rtx_insn *rinsn)
-{
-  enum rtx_code code = GET_CODE (x);
-  int i, j;
-  const char *fmt;
-
-  if (code == LABEL_REF && !LABEL_REF_NONLOCAL_P (x))
-{
-  /* This code used to ignore labels that referred to dispatch tables to
-avoid flow generating (slightly) worse code.
-
-We no longer ignore such label references (see LABEL_REF handling in
-mark_jump_label for additional information).  */
-
-  /* There's no reason for current users to emit jump-insns with
-such a LABEL_REF, so we don't have to handle REG_LABEL_TARGET
-notes.  */
-  gcc_assert (!JUMP_P (rinsn));
-  add_reg_note (rinsn, REG_LABEL_OPERAND, label_ref_label (x));
-
-  if (LABEL_P (label_ref_label (x)))
-   LABEL_NUSES (label_ref_label (x))++;
-
-  return;
-}
-
-  for (i = GET_RTX_LENGTH (code) - 1, fmt = GET_RTX_FORMAT (code); i >= 0; i--)
-{
-  if (fmt[i] == 'e')
-   add_label_notes (XEXP (x, i), rinsn);
-  else if (fmt[i] == 'E')
-   for (j = XVECLEN (x, i) - 1; j >= 0; j--)
- add_label_notes (XVECEXP (x, i, j), rinsn);
-}
-}
-
-/* Add EXPR to the end of basic block BB.
-
-   This is used by both the PRE and code hoisting.  */
-
-static void
-insert_insn_end_basic_block (rtx_insn *rinsn, basic_block cfg_bb)
-{
-  rtx_insn *end_rinsn = BB_END (cfg_bb);
-  rtx_insn *new_insn;
-  rtx_insn *pat, *pat_end;
-
-  pat = rinsn;
-  gcc_assert (pat && INSN_P (pat));
-
-  pat_end = pat;
-  while (NEXT_INSN (pat_end) != NULL_RTX)
-pat_end = NEXT_INSN (pat_end);
-
-  /* If the last end_rinsn is a jump, insert EXPR in front.  Similarly we need
- to take care of trapping instructions in presence of non-call exceptions.
-   */
-
-  if (JUMP_P (end_rinsn)
-  || (NONJUMP_INSN_P (end_rinsn)
- && (!single_succ_p (cfg_bb)
- || single_succ_edge (cfg_bb)->flags & EDGE_ABNORMAL)))
-{
-  /* FIXME: What if something in jump uses value set in new end_rinsn?  */
-  new_insn = emit_insn_before_noloc (pat, end_rinsn, cfg_bb);
-}
-
-  /* Likewise if the last end_rinsn is a call, as will happen in the presence
- of exception handling.  */
-  else if (CALL_P (end_rinsn)
-  && (!single_succ_p (cfg_bb)
-  || single_succ_edge (cfg_bb)->flags & EDGE_ABNORMAL))
-{
-  /* Keeping in mind targets with small register classes and parameters
-in registers, we search backward and place the instructions before
-the first parameter is loaded.  Do this for everyone for consistency
-and a presumption that we'll get better code elsewhere as well.  */
-
-  /* Since different machines initia

Re: Re: [PATCH] GCSE: Export add_label_notes as global function

2023-07-10 Thread juzhe.zh...@rivai.ai
Hi, Richard.

I find out I just only need to export 'insert_insn_end_basic_block' for global 
used by RISC-V port (current riscv-vsetvl.cc and future riscv.cc).

Does it look more reasonable ?

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2023-07-10 15:25
To: Ju-Zhe Zhong
CC: gcc-patches; jeffreyalaw
Subject: Re: [PATCH] GCSE: Export add_label_notes as global function
On Mon, 10 Jul 2023, juzhe.zh...@rivai.ai wrote:
 
> From: Ju-Zhe Zhong 
> 
> Since 'add_lable_notes' is a generic helper function which is used by 
> riscv-vsetvl.cc
> in RISC-V port backend. And it's also will be used by riscv.cc too by the 
> following patches.
> Export it as global helper function.
 
I know nothing about this code but grepping shows me the existing
rebuild_jump_labels () API which also properly resets LABEL_NUSES
before incrementing it.  I don't think exporting add_label_notes ()
as-is is good because it at least will wreck those counts.
GCSE uses this function to add the notes for a specific instruction
only, so if we want to export such API the name of the function
should imply it works on a single insn.
 
Richard.
 
> gcc/ChangeLog:
> 
> * config/riscv/riscv-vsetvl.cc (add_label_notes): Remove it.
> * gcse.cc (add_label_notes): Export it as global.
> (one_pre_gcse_pass): Ditto.
> * gcse.h (add_label_notes): Ditto.
> 
> ---
>  gcc/config/riscv/riscv-vsetvl.cc | 48 +---
>  gcc/gcse.cc  |  3 +-
>  gcc/gcse.h   |  1 +
>  3 files changed, 3 insertions(+), 49 deletions(-)
> 
> diff --git a/gcc/config/riscv/riscv-vsetvl.cc 
> b/gcc/config/riscv/riscv-vsetvl.cc
> index ab47901e23f..038ba22362e 100644
> --- a/gcc/config/riscv/riscv-vsetvl.cc
> +++ b/gcc/config/riscv/riscv-vsetvl.cc
> @@ -98,6 +98,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "lcm.h"
>  #include "predict.h"
>  #include "profile-count.h"
> +#include "gcse.h"
>  #include "riscv-vsetvl.h"
>  
>  using namespace rtl_ssa;
> @@ -763,53 +764,6 @@ insert_vsetvl (enum emit_type emit_type, rtx_insn *rinsn,
>return VSETVL_DISCARD_RESULT;
>  }
>  
> -/* If X contains any LABEL_REF's, add REG_LABEL_OPERAND notes for them
> -   to INSN.  If such notes are added to an insn which references a
> -   CODE_LABEL, the LABEL_NUSES count is incremented.  We have to add
> -   that note, because the following loop optimization pass requires
> -   them.  */
> -
> -/* ??? If there was a jump optimization pass after gcse and before loop,
> -   then we would not need to do this here, because jump would add the
> -   necessary REG_LABEL_OPERAND and REG_LABEL_TARGET notes.  */
> -
> -static void
> -add_label_notes (rtx x, rtx_insn *rinsn)
> -{
> -  enum rtx_code code = GET_CODE (x);
> -  int i, j;
> -  const char *fmt;
> -
> -  if (code == LABEL_REF && !LABEL_REF_NONLOCAL_P (x))
> -{
> -  /* This code used to ignore labels that referred to dispatch tables to
> - avoid flow generating (slightly) worse code.
> -
> - We no longer ignore such label references (see LABEL_REF handling in
> - mark_jump_label for additional information).  */
> -
> -  /* There's no reason for current users to emit jump-insns with
> - such a LABEL_REF, so we don't have to handle REG_LABEL_TARGET
> - notes.  */
> -  gcc_assert (!JUMP_P (rinsn));
> -  add_reg_note (rinsn, REG_LABEL_OPERAND, label_ref_label (x));
> -
> -  if (LABEL_P (label_ref_label (x)))
> - LABEL_NUSES (label_ref_label (x))++;
> -
> -  return;
> -}
> -
> -  for (i = GET_RTX_LENGTH (code) - 1, fmt = GET_RTX_FORMAT (code); i >= 0; 
> i--)
> -{
> -  if (fmt[i] == 'e')
> - add_label_notes (XEXP (x, i), rinsn);
> -  else if (fmt[i] == 'E')
> - for (j = XVECLEN (x, i) - 1; j >= 0; j--)
> -   add_label_notes (XVECEXP (x, i, j), rinsn);
> -}
> -}
> -
>  /* Add EXPR to the end of basic block BB.
>  
> This is used by both the PRE and code hoisting.  */
> diff --git a/gcc/gcse.cc b/gcc/gcse.cc
> index 72832736572..5627fbf127a 100644
> --- a/gcc/gcse.cc
> +++ b/gcc/gcse.cc
> @@ -483,7 +483,6 @@ static void pre_insert_copies (void);
>  static int pre_delete (void);
>  static int pre_gcse (struct edge_list *);
>  static int one_pre_gcse_pass (void);
> -static void add_label_notes (rtx, rtx_insn *);
>  static void alloc_code_hoist_mem (int, int);
>  static void free_code_hoist_mem (void);
>  static void compute_code_hoist_vbeinout (void);
> @@ -2639,7 +2638,7 @@ one_pre_gcse_pass (void)
> then we would not need to do this here, because jump would add the
> necessary REG_LABEL_OPERAND and REG_LABEL_TARGET notes.  */
>  
> -static void
> +void
>  add_label_notes (rtx x, rtx_insn *insn)
>  {
>enum rtx_code code = GET_CODE (x);
> diff --git a/gcc/gcse.h b/gcc/gcse.h
> index 5582b29eec2..e5ee9b088bd 100644
> --- a/gcc/gcse.h
> +++ b/gcc/gcse.h
> @@ -41,5 +41,6 @@ extern struct target_gcse *this_target_gcse;
>  
>  void gcse_cc_finalize (void);
>  extern bool gcse_or_cp

Re: Re: [PATCH] GCSE: Export add_label_notes as global function

2023-07-10 Thread juzhe.zh...@rivai.ai
Sorry, I forget to add the patch:
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/623960.html 




juzhe.zh...@rivai.ai
 
From: juzhe.zh...@rivai.ai
Date: 2023-07-10 15:58
To: rguenther
CC: gcc-patches; jeffreyalaw
Subject: Re: Re: [PATCH] GCSE: Export add_label_notes as global function
Hi, Richard.

I find out I just only need to export 'insert_insn_end_basic_block' for global 
used by RISC-V port (current riscv-vsetvl.cc and future riscv.cc).

Does it look more reasonable ?

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2023-07-10 15:25
To: Ju-Zhe Zhong
CC: gcc-patches; jeffreyalaw
Subject: Re: [PATCH] GCSE: Export add_label_notes as global function
On Mon, 10 Jul 2023, juzhe.zh...@rivai.ai wrote:
 
> From: Ju-Zhe Zhong 
> 
> Since 'add_lable_notes' is a generic helper function which is used by 
> riscv-vsetvl.cc
> in RISC-V port backend. And it's also will be used by riscv.cc too by the 
> following patches.
> Export it as global helper function.
 
I know nothing about this code but grepping shows me the existing
rebuild_jump_labels () API which also properly resets LABEL_NUSES
before incrementing it.  I don't think exporting add_label_notes ()
as-is is good because it at least will wreck those counts.
GCSE uses this function to add the notes for a specific instruction
only, so if we want to export such API the name of the function
should imply it works on a single insn.
 
Richard.
 
> gcc/ChangeLog:
> 
> * config/riscv/riscv-vsetvl.cc (add_label_notes): Remove it.
> * gcse.cc (add_label_notes): Export it as global.
> (one_pre_gcse_pass): Ditto.
> * gcse.h (add_label_notes): Ditto.
> 
> ---
>  gcc/config/riscv/riscv-vsetvl.cc | 48 +---
>  gcc/gcse.cc  |  3 +-
>  gcc/gcse.h   |  1 +
>  3 files changed, 3 insertions(+), 49 deletions(-)
> 
> diff --git a/gcc/config/riscv/riscv-vsetvl.cc 
> b/gcc/config/riscv/riscv-vsetvl.cc
> index ab47901e23f..038ba22362e 100644
> --- a/gcc/config/riscv/riscv-vsetvl.cc
> +++ b/gcc/config/riscv/riscv-vsetvl.cc
> @@ -98,6 +98,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "lcm.h"
>  #include "predict.h"
>  #include "profile-count.h"
> +#include "gcse.h"
>  #include "riscv-vsetvl.h"
>  
>  using namespace rtl_ssa;
> @@ -763,53 +764,6 @@ insert_vsetvl (enum emit_type emit_type, rtx_insn *rinsn,
>return VSETVL_DISCARD_RESULT;
>  }
>  
> -/* If X contains any LABEL_REF's, add REG_LABEL_OPERAND notes for them
> -   to INSN.  If such notes are added to an insn which references a
> -   CODE_LABEL, the LABEL_NUSES count is incremented.  We have to add
> -   that note, because the following loop optimization pass requires
> -   them.  */
> -
> -/* ??? If there was a jump optimization pass after gcse and before loop,
> -   then we would not need to do this here, because jump would add the
> -   necessary REG_LABEL_OPERAND and REG_LABEL_TARGET notes.  */
> -
> -static void
> -add_label_notes (rtx x, rtx_insn *rinsn)
> -{
> -  enum rtx_code code = GET_CODE (x);
> -  int i, j;
> -  const char *fmt;
> -
> -  if (code == LABEL_REF && !LABEL_REF_NONLOCAL_P (x))
> -{
> -  /* This code used to ignore labels that referred to dispatch tables to
> - avoid flow generating (slightly) worse code.
> -
> - We no longer ignore such label references (see LABEL_REF handling in
> - mark_jump_label for additional information).  */
> -
> -  /* There's no reason for current users to emit jump-insns with
> - such a LABEL_REF, so we don't have to handle REG_LABEL_TARGET
> - notes.  */
> -  gcc_assert (!JUMP_P (rinsn));
> -  add_reg_note (rinsn, REG_LABEL_OPERAND, label_ref_label (x));
> -
> -  if (LABEL_P (label_ref_label (x)))
> - LABEL_NUSES (label_ref_label (x))++;
> -
> -  return;
> -}
> -
> -  for (i = GET_RTX_LENGTH (code) - 1, fmt = GET_RTX_FORMAT (code); i >= 0; 
> i--)
> -{
> -  if (fmt[i] == 'e')
> - add_label_notes (XEXP (x, i), rinsn);
> -  else if (fmt[i] == 'E')
> - for (j = XVECLEN (x, i) - 1; j >= 0; j--)
> -   add_label_notes (XVECEXP (x, i, j), rinsn);
> -}
> -}
> -
>  /* Add EXPR to the end of basic block BB.
>  
> This is used by both the PRE and code hoisting.  */
> diff --git a/gcc/gcse.cc b/gcc/gcse.cc
> index 72832736572..5627fbf127a 100644
> --- a/gcc/gcse.cc
> +++ b/gcc/gcse.cc
> @@ -483,7 +483,6 @@ static void pre_insert_copies (void);
>  static int pre_delete (void);
>  static int pre_gcse (struct edge_list *);
>  static int one_pre_gcse_pass (void);
> -static void add_label_notes (rtx, rtx_insn *);
>  static void alloc_code_hoist_mem (int, int);
>  static void free_code_hoist_mem (void);
>  static void compute_code_hoist_vbeinout (void);
> @@ -2639,7 +2638,7 @@ one_pre_gcse_pass (void)
> then we would not need to do this here, because jump would add the
> necessary REG_LABEL_OPERAND and REG_LABEL_TARGET notes.  */
>  
> -static void
> +void
>  add_label_notes (rtx x, rtx_insn *ins

Re: Re: [PATCH] GCSE: Export add_label_notes as global function

2023-07-10 Thread Richard Biener via Gcc-patches
On Mon, 10 Jul 2023, juzhe.zh...@rivai.ai wrote:

> Hi, Richard.
> 
> I find out I just only need to export 'insert_insn_end_basic_block' for 
> global used by RISC-V port (current riscv-vsetvl.cc and future riscv.cc).
> 
> Does it look more reasonable ?

Yes, it looks more reasonable - I'll leave review to somebody knowing
the code - it would be nice to better document the API rather than
saying that it's used by gcse and code hoisting ...

> Thanks.
> 
> 
> juzhe.zh...@rivai.ai
>  
> From: Richard Biener
> Date: 2023-07-10 15:25
> To: Ju-Zhe Zhong
> CC: gcc-patches; jeffreyalaw
> Subject: Re: [PATCH] GCSE: Export add_label_notes as global function
> On Mon, 10 Jul 2023, juzhe.zh...@rivai.ai wrote:
>  
> > From: Ju-Zhe Zhong 
> > 
> > Since 'add_lable_notes' is a generic helper function which is used by 
> > riscv-vsetvl.cc
> > in RISC-V port backend. And it's also will be used by riscv.cc too by the 
> > following patches.
> > Export it as global helper function.
>  
> I know nothing about this code but grepping shows me the existing
> rebuild_jump_labels () API which also properly resets LABEL_NUSES
> before incrementing it.  I don't think exporting add_label_notes ()
> as-is is good because it at least will wreck those counts.
> GCSE uses this function to add the notes for a specific instruction
> only, so if we want to export such API the name of the function
> should imply it works on a single insn.
>  
> Richard.
>  
> > gcc/ChangeLog:
> > 
> > * config/riscv/riscv-vsetvl.cc (add_label_notes): Remove it.
> > * gcse.cc (add_label_notes): Export it as global.
> > (one_pre_gcse_pass): Ditto.
> > * gcse.h (add_label_notes): Ditto.
> > 
> > ---
> >  gcc/config/riscv/riscv-vsetvl.cc | 48 +---
> >  gcc/gcse.cc  |  3 +-
> >  gcc/gcse.h   |  1 +
> >  3 files changed, 3 insertions(+), 49 deletions(-)
> > 
> > diff --git a/gcc/config/riscv/riscv-vsetvl.cc 
> > b/gcc/config/riscv/riscv-vsetvl.cc
> > index ab47901e23f..038ba22362e 100644
> > --- a/gcc/config/riscv/riscv-vsetvl.cc
> > +++ b/gcc/config/riscv/riscv-vsetvl.cc
> > @@ -98,6 +98,7 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "lcm.h"
> >  #include "predict.h"
> >  #include "profile-count.h"
> > +#include "gcse.h"
> >  #include "riscv-vsetvl.h"
> >  
> >  using namespace rtl_ssa;
> > @@ -763,53 +764,6 @@ insert_vsetvl (enum emit_type emit_type, rtx_insn 
> > *rinsn,
> >return VSETVL_DISCARD_RESULT;
> >  }
> >  
> > -/* If X contains any LABEL_REF's, add REG_LABEL_OPERAND notes for them
> > -   to INSN.  If such notes are added to an insn which references a
> > -   CODE_LABEL, the LABEL_NUSES count is incremented.  We have to add
> > -   that note, because the following loop optimization pass requires
> > -   them.  */
> > -
> > -/* ??? If there was a jump optimization pass after gcse and before loop,
> > -   then we would not need to do this here, because jump would add the
> > -   necessary REG_LABEL_OPERAND and REG_LABEL_TARGET notes.  */
> > -
> > -static void
> > -add_label_notes (rtx x, rtx_insn *rinsn)
> > -{
> > -  enum rtx_code code = GET_CODE (x);
> > -  int i, j;
> > -  const char *fmt;
> > -
> > -  if (code == LABEL_REF && !LABEL_REF_NONLOCAL_P (x))
> > -{
> > -  /* This code used to ignore labels that referred to dispatch tables 
> > to
> > - avoid flow generating (slightly) worse code.
> > -
> > - We no longer ignore such label references (see LABEL_REF handling in
> > - mark_jump_label for additional information).  */
> > -
> > -  /* There's no reason for current users to emit jump-insns with
> > - such a LABEL_REF, so we don't have to handle REG_LABEL_TARGET
> > - notes.  */
> > -  gcc_assert (!JUMP_P (rinsn));
> > -  add_reg_note (rinsn, REG_LABEL_OPERAND, label_ref_label (x));
> > -
> > -  if (LABEL_P (label_ref_label (x)))
> > - LABEL_NUSES (label_ref_label (x))++;
> > -
> > -  return;
> > -}
> > -
> > -  for (i = GET_RTX_LENGTH (code) - 1, fmt = GET_RTX_FORMAT (code); i >= 0; 
> > i--)
> > -{
> > -  if (fmt[i] == 'e')
> > - add_label_notes (XEXP (x, i), rinsn);
> > -  else if (fmt[i] == 'E')
> > - for (j = XVECLEN (x, i) - 1; j >= 0; j--)
> > -   add_label_notes (XVECEXP (x, i, j), rinsn);
> > -}
> > -}
> > -
> >  /* Add EXPR to the end of basic block BB.
> >  
> > This is used by both the PRE and code hoisting.  */
> > diff --git a/gcc/gcse.cc b/gcc/gcse.cc
> > index 72832736572..5627fbf127a 100644
> > --- a/gcc/gcse.cc
> > +++ b/gcc/gcse.cc
> > @@ -483,7 +483,6 @@ static void pre_insert_copies (void);
> >  static int pre_delete (void);
> >  static int pre_gcse (struct edge_list *);
> >  static int one_pre_gcse_pass (void);
> > -static void add_label_notes (rtx, rtx_insn *);
> >  static void alloc_code_hoist_mem (int, int);
> >  static void free_code_hoist_mem (void);
> >  static void compute_code_hoist_vbeinout (void);
> > @@ -2639,7 +2638,7 @@ one_pre_gcse_

[PATCH v2] GCSE: Export 'insert_insn_end_basic_block' as global function

2023-07-10 Thread juzhe . zhong
From: Ju-Zhe Zhong 

Since VSETVL PASS in RISC-V port is using common part of 
'insert_insn_end_basic_block (struct gcse_expr *expr, basic_block bb)'
and we will also this helper function in riscv.cc for the following patches.

So extract the common part codes of 'insert_insn_end_basic_block (struct 
gcse_expr *expr, basic_block bb)', the new function
of the common part is also call 'insert_insn_end_basic_block (rtx_insn *pat, 
basic_block bb)' but with different arguments.
And call 'insert_insn_end_basic_block (rtx_insn *pat, basic_block bb)' in 
'insert_insn_end_basic_block (struct gcse_expr *expr, basic_block bb)'
and VSETVL PASS in RISC-V port.

Remove redundant codes of VSETVL PASS in RISC-V port.

gcc/ChangeLog:

* config/riscv/riscv-vsetvl.cc (add_label_notes): Remove it.
(insert_insn_end_basic_block): Ditto.
(pass_vsetvl::commit_vsetvls): Adapt for new helper function.
* gcse.cc (insert_insn_end_basic_block):  Export as global function.
* gcse.h (insert_insn_end_basic_block): Ditto.

---
 gcc/config/riscv/riscv-vsetvl.cc | 128 +--
 gcc/gcse.cc  |  29 ---
 gcc/gcse.h   |   1 +
 3 files changed, 25 insertions(+), 133 deletions(-)

diff --git a/gcc/config/riscv/riscv-vsetvl.cc b/gcc/config/riscv/riscv-vsetvl.cc
index ab47901e23f..586dc8e5379 100644
--- a/gcc/config/riscv/riscv-vsetvl.cc
+++ b/gcc/config/riscv/riscv-vsetvl.cc
@@ -98,6 +98,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "lcm.h"
 #include "predict.h"
 #include "profile-count.h"
+#include "gcse.h"
 #include "riscv-vsetvl.h"
 
 using namespace rtl_ssa;
@@ -763,127 +764,6 @@ insert_vsetvl (enum emit_type emit_type, rtx_insn *rinsn,
   return VSETVL_DISCARD_RESULT;
 }
 
-/* If X contains any LABEL_REF's, add REG_LABEL_OPERAND notes for them
-   to INSN.  If such notes are added to an insn which references a
-   CODE_LABEL, the LABEL_NUSES count is incremented.  We have to add
-   that note, because the following loop optimization pass requires
-   them.  */
-
-/* ??? If there was a jump optimization pass after gcse and before loop,
-   then we would not need to do this here, because jump would add the
-   necessary REG_LABEL_OPERAND and REG_LABEL_TARGET notes.  */
-
-static void
-add_label_notes (rtx x, rtx_insn *rinsn)
-{
-  enum rtx_code code = GET_CODE (x);
-  int i, j;
-  const char *fmt;
-
-  if (code == LABEL_REF && !LABEL_REF_NONLOCAL_P (x))
-{
-  /* This code used to ignore labels that referred to dispatch tables to
-avoid flow generating (slightly) worse code.
-
-We no longer ignore such label references (see LABEL_REF handling in
-mark_jump_label for additional information).  */
-
-  /* There's no reason for current users to emit jump-insns with
-such a LABEL_REF, so we don't have to handle REG_LABEL_TARGET
-notes.  */
-  gcc_assert (!JUMP_P (rinsn));
-  add_reg_note (rinsn, REG_LABEL_OPERAND, label_ref_label (x));
-
-  if (LABEL_P (label_ref_label (x)))
-   LABEL_NUSES (label_ref_label (x))++;
-
-  return;
-}
-
-  for (i = GET_RTX_LENGTH (code) - 1, fmt = GET_RTX_FORMAT (code); i >= 0; i--)
-{
-  if (fmt[i] == 'e')
-   add_label_notes (XEXP (x, i), rinsn);
-  else if (fmt[i] == 'E')
-   for (j = XVECLEN (x, i) - 1; j >= 0; j--)
- add_label_notes (XVECEXP (x, i, j), rinsn);
-}
-}
-
-/* Add EXPR to the end of basic block BB.
-
-   This is used by both the PRE and code hoisting.  */
-
-static void
-insert_insn_end_basic_block (rtx_insn *rinsn, basic_block cfg_bb)
-{
-  rtx_insn *end_rinsn = BB_END (cfg_bb);
-  rtx_insn *new_insn;
-  rtx_insn *pat, *pat_end;
-
-  pat = rinsn;
-  gcc_assert (pat && INSN_P (pat));
-
-  pat_end = pat;
-  while (NEXT_INSN (pat_end) != NULL_RTX)
-pat_end = NEXT_INSN (pat_end);
-
-  /* If the last end_rinsn is a jump, insert EXPR in front.  Similarly we need
- to take care of trapping instructions in presence of non-call exceptions.
-   */
-
-  if (JUMP_P (end_rinsn)
-  || (NONJUMP_INSN_P (end_rinsn)
- && (!single_succ_p (cfg_bb)
- || single_succ_edge (cfg_bb)->flags & EDGE_ABNORMAL)))
-{
-  /* FIXME: What if something in jump uses value set in new end_rinsn?  */
-  new_insn = emit_insn_before_noloc (pat, end_rinsn, cfg_bb);
-}
-
-  /* Likewise if the last end_rinsn is a call, as will happen in the presence
- of exception handling.  */
-  else if (CALL_P (end_rinsn)
-  && (!single_succ_p (cfg_bb)
-  || single_succ_edge (cfg_bb)->flags & EDGE_ABNORMAL))
-{
-  /* Keeping in mind targets with small register classes and parameters
-in registers, we search backward and place the instructions before
-the first parameter is loaded.  Do this for everyone for consistency
-and a presumption that we'll get better code elsewhere as well.  */
-
-  /* Since different machines initi

Re: Re: [PATCH] GCSE: Export add_label_notes as global function

2023-07-10 Thread juzhe.zh...@rivai.ai
Thanks Richi.

I have sent V2 patch:
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/623964.html 
which is fixing the comments for you:

-/* Add EXPR to the end of basic block BB.
-
-   This is used by both the PRE and code hoisting.  */
+/* Return the INSN which is added at the end of the block BB with
+   same instruction pattern with PAT.  */
+rtx_insn *
+insert_insn_end_basic_block (rtx_insn *pat, basic_block bb)


Is it better ? 

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2023-07-10 16:01
To: juzhe.zh...@rivai.ai
CC: gcc-patches; jeffreyalaw
Subject: Re: Re: [PATCH] GCSE: Export add_label_notes as global function
On Mon, 10 Jul 2023, juzhe.zh...@rivai.ai wrote:
 
> Hi, Richard.
> 
> I find out I just only need to export 'insert_insn_end_basic_block' for 
> global used by RISC-V port (current riscv-vsetvl.cc and future riscv.cc).
> 
> Does it look more reasonable ?
 
Yes, it looks more reasonable - I'll leave review to somebody knowing
the code - it would be nice to better document the API rather than
saying that it's used by gcse and code hoisting ...
 
> Thanks.
> 
> 
> juzhe.zh...@rivai.ai
>  
> From: Richard Biener
> Date: 2023-07-10 15:25
> To: Ju-Zhe Zhong
> CC: gcc-patches; jeffreyalaw
> Subject: Re: [PATCH] GCSE: Export add_label_notes as global function
> On Mon, 10 Jul 2023, juzhe.zh...@rivai.ai wrote:
>  
> > From: Ju-Zhe Zhong 
> > 
> > Since 'add_lable_notes' is a generic helper function which is used by 
> > riscv-vsetvl.cc
> > in RISC-V port backend. And it's also will be used by riscv.cc too by the 
> > following patches.
> > Export it as global helper function.
>  
> I know nothing about this code but grepping shows me the existing
> rebuild_jump_labels () API which also properly resets LABEL_NUSES
> before incrementing it.  I don't think exporting add_label_notes ()
> as-is is good because it at least will wreck those counts.
> GCSE uses this function to add the notes for a specific instruction
> only, so if we want to export such API the name of the function
> should imply it works on a single insn.
>  
> Richard.
>  
> > gcc/ChangeLog:
> > 
> > * config/riscv/riscv-vsetvl.cc (add_label_notes): Remove it.
> > * gcse.cc (add_label_notes): Export it as global.
> > (one_pre_gcse_pass): Ditto.
> > * gcse.h (add_label_notes): Ditto.
> > 
> > ---
> >  gcc/config/riscv/riscv-vsetvl.cc | 48 +---
> >  gcc/gcse.cc  |  3 +-
> >  gcc/gcse.h   |  1 +
> >  3 files changed, 3 insertions(+), 49 deletions(-)
> > 
> > diff --git a/gcc/config/riscv/riscv-vsetvl.cc 
> > b/gcc/config/riscv/riscv-vsetvl.cc
> > index ab47901e23f..038ba22362e 100644
> > --- a/gcc/config/riscv/riscv-vsetvl.cc
> > +++ b/gcc/config/riscv/riscv-vsetvl.cc
> > @@ -98,6 +98,7 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "lcm.h"
> >  #include "predict.h"
> >  #include "profile-count.h"
> > +#include "gcse.h"
> >  #include "riscv-vsetvl.h"
> >  
> >  using namespace rtl_ssa;
> > @@ -763,53 +764,6 @@ insert_vsetvl (enum emit_type emit_type, rtx_insn 
> > *rinsn,
> >return VSETVL_DISCARD_RESULT;
> >  }
> >  
> > -/* If X contains any LABEL_REF's, add REG_LABEL_OPERAND notes for them
> > -   to INSN.  If such notes are added to an insn which references a
> > -   CODE_LABEL, the LABEL_NUSES count is incremented.  We have to add
> > -   that note, because the following loop optimization pass requires
> > -   them.  */
> > -
> > -/* ??? If there was a jump optimization pass after gcse and before loop,
> > -   then we would not need to do this here, because jump would add the
> > -   necessary REG_LABEL_OPERAND and REG_LABEL_TARGET notes.  */
> > -
> > -static void
> > -add_label_notes (rtx x, rtx_insn *rinsn)
> > -{
> > -  enum rtx_code code = GET_CODE (x);
> > -  int i, j;
> > -  const char *fmt;
> > -
> > -  if (code == LABEL_REF && !LABEL_REF_NONLOCAL_P (x))
> > -{
> > -  /* This code used to ignore labels that referred to dispatch tables 
> > to
> > - avoid flow generating (slightly) worse code.
> > -
> > - We no longer ignore such label references (see LABEL_REF handling in
> > - mark_jump_label for additional information).  */
> > -
> > -  /* There's no reason for current users to emit jump-insns with
> > - such a LABEL_REF, so we don't have to handle REG_LABEL_TARGET
> > - notes.  */
> > -  gcc_assert (!JUMP_P (rinsn));
> > -  add_reg_note (rinsn, REG_LABEL_OPERAND, label_ref_label (x));
> > -
> > -  if (LABEL_P (label_ref_label (x)))
> > - LABEL_NUSES (label_ref_label (x))++;
> > -
> > -  return;
> > -}
> > -
> > -  for (i = GET_RTX_LENGTH (code) - 1, fmt = GET_RTX_FORMAT (code); i >= 0; 
> > i--)
> > -{
> > -  if (fmt[i] == 'e')
> > - add_label_notes (XEXP (x, i), rinsn);
> > -  else if (fmt[i] == 'E')
> > - for (j = XVECLEN (x, i) - 1; j >= 0; j--)
> > -   add_label_notes (XVECEXP (x, i, j), rinsn);
> > -}
> > -}
> > -
> >  /* Add EXPR

Re: [PATCH 4/19]middle-end: Fix scale_loop_frequencies segfault on multiple-exits

2023-07-10 Thread Jan Hubicka via Gcc-patches
Hi,
over weekend I found that vectorizer is missing scale_loop_profile for
epilogues.  It already adjusts loop_info to set max iteraitons, so
adding it was easy. However now predicts the first loop to iterate at
most once (which is too much, I suppose it forgets to divide by epilogue
unrolling factor) and second never.
> 
> The -O2 cost model doesn't want to do epilogues:
> 
>   /* If using the "very cheap" model. reject cases in which we'd keep
>  a copy of the scalar code (even if we might be able to vectorize it).  
> */
>   if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP
>   && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
>   || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>   || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
> {
>   if (dump_enabled_p ())
> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>  "some scalar iterations would need to be 
> peeled\n");
>   return 0;
> }
> 
> it's because of the code size increase.

I know, however -O2 is not -Os and here the tradeoffs of
performance/code size seems a lot better than other code expanding
things we do at -O2 (such as the unrolling 3 times).
I think we set the very cheap cost model very conservatively in order to
get -ftree-vectorize enabled with -O2 and there is some room for finding
right balance.

I get:

jan@localhost:~> cat t.c
int a[99];
__attribute((noipa, weak))
void
test()
{
for (int i = 0 ; i < 99; i++)
a[i]++;
}
void
main()
{
for (int j = 0; j < 1000; j++)
test();
}
jan@localhost:~> gcc -O2 t.c -fno-unroll-loops ; time ./a.out

real0m0.529s
user0m0.528s
sys 0m0.000s

jan@localhost:~> gcc -O2 t.c ; time ./a.out

real0m0.427s
user0m0.426s
sys 0m0.000s
jan@localhost:~> gcc -O3 t.c ; time ./a.out

real0m0.136s
user0m0.135s
sys 0m0.000s
jan@localhost:~> clang -O2 t.c ; time ./a.out


real0m0.116s
user0m0.116s
sys 0m0.000s

Code size (of function test):
 gcc -O2 -fno-unroll-loops 17  bytes
 gcc -O2   29  bytes
 gcc -O3   50  bytes
 clang -O2 510 bytes

So unroling 70% code size growth for 23% speedup.
Vectorizing is 294% code size growth for 388% speedup
Clang does 3000% codde size growth for 456% speedup
> 
> That's clearly much larger code.  On x86 we're also fighting with
> large instruction encodings here, in particular EVEX for AVX512 is
> "bad" here.  We hardly get more than two instructions decoded per
> cycle due to their size.

Agreed, I found it surprising clang does that much of complette unrolling
at -O2. However vectorizing and not unrolling here seems like it may be
a better default for -O2 than what we do currently...

Honza
> 
> Richard.


Re: [PATCH] simplify-rtx: Fix invalid simplification with paradoxical subregs [PR110206]

2023-07-10 Thread Richard Biener via Gcc-patches
On Sun, Jul 9, 2023 at 10:53 AM Uros Bizjak via Gcc-patches
 wrote:
>
> As shown in the PR, simplify_gen_subreg call in simplify_replace_fn_rtx:
>
> (gdb) list
> 469   if (code == SUBREG)
> 470 {
> 471   op0 = simplify_replace_fn_rtx (SUBREG_REG (x),
> old_rtx, fn, data);
> 472   if (op0 == SUBREG_REG (x))
> 473 return x;
> 474   op0 = simplify_gen_subreg (GET_MODE (x), op0,
> 475  GET_MODE (SUBREG_REG (x)),
> 476  SUBREG_BYTE (x));
> 477   return op0 ? op0 : x;
> 478 }
>
> simplifies with following arguments:
>
> (gdb) p debug_rtx (op0)
> (const_vector:V4QI [
> (const_int -52 [0xffcc]) repeated x4
> ])
> (gdb) p debug_rtx (x)
> (subreg:V16QI (reg:V4QI 98) 0)
>
> to:
>
> (gdb) p debug_rtx (op0)
> (const_vector:V16QI [
> (const_int -52 [0xffcc]) repeated x16
> ])
>
> This simplification is invalid, it is not possible to get V16QImode vector
> from V4QImode vector, even when all elements are duplicates.
>
> The simplification happens in simplify_context::simplify_subreg:
>
> (gdb) list
> 7558  if (VECTOR_MODE_P (outermode)
> 7559  && GET_MODE_INNER (outermode) == GET_MODE_INNER (innermode)
> 7560  && vec_duplicate_p (op, &elt))
> 7561return gen_vec_duplicate (outermode, elt);
>
> but the above simplification is valid only for non-paradoxical registers,
> where outermode <= innermode.  We should not assume that elements outside
> the original register are valid, let alone all duplicates.

Hmm, but looking at the audit trail the x86 backend expects them to be zero?
Isn't that wrong as well?

That is, I think putting any random value into the upper lanes when
constant folding
a paradoxical subreg sounds OK to me, no?

Of course we might choose to not do such constant propagation for
efficiency reason - at least
when the resulting CONST_* would require a larger constant pool entry
or more costly
construction.

Thanks,
Richard.

> PR target/110206
>
> gcc/ChangeLog:
>
> * simplify-rtx.cc (simplify_context::simplify_subreg):
> Avoid returning a vector with duplicated value
> outside the original register.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.dg/torture/pr110206.c: New test.
>
> Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.
>
> OK for master and release branches?
>
> Uros.


Re: [PATCH 4/19]middle-end: Fix scale_loop_frequencies segfault on multiple-exits

2023-07-10 Thread Jan Hubicka via Gcc-patches
> On Fri, 7 Jul 2023, Jan Hubicka wrote:
> 
> > > 
> > > Looks good, but I wonder what we can do to at least make the
> > > multiple exit case behave reasonably?  The vectorizer keeps track
> > 
> > > of a "canonical" exit, would it be possible to pass in the main
> > > exit edge and use that instead of single_exit (), would other
> > > exits then behave somewhat reasonable or would we totally screw
> > > things up here?  That is, the "canonical" exit would be the
> > > counting exit while the other exits are on data driven conditions
> > > and thus wouldn't change probability when we reduce the number
> > > of iterations(?)
> > 
> > I can add canonical_exit parameter and make the function to direct flow
> > to it if possible.  However overall I think fixup depends on what
> > transformation led to the change.
> 
> I think the vectorizer knows there's a single counting IV and all
> other exits are dependent on data processed, so the scaling the
> vectorizer just changes the counting IV.  So I think it makes
> sense to pass that exit to the function in all cases.

It really seems to me that vectorized loop is like N loops happening 
in parallel, so the probabilities of alternative exits grows as well.
But canonical exit is right thing to do for prologues - here we really
add extra conditions to the iteration counting exit.
> 
> > Assuming that vectorizer did no prologues and apilogues and we
> > vectorized with factor N, then I think the update could be done more
> > specifically as follows.
> > 
> > We know that header block count dropped by 4. So we can start from that
> > and each time we reach basic block with exit edge, we know the original
> > count of the edge.  This count is unchanged, so one can rescale
> > probabilities out of that BB accordingly.  If loop has no inner loops,
> > we can just walk the body in RPO and propagate scales downwards and we
> > sould arrive to right result
> 
> That should work for alternate exits as well, no?
Yes, i think it could omstly work for acyclic bodies. I ended up
implementing a special case of this for loop-ch in order to handle
corectly loop invariant conditionals.  Will send patch after some
cleanups. (There seems to be more loop invariant conditionals in real
code than I would tought)

Tampering only with loop exit probabilities is not always enought.
If you have:
  while (1)
if (test1)
  {
if (test2)
  break;
  }
increasing count of exit may require increasing probablity of the outer
conditional.   Do we support this in vectorization at all and if so, do
we know something here?
For example if the test1 is triggered if test1 is true in one of
iterations packed togehter, its probability also increases by
vectorization factor.  

We run into this in peeling i.e. when we prove that test1 will trigger
undefined behaviour after one or two iterations but the orignal
esimtated profile believes in higher iteration count.  I added special
case for this yesterday to avoid turning if (test2) to 100% in this case
as that triggers strange codegen in some of fortran testcases.

We also can have
  while (1)
while (test1)
  {
if (test2)
  break;
  }
Which is harder because changing probability of test2 affects number
of iteraitons of the inner loop.  So I am giving up on this.
I think currently it happens mostly with unlooping.
> 
> > I originally added the bound parameter to handle prologues/epilogues
> > which gets new artificial bound.  In prologue I think you are right that
> > the flow will be probably directed to the conditional counting
> > iterations.
> 
> I suppose we'd need to scale both main and epilogue together since
> the epilogue "steals" from the main loop counts.  Likewise if there's
> a skip edge around the vector loop.  I think currently we simply
> set the edge probability of those skip conds rather than basing
> this off the niter values they work on.  Aka if (niter < VF) goto
> epilogue; do {} while (niter / VF); epilogue: do {} while (niter);
> 
> There's also the cost model which might require niter > VF to enter
> the main loop body.

I think I mostly understand this since we was playing with it with Ondra's
histograms (that can be used to get some of the unknowns in the
transformation right). The unknowns (how many times we end up jumpig to
epilogue, for instance, probably can't be reasonably well guessed if we
do not know the loop histogram which currently we know only if we prove
that loop has constant number of iterations.  So I am trying to get
right at least this case first.

Theoretically correct approach would be to first determine entry counts
of prologue and epilogue, then produce what we believe to be correct
profile of those and subtract it from the main loop profile updating
also probabilities in basic blocks where we did nontrivial changes while
updating prologs/epilogs. Finally scale down the main loop profile and
increase exit probabilities.

Honza


Re: [PATCH 4/19]middle-end: Fix scale_loop_frequencies segfault on multiple-exits

2023-07-10 Thread Richard Biener via Gcc-patches
On Mon, 10 Jul 2023, Jan Hubicka wrote:

> Hi,
> over weekend I found that vectorizer is missing scale_loop_profile for
> epilogues.  It already adjusts loop_info to set max iteraitons, so
> adding it was easy. However now predicts the first loop to iterate at
> most once (which is too much, I suppose it forgets to divide by epilogue
> unrolling factor) and second never.
> > 
> > The -O2 cost model doesn't want to do epilogues:
> > 
> >   /* If using the "very cheap" model. reject cases in which we'd keep
> >  a copy of the scalar code (even if we might be able to vectorize it).  
> > */
> >   if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP
> >   && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
> >   || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> >   || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
> > {
> >   if (dump_enabled_p ())
> > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >  "some scalar iterations would need to be 
> > peeled\n");
> >   return 0;
> > }
> > 
> > it's because of the code size increase.
> 
> I know, however -O2 is not -Os and here the tradeoffs of
> performance/code size seems a lot better than other code expanding
> things we do at -O2 (such as the unrolling 3 times).
> I think we set the very cheap cost model very conservatively in order to
> get -ftree-vectorize enabled with -O2 and there is some room for finding
> right balance.
> 
> I get:
> 
> jan@localhost:~> cat t.c
> int a[99];
> __attribute((noipa, weak))
> void
> test()
> {
> for (int i = 0 ; i < 99; i++)
> a[i]++;
> }
> void
> main()
> {
> for (int j = 0; j < 1000; j++)
> test();
> }
> jan@localhost:~> gcc -O2 t.c -fno-unroll-loops ; time ./a.out
> 
> real0m0.529s
> user0m0.528s
> sys 0m0.000s
> 
> jan@localhost:~> gcc -O2 t.c ; time ./a.out
> 
> real0m0.427s
> user0m0.426s
> sys 0m0.000s
> jan@localhost:~> gcc -O3 t.c ; time ./a.out
> 
> real0m0.136s
> user0m0.135s
> sys 0m0.000s
> jan@localhost:~> clang -O2 t.c ; time ./a.out
> 
> 
> real0m0.116s
> user0m0.116s
> sys 0m0.000s
> 
> Code size (of function test):
>  gcc -O2 -fno-unroll-loops 17  bytes
>  gcc -O2   29  bytes
>  gcc -O3   50  bytes
>  clang -O2 510 bytes
> 
> So unroling 70% code size growth for 23% speedup.
> Vectorizing is 294% code size growth for 388% speedup
> Clang does 3000% codde size growth for 456% speedup
> > 
> > That's clearly much larger code.  On x86 we're also fighting with
> > large instruction encodings here, in particular EVEX for AVX512 is
> > "bad" here.  We hardly get more than two instructions decoded per
> > cycle due to their size.
> 
> Agreed, I found it surprising clang does that much of complette unrolling
> at -O2. However vectorizing and not unrolling here seems like it may be
> a better default for -O2 than what we do currently...

I was also playing with AVX512 fully masked loops here which avoids
the epilogue but due to the instruction encoding size that doesn't
usually win.  I agree that size isn't everything at least for -O2.

Richard.


Re: [PATCH] simplify-rtx: Fix invalid simplification with paradoxical subregs [PR110206]

2023-07-10 Thread Uros Bizjak via Gcc-patches
On Mon, Jul 10, 2023 at 11:17 AM Richard Biener
 wrote:
>
> On Sun, Jul 9, 2023 at 10:53 AM Uros Bizjak via Gcc-patches
>  wrote:
> >
> > As shown in the PR, simplify_gen_subreg call in simplify_replace_fn_rtx:
> >
> > (gdb) list
> > 469   if (code == SUBREG)
> > 470 {
> > 471   op0 = simplify_replace_fn_rtx (SUBREG_REG (x),
> > old_rtx, fn, data);
> > 472   if (op0 == SUBREG_REG (x))
> > 473 return x;
> > 474   op0 = simplify_gen_subreg (GET_MODE (x), op0,
> > 475  GET_MODE (SUBREG_REG (x)),
> > 476  SUBREG_BYTE (x));
> > 477   return op0 ? op0 : x;
> > 478 }
> >
> > simplifies with following arguments:
> >
> > (gdb) p debug_rtx (op0)
> > (const_vector:V4QI [
> > (const_int -52 [0xffcc]) repeated x4
> > ])
> > (gdb) p debug_rtx (x)
> > (subreg:V16QI (reg:V4QI 98) 0)
> >
> > to:
> >
> > (gdb) p debug_rtx (op0)
> > (const_vector:V16QI [
> > (const_int -52 [0xffcc]) repeated x16
> > ])
> >
> > This simplification is invalid, it is not possible to get V16QImode vector
> > from V4QImode vector, even when all elements are duplicates.
> >
> > The simplification happens in simplify_context::simplify_subreg:
> >
> > (gdb) list
> > 7558  if (VECTOR_MODE_P (outermode)
> > 7559  && GET_MODE_INNER (outermode) == GET_MODE_INNER 
> > (innermode)
> > 7560  && vec_duplicate_p (op, &elt))
> > 7561return gen_vec_duplicate (outermode, elt);
> >
> > but the above simplification is valid only for non-paradoxical registers,
> > where outermode <= innermode.  We should not assume that elements outside
> > the original register are valid, let alone all duplicates.
>
> Hmm, but looking at the audit trail the x86 backend expects them to be zero?
> Isn't that wrong as well?

If you mean Comment #10, it is just an observation that
simplify_replace_rtx simplifies arguments from Comment #9 to:

(gdb) p debug_rtx (src)
(const_vector:V8HI [
(const_int 204 [0xcc]) repeated x4
(const_int 0 [0]) repeated x4
])

instead of:

(gdb) p debug_rtx (src)
(const_vector:V8HI [
(const_int 204 [0xcc]) repeated x8
])

which is in line with the statement below.
>
> That is, I think putting any random value into the upper lanes when
> constant folding
> a paradoxical subreg sounds OK to me, no?

The compiler is putting zero there as can be seen from the above new RTX.

> Of course we might choose to not do such constant propagation for
> efficiency reason - at least
> when the resulting CONST_* would require a larger constant pool entry
> or more costly
> construction.

This is probably a follow-up improvement, where this patch tries to
fix a specific invalid simplification of simplify_replace_rtx that is
invalid universally.

Uros.


Re: [PATCH 4/19]middle-end: Fix scale_loop_frequencies segfault on multiple-exits

2023-07-10 Thread Richard Biener via Gcc-patches
On Mon, 10 Jul 2023, Jan Hubicka wrote:

> > On Fri, 7 Jul 2023, Jan Hubicka wrote:
> > 
> > > > 
> > > > Looks good, but I wonder what we can do to at least make the
> > > > multiple exit case behave reasonably?  The vectorizer keeps track
> > > 
> > > > of a "canonical" exit, would it be possible to pass in the main
> > > > exit edge and use that instead of single_exit (), would other
> > > > exits then behave somewhat reasonable or would we totally screw
> > > > things up here?  That is, the "canonical" exit would be the
> > > > counting exit while the other exits are on data driven conditions
> > > > and thus wouldn't change probability when we reduce the number
> > > > of iterations(?)
> > > 
> > > I can add canonical_exit parameter and make the function to direct flow
> > > to it if possible.  However overall I think fixup depends on what
> > > transformation led to the change.
> > 
> > I think the vectorizer knows there's a single counting IV and all
> > other exits are dependent on data processed, so the scaling the
> > vectorizer just changes the counting IV.  So I think it makes
> > sense to pass that exit to the function in all cases.
> 
> It really seems to me that vectorized loop is like N loops happening 
> in parallel, so the probabilities of alternative exits grows as well.
> But canonical exit is right thing to do for prologues - here we really
> add extra conditions to the iteration counting exit.
> > 
> > > Assuming that vectorizer did no prologues and apilogues and we
> > > vectorized with factor N, then I think the update could be done more
> > > specifically as follows.
> > > 
> > > We know that header block count dropped by 4. So we can start from that
> > > and each time we reach basic block with exit edge, we know the original
> > > count of the edge.  This count is unchanged, so one can rescale
> > > probabilities out of that BB accordingly.  If loop has no inner loops,
> > > we can just walk the body in RPO and propagate scales downwards and we
> > > sould arrive to right result
> > 
> > That should work for alternate exits as well, no?
> Yes, i think it could omstly work for acyclic bodies. I ended up
> implementing a special case of this for loop-ch in order to handle
> corectly loop invariant conditionals.  Will send patch after some
> cleanups. (There seems to be more loop invariant conditionals in real
> code than I would tought)
> 
> Tampering only with loop exit probabilities is not always enought.
> If you have:
>   while (1)
> if (test1)
>   {
> if (test2)
> break;
>   }
> increasing count of exit may require increasing probablity of the outer
> conditional.   Do we support this in vectorization at all and if so, do
> we know something here?

Tamar would need to answer this but without early break vectorization
the if-conversion pass will flatten everything and I think even early
breaks will be in the end a non-nested sequence of BBs with
exit conds at the end (or a loopback branch).

Note the (scalar) epilogue is copied from the original scalar loop
body so it doesn't see any if-conversion.

> For example if the test1 is triggered if test1 is true in one of
> iterations packed togehter, its probability also increases by
> vectorization factor.  
> 
> We run into this in peeling i.e. when we prove that test1 will trigger
> undefined behaviour after one or two iterations but the orignal
> esimtated profile believes in higher iteration count.  I added special
> case for this yesterday to avoid turning if (test2) to 100% in this case
> as that triggers strange codegen in some of fortran testcases.
> 
> We also can have
>   while (1)
> while (test1)
>   {
> if (test2)
> break;
>   }
> Which is harder because changing probability of test2 affects number
> of iteraitons of the inner loop.  So I am giving up on this.
> I think currently it happens mostly with unlooping.

What I saw most wrecking the profile is when passes turn
if (cond) into if (0/1) leaving the CFG adjustment to CFG cleanup
which then simply deletes one of the outgoing edges without doing
anything to the (guessed) profile.

> > > I originally added the bound parameter to handle prologues/epilogues
> > > which gets new artificial bound.  In prologue I think you are right that
> > > the flow will be probably directed to the conditional counting
> > > iterations.
> > 
> > I suppose we'd need to scale both main and epilogue together since
> > the epilogue "steals" from the main loop counts.  Likewise if there's
> > a skip edge around the vector loop.  I think currently we simply
> > set the edge probability of those skip conds rather than basing
> > this off the niter values they work on.  Aka if (niter < VF) goto
> > epilogue; do {} while (niter / VF); epilogue: do {} while (niter);
> > 
> > There's also the cost model which might require niter > VF to enter
> > the main loop body.
> 
> I think I mostly understand this since we was playing with it with Ondra

Re: [PATCH] simplify-rtx: Fix invalid simplification with paradoxical subregs [PR110206]

2023-07-10 Thread Richard Biener via Gcc-patches
On Mon, Jul 10, 2023 at 11:26 AM Uros Bizjak  wrote:
>
> On Mon, Jul 10, 2023 at 11:17 AM Richard Biener
>  wrote:
> >
> > On Sun, Jul 9, 2023 at 10:53 AM Uros Bizjak via Gcc-patches
> >  wrote:
> > >
> > > As shown in the PR, simplify_gen_subreg call in simplify_replace_fn_rtx:
> > >
> > > (gdb) list
> > > 469   if (code == SUBREG)
> > > 470 {
> > > 471   op0 = simplify_replace_fn_rtx (SUBREG_REG (x),
> > > old_rtx, fn, data);
> > > 472   if (op0 == SUBREG_REG (x))
> > > 473 return x;
> > > 474   op0 = simplify_gen_subreg (GET_MODE (x), op0,
> > > 475  GET_MODE (SUBREG_REG (x)),
> > > 476  SUBREG_BYTE (x));
> > > 477   return op0 ? op0 : x;
> > > 478 }
> > >
> > > simplifies with following arguments:
> > >
> > > (gdb) p debug_rtx (op0)
> > > (const_vector:V4QI [
> > > (const_int -52 [0xffcc]) repeated x4
> > > ])
> > > (gdb) p debug_rtx (x)
> > > (subreg:V16QI (reg:V4QI 98) 0)
> > >
> > > to:
> > >
> > > (gdb) p debug_rtx (op0)
> > > (const_vector:V16QI [
> > > (const_int -52 [0xffcc]) repeated x16
> > > ])
> > >
> > > This simplification is invalid, it is not possible to get V16QImode vector
> > > from V4QImode vector, even when all elements are duplicates.

^^^

I think this simplification is valid.  A simplification to

(const_vector:V16QI [
 (const_int -52 [0xffcc]) repeated x4
 (const_int 0 [0]) repeated x12
 ])

would be valid as well.

> > > The simplification happens in simplify_context::simplify_subreg:
> > >
> > > (gdb) list
> > > 7558  if (VECTOR_MODE_P (outermode)
> > > 7559  && GET_MODE_INNER (outermode) == GET_MODE_INNER 
> > > (innermode)
> > > 7560  && vec_duplicate_p (op, &elt))
> > > 7561return gen_vec_duplicate (outermode, elt);
> > >
> > > but the above simplification is valid only for non-paradoxical registers,
> > > where outermode <= innermode.  We should not assume that elements outside
> > > the original register are valid, let alone all duplicates.
> >
> > Hmm, but looking at the audit trail the x86 backend expects them to be zero?
> > Isn't that wrong as well?
>
> If you mean Comment #10, it is just an observation that
> simplify_replace_rtx simplifies arguments from Comment #9 to:
>
> (gdb) p debug_rtx (src)
> (const_vector:V8HI [
> (const_int 204 [0xcc]) repeated x4
> (const_int 0 [0]) repeated x4
> ])
>
> instead of:
>
> (gdb) p debug_rtx (src)
> (const_vector:V8HI [
> (const_int 204 [0xcc]) repeated x8
> ])
>
> which is in line with the statement below.
> >
> > That is, I think putting any random value into the upper lanes when
> > constant folding
> > a paradoxical subreg sounds OK to me, no?
>
> The compiler is putting zero there as can be seen from the above new RTX.
>
> > Of course we might choose to not do such constant propagation for
> > efficiency reason - at least
> > when the resulting CONST_* would require a larger constant pool entry
> > or more costly
> > construction.
>
> This is probably a follow-up improvement, where this patch tries to
> fix a specific invalid simplification of simplify_replace_rtx that is
> invalid universally.

How so?  What specifies the values of the paradoxical subreg for the
bytes not covered by the subreg operand?

>
> Uros.


Re: [PATCH v2] x86: Properly find the maximum stack slot alignment

2023-07-10 Thread Richard Biener via Gcc-patches
On Fri, Jul 7, 2023 at 5:14 PM H.J. Lu via Gcc-patches
 wrote:
>
> Don't assume that stack slots can only be accessed by stack or frame
> registers.  We first find all registers defined by stack or frame
> registers.  Then check memory accesses by such registers, including
> stack and frame registers.
>
> gcc/
>
> PR target/109780
> * config/i386/i386.cc (ix86_update_stack_alignment): New.
> (ix86_find_all_reg_use): Likewise.
> (ix86_find_max_used_stack_alignment): Also check memory accesses
> from registers defined by stack or frame registers.
>
> gcc/testsuite/
>
> PR target/109780
> * g++.target/i386/pr109780-1.C: New test.
> * gcc.target/i386/pr109780-1.c: Likewise.
> * gcc.target/i386/pr109780-2.c: Likewise.
> ---
>  gcc/config/i386/i386.cc| 120 +
>  gcc/testsuite/g++.target/i386/pr109780-1.C |  72 +
>  gcc/testsuite/gcc.target/i386/pr109780-1.c |  14 +++
>  gcc/testsuite/gcc.target/i386/pr109780-2.c |  21 
>  4 files changed, 206 insertions(+), 21 deletions(-)
>  create mode 100644 gcc/testsuite/g++.target/i386/pr109780-1.C
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr109780-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr109780-2.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index caca74d6dec..27f349b0ccb 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -8084,6 +8084,63 @@ output_probe_stack_range (rtx reg, rtx end)
>return "";
>  }
>
> +/* Update the maximum stack slot alignment from memory alignment in
> +   PAT.  */
> +
> +static void
> +ix86_update_stack_alignment (rtx, const_rtx pat, void *data)
> +{
> +  /* This insn may reference stack slot.  Update the maximum stack slot
> + alignment.  */
> +  subrtx_iterator::array_type array;
> +  FOR_EACH_SUBRTX (iter, array, pat, ALL)
> +if (MEM_P (*iter))
> +  {
> +   unsigned int alignment = MEM_ALIGN (*iter);
> +   unsigned int *stack_alignment
> + = (unsigned int *) data;
> +   if (alignment > *stack_alignment)
> + *stack_alignment = alignment;
> +   break;
> +  }
> +}
> +
> +/* Find all registers defined with REG.  */
> +
> +static void
> +ix86_find_all_reg_use (HARD_REG_SET &stack_slot_access, int reg)
> +{
> +  for (df_ref ref = DF_REG_USE_CHAIN (reg);
> +   ref != NULL;
> +   ref = DF_REF_NEXT_REG (ref))
> +{
> +  if (DF_REF_IS_ARTIFICIAL (ref))
> +   continue;
> +
> +  rtx_insn *insn = DF_REF_INSN (ref);
> +  if (!NONDEBUG_INSN_P (insn))
> +   continue;
> +
> +  rtx set = single_set (insn);
> +  if (!set)
> +   continue;
> +
> +  rtx src = SET_SRC (set);
> +  if (MEM_P (src))
> +   continue;
> +
> +  rtx dest = SET_DEST (set);
> +  if (!REG_P (dest))
> +   continue;
> +
> +  if (TEST_HARD_REG_BIT (stack_slot_access, REGNO (dest)))
> +   continue;
> +
> +  /* Add this register to stack_slot_access.  */
> +  add_to_hard_reg_set (&stack_slot_access, Pmode, REGNO (dest));
> +}
> +}
> +
>  /* Set stack_frame_required to false if stack frame isn't required.
> Update STACK_ALIGNMENT to the largest alignment, in bits, of stack
> slot used if stack frame is required and CHECK_STACK_SLOT is true.  */
> @@ -8102,10 +8159,6 @@ ix86_find_max_used_stack_alignment (unsigned int 
> &stack_alignment,
>add_to_hard_reg_set (&set_up_by_prologue, Pmode,
>HARD_FRAME_POINTER_REGNUM);
>
> -  /* The preferred stack alignment is the minimum stack alignment.  */
> -  if (stack_alignment > crtl->preferred_stack_boundary)
> -stack_alignment = crtl->preferred_stack_boundary;
> -
>bool require_stack_frame = false;
>
>FOR_EACH_BB_FN (bb, cfun)
> @@ -8117,27 +8170,52 @@ ix86_find_max_used_stack_alignment (unsigned int 
> &stack_alignment,
>set_up_by_prologue))
>   {
> require_stack_frame = true;
> -
> -   if (check_stack_slot)
> - {
> -   /* Find the maximum stack alignment.  */
> -   subrtx_iterator::array_type array;
> -   FOR_EACH_SUBRTX (iter, array, PATTERN (insn), ALL)
> - if (MEM_P (*iter)
> - && (reg_mentioned_p (stack_pointer_rtx,
> -  *iter)
> - || reg_mentioned_p (frame_pointer_rtx,
> - *iter)))
> -   {
> - unsigned int alignment = MEM_ALIGN (*iter);
> - if (alignment > stack_alignment)
> -   stack_alignment = alignment;
> -   }
> - }
> +   break;
>   }
>  }
>
>cfun->machine->stack_frame_required = require_stack_frame;
> +
> +  /* Stop if we don't need to check stack slot.  */
> +  if (!check_stack_slot)
>

Re: [PATCH v2] vect: Fix vectorized BIT_FIELD_REF for signed bit-fields [PR110557]

2023-07-10 Thread Richard Biener via Gcc-patches
On Fri, 7 Jul 2023, Xi Ruoyao wrote:

> If a bit-field is signed and it's wider than the output type, we must
> ensure the extracted result sign-extended.  But this was not handled
> correctly.
> 
> For example:
> 
> int x : 8;
> long y : 55;
> bool z : 1;
> 
> The vectorized extraction of y was:
> 
> vect__ifc__49.29_110 =
>   MEM  [(struct Item *)vectp_a.27_108];
> vect_patt_38.30_112 =
>   vect__ifc__49.29_110 & { 9223372036854775552, 9223372036854775552 };
> vect_patt_39.31_113 = vect_patt_38.30_112 >> 8;
> vect_patt_40.32_114 =
>   VIEW_CONVERT_EXPR(vect_patt_39.31_113);
> 
> This is obviously incorrect.  This pach has implemented it as:
> 
> vect__ifc__25.16_62 =
>   MEM  [(struct Item *)vectp_a.14_60];
> vect_patt_31.17_63 =
>   VIEW_CONVERT_EXPR(vect__ifc__25.16_62);
> vect_patt_32.18_64 = vect_patt_31.17_63 << 1;
> vect_patt_33.19_65 = vect_patt_32.18_64 >> 9;

OK.

Thanks,
Richard.

> gcc/ChangeLog:
> 
>   PR tree-optimization/110557
>   * tree-vect-patterns.cc (vect_recog_bitfield_ref_pattern):
>   Ensure the output sign-extended if necessary.
> 
> gcc/testsuite/ChangeLog:
> 
>   PR tree-optimization/110557
>   * g++.dg/vect/pr110557.cc: New test.
> ---
> 
> Change v1 -> v2:
> 
> - Rename two variables for readability.
> - Remove a redundant useless_type_conversion_p check.
> - Edit the comment for early conversion to show the rationale of
>   "|| ref_sext".
> 
> Bootstrapped (with BOOT_CFLAGS="-O3 -mavx2") and regtested on
> x86_64-linux-gnu.  Ok for trunk and gcc-13?
> 
>  gcc/testsuite/g++.dg/vect/pr110557.cc | 37 
>  gcc/tree-vect-patterns.cc | 62 ---
>  2 files changed, 83 insertions(+), 16 deletions(-)
>  create mode 100644 gcc/testsuite/g++.dg/vect/pr110557.cc
> 
> diff --git a/gcc/testsuite/g++.dg/vect/pr110557.cc 
> b/gcc/testsuite/g++.dg/vect/pr110557.cc
> new file mode 100644
> index 000..e1fbe1caac4
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/vect/pr110557.cc
> @@ -0,0 +1,37 @@
> +// { dg-additional-options "-mavx" { target { avx_runtime } } }
> +
> +static inline long
> +min (long a, long b)
> +{
> +  return a < b ? a : b;
> +}
> +
> +struct Item
> +{
> +  int x : 8;
> +  long y : 55;
> +  bool z : 1;
> +};
> +
> +__attribute__ ((noipa)) long
> +test (Item *a, int cnt)
> +{
> +  long size = 0;
> +  for (int i = 0; i < cnt; i++)
> +size = min ((long)a[i].y, size);
> +  return size;
> +}
> +
> +int
> +main ()
> +{
> +  struct Item items[] = {
> +{ 1, -1 },
> +{ 2, -2 },
> +{ 3, -3 },
> +{ 4, -4 },
> +  };
> +
> +  if (test (items, 4) != -4)
> +__builtin_trap ();
> +}
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index 1bc36b043a0..c0832e8679f 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -2566,7 +2566,7 @@ vect_recog_widen_sum_pattern (vec_info *vinfo,
> Widening with mask first, shift later:
> container = (type_out) container;
> masked = container & (((1 << bitsize) - 1) << bitpos);
> -   result = patt2 >> masked;
> +   result = masked >> bitpos;
>  
> Widening with shift first, mask last:
> container = (type_out) container;
> @@ -2578,6 +2578,15 @@ vect_recog_widen_sum_pattern (vec_info *vinfo,
> result = masked >> bitpos;
> result = (type_out) result;
>  
> +   If the bitfield is signed and it's wider than type_out, we need to
> +   keep the result sign-extended:
> +   container = (type) container;
> +   masked = container << (prec - bitsize - bitpos);
> +   result = (type_out) (masked >> (prec - bitsize));
> +
> +   Here type is the signed variant of the wider of type_out and the type
> +   of container.
> +
> The shifting is always optional depending on whether bitpos != 0.
>  
>  */
> @@ -2636,14 +2645,22 @@ vect_recog_bitfield_ref_pattern (vec_info *vinfo, 
> stmt_vec_info stmt_info,
>if (BYTES_BIG_ENDIAN)
>  shift_n = prec - shift_n - mask_width;
>  
> +  bool ref_sext = (!TYPE_UNSIGNED (TREE_TYPE (bf_ref)) &&
> +TYPE_PRECISION (ret_type) > mask_width);
> +  bool load_widen = (TYPE_PRECISION (TREE_TYPE (container)) <
> +  TYPE_PRECISION (ret_type));
> +
>/* We move the conversion earlier if the loaded type is smaller than the
> - return type to enable the use of widening loads.  */
> -  if (TYPE_PRECISION (TREE_TYPE (container)) < TYPE_PRECISION (ret_type)
> -  && !useless_type_conversion_p (TREE_TYPE (container), ret_type))
> -{
> -  pattern_stmt
> - = gimple_build_assign (vect_recog_temp_ssa_var (ret_type),
> -NOP_EXPR, container);
> + return type to enable the use of widening loads.  And if we need a
> + sign extension, we need to convert the loaded value early to a signed
> + type as well.  */
> +  if (ref_sext || load_widen)
> +{
> +  tree type = load_widen ? ret_type : container_type;
> +  if (ref_sext

Re: [PATCH] VECT: Add COND_LEN_* operations for loop control with length targets

2023-07-10 Thread Richard Biener via Gcc-patches
On Fri, 7 Jul 2023, juzhe.zh...@rivai.ai wrote:

> From: Ju-Zhe Zhong 
> 
> Hi, Richard and Richi.
> 
> This patch is adding cond_len_* operations pattern for target support 
> loop control with length.

It looks mostly OK - the probably obvious question is with rearding
to the "missing" bias argument ...

IBM folks - is there any expectation that the set of len familiy of
instructions increases or will they be accounted as "mistake" and
future additions will happen in different ways?

At the moment I'd say for consistency reasons 'len' should always
come with 'bias'.

Thanks,
Richard.

> These patterns will be used in these following case:
> 
> 1. Integer division:
>void
>f (int32_t *restrict a, int32_t *restrict b, int32_t *restrict c, int n)
>{
>  for (int i = 0; i < n; ++i)
>   {
> a[i] = b[i] / c[i];
>   }
>}
> 
>   ARM SVE IR:
>   
>   ...
>   max_mask_36 = .WHILE_ULT (0, bnd.5_32, { 0, ... });
> 
>   Loop:
>   ...
>   # loop_mask_29 = PHI 
>   ...
>   vect__4.8_28 = .MASK_LOAD (_33, 32B, loop_mask_29);
>   ...
>   vect__6.11_25 = .MASK_LOAD (_20, 32B, loop_mask_29);
>   vect__8.12_24 = .COND_DIV (loop_mask_29, vect__4.8_28, vect__6.11_25, 
> vect__4.8_28);
>   ...
>   .MASK_STORE (_1, 32B, loop_mask_29, vect__8.12_24);
>   ...
>   next_mask_37 = .WHILE_ULT (_2, bnd.5_32, { 0, ... });
>   ...
>   
>   For target like RVV who support loop control with length, we want to see IR 
> as follows:
>   
>   Loop:
>   ...
>   # loop_len_29 = SELECT_VL
>   ...
>   vect__4.8_28 = .LEN_MASK_LOAD (_33, 32B, loop_len_29);
>   ...
>   vect__6.11_25 = .LEN_MASK_LOAD (_20, 32B, loop_len_29);
>   vect__8.12_24 = .COND_LEN_DIV (dummp_mask, vect__4.8_28, vect__6.11_25, 
> vect__4.8_28, loop_len_29);
>   ...
>   .LEN_MASK_STORE (_1, 32B, loop_len_29, vect__8.12_24);
>   ...
>   next_mask_37 = .WHILE_ULT (_2, bnd.5_32, { 0, ... });
>   ...
>   
>   Notice here, we use dummp_mask = { -1, -1,  , -1 }
> 
> 2. Integer conditional division:
>Similar case with (1) but with condtion:
>void
>f (int32_t *restrict a, int32_t *restrict b, int32_t *restrict c, int32_t 
> * cond, int n)
>{
>  for (int i = 0; i < n; ++i)
>{
>  if (cond[i])
>  a[i] = b[i] / c[i];
>}
>}
>
>ARM SVE:
>...
>max_mask_76 = .WHILE_ULT (0, bnd.6_52, { 0, ... });
> 
>Loop:
>...
># loop_mask_55 = PHI 
>...
>vect__4.9_56 = .MASK_LOAD (_51, 32B, loop_mask_55);
>mask__29.10_58 = vect__4.9_56 != { 0, ... };
>vec_mask_and_61 = loop_mask_55 & mask__29.10_58;
>...
>vect__6.13_62 = .MASK_LOAD (_24, 32B, vec_mask_and_61);
>...
>vect__8.16_66 = .MASK_LOAD (_1, 32B, vec_mask_and_61);
>vect__10.17_68 = .COND_DIV (vec_mask_and_61, vect__6.13_62, vect__8.16_66, 
> vect__6.13_62);
>...
>.MASK_STORE (_2, 32B, vec_mask_and_61, vect__10.17_68);
>...
>next_mask_77 = .WHILE_ULT (_3, bnd.6_52, { 0, ... });
>
>Here, ARM SVE use vec_mask_and_61 = loop_mask_55 & mask__29.10_58; to 
> gurantee the correct result.
>
>However, target with length control can not perform this elegant flow, for 
> RVV, we would expect:
>
>Loop:
>...
>loop_len_55 = SELECT_VL
>...
>mask__29.10_58 = vect__4.9_56 != { 0, ... };
>...
>vect__10.17_68 = .COND_LEN_DIV (mask__29.10_58, vect__6.13_62, 
> vect__8.16_66, vect__6.13_62, loop_len_55);
>...
> 
>Here we expect COND_LEN_DIV predicated by a real mask which is the outcome 
> of comparison: mask__29.10_58 = vect__4.9_56 != { 0, ... };
>and a real length which is produced by loop control : loop_len_55 = 
> SELECT_VL
>
> 3. conditional Floating-point operations (no -ffast-math):
>
> void
> f (float *restrict a, float *restrict b, int32_t *restrict cond, int n)
> {
>   for (int i = 0; i < n; ++i)
> {
>   if (cond[i])
>   a[i] = b[i] + a[i];
> }
> }
>   
>   ARM SVE IR:
>   max_mask_70 = .WHILE_ULT (0, bnd.6_46, { 0, ... });
> 
>   ...
>   # loop_mask_49 = PHI 
>   ...
>   mask__27.10_52 = vect__4.9_50 != { 0, ... };
>   vec_mask_and_55 = loop_mask_49 & mask__27.10_52;
>   ...
>   vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, 
> vect__6.13_56);
>   ...
>   next_mask_71 = .WHILE_ULT (_22, bnd.6_46, { 0, ... });
>   ...
>   
>   For RVV, we would expect IR:
>   
>   ...
>   loop_len_49 = SELECT_VL
>   ...
>   mask__27.10_52 = vect__4.9_50 != { 0, ... };
>   ...
>   vect__9.17_62 = .COND_LEN_ADD (mask__27.10_52, vect__6.13_56, 
> vect__8.16_60, vect__6.13_56, loop_len_49);
>   ...
> 
> 4. Conditional un-ordered reduction:
>
>int32_t
>f (int32_t *restrict a, 
>int32_t *restrict cond, int n)
>{
>  int32_t result = 0;
>  for (int i = 0; i < n; ++i)
>{
>if (cond[i])
>  result += a[i];
>}
>  return result;
>}
>
>ARM SVE IR:
>  
>  Loop:
>  # vect_result_18.7_37 = PH

Re: [PATCH] simplify-rtx: Fix invalid simplification with paradoxical subregs [PR110206]

2023-07-10 Thread Uros Bizjak via Gcc-patches
On Mon, Jul 10, 2023 at 11:47 AM Richard Biener
 wrote:
>
> On Mon, Jul 10, 2023 at 11:26 AM Uros Bizjak  wrote:
> >
> > On Mon, Jul 10, 2023 at 11:17 AM Richard Biener
> >  wrote:
> > >
> > > On Sun, Jul 9, 2023 at 10:53 AM Uros Bizjak via Gcc-patches
> > >  wrote:
> > > >
> > > > As shown in the PR, simplify_gen_subreg call in simplify_replace_fn_rtx:
> > > >
> > > > (gdb) list
> > > > 469   if (code == SUBREG)
> > > > 470 {
> > > > 471   op0 = simplify_replace_fn_rtx (SUBREG_REG (x),
> > > > old_rtx, fn, data);
> > > > 472   if (op0 == SUBREG_REG (x))
> > > > 473 return x;
> > > > 474   op0 = simplify_gen_subreg (GET_MODE (x), op0,
> > > > 475  GET_MODE (SUBREG_REG (x)),
> > > > 476  SUBREG_BYTE (x));
> > > > 477   return op0 ? op0 : x;
> > > > 478 }
> > > >
> > > > simplifies with following arguments:
> > > >
> > > > (gdb) p debug_rtx (op0)
> > > > (const_vector:V4QI [
> > > > (const_int -52 [0xffcc]) repeated x4
> > > > ])
> > > > (gdb) p debug_rtx (x)
> > > > (subreg:V16QI (reg:V4QI 98) 0)
> > > >
> > > > to:
> > > >
> > > > (gdb) p debug_rtx (op0)
> > > > (const_vector:V16QI [
> > > > (const_int -52 [0xffcc]) repeated x16
> > > > ])
> > > >
> > > > This simplification is invalid, it is not possible to get V16QImode 
> > > > vector
> > > > from V4QImode vector, even when all elements are duplicates.
>
> ^^^
>
> I think this simplification is valid.  A simplification to
>
> (const_vector:V16QI [
>  (const_int -52 [0xffcc]) repeated x4
>  (const_int 0 [0]) repeated x12
>  ])
>
> would be valid as well.
>
> > > > The simplification happens in simplify_context::simplify_subreg:
> > > >
> > > > (gdb) list
> > > > 7558  if (VECTOR_MODE_P (outermode)
> > > > 7559  && GET_MODE_INNER (outermode) == GET_MODE_INNER 
> > > > (innermode)
> > > > 7560  && vec_duplicate_p (op, &elt))
> > > > 7561return gen_vec_duplicate (outermode, elt);
> > > >
> > > > but the above simplification is valid only for non-paradoxical 
> > > > registers,
> > > > where outermode <= innermode.  We should not assume that elements 
> > > > outside
> > > > the original register are valid, let alone all duplicates.
> > >
> > > Hmm, but looking at the audit trail the x86 backend expects them to be 
> > > zero?
> > > Isn't that wrong as well?
> >
> > If you mean Comment #10, it is just an observation that
> > simplify_replace_rtx simplifies arguments from Comment #9 to:
> >
> > (gdb) p debug_rtx (src)
> > (const_vector:V8HI [
> > (const_int 204 [0xcc]) repeated x4
> > (const_int 0 [0]) repeated x4
> > ])
> >
> > instead of:
> >
> > (gdb) p debug_rtx (src)
> > (const_vector:V8HI [
> > (const_int 204 [0xcc]) repeated x8
> > ])
> >
> > which is in line with the statement below.
> > >
> > > That is, I think putting any random value into the upper lanes when
> > > constant folding
> > > a paradoxical subreg sounds OK to me, no?
> >
> > The compiler is putting zero there as can be seen from the above new RTX.
> >
> > > Of course we might choose to not do such constant propagation for
> > > efficiency reason - at least
> > > when the resulting CONST_* would require a larger constant pool entry
> > > or more costly
> > > construction.
> >
> > This is probably a follow-up improvement, where this patch tries to
> > fix a specific invalid simplification of simplify_replace_rtx that is
> > invalid universally.
>
> How so?  What specifies the values of the paradoxical subreg for the
> bytes not covered by the subreg operand?

I don't know why 0 is generated here (and if it is valid) for
paradoxical bytes, but 0xcc is not correct, since it sets REG_EQUAL to
the wrong constant and triggers unwanted propagation later on.

Uros.


Pushed: [PATCH v2] vect: Fix vectorized BIT_FIELD_REF for signed bit-fields [PR110557]

2023-07-10 Thread Xi Ruoyao via Gcc-patches
On Mon, 2023-07-10 at 10:33 +, Richard Biener wrote:
> On Fri, 7 Jul 2023, Xi Ruoyao wrote:
> 
> > If a bit-field is signed and it's wider than the output type, we
> > must
> > ensure the extracted result sign-extended.  But this was not handled
> > correctly.
> > 
> > For example:
> > 
> >     int x : 8;
> >     long y : 55;
> >     bool z : 1;
> > 
> > The vectorized extraction of y was:
> > 
> >     vect__ifc__49.29_110 =
> >   MEM  [(struct Item
> > *)vectp_a.27_108];
> >     vect_patt_38.30_112 =
> >   vect__ifc__49.29_110 & { 9223372036854775552,
> > 9223372036854775552 };
> >     vect_patt_39.31_113 = vect_patt_38.30_112 >> 8;
> >     vect_patt_40.32_114 =
> >   VIEW_CONVERT_EXPR(vect_patt_39.31_113);
> > 
> > This is obviously incorrect.  This pach has implemented it as:
> > 
> >     vect__ifc__25.16_62 =
> >   MEM  [(struct Item
> > *)vectp_a.14_60];
> >     vect_patt_31.17_63 =
> >   VIEW_CONVERT_EXPR(vect__ifc__25.16_62);
> >     vect_patt_32.18_64 = vect_patt_31.17_63 << 1;
> >     vect_patt_33.19_65 = vect_patt_32.18_64 >> 9;
> 
> OK.

Pushed r14-2407 and r13-7553.

-- 
Xi Ruoyao 
School of Aerospace Science and Technology, Xidian University


[PATCH V2] VECT: Add COND_LEN_* operations for loop control with length targets

2023-07-10 Thread juzhe . zhong
From: Ju-Zhe Zhong 

Hi, Richard and Richi.

This patch is adding cond_len_* operations pattern for target support loop 
control with length.

These patterns will be used in these following case:

1. Integer division:
   void
   f (int32_t *restrict a, int32_t *restrict b, int32_t *restrict c, int n)
   {
 for (int i = 0; i < n; ++i)
  {
a[i] = b[i] / c[i];
  }
   }

  ARM SVE IR:
  
  ...
  max_mask_36 = .WHILE_ULT (0, bnd.5_32, { 0, ... });

  Loop:
  ...
  # loop_mask_29 = PHI 
  ...
  vect__4.8_28 = .MASK_LOAD (_33, 32B, loop_mask_29);
  ...
  vect__6.11_25 = .MASK_LOAD (_20, 32B, loop_mask_29);
  vect__8.12_24 = .COND_DIV (loop_mask_29, vect__4.8_28, vect__6.11_25, 
vect__4.8_28);
  ...
  .MASK_STORE (_1, 32B, loop_mask_29, vect__8.12_24);
  ...
  next_mask_37 = .WHILE_ULT (_2, bnd.5_32, { 0, ... });
  ...
  
  For target like RVV who support loop control with length, we want to see IR 
as follows:
  
  Loop:
  ...
  # loop_len_29 = SELECT_VL
  ...
  vect__4.8_28 = .LEN_MASK_LOAD (_33, 32B, loop_len_29);
  ...
  vect__6.11_25 = .LEN_MASK_LOAD (_20, 32B, loop_len_29);
  vect__8.12_24 = .COND_LEN_DIV (dummp_mask, vect__4.8_28, vect__6.11_25, 
vect__4.8_28, loop_len_29, bias);
  ...
  .LEN_MASK_STORE (_1, 32B, loop_len_29, vect__8.12_24);
  ...
  next_mask_37 = .WHILE_ULT (_2, bnd.5_32, { 0, ... });
  ...
  
  Notice here, we use dummp_mask = { -1, -1,  , -1 }

2. Integer conditional division:
   Similar case with (1) but with condtion:
   void
   f (int32_t *restrict a, int32_t *restrict b, int32_t *restrict c, int32_t * 
cond, int n)
   {
 for (int i = 0; i < n; ++i)
   {
 if (cond[i])
 a[i] = b[i] / c[i];
   }
   }
   
   ARM SVE:
   ...
   max_mask_76 = .WHILE_ULT (0, bnd.6_52, { 0, ... });

   Loop:
   ...
   # loop_mask_55 = PHI 
   ...
   vect__4.9_56 = .MASK_LOAD (_51, 32B, loop_mask_55);
   mask__29.10_58 = vect__4.9_56 != { 0, ... };
   vec_mask_and_61 = loop_mask_55 & mask__29.10_58;
   ...
   vect__6.13_62 = .MASK_LOAD (_24, 32B, vec_mask_and_61);
   ...
   vect__8.16_66 = .MASK_LOAD (_1, 32B, vec_mask_and_61);
   vect__10.17_68 = .COND_DIV (vec_mask_and_61, vect__6.13_62, vect__8.16_66, 
vect__6.13_62);
   ...
   .MASK_STORE (_2, 32B, vec_mask_and_61, vect__10.17_68);
   ...
   next_mask_77 = .WHILE_ULT (_3, bnd.6_52, { 0, ... });
   
   Here, ARM SVE use vec_mask_and_61 = loop_mask_55 & mask__29.10_58; to 
gurantee the correct result.
   
   However, target with length control can not perform this elegant flow, for 
RVV, we would expect:
   
   Loop:
   ...
   loop_len_55 = SELECT_VL
   ...
   mask__29.10_58 = vect__4.9_56 != { 0, ... };
   ...
   vect__10.17_68 = .COND_LEN_DIV (mask__29.10_58, vect__6.13_62, 
vect__8.16_66, vect__6.13_62, loop_len_55, bias);
   ...

   Here we expect COND_LEN_DIV predicated by a real mask which is the outcome 
of comparison: mask__29.10_58 = vect__4.9_56 != { 0, ... };
   and a real length which is produced by loop control : loop_len_55 = SELECT_VL
   
3. conditional Floating-point operations (no -ffast-math):
   
void
f (float *restrict a, float *restrict b, int32_t *restrict cond, int n)
{
  for (int i = 0; i < n; ++i)
{
  if (cond[i])
  a[i] = b[i] + a[i];
}
}
  
  ARM SVE IR:
  max_mask_70 = .WHILE_ULT (0, bnd.6_46, { 0, ... });

  ...
  # loop_mask_49 = PHI 
  ...
  mask__27.10_52 = vect__4.9_50 != { 0, ... };
  vec_mask_and_55 = loop_mask_49 & mask__27.10_52;
  ...
  vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, 
vect__6.13_56);
  ...
  next_mask_71 = .WHILE_ULT (_22, bnd.6_46, { 0, ... });
  ...
  
  For RVV, we would expect IR:
  
  ...
  loop_len_49 = SELECT_VL
  ...
  mask__27.10_52 = vect__4.9_50 != { 0, ... };
  ...
  vect__9.17_62 = .COND_LEN_ADD (mask__27.10_52, vect__6.13_56, vect__8.16_60, 
vect__6.13_56, loop_len_49, bias);
  ...

4. Conditional un-ordered reduction:
   
   int32_t
   f (int32_t *restrict a, 
   int32_t *restrict cond, int n)
   {
 int32_t result = 0;
 for (int i = 0; i < n; ++i)
   {
   if (cond[i])
 result += a[i];
   }
 return result;
   }
   
   ARM SVE IR:
 
 Loop:
 # vect_result_18.7_37 = PHI 
 ...
 # loop_mask_40 = PHI 
 ...
 mask__17.11_43 = vect__4.10_41 != { 0, ... };
 vec_mask_and_46 = loop_mask_40 & mask__17.11_43;
 ...
 vect__33.16_51 = .COND_ADD (vec_mask_and_46, vect_result_18.7_37, 
vect__7.14_47, vect_result_18.7_37);
 ...
 next_mask_58 = .WHILE_ULT (_15, bnd.6_36, { 0, ... });
 ...
   
 Epilogue:
 _53 = .REDUC_PLUS (vect__33.16_51); [tail call]
   
   For RVV, we expect:
 
Loop:
 # vect_result_18.7_37 = PHI 
 ...
 loop_len_40 = SELECT_VL
 ...
 mask__17.11_43 = vect__4.10_41 != { 0, ... };
 ...
 vect__33.16_51 = .COND_LEN_ADD (mask__17.11_43, vect_result_18.7_37, 
vect__7.14_47, vect_result_18.7_37, loop_len_40, bias);
 ...
 next_mask_58 =

Re: Re: [PATCH] VECT: Add COND_LEN_* operations for loop control with length targets

2023-07-10 Thread juzhe.zh...@rivai.ai
Thanks Richi.

I added "BIAS" for you:
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/623978.html 

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2023-07-10 18:40
To: Ju-Zhe Zhong
CC: gcc-patches; richard.sandiford; linkw; krebbel
Subject: Re: [PATCH] VECT: Add COND_LEN_* operations for loop control with 
length targets
On Fri, 7 Jul 2023, juzhe.zh...@rivai.ai wrote:
 
> From: Ju-Zhe Zhong 
> 
> Hi, Richard and Richi.
> 
> This patch is adding cond_len_* operations pattern for target support 
> loop control with length.
 
It looks mostly OK - the probably obvious question is with rearding
to the "missing" bias argument ...
 
IBM folks - is there any expectation that the set of len familiy of
instructions increases or will they be accounted as "mistake" and
future additions will happen in different ways?
 
At the moment I'd say for consistency reasons 'len' should always
come with 'bias'.
 
Thanks,
Richard.
 
> These patterns will be used in these following case:
> 
> 1. Integer division:
>void
>f (int32_t *restrict a, int32_t *restrict b, int32_t *restrict c, int n)
>{
>  for (int i = 0; i < n; ++i)
>   {
> a[i] = b[i] / c[i];
>   }
>}
> 
>   ARM SVE IR:
>   
>   ...
>   max_mask_36 = .WHILE_ULT (0, bnd.5_32, { 0, ... });
> 
>   Loop:
>   ...
>   # loop_mask_29 = PHI 
>   ...
>   vect__4.8_28 = .MASK_LOAD (_33, 32B, loop_mask_29);
>   ...
>   vect__6.11_25 = .MASK_LOAD (_20, 32B, loop_mask_29);
>   vect__8.12_24 = .COND_DIV (loop_mask_29, vect__4.8_28, vect__6.11_25, 
> vect__4.8_28);
>   ...
>   .MASK_STORE (_1, 32B, loop_mask_29, vect__8.12_24);
>   ...
>   next_mask_37 = .WHILE_ULT (_2, bnd.5_32, { 0, ... });
>   ...
>   
>   For target like RVV who support loop control with length, we want to see IR 
> as follows:
>   
>   Loop:
>   ...
>   # loop_len_29 = SELECT_VL
>   ...
>   vect__4.8_28 = .LEN_MASK_LOAD (_33, 32B, loop_len_29);
>   ...
>   vect__6.11_25 = .LEN_MASK_LOAD (_20, 32B, loop_len_29);
>   vect__8.12_24 = .COND_LEN_DIV (dummp_mask, vect__4.8_28, vect__6.11_25, 
> vect__4.8_28, loop_len_29);
>   ...
>   .LEN_MASK_STORE (_1, 32B, loop_len_29, vect__8.12_24);
>   ...
>   next_mask_37 = .WHILE_ULT (_2, bnd.5_32, { 0, ... });
>   ...
>   
>   Notice here, we use dummp_mask = { -1, -1,  , -1 }
> 
> 2. Integer conditional division:
>Similar case with (1) but with condtion:
>void
>f (int32_t *restrict a, int32_t *restrict b, int32_t *restrict c, int32_t 
> * cond, int n)
>{
>  for (int i = 0; i < n; ++i)
>{
>  if (cond[i])
>  a[i] = b[i] / c[i];
>}
>}
>
>ARM SVE:
>...
>max_mask_76 = .WHILE_ULT (0, bnd.6_52, { 0, ... });
> 
>Loop:
>...
># loop_mask_55 = PHI 
>...
>vect__4.9_56 = .MASK_LOAD (_51, 32B, loop_mask_55);
>mask__29.10_58 = vect__4.9_56 != { 0, ... };
>vec_mask_and_61 = loop_mask_55 & mask__29.10_58;
>...
>vect__6.13_62 = .MASK_LOAD (_24, 32B, vec_mask_and_61);
>...
>vect__8.16_66 = .MASK_LOAD (_1, 32B, vec_mask_and_61);
>vect__10.17_68 = .COND_DIV (vec_mask_and_61, vect__6.13_62, vect__8.16_66, 
> vect__6.13_62);
>...
>.MASK_STORE (_2, 32B, vec_mask_and_61, vect__10.17_68);
>...
>next_mask_77 = .WHILE_ULT (_3, bnd.6_52, { 0, ... });
>
>Here, ARM SVE use vec_mask_and_61 = loop_mask_55 & mask__29.10_58; to 
> gurantee the correct result.
>
>However, target with length control can not perform this elegant flow, for 
> RVV, we would expect:
>
>Loop:
>...
>loop_len_55 = SELECT_VL
>...
>mask__29.10_58 = vect__4.9_56 != { 0, ... };
>...
>vect__10.17_68 = .COND_LEN_DIV (mask__29.10_58, vect__6.13_62, 
> vect__8.16_66, vect__6.13_62, loop_len_55);
>...
> 
>Here we expect COND_LEN_DIV predicated by a real mask which is the outcome 
> of comparison: mask__29.10_58 = vect__4.9_56 != { 0, ... };
>and a real length which is produced by loop control : loop_len_55 = 
> SELECT_VL
>
> 3. conditional Floating-point operations (no -ffast-math):
>
> void
> f (float *restrict a, float *restrict b, int32_t *restrict cond, int n)
> {
>   for (int i = 0; i < n; ++i)
> {
>   if (cond[i])
>   a[i] = b[i] + a[i];
> }
> }
>   
>   ARM SVE IR:
>   max_mask_70 = .WHILE_ULT (0, bnd.6_46, { 0, ... });
> 
>   ...
>   # loop_mask_49 = PHI 
>   ...
>   mask__27.10_52 = vect__4.9_50 != { 0, ... };
>   vec_mask_and_55 = loop_mask_49 & mask__27.10_52;
>   ...
>   vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, 
> vect__6.13_56);
>   ...
>   next_mask_71 = .WHILE_ULT (_22, bnd.6_46, { 0, ... });
>   ...
>   
>   For RVV, we would expect IR:
>   
>   ...
>   loop_len_49 = SELECT_VL
>   ...
>   mask__27.10_52 = vect__4.9_50 != { 0, ... };
>   ...
>   vect__9.17_62 = .COND_LEN_ADD (mask__27.10_52, vect__6.13_56, 
> vect__8.16_60, vect__6.13_56, loop_len_49);
>   ...
> 
> 4. Conditional un-ordered

Re: [PATCH] simplify-rtx: Fix invalid simplification with paradoxical subregs [PR110206]

2023-07-10 Thread Richard Biener via Gcc-patches
On Mon, Jul 10, 2023 at 1:01 PM Uros Bizjak  wrote:
>
> On Mon, Jul 10, 2023 at 11:47 AM Richard Biener
>  wrote:
> >
> > On Mon, Jul 10, 2023 at 11:26 AM Uros Bizjak  wrote:
> > >
> > > On Mon, Jul 10, 2023 at 11:17 AM Richard Biener
> > >  wrote:
> > > >
> > > > On Sun, Jul 9, 2023 at 10:53 AM Uros Bizjak via Gcc-patches
> > > >  wrote:
> > > > >
> > > > > As shown in the PR, simplify_gen_subreg call in 
> > > > > simplify_replace_fn_rtx:
> > > > >
> > > > > (gdb) list
> > > > > 469   if (code == SUBREG)
> > > > > 470 {
> > > > > 471   op0 = simplify_replace_fn_rtx (SUBREG_REG (x),
> > > > > old_rtx, fn, data);
> > > > > 472   if (op0 == SUBREG_REG (x))
> > > > > 473 return x;
> > > > > 474   op0 = simplify_gen_subreg (GET_MODE (x), op0,
> > > > > 475  GET_MODE (SUBREG_REG 
> > > > > (x)),
> > > > > 476  SUBREG_BYTE (x));
> > > > > 477   return op0 ? op0 : x;
> > > > > 478 }
> > > > >
> > > > > simplifies with following arguments:
> > > > >
> > > > > (gdb) p debug_rtx (op0)
> > > > > (const_vector:V4QI [
> > > > > (const_int -52 [0xffcc]) repeated x4
> > > > > ])
> > > > > (gdb) p debug_rtx (x)
> > > > > (subreg:V16QI (reg:V4QI 98) 0)
> > > > >
> > > > > to:
> > > > >
> > > > > (gdb) p debug_rtx (op0)
> > > > > (const_vector:V16QI [
> > > > > (const_int -52 [0xffcc]) repeated x16
> > > > > ])
> > > > >
> > > > > This simplification is invalid, it is not possible to get V16QImode 
> > > > > vector
> > > > > from V4QImode vector, even when all elements are duplicates.
> >
> > ^^^
> >
> > I think this simplification is valid.  A simplification to
> >
> > (const_vector:V16QI [
> >  (const_int -52 [0xffcc]) repeated x4
> >  (const_int 0 [0]) repeated x12
> >  ])
> >
> > would be valid as well.
> >
> > > > > The simplification happens in simplify_context::simplify_subreg:
> > > > >
> > > > > (gdb) list
> > > > > 7558  if (VECTOR_MODE_P (outermode)
> > > > > 7559  && GET_MODE_INNER (outermode) == GET_MODE_INNER 
> > > > > (innermode)
> > > > > 7560  && vec_duplicate_p (op, &elt))
> > > > > 7561return gen_vec_duplicate (outermode, elt);
> > > > >
> > > > > but the above simplification is valid only for non-paradoxical 
> > > > > registers,
> > > > > where outermode <= innermode.  We should not assume that elements 
> > > > > outside
> > > > > the original register are valid, let alone all duplicates.
> > > >
> > > > Hmm, but looking at the audit trail the x86 backend expects them to be 
> > > > zero?
> > > > Isn't that wrong as well?
> > >
> > > If you mean Comment #10, it is just an observation that
> > > simplify_replace_rtx simplifies arguments from Comment #9 to:
> > >
> > > (gdb) p debug_rtx (src)
> > > (const_vector:V8HI [
> > > (const_int 204 [0xcc]) repeated x4
> > > (const_int 0 [0]) repeated x4
> > > ])
> > >
> > > instead of:
> > >
> > > (gdb) p debug_rtx (src)
> > > (const_vector:V8HI [
> > > (const_int 204 [0xcc]) repeated x8
> > > ])
> > >
> > > which is in line with the statement below.
> > > >
> > > > That is, I think putting any random value into the upper lanes when
> > > > constant folding
> > > > a paradoxical subreg sounds OK to me, no?
> > >
> > > The compiler is putting zero there as can be seen from the above new RTX.
> > >
> > > > Of course we might choose to not do such constant propagation for
> > > > efficiency reason - at least
> > > > when the resulting CONST_* would require a larger constant pool entry
> > > > or more costly
> > > > construction.
> > >
> > > This is probably a follow-up improvement, where this patch tries to
> > > fix a specific invalid simplification of simplify_replace_rtx that is
> > > invalid universally.
> >
> > How so?  What specifies the values of the paradoxical subreg for the
> > bytes not covered by the subreg operand?
>
> I don't know why 0 is generated here (and if it is valid) for
> paradoxical bytes, but 0xcc is not correct, since it sets REG_EQUAL to
> the wrong constant and triggers unwanted propagation later on.

Quoting what I wrote in the PR below.  I think pragmatically the fix is
good - we might miss some opportunistic folding this way but we for
sure may not optimistically register an equality via REG_EQUAL without
enforcing it (removing the producer and replacing it with the optimistic
constant).

So consider the patch approved if no other RTL maintainer chimes in
within 48h.

Thanks,
Richard.


I can see cprop1 adds the REG_EQUAL note:

(insn 22 21 23 4 (set (reg:V8HI 100)
(zero_extend:V8HI (vec_select:V8QI (subreg:V16QI (reg:V4QI 98) 0)
(parallel [
(const_int 0 [0])
(const_int 1 [0x1])
(const_int 2 [0x2])
   

[COMMITTED] ada: Add leafy mode for zero-call-used-regs

2023-07-10 Thread Marc Poulhiès via Gcc-patches
From: Alexandre Oliva 

Document leafy mode.

gcc/ada/

* doc/gnat_rm/security_hardening_features.rst (Register
Scrubbing): Document leafy mode.
* gnat_rm.texi: Regenerate.

Tested on x86_64-pc-linux-gnu, committed on master.

---
 gcc/ada/doc/gnat_rm/security_hardening_features.rst | 6 ++
 gcc/ada/gnat_rm.texi| 8 +++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/gcc/ada/doc/gnat_rm/security_hardening_features.rst 
b/gcc/ada/doc/gnat_rm/security_hardening_features.rst
index ad165cd6849..14328598c33 100644
--- a/gcc/ada/doc/gnat_rm/security_hardening_features.rst
+++ b/gcc/ada/doc/gnat_rm/security_hardening_features.rst
@@ -34,6 +34,12 @@ subprograms.
  pragma Machine_Attribute (Bar, "zero_call_used_regs", "all");
  --  Before returning, Bar scrubs all call-clobbered registers.
 
+ function Baz return Integer;
+ pragma Machine_Attribute (Bar, "zero_call_used_regs", "leafy");
+ --  Before returning, Bar scrubs call-clobbered registers, either
+ --  those it uses itself, if it can be identified as a leaf
+ --  function, or all of them otherwise.
+
 
 For usage and more details on the command-line option, on the
 ``zero_call_used_regs`` attribute, and on their use with other
diff --git a/gcc/ada/gnat_rm.texi b/gcc/ada/gnat_rm.texi
index b28e6ebfffa..817ba0b9108 100644
--- a/gcc/ada/gnat_rm.texi
+++ b/gcc/ada/gnat_rm.texi
@@ -19,7 +19,7 @@
 
 @copying
 @quotation
-GNAT Reference Manual , Jul 04, 2023
+GNAT Reference Manual , Jul 10, 2023
 
 AdaCore
 
@@ -29191,6 +29191,12 @@ pragma Machine_Attribute (Foo, "zero_call_used_regs", 
"used");
 function Bar return Integer;
 pragma Machine_Attribute (Bar, "zero_call_used_regs", "all");
 --  Before returning, Bar scrubs all call-clobbered registers.
+
+function Baz return Integer;
+pragma Machine_Attribute (Bar, "zero_call_used_regs", "leafy");
+--  Before returning, Bar scrubs call-clobbered registers, either
+--  those it uses itself, if it can be identified as a leaf
+--  function, or all of them otherwise.
 @end example
 
 For usage and more details on the command-line option, on the
-- 
2.40.0



[COMMITTED] ada: Simplify assertion to remove CodePeer message

2023-07-10 Thread Marc Poulhiès via Gcc-patches
From: Yannick Moy 

CodePeer is correctly warning on a test always true in an assertion.
It can be rewritten without loss of proof to avoid that message.

gcc/ada/

* libgnat/s-aridou.adb (Lemma_Powers_Of_2_Commutation): Rewrite
assertion.

Tested on x86_64-pc-linux-gnu, committed on master.

---
 gcc/ada/libgnat/s-aridou.adb | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/gcc/ada/libgnat/s-aridou.adb b/gcc/ada/libgnat/s-aridou.adb
index 7ebf8682b32..2f1fbd55453 100644
--- a/gcc/ada/libgnat/s-aridou.adb
+++ b/gcc/ada/libgnat/s-aridou.adb
@@ -1456,9 +1456,7 @@ is
 pragma Assert (Big (Double_Uns'(2))**M = Big_2xx (M));
  end if;
   else
- pragma Assert
-   (Big (Double_Uns'(2))**M =
- (if M < Double_Size then Big_2xx (M) else Big_2xxDouble));
+ pragma Assert (Big (Double_Uns'(2))**M = Big_2xx (M));
   end if;
end Lemma_Powers_Of_2_Commutation;
 
-- 
2.40.0



[COMMITTED] ada: Follow-up fix for compilation issue with recent MinGW-w64 versions

2023-07-10 Thread Marc Poulhiès via Gcc-patches
From: Eric Botcazou 

It turns out that adaint.c includes other Windows header files than just
windows.h, so defining WIN32_LEAN_AND_MEAN is not sufficient for it.

gcc/ada/

* adaint.c [_WIN32]: Undefine 'abort' macro.

Tested on x86_64-pc-linux-gnu, committed on master.

---
 gcc/ada/adaint.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/gcc/ada/adaint.c b/gcc/ada/adaint.c
index 8522094164e..2a193efc002 100644
--- a/gcc/ada/adaint.c
+++ b/gcc/ada/adaint.c
@@ -227,6 +227,9 @@ UINT __gnat_current_ccs_encoding;
 
 #elif defined (_WIN32)
 
+/* Cannot redefine abort here.  */
+#undef abort
+
 #define WIN32_LEAN_AND_MEAN
 #include 
 #include 
-- 
2.40.0



[COMMITTED] ada: Adapt proof of System.Arith_Double to remove CVC4

2023-07-10 Thread Marc Poulhiès via Gcc-patches
From: Yannick Moy 

The proof of System.Arith_Double still required the use of
CVC4, now replaced by its successor cvc5. Adapt the proof to be
able to remove CVC4 in the proof of run-time units.

gcc/ada/

* libgnat/s-aridou.adb (Lemma_Div_Mult): New simple lemma.
(Lemma_Powers_Of_2_Commutation): State post in else branch.
(Lemma_Div_Pow2): Introduce local lemma and use it.
(Scaled_Divide): Use cut operations in assertions, lemmas, new
assertions. Introduce local lemma and use it.

Tested on x86_64-pc-linux-gnu, committed on master.

---
 gcc/ada/libgnat/s-aridou.adb | 84 
 1 file changed, 75 insertions(+), 9 deletions(-)

diff --git a/gcc/ada/libgnat/s-aridou.adb b/gcc/ada/libgnat/s-aridou.adb
index 831590ce387..7ebf8682b32 100644
--- a/gcc/ada/libgnat/s-aridou.adb
+++ b/gcc/ada/libgnat/s-aridou.adb
@@ -301,6 +301,11 @@ is
  Pre  => A * S = B * S + R and then S /= 0,
  Post => A = B + R / S;
 
+   procedure Lemma_Div_Mult (X : Big_Natural; Y : Big_Positive)
+   with
+ Ghost,
+ Post => X / Y * Y > X - Y;
+
procedure Lemma_Double_Big_2xxSingle
with
  Ghost,
@@ -639,6 +644,7 @@ is
is null;
procedure Lemma_Div_Ge (X, Y, Z : Big_Integer) is null;
procedure Lemma_Div_Lt (X, Y, Z : Big_Natural) is null;
+   procedure Lemma_Div_Mult (X : Big_Natural; Y : Big_Positive) is null;
procedure Lemma_Double_Big_2xxSingle is null;
procedure Lemma_Double_Shift (X : Double_Uns; S, S1 : Double_Uns) is null;
procedure Lemma_Double_Shift (X : Single_Uns; S, S1 : Natural) is null;
@@ -1449,6 +1455,10 @@ is
   (Double_Uns'(2 ** (M - 1)), 2, Double_Uns'(2**M));
 pragma Assert (Big (Double_Uns'(2))**M = Big_2xx (M));
  end if;
+  else
+ pragma Assert
+   (Big (Double_Uns'(2))**M =
+ (if M < Double_Size then Big_2xx (M) else Big_2xxDouble));
   end if;
end Lemma_Powers_Of_2_Commutation;
 
@@ -1537,6 +1547,19 @@ is
"Q is the quotient of X by Div");
 
   procedure Lemma_Div_Pow2 (X : Double_Uns; I : Natural) is
+
+ --  Local lemmas
+
+ procedure Lemma_Mult_Le (X, Y, Z : Double_Uns)
+ with
+   Ghost,
+   Pre  => X <= 1,
+   Post => X * Z <= Z;
+
+ procedure Lemma_Mult_Le (X, Y, Z : Double_Uns) is null;
+
+ --  Local variables
+
  Div1 : constant Double_Uns := Double_Uns'(2) ** I;
  Div2 : constant Double_Uns := Double_Uns'(2);
  Left : constant Double_Uns := X / Div1 / Div2;
@@ -1544,8 +1567,12 @@ is
  pragma Assert (R2 <= Div2 - 1);
  R1   : constant Double_Uns := X - X / Div1 * Div1;
  pragma Assert (R1 < Div1);
+
+  --  Start of processing for Lemma_Div_Pow2
+
   begin
  pragma Assert (X = Left * (Div1 * Div2) + R2 * Div1 + R1);
+ Lemma_Mult_Le (R2, Div2 - 1, Div1);
  pragma Assert (R2 * Div1 + R1 < Div1 * Div2);
  Lemma_Quot_Rem (X, Div1 * Div2, Left, R2 * Div1 + R1);
  pragma Assert (Left = X / (Div1 * Div2));
@@ -2937,7 +2964,10 @@ is
   Big_2xxSingle * Big (Double_Uns (D (3)))
 + Big (Double_Uns (D (4;
  pragma Assert
-   (Big (D (1) & D (2)) < Big (Zu));
+   (By (Big (D (1) & D (2)) < Big (Zu),
+Big_2xxDouble * (Big (Zu) - Big (D (1) & D (2))) >
+  Big_2xxSingle * Big (Double_Uns (D (3)))
++ Big (Double_Uns (D (4);
 
  --  Loop to compute quotient digits, runs twice for Qd (1) and Qd (2)
 
@@ -2962,7 +2992,7 @@ is
 --  Local ghost variables
 
 Qd1  : Single_Uns := 0 with Ghost;
-D234 : Big_Integer with Ghost;
+D234 : Big_Integer with Ghost, Relaxed_Initialization;
 D123 : constant Big_Integer := Big3 (D (1), D (2), D (3))
   with Ghost;
 D4   : constant Big_Integer := Big (Double_Uns (D (4)))
@@ -3015,8 +3045,10 @@ is
   Lemma_Div_Lt
 (Big3 (D (J), D (J + 1), D (J + 2)),
  Big_2xxSingle, Big (Zu));
-  pragma Assert (Big (Double_Uns (Qd (J))) >=
-Big3 (D (J), D (J + 1), D (J + 2)) / Big (Zu));
+  pragma Assert
+(By (Big (Double_Uns (Qd (J))) >=
+   Big3 (D (J), D (J + 1), D (J + 2)) / Big (Zu),
+ Big (Double_Uns (Qd (J))) = Big_2xxSingle - 1));
 
else
   Qd (J) := Lo ((D (J) & D (J + 1)) / Zhi);
@@ -3025,6 +3057,7 @@ is
end if;
 
pragma Assert (for all K in 1 .. J => Qd (K)'Initialized);
+   Lemma_Div_Mult (Big3 (D (J), D (J + 1), D (J + 2)), Big (Zu));
Lemma_Gt_Mult
  (Big (Double_Uns (Qd (J))),
   Big3 (D (J), D (J + 1), D (J + 2)) / Big (Zu),
@@ -3094,6 +3127,11

[COMMITTED] ada: Add typedefs to snames.h-tmpl

2023-07-10 Thread Marc Poulhiès via Gcc-patches
From: Tom Tromey 

A future patch will change sname.h-tmpl to use enums rather than
preprocessor defines.  In order to do this, first introduce some
typedefs that can be used in gcc-interface.

gcc/ada/

* snames.h-tmpl (Name_Id, Attribute_Id, Convention_Id)
(Pragma_Id): New typedefs.
(Get_Attribute_Id, Get_Pragma_Id): Use typedef.

Tested on x86_64-pc-linux-gnu, committed on master.

---
 gcc/ada/snames.h-tmpl | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/gcc/ada/snames.h-tmpl b/gcc/ada/snames.h-tmpl
index b15792a5724..95b3c776197 100644
--- a/gcc/ada/snames.h-tmpl
+++ b/gcc/ada/snames.h-tmpl
@@ -28,6 +28,7 @@
 
 /* Name_Id values */
 
+typedef Int Name_Id;
 #define  Name_ !! TEMPLATE INSERTION POINT
 
 /* Define the function to return one of the numeric values below. Note
@@ -35,8 +36,9 @@
than 256 entries is represented that way in Ada.  The operand is a Chars
field value.  */
 
+typedef Byte Attribute_Id;
 #define Get_Attribute_Id snames__get_attribute_id
-extern unsigned char Get_Attribute_Id (int);
+extern Attribute_Id Get_Attribute_Id (int);
 
 /* Define the numeric values for attributes.  */
 
@@ -44,6 +46,7 @@ extern unsigned char Get_Attribute_Id (int);
 
 /* Define the numeric values for the conventions.  */
 
+typedef Byte Convention_Id;
 #define  Convention_ !! TEMPLATE INSERTION POINT
 
 /* Define the function to check if a Name_Id value is a valid pragma */
@@ -56,8 +59,9 @@ extern Boolean Is_Pragma_Name (Name_Id);
than 256 entries is represented that way in Ada.  The operand is a Chars
field value.  */
 
+typedef Byte Pragma_Id;
 #define Get_Pragma_Id snames__get_pragma_id
-extern unsigned char Get_Pragma_Id (int);
+extern Pragma_Id Get_Pragma_Id (int);
 
 /* Define the numeric values for the pragmas. */
 
-- 
2.40.0



[COMMITTED] ada: hardcfr: mark throw-expected functions

2023-07-10 Thread Marc Poulhiès via Gcc-patches
From: Alexandre Oliva 

Adjust documentation to reflect the introduction of
-fhardcfr-check-noreturn-calls=no-xthrow.

gcc/ada/

* doc/gnat_rm/security_hardening_features.rst (Control Flow
Redundancy): Add -fhardcfr-check-noreturn-calls=no-xthrow.
* gnat_rm.texi: Regenerate.

Tested on x86_64-pc-linux-gnu, committed on master.

---
 .../doc/gnat_rm/security_hardening_features.rst | 17 +
 gcc/ada/gnat_rm.texi| 17 +
 2 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/gcc/ada/doc/gnat_rm/security_hardening_features.rst 
b/gcc/ada/doc/gnat_rm/security_hardening_features.rst
index 14328598c33..cf8c8a2493d 100644
--- a/gcc/ada/doc/gnat_rm/security_hardening_features.rst
+++ b/gcc/ada/doc/gnat_rm/security_hardening_features.rst
@@ -493,17 +493,18 @@ gets modified as follows:
end;
 
 
-Verification may also be performed before No_Return calls, whether
-only nothrow ones, with
-:switch:`-fhardcfr-check-noreturn-calls=nothrow`, or all of them, with
-:switch:`-fhardcfr-check-noreturn-calls=always`.  The default is
-:switch:`-fhardcfr-check-noreturn-calls=never` for this feature, that
-disables checking before No_Return calls.
+Verification may also be performed before No_Return calls, whether all
+of them, with :switch:`-fhardcfr-check-noreturn-calls=always`; all but
+internal subprograms involved in exception-raising or -reraising, with
+:switch:`-fhardcfr-check-noreturn-calls=no-xthrow` (default); only
+nothrow ones, with :switch:`-fhardcfr-check-noreturn-calls=nothrow`;
+or none, with :switch:`-fhardcfr-check-noreturn-calls=never`.
 
 When a No_Return call returns control to its caller through an
 exception, verification may have already been performed before the
-call, if :switch:`-fhardcfr-check-noreturn-calls=always` is in effect.
-The compiler arranges for already-checked No_Return calls without a
+call, if :switch:`-fhardcfr-check-noreturn-calls=always` or
+:switch:`-fhardcfr-check-noreturn-calls=no-xthrow` is in effect.  The
+compiler arranges for already-checked No_Return calls without a
 preexisting handler to bypass the implicitly-added cleanup handler and
 thus the redundant check, but a local exception or cleanup handler, if
 present, will modify the set of visited blocks, and checking will take
diff --git a/gcc/ada/gnat_rm.texi b/gcc/ada/gnat_rm.texi
index 817ba0b9108..988bb779105 100644
--- a/gcc/ada/gnat_rm.texi
+++ b/gcc/ada/gnat_rm.texi
@@ -29634,17 +29634,18 @@ exception
 end;
 @end example
 
-Verification may also be performed before No_Return calls, whether
-only nothrow ones, with
-@code{-fhardcfr-check-noreturn-calls=nothrow}, or all of them, with
-@code{-fhardcfr-check-noreturn-calls=always}.  The default is
-@code{-fhardcfr-check-noreturn-calls=never} for this feature, that
-disables checking before No_Return calls.
+Verification may also be performed before No_Return calls, whether all
+of them, with @code{-fhardcfr-check-noreturn-calls=always}; all but
+internal subprograms involved in exception-raising or -reraising, with
+@code{-fhardcfr-check-noreturn-calls=no-xthrow} (default); only
+nothrow ones, with @code{-fhardcfr-check-noreturn-calls=nothrow};
+or none, with @code{-fhardcfr-check-noreturn-calls=never}.
 
 When a No_Return call returns control to its caller through an
 exception, verification may have already been performed before the
-call, if @code{-fhardcfr-check-noreturn-calls=always} is in effect.
-The compiler arranges for already-checked No_Return calls without a
+call, if @code{-fhardcfr-check-noreturn-calls=always} or
+@code{-fhardcfr-check-noreturn-calls=no-xthrow} is in effect.  The
+compiler arranges for already-checked No_Return calls without a
 preexisting handler to bypass the implicitly-added cleanup handler and
 thus the redundant check, but a local exception or cleanup handler, if
 present, will modify the set of visited blocks, and checking will take
-- 
2.40.0



[COMMITTED] ada: hardcfr: optionally disable in leaf functions

2023-07-10 Thread Marc Poulhiès via Gcc-patches
From: Alexandre Oliva 

Document -fhardcfr-skip-leaf.

gcc/ada/

* doc/gnat_rm/security_hardening_features.rst (Control Flow
Hardening): Document -fhardcfr-skip-leaf.
* gnat_rm.texi: Regenerate.

Tested on x86_64-pc-linux-gnu, committed on master.

---
 gcc/ada/doc/gnat_rm/security_hardening_features.rst | 5 +
 gcc/ada/gnat_rm.texi| 5 +
 2 files changed, 10 insertions(+)

diff --git a/gcc/ada/doc/gnat_rm/security_hardening_features.rst 
b/gcc/ada/doc/gnat_rm/security_hardening_features.rst
index cf8c8a2493d..e057af2ea12 100644
--- a/gcc/ada/doc/gnat_rm/security_hardening_features.rst
+++ b/gcc/ada/doc/gnat_rm/security_hardening_features.rst
@@ -369,6 +369,11 @@ basic blocks take note as control flows through them, and, 
before
 returning, subprograms verify that the taken notes are consistent with
 the control-flow graph.
 
+The performance impact of verification on leaf subprograms can be much
+higher, while the averted risks are much lower on them.
+Instrumentation can be disabled for leaf subprograms with
+:switch:`-fhardcfr-skip-leaf`.
+
 Functions with too many basic blocks, or with multiple return points,
 call a run-time function to perform the verification.  Other functions
 perform the verification inline before returning.
diff --git a/gcc/ada/gnat_rm.texi b/gcc/ada/gnat_rm.texi
index 988bb779105..0d11be0c188 100644
--- a/gcc/ada/gnat_rm.texi
+++ b/gcc/ada/gnat_rm.texi
@@ -29515,6 +29515,11 @@ basic blocks take note as control flows through them, 
and, before
 returning, subprograms verify that the taken notes are consistent with
 the control-flow graph.
 
+The performance impact of verification on leaf subprograms can be much
+higher, while the averted risks are much lower on them.
+Instrumentation can be disabled for leaf subprograms with
+@code{-fhardcfr-skip-leaf}.
+
 Functions with too many basic blocks, or with multiple return points,
 call a run-time function to perform the verification.  Other functions
 perform the verification inline before returning.
-- 
2.40.0



[COMMITTED] ada: Documentation for mixed declarations and statements

2023-07-10 Thread Marc Poulhiès via Gcc-patches
From: Bob Duff 

This patch documents the new feature that allows declarations mixed with
statements, primarily by referring to the RFC.

gcc/ada/

* doc/gnat_rm/gnat_language_extensions.rst
(Local Declarations Without Block): Document the feature very
briefly, and refer the reader to the RFC for details and examples.
* gnat_rm.texi: Regenerate.
* gnat_ugn.texi: Regenerate.

Tested on x86_64-pc-linux-gnu, committed on master.

---
 .../doc/gnat_rm/gnat_language_extensions.rst  |  21 
 gcc/ada/gnat_rm.texi  | 111 +++---
 gcc/ada/gnat_ugn.texi |   4 +-
 3 files changed, 91 insertions(+), 45 deletions(-)

diff --git a/gcc/ada/doc/gnat_rm/gnat_language_extensions.rst 
b/gcc/ada/doc/gnat_rm/gnat_language_extensions.rst
index 220345d9b38..42d64133989 100644
--- a/gcc/ada/doc/gnat_rm/gnat_language_extensions.rst
+++ b/gcc/ada/doc/gnat_rm/gnat_language_extensions.rst
@@ -45,6 +45,27 @@ file, or in a ``.adc`` file corresponding to your project.
 Curated Extensions
 ==
 
+Local Declarations Without Block
+
+
+A basic_declarative_item may appear at the place of any statement.
+This avoids the heavy syntax of block_statements just to declare
+something locally.
+
+Link to the original RFC:
+https://github.com/AdaCore/ada-spark-rfcs/blob/master/prototyped/rfc-local-vars-without-block.md
+For example:
+
+.. code-block:: ada
+
+   if X > 5 then
+  X := X + 1;
+
+  Squared : constant Integer := X**2;
+
+  X := X + Squared;
+   end if;
+
 Conditional when constructs
 ---
 
diff --git a/gcc/ada/gnat_rm.texi b/gcc/ada/gnat_rm.texi
index 0d11be0c188..066c066d19d 100644
--- a/gcc/ada/gnat_rm.texi
+++ b/gcc/ada/gnat_rm.texi
@@ -881,6 +881,7 @@ GNAT language extensions
 
 Curated Extensions
 
+* Local Declarations Without Block:: 
 * Conditional when constructs:: 
 * Case pattern matching:: 
 * Fixed lower bounds for array types and subtypes:: 
@@ -28574,6 +28575,7 @@ for serious projects, and is only means as a 
playground/technology preview.
 
 
 @menu
+* Local Declarations Without Block:: 
 * Conditional when constructs:: 
 * Case pattern matching:: 
 * Fixed lower bounds for array types and subtypes:: 
@@ -28585,8 +28587,31 @@ for serious projects, and is only means as a 
playground/technology preview.
 
 @end menu
 
-@node Conditional when constructs,Case pattern matching,,Curated Extensions
-@anchor{gnat_rm/gnat_language_extensions 
conditional-when-constructs}@anchor{436}
+@node Local Declarations Without Block,Conditional when constructs,,Curated 
Extensions
+@anchor{gnat_rm/gnat_language_extensions 
local-declarations-without-block}@anchor{436}
+@subsection Local Declarations Without Block
+
+
+A basic_declarative_item may appear at the place of any statement.
+This avoids the heavy syntax of block_statements just to declare
+something locally.
+
+Link to the original RFC:
+@indicateurl{https://github.com/AdaCore/ada-spark-rfcs/blob/master/prototyped/rfc-local-vars-without-block.md}
+For example:
+
+@example
+if X > 5 then
+   X := X + 1;
+
+   Squared : constant Integer := X**2;
+
+   X := X + Squared;
+end if;
+@end example
+
+@node Conditional when constructs,Case pattern matching,Local Declarations 
Without Block,Curated Extensions
+@anchor{gnat_rm/gnat_language_extensions 
conditional-when-constructs}@anchor{437}
 @subsection Conditional when constructs
 
 
@@ -28658,7 +28683,7 @@ Link to the original RFC:
 
@indicateurl{https://github.com/AdaCore/ada-spark-rfcs/blob/master/prototyped/rfc-conditional-when-constructs.rst}
 
 @node Case pattern matching,Fixed lower bounds for array types and 
subtypes,Conditional when constructs,Curated Extensions
-@anchor{gnat_rm/gnat_language_extensions case-pattern-matching}@anchor{437}
+@anchor{gnat_rm/gnat_language_extensions case-pattern-matching}@anchor{438}
 @subsection Case pattern matching
 
 
@@ -28790,7 +28815,7 @@ Link to the original RFC:
 
@indicateurl{https://github.com/AdaCore/ada-spark-rfcs/blob/master/prototyped/rfc-pattern-matching.rst}
 
 @node Fixed lower bounds for array types and subtypes,Prefixed-view notation 
for calls to primitive subprograms of untagged types,Case pattern 
matching,Curated Extensions
-@anchor{gnat_rm/gnat_language_extensions 
fixed-lower-bounds-for-array-types-and-subtypes}@anchor{438}
+@anchor{gnat_rm/gnat_language_extensions 
fixed-lower-bounds-for-array-types-and-subtypes}@anchor{439}
 @subsection Fixed lower bounds for array types and subtypes
 
 
@@ -28844,7 +28869,7 @@ Link to the original RFC:
 
@indicateurl{https://github.com/AdaCore/ada-spark-rfcs/blob/master/prototyped/rfc-fixed-lower-bound.rst}
 
 @node Prefixed-view notation for calls to primitive subprograms of untagged 
types,Expression defaults for generic formal functions,Fixed lower bounds for 
array types and subtypes,Curated Extensions
-@anchor{gnat_rm/gnat_language_extensions 
prefixe

Re: [PATCH] libstdc++: Use RAII in std::vector::_M_realloc_insert

2023-07-10 Thread Jonathan Wakely via Gcc-patches
On Wed, 28 Jun 2023 at 08:56, Jan Hubicka  wrote:
>
> > I think the __throw_bad_alloc() and __throw_bad_array_new_length()
> > functions should always be rare, so marking them cold seems fine (users who
> > define their own allocators that want to throw bad_alloc "often" will
> > probably throw it directly, they shouldn't be using our __throw_bad_alloc()
> > function anyway). I don't think __throw_bad_exception is ever used, so that
> > doesn't matter (we could remove it from the header and just keep its
> > definition in the library, but there's no big advantage to doing that).
> > Others like __throw_length_error() should also be very very rare, and could
> > be marked cold.
> >
> > Maybe we should just mark everything in  as cold. If
> > users want to avoid the cost of calls to those functions they can do so by
> > checking function preconditions/arguments to avoid the exceptions. There
> > are very few places where a throwing libstdc++ API doesn't have a way to
> > avoid the exception. The only one that isn't easily avoidable is
> > __throw_bad_alloc but OOM should be rare.
>
> Hi,
> this marks everything in functexcept.h as cold and I also noticed that
> we probably want to mark as such terminate.

Should we do the same for __glibcxx_assert_fail, declared in
libstdc++-v3/include/bits/c++config?



[PATCH v2] arm: Fix MVE intrinsics support with LTO (PR target/110268)

2023-07-10 Thread Christophe Lyon via Gcc-patches
After the recent MVE intrinsics re-implementation, LTO stopped working
because the intrinsics would no longer be defined.

The main part of the patch is simple and similar to what we do for
AArch64:
- call handle_arm_mve_h() from arm_init_mve_builtins to declare the
  intrinsics when the compiler is in LTO mode
- actually implement arm_builtin_decl for MVE.

It was just a bit tricky to handle __ARM_MVE_PRESERVE_USER_NAMESPACE:
its value in the user code cannot be guessed at LTO time, so we always
have to assume that it was not defined.  The led to a few fixes in the
way we register MVE builtins as placeholders or not.  Without this
patch, we would just omit some versions of the inttrinsics when
__ARM_MVE_PRESERVE_USER_NAMESPACE is true. In fact, like for the C/C++
placeholders, we need to always keep entries for all of them to ensure
that we have a consistent numbering scheme.

2023-06-26  Christophe Lyon   

PR target/110268
gcc/
* config/arm/arm-builtins.cc (arm_init_mve_builtins): Handle LTO.
(arm_builtin_decl): Hahndle MVE builtins.
* config/arm/arm-mve-builtins.cc (builtin_decl): New function.
(add_unique_function): Fix handling of
__ARM_MVE_PRESERVE_USER_NAMESPACE.
(add_overloaded_function): Likewise.
* config/arm/arm-protos.h (builtin_decl): New declaration.

gcc/testsuite/
* gcc.target/arm/pr110268-1.c: New test.
* gcc.target/arm/pr110268-2.c: New test.
---
 gcc/config/arm/arm-builtins.cc| 11 +++-
 gcc/config/arm/arm-mve-builtins.cc| 61 ---
 gcc/config/arm/arm-protos.h   |  1 +
 gcc/testsuite/gcc.target/arm/pr110268-1.c | 12 +
 gcc/testsuite/gcc.target/arm/pr110268-2.c | 23 +
 5 files changed, 78 insertions(+), 30 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/arm/pr110268-1.c
 create mode 100644 gcc/testsuite/gcc.target/arm/pr110268-2.c

diff --git a/gcc/config/arm/arm-builtins.cc b/gcc/config/arm/arm-builtins.cc
index 36365e40a5b..fca7dcaf565 100644
--- a/gcc/config/arm/arm-builtins.cc
+++ b/gcc/config/arm/arm-builtins.cc
@@ -1918,6 +1918,15 @@ arm_init_mve_builtins (void)
   arm_builtin_datum *d = &mve_builtin_data[i];
   arm_init_builtin (fcode, d, "__builtin_mve");
 }
+
+  if (in_lto_p)
+{
+  arm_mve::handle_arm_mve_types_h ();
+  /* Under LTO, we cannot know whether
+__ARM_MVE_PRESERVE_USER_NAMESPACE was defined, so assume it
+was not.  */
+  arm_mve::handle_arm_mve_h (false);
+}
 }
 
 /* Set up all the NEON builtins, even builtins for instructions that are not
@@ -2723,7 +2732,7 @@ arm_builtin_decl (unsigned code, bool initialize_p 
ATTRIBUTE_UNUSED)
 case ARM_BUILTIN_GENERAL:
   return arm_general_builtin_decl (subcode);
 case ARM_BUILTIN_MVE:
-  return error_mark_node;
+  return arm_mve::builtin_decl (subcode);
 default:
   gcc_unreachable ();
 }
diff --git a/gcc/config/arm/arm-mve-builtins.cc 
b/gcc/config/arm/arm-mve-builtins.cc
index 7033e41a571..413d8100607 100644
--- a/gcc/config/arm/arm-mve-builtins.cc
+++ b/gcc/config/arm/arm-mve-builtins.cc
@@ -493,6 +493,16 @@ handle_arm_mve_h (bool preserve_user_namespace)
 preserve_user_namespace);
 }
 
+/* Return the function decl with MVE function subcode CODE, or error_mark_node
+   if no such function exists.  */
+tree
+builtin_decl (unsigned int code)
+{
+  if (code >= vec_safe_length (registered_functions))
+return error_mark_node;
+  return (*registered_functions)[code]->decl;
+}
+
 /* Return true if CANDIDATE is equivalent to MODEL_TYPE for overloading
purposes.  */
 static bool
@@ -849,7 +859,6 @@ function_builder::add_function (const function_instance 
&instance,
 ? integer_zero_node
 : simulate_builtin_function_decl (input_location, name, fntype,
  code, NULL, attrs);
-
   registered_function &rfn = *ggc_alloc  ();
   rfn.instance = instance;
   rfn.decl = decl;
@@ -889,15 +898,12 @@ function_builder::add_unique_function (const 
function_instance &instance,
   gcc_assert (!*rfn_slot);
   *rfn_slot = &rfn;
 
-  /* Also add the non-prefixed non-overloaded function, if the user namespace
- does not need to be preserved.  */
-  if (!preserve_user_namespace)
-{
-  char *noprefix_name = get_name (instance, false, false);
-  tree attrs = get_attributes (instance);
-  add_function (instance, noprefix_name, fntype, attrs, requires_float,
-   false, false);
-}
+  /* Also add the non-prefixed non-overloaded function, as placeholder
+ if the user namespace does not need to be preserved.  */
+  char *noprefix_name = get_name (instance, false, false);
+  attrs = get_attributes (instance);
+  add_function (instance, noprefix_name, fntype, attrs, requires_float,
+   false, preserve_user_namespace);
 
   /* Also add the function under its overloaded alias, if we w

RE: [PATCH] testsuite: Add _link flavor for several arm_arch* and arm* effective-targets

2023-07-10 Thread Kyrylo Tkachov via Gcc-patches



> -Original Message-
> From: Christophe Lyon 
> Sent: Friday, July 7, 2023 8:52 AM
> To: gcc-patches@gcc.gnu.org; Kyrylo Tkachov ;
> Richard Earnshaw 
> Cc: Christophe Lyon 
> Subject: [PATCH] testsuite: Add _link flavor for several arm_arch* and arm*
> effective-targets
> 
> For arm targets, we generate many effective-targets with
> check_effective_target_FUNC_multilib and
> check_effective_target_arm_arch_FUNC_multilib which check if we can
> link and execute a simple program with a given set of flags/multilibs.
> 
> In some cases however, it's possible to link but not to execute a
> program, so this patch adds similar _link effective-targets which only
> check if link succeeds.
> 
> The patch does not uupdate the documentation as it already lacks the
> numerous existing related effective-targets.

I think this looks ok but...

> 
> 2023-07-07  Christophe Lyon  
> 
>   gcc/testsuite/
>   * lib/target-supports.exp (arm_*FUNC_link): New effective-targets.
> ---
>  gcc/testsuite/lib/target-supports.exp | 27 +++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-
> supports.exp
> index c04db2be7f9..d33bc077418 100644
> --- a/gcc/testsuite/lib/target-supports.exp
> +++ b/gcc/testsuite/lib/target-supports.exp
> @@ -5129,6 +5129,14 @@ foreach { armfunc armflag armdefs } {
>   return "$flags FLAG"
>   }
> 
> +proc check_effective_target_arm_arch_FUNC_link { } {
> + return [check_no_compiler_messages arm_arch_FUNC_link
> executable {
> + #include 
> + int dummy;
> + int main (void) { return 0; }
> + } [add_options_for_arm_arch_FUNC ""]]
> + }
> +
>   proc check_effective_target_arm_arch_FUNC_multilib { } {
>   return [check_runtime arm_arch_FUNC_multilib {
>   int
> @@ -5906,6 +5914,7 @@ proc add_options_for_arm_v8_2a_bf16_neon {
> flags } {
>  #   arm_v8m_main_cde: Armv8-m CDE (Custom Datapath Extension).
>  #   arm_v8m_main_cde_fp: Armv8-m CDE with FP registers.
>  #   arm_v8_1m_main_cde_mve: Armv8.1-m CDE with MVE.
> +#   arm_v8_1m_main_cde_mve_fp: Armv8.1-m CDE with MVE with FP
> support.
>  # Usage:
>  #   /* { dg-require-effective-target arm_v8m_main_cde_ok } */
>  #   /* { dg-add-options arm_v8m_main_cde } */
> @@ -5965,6 +5974,24 @@ foreach { armfunc armflag armdef arminc } {
>   return "$flags $et_FUNC_flags"
>   }
> 
> +proc check_effective_target_FUNC_link { } {
> + if { ! [check_effective_target_FUNC_ok] } {
> + return 0;
> + }
> + return [check_no_compiler_messages FUNC_link executable {
> + #if !(DEF)
> + #error "DEF failed"
> + #endif
> + #include 

... why is arm_cde.h included here?

> + INC
> + int
> + main (void)
> + {
> + return 0;
> + }
> + } [add_options_for_FUNC ""]]
> + }
> +
>   proc check_effective_target_FUNC_multilib { } {
>   if { ! [check_effective_target_FUNC_ok] } {
>   return 0;
> --
> 2.34.1



RE: [PATCH] doc: Document arm_v8_1m_main_cde_mve_fp

2023-07-10 Thread Kyrylo Tkachov via Gcc-patches



> -Original Message-
> From: Christophe Lyon 
> Sent: Friday, July 7, 2023 8:52 AM
> To: gcc-patches@gcc.gnu.org; Kyrylo Tkachov ;
> Richard Earnshaw 
> Cc: Christophe Lyon 
> Subject: [PATCH] doc: Document arm_v8_1m_main_cde_mve_fp
> 
> The arm_v8_1m_main_cde_mve_fp family of effective targets was not
> documented when it was introduced.
> 
> 2023-07-07  Christophe Lyon  
> 
>   gcc/
>   * doc/sourcebuild.texi (arm_v8_1m_main_cde_mve_fp): Document.
> ---
>  gcc/doc/sourcebuild.texi | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/gcc/doc/sourcebuild.texi b/gcc/doc/sourcebuild.texi
> index 526020c7511..03fb2394705 100644
> --- a/gcc/doc/sourcebuild.texi
> +++ b/gcc/doc/sourcebuild.texi
> @@ -2190,6 +2190,12 @@ ARM target supports options to generate
> instructions from ARMv8.1-M with
>  the Custom Datapath Extension (CDE) and M-Profile Vector Extension (MVE).
>  Some multilibs may be incompatible with these options.
> 
> +@item arm_v8_1m_main_cde_mve_fp
> +ARM target supports options to generate instructions from ARMv8.1-M
> +with the Custom Datapath Extension (CDE) and M-Profile Vector
> +Extension (MVE) with floating-point support.  Some multilibs may be
> +incompatible with these options.

I know the GCC source is inconsistent on this but the proper branding these 
days is "ARM" -> "Arm" and "ARMv8.1-M" -> "Armv8.1-M".
Ok with those changes.
Thanks,
Kyrill

> +
>  @item arm_pacbti_hw
>  Test system supports executing Pointer Authentication and Branch Target
>  Identification instructions.
> --
> 2.34.1



RE: [PATCH v2] arm: Fix MVE intrinsics support with LTO (PR target/110268)

2023-07-10 Thread Kyrylo Tkachov via Gcc-patches



> -Original Message-
> From: Christophe Lyon 
> Sent: Monday, July 10, 2023 2:09 PM
> To: gcc-patches@gcc.gnu.org; Kyrylo Tkachov ;
> Richard Earnshaw 
> Cc: Christophe Lyon 
> Subject: [PATCH v2] arm: Fix MVE intrinsics support with LTO (PR
> target/110268)
> 
> After the recent MVE intrinsics re-implementation, LTO stopped working
> because the intrinsics would no longer be defined.
> 
> The main part of the patch is simple and similar to what we do for
> AArch64:
> - call handle_arm_mve_h() from arm_init_mve_builtins to declare the
>   intrinsics when the compiler is in LTO mode
> - actually implement arm_builtin_decl for MVE.
> 
> It was just a bit tricky to handle __ARM_MVE_PRESERVE_USER_NAMESPACE:
> its value in the user code cannot be guessed at LTO time, so we always
> have to assume that it was not defined.  The led to a few fixes in the
> way we register MVE builtins as placeholders or not.  Without this
> patch, we would just omit some versions of the inttrinsics when
> __ARM_MVE_PRESERVE_USER_NAMESPACE is true. In fact, like for the C/C++
> placeholders, we need to always keep entries for all of them to ensure
> that we have a consistent numbering scheme.

Ok.
Thanks,
Kyrill

> 
> 2023-06-26  Christophe Lyon   
> 
>   PR target/110268
>   gcc/
>   * config/arm/arm-builtins.cc (arm_init_mve_builtins): Handle LTO.
>   (arm_builtin_decl): Hahndle MVE builtins.
>   * config/arm/arm-mve-builtins.cc (builtin_decl): New function.
>   (add_unique_function): Fix handling of
>   __ARM_MVE_PRESERVE_USER_NAMESPACE.
>   (add_overloaded_function): Likewise.
>   * config/arm/arm-protos.h (builtin_decl): New declaration.
> 
>   gcc/testsuite/
>   * gcc.target/arm/pr110268-1.c: New test.
>   * gcc.target/arm/pr110268-2.c: New test.
> ---
>  gcc/config/arm/arm-builtins.cc| 11 +++-
>  gcc/config/arm/arm-mve-builtins.cc| 61 ---
>  gcc/config/arm/arm-protos.h   |  1 +
>  gcc/testsuite/gcc.target/arm/pr110268-1.c | 12 +
>  gcc/testsuite/gcc.target/arm/pr110268-2.c | 23 +
>  5 files changed, 78 insertions(+), 30 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/arm/pr110268-1.c
>  create mode 100644 gcc/testsuite/gcc.target/arm/pr110268-2.c
> 
> diff --git a/gcc/config/arm/arm-builtins.cc b/gcc/config/arm/arm-builtins.cc
> index 36365e40a5b..fca7dcaf565 100644
> --- a/gcc/config/arm/arm-builtins.cc
> +++ b/gcc/config/arm/arm-builtins.cc
> @@ -1918,6 +1918,15 @@ arm_init_mve_builtins (void)
>arm_builtin_datum *d = &mve_builtin_data[i];
>arm_init_builtin (fcode, d, "__builtin_mve");
>  }
> +
> +  if (in_lto_p)
> +{
> +  arm_mve::handle_arm_mve_types_h ();
> +  /* Under LTO, we cannot know whether
> +  __ARM_MVE_PRESERVE_USER_NAMESPACE was defined, so assume
> it
> +  was not.  */
> +  arm_mve::handle_arm_mve_h (false);
> +}
>  }
> 
>  /* Set up all the NEON builtins, even builtins for instructions that are not
> @@ -2723,7 +2732,7 @@ arm_builtin_decl (unsigned code, bool initialize_p
> ATTRIBUTE_UNUSED)
>  case ARM_BUILTIN_GENERAL:
>return arm_general_builtin_decl (subcode);
>  case ARM_BUILTIN_MVE:
> -  return error_mark_node;
> +  return arm_mve::builtin_decl (subcode);
>  default:
>gcc_unreachable ();
>  }
> diff --git a/gcc/config/arm/arm-mve-builtins.cc b/gcc/config/arm/arm-mve-
> builtins.cc
> index 7033e41a571..413d8100607 100644
> --- a/gcc/config/arm/arm-mve-builtins.cc
> +++ b/gcc/config/arm/arm-mve-builtins.cc
> @@ -493,6 +493,16 @@ handle_arm_mve_h (bool
> preserve_user_namespace)
>preserve_user_namespace);
>  }
> 
> +/* Return the function decl with MVE function subcode CODE, or
> error_mark_node
> +   if no such function exists.  */
> +tree
> +builtin_decl (unsigned int code)
> +{
> +  if (code >= vec_safe_length (registered_functions))
> +return error_mark_node;
> +  return (*registered_functions)[code]->decl;
> +}
> +
>  /* Return true if CANDIDATE is equivalent to MODEL_TYPE for overloading
> purposes.  */
>  static bool
> @@ -849,7 +859,6 @@ function_builder::add_function (const
> function_instance &instance,
>  ? integer_zero_node
>  : simulate_builtin_function_decl (input_location, name, fntype,
> code, NULL, attrs);
> -
>registered_function &rfn = *ggc_alloc  ();
>rfn.instance = instance;
>rfn.decl = decl;
> @@ -889,15 +898,12 @@ function_builder::add_unique_function (const
> function_instance &instance,
>gcc_assert (!*rfn_slot);
>*rfn_slot = &rfn;
> 
> -  /* Also add the non-prefixed non-overloaded function, if the user
> namespace
> - does not need to be preserved.  */
> -  if (!preserve_user_namespace)
> -{
> -  char *noprefix_name = get_name (instance, false, false);
> -  tree attrs = get_attributes (instance);
> -  add_function (instance, noprefix_nam

Re: [PATCH] testsuite: Add _link flavor for several arm_arch* and arm* effective-targets

2023-07-10 Thread Christophe Lyon via Gcc-patches
On Mon, 10 Jul 2023 at 15:46, Kyrylo Tkachov  wrote:

>
>
> > -Original Message-
> > From: Christophe Lyon 
> > Sent: Friday, July 7, 2023 8:52 AM
> > To: gcc-patches@gcc.gnu.org; Kyrylo Tkachov ;
> > Richard Earnshaw 
> > Cc: Christophe Lyon 
> > Subject: [PATCH] testsuite: Add _link flavor for several arm_arch* and
> arm*
> > effective-targets
> >
> > For arm targets, we generate many effective-targets with
> > check_effective_target_FUNC_multilib and
> > check_effective_target_arm_arch_FUNC_multilib which check if we can
> > link and execute a simple program with a given set of flags/multilibs.
> >
> > In some cases however, it's possible to link but not to execute a
> > program, so this patch adds similar _link effective-targets which only
> > check if link succeeds.
> >
> > The patch does not uupdate the documentation as it already lacks the
> > numerous existing related effective-targets.
>
> I think this looks ok but...
>
> >
> > 2023-07-07  Christophe Lyon  
> >
> >   gcc/testsuite/
> >   * lib/target-supports.exp (arm_*FUNC_link): New effective-targets.
> > ---
> >  gcc/testsuite/lib/target-supports.exp | 27 +++
> >  1 file changed, 27 insertions(+)
> >
> > diff --git a/gcc/testsuite/lib/target-supports.exp
> b/gcc/testsuite/lib/target-
> > supports.exp
> > index c04db2be7f9..d33bc077418 100644
> > --- a/gcc/testsuite/lib/target-supports.exp
> > +++ b/gcc/testsuite/lib/target-supports.exp
> > @@ -5129,6 +5129,14 @@ foreach { armfunc armflag armdefs } {
> >   return "$flags FLAG"
> >   }
> >
> > +proc check_effective_target_arm_arch_FUNC_link { } {
> > + return [check_no_compiler_messages arm_arch_FUNC_link
> > executable {
> > + #include 
> > + int dummy;
> > + int main (void) { return 0; }
> > + } [add_options_for_arm_arch_FUNC ""]]
> > + }
> > +
> >   proc check_effective_target_arm_arch_FUNC_multilib { } {
> >   return [check_runtime arm_arch_FUNC_multilib {
> >   int
> > @@ -5906,6 +5914,7 @@ proc add_options_for_arm_v8_2a_bf16_neon {
> > flags } {
> >  #   arm_v8m_main_cde: Armv8-m CDE (Custom Datapath Extension).
> >  #   arm_v8m_main_cde_fp: Armv8-m CDE with FP registers.
> >  #   arm_v8_1m_main_cde_mve: Armv8.1-m CDE with MVE.
> > +#   arm_v8_1m_main_cde_mve_fp: Armv8.1-m CDE with MVE with FP
> > support.
> >  # Usage:
> >  #   /* { dg-require-effective-target arm_v8m_main_cde_ok } */
> >  #   /* { dg-add-options arm_v8m_main_cde } */
> > @@ -5965,6 +5974,24 @@ foreach { armfunc armflag armdef arminc } {
> >   return "$flags $et_FUNC_flags"
> >   }
> >
> > +proc check_effective_target_FUNC_link { } {
> > + if { ! [check_effective_target_FUNC_ok] } {
> > + return 0;
> > + }
> > + return [check_no_compiler_messages FUNC_link executable {
> > + #if !(DEF)
> > + #error "DEF failed"
> > + #endif
> > + #include 
>
> ... why is arm_cde.h included here?
>
> It's the very same code as  check_effective_target_FUNC_multilib below.

I think it's needed in case the toolchain's default configuration is not
able to support CDE. I believe these tests would fail if the toolchain
defaults
to -mfloat-abi=soft (the gnu/stubs-{soft|hard}.h "usual" error)

I added this chunk for consistency with the other one, it's not needed at
the moment.

Christophe



> + INC
> > + int
> > + main (void)
> > + {
> > + return 0;
> > + }
> > + } [add_options_for_FUNC ""]]
> > + }
> > +
> >   proc check_effective_target_FUNC_multilib { } {
> >   if { ! [check_effective_target_FUNC_ok] } {
> >   return 0;
> > --
> > 2.34.1
>
>


RE: [PATCH] testsuite: Add _link flavor for several arm_arch* and arm* effective-targets

2023-07-10 Thread Kyrylo Tkachov via Gcc-patches


> -Original Message-
> From: Christophe Lyon 
> Sent: Monday, July 10, 2023 2:59 PM
> To: Kyrylo Tkachov 
> Cc: gcc-patches@gcc.gnu.org; Richard Earnshaw
> 
> Subject: Re: [PATCH] testsuite: Add _link flavor for several arm_arch* and
> arm* effective-targets
> 
> 
> 
> On Mon, 10 Jul 2023 at 15:46, Kyrylo Tkachov   > wrote:
> 
> 
> 
> 
>   > -Original Message-
>   > From: Christophe Lyon   >
>   > Sent: Friday, July 7, 2023 8:52 AM
>   > To: gcc-patches@gcc.gnu.org  ;
> Kyrylo Tkachov   >;
>   > Richard Earnshaw   >
>   > Cc: Christophe Lyon   >
>   > Subject: [PATCH] testsuite: Add _link flavor for several arm_arch*
> and arm*
>   > effective-targets
>   >
>   > For arm targets, we generate many effective-targets with
>   > check_effective_target_FUNC_multilib and
>   > check_effective_target_arm_arch_FUNC_multilib which check if we
> can
>   > link and execute a simple program with a given set of
> flags/multilibs.
>   >
>   > In some cases however, it's possible to link but not to execute a
>   > program, so this patch adds similar _link effective-targets which only
>   > check if link succeeds.
>   >
>   > The patch does not uupdate the documentation as it already lacks
> the
>   > numerous existing related effective-targets.
> 
>   I think this looks ok but...
> 
>   >
>   > 2023-07-07  Christophe Lyon    >
>   >
>   >   gcc/testsuite/
>   >   * lib/target-supports.exp (arm_*FUNC_link): New effective-
> targets.
>   > ---
>   >  gcc/testsuite/lib/target-supports.exp | 27
> +++
>   >  1 file changed, 27 insertions(+)
>   >
>   > diff --git a/gcc/testsuite/lib/target-supports.exp
> b/gcc/testsuite/lib/target-
>   > supports.exp
>   > index c04db2be7f9..d33bc077418 100644
>   > --- a/gcc/testsuite/lib/target-supports.exp
>   > +++ b/gcc/testsuite/lib/target-supports.exp
>   > @@ -5129,6 +5129,14 @@ foreach { armfunc armflag armdefs } {
>   >   return "$flags FLAG"
>   >   }
>   >
>   > +proc check_effective_target_arm_arch_FUNC_link { } {
>   > + return [check_no_compiler_messages arm_arch_FUNC_link
>   > executable {
>   > + #include 
>   > + int dummy;
>   > + int main (void) { return 0; }
>   > + } [add_options_for_arm_arch_FUNC ""]]
>   > + }
>   > +
>   >   proc check_effective_target_arm_arch_FUNC_multilib { } {
>   >   return [check_runtime arm_arch_FUNC_multilib {
>   >   int
>   > @@ -5906,6 +5914,7 @@ proc
> add_options_for_arm_v8_2a_bf16_neon {
>   > flags } {
>   >  #   arm_v8m_main_cde: Armv8-m CDE (Custom Datapath
> Extension).
>   >  #   arm_v8m_main_cde_fp: Armv8-m CDE with FP registers.
>   >  #   arm_v8_1m_main_cde_mve: Armv8.1-m CDE with MVE.
>   > +#   arm_v8_1m_main_cde_mve_fp: Armv8.1-m CDE with MVE with
> FP
>   > support.
>   >  # Usage:
>   >  #   /* { dg-require-effective-target arm_v8m_main_cde_ok } */
>   >  #   /* { dg-add-options arm_v8m_main_cde } */
>   > @@ -5965,6 +5974,24 @@ foreach { armfunc armflag armdef
> arminc } {
>   >   return "$flags $et_FUNC_flags"
>   >   }
>   >
>   > +proc check_effective_target_FUNC_link { } {
>   > + if { ! [check_effective_target_FUNC_ok] } {
>   > + return 0;
>   > + }
>   > + return [check_no_compiler_messages FUNC_link executable {
>   > + #if !(DEF)
>   > + #error "DEF failed"
>   > + #endif
>   > + #include 
> 
>   ... why is arm_cde.h included here?
> 
> 
> 
> It's the very same code as  check_effective_target_FUNC_multilib below.
> 
> I think it's needed in case the toolchain's default configuration is not
> able to support CDE. I believe these tests would fail if the toolchain 
> defaults
> to -mfloat-abi=soft (the gnu/stubs-{soft|hard}.h "usual" error)
> 
> I added this chunk for consistency with the other one, it's not needed at the
> moment.

Ah, this is a CDE-specific region. I couldn't tell from the default diff 
context, but having looked at the code around it, it makes sense.
Ok.
Thanks,
Kyrill

> 
> Christophe
> 
> 
> 
> 
>   > + INC
>   > + int
>   > + main (void)
>   > + {
>   > + return 0;
>   > + }
>   > + } [add_options_for_FUNC ""]]
>   > + }
>   > +
>   >   proc check_effective_target_FUNC_multilib { } {
>   > 

Re: [PATCH v2] GCSE: Export 'insert_insn_end_basic_block' as global function

2023-07-10 Thread Jeff Law via Gcc-patches




On 7/10/23 02:12, juzhe.zh...@rivai.ai wrote:

From: Ju-Zhe Zhong 

Since VSETVL PASS in RISC-V port is using common part of 
'insert_insn_end_basic_block (struct gcse_expr *expr, basic_block bb)'
and we will also this helper function in riscv.cc for the following patches.

So extract the common part codes of 'insert_insn_end_basic_block (struct 
gcse_expr *expr, basic_block bb)', the new function
of the common part is also call 'insert_insn_end_basic_block (rtx_insn *pat, 
basic_block bb)' but with different arguments.
And call 'insert_insn_end_basic_block (rtx_insn *pat, basic_block bb)' in 
'insert_insn_end_basic_block (struct gcse_expr *expr, basic_block bb)'
and VSETVL PASS in RISC-V port.

Remove redundant codes of VSETVL PASS in RISC-V port.

gcc/ChangeLog:

 * config/riscv/riscv-vsetvl.cc (add_label_notes): Remove it.
 (insert_insn_end_basic_block): Ditto.
 (pass_vsetvl::commit_vsetvls): Adapt for new helper function.
 * gcse.cc (insert_insn_end_basic_block):  Export as global function.
 * gcse.h (insert_insn_end_basic_block): Ditto.

OK.  Thanks for remembering to clean this up.

jeff


[PATCH] doc: Add doc for RISC-V Operand Modifiers

2023-07-10 Thread Kito Cheng via Gcc-patches
Document `z` and `i` operand modifiers, we have much more modifiers
other than those two, but they are the only two implement on both
GCC and LLVM, consider the compatibility I would like to document those
two first, and then review other modifiers later to see if any other should
expose and implement on RISC-V LLVM too.

gcc/ChangeLog:

* doc/extend.texi (RISC-V Operand Modifiers): New.
---
 gcc/doc/extend.texi | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index bfbc1d6cc9f8..d88fd75e06e6 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -11533,6 +11533,16 @@ The list below describes the supported modifiers and 
their effects for LoongArch
 @item @code{z} @tab Print the operand in its unmodified form, followed by a 
comma.
 @end multitable
 
+@anchor{riscvOperandmodifiers}
+@subsubsection RISC-V Operand Modifiers
+
+The list below describes the supported modifiers and their effects for RISC-V.
+
+@multitable @columnfractions .10 .90
+@headitem Modifier @tab Description
+@item @code{z} @tab Print ''@code{zero}'' instead of 0 if the operand is an 
immediate with a value of zero.
+@item @code{i} @tab Print the character ''@code{i}'' if the operand is an 
immediate.
+@end multitable
 
 @lowersections
 @include md.texi
-- 
2.40.1



Re: [PATCH] doc: Add doc for RISC-V Operand Modifiers

2023-07-10 Thread Jeff Law via Gcc-patches




On 7/10/23 08:19, Kito Cheng wrote:

Document `z` and `i` operand modifiers, we have much more modifiers
other than those two, but they are the only two implement on both
GCC and LLVM, consider the compatibility I would like to document those
two first, and then review other modifiers later to see if any other should
expose and implement on RISC-V LLVM too.

gcc/ChangeLog:

* doc/extend.texi (RISC-V Operand Modifiers): New.

OK
jeff


Re: [PATCH] doc: Add doc for RISC-V Operand Modifiers

2023-07-10 Thread Kito Cheng via Gcc-patches
thanks, pushed to trunk :)

On Mon, Jul 10, 2023 at 10:33 PM Jeff Law via Gcc-patches
 wrote:
>
>
>
> On 7/10/23 08:19, Kito Cheng wrote:
> > Document `z` and `i` operand modifiers, we have much more modifiers
> > other than those two, but they are the only two implement on both
> > GCC and LLVM, consider the compatibility I would like to document those
> > two first, and then review other modifiers later to see if any other should
> > expose and implement on RISC-V LLVM too.
> >
> > gcc/ChangeLog:
> >
> >   * doc/extend.texi (RISC-V Operand Modifiers): New.
> OK
> jeff


[pushed] c++: redeclare_class_template and ttps [PR110523]

2023-07-10 Thread Patrick Palka via Gcc-patches
Tested on x86_64-pc-linux-gnu, pushed to trunk as obvious.

-- >8 --

Now that we cache level-lowered ttps we can end up processing the same
ttp multiple times via (multiple calls to) redeclare_class_template, so
we can't assume a ttp's DECL_CONTEXT is initially empty.

PR c++/110523

gcc/cp/ChangeLog:

* pt.cc (redeclare_class_template): Relax the ttp DECL_CONTEXT
assert, and downgrade it to a checking assert.

gcc/testsuite/ChangeLog:

* g++.dg/template/ttp37.C: New test.
---
 gcc/cp/pt.cc  |  3 ++-
 gcc/testsuite/g++.dg/template/ttp37.C | 15 +++
 2 files changed, 17 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/g++.dg/template/ttp37.C

diff --git a/gcc/cp/pt.cc b/gcc/cp/pt.cc
index d7d774fd9e5..076f788281e 100644
--- a/gcc/cp/pt.cc
+++ b/gcc/cp/pt.cc
@@ -6388,7 +6388,8 @@ redeclare_class_template (tree type, tree parms, tree 
cons)
 DECL_CONTEXT of the template for which they are a parameter.  */
   if (TREE_CODE (parm) == TEMPLATE_DECL)
{
- gcc_assert (DECL_CONTEXT (parm) == NULL_TREE);
+ gcc_checking_assert (DECL_CONTEXT (parm) == NULL_TREE
+  || DECL_CONTEXT (parm) == tmpl);
  DECL_CONTEXT (parm) = tmpl;
}
 }
diff --git a/gcc/testsuite/g++.dg/template/ttp37.C 
b/gcc/testsuite/g++.dg/template/ttp37.C
new file mode 100644
index 000..c5f4e99c20a
--- /dev/null
+++ b/gcc/testsuite/g++.dg/template/ttp37.C
@@ -0,0 +1,15 @@
+// PR c++/110523
+
+template class>
+class basic_json;
+
+template
+struct json_pointer {
+  template class>
+  friend class basic_json;
+};
+
+template struct json_pointer;
+template struct json_pointer;
+template struct json_pointer;
+template struct json_pointer;
-- 
2.41.0.327.gaa9166bcc0



RE: [PATCH v2] GCSE: Export 'insert_insn_end_basic_block' as global function

2023-07-10 Thread Li, Pan2 via Gcc-patches
Committed, thanks Jeff and Richard.

Pan

-Original Message-
From: Gcc-patches  On Behalf 
Of Jeff Law via Gcc-patches
Sent: Monday, July 10, 2023 10:08 PM
To: juzhe.zh...@rivai.ai; gcc-patches@gcc.gnu.org
Cc: rguent...@suse.de
Subject: Re: [PATCH v2] GCSE: Export 'insert_insn_end_basic_block' as global 
function



On 7/10/23 02:12, juzhe.zh...@rivai.ai wrote:
> From: Ju-Zhe Zhong 
> 
> Since VSETVL PASS in RISC-V port is using common part of 
> 'insert_insn_end_basic_block (struct gcse_expr *expr, basic_block bb)'
> and we will also this helper function in riscv.cc for the following patches.
> 
> So extract the common part codes of 'insert_insn_end_basic_block (struct 
> gcse_expr *expr, basic_block bb)', the new function
> of the common part is also call 'insert_insn_end_basic_block (rtx_insn *pat, 
> basic_block bb)' but with different arguments.
> And call 'insert_insn_end_basic_block (rtx_insn *pat, basic_block bb)' in 
> 'insert_insn_end_basic_block (struct gcse_expr *expr, basic_block bb)'
> and VSETVL PASS in RISC-V port.
> 
> Remove redundant codes of VSETVL PASS in RISC-V port.
> 
> gcc/ChangeLog:
> 
>  * config/riscv/riscv-vsetvl.cc (add_label_notes): Remove it.
>  (insert_insn_end_basic_block): Ditto.
>  (pass_vsetvl::commit_vsetvls): Adapt for new helper function.
>  * gcse.cc (insert_insn_end_basic_block):  Export as global function.
>  * gcse.h (insert_insn_end_basic_block): Ditto.
OK.  Thanks for remembering to clean this up.

jeff


RE: [PATCH 5/19]middle-end: Enable bit-field vectorization to work correctly when we're vectoring inside conds

2023-07-10 Thread Tamar Christina via Gcc-patches
> > -  *type_out = STMT_VINFO_VECTYPE (stmt_info);
> > +  if (cond_cst)
> > +{
> > +  append_pattern_def_seq (vinfo, stmt_info, pattern_stmt, vectype);
> > +  pattern_stmt
> > +   = gimple_build_cond (gimple_cond_code (cond_stmt),
> > +gimple_get_lhs (pattern_stmt),
> > +fold_convert (ret_type, cond_cst),
> > +gimple_cond_true_label (cond_stmt),
> > +gimple_cond_false_label (cond_stmt));
> > +  *type_out = STMT_VINFO_VECTYPE (stmt_info);
> 
> is there any vectype set for a gcond?

No, because gconds can't be codegen'd yet, atm we must replace the original
gcond when generating code.

However looking at the diff this code, don't think the else is needed here.
Testing an updated patch.

> 
> I must say the flow of the function is a bit convoluted now.  Is it possible 
> to
> factor out a helper so we can fully separate the gassign vs. gcond handling in
> this function?

I am not sure, the only place the changes are are at the start (e.g. how we 
determine bf_stmt)
and how we determine ret_type, and when determining shift_first for the single 
use case.

Now I can't move the ret_type anywhere as I need to decompose bf_stmt first.  
And the shift_first
can be simplified by moving it up into the part that determined bf_stmt, but 
then we walk the
immediate uses even on cases where we early exit.  Which seems inefficient.

Then there's the final clause which just generates an additional gcond if the 
original statement was
a gcond. But not sure that'll help, since it's just something done *in 
addition* to the normal assign.

So there doesn't seem to be enough, or big enough divergence to justify a 
split.   I have however made
an attempt at cleaning it up a bit, is this one better?

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* tree-vect-patterns.cc (vect_init_pattern_stmt): Copy STMT_VINFO_TYPE
from original statement.
(vect_recog_bitfield_ref_pattern): Support bitfields in gcond.

Co-Authored-By:  Andre Vieira 

--- inline copy of patch ---

diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 
60bc9be6819af9bd28a81430869417965ba9d82d..b842f7d983405cd04f6760be7d91c1f55b30aac4
 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -128,6 +128,7 @@ vect_init_pattern_stmt (vec_info *vinfo, gimple 
*pattern_stmt,
   STMT_VINFO_RELATED_STMT (pattern_stmt_info) = orig_stmt_info;
   STMT_VINFO_DEF_TYPE (pattern_stmt_info)
 = STMT_VINFO_DEF_TYPE (orig_stmt_info);
+  STMT_VINFO_TYPE (pattern_stmt_info) = STMT_VINFO_TYPE (orig_stmt_info);
   if (!STMT_VINFO_VECTYPE (pattern_stmt_info))
 {
   gcc_assert (!vectype
@@ -2441,6 +2442,10 @@ vect_recog_widen_sum_pattern (vec_info *vinfo,
bf_value = BIT_FIELD_REF (container, bitsize, bitpos);
result = (type_out) bf_value;
 
+   or
+
+   if (BIT_FIELD_REF (container, bitsize, bitpos) `cmp` )
+
where type_out is a non-bitfield type, that is to say, it's precision 
matches
2^(TYPE_SIZE(type_out) - (TYPE_UNSIGNED (type_out) ? 1 : 2)).
 
@@ -2450,6 +2455,10 @@ vect_recog_widen_sum_pattern (vec_info *vinfo,
here it starts with:
result = (type_out) bf_value;
 
+   or
+
+   if (BIT_FIELD_REF (container, bitsize, bitpos) `cmp` )
+
Output:
 
* TYPE_OUT: The vector type of the output of this pattern.
@@ -2482,33 +2491,45 @@ vect_recog_widen_sum_pattern (vec_info *vinfo,
 
The shifting is always optional depending on whether bitpos != 0.
 
+   When the original bitfield was inside a gcond then an new gcond is also
+   generated with the newly `result` as the operand to the comparison.
+
 */
 
 static gimple *
 vect_recog_bitfield_ref_pattern (vec_info *vinfo, stmt_vec_info stmt_info,
 tree *type_out)
 {
-  gassign *first_stmt = dyn_cast  (stmt_info->stmt);
-
-  if (!first_stmt)
-return NULL;
-
-  gassign *bf_stmt;
-  if (CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (first_stmt))
-  && TREE_CODE (gimple_assign_rhs1 (first_stmt)) == SSA_NAME)
+  gimple *bf_stmt = NULL;
+  tree lhs = NULL_TREE;
+  tree ret_type = NULL_TREE;
+  gimple *stmt = STMT_VINFO_STMT (stmt_info);
+  if (gcond *cond_stmt = dyn_cast  (stmt))
+{
+  tree op = gimple_cond_lhs (cond_stmt);
+  if (TREE_CODE (op) != SSA_NAME)
+   return NULL;
+  bf_stmt = dyn_cast  (SSA_NAME_DEF_STMT (op));
+  if (TREE_CODE (gimple_cond_rhs (cond_stmt)) != INTEGER_CST)
+   return NULL;
+}
+  else if (is_gimple_assign (stmt)
+  && CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (stmt))
+  && TREE_CODE (gimple_assign_rhs1 (stmt)) == SSA_NAME)
 {
-  gimple *second_stmt
-   = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (first_stmt));
+  gimple *second_stmt = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmt));
   bf_stmt = dyn_cast  (second_stmt);
-  if (!bf_stmt
-  

[Patch, Fortran] Allow ref'ing PDT's len() in parameter-initializer [PR102003]

2023-07-10 Thread Andre Vehreschild via Gcc-patches
Hi all,

while browsing the pdt meta-bug I came across 102003 and thought to myself:
Well, that one is easy. How foolish of me...

Anyway, the solution attached prevents a pdt_len (or pdt_kind) expression in a
function call (e.g. len() or kind()) to mark the whole expression as a pdt one.
The second part of the patch in simplify.cc then takes care of either generating
the correct component ref or when a constant expression (i.e.
gfc_init_expr_flag is set) is required to look this up from the actual symbol
(not from the type, because there the default value is stored).

Regtested ok on x86_64-linux-gnu/Fedora 37.

Regards,
Andre
--
Andre Vehreschild * Email: vehre ad gmx dot de
gcc/fortran/ChangeLog:

* expr.cc (gfc_match_init_expr): Prevent PDT analysis for function
calls.
* simplify.cc (gfc_simplify_len): Replace len() of PDT with pdt
component ref or constant.

gcc/testsuite/ChangeLog:

* gfortran.dg/pdt_33.f03: New test.

diff --git a/gcc/fortran/expr.cc b/gcc/fortran/expr.cc
index e418f1f3301..fb6eb76cda7 100644
--- a/gcc/fortran/expr.cc
+++ b/gcc/fortran/expr.cc
@@ -3229,7 +3229,7 @@ gfc_match_init_expr (gfc_expr **result)
   return m;
 }

-  if (gfc_derived_parameter_expr (expr))
+  if (expr->expr_type != EXPR_FUNCTION && gfc_derived_parameter_expr (expr))
 {
   *result = expr;
   gfc_init_expr_flag = false;
diff --git a/gcc/fortran/simplify.cc b/gcc/fortran/simplify.cc
index 81680117f70..8fb453d0a54 100644
--- a/gcc/fortran/simplify.cc
+++ b/gcc/fortran/simplify.cc
@@ -4580,19 +4580,54 @@ gfc_simplify_len (gfc_expr *e, gfc_expr *kind)
   return range_check (result, "LEN");
 }
   else if (e->expr_type == EXPR_VARIABLE && e->ts.type == BT_CHARACTER
-	   && e->symtree->n.sym
-	   && e->symtree->n.sym->ts.type != BT_DERIVED
-	   && e->symtree->n.sym->assoc && e->symtree->n.sym->assoc->target
-	   && e->symtree->n.sym->assoc->target->ts.type == BT_DERIVED
-	   && e->symtree->n.sym->assoc->target->symtree->n.sym
-	   && UNLIMITED_POLY (e->symtree->n.sym->assoc->target->symtree->n.sym))
-
-/* The expression in assoc->target points to a ref to the _data component
-   of the unlimited polymorphic entity.  To get the _len component the last
-   _data ref needs to be stripped and a ref to the _len component added.  */
-return gfc_get_len_component (e->symtree->n.sym->assoc->target, k);
-  else
-return NULL;
+	   && e->symtree->n.sym)
+{
+  if (e->symtree->n.sym->ts.type != BT_DERIVED
+	 && e->symtree->n.sym->assoc && e->symtree->n.sym->assoc->target
+	 && e->symtree->n.sym->assoc->target->ts.type == BT_DERIVED
+	 && e->symtree->n.sym->assoc->target->symtree->n.sym
+	 && UNLIMITED_POLY (e->symtree->n.sym->assoc->target->symtree
+->n.sym))
+	/* The expression in assoc->target points to a ref to the _data
+	   component of the unlimited polymorphic entity.  To get the _len
+	   component the last _data ref needs to be stripped and a ref to the
+	   _len component added.  */
+	return gfc_get_len_component (e->symtree->n.sym->assoc->target, k);
+  else if (e->symtree->n.sym->ts.type == BT_DERIVED
+	   && e->ref && e->ref->type == REF_COMPONENT
+	   && e->ref->u.c.component->attr.pdt_string
+	   && e->ref->u.c.component->ts.type == BT_CHARACTER
+	   && e->ref->u.c.component->ts.u.cl->length)
+	{
+	  if (gfc_init_expr_flag)
+	{
+	  /* The actual length of a pdt is in its components.  In the
+		 initializer of the current ref is only the default value.
+		 Therefore traverse the chain of components and pick the correct
+		 one's initializer expressions.  */
+	  for (gfc_component *comp = e->symtree->n.sym->ts.u.derived
+		   ->components; comp != NULL; comp = comp->next)
+		{
+		  if (!strcmp (comp->name, e->ref->u.c.component->ts.u.cl
+			   ->length->symtree->name))
+		return gfc_copy_expr (comp->initializer);
+		}
+	}
+	  else
+	{
+	  gfc_expr *len_expr = gfc_copy_expr (e);
+	  gfc_free_ref_list (len_expr->ref);
+	  len_expr->ref = NULL;
+	  gfc_find_component (len_expr->symtree->n.sym->ts.u.derived, e->ref
+  ->u.c.component->ts.u.cl->length->symtree
+  ->name,
+  false, true, &len_expr->ref);
+	  len_expr->ts = len_expr->ref->u.c.component->ts;
+	  return len_expr;
+	}
+	}
+}
+  return NULL;
 }


diff --git a/gcc/testsuite/gfortran.dg/pdt_33.f03 b/gcc/testsuite/gfortran.dg/pdt_33.f03
new file mode 100644
index 000..c12bd9b411c
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/pdt_33.f03
@@ -0,0 +1,18 @@
+! { dg-do run }
+!
+! Test the fix for PR102003, where len parameters where not returned as constants.
+!
+! Contributed by Harald Anlauf  
+!
+program pr102003
+  type pdt(n)
+ integer, len :: n = 8
+ character(len=n) :: c
+  end type pdt
+  type(pdt(42)) :: p
+  integer, parameter :: m = len (p% c)
+
+  if (m /= 42) stop 1
+  if (len (p% c) /= 42) stop 2
+end
+


[x86-64] RFC: Add nosse abi attribute

2023-07-10 Thread Michael Matz via Gcc-patches
Hello,

the ELF psABI for x86-64 doesn't have any callee-saved SSE
registers (there were actual reasons for that, but those don't
matter anymore).  This starts to hurt some uses, as it means that
as soon as you have a call (say to memmove/memcpy, even if
implicit as libcall) in a loop that manipulates floating point
or vector data you get saves/restores around those calls.

But in reality many functions can be written such that they only need
to clobber a subset of the 16 XMM registers (or do the save/restore
themself in the codepaths that needs them, hello memcpy again).
So we want to introduce a way to specify this, via an ABI attribute
that basically says "doesn't clobber the high XMM regs".

I've opted to do only the obvious: do something special only for
xmm8 to xmm15, without a way to specify the clobber set in more detail.
I think such half/half split is reasonable, and as I don't want to
change the argument passing anyway (whose regs are always clobbered)
there isn't that much wiggle room anyway.

I chose to make it possible to write function definitions with that
attribute with GCC adding the necessary callee save/restore code in
the xlogue itself.  Carefully note that this is only possible for
the SSE2 registers, as other parts of them would need instructions
that are only optional.  When a function doesn't contain calls to
unknown functions we can be a bit more lenient: we can make it so that
GCC simply doesn't touch xmm8-15 at all, then no save/restore is
necessary.  If a function contains calls then GCC can't know which
parts of the XMM regset is clobbered by that, it may be parts
which don't even exist yet (say until avx2048 comes out), so we must
restrict ourself to only save/restore the SSE2 parts and then of course
can only claim to not clobber those parts.

To that end I introduce actually two related attributes (for naming
see below):
* nosseclobber: claims (and ensures) that xmm8-15 aren't clobbered
* noanysseclobber: claims (and ensures) that nothing of any of the
  registers overlapping xmm8-15 is clobbered (not even future, as of
  yet unknown, parts)

Ensuring the first is simple: potentially add saves/restore in xlogue
(e.g. when xmm8 is either used explicitely or implicitely by a call).
Ensuring the second comes with more: we must also ensure that no
functions are called that don't guarantee the same thing (in addition
to just removing all xmm8-15 parts alltogether from the available
regsters).

See also the added testcases for what I intended to support.

I chose to use the new target independend function-abi facility for
this.  I need some adjustments in generic code:
* the "default_abi" is actually more like a "current" abi: it happily
  changes its contents according to conditional_register_usage,
  and other code assumes that such changes do propagate.
  But if that conditonal_reg_usage is actually done because the current
  function is of a different ABI, then we must not change default_abi.
* in insn_callee_abi we do look at a potential fndecl for a call
  insn (only set when -fipa-ra), but doesn't work for calls through
  pointers and (as said) is optional.  So, also always look at the
  called functions type (it's always recorded in the MEM_EXPR for
  non-libcalls), before asking the target.
  (The function-abi accessors working on trees were already doing that,
  its just the RTL accessor that missed this)

Accordingly I also implement some more target hooks for function-abi.
With that it's possible to also move the other ABI-influencing code
of i386 to function-abi (ms_abi and friends).  I have not done so for
this patch.

Regarding the names of the attributes: gah!  I've left them at
my mediocre attempts of names in order to hopefully get input on better
names :-)

I would welcome any comments, about the names, the approach, the attempt
at documenting the intricacies of these attributes and anything.

FWIW, this particular patch was regstrapped on x86-64-linux
with trunk from a week ago (and sniff-tested on current trunk).


Ciao,
Michael.

diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
index 37cb5a0dcc4..92358f4ac41 100644
--- a/gcc/config/i386/i386-options.cc
+++ b/gcc/config/i386/i386-options.cc
@@ -3244,6 +3244,16 @@ ix86_set_indirect_branch_type (tree fndecl)
 }
 }
 
+unsigned
+ix86_fntype_to_abi_id (const_tree fntype)
+{
+  if (lookup_attribute ("nosseclobber", TYPE_ATTRIBUTES (fntype)))
+return ABI_LESS_SSE;
+  if (lookup_attribute ("noanysseclobber", TYPE_ATTRIBUTES (fntype)))
+return ABI_NO_SSE;
+  return ABI_DEFAULT;
+}
+
 /* Establish appropriate back-end context for processing the function
FNDECL.  The argument might be NULL to indicate processing at top
level, outside of any function scope.  */
@@ -3311,6 +3321,12 @@ ix86_set_current_function (tree fndecl)
   else
TREE_TARGET_GLOBALS (new_tree) = save_target_globals_default_opts ();
 }
+
+  unsigned prev_abi_id = 0;
+  if (ix86_previous_fn

Re: [PATCH] Break false dependence for vpternlog by inserting vpxor or setting constraint of input operand to '0'

2023-07-10 Thread Alexander Monakov via Gcc-patches


On Mon, 10 Jul 2023, liuhongt via Gcc-patches wrote:

> False dependency happens when destination is only updated by
> pternlog. There is no false dependency when destination is also used
> in source. So either a pxor should be inserted, or input operand
> should be set with constraint '0'.
> 
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ready to push to trunk.

Shouldn't this patch also remove uses of vpternlog in
standard_sse_constant_opcode?

A couple more questions below:

> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -1382,6 +1382,29 @@ (define_insn "mov_internal"
> ]
> (symbol_ref "true")))])
>  
> +; False dependency happens on destination register which is not really
> +; used when moving all ones to vector register
> +(define_split
> +  [(set (match_operand:VMOVE 0 "register_operand")
> + (match_operand:VMOVE 1 "int_float_vector_all_ones_operand"))]
> +  "TARGET_AVX512F && reload_completed
> +  && ( == 64 || EXT_REX_SSE_REG_P (operands[0]))
> +  && optimize_function_for_speed_p (cfun)"

Yan's patch used optimize_insn_for_speed_p (), which looks more appropriate.
Doesn't it work here as well?

> +  [(set (match_dup 0) (match_dup 2))
> +   (parallel
> + [(set (match_dup 0) (match_dup 1))
> +  (unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP)])]
> +  "operands[2] = CONST0_RTX (mode);")
> +
> +(define_insn "*vmov_constm1_pternlog_false_dep"
> +  [(set (match_operand:VMOVE 0 "register_operand" "=v")
> + (match_operand:VMOVE 1 "int_float_vector_all_ones_operand" 
> ""))
> +   (unspec [(match_operand:VMOVE 2 "register_operand" "0")] 
> UNSPEC_INSN_FALSE_DEP)]
> +   "TARGET_AVX512VL ||  == 64"
> +   "vpternlogd\t{$0xFF, %0, %0, %0|%0, %0, %0, 0xFF}"
> +  [(set_attr "type" "sselog1")
> +   (set_attr "prefix" "evex")])
> +
>  ;; If mem_addr points to a memory region with less than whole vector size 
> bytes
>  ;; of accessible memory and k is a mask that would prevent reading the 
> inaccessible
>  ;; bytes from mem_addr, add UNSPEC_MASKLOAD to prevent it to be transformed 
> to vpblendd
> @@ -9336,7 +9359,7 @@ (define_expand "_cvtmask2"
>  operands[3] = CONST0_RTX (mode);
>}")
>  
> -(define_insn "*_cvtmask2"
> +(define_insn_and_split "*_cvtmask2"
>[(set (match_operand:VI48_AVX512VL 0 "register_operand" "=v,v")
>   (vec_merge:VI48_AVX512VL
> (match_operand:VI48_AVX512VL 2 "vector_all_ones_operand")
> @@ -9346,11 +9369,35 @@ (define_insn "*_cvtmask2"
>"@
> vpmovm2\t{%1, %0|%0, %1}
> vpternlog\t{$0x81, %0, %0, %0%{%1%}%{z%}|%0%{%1%}%{z%}, 
> %0, %0, 0x81}"
> +  "&& !TARGET_AVX512DQ && reload_completed
> +   && optimize_function_for_speed_p (cfun)"
> +  [(set (match_dup 0) (match_dup 4))
> +   (parallel
> +[(set (match_dup 0)
> +   (vec_merge:VI48_AVX512VL
> + (match_dup 2)
> + (match_dup 3)
> + (match_dup 1)))
> + (unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP)])]
> +  "operands[4] = CONST0_RTX (mode);"
>[(set_attr "isa" "avx512dq,*")
> (set_attr "length_immediate" "0,1")
> (set_attr "prefix" "evex")
> (set_attr "mode" "")])
>  
> +(define_insn "*_cvtmask2_pternlog_false_dep"
> +  [(set (match_operand:VI48_AVX512VL 0 "register_operand" "=v")
> + (vec_merge:VI48_AVX512VL
> +   (match_operand:VI48_AVX512VL 2 "vector_all_ones_operand")
> +   (match_operand:VI48_AVX512VL 3 "const0_operand")
> +   (match_operand: 1 "register_operand" "Yk")))
> +   (unspec [(match_operand:VI48_AVX512VL 4 "register_operand" "0")] 
> UNSPEC_INSN_FALSE_DEP)]
> +  "TARGET_AVX512F && !TARGET_AVX512DQ"
> +  "vpternlog\t{$0x81, %0, %0, %0%{%1%}%{z%}|%0%{%1%}%{z%}, 
> %0, %0, 0x81}"
> +  [(set_attr "length_immediate" "1")
> +   (set_attr "prefix" "evex")
> +   (set_attr "mode" "")])
> +
>  (define_expand "extendv2sfv2df2"
>[(set (match_operand:V2DF 0 "register_operand")
>   (float_extend:V2DF
> @@ -17166,20 +17213,32 @@ (define_expand "one_cmpl2"
>  operands[2] = force_reg (mode, operands[2]);
>  })
>  
> -(define_insn "one_cmpl2"
> -  [(set (match_operand:VI 0 "register_operand" "=v,v")
> - (xor:VI (match_operand:VI 1 "bcst_vector_operand" "vBr,m")
> - (match_operand:VI 2 "vector_all_ones_operand" "BC,BC")))]
> +(define_insn_and_split "one_cmpl2"
> +  [(set (match_operand:VI 0 "register_operand" "=v,v,v")
> + (xor:VI (match_operand:VI 1 "bcst_vector_operand" " 0, m,Br")
> + (match_operand:VI 2 "vector_all_ones_operand" "BC,BC,BC")))]
>"TARGET_AVX512F
> && (!
> || mode == SImode
> || mode == DImode)"
>  {
> +  if (! && which_alternative
> +  && optimize_function_for_speed_p (cfun))
> +return "#";
> +
>if (TARGET_AVX512VL)
>  return "vpternlog\t{$0x55, %1, %0, 
> %0|%0, %0, %1, 0x55}";
>else
>  return "vpternlog\t{$0x55, %g1, %g0, 
> %g0|%g0, %g0, %g1, 0x55}";
>  }
> +  "&& reload_completed && !REG_P (operands[1]) && !
> +   && optimize_function_for_speed_p (cfun)"
> +

[PATCH, OBVIOUS] rs6000: Remove redundant MEM_P predicate usage

2023-07-10 Thread Peter Bergner via Gcc-patches
While helping someone on the team debug an issue, I noticed some redundant
tests in a couple of our predicates which can be removed.  I'm going to
commit the following as obvious once bootstrap and regtesting come back
clean.

Peter


rs6000: Remove redundant MEM_P predicate usage

The quad_memory_operand and vsx_quad_dform_memory_operand predicates contain
a (match_code "mem") test, making their MEM_P usage redundant.  Remove them.

gcc/
* config/rs6000/predicates.md (quad_memory_operand): Remove redundant
MEM_P usage.
(vsx_quad_dform_memory_operand): Likewise.

diff --git a/gcc/config/rs6000/predicates.md b/gcc/config/rs6000/predicates.md
index 8479331482e..3552d908e9d 100644
--- a/gcc/config/rs6000/predicates.md
+++ b/gcc/config/rs6000/predicates.md
@@ -912,7 +912,7 @@ (define_predicate "quad_memory_operand"
   if (!TARGET_QUAD_MEMORY && !TARGET_SYNC_TI)
 return false;
 
-  if (GET_MODE_SIZE (mode) != 16 || !MEM_P (op) || MEM_ALIGN (op) < 128)
+  if (GET_MODE_SIZE (mode) != 16 || MEM_ALIGN (op) < 128)
 return false;
 
   return quad_address_p (XEXP (op, 0), mode, false);
@@ -924,7 +924,7 @@ (define_predicate "quad_memory_operand"
 (define_predicate "vsx_quad_dform_memory_operand"
   (match_code "mem")
 {
-  if (!TARGET_P9_VECTOR || !MEM_P (op) || GET_MODE_SIZE (mode) != 16)
+  if (!TARGET_P9_VECTOR || GET_MODE_SIZE (mode) != 16)
 return false;
 
   return quad_address_p (XEXP (op, 0), mode, false);


Re: [x86-64] RFC: Add nosse abi attribute

2023-07-10 Thread Richard Biener via Gcc-patches



> Am 10.07.2023 um 17:56 schrieb Michael Matz via Gcc-patches 
> :
> 
> Hello,
> 
> the ELF psABI for x86-64 doesn't have any callee-saved SSE
> registers (there were actual reasons for that, but those don't
> matter anymore).  This starts to hurt some uses, as it means that
> as soon as you have a call (say to memmove/memcpy, even if
> implicit as libcall) in a loop that manipulates floating point
> or vector data you get saves/restores around those calls.
> 
> But in reality many functions can be written such that they only need
> to clobber a subset of the 16 XMM registers (or do the save/restore
> themself in the codepaths that needs them, hello memcpy again).
> So we want to introduce a way to specify this, via an ABI attribute
> that basically says "doesn't clobber the high XMM regs".
> 
> I've opted to do only the obvious: do something special only for
> xmm8 to xmm15, without a way to specify the clobber set in more detail.
> I think such half/half split is reasonable, and as I don't want to
> change the argument passing anyway (whose regs are always clobbered)
> there isn't that much wiggle room anyway.

What about xmm16 to xmm31 which AVX512 adds and any possible future additions 
to the register file?  (I suppose the any variant also covers zmm - and also 
future widened variants?). What about AVX512 mask registers?

> I chose to make it possible to write function definitions with that
> attribute with GCC adding the necessary callee save/restore code in
> the xlogue itself.  Carefully note that this is only possible for
> the SSE2 registers, as other parts of them would need instructions
> that are only optional.  When a function doesn't contain calls to
> unknown functions we can be a bit more lenient: we can make it so that
> GCC simply doesn't touch xmm8-15 at all, then no save/restore is
> necessary.  If a function contains calls then GCC can't know which
> parts of the XMM regset is clobbered by that, it may be parts
> which don't even exist yet (say until avx2048 comes out), so we must
> restrict ourself to only save/restore the SSE2 parts and then of course
> can only claim to not clobber those parts.
> 
> To that end I introduce actually two related attributes (for naming
> see below):
> * nosseclobber: claims (and ensures) that xmm8-15 aren't clobbered
> * noanysseclobber: claims (and ensures) that nothing of any of the
>  registers overlapping xmm8-15 is clobbered (not even future, as of
>  yet unknown, parts)
> 
> Ensuring the first is simple: potentially add saves/restore in xlogue
> (e.g. when xmm8 is either used explicitely or implicitely by a call).
> Ensuring the second comes with more: we must also ensure that no
> functions are called that don't guarantee the same thing (in addition
> to just removing all xmm8-15 parts alltogether from the available
> regsters).
> 
> See also the added testcases for what I intended to support.
> 
> I chose to use the new target independend function-abi facility for
> this.  I need some adjustments in generic code:
> * the "default_abi" is actually more like a "current" abi: it happily
>  changes its contents according to conditional_register_usage,
>  and other code assumes that such changes do propagate.
>  But if that conditonal_reg_usage is actually done because the current
>  function is of a different ABI, then we must not change default_abi.
> * in insn_callee_abi we do look at a potential fndecl for a call
>  insn (only set when -fipa-ra), but doesn't work for calls through
>  pointers and (as said) is optional.  So, also always look at the
>  called functions type (it's always recorded in the MEM_EXPR for
>  non-libcalls), before asking the target.
>  (The function-abi accessors working on trees were already doing that,
>  its just the RTL accessor that missed this)
> 
> Accordingly I also implement some more target hooks for function-abi.
> With that it's possible to also move the other ABI-influencing code
> of i386 to function-abi (ms_abi and friends).  I have not done so for
> this patch.
> 
> Regarding the names of the attributes: gah!  I've left them at
> my mediocre attempts of names in order to hopefully get input on better
> names :-)
> 
> I would welcome any comments, about the names, the approach, the attempt
> at documenting the intricacies of these attributes and anything.
> 
> FWIW, this particular patch was regstrapped on x86-64-linux
> with trunk from a week ago (and sniff-tested on current trunk).
> 
> 
> Ciao,
> Michael.
> 
> diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
> index 37cb5a0dcc4..92358f4ac41 100644
> --- a/gcc/config/i386/i386-options.cc
> +++ b/gcc/config/i386/i386-options.cc
> @@ -3244,6 +3244,16 @@ ix86_set_indirect_branch_type (tree fndecl)
> }
> }
> 
> +unsigned
> +ix86_fntype_to_abi_id (const_tree fntype)
> +{
> +  if (lookup_attribute ("nosseclobber", TYPE_ATTRIBUTES (fntype)))
> +return ABI_LESS_SSE;
> +  if (lookup_attribute ("noanysseclobber", TYPE_ATT

[PATCH, OpenACC 2.7] readonly modifier support in front-ends

2023-07-10 Thread Chung-Lin Tang via Gcc-patches
Hi Thomas,
this patch contains support for the 'readonly' modifier in copyin clauses
and the cache directive.

As we discussed earlier, the work for actually linking this to middle-end
points-to analysis is a somewhat non-trivial issue. This first patch allows
the language feature to be used in OpenACC directives first (with no effect for 
now).
The middle-end changes are probably going to be a later patch.

(Also CCing Tobias because of the Fortran bits)

Tested on powerpc64le-linux with nvptx offloading. Is this okay for trunk?

Thanks,
Chung-Lin

2023-07-10  Chung-Lin Tang  

gcc/c/ChangeLog:
* c-parser.cc (c_parser_omp_var_list_parens):
Add 'bool *readonly = NULL' parameter, add readonly modifier parsing
support.
(c_parser_oacc_data_clause): Adjust c_parser_omp_var_list_parens call
to turn on readonly modifier parsing for copyin clause, set
OMP_CLAUSE_MAP_READONLY if readonly modifier found, update comments.
(c_parser_oacc_cache): Adjust c_parser_omp_var_list_parens call
to turn on readonly modifier parsing, set OMP_CLAUSE__CACHE__READONLY
if readonly modifier found, update comments.

gcc/cp/ChangeLog:
* parser.cc (cp_parser_omp_var_list):
Add 'bool *readonly = NULL' parameter, add readonly modifier parsing
support.
(cp_parser_oacc_data_clause): Adjust cp_parser_omp_var_list call
to turn on readonly modifier parsing for copyin clause, set
OMP_CLAUSE_MAP_READONLY if readonly modifier found, update comments.
(cp_parser_oacc_cache): Adjust cp_parser_omp_var_list call
to turn on readonly modifier parsing, set OMP_CLAUSE__CACHE__READONLY
if readonly modifier found, update comments.

gcc/fortran/ChangeLog:
* gfortran.h (typedef struct gfc_omp_namelist): Adjust map_op as
ENUM_BITFIELD field, add 'bool readonly' field.
* openmp.cc (gfc_match_omp_map_clause): Add 'bool readonly = false'
parameter, set n->u.readonly field.
(gfc_match_omp_clauses): Add readonly modifier parsing for OpenACC
copyin clause, adjust call to gfc_match_omp_map_clause.
(gfc_match_oacc_cache): Add readonly modifier parsing for OpenACC
cache directive, adjust call to gfc_match_omp_map_clause.
* trans-openmp.cc (gfc_trans_omp_clauses): Set OMP_CLAUSE_MAP_READONLY,
OMP_CLAUSE__CACHE__READONLY to 1 when readonly is set.

gcc/ChangeLog:
* tree-pretty-print.cc (dump_omp_clause): Add support for printing
OMP_CLAUSE_MAP_READONLY and OMP_CLAUSE__CACHE__READONLY.
* tree.h (OMP_CLAUSE_MAP_READONLY): New macro.
(OMP_CLAUSE__CACHE__READONLY): New macro.

gcc/testsuite/ChangeLog:
* c-c++-common/goacc/readonly-1.c: New test.
* gfortran.dg/goacc/readonly-1.f90: New test.

diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc
index d4b98d5d8b6..09e1e89d793 100644
--- a/gcc/c/c-parser.cc
+++ b/gcc/c/c-parser.cc
@@ -14059,7 +14059,8 @@ c_parser_omp_variable_list (c_parser *parser,
 
 static tree
 c_parser_omp_var_list_parens (c_parser *parser, enum omp_clause_code kind,
- tree list, bool allow_deref = false)
+ tree list, bool allow_deref = false,
+ bool *readonly = NULL)
 {
   /* The clauses location.  */
   location_t loc = c_parser_peek_token (parser)->location;
@@ -14067,6 +14068,20 @@ c_parser_omp_var_list_parens (c_parser *parser, enum 
omp_clause_code kind,
   matching_parens parens;
   if (parens.require_open (parser))
 {
+  if (readonly != NULL)
+   {
+ c_token *token = c_parser_peek_token (parser);
+ if (token->type == CPP_NAME
+ && !strcmp (IDENTIFIER_POINTER (token->value), "readonly")
+ && c_parser_peek_2nd_token (parser)->type == CPP_COLON)
+   {
+ c_parser_consume_token (parser);
+ c_parser_consume_token (parser);
+ *readonly = true;
+   }
+ else
+   *readonly = false;
+   }
   list = c_parser_omp_variable_list (parser, loc, kind, list, allow_deref);
   parens.skip_until_found_close (parser);
 }
@@ -14084,7 +14099,11 @@ c_parser_omp_var_list_parens (c_parser *parser, enum 
omp_clause_code kind,
OpenACC 2.6:
no_create ( variable-list )
attach ( variable-list )
-   detach ( variable-list ) */
+   detach ( variable-list )
+
+   OpenACC 2.7:
+   copyin (readonly : variable-list )
+ */
 
 static tree
 c_parser_oacc_data_clause (c_parser *parser, pragma_omp_clause c_kind,
@@ -14135,11 +14154,22 @@ c_parser_oacc_data_clause (c_parser *parser, 
pragma_omp_clause c_kind,
 default:
   gcc_unreachable ();
 }
+
+  /* Turn on readonly modifier parsing for copyin clause.  */
+  bool readonly = false, *readonly_ptr = NULL;
+  if (c_kind == PRAGMA_OACC_CLAUSE_COPYIN)
+readonly_ptr = &readonly;
+
   tree nl, c;
-  nl = c_parser_omp_var_list_parens (pars

[PATCH] libgcc: Fix -Wint-conversion warning in find_fde_tail

2023-07-10 Thread Florian Weimer via Gcc-patches
Fixes commit r14-1614-g49310a99330849 ("libgcc: Fix eh_frame fast path
in find_fde_tail").

libgcc/

PR libgcc/110179
* unwind-dw2-fde-dip.c (find_fde_tail): Add cast to avoid
implicit conversion of pointer value to integer.

---
 libgcc/unwind-dw2-fde-dip.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libgcc/unwind-dw2-fde-dip.c b/libgcc/unwind-dw2-fde-dip.c
index 4e0b880513f..28ea0e64e0e 100644
--- a/libgcc/unwind-dw2-fde-dip.c
+++ b/libgcc/unwind-dw2-fde-dip.c
@@ -403,7 +403,7 @@ find_fde_tail (_Unwind_Ptr pc,
 BFD ld generates.  */
   signed value __attribute__ ((mode (SI)));
   memcpy (&value, p, sizeof (value));
-  eh_frame = p + value;
+  eh_frame = (_Unwind_Ptr) (p + value);
   p += sizeof (value);
 }
   else

base-commit: 1f9b18962f2d86abafbb452bf001b72edafb6eef



Re: [Patch, Fortran] Allow ref'ing PDT's len() in parameter-initializer [PR102003]

2023-07-10 Thread Harald Anlauf via Gcc-patches

Hi Andre,

thanks for looking into this!

While it fixes the original PR, here is a minor extension of the
testcase that ICEs here with your patch:

program pr102003
  type pdt(n)
 integer, len :: n = 8
 character(len=n) :: c
  end type pdt
  type(pdt(42)) :: p
  integer, parameter :: m = len (p% c)
  integer, parameter :: n = p% c% len

  if (m /= 42) stop 1
  if (len (p% c) /= 42) stop 2
  print *, p% c% len   ! OK
  if (p% c% len  /= 42) stop 3 ! OK
  print *, n   ! ICE
end

I get:

pdt_33.f03:14:27:

   14 |   integer, parameter :: n = p% c% len
  |   1
Error: non-constant initialization expression at (1)
pdt_33.f03:20:31:

   20 |   print *, n   ! ICE
  |   1
internal compiler error: tree check: expected record_type or union_type
or qual_union_type, have integer_type in gfc_conv_component_ref, at
fortran/trans-expr.cc:2757
0x84286c tree_check_failed(tree_node const*, char const*, int, char
const*, ...)
../../gcc-trunk/gcc/tree.cc:8899
0xa6d6fb tree_check3(tree_node*, char const*, int, char const*,
tree_code, tree_code, tree_code)
../../gcc-trunk/gcc/tree.h:3617
0xa90847 gfc_conv_component_ref(gfc_se*, gfc_ref*)
../../gcc-trunk/gcc/fortran/trans-expr.cc:2757
0xa91bbc gfc_conv_variable
../../gcc-trunk/gcc/fortran/trans-expr.cc:3137
0xaa8e9c gfc_conv_expr(gfc_se*, gfc_expr*)
../../gcc-trunk/gcc/fortran/trans-expr.cc:9594
0xaa92ae gfc_conv_expr_reference(gfc_se*, gfc_expr*)
../../gcc-trunk/gcc/fortran/trans-expr.cc:9713
0xad67f6 gfc_trans_transfer(gfc_code*)
../../gcc-trunk/gcc/fortran/trans-io.cc:2607
0xa43cb7 trans_code
../../gcc-trunk/gcc/fortran/trans.cc:2449
0xad37c6 build_dt
../../gcc-trunk/gcc/fortran/trans-io.cc:2051
0xa43cd7 trans_code
../../gcc-trunk/gcc/fortran/trans.cc:2421
0xa84711 gfc_generate_function_code(gfc_namespace*)
../../gcc-trunk/gcc/fortran/trans-decl.cc:7762
0x9d9ca7 translate_all_program_units
../../gcc-trunk/gcc/fortran/parse.cc:6929
0x9d9ca7 gfc_parse_file()
../../gcc-trunk/gcc/fortran/parse.cc:7235
0xa40a1f gfc_be_parse_file
../../gcc-trunk/gcc/fortran/f95-lang.cc:229

The fortran-dump confirms that n is not simplified to a constant.
So while you're at it, do you also see a solution to this variant?

Harald


Am 10.07.23 um 17:48 schrieb Andre Vehreschild via Gcc-patches:

Hi all,

while browsing the pdt meta-bug I came across 102003 and thought to myself:
Well, that one is easy. How foolish of me...

Anyway, the solution attached prevents a pdt_len (or pdt_kind) expression in a
function call (e.g. len() or kind()) to mark the whole expression as a pdt one.
The second part of the patch in simplify.cc then takes care of either generating
the correct component ref or when a constant expression (i.e.
gfc_init_expr_flag is set) is required to look this up from the actual symbol
(not from the type, because there the default value is stored).

Regtested ok on x86_64-linux-gnu/Fedora 37.

Regards,
Andre
--
Andre Vehreschild * Email: vehre ad gmx dot de




Re: [x86-64] RFC: Add nosse abi attribute

2023-07-10 Thread Alexander Monakov via Gcc-patches


On Mon, 10 Jul 2023, Michael Matz via Gcc-patches wrote:

> Hello,
> 
> the ELF psABI for x86-64 doesn't have any callee-saved SSE
> registers (there were actual reasons for that, but those don't
> matter anymore).  This starts to hurt some uses, as it means that
> as soon as you have a call (say to memmove/memcpy, even if
> implicit as libcall) in a loop that manipulates floating point
> or vector data you get saves/restores around those calls.
> 
> But in reality many functions can be written such that they only need
> to clobber a subset of the 16 XMM registers (or do the save/restore
> themself in the codepaths that needs them, hello memcpy again).
> So we want to introduce a way to specify this, via an ABI attribute
> that basically says "doesn't clobber the high XMM regs".

I think the main question is why you're going with this (weak) form
instead of the (strong) form "may only clobber the low XMM regs":
as Richi noted, surely for libcalls we'd like to know they preserve
AVX-512 mask registers as well?

(I realize this is partially answered later)

Note this interacts with anything that interposes between the caller
and the callee, like the Glibc lazy binding stub (which used to
zero out high halves of 512-bit arguments in ZMM registers).
Not an immediate problem for the patch, just something to mind perhaps.

> I've opted to do only the obvious: do something special only for
> xmm8 to xmm15, without a way to specify the clobber set in more detail.
> I think such half/half split is reasonable, and as I don't want to
> change the argument passing anyway (whose regs are always clobbered)
> there isn't that much wiggle room anyway.
> 
> I chose to make it possible to write function definitions with that
> attribute with GCC adding the necessary callee save/restore code in
> the xlogue itself.

But you can't trivially restore if the callee is sibcalling — what
happens then (a testcase might be nice)?

> Carefully note that this is only possible for
> the SSE2 registers, as other parts of them would need instructions
> that are only optional.

What is supposed to happen on 32-bit x86 with -msse -mno-sse2?

> When a function doesn't contain calls to
> unknown functions we can be a bit more lenient: we can make it so that
> GCC simply doesn't touch xmm8-15 at all, then no save/restore is
> necessary.

What if the source code has a local register variable bound to xmm15,
i.e. register double x asm("xmm15"); asm("..." : "+x"(x)); ?
Probably "dont'd do that", i.e. disallow that in the documentation?

> If a function contains calls then GCC can't know which
> parts of the XMM regset is clobbered by that, it may be parts
> which don't even exist yet (say until avx2048 comes out), so we must
> restrict ourself to only save/restore the SSE2 parts and then of course
> can only claim to not clobber those parts.

Hm, I guess this is kinda the reason a "weak" form is needed. But this
highlights the difference between the two: the "weak" form will actively
preserve some state (so it cannot preserve future extensions), while
the "strong" form may just passively not touch any state, preserving
any state it doesn't know about.

> To that end I introduce actually two related attributes (for naming
> see below):
> * nosseclobber: claims (and ensures) that xmm8-15 aren't clobbered

This is the weak/active form; I'd suggest "preserve_high_sse".

> * noanysseclobber: claims (and ensures) that nothing of any of the
>   registers overlapping xmm8-15 is clobbered (not even future, as of
>   yet unknown, parts)

This is the strong/passive form; I'd suggest "only_low_sse".

> Ensuring the first is simple: potentially add saves/restore in xlogue
> (e.g. when xmm8 is either used explicitely or implicitely by a call).
> Ensuring the second comes with more: we must also ensure that no
> functions are called that don't guarantee the same thing (in addition
> to just removing all xmm8-15 parts alltogether from the available
> regsters).
> 
> See also the added testcases for what I intended to support.
> 
> I chose to use the new target independend function-abi facility for
> this.  I need some adjustments in generic code:
> * the "default_abi" is actually more like a "current" abi: it happily
>   changes its contents according to conditional_register_usage,
>   and other code assumes that such changes do propagate.
>   But if that conditonal_reg_usage is actually done because the current
>   function is of a different ABI, then we must not change default_abi.
> * in insn_callee_abi we do look at a potential fndecl for a call
>   insn (only set when -fipa-ra), but doesn't work for calls through
>   pointers and (as said) is optional.  So, also always look at the
>   called functions type (it's always recorded in the MEM_EXPR for
>   non-libcalls), before asking the target.
>   (The function-abi accessors working on trees were already doing that,
>   its just the RTL accessor that missed this)
> 
> Accordingly I also implement some more

[PATCH] gcc-14/changes.html: Deprecate a GCC C extension on flexible array members.

2023-07-10 Thread Qing Zhao via Gcc-patches
Hi,

This is the change for the GCC14 releaes Notes on the deprecating of a C
extension about flexible array members.

Okay for committing?

thanks.

Qing



*htdocs/gcc-14/changes.html (Caveats): Add notice about deprecating a C
extension about flexible array members.
---
 htdocs/gcc-14/changes.html | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
index 3f797642..c7f2ce4d 100644
--- a/htdocs/gcc-14/changes.html
+++ b/htdocs/gcc-14/changes.html
@@ -30,7 +30,15 @@ a work-in-progress.
 
 Caveats
 
-  ...
+  C:
+  Support for the GCC extension, a structure containing a C99 flexible 
array
+  member, or a union containing such a structure, is not the last field of
+  another structure, is deprecated. Refer to
+  https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html";>
+  Zero Length Arrays.
+  Any code relying on this extension should be modifed to ensure that
+  C99 flexible array members only end up at the ends of structures.
+  
 
 
 
-- 
2.31.1



Re: [PATCH ver 2] rs6000, __builtin_set_fpscr_rn add retrun value

2023-07-10 Thread Carl Love via Gcc-patches
On Thu, 2023-07-06 at 17:54 -0500, Peter Bergner wrote:
> On 6/30/23 7:58 PM, Carl Love via Gcc-patches wrote:
> > rs6000, __builtin_set_fpscr_rn add retrun value
> 
> s/retrun/return/
> 
> Maybe better written as:
> 
> rs6000: Add return value to __builtin_set_fpscr_rn

Changed subject, fixed misspelling.
> 
> 
> > Change the return value from void to double.  The return value
> > consists of
> > the FPSCR fields DRN, VE, OE, UE, ZE, XE, NI, RN bit
> > positions.  Add an
> > overloaded version which accepts a double argument.
> 
> You're not adding an overloaded version anymore, so I think you can
> just
> remove the last sentence.

Yup, didn't get that removed when removing the overloaded instance. 
fixed.

> 
> 
> 
> > The test powerpc/test_fpscr_rn_builtin.c is updated to add tests
> > for the
> > double reterun value and the new double argument.
> 
> s/reterun/return/   ...and there is no double argument anymore, so
> that
> part can be removed.

Fixed.  Note, the new return value tests were moved to new test file.
> 
> 
> 
> > * config/rs6000/rs6000.md ((rs6000_get_fpscr_fields): New
> > define_expand.
> 
> Too many '('.

fixed.

> 
> 
> 
> > (rs6000_set_fpscr_rn): Addedreturn argument.  Updated
> > to use new
> 
> Looks like a  after Added instead of a space.
> 
> 
> > rs6000_get_fpscr_fields and rs6000_update_fpscr_rn_field define
> >  _expands.
> 
> Don't split define_expand across two lines.

Fixed.

> 
> 
> 
> > * doc/extend.texi (__builtin_set_fpscr_rn): Update description
> > for
> > the return value and new double argument.  Add descripton for
> > __SET_FPSCR_RN_RETURNS_FPSCR__ macro.
> 
> s/descripton/description/

Fixed.

> 
> 
> 
> 
> 
> 
> > +  /* Tell the user the __builtin_set_fpscr_rn now returns the
> > FPSCR fields
> > + in a double.  Originally the builtin returned void.  */
> 
> Either:
>   1) s/Tell the user the __builtin_set_fpscr_rn/Tell the user
> __builtin_set_fpscr_rn/ 
>   2) s/the __builtin_set_fpscr_rn now/the __builtin_set_fpscr_rn
> built-in now/ 
> 
> 
> > +  if ((flags & OPTION_MASK_SOFT_FLOAT) == 0)
> > +  rs6000_define_or_undefine_macro (define_p,
> > "__SET_FPSCR_RN_RETURNS_FPSCR__");
> 
> This doesn't look like it's indented correctly.
> 
> 

Fixed indentation.

> 
> 
> > +(define_expand "rs6000_get_fpscr_fields"
> > + [(match_operand:DF 0 "gpc_reg_operand")]
> > +  "TARGET_HARD_FLOAT"
> > +{
> > +  /* Extract fields bits 29:31 (DRN) and bits 56:63 (VE, OE, UE,
> > ZE, XE, NI,
> > + RN) from the FPSCR and return them.  */
> > +  rtx tmp_df = gen_reg_rtx (DFmode);
> > +  rtx tmp_di = gen_reg_rtx (DImode);
> > +
> > +  emit_insn (gen_rs6000_mffs (tmp_df));
> > +  tmp_di = simplify_gen_subreg (DImode, tmp_df, DFmode, 0);
> > +  emit_insn (gen_anddi3 (tmp_di, tmp_di, GEN_INT
> > (0x000700FFULL)));
> > +  rtx tmp_rtn = simplify_gen_subreg (DFmode, tmp_di, DImode, 0);
> > +  emit_move_insn (operands[0], tmp_rtn);
> > +  DONE;
> > +})
> 
> This doesn't look correct.  You first set tmp_di to a new reg rtx but
> then
> throw that away with the return value of simplify_gen_subreg().  I'm
> guessing
> you want that tmp_di as a gen_reg_rtx for the destination of the
> gen_anddi3, so
> you probably want a different rtx for the subreg that feeds the
> gen_anddi3.

OK, fixed the use of the tmp values.  Note the define_expand was
inlined into define_expand "rs6000_set_fpscr_rn per comments from
Kewen.  Inlining allows the reuse some of the tmp values.

> 
> 
> 
> > +(define_expand "rs6000_update_fpscr_rn_field"
> > + [(match_operand:DI 0 "gpc_reg_operand")]
> > +  "TARGET_HARD_FLOAT"
> > +{
> > +  /* Insert the new RN value from operands[0] into FPSCR bit
> > [62:63].  */
> > +  rtx tmp_di = gen_reg_rtx (DImode);
> > +  rtx tmp_df = gen_reg_rtx (DFmode);
> > +
> > +  emit_insn (gen_rs6000_mffs (tmp_df));
> > +  tmp_di = simplify_gen_subreg (DImode, tmp_df, DFmode, 0);
> 
> Ditto.

Fixed.

> 
> 
> 
> 
> > +The @code{__builtin_set_fpscr_rn} builtin allows changing both of
> > the floating
> > +point rounding mode bits and returning the various FPSCR fields
> > before the RN
> > +field is updated.  The builtin returns a double consisting of the
> > initial value
> > +of the FPSCR fields DRN, VE, OE, UE, ZE, XE, NI, and RN bit
> > positions with all
> > +other bits set to zero. The builtin argument is a 2-bit value for
> > the new RN
> > +field value.  The argument can either be an @code{const int} or
> > stored in a
> > +variable.  Earlier versions of @code{__builtin_set_fpscr_rn}
> > returned void.  A
> > +@code{__SET_FPSCR_RN_RETURNS_FPSCR__} macro has been added.  If
> > defined, then
> > +the @code{__builtin_set_fpscr_rn} builtin returns the FPSCR
> > fields.  If not
> > +defined, the @code{__builtin_set_fpscr_rn} does not return a
> > vaule.  If the
> > +@option{-msoft-float} option is used, the
> > @code{__builtin_set_fpscr_rn} builtin
> > +will not return a value.
> 
> Multiple occurrences of "builtin" that should be spelled 

Re: [PATCH ver 2] rs6000, __builtin_set_fpscr_rn add retrun value

2023-07-10 Thread Carl Love via Gcc-patches
On Fri, 2023-07-07 at 12:06 +0800, Kewen.Lin wrote:
> Hi Carl,
> 
> Some more minor comments are inline below on top of Peter's
> insightful
> review comments.
> 
> on 2023/7/1 08:58, Carl Love wrote:
> > GCC maintainers:
> > 
> > Ver 2,  Went back thru the requirements and emails.  Not sure where
> > I
> > came up with the requirement for an overloaded version with double
> > argument.  Removed the overloaded version with the double
> > argument. 
> > Added the macro to announce if the __builtin_set_fpscr_rn returns a
> > void or a double with the FPSCR bits.  Updated the documentation
> > file. 
> > Retested on Power 8 BE/LE, Power 9 BE/LE, Power 10 LE.  Redid the
> > test
> > file.  Per request, the original test file functionality was not
> > changed.  Just changed the name from test_fpscr_rn_builtin.c to 
> > test_fpscr_rn_builtin_1.c.  Put new tests for the return values
> > into a
> > new test file, test_fpscr_rn_builtin_2.c.
> > 
> > The GLibC team requested a builtin to replace the mffscrn and
> > mffscrniinline asm instructions in the GLibC code.  Previously
> > there
> > was discussion on adding builtins for the mffscrn instructions.
> > 
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620261.html
> > 
> > In the end, it was felt that it would be to extend the existing
> > __builtin_set_fpscr_rn builtin to return a double instead of a void
> > type.  The desire is that we could have the functionality of the
> > mffscrn and mffscrni instructions on older ISAs.  The two
> > instructions
> > were initially added in ISA 3.0.  The __builtin_set_fpscr_rn has
> > the
> > needed functionality to set the RN field using the mffscrn and
> > mffscrni
> > instructions if ISA 3.0 is supported or fall back to using logical
> > instructions to mask and set the bits for earlier ISAs.  The
> > instructions return the current value of the FPSCR fields DRN, VE,
> > OE,
> > UE, ZE, XE, NI, RN bit positions then update the RN bit positions
> > with
> > the new RN value provided.
> > 
> > The current __builtin_set_fpscr_rn builtin has a return type of
> > void. 
> > So, changing the return type to double and returning the  FPSCR
> > fields
> > DRN, VE, OE, UE, ZE, XE, NI, RN bit positions would then give the
> > functionally equivalent of the mffscrn and mffscrni
> > instructions.  Any
> > current uses of the builtin would just ignore the return value yet
> > any
> > new uses could use the return value.  So the requirement is for the
> > change to the __builtin_set_fpscr_rn builtin to be backwardly
> > compatible and work for all ISAs.
> > 
> > The following patch changes the return type of the
> >  __builtin_set_fpscr_rn builtin from void to double.  The return
> > value
> > is the current value of the various FPSCR fields DRN, VE, OE, UE,
> > ZE,
> > XE, NI, RN bit positions when the builtin is called.  The builtin
> > then
> > updated the RN field with the new value provided as an argument to
> > the
> > builtin.  The patch adds new testcases to test_fpscr_rn_builtin.c
> > to
> > check that the builtin returns the current value of the FPSCR
> > fields
> > and then updates the RN field.
> > 
> > The GLibC team has reviewed the patch to make sure it met their
> > needs
> > as a drop in replacement for the inline asm mffscr and mffscrni
> > statements in the GLibC code.  T
> > 
> > The patch has been tested on Power 8 LE/BE, Power 9 LE/BE and Power
> > 10
> > LE.
> > 
> > Please let me know if the patch is acceptable for
> > mainline.  Thanks.
> > 
> >Carl 
> > 
> > 
> > --
> > rs6000, __builtin_set_fpscr_rn add retrun value
> > 
> > Change the return value from void to double.  The return value
> > consists of
> > the FPSCR fields DRN, VE, OE, UE, ZE, XE, NI, RN bit
> > positions.  Add an
> > overloaded version which accepts a double argument.
> > 
> > The test powerpc/test_fpscr_rn_builtin.c is updated to add tests
> > for the
> > double reterun value and the new double argument.
> > 
> > gcc/ChangeLog:
> > * config/rs6000/rs6000-builtins.def (__builtin_set_fpscr_rn):
> > Update
> > builtin definition return type.
> > * config/rs6000-c.cc(rs6000_target_modify_macros): Add check,
> > define
> > __SET_FPSCR_RN_RETURNS_FPSCR__ macro.
> > * config/rs6000/rs6000.md ((rs6000_get_fpscr_fields): New
> > define_expand.
> > (rs6000_update_fpscr_rn_field): New define_expand.
> > (rs6000_set_fpscr_rn): Addedreturn argument.  Updated
> > to use new
> > rs6000_get_fpscr_fields and rs6000_update_fpscr_rn_field define
> >  _expands.
> > * doc/extend.texi (__builtin_set_fpscr_rn): Update description
> > for
> > the return value and new double argument.  Add descripton for
> > __SET_FPSCR_RN_RETURNS_FPSCR__ macro.
> > 
> > gcc/testsuite/ChangeLog:
> > gcc.target/powerpc/test_fpscr_rn_builtin.c: Renamed to
> > test_fpscr_rn_builtin_1.c.  Added comment.
> > gcc.target/powerpc/test_fpscr_rn_builtin_2

[PATCH ver3] rs6000, Add return value to __builtin_set_fpscr_rn

2023-07-10 Thread Carl Love via Gcc-patches


GCC maintainers:

Ver 3, Renamed the patch per comments on ver 2.  Previous subject line
was " [PATCH ver 2] rs6000, __builtin_set_fpscr_rn add retrun value".  
Fixed spelling mistakes and formatting.  Updated define_expand
"rs6000_set_fpscr_rn to have the rs6000_get_fpscr_fields and
rs6000_update_fpscr_rn_field define expands inlined.  Optimized the
code and fixed use of temporary register values. Updated the test file
dg-do run arguments and dg-options.  Removed the check for
__SET_FPSCR_RN_RETURNS_FPSCR__. Removed additional references to the
overloaded built-in with double argument.  Fixed up the documentation
file.  Updated patch retested on Power 8 BE/LE, Power 9 BE/LE and Power
10 LE.

Ver 2,  Went back thru the requirements and emails.  Not sure where I
came up with the requirement for an overloaded version with double
argument.  Removed the overloaded version with the double argument. 
Added the macro to announce if the __builtin_set_fpscr_rn returns a
void or a double with the FPSCR bits.  Updated the documentation file. 
Retested on Power 8 BE/LE, Power 9 BE/LE, Power 10 LE.  Redid the test
file.  Per request, the original test file functionality was not
changed.  Just changed the name from test_fpscr_rn_builtin.c to 
test_fpscr_rn_builtin_1.c.  Put new tests for the return values into a
new test file, test_fpscr_rn_builtin_2.c.

The GLibC team requested a builtin to replace the mffscrn and
mffscrniinline asm instructions in the GLibC code.  Previously there
was discussion on adding builtins for the mffscrn instructions.

https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620261.html

In the end, it was felt that it would be to extend the existing
__builtin_set_fpscr_rn builtin to return a double instead of a void
type.  The desire is that we could have the functionality of the
mffscrn and mffscrni instructions on older ISAs.  The two instructions
were initially added in ISA 3.0.  The __builtin_set_fpscr_rn has the
needed functionality to set the RN field using the mffscrn and mffscrni
instructions if ISA 3.0 is supported or fall back to using logical
instructions to mask and set the bits for earlier ISAs.  The
instructions return the current value of the FPSCR fields DRN, VE, OE,
UE, ZE, XE, NI, RN bit positions then update the RN bit positions with
the new RN value provided.

The current __builtin_set_fpscr_rn builtin has a return type of void. 
So, changing the return type to double and returning the  FPSCR fields
DRN, VE, OE, UE, ZE, XE, NI, RN bit positions would then give the
functionally equivalent of the mffscrn and mffscrni instructions.  Any
current uses of the builtin would just ignore the return value yet any
new uses could use the return value.  So the requirement is for the
change to the __builtin_set_fpscr_rn builtin to be backwardly
compatible and work for all ISAs.

The following patch changes the return type of the
 __builtin_set_fpscr_rn builtin from void to double.  The return value
is the current value of the various FPSCR fields DRN, VE, OE, UE, ZE,
XE, NI, RN bit positions when the builtin is called.  The builtin then
updated the RN field with the new value provided as an argument to the
builtin.  The patch adds new testcases to test_fpscr_rn_builtin.c to
check that the builtin returns the current value of the FPSCR fields
and then updates the RN field.

The GLibC team has reviewed the patch to make sure it met their needs
as a drop in replacement for the inline asm mffscr and mffscrni
statements in the GLibC code.  T

The patch has been tested on Power 8 LE/BE, Power 9 LE/BE and Power 10
LE.

Please let me know if the patch is acceptable for mainline.  Thanks.

   Carl 



-
rs6000, Add return value  to __builtin_set_fpscr_rn

Change the return value from void to double for __builtin_set_fpscr_rn.
The return value consists of the FPSCR fields DRN, VE, OE, UE, ZE, XE, NI,
RN bit positions.  A new test file, test powerpc/test_fpscr_rn_builtin_2.c,
is added to test the new return value for the built-in.

gcc/ChangeLog:
* config/rs6000/rs6000-builtins.def (__builtin_set_fpscr_rn): Update
built-in definition return type.
* config/rs6000-c.cc (rs6000_target_modify_macros): Add check,
define __SET_FPSCR_RN_RETURNS_FPSCR__ macro.
* config/rs6000/rs6000.md (rs6000_set_fpscr_rn): Added return
argument to return FPSCR fields.
* doc/extend.texi (__builtin_set_fpscr_rn): Update description for
the return value.  Add description for
__SET_FPSCR_RN_RETURNS_FPSCR__ macro.

gcc/testsuite/ChangeLog:
gcc.target/powerpc/test_fpscr_rn_builtin.c: Renamed to
test_fpscr_rn_builtin_1.c.  Added comment.
gcc.target/powerpc/test_fpscr_rn_builtin_2.c: New test for the
return value of __builtin_set_fpscr_rn builtin.
---
 gcc/config/rs6000/rs6000-builtins.def |   2 +-
 gcc/config/rs6000/rs6000-c.cc |

Re: [PATCH] libgcc: Fix -Wint-conversion warning in find_fde_tail

2023-07-10 Thread Jakub Jelinek via Gcc-patches
On Mon, Jul 10, 2023 at 08:54:54PM +0200, Florian Weimer via Gcc-patches wrote:
> Fixes commit r14-1614-g49310a99330849 ("libgcc: Fix eh_frame fast path
> in find_fde_tail").
> 
> libgcc/
> 
>   PR libgcc/110179
>   * unwind-dw2-fde-dip.c (find_fde_tail): Add cast to avoid
>   implicit conversion of pointer value to integer.

Ok, thanks.

Jakub



Re: [Patch, Fortran] Allow ref'ing PDT's len() in parameter-initializer [PR102003]

2023-07-10 Thread Andre Vehreschild via Gcc-patches

Hi Harald,

I do get why this happens. I still don't get why I have to do this
'optimization' manually. I mean, this rewriting of expressions is needed in
more than one location and most probably already present somewhere. So who
can point me in the right direction?

Regards,
Andre

Andre Vehreschild


Re: [PATCH v3 0/3] c++: Track lifetimes in constant evaluation [PR70331,...]

2023-07-10 Thread Patrick Palka via Gcc-patches
On Sat, 1 Jul 2023, Nathaniel Shead wrote:

> This is an update of the patch series at
> https://gcc.gnu.org/pipermail/gcc-patches/2023-March/614811.html
> 
> Changes since v2:
> 
> - Use a separate 'hash_set' to track expired variables instead of
>   adding a flag to 'lang_decl_base'.
> - Use 'iloc_sentinel' to propagate location information down to
>   subexpressions instead of manually saving and falling back to a
>   parent expression's location.
> - Update more tests with improved error location information.

Thanks very much! This patch series looks good to me.

> 
> Bootstrapped and regtested on x86_64-pc-linux-gnu.
> 
> ---
> 
> Nathaniel Shead (3):
>   c++: Track lifetimes in constant evaluation [PR70331,PR96630,PR98675]
>   c++: Improve constexpr error for dangling local variables
>   c++: Improve location information in constant evaluation
> 
>  gcc/cp/constexpr.cc   | 158 +++---
>  gcc/cp/semantics.cc   |   5 +-
>  gcc/cp/typeck.cc  |   5 +-
>  gcc/testsuite/g++.dg/cpp0x/constexpr-48089.C  |  10 +-
>  gcc/testsuite/g++.dg/cpp0x/constexpr-70323.C  |   8 +-
>  gcc/testsuite/g++.dg/cpp0x/constexpr-70323a.C |   8 +-
>  .../g++.dg/cpp0x/constexpr-delete2.C  |   5 +-
>  gcc/testsuite/g++.dg/cpp0x/constexpr-diag3.C  |   2 +-
>  gcc/testsuite/g++.dg/cpp0x/constexpr-ice20.C  |   1 +
>  .../g++.dg/cpp0x/constexpr-recursion.C|   6 +-
>  gcc/testsuite/g++.dg/cpp0x/overflow1.C|   2 +-
>  gcc/testsuite/g++.dg/cpp1y/constexpr-89285.C  |   5 +-
>  gcc/testsuite/g++.dg/cpp1y/constexpr-89481.C  |   3 +-
>  .../g++.dg/cpp1y/constexpr-lifetime1.C|  14 ++
>  .../g++.dg/cpp1y/constexpr-lifetime2.C|  20 +++
>  .../g++.dg/cpp1y/constexpr-lifetime3.C|  13 ++
>  .../g++.dg/cpp1y/constexpr-lifetime4.C|  11 ++
>  .../g++.dg/cpp1y/constexpr-lifetime5.C|  11 ++
>  .../g++.dg/cpp1y/constexpr-tracking-const14.C |   3 +-
>  .../g++.dg/cpp1y/constexpr-tracking-const16.C |   3 +-
>  .../g++.dg/cpp1y/constexpr-tracking-const18.C |   4 +-
>  .../g++.dg/cpp1y/constexpr-tracking-const19.C |   4 +-
>  .../g++.dg/cpp1y/constexpr-tracking-const21.C |   4 +-
>  .../g++.dg/cpp1y/constexpr-tracking-const22.C |   4 +-
>  .../g++.dg/cpp1y/constexpr-tracking-const3.C  |   3 +-
>  .../g++.dg/cpp1y/constexpr-tracking-const4.C  |   3 +-
>  .../g++.dg/cpp1y/constexpr-tracking-const7.C  |   3 +-
>  gcc/testsuite/g++.dg/cpp1y/constexpr-union5.C |   4 +-
>  gcc/testsuite/g++.dg/cpp1y/pr68180.C  |   4 +-
>  .../g++.dg/cpp1z/constexpr-lambda6.C  |   4 +-
>  .../g++.dg/cpp1z/constexpr-lambda8.C  |   5 +-
>  gcc/testsuite/g++.dg/cpp2a/bit-cast11.C   |  10 +-
>  gcc/testsuite/g++.dg/cpp2a/bit-cast12.C   |  10 +-
>  gcc/testsuite/g++.dg/cpp2a/bit-cast14.C   |  14 +-
>  gcc/testsuite/g++.dg/cpp2a/constexpr-98122.C  |   4 +-
>  .../g++.dg/cpp2a/constexpr-dynamic17.C|   5 +-
>  gcc/testsuite/g++.dg/cpp2a/constexpr-init1.C  |   5 +-
>  gcc/testsuite/g++.dg/cpp2a/constexpr-new12.C  |   6 +-
>  gcc/testsuite/g++.dg/cpp2a/constexpr-new3.C   |  10 +-
>  gcc/testsuite/g++.dg/cpp2a/constinit10.C  |   5 +-
>  .../g++.dg/cpp2a/is-corresponding-member4.C   |   4 +-
>  gcc/testsuite/g++.dg/ext/constexpr-vla2.C |   4 +-
>  gcc/testsuite/g++.dg/ext/constexpr-vla3.C |   4 +-
>  gcc/testsuite/g++.dg/ubsan/pr63956.C  |  23 +--
>  .../g++.dg/warn/Wreturn-local-addr-6.C|   3 -
>  .../25_algorithms/equal/constexpr_neg.cc  |   7 +-
>  .../testsuite/26_numerics/gcd/105844.cc   |  10 +-
>  .../testsuite/26_numerics/lcm/105844.cc   |  14 +-
>  48 files changed, 330 insertions(+), 143 deletions(-)
>  create mode 100644 gcc/testsuite/g++.dg/cpp1y/constexpr-lifetime1.C
>  create mode 100644 gcc/testsuite/g++.dg/cpp1y/constexpr-lifetime2.C
>  create mode 100644 gcc/testsuite/g++.dg/cpp1y/constexpr-lifetime3.C
>  create mode 100644 gcc/testsuite/g++.dg/cpp1y/constexpr-lifetime4.C
>  create mode 100644 gcc/testsuite/g++.dg/cpp1y/constexpr-lifetime5.C
> 
> -- 
> 2.41.0
> 
> 



[PATCH] Optimize vec_splats of vec_extract for V2DI/V2DF (PR target/99293)

2023-07-10 Thread Michael Meissner via Gcc-patches
This patch optimizes cases like:

vector double v1, v2;
/* ... */
v2 = vec_splats (vec_extract (v1, 0);   /* or  */
v2 = vec_splats (vec_extract (v1, 1);

Previously:

vector long long
splat_dup_l_0 (vector long long v)
{
  return __builtin_vec_splats (__builtin_vec_extract (v, 0));
}

would generate:

mfvsrld 9,34
mtvsrdd 34,9,9
blr

With this patch, GCC generates:

xxpermdi 34,34,34,3
blr

2023-07-10  Michael Meissner  

gcc/

PR target/99293
* gcc/config/rs6000/vsx.md (vsx_splat_extract_): New combiner
insn.

gcc/testsuite/

PR target/108958
* gcc.target/powerpc/pr99293.c: New test.
* gcc.target/powerpc/builtins-1.c: Update insn count.
---
 gcc/config/rs6000/vsx.md  | 18 ++
 gcc/testsuite/gcc.target/powerpc/builtins-1.c |  2 +-
 gcc/testsuite/gcc.target/powerpc/pr99293.c| 55 +++
 3 files changed, 74 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr99293.c

diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index 0c269e4e8d9..d34c3b21abe 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -4600,6 +4600,24 @@ (define_insn "vsx_splat__mem"
   "lxvdsx %x0,%y1"
   [(set_attr "type" "vecload")])
 
+;; Optimize SPLAT of an extract from a V2DF/V2DI vector with a constant element
+(define_insn "*vsx_splat_extract_"
+  [(set (match_operand:VSX_D 0 "vsx_register_operand" "=wa")
+   (vec_duplicate:VSX_D
+(vec_select:
+ (match_operand:VSX_D 1 "vsx_register_operand" "wa")
+ (parallel [(match_operand 2 "const_0_to_1_operand" "n")]]
+  "VECTOR_MEM_VSX_P (mode)"
+{
+  int which_word = INTVAL (operands[2]);
+  if (!BYTES_BIG_ENDIAN)
+which_word = 1 - which_word;
+
+  operands[3] = GEN_INT (which_word ? 3 : 0);
+  return "xxpermdi %x0,%x1,%x1,%3";
+}
+  [(set_attr "type" "vecperm")])
+
 ;; V4SI splat support
 (define_insn "vsx_splat_v4si"
   [(set (match_operand:V4SI 0 "vsx_register_operand" "=wa,wa")
diff --git a/gcc/testsuite/gcc.target/powerpc/builtins-1.c 
b/gcc/testsuite/gcc.target/powerpc/builtins-1.c
index 28cd1aa6b1a..98783668bce 100644
--- a/gcc/testsuite/gcc.target/powerpc/builtins-1.c
+++ b/gcc/testsuite/gcc.target/powerpc/builtins-1.c
@@ -1035,4 +1035,4 @@ foo156 (vector unsigned short usa)
 /* { dg-final { scan-assembler-times {\mvmrglb\M} 3 } } */
 /* { dg-final { scan-assembler-times {\mvmrgew\M} 4 } } */
 /* { dg-final { scan-assembler-times {\mvsplth|xxsplth\M} 4 } } */
-/* { dg-final { scan-assembler-times {\mxxpermdi\M} 44 } } */
+/* { dg-final { scan-assembler-times {\mxxpermdi\M} 42 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/pr99293.c 
b/gcc/testsuite/gcc.target/powerpc/pr99293.c
new file mode 100644
index 000..e5f44bd7346
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr99293.c
@@ -0,0 +1,55 @@
+/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-options "-O2 -mpower8-vector" } */
+
+/* Test for PR 99263, which wants to do:
+   __builtin_vec_splats (__builtin_vec_extract (v, n))
+
+   where v is a V2DF or V2DI vector and n is either 0 or 1.  Previously the GCC
+   compiler would do a direct move to the GPR registers to select the item and 
a
+   direct move from the GPR registers to do the splat.
+
+   Before the patch, splat_dup_ll_0 or splat_dup_dbl_0 below would generate:
+
+mfvsrld 9,34
+mtvsrdd 34,9,9
+blr
+
+   and now it generates:
+
+xxpermdi 34,34,34,3
+blr  */
+
+#include 
+
+vector long long
+splat_dup_ll_0 (vector long long v)
+{
+  /* xxpermdi 34,34,34,3 */
+  return __builtin_vec_splats (vec_extract (v, 0));
+}
+
+vector double
+splat_dup_dbl_0 (vector double v)
+{
+  /* xxpermdi 34,34,34,3 */
+  return __builtin_vec_splats (vec_extract (v, 0));
+}
+
+vector long long
+splat_dup_ll_1 (vector long long v)
+{
+  /* xxpermdi 34,34,34,0 */
+  return __builtin_vec_splats (vec_extract (v, 1));
+}
+
+vector double
+splat_dup_dbl_1 (vector double v)
+{
+  /* xxpermdi 34,34,34,0 */
+  return __builtin_vec_splats (vec_extract (v, 1));
+}
+
+/* { dg-final { scan-assembler-times "xxpermdi" 4 } } */
+/* { dg-final { scan-assembler-not   "mfvsrd" } } */
+/* { dg-final { scan-assembler-not   "mfvsrld"} } */
+/* { dg-final { scan-assembler-not   "mtvsrdd"} } */
-- 
2.41.0


-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meiss...@linux.ibm.com


[PATCH] Improve 64->128 bit zero extension on PowerPC (PR target/108958)

2023-07-10 Thread Michael Meissner via Gcc-patches
If we are converting an unsigned DImode to a TImode value, and the TImode value
will go in a vector register, GCC currently does the DImode to TImode conversion
in GPR registers, and then moves the value to the vector register via a mtvsrdd
instruction.

This patch adds a new zero_extendditi2 insn which optimizes moving a GPR to a
vector register using the mtvsrdd instruction with RA=0, and using lxvrdx to
load a 64-bit value into the bottom 64-bits of the vector register.

2023-07-10  Michael Meissner  

gcc/

PR target/108958
* gcc/config/rs6000.md (zero_extendditi2): New insn.

gcc/testsuite/

PR target/108958
* gcc.target/powerpc/pr108958.c: New test.
---
 gcc/config/rs6000/rs6000.md | 52 +++
 gcc/testsuite/gcc.target/powerpc/pr108958.c | 57 +
 2 files changed, 109 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108958.c

diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index cdab49fbb91..1a3d6316eab 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -987,6 +987,58 @@ (define_insn_and_split "*zero_extendsi2_dot2"
(set_attr "dot" "yes")
(set_attr "length" "4,8")])
 
+(define_insn_and_split "zero_extendditi2"
+  [(set (match_operand:TI 0 "gpc_reg_operand" "=r,r,wa,wa,wa")
+   (zero_extend:TI
+(match_operand:DI 1 "reg_or_mem_operand" "r,m,b,Z,wa")))
+   (clobber (match_scratch:DI 2 "=X,X,X,X,&wa"))]
+  "TARGET_POWERPC64 && TARGET_P9_VECTOR"
+  "@
+   #
+   #
+   mtvsrdd %x0,0,%1
+   lxvrdx %x0,%y1
+   #"
+  "&& reload_completed
+   && (int_reg_operand (operands[0], TImode)
+   || (vsx_register_operand (operands[0], TImode)
+  && vsx_register_operand (operands[1], DImode)))"
+  [(set (match_dup 2) (match_dup 1))
+   (set (match_dup 3) (const_int 0))]
+{
+  rtx dest = operands[0];
+  rtx src = operands[1];
+
+  /* If we are converting a VSX DImode to VSX TImode, we need to move the upper
+ 64-bits (DImode) to the lower 64-bits.  We can't just do a xxpermdi
+ instruction to swap the two 64-bit words, because can't rely on the bottom
+ 64-bits of the VSX register being 0.  Instead we create a 0 and do the
+ xxpermdi operation to combine the two registers.  */
+  if (vsx_register_operand (dest, TImode)
+  && vsx_register_operand (src, DImode))
+{
+  rtx tmp = operands[2];
+  emit_move_insn (tmp, const0_rtx);
+
+  rtx hi = tmp;
+  rtx lo = src;
+  if (!BYTES_BIG_ENDIAN)
+   std::swap (hi, lo);
+
+  rtx dest_v2di = gen_rtx_REG (V2DImode, reg_or_subregno (dest));
+  emit_insn (gen_vsx_concat_v2di (dest_v2di, hi, lo));
+  DONE;
+}
+
+  /* If we are zero extending to a GPR register either from a GPR register,
+ a VSX register or from memory, do the zero extend operation to the
+ lower DI register, and set the upper DI register to 0.  */
+  operands[2] = gen_lowpart (DImode, dest);
+  operands[3] = gen_highpart (DImode, dest);
+}
+  [(set_attr "type" "*,load,vecexts,vecload,vecperm")
+   (set_attr "isa" "*,*,p9v,p10,*")
+   (set_attr "length" "8,8,*,*,8")])
 
 (define_insn "extendqi2"
   [(set (match_operand:EXTQI 0 "gpc_reg_operand" "=r,?*v")
diff --git a/gcc/testsuite/gcc.target/powerpc/pr108958.c 
b/gcc/testsuite/gcc.target/powerpc/pr108958.c
new file mode 100644
index 000..85ea0976f91
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr108958.c
@@ -0,0 +1,57 @@
+/* { dg-require-effective-target int128 } */
+/* { dg-require-effective-target power10_ok } */
+/* { dg-options "-mdejagnu-cpu=power10 -O2" } */
+
+/* This patch makes sure the various optimization and code paths are done for
+   zero extending DImode to TImode on power10 (PR target/pr108958).  */
+
+__uint128_t
+gpr_to_gpr (unsigned long long a)
+{
+  return a;  /* li 4,0.  */
+}
+
+__uint128_t
+mem_to_gpr (unsigned long long *p)
+{
+  return *p;   /* ld 3,0(3); li 4,0.  */
+}
+
+__uint128_t
+vsx_to_gpr (double d)
+{
+  return (unsigned long long)d;/* fctiduz 0,1; li 4,0; mfvsrd 
3,0.  */
+}
+
+void
+gpr_to_vsx (__uint128_t *p, unsigned long long a)
+{
+  __uint128_t b = a;   /* mtvsrdd 0,0,4; stxv 0,0(3).  */
+  __asm__ (" # %x0" : "+wa" (b));
+  *p = b;
+}
+
+void
+mem_to_vsx (__uint128_t *p, unsigned long long *q)
+{
+  __uint128_t a = *q;  /* lxvrdx 0,0,4; stxv 0,0(3).  */
+  __asm__ (" # %x0" : "+wa" (a));
+  *p = a;
+}
+
+void
+vsx_to_vsx (__uint128_t *p, double d)
+{
+  /* fctiduz 1,1; xxspltib 0,0; xxpermdi 0,0,1,0; stxv 0,0(3).  */
+  __uint128_t a = (unsigned long long)d;
+  __asm__ (" # %x0" : "+wa" (a));
+  *p = a;
+}
+
+/* { dg-final { scan-assembler-times {\mld\M}   1 } } */
+/* { dg-final { scan-assembler-times {\mli\M}   3 } } */
+/* { dg-final { scan-assembler-times {\mlxvrdx\M}   1 } } */
+/* { dg-final { scan-assembler-times {\mmfvsrd\M}   1 } } */
+/* { dg-fin

[committed] reorg: Change return type of predicate functions from int to bool

2023-07-10 Thread Uros Bizjak via Gcc-patches
Also change some internal variables and function arguments from int to bool.

gcc/ChangeLog:

* reorg.cc (stop_search_p): Change return type from int to bool
and adjust function body accordingly.
(resource_conflicts_p): Ditto.
(insn_references_resource_p): Change return type from int to bool.
(insn_sets_resource_p): Ditto.
(redirect_with_delay_slots_safe_p): Ditto.
(condition_dominates_p): Change return type from int to bool
and adjust function body accordingly.
(redirect_with_delay_list_safe_p): Ditto.
(check_annul_list_true_false): Ditto.  Change "annul_true_p"
function argument to bool.
(steal_delay_list_from_target): Change "pannul_p" function
argument to bool pointer.  Change "must_annul" and "used_annul"
variables from int to bool.
(steal_delay_list_from_fallthrough): Ditto.
(own_thread_p): Change return type from int to bool and adjust
function body accordingly.  Change "allow_fallthrough" function
argument to bool.
(reorg_redirect_jump): Change return type from int to bool.
(fill_simple_delay_slots): Change "non_jumps_p" function
argument from int to bool.  Change "maybe_never" varible to bool.
(fill_slots_from_thread): Change "likely", "thread_if_true" and
"own_thread" function arguments to bool.  Change "lose" and
"must_annul" variables to bool.
(delete_from_delay_slot): Change "had_barrier" variable to bool.
(try_merge_delay_insns): Change "annul_p" variable to bool.
(fill_eager_delay_slots): Change "own_target" and "own_fallthrouhg"
variables to bool.
(rest_of_handle_delay_slots): Change return type from int to void
and adjust function body accordingly.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/reorg.cc b/gcc/reorg.cc
index ed32c91c3fa..81290463833 100644
--- a/gcc/reorg.cc
+++ b/gcc/reorg.cc
@@ -174,10 +174,10 @@ static int *uid_to_ruid;
 /* Highest valid index in `uid_to_ruid'.  */
 static int max_uid;
 
-static int stop_search_p (rtx_insn *, int);
-static int resource_conflicts_p (struct resources *, struct resources *);
-static int insn_references_resource_p (rtx, struct resources *, bool);
-static int insn_sets_resource_p (rtx, struct resources *, bool);
+static bool stop_search_p (rtx_insn *, bool);
+static bool resource_conflicts_p (struct resources *, struct resources *);
+static bool insn_references_resource_p (rtx, struct resources *, bool);
+static bool insn_sets_resource_p (rtx, struct resources *, bool);
 static rtx_code_label *find_end_label (rtx);
 static rtx_insn *emit_delay_sequence (rtx_insn *, const vec &,
  int);
@@ -188,35 +188,35 @@ static void note_delay_statistics (int, int);
 static int get_jump_flags (const rtx_insn *, rtx);
 static int mostly_true_jump (rtx);
 static rtx get_branch_condition (const rtx_insn *, rtx);
-static int condition_dominates_p (rtx, const rtx_insn *);
-static int redirect_with_delay_slots_safe_p (rtx_insn *, rtx, rtx);
-static int redirect_with_delay_list_safe_p (rtx_insn *, rtx,
-   const vec &);
-static int check_annul_list_true_false (int, const vec &);
+static bool condition_dominates_p (rtx, const rtx_insn *);
+static bool redirect_with_delay_slots_safe_p (rtx_insn *, rtx, rtx);
+static bool redirect_with_delay_list_safe_p (rtx_insn *, rtx,
+const vec &);
+static bool check_annul_list_true_false (bool, const vec &);
 static void steal_delay_list_from_target (rtx_insn *, rtx, rtx_sequence *,
  vec *,
  struct resources *,
  struct resources *,
  struct resources *,
- int, int *, int *,
+ int, int *, bool *,
  rtx *);
 static void steal_delay_list_from_fallthrough (rtx_insn *, rtx, rtx_sequence *,
   vec *,
   struct resources *,
   struct resources *,
   struct resources *,
-  int, int *, int *);
+  int, int *, bool *);
 static void try_merge_delay_insns (rtx_insn *, rtx_insn *);
 static rtx_insn *redundant_insn (rtx, rtx_insn *, const vec &);
-static int own_thread_p (rtx, rtx, int);
+static bool own_thread_p (rtx, rtx, bool);
 static void update_block (rtx_insn *, rtx_insn *);
-static int reorg_redirect_jump (rtx_jump_insn *, rtx);
+static bool reorg_redirect_jump (rtx_jump_insn *, rtx);
 static void update_reg_dead_notes (rtx_insn *, rtx_insn *);
 static void fix_reg_dead_note (rtx_insn *, rtx);
 static void update_reg_unus

[PATCH] Fix typo in insn name.

2023-07-10 Thread Michael Meissner via Gcc-patches
In doing other work, I noticed that there was an insn:

vsx_extract_v4sf__load

Which did not have an iterator.  I removed the useless .

I have tested this patch on the following systems and there was no degration.
Can I check it into the trunk branch?

*   Power10, LE, --with-cpu=power10, IBM 128-bit long double
*   Power9,  LE, --with-cpu=power9,  IBM 128-bit long double
*   Power9,  LE, --with-cpu=power9,  IEEE 128-bit long double
*   Power9,  LE, --with-cpu=power9,  64-bit default long double
*   Power9,  BE, --with-cpu=power9,  IBM 128-bit long double
*   Power8,  BE, --with-cpu=power8,  IBM 128-bit long double

2023-07-10  Michael Meissner  

gcc/

* config/rs6000/vsx.md (vsx_extract_v4sf_load): Rename from
vsx_extract_v4sf__load.
---
 gcc/config/rs6000/vsx.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index d34c3b21abe..aed450e31ec 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -3576,7 +3576,7 @@ (define_insn_and_split "vsx_extract_v4sf"
   [(set_attr "length" "8")
(set_attr "type" "fp")])
 
-(define_insn_and_split "*vsx_extract_v4sf__load"
+(define_insn_and_split "*vsx_extract_v4sf_load"
   [(set (match_operand:SF 0 "register_operand" "=f,v,v,?r")
(vec_select:SF
 (match_operand:V4SF 1 "memory_operand" "m,Z,m,m")
-- 
2.41.0


-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meiss...@linux.ibm.com


Re: [PATCH] Optimize vec_splats of vec_extract for V2DI/V2DF (PR target/99293)

2023-07-10 Thread Michael Meissner via Gcc-patches
I forgot to add:

I have tested this patch on the following systems and there was no degration.
Can I check it into the trunk branch?

*   Power10, LE, --with-cpu=power10, IBM 128-bit long double
*   Power9,  LE, --with-cpu=power9,  IBM 128-bit long double
*   Power9,  LE, --with-cpu=power9,  IEEE 128-bit long double
*   Power9,  LE, --with-cpu=power9,  64-bit default long double
*   Power9,  BE, --with-cpu=power9,  IBM 128-bit long double
*   Power8,  BE, --with-cpu=power8,  IBM 128-bit long double

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meiss...@linux.ibm.com


Re: [PATCH] Improve 64->128 bit zero extension on PowerPC (PR target/108958)

2023-07-10 Thread Michael Meissner via Gcc-patches
I forgot to add:

I have tested this patch on the following systems and there was no degration.
Can I check it into the trunk branch?

*   Power10, LE, --with-cpu=power10, IBM 128-bit long double
*   Power9,  LE, --with-cpu=power9,  IBM 128-bit long double
*   Power9,  LE, --with-cpu=power9,  IEEE 128-bit long double
*   Power9,  LE, --with-cpu=power9,  64-bit default long double
*   Power9,  BE, --with-cpu=power9,  IBM 128-bit long double
*   Power8,  BE, --with-cpu=power8,  IBM 128-bit long double

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meiss...@linux.ibm.com


Re: [PATCH] Fix typo in insn name.

2023-07-10 Thread Segher Boessenkool
Hi!

On Mon, Jul 10, 2023 at 03:59:44PM -0400, Michael Meissner wrote:
> In doing other work, I noticed that there was an insn:
> 
>   vsx_extract_v4sf__load
> 
> Which did not have an iterator.  I removed the useless .

This patch does that, you mean.

> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@ -3576,7 +3576,7 @@ (define_insn_and_split "vsx_extract_v4sf"
>[(set_attr "length" "8")
> (set_attr "type" "fp")])
>  
> -(define_insn_and_split "*vsx_extract_v4sf__load"
> +(define_insn_and_split "*vsx_extract_v4sf_load"
>[(set (match_operand:SF 0 "register_operand" "=f,v,v,?r")
>   (vec_select:SF
>(match_operand:V4SF 1 "memory_operand" "m,Z,m,m")

Does this fix any ICEs?  Or do you have some example that makes better
machine code after this change?  Or would a better change perhaps be to
just remove this pattern completely, if it doesn't do anything useful?

I.e., please include a new testcase.


Segher


[PATCH] testsuite: fix allocator-opt1.C FAIL with old ABI

2023-07-10 Thread Marek Polacek via Gcc-patches
Running
$ make check-g++ 
RUNTESTFLAGS='--target_board=unix\{-D_GLIBCXX_USE_CXX11_ABI=0,\} 
dg.exp=allocator-opt1.C'
yields:

FAIL: g++.dg/tree-ssa/allocator-opt1.C  -std=c++98  scan-tree-dump-times gimple 
"struct allocator D" 1
FAIL: g++.dg/tree-ssa/allocator-opt1.C  -std=c++14  scan-tree-dump-times gimple 
"struct allocator D" 1
FAIL: g++.dg/tree-ssa/allocator-opt1.C  -std=c++17  scan-tree-dump-times gimple 
"struct allocator D" 1
FAIL: g++.dg/tree-ssa/allocator-opt1.C  -std=c++20  scan-tree-dump-times gimple 
"struct allocator D" 1

=== g++ Summary for unix/-D_GLIBCXX_USE_CXX11_ABI=0 ===

=== g++ Summary for unix ===

because in the old ABI we get two "struct allocator D".  This patch
follows r14-658 although I'm not quite sure I follow the logic there.

Tested on x86_64-pc-linux-gnu, ok for trunk?

gcc/testsuite/ChangeLog:

* g++.dg/tree-ssa/allocator-opt1.C: Force _GLIBCXX_USE_CXX11_ABI to 1.
---
 gcc/testsuite/g++.dg/tree-ssa/allocator-opt1.C | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/gcc/testsuite/g++.dg/tree-ssa/allocator-opt1.C 
b/gcc/testsuite/g++.dg/tree-ssa/allocator-opt1.C
index e8394c7ad70..9f13eedb604 100644
--- a/gcc/testsuite/g++.dg/tree-ssa/allocator-opt1.C
+++ b/gcc/testsuite/g++.dg/tree-ssa/allocator-opt1.C
@@ -5,8 +5,18 @@
 // Currently the dump doesn't print the allocator template arg in this context.
 // { dg-final { scan-tree-dump-times "struct allocator D" 1 "gimple" } }
 
+// In the pre-C++11 ABI we get two allocator variables.
+#undef _GLIBCXX_USE_CXX11_ABI
+#define _GLIBCXX_USE_CXX11_ABI 1
+
+// When the library is not dual-ABI and defaults to old just compile
+// an empty TU
+#if _GLIBCXX_USE_CXX11_ABI
+
 #include 
 void f (const char *p)
 {
   std::string lst[] = { p, p, p, p };
 }
+
+#endif

base-commit: 2d7c95e31431a297060c94697af84f498abf97a2
-- 
2.41.0



Re: [x86-64] RFC: Add nosse abi attribute

2023-07-10 Thread Alexander Monakov via Gcc-patches
On Mon, 10 Jul 2023, Alexander Monakov wrote:

> > I chose to make it possible to write function definitions with that
> > attribute with GCC adding the necessary callee save/restore code in
> > the xlogue itself.
> 
> But you can't trivially restore if the callee is sibcalling — what
> happens then (a testcase might be nice)?

Sorry, when the caller is doing the sibcall, not the callee.

Alexander


Re: [PATCH ver3] rs6000, Add return value to __builtin_set_fpscr_rn

2023-07-10 Thread Peter Bergner via Gcc-patches
On 7/10/23 2:18 PM, Carl Love wrote:
> +  /* Get the current FPSCR fields, bits 29:31 (DRN) and bits 56:63 (VE, OE, 
> UE,
> +  ZE, XE, NI, RN) from the FPSCR and return them.  */

The 'Z' above should line up directly under the 'G' in Get.


> -  /* Insert new RN mode into FSCPR.  */
> -  emit_insn (gen_rs6000_mffs (tmp_df));
> -  tmp_di = simplify_gen_subreg (DImode, tmp_df, DFmode, 0);
> -  emit_insn (gen_anddi3 (tmp_di, tmp_di, GEN_INT (-4)));
> -  emit_insn (gen_iordi3 (tmp_di, tmp_di, tmp_rn));
> +  /* Insert the new RN value from tmp_rn into FPSCR bit [62:63].  */
> +  emit_insn (gen_anddi3 (tmp_di1, tmp_di2, GEN_INT (-4)));
> +  emit_insn (gen_iordi3 (tmp_di1, tmp_di1, tmp_rn));

This is an expander, so you shouldn't reuse temporaries as multiple
destination pseudos, since that limits the register allocator's freedom.
I know the old code did it, but since you're changing the line, you
might as well use a new temp.


I cannot approve it, but it LGTM with those fixed.

Peter




[Patch] libgomp: Update OpenMP memory allocation doc, fix omp_high_bw_mem_space

2023-07-10 Thread Tobias Burnus

I noted that all memory spaces are supported, some by falling
back to the default ("malloc") - except for omp_high_bw_mem_space
(unless the memkind lib is available).

I think it makes more sense to fallback to 'malloc' also for
omp_high_bw_mem_space.

Additionally, I updated the documentation to more explicitly state
what the current implementation is.

Thoughts? Wording improvement suggestions?

Tobias

PS: I wonder whether it makes sense to use use libnuma besides
libmemkind (which depends on libnuma); however, the question is
when. libnuma provides numa_alloc_interleaved(_subset),
numa_alloc_local and numa_alloc_onnode.

In any case, something is odd here. I have two nodes, 0 and 1
(→ 'lscpu') and 'numactls --show' shows "preferred node: current".
I allocate memory and then use the following to find the node:

"get_mempolicy (&node, NULL, 0, ptr, MPOL_F_ADDR|MPOL_F_NODE)"

Result: With malloc'ed data, it shows the same node as the node
running the code (i.e. the same as 'getcpu (NULL, &node1);' ==
'numa_node_of_cpu (sched_getcpu());'). But I get a constant
result of 1 for numa_alloc_local and numa_alloc_onnode, independent
of the passed node number (0 or 1) and on the CPU the thread runs on.
-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955
libgomp: Update OpenMP memory allocation doc, fix omp_high_bw_mem_space

libgomp/ChangeLog:

	* allocator.c (omp_init_allocator): Use malloc for
	omp_high_bw_mem_space when the memkind lib is unavailable instead
	of returning omp_null_allocator.
	* libgomp.texi (Memory allocation with libmemkind): Document
	implementation in more details.

 libgomp/allocator.c  |  2 +-
 libgomp/libgomp.texi | 26 +-
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/libgomp/allocator.c b/libgomp/allocator.c
index c49931cbad4..25c0f150302 100644
--- a/libgomp/allocator.c
+++ b/libgomp/allocator.c
@@ -301,7 +301,7 @@ omp_init_allocator (omp_memspace_handle_t memspace, int ntraits,
 	  break;
 	}
 #endif
-  return omp_null_allocator;
+  break;
 case omp_large_cap_mem_space:
 #ifdef LIBGOMP_USE_MEMKIND
   memkind_data = gomp_get_memkind ();
diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi
index 7d27cc50df5..b1f58e74903 100644
--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -4634,6 +4634,17 @@ smaller number.  On non-host devices, the value of the
 @node Memory allocation with libmemkind
 @section Memory allocation with libmemkind
 
+For the memory spaces, the following applies:
+@itemize
+@item @code{omp_default_mem_space} is supported
+@item @code{omp_const_mem_space} maps to @code{omp_default_mem_space}
+@item @code{omp_low_lat_mem_space} maps to @code{omp_default_mem_space}
+@item @code{omp_large_cap_mem_space} maps to @code{omp_default_mem_space},
+  unless the memkind library is available
+@item @code{omp_high_bw_mem_space} maps to @code{omp_default_mem_space},
+  unless the memkind library is available
+@end itemize
+
 On Linux systems, where the @uref{https://github.com/memkind/memkind, memkind
 library} (@code{libmemkind.so.0}) is available at runtime, it is used when
 creating memory allocators requesting
@@ -4641,9 +4652,22 @@ creating memory allocators requesting
 @itemize
 @item the memory space @code{omp_high_bw_mem_space}
 @item the memory space @code{omp_large_cap_mem_space}
-@item the partition trait @code{omp_atv_interleaved}
+@item the partition trait @code{omp_atv_interleaved}; note that for
+  @code{omp_large_cap_mem_space} the allocation will not be interleaved
 @end itemize
 
+Additional notes:
+@itemize
+@item The @code{pinned} trait is unsupported.
+@item For the @code{partition} trait, the partition part size will be the same
+  as the requested size (i.e. @code{interleaved} or @code{blocked} has no
+  effect), except for @code{interleaved} when the memkind library is
+  available.  Furthermore, @code{nearest} might not always return memory
+  on the node of the CPU that triggered an allocation.
+@item The @code{access} trait has no effect such that memory is always
+  accessible by all threads.
+@item The @code{sync_hint} trait has no effect.
+@end itemize
 
 @c -
 @c Offload-Target Specifics


Re: [PATCH] Fix typo in insn name.

2023-07-10 Thread Michael Meissner via Gcc-patches
On Mon, Jul 10, 2023 at 03:10:21PM -0500, Segher Boessenkool wrote:
> Hi!
> 
> On Mon, Jul 10, 2023 at 03:59:44PM -0400, Michael Meissner wrote:
> > In doing other work, I noticed that there was an insn:
> > 
> > vsx_extract_v4sf__load
> > 
> > Which did not have an iterator.  I removed the useless .
> 
> This patch does that, you mean.
> 
> > --- a/gcc/config/rs6000/vsx.md
> > +++ b/gcc/config/rs6000/vsx.md
> > @@ -3576,7 +3576,7 @@ (define_insn_and_split "vsx_extract_v4sf"
> >[(set_attr "length" "8")
> > (set_attr "type" "fp")])
> >  
> > -(define_insn_and_split "*vsx_extract_v4sf__load"
> > +(define_insn_and_split "*vsx_extract_v4sf_load"
> >[(set (match_operand:SF 0 "register_operand" "=f,v,v,?r")
> > (vec_select:SF
> >  (match_operand:V4SF 1 "memory_operand" "m,Z,m,m")
> 
> Does this fix any ICEs?  Or do you have some example that makes better
> machine code after this change?  Or would a better change perhaps be to
> just remove this pattern completely, if it doesn't do anything useful?
> 
> I.e., please include a new testcase.

There is absolutely no code change.  It is purely a cleanup patch.  In doing
other patches, I just noticed that pattern had a _ in it when it didn't
have an iterator.  I just cleaned up the code removing _.  I probably
should have changed it to vsx_extract_v4sf_sf_load.

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meiss...@linux.ibm.com


Re: [PATCH, OBVIOUS] rs6000: Remove redundant MEM_P predicate usage

2023-07-10 Thread Peter Bergner via Gcc-patches
On 7/10/23 11:47 AM, Peter Bergner wrote:
> While helping someone on the team debug an issue, I noticed some redundant
> tests in a couple of our predicates which can be removed.  I'm going to
> commit the following as obvious once bootstrap and regtesting come back
> clean.
> 
> Peter
> 
> 
> rs6000: Remove redundant MEM_P predicate usage
> 
> The quad_memory_operand and vsx_quad_dform_memory_operand predicates contain
> a (match_code "mem") test, making their MEM_P usage redundant.  Remove them.
> 
> gcc/
>   * config/rs6000/predicates.md (quad_memory_operand): Remove redundant
>   MEM_P usage.
>   (vsx_quad_dform_memory_operand): Likewise.

Testing was clean as expected.  Pushed to trunk.

Peter



Re: [PATCH] rs6000: Remove redundant initialization [PR106907]

2023-07-10 Thread Peter Bergner via Gcc-patches
On 6/29/23 4:31 AM, Kewen.Lin via Gcc-patches wrote:
> This is okay for trunk (no backports needed btw), this fix can even be
> taken as obvious, thanks!
> 
>>
>> 2023-06-07  Jeevitha Palanisamy  
>>
>> gcc/
>>  PR target/106907
> 
> One curious question is that this PR106907 seemed not to report this issue,
> is there another PR reporting this?  Or do I miss something?

I think Jeevitha just ran cppcheck by hand and noticed the "new" warnings
and added them to the list of things to fixup.  Yeah, it would be nice to
add the new warnings to the PR for historical reasons.

Peter





Re: [PATCH] Break false dependence for vpternlog by inserting vpxor or setting constraint of input operand to '0'

2023-07-10 Thread Hongtao Liu via Gcc-patches
On Tue, Jul 11, 2023 at 12:24 AM Alexander Monakov via Gcc-patches
 wrote:
>
>
> On Mon, 10 Jul 2023, liuhongt via Gcc-patches wrote:
>
> > False dependency happens when destination is only updated by
> > pternlog. There is no false dependency when destination is also used
> > in source. So either a pxor should be inserted, or input operand
> > should be set with constraint '0'.
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Ready to push to trunk.
>
> Shouldn't this patch also remove uses of vpternlog in
> standard_sse_constant_opcode?
It's still needed when !optimize_function_for_speed_p (cfun).
>
> A couple more questions below:
>
> > --- a/gcc/config/i386/sse.md
> > +++ b/gcc/config/i386/sse.md
> > @@ -1382,6 +1382,29 @@ (define_insn "mov_internal"
> > ]
> > (symbol_ref "true")))])
> >
> > +; False dependency happens on destination register which is not really
> > +; used when moving all ones to vector register
> > +(define_split
> > +  [(set (match_operand:VMOVE 0 "register_operand")
> > + (match_operand:VMOVE 1 "int_float_vector_all_ones_operand"))]
> > +  "TARGET_AVX512F && reload_completed
> > +  && ( == 64 || EXT_REX_SSE_REG_P (operands[0]))
> > +  && optimize_function_for_speed_p (cfun)"
>
> Yan's patch used optimize_insn_for_speed_p (), which looks more appropriate.
> Doesn't it work here as well?
I'm just aligned with lzcnt/popcnt case, the difference between
option_insn_for_speed_p and optimized_function_for_speed_p is the
former will consider
!crtl->maybe_hot_insn_p but the latter just returns
!optimize_function_for_size_p (cfun). It looks
optimize_insn_for_speed_p() is more reasonable for single insn.

 350optimize_insn_for_size_p (void)
 351{
 352  enum optimize_size_level ret = optimize_function_for_size_p (cfun);
 353  if (ret < OPTIMIZE_SIZE_BALANCED && !crtl->maybe_hot_insn_p)
 354ret = OPTIMIZE_SIZE_BALANCED;
 355  return ret;

>
> > +  [(set (match_dup 0) (match_dup 2))
> > +   (parallel
> > + [(set (match_dup 0) (match_dup 1))
> > +  (unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP)])]
> > +  "operands[2] = CONST0_RTX (mode);")
> > +
> > +(define_insn "*vmov_constm1_pternlog_false_dep"
> > +  [(set (match_operand:VMOVE 0 "register_operand" "=v")
> > + (match_operand:VMOVE 1 "int_float_vector_all_ones_operand" 
> > ""))
> > +   (unspec [(match_operand:VMOVE 2 "register_operand" "0")] 
> > UNSPEC_INSN_FALSE_DEP)]
> > +   "TARGET_AVX512VL ||  == 64"
> > +   "vpternlogd\t{$0xFF, %0, %0, %0|%0, %0, %0, 0xFF}"
> > +  [(set_attr "type" "sselog1")
> > +   (set_attr "prefix" "evex")])
> > +
> >  ;; If mem_addr points to a memory region with less than whole vector size 
> > bytes
> >  ;; of accessible memory and k is a mask that would prevent reading the 
> > inaccessible
> >  ;; bytes from mem_addr, add UNSPEC_MASKLOAD to prevent it to be 
> > transformed to vpblendd
> > @@ -9336,7 +9359,7 @@ (define_expand 
> > "_cvtmask2"
> >  operands[3] = CONST0_RTX (mode);
> >}")
> >
> > -(define_insn "*_cvtmask2"
> > +(define_insn_and_split "*_cvtmask2"
> >[(set (match_operand:VI48_AVX512VL 0 "register_operand" "=v,v")
> >   (vec_merge:VI48_AVX512VL
> > (match_operand:VI48_AVX512VL 2 "vector_all_ones_operand")
> > @@ -9346,11 +9369,35 @@ (define_insn 
> > "*_cvtmask2"
> >"@
> > vpmovm2\t{%1, %0|%0, %1}
> > vpternlog\t{$0x81, %0, %0, %0%{%1%}%{z%}|%0%{%1%}%{z%}, 
> > %0, %0, 0x81}"
> > +  "&& !TARGET_AVX512DQ && reload_completed
> > +   && optimize_function_for_speed_p (cfun)"
> > +  [(set (match_dup 0) (match_dup 4))
> > +   (parallel
> > +[(set (match_dup 0)
> > +   (vec_merge:VI48_AVX512VL
> > + (match_dup 2)
> > + (match_dup 3)
> > + (match_dup 1)))
> > + (unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP)])]
> > +  "operands[4] = CONST0_RTX (mode);"
> >[(set_attr "isa" "avx512dq,*")
> > (set_attr "length_immediate" "0,1")
> > (set_attr "prefix" "evex")
> > (set_attr "mode" "")])
> >
> > +(define_insn "*_cvtmask2_pternlog_false_dep"
> > +  [(set (match_operand:VI48_AVX512VL 0 "register_operand" "=v")
> > + (vec_merge:VI48_AVX512VL
> > +   (match_operand:VI48_AVX512VL 2 "vector_all_ones_operand")
> > +   (match_operand:VI48_AVX512VL 3 "const0_operand")
> > +   (match_operand: 1 "register_operand" "Yk")))
> > +   (unspec [(match_operand:VI48_AVX512VL 4 "register_operand" "0")] 
> > UNSPEC_INSN_FALSE_DEP)]
> > +  "TARGET_AVX512F && !TARGET_AVX512DQ"
> > +  "vpternlog\t{$0x81, %0, %0, %0%{%1%}%{z%}|%0%{%1%}%{z%}, 
> > %0, %0, 0x81}"
> > +  [(set_attr "length_immediate" "1")
> > +   (set_attr "prefix" "evex")
> > +   (set_attr "mode" "")])
> > +
> >  (define_expand "extendv2sfv2df2"
> >[(set (match_operand:V2DF 0 "register_operand")
> >   (float_extend:V2DF
> > @@ -17166,20 +17213,32 @@ (define_expand "one_cmpl2"
> >  operands[2] = force_reg (mode, operands[2]);
> >  })
> >
> > -(define_insn "one_cmpl2"
> > -  [(set (match_operand:VI

Re: [PATCH ver3] rs6000, Add return value to __builtin_set_fpscr_rn

2023-07-10 Thread Carl Love via Gcc-patches
Peter:


On Mon, 2023-07-10 at 16:57 -0500, Peter Bergner wrote:
> On 7/10/23 2:18 PM, Carl Love wrote:
> > +  /* Get the current FPSCR fields, bits 29:31 (DRN) and bits 56:63
> > (VE, OE, UE,
> > +  ZE, XE, NI, RN) from the FPSCR and return them.  */
> 
> The 'Z' above should line up directly under the 'G' in Get.

Yup.  Fixed.

> 
> 
> > -  /* Insert new RN mode into FSCPR.  */
> > -  emit_insn (gen_rs6000_mffs (tmp_df));
> > -  tmp_di = simplify_gen_subreg (DImode, tmp_df, DFmode, 0);
> > -  emit_insn (gen_anddi3 (tmp_di, tmp_di, GEN_INT (-4)));
> > -  emit_insn (gen_iordi3 (tmp_di, tmp_di, tmp_rn));
> > +  /* Insert the new RN value from tmp_rn into FPSCR bit
> > [62:63].  */
> > +  emit_insn (gen_anddi3 (tmp_di1, tmp_di2, GEN_INT (-4)));
> > +  emit_insn (gen_iordi3 (tmp_di1, tmp_di1, tmp_rn));
> 
> This is an expander, so you shouldn't reuse temporaries as multiple
> destination pseudos, since that limits the register allocator's
> freedom.
> I know the old code did it, but since you're changing the line, you
> might as well use a new temp.

OK, wasn't aware that reusing temps was an issue for the register
allocator.  Thanks for letting me know.  So, I think you want something
like:
   
  rtx tmp_rn = gen_reg_rtx (DImode);
  rtx tmp_di3 = gen_reg_rtx (DImode);

  /* Extract new RN mode from operand.  */
  rtx op1 = convert_to_mode (DImode, operands[1], false);
  emit_insn (gen_anddi3 (tmp_rn, op1, GEN_INT (3)));

  /* Insert the new RN value from tmp_rn into FPSCR bit [62:63].  */
  emit_insn (gen_anddi3 (tmp_di1, tmp_di2, GEN_INT (-4)));
  emit_insn (gen_iordi3 (tmp_di3, tmp_di1, tmp_rn));

  /* Need to write to field k=15.  The fields are [0:15].  Hence with
 L=0, W=0, FLM_i must be equal to 8, 16 = i + 8*(1-W).  FLM is an
 8-bit field[0:7]. Need to set the bit that corresponds to the
 value of i that you want [0:7].  */
  tmp_df = simplify_gen_subreg (DFmode, tmp_di3, DImode, 0);

where each destination is a unique register.  Then let the register
allocator can decide if it wants to use the same register or not at
code generation time.

I made the change and did a quick check compiling on Power 10 with
mcpu=power[8,9,10] and it worked fine. I will run the full regression
on each of the processor types just to be sure.

  Carl 



Re: [PATCH V2] VECT: Add COND_LEN_* operations for loop control with length targets

2023-07-10 Thread juzhe.zh...@rivai.ai
Bootstraped and Regression on X86 last night with no surprise fails.

This patch has already included  'BIAS' argument.

Ok for trunk ?


juzhe.zh...@rivai.ai
 
From: juzhe.zhong
Date: 2023-07-10 19:35
To: gcc-patches
CC: richard.sandiford; rguenther; Ju-Zhe Zhong
Subject: [PATCH V2] VECT: Add COND_LEN_* operations for loop control with 
length targets
From: Ju-Zhe Zhong 
 
Hi, Richard and Richi.
 
This patch is adding cond_len_* operations pattern for target support loop 
control with length.
 
These patterns will be used in these following case:
 
1. Integer division:
   void
   f (int32_t *restrict a, int32_t *restrict b, int32_t *restrict c, int n)
   {
 for (int i = 0; i < n; ++i)
  {
a[i] = b[i] / c[i];
  }
   }
 
  ARM SVE IR:
  
  ...
  max_mask_36 = .WHILE_ULT (0, bnd.5_32, { 0, ... });
 
  Loop:
  ...
  # loop_mask_29 = PHI 
  ...
  vect__4.8_28 = .MASK_LOAD (_33, 32B, loop_mask_29);
  ...
  vect__6.11_25 = .MASK_LOAD (_20, 32B, loop_mask_29);
  vect__8.12_24 = .COND_DIV (loop_mask_29, vect__4.8_28, vect__6.11_25, 
vect__4.8_28);
  ...
  .MASK_STORE (_1, 32B, loop_mask_29, vect__8.12_24);
  ...
  next_mask_37 = .WHILE_ULT (_2, bnd.5_32, { 0, ... });
  ...
  
  For target like RVV who support loop control with length, we want to see IR 
as follows:
  
  Loop:
  ...
  # loop_len_29 = SELECT_VL
  ...
  vect__4.8_28 = .LEN_MASK_LOAD (_33, 32B, loop_len_29);
  ...
  vect__6.11_25 = .LEN_MASK_LOAD (_20, 32B, loop_len_29);
  vect__8.12_24 = .COND_LEN_DIV (dummp_mask, vect__4.8_28, vect__6.11_25, 
vect__4.8_28, loop_len_29, bias);
  ...
  .LEN_MASK_STORE (_1, 32B, loop_len_29, vect__8.12_24);
  ...
  next_mask_37 = .WHILE_ULT (_2, bnd.5_32, { 0, ... });
  ...
  
  Notice here, we use dummp_mask = { -1, -1,  , -1 }
 
2. Integer conditional division:
   Similar case with (1) but with condtion:
   void
   f (int32_t *restrict a, int32_t *restrict b, int32_t *restrict c, int32_t * 
cond, int n)
   {
 for (int i = 0; i < n; ++i)
   {
 if (cond[i])
 a[i] = b[i] / c[i];
   }
   }
   
   ARM SVE:
   ...
   max_mask_76 = .WHILE_ULT (0, bnd.6_52, { 0, ... });
 
   Loop:
   ...
   # loop_mask_55 = PHI 
   ...
   vect__4.9_56 = .MASK_LOAD (_51, 32B, loop_mask_55);
   mask__29.10_58 = vect__4.9_56 != { 0, ... };
   vec_mask_and_61 = loop_mask_55 & mask__29.10_58;
   ...
   vect__6.13_62 = .MASK_LOAD (_24, 32B, vec_mask_and_61);
   ...
   vect__8.16_66 = .MASK_LOAD (_1, 32B, vec_mask_and_61);
   vect__10.17_68 = .COND_DIV (vec_mask_and_61, vect__6.13_62, vect__8.16_66, 
vect__6.13_62);
   ...
   .MASK_STORE (_2, 32B, vec_mask_and_61, vect__10.17_68);
   ...
   next_mask_77 = .WHILE_ULT (_3, bnd.6_52, { 0, ... });
   
   Here, ARM SVE use vec_mask_and_61 = loop_mask_55 & mask__29.10_58; to 
gurantee the correct result.
   
   However, target with length control can not perform this elegant flow, for 
RVV, we would expect:
   
   Loop:
   ...
   loop_len_55 = SELECT_VL
   ...
   mask__29.10_58 = vect__4.9_56 != { 0, ... };
   ...
   vect__10.17_68 = .COND_LEN_DIV (mask__29.10_58, vect__6.13_62, 
vect__8.16_66, vect__6.13_62, loop_len_55, bias);
   ...
 
   Here we expect COND_LEN_DIV predicated by a real mask which is the outcome 
of comparison: mask__29.10_58 = vect__4.9_56 != { 0, ... };
   and a real length which is produced by loop control : loop_len_55 = SELECT_VL
   
3. conditional Floating-point operations (no -ffast-math):
   
void
f (float *restrict a, float *restrict b, int32_t *restrict cond, int n)
{
  for (int i = 0; i < n; ++i)
{
  if (cond[i])
  a[i] = b[i] + a[i];
}
}
  
  ARM SVE IR:
  max_mask_70 = .WHILE_ULT (0, bnd.6_46, { 0, ... });
 
  ...
  # loop_mask_49 = PHI 
  ...
  mask__27.10_52 = vect__4.9_50 != { 0, ... };
  vec_mask_and_55 = loop_mask_49 & mask__27.10_52;
  ...
  vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, 
vect__6.13_56);
  ...
  next_mask_71 = .WHILE_ULT (_22, bnd.6_46, { 0, ... });
  ...
  
  For RVV, we would expect IR:
  
  ...
  loop_len_49 = SELECT_VL
  ...
  mask__27.10_52 = vect__4.9_50 != { 0, ... };
  ...
  vect__9.17_62 = .COND_LEN_ADD (mask__27.10_52, vect__6.13_56, vect__8.16_60, 
vect__6.13_56, loop_len_49, bias);
  ...
 
4. Conditional un-ordered reduction:
   
   int32_t
   f (int32_t *restrict a, 
   int32_t *restrict cond, int n)
   {
 int32_t result = 0;
 for (int i = 0; i < n; ++i)
   {
   if (cond[i])
 result += a[i];
   }
 return result;
   }
   
   ARM SVE IR:
 
 Loop:
 # vect_result_18.7_37 = PHI 
 ...
 # loop_mask_40 = PHI 
 ...
 mask__17.11_43 = vect__4.10_41 != { 0, ... };
 vec_mask_and_46 = loop_mask_40 & mask__17.11_43;
 ...
 vect__33.16_51 = .COND_ADD (vec_mask_and_46, vect_result_18.7_37, 
vect__7.14_47, vect_result_18.7_37);
 ...
 next_mask_58 = .WHILE_ULT (_15, bnd.6_36, { 0, ... });
 ...
   
 Epilogue:
 _53 = .REDUC_PLUS (vect__

Re: [PATCH] VECT: Add COND_LEN_* operations for loop control with length targets

2023-07-10 Thread Kewen.Lin via Gcc-patches
on 2023/7/10 18:40, Richard Biener wrote:
> On Fri, 7 Jul 2023, juzhe.zh...@rivai.ai wrote:
> 
>> From: Ju-Zhe Zhong 
>>
>> Hi, Richard and Richi.
>>
>> This patch is adding cond_len_* operations pattern for target support 
>> loop control with length.
> 
> It looks mostly OK - the probably obvious question is with rearding
> to the "missing" bias argument ...
> 
> IBM folks - is there any expectation that the set of len familiy of
> instructions increases or will they be accounted as "mistake" and
> future additions will happen in different ways?

As far as I know, there is no plan to extend this len family on Power
and I guess future extension very likely adopts a different way.

BR,
Kewen

> 
> At the moment I'd say for consistency reasons 'len' should always
> come with 'bias'.
> 
> Thanks,
> Richard.
> 
>> These patterns will be used in these following case:
>>
>> 1. Integer division:
>>void
>>f (int32_t *restrict a, int32_t *restrict b, int32_t *restrict c, int n)
>>{
>>  for (int i = 0; i < n; ++i)
>>   {
>> a[i] = b[i] / c[i];
>>   }
>>}
>>
>>   ARM SVE IR:
>>   
>>   ...
>>   max_mask_36 = .WHILE_ULT (0, bnd.5_32, { 0, ... });
>>
>>   Loop:
>>   ...
>>   # loop_mask_29 = PHI 
>>   ...
>>   vect__4.8_28 = .MASK_LOAD (_33, 32B, loop_mask_29);
>>   ...
>>   vect__6.11_25 = .MASK_LOAD (_20, 32B, loop_mask_29);
>>   vect__8.12_24 = .COND_DIV (loop_mask_29, vect__4.8_28, vect__6.11_25, 
>> vect__4.8_28);
>>   ...
>>   .MASK_STORE (_1, 32B, loop_mask_29, vect__8.12_24);
>>   ...
>>   next_mask_37 = .WHILE_ULT (_2, bnd.5_32, { 0, ... });
>>   ...
>>   
>>   For target like RVV who support loop control with length, we want to see 
>> IR as follows:
>>   
>>   Loop:
>>   ...
>>   # loop_len_29 = SELECT_VL
>>   ...
>>   vect__4.8_28 = .LEN_MASK_LOAD (_33, 32B, loop_len_29);
>>   ...
>>   vect__6.11_25 = .LEN_MASK_LOAD (_20, 32B, loop_len_29);
>>   vect__8.12_24 = .COND_LEN_DIV (dummp_mask, vect__4.8_28, vect__6.11_25, 
>> vect__4.8_28, loop_len_29);
>>   ...
>>   .LEN_MASK_STORE (_1, 32B, loop_len_29, vect__8.12_24);
>>   ...
>>   next_mask_37 = .WHILE_ULT (_2, bnd.5_32, { 0, ... });
>>   ...
>>   
>>   Notice here, we use dummp_mask = { -1, -1,  , -1 }
>>
>> 2. Integer conditional division:
>>Similar case with (1) but with condtion:
>>void
>>f (int32_t *restrict a, int32_t *restrict b, int32_t *restrict c, int32_t 
>> * cond, int n)
>>{
>>  for (int i = 0; i < n; ++i)
>>{
>>  if (cond[i])
>>  a[i] = b[i] / c[i];
>>}
>>}
>>
>>ARM SVE:
>>...
>>max_mask_76 = .WHILE_ULT (0, bnd.6_52, { 0, ... });
>>
>>Loop:
>>...
>># loop_mask_55 = PHI 
>>...
>>vect__4.9_56 = .MASK_LOAD (_51, 32B, loop_mask_55);
>>mask__29.10_58 = vect__4.9_56 != { 0, ... };
>>vec_mask_and_61 = loop_mask_55 & mask__29.10_58;
>>...
>>vect__6.13_62 = .MASK_LOAD (_24, 32B, vec_mask_and_61);
>>...
>>vect__8.16_66 = .MASK_LOAD (_1, 32B, vec_mask_and_61);
>>vect__10.17_68 = .COND_DIV (vec_mask_and_61, vect__6.13_62, 
>> vect__8.16_66, vect__6.13_62);
>>...
>>.MASK_STORE (_2, 32B, vec_mask_and_61, vect__10.17_68);
>>...
>>next_mask_77 = .WHILE_ULT (_3, bnd.6_52, { 0, ... });
>>
>>Here, ARM SVE use vec_mask_and_61 = loop_mask_55 & mask__29.10_58; to 
>> gurantee the correct result.
>>
>>However, target with length control can not perform this elegant flow, 
>> for RVV, we would expect:
>>
>>Loop:
>>...
>>loop_len_55 = SELECT_VL
>>...
>>mask__29.10_58 = vect__4.9_56 != { 0, ... };
>>...
>>vect__10.17_68 = .COND_LEN_DIV (mask__29.10_58, vect__6.13_62, 
>> vect__8.16_66, vect__6.13_62, loop_len_55);
>>...
>>
>>Here we expect COND_LEN_DIV predicated by a real mask which is the 
>> outcome of comparison: mask__29.10_58 = vect__4.9_56 != { 0, ... };
>>and a real length which is produced by loop control : loop_len_55 = 
>> SELECT_VL
>>
>> 3. conditional Floating-point operations (no -ffast-math):
>>
>> void
>> f (float *restrict a, float *restrict b, int32_t *restrict cond, int n)
>> {
>>   for (int i = 0; i < n; ++i)
>> {
>>   if (cond[i])
>>   a[i] = b[i] + a[i];
>> }
>> }
>>   
>>   ARM SVE IR:
>>   max_mask_70 = .WHILE_ULT (0, bnd.6_46, { 0, ... });
>>
>>   ...
>>   # loop_mask_49 = PHI 
>>   ...
>>   mask__27.10_52 = vect__4.9_50 != { 0, ... };
>>   vec_mask_and_55 = loop_mask_49 & mask__27.10_52;
>>   ...
>>   vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, 
>> vect__6.13_56);
>>   ...
>>   next_mask_71 = .WHILE_ULT (_22, bnd.6_46, { 0, ... });
>>   ...
>>   
>>   For RVV, we would expect IR:
>>   
>>   ...
>>   loop_len_49 = SELECT_VL
>>   ...
>>   mask__27.10_52 = vect__4.9_50 != { 0, ... };
>>   ...
>>   vect__9.17_62 = .COND_LEN_ADD (mask__27.10_52, vect__6.13_56, 
>> vect__8.16_60, vect__6.13_56, loop_len_49);
>>   ...
>>
>> 4. Cond

Re: [PATCH v5] rs6000: Update the vsx-vector-6.* tests.

2023-07-10 Thread Kewen.Lin via Gcc-patches
Hi Carl,

on 2023/7/8 04:40, Carl Love wrote:
> 
> GCC maintainers:
> 
> Ver 5. Removed -compile from the names of the compile only tests. Fixed
> up the reference to the compile file names in the .h file headers. 
> Replaced powerpc_vsx_ok with vsx_hw in the run test files.  Removed the
> -save-temps from all files.  Retested on all of the various platforms
> with no regressions.
> 
> Ver 4. Fixed a few typos.  Redid the tests to create separate run and
> compile tests.
> 
> Ver 3.  Added __attribute__ ((noipa)) to the test files.  Changed some
> of the scan-assembler-times checks to cover multiple similar
> instructions.  Change the function check macro to a macro to generate a
> function to do the test and check the results.  Retested on the various
> processor types and BE/LE versions.
> 
> Ver 2.  Switched to using code macros to generate the call to the
> builtin and test the results.  Added in instruction counts for the key
> instruction for the builtin.  Moved the tests into an additional
> function call to ensure the compile doesn't replace the builtin call
> code with the statically computed results.  The compiler was doing this
> for a few of the simpler tests.  
> 
> The following patch takes the tests in vsx-vector-6-p7.h,  vsx-vector-
> 6-p8.h, vsx-vector-6-p9.h and reorganizes them into a series of smaller
> test files by functionality rather than processor version.
> 
> Tested the patch on Power 8 LE/BE, Power 9 LE/BE and Power 10 LE with
> no regresions.
> 
> Please let me know if this patch is acceptable for mainline.  Thanks.

This patch is okay for trunk, thanks for the patience!

BR,
Kewen

> 
>Carl
> 
> 
> 
> -
> rs6000: Update the vsx-vector-6.* tests.
> 
> The vsx-vector-6.h file is included into the processor specific test files
> vsx-vector-6.p7.c, vsx-vector-6.p8.c, and vsx-vector-6.p9.c.  The .h file
> contains a large number of vsx vector built-in tests.  The processor
> specific files contain the number of instructions that the tests are
> expected to generate for that processor.  The tests are compile only.
> 
> This patch reworks the tests into a series of files for related tests.
> The new tests consist of a runnable test to verify the built-in argument
> types and the functional correctness of each built-in.  There is also a
> compile only test that verifies the built-ins generate the expected number
> of instructions for the various built-in tests.
> 
> gcc/testsuite/
>   * gcc.target/powerpc/vsx-vector-6-func-1op.h: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-1op-run.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-1op.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-2lop.h: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-2lop-run.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-2lop.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-2op.h: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-2op-run.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-2op.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-3op.h: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-3op-run.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-3op.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-cmp-all.h: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-cmp-all-run.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-cmp-all.c: New test
>   file.
>   * gcc.target/powerpc/vsx-vector-6-func-cmp.h: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-cmp-run.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-cmp.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6.h: Remove test file.
>   * gcc.target/powerpc/vsx-vector-6.p7.c: Remove test file.
>   * gcc.target/powerpc/vsx-vector-6.p8.c: Remove test file.
>   * gcc.target/powerpc/vsx-vector-6.p9.c: Remove test file.
> ---
>  .../powerpc/vsx-vector-6-func-1op-run.c   |  98 
>  .../powerpc/vsx-vector-6-func-1op.c   |  22 ++
>  .../powerpc/vsx-vector-6-func-1op.h   |  43 
>  .../powerpc/vsx-vector-6-func-2lop-run.c  | 177 ++
>  .../powerpc/vsx-vector-6-func-2lop.c  |  14 ++
>  .../powerpc/vsx-vector-6-func-2lop.h  |  47 
>  .../powerpc/vsx-vector-6-func-2op-run.c   |  96 
>  .../powerpc/vsx-vector-6-func-2op.c   |  21 ++
>  .../powerpc/vsx-vector-6-func-2op.h   |  42 
>  .../powerpc/vsx-vector-6-func-3op-run.c   | 229 ++
>  .../powerpc/vsx-vector-6-func-3op.c   |  17 ++
>  .../powerpc/vsx-vector-6-func-3op.h   |  73 ++
>  .../powerpc/vsx-vector-6-func-cmp-all-run.c   | 147 +++
>  .../powerpc/vsx-vector-6-func-cmp-all.c   |  17 ++
>  .../powerpc/vsx-vector-6-func-cmp-all.h   |  76 ++
>  .../p

Re: Re: [PATCH] VECT: Add COND_LEN_* operations for loop control with length targets

2023-07-10 Thread juzhe.zh...@rivai.ai
Hi, Kewen.

You mean COND_LEN_* family I added is unlikely used by power in the future?
Could you revise them to make it possible used by power in the future so that 
we won't duplicate to many patterns.
For example, COND_LEN_* has mask operand, is it possible that power can also 
use it with dummy mask = { -1, -1, ..., -1}.

Thanks.


juzhe.zh...@rivai.ai
 
From: Kewen.Lin
Date: 2023-07-11 09:17
To: Richard Biener
CC: gcc-patches; richard.sandiford; linkw; krebbel; Ju-Zhe Zhong; Segher 
Boessenkool; David Edelsohn; Peter Bergner
Subject: Re: [PATCH] VECT: Add COND_LEN_* operations for loop control with 
length targets
on 2023/7/10 18:40, Richard Biener wrote:
> On Fri, 7 Jul 2023, juzhe.zh...@rivai.ai wrote:
> 
>> From: Ju-Zhe Zhong 
>>
>> Hi, Richard and Richi.
>>
>> This patch is adding cond_len_* operations pattern for target support 
>> loop control with length.
> 
> It looks mostly OK - the probably obvious question is with rearding
> to the "missing" bias argument ...
> 
> IBM folks - is there any expectation that the set of len familiy of
> instructions increases or will they be accounted as "mistake" and
> future additions will happen in different ways?
 
As far as I know, there is no plan to extend this len family on Power
and I guess future extension very likely adopts a different way.
 
BR,
Kewen
 
> 
> At the moment I'd say for consistency reasons 'len' should always
> come with 'bias'.
> 
> Thanks,
> Richard.
> 
>> These patterns will be used in these following case:
>>
>> 1. Integer division:
>>void
>>f (int32_t *restrict a, int32_t *restrict b, int32_t *restrict c, int n)
>>{
>>  for (int i = 0; i < n; ++i)
>>   {
>> a[i] = b[i] / c[i];
>>   }
>>}
>>
>>   ARM SVE IR:
>>   
>>   ...
>>   max_mask_36 = .WHILE_ULT (0, bnd.5_32, { 0, ... });
>>
>>   Loop:
>>   ...
>>   # loop_mask_29 = PHI 
>>   ...
>>   vect__4.8_28 = .MASK_LOAD (_33, 32B, loop_mask_29);
>>   ...
>>   vect__6.11_25 = .MASK_LOAD (_20, 32B, loop_mask_29);
>>   vect__8.12_24 = .COND_DIV (loop_mask_29, vect__4.8_28, vect__6.11_25, 
>> vect__4.8_28);
>>   ...
>>   .MASK_STORE (_1, 32B, loop_mask_29, vect__8.12_24);
>>   ...
>>   next_mask_37 = .WHILE_ULT (_2, bnd.5_32, { 0, ... });
>>   ...
>>   
>>   For target like RVV who support loop control with length, we want to see 
>> IR as follows:
>>   
>>   Loop:
>>   ...
>>   # loop_len_29 = SELECT_VL
>>   ...
>>   vect__4.8_28 = .LEN_MASK_LOAD (_33, 32B, loop_len_29);
>>   ...
>>   vect__6.11_25 = .LEN_MASK_LOAD (_20, 32B, loop_len_29);
>>   vect__8.12_24 = .COND_LEN_DIV (dummp_mask, vect__4.8_28, vect__6.11_25, 
>> vect__4.8_28, loop_len_29);
>>   ...
>>   .LEN_MASK_STORE (_1, 32B, loop_len_29, vect__8.12_24);
>>   ...
>>   next_mask_37 = .WHILE_ULT (_2, bnd.5_32, { 0, ... });
>>   ...
>>   
>>   Notice here, we use dummp_mask = { -1, -1,  , -1 }
>>
>> 2. Integer conditional division:
>>Similar case with (1) but with condtion:
>>void
>>f (int32_t *restrict a, int32_t *restrict b, int32_t *restrict c, int32_t 
>> * cond, int n)
>>{
>>  for (int i = 0; i < n; ++i)
>>{
>>  if (cond[i])
>>  a[i] = b[i] / c[i];
>>}
>>}
>>
>>ARM SVE:
>>...
>>max_mask_76 = .WHILE_ULT (0, bnd.6_52, { 0, ... });
>>
>>Loop:
>>...
>># loop_mask_55 = PHI 
>>...
>>vect__4.9_56 = .MASK_LOAD (_51, 32B, loop_mask_55);
>>mask__29.10_58 = vect__4.9_56 != { 0, ... };
>>vec_mask_and_61 = loop_mask_55 & mask__29.10_58;
>>...
>>vect__6.13_62 = .MASK_LOAD (_24, 32B, vec_mask_and_61);
>>...
>>vect__8.16_66 = .MASK_LOAD (_1, 32B, vec_mask_and_61);
>>vect__10.17_68 = .COND_DIV (vec_mask_and_61, vect__6.13_62, 
>> vect__8.16_66, vect__6.13_62);
>>...
>>.MASK_STORE (_2, 32B, vec_mask_and_61, vect__10.17_68);
>>...
>>next_mask_77 = .WHILE_ULT (_3, bnd.6_52, { 0, ... });
>>
>>Here, ARM SVE use vec_mask_and_61 = loop_mask_55 & mask__29.10_58; to 
>> gurantee the correct result.
>>
>>However, target with length control can not perform this elegant flow, 
>> for RVV, we would expect:
>>
>>Loop:
>>...
>>loop_len_55 = SELECT_VL
>>...
>>mask__29.10_58 = vect__4.9_56 != { 0, ... };
>>...
>>vect__10.17_68 = .COND_LEN_DIV (mask__29.10_58, vect__6.13_62, 
>> vect__8.16_66, vect__6.13_62, loop_len_55);
>>...
>>
>>Here we expect COND_LEN_DIV predicated by a real mask which is the 
>> outcome of comparison: mask__29.10_58 = vect__4.9_56 != { 0, ... };
>>and a real length which is produced by loop control : loop_len_55 = 
>> SELECT_VL
>>
>> 3. conditional Floating-point operations (no -ffast-math):
>>
>> void
>> f (float *restrict a, float *restrict b, int32_t *restrict cond, int n)
>> {
>>   for (int i = 0; i < n; ++i)
>> {
>>   if (cond[i])
>>   a[i] = b[i] + a[i];
>> }
>> }
>>   
>>   ARM SVE IR:
>>   max_mask_70 = .WHILE_ULT (0, bnd.6_46, { 0, ... }

Re: [PATCH] rs6000: Remove redundant initialization [PR106907]

2023-07-10 Thread Kewen.Lin via Gcc-patches
on 2023/7/11 07:11, Peter Bergner wrote:
> On 6/29/23 4:31 AM, Kewen.Lin via Gcc-patches wrote:
>> This is okay for trunk (no backports needed btw), this fix can even be
>> taken as obvious, thanks!
>>
>>>
>>> 2023-06-07  Jeevitha Palanisamy  
>>>
>>> gcc/
>>> PR target/106907
>>
>> One curious question is that this PR106907 seemed not to report this issue,
>> is there another PR reporting this?  Or do I miss something?
> 
> I think Jeevitha just ran cppcheck by hand and noticed the "new" warnings
> and added them to the list of things to fixup.  Yeah, it would be nice to
> add the new warnings to the PR for historical reasons.

Thanks for clarifying it.  Yeah, I noticed Jeevitha added more comments to
that PR. :)

BR,
Kewen


Re: [PATCH] VECT: Add COND_LEN_* operations for loop control with length targets

2023-07-10 Thread Kewen.Lin via Gcc-patches
Hi Juzhe,

> Hi, Kewen.
> 
> You mean COND_LEN_* family I added is unlikely used by power in the future?

Since AFAIK there is no plan to extend len family on Power, it's very likely 
that
Power will have no chance to leverage them, so yes.

> Could you revise them to make it possible used by power in the future so that 
> we won't duplicate to many patterns.

Sorry, since we don't have such plan for this kind of capability, I don't have
any solid inputs or requirements for this patch.  But IMHO the proposed 
interfaces
look good enough for any potential future uses.

BR,
Kewen

> For example, COND_LEN_* has mask operand, is it possible that power can also 
> use it with dummy mask = { -1, -1, ..., -1}.> 
> Thanks.
> 


[PATCH v2] tree-optimization/110279- Check for nested FMA chains in reassoc

2023-07-10 Thread Di Zhao OS via Gcc-patches
Attached is an updated version of the patch.

Based on Philipp's review, some changes:

1. Defined new enum fma_state to describe the state of FMA candidates
   for a list of operands. (Since the tests seems simple after the
   change, I didn't add predicates on it.)
2. Changed return type of convert_mult_to_fma_1 and convert_mult_to_fma
   to tree, to remove the in/out parameter.
3. Added description of return value values of rank_ops_for_fma.

---
gcc/ChangeLog:

* tree-ssa-math-opts.cc (convert_mult_to_fma_1): Added new parameter
check_only_p. Changed return type to tree.
(struct fma_transformation_info): Moved to header.
(class fma_deferring_state): Moved to header.
(convert_mult_to_fma): Added new parameter check_only_p. Changed
return type to tree.
* tree-ssa-math-opts.h (struct fma_transformation_info): Moved from .cc.
(class fma_deferring_state): Moved from .cc.
(convert_mult_to_fma): Add function decl.
* tree-ssa-reassoc.cc (enum fma_state): Defined new enum to describe
the state of FMA candidates for a list of operands.
(rewrite_expr_tree_parallel): Changed boolean parameter to enum type.
(rank_ops_for_fma): Return enum fma_state.
(reassociate_bb): Avoid rewriting to parallel if nested FMAs are found.

Thanks,
Di Zhao 




0001-Check-for-nested-FMA-chains-in-reassoc.patch
Description: 0001-Check-for-nested-FMA-chains-in-reassoc.patch


[PATCH] Add peephole to eliminate redundant comparison after cmpccxadd.

2023-07-10 Thread liuhongt via Gcc-patches
Similar like we did for cmpxchg, but extended to all
ix86_comparison_int_operator since cmpccxadd set EFLAGS exactly same
as CMP.

Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,},
Ok for trunk?

gcc/ChangeLog:

PR target/110591
* config/i386/sync.md (cmpccxadd_): Add a new
define_peephole2 after the pattern.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr110591.c: New test.
---
 gcc/config/i386/sync.md  | 56 
 gcc/testsuite/gcc.target/i386/pr110591.c | 66 
 2 files changed, 122 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr110591.c

diff --git a/gcc/config/i386/sync.md b/gcc/config/i386/sync.md
index e1fa1504deb..43f6421bcb8 100644
--- a/gcc/config/i386/sync.md
+++ b/gcc/config/i386/sync.md
@@ -1105,3 +1105,59 @@ (define_insn "cmpccxadd_"
   output_asm_insn (buf, operands);
   return "";
 })
+
+(define_peephole2
+  [(set (match_operand:SWI48x 0 "register_operand")
+   (match_operand:SWI48x 1 "x86_64_general_operand"))
+   (parallel [(set (match_dup 0)
+  (unspec_volatile:SWI48x
+[(match_operand:SWI48x 2 "memory_operand")
+ (match_dup 0)
+ (match_operand:SWI48x 3 "register_operand")
+ (match_operand:SI 4 "const_int_operand")]
+UNSPECV_CMPCCXADD))
+ (set (match_dup 2)
+  (unspec_volatile:SWI48x [(const_int 0)] UNSPECV_CMPCCXADD))
+ (clobber (reg:CC FLAGS_REG))])
+   (set (reg FLAGS_REG)
+   (compare (match_operand:SWI48x 5 "register_operand")
+(match_operand:SWI48x 6 "x86_64_general_operand")))
+   (set (match_operand:QI 7 "nonimmediate_operand")
+   (match_operator:QI 8 "ix86_comparison_int_operator"
+ [(reg FLAGS_REG) (const_int 0)]))]
+  "TARGET_CMPCCXADD && TARGET_64BIT
+   && ((rtx_equal_p (operands[0], operands[5])
+   && rtx_equal_p (operands[1], operands[6]))
+   || ((rtx_equal_p (operands[0], operands[6])
+   && rtx_equal_p (operands[1], operands[5]))
+  && peep2_regno_dead_p (4, FLAGS_REG)))"
+  [(set (match_dup 0)
+   (match_dup 1))
+   (parallel [(set (match_dup 0)
+  (unspec_volatile:SWI48x
+[(match_dup 2)
+ (match_dup 0)
+ (match_dup 3)
+ (match_dup 4)]
+UNSPECV_CMPCCXADD))
+ (set (match_dup 2)
+  (unspec_volatile:SWI48x [(const_int 0)] UNSPECV_CMPCCXADD))
+ (clobber (reg:CC FLAGS_REG))])
+   (set (match_dup 7)
+   (match_op_dup 8
+ [(match_dup 9) (const_int 0)]))]
+{
+  operands[9] = gen_rtx_REG (GET_MODE (XEXP (operands[8], 0)), FLAGS_REG);
+  if (rtx_equal_p (operands[0], operands[6])
+ && rtx_equal_p (operands[1], operands[5])
+ && swap_condition (GET_CODE (operands[8])) != GET_CODE (operands[8]))
+ {
+   operands[8] = shallow_copy_rtx (operands[8]);
+   enum rtx_code ccode = swap_condition (GET_CODE (operands[8]));
+   PUT_CODE (operands[8], ccode);
+   operands[9] = gen_rtx_REG (SELECT_CC_MODE (ccode,
+ operands[6],
+ operands[5]),
+  FLAGS_REG);
+ }
+})
diff --git a/gcc/testsuite/gcc.target/i386/pr110591.c 
b/gcc/testsuite/gcc.target/i386/pr110591.c
new file mode 100644
index 000..32a515b429e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr110591.c
@@ -0,0 +1,66 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-mcmpccxadd -O2" } */
+/* { dg-final { scan-assembler-not {cmp[lq]?[ \t]+} } } */
+/* { dg-final { scan-assembler-times {cmpoxadd[ \t]+} 12 } } */
+
+#include 
+
+_Bool foo_setg (int *ptr, int v)
+{
+return _cmpccxadd_epi32(ptr, v, 1, _CMPCCX_O) > v;
+}
+
+_Bool foo_setl (int *ptr, int v)
+{
+return _cmpccxadd_epi32(ptr, v, 1, _CMPCCX_O) < v;
+}
+
+_Bool foo_sete(int *ptr, int v)
+{
+return _cmpccxadd_epi32(ptr, v, 1, _CMPCCX_O) == v;
+}
+
+_Bool foo_setne(int *ptr, int v)
+{
+return _cmpccxadd_epi32(ptr, v, 1, _CMPCCX_O) != v;
+}
+
+_Bool foo_setge(int *ptr, int v)
+{
+return _cmpccxadd_epi32(ptr, v, 1, _CMPCCX_O) >= v;
+}
+
+_Bool foo_setle(int *ptr, int v)
+{
+return _cmpccxadd_epi32(ptr, v, 1, _CMPCCX_O) <= v;
+}
+
+_Bool fooq_setg (long long *ptr, long long v)
+{
+return _cmpccxadd_epi64(ptr, v, 1, _CMPCCX_O) > v;
+}
+
+_Bool fooq_setl (long long *ptr, long long v)
+{
+return _cmpccxadd_epi64(ptr, v, 1, _CMPCCX_O) < v;
+}
+
+_Bool fooq_sete(long long *ptr, long long v)
+{
+return _cmpccxadd_epi64(ptr, v, 1, _CMPCCX_O) == v;
+}
+
+_Bool fooq_setne(long long *ptr, long long v)
+{
+return _cmpccxadd_epi64(ptr, v, 1, _CMPCCX_O) != v;
+}
+
+_Bool fooq_setge(long long *ptr, long long v)
+{
+return _cmpccxadd_epi64(ptr, v, 1, _CMPCCX_O) >= v;
+}
+
+_Bool

Re: [PATCH] Add peephole to eliminate redundant comparison after cmpccxadd.

2023-07-10 Thread Hongtao Liu via Gcc-patches
Please ignore this patch, I'm testing another patch to separate non
swap operands case where a setcc is not needed in the peephole2.

On Tue, Jul 11, 2023 at 11:14 AM liuhongt via Gcc-patches
 wrote:
>
> Similar like we did for cmpxchg, but extended to all
> ix86_comparison_int_operator since cmpccxadd set EFLAGS exactly same
> as CMP.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,},
> Ok for trunk?
>
> gcc/ChangeLog:
>
> PR target/110591
> * config/i386/sync.md (cmpccxadd_): Add a new
> define_peephole2 after the pattern.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr110591.c: New test.
> ---
>  gcc/config/i386/sync.md  | 56 
>  gcc/testsuite/gcc.target/i386/pr110591.c | 66 
>  2 files changed, 122 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr110591.c
>
> diff --git a/gcc/config/i386/sync.md b/gcc/config/i386/sync.md
> index e1fa1504deb..43f6421bcb8 100644
> --- a/gcc/config/i386/sync.md
> +++ b/gcc/config/i386/sync.md
> @@ -1105,3 +1105,59 @@ (define_insn "cmpccxadd_"
>output_asm_insn (buf, operands);
>return "";
>  })
> +
> +(define_peephole2
> +  [(set (match_operand:SWI48x 0 "register_operand")
> +   (match_operand:SWI48x 1 "x86_64_general_operand"))
> +   (parallel [(set (match_dup 0)
> +  (unspec_volatile:SWI48x
> +[(match_operand:SWI48x 2 "memory_operand")
> + (match_dup 0)
> + (match_operand:SWI48x 3 "register_operand")
> + (match_operand:SI 4 "const_int_operand")]
> +UNSPECV_CMPCCXADD))
> + (set (match_dup 2)
> +  (unspec_volatile:SWI48x [(const_int 0)] UNSPECV_CMPCCXADD))
> + (clobber (reg:CC FLAGS_REG))])
> +   (set (reg FLAGS_REG)
> +   (compare (match_operand:SWI48x 5 "register_operand")
> +(match_operand:SWI48x 6 "x86_64_general_operand")))
> +   (set (match_operand:QI 7 "nonimmediate_operand")
> +   (match_operator:QI 8 "ix86_comparison_int_operator"
> + [(reg FLAGS_REG) (const_int 0)]))]
> +  "TARGET_CMPCCXADD && TARGET_64BIT
> +   && ((rtx_equal_p (operands[0], operands[5])
> +   && rtx_equal_p (operands[1], operands[6]))
> +   || ((rtx_equal_p (operands[0], operands[6])
> +   && rtx_equal_p (operands[1], operands[5]))
> +  && peep2_regno_dead_p (4, FLAGS_REG)))"
> +  [(set (match_dup 0)
> +   (match_dup 1))
> +   (parallel [(set (match_dup 0)
> +  (unspec_volatile:SWI48x
> +[(match_dup 2)
> + (match_dup 0)
> + (match_dup 3)
> + (match_dup 4)]
> +UNSPECV_CMPCCXADD))
> + (set (match_dup 2)
> +  (unspec_volatile:SWI48x [(const_int 0)] UNSPECV_CMPCCXADD))
> + (clobber (reg:CC FLAGS_REG))])
> +   (set (match_dup 7)
> +   (match_op_dup 8
> + [(match_dup 9) (const_int 0)]))]
> +{
> +  operands[9] = gen_rtx_REG (GET_MODE (XEXP (operands[8], 0)), FLAGS_REG);
> +  if (rtx_equal_p (operands[0], operands[6])
> + && rtx_equal_p (operands[1], operands[5])
> + && swap_condition (GET_CODE (operands[8])) != GET_CODE (operands[8]))
> + {
> +   operands[8] = shallow_copy_rtx (operands[8]);
> +   enum rtx_code ccode = swap_condition (GET_CODE (operands[8]));
> +   PUT_CODE (operands[8], ccode);
> +   operands[9] = gen_rtx_REG (SELECT_CC_MODE (ccode,
> + operands[6],
> + operands[5]),
> +  FLAGS_REG);
> + }
> +})
> diff --git a/gcc/testsuite/gcc.target/i386/pr110591.c 
> b/gcc/testsuite/gcc.target/i386/pr110591.c
> new file mode 100644
> index 000..32a515b429e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr110591.c
> @@ -0,0 +1,66 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-mcmpccxadd -O2" } */
> +/* { dg-final { scan-assembler-not {cmp[lq]?[ \t]+} } } */
> +/* { dg-final { scan-assembler-times {cmpoxadd[ \t]+} 12 } } */
> +
> +#include 
> +
> +_Bool foo_setg (int *ptr, int v)
> +{
> +return _cmpccxadd_epi32(ptr, v, 1, _CMPCCX_O) > v;
> +}
> +
> +_Bool foo_setl (int *ptr, int v)
> +{
> +return _cmpccxadd_epi32(ptr, v, 1, _CMPCCX_O) < v;
> +}
> +
> +_Bool foo_sete(int *ptr, int v)
> +{
> +return _cmpccxadd_epi32(ptr, v, 1, _CMPCCX_O) == v;
> +}
> +
> +_Bool foo_setne(int *ptr, int v)
> +{
> +return _cmpccxadd_epi32(ptr, v, 1, _CMPCCX_O) != v;
> +}
> +
> +_Bool foo_setge(int *ptr, int v)
> +{
> +return _cmpccxadd_epi32(ptr, v, 1, _CMPCCX_O) >= v;
> +}
> +
> +_Bool foo_setle(int *ptr, int v)
> +{
> +return _cmpccxadd_epi32(ptr, v, 1, _CMPCCX_O) <= v;
> +}
> +
> +_Bool fooq_setg (long long *ptr, long long v)
> +{
> +return _cmpccxadd_epi64(ptr, v, 1, _CMPCCX_O) 

[PATCH] i386: Guard 128 bit VAES builtins with AVX512VL

2023-07-10 Thread Haochen Jiang via Gcc-patches
Hi all,

Currently on trunk, both usage of intrin and builtin for 128 bit VAES
ISA will result in ICE since we did not check AVX512VL until pattern,
which is not user expected. This patch aims to fix that ICE and throw
an error under this scenario.

Regtested on x86-64-linux-gnu{-m32,}. Ok for trunk?

BRs,
Haochen

Since commit 24a8acc, 128 bit intrin is enabled for VAES. However,
AVX512VL is not checked until we reached into pattern, which reports an
ICE.

Added an AVX512VL guard at builtin to report error when checking ISA
flags.

gcc/ChangeLog:

* config/i386/i386-builtins.cc (ix86_init_mmx_sse_builtins):
Add OPTION_MASK_ISA_AVX512VL.
* config/i386/i386-expand.cc (ix86_check_builtin_isa_match):
Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx512vl-vaes-1.c: New test.
---
 gcc/config/i386/i386-builtins.cc| 12 
 gcc/config/i386/i386-expand.cc  |  4 +++-
 gcc/testsuite/gcc.target/i386/avx512vl-vaes-1.c | 12 
 3 files changed, 23 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-vaes-1.c

diff --git a/gcc/config/i386/i386-builtins.cc b/gcc/config/i386/i386-builtins.cc
index 28f404da288..e436ca4e5b1 100644
--- a/gcc/config/i386/i386-builtins.cc
+++ b/gcc/config/i386/i386-builtins.cc
@@ -662,19 +662,23 @@ ix86_init_mmx_sse_builtins (void)
   VOID_FTYPE_UNSIGNED_UNSIGNED, IX86_BUILTIN_MWAIT);
 
   /* AES */
-  def_builtin_const (OPTION_MASK_ISA_AES | OPTION_MASK_ISA_SSE2,
+  def_builtin_const (OPTION_MASK_ISA_AES | OPTION_MASK_ISA_SSE2
+| OPTION_MASK_ISA_AVX512VL,
 OPTION_MASK_ISA2_VAES,
 "__builtin_ia32_aesenc128",
 V2DI_FTYPE_V2DI_V2DI, IX86_BUILTIN_AESENC128);
-  def_builtin_const (OPTION_MASK_ISA_AES | OPTION_MASK_ISA_SSE2,
+  def_builtin_const (OPTION_MASK_ISA_AES | OPTION_MASK_ISA_SSE2
+| OPTION_MASK_ISA_AVX512VL,
 OPTION_MASK_ISA2_VAES,
 "__builtin_ia32_aesenclast128",
 V2DI_FTYPE_V2DI_V2DI, IX86_BUILTIN_AESENCLAST128);
-  def_builtin_const (OPTION_MASK_ISA_AES | OPTION_MASK_ISA_SSE2,
+  def_builtin_const (OPTION_MASK_ISA_AES | OPTION_MASK_ISA_SSE2
+| OPTION_MASK_ISA_AVX512VL,
 OPTION_MASK_ISA2_VAES,
 "__builtin_ia32_aesdec128",
 V2DI_FTYPE_V2DI_V2DI, IX86_BUILTIN_AESDEC128);
-  def_builtin_const (OPTION_MASK_ISA_AES | OPTION_MASK_ISA_SSE2,
+  def_builtin_const (OPTION_MASK_ISA_AES | OPTION_MASK_ISA_SSE2
+| OPTION_MASK_ISA_AVX512VL,
 OPTION_MASK_ISA2_VAES,
 "__builtin_ia32_aesdeclast128",
 V2DI_FTYPE_V2DI_V2DI, IX86_BUILTIN_AESDECLAST128);
diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 567248d6830..9a04bf4455b 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -12626,6 +12626,7 @@ ix86_check_builtin_isa_match (unsigned int fcode,
OPTION_MASK_ISA2_AVXIFMA
  (OPTION_MASK_ISA_AVX512VL | OPTION_MASK_ISA2_AVX512BF16) or
OPTION_MASK_ISA2_AVXNECONVERT
+ OPTION_MASK_ISA_AES or (OPTION_MASK_ISA_AVX512VL | OPTION_MASK_ISA2_VAES)
  where for each such pair it is sufficient if either of the ISAs is
  enabled, plus if it is ored with other options also those others.
  OPTION_MASK_ISA_MMX in bisa is satisfied also if TARGET_MMX_WITH_SSE.  */
@@ -12649,7 +12650,8 @@ ix86_check_builtin_isa_match (unsigned int fcode,
 OPTION_MASK_ISA2_AVXIFMA);
   SHARE_BUILTIN (OPTION_MASK_ISA_AVX512VL, OPTION_MASK_ISA2_AVX512BF16, 0,
 OPTION_MASK_ISA2_AVXNECONVERT);
-  SHARE_BUILTIN (OPTION_MASK_ISA_AES, 0, 0, OPTION_MASK_ISA2_VAES);
+  SHARE_BUILTIN (OPTION_MASK_ISA_AES, 0, OPTION_MASK_ISA_AVX512VL,
+OPTION_MASK_ISA2_VAES);
   isa = tmp_isa;
   isa2 = tmp_isa2;
 
diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-vaes-1.c 
b/gcc/testsuite/gcc.target/i386/avx512vl-vaes-1.c
new file mode 100644
index 000..fabb170a031
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx512vl-vaes-1.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-mvaes -mno-avx512vl -mno-aes" } */
+
+#include 
+
+typedef long long v2di __attribute__((vector_size (16)));
+
+v2di
+f1 (v2di x, v2di y)
+{
+  return __builtin_ia32_aesenc128 (x, y); /* { dg-error "needs isa option" } */
+}
-- 
2.31.1



Re: [PATCH] i386: Guard 128 bit VAES builtins with AVX512VL

2023-07-10 Thread Hongtao Liu via Gcc-patches
On Tue, Jul 11, 2023 at 11:40 AM Haochen Jiang via Gcc-patches
 wrote:
>
> Hi all,
>
> Currently on trunk, both usage of intrin and builtin for 128 bit VAES
> ISA will result in ICE since we did not check AVX512VL until pattern,
> which is not user expected. This patch aims to fix that ICE and throw
> an error under this scenario.
>
> Regtested on x86-64-linux-gnu{-m32,}. Ok for trunk?
>
Ok.
> BRs,
> Haochen
>
> Since commit 24a8acc, 128 bit intrin is enabled for VAES. However,
> AVX512VL is not checked until we reached into pattern, which reports an
> ICE.
>
> Added an AVX512VL guard at builtin to report error when checking ISA
> flags.
>
> gcc/ChangeLog:
>
> * config/i386/i386-builtins.cc (ix86_init_mmx_sse_builtins):
> Add OPTION_MASK_ISA_AVX512VL.
> * config/i386/i386-expand.cc (ix86_check_builtin_isa_match):
> Ditto.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx512vl-vaes-1.c: New test.
> ---
>  gcc/config/i386/i386-builtins.cc| 12 
>  gcc/config/i386/i386-expand.cc  |  4 +++-
>  gcc/testsuite/gcc.target/i386/avx512vl-vaes-1.c | 12 
>  3 files changed, 23 insertions(+), 5 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-vaes-1.c
>
> diff --git a/gcc/config/i386/i386-builtins.cc 
> b/gcc/config/i386/i386-builtins.cc
> index 28f404da288..e436ca4e5b1 100644
> --- a/gcc/config/i386/i386-builtins.cc
> +++ b/gcc/config/i386/i386-builtins.cc
> @@ -662,19 +662,23 @@ ix86_init_mmx_sse_builtins (void)
>VOID_FTYPE_UNSIGNED_UNSIGNED, IX86_BUILTIN_MWAIT);
>
>/* AES */
> -  def_builtin_const (OPTION_MASK_ISA_AES | OPTION_MASK_ISA_SSE2,
> +  def_builtin_const (OPTION_MASK_ISA_AES | OPTION_MASK_ISA_SSE2
> +| OPTION_MASK_ISA_AVX512VL,
>  OPTION_MASK_ISA2_VAES,
>  "__builtin_ia32_aesenc128",
>  V2DI_FTYPE_V2DI_V2DI, IX86_BUILTIN_AESENC128);
> -  def_builtin_const (OPTION_MASK_ISA_AES | OPTION_MASK_ISA_SSE2,
> +  def_builtin_const (OPTION_MASK_ISA_AES | OPTION_MASK_ISA_SSE2
> +| OPTION_MASK_ISA_AVX512VL,
>  OPTION_MASK_ISA2_VAES,
>  "__builtin_ia32_aesenclast128",
>  V2DI_FTYPE_V2DI_V2DI, IX86_BUILTIN_AESENCLAST128);
> -  def_builtin_const (OPTION_MASK_ISA_AES | OPTION_MASK_ISA_SSE2,
> +  def_builtin_const (OPTION_MASK_ISA_AES | OPTION_MASK_ISA_SSE2
> +| OPTION_MASK_ISA_AVX512VL,
>  OPTION_MASK_ISA2_VAES,
>  "__builtin_ia32_aesdec128",
>  V2DI_FTYPE_V2DI_V2DI, IX86_BUILTIN_AESDEC128);
> -  def_builtin_const (OPTION_MASK_ISA_AES | OPTION_MASK_ISA_SSE2,
> +  def_builtin_const (OPTION_MASK_ISA_AES | OPTION_MASK_ISA_SSE2
> +| OPTION_MASK_ISA_AVX512VL,
>  OPTION_MASK_ISA2_VAES,
>  "__builtin_ia32_aesdeclast128",
>  V2DI_FTYPE_V2DI_V2DI, IX86_BUILTIN_AESDECLAST128);
> diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> index 567248d6830..9a04bf4455b 100644
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -12626,6 +12626,7 @@ ix86_check_builtin_isa_match (unsigned int fcode,
> OPTION_MASK_ISA2_AVXIFMA
>   (OPTION_MASK_ISA_AVX512VL | OPTION_MASK_ISA2_AVX512BF16) or
> OPTION_MASK_ISA2_AVXNECONVERT
> + OPTION_MASK_ISA_AES or (OPTION_MASK_ISA_AVX512VL | 
> OPTION_MASK_ISA2_VAES)
>   where for each such pair it is sufficient if either of the ISAs is
>   enabled, plus if it is ored with other options also those others.
>   OPTION_MASK_ISA_MMX in bisa is satisfied also if TARGET_MMX_WITH_SSE.  
> */
> @@ -12649,7 +12650,8 @@ ix86_check_builtin_isa_match (unsigned int fcode,
>  OPTION_MASK_ISA2_AVXIFMA);
>SHARE_BUILTIN (OPTION_MASK_ISA_AVX512VL, OPTION_MASK_ISA2_AVX512BF16, 0,
>  OPTION_MASK_ISA2_AVXNECONVERT);
> -  SHARE_BUILTIN (OPTION_MASK_ISA_AES, 0, 0, OPTION_MASK_ISA2_VAES);
> +  SHARE_BUILTIN (OPTION_MASK_ISA_AES, 0, OPTION_MASK_ISA_AVX512VL,
> +OPTION_MASK_ISA2_VAES);
>isa = tmp_isa;
>isa2 = tmp_isa2;
>
> diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-vaes-1.c 
> b/gcc/testsuite/gcc.target/i386/avx512vl-vaes-1.c
> new file mode 100644
> index 000..fabb170a031
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx512vl-vaes-1.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mvaes -mno-avx512vl -mno-aes" } */
> +
> +#include 
> +
> +typedef long long v2di __attribute__((vector_size (16)));
> +
> +v2di
> +f1 (v2di x, v2di y)
> +{
> +  return __builtin_ia32_aesenc128 (x, y); /* { dg-error "needs isa option" } 
> */
> +}
> --
> 2.31.1
>


-- 
BR,
Hongtao


[PATCH v2] Break false dependence for vpternlog by inserting vpxor or setting constraint of input operand to '0'

2023-07-10 Thread liuhongt via Gcc-patches
Here's updated patch.
1. use optimize_insn_for_speed_p instead of using optimize_function_for_speed_p.
2. explicitly move memory to dest register to avoid false dependence in 
one_cmpl pattern.


False dependency happens when destination is only updated by
pternlog. There is no false dependency when destination is also used
in source. So either a pxor should be inserted, or input operand
should be set with constraint '0'.

gcc/ChangeLog:

PR target/110438
PR target/110202
* config/i386/predicates.md
(int_float_vector_all_ones_operand): New predicate.
* config/i386/sse.md (*vmov_constm1_pternlog_false_dep): New
define_insn.
(*_cvtmask2_pternlog_false_dep):
Ditto.
(*_cvtmask2_pternlog_false_dep):
Ditto.
(*_cvtmask2): Adjust to
define_insn_and_split to avoid false dependence.
(*_cvtmask2): Ditto.
(one_cmpl2): Adjust constraint
of operands 1 to '0' to avoid false dependence.
(*andnot3): Ditto.
(iornot3): Ditto.
(*3): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr110438.c: New test.
* gcc.target/i386/pr100711.c: Adjust testcase.
---
 gcc/config/i386/predicates.md  |   8 +-
 gcc/config/i386/sse.md | 145 ++---
 gcc/testsuite/gcc.target/i386/pr100711-6.c |   2 +-
 gcc/testsuite/gcc.target/i386/pr110438.c   |  30 +
 4 files changed, 168 insertions(+), 17 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr110438.c

diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
index 7ddbe01a6f9..37d20c6303a 100644
--- a/gcc/config/i386/predicates.md
+++ b/gcc/config/i386/predicates.md
@@ -1192,12 +1192,18 @@ (define_predicate "float_vector_all_ones_operand"
 return false;
 })
 
-/* Return true if operand is a vector constant that is all ones. */
+/* Return true if operand is an integral vector constant that is all ones. */
 (define_predicate "vector_all_ones_operand"
   (and (match_code "const_vector")
(match_test "INTEGRAL_MODE_P (GET_MODE (op))")
(match_test "op == CONSTM1_RTX (GET_MODE (op))")))
 
+/* Return true if operand is a vector constant that is all ones. */
+(define_predicate "int_float_vector_all_ones_operand"
+  (ior (match_operand 0 "vector_all_ones_operand")
+   (match_operand 0 "float_vector_all_ones_operand")
+   (match_test "op == constm1_rtx")))
+
 /* Return true if operand is an 128/256bit all ones vector
that zero-extends to 256/512bit.  */
 (define_predicate "vector_all_ones_zero_extend_half_operand"
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 418c337a775..05485b1792d 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -1382,6 +1382,29 @@ (define_insn "mov_internal"
  ]
  (symbol_ref "true")))])
 
+; False dependency happens on destination register which is not really
+; used when moving all ones to vector register
+(define_split
+  [(set (match_operand:VMOVE 0 "register_operand")
+   (match_operand:VMOVE 1 "int_float_vector_all_ones_operand"))]
+  "TARGET_AVX512F && reload_completed
+  && ( == 64 || EXT_REX_SSE_REG_P (operands[0]))
+  && optimize_insn_for_speed_p ()"
+  [(set (match_dup 0) (match_dup 2))
+   (parallel
+ [(set (match_dup 0) (match_dup 1))
+  (unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP)])]
+  "operands[2] = CONST0_RTX (mode);")
+
+(define_insn "*vmov_constm1_pternlog_false_dep"
+  [(set (match_operand:VMOVE 0 "register_operand" "=v")
+   (match_operand:VMOVE 1 "int_float_vector_all_ones_operand" 
""))
+   (unspec [(match_operand:VMOVE 2 "register_operand" "0")] 
UNSPEC_INSN_FALSE_DEP)]
+   "TARGET_AVX512VL ||  == 64"
+   "vpternlogd\t{$0xFF, %0, %0, %0|%0, %0, %0, 0xFF}"
+  [(set_attr "type" "sselog1")
+   (set_attr "prefix" "evex")])
+
 ;; If mem_addr points to a memory region with less than whole vector size bytes
 ;; of accessible memory and k is a mask that would prevent reading the 
inaccessible
 ;; bytes from mem_addr, add UNSPEC_MASKLOAD to prevent it to be transformed to 
vpblendd
@@ -9336,7 +9359,7 @@ (define_expand "_cvtmask2"
 operands[3] = CONST0_RTX (mode);
   }")
 
-(define_insn "*_cvtmask2"
+(define_insn_and_split "*_cvtmask2"
   [(set (match_operand:VI48_AVX512VL 0 "register_operand" "=v,v")
(vec_merge:VI48_AVX512VL
  (match_operand:VI48_AVX512VL 2 "vector_all_ones_operand")
@@ -9346,11 +9369,35 @@ (define_insn "*_cvtmask2"
   "@
vpmovm2\t{%1, %0|%0, %1}
vpternlog\t{$0x81, %0, %0, %0%{%1%}%{z%}|%0%{%1%}%{z%}, %0, 
%0, 0x81}"
+  "&& !TARGET_AVX512DQ && reload_completed
+   && optimize_function_for_speed_p (cfun)"
+  [(set (match_dup 0) (match_dup 4))
+   (parallel
+[(set (match_dup 0)
+ (vec_merge:VI48_AVX512VL
+   (match_dup 2)
+   (match_dup 3)
+   (match_dup 1)))
+ (unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP)])]
+  "operands[4] = CONST0_RTX

[PATCH] riscv: thead: Fix failing XTheadCondMov tests (indirect-rv[32|64])

2023-07-10 Thread Christoph Muellner
From: Christoph Müllner 

Recently, two identical XTheadCondMov tests have been added, which both fail.
Let's fix that by changing the following:
* Merge both files into one (no need for separate tests for rv32 and rv64)
* Drop unrelated attribute check test (we already test for `th.mveqz`
  and `th.mvnez` instructions, so there is little additional value)
* Fix the pattern to allow matching

gcc/testsuite/ChangeLog:

* gcc.target/riscv/xtheadcondmov-indirect-rv32.c: Moved to...
* gcc.target/riscv/xtheadcondmov-indirect.c: ...here.
* gcc.target/riscv/xtheadcondmov-indirect-rv64.c: Removed.

Fixes: a1806f0918c0 ("RISC-V: Optimize TARGET_XTHEADCONDMOV")
Signed-off-by: Christoph Müllner 
---
 .../riscv/xtheadcondmov-indirect-rv32.c   | 104 ---
 .../riscv/xtheadcondmov-indirect-rv64.c   | 104 ---
 .../gcc.target/riscv/xtheadcondmov-indirect.c | 118 ++
 3 files changed, 118 insertions(+), 208 deletions(-)
 delete mode 100644 gcc/testsuite/gcc.target/riscv/xtheadcondmov-indirect-rv32.c
 delete mode 100644 gcc/testsuite/gcc.target/riscv/xtheadcondmov-indirect-rv64.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/xtheadcondmov-indirect.c

diff --git a/gcc/testsuite/gcc.target/riscv/xtheadcondmov-indirect-rv32.c 
b/gcc/testsuite/gcc.target/riscv/xtheadcondmov-indirect-rv32.c
deleted file mode 100644
index d0df59c5e1c..000
--- a/gcc/testsuite/gcc.target/riscv/xtheadcondmov-indirect-rv32.c
+++ /dev/null
@@ -1,104 +0,0 @@
-/* { dg-do compile } */
-/* { dg-options "-O2 -march=rv32gc_xtheadcondmov -mabi=ilp32 
-mriscv-attribute" } */
-/* { dg-skip-if "" { *-*-* } {"-O0" "-O1" "-Os" "-Og" "-O3" "-Oz" "-flto"} } */
-/* { dg-final { check-function-bodies "**" ""  } } */
-
-/*
-**ConEmv_imm_imm_reg:
-** addi\t\s*[a-x0-9]+,\s*[a-x0-9]+,-1000+
-** li\t\s*[a-x0-9]+,10+
-** th.mvnez\t\s*[a-x0-9]+,\s*[a-x0-9]+,\s*[a-x0-9]+
-** ret
-*/
-int ConEmv_imm_imm_reg(int x, int y){
-  if (x == 1000) return 10;
-  return y;
-}
-
-/*
-**ConEmv_imm_reg_reg:
-** addi\t\s*[a-x0-9]+,\s*[a-x0-9]+,-1000+
-** th.mveqz\t\s*[a-x0-9]+,\s*[a-x0-9]+,\s*[a-x0-9]+
-** mv\t\s*[a-x0-9]+,\s*[a-x0-9]+
-** ret
-*/
-int ConEmv_imm_reg_reg(int x, int y, int z){
-  if (x == 1000) return y;
-  return z;
-}
-
-/*
-**ConEmv_reg_imm_reg:
-** sub\t\s*[a-x0-9]+,\s*[a-x0-9]+,\s*[a-x0-9]+
-** li\t\s*[a-x0-9]+,10+
-** th.mvnez\t\s*[a-x0-9]+,\s*[a-x0-9]+,\s*[a-x0-9]+
-** ret
-*/
-int ConEmv_reg_imm_reg(int x, int y, int z){
-  if (x == y) return 10;
-  return z;
-}
-
-/*
-**ConEmv_reg_reg_reg:
-** sub\t\s*[a-x0-9]+,\s*[a-x0-9]+,\s*[a-x0-9]+
-** th.mveqz\t\s*[a-x0-9]+,\s*[a-x0-9]+,\s*[a-x0-9]+
-** mv\t\s*[a-x0-9]+,\s*[a-x0-9]+
-** ret
-*/
-int ConEmv_reg_reg_reg(int x, int y, int z, int n){
-  if (x == y) return z;
-  return n;
-}
-
-/*
-**ConNmv_imm_imm_reg:
-** addi\t\s*[a-x0-9]+,\s*[a-x0-9]+,-1000+
-** li\t\s*[a-x0-9]+,9998336+
-** addi\t\s*[a-x0-9]+,\s*[a-x0-9]+,1664+
-** th.mveqz\t\s*[a-x0-9]+,\s*[a-x0-9]+,\s*[a-x0-9]+
-** ret
-*/
-int ConNmv_imm_imm_reg(int x, int y){
-  if (x != 1000) return 1000;
-  return y;
-}
-
-/*
-**ConNmv_imm_reg_reg:
-** addi\t\s*[a-x0-9]+,\s*[a-x0-9]+,-1000+
-** th.mvnez\t\s*[a-x0-9]+,\s*[a-x0-9]+,\s*[a-x0-9]+
-** mv\t\s*[a-x0-9]+,\s*[a-x0-9]+
-** ret
-*/
-int ConNmv_imm_reg_reg(int x, int y, int z){
-  if (x != 1000) return y;
-  return z;
-}
-
-/*
-**ConNmv_reg_imm_reg:
-** sub\t\s*[a-x0-9]+,\s*[a-x0-9]+,\s*[a-x0-9]+
-** li\t\s*[a-x0-9]+,10+
-** th.mveqz\t\s*[a-x0-9]+,\s*[a-x0-9]+,\s*[a-x0-9]+
-** ret
-*/
-int ConNmv_reg_imm_reg(int x, int y, int z){
-  if (x != y) return 10;
-  return z;
-}
-
-/*
-**ConNmv_reg_reg_reg:
-** sub\t\s*[a-x0-9]+,\s*[a-x0-9]+,\s*[a-x0-9]+
-** th.mvnez\t\s*[a-x0-9]+,\s*[a-x0-9]+,\s*[a-x0-9]+
-** mv\t\s*[a-x0-9]+,\s*[a-x0-9]+
-** ret
-*/
-int ConNmv_reg_reg_reg(int x, int y, int z, int n){
-  if (x != y) return z;
-  return n;
-}
-
-
-/* { dg-final { scan-assembler ".attribute arch, 
\"rv32i2p1_m2p0_a2p1_f2p2_d2p2_c2p0_zicsr2p0_zifencei2p0_xtheadcondmov1p0\"" } 
} */
diff --git a/gcc/testsuite/gcc.target/riscv/xtheadcondmov-indirect-rv64.c 
b/gcc/testsuite/gcc.target/riscv/xtheadcondmov-indirect-rv64.c
deleted file mode 100644
index cc971a75ace..000
--- a/gcc/testsuite/gcc.target/riscv/xtheadcondmov-indirect-rv64.c
+++ /dev/null
@@ -1,104 +0,0 @@
-/* { dg-do compile } */
-/* { dg-options "-O2 -march=rv64gc_xtheadcondmov -mabi=lp64d 
-mriscv-attribute" } */
-/* { dg-skip-if "" { *-*-* } {"-O0" "-O1" "-Os" "-Og" "-O3" "-Oz" "-flto"} } */
-/* { dg-final { check-function-bodies "**" ""  } } */
-
-/*
-**ConEmv_imm_imm_reg:
-** addi\t\s*[a-x0-9]+,\s*[a-x0-9]+,-1000+
-** li\t\s*[a-x0-9]+,10+
-** th.mvnez\t\s*[a-x0-9]+,\s*[a-x0-9]+,\s*[a-x0-9]+
-** ret
-*/
-int ConEmv_imm_imm_reg(int x, int y){
-  if (x == 1000) return 10;
-  return y;
-}
-
-/*
-**ConEmv_imm_reg_reg:
-** addi

Re: [PATCH ver 2] rs6000, __builtin_set_fpscr_rn add retrun value

2023-07-10 Thread Kewen.Lin via Gcc-patches
on 2023/7/11 03:18, Carl Love wrote:
> On Fri, 2023-07-07 at 12:06 +0800, Kewen.Lin wrote:
>> Hi Carl,
>>
>> Some more minor comments are inline below on top of Peter's
>> insightful
>> review comments.
>>
>> on 2023/7/1 08:58, Carl Love wrote:
>>> GCC maintainers:
>>>
>>> Ver 2,  Went back thru the requirements and emails.  Not sure where
>>> I
>>> came up with the requirement for an overloaded version with double
>>> argument.  Removed the overloaded version with the double
>>> argument. 
>>> Added the macro to announce if the __builtin_set_fpscr_rn returns a
>>> void or a double with the FPSCR bits.  Updated the documentation
>>> file. 
>>> Retested on Power 8 BE/LE, Power 9 BE/LE, Power 10 LE.  Redid the
>>> test
>>> file.  Per request, the original test file functionality was not
>>> changed.  Just changed the name from test_fpscr_rn_builtin.c to 
>>> test_fpscr_rn_builtin_1.c.  Put new tests for the return values
>>> into a
>>> new test file, test_fpscr_rn_builtin_2.c.
>>>
>>> The GLibC team requested a builtin to replace the mffscrn and
>>> mffscrniinline asm instructions in the GLibC code.  Previously
>>> there
>>> was discussion on adding builtins for the mffscrn instructions.
>>>
>>> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620261.html
>>>
>>> In the end, it was felt that it would be to extend the existing
>>> __builtin_set_fpscr_rn builtin to return a double instead of a void
>>> type.  The desire is that we could have the functionality of the
>>> mffscrn and mffscrni instructions on older ISAs.  The two
>>> instructions
>>> were initially added in ISA 3.0.  The __builtin_set_fpscr_rn has
>>> the
>>> needed functionality to set the RN field using the mffscrn and
>>> mffscrni
>>> instructions if ISA 3.0 is supported or fall back to using logical
>>> instructions to mask and set the bits for earlier ISAs.  The
>>> instructions return the current value of the FPSCR fields DRN, VE,
>>> OE,
>>> UE, ZE, XE, NI, RN bit positions then update the RN bit positions
>>> with
>>> the new RN value provided.
>>>
>>> The current __builtin_set_fpscr_rn builtin has a return type of
>>> void. 
>>> So, changing the return type to double and returning the  FPSCR
>>> fields
>>> DRN, VE, OE, UE, ZE, XE, NI, RN bit positions would then give the
>>> functionally equivalent of the mffscrn and mffscrni
>>> instructions.  Any
>>> current uses of the builtin would just ignore the return value yet
>>> any
>>> new uses could use the return value.  So the requirement is for the
>>> change to the __builtin_set_fpscr_rn builtin to be backwardly
>>> compatible and work for all ISAs.
>>>
>>> The following patch changes the return type of the
>>>  __builtin_set_fpscr_rn builtin from void to double.  The return
>>> value
>>> is the current value of the various FPSCR fields DRN, VE, OE, UE,
>>> ZE,
>>> XE, NI, RN bit positions when the builtin is called.  The builtin
>>> then
>>> updated the RN field with the new value provided as an argument to
>>> the
>>> builtin.  The patch adds new testcases to test_fpscr_rn_builtin.c
>>> to
>>> check that the builtin returns the current value of the FPSCR
>>> fields
>>> and then updates the RN field.
>>>
>>> The GLibC team has reviewed the patch to make sure it met their
>>> needs
>>> as a drop in replacement for the inline asm mffscr and mffscrni
>>> statements in the GLibC code.  T
>>>
>>> The patch has been tested on Power 8 LE/BE, Power 9 LE/BE and Power
>>> 10
>>> LE.
>>>
>>> Please let me know if the patch is acceptable for
>>> mainline.  Thanks.
>>>
>>>Carl 
>>>
>>>
>>> --
>>> rs6000, __builtin_set_fpscr_rn add retrun value
>>>
>>> Change the return value from void to double.  The return value
>>> consists of
>>> the FPSCR fields DRN, VE, OE, UE, ZE, XE, NI, RN bit
>>> positions.  Add an
>>> overloaded version which accepts a double argument.
>>>
>>> The test powerpc/test_fpscr_rn_builtin.c is updated to add tests
>>> for the
>>> double reterun value and the new double argument.
>>>
>>> gcc/ChangeLog:
>>> * config/rs6000/rs6000-builtins.def (__builtin_set_fpscr_rn):
>>> Update
>>> builtin definition return type.
>>> * config/rs6000-c.cc(rs6000_target_modify_macros): Add check,
>>> define
>>> __SET_FPSCR_RN_RETURNS_FPSCR__ macro.
>>> * config/rs6000/rs6000.md ((rs6000_get_fpscr_fields): New
>>> define_expand.
>>> (rs6000_update_fpscr_rn_field): New define_expand.
>>> (rs6000_set_fpscr_rn): Addedreturn argument.  Updated
>>> to use new
>>> rs6000_get_fpscr_fields and rs6000_update_fpscr_rn_field define
>>>  _expands.
>>> * doc/extend.texi (__builtin_set_fpscr_rn): Update description
>>> for
>>> the return value and new double argument.  Add descripton for
>>> __SET_FPSCR_RN_RETURNS_FPSCR__ macro.
>>>
>>> gcc/testsuite/ChangeLog:
>>> gcc.target/powerpc/test_fpscr_rn_builtin.c: Renamed to
>>> test_fpscr_rn_builtin_1.c.  Added comment.
>>> gcc.target/pow

Re: [PATCH ver3] rs6000, Add return value to __builtin_set_fpscr_rn

2023-07-10 Thread Kewen.Lin via Gcc-patches
Hi Carl,

Excepting for Peter's review comments, some nits are inline below.

on 2023/7/11 03:18, Carl Love wrote:
> 
> GCC maintainers:
> 
> Ver 3, Renamed the patch per comments on ver 2.  Previous subject line
> was " [PATCH ver 2] rs6000, __builtin_set_fpscr_rn add retrun value".  
> Fixed spelling mistakes and formatting.  Updated define_expand
> "rs6000_set_fpscr_rn to have the rs6000_get_fpscr_fields and
> rs6000_update_fpscr_rn_field define expands inlined.  Optimized the
> code and fixed use of temporary register values. Updated the test file
> dg-do run arguments and dg-options.  Removed the check for
> __SET_FPSCR_RN_RETURNS_FPSCR__. Removed additional references to the
> overloaded built-in with double argument.  Fixed up the documentation
> file.  Updated patch retested on Power 8 BE/LE, Power 9 BE/LE and Power
> 10 LE.
> 
> Ver 2,  Went back thru the requirements and emails.  Not sure where I
> came up with the requirement for an overloaded version with double
> argument.  Removed the overloaded version with the double argument. 
> Added the macro to announce if the __builtin_set_fpscr_rn returns a
> void or a double with the FPSCR bits.  Updated the documentation file. 
> Retested on Power 8 BE/LE, Power 9 BE/LE, Power 10 LE.  Redid the test
> file.  Per request, the original test file functionality was not
> changed.  Just changed the name from test_fpscr_rn_builtin.c to 
> test_fpscr_rn_builtin_1.c.  Put new tests for the return values into a
> new test file, test_fpscr_rn_builtin_2.c.
> 
> The GLibC team requested a builtin to replace the mffscrn and
> mffscrniinline asm instructions in the GLibC code.  Previously there
> was discussion on adding builtins for the mffscrn instructions.
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620261.html
> 
> In the end, it was felt that it would be to extend the existing
> __builtin_set_fpscr_rn builtin to return a double instead of a void
> type.  The desire is that we could have the functionality of the
> mffscrn and mffscrni instructions on older ISAs.  The two instructions
> were initially added in ISA 3.0.  The __builtin_set_fpscr_rn has the
> needed functionality to set the RN field using the mffscrn and mffscrni
> instructions if ISA 3.0 is supported or fall back to using logical
> instructions to mask and set the bits for earlier ISAs.  The
> instructions return the current value of the FPSCR fields DRN, VE, OE,
> UE, ZE, XE, NI, RN bit positions then update the RN bit positions with
> the new RN value provided.
> 
> The current __builtin_set_fpscr_rn builtin has a return type of void. 
> So, changing the return type to double and returning the  FPSCR fields
> DRN, VE, OE, UE, ZE, XE, NI, RN bit positions would then give the
> functionally equivalent of the mffscrn and mffscrni instructions.  Any
> current uses of the builtin would just ignore the return value yet any
> new uses could use the return value.  So the requirement is for the
> change to the __builtin_set_fpscr_rn builtin to be backwardly
> compatible and work for all ISAs.
> 
> The following patch changes the return type of the
>  __builtin_set_fpscr_rn builtin from void to double.  The return value
> is the current value of the various FPSCR fields DRN, VE, OE, UE, ZE,
> XE, NI, RN bit positions when the builtin is called.  The builtin then
> updated the RN field with the new value provided as an argument to the
> builtin.  The patch adds new testcases to test_fpscr_rn_builtin.c to
> check that the builtin returns the current value of the FPSCR fields
> and then updates the RN field.
> 
> The GLibC team has reviewed the patch to make sure it met their needs
> as a drop in replacement for the inline asm mffscr and mffscrni
> statements in the GLibC code.  T
> 
> The patch has been tested on Power 8 LE/BE, Power 9 LE/BE and Power 10
> LE.
> 
> Please let me know if the patch is acceptable for mainline.  Thanks.
> 
>Carl 
> 
> 
> 
> -
> rs6000, Add return value  to __builtin_set_fpscr_rn

Nit: One more unexpected space.

> 
> Change the return value from void to double for __builtin_set_fpscr_rn.
> The return value consists of the FPSCR fields DRN, VE, OE, UE, ZE, XE, NI,
> RN bit positions.  A new test file, test powerpc/test_fpscr_rn_builtin_2.c,
> is added to test the new return value for the built-in.

Nit: It would be better to note the newly added __SET_FPSCR_RN_RETURNS_FPSCR__
in commit log as well.

> 
> gcc/ChangeLog:
>   * config/rs6000/rs6000-builtins.def (__builtin_set_fpscr_rn): Update
>   built-in definition return type.
>   * config/rs6000-c.cc (rs6000_target_modify_macros): Add check,
>   define __SET_FPSCR_RN_RETURNS_FPSCR__ macro.
>   * config/rs6000/rs6000.md (rs6000_set_fpscr_rn): Added return

Nit: s/Added/Add/

>   argument to return FPSCR fields.
>   * doc/extend.texi (__builtin_set_fpscr_rn): Update description for
>   the return value.  A

[PATCH v3] x86: make better use of VBROADCASTSS / VPBROADCASTD

2023-07-10 Thread Jan Beulich via Gcc-patches
... in vec_dupv4sf / *vec_dupv4si. The respective broadcast insns are
never longer (yet sometimes shorter) than the corresponding VSHUFPS /
VPSHUFD, due to the immediate operand of the shuffle insns balancing the
(uniform) need for VEX3 in the broadcast ones. When EVEX encoding is
respective the broadcast insns are always shorter.

Add new alternatives to cover the AVX2 and AVX512 cases as appropriate.

While touching this anyway, switch to consistently using "sseshuf1" in
the "type" attributes for all shuffle forms.

gcc/

* config/i386/sse.md (vec_dupv4sf): Make first alternative use
vbroadcastss for AVX2. New AVX512F alternative.
(*vec_dupv4si): New AVX2 and AVX512F alternatives using
vpbroadcastd. Replace sselog1 by sseshuf1 in "type" attribute.

gcc/testsuite/

* gcc.target/i386/avx2-dupv4sf.c: New test.
* gcc.target/i386/avx2-dupv4si.c: Likewise.
* gcc.target/i386/avx512f-dupv4sf.c: Likewise.
* gcc.target/i386/avx512f-dupv4si.c: Likewise.
---
Note that unlike originally intended, "prefix_extra" isn't dropped:
"length_vex" uses it to determine whether 2-byte VEX encoding is
possible (which it isn't for VBROADCASTSS / VPBROADCASTD). "length"
itself specifically does not use it for VEX/EVEX encoded insns.

Especially with the added "enabled" attribute I didn't really see how to
(further) fold alternatives 0 and 1. Instead *vec_dupv4si might benefit
from using sse2_noavx2 instead of sse2 for alternative 2, except that
there is no sse2_noavx2, only sse2_noavx.

I'm working from the assumption that the isa attributes to the original
1st and 2nd alternatives don't need further restricting (to sse2_noavx2
or avx_noavx2 as applicable), as the new earlier alternatives cover all
operand forms already when at least AVX2 is enabled.
---
v3: Testcases for new alternatives. "type" and "prefix_extra"
adjustments.
v2: Correct operand constraints. Respect -mprefer-vector-width=. Fold
two alternatives of vec_dupv4sf.

--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -25969,41 +25969,64 @@
(const_int 1)))])
 
 (define_insn "vec_dupv4sf"
-  [(set (match_operand:V4SF 0 "register_operand" "=v,v,x")
+  [(set (match_operand:V4SF 0 "register_operand" "=v,v,v,x")
(vec_duplicate:V4SF
- (match_operand:SF 1 "nonimmediate_operand" "Yv,m,0")))]
+ (match_operand:SF 1 "nonimmediate_operand" "Yv,v,m,0")))]
   "TARGET_SSE"
   "@
-   vshufps\t{$0, %1, %1, %0|%0, %1, %1, 0}
+   * return TARGET_AVX2 ? \"vbroadcastss\t{%1, %0|%0, %1}\" : \"vshufps\t{$0, 
%d1, %0|%0, %d1, 0}\";
+   vbroadcastss\t{%1, %g0|%g0, %1}
vbroadcastss\t{%1, %0|%0, %1}
shufps\t{$0, %0, %0|%0, %0, 0}"
-  [(set_attr "isa" "avx,avx,noavx")
-   (set_attr "type" "sseshuf1,ssemov,sseshuf1")
-   (set_attr "length_immediate" "1,0,1")
-   (set_attr "prefix_extra" "0,1,*")
-   (set_attr "prefix" "maybe_evex,maybe_evex,orig")
-   (set_attr "mode" "V4SF")])
+  [(set_attr "isa" "avx,*,avx,noavx")
+   (set (attr "type")
+   (cond [(and (eq_attr "alternative" "0")
+   (match_test "!TARGET_AVX2"))
+(const_string "sseshuf1")
+  (eq_attr "alternative" "3")
+(const_string "sseshuf1")
+ ]
+ (const_string "ssemov")))
+   (set (attr "length_immediate")
+   (if_then_else (eq_attr "type" "sseshuf1")
+ (const_string "1")
+ (const_string "0")))
+   (set_attr "prefix_extra" "0,1,1,*")
+   (set_attr "prefix" "maybe_evex,evex,maybe_evex,orig")
+   (set_attr "mode" "V4SF,V16SF,V4SF,V4SF")
+   (set (attr "enabled")
+   (if_then_else (eq_attr "alternative" "1")
+ (symbol_ref "TARGET_AVX512F && !TARGET_AVX512VL
+  && !TARGET_PREFER_AVX256")
+ (const_string "*")))])
 
 (define_insn "*vec_dupv4si"
-  [(set (match_operand:V4SI 0 "register_operand" "=v,v,x")
+  [(set (match_operand:V4SI 0 "register_operand" "=v,v,v,v,x")
(vec_duplicate:V4SI
- (match_operand:SI 1 "nonimmediate_operand" "Yv,m,0")))]
+ (match_operand:SI 1 "nonimmediate_operand" "Yvm,v,Yv,m,0")))]
   "TARGET_SSE"
   "@
+   vpbroadcastd\t{%1, %0|%0, %1}
+   vpbroadcastd\t{%1, %g0|%g0, %1}
%vpshufd\t{$0, %1, %0|%0, %1, 0}
vbroadcastss\t{%1, %0|%0, %1}
shufps\t{$0, %0, %0|%0, %0, 0}"
-  [(set_attr "isa" "sse2,avx,noavx")
-   (set_attr "type" "sselog1,ssemov,sselog1")
-   (set_attr "length_immediate" "1,0,1")
-   (set_attr "prefix_extra" "0,1,*")
-   (set_attr "prefix" "maybe_vex,maybe_evex,orig")
-   (set_attr "mode" "TI,V4SF,V4SF")
+  [(set_attr "isa" "avx2,*,sse2,avx,noavx")
+   (set_attr "type" "ssemov,ssemov,sseshuf1,ssemov,sseshuf1")
+   (set_attr "length_immediate" "0,0,1,0,1")
+   (set_attr "prefix_extra" "1,1,0,1,*")
+   (set_attr "prefix" "maybe_evex,evex,maybe_vex,maybe_evex,orig")
+   (set_attr "mode" "TI,XI,TI,V4SF,V4SF")
(set (attr "preferred_for_speed")
-  

[PATCH] x86: improve fast bfloat->float conversion

2023-07-10 Thread Jan Beulich via Gcc-patches
There's nothing AVX512BW-ish in here, so no reason to use Yw as the
constraints for the AVX alternative. Furthermore by using the 512-bit
form of VPSSLD (in a new alternative) all 32 registers can be used
directly by the insn without AVX512VL needing to be enabled.

Also adjust the originally last alternative's "prefix" attribute to
maybe_evex.

gcc/

* config/i386/i386.md (extendbfsf2_1): Add new AVX512F
alternative. Adjust original last alternative's "prefix"
attribute to maybe_evex.
---
The corresponding expander, "extendbfsf2", looks to have been dead since
its introduction in a1ecc5600464 ("Fix incorrect _mm_cvtsbh_ss"): The
builtin references the insn (extendbfsf2_1), not the expander. Can't the
expander be deleted and the name of the insn then pruned of the _1
suffix? If so, that further raises the question of the significance of
the "!HONOR_NANS (BFmode)" that the expander has, but the insn doesn't
have. Which may instead suggest the builtin was meant to reference the
expander. Yet then I can't see what would the builtin would expand to
when HONOR_NANS (BFmode) it true.

I further wonder whether the nearby "extendhfdf2" expander is really
needed. It doesn't look to specify anything that the corresponding insn
doesn't also specify.

--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -5181,21 +5181,27 @@
 ;; Don't use float_extend since psrlld doesn't raise
 ;; exceptions and turn a sNaN into a qNaN.
 (define_insn "extendbfsf2_1"
-  [(set (match_operand:SF 0 "register_operand"   "=x,Yw")
+  [(set (match_operand:SF 0 "register_operand"   "=x,Yv,v")
(unspec:SF
- [(match_operand:BF 1 "register_operand" " 0,Yw")]
+ [(match_operand:BF 1 "register_operand" " 0,Yv,v")]
  UNSPEC_CVTBFSF))]
  "TARGET_SSE2"
  "@
   pslld\t{$16, %0|%0, 16}
-  vpslld\t{$16, %1, %0|%0, %1, 16}"
-  [(set_attr "isa" "noavx,avx")
+  vpslld\t{$16, %1, %0|%0, %1, 16}
+  vpslld\t{$16, %g1, %g0|%g0, %g1, 16}"
+  [(set_attr "isa" "noavx,avx,*")
(set_attr "type" "sseishft1")
(set_attr "length_immediate" "1")
-   (set_attr "prefix_data16" "1,*")
-   (set_attr "prefix" "orig,vex")
-   (set_attr "mode" "TI")
-   (set_attr "memory" "none")])
+   (set_attr "prefix_data16" "1,*,*")
+   (set_attr "prefix" "orig,maybe_evex,evex")
+   (set_attr "mode" "TI,TI,XI")
+   (set_attr "memory" "none")
+   (set (attr "enabled")
+ (if_then_else (eq_attr "alternative" "2")
+   (symbol_ref "TARGET_AVX512F && !TARGET_AVX512VL
+   && !TARGET_PREFER_AVX256")
+   (const_string "*")))])
 
 (define_expand "extendxf2"
   [(set (match_operand:XF 0 "nonimmediate_operand")


RE: [PATCH v3] x86: make better use of VBROADCASTSS / VPBROADCASTD

2023-07-10 Thread Liu, Hongtao via Gcc-patches


> -Original Message-
> From: Jan Beulich 
> Sent: Tuesday, July 11, 2023 2:04 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Kirill Yukhin ; Liu, Hongtao
> 
> Subject: [PATCH v3] x86: make better use of VBROADCASTSS /
> VPBROADCASTD
> 
> ... in vec_dupv4sf / *vec_dupv4si. The respective broadcast insns are never
> longer (yet sometimes shorter) than the corresponding VSHUFPS / VPSHUFD,
> due to the immediate operand of the shuffle insns balancing the
> (uniform) need for VEX3 in the broadcast ones. When EVEX encoding is
> respective the broadcast insns are always shorter.
> 
> Add new alternatives to cover the AVX2 and AVX512 cases as appropriate.
> 
> While touching this anyway, switch to consistently using "sseshuf1" in the
> "type" attributes for all shuffle forms.
> 
> gcc/
> 
>   * config/i386/sse.md (vec_dupv4sf): Make first alternative use
>   vbroadcastss for AVX2. New AVX512F alternative.
>   (*vec_dupv4si): New AVX2 and AVX512F alternatives using
>   vpbroadcastd. Replace sselog1 by sseshuf1 in "type" attribute.
> 
> gcc/testsuite/
> 
>   * gcc.target/i386/avx2-dupv4sf.c: New test.
>   * gcc.target/i386/avx2-dupv4si.c: Likewise.
>   * gcc.target/i386/avx512f-dupv4sf.c: Likewise.
>   * gcc.target/i386/avx512f-dupv4si.c: Likewise.
> ---
> Note that unlike originally intended, "prefix_extra" isn't dropped:
> "length_vex" uses it to determine whether 2-byte VEX encoding is possible
> (which it isn't for VBROADCASTSS / VPBROADCASTD). "length"
> itself specifically does not use it for VEX/EVEX encoded insns.
> 
> Especially with the added "enabled" attribute I didn't really see how to
> (further) fold alternatives 0 and 1. Instead *vec_dupv4si might benefit from
> using sse2_noavx2 instead of sse2 for alternative 2, except that there is no
> sse2_noavx2, only sse2_noavx.
> 
> I'm working from the assumption that the isa attributes to the original 1st 
> and
> 2nd alternatives don't need further restricting (to sse2_noavx2 or
> avx_noavx2 as applicable), as the new earlier alternatives cover all operand
> forms already when at least AVX2 is enabled.
Yes, the patch LGTM.
> ---
> v3: Testcases for new alternatives. "type" and "prefix_extra"
> adjustments.
> v2: Correct operand constraints. Respect -mprefer-vector-width=. Fold
> two alternatives of vec_dupv4sf.
> 
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -25969,41 +25969,64 @@
>   (const_int 1)))])
> 
>  (define_insn "vec_dupv4sf"
> -  [(set (match_operand:V4SF 0 "register_operand" "=v,v,x")
> +  [(set (match_operand:V4SF 0 "register_operand" "=v,v,v,x")
>   (vec_duplicate:V4SF
> -   (match_operand:SF 1 "nonimmediate_operand" "Yv,m,0")))]
> +   (match_operand:SF 1 "nonimmediate_operand" "Yv,v,m,0")))]
>"TARGET_SSE"
>"@
> -   vshufps\t{$0, %1, %1, %0|%0, %1, %1, 0}
> +   * return TARGET_AVX2 ? \"vbroadcastss\t{%1, %0|%0, %1}\" :
> \"vshufps\t{$0, %d1, %0|%0, %d1, 0}\";
> +   vbroadcastss\t{%1, %g0|%g0, %1}
> vbroadcastss\t{%1, %0|%0, %1}
> shufps\t{$0, %0, %0|%0, %0, 0}"
> -  [(set_attr "isa" "avx,avx,noavx")
> -   (set_attr "type" "sseshuf1,ssemov,sseshuf1")
> -   (set_attr "length_immediate" "1,0,1")
> -   (set_attr "prefix_extra" "0,1,*")
> -   (set_attr "prefix" "maybe_evex,maybe_evex,orig")
> -   (set_attr "mode" "V4SF")])
> +  [(set_attr "isa" "avx,*,avx,noavx")
> +   (set (attr "type")
> + (cond [(and (eq_attr "alternative" "0")
> + (match_test "!TARGET_AVX2"))
> +  (const_string "sseshuf1")
> +(eq_attr "alternative" "3")
> +  (const_string "sseshuf1")
> +   ]
> +   (const_string "ssemov")))
> +   (set (attr "length_immediate")
> + (if_then_else (eq_attr "type" "sseshuf1")
> +   (const_string "1")
> +   (const_string "0")))
> +   (set_attr "prefix_extra" "0,1,1,*")
> +   (set_attr "prefix" "maybe_evex,evex,maybe_evex,orig")
> +   (set_attr "mode" "V4SF,V16SF,V4SF,V4SF")
> +   (set (attr "enabled")
> + (if_then_else (eq_attr "alternative" "1")
> +   (symbol_ref "TARGET_AVX512F && !TARGET_AVX512VL
> +&& !TARGET_PREFER_AVX256")
> +   (const_string "*")))])
> 
>  (define_insn "*vec_dupv4si"
> -  [(set (match_operand:V4SI 0 "register_operand" "=v,v,x")
> +  [(set (match_operand:V4SI 0 "register_operand" "=v,v,v,v,x")
>   (vec_duplicate:V4SI
> -   (match_operand:SI 1 "nonimmediate_operand" "Yv,m,0")))]
> +   (match_operand:SI 1 "nonimmediate_operand" "Yvm,v,Yv,m,0")))]
>"TARGET_SSE"
>"@
> +   vpbroadcastd\t{%1, %0|%0, %1}
> +   vpbroadcastd\t{%1, %g0|%g0, %1}
> %vpshufd\t{$0, %1, %0|%0, %1, 0}
> vbroadcastss\t{%1, %0|%0, %1}
> shufps\t{$0, %0, %0|%0, %0, 0}"
> -  [(set_attr "isa" "sse2,avx,noavx")
> -   (set_attr "type" "sselog1,ssemov,sselog1")
> -   (set_attr "length_immediate" "1,0,1")
> -   (set_attr "prefix_extra" "0,1,*")
> -   (set_attr "p

[PATCH] RISC-V: Optimize permutation codegen with vcompress

2023-07-10 Thread juzhe . zhong
From: Ju-Zhe Zhong 

This patch is to recognize specific permutation pattern which can be applied 
compress approach.

Consider this following case:
#include 
typedef int8_t vnx64i __attribute__ ((vector_size (64)));
#define MASK_64\
  1, 2, 3, 5, 7, 9, 10, 11, 12, 14, 15, 17, 19, 21, 22, 23, 26, 28, 30, 31,\
37, 38, 41, 46, 47, 53, 54, 55, 60, 61, 62, 63, 76, 77, 78, 79, 80, 81,\
82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99,\
100, 101, 102, 103, 104, 105, 106, 107
void __attribute__ ((noinline, noclone)) test_1 (int8_t *x, int8_t *y, int8_t 
*out)
{
  vnx64i v1 = *(vnx64i*)x;
  vnx64i v2 = *(vnx64i*)y;
  vnx64i v3 = __builtin_shufflevector (v1, v2, MASK_64);
  *(vnx64i*)out = v3;
}

https://godbolt.org/z/P33nev6cW

Before this patch:
lui a4,%hi(.LANCHOR0)
addia4,a4,%lo(.LANCHOR0)
vl4re8.vv4,0(a4)
li  a4,64
vsetvli a5,zero,e8,m4,ta,mu
vl4re8.vv20,0(a0)
vl4re8.vv16,0(a1)
vmv.v.x v12,a4
vrgather.vv v8,v20,v4
vmsgeu.vv   v0,v4,v12
vsub.vv v4,v4,v12
vrgather.vv v8,v16,v4,v0.t
vs4r.v  v8,0(a2)
ret

After this patch:
lui a4,%hi(.LANCHOR0)
addia4,a4,%lo(.LANCHOR0)
vsetvli a5,zero,e8,m4,ta,ma
vl4re8.vv12,0(a1)
vl4re8.vv8,0(a0)
vlm.v   v0,0(a4)
vslideup.vi v4,v12,20
vcompress.vmv4,v8,v0
vs4r.v  v4,0(a2)
ret

gcc/ChangeLog:

* config/riscv/riscv-protos.h (enum insn_type): Add vcompress 
optimization.
* config/riscv/riscv-v.cc (emit_vlmax_compress_insn): Ditto.
(shuffle_compress_patterns): Ditto.
(expand_vec_perm_const_1): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/vls-vlmax/compress-1.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/compress-2.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/compress-3.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/compress-4.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/compress-5.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/compress-6.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/compress_run-1.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/compress_run-2.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/compress_run-3.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/compress_run-4.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/compress_run-5.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/compress_run-6.c: New test.

---
 gcc/config/riscv/riscv-protos.h   |   1 +
 gcc/config/riscv/riscv-v.cc   | 156 
 .../riscv/rvv/autovec/vls-vlmax/compress-1.c  |  21 ++
 .../riscv/rvv/autovec/vls-vlmax/compress-2.c  |  46 
 .../riscv/rvv/autovec/vls-vlmax/compress-3.c  |  60 +
 .../riscv/rvv/autovec/vls-vlmax/compress-4.c  |  81 +++
 .../riscv/rvv/autovec/vls-vlmax/compress-5.c  |  85 +++
 .../riscv/rvv/autovec/vls-vlmax/compress-6.c  |  95 
 .../rvv/autovec/vls-vlmax/compress_run-1.c|  27 +++
 .../rvv/autovec/vls-vlmax/compress_run-2.c|  51 
 .../rvv/autovec/vls-vlmax/compress_run-3.c|  79 +++
 .../rvv/autovec/vls-vlmax/compress_run-4.c| 117 +
 .../rvv/autovec/vls-vlmax/compress_run-5.c| 149 
 .../rvv/autovec/vls-vlmax/compress_run-6.c| 222 ++
 14 files changed, 1190 insertions(+)
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/compress-1.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/compress-2.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/compress-3.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/compress-4.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/compress-5.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/compress-6.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/compress_run-1.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/compress_run-2.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/compress_run-3.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/compress_run-4.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/compress_run-5.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/compress_run-6.c

diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index 5766e3597e8..6cd5c6639c9 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -148,6 +148,7 @@ enum insn_type
   RVV_WIDEN_TER

Re: [x86-64] RFC: Add nosse abi attribute

2023-07-10 Thread Richard Biener via Gcc-patches
On Mon, Jul 10, 2023 at 9:08 PM Alexander Monakov via Gcc-patches
 wrote:
>
>
> On Mon, 10 Jul 2023, Michael Matz via Gcc-patches wrote:
>
> > Hello,
> >
> > the ELF psABI for x86-64 doesn't have any callee-saved SSE
> > registers (there were actual reasons for that, but those don't
> > matter anymore).  This starts to hurt some uses, as it means that
> > as soon as you have a call (say to memmove/memcpy, even if
> > implicit as libcall) in a loop that manipulates floating point
> > or vector data you get saves/restores around those calls.
> >
> > But in reality many functions can be written such that they only need
> > to clobber a subset of the 16 XMM registers (or do the save/restore
> > themself in the codepaths that needs them, hello memcpy again).
> > So we want to introduce a way to specify this, via an ABI attribute
> > that basically says "doesn't clobber the high XMM regs".
>
> I think the main question is why you're going with this (weak) form
> instead of the (strong) form "may only clobber the low XMM regs":
> as Richi noted, surely for libcalls we'd like to know they preserve
> AVX-512 mask registers as well?
>
> (I realize this is partially answered later)
>
> Note this interacts with anything that interposes between the caller
> and the callee, like the Glibc lazy binding stub (which used to
> zero out high halves of 512-bit arguments in ZMM registers).
> Not an immediate problem for the patch, just something to mind perhaps.
>
> > I've opted to do only the obvious: do something special only for
> > xmm8 to xmm15, without a way to specify the clobber set in more detail.
> > I think such half/half split is reasonable, and as I don't want to
> > change the argument passing anyway (whose regs are always clobbered)
> > there isn't that much wiggle room anyway.
> >
> > I chose to make it possible to write function definitions with that
> > attribute with GCC adding the necessary callee save/restore code in
> > the xlogue itself.
>
> But you can't trivially restore if the callee is sibcalling — what
> happens then (a testcase might be nice)?
>
> > Carefully note that this is only possible for
> > the SSE2 registers, as other parts of them would need instructions
> > that are only optional.
>
> What is supposed to happen on 32-bit x86 with -msse -mno-sse2?
>
> > When a function doesn't contain calls to
> > unknown functions we can be a bit more lenient: we can make it so that
> > GCC simply doesn't touch xmm8-15 at all, then no save/restore is
> > necessary.
>
> What if the source code has a local register variable bound to xmm15,
> i.e. register double x asm("xmm15"); asm("..." : "+x"(x)); ?
> Probably "dont'd do that", i.e. disallow that in the documentation?
>
> > If a function contains calls then GCC can't know which
> > parts of the XMM regset is clobbered by that, it may be parts
> > which don't even exist yet (say until avx2048 comes out), so we must
> > restrict ourself to only save/restore the SSE2 parts and then of course
> > can only claim to not clobber those parts.
>
> Hm, I guess this is kinda the reason a "weak" form is needed. But this
> highlights the difference between the two: the "weak" form will actively
> preserve some state (so it cannot preserve future extensions), while
> the "strong" form may just passively not touch any state, preserving
> any state it doesn't know about.
>
> > To that end I introduce actually two related attributes (for naming
> > see below):
> > * nosseclobber: claims (and ensures) that xmm8-15 aren't clobbered
>
> This is the weak/active form; I'd suggest "preserve_high_sse".
>
> > * noanysseclobber: claims (and ensures) that nothing of any of the
> >   registers overlapping xmm8-15 is clobbered (not even future, as of
> >   yet unknown, parts)
>
> This is the strong/passive form; I'd suggest "only_low_sse".
>
> > Ensuring the first is simple: potentially add saves/restore in xlogue
> > (e.g. when xmm8 is either used explicitely or implicitely by a call).
> > Ensuring the second comes with more: we must also ensure that no
> > functions are called that don't guarantee the same thing (in addition
> > to just removing all xmm8-15 parts alltogether from the available
> > regsters).
> >
> > See also the added testcases for what I intended to support.
> >
> > I chose to use the new target independend function-abi facility for
> > this.  I need some adjustments in generic code:
> > * the "default_abi" is actually more like a "current" abi: it happily
> >   changes its contents according to conditional_register_usage,
> >   and other code assumes that such changes do propagate.
> >   But if that conditonal_reg_usage is actually done because the current
> >   function is of a different ABI, then we must not change default_abi.
> > * in insn_callee_abi we do look at a potential fndecl for a call
> >   insn (only set when -fipa-ra), but doesn't work for calls through
> >   pointers and (as said) is optional.  So, also always look at the
> >   called functions t

RE: [PATCH] x86: improve fast bfloat->float conversion

2023-07-10 Thread Liu, Hongtao via Gcc-patches


> -Original Message-
> From: Jan Beulich 
> Sent: Tuesday, July 11, 2023 2:08 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Liu, Hongtao ; Kirill Yukhin
> 
> Subject: [PATCH] x86: improve fast bfloat->float conversion
> 
> There's nothing AVX512BW-ish in here, so no reason to use Yw as the
> constraints for the AVX alternative. Furthermore by using the 512-bit form of
> VPSSLD (in a new alternative) all 32 registers can be used directly by the 
> insn
> without AVX512VL needing to be enabled.
Yes, the instruction vpslld doesn't need AVX512BW, the patch LGTM.
> 
> Also adjust the originally last alternative's "prefix" attribute to 
> maybe_evex.
> 
> gcc/
> 
>   * config/i386/i386.md (extendbfsf2_1): Add new AVX512F
>   alternative. Adjust original last alternative's "prefix"
>   attribute to maybe_evex.
> ---
> The corresponding expander, "extendbfsf2", looks to have been dead since
> its introduction in a1ecc5600464 ("Fix incorrect _mm_cvtsbh_ss"): The builtin
> references the insn (extendbfsf2_1), not the expander. Can't the expander
> be deleted and the name of the insn then pruned of the _1 suffix? If so, that
> further raises the question of the significance of the "!HONOR_NANS
> (BFmode)" that the expander has, but the insn doesn't have. Which may
> instead suggest the builtin was meant to reference the expander. Yet then I
> can't see what would the builtin would expand to when HONOR_NANS
> (BFmode) it true.

Quote from what Jakub said in [1].
---
This is not correct.
While using such code for _mm_cvtsbh_ss is fine if it is documented not to
raise exceptions and turn a sNaN into a qNaN, it is not fine for HONOR_NANS
(i.e. when -ffast-math is not on), because a __bf16 -> float conversion
on sNaN should raise invalid exception and turn it into a qNaN.
We could have extendbfsf2 expander that would FAIL; if HONOR_NANS and
emit extendbfsf2_1 otherwise. 
---
[1] https://gcc.gnu.org/pipermail/gcc-patches/2022-November/607108.html
> 
> I further wonder whether the nearby "extendhfdf2" expander is really
> needed. It doesn't look to specify anything that the corresponding insn
> doesn't also specify.
> 
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -5181,21 +5181,27 @@
>  ;; Don't use float_extend since psrlld doesn't raise  ;; exceptions and turn 
> a
> sNaN into a qNaN.
>  (define_insn "extendbfsf2_1"
> -  [(set (match_operand:SF 0 "register_operand"   "=x,Yw")
> +  [(set (match_operand:SF 0 "register_operand"   "=x,Yv,v")
>   (unspec:SF
> -   [(match_operand:BF 1 "register_operand" " 0,Yw")]
> +   [(match_operand:BF 1 "register_operand" " 0,Yv,v")]
> UNSPEC_CVTBFSF))]
>   "TARGET_SSE2"
>   "@
>pslld\t{$16, %0|%0, 16}
> -  vpslld\t{$16, %1, %0|%0, %1, 16}"
> -  [(set_attr "isa" "noavx,avx")
> +  vpslld\t{$16, %1, %0|%0, %1, 16}
> +  vpslld\t{$16, %g1, %g0|%g0, %g1, 16}"
> +  [(set_attr "isa" "noavx,avx,*")
> (set_attr "type" "sseishft1")
> (set_attr "length_immediate" "1")
> -   (set_attr "prefix_data16" "1,*")
> -   (set_attr "prefix" "orig,vex")
> -   (set_attr "mode" "TI")
> -   (set_attr "memory" "none")])
> +   (set_attr "prefix_data16" "1,*,*")
> +   (set_attr "prefix" "orig,maybe_evex,evex")
> +   (set_attr "mode" "TI,TI,XI")
> +   (set_attr "memory" "none")
> +   (set (attr "enabled")
> + (if_then_else (eq_attr "alternative" "2")
> +   (symbol_ref "TARGET_AVX512F && !TARGET_AVX512VL
> + && !TARGET_PREFER_AVX256")
> +   (const_string "*")))])
> 
>  (define_expand "extendxf2"
>[(set (match_operand:XF 0 "nonimmediate_operand")


  1   2   >