Re: [PATCH] lower-subreg, expr: Mitigate inefficiencies derived from "(clobber (reg X))" followed by "(set (subreg (reg X)) (...))"

2022-08-03 Thread Richard Sandiford via Gcc-patches
Takayuki 'January June' Suwa via Gcc-patches  writes:
> Emitting "(clobber (reg X))" before "(set (subreg (reg X)) (...))" keeps
> data flow consistent, but it also increases register allocation pressure
> and thus often creates many unwanted register-to-register moves that
> cannot be optimized away.

There are two things here:

- If emit_move_complex_parts emits a clobber of a hard register,
  then that's probably a bug/misfeature.  The point of the clobber is
  to indicate that the register has no useful contents.  That's useful
  for wide pseudos that are written to in parts, since it avoids the
  need to track the liveness of each part of the pseudo individually.
  But it shouldn't be necessary for hard registers, since subregs of
  hard registers are simplified to hard registers wherever possible
  (which on most targets is "always").

  So I think the emit_move_complex_parts clobber should be restricted
  to !HARD_REGISTER_P, like the lower-subreg clobber is.  If that helps
  (if only partly) then it would be worth doing as its own patch.

- I think it'd be worth looking into more detail why a clobber makes
  a difference to register pressure.  A clobber of a pseudo register R
  shouldn't make R conflict with things that are live at the point of
  the clobber.

>  It seems just analogous to partial register
> stall which is a famous problem on processors that do register renaming.
>
> In my opinion, when the register to be clobbered is a composite of hard
> ones, we should clobber the individual elements separetely, otherwise
> clear the entire to zero prior to use as the "init-regs" pass does (like
> partial register stall workarounds on x86 CPUs).  Such redundant zero
> constant assignments will be removed later in the "cprop_hardreg" pass.

I don't think we should rely on the zero being optimised away later.

Emitting the zero also makes it harder for the register allocator
to elide the move.  For example, if we have:

  (set (subreg:SI (reg:DI P) 0) (reg:SI R0))
  (set (subreg:SI (reg:DI P) 4) (reg:SI R1))

then there is at least a chance that the RA could assign hard registers
R0:R1 to P, which would turn the moves into nops.  If we emit:

  (set (reg:DI P) (const_int 0))

beforehand then that becomes impossible, since R0 and R1 would then
conflict with P.

TBH I'm surprised we still run init_regs for LRA.  I thought there was
a plan to stop doing that, but perhaps I misremember.

Thanks,
Richard

> This patch may give better output code quality for the reasons above,
> especially on architectures that don't have DFmode hard registers
> (On architectures with such hard registers, this patch changes virtually
> nothing).
>
> For example (Espressif ESP8266, Xtensa without FP hard regs):
>
> /* example */
> double _Complex conjugate(double _Complex z) {
>   __imag__(z) *= -1;
>   return z;
> }
>
> ;; before
> conjugate:
> movi.n  a6, -1
> sllia6, a6, 31
> mov.n   a8, a2
> mov.n   a9, a3
> mov.n   a7, a4
> xor a6, a5, a6
> mov.n   a2, a8
> mov.n   a3, a9
> mov.n   a4, a7
> mov.n   a5, a6
> ret.n
>
> ;; after
> conjugate:
> movi.n  a6, -1
> sllia6, a6, 31
> xor a6, a5, a6
> mov.n   a5, a6
> ret.n
>
> gcc/ChangeLog:
>
>   * lower-subreg.cc (resolve_simple_move):
>   Add zero clear of the entire register immediately after
>   the clobber.
>   * expr.cc (emit_move_complex_parts):
>   Change to clobber the real and imaginary parts separately
>   instead of the whole complex register if possible.
> ---
>  gcc/expr.cc | 26 --
>  gcc/lower-subreg.cc |  7 ++-
>  2 files changed, 26 insertions(+), 7 deletions(-)
>
> diff --git a/gcc/expr.cc b/gcc/expr.cc
> index 80bb1b8a4c5..9732e8fd4e5 100644
> --- a/gcc/expr.cc
> +++ b/gcc/expr.cc
> @@ -3775,15 +3775,29 @@ emit_move_complex_push (machine_mode mode, rtx x, rtx 
> y)
>  rtx_insn *
>  emit_move_complex_parts (rtx x, rtx y)
>  {
> -  /* Show the output dies here.  This is necessary for SUBREGs
> - of pseudos since we cannot track their lifetimes correctly;
> - hard regs shouldn't appear here except as return values.  */
> -  if (!reload_completed && !reload_in_progress
> -  && REG_P (x) && !reg_overlap_mentioned_p (x, y))
> -emit_clobber (x);
> +  rtx_insn *re_insn, *im_insn;
>  
>write_complex_part (x, read_complex_part (y, false), false, true);
> +  re_insn = get_last_insn ();
>write_complex_part (x, read_complex_part (y, true), true, false);
> +  im_insn = get_last_insn ();
> +
> +  /* Show the output dies here.  This is necessary for SUBREGs
> + of pseudos since we cannot track their lifetimes correctly.  */
> +  if (can_create_pseudo_p ()
> +  && REG_P (x) && ! reg_overlap_mentioned_p (x, y))
> +{
> +  /* Hard regs shouldn't appear here except as return values.  */
> +  if (HA

RE: [PATCH 2/2]middle-end: Support recognition of three-way max/min.

2022-08-03 Thread Tamar Christina via Gcc-patches

> -Original Message-
> From: Richard Biener 
> Sent: Tuesday, August 2, 2022 10:11 AM
> To: Tamar Christina 
> Cc: Richard Biener ; ja...@redhat.com; nd
> ; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH 2/2]middle-end: Support recognition of three-way
> max/min.
> 
> On Tue, Aug 2, 2022 at 10:33 AM Tamar Christina via Gcc-patches  patc...@gcc.gnu.org> wrote:
> >
> > > > > > When this function replaces the edge it doesn't seem to update
> > > > > > the
> > > > > dominators.
> > > > > > Since It's replacing the middle BB we then end up with an
> > > > > > error
> > > > > >
> > > > > > gcc/testsuite/gcc.dg/tree-ssa/minmax-14.c:17:1: error:
> > > > > > dominator of 5 should be 4, not 2
> > > > > >
> > > > > > during early verify. So instead, I replace the BB but defer
> > > > > > its deletion until cleanup which removes it and updates the
> dominators.
> > > > >
> > > > > Hmm, for a diamond shouldn't you replace
> > > > >
> > > > >   if (EDGE_SUCC (cond_block, 0)->dest == bb)
> > > > > edge_to_remove = EDGE_SUCC (cond_block, 1);
> > > > >   else
> > > > > edge_to_remove = EDGE_SUCC (cond_block, 0);
> > > > >
> > > > > with
> > > > >
> > > > >   if (EDGE_SUCC (cond_block, 0)->dest == bb)
> > > > > edge_to_remove = EDGE_SUCC (cond_block, 1);
> > > > >   else if (EDGE_SUCC (cond_block, 1)->dest == bb)
> > > > > edge_to_remove = EDGE_SUCC (cond_block, 0);
> > > > >
> > > > > thus, the code expects to be left with a fallthru to the PHI
> > > > > block which is expected to have the immediate dominator being
> > > > > cond_block but with a diamond there's a (possibly empty) block
> > > > > inbetween and dominators are wrong.
> > > >
> > > > Agreed, but the (EDGE_SUCC (cond_block, 1)->dest == bb) doesn't
> > > > seem like the Right one since for a diamond there will be a block
> > > > in between the two.  Did you perhaps mean  EDGE_SUCC (EDGE_SUCC
> > > > (cond_block, 1)->dest, 0)->dest == bb? i.e. that that destination
> > > > across the
> > > diamond be bb, and then you remove the middle block?
> > >
> > > Hmm, I think my condition was correct - the code tries to remove the
> > > edge to the middle-block and checks the remaining edge falls through
> > > to the merge block.  With a true diamond there is no fallthru to the
> > > merge block to keep so we better don't remove any edge?
> > >
> > > > For the minmax diamond we want both edges removed, since all the
> > > > code in the middle BBs are now dead.  But this is probably not
> > > > true in the general
> > > sense.
> >
> > Ah! Sorry I was firing a few cylinders short, I get what you mean now:
> >
> > @@ -425,8 +439,19 @@ replace_phi_edge_with_variable (basic_block
> cond_block,
> >edge edge_to_remove;
> >if (EDGE_SUCC (cond_block, 0)->dest == bb)
> >  edge_to_remove = EDGE_SUCC (cond_block, 1);
> > -  else
> > +  else if (EDGE_SUCC (cond_block, 1)->dest == bb)
> >  edge_to_remove = EDGE_SUCC (cond_block, 0);
> > +  else
> > +{
> > +  /* If neither edge from the conditional is the final bb
> > +then we must have a diamond block, in which case
> > +the true edge was changed by SET_USE above and we must
> > +mark the other edge as the false edge.  */
> > +  gcond *cond = as_a  (last_stmt (cond_block));
> > +  gimple_cond_make_false (cond);
> > +  return;
> > +}
> > +
> 
> Note there is already
> 
>   if (EDGE_COUNT (edge_to_remove->dest->preds) == 1)
> {
> ...
> }
>   else
> {
>   /* If there are other edges into the middle block make
>  CFG cleanup deal with the edge removal to avoid
>  updating dominators here in a non-trivial way.  */
>   gcond *cond = as_a  (last_stmt (cond_block));
>   if (edge_to_remove->flags & EDGE_TRUE_VALUE)
> gimple_cond_make_false (cond);
>   else
> gimple_cond_make_true (cond);
> }
> 
> I'm not sure how you can say 'e' is always the true edge?  May I suggest to
> amend the first condition with edge_to_remove && (and initialize that to
> NULL) and use e->flags instead of edge_to_remove in the else, of course
> also inverting the logic since we're keeping 'e'?

As discussed on IRC, here's the version using keep_edge:

@@ -422,12 +436,17 @@ replace_phi_edge_with_variable (basic_block cond_block,
   SET_USE (PHI_ARG_DEF_PTR (phi, e->dest_idx), new_tree);

   /* Remove the empty basic block.  */
-  edge edge_to_remove;
+  edge edge_to_remove = NULL, keep_edge = NULL;
   if (EDGE_SUCC (cond_block, 0)->dest == bb)
 edge_to_remove = EDGE_SUCC (cond_block, 1);
-  else
+  else if (EDGE_SUCC (cond_block, 1)->dest == bb)
 edge_to_remove = EDGE_SUCC (cond_block, 0);
-  if (EDGE_COUNT (edge_to_remove->dest->preds) == 1)
+  else if ((keep_edge = find_edge (cond_block, e->src)))
+;
+  else
+gcc_unreachable ();
+
+  if (edge_to_remove && EDGE_COUNT (edge_to_remove->dest->preds) == 1)
 {
   e->flags |= EDGE_FALLTHRU;
   e->flags &= ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE);
@@ -4

[PATCH, rs6000] TARGET_MADDLD should include TARGET_POWERPC64

2022-08-03 Thread HAO CHEN GUI via Gcc-patches
Hi,
  This patch changes the definition of TARGET_MADDLD and includes
TARGET_POWERPC64, since maddld is a 64 bit instruction.

  maddld-1.c now checks "has_arch_ppc64". It depends on a patch which fixes
empty TU problem.
https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598744.html

  Bootstrapped and tested on powerpc64-linux BE and LE with no regressions.
Is this okay for trunk? Any recommendations? Thanks a lot.

ChangeLog
2022-08-03  Haochen Gui  

gcc/
* config/rs6000/rs6000.h (TARGET_MADDLD): Define.

gcc/testsuite/
* gcc.target/powerpc/maddld-1.c: Modify target requirement to compile
it on the target which supports 64 bit instructions.


patch.diff
diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
index 7d04556304a..2f15451fd8b 100644
--- a/gcc/config/rs6000/rs6000.h
+++ b/gcc/config/rs6000/rs6000.h
@@ -466,7 +466,7 @@ extern int rs6000_vector_align[];
 #define TARGET_FCTIWUZ TARGET_POPCNTD
 #define TARGET_CTZ TARGET_MODULO
 #define TARGET_EXTSWSLI(TARGET_MODULO && TARGET_POWERPC64)
-#define TARGET_MADDLD  TARGET_MODULO
+#define TARGET_MADDLD  (TARGET_MODULO && TARGET_POWERPC64)

 #define TARGET_XSCVDPSPN   (TARGET_DIRECT_MOVE || TARGET_P8_VECTOR)
 #define TARGET_XSCVSPDPN   (TARGET_DIRECT_MOVE || TARGET_P8_VECTOR)
diff --git a/gcc/testsuite/gcc.target/powerpc/maddld-1.c 
b/gcc/testsuite/gcc.target/powerpc/maddld-1.c
index 4edecf1c86d..0a53658e058 100644
--- a/gcc/testsuite/gcc.target/powerpc/maddld-1.c
+++ b/gcc/testsuite/gcc.target/powerpc/maddld-1.c
@@ -1,4 +1,4 @@
-/* { dg-do compile } */
+/* { dg-do compile { target { has_arch_ppc64 } } } */
 /* { dg-options "-mdejagnu-cpu=power9 -O2" } */

 /* This file tests the maddld instruction can be used in SI mode


RE: [PATCH 2/2]middle-end: Support recognition of three-way max/min.

2022-08-03 Thread Richard Biener via Gcc-patches
On Wed, 3 Aug 2022, Tamar Christina wrote:

> 
> > -Original Message-
> > From: Richard Biener 
> > Sent: Tuesday, August 2, 2022 10:11 AM
> > To: Tamar Christina 
> > Cc: Richard Biener ; ja...@redhat.com; nd
> > ; gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH 2/2]middle-end: Support recognition of three-way
> > max/min.
> > 
> > On Tue, Aug 2, 2022 at 10:33 AM Tamar Christina via Gcc-patches  > patc...@gcc.gnu.org> wrote:
> > >
> > > > > > > When this function replaces the edge it doesn't seem to update
> > > > > > > the
> > > > > > dominators.
> > > > > > > Since It's replacing the middle BB we then end up with an
> > > > > > > error
> > > > > > >
> > > > > > > gcc/testsuite/gcc.dg/tree-ssa/minmax-14.c:17:1: error:
> > > > > > > dominator of 5 should be 4, not 2
> > > > > > >
> > > > > > > during early verify. So instead, I replace the BB but defer
> > > > > > > its deletion until cleanup which removes it and updates the
> > dominators.
> > > > > >
> > > > > > Hmm, for a diamond shouldn't you replace
> > > > > >
> > > > > >   if (EDGE_SUCC (cond_block, 0)->dest == bb)
> > > > > > edge_to_remove = EDGE_SUCC (cond_block, 1);
> > > > > >   else
> > > > > > edge_to_remove = EDGE_SUCC (cond_block, 0);
> > > > > >
> > > > > > with
> > > > > >
> > > > > >   if (EDGE_SUCC (cond_block, 0)->dest == bb)
> > > > > > edge_to_remove = EDGE_SUCC (cond_block, 1);
> > > > > >   else if (EDGE_SUCC (cond_block, 1)->dest == bb)
> > > > > > edge_to_remove = EDGE_SUCC (cond_block, 0);
> > > > > >
> > > > > > thus, the code expects to be left with a fallthru to the PHI
> > > > > > block which is expected to have the immediate dominator being
> > > > > > cond_block but with a diamond there's a (possibly empty) block
> > > > > > inbetween and dominators are wrong.
> > > > >
> > > > > Agreed, but the (EDGE_SUCC (cond_block, 1)->dest == bb) doesn't
> > > > > seem like the Right one since for a diamond there will be a block
> > > > > in between the two.  Did you perhaps mean  EDGE_SUCC (EDGE_SUCC
> > > > > (cond_block, 1)->dest, 0)->dest == bb? i.e. that that destination
> > > > > across the
> > > > diamond be bb, and then you remove the middle block?
> > > >
> > > > Hmm, I think my condition was correct - the code tries to remove the
> > > > edge to the middle-block and checks the remaining edge falls through
> > > > to the merge block.  With a true diamond there is no fallthru to the
> > > > merge block to keep so we better don't remove any edge?
> > > >
> > > > > For the minmax diamond we want both edges removed, since all the
> > > > > code in the middle BBs are now dead.  But this is probably not
> > > > > true in the general
> > > > sense.
> > >
> > > Ah! Sorry I was firing a few cylinders short, I get what you mean now:
> > >
> > > @@ -425,8 +439,19 @@ replace_phi_edge_with_variable (basic_block
> > cond_block,
> > >edge edge_to_remove;
> > >if (EDGE_SUCC (cond_block, 0)->dest == bb)
> > >  edge_to_remove = EDGE_SUCC (cond_block, 1);
> > > -  else
> > > +  else if (EDGE_SUCC (cond_block, 1)->dest == bb)
> > >  edge_to_remove = EDGE_SUCC (cond_block, 0);
> > > +  else
> > > +{
> > > +  /* If neither edge from the conditional is the final bb
> > > +then we must have a diamond block, in which case
> > > +the true edge was changed by SET_USE above and we must
> > > +mark the other edge as the false edge.  */
> > > +  gcond *cond = as_a  (last_stmt (cond_block));
> > > +  gimple_cond_make_false (cond);
> > > +  return;
> > > +}
> > > +
> > 
> > Note there is already
> > 
> >   if (EDGE_COUNT (edge_to_remove->dest->preds) == 1)
> > {
> > ...
> > }
> >   else
> > {
> >   /* If there are other edges into the middle block make
> >  CFG cleanup deal with the edge removal to avoid
> >  updating dominators here in a non-trivial way.  */
> >   gcond *cond = as_a  (last_stmt (cond_block));
> >   if (edge_to_remove->flags & EDGE_TRUE_VALUE)
> > gimple_cond_make_false (cond);
> >   else
> > gimple_cond_make_true (cond);
> > }
> > 
> > I'm not sure how you can say 'e' is always the true edge?  May I suggest to
> > amend the first condition with edge_to_remove && (and initialize that to
> > NULL) and use e->flags instead of edge_to_remove in the else, of course
> > also inverting the logic since we're keeping 'e'?
> 
> As discussed on IRC, here's the version using keep_edge:
> 
> @@ -422,12 +436,17 @@ replace_phi_edge_with_variable (basic_block cond_block,
>SET_USE (PHI_ARG_DEF_PTR (phi, e->dest_idx), new_tree);
> 
>/* Remove the empty basic block.  */
> -  edge edge_to_remove;
> +  edge edge_to_remove = NULL, keep_edge = NULL;
>if (EDGE_SUCC (cond_block, 0)->dest == bb)
>  edge_to_remove = EDGE_SUCC (cond_block, 1);
> -  else
> +  else if (EDGE_SUCC (cond_block, 1)->dest == bb)
>  edge_to_remove = EDGE_SUCC (cond_block, 0);
> -  if (EDGE_COUNT (edge_to_remove->dest->pre

RE: [PATCH] x86: Enable __bf16 type for TARGET_SSE2 and above

2022-08-03 Thread Kong, Lingling via Gcc-patches
Hi,

Old patch has some mistake in `*movbf_internal` , now disable BFmode constant 
double move in `*movbf_internal`.

Thanks,
Lingling

> -Original Message-
> From: Kong, Lingling 
> Sent: Tuesday, July 26, 2022 9:31 AM
> To: Liu, Hongtao ; gcc-patches@gcc.gnu.org
> Cc: Kong, Lingling 
> Subject: [PATCH] x86: Enable __bf16 type for TARGET_SSE2 and above
> 
> Hi,
> 
> The patch is enable __bf16 scalar type for target sse2 and above according to
> psABI(https://gitlab.com/x86-psABIs/x86-64-ABI/-/merge_requests/35/diffs).
> The __bf16 type is a storage type like arm.
> 
> OK for master?
> 
> gcc/ChangeLog:
> 
>   * config/i386/i386-builtin-types.def (BFLOAT16): New primitive type.
>   * config/i386/i386-builtins.cc : Support __bf16 type for i386 backend.
>   (ix86_register_bf16_builtin_type): New function.
>   (ix86_bf16_type_node): New.
>   (ix86_bf16_ptr_type_node): Ditto.
>   (ix86_init_builtin_types): Add ix86_register_bf16_builtin_type function
> call.
>   * config/i386/i386-modes.def (FLOAT_MODE): Add BFmode.
>   (ADJUST_FLOAT_FORMAT): Ditto.
>   * config/i386/i386.cc (merge_classes): Handle BFmode.
>   (classify_argument): Ditto.
>   (examine_argument): Ditto.
>   (construct_container): Ditto.
>   (function_value_32): Return __bf16 by %xmm0.
>   (function_value_64): Return __bf16 by SSE register.
>   (ix86_print_operand): Handle CONST_DOUBLE BFmode.
>   (ix86_secondary_reload): Require gpr as intermediate register
>   to store __bf16 from sse register when sse4 is not available.
>   (ix86_scalar_mode_supported_p): Enable __bf16 under sse2.
>   (ix86_mangle_type): Add manlging for __bf16 type.
>   (ix86_invalid_conversion): New function for target hook.
>   (ix86_invalid_unary_op): Ditto.
>   (ix86_invalid_binary_op): Ditto.
>   (TARGET_INVALID_CONVERSION): New define for target hook.
>   (TARGET_INVALID_UNARY_OP): Ditto.
>   (TARGET_INVALID_BINARY_OP): Ditto.
>   * config/i386/i386.h (host_detect_local_cpu): Add BFmode.
>   * config/i386/i386.md (*pushhf_rex64): Change for BFmode.
>   (*push_rex64): Ditto.
>   (*pushhf): Ditto.
>   (*push): Ditto.
>   (*movhf_internal): Ditto.
>   (*mov_internal): Ditto.
> 
> gcc/testsuite/ChangeLog:
> 
>   * g++.target/i386/bfloat_cpp_typecheck.C: New test.
>   * gcc.target/i386/bfloat16-1.c: Ditto.
>   * gcc.target/i386/sse2-bfloat16-1.c: Ditto.
>   * gcc.target/i386/sse2-bfloat16-2.c: Ditto.
>   * gcc.target/i386/sse2-bfloat16-scalar-typecheck.c: Ditto.
> ---
>  gcc/config/i386/i386-builtin-types.def|   1 +
>  gcc/config/i386/i386-builtins.cc  |  21 ++
>  gcc/config/i386/i386-modes.def|   2 +
>  gcc/config/i386/i386.cc   |  75 +-
>  gcc/config/i386/i386.h|   4 +-
>  gcc/config/i386/i386.md   |  32 +--
>  .../g++.target/i386/bfloat_cpp_typecheck.C|  10 +
>  gcc/testsuite/gcc.target/i386/bfloat16-1.c|  12 +
>  .../gcc.target/i386/sse2-bfloat16-1.c |   8 +
>  .../gcc.target/i386/sse2-bfloat16-2.c |  17 ++
>  .../i386/sse2-bfloat16-scalar-typecheck.c | 215 ++
>  11 files changed, 375 insertions(+), 22 deletions(-)  create mode 100644
> gcc/testsuite/g++.target/i386/bfloat_cpp_typecheck.C
>  create mode 100644 gcc/testsuite/gcc.target/i386/bfloat16-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sse2-bfloat16-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sse2-bfloat16-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sse2-bfloat16-scalar-
> typecheck.c
> 
> diff --git a/gcc/config/i386/i386-builtin-types.def b/gcc/config/i386/i386-
> builtin-types.def
> index 7a2da1db0b0..63a360b0f8b 100644
> --- a/gcc/config/i386/i386-builtin-types.def
> +++ b/gcc/config/i386/i386-builtin-types.def
> @@ -69,6 +69,7 @@ DEF_PRIMITIVE_TYPE (UINT16,
> short_unsigned_type_node)  DEF_PRIMITIVE_TYPE (INT64,
> long_long_integer_type_node)  DEF_PRIMITIVE_TYPE (UINT64,
> long_long_unsigned_type_node)  DEF_PRIMITIVE_TYPE (FLOAT16,
> ix86_float16_type_node)
> +DEF_PRIMITIVE_TYPE (BFLOAT16, ix86_bf16_type_node)
>  DEF_PRIMITIVE_TYPE (FLOAT, float_type_node)  DEF_PRIMITIVE_TYPE
> (DOUBLE, double_type_node)  DEF_PRIMITIVE_TYPE (FLOAT80,
> float80_type_node) diff --git a/gcc/config/i386/i386-builtins.cc
> b/gcc/config/i386/i386-builtins.cc
> index fe7243c3837..6a04fb57e65 100644
> --- a/gcc/config/i386/i386-builtins.cc
> +++ b/gcc/config/i386/i386-builtins.cc
> @@ -126,6 +126,9 @@ BDESC_VERIFYS (IX86_BUILTIN_MAX,  static GTY(()) tree
> ix86_builtin_type_tab[(int) IX86_BT_LAST_CPTR + 1];
> 
>  tree ix86_float16_type_node = NULL_TREE;
> +tree ix86_bf16_type_node = NULL_TREE;
> +tree ix86_bf16_ptr_type_node = NULL_TREE;
> +
>  /* Retrieve an element from the above table, building some of
> the types lazily.  */
> 
> @@ -1366,6 +1369,22 @@ ix86_register_float16_builtin_type (void

Re: [PATCH] IPA: reduce what we dump in normal mode

2022-08-03 Thread Martin Liška
On 8/2/22 18:27, Jan Hubicka wrote:
> If you disable dumping, you can also disable the collection of stats
> which is guarded by if (dump_file) as well.  Otherwise the patch is OK.

Sure, good point!

I'm going to push the following.

Thanks,
MartinFrom 7585e5ecb47761516b8f397002819f2c95b8c32e Mon Sep 17 00:00:00 2001
From: Martin Liska 
Date: Wed, 3 Aug 2022 10:53:22 +0200
Subject: [PATCH] profile: do not collect stats unless TDF_DETAILS

gcc/ChangeLog:

	* profile.cc (compute_branch_probabilities): Do not collect
	stats unless TDF_DETAILS.
---
 gcc/profile.cc | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/gcc/profile.cc b/gcc/profile.cc
index 92de821b8bb..96121d60711 100644
--- a/gcc/profile.cc
+++ b/gcc/profile.cc
@@ -753,7 +753,8 @@ compute_branch_probabilities (unsigned cfg_checksum, unsigned lineno_checksum)
 	bb->count = profile_count::from_gcov_type (bb_gcov_count (bb));
 	  else
 	bb->count = profile_count::guessed_zero ();
-	  if (dump_file && bb->index >= 0)
+
+	  if (dump_file && (dump_flags & TDF_DETAILS) && bb->index >= 0)
 	{
 	  double freq1 = cnt.to_sreal_scale (old_entry_cnt).to_double ();
 	  double freq2 = bb->count.to_sreal_scale
@@ -776,8 +777,8 @@ compute_branch_probabilities (unsigned cfg_checksum, unsigned lineno_checksum)
 	  nsum2 += stat.feedback;
 	  fprintf (dump_file,
 		   " Basic block %4i guessed freq: %12.3f"
-		   " cummulative:%6.2f%% "
-		   " feedback freq: %12.3f cummulative:%7.2f%%"
+		   " cumulative:%6.2f%% "
+		   " feedback freq: %12.3f cumulative:%7.2f%%"
 		   " cnt: 10%" PRId64 "\n", stat.bb->index,
 		   stat.guessed,
 		   nsum1 * 100 / sum1,
-- 
2.37.1



Re: [PATCH v2] cselib: add function to check if SET is redundant [PR106187]

2022-08-03 Thread Richard Earnshaw via Gcc-patches




On 03/08/2022 00:36, Jeff Law wrote:



On 8/2/2022 10:06 AM, Richard Earnshaw wrote:



On 01/08/2022 11:38, Richard Earnshaw via Gcc-patches wrote:



On 30/07/2022 20:57, Jeff Law via Gcc-patches wrote:



On 7/29/2022 7:52 AM, Richard Earnshaw via Gcc-patches wrote:
A SET operation that writes memory may have the same value as an 
earlier store but if the alias sets of the new and earlier store do 
not conflict then the set is not truly redundant.  This can happen, 
for example, if objects of different types share a stack slot.


To fix this we define a new function in cselib that first checks for
equality and if that is successful then finds the earlier store in the
value history and checks the alias sets.

The routine is used in two places elsewhere in the compiler. Firstly
in cfgcleanup and secondly in postreload.

gcc/ChangeLog:
* alias.h (mems_same_for_tbaa_p): Declare.
* alias.cc (mems_same_for_tbaa_p): New function.
* dse.cc (record_store): Use it instead of open-coding
alias check.
* cselib.h (cselib_redundant_set_p): Declare.
* cselib.cc: Include alias.h
(cselib_redundant_set_p): New function.
* cfgcleanup.cc: (mark_effect): Use cselib_redundant_set_p instead
of rtx_equal_for_cselib_p.
* postreload.c (reload_cse_simplify): Use cselib_redundant_set_p.
(reload_cse_noop_set_p): Delete.
Seems quite reasonable.   The only question I would have would be 
whether or not you considered including the aliasing info into the 
hashing used by cselib.  You'd probably still need the bulk of this 
patch as well since we could presumably still get a hash conflict 
with two stores of the same value to the same location, but with 
different alias sets (it's just much less likely), so perhaps it 
doesn't really buy us anything.


I thought about this, but if the alias set were included in the hash, 
then surely you'd get every alias set in a different value.  Then 
you'd miss the cases where the alias sets do conflict even though 
they are not the same.  Anyway, the values /are/ the same so in some 
circumstances you might want to know that.




Ideally this would include a testcase.  You might be able to turn 
that non-executawble reduced case into something useful by scanning 
the post-reload dumps.


I considered this as well, but the testcase I have is far too 
fragile, I think.  The existing test only fails on Arm, only fails on 
11.2 (not 11.3 or gcc-12 onwards), relies on two objects with the 
same value being in distinct alias sets but still assigned to the 
same stack slot and for some copy dance to end up trying to write 
back the original value to the same slot but with a non-conflicting 
set.  And finally, the scheduler has to then try to move a load past 
the non-aliasing store.





To get anywhere close to this I think we'd need something akin to the 
gimple reader but for RTL so that we could set up all the conditions 
for the failure without the risk of an earlier transform blowing the 
test away.


I wasn't aware of the rtl reader already in the compiler.  But it 
doesn't really get me any closer as it is lacking in so many regards:


- It can't handle (const_double:SF ...) - it tries to handle the 
argument as an int.  This is a consequence, I think, of the reader 
being based on that for reading machine descriptions where FP 
const_double is simply never encountered.


- It doesn't seem to handle anything much more than very basic types, 
and in particular appears to have no way of ensuring that alias sets 
match up with the type system.




I even considered whether we could start with a gimple dump and 
bypassing all the tree/gimple transformations, but even that would be 
still at the mercy of the stack-slot allocation algorithm.


I spent a while trying to get some gimple out of the dumpers in a form 
that was usable, but that's pretty much a non-starter.  To make it 
work we'd need to add support for gimple clobbers on objects - without 
that there's no way to get the stack-slot sharing code to work. 
Furthermore, even feeding fully-optimized gimple directly into expand 
is such a long way from the postreload pass, that I can't believe the 
testcase would remain stable for long.


And the other major issue is that the original testcase is heavily 
templated C++ and neither of the parsers gimple or rtl is supported in 
cc1plus: converting the boilerplate to be C-friendly is probably going 
to be hard.


I can't afford to spend much more time on this, especially given the 
low-quality test we're going to get out of the end of the process.
Understood.  Let's just go with the patch as-is.  That's normal for 
cases where we can't produce a reasonable test.




Thanks, committed to trunk.  Will work on backports if it doesn't throw 
up any issues in the next few days.


R.


jeff


Re: [PATCH, rs6000] TARGET_MADDLD should include TARGET_POWERPC64

2022-08-03 Thread Kewen.Lin via Gcc-patches
Hi Haochen,

on 2022/8/3 16:24, HAO CHEN GUI wrote:
> Hi,
>   This patch changes the definition of TARGET_MADDLD and includes
> TARGET_POWERPC64, since maddld is a 64 bit instruction.
> 
>   maddld-1.c now checks "has_arch_ppc64". It depends on a patch which fixes
> empty TU problem.
> https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598744.html
> 

aha, I'm going to push it if Segher doesn't have further comments.  :)

>   Bootstrapped and tested on powerpc64-linux BE and LE with no regressions.
> Is this okay for trunk? Any recommendations? Thanks a lot.
> 
> ChangeLog
> 2022-08-03  Haochen Gui  
> 
> gcc/
>   * config/rs6000/rs6000.h (TARGET_MADDLD): Define.

May be ": Adjust." or ": Adjust with TARGET_POWERPC64.".

> 
> gcc/testsuite/
>   * gcc.target/powerpc/maddld-1.c: Modify target requirement to compile
>   it on the target which supports 64 bit instructions.
> 

May be ": Add effective target has_arch_ppc64."

OK with these nits adjusted or not.

BR,
Kewen

> 
> patch.diff
> diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
> index 7d04556304a..2f15451fd8b 100644
> --- a/gcc/config/rs6000/rs6000.h
> +++ b/gcc/config/rs6000/rs6000.h
> @@ -466,7 +466,7 @@ extern int rs6000_vector_align[];
>  #define TARGET_FCTIWUZ   TARGET_POPCNTD
>  #define TARGET_CTZ   TARGET_MODULO
>  #define TARGET_EXTSWSLI  (TARGET_MODULO && TARGET_POWERPC64)
> -#define TARGET_MADDLDTARGET_MODULO
> +#define TARGET_MADDLD(TARGET_MODULO && TARGET_POWERPC64)
> 
>  #define TARGET_XSCVDPSPN (TARGET_DIRECT_MOVE || TARGET_P8_VECTOR)
>  #define TARGET_XSCVSPDPN (TARGET_DIRECT_MOVE || TARGET_P8_VECTOR)
> diff --git a/gcc/testsuite/gcc.target/powerpc/maddld-1.c 
> b/gcc/testsuite/gcc.target/powerpc/maddld-1.c
> index 4edecf1c86d..0a53658e058 100644
> --- a/gcc/testsuite/gcc.target/powerpc/maddld-1.c
> +++ b/gcc/testsuite/gcc.target/powerpc/maddld-1.c
> @@ -1,4 +1,4 @@
> -/* { dg-do compile } */
> +/* { dg-do compile { target { has_arch_ppc64 } } } */
>  /* { dg-options "-mdejagnu-cpu=power9 -O2" } */
> 
>  /* This file tests the maddld instruction can be used in SI mode


[PATCH] backwards threader costing TLC and a fix

2022-08-03 Thread Richard Biener via Gcc-patches
The previous change to the backwards threader costing contained a
mistake that can make us reject a path based on size when the
full thread path is not know yet and the full path would be considered
hot but the partial path not yet.

Instead of adding another simple fix for this particular issue I
took the approach of making sure profitable_path_p is only called
for the full path we finally consider threading and split out the
size computation part that is useful to limit the greedy search
early.  That removes simplifies profitable_path_p.

Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.

Any objections or comments?

Thanks,
Richard.

* tree-ssa-threadbackward.cc
(back_threader::m_path_insns): New.
(back_threader::back_threader): Initialize m_path_insns.
(back_threader_profitability::profitable_path_p): Take
n_insns, do not take name as argument.  taken and
irreducible_loop are now always passed and not NULL.
Split out path size computation to ...
(back_threader::find_paths_to_names): ... the greedy
search, applying the number of block and size limits here.
Do not call profitable_path_p before calling
maybe_register_path.
(back_threader::maybe_register_path): Pass m_path_insns
rather than m_name to profitable_path_p.
---
 gcc/tree-ssa-threadbackward.cc | 337 -
 1 file changed, 160 insertions(+), 177 deletions(-)

diff --git a/gcc/tree-ssa-threadbackward.cc b/gcc/tree-ssa-threadbackward.cc
index ba114e98a41..3fb05c4c75b 100644
--- a/gcc/tree-ssa-threadbackward.cc
+++ b/gcc/tree-ssa-threadbackward.cc
@@ -64,8 +64,8 @@ public:
   back_threader_profitability (bool speed_p)
 : m_speed_p (speed_p)
   { }
-  bool profitable_path_p (const vec &, tree name, edge taken,
- bool *irreducible_loop = NULL);
+  bool profitable_path_p (const vec &, int n_insns,
+ edge taken, bool *irreducible_loop);
 private:
   const bool m_speed_p;
 };
@@ -104,6 +104,7 @@ private:
 
   // Current path being analyzed.
   auto_vec m_path;
+  int m_path_insns;
   // Hash to mark visited BBs while analyzing a path.
   hash_set m_visited_bbs;
   // The set of SSA names, any of which could potentially change the
@@ -133,6 +134,7 @@ const edge back_threader::UNREACHABLE_EDGE = (edge) -1;
 
 back_threader::back_threader (function *fun, unsigned flags, bool first)
   : m_profit (flags & BT_SPEED),
+m_path_insns (0),
 m_first (first)
 {
   if (flags & BT_SPEED)
@@ -242,8 +244,8 @@ back_threader::maybe_register_path ()
   else
{
  bool irreducible = false;
- if (m_profit.profitable_path_p (m_path, m_name, taken_edge,
- &irreducible)
+ if (m_profit.profitable_path_p (m_path, m_path_insns,
+ taken_edge, &irreducible)
  && debug_counter ())
{
  m_registry.register_path (m_path, taken_edge);
@@ -411,58 +413,142 @@ back_threader::find_paths_to_names (basic_block bb, 
bitmap interesting)
   if (m_visited_bbs.add (bb))
 return;
 
-  m_path.safe_push (bb);
-
-  // Try to resolve the path without looking back.
-  if (m_path.length () > 1
-  && (!m_profit.profitable_path_p (m_path, m_name, NULL)
- || maybe_register_path ()))
+  if (m_path.length () >= (unsigned) param_max_fsm_thread_length)
 {
-  m_path.pop ();
-  m_visited_bbs.remove (bb);
+  if (dump_file && (dump_flags & TDF_DETAILS))
+   fprintf (dump_file, "  FAIL: Jump-thread path not considered: "
+"the number of basic blocks on the path "
+"exceeds PARAM_MAX_FSM_THREAD_LENGTH.\n");
   return;
 }
 
-  auto_bitmap processed;
-  bool done = false;
-  // We use a worklist instead of iterating through the bitmap,
-  // because we may add new items in-flight.
-  auto_vec worklist (bitmap_count_bits (interesting));
-  populate_worklist (worklist, interesting);
-  while (!worklist.is_empty ())
+  int old_path_insns = m_path_insns;
+  /* When pushing BB we have to add the previous block size since that is
+ now part of the thread.  When this is the first block do not account
+ for the control stmt at the end since that is going to be elided.
+ ???  It might be worth caching the basic-block sizes since we can
+ end up visiting it more than once.  Slight complication arises from
+ the special-casing of m_name.  */
+  if (!m_path.is_empty ())
 {
-  tree name = worklist.pop ();
-  unsigned i = SSA_NAME_VERSION (name);
-  gimple *def_stmt = SSA_NAME_DEF_STMT (name);
-  basic_block def_bb = gimple_bb (def_stmt);
-
-  // Process any PHIs defined in this block.
-  if (def_bb == bb
- && bitmap_set_bit (processed, i)
- && gimple_code (def_stmt) == GIMPLE_PHI)
+  basic_block new_bb = m_path.last ();
+  int new_bb_insn

[PATCH] Backwards threader greedy search TLC

2022-08-03 Thread Richard Biener via Gcc-patches
I've tried to understand how the greedy search works seeing the
bitmap dances and the split into resolve_phi.  I've summarized
the intent of the algorithm as

  // For further greedy searching we want to remove interesting
  // names defined in BB but add ones on the PHI edges for the
  // respective edges.

but the implementation differs in detail.  In particular when
there is more than one interesting PHI in BB it seems to only consider
the first for translating defs across edges.  It also only applies
the loop crossing restriction when there is an interesting PHI.
I've also noticed that while the set of interesting names is rolled
back, m_imports just keeps growing - is that a bug?

The following preserves the loop crossing restriction to the case
of interesting PHIs but merges resolve_phi back, changing interesting
as outlined with the intent above.  It should get more threading
cases when there are multiple interesting PHI defs in a block.
It might be a bit faster due to less bitmap operations but in the
end the main intent was to make what happens more obvious.

Bootstrap and regtest pending on x86_64-unknown-linux-gnu.

Aldy - you wrote the existing implementation, is the following OK?
Is preserving the grown m_imports intended?  I see it's eventually
used by find_taken_edge_* when finalizing a path passed as
m_solver->compute_ranges (path, m_imports), not sure what the
effect is if we keep growing m_imports here?

Any other comments?

Thanks,
Richard.

* tree-ssa-threadbackward.cc (populate_worklist): Remove.
(back_threader::resolve_phi): Likewise.
(back_threader::find_paths_to_names): Rewrite greedy search.
---
 gcc/tree-ssa-threadbackward.cc | 146 +++--
 1 file changed, 50 insertions(+), 96 deletions(-)

diff --git a/gcc/tree-ssa-threadbackward.cc b/gcc/tree-ssa-threadbackward.cc
index 3fb05c4c75b..3adced6a9c9 100644
--- a/gcc/tree-ssa-threadbackward.cc
+++ b/gcc/tree-ssa-threadbackward.cc
@@ -91,7 +91,6 @@ private:
   edge maybe_register_path ();
   void maybe_register_path_dump (edge taken_edge);
   void find_paths_to_names (basic_block bb, bitmap imports);
-  void resolve_phi (gphi *phi, bitmap imports);
   edge find_taken_edge (const vec &path);
   edge find_taken_edge_cond (const vec &path, gcond *);
   edge find_taken_edge_switch (const vec &path, gswitch *);
@@ -337,71 +336,6 @@ back_threader::find_taken_edge_cond (const 
vec &path,
   return NULL;
 }
 
-// Populate a vector of trees from a bitmap.
-
-static inline void
-populate_worklist (vec &worklist, bitmap bits)
-{
-  bitmap_iterator bi;
-  unsigned i;
-
-  EXECUTE_IF_SET_IN_BITMAP (bits, 0, i, bi)
-{
-  tree name = ssa_name (i);
-  worklist.quick_push (name);
-}
-}
-
-// Find jump threading paths that go through a PHI.
-
-void
-back_threader::resolve_phi (gphi *phi, bitmap interesting)
-{
-  if (SSA_NAME_OCCURS_IN_ABNORMAL_PHI (gimple_phi_result (phi)))
-return;
-
-  for (size_t i = 0; i < gimple_phi_num_args (phi); ++i)
-{
-  edge e = gimple_phi_arg_edge (phi, i);
-
-  // This is like path_crosses_loops in profitable_path_p but more
-  // restrictive to avoid peeling off loop iterations (see
-  // tree-ssa/pr14341.c for an example).
-  bool profitable_p = m_path[0]->loop_father == e->src->loop_father;
-  if (!profitable_p)
-   {
- if (dump_file && (dump_flags & TDF_DETAILS))
-   {
- fprintf (dump_file,
-  "  FAIL: path through PHI in bb%d (incoming bb:%d) 
crosses loop\n",
-  e->dest->index, e->src->index);
- fprintf (dump_file, "path: %d->", e->src->index);
- dump_path (dump_file, m_path);
- fprintf (dump_file, "->xx REJECTED\n");
-   }
- continue;
-   }
-
-  tree arg = gimple_phi_arg_def (phi, i);
-  unsigned v = 0;
-
-  if (TREE_CODE (arg) == SSA_NAME
- && !bitmap_bit_p (interesting, SSA_NAME_VERSION (arg)))
-   {
- // Record that ARG is interesting when searching down this path.
- v = SSA_NAME_VERSION (arg);
- gcc_checking_assert (v != 0);
- bitmap_set_bit (interesting, v);
- bitmap_set_bit (m_imports, v);
-   }
-
-  find_paths_to_names (e->src, interesting);
-
-  if (v)
-   bitmap_clear_bit (interesting, v);
-}
-}
-
 // Find jump threading paths to any of the SSA names in the
 // INTERESTING bitmap, and register any such paths.
 //
@@ -505,45 +439,65 @@ back_threader::find_paths_to_names (basic_block bb, 
bitmap interesting)
 
   else
 {
-  auto_bitmap processed;
-  bool done = false;
-  // We use a worklist instead of iterating through the bitmap,
-  // because we may add new items in-flight.
-  auto_vec worklist (bitmap_count_bits (interesting));
-  populate_worklist (worklist, interesting);
-  while (!worklist.is_empty ())
+  // For further greedy searching we want to rem

[PATCH] RISC-V: Avoid redundant sign-extension for SImode SGE, SGEU, SLE, SLEU

2022-08-03 Thread Maciej W. Rozycki
We produce inefficient code for some synthesized SImode conditional set 
operations (i.e. ones that are not directly implemented in hardware) on 
RV64.  For example a piece of C code like this:

int
sleu (unsigned int x, unsigned int y)
{
  return x <= y;
}

gets compiled (at `-O2') to this:

sleu:
sgtua0,a0,a1# 9 [c=4 l=4]  *sgtu_disi
xoria0,a0,1 # 10[c=4 l=4]  *xorsi3_internal/1
sext.w  a0,a0   # 16[c=4 l=4]  extendsidi2/0
ret # 25[c=0 l=4]  simple_return

This is because the middle end expands a SLEU operation missing from 
RISC-V hardware into a sequence of a SImode SGTU operation followed by 
an explicit SImode XORI operation with immediate 1.  And while the SGTU 
machine instruction (alias SLTU with the input operands swapped) gives a 
properly sign-extended 32-bit result which is valid both as a SImode or 
a DImode operand the middle end does not see that through a SImode XORI 
operation, because we tell the middle end that the RISC-V target (unlike 
MIPS) may hold values in DImode integer registers that are valid for 
SImode operations even if not properly sign-extended.

However the RISC-V psABI requires that 32-bit function arguments and 
results passed in 64-bit integer registers be properly sign-extended, so 
this is explicitly done at the conclusion of the function.

Fix this by making the backend use a sequence of a DImode SGTU operation 
followed by a SImode SEQZ operation instead.  The latter operation is 
known by the middle end to produce a properly sign-extended 32-bit 
result and therefore combine gets rid of the sign-extension operation 
that follows and actually folds it into the very same XORI machine 
operation resulting in:

sleu:
sgtua0,a0,a1# 9 [c=4 l=4]  *sgtu_didi
xoria0,a0,1 # 16[c=4 l=4]  xordi3/1
ret # 25[c=0 l=4]  simple_return

instead (although the SEQZ alias SLTIU against immediate 1 machine 
instruction would equally do and is actually retained at `-O0').  This 
is handled analogously for the remaining synthesized operations of this 
kind, i.e. `SLE', `SGEU', and `SGE'.

gcc/
* config/riscv/riscv.cc (riscv_emit_int_order_test): Use EQ 0 
rather that XOR 1 for LE and LEU operations.

gcc/testsuite/
* gcc.target/riscv/sge.c: New test.
* gcc.target/riscv/sgeu.c: New test.
* gcc.target/riscv/sle.c: New test.
* gcc.target/riscv/sleu.c: New test.
---
Hi,

 Regression-tested with the `riscv64-linux-gnu' target.  OK to apply?

  Maciej
---
 gcc/config/riscv/riscv.cc |4 ++--
 gcc/testsuite/gcc.target/riscv/sge.c  |   11 +++
 gcc/testsuite/gcc.target/riscv/sgeu.c |   11 +++
 gcc/testsuite/gcc.target/riscv/sle.c  |   11 +++
 gcc/testsuite/gcc.target/riscv/sleu.c |   11 +++
 5 files changed, 46 insertions(+), 2 deletions(-)

gcc-riscv-int-order-inv-seqz.diff
Index: gcc/gcc/config/riscv/riscv.cc
===
--- gcc.orig/gcc/config/riscv/riscv.cc
+++ gcc/gcc/config/riscv/riscv.cc
@@ -2500,9 +2500,9 @@ riscv_emit_int_order_test (enum rtx_code
}
   else if (invert_ptr == 0)
{
- rtx inv_target = riscv_force_binary (GET_MODE (target),
+ rtx inv_target = riscv_force_binary (word_mode,
   inv_code, cmp0, cmp1);
- riscv_emit_binary (XOR, target, inv_target, const1_rtx);
+ riscv_emit_binary (EQ, target, inv_target, const0_rtx);
}
   else
{
Index: gcc/gcc/testsuite/gcc.target/riscv/sge.c
===
--- /dev/null
+++ gcc/gcc/testsuite/gcc.target/riscv/sge.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target rv64 } */
+/* { dg-skip-if "" { *-*-* } { "-O0" } } */
+
+int
+sge (int x, int y)
+{
+  return x >= y;
+}
+
+/* { dg-final { scan-assembler-not "sext\\.w" } } */
Index: gcc/gcc/testsuite/gcc.target/riscv/sgeu.c
===
--- /dev/null
+++ gcc/gcc/testsuite/gcc.target/riscv/sgeu.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target rv64 } */
+/* { dg-skip-if "" { *-*-* } { "-O0" } } */
+
+int
+sgeu (unsigned int x, unsigned int y)
+{
+  return x >= y;
+}
+
+/* { dg-final { scan-assembler-not "sext\\.w" } } */
Index: gcc/gcc/testsuite/gcc.target/riscv/sle.c
===
--- /dev/null
+++ gcc/gcc/testsuite/gcc.target/riscv/sle.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target rv64 } */
+/* { dg-skip-if "" { *-*-* } { "-O0" } } */
+
+int
+sle (int x, int y)
+{
+  return x <= y;
+}
+
+/* { dg-final { scan-assembler-not "sext\\.w" } } */
Index: gcc/gcc/testsuite/gcc.target/riscv/sleu.c
=

Re: [PATCH] Properly honor param_max_fsm_thread_path_insns in backwards threader

2022-08-03 Thread Richard Biener via Gcc-patches
On Tue, 2 Aug 2022, Jan Hubicka wrote:

> > On Tue, 2 Aug 2022, Aldy Hernandez wrote:
> > 
> > > On Tue, Aug 2, 2022 at 1:45 PM Richard Biener  wrote:
> > > >
> > > > On Tue, 2 Aug 2022, Aldy Hernandez wrote:
> > > >
> > > > > Unfortunately, this was before my time, so I don't know.
> > > > >
> > > > > That being said, thanks for tackling these issues that my work
> > > > > triggered last release.  Much appreciated.
> > > >
> > > > Ah.  But it was your r12-324-g69e5544210e3c0 that did
> > > >
> > > > -  else if (n_insns > 1)
> > > > +  else if (!m_speed_p && n_insns > 1)
> > > >
> > > > causing the breakage on the 12 branch.  That leads to a simpler
> > > > fix I guess.  Will re-test and also backport to GCC 12 if successful.
> > > 
> > > Huh.  It's been a while, but that looks like a typo.  That patch was
> > > supposed to be non-behavior changing.
> > 
> > Exactly my thinking so reverting it shouldn't be a reason for
> > detailed questions.  Now, the contains_hot_bb computation is,
> > that one was introduced by Honza in r7-6476-g0f0c2cc3a17efa
> > together with the comment and a testcase.
> > 
> > So - Honza, what was the reasoning to look at raw BB counts here
> > rather than for example the path entry edge count?
> I think the explanation is in the final comment:
>   /* Threading is profitable if the path duplicated is hot but also
>  in a case we separate cold path from hot path and permit ptimization
>  of the hot path later.  Be on the agressive side here. In some estcases,
>  as in PR 78407 this leads to noticeable improvements.  */
> 
> If you have non-threadable hot path threading out cold paths will make
> it easier to be optimized since you have fewer meets in the dataflow.

I see.  It does seem that it would be better handled by tracing the
hot path but of course threading the cold path might be cheaper.

That said, the cost modeling checks hotness of the threaded path
by looking at its exit edge (the one we can optimize statically)
but shouldn't hotness of the threaded path be determined by
looking at the path entry edge count instead?  With inconsistent
guessed profile (inconsistent now that we can statically compute
the outgoing branch on one of the paths) it's of course all GIGO, but ...

I'll leave this alone for now.  Still we now have the exceptions of
predicted never executed entry/exit to avoid diagnostic fallout.

Richard.


[PATCH] PR106342 - IBM zSystems: Provide vsel for all vector modes

2022-08-03 Thread Ilya Leoshkevich via Gcc-patches
Bootstrapped and regtested on s390x-redhat-linux.  Ok for master?



dg.exp=pr104612.c fails with an ICE on s390x, because copysignv2sf3
produces an insn that vsel is supposed to recognize, but can't,
because it's not defined for V2SF.  Fix by defining it for all vector
modes supported by copysign3.

gcc/ChangeLog:

* config/s390/vector.md (V_HW_FT): New iterator.
* config/s390/vx-builtins.md (vsel): Use V instead of
V_HW.
---
 gcc/config/s390/vector.md  |  6 ++
 gcc/config/s390/vx-builtins.md | 12 ++--
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/gcc/config/s390/vector.md b/gcc/config/s390/vector.md
index a6c4b4eb974..624729814af 100644
--- a/gcc/config/s390/vector.md
+++ b/gcc/config/s390/vector.md
@@ -63,6 +63,12 @@
   V1DF V2DF
   (V1TF "TARGET_VXE") (TF "TARGET_VXE")])
 
+; All modes present in V_HW and VFT.
+(define_mode_iterator V_HW_FT [V16QI V8HI V4SI V2DI (V1TI "TARGET_VXE") V1DF
+  V2DF (V1SF "TARGET_VXE") (V2SF "TARGET_VXE")
+  (V4SF "TARGET_VXE") (V1TF "TARGET_VXE")
+  (TF "TARGET_VXE")])
+
 ; FP vector modes directly supported by the HW.  This does not include
 ; vector modes using only part of a vector register and should be used
 ; for instructions which might trigger IEEE exceptions.
diff --git a/gcc/config/s390/vx-builtins.md b/gcc/config/s390/vx-builtins.md
index d5130799804..98ee08b2683 100644
--- a/gcc/config/s390/vx-builtins.md
+++ b/gcc/config/s390/vx-builtins.md
@@ -517,12 +517,12 @@
 ; swapped in s390-c.cc when we get here.
 
 (define_insn "vsel"
-  [(set (match_operand:V_HW  0 "register_operand" "=v")
-   (ior:V_HW
-(and:V_HW (match_operand:V_HW   1 "register_operand"  "v")
-  (match_operand:V_HW   3 "register_operand"  "v"))
-(and:V_HW (not:V_HW (match_dup 3))
-  (match_operand:V_HW   2 "register_operand"  "v"]
+  [(set (match_operand:V_HW_FT   0 "register_operand" "=v")
+   (ior:V_HW_FT
+(and:V_HW_FT (match_operand:V_HW_FT 1 "register_operand"  "v")
+ (match_operand:V_HW_FT 3 "register_operand"  "v"))
+(and:V_HW_FT (not:V_HW_FT (match_dup 3))
+ (match_operand:V_HW_FT 2 "register_operand"  "v"]
   "TARGET_VX"
   "vsel\t%v0,%1,%2,%3"
   [(set_attr "op_type" "VRR")])
-- 
2.35.3



Re: [PATCH] lower-subreg, expr: Mitigate inefficiencies derived from "(clobber (reg X))" followed by "(set (subreg (reg X)) (...))"

2022-08-03 Thread Takayuki 'January June' Suwa via Gcc-patches
Thanks for your response.

On 2022/08/03 16:52, Richard Sandiford wrote:
> Takayuki 'January June' Suwa via Gcc-patches  writes:
>> Emitting "(clobber (reg X))" before "(set (subreg (reg X)) (...))" keeps
>> data flow consistent, but it also increases register allocation pressure
>> and thus often creates many unwanted register-to-register moves that
>> cannot be optimized away.
> 
> There are two things here:
> 
> - If emit_move_complex_parts emits a clobber of a hard register,
>   then that's probably a bug/misfeature.  The point of the clobber is
>   to indicate that the register has no useful contents.  That's useful
>   for wide pseudos that are written to in parts, since it avoids the
>   need to track the liveness of each part of the pseudo individually.
>   But it shouldn't be necessary for hard registers, since subregs of
>   hard registers are simplified to hard registers wherever possible
>   (which on most targets is "always").
> 
>   So I think the emit_move_complex_parts clobber should be restricted
>   to !HARD_REGISTER_P, like the lower-subreg clobber is.  If that helps
>   (if only partly) then it would be worth doing as its own patch.
> 
> - I think it'd be worth looking into more detail why a clobber makes
>   a difference to register pressure.  A clobber of a pseudo register R
>   shouldn't make R conflict with things that are live at the point of
>   the clobber.

I agree with its worth.
In fact, aside from other ports, on the xtensa one, RA in code with frequent 
D[FC]mode pseudos is terribly bad.
For example, in __muldc3 on libgcc2, the size of the stack frame reserved will 
almost double depending on whether or not this patch is applied.

> 
>>  It seems just analogous to partial register
>> stall which is a famous problem on processors that do register renaming.
>>
>> In my opinion, when the register to be clobbered is a composite of hard
>> ones, we should clobber the individual elements separetely, otherwise
>> clear the entire to zero prior to use as the "init-regs" pass does (like
>> partial register stall workarounds on x86 CPUs).  Such redundant zero
>> constant assignments will be removed later in the "cprop_hardreg" pass.
> 
> I don't think we should rely on the zero being optimised away later.
> 
> Emitting the zero also makes it harder for the register allocator
> to elide the move.  For example, if we have:
> 
>   (set (subreg:SI (reg:DI P) 0) (reg:SI R0))
>   (set (subreg:SI (reg:DI P) 4) (reg:SI R1))
> 
> then there is at least a chance that the RA could assign hard registers
> R0:R1 to P, which would turn the moves into nops.  If we emit:
> 
>   (set (reg:DI P) (const_int 0))
> 
> beforehand then that becomes impossible, since R0 and R1 would then
> conflict with P.

Ah, surely, as you pointed out for targets where "(reg: DI)" corresponds to one 
hard register.

> 
> TBH I'm surprised we still run init_regs for LRA.  I thought there was
> a plan to stop doing that, but perhaps I misremember.

Sorry I am not sure about the status of LRA... because the xtensa port is still 
using reload.

As conclusion, trying to tweak the common code side may have been a bit 
premature.
I'll consider if I can deal with those issues on the side of the 
target-specific code.

> 
> Thanks,
> Richard
> 
>> This patch may give better output code quality for the reasons above,
>> especially on architectures that don't have DFmode hard registers
>> (On architectures with such hard registers, this patch changes virtually
>> nothing).
>>
>> For example (Espressif ESP8266, Xtensa without FP hard regs):
>>
>> /* example */
>> double _Complex conjugate(double _Complex z) {
>>   __imag__(z) *= -1;
>>   return z;
>> }
>>
>> ;; before
>> conjugate:
>> movi.n  a6, -1
>> sllia6, a6, 31
>> mov.n   a8, a2
>> mov.n   a9, a3
>> mov.n   a7, a4
>> xor a6, a5, a6
>> mov.n   a2, a8
>> mov.n   a3, a9
>> mov.n   a4, a7
>> mov.n   a5, a6
>> ret.n
>>
>> ;; after
>> conjugate:
>> movi.n  a6, -1
>> sllia6, a6, 31
>> xor a6, a5, a6
>> mov.n   a5, a6
>> ret.n
>>
>> gcc/ChangeLog:
>>
>>  * lower-subreg.cc (resolve_simple_move):
>>  Add zero clear of the entire register immediately after
>>  the clobber.
>>  * expr.cc (emit_move_complex_parts):
>>  Change to clobber the real and imaginary parts separately
>>  instead of the whole complex register if possible.
>> ---
>>  gcc/expr.cc | 26 --
>>  gcc/lower-subreg.cc |  7 ++-
>>  2 files changed, 26 insertions(+), 7 deletions(-)
>>
>> diff --git a/gcc/expr.cc b/gcc/expr.cc
>> index 80bb1b8a4c5..9732e8fd4e5 100644
>> --- a/gcc/expr.cc
>> +++ b/gcc/expr.cc
>> @@ -3775,15 +3775,29 @@ emit_move_complex_push (machine_mode mode, rtx x, 
>> rtx y)
>>  rtx_insn *
>>  emit_move_complex_parts (rtx x, rtx y)
>>  {
>> -  /* Show the output dies here.

Re: [PATCH] RFC: Extend SLP permutation optimisations

2022-08-03 Thread Richard Biener via Gcc-patches
On Tue, 2 Aug 2022, Richard Sandiford wrote:

> Currently SLP tries to force permute operations "down" the graph
> from loads in the hope of reducing the total number of permutes
> needed or (in the best case) removing the need for the permutes
> entirely.  This patch tries to extend it as follows:
> 
> - Allow loads to take a different permutation from the one they
>   started with, rather than choosing between "original permutation"
>   and "no permutation".
> 
> - Allow changes in both directions, if the target supports the
>   reverse permute operation.
> 
> - Treat the placement of permute operations as a two-way dataflow
>   problem: after propagating information from leaves to roots (as now),
>   propagate information back up the graph.
> 
> - Take execution frequency into account when optimising for speed,
>   so that (for example) permutes inside loops have a higher cost
>   than permutes outside loops.
> 
> - Try to reduce the total number of permutes when optimising for
>   size, even if that increases the number of permutes on a given
>   execution path.
> 
> See the big block comment above vect_optimize_slp_pass for
> a detailed description.
> 
> A while back I posted a patch to extend the existing optimisation
> to non-consecutive loads.  This patch doesn't include that change
> (although it's a possible future addition).
> 
> The original motivation for doing this was to add a framework that would
> allow other layout differences in future.  The two main ones are:
> 
> - Make it easier to represent predicated operations, including
>   predicated operations with gaps.  E.g.:
> 
>  a[0] += 1;
>  a[1] += 1;
>  a[3] += 1;
> 
>   could be a single load/add/store for SVE.  We could handle this
>   by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 }
>   (depending on what's being counted).  We might need to move
>   elements between lanes at various points, like with permutes.
> 
>   (This would first mean adding support for stores with gaps.)
> 
> - Make it easier to switch between an even/odd and unpermuted layout
>   when switching between wide and narrow elements.  E.g. if a widening
>   operation produces an even vector and an odd vector, we should try
>   to keep operations on the wide elements in that order rather than
>   force them to be permuted back "in order".
> 
> To give some examples of what the current patch does:
> 
> int f1(int *__restrict a, int *__restrict b, int *__restrict c,
>int *__restrict d)
> {
>   a[0] = (b[1] << c[3]) - d[1];
>   a[1] = (b[0] << c[2]) - d[0];
>   a[2] = (b[3] << c[1]) - d[3];
>   a[3] = (b[2] << c[0]) - d[2];
> }
> 
> continues to produce the same code as before when optimising for
> speed: b, c and d are permuted at load time.  But when optimising
> for size we instead permute c into the same order as b+d and then
> permute the result of the arithmetic into the same order as a:
> 
> ldr q1, [x2]
> ldr q0, [x1]
> ext v1.16b, v1.16b, v1.16b, #8 // <--
> sshlv0.4s, v0.4s, v1.4s
> ldr q1, [x3]
> sub v0.4s, v0.4s, v1.4s
> rev64   v0.4s, v0.4s   // <--
> str q0, [x0]
> ret
> 
> The following function:
> 
> int f2(int *__restrict a, int *__restrict b, int *__restrict c,
>int *__restrict d)
> {
>   a[0] = (b[3] << c[3]) - d[3];
>   a[1] = (b[2] << c[2]) - d[2];
>   a[2] = (b[1] << c[1]) - d[1];
>   a[3] = (b[0] << c[0]) - d[0];
> }
> 
> continues to push the reverse down to just before the store,
> like the current code does.
> 
> In:
> 
> int f3(int *__restrict a, int *__restrict b, int *__restrict c,
>int *__restrict d)
> {
>   for (int i = 0; i < 100; ++i)
> {
>   a[0] = (a[0] + c[3]);
>   a[1] = (a[1] + c[2]);
>   a[2] = (a[2] + c[1]);
>   a[3] = (a[3] + c[0]);
>   c += 4;
> }
> }
> 
> the loads of a are hoisted and the stores of a are sunk, so that
> only the load from c happens in the loop.  When optimising for
> speed, we prefer to have the loop operate on the reversed layout,
> changing on entry and exit from the loop:
> 
> mov x3, x0
> adrpx0, .LC0
> add x1, x2, 1600
> ldr q2, [x0, #:lo12:.LC0]
> ldr q0, [x3]
> mov v1.16b, v0.16b
> tbl v0.16b, {v0.16b - v1.16b}, v2.16b// <
> .p2align 3,,7
> .L6:
> ldr q1, [x2], 16
> add v0.4s, v0.4s, v1.4s
> cmp x2, x1
> bne .L6
> mov v1.16b, v0.16b
> adrpx0, .LC0
> ldr q2, [x0, #:lo12:.LC0]
> tbl v0.16b, {v0.16b - v1.16b}, v2.16b// <
> str q0, [x3]
> ret
> 
> Similarly, for the very artificial testcase:
> 
> int f4(int *__restrict a, int *__restrict b, int *__restrict c,
>int *__restrict d)
> {
>   int a0 = a[0];
>   int a1 = a[1];
>   int a2 = a[2];
>   int a3 = a[3];
>   for (i

Re: [PATCH] configure: respect --with-build-time-tools [PR43301]

2022-08-03 Thread Eric Gallager via Gcc-patches
On Tue, Aug 2, 2022 at 11:33 PM Alexandre Oliva  wrote:
>
> On Aug  2, 2022, Eric Gallager  wrote:
>
> > On Tue, Aug 2, 2022 at 1:24 AM Alexandre Oliva  wrote:
>
> >> -elif test -x as$build_exeext; then
> >> +elif test -x as$build_exeext \
> >> +   && { test "x$build_exeext" != "x" \
> >> +|| test "x`grep '^# Invoke as, ld or nm from the build tree' \
> >> + as`" = "x"; }; then
> >>
> >> WDYT?
>
> > Hi, thanks for the feedback; I'm a bit confused, though: where exactly
> > would this proposed change go?
>
> gcc/configure.ac, just before:
>
> # Build using assembler in the current directory.
> gcc_cv_as=./as$build_exeext
>

OK, after reviewing the surrounding code, I think it makes sense; do
you want to commit it, or should I?
Thanks,
Eric

> --
> Alexandre Oliva, happy hackerhttps://FSFLA.org/blogs/lxo/
>Free Software Activist   GNU Toolchain Engineer
> Disinformation flourishes because many people care deeply about injustice
> but very few check the facts.  Ask me about 


[PATCH] middle-end: Allow backend to expand/split double word compare to 0/-1.

2022-08-03 Thread Roger Sayle

This patch to the middle-end's RTL expansion reorders the code in
emit_store_flag_1 so that the backend has more control over how best
to expand/split double word equality/inequality comparisons against
zero or minus one.  With the current implementation, the middle-end
always decides to lower this idiom during RTL expansion using SUBREGs
and word mode instructions, without ever consulting the backend's
machine description.  Hence on x86_64, a TImode comparison against zero
is always expanded as:

(parallel [
  (set (reg:DI 91)
   (ior:DI (subreg:DI (reg:TI 88) 0)
   (subreg:DI (reg:TI 88) 8)))
  (clobber (reg:CC 17 flags))])

(set (reg:CCZ 17 flags)
 (compare:CCZ (reg:DI 91)
  (const_int 0 [0])))

This patch, which makes no changes to the code itself, simply reorders
the clauses in emit_store_flag_1 so that the middle-end first attempts
expansion using the target's doubleword mode cstore optab/expander,
and only if this fails, falls back to lowering to word mode operations.
On x86_64, this allows the expander to produce:

(set (reg:CCZ 17 flags)
 (compare:CCZ (reg:TI 88)
  (const_int 0 [0])))

which is a candidate for scalar-to-vector transformations (and
combine simplifications etc.).  On targets that don't define a cstore
pattern for doubleword integer modes, there should be no change in
behaviour.  For those that do, the current behaviour can be restored
(if desired) by restricting the expander/insn to not apply when the
comparison is EQ or NE, and operand[2] is either const0_rtx or
constm1_rtx.

This change just keeps RTL expansion more consistent (in philosophy).
For other doubleword comparisons, such as with operators LT and GT,
or with constants other than zero or -1, the wishes of the backend
are respected, and only if the optab expansion fails are the default
fall-back implementations using narrower integer mode operations
(and conditional jumps) used.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32},
with no new failures. I'm happy to help tweak any backends that notice
a change in their generated code.  Ok for mainline?

2022-08-03  Roger Sayle  

gcc/ChangeLog
* expmed.cc (emit_store_flag_1): Move code to expand double word
equality and inequality against zero or -1, using word operations,
to after trying to use the backend's cstore4 optab/expander.


Thanks in advance,
Roger
--

diff --git a/gcc/expmed.cc b/gcc/expmed.cc
index 9b01b5a..8d7418b 100644
--- a/gcc/expmed.cc
+++ b/gcc/expmed.cc
@@ -5662,63 +5662,9 @@ emit_store_flag_1 (rtx target, enum rtx_code code, rtx 
op0, rtx op1,
   break;
 }
 
-  /* If we are comparing a double-word integer with zero or -1, we can
- convert the comparison into one involving a single word.  */
-  scalar_int_mode int_mode;
-  if (is_int_mode (mode, &int_mode)
-  && GET_MODE_BITSIZE (int_mode) == BITS_PER_WORD * 2
-  && (!MEM_P (op0) || ! MEM_VOLATILE_P (op0)))
-{
-  rtx tem;
-  if ((code == EQ || code == NE)
- && (op1 == const0_rtx || op1 == constm1_rtx))
-   {
- rtx op00, op01;
-
- /* Do a logical OR or AND of the two words and compare the
-result.  */
- op00 = simplify_gen_subreg (word_mode, op0, int_mode, 0);
- op01 = simplify_gen_subreg (word_mode, op0, int_mode, UNITS_PER_WORD);
- tem = expand_binop (word_mode,
- op1 == const0_rtx ? ior_optab : and_optab,
- op00, op01, NULL_RTX, unsignedp,
- OPTAB_DIRECT);
-
- if (tem != 0)
-   tem = emit_store_flag (NULL_RTX, code, tem, op1, word_mode,
-  unsignedp, normalizep);
-   }
-  else if ((code == LT || code == GE) && op1 == const0_rtx)
-   {
- rtx op0h;
-
- /* If testing the sign bit, can just test on high word.  */
- op0h = simplify_gen_subreg (word_mode, op0, int_mode,
- subreg_highpart_offset (word_mode,
- int_mode));
- tem = emit_store_flag (NULL_RTX, code, op0h, op1, word_mode,
-unsignedp, normalizep);
-   }
-  else
-   tem = NULL_RTX;
-
-  if (tem)
-   {
- if (target_mode == VOIDmode || GET_MODE (tem) == target_mode)
-   return tem;
- if (!target)
-   target = gen_reg_rtx (target_mode);
-
- convert_move (target, tem,
-   !val_signbit_known_set_p (word_mode,
- (normalizep ? normalizep
-  : STORE_FLAG_VALUE)));
- return target;
-   }
-}
-
   /* If this is A < 0 or A >= 0, we can do this by taking the ones
  complement of A (for GE) and shifting the sign bi

[committed] libstdc++: Avoid try-catch and O(N) size in std::list::merge for old ABI

2022-08-03 Thread Jonathan Wakely via Gcc-patches
Tested x86_64-linux, pushed to gcc-10 branch.

-- >8 --

The current std::list::merge code calls size() before starting to merge
any elements, so that the _M_size members can be updated after the merge
finishes. The work is done in a try-block so that the sizes can still be
updated in an exception handler if any element comparison throws.

The _M_size members only exist for the cxx11 ABI, so the initial call to
size() and the try-catch are only needed for that ABI. For the old ABI
the size() call performs an O(N) list traversal to get a value that
isn't even used, and catching exceptions just to rethrow them isn't
needed either.

In r11-10123 this code was refactored to use an RAII guard type, but for
the gcc-10 branch a less invasive change using preprocessor conditionals
seems more appropriate.

libstdc++-v3/ChangeLog:

* include/bits/list.tcc (list::merge) [!USE_CXX11_ABI]: Remove
call to size() and try-catch block.
---
 libstdc++-v3/include/bits/list.tcc | 4 
 1 file changed, 4 insertions(+)

diff --git a/libstdc++-v3/include/bits/list.tcc 
b/libstdc++-v3/include/bits/list.tcc
index ce9e983c539..1f27d8b786a 100644
--- a/libstdc++-v3/include/bits/list.tcc
+++ b/libstdc++-v3/include/bits/list.tcc
@@ -406,8 +406,10 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
  iterator __last1 = end();
  iterator __first2 = __x.begin();
  iterator __last2 = __x.end();
+#if _GLIBCXX_USE_CXX11_ABI
  const size_t __orig_size = __x.size();
  __try {
+#endif
while (__first1 != __last1 && __first2 != __last2)
  if (*__first2 < *__first1)
{
@@ -422,6 +424,7 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
 
this->_M_inc_size(__x._M_get_size());
__x._M_set_size(0);
+#if _GLIBCXX_USE_CXX11_ABI
  }
  __catch(...)
{
@@ -430,6 +433,7 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
  __x._M_set_size(__dist);
  __throw_exception_again;
}
+#endif
}
 }
 
-- 
2.37.1



Re: [Patch] OpenMP, libgomp, gimple: omp_get_max_teams, omp_set_num_teams, and omp_{gs}et_teams_thread_limit on offload devices

2022-08-03 Thread Marcel Vollweiler

Hi Jakub,

This patch was reduced a bit and most of your comments were considered in the
last submission of the environment variable syntax extension patch
(https://gcc.gnu.org/pipermail/gcc-patches/2022-August/599175.html). This patch
also builds on that envvar patch version.

The nteams-var related content was moved from this patch to the envvar patch as
that is closely connected. However, additional testing and testing of copy back
device-specific nteams-var ICV values is still included in this patch together
with the teams-thread-limit-var content.


--- a/gcc/gimplify.cc
+++ b/gcc/gimplify.cc
@@ -13994,7 +13994,7 @@ optimize_target_teams (tree target, gimple_seq *pre_p)
struct gimplify_omp_ctx *target_ctx = gimplify_omp_ctxp;

if (teams == NULL_TREE)
-num_teams_upper = integer_one_node;
+num_teams_upper = integer_minus_two_node;


No, please don't introduce this, it is quite costly to have a GC trees
like integer_one_node, so they should stay for the most commonly used
numbers, -2 isn't like that.  Just build_int_cst (integer_type_node, -2).


integer_minus_two_node was replaced by "build_int_cst (integer_type_node, -2)".




--- a/gcc/tree-core.h
+++ b/gcc/tree-core.h
@@ -642,6 +642,7 @@ enum tree_index {
TI_INTEGER_ONE,
TI_INTEGER_THREE,
TI_INTEGER_MINUS_ONE,
+  TI_INTEGER_MINUS_TWO,
TI_NULL_POINTER,

TI_SIZE_ZERO,
diff --git a/gcc/tree.cc b/gcc/tree.cc
index 8f83ea1..8cb474d 100644
--- a/gcc/tree.cc
+++ b/gcc/tree.cc
@@ -9345,6 +9345,7 @@ build_common_tree_nodes (bool signed_char)
integer_one_node = build_int_cst (integer_type_node, 1);
integer_three_node = build_int_cst (integer_type_node, 3);
integer_minus_one_node = build_int_cst (integer_type_node, -1);
+  integer_minus_two_node = build_int_cst (integer_type_node, -2);

size_zero_node = size_int (0);
size_one_node = size_int (1);
diff --git a/gcc/tree.h b/gcc/tree.h
index cea49a5..1aeb009 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -4206,6 +4206,7 @@ tree_strip_any_location_wrapper (tree exp)
  #define integer_one_node   global_trees[TI_INTEGER_ONE]
  #define integer_three_node  global_trees[TI_INTEGER_THREE]
  #define integer_minus_one_node global_trees[TI_INTEGER_MINUS_ONE]
+#define integer_minus_two_node  global_trees[TI_INTEGER_MINUS_TWO]
  #define size_zero_node global_trees[TI_SIZE_ZERO]
  #define size_one_node  global_trees[TI_SIZE_ONE]
  #define bitsize_zero_node  global_trees[TI_BITSIZE_ZERO]


And drop the above 3 hunks.


Removed.




--- a/libgomp/config/gcn/icv-device.c
+++ b/libgomp/config/gcn/icv-device.c
@@ -37,6 +37,7 @@ volatile int GOMP_DEFAULT_DEVICE_VAR;
  volatile int GOMP_MAX_ACTIVE_LEVELS_VAR;
  volatile omp_proc_bind_t GOMP_BIND_VAR;
  volatile int GOMP_NTEAMS_VAR;
+volatile int GOMP_TEAMS_THREAD_LIMIT_VAR;


I really don't like this copying of individual ICVs one by one to the
device, copy a struct containing them and access fields in that struct.


I recently changed this in
https://gcc.gnu.org/pipermail/gcc-patches/2022-August/599175.html. So there is
one struct containing all ICVs that are copied from host to the device and back.




--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -116,6 +116,7 @@ struct addr_pair
  #define GOMP_MAX_ACTIVE_LEVELS_VAR __gomp_max_active_levels
  #define GOMP_BIND_VAR __gomp_bind
  #define GOMP_NTEAMS_VAR __gomp_nteams
+#define GOMP_TEAMS_THREAD_LIMIT_VAR __gomp_teams_thread_limit_var


Likewise here.


Those were all removed.




@@ -527,13 +538,19 @@ struct gomp_icv_list {

  extern void *gomp_get_icv_value_ptr (struct gomp_icv_list **list,
  int device_num);
-extern struct gomp_icv_list *gomp_run_sched_var_dev_list;
-extern struct gomp_icv_list *gomp_run_sched_chunk_size_dev_list;
+extern struct gomp_icv_list* gomp_add_device_specific_icv (int dev_num,
+   size_t size,
+struct gomp_icv_list 
**list);
+extern struct gomp_icv_list *gomp_initial_run_sched_var_dev_list;
+extern struct gomp_icv_list *gomp_initial_run_sched_chunk_size_dev_list;
+extern struct gomp_icv_list *gomp_initial_max_active_levels_var_dev_list;
+extern struct gomp_icv_list *gomp_initial_proc_bind_var_dev_list;
+extern struct gomp_icv_list *gomp_initial_proc_bind_var_list_dev_list;
+extern struct gomp_icv_list *gomp_initial_proc_bind_var_list_len_dev_list;
+extern struct gomp_icv_list *gomp_initial_nteams_var_dev_list;
+
  extern struct gomp_icv_list *gomp_nteams_var_dev_list;
-extern struct gomp_icv_list *gomp_max_active_levels_var_dev_list;
-extern struct gomp_icv_list *gomp_proc_bind_var_dev_list;
-extern struct gomp_icv_list *gomp_proc_bind_var_list_dev_list;
-extern struct gomp_icv_list *gomp_proc_bind_var_list_len_dev_list;
+extern struct gomp_icv_list *gomp_teams_thread_limit_var_dev_list;


Nor these per-var

Re: [PATCH] PR106342 - IBM zSystems: Provide vsel for all vector modes

2022-08-03 Thread Andreas Krebbel via Gcc-patches
On 8/3/22 12:20, Ilya Leoshkevich wrote:
> Bootstrapped and regtested on s390x-redhat-linux.  Ok for master?
> 
> 
> 
> dg.exp=pr104612.c fails with an ICE on s390x, because copysignv2sf3
> produces an insn that vsel is supposed to recognize, but can't,
> because it's not defined for V2SF.  Fix by defining it for all vector
> modes supported by copysign3.
> 
> gcc/ChangeLog:
> 
>   * config/s390/vector.md (V_HW_FT): New iterator.
>   * config/s390/vx-builtins.md (vsel): Use V instead of
>   V_HW.

Ok. There is a typo in the changelog:
"Use *V* instead ..." should probably read "Use V_HW_FT instead ..."

Thanks,

Andreas

> ---
>  gcc/config/s390/vector.md  |  6 ++
>  gcc/config/s390/vx-builtins.md | 12 ++--
>  2 files changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/gcc/config/s390/vector.md b/gcc/config/s390/vector.md
> index a6c4b4eb974..624729814af 100644
> --- a/gcc/config/s390/vector.md
> +++ b/gcc/config/s390/vector.md
> @@ -63,6 +63,12 @@
>  V1DF V2DF
>  (V1TF "TARGET_VXE") (TF "TARGET_VXE")])
>  
> +; All modes present in V_HW and VFT.
> +(define_mode_iterator V_HW_FT [V16QI V8HI V4SI V2DI (V1TI "TARGET_VXE") V1DF
> +V2DF (V1SF "TARGET_VXE") (V2SF "TARGET_VXE")
> +(V4SF "TARGET_VXE") (V1TF "TARGET_VXE")
> +(TF "TARGET_VXE")])
> +
>  ; FP vector modes directly supported by the HW.  This does not include
>  ; vector modes using only part of a vector register and should be used
>  ; for instructions which might trigger IEEE exceptions.
> diff --git a/gcc/config/s390/vx-builtins.md b/gcc/config/s390/vx-builtins.md
> index d5130799804..98ee08b2683 100644
> --- a/gcc/config/s390/vx-builtins.md
> +++ b/gcc/config/s390/vx-builtins.md
> @@ -517,12 +517,12 @@
>  ; swapped in s390-c.cc when we get here.
>  
>  (define_insn "vsel"
> -  [(set (match_operand:V_HW  0 "register_operand" "=v")
> - (ior:V_HW
> -  (and:V_HW (match_operand:V_HW   1 "register_operand"  "v")
> -(match_operand:V_HW   3 "register_operand"  "v"))
> -  (and:V_HW (not:V_HW (match_dup 3))
> -(match_operand:V_HW   2 "register_operand"  "v"]
> +  [(set (match_operand:V_HW_FT   0 "register_operand" "=v")
> + (ior:V_HW_FT
> +  (and:V_HW_FT (match_operand:V_HW_FT 1 "register_operand"  "v")
> +   (match_operand:V_HW_FT 3 "register_operand"  "v"))
> +  (and:V_HW_FT (not:V_HW_FT (match_dup 3))
> +   (match_operand:V_HW_FT 2 "register_operand"  "v"]
>"TARGET_VX"
>"vsel\t%v0,%1,%2,%3"
>[(set_attr "op_type" "VRR")])



Re: [09/23] Add a cut-down version of std::span (array_slice)

2022-08-03 Thread Martin Jambor
Hi Richard,

On Fri, Nov 13 2020, Richard Sandiford via Gcc-patches wrote:
> A later patch wants to be able to pass around subarray views of an
> existing array.  The standard class to do that is std::span, but it's
> a C++20 thing.  This patch just adds a cut-down version of it.

thanks a lot for introducing it.  I hope to use it as a unified view
into something which might be a GC vec or heap vec an an auto_vec.

But I have one question:

>
> The intention is just to provide what's currently needed.
>
> gcc/
>   * vec.h (array_slice): New class.
> ---
>  gcc/vec.h | 120 ++
>  1 file changed, 120 insertions(+)
>
> diff --git a/gcc/vec.h b/gcc/vec.h
> index f02beddc975..7768de9f518 100644
> --- a/gcc/vec.h
> +++ b/gcc/vec.h
> @@ -2128,6 +2128,126 @@ release_vec_vec (vec > &vec)
>vec.release ();
>  }
>  
> +// Provide a subset of the std::span functionality.  (We can't use std::span
> +// itself because it's a C++20 feature.)
> +//
> +// In addition, provide an invalid value that is distinct from all valid
> +// sequences (including the empty sequence).  This can be used to return
> +// failure without having to use std::optional.
> +//
> +// There is no operator bool because it would be ambiguous whether it is
> +// testing for a valid value or an empty sequence.
> +template
> +class array_slice
> +{
> +  template friend class array_slice;
> +
> +public:
> +  using value_type = T;
> +  using iterator = T *;
> +  using const_iterator = const T *;
> +
> +  array_slice () : m_base (nullptr), m_size (0) {}
> +
> +  template
> +  array_slice (array_slice other)
> +: m_base (other.m_base), m_size (other.m_size) {}
> +
> +  array_slice (iterator base, unsigned int size)
> +: m_base (base), m_size (size) {}
> +
> +  template
> +  array_slice (T (&array)[N]) : m_base (array), m_size (N) {}
> +
> +  template
> +  array_slice (const vec &v)
> +: m_base (v.address ()), m_size (v.length ()) {}
> +

What is the reason for making the parameter const here?

The problem is that if you do for example:

  auto_vec test_base;
  test_base.quick_grow_cleared (10);
  array_slice test(test_base);

the constructor will get a const reference to test_base and so will
invoke the const variant of v.address() which returns a const bool *
which cannot be assigned into non-const qualified base.  AFAICS, the
constructor only works if the array_slice is array_slice.

Is that intentional?  I am not a C++ expert and can be easily
overlooking something.  I understand that users need to be careful not
to cause reallocation of the underlying vector while the array_slice
exists but the const qualifier does not achieve that.  (A wild idea to
be to add a array_slice ref-counter to auto_vec, which seems to be less
space-efficiency-critical than other vecs, and assert on reallocation
when it is not zero, hehe).

Removing the const qualifier in the constructor parameter makes the
error go away - as does adding another constructor without it, which
might be the correct thing to do.

On a related note, would the following constructor be a good addition to
the class (I can make it const too)?

  template
  array_slice (vec *v)
: m_base (v ? v->address () : nullptr), m_size (v ? v->length (): 0) {}


Thanks,

Martin



> +  iterator begin () { return m_base; }
> +  iterator end () { return m_base + m_size; }
> +
> +  const_iterator begin () const { return m_base; }
> +  const_iterator end () const { return m_base + m_size; }
> +
> +  value_type &front ();
> +  value_type &back ();
> +  value_type &operator[] (unsigned int i);
> +
> +  const value_type &front () const;
> +  const value_type &back () const;
> +  const value_type &operator[] (unsigned int i) const;
> +
> +  size_t size () const { return m_size; }
> +  size_t size_bytes () const { return m_size * sizeof (T); }
> +  bool empty () const { return m_size == 0; }
> +
> +  // An invalid array_slice that represents a failed operation.  This is
> +  // distinct from an empty slice, which is a valid result in some contexts.
> +  static array_slice invalid () { return { nullptr, ~0U }; }
> +
> +  // True if the array is valid, false if it is an array like INVALID.
> +  bool is_valid () const { return m_base || m_size == 0; }
> +
> +private:
> +  iterator m_base;
> +  unsigned int m_size;
> +};
> +
> +template
> +inline typename array_slice::value_type &
> +array_slice::front ()
> +{
> +  gcc_checking_assert (m_size);
> +  return m_base[0];
> +}
> +
> +template
> +inline const typename array_slice::value_type &
> +array_slice::front () const
> +{
> +  gcc_checking_assert (m_size);
> +  return m_base[0];
> +}
> +
> +template
> +inline typename array_slice::value_type &
> +array_slice::back ()
> +{
> +  gcc_checking_assert (m_size);
> +  return m_base[m_size - 1];
> +}
> +
> +template
> +inline const typename array_slice::value_type &
> +array_slice::back () const
> +{
> +  gcc_checking_assert (m_size);
> +  return m_base[m_size 

RE: [PATCH 1/2]middle-end: Simplify subtract where both arguments are being bitwise inverted.

2022-08-03 Thread Tamar Christina via Gcc-patches
> -Original Message-
> From: Richard Biener 
> Sent: Tuesday, June 21, 2022 8:43 AM
> To: Tamar Christina 
> Cc: Richard Sandiford ; Richard Biener via Gcc-
> patches ; Richard Guenther
> ; nd 
> Subject: Re: [PATCH 1/2]middle-end: Simplify subtract where both
> arguments are being bitwise inverted.
> 
> On Mon, Jun 20, 2022 at 10:49 AM Tamar Christina
>  wrote:
> >
> > > -Original Message-
> > > From: Richard Sandiford 
> > > Sent: Monday, June 20, 2022 9:19 AM
> > > To: Richard Biener via Gcc-patches 
> > > Cc: Tamar Christina ; Richard Biener
> > > ; Richard Guenther
> ;
> > > nd 
> > > Subject: Re: [PATCH 1/2]middle-end: Simplify subtract where both
> > > arguments are being bitwise inverted.
> > >
> > > Richard Biener via Gcc-patches  writes:
> > > > On Thu, Jun 16, 2022 at 1:10 PM Tamar Christina via Gcc-patches
> > > >  wrote:
> > > >>
> > > >> Hi All,
> > > >>
> > > >> This adds a match.pd rule that drops the bitwwise nots when both
> > > >> arguments to a subtract is inverted. i.e. for:
> > > >>
> > > >> float g(float a, float b)
> > > >> {
> > > >>   return ~(int)a - ~(int)b;
> > > >> }
> > > >>
> > > >> we instead generate
> > > >>
> > > >> float g(float a, float b)
> > > >> {
> > > >>   return (int)a - (int)b;
> > > >> }
> > > >>
> > > >> We already do a limited version of this from the fold_binary fold
> > > >> functions but this makes a more general version in match.pd that
> > > >> applies
> > > more often.
> > > >>
> > > >> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > > >>
> > > >> Ok for master?
> > > >>
> > > >> Thanks,
> > > >> Tamar
> > > >>
> > > >> gcc/ChangeLog:
> > > >>
> > > >> * match.pd: New bit_not rule.
> > > >>
> > > >> gcc/testsuite/ChangeLog:
> > > >>
> > > >> * gcc.dg/subnot.c: New test.
> > > >>
> > > >> --- inline copy of patch --
> > > >> diff --git a/gcc/match.pd b/gcc/match.pd index
> > > >>
> > >
> a59b6778f661cf9121dd3503f43472871e4da445..51b0a1b562409af535e53828a1
> > > 0
> > > >> c30b8a3e1ae2e 100644
> > > >> --- a/gcc/match.pd
> > > >> +++ b/gcc/match.pd
> > > >> @@ -1258,6 +1258,10 @@ DEFINE_INT_AND_FLOAT_ROUND_FN
> (RINT)
> > > >> (simplify
> > > >>   (bit_not (plus:c (bit_not @0) @1))
> > > >>   (minus @0 @1))
> > > >> +/* (~X - ~Y) -> X - Y.  */
> > > >> +(simplify
> > > >> + (minus (bit_not @0) (bit_not @1))  (minus @0 @1))
> > > >
> > > > It doesn't seem correct.
> > > >
> > > > (gdb) p/x ~-1 - ~0x8000
> > > > $3 = 0x8001
> > > > (gdb) p/x -1 - 0x8000
> > > > $4 = 0x7fff
> > > >
> > > > where I was looking for a case exposing undefined integer overflow.
> > >
> > > Yeah, shouldn't it be folding to (minus @1 @0) instead?
> > >
> > >   ~X = (-X - 1)
> > >   -Y = (-Y - 1)
> > >
> > > so:
> > >
> > >   ~X - ~Y = (-X - 1) - (-Y - 1)
> > >   = -X - 1 + Y + 1
> > >   = Y - X
> > >
> >
> > You're right, sorry, I should have paid more attention when I wrote the
> patch.
> 
> You still need to watch out for undefined overflow cases in the result that
> were well-defined in the original expression I think.

The only special thing we do for signed numbers if to do the subtract as 
unsigned.  As I mentioned
before GCC already does this transformation as part of the fold machinery, but 
that only only happens
when a very simple tree is matched and only when single use. i.e. 
https://godbolt.org/z/EWsdhfrKj

I'm only attempting to make it apply more generally as the result is always 
beneficial.

I've respun the patch to the same as we already do.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* match.pd: New bit_not rule.

gcc/testsuite/ChangeLog:

* gcc.dg/subnot.c: New test.

--- inline copy of patch ---

diff --git a/gcc/match.pd b/gcc/match.pd
index 
330c1db0c8e12b0fb010b1958729444672403866..00b3e07b2a5216b19ed58500923680d83c67d8cf
 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -1308,6 +1308,11 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 (simplify
  (bit_not (plus:c (bit_not @0) @1))
  (minus @0 @1))
+/* (~X - ~Y) -> Y - X.  */
+(simplify
+ (minus (bit_not @0) (bit_not @1))
+  (with { tree utype = unsigned_type_for (type); }
+   (convert (minus (convert:utype @1) (convert:utype @0)
 
 /* ~(X - Y) -> ~X + Y.  */
 (simplify
diff --git a/gcc/testsuite/gcc.dg/subnot.c b/gcc/testsuite/gcc.dg/subnot.c
new file mode 100644
index 
..d621bacd27bd3d19a010e4c9f831aa77d28bd02d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/subnot.c
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O -fdump-tree-optimized" } */
+
+float g(float a, float b)
+{
+  return ~(int)a - ~(int)b;
+}
+
+/* { dg-final { scan-tree-dump-not "~" "optimized" } } */


rb15840.patch
Description: rb15840.patch


Re: [09/23] Add a cut-down version of std::span (array_slice)

2022-08-03 Thread Richard Sandiford via Gcc-patches
Martin Jambor  writes:
> Hi Richard,
>
> On Fri, Nov 13 2020, Richard Sandiford via Gcc-patches wrote:
>> A later patch wants to be able to pass around subarray views of an
>> existing array.  The standard class to do that is std::span, but it's
>> a C++20 thing.  This patch just adds a cut-down version of it.
>
> thanks a lot for introducing it.  I hope to use it as a unified view
> into something which might be a GC vec or heap vec an an auto_vec.
>
> But I have one question:
>
>>
>> The intention is just to provide what's currently needed.
>>
>> gcc/
>>  * vec.h (array_slice): New class.
>> ---
>>  gcc/vec.h | 120 ++
>>  1 file changed, 120 insertions(+)
>>
>> diff --git a/gcc/vec.h b/gcc/vec.h
>> index f02beddc975..7768de9f518 100644
>> --- a/gcc/vec.h
>> +++ b/gcc/vec.h
>> @@ -2128,6 +2128,126 @@ release_vec_vec (vec > &vec)
>>vec.release ();
>>  }
>>  
>> +// Provide a subset of the std::span functionality.  (We can't use std::span
>> +// itself because it's a C++20 feature.)
>> +//
>> +// In addition, provide an invalid value that is distinct from all valid
>> +// sequences (including the empty sequence).  This can be used to return
>> +// failure without having to use std::optional.
>> +//
>> +// There is no operator bool because it would be ambiguous whether it is
>> +// testing for a valid value or an empty sequence.
>> +template
>> +class array_slice
>> +{
>> +  template friend class array_slice;
>> +
>> +public:
>> +  using value_type = T;
>> +  using iterator = T *;
>> +  using const_iterator = const T *;
>> +
>> +  array_slice () : m_base (nullptr), m_size (0) {}
>> +
>> +  template
>> +  array_slice (array_slice other)
>> +: m_base (other.m_base), m_size (other.m_size) {}
>> +
>> +  array_slice (iterator base, unsigned int size)
>> +: m_base (base), m_size (size) {}
>> +
>> +  template
>> +  array_slice (T (&array)[N]) : m_base (array), m_size (N) {}
>> +
>> +  template
>> +  array_slice (const vec &v)
>> +: m_base (v.address ()), m_size (v.length ()) {}
>> +
>
> What is the reason for making the parameter const here?
>
> The problem is that if you do for example:
>
>   auto_vec test_base;
>   test_base.quick_grow_cleared (10);
>   array_slice test(test_base);
>
> the constructor will get a const reference to test_base and so will
> invoke the const variant of v.address() which returns a const bool *
> which cannot be assigned into non-const qualified base.  AFAICS, the
> constructor only works if the array_slice is array_slice.
>
> Is that intentional?  I am not a C++ expert and can be easily
> overlooking something.  I understand that users need to be careful not
> to cause reallocation of the underlying vector while the array_slice
> exists but the const qualifier does not achieve that.  (A wild idea to
> be to add a array_slice ref-counter to auto_vec, which seems to be less
> space-efficiency-critical than other vecs, and assert on reallocation
> when it is not zero, hehe).
>
> Removing the const qualifier in the constructor parameter makes the
> error go away - as does adding another constructor without it, which
> might be the correct thing to do.

Yeah, the latter sounds better to me.  (The existing uses of array_slice
are for const elements, which is why I didn't come across this.)

> On a related note, would the following constructor be a good addition to
> the class (I can make it const too)?
>
>   template
>   array_slice (vec *v)
> : m_base (v ? v->address () : nullptr), m_size (v ? v->length (): 0) {}

LGTM.

Thanks,
Richard

> Thanks,
>
> Martin
>
>
>
>> +  iterator begin () { return m_base; }
>> +  iterator end () { return m_base + m_size; }
>> +
>> +  const_iterator begin () const { return m_base; }
>> +  const_iterator end () const { return m_base + m_size; }
>> +
>> +  value_type &front ();
>> +  value_type &back ();
>> +  value_type &operator[] (unsigned int i);
>> +
>> +  const value_type &front () const;
>> +  const value_type &back () const;
>> +  const value_type &operator[] (unsigned int i) const;
>> +
>> +  size_t size () const { return m_size; }
>> +  size_t size_bytes () const { return m_size * sizeof (T); }
>> +  bool empty () const { return m_size == 0; }
>> +
>> +  // An invalid array_slice that represents a failed operation.  This is
>> +  // distinct from an empty slice, which is a valid result in some contexts.
>> +  static array_slice invalid () { return { nullptr, ~0U }; }
>> +
>> +  // True if the array is valid, false if it is an array like INVALID.
>> +  bool is_valid () const { return m_base || m_size == 0; }
>> +
>> +private:
>> +  iterator m_base;
>> +  unsigned int m_size;
>> +};
>> +
>> +template
>> +inline typename array_slice::value_type &
>> +array_slice::front ()
>> +{
>> +  gcc_checking_assert (m_size);
>> +  return m_base[0];
>> +}
>> +
>> +template
>> +inline const typename array_slice::value_type &
>> +array_slice::front () const
>> +{
>> +  gcc_checking_assert (m_s

[COMMITED] testsuite: btf: fix regexps in btf-int-1.c

2022-08-03 Thread Jose E. Marchesi via Gcc-patches


The regexps in hte test btf-int-1.c were not working properly with the
commenting style of at least one target: powerpc64le-linux-gnu.  This
patch changes the test to use better regexps.

Tested in bpf-unkonwn-none, x86_64-linux-gnu and powerpc64le-linux-gnu.
Pushed to master as obvious.

gcc/testsuite/ChangeLog:

PR testsuite/106515
* gcc.dg/debug/btf/btf-int-1.c: Fix regexps in
scan-assembler-times.
---
 gcc/testsuite/gcc.dg/debug/btf/btf-int-1.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/debug/btf/btf-int-1.c 
b/gcc/testsuite/gcc.dg/debug/btf/btf-int-1.c
index 87d9758e9cb..e1ed198131a 100644
--- a/gcc/testsuite/gcc.dg/debug/btf/btf-int-1.c
+++ b/gcc/testsuite/gcc.dg/debug/btf/btf-int-1.c
@@ -18,10 +18,10 @@
 /* { dg-final { scan-assembler-times "\[\t \]0x100\[\t 
\]+\[^\n\]*btt_info" 9 } } */
 
 /* Check the signed flags, but not bit size. */
-/* { dg-final { scan-assembler-times "\[\t \]0x1..\[\t 
\]+\[^\n\]*bti_encoding" 4 } } */
-/* { dg-final { scan-assembler-times "\[\t \]0x..\[\t \]+\[^\n\]*bti_encoding" 
3 } } */
-/* { dg-final { scan-assembler-times "\[\t \]0x.\[\t \]+\[^\n\]*bti_encoding" 
1 } } */
-/* { dg-final { scan-assembler-times "\[\t \]0x4..\[\t 
\]+\[^\n\]*bti_encoding" 1 } } */
+/* { dg-final { scan-assembler-times "\[\t \]0x1\[0-9a-zA-Z\]{2}\[\t 
\]+\[^\n\]*bti_encoding" 4 } } */
+/* { dg-final { scan-assembler-times "\[\t \]0x\[0-9a-zA-Z\]{2}\[\t 
\]+\[^\n\]*bti_encoding" 3 } } */
+/* { dg-final { scan-assembler-times "\[\t \]0x\[0-9a-zA-Z\]\[\t 
\]+\[^\n\]*bti_encoding" 1 } } */
+/* { dg-final { scan-assembler-times "\[\t \]0x4\[0-9a-zA-Z\]{2}\[\t 
\]+\[^\n\]*bti_encoding" 1 } } */
 
 /* Check that there is a string entry for each type name.  */
 /* { dg-final { scan-assembler-times "ascii \"unsigned char.0\"\[\t 
\]+\[^\n\]*btf_string" 1 } } */
-- 
2.30.2



Re: [PATCH, rs6000] TARGET_MADDLD should include TARGET_POWERPC64

2022-08-03 Thread Segher Boessenkool
Hi!

On Wed, Aug 03, 2022 at 04:24:15PM +0800, HAO CHEN GUI wrote:
>   This patch changes the definition of TARGET_MADDLD and includes
> TARGET_POWERPC64, since maddld is a 64 bit instruction.

Hrm.  But the maddld insn is useful for SImode as well, in 32-bit mode,
it is just its name that is a bit confusing then.  Sorry for confusing
things :-(

Add a test for SImode maddld as well?

Please fix things up once again and resend?  Sorry again!


Segher


Re: [PATCH] lower-subreg, expr: Mitigate inefficiencies derived from "(clobber (reg X))" followed by "(set (subreg (reg X)) (...))"

2022-08-03 Thread Jeff Law via Gcc-patches




On 8/3/2022 1:52 AM, Richard Sandiford via Gcc-patches wrote:

Takayuki 'January June' Suwa via Gcc-patches  writes:

Emitting "(clobber (reg X))" before "(set (subreg (reg X)) (...))" keeps
data flow consistent, but it also increases register allocation pressure
and thus often creates many unwanted register-to-register moves that
cannot be optimized away.

There are two things here:

- If emit_move_complex_parts emits a clobber of a hard register,
   then that's probably a bug/misfeature.  The point of the clobber is
   to indicate that the register has no useful contents.  That's useful
   for wide pseudos that are written to in parts, since it avoids the
   need to track the liveness of each part of the pseudo individually.
   But it shouldn't be necessary for hard registers, since subregs of
   hard registers are simplified to hard registers wherever possible
   (which on most targets is "always").

   So I think the emit_move_complex_parts clobber should be restricted
   to !HARD_REGISTER_P, like the lower-subreg clobber is.  If that helps
   (if only partly) then it would be worth doing as its own patch.

Agreed.



- I think it'd be worth looking into more detail why a clobber makes
   a difference to register pressure.  A clobber of a pseudo register R
   shouldn't make R conflict with things that are live at the point of
   the clobber.

Also agreed.




  It seems just analogous to partial register
stall which is a famous problem on processors that do register renaming.

In my opinion, when the register to be clobbered is a composite of hard
ones, we should clobber the individual elements separetely, otherwise
clear the entire to zero prior to use as the "init-regs" pass does (like
partial register stall workarounds on x86 CPUs).  Such redundant zero
constant assignments will be removed later in the "cprop_hardreg" pass.

I don't think we should rely on the zero being optimised away later.

Emitting the zero also makes it harder for the register allocator
to elide the move.  For example, if we have:

   (set (subreg:SI (reg:DI P) 0) (reg:SI R0))
   (set (subreg:SI (reg:DI P) 4) (reg:SI R1))

then there is at least a chance that the RA could assign hard registers
R0:R1 to P, which would turn the moves into nops.  If we emit:

   (set (reg:DI P) (const_int 0))

beforehand then that becomes impossible, since R0 and R1 would then
conflict with P.

TBH I'm surprised we still run init_regs for LRA.  I thought there was
a plan to stop doing that, but perhaps I misremember.
I have vague memories of dealing with some of this nonsense a few 
release cycles.  I don't recall all the details, but init-regs + 
lower-subreg + regcprop + splitting all conspired to generate poor code 
on the MIPS targets.  See pr87761, though it doesn't include all my 
findings -- I can't recall if I walked through the entire tortured 
sequence in the gcc-patches discussion or not.


I ended up working around in the mips backend in conjunction with some 
changes to regcprop IIRC.


Jeff



Re: [PATCH] stack-protector: Check stack canary for noreturn function

2022-08-03 Thread H.J. Lu via Gcc-patches
On Tue, Aug 2, 2022 at 4:34 PM Jeff Law  wrote:
>
>
>
> On 8/2/2022 11:43 AM, H.J. Lu wrote:
> > On Sat, Jul 30, 2022 at 1:30 PM Jeff Law via Gcc-patches
> >  wrote:
> >>
> >>
> >> On 7/14/2022 3:55 PM, H.J. Lu via Gcc-patches wrote:
> >>> Check stack canary for noreturn function to catch stack corruption
> >>> before calling noreturn function.  For C++, check stack canary when
> >>> throwing exception or resuming stack unwind to avoid corrupted stack.
> >>>
> >>> gcc/
> >>>
> >>>PR middle-end/58245
> >>>* calls.cc (expand_call): Check stack canary for noreturn
> >>>function.
> >>>
> >>> gcc/testsuite/
> >>>
> >>>PR middle-end/58245
> >>>* c-c++-common/pr58245-1.c: New test.
> >>>* g++.dg/pr58245-1.C: Likewise.
> >>>* g++.dg/fstack-protector-strong.C: Adjusted.
> >> But is this really something we want?   I'd actually lean towards
> >> eliminating the useless load -- I don't necessarily think we should be
> >> treating non-returning paths specially here.
> >>
> >> The whole point of the stack protector is to prevent the *return* path
> >> from going to an attacker controlled location.  I'm not sure checking
> >> the protector at this point actually does anything particularly useful.
> > throw is marked no return.   Since the unwind library may read
> > the stack contents to unwind stack, it the stack is corrupted, the
> > exception handling may go wrong.   Should we handle this case?
> That's the question I think we need to answer.  The EH paths are a known
> security issue on Windows and while ours are notably different I'm not
> sure if there's a real attack surface in those paths.  My sense is that
> if we need to tackle this that doing so on the throw side might be
> better as it's closer conceptually to when//how we check the canary for
> a normal return.

Like this?

@@ -3154,7 +3155,10 @@ expand_call (tree exp, rtx target, int ignore)
   if (pass && (flags & ECF_MALLOC))
   start_sequence ();

-  if (pass == 0
+  /* Check the canary value for sibcall or function which doesn't
+   return and could throw.  */
+  if ((pass == 0
+ || ((flags & ECF_NORETURN) != 0 && tree_could_throw_p (exp)))
 && crtl->stack_protect_guard
 && targetm.stack_protect_runtime_enabled_p ())
   stack_protect_epilogue ();

> jeff
> >
> >   --
> > H.J.
>


-- 
H.J.


Ping: [PATCH, V2] Do not enable -mblock-ops-vector-pair.

2022-08-03 Thread Michael Meissner via Gcc-patches
Ping patch.

| Date: Mon, 25 Jul 2022 16:15:05 -0400
| Subject: [PATCH, V2] Do not enable -mblock-ops-vector-pair.
| Message-ID: 

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meiss...@linux.ibm.com


Ping: [PATCH 0/5] IEEE 128-bit built-in overload support.

2022-08-03 Thread Michael Meissner via Gcc-patches
Ping patches.

Patch #1 of 5.
| Date: Thu, 28 Jul 2022 00:47:13 -0400
| Subject: [PATCH 1/5] IEEE 128-bit built-in overload support.
| Message-ID: 

Patch #2 of 5.
| Date: Thu, 28 Jul 2022 00:48:51 -0400
| Subject: [PATCH 2/5] Support IEEE 128-bit overload round_to_odd built-in 
functions.
| Message-ID: 

Patch #3 of 5.
| Date: Thu, 28 Jul 2022 00:50:43 -0400
| Subject: [PATCH 3/5] Support IEEE 128-bit overload comparison built-in 
functions.
| Message-ID: 

Patch #4 of 5.
| Date: Thu, 28 Jul 2022 00:52:38 -0400
| Subject: [PATCH 4/5] Support IEEE 128-bit overload extract and insert 
built-in functions.
| Message-ID: 

Patch #5 of 5.
| Date: Thu, 28 Jul 2022 00:54:15 -0400
| Subject: [PATCH 5/5] Support IEEE 128-bit overload test data built-in 
functions.
| Message-ID: 

-- 
Michael Meissner, IBM
PO Box 98, Ayer, Massachusetts, USA, 01432
email: meiss...@linux.ibm.com


HELP!!How to add a testing case to check a compilation time warning for "cc1"

2022-08-03 Thread Qing Zhao via Gcc-patches
Hi,

My private cc1 issued the following warning:

[opc@qinzhao-ol8u3-x86 gcc]$ sh t
cc1: warning: ‘-fstrict-flex-arrays’ is not supported with a ISO C before C99, 
ignored


I’d like to add a testing case for this warning into gcc.dg directory, however, 
I cannot find a proper
testing directive to catch this warning (which is not associated with any 
source line, so “dg-warning” didn’t work)

I checked online internal doc: 
https://gcc.gnu.org/onlinedocs/gcc-12.1.0/gccint/Test-Directives.html
Cannot find one.

Does anyone know how to do this?

Thanks a lot for your help.

Qing

Re: HELP!!How to add a testing case to check a compilation time warning for "cc1"

2022-08-03 Thread Qing Zhao via Gcc-patches
Never mind, just found how to do this:

/* { dg-warning "'-fstrict-flex-arrays' is not supported with a ISO C before 
C99, ignored" ""  { target *-*-* } 0 } */

And worked.

thanks.

Qing
> On Aug 3, 2022, at 2:52 PM, Qing Zhao via Gcc-patches 
>  wrote:
> 
> Hi,
> 
> My private cc1 issued the following warning:
> 
> [opc@qinzhao-ol8u3-x86 gcc]$ sh t
> cc1: warning: ‘-fstrict-flex-arrays’ is not supported with a ISO C before 
> C99, ignored
> 
> 
> I’d like to add a testing case for this warning into gcc.dg directory, 
> however, I cannot find a proper
> testing directive to catch this warning (which is not associated with any 
> source line, so “dg-warning” didn’t work)
> 
> I checked online internal doc: 
> https://gcc.gnu.org/onlinedocs/gcc-12.1.0/gccint/Test-Directives.html
> Cannot find one.
> 
> Does anyone know how to do this?
> 
> Thanks a lot for your help.
> 
> Qing



[PATCH] match.pd: Add bitwise and pattern [PR106243]

2022-08-03 Thread Sam Feifer via Gcc-patches
This patch adds a new optimization to match.pd. The pattern, -x & 1,
now gets simplified to x & 1, reducing the number of instructions
produced.

This patch also adds tests for the optimization rule.

Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk?

PR tree-optimization/106243

gcc/ChangeLog:

* match.pd (-x & 1): New simplification.

gcc/testsuite/ChangeLog:

* gcc.dg/pr106243-1.c: New test.
* gcc.dg/pr106243.c: New test.
---
 gcc/match.pd  |  5 
 gcc/testsuite/gcc.dg/pr106243-1.c | 18 +
 gcc/testsuite/gcc.dg/pr106243.c   | 43 +++
 3 files changed, 66 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/pr106243-1.c
 create mode 100644 gcc/testsuite/gcc.dg/pr106243.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 562138a8034..78b32567836 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -8061,3 +8061,8 @@ and,
   (if (TYPE_UNSIGNED (TREE_TYPE (@0)))
 (bit_and @0 @1)
   (cond (le @0 @1) @0 (bit_and @0 @1))
+
+/* -x & 1 -> x & 1.  */
+(simplify 
+  (bit_and:c (negate @0) integer_onep@1)
+  (bit_and @0 @1))
diff --git a/gcc/testsuite/gcc.dg/pr106243-1.c 
b/gcc/testsuite/gcc.dg/pr106243-1.c
new file mode 100644
index 000..b1dbe5cbe44
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr106243-1.c
@@ -0,0 +1,18 @@
+/* PR tree-optimization/106243 */
+/* { dg-do run } */
+/* { dg-options "-O2" } */
+
+#include "pr106243.c"
+
+int main () {
+
+if (foo(3) != 1
+|| bar(-6) != 0
+|| baz(17) != 1
+|| qux(-128) != 0
+|| foo(127) != 1) {
+__builtin_abort();
+}
+
+return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/pr106243.c b/gcc/testsuite/gcc.dg/pr106243.c
new file mode 100644
index 000..ee2706f2bf9
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr106243.c
@@ -0,0 +1,43 @@
+/* PR tree-optimization/106243 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+#define vector __attribute__((vector_size(4*sizeof(int
+
+/* Test from PR.  */
+__attribute__((noipa)) int foo (int x) {
+return -x & 1;
+}
+
+/* Other test from PR.  */
+__attribute__((noipa)) int bar (int x) {
+return (0 - x) & 1;
+}
+
+/* Forward propogation.  */
+__attribute__((noipa)) int baz (int x) {
+x = -x;
+return x & 1;
+}
+
+/* Commutative property.  */
+__attribute__((noipa)) int qux (int x) {
+return 1 & -x;
+}
+
+/* Vector test case.  */
+__attribute__((noipa)) vector int waldo (vector int x) {
+return -x & 1;
+}
+
+/* Should not simplify.  */
+__attribute__((noipa)) int thud (int x) {
+return -x & 2;
+}
+
+/* Should not simplify.  */
+__attribute__((noipa)) int corge (int x) {
+return -x & -1;
+}
+
+/* { dg-final {scan-tree-dump-times "-" 2 "optimized" } } */

base-commit: 388fbbd895e72669909173c3003ae65c6483a3c2
-- 
2.31.1



[COMMITTED] tree-optimization/106514 - Do not walk equivalence set in path_oracle::killing_def.

2022-08-03 Thread Andrew MacLeod via Gcc-patches
The path oracles killing_def () routine was removing an ssa-name from 
each equivalence in the set.  This was busy work that did not need to be 
done.


when checking for an equivalence between A and B, the path oracle 
requires that A be in B's set and B be in A's set.  By setting the 
equivalence set for A to contain *just* A, there is no need to visit all 
the equivalences and remove them.


Aldy ran it thru the thread counter, and there was no difference in the 
number of threads found. The testcase in the PR on my machine ran in the 
following times:


there no difference i the default time, but with the specified option  
-O2 --param max-jump-thread-duplication-stmts=30 :


before patch:

backwards jump threading   :  11.64 ( 92%)   0.00 (  0%) 11.71 ( 
91%)    24  (  0%)

TOTAL   :  12.70  0.07 12.83   47M

After patch:

backwards jump threading   :   1.96 ( 65%)   0.01 ( 10%)   1.99 
( 64%)    24  (  0%)

TOTAL   :   3.00  0.10 3.12   47M

Clearly more work to do, but much better for a start.  next up, why is 
compute_ranges-in_block () taking over 50% of the time now.


This was bootstrapped on x86_64-pc-linux-gnu with no regressions.  pushed.

Andrew
commit 19ffb35d17474bb4dd3eb78963c28d10b5362321
Author: Andrew MacLeod 
Date:   Wed Aug 3 13:55:42 2022 -0400

Do not walk equivalence set in path_oracle::killing_def.

When killing a def in the path ranger, there is no need to walk the set
of existing equivalences clearing bits.  An equivalence match requires
that both ssa-names have to be in each others set.  As killing_def
creates a new empty set contianing only the current def,  it already
ensures false equivaelnces won't happen.

PR tree-optimization/106514
* value-relation.cc (path_oracle::killing_def) Do not walk the
  equivalence set clearing bits.

diff --git a/gcc/value-relation.cc b/gcc/value-relation.cc
index a447021214f..3f0957ccdd6 100644
--- a/gcc/value-relation.cc
+++ b/gcc/value-relation.cc
@@ -1400,16 +1400,7 @@ path_oracle::killing_def (tree ssa)
   unsigned v = SSA_NAME_VERSION (ssa);
 
   bitmap_set_bit (m_killed_defs, v);
-
-  // Walk the equivalency list and remove SSA from any equivalencies.
-  if (bitmap_bit_p (m_equiv.m_names, v))
-{
-  for (equiv_chain *ptr = m_equiv.m_next; ptr; ptr = ptr->m_next)
-   if (bitmap_bit_p (ptr->m_names, v))
- bitmap_clear_bit (ptr->m_names, v);
-}
-  else
-bitmap_set_bit (m_equiv.m_names, v);
+  bitmap_set_bit (m_equiv.m_names, v);
 
   // Now add an equivalency with itself so we don't look to the root oracle.
   bitmap b = BITMAP_ALLOC (&m_bitmaps);


Re: [PATCH, V2] Do not enable -mblock-ops-vector-pair.

2022-08-03 Thread Segher Boessenkool
Hi Mike,

On Mon, Jul 25, 2022 at 04:15:05PM -0400, Michael Meissner wrote:
> Testing has shown that using the load vector pair and store vector pair
> instructions for block moves has some performance issues on power10.

> This patch eliminates the code setting -mblock-ops-vector-pair.  If you
> want to generate load vector pair and store vector pair instructions for
> block moves, you must use -mblock-ops-vector-pair.
> 
> 2022-07-25   Michael Meissner  
> 
> gcc/
> 
>   * config/rs6000/rs6000.cc (rs6000_option_override_internal): Remove
>   code setting -mblock-ops-vector-pair.

Okay for trunk (and any backports you may need).  Thanks!


Segher


Re: [PATCH 2/2]middle-end: Support recognition of three-way max/min.

2022-08-03 Thread H.J. Lu via Gcc-patches
On Wed, Aug 3, 2022 at 1:26 AM Richard Biener via Gcc-patches
 wrote:
>
> On Wed, 3 Aug 2022, Tamar Christina wrote:
>
> >
> > > -Original Message-
> > > From: Richard Biener 
> > > Sent: Tuesday, August 2, 2022 10:11 AM
> > > To: Tamar Christina 
> > > Cc: Richard Biener ; ja...@redhat.com; nd
> > > ; gcc-patches@gcc.gnu.org
> > > Subject: Re: [PATCH 2/2]middle-end: Support recognition of three-way
> > > max/min.
> > >
> > > On Tue, Aug 2, 2022 at 10:33 AM Tamar Christina via Gcc-patches  > > patc...@gcc.gnu.org> wrote:
> > > >
> > > > > > > > When this function replaces the edge it doesn't seem to update
> > > > > > > > the
> > > > > > > dominators.
> > > > > > > > Since It's replacing the middle BB we then end up with an
> > > > > > > > error
> > > > > > > >
> > > > > > > > gcc/testsuite/gcc.dg/tree-ssa/minmax-14.c:17:1: error:
> > > > > > > > dominator of 5 should be 4, not 2
> > > > > > > >
> > > > > > > > during early verify. So instead, I replace the BB but defer
> > > > > > > > its deletion until cleanup which removes it and updates the
> > > dominators.
> > > > > > >
> > > > > > > Hmm, for a diamond shouldn't you replace
> > > > > > >
> > > > > > >   if (EDGE_SUCC (cond_block, 0)->dest == bb)
> > > > > > > edge_to_remove = EDGE_SUCC (cond_block, 1);
> > > > > > >   else
> > > > > > > edge_to_remove = EDGE_SUCC (cond_block, 0);
> > > > > > >
> > > > > > > with
> > > > > > >
> > > > > > >   if (EDGE_SUCC (cond_block, 0)->dest == bb)
> > > > > > > edge_to_remove = EDGE_SUCC (cond_block, 1);
> > > > > > >   else if (EDGE_SUCC (cond_block, 1)->dest == bb)
> > > > > > > edge_to_remove = EDGE_SUCC (cond_block, 0);
> > > > > > >
> > > > > > > thus, the code expects to be left with a fallthru to the PHI
> > > > > > > block which is expected to have the immediate dominator being
> > > > > > > cond_block but with a diamond there's a (possibly empty) block
> > > > > > > inbetween and dominators are wrong.
> > > > > >
> > > > > > Agreed, but the (EDGE_SUCC (cond_block, 1)->dest == bb) doesn't
> > > > > > seem like the Right one since for a diamond there will be a block
> > > > > > in between the two.  Did you perhaps mean  EDGE_SUCC (EDGE_SUCC
> > > > > > (cond_block, 1)->dest, 0)->dest == bb? i.e. that that destination
> > > > > > across the
> > > > > diamond be bb, and then you remove the middle block?
> > > > >
> > > > > Hmm, I think my condition was correct - the code tries to remove the
> > > > > edge to the middle-block and checks the remaining edge falls through
> > > > > to the merge block.  With a true diamond there is no fallthru to the
> > > > > merge block to keep so we better don't remove any edge?
> > > > >
> > > > > > For the minmax diamond we want both edges removed, since all the
> > > > > > code in the middle BBs are now dead.  But this is probably not
> > > > > > true in the general
> > > > > sense.
> > > >
> > > > Ah! Sorry I was firing a few cylinders short, I get what you mean now:
> > > >
> > > > @@ -425,8 +439,19 @@ replace_phi_edge_with_variable (basic_block
> > > cond_block,
> > > >edge edge_to_remove;
> > > >if (EDGE_SUCC (cond_block, 0)->dest == bb)
> > > >  edge_to_remove = EDGE_SUCC (cond_block, 1);
> > > > -  else
> > > > +  else if (EDGE_SUCC (cond_block, 1)->dest == bb)
> > > >  edge_to_remove = EDGE_SUCC (cond_block, 0);
> > > > +  else
> > > > +{
> > > > +  /* If neither edge from the conditional is the final bb
> > > > +then we must have a diamond block, in which case
> > > > +the true edge was changed by SET_USE above and we must
> > > > +mark the other edge as the false edge.  */
> > > > +  gcond *cond = as_a  (last_stmt (cond_block));
> > > > +  gimple_cond_make_false (cond);
> > > > +  return;
> > > > +}
> > > > +
> > >
> > > Note there is already
> > >
> > >   if (EDGE_COUNT (edge_to_remove->dest->preds) == 1)
> > > {
> > > ...
> > > }
> > >   else
> > > {
> > >   /* If there are other edges into the middle block make
> > >  CFG cleanup deal with the edge removal to avoid
> > >  updating dominators here in a non-trivial way.  */
> > >   gcond *cond = as_a  (last_stmt (cond_block));
> > >   if (edge_to_remove->flags & EDGE_TRUE_VALUE)
> > > gimple_cond_make_false (cond);
> > >   else
> > > gimple_cond_make_true (cond);
> > > }
> > >
> > > I'm not sure how you can say 'e' is always the true edge?  May I suggest 
> > > to
> > > amend the first condition with edge_to_remove && (and initialize that to
> > > NULL) and use e->flags instead of edge_to_remove in the else, of course
> > > also inverting the logic since we're keeping 'e'?
> >
> > As discussed on IRC, here's the version using keep_edge:
> >
> > @@ -422,12 +436,17 @@ replace_phi_edge_with_variable (basic_block 
> > cond_block,
> >SET_USE (PHI_ARG_DEF_PTR (phi, e->dest_idx), new_tree);
> >
> >/* Remove the empty basic block.  */
> > -  edge edge_to_remove;
> >

Re: [PATCH] match.pd: Add bitwise and pattern [PR106243]

2022-08-03 Thread Prathamesh Kulkarni via Gcc-patches
On Thu, 4 Aug 2022 at 00:41, Sam Feifer via Gcc-patches
 wrote:
>
> This patch adds a new optimization to match.pd. The pattern, -x & 1,
> now gets simplified to x & 1, reducing the number of instructions
> produced.
Hi Sam,
No comments on patch, but wondering if we can similarly add another pattern to
simplify abs(x) & 1 -> x & 1 ?
Currently we don't appear to do it on GIMPLE:

__attribute__((noipa))
int f1 (int x)
{
  return __builtin_abs (x) & 1;
}

.optimized dump shows:
  _1 = ABS_EXPR ;
  _3 = _1 & 1;
  return _3;

altho combine simplifies it to x & 1 on RTL, resulting in code-gen:
f1:
and w0, w0, 1
ret

Thanks,
Prathamesh
>
> This patch also adds tests for the optimization rule.
>
> Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk?
>
> PR tree-optimization/106243
>
> gcc/ChangeLog:
>
> * match.pd (-x & 1): New simplification.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.dg/pr106243-1.c: New test.
> * gcc.dg/pr106243.c: New test.
> ---
>  gcc/match.pd  |  5 
>  gcc/testsuite/gcc.dg/pr106243-1.c | 18 +
>  gcc/testsuite/gcc.dg/pr106243.c   | 43 +++
>  3 files changed, 66 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.dg/pr106243-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/pr106243.c
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 562138a8034..78b32567836 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -8061,3 +8061,8 @@ and,
>(if (TYPE_UNSIGNED (TREE_TYPE (@0)))
>  (bit_and @0 @1)
>(cond (le @0 @1) @0 (bit_and @0 @1))
> +
> +/* -x & 1 -> x & 1.  */
> +(simplify
> +  (bit_and:c (negate @0) integer_onep@1)
> +  (bit_and @0 @1))
> diff --git a/gcc/testsuite/gcc.dg/pr106243-1.c 
> b/gcc/testsuite/gcc.dg/pr106243-1.c
> new file mode 100644
> index 000..b1dbe5cbe44
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/pr106243-1.c
> @@ -0,0 +1,18 @@
> +/* PR tree-optimization/106243 */
> +/* { dg-do run } */
> +/* { dg-options "-O2" } */
> +
> +#include "pr106243.c"
> +
> +int main () {
> +
> +if (foo(3) != 1
> +|| bar(-6) != 0
> +|| baz(17) != 1
> +|| qux(-128) != 0
> +|| foo(127) != 1) {
> +__builtin_abort();
> +}
> +
> +return 0;
> +}
> diff --git a/gcc/testsuite/gcc.dg/pr106243.c b/gcc/testsuite/gcc.dg/pr106243.c
> new file mode 100644
> index 000..ee2706f2bf9
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/pr106243.c
> @@ -0,0 +1,43 @@
> +/* PR tree-optimization/106243 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +
> +#define vector __attribute__((vector_size(4*sizeof(int
> +
> +/* Test from PR.  */
> +__attribute__((noipa)) int foo (int x) {
> +return -x & 1;
> +}
> +
> +/* Other test from PR.  */
> +__attribute__((noipa)) int bar (int x) {
> +return (0 - x) & 1;
> +}
> +
> +/* Forward propogation.  */
> +__attribute__((noipa)) int baz (int x) {
> +x = -x;
> +return x & 1;
> +}
> +
> +/* Commutative property.  */
> +__attribute__((noipa)) int qux (int x) {
> +return 1 & -x;
> +}
> +
> +/* Vector test case.  */
> +__attribute__((noipa)) vector int waldo (vector int x) {
> +return -x & 1;
> +}
> +
> +/* Should not simplify.  */
> +__attribute__((noipa)) int thud (int x) {
> +return -x & 2;
> +}
> +
> +/* Should not simplify.  */
> +__attribute__((noipa)) int corge (int x) {
> +return -x & -1;
> +}
> +
> +/* { dg-final {scan-tree-dump-times "-" 2 "optimized" } } */
>
> base-commit: 388fbbd895e72669909173c3003ae65c6483a3c2
> --
> 2.31.1
>


Re: [PATCH] match.pd: Add bitwise and pattern [PR106243]

2022-08-03 Thread Jeff Law via Gcc-patches




On 8/3/2022 2:44 PM, Prathamesh Kulkarni via Gcc-patches wrote:

On Thu, 4 Aug 2022 at 00:41, Sam Feifer via Gcc-patches
 wrote:

This patch adds a new optimization to match.pd. The pattern, -x & 1,
now gets simplified to x & 1, reducing the number of instructions
produced.

Hi Sam,
No comments on patch, but wondering if we can similarly add another pattern to
simplify abs(x) & 1 -> x & 1 ?
Currently we don't appear to do it on GIMPLE:

__attribute__((noipa))
int f1 (int x)
{
   return __builtin_abs (x) & 1;
}

.optimized dump shows:
   _1 = ABS_EXPR ;
   _3 = _1 & 1;
   return _3;

altho combine simplifies it to x & 1 on RTL, resulting in code-gen:
f1:
 and w0, w0, 1
 ret
Doesn't the abs(x) & mask simplify to x & mask for any mask where the 
sign bit of x is off -- including cases where mask isn't necessarily a 
compile-time constant, but we have range data which allows us to know 
that x's sign bit is off in mask.


jeff





RE: [PING][PATCH] Add instruction level discriminator support.

2022-08-03 Thread Eugene Rozenfeld via Gcc-patches
One more ping for this patch 
https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596065.html

CC Jason since this changes discriminators emitted in dwarf.

Thanks,

Eugene

-Original Message-
From: Eugene Rozenfeld 
Sent: Monday, June 27, 2022 12:45 PM
To: gcc-patches@gcc.gnu.org; Andi Kleen ; Jan Hubicka 

Subject: RE: [PING][PATCH] Add instruction level discriminator support.

Another ping for 
https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596065.html .

I got a review from Andi 
(https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596549.html) but I also 
need a review from someone who can approve the changes.

Thanks,

Eugene

-Original Message-
From: Eugene Rozenfeld 
Sent: Friday, June 10, 2022 12:03 PM
To: gcc-patches@gcc.gnu.org; Andi Kleen ; Jan Hubicka 

Subject: [PING][PATCH] Add instruction level discriminator support.

Hello,

I'd like to ping this patch: 
https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596065.html

Thanks,

Eugene

-Original Message-
From: Gcc-patches  On 
Behalf Of Eugene Rozenfeld via Gcc-patches
Sent: Thursday, June 02, 2022 12:22 AM
To: gcc-patches@gcc.gnu.org; Andi Kleen ; Jan Hubicka 

Subject: [EXTERNAL] [PATCH] Add instruction level discriminator support.

This is the first in a series of patches to enable discriminator support in 
AutoFDO.

This patch switches to tracking discriminators per statement/instruction 
instead of per basic block. Tracking per basic block was problematic since not 
all statements in a basic block needed a discriminator and, also, later 
optimizations could move statements between basic blocks making correlation 
during AutoFDO compilation unreliable. Tracking per statement also allows us to 
assign different discriminators to multiple function calls in the same basic 
block. A subsequent patch will add that support.

The idea of this patch is based on commit 
4c311d95cf6d9519c3c20f641cc77af7df491fdf
by Dehao Chen in vendors/google/heads/gcc-4_8 but uses a slightly different 
approach. In Dehao's work special (normally unused) location ids and side 
tables were used to keep track of locations with discriminators. Things have 
changed since then and I don't think we have unused location ids anymore. 
Instead, I made discriminators a part of ad-hoc locations.

The difference from Dehao's work also includes support for discriminator 
reading/writing in lto streaming and in modules.

Tested on x86_64-pc-linux-gnu.


0001-Add-instruction-level-discriminator-support.patch
Description: 0001-Add-instruction-level-discriminator-support.patch


Re: [PATCH, rs6000] TARGET_MADDLD should include TARGET_POWERPC64

2022-08-03 Thread HAO CHEN GUI via Gcc-patches
Hi Segher,

On 4/8/2022 上午 12:54, Segher Boessenkool wrote:
> Hrm.  But the maddld insn is useful for SImode as well, in 32-bit mode,
> it is just its name that is a bit confusing then.  Sorry for confusing
> things :-(
> 
> Add a test for SImode maddld as well?

 Thanks for your comments.

 Just want to double confirm that a maddld test case with "-m32" and
"-mpowerpc64" is needed. As far as I understand, maddld is a 64-bit
instruction and it should be used with "-mpowerpc64" in a 32-bit register
environment.

Thanks again
Gui Haochen


[r13-1950 Regression] FAIL: gfortran.dg/make_unit.f90 -O1 (test for excess errors) on Linux/x86_64

2022-08-03 Thread haochen.jiang via Gcc-patches
On Linux/x86_64,

9bb19e143cfe8863e2e79d4176b5d7e997b08c5f is the first bad commit
commit 9bb19e143cfe8863e2e79d4176b5d7e997b08c5f
Author: Tamar Christina 
Date:   Wed Aug 3 16:00:39 2022 +0100

middle-end: Support recognition of three-way max/min.

caused

FAIL: gcc.dg/analyzer/pr96653.c (internal compiler error: in gimple_phi_arg, at 
gimple.h:4594)
FAIL: gcc.dg/analyzer/pr96653.c (test for excess errors)
FAIL: g++.dg/warn/uninit-pr105562.C  -std=gnu++14 (internal compiler error: in 
gimple_phi_arg, at gimple.h:4594)
FAIL: g++.dg/warn/uninit-pr105562.C  -std=gnu++14 (test for excess errors)
FAIL: g++.dg/warn/uninit-pr105562.C  -std=gnu++17 (internal compiler error: in 
gimple_phi_arg, at gimple.h:4594)
FAIL: g++.dg/warn/uninit-pr105562.C  -std=gnu++17 (test for excess errors)
FAIL: g++.dg/warn/uninit-pr105562.C  -std=gnu++20 (internal compiler error: in 
gimple_phi_arg, at gimple.h:4594)
FAIL: g++.dg/warn/uninit-pr105562.C  -std=gnu++20 (test for excess errors)
FAIL: gfortran.dg/make_unit.f90   -O1  (internal compiler error: in 
gimple_phi_arg, at gimple.h:4594)
FAIL: gfortran.dg/make_unit.f90   -O1  (test for excess errors)

with GCC configured with

../../gcc/configure 
--prefix=/export/users/haochenj/src/gcc-bisect/master/master/r13-1950/usr 
--enable-clocale=gnu --with-system-zlib --with-demangler-in-ld 
--with-fpmath=sse --enable-languages=c,c++,fortran --enable-cet --without-isl 
--enable-libmpx x86_64-linux --disable-bootstrap

To reproduce:

$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="analyzer.exp=gcc.dg/analyzer/pr96653.c 
--target_board='unix{-m32}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="analyzer.exp=gcc.dg/analyzer/pr96653.c --target_board='unix{-m32\ 
-march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="dg.exp=g++.dg/warn/uninit-pr105562.C --target_board='unix{-m32}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="dg.exp=g++.dg/warn/uninit-pr105562.C --target_board='unix{-m32\ 
-march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="dg.exp=gfortran.dg/make_unit.f90 --target_board='unix{-m32}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="dg.exp=gfortran.dg/make_unit.f90 --target_board='unix{-m32\ 
-march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="dg.exp=gfortran.dg/make_unit.f90 --target_board='unix{-m64}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="dg.exp=gfortran.dg/make_unit.f90 --target_board='unix{-m64\ 
-march=cascadelake}'"

(Please do not reply to this email, for question about this report, contact me 
at haochen dot jiang at intel.com)


[RFC: PATCH] Extend vectorizer to handle nonlinear induction for neg, mul/lshift/rshift with a constant.

2022-08-03 Thread liuhongt via Gcc-patches
For neg, the patch create a vec_init as [ a, -a, a, -a, ...  ] and no
vec_step is needed to update vectorized iv since vf is always multiple
of 2(negative * negative is positive).

For shift, the patch create a vec_init as [ a, a >> c, a >> 2*c, ..]
as vec_step as [ c * nunits, c * nunits, c * nunits, ... ], vectorized iv is
updated as vec_def = vec_init >>/<< vec_step.

For mul, the patch create a vec_init as [ a, a * c, a * pow(c, 2), ..]
as vec_step as [ pow(c,nunits), pow(c,nunits),...] iv is updated as vec_def =
vec_init * vec_step.

The patch handles nonlinear iv for
1. Integer type only, floating point is not handled.
2. No slp_node.
3. iv_loop should be same as vector loop, not nested loop.
4. No UD is created, for mul, no UD overlow for pow (c, vf), for
   shift, shift count should be less than type precision.

Bootstrapped and regression tested on x86_64-pc-linux-gnu{-m32,}.
There's some cases observed in SPEC2017, but no big performance impact.

Any comments?

gcc/ChangeLog:

PR tree-optimization/103144
* tree-vect-loop.cc (vect_is_nonlinear_iv_evolution): New function.
(vect_analyze_scalar_cycles_1): Detect nonlinear iv by upper function.
(vect_create_nonlinear_iv_init): New function.
(vect_create_nonlinear_iv_step): Ditto
(vect_create_nonlinear_iv_vec_step): Ditto
(vect_update_nonlinear_iv): Ditto
(vectorizable_nonlinear_induction): Ditto.
(vectorizable_induction): Call
vectorizable_nonlinear_induction when induction_type is not
vect_step_op_add.
* tree-vectorizer.h (enum vect_induction_op_type): New enum.
(STMT_VINFO_LOOP_PHI_EVOLUTION_TYPE): New Macro.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr103144-mul-1.c: New test.
* gcc.target/i386/pr103144-mul-2.c: New test.
* gcc.target/i386/pr103144-neg-1.c: New test.
* gcc.target/i386/pr103144-neg-2.c: New test.
* gcc.target/i386/pr103144-shift-1.c: New test.
* gcc.target/i386/pr103144-shift-2.c: New test.
---
 .../gcc.target/i386/pr103144-mul-1.c  |  25 +
 .../gcc.target/i386/pr103144-mul-2.c  |  43 ++
 .../gcc.target/i386/pr103144-neg-1.c  |  25 +
 .../gcc.target/i386/pr103144-neg-2.c  |  36 ++
 .../gcc.target/i386/pr103144-shift-1.c|  34 +
 .../gcc.target/i386/pr103144-shift-2.c|  61 ++
 gcc/tree-vect-loop.cc | 604 +-
 gcc/tree-vectorizer.h |  11 +
 8 files changed, 834 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr103144-mul-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr103144-mul-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr103144-neg-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr103144-neg-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr103144-shift-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr103144-shift-2.c

diff --git a/gcc/testsuite/gcc.target/i386/pr103144-mul-1.c 
b/gcc/testsuite/gcc.target/i386/pr103144-mul-1.c
new file mode 100644
index 000..2357541d95d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr103144-mul-1.c
@@ -0,0 +1,25 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx2 -ftree-vectorize -fvect-cost-model=unlimited 
-fdump-tree-vect-details -mprefer-vector-width=256" } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" } } */
+
+#define N 1
+void
+foo_mul (int* a, int b)
+{
+  for (int i = 0; i != N; i++)
+{
+  a[i] = b;
+  b *= 3;
+}
+}
+
+void
+foo_mul_const (int* a)
+{
+  int b = 1;
+  for (int i = 0; i != N; i++)
+{
+  a[i] = b;
+  b *= 3;
+}
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr103144-mul-2.c 
b/gcc/testsuite/gcc.target/i386/pr103144-mul-2.c
new file mode 100644
index 000..4ea53e44658
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr103144-mul-2.c
@@ -0,0 +1,43 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -mavx2 -ftree-vectorize -fvect-cost-model=unlimited 
-mprefer-vector-width=256" } */
+/* { dg-require-effective-target avx2 } */
+
+#include "avx2-check.h"
+#include 
+#include "pr103144-mul-1.c"
+
+typedef int v8si __attribute__((vector_size(32)));
+
+void
+avx2_test (void)
+{
+  int* epi32_exp = (int*) malloc (N * sizeof (int));
+  int* epi32_dst = (int*) malloc (N * sizeof (int));
+
+  __builtin_memset (epi32_exp, 0, N * sizeof (int));
+  int b = 8;
+  v8si init = __extension__(v8si) { b, b * 3, b * 9, b * 27, b * 81, b * 243, 
b * 729, b * 2187 };
+
+  for (int i = 0; i != N / 8; i++)
+{
+  memcpy (epi32_exp + i * 8, &init, 32);
+  init *= 6561;
+}
+
+  foo_mul (epi32_dst, b);
+  if (__builtin_memcmp (epi32_dst, epi32_exp, N * sizeof (int)) != 0)
+__builtin_abort ();
+
+  init = __extension__(v8si) { 1, 3, 9, 27, 81, 243, 729, 2187 };
+  for (int i = 0; i != N / 8; i++)
+{
+  memcpy (epi32_exp + i * 8, &init, 32);
+  init *= 6561;
+}
+
+ 

Re: [PATCH] x86: Enable __bf16 type for TARGET_SSE2 and above

2022-08-03 Thread Hongtao Liu via Gcc-patches
On Wed, Aug 3, 2022 at 4:41 PM Kong, Lingling via Gcc-patches
 wrote:
>
> Hi,
>
> Old patch has some mistake in `*movbf_internal` , now disable BFmode constant 
> double move in `*movbf_internal`.
LGTM.
>
> Thanks,
> Lingling
>
> > -Original Message-
> > From: Kong, Lingling 
> > Sent: Tuesday, July 26, 2022 9:31 AM
> > To: Liu, Hongtao ; gcc-patches@gcc.gnu.org
> > Cc: Kong, Lingling 
> > Subject: [PATCH] x86: Enable __bf16 type for TARGET_SSE2 and above
> >
> > Hi,
> >
> > The patch is enable __bf16 scalar type for target sse2 and above according 
> > to
> > psABI(https://gitlab.com/x86-psABIs/x86-64-ABI/-/merge_requests/35/diffs).
> > The __bf16 type is a storage type like arm.
> >
> > OK for master?
> >
> > gcc/ChangeLog:
> >
> >   * config/i386/i386-builtin-types.def (BFLOAT16): New primitive type.
> >   * config/i386/i386-builtins.cc : Support __bf16 type for i386 backend.
> >   (ix86_register_bf16_builtin_type): New function.
> >   (ix86_bf16_type_node): New.
> >   (ix86_bf16_ptr_type_node): Ditto.
> >   (ix86_init_builtin_types): Add ix86_register_bf16_builtin_type 
> > function
> > call.
> >   * config/i386/i386-modes.def (FLOAT_MODE): Add BFmode.
> >   (ADJUST_FLOAT_FORMAT): Ditto.
> >   * config/i386/i386.cc (merge_classes): Handle BFmode.
> >   (classify_argument): Ditto.
> >   (examine_argument): Ditto.
> >   (construct_container): Ditto.
> >   (function_value_32): Return __bf16 by %xmm0.
> >   (function_value_64): Return __bf16 by SSE register.
> >   (ix86_print_operand): Handle CONST_DOUBLE BFmode.
> >   (ix86_secondary_reload): Require gpr as intermediate register
> >   to store __bf16 from sse register when sse4 is not available.
> >   (ix86_scalar_mode_supported_p): Enable __bf16 under sse2.
> >   (ix86_mangle_type): Add manlging for __bf16 type.
> >   (ix86_invalid_conversion): New function for target hook.
> >   (ix86_invalid_unary_op): Ditto.
> >   (ix86_invalid_binary_op): Ditto.
> >   (TARGET_INVALID_CONVERSION): New define for target hook.
> >   (TARGET_INVALID_UNARY_OP): Ditto.
> >   (TARGET_INVALID_BINARY_OP): Ditto.
> >   * config/i386/i386.h (host_detect_local_cpu): Add BFmode.
> >   * config/i386/i386.md (*pushhf_rex64): Change for BFmode.
> >   (*push_rex64): Ditto.
> >   (*pushhf): Ditto.
> >   (*push): Ditto.
> >   (*movhf_internal): Ditto.
> >   (*mov_internal): Ditto.
> >
> > gcc/testsuite/ChangeLog:
> >
> >   * g++.target/i386/bfloat_cpp_typecheck.C: New test.
> >   * gcc.target/i386/bfloat16-1.c: Ditto.
> >   * gcc.target/i386/sse2-bfloat16-1.c: Ditto.
> >   * gcc.target/i386/sse2-bfloat16-2.c: Ditto.
> >   * gcc.target/i386/sse2-bfloat16-scalar-typecheck.c: Ditto.
> > ---
> >  gcc/config/i386/i386-builtin-types.def|   1 +
> >  gcc/config/i386/i386-builtins.cc  |  21 ++
> >  gcc/config/i386/i386-modes.def|   2 +
> >  gcc/config/i386/i386.cc   |  75 +-
> >  gcc/config/i386/i386.h|   4 +-
> >  gcc/config/i386/i386.md   |  32 +--
> >  .../g++.target/i386/bfloat_cpp_typecheck.C|  10 +
> >  gcc/testsuite/gcc.target/i386/bfloat16-1.c|  12 +
> >  .../gcc.target/i386/sse2-bfloat16-1.c |   8 +
> >  .../gcc.target/i386/sse2-bfloat16-2.c |  17 ++
> >  .../i386/sse2-bfloat16-scalar-typecheck.c | 215 ++
> >  11 files changed, 375 insertions(+), 22 deletions(-)  create mode 100644
> > gcc/testsuite/g++.target/i386/bfloat_cpp_typecheck.C
> >  create mode 100644 gcc/testsuite/gcc.target/i386/bfloat16-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/sse2-bfloat16-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/sse2-bfloat16-2.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/sse2-bfloat16-scalar-
> > typecheck.c
> >
> > diff --git a/gcc/config/i386/i386-builtin-types.def b/gcc/config/i386/i386-
> > builtin-types.def
> > index 7a2da1db0b0..63a360b0f8b 100644
> > --- a/gcc/config/i386/i386-builtin-types.def
> > +++ b/gcc/config/i386/i386-builtin-types.def
> > @@ -69,6 +69,7 @@ DEF_PRIMITIVE_TYPE (UINT16,
> > short_unsigned_type_node)  DEF_PRIMITIVE_TYPE (INT64,
> > long_long_integer_type_node)  DEF_PRIMITIVE_TYPE (UINT64,
> > long_long_unsigned_type_node)  DEF_PRIMITIVE_TYPE (FLOAT16,
> > ix86_float16_type_node)
> > +DEF_PRIMITIVE_TYPE (BFLOAT16, ix86_bf16_type_node)
> >  DEF_PRIMITIVE_TYPE (FLOAT, float_type_node)  DEF_PRIMITIVE_TYPE
> > (DOUBLE, double_type_node)  DEF_PRIMITIVE_TYPE (FLOAT80,
> > float80_type_node) diff --git a/gcc/config/i386/i386-builtins.cc
> > b/gcc/config/i386/i386-builtins.cc
> > index fe7243c3837..6a04fb57e65 100644
> > --- a/gcc/config/i386/i386-builtins.cc
> > +++ b/gcc/config/i386/i386-builtins.cc
> > @@ -126,6 +126,9 @@ BDESC_VERIFYS (IX86_BUILTIN_MAX,  static GTY(()) tree
> > ix86_builtin_type_tab[(int) IX86_BT_LAST_CPTR + 1];
> >
> >  

RE: [PATCH 1/2]middle-end: Simplify subtract where both arguments are being bitwise inverted.

2022-08-03 Thread Richard Biener via Gcc-patches
On Wed, 3 Aug 2022, Tamar Christina wrote:

> > -Original Message-
> > From: Richard Biener 
> > Sent: Tuesday, June 21, 2022 8:43 AM
> > To: Tamar Christina 
> > Cc: Richard Sandiford ; Richard Biener via Gcc-
> > patches ; Richard Guenther
> > ; nd 
> > Subject: Re: [PATCH 1/2]middle-end: Simplify subtract where both
> > arguments are being bitwise inverted.
> > 
> > On Mon, Jun 20, 2022 at 10:49 AM Tamar Christina
> >  wrote:
> > >
> > > > -Original Message-
> > > > From: Richard Sandiford 
> > > > Sent: Monday, June 20, 2022 9:19 AM
> > > > To: Richard Biener via Gcc-patches 
> > > > Cc: Tamar Christina ; Richard Biener
> > > > ; Richard Guenther
> > ;
> > > > nd 
> > > > Subject: Re: [PATCH 1/2]middle-end: Simplify subtract where both
> > > > arguments are being bitwise inverted.
> > > >
> > > > Richard Biener via Gcc-patches  writes:
> > > > > On Thu, Jun 16, 2022 at 1:10 PM Tamar Christina via Gcc-patches
> > > > >  wrote:
> > > > >>
> > > > >> Hi All,
> > > > >>
> > > > >> This adds a match.pd rule that drops the bitwwise nots when both
> > > > >> arguments to a subtract is inverted. i.e. for:
> > > > >>
> > > > >> float g(float a, float b)
> > > > >> {
> > > > >>   return ~(int)a - ~(int)b;
> > > > >> }
> > > > >>
> > > > >> we instead generate
> > > > >>
> > > > >> float g(float a, float b)
> > > > >> {
> > > > >>   return (int)a - (int)b;
> > > > >> }
> > > > >>
> > > > >> We already do a limited version of this from the fold_binary fold
> > > > >> functions but this makes a more general version in match.pd that
> > > > >> applies
> > > > more often.
> > > > >>
> > > > >> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > > > >>
> > > > >> Ok for master?
> > > > >>
> > > > >> Thanks,
> > > > >> Tamar
> > > > >>
> > > > >> gcc/ChangeLog:
> > > > >>
> > > > >> * match.pd: New bit_not rule.
> > > > >>
> > > > >> gcc/testsuite/ChangeLog:
> > > > >>
> > > > >> * gcc.dg/subnot.c: New test.
> > > > >>
> > > > >> --- inline copy of patch --
> > > > >> diff --git a/gcc/match.pd b/gcc/match.pd index
> > > > >>
> > > >
> > a59b6778f661cf9121dd3503f43472871e4da445..51b0a1b562409af535e53828a1
> > > > 0
> > > > >> c30b8a3e1ae2e 100644
> > > > >> --- a/gcc/match.pd
> > > > >> +++ b/gcc/match.pd
> > > > >> @@ -1258,6 +1258,10 @@ DEFINE_INT_AND_FLOAT_ROUND_FN
> > (RINT)
> > > > >> (simplify
> > > > >>   (bit_not (plus:c (bit_not @0) @1))
> > > > >>   (minus @0 @1))
> > > > >> +/* (~X - ~Y) -> X - Y.  */
> > > > >> +(simplify
> > > > >> + (minus (bit_not @0) (bit_not @1))  (minus @0 @1))
> > > > >
> > > > > It doesn't seem correct.
> > > > >
> > > > > (gdb) p/x ~-1 - ~0x8000
> > > > > $3 = 0x8001
> > > > > (gdb) p/x -1 - 0x8000
> > > > > $4 = 0x7fff
> > > > >
> > > > > where I was looking for a case exposing undefined integer overflow.
> > > >
> > > > Yeah, shouldn't it be folding to (minus @1 @0) instead?
> > > >
> > > >   ~X = (-X - 1)
> > > >   -Y = (-Y - 1)
> > > >
> > > > so:
> > > >
> > > >   ~X - ~Y = (-X - 1) - (-Y - 1)
> > > >   = -X - 1 + Y + 1
> > > >   = Y - X
> > > >
> > >
> > > You're right, sorry, I should have paid more attention when I wrote the
> > patch.
> > 
> > You still need to watch out for undefined overflow cases in the result that
> > were well-defined in the original expression I think.
> 
> The only special thing we do for signed numbers if to do the subtract as 
> unsigned.  As I mentioned
> before GCC already does this transformation as part of the fold machinery, 
> but that only only happens
> when a very simple tree is matched and only when single use. i.e. 
> https://godbolt.org/z/EWsdhfrKj
> 
> I'm only attempting to make it apply more generally as the result is always 
> beneficial.
> 
> I've respun the patch to the same as we already do.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?

OK.

> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>   * match.pd: New bit_not rule.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/subnot.c: New test.
> 
> --- inline copy of patch ---
> 
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 
> 330c1db0c8e12b0fb010b1958729444672403866..00b3e07b2a5216b19ed58500923680d83c67d8cf
>  100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -1308,6 +1308,11 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  (simplify
>   (bit_not (plus:c (bit_not @0) @1))
>   (minus @0 @1))
> +/* (~X - ~Y) -> Y - X.  */
> +(simplify
> + (minus (bit_not @0) (bit_not @1))
> +  (with { tree utype = unsigned_type_for (type); }
> +   (convert (minus (convert:utype @1) (convert:utype @0)
>  
>  /* ~(X - Y) -> ~X + Y.  */
>  (simplify
> diff --git a/gcc/testsuite/gcc.dg/subnot.c b/gcc/testsuite/gcc.dg/subnot.c
> new file mode 100644
> index 
> ..d621bacd27bd3d19a010e4c9f831aa77d28bd02d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/subnot.c
> @@ -0,0 +1,9 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O -fdump-tree-optim