Re: PING: [PATCH] x86: Add a pass to remove redundant all 0s/1s vector load

2025-04-20 Thread Hongtao Liu
On Sat, Apr 19, 2025 at 1:25 PM H.J. Lu  wrote:
>
> On Sun, Dec 1, 2024 at 7:50 AM H.J. Lu  wrote:
> >
> > For all different modes of all 0s/1s vectors, we can use the single widest
> > all 0s/1s vector register for all 0s/1s vector uses in the whole function.
> > Add a pass to generate a single widest all 0s/1s vector set instruction at
> > entry of the nearest common dominator for basic blocks with all 0s/1s
> > vector uses.  On Linux/x86-64, in cc1plus, this patch reduces the number
> > of vector xor instructions from 4803 to 4714 and pcmpeq instructions from
> > 144 to 142.
> >
> > This change causes a regression:
> >
> > FAIL: gcc.dg/rtl/x86_64/vector_eq.c
> >
> > without the fix for
> >
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117863
> >
> > NB: PR target/92080 and PR target/117839 aren't same.  PR target/117839
> > is for vectors of all 0s and all 1s with different sizes and different
> > components.  PR target/92080 is for broadcast of the same component to
> > different vector sizes.  This patch covers only all 0s and all 1s cases
> > of PR target/92080.
> >
> > gcc/
> >
> > PR target/92080
> > PR target/117839
> > * config/i386/i386-features.cc (ix86_rrvl_gate): New.
> > (ix86_place_single_vector_set): Likewise.
> > (ix86_get_vector_load_mode): Likewise.
> > (remove_redundant_vector_load): Likewise.
> > (pass_data_remove_redundant_vector_load): Likewise.
> > (pass_remove_redundant_vector_load): Likewise.
> > (make_pass_remove_redundant_vector_load): Likewise.
> > * config/i386/i386-passes.def: Add
> > pass_remove_redundant_vector_load after
> > pass_remove_partial_avx_dependency.
> > * config/i386/i386-protos.h
> > (make_pass_remove_redundant_vector_load): New.
> >
> > gcc/testsuite/
> >
> > PR target/92080
> > PR target/117839
> > * gcc.target/i386/pr117839-1a.c: New test.
> > * gcc.target/i386/pr117839-1b.c: Likewise.
> > * gcc.target/i386/pr117839-2.c: Likewise.
> > * gcc.target/i386/pr92080-1.c: Likewise.
> > * gcc.target/i386/pr92080-2.c: Likewise.
> > * gcc.target/i386/pr92080-3.c: Likewise.
> >
> > Signed-off-by: H.J. Lu 
> > ---
> >  gcc/config/i386/i386-features.cc| 308 
> >  gcc/config/i386/i386-passes.def |   1 +
> >  gcc/config/i386/i386-protos.h   |   2 +
> >  gcc/testsuite/gcc.target/i386/pr117839-1a.c |  35 +++
> >  gcc/testsuite/gcc.target/i386/pr117839-1b.c |   5 +
> >  gcc/testsuite/gcc.target/i386/pr117839-2.c  |  40 +++
> >  gcc/testsuite/gcc.target/i386/pr92080-1.c   |  54 
> >  gcc/testsuite/gcc.target/i386/pr92080-2.c   |  59 
> >  gcc/testsuite/gcc.target/i386/pr92080-3.c   |  48 +++
> >  9 files changed, 552 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr117839-1a.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr117839-1b.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr117839-2.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-2.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr92080-3.c
> >
> > diff --git a/gcc/config/i386/i386-features.cc 
> > b/gcc/config/i386/i386-features.cc
> > index 003b003e09c..7d8d260750d 100644
> > --- a/gcc/config/i386/i386-features.cc
> > +++ b/gcc/config/i386/i386-features.cc
> > @@ -3288,6 +3288,314 @@ make_pass_remove_partial_avx_dependency 
> > (gcc::context *ctxt)
> >return new pass_remove_partial_avx_dependency (ctxt);
> >  }
> >
> > +static bool
> > +ix86_rrvl_gate ()
> > +{
> > +  return (TARGET_SSE2
> > + && optimize
> > + && optimize_function_for_speed_p (cfun));
> > +}
> > +
> > +/* Generate a vector set, DEST = SRC, at entry of the nearest dominator
> > +   for basic block map BBS, which is in the fake loop that contains the
> > +   whole function, so that there is only a single vector set in the
> > +   whole function.   */
> > +
> > +static void
> > +ix86_place_single_vector_set (rtx dest, rtx src, bitmap bbs)
> > +{
> > +  basic_block bb = nearest_common_dominator_for_set (CDI_DOMINATORS, bbs);
> > +  while (bb->loop_father->latch
> > +!= EXIT_BLOCK_PTR_FOR_FN (cfun))
> > +bb = get_immediate_dominator (CDI_DOMINATORS,
> > + bb->loop_father->header);
> > +
> > +  rtx set = gen_rtx_SET (dest, src);
> > +
> > +  rtx_insn *insn = BB_HEAD (bb);
> > +  while (insn && !NONDEBUG_INSN_P (insn))
> > +{
> > +  if (insn == BB_END (bb))
> > +   {
> > + insn = NULL;
> > + break;
> > +   }
> > +  insn = NEXT_INSN (insn);
> > +}
> > +
> > +  rtx_insn *set_insn;
> > +  if (insn == BB_HEAD (bb))
> > +set_insn = emit_insn_before (set, insn);
> > +  else
> > +set_insn = emit_insn_after (set,
> > +   insn ? PREV_INSN (insn) : BB_END (bb));
>

[PATCH] [x86] Generate 2 FMA instructions in ix86_expand_swdivsf.

2025-04-20 Thread liuhongt
From: "hongtao.liu" 

When FMA is available, N-R step can be rewritten with

a / b = (a - (rcp(b) * a * b)) * rcp(b) + rcp(b) * a

which have 2 fma generated.[1]

[1] https://bugs.llvm.org/show_bug.cgi?id=21385

Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?


gcc/ChangeLog:

* config/i386/i386-expand.cc (ix86_emit_swdivsf): Generate 2
FMA instructions when TARGET_FMA.

gcc/testsuite/ChangeLog:

* gcc.target/i386/recip-vec-divf-fma.c: New test.
---
 gcc/config/i386/i386-expand.cc| 44 ++-
 .../gcc.target/i386/recip-vec-divf-fma.c  | 12 +
 2 files changed, 44 insertions(+), 12 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/recip-vec-divf-fma.c

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index cdfd94d3c73..4fffbfdd574 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -19256,8 +19256,6 @@ ix86_emit_swdivsf (rtx res, rtx a, rtx b, machine_mode 
mode)
   e1 = gen_reg_rtx (mode);
   x1 = gen_reg_rtx (mode);
 
-  /* a / b = a * ((rcp(b) + rcp(b)) - (b * rcp(b) * rcp (b))) */
-
   b = force_reg (mode, b);
 
   /* x0 = rcp(b) estimate */
@@ -19270,20 +19268,42 @@ ix86_emit_swdivsf (rtx res, rtx a, rtx b, 
machine_mode mode)
 emit_insn (gen_rtx_SET (x0, gen_rtx_UNSPEC (mode, gen_rtvec (1, b),
UNSPEC_RCP)));
 
-  /* e0 = x0 * b */
-  emit_insn (gen_rtx_SET (e0, gen_rtx_MULT (mode, x0, b)));
+  unsigned vector_size = GET_MODE_SIZE (mode);
 
-  /* e0 = x0 * e0 */
-  emit_insn (gen_rtx_SET (e0, gen_rtx_MULT (mode, x0, e0)));
+  /* (a - (rcp(b) * a * b)) * rcp(b) + rcp(b) * a
+ N-R step with 2 fma implementation.  */
+  if (TARGET_FMA
+  || (TARGET_AVX512F && vector_size == 64)
+  || (TARGET_AVX512VL && (vector_size == 32 || vector_size == 16)))
+{
+  /* e0 = x0 * a  */
+  emit_insn (gen_rtx_SET (e0, gen_rtx_MULT (mode, x0, a)));
+  /* e1 = e0 * b - a  */
+  emit_insn (gen_rtx_SET (e1, gen_rtx_FMA (mode, e0, b,
+  gen_rtx_NEG (mode, a;
+  /* res = - e1 * x0 + e0  */
+  emit_insn (gen_rtx_SET (res, gen_rtx_FMA (mode,
+  gen_rtx_NEG (mode, e1),
+  x0, e0)));
+}
+/* a / b = a * ((rcp(b) + rcp(b)) - (b * rcp(b) * rcp (b))) */
+  else
+{
+  /* e0 = x0 * b */
+  emit_insn (gen_rtx_SET (e0, gen_rtx_MULT (mode, x0, b)));
 
-  /* e1 = x0 + x0 */
-  emit_insn (gen_rtx_SET (e1, gen_rtx_PLUS (mode, x0, x0)));
+  /* e1 = x0 + x0 */
+  emit_insn (gen_rtx_SET (e1, gen_rtx_PLUS (mode, x0, x0)));
 
-  /* x1 = e1 - e0 */
-  emit_insn (gen_rtx_SET (x1, gen_rtx_MINUS (mode, e1, e0)));
+  /* e0 = x0 * e0 */
+  emit_insn (gen_rtx_SET (e0, gen_rtx_MULT (mode, x0, e0)));
 
-  /* res = a * x1 */
-  emit_insn (gen_rtx_SET (res, gen_rtx_MULT (mode, a, x1)));
+  /* x1 = e1 - e0 */
+  emit_insn (gen_rtx_SET (x1, gen_rtx_MINUS (mode, e1, e0)));
+
+  /* res = a * x1 */
+  emit_insn (gen_rtx_SET (res, gen_rtx_MULT (mode, a, x1)));
+}
 }
 
 /* Output code to perform a Newton-Rhapson approximation of a
diff --git a/gcc/testsuite/gcc.target/i386/recip-vec-divf-fma.c 
b/gcc/testsuite/gcc.target/i386/recip-vec-divf-fma.c
new file mode 100644
index 000..ad9e07b1eb6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/recip-vec-divf-fma.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mfma -mavx2" } */
+/* { dg-final { scan-assembler-times {(?n)vfn?m(add|sub)[1-3]*ps} 2 } } */
+
+typedef float v4sf __attribute__((vector_size(16)));
+/* (a - (rcp(b) * a * b)) * rcp(b) + rcp(b) * a  */
+
+v4sf
+foo (v4sf a, v4sf b)
+{
+return a / b;
+}
-- 
2.34.1



Re: [PATCH] [x86] Generate 2 FMA instructions in ix86_expand_swdivsf.

2025-04-20 Thread Uros Bizjak
On Mon, Apr 21, 2025 at 5:43 AM liuhongt  wrote:
>
> From: "hongtao.liu" 
>
> When FMA is available, N-R step can be rewritten with
>
> a / b = (a - (rcp(b) * a * b)) * rcp(b) + rcp(b) * a
>
> which have 2 fma generated.[1]
>
> [1] https://bugs.llvm.org/show_bug.cgi?id=21385
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
>
> gcc/ChangeLog:
>
> * config/i386/i386-expand.cc (ix86_emit_swdivsf): Generate 2
> FMA instructions when TARGET_FMA.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/recip-vec-divf-fma.c: New test.

OK, with a small nit below.

Thanks,
Uros.

> ---
>  gcc/config/i386/i386-expand.cc| 44 ++-
>  .../gcc.target/i386/recip-vec-divf-fma.c  | 12 +
>  2 files changed, 44 insertions(+), 12 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/recip-vec-divf-fma.c
>
> diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> index cdfd94d3c73..4fffbfdd574 100644
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -19256,8 +19256,6 @@ ix86_emit_swdivsf (rtx res, rtx a, rtx b, 
> machine_mode mode)
>e1 = gen_reg_rtx (mode);
>x1 = gen_reg_rtx (mode);
>
> -  /* a / b = a * ((rcp(b) + rcp(b)) - (b * rcp(b) * rcp (b))) */
> -
>b = force_reg (mode, b);
>
>/* x0 = rcp(b) estimate */
> @@ -19270,20 +19268,42 @@ ix86_emit_swdivsf (rtx res, rtx a, rtx b, 
> machine_mode mode)
>  emit_insn (gen_rtx_SET (x0, gen_rtx_UNSPEC (mode, gen_rtvec (1, b),
> UNSPEC_RCP)));
>
> -  /* e0 = x0 * b */
> -  emit_insn (gen_rtx_SET (e0, gen_rtx_MULT (mode, x0, b)));
> +  unsigned vector_size = GET_MODE_SIZE (mode);
>
> -  /* e0 = x0 * e0 */
> -  emit_insn (gen_rtx_SET (e0, gen_rtx_MULT (mode, x0, e0)));
> +  /* (a - (rcp(b) * a * b)) * rcp(b) + rcp(b) * a
> + N-R step with 2 fma implementation.  */
> +  if (TARGET_FMA
> +  || (TARGET_AVX512F && vector_size == 64)
> +  || (TARGET_AVX512VL && (vector_size == 32 || vector_size == 16)))
> +{
> +  /* e0 = x0 * a  */
> +  emit_insn (gen_rtx_SET (e0, gen_rtx_MULT (mode, x0, a)));
> +  /* e1 = e0 * b - a  */
> +  emit_insn (gen_rtx_SET (e1, gen_rtx_FMA (mode, e0, b,
> +  gen_rtx_NEG (mode, a;
> +  /* res = - e1 * x0 + e0  */
> +  emit_insn (gen_rtx_SET (res, gen_rtx_FMA (mode,
> +  gen_rtx_NEG (mode, e1),
> +  x0, e0)));
> +}
> +/* a / b = a * ((rcp(b) + rcp(b)) - (b * rcp(b) * rcp (b))) */
> +  else

Please put the above comment here, as it applies to the "else" branch.

> +{
> +  /* e0 = x0 * b */
> +  emit_insn (gen_rtx_SET (e0, gen_rtx_MULT (mode, x0, b)));
>
> -  /* e1 = x0 + x0 */
> -  emit_insn (gen_rtx_SET (e1, gen_rtx_PLUS (mode, x0, x0)));
> +  /* e1 = x0 + x0 */
> +  emit_insn (gen_rtx_SET (e1, gen_rtx_PLUS (mode, x0, x0)));
>
> -  /* x1 = e1 - e0 */
> -  emit_insn (gen_rtx_SET (x1, gen_rtx_MINUS (mode, e1, e0)));
> +  /* e0 = x0 * e0 */
> +  emit_insn (gen_rtx_SET (e0, gen_rtx_MULT (mode, x0, e0)));
>
> -  /* res = a * x1 */
> -  emit_insn (gen_rtx_SET (res, gen_rtx_MULT (mode, a, x1)));
> +  /* x1 = e1 - e0 */
> +  emit_insn (gen_rtx_SET (x1, gen_rtx_MINUS (mode, e1, e0)));
> +
> +  /* res = a * x1 */
> +  emit_insn (gen_rtx_SET (res, gen_rtx_MULT (mode, a, x1)));
> +}
>  }
>
>  /* Output code to perform a Newton-Rhapson approximation of a
> diff --git a/gcc/testsuite/gcc.target/i386/recip-vec-divf-fma.c 
> b/gcc/testsuite/gcc.target/i386/recip-vec-divf-fma.c
> new file mode 100644
> index 000..ad9e07b1eb6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/recip-vec-divf-fma.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -mfma -mavx2" } */
> +/* { dg-final { scan-assembler-times {(?n)vfn?m(add|sub)[1-3]*ps} 2 } } */
> +
> +typedef float v4sf __attribute__((vector_size(16)));
> +/* (a - (rcp(b) * a * b)) * rcp(b) + rcp(b) * a  */
> +
> +v4sf
> +foo (v4sf a, v4sf b)
> +{
> +return a / b;
> +}
> --
> 2.34.1
>


[PATCH] Accept allones or 0 operand for vcond_mask op1.

2025-04-20 Thread liuhongt
Since ix86_expand_sse_movcc will simplify them into a simple vmov, vpand
or vpandn.
Current register_operand/vector_operand could lose some optimization
opportunity.

Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?

gcc/ChangeLog:

* config/i386/predicates.md (vector_or_0_or_1s_operand): New predicate.
(nonimm_or_0_or_1s_operand): Ditto.
* config/i386/sse.md (vcond_mask_):
Extend the predicate of operands1 to accept 0 or allones
operands.
(vcond_mask_): Ditto.
(vcond_mask_v1tiv1ti): Ditto.
(vcond_mask_): Ditto.
* config/i386/i386.md (movcc): Ditto for operands[2] and
operands[3].

gcc/testsuite/ChangeLog:

* gcc.target/i386/blendv-to-maxmin.c: New test.
* gcc.target/i386/blendv-to-pand.c: New test.
---
 gcc/config/i386/i386-expand.cc   |  6 ++
 gcc/config/i386/i386.md  |  4 ++--
 gcc/config/i386/predicates.md| 14 ++
 gcc/config/i386/sse.md   | 10 +-
 gcc/testsuite/gcc.target/i386/blendv-to-maxmin.c | 12 
 gcc/testsuite/gcc.target/i386/blendv-to-pand.c   | 16 
 6 files changed, 55 insertions(+), 7 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/blendv-to-maxmin.c
 create mode 100644 gcc/testsuite/gcc.target/i386/blendv-to-pand.c

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index cdfd94d3c73..ef867fb4f82 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -4138,6 +4138,10 @@ ix86_expand_sse_fp_minmax (rtx dest, enum rtx_code code, 
rtx cmp_op0,
 return false;
 
   mode = GET_MODE (dest);
+  if (immediate_operand (if_false, mode))
+if_false = force_reg (mode, if_false);
+  if (immediate_operand (if_true, mode))
+if_true = force_reg (mode, if_true);
 
   /* We want to check HONOR_NANS and HONOR_SIGNED_ZEROS here,
  but MODE may be a vector mode and thus not appropriate.  */
@@ -4687,6 +4691,8 @@ ix86_expand_fp_movcc (rtx operands[])
   compare_op = ix86_expand_compare (NE, tmp, const0_rtx);
 }
 
+  operands[2] = force_reg (mode, operands[2]);
+  operands[3] = force_reg (mode, operands[3]);
   emit_insn (gen_rtx_SET (operands[0],
  gen_rtx_IF_THEN_ELSE (mode, compare_op,
operands[2], operands[3])));
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index f7f790d2aeb..45c2fe5a58a 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -26576,8 +26576,8 @@ (define_expand "movcc"
   [(set (match_operand:X87MODEF 0 "register_operand")
(if_then_else:X87MODEF
  (match_operand 1 "comparison_operator")
- (match_operand:X87MODEF 2 "register_operand")
- (match_operand:X87MODEF 3 "register_operand")))]
+ (match_operand:X87MODEF 2 "nonimm_or_0_or_1s_operand")
+ (match_operand:X87MODEF 3 "nonimm_or_0_operand")))]
   "(TARGET_80387 && TARGET_CMOVE)
|| (SSE_FLOAT_MODE_P (mode) && TARGET_SSE_MATH)"
   "if (ix86_expand_fp_movcc (operands)) DONE; else FAIL;")
diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
index 3d3848c0a22..4b23e18eaf4 100644
--- a/gcc/config/i386/predicates.md
+++ b/gcc/config/i386/predicates.md
@@ -1267,6 +1267,14 @@ (define_predicate "vector_or_const_vector_operand"
(match_operand 0 "vector_memory_operand")
(match_code "const_vector")))
 
+; Return true when OP is register_operand, vector_memory_operand,
+; const_vector zero or const_vector all ones.
+(define_predicate "vector_or_0_or_1s_operand"
+  (ior (match_operand 0 "register_operand")
+   (match_operand 0 "vector_memory_operand")
+   (match_operand 0 "const0_operand")
+   (match_operand 0 "int_float_vector_all_ones_operand")))
+
 (define_predicate "bcst_mem_operand"
   (and (match_code "vec_duplicate")
(and (match_test "TARGET_AVX512F")
@@ -1333,6 +1341,12 @@ (define_predicate "nonimm_or_0_operand"
   (ior (match_operand 0 "nonimmediate_operand")
(match_operand 0 "const0_operand")))
 
+; Return true when OP is a nonimmediate or zero or all ones.
+(define_predicate "nonimm_or_0_or_1s_operand"
+  (ior (match_operand 0 "nonimmediate_operand")
+   (match_operand 0 "const0_operand")
+   (match_operand 0 "int_float_vector_all_ones_operand")))
+
 ;; Return true for RTX codes that force SImode address.
 (define_predicate "SImode_address_operand"
   (match_code "subreg,zero_extend,and"))
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index ed5ac1abe80..aa192993b50 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -5138,7 +5138,7 @@ (define_mode_iterator VI_256_AVX2 [(V32QI "TARGET_AVX2") 
(V16HI "TARGET_AVX2")
 (define_expand "vcond_mask_"
   [(set (match_operand:VI_256_AVX2 0 "register_operand")
(vec_merge:VI_256_AVX2
- (match_operand

[PATCH] x86: Properly find the maximum stack slot alignment

2025-04-20 Thread H.J. Lu
Don't assume that stack slots can only be accessed by stack or frame
registers.  We first find all registers defined by stack or frame
registers.  Then check memory accesses by such registers, including
stack and frame registers.

gcc/

PR target/109780
PR target/109093
* config/i386/i386.cc (stack_access_data): New.
(ix86_update_stack_alignment): Likewise.
(ix86_find_all_reg_use_1): Likewise.
(ix86_find_all_reg_use): Likewise.
(ix86_find_max_used_stack_alignment): Also check memory accesses
from registers defined by stack or frame registers.

gcc/testsuite/

PR target/109780
PR target/109093
* g++.target/i386/pr109780-1.C: New test.
* gcc.target/i386/pr109093-1.c: Likewise.
* gcc.target/i386/pr109780-1.c: Likewise.
* gcc.target/i386/pr109780-2.c: Likewise.
* gcc.target/i386/pr109780-3.c: Likewise.

Signed-off-by: H.J. Lu 
---
 gcc/config/i386/i386.cc| 174 ++---
 gcc/testsuite/g++.target/i386/pr109780-1.C |  72 +
 gcc/testsuite/gcc.target/i386/pr109093-1.c |  33 
 gcc/testsuite/gcc.target/i386/pr109780-1.c |  14 ++
 gcc/testsuite/gcc.target/i386/pr109780-2.c |  21 +++
 gcc/testsuite/gcc.target/i386/pr109780-3.c |  46 ++
 6 files changed, 339 insertions(+), 21 deletions(-)
 create mode 100644 gcc/testsuite/g++.target/i386/pr109780-1.C
 create mode 100644 gcc/testsuite/gcc.target/i386/pr109093-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr109780-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr109780-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr109780-3.c

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 28603c2943e..9e4e76857e6 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -8473,6 +8473,103 @@ output_probe_stack_range (rtx reg, rtx end)
   return "";
 }
 
+/* Data passed to ix86_update_stack_alignment.  */
+struct stack_access_data
+{
+  /* The stack access register.  */
+  const_rtx reg;
+  /* Pointer to stack alignment.  */
+  unsigned int *stack_alignment;
+};
+
+/* Update the maximum stack slot alignment from memory alignment in
+   PAT.  */
+
+static void
+ix86_update_stack_alignment (rtx, const_rtx pat, void *data)
+{
+  /* This insn may reference stack slot.  Update the maximum stack slot
+ alignment if the memory is referenced by the stack access register.
+   */
+  stack_access_data *p = (stack_access_data *) data;
+  subrtx_iterator::array_type array;
+  FOR_EACH_SUBRTX (iter, array, pat, ALL)
+{
+  auto op = *iter;
+  if (GET_CODE (op) == ZERO_EXTEND)
+   op = XEXP (op, 0);
+  if (MEM_P (op) && reg_mentioned_p (p->reg, op))
+   {
+ unsigned int alignment = MEM_ALIGN (op);
+ if (alignment > *p->stack_alignment)
+   *p->stack_alignment = alignment;
+ break;
+   }
+}
+}
+
+/* Helper function for ix86_find_all_reg_use.  */
+
+static void
+ix86_find_all_reg_use_1 (rtx set, HARD_REG_SET &stack_slot_access,
+auto_bitmap &worklist)
+{
+  rtx dest = SET_DEST (set);
+  if (!REG_P (dest))
+return;
+
+  rtx src = SET_SRC (set);
+
+  if (GET_CODE (src) == ZERO_EXTEND)
+src = XEXP (src, 0);
+
+  if (MEM_P (src) || CONST_SCALAR_INT_P (src))
+return;
+
+  if (TEST_HARD_REG_BIT (stack_slot_access, REGNO (dest)))
+return;
+
+  /* Add this register to stack_slot_access.  */
+  add_to_hard_reg_set (&stack_slot_access, Pmode, REGNO (dest));
+  bitmap_set_bit (worklist, REGNO (dest));
+}
+
+/* Find all registers defined with REG.  */
+
+static void
+ix86_find_all_reg_use (HARD_REG_SET &stack_slot_access,
+  unsigned int reg, auto_bitmap &worklist)
+{
+  for (df_ref ref = DF_REG_USE_CHAIN (reg);
+   ref != NULL;
+   ref = DF_REF_NEXT_REG (ref))
+{
+  if (DF_REF_IS_ARTIFICIAL (ref))
+   continue;
+
+  rtx_insn *insn = DF_REF_INSN (ref);
+
+  if (!NONJUMP_INSN_P (insn))
+   continue;
+
+  rtx set = single_set (insn);
+  if (set)
+   ix86_find_all_reg_use_1 (set, stack_slot_access, worklist);
+
+  rtx pat = PATTERN (insn);
+  if (GET_CODE (pat) != PARALLEL)
+   continue;
+
+  for (int i = 0; i < XVECLEN (pat, 0); i++)
+   {
+ rtx exp = XVECEXP (pat, 0, i);
+
+ if (GET_CODE (exp) == SET)
+   ix86_find_all_reg_use_1 (exp, stack_slot_access, worklist);
+   }
+}
+}
+
 /* Set stack_frame_required to false if stack frame isn't required.
Update STACK_ALIGNMENT to the largest alignment, in bits, of stack
slot used if stack frame is required and CHECK_STACK_SLOT is true.  */
@@ -8491,10 +8588,6 @@ ix86_find_max_used_stack_alignment (unsigned int 
&stack_alignment,
   add_to_hard_reg_set (&set_up_by_prologue, Pmode,
   HARD_FRAME_POINTER_REGNUM);
 
-  /* The preferred stack alignment is the minimum stack alignment.  */
-  if (stack_alignment > c

Re: [PATCH v2] x86: Update memcpy/memset inline strategies for -mtune=generic

2025-04-20 Thread H.J. Lu
On Sun, Apr 20, 2025 at 6:31 PM Jan Hubicka  wrote:
>
> >   PR target/102294
> >   PR target/119596
> >   * config/i386/x86-tune-costs.h (generic_memcpy): Updated.
> >   (generic_memset): Likewise.
> >   (generic_cost): Change CLEAR_RATIO to 17.
> >   * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB):
> >   Add m_GENERIC.
>
> Looking through the PRs, there they are primarily about CLEAR_RATIO
> being lower than on clang which makes us to produce slower (but smaller)
> initialization sequence for blocks of certain size.
> It seems Kenrel is discussed there too (-mno-sse).
>
> Bumping it up for SSE makes sense provided that SSE codegen does not
> suffer from the long $0 immediates. I would say it is OK also for
> -mno-sse provided speedups are quite noticeable, but it would be really
> nice to solve this incrementally.
>
> concerning X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB my understanding is
> that Intel chips likes stosb for small blocks, since they are not
> optimized for stosw/q.  Zen seems to preffer stopsq over stosb for
> blocks up to 128 bytes.
>
> How does the loop version compare to stopsb for blocks in rage
> 1...128 bytes in Intel hardware?
>
> Since the case we prove block size to be small but we do not know a
> size, I think using loop or unrolled for blocks up to say 128 bytes
> may work well for both.
>
> Honza

My patch has a 256 byte threshold.  Are you suggesting changing it
to 128 bytes?

-- 
H.J.


Re: [PATCH v2] x86: Update memcpy/memset inline strategies for -mtune=generic

2025-04-20 Thread H.J. Lu
On Mon, Apr 21, 2025 at 7:24 AM H.J. Lu  wrote:
>
> On Sun, Apr 20, 2025 at 6:31 PM Jan Hubicka  wrote:
> >
> > >   PR target/102294
> > >   PR target/119596
> > >   * config/i386/x86-tune-costs.h (generic_memcpy): Updated.
> > >   (generic_memset): Likewise.
> > >   (generic_cost): Change CLEAR_RATIO to 17.
> > >   * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB):
> > >   Add m_GENERIC.
> >
> > Looking through the PRs, there they are primarily about CLEAR_RATIO
> > being lower than on clang which makes us to produce slower (but smaller)
> > initialization sequence for blocks of certain size.
> > It seems Kenrel is discussed there too (-mno-sse).
> >
> > Bumping it up for SSE makes sense provided that SSE codegen does not
> > suffer from the long $0 immediates. I would say it is OK also for
> > -mno-sse provided speedups are quite noticeable, but it would be really
> > nice to solve this incrementally.
> >
> > concerning X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB my understanding is
> > that Intel chips likes stosb for small blocks, since they are not
> > optimized for stosw/q.  Zen seems to preffer stopsq over stosb for
> > blocks up to 128 bytes.
> >
> > How does the loop version compare to stopsb for blocks in rage
> > 1...128 bytes in Intel hardware?
> >
> > Since the case we prove block size to be small but we do not know a
> > size, I think using loop or unrolled for blocks up to say 128 bytes
> > may work well for both.
> >
> > Honza
>
> My patch has a 256 byte threshold.  Are you suggesting changing it
> to 128 bytes?
>

256 bytes were selected since MOVE_RATIO and CLEAR_RATIO are
17 which is  16 * 16 (256) bytes.  To lower the threshold to 128 bytes,
MOVE_RATIO/CLEAR_RATIO will be changed to 9.  Do we want to
do that?


-- 
H.J.


New Swedish PO file for 'gcc' (version 15.1-b20250406)

2025-04-20 Thread Translation Project Robot
Hello, gentle maintainer.

This is a message from the Translation Project robot.

A revised PO file for textual domain 'gcc' has been submitted
by the Swedish team of translators.  The file is available at:

https://translationproject.org/latest/gcc/sv.po

(This file, 'gcc-15.1-b20250406.sv.po', has just now been sent to you in
a separate email.)

All other PO files for your package are available in:

https://translationproject.org/latest/gcc/

Please consider including all of these in your next release, whether
official or a pretest.

Whenever you have a new distribution with a new version number ready,
containing a newer POT file, please send the URL of that distribution
tarball to the address below.  The tarball may be just a pretest or a
snapshot, it does not even have to compile.  It is just used by the
translators when they need some extra translation context.

The following HTML page has been updated:

https://translationproject.org/domain/gcc.html

If any question arises, please contact the translation coordinator.

Thank you for all your work,

The Translation Project robot, in the
name of your translation coordinator.




Re: PING: [PATCH v2] x86: Add pcmpeq splitters

2025-04-20 Thread H.J. Lu
On Sat, Apr 19, 2025 at 4:16 PM Uros Bizjak  wrote:
>
> On Sat, Apr 19, 2025 at 7:22 AM H.J. Lu  wrote:
> >
> > On Mon, Dec 2, 2024 at 6:27 AM H.J. Lu  wrote:
> > >
> > > Add pcmpeq splitters to split
> > >
> > > (insn 5 3 7 2 (set (reg:V4SI 100)
> > > (eq:V4SI (reg:V4SI 98)
> > > (reg:V4SI 98))) 7910 {*sse2_eqv4si3}
> > >  (expr_list:REG_DEAD (reg:V4SI 98)
> > > (expr_list:REG_EQUAL (eq:V4SI (const_vector:V4SI [
> > > (const_int -1 [0x]) repeated x4
> > > ])
> > > (const_vector:V4SI [
> > > (const_int -1 [0x]) repeated x4
> > > ]))
> > > (nil
> > >
> > > to
> > >
> > > (insn 8 3 7 2 (set (reg:V4SI 100)
> > > (const_vector:V4SI [
> > > (const_int -1 [0x]) repeated x4
> > > ])) -1
> > >  (nil))
> > >
> > > gcc/
> > >
> > > PR target/117863
> > > * config/i386/sse.md: Add pcmpeq splitters.
> > >
> > > gcc/testsuite/
> > >
> > > PR target/117863
> > > * gcc.dg/rtl/i386/vector_eq-2.c: New test.
> > >
> > > Signed-off-by: H.J. Lu 
> > > ---
> > >  gcc/config/i386/sse.md  | 36 +++
> > >  gcc/testsuite/gcc.dg/rtl/i386/vector_eq-2.c | 71 +
> > >  2 files changed, 107 insertions(+)
> > >  create mode 100644 gcc/testsuite/gcc.dg/rtl/i386/vector_eq-2.c
> > >
> > > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > > index 498a42d6e1e..4b19bc22a83 100644
> > > --- a/gcc/config/i386/sse.md
> > > +++ b/gcc/config/i386/sse.md
> > > @@ -17943,6 +17943,18 @@ (define_insn "*avx2_eq3"
> > > (set_attr "prefix" "vex")
> > > (set_attr "mode" "OI")])
> > >
> > > +;; Don't remove memory operand to keep volatile memory.
>
> Perhaps we can use MEM_VOLATILE_P to also allow memory operands?
>
> > > +(define_split
> > > +  [(set (match_operand:VI_256 0 "register_operand")
> > > +   (eq:VI_256
> > > + (match_operand:VI_256 1 "register_operand")
> > > + (match_operand:VI_256 2 "register_operand")))]
> > > +  "TARGET_AVX2 && rtx_equal_p (operands[1], operands[2])"
> > > +  [(set (match_dup 0) (match_dup 1))]
> > > +{
> > > +  operands[1] = CONSTM1_RTX (mode);
> > > +})
>
> Single preparation statements should use double quotes, here and in other 
> cases.
>

This isn't needed anymore with

commit 546f28f83ceba74dc8bf84b0435c0159ffca971a
Author: Richard Sandiford 
Date:   Mon Apr 7 08:03:46 2025 +0100

simplify-rtx: Fix shortcut for vector eq/ne

I am checking 2 tests instead.

-- 
H.J.


[PATCH] libstdc++: fix possible undefined atomic lock-free type aliases in module std

2025-04-20 Thread ZENG Hao
When building for 'i386-*' targets, all basic types are 'sometimes lock-free'
and thus std::atomic_signed_lock_free and std::atomic_unsigned_lock_free are
not declared. In the header , they are placed in preprocessor
condition __cpp_lib_atomic_lock_free_type_aliases. In module std, they should
be the same.

libstdc++-v3/ChangeLog:

* src/c++23/std.cc.in: Add preprocessor condition
__cpp_lib_atomic_lock_free_type_aliases for
std::atomic_signed_lock_free and std::atomic_unsigned_lock_free.
---
 libstdc++-v3/src/c++23/std.cc.in | 4 
 1 file changed, 4 insertions(+)

diff --git a/libstdc++-v3/src/c++23/std.cc.in b/libstdc++-v3/src/c++23/std.cc.in
index 5e18ad73908..ea50496b057 100644
--- a/libstdc++-v3/src/c++23/std.cc.in
+++ b/libstdc++-v3/src/c++23/std.cc.in
@@ -599,7 +599,9 @@ export namespace std
   using std::atomic_schar;
   using std::atomic_short;
   using std::atomic_signal_fence;
+#ifdef __cpp_lib_atomic_lock_free_type_aliases
   using std::atomic_signed_lock_free;
+#endif
   using std::atomic_size_t;
   using std::atomic_store;
   using std::atomic_store_explicit;
@@ -622,7 +624,9 @@ export namespace std
   using std::atomic_uintptr_t;
   using std::atomic_ullong;
   using std::atomic_ulong;
+#ifdef __cpp_lib_atomic_lock_free_type_aliases
   using std::atomic_unsigned_lock_free;
+#endif
   using std::atomic_ushort;
   using std::atomic_wait;
   using std::atomic_wait_explicit;
-- 
2.49.0



Re: [PATCH v2] x86: Update memcpy/memset inline strategies for -mtune=generic

2025-04-20 Thread Jan Hubicka
>   PR target/102294
>   PR target/119596
>   * config/i386/x86-tune-costs.h (generic_memcpy): Updated.
>   (generic_memset): Likewise.
>   (generic_cost): Change CLEAR_RATIO to 17.
>   * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB):
>   Add m_GENERIC.

Looking through the PRs, there they are primarily about CLEAR_RATIO
being lower than on clang which makes us to produce slower (but smaller)
initialization sequence for blocks of certain size.
It seems Kenrel is discussed there too (-mno-sse).

Bumping it up for SSE makes sense provided that SSE codegen does not
suffer from the long $0 immediates. I would say it is OK also for
-mno-sse provided speedups are quite noticeable, but it would be really
nice to solve this incrementally.

concerning X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB my understanding is
that Intel chips likes stosb for small blocks, since they are not
optimized for stosw/q.  Zen seems to preffer stopsq over stosb for
blocks up to 128 bytes.

How does the loop version compare to stopsb for blocks in rage
1...128 bytes in Intel hardware?

Since the case we prove block size to be small but we do not know a
size, I think using loop or unrolled for blocks up to say 128 bytes
may work well for both.

Honza


Re: [PATCH v2] x86: Update memcpy/memset inline strategies for -mtune=generic

2025-04-20 Thread Jan Hubicka
> On Sun, Apr 20, 2025 at 4:19 AM Jan Hubicka  wrote:
> >
> > > On Tue, Apr 8, 2025 at 3:52 AM H.J. Lu  wrote:
> > > >
> > > > Simplify memcpy and memset inline strategies to avoid branches for
> > > > -mtune=generic:
> > > >
> > > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
> > > >load and store for up to 16 * 16 (256) bytes when the data size is
> > > >fixed and known.
> >
> > Originally we set CLEAR_RATION smaller than MOVE_RATIO because to store
> > zeros we use:
> >
> >0:   48 c7 07 00 00 00 00movq   $0x0,(%rdi)
> >7:   48 c7 47 08 00 00 00movq   $0x0,0x8(%rdi)
> >e:   00
> >f:   48 c7 47 10 00 00 00movq   $0x0,0x10(%rdi)
> >   16:   00
> >   17:   48 c7 47 18 00 00 00movq   $0x0,0x18(%rdi)
> >   1e:   00
> >
> > so about 8 bytes per instructions.   We could optimize it by loading 0
> 
> This is orthogonal to this patch.

True, I mentioned it mostly because ...
> 
> > to scratch register but we don't.  SSE variant is shorter:
> >
> >4:   0f 11 07movups %xmm0,(%rdi)
> >7:   0f 11 47 10 movups %xmm0,0x10(%rdi)
> >
> > So I wonder if we care about code size with -mno-sse (i.e. for building
> > kernel).
> 
> This patch doesn't change -Os behavior which uses x86_size_cost,
> not generic_cost.

... we need to make code size/speed tradeoffs even at -O2 (and partly
-O3).  A sequence of 17 moves in integer will be 136 bytes of code while
with sse it will be 68 bytes. It would be nice to understand how often
it pays back compared to shorter sequences.

> > SPEC is not very sensitive to string op implementation.  I wonder if you
> > have specific testcases where using loop variant for very small blocks
> > is a loss?
> 
> For small blocks with known sizes, loop is slower because of branches.

For known size we shoudl use a sequence of moves (up to MOVE/COPY
ratio). Even with current setting of 6 we should be able to copy all
block of size <32.  To copy block of 32 we need 4 64bit moves or 2
128bit moves. 

#include 
char *src;
char *dest;
char *dest2;
void
test ()
{
memcpy (dest, src, 31);
}
void
test2 ()
{
memset (dest2, 0, 31);
}

compiles to
test:
movqsrc(%rip), %rdx
movqdest(%rip), %rax
movdqu  (%rdx), %xmm0
movups  %xmm0, (%rax)
movdqu  15(%rdx), %xmm0
movups  %xmm0, 15(%rax)
ret
test2:
movqdest2(%rip), %rax
pxor%xmm0, %xmm0
movups  %xmm0, (%rax)
movups  %xmm0, 15(%rax)
ret

The copy algorithm tables are mostly used when the block size is greater
than what we can copy/set by COPY/CLEAR_RATIO or when we know expected
size by profile feedback. In relatively relatively rare cases when value
ranges delivers useful range we use it too. (Some work would be needed
to make this work better). But it works on simple testcases.  For
example:

#include 
void
test (char *dest, int n)
{
memset (dest, 0, 30 + (n != 0));
}

compiles to a loop instead of library call since we know that the code
will copy 30 or 31 bytes.

This testcase should be copmiled too:
#include 
char dest[31];
void
test (int n)
{
memset (dest, 0, n);
}

since the upper bound on the block size is 31 bytes, but we fail to
detect that.

Honza
> 
> > We are also better on picking codegen choice with PGO since we
> > value-profile size of the block.
> >
> > Inlining memcpy is bigger win in situation where it prevents spilling
> > data from caller saved registers.  This makes it a bit hard to guess how
> > microbenchmarks relate to more real-world situations where the
> > surrounding code may need to hold data in SSE regs etc.
> > If we had a special entry-point to memcpy/memset that does not clobber
> > registers and does its own callee save, this problem would go away...
> >
> > Honza
> 
> 
> 
> -- 
> H.J.


RE: [PATCH] cobol: Allow for undefined NAME_MAX [PR119217]

2025-04-20 Thread Robert Dubner



> -Original Message-
> From: Sam James 
> Sent: Saturday, April 19, 2025 19:53
> To: Robert Dubner 
> Cc: Jakub Jelinek ; Rainer Orth  bielefeld.de>; Richard Biener ; Andreas
Schwab
> ; gcc-patches@gcc.gnu.org; James K. Lowden
> 
> Subject: Re: [PATCH] cobol: Allow for undefined NAME_MAX [PR119217]
> 
> Robert Dubner  writes:
> 
> >> -Original Message-
> >> From: Jakub Jelinek 
> >> Sent: Friday, April 18, 2025 14:10
> >> To: Rainer Orth 
> >> Cc: Richard Biener ; Andreas Schwab
> >> ; gcc-patches@gcc.gnu.org; Robert Dubner
> >> ; James K. Lowden 
> >> Subject: Re: [PATCH] cobol: Allow for undefined NAME_MAX [PR119217]
> >>
> >> On Fri, Apr 18, 2025 at 06:04:29PM +0200, Rainer Orth wrote:
> >> > That's one option, but maybe it's better the other way round:
instead
> > of
> >> > excluding known-bad targets, restrict cobol to known-good ones
> >> > (i.e. x86_64-*-linux* and aarch64-*-linux*) instead.
> >> >
> >> > I've been using the following for this (should be retested for
> > safety).
> >>
> >> I admit I don't really know what works and what doesn't out of the
box
> >> now,
> >> but your patch looks reasonable to me for 15 branch.
> >>
> >> Richard, Robert and/or James, do you agree?
> >
> > I agree.  At the present time, I have access to only aarch64/x86_64-
> linux
> > machines, so those are the only ones I know work.  I seem to recall I
> > originally did it that way; only those configurations were
white-listed.
> 
> I think you may be mistaken. In r15-7941-g45c281deb7a2e2, aarch64 and
> x86_64 were whitelisted as *architectures*, but the platform (including
> the kernel - Linux) wasn't specified. Rainer is reporting an issue with
> x86_64 Solaris.

I wouldn't be surprised.  I should have stopped after, "I agree."


> 
> thanks,
> sam


[PATCH] [RISC-V]Support -mcpu for Xuantie cpu

2025-04-20 Thread Yixuan Chen
gcc/ChangeLog:

* config/riscv/riscv-cores.def (RISCV_TUNE): Add xt-c908, xt-c908v, 
xt-c910, xt-c910v2, xt-c920, xt-c920v2.
(RISCV_CORE): Add xt-c908, xt-c908v, xt-c910, xt-c910v2, xt-c920, 
xt-c920v2
* config/riscv/riscv.cc: Add xt-c908, xt-c908v, xt-c910, xt-c910v2, 
xt-c920, xt-c920v2
* doc/invoke.texi:  Add xt-c908, xt-c908v, xt-c910, xt-c910v2, xt-c920, 
xt-c920v2

gcc/testsuite/ChangeLog:

* gcc.target/riscv/mcpu-xt-c908.c: test -mcpu=xt-c908.
* gcc.target/riscv/mcpu-xt-c910.c: test -mcpu=xt-c910.
* gcc.target/riscv/mcpu-xt-c920v2.c: test -mcpu=xt-c920v2.
* gcc.target/riscv/mcpu-xt-c908v.c: test -mcpu=xt-c908v.
* gcc.target/riscv/mcpu-xt-c910v2.c: test -mcpu=xt-c910v2.
* gcc.target/riscv/mcpu-xt-c920.c: test -mcpu=xt-c920.

Support -mcpu=xt-c908, xt-c908v, xt-c910, xt-c910v2, xt-c920, xt-c920v2
for Xuantie series cpu.
ref:https://www.xrvm.cn/community/download?id=4224248662731067392

without fmv_cost, vector_unaligned_access, use_divmod_expansion, 
overlap_op_by_pieces, fill the tune info with generic ooo for further 
modification.
---
 gcc/config/riscv/riscv-cores.def  | 48 
 gcc/doc/invoke.texi   |  7 ++-
 gcc/testsuite/gcc.target/riscv/mcpu-xt-c908.c | 48 
 .../gcc.target/riscv/mcpu-xt-c908v.c  | 50 +
 gcc/testsuite/gcc.target/riscv/mcpu-xt-c910.c | 35 
 .../gcc.target/riscv/mcpu-xt-c910v2.c | 51 +
 gcc/testsuite/gcc.target/riscv/mcpu-xt-c920.c | 34 +++
 .../gcc.target/riscv/mcpu-xt-c920v2.c | 56 +++
 8 files changed, 326 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/mcpu-xt-c908.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/mcpu-xt-c908v.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/mcpu-xt-c910.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/mcpu-xt-c910v2.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/mcpu-xt-c920.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/mcpu-xt-c920v2.c

diff --git a/gcc/config/riscv/riscv-cores.def b/gcc/config/riscv/riscv-cores.def
index 2918496bcd0..e31afc3fe70 100644
--- a/gcc/config/riscv/riscv-cores.def
+++ b/gcc/config/riscv/riscv-cores.def
@@ -41,6 +41,12 @@ RISCV_TUNE("sifive-p400-series", sifive_p400, 
sifive_p400_tune_info)
 RISCV_TUNE("sifive-p600-series", sifive_p600, sifive_p600_tune_info)
 RISCV_TUNE("tt-ascalon-d8", generic_ooo, tt_ascalon_d8_tune_info)
 RISCV_TUNE("thead-c906", generic, thead_c906_tune_info)
+RISCV_TUNE("xt-c908", generic, generic_ooo_tune_info)
+RISCV_TUNE("xt-c908v", generic, generic_ooo_tune_info)
+RISCV_TUNE("xt-c910", generic, generic_ooo_tune_info)
+RISCV_TUNE("xt-c910v2", generic, generic_ooo_tune_info)
+RISCV_TUNE("xt-c920", generic, generic_ooo_tune_info)
+RISCV_TUNE("xt-c920v2", generic, generic_ooo_tune_info)
 RISCV_TUNE("xiangshan-nanhu", xiangshan, xiangshan_nanhu_tune_info)
 RISCV_TUNE("generic-ooo", generic_ooo, generic_ooo_tune_info)
 RISCV_TUNE("size", generic, optimize_size_tune_info)
@@ -93,6 +99,48 @@ RISCV_CORE("thead-c906",  
"rv64imafdc_xtheadba_xtheadbb_xtheadbs_xtheadcmo_"
  "xtheadmemidx_xtheadmempair_xtheadsync",
  "thead-c906")

+RISCV_CORE("xt-c908", "rv64imafdc_zicbom_zicbop_zicboz_zicntr_zicsr_"
+ "zifencei_zihintpause_zihpm_zfh_zba_zbb_zbc_zbs_"
+ "sstc_svinval_svnapot_svpbmt_xtheadba_xtheadbb_"
+ "xtheadbs_xtheadcmo_xtheadcondmov_xtheadfmemidx_"
+ "xtheadmac_xtheadmemidx_xtheadmempair_xtheadsync",
+ "xt-c908")
+RISCV_CORE("xt-c908v","rv64imafdcv_zicbom_zicbop_zicboz_zicntr_zicsr_"
+ "zifencei_zihintpause_zihpm_zfh_zba_zbb_zbc_zbs_"
+ "zvfh_sstc_svinval_svnapot_svpbmt__xtheadba_"
+ "xtheadbb_xtheadbs_xtheadcmo_xtheadcondmov_"
+ "xtheadfmemidx_xtheadmac_xtheadmemidx_"
+ "xtheadmempair_xtheadsync_xtheadvdot",
+ "xt-c908")
+RISCV_CORE("xt-c910", "rv64imafdc_zicntr_zicsr_zifencei_zihpm_zfh_"
+ "xtheadba_xtheadbb_xtheadbs_xtheadcmo_"
+ "xtheadcondmov_xtheadfmemidx_xtheadmac_"
+ "xtheadmemidx_xtheadmempair_xtheadsync",
+ "xt-c910")
+RISCV_CORE("xt-c910v2",   "rv64imafdc_zicbom_zicbop_zicboz_zicntr_zicond_"
+ "zicsr_zifencei _zihintntl_zihintpause_zihpm_"
+ "zawrs_zfa_zfbfmin_zfh_zca_zcb_zcd_zba_zbb_zbc_"
+ "zbs_sscofpmf_sstc_svinval_svnapot_svpbmt_"
+ "xtheadba_xtheadbb_xtheadbs_xtheadcmo_"
+   

Ping [PATCH/RFC] target, hooks: Allow a target to trap on unreachable [PR109267].

2025-04-20 Thread FX Coudert
Hi all,

I’d like to ping the patch from April 2024 at 
https://gcc.gnu.org/pipermail/gcc-patches/2024-May/651100.html

As far as I understand, the status of this is:
- Iain posted the patch for review
- Richard said "your target hook is reasonable though I'd name it 
expand_unreachable_as_trap maybe“
- Andrew has an idea for handling this in middle-end, but no time to work on it
- meanwhile, this has gained one year of exposure on Iain’s branch, which is 
used in all major macOS distributions of GCC; there is no issue reported

Andrew, do you still intend to work on it? It’d be great to fix this issue in 
GCC, one way or another, to minimise the amount of patches needed for modern 
darwin targets.

Best,
FX

Re: [PATCH] simplify-rtx: Fix shortcut for vector eq/ne

2025-04-20 Thread H.J. Lu
On Tue, Apr 1, 2025 at 8:17 PM Richard Sandiford
 wrote:
>
> This patch forestalls a regression in gcc.dg/rtl/x86_64/vector_eq.c
> with the patch for PR116398.  The test wants:
>
>   (cinsn 3 (set (reg:V4SI <0>) (const_vector:V4SI [(const_int 0) 
> (const_int 0) (const_int 0) (const_int 0)])))
>   (cinsn 5 (set (reg:V4SI <2>)
> (eq:V4SI (reg:V4SI <0>) (reg:V4SI <1>
>
> to be folded to a vector of -1s.  One unusual thing about the fold
> is that the <1> in the second insn is uninitialised; it looks like
> it should be replaced by <0>, or that there should be an insn 4 that
> copies <0> to <1>.
>
> As it stands, the test relies on init-regs to insert a zero
> initialisation of <1>.  This happens after all the cse/pre/fwprop
> stuff, with only dce passes between init-regs and combine.
> Combine therefore sees:
>
> (insn 3 2 8 2 (set (reg:V4SI 98)
> (const_vector:V4SI [
> (const_int 0 [0]) repeated x4
> ])) 2403 {movv4si_internal}
>  (nil))
> (insn 8 3 9 2 (clobber (reg:V4SI 99)) -1
>  (nil))
> (insn 9 8 5 2 (set (reg:V4SI 99)
> (const_vector:V4SI [
> (const_int 0 [0]) repeated x4
> ])) -1
>  (nil))
> (insn 5 9 7 2 (set (reg:V4SI 100)
> (eq:V4SI (reg:V4SI 98)
> (reg:V4SI 99))) 7874 {*sse2_eqv4si3}
>  (expr_list:REG_DEAD (reg:V4SI 99)
> (expr_list:REG_DEAD (reg:V4SI 98)
> (expr_list:REG_EQUAL (eq:V4SI (const_vector:V4SI [
> (const_int 0 [0]) repeated x4
> ])
> (reg:V4SI 99))
> (nil)
>
> It looks like the test should then pass through a 3, 9 -> 5 combination,
> so that we get an (eq ...) between two zeros and fold it to a vector
> of -1s.  But although the combination is attempted, the fold doesn't
> happen.  Instead, combine is left to match the unsimplified (eq ...)
> between two zeros, which rightly fails.  The test only passes because
> late_combine2 happens to try simplifying an (eq ...) between reg X and
> reg X, which does fold to a vector of -1s.
>
> The different handling of registers and constants is due to this
> code in simplify_const_relational_operation:
>
>   if (INTEGRAL_MODE_P (mode) && trueop1 != const0_rtx
>   && (code == EQ || code == NE)
>   && ! ((REG_P (op0) || CONST_INT_P (trueop0))
> && (REG_P (op1) || CONST_INT_P (trueop1)))
>   && (tem = simplify_binary_operation (MINUS, mode, op0, op1)) != 0
>   /* We cannot do this if tem is a nonzero address.  */
>   && ! nonzero_address_p (tem))
> return simplify_const_relational_operation (signed_condition (code),
> mode, tem, const0_rtx);
>
> INTEGRAL_MODE_P matches vector integer modes, but everything else
> about the condition is written for scalar integers only.  Thus if
> trueop0 and trueop1 are equal vector constants, we'll bypass all
> the exclusions and try simplifying a subtraction.  This will succeed,
> giving a vector of zeros.  The recursive call will then try to simplify
> a comparison between the vector of zeros and const0_rtx, which isn't
> well-formed.  Luckily or unluckily, the ill-formedness doesn't trigger
> an ICE, but it does prevent any simplification from happening.
>
> The least-effort fix would be to replace INTEGRAL_MODE_P with
> SCALAR_INT_MODE_P.  But the fold does make conceptual sense for
> vectors too, so it seemed better to keep the INTEGRAL_MODE_P and
> generalise the rest of the condition to match.
>
> Tested on aarch64-linux-gnu & x86_64-linux-gnu.  OK to install?
>
> I'm hoping to post the actual patch for PR116398 later today.
>
> Richard
>
>
> gcc/
> * simplify-rtx.cc (simplify_const_relational_operation): Generalize
> the constant checks in the fold-via-minus path to match the
> INTEGRAL_MODE_P condition.
> ---
>  gcc/simplify-rtx.cc | 13 +
>  1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
> index fe007bc7d96..6f969effdf9 100644
> --- a/gcc/simplify-rtx.cc
> +++ b/gcc/simplify-rtx.cc
> @@ -6657,15 +6657,20 @@ simplify_const_relational_operation (enum rtx_code 
> code,
>   we do not know the signedness of the operation on either the left or
>   the right hand side of the comparison.  */
>
> -  if (INTEGRAL_MODE_P (mode) && trueop1 != const0_rtx
> +  if (INTEGRAL_MODE_P (mode)
> +  && trueop1 != CONST0_RTX (mode)
>&& (code == EQ || code == NE)
> -  && ! ((REG_P (op0) || CONST_INT_P (trueop0))
> -   && (REG_P (op1) || CONST_INT_P (trueop1)))
> +  && ! ((REG_P (op0)
> +|| CONST_SCALAR_INT_P (trueop0)
> +|| CONST_VECTOR_P (trueop0))
> +   && (REG_P (op1)
> +   || CONST_SCALAR_INT_P (trueop1)
> +   || CONST_VECTOR_P (trueop1)))
>&& (tem = simplify_binary_operation (MINUS, mode, op0, op1)) != 0