from:"Kewen.Lin"

Re: [PATCH v2] rs6000: Change bitwise xor to an equality operator [PR106907]

2023-10-16 Thread Kewen.Lin

Hi,

on 2023/10/11 19:50, jeevitha wrote:
> Hi All,
> 
> The following patch has been bootstrapped and regtested on powerpc64le-linux.
> 
> PR106907 has a few warnings spotted from cppcheck. These warnings
> are related to the need of precedence clarification. Instead of using xor,
> it has been changed to equality check, which achieves the same result.
> Additionally, comment indentation has been fixed.
> 

Ok for trunk, thanks!

BR,
Kewen

> 2023-10-11  Jeevitha Palanisamy  
> 
> gcc/
>   PR target/106907
>   * config/rs6000/rs6000.cc (altivec_expand_vec_perm_const): Change 
> bitwise
>   xor to an equality and fix comment indentation.
> 
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index 2828f01413c..00191f8656b 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -23624,10 +23624,10 @@ altivec_expand_vec_perm_const (rtx target, rtx op0, 
> rtx op1,
> && GET_MODE (XEXP (op0, 0)) != V8HImode)))
>   continue;
>  
> -  /* For little-endian, the two input operands must be swapped
> - (or swapped back) to ensure proper right-to-left numbering
> - from 0 to 2N-1.  */
> -   if (swapped ^ !BYTES_BIG_ENDIAN
> +   /* For little-endian, the two input operands must be swapped
> +  (or swapped back) to ensure proper right-to-left numbering
> +  from 0 to 2N-1.  */
> +   if (swapped == BYTES_BIG_ENDIAN
> && icode != CODE_FOR_vsx_xxpermdi_v16qi)
>   std::swap (op0, op1);
> if (imode != V16QImode)
> 
>

Re: [PATCH-2, rs6000] Enable vector mode for memory equality compare [PR111449]

2023-10-16 Thread Kewen.Lin

Hi,

on 2023/10/10 16:18, HAO CHEN GUI wrote:
> Hi David,
> 
>   Thanks for your review comments.
> 
> 在 2023/10/9 23:42, David Edelsohn 写道:
>>  #define MOVE_MAX (! TARGET_POWERPC64 ? 4 : 8)
>>  #define MAX_MOVE_MAX 8
>> +#define MOVE_MAX_PIECES (!TARGET_POWERPC64 ? 4 : 16)
>> +#define COMPARE_MAX_PIECES (!TARGET_POWERPC64 ? 4 : 16)
>>
>>
>> How are the definitions of MOVE_MAX_PIECES and COMPARE_MAX_PIECES 
>> determined?  The email does not provide any explanation for the 
>> implementation.  The rest of the patch is related to vector support, but 
>> vector support is not dependent on TARGET_POWERPC64.
> 
> By default, MOVE_MAX_PIECES and COMPARE_MAX_PIECES is set the same value
> as MOVE_MAX. The move and compare instructions are required in
> compare_by_pieces, those macros are set to 16 byte when supporting
> vector mode (V16QImode). The problem is rs6000 hasn't supported TImode
> for "-m32". We discussed it in issue 1307. TImode will be used for
> move when MOVE_MAX_PIECES is set to 16. But TImode isn't supported
> with "-m32" which might cause ICE.

I think David raised a good question, it sounds to me that the current
handling simply consider that if MOVE_MAX_PIECES is set to 16, the
required operations for this optimization on TImode are always available,
but unfortunately on rs6000 the assumption doesn't hold, so could we
teach generic code instead?

BR,
Kewen

[PATCH] vect: Cost adjacent vector loads/stores together [PR111784]

2023-10-17 Thread Kewen.Lin

Hi,

As comments[1][2], this patch is to change the costing way
on some adjacent vector loads/stores from costing one by
one to costing them together with the total number once.

It helps to fix the exposed regression PR111784 on aarch64,
as aarch64 specific costing could make different decisions
according to the different costing ways (counting with total
number vs. counting one by one).  Based on a reduced test
case from PR111784, only considering vec_num can fix the
regression already, but vector loads/stores in regard to
ncopies are also adjacent accesses, so they are considered
as well.

btw, this patch leaves the costing on dr_explicit_realign
and dr_explicit_realign_optimized alone to make it simple.
The costing way change can cause the differences for them
since there is one costing depending on targetm.vectorize.
builtin_mask_for_load and it's costed according to the
calling times.  IIUC, these two dr_alignment_support are
mainly used for old Power? (only having 16 bytes aligned
vector load/store but no unaligned vector load/store).

Bootstrapped and regtested on x86_64-redhat-linux,
aarch64-linux-gnu, powerpc64-linux-gnu P{7,8,9}
and powerpc64le-linux-gnu P{8,9,10}.

Is it ok for trunk?

[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630742.html
[2] https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630744.html

BR,
Kewen
-
gcc/ChangeLog:

* tree-vect-stmts.cc (vectorizable_store): Adjust costing way for
adjacent vector stores, by costing them with the total number
rather than costing them one by one.
(vectorizable_load): Adjust costing way for adjacent vector
loads, by costing them with the total number rather than costing
them one by one.
---
 gcc/tree-vect-stmts.cc | 137 -
 1 file changed, 95 insertions(+), 42 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index b3a56498595..af134ff2bf7 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -8681,6 +8681,9 @@ vectorizable_store (vec_info *vinfo,
   alias_off = build_int_cst (ref_type, 0);
   stmt_vec_info next_stmt_info = first_stmt_info;
   auto_vec vec_oprnds (ncopies);
+  /* For costing some adjacent vector stores, we'd like to cost with
+the total number of them once instead of cost each one by one. */
+  unsigned int n_adjacent_stores = 0;
   for (g = 0; g < group_size; g++)
{
  running_off = offvar;
@@ -8738,10 +8741,7 @@ vectorizable_store (vec_info *vinfo,
 store to avoid ICE like 110776.  */
  if (VECTOR_TYPE_P (ltype)
  && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
-   vect_get_store_cost (vinfo, stmt_info, 1,
-alignment_support_scheme,
-misalignment, &inside_cost,
-cost_vec);
+   n_adjacent_stores++;
  else
inside_cost
  += record_stmt_cost (cost_vec, 1, scalar_store,
@@ -8798,11 +8798,18 @@ vectorizable_store (vec_info *vinfo,
break;
}

-  if (costing_p && dump_enabled_p ())
-   dump_printf_loc (MSG_NOTE, vect_location,
-"vect_model_store_cost: inside_cost = %d, "
-"prologue_cost = %d .\n",
-inside_cost, prologue_cost);
+  if (costing_p)
+   {
+ if (n_adjacent_stores > 0)
+   vect_get_store_cost (vinfo, stmt_info, n_adjacent_stores,
+alignment_support_scheme, misalignment,
+&inside_cost, cost_vec);
+ if (dump_enabled_p ())
+   dump_printf_loc (MSG_NOTE, vect_location,
+"vect_model_store_cost: inside_cost = %d, "
+"prologue_cost = %d .\n",
+inside_cost, prologue_cost);
+   }

   return true;
 }
@@ -8909,6 +8916,9 @@ vectorizable_store (vec_info *vinfo,
 {
   gcc_assert (!slp && grouped_store);
   unsigned inside_cost = 0, prologue_cost = 0;
+  /* For costing some adjacent vector stores, we'd like to cost with
+the total number of them once instead of cost each one by one. */
+  unsigned int n_adjacent_stores = 0;
   for (j = 0; j < ncopies; j++)
{
  gimple *new_stmt;
@@ -8974,10 +8984,7 @@ vectorizable_store (vec_info *vinfo,

  if (costing_p)
{
- for (i = 0; i < vec_num; i++)
-   vect_get_store_cost (vinfo, stmt_info, 1,
-alignment_support_scheme, misalignment,
-&inside_cost, cost_vec);
+ n_adjacent_stores += vec_num;
  continue;
}

@@ -906

Re: [PATCH] vect: Cost adjacent vector loads/stores together [PR111784]

2023-10-22 Thread Kewen.Lin

Hi Richard,

on 2023/10/20 06:12, Richard Sandiford wrote:
> "Kewen.Lin"  writes:
>> Hi,
>>
>> As comments[1][2], this patch is to change the costing way
>> on some adjacent vector loads/stores from costing one by
>> one to costing them together with the total number once.
>>
>> It helps to fix the exposed regression PR111784 on aarch64,
>> as aarch64 specific costing could make different decisions
>> according to the different costing ways (counting with total
>> number vs. counting one by one).  Based on a reduced test
>> case from PR111784, only considering vec_num can fix the
>> regression already, but vector loads/stores in regard to
>> ncopies are also adjacent accesses, so they are considered
>> as well.
>>
>> btw, this patch leaves the costing on dr_explicit_realign
>> and dr_explicit_realign_optimized alone to make it simple.
>> The costing way change can cause the differences for them
>> since there is one costing depending on targetm.vectorize.
>> builtin_mask_for_load and it's costed according to the
>> calling times.  IIUC, these two dr_alignment_support are
>> mainly used for old Power? (only having 16 bytes aligned
>> vector load/store but no unaligned vector load/store).
>>
>> Bootstrapped and regtested on x86_64-redhat-linux,
>> aarch64-linux-gnu, powerpc64-linux-gnu P{7,8,9}
>> and powerpc64le-linux-gnu P{8,9,10}.
>>
>> Is it ok for trunk?
>>
>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630742.html
>> [2] https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630744.html
>>
>> BR,
>> Kewen
>> -
>> gcc/ChangeLog:
>>
>>  * tree-vect-stmts.cc (vectorizable_store): Adjust costing way for
>>  adjacent vector stores, by costing them with the total number
>>  rather than costing them one by one.
>>  (vectorizable_load): Adjust costing way for adjacent vector
>>  loads, by costing them with the total number rather than costing
>>  them one by one.
> 
> OK.  Thanks for doing this!  Like Richard says, the way that the aarch64
> cost hooks rely on the count isn't super robust, but I think it's the
> best we can easily do in the circumstances.  Hopefully costing will
> become much easier once the non-SLP representation goes away.

Looking forward to that, thanks for the review!  Committed in r14-4842.

BR,
Kewen

Re: [PATCH 1/3]rs6000: update num_insns_constant for 2 insns

2023-10-24 Thread Kewen.Lin

Hi,

on 2023/10/25 10:00, Jiufu Guo wrote:
> Hi,
> 
> Trunk gcc supports more constants to be built via two instructions: e.g.
> "li/lis; xori/xoris/rldicl/rldicr/rldic".
> And then num_insns_constant should also be updated.
> 

Thanks for updating this.

> Bootstrap & regtest pass ppc64{,le}.
> Is this ok for trunk?
> 
> BR,
> Jeff (Jiufu Guo)
> 
> gcc/ChangeLog:
> 
>   * config/rs6000/rs6000.cc (can_be_built_by_lilis_and_rldicX): New
>   function.
>   (num_insns_constant_gpr): Update to return 2 for more cases.
>   (rs6000_emit_set_long_const): Update to use
>   can_be_built_by_lilis_and_rldicX.
> 
> ---
>  gcc/config/rs6000/rs6000.cc | 64 -
>  1 file changed, 41 insertions(+), 23 deletions(-)
> 
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index cc24dd5301e..b23ff3d7917 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -6032,6 +6032,9 @@ direct_return (void)
>return 0;
>  }
>  
> +static bool
> +can_be_built_by_lilis_and_rldicX (HOST_WIDE_INT, int *, HOST_WIDE_INT *);
> +
>  /* Helper for num_insns_constant.  Calculate number of instructions to
> load VALUE to a single gpr using combinations of addi, addis, ori,
> oris, sldi and rldimi instructions.  */
> @@ -6044,35 +6047,41 @@ num_insns_constant_gpr (HOST_WIDE_INT value)
>  return 1;
>  
>/* constant loadable with addis */
> -  else if ((value & 0x) == 0
> -&& (value >> 31 == -1 || value >> 31 == 0))
> +  if ((value & 0x) == 0 && (value >> 31 == -1 || value >> 31 == 0))
>  return 1;
>  
>/* PADDI can support up to 34 bit signed integers.  */
> -  else if (TARGET_PREFIXED && SIGNED_INTEGER_34BIT_P (value))
> +  if (TARGET_PREFIXED && SIGNED_INTEGER_34BIT_P (value))
>  return 1;
>  
> -  else if (TARGET_POWERPC64)
> -{
> -  HOST_WIDE_INT low = sext_hwi (value, 32);
> -  HOST_WIDE_INT high = value >> 31;
> +  if (!TARGET_POWERPC64)
> +return 2;
>  
> -  if (high == 0 || high == -1)
> - return 2;
> +  HOST_WIDE_INT low = sext_hwi (value, 32);
> +  HOST_WIDE_INT high = value >> 31;
>  
> -  high >>= 1;
> +  if (high == 0 || high == -1)
> +return 2;
>  
> -  if (low == 0 || low == high)
> - return num_insns_constant_gpr (high) + 1;
> -  else if (high == 0)
> - return num_insns_constant_gpr (low) + 1;
> -  else
> - return (num_insns_constant_gpr (high)
> - + num_insns_constant_gpr (low) + 1);
> -}
> +  high >>= 1;
>  
> -  else
> +  HOST_WIDE_INT ud2 = (low >> 16) & 0x;
> +  HOST_WIDE_INT ud1 = low & 0x;
> +  if (high == -1 && ((!(ud2 & 0x8000) && ud1 == 0) || (ud1 & 0x8000)))
> +return 2;
> +  if (high == 0 && (ud1 == 0 || (!(ud1 & 0x8000
>  return 2;

I was thinking that instead of enumerating all the cases in function
rs6000_emit_set_long_const, if we can add one optional argument like
"int* num_insns=nullptr" to function rs6000_emit_set_long_const, and
when it's not nullptr, not emitting anything but update the count in
rs6000_emit_set_long_const.  It helps people remember to update
num_insns when updating rs6000_emit_set_long_const in future, it's
also more clear on how the number comes from.

Does it sound good to you?

BR,
Kewen

> +
> +  int shift;
> +  HOST_WIDE_INT mask;
> +  if (can_be_built_by_lilis_and_rldicX (value, &shift, &mask))
> +return 2;
> +
> +  if (low == 0 || low == high)
> +return num_insns_constant_gpr (high) + 1;
> +  if (high == 0)
> +return num_insns_constant_gpr (low) + 1;
> +  return (num_insns_constant_gpr (high) + num_insns_constant_gpr (low) + 1);
>  }
>  
>  /* Helper for num_insns_constant.  Allow constants formed by the
> @@ -10492,6 +10501,18 @@ can_be_built_by_li_and_rldic (HOST_WIDE_INT c, int 
> *shift, HOST_WIDE_INT *mask)
>return false;
>  }
>  
> +/* Combine the above checking functions for  li/lis;rldicX. */
> +
> +static bool
> +can_be_built_by_lilis_and_rldicX (HOST_WIDE_INT c, int *shift,
> +   HOST_WIDE_INT *mask)
> +{
> +  return (can_be_built_by_li_lis_and_rotldi (c, shift, mask)
> +   || can_be_built_by_li_lis_and_rldicl (c, shift, mask)
> +   || can_be_built_by_li_lis_and_rldicr (c, shift, mask)
> +   || can_be_built_by_li_and_rldic (c, shift, mask));
> +}
> +
>  /* Subroutine of rs6000_emit_set_const, handling PowerPC64 DImode.
> Output insns to set DEST equal to the constant C as a series of
> lis, ori and shl instructions.  */
> @@ -10538,10 +10559,7 @@ rs6000_emit_set_long_const (rtx dest, HOST_WIDE_INT 
> c)
>emit_move_insn (dest, gen_rtx_XOR (DImode, temp,
>GEN_INT ((ud2 ^ 0x) << 16)));
>  }
> -  else if (can_be_built_by_li_lis_and_rotldi (c, &shift, &mask)
> -|| can_be_built_by_li_lis_and_rldicl (c, &shift, &mask)
> -|| can_be_built_by_li_lis_and_rldicr (c, &shift, &mask)
> -|| can_be_built_by_li_and_rl

Re: [PATCH 2/3]rs6000: using 'pli' to load 34bit-constant

2023-10-24 Thread Kewen.Lin

on 2023/10/25 10:00, Jiufu Guo wrote:
> Hi,
> 
> For constants with 16bit values, 'li or lis' can be used to generate
> the value.  For 34bit constant, 'pli' is ok to generate the value.
> 
> Bootstrap®test pass on ppc64{,le}.
> Is this ok for trunk?
> 
> BR,
> Jeff (Jiufu Guo)
> 
> gcc/ChangeLog:
> 
>   * config/rs6000/rs6000.cc (rs6000_emit_set_long_const): Add code to use
>   pli for 34bit constant.
> 
> ---
>  gcc/config/rs6000/rs6000.cc | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index b23ff3d7917..4690384cdbe 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -10530,7 +10530,11 @@ rs6000_emit_set_long_const (rtx dest, HOST_WIDE_INT 
> c)
>ud3 = (c >> 32) & 0x;
>ud4 = (c >> 48) & 0x;
> 
> -  if ((ud4 == 0x && ud3 == 0x && ud2 == 0x && (ud1 & 0x8000))
> +  if (TARGET_PREFIXED && SIGNED_INTEGER_34BIT_P (c))
> +{
> +  emit_move_insn (dest, GEN_INT (c));
> +}

Nit: unexpected formatting, no {} needed.

Is there any test case justifying this change?  I think only one "li" or "lis"
beats "pli" since the latter is a prefixed insn, it puts more burdens on insn
decoding.

BR,
Kewen

> +  else if ((ud4 == 0x && ud3 == 0x && ud2 == 0x && (ud1 & 
> 0x8000))
>|| (ud4 == 0 && ud3 == 0 && ud2 == 0 && ! (ud1 & 0x8000)))
>  emit_move_insn (dest, GEN_INT (sext_hwi (ud1, 16)));
>

Re: [PATCH 3/3]rs6000: split complicate constant to constant pool

2023-10-24 Thread Kewen.Lin

Hi,

on 2023/10/25 10:00, Jiufu Guo wrote:
> Hi,
> 
> Sometimes, a complicated constant is built via 3(or more)
> instructions to build. Generally speaking, it would not be
> as faster as loading it from the constant pool (as a few
> discussions in PR63281).

I may miss some previous discussions, but I'm curious why we
chose ">=3" here, as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281#c9
which indicates that more than 3 (>3) should be considered
with this change.

> 
> For the concern that I raised in:
> https://gcc.gnu.org/pipermail/gcc-patches/2022-August/599676.html
> The micro-cases would not be the major concern. Because as
> Segher explained in:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281#c18
> It would just be about the benchmark method.
> 
> As tested on spec2017, for visible performance changes, we
> can find the runtime improvement on 500.perlbench_r about
> ~1.8% (-O2) when support loading complicates constant from
> constant pool. And no visible performance recession on
> other benchmarks.

The improvement on 500.perlbench_r looks to match what PR63281
mentioned, nice!  I'm curious that which options and which kinds
of CPUs have you tested with?  Since this is a general change,
I'd expect we can test with P8/P9/P10 at O2/O3 (or Ofast) at
least.

BR,
Kewen

> 
> Boostrap & regtest pass on ppc64{,le}.
> Is this ok for trunk?
> 
> BR,
> Jeff (Jiufu Guo)
> 
>   PR target/63281
> 
> gcc/ChangeLog:
> 
>   * config/rs6000/rs6000.cc (rs6000_emit_set_const): Update to split
>   complicate constant to memory.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.target/powerpc/const_anchors.c: Update to test final-rtl. 
>   * gcc.target/powerpc/parall_5insn_const.c: Update to keep original test
>   point.
>   * gcc.target/powerpc/pr106550.c: Likewise..
>   * gcc.target/powerpc/pr106550_1.c: Likewise.
>   * gcc.target/powerpc/pr87870.c: Update according to latest behavior.
>   * gcc.target/powerpc/pr93012.c: Likewise.
> 
> ---
>  gcc/config/rs6000/rs6000.cc | 16 
>  .../gcc.target/powerpc/const_anchors.c  |  5 ++---
>  .../gcc.target/powerpc/parall_5insn_const.c | 14 --
>  gcc/testsuite/gcc.target/powerpc/pr106550.c | 17 +++--
>  gcc/testsuite/gcc.target/powerpc/pr106550_1.c   | 15 +--
>  gcc/testsuite/gcc.target/powerpc/pr87870.c  |  5 -
>  gcc/testsuite/gcc.target/powerpc/pr93012.c  |  4 +++-
>  7 files changed, 65 insertions(+), 11 deletions(-)
> 
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index 4690384cdbe..b9562f1ea0f 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -10292,6 +10292,22 @@ rs6000_emit_set_const (rtx dest, rtx source)
> c = sext_hwi (c, 32);
> emit_move_insn (lo, GEN_INT (c));
>   }
> +
> +  /* If it can be stored to the constant pool and profitable.  */
> +  else if (base_reg_operand (dest, mode)
> +&& num_insns_constant (source, mode) > 2)
> + {
> +   rtx sym = force_const_mem (mode, source);
> +   if (TARGET_TOC && SYMBOL_REF_P (XEXP (sym, 0))
> +   && use_toc_relative_ref (XEXP (sym, 0), mode))
> + {
> +   rtx toc = create_TOC_reference (XEXP (sym, 0), copy_rtx (dest));
> +   sym = gen_const_mem (mode, toc);
> +   set_mem_alias_set (sym, get_TOC_alias_set ());
> + }
> +
> +   emit_insn (gen_rtx_SET (dest, sym));
> + }
>else
>   rs6000_emit_set_long_const (dest, c);
>break;
> diff --git a/gcc/testsuite/gcc.target/powerpc/const_anchors.c 
> b/gcc/testsuite/gcc.target/powerpc/const_anchors.c
> index 542e2674b12..188744165f2 100644
> --- a/gcc/testsuite/gcc.target/powerpc/const_anchors.c
> +++ b/gcc/testsuite/gcc.target/powerpc/const_anchors.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile { target has_arch_ppc64 } } */
> -/* { dg-options "-O2" } */
> +/* { dg-options "-O2 -fdump-rtl-final" } */
>  
>  #define C1 0x2351847027482577ULL
>  #define C2 0x2351847027482578ULL
> @@ -16,5 +16,4 @@ void __attribute__ ((noinline)) foo1 (long long *a, long 
> long b)
>if (b)
>  *a++ = C2;
>  }
> -
> -/* { dg-final { scan-assembler-times {\maddi\M} 2 } } */
> +/* { dg-final { scan-rtl-dump-times {\madddi3\M} 2 "final" } } */
> diff --git a/gcc/testsuite/gcc.target/powerpc/parall_5insn_const.c 
> b/gcc/testsuite/gcc.target/powerpc/parall_5insn_const.c
> index e3a9a7264cf..df0690b90be 100644
> --- a/gcc/testsuite/gcc.target/powerpc/parall_5insn_const.c
> +++ b/gcc/testsuite/gcc.target/powerpc/parall_5insn_const.c
> @@ -9,8 +9,18 @@
>  void __attribute__ ((noinline)) foo (unsigned long long *a)
>  {
>/* 2 lis + 2 ori + 1 rldimi for each constant.  */
> -  *a++ = 0x800aabcdc167fa16ULL;
> -  *a++ = 0x7543a876867f616ULL;
> +  {
> +register long long d asm("r0") = 0x800aabcdc167fa16ULL;
> +long long n;
> +asm("mr %0, %1" : "=r"(n) : "r"(d));
> +*a++ = n

[PATCH v3] sched: Change no_real_insns_p to no_real_nondebug_insns_p [PR108273]

2023-10-24 Thread Kewen.Lin

Hi,

This is almost a repost for v2 which was posted at[1] in March
excepting for:
  1) rebased from r14-4810 which is relatively up-to-date,
 some conflicts on "int to bool" return type change have
 been resolved;
  2) adjust commit log a bit;
  3) fix misspelled "articial" with "artificial" somewhere;

--
*v2 comments*:

By addressing Alexander's comments, against v1 this
patch v2 mainly:

  - Rename no_real_insns_p to no_real_nondebug_insns_p;
  - Introduce enum rgn_bb_deps_free_action for three
kinds of actions to free deps;
  - Change function free_deps_for_bb_no_real_insns_p to
resolve_forw_deps which only focuses on forward deps;
  - Extend the handlings to cover dbg-cnt sched_block,
add one test case for it;
  - Move free_trg_info call in schedule_region to an
appropriate place.

One thing I'm not sure about is the change in function
sched_rgn_local_finish, currently the invocation to
sched_rgn_local_free is guarded with !sel_sched_p (),
so I just follow it, but the initialization of those
structures (in sched_rgn_local_init) isn't guarded
with !sel_sched_p (), it looks odd.

--

As PR108273 shows, when there is one block which only has
NOTE_P and LABEL_P insns at non-debug mode while has some
extra DEBUG_INSN_P insns at debug mode, after scheduling
it, the DFA states would be different between debug mode
and non-debug mode.  Since at non-debug mode, the block
meets no_real_insns_p, it gets skipped; while at debug
mode, it gets scheduled, even it only has NOTE_P, LABEL_P
and DEBUG_INSN_P, the call of function advance_one_cycle
will change the DFA state.  PR108519 also shows this issue
can be exposed by some scheduler changes.

This patch is to change function no_real_insns_p to
function no_real_nondebug_insns_p by taking debug insn into
account, which make us not try to schedule for the block
having only NOTE_P, LABEL_P and DEBUG_INSN_P insns,
resulting in consistent DFA states between non-debug and
debug mode.

Changing no_real_insns_p to no_real_nondebug_insns_p caused
ICE when doing free_block_dependencies, the root cause is
that we create dependencies for debug insns, those
dependencies are expected to be resolved during scheduling
insns, but they get skipped after this change.
By checking the code, it looks it's reasonable to skip to
compute block dependences for no_real_nondebug_insns_p
blocks.  There is also another issue, which gets exposed
in SPEC2017 bmks build at option -O2 -g, is that we could
skip to schedule some block, which already gets dependency
graph built so has dependencies computed and rgn_n_insns
accumulated, then the later verification on if the graph
becomes exhausted by scheduling would fail as follow:

  /* Sanity check: verify that all region insns were
 scheduled.  */
gcc_assert (sched_rgn_n_insns == rgn_n_insns);

, and also some forward deps aren't resovled.

As Alexander pointed out, the current debug count handling
also suffers the similar issue, so this patch handles these
two cases together: one is for some block gets skipped by
!dbg_cnt (sched_block), the other is for some block which
is not no_real_nondebug_insns_p initially but becomes
no_real_nondebug_insns_p due to speculative scheduling.

This patch can be bootstrapped and regress-tested on
x86_64-redhat-linux, aarch64-linux-gnu and
powerpc64{,le}-linux-gnu.

I also verified this patch can pass SPEC2017 both intrate
and fprate bmks building at -g -O2/-O3.

Any thoughts?  Is it ok for trunk?

[1] v2: https://gcc.gnu.org/pipermail/gcc-patches/2023-March/614818.html
[2] v1: https://gcc.gnu.org/pipermail/gcc-patches/2023-March/614224.html

BR,
Kewen
-
PR rtl-optimization/108273

gcc/ChangeLog:

* haifa-sched.cc (no_real_insns_p): Rename to ...
(no_real_nondebug_insns_p): ... this, and consider DEBUG_INSN_P insn.
* sched-ebb.cc (schedule_ebb): Replace no_real_insns_p with
no_real_nondebug_insns_p.
* sched-int.h (no_real_insns_p): Rename to ...
(no_real_nondebug_insns_p): ... this.
* sched-rgn.cc (enum rgn_bb_deps_free_action): New enum.
(bb_deps_free_actions): New static variable.
(compute_block_dependences): Skip for no_real_nondebug_insns_p.
(resolve_forw_deps): New function.
(free_block_dependencies): Check bb_deps_free_actions and call
function resolve_forw_deps for RGN_BB_DEPS_FREE_ARTIFICIAL.
(compute_priorities): Replace no_real_insns_p with
no_real_nondebug_insns_p.
(schedule_region): Replace no_real_insns_p with
no_real_nondebug_insns_p, set RGN_BB_DEPS_FREE_ARTIFICIAL if the block
get dependencies computed before but skipped now, fix up count
sched_rgn_n_insns for it too.  Call free_trg_info when the block
gets scheduled, and move sched_rgn_local_finish after the loop
of free_block_dependencies loop.
(sched_rgn_local_init): Allocate and compute bb_deps_free_actions.
(sched_rgn_local_fini

PING^5 [PATCH 0/9] rs6000: Rework rs6000_emit_vector_compare

2023-10-24 Thread Kewen.Lin

Hi,

Gentle ping this series:

https://gcc.gnu.org/pipermail/gcc-patches/2022-November/607146.html

BR,
Kewen

 on 2022/11/24 17:15, Kewen Lin wrote:
> Hi,
>
> Following Segher's suggestion, this patch series is to rework
> function rs6000_emit_vector_compare for vector float and int
> in multiple steps, it's based on the previous attempts [1][2].
> As mentioned in [1], the need to rework this for float is to
> make a centralized place for vector float comparison handlings
> instead of supporting with swapping ops and reversing code etc.
> dispersedly.  It's also for a subsequent patch to handle
> comparison operators with or without trapping math (PR105480).
> With the handling on vector float reworked, we can further make
> the handling on vector int simplified as shown.
>
> For Segher's concern about whether this rework causes any
> assembly change, I constructed two testcases for vector float[3]
> and int[4] respectively before, it showed the most are fine
> excepting for the difference on LE and UNGT, it's demonstrated
> as improvement since it uses GE instead of GT ior EQ.  The
> associated test case in patch 3/9 is a good example.
>
> Besides, w/ and w/o the whole patch series, I built the whole
> SPEC2017 at options -O3 and -Ofast separately, checked the
> differences on object assembly.  The result showed that the
> most are unchanged, except for:
>
>   * at -O3, 521.wrf_r has 9 object files and 526.blender_r has
> 9 object files with differences.
>
>   * at -Ofast, 521.wrf_r has 12 object files, 526.blender_r has
> one and 527.cam4_r has 4 object files with differences.
>
> By looking into these differences, all significant differences
> are caused by the known improvement mentined above transforming
> GT ior EQ to GE, which can also affect unrolling decision due
> to insn count.  Some other trivial differences are branch
> target offset difference, nop difference for alignment, vsx
> register number differences etc.
>
> I also evaluated the runtime performance for these changed
> benchmarks, the result is neutral.
>
> These patches are bootstrapped and regress-tested
> incrementally on powerpc64-linux-gnu P7 & P8, and
> powerpc64le-linux-gnu P9 & P10.
>
> Is it ok for trunk?
>
> BR,
> Kewen
> -
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606375.html
> [2] https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606376.html
> [3] https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606504.html
> [4] https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606506.html
>
> Kewen Lin (9):
>   rs6000: Rework vector float comparison in rs6000_emit_vector_compare - 
> p1
>   rs6000: Rework vector float comparison in rs6000_emit_vector_compare - 
> p2
>   rs6000: Rework vector float comparison in rs6000_emit_vector_compare - 
> p3
>   rs6000: Rework vector float comparison in rs6000_emit_vector_compare - 
> p4
>   rs6000: Rework vector integer comparison in rs6000_emit_vector_compare 
> - p1
>   rs6000: Rework vector integer comparison in rs6000_emit_vector_compare 
> - p2
>   rs6000: Rework vector integer comparison in rs6000_emit_vector_compare 
> - p3
>   rs6000: Rework vector integer comparison in rs6000_emit_vector_compare 
> - p4
>   rs6000: Rework vector integer comparison in rs6000_emit_vector_compare 
> - p5
>
>  gcc/config/rs6000/rs6000.cc | 180 ++--
>  gcc/testsuite/gcc.target/powerpc/vcond-fp.c |  25 +++
>  2 files changed, 74 insertions(+), 131 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/vcond-fp.c
>

PING^3 [PATCH v2] rs6000: Don't use optimize_function_for_speed_p too early [PR108184]

2023-10-24 Thread Kewen.Lin

Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2023-January/609993.html

BR,
Kewen

>> on 2023/1/16 17:08, Kewen.Lin via Gcc-patches wrote:
>>> Hi,
>>>
>>> As Honza pointed out in [1], the current uses of function
>>> optimize_function_for_speed_p in rs6000_option_override_internal
>>> are too early, since the query results from the functions
>>> optimize_function_for_{speed,size}_p could be changed later due
>>> to profile feedback and some function attributes handlings etc.
>>>
>>> This patch is to move optimize_function_for_speed_p to all the
>>> use places of the corresponding flags, which follows the existing
>>> practices.  Maybe we can cache it somewhere at an appropriate
>>> timing, but that's another thing.
>>>
>>> Comparing with v1[2], this version added one test case for
>>> SAVE_TOC_INDIRECT as Segher questioned and suggested, and it
>>> also considered the possibility of explicit option (see test
>>> cases pr108184-2.c and pr108184-4.c).  I believe that excepting
>>> for the intentional change on optimize_function_for_{speed,
>>> size}_p, there is no other function change.
>>>
>>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2022-November/607527.html
>>> [2] https://gcc.gnu.org/pipermail/gcc-patches/2023-January/609379.html
>>>
>>> Bootstrapped and regtested on powerpc64-linux-gnu P8,
>>> powerpc64le-linux-gnu P{9,10} and powerpc-ibm-aix.
>>>
>>> Is it ok for trunk?
>>>
>>> BR,
>>> Kewen
>>> -
>>> gcc/ChangeLog:
>>>
>>> * config/rs6000/rs6000.cc (rs6000_option_override_internal): Remove
>>> all optimize_function_for_speed_p uses.
>>> (fusion_gpr_load_p): Call optimize_function_for_speed_p along
>>> with TARGET_P8_FUSION_SIGN.
>>> (expand_fusion_gpr_load): Likewise.
>>> (rs6000_call_aix): Call optimize_function_for_speed_p along with
>>> TARGET_SAVE_TOC_INDIRECT.
>>> * config/rs6000/predicates.md (fusion_gpr_mem_load): Call
>>> optimize_function_for_speed_p along with TARGET_P8_FUSION_SIGN.
>>>
>>> gcc/testsuite/ChangeLog:
>>>
>>> * gcc.target/powerpc/pr108184-1.c: New test.
>>> * gcc.target/powerpc/pr108184-2.c: New test.
>>> * gcc.target/powerpc/pr108184-3.c: New test.
>>> * gcc.target/powerpc/pr108184-4.c: New test.
>>> ---
>>>  gcc/config/rs6000/predicates.md   |  5 +++-
>>>  gcc/config/rs6000/rs6000.cc   | 19 +-
>>>  gcc/testsuite/gcc.target/powerpc/pr108184-1.c | 16 
>>>  gcc/testsuite/gcc.target/powerpc/pr108184-2.c | 15 +++
>>>  gcc/testsuite/gcc.target/powerpc/pr108184-3.c | 25 +++
>>>  gcc/testsuite/gcc.target/powerpc/pr108184-4.c | 24 ++
>>>  6 files changed, 97 insertions(+), 7 deletions(-)
>>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108184-1.c
>>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108184-2.c
>>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108184-3.c
>>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108184-4.c
>>>
>>> diff --git a/gcc/config/rs6000/predicates.md 
>>> b/gcc/config/rs6000/predicates.md
>>> index a1764018545..9f84468db84 100644
>>> --- a/gcc/config/rs6000/predicates.md
>>> +++ b/gcc/config/rs6000/predicates.md
>>> @@ -1878,7 +1878,10 @@ (define_predicate "fusion_gpr_mem_load"
>>>
>>>/* Handle sign/zero extend.  */
>>>if (GET_CODE (op) == ZERO_EXTEND
>>> -  || (TARGET_P8_FUSION_SIGN && GET_CODE (op) == SIGN_EXTEND))
>>> +  || (TARGET_P8_FUSION_SIGN
>>> + && GET_CODE (op) == SIGN_EXTEND
>>> + && (rs6000_isa_flags_explicit & OPTION_MASK_P8_FUSION_SIGN
>>> + || optimize_function_for_speed_p (cfun
>>>  {
>>>op = XEXP (op, 0);
>>>mode = GET_MODE (op);
>>> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
>>> index 6ac3adcec6b..f47d21980a9 100644
>>> --- a/gcc/config/rs6000/rs6000.cc
>>> +++ b/gcc/config/rs6000/rs6000.cc
>>> @@ -3997,8 +3997,7 @@ rs6000_option_override_internal (bool global_init_p)
>>>/* If we can shrink-wrap the TOC register save separately, then use
>>>   -msave-toc-indirect unless explicitly disabled.  */
>>>if ((rs6000_isa_flags_e

Re: [PATCH 3/3]rs6000: split complicate constant to constant pool

2023-10-25 Thread Kewen.Lin

on 2023/10/25 16:14, Jiufu Guo wrote:
> 
> Hi,
> 
> "Kewen.Lin"  writes:
> 
>> Hi,
>>
>> on 2023/10/25 10:00, Jiufu Guo wrote:
>>> Hi,
>>>
>>> Sometimes, a complicated constant is built via 3(or more)
>>> instructions to build. Generally speaking, it would not be
>>> as faster as loading it from the constant pool (as a few
>>> discussions in PR63281).
>>
>> I may miss some previous discussions, but I'm curious why we
>> chose ">=3" here, as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281#c9
>> which indicates that more than 3 (>3) should be considered
>> with this change.
> 
> Thanks a lot for your great patience for reading the history!
> Yes, there are some discussions about "> 3" vs. "> 2".
> - In theory, "ld" is one instruction.  If consider "address/toc"
>   adjust, we may count it as 2 instructions. "pld" may need less
>   cycles.

OK, even without prefixed insn support, the high part of address
computation could be optimized as nop by linker further.  It would
be good to say something on this in commit log, otherwise people
may be confused as the PR comment mentioned above.

> - As test, it seems "> 2" could get better/stable runtime result
>   during testing SPEC2017.

Ok, if you posted the conclusion previously, it would be good to
mention it here with a link on the result comparisons.

> 
>>
>>>
>>> For the concern that I raised in:
>>> https://gcc.gnu.org/pipermail/gcc-patches/2022-August/599676.html
>>> The micro-cases would not be the major concern. Because as
>>> Segher explained in:
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281#c18
>>> It would just be about the benchmark method.
>>>
>>> As tested on spec2017, for visible performance changes, we
>>> can find the runtime improvement on 500.perlbench_r about
>>> ~1.8% (-O2) when support loading complicates constant from
>>> constant pool. And no visible performance recession on
>>> other benchmarks.
>>
>> The improvement on 500.perlbench_r looks to match what PR63281
>> mentioned, nice!  I'm curious that which options and which kinds
>> of CPUs have you tested with?  Since this is a general change,
>> I'd expect we can test with P8/P9/P10 at O2/O3 (or Ofast) at
>> least.
> 
> Great advice! Thanks for pointing this!
> A few months ago, P8/P9/P10 are tested. While this time, I rerun
> SPEC2017 on P10 for -O2 and -O3.  More test on latest code would
> be better.

Was it tested previously with your recent commits on constant
building together?  or just with the trunk at that time?  Anyway,
I was curious how it's tested, thanks for replying, good to see
those are covered.  :)  I'd leave the further review to Segher and
David.

BR,
Kewen

> 
> 
> BR,
> Jeff (Jiufu Guo)
> 
>>
>> BR,
>> Kewen
>>
>>>
>>> Boostrap & regtest pass on ppc64{,le}.
>>> Is this ok for trunk?
>>>
>>> BR,
>>> Jeff (Jiufu Guo)
>>>
>>> PR target/63281
>>>
>>> gcc/ChangeLog:
>>>
>>> * config/rs6000/rs6000.cc (rs6000_emit_set_const): Update to split
>>> complicate constant to memory.
>>>
>>> gcc/testsuite/ChangeLog:
>>>
>>> * gcc.target/powerpc/const_anchors.c: Update to test final-rtl. 
>>> * gcc.target/powerpc/parall_5insn_const.c: Update to keep original test
>>> point.
>>> * gcc.target/powerpc/pr106550.c: Likewise..
>>> * gcc.target/powerpc/pr106550_1.c: Likewise.
>>> * gcc.target/powerpc/pr87870.c: Update according to latest behavior.
>>> * gcc.target/powerpc/pr93012.c: Likewise.
>>>
>>> ---
>>>  gcc/config/rs6000/rs6000.cc | 16 
>>>  .../gcc.target/powerpc/const_anchors.c  |  5 ++---
>>>  .../gcc.target/powerpc/parall_5insn_const.c | 14 --
>>>  gcc/testsuite/gcc.target/powerpc/pr106550.c | 17 +++--
>>>  gcc/testsuite/gcc.target/powerpc/pr106550_1.c   | 15 +--
>>>  gcc/testsuite/gcc.target/powerpc/pr87870.c  |  5 -
>>>  gcc/testsuite/gcc.target/powerpc/pr93012.c  |  4 +++-
>>>  7 files changed, 65 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
>>> index 4690384cdbe..b9562f1ea0f 100644
>>> --- a/gcc/config/rs6000/rs6000.cc
>>> +++ b/gcc/config/rs6000/rs6000.c

[PATCH] rs6000: Consider inline asm as safe if no assembler complains [PR111828]

2023-10-29 Thread Kewen.Lin

Hi,

As discussed in PR111828, rs6000_update_ipa_fn_target_info
is much conservative, currently for any non-empty inline
asm, without any parsing, it would take inline asm could
have HTM insns.  It means for one function attributed with
power8 having inline asm, even if it has no HTM insns, we
don't make a function attributed with power10 inline it.

Peter pointed out an inline asm parser can be a slippery
slope, and noticed that the current gnu assembler still
allows HTM insns even with power10 machine type, so he
suggested that we can aggressively ignore the handling on
inline asm, this patch goes for this suggestion.

Considering that there are a few assembler alternatives
and assembler can update its behaviors (complaining HTM
insns at power10 and later cpus sounds reasonable from a
certain point of view), this patch also checks assembler
complains on HTM insns at power10 or not.  For a case that
a caller attributed power10 calls a callee attributed
power8 having inline asm with HTM insn, without inlining
at least the compilation succeeds, but if assembler
complains HTM insns at power10, after inlining the
compilation would fail.

The two associated test cases are fine without and with
this patch (effective target takes effect or not).

Bootstrapped and regtested on x86_64-redhat-linux,
powerpc64-linux-gnu P8/P9 and powerpc64le-linux-gnu
P9/P10.

I'm going to push this a week later if no objections.

BR,
Kewen
-
PR target/111828

gcc/ChangeLog:

* config.in: Regenerate.
* config/rs6000/rs6000.cc (rs6000_update_ipa_fn_target_info): Guard
inline asm handling under !HAVE_AS_POWER10_HTM.
* configure: Regenerate.
* configure.ac: Detect assembler support for HTM insns at power10.

gcc/testsuite/ChangeLog:

* lib/target-supports.exp
(check_effective_target_powerpc_as_p10_htm): New proc.
* g++.target/powerpc/pr111828-1.C: New test.
* g++.target/powerpc/pr111828-2.C: New test.
---
 gcc/config.in |  6 +++
 gcc/config/rs6000/rs6000.cc   |  5 +-
 gcc/configure | 43 +++
 gcc/configure.ac  | 17 ++
 gcc/testsuite/g++.target/powerpc/pr111828-1.C | 49 +
 gcc/testsuite/g++.target/powerpc/pr111828-2.C | 52 +++
 gcc/testsuite/lib/target-supports.exp | 40 ++
 7 files changed, 211 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/g++.target/powerpc/pr111828-1.C
 create mode 100644 gcc/testsuite/g++.target/powerpc/pr111828-2.C

diff --git a/gcc/config.in b/gcc/config.in
index d04718ad128..c9681351389 100644
--- a/gcc/config.in
+++ b/gcc/config.in
@@ -682,6 +682,12 @@
 #endif


+/* Define if your assembler supports htm insns on power10. */
+#ifndef USED_FOR_TARGET
+#undef HAVE_AS_POWER10_HTM
+#endif
+
+
 /* Define if your assembler supports .ref */
 #ifndef USED_FOR_TARGET
 #undef HAVE_AS_REF
diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index cc24dd5301e..6d084069014 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -25682,6 +25682,7 @@ rs6000_need_ipa_fn_target_info (const_tree decl,
 static bool
 rs6000_update_ipa_fn_target_info (unsigned int &info, const gimple *stmt)
 {
+#ifndef HAVE_AS_POWER10_HTM
   /* Assume inline asm can use any instruction features.  */
   if (gimple_code (stmt) == GIMPLE_ASM)
 {
@@ -25693,7 +25694,9 @@ rs6000_update_ipa_fn_target_info (unsigned int &info, 
const gimple *stmt)
info |= RS6000_FN_TARGET_INFO_HTM;
   return false;
 }
-  else if (gimple_code (stmt) == GIMPLE_CALL)
+#endif
+
+  if (gimple_code (stmt) == GIMPLE_CALL)
 {
   tree fndecl = gimple_call_fndecl (stmt);
   if (fndecl && fndecl_built_in_p (fndecl, BUILT_IN_MD))
diff --git a/gcc/configure b/gcc/configure
index c43bde8174b..afad4462dd3 100755
--- a/gcc/configure
+++ b/gcc/configure
@@ -28218,6 +28218,49 @@ if test $gcc_cv_as_powerpc_mfcrf = yes; then

 $as_echo "#define HAVE_AS_MFCRF 1" >>confdefs.h

+fi
+
+
+case $target in
+  *-*-aix*) conftest_s='   .machine "pwr10"
+   .csect .text[PR]
+   tend. 0';;
+  *-*-darwin*) conftest_s='.text
+   tend. 0';;
+  *) conftest_s='  .machine power10
+   .text
+   tend. 0';;
+esac
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking assembler for htm 
support on Power10" >&5
+$as_echo_n "checking assembler for htm support on Power10... " >&6; }
+if ${gcc_cv_as_power10_htm+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  gcc_cv_as_power10_htm=no
+  if test x$gcc_cv_as != x; then
+$as_echo "$conftest_s" > conftest.s
+if { ac_try='$gcc_cv_as $gcc_cv_as_flags  -o conftest.o conftest.s >&5'
+  { { eval echo "\"\$as_me\":${as_lineno-$LINENO}: \"$ac_try\""; } >&5
+  (eval $ac_try) 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status

Re: [PATCH] rs6000, Add missing overloaded bcd builtin tests

2023-10-30 Thread Kewen.Lin

Hi Carl,

on 2023/10/31 08:08, Carl Love wrote:
> GCC maintainers:
> 
> The following patch adds tests for two of the rs6000 overloaded built-
> ins that do not have tests.  Additionally the GCC documentation file

I just found that actually they have the test coverage, because we have

#define __builtin_bcdcmpeq(a,b)   __builtin_vec_bcdsub_eq(a,b,0)
#define __builtin_bcdcmpgt(a,b)   __builtin_vec_bcdsub_gt(a,b,0)
#define __builtin_bcdcmplt(a,b)   __builtin_vec_bcdsub_lt(a,b,0)
#define __builtin_bcdcmpge(a,b)   __builtin_vec_bcdsub_ge(a,b,0)
#define __builtin_bcdcmple(a,b)   __builtin_vec_bcdsub_le(a,b,0)

in altivec.h and gcc/testsuite/gcc.target/powerpc/bcd-4.c tests all these
__builtin_bcdcmp* ...

> doc/extend.texi is updated to include the built-in definitions as they
> were missing.

... since we already document __builtin_vec_bcdsub_{eq,gt,lt}, I think
it's still good to supplement the documentation and add the explicit
testing cases.

> 
> The patch has been tested on a Power 10 system with no regressions. 
> Please let me know if this patch is acceptable for mainline.
> 
>  Carl
> 
> ---
> rs6000, Add missing overloaded bcd builtin tests
> 
> The two BCD overloaded built-ins __builtin_bcdsub_ge and __builtin_bcdsub_le
> do not have a corresponding test.  Add tests to existing test file and update
> the documentation with the built-in definitions.

As above, this commit log doesn't describe the actuality well, please update
it with something like:

Currently we have the documentation for __builtin_vec_bcdsub_{eq,gt,lt} but
not for __builtin_bcdsub_[gl]e, this patch is to supplement the descriptions
for them.  Although they are mainly for __builtin_bcdcmp{ge,le}, we already
have some testing coverage for __builtin_vec_bcdsub_{eq,gt,lt}, this patch
adds the corresponding explicit test cases as well.

> 
> gcc/ChangeLog:
>   * doc/extend.texi (__builtin_bcdsub_le, __builtin_bcdsub_ge): Add
>   documentation for the builti-ins.
> 
> gcc/testsuite/ChangeLog:
>   * bcd-3.c (do_sub_ge, do_suble): Add functions to test builtins
>   __builtin_bcdsub_ge and __builtin_bcdsub_le).

1) Unexpected ")" at the end.

2) I supposed git gcc-verify would complain on this changelog entry.

Should be starting with:

* gcc.target/powerpc/bcd-3.c (

, no?

OK for trunk with the above comments addressed, thanks!

BR,
Kewen

> ---
>  gcc/doc/extend.texi  |  4 
>  gcc/testsuite/gcc.target/powerpc/bcd-3.c | 22 +-
>  2 files changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> index cf0d0c63cce..fa7402813e7 100644
> --- a/gcc/doc/extend.texi
> +++ b/gcc/doc/extend.texi
> @@ -20205,12 +20205,16 @@ int __builtin_bcdadd_ov (vector unsigned char, 
> vector unsigned char, const int);
>  vector __int128 __builtin_bcdsub (vector __int128, vector __int128, const 
> int);
>  vector unsigned char __builtin_bcdsub (vector unsigned char, vector unsigned 
> char,
> const int);
> +int __builtin_bcdsub_le (vector __int128, vector __int128, const int);
> +int __builtin_bcdsub_le (vector unsigned char, vector unsigned char, const 
> int);
>  int __builtin_bcdsub_lt (vector __int128, vector __int128, const int);
>  int __builtin_bcdsub_lt (vector unsigned char, vector unsigned char, const 
> int);
>  int __builtin_bcdsub_eq (vector __int128, vector __int128, const int);
>  int __builtin_bcdsub_eq (vector unsigned char, vector unsigned char, const 
> int);
>  int __builtin_bcdsub_gt (vector __int128, vector __int128, const int);
>  int __builtin_bcdsub_gt (vector unsigned char, vector unsigned char, const 
> int);
> +int __builtin_bcdsub_ge (vector __int128, vector __int128, const int);
> +int __builtin_bcdsub_ge (vector unsigned char, vector unsigned char, const 
> int);
>  int __builtin_bcdsub_ov (vector __int128, vector __int128, const int);
>  int __builtin_bcdsub_ov (vector unsigned char, vector unsigned char, const 
> int);
>  @end smallexample
> diff --git a/gcc/testsuite/gcc.target/powerpc/bcd-3.c 
> b/gcc/testsuite/gcc.target/powerpc/bcd-3.c
> index 7948a0c95e2..9891f4ff08e 100644
> --- a/gcc/testsuite/gcc.target/powerpc/bcd-3.c
> +++ b/gcc/testsuite/gcc.target/powerpc/bcd-3.c
> @@ -3,7 +3,7 @@
>  /* { dg-require-effective-target powerpc_p8vector_ok } */
>  /* { dg-options "-mdejagnu-cpu=power8 -O2" } */
>  /* { dg-final { scan-assembler-times "bcdadd\[.\] " 4 } } */
> -/* { dg-final { scan-assembler-times "bcdsub\[.\] " 4 } } */
> +/* { dg-final { scan-assembler-times "bcdsub\[.\] " 6 } } */
>  /* { dg-final { scan-assembler-not   "bl __builtin"   } } */
>  /* { dg-final { scan-assembler-not   "mtvsr"   } } */
>  /* { dg-final { scan-assembler-not   "mfvsr"   } } */
> @@ -93,6 +93,26 @@ do_sub_gt (vector_128_t a, vector_128_t b, int *p)
>return ret;
>  }
>  
> +vector_128_t
> +do_sub_ge (vector

[PATCH 1/3][rs6000] Replace vsx_xvcdpsp by vsx_xvcvdpsp

2019-10-23 Thread Kewen.Lin

Hi,

I noticed that vsx_xvcdpsp and vsx_xvcvdpsp are almost the same,
and vsx_xvcdpsp looks replaceable with vsx_xvcvdpsp since it's only
called by gen_*.

Bootstrapped and regress tested on powerpc64le-linux-gnu.


gcc/ChangeLog

2019-10-23  Kewen Lin  

* config/rs6000/vsx.md (vsx_xvcdpsp): Remove define_insn.
(UNSPEC_VSX_XVCDPSP): Remove.
* config/rs6000/rs6000.c (rs6000_generate_float2_double_code):
Replace gen_vsx_xvcdpsp by gen_vsx_xvcvdpsp.

>From 8c6309c131b7614ed8d6aeb4ca2d3d89ab0b8d38 Mon Sep 17 00:00:00 2001
From: Kewen Lin 
Date: Tue, 8 Oct 2019 01:51:06 -0500
Subject: [PATCH 1/3] Replace vsx_xvcdpsp by vsx_xvcvdpsp

---
 gcc/config/rs6000/rs6000.c | 4 ++--
 gcc/config/rs6000/vsx.md   | 9 -
 2 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index c2834bd..23898b1 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -25549,8 +25549,8 @@ rs6000_generate_float2_double_code (rtx dst, rtx src1, 
rtx src2)
   rtx_tmp2 = gen_reg_rtx (V4SFmode);
   rtx_tmp3 = gen_reg_rtx (V4SFmode);
 
-  emit_insn (gen_vsx_xvcdpsp (rtx_tmp2, rtx_tmp0));
-  emit_insn (gen_vsx_xvcdpsp (rtx_tmp3, rtx_tmp1));
+  emit_insn (gen_vsx_xvcvdpsp (rtx_tmp2, rtx_tmp0));
+  emit_insn (gen_vsx_xvcvdpsp (rtx_tmp3, rtx_tmp1));
 
   if (BYTES_BIG_ENDIAN)
 emit_insn (gen_p8_vmrgew_v4sf (dst, rtx_tmp2, rtx_tmp3));
diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index f54d343..d6f079c 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -301,7 +301,6 @@
UNSPEC_VSX_XVCVSXDDP
UNSPEC_VSX_XVCVUXDDP
UNSPEC_VSX_XVCVDPSXDS
-   UNSPEC_VSX_XVCDPSP
UNSPEC_VSX_XVCVDPUXDS
UNSPEC_VSX_SIGN_EXTEND
UNSPEC_VSX_XVCVSPSXWS
@@ -2367,14 +2366,6 @@
   "xvcvuxdsp %x0,%x1"
   [(set_attr "type" "vecdouble")])
 
-(define_insn "vsx_xvcdpsp"
-  [(set (match_operand:V4SF 0 "vsx_register_operand" "=wa")
-   (unspec:V4SF [(match_operand:V2DF 1 "vsx_register_operand" "wa")]
-UNSPEC_VSX_XVCDPSP))]
-  "VECTOR_UNIT_VSX_P (V2DFmode)"
-  "xvcvdpsp %x0,%x1"
-  [(set_attr "type" "vecdouble")])
-
 ;; Convert from 32-bit to 64-bit types
 ;; Provide both vector and scalar targets
 (define_insn "vsx_xvcvsxwdp"
-- 
2.7.4

[PATCH 2/3][rs6000] vector conversion RTL pattern update for same unit size

2019-10-23 Thread Kewen.Lin

Hi,

For those fixed point <-> floating point vector conversion with
same element unit size, such as: SP <-> SI, DP <-> DI, it's fine
to use the existing RTL operations like any_fix/any_float for them.

This patch is to update them with any_fix/any_float.

Bootstrapped and regress tested on powerpc64le-linux-gnu.


gcc/ChangeLog

2019-10-23  Kewen Lin  

* config/rs6000/vsx.md (UNSPEC_VSX_CV[SU]XWSP,
UNSPEC_VSX_XVCV[SU]XDDP, UNSPEC_VSX_XVCVDP[SU]XDS,
UNSPEC_VSX_XVCVSPSXWS): Remove.
(vsx_xvcv[su]xddp, vsx_xvcvdp[su]xds, vsx_xvcvsp[su]xws,
vsx_xvcv[su]xwsp): Update define_insn RTL patterns.
>From 39ae875d4ae6ce22e170aeb456ef307a1f5fd1e0 Mon Sep 17 00:00:00 2001
From: Kewen Lin 
Date: Wed, 23 Oct 2019 02:56:48 -0500
Subject: [PATCH 2/3] Update RTL pattern on vector SP<->[SU]W DP<->[SU]D
 conversion

---
 gcc/config/rs6000/vsx.md | 105 +--
 1 file changed, 28 insertions(+), 77 deletions(-)

diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index d6f079c..83e4071 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -277,8 +277,6 @@
UNSPEC_VSX_CVUXDSP
UNSPEC_VSX_CVSPSXDS
UNSPEC_VSX_CVSPUXDS
-   UNSPEC_VSX_CVSXWSP
-   UNSPEC_VSX_CVUXWSP
UNSPEC_VSX_FLOAT2
UNSPEC_VSX_UNS_FLOAT2
UNSPEC_VSX_FLOATE
@@ -298,12 +296,7 @@
UNSPEC_VSX_DIVSD
UNSPEC_VSX_DIVUD
UNSPEC_VSX_MULSD
-   UNSPEC_VSX_XVCVSXDDP
-   UNSPEC_VSX_XVCVUXDDP
-   UNSPEC_VSX_XVCVDPSXDS
-   UNSPEC_VSX_XVCVDPUXDS
UNSPEC_VSX_SIGN_EXTEND
-   UNSPEC_VSX_XVCVSPSXWS
UNSPEC_VSX_XVCVSPSXDS
UNSPEC_VSX_VSLO
UNSPEC_VSX_EXTRACT
@@ -2202,6 +2195,34 @@
 
 ;; Convert and scale (used by vec_ctf, vec_cts, vec_ctu for double/long long)
 
+(define_insn "vsx_xvcvxwsp"
+  [(set (match_operand:V4SF 0 "vsx_register_operand" "=wa")
+ (any_float:V4SF (match_operand:V4SI 1 "vsx_register_operand" "wa")))]
+  "VECTOR_UNIT_VSX_P (V4SFmode)"
+  "xvcvxwsp %x0,%x1"
+  [(set_attr "type" "vecfloat")])
+
+(define_insn "vsx_xvcvxddp"
+  [(set (match_operand:V2DF 0 "vsx_register_operand" "=wa")
+(any_float:V2DF (match_operand:V2DI 1 "vsx_register_operand" "wa")))]
+  "VECTOR_UNIT_VSX_P (V2DFmode)"
+  "xvcvxddp %x0,%x1"
+  [(set_attr "type" "vecdouble")])
+
+(define_insn "vsx_xvcvspxws"
+  [(set (match_operand:V4SI 0 "vsx_register_operand" "=wa")
+(any_fix:V4SI (match_operand:V4SF 1 "vsx_register_operand" "wa")))]
+  "VECTOR_UNIT_VSX_P (V4SFmode)"
+  "xvcvspxws %x0,%x1"
+  [(set_attr "type" "vecfloat")])
+
+(define_insn "vsx_xvcvdpxds"
+  [(set (match_operand:V2DI 0 "vsx_register_operand" "=wa")
+(any_fix:V2DI (match_operand:V2DF 1 "vsx_register_operand" "wa")))]
+  "VECTOR_UNIT_VSX_P (V2DFmode)"
+  "xvcvdpxds %x0,%x1"
+  [(set_attr "type" "vecdouble")])
+
 (define_expand "vsx_xvcvsxddp_scale"
   [(match_operand:V2DF 0 "vsx_register_operand")
(match_operand:V2DI 1 "vsx_register_operand")
@@ -2217,14 +2238,6 @@
   DONE;
 })
 
-(define_insn "vsx_xvcvsxddp"
-  [(set (match_operand:V2DF 0 "vsx_register_operand" "=wa")
-(unspec:V2DF [(match_operand:V2DI 1 "vsx_register_operand" "wa")]
- UNSPEC_VSX_XVCVSXDDP))]
-  "VECTOR_UNIT_VSX_P (V2DFmode)"
-  "xvcvsxddp %x0,%x1"
-  [(set_attr "type" "vecdouble")])
-
 (define_expand "vsx_xvcvuxddp_scale"
   [(match_operand:V2DF 0 "vsx_register_operand")
(match_operand:V2DI 1 "vsx_register_operand")
@@ -2240,14 +2253,6 @@
   DONE;
 })
 
-(define_insn "vsx_xvcvuxddp"
-  [(set (match_operand:V2DF 0 "vsx_register_operand" "=wa")
-(unspec:V2DF [(match_operand:V2DI 1 "vsx_register_operand" "wa")]
- UNSPEC_VSX_XVCVUXDDP))]
-  "VECTOR_UNIT_VSX_P (V2DFmode)"
-  "xvcvuxddp %x0,%x1"
-  [(set_attr "type" "vecdouble")])
-
 (define_expand "vsx_xvcvdpsxds_scale"
   [(match_operand:V2DI 0 "vsx_register_operand")
(match_operand:V2DF 1 "vsx_register_operand")
@@ -2270,26 +2275,6 @@
 })
 
 ;; convert vector of 64-bit floating point numbers to vector of
-;; 64-bit signed integer
-(define_insn "vsx_xvcvdpsxds"
-  [(set (match_operand:V2DI 0 "vsx_register_operand" "=wa")
-(unspec:V2DI [(match_operand:V2DF 1 "vsx_register_operand" "wa")]
- UNSPEC_VSX_XVCVDPSXDS))]
-  "VECTOR_UNIT_VSX_P (V2DFmode)"
-  "xvcvdpsxds %x0,%x1"
-  [(set_attr "type" "vecdouble")])
-
-;; convert vector of 32-bit floating point numbers to vector of
-;; 32-bit signed integer
-(define_insn "vsx_xvcvspsxws"
-  [(set (match_operand:V4SI 0 "vsx_register_operand" "=wa")
-   (unspec:V4SI [(match_operand:V4SF 1 "vsx_register_operand" "wa")]
-UNSPEC_VSX_XVCVSPSXWS))]
-  "VECTOR_UNIT_VSX_P (V4SFmode)"
-  "xvcvspsxws %x0,%x1"
-  [(set_attr "type" "vecfloat")])
-
-;; convert vector of 64-bit floating point numbers to vector of
 ;; 64-bit unsigned integer
 (define_expand "vsx_xvcvdpuxds_scale"
   [(match_operand:V2DI 0 "vsx_register_operand")
@@ -2312,24 +2297,6 @@
   DONE;
 })
 
-;; convert vector of 32-bit floating

[PATCH 3/3][rs6000] vector conversion RTL pattern update for diff unit size

2019-10-23 Thread Kewen.Lin

Hi,

Following the previous one 2/3, this patch is to update the
vector conversions between fixed point and floating point
with different element unit sizes, such as: SP <-> DI, DP <-> SI.

Bootstrap and regression testing just launched.


gcc/ChangeLog

2019-10-23  Kewen Lin  

* config/rs6000/rs6000-modes.def (V2SF, V2SI): New modes.
* config/rs6000/vsx.md (UNSPEC_VSX_CVDPSXWS, UNSPEC_VSX_CVSXDSP, 
UNSPEC_VSX_CVUXDSP, UNSPEC_VSX_CVSPSXDS, UNSPEC_VSX_CVSPUXDS): Remove.
(vsx_xvcvspdp): New define_expand, old one split to...
(vsx_xvcvspdp_be): ... this.  New.  And...
(vsx_xvcvspdp_le): ... this.  New.
(vsx_xvcvdpsp): New define_expand, old one split to...
(vsx_xvcvdpsp_be): ... this.  New.  And...
(vsx_xvcvdpsp_le): ... this.  New.
(vsx_xvcvdp[su]xws): New define_expand, old one split to...
(vsx_xvcvdpxws_be): ... this.  New.  And...
(vsx_xvcvdpxws_le): ... this.  New.
(vsx_xvcv[su]xdsp): New define_expand, old one split to...
(vsx_xvcvxdsp_be): ... this.  New.  And...
(vsx_xvcvxdsp_le): ... this.  New.
(vsx_xvcv[su]xwdp): New define_expand, old one split to...
(vsx_xvcvxwdp_be): ... this.  New.  And...
(vsx_xvcvxwdp_le): ... this.  New.
(vsx_xvcvsp[su]xds): New define_expand, old one split to...
(vsx_xvcvspxds_be): ... this.  New.  And...
(vsx_xvcvspxds_le): ... this.  New.
>From 5315810c391b75661de9027ea2848d31390e1d8b Mon Sep 17 00:00:00 2001
From: Kewen Lin 
Date: Wed, 23 Oct 2019 04:02:00 -0500
Subject: [PATCH 3/3] Update RTL pattern on vector fp/int 32bit <-> 64bit
 conversion

---
 gcc/config/rs6000/rs6000-modes.def |   4 +
 gcc/config/rs6000/vsx.md   | 240 +++--
 2 files changed, 181 insertions(+), 63 deletions(-)

diff --git a/gcc/config/rs6000/rs6000-modes.def 
b/gcc/config/rs6000/rs6000-modes.def
index 677062c..449e176 100644
--- a/gcc/config/rs6000/rs6000-modes.def
+++ b/gcc/config/rs6000/rs6000-modes.def
@@ -74,6 +74,10 @@ VECTOR_MODES (FLOAT, 16); /*   V8HF  V4SF V2DF */
 VECTOR_MODES (INT, 32);   /* V32QI V16HI V8SI V4DI */
 VECTOR_MODES (FLOAT, 32); /*   V16HF V8SF V4DF */
 
+/* Half VMX/VSX vector (for select)  */
+VECTOR_MODE (FLOAT, SF, 2);   /* V2SF  */
+VECTOR_MODE (INT, SI, 2); /* V2SI  */
+
 /* Replacement for TImode that only is allowed in GPRs.  We also use PTImode
for quad memory atomic operations to force getting an even/odd register
combination.  */
diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index 83e4071..44025f6 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -265,7 +265,6 @@
 ;; Constants for creating unspecs
 (define_c_enum "unspec"
   [UNSPEC_VSX_CONCAT
-   UNSPEC_VSX_CVDPSXWS
UNSPEC_VSX_CVDPUXWS
UNSPEC_VSX_CVSPDP
UNSPEC_VSX_CVHPSP
@@ -273,10 +272,6 @@
UNSPEC_VSX_CVDPSPN
UNSPEC_VSX_CVSXWDP
UNSPEC_VSX_CVUXWDP
-   UNSPEC_VSX_CVSXDSP
-   UNSPEC_VSX_CVUXDSP
-   UNSPEC_VSX_CVSPSXDS
-   UNSPEC_VSX_CVSPUXDS
UNSPEC_VSX_FLOAT2
UNSPEC_VSX_UNS_FLOAT2
UNSPEC_VSX_FLOATE
@@ -2106,22 +2101,69 @@
   "xscvdpsp %x0,%x1"
   [(set_attr "type" "fp")])
 
-(define_insn "vsx_xvcvspdp"
+(define_insn "vsx_xvcvspdp_be"
   [(set (match_operand:V2DF 0 "vsx_register_operand" "=v,?wa")
-   (unspec:V2DF [(match_operand:V4SF 1 "vsx_register_operand" "wa,wa")]
- UNSPEC_VSX_CVSPDP))]
-  "VECTOR_UNIT_VSX_P (V4SFmode)"
+ (float_extend:V2DF
+   (vec_select:V2SF (match_operand:V4SF 1 "vsx_register_operand" "wa,wa")
+(parallel [(const_int 0) (const_int 2)]]
+  "VECTOR_UNIT_VSX_P (V4SFmode) && BYTES_BIG_ENDIAN"
+  "xvcvspdp %x0,%x1"
+  [(set_attr "type" "vecdouble")])
+
+(define_insn "vsx_xvcvspdp_le"
+  [(set (match_operand:V2DF 0 "vsx_register_operand" "=v,?wa")
+ (float_extend:V2DF
+   (vec_select:V2SF (match_operand:V4SF 1 "vsx_register_operand" "wa,wa")
+(parallel [(const_int 1) (const_int 3)]]
+  "VECTOR_UNIT_VSX_P (V4SFmode) && !BYTES_BIG_ENDIAN"
   "xvcvspdp %x0,%x1"
   [(set_attr "type" "vecdouble")])
 
-(define_insn "vsx_xvcvdpsp"
+(define_expand "vsx_xvcvspdp"
+  [(match_operand:V2DF 0 "vsx_register_operand")
+   (match_operand:V4SF 1 "vsx_register_operand")]
+  "VECTOR_UNIT_VSX_P (V4SFmode)"
+{
+  if (BYTES_BIG_ENDIAN)
+emit_insn (gen_vsx_xvcvspdp_be (operands[0], operands[1]));
+  else
+emit_insn (gen_vsx_xvcvspdp_le (operands[0], operands[1]));
+  DONE;
+})
+
+(define_insn "vsx_xvcvdpsp_be"
   [(set (match_operand:V4SF 0 "vsx_register_operand" "=wa,?wa")
-   (unspec:V4SF [(match_operand:V2DF 1 "vsx_register_operand" "v,wa")]
- UNSPEC_VSX_CVSPDP))]
-  "VECTOR_UNIT_VSX_P (V2DFmode)"
+ (float_truncate:V4SF
+   (vec_concat:V4DF (match_operand:V2DF 1 "vsx_register_operand" "v,wa")
+(vec_select:V2DF (match_dup 1)
+  (parallel

[PATCH rs6000]Fix PR92132

2019-10-25 Thread Kewen.Lin

Hi,

To support full condition reduction vectorization, we have to define
vec_cmp_* and vcond_mask_*.  This patch is to add related expands.
Add vector_{ungt,unge,unlt,unle} for unique vector_*
interface support.

Regression testing just launched.

gcc/ChangeLog

2019-10-25  Kewen Lin  

PR target/92132
* config/rs6000/rs6000.md (one_cmpl3_internal): Expose name.
* config/rs6000/vector.md (fpcmpun): New code_iterator.
(vcond_mask_): New expand.
(vcond_mask_): Likewise.
(vec_cmp): Likewise.
(vec_cmpu): Likewise.
(vec_cmp): Likewise.
(vector_{ungt,unge,unlt,unle}): Likewise.
(vector_uneq): Expose name.
(vector_ltgt): Likewise.
(vector_unordered): Likewise.
(vector_ordered): Likewise.

gcc/testsuite/ChangeLog

2019-10-25  Kewen Lin  

PR target/92132
* gcc.target/powerpc/pr92132-fp-1.c: New test.
* gcc.target/powerpc/pr92132-fp-2.c: New test.
* gcc.target/powerpc/pr92132-int-1.c: New test.
* gcc.target/powerpc/pr92132-int-1.c: New test.

diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index d0cca1e..2a68548 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -6800,7 +6800,7 @@
 (const_string "16")))])
 
 ;; 128-bit one's complement
-(define_insn_and_split "*one_cmpl3_internal"
+(define_insn_and_split "one_cmpl3_internal"
   [(set (match_operand:BOOL_128 0 "vlogical_operand" "=")
(not:BOOL_128
  (match_operand:BOOL_128 1 "vlogical_operand" "")))]
diff --git a/gcc/config/rs6000/vector.md b/gcc/config/rs6000/vector.md
index 886cbad..64c3c60 100644
--- a/gcc/config/rs6000/vector.md
+++ b/gcc/config/rs6000/vector.md
@@ -107,6 +107,8 @@
 (smin "smin")
 (smax "smax")])
 
+(define_code_iterator fpcmpun [ungt unge unlt unle])
+
 
 ;; Vector move instructions.  Little-endian VSX loads and stores require
 ;; special handling to circumvent "element endianness."
@@ -493,6 +495,241 @@
 FAIL;
 })
 
+;; To support vector condition vectorization, define vcond_mask and vec_cmp.
+
+;; Same mode for condition true/false values and predicate operand.
+(define_expand "vcond_mask_"
+  [(match_operand:VEC_I 0 "vint_operand")
+   (match_operand:VEC_I 1 "vint_operand")
+   (match_operand:VEC_I 2 "vint_operand")
+   (match_operand:VEC_I 3 "vint_operand")]
+  "VECTOR_UNIT_ALTIVEC_OR_VSX_P (mode)"
+{
+  emit_insn (gen_vector_select_ (operands[0], operands[2], operands[1],
+ operands[3]));
+  DONE;
+})
+
+;; Condition true/false values are float but predicate operand is of
+;; type integer vector with same element size.
+(define_expand "vcond_mask_"
+  [(match_operand:VEC_F 0 "vfloat_operand")
+   (match_operand:VEC_F 1 "vfloat_operand")
+   (match_operand:VEC_F 2 "vfloat_operand")
+   (match_operand: 3 "vint_operand")]
+  "VECTOR_UNIT_ALTIVEC_OR_VSX_P (mode)"
+{
+  emit_insn (gen_vector_select_ (operands[0], operands[2], operands[1],
+ operands[3]));
+  DONE;
+})
+
+;; For signed integer vectors comparison.
+(define_expand "vec_cmp"
+  [(set (match_operand:VEC_I 0 "vint_operand")
+   (match_operator 1 "comparison_operator"
+ [(match_operand:VEC_I 2 "vint_operand")
+  (match_operand:VEC_I 3 "vint_operand")]))]
+  "VECTOR_UNIT_ALTIVEC_OR_VSX_P (mode)"
+{
+  enum rtx_code code = GET_CODE (operands[1]);
+  rtx tmp = gen_reg_rtx (mode);
+  switch (code)
+{
+case NE:
+  emit_insn (gen_vector_eq (operands[0], operands[2], operands[3]));
+  emit_insn (gen_one_cmpl2 (operands[0], operands[0]));
+  break;
+case EQ:
+  emit_insn (gen_vector_eq (operands[0], operands[2], operands[3]));
+  break;
+case GE:
+  emit_insn (
+   gen_vector_nlt (operands[0], operands[2], operands[3], tmp));
+  break;
+case GT:
+  emit_insn (gen_vector_gt (operands[0], operands[2], operands[3]));
+  break;
+case LE:
+  emit_insn (
+   gen_vector_ngt (operands[0], operands[2], operands[3], tmp));
+  break;
+case LT:
+  emit_insn (gen_vector_gt (operands[0], operands[3], operands[2]));
+  break;
+case GEU:
+  emit_insn (
+   gen_vector_nltu (operands[0], operands[2], operands[3], tmp));
+  break;
+case GTU:
+  emit_insn (gen_vector_gtu (operands[0], operands[2], operands[3]));
+  break;
+case LEU:
+  emit_insn (
+   gen_vector_ngtu (operands[0], operands[2], operands[3], tmp));
+  break;
+case LTU:
+  emit_insn (gen_vector_gtu (operands[0], operands[3], operands[2]));
+  break;
+default:
+  gcc_unreachable ();
+  break;
+}
+  DONE;
+})
+
+;; For unsigned integer vectors comparison.
+(define_expand "vec_cmpu"
+  [(set (match_operand:VEC_I 0 "vint_operand")
+   (match_operator 1 "comparison_operator"
+ [(match_operand:VEC_I 2 "vint_operand")
+

Re: [PATCH rs6000]Fix PR92132

2019-10-28 Thread Kewen.Lin

Fixed one place without consistent mode. 

Bootstrapped and regress testing passed on powerpc64le-linux.


Thanks!
Kewen

---

gcc/ChangeLog

2019-10-25  Kewen Lin  

PR target/92132
* config/rs6000/rs6000.md (one_cmpl3_internal): Expose name.
* config/rs6000/vector.md (fpcmpun): New code_iterator.
(vcond_mask_): New expand.
(vcond_mask_): Likewise.
(vec_cmp): Likewise.
(vec_cmpu): Likewise.
(vec_cmp): Likewise.
(vector_{ungt,unge,unlt,unle}): Likewise.
(vector_uneq): Expose name.
(vector_ltgt): Likewise.
(vector_unordered): Likewise.
(vector_ordered): Likewise.

gcc/testsuite/ChangeLog

2019-10-25  Kewen Lin  

PR target/92132
* gcc.target/powerpc/pr92132-fp-1.c: New test.
* gcc.target/powerpc/pr92132-fp-2.c: New test.
* gcc.target/powerpc/pr92132-int-1.c: New test.
* gcc.target/powerpc/pr92132-int-2.c: New test.



diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index d0cca1e..2a68548 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -6800,7 +6800,7 @@
 (const_string "16")))])
 
 ;; 128-bit one's complement
-(define_insn_and_split "*one_cmpl3_internal"
+(define_insn_and_split "one_cmpl3_internal"
   [(set (match_operand:BOOL_128 0 "vlogical_operand" "=")
(not:BOOL_128
  (match_operand:BOOL_128 1 "vlogical_operand" "")))]
diff --git a/gcc/config/rs6000/vector.md b/gcc/config/rs6000/vector.md
index 886cbad..0ef64eb 100644
--- a/gcc/config/rs6000/vector.md
+++ b/gcc/config/rs6000/vector.md
@@ -107,6 +107,8 @@
 (smin "smin")
 (smax "smax")])
 
+(define_code_iterator fpcmpun [ungt unge unlt unle])
+
 
 ;; Vector move instructions.  Little-endian VSX loads and stores require
 ;; special handling to circumvent "element endianness."
@@ -493,6 +495,241 @@
 FAIL;
 })
 
+;; To support vector condition vectorization, define vcond_mask and vec_cmp.
+
+;; Same mode for condition true/false values and predicate operand.
+(define_expand "vcond_mask_"
+  [(match_operand:VEC_I 0 "vint_operand")
+   (match_operand:VEC_I 1 "vint_operand")
+   (match_operand:VEC_I 2 "vint_operand")
+   (match_operand:VEC_I 3 "vint_operand")]
+  "VECTOR_UNIT_ALTIVEC_OR_VSX_P (mode)"
+{
+  emit_insn (gen_vector_select_ (operands[0], operands[2], operands[1],
+ operands[3]));
+  DONE;
+})
+
+;; Condition true/false values are float but predicate operand is of
+;; type integer vector with same element size.
+(define_expand "vcond_mask_"
+  [(match_operand:VEC_F 0 "vfloat_operand")
+   (match_operand:VEC_F 1 "vfloat_operand")
+   (match_operand:VEC_F 2 "vfloat_operand")
+   (match_operand: 3 "vint_operand")]
+  "VECTOR_UNIT_ALTIVEC_OR_VSX_P (mode)"
+{
+  emit_insn (gen_vector_select_ (operands[0], operands[2], operands[1],
+ gen_lowpart (mode, operands[3])));
+  DONE;
+})
+
+;; For signed integer vectors comparison.
+(define_expand "vec_cmp"
+  [(set (match_operand:VEC_I 0 "vint_operand")
+   (match_operator 1 "comparison_operator"
+ [(match_operand:VEC_I 2 "vint_operand")
+  (match_operand:VEC_I 3 "vint_operand")]))]
+  "VECTOR_UNIT_ALTIVEC_OR_VSX_P (mode)"
+{
+  enum rtx_code code = GET_CODE (operands[1]);
+  rtx tmp = gen_reg_rtx (mode);
+  switch (code)
+{
+case NE:
+  emit_insn (gen_vector_eq (operands[0], operands[2], operands[3]));
+  emit_insn (gen_one_cmpl2 (operands[0], operands[0]));
+  break;
+case EQ:
+  emit_insn (gen_vector_eq (operands[0], operands[2], operands[3]));
+  break;
+case GE:
+  emit_insn (
+   gen_vector_nlt (operands[0], operands[2], operands[3], tmp));
+  break;
+case GT:
+  emit_insn (gen_vector_gt (operands[0], operands[2], operands[3]));
+  break;
+case LE:
+  emit_insn (
+   gen_vector_ngt (operands[0], operands[2], operands[3], tmp));
+  break;
+case LT:
+  emit_insn (gen_vector_gt (operands[0], operands[3], operands[2]));
+  break;
+case GEU:
+  emit_insn (
+   gen_vector_nltu (operands[0], operands[2], operands[3], tmp));
+  break;
+case GTU:
+  emit_insn (gen_vector_gtu (operands[0], operands[2], operands[3]));
+  break;
+case LEU:
+  emit_insn (
+   gen_vector_ngtu (operands[0], operands[2], operands[3], tmp));
+  break;
+case LTU:
+  emit_insn (gen_vector_gtu (operands[0], operands[3], operands[2]));
+  break;
+default:
+  gcc_unreachable ();
+  break;
+}
+  DONE;
+})
+
+;; For unsigned integer vectors comparison.
+(define_expand "vec_cmpu"
+  [(set (match_operand:VEC_I 0 "vint_operand")
+   (match_operator 1 "comparison_operator"
+ [(match_operand:VEC_I 2 "vint_operand")
+  (match_operand:VEC_I 3 "vint_operand")]))]
+  "VECTOR_UNIT_ALTIVEC_OR_VSX_P (mode)"
+{
+  em

[PATCH, rs6000] Fix PR92127

2019-10-30 Thread Kewen.Lin

Hi,

As PR92127 shows, recent commit r276645 enables more unrollings,
two ppc vectorization cost model test cases are fragile and failed
after the change.  This patch is to disable unrolling for the
loops of interest to make test cases more robust.

Verified on ppc64-redhat-linux.  Should be fine on powerpc64le which
supports hw_misalign and would be XFAIL.

Is it ok for trunk?  Thanks in advance!


Kewen



gcc/testsuite/ChangeLog

2019-10-30  Kewen Lin  

PR testsuite/92127
* gcc.dg/vect/costmodel/ppc/costmodel-pr37194.c: Disable unroll.
* gcc.dg/vect/costmodel/ppc/costmodel-fast-math-vect-pr29925.c: 
Likewise.


diff --git 
a/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-fast-math-vect-pr29925.c 
b/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-fast-math-vect-pr29925.c
index a3662e2..34445dc 100644
--- a/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-fast-math-vect-pr29925.c
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-fast-math-vect-pr29925.c
@@ -13,6 +13,8 @@ interp_pitch(float *exc, float *interp, int pitch, int len)
for (i=0;i

[PATCH 3/3 V2][rs6000] vector conversion RTL pattern update for diff unit size

2019-10-31 Thread Kewen.Lin

Hi Segher,

Thanks a lot for the comments.

on 2019/10/31 上午2:49, Segher Boessenkool wrote:
> Hi!
> 
> On Wed, Oct 23, 2019 at 05:42:45PM +0800, Kewen.Lin wrote:
>> Following the previous one 2/3, this patch is to update the
>> vector conversions between fixed point and floating point
>> with different element unit sizes, such as: SP <-> DI, DP <-> SI.
> 
>>  (vsx_xvcvdp[su]xws): New define_expand, old one split to...
> 
> You mean  here, please fix (never use wildcards like [su] in changelogs:
> people grep for things in changelogs, which misses entries with wildcards).
> 

OK, will fix it.

>> +/* Half VMX/VSX vector (for select)  */
>> +VECTOR_MODE (FLOAT, SF, 2);   /* V2SF  */
>> +VECTOR_MODE (INT, SI, 2); /* V2SI  */
> 
> Or "for internal use", in general.  What happens if a user tries to create
> something of such a mode?  I hope we don't ICE :-/
> 

I did some testings, it failed (ICE) if we constructed one insn with these 
modes artificially.  But I also checked the existing V8SF/V8SI/V4DF/... etc.,
they have same issues.  It looks more like a new issue to avoid that.

>> +;; Convert vector of 64-bit floating point numbers to vector of
>> +;; 32-bit signed/unsigned integers.
>> +(define_insn "vsx_xvcvdpxws_be"
>>[(set (match_operand:V4SI 0 "vsx_register_operand" "=v,?wa")
>> -(unspec:V4SI [(match_operand:V2DF 1 "vsx_register_operand" "wa,wa")]
>> - UNSPEC_VSX_CVDPSXWS))]
>> -  "VECTOR_UNIT_VSX_P (V2DFmode)"
>> -  "xvcvdpsxws %x0,%x1"
>> + (any_fix:V4SI
>> +   (vec_concat:V4DF (match_operand:V2DF 1 "vsx_register_operand" 
>> "wa,wa")
>> + (vec_select:V2DF (match_dup 1)
>> +   (parallel [(const_int 1) (const_int 0)])]
>> +  "VECTOR_UNIT_VSX_P (V2DFmode) && BYTES_BIG_ENDIAN"
>> +  "xvcvdpxws %x0,%x1"
>>[(set_attr "type" "vecdouble")])
> 
> This doesn't work, I think: the insns actually leaves words 1 and 3
> undefined, but this pattern gives them a meaning.
> 
> I don't think we can do better than unspecs for such insns.  Or change
> the pattern to only describe the defined parts (this works for e.g. mulhw
> that describes its result as SImode: its DImode result has the high half
> undefined).
> 

Good point, thanks!  I agree, the current implementation for 64bit -> 32bit
RTL pattern has different semantics to what the instructions represent.
I was trying to find something represent undefined or random register value
and with twice vec_concat, but failed to.  I'll recover the change for the 
64bit -> 32bit part.

Updated patch attached, new regression testing just launched.


BR,
Kewen

-

gcc/ChangeLog

2019-10-31  Kewen Lin  

* config/rs6000/rs6000-modes.def (V2SF, V2SI): New modes.
* config/rs6000/vsx.md (UNSPEC_VSX_CVSPSXDS, UNSPEC_VSX_CVSPUXDS): 
Remove.
(vsx_xvcvspdp): New define_expand, old define_insn split to...
(vsx_xvcvspdp_be): ... this.  New.  And...
(vsx_xvcvspdp_le): ... this.  New.
(vsx_xvcvxwdp): New define_expand, old define_insn split to...
(vsx_xvcvxwdp_be): ... this.  New.  And...
(vsx_xvcvxwdp_le): ... this.  New.
(vsx_xvcvspxds): New define_expand, old define_insn split to...
(vsx_xvcvspxds_be): ... this.  New.  And...
(vsx_xvcvspxds_le): ... this.  New.
diff --git a/gcc/config/rs6000/rs6000-modes.def 
b/gcc/config/rs6000/rs6000-modes.def
index 677062c..2051358 100644
--- a/gcc/config/rs6000/rs6000-modes.def
+++ b/gcc/config/rs6000/rs6000-modes.def
@@ -74,6 +74,10 @@ VECTOR_MODES (FLOAT, 16); /*   V8HF  V4SF V2DF */
 VECTOR_MODES (INT, 32);   /* V32QI V16HI V8SI V4DI */
 VECTOR_MODES (FLOAT, 32); /*   V16HF V8SF V4DF */
 
+/* Half VMX/VSX vector (for internal use)  */
+VECTOR_MODE (FLOAT, SF, 2);   /* V2SF  */
+VECTOR_MODE (INT, SI, 2); /* V2SI  */
+
 /* Replacement for TImode that only is allowed in GPRs.  We also use PTImode
for quad memory atomic operations to force getting an even/odd register
combination.  */
diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index 83e4071..99b51cb 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -275,8 +275,6 @@
UNSPEC_VSX_CVUXWDP
UNSPEC_VSX_CVSXDSP
UNSPEC_VSX_CVUXDSP
-   UNSPEC_VSX_CVSPSXDS
-   UNSPEC_VSX_CVSPUXDS
UNSPEC_VSX_FLOAT2
UNSPEC_VSX_UNS_FLOAT2
UNSPEC_VSX_FLOATE
@@ -2106,14 +2104,36 @@
   "xscvdpsp %x0,%x1"
   [(set_attr "type" "fp")])
 
-(define_insn "vsx_xvcvspdp"
+(define_insn "

Re: [PATCH 3/3 V2][rs6000] vector conversion RTL pattern update for diff unit size

2019-10-31 Thread Kewen.Lin

Hi Segher,

on 2019/11/1 上午2:49, Segher Boessenkool wrote:
> Hi!
> 
> On Thu, Oct 31, 2019 at 05:35:22PM +0800, Kewen.Lin wrote:
>>>> +/* Half VMX/VSX vector (for select)  */
>>>> +VECTOR_MODE (FLOAT, SF, 2);   /* V2SF  */
>>>> +VECTOR_MODE (INT, SI, 2); /* V2SI  */
>>>
>>> Or "for internal use", in general.  What happens if a user tries to create
>>> something of such a mode?  I hope we don't ICE :-/
>>
>> I did some testings, it failed (ICE) if we constructed one insn with these 
>> modes artificially.  But I also checked the existing V8SF/V8SI/V4DF/... etc.,
>> they have same issues.  It looks more like a new issue to avoid that.
> 
> What does "artificially" mean?  If you had to change the compiler for your
> test, that doesn't count; otherwise, please file a PR.
> 

Yes, I hacked the compiler to emit it directly.  OK, it's fine then.  :)

>>  * config/rs6000/vsx.md (UNSPEC_VSX_CVSPSXDS, UNSPEC_VSX_CVSPUXDS): 
>> Remove.
> 
> (line too long)

Will fix it.

> 
> Okay for trunk.  Thanks!
> 

Thanks for your time!


BR,
Kewen

[PATCH, rs6000] Make load cost more in vectorization cost for P8/P9

2019-11-03 Thread Kewen.Lin

Hi,

To align with rs6000_insn_cost costing more for load type insns,
this patch is to make load insns cost more in vectorization cost
function.  Considering that the result of load usually is used
somehow later (true-dep) but store won't, we keep the store as
before.

The SPEC2017 performance evaluation on Power8 shows 525.x264_r
+9.56%, 511.povray_r +2.08%, 527.cam4_r 1.16% gains, no 
significant degradation, SPECINT geomean +0.88%, SPECFP geomean
+0.26%.

The SPEC2017 performance evaluation on Power9 shows no significant
improvement or degradation, SPECINT geomean +0.04%, SPECFP geomean
+0.04%.

The SPEC2006 performance evaluation on Power8 shows 454.calculix
+4.41% gain but 416.gamess -1.19% and 453.povray -3.83% degradation.
I looked into the two degradation bmks, the degradation were NOT
due to hotspot changes by vectorization, were all side effects.
SPECINT geomean +0.10%, SPECFP geomean no changed considering
the degradation.

Bootstrapped and regress tested on powerpc64le-linux-gnu.  
Is OK for trunk?


BR,
Kewen

---

gcc/ChangeLog

2019-11-04  Kewen Lin  

* config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost): Make
scalar_load, vector_load, unaligned_load and vector_gather_load cost
a bit more on Power8 and up.

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 5876714..876c7ef 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -4763,15 +4763,22 @@ rs6000_builtin_vectorization_cost (enum 
vect_cost_for_stmt type_of_cost,
   switch (type_of_cost)
 {
   case scalar_stmt:
-  case scalar_load:
   case scalar_store:
   case vector_stmt:
-  case vector_load:
   case vector_store:
   case vec_to_scalar:
   case scalar_to_vec:
   case cond_branch_not_taken:
 return 1;
+  case scalar_load:
+  case vector_load:
+   /* Like rs6000_insn_cost, make load insns cost a bit more. FIXME: the
+  benefits were observed on Power8 and up, we can unify it if similar
+  profits are measured on Power6 and Power7.  */
+   if (TARGET_P8_VECTOR)
+ return 2;
+   else
+ return 1;
 
   case vec_perm:
/* Power7 has only one permute unit, make it a bit expensive.  */
@@ -4792,8 +4799,9 @@ rs6000_builtin_vectorization_cost (enum 
vect_cost_for_stmt type_of_cost,
 
   case unaligned_load:
   case vector_gather_load:
+   /* Like rs6000_insn_cost, make load insns cost a bit more.  */
if (TARGET_EFFICIENT_UNALIGNED_VSX)
- return 1;
+ return 2;
 
 if (TARGET_VSX && TARGET_ALLOW_MOVMISALIGN)
   {
@@ -4827,7 +4835,13 @@ rs6000_builtin_vectorization_cost (enum 
vect_cost_for_stmt type_of_cost,
   /* Misaligned loads are not supported.  */
   gcc_unreachable ();
 
-return 2;
+   /* Like rs6000_insn_cost, make load insns cost a bit more. FIXME: the
+  benefits were observed on Power8 and up, we can unify it if similar
+  profits are measured on Power6 and Power7.  */
+   if (TARGET_P8_VECTOR)
+ return 4;
+   else
+ return 2;
 
   case unaligned_store:
   case vector_scatter_store:

Re: [PATCH V3] rs6000: Refine small loop unroll in loop_unroll_adjust hook

2019-11-04 Thread Kewen.Lin

Hi Jeff,

Thanks for the patch, I learned a lot from it.  Some nits embedded.

on 2019/11/4 下午2:31, Jiufu Guo wrote:
> Hi,
> 
> In this patch, loop unroll adjust hook is introduced for powerpc.  We can do
> target related hueristic adjustment in this hook. In this patch, small loops
> is unrolled 2 times for O2 and O3 by default.  With this patch, we can see
> some improvement for spec2017.  This patch enhanced a little for [Patch V2] to
> enable small loops unroll for O3 by default like O2.
> 
> Bootstrapped and regtested on powerpc64le.  Is this ok for trunk?
> 
> Jiufu
> BR.
> 
> gcc/
> 2019-11-04  Jiufu Guo 
> 
>   PR tree-optimization/88760
>   * config/rs6000/rs6000.c (rs6000_option_override_internal): Remove
>   code which changes PARAM_MAX_UNROLL_TIMES and PARAM_MAX_UNROLLED_INSNS.
>   (TARGET_LOOP_UNROLL_ADJUST): Add loop unroll adjust hook.
>   (rs6000_loop_unroll_adjust): New hook for loop unroll adjust.
>   Unrolling small loop 2 times for -O2 and -O3.
>   (rs6000_function_specific_save): Save unroll_small_loops flag.
>   (rs6000_function_specific_restore): Restore unroll_small_loops flag.
>   * gcc/config/rs6000/rs6000.opt (unroll_small_loops): New internal flag.
> 
>   
> gcc.testsuite/
> 2019-11-04  Jiufu Guo  
> 
>   PR tree-optimization/88760
>   * gcc.dg/pr59643.c: Update back to r277550.
> 
> ---
>  gcc/config/rs6000/rs6000.c | 38 --
>  gcc/config/rs6000/rs6000.opt   |  7 +++
>  gcc/testsuite/gcc.dg/pr59643.c |  3 ---
>  3 files changed, 35 insertions(+), 13 deletions(-)
> 
> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
> index 9ed5151..5e1a75d 100644
> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
> @@ -1428,6 +1428,9 @@ static const struct attribute_spec 
> rs6000_attribute_table[] =
>  #undef TARGET_VECTORIZE_DESTROY_COST_DATA
>  #define TARGET_VECTORIZE_DESTROY_COST_DATA rs6000_destroy_cost_data
>  
> +#undef TARGET_LOOP_UNROLL_ADJUST
> +#define TARGET_LOOP_UNROLL_ADJUST rs6000_loop_unroll_adjust
> +
>  #undef TARGET_INIT_BUILTINS
>  #define TARGET_INIT_BUILTINS rs6000_init_builtins
>  #undef TARGET_BUILTIN_DECL
> @@ -4540,25 +4543,20 @@ rs6000_option_override_internal (bool global_init_p)
>global_options.x_param_values,
>global_options_set.x_param_values);
>  
> -  /* unroll very small loops 2 time if no -funroll-loops.  */
> +  /* If funroll-loops is not enabled explicitly, then enable small loops
> +  unrolling for -O2, and do not turn fweb or frename-registers on.  */

"for -O2" -> "for -O2 and up"? since I noticed it checks "optimize >=2" later.

>if (!global_options_set.x_flag_unroll_loops
> && !global_options_set.x_flag_unroll_all_loops)
>   {
> -   maybe_set_param_value (PARAM_MAX_UNROLL_TIMES, 2,
> -  global_options.x_param_values,
> -  global_options_set.x_param_values);
> -
> -   maybe_set_param_value (PARAM_MAX_UNROLLED_INSNS, 20,
> -  global_options.x_param_values,
> -  global_options_set.x_param_values);
> +   unroll_small_loops = optimize >= 2 ? 1 : 0;

Maybe simpler with "unroll_small_loops = flag_unroll_loops"? 

>  
> -   /* If fweb or frename-registers are not specificed in command-line,
> -  do not turn them on implicitly.  */
> if (!global_options_set.x_flag_web)
>   global_options.x_flag_web = 0;
> if (!global_options_set.x_flag_rename_registers)
>   global_options.x_flag_rename_registers = 0;
>   }
> +  else
> + unroll_small_loops = 0;

Could we initialize this in rs6000.opt as zero?


BR,
Kewen

Re: [PATCH, rs6000] Make load cost more in vectorization cost for P8/P9

2019-11-04 Thread Kewen.Lin

Hi Segher,

Thanks for the comments!

on 2019/11/5 上午4:21, Segher Boessenkool wrote:
> Hi!
> 
> On Mon, Nov 04, 2019 at 03:16:06PM +0800, Kewen.Lin wrote:
>> To align with rs6000_insn_cost costing more for load type insns,
> 
> (Which itself has history in rs6000_rtx_costs).
> 
>> this patch is to make load insns cost more in vectorization cost
>> function.  Considering that the result of load usually is used
>> somehow later (true-dep) but store won't, we keep the store as
>> before.
> 
> The latency of load insns is about twice that of "simple" instructions;
> 2 vs. 1 on older cores, and 4 (or so) vs. 2 on newer cores.
> 

Yes, for latest Power9, general load takes 4, vsx load takes 5.

>> The SPEC2017 performance evaluation on Power8 shows 525.x264_r
>> +9.56%, 511.povray_r +2.08%, 527.cam4_r 1.16% gains, no 
>> significant degradation, SPECINT geomean +0.88%, SPECFP geomean
>> +0.26%.
> 
> Nice :-)
> 
>> The SPEC2017 performance evaluation on Power9 shows no significant
>> improvement or degradation, SPECINT geomean +0.04%, SPECFP geomean
>> +0.04%.
>>
>> The SPEC2006 performance evaluation on Power8 shows 454.calculix
>> +4.41% gain but 416.gamess -1.19% and 453.povray -3.83% degradation.
>> I looked into the two degradation bmks, the degradation were NOT
>> due to hotspot changes by vectorization, were all side effects.
>> SPECINT geomean +0.10%, SPECFP geomean no changed considering
>> the degradation.
> 
> Also nice.
> 
>> --- a/gcc/config/rs6000/rs6000.c
>> +++ b/gcc/config/rs6000/rs6000.c
>> @@ -4763,15 +4763,22 @@ rs6000_builtin_vectorization_cost (enum 
>> vect_cost_for_stmt type_of_cost,
>>switch (type_of_cost)
>>  {
>>case scalar_stmt:
>> -  case scalar_load:
>>case scalar_store:
>>case vector_stmt:
>> -  case vector_load:
>>case vector_store:
>>case vec_to_scalar:
>>case scalar_to_vec:
>>case cond_branch_not_taken:
>>  return 1;
>> +  case scalar_load:
>> +  case vector_load:
>> +/* Like rs6000_insn_cost, make load insns cost a bit more. FIXME: the
> 
> (two spaces after full stop).
> 

Good catch!  Will fix it (and the others).

>> +   benefits were observed on Power8 and up, we can unify it if similar
>> +   profits are measured on Power6 and Power7.  */
>> +if (TARGET_P8_VECTOR)
>> +  return 2;
>> +else
>> +  return 1;
> 
> Hrm, but you showed benchmark improvements for p9 as well?
> 

No significant gains but no degradation as well, so I thought it's fine to align
it together.  Does it make sense?

> What happens if you enable this for everything as well?
> 

My concern was that if we enable it for everything, it's possible to introduce
degradation for some benchmarks on P6 or P7 where we didn't evaluate the
performance impact.  Although it's reasonable from the point view of load 
latency,
it's possible to get worse result in the actual benchmarks based on my fine 
grain
cost adjustment experiment before.  

Or do you suggest enabling it everywhere and solve the degradation issue if 
exposed?
I'm also fine with that.  :)


BR,
Kewen

Re: [PATCH v2] PR92090: Fix testcase failures by r276469

2019-11-04 Thread Kewen.Lin

on 2019/11/5 上午6:57, Joseph Myers wrote:
> On Mon, 4 Nov 2019, luoxhu wrote:
> 
>> -finline-functions is enabled by default for O2 since r276469, update the
>> test cases with -fno-inline-functions.
>>
>> v2: disable inlining for the failed cases.  Add two more failed cases
>> not listed in BZ.  Tested on P8LE, P8BE and P9LE.
> 
> If inlining (or other interprocedural analysis) invalidates a test's 
> intent (e.g. all the code gets optimized away), our normal approach is to 
> use noinline etc. function attributes to prevent that inlining.
> 
> If you're adding such options to work around an ICE, which certainly 
> appears to be the case in the architecture-independent testcases here, you 
> should (a) have comments in the tests saying explicitly that the options 
> are there temporarily to work around the ICE in a bug whose number is 
> given in the comment, and (b) a remark in the open regression bug for the 
> ICE saying that those options have been added as a temporary workaround 
> and that a patch fixing the ICE should remove them again.  The commit 
> message also needs to make very clear that the commit is *not* a fix for 
> that bug and so it must *not* be closed as fixed until there is an actual 
> fix for the ICE.
> 

Hi Joseph,

Very good point!  Since gcc doesn't pursue 100% testsuite pass rate, I noticed
there are a few failures exposed/caused by some PRs all the time.  Could we
just leave the test case there without any pre workaround till the PR get fixed?
It seems more like what we did often.  Or the good thing with pre workaround
here is to make this case sensitive again for being used for other testing?

BR,
Kewen

> So I don't think this patch is OK without having such comments in the 
> tests to explain the issue and a carefully written commit message warning 
> that the patch is a workaround, not a fix and the bug in question must not 
> be closed simply because of the commit mentioning it.
>

Re: [PATCH rs6000]Fix PR92132

2019-11-05 Thread Kewen.Lin

Hi Segher,

Thanks for the comments!

on 2019/11/2 上午7:17, Segher Boessenkool wrote:
> On Tue, Oct 29, 2019 at 01:16:53PM +0800, Kewen.Lin wrote:
>>  (vcond_mask_): New expand.
> 
> Say for which mode please?  Like
>   (vcond_mask_ for VEC_I and VEC_I): New expand.
> 

Fixed as below.

>>  (vcond_mask_): Likewise.
> 
> "for VEC_I and VEC_F", here, but the actual names in the pattern are for
> vector modes of same-size integer elements.  Maybe it is clear enough like
> this, dunno.

Changed to for VEC_F, New expand for float vector modes and same-size 
integer vector modes.

> 
>>  (vector_{ungt,unge,unlt,unle}): Likewise.
> 
> Never use wildcards (or shell expansions) in the "what changed" part of a
> changelog, because people try to search for that.

Thanks for the explanation, fixed. 

> 
>>  ;; 128-bit one's complement
>> -(define_insn_and_split "*one_cmpl3_internal"
>> +(define_insn_and_split "one_cmpl3_internal"
> 
> Instead, rename it to "one_cmpl3" and delete the define_expand that
> serves no function?

Renamed.  Sorry, what's the "define_expand" specified here.  I thought it's
for existing one_cmpl3 but I didn't find it. 

> 
>> +(define_code_iterator fpcmpun [ungt unge unlt unle])
> 
> Why these four?  Should there be more?  Should this be added to some
> existing iterator?

For floating point comparison operator and vector type, currently rs6000
supports eq, gt, ge, *ltgt, *unordered, *ordered, *uneq (* for unnamed).
We can leverage gt, ge, eq for lt, le, ne, then these four left.

I originally wanted to merge them into the existing unordered or uneq, but
I found it's hard to share their existing patterns.  For example, the uneq
looks like:

  [(set (match_dup 3)
(gt:VEC_F (match_dup 1)
  (match_dup 2)))
   (set (match_dup 4)
(gt:VEC_F (match_dup 2)
  (match_dup 1)))
   (set (match_dup 0)
(and:VEC_F (not:VEC_F (match_dup 3))
   (not:VEC_F (match_dup 4]

While ungt looks like:

  [(set (match_dup 3)
(ge:VEC_F (match_dup 1)
  (match_dup 2)))
   (set (match_dup 4)
(ge:VEC_F (match_dup 2)
  (match_dup 1)))
   (set (match_dup 3)
(ior:VEC_F (not:VEC_F (match_dup 3))
   (not:VEC_F (match_dup 4
   (set (match_dup 4)
(gt:VEC_F (match_dup 1)
  (match_dup 2)))
   (set (match_dup 3)
(ior:VEC_F (match_dup 3)
   (match_dup 4)))]
  
> 
> It's not all comparisons including unordered, there are uneq, unordered
> itself, and ne as well.

Yes, they are not, just a list holding missing support comparison operator.

> 
>> +;; Same mode for condition true/false values and predicate operand.
>> +(define_expand "vcond_mask_"
>> +  [(match_operand:VEC_I 0 "vint_operand")
>> +   (match_operand:VEC_I 1 "vint_operand")
>> +   (match_operand:VEC_I 2 "vint_operand")
>> +   (match_operand:VEC_I 3 "vint_operand")]
>> +  "VECTOR_UNIT_ALTIVEC_OR_VSX_P (mode)"
>> +{
>> +  emit_insn (gen_vector_select_ (operands[0], operands[2], 
>> operands[1],
>> +  operands[3]));
>> +  DONE;
>> +})
> 
> So is this exactly the same as vsel/xxsel?

Yes, expanded into if_then_else and ne against zero, can match their patterns.

> 
>> +;; For signed integer vectors comparison.
>> +(define_expand "vec_cmp"
> 
>> +case GEU:
>> +  emit_insn (
>> +gen_vector_nltu (operands[0], operands[2], operands[3], tmp));
>> +  break;
>> +case GTU:
>> +  emit_insn (gen_vector_gtu (operands[0], operands[2], 
>> operands[3]));
>> +  break;
>> +case LEU:
>> +  emit_insn (
>> +gen_vector_ngtu (operands[0], operands[2], operands[3], tmp));
>> +  break;
>> +case LTU:
>> +  emit_insn (gen_vector_gtu (operands[0], operands[3], 
>> operands[2]));
>> +  break;
> 
> You shouldn't allow those for signed comparisons, that will only hide
> problems.

OK, moved into vec_cmpu*.

> 
> You can do all the rest with some iterator / code attribute?  Or two cases,
> one for the codes that need ops 2 and 3 swapped, one for the rest?
> 

Sorry, I tried to use code attributes here but failed.  I think the reason is 
the
pattern name doesn't have .  I can only get the code from operand 1, then
have to use "switch case"?  I can change it with one more define_expand, but
is that what we wanted?  It looks we still need "case"s.

define_expand "vec_cmp"

[PATCH, rs6000 v2] Make load cost more in vectorization cost for P8/P9

2019-11-06 Thread Kewen.Lin

Hi Segher,

on 2019/11/7 上午1:38, Segher Boessenkool wrote:
> Hi!
> 
> On Tue, Nov 05, 2019 at 10:14:46AM +0800, Kewen.Lin wrote:
>>>> + benefits were observed on Power8 and up, we can unify it if similar
>>>> + profits are measured on Power6 and Power7.  */
>>>> +  if (TARGET_P8_VECTOR)
>>>> +return 2;
>>>> +  else
>>>> +return 1;
>>>
>>> Hrm, but you showed benchmark improvements for p9 as well?
>>>
>>
>> No significant gains but no degradation as well, so I thought it's fine to 
>> align
>> it together.  Does it make sense?
> 
> It's a bit strange at this point to do tunings for p8 that do we do not
> do for later cpus.
> 
>>> What happens if you enable this for everything as well?
>>
>> My concern was that if we enable it for everything, it's possible to 
>> introduce
>> degradation for some benchmarks on P6 or P7 where we didn't evaluate the
>> performance impact.
> 
> No one cares about p6.

OK.  :)

> 
> We reasonably expect it will work just as well on p7 as on p8 and later.
> That you haven't tested on p7 yet says something about how important that
> platform is now ;-)
> 

Yes, exactly.

>> Although it's reasonable from the point view of load latency,
>> it's possible to get worse result in the actual benchmarks based on my fine 
>> grain
>> cost adjustment experiment before.  
>>
>> Or do you suggest enabling it everywhere and solve the degradation issue if 
>> exposed?
>> I'm also fine with that.  :)
> 
> Yeah, let's just enable it everywhere.

One updated patch to enable it everywhere attached.


BR,
Kewen

---
gcc/ChangeLog

2019-11-07  Kewen Lin  

* config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost): Make
scalar_load, vector_load, unaligned_load and vector_gather_load cost
more to conform hardware latency and insn cost settings.
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 5876714..1094fbd 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -4763,15 +4763,17 @@ rs6000_builtin_vectorization_cost (enum 
vect_cost_for_stmt type_of_cost,
   switch (type_of_cost)
 {
   case scalar_stmt:
-  case scalar_load:
   case scalar_store:
   case vector_stmt:
-  case vector_load:
   case vector_store:
   case vec_to_scalar:
   case scalar_to_vec:
   case cond_branch_not_taken:
 return 1;
+  case scalar_load:
+  case vector_load:
+   /* Like rs6000_insn_cost, make load insns cost a bit more.  */
+ return 2;
 
   case vec_perm:
/* Power7 has only one permute unit, make it a bit expensive.  */
@@ -4792,42 +4794,44 @@ rs6000_builtin_vectorization_cost (enum 
vect_cost_for_stmt type_of_cost,
 
   case unaligned_load:
   case vector_gather_load:
+   /* Like rs6000_insn_cost, make load insns cost a bit more.  */
if (TARGET_EFFICIENT_UNALIGNED_VSX)
- return 1;
-
-if (TARGET_VSX && TARGET_ALLOW_MOVMISALIGN)
-  {
-elements = TYPE_VECTOR_SUBPARTS (vectype);
-if (elements == 2)
-  /* Double word aligned.  */
-  return 2;
-
-if (elements == 4)
-  {
-switch (misalign)
-  {
-case 8:
-  /* Double word aligned.  */
-  return 2;
+ return 2;
 
-case -1:
-  /* Unknown misalignment.  */
-case 4:
-case 12:
-  /* Word aligned.  */
-  return 22;
+   if (TARGET_VSX && TARGET_ALLOW_MOVMISALIGN)
+ {
+   elements = TYPE_VECTOR_SUBPARTS (vectype);
+   if (elements == 2)
+ /* Double word aligned.  */
+ return 4;
 
-default:
-  gcc_unreachable ();
-  }
-  }
-  }
+   if (elements == 4)
+ {
+   switch (misalign)
+ {
+ case 8:
+   /* Double word aligned.  */
+   return 4;
+
+ case -1:
+   /* Unknown misalignment.  */
+ case 4:
+ case 12:
+   /* Word aligned.  */
+   return 44;
+
+ default:
+   gcc_unreachable ();
+ }
+ }
+ }
 
-if (TARGET_ALTIVEC)
-  /* Misaligned loads are not supported.  */
-  gcc_unreachable ();
+   if (TARGET_ALTIVEC)
+ /* Misaligned loads are not supported.  */
+ gcc_unreachable ();
 
-return 2;
+   /* Like rs6000_insn_cost, make load insns cost a bit more.  */
+   return 4;
 
   case unaligned_store:
   case vector_scatter_store:

Re: [PATCH rs6000]Fix PR92132

2019-11-07 Thread Kewen.Lin

Hi Segher,

on 2019/11/7 上午7:49, Segher Boessenkool wrote:
> 
> The expander named "one_cmpl3":
> 
> Erm.  2, not 3 :-)
> 
> (define_expand "one_cmpl2"
>   [(set (match_operand:BOOL_128 0 "vlogical_operand")
> (not:BOOL_128 (match_operand:BOOL_128 1 "vlogical_operand")))]
>   ""
>   "")
> 
> while the define_insn is
> 
> (define_insn_and_split "*one_cmpl3_internal"
>   [(set (match_operand:BOOL_128 0 "vlogical_operand" "=")
> (not:BOOL_128
>   (match_operand:BOOL_128 1 "vlogical_operand" "")))]
>   ""
> {
> 

Ah, sorry I didn't notice we have one cmpl**3** but actually for one
cmpl**2** expand, a bit surprised.  Done.  Thanks for pointing that.

> etc., so you can just delete the expand and rename the insn to the proper
> name (one_cmpl2).  It sometimes is useful to have an expand like
> this if there are multiple insns that could implement this, but that is
> not the case here.
> 

OK, example like vector_select?  :)

 +(define_code_iterator fpcmpun [ungt unge unlt unle])
>>>
>>> Why these four?  Should there be more?  Should this be added to some
>>> existing iterator?
>>
>> For floating point comparison operator and vector type, currently rs6000
>> supports eq, gt, ge, *ltgt, *unordered, *ordered, *uneq (* for unnamed).
>> We can leverage gt, ge, eq for lt, le, ne, then these four left.
> 
> There are four conditions for FP: lt/gt/eq/un.  For every comparison,
> exactly one of the four is true.  If not HONOR_NANS for this mode you
> never have un, so it is one of lt/gt/eq then, just like with integers.
> 
> If we have HONOR_NANS(mode) (or !flag_finite_math_only), there are 14
> possible combinations to test for (testing for any of the four or none
> of the four is easy ;-) )
> 
> Four test just if lt, gt, eq, or un is set.  Another four test if one of
> the flags is *not* set, or said differently, if one of three flags is set:
> ordered, ne, unle, unge.  The remaining six test two flags each: ltgt, le,
> unlt, ge, ungt, uneq.

Yes, for these 14, rs6000 current support status:

  ge: vector_ge -> define_expand -> match vsx/altivec insn
  gt: vector_gt -> define_expand -> match vsx/altivec insn
  eq: vector_eq -> define_expand -> match vsx/altivec insn
  
  ltgt: *vector_ltgt -> define_insn_and_split
  ord: *vector_ordered -> define_insn_and_split
  unord: *vector_unordered -> define_insn_and_split
  uneq: *vector_uneq -> define_insn_and_split

  ne: no RTL pattern.
  lt: Likewise.
  le: Likewise.
  unge: Likewise.
  ungt: Likewise.
  unle: Likewise.
  unlt: Likewise.

Since I thought the un{ge,gt,le,lt} is a bit complicated than ne/lt/le (wrong
thought actually), I added the specific define_expand for them.  As your
simpler example below, I've added the RTL patterns with define_expand for the
missing ne, lt, le, unge, ungt, unle, unlt.

I didn't use iterator any more, since without further refactoring, just
several ones (2 each pair) can be shared with iterators, and need to check
 to decide swap or not.  Maybe the subsequent uniform refactoring patch
is required to make it?  

> 
>> I originally wanted to merge them into the existing unordered or uneq, but
>> I found it's hard to share their existing patterns.  For example, the uneq
>> looks like:
>>
>>   [(set (match_dup 3)
>>  (gt:VEC_F (match_dup 1)
>>(match_dup 2)))
>>(set (match_dup 4)
>>  (gt:VEC_F (match_dup 2)
>>(match_dup 1)))
>>(set (match_dup 0)
>>  (and:VEC_F (not:VEC_F (match_dup 3))
>> (not:VEC_F (match_dup 4]
> 
> Or ge/ge/eqv, etc. -- there are multiple options.
> 
>> While ungt looks like:
>>
>>   [(set (match_dup 3)
>>  (ge:VEC_F (match_dup 1)
>>(match_dup 2)))
>>(set (match_dup 4)
>>  (ge:VEC_F (match_dup 2)
>>(match_dup 1)))
>>(set (match_dup 3)
>>  (ior:VEC_F (not:VEC_F (match_dup 3))
>> (not:VEC_F (match_dup 4
>>(set (match_dup 4)
>>  (gt:VEC_F (match_dup 1)
>>(match_dup 2)))
>>(set (match_dup 3)
>>  (ior:VEC_F (match_dup 3)
>> (match_dup 4)))]
> 
> (set (match_dup 3)
>  (ge:VEC_F (match_dup 2)
>(match_dup 1)))
> (set (match_dup 0)
>  (not:VEC_F (match_dup 3)))
> 
> should be enough?
> 

Nice!  I was trapped to get unordered first.  :(

> 
> So we have only gt/ge/eq.
> 
> I think the following are ooptimal (not tested!):
> 
> lt(a,b) = gt(b,a)
yes, this is what I used for that operator.

> gt(a,b) = gt(a,b)
> eq(a,b) = eq(a,b)
> un(a,b) = ~(ge(a,b) | ge(b,a))
> 

existing code uses (~ge(a,b) & ~ge(b,a))
but should be the same.

> ltgt(a,b) = ge(a,b) ^ ge(b,a)

existing code uses gt(a,b) | gt(b,a)
but should be the same.

> le(a,b)   = ge(b,a)
> unlt(a,b) = ~ge(a,b)
> ge(a,b)   = ge(a,b)
> ungt(a,b) = ~ge(b,a)
> uneq(a,b) = ~(ge(a,b) ^ ge(b,a))
> 

existing code uses ~gt(a,b) & ~gt(b,a)
but should be the same.

> ord(a,b)  = ge(a,b) | ge(b,a)
> ne(a,b)   = ~eq(a,b)
> unle(a,b) = ~gt(a,b)
> ung

Re: [PATCH, rs6000 v2] Make load cost more in vectorization cost for P8/P9

2019-11-07 Thread Kewen.Lin

Hi Segher,

on 2019/11/8 上午6:36, Segher Boessenkool wrote:
> On Thu, Nov 07, 2019 at 11:22:12AM +0800, Kewen.Lin wrote:
>> One updated patch to enable it everywhere attached.
> 
>> 2019-11-07  Kewen Lin  
>>
>>  * config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost): Make
>>  scalar_load, vector_load, unaligned_load and vector_gather_load cost
>>  more to conform hardware latency and insn cost settings.
> 
>>case unaligned_load:
>>case vector_gather_load:
> 
> ...
> 
>> -  /* Word aligned.  */
>> -  return 22;
> 
>> +/* Word aligned.  */
>> +return 44;
> 
> I don't think it should go up from 22 all the way to 44 (not all insns here
> are loads).  But exact cost doesn't really matter.  Make it 30 perhaps?
> 

Good point, I'll try the cost 33 (the avg. of 22 and 44).

> 44 (as well as 22) are awfully precise numbers for a very imprecise cost
> like this ;-)

Yep!  ;-)

> 
> With either cost, whatever seems reasonable to you and works well in your
> tests: approved for trunk.  Thanks!

Thanks!  I'll kick off two regression testing on both BE and LE with new cost,
then commit it if everything goes well.


BR,
Kewen

Re: [PATCH rs6000]Fix PR92132

2019-11-07 Thread Kewen.Lin

Hi Segher,

on 2019/11/8 上午8:07, Segher Boessenkool wrote:
> Hi!
> 
>>> Half are pretty simple:
>>>
>>> lt(a,b) = gt(b,a)
>>> gt(a,b) = gt(a,b)
>>> eq(a,b) = eq(a,b)
>>> le(a,b) = ge(b,a)
>>> ge(a,b) = ge(a,b)
>>>
>>> ltgt(a,b) = ge(a,b) ^ ge(b,a)
>>> ord(a,b)  = ge(a,b) | ge(b,a)
>>>
>>> The other half are the negations of those:
>>>
>>> unge(a,b) = ~gt(b,a)
>>> unle(a,b) = ~gt(a,b)
>>> ne(a,b)   = ~eq(a,b)
>>> ungt(a,b) = ~ge(b,a)
>>> unlt(a,b) = ~ge(a,b)
>>>
>>> uneq(a,b) = ~(ge(a,b) ^ ge(b,a))
>>> un(a,b) = ~(ge(a,b) | ge(b,a))
>>
>> Awesome!  Do you suggest refactoring on them?  :)
> 
> I'd do the first five in one pattern (which then swaps two ops and the
> condition in the lt and le case), and the other five in another pattern.
> And the rest in two or four patterns?  Just try it out, see what works
> well.  It helps to do a bunch together in one pattern, but if that then
> turns into special cases for everything, more might be lost than gained.> 

Got it, I'll make a refactoring patch for this part later.

> 
>>> 8 codes, ordered:never lt   gt   ltgt eq   le   ge   ordered
>>> 8 codes, unordered:  unordered unlt ungt ne   uneq unle unge always
>>> 8 codes, fast-math:  never lt   gt   ne   eq   le   ge   always
>>> 8 codes, non-fp: never lt   gt   ne   eq   le   ge   always
>>
>> Sorry, I don't quite follow this table.  What's the column heads?
> 
> The first row is the eight possible fp conditions that are not always
> true if unordered is set; the second row is those that *are* always true
> if it is set.  The other two rows (which are the same) is just the eight
> conditions that do not test unordered at all.
> 
> The tricky one is "ne": for FP *with* NaNs, "ne" means "less than, or
> greater than, or unordered", while without NaNs (i.e. -ffast-math) it
> means "less than, or greater than".
> 
> You could write the column heads as
> --/--/--  lt/--/--  --/gt/--  lt/gt/--  --/--/eq  lt/--/eq  --/gt/eq  lt/gt/eq
> if that helps?  Just the eight combinations of the first free flags.
> 

Thanks a lot for the explanation.  It's helpful!
 

>> +;; For signed integer vectors comparison.
>> +(define_expand "vec_cmp"
>> +  [(set (match_operand:VEC_I 0 "vint_operand")
>> +(match_operator 1 "signed_or_equality_comparison_operator"
>> +  [(match_operand:VEC_I 2 "vint_operand")
>> +   (match_operand:VEC_I 3 "vint_operand")]))]
>> +  "VECTOR_UNIT_ALTIVEC_OR_VSX_P (mode)"
>> +{
>> +  enum rtx_code code = GET_CODE (operands[1]);
>> +  rtx tmp = gen_reg_rtx (mode);
>> +  switch (code)
>> +{
>> +case NE:
>> +  emit_insn (gen_vector_eq (operands[0], operands[2], 
>> operands[3]));
>> +  emit_insn (gen_one_cmpl2 (operands[0], operands[0]));
>> +  break;
>> +case EQ:
>> +  emit_insn (gen_vector_eq (operands[0], operands[2], 
>> operands[3]));
>> +  break;
>> +case GE:
>> +  emit_insn (gen_vector_nlt (operands[0],operands[2], operands[3],
>> +   tmp));
>> +  break;
>> +case GT:
>> +  emit_insn (gen_vector_gt (operands[0], operands[2], 
>> operands[3]));
>> +  break;
>> +case LE:
>> +  emit_insn (gen_vector_ngt (operands[0], operands[2], 
>> operands[3],
>> +   tmp));
>> +  break;
>> +case LT:
>> +  emit_insn (gen_vector_gt (operands[0], operands[3], 
>> operands[2]));
>> +  break;
>> +default:
>> +  gcc_unreachable ();
>> +  break;
>> +}
>> +  DONE;
>> +})
> 
> I would think this can be done easier, but it is alright for now, it can
> be touched up later if we want.
> 
>> +;; For float point vectors comparison.
>> +(define_expand "vec_cmp"
> 
> This, too.
> 
>> +  [(set (match_operand: 0 "vint_operand")
>> + (match_operator 1 "comparison_operator"
> 
> If you make an iterator for this instead, it is simpler code (you can then
> use  to do all these cases in one statement).

If my understanding is correct and based on some tries before, I think we
have to leave these **CASEs** there (at least at the 1st level define_expand
for vec_cmp*), since vec_cmp* doesn't have  field in the pattern name.
The code can be only extracted from operator 1.  I tried to add one dummy
operand to hold  but it's impractical.

Sorry, I may miss something here, I'm happy to make a subsequent patch to
uniform these cases if there is a good way to run a code iterator on them.

> 
> But that can be done later.  Okay for trunk.  Thanks!
> 

Many thanks for your time!


BR,
Kewen

Re: [PATCH rs6000]Fix PR92132

2019-11-10 Thread Kewen.Lin

Hi Segher,

on 2019/11/9 上午1:36, Segher Boessenkool wrote:
> Hi!
> 
> On Fri, Nov 08, 2019 at 10:38:13AM +0800, Kewen.Lin wrote:
>>>> +  [(set (match_operand: 0 "vint_operand")
>>>> +   (match_operator 1 "comparison_operator"
>>>
>>> If you make an iterator for this instead, it is simpler code (you can then
>>> use  to do all these cases in one statement).
>>
>> If my understanding is correct and based on some tries before, I think we
>> have to leave these **CASEs** there (at least at the 1st level define_expand
>> for vec_cmp*), since vec_cmp* doesn't have  field in the pattern name.
>> The code can be only extracted from operator 1.  I tried to add one dummy
>> operand to hold  but it's impractical.
>>
>> Sorry, I may miss something here, I'm happy to make a subsequent patch to
>> uniform these cases if there is a good way to run a code iterator on them.
> 
> Instead of
> 
>   [(set (match_operand:VEC_I 0 "vint_operand")
>   (match_operator 1 "signed_or_equality_comparison_operator"
> [(match_operand:VEC_I 2 "vint_operand")
>  (match_operand:VEC_I 3 "vint_operand")]))]
> 
> you can do
> 
>   [(set (match_operand:VEC_I 0 "vint_operand")
>   (some_iter:VEC_I (match_operand:VEC_I 1 "vint_operand")
>(match_operand:VEC_I 2 "vint_operand")))]
> 

Thanks for your example.  But I'm afraid that it doesn't work for these 
patterns.

I tried it with simple code below:

; For testing
(define_code_iterator some_iter [eq gt])

(define_expand "vec_cmp"
  [(set (match_operand:VEC_I 0 "vint_operand")
(some_iter:VEC_I
  (match_operand:VEC_I 2 "vint_operand")
  (match_operand:VEC_I 3 "vint_operand")))]
  "VECTOR_UNIT_ALTIVEC_OR_VSX_P (mode)"
{
  emit_insn (gen_vector_ (operands[0], operands[2], operands[3]));
  DONE;
})

Error messages were emitted:

/home/linkw/gcc/gcc-git-fix/gcc/config/rs6000/vector.md:531:1: duplicate 
definition of 'vec_cmpv16qiv16qi'
/home/linkw/gcc/gcc-git-fix/gcc/config/rs6000/vector.md:531:1: duplicate 
definition of 'vec_cmpv8hiv8hi'
/home/linkw/gcc/gcc-git-fix/gcc/config/rs6000/vector.md:531:1: duplicate 
definition of 'vec_cmpv4siv4si'
/home/linkw/gcc/gcc-git-fix/gcc/config/rs6000/vector.md:531:1: duplicate 
definition of 'vec_cmpv2div2di'

It's expected, since the pattern here is vec_cmp rather than
vec_cmp, your example would work perfectly for the later.
Btw, in that pattern, the comparison operator is passed in operand 1.


BR,
Kewen

> with some_iter some code_iterator, (note you need to renumber), and in the
> body you can then just use  (or , or some other code_attribute).
> 
> code_iterator is more flexible than match_operator, in most ways.
> 
> 
> Segher
>

[PATCH, rs6000] Refactor FP vector comparison operators

2019-11-10 Thread Kewen.Lin

Hi,

This is a subsequent patch to refactor the existing float point
vector comparison operator supports.  The patch to fix PR92132
supplemented vector float point comparison by exposing the names
for unordered/ordered/uneq/ltgt and adding ungt/unge/unlt/unle/
ne.  As Segher pointed out, some patterns can be refactored
together.  The main link on this is: 
https://gcc.gnu.org/ml/gcc-patches/2019-11/msg00452.html


The refactoring mainly follows the below patterns:

pattern 1:
  lt(a,b) = gt(b,a)
  le(a,b) = ge(b,a)

pattern 2:
  unge(a,b) = ~gt(b,a)
  unle(a,b) = ~gt(a,b)
  ne(a,b)   = ~eq(a,b)
  ungt(a,b) = ~ge(b,a)
  unlt(a,b) = ~ge(a,b)

pattern 3:
  ltgt: gt(a,b) | gt(b,a)
  ordered: ge(a,b) | ge(b,a)

pattern 4:
  uneq: ~gt(a,b) & ~gt(b,a)
  unordered: ~ge(a,b) & ~ge(b,a)

Naming the code iterators and attributes are really knotty for me :(.

Regression testing just launched.

BR,
Kewen

---
gcc/ChangeLog

2019-11-11 Kewen Lin  

* config/rs6000/vector.md (vec_fp_cmp1): New code iterator.
(vec_fp_cmp2): Likewise.
(vec_fp_cmp3): Likewise.
(vec_fp_cmp4): Likewise.
(vec_fp_cmp1_attr): New code attribute.
(vec_fp_cmp2_attr): Likewise.
(vec_fp_cmp3_attr): Likewise.
(vec_fp_cmp4_attr): Likewise.
(vector_ for VEC_F and vec_fp_cmp1): New
define_and_split.
(vector_ for VEC_F and vec_fp_cmp2): Likewise.
(vector_ for VEC_F and vec_fp_cmp3): Likewise.
(vector_ for VEC_F and vec_fp_cmp4): Likewise.
(vector_lt for VEC_F): Refactor with vec_fp_cmp1.
(vector_le for VEC_F): Likewise.
(vector_unge for VEC_F): Refactor with vec_fp_cmp2.
(vector_unle for VEC_F): Likewise.
(vector_ne for VEC_F): Likewise.
(vector_ungt for VEC_F): Likewise.
(vector_unlt for VEC_F): Likewise.
(vector_ltgt for VEC_F): Refactor with vec_fp_cmp3.
(vector_ordered for VEC_F): Likewise.
(vector_uneq for VEC_F): Refactor with vec_fp_cmp4.
(vector_unordered for VEC_F): Likewise.
diff --git a/gcc/config/rs6000/vector.md b/gcc/config/rs6000/vector.md
index b132037..be2d425 100644
--- a/gcc/config/rs6000/vector.md
+++ b/gcc/config/rs6000/vector.md
@@ -107,6 +107,31 @@
 (smin "smin")
 (smax "smax")])
 
+;; code iterators and attributes for vector FP comparison operators:
+
+;; 1. lt and le.
+(define_code_iterator vec_fp_cmp1 [lt le])
+(define_code_attr vec_fp_cmp1_attr [(lt "gt")
+   (le "ge")])
+
+; 2. unge, unle, ne, ungt and unlt.
+(define_code_iterator vec_fp_cmp2 [unge unle ne ungt unlt])
+(define_code_attr vec_fp_cmp2_attr [(unge "gt")
+   (unle "gt")
+   (ne   "eq")
+   (ungt "ge")
+   (unlt "ge")])
+
+;; 3. ltgt and ordered.
+(define_code_iterator vec_fp_cmp3 [ltgt ordered])
+(define_code_attr vec_fp_cmp3_attr [(ltgt "gt")
+   (ordered "ge")])
+
+;; 4. uneq and unordered.
+(define_code_iterator vec_fp_cmp4 [uneq unordered])
+(define_code_attr vec_fp_cmp4_attr [(uneq "gt")
+   (unordered "ge")])
+
 
 ;; Vector move instructions.  Little-endian VSX loads and stores require
 ;; special handling to circumvent "element endianness."
@@ -665,88 +690,6 @@
   DONE;
 })
 
-; lt(a,b) = gt(b,a)
-(define_expand "vector_lt"
-  [(set (match_operand:VEC_F 0 "vfloat_operand")
-   (lt:VEC_F (match_operand:VEC_F 1 "vfloat_operand")
- (match_operand:VEC_F 2 "vfloat_operand")))]
-  "VECTOR_UNIT_ALTIVEC_OR_VSX_P (mode)"
-{
-  emit_insn (gen_vector_gt (operands[0], operands[2], operands[1]));
-  DONE;
-})
-
-; le(a,b) = ge(b,a)
-(define_expand "vector_le"
-  [(set (match_operand:VEC_F 0 "vfloat_operand")
-   (le:VEC_F (match_operand:VEC_F 1 "vfloat_operand")
- (match_operand:VEC_F 2 "vfloat_operand")))]
-  "VECTOR_UNIT_ALTIVEC_OR_VSX_P (mode)"
-{
-  emit_insn (gen_vector_ge (operands[0], operands[2], operands[1]));
-  DONE;
-})
-
-; ne(a,b) = ~eq(a,b)
-(define_expand "vector_ne"
-  [(set (match_operand:VEC_F 0 "vfloat_operand")
-   (ne:VEC_F (match_operand:VEC_F 1 "vfloat_operand")
- (match_operand:VEC_F 2 "vfloat_operand")))]
-  "VECTOR_UNIT_ALTIVEC_OR_VSX_P (mode)"
-{
-  emit_insn (gen_vector_eq (operands[0], operands[1], operands[2]));
-  emit_insn (gen_one_cmpl2 (operands[0], operands[0]));
-  DONE;
-})
-
-; unge(a,b) = ~gt(b,a)
-(define_expand "vector_unge"
-  [(set (match_operand:VEC_F 0 "vfloat_operand")
-   (unge:VEC_F (match_operand:VEC_F 1 "vfloat_operand")
-   (match_operand:VEC_F 2 "vfloat_operand")))]
-  "VECTOR_UNIT_ALTIVEC_OR_VSX_P (mode)"
-{
-  emit_insn (gen_vector_gt (operands[0], operands[2], operands[1]));
-  emit_insn (gen_one_cmpl2 (operands[0], operands[0]));
-  DONE

Re: [PATCH, rs6000] Refactor FP vector comparison operators

2019-11-12 Thread Kewen.Lin

Hi Segher,

on 2019/11/11 下午8:51, Segher Boessenkool wrote:
> Hi!
> 
>> pattern 1:
>>   lt(a,b) = gt(b,a)
>>   le(a,b) = ge(b,a)
> 
> This is done by swap_condition normally.

Nice!  Done.

> 
>> pattern 2:
>>   unge(a,b) = ~gt(b,a)
>>   unle(a,b) = ~gt(a,b)
>>   ne(a,b)   = ~eq(a,b)
>>   ungt(a,b) = ~ge(b,a)
>>   unlt(a,b) = ~ge(a,b)
> 
> This is reverse_condition_maybe_unordered (and a swap, in two cases).
> 

Nice!  Done.

> 
>> pattern 4:
>>   uneq: ~gt(a,b) & ~gt(b,a)
>>   unordered: ~ge(a,b) & ~ge(b,a)
> 
> That is 3, reversed.
> 

Yes, merge them.

> 
>> +; 1. For lt and le:
>> +; lt(a,b) = gt(b,a)
>> +; le(a,b) = ge(b,a)
>> +(define_insn_and_split "vector_"
>>[(set (match_operand:VEC_F 0 "vfloat_operand")
>> +(vec_fp_cmp1:VEC_F (match_operand:VEC_F 1 "vfloat_operand")
>> +   (match_operand:VEC_F 2 "vfloat_operand")))]
>>"VECTOR_UNIT_ALTIVEC_OR_VSX_P (mode)"
>>"#"
>>""
>> +  [(set (match_dup 0)
>> +(:VEC_F (match_dup 2)
>> +  (match_dup 1)))]
>>  {
>>  })
> 
> Empty preparation statements (the {}) can just be left out completely.
> 

OK, thanks!

> The split condition "" is incorrect, it should be "&& 1": if it starts
> with "&&", the insn condition is included.
> 

Got it, thanks!
> 
> So maybe it is simplest to *do* use match_operator here, handle all of
> lt gt le ge eq unge unle ungt unlt ne  in one define_expand, which then
> swaps the condition and the args, and expand the extra not for the five
> where that is needed?
> 

I still used define_and_split to support the RTL pattern recognition, so
I didn't use match_operator there, but I put le lt ne unge ungt unlt unle
together in one define_and_split as you suggested.

> 
>> +;; 3. ltgt and ordered.
>> +(define_code_iterator vec_fp_cmp3 [ltgt ordered])
>> +(define_code_attr vec_fp_cmp3_attr [(ltgt "gt")
>> +(ordered "ge")])
>> +
>> +;; 4. uneq and unordered.
>> +(define_code_iterator vec_fp_cmp4 [uneq unordered])
>> +(define_code_attr vec_fp_cmp4_attr [(uneq "gt")
>> +(unordered "ge")])
> 
> And then another one for  ltgt uneq ordered unordered  perhaps?
> 

Done.

> So you'll need to define two new predicates then.  Something like
> vector_fp_comparison_operator and, erm, vector_fp_extra_comparison_operator,
> which kind of sucks as a name, but there is only one ;-)

Thanks!  I used the names for code_iterator in define_and_split now.

> 
> Sorry for sending you first one way, and then back the other.
> 

It really doesn't matter!

The updated patch is attached.  Since we check those  and generate
related in prepare of define_and_split, I had one idea to factor those
two parts out and merged as one common split function (move to rs6000.c)?
Does it sound better?


BR,
Kewen

---
gcc/ChangeLog

2019-11-12 Kewen Lin  

* config/rs6000/vector.md (vector_fp_comparison_operator):
New code iterator.
(vector_fp_extra_comparison_operator): Likewise.
(vector_ for VEC_F and
vector_fp_comparison_operator): New define_and_split.
(vector_ for VEC_F and
vector_fp_extra_comparison_operator): Likewise.
(vector_lt for VEC_F): Refactor with
vector_fp_comparison_operator.
(vector_le for VEC_F): Likewise.
(vector_unge for VEC_F): Likewise.
(vector_unle for VEC_F): Likewise.
(vector_ne for VEC_F): Likewise.
(vector_ungt for VEC_F): Likewise.
(vector_unlt for VEC_F): Likewise.
(vector_ltgt for VEC_F): Refactor with
vector_fp_extra_comparison_operator.
(vector_ordered for VEC_F): Likewise.
(vector_uneq for VEC_F): Likewise.
(vector_unordered for VEC_F): Likewise.
diff --git a/gcc/config/rs6000/vector.md b/gcc/config/rs6000/vector.md
index b132037..9919fd4 100644
--- a/gcc/config/rs6000/vector.md
+++ b/gcc/config/rs6000/vector.md
@@ -107,6 +107,12 @@
 (smin "smin")
 (smax "smax")])
 
+;; code iterators and attributes for vector FP comparison operators:
+(define_code_iterator vector_fp_comparison_operator [lt le ne
+ungt unge unlt unle])
+(define_code_iterator vector_fp_extra_comparison_operator [ltgt uneq
+  unordered ordered])
+
 
 ;; Vector move instructions.  Little-endian VSX loads and stores require
 ;; special handling to circumvent "element endianness."
@@ -665,88 +671,6 @@
   DONE;
 })
 
-; lt(a,b) = gt(b,a)
-(define_expand "vector_lt"
-  [(set (match_operand:VEC_F 0 "vfloat_operand")
-   (lt:VEC_F (match_operand:VEC_F 1 "vfloat_operand")
- (match_operand:VEC_F 2 "vfloat_operand")))]
-  "VECTOR_UNIT_ALTIVEC_OR_VSX_P (mode)"
-{
-  emit_insn (gen_vector_gt (operands[0], operands[2], operands[1]));
-  DONE;
-})
-
-; le(a,b) = ge(b,a)
-(define_expand "vecto

[PATCH, testsuite] Fix PR92464 by adjust test case loop bound

2019-11-12 Thread Kewen.Lin

Hi,

As PR92464 shows, the recent vectorization cost adjustment on load
insns is responsible for this regression.  It leads the profitable
min iteration count to change from 19 to 12.  The case happens to
hit the threshold.  By actual runtime performance evaluation, the
vectorized version perform on par with non vectorized version
(before).  So the vectorization on 12 is actually fine.  To keep
the case sensitive on high peeling cost, this patch is to adjust
the loop bound from 16 to 14.

Verified on ppc64-redhat-linux (BE P7) and powerpc64le-linux-gnu
(LE P8). 


BR,
Kewen

-

gcc/testsuite/ChangeLog

2019-11-13  Kewen Lin  

PR target/92464
* gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c: Adjust
loop bound due to load cost adjustment.


diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c 
b/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c
index 4a7da2e..1bb064e 100644
--- a/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c
@@ -4,7 +4,7 @@
 #include 
 #include "../../tree-vect.h"

-#define N 16
+#define N 14
 #define OFF 4

 /* Check handling of accesses for which the "initial condition" -

[PATCH] Fix typo and avoid possible memory leak

2020-01-12 Thread Kewen.Lin

Hi,

Function average_num_loop_insns forgets to free loop body in early return.  
Besides, overflow comparison checks 100 (e6) but the return value is
10 (e5), I guess it's unexpected, a typo?

Bootstrapped and regress tested on powerpc64le-linux-gnu.  
I guess this should go to GCC11? Is it ok?

BR,
Kewen

gcc/ChangeLog

2020-01-13  Kewen Lin  

* cfgloopanal.c (average_num_loop_insns): Free bbs when early return,
fix typo on return value.

 

diff --git a/gcc/cfgloopanal.c b/gcc/cfgloopanal.c
index 199c20b..65d239a 100644
--- a/gcc/cfgloopanal.c
+++ b/gcc/cfgloopanal.c
@@ -219,7 +219,10 @@ average_num_loop_insns (const class loop *loop)
   ninsns += (sreal)binsns * bb->count.to_sreal_scale (loop->header->count);
   /* Avoid overflows.   */
   if (ninsns > 100)
-   return 10;
+   {
+ free (bbs);
+ return 100;
+   }
 }
   free (bbs);

Re: [PATCH] Fix typo and avoid possible memory leak

2020-01-14 Thread Kewen.Lin

on 2020/1/13 下午6:46, Richard Sandiford wrote:
> "Kewen.Lin"  writes:
>> Hi,
>>
>> Function average_num_loop_insns forgets to free loop body in early return.  
>> Besides, overflow comparison checks 100 (e6) but the return value is
>> 10 (e5), I guess it's unexpected, a typo?
>>
>> Bootstrapped and regress tested on powerpc64le-linux-gnu.  
>> I guess this should go to GCC11? Is it ok?
> 
> OK for GCC 10, thanks.  This is a regression from GCC 7.
> 
Hi Richard,

Thanks for correcting me it's a regression!  Committed in b38e86ddb7a9.

BR,
Kewen

[PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling

2020-01-16 Thread Kewen.Lin

Hi,

As we discussed in the thread
https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
I'm working to teach IVOPTs to consider D-form group access during unrolling.
The difference on D-form and other forms during unrolling is we can put the
stride into displacement field to avoid additional step increment. eg:

With X-form (uf step increment):
  ...
  LD A = baseA, X
  LD B = baseB, X
  ST C = baseC, X
  X = X + stride
  LD A = baseA, X
  LD B = baseB, X
  ST C = baseC, X
  X = X + stride
  LD A = baseA, X
  LD B = baseB, X
  ST C = baseC, X
  X = X + stride
  ...

With D-form (one step increment for each base):
  ...
  LD A = baseA, OFF
  LD B = baseB, OFF
  ST C = baseC, OFF
  LD A = baseA, OFF+stride
  LD B = baseB, OFF+stride
  ST C = baseC, OFF+stride
  LD A = baseA, OFF+2*stride
  LD B = baseB, OFF+2*stride
  ST C = baseC, OFF+2*stride
  ...
  baseA += stride * uf
  baseB += stride * uf
  baseC += stride * uf

Imagining that if the loop get unrolled by 8 times, then 3 step updates with
D-form vs. 8 step updates with X-form. Here we only need to check stride
meet D-form field requirement, since if OFF doesn't meet, we can construct
baseA' with baseA + OFF.

This patch set consists four parts:
 
  [PATCH 1/4 GCC11] Add middle-end unroll factor estimation

 Add unroll factor estimation in middle-end. It mainly refers to current
 RTL unroll factor determination in function decide_unrolling and its
 sub calls.  As Richard B. suggested, we probably can force unroll factor
 with this and avoid duplicate unroll factor calculation, but I think it
 need more benchmarking work and should be handled separately.

  [PATCH 2/4 GCC11] Add target hook stride_dform_valid_p 

 Add one target hook to determine whether the current memory access with
 the given mode, stride and other flags have available D-form supports.
 
  [PATCH 3/4 GCC11] IVOPTs Consider cost_step on different forms during 
unrolling

 Teach IVOPTs to identify address type iv group with D-form preferred,
 and flag dform_p of their derived iv cands.  Considering unroll factor,
 increase iv cost with (uf - 1) * cost_step if it's not a dform iv cand. 
 
  [PATCH 4/4 GCC11] rs6000: P9 D-form test cases

 Add some test cases, mainly copied from Kelvin's patch.

Bootstrapped and regress tested on powerpc64le-linux-gnu.
I'll take two weeks leave soon, please expect late responses.
Thanks a lot in advance!

BR,
Kewen



 gcc/cfgloop.h   |   3 +
 gcc/config/rs6000/rs6000.c  |  56 -
 gcc/doc/tm.texi |  14 +
 gcc/doc/tm.texi.in  |   4 ++
 gcc/target.def  |  21 ++-
 gcc/testsuite/gcc.target/powerpc/p9-dform-0.c   |  43 +
 gcc/testsuite/gcc.target/powerpc/p9-dform-1.c   |  55 +
 gcc/testsuite/gcc.target/powerpc/p9-dform-2.c   |  12 
 gcc/testsuite/gcc.target/powerpc/p9-dform-3.c   |  15 +
 gcc/testsuite/gcc.target/powerpc/p9-dform-4.c   |  12 
 gcc/testsuite/gcc.target/powerpc/p9-dform-generic.h |  34 +++
 gcc/tree-ssa-loop-ivopts.c  |  84 
+-
 gcc/tree-ssa-loop-manip.c   | 254 
+
 gcc/tree-ssa-loop-manip.h   |   3 +-
 gcc/tree-ssa-loop.c |  33 ++
 gcc/tree-ssa-loop.h |   2 +
 16 files changed, 640 insertions(+), 5 deletions(-)

[PATCH 1/4 GCC11] Add middle-end unroll factor estimation

2020-01-16 Thread Kewen.Lin

gcc/ChangeLog

2020-01-16  Kewen Lin  

* cfgloop.h (struct loop): New field estimated_uf.
* config/rs6000/rs6000.c (TARGET_LOOP_UNROLL_ADJUST_TREE): New macro.
(rs6000_loop_unroll_adjust_tree): New function.
* doc/tm.texi: Regenerate.
* doc/tm.texi.in (TARGET_LOOP_UNROLL_ADJUST_TREE): New hook.
* target.def (loop_unroll_adjust_tree): New hook.
* tree-ssa-loop-manip.c (decide_uf_const_iter): New function.
(decide_uf_runtime_iter): Likewise.
(decide_uf_stupid): Likewise.
(estimate_unroll_factor): Likewise.
* tree-ssa-loop-manip.h (estimate_unroll_factor): New declare.
* tree-ssa-loop.c (tree_average_num_loop_insns): New function.
* tree-ssa-loop.h (tree_average_num_loop_insns): New declare.
 gcc/cfgloop.h  |   3 +
 gcc/config/rs6000/rs6000.c |  16 ++-
 gcc/doc/tm.texi|   6 ++
 gcc/doc/tm.texi.in |   2 +
 gcc/target.def |   8 ++
 gcc/tree-ssa-loop-manip.c  | 254 +
 gcc/tree-ssa-loop-manip.h  |   3 +-
 gcc/tree-ssa-loop.c|  33 ++
 gcc/tree-ssa-loop.h|   2 +
 9 files changed, 324 insertions(+), 3 deletions(-)

diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index e3590d7..feceed6 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -232,6 +232,9 @@ public:
  Other values means unroll with the given unrolling factor.  */
   unsigned short unroll;
 
+  /* Like unroll field above, but it's estimated in middle-end.  */
+  unsigned short estimated_uf;
+
   /* If this loop was inlined the main clique of the callee which does
  not need remapping when copying the loop body.  */
   unsigned short owned_clique;
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 2995348..0dabaa6 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1431,6 +1431,9 @@ static const struct attribute_spec 
rs6000_attribute_table[] =
 #undef TARGET_LOOP_UNROLL_ADJUST
 #define TARGET_LOOP_UNROLL_ADJUST rs6000_loop_unroll_adjust
 
+#undef TARGET_LOOP_UNROLL_ADJUST_TREE
+#define TARGET_LOOP_UNROLL_ADJUST_TREE rs6000_loop_unroll_adjust_tree
+
 #undef TARGET_INIT_BUILTINS
 #define TARGET_INIT_BUILTINS rs6000_init_builtins
 #undef TARGET_BUILTIN_DECL
@@ -5090,7 +5093,8 @@ rs6000_destroy_cost_data (void *data)
   free (data);
 }
 
-/* Implement targetm.loop_unroll_adjust.  */
+/* Implement targetm.loop_unroll_adjust.  Don't forget to update
+   loop_unroll_adjust_tree for any changes.  */
 
 static unsigned
 rs6000_loop_unroll_adjust (unsigned nunroll, struct loop *loop)
@@ -5109,6 +5113,16 @@ rs6000_loop_unroll_adjust (unsigned nunroll, struct loop 
*loop)
   return nunroll;
 }
 
+/* Implement targetm.loop_unroll_adjust_tree, strictly refers to
+   targetm.loop_unroll_adjust.  */
+
+static unsigned
+rs6000_loop_unroll_adjust_tree (unsigned nunroll, struct loop *loop)
+{
+  /* For now loop_unroll_adjust is simple, just invoke directly.  */
+  return rs6000_loop_unroll_adjust (nunroll, loop);
+}
+
 /* Handler for the Mathematical Acceleration Subsystem (mass) interface to a
library with vectorized intrinsics.  */
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 2244df4..86ad278 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11875,6 +11875,12 @@ is required only when the target has special 
constraints like maximum
 number of memory accesses.
 @end deftypefn
 
+@deftypefn {Target Hook} unsigned TARGET_LOOP_UNROLL_ADJUST_TREE (unsigned 
@var{nunroll}, class loop *@var{loop})
+This target hook is the same as @code{loop_unroll_adjust}, but it's for
+middle-end unroll factor estimation computation. See
+@code{loop_unroll_adjust} for the function description.
+@end deftypefn
+
 @defmac POWI_MAX_MULTS
 If defined, this macro is interpreted as a signed integer C expression
 that specifies the maximum number of floating point multiplications
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 52cd603..fd9769e 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -8008,6 +8008,8 @@ lists.
 
 @hook TARGET_LOOP_UNROLL_ADJUST
 
+@hook TARGET_LOOP_UNROLL_ADJUST_TREE
+
 @defmac POWI_MAX_MULTS
 If defined, this macro is interpreted as a signed integer C expression
 that specifies the maximum number of floating point multiplications
diff --git a/gcc/target.def b/gcc/target.def
index e705c5d..f61c831 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2725,6 +2725,14 @@ number of memory accesses.",
  unsigned, (unsigned nunroll, class loop *loop),
  NULL)
 
+DEFHOOK
+(loop_unroll_adjust_tree,
+ "This target hook is the same as @code{loop_unroll_adjust}, but it's for\n\
+middle-end unroll factor estimation computation. See\n\
+@code{loop_unroll_adjust} for the function description.",
+ unsigned, (unsigned nunroll, class loop *loop),
+ NULL)
+
 /* True if X is a legitimate MODE-mode immediate operand.  */
 DEFHOOK
 (legitimate_constant_p,
diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/t

[PATCH 2/4 GCC11] Add target hook stride_dform_valid_p

2020-01-16 Thread Kewen.Lin


gcc/ChangeLog

2020-01-16  Kewen Lin  

* config/rs6000/rs6000.c (TARGET_STRIDE_DFORM_VALID_P): New macro.
(rs6000_stride_dform_valid_p): New function.
* doc/tm.texi: Regenerate.
* doc/tm.texi.in (TARGET_STRIDE_DFORM_VALID_P): New hook.
* target.def (stride_dform_valid_p): New hook.

 gcc/config/rs6000/rs6000.c | 40 
 gcc/doc/tm.texi|  8 
 gcc/doc/tm.texi.in |  2 ++
 gcc/target.def | 13 -
 4 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 0dabaa6..1e41fcf 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1657,6 +1657,9 @@ static const struct attribute_spec 
rs6000_attribute_table[] =
 #undef TARGET_PREDICT_DOLOOP_P
 #define TARGET_PREDICT_DOLOOP_P rs6000_predict_doloop_p
 
+#undef TARGET_STRIDE_DFORM_VALID_P
+#define TARGET_STRIDE_DFORM_VALID_P rs6000_stride_dform_valid_p
+
 #undef TARGET_HAVE_COUNT_REG_DECR_P
 #define TARGET_HAVE_COUNT_REG_DECR_P true
 
@@ -26272,6 +26275,43 @@ rs6000_predict_doloop_p (struct loop *loop)
   return true;
 }
 
+/* Return true if the memory access with mode MODE, signedness SIGNED_P and
+   store STORE_P with offset from 0 to (NUNROLL-1) * STRIDE are valid with
+   D-form instructions.  */
+
+static bool
+rs6000_stride_dform_valid_p (machine_mode mode, signed HOST_WIDE_INT stride,
+ bool signed_p, bool store_p, unsigned nunroll)
+{
+  static const HOST_WIDE_INT max_bound = 0x7fff;
+  static const HOST_WIDE_INT min_bound = -0x8000;
+
+  if (!IN_RANGE ((nunroll - 1) * stride, min_bound, max_bound))
+return false;
+
+  /* Check DQ-form for vector mode or float128 mode.  */
+  if (VECTOR_MODE_P (mode) || FLOAT128_VECTOR_P (mode))
+{
+  if (mode_supports_dq_form (mode) && !(stride & 0xF))
+   return true;
+  else
+   return false;
+}
+
+  /* Simply consider non VSX instructions.  */
+  if (mode == QImode || mode == HImode || mode == SFmode || mode == DFmode)
+return true;
+
+  /* lwz/stw is D-form, but lwa is DS-form.  */
+  if (mode == SImode && (!signed_p || store_p || !(stride & 0x03)))
+return true;
+
+  if (mode == DImode && !(stride & 0x03))
+return true;
+
+  return false;
+}
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-rs6000.h"
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 86ad278..0b8bc7c 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11669,6 +11669,14 @@ function version at run-time for a given set of 
function versions.
 body must be generated.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_STRIDE_DFORM_VALID_P (machine_mode 
@var{mode}, signed HOST_WIDE_INT @var{stride}, bool @var{signed_p}, bool 
@var{store_p}, unsigned @var{nunroll})
+For a given memory access, check whether it is valid to put 0, @var{stride}
+, 2 * @var{stride}, ... , (@var{nunroll} - 1) to the instruction D-form
+displacement, with mode @var{mode}, signedness @var{signed_p} and store
+@var{store_p}.  Return true if valid.
+The default version of this hook returns false.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_PREDICT_DOLOOP_P (class loop *@var{loop})
 Return true if we can predict it is possible to use a low-overhead loop
 for a particular loop.  The parameter @var{loop} is a pointer to the loop.
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index fd9769e..e90d020 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -7953,6 +7953,8 @@ to by @var{ce_info}.
 
 @hook TARGET_GENERATE_VERSION_DISPATCHER_BODY
 
+@hook TARGET_STRIDE_DFORM_VALID_P
+
 @hook TARGET_PREDICT_DOLOOP_P
 
 @hook TARGET_HAVE_COUNT_REG_DECR_P
diff --git a/gcc/target.def b/gcc/target.def
index f61c831..ee19a8d 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -4300,7 +4300,18 @@ DEFHOOK
  emits a @code{speculation_barrier} instruction if that is defined.",
 rtx, (machine_mode mode, rtx result, rtx val, rtx failval),
  default_speculation_safe_value)
- 
+
+DEFHOOK
+(stride_dform_valid_p,
+ "For a given memory access, check whether it is valid to put 0, 
@var{stride}\n\
+, 2 * @var{stride}, ... , (@var{nunroll} - 1) to the instruction D-form\n\
+displacement, with mode @var{mode}, signedness @var{signed_p} and store\n\
+@var{store_p}.  Return true if valid.\n\
+The default version of this hook returns false.",
+ bool, (machine_mode mode, signed HOST_WIDE_INT stride, bool signed_p,
+ bool store_p, unsigned nunroll),
+ NULL)
+
 DEFHOOK
 (predict_doloop_p,
  "Return true if we can predict it is possible to use a low-overhead loop\n\
-- 
2.7.4

[PATCH 3/4 GCC11] IVOPTs Consider cost_step on different forms during unrolling

2020-01-16 Thread Kewen.Lin

gcc/ChangeLog

2020-01-16  Kewen Lin  

* tree-ssa-loop-ivopts.c (struct iv_group): New field dform_p.
(struct iv_cand): New field dform_p.
(struct ivopts_data): New field mark_dform_p.
(record_group): Initialize dform_p.
(mark_dform_groups): New function.
(find_interesting_uses): Call mark_dform_groups.
(add_candidate_1): Update dform_p if derived from dform_p group.
(determine_iv_cost): Increase cost by considering unroll factor.
(tree_ssa_iv_optimize_loop): Call estimate_unroll_factor, update 
mark_dform_p.

 gcc/tree-ssa-loop-ivopts.c | 84 +-
 1 file changed, 83 insertions(+), 1 deletion(-)

diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index ab52cbe..a0d29bb 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -429,6 +429,8 @@ struct iv_group
   struct iv_cand *selected;
   /* To indicate this is a doloop use group.  */
   bool doloop_p;
+  /* To indicate this group is D-form preferred.  */
+  bool dform_p;
   /* Uses in the group.  */
   vec vuses;
 };
@@ -470,6 +472,7 @@ struct iv_cand
   struct iv *orig_iv;  /* The original iv if this cand is added from biv with
   smaller type.  */
   bool doloop_p;   /* Whether this is a doloop candidate.  */
+  bool dform_p;/* Derived from one D-form preferred group.  */
 };
 
 /* Hashtable entry for common candidate derived from iv uses.  */
@@ -650,6 +653,10 @@ struct ivopts_data
 
   /* Whether the loop has doloop comparison use.  */
   bool doloop_use_p;
+
+  /* Whether the loop is likely to unroll and need to check and mark
+ D-form group for better step cost modeling.  */
+  bool mark_dform_p;
 };
 
 /* An assignment of iv candidates to uses.  */
@@ -1575,6 +1582,7 @@ record_group (struct ivopts_data *data, enum use_type 
type)
   group->related_cands = BITMAP_ALLOC (NULL);
   group->vuses.create (1);
   group->doloop_p = false;
+  group->dform_p = false;
 
   data->vgroups.safe_push (group);
   return group;
@@ -2724,6 +2732,59 @@ split_address_groups (struct ivopts_data *data)
 }
 }
 
+/* Go through all address type groups, check and mark D-form preferred.  */
+static void
+mark_dform_groups (struct ivopts_data *data)
+{
+  if (!data->mark_dform_p)
+return;
+
+  class loop *loop = data->current_loop;
+  bool dump_details = (dump_file && (dump_flags & TDF_DETAILS));
+  for (unsigned i = 0; i < data->vgroups.length (); i++)
+{
+  struct iv_group *group = data->vgroups[i];
+  if (address_p (group->type))
+   {
+ bool found = true;
+ for (unsigned j = 0; j < group->vuses.length (); j++)
+   {
+ struct iv_use *use = group->vuses[j];
+ gcc_assert (use->mem_type);
+ /* Ensure the step fit into D-form field.  */
+ if (TREE_CODE (use->iv->step) != INTEGER_CST
+ || !tree_fits_shwi_p (use->iv->step))
+   {
+ found = false;
+ if (dump_details)
+   fprintf (dump_file,
+" Group use %u.%u doesn't"
+"have constant step for D-form.\n",
+i, j);
+ break;
+   }
+ bool is_store
+   = TREE_CODE (gimple_assign_lhs (use->stmt)) == SSA_NAME;
+ if (!targetm.stride_dform_valid_p (TYPE_MODE (use->mem_type),
+tree_to_shwi (use->iv->step),
+TYPE_UNSIGNED (use->mem_type),
+is_store, loop->estimated_uf))
+   {
+ found = false;
+ if (dump_details)
+   fprintf (dump_file,
+" Group use %u.%u isn't"
+"suitable for D-form.\n",
+i, j);
+ break;
+   }
+   }
+ if (found)
+   group->dform_p = true;
+   }
+}
+}
+
 /* Finds uses of the induction variables that are interesting.  */
 
 static void
@@ -2755,6 +2816,8 @@ find_interesting_uses (struct ivopts_data *data)
 
   split_address_groups (data);
 
+  mark_dform_groups (data);
+
   if (dump_file && (dump_flags & TDF_DETAILS))
 {
   fprintf (dump_file, "\n:\n");
@@ -3137,6 +3200,7 @@ add_candidate_1 (struct ivopts_data *data, tree base, 
tree step, bool important,
   cand->important = important;
   cand->incremented_at = incremented_at;
   cand->doloop_p = doloop;
+  cand->dform_p = false;
   data->vcands.safe_push (cand);
 
   if (!poly_int_tree_p (step))
@@ -3173,7 +3237,11 @@ add_candidate_1 (struct ivopts_data *data, tree base, 
tree step, bool important,
 
   /* Relate candidate to the group for which it is added.  */
   if (use)
-bitmap_set_bi

[PATCH 4/4 GCC11] rs6000: P9 D-form test cases

2020-01-16 Thread Kewen.Lin


gcc/testsuite/ChangeLog

2020-01-16  Kelvin Nilsen  
Kewen Lin  

* gcc.target/powerpc/p9-dform-0.c: New test.
* gcc.target/powerpc/p9-dform-1.c: New test.
* gcc.target/powerpc/p9-dform-2.c: New test.
* gcc.target/powerpc/p9-dform-3.c: New test.
* gcc.target/powerpc/p9-dform-4.c: New test.
* gcc.target/powerpc/p9-dform-generic.h: New test.
 gcc/testsuite/gcc.target/powerpc/p9-dform-0.c  | 43 +
 gcc/testsuite/gcc.target/powerpc/p9-dform-1.c  | 55 ++
 gcc/testsuite/gcc.target/powerpc/p9-dform-2.c  | 12 +
 gcc/testsuite/gcc.target/powerpc/p9-dform-3.c  | 15 ++
 gcc/testsuite/gcc.target/powerpc/p9-dform-4.c  | 12 +
 .../gcc.target/powerpc/p9-dform-generic.h  | 34 +
 6 files changed, 171 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-dform-0.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-dform-1.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-dform-2.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-dform-3.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-dform-4.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-dform-generic.h

diff --git a/gcc/testsuite/gcc.target/powerpc/p9-dform-0.c 
b/gcc/testsuite/gcc.target/powerpc/p9-dform-0.c
new file mode 100644
index 000..01f8b69
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-dform-0.c
@@ -0,0 +1,43 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops" } */
+
+/* This test confirms that the dform instructions are selected in the
+   translation of this main program.  */
+
+extern void first_dummy ();
+extern void dummy (double sacc, int n);
+extern void other_dummy ();
+
+extern float opt_value;
+extern char *opt_desc;
+
+#define M 128
+#define N 512
+
+double x [N];
+double y [N];
+
+int main (int argc, char *argv []) {
+  double sacc;
+
+  first_dummy ();
+  for (int j = 0; j < M; j++) {
+
+sacc = 0.00;
+for (unsigned long long int i = 0; i < N; i++) {
+  sacc += x[i] * y[i];
+}
+dummy (sacc, N);
+  }
+  opt_value = ((float) N) * 2 * ((float) M);
+  opt_desc = "flops";
+  other_dummy ();
+}
+
+/* At time the dform optimization pass was merged with trunk, 12
+   lxv instructions were emitted in place of the same number of lxvx
+   instructions.  No need to require exactly this number, as it may
+   change when other optimization passes evolve.  */
+
+/* { dg-final { scan-assembler {\mlxv\M} } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-dform-1.c 
b/gcc/testsuite/gcc.target/powerpc/p9-dform-1.c
new file mode 100644
index 000..c6f1d76
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-dform-1.c
@@ -0,0 +1,55 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops" } */
+
+/* This test confirms that the dform instructions are selected in the
+   translation of this main program.  */
+
+extern void first_dummy ();
+extern void dummy (double sacc, int n);
+extern void other_dummy ();
+
+extern float opt_value;
+extern char *opt_desc;
+
+#define M 128
+#define N 512
+
+double x [N];
+double y [N];
+double z [N];
+
+int main (int argc, char *argv []) {
+  double sacc;
+
+  first_dummy ();
+  for (int j = 0; j < M; j++) {
+
+sacc = 0.00;
+for (unsigned long long int i = 0; i < N; i++) {
+  z[i] = x[i] * y[i];
+  sacc += z[i];
+}
+dummy (sacc, N);
+  }
+  opt_value = ((float) N) * 2 * ((float) M);
+  opt_desc = "flops";
+  other_dummy ();
+}
+
+
+
+/* At time the dform optimization pass was merged with trunk, 12
+   lxv instructions were emitted in place of the same number of lxvx
+   instructions.  No need to require exactly this number, as it may
+   change when other optimization passes evolve.  */
+
+/* { dg-final { scan-assembler {\mlxv\M} } } */
+
+/* At time the dform optimization pass was merged with trunk, 6
+   stxv instructions were emitted in place of the same number of stxvx
+   instructions.  No need to require exactly this number, as it may
+   change when other optimization passes evolve.  */
+
+/* { dg-final { scan-assembler {\mstxv\M} } } */
+
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-dform-2.c 
b/gcc/testsuite/gcc.target/powerpc/p9-dform-2.c
new file mode 100644
index 000..8752f3d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-dform-2.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops" } */
+
+#define TYPE int
+#include "p9-dform-generic.h"
+
+/* The precise number of lxv and stxv instructions may be impacted by
+   complex interactions between optimization passes, but we expect at
+   least one of each.  */
+/* { dg-final { scan-assembler {\mlxv\M} } } */
+/* {

Re: [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling

2020-02-09 Thread Kewen.Lin

Hi Segher,

on 2020/1/20 下午8:33, Segher Boessenkool wrote:
> Hi!
> 
> On Thu, Jan 16, 2020 at 05:36:52PM +0800, Kewen.Lin wrote:
>> As we discussed in the thread
>> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
>> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
>> I'm working to teach IVOPTs to consider D-form group access during unrolling.
>> The difference on D-form and other forms during unrolling is we can put the
>> stride into displacement field to avoid additional step increment. eg:
> 
> 
> 
>> Imagining that if the loop get unrolled by 8 times, then 3 step updates with
>> D-form vs. 8 step updates with X-form. Here we only need to check stride
>> meet D-form field requirement, since if OFF doesn't meet, we can construct
>> baseA' with baseA + OFF.
> 
> So why doesn't the existing code do this already?  Why does it make all
> the extra induction variables?  Is the existing cost model bad, are our
> target costs bad, or something like that?
> 

I think the main cause is IVOPTs runs before RTL unroll, when it's determining
the IV sets, it can only take the normal step cost into account, since its
input isn't unrolled yet.  After unrolling, the x-form indexing register has to
play with more UF-1 times update, but we can probably hide them in d-form
displacement field.  The way I proposed here is to adjust IV cost with
additional cost_step according to estimated unroll.  It doesn't introduce new
IV cand but can affect the final optimal set.

BR,
Kewen

[PATCH 1/4 v2 GCC11] Add middle-end unroll factor estimation

2020-02-09 Thread Kewen.Lin

Hi Segher,

Thanks for your comments!  Updated to v2 as below:

  1) Removed unnecessary hook loop_unroll_adjust_tree.
  2) Updated estimated_uf to estimated_unroll and some comments.

gcc/ChangeLog

2020-02-10  Kewen Lin  

* cfgloop.h (struct loop): New field estimated_unroll.
* tree-ssa-loop-manip.c (decide_uf_const_iter): New function.
(decide_uf_runtime_iter): Likewise.
(decide_uf_stupid): Likewise.
(estimate_unroll_factor): Likewise.
* tree-ssa-loop-manip.h (estimate_unroll_factor): New declare.
* tree-ssa-loop.c (tree_average_num_loop_insns): New function.
* tree-ssa-loop.h (tree_average_num_loop_insns): New declare.

BR,
Kewen

on 2020/1/20 下午9:02, Segher Boessenkool wrote:
> Hi!
> 
> On Thu, Jan 16, 2020 at 05:39:40PM +0800, Kewen.Lin wrote:
>> --- a/gcc/cfgloop.h
>> +++ b/gcc/cfgloop.h
>> @@ -232,6 +232,9 @@ public:
>>   Other values means unroll with the given unrolling factor.  */
>>unsigned short unroll;
>>  
>> +  /* Like unroll field above, but it's estimated in middle-end.  */
>> +  unsigned short estimated_uf;
> 
> Please use full words?  "estimated_unroll" perhaps?  (Similar for other
> new names).
> 

Done.

>> +/* Implement targetm.loop_unroll_adjust_tree, strictly refers to
>> +   targetm.loop_unroll_adjust.  */
>> +
>> +static unsigned
>> +rs6000_loop_unroll_adjust_tree (unsigned nunroll, struct loop *loop)
>> +{
>> +  /* For now loop_unroll_adjust is simple, just invoke directly.  */
>> +  return rs6000_loop_unroll_adjust (nunroll, loop);
>> +}
> 
> Since the two hooks have the same arguments as well, it should really
> just be one hook, and an implementation can check whether
>   current_pass->type == RTL_PASS
> if it needs to do something special for RTL, etc.?  Or it can use some
> more appropriate condition -- the point is you need no extra hook.
> 

Good point, removed it.

>> +  /* Check number of iterations is constant.  */
>> +  if ((niter_desc->may_be_zero && !integer_zerop (niter_desc->may_be_zero))
>> +  || !tree_fits_uhwi_p (niter_desc->niter))
>> +return false;
> 
> Check, and do what?  It's easier to read if you say.
> 

Done.

> 
> "If loop->unroll is set, use that as loop->estimated_unroll"?
> 

Done.

---
 gcc/cfgloop.h |   3 +
 gcc/tree-ssa-loop-manip.c | 253 ++
 gcc/tree-ssa-loop-manip.h |   3 +-
 gcc/tree-ssa-loop.c   |  33 ++
 gcc/tree-ssa-loop.h   |   2 +
 5 files changed, 292 insertions(+), 2 deletions(-)

diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 11378ca..c5bcca7 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -232,6 +232,9 @@ public:
  Other values means unroll with the given unrolling factor.  */
   unsigned short unroll;
 
+  /* Like unroll field above, but it's estimated in middle-end.  */
+  unsigned short estimated_unroll;
+
   /* If this loop was inlined the main clique of the callee which does
  not need remapping when copying the loop body.  */
   unsigned short owned_clique;
diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
index 120b35b..72ac335 100644
--- a/gcc/tree-ssa-loop-manip.c
+++ b/gcc/tree-ssa-loop-manip.c
@@ -21,6 +21,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "system.h"
 #include "coretypes.h"
 #include "backend.h"
+#include "target.h"
 #include "tree.h"
 #include "gimple.h"
 #include "cfghooks.h"
@@ -42,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfgloop.h"
 #include "tree-scalar-evolution.h"
 #include "tree-inline.h"
+#include "wide-int.h"
 
 /* All bitmaps for rewriting into loop-closed SSA go on this obstack,
so that we can free them all at once.  */
@@ -1592,3 +1594,254 @@ canonicalize_loop_ivs (class loop *loop, tree *nit, 
bool bump_in_latch)
 
   return var_before;
 }
+
+/* Try to determine estimated unroll factor for given LOOP with constant number
+   of iterations, mainly refer to decide_unroll_constant_iterations.
+- NITER_DESC holds number of iteration description if it isn't NULL.
+- NUNROLL holds a unroll factor value computed with instruction numbers.
+- ITER holds estimated or likely max loop iterations.
+   Return true if it succeeds, also update estimated_unroll.  */
+
+static bool
+decide_uf_const_iter (class loop *loop, const tree_niter_desc *niter_desc,
+ unsigned nunroll, const widest_int *iter)
+{
+  /* Skip big loops.  */
+  if (nunroll <= 1)
+return false;
+
+  gcc_assert (niter_desc && niter_desc->assumptions);
+
+  /* Check number of iterations is consta

[PATCH 4/4 v2 GCC11] rs6000: P9 D-form test cases

2020-02-09 Thread Kewen.Lin

Hi Segher,

Updated as below according to your suggestion.

BR,
Kewen



gcc/testsuite/ChangeLog

2020-02-10  Kelvin Nilsen  
Kewen Lin  

* gcc.target/powerpc/p9-dform-0.c: New test.
* gcc.target/powerpc/p9-dform-1.c: New test.
* gcc.target/powerpc/p9-dform-2.c: New test.
* gcc.target/powerpc/p9-dform-3.c: New test.
* gcc.target/powerpc/p9-dform-4.c: New test.
* gcc.target/powerpc/p9-dform-generic.h: New test.

on 2020/1/20 下午9:19, Segher Boessenkool wrote:
> Hi!
> 
> On Thu, Jan 16, 2020 at 05:42:41PM +0800, Kewen.Lin wrote:
>> +/* At time the dform optimization pass was merged with trunk, 12
>> +   lxv instructions were emitted in place of the same number of lxvx
>> +   instructions.  No need to require exactly this number, as it may
>> +   change when other optimization passes evolve.  */
>> +
>> +/* { dg-final { scan-assembler {\mlxv\M} } } */
> 
> Maybe you can also test there ar no lxvx insns generated?
> 

Done, thanks!
---
 gcc/testsuite/gcc.target/powerpc/p9-dform-0.c  | 44 +
 gcc/testsuite/gcc.target/powerpc/p9-dform-1.c  | 57 ++
 gcc/testsuite/gcc.target/powerpc/p9-dform-2.c  | 14 ++
 gcc/testsuite/gcc.target/powerpc/p9-dform-3.c  | 17 +++
 gcc/testsuite/gcc.target/powerpc/p9-dform-4.c  | 14 ++
 .../gcc.target/powerpc/p9-dform-generic.h  | 34 +
 6 files changed, 180 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-dform-0.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-dform-1.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-dform-2.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-dform-3.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-dform-4.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/p9-dform-generic.h

diff --git a/gcc/testsuite/gcc.target/powerpc/p9-dform-0.c 
b/gcc/testsuite/gcc.target/powerpc/p9-dform-0.c
new file mode 100644
index 000..68b0434
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-dform-0.c
@@ -0,0 +1,44 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops" } */
+
+/* This test confirms that the dform instructions are selected in the
+   translation of this main program.  */
+
+extern void first_dummy ();
+extern void dummy (double sacc, int n);
+extern void other_dummy ();
+
+extern float opt_value;
+extern char *opt_desc;
+
+#define M 128
+#define N 512
+
+double x [N];
+double y [N];
+
+int main (int argc, char *argv []) {
+  double sacc;
+
+  first_dummy ();
+  for (int j = 0; j < M; j++) {
+
+sacc = 0.00;
+for (unsigned long long int i = 0; i < N; i++) {
+  sacc += x[i] * y[i];
+}
+dummy (sacc, N);
+  }
+  opt_value = ((float) N) * 2 * ((float) M);
+  opt_desc = "flops";
+  other_dummy ();
+}
+
+/* At time the dform optimization pass was merged with trunk, 12
+   lxv instructions were emitted in place of the same number of lxvx
+   instructions.  No need to require exactly this number, as it may
+   change when other optimization passes evolve.  */
+
+/* { dg-final { scan-assembler {\mlxv\M} } } */
+/* { dg-final { scan-assembler-not {\mlxvx\M} } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/p9-dform-1.c 
b/gcc/testsuite/gcc.target/powerpc/p9-dform-1.c
new file mode 100644
index 000..b80ffbc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/p9-dform-1.c
@@ -0,0 +1,57 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops" } */
+
+/* This test confirms that the dform instructions are selected in the
+   translation of this main program.  */
+
+extern void first_dummy ();
+extern void dummy (double sacc, int n);
+extern void other_dummy ();
+
+extern float opt_value;
+extern char *opt_desc;
+
+#define M 128
+#define N 512
+
+double x [N];
+double y [N];
+double z [N];
+
+int main (int argc, char *argv []) {
+  double sacc;
+
+  first_dummy ();
+  for (int j = 0; j < M; j++) {
+
+sacc = 0.00;
+for (unsigned long long int i = 0; i < N; i++) {
+  z[i] = x[i] * y[i];
+  sacc += z[i];
+}
+dummy (sacc, N);
+  }
+  opt_value = ((float) N) * 2 * ((float) M);
+  opt_desc = "flops";
+  other_dummy ();
+}
+
+
+
+/* At time the dform optimization pass was merged with trunk, 12
+   lxv instructions were emitted in place of the same number of lxvx
+   instructions.  No need to require exactly this number, as it may
+   change when other optimization passes evolve.  */
+
+/* { dg-final { scan-assembler {\mlxv\M} } } */
+/* { dg-final { scan-assembler-not {\mlxvx\M} } } */
+
+/* At time the dform optimization pass was merged with trunk, 6
+   stxv instructions were emitted in place of the same number of stxvx
+

Re: [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling

2020-02-10 Thread Kewen.Lin

on 2020/2/11 上午5:29, Segher Boessenkool wrote:
> Hi!
> 
> On Mon, Feb 10, 2020 at 02:17:04PM +0800, Kewen.Lin wrote:
>> on 2020/1/20 下午8:33, Segher Boessenkool wrote:
>>> On Thu, Jan 16, 2020 at 05:36:52PM +0800, Kewen.Lin wrote:
>>>> As we discussed in the thread
>>>> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
>>>> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
>>>> I'm working to teach IVOPTs to consider D-form group access during 
>>>> unrolling.
>>>> The difference on D-form and other forms during unrolling is we can put the
>>>> stride into displacement field to avoid additional step increment. eg:
>>>
>>> 
>>>
>>>> Imagining that if the loop get unrolled by 8 times, then 3 step updates 
>>>> with
>>>> D-form vs. 8 step updates with X-form. Here we only need to check stride
>>>> meet D-form field requirement, since if OFF doesn't meet, we can construct
>>>> baseA' with baseA + OFF.
>>>
>>> So why doesn't the existing code do this already?  Why does it make all
>>> the extra induction variables?  Is the existing cost model bad, are our
>>> target costs bad, or something like that?
>>>
>>
>> I think the main cause is IVOPTs runs before RTL unroll, when it's 
>> determining
>> the IV sets, it can only take the normal step cost into account, since its
>> input isn't unrolled yet.  After unrolling, the x-form indexing register has 
>> to
>> play with more UF-1 times update, but we can probably hide them in d-form
>> displacement field.  The way I proposed here is to adjust IV cost with
>> additional cost_step according to estimated unroll.  It doesn't introduce new
>> IV cand but can affect the final optimal set.
> 
> Yes, we should decide how often we want to unroll things somewhere before
> ivopts already, and just use that info here.

Agreed! If some passes are interested on this unroll factor estimation, we
can move backward there if it's before IVOPTs.  As patch 1/4, once it's set,
the later pass can just reuse that info.  As Richard B. suggested, we can
even skip the later RTL unroll factor determination.

> 
> Or are there advantage to doing it *in* ivopts?  It sounds like doing
> it there is probably expensive, but maybe not, and we need to do similar
> analysis there anyway.
> 

Good question.  I didn't consider that, the reason putting here is we need
this information in IVOPTs for some cases.  :)

BR,
Kewen

Re: [PATCH 1/4 v2 GCC11] Add middle-end unroll factor estimation

2020-02-10 Thread Kewen.Lin

Hi Jeff,

on 2020/2/11 上午10:14, Jiufu Guo wrote:
> "Kewen.Lin"  writes:
> 
>> Hi Segher,
>>
>> Thanks for your comments!  Updated to v2 as below:
>>
>>   1) Removed unnecessary hook loop_unroll_adjust_tree.
>>   2) Updated estimated_uf to estimated_unroll and some comments.
>>
>> gcc/ChangeLog
>>
>> 2020-02-10  Kewen Lin  
>>
>>  * cfgloop.h (struct loop): New field estimated_unroll.
>>  * tree-ssa-loop-manip.c (decide_uf_const_iter): New function.
>>  (decide_uf_runtime_iter): Likewise.
>>  (decide_uf_stupid): Likewise.
>>  (estimate_unroll_factor): Likewise.
> In RTL unroller, target hooks are also involved when decide unroll
> factor, so these decide_uf_xx functions may not same with final real
> unroll factor in RTL. For example, at O2 by default, small loops will be
> unrolled 2 times. 

I didn't quite follow your comments, the patch already had
targetm.loop_unroll_adjust in these decide_uf_xx functions, exactly
the same as what we have in loop-unroll.c (for RTL unroll).

Or anything I missed?

BR,
Kewen

[PATCH 1/4 v3 GCC11] Add middle-end unroll factor estimation

2020-02-10 Thread Kewen.Lin

Hi,

v3 changes:
  - Updated _uf to _unroll for some function names.

By the way, should I guard the current i386/s390 loop_unroll_adjust
ealy return with (current_pass->type != RTL_PASS)?  I'm inclined not
to, since this analysis isn't enabled by default, if those targets
want to adopt this analysis, they will get ICE immediately then users
can notice it and make up the gimple part check.  A guard can make
ICE disappear but the hook works poorly silently, users probably miss
to update it.

BR,
Kewen

-

gcc/ChangeLog

2020-02-11  Kewen Lin  

* cfgloop.h (struct loop): New field estimated_unroll.
* tree-ssa-loop-manip.c (decide_unroll_const_iter): New function.
(decide_unroll_runtime_iter): Likewise.
(decide_unroll_stupid): Likewise.
(estimate_unroll_factor): Likewise.
* tree-ssa-loop-manip.h (estimate_unroll_factor): New declaration.
* tree-ssa-loop.c (tree_average_num_loop_insns): New function.
* tree-ssa-loop.h (tree_average_num_loop_insns): New declaration.

on 2020/2/11 上午7:34, Segher Boessenkool wrote:
> Hi!
> 
> On Mon, Feb 10, 2020 at 02:20:17PM +0800, Kewen.Lin wrote:
>>  * tree-ssa-loop-manip.c (decide_uf_const_iter): New function.
>>  (decide_uf_runtime_iter): Likewise.
>>  (decide_uf_stupid): Likewise.
> 
> These names still use "uf".  (Those are the last I see).
> 

Good catch!

>>  * tree-ssa-loop-manip.h (estimate_unroll_factor): New declare.
> 
> "New declaration."
>
Done.

---
 gcc/cfgloop.h |   3 +
 gcc/tree-ssa-loop-manip.c | 253 ++
 gcc/tree-ssa-loop-manip.h |   3 +-
 gcc/tree-ssa-loop.c   |  33 ++
 gcc/tree-ssa-loop.h   |   2 +
 5 files changed, 292 insertions(+), 2 deletions(-)

diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 11378ca..c5bcca7 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -232,6 +232,9 @@ public:
  Other values means unroll with the given unrolling factor.  */
   unsigned short unroll;
 
+  /* Like unroll field above, but it's estimated in middle-end.  */
+  unsigned short estimated_unroll;
+
   /* If this loop was inlined the main clique of the callee which does
  not need remapping when copying the loop body.  */
   unsigned short owned_clique;
diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
index 120b35b..8a5a1a9 100644
--- a/gcc/tree-ssa-loop-manip.c
+++ b/gcc/tree-ssa-loop-manip.c
@@ -21,6 +21,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "system.h"
 #include "coretypes.h"
 #include "backend.h"
+#include "target.h"
 #include "tree.h"
 #include "gimple.h"
 #include "cfghooks.h"
@@ -42,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfgloop.h"
 #include "tree-scalar-evolution.h"
 #include "tree-inline.h"
+#include "wide-int.h"
 
 /* All bitmaps for rewriting into loop-closed SSA go on this obstack,
so that we can free them all at once.  */
@@ -1592,3 +1594,254 @@ canonicalize_loop_ivs (class loop *loop, tree *nit, 
bool bump_in_latch)
 
   return var_before;
 }
+
+/* Try to determine estimated unroll factor for given LOOP with constant number
+   of iterations, mainly refer to decide_unroll_constant_iterations.
+- NITER_DESC holds number of iteration description if it isn't NULL.
+- NUNROLL holds a unroll factor value computed with instruction numbers.
+- ITER holds estimated or likely max loop iterations.
+   Return true if it succeeds, also update estimated_unroll.  */
+
+static bool
+decide_unroll_const_iter (class loop *loop, const tree_niter_desc *niter_desc,
+ unsigned nunroll, const widest_int *iter)
+{
+  /* Skip big loops.  */
+  if (nunroll <= 1)
+return false;
+
+  gcc_assert (niter_desc && niter_desc->assumptions);
+
+  /* Check number of iterations is constant, return false if no.  */
+  if ((niter_desc->may_be_zero && !integer_zerop (niter_desc->may_be_zero))
+  || !tree_fits_uhwi_p (niter_desc->niter))
+return false;
+
+  unsigned HOST_WIDE_INT const_niter = tree_to_uhwi (niter_desc->niter);
+
+  /* If unroll factor is set explicitly, use it as estimated_unroll.  */
+  if (loop->unroll > 0 && loop->unroll < USHRT_MAX)
+{
+  /* It should have been peeled instead.  */
+  if (const_niter == 0 || (unsigned) loop->unroll > const_niter - 1)
+   loop->estimated_unroll = 1;
+  else
+   loop->estimated_unroll = loop->unroll;
+  return true;
+}
+
+  /* Check whether the loop rolls enough to consider.  */
+  if (const_niter < 2 * nunroll || wi::ltu_p (*iter, 2 * nunroll))
+return false;
+
+  /* Success; now compute number of iterations to unroll.  */
+  unsigned best_unroll = 0, n_c

[PATCH, IRA] Fix PR91052 by skipping multiple_sets insn in combine_and_move_insns

2020-02-11 Thread Kewen.Lin

Hi,

As PR91052's comments show, commit r272731 exposed one issue in function
combine_and_move_insns.  Function combine_and_move_insns perform the
below unexpected transformation.

** Before: **

   67: NOTE_INSN_BASIC_BLOCK 8
...
   59: {r184:SF=[sfp:SI-0x190];r121:SI=sfp:SI-0x190;}  ==> move object
  REG_UNUSED r121:SI
   77: {r191:SF=[sfp:SI-0x18c];r121:SI=sfp:SI-0x18c;}
   60: r122:SI=r127:SI
  REG_DEAD r127:SI
   61: [r122:SI]=r184:SF
  REG_DEAD r184:SF
   79: [++r122:SI]=r191:SF
  REG_DEAD r191:SF
  REG_INC r122:SI
   64: r187:SF=[r137:SI+low(`*.LC0')]
   99: r198:SF=[++r121:SI] => with sp-0x18c+4;
  REG_INC r121:SI
  104: r201:SF=[r137:SI+low(`*.LC0')]
   65: [r126:SI]=r187:SF
  REG_DEAD r187:SF
  105: [r126:SI]=r201:SF
  REG_DEAD r201:SF
  101: [++r122:SI]=r198:SF
  REG_DEAD r198:SF
  REG_INC r122:SI
  114: L114:
  113: NOTE_INSN_BASIC_BLOCK 9 

** After: **

   67: NOTE_INSN_BASIC_BLOCK 8
...
   77: {r191:SF=[sfp:SI-0x18c];r121:SI=sfp:SI-0x18c;}
  REG_UNUSED r121:SI
   60: r122:SI=r127:SI
  REG_DEAD r127:SI
  219: {r184:SF=[sfp:SI-0x190];r121:SI=sfp:SI-0x190;}   ==> moved here but 
update origin r121.
   61: [r122:SI]=r184:SF
  REG_DEAD r184:SF
   79: [++r122:SI]=r191:SF
  REG_DEAD r191:SF
  REG_INC r122:SI
   64: r187:SF=[r137:SI+low(`*.LC0')]
  REG_EQUIV [r137:SI+low(`*.LC0')]
   99: r198:SF=[++r121:SI]=> with sp-0x18c; inconsistent 
from above.
  REG_INC r121:SI
  104: r201:SF=[r137:SI+low(`*.LC0')]
  REG_EQUIV [r137:SI+low(`*.LC0')]
   65: [r126:SI]=r187:SF
  REG_DEAD r187:SF
  105: [r126:SI]=r201:SF
  REG_DEAD r201:SF
  101: [++r122:SI]=r198:SF
  REG_DEAD r198:SF
  REG_INC r122:SI
  114: L114:
  113: NOTE_INSN_BASIC_BLOCK 9

The insn 59 is special with multiple_sets, its movement alters the live
interval of r121 from insn 77 to insn 99 and update r121 with unexpected
value.

Bootstrapped/regtested on powerpc64le-linux-gnu (LE) and
ppc64-redhat-linux (BE).

Is it ok for trunk?

BR,
Kewen

---

gcc/ChangeLog

2020-02-11  Kewen Lin  

* ira.c (combine_and_move_insns): Skip multiple_sets def_insn.
diff --git a/gcc/ira.c b/gcc/ira.c
index c8b5f86..a655ae1 100644
--- a/gcc/ira.c
+++ b/gcc/ira.c
@@ -3784,6 +3784,11 @@ combine_and_move_insns (void)
   if (can_throw_internal (def_insn))
continue;
 
+  /* Instructions with multiple sets can only be moved if DF analysis is
+performed for all of the registers set.  See PR91052.  */
+  if (multiple_sets (def_insn))
+   continue;
+
   basic_block use_bb = BLOCK_FOR_INSN (use_insn);
   basic_block def_bb = BLOCK_FOR_INSN (def_insn);
   if (bb_loop_depth (use_bb) > bb_loop_depth (def_bb))

Re: [PATCH, IRA] Fix PR91052 by skipping multiple_sets insn in combine_and_move_insns

2020-02-11 Thread Kewen.Lin

on 2020/2/12 上午12:24, Vladimir Makarov wrote:
> On 2/11/20 3:01 AM, Kewen.Lin wrote:
>> Hi,
>>
>> As PR91052's comments show, commit r272731 exposed one issue in function
>> combine_and_move_insns.  Function combine_and_move_insns perform the
>> below unexpected transformation.
>>
>> ** Before: **
>>
>>     67: NOTE_INSN_BASIC_BLOCK 8
>> ...
>>     59: {r184:SF=[sfp:SI-0x190];r121:SI=sfp:SI-0x190;}  ==> move object
>>    REG_UNUSED r121:SI
>>     77: {r191:SF=[sfp:SI-0x18c];r121:SI=sfp:SI-0x18c;}
>>     60: r122:SI=r127:SI
>>    REG_DEAD r127:SI
>>     61: [r122:SI]=r184:SF
>>    REG_DEAD r184:SF
>>     79: [++r122:SI]=r191:SF
>>    REG_DEAD r191:SF
>>    REG_INC r122:SI
>>     64: r187:SF=[r137:SI+low(`*.LC0')]
>>     99: r198:SF=[++r121:SI] => with sp-0x18c+4;
>>    REG_INC r121:SI
>>    104: r201:SF=[r137:SI+low(`*.LC0')]
>>     65: [r126:SI]=r187:SF
>>    REG_DEAD r187:SF
>>    105: [r126:SI]=r201:SF
>>    REG_DEAD r201:SF
>>    101: [++r122:SI]=r198:SF
>>    REG_DEAD r198:SF
>>    REG_INC r122:SI
>>    114: L114:
>>    113: NOTE_INSN_BASIC_BLOCK 9
>>
>> ** After: **
>>
>>     67: NOTE_INSN_BASIC_BLOCK 8
>> ...
>>     77: {r191:SF=[sfp:SI-0x18c];r121:SI=sfp:SI-0x18c;}
>>    REG_UNUSED r121:SI
>>     60: r122:SI=r127:SI
>>    REG_DEAD r127:SI
>>    219: {r184:SF=[sfp:SI-0x190];r121:SI=sfp:SI-0x190;}   ==> moved here but 
>> update origin r121.
>>     61: [r122:SI]=r184:SF
>>    REG_DEAD r184:SF
>>     79: [++r122:SI]=r191:SF
>>    REG_DEAD r191:SF
>>    REG_INC r122:SI
>>     64: r187:SF=[r137:SI+low(`*.LC0')]
>>    REG_EQUIV [r137:SI+low(`*.LC0')]
>>     99: r198:SF=[++r121:SI]    => with sp-0x18c; 
>> inconsistent from above.
>>    REG_INC r121:SI
>>    104: r201:SF=[r137:SI+low(`*.LC0')]
>>    REG_EQUIV [r137:SI+low(`*.LC0')]
>>     65: [r126:SI]=r187:SF
>>    REG_DEAD r187:SF
>>    105: [r126:SI]=r201:SF
>>    REG_DEAD r201:SF
>>    101: [++r122:SI]=r198:SF
>>    REG_DEAD r198:SF
>>    REG_INC r122:SI
>>    114: L114:
>>    113: NOTE_INSN_BASIC_BLOCK 9
>>
>> The insn 59 is special with multiple_sets, its movement alters the live
>> interval of r121 from insn 77 to insn 99 and update r121 with unexpected
>> value.
>>
>> Bootstrapped/regtested on powerpc64le-linux-gnu (LE) and
>> ppc64-redhat-linux (BE).
>>
>> Is it ok for trunk?
> 
> Yes. Thank you for working on the PR, Kewen.
> 
> I don't think that any expensive additional analysis is worth to use it for 
> solving the PR.  So I believe your patch is an adequate solution.
> 

Thanks Vladimir!  Committed in 
r10-6591-g4d2248bec5d22061ab252724bd59d45c8a47e009
with the below updated ChangeLog (sorry for missing one PR line).

2020-02-12  Kewen Lin  

PR target/91052
* ira.c (combine_and_move_insns): Skip multiple_sets def_insn.

And yes, I doubt the gain with more expensive analysis to legalize
the movement, even with that we need to update notes like REG_UNUSED
(inconsistent notes is direct cause of the ICE), it also seems not trivial.

BR,
Kewen

> 
>> ---
>>
>> gcc/ChangeLog
>>
>> 2020-02-11  Kewen Lin  
>>
>> * ira.c (combine_and_move_insns): Skip multiple_sets def_insn.
> 
>

[PATCH, rs6000] Adjust vectorization cost for scalar COND_EXPR

2019-12-11 Thread Kewen.Lin

Hi,

We found that the vectorization cost modeling on scalar COND_EXPR is a bit off
on rs6000.  One typical case is 548.exchange2_r, -Ofast -mcpu=power9 -mrecip
-fvect-cost-model=unlimited is better than -Ofast -mcpu=power9 -mrecip (the
default is -fvect-cost-model=dynamic) by 1.94%.  Scalar COND_EXPR is expanded
into compare + branch or compare + isel normally, either of them should be
priced more than the simple FXU operation.  This patch is to add additional
vectorization cost onto scalar COND_EXPR on top of builtin_vectorization_cost.
The idea to use additional cost value 2 instead of the others: 1) try various
possible value candidates from 1 to 5, 2 is the best measured on Power9.  2) 
from latency view, compare takes 3 cycles and isel takes 2 on Power9, it's 
2.5 times of simple FXU instruction which takes cost 1 in the current
modeling, it's close.  3) get fine SPEC2017 ratio on Power8 as well.

The SPEC2017 performance evaluation on Power9 with explicit unrolling shows 
548.exchange2_r +2.35% gains, but 526.blender_r -1.99% degradation, the others
is trivial.  By further investigation on 526.blender_r, the assembly of 10
hottest functions are unchanged, the impact should be due to some side effects.
SPECINT geomean +0.16%, SPECFP geomean -0.16% (mainly due to blender_r).
Without explicit unrolling, 548.exchange2_r +1.78% gains and the others are
trivial.  SPECINT geomean +0.19%, SPECINT geomean +0.06%.

While the SPEC2017 performance evaluation on Power8 shows 500.perlbench_r
+1.32% gain and 511.povray_r +2.03% gain, the others are trivial.  SPECINT
geomean +0.08%, SPECINT geomean +0.18%.

Bootstrapped and regress tested on powerpc64le-linux-gnu.  
Is OK for trunk?

BR,
Kewen
---

gcc/ChangeLog

2019-12-11  Kewen Lin  

* config/rs6000/rs6000.c (adjust_vectorization_cost): New function.
(rs6000_add_stmt_cost): Call adjust_vectorization_cost and update
stmt_cost.

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 2995348..5dad3cc 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -5016,6 +5016,29 @@ rs6000_init_cost (struct loop *loop_info)
   return data;
 }
 
+/* Adjust vectorization cost after calling rs6000_builtin_vectorization_cost.
+   For some statement, we would like to further fine-grain tweak the cost on
+   top of rs6000_builtin_vectorization_cost handling which doesn't have any
+   information on statement operation codes etc.  One typical case here is
+   COND_EXPR, it takes the same cost to simple FXU instruction when evaluating
+   for scalar cost, but it should be priced more whatever transformed to either
+   compare + branch or compare + isel instructions.  */
+
+static unsigned
+adjust_vectorization_cost (enum vect_cost_for_stmt kind,
+  struct _stmt_vec_info *stmt_info)
+{
+  if (kind == scalar_stmt && stmt_info && stmt_info->stmt
+  && gimple_code (stmt_info->stmt) == GIMPLE_ASSIGN)
+{
+  tree_code subcode = gimple_assign_rhs_code (stmt_info->stmt);
+  if (subcode == COND_EXPR)
+   return 2;
+}
+
+  return 0;
+}
+
 /* Implement targetm.vectorize.add_stmt_cost.  */
 
 static unsigned
@@ -5031,6 +5054,7 @@ rs6000_add_stmt_cost (void *data, int count, enum 
vect_cost_for_stmt kind,
   tree vectype = stmt_info ? stmt_vectype (stmt_info) : NULL_TREE;
   int stmt_cost = rs6000_builtin_vectorization_cost (kind, vectype,
 misalign);
+  stmt_cost += adjust_vectorization_cost (kind, stmt_info);
   /* Statements in an inner loop relative to the loop being
 vectorized are weighted more heavily.  The value here is
 arbitrary and could potentially be improved with analysis.  */

[RFC/PATCH] IVOPTs select cand with preferred D-form access

2020-01-06 Thread Kewen.Lin

Hi all,

Recently I'm investigating on an issue related to use D-form/X-form vector
memory access, it's the same as what the patch
https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01879.html 
was intended to deal with.  Power9 introduces DQ-form instructions for vector
memory access, we perfer to use DQ-form when unrolling loop.  As the example
in the above link, it can save number of ADDI and GPR for indexing.

Or for below example:

extern void dummy (double, unsigned n);

void
func (double *x, double *y, unsigned m, unsigned n)
{
  double sacc;
  for (unsigned j = 1; j < m; j++)
  {
sacc = 0.0;
for (unsigned i = 1; i < n; i++)
  sacc = sacc + x[i] * y[i];
dummy (sacc, n);
  }
}

Core loop with X-form (lxvx):
/*
mtctr   r10
lxvxvs12,r31,r9
lxvxvs0,r30,r9
addir10,r9,16
addir9,r9,32
xvmaddadp vs32,vs12,vs0
lxvxvs12,r31,r10
lxvxvs0,r30,r10
xvmaddadp vs11,vs12,vs0
lxvxvs12,r31,r9
lxvxvs0,r30,r9
addir9,r10,32
xvmaddadp vs32,vs12,vs0
lxvxvs12,r31,r9
lxvxvs0,r30,r9
addir9,r10,48
xvmaddadp vs11,vs12,vs0
lxvxvs12,r31,r9
lxvxvs0,r30,r9
addir9,r10,64
xvmaddadp vs32,vs12,vs0
lxvxvs12,r31,r9
lxvxvs0,r30,r9
addir9,r10,80
xvmaddadp vs11,vs12,vs0
lxvxvs12,r31,r9
lxvxvs0,r30,r9
addir9,r10,96
xvmaddadp vs32,vs12,vs0
lxvxvs12,r31,r9
lxvxvs0,r30,r9
addir9,r10,112
xvmaddadp vs11,vs12,vs0
bdnz190 
*/

vs.
/*
Core loop with D-form (lxv)
mtctr   r8
lxv vs12,0(r9)
lxv vs0,0(r10)
addir7,r9,16  // r7, r8 can be eliminated further with r9, r10
addir8,r10,16 // 2 or 4 addi vs. 8 addi above
addir9,r9,128
addir10,r10,128  
xvmaddadp vs32,vs12,vs0
lxv vs12,-112(r9)
lxv vs0,-112(r10)
xvmaddadp vs11,vs12,vs0
lxv vs12,16(r7)
lxv vs0,16(r8)
xvmaddadp vs32,vs12,vs0
lxv vs12,32(r7)
lxv vs0,32(r8)
xvmaddadp vs11,vs12,vs0
lxv vs12,48(r7)
lxv vs0,48(r8)
xvmaddadp vs32,vs12,vs0
lxv vs12,64(r7)
lxv vs0,64(r8)
xvmaddadp vs11,vs12,vs0
lxv vs12,80(r7)
lxv vs0,80(r8)
xvmaddadp vs32,vs12,vs0
lxv vs12,96(r7)
lxv vs0,96(r8)
xvmaddadp vs11,vs12,vs0
bdnz1b0 
*/

We are thinking whether it can be handled in IVOPTs instead of one RTL pass.

During IVOPTs selecting IV cands, it doesn't know the loop will be unrolled so
it doesn't count the possible step cost in with X-form.  If we can teach it to
consider the case, the IV cands which plays with D-form can be preferred.
Currently unrolling (incomplete) happens in RTL, it looks we have to predict
the loop whether unroll in IVOPTs.  Since there is some parameter checks on RTL
insn counts and target hooks, it seems not easy to get that.  Besides, we need
to check the step is valid to put into D-form field (eg: DQ-form requires divide
16 exactly), to ensure no extra ADDIs needed.

I'm not sure whether it's a good idea to implement in IVOPTs, but I did some
changes in IVOPTs to prove it's doable to get expected codes, the patch is 
attached.

Any comments/suggestions are highly appreiciated!

BR,
Kewen
diff --git a/gcc/common.opt b/gcc/common.opt
index 404b6aa..0d3f8f8 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1465,6 +1465,10 @@ ffinite-loops
 Common Report Var(flag_finite_loops) Optimization
 Assume that loops with an exit will terminate and not loop indefinitely.
 
+fivopts-dform
+Common Report Var(flag_ivopts_dform) Init(1) Optimization
+Assume D-form is preferred in IVOPTS like unrolling.
+
 ffixed-
 Common Joined RejectNegative Var(common_deferred_options) Defer
 -ffixed- Mark  as being unavailable to the compiler.
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 2995348..588feac 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1654,6 +1654,9 @@ static const struct attribute_spec 
rs6000_attribute_table[] =
 #undef TARGET_PREDICT_DOLOOP_P
 #define TARGET_PREDICT_DOLOOP_P rs6000_predict_doloop_p
 
+#undef TARGET_D_FORM_SUITABLE_P
+#define TARGET_D_FORM_SUITABLE_P rs6000_d_form_suitable_p
+
 #undef TARGET_HAVE_COUNT_REG_DECR_P
 #define TARGET_HAVE_COUNT_REG_DECR_P true
 
@@ -26258,6 +26261,28 @@ rs6000_predict_doloop_p (struct loop *loop)
   return true;
 }
 
+static bool
+rs6000_d_form_suitable_p (machine_mode mode, signed HOST_WIDE_INT val,
+ bool is_sig, bool is_store)
+{
+  /* Only Power9 and above supports DQ form

Re: [RFC] IVOPTs select cand with preferred D-form access

2020-01-07 Thread Kewen.Lin

on 2020/1/7 下午5:14, Richard Biener wrote:
> On Mon, 6 Jan 2020, Kewen.Lin wrote:
> 
>> We are thinking whether it can be handled in IVOPTs instead of one RTL pass.
>>
>> During IVOPTs selecting IV cands, it doesn't know the loop will be unrolled 
>> so
>> it doesn't count the possible step cost in with X-form.  If we can teach it 
>> to
>> consider the case, the IV cands which plays with D-form can be preferred.
>> Currently unrolling (incomplete) happens in RTL, it looks we have to predict
>> the loop whether unroll in IVOPTs.  Since there is some parameter checks on 
>> RTL
>> insn counts and target hooks, it seems not easy to get that.  Besides, we 
>> need
>> to check the step is valid to put into D-form field (eg: DQ-form requires 
>> divide
>> 16 exactly), to ensure no extra ADDIs needed.
>>
>> I'm not sure whether it's a good idea to implement in IVOPTs, but I did some
>> changes in IVOPTs to prove it's doable to get expected codes, the patch is 
>> attached.
>>
>> Any comments/suggestions are highly appreiciated!
> 
> Is the unrolled code better than the not unrolled code (assuming
> optimal IV choice)?  Then IMHO IVOPTs should drive the unrolling,
> either by actually doing it or by forcing it via the loop->unroll
> setting.  I don't think second-guessing the RTL unroller at this
> point is going to work.  Alternatively turn X-form into D-form during
> RTL unrolling?
> 

Hi Richard,

Thanks for the comments!

Yes, unrolled version is better on Power9 for both forms, but D-form unrolled 
is better
than X-form unrolled.  If we drive unrolling in IVOPTs, not sure it will be a 
concern
that IVOPTs becomes too heavy? or too rude with forced UF if imprecise? Do we 
still
have the plan to introduce one middle-end unroll pass, does it help if yes?
The quoted RTL patch is to propose one RTL pass after RTL loop passes, it also 
sounds
good to check whether RTL unrolling is a good place!


BR,
Kewen

Re: [RFC] IVOPTs select cand with preferred D-form access

2020-01-07 Thread Kewen.Lin

on 2020/1/7 下午7:25, Richard Biener wrote:
> On Tue, 7 Jan 2020, Kewen.Lin wrote:
> 
>> on 2020/1/7 下午5:14, Richard Biener wrote:
>>> On Mon, 6 Jan 2020, Kewen.Lin wrote:
>>>
>>>> We are thinking whether it can be handled in IVOPTs instead of one RTL 
>>>> pass.
>>>>
>>>> During IVOPTs selecting IV cands, it doesn't know the loop will be 
>>>> unrolled so
>>>> it doesn't count the possible step cost in with X-form.  If we can teach 
>>>> it to
>>>> consider the case, the IV cands which plays with D-form can be preferred.
>>>> Currently unrolling (incomplete) happens in RTL, it looks we have to 
>>>> predict
>>>> the loop whether unroll in IVOPTs.  Since there is some parameter checks 
>>>> on RTL
>>>> insn counts and target hooks, it seems not easy to get that.  Besides, we 
>>>> need
>>>> to check the step is valid to put into D-form field (eg: DQ-form requires 
>>>> divide
>>>> 16 exactly), to ensure no extra ADDIs needed.
>>>>
>>>> I'm not sure whether it's a good idea to implement in IVOPTs, but I did 
>>>> some
>>>> changes in IVOPTs to prove it's doable to get expected codes, the patch is 
>>>> attached.
>>>>
>>>> Any comments/suggestions are highly appreiciated!
>>>
>>> Is the unrolled code better than the not unrolled code (assuming
>>> optimal IV choice)?  Then IMHO IVOPTs should drive the unrolling,
>>> either by actually doing it or by forcing it via the loop->unroll
>>> setting.  I don't think second-guessing the RTL unroller at this
>>> point is going to work.  Alternatively turn X-form into D-form during
>>> RTL unrolling?
>>>
>>
>> Hi Richard,
>>
>> Thanks for the comments!
>>
>> Yes, unrolled version is better on Power9 for both forms, but D-form 
>> unrolled is better than X-form unrolled.  If we drive unrolling in 
>> IVOPTs, not sure it will be a concern that IVOPTs becomes too heavy? or 
>> too rude with forced UF if imprecise? Do we still have the plan to 
>> introduce one middle-end unroll pass, does it help if yes?
> 
> I have the opinion that an isolated unrolling pass is not wanted.
> Instead unrolling should be driven by some profitability metric
> which in your case is better induction variable optimization.
> In the "usual" case it is better scheduling where then scheduling
> should drive unrolling.

OK, it makes sense.  I heard some compiler consider unrolling factor
for vectorization and some for modulo scheduling.

> 
>> The quoted 
>> RTL patch is to propose one RTL pass after RTL loop passes, it also 
>> sounds good to check whether RTL unrolling is a good place!
> 
> Why would you need a new RTL pass?  I'd do it during the unroll
> transform itself, ideally on the not unrolled body because that's
> likely simpler than updating N copies?

Good question, I don't have good understanding on it.  But from the notes
of the patch, I guess one new pass doesn't only handle the cases exposed
by unrolling, but also the others without unrolling.

Quoted from its note: "This new pass scans existing rtl expressions and
replaces X-form loads and stores with rtl expressions that favor selection
of the D-form instructions in contexts for which the D-form instructions
are preferred.  The new pass runs after the RTL loop optimizations since
loop unrolling often introduces opportunities for beneficial replacements
of X-form addressing instructions."

BR,
Kewen

Re: [RFC] IVOPTs select cand with preferred D-form access

2020-01-08 Thread Kewen.Lin

Hi Bin,

> I am a bit worried that would make IVOPTs heavy too, it might be
> possible to compute heuristics whether loop should be unrolled as a
> post-IVOPTs transformation.  Of course the transformation needs to do
> more work than simply unrolling in order to take advantage of
> aforementioned addressing mode.

Agreed, I prefer to just figure out the unroll factor (UF) by some
heurisitcs instead of performing actual unrolling as well.  I guess 
"post-IVOPTs" is a typo for "pre-IVOPTs"?

> BTW, unrolled loop won't perform as good as ppc if the target doesn't
> support [base + register + offset] addressing mode?
> 

The target which doesn't support D-form would probably have benefits
from unrolling, but IVOPTs decision won't affect it since X-form doesn't
have offset field to hide step updates.  In the next patch, I'll compute
UF with moreheurisitics and update IV cand step cost with that, for
D-form cand, just one time step_cost, but for X-form cand, it would be
UF*step_cost.

> Another point, in case of multiple passes doing unrolling, the
> "already unrolled" information may need to be recorded as a flag of
> loop properties.

Yes, we can update the computed UF into loop struct unroll field.
I'll check performance impact to avoid the proposed UF computation is 
poor.

Thanks,
Kewen

Re: [PATCH, rs6000] Add subreg patterns for SImode rotate and mask insert

2024-03-06 Thread Kewen.Lin

Hi,

on 2024/3/1 10:41, HAO CHEN GUI wrote:
> Hi,
>   This patch fixes regression cases in gcc.target/powerpc/rlwimi-2.c. In
> combine pass, SImode (subreg from DImode) lshiftrt is converted to DImode
> lshiftrt with an out AND. It matches a DImode rotate and mask insert on
> rs6000.
> 
> Trying 2 -> 7:
> 2: r122:DI=r129:DI
>   REG_DEAD r129:DI
> 7: r125:SI=r122:DI#0 0>>0x1f
>   REG_DEAD r122:DI
> Failed to match this instruction:
> (set (subreg:DI (reg:SI 125 [ x ]) 0)
> (zero_extract:DI (reg:DI 129)
> (const_int 32 [0x20])
> (const_int 1 [0x1])))
> Successfully matched this instruction:
> (set (subreg:DI (reg:SI 125 [ x ]) 0)
> (and:DI (lshiftrt:DI (reg:DI 129)
> (const_int 31 [0x1f]))
> (const_int 4294967295 [0x])))
> 
> This conversion blocks the further combination which combines to a SImode
> rotate and mask insert insn.
> 
> Trying 9, 7 -> 10:
> 9: r127:SI=r130:DI#0&0xfffe
>   REG_DEAD r130:DI
> 7: r125:SI#0=r129:DI 0>>0x1f&0x
>   REG_DEAD r129:DI
>10: r124:SI=r127:SI|r125:SI
>   REG_DEAD r125:SI
>   REG_DEAD r127:SI
> Failed to match this instruction:
> (set (reg:SI 124)
> (ior:SI (and:SI (subreg:SI (reg:DI 130) 0)
> (const_int -2 [0xfffe]))
> (subreg:SI (zero_extract:DI (reg:DI 129)
> (const_int 32 [0x20])
> (const_int 1 [0x1])) 0)))
> Failed to match this instruction:
> (set (reg:SI 124)
> (ior:SI (and:SI (subreg:SI (reg:DI 130) 0)
> (const_int -2 [0xfffe]))
> (subreg:SI (and:DI (lshiftrt:DI (reg:DI 129)
> (const_int 31 [0x1f]))
> (const_int 4294967295 [0x])) 0)))
> 
>   The root cause of the issue is if it's necessary to do the widen mode for
> lshiftrt when the target already has the narrow mode lshiftrt and its cost
> is not high. My former patch tried to fix the problem but not accepted yet.
> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624852.html

Hope Segher can chime in this proposal updating combine pass, I can understand
the new proposal by introducing new patterns in target code is able to fix the
issue, but IMHO it's likely there are some other mis-optimization which don't
get noticed and they need some similar pattern extension (duplicate some pattern
& adjust with subreg) to optimize, from this perspective, it would be nice if
it's possible to have a more general fix.

Some minor comments for this patch itself are inlined.

> 
>   As it's stage 4 now, I drafted this patch to fix the regression by adding
> subreg patterns of SImode rotate and mask insert. It actually does reversed
> things and narrow the mode for lshiftrt so that it can matches the SImode
> rotate and mask insert.
> 
>   The case "rlwimi-2.c" is fixed and restore the corresponding number of
> insns to original ones. The case "rlwinm-0.c" is also changed and 9 "rlwinm"
> is replaced with 9 "rldicl" as the sequence of combine is changed. It's not
> a regression as the total number of insns isn't changed.
> 
>   Bootstrapped and tested on x86 and powerpc64-linux BE and LE with no
> regressions. Is it OK for the trunk?
> 
> Thanks
> Gui Haochen
> 
> 
> ChangeLog
> rs6000: Add subreg patterns for SImode rotate and mask insert
> 
> In combine pass, SImode (subreg from DImode) lshiftrt is converted to DImode
> lshiftrt with an AND.  The new pattern matches rotate and mask insert on
> rs6000.  Thus it blocks the pattern to be further combined to a SImode rotate
> and mask insert pattern.  This patch fixes the problem by adding two subreg
> pattern for SImode rotate and mask insert patterns.
> 
> gcc/
>   PR target/93738
>   * config/rs6000/rs6000.md (*rotlsi3_insert_9): New.
>   (*rotlsi3_insert_8): New.
> 
> gcc/testsuite/
>   PR target/93738
>   * gcc.target/powerpc/rlwimi-2.c: Adjust the number of 64bit and 32bit
>   rotate instructions.
>   * gcc.target/powerpc/rlwinm-0.c: Likewise.
> 
> patch.diff
> diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
> index bc8bc6ab060..b0b40f91e3e 100644
> --- a/gcc/config/rs6000/rs6000.md
> +++ b/gcc/config/rs6000/rs6000.md
> @@ -4253,6 +4253,36 @@ (define_insn "*rotl3_insert"
>  ; difference between rlwimi and rldimi.  We also might want dot forms,
>  ; but not for rlwimi on POWER4 and similar processors.
> 
> +; Subreg pattern of insn "*rotlsi3_insert"
> +(define_insn_and_split "*rotlsi3_insert_9"

Nit: "*rotlsi3_insert_subreg" seems a better name, ...

> +  [(set (match_operand:SI 0 "gpc_reg_operand" "=r")
> + (ior:SI (and:SI
> +  (match_operator:SI 8 "lowpart_subreg_operator"
> +   [(and:DI (match_operator:DI 4 "rotate_mask_operator"
> + [(match_operand:DI 1 "gpc_reg_operand" "r")
> +  (match_operand:SI 2 "const_int_operand" "n")])
> +(match_operand:DI 3 "const_int_ope

Re: [PATCH V3] rs6000: Don't ICE when compiling the __builtin_vsx_splat_2di built-in [PR113950]

2024-03-06 Thread Kewen.Lin

Hi,

on 2024/3/4 02:55, jeevitha wrote:
> Hi All,
> 
> The following patch has been bootstrapped and regtested on powerpc64le-linux.
>   
> When we expand the __builtin_vsx_splat_2di function, we were allowing 
> immediate
> value for second operand which causes an unrecognizable insn ICE. Even though
> the immediate value was forced into a register, it wasn't correctly assigned
> to the second operand. So corrected the assignment of op1 to operands[1].
> 
> 2024-02-29  Jeevitha Palanisamy  
> 
> gcc/
>   PR target/113950
>   * config/rs6000/vsx.md (vsx_splat_): Corrected assignment to
>   operand1.

Nit: s/Corrected/Correct/, maybe add "and simplify else if with else.".

> 
> gcc/testsuite/
>   PR target/113950
>   * gcc.target/powerpc/pr113950.c: New testcase.
> 
> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
> index 6111cc90eb7..f135fa079bd 100644
> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@ -4666,8 +4666,8 @@
>rtx op1 = operands[1];
>if (MEM_P (op1))
>  operands[1] = rs6000_force_indexed_or_indirect_mem (op1);
> -  else if (!REG_P (op1))
> -op1 = force_reg (mode, op1);
> +  else
> +operands[1] = force_reg (mode, op1);
>  })
>  
>  (define_insn "vsx_splat__reg"
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr113950.c 
> b/gcc/testsuite/gcc.target/powerpc/pr113950.c
> new file mode 100644
> index 000..64566a580d9
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr113950.c
> @@ -0,0 +1,24 @@
> +/* PR target/113950 */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O1 -mvsx" } */

Nit: s/-O1/-O2/, -O2 is preferred when the failure can be reproduced
with -O2 (not just -O1), since optimization may change the level in
which it's placed, -O2 is more general.

As the discussions in the thread of the previous versions, I think
Segher agreed this approach, so OK for trunk with the above nits
tweaked, thanks!

BR,
Kewen

> +
> +/* Verify we do not ICE on the following.  */
> +
> +void abort (void);
> +
> +int main ()
> +{
> +  int i;
> +  vector signed long long vsll_result, vsll_expected_result;
> +  signed long long sll_arg1;
> +
> +  sll_arg1 = 300;
> +  vsll_expected_result = (vector signed long long) {300, 300};
> +  vsll_result = __builtin_vsx_splat_2di (sll_arg1);  
> +
> +  for (i = 0; i < 2; i++)
> +if (vsll_result[i] != vsll_expected_result[i])
> +  abort();
> +
> +  return 0;
> +}
> 
>

Re: [PATCH] fix PowerPC < 7 w/ Altivec not to default to power7

2024-03-10 Thread Kewen.Lin

Hi,

on 2024/3/8 19:33, Rene Rebe wrote:
> This might not be the best timing -short before a major release-,
> however, Sam just commented on the bug I filled years ago [1], so here
> we go:
> 
> Glibc uses .machine to determine assembler optimizations to use.
> However, since reworking the rs6000 .machine output selection in
> commit e154242724b084380e3221df7c08fcdbd8460674 22 May 2019, G5 as
> well as Cell, and even power4 w/ -maltivec currently resulted in
> power7. Mask _ALTIVEC away as the .machine selection already did for
> GFX and GPOPT.

Thanks for fixing, this fix looks reasonable to me, OPTION_MASK_ALTIVEC
is a part of POWERPC_7400_MASK so any specified cpu type which has this
POWERPC_7400_MASK by default and isn't handled early in function
rs6000_machine_from_flags can suffer from this issue.

> 
> powerpc64-t2-linux-gnu-gcc  test.c -S -o - -mcpu=G5
>   .file   "test.c"
>   .machine power7
>   .abiversion 2
>   .section".text"
>   .ident  "GCC: (GNU) 10.2.0"
>   .section.note.GNU-stack,"",@progbits
> 

Nit: Could you also add one test case for this?

btw, -mdejagnu-cpu=G5 can force the cpu type in dg-options.

> We ship this in T2/Linux [2] since 2020 and it is tested on G5, Cell
> and Power8.
> 
> Signed-of-by: René Rebe 
> 
> [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97367
> [2] https://t2sde.org
> 
> --- gcc-11.1.0-RC-20210423/gcc/config/rs6000/rs6000.cc.vanilla
> 2021-04-25 22:57:16.964223106 +0200
> +++ gcc-11.1.0-RC-20210423/gcc/config/rs6000/rs6000.cc2021-04-25 
> 22:57:27.193223841 +0200
> @@ -5765,7 +5765,7 @@
>HOST_WIDE_INT flags = rs6000_isa_flags;
>  
>/* Disable the flags that should never influence the .machine selection.  
> */
> -  flags &= ~(OPTION_MASK_PPC_GFXOPT | OPTION_MASK_PPC_GPOPT | 
> OPTION_MASK_ISEL);
> +  flags &= ~(OPTION_MASK_PPC_GFXOPT | OPTION_MASK_PPC_GPOPT | 
> OPTION_MASK_ALTIVEC | OPTION_MASK_ISEL);

Nit: This line is too long and needs re-format.

BR,
Kewen

>  
>if ((flags & (ISA_3_1_MASKS_SERVER & ~ISA_3_0_MASKS_SERVER)) != 0)
>  return "power10";
>

Re: [PATCH V3] rs6000: Don't ICE when compiling the __builtin_vsx_splat_2di built-in [PR113950]

2024-03-17 Thread Kewen.Lin

Hi,

on 2024/3/16 04:34, Peter Bergner wrote:
> On 3/6/24 3:27 AM, Kewen.Lin wrote:
>> on 2024/3/4 02:55, jeevitha wrote:
>>> The following patch has been bootstrapped and regtested on 
>>> powerpc64le-linux.
>>> 
>>> When we expand the __builtin_vsx_splat_2di function, we were allowing 
>>> immediate
>>> value for second operand which causes an unrecognizable insn ICE. Even 
>>> though
>>> the immediate value was forced into a register, it wasn't correctly assigned
>>> to the second operand. So corrected the assignment of op1 to operands[1].
> [snip]
>> As the discussions in the thread of the previous versions, I think
>> Segher agreed this approach, so OK for trunk with the above nits
>> tweaked, thanks!
> 
> The bogus vsx_splat_ code goes all the way back to GCC 8, so we
> should backport this fix.  Segher and Ke Wen, can we get an approval
> to backport this to all the open release branches (GCC 13, 12, 11)?
> Thanks.

Sure, okay for backporting this to all active branches, thanks!

> 
> Jeevitha, once we get approval, please perform the backports.
> 
> Peter
> 
> 

BR,
Kewen

Re: [PATCH] rs6000: Fix up setup_incoming_varargs [PR114175]

2024-03-19 Thread Kewen.Lin

Hi Jakub,

on 2024/3/19 01:21, Jakub Jelinek wrote:
> Hi!
> 
> The c23-stdarg-8.c test (as well as the new test below added to cover even
> more cases) FAIL on powerpc64le-linux and presumably other powerpc* targets
> as well.
> Like in the r14-9503-g218d174961 change on x86-64 we need to advance
> next_cum after the hidden return pointer argument even in case where
> there are no user arguments before ... in C23.
> The following patch does that.
> 
> There is another TYPE_NO_NAMED_ARGS_STDARG_P use later on:
>   if (!TYPE_NO_NAMED_ARGS_STDARG_P (TREE_TYPE (current_function_decl))
>   && targetm.calls.must_pass_in_stack (arg))
> first_reg_offset += rs6000_arg_size (TYPE_MODE (arg.type), arg.type);
> but I believe it was added there in r13-3549-g4fe34cdc unnecessarily,
> when there is no hidden return pointer argument, arg.type is NULL and
> must_pass_in_stack_var_size as well as must_pass_in_stack_var_size_or_pad
> return false in that case, and for the TYPE_NO_NAMED_ARGS_STDARG_P
> case with hidden return pointer argument that argument should have pointer
> type and it is the first argument, so must_pass_in_stack shouldn't be true
> for it either.
> 
> Bootstrapped/regtested on powerpc64le-linux, bootstrap/regtest on
> powerpc64-linux running, ok for trunk?

Okay for trunk (I guess all testings should go well), thanks for taking
care of this!

FWIW, I also tested c23-stdarg-* test cases on aix with this patch, all
of them worked well.

BR,
Kewen

> 
> 2024-03-18  Jakub Jelinek  
> 
>   PR target/114175
>   * config/rs6000/rs6000-call.cc (setup_incoming_varargs): Only skip
>   rs6000_function_arg_advance_1 for TYPE_NO_NAMED_ARGS_STDARG_P functions
>   if arg.type is NULL.
> 
>   * gcc.dg/c23-stdarg-9.c: New test.
> 
> --- gcc/config/rs6000/rs6000-call.cc.jj   2024-01-03 12:01:19.645532834 
> +0100
> +++ gcc/config/rs6000/rs6000-call.cc  2024-03-18 11:36:02.376846802 +0100
> @@ -2253,7 +2253,8 @@ setup_incoming_varargs (cumulative_args_
>  
>/* Skip the last named argument.  */
>next_cum = *get_cumulative_args (cum);
> -  if (!TYPE_NO_NAMED_ARGS_STDARG_P (TREE_TYPE (current_function_decl)))
> +  if (!TYPE_NO_NAMED_ARGS_STDARG_P (TREE_TYPE (current_function_decl))
> +  || arg.type != NULL_TREE)
>  rs6000_function_arg_advance_1 (&next_cum, arg.mode, arg.type, arg.named,
>  0);
>  
> --- gcc/testsuite/gcc.dg/c23-stdarg-9.c.jj2024-03-18 11:46:17.281200214 
> +0100
> +++ gcc/testsuite/gcc.dg/c23-stdarg-9.c   2024-03-18 11:46:26.826065998 
> +0100
> @@ -0,0 +1,284 @@
> +/* Test C23 variadic functions with no named parameters, or last named
> +   parameter with a declaration not allowed in C17.  Execution tests.  */
> +/* { dg-do run } */
> +/* { dg-options "-O2 -std=c23 -pedantic-errors" } */
> +
> +#include 
> +
> +struct S { int a[1024]; };
> +
> +int
> +f1 (...)
> +{
> +  int r = 0;
> +  va_list ap;
> +  va_start (ap);
> +  r += va_arg (ap, int);
> +  va_end (ap);
> +  return r;
> +}
> +
> +int
> +f2 (...)
> +{
> +  int r = 0;
> +  va_list ap;
> +  va_start (ap);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  va_end (ap);
> +  return r;
> +}
> +
> +int
> +f3 (...)
> +{
> +  int r = 0;
> +  va_list ap;
> +  va_start (ap);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  va_end (ap);
> +  return r;
> +}
> +
> +int
> +f4 (...)
> +{
> +  int r = 0;
> +  va_list ap;
> +  va_start (ap);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  va_end (ap);
> +  return r;
> +}
> +
> +int
> +f5 (...)
> +{
> +  int r = 0;
> +  va_list ap;
> +  va_start (ap);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  va_end (ap);
> +  return r;
> +}
> +
> +int
> +f6 (...)
> +{
> +  int r = 0;
> +  va_list ap;
> +  va_start (ap);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  va_end (ap);
> +  return r;
> +}
> +
> +int
> +f7 (...)
> +{
> +  int r = 0;
> +  va_list ap;
> +  va_start (ap);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  va_end (ap);
> +  return r;
> +}
> +
> +int
> +f8 (...)
> +{
> +  int r = 0;
> +  va_list ap;
> +  va_start (ap);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  r += va_arg (ap, int);
> +  va_end (ap);
> +  return r;
> +}
> +
> +struct S
> +s1 (...)
> +{
> +  int r = 0;
> +  va_list ap;
> +  va_start (ap);
> +  r += va_arg (ap, int);
> +  va_end (ap);
> +  struct S s = {};

Re: [PATCH v1] rs6000: Stackoverflow in optimized code on PPC [PR100799]

2024-04-01 Thread Kewen.Lin

Hi!

on 2024/3/22 17:36, Jakub Jelinek wrote:
> On Fri, Mar 22, 2024 at 02:55:43PM +0530, Ajit Agarwal wrote:
>> rs6000: Stackoverflow in optimized code on PPC [PR100799]
>>
>> When using FlexiBLAS with OpenBLAS we noticed corruption of
>> the parameters passed to OpenBLAS functions. FlexiBLAS
>> basically provides a BLAS interface where each function
>> is a stub that forwards the arguments to a real BLAS lib,
>> like OpenBLAS.
>>
>> Fixes the corruption of caller frame checking number of
>> arguments is less than equal to GP_ARG_NUM_REG (8)
>> excluding hidden unused DECLS.
> 
> Looks mostly good to me except some comment nits, but I'll defer
> the actual ack to the rs6000 maintainers.
> 
>> +  /* Workaround buggy C/C++ wrappers around Fortran routines with
>> + character(len=constant) arguments if the hidden string length arguments
>> + are passed on the stack; if the callers forget to pass those arguments,
>> + attempting to tail call in such routines leads to stack corruption.
> 
> I thought it isn't just tail calls, even normal calls.
> When the buggy C/C++ wrappers call the function with fewer arguments
> than it actually has and doesn't expect the parameter save area on the
> caller side because of that while the callee expects it and the callee
> actually stores something in the parameter save area, it corrupts whatever
> is in the caller stack frame at that location.

I agree it's not just tail calls, but currently DECL_HIDDEN_STRING_LENGTH
setting is guarded with flag_tail_call_workaround, which was intended to
be only for tail calls.  So I wonder if we should update this option name,
or introduce another option which is more for C/Fortran interoperability
workaround, set DECL_HIDDEN_STRING_LENGTH with this guard and also enable
this existing flag_tail_call_workaround.

> 
>> + Avoid return stack space for parameters <= 8 excluding hidden string
>> + length argument is passed (partially or fully) on the stack in the
>> + caller and the callee needs to pass any arguments on the stack.  */
>> +  unsigned int num_args = 0;
>> +  unsigned int hidden_length = 0;
>> +
>> +  for (tree arg = DECL_ARGUMENTS (current_function_decl);
>> +   arg; arg = DECL_CHAIN (arg))
>> +{
>> +  num_args++;
>> +  if (DECL_HIDDEN_STRING_LENGTH (arg))
>> +{
>> +  tree parmdef = ssa_default_def (cfun, arg);
>> +  if (parmdef == NULL || has_zero_uses (parmdef))
>> +{
>> +  cum->hidden_string_length = 1;
>> +  hidden_length++;
>> +}
>> +}

As Fortran allows to have some string with unknown length, it's possible to
have some test cases which have mixed used and unused hidden lengths, since
the used ones matter, users may already modify their C code to prepare the
required used hidden length.  But with this change, the modified code could
not work any more.  For example, 7th and 8th are unused but 9th argument is
used, 9th is passed by caller on stack but callee expects it comes from r9
instead (7th arg).  So IMHO we should be more conservative and only make
this workaround by taking care of the continuous unused hidden length at the
end of arg list.  Someone may argue if users already know how to modify their
C code to interoperate with Fortran, we should already modify all their C code
and won't adopt this workaround, but if this restriction still works for all
the motivated test cases, IMHO keeping more conservative is good, as users
could only update some "broken" cases not "all".

BR,
Kewen


>> +   }
>> +
>> +  cum->actual_parm_length = num_args - hidden_length;
>> +
>>/* Check for a longcall attribute.  */
>>if ((!fntype && rs6000_default_long_calls)
>>|| (fntype
>> @@ -1857,7 +1884,16 @@ rs6000_function_arg (cumulative_args_t cum_v, const 
>> function_arg_info &arg)
>>  
>>return rs6000_finish_function_arg (mode, rvec, k);
>>  }
>> -  else if (align_words < GP_ARG_NUM_REG)
>> + /* Workaround buggy C/C++ wrappers around Fortran routines with
>> +character(len=constant) arguments if the hidden string length arguments
>> +are passed on the stack; if the callers forget to pass those arguments,
>> +attempting to tail call in such routines leads to stack corruption.
>> +Avoid return stack space for parameters <= 8 excluding hidden string
>> +length argument is passed (partially or fully) on the stack in the
>> +caller and the callee needs to pass any arguments on the stack.  */
>> +  else if (align_words < GP_ARG_NUM_REG
>> +   || (cum->hidden_string_length
>> +   && cum->actual_parm_length <= GP_ARG_NUM_REG))
>>  {
>>if (TARGET_32BIT && TARGET_POWERPC64)
>>  return rs6000_mixed_function_arg (mode, type, align_words);
>> diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
>> index 68bc45d65ba..a1d3ed00b14 100644
>> --- a/gcc/config/rs6000/rs6000.h
>> +++ b/gcc/config/rs6000/rs6000.h
>> @@ -1490,6 +1490,14 @@ typedef struct r

Re: [PATCH v2] rs6000: Stackoverflow in optimized code on PPC [PR100799]

2024-04-01 Thread Kewen.Lin

Hi!

on 2024/3/24 02:37, Ajit Agarwal wrote:
> 
> 
> On 23/03/24 9:33 pm, Peter Bergner wrote:
>> On 3/23/24 4:33 AM, Ajit Agarwal wrote:
> -  else if (align_words < GP_ARG_NUM_REG)
> +  else if (align_words < GP_ARG_NUM_REG
> +|| (cum->hidden_string_length
> +&& cum->actual_parm_length <= GP_ARG_NUM_REG))
 {
   if (TARGET_32BIT && TARGET_POWERPC64)
 return rs6000_mixed_function_arg (mode, type, align_words);

   return gen_rtx_REG (mode, GP_ARG_MIN_REG + align_words);
 }
   else
 return NULL_RTX;

 The old code for the unused hidden parameter (which was the 9th param) 
 would
 fall thru to the "return NULL_RTX;" which would make the callee assume 
 there
 was a parameter save area allocated.  Now instead, we'll return a reg rtx,
 probably of r11 (r3 thru r10 are our param regs) and I'm guessing we'll now
 see a copy of r11 into a pseudo like we do for the other param regs.
 Is that a problem? Given it's an unused parameter, it'll probably get 
 deleted
 as dead code, but could it cause any issues?  What if we have more than one

I think Peter raised one good point, not sure it would really cause some issues,
but the assigned reg goes beyond GP_ARG_MAX_REG, at least it is confusing to 
people
especially without DCE like at -O0.  Can we aggressively remove these candidates
from DECL_ARGUMENTS chain?  Does it cause any assertion to fail?

BR,
Kewen

 unused hidden parameter and we return r12 and r13 which have specific uses
 in our ABIs (eg, r13 is our TCB pointer), so it may not actually look dead.
 Have you verified what the callee RTL looks like after expand for these
 unused hidden parameters?  Is there a rtx we can return that isn't a 
 NULL_RTX
 which triggers the assumption of a parameter save area, but isn't a reg rtx
 which might lead to some rtl being generated?  Would a (const_int 0) or
 something else work?

>>> For the above use case it will return 
>>>
>>> (reg:DI 5 %r5) and below check entry_parm = 
>>> (reg:DI 5 %r5) and the following check will not return TRUE and hence
>>>parameter save area will not be allocated.
>>
>> Why r5?!?!   The 8th (integer) param would return r10, so I'd assume if
>> the next param was a hidden param, then it'd get the next gpr, so r11.
>> How does it jump back to r5 which may have been used by the 3rd param?
>>
>>
> My mistake its r11 only for hidden param.
>>
>>
>>
>>> It will not generate any rtx in the callee rtl code but it just used to
>>> check whether to allocate parameter save area or not when number of args > 
>>> 8.
>>>
>>> /* If there is no incoming register, we need a stack.  */
>>>   entry_parm = rs6000_function_arg (args_so_far, arg);
>>>   if (entry_parm == NULL)
>>> return true;
>>>
>>>   /* Likewise if we need to pass both in registers and on the stack.  */
>>>   if (GET_CODE (entry_parm) == PARALLEL
>>>   && XEXP (XVECEXP (entry_parm, 0, 0), 0) == NULL_RTX)
>>> return true;
>>
>> Yes, this code in rs6000_parm_needs_stack() uses the rs6000_function_arg()
>> return value as a boolean to tell us whether a parameter save area is 
>> required
>> so what we return is unimportant other than to know it's not NULL_RTX.
>>
>> I'm more concerned about the use of the target hook 
>> targetm.calls.function_arg
>> used in the generic parts of the compiler.  What will that code do 
>> differently
>> now that we return a reg rtx rather than NULL_RTX?  Might that code use
>> the reg rtx to emit something?  I'd feel better if you could verify what
>> happens in that code when we return a reg rtx for that 9th hidden param which
>> isn't really being passed in a register.
>>
> 
> As per my understanding and debugging openBLAS code testcase I see that 
> reg_rtx returned inside the below IF condition is used for check whether 
> paramter save area is needed or not. 
> 
> In the generic code where targetm.calls.function_arg is called 
> in calls.cc returned rtx is used for PARALLEL case so that we can
> check if we need to pass both in registers and stack then they emit
> store with respect to return rtx. If we identify that we need only
> registers for argument then it emits nothing.
> 
> Thanks & Regards
> Ajit
>>
>> Peter
>>
>>

Re: [PATCH v2] rs6000: Stackoverflow in optimized code on PPC [PR100799]

2024-04-02 Thread Kewen.Lin

Hi Jakub,

on 2024/4/2 16:03, Jakub Jelinek wrote:
> On Tue, Apr 02, 2024 at 02:12:04PM +0800, Kewen.Lin wrote:
>>>>>> The old code for the unused hidden parameter (which was the 9th param) 
>>>>>> would
>>>>>> fall thru to the "return NULL_RTX;" which would make the callee assume 
>>>>>> there
>>>>>> was a parameter save area allocated.  Now instead, we'll return a reg 
>>>>>> rtx,
>>>>>> probably of r11 (r3 thru r10 are our param regs) and I'm guessing we'll 
>>>>>> now
>>>>>> see a copy of r11 into a pseudo like we do for the other param regs.
>>>>>> Is that a problem? Given it's an unused parameter, it'll probably get 
>>>>>> deleted
>>>>>> as dead code, but could it cause any issues?  What if we have more than 
>>>>>> one
>>
>> I think Peter raised one good point, not sure it would really cause some 
>> issues,
>> but the assigned reg goes beyond GP_ARG_MAX_REG, at least it is confusing to 
>> people
>> especially without DCE like at -O0.  Can we aggressively remove these 
>> candidates
>> from DECL_ARGUMENTS chain?  Does it cause any assertion to fail?
> 
> I'd prefer not to remove DECL_ARGUMENTS chains, they are valid arguments that 
> just some
> invalid code doesn't pass.  By removing them you basically always create an
> invalid case, this time in the other direction, valid caller passes more
> arguments than the callee (invalidly) expects.

Thanks for the comments, do you mean it can affect the arguments validation 
when there
is explicit function declaration with interface?  Then can we strip them when 
we are
going to expand them (like checking currently_expanding_function_start)?  since 
from the
perspective of resulted assembly, with this workaround, the callee can:
  1) pass the hidden args in unexpected GPR like r11, ... at -O0;
  2) get rid of such hidden args as they are unused at -O2;
This proposal aims to make the assembly at -O0 not to pass with r11... (same as 
-O2),
comparing to the assembly at O2, the mismatch isn't actually changed.

BR,
Kewen

Re: [PATCH v2] rs6000: Stackoverflow in optimized code on PPC [PR100799]

2024-04-03 Thread Kewen.Lin

Hi Jakub,

on 2024/4/3 16:35, Jakub Jelinek wrote:
> On Wed, Apr 03, 2024 at 01:18:54PM +0800, Kewen.Lin wrote:
>>> I'd prefer not to remove DECL_ARGUMENTS chains, they are valid arguments 
>>> that just some
>>> invalid code doesn't pass.  By removing them you basically always create an
>>> invalid case, this time in the other direction, valid caller passes more
>>> arguments than the callee (invalidly) expects.
>>
>> Thanks for the comments, do you mean it can affect the arguments validation 
>> when there
>> is explicit function declaration with interface?  Then can we strip them 
>> when we are
>> going to expand them (like checking currently_expanding_function_start)?
> 
> I'd prefer not stripping them at all; they are clearly marked as perhaps not
> passed in buggy programs (the DECL_HIDDEN_STRING_LENGTH argument) and
> removing them implies the decl is a throw away, that after expansion

Yes, IMHO it's safe as they are unused.

> nothing will actually look at it anymore.  I believe that is the case of
> function bodies, we expand them into RTL and throw away the GIMPLE, and
> after final clear the bodies, but it is not the case of the FUNCTION_DECL
> or its DECL_ARGUMENTs etc.  E.g. GIMPLE optimizations or expansion of
> callers could be looking at those as well.

At expand time GIMPLE optimizations should already finish, so it should be
safe to strip them at that time?  It would surprise me if expansions of
callers will look at callee's information, it's more like what should be
done in IPA analysis instead?

> 
>> since from the
>> perspective of resulted assembly, with this workaround, the callee can:
>>   1) pass the hidden args in unexpected GPR like r11, ... at -O0;
>>   2) get rid of such hidden args as they are unused at -O2;
>> This proposal aims to make the assembly at -O0 not to pass with r11... (same 
>> as -O2),
>> comparing to the assembly at O2, the mismatch isn't actually changed.
> 
> The aim for the workaround was just avoid assuming there is a argument save
> area in the caller stack when it is sometimes missing.

Yeah, understood.

> If you are looking for optimizations where nothing actually passes the
> unneeded arguments and nothing expects them to be passed, then it is a task
> for IPA optimizations and should be done solely if IPA determines that all
> callers can be adjusted together with the callee; I think IPA already does
> that in that case for years, regardless if it is DECL_HIDDEN_STRING_LENGTH
> PARM_DECL or not.

No, it's not what I was looking for.  Peter's comments made me feel it's not
good to have assembly at O0 like:

std %r3,112(%r31)
std %r4,120(%r31)
std %r5,128(%r31)
std %r6,136(%r31)
std %r7,144(%r31)
std %r8,152(%r31)
std %r9,160(%r31)
std %r10,168(%r31)
std %r11,176(%r31) // this mislead people that we pass 9th arg via r11,
   // it would be nice not to have it.

so I was thinking if there is some way to get rid of it.

BR,
Kewen

Re: [PATCH v2] rs6000: Stackoverflow in optimized code on PPC [PR100799]

2024-04-03 Thread Kewen.Lin

Hi!

on 2024/4/3 17:23, Jakub Jelinek wrote:
> On Wed, Apr 03, 2024 at 05:02:40PM +0800, Kewen.Lin wrote:
>> on 2024/4/3 16:35, Jakub Jelinek wrote:
>>> On Wed, Apr 03, 2024 at 01:18:54PM +0800, Kewen.Lin wrote:
>>>>> I'd prefer not to remove DECL_ARGUMENTS chains, they are valid arguments 
>>>>> that just some
>>>>> invalid code doesn't pass.  By removing them you basically always create 
>>>>> an
>>>>> invalid case, this time in the other direction, valid caller passes more
>>>>> arguments than the callee (invalidly) expects.
>>>>
>>>> Thanks for the comments, do you mean it can affect the arguments 
>>>> validation when there
>>>> is explicit function declaration with interface?  Then can we strip them 
>>>> when we are
>>>> going to expand them (like checking currently_expanding_function_start)?
>>>
>>> I'd prefer not stripping them at all; they are clearly marked as perhaps not
>>> passed in buggy programs (the DECL_HIDDEN_STRING_LENGTH argument) and
>>> removing them implies the decl is a throw away, that after expansion
>>
>> Yes, IMHO it's safe as they are unused.
> 
> But they are still passed in the usual case.
> 
>>> nothing will actually look at it anymore.  I believe that is the case of
>>> function bodies, we expand them into RTL and throw away the GIMPLE, and
>>> after final clear the bodies, but it is not the case of the FUNCTION_DECL
>>> or its DECL_ARGUMENTs etc.  E.g. GIMPLE optimizations or expansion of
>>> callers could be looking at those as well.
>>
>> At expand time GIMPLE optimizations should already finish, so it should be
>> safe to strip them at that time?
> 
> No.
> The IPA/post IPA behavior is that IPA optimizations are performed and then
> cgraph finalizes one function at a time, going there from modifications
> needed from IPA passes, post IPA GIMPLE optimizations, expansion to RTL,
> RTL optimizations, emitting assembly, throwing away the body, then picking
> another function and repeating that etc.
> So, when one function makes it to expansion, if you modify its
> DECL_ARGUMENTS etc., all the post IPA GIMPLE optimization passes of other
> functions might still see such changes.

Thanks for explaining, I agree it's risky from this perspective.

> 
>>  It would surprise me if expansions of
>> callers will look at callee's information, it's more like what should be
>> done in IPA analysis instead?
> 
> Depends on what exactly it is.  E.g. C non-prototyped functions have
> just DECL_ARGUMENTS to check how many arguments the call should have vs.
> what is actually passed.

OK.

> 
>> No, it's not what I was looking for.  Peter's comments made me feel it's not
>> good to have assembly at O0 like:
>>
>> std %r3,112(%r31)
>> std %r4,120(%r31)
>> std %r5,128(%r31)
>> std %r6,136(%r31)
>> std %r7,144(%r31)
>> std %r8,152(%r31)
>> std %r9,160(%r31)
>> std %r10,168(%r31)
>> std %r11,176(%r31) // this mislead people that we pass 9th arg via 
>> r11,
>>// it would be nice not to have it.
>>
>> so I was thinking if there is some way to get rid of it.
> 
> You want to optimize at -O0?  Don't.

I don't really want optimization but try to get rid of the unreasonable
assembly code.  :)

> That will screw up debugging.  The function does have that argument, it
> should show up in debug info; it should show up also at -O2 in debug info
> etc.  If you remove chains from DECL_ARGUMENTS, because we have early dwarf
> these days, DW_TAG_formal_parameter nodes should have been already created,
> but it would mean that DW_AT_location for those arguments likely isn't
> filled.  Now, for -O2 it might be the case that the argument has useful
> location only at the start of the function, could have
> DW_OP_entry_value(%r11) afterwards, but at -O0 it better should have some
> stack slot into which the argument is saved and DW_AT_location should be
> that stack slot.  All that should change with the workaround is that if the
> stack slot would be normally in the argument save area in the caller's
> frame, if such argument save area can't be counted on, then it needs to be
> saved in some other stack slot, like arguments are saved to when there are
> only <= 8 arguments.

Thanks for the details on debugging support, but IIUC with this workaround
being adopted, the debuggability on hidden args are already broken, aren't?
Si

Re: [PATCH v2] rs6000: Stackoverflow in optimized code on PPC [PR100799]

2024-04-03 Thread Kewen.Lin

on 2024/4/3 19:18, Jakub Jelinek wrote:
> On Wed, Apr 03, 2024 at 07:01:50PM +0800, Kewen.Lin wrote:
>> Thanks for the details on debugging support, but IIUC with this workaround
>> being adopted, the debuggability on hidden args are already broken, aren't?
> 
> No.
> In the correct program case, which should be the usual case, the caller will
> pass the arguments and one should be able to see the values in the debugger
> even if the function doesn't actually use those arguments.
> If the caller is buggy and doesn't pass those arguments, one should be able
> to see garbage values for those arguments and perhaps that way figure out
> that the program is buggy and fix it.

But it's not true with Ajit's current implementation that is lying args are
passed in r11..., so whatever the caller is usual (in argument save area) or
not (not such value), the values are all broken.

> 
>> Since with a usual caller, the actual argument is passed in argument save
>> area, but the callee debug information says the location is %r11 or some
>> other stack slot.
>>
>> I think the difficulty is that: with this workaround, for some arguments we
>> are lying they are not passed in argument save area, then we have to pretend
>> they are passed in r11,r12..., but in fact these registers are not valid to
>> pass arguments, so it's unreasonable and confusing.  With your explanation,
>> I agree that stripping DECL_ARGUMENTS chains isn't a good way to eliminate
>> this confusion, maybe always using GP_ARG_MIN_REG/GP_ARG_MAX_REG for things
>> exceeding GP_ARG_MAX_REG can reduce the unreasonableness (but still confusing
>> IMHO).
> 
> If those arguments aren't passed in r11/r12, but in memory, the workaround
> shouldn't pretend they are passed somewhere where they aren't actually
> passed.

Unfortunately the current implementation doesn't conform this, I misunderstood
you knew that.

> Instead, it should load them from the memory where they are actually
> normally passed.
> What needs to be ensured though is that those arguments are for -O0 loaded
> from those stack slots and saved to different stack slots (inside of the
> callee frame, rather than in caller's frame), for -O1+ just not loaded at
> all and pretended to just live in the caller's frame, and most importantly
> ensure that the callee doesn't try to think there is a parameter save area
> in the caller's frame which it can use for random saving related or
> unrelated data.  So, e.g. REG_EQUAL/REG_EQUIV shouldn't be used, nor tell
> that the 1st-8th arguments could be saved to the parameter save area.
> So, for the 1st-8th arguments it really should act as if there is no
> parameter save area and for the DECL_HIDDEN_STRING_LENGTH ones after it
> as it those are passed in memory, but as if that memory is owned by the
> caller, not callee, so it is not correct to modify that memory.

Now I got your points.  I like this proposal and also believe it makes more
sense on both the resulted assembly and the debuggability support, though
it sounds the implementation has to be more complicated than what's done.

Thanks for all the inputs!!

BR,
Kewen

Re: [PATCH] rs6000: Replace OPTION_MASK_DIRECT_MOVE with OPTION_MASK_P8_VECTOR [PR101865]

2024-04-08 Thread Kewen.Lin

Hi Peter,

on 2024/4/6 06:28, Peter Bergner wrote:
> This is a cleanup patch in preparation to fixing the real bug in PR101865.
> TARGET_DIRECT_MOVE is redundant with TARGET_P8_VECTOR, so alias it to that.
> Also replace all usages of OPTION_MASK_DIRECT_MOVE with OPTION_MASK_P8_VECTOR
> and delete the now dead mask.
> 
> This passed bootstrap and retesting on powerpc64le-linux with no regressions.
> Ok for trunk?
> 
> Eventually we'll want to backport this along with the follow-on patch that
> actually fixes PR101865.
> 
> Peter
> 
> 
> gcc/
>   PR target/101865
>   * config/rs6000/rs6000.h (TARGET_DIRECT_MOVE): Define.
>   * config/rs6000/rs6000.cc (rs6000_option_override_internal): Replace
>   OPTION_MASK_DIRECT_MOVE with OPTION_MASK_P8_VECTOR.  Delete redundant
>   OPTION_MASK_DIRECT_MOVE usage.  Delete TARGET_DIRECT_MOVE dead code.
>   (rs6000_opt_masks): Neuter the "direct-move" option.
>   * config/rs6000/rs6000-c.cc (rs6000_target_modify_macros): Replace
>   OPTION_MASK_DIRECT_MOVE with OPTION_MASK_P8_VECTOR.  Delete useless
>   comment.
>   * config/rs6000/rs6000-cpus.def (ISA_2_7_MASKS_SERVER): Delete
>   OPTION_MASK_DIRECT_MOVE.
>   (OTHER_VSX_VECTOR_MASKS): Likewise.
>   (POWERPC_MASKS): Likewise.
>   * config/rs6000/rs6000.opt (mno-direct-move): New.
>   (mdirect-move): Remove Mask and Var.
> 
> 
> diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
> index 68bc45d65ba..77d045c9f6e 100644
> --- a/gcc/config/rs6000/rs6000.h
> +++ b/gcc/config/rs6000/rs6000.h
> @@ -471,6 +471,8 @@ extern int rs6000_vector_align[];
>  #define TARGET_EXTSWSLI  (TARGET_MODULO && TARGET_POWERPC64)
>  #define TARGET_MADDLDTARGET_MODULO
>  
> +/* TARGET_DIRECT_MOVE is redundant to TARGET_P8_VECTOR, so alias it to that. 
>  */
> +#define TARGET_DIRECT_MOVE   TARGET_P8_VECTOR
>  #define TARGET_XSCVDPSPN (TARGET_DIRECT_MOVE || TARGET_P8_VECTOR)
>  #define TARGET_XSCVSPDPN (TARGET_DIRECT_MOVE || TARGET_P8_VECTOR)
>  #define TARGET_VADDUQM   (TARGET_P8_VECTOR && TARGET_POWERPC64)
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index 6ba9df4f02e..c241371147c 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -3811,7 +3811,7 @@ rs6000_option_override_internal (bool global_init_p)
>   Testing for direct_move matches power8 and later.  */
>if (!BYTES_BIG_ENDIAN
>&& !(processor_target_table[tune_index].target_enable
> -& OPTION_MASK_DIRECT_MOVE))
> +& OPTION_MASK_P8_VECTOR))
>  rs6000_isa_flags |= ~rs6000_isa_flags_explicit & 
> OPTION_MASK_STRICT_ALIGN;
>  
>/* Add some warnings for VSX.  */
> @@ -3853,8 +3853,7 @@ rs6000_option_override_internal (bool global_init_p)
>&& (rs6000_isa_flags_explicit & (OPTION_MASK_SOFT_FLOAT
>  | OPTION_MASK_ALTIVEC
>  | OPTION_MASK_VSX)) != 0)
> -rs6000_isa_flags &= ~((OPTION_MASK_P8_VECTOR | OPTION_MASK_CRYPTO
> -| OPTION_MASK_DIRECT_MOVE)
> +rs6000_isa_flags &= ~((OPTION_MASK_P8_VECTOR | OPTION_MASK_CRYPTO)
>& ~rs6000_isa_flags_explicit);
>  
>if (TARGET_DEBUG_REG || TARGET_DEBUG_TARGET)
> @@ -3939,13 +3938,6 @@ rs6000_option_override_internal (bool global_init_p)
>rs6000_isa_flags &= ~OPTION_MASK_FPRND;
>  }
>  
> -  if (TARGET_DIRECT_MOVE && !TARGET_VSX)
> -{
> -  if (rs6000_isa_flags_explicit & OPTION_MASK_DIRECT_MOVE)
> - error ("%qs requires %qs", "-mdirect-move", "-mvsx");
> -  rs6000_isa_flags &= ~OPTION_MASK_DIRECT_MOVE;
> -}
> -
>if (TARGET_P8_VECTOR && !TARGET_ALTIVEC)
>  rs6000_isa_flags &= ~OPTION_MASK_P8_VECTOR;
>  
> @@ -24429,7 +24421,7 @@ static struct rs6000_opt_mask const 
> rs6000_opt_masks[] =
>   false, true  },
>{ "cmpb",  OPTION_MASK_CMPB,   false, true  },
>{ "crypto",OPTION_MASK_CRYPTO, false, 
> true  },
> -  { "direct-move",   OPTION_MASK_DIRECT_MOVE,false, true  },
> +  { "direct-move",   0,  false, true  },
>{ "dlmzb", OPTION_MASK_DLMZB,  false, true  },
>{ "efficient-unaligned-vsx",   OPTION_MASK_EFFICIENT_UNALIGNED_VSX,
>   false, true  },
> diff --git a/gcc/config/rs6000/rs6000-c.cc b/gcc/config/rs6000/rs6000-c.cc
> index ce0b14a8d37..647f20de7f2 100644
> --- a/gcc/config/rs6000/rs6000-c.cc
> +++ b/gcc/config/rs6000/rs6000-c.cc
> @@ -429,19 +429,7 @@ rs6000_target_modify_macros (bool define_p, 
> HOST_WIDE_INT flags)
>  rs6000_define_or_undefine_macro (define_p, "_ARCH_PWR6");
>if ((flags & OPTION_MASK_POPCNTD) != 0)
>  rs6000_define_or_undefine_macro (define_p, "_ARCH_PWR7");
> -  /* Note

Re: [PATCH 3/3] Add -mcpu=power11 tests

2024-04-08 Thread Kewen.Lin

Hi Mike,

on 2024/3/20 12:16, Michael Meissner wrote:
> This patch adds some simple tests for -mcpu=power11 support.  In order to run
> these tests, you need an assembler that supports the appropriate option for
> supporting the Power11 processor (-mpower11 under Linux or -mpwr11 under AIX).
> 
> I have tested this patch on a little endian power10 system and a big endian
> power9 system using the latest binutils which includes support for power11.
> There were no regressions, and the 3 power11 tests added ran on both systems.
> Can I check this patch into GCC 15 when it opens up for general patches?
> 
> 2024-03-18  Michael Meissner  
> 
> gcc/testsuite/
> 
>   * gcc.target/powerpc/power11-1.c: New test.
>   * gcc.target/powerpc/power11-2.c: Likewise.
>   * gcc.target/powerpc/power11-3.c: Likewise.
>   * lib/target-supports.exp (check_effective_target_power11_ok): Add new
>   effective target.
> ---
>  gcc/testsuite/gcc.target/powerpc/power11-1.c | 13 +
>  gcc/testsuite/gcc.target/powerpc/power11-2.c | 20 
>  gcc/testsuite/gcc.target/powerpc/power11-3.c | 10 ++
>  gcc/testsuite/lib/target-supports.exp| 17 +
>  4 files changed, 60 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/power11-1.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/power11-2.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/power11-3.c
> 
> diff --git a/gcc/testsuite/gcc.target/powerpc/power11-1.c 
> b/gcc/testsuite/gcc.target/powerpc/power11-1.c
> new file mode 100644
> index 000..6a2e802eedf
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/power11-1.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile { target powerpc*-*-* } } */
> +/* { dg-require-effective-target power11_ok } */
> +/* { dg-options "-mdejagnu-cpu=power11 -O2" } */
> +
> +/* Basic check to see if the compiler supports -mcpu=power11.  */
> +
> +#ifndef _ARCH_PWR11
> +#error "-mcpu=power11 is not supported"
> +#endif
> +
> +void foo (void)
> +{
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/power11-2.c 
> b/gcc/testsuite/gcc.target/powerpc/power11-2.c
> new file mode 100644
> index 000..7b9904c1d29
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/power11-2.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile { target powerpc*-*-* } } */
> +/* { dg-require-effective-target power11_ok } */
> +/* { dg-options "-O2" } */
> +
> +/* Check if we can set the power11 target via a target attribute.  */
> +
> +__attribute__((__target__("cpu=power9")))
> +void foo_p9 (void)
> +{
> +}
> +
> +__attribute__((__target__("cpu=power10")))
> +void foo_p10 (void)
> +{
> +}
> +
> +__attribute__((__target__("cpu=power11")))
> +void foo_p11 (void)
> +{
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/power11-3.c 
> b/gcc/testsuite/gcc.target/powerpc/power11-3.c
> new file mode 100644
> index 000..9b2d643cc0f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/power11-3.c
> @@ -0,0 +1,10 @@
> +/* { dg-do compile { target powerpc*-*-* } }  */
> +/* { dg-require-effective-target power11_ok } */
> +/* { dg-options "-mdejagnu-cpu=power8 -O2" }  */
> +
> +/* Check if we can set the power11 target via a target_clones attribute.  */
> +
> +__attribute__((__target_clones__("cpu=power11,cpu=power9,default")))
> +void foo (void)
> +{
> +}
> diff --git a/gcc/testsuite/lib/target-supports.exp 
> b/gcc/testsuite/lib/target-supports.exp
> index 467b539b20d..be80494be80 100644
> --- a/gcc/testsuite/lib/target-supports.exp
> +++ b/gcc/testsuite/lib/target-supports.exp
> @@ -7104,6 +7104,23 @@ proc check_effective_target_power10_ok { } {
>  }
>  }
>  
> +# Return 1 if this is a PowerPC target supporting -mcpu=power11.
> +
> +proc check_effective_target_power11_ok { } {
> +if { ([istarget powerpc*-*-*]) } {
> + return [check_no_compiler_messages power11_ok object {
> + int main (void) {
> + #ifndef _ARCH_PWR11
> + #error "-mcpu=power11 is not supported"
> + #endif
> + return 0;
> + }
> + } "-mcpu=power11"]
> +} else {
> + return 0
> +}
> +}

Sorry that I didn't catch this before, this effective target looks useless
since its users power11-[123].c are all for compiling and the compilation
doesn't rely on assembler behavior.  power11-1.c has checked for _ARCH_PWR11,
maybe we want some cases with "dg-do assemble" to adopt this?

btw, the other two sub-patches in this series look good to me, as I know this
series has been on Segher's TODO list, I'll leave the approvals to him.

BR,
Kewen

[PATCH] testsuite: Add profile_update_atomic check to gcov-20.c [PR114614]

2024-04-08 Thread Kewen.Lin

Hi,

As PR114614 shows, the newly added test case gcov-20.c by
commit r14-9789-g08a52331803f66 failed on targets which do
not support atomic profile update, there would be a message
like:

  warning: target does not support atomic profile update,
   single mode is selected

Since the test case adopts -fprofile-update=atomic, it
requires effective target check profile_update_atomic, this
patch is to add the check accordingly.

Tested well on x86_64-redhat-linux, powerpc64-linux-gnu P8/P9
and powerpc64le-linux-gnu P9/P10.

Is it ok for trunk?

BR,
Kewen
-
PR testsuite/114614

gcc/testsuite/ChangeLog:

* gcc.misc-tests/gcov-20.c: Add effective target check
profile_update_atomic.
---
 gcc/testsuite/gcc.misc-tests/gcov-20.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gcc/testsuite/gcc.misc-tests/gcov-20.c 
b/gcc/testsuite/gcc.misc-tests/gcov-20.c
index 215faffc980..ca8c12aad2b 100644
--- a/gcc/testsuite/gcc.misc-tests/gcov-20.c
+++ b/gcc/testsuite/gcc.misc-tests/gcov-20.c
@@ -1,5 +1,6 @@
 /* { dg-options "-fcondition-coverage -ftest-coverage -fprofile-update=atomic" 
} */
 /* { dg-do run { target native } } */
+/* { dg-require-effective-target profile_update_atomic } */

 /* Some side effect to stop branches from being pruned */
 int x = 0;
--
2.43.0

[PATCH] rs6000: Fix wrong align passed to build_aligned_type [PR88309]

2024-04-08 Thread Kewen.Lin

Hi,

As the comments in PR88309 show, there are two oversights
in rs6000_gimple_fold_builtin that pass align in bytes to
build_aligned_type but which actually requires align in
bits, it causes unexpected ICE or hanging in function
is_miss_rate_acceptable due to zero align_unit value.

This patch is to fix them by converting bytes to bits, add
an assertion on positive align_unit value and notes function
build_aligned_type requires align measured in bits in its
function comment.

Bootstrapped and regtested on x86_64-redhat-linux, 
powerpc64-linux-gnu P8/P9 and powerpc64le-linux-gnu P9 and P10.

Is it (the generic part code change) ok for trunk?

BR,
Kewen
-
PR target/88309

Co-authored-by: Andrew Pinski 

gcc/ChangeLog:

* config/rs6000/rs6000-builtin.cc (rs6000_gimple_fold_builtin): Fix
wrong align passed to function build_aligned_type.
* tree-ssa-loop-prefetch.cc (is_miss_rate_acceptable): Add an
assertion to ensure align_unit should be positive.
* tree.cc (build_qualified_type): Update function comments.

gcc/testsuite/ChangeLog:

* gcc.target/powerpc/pr88309.c: New test.
---
 gcc/config/rs6000/rs6000-builtin.cc|  4 ++--
 gcc/testsuite/gcc.target/powerpc/pr88309.c | 27 ++
 gcc/tree-ssa-loop-prefetch.cc  |  2 ++
 gcc/tree.cc|  3 ++-
 4 files changed, 33 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr88309.c

diff --git a/gcc/config/rs6000/rs6000-builtin.cc 
b/gcc/config/rs6000/rs6000-builtin.cc
index 6698274031b..e7d6204074c 100644
--- a/gcc/config/rs6000/rs6000-builtin.cc
+++ b/gcc/config/rs6000/rs6000-builtin.cc
@@ -1900,7 +1900,7 @@ rs6000_gimple_fold_builtin (gimple_stmt_iterator *gsi)
tree lhs_type = TREE_TYPE (lhs);
/* In GIMPLE the type of the MEM_REF specifies the alignment.  The
  required alignment (power) is 4 bytes regardless of data type.  */
-   tree align_ltype = build_aligned_type (lhs_type, 4);
+   tree align_ltype = build_aligned_type (lhs_type, 32);
/* POINTER_PLUS_EXPR wants the offset to be of type 'sizetype'.  Create
   the tree using the value from arg0.  The resulting type will match
   the type of arg1.  */
@@ -1944,7 +1944,7 @@ rs6000_gimple_fold_builtin (gimple_stmt_iterator *gsi)
tree arg2_type = ptr_type_node;
/* In GIMPLE the type of the MEM_REF specifies the alignment.  The
   required alignment (power) is 4 bytes regardless of data type.  */
-   tree align_stype = build_aligned_type (arg0_type, 4);
+   tree align_stype = build_aligned_type (arg0_type, 32);
/* POINTER_PLUS_EXPR wants the offset to be of type 'sizetype'.  Create
   the tree using the value from arg1.  */
gimple_seq stmts = NULL;
diff --git a/gcc/testsuite/gcc.target/powerpc/pr88309.c 
b/gcc/testsuite/gcc.target/powerpc/pr88309.c
new file mode 100644
index 000..c0078cf2b8c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr88309.c
@@ -0,0 +1,27 @@
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-mvsx -O2 -fprefetch-loop-arrays" } */
+
+/* Verify there is no ICE or hanging.  */
+
+#include 
+
+void b(float *c, vector float a, vector float, vector float)
+{
+  vector float d;
+  vector char ahbc;
+  vec_xst(vec_perm(a, d, ahbc), 0, c);
+}
+
+vector float e(vector unsigned);
+
+void f() {
+  float *dst;
+  int g = 0;
+  for (;; g += 16) {
+vector unsigned m, i;
+vector unsigned n, j;
+vector unsigned k, l;
+b(dst + g * 3, e(m), e(n), e(k));
+b(dst + (g + 4) * 3, e(i), e(j), e(l));
+  }
+}
diff --git a/gcc/tree-ssa-loop-prefetch.cc b/gcc/tree-ssa-loop-prefetch.cc
index bbd98e03254..70073cc4fe4 100644
--- a/gcc/tree-ssa-loop-prefetch.cc
+++ b/gcc/tree-ssa-loop-prefetch.cc
@@ -739,6 +739,8 @@ is_miss_rate_acceptable (unsigned HOST_WIDE_INT 
cache_line_size,
   if (delta >= (HOST_WIDE_INT) cache_line_size)
 return false;

+  gcc_assert (align_unit > 0);
+
   miss_positions = 0;
   total_positions = (cache_line_size / align_unit) * distinct_iters;
   max_allowed_miss_positions = (ACCEPTABLE_MISS_RATE * total_positions) / 1000;
diff --git a/gcc/tree.cc b/gcc/tree.cc
index f801712c9dd..6f8400e6640 100644
--- a/gcc/tree.cc
+++ b/gcc/tree.cc
@@ -5689,7 +5689,8 @@ build_qualified_type (tree type, int type_quals 
MEM_STAT_DECL)
   return t;
 }

-/* Create a variant of type T with alignment ALIGN.  */
+/* Create a variant of type T with alignment ALIGN which
+   is measured in bits.  */

 tree
 build_aligned_type (tree type, unsigned int align)
--
2.43.0

Re: [PATCH] rs6000: Fix wrong align passed to build_aligned_type [PR88309]

2024-04-08 Thread Kewen.Lin

on 2024/4/8 18:47, Richard Biener wrote:
> On Mon, Apr 8, 2024 at 11:22 AM Kewen.Lin  wrote:
>>
>> Hi,
>>
>> As the comments in PR88309 show, there are two oversights
>> in rs6000_gimple_fold_builtin that pass align in bytes to
>> build_aligned_type but which actually requires align in
>> bits, it causes unexpected ICE or hanging in function
>> is_miss_rate_acceptable due to zero align_unit value.
>>
>> This patch is to fix them by converting bytes to bits, add
>> an assertion on positive align_unit value and notes function
>> build_aligned_type requires align measured in bits in its
>> function comment.
>>
>> Bootstrapped and regtested on x86_64-redhat-linux,
>> powerpc64-linux-gnu P8/P9 and powerpc64le-linux-gnu P9 and P10.
>>
>> Is it (the generic part code change) ok for trunk?
> 
> OK

Thanks, pushed as r14-9850, is it also ok to backport after burn-in time?

BR,
Kewen

> 
>> BR,
>> Kewen
>> -
>> PR target/88309
>>
>> Co-authored-by: Andrew Pinski 
>>
>> gcc/ChangeLog:
>>
>> * config/rs6000/rs6000-builtin.cc (rs6000_gimple_fold_builtin): Fix
>> wrong align passed to function build_aligned_type.
>> * tree-ssa-loop-prefetch.cc (is_miss_rate_acceptable): Add an
>> assertion to ensure align_unit should be positive.
>> * tree.cc (build_qualified_type): Update function comments.
>>
>> gcc/testsuite/ChangeLog:
>>
>> * gcc.target/powerpc/pr88309.c: New test.
>> ---
>>  gcc/config/rs6000/rs6000-builtin.cc|  4 ++--
>>  gcc/testsuite/gcc.target/powerpc/pr88309.c | 27 ++
>>  gcc/tree-ssa-loop-prefetch.cc  |  2 ++
>>  gcc/tree.cc|  3 ++-
>>  4 files changed, 33 insertions(+), 3 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr88309.c
>>
>> diff --git a/gcc/config/rs6000/rs6000-builtin.cc 
>> b/gcc/config/rs6000/rs6000-builtin.cc
>> index 6698274031b..e7d6204074c 100644
>> --- a/gcc/config/rs6000/rs6000-builtin.cc
>> +++ b/gcc/config/rs6000/rs6000-builtin.cc
>> @@ -1900,7 +1900,7 @@ rs6000_gimple_fold_builtin (gimple_stmt_iterator *gsi)
>> tree lhs_type = TREE_TYPE (lhs);
>> /* In GIMPLE the type of the MEM_REF specifies the alignment.  The
>>   required alignment (power) is 4 bytes regardless of data type.  */
>> -   tree align_ltype = build_aligned_type (lhs_type, 4);
>> +   tree align_ltype = build_aligned_type (lhs_type, 32);
>> /* POINTER_PLUS_EXPR wants the offset to be of type 'sizetype'.  
>> Create
>>the tree using the value from arg0.  The resulting type will match
>>the type of arg1.  */
>> @@ -1944,7 +1944,7 @@ rs6000_gimple_fold_builtin (gimple_stmt_iterator *gsi)
>> tree arg2_type = ptr_type_node;
>> /* In GIMPLE the type of the MEM_REF specifies the alignment.  The
>>required alignment (power) is 4 bytes regardless of data type.  */
>> -   tree align_stype = build_aligned_type (arg0_type, 4);
>> +   tree align_stype = build_aligned_type (arg0_type, 32);
>> /* POINTER_PLUS_EXPR wants the offset to be of type 'sizetype'.  
>> Create
>>the tree using the value from arg1.  */
>> gimple_seq stmts = NULL;
>> diff --git a/gcc/testsuite/gcc.target/powerpc/pr88309.c 
>> b/gcc/testsuite/gcc.target/powerpc/pr88309.c
>> new file mode 100644
>> index 000..c0078cf2b8c
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/powerpc/pr88309.c
>> @@ -0,0 +1,27 @@
>> +/* { dg-require-effective-target powerpc_vsx_ok } */
>> +/* { dg-options "-mvsx -O2 -fprefetch-loop-arrays" } */
>> +
>> +/* Verify there is no ICE or hanging.  */
>> +
>> +#include 
>> +
>> +void b(float *c, vector float a, vector float, vector float)
>> +{
>> +  vector float d;
>> +  vector char ahbc;
>> +  vec_xst(vec_perm(a, d, ahbc), 0, c);
>> +}
>> +
>> +vector float e(vector unsigned);
>> +
>> +void f() {
>> +  float *dst;
>> +  int g = 0;
>> +  for (;; g += 16) {
>> +vector unsigned m, i;
>> +vector unsigned n, j;
>> +vector unsigned k, l;
>> +b(dst + g * 3, e(m), e(n), e(k));
>> +b(dst + (g + 4) * 3, e(i), e(j), e(l));
>> +  }
>> +}
>> diff --git a/gcc/tree-ssa-loop-prefetch.cc b/gcc/tree-ssa-loop-prefetch.cc
>> index bbd98e03254..70073cc4fe4 100644
>> --- a/gcc/tree-ssa-loop-

Re: [PATCH] testsuite: Add profile_update_atomic check to gcov-20.c [PR114614]

2024-04-08 Thread Kewen.Lin

on 2024/4/8 18:47, Richard Biener wrote:
> On Mon, Apr 8, 2024 at 11:23 AM Kewen.Lin  wrote:
>>
>> Hi,
>>
>> As PR114614 shows, the newly added test case gcov-20.c by
>> commit r14-9789-g08a52331803f66 failed on targets which do
>> not support atomic profile update, there would be a message
>> like:
>>
>>   warning: target does not support atomic profile update,
>>single mode is selected
>>
>> Since the test case adopts -fprofile-update=atomic, it
>> requires effective target check profile_update_atomic, this
>> patch is to add the check accordingly.
>>
>> Tested well on x86_64-redhat-linux, powerpc64-linux-gnu P8/P9
>> and powerpc64le-linux-gnu P9/P10.
>>
>> Is it ok for trunk?
> 
> OK

Thanks, pushed as r14-9851.

BR,
Kewen

> 
>> BR,
>> Kewen
>> -
>> PR testsuite/114614
>>
>> gcc/testsuite/ChangeLog:
>>
>> * gcc.misc-tests/gcov-20.c: Add effective target check
>> profile_update_atomic.
>> ---
>>  gcc/testsuite/gcc.misc-tests/gcov-20.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/gcc/testsuite/gcc.misc-tests/gcov-20.c 
>> b/gcc/testsuite/gcc.misc-tests/gcov-20.c
>> index 215faffc980..ca8c12aad2b 100644
>> --- a/gcc/testsuite/gcc.misc-tests/gcov-20.c
>> +++ b/gcc/testsuite/gcc.misc-tests/gcov-20.c
>> @@ -1,5 +1,6 @@
>>  /* { dg-options "-fcondition-coverage -ftest-coverage 
>> -fprofile-update=atomic" } */
>>  /* { dg-do run { target native } } */
>> +/* { dg-require-effective-target profile_update_atomic } */
>>
>>  /* Some side effect to stop branches from being pruned */
>>  int x = 0;
>> --
>> 2.43.0

Re: [PATCH] rs6000: Replace OPTION_MASK_DIRECT_MOVE with OPTION_MASK_P8_VECTOR [PR101865]

2024-04-08 Thread Kewen.Lin

Hi Peter,

on 2024/4/8 21:21, Peter Bergner wrote:
> On 4/8/24 3:55 AM, Kewen.Lin wrote:
>> on 2024/4/6 06:28, Peter Bergner wrote:
>>> +mno-direct-move
>>> +Target Undocumented WarnRemoved
>>> +
>>>  mdirect-move
>>> -Target Undocumented Mask(DIRECT_MOVE) Var(rs6000_isa_flags) WarnRemoved
>>> +Target Undocumented WarnRemoved
>>
>> When reviewing my previous patch to "neuter option -mpower{8,9}-vector",
>> Segher mentioned that we don't need to keep such option warning all the
>> time and can drop it like in a release later as users should be aware of
>> this information then, I agreed and considering that patch disabling
>> -m[no-]direct-move was r8-7845-g57f108f5a1e1b2, I think we can just remove
>> m[no-]direct-move here?  What do you think?
> 
> 
> I'm fine with that if that is what we want.  So something like the following?
> 
> +;; This option existed in the past, but now is always silently ignored.
> mdirect-move
> -Target Undocumented Mask(DIRECT_MOVE) Var(rs6000_isa_flags) WarnRemoved
> +Target Undocumented Ignore

I prefer to remove it completely, that is:

> -mdirect-move
> -Target Undocumented Mask(DIRECT_MOVE) Var(rs6000_isa_flags) WarnRemoved

The reason why you still kept it is to keep a historical record here?

Segher pointed out to me that this kind of option complete removal should be
stage 1 stuff, so let's defer to make it in a separated patch next release
(including some other options like mfpgpr you showed below etc.). :)

For the original patch,

> +mno-direct-move
> +Target Undocumented WarnRemoved

s/WarnRemoved/Ignore/ to match some other existing practice, there is no
warning now if specifying -mno-direct-move and it would be good to keep
the same behavior for users.

OK for trunk and active branches with this tweaked, thanks!

> 
> 
> The above seems to silently ignore both -mdirect-move and -mno-direct-move
> which I think is what we want.  That said, it's not what we've done with
> other options, but maybe those just need to be changed too?

Yes, I think they need to be changed too (next release).

BR,
Kewen

Re: [PATCH] rs6000: Replace OPTION_MASK_DIRECT_MOVE with OPTION_MASK_P8_VECTOR [PR101865]

2024-04-08 Thread Kewen.Lin

on 2024/4/9 11:20, Peter Bergner wrote:
> On 4/8/24 9:37 PM, Kewen.Lin wrote:
>> on 2024/4/8 21:21, Peter Bergner wrote:
>> I prefer to remove it completely, that is:
>>
>>> -mdirect-move
>>> -Target Undocumented Mask(DIRECT_MOVE) Var(rs6000_isa_flags) WarnRemoved
>>
>> The reason why you still kept it is to keep a historical record here?
> 
> I believe we've never completely removed an option before.  I think the

By checking the history, we did remove some options for SPE, paired single,
xilinx-fpu etc., which can be taken as gone with feature removal, but also
-maltivec={le,be} and -misel={yes,no}.

> thought was, if some software package explicitly used the option, then
> they shouldn't see an 'unrecognized command-line option' error, but
> rather either a warning that the option was removed or just silently
> ignore it.  Ie, we don't want to make a package that used to build with
> an old compiler now break its build because the option doesn't exist
> anymore.

I understand, but an argument is that no errors (even no warnings) can imply
some option still takes effect and cause some misunderstanding easily.  For
the release in which we remove the support of an option, we can still mark
it as WarnRemoved, but after a release or so, users should be aware of this
change and modify their build scripts if need, it's better to emit errors
for them to avoid the false appearance that it's still supported.

> 
>> Segher pointed out to me that this kind of option complete removal should be
>> stage 1 stuff, so let's defer to make it in a separated patch next release
>> (including some other options like mfpgpr you showed below etc.). :)
> 
> If we're going to completely remove it, then for sure, it's a stage1 thing.
> I'd like to hear Segher's thoughts on whether we should completely remove
> it or just silently ignore it.
> 
> 
> 
>> For the original patch,
>>
>>> +mno-direct-move
>>> +Target Undocumented WarnRemoved
>>
>> s/WarnRemoved/Ignore/ to match some other existing practice, there is no
>> warning now if specifying -mno-direct-move and it would be good to keep
>> the same behavior for users.
> 
> If we want to silently ignore -mdirect-move and -mno-direct-move, then we
> just need to do:
> 
> mdirect-move
> -Target Undocumented Mask(DIRECT_MOVE) Var(rs6000_isa_flags) WarnRemoved
> +Target Undocumented Ignore
> 

Since removing it completely is a stage1 thing, I prefer to keep mdirect-move
and -mno-direct-move handlings as before, WarnRemoved and Ignore separately.

> There's no need to mention -mno-direct-move at all then.  It was only in the
> case I thought we wanted to warn against it's use that I added 
> -mno-direct-move.
> 
> 

Not to mention it is fine too, just keep the handlings and defer it to stage 1. 
:)

BR,
Kewen

[PATCH] testsuite: Adjust pr113359-2_*.c with unsigned long long [PR114662]

2024-04-09 Thread Kewen.Lin

Hi,

pr113359-2_*.c define a struct having unsigned long type
members ay and az which have 4 bytes size at -m32, while
the related constants CL1 and CL2 used for equality check
are always 8 bytes, it makes compiler consider the below

  69   if (a.ay != CL1)
  70 __builtin_abort ();

always to abort and optimize away the following call to
getb, which leads to the expected wpa dumping on
"Semantic equality" missing.

This patch is to modify the types with unsigned long long
accordingly.  Tested well on powerpc64-linux-gnu.

Is it ok for trunk?

BR,
Kewen
-
PR testsuite/114662

gcc/testsuite/ChangeLog:

* gcc.dg/lto/pr113359-2_0.c: Use unsigned long long instead of
unsigned long.
* gcc.dg/lto/pr113359-2_1.c: Likewise.
---
 gcc/testsuite/gcc.dg/lto/pr113359-2_0.c | 8 
 gcc/testsuite/gcc.dg/lto/pr113359-2_1.c | 8 
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/lto/pr113359-2_0.c 
b/gcc/testsuite/gcc.dg/lto/pr113359-2_0.c
index 8b2d5bdfab2..8495667599d 100644
--- a/gcc/testsuite/gcc.dg/lto/pr113359-2_0.c
+++ b/gcc/testsuite/gcc.dg/lto/pr113359-2_0.c
@@ -8,15 +8,15 @@
 struct SA
 {
   unsigned int ax;
-  unsigned long ay;
-  unsigned long az;
+  unsigned long long ay;
+  unsigned long long az;
 };

 struct SB
 {
   unsigned int bx;
-  unsigned long by;
-  unsigned long bz;
+  unsigned long long by;
+  unsigned long long bz;
 };

 struct ZA
diff --git a/gcc/testsuite/gcc.dg/lto/pr113359-2_1.c 
b/gcc/testsuite/gcc.dg/lto/pr113359-2_1.c
index 61bc0547981..8320f347efe 100644
--- a/gcc/testsuite/gcc.dg/lto/pr113359-2_1.c
+++ b/gcc/testsuite/gcc.dg/lto/pr113359-2_1.c
@@ -5,15 +5,15 @@
 struct SA
 {
   unsigned int ax;
-  unsigned long ay;
-  unsigned long az;
+  unsigned long long ay;
+  unsigned long long az;
 };

 struct SB
 {
   unsigned int bx;
-  unsigned long by;
-  unsigned long bz;
+  unsigned long long by;
+  unsigned long long bz;
 };

 struct ZA
--
2.43.0

Re: [PATCH] testsuite: Adjust pr113359-2_*.c with unsigned long long [PR114662]

2024-04-10 Thread Kewen.Lin

on 2024/4/10 15:11, Richard Biener wrote:
> On Wed, Apr 10, 2024 at 8:24 AM Kewen.Lin  wrote:
>>
>> Hi,
>>
>> pr113359-2_*.c define a struct having unsigned long type
>> members ay and az which have 4 bytes size at -m32, while
>> the related constants CL1 and CL2 used for equality check
>> are always 8 bytes, it makes compiler consider the below
>>
>>   69   if (a.ay != CL1)
>>   70 __builtin_abort ();
>>
>> always to abort and optimize away the following call to
>> getb, which leads to the expected wpa dumping on
>> "Semantic equality" missing.
>>
>> This patch is to modify the types with unsigned long long
>> accordingly.  Tested well on powerpc64-linux-gnu.
>>
>> Is it ok for trunk?
> 
> OK

Thanks!  Pushed as r14-9886.

BR,
Kewen

> 
>> BR,
>> Kewen
>> -
>> PR testsuite/114662
>>
>> gcc/testsuite/ChangeLog:
>>
>> * gcc.dg/lto/pr113359-2_0.c: Use unsigned long long instead of
>> unsigned long.
>> * gcc.dg/lto/pr113359-2_1.c: Likewise.
>> ---
>>  gcc/testsuite/gcc.dg/lto/pr113359-2_0.c | 8 
>>  gcc/testsuite/gcc.dg/lto/pr113359-2_1.c | 8 
>>  2 files changed, 8 insertions(+), 8 deletions(-)
>>
>> diff --git a/gcc/testsuite/gcc.dg/lto/pr113359-2_0.c 
>> b/gcc/testsuite/gcc.dg/lto/pr113359-2_0.c
>> index 8b2d5bdfab2..8495667599d 100644
>> --- a/gcc/testsuite/gcc.dg/lto/pr113359-2_0.c
>> +++ b/gcc/testsuite/gcc.dg/lto/pr113359-2_0.c
>> @@ -8,15 +8,15 @@
>>  struct SA
>>  {
>>unsigned int ax;
>> -  unsigned long ay;
>> -  unsigned long az;
>> +  unsigned long long ay;
>> +  unsigned long long az;
>>  };
>>
>>  struct SB
>>  {
>>unsigned int bx;
>> -  unsigned long by;
>> -  unsigned long bz;
>> +  unsigned long long by;
>> +  unsigned long long bz;
>>  };
>>
>>  struct ZA
>> diff --git a/gcc/testsuite/gcc.dg/lto/pr113359-2_1.c 
>> b/gcc/testsuite/gcc.dg/lto/pr113359-2_1.c
>> index 61bc0547981..8320f347efe 100644
>> --- a/gcc/testsuite/gcc.dg/lto/pr113359-2_1.c
>> +++ b/gcc/testsuite/gcc.dg/lto/pr113359-2_1.c
>> @@ -5,15 +5,15 @@
>>  struct SA
>>  {
>>unsigned int ax;
>> -  unsigned long ay;
>> -  unsigned long az;
>> +  unsigned long long ay;
>> +  unsigned long long az;
>>  };
>>
>>  struct SB
>>  {
>>unsigned int bx;
>> -  unsigned long by;
>> -  unsigned long bz;
>> +  unsigned long long by;
>> +  unsigned long long bz;
>>  };
>>
>>  struct ZA
>> --
>> 2.43.0

Re: Repost [PATCH 4/6] PowerPC: Make MMA insns support DMR registers.

2024-02-03 Thread Kewen.Lin

Hi Mike,

on 2024/1/6 07:39, Michael Meissner wrote:
> This patch changes the MMA instructions to use either FPR registers
> (-mcpu=power10) or DMRs (-mcpu=future).  In this patch, the existing MMA
> instruction names are used.
> 
> A macro (__PPC_DMR__) is defined if the MMA instructions use the DMRs.
> 
> The patches have been tested on both little and big endian systems.  Can I 
> check
> it into the master branch?
> 
> 2024-01-05   Michael Meissner  
> 
> gcc/
> 
>   * config/rs6000/mma.md (mma_): New define_expand to handle
>   mma_ for dense math and non dense math.
>   (mma_ insn): Restrict to non dense math.
>   (mma_xxsetaccz): Convert to define_expand to handle non dense math and
>   dense math.
>   (mma_xxsetaccz_vsx): Rename from mma_xxsetaccz and restrict usage to non
>   dense math.
>   (mma_xxsetaccz_dm): Dense math version of mma_xxsetaccz.
>   (mma_): Add support for dense math.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   * config/rs6000/rs6000-c.cc (rs6000_target_modify_macros): Define
>   __PPC_DMR__ if we have dense math instructions.
>   * config/rs6000/rs6000.cc (print_operand): Make %A handle only DMRs if
>   dense math and only FPRs if not dense math.
>   (rs6000_split_multireg_move): Do not generate the xxmtacc instruction to
>   prime the DMR registers or the xxmfacc instruction to de-prime
>   instructions if we have dense math register support.
> ---
>  gcc/config/rs6000/mma.md  | 247 +-
>  gcc/config/rs6000/rs6000-c.cc |   3 +
>  gcc/config/rs6000/rs6000.cc   |  35 ++---
>  3 files changed, 176 insertions(+), 109 deletions(-)
> 
> diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
> index bb898919ab5..525a85146ff 100644
> --- a/gcc/config/rs6000/mma.md
> +++ b/gcc/config/rs6000/mma.md
> @@ -559,190 +559,249 @@ (define_insn "*mma_disassemble_acc_dm"
>"dmxxextfdmr256 %0,%1,2"
>[(set_attr "type" "mma")])
>  
> -(define_insn "mma_"
> +;; MMA instructions that do not use their accumulators as an input, still 
> must
> +;; not allow their vector operands to overlap the registers used by the
> +;; accumulator.  We enforce this by marking the output as early clobber.  If 
> we
> +;; have dense math, we don't need the whole prime/de-prime action, so just 
> make
> +;; thse instructions be NOPs.

typo: thse.

> +
> +(define_expand "mma_"
> +  [(set (match_operand:XO 0 "register_operand")
> + (unspec:XO [(match_operand:XO 1 "register_operand")]

s/register_operand/accumulator_operand/?

> +MMA_ACC))]
> +  "TARGET_MMA"
> +{
> +  if (TARGET_DENSE_MATH)
> +{
> +  if (!rtx_equal_p (operands[0], operands[1]))
> + emit_move_insn (operands[0], operands[1]);
> +  DONE;
> +}
> +
> +  /* Generate the prime/de-prime code.  */
> +})
> +
> +(define_insn "*mma_"

May be better to name with "*mma__nodm"?

>[(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
>   (unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0")]
>   MMA_ACC))]
> -  "TARGET_MMA"
> +  "TARGET_MMA && !TARGET_DENSE_MATH"

I found that "TARGET_MMA && !TARGET_DENSE_MATH" is used much (like changes in 
function
rs6000_split_multireg_move in this patch and some places in previous patches), 
maybe we
can introduce a macro named as TARGET_MMA_NODM short for it?

>" %A0"
>[(set_attr "type" "mma")])
>  
>  ;; We can't have integer constants in XOmode so we wrap this in an
> -;; UNSPEC_VOLATILE.
> +;; UNSPEC_VOLATILE for the non-dense math case.  For dense math, we don't 
> need
> +;; to disable optimization and we can do a normal UNSPEC.
>  
> -(define_insn "mma_xxsetaccz"
> -  [(set (match_operand:XO 0 "fpr_reg_operand" "=d")
> +(define_expand "mma_xxsetaccz"
> +  [(set (match_operand:XO 0 "register_operand")

s/register_operand/accumulator_operand/?

>   (unspec_volatile:XO [(const_int 0)]
>   UNSPECV_MMA_XXSETACCZ))]
>"TARGET_MMA"
> +{
> +  if (TARGET_DENSE_MATH)
> +{
> +  emit_insn (gen_mma_xxsetaccz_dm (operands[0]));
> +  DONE;
> +}
> +})
> +
> +(define_insn "*mma_xxsetaccz_vsx"

s/vsx/nodm/

> +  [(set (match_operand:XO 0 "fpr_reg_operand" "=d")
> + (unspec_volatile:XO [(const_int 0)]
> + UNSPECV_MMA_XXSETACCZ))]
> +  "TARGET_MMA && !TARGET_DENSE_MATH"
>"xxsetaccz %A0"
>[(set_attr "type" "mma")])
>  
> +
> +(define_insn "mma_xxsetaccz_dm"
> +  [(set (match_operand:XO 0 "dmr_operand" "=wD")
> + (unspec:XO [(const_int 0)]
> +UNSPECV_MMA_XXSETACCZ))]
> +  "TARGET_DENSE_MATH"
> +  "dmsetdmrz %0"
> +  [(set_attr "type" "mma")])
> +
>  (define_insn "mma_"
> -

Re: Repost [PATCH 5/6] PowerPC: Switch to dense math names for all MMA operations.

2024-02-03 Thread Kewen.Lin

Hi Mike,

on 2024/1/6 07:40, Michael Meissner wrote:
> This patch changes the assembler instruction names for MMA instructions from
> the original name used in power10 to the new name when used with the dense 
> math
> system.  I.e. xvf64gerpp becomes dmxvf64gerpp.  The assembler will emit the
> same bits for either spelling.
> 
> The patches have been tested on both little and big endian systems.  Can I 
> check
> it into the master branch?
> 
> 2024-01-05   Michael Meissner  
> 
> gcc/
> 
>   * config/rs6000/mma.md (vvi4i4i8_dm): New int attribute.
>   (avvi4i4i8_dm): Likewise.
>   (vvi4i4i2_dm): Likewise.
>   (avvi4i4i2_dm): Likewise.
>   (vvi4i4_dm): Likewise.
>   (avvi4i4_dm): Likewise.
>   (pvi4i2_dm): Likewise.
>   (apvi4i2_dm): Likewise.
>   (vvi4i4i4_dm): Likewise.
>   (avvi4i4i4_dm): Likewise.
>   (mma_): Add support for running on DMF systems, generating the dense
>   math instruction and using the dense math accumulators.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_): Likewise.
>   (mma_   (mma_): Likewise.
>   (mma_   (mma_): Likewise.
>   (mma_): Likewise.
> 
> gcc/testsuite/
> 
>   * gcc.target/powerpc/dm-double-test.c: New test.
>   * lib/target-supports.exp (check_effective_target_ppc_dmr_ok): New
>   target test.
> ---
>  gcc/config/rs6000/mma.md  |  98 +++--
>  .../gcc.target/powerpc/dm-double-test.c   | 194 ++
>  gcc/testsuite/lib/target-supports.exp |  19 ++
>  3 files changed, 299 insertions(+), 12 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/dm-double-test.c
> 
> diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
> index 525a85146ff..f06e6bbb184 100644
> --- a/gcc/config/rs6000/mma.md
> +++ b/gcc/config/rs6000/mma.md
> @@ -227,13 +227,22 @@ (define_int_attr apv[(UNSPEC_MMA_XVF64GERPP 
> "xvf64gerpp")
>  
>  (define_int_attr vvi4i4i8[(UNSPEC_MMA_PMXVI4GER8 "pmxvi4ger8")])
>  
> +(define_int_attr vvi4i4i8_dm [(UNSPEC_MMA_PMXVI4GER8 
> "pmdmxvi4ger8")])

Can we update vvi4i4i8 to

(define_int_attr vvi4i4i8   [(UNSPEC_MMA_PMXVI4GER8 "xvi4ger8")])

by avoiding to introduce vvi4i4i8_dm, then its use places would be like:

-  " %A0,%x1,%x2,%3,%4,%5"
+  "@
+   pmdm %A0,%x1,%x2,%3,%4,%5
+   pm %A0,%x1,%x2,%3,%4,%5
+   pm %A0,%x1,%x2,%3,%4,%5"

and 

- define_insn "mma_"
+ define_insn "mma_pm"

(or updating its use in corresponding bif expander field)

?  

This comment is also applied for the other iterators changes.

> +
>  (define_int_attr avvi4i4i8   [(UNSPEC_MMA_PMXVI4GER8PP   
> "pmxvi4ger8pp")])
>  
> +(define_int_attr avvi4i4i8_dm[(UNSPEC_MMA_PMXVI4GER8PP   
> "pmdmxvi4ger8pp")])
> +
>  (define_int_attr vvi4i4i2[(UNSPEC_MMA_PMXVI16GER2"pmxvi16ger2")
>(UNSPEC_MMA_PMXVI16GER2S   "pmxvi16ger2s")
>(UNSPEC_MMA_PMXVF16GER2"pmxvf16ger2")
>(UNSPEC_MMA_PMXVBF16GER2   
> "pmxvbf16ger2")])
>  
> +(define_int_attr vvi4i4i2_dm [(UNSPEC_MMA_PMXVI16GER2"pmdmxvi16ger2")
> +  (UNSPEC_MMA_PMXVI16GER2S   
> "pmdmxvi16ger2s")
> +  (UNSPEC_MMA_PMXVF16GER2"pmdmxvf16ger2")
> +  (UNSPEC_MMA_PMXVBF16GER2   
> "pmdmxvbf16ger2")])
> +
>  (define_int_attr avvi4i4i2   [(UNSPEC_MMA_PMXVI16GER2PP  "pmxvi16ger2pp")
>(UNSPEC_MMA_PMXVI16GER2SPP 
> "pmxvi16ger2spp")
>(UNSPEC_MMA_PMXVF16GER2PP  "pmxvf16ger2pp")
> @@ -245,25 +254,54 @@ (define_int_attr avvi4i4i2  
> [(UNSPEC_MMA_PMXVI16GER2PP  "pmxvi16ger2pp")
>(UNSPEC_MMA_PMXVBF16GER2NP 
> "pmxvbf16ger2np")
>(UNSPEC_MMA_PMXVBF16GER2NN 
> "pmxvbf16ger2nn")])
>  
> +(define_int_attr avvi4i4i2_dm[(UNSPEC_MMA_PMXVI16GER2PP  
> "pmdmxvi16ger2pp")
> +  (UNSPEC_MMA_PMXVI16GER2SPP 
> "pmdmxvi16ger2spp")
> +  (UNSPEC_MMA_PMXVF16GER2PP  
> "pmdmxvf16ger2pp")
> +  (UNSPEC_MMA_PMXVF16GER2PN  
> "pmdmxvf16ger2pn")
> +  (UNSPEC_MMA_PMXVF16GER2NP  
> "pmdmxvf16ger2np")
> +  (UNSPEC_MMA_PMXVF16GER2NN  
> "pmdmxvf16ger2nn")
> +  (UNSPEC_MMA_PMXVBF16GER2PP 
> "pmdmxvbf16ger2pp")
> +  (UNSPEC_MMA_PMXVBF16GER2PN 
> "pmdmxvbf16ger2pn")
> +  (UNSPEC_MMA_PMXVBF16GER2NP 
> "pmdmxvbf16ger2np")
> +  (UNSPEC_MMA_PMXVBF16GER2NN 
> "pmdmxvbf16g

Re: Repost [PATCH 6/6] PowerPC: Add support for 1,024 bit DMR registers.

2024-02-04 Thread Kewen.Lin

Hi Mike,

on 2024/1/6 07:42, Michael Meissner wrote:
> This patch is a prelimianry patch to add the full 1,024 bit dense math 
> register> (DMRs) for -mcpu=future.  The MMA 512-bit accumulators map onto the 
> top of the
> DMR register.
> 
> This patch only adds the new 1,024 bit register support.  It does not add
> support for any instructions that need 1,024 bit registers instead of 512 bit
> registers.
> 
> I used the new mode 'TDOmode' to be the opaque mode used for 1,204 bit

typo: 1,204

> registers.  The 'wD' constraint added in previous patches is used for these
> registers.  I added support to do load and store of DMRs via the VSX 
> registers,
> since there are no load/store dense math instructions.  I added the new 
> keyword
> '__dmr' to create 1,024 bit types that can be loaded into DMRs.  At present, I
> don't have aliases for __dmr512 and __dmr1024 that we've discussed internally.
> 
> The patches have been tested on both little and big endian systems.  Can I 
> check
> it into the master branch?
> 
> 2024-01-05   Michael Meissner  
> 
> gcc/
> 
>   * config/rs6000/mma.md (UNSPEC_DM_INSERT512_UPPER): New unspec.
>   (UNSPEC_DM_INSERT512_LOWER): Likewise.
>   (UNSPEC_DM_EXTRACT512): Likewise.
>   (UNSPEC_DMR_RELOAD_FROM_MEMORY): Likewise.
>   (UNSPEC_DMR_RELOAD_TO_MEMORY): Likewise.
>   (movtdo): New define_expand and define_insn_and_split to implement 1,024
>   bit DMR registers.
>   (movtdo_insert512_upper): New insn.
>   (movtdo_insert512_lower): Likewise.
>   (movtdo_extract512): Likewise.
>   (reload_dmr_from_memory): Likewise.
>   (reload_dmr_to_memory): Likewise.
>   * config/rs6000/rs6000-builtin.cc (rs6000_type_string): Add DMR
>   support.
>   (rs6000_init_builtins): Add support for __dmr keyword.
>   * config/rs6000/rs6000-call.cc (rs6000_return_in_memory): Add support
>   for TDOmode.
>   (rs6000_function_arg): Likewise.
>   * config/rs6000/rs6000-modes.def (TDOmode): New mode.
>   * config/rs6000/rs6000.cc (rs6000_hard_regno_nregs_internal): Add
>   support for TDOmode.
>   (rs6000_hard_regno_mode_ok_uncached): Likewise.
>   (rs6000_hard_regno_mode_ok): Likewise.
>   (rs6000_modes_tieable_p): Likewise.
>   (rs6000_debug_reg_global): Likewise.
>   (rs6000_setup_reg_addr_masks): Likewise.
>   (rs6000_init_hard_regno_mode_ok): Add support for TDOmode.  Setup reload
>   hooks for DMR mode.
>   (reg_offset_addressing_ok_p): Add support for TDOmode.
>   (rs6000_emit_move): Likewise.
>   (rs6000_secondary_reload_simple_move): Likewise.
>   (rs6000_secondary_reload_class): Likewise.
>   (rs6000_mangle_type): Add mangling for __dmr type.
>   (rs6000_dmr_register_move_cost): Add support for TDOmode.
>   (rs6000_split_multireg_move): Likewise.
>   (rs6000_invalid_conversion): Likewise.
>   * config/rs6000/rs6000.h (VECTOR_ALIGNMENT_P): Add TDOmode.
>   (enum rs6000_builtin_type_index): Add DMR type nodes.
>   (dmr_type_node): Likewise.
>   (ptr_dmr_type_node): Likewise.
> 
> gcc/testsuite/
> 
>   * gcc.target/powerpc/dm-1024bit.c: New test.
> ---
>  gcc/config/rs6000/mma.md  | 152 ++
>  gcc/config/rs6000/rs6000-builtin.cc   |  13 ++
>  gcc/config/rs6000/rs6000-call.cc  |  13 +-
>  gcc/config/rs6000/rs6000-modes.def|   4 +
>  gcc/config/rs6000/rs6000.cc   | 135 
>  gcc/config/rs6000/rs6000.h|   7 +-
>  gcc/testsuite/gcc.target/powerpc/dm-1024bit.c |  63 
>  7 files changed, 351 insertions(+), 36 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/dm-1024bit.c
> 
> diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
> index f06e6bbb184..37de9030903 100644
> --- a/gcc/config/rs6000/mma.md
> +++ b/gcc/config/rs6000/mma.md
> @@ -92,6 +92,11 @@ (define_c_enum "unspec"
> UNSPEC_MMA_XXMFACC
> UNSPEC_MMA_XXMTACC
> UNSPEC_DM_ASSEMBLE_ACC
> +   UNSPEC_DM_INSERT512_UPPER
> +   UNSPEC_DM_INSERT512_LOWER
> +   UNSPEC_DM_EXTRACT512
> +   UNSPEC_DMR_RELOAD_FROM_MEMORY
> +   UNSPEC_DMR_RELOAD_TO_MEMORY
>])
>  
>  (define_c_enum "unspecv"
> @@ -879,3 +884,150 @@ (define_insn "mma_"
>[(set_attr "type" "mma")
> (set_attr "prefixed" "yes")
> (set_attr "isa" "dm,not_dm,not_dm")])
> +
> +
> +;; TDOmode (i.e. __dmr).
> +(define_expand "movtdo"
> +  [(set (match_operand:TDO 0 "nonimmediate_operand")
> + (match_operand:TDO 1 "input_operand"))]
> +  "TARGET_DENSE_MATH"
> +{
> +  rs6000_emit_move (operands[0], operands[1], TDOmode);
> +  DONE;
> +})
> +
> +(define_insn_and_split "*movtdo"
> +  [(set (match_operand:TDO 0 "nonimmediate_operand" "=wa,m,wa,wD,wD,wa")
> + (match_operand:TDO 1 "input_operand" "m,wa,wa,wa,wD,wD"))]
> +  "TARGET_DENSE_MATH
> +   && (gpc_reg_operand (operands[0], TDOmode)
> +   || gpc_reg_operand (operands[1], TDOmode))"
> +  "@
> +   #
>

Re: [PATCH v2] rs6000: Rework option -mpowerpc64 handling [PR106680]

2024-02-05 Thread Kewen.Lin

Hi Sebastian,

on 2024/2/5 18:38, Sebastian Huber wrote:
> Hello,
> 
> On 27.12.22 11:16, Kewen.Lin via Gcc-patches wrote:
>> Hi Segher,
>>
>> on 2022/12/24 04:26, Segher Boessenkool wrote:
>>> Hi!
>>>
>>> On Wed, Oct 12, 2022 at 04:12:21PM +0800, Kewen.Lin wrote:
>>>> PR106680 shows that -m32 -mpowerpc64 is different from
>>>> -mpowerpc64 -m32, this is determined by the way how we
>>>> handle option powerpc64 in rs6000_handle_option.
>>>>
>>>> Segher pointed out this difference should be taken as
>>>> a bug and we should ensure that option powerpc64 is
>>>> independent of -m32/-m64.  So this patch removes the
>>>> handlings in rs6000_handle_option and add some necessary
>>>> supports in rs6000_option_override_internal instead.
>>>
>>> Sorry for the late review.
>>>
>>>> +  /* Don't expect powerpc64 enabled on those OSes with 
>>>> OS_MISSING_POWERPC64,
>>>> + since they don't support saving the high part of 64-bit registers on
>>>> + context switch.  If the user explicitly specifies it, we won't 
>>>> interfere
>>>> + with the user's specification.  */
>>>
>>> It depends on the OS, and what you call "context switch".  For example
>>> on Linux the context switches done by the kernel are fine, only things
>>> done by setjmp/longjmp and getcontext/setcontext are not.  So just be a
>>> bit more vague here?  "Since they do not save and restore the high half
>>> of the GPRs correctly in all cases", something like that?
>>>
>>> Okay for trunk like that.  Thanks!
>>>
>>
>> Thanks!  Adjusted as you suggested and committed in r13-4894-gacc727cf02a144.
> 
> I am a bit late, however, this broke the 32-bit support for -mcpu=e6500. For 
> RTEMS, I have the following multilibs:
> 
> MULTILIB_REQUIRED += mcpu=e6500/m32
> MULTILIB_REQUIRED += mcpu=e6500/m32/mvrsave
> MULTILIB_REQUIRED += mcpu=e6500/m32/msoft-float/mno-altivec
> MULTILIB_REQUIRED += mcpu=e6500/m64
> MULTILIB_REQUIRED += mcpu=e6500/m64/mvrsave
> 
> I configured GCC as a bi-arch compiler (32-bit and 64-bit). It seems you 
> removed the -m32 handling, so I am not sure how to approach this issue. I 
> added a test case to the PR:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106680

Thanks for reporting, I'll have a look at it (but I'm starting to be on 
vacation, so there may be slow response).

I'm not sure what's happened in bugzilla recently, but I didn't receive any 
mail notifications on your comments
#c5 and #c6 (sorry for the late response), since PR106680 is in state resolved 
maybe it's good to file a new
one for further tracking. :)

BR,
Kewen

Re: Repost [PATCH 1/6] Add -mcpu=future

2024-02-07 Thread Kewen.Lin

on 2024/2/6 14:01, Michael Meissner wrote:
> On Tue, Jan 23, 2024 at 04:44:32PM +0800, Kewen.Lin wrote:
...
>>> diff --git a/gcc/config/rs6000/rs6000-opts.h 
>>> b/gcc/config/rs6000/rs6000-opts.h
>>> index 33fd0efc936..25890ae3034 100644
>>> --- a/gcc/config/rs6000/rs6000-opts.h
>>> +++ b/gcc/config/rs6000/rs6000-opts.h
>>> @@ -67,7 +67,9 @@ enum processor_type
>>> PROCESSOR_MPCCORE,
>>> PROCESSOR_CELL,
>>> PROCESSOR_PPCA2,
>>> -   PROCESSOR_TITAN
>>> +   PROCESSOR_TITAN,
>>> +
>>
>> Nit: unintentional empty line?
>>
>>> +   PROCESSOR_FUTURE
>>>  };
> 
> It was more as a separation.  The MPCCORE, CELL, PPCA2, and TITAN are rather
> old processors.  I don't recall why we kept them after the POWER.
> 
> Logically we should re-order the list and move MPCCORE, etc. earlier, but I
> will delete the blank line in future patches.

Thanks for clarifying, the re-order thing can be done in a separate patch and
in this context one comment line would be better than a blank line. :)

...

>>> + power10 tuning until future tuning is added.  */
>>>if (rs6000_tune_index >= 0)
>>> -tune_index = rs6000_tune_index;
>>> +{
>>> +  enum processor_type cur_proc
>>> +   = processor_target_table[rs6000_tune_index].processor;
>>> +
>>> +  if (cur_proc == PROCESSOR_FUTURE)
>>> +   {
>>> + static bool issued_future_tune_warning = false;
>>> + if (!issued_future_tune_warning)
>>> +   {
>>> + issued_future_tune_warning = true;
>>
>> This seems to ensure we only warn this once, but I noticed that in rs6000/
>> only some OPT_Wpsabi related warnings adopt this way, I wonder if we don't
>> restrict it like this, for a tiny simple case, how many times it would warn?
> 
> In a simple case, you would only get the warning once.  But if you use
> __attribute__((__target__(...))) or #pragma target ... you might see it more
> than once.

OK, considering we only get this warning once for a simple case, I'm inclined
not to keep a static variable for it, it's the same as what we do currently
for option conflict errors emission.  But I'm fine for either.


>>>else
>>>  {
>>> -  size_t i;
>>>enum processor_type tune_proc
>>> = (TARGET_POWERPC64 ? PROCESSOR_DEFAULT64 : PROCESSOR_DEFAULT);
>>>  
>>> -  tune_index = -1;
>>> -  for (i = 0; i < ARRAY_SIZE (processor_target_table); i++)
>>> -   if (processor_target_table[i].processor == tune_proc)
>>> - {
>>> -   tune_index = i;
>>> -   break;
>>> - }
>>> +  tune_index = rs600_cpu_index_lookup (tune_proc == PROCESSOR_FUTURE
>>> +  ? PROCESSOR_POWER10
>>> +  : tune_proc);
>>
>> This part looks useless, as tune_proc is impossible to be PROCESSOR_FUTURE.
> 
> Well in theory, you could configure the compiler with --with-cpu=future or
> --with-tune=future.

Sorry for the possible confusion here, the "tune_proc" that I referred to is
the variable in the above else branch:

   enum processor_type tune_proc = (TARGET_POWERPC64 ? PROCESSOR_DEFAULT64 : 
PROCESSOR_DEFAULT);

It's either PROCESSOR_DEFAULT64 or PROCESSOR_DEFAULT, so it doesn't have a
chance to be PROCESSOR_FUTURE, so the checking "tune_proc == PROCESSOR_FUTURE"
is useless.

That's why I suggested the below flow, it does a final check out of those 
checks,
it looks a bit more clear IMHO.

> 
>>>  }
>>
>> Maybe re-structure the above into:
>>
>> bool explicit_tune = false;
>> if (rs6000_tune_index >= 0)
>>   {
>> tune_index = rs6000_tune_index;
>> explicit_tune = true;
>>   }
>> else if (cpu_index >= 0)
>>   // as before
>>   rs6000_tune_index = tune_index = cpu_index;
>> else
>>   {
>>//as before
>>...
>>   }
>>
>> // Check tune_index here instead.
>>
>> if (processor_target_table[tune_index].processor == PROCESSOR_FUTURE)
>>   {
>> tune_index = rs6000_cpu_index_lookup (PROCESSOR_POWER10);
>> if (explicit_tune)
>>   warn ...
>>   }
>>
>> // as before
>> rs6000_tune = processor_target_table[tune_index].processor;
>>
>>>  


BR,
Kewen

Re: Repost [PATCH 3/6] PowerPC: Add support for accumulators in DMR registers.

2024-02-07 Thread Kewen.Lin

on 2024/2/7 08:06, Michael Meissner wrote:
> On Thu, Jan 25, 2024 at 05:28:49PM +0800, Kewen.Lin wrote:
>> Hi Mike,
>>
>> on 2024/1/6 07:38, Michael Meissner wrote:
>>> The MMA subsystem added the notion of accumulator registers as an optional
>>> feature of ISA 3.1 (power10).  In ISA 3.1, these accumulators overlapped 
>>> with
>>> the traditional floating point registers 0..31, but logically the 
>>> accumulator
>>> registers were separate from the FPR registers.  In ISA 3.1, it was 
>>> anticipated
>>
>> Using VSX register 0..31 rather than traditional floating point registers 
>> 0..31
>> seems more clear, since floating point registers imply 64 bit long registers.
> 
> Ok.
> 
>>> that in future systems, the accumulator registers may no overlap with the 
>>> FPR
>>> registers.  This patch adds the support for dense math registers as separate
>>> registers.
>>>
>>> This particular patch does not change the MMA support to use the 
>>> accumulators
>>> within the dense math registers.  This patch just adds the basic support for
>>> having separate DMRs.  The next patch will switch the MMA support to use the
>>> accumulators if -mcpu=future is used.
>>>
>>> For testing purposes, I added an undocumented option '-mdense-math' to 
>>> enable
>>> or disable the dense math support.
>>
>> Can we avoid this and use one macro for it instead?  As you might have 
>> noticed
>> that some previous temporary options like -mpower{8,9}-vector cause ICEs due 
>> to
>> some unexpected combination and we are going to neuter them, so let's try our
>> best to avoid it if possible.  I guess one macro TARGET_DENSE_MATH defined by
>> TARGET_FUTURE && TARGET_MMA matches all use places? and specifying 
>> -mcpu=future
>> can enable it while -mcpu=power10 can disable it.
> 
> That depends on whether there will be other things added in the future power
> that are not in the MMA+ instruction set.
> 
> But I can switch to defining TARGET_DENSE_MATH to testing TARGET_FUTURE and
> TARGET_MMA.  That way if/when a new cpu comes out, we will just have to change
> the definition of TARGET_DENSE_MATH and not all of the uses.

Yes, that's what I expected.  Thanks!

> 
> I will also add TARGET_MMA_NO_DENSE_MATH to handle the existing MMA code for
> assemble and disassemble when we don't have dense math instructions.

Nice, I also found having such macro can help when reviewing one latter patch
so suggested a similar there.

>>> -(define_insn_and_split "*movxo"
>>> +(define_insn_and_split "*movxo_nodm"
>>>[(set (match_operand:XO 0 "nonimmediate_operand" "=d,ZwO,d")
>>> (match_operand:XO 1 "input_operand" "ZwO,d,d"))]
>>> -  "TARGET_MMA
>>> +  "TARGET_MMA && !TARGET_DENSE_MATH
>>> && (gpc_reg_operand (operands[0], XOmode)
>>> || gpc_reg_operand (operands[1], XOmode))"
>>>"@
>>> @@ -366,6 +369,31 @@ (define_insn_and_split "*movxo"
>>> (set_attr "length" "*,*,16")
>>> (set_attr "max_prefixed_insns" "2,2,*")])
>>>  
>>> +(define_insn_and_split "*movxo_dm"
>>> +  [(set (match_operand:XO 0 "nonimmediate_operand" "=wa,QwO,wa,wD,wD,wa")
>>> +   (match_operand:XO 1 "input_operand""QwO,wa, wa,wa,wD,wD"))]
>>
>> Why not adopt ZwO rather than QwO?
> 
> You have to split the address into 2 addresses for loading or storing vector
> pairs (or 4 addresses for loading or storing vectors).  Z would allow
> register+register addresses, and you wouldn't be able to create the second 
> address by adding 128 to it.  Hence it uses 'Q' for register only and 'wo' for
> d-form addresses.

Thanks for clarifying.  But without this patch the define_insn_and_split *movxo
adopts "ZwO", IMHO it would mean the current "*movxo" define_insn_and_split have
been problematic?  I thought adjust_address can ensure the new address would be
still valid after adjusting 128 offset, could you double check?

> 
>>
>>> +  "TARGET_DENSE_MATH
>>> +   && (gpc_reg_operand (operands[0], XOmode)
>>> +   || gpc_reg_operand (operands[1], XOmode))"
>>> +  "@
>>> +   #
>>> +   #
>>> +   #
>>> +   dmxxinstdmr512 %0,%1,%Y1,0
>>> +   dmmr %0,%1
>>> +   dmxxextfdmr512 %0,%Y0

Re: [PATCH] rs6000: Neuter option -mpower{8,9}-vector [PR109987]

2024-02-20 Thread Kewen.Lin

Hi Segher,

Thanks for the review comments!

on 2024/2/20 02:45, Segher Boessenkool wrote:
> Hi!
> 
> On Tue, Jan 16, 2024 at 10:50:01AM +0800, Kewen.Lin wrote:
>> As PR109987 and its duplicated bugs show, -mno-power8-vector
>> (and -mno-power9-vector) cause some problems and as Segher
>> pointed out in [1] they are workaround options, so this patch
>> is to remove -m{no,}-power{8,9}-options.
> 
> Excellent :-)
> 
>> Like what we did
>> for option -mdirect-move before, this patch still keep the
>> corresponding internal flags and they are automatically set
>> based on -mcpu.
> 
> Yup.  That makes the code nicer, and it what we already have anyway!
> 
>> The test suite update takes some efforts,
> 
> Yeah :-/
> 
>> it consists of some aspects:
>>   - effective target powerpc_p{8,9}vector_ok are removed
>> and replaced with powerpc_vsx_ok.
> 
> So all such testcases already arrange to have p8 or p9 some other way?

Some of them already have, but some of them don't, for those
without any p8/p9 are adjusted according to the test points
as below explanation.

> 
>>   - Some cases having -mpower{8,9}-vector are updated with
>> -mvsx, some of them already have -mdejagnu-cpu.  For
>> those that don't have -mdejagnu-cpu, if -mdejagnu-cpu
>> is needed for the test point, then it's appended;
>> otherwise, add additional-options -mdejagnu-cpu=power{8,9}
>> if has_arch_pwr{8,9} isn't satisfied.
> 
> Yeah it's a judgement call every time.
> 
>>   - Some test cases are updated with explicit -mvsx.
>>   - Some test cases with those two option mixed are adjusted
>> to keep the test points, like -mpower8-vector
>> -mno-power9-vector are updated with -mdejagnu-cpu=power8
>> -mvsx etc.
> 
> -mcpu=power8 implies -mvsx already.

Yes, but users can specify -mno-vsx in RUNTESTFLAGS, dejagnu
framework can have different behaviors (options order) for
different versions, this explicit -mvsx is mainly for the
consistency between the checking and the actual testing.
But according to the discussion in an internal thread, the
current powerpc_vsx_ok doesn't work as what we expect, there
will be some changes later.

> 
>>   - Some test cases with -mno-power{8,9}-vector are updated
>> by replacing -mno-power{8,9}-vector with -mno-vsx, or
>> just removing it.
> 
> Okay.
> 
>>   - For some cases, we don't always specify -mdejagnu-cpu to
>> avoid to restrict the testing coverage, it would check
>> has_arch_pwr{8,9} and appended that as need.
> 
> That is in general how all tests should be.  Very sometimes we want to
> test for a specific CPU, for a regression test that exhibited just on a
> certain CPU for example.  But we should never have a -mcpu= (or a
> -mpowerN-vector nastiness thing) to test things on a new CPU!  Just do a
> testsuite ruyn with *that* CPU.  Not many years from now, *all* CPUs
> will have those new instructions anyway, so let's not put noise in the
> testcases that will be irrelevant soon.
> 
>>   - For vect test cases run, it doesn't specify -mcpu=power9
>> for power10 and up.
>>
>> Bootstrapped and regtested on:
>>   - powerpc64-linux-gnu P7/P8/P9 {-m32,-m64}
>>   - powerpc64le-linux-gnu P8/P9/P10
> 
> In general it is nice to test 970 as the lowest vector thing we have,
> abnd/or p4 as a target without anything vector, as well.  But I expect
> thoise will just work for this patch :-)

Thanks for the tips, I'll give them a shot before pushing it.

> 
>> Although it's stage4 now, as the discussion in PR113115 we
>> are still eager to neuter these two options.
> 
> It is mostly a testsuite patch, and testcase patches are fine (and much
> wanted!) in stage 4.  The actual compiler options remain, and behaviour
> does not change for anyone who used the option as intended,

Yes, excepting for one unexpected use that users having one cpu type which
doesn't support power8/power9 capability but meanwhile specifies option
-mpower{8,9}-vector to gain power8/power9 capability (as currently these
options can enable the corresponding flags).  But I don't think it's an
expected use case.

> 
> Okay for trunk.  Thanks!  Comments below:
> 
>>  * config/rs6000/rs6000.opt: Make option power{8,9}-vector as
>>  WarnRemoved.
> 
> Do we want this, or do we want it silent?  Should we remove the options
> later, if we now warn for it?

Good question, it mainly follows the practice of option direct-move here.
IMHO at least for power8-vector we want WarnRemoved for now as it's
documented before, and we can probably make it (or th

Re: [PATCH] rs6000: Update instruction counts due to combine changes [PR112103]

2024-02-20 Thread Kewen.Lin

Hi Peter,

on 2024/2/20 06:35, Peter Bergner wrote:
> rs6000: Update instruction counts due to combine changes [PR112103]
> 
> The PR91865 combine fix changed instruction counts slightly for rlwinm-0.c.
> Adjust expected instruction counts accordingly.
> 
> This passed on both powerpc64le-linux and powerpc64-linux running the
> testsuite in both 32-bit and 64-bit modes.  Ok for trunk?

OK for trunk, thanks for fixing!

> 
> FYI, I will open a new bug to track the removing of the superfluous
> insns detected in PR112103.

Hope this test case will become not fragile any more once this filed
issue gets fixed. :)

BR,
Kewen

> 
> 
> Peter
> 
> 
> gcc/testsuite/
>   PR target/112103
>   * gcc.target/powerpc/rlwinm-0.c: Adjust expected instruction counts.
> 
> diff --git a/gcc/testsuite/gcc.target/powerpc/rlwinm-0.c 
> b/gcc/testsuite/gcc.target/powerpc/rlwinm-0.c
> index 4f4fca2d8ef..a10d9174306 100644
> --- a/gcc/testsuite/gcc.target/powerpc/rlwinm-0.c
> +++ b/gcc/testsuite/gcc.target/powerpc/rlwinm-0.c
> @@ -4,10 +4,10 @@
>  /* { dg-final { scan-assembler-times {(?n)^\s+[a-z]} 6739 { target ilp32 } } 
> } */
>  /* { dg-final { scan-assembler-times {(?n)^\s+[a-z]} 9716 { target lp64 } } 
> } */
>  /* { dg-final { scan-assembler-times {(?n)^\s+blr} 3375 } } */
> -/* { dg-final { scan-assembler-times {(?n)^\s+rldicl} 3081 { target lp64 } } 
> } */
> +/* { dg-final { scan-assembler-times {(?n)^\s+rldicl} 3090 { target lp64 } } 
> } */
>  
>  /* { dg-final { scan-assembler-times {(?n)^\s+rlwinm} 3197 { target ilp32 } 
> } } */
> -/* { dg-final { scan-assembler-times {(?n)^\s+rlwinm} 3093 { target lp64 } } 
> } */
> +/* { dg-final { scan-assembler-times {(?n)^\s+rlwinm} 3084 { target lp64 } } 
> } */
>  /* { dg-final { scan-assembler-times {(?n)^\s+rotlwi} 154 } } */
>  /* { dg-final { scan-assembler-times {(?n)^\s+srwi} 13 { target ilp32 } } } 
> */
>  /* { dg-final { scan-assembler-times {(?n)^\s+srdi} 13 { target lp64 } } } */

Re: Repost [PATCH 1/6] Add -mcpu=future

2024-02-20 Thread Kewen.Lin

Hi Mike,

Sorry for late reply (just back from vacation).

on 2024/2/8 03:58, Michael Meissner wrote:
> On Wed, Feb 07, 2024 at 05:21:10PM +0800, Kewen.Lin wrote:
>> on 2024/2/6 14:01, Michael Meissner wrote:
>> Sorry for the possible confusion here, the "tune_proc" that I referred to is
>> the variable in the above else branch:
>>
>>enum processor_type tune_proc = (TARGET_POWERPC64 ? PROCESSOR_DEFAULT64 : 
>> PROCESSOR_DEFAULT);
>>
>> It's either PROCESSOR_DEFAULT64 or PROCESSOR_DEFAULT, so it doesn't have a
>> chance to be PROCESSOR_FUTURE, so the checking "tune_proc == 
>> PROCESSOR_FUTURE"
>> is useless.
> 
> PROCESSOR_DEFAULT can be PROCESSOR_FUTURE if somebody configures GCC with
> --with-cpu=future.  While in general it shouldn't occur, it is helpful to
> consider all of the corner cases.

But it sounds not true, I think you meant TARGET_CPU_DEFAULT instead?

On one local ppc64le machine I tried to configure with --with-cpu=power10,
I got {,OPTION_}TARGET_CPU_DEFAULT "power10" but PROCESSOR_DEFAULT is still
PROCESSOR_POWER7 (PROCESSOR_DEFAULT64 is PROCESSOR_POWER8).  I think these
PROCESSOR_DEFAULT{,64} are defined by various headers:

$ grep -r "define PROCESSOR_DEFAULT" gcc/config/rs6000/
gcc/config/rs6000/aix71.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER7
gcc/config/rs6000/aix71.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER7
gcc/config/rs6000/aix72.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER7
gcc/config/rs6000/aix72.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER7
gcc/config/rs6000/aix73.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER8
gcc/config/rs6000/aix73.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER8
gcc/config/rs6000/darwin.h:#define PROCESSOR_DEFAULT  PROCESSOR_PPC7400
gcc/config/rs6000/darwin.h:#define PROCESSOR_DEFAULT64  PROCESSOR_POWER4
gcc/config/rs6000/freebsd64.h:#define PROCESSOR_DEFAULT PROCESSOR_PPC7450
gcc/config/rs6000/freebsd64.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER8
gcc/config/rs6000/linux64.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER7
gcc/config/rs6000/linux64.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER8
gcc/config/rs6000/rs6000.h:#define PROCESSOR_DEFAULT   PROCESSOR_PPC603
gcc/config/rs6000/rs6000.h:#define PROCESSOR_DEFAULT64 PROCESSOR_RS64A
gcc/config/rs6000/vxworks.h:#define PROCESSOR_DEFAULT PROCESSOR_PPC604

, and they are unlikely to be updated later, no?

btw, the given --with-cpu=future will make cpu_index never negative so

  ...
  else if (cpu_index >= 0)
rs6000_tune_index = tune_index = cpu_index;
  else
... 

so there is no chance to enter "else" arm, that is, that arm only takes
effect when no cpu/tune is given (neither -m{cpu,tune} nor --with-cpu=).

BR,
Kewen

Re: [PATCH] rs6000: Neuter option -mpower{8,9}-vector [PR109987]

2024-02-20 Thread Kewen.Lin

on 2024/2/20 19:19, Segher Boessenkool wrote:
> On Tue, Feb 20, 2024 at 05:27:07PM +0800, Kewen.Lin wrote:
>> Good question, it mainly follows the practice of option direct-move here.
>> IMHO at least for power8-vector we want WarnRemoved for now as it's
>> documented before, and we can probably make it (or them) removed later on
>> trunk once all active branch releases don't support it any more.
>>
>> What's your opinion on this?
> 
> Originally I did
>   Warn(%qs is deprecated)
> which already was a mistake.  It then changed to
>   Deprecated
> and then to
>   WarnRemoved
> which make it clearer that it is a bad plan.
> 
> If it is okay to remove an option, we should not talk about it at all
> anymore.  Well maybe warn about it for another release or so, but not
> longer.

OK, thanks for the suggestion.

> 
>>>>  (define_register_constraint "we" 
>>>> "rs6000_constraints[RS6000_CONSTRAINT_we]"
>>>> -  "@internal Like @code{wa}, if @option{-mpower9-vector} and 
>>>> @option{-m64} are
>>>> -   used; otherwise, @code{NO_REGS}.")
>>>> +  "@internal Like @code{wa}, if the cpu type is power9 or up, meanwhile
>>>> +   @option{-mvsx} and @option{-m64} are used; otherwise, @code{NO_REGS}.")
>>>
>>> "if this is a POWER9 or later and @option{-mvsx} and @option{-m64} are
>>> used".  How clumsy.  Maybe we should make the patterns that use "we"
>>> work without mtvsrdd as well?  Hrm, they will still require 64-bit GPRs
>>> of course, unless we can do something tricky.
>>>
>>> We do not need the special constraint at all of course (we can add these
>>> conditions to all patterns that use it: all *two* patterns).  So maybe
>>> that's what we should do :-)
>>
>> Not sure the original intention introducing it (Mike might know it best), but
>> removing it sounds doable.
> 
> It is for mtvsrdd.

Yes, I meant to say not sure if there was some obstacle which made us introduce
a new constraint, or just because it's simple.

> 
>>  btw, it seems more than two patterns using it?
>> like (if I didn't miss something):
>>   - vsx_concat_
>>   - vsx_splat__reg
>>   - vsx_splat_v4si_di
>>   - vsx_mov_64bit
> 
> Yes, it isn't clear we should use this contraint in those last two.  It
> looks like those do not even need the restriction to 64 bit systems.
> Well the last one obviously has that already, but then it could just use
> "wa", no?

For vsx_splat_v4si_di, it's for mtvsrws, ISA notes GPR[RA].bit[32:63] which
implies the context has 64bit GPR?  The last one seems still to distinguish
there is power9 support or not, just use "wa" which only implies power7
doesn't fit with it?

btw, the actual guard for "we" is TARGET_POWERPC64 rather than TARGET_64BIT,
the documentation isn't accurate enough.  Just filed internal issue #1345
for further tracking on this.

> 
>>> -mcpu=power8 implies -mvsx (power7 already).  You can disable VSX, or
>>> VMX as well, but by default it is enabled.
>>
>> Yes, it's meant to consider the explicitly -mno-vsx, which suffers the option
>> order issue.  But considering we raise error for -mno-vsx -mpower{8,9}-vector
>> before, without specifying -mvsx is closer to the previous.
>>
>> I'll adjust it and the below similar ones, thanks!
> 
> It is never supported to do unsupported things :-)
> 
> We need to be able to rely on defaults.  Otherwise, we will have to
> implement all of GCC recursively, in itself, in the testsuite, and in
> individual tests.  Let's not :-)

OK, fair enough.  Thanks!

BR,
Kewen

Re: [PATCH] rs6000: Neuter option -mpower{8,9}-vector [PR109987]

2024-02-20 Thread Kewen.Lin

on 2024/2/21 09:37, Peter Bergner wrote:
> On 2/20/24 3:27 AM, Kewen.Lin wrote:
>> on 2024/2/20 02:45, Segher Boessenkool wrote:
>>> On Tue, Jan 16, 2024 at 10:50:01AM +0800, Kewen.Lin wrote:
>>>> it consists of some aspects:
>>>>   - effective target powerpc_p{8,9}vector_ok are removed
>>>> and replaced with powerpc_vsx_ok.
>>>
>>> So all such testcases already arrange to have p8 or p9 some other way?
> 
> Shouldn't that be replaced with powerpc_vsx instead of powerpc_vsx_ok?
> That way we know VSX code gen is enabled for the options being used,
> even those in RUNTESTFLAGS.
> 
> I thought we agreed that powerpc_vsx_ok was almost always useless and
> we always want to use powerpc_vsx.  ...or did I miss that we removed
> the old powerpc_vsx_ok and renamed powerpc_vsx to powerpc_vsx_ok?

Yes, I think we all agreed that powerpc_vsx matches more with what we
expect, but I'm hesitating to make such change at this stage because:

  1. if testing on an env without vsx support, the test results on these
 affected test cases may change a lot, as many test cases would
 become unsupported (they pass before with explicit -mvsx before).

  2. teach current powerpc_vsx to make use of current_compiler_flags
 just like some existing practices on has_arch_* may help mitigate
 it as not few test cases already have explicit -mvsx.  But AIUI
 current_complier_flags requires dg-options line comes first before
 the effective target line to make options in dg-options take
 effect, it means we need some work to adjust line order for the
 affected test cases.  On the other hand, some enhancement is needed
 for current_compiler_flags as powerpc_vsx (old powerpc_vsx_ok) isn't
 only used in test case but can be also used in some exp check
 where no expected flags exist.

  3. there may be some other similar effective target checks which we
 want to update as well, it means we need a re-visit on the existing
 effective target checks (rs6000 specific).

  4. powerpc_vsx_ok has been there for a long long time, and -mno-vsx
 is rarely used in RUNTESTFLAGS, this only affects testing, so it
 is not that urgent.

so I'm inclined to work on this in next stage 1.  What do you think?

> 
>>>>   - Some test cases are updated with explicit -mvsx.
>>>>   - Some test cases with those two option mixed are adjusted
>>>> to keep the test points, like -mpower8-vector
>>>> -mno-power9-vector are updated with -mdejagnu-cpu=power8
>>>> -mvsx etc.
>>>
>>> -mcpu=power8 implies -mvsx already.
> 
> Then we can omit the explicit -msx option, correct?  Ie, if the
> user forces -mno-vsx in RUNTESTFLAGS, then we'll just skip the
> test case as UNSUPPORTED rather than trying to compile some
> vsx test case with vsx disabled via the options.

Yes, we can strip any -mvsx then, but if we want the test case
to be tested when it's able to, we can still append an extra
-mvsx.  Even if -mno-vsx is specified but if the option order
makes it like "-mno-vsx... -mvsx", powerpc_vsx is supported
so that the test case can be still tested well with -mvsx
enabled, while if the order is like "-mvsx ... -mno-vsx",
powerpc_vsx fails and it becomes unsupported.

BR,
Kewen

Re: Repost [PATCH 1/6] Add -mcpu=future

2024-02-26 Thread Kewen.Lin

on 2024/2/21 15:19, Michael Meissner wrote:
> On Tue, Feb 20, 2024 at 06:35:34PM +0800, Kewen.Lin wrote:
>> Hi Mike,
>>
>> Sorry for late reply (just back from vacation).
>>
>> on 2024/2/8 03:58, Michael Meissner wrote:
>>> On Wed, Feb 07, 2024 at 05:21:10PM +0800, Kewen.Lin wrote:
>>>> on 2024/2/6 14:01, Michael Meissner wrote:
>>>> Sorry for the possible confusion here, the "tune_proc" that I referred to 
>>>> is
>>>> the variable in the above else branch:
>>>>
>>>>enum processor_type tune_proc = (TARGET_POWERPC64 ? PROCESSOR_DEFAULT64 
>>>> : PROCESSOR_DEFAULT);
>>>>
>>>> It's either PROCESSOR_DEFAULT64 or PROCESSOR_DEFAULT, so it doesn't have a
>>>> chance to be PROCESSOR_FUTURE, so the checking "tune_proc == 
>>>> PROCESSOR_FUTURE"
>>>> is useless.
>>>
>>> PROCESSOR_DEFAULT can be PROCESSOR_FUTURE if somebody configures GCC with
>>> --with-cpu=future.  While in general it shouldn't occur, it is helpful to
>>> consider all of the corner cases.
>>
>> But it sounds not true, I think you meant TARGET_CPU_DEFAULT instead?
>>
>> On one local ppc64le machine I tried to configure with --with-cpu=power10,
>> I got {,OPTION_}TARGET_CPU_DEFAULT "power10" but PROCESSOR_DEFAULT is still
>> PROCESSOR_POWER7 (PROCESSOR_DEFAULT64 is PROCESSOR_POWER8).  I think these
>> PROCESSOR_DEFAULT{,64} are defined by various headers:
> 
> Yes, I was mistaken.  You are correct TARGET_CPU_DEFAULT is set.  I will 
> change
> the comments.

Thanks!

> 
>> gcc/config/rs6000/aix71.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER7
>> gcc/config/rs6000/aix71.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER7
>> gcc/config/rs6000/aix72.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER7
>> gcc/config/rs6000/aix72.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER7
>> gcc/config/rs6000/aix73.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER8
>> gcc/config/rs6000/aix73.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER8
>> gcc/config/rs6000/darwin.h:#define PROCESSOR_DEFAULT  PROCESSOR_PPC7400
>> gcc/config/rs6000/darwin.h:#define PROCESSOR_DEFAULT64  PROCESSOR_POWER4
>> gcc/config/rs6000/freebsd64.h:#define PROCESSOR_DEFAULT PROCESSOR_PPC7450
>> gcc/config/rs6000/freebsd64.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER8
>> gcc/config/rs6000/linux64.h:#define PROCESSOR_DEFAULT PROCESSOR_POWER7
>> gcc/config/rs6000/linux64.h:#define PROCESSOR_DEFAULT64 PROCESSOR_POWER8
>> gcc/config/rs6000/rs6000.h:#define PROCESSOR_DEFAULT   PROCESSOR_PPC603
>> gcc/config/rs6000/rs6000.h:#define PROCESSOR_DEFAULT64 PROCESSOR_RS64A
>> gcc/config/rs6000/vxworks.h:#define PROCESSOR_DEFAULT PROCESSOR_PPC604
>>
>> , and they are unlikely to be updated later, no?
>>
>> btw, the given --with-cpu=future will make cpu_index never negative so
>>
>>   ...
>>   else if (cpu_index >= 0)
>> rs6000_tune_index = tune_index = cpu_index;
>>   else
>> ... 
>>
>> so there is no chance to enter "else" arm, that is, that arm only takes
>> effect when no cpu/tune is given (neither -m{cpu,tune} nor --with-cpu=).
> 
> Note, this is existing code.  I didn't modify it.  If we want to change it, we
> should do it as another patch.

Yes, I agree.  Just to clarify, I didn't suggest changing it but instead
suggested almost keeping them, since we don't need any changes in "else"
arm, so instead of updating in arms "if" and "else if" for "future cpu type",
it seems a bit more clear to just check it after this, ie.:



bool explicit_tune = false;
if (rs6000_tune_index >= 0)
  {
tune_index = rs6000_tune_index;
explicit_tune = true;
  }
else if (cpu_index >= 0)
  // as before
  rs6000_tune_index = tune_index = cpu_index;
else
  {
   //as before
   ...
  }

// Check tune_index here instead.

if (processor_target_table[tune_index].processor == PROCESSOR_FUTURE)
  {
tune_index = rs6000_cpu_index_lookup (PROCESSOR_POWER10);
if (explicit_tune)
  warn ...
  }

// as before
rs6000_tune = processor_target_table[tune_index].processor;



, copied from previous comment: 
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/643681.html

BR,
Kewen

Re: [PATCH] rs6000: Don't allow immediate value in the vsx_splat pattern [PR113950]

2024-02-26 Thread Kewen.Lin

Hi,

on 2024/2/26 14:18, jeevitha wrote:
> Hi All,
> 
> The following patch has been bootstrapped and regtested on powerpc64le-linux.
> 
> There is no immediate value splatting instruction in powerpc. Currently that
> needs to be stored in a register or memory. For addressing this I have updated
> the predicate for the second operand in vsx_splat to splat_input_operand,
> which will handle the operands appropriately.

The test case fails with error message with GCC 11, but fails with ICE from GCC
12, it's kind of regression, so I think we can make such fix in this stage.

Out of curiosity, did you check why it triggers error messages on GCC 11?  I
guess the difference from GCC 12 is Bill introduced new built-in framework in
GCC12 which adds the support for the bif, but I'm curious what prevent this
being supported before that.

> 
> 2024-02-26  Jeevitha Palanisamy  
> 
> gcc/
>   PR target/113950
>   * config/rs6000/vsx.md (vsx_splat_): Updated the predicates
>   for second operand.
> 
> gcc/testsuite/
>   PR target/113950
>   * gcc.target/powerpc/pr113950.c: New testcase.
> 
> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
> index 6111cc90eb7..e5688ff972a 100644
> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@ -4660,7 +4660,7 @@
>  (define_expand "vsx_splat_"
>[(set (match_operand:VSX_D 0 "vsx_register_operand")
>   (vec_duplicate:VSX_D
> -  (match_operand: 1 "input_operand")))]
> +  (match_operand: 1 "splat_input_operand")))]
>"VECTOR_MEM_VSX_P (mode)"
>  {
>rtx op1 = operands[1];

This hunk actually does force_reg already:

...
  else if (!REG_P (op1))
op1 = force_reg (mode, op1);

but it's assigning to op1 unexpectedly (an omission IMHO), so just
simply fix it with:

  else if (!REG_P (op1))
-op1 = force_reg (mode, op1);
+operands[1] = force_reg (mode, op1);

instead, can you verify?

> diff --git a/gcc/testsuite/gcc.target/powerpc/pr113950.c 
> b/gcc/testsuite/gcc.target/powerpc/pr113950.c
> new file mode 100644
> index 000..29ded29f683
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr113950.c
> @@ -0,0 +1,24 @@
> +/* PR target/113950 */
> +/* { dg-do compile } */

We need an effective target to ensure vsx support, for now it's powerpc_vsx_ok.
ie: /* { dg-require-effective-target powerpc_vsx_ok } */

(most/all of its uses would be replaced with an enhanced powerpc_vsx in next 
stage 1).

BR,
Kewen


> +/* { dg-options "-O1" } */
> +
> +/* Verify we do not ICE on the following.  */
> +
> +void abort (void);
> +
> +int main ()
> +{
> +  int i;
> +  vector signed long long vsll_result, vsll_expected_result;
> +  signed long long sll_arg1;
> +
> +  sll_arg1 = 300;
> +  vsll_expected_result = (vector signed long long) {300, 300};
> +  vsll_result = __builtin_vsx_splat_2di (sll_arg1);  
> +
> +  for (i = 0; i < 2; i++)
> +if (vsll_result[i] != vsll_expected_result[i])
> +  abort();
> +
> +  return 0;
> +}
> 
>

Re: [PATCH] rs6000: Don't allow immediate value in the vsx_splat pattern [PR113950]

2024-02-26 Thread Kewen.Lin

on 2024/2/26 23:07, Peter Bergner wrote:
> On 2/26/24 4:49 AM, Kewen.Lin wrote:
>> on 2024/2/26 14:18, jeevitha wrote:
>>> Hi All,
>>> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
>>> index 6111cc90eb7..e5688ff972a 100644
>>> --- a/gcc/config/rs6000/vsx.md
>>> +++ b/gcc/config/rs6000/vsx.md
>>> @@ -4660,7 +4660,7 @@
>>>  (define_expand "vsx_splat_"
>>>[(set (match_operand:VSX_D 0 "vsx_register_operand")
>>> (vec_duplicate:VSX_D
>>> -(match_operand: 1 "input_operand")))]
>>> +(match_operand: 1 "splat_input_operand")))]
>>>"VECTOR_MEM_VSX_P (mode)"
>>>  {
>>>rtx op1 = operands[1];
>>
>> This hunk actually does force_reg already:
>>
>> ...
>>   else if (!REG_P (op1))
>> op1 = force_reg (mode, op1);
>>
>> but it's assigning to op1 unexpectedly (an omission IMHO), so just
>> simply fix it with:
>>
>>   else if (!REG_P (op1))
>> -op1 = force_reg (mode, op1);
>> +operands[1] = force_reg (mode, op1);
> 
> I agree op1 was an oversight and it should be operands[1].
> That said, I think using more precise predicates is a good thing,

Agreed.

> so I think we should use both Jeevitha's predicate change and
> your operands[1] change.

Since either the original predicate change or operands[1] change
can fix this issue, I think it's implied that only either of them
is enough, so we can remove "else if (!REG_P (op1))" arm (or even
replaced with one else arm to assert REG_P (op1))?

> 
> I'll note that Jeevitha originally had the operands[1] change, but I
> didn't look closely enough at the issue or the pattern and mentioned
> that these kinds of bugs can be caused by too loose constraints and
> predicates, which is when she found the updated predicate to use.
> I believe she already even bootstrapped and regtested the operands[1]
> only change.  Jeevitha???
> 

Good to know that. :)

> 
> 
> 
>>> +/* PR target/113950 */
>>> +/* { dg-do compile } */
>>
>> We need an effective target to ensure vsx support, for now it's 
>> powerpc_vsx_ok.
>> ie: /* { dg-require-effective-target powerpc_vsx_ok } */
> 
> Agreed.
> 
> 
>>> +/* { dg-options "-O1" } */
> 
> I think we should also use a -mcpu=XXX option to ensure VSX is enabled
> when compiling these VSX built-in functions.  I'm fine using any CPU
> (power7 or later) where the ICE exists with an unpatched compiler.
> Otherwise, testing will be limited to our server systems that have
> VSX enabled by default.

Good point, or maybe just an explicit -mvsx like some existing ones, which
can avoid to only test some fixed cpu type.

BR,
Kewen

Re: [PATCH] rs6000: Don't allow immediate value in the vsx_splat pattern [PR113950]

2024-02-26 Thread Kewen.Lin

on 2024/2/27 10:13, Peter Bergner wrote:
> On 2/26/24 7:55 PM, Kewen.Lin wrote:
>> on 2024/2/26 23:07, Peter Bergner wrote:
>>> so I think we should use both Jeevitha's predicate change and
>>> your operands[1] change.
>>
>> Since either the original predicate change or operands[1] change
>> can fix this issue, I think it's implied that only either of them
>> is enough, so we can remove "else if (!REG_P (op1))" arm (or even
>> replaced with one else arm to assert REG_P (op1))?
> 
> splat_input_operand allows, mem, reg and subreg, so I don't think
> we can just assert on REG_P (op1), since op1 could be a subreg.

ah, you are right! I missed the "subreg".

> I do agree we can remove the "if (!REG_P (op1))" test on the else
> branch, since force_reg() has an early exit for regs, so a simple:
> 
>   ...
>   else
> operands[1] = force_reg (mode, op1);
> 
> ..should work.

Yes!

> 
> 
> 
> 
>> Good point, or maybe just an explicit -mvsx like some existing ones, which
>> can avoid to only test some fixed cpu type.
> 
> If a simple "-O1 -vsx" is enough to expose the ICE on an unpacthed
> compiler and a PASS on a patched compiler, then I'm all for it.
> Jeevitha, can you try confirming that?

Jeevitha, can you also check why we have the different behavior on GCC 11 when
you get time?  GCC 12 has new built-in framework, so this ICE gets exposed, but
IMHO it would still be good to double check the previous behavior is due to
some miss support or some other latent bug.  Thanks in advance!

BR,
Kewen

Re: [PATCH 01/11] rs6000, Fix __builtin_vsx_cmple* args and documentation, builtins

2024-02-28 Thread Kewen.Lin

Hi,

on 2024/2/21 01:55, Carl Love wrote:
> 
> GCC maintainers:
> 
> This patch fixes the arguments and return type for the various 
> __builtin_vsx_cmple* built-ins.  They were defined as signed but should have 
> been defined as unsigned.
> 
> The patch has been tested on Power 10 with no regressions.
> 
> Please let me know if this patch is acceptable for mainline.  Thanks.
> 
>   Carl 
> 
> -
> 
> rs6000, Fix __builtin_vsx_cmple* args and documentation, builtins
> 
> The built-ins __builtin_vsx_cmple_u16qi, __builtin_vsx_cmple_u2di,
> __builtin_vsx_cmple_u4si and __builtin_vsx_cmple_u8hi should take
> unsigned arguments and return an unsigned result.  This patch changes
> the arguments and return type from signed to unsigned.

Apparently the types mismatch the corresponding bif names, but I wonder
if these __builtin_vsx_cmple* actually provide some value?

Users can just use vec_cmple as PVIPR defines, as altivec.h shows,
vec_cmple gets redefined with vec_cmpge, these are not for the underlying
implementation.  I also checked the documentation of openXL (xl compiler),
they don't support these either (these are not for compability).

So can we just remove these bifs?

> 
> The documentation for the signed and unsigned versions of
> __builtin_vsx_cmple is missing from extend.texi.  This patch adds the
> missing documentation.
> 
> Test cases are added for each of the signed and unsigned built-ins.
> 
> gcc/ChangeLog:
>   * config/rs6000/rs6000-builtins.def (__builtin_vsx_cmple_u16qi,
>   __builtin_vsx_cmple_u2di, __builtin_vsx_cmple_u4si): Change
>   arguments and return from signed to unsigned.
>   * doc/extend.texi (__builtin_vsx_cmple_16qi,
>   __builtin_vsx_cmple_8hi, __builtin_vsx_cmple_4si,
>   __builtin_vsx_cmple_u16qi, __builtin_vsx_cmple_u8hi,
>   __builtin_vsx_cmple_u4si): Add documentation.
> 
> gcc/testsuite/ChangeLog:
>   * gcc.target/powerpc/vsx-cmple.c: New test file.
> ---
>  gcc/config/rs6000/rs6000-builtins.def|  10 +-
>  gcc/doc/extend.texi  |  23 
>  gcc/testsuite/gcc.target/powerpc/vsx-cmple.c | 127 +++
>  3 files changed, 155 insertions(+), 5 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/vsx-cmple.c
> 
> diff --git a/gcc/config/rs6000/rs6000-builtins.def 
> b/gcc/config/rs6000/rs6000-builtins.def
> index 3bc7fed6956..d66a53a0fab 100644
> --- a/gcc/config/rs6000/rs6000-builtins.def
> +++ b/gcc/config/rs6000/rs6000-builtins.def
> @@ -1349,16 +1349,16 @@
>const vss __builtin_vsx_cmple_8hi (vss, vss);
>  CMPLE_8HI vector_ngtv8hi {}
>  
> -  const vsc __builtin_vsx_cmple_u16qi (vsc, vsc);
> +  const vuc __builtin_vsx_cmple_u16qi (vuc, vuc);
>  CMPLE_U16QI vector_ngtuv16qi {}
>  
> -  const vsll __builtin_vsx_cmple_u2di (vsll, vsll);
> +  const vull __builtin_vsx_cmple_u2di (vull, vull);
>  CMPLE_U2DI vector_ngtuv2di {}
>  
> -  const vsi __builtin_vsx_cmple_u4si (vsi, vsi);
> +  const vui __builtin_vsx_cmple_u4si (vui, vui);
>  CMPLE_U4SI vector_ngtuv4si {}
>  
> -  const vss __builtin_vsx_cmple_u8hi (vss, vss);
> +  const vus __builtin_vsx_cmple_u8hi (vus, vus);
>  CMPLE_U8HI vector_ngtuv8hi {}
>  
>const vd __builtin_vsx_concat_2df (double, double);
> @@ -1769,7 +1769,7 @@
>const vf __builtin_vsx_xvcvuxdsp (vull);
>  XVCVUXDSP vsx_xvcvuxdsp {}
>  
> -  const vd __builtin_vsx_xvcvuxwdp (vsi);
> +  const vd __builtin_vsx_xvcvuxwdp (vui);
>  XVCVUXWDP vsx_xvcvuxwdp {}

This change is unexpected, it should not be in this sub-patch. :)

>  
>const vf __builtin_vsx_xvcvuxwsp (vsi);
> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> index 2b8ba1949bf..4d8610f6aa8 100644
> --- a/gcc/doc/extend.texi
> +++ b/gcc/doc/extend.texi
> @@ -22522,6 +22522,29 @@ if the VSX instruction set is available.  The 
> @samp{vec_vsx_ld} and
>  @samp{vec_vsx_st} built-in functions always generate the VSX @samp{LXVD2X},
>  @samp{LXVW4X}, @samp{STXVD2X}, and @samp{STXVW4X} instructions.
>  
> +
> +@smallexample
> +vector signed char __builtin_vsx_cmple_16qi (vector signed char,
> + vector signed char);
> +vector signed short __builtin_vsx_cmple_8hi (vector signed short,
> + vector signed short);
> +vector signed int __builtin_vsx_cmple_4si (vector signed int,
> + vector signed int);
> +vector unsigned char __builtin_vsx_cmple_u16qi (vector unsigned char,
> +vector unsigned char);
> +vector unsigned short __builtin_vsx_cmple_u8hi (vector unsigned short,
> +vector unsigned short);
> +vector unsigned int __builtin_vsx_cmple_u4si (vector unsigned int,
> +  vector unsigned int);
> +@end smallexample

We don't document any vsx_c

Re: [PATCH 02/11] rs6000, fix arguments, add documentation for vector, element conversions

2024-02-28 Thread Kewen.Lin

Hi,

on 2024/2/21 01:56, Carl Love wrote:
> 
> GCC maintainers:
> 
> This patch fixes the  return type for the __builtin_vsx_xvcvdpuxws and 
> __builtin_vsx_xvcvspuxds built-ins.  They were defined as signed but should 
> have been defined as unsigned.
> 
> The patch has been tested on Power 10 with no regressions.
> 
> Please let me know if this patch is acceptable for mainline.  Thanks.
> 
>   Carl 
> 
> -
> rs6000, fix arguments, add documentation for vector element conversions
> 
> The return type for the __builtin_vsx_xvcvdpuxws, __builtin_vsx_xvcvspuxds,
> __builtin_vsx_xvcvspuxws built-ins should be unsigned.  This patch changes
> the return values from signed to unsigned.
> 
> The documentation for the vector element conversion built-ins:
> 
> __builtin_vsx_xvcvspsxws
> __builtin_vsx_xvcvspsxds
> __builtin_vsx_xvcvspuxds
> __builtin_vsx_xvcvdpsxws
> __builtin_vsx_xvcvdpuxws
> __builtin_vsx_xvcvdpuxds_uns
> __builtin_vsx_xvcvspdp
> __builtin_vsx_xvcvdpsp
> __builtin_vsx_xvcvspuxws
> __builtin_vsx_xvcvsxwdp
> __builtin_vsx_xvcvuxddp_uns
> __builtin_vsx_xvcvuxwdp
> 
> is missing from extend.texi.  This patch adds the missing documentation.

I think we should recommend users to adopt the recommended built-ins in
PVIPR, by checking the corresponding mnemonic in PVIPR, I got:

__builtin_vsx_xvcvspsxws -> vec_signed
__builtin_vsx_xvcvspsxds -> N/A
__builtin_vsx_xvcvspuxds -> N/A
__builtin_vsx_xvcvdpsxws -> vec_signed{e,o}
__builtin_vsx_xvcvdpuxws -> vec_unsigned{e,o}
__builtin_vsx_xvcvdpuxds_uns -> vec_unsigned
__builtin_vsx_xvcvspdp   -> vec_double{e,o}
__builtin_vsx_xvcvdpsp   -> vec_float{e,o}
__builtin_vsx_xvcvspuxws -> vec_unsigned
__builtin_vsx_xvcvsxwdp  -> vec_double{e,o}
__builtin_vsx_xvcvuxddp_uns> vec_double

For __builtin_vsx_xvcvspsxds and __builtin_vsx_xvcvspuxds which don't have
the according PVIPR built-ins, we can extend the current vec_{un,}signed{e,o}
to cover them and document them following the section mentioning PVIPR.

BR,
Kewen

> 
> This patch also adds runnable test cases for each of the built-ins.
> 
> gcc/ChangeLog:
>   * config/rs6000/rs6000-builtins.def (__builtin_vsx_xvcvdpuxws,
>   __builtin_vsx_xvcvspuxds, __builtin_vsx_xvcvspuxws): Change
>   return type from signed to unsigned.
>   * doc/extend.texi (__builtin_vsx_xvcvspsxws,
>   __builtin_vsx_xvcvspsxds, __builtin_vsx_xvcvspuxds,
>   __builtin_vsx_xvcvdpsxws, __builtin_vsx_xvcvdpuxws,
>   __builtin_vsx_xvcvdpuxds_uns, __builtin_vsx_xvcvspdp,
>   __builtin_vsx_xvcvdpsp, __builtin_vsx_xvcvspuxws,
>   __builtin_vsx_xvcvsxwdp, __builtin_vsx_xvcvuxddp_uns,
>   __builtin_vsx_xvcvuxwdp): Add documentation for builtins.
> 
> gcc/testsuite/ChangeLog:
>   * gcc.target/powerpc/vsx-builtin-runnable-1.c: New test file.
> ---
>  gcc/config/rs6000/rs6000-builtins.def |   6 +-
>  gcc/doc/extend.texi   | 135 ++
>  .../powerpc/vsx-builtin-runnable-1.c  | 233 ++
>  3 files changed, 371 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-1.c
> 
> diff --git a/gcc/config/rs6000/rs6000-builtins.def 
> b/gcc/config/rs6000/rs6000-builtins.def
> index d66a53a0fab..fd316f629e5 100644
> --- a/gcc/config/rs6000/rs6000-builtins.def
> +++ b/gcc/config/rs6000/rs6000-builtins.def
> @@ -1724,7 +1724,7 @@
>const vull __builtin_vsx_xvcvdpuxds_uns (vd);
>  XVCVDPUXDS_UNS vsx_fixuns_truncv2dfv2di2 {}
>  
> -  const vsi __builtin_vsx_xvcvdpuxws (vd);
> +  const vui __builtin_vsx_xvcvdpuxws (vd);
>  XVCVDPUXWS vsx_xvcvdpuxws {}
>  
>const vd __builtin_vsx_xvcvspdp (vf);
> @@ -1736,10 +1736,10 @@
>const vsi __builtin_vsx_xvcvspsxws (vf);
>  XVCVSPSXWS vsx_fix_truncv4sfv4si2 {}
>  
> -  const vsll __builtin_vsx_xvcvspuxds (vf);
> +  const vull __builtin_vsx_xvcvspuxds (vf);
>  XVCVSPUXDS vsx_xvcvspuxds {}
>  
> -  const vsi __builtin_vsx_xvcvspuxws (vf);
> +  const vui __builtin_vsx_xvcvspuxws (vf);
>  XVCVSPUXWS vsx_fixuns_truncv4sfv4si2 {}
>  
>const vd __builtin_vsx_xvcvsxddp (vsll);
> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> index 4d8610f6aa8..583b1d890bf 100644
> --- a/gcc/doc/extend.texi
> +++ b/gcc/doc/extend.texi
> @@ -21360,6 +21360,141 @@ __float128 __builtin_sqrtf128 (__float128);
>  __float128 __builtin_fmaf128 (__float128, __float128, __float128);
>  @end smallexample
>  
> +@smallexample
> +vector int __builtin_vsx_xvcvspsxws (vector float);
> +@end smallexample
> +
> +The @code{__builtin_vsx_xvcvspsxws} converts the single precision floating
> +point vector element i to a signed single-precision integer value using
> +round to zero storing the result in element i.  If the source element is NaN
> +the result is set to 0x8000 and VXCI is set to 1.  If the source
> +element is SNaN then VXSNAN is also set to 1.  If the rounded value is 
> greater
> +than 2^31 - 1

Re: [PATCH 04/11] rs6000, Update comment for the __builtin_vsx_vper*, built-ins.

2024-02-28 Thread Kewen.Lin

Hi,

on 2024/2/21 01:56, Carl Love wrote:
> GCC maintainers:
> 
> The patch expands an existing comment to document that the duplicates are 
> covered by an overloaded built-in.  I am wondering if we should just go ahead 
> and remove the duplicates?

As the below comments Bill placed before, I think we should remove them, since
users should use the standard interface vec_perm which is defined by PVIPR.

They are not undocumented at all, in case some users are still using such 
builtins
they should switch to use vec_perm instead, so even if it's stage 4 now, it 
looks
still fine to drop them IMHO.

Segher & Peter, what do you think of this?

BR,
Kewen

> 
> The patch has been tested on Power 10 with no regressions.
> 
> Please let me know if this patch is acceptable for mainline.  Thanks.
> 
>   Carl 
> 
> -
> rs6000, Update comment for the __builtin_vsx_vper* built-ins.
> 
> There is a comment about the __builtin_vsx_vper* built-ins being
> duplicates of the __builtin_altivec_* built-ins.  The note says we
> should consider deprecation/removeal of the __builtin_vsx_vper*.  Add a
> note that the _builtin_vsx_vper* built-ins are covered by the overloaded
> vec_perm built-ins which use the __builtin_altivec_* built-in definitions.
> 
> gcc/ChangeLog:
>   * config/rs6000/rs6000-builtins.def ( __builtin_vsx_vperm_*):
>   Add comment to existing comment about the built-ins.
> ---
>  gcc/config/rs6000/rs6000-builtins.def | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/gcc/config/rs6000/rs6000-builtins.def 
> b/gcc/config/rs6000/rs6000-builtins.def
> index 96d095da2cb..4c95429f137 100644
> --- a/gcc/config/rs6000/rs6000-builtins.def
> +++ b/gcc/config/rs6000/rs6000-builtins.def
> @@ -1556,6 +1556,14 @@
>  ; These are duplicates of __builtin_altivec_* counterparts, and are being
>  ; kept for backwards compatibility.  The reason for their existence is
>  ; unclear.  TODO: Consider deprecation/removal at some point.
> +; Note, __builtin_vsx_vperm_16qi, __builtin_vsx_vperm_16qi_uns,
> +; __builtin_vsx_vperm_1ti, __builtin_vsx_vperm_v1ti_uns,
> +; __builtin_vsx_vperm_2df, __builtin_vsx_vperm_2di, __builtin_vsx_vperm_2di,
> +; __builtin_vsx_vperm_2di_uns, __builtin_vsx_vperm_4sf,
> +; __builtin_vsx_vperm_4si, __builtin_vsx_vperm_4si_uns,
> +; __builtin_vsx_vperm_8hi, __builtin_altivec_vperm_8hi_uns
> +; are all covered by the overloaded vec_perm built-in which uses the
> +; __builtin_altivec_* built-in definitions.
>const vsc __builtin_vsx_vperm_16qi (vsc, vsc, vuc);
>  VPERM_16QI_X altivec_vperm_v16qi {}
>

Re: [PATCH 03/11] rs6000, remove duplicated built-ins

2024-02-28 Thread Kewen.Lin

on 2024/2/21 01:56, Carl Love wrote:
> GCC maintainers:
> 
> There are a number of undocumented built-ins that are duplicates of other 
> documented built-ins.  This patch removes the duplicates so users will only 
> use the documented built-in.
> 
> The patch has been tested on Power 10 with no regressions.

Can you also test this on at least one BE machine?  The behaviors of some
built-ins may also depend on endianness.

> 
> Please let me know if this patch is acceptable for mainline.  Thanks.
> 
>   Carl 
> 
> -
> 
> rs6000, remove duplicated built-ins
> 
> The following undocumented built-ins are same as existing documented
> overloaded builtins.
> 
>   const vf __builtin_vsx_xxmrghw (vf, vf);
> same as  vf __builtin_vec_mergeh (vf, vf);  (overloaded vec_mergeh)
> 
>   const vsi __builtin_vsx_xxmrghw_4si (vsi, vsi);
> same as vsi __builtin_vec_mergeh (vsi, vsi);   (overloaded vec_mergeh)
> 
>   const vf __builtin_vsx_xxmrglw (vf, vf);
> same as vf __builtin_vec_mergel (vf, vf);  (overloaded vec_mergel)
> 
>   const vsi __builtin_vsx_xxmrglw_4si (vsi, vsi);
> same as vsi __builtin_vec_mergel (vsi, vsi);   (overloaded vec_mergel)
> 

With these builtin definitions removed, the according expanders
vsx_xxmrg{h,l}w_v4s{f,i} look useless then, please have a check, if so,
they should be removed together, and put this part of changes into a
separated patch (mainly vec merge) ...


>   const vsc __builtin_vsx_xxsel_16qi (vsc, vsc, vsc);
> same as vsc __builtin_vec_sel (vsc, vsc, vuc);  (overloaded vec_sel)
> 
>   const vuc __builtin_vsx_xxsel_16qi_uns (vuc, vuc, vuc);
> same as vuc __builtin_vec_sel (vuc, vuc, vuc);  (overloaded vec_sel)
> 
>   const vd __builtin_vsx_xxsel_2df (vd, vd, vd);
> same as  vd __builtin_vec_sel (vd, vd, vull);   (overloaded vec_sel)
> 
>   const vsll __builtin_vsx_xxsel_2di (vsll, vsll, vsll);
> same as vsll __builtin_vec_sel (vsll, vsll, vsll);  (overloaded vec_sel)
> 
>   const vull __builtin_vsx_xxsel_2di_uns (vull, vull, vull);
> same as vull __builtin_vec_sel (vull, vull, vsll);  (overloaded vec_sel)
> 
>   const vf __builtin_vsx_xxsel_4sf (vf, vf, vf);
> same as vf __builtin_vec_sel (vf, vf, vsi)  (overloaded vec_sel)
> 
>   const vsi __builtin_vsx_xxsel_4si (vsi, vsi, vsi);
> same as vsi __builtin_vec_sel (vsi, vsi, vbi);  (overloaded vec_sel)
> 
>   const vui __builtin_vsx_xxsel_4si_uns (vui, vui, vui);
> same as vui __builtin_vec_sel (vui, vui, vui);  (overloaded vec_sel)
> 
>   const vss __builtin_vsx_xxsel_8hi (vss, vss, vss);
> same as vss __builtin_vec_sel (vss, vss, vbs);  (overloaded vec_sel)
> 
>   const vus __builtin_vsx_xxsel_8hi_uns (vus, vus, vus);
> same as vus __builtin_vec_sel (vus, vus, vus);  (overloaded vec_sel)

... and adopt another one for this part (vec_sel).

> 
> This patch removed the duplicate built-in definitions so only the
> documented built-ins will be available for use.  The case statements in
> rs6000_gimple_fold_builtin that ar no longer needed are also removed.
> 
> gcc/ChangeLog:
>   * config/rs6000/rs6000-builtins.def (__builtin_vsx_xxmrghw,
>   __builtin_vsx_xxmrghw_4si, __builtin_vsx_xxmrglw,
>   __builtin_vsx_xxmrglw_4si, __builtin_vsx_xxsel_16qi,
>   __builtin_vsx_xxsel_16qi_uns, __builtin_vsx_xxsel_2df,
>   __builtin_vsx_xxsel_2di, __builtin_vsx_xxsel_2di_uns,
>   __builtin_vsx_xxsel_4sf, __builtin_vsx_xxsel_4si,
>   __builtin_vsx_xxsel_4si_uns, __builtin_vsx_xxsel_8hi,
>   __builtin_vsx_xxsel_8hi_uns): Removed built-in definition.

Nit: s/Removed/Remove/

>   * config/rs6000/rs6000-builtin.cc (rs6000_gimple_fold_builtin):
>   remove case entries RS6000_BIF_XXMRGLW_4SI,
>   RS6000_BIF_XXMRGLW_4SF, RS6000_BIF_XXMRGHW_4SI,
>   RS6000_BIF_XXMRGHW_4SF.

Nit: s/remove/Remove/

> 
> gcc/testsuite/ChangeLog:
>   * gcc.target/powerpc/vsx-builtin-3.c (__builtin_vsx_xxsel_4si,
>   __builtin_vsx_xxsel_8hi, __builtin_vsx_xxsel_16qi,
>   __builtin_vsx_xxsel_4sf, __builtin_vsx_xxsel_2df): Remove test
>   cases for removed built-ins.
> ---
>  gcc/config/rs6000/rs6000-builtin.cc   |  4 --
>  gcc/config/rs6000/rs6000-builtins.def | 42 ---
>  .../gcc.target/powerpc/vsx-builtin-3.c|  6 ---
>  3 files changed, 52 deletions(-)
> 
> diff --git a/gcc/config/rs6000/rs6000-builtin.cc 
> b/gcc/config/rs6000/rs6000-builtin.cc
> index 6698274031b..e436cbe4935 100644
> --- a/gcc/config/rs6000/rs6000-builtin.cc
> +++ b/gcc/config/rs6000/rs6000-builtin.cc
> @@ -2110,20 +2110,16 @@ rs6000_gimple_fold_builtin (gimple_stmt_iterator *gsi)
>  /* vec_mergel (integrals).  */
>  case RS6000_BIF_VMRGLH:
>  case RS6000_BIF_VMRGLW:
> -case RS6000_BIF_XXMRGLW_4SI:
>  case RS6000_BIF_VMRGLB:
>  case RS6000_BIF_VEC_MERGEL_V2DI:
> -case RS6000_BIF_XXMRGLW_4SF:
>  case RS6000_BIF_VEC_MERGEL_V2DF:
>fold_mergehl_helper (gsi, stmt, 1);
>

Re: [PATCH 05/11] rs6000, __builtin_vsx_xvneg[sp,dp] add documentation, and test cases

2024-02-28 Thread Kewen.Lin

Hi,

on 2024/2/21 01:56, Carl Love wrote:
> GCC maintainers:
> 
> The patch adds documentation and test cases for the __builtin_vsx_xvnegsp, 
> __builtin_vsx_xvnegdp built-ins.
> 
> The patch has been tested on Power 10 with no regressions.
> 
> Please let me know if this patch is acceptable for mainline.  Thanks.
> 
>   Carl 
> 
> rs6000, __builtin_vsx_xvneg[sp,dp] add documentation and test cases
> 
> Add documentation to the extend.texi file for the two built-ins
> __builtin_vsx_xvnegsp, __builtin_vsx_xvnegdp.

I think these two are useless, the functionality has been covered by vec_neg
in PVIPR, so instead we should get rid of these definitions (bif def table,
test cases if there are some).

BR,
Kewen

> 
> Add test cases for the two built-ins.
> 
> gcc/ChangeLog:
>   * doc/extend.texi (__builtin_vsx_xvnegsp, __builtin_vsx_xvnegdp):
>   Add documentation.
> 
> gcc/testsuite/ChangeLog:
>   * gcc.target/powerpc/vsx-builtin-runnable-2.c: New test case.
> ---
>  gcc/doc/extend.texi   | 13 +
>  .../powerpc/vsx-builtin-runnable-2.c  | 51 +++
>  2 files changed, 64 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-2.c
> 
> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> index 583b1d890bf..83eed9e334b 100644
> --- a/gcc/doc/extend.texi
> +++ b/gcc/doc/extend.texi
> @@ -21495,6 +21495,19 @@ The @code{__builtin_vsx_xvcvuxwdp} converts single 
> precision unsigned integer
>  value to a double precision floating point value.  Input element at index 2*i
>  is stored in the destination element i.
>  
> +@smallexample
> +vector float __builtin_vsx_xvnegsp (vector float);
> +vector double __builtin_vsx_xvnegdp (vector double);
> +@end smallexample
> +
> +The  @code{__builtin_vsx_xvnegsp} and @code{__builtin_vsx_xvnegdp} negate 
> each
> +vector element.
> +
> +@smallexample
> +vector __int128  __builtin_vsx_xxpermdi_1ti (vector __int128, vector 
> __int128,
> +const int);
> +
> +@end smallexample
>  @node Basic PowerPC Built-in Functions Available on ISA 2.07
>  @subsubsection Basic PowerPC Built-in Functions Available on ISA 2.07
>  
> diff --git a/gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-2.c 
> b/gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-2.c
> new file mode 100644
> index 000..7906a8e01d7
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-2.c
> @@ -0,0 +1,51 @@
> +/* { dg-do run { target { lp64 } } } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mdejagnu-cpu=power7" } */
> +
> +#define DEBUG 0
> +
> +#if DEBUG
> +#include 
> +#include 
> +#endif
> +
> +void abort (void);
> +
> +int main ()
> +{
> +  int i;
> +  vector double vd_arg1, vd_result, vd_expected_result;
> +  vector float vf_arg1, vf_result, vf_expected_result;
> +
> +  /* VSX Vector Negate Single-Precision.  */
> +
> +  vf_arg1 = (vector float) {-1.0, 12345.98, -2.1234, 238.9};
> +  vf_result = __builtin_vsx_xvnegsp (vf_arg1);
> +  vf_expected_result = (vector float) {1.0, -12345.98, 2.1234, -238.9};
> +
> +  for (i = 0; i < 4; i++)
> +if (vf_result[i] != vf_expected_result[i])
> +#if DEBUG
> +  printf("ERROR, __builtin_vsx_xvnegsp: vf_result[%d] = %f, 
> vf_expected_result[%d] = %f\n",
> +  i, vf_result[i], i, vf_expected_result[i]);
> +#else
> +  abort();
> +#endif
> +
> +  /* VSX Vector Negate Double-Precision.  */
> +
> +  vd_arg1 = (vector double) {12345.98, -2.1234};
> +  vd_result = __builtin_vsx_xvnegdp (vd_arg1);
> +  vd_expected_result = (vector double) {-12345.98, 2.1234};
> +
> +  for (i = 0; i < 2; i++)
> +if (vd_result[i] != vd_expected_result[i])
> +#if DEBUG
> +  printf("ERROR, __builtin_vsx_xvnegdp: vd_result[%d] = %f, 
> vd_expected_result[%d] = %f\n",
> +  i, vd_result[i], i, vd_expected_result[i]);
> +#else
> +  abort();
> +#endif
> +
> +  return 0;
> +}

Re: [PATCH 06/11] rs6000, __builtin_vsx_xxpermdi_1ti add documentation, and test case

2024-02-28 Thread Kewen.Lin

Hi Carl,

on 2024/2/21 01:57, Carl Love wrote:
> GCC maintainers:
> 
> The patch adds documentation and test case for the __builtin_vsx_xxpermdi_1ti 
> built-in.
> 
> The patch has been tested on Power 10 with no regressions.
> 
> Please let me know if this patch is acceptable for mainline.  Thanks.
> 
>   Carl 
> 
> 
> rs6000, __builtin_vsx_xxpermdi_1ti add documentation and test case
> 
> Add documentation to the extend.texi file for the
> __builtin_vsx_xxpermdi_1ti built-in.

I think this one should be part of vec_xxpermdi (overload.def), we can
extend vec_xxpermdi by one more instance with type vsq, also update the
documentation on vec_xxpermdi for this newly introduced.

BR,
Kewen

> 
> Add test cases for the __builtin_vsx_xxpermdi_1ti built-in.
> 
> gcc/ChangeLog:
>   * doc/extend.texi (__builtin_vsx_xxpermdi_1ti): Add documentation.
> 
> gcc/testsuite/ChangeLog:
>   * gcc.target/powerpc/vsx-builtin-runnable-3.c: New test case.
> ---
>  gcc/doc/extend.texi   |  7 +++
>  .../powerpc/vsx-builtin-runnable-3.c  | 48 +++
>  2 files changed, 55 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-3.c
> 
> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> index 83eed9e334b..22f67ebab31 100644
> --- a/gcc/doc/extend.texi
> +++ b/gcc/doc/extend.texi
> @@ -21508,6 +21508,13 @@ vector __int128  __builtin_vsx_xxpermdi_1ti (vector 
> __int128, vector __int128,
>  const int);
>  
>  @end smallexample
> +
> +The  @code{__builtin_vsx_xxpermdi_1ti} Let srcA[127:0] be the 128-bit first
> +argument and srcB[127:0] be the 128-bit second argument.  Let sel[1:0] be the
> +least significant bits of the const int argument (third input argument).  The
> +result bits [127:64] is srcB[127:64] if  sel[1] = 0, srcB[63:0] otherwise.  
> The
> +result bits [63:0] is srcA[127:64] if  sel[0] = 0, srcA[63:0] otherwise.
> +
>  @node Basic PowerPC Built-in Functions Available on ISA 2.07
>  @subsubsection Basic PowerPC Built-in Functions Available on ISA 2.07
>  
> diff --git a/gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-3.c 
> b/gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-3.c
> new file mode 100644
> index 000..ba287597cec
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-3.c
> @@ -0,0 +1,48 @@
> +/* { dg-do run { target { lp64 } } } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mdejagnu-cpu=power7" } */
> +
> +#include 
> +
> +#define DEBUG 0
> +
> +#if DEBUG
> +#include 
> +#include 
> +#endif
> +
> +void abort (void);
> +
> +int main ()
> +{
> +  int i;
> +
> +  vector signed __int128 vsq_arg1, vsq_arg2, vsq_result, vsq_expected_result;
> +
> +  vsq_arg1[0] = (__int128) 0x;
> +  vsq_arg1[0] = vsq_arg1[0] << 64 | (__int128) 0x;
> +  vsq_arg2[0] = (__int128) 0x1100110011001100;
> +  vsq_arg2[0] = (vsq_arg2[0]  << 64) | (__int128) 0x;
> +
> +  vsq_expected_result[0] = (__int128) 0x;
> +  vsq_expected_result[0] = (vsq_expected_result[0] << 64)
> +| (__int128) 0x;
> +
> +  vsq_result = __builtin_vsx_xxpermdi_1ti (vsq_arg1, vsq_arg2, 2);
> +
> +  if (vsq_result[0] != vsq_expected_result[0])
> +{
> +#if DEBUG
> +   printf("ERROR, __builtin_vsx_xxpermdi_1ti: vsq_result = 0x%016llx 
> %016llx\n",
> +   (unsigned long long) (vsq_result[0] >> 64),
> +   (unsigned long long) vsq_result[0]);
> +   printf(" vsq_expected_resultd = 0x%016llx 
> %016llx\n",
> +   (unsigned long long)(vsq_expected_result[0] >> 64),
> +   (unsigned long long) vsq_expected_result[0]);
> +#else
> +  abort();
> +#endif
> + }
> +
> +  return 0;
> +}

Re: [PATCH 07/11] rs6000, __builtin_vsx_xvcmpeq[sp, dp, sp_p] add, documentation and test case

2024-02-28 Thread Kewen.Lin

Hi Carl,

on 2024/2/21 01:57, Carl Love wrote:
> 
>  GCC maintainers:
> 
> The patch adds documentation and test case for the  __builtin_vsx_xvcmpeq[sp, 
> dp, sp_p] built-ins.
> 
> The patch has been tested on Power 10 with no regressions.
> 
> Please let me know if this patch is acceptable for mainline.  Thanks.
> 
>   Carl 
> 
> 
> rs6000, __builtin_vsx_xvcmpeq[sp, dp, sp_p] add documentation and test case
> 
> Add a test case for the __builtin_vsx_xvcmpeqsp_p built-in.
> 
> Add documentation for the __builtin_vsx_xvcmpeqsp_p,
> __builtin_vsx_xvcmpeqdp, and __builtin_vsx_xvcmpeqsp builtins.

1) for __builtin_vsx_xvcmpeqsp_p, its functionality has been already covered
by __builtin_altivec_vcmpeqfp_p which is a instance of __builtin_vec_vcmpeq_p,
so it's useless and removable.

2) for __builtin_vsx_xvcmpeqdp, it's a instance for overloaded PVIPR function
vec_cmpeq, it's unexpected to use it directly, so we don't need to document it.

3) for __builtin_vsx_xvcmpeqsp, it's duplicated of existing vec_cmpeq instance
__builtin_altivec_vcmpeqfp, so it's useless and removable.

BR,
Kewen

> 
> gcc/ChangeLog:
>   * doc/extend.texi (__builtin_vsx_xvcmpeqsp_p,
>   __builtin_vsx_xvcmpeqdp, __builtin_vsx_xvcmpeqsp): Add
>   documentation.
> 
> gcc/testsuite/ChangeLog:
>   * gcc.target/powerpc/vsx-builtin-runnable-4.c: New test case.
> ---
>  gcc/doc/extend.texi   |  23 +++
>  .../powerpc/vsx-builtin-runnable-4.c  | 135 ++
>  2 files changed, 158 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-4.c
> 
> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> index 22f67ebab31..87fd30bfa9e 100644
> --- a/gcc/doc/extend.texi
> +++ b/gcc/doc/extend.texi
> @@ -22700,6 +22700,18 @@ vectors of their defined type.  The corresponding 
> result element is set to
>  all ones if the two argument elements are less than or equal and all zeros
>  otherwise.
>  
> +@smallexample
> +const vf __builtin_vsx_xvcmpeqsp (vf, vf);
> +const vd __builtin_vsx_xvcmpeqdp (vd, vd);
> +@end smallexample
> +
> +The builti-ins @code{__builtin_vsx_xvcmpeqdp} and
> +@code{__builtin_vsx_xvcmpeqdp} compare two floating point vectors and return
> +a vector.  If the corresponding elements are equal then the corresponding
> +vector element of the result is set to all ones, it is set to all zeros
> +otherwise.
> +
> +
>  @node PowerPC AltiVec Built-in Functions Available on ISA 2.07
>  @subsubsection PowerPC AltiVec Built-in Functions Available on ISA 2.07
>  
> @@ -23989,6 +24001,17 @@ is larger than 128 bits, the result is undefined.
>  The result is the modulo result of dividing the first input  by the second
>  input.
>  
> +@smallexample
> +const signed int __builtin_vsx_xvcmpeqdp_p (signed int, vd, vd);
> +@end smallexample
> +
> +The first argument of the builti-in @code{__builtin_vsx_xvcmpeqdp_p} is an
> +integer in the range of 0 to 1.  The second and third arguments are floating
> +point vectors to be compared.  The result is 1 if the first argument is a 1
> +and one or more of the corresponding vector elements are equal.  The result 
> is
> +1 if the first argument is 0 and all of the corresponding vector elements are
> +not equal.  The result is zero otherwise.
> +
>  The following builtins perform 128-bit vector comparisons.  The
>  @code{vec_all_xx}, @code{vec_any_xx}, and @code{vec_cmpxx}, where @code{xx} 
> is
>  one of the operations @code{eq, ne, gt, lt, ge, le} perform pairwise
> diff --git a/gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-4.c 
> b/gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-4.c
> new file mode 100644
> index 000..8ac07c7c807
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/vsx-builtin-runnable-4.c
> @@ -0,0 +1,135 @@
> +/* { dg-do run { target { power10_hw } } } */
> +/* { dg-do link { target { ! power10_hw } } } */
> +/* { dg-options "-mdejagnu-cpu=power10 -O2 -save-temps" } */
> +/* { dg-require-effective-target power10_ok } */
> +
> +#define DEBUG 0
> +
> +#if DEBUG
> +#include 
> +#include 
> +#endif
> +
> +void abort (void);
> +
> +int main ()
> +{
> +  int i;
> +  int result;
> +  vector float vf_arg1, vf_arg2;
> +  vector double d_arg1, d_arg2;
> +
> +  /* Compare vectors with one equal element, check
> + for all elements unequal, i.e. first arg is 1.  */
> +  vf_arg1 = (vector float) {1.0, 2.0, 3.0, 4.0};
> +  vf_arg2 = (vector float) {1.0, 3.0, 2.0, 8.0};
> +  result = __builtin_vsx_xvcmpeqsp_p (1, vf_arg1, vf_arg2);
> +
> +#if DEBUG
> +  printf("result = 0x%x\n", (unsigned int) result);
> +#endif
> +
> +  if (result != 1)
> +for (i = 0; i < 4; i++)
> +#if DEBUG
> +  printf("ERROR, __builtin_vsx_xvcmpeqsp_p 1: arg 1 = 1, varg3[%d] = %f, 
> varg3[%d] = %f\n",
> +  i, vf_arg1[i], i, vf_arg2[i]);
> +#else
> +  abort();
> +#endif
> +  /* Compare vectors with one equal element, c

Re: [PATCH 09/11] rs6000, add test cases for the vec_cmpne built-ins

2024-02-28 Thread Kewen.Lin

Hi,

on 2024/2/21 01:57, Carl Love wrote:
> GCC maintainers:
> 
> The patch adds test cases for the vec_cmpne of built-ins.
> 
> The patch has been tested on Power 10 with no regressions.
> 
> Please let me know if this patch is acceptable for mainline.  Thanks.
> 
>   Carl 
> 
> rs6000, add test cases for the vec_cmpne built-ins

The subject and this subject line are saying "vec_cmpne" ...

> 
> Add test cases for the signed int, unsigned it, signed short, unsigned
> short, signed char and unsigned char built-ins.
> 
> Note, the built-ins are documented in the Power Vector Instrinsic
> Programing reference manual.
> 
> gcc/testsuite/ChangeLog:
>   * gcc.target/powerpc/vec-cmple.c: New test case.
>   * gcc.target/powerpc/vec-cmple.h: New test case include file.


... But I think you meant "vec_cmple".

> ---
>  gcc/testsuite/gcc.target/powerpc/vec-cmple.c | 35 
>  gcc/testsuite/gcc.target/powerpc/vec-cmple.h | 84 
>  2 files changed, 119 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/vec-cmple.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/vec-cmple.h
> 
> diff --git a/gcc/testsuite/gcc.target/powerpc/vec-cmple.c 
> b/gcc/testsuite/gcc.target/powerpc/vec-cmple.c
> new file mode 100644
> index 000..766a1c770e2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/vec-cmple.c
> @@ -0,0 +1,35 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target powerpc_altivec_ok } */

Should be "vmx_hw" for run test.

> +/* { dg-options "-maltivec -O2" } */
> +
> +/* Test that the vec_cmpne builtin generates the expected Altivec
> +   instructions.  */

It seems this file was copied from vec-cmpne.c?  As we have 
vec-cmpne-runnable.c, maybe we can rename it to vec-cmple-runnable.c.

And previously since we have vec-cmpne-runnable.c and vec-cmpne.c
to use vec-cmpne.h, so a header was introduced.  If you just want to
add one runnable test case, maybe just inline vec-cmple.h since
it's not used by others at all.

BR,
Kewen

> +
> +#include "vec-cmple.h"
> +
> +int main ()
> +{
> +  /* Note macro expansions for "signed long long int" and
> + "unsigned long long int" do not work for the vec_vsx_ld builtin.  */
> +  define_test_functions (int, signed int, signed int, si);
> +  define_test_functions (int, unsigned int, unsigned int, ui);
> +  define_test_functions (short, signed short, signed short, ss);
> +  define_test_functions (short, unsigned short, unsigned short, us);
> +  define_test_functions (char, signed char, signed char, sc);
> +  define_test_functions (char, unsigned char, unsigned char, uc);
> +
> +  define_init_verify_functions (int, signed int, signed int, si);
> +  define_init_verify_functions (int, unsigned int, unsigned int, ui);
> +  define_init_verify_functions (short, signed short, signed short, ss);
> +  define_init_verify_functions (short, unsigned short, unsigned short, us);
> +  define_init_verify_functions (char, signed char, signed char, sc);
> +  define_init_verify_functions (char, unsigned char, unsigned char, uc);
> +
> +  execute_test_functions (int, signed int, signed int, si);
> +  execute_test_functions (int, unsigned int, unsigned int, ui);
> +  execute_test_functions (short, signed short, signed short, ss);
> +  execute_test_functions (short, unsigned short, unsigned short, us);
> +  execute_test_functions (char, signed char, signed char, sc);
> +  execute_test_functions (char, unsigned char, unsigned char, uc);
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/vec-cmple.h 
> b/gcc/testsuite/gcc.target/powerpc/vec-cmple.h
> new file mode 100644
> index 000..4126706b99a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/vec-cmple.h
> @@ -0,0 +1,84 @@
> +#include "altivec.h"
> +
> +#define N 4096
> +
> +#include 
> +void abort ();
> +
> +#define PRAGMA(X) _Pragma (#X)
> +#define UNROLL0 PRAGMA (GCC unroll 0)
> +
> +#define define_test_functions(VBTYPE, RTYPE, STYPE, NAME)\
> +\
> +RTYPE result_le_##NAME[N] __attribute__((aligned(16))); \
> +STYPE operand1_##NAME[N] __attribute__((aligned(16))); \
> +STYPE operand2_##NAME[N] __attribute__((aligned(16))); \
> +RTYPE expected_##NAME[N] __attribute__((aligned(16))); \
> +\
> +__attribute__((noinline)) void vector_tests_##NAME () \
> +{ \
> +  vector STYPE v1_##NAME, v2_##NAME; \
> +  vector bool VBTYPE tmp_##NAME; \
> +  int i; \
> +  UNROLL0 \
> +  for (i = 0; i < N; i+=16/sizeof (STYPE))   \
> +{ \
> +  /* result_le = operand1!=operand2.  */ \
> +  v1_##NAME = vec_vsx_ld (0, (const vector STYPE*)&operand1_##NAME[i]); \
> +  v2_##NAME = vec_vsx_ld (0, (const vector STYPE*)&operand2_##NAME[i]); \
> +\
> +  tmp_##NAME = vec_cmple (v1_##NAME, v2_##NAME); \
> +  vec_vsx_st (tmp_##NAME, 0, &result_le_##NAME[i]); \
> +} \
> +}
> +
> +#define define_init_verify_functions(VBTYPE, RTYPE, STYPE, NAME) \
> +__attribute__((no

Re: PATCH 11/11] rs6000, make test vec-cmpne.c a runnable test

2024-02-28 Thread Kewen.Lin

Hi,

on 2024/2/21 01:58, Carl Love wrote:
>  GCC maintainers:
> 
> The patch changes the  vec-cmpne.c from a compile only test to a runnable 
> test.  The macros to create the functions needed to test the built-ins and 
> verify the restults are all there in the include file.  The .c file just 
> needed to have the macro definitions inserted and change the header from 
> compile to run.  The test can now do functional verification of the results 
> in addition to verifying the expected instructions are generated.
> 
> The patch has been tested on Power 10 with no regressions.
> 
> Please let me know if this patch is acceptable for mainline.  Thanks.
> 
>   Carl 
> 
> rs6000, make test vec-cmpne.c a runnable test
> 
> The macros in vec-cmpne.h define test functions.  They also setup
> test value functions, verification functions and execute test functions.
> The test is setup as a compile only test so none of the verification and
> execute functions are being used.

But there is a test gcc/testsuite/gcc.target/powerpc/vec-cmpne-runnable.c
which aims to do the runtime verification.

BR,
Kewen

> 
> The patch adds the macro definitions to create the intialization,
> verfiy and execute functions to a main program so not only can the
> test verify the correct instructions are generated but also run the
> tests and verify the results.  The test is then changed from a compile
> to a run test.
> 
> gcc/testsuite/ChangeLog:
>   * gcc.target/powerpc/vec-cmple.c (main): Add main function with
>   macro calls to define the test functions, create the verify
>   functions and execute functions.
>   Update scan-assembler-times (vcmpequ): Updated count to include
>   instructions used to generate expected test results.
>   * gcc.target/powerpc/vec-cmple.h (vector_tests_##NAME): Remove
>   line continuation after closing bracket.  Remove extra blank line.
> ---
>  gcc/testsuite/gcc.target/powerpc/vec-cmpne.c | 41 +++-
>  gcc/testsuite/gcc.target/powerpc/vec-cmpne.h |  3 +-
>  2 files changed, 32 insertions(+), 12 deletions(-)
> 
> diff --git a/gcc/testsuite/gcc.target/powerpc/vec-cmpne.c 
> b/gcc/testsuite/gcc.target/powerpc/vec-cmpne.c
> index b57e0ac8638..2c369976a44 100644
> --- a/gcc/testsuite/gcc.target/powerpc/vec-cmpne.c
> +++ b/gcc/testsuite/gcc.target/powerpc/vec-cmpne.c
> @@ -1,20 +1,41 @@
> -/* { dg-do compile } */
> +/* { dg-do run } */
>  /* { dg-require-effective-target powerpc_altivec_ok } */
> -/* { dg-options "-maltivec -O2" } */
> +/* { dg-options "-maltivec -O2 -save-temps" } */
>  
>  /* Test that the vec_cmpne builtin generates the expected Altivec
> instructions.  */
>  
>  #include "vec-cmpne.h"
>  
> -define_test_functions (int, signed int, signed int, si);
> -define_test_functions (int, unsigned int, unsigned int, ui);
> -define_test_functions (short, signed short, signed short, ss);
> -define_test_functions (short, unsigned short, unsigned short, us);
> -define_test_functions (char, signed char, signed char, sc);
> -define_test_functions (char, unsigned char, unsigned char, uc);
> -define_test_functions (int, signed int, float, ff);
> +int main ()
> +{
> +  define_test_functions (int, signed int, signed int, si);
> +  define_test_functions (int, unsigned int, unsigned int, ui);
> +  define_test_functions (short, signed short, signed short, ss);
> +  define_test_functions (short, unsigned short, unsigned short, us);
> +  define_test_functions (char, signed char, signed char, sc);
> +  define_test_functions (char, unsigned char, unsigned char, uc);
> +  define_test_functions (int, signed int, float, ff);
> +
> +  define_init_verify_functions (int, signed int, signed int, si);
> +  define_init_verify_functions (int, unsigned int, unsigned int, ui);
> +  define_init_verify_functions (short, signed short, signed short, ss);
> +  define_init_verify_functions (short, unsigned short, unsigned short, us);
> +  define_init_verify_functions (char, signed char, signed char, sc);
> +  define_init_verify_functions (char, unsigned char, unsigned char, uc);
> +  define_init_verify_functions (int, signed int, float, ff);
> +
> +  execute_test_functions (int, signed int, signed int, si);
> +  execute_test_functions (int, unsigned int, unsigned int, ui);
> +  execute_test_functions (short, signed short, signed short, ss);
> +  execute_test_functions (short, unsigned short, unsigned short, us);
> +  execute_test_functions (char, signed char, signed char, sc);
> +  execute_test_functions (char, unsigned char, unsigned char, uc);
> +  execute_test_functions (int, signed int, float, ff);
> +
> +  return 0;
> +}
>  
>  /* { dg-final { scan-assembler-times {\mvcmpequb\M}  2 } } */
>  /* { dg-final { scan-assembler-times {\mvcmpequh\M}  2 } } */
> -/* { dg-final { scan-assembler-times {\mvcmpequw\M}  2 } } */
> +/* { dg-final { scan-assembler-times {\mvcmpequw\M}  32 } } */
> diff --git a/gcc/test

Re: [PATCH 08/11] rs6000, add tests and documentation for various, built-ins

2024-02-28 Thread Kewen.Lin

Hi,

on 2024/2/21 01:57, Carl Love wrote:
>  
>  GCC maintainers:
> 
> The patch adds documentation a number of built-ins.
> 
> The patch has been tested on Power 10 with no regressions.
> 
> Please let me know if this patch is acceptable for mainline.  Thanks.
> 
>   Carl 
> 
>  rs6000, add tests and documentation for various built-ins
> 
> This patch adds a test case and documentation in extend.texi for the
> following built-ins:
> 
> __builtin_altivec_fix_sfsi
> __builtin_altivec_fixuns_sfsi
> __builtin_altivec_float_sisf
> __builtin_altivec_uns_float_sisf

I think these are covered by vec_{unsigned,signed,float}, could you
have a check?

> __builtin_altivec_vrsqrtfp

Similar to that __builtin_altivec_vrsqrtefp is covered by vec_rsqrte,
this is already covered by vec_rsqrt, which has the vf instance
__builtin_vsx_xvrsqrtsp, so this one is useless and removable.


> __builtin_altivec_mask_for_load

This one is for internal use, I don't think we want to document it in
user manual.

> __builtin_altivec_vsel_1ti
> __builtin_altivec_vsel_1ti_uns

I think we can extend the existing vec_sel for vsq and vuq, also update
the documentation.

> __builtin_vec_init_v16qi
> __builtin_vec_init_v4sf
> __builtin_vec_init_v4si
> __builtin_vec_init_v8hi

There are more vec_init variants __builtin_vec_init_{v2df,v2di,v1ti},
any reasons not include them here? ...

> __builtin_vec_set_v16qi
> __builtin_vec_set_v4sf
> __builtin_vec_set_v4si
> __builtin_vec_set_v8hi

... and some similar variants for this one?

it seems that users can just use something like:

  vector ... = {x, y} ...

for the vector initialization and something like:

  vector ... z;
  z[0] = ...;
  z[i] = ...;

for the vector set.  Can you check if there are some
differences between the above style and builtin? (both
BE and LE).  And the historical reasons for adding them?

If we really need them, I'd like to see we just have
the according overload function like vec_init and vec_set
instead of exposing the instances with different suffixes.

BR,
Kewen

> 
> gcc/ChangeLog:
>   * doc/extend.texi (__builtin_altivec_fix_sfsi,
>   __builtin_altivec_fixuns_sfsi, __builtin_altivec_float_sisf,
>   __builtin_altivec_uns_float_sisf, __builtin_altivec_vrsqrtfp,
>   __builtin_altivec_mask_for_load, __builtin_altivec_vsel_1ti,
>   __builtin_altivec_vsel_1ti_uns, __builtin_vec_init_v16qi,
>   __builtin_vec_init_v4sf, __builtin_vec_init_v4si,
>   __builtin_vec_init_v8hi, __builtin_vec_set_v16qi,
>   __builtin_vec_set_v4sf, __builtin_vec_set_v4si,
>   __builtin_vec_set_v8hi): Add documentation.
> 
> gcc/testsuite/ChangeLog:
>   * gcc.target/powerpc/altivec-38.c: New test case.
> ---
>  gcc/doc/extend.texi   |  98 
>  gcc/testsuite/gcc.target/powerpc/altivec-38.c | 503 ++
>  2 files changed, 601 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/altivec-38.c
> 
> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> index 87fd30bfa9e..89d0a1f77b0 100644
> --- a/gcc/doc/extend.texi
> +++ b/gcc/doc/extend.texi
> @@ -22678,6 +22678,104 @@ if the VSX instruction set is available.  The 
> @samp{vec_vsx_ld} and
>  @samp{LXVW4X}, @samp{STXVD2X}, and @samp{STXVW4X} instructions.
>  
>  
> +@smallexample
> +vector signed int __builtin_altivec_fix_sfsi (vector float);
> +vector signed int __builtin_altivec_fixuns_sfsi (vector float);
> +vector float __builtin_altivec_float_sisf (vector int);
> +vector float __builtin_altivec_uns_float_sisf (vector int);
> +vector float __builtin_altivec_vrsqrtfp (vector float);
> +@end smallexample
> +
> +The @code{__builtin_altivec_fix_sfsi} converts a vector of single precision
> +floating point values to a vector of signed integers with round to zero.
> +
> +The @code{__builtin_altivec_fixuns_sfsi} converts a vector of single 
> precision
> +floating point values to a vector of unsigned integers with round to zero.  
> If
> +the rounded floating point value is less then 0 the result is 0 and VXCVI
> +is set to 1.
> +
> +The @code{__builtin_altivec_float_sisf} converts a vector of single precision
> +signed integers to a vector of floating point values using the rounding mode
> +specified by RN.
> +
> +The @code{__builtin_altivec_uns_float_sisf} converts a vector of single
> +precision unsigned integers to a vector of floating point values using the
> +rounding mode specified by RN.
> +
> +The @code{__builtin_altivec_vrsqrtfp} returns a vector of floating point
> +estimates of the reciprical square root of each floating point source vector
> +element.
> +
> +@smallexample
> +vector signed char test_altivec_mask_for_load (const void *);
> +@end smallexample
> +
> +The @code{__builtin_altivec_vrsqrtfp} returns a vector mask based on the
> +bottom four bits of the argument.  Let X be the 32-byte value:
> +0x00 || 0x01 || 0x02 || ... || 0x1D || 0x1E || 0x1F.
> +Bytes sh

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1615 matches

Mail list logo