from:"Hongtao Liu via Gcc\-patches"

Re: [PATCH V2] [vect]Enhance NARROW FLOAT_EXPR vectorization by truncating integer to lower precision.

2023-05-28 Thread Hongtao Liu via Gcc-patches

ping.

On Mon, May 8, 2023 at 9:59 AM liuhongt  wrote:
>
> > > @@ -4799,7 +4800,8 @@ vect_create_vectorized_demotion_stmts (vec_info 
> > > *vinfo, vec *vec_oprnds,
> > >stmt_vec_info stmt_info,
> > >vec &vec_dsts,
> > >gimple_stmt_iterator *gsi,
> > > -  slp_tree slp_node, enum tree_code 
> > > code)
> > > +  slp_tree slp_node, enum tree_code 
> > > code,
> > > +  bool last_stmt_p)
> >
> > Can you please document this new parameter?
> >
> Changed.
>
> >
> > I understand what you are doing, but somehow it looks a bit awkward?
> > Maybe we should split the NARROW case into NARROW_SRC and NARROW_DST?
> > The case of narrowing the source because we know its range isn't a
> > good fit for the
> > flow.
> Changed.
>
> Here's updated patch.
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
> Similar like WIDEN FLOAT_EXPR, when direct_optab is not existed, try
> intermediate integer type whenever gimple ranger can tell it's safe.
>
> .i.e.
> When there's no direct optab for vector long long -> vector float, but
> the value range of integer can be represented as int, try vector int
> -> vector float if availble.
>
> gcc/ChangeLog:
>
> PR tree-optimization/108804
> * tree-vect-patterns.cc (vect_get_range_info): Remove static.
> * tree-vect-stmts.cc (vect_create_vectorized_demotion_stmts):
> Add new parameter narrow_src_p.
> (vectorizable_conversion): Enhance NARROW FLOAT_EXPR
> vectorization by truncating to lower precision.
> * tree-vectorizer.h (vect_get_range_info): New declare.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr108804.c: New test.
> ---
>  gcc/testsuite/gcc.target/i386/pr108804.c |  15 +++
>  gcc/tree-vect-patterns.cc|   2 +-
>  gcc/tree-vect-stmts.cc   | 135 +--
>  gcc/tree-vectorizer.h|   1 +
>  4 files changed, 121 insertions(+), 32 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr108804.c
>
> diff --git a/gcc/testsuite/gcc.target/i386/pr108804.c 
> b/gcc/testsuite/gcc.target/i386/pr108804.c
> new file mode 100644
> index 000..2a43c1e1848
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr108804.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx2 -Ofast -fdump-tree-vect-details" } */
> +/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 1 "vect" } } 
> */
> +
> +typedef unsigned long long uint64_t;
> +uint64_t d[512];
> +float f[1024];
> +
> +void foo() {
> +for (int i=0; i<512; ++i) {
> +uint64_t k = d[i];
> +f[i]=(k & 0x3F30);
> +}
> +}
> +
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index a49b0953977..dd546b488a4 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -61,7 +61,7 @@ along with GCC; see the file COPYING3.  If not see
>  /* Return true if we have a useful VR_RANGE range for VAR, storing it
> in *MIN_VALUE and *MAX_VALUE if so.  Note the range in the dump files.  */
>
> -static bool
> +bool
>  vect_get_range_info (tree var, wide_int *min_value, wide_int *max_value)
>  {
>value_range vr;
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 6b7dbfd4a23..3da89a8402d 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -51,6 +51,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "internal-fn.h"
>  #include "tree-vector-builder.h"
>  #include "vec-perm-indices.h"
> +#include "gimple-range.h"
>  #include "tree-ssa-loop-niter.h"
>  #include "gimple-fold.h"
>  #include "regs.h"
> @@ -4791,7 +4792,9 @@ vect_gen_widened_results_half (vec_info *vinfo, enum 
> tree_code code,
>
>  /* Create vectorized demotion statements for vector operands from VEC_OPRNDS.
> For multi-step conversions store the resulting vectors and call the 
> function
> -   recursively.  */
> +   recursively. When NARROW_SRC_P is true, there's still a conversion after
> +   narrowing, don't store the vectors in the SLP_NODE or in vector info of
> +   the scalar statement(or in STMT_VINFO_RELATED_STMT chain).  */
>
>  static void
>  vect_create_vectorized_demotion_stmts (vec_info *vinfo, vec 
> *vec_oprnds,
> @@ -4799,7 +4802,8 @@ vect_create_vectorized_demotion_stmts (vec_info *vinfo, 
> vec *vec_oprnds,
>stmt_vec_info stmt_info,
>vec &vec_dsts,
>gimple_stmt_iterator *gsi,
> -  slp_tree slp_node, enum tree_code code)
> +  slp_tree slp_node, enum tree_code code,
> +  bool narrow_src_p)
>  {
>uns

Re: [PATCH] Fold _mm{, 256, 512}_abs_{epi8, epi16, epi32, epi64} into gimple ABSU_EXPR + VCE.

2023-06-06 Thread Hongtao Liu via Gcc-patches

On Tue, Jun 6, 2023 at 12:49 PM Andrew Pinski  wrote:
>
> On Mon, Jun 5, 2023 at 9:34 PM liuhongt via Gcc-patches
>  wrote:
> >
> > r14-1145 fold the intrinsics into gimple ABS_EXPR which has UB for
> > TYPE_MIN, but PABSB will store unsigned result into dst. The patch
> > uses ABSU_EXPR + VCE instead of ABS_EXPR.
> >
> > Also don't fold _mm_abs_{pi8,pi16,pi32} w/o TARGET_64BIT since 64-bit
> > vector absm2 is guarded with TARGET_MMX_WITH_SSE.
> >
> > Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
> > Ok for trunk?
> >
> >
> > gcc/ChangeLog:
> >
> > PR target/110108
> > * config/i386/i386.cc (ix86_gimple_fold_builtin): Fold
> > _mm{,256,512}_abs_{epi8,epi16,epi32,epi64} into gimple
> > ABSU_EXPR + VCE, don't fold _mm_abs_{pi8,pi16,pi32} w/o
> > TARGET_64BIT.
> > * config/i386/i386-builtin.def: Replace CODE_FOR_nothing with
> > real codename for __builtin_ia32_pabs{b,w,d}.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/i386/pr110108.c: New test.
> > ---
> >  gcc/config/i386/i386-builtin.def |  6 ++--
> >  gcc/config/i386/i386.cc  | 44 
> >  gcc/testsuite/gcc.target/i386/pr110108.c | 16 +
> >  3 files changed, 56 insertions(+), 10 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr110108.c
> >
> > diff --git a/gcc/config/i386/i386-builtin.def 
> > b/gcc/config/i386/i386-builtin.def
> > index 383b68a9bb8..7ba5b6a9d11 100644
> > --- a/gcc/config/i386/i386-builtin.def
> > +++ b/gcc/config/i386/i386-builtin.def
> > @@ -900,11 +900,11 @@ BDESC (OPTION_MASK_ISA_SSE3, 0, 
> > CODE_FOR_sse3_hsubv2df3, "__builtin_ia32_hsubpd"
> >
> >  /* SSSE3 */
> >  BDESC (OPTION_MASK_ISA_SSSE3, 0, CODE_FOR_nothing, 
> > "__builtin_ia32_pabsb128", IX86_BUILTIN_PABSB128, UNKNOWN, (int) 
> > V16QI_FTYPE_V16QI)
> > -BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0, CODE_FOR_nothing, 
> > "__builtin_ia32_pabsb", IX86_BUILTIN_PABSB, UNKNOWN, (int) V8QI_FTYPE_V8QI)
> > +BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0, 
> > CODE_FOR_ssse3_absv8qi2, "__builtin_ia32_pabsb", IX86_BUILTIN_PABSB, 
> > UNKNOWN, (int) V8QI_FTYPE_V8QI)
> >  BDESC (OPTION_MASK_ISA_SSSE3, 0, CODE_FOR_nothing, 
> > "__builtin_ia32_pabsw128", IX86_BUILTIN_PABSW128, UNKNOWN, (int) 
> > V8HI_FTYPE_V8HI)
> > -BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0, CODE_FOR_nothing, 
> > "__builtin_ia32_pabsw", IX86_BUILTIN_PABSW, UNKNOWN, (int) V4HI_FTYPE_V4HI)
> > +BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0, 
> > CODE_FOR_ssse3_absv4hi2, "__builtin_ia32_pabsw", IX86_BUILTIN_PABSW, 
> > UNKNOWN, (int) V4HI_FTYPE_V4HI)
> >  BDESC (OPTION_MASK_ISA_SSSE3, 0, CODE_FOR_nothing, 
> > "__builtin_ia32_pabsd128", IX86_BUILTIN_PABSD128, UNKNOWN, (int) 
> > V4SI_FTYPE_V4SI)
> > -BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0, CODE_FOR_nothing, 
> > "__builtin_ia32_pabsd", IX86_BUILTIN_PABSD, UNKNOWN, (int) V2SI_FTYPE_V2SI)
> > +BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0, 
> > CODE_FOR_ssse3_absv2si2, "__builtin_ia32_pabsd", IX86_BUILTIN_PABSD, 
> > UNKNOWN, (int) V2SI_FTYPE_V2SI)
> >
> >  BDESC (OPTION_MASK_ISA_SSSE3, 0, CODE_FOR_ssse3_phaddwv8hi3, 
> > "__builtin_ia32_phaddw128", IX86_BUILTIN_PHADDW128, UNKNOWN, (int) 
> > V8HI_FTYPE_V8HI_V8HI)
> >  BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0, 
> > CODE_FOR_ssse3_phaddwv4hi3, "__builtin_ia32_phaddw", IX86_BUILTIN_PHADDW, 
> > UNKNOWN, (int) V4HI_FTYPE_V4HI_V4HI)
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index d4ff56ee8dd..b09b3c79e99 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -18433,6 +18433,7 @@ bool
> >  ix86_gimple_fold_builtin (gimple_stmt_iterator *gsi)
> >  {
> >gimple *stmt = gsi_stmt (*gsi), *g;
> > +  gimple_seq stmts = NULL;
> >tree fndecl = gimple_call_fndecl (stmt);
> >gcc_checking_assert (fndecl && fndecl_built_in_p (fndecl, BUILT_IN_MD));
> >int n_args = gimple_call_num_args (stmt);
> > @@ -18555,7 +18556,6 @@ ix86_gimple_fold_builtin (gimple_stmt_iterator *gsi)
> > {
> >   loc = gimple_location (stmt);
> >   tree type = TREE_TYPE (arg2);
> > - gimple_seq stmts = NULL;
> >   if (VECTOR_FLOAT_TYPE_P (type))
> > {
> >   tree itype = GET_MODE_INNER (TYPE_MODE (type)) == E_SFmode
> > @@ -18610,7 +18610,6 @@ ix86_gimple_fold_builtin (gimple_stmt_iterator *gsi)
> >   tree zero_vec = build_zero_cst (type);
> >   tree minus_one_vec = build_minus_one_cst (type);
> >   tree cmp_type = truth_type_for (type);
> > - gimple_seq stmts = NULL;
> >   tree cmp = gimple_build (&stmts, tcode, cmp_type, arg0, arg1);
> >   gsi_insert_seq_before (gsi, stmts, GSI_SAME_STMT);
> >   g = gimple_build_assign (gimple_call_lhs (stmt),
> > @@ -18904,14 +18903,18 @@ ix86_gimple_fold_builtin (gimple_stmt_iterator 
> > *gsi)
> >break;
> >
> >

Re: [PATCH] Fold _mm{, 256, 512}_abs_{epi8, epi16, epi32, epi64} into gimple ABSU_EXPR + VCE.

2023-06-06 Thread Hongtao Liu via Gcc-patches

On Tue, Jun 6, 2023 at 5:11 PM Uros Bizjak  wrote:
>
> On Tue, Jun 6, 2023 at 6:33 AM liuhongt via Gcc-patches
>  wrote:
> >
> > r14-1145 fold the intrinsics into gimple ABS_EXPR which has UB for
> > TYPE_MIN, but PABSB will store unsigned result into dst. The patch
> > uses ABSU_EXPR + VCE instead of ABS_EXPR.
> >
> > Also don't fold _mm_abs_{pi8,pi16,pi32} w/o TARGET_64BIT since 64-bit
> > vector absm2 is guarded with TARGET_MMX_WITH_SSE.
>
>This should be !TARGET_MMX_WITH_SSE. TARGET_64BIT is not enough, see
>the definition of T_M_W_S in i386.h. OTOH, these builtins are
>available for TARGET_MMX, so I'm not sure if the above check is needed
>at all.
BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0,
CODE_FOR_ssse3_absv8qi2, "__builtin_ia32_pabsb", IX86_BUILTIN_PABSB,
UNKNOWN, (int) V8QI_FTYPE_V8QI)

ISA requirement(OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX) will be
checked by ix86_check_builtin_isa_match which is at the beginning of
ix86_gimple_fold_builtin.
Here, we're folding those builtin into gimple ABSU_EXPR, and
ABSU_EXPR will be lowered by vec_lower pass when backend
doesn't support corressponding absm2_optab, that's why i only check
TARGET_64BIT here.

> Please note that we are using builtins here, so we should not fold to
> absm2, but to ssse3_absm2, which is also available with TARGET_MMX.
Yes, that exactly why I checked TARGET_64BIT here, w/ TARGET_64BIT,
backend suppport absm2_optab which exactly matches ssse3_absm2.
w/o TARGET_64BIT, the builtin shouldn't folding into gimple ABSU_EXPR,
but let backend expanded to ssse3_absm2.

>
> Uros.
>
> >
> > Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
> > Ok for trunk?
> >
> >
> > gcc/ChangeLog:
> >
> > PR target/110108
> > * config/i386/i386.cc (ix86_gimple_fold_builtin): Fold
> > _mm{,256,512}_abs_{epi8,epi16,epi32,epi64} into gimple
> > ABSU_EXPR + VCE, don't fold _mm_abs_{pi8,pi16,pi32} w/o
> > TARGET_64BIT.
> > * config/i386/i386-builtin.def: Replace CODE_FOR_nothing with
> > real codename for __builtin_ia32_pabs{b,w,d}.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/i386/pr110108.c: New test.
> > ---
> >  gcc/config/i386/i386-builtin.def |  6 ++--
> >  gcc/config/i386/i386.cc  | 44 
> >  gcc/testsuite/gcc.target/i386/pr110108.c | 16 +
> >  3 files changed, 56 insertions(+), 10 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr110108.c
> >
> > diff --git a/gcc/config/i386/i386-builtin.def 
> > b/gcc/config/i386/i386-builtin.def
> > index 383b68a9bb8..7ba5b6a9d11 100644
> > --- a/gcc/config/i386/i386-builtin.def
> > +++ b/gcc/config/i386/i386-builtin.def
> > @@ -900,11 +900,11 @@ BDESC (OPTION_MASK_ISA_SSE3, 0, 
> > CODE_FOR_sse3_hsubv2df3, "__builtin_ia32_hsubpd"
> >
> >  /* SSSE3 */
> >  BDESC (OPTION_MASK_ISA_SSSE3, 0, CODE_FOR_nothing, 
> > "__builtin_ia32_pabsb128", IX86_BUILTIN_PABSB128, UNKNOWN, (int) 
> > V16QI_FTYPE_V16QI)
> > -BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0, CODE_FOR_nothing, 
> > "__builtin_ia32_pabsb", IX86_BUILTIN_PABSB, UNKNOWN, (int) V8QI_FTYPE_V8QI)
> > +BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0, 
> > CODE_FOR_ssse3_absv8qi2, "__builtin_ia32_pabsb", IX86_BUILTIN_PABSB, 
> > UNKNOWN, (int) V8QI_FTYPE_V8QI)
> >  BDESC (OPTION_MASK_ISA_SSSE3, 0, CODE_FOR_nothing, 
> > "__builtin_ia32_pabsw128", IX86_BUILTIN_PABSW128, UNKNOWN, (int) 
> > V8HI_FTYPE_V8HI)
> > -BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0, CODE_FOR_nothing, 
> > "__builtin_ia32_pabsw", IX86_BUILTIN_PABSW, UNKNOWN, (int) V4HI_FTYPE_V4HI)
> > +BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0, 
> > CODE_FOR_ssse3_absv4hi2, "__builtin_ia32_pabsw", IX86_BUILTIN_PABSW, 
> > UNKNOWN, (int) V4HI_FTYPE_V4HI)
> >  BDESC (OPTION_MASK_ISA_SSSE3, 0, CODE_FOR_nothing, 
> > "__builtin_ia32_pabsd128", IX86_BUILTIN_PABSD128, UNKNOWN, (int) 
> > V4SI_FTYPE_V4SI)
> > -BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0, CODE_FOR_nothing, 
> > "__builtin_ia32_pabsd", IX86_BUILTIN_PABSD, UNKNOWN, (int) V2SI_FTYPE_V2SI)
> > +BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0, 
> > CODE_FOR_ssse3_absv2si2, "__builtin_ia32_pabsd", IX86_BUILTIN_PABSD, 
> > UNKNOWN, (int) V2SI_FTYPE_V2SI)
> >
> >  BDESC (OPTION_MASK_ISA_SSSE3, 0, CODE_FOR_ssse3_phaddwv8hi3, 
> > "__builtin_ia32_phaddw128", IX86_BUILTIN_PHADDW128, UNKNOWN, (int) 
> > V8HI_FTYPE_V8HI_V8HI)
> >  BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0, 
> > CODE_FOR_ssse3_phaddwv4hi3, "__builtin_ia32_phaddw", IX86_BUILTIN_PHADDW, 
> > UNKNOWN, (int) V4HI_FTYPE_V4HI_V4HI)
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index d4ff56ee8dd..b09b3c79e99 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -18433,6 +18433,7 @@ bool
> >  ix86_gimple_fold_builtin (gimple_stmt_iterator *gsi)
> >  {
> >gimple *stmt = gsi_stmt (*gsi), *g;
> > +  gimple_seq stmts = NULL;
> >tr

Re: [PATCH] Fold _mm{, 256, 512}_abs_{epi8, epi16, epi32, epi64} into gimple ABSU_EXPR + VCE.

2023-06-06 Thread Hongtao Liu via Gcc-patches

On Tue, Jun 6, 2023 at 10:36 PM Uros Bizjak  wrote:
>
> On Tue, Jun 6, 2023 at 1:42 PM Hongtao Liu  wrote:
> >
> > On Tue, Jun 6, 2023 at 5:11 PM Uros Bizjak  wrote:
> > >
> > > On Tue, Jun 6, 2023 at 6:33 AM liuhongt via Gcc-patches
> > >  wrote:
> > > >
> > > > r14-1145 fold the intrinsics into gimple ABS_EXPR which has UB for
> > > > TYPE_MIN, but PABSB will store unsigned result into dst. The patch
> > > > uses ABSU_EXPR + VCE instead of ABS_EXPR.
> > > >
> > > > Also don't fold _mm_abs_{pi8,pi16,pi32} w/o TARGET_64BIT since 64-bit
> > > > vector absm2 is guarded with TARGET_MMX_WITH_SSE.
> > >
> > >This should be !TARGET_MMX_WITH_SSE. TARGET_64BIT is not enough, see
> > >the definition of T_M_W_S in i386.h. OTOH, these builtins are
> > >available for TARGET_MMX, so I'm not sure if the above check is needed
> > >at all.
> > BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0,
> > CODE_FOR_ssse3_absv8qi2, "__builtin_ia32_pabsb", IX86_BUILTIN_PABSB,
> > UNKNOWN, (int) V8QI_FTYPE_V8QI)
> >
> > ISA requirement(OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX) will be
> > checked by ix86_check_builtin_isa_match which is at the beginning of
> > ix86_gimple_fold_builtin.
> > Here, we're folding those builtin into gimple ABSU_EXPR, and
> > ABSU_EXPR will be lowered by vec_lower pass when backend
> > doesn't support corressponding absm2_optab, that's why i only check
> > TARGET_64BIT here.
> >
> > > Please note that we are using builtins here, so we should not fold to
> > > absm2, but to ssse3_absm2, which is also available with TARGET_MMX.
> > Yes, that exactly why I checked TARGET_64BIT here, w/ TARGET_64BIT,
> > backend suppport absm2_optab which exactly matches ssse3_absm2.
> > w/o TARGET_64BIT, the builtin shouldn't folding into gimple ABSU_EXPR,
> > but let backend expanded to ssse3_absm2.
>
> Thanks for the explanation, but for consistency, I'd recommend
> checking TARGET_MMX_WITH_SSE (= TARGET_64BIT && TARGET_SSE2) here. The
> macro is self-explanatory, while the usage of TARGET_64BIT is not that
> descriptive.
Sure.
>
> Uros.



-- 
BR,
Hongtao

Re: [PATCH] Fold _mm{, 256, 512}_abs_{epi8, epi16, epi32, epi64} into gimple ABSU_EXPR + VCE.

2023-06-08 Thread Hongtao Liu via Gcc-patches

On Wed, Jun 7, 2023 at 8:31 AM Hongtao Liu  wrote:
>
> On Tue, Jun 6, 2023 at 10:36 PM Uros Bizjak  wrote:
> >
> > On Tue, Jun 6, 2023 at 1:42 PM Hongtao Liu  wrote:
> > >
> > > On Tue, Jun 6, 2023 at 5:11 PM Uros Bizjak  wrote:
> > > >
> > > > On Tue, Jun 6, 2023 at 6:33 AM liuhongt via Gcc-patches
> > > >  wrote:
> > > > >
> > > > > r14-1145 fold the intrinsics into gimple ABS_EXPR which has UB for
> > > > > TYPE_MIN, but PABSB will store unsigned result into dst. The patch
> > > > > uses ABSU_EXPR + VCE instead of ABS_EXPR.
> > > > >
> > > > > Also don't fold _mm_abs_{pi8,pi16,pi32} w/o TARGET_64BIT since 64-bit
> > > > > vector absm2 is guarded with TARGET_MMX_WITH_SSE.
> > > >
> > > >This should be !TARGET_MMX_WITH_SSE. TARGET_64BIT is not enough, see
> > > >the definition of T_M_W_S in i386.h. OTOH, these builtins are
> > > >available for TARGET_MMX, so I'm not sure if the above check is needed
> > > >at all.
> > > BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0,
> > > CODE_FOR_ssse3_absv8qi2, "__builtin_ia32_pabsb", IX86_BUILTIN_PABSB,
> > > UNKNOWN, (int) V8QI_FTYPE_V8QI)
> > >
> > > ISA requirement(OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX) will be
> > > checked by ix86_check_builtin_isa_match which is at the beginning of
> > > ix86_gimple_fold_builtin.
> > > Here, we're folding those builtin into gimple ABSU_EXPR, and
> > > ABSU_EXPR will be lowered by vec_lower pass when backend
> > > doesn't support corressponding absm2_optab, that's why i only check
> > > TARGET_64BIT here.
> > >
> > > > Please note that we are using builtins here, so we should not fold to
> > > > absm2, but to ssse3_absm2, which is also available with TARGET_MMX.
> > > Yes, that exactly why I checked TARGET_64BIT here, w/ TARGET_64BIT,
> > > backend suppport absm2_optab which exactly matches ssse3_absm2.
> > > w/o TARGET_64BIT, the builtin shouldn't folding into gimple ABSU_EXPR,
> > > but let backend expanded to ssse3_absm2.
> >
> > Thanks for the explanation, but for consistency, I'd recommend
> > checking TARGET_MMX_WITH_SSE (= TARGET_64BIT && TARGET_SSE2) here. The
> > macro is self-explanatory, while the usage of TARGET_64BIT is not that
> > descriptive.
> Sure.
Pushed to trunk.
> >
> > Uros.
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao

Re: [PATCH v2] Explicitly view_convert_expr mask to signed type when folding pblendvb builtins.

2023-06-08 Thread Hongtao Liu via Gcc-patches

On Tue, Jun 6, 2023 at 4:23 PM liuhongt  wrote:
>
> > I think this is a better patch and will always be correct and still
> > get folded at the gimple level (correctly):
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index d4ff56ee8dd..02bf5ba93a5 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -18561,8 +18561,10 @@ ix86_gimple_fold_builtin (gimple_stmt_iterator 
> > *gsi)
> >   tree itype = GET_MODE_INNER (TYPE_MODE (type)) == E_SFmode
> > ? intSI_type_node : intDI_type_node;
> >   type = get_same_sized_vectype (itype, type);
> > - arg2 = gimple_build (&stmts, VIEW_CONVERT_EXPR, type, arg2);
> > }
> > + else
> > +   type = signed_type_for (type);
> > + arg2 = gimple_build (&stmts, VIEW_CONVERT_EXPR, type, arg2);
> >   tree zero_vec = build_zero_cst (type);
> >   tree cmp_type = truth_type_for (type);
> >   tree cmp = gimple_build (&stmts, LT_EXPR, cmp_type, arg2, 
> > zero_vec);
> >
> >
>
> Yes, thanks.
>
> Here's the updated patch:
>
> Since mask < 0 will be always false for vector char when
> -funsigned-char, but vpblendvb needs to check the most significant
> bit. The patch explicitly VCE to vector signed char.
>
Pushed to trunk and backport to GCC-13/GCC-12 release branch.(No need
for GCC-11 and earlier since the bug is introduced in GCC12).
>
> gcc/ChangeLog:
>
> PR target/110108
> * config/i386/i386.cc (ix86_gimple_fold_builtin): Explicitly
> view_convert_expr mask to signed type when folding pblendvb
> builtins.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr110108-2.c: New test.
> ---
>  gcc/config/i386/i386.cc|  4 +++-
>  gcc/testsuite/gcc.target/i386/pr110108-2.c | 14 ++
>  2 files changed, 17 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr110108-2.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index da20c2c49de..4e594a9c88e 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -18561,8 +18561,10 @@ ix86_gimple_fold_builtin (gimple_stmt_iterator *gsi)
>   tree itype = GET_MODE_INNER (TYPE_MODE (type)) == E_SFmode
> ? intSI_type_node : intDI_type_node;
>   type = get_same_sized_vectype (itype, type);
> - arg2 = gimple_build (&stmts, VIEW_CONVERT_EXPR, type, arg2);
> }
> + else
> +   type = signed_type_for (type);
> + arg2 = gimple_build (&stmts, VIEW_CONVERT_EXPR, type, arg2);
>   tree zero_vec = build_zero_cst (type);
>   tree cmp_type = truth_type_for (type);
>   tree cmp = gimple_build (&stmts, LT_EXPR, cmp_type, arg2, zero_vec);
> diff --git a/gcc/testsuite/gcc.target/i386/pr110108-2.c 
> b/gcc/testsuite/gcc.target/i386/pr110108-2.c
> new file mode 100644
> index 000..2d1d2fd4991
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr110108-2.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx2 -O2 -funsigned-char" } */
> +/* { dg-final { scan-assembler-times "vpblendvb" 2 } } */
> +
> +#include 
> +__m128i do_stuff_128(__m128i X0, __m128i X1, __m128i X2) {
> +  __m128i Result = _mm_blendv_epi8(X0, X1, X2);
> +  return Result;
> +}
> +
> +__m256i do_stuff_256(__m256i X0, __m256i X1, __m256i X2) {
> +  __m256i Result = _mm256_blendv_epi8(X0, X1, X2);
> +  return Result;
> +}
> --
> 2.39.1.388.g2fc9e9ca3c
>


-- 
BR,
Hongtao

Re: [PATCH] [x86] Add missing vec_pack/unpacks patterns for _Float16 <-> int/float conversion.

2023-06-12 Thread Hongtao Liu via Gcc-patches

On Mon, Jun 5, 2023 at 9:26 AM liuhongt  wrote:
>
> This patch only support vec_pack/unpacks optabs for vector modes whose lenth 
> >= 128.
> For 32/64-bit vector, they're more hanlded by BB vectorizer with
> truncmn2/extendmn2/fix{,uns}_truncmn2.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ready to push to trunk.
Committed.
>
> gcc/ChangeLog:
>
> * config/i386/sse.md (vec_pack_float_): New 
> expander.
> (vec_unpack_fix_trunc_lo_): Ditto.
> (vec_unpack_fix_trunc_hi_): Ditto.
> (vec_unpacks_lo_: Ditto.
> (vec_unpacks_hi_: Ditto.
> (sse_movlhps_): New define_insn.
> (ssse3_palignr_perm): Extend to V_128H.
> (V_128H): New mode iterator.
> (ssepackPHmode): New mode attribute.
> (vunpck_extract_mode>: Ditto.
> (vpckfloat_concat_mode): Extend to VxSI/VxSF for _Float16.
> (vpckfloat_temp_mode): Ditto.
> (vpckfloat_op_mode): Ditto.
> (vunpckfixt_mode): Extend to VxHF.
> (vunpckfixt_model): Ditto.
> (vunpckfixt_extract_mode): Ditto.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/vec_pack_fp16-1.c: New test.
> * gcc.target/i386/vec_pack_fp16-2.c: New test.
> * gcc.target/i386/vec_pack_fp16-3.c: New test.
> ---
>  gcc/config/i386/sse.md| 216 +-
>  .../gcc.target/i386/vec_pack_fp16-1.c |  34 +++
>  .../gcc.target/i386/vec_pack_fp16-2.c |   9 +
>  .../gcc.target/i386/vec_pack_fp16-3.c |   8 +
>  4 files changed, 258 insertions(+), 9 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/vec_pack_fp16-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/vec_pack_fp16-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/vec_pack_fp16-3.c
>
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index a92f50e96b5..1eb2dd077ff 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -291,6 +291,9 @@ (define_mode_iterator V
>  (define_mode_iterator V_128
>[V16QI V8HI V4SI V2DI V4SF (V2DF "TARGET_SSE2")])
>
> +(define_mode_iterator V_128H
> +  [V16QI V8HI V8HF V8BF V4SI V2DI V4SF (V2DF "TARGET_SSE2")])
> +
>  ;; All 256bit vector modes
>  (define_mode_iterator V_256
>[V32QI V16HI V8SI V4DI V8SF V4DF])
> @@ -1076,6 +1079,12 @@ (define_mode_attr ssePHmodelower
> (V8DI "v8hf") (V4DI "v4hf") (V2DI "v2hf")
> (V8DF "v8hf") (V16SF "v16hf") (V8SF "v8hf")])
>
> +
> +;; Mapping of vector modes to packed vector hf modes of same sized.
> +(define_mode_attr ssepackPHmode
> +  [(V16SI "V32HF") (V8SI "V16HF") (V4SI "V8HF")
> +   (V16SF "V32HF") (V8SF "V16HF") (V4SF "V8HF")])
> +
>  ;; Mapping of vector modes to packed single mode of the same size
>  (define_mode_attr ssePSmode
>[(V16SI "V16SF") (V8DF "V16SF")
> @@ -6918,6 +6927,61 @@ (define_mode_attr qq2phsuff
> (V16SF "") (V8SF "{y}") (V4SF "{x}")
> (V8DF "{z}") (V4DF "{y}") (V2DF "{x}")])
>
> +(define_mode_attr vunpck_extract_mode
> +  [(V32HF "v32hf") (V16HF "v16hf") (V8HF "v16hf")])
> +
> +(define_expand "vec_unpacks_lo_"
> +  [(match_operand: 0 "register_operand")
> +   (match_operand:VF_AVX512FP16VL 1 "register_operand")]
> +  "TARGET_AVX512FP16"
> +{
> +  rtx tem = operands[1];
> +  rtx (*gen) (rtx, rtx);
> +  if (mode != V8HFmode)
> +{
> +  tem = gen_reg_rtx (mode);
> +  emit_insn (gen_vec_extract_lo_ (tem,
> +  operands[1]));
> +  gen = gen_extend2;
> +}
> +  else
> +gen = gen_avx512fp16_float_extend_phv4sf2;
> +
> +  emit_insn (gen (operands[0], tem));
> +  DONE;
> +})
> +
> +(define_expand "vec_unpacks_hi_"
> +  [(match_operand: 0 "register_operand")
> +   (match_operand:VF_AVX512FP16VL 1 "register_operand")]
> +  "TARGET_AVX512FP16"
> +{
> +  rtx tem = operands[1];
> +  rtx (*gen) (rtx, rtx);
> +  if (mode != V8HFmode)
> +{
> +  tem = gen_reg_rtx (mode);
> +  emit_insn (gen_vec_extract_hi_ (tem,
> +  operands[1]));
> +  gen = gen_extend2;
> +}
> +  else
> +{
> +  tem = gen_reg_rtx (V8HFmode);
> +  rtvec tmp = rtvec_alloc (8);
> +  for (int i = 0; i != 8; i++)
> +   RTVEC_ELT (tmp, i) = GEN_INT((i+4)%8);
> +
> +  rtx selector = gen_rtx_PARALLEL (VOIDmode, tmp);
> +  emit_move_insn (tem,
> +gen_rtx_VEC_SELECT (V8HFmode, operands[1], selector));
> +  gen = gen_avx512fp16_float_extend_phv4sf2;
> +}
> +
> +  emit_insn (gen (operands[0], tem));
> +  DONE;
> +})
> +
>  (define_insn 
> "avx512fp16_vcvtph2_"
>[(set (match_operand:VI248_AVX512VL 0 "register_operand" "=v")
>  (unspec:VI248_AVX512VL
> @@ -8314,11 +8378,17 @@ (define_expand "floatv2div2sf2"
>  })
>
>  (define_mode_attr vpckfloat_concat_mode
> -  [(V8DI "v16sf") (V4DI "v8sf") (V2DI "v8sf")])
> +  [(V8DI "v16sf") (V4DI "v8sf") (V2DI "v8sf")
> +   (V16SI "v32hf") (V8SI "v16hf") (V4SI "v16hf")
> +   (V

Re: [PATCH] x86/AVX512: use VMOVDDUP for broadcast to V2DF

2023-06-13 Thread Hongtao Liu via Gcc-patches

On Wed, Jun 14, 2023 at 1:55 PM Jan Beulich via Gcc-patches
 wrote:
>
> Like is already the case for the AVX/AVX2 form, VMOVDDUP - acting on
> double precision floating values - is more appropriate to use here, and
> it can also result in shorter insn encodings when source is memory or
> %xmm0...%xmm7, and no masking is applied (in allowing a 2-byte VEX
> prefix then instead of a 3-byte one).
>
> gcc/
>
> * config/i386/sse.md (_vec_dup): Use
> vmovddup.
Ok for trunk.
>
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -25724,9 +25724,9 @@
>"TARGET_AVX512F"
>  {
>/*  There is no DF broadcast (in AVX-512*) to 128b register.
> -  Mimic it with integer variant.  */
> +  Mimic it with vmovddup, just like vec_dupv2df does.  */
>if (mode == V2DFmode)
> -return "vpbroadcastq\t{%1, %0|%0, %q1}";
> +return "vmovddup\t{%1, %0|%0, %q1}";
>
>return "vbroadcast\t{%1, 
> %0|%0, %1}";
>  }



-- 
BR,
Hongtao

Re: [PATCH] x86: add Bk and Br to comment list B's sub-chars

2023-06-13 Thread Hongtao Liu via Gcc-patches

On Wed, Jun 14, 2023 at 1:56 PM Jan Beulich via Gcc-patches
 wrote:
>
> gcc/
>
> * config/i386/constraints.md: Mention k and r for B.
Ok.
>
> --- a/gcc/config/i386/constraints.md
> +++ b/gcc/config/i386/constraints.md
> @@ -162,7 +162,9 @@
>  ;;  g  GOT memory operand.
>  ;;  m  Vector memory operand
>  ;;  c  Constant memory operand
> +;;  k  TLS address that allows insn using non-integer registers
>  ;;  n  Memory operand without REX prefix
> +;;  r  Broadcast memory operand
>  ;;  s  Sibcall memory operand, not valid for TARGET_X32
>  ;;  w  Call memory operand, not valid for TARGET_X32
>  ;;  z  Constant call address operand.



-- 
BR,
Hongtao

Re: [PATCH] x86: make better use of VBROADCASTSS / VPBROADCASTD

2023-06-14 Thread Hongtao Liu via Gcc-patches

On Wed, Jun 14, 2023 at 1:58 PM Jan Beulich via Gcc-patches
 wrote:
>
> ... in vec_dupv4sf / *vec_dupv4si. The respective broadcast insns are
> never longer (yet sometimes shorter) than the corresponding VSHUFPS /
> VPSHUFD, due to the immediate operand of the shuffle insns balancing the
> need for VEX3 in the broadcast ones. When EVEX encoding is required the
> broadcast insns are always shorter.
>
> Add two new alternatives each, one covering the AVX2 case and one
> covering AVX512.
I think you can just change assemble output for this first alternative
when TARGET_AVX2, use vbroadcastss, else use vshufps since
vbroadcastss only accept register operand when TARGET_AVX2. And no
need to support 2 extra alternatives which doesn't make sense just
make RA more confused about the same meaning of different
alternatives.
>
> gcc/
>
> * config/i386/sse.md (vec_dupv4sf): New AVX2 and AVX512F
> alternatives using vbroadcastss.
> (*vec_dupv4si): New AVX2 and AVX512F alternatives using
> vpbroadcastd.
> ---
> I'm working from the assumption that the isa attributes to the original
> 1st and 2nd alternatives don't need further restricting (to sse2_noavx2
> or avx_noavx2 as applicable), as the new earlier alternatives cover all
> operand forms already when at least AVX2 is enabled.
>
> Isn't prefix_extra use bogus here? What extra prefix does vbroadcastss
> use? (Same further down in *vec_dupv4si and avx2_vbroadcasti128_
> and elsewhere.)
Not sure about this part. I grep prefix_extra, seems only used by
znver.md/znver4.md for schedule, and only for comi instructions(?the
reservation name seems so).
>
> Is use of Yv for the source operand really necessary in *vec_dupv4si?
> I.e. would scalar integer values be put in XMM{16...31} when AVX512VL
Yes, You can look at ix86_hard_regno_mode_ok, EXT_REX_SSE_REGNO is
allowed for scalar mode, but not for 128/256-bit vector modes.

20204  if (TARGET_AVX512F
20205  && (VALID_AVX512F_REG_OR_XI_MODE (mode)
20206  || VALID_AVX512F_SCALAR_MODE (mode)))
20207return true;


> isn't enabled? If so (*movsi_internal / *movdi_internal suggest they
> might), wouldn't *vec_dupv2di need to use Yv as well in its 3rd
> alternative (or just m, as Yv is already covered by the 2nd one)?
I guess xm is more suitable since we still want to allocate
operands[1] to register when sse3_noavx.
It didn't hit any error since for avx and above, alternative 1(2rd
one) is always matched than alternative 2.
>
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -25798,38 +25798,42 @@
> (const_int 1)))])
>
>  (define_insn "vec_dupv4sf"
> -  [(set (match_operand:V4SF 0 "register_operand" "=v,v,x")
> +  [(set (match_operand:V4SF 0 "register_operand" "=Yv,v,v,v,x")
> (vec_duplicate:V4SF
> - (match_operand:SF 1 "nonimmediate_operand" "Yv,m,0")))]
> + (match_operand:SF 1 "nonimmediate_operand" "v,vm,Yv,m,0")))]
>"TARGET_SSE"
>"@
> +   vbroadcastss\t{%1, %0|%0, %1}
> +   vbroadcastss\t{%1, %g0|%g0, %1}
> vshufps\t{$0, %1, %1, %0|%0, %1, %1, 0}
> vbroadcastss\t{%1, %0|%0, %1}
> shufps\t{$0, %0, %0|%0, %0, 0}"
> -  [(set_attr "isa" "avx,avx,noavx")
> -   (set_attr "type" "sseshuf1,ssemov,sseshuf1")
> -   (set_attr "length_immediate" "1,0,1")
> -   (set_attr "prefix_extra" "0,1,*")
> -   (set_attr "prefix" "maybe_evex,maybe_evex,orig")
> -   (set_attr "mode" "V4SF")])
> +  [(set_attr "isa" "avx2,avx512f,avx,avx,noavx")
> +   (set_attr "type" "ssemov,ssemov,sseshuf1,ssemov,sseshuf1")
> +   (set_attr "length_immediate" "0,0,1,0,1")
> +   (set_attr "prefix_extra" "*,*,0,1,*")
> +   (set_attr "prefix" "maybe_evex,evex,maybe_evex,maybe_evex,orig")
> +   (set_attr "mode" "V4SF,V16SF,V4SF,V4SF,V4SF")])
>
>  (define_insn "*vec_dupv4si"
> -  [(set (match_operand:V4SI 0 "register_operand" "=v,v,x")
> +  [(set (match_operand:V4SI 0 "register_operand" "=Yv,v,v,v,x")
> (vec_duplicate:V4SI
> - (match_operand:SI 1 "nonimmediate_operand" "Yv,m,0")))]
> + (match_operand:SI 1 "nonimmediate_operand" "vm,vm,Yv,m,0")))]
>"TARGET_SSE"
>"@
> +   vpbroadcastd\t{%1, %0|%0, %1}
> +   vpbroadcastd\t{%1, %g0|%g0, %1}
> %vpshufd\t{$0, %1, %0|%0, %1, 0}
> vbroadcastss\t{%1, %0|%0, %1}
> shufps\t{$0, %0, %0|%0, %0, 0}"
> -  [(set_attr "isa" "sse2,avx,noavx")
> -   (set_attr "type" "sselog1,ssemov,sselog1")
> -   (set_attr "length_immediate" "1,0,1")
> -   (set_attr "prefix_extra" "0,1,*")
> -   (set_attr "prefix" "maybe_vex,maybe_evex,orig")
> -   (set_attr "mode" "TI,V4SF,V4SF")
> +  [(set_attr "isa" "avx2,avx512f,sse2,avx,noavx")
> +   (set_attr "type" "ssemov,ssemov,sselog1,ssemov,sselog1")
> +   (set_attr "length_immediate" "0,0,1,0,1")
> +   (set_attr "prefix_extra" "*,*,0,1,*")
> +   (set_attr "prefix" "maybe_evex,evex,maybe_vex,maybe_evex,orig")
> +   (set_attr "mode" "TI,XI,TI,V4SF,V4SF")
> (set (attr "preferred_for_speed")
> - (cond [(eq_attr "alternative" "1"

Re: [PATCH] x86: make VPTERNLOG* usable on less than 512-bit operands with just AVX512F

2023-06-14 Thread Hongtao Liu via Gcc-patches

On Wed, Jun 14, 2023 at 1:59 PM Jan Beulich via Gcc-patches
 wrote:
>
> There's no reason to constrain this to AVX512VL, as the wider operation
> is not usable for more narrow operands only when the possible memory
But this may require more resources (on AMD znver4 processor a zmm
instruction will also be split into 2 uops, right?) And on some intel
processors(SKX/CLX) there will be frequency reduction.
If it needs to be done, it is better guarded with
!TARGET_PREFER_AVX256, at least when micro-architecture AVX256_OPTIMAL
or users explicitly uses -mprefer-vector-width=256, we don't want to
produce any zmm instruction for surprise.(Although
-mprefer-vector-width=256 is supposed for auto-vectorizer, but backend
codegen also use it under such cases, i.e. in *movsf_internal
alternative 5 use zmm only TARGET_AVX512F && !TARGET_PREFER_AVX256.)
> source is a non-broadcast one. This way even the scalar copysign3
> can benefit from the operation being a single-insn one (leaving aside
> moves which the compiler decides to insert for unclear reasons, and
> leaving aside the fact that bcst_mem_operand() is too restrictive for
> broadcast to be embedded right into VPTERNLOG*).
>
> Along with this also request value duplication in
> ix86_expand_copysign()'s call to ix86_build_signbit_mask(), eliminating
> excess space allocation in .rodata.*, filled with zeros which are never
> read.
>
> gcc/
>
> * config/i386/i386-expand.cc (ix86_expand_copysign): Request
> value duplication by ix86_build_signbit_mask() when AVX512F and
> not HFmode.
> * config/i386/sse.md (*_vternlog_all): Convert to
> 2-alternative form. Adjust "mode" attribute. Add "enabled"
> attribute.
> (*_vpternlog_1): Relax to just TARGET_AVX512F.
> (*_vpternlog_2): Likewise.
> (*_vpternlog_3): Likewise.
> ---
> I guess the underlying pattern, going along the lines of what
> one_cmpl2 uses, can be applied elsewhere
> as well.
>
> HFmode could use embedded broadcast too for copysign and alike, but that
> would need to be V2HF -> V8HF (for which I don't think there are any
> existing patterns).
>
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -2266,7 +2266,7 @@ ix86_expand_copysign (rtx operands[])
>else
>  dest = NULL_RTX;
>op1 = lowpart_subreg (vmode, force_reg (mode, operands[2]), mode);
> -  mask = ix86_build_signbit_mask (vmode, 0, 0);
> +  mask = ix86_build_signbit_mask (vmode, TARGET_AVX512F && mode != HFmode, 
> 0);
>
>if (CONST_DOUBLE_P (operands[1]))
>  {
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -12399,11 +12399,11 @@
> (set_attr "mode" "")])
>
>  (define_insn "*_vternlog_all"
> -  [(set (match_operand:V 0 "register_operand" "=v")
> +  [(set (match_operand:V 0 "register_operand" "=v,v")
> (unspec:V
> - [(match_operand:V 1 "register_operand" "0")
> -  (match_operand:V 2 "register_operand" "v")
> -  (match_operand:V 3 "bcst_vector_operand" "vmBr")
> + [(match_operand:V 1 "register_operand" "0,0")
> +  (match_operand:V 2 "register_operand" "v,v")
> +  (match_operand:V 3 "bcst_vector_operand" "vBr,m")
>(match_operand:SI 4 "const_0_to_255_operand")]
>   UNSPEC_VTERNLOG))]
>"TARGET_AVX512F
> @@ -12411,10 +12411,22 @@
> it's not real AVX512FP16 instruction.  */
>&& (GET_MODE_SIZE (GET_MODE_INNER (mode)) >= 4
>   || GET_CODE (operands[3]) != VEC_DUPLICATE)"
> -  "vpternlog\t{%4, %3, %2, %0|%0, %2, %3, %4}"
> +{
> +  if (TARGET_AVX512VL)
> +return "vpternlog\t{%4, %3, %2, %0|%0, %2, %3, %4}";
> +  else
> +return "vpternlog\t{%4, %g3, %g2, %g0|%g0, %g2, %g3, %4}";
> +}
>[(set_attr "type" "sselog")
> (set_attr "prefix" "evex")
> -   (set_attr "mode" "")])
> +   (set (attr "mode")
> +(if_then_else (match_test "TARGET_AVX512VL")
> + (const_string "")
> + (const_string "XI")))
> +   (set (attr "enabled")
> +   (if_then_else (eq_attr "alternative" "1")
> + (symbol_ref " == 64 || TARGET_AVX512VL")
> + (const_string "*")))])
>
>  ;; There must be lots of other combinations like
>  ;;
> @@ -12443,7 +12455,7 @@
>   (any_logic2:V
> (match_operand:V 3 "regmem_or_bitnot_regmem_operand")
> (match_operand:V 4 "regmem_or_bitnot_regmem_operand"]
> -  "( == 64 || TARGET_AVX512VL)
> +  "TARGET_AVX512F
> && ix86_pre_reload_split ()
> && (rtx_equal_p (STRIP_UNARY (operands[1]),
> STRIP_UNARY (operands[4]))
> @@ -12527,7 +12539,7 @@
>   (match_operand:V 2 "regmem_or_bitnot_regmem_operand"))
> (match_operand:V 3 "regmem_or_bitnot_regmem_operand"))
>   (match_operand:V 4 "regmem_or_bitnot_regmem_operand")))]
> -  "( == 64 || TARGET_AVX512VL)
> +  "TARGET_AVX512F
> && ix86_pre_reload_split ()
> && (rtx_equal_p (STRIP_UNARY (operands[

Re: [PATCH 8/9] vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS_PERMUTE

2023-06-14 Thread Hongtao Liu via Gcc-patches

On Tue, Jun 13, 2023 at 10:07 AM Kewen Lin via Gcc-patches
 wrote:
>
> This patch adjusts the cost handling on
> VMAT_CONTIGUOUS_PERMUTE in function vectorizable_load.  We
> don't call function vect_model_load_cost for it any more.
>
> As the affected test case gcc.target/i386/pr70021.c shows,
> the previous costing can under-cost the total generated
> vector loads as for VMAT_CONTIGUOUS_PERMUTE function
> vect_model_load_cost doesn't consider the group size which
> is considered as vec_num during the transformation.
The original PR is for the correctness issue, and I'm not sure how
much of a performance impact the patch would be, but the change looks
reasonable, so the test change looks ok to me.
I'll track performance impact on SPEC2017 to see if there's any
regression caused by the patch(Guess probably not).
>
> This patch makes the count of vector load in costing become
> consistent with what we generates during the transformation.
> To be more specific, for the given test case, for memory
> access b[i_20], it costed for 2 vector loads before,
> with this patch it costs 8 instead, it matches the final
> count of generated vector loads basing from b.  This costing
> change makes cost model analysis feel it's not profitable
> to vectorize the first loop, so this patch adjusts the test
> case without vect cost model any more.
>
> But note that this test case also exposes something we can
> improve further is that although the number of vector
> permutation what we costed and generated are consistent,
> but DCE can further optimize some unused permutation out,
> it would be good if we can predict that and generate only
> those necessary permutations.
>
> gcc/ChangeLog:
>
> * tree-vect-stmts.cc (vect_model_load_cost): Assert this function only
> handle memory_access_type VMAT_CONTIGUOUS, remove some
> VMAT_CONTIGUOUS_PERMUTE related handlings.
> (vectorizable_load): Adjust the cost handling on 
> VMAT_CONTIGUOUS_PERMUTE
> without calling vect_model_load_cost.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr70021.c: Adjust with -fno-vect-cost-model.
> ---
>  gcc/testsuite/gcc.target/i386/pr70021.c |  2 +-
>  gcc/tree-vect-stmts.cc  | 88 ++---
>  2 files changed, 51 insertions(+), 39 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/i386/pr70021.c 
> b/gcc/testsuite/gcc.target/i386/pr70021.c
> index 6562c0f2bd0..d509583601e 100644
> --- a/gcc/testsuite/gcc.target/i386/pr70021.c
> +++ b/gcc/testsuite/gcc.target/i386/pr70021.c
> @@ -1,7 +1,7 @@
>  /* PR target/70021 */
>  /* { dg-do run } */
>  /* { dg-require-effective-target avx2 } */
> -/* { dg-options "-O2 -ftree-vectorize -mavx2 -fdump-tree-vect-details 
> -mtune=skylake" } */
> +/* { dg-options "-O2 -ftree-vectorize -mavx2 -fdump-tree-vect-details 
> -mtune=skylake -fno-vect-cost-model" } */
>
>  #include "avx2-check.h"
>
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 7f8d9db5363..e7a97dbe05d 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -1134,8 +1134,7 @@ vect_model_load_cost (vec_info *vinfo,
>   slp_tree slp_node,
>   stmt_vector_for_cost *cost_vec)
>  {
> -  gcc_assert (memory_access_type == VMAT_CONTIGUOUS
> - || memory_access_type == VMAT_CONTIGUOUS_PERMUTE);
> +  gcc_assert (memory_access_type == VMAT_CONTIGUOUS);
>
>unsigned int inside_cost = 0, prologue_cost = 0;
>bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
> @@ -1174,26 +1173,6 @@ vect_model_load_cost (vec_info *vinfo,
>   once per group anyhow.  */
>bool first_stmt_p = (first_stmt_info == stmt_info);
>
> -  /* We assume that the cost of a single load-lanes instruction is
> - equivalent to the cost of DR_GROUP_SIZE separate loads.  If a grouped
> - access is instead being provided by a load-and-permute operation,
> - include the cost of the permutes.  */
> -  if (first_stmt_p
> -  && memory_access_type == VMAT_CONTIGUOUS_PERMUTE)
> -{
> -  /* Uses an even and odd extract operations or shuffle operations
> -for each needed permute.  */
> -  int group_size = DR_GROUP_SIZE (first_stmt_info);
> -  int nstmts = ncopies * ceil_log2 (group_size) * group_size;
> -  inside_cost += record_stmt_cost (cost_vec, nstmts, vec_perm,
> -  stmt_info, 0, vect_body);
> -
> -  if (dump_enabled_p ())
> -dump_printf_loc (MSG_NOTE, vect_location,
> - "vect_model_load_cost: strided group_size = %d .\n",
> - group_size);
> -}
> -
>vect_get_load_cost (vinfo, stmt_info, ncopies, alignment_support_scheme,
>   misalignment, first_stmt_p, &inside_cost, 
> &prologue_cost,
>   cost_vec, cost_vec, true);
> @@ -10652,11 +10631,22 @@ vectorizable_load (vec_info *vinfo,
>  alignment support schemes.  */

Re: [PATCH] x86: make better use of VBROADCASTSS / VPBROADCASTD

2023-06-14 Thread Hongtao Liu via Gcc-patches

On Wed, Jun 14, 2023 at 5:03 PM Jan Beulich  wrote:
>
> On 14.06.2023 09:41, Hongtao Liu wrote:
> > On Wed, Jun 14, 2023 at 1:58 PM Jan Beulich via Gcc-patches
> >  wrote:
> >>
> >> ... in vec_dupv4sf / *vec_dupv4si. The respective broadcast insns are
> >> never longer (yet sometimes shorter) than the corresponding VSHUFPS /
> >> VPSHUFD, due to the immediate operand of the shuffle insns balancing the
> >> need for VEX3 in the broadcast ones. When EVEX encoding is required the
> >> broadcast insns are always shorter.
> >>
> >> Add two new alternatives each, one covering the AVX2 case and one
> >> covering AVX512.
> > I think you can just change assemble output for this first alternative
> > when TARGET_AVX2, use vbroadcastss, else use vshufps since
> > vbroadcastss only accept register operand when TARGET_AVX2. And no
> > need to support 2 extra alternatives which doesn't make sense just
> > make RA more confused about the same meaning of different
> > alternatives.
>
> You mean by switching from "@ ..." to C code using "switch
> (which_alternative)"? I can do that, sure. Yet that'll make for a
> more complicated "length_immediate" attribute then. Would be nice
Yes, you can also do something like
   (set (attr "length_immediate")
 (cond [(eq_attr "alternative" "0")
   (if_then_else (match_test "TARGET_AVX2)
(const_string "")
   (const_string "1"))
...]

> if you could confirm that this is what you want, as I may well
> have misunderstood you.
>
> But that'll be for vec_dupv4sf only, as vec_dupv4si is subtly
> different.
Yes, but can we use vpbroadcastd for vec_dupv4si similarly?
>
> >> ---
> >> I'm working from the assumption that the isa attributes to the original
> >> 1st and 2nd alternatives don't need further restricting (to sse2_noavx2
> >> or avx_noavx2 as applicable), as the new earlier alternatives cover all
> >> operand forms already when at least AVX2 is enabled.
> >>
> >> Isn't prefix_extra use bogus here? What extra prefix does vbroadcastss
> >> use? (Same further down in *vec_dupv4si and avx2_vbroadcasti128_
> >> and elsewhere.)
> > Not sure about this part. I grep prefix_extra, seems only used by
> > znver.md/znver4.md for schedule, and only for comi instructions(?the
> > reservation name seems so).
>
> define_attr "length_vex" and define_attr "length" use it, too.
> Otherwise I would have asked whether the attribute couldn't be
> purged from most insns.
>
> My present understanding is that the attribute is wrong on
> vec_dupv4sf (and hence wants dropping from there altogether), and it
> should be "prefix_data16" instead on *vec_dupv4si, evaluating to 1
> only for the non-AVX pshufd case. I suspect at least the latter
> would be going to far for doing it "while here" right in this patch.
> Plus I think I have seen various other questionable uses of that
> attribute.
>
> >> Is use of Yv for the source operand really necessary in *vec_dupv4si?
> >> I.e. would scalar integer values be put in XMM{16...31} when AVX512VL
> > Yes, You can look at ix86_hard_regno_mode_ok, EXT_REX_SSE_REGNO is
> > allowed for scalar mode, but not for 128/256-bit vector modes.
> >
> > 20204  if (TARGET_AVX512F
> > 20205  && (VALID_AVX512F_REG_OR_XI_MODE (mode)
> > 20206  || VALID_AVX512F_SCALAR_MODE (mode)))
> > 20207return true;
>
> Okay, so I need to switch input constraints for relevant new
> alternatives to Yv (I actually wonder why I did use v in
> vec_dupv4sf, as it was clear to me that SFmode can be in the high
> 16 xmm registers with just AVX512F).
>
> >> isn't enabled? If so (*movsi_internal / *movdi_internal suggest they
> >> might), wouldn't *vec_dupv2di need to use Yv as well in its 3rd
> >> alternative (or just m, as Yv is already covered by the 2nd one)?
> > I guess xm is more suitable since we still want to allocate
> > operands[1] to register when sse3_noavx.
> > It didn't hit any error since for avx and above, alternative 1(2rd
> > one) is always matched than alternative 2.
>
> I'm afraid I don't follow: With just -mavx512f the source operand
> can be in, say, %xmm16 (as per your clarification above). This
> would not match Yv, but it would match vm. And hence wrongly
> create an AVX512VL form of vmovddup. I didn't try it out earlier,
> because unlike for SFmode / DFmode I thought it's not really clear
> how to get the compiler to reliably put a DImode variable in an xmm
> reg, but it just occurred to me that this can be done the same way
> there. And voila,
>
> typedef long long __attribute__((vector_size(16))) v2di;
>
> v2di bcst(long long ll) {
> register long long x asm("xmm16") = ll;
>
> asm("nop %%esp" : "+v" (x));
> return (v2di){x, x};
> }
>
> compiled with just -mavx512f (and -O2) produces an AVX512VL insn.
Ah, I see, indeed it's a potential bug for -mavx512f -mavx512vl
I meant with -mavx512vl,
_vec_dup_gpr will be matched
instead of vec_dupv2di since it's put before

Re: [PATCH] x86: make VPTERNLOG* usable on less than 512-bit operands with just AVX512F

2023-06-14 Thread Hongtao Liu via Gcc-patches

On Wed, Jun 14, 2023 at 5:32 PM Jan Beulich  wrote:
>
> On 14.06.2023 10:10, Hongtao Liu wrote:
> > On Wed, Jun 14, 2023 at 1:59 PM Jan Beulich via Gcc-patches
> >  wrote:
> >>
> >> There's no reason to constrain this to AVX512VL, as the wider operation
> >> is not usable for more narrow operands only when the possible memory
> > But this may require more resources (on AMD znver4 processor a zmm
> > instruction will also be split into 2 uops, right?) And on some intel
> > processors(SKX/CLX) there will be frequency reduction.
>
> I'm afraid I don't follow: Largely the same AVX512 code would be
> generated when passing -mavx512vl, so how can power/performance
> considerations matter here? All I'm doing here (and in a few more
Yes , for -march=*** is ok since AVX512VL is included.
what your patch improve is -mavx512f -mno-avx512vl, but for specific
option combinations like -mavx512f -mprefer-vector-width=256
-mno-avx512vl, your patch will produce zmm instruction which is not
expected.
> patches I'm still in the process of testing) is relax when AVX512
> insns can actually be used (reducing the copying between registers
> and/or the number of insns needed). My understanding on the Intel
> side is that it only matters whether AVX512 insns are used, not
No, vector length matters, ymm/xmm evex insns are ok to use, but zmm
insns will cause frequency reduction.
> what vector length they are. You may be right about znver4, though.
>
> Nevertheless I agree ...
>
> > If it needs to be done, it is better guarded with
> > !TARGET_PREFER_AVX256, at least when micro-architecture AVX256_OPTIMAL
> > or users explicitly uses -mprefer-vector-width=256, we don't want to
> > produce any zmm instruction for surprise.(Although
> > -mprefer-vector-width=256 is supposed for auto-vectorizer, but backend
> > codegen also use it under such cases, i.e. in *movsf_internal
> > alternative 5 use zmm only TARGET_AVX512F && !TARGET_PREFER_AVX256.)
>
> ... that respecting such overrides is probably desirable, so I'll
> adjust.
>
> Jan
>
> >> source is a non-broadcast one. This way even the scalar copysign3
> >> can benefit from the operation being a single-insn one (leaving aside
> >> moves which the compiler decides to insert for unclear reasons, and
> >> leaving aside the fact that bcst_mem_operand() is too restrictive for
> >> broadcast to be embedded right into VPTERNLOG*).
> >>
> >> Along with this also request value duplication in
> >> ix86_expand_copysign()'s call to ix86_build_signbit_mask(), eliminating
> >> excess space allocation in .rodata.*, filled with zeros which are never
> >> read.
> >>
> >> gcc/
> >>
> >> * config/i386/i386-expand.cc (ix86_expand_copysign): Request
> >> value duplication by ix86_build_signbit_mask() when AVX512F and
> >> not HFmode.
> >> * config/i386/sse.md (*_vternlog_all): Convert to
> >> 2-alternative form. Adjust "mode" attribute. Add "enabled"
> >> attribute.
> >> (*_vpternlog_1): Relax to just TARGET_AVX512F.
> >> (*_vpternlog_2): Likewise.
> >> (*_vpternlog_3): Likewise.
>


-- 
BR,
Hongtao

Re: [PATCH] x86: make better use of VBROADCASTSS / VPBROADCASTD

2023-06-14 Thread Hongtao Liu via Gcc-patches

On Thu, Jun 15, 2023 at 1:23 PM Hongtao Liu  wrote:
>
> On Wed, Jun 14, 2023 at 5:03 PM Jan Beulich  wrote:
> >
> > On 14.06.2023 09:41, Hongtao Liu wrote:
> > > On Wed, Jun 14, 2023 at 1:58 PM Jan Beulich via Gcc-patches
> > >  wrote:
> > >>
> > >> ... in vec_dupv4sf / *vec_dupv4si. The respective broadcast insns are
> > >> never longer (yet sometimes shorter) than the corresponding VSHUFPS /
> > >> VPSHUFD, due to the immediate operand of the shuffle insns balancing the
> > >> need for VEX3 in the broadcast ones. When EVEX encoding is required the
> > >> broadcast insns are always shorter.
> > >>
> > >> Add two new alternatives each, one covering the AVX2 case and one
> > >> covering AVX512.
> > > I think you can just change assemble output for this first alternative
> > > when TARGET_AVX2, use vbroadcastss, else use vshufps since
> > > vbroadcastss only accept register operand when TARGET_AVX2. And no
> > > need to support 2 extra alternatives which doesn't make sense just
> > > make RA more confused about the same meaning of different
> > > alternatives.
> >
> > You mean by switching from "@ ..." to C code using "switch
> > (which_alternative)"? I can do that, sure. Yet that'll make for a
> > more complicated "length_immediate" attribute then. Would be nice
> Yes, you can also do something like
>(set (attr "length_immediate")
>  (cond [(eq_attr "alternative" "0")
>(if_then_else (match_test "TARGET_AVX2)
> (const_string "")
>(const_string "1"))
> ...]
>
> > if you could confirm that this is what you want, as I may well
> > have misunderstood you.
> >
> > But that'll be for vec_dupv4sf only, as vec_dupv4si is subtly
> > different.
> Yes, but can we use vpbroadcastd for vec_dupv4si similarly?
> >
> > >> ---
> > >> I'm working from the assumption that the isa attributes to the original
> > >> 1st and 2nd alternatives don't need further restricting (to sse2_noavx2
> > >> or avx_noavx2 as applicable), as the new earlier alternatives cover all
> > >> operand forms already when at least AVX2 is enabled.
> > >>
> > >> Isn't prefix_extra use bogus here? What extra prefix does vbroadcastss
> > >> use? (Same further down in *vec_dupv4si and avx2_vbroadcasti128_
> > >> and elsewhere.)
> > > Not sure about this part. I grep prefix_extra, seems only used by
> > > znver.md/znver4.md for schedule, and only for comi instructions(?the
> > > reservation name seems so).
> >
> > define_attr "length_vex" and define_attr "length" use it, too.
> > Otherwise I would have asked whether the attribute couldn't be
> > purged from most insns.
> >
> > My present understanding is that the attribute is wrong on
> > vec_dupv4sf (and hence wants dropping from there altogether), and it
> > should be "prefix_data16" instead on *vec_dupv4si, evaluating to 1
> > only for the non-AVX pshufd case. I suspect at least the latter
> > would be going to far for doing it "while here" right in this patch.
> > Plus I think I have seen various other questionable uses of that
> > attribute.
> >
> > >> Is use of Yv for the source operand really necessary in *vec_dupv4si?
> > >> I.e. would scalar integer values be put in XMM{16...31} when AVX512VL
> > > Yes, You can look at ix86_hard_regno_mode_ok, EXT_REX_SSE_REGNO is
> > > allowed for scalar mode, but not for 128/256-bit vector modes.
> > >
> > > 20204  if (TARGET_AVX512F
> > > 20205  && (VALID_AVX512F_REG_OR_XI_MODE (mode)
> > > 20206  || VALID_AVX512F_SCALAR_MODE (mode)))
> > > 20207return true;
> >
> > Okay, so I need to switch input constraints for relevant new
> > alternatives to Yv (I actually wonder why I did use v in
> > vec_dupv4sf, as it was clear to me that SFmode can be in the high
> > 16 xmm registers with just AVX512F).
> >
> > >> isn't enabled? If so (*movsi_internal / *movdi_internal suggest they
> > >> might), wouldn't *vec_dupv2di need to use Yv as well in its 3rd
> > >> alternative (or just m, as Yv is already covered by the 2nd one)?
> > > I guess xm is more suitable since we still want to allocate
> > > operands[1] to register when sse3_noavx.
> > > It didn't hit any error since for avx and above, alternative 1(2rd
> > > one) is always matched than alternative 2.
> >
> > I'm afraid I don't follow: With just -mavx512f the source operand
> > can be in, say, %xmm16 (as per your clarification above). This
> > would not match Yv, but it would match vm. And hence wrongly
> > create an AVX512VL form of vmovddup. I didn't try it out earlier,
> > because unlike for SFmode / DFmode I thought it's not really clear
> > how to get the compiler to reliably put a DImode variable in an xmm
> > reg, but it just occurred to me that this can be done the same way
> > there. And voila,
> >
> > typedef long long __attribute__((vector_size(16))) v2di;
> >
> > v2di bcst(long long ll) {
> > register long long x asm("xmm16") = ll;
> >
> > asm("nop %%esp" : "+v" (x));
> >

Re: [PATCH] x86: make better use of VBROADCASTSS / VPBROADCASTD

2023-06-15 Thread Hongtao Liu via Gcc-patches

On Thu, Jun 15, 2023 at 2:41 PM Jan Beulich  wrote:
>
> On 15.06.2023 07:23, Hongtao Liu wrote:
> > On Wed, Jun 14, 2023 at 5:03 PM Jan Beulich  wrote:
> >>
> >> On 14.06.2023 09:41, Hongtao Liu wrote:
> >>> On Wed, Jun 14, 2023 at 1:58 PM Jan Beulich via Gcc-patches
> >>>  wrote:
> 
>  ... in vec_dupv4sf / *vec_dupv4si. The respective broadcast insns are
>  never longer (yet sometimes shorter) than the corresponding VSHUFPS /
>  VPSHUFD, due to the immediate operand of the shuffle insns balancing the
>  need for VEX3 in the broadcast ones. When EVEX encoding is required the
>  broadcast insns are always shorter.
> 
>  Add two new alternatives each, one covering the AVX2 case and one
>  covering AVX512.
> >>> I think you can just change assemble output for this first alternative
> >>> when TARGET_AVX2, use vbroadcastss, else use vshufps since
> >>> vbroadcastss only accept register operand when TARGET_AVX2. And no
> >>> need to support 2 extra alternatives which doesn't make sense just
> >>> make RA more confused about the same meaning of different
> >>> alternatives.
> >>
> >> You mean by switching from "@ ..." to C code using "switch
> >> (which_alternative)"? I can do that, sure. Yet that'll make for a
> >> more complicated "length_immediate" attribute then. Would be nice
> > Yes, you can also do something like
> >(set (attr "length_immediate")
> >  (cond [(eq_attr "alternative" "0")
> >(if_then_else (match_test "TARGET_AVX2)
> > (const_string "")
> >(const_string "1"))
> > ...]
>
> Yes, that's along the lines of what I was thinking of. I'm uncertain
> about one aspect of what you spelled out above, though: What is the
> meaning of the empty string in (const_string "")? Shouldn't this be
> "0" or "*"?
Yes, sorry for the typo, should be 0 or *.
>
> >> But that'll be for vec_dupv4sf only, as vec_dupv4si is subtly
> >> different.
> > Yes, but can we use vpbroadcastd for vec_dupv4si similarly?
>
> Well, the use there is similar, but the folding with the shuffle
> alternative won't be possible, because of the new first alternative
> also allowing m for the source, when the shuffle one allows for only
> Yv. The extra m is pointless to have in vec_dupv4sf (because a later
> alternative with a wider ISA [avx] has it already), while in
> vec_dupv4si the similar later alternative resolves to vbroadcastss,
> not vpbroadcastd. I should be able to fold the two vpbroadcastd
> alternatives, along the lines of what I've done in the vec_dupv2di
> patch just sent. (As I just realized the m in what are alternatives
> 1 each in patch v1 is pointless, since already taken care of by
> other alternatives.)
>
> Jan



-- 
BR,
Hongtao

Re: [PATCH] x86: correct and improve "*vec_dupv2di"

2023-06-15 Thread Hongtao Liu via Gcc-patches

On Thu, Jun 15, 2023 at 3:07 PM Uros Bizjak via Gcc-patches
 wrote:
>
> On Thu, Jun 15, 2023 at 8:03 AM Jan Beulich via Gcc-patches
>  wrote:
> >
> > The input constraint for the %vmovddup alternative was wrong, as the
> > upper 16 XMM registers require AVX512VL to be used with this insn. To
> > compensate, introduce a new alternative permitting all 32 registers, by
> > broadcasting to the full 512 bits in that case if AVX512VL is not
> > available.
> >
> > gcc/
> >
> > * config/i386/sse.md (vec_dupv2di): Correct %vmovddup input
> > constraint. Add new AVX512F alternative.
> > ---
> > Strictly speaking the new alternative could be enabled from AVX2
> > onwards, but vmovddup can frequently be a shorter encoding (VEX2
> > vs VEX3).
> >
> > --- a/gcc/config/i386/sse.md
> > +++ b/gcc/config/i386/sse.md
> > @@ -25851,19 +25851,39 @@
> >(symbol_ref "true")))])
> >
> >  (define_insn "*vec_dupv2di"
> > -  [(set (match_operand:V2DI 0 "register_operand" "=x,v,v,x")
> > +  [(set (match_operand:V2DI 0 "register_operand" "=x,v,v,v,x")
> > (vec_duplicate:V2DI
> > - (match_operand:DI 1 "nonimmediate_operand" " 0,Yv,vm,0")))]
> > + (match_operand:DI 1 "nonimmediate_operand" " 0,Yv,vm,Yvm,0")))]
> >"TARGET_SSE"
> > -  "@
> > -   punpcklqdq\t%0, %0
> > -   vpunpcklqdq\t{%d1, %0|%0, %d1}
> > -   %vmovddup\t{%1, %0|%0, %1}
> > -   movlhps\t%0, %0"
> > -  [(set_attr "isa" "sse2_noavx,avx,sse3,noavx")
> > -   (set_attr "type" "sselog1,sselog1,sselog1,ssemov")
> > -   (set_attr "prefix" "orig,maybe_evex,maybe_vex,orig")
> > -   (set_attr "mode" "TI,TI,DF,V4SF")])
> > +{
> > +  switch (which_alternative)
> > +{
> > +case 0:
> > +  return "punpcklqdq\t%0, %0";
> > +case 1:
> > +  return "vpunpcklqdq\t{%d1, %0|%0, %d1}";
> > +case 2:
> > +  if (TARGET_AVX512VL)
> > +   return "vpbroadcastq\t{%1, %0|%0, %1}";
> > +  return "vpbroadcastq\t{%1, %g0|%g0, %1}";
>
> You can use
>
> * return TARGET_AVX512VL ? \"vpbroadcastq\t{%1, %0|%0, %1}\" :
> \"vpbroadcastq\t{%1, %g0|%g0, %1}\";
>
> directly in a multi-output insn template to avoid the above C code.
> See e.g. sse2_cvtpd2pi for an example.
>
> Uros.
>
> > +case 3:
> > +  return "%vmovddup\t{%1, %0|%0, %1}";
> > +case 4:
> > +  return "movlhps\t%0, %0";
> > +default:
> > +  gcc_unreachable ();
> > +}
> > +}
> > +  [(set_attr "isa" "sse2_noavx,avx,avx512f,sse3,noavx")
> > +   (set_attr "type" "sselog1,sselog1,ssemov,sselog1,ssemov")
> > +   (set_attr "prefix" "orig,maybe_evex,evex,maybe_vex,orig")
> > +   (set_attr "mode" "TI,TI,TI,DF,V4SF")
alternative 2 should be XImode when !TARGET_AVX512VL.
> > +   (set (attr "enabled")
> > +   (if_then_else
> > + (eq_attr "alternative" "2")
> > + (symbol_ref "TARGET_AVX512VL
> > +  || (TARGET_AVX512F && !TARGET_PREFER_AVX256)")
> > + (const_string "*")))])
> >
> >  (define_insn "avx2_vbroadcasti128_"
> >[(set (match_operand:VI_256 0 "register_operand" "=x,v,v")



-- 
BR,
Hongtao

Re: [x86 PATCH] Tweak ix86_expand_int_compare to use PTEST for vector equality.

2023-07-11 Thread Hongtao Liu via Gcc-patches

On Wed, Jul 12, 2023 at 4:57 AM Roger Sayle  wrote:
>
>
> > From: Hongtao Liu 
> > Sent: 28 June 2023 04:23
> > > From: Roger Sayle 
> > > Sent: 27 June 2023 20:28
> > >
> > > I've also come up with an alternate/complementary/supplementary
> > > fix of generating the PTEST during RTL expansion, rather than rely on
> > > this being caught/optimized later during STV.
> > >
> > > You may notice in this patch, the tests for TARGET_SSE4_1 and TImode
> > > appear last.  When I was writing this, I initially also added support
> > > for AVX VPTEST and OImode, before realizing that x86 doesn't (yet)
> > > support 256-bit OImode (which also explains why we don't have an
> > > OImode to V1OImode scalar-to-vector pass).  Retaining this clause
> > > ordering should minimize the lines changed if things change in future.
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > > and make -k check, both with and without --target_board=unix{-m32}
> > > with no new failures.  Ok for mainline?
> > >
> > >
> > > 2023-06-27  Roger Sayle  
> > >
> > > gcc/ChangeLog
> > > * config/i386/i386-expand.cc (ix86_expand_int_compare): If
> > > testing a TImode SUBREG of a 128-bit vector register against
> > > zero, use a PTEST instruction instead of first moving it to
> > > to scalar registers.
> > >
> >
> > +  /* Attempt to use PTEST, if available, when testing vector modes for
> > + equality/inequality against zero.  */  if (op1 == const0_rtx
> > +  && SUBREG_P (op0)
> > +  && cmpmode == CCZmode
> > +  && SUBREG_BYTE (op0) == 0
> > +  && REG_P (SUBREG_REG (op0))
> > Just register_operand (op0, TImode),
>
> I completely agree that in most circumstances, the early RTL optimizers
> should use standard predicates, such as register_operand, that don't
> distinguish between REG and SUBREG, allowing the choice (assignment)
> to be left to register allocation (reload).
>
> However in this case, unusually, the presence of the SUBREG, and treating
> it differently from a REG is critical (in fact the reason for the patch).  
> x86_64
> can very efficiently test whether a 128-bit value is zero, setting ZF, either
> in TImode, using orq %rax,%rdx in a single cycle/single instruction, or in
> V1TImode, using ptest %xmm0,%xmm0, in a single cycle/single instruction.
> There's no reason to prefer one form over the other.  A SUREG, however, that
> moves the value from the scalar registers to a vector register, or from a 
> vector
> registers to scalar registers, requires two or three instructions, often 
> reading
> and writing values via memory, at a huge performance penalty.   Hence the
> goal is to eliminate the (VIEW_CONVERT) SUBREG, and choose the appropriate
> single-cycle test instruction for where the data is located.  Hence we want
> to leave REG_P alone, but optimize (only) the SUBREG_P cases.
> register_operand doesn't help with this.
>
> Note this is counter to the usual advice.  Normally, a SUBREG between scalar
> registers is cheap (in fact free) on x86, hence it safe for predicates to 
> ignore
> them prior to register allocation.  But another use of SUBREG, to represent
> a VIEW_CONVERT_EXPR/transfer between processing units is closer to a
> conversion, and a very expensive one (going via memory with different size
> reads vs writes) at that.
>
>
> > +  && VECTOR_MODE_P (GET_MODE (SUBREG_REG (op0)))
> > +  && TARGET_SSE4_1
> > +  && GET_MODE (op0) == TImode
> > +  && GET_MODE_SIZE (GET_MODE (SUBREG_REG (op0))) == 16)
> > +{
> > +  tmp = SUBREG_REG (op0);
> > and tmp = lowpart_subreg (V1TImode, force_reg (TImode, op0));?
> > I think RA can handle SUBREG correctly, no need for extra predicates.
>
> Likewise, your "tmp = lowpart_subreg (V1TImode, force_reg (TImode, ...))"
> is forcing there to always be an inter-unit transfer/pipeline stall, when 
> this is
> idiom that we're trying to eliminate.
>
> I should have repeated the motivating example from my original post at
> https://gcc.gnu.org/pipermail/gcc-patches/2023-June/622706.html
>
> typedef long long __m128i __attribute__ ((__vector_size__ (16)));
> int foo (__m128i x, __m128i y) {
>   return (__int128)x == (__int128)y;
> }
>
> is currently generated as:
> foo:movaps  %xmm0, -40(%rsp)
> movq-32(%rsp), %rdx
> movq%xmm0, %rax
> movq%xmm1, %rsi
> movaps  %xmm1, -24(%rsp)
> movq-16(%rsp), %rcx
> xorq%rsi, %rax
> xorq%rcx, %rdx
> orq %rdx, %rax
> sete%al
> movzbl  %al, %eax
> ret
>
> with this patch (to eliminate the interunit SUBREG) this becomes:
>
> foo:pxor%xmm1, %xmm0
> xorl%eax, %eax
> ptest   %xmm0, %xmm0
> sete%al
> ret
>
> Hopefully, this clarifies things a little.
Thanks for the explanation, the patch LGTM.
One curious question, is there any case SUBREG_BYTE != 0 when inner
and outer mode(TImode) have the s

Re: [PATCH V2] Provide -fcf-protection=branch,return.

2023-07-12 Thread Hongtao Liu via Gcc-patches

ping.

On Mon, May 22, 2023 at 4:08 PM Hongtao Liu  wrote:
>
> ping.
>
> On Sat, May 13, 2023 at 5:20 PM liuhongt  wrote:
> >
> > > I think this could be simplified if you use either EnumSet or
> > > EnumBitSet instead in common.opt for `-fcf-protection=`.
> >
> > Use EnumSet instead of EnumBitSet since CF_FULL is not power of 2.
> > It is a bit tricky for sets classification, cf_branch and cf_return
> > should be in different sets, but they both "conflicts" cf_full,
> > cf_none. And current EnumSet don't handle this well.
> >
> > So in the current implementation, only cf_full,cf_none are exclusive
> > to each other, but they can be combined with any cf_branch, cf_return,
> > cf_check. It's not perfect, but still an improvement than original
> > one.
> >
> > gcc/ChangeLog:
> >
> > * common.opt: (fcf-protection=): Add EnumSet attribute to
> > support combination of params.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * c-c++-common/fcf-protection-10.c: New test.
> > * c-c++-common/fcf-protection-11.c: New test.
> > * c-c++-common/fcf-protection-12.c: New test.
> > * c-c++-common/fcf-protection-8.c: New test.
> > * c-c++-common/fcf-protection-9.c: New test.
> > * gcc.target/i386/pr89701-1.c: New test.
> > * gcc.target/i386/pr89701-2.c: New test.
> > * gcc.target/i386/pr89701-3.c: New test.
> > ---
> >  gcc/common.opt | 12 ++--
> >  gcc/testsuite/c-c++-common/fcf-protection-10.c |  2 ++
> >  gcc/testsuite/c-c++-common/fcf-protection-11.c |  2 ++
> >  gcc/testsuite/c-c++-common/fcf-protection-12.c |  2 ++
> >  gcc/testsuite/c-c++-common/fcf-protection-8.c  |  2 ++
> >  gcc/testsuite/c-c++-common/fcf-protection-9.c  |  2 ++
> >  gcc/testsuite/gcc.target/i386/pr89701-1.c  |  4 
> >  gcc/testsuite/gcc.target/i386/pr89701-2.c  |  4 
> >  gcc/testsuite/gcc.target/i386/pr89701-3.c  |  4 
> >  9 files changed, 28 insertions(+), 6 deletions(-)
> >  create mode 100644 gcc/testsuite/c-c++-common/fcf-protection-10.c
> >  create mode 100644 gcc/testsuite/c-c++-common/fcf-protection-11.c
> >  create mode 100644 gcc/testsuite/c-c++-common/fcf-protection-12.c
> >  create mode 100644 gcc/testsuite/c-c++-common/fcf-protection-8.c
> >  create mode 100644 gcc/testsuite/c-c++-common/fcf-protection-9.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr89701-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr89701-2.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr89701-3.c
> >
> > diff --git a/gcc/common.opt b/gcc/common.opt
> > index a28ca13385a..02f2472959a 100644
> > --- a/gcc/common.opt
> > +++ b/gcc/common.opt
> > @@ -1886,7 +1886,7 @@ fcf-protection
> >  Common RejectNegative Alias(fcf-protection=,full)
> >
> >  fcf-protection=
> > -Common Joined RejectNegative Enum(cf_protection_level) 
> > Var(flag_cf_protection) Init(CF_NONE)
> > +Common Joined RejectNegative Enum(cf_protection_level) EnumSet 
> > Var(flag_cf_protection) Init(CF_NONE)
> >  -fcf-protection=[full|branch|return|none|check]Instrument 
> > functions with checks to verify jump/call/return control-flow transfer
> >  instructions have valid targets.
> >
> > @@ -1894,19 +1894,19 @@ Enum
> >  Name(cf_protection_level) Type(enum cf_protection_level) 
> > UnknownError(unknown Control-Flow Protection Level %qs)
> >
> >  EnumValue
> > -Enum(cf_protection_level) String(full) Value(CF_FULL)
> > +Enum(cf_protection_level) String(full) Value(CF_FULL) Set(1)
> >
> >  EnumValue
> > -Enum(cf_protection_level) String(branch) Value(CF_BRANCH)
> > +Enum(cf_protection_level) String(branch) Value(CF_BRANCH) Set(2)
> >
> >  EnumValue
> > -Enum(cf_protection_level) String(return) Value(CF_RETURN)
> > +Enum(cf_protection_level) String(return) Value(CF_RETURN) Set(3)
> >
> >  EnumValue
> > -Enum(cf_protection_level) String(check) Value(CF_CHECK)
> > +Enum(cf_protection_level) String(check) Value(CF_CHECK) Set(4)
> >
> >  EnumValue
> > -Enum(cf_protection_level) String(none) Value(CF_NONE)
> > +Enum(cf_protection_level) String(none) Value(CF_NONE) Set(1)
> >
> >  finstrument-functions
> >  Common Var(flag_instrument_function_entry_exit,1)
> > diff --git a/gcc/testsuite/c-c++-common/fcf-protection-10.c 
> > b/gcc/testsuite/c-c++-common/fcf-protection-10.c
> > new file mode 100644
> > index 000..b271d134e52
> > --- /dev/null
> > +++ b/gcc/testsuite/c-c++-common/fcf-protection-10.c
> > @@ -0,0 +1,2 @@
> > +/* { dg-do compile { target { "i?86-*-* x86_64-*-*" } } } */
> > +/* { dg-options "-fcf-protection=branch,check" } */
> > diff --git a/gcc/testsuite/c-c++-common/fcf-protection-11.c 
> > b/gcc/testsuite/c-c++-common/fcf-protection-11.c
> > new file mode 100644
> > index 000..2e566350ccd
> > --- /dev/null
> > +++ b/gcc/testsuite/c-c++-common/fcf-protection-11.c
> > @@ -0,0 +1,2 @@
> > +/* { dg-do compile { target { "i?86-*-* x86_64-*-*" } } } */
> > +/* { dg-options "-fcf-protection=branch,return" } */

Re: [PATCH] tree-optimization/94864 - vector insert of vector extract simplification

2023-07-12 Thread Hongtao Liu via Gcc-patches

On Wed, Jul 12, 2023 at 9:37 PM Richard Biener via Gcc-patches
 wrote:
>
> The PRs ask for optimizing of
>
>   _1 = BIT_FIELD_REF ;
>   result_4 = BIT_INSERT_EXPR ;
>
> to a vector permutation.  The following implements this as
> match.pd pattern, improving code generation on x86_64.
>
> On the RTL level we face the issue that backend patterns inconsistently
> use vec_merge and vec_select of vec_concat to represent permutes.
>
> I think using a (supported) permute is almost always better
> than an extract plus insert, maybe excluding the case we extract
> element zero and that's aliased to a register that can be used
> directly for insertion (not sure how to query that).
>
> But this regresses for example gcc.target/i386/pr54855-8.c because PRE
> now realizes that
>
>   _1 = BIT_FIELD_REF ;
>   if (_1 > a_4(D))
> goto ; [50.00%]
>   else
> goto ; [50.00%]
>
>[local count: 536870913]:
>
>[local count: 1073741824]:
>   # iftmp.0_2 = PHI <_1(3), a_4(D)(2)>
>   x_5 = BIT_INSERT_EXPR ;
>
> is equal to
>
>[local count: 1073741824]:
>   _1 = BIT_FIELD_REF ;
>   if (_1 > a_4(D))
> goto ; [50.00%]
>   else
> goto ; [50.00%]
>
>[local count: 536870912]:
>   _7 = BIT_INSERT_EXPR ;
>
>[local count: 1073741824]:
>   # prephitmp_8 = PHI 
>
> and that no longer produces the desired maxsd operation at the RTL
The comparison is scalar mode, but operations in then_bb is
vector_mode, if_convert can't eliminate the condition any more(and
won't go into backend ix86_expand_sse_fp_minmax).
I think for ordered comparisons like _1 > a_4, it doesn't match
fmin/fmax, but match SSE MINSS/MAXSS since it alway returns the second
operand(not the other operand) when there's NONE.
> level (we fail to match .FMAX at the GIMPLE level earlier).
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu with regressions:
>
> FAIL: gcc.target/i386/pr54855-13.c scan-assembler-times vmaxsh[ t] 1
> FAIL: gcc.target/i386/pr54855-13.c scan-assembler-not vcomish[ t]
> FAIL: gcc.target/i386/pr54855-8.c scan-assembler-times maxsd 1
> FAIL: gcc.target/i386/pr54855-8.c scan-assembler-not movsd
> FAIL: gcc.target/i386/pr54855-9.c scan-assembler-times minss 1
> FAIL: gcc.target/i386/pr54855-9.c scan-assembler-not movss
>
> I think this is also PR88540 (the lack of min/max detection, not
> sure if the SSE min/max are suitable here)
>
> PR tree-optimization/94864
> PR tree-optimization/94865
> * match.pd (bit_insert @0 (BIT_FIELD_REF @1 ..) ..): New pattern
> for vector insertion from vector extraction.
>
> * gcc.target/i386/pr94864.c: New testcase.
> * gcc.target/i386/pr94865.c: Likewise.
> ---
>  gcc/match.pd| 25 +
>  gcc/testsuite/gcc.target/i386/pr94864.c | 13 +
>  gcc/testsuite/gcc.target/i386/pr94865.c | 13 +
>  3 files changed, 51 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr94864.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr94865.c
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 8543f777a28..8cc106049c4 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -7770,6 +7770,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>   wi::to_wide (@ipos) + isize))
>  (BIT_FIELD_REF @0 @rsize @rpos)
>
> +/* Simplify vector inserts of other vector extracts to a permute.  */
> +(simplify
> + (bit_insert @0 (BIT_FIELD_REF@2 @1 @rsize @rpos) @ipos)
> + (if (VECTOR_TYPE_P (type)
> +  && types_match (@0, @1)
> +  && types_match (TREE_TYPE (TREE_TYPE (@0)), TREE_TYPE (@2))
> +  && TYPE_VECTOR_SUBPARTS (type).is_constant ())
> +  (with
> +   {
> + unsigned HOST_WIDE_INT elsz
> +   = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (TREE_TYPE (@1;
> + poly_uint64 relt = exact_div (tree_to_poly_uint64 (@rpos), elsz);
> + poly_uint64 ielt = exact_div (tree_to_poly_uint64 (@ipos), elsz);
> + unsigned nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> + vec_perm_builder builder;
> + builder.new_vector (nunits, nunits, 1);
> + for (unsigned i = 0; i < nunits; ++i)
> +   builder.quick_push (known_eq (ielt, i) ? nunits + relt : i);
> + vec_perm_indices sel (builder, 2, nunits);
> +   }
> +   (if (!VECTOR_MODE_P (TYPE_MODE (type))
> +   || can_vec_perm_const_p (TYPE_MODE (type), TYPE_MODE (type), sel, 
> false))
> +(vec_perm @0 @1 { vec_perm_indices_to_tree
> +(build_vector_type (ssizetype, nunits), sel); })
> +
>  (if (canonicalize_math_after_vectorization_p ())
>   (for fmas (FMA)
>(simplify
> diff --git a/gcc/testsuite/gcc.target/i386/pr94864.c 
> b/gcc/testsuite/gcc.target/i386/pr94864.c
> new file mode 100644
> index 000..69cb481fcfe
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr94864.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -msse2 -mno-avx" } */
> +
> +typedef double v2df __attribute__((vector_size(16)));
> +
> +v2df mo

Re: [PATCH] tree-optimization/94864 - vector insert of vector extract simplification

2023-07-12 Thread Hongtao Liu via Gcc-patches

On Thu, Jul 13, 2023 at 10:47 AM Hongtao Liu  wrote:
>
> On Wed, Jul 12, 2023 at 9:37 PM Richard Biener via Gcc-patches
>  wrote:
> >
> > The PRs ask for optimizing of
> >
> >   _1 = BIT_FIELD_REF ;
> >   result_4 = BIT_INSERT_EXPR ;
> >
> > to a vector permutation.  The following implements this as
> > match.pd pattern, improving code generation on x86_64.
> >
> > On the RTL level we face the issue that backend patterns inconsistently
> > use vec_merge and vec_select of vec_concat to represent permutes.
> >
> > I think using a (supported) permute is almost always better
> > than an extract plus insert, maybe excluding the case we extract
> > element zero and that's aliased to a register that can be used
> > directly for insertion (not sure how to query that).
> >
> > But this regresses for example gcc.target/i386/pr54855-8.c because PRE
> > now realizes that
> >
> >   _1 = BIT_FIELD_REF ;
> >   if (_1 > a_4(D))
> > goto ; [50.00%]
> >   else
> > goto ; [50.00%]
> >
> >[local count: 536870913]:
> >
> >[local count: 1073741824]:
> >   # iftmp.0_2 = PHI <_1(3), a_4(D)(2)>
> >   x_5 = BIT_INSERT_EXPR ;
> >
> > is equal to
> >
> >[local count: 1073741824]:
> >   _1 = BIT_FIELD_REF ;
> >   if (_1 > a_4(D))
> > goto ; [50.00%]
> >   else
> > goto ; [50.00%]
> >
> >[local count: 536870912]:
> >   _7 = BIT_INSERT_EXPR ;
> >
> >[local count: 1073741824]:
> >   # prephitmp_8 = PHI 
> >
> > and that no longer produces the desired maxsd operation at the RTL
> The comparison is scalar mode, but operations in then_bb is
> vector_mode, if_convert can't eliminate the condition any more(and
> won't go into backend ix86_expand_sse_fp_minmax).
> I think for ordered comparisons like _1 > a_4, it doesn't match
> fmin/fmax, but match SSE MINSS/MAXSS since it alway returns the second
> operand(not the other operand) when there's NONE.
I mean NANs.
> > level (we fail to match .FMAX at the GIMPLE level earlier).
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu with regressions:
> >
> > FAIL: gcc.target/i386/pr54855-13.c scan-assembler-times vmaxsh[ t] 1
> > FAIL: gcc.target/i386/pr54855-13.c scan-assembler-not vcomish[ t]
> > FAIL: gcc.target/i386/pr54855-8.c scan-assembler-times maxsd 1
> > FAIL: gcc.target/i386/pr54855-8.c scan-assembler-not movsd
> > FAIL: gcc.target/i386/pr54855-9.c scan-assembler-times minss 1
> > FAIL: gcc.target/i386/pr54855-9.c scan-assembler-not movss
> >
> > I think this is also PR88540 (the lack of min/max detection, not
> > sure if the SSE min/max are suitable here)
> >
> > PR tree-optimization/94864
> > PR tree-optimization/94865
> > * match.pd (bit_insert @0 (BIT_FIELD_REF @1 ..) ..): New pattern
> > for vector insertion from vector extraction.
> >
> > * gcc.target/i386/pr94864.c: New testcase.
> > * gcc.target/i386/pr94865.c: Likewise.
> > ---
> >  gcc/match.pd| 25 +
> >  gcc/testsuite/gcc.target/i386/pr94864.c | 13 +
> >  gcc/testsuite/gcc.target/i386/pr94865.c | 13 +
> >  3 files changed, 51 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr94864.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr94865.c
> >
> > diff --git a/gcc/match.pd b/gcc/match.pd
> > index 8543f777a28..8cc106049c4 100644
> > --- a/gcc/match.pd
> > +++ b/gcc/match.pd
> > @@ -7770,6 +7770,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> >   wi::to_wide (@ipos) + isize))
> >  (BIT_FIELD_REF @0 @rsize @rpos)
> >
> > +/* Simplify vector inserts of other vector extracts to a permute.  */
> > +(simplify
> > + (bit_insert @0 (BIT_FIELD_REF@2 @1 @rsize @rpos) @ipos)
> > + (if (VECTOR_TYPE_P (type)
> > +  && types_match (@0, @1)
> > +  && types_match (TREE_TYPE (TREE_TYPE (@0)), TREE_TYPE (@2))
> > +  && TYPE_VECTOR_SUBPARTS (type).is_constant ())
> > +  (with
> > +   {
> > + unsigned HOST_WIDE_INT elsz
> > +   = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (TREE_TYPE (@1;
> > + poly_uint64 relt = exact_div (tree_to_poly_uint64 (@rpos), elsz);
> > + poly_uint64 ielt = exact_div (tree_to_poly_uint64 (@ipos), elsz);
> > + unsigned nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> > + vec_perm_builder builder;
> > + builder.new_vector (nunits, nunits, 1);
> > + for (unsigned i = 0; i < nunits; ++i)
> > +   builder.quick_push (known_eq (ielt, i) ? nunits + relt : i);
> > + vec_perm_indices sel (builder, 2, nunits);
> > +   }
> > +   (if (!VECTOR_MODE_P (TYPE_MODE (type))
> > +   || can_vec_perm_const_p (TYPE_MODE (type), TYPE_MODE (type), sel, 
> > false))
> > +(vec_perm @0 @1 { vec_perm_indices_to_tree
> > +(build_vector_type (ssizetype, nunits), sel); 
> > })
> > +
> >  (if (canonicalize_math_after_vectorization_p ())
> >   (for fmas (FMA)
> >(simplify
> > diff --git a/gcc/testsuite/gcc.target/i386/pr94864.c 
> > b/gcc/te

Re: [PATCH] tree-optimization/94864 - vector insert of vector extract simplification

2023-07-13 Thread Hongtao Liu via Gcc-patches

On Thu, Jul 13, 2023 at 2:32 PM Richard Biener  wrote:
>
> On Thu, 13 Jul 2023, Hongtao Liu wrote:
>
> > On Thu, Jul 13, 2023 at 10:47?AM Hongtao Liu  wrote:
> > >
> > > On Wed, Jul 12, 2023 at 9:37?PM Richard Biener via Gcc-patches
> > >  wrote:
> > > >
> > > > The PRs ask for optimizing of
> > > >
> > > >   _1 = BIT_FIELD_REF ;
> > > >   result_4 = BIT_INSERT_EXPR ;
> > > >
> > > > to a vector permutation.  The following implements this as
> > > > match.pd pattern, improving code generation on x86_64.
> > > >
> > > > On the RTL level we face the issue that backend patterns inconsistently
> > > > use vec_merge and vec_select of vec_concat to represent permutes.
> > > >
> > > > I think using a (supported) permute is almost always better
> > > > than an extract plus insert, maybe excluding the case we extract
> > > > element zero and that's aliased to a register that can be used
> > > > directly for insertion (not sure how to query that).
> > > >
> > > > But this regresses for example gcc.target/i386/pr54855-8.c because PRE
> > > > now realizes that
> > > >
> > > >   _1 = BIT_FIELD_REF ;
> > > >   if (_1 > a_4(D))
> > > > goto ; [50.00%]
> > > >   else
> > > > goto ; [50.00%]
> > > >
> > > >[local count: 536870913]:
> > > >
> > > >[local count: 1073741824]:
> > > >   # iftmp.0_2 = PHI <_1(3), a_4(D)(2)>
> > > >   x_5 = BIT_INSERT_EXPR ;
> > > >
> > > > is equal to
> > > >
> > > >[local count: 1073741824]:
> > > >   _1 = BIT_FIELD_REF ;
> > > >   if (_1 > a_4(D))
> > > > goto ; [50.00%]
> > > >   else
> > > > goto ; [50.00%]
> > > >
> > > >[local count: 536870912]:
> > > >   _7 = BIT_INSERT_EXPR ;
> > > >
> > > >[local count: 1073741824]:
> > > >   # prephitmp_8 = PHI 
> > > >
> > > > and that no longer produces the desired maxsd operation at the RTL
> > > The comparison is scalar mode, but operations in then_bb is
> > > vector_mode, if_convert can't eliminate the condition any more(and
> > > won't go into backend ix86_expand_sse_fp_minmax).
> > > I think for ordered comparisons like _1 > a_4, it doesn't match
> > > fmin/fmax, but match SSE MINSS/MAXSS since it alway returns the second
> > > operand(not the other operand) when there's NONE.
> > I mean NANs.
>
> Btw, I once tried to recognize MAX here at the GIMPLE level but
> while the x86 (vector) max insns are fine for x > y ? x : y we
> have no tree code or optab for exactly that, we have MAX_EXPR
> which behaves differently for NaN and .FMAX which is exactly IEEE
> which the x86 ISA isn't.
>
> I wonder if we thus should if-convert this on the GIMPLE level
> but to x > y ? x : y, thus a COND_EXPR?
COND_EXPR maps to movcc, for x86 it's expanded by
ix86_expand_fp_movcc which will try fp minmax detect.
It's probably ok.
>
> Richard.
>
> > > > level (we fail to match .FMAX at the GIMPLE level earlier).
> > > >
> > > > Bootstrapped and tested on x86_64-unknown-linux-gnu with regressions:
> > > >
> > > > FAIL: gcc.target/i386/pr54855-13.c scan-assembler-times vmaxsh[ t] 1
> > > > FAIL: gcc.target/i386/pr54855-13.c scan-assembler-not vcomish[ t]
> > > > FAIL: gcc.target/i386/pr54855-8.c scan-assembler-times maxsd 1
> > > > FAIL: gcc.target/i386/pr54855-8.c scan-assembler-not movsd
> > > > FAIL: gcc.target/i386/pr54855-9.c scan-assembler-times minss 1
> > > > FAIL: gcc.target/i386/pr54855-9.c scan-assembler-not movss
> > > >
> > > > I think this is also PR88540 (the lack of min/max detection, not
> > > > sure if the SSE min/max are suitable here)
> > > >
> > > > PR tree-optimization/94864
> > > > PR tree-optimization/94865
> > > > * match.pd (bit_insert @0 (BIT_FIELD_REF @1 ..) ..): New pattern
> > > > for vector insertion from vector extraction.
> > > >
> > > > * gcc.target/i386/pr94864.c: New testcase.
> > > > * gcc.target/i386/pr94865.c: Likewise.
> > > > ---
> > > >  gcc/match.pd| 25 +
> > > >  gcc/testsuite/gcc.target/i386/pr94864.c | 13 +
> > > >  gcc/testsuite/gcc.target/i386/pr94865.c | 13 +
> > > >  3 files changed, 51 insertions(+)
> > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr94864.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr94865.c
> > > >
> > > > diff --git a/gcc/match.pd b/gcc/match.pd
> > > > index 8543f777a28..8cc106049c4 100644
> > > > --- a/gcc/match.pd
> > > > +++ b/gcc/match.pd
> > > > @@ -7770,6 +7770,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> > > >   wi::to_wide (@ipos) + isize))
> > > >  (BIT_FIELD_REF @0 @rsize @rpos)
> > > >
> > > > +/* Simplify vector inserts of other vector extracts to a permute.  */
> > > > +(simplify
> > > > + (bit_insert @0 (BIT_FIELD_REF@2 @1 @rsize @rpos) @ipos)
> > > > + (if (VECTOR_TYPE_P (type)
> > > > +  && types_match (@0, @1)
> > > > +  && types_match (TREE_TYPE (TREE_TYPE (@0)), TREE_TYPE (@2))
> > > > +  && TYPE_VECTOR_SUBPARTS (type).is_constant ())
> > > > +  (with
>

Re: [PATCH 1/4] Support Intel AVX-VNNI-INT16

2023-07-16 Thread Hongtao Liu via Gcc-patches

On Thu, Jul 13, 2023 at 2:06 PM Haochen Jiang via Gcc-patches
 wrote:
>
> From: Kong Lingling 
>
> gcc/ChangeLog
>
> * common/config/i386/cpuinfo.h (get_available_features): Detect
> avxvnniint16.
> * common/config/i386/i386-common.cc
> (OPTION_MASK_ISA2_AVXVNNIINT16_SET): New.
> (OPTION_MASK_ISA2_AVXVNNIINT16_UNSET): Ditto.
> (ix86_handle_option): Handle -mavxvnniint16.
> * common/config/i386/i386-cpuinfo.h (enum processor_features):
> Add FEATURE_AVXVNNIINT16.
> * common/config/i386/i386-isas.h: Add ISA_NAME_TABLE_ENTRY for
> avxvnniint16.
> * config.gcc: Add avxvnniint16.h.
> * config/i386/avxvnniint16intrin.h: New file.
> * config/i386/cpuid.h (bit_AVXVNNIINT16): New.
> * config/i386/i386-builtin.def: Add new builtins.
> * config/i386/i386-c.cc (ix86_target_macros_internal): Define
> __AVXVNNIINT16__.
> * config/i386/i386-options.cc (isa2_opts): Add -mavxvnniint16.
> (ix86_valid_target_attribute_inner_p): Handle avxvnniint16intrin.h.
> * config/i386/i386-isa.def: Add DEF_PTA(AVXVNNIINT16).
> * config/i386/i386.opt: Add option -mavxvnniint16.
> * config/i386/immintrin.h: Include avxvnniint16.h.
> * config/i386/sse.md
> (vpdp_): New define_insn.
> * doc/extend.texi: Document avxvnniint16.
> * doc/invoke.texi: Document -mavxvnniint16.
> * doc/sourcebuild.texi: Document target avxvnniint16.
Ok.
>
> gcc/testsuite/ChangeLog
>
> * g++.dg/other/i386-2.C: Add -mavxvnniint16.
> * g++.dg/other/i386-3.C: Ditto.
> * gcc.target/i386/avx-check.h: Add avxvnniint16 check.
> * gcc.target/i386/sse-12.c: Add -mavxvnniint16.
> * gcc.target/i386/sse-13.c: Ditto.
> * gcc.target/i386/sse-14.c: Ditto.
> * gcc.target/i386/sse-22.c: Ditto.
> * gcc.target/i386/sse-23.c: Ditto.
> * gcc.target/i386/funcspec-56.inc: Add new target attribute.
> * lib/target-supports.exp
> (check_effective_target_avxvnniint16): New.
> * gcc.target/i386/avxvnniint16-1.c: Ditto.
> * gcc.target/i386/avxvnniint16-vpdpwusd-2.c: Ditto.
> * gcc.target/i386/avxvnniint16-vpdpwusds-2.c: Ditto.
> * gcc.target/i386/avxvnniint16-vpdpwsud-2.c: Ditto.
> * gcc.target/i386/avxvnniint16-vpdpwsuds-2.c: Ditto.
> * gcc.target/i386/avxvnniint16-vpdpwuud-2.c: Ditto.
> * gcc.target/i386/avxvnniint16-vpdpwuuds-2.c: Ditto.
>
> Co-authored-by: Haochen Jiang 
> ---
>  gcc/common/config/i386/cpuinfo.h  |   2 +
>  gcc/common/config/i386/i386-common.cc |  22 ++-
>  gcc/common/config/i386/i386-cpuinfo.h |   1 +
>  gcc/common/config/i386/i386-isas.h|   2 +
>  gcc/config.gcc|   2 +-
>  gcc/config/i386/avxvnniint16intrin.h  | 138 ++
>  gcc/config/i386/cpuid.h   |   1 +
>  gcc/config/i386/i386-builtin.def  |  14 ++
>  gcc/config/i386/i386-c.cc |   2 +
>  gcc/config/i386/i386-isa.def  |   1 +
>  gcc/config/i386/i386-options.cc   |   4 +-
>  gcc/config/i386/i386.opt  |   5 +
>  gcc/config/i386/immintrin.h   |   2 +
>  gcc/config/i386/sse.md|  32 
>  gcc/doc/extend.texi   |   5 +
>  gcc/doc/invoke.texi   |  10 +-
>  gcc/doc/sourcebuild.texi  |   3 +
>  gcc/testsuite/g++.dg/other/i386-2.C   |   2 +-
>  gcc/testsuite/g++.dg/other/i386-3.C   |   2 +-
>  gcc/testsuite/gcc.target/i386/avx-check.h |   3 +
>  .../gcc.target/i386/avxvnniint16-1.c  |  43 ++
>  .../gcc.target/i386/avxvnniint16-vpdpwsud-2.c |  71 +
>  .../i386/avxvnniint16-vpdpwsuds-2.c   |  72 +
>  .../gcc.target/i386/avxvnniint16-vpdpwusd-2.c |  71 +
>  .../i386/avxvnniint16-vpdpwusds-2.c   |  72 +
>  .../gcc.target/i386/avxvnniint16-vpdpwuud-2.c |  71 +
>  .../i386/avxvnniint16-vpdpwuuds-2.c   |  71 +
>  gcc/testsuite/gcc.target/i386/funcspec-56.inc |   2 +
>  gcc/testsuite/gcc.target/i386/sse-12.c|   2 +-
>  gcc/testsuite/gcc.target/i386/sse-13.c|   2 +-
>  gcc/testsuite/gcc.target/i386/sse-14.c|   2 +-
>  gcc/testsuite/gcc.target/i386/sse-22.c|   4 +-
>  gcc/testsuite/gcc.target/i386/sse-23.c|   2 +-
>  gcc/testsuite/lib/target-supports.exp |  12 ++
>  34 files changed, 735 insertions(+), 15 deletions(-)
>  create mode 100644 gcc/config/i386/avxvnniint16intrin.h
>  create mode 100644 gcc/testsuite/gcc.target/i386/avxvnniint16-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avxvnniint16-vpdpwsud-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avxvnniint16-vpdpwsuds-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i3

Re: [PATCH 3/4] Support Intel SHA512

2023-07-16 Thread Hongtao Liu via Gcc-patches

On Thu, Jul 13, 2023 at 2:06 PM Haochen Jiang via Gcc-patches
 wrote:
>
> gcc/ChangeLog:
>
> * common/config/i386/cpuinfo.h (get_available_features):
> Detect SHA512.
> * common/config/i386/i386-common.cc (OPTION_MASK_ISA2_SHA512_SET,
> OPTION_MASK_ISA2_SHA512_UNSET): New.
> (OPTION_MASK_ISA2_AVX_UNSET): Add SHA512.
> (ix86_handle_option): Handle -msha512.
> * common/config/i386/i386-cpuinfo.h (enum processor_features):
> Add FEATURE_SHA512.
> * common/config/i386/i386-isas.h: Add ISA_NAME_TABLE_ENTRY for
> sha512.
> * config.gcc: Add sha512intrin.h.
> * config/i386/cpuid.h (bit_SHA512): New.
> * config/i386/i386-builtin-types.def:
> Add DEF_FUNCTION_TYPE (V4DI, V4DI, V4DI, V2DI).
> * config/i386/i386-builtin.def (BDESC): Add new builtins.
> * config/i386/i386-c.cc (ix86_target_macros_internal): Define
> __SHA512__.
> * config/i386/i386-expand.cc (ix86_expand_args_builtin): Handle
> V4DI_FTYPE_V4DI_V4DI_V2DI and V4DI_FTYPE_V4DI_V2DI.
> * config/i386/i386-isa.def (SHA512): Add DEF_PTA(SHA512).
> * config/i386/i386-options.cc (isa2_opts): Add -msha512.
> (ix86_valid_target_attribute_inner_p): Handle sha512.
> * config/i386/i386.opt: Add option -msha512.
> * config/i386/immintrin.h: Include sha512intrin.h.
> * config/i386/sse.md (vsha512msg1): New define insn.
> (vsha512msg2): Ditto.
> (vsha512rnds2): Ditto.
> * doc/extend.texi: Document sha512.
> * doc/invoke.texi: Document -msha512.
> * doc/sourcebuild.texi: Document target sha512.
> * config/i386/sha512intrin.h: New file.
>
> gcc/testsuite/ChangeLog:
>
> * g++.dg/others/i386-2.C: Add -msha512.
> * g++.dg/others/i386-3.C: Ditto.
> * gcc.target/i386/funcspec-56.inc: Add new target attribute.
> * gcc.target/i386/sse-12.c: Add -msha512.
> * gcc.target/i386/sse-13.c: Ditto.
> * gcc.target/i386/sse-14.c: Ditto.
> * gcc.target/i386/sse-22.c: Add sha512.
> * gcc.target/i386/sse-23.c: Ditto.
> * lib/target-supports.exp (check_effective_target_sha512): New.
> * gcc.target/i386/sha512-1.c: New test.
> * gcc.target/i386/sha512-check.h: Ditto.
> * gcc.target/i386/sha512msg1-2.c: Ditto.
> * gcc.target/i386/sha512msg2-2.c: Ditto.
> * gcc.target/i386/sha512rnds2-2.c: Ditto.
Ok.
> ---
>  gcc/common/config/i386/cpuinfo.h  |  2 +
>  gcc/common/config/i386/i386-common.cc | 19 -
>  gcc/common/config/i386/i386-cpuinfo.h |  1 +
>  gcc/common/config/i386/i386-isas.h|  1 +
>  gcc/config.gcc|  2 +-
>  gcc/config/i386/cpuid.h   |  1 +
>  gcc/config/i386/i386-builtin-types.def|  3 +
>  gcc/config/i386/i386-builtin.def  |  5 ++
>  gcc/config/i386/i386-c.cc |  2 +
>  gcc/config/i386/i386-expand.cc|  2 +
>  gcc/config/i386/i386-isa.def  |  1 +
>  gcc/config/i386/i386-options.cc   |  4 +-
>  gcc/config/i386/i386.opt  | 10 +++
>  gcc/config/i386/immintrin.h   |  2 +
>  gcc/config/i386/sha512intrin.h| 64 ++
>  gcc/config/i386/sse.md| 40 +
>  gcc/doc/extend.texi   |  5 ++
>  gcc/doc/invoke.texi   | 10 ++-
>  gcc/doc/sourcebuild.texi  |  3 +
>  gcc/testsuite/g++.dg/other/i386-2.C   |  2 +-
>  gcc/testsuite/g++.dg/other/i386-3.C   |  2 +-
>  gcc/testsuite/gcc.target/i386/funcspec-56.inc |  2 +
>  gcc/testsuite/gcc.target/i386/sha512-1.c  | 18 
>  gcc/testsuite/gcc.target/i386/sha512-check.h  | 43 ++
>  gcc/testsuite/gcc.target/i386/sha512msg1-2.c  | 48 +++
>  gcc/testsuite/gcc.target/i386/sha512msg2-2.c  | 47 ++
>  gcc/testsuite/gcc.target/i386/sha512rnds2-2.c | 85 +++
>  gcc/testsuite/gcc.target/i386/sse-12.c|  2 +-
>  gcc/testsuite/gcc.target/i386/sse-13.c|  2 +-
>  gcc/testsuite/gcc.target/i386/sse-14.c|  2 +-
>  gcc/testsuite/gcc.target/i386/sse-22.c|  4 +-
>  gcc/testsuite/gcc.target/i386/sse-23.c|  2 +-
>  gcc/testsuite/lib/target-supports.exp | 14 +++
>  33 files changed, 436 insertions(+), 14 deletions(-)
>  create mode 100644 gcc/config/i386/sha512intrin.h
>  create mode 100644 gcc/testsuite/gcc.target/i386/sha512-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sha512-check.h
>  create mode 100644 gcc/testsuite/gcc.target/i386/sha512msg1-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sha512msg2-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sha512rnds2-2.c
>
> diff --git a/gcc/common/config/i386/cpuinfo.h 
> b/gcc/common/config/i386/cpuinfo.h
> index e5cdffe

Re: [PATCH 2/4] Support Intel SM3

2023-07-16 Thread Hongtao Liu via Gcc-patches

On Thu, Jul 13, 2023 at 2:04 PM Haochen Jiang via Gcc-patches
 wrote:
>
> gcc/ChangeLog:
>
> * common/config/i386/cpuinfo.h (get_available_features):
> Detect SM3.
> * common/config/i386/i386-common.cc (OPTION_MASK_ISA2_SM3_SET,
> OPTION_MASK_ISA2_SM3_UNSET): New.
> (OPTION_MASK_ISA2_AVX_UNSET): Add SM3.
> (ix86_handle_option): Handle -msm3.
> * common/config/i386/i386-cpuinfo.h (enum processor_features):
> Add FEATURE_SM3.
> * common/config/i386/i386-isas.h: Add ISA_NAME_TABLE_ENTRY for
> SM3.
> * config.gcc: Add sm3intrin.h
> * config/i386/cpuid.h (bit_SM3): New.
> * config/i386/i386-builtin-types.def:
> Add DEF_FUNCTION_TYPE (V4SI, V4SI, V4SI, V4SI, INT).
> * config/i386/i386-builtin.def (BDESC): Add new builtins.
> * config/i386/i386-c.cc (ix86_target_macros_internal): Define
> __SM3__.
> * config/i386/i386-expand.cc (ix86_expand_args_builtin): Handle
> V4SI_FTYPE_V4SI_V4SI_V4SI_INT.
> * config/i386/i386-isa.def (SM3): Add DEF_PTA(SM3).
> * config/i386/i386-options.cc (isa2_opts): Add -msm3.
> (ix86_valid_target_attribute_inner_p): Handle sm3.
> * config/i386/i386.opt: Add option -msm3.
> * config/i386/immintrin.h: Include sm3intrin.h.
> * config/i386/sse.md (vsm3msg1): New define insn.
> (vsm3msg2): Ditto.
> (vsm3rnds2): Ditto.
> * doc/extend.texi: Document sm3.
> * doc/invoke.texi: Document -msm3.
> * doc/sourcebuild.texi: Document target sm3.
> * config/i386/sm3intrin.h: New file.
>
> gcc/testsuite/ChangeLog:
>
> * g++.dg/other/i386-2.C: Add -msm3.
> * g++.dg/other/i386-3.C: Ditto.
> * gcc.target/i386/avx-1.c: Add new define for immediate.
> * gcc.target/i386/funcspec-56.inc: Add new target attribute.
> * gcc.target/i386/sse-12.c: Add -msm3.
> * gcc.target/i386/sse-13.c: Ditto.
> * gcc.target/i386/sse-14.c: Ditto.
> * gcc.target/i386/sse-22.c: Add sm3.
> * gcc.target/i386/sse-23.c: Ditto.
> * lib/target-supports.exp (check_effective_target_sm3): New.
> * gcc.target/i386/sm3-1.c: New test.
> * gcc.target/i386/sm3-check.h: Ditto.
> * gcc.target/i386/sm3msg1-2.c: Ditto.
> * gcc.target/i386/sm3msg2-2.c: Ditto.
> * gcc.target/i386/sm3rnds2-2.c: Ditto.
Ok.
> ---
>  gcc/common/config/i386/cpuinfo.h  |   2 +
>  gcc/common/config/i386/i386-common.cc |  20 +++-
>  gcc/common/config/i386/i386-cpuinfo.h |   1 +
>  gcc/common/config/i386/i386-isas.h|   1 +
>  gcc/config.gcc|   3 +-
>  gcc/config/i386/cpuid.h   |   1 +
>  gcc/config/i386/i386-builtin-types.def|   3 +
>  gcc/config/i386/i386-builtin.def  |   5 +
>  gcc/config/i386/i386-c.cc |   2 +
>  gcc/config/i386/i386-expand.cc|   1 +
>  gcc/config/i386/i386-isa.def  |   1 +
>  gcc/config/i386/i386-options.cc   |   2 +
>  gcc/config/i386/i386.opt  |   5 +
>  gcc/config/i386/immintrin.h   |   2 +
>  gcc/config/i386/sm3intrin.h   |  72 
>  gcc/config/i386/sse.md|  43 
>  gcc/doc/extend.texi   |   5 +
>  gcc/doc/invoke.texi   |   7 +-
>  gcc/doc/sourcebuild.texi  |   3 +
>  gcc/testsuite/g++.dg/other/i386-2.C   |   2 +-
>  gcc/testsuite/g++.dg/other/i386-3.C   |   2 +-
>  gcc/testsuite/gcc.target/i386/avx-1.c |   3 +
>  gcc/testsuite/gcc.target/i386/funcspec-56.inc |   2 +
>  gcc/testsuite/gcc.target/i386/sm3-1.c |  17 +++
>  gcc/testsuite/gcc.target/i386/sm3-check.h |  37 +++
>  gcc/testsuite/gcc.target/i386/sm3msg1-2.c |  54 +
>  gcc/testsuite/gcc.target/i386/sm3msg2-2.c |  57 ++
>  gcc/testsuite/gcc.target/i386/sm3rnds2-2.c| 104 ++
>  gcc/testsuite/gcc.target/i386/sse-12.c|   2 +-
>  gcc/testsuite/gcc.target/i386/sse-13.c|   5 +-
>  gcc/testsuite/gcc.target/i386/sse-14.c|   5 +-
>  gcc/testsuite/gcc.target/i386/sse-22.c|   7 +-
>  gcc/testsuite/gcc.target/i386/sse-23.c|   5 +-
>  gcc/testsuite/lib/target-supports.exp |  15 +++
>  34 files changed, 484 insertions(+), 12 deletions(-)
>  create mode 100644 gcc/config/i386/sm3intrin.h
>  create mode 100644 gcc/testsuite/gcc.target/i386/sm3-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sm3-check.h
>  create mode 100644 gcc/testsuite/gcc.target/i386/sm3msg1-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sm3msg2-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sm3rnds2-2.c
>
> diff --git a/gcc/common/config/i386/cpuinfo.h 
> b/gcc/common/config/i386/cpuinfo.h
> index

Re: [PATCH 4/4] Support Intel SM4

2023-07-16 Thread Hongtao Liu via Gcc-patches

On Thu, Jul 13, 2023 at 2:04 PM Haochen Jiang via Gcc-patches
 wrote:
>
> gcc/ChangeLog:
>
> * common/config/i386/cpuinfo.h (get_available_features):
> Detech SM4.
> * common/config/i386/i386-common.cc (OPTION_MASK_ISA2_SM4_SET,
> OPTION_MASK_ISA2_SM4_UNSET): New.
> (OPTION_MASK_ISA2_AVX_UNSET): Add SM4.
> (ix86_handle_option): Handle -msm4.
> * common/config/i386/i386-cpuinfo.h (enum processor_features):
> Add FEATURE_SM4.
> * common/config/i386/i386-isas.h: Add ISA_NAME_TABLE_ENTRY for
> sm4.
> * config.gcc: Add sm4intrin.h.
> * config/i386/cpuid.h (bit_SM4): New.
> * config/i386/i386-builtin.def (BDESC): Add new builtins.
> * config/i386/i386-c.cc (ix86_target_macros_internal): Define
> __SM4__.
> * config/i386/i386-isa.def (SM4): Add DEF_PTA(SM4).
> * config/i386/i386-options.cc (isa2_opts): Add -msm4.
> (ix86_valid_target_attribute_inner_p): Handle sm4.
> * config/i386/i386.opt: Add option -msm4.
> * config/i386/immintrin.h: Include sm4intrin.h
> * config/i386/sse.md (vsm4key4_): New define insn.
> (vsm4rnds4_): Ditto.
> * doc/extend.texi: Document sm4.
> * doc/invoke.texi: Document -msm4.
> * doc/sourcebuild.texi: Document target sm4.
> * config/i386/sm4intrin.h: New file.
>
> gcc/testsuite/ChangeLog:
>
> * g++.dg/other/i386-2.C: Add -msm4.
> * g++.dg/other/i386-3.C: Ditto.
> * gcc.target/i386/funcspec-56.inc: Add new target attribute.
> * gcc.target/i386/sse-12.c: Add -msm4.
> * gcc.target/i386/sse-13.c: Ditto.
> * gcc.target/i386/sse-14.c: Ditto.
> * gcc.target/i386/sse-22.c: Add sm4.
> * gcc.target/i386/sse-23.c: Ditto.
> * lib/target-supports.exp (check_effective_target_sm4): New.
> * gcc.target/i386/sm4-1.c: New test.
> * gcc.target/i386/sm4-check.h: Ditto.
> * gcc.target/i386/sm4key4-2.c: Ditto.
> * gcc.target/i386/sm4rnds4-2.c: Ditto.
Ok.
> ---
>  gcc/common/config/i386/cpuinfo.h  |   2 +
>  gcc/common/config/i386/i386-common.cc |  20 +-
>  gcc/common/config/i386/i386-cpuinfo.h |   1 +
>  gcc/common/config/i386/i386-isas.h|   1 +
>  gcc/config.gcc|   2 +-
>  gcc/config/i386/cpuid.h   |   1 +
>  gcc/config/i386/i386-builtin.def  |   6 +
>  gcc/config/i386/i386-c.cc |   2 +
>  gcc/config/i386/i386-isa.def  |   1 +
>  gcc/config/i386/i386-options.cc   |   4 +-
>  gcc/config/i386/i386.opt  |   5 +
>  gcc/config/i386/immintrin.h   |   2 +
>  gcc/config/i386/sm4intrin.h   |  70 +++
>  gcc/config/i386/sse.md|  26 +++
>  gcc/doc/extend.texi   |   5 +
>  gcc/doc/invoke.texi   |   9 +-
>  gcc/doc/sourcebuild.texi  |   3 +
>  gcc/testsuite/g++.dg/other/i386-2.C   |   2 +-
>  gcc/testsuite/g++.dg/other/i386-3.C   |   2 +-
>  gcc/testsuite/gcc.target/i386/funcspec-56.inc |   2 +
>  gcc/testsuite/gcc.target/i386/sm4-1.c |  20 ++
>  gcc/testsuite/gcc.target/i386/sm4-check.h | 183 ++
>  gcc/testsuite/gcc.target/i386/sm4key4-2.c |  14 ++
>  gcc/testsuite/gcc.target/i386/sm4rnds4-2.c|  14 ++
>  gcc/testsuite/gcc.target/i386/sse-12.c|   2 +-
>  gcc/testsuite/gcc.target/i386/sse-13.c|   2 +-
>  gcc/testsuite/gcc.target/i386/sse-14.c|   2 +-
>  gcc/testsuite/gcc.target/i386/sse-22.c|   4 +-
>  gcc/testsuite/gcc.target/i386/sse-23.c|   2 +-
>  gcc/testsuite/lib/target-supports.exp |  14 ++
>  30 files changed, 409 insertions(+), 14 deletions(-)
>  create mode 100644 gcc/config/i386/sm4intrin.h
>  create mode 100644 gcc/testsuite/gcc.target/i386/sm4-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sm4-check.h
>  create mode 100644 gcc/testsuite/gcc.target/i386/sm4key4-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sm4rnds4-2.c
>
> diff --git a/gcc/common/config/i386/cpuinfo.h 
> b/gcc/common/config/i386/cpuinfo.h
> index 0cfde3ebccd..f9434f038ea 100644
> --- a/gcc/common/config/i386/cpuinfo.h
> +++ b/gcc/common/config/i386/cpuinfo.h
> @@ -881,6 +881,8 @@ get_available_features (struct __processor_model 
> *cpu_model,
> set_feature (FEATURE_SM3);
>   if (eax & bit_SHA512)
> set_feature (FEATURE_SHA512);
> + if (eax & bit_SM4)
> +   set_feature (FEATURE_SM4);
> }
>if (avx512_usable)
> {
> diff --git a/gcc/common/config/i386/i386-common.cc 
> b/gcc/common/config/i386/i386-common.cc
> index 97c3cdfe5e1..610cabe52c1 100644
> --- a/gcc/common/config/i386/i386-common.cc
> +++ b/gcc/common/config/i386/i386-common.cc
> @@ -122,6 +122,7 @@ a

Re: [PATCH] Initial Lunar Lake, Arrow Lake and Arrow Lake S Support

2023-07-16 Thread Hongtao Liu via Gcc-patches

On Fri, Jul 14, 2023 at 10:55 AM Mo, Zewei via Gcc-patches
 wrote:
>
> Hi all,
>
> This patch is to add initial support for Lunar Lake, Arrow Lake and Arrow Lake
> S for GCC.
>
> This link of related information is listed below:
> https://www.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
>
> This has been tested on x86_64-pc-linux-gnu. Is this ok for trunk? Thank you.
Ok.
>
> gcc/ChangeLog:
>
> * common/config/i386/cpuinfo.h (get_intel_cpu): Handle Lunar Lake,
> Arrow Lake and Arrow Lake S.
> * common/config/i386/i386-common.cc:
> (processor_name): Add arrowlake.
> (processor_alias_table): Add arrow lake, arrow lake s and lunar
> lake.
> * common/config/i386/i386-cpuinfo.h (enum processor_subtypes):
> Add INTEL_COREI7_ARROWLAKE and INTEL_COREI7_ARROWLAKE_S.
> * config.gcc: Add -march=arrowlake and -march=arrowlake-s.
> * config/i386/driver-i386.cc (host_detect_local_cpu): Handle
> arrowlake-s.
> * config/i386/i386-options.cc (m_ARROWLAKE): New.
> (processor_cost_table): Add arrowlake.
> * config/i386/i386.h (enum processor_type):
> Add PROCESSOR_ARROWLAKE.
> * doc/extend.texi: Add arrowlake and arrowlake-s.
> * doc/invoke.texi: Ditto.
>
> gcc/testsuite/ChangeLog:
>
> * g++.target/i386/mv16.C: Add arrowlake and arrowlake-s.
> * gcc.target/i386/funcspec-56.inc: Handle new march.
> ---
>  gcc/common/config/i386/cpuinfo.h  | 18 
>  gcc/common/config/i386/i386-common.cc |  7 ++
>  gcc/common/config/i386/i386-cpuinfo.h |  2 +
>  gcc/config.gcc|  2 +-
>  gcc/config/i386/driver-i386.cc|  5 +-
>  gcc/config/i386/i386-c.cc |  7 ++
>  gcc/config/i386/i386-options.cc   |  2 +
>  gcc/config/i386/i386.h|  4 +
>  gcc/config/i386/x86-tune.def  | 92 +++
>  gcc/doc/extend.texi   |  6 ++
>  gcc/doc/invoke.texi   | 17 
>  gcc/testsuite/g++.target/i386/mv16.C  | 12 +++
>  gcc/testsuite/gcc.target/i386/funcspec-56.inc |  2 +
>  13 files changed, 135 insertions(+), 41 deletions(-)
>
> diff --git a/gcc/common/config/i386/cpuinfo.h 
> b/gcc/common/config/i386/cpuinfo.h
> index 159e5f03f0b..e6f1a0ac0a1 100644
> --- a/gcc/common/config/i386/cpuinfo.h
> +++ b/gcc/common/config/i386/cpuinfo.h
> @@ -579,6 +579,24 @@ get_intel_cpu (struct __processor_model *cpu_model,
>CHECK___builtin_cpu_is ("grandridge");
>cpu_model->__cpu_type = INTEL_GRANDRIDGE;
>break;
> +case 0xc5:
> +  /* Arrow Lake.  */
> +  cpu = "arrowlake";
> +  CHECK___builtin_cpu_is ("corei7");
> +  CHECK___builtin_cpu_is ("arrowlake");
> +  cpu_model->__cpu_type = INTEL_COREI7;
> +  cpu_model->__cpu_subtype = INTEL_COREI7_ARROWLAKE;
> +  break;
> +case 0xc6:
> +  /* Arrow Lake S.  */
> +case 0xbd:
> +  /* Lunar Lake.  */
> +  cpu = "arrowlake-s";
> +  CHECK___builtin_cpu_is ("corei7");
> +  CHECK___builtin_cpu_is ("arrowlake-s");
> +  cpu_model->__cpu_type = INTEL_COREI7;
> +  cpu_model->__cpu_subtype = INTEL_COREI7_ARROWLAKE_S;
> +  break;
>  case 0x17:
>  case 0x1d:
>/* Penryn.  */
> diff --git a/gcc/common/config/i386/i386-common.cc 
> b/gcc/common/config/i386/i386-common.cc
> index 9b45ad61239..541f1441db8 100644
> --- a/gcc/common/config/i386/i386-common.cc
> +++ b/gcc/common/config/i386/i386-common.cc
> @@ -2044,6 +2044,7 @@ const char *const processor_names[] =
>"alderlake",
>"rocketlake",
>"graniterapids",
> +  "arrowlake",
>"intel",
>"lujiazui",
>"geode",
> @@ -2167,6 +2168,12 @@ const pta processor_alias_table[] =
>  M_CPU_SUBTYPE (INTEL_COREI7_ALDERLAKE), P_PROC_AVX2},
>{"graniterapids", PROCESSOR_GRANITERAPIDS, CPU_HASWELL, PTA_GRANITERAPIDS,
>  M_CPU_SUBTYPE (INTEL_COREI7_GRANITERAPIDS), P_PROC_AVX512F},
> +  {"arrowlake", PROCESSOR_ARROWLAKE, CPU_HASWELL, PTA_ARROWLAKE,
> +M_CPU_SUBTYPE (INTEL_COREI7_ARROWLAKE), P_PROC_AVX2},
> +  {"arrowlake-s", PROCESSOR_ARROWLAKE, CPU_HASWELL, PTA_ARROWLAKE_S,
> +M_CPU_SUBTYPE (INTEL_COREI7_ARROWLAKE_S), P_PROC_AVX2},
> +  {"lunarlake", PROCESSOR_ARROWLAKE, CPU_HASWELL, PTA_ARROWLAKE_S,
> +M_CPU_SUBTYPE (INTEL_COREI7_ARROWLAKE_S), P_PROC_AVX2},
>{"bonnell", PROCESSOR_BONNELL, CPU_ATOM, PTA_BONNELL,
>  M_CPU_TYPE (INTEL_BONNELL), P_PROC_SSSE3},
>{"atom", PROCESSOR_BONNELL, CPU_ATOM, PTA_BONNELL,
> diff --git a/gcc/common/config/i386/i386-cpuinfo.h 
> b/gcc/common/config/i386/i386-cpuinfo.h
> index e6385dd56a3..b371fb792ec 100644
> --- a/gcc/common/config/i386/i386-cpuinfo.h
> +++ b/gcc/common/config/i386/i386-cpuinfo.h
> @@ -98,6 +98,8 @@ enum processor_subtypes
>ZHAOXIN_FAM7H_LUJIAZUI,
>AMDFAM19H_ZNVER4

Re: [PATCH] x86: slightly enhance "vec_dupv2df"

2023-07-16 Thread Hongtao Liu via Gcc-patches

On Fri, Jul 14, 2023 at 5:40 PM Jan Beulich via Gcc-patches
 wrote:
>
> Introduce a new alternative permitting all 32 registers to be used as
> source without AVX512VL, by broadcasting to the full 512 bits in that
> case. (The insn would also permit all registers to be used as
> destination, but V2DFmode doesn't.)
The patch looks technically ok, but considering we don't have a real
CPU with only AVX512F but no AVX512VL, these optimisations for AVX512F
only don't make much sense, but rather increase the burden for
maintenance.
(For now, AVX512VL+AVX512CD+AVX512BW+AVX512DQ is a base set after
skylake-avx512, users are more likely to use -march=$PROCESSOR for
avx512.)
For this(and those previous AVX512F only) optimised patch, I think
it's helpful to help understand the pattern, so I'll approve of this
patch. But I hope we don't spend too much time on such optimisations
(unless there is an AVX512F only processor).
>
> gcc/
>
> * config/i386/sse.md (vec_dupv2df): Add new AVX512F
> alternative. Move AVX512VL part of condition to new "enabled"
> attribute.
> ---
> Because of the V2DF restriction, in principle the new source constraint
> could also omit 'm'.
>
> Can't the latter two of the original alternatives be folded, by using
> Yvm instead of xm/vm?
I think yes.
>
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -13761,18 +13761,27 @@
> (set_attr "mode" "DF,DF,V1DF,V1DF,V1DF,V2DF,V1DF,V1DF,V1DF")])
>
>  (define_insn "vec_dupv2df"
> -  [(set (match_operand:V2DF 0 "register_operand" "=x,x,v")
> +  [(set (match_operand:V2DF 0 "register_operand" "=x,x,v,v")
> (vec_duplicate:V2DF
> - (match_operand:DF 1 "nonimmediate_operand" " 0,xm,vm")))]
> -  "TARGET_SSE2 && "
> + (match_operand:DF 1 "nonimmediate_operand" "0,xm,vm,vm")))]
> +  "TARGET_SSE2"
>"@
> unpcklpd\t%0, %0
> %vmovddup\t{%1, %0|%0, %1}
> -   vmovddup\t{%1, %0|%0, %1}"
> -  [(set_attr "isa" "noavx,sse3,avx512vl")
> -   (set_attr "type" "sselog1")
> -   (set_attr "prefix" "orig,maybe_vex,evex")
> -   (set_attr "mode" "V2DF,DF,DF")])
> +   vmovddup\t{%1, %0|%0, %1}
> +   vbroadcastsd\t{%1, }%g0{|, %1}"
> +  [(set_attr "isa" "noavx,sse3,avx512vl,*")
> +   (set_attr "type" "sselog1,ssemov,ssemov,ssemov")
> +   (set_attr "prefix" "orig,maybe_vex,evex,evex")
> +   (set_attr "mode" "V2DF,DF,DF,V8DF")
> +   (set (attr "enabled")
> +   (cond [(eq_attr "alternative" "3")
> +(symbol_ref "TARGET_AVX512F && !TARGET_AVX512VL
> + && !TARGET_PREFER_AVX256")
> +  (match_test "")
> +(const_string "*")
> + ]
> + (symbol_ref "false")))])
>
>  (define_insn "vec_concatv2df"
>[(set (match_operand:V2DF 0 "register_operand" "=x,x,v,x,x, v,x,x")



-- 
BR,
Hongtao

Re: [PATCH] x86: avoid maybe_gen_...()

2023-07-16 Thread Hongtao Liu via Gcc-patches

On Fri, Jul 14, 2023 at 5:42 PM Jan Beulich via Gcc-patches
 wrote:
>
> In the (however unlikely) event that no insn can be found for the
> requested mode, using maybe_gen_...() without (really) checking its
> result for being a null rtx would lead to silent bad code generation.
Ok.
>
> gcc/
>
> * config/i386/i386-expand.cc (ix86_expand_vector_init_duplicate):
> Use gen_vec_set_0.
> (ix86_expand_vector_extract): Use gen_vec_extract_lo /
> gen_vec_extract_hi.
> (expand_vec_perm_broadcast_1): Use gen_vec_interleave_high /
> gen_vec_interleave_low. Rename local variable.
>
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -15456,8 +15456,7 @@ ix86_expand_vector_init_duplicate (bool
> {
>   tmp1 = force_reg (GET_MODE_INNER (mode), val);
>   tmp2 = gen_reg_rtx (mode);
> - emit_insn (maybe_gen_vec_set_0 (mode, tmp2,
> - CONST0_RTX (mode), tmp1));
> + emit_insn (gen_vec_set_0 (mode, tmp2, CONST0_RTX (mode), tmp1));
>   tmp1 = gen_lowpart (mode, tmp2);
> }
>   else
> @@ -17419,9 +17418,9 @@ ix86_expand_vector_extract (bool mmx_ok,
>  ? gen_reg_rtx (V16HFmode)
>  : gen_reg_rtx (V16BFmode));
>   if (elt < 16)
> -   emit_insn (maybe_gen_vec_extract_lo (mode, tmp, vec));
> +   emit_insn (gen_vec_extract_lo (mode, tmp, vec));
>   else
> -   emit_insn (maybe_gen_vec_extract_hi (mode, tmp, vec));
> +   emit_insn (gen_vec_extract_hi (mode, tmp, vec));
>   ix86_expand_vector_extract (false, target, tmp, elt & 15);
>   return;
> }
> @@ -17435,9 +17434,9 @@ ix86_expand_vector_extract (bool mmx_ok,
>  ? gen_reg_rtx (V8HFmode)
>  : gen_reg_rtx (V8BFmode));
>   if (elt < 8)
> -   emit_insn (maybe_gen_vec_extract_lo (mode, tmp, vec));
> +   emit_insn (gen_vec_extract_lo (mode, tmp, vec));
>   else
> -   emit_insn (maybe_gen_vec_extract_hi (mode, tmp, vec));
> +   emit_insn (gen_vec_extract_hi (mode, tmp, vec));
>   ix86_expand_vector_extract (false, target, tmp, elt & 7);
>   return;
> }
> @@ -22501,18 +22500,18 @@ expand_vec_perm_broadcast_1 (struct expa
>if (d->testing_p)
> return true;
>
> -  rtx (*maybe_gen) (machine_mode, int, rtx, rtx, rtx);
> +  rtx (*gen_interleave) (machine_mode, int, rtx, rtx, rtx);
>if (elt >= nelt2)
> {
> - maybe_gen = maybe_gen_vec_interleave_high;
> + gen_interleave = gen_vec_interleave_high;
>   elt -= nelt2;
> }
>else
> -   maybe_gen = maybe_gen_vec_interleave_low;
> +   gen_interleave = gen_vec_interleave_low;
>nelt2 /= 2;
>
>dest = gen_reg_rtx (vmode);
> -  emit_insn (maybe_gen (vmode, 1, dest, op0, op0));
> +  emit_insn (gen_interleave (vmode, 1, dest, op0, op0));
>
>vmode = V4SImode;
>op0 = gen_lowpart (vmode, dest);



-- 
BR,
Hongtao

Re: [PATCH] x86: slightly enhance "vec_dupv2df"

2023-07-16 Thread Hongtao Liu via Gcc-patches

On Mon, Jul 17, 2023 at 2:20 PM Jan Beulich  wrote:
>
> On 17.07.2023 08:09, Hongtao Liu wrote:
> > On Fri, Jul 14, 2023 at 5:40 PM Jan Beulich via Gcc-patches
> >  wrote:
> >>
> >> Introduce a new alternative permitting all 32 registers to be used as
> >> source without AVX512VL, by broadcasting to the full 512 bits in that
> >> case. (The insn would also permit all registers to be used as
> >> destination, but V2DFmode doesn't.)
> > The patch looks technically ok, but considering we don't have a real
> > CPU with only AVX512F but no AVX512VL, these optimisations for AVX512F
> > only don't make much sense, but rather increase the burden for
> > maintenance.
>
> Well, I can of course ignore this aspect going forward. It seemed
> relevant to me for two reasons: For one, I expect I'm not the only
> one to simply pass -mavx512f when caring about basic AVX512. And
You're not, AFAIK, some users used target("avx512f") for FMV.  But I'd
rather persuade them to use target ("arch=x86-64-v4") rather than
optimizing for AVX512F only.
> then isn't the Knights line of processors (Xeon Phi) lacking VL?
> (I'm getting the impression though that this line is discontinued
> now.)
KNL is deprecated, and yes it doesn't support AVX512VL.
>
> >> Can't the latter two of the original alternatives be folded, by using
> >> Yvm instead of xm/vm?
> > I think yes.
>
> I guess I'll make a follow-on patch for that then.
>
> Jan



-- 
BR,
Hongtao

Re: [PATCH] Add peephole to eliminate redundant comparison after cmpccxadd.

2023-07-16 Thread Hongtao Liu via Gcc-patches

Ping.

On Tue, Jul 11, 2023 at 5:16 PM liuhongt via Gcc-patches
 wrote:
>
> Similar like we did for CMPXCHG, but extended to all
> ix86_comparison_int_operator since CMPCCXADD set EFLAGS exactly same
> as CMP.
>
> When operand order in CMP insn is same as that in CMPCCXADD,
> CMP insn can be eliminated directly.
>
> When operand order is swapped in CMP insn, only optimize
> cmpccxadd + cmpl + jcc/setcc to cmpccxadd + jcc/setcc when FLAGS_REG is dead
> after jcc/setcc plus adjusting code for jcc/setcc.
>
> gcc/ChangeLog:
>
> PR target/110591
> * config/i386/sync.md (cmpccxadd_): Adjust the pattern
> to explicitly set FLAGS_REG like *cmp_1, also add extra
> 3 define_peephole2 after the pattern.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr110591.c: New test.
> * gcc.target/i386/pr110591-2.c: New test.
> ---
>  gcc/config/i386/sync.md| 160 -
>  gcc/testsuite/gcc.target/i386/pr110591-2.c |  90 
>  gcc/testsuite/gcc.target/i386/pr110591.c   |  66 +
>  3 files changed, 315 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr110591-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr110591.c
>
> diff --git a/gcc/config/i386/sync.md b/gcc/config/i386/sync.md
> index e1fa1504deb..e84226cf895 100644
> --- a/gcc/config/i386/sync.md
> +++ b/gcc/config/i386/sync.md
> @@ -1093,7 +1093,9 @@ (define_insn "cmpccxadd_"
>   UNSPECV_CMPCCXADD))
> (set (match_dup 1)
> (unspec_volatile:SWI48x [(const_int 0)] UNSPECV_CMPCCXADD))
> -   (clobber (reg:CC FLAGS_REG))]
> +   (set (reg:CC FLAGS_REG)
> +   (compare:CC (match_dup 1)
> +   (match_dup 2)))]
>"TARGET_CMPCCXADD && TARGET_64BIT"
>  {
>char buf[128];
> @@ -1105,3 +1107,159 @@ (define_insn "cmpccxadd_"
>output_asm_insn (buf, operands);
>return "";
>  })
> +
> +(define_peephole2
> +  [(set (match_operand:SWI48x 0 "register_operand")
> +   (match_operand:SWI48x 1 "x86_64_general_operand"))
> +   (parallel [(set (match_dup 0)
> +  (unspec_volatile:SWI48x
> +[(match_operand:SWI48x 2 "memory_operand")
> + (match_dup 0)
> + (match_operand:SWI48x 3 "register_operand")
> + (match_operand:SI 4 "const_int_operand")]
> +UNSPECV_CMPCCXADD))
> + (set (match_dup 2)
> +  (unspec_volatile:SWI48x [(const_int 0)] UNSPECV_CMPCCXADD))
> + (set (reg:CC FLAGS_REG)
> +  (compare:CC (match_dup 2)
> +  (match_dup 0)))])
> +   (set (reg FLAGS_REG)
> +   (compare (match_operand:SWI48x 5 "register_operand")
> +(match_operand:SWI48x 6 "x86_64_general_operand")))]
> +  "TARGET_CMPCCXADD && TARGET_64BIT
> +   && rtx_equal_p (operands[0], operands[5])
> +   && rtx_equal_p (operands[1], operands[6])"
> +  [(set (match_dup 0)
> +   (match_dup 1))
> +   (parallel [(set (match_dup 0)
> +  (unspec_volatile:SWI48x
> +[(match_dup 2)
> + (match_dup 0)
> + (match_dup 3)
> + (match_dup 4)]
> +UNSPECV_CMPCCXADD))
> + (set (match_dup 2)
> +  (unspec_volatile:SWI48x [(const_int 0)] UNSPECV_CMPCCXADD))
> + (set (reg:CC FLAGS_REG)
> +  (compare:CC (match_dup 2)
> +  (match_dup 0)))])
> +   (set (match_dup 7)
> +   (match_op_dup 8
> + [(match_dup 9) (const_int 0)]))])
> +
> +(define_peephole2
> +  [(set (match_operand:SWI48x 0 "register_operand")
> +   (match_operand:SWI48x 1 "x86_64_general_operand"))
> +   (parallel [(set (match_dup 0)
> +  (unspec_volatile:SWI48x
> +[(match_operand:SWI48x 2 "memory_operand")
> + (match_dup 0)
> + (match_operand:SWI48x 3 "register_operand")
> + (match_operand:SI 4 "const_int_operand")]
> +UNSPECV_CMPCCXADD))
> + (set (match_dup 2)
> +  (unspec_volatile:SWI48x [(const_int 0)] UNSPECV_CMPCCXADD))
> + (set (reg:CC FLAGS_REG)
> +  (compare:CC (match_dup 2)
> +  (match_dup 0)))])
> +   (set (reg FLAGS_REG)
> +   (compare (match_operand:SWI48x 5 "register_operand")
> +(match_operand:SWI48x 6 "x86_64_general_operand")))
> +   (set (match_operand:QI 7 "nonimmediate_operand")
> +   (match_operator:QI 8 "ix86_comparison_int_operator"
> + [(reg FLAGS_REG) (const_int 0)]))]
> +  "TARGET_CMPCCXADD && TARGET_64BIT
> +   && rtx_equal_p (operands[0], operands[6])
> +   && rtx_equal_p (operands[1], operands[5])
> +   && peep2_regno_dead_p (4, FLAGS_REG)"
> +  [(set (match_dup 0)
> +   (match_dup 1))
> +   (parallel [(set (match_dup 0

Re: [PATCH 1/2] [i386] Support type _Float16/__bf16 independent of SSE2.

2023-07-17 Thread Hongtao Liu via Gcc-patches

I'd like to ping for this patch (only patch 1/2, for patch 2/2, I
think that may not be necessary).

On Mon, May 15, 2023 at 9:20 AM Hongtao Liu  wrote:
>
> ping.
>
> On Fri, Apr 21, 2023 at 9:55 PM liuhongt  wrote:
> >
> > > > +  if (!TARGET_SSE2)
> > > > +{
> > > > +  if (c_dialect_cxx ()
> > > > +   && cxx_dialect > cxx20)
> > >
> > > Formatting, both conditions are short, so just put them on one line.
> > Changed.
> >
> > > But for the C++23 macros, more importantly I think we really should
> > > also in ix86_target_macros_internal add
> > >   if (c_dialect_cxx ()
> > >   && cxx_dialect > cxx20
> > >   && (isa_flag & OPTION_MASK_ISA_SSE2))
> > > {
> > >   def_or_undef (parse_in, "__STDCPP_FLOAT16_T__");
> > >   def_or_undef (parse_in, "__STDCPP_BFLOAT16_T__");
> > > }
> > > plus associated libstdc++ changes.  It can be done incrementally though.
> > Added in PATCH 2/2
> >
> > > > +  if (flag_building_libgcc)
> > > > + {
> > > > +   /* libbid uses __LIBGCC_HAS_HF_MODE__ and __LIBGCC_HAS_BF_MODE__
> > > > +  to check backend support of _Float16 and __bf16 type.  */
> > >
> > > That is actually the case only for HFmode, but not for BFmode right now.
> > > So, we need further work.  One is to add the BFmode support in there,
> > > and another one is make sure the _Float16 <-> _Decimal* and __bf16 <->
> > > _Decimal* conversions are compiled in also if not -msse2 by default.
> > > One way to do that is wrap the HF and BF mode related functions on x86
> > > #ifndef __SSE2__ into the pragmas like intrin headers use (but then
> > > perhaps we don't need to undef this stuff here), another is not provide
> > > the hf/bf support in that case from the TUs where they are provided now,
> > > but from a different one which would be compiled with -msse2.
> > Add CFLAGS-_hf_to_sd.c += -msse2, similar for other files in libbid, just 
> > like
> > we did before for HFtype softfp. Then no need to undef libgcc macros.
> >
> > > >/* We allowed the user to turn off SSE for kernel mode.  Don't crash 
> > > > if
> > > >   some less clueful developer tries to use floating-point anyway.  
> > > > */
> > > > -  if (needed_sseregs && !TARGET_SSE)
> > > > +  if (needed_sseregs
> > > > +  && (!TARGET_SSE
> > > > +   || (VALID_SSE2_TYPE_MODE (mode)
> > > > +   && !TARGET_SSE2)))
> > >
> > > Formatting, no need to split this up that much.
> > >   if (needed_sseregs
> > >   && (!TARGET_SSE
> > >   || (VALID_SSE2_TYPE_MODE (mode) && !TARGET_SSE2)))
> > > or even better
> > >   if (needed_sseregs
> > >   && (!TARGET_SSE || (VALID_SSE2_TYPE_MODE (mode) && !TARGET_SSE2)))
> > > will do it.
> > Changed.
> >
> > > Instead of this, just use
> > >   if (!float16_type_node)
> > > {
> > >   float16_type_node = ix86_float16_type_node;
> > >   callback (float16_type_node);
> > >   float16_type_node = NULL_TREE;
> > > }
> > >   if (!bfloat16_type_node)
> > > {
> > >   bfloat16_type_node = ix86_bf16_type_node;
> > >   callback (bfloat16_type_node);
> > >   bfloat16_type_node = NULL_TREE;
> > > }
> > Changed.
> >
> >
> > > > +static const char *
> > > > +ix86_invalid_conversion (const_tree fromtype, const_tree totype)
> > > > +{
> > > > +  if (element_mode (fromtype) != element_mode (totype))
> > > > +{
> > > > +  /* Do no allow conversions to/from BFmode/HFmode scalar types
> > > > +  when TARGET_SSE2 is not available.  */
> > > > +  if ((TYPE_MODE (fromtype) == BFmode
> > > > +|| TYPE_MODE (fromtype) == HFmode)
> > > > +   && !TARGET_SSE2)
> > >
> > > First of all, not really sure if this should be purely about scalar
> > > modes, not also complex and vector modes involving those inner modes.
> > > Because complex or vector modes with BF/HF elements will be without
> > > TARGET_SSE2 for sure lowered into scalar code and that can't be handled
> > > either.
> > > So if (!TARGET_SSE2 && GET_MODE_INNER (TYPE_MODE (fromtype)) == BFmode)
> > > or even better
> > > if (!TARGET_SSE2 && element_mode (fromtype) == BFmode)
> > > ?
> > > Or even better remember the 2 modes above into machine_mode temporaries
> > > and just use those in the != comparison and for the checks?
> > >
> > > Also, I think it is weird to tell user %<__bf16%> or %<_Float16%> when
> > > we know which one it is.  Just return separate messages?
> > Changed.
> >
> > > > +  /* Reject all single-operand operations on BFmode/HFmode except for &
> > > > + when TARGET_SSE2 is not available.  */
> > > > +  if ((element_mode (type) == BFmode || element_mode (type) == HFmode)
> > > > +  && !TARGET_SSE2 && op != ADDR_EXPR)
> > > > +return N_("operation not permitted on type %<__bf16%> "
> > > > +   "or %<_Float16%> without option %<-msse2%>");
> > >
> > > Similarly.  Also, check !TARGET_SSE2 first as inexpensive one.
> > Changed.
> >
> >
> > Bootstrapped and regtested on

Re: [PATCH 1/2] [i386] Support type _Float16/__bf16 independent of SSE2.

2023-07-18 Thread Hongtao Liu via Gcc-patches

On Mon, Jul 17, 2023 at 7:38 PM Uros Bizjak  wrote:
>
> On Mon, Jul 17, 2023 at 10:28 AM Hongtao Liu  wrote:
> >
> > I'd like to ping for this patch (only patch 1/2, for patch 2/2, I
> > think that may not be necessary).
> >
> > On Mon, May 15, 2023 at 9:20 AM Hongtao Liu  wrote:
> > >
> > > ping.
> > >
> > > On Fri, Apr 21, 2023 at 9:55 PM liuhongt  wrote:
> > > >
> > > > > > +  if (!TARGET_SSE2)
> > > > > > +{
> > > > > > +  if (c_dialect_cxx ()
> > > > > > +   && cxx_dialect > cxx20)
> > > > >
> > > > > Formatting, both conditions are short, so just put them on one line.
> > > > Changed.
> > > >
> > > > > But for the C++23 macros, more importantly I think we really should
> > > > > also in ix86_target_macros_internal add
> > > > >   if (c_dialect_cxx ()
> > > > >   && cxx_dialect > cxx20
> > > > >   && (isa_flag & OPTION_MASK_ISA_SSE2))
> > > > > {
> > > > >   def_or_undef (parse_in, "__STDCPP_FLOAT16_T__");
> > > > >   def_or_undef (parse_in, "__STDCPP_BFLOAT16_T__");
> > > > > }
> > > > > plus associated libstdc++ changes.  It can be done incrementally 
> > > > > though.
> > > > Added in PATCH 2/2
> > > >
> > > > > > +  if (flag_building_libgcc)
> > > > > > + {
> > > > > > +   /* libbid uses __LIBGCC_HAS_HF_MODE__ and 
> > > > > > __LIBGCC_HAS_BF_MODE__
> > > > > > +  to check backend support of _Float16 and __bf16 type.  */
> > > > >
> > > > > That is actually the case only for HFmode, but not for BFmode right 
> > > > > now.
> > > > > So, we need further work.  One is to add the BFmode support in there,
> > > > > and another one is make sure the _Float16 <-> _Decimal* and __bf16 <->
> > > > > _Decimal* conversions are compiled in also if not -msse2 by default.
> > > > > One way to do that is wrap the HF and BF mode related functions on x86
> > > > > #ifndef __SSE2__ into the pragmas like intrin headers use (but then
> > > > > perhaps we don't need to undef this stuff here), another is not 
> > > > > provide
> > > > > the hf/bf support in that case from the TUs where they are provided 
> > > > > now,
> > > > > but from a different one which would be compiled with -msse2.
> > > > Add CFLAGS-_hf_to_sd.c += -msse2, similar for other files in libbid, 
> > > > just like
> > > > we did before for HFtype softfp. Then no need to undef libgcc macros.
> > > >
> > > > > >/* We allowed the user to turn off SSE for kernel mode.  Don't 
> > > > > > crash if
> > > > > >   some less clueful developer tries to use floating-point 
> > > > > > anyway.  */
> > > > > > -  if (needed_sseregs && !TARGET_SSE)
> > > > > > +  if (needed_sseregs
> > > > > > +  && (!TARGET_SSE
> > > > > > +   || (VALID_SSE2_TYPE_MODE (mode)
> > > > > > +   && !TARGET_SSE2)))
> > > > >
> > > > > Formatting, no need to split this up that much.
> > > > >   if (needed_sseregs
> > > > >   && (!TARGET_SSE
> > > > >   || (VALID_SSE2_TYPE_MODE (mode) && !TARGET_SSE2)))
> > > > > or even better
> > > > >   if (needed_sseregs
> > > > >   && (!TARGET_SSE || (VALID_SSE2_TYPE_MODE (mode) && 
> > > > > !TARGET_SSE2)))
> > > > > will do it.
> > > > Changed.
> > > >
> > > > > Instead of this, just use
> > > > >   if (!float16_type_node)
> > > > > {
> > > > >   float16_type_node = ix86_float16_type_node;
> > > > >   callback (float16_type_node);
> > > > >   float16_type_node = NULL_TREE;
> > > > > }
> > > > >   if (!bfloat16_type_node)
> > > > > {
> > > > >   bfloat16_type_node = ix86_bf16_type_node;
> > > > >   callback (bfloat16_type_node);
> > > > >   bfloat16_type_node = NULL_TREE;
> > > > > }
> > > > Changed.
> > > >
> > > >
> > > > > > +static const char *
> > > > > > +ix86_invalid_conversion (const_tree fromtype, const_tree totype)
> > > > > > +{
> > > > > > +  if (element_mode (fromtype) != element_mode (totype))
> > > > > > +{
> > > > > > +  /* Do no allow conversions to/from BFmode/HFmode scalar types
> > > > > > +  when TARGET_SSE2 is not available.  */
> > > > > > +  if ((TYPE_MODE (fromtype) == BFmode
> > > > > > +|| TYPE_MODE (fromtype) == HFmode)
> > > > > > +   && !TARGET_SSE2)
> > > > >
> > > > > First of all, not really sure if this should be purely about scalar
> > > > > modes, not also complex and vector modes involving those inner modes.
> > > > > Because complex or vector modes with BF/HF elements will be without
> > > > > TARGET_SSE2 for sure lowered into scalar code and that can't be 
> > > > > handled
> > > > > either.
> > > > > So if (!TARGET_SSE2 && GET_MODE_INNER (TYPE_MODE (fromtype)) == 
> > > > > BFmode)
> > > > > or even better
> > > > > if (!TARGET_SSE2 && element_mode (fromtype) == BFmode)
> > > > > ?
> > > > > Or even better remember the 2 modes above into machine_mode 
> > > > > temporaries
> > > > > and just use those in the != comparison and for the checks?
> > > > >
> > > > > Also, I think it is weird to tell user

Re: [PATCH V2] Provide -fcf-protection=branch,return.

2023-07-19 Thread Hongtao Liu via Gcc-patches

On Wed, Jul 12, 2023 at 3:27 PM Hongtao Liu  wrote:
>
> ping.
>
> On Mon, May 22, 2023 at 4:08 PM Hongtao Liu  wrote:
> >
> > ping.
> >
> > On Sat, May 13, 2023 at 5:20 PM liuhongt  wrote:
> > >
> > > > I think this could be simplified if you use either EnumSet or
> > > > EnumBitSet instead in common.opt for `-fcf-protection=`.
> > >
> > > Use EnumSet instead of EnumBitSet since CF_FULL is not power of 2.
> > > It is a bit tricky for sets classification, cf_branch and cf_return
> > > should be in different sets, but they both "conflicts" cf_full,
> > > cf_none. And current EnumSet don't handle this well.
> > >
> > > So in the current implementation, only cf_full,cf_none are exclusive
> > > to each other, but they can be combined with any cf_branch, cf_return,
> > > cf_check. It's not perfect, but still an improvement than original
> > > one.
> > >
I'm going to commit this patch if there's no objection, it's just a
refactor of option -fcf-protection=.
If there's any regression observed, I will fix(or revert the patch).
> > > gcc/ChangeLog:
> > >
> > > * common.opt: (fcf-protection=): Add EnumSet attribute to
> > > support combination of params.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > * c-c++-common/fcf-protection-10.c: New test.
> > > * c-c++-common/fcf-protection-11.c: New test.
> > > * c-c++-common/fcf-protection-12.c: New test.
> > > * c-c++-common/fcf-protection-8.c: New test.
> > > * c-c++-common/fcf-protection-9.c: New test.
> > > * gcc.target/i386/pr89701-1.c: New test.
> > > * gcc.target/i386/pr89701-2.c: New test.
> > > * gcc.target/i386/pr89701-3.c: New test.
> > > ---
> > >  gcc/common.opt | 12 ++--
> > >  gcc/testsuite/c-c++-common/fcf-protection-10.c |  2 ++
> > >  gcc/testsuite/c-c++-common/fcf-protection-11.c |  2 ++
> > >  gcc/testsuite/c-c++-common/fcf-protection-12.c |  2 ++
> > >  gcc/testsuite/c-c++-common/fcf-protection-8.c  |  2 ++
> > >  gcc/testsuite/c-c++-common/fcf-protection-9.c  |  2 ++
> > >  gcc/testsuite/gcc.target/i386/pr89701-1.c  |  4 
> > >  gcc/testsuite/gcc.target/i386/pr89701-2.c  |  4 
> > >  gcc/testsuite/gcc.target/i386/pr89701-3.c  |  4 
> > >  9 files changed, 28 insertions(+), 6 deletions(-)
> > >  create mode 100644 gcc/testsuite/c-c++-common/fcf-protection-10.c
> > >  create mode 100644 gcc/testsuite/c-c++-common/fcf-protection-11.c
> > >  create mode 100644 gcc/testsuite/c-c++-common/fcf-protection-12.c
> > >  create mode 100644 gcc/testsuite/c-c++-common/fcf-protection-8.c
> > >  create mode 100644 gcc/testsuite/c-c++-common/fcf-protection-9.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr89701-1.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr89701-2.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr89701-3.c
> > >
> > > diff --git a/gcc/common.opt b/gcc/common.opt
> > > index a28ca13385a..02f2472959a 100644
> > > --- a/gcc/common.opt
> > > +++ b/gcc/common.opt
> > > @@ -1886,7 +1886,7 @@ fcf-protection
> > >  Common RejectNegative Alias(fcf-protection=,full)
> > >
> > >  fcf-protection=
> > > -Common Joined RejectNegative Enum(cf_protection_level) 
> > > Var(flag_cf_protection) Init(CF_NONE)
> > > +Common Joined RejectNegative Enum(cf_protection_level) EnumSet 
> > > Var(flag_cf_protection) Init(CF_NONE)
> > >  -fcf-protection=[full|branch|return|none|check]Instrument 
> > > functions with checks to verify jump/call/return control-flow transfer
> > >  instructions have valid targets.
> > >
> > > @@ -1894,19 +1894,19 @@ Enum
> > >  Name(cf_protection_level) Type(enum cf_protection_level) 
> > > UnknownError(unknown Control-Flow Protection Level %qs)
> > >
> > >  EnumValue
> > > -Enum(cf_protection_level) String(full) Value(CF_FULL)
> > > +Enum(cf_protection_level) String(full) Value(CF_FULL) Set(1)
> > >
> > >  EnumValue
> > > -Enum(cf_protection_level) String(branch) Value(CF_BRANCH)
> > > +Enum(cf_protection_level) String(branch) Value(CF_BRANCH) Set(2)
> > >
> > >  EnumValue
> > > -Enum(cf_protection_level) String(return) Value(CF_RETURN)
> > > +Enum(cf_protection_level) String(return) Value(CF_RETURN) Set(3)
> > >
> > >  EnumValue
> > > -Enum(cf_protection_level) String(check) Value(CF_CHECK)
> > > +Enum(cf_protection_level) String(check) Value(CF_CHECK) Set(4)
> > >
> > >  EnumValue
> > > -Enum(cf_protection_level) String(none) Value(CF_NONE)
> > > +Enum(cf_protection_level) String(none) Value(CF_NONE) Set(1)
> > >
> > >  finstrument-functions
> > >  Common Var(flag_instrument_function_entry_exit,1)
> > > diff --git a/gcc/testsuite/c-c++-common/fcf-protection-10.c 
> > > b/gcc/testsuite/c-c++-common/fcf-protection-10.c
> > > new file mode 100644
> > > index 000..b271d134e52
> > > --- /dev/null
> > > +++ b/gcc/testsuite/c-c++-common/fcf-protection-10.c
> > > @@ -0,0 +1,2 @@
> > > +/* { dg-do compile { target { "i?86-*-* x86_64-*-*" } } } */
> > > +/* { dg-

Re: [PATCH] Optimize vlddqu to vmovdqu for TARGET_AVX

2023-07-20 Thread Hongtao Liu via Gcc-patches

On Thu, Jul 20, 2023 at 4:11 PM Uros Bizjak via Gcc-patches
 wrote:
>
> On Thu, Jul 20, 2023 at 9:35 AM liuhongt  wrote:
> >
> > For Intel processors, after TARGET_AVX, vmovdqu is optimized as fast
> > as vlddqu, UNSPEC_LDDQU can be removed to enable more optimizations.
> > Can someone confirm this with AMD folks?
> > If AMD doesn't like such optimization, I'll put my optimization under
> > micro-architecture tuning.
>
> The instruction is reachable only as __builtin_ia32_lddqu* (aka
> _mm_lddqu_si*), so it was chosen by the programmer for a reason. I
> think that in this case, the compiler should not be too smart and
> change the instruction behind the programmer's back. The caveats are
> also explained at length in the ISA manual.
fine.
>
> Uros.
>
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > If AMD also like such optimization, Ok for trunk?
> >
> > gcc/ChangeLog:
> >
> > * config/i386/sse.md (_lddqu): Change to
> > define_expand, expand as simple move when TARGET_AVX
> > && ( == 16 || !TARGET_AVX256_SPLIT_UNALIGNED_LOAD).
> > The original define_insn is renamed to
> > ..
> > (_lddqu): .. this.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/i386/vlddqu_vinserti128.c: New test.
> > ---
> >  gcc/config/i386/sse.md| 15 ++-
> >  .../gcc.target/i386/vlddqu_vinserti128.c  | 11 +++
> >  2 files changed, 25 insertions(+), 1 deletion(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c
> >
> > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> > index 2d81347c7b6..d571a78f4c4 100644
> > --- a/gcc/config/i386/sse.md
> > +++ b/gcc/config/i386/sse.md
> > @@ -1835,7 +1835,20 @@ (define_peephole2
> >[(set (match_dup 4) (match_dup 1))]
> >"operands[4] = adjust_address (operands[0], V2DFmode, 0);")
> >
> > -(define_insn "_lddqu"
> > +(define_expand "_lddqu"
> > +  [(set (match_operand:VI1 0 "register_operand")
> > +   (unspec:VI1 [(match_operand:VI1 1 "memory_operand")]
> > +   UNSPEC_LDDQU))]
> > +  "TARGET_SSE3"
> > +{
> > +  if (TARGET_AVX && ( == 16 || 
> > !TARGET_AVX256_SPLIT_UNALIGNED_LOAD))
> > +{
> > +  emit_move_insn (operands[0], operands[1]);
> > +  DONE;
> > +}
> > +})
> > +
> > +(define_insn "*_lddqu"
> >[(set (match_operand:VI1 0 "register_operand" "=x")
> > (unspec:VI1 [(match_operand:VI1 1 "memory_operand" "m")]
> > UNSPEC_LDDQU))]
> > diff --git a/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c 
> > b/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c
> > new file mode 100644
> > index 000..29699a5fa7f
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/vlddqu_vinserti128.c
> > @@ -0,0 +1,11 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-mavx2 -O2" } */
> > +/* { dg-final { scan-assembler-times "vbroadcasti128" 1 } } */
> > +/* { dg-final { scan-assembler-not {(?n)vlddqu.*xmm} } } */
> > +
> > +#include 
> > +__m256i foo(void *data) {
> > +__m128i X1 = _mm_lddqu_si128((__m128i*)data);
> > +__m256i V1 = _mm256_broadcastsi128_si256 (X1);
> > +return V1;
> > +}
> > --
> > 2.39.1.388.g2fc9e9ca3c
> >



-- 
BR,
Hongtao

Re: [r14-2834 Regression] FAIL: gcc.target/i386/pr87007-5.c scan-assembler-times vxorps[^\n\r]*xmm[0-9] 1 on Linux/x86_64

2023-07-31 Thread Hongtao Liu via Gcc-patches

On Sat, Jul 29, 2023 at 11:55 AM haochen.jiang via Gcc-regression
 wrote:
>
> On Linux/x86_64,
>
> b9d7140c80bd3c7355b8291bb46f0895dcd8c3cb is the first bad commit
> commit b9d7140c80bd3c7355b8291bb46f0895dcd8c3cb
> Author: Jan Hubicka 
> Date:   Fri Jul 28 09:16:09 2023 +0200
>
> loop-split improvements, part 1
>
> caused
>
> FAIL: gcc.target/i386/pr87007-4.c scan-assembler-times vxorps[^\n\r]*xmm[0-9] 
> 1
> FAIL: gcc.target/i386/pr87007-5.c scan-assembler-times vxorps[^\n\r]*xmm[0-9] 
> 1
>
> with GCC configured with
I'll adjust testcase for this one.
Now we have
vpbroadcastd %ecx, %xmm0
vpaddd .LC3(%rip), %xmm0, %xmm0
vpextrd $3, %xmm0, %eax
vmovddup %xmm3, %xmm0
vrndscalepd $9, %xmm0, %xmm0
vunpckhpd %xmm0, %xmm0, %xmm3

for vrndscalepd, no need to insert pxor since it reuses input operand
xmm0 which loads from memory.
>
> ../../gcc/configure 
> --prefix=/export/users/haochenj/src/gcc-bisect/master/master/r14-2834/usr 
> --enable-clocale=gnu --with-system-zlib --with-demangler-in-ld 
> --with-fpmath=sse --enable-languages=c,c++,fortran --enable-cet --without-isl 
> --enable-libmpx x86_64-linux --disable-bootstrap
>
> To reproduce:
>
> $ cd {build_dir}/gcc && make check 
> RUNTESTFLAGS="i386.exp=gcc.target/i386/pr87007-4.c 
> --target_board='unix{-m32}'"
> $ cd {build_dir}/gcc && make check 
> RUNTESTFLAGS="i386.exp=gcc.target/i386/pr87007-4.c --target_board='unix{-m32\ 
> -march=cascadelake}'"
> $ cd {build_dir}/gcc && make check 
> RUNTESTFLAGS="i386.exp=gcc.target/i386/pr87007-4.c 
> --target_board='unix{-m64}'"
> $ cd {build_dir}/gcc && make check 
> RUNTESTFLAGS="i386.exp=gcc.target/i386/pr87007-4.c --target_board='unix{-m64\ 
> -march=cascadelake}'"
> $ cd {build_dir}/gcc && make check 
> RUNTESTFLAGS="i386.exp=gcc.target/i386/pr87007-5.c 
> --target_board='unix{-m32}'"
> $ cd {build_dir}/gcc && make check 
> RUNTESTFLAGS="i386.exp=gcc.target/i386/pr87007-5.c --target_board='unix{-m32\ 
> -march=cascadelake}'"
> $ cd {build_dir}/gcc && make check 
> RUNTESTFLAGS="i386.exp=gcc.target/i386/pr87007-5.c 
> --target_board='unix{-m64}'"
> $ cd {build_dir}/gcc && make check 
> RUNTESTFLAGS="i386.exp=gcc.target/i386/pr87007-5.c --target_board='unix{-m64\ 
> -march=cascadelake}'"
>
> (Please do not reply to this email, for question about this report, contact 
> me at haochen dot jiang at intel.com.)
> (If you met problems with cascadelake related, disabling AVX512F in command 
> line might save that.)
> (However, please make sure that there is no potential problems with AVX512.)



-- 
BR,
Hongtao

Re: [x86 PATCH] UNSPEC_PALIGNR optimizations and clean-ups.

2022-06-30 Thread Hongtao Liu via Gcc-patches

On Fri, Jul 1, 2022 at 2:42 AM Roger Sayle  wrote:
>
>
> This patch is a follow-up to Hongtao's fix for PR target/105854.  That
> fix is perfectly correct, but the thing that caught my eye was why is
> the compiler generating a shift by zero at all.  Digging deeper it
> turns out that we can easily optimize __builtin_ia32_palignr for
> alignments of 0 and 64 respectively, which may be simplified to moves
> from the highpart or lowpart.
>
> After adding optimizations to simplify the 64-bit DImode palignr,
> I started to add the corresponding optimizations for vpalignr (i.e.
> 128-bit).  The first oddity is that sse.md uses TImode and a special
> SSESCALARMODE iterator, rather than V1TImode, and indeed the comment
> above SSESCALARMODE hints that this should be "dropped in favor of
> VIMAX_AVX2_AVX512BW".  Hence this patch includes the migration of
> _palignr to use VIMAX_AVX2_AVX512BW, basically
> using V1TImode instead of TImode for 128-bit palignr.
>
> But it was only after I'd implemented this clean-up that I stumbled
> across the strange semantics of 128-bit [v]palignr.  According to
> https://www.felixcloutier.com/x86/palignr, the semantics are subtly
> different based upon how the instruction is encoded.  PALIGNR leaves
> the highpart unmodified, whilst VEX.128 encoded VPALIGNR clears the
> highpart, and (unless I'm mistaken) it looks like GCC currently uses
> the exact same RTL/templates for both, treating one as an alternative
> for the other.
I think as long as patterns or intrinsics only care about the low
part, they should be ok.
But if we want to use default behavior for upper bits, we need to
restrict them under specific isa(.i.e. vmovq in vec_set_0).
Generally, 128-bit sse legacy instructions have different behaviors
for upper bits from AVX ones, and that's why vzeroupper is introduced
for sse <-> avx instructions transition.
>
> Hence I thought I'd post what I have so far (part optimization and
> part clean-up), to then ask the x86 experts for their opinions.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-,32},
> with no new failures.  Ok for mainline?
>
>
> 2022-06-30  Roger Sayle  
>
> gcc/ChangeLog
> * config/i386/i386-builtin.def (__builtin_ia32_palignr128): Change
> CODE_FOR_ssse3_palignrti to CODE_FOR_ssse3_palignrv1ti.
> * config/i386/i386-expand.cc (expand_vec_perm_palignr): Use V1TImode
> and gen_ssse3_palignv1ti instead of TImode.
> * config/i386/sse.md (SSESCALARMODE): Delete.
> (define_mode_attr ssse3_avx2): Handle V1TImode instead of TImode.
> (_palignr): Use VIMAX_AVX2_AVX512BW as a mode
> iterator instead of SSESCALARMODE.
>
> (ssse3_palignrdi): Optimize cases when operands[3] is 0 or 64,
> using a single move instruction (if required).
> (define_split): Likewise split UNSPEC_PALIGNR $0 into a move.
> (define_split): Likewise split UNSPEC_PALIGNR $64 into a move.
>
> gcc/testsuite/ChangeLog
> * gcc.target/i386/ssse3-palignr-2.c: New test case.
>
>
> Thanks in advance,
> Roger
> --
>

+(define_split
+  [(set (match_operand:DI 0 "register_operand")
+ (unspec:DI [(match_operand:DI 1 "register_operand")
+(match_operand:DI 2 "register_mmxmem_operand")
+(const_int 0)]
+   UNSPEC_PALIGNR))]
+  ""
+  [(set (match_dup 0) (match_dup 2))])
+
+(define_split
+  [(set (match_operand:DI 0 "register_operand")
+ (unspec:DI [(match_operand:DI 1 "register_operand")
+(match_operand:DI 2 "register_mmxmem_operand")
+(const_int 64)]
+   UNSPEC_PALIGNR))]
+  ""
+  [(set (match_dup 0) (match_dup 1))])
+
define_split is assumed to be splitted to 2(or more) insns, hence
pass_combine will only try define_split if the number of merged insns
is greater than 2.
For palignr, i think most time there would be only 2 merged
insns(constant propagation), so better to change them as pre_reload
splitter.
(.i.e. (define_insn_and_split "*avx512bw_permvar_truncv16siv16hi_1").


--
BR,
Hongtao

Re: [PATCH] Add myself for write after approval

2022-06-30 Thread Hongtao Liu via Gcc-patches

I think this can be taken as an obvious fix without prior approval.
"Obvious fixes can be committed without prior approval. Just check in
the fix and copy it to gcc-patches."
Quoted from https://gcc.gnu.org/gitwrite.html

On Fri, Jul 1, 2022 at 10:02 AM Haochen Jiang via Gcc-patches
 wrote:
>
> Hi all,
>
> I want to add myself in MAINTAINERS for write after approval.
>
> Ok for trunk?
>
> BRs,
> Haochen
>
> ChangeLog:
>
> * MAINTAINERS (Write After Approval): Add myself.
> ---
>  MAINTAINERS | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 151770f59f4..3c448ba9eb6 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -464,6 +464,7 @@ Harsha Jagasia  
> 
>  Fariborz Jahanian  
>  Surya Kumari Jangala   
>  Qian Jianhua   
> +Haochen Jiang  
>  Janis Johnson  
> 
>  Teresa Johnson 
>  Kean Johnston  
> --
> 2.18.1
>


-- 
BR,
Hongtao

Re: [x86 PATCH] UNSPEC_PALIGNR optimizations and clean-ups.

2022-06-30 Thread Hongtao Liu via Gcc-patches

On Fri, Jul 1, 2022 at 10:12 AM Hongtao Liu  wrote:
>
> On Fri, Jul 1, 2022 at 2:42 AM Roger Sayle  wrote:
> >
> >
> > This patch is a follow-up to Hongtao's fix for PR target/105854.  That
> > fix is perfectly correct, but the thing that caught my eye was why is
> > the compiler generating a shift by zero at all.  Digging deeper it
> > turns out that we can easily optimize __builtin_ia32_palignr for
> > alignments of 0 and 64 respectively, which may be simplified to moves
> > from the highpart or lowpart.
> >
> > After adding optimizations to simplify the 64-bit DImode palignr,
> > I started to add the corresponding optimizations for vpalignr (i.e.
> > 128-bit).  The first oddity is that sse.md uses TImode and a special
> > SSESCALARMODE iterator, rather than V1TImode, and indeed the comment
> > above SSESCALARMODE hints that this should be "dropped in favor of
> > VIMAX_AVX2_AVX512BW".  Hence this patch includes the migration of
> > _palignr to use VIMAX_AVX2_AVX512BW, basically
> > using V1TImode instead of TImode for 128-bit palignr.
> >
> > But it was only after I'd implemented this clean-up that I stumbled
> > across the strange semantics of 128-bit [v]palignr.  According to
> > https://www.felixcloutier.com/x86/palignr, the semantics are subtly
> > different based upon how the instruction is encoded.  PALIGNR leaves
> > the highpart unmodified, whilst VEX.128 encoded VPALIGNR clears the
> > highpart, and (unless I'm mistaken) it looks like GCC currently uses
> > the exact same RTL/templates for both, treating one as an alternative
> > for the other.
> I think as long as patterns or intrinsics only care about the low
> part, they should be ok.
> But if we want to use default behavior for upper bits, we need to
> restrict them under specific isa(.i.e. vmovq in vec_set_0).
> Generally, 128-bit sse legacy instructions have different behaviors
> for upper bits from AVX ones, and that's why vzeroupper is introduced
> for sse <-> avx instructions transition.
> >
> > Hence I thought I'd post what I have so far (part optimization and
> > part clean-up), to then ask the x86 experts for their opinions.
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-,32},
> > with no new failures.  Ok for mainline?
> >
> >
> > 2022-06-30  Roger Sayle  
> >
> > gcc/ChangeLog
> > * config/i386/i386-builtin.def (__builtin_ia32_palignr128): Change
> > CODE_FOR_ssse3_palignrti to CODE_FOR_ssse3_palignrv1ti.
> > * config/i386/i386-expand.cc (expand_vec_perm_palignr): Use V1TImode
> > and gen_ssse3_palignv1ti instead of TImode.
> > * config/i386/sse.md (SSESCALARMODE): Delete.
> > (define_mode_attr ssse3_avx2): Handle V1TImode instead of TImode.
> > (_palignr): Use VIMAX_AVX2_AVX512BW as a mode
> > iterator instead of SSESCALARMODE.
> >
> > (ssse3_palignrdi): Optimize cases when operands[3] is 0 or 64,
> > using a single move instruction (if required).
> > (define_split): Likewise split UNSPEC_PALIGNR $0 into a move.
> > (define_split): Likewise split UNSPEC_PALIGNR $64 into a move.
> >
> > gcc/testsuite/ChangeLog
> > * gcc.target/i386/ssse3-palignr-2.c: New test case.
> >
> >
> > Thanks in advance,
> > Roger
> > --
> >
>
> +(define_split
> +  [(set (match_operand:DI 0 "register_operand")
> + (unspec:DI [(match_operand:DI 1 "register_operand")
> +(match_operand:DI 2 "register_mmxmem_operand")
> +(const_int 0)]
> +   UNSPEC_PALIGNR))]
> +  ""
> +  [(set (match_dup 0) (match_dup 2))])
> +
> +(define_split
> +  [(set (match_operand:DI 0 "register_operand")
> + (unspec:DI [(match_operand:DI 1 "register_operand")
> +(match_operand:DI 2 "register_mmxmem_operand")
> +(const_int 64)]
> +   UNSPEC_PALIGNR))]
> +  ""
> +  [(set (match_dup 0) (match_dup 1))])
> +
> define_split is assumed to be splitted to 2(or more) insns, hence
> pass_combine will only try define_split if the number of merged insns
> is greater than 2.
> For palignr, i think most time there would be only 2 merged
> insns(constant propagation), so better to change them as pre_reload
> splitter.
> (.i.e. (define_insn_and_split "*avx512bw_permvar_truncv16siv16hi_1").
I think you can just merge 2 define_split into define_insn_and_split
"ssse3_palignrdi" by relaxing split condition as

-  "TARGET_SSSE3 && reload_completed
-   && SSE_REGNO_P (REGNO (operands[0]))"
+  "(TARGET_SSSE3 && reload_completed
+   && SSE_REGNO_P (REGNO (operands[0])))
+   || INVAL(operands[3]) == 0
+   || INVAL(operands[3]) == 64"

and you have already handled them by

+  if (operands[3] == const0_rtx)
+{
+  if (!rtx_equal_p (operands[0], operands[2]))
+ emit_move_insn (operands[0], operands[2]);
+  else
+ emit_note (NOTE_INSN_DELETED);
+  DONE;
+}
+  else if (INTVAL (operands[3]) == 64)
+{
+  if (!rtx_equal_p (operands[0], operands[1]))
+ emit_move_insn (operands[0],

Re: [x86 PATCH] UNSPEC_PALIGNR optimizations and clean-ups.

2022-07-04 Thread Hongtao Liu via Gcc-patches

On Tue, Jul 5, 2022 at 1:48 AM Roger Sayle  wrote:
>
>
> Hi Hongtao,
> Many thanks for your review.  This revised patch implements your
> suggestions of removing the combine splitters, and instead reusing
> the functionality of the ssse3_palignrdi define_insn_and split.
>
> This revised patch has been tested on x86_64-pc-linux-gnu with make
> bootstrap and make -k check, both with and with --target_board=unix{-32},
> with no new failures.  Is this revised version Ok for mainline?
Ok.
>
>
> 2022-07-04  Roger Sayle  
> Hongtao Liu  
>
> gcc/ChangeLog
> * config/i386/i386-builtin.def (__builtin_ia32_palignr128): Change
> CODE_FOR_ssse3_palignrti to CODE_FOR_ssse3_palignrv1ti.
> * config/i386/i386-expand.cc (expand_vec_perm_palignr): Use V1TImode
> and gen_ssse3_palignv1ti instead of TImode.
> * config/i386/sse.md (SSESCALARMODE): Delete.
> (define_mode_attr ssse3_avx2): Handle V1TImode instead of TImode.
> (_palignr): Use VIMAX_AVX2_AVX512BW as a mode
> iterator instead of SSESCALARMODE.
>
> (ssse3_palignrdi): Optimize cases where operands[3] is 0 or 64,
> using a single move instruction (if required).
>
> gcc/testsuite/ChangeLog
> * gcc.target/i386/ssse3-palignr-2.c: New test case.
>
>
> Thanks in advance,
> Roger
> --
>
> > -Original Message-
> > From: Hongtao Liu 
> > Sent: 01 July 2022 03:40
> > To: Roger Sayle 
> > Cc: GCC Patches 
> > Subject: Re: [x86 PATCH] UNSPEC_PALIGNR optimizations and clean-ups.
> >
> > On Fri, Jul 1, 2022 at 10:12 AM Hongtao Liu  wrote:
> > >
> > > On Fri, Jul 1, 2022 at 2:42 AM Roger Sayle 
> > wrote:
> > > >
> > > >
> > > > This patch is a follow-up to Hongtao's fix for PR target/105854.
> > > > That fix is perfectly correct, but the thing that caught my eye was
> > > > why is the compiler generating a shift by zero at all.  Digging
> > > > deeper it turns out that we can easily optimize
> > > > __builtin_ia32_palignr for alignments of 0 and 64 respectively,
> > > > which may be simplified to moves from the highpart or lowpart.
> > > >
> > > > After adding optimizations to simplify the 64-bit DImode palignr, I
> > > > started to add the corresponding optimizations for vpalignr (i.e.
> > > > 128-bit).  The first oddity is that sse.md uses TImode and a special
> > > > SSESCALARMODE iterator, rather than V1TImode, and indeed the comment
> > > > above SSESCALARMODE hints that this should be "dropped in favor of
> > > > VIMAX_AVX2_AVX512BW".  Hence this patch includes the migration of
> > > > _palignr to use VIMAX_AVX2_AVX512BW, basically
> > > > using V1TImode instead of TImode for 128-bit palignr.
> > > >
> > > > But it was only after I'd implemented this clean-up that I stumbled
> > > > across the strange semantics of 128-bit [v]palignr.  According to
> > > > https://www.felixcloutier.com/x86/palignr, the semantics are subtly
> > > > different based upon how the instruction is encoded.  PALIGNR leaves
> > > > the highpart unmodified, whilst VEX.128 encoded VPALIGNR clears the
> > > > highpart, and (unless I'm mistaken) it looks like GCC currently uses
> > > > the exact same RTL/templates for both, treating one as an
> > > > alternative for the other.
> > > I think as long as patterns or intrinsics only care about the low
> > > part, they should be ok.
> > > But if we want to use default behavior for upper bits, we need to
> > > restrict them under specific isa(.i.e. vmovq in vec_set_0).
> > > Generally, 128-bit sse legacy instructions have different behaviors
> > > for upper bits from AVX ones, and that's why vzeroupper is introduced
> > > for sse <-> avx instructions transition.
> > > >
> > > > Hence I thought I'd post what I have so far (part optimization and
> > > > part clean-up), to then ask the x86 experts for their opinions.
> > > >
> > > > This patch has been tested on x86_64-pc-linux-gnu with make
> > > > bootstrap and make -k check, both with and without
> > > > --target_board=unix{-,32}, with no new failures.  Ok for mainline?
> > > >
> > > >
> > > > 2022-06-30  Roger Sayle  
> > > >
> > > > gcc/ChangeLog
> > > > * config/i386/i386-builtin.def (__builtin_ia32_palignr128): 
> > > > Change
> > > > CODE_FOR_ssse3_palignrti to CODE_FOR_ssse3_palignrv1ti.
> > > > * config/i386/i386-expand.cc (expand_vec_perm_palignr): Use
> > V1TImode
> > > > and gen_ssse3_palignv1ti instead of TImode.
> > > > * config/i386/sse.md (SSESCALARMODE): Delete.
> > > > (define_mode_attr ssse3_avx2): Handle V1TImode instead of 
> > > > TImode.
> > > > (_palignr): Use VIMAX_AVX2_AVX512BW as a
> > mode
> > > > iterator instead of SSESCALARMODE.
> > > >
> > > > (ssse3_palignrdi): Optimize cases when operands[3] is 0 or 64,
> > > > using a single move instruction (if required).
> > > > (define_split): Likewise split UNSPEC_PALIGNR $0 into a move.
> > > > (define_split): Likewise split UNSPEC_PALIGNR

Re: [PATCH] [RFC]Support vectorization for Complex type.

2022-07-11 Thread Hongtao Liu via Gcc-patches

On Mon, Jul 11, 2022 at 7:47 PM Richard Biener via Gcc-patches
 wrote:
>
> On Mon, Jul 11, 2022 at 5:44 AM liuhongt  wrote:
> >
> > The patch only handles load/store(including ctor/permutation, except
> > gather/scatter) for complex type, other operations don't needs to be
> > handled since they will be lowered by pass cplxlower.(MASK_LOAD is not
> > supported for complex type, so no need to handle either).
>
> (*)
>
> > Instead of support vector(2) _Complex double, this patch takes vector(4)
> > double as vector type of _Complex double. Since vectorizer originally
> > takes TYPE_VECTOR_SUBPARTS as nunits which is not true for complex
> > type, the patch handles nunits/ncopies/vf specially for complex type.
>
> For the limited set above(*) can you explain what's "special" about
> vector(2) _Complex
> vs. vector(4) double, thus why we need to have STMT_VINFO_COMPLEX_P at all?
Supporting a vector(2) complex  is a straightforward idea, just like
supporting other scalar type in vectorizer, but it requires more
efforts(in the backend and frontend), considering that most of
operations of complex type will be lowered into realpart and imagpart
operations, supporting a vector(2) complex does not look that
necessary. Then it comes up with supporting vector(4) double(with
adjustment of vf/ctor/permutation), the vectorizer only needs to
handle the vectorization of the move operation of the complex type(no
need to worry about wrongly mapping vector(4) double multiplication to
complex type multiplication since it's already lowered before
vectorizer).
stmt_info does not record the scalar type, in order to avoid duplicate
operation like getting a lhs type from stmt to determine whether it is
a complex type, STMT_VINFO_COMPLEX_P bit is added, this bit is mainly
initialized in vect_analyze_data_refs and vect_get_vector_types_for_
stmt.
>
> I wonder to what extent your handling can be extended to support 
> re-vectorizing
> (with a higher VF for example) already vectorized code?  The vectorizer giving
> up on vector(2) double looks quite obviously similar to it giving up
> on _Complex double ...
Yes, it can be extended to vector(2) double/float/int/ with a bit
adjustment(exacting element by using bit_field instead of
imagpart_expr/realpart_expr).
> It would be a shame to not use the same underlying mechanism for dealing with
> both, where for the vector case obviously vector(4) would be supported as 
> well.
>
> In principle _Complex double operations should be two SLP lanes but it seems 
> you
> are handling them with classical interleaving as well?
I'm only handling move operations, for other operations it will be
lowered to realpart and imagpart and thus two SLP lanes.
>
> Thanks,
> Richard.
>
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Also test the patch for SPEC2017 and find there's complex type vectorization
> > in 510/549(but no performance impact).
> >
> > Any comments?
> >
> > gcc/ChangeLog:
> >
> > PR tree-optimization/106010
> > * tree-vect-data-refs.cc (vect_get_data_access_cost):
> > Pass complex_p to vect_get_num_copies to avoid ICE.
> > (vect_analyze_data_refs): Support vectorization for Complex
> > type with vector scalar types.
> > * tree-vect-loop.cc (vect_determine_vf_for_stmt_1): VF should
> > be half of TYPE_VECTOR_SUBPARTS when complex_p.
> > * tree-vect-slp.cc (vect_record_max_nunits): nunits should be
> > half of TYPE_VECTOR_SUBPARTS when complex_p.
> > (vect_optimize_slp): Support permutation for complex type.
> > (vect_slp_analyze_node_operations_1): Double nunits in
> > vect_get_num_vectors to get right SLP_TREE_NUMBER_OF_VEC_STMTS
> > when complex_p.
> > (vect_slp_analyze_node_operations): Ditto.
> > (vect_create_constant_vectors): Support CTOR for complex type.
> > (vect_transform_slp_perm_load): Support permutation for
> > complex type.
> > * tree-vect-stmts.cc (vect_init_vector): Support complex type.
> > (vect_get_vec_defs_for_operand): Get vector type for
> > complex type.
> > (vectorizable_store): Get right ncopies/nunits for complex
> > type, also return false when complex_p and
> > !TYPE_VECTOR_SUBPARTS.is_constant ().
> > (vectorizable_load): Ditto.
> > (vect_get_vector_types_for_stmt): Get vector type for complex type.
> > * tree-vectorizer.h (STMT_VINFO_COMPLEX_P): New macro.
> > (vect_get_num_copies): New overload.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/i386/pr106010-1a.c: New test.
> > * gcc.target/i386/pr106010-1b.c: New test.
> > * gcc.target/i386/pr106010-1c.c: New test.
> > * gcc.target/i386/pr106010-2a.c: New test.
> > * gcc.target/i386/pr106010-2b.c: New test.
> > * gcc.target/i386/pr106010-2c.c: New test.
> > * gcc.target/i386/pr106010-3a.c: New test.
> > * gcc

Re: [PATCH] Allocate general register(memory/immediate) for 16/32/64-bit vector bit_op patterns.

2022-07-11 Thread Hongtao Liu via Gcc-patches

On Mon, Jul 11, 2022 at 4:03 PM Uros Bizjak via Gcc-patches
 wrote:
>
> On Mon, Jul 11, 2022 at 3:15 AM liuhongt  wrote:
> >
> > And split it to GPR-version instruction after reload.
> >
> > This will enable below optimization for 16/32/64-bit vector bit_op
> >
> > -   movd(%rdi), %xmm0
> > -   movd(%rsi), %xmm1
> > -   pand%xmm1, %xmm0
> > -   movd%xmm0, (%rdi)
> > +   movl(%rsi), %eax
> > +   andl%eax, (%rdi)
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Ok for trunk?
>
> The patch will create many interunit moves (xmm <-> gpr) for anything
> but the most simple logic sequences, because operations with
> memory/immediate will be forced into GPR registers, while reg/reg
> operations will remain in XMM registers.
Agree not to deal with mem/immediate at first.
>
> I tried to introduce GPR registers to MMX logic insns in the past and
> observed the above behavior, but perhaps RA evolved in the mean time
> to handle different register sets better (especially under register
> pressure). However, I would advise to be careful with this
> functionality.
>
> Perhaps this problem should be attacked in stages. First, please
> introduce GPR registers to MMX logic instructions (similar to how
> VI_16_32 mode instructions are handled). After RA effects will be
There's "?r" in VI_16_32 logic instructions which prevent RA allocate
gpr for testcase in the patch.
Is it ok to remove "?" for them(Also add alternative "r" instead of
"?r" in mmx logic insns)?
If there's other instructions that prefer "v to "r", then RA will
allocate "v", but for logic instructions, "r" and “v" should be
treated equally, just as in the 16/32/64-bit vector
mov_internal.
> analysed, only then memory/immediate handling should be added. Also,
> please don't forget to handle ANDNOT insn - TARGET_BMI slightly
> complicates this part, but this is also solved with VI_16_32 mode
> instructions.
>
> Uros.
>
> >
> > gcc/ChangeLog:
> >
> > PR target/106038
> > * config/i386/mmx.md (3): Expand
> > with (clobber (reg:CC flags_reg)) under TARGET_64BIT
> > (mmx_code>3): Ditto.
> > (*mmx_3_1): New define_insn, add post_reload
> > splitter after it.
> > (*3): New define_insn, also add post_reload
> > splitter after it.
> > (mmxinsnmode): New mode attribute.
> > (VI_16_32_64): New mode iterator.
> > (*mov_imm): Refactor with mmxinsnmode.
> > * config/i386/predicates.md
> > (nonimmediate_or_x86_64_vector_cst): New predicate.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/i386/pr106038-1.c: New test.
> > * gcc.target/i386/pr106038-2.c: New test.
> > * gcc.target/i386/pr106038-3.c: New test.
> > ---
> >  gcc/config/i386/mmx.md | 131 +++--
> >  gcc/config/i386/predicates.md  |   4 +
> >  gcc/testsuite/gcc.target/i386/pr106038-1.c |  61 ++
> >  gcc/testsuite/gcc.target/i386/pr106038-2.c |  35 ++
> >  gcc/testsuite/gcc.target/i386/pr106038-3.c |  17 +++
> >  5 files changed, 213 insertions(+), 35 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr106038-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr106038-2.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr106038-3.c
> >
> > diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
> > index 3294c1e6274..85b06abea27 100644
> > --- a/gcc/config/i386/mmx.md
> > +++ b/gcc/config/i386/mmx.md
> > @@ -75,6 +75,11 @@ (define_mode_iterator V_16_32_64
> >  (V8QI "TARGET_64BIT") (V4HI "TARGET_64BIT") (V4HF "TARGET_64BIT")
> >  (V2SI "TARGET_64BIT") (V2SF "TARGET_64BIT")])
> >
> > +(define_mode_iterator VI_16_32_64
> > +   [V2QI V4QI V2HI
> > +(V8QI "TARGET_64BIT") (V4HI "TARGET_64BIT")
> > +(V2SI "TARGET_64BIT")])
> > +
> >  ;; V2S* modes
> >  (define_mode_iterator V2FI [V2SF V2SI])
> >
> > @@ -86,6 +91,14 @@ (define_mode_attr mmxvecsize
> >[(V8QI "b") (V4QI "b") (V2QI "b")
> > (V4HI "w") (V2HI "w") (V2SI "d") (V1DI "q")])
> >
> > +;; Mapping to same size integral mode.
> > +(define_mode_attr mmxinsnmode
> > +  [(V8QI "DI") (V4QI "SI") (V2QI "HI")
> > +   (V4HI "DI") (V2HI "SI")
> > +   (V2SI "DI")
> > +   (V4HF "DI") (V2HF "SI")
> > +   (V2SF "DI")])
> > +
> >  (define_mode_attr mmxdoublemode
> >[(V8QI "V8HI") (V4HI "V4SI")])
> >
> > @@ -350,22 +363,7 @@ (define_insn_and_split "*mov_imm"
> >HOST_WIDE_INT val = ix86_convert_const_vector_to_integer (operands[1],
> > mode);
> >operands[1] = GEN_INT (val);
> > -  machine_mode mode;
> > -  switch (GET_MODE_SIZE (mode))
> > -{
> > -case 2:
> > -  mode = HImode;
> > -  break;
> > -case 4:
> > -  mode = SImode;
> > -  break;
> > -case 8:
> > -  mode = DImode;
> > -  break;
> > -default:
> > -  gcc_unreachable ();
> > -}
> > -  operands[

Re: [PATCH] [RFC]Support vectorization for Complex type.

2022-07-12 Thread Hongtao Liu via Gcc-patches

On Tue, Jul 12, 2022 at 10:12 PM Richard Biener
 wrote:
>
> On Tue, Jul 12, 2022 at 6:11 AM Hongtao Liu  wrote:
> >
> > On Mon, Jul 11, 2022 at 7:47 PM Richard Biener via Gcc-patches
> >  wrote:
> > >
> > > On Mon, Jul 11, 2022 at 5:44 AM liuhongt  wrote:
> > > >
> > > > The patch only handles load/store(including ctor/permutation, except
> > > > gather/scatter) for complex type, other operations don't needs to be
> > > > handled since they will be lowered by pass cplxlower.(MASK_LOAD is not
> > > > supported for complex type, so no need to handle either).
> > >
> > > (*)
> > >
> > > > Instead of support vector(2) _Complex double, this patch takes vector(4)
> > > > double as vector type of _Complex double. Since vectorizer originally
> > > > takes TYPE_VECTOR_SUBPARTS as nunits which is not true for complex
> > > > type, the patch handles nunits/ncopies/vf specially for complex type.
> > >
> > > For the limited set above(*) can you explain what's "special" about
> > > vector(2) _Complex
> > > vs. vector(4) double, thus why we need to have STMT_VINFO_COMPLEX_P at 
> > > all?
> > Supporting a vector(2) complex  is a straightforward idea, just like
> > supporting other scalar type in vectorizer, but it requires more
> > efforts(in the backend and frontend), considering that most of
> > operations of complex type will be lowered into realpart and imagpart
> > operations, supporting a vector(2) complex does not look that
> > necessary. Then it comes up with supporting vector(4) double(with
> > adjustment of vf/ctor/permutation), the vectorizer only needs to
> > handle the vectorization of the move operation of the complex type(no
> > need to worry about wrongly mapping vector(4) double multiplication to
> > complex type multiplication since it's already lowered before
> > vectorizer).
> > stmt_info does not record the scalar type, in order to avoid duplicate
> > operation like getting a lhs type from stmt to determine whether it is
> > a complex type, STMT_VINFO_COMPLEX_P bit is added, this bit is mainly
> > initialized in vect_analyze_data_refs and vect_get_vector_types_for_
> > stmt.
> > >
> > > I wonder to what extent your handling can be extended to support 
> > > re-vectorizing
> > > (with a higher VF for example) already vectorized code?  The vectorizer 
> > > giving
> > > up on vector(2) double looks quite obviously similar to it giving up
> > > on _Complex double ...
> > Yes, it can be extended to vector(2) double/float/int/ with a bit
> > adjustment(exacting element by using bit_field instead of
> > imagpart_expr/realpart_expr).
> > > It would be a shame to not use the same underlying mechanism for dealing 
> > > with
> > > both, where for the vector case obviously vector(4) would be supported as 
> > > well.
> > >
> > > In principle _Complex double operations should be two SLP lanes but it 
> > > seems you
> > > are handling them with classical interleaving as well?
> > I'm only handling move operations, for other operations it will be
> > lowered to realpart and imagpart and thus two SLP lanes.
>
> Yes, I understood that.
>
> Doing it more general (and IMHO better) would involve enhancing
> how we represent dataref groups, maintaining the number of scalars
> covered by each of the vinfos.  On the SLP representation side it
> probably requires to rely on the representative for access and not
> on the scalar stmts (since those do not map properly to the lanes).
>
> Ideally we'd be able to handle
>
> struct { _Complex double c; double a; double b; } a[], b[];
>
> void foo ()
> {
>for (int i = 0; i < 100; ++i)
> {
>   a[i].c = b[i].c;
>   a[i].a = b[i].a;
>   a[i].b = b[i].b;
> }
> }
>
> which I guess your patch doesn't handle with plain AVX vector
> copies but instead uses interleaving for the _Complex and non-_Complex
> parts?
Indeed, it produces wrong code.
>
> Let me spend some time fleshing out what is necessary to make
> this work "properly".  We can consider your special-casing of _Complex
> memory ops if I can't manage to assess the complexity of the task.
>
> Thanks,
> Richard.
>
> > >
> > > Thanks,
> > > Richard.
> > >
> > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > > > Also test the patch for SPEC2017 and find there's complex type 
> > > > vectorization
> > > > in 510/549(but no performance impact).
> > > >
> > > > Any comments?
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > PR tree-optimization/106010
> > > > * tree-vect-data-refs.cc (vect_get_data_access_cost):
> > > > Pass complex_p to vect_get_num_copies to avoid ICE.
> > > > (vect_analyze_data_refs): Support vectorization for Complex
> > > > type with vector scalar types.
> > > > * tree-vect-loop.cc (vect_determine_vf_for_stmt_1): VF should
> > > > be half of TYPE_VECTOR_SUBPARTS when complex_p.
> > > > * tree-vect-slp.cc (vect_record_max_nunits): nunits should be
> > > > half of TYPE_VECTOR_SUBPARTS when complex_p.
>

Re: [PATCH] [RFC]Support vectorization for Complex type.

2022-07-14 Thread Hongtao Liu via Gcc-patches

On Thu, Jul 14, 2022 at 4:20 PM Richard Biener
 wrote:
>
> On Wed, Jul 13, 2022 at 9:34 AM Richard Biener
>  wrote:
> >
> > On Wed, Jul 13, 2022 at 6:47 AM Hongtao Liu  wrote:
> > >
> > > On Tue, Jul 12, 2022 at 10:12 PM Richard Biener
> > >  wrote:
> > > >
> > > > On Tue, Jul 12, 2022 at 6:11 AM Hongtao Liu  wrote:
> > > > >
> > > > > On Mon, Jul 11, 2022 at 7:47 PM Richard Biener via Gcc-patches
> > > > >  wrote:
> > > > > >
> > > > > > On Mon, Jul 11, 2022 at 5:44 AM liuhongt  
> > > > > > wrote:
> > > > > > >
> > > > > > > The patch only handles load/store(including ctor/permutation, 
> > > > > > > except
> > > > > > > gather/scatter) for complex type, other operations don't needs to 
> > > > > > > be
> > > > > > > handled since they will be lowered by pass cplxlower.(MASK_LOAD 
> > > > > > > is not
> > > > > > > supported for complex type, so no need to handle either).
> > > > > >
> > > > > > (*)
> > > > > >
> > > > > > > Instead of support vector(2) _Complex double, this patch takes 
> > > > > > > vector(4)
> > > > > > > double as vector type of _Complex double. Since vectorizer 
> > > > > > > originally
> > > > > > > takes TYPE_VECTOR_SUBPARTS as nunits which is not true for complex
> > > > > > > type, the patch handles nunits/ncopies/vf specially for complex 
> > > > > > > type.
> > > > > >
> > > > > > For the limited set above(*) can you explain what's "special" about
> > > > > > vector(2) _Complex
> > > > > > vs. vector(4) double, thus why we need to have STMT_VINFO_COMPLEX_P 
> > > > > > at all?
> > > > > Supporting a vector(2) complex  is a straightforward idea, just like
> > > > > supporting other scalar type in vectorizer, but it requires more
> > > > > efforts(in the backend and frontend), considering that most of
> > > > > operations of complex type will be lowered into realpart and imagpart
> > > > > operations, supporting a vector(2) complex does not look that
> > > > > necessary. Then it comes up with supporting vector(4) double(with
> > > > > adjustment of vf/ctor/permutation), the vectorizer only needs to
> > > > > handle the vectorization of the move operation of the complex type(no
> > > > > need to worry about wrongly mapping vector(4) double multiplication to
> > > > > complex type multiplication since it's already lowered before
> > > > > vectorizer).
> > > > > stmt_info does not record the scalar type, in order to avoid duplicate
> > > > > operation like getting a lhs type from stmt to determine whether it is
> > > > > a complex type, STMT_VINFO_COMPLEX_P bit is added, this bit is mainly
> > > > > initialized in vect_analyze_data_refs and vect_get_vector_types_for_
> > > > > stmt.
> > > > > >
> > > > > > I wonder to what extent your handling can be extended to support 
> > > > > > re-vectorizing
> > > > > > (with a higher VF for example) already vectorized code?  The 
> > > > > > vectorizer giving
> > > > > > up on vector(2) double looks quite obviously similar to it giving up
> > > > > > on _Complex double ...
> > > > > Yes, it can be extended to vector(2) double/float/int/ with a bit
> > > > > adjustment(exacting element by using bit_field instead of
> > > > > imagpart_expr/realpart_expr).
> > > > > > It would be a shame to not use the same underlying mechanism for 
> > > > > > dealing with
> > > > > > both, where for the vector case obviously vector(4) would be 
> > > > > > supported as well.
> > > > > >
> > > > > > In principle _Complex double operations should be two SLP lanes but 
> > > > > > it seems you
> > > > > > are handling them with classical interleaving as well?
> > > > > I'm only handling move operations, for other operations it will be
> > > > > lowered to realpart and imagpart and thus two SLP lanes.
> > > >
> > > > Yes, I understood that.
> > > >
> > > > Doing it more general (and IMHO better) would involve enhancing
> > > > how we represent dataref groups, maintaining the number of scalars
> > > > covered by each of the vinfos.  On the SLP representation side it
> > > > probably requires to rely on the representative for access and not
> > > > on the scalar stmts (since those do not map properly to the lanes).
> > > >
> > > > Ideally we'd be able to handle
> > > >
> > > > struct { _Complex double c; double a; double b; } a[], b[];
> > > >
> > > > void foo ()
> > > > {
> > > >for (int i = 0; i < 100; ++i)
> > > > {
> > > >   a[i].c = b[i].c;
> > > >   a[i].a = b[i].a;
> > > >   a[i].b = b[i].b;
> > > > }
> > > > }
> > > >
> > > > which I guess your patch doesn't handle with plain AVX vector
> > > > copies but instead uses interleaving for the _Complex and non-_Complex
> > > > parts?
> > > Indeed, it produces wrong code.
> >
> > For _Complex, in case we don't get to the "true and only" solution it
> > might be easier to split the loads and stores when it's just memory
> > copies and we have vectorization enabled and a supported vector
> > mode that would surely re-assemble them (store-merging doesn't seem
> > to do that).
> >
> > Btw,

Re: [PATCH] [RFC]Support vectorization for Complex type.

2022-07-14 Thread Hongtao Liu via Gcc-patches

On Thu, Jul 14, 2022 at 4:53 PM Hongtao Liu  wrote:
>
> On Thu, Jul 14, 2022 at 4:20 PM Richard Biener
>  wrote:
> >
> > On Wed, Jul 13, 2022 at 9:34 AM Richard Biener
> >  wrote:
> > >
> > > On Wed, Jul 13, 2022 at 6:47 AM Hongtao Liu  wrote:
> > > >
> > > > On Tue, Jul 12, 2022 at 10:12 PM Richard Biener
> > > >  wrote:
> > > > >
> > > > > On Tue, Jul 12, 2022 at 6:11 AM Hongtao Liu  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, Jul 11, 2022 at 7:47 PM Richard Biener via Gcc-patches
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Mon, Jul 11, 2022 at 5:44 AM liuhongt  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > The patch only handles load/store(including ctor/permutation, 
> > > > > > > > except
> > > > > > > > gather/scatter) for complex type, other operations don't needs 
> > > > > > > > to be
> > > > > > > > handled since they will be lowered by pass cplxlower.(MASK_LOAD 
> > > > > > > > is not
> > > > > > > > supported for complex type, so no need to handle either).
> > > > > > >
> > > > > > > (*)
> > > > > > >
> > > > > > > > Instead of support vector(2) _Complex double, this patch takes 
> > > > > > > > vector(4)
> > > > > > > > double as vector type of _Complex double. Since vectorizer 
> > > > > > > > originally
> > > > > > > > takes TYPE_VECTOR_SUBPARTS as nunits which is not true for 
> > > > > > > > complex
> > > > > > > > type, the patch handles nunits/ncopies/vf specially for complex 
> > > > > > > > type.
> > > > > > >
> > > > > > > For the limited set above(*) can you explain what's "special" 
> > > > > > > about
> > > > > > > vector(2) _Complex
> > > > > > > vs. vector(4) double, thus why we need to have 
> > > > > > > STMT_VINFO_COMPLEX_P at all?
> > > > > > Supporting a vector(2) complex  is a straightforward idea, just like
> > > > > > supporting other scalar type in vectorizer, but it requires more
> > > > > > efforts(in the backend and frontend), considering that most of
> > > > > > operations of complex type will be lowered into realpart and 
> > > > > > imagpart
> > > > > > operations, supporting a vector(2) complex does not look that
> > > > > > necessary. Then it comes up with supporting vector(4) double(with
> > > > > > adjustment of vf/ctor/permutation), the vectorizer only needs to
> > > > > > handle the vectorization of the move operation of the complex 
> > > > > > type(no
> > > > > > need to worry about wrongly mapping vector(4) double multiplication 
> > > > > > to
> > > > > > complex type multiplication since it's already lowered before
> > > > > > vectorizer).
> > > > > > stmt_info does not record the scalar type, in order to avoid 
> > > > > > duplicate
> > > > > > operation like getting a lhs type from stmt to determine whether it 
> > > > > > is
> > > > > > a complex type, STMT_VINFO_COMPLEX_P bit is added, this bit is 
> > > > > > mainly
> > > > > > initialized in vect_analyze_data_refs and vect_get_vector_types_for_
> > > > > > stmt.
> > > > > > >
> > > > > > > I wonder to what extent your handling can be extended to support 
> > > > > > > re-vectorizing
> > > > > > > (with a higher VF for example) already vectorized code?  The 
> > > > > > > vectorizer giving
> > > > > > > up on vector(2) double looks quite obviously similar to it giving 
> > > > > > > up
> > > > > > > on _Complex double ...
> > > > > > Yes, it can be extended to vector(2) double/float/int/ with a 
> > > > > > bit
> > > > > > adjustment(exacting element by using bit_field instead of
> > > > > > imagpart_expr/realpart_expr).
> > > > > > > It would be a shame to not use the same underlying mechanism for 
> > > > > > > dealing with
> > > > > > > both, where for the vector case obviously vector(4) would be 
> > > > > > > supported as well.
> > > > > > >
> > > > > > > In principle _Complex double operations should be two SLP lanes 
> > > > > > > but it seems you
> > > > > > > are handling them with classical interleaving as well?
> > > > > > I'm only handling move operations, for other operations it will be
> > > > > > lowered to realpart and imagpart and thus two SLP lanes.
> > > > >
> > > > > Yes, I understood that.
> > > > >
> > > > > Doing it more general (and IMHO better) would involve enhancing
> > > > > how we represent dataref groups, maintaining the number of scalars
> > > > > covered by each of the vinfos.  On the SLP representation side it
> > > > > probably requires to rely on the representative for access and not
> > > > > on the scalar stmts (since those do not map properly to the lanes).
> > > > >
> > > > > Ideally we'd be able to handle
> > > > >
> > > > > struct { _Complex double c; double a; double b; } a[], b[];
> > > > >
> > > > > void foo ()
> > > > > {
> > > > >for (int i = 0; i < 100; ++i)
> > > > > {
> > > > >   a[i].c = b[i].c;
> > > > >   a[i].a = b[i].a;
> > > > >   a[i].b = b[i].b;
> > > > > }
> > > > > }
> > > > >
> > > > > which I guess your patch doesn't handle with plain AVX vector
> > > > > copies but instead uses interleaving

Re: [PATCH] Extend 64-bit vector bit_op patterns with ?r alternative

2022-07-14 Thread Hongtao Liu via Gcc-patches

On Thu, Jul 14, 2022 at 3:22 PM Uros Bizjak via Gcc-patches
 wrote:
>
> On Thu, Jul 14, 2022 at 7:33 AM liuhongt  wrote:
> >
> > And split it to GPR-version instruction after reload.
> >
> > > ?r was introduced under the assumption that we want vector values
> > > mostly in vector registers. Currently there are no instructions with
> > > memory or immediate operand, so that made sense at the time. Let's
> > > keep ?r until logic instructions with mem/imm operands are introduced.
> > > So, for the patch that adds 64-bit vector logic in GPR, I would advise
> > > to first introduce only register operands. mem/imm operands should be
> > Update patch to add ?r to 64-bit bit_op patterns.
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > No big imact on SPEC2017(Most same binary).
>
> The problem with your approach is with the combine pass, where combine
> first tries to recognize the combined instruction without clobber,
> before re-recognizing instruction with added clobber. So, if a forward
> propagation happens, the combine will *always* choose the insn variant
> without GPR.
Thank you for the explanation, I really did not know this point.
>
> So, the solution with VI_16_32 is to always expand with a clobbered
> version that is split to either SImode or V16QImode. With 64-bit
> instructions, we have two additional complications. First, we have a
> native MMX instruction, and we have to split to it after reload, and
> second, we have a builtin that expects vector insn.
>
> To solve the first issue, we should change the mode of
> "*mmx" to V1DImode and split your new _gpr version with
> clobber to it for !GENERAL_REG_P operands.
>
> The second issue could be solved by emitting V1DImode instructions
> directly from the expander. Please note there are several expanders
> that expect non-clobbered logic insn in certain mode to be available,
> so the situation can become quite annoying...
Yes. It looks like it would add a lot of code complexity, I'll hold
the patch for now.
>
> Uros.



-- 
BR,
Hongtao

Re: [PATCH] i386: Fix _mm_[u]comixx_{ss,sd} codegen and add PF result. [PR106113]

2022-07-14 Thread Hongtao Liu via Gcc-patches

On Thu, Jul 14, 2022 at 2:11 PM Kong, Lingling via Gcc-patches
 wrote:
>
> Hi,
>
> The patch is to fix _mm_[u]comixx_{ss,sd} codegen and add PF result.  These 
> intrinsics have changed over time, like `_mm_comieq_ss ` old operation is 
> `RETURN ( a[31:0] == b[31:0] ) ? 1 : 0`, and new operation update is `RETURN 
> ( a[31:0] != NaN AND b[31:0] != NaN AND a[31:0] == b[31:0] ) ? 1 : 0`.
>
> OK for master?
All _mm_comiXX_ss uses order_compare except for mm_comine_ss which
uses unordered_compare, now it's aligned with intrinsic guide.
Ok for trunk.
>
> gcc/ChangeLog:
>
> PR target/106113
> * config/i386/i386-builtin.def (BDESC): Fix [u]comi{ss,sd}
> comparison due to intrinsics changed over time.
> * config/i386/i386-expand.cc (ix86_ssecom_setcc):
> Add unordered check and mode for sse comi codegen.
> (ix86_expand_sse_comi): Add unordered check and check a different
> CCmode.
> (ix86_expand_sse_comi_round):Extract unordered check and mode part
> in ix86_ssecom_setcc.
>
> gcc/testsuite/ChangeLog:
>
> PR target/106113
> * gcc.target/i386/avx-vcomisd-pr106113-2.c: New test.
> * gcc.target/i386/avx-vcomiss-pr106113-2.c: Ditto.
> * gcc.target/i386/avx-vucomisd-pr106113-2.c: Ditto.
> * gcc.target/i386/avx-vucomiss-pr106113-2.c: Ditto.
> * gcc.target/i386/sse-comiss-pr106113-1.c: Ditto.
> * gcc.target/i386/sse-comiss-pr106113-2.c: Ditto.
> * gcc.target/i386/sse-ucomiss-pr106113-1.c: Ditto.
> * gcc.target/i386/sse-ucomiss-pr106113-2.c: Ditto.
> * gcc.target/i386/sse2-comisd-pr106113-1.c: Ditto.
> * gcc.target/i386/sse2-comisd-pr106113-2.c: Ditto.
> * gcc.target/i386/sse2-ucomisd-pr106113-1.c: Ditto.
> * gcc.target/i386/sse2-ucomisd-pr106113-2.c: Ditto.
> ---
>  gcc/config/i386/i386-builtin.def  |  32 ++--
>  gcc/config/i386/i386-expand.cc| 140 +++---
>  .../gcc.target/i386/avx-vcomisd-pr106113-2.c  |   8 +
>  .../gcc.target/i386/avx-vcomiss-pr106113-2.c  |   8 +
>  .../gcc.target/i386/avx-vucomisd-pr106113-2.c |   8 +
>  .../gcc.target/i386/avx-vucomiss-pr106113-2.c |   8 +
>  .../gcc.target/i386/sse-comiss-pr106113-1.c   |  19 +++
>  .../gcc.target/i386/sse-comiss-pr106113-2.c   |  59 
>  .../gcc.target/i386/sse-ucomiss-pr106113-1.c  |  19 +++
>  .../gcc.target/i386/sse-ucomiss-pr106113-2.c  |  59 
>  .../gcc.target/i386/sse2-comisd-pr106113-1.c  |  19 +++
>  .../gcc.target/i386/sse2-comisd-pr106113-2.c  |  59 
>  .../gcc.target/i386/sse2-ucomisd-pr106113-1.c |  19 +++
>  .../gcc.target/i386/sse2-ucomisd-pr106113-2.c |  59 
>  14 files changed, 450 insertions(+), 66 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx-vcomisd-pr106113-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx-vcomiss-pr106113-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx-vucomisd-pr106113-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx-vucomiss-pr106113-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sse-comiss-pr106113-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sse-comiss-pr106113-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sse-ucomiss-pr106113-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sse-ucomiss-pr106113-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sse2-comisd-pr106113-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sse2-comisd-pr106113-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sse2-ucomisd-pr106113-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/sse2-ucomisd-pr106113-2.c
>
> diff --git a/gcc/config/i386/i386-builtin.def 
> b/gcc/config/i386/i386-builtin.def
> index fd160935e67..acb7e8ca64b 100644
> --- a/gcc/config/i386/i386-builtin.def
> +++ b/gcc/config/i386/i386-builtin.def
> @@ -35,30 +35,30 @@
>  IX86_BUILTIN__BDESC_##NEXT_KIND##_FIRST - 1.  */
>
>  BDESC_FIRST (comi, COMI,
> -   OPTION_MASK_ISA_SSE, 0, CODE_FOR_sse_comi, "__builtin_ia32_comieq", 
> IX86_BUILTIN_COMIEQSS, UNEQ, 0)
> -BDESC (OPTION_MASK_ISA_SSE, 0, CODE_FOR_sse_comi, "__builtin_ia32_comilt", 
> IX86_BUILTIN_COMILTSS, UNLT, 0)
> -BDESC (OPTION_MASK_ISA_SSE, 0, CODE_FOR_sse_comi, "__builtin_ia32_comile", 
> IX86_BUILTIN_COMILESS, UNLE, 0)
> +   OPTION_MASK_ISA_SSE, 0, CODE_FOR_sse_comi, "__builtin_ia32_comieq", 
> IX86_BUILTIN_COMIEQSS, EQ, 0)
> +BDESC (OPTION_MASK_ISA_SSE, 0, CODE_FOR_sse_comi, "__builtin_ia32_comilt", 
> IX86_BUILTIN_COMILTSS, LT, 0)
> +BDESC (OPTION_MASK_ISA_SSE, 0, CODE_FOR_sse_comi, "__builtin_ia32_comile", 
> IX86_BUILTIN_COMILESS, LE, 0)
>  BDESC (OPTION_MASK_ISA_SSE, 0, CODE_FOR_sse_comi, "__builtin_ia32_comigt", 
> IX86_BUILTIN_COMIGTSS, GT, 0)
>  BDESC (OPTION_MASK_ISA_SSE, 0, CODE_FOR_sse_comi, "__builtin_ia32_comige", 
> IX86_BUILTIN_COMIGESS, GE, 0)
> -BDESC (OPTION_MASK_ISA_SSE, 0, CODE_FOR_sse_comi, "__builtin_ia32_comineq", 
> IX86_BUILTIN

Re: [PATCH] x86: Disable sibcall if indirect_return attribute doesn't match

2022-07-14 Thread Hongtao Liu via Gcc-patches

On Fri, Jul 15, 2022 at 1:44 AM H.J. Lu via Gcc-patches
 wrote:
>
> When shadow stack is enabled, function with indirect_return attribute
> may return via indirect jump.  In this case, we need to disable sibcall
> if caller doesn't have indirect_return attribute and indirect branch
> tracking is enabled since compiler won't generate ENDBR when calling the
> caller.
>
LGTM.
> gcc/
>
> PR target/85620
> * config/i386/i386.cc (ix86_function_ok_for_sibcall): Return
> false if callee has indirect_return attribute and caller
> doesn't.
>
> gcc/testsuite/
>
> PR target/85620
> * gcc.target/i386/pr85620-2.c: Updated.
> * gcc.target/i386/pr85620-5.c: New test.
> * gcc.target/i386/pr85620-6.c: Likewise.
> * gcc.target/i386/pr85620-7.c: Likewise.
> ---
>  gcc/config/i386/i386.cc   | 10 ++
>  gcc/testsuite/gcc.target/i386/pr85620-2.c |  3 ++-
>  gcc/testsuite/gcc.target/i386/pr85620-5.c | 13 +
>  gcc/testsuite/gcc.target/i386/pr85620-6.c | 14 ++
>  gcc/testsuite/gcc.target/i386/pr85620-7.c | 14 ++
>  5 files changed, 53 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr85620-5.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr85620-6.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr85620-7.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 3a3c7299eb4..e03f86d4a23 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -1024,6 +1024,16 @@ ix86_function_ok_for_sibcall (tree decl, tree exp)
>  return false;
>  }
>
> +  /* Disable sibcall if callee has indirect_return attribute and
> + caller doesn't since callee will return to the caller's caller
> + via an indirect jump.  */
> +  if (((flag_cf_protection & (CF_RETURN | CF_BRANCH))
> +   == (CF_RETURN | CF_BRANCH))
> +  && lookup_attribute ("indirect_return", TYPE_ATTRIBUTES (type))
> +  && !lookup_attribute ("indirect_return",
> +   TYPE_ATTRIBUTES (TREE_TYPE (cfun->decl
> +return false;
> +
>/* Otherwise okay.  That also includes certain types of indirect calls.  */
>return true;
>  }
> diff --git a/gcc/testsuite/gcc.target/i386/pr85620-2.c 
> b/gcc/testsuite/gcc.target/i386/pr85620-2.c
> index b2e680fa1fe..14ce0ffd1e1 100644
> --- a/gcc/testsuite/gcc.target/i386/pr85620-2.c
> +++ b/gcc/testsuite/gcc.target/i386/pr85620-2.c
> @@ -1,6 +1,7 @@
>  /* { dg-do compile } */
>  /* { dg-options "-O2 -fcf-protection" } */
> -/* { dg-final { scan-assembler-times {\mendbr} 1 } } */
> +/* { dg-final { scan-assembler-times {\mendbr} 2 } } */
> +/* { dg-final { scan-assembler-not "jmp" } } */
>
>  struct ucontext;
>
> diff --git a/gcc/testsuite/gcc.target/i386/pr85620-5.c 
> b/gcc/testsuite/gcc.target/i386/pr85620-5.c
> new file mode 100644
> index 000..04537702d09
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr85620-5.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fcf-protection" } */
> +/* { dg-final { scan-assembler-not "jmp" } } */
> +
> +struct ucontext;
> +
> +extern int (*bar) (struct ucontext *) __attribute__((__indirect_return__));
> +
> +int
> +foo (struct ucontext *oucp)
> +{
> +  return bar (oucp);
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/pr85620-6.c 
> b/gcc/testsuite/gcc.target/i386/pr85620-6.c
> new file mode 100644
> index 000..0b6a64e8454
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr85620-6.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fcf-protection" } */
> +/* { dg-final { scan-assembler "jmp" } } */
> +
> +struct ucontext;
> +
> +extern int bar (struct ucontext *) __attribute__((__indirect_return__));
> +
> +__attribute__((__indirect_return__))
> +int
> +foo (struct ucontext *oucp)
> +{
> +  return bar (oucp);
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/pr85620-7.c 
> b/gcc/testsuite/gcc.target/i386/pr85620-7.c
> new file mode 100644
> index 000..fa62d56decf
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr85620-7.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fcf-protection" } */
> +/* { dg-final { scan-assembler "jmp" } } */
> +
> +struct ucontext;
> +
> +extern int (*bar) (struct ucontext *) __attribute__((__indirect_return__));
> +extern int foo (struct ucontext *) __attribute__((__indirect_return__));
> +
> +int
> +foo (struct ucontext *oucp)
> +{
> +  return bar (oucp);
> +}
> --
> 2.36.1
>


-- 
BR,
Hongtao

Re: [AVX512 PATCH] Add UNSPEC_MASKOP to kupck instructions in sse.md.

2022-07-17 Thread Hongtao Liu via Gcc-patches

On Sat, Jul 16, 2022 at 10:08 PM Roger Sayle  wrote:
>
>
> This AVX512 specific patch to sse.md is split out from an earlier patch:
> https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596199.html
>
> The new splitters proposed in that patch interfere with AVX512's
> kunpckdq instruction which is defined as identical RTL,
> DW:DI = (HI:SI<<32)|zero_extend(LO:SI).  To distinguish these,
> and avoid AVX512 mask registers accidentally being (ab)used by reload
> to perform SImode scalar shifts, this patch adds the explicit
> (unspec UNSPEC_MASKOP) to the unpack mask operations, which matches
> what sse.md does for the other mask specific (logic) operations.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
Ok, thanks for handling this.
>
> 2022-07-16  Roger Sayle  
>
> gcc/ChangeLog
> * config/i386/sse.md (kunpckhi): Add UNSPEC_MASKOP unspec.
> (kunpcksi): Likewise, add UNSPEC_MASKOP unspec.
> (kunpckdi): Likewise, add UNSPEC_MASKOP unspec.
> (vec_pack_trunc_qi): Update to specify required UNSPEC_MASKOP
> unspec.
> (vec_pack_trunc_): Likewise.
>
>
> Thanks in advance,
> Roger
> --
>


-- 
BR,
Hongtao

Re: [PATCH V2] Extend 16/32-bit vector bit_op patterns with (m, 0, i) alternative.

2022-07-18 Thread Hongtao Liu via Gcc-patches

On Tue, Jul 19, 2022 at 2:35 PM Uros Bizjak via Gcc-patches
 wrote:
>
> On Tue, Jul 19, 2022 at 8:07 AM liuhongt  wrote:
> >
> > And split it after reload.
> >
> > > You will need ix86_binary_operator_ok insn constraint here with
> > > corresponding expander using ix86_fixup_binary_operands_no_copy to
> > > prepare insn operands.
> > Split define_expand with just register_operand, and allow
> > memory/immediate in define_insn, assume combine/forwprop will do 
> > optimization.
>
> But you will *ease* the job of the above passes if you use
> ix86_fixup_binary_operands_no_copy in the expander.
for -m32, it will hit ICE in
Breakpoint 1, ix86_fixup_binary_operands_no_copy (code=XOR,
mode=E_V4QImode, operands=0x7fffa970) a
/gcc/config/i386/i386-expand.cc:1184
1184  rtx dst = ix86_fixup_binary_operands (code, mode, operands);
(gdb) n
1185  gcc_assert (dst == operands[0]); -- here
(gdb)

the original operands[0], operands[1], operands[2] are below
(gdb) p debug_rtx (operands[0])
(mem/c:V4QI (plus:SI (reg/f:SI 77 virtual-stack-vars)
(const_int -8220 [0xdfe4])) [0 MEM  [(unsigned char *)&tmp2 + 4B]+0 S4 A32])
$1 = void
(gdb) p debug_rtx (operands[1])
(subreg:V4QI (reg:SI 129) 0)
$2 = void
(gdb) p debug_rtx (operands[2])
(subreg:V4QI (reg:SI 98 [ _46 ]) 0)
$3 = void
(gdb)

since operands[0] is mem and not equal to operands[1],
ix86_fixup_binary_operands will create a pseudo register for dst. and
then hit ICE.
Is this a bug or assumed?

>
> Uros.
>
> >
> > > Please use if (!register_operand (operands[2], mode)) instead.
> > Changed.
> >
> > Update patch.
> >
> > gcc/ChangeLog:
> >
> > PR target/106038
> > * config/i386/mmx.md (3): New define_expand, it's
> > original "3".
> > (*3): New define_insn, it's original
> > "3" be extended to handle memory and immediate
> > operand with ix86_binary_operator_ok. Also adjust define_split
> > after it.
> > (mmxinsnmode): New mode attribute.
> > (*mov_imm): Refactor with mmxinsnmode.
> > * config/i386/predicates.md
> > (register_or_x86_64_const_vector_operand): New predicate.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/i386/pr106038-1.c: New test.
> > ---
> >  gcc/config/i386/mmx.md | 71 --
> >  gcc/config/i386/predicates.md  |  4 ++
> >  gcc/testsuite/gcc.target/i386/pr106038-1.c | 27 
> >  3 files changed, 71 insertions(+), 31 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr106038-1.c
> >
> > diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
> > index 3294c1e6274..316b83dd3ac 100644
> > --- a/gcc/config/i386/mmx.md
> > +++ b/gcc/config/i386/mmx.md
> > @@ -86,6 +86,14 @@ (define_mode_attr mmxvecsize
> >[(V8QI "b") (V4QI "b") (V2QI "b")
> > (V4HI "w") (V2HI "w") (V2SI "d") (V1DI "q")])
> >
> > +;; Mapping to same size integral mode.
> > +(define_mode_attr mmxinsnmode
> > +  [(V8QI "DI") (V4QI "SI") (V2QI "HI")
> > +   (V4HI "DI") (V2HI "SI")
> > +   (V2SI "DI")
> > +   (V4HF "DI") (V2HF "SI")
> > +   (V2SF "DI")])
> > +
> >  (define_mode_attr mmxdoublemode
> >[(V8QI "V8HI") (V4HI "V4SI")])
> >
> > @@ -350,22 +358,7 @@ (define_insn_and_split "*mov_imm"
> >HOST_WIDE_INT val = ix86_convert_const_vector_to_integer (operands[1],
> > mode);
> >operands[1] = GEN_INT (val);
> > -  machine_mode mode;
> > -  switch (GET_MODE_SIZE (mode))
> > -{
> > -case 2:
> > -  mode = HImode;
> > -  break;
> > -case 4:
> > -  mode = SImode;
> > -  break;
> > -case 8:
> > -  mode = DImode;
> > -  break;
> > -default:
> > -  gcc_unreachable ();
> > -}
> > -  operands[0] = lowpart_subreg (mode, operands[0], mode);
> > +  operands[0] = lowpart_subreg (mode, operands[0], 
> > mode);
> >  })
> >
> >  ;; For TARGET_64BIT we always round up to 8 bytes.
> > @@ -2974,33 +2967,49 @@ (define_insn "*mmx_3"
> > (set_attr "type" "mmxadd,sselog,sselog,sselog")
> > (set_attr "mode" "DI,TI,TI,TI")])
> >
> > -(define_insn "3"
> > -  [(set (match_operand:VI_16_32 0 "register_operand" "=?r,x,x,v")
> > +(define_expand "3"
> > +  [(parallel
> > +[(set (match_operand:VI_16_32 0 "register_operand")
> > +(any_logic:VI_16_32
> > + (match_operand:VI_16_32 1 "register_operand")
> > + (match_operand:VI_16_32 2 "register_operand")))
> > +   (clobber (reg:CC FLAGS_REG))])]
> > +  "")
> > +
> > +(define_insn "*3"
> > +  [(set (match_operand:VI_16_32 0 "nonimmediate_operand" "=?r,m,x,x,v")
> >  (any_logic:VI_16_32
> > - (match_operand:VI_16_32 1 "register_operand" "%0,0,x,v")
> > - (match_operand:VI_16_32 2 "register_operand" "r,x,x,v")))
> > + (match_operand:VI_16_32 1 "nonimmediate_operand" "%0,0,0,x,v")
> > + (match_operand:VI_16_32 2 
> > "register_or_x86_64_const_vector_operand" "r,i,x,x,v")))
> >

Re: [PATCH V2] Extend 16/32-bit vector bit_op patterns with (m, 0, i) alternative.

2022-07-19 Thread Hongtao Liu via Gcc-patches

On Tue, Jul 19, 2022 at 5:37 PM Uros Bizjak  wrote:
>
> On Tue, Jul 19, 2022 at 8:56 AM Hongtao Liu  wrote:
> >
> > On Tue, Jul 19, 2022 at 2:35 PM Uros Bizjak via Gcc-patches
> >  wrote:
> > >
> > > On Tue, Jul 19, 2022 at 8:07 AM liuhongt  wrote:
> > > >
> > > > And split it after reload.
> > > >
> > > > > You will need ix86_binary_operator_ok insn constraint here with
> > > > > corresponding expander using ix86_fixup_binary_operands_no_copy to
> > > > > prepare insn operands.
> > > > Split define_expand with just register_operand, and allow
> > > > memory/immediate in define_insn, assume combine/forwprop will do 
> > > > optimization.
> > >
> > > But you will *ease* the job of the above passes if you use
> > > ix86_fixup_binary_operands_no_copy in the expander.
> > for -m32, it will hit ICE in
> > Breakpoint 1, ix86_fixup_binary_operands_no_copy (code=XOR,
> > mode=E_V4QImode, operands=0x7fffa970) a
> > /gcc/config/i386/i386-expand.cc:1184
> > 1184  rtx dst = ix86_fixup_binary_operands (code, mode, operands);
> > (gdb) n
> > 1185  gcc_assert (dst == operands[0]); -- here
> > (gdb)
> >
> > the original operands[0], operands[1], operands[2] are below
> > (gdb) p debug_rtx (operands[0])
> > (mem/c:V4QI (plus:SI (reg/f:SI 77 virtual-stack-vars)
> > (const_int -8220 [0xdfe4])) [0 MEM  > unsigned char> [(unsigned char *)&tmp2 + 4B]+0 S4 A32])
> > $1 = void
> > (gdb) p debug_rtx (operands[1])
> > (subreg:V4QI (reg:SI 129) 0)
> > $2 = void
> > (gdb) p debug_rtx (operands[2])
> > (subreg:V4QI (reg:SI 98 [ _46 ]) 0)
> > $3 = void
> > (gdb)
> >
> > since operands[0] is mem and not equal to operands[1],
> > ix86_fixup_binary_operands will create a pseudo register for dst. and
> > then hit ICE.
> > Is this a bug or assumed?
>
> You will need ix86_expand_binary_operator here.
It will swap memory operand from op1 to op2 and hit ICE for unrecognized insn.

What about this?

-(define_insn "3"
-  [(set (match_operand:VI_16_32 0 "register_operand" "=?r,x,x,v")
+(define_expand "3"
+  [(set (match_operand:VI_16_32 0 "nonimmediate_operand")
 (any_logic:VI_16_32
- (match_operand:VI_16_32 1 "register_operand" "%0,0,x,v")
- (match_operand:VI_16_32 2 "register_operand" "r,x,x,v")))
-   (clobber (reg:CC FLAGS_REG))]
+ (match_operand:VI_16_32 1 "nonimmediate_operand")
+ (match_operand:VI_16_32 2
"register_or_x86_64_const_vector_operand")))]
   ""
+{
+  rtx dst = ix86_fixup_binary_operands (, mode, operands);
+  if (MEM_P (operands[2]))
+operands[2] = force_reg (mode, operands[2]);
+  rtx op = gen_rtx_SET (dst, gen_rtx_fmt_ee (, mode,
+operands[1], operands[2]));
+  rtx clob = gen_rtx_CLOBBER (VOIDmode, gen_rtx_REG (CCmode, FLAGS_REG));
+  emit_insn (gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, op, clob)));
+  if (dst != operands[0])
+emit_move_insn (operands[0], dst);
+   DONE;
+})
+

>
> Uros.



-- 
BR,
Hongtao

Re: [PATCH V2] Extend 16/32-bit vector bit_op patterns with (m, 0, i) alternative.

2022-07-19 Thread Hongtao Liu via Gcc-patches

On Wed, Jul 20, 2022 at 2:18 PM Uros Bizjak  wrote:
>
> On Wed, Jul 20, 2022 at 8:14 AM Uros Bizjak  wrote:
> >
> > On Wed, Jul 20, 2022 at 4:37 AM Hongtao Liu  wrote:
> > >
> > > On Tue, Jul 19, 2022 at 5:37 PM Uros Bizjak  wrote:
> > > >
> > > > On Tue, Jul 19, 2022 at 8:56 AM Hongtao Liu  wrote:
> > > > >
> > > > > On Tue, Jul 19, 2022 at 2:35 PM Uros Bizjak via Gcc-patches
> > > > >  wrote:
> > > > > >
> > > > > > On Tue, Jul 19, 2022 at 8:07 AM liuhongt  
> > > > > > wrote:
> > > > > > >
> > > > > > > And split it after reload.
> > > > > > >
> > > > > > > > You will need ix86_binary_operator_ok insn constraint here with
> > > > > > > > corresponding expander using ix86_fixup_binary_operands_no_copy 
> > > > > > > > to
> > > > > > > > prepare insn operands.
> > > > > > > Split define_expand with just register_operand, and allow
> > > > > > > memory/immediate in define_insn, assume combine/forwprop will do 
> > > > > > > optimization.
> > > > > >
> > > > > > But you will *ease* the job of the above passes if you use
> > > > > > ix86_fixup_binary_operands_no_copy in the expander.
> > > > > for -m32, it will hit ICE in
> > > > > Breakpoint 1, ix86_fixup_binary_operands_no_copy (code=XOR,
> > > > > mode=E_V4QImode, operands=0x7fffa970) a
> > > > > /gcc/config/i386/i386-expand.cc:1184
> > > > > 1184  rtx dst = ix86_fixup_binary_operands (code, mode, operands);
> > > > > (gdb) n
> > > > > 1185  gcc_assert (dst == operands[0]); -- here
> > > > > (gdb)
> > > > >
> > > > > the original operands[0], operands[1], operands[2] are below
> > > > > (gdb) p debug_rtx (operands[0])
> > > > > (mem/c:V4QI (plus:SI (reg/f:SI 77 virtual-stack-vars)
> > > > > (const_int -8220 [0xdfe4])) [0 MEM  > > > > unsigned char> [(unsigned char *)&tmp2 + 4B]+0 S4 A32])
> > > > > $1 = void
> > > > > (gdb) p debug_rtx (operands[1])
> > > > > (subreg:V4QI (reg:SI 129) 0)
> > > > > $2 = void
> > > > > (gdb) p debug_rtx (operands[2])
> > > > > (subreg:V4QI (reg:SI 98 [ _46 ]) 0)
> > > > > $3 = void
> > > > > (gdb)
> > > > >
> > > > > since operands[0] is mem and not equal to operands[1],
> > > > > ix86_fixup_binary_operands will create a pseudo register for dst. and
> > > > > then hit ICE.
> > > > > Is this a bug or assumed?
> > > >
> > > > You will need ix86_expand_binary_operator here.
> > > It will swap memory operand from op1 to op2 and hit ICE for unrecognized 
> > > insn.
> > >
> > > What about this?
> >
> > Still no good... You are using commutative operands, so the predicate
> > of operand 2 should also allow memory. So, the predicate should be
> > nonimmediate_or_x86_64_const_vector_operand. The intermediate insn
> > pattern should look something like *_1, but with
> > added XMM and MMX reg alternatives instead of mask regs.
>
> Alternatively, you can use UNKNOWN operator to prevent
> canonicalization, but then you should not use commutative constraint
> in the intermediate insn. I think this is the best solution.
Like this?

-(define_insn "3"
-  [(set (match_operand:VI_16_32 0 "register_operand" "=?r,x,x,v")
+(define_expand "3"
+  [(set (match_operand:VI_16_32 0 "nonimmediate_operand")
 (any_logic:VI_16_32
- (match_operand:VI_16_32 1 "register_operand" "%0,0,x,v")
- (match_operand:VI_16_32 2 "register_operand" "r,x,x,v")))
-   (clobber (reg:CC FLAGS_REG))]
+ (match_operand:VI_16_32 1 "nonimmediate_operand")
+ (match_operand:VI_16_32 2
"register_or_x86_64_const_vector_operand")))]
   ""
+{
+  rtx dst = ix86_fixup_binary_operands (, mode, operands);
+  if (MEM_P (operands[2]))
+operands[2] = force_reg (mode, operands[2]);
+  rtx op = gen_rtx_SET (dst, gen_rtx_fmt_ee (, mode,
+operands[1], operands[2]));
+  rtx clob = gen_rtx_CLOBBER (VOIDmode, gen_rtx_REG (CCmode, FLAGS_REG));
+  emit_insn (gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, op, clob)));
+  if (dst != operands[0])
+emit_move_insn (operands[0], dst);
+   DONE;
+})
+
+(define_insn "*3"
+  [(set (match_operand:VI_16_32 0 "nonimmediate_operand" "=?r,m,x,x,v")
+(any_logic:VI_16_32
+ (match_operand:VI_16_32 1 "nonimmediate_operand" "0,0,0,x,v")
+ (match_operand:VI_16_32 2
"register_or_x86_64_const_vector_operand" "r,i,x,x,v")))
+   (clobber (reg:CC FLAGS_REG))]
+  "ix86_binary_operator_ok (UNKNOWN, mode, operands)"
   "#"
-  [(set_attr "isa" "*,sse2_noavx,avx,avx512vl")
-   (set_attr "type" "alu,sselog,sselog,sselog")
-   (set_attr "mode" "SI,TI,TI,TI")])
+  [(set_attr "isa" "*,*,sse2_noavx,avx,avx512vl")
+   (set_attr "type" "alu,alu,sselog,sselog,sselog")
+   (set_attr "mode" "SI,SI,TI,TI,TI")])

>
> Uros.
>
> > >
> > > -(define_insn "3"
> > > -  [(set (match_operand:VI_16_32 0 "register_operand" "=?r,x,x,v")
> > > +(define_expand "3"
> > > +  [(set (match_operand:VI_16_32 0 "nonimmediate_operand")
> > >  (any_logic:VI_16_32
> > > - (match_operand:VI_16_32 1 "register_operand" "%0,0,x,v")
> > > - (match_operand

Re: [PATCH V2] Extend 16/32-bit vector bit_op patterns with (m, 0, i) alternative.

2022-07-20 Thread Hongtao Liu via Gcc-patches

On Wed, Jul 20, 2022 at 3:18 PM Uros Bizjak  wrote:
>
> On Wed, Jul 20, 2022 at 8:54 AM Hongtao Liu  wrote:
> >
> > On Wed, Jul 20, 2022 at 2:18 PM Uros Bizjak  wrote:
> > >
> > > On Wed, Jul 20, 2022 at 8:14 AM Uros Bizjak  wrote:
> > > >
> > > > On Wed, Jul 20, 2022 at 4:37 AM Hongtao Liu  wrote:
> > > > >
> > > > > On Tue, Jul 19, 2022 at 5:37 PM Uros Bizjak  wrote:
> > > > > >
> > > > > > On Tue, Jul 19, 2022 at 8:56 AM Hongtao Liu  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Tue, Jul 19, 2022 at 2:35 PM Uros Bizjak via Gcc-patches
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Tue, Jul 19, 2022 at 8:07 AM liuhongt 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > And split it after reload.
> > > > > > > > >
> > > > > > > > > > You will need ix86_binary_operator_ok insn constraint here 
> > > > > > > > > > with
> > > > > > > > > > corresponding expander using 
> > > > > > > > > > ix86_fixup_binary_operands_no_copy to
> > > > > > > > > > prepare insn operands.
> > > > > > > > > Split define_expand with just register_operand, and allow
> > > > > > > > > memory/immediate in define_insn, assume combine/forwprop will 
> > > > > > > > > do optimization.
> > > > > > > >
> > > > > > > > But you will *ease* the job of the above passes if you use
> > > > > > > > ix86_fixup_binary_operands_no_copy in the expander.
> > > > > > > for -m32, it will hit ICE in
> > > > > > > Breakpoint 1, ix86_fixup_binary_operands_no_copy (code=XOR,
> > > > > > > mode=E_V4QImode, operands=0x7fffa970) a
> > > > > > > /gcc/config/i386/i386-expand.cc:1184
> > > > > > > 1184  rtx dst = ix86_fixup_binary_operands (code, mode, 
> > > > > > > operands);
> > > > > > > (gdb) n
> > > > > > > 1185  gcc_assert (dst == operands[0]); -- here
> > > > > > > (gdb)
> > > > > > >
> > > > > > > the original operands[0], operands[1], operands[2] are below
> > > > > > > (gdb) p debug_rtx (operands[0])
> > > > > > > (mem/c:V4QI (plus:SI (reg/f:SI 77 virtual-stack-vars)
> > > > > > > (const_int -8220 [0xdfe4])) [0 MEM  > > > > > > unsigned char> [(unsigned char *)&tmp2 + 4B]+0 S4 A32])
> > > > > > > $1 = void
> > > > > > > (gdb) p debug_rtx (operands[1])
> > > > > > > (subreg:V4QI (reg:SI 129) 0)
> > > > > > > $2 = void
> > > > > > > (gdb) p debug_rtx (operands[2])
> > > > > > > (subreg:V4QI (reg:SI 98 [ _46 ]) 0)
> > > > > > > $3 = void
> > > > > > > (gdb)
> > > > > > >
> > > > > > > since operands[0] is mem and not equal to operands[1],
> > > > > > > ix86_fixup_binary_operands will create a pseudo register for dst. 
> > > > > > > and
> > > > > > > then hit ICE.
> > > > > > > Is this a bug or assumed?
> > > > > >
> > > > > > You will need ix86_expand_binary_operator here.
> > > > > It will swap memory operand from op1 to op2 and hit ICE for 
> > > > > unrecognized insn.
> > > > >
> > > > > What about this?
> > > >
> > > > Still no good... You are using commutative operands, so the predicate
> > > > of operand 2 should also allow memory. So, the predicate should be
> > > > nonimmediate_or_x86_64_const_vector_operand. The intermediate insn
> > > > pattern should look something like *_1, but with
> > > > added XMM and MMX reg alternatives instead of mask regs.
> > >
> > > Alternatively, you can use UNKNOWN operator to prevent
> > > canonicalization, but then you should not use commutative constraint
> > > in the intermediate insn. I think this is the best solution.
> > Like this?
>
> Please check the attached (lightly tested) patch that keeps
> commutative operands.
Yes, it looks best, I'll fully test the patch.
>
> Uros.



-- 
BR,
Hongtao

gcc-patches@gcc.gnu.org

2022-07-20 Thread Hongtao Liu via Gcc-patches

On Wed, Jul 20, 2022 at 4:00 PM Richard Biener via Gcc-patches
 wrote:
>
> On Wed, Jul 20, 2022 at 4:46 AM liuhongt  wrote:
> >
> > > My original comments still stand (it feels like this should be more 
> > > generic).
> > > Can we go the way lowering complex loads/stores first?  A large part
> > > of the testcases
> > > added by the patch should pass after that.
> >
> > This is the patch as suggested, one additional change is handling 
> > COMPLEX_CST
> > for rhs. And it will enable vectorization for pr106010-8a.c.
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Ok for trunk?
>
> OK.
>
> Are there cases left your vectorizer patch handles over this one?
No.
>
> Thanks,
> Richard.
>
> > 2022-07-20  Richard Biener  
> > Hongtao Liu  
> >
> > gcc/ChangeLog:
> >
> > PR tree-optimization/106010
> > * tree-complex.cc (init_dont_simulate_again): Lower complex
> > type move.
> > (expand_complex_move): Also expand COMPLEX_CST for rhs.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/i386/pr106010-1a.c: New test.
> > * gcc.target/i386/pr106010-1b.c: New test.
> > * gcc.target/i386/pr106010-1c.c: New test.
> > * gcc.target/i386/pr106010-2a.c: New test.
> > * gcc.target/i386/pr106010-2b.c: New test.
> > * gcc.target/i386/pr106010-2c.c: New test.
> > * gcc.target/i386/pr106010-3a.c: New test.
> > * gcc.target/i386/pr106010-3b.c: New test.
> > * gcc.target/i386/pr106010-3c.c: New test.
> > * gcc.target/i386/pr106010-4a.c: New test.
> > * gcc.target/i386/pr106010-4b.c: New test.
> > * gcc.target/i386/pr106010-4c.c: New test.
> > * gcc.target/i386/pr106010-5a.c: New test.
> > * gcc.target/i386/pr106010-5b.c: New test.
> > * gcc.target/i386/pr106010-5c.c: New test.
> > * gcc.target/i386/pr106010-6a.c: New test.
> > * gcc.target/i386/pr106010-6b.c: New test.
> > * gcc.target/i386/pr106010-6c.c: New test.
> > * gcc.target/i386/pr106010-7a.c: New test.
> > * gcc.target/i386/pr106010-7b.c: New test.
> > * gcc.target/i386/pr106010-7c.c: New test.
> > * gcc.target/i386/pr106010-8a.c: New test.
> > * gcc.target/i386/pr106010-8b.c: New test.
> > * gcc.target/i386/pr106010-8c.c: New test.
> > * gcc.target/i386/pr106010-9a.c: New test.
> > * gcc.target/i386/pr106010-9b.c: New test.
> > * gcc.target/i386/pr106010-9c.c: New test.
> > * gcc.target/i386/pr106010-9d.c: New test.
> > ---
> >  gcc/testsuite/gcc.target/i386/pr106010-1a.c |  58 
> >  gcc/testsuite/gcc.target/i386/pr106010-1b.c |  63 
> >  gcc/testsuite/gcc.target/i386/pr106010-1c.c |  41 +
> >  gcc/testsuite/gcc.target/i386/pr106010-2a.c |  82 ++
> >  gcc/testsuite/gcc.target/i386/pr106010-2b.c |  62 
> >  gcc/testsuite/gcc.target/i386/pr106010-2c.c |  47 ++
> >  gcc/testsuite/gcc.target/i386/pr106010-3a.c |  80 ++
> >  gcc/testsuite/gcc.target/i386/pr106010-3b.c | 126 
> >  gcc/testsuite/gcc.target/i386/pr106010-3c.c |  69 +
> >  gcc/testsuite/gcc.target/i386/pr106010-4a.c | 101 +
> >  gcc/testsuite/gcc.target/i386/pr106010-4b.c |  67 +
> >  gcc/testsuite/gcc.target/i386/pr106010-4c.c |  54 +++
> >  gcc/testsuite/gcc.target/i386/pr106010-5a.c | 117 +++
> >  gcc/testsuite/gcc.target/i386/pr106010-5b.c |  80 ++
> >  gcc/testsuite/gcc.target/i386/pr106010-5c.c |  62 
> >  gcc/testsuite/gcc.target/i386/pr106010-6a.c | 115 ++
> >  gcc/testsuite/gcc.target/i386/pr106010-6b.c | 157 
> >  gcc/testsuite/gcc.target/i386/pr106010-6c.c |  80 ++
> >  gcc/testsuite/gcc.target/i386/pr106010-7a.c |  58 
> >  gcc/testsuite/gcc.target/i386/pr106010-7b.c |  63 
> >  gcc/testsuite/gcc.target/i386/pr106010-7c.c |  41 +
> >  gcc/testsuite/gcc.target/i386/pr106010-8a.c |  58 
> >  gcc/testsuite/gcc.target/i386/pr106010-8b.c |  53 +++
> >  gcc/testsuite/gcc.target/i386/pr106010-8c.c |  38 +
> >  gcc/testsuite/gcc.target/i386/pr106010-9a.c |  89 +++
> >  gcc/testsuite/gcc.target/i386/pr106010-9b.c |  90 +++
> >  gcc/testsuite/gcc.target/i386/pr106010-9c.c |  90 +++
> >  gcc/testsuite/gcc.target/i386/pr106010-9d.c |  92 
> >  gcc/tree-complex.cc |   9 +-
> >  29 files changed, 2141 insertions(+), 1 deletion(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-1a.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-1b.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-1c.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-2a.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-2b.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-2c.c
> >  create mode 100644 gcc/testsuite/gcc.tar

Re: [PATCH] Move pass_cse_sincos after vectorizer.

2022-07-20 Thread Hongtao Liu via Gcc-patches

On Wed, Jul 20, 2022 at 3:59 PM Richard Biener via Gcc-patches
 wrote:
>
> On Wed, Jul 20, 2022 at 4:20 AM liuhongt  wrote:
> >
> > __builtin_cexpi can't be vectorized since there's gap between it and
> > vectorized sincos version(In libmvec, it passes a double and two
> > double pointer and returns nothing.) And it will lose some
> > vectorization opportunity if sin & cos are optimized to cexpi before
> > vectorizer.
> >
> > I'm trying to add vect_recog_cexpi_pattern to split cexpi to sin and
> > cos, but it failed vectorizable_simd_clone_call since NULL is returned
> > by cgraph_node::get (fndecl).  So alternatively, the patch try to move
> > pass_cse_sincos after vectorizer, just before pas_cse_reciprocals.
> >
> > Also original pass_cse_sincos additionaly expands pow&cabs, this patch
> > split that part into a separate pass named pass_expand_powcabs which
> > remains the old pass position.
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Observe more libmvec sin/cos vectorization in specfp, but no big 
> > performance.
> >
> > Ok for trunk?
>
> OK.
>
> I wonder if we can merge the workers of the three passes we have into
> a single function, handing it an argument what to handle to be a bit more
> flexible in the future.  That would also avoid doing
>
> > +  NEXT_PASS (pass_cse_sincos);
> >NEXT_PASS (pass_cse_reciprocals);
>
> thus two function walks after each other.  But I guess that can be done
> as followup (or not if we decide so).
Let me try this as followup.
>
> Thanks,
> Richard.
>
> >
> > gcc/ChangeLog:
> >
> > * passes.def: (Split pass_cse_sincos to pass_expand_powcabs
> > and pass_cse_sincos, and move pass_cse_sincos after vectorizer).
> > * timevar.def (TV_TREE_POWCABS): New timevar.
> > * tree-pass.h (make_pass_expand_powcabs): Split from 
> > pass_cse_sincos.
> > * tree-ssa-math-opts.cc (gimple_expand_builtin_cabs): Ditto.
> > (class pass_expand_powcabs): Ditto.
> > (pass_expand_powcabs::execute): Ditto.
> > (make_pass_expand_powcabs): Ditto.
> > (pass_cse_sincos::execute): Remove pow/cabs expand part.
> > (make_pass_cse_sincos): Ditto.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.dg/pow-sqrt-synth-1.c: Adjust testcase.
> > ---
> >  gcc/passes.def  |   3 +-
> >  gcc/testsuite/gcc.dg/pow-sqrt-synth-1.c |   4 +-
> >  gcc/timevar.def |   1 +
> >  gcc/tree-pass.h |   1 +
> >  gcc/tree-ssa-math-opts.cc   | 112 +++-
> >  5 files changed, 97 insertions(+), 24 deletions(-)
> >
> > diff --git a/gcc/passes.def b/gcc/passes.def
> > index 375d3d62d51..6bb92efacd4 100644
> > --- a/gcc/passes.def
> > +++ b/gcc/passes.def
> > @@ -253,7 +253,7 @@ along with GCC; see the file COPYING3.  If not see
> >NEXT_PASS (pass_ccp, true /* nonzero_p */);
> >/* After CCP we rewrite no longer addressed locals into SSA
> >  form if possible.  */
> > -  NEXT_PASS (pass_cse_sincos);
> > +  NEXT_PASS (pass_expand_powcabs);
> >NEXT_PASS (pass_optimize_bswap);
> >NEXT_PASS (pass_laddress);
> >NEXT_PASS (pass_lim);
> > @@ -328,6 +328,7 @@ along with GCC; see the file COPYING3.  If not see
> >NEXT_PASS (pass_simduid_cleanup);
> >NEXT_PASS (pass_lower_vector_ssa);
> >NEXT_PASS (pass_lower_switch);
> > +  NEXT_PASS (pass_cse_sincos);
> >NEXT_PASS (pass_cse_reciprocals);
> >NEXT_PASS (pass_reassoc, false /* early_p */);
> >NEXT_PASS (pass_strength_reduction);
> > diff --git a/gcc/testsuite/gcc.dg/pow-sqrt-synth-1.c 
> > b/gcc/testsuite/gcc.dg/pow-sqrt-synth-1.c
> > index 4a94325cdb3..484b29a8fc8 100644
> > --- a/gcc/testsuite/gcc.dg/pow-sqrt-synth-1.c
> > +++ b/gcc/testsuite/gcc.dg/pow-sqrt-synth-1.c
> > @@ -1,5 +1,5 @@
> >  /* { dg-do compile { target sqrt_insn } } */
> > -/* { dg-options "-fdump-tree-sincos -Ofast --param max-pow-sqrt-depth=8" } 
> > */
> > +/* { dg-options "-fdump-tree-powcabs -Ofast --param max-pow-sqrt-depth=8" 
> > } */
> >  /* { dg-additional-options "-mfloat-abi=softfp -mfpu=neon-vfpv4" { target 
> > arm*-*-* } } */
> >
> >  double
> > @@ -34,4 +34,4 @@ vecfoo (double *a)
> >  a[i] = __builtin_pow (a[i], 1.25);
> >  }
> >
> > -/* { dg-final { scan-tree-dump-times "synthesizing" 7 "sincos" } } */
> > +/* { dg-final { scan-tree-dump-times "synthesizing" 7 "powcabs" } } */
> > diff --git a/gcc/timevar.def b/gcc/timevar.def
> > index 2dae5e1c760..651af19876f 100644
> > --- a/gcc/timevar.def
> > +++ b/gcc/timevar.def
> > @@ -220,6 +220,7 @@ DEFTIMEVAR (TV_TREE_SWITCH_CONVERSION, "tree switch 
> > conversion")
> >  DEFTIMEVAR (TV_TREE_SWITCH_LOWERING,   "tree switch lowering")
> >  DEFTIMEVAR (TV_TREE_RECIP, "gimple CSE reciprocals")
> >  DEFTIMEVAR (TV_TREE_SINCOS   , "gimple CSE sin/cos")
> > +DEFTIMEVAR (TV_TREE_POWCABS   , "gimple expand pow/cabs

Re: [PATCH] x86: Enable __bf16 type for TARGET_SSE2 and above

2022-08-03 Thread Hongtao Liu via Gcc-patches

On Wed, Aug 3, 2022 at 4:41 PM Kong, Lingling via Gcc-patches
 wrote:
>
> Hi,
>
> Old patch has some mistake in `*movbf_internal` , now disable BFmode constant 
> double move in `*movbf_internal`.
LGTM.
>
> Thanks,
> Lingling
>
> > -Original Message-
> > From: Kong, Lingling 
> > Sent: Tuesday, July 26, 2022 9:31 AM
> > To: Liu, Hongtao ; gcc-patches@gcc.gnu.org
> > Cc: Kong, Lingling 
> > Subject: [PATCH] x86: Enable __bf16 type for TARGET_SSE2 and above
> >
> > Hi,
> >
> > The patch is enable __bf16 scalar type for target sse2 and above according 
> > to
> > psABI(https://gitlab.com/x86-psABIs/x86-64-ABI/-/merge_requests/35/diffs).
> > The __bf16 type is a storage type like arm.
> >
> > OK for master?
> >
> > gcc/ChangeLog:
> >
> >   * config/i386/i386-builtin-types.def (BFLOAT16): New primitive type.
> >   * config/i386/i386-builtins.cc : Support __bf16 type for i386 backend.
> >   (ix86_register_bf16_builtin_type): New function.
> >   (ix86_bf16_type_node): New.
> >   (ix86_bf16_ptr_type_node): Ditto.
> >   (ix86_init_builtin_types): Add ix86_register_bf16_builtin_type 
> > function
> > call.
> >   * config/i386/i386-modes.def (FLOAT_MODE): Add BFmode.
> >   (ADJUST_FLOAT_FORMAT): Ditto.
> >   * config/i386/i386.cc (merge_classes): Handle BFmode.
> >   (classify_argument): Ditto.
> >   (examine_argument): Ditto.
> >   (construct_container): Ditto.
> >   (function_value_32): Return __bf16 by %xmm0.
> >   (function_value_64): Return __bf16 by SSE register.
> >   (ix86_print_operand): Handle CONST_DOUBLE BFmode.
> >   (ix86_secondary_reload): Require gpr as intermediate register
> >   to store __bf16 from sse register when sse4 is not available.
> >   (ix86_scalar_mode_supported_p): Enable __bf16 under sse2.
> >   (ix86_mangle_type): Add manlging for __bf16 type.
> >   (ix86_invalid_conversion): New function for target hook.
> >   (ix86_invalid_unary_op): Ditto.
> >   (ix86_invalid_binary_op): Ditto.
> >   (TARGET_INVALID_CONVERSION): New define for target hook.
> >   (TARGET_INVALID_UNARY_OP): Ditto.
> >   (TARGET_INVALID_BINARY_OP): Ditto.
> >   * config/i386/i386.h (host_detect_local_cpu): Add BFmode.
> >   * config/i386/i386.md (*pushhf_rex64): Change for BFmode.
> >   (*push_rex64): Ditto.
> >   (*pushhf): Ditto.
> >   (*push): Ditto.
> >   (*movhf_internal): Ditto.
> >   (*mov_internal): Ditto.
> >
> > gcc/testsuite/ChangeLog:
> >
> >   * g++.target/i386/bfloat_cpp_typecheck.C: New test.
> >   * gcc.target/i386/bfloat16-1.c: Ditto.
> >   * gcc.target/i386/sse2-bfloat16-1.c: Ditto.
> >   * gcc.target/i386/sse2-bfloat16-2.c: Ditto.
> >   * gcc.target/i386/sse2-bfloat16-scalar-typecheck.c: Ditto.
> > ---
> >  gcc/config/i386/i386-builtin-types.def|   1 +
> >  gcc/config/i386/i386-builtins.cc  |  21 ++
> >  gcc/config/i386/i386-modes.def|   2 +
> >  gcc/config/i386/i386.cc   |  75 +-
> >  gcc/config/i386/i386.h|   4 +-
> >  gcc/config/i386/i386.md   |  32 +--
> >  .../g++.target/i386/bfloat_cpp_typecheck.C|  10 +
> >  gcc/testsuite/gcc.target/i386/bfloat16-1.c|  12 +
> >  .../gcc.target/i386/sse2-bfloat16-1.c |   8 +
> >  .../gcc.target/i386/sse2-bfloat16-2.c |  17 ++
> >  .../i386/sse2-bfloat16-scalar-typecheck.c | 215 ++
> >  11 files changed, 375 insertions(+), 22 deletions(-)  create mode 100644
> > gcc/testsuite/g++.target/i386/bfloat_cpp_typecheck.C
> >  create mode 100644 gcc/testsuite/gcc.target/i386/bfloat16-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/sse2-bfloat16-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/sse2-bfloat16-2.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/sse2-bfloat16-scalar-
> > typecheck.c
> >
> > diff --git a/gcc/config/i386/i386-builtin-types.def b/gcc/config/i386/i386-
> > builtin-types.def
> > index 7a2da1db0b0..63a360b0f8b 100644
> > --- a/gcc/config/i386/i386-builtin-types.def
> > +++ b/gcc/config/i386/i386-builtin-types.def
> > @@ -69,6 +69,7 @@ DEF_PRIMITIVE_TYPE (UINT16,
> > short_unsigned_type_node)  DEF_PRIMITIVE_TYPE (INT64,
> > long_long_integer_type_node)  DEF_PRIMITIVE_TYPE (UINT64,
> > long_long_unsigned_type_node)  DEF_PRIMITIVE_TYPE (FLOAT16,
> > ix86_float16_type_node)
> > +DEF_PRIMITIVE_TYPE (BFLOAT16, ix86_bf16_type_node)
> >  DEF_PRIMITIVE_TYPE (FLOAT, float_type_node)  DEF_PRIMITIVE_TYPE
> > (DOUBLE, double_type_node)  DEF_PRIMITIVE_TYPE (FLOAT80,
> > float80_type_node) diff --git a/gcc/config/i386/i386-builtins.cc
> > b/gcc/config/i386/i386-builtins.cc
> > index fe7243c3837..6a04fb57e65 100644
> > --- a/gcc/config/i386/i386-builtins.cc
> > +++ b/gcc/config/i386/i386-builtins.cc
> > @@ -126,6 +126,9 @@ BDESC_VERIFYS (IX86_BUILTIN_MAX,  static GTY(()) tree
> > ix86_builtin_type_tab[(int) IX86_BT_LAST_CPTR + 1];
> >
> >

Re: [RFC: PATCH] Extend vectorizer to handle nonlinear induction for neg, mul/lshift/rshift with a constant.

2022-08-04 Thread Hongtao Liu via Gcc-patches

On Thu, Aug 4, 2022 at 4:19 PM Richard Biener via Gcc-patches
 wrote:
>
> On Thu, Aug 4, 2022 at 6:29 AM liuhongt via Gcc-patches
>  wrote:
> >
> > For neg, the patch create a vec_init as [ a, -a, a, -a, ...  ] and no
> > vec_step is needed to update vectorized iv since vf is always multiple
> > of 2(negative * negative is positive).
> >
> > For shift, the patch create a vec_init as [ a, a >> c, a >> 2*c, ..]
> > as vec_step as [ c * nunits, c * nunits, c * nunits, ... ], vectorized iv is
> > updated as vec_def = vec_init >>/<< vec_step.
> >
> > For mul, the patch create a vec_init as [ a, a * c, a * pow(c, 2), ..]
> > as vec_step as [ pow(c,nunits), pow(c,nunits),...] iv is updated as vec_def 
> > =
> > vec_init * vec_step.
> >
> > The patch handles nonlinear iv for
> > 1. Integer type only, floating point is not handled.
> > 2. No slp_node.
> > 3. iv_loop should be same as vector loop, not nested loop.
> > 4. No UD is created, for mul, no UD overlow for pow (c, vf), for
> >shift, shift count should be less than type precision.
> >
> > Bootstrapped and regression tested on x86_64-pc-linux-gnu{-m32,}.
> > There's some cases observed in SPEC2017, but no big performance impact.
> >
> > Any comments?
>
> Looks good overall - a few comments inline.  Also can you please add
> SLP support?
> I've tried hard to fill in gaps where SLP support is missing since my
> goal is still to get
> rid of non-SLP.
>
> > gcc/ChangeLog:
> >
> > PR tree-optimization/103144
> > * tree-vect-loop.cc (vect_is_nonlinear_iv_evolution): New function.
> > (vect_analyze_scalar_cycles_1): Detect nonlinear iv by upper 
> > function.
> > (vect_create_nonlinear_iv_init): New function.
> > (vect_create_nonlinear_iv_step): Ditto
> > (vect_create_nonlinear_iv_vec_step): Ditto
> > (vect_update_nonlinear_iv): Ditto
> > (vectorizable_nonlinear_induction): Ditto.
> > (vectorizable_induction): Call
> > vectorizable_nonlinear_induction when induction_type is not
> > vect_step_op_add.
> > * tree-vectorizer.h (enum vect_induction_op_type): New enum.
> > (STMT_VINFO_LOOP_PHI_EVOLUTION_TYPE): New Macro.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/i386/pr103144-mul-1.c: New test.
> > * gcc.target/i386/pr103144-mul-2.c: New test.
> > * gcc.target/i386/pr103144-neg-1.c: New test.
> > * gcc.target/i386/pr103144-neg-2.c: New test.
> > * gcc.target/i386/pr103144-shift-1.c: New test.
> > * gcc.target/i386/pr103144-shift-2.c: New test.
> > ---
> >  .../gcc.target/i386/pr103144-mul-1.c  |  25 +
> >  .../gcc.target/i386/pr103144-mul-2.c  |  43 ++
> >  .../gcc.target/i386/pr103144-neg-1.c  |  25 +
> >  .../gcc.target/i386/pr103144-neg-2.c  |  36 ++
> >  .../gcc.target/i386/pr103144-shift-1.c|  34 +
> >  .../gcc.target/i386/pr103144-shift-2.c|  61 ++
> >  gcc/tree-vect-loop.cc | 604 +-
> >  gcc/tree-vectorizer.h |  11 +
> >  8 files changed, 834 insertions(+), 5 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103144-mul-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103144-mul-2.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103144-neg-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103144-neg-2.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103144-shift-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr103144-shift-2.c
> >
> > diff --git a/gcc/testsuite/gcc.target/i386/pr103144-mul-1.c 
> > b/gcc/testsuite/gcc.target/i386/pr103144-mul-1.c
> > new file mode 100644
> > index 000..2357541d95d
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr103144-mul-1.c
> > @@ -0,0 +1,25 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -mavx2 -ftree-vectorize -fvect-cost-model=unlimited 
> > -fdump-tree-vect-details -mprefer-vector-width=256" } */
> > +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" } } */
> > +
> > +#define N 1
> > +void
> > +foo_mul (int* a, int b)
> > +{
> > +  for (int i = 0; i != N; i++)
> > +{
> > +  a[i] = b;
> > +  b *= 3;
> > +}
> > +}
> > +
> > +void
> > +foo_mul_const (int* a)
> > +{
> > +  int b = 1;
> > +  for (int i = 0; i != N; i++)
> > +{
> > +  a[i] = b;
> > +  b *= 3;
> > +}
> > +}
> > diff --git a/gcc/testsuite/gcc.target/i386/pr103144-mul-2.c 
> > b/gcc/testsuite/gcc.target/i386/pr103144-mul-2.c
> > new file mode 100644
> > index 000..4ea53e44658
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr103144-mul-2.c
> > @@ -0,0 +1,43 @@
> > +/* { dg-do run } */
> > +/* { dg-options "-O2 -mavx2 -ftree-vectorize -fvect-cost-model=unlimited 
> > -mprefer-vector-width=256" } */
> > +/* { dg-require-effective-target avx2 } */
> > +
> > +#include "avx2-check.h"
> > +#include 
> > +#include "pr103144-mul-1.c"
>

Re: [PATCH] Replace invariant ternlog operands

2023-08-03 Thread Hongtao Liu via Gcc-patches

On Fri, Aug 4, 2023 at 1:30 AM Alexander Monakov  wrote:
>
>
> On Thu, 27 Jul 2023, Liu, Hongtao via Gcc-patches wrote:
>
> > > +;; If the first and the second operands of ternlog are invariant and ;;
> > > +the third operand is memory ;; then we should add load third operand
> > > +from memory to register and ;; replace first and second operands with
> > > +this register (define_split
> > > +  [(set (match_operand:V 0 "register_operand")
> > > +   (unspec:V
> > > + [(match_operand:V 1 "register_operand")
> > > +  (match_operand:V 2 "register_operand")
> > > +  (match_operand:V 3 "memory_operand")
> > > +  (match_operand:SI 4 "const_0_to_255_operand")]
> > > + UNSPEC_VTERNLOG))]
> > > +  "ternlog_invariant_operand_mask (operands) == 3 && !reload_completed"
> > Maybe better with "!reload_completed  && ternlog_invariant_operand_mask 
> > (operands) == 3"
>
> I made this change (in both places), plus some style TLC. Ok to apply?
Ok.
>
> From d24304a9efd049e8db6df5ac78de8ca2d941a3c7 Mon Sep 17 00:00:00 2001
> From: Yan Simonaytes 
> Date: Tue, 25 Jul 2023 20:43:19 +0300
> Subject: [PATCH] Eliminate irrelevant operands of VPTERNLOG
>
> As mentioned in PR 110202, GCC may be presented with input where control
> word of the VPTERNLOG intrinsic implies that some of its operands do not
> affect the result.  In that case, we can eliminate irrelevant operands
> of the instruction by substituting any other operand in their place.
> This removes false dependencies.
>
> For instance, instead of (252 = 0xfc = _MM_TERNLOG_A | _MM_TERNLOG_B)
>
> vpternlogq  $252, %zmm2, %zmm1, %zmm0
>
> emit
>
> vpternlogq  $252, %zmm0, %zmm1, %zmm0
>
> When VPTERNLOG is invariant w.r.t first and second operands, and the
> third operand is memory, load memory into the output operand first, i.e.
> instead of (85 = 0x55 = ~_MM_TERNLOG_C)
>
> vpternlogq  $85, (%rdi), %zmm1, %zmm0
>
> emit
>
> vmovdqa64   (%rdi), %zmm0
> vpternlogq  $85, %zmm0, %zmm0, %zmm0
>
> gcc/ChangeLog:
>
> * config/i386/i386-protos.h (vpternlog_irrelevant_operand_mask):
> Declare.
> (substitute_vpternlog_operands): Declare.
> * config/i386/i386.cc (vpternlog_irrelevant_operand_mask): New
> helper.
> (substitute_vpternlog_operands): New function.  Use them...
> * config/i386/sse.md: ... here in new VPTERNLOG define_splits.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/invariant-ternlog-1.c: New test.
> * gcc.target/i386/invariant-ternlog-2.c: New test.
> ---
>  gcc/config/i386/i386-protos.h |  3 ++
>  gcc/config/i386/i386.cc   | 43 +++
>  gcc/config/i386/sse.md| 42 ++
>  .../gcc.target/i386/invariant-ternlog-1.c | 21 +
>  .../gcc.target/i386/invariant-ternlog-2.c | 12 ++
>  5 files changed, 121 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/invariant-ternlog-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/invariant-ternlog-2.c
>
> diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> index 27fe73ca65..12e6ff0ebc 100644
> --- a/gcc/config/i386/i386-protos.h
> +++ b/gcc/config/i386/i386-protos.h
> @@ -70,6 +70,9 @@ extern machine_mode ix86_cc_mode (enum rtx_code, rtx, rtx);
>  extern int avx_vpermilp_parallel (rtx par, machine_mode mode);
>  extern int avx_vperm2f128_parallel (rtx par, machine_mode mode);
>
> +extern int vpternlog_irrelevant_operand_mask (rtx[]);
> +extern void substitute_vpternlog_operands (rtx[]);
> +
>  extern bool ix86_expand_strlen (rtx, rtx, rtx, rtx);
>  extern bool ix86_expand_set_or_cpymem (rtx, rtx, rtx, rtx, rtx, rtx,
>rtx, rtx, rtx, rtx, bool);
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 32851a514a..9a7c1135a0 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -19420,6 +19420,49 @@ avx_vperm2f128_parallel (rtx par, machine_mode mode)
>return mask + 1;
>  }
>
> +/* Return a mask of VPTERNLOG operands that do not affect output.  */
> +
> +int
> +vpternlog_irrelevant_operand_mask (rtx *operands)
> +{
> +  int mask = 0;
> +  int imm8 = XINT (operands[4], 0);
> +
> +  if (((imm8 >> 4) & 0x0F) == (imm8 & 0x0F))
> +mask |= 1;
> +  if (((imm8 >> 2) & 0x33) == (imm8 & 0x33))
> +mask |= 2;
> +  if (((imm8 >> 1) & 0x55) == (imm8 & 0x55))
> +mask |= 4;
> +
> +  return mask;
> +}
> +
> +/* Eliminate false dependencies on operands that do not affect output
> +   by substituting other operands of a VPTERNLOG.  */
> +
> +void
> +substitute_vpternlog_operands (rtx *operands)
> +{
> +  int mask = vpternlog_irrelevant_operand_mask (operands);
> +
> +  if (mask & 1) /* The first operand is irrelevant.  */
> +operands[1] = operands[2];
> +
> +  if (mask & 2) /* The second operand is irrelevant.  */
> +operands[2] = operands[1];
> +
> +

Re: [PATCH 01/10] x86: "prefix_extra" tidying

2023-08-03 Thread Hongtao Liu via Gcc-patches

On Thu, Aug 3, 2023 at 4:10 PM Jan Beulich via Gcc-patches
 wrote:
>
> Drop SSE5 leftovers from both its comment and its default calculation.
> A value of 2 simply cannot occur anymore. Instead extend the comment to
> mention the use of the attribute in "length_vex", clarifying why
> "prefix_extra" can actually be meaningful on VEX-encoded insns despite
> those not having any real prefixes except possibly segment overrides.
>
Ok.
> gcc/
>
> * config/i386/i386.md (prefix_extra): Correct comment. Fold
> cases yielding 2 into ones yielding 1.
> ---
> I question the 3DNow! aspect here: There's no extra prefix there. It's
> an immediate instead which "sub-divides" major opcode 0f0f.
>
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -620,13 +620,11 @@
> (const_int 0)))
>
>  ;; There are also additional prefixes in 3DNOW, SSSE3.
> -;; ssemuladd,sse4arg default to 0f24/0f25 and DREX byte,
> -;; sseiadd1,ssecvt1 to 0f7a with no DREX byte.
>  ;; 3DNOW has 0f0f prefix, SSSE3 and SSE4_{1,2} 0f38/0f3a.
> +;; While generally inapplicable to VEX/XOP/EVEX encodings, "length_vex" uses
> +;; the attribute evaluating to zero to know that VEX2 encoding may be usable.
>  (define_attr "prefix_extra" ""
> -  (cond [(eq_attr "type" "ssemuladd,sse4arg")
> -  (const_int 2)
> -(eq_attr "type" "sseiadd1,ssecvt1")
> +  (cond [(eq_attr "type" "ssemuladd,sse4arg,sseiadd1,ssecvt1")
>(const_int 1)
> ]
> (const_int 0)))
>


-- 
BR,
Hongtao

Re: [PATCH 02/10] x86: "sse4arg" adjustments

2023-08-03 Thread Hongtao Liu via Gcc-patches

On Thu, Aug 3, 2023 at 4:10 PM Jan Beulich via Gcc-patches
 wrote:
>
> Record common properties in other attributes' default calculations:
> There's always a 1-byte immediate, and they're always encoded in a VEX3-
> like manner (note that "prefix_extra" already evaluates to 1 in this
> case). The drop now (or already previously) redundant explicit
> attributes, adding "mode" ones where they were missing.
>
> Furthermore use "sse4arg" consistently for all VPCOM* insns; so far
> signed comparisons did use it, while unsigned ones used "ssecmp". Note
> that while they have (not counting the explicit or implicit immediate
> operand) they really only have 3 operands, the operator is also counted
> in those patterns. That's relevant for establishing the "memory"
> attribute's value, and at the same time benign when there are only
> register operands.
>
> Note that despite also having 4 operands, multiply-add insns aren't
> affected by this change, as they use "ssemuladd" for "type".
Ok. (I'm not quite familiar for those xop instructions encoding, you
must have better understanding than me, so just rubber-stamp the
patch.
>
> gcc/
>
> * config/i386/i386.md (length_immediate): Handle "sse4arg".
> (prefix): Likewise.
> (*xop_pcmov_): Add "mode" attribute.
> * config/i386/mmx.md (*xop_maskcmp3): Drop "prefix_data16",
> "prefix_rep", "prefix_extra", and "length_immediate" attributes.
> (*xop_maskcmp_uns3): Likewise. Switch "type" to "sse4arg".
> (*xop_pcmov_): Add "mode" attribute.
> * config/i386/sse.md (xop_pcmov_): Add "mode"
> attribute.
> (xop_maskcmp3): Drop "prefix_data16", "prefix_rep",
> "prefix_extra", and "length_immediate" attributes.
> (xop_maskcmp_uns3): Likewise. Switch "type" to "sse4arg".
> (xop_maskcmp_uns23): Drop "prefix_data16", "prefix_extra",
> and "length_immediate" attributes. Switch "type" to "sse4arg".
> (xop_pcom_tf3): Likewise.
> (xop_vpermil23): Drop "length_immediate" attribute.
>
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -536,6 +536,8 @@
>(cond [(eq_attr "type" "incdec,setcc,icmov,str,lea,other,multi,idiv,leave,
>   bitmanip,imulx,msklog,mskmov")
>(const_int 0)
> +(eq_attr "type" "sse4arg")
> +  (const_int 1)
>  (eq_attr "unit" "i387,sse,mmx")
>(const_int 0)
>  (eq_attr "type" "alu,alu1,negnot,imovx,ishift,ishiftx,ishift1,
> @@ -635,6 +637,8 @@
> (const_string "vex")
>   (eq_attr "mode" "XI,V16SF,V8DF")
> (const_string "evex")
> +(eq_attr "type" "sse4arg")
> +  (const_string "vex")
>  ]
>  (const_string "orig")))
>
> @@ -23286,7 +23290,8 @@
>   (match_operand:MODEF 3 "register_operand" "x")))]
>"TARGET_XOP"
>"vpcmov\t{%1, %3, %2, %0|%0, %2, %3, %1}"
> -  [(set_attr "type" "sse4arg")])
> +  [(set_attr "type" "sse4arg")
> +   (set_attr "mode" "TI")])
>
>  ;; These versions of the min/max patterns are intentionally ignorant of
>  ;; their behavior wrt -0.0 and NaN (via the commutative operand mark).
> --- a/gcc/config/i386/mmx.md
> +++ b/gcc/config/i386/mmx.md
> @@ -2909,10 +2909,6 @@
>"TARGET_XOP"
>"vpcom%Y1\t{%3, %2, %0|%0, %2, %3}"
>[(set_attr "type" "sse4arg")
> -   (set_attr "prefix_data16" "0")
> -   (set_attr "prefix_rep" "0")
> -   (set_attr "prefix_extra" "2")
> -   (set_attr "length_immediate" "1")
> (set_attr "mode" "TI")])
>
>  (define_insn "*xop_maskcmp3"
> @@ -2923,10 +2919,6 @@
>"TARGET_XOP"
>"vpcom%Y1\t{%3, %2, %0|%0, %2, %3}"
>[(set_attr "type" "sse4arg")
> -   (set_attr "prefix_data16" "0")
> -   (set_attr "prefix_rep" "0")
> -   (set_attr "prefix_extra" "2")
> -   (set_attr "length_immediate" "1")
> (set_attr "mode" "TI")])
>
>  (define_insn "*xop_maskcmp_uns3"
> @@ -2936,11 +2928,7 @@
>   (match_operand:MMXMODEI 3 "register_operand" "x")]))]
>"TARGET_XOP"
>"vpcom%Y1u\t{%3, %2, %0|%0, %2, %3}"
> -  [(set_attr "type" "ssecmp")
> -   (set_attr "prefix_data16" "0")
> -   (set_attr "prefix_rep" "0")
> -   (set_attr "prefix_extra" "2")
> -   (set_attr "length_immediate" "1")
> +  [(set_attr "type" "sse4arg")
> (set_attr "mode" "TI")])
>
>  (define_insn "*xop_maskcmp_uns3"
> @@ -2950,11 +2938,7 @@
>   (match_operand:VI_16_32 3 "register_operand" "x")]))]
>"TARGET_XOP"
>"vpcom%Y1u\t{%3, %2, %0|%0, %2, %3}"
> -  [(set_attr "type" "ssecmp")
> -   (set_attr "prefix_data16" "0")
> -   (set_attr "prefix_rep" "0")
> -   (set_attr "prefix_extra" "2")
> -   (set_attr "length_immediate" "1")
> +  [(set_attr "type" "sse4arg")
> (set_attr "mode" "TI")])
>
>  (define_expand "vec_cmp"
> @@ -3144,7 +3128,8 @@
>(match_operand:MMXMODE124 2 "register_operand" "x")))]
>"TARGET_XOP && TARGET_MMX_WITH_SSE"
>"vpcmov\t{%3, %2, %1, %0|%0, %1, %2, %3}"
> -  [(set_attr "type" "sse4arg")])
> +  [(set

Re: [PATCH 03/10] x86: "ssemuladd" adjustments

2023-08-03 Thread Hongtao Liu via Gcc-patches

On Thu, Aug 3, 2023 at 4:11 PM Jan Beulich via Gcc-patches
 wrote:
>
> They're all VEX3- (also covering XOP) or EVEX-encoded. Express that in
> the default calculation of "prefix". FMA4 insns also all have a 1-byte
> immediate operand.
>
> Where the default calculation is not sufficient / applicable, add
> explicit "prefix" attributes. While there also add a "mode" attribute to
> fma___pair.
Ok.
>
> gcc/
>
> * config/i386/i386.md (isa): Move up.
> (length_immediate): Handle "fma4".
> (prefix): Handle "ssemuladd".
> * config/i386/sse.md (*fma_fmadd_): Add "prefix" attribute.
> (fma_fmadd_):
> Likewise.
> (_fmadd__mask): Likewise.
> (_fmadd__mask3): Likewise.
> (fma_fmsub_):
> Likewise.
> (_fmsub__mask): Likewise.
> (_fmsub__mask3): Likewise.
> (*fma_fnmadd_): Likewise.
> (fma_fnmadd_):
> Likewise.
> (_fnmadd__mask): Likewise.
> (_fnmadd__mask3): Likewise.
> (fma_fnmsub_):
> Likewise.
> (_fnmsub__mask): Likewise.
> (_fnmsub__mask3): Likewise.
> (fma_fmaddsub_):
> Likewise.
> (_fmaddsub__mask): Likewise.
> (_fmaddsub__mask3): Likewise.
> (fma_fmsubadd_):
> Likewise.
> (_fmsubadd__mask): Likewise.
> (_fmsubadd__mask3): Likewise.
> (*fmai_fmadd_): Likewise.
> (*fmai_fmsub_): Likewise.
> (*fmai_fnmadd_): Likewise.
> (*fmai_fnmsub_): Likewise.
> (avx512f_vmfmadd__mask): Likewise.
> (avx512f_vmfmadd__mask3): Likewise.
> (avx512f_vmfmadd__maskz_1): Likewise.
> (*avx512f_vmfmsub__mask): Likewise.
> (avx512f_vmfmsub__mask3): Likewise.
> (*avx512f_vmfmsub__maskz_1): Likewise.
> (avx512f_vmfnmadd__mask): Likewise.
> (avx512f_vmfnmadd__mask3): Likewise.
> (avx512f_vmfnmadd__maskz_1): Likewise.
> (*avx512f_vmfnmsub__mask): Likewise.
> (*avx512f_vmfnmsub__mask3): Likewise.
> (*avx512f_vmfnmsub__maskz_1): Likewise.
> (*fma4i_vmfmadd_): Likewise.
> (*fma4i_vmfmsub_): Likewise.
> (*fma4i_vmfnmadd_): Likewise.
> (*fma4i_vmfnmsub_): Likewise.
> (fma__): Likewise.
> (___mask): Likewise.
> 
> (avx512fp16_fma_sh_v8hf):
> Likewise.
> (avx512fp16_sh_v8hf_mask): Likewise.
> (xop_p): Likewise.
> (xop_pdql): Likewise.
> (xop_pdqh): Likewise.
> (xop_pwd): Likewise.
> (xop_pwd): Likewise.
> (fma___pair): Likewise. Add "mode" attribute.
>
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -531,12 +531,23 @@
>(const_string "unknown")]
>  (const_string "integer")))
>
> +;; Used to control the "enabled" attribute on a per-instruction basis.
> +(define_attr "isa" "base,x64,nox64,x64_sse2,x64_sse4,x64_sse4_noavx,
> +   x64_avx,x64_avx512bw,x64_avx512dq,aes,
> +   sse_noavx,sse2,sse2_noavx,sse3,sse3_noavx,sse4,sse4_noavx,
> +   avx,noavx,avx2,noavx2,bmi,bmi2,fma4,fma,avx512f,noavx512f,
> +   avx512bw,noavx512bw,avx512dq,noavx512dq,fma_or_avx512vl,
> +   
> avx512vl,noavx512vl,avxvnni,avx512vnnivl,avx512fp16,avxifma,
> +   avx512ifmavl,avxneconvert,avx512bf16vl,vpclmulqdqvl"
> +  (const_string "base"))
> +
>  ;; The (bounding maximum) length of an instruction immediate.
>  (define_attr "length_immediate" ""
>(cond [(eq_attr "type" "incdec,setcc,icmov,str,lea,other,multi,idiv,leave,
>   bitmanip,imulx,msklog,mskmov")
>(const_int 0)
> -(eq_attr "type" "sse4arg")
> +(ior (eq_attr "type" "sse4arg")
> + (eq_attr "isa" "fma4"))
>(const_int 1)
>  (eq_attr "unit" "i387,sse,mmx")
>(const_int 0)
> @@ -637,6 +648,10 @@
> (const_string "vex")
>   (eq_attr "mode" "XI,V16SF,V8DF")
> (const_string "evex")
> +(eq_attr "type" "ssemuladd")
> +  (if_then_else (eq_attr "isa" "fma4")
> +(const_string "vex")
> +(const_string "maybe_evex"))
>  (eq_attr "type" "sse4arg")
>(const_string "vex")
>  ]
> @@ -842,16 +857,6 @@
>  ;; Define attribute to indicate unaligned ssemov insns
>  (define_attr "movu" "0,1" (const_string "0"))
>
> -;; Used to control the "enabled" attribute on a per-instruction basis.
> -(define_attr "isa" "base,x64,nox64,x64_sse2,x64_sse4,x64_sse4_noavx,
> -   x64_avx,x64_avx512bw,x64_avx512dq,aes,
> -   sse_noavx,sse2,sse2_noavx,sse3,sse3_noavx,sse4,sse4_noavx,
> -   avx,noavx,avx2,noavx2,bmi,bmi2,fma4,fma,avx512f,noavx512f,
> -   avx512bw,noavx512bw,avx512dq,noavx512dq,fma_or_avx512vl,
> -   
> avx512vl,noavx512vl,avxvnni,avx512vnnivl,avx512fp16,avxifma,
> -

Re: [PATCH 05/10] x86: replace/correct bogus "prefix_extra"

2023-08-03 Thread Hongtao Liu via Gcc-patches

On Thu, Aug 3, 2023 at 4:14 PM Jan Beulich via Gcc-patches
 wrote:
>
> In the rdrand and rdseed cases "prefix_0f" is meant instead. For
> mmx_floatv2siv2sf2 1 is correct only for the first alternative. For
> the integer min/max cases 1 uniformly applies to legacy and VEX
> encodings (the UB and SW variants are dealt with separately anyway).
> Same for {,V}MOVNTDQA.
>
> Unlike {,V}PEXTRW, which has two encoding forms, {,V}PINSRW only has
> a single form in 0f space. (In *vec_extract note that the
> dropped part if the condition also referenced non-existing alternative
> 2.)
>
> Of the integer compare insns, only the 64-bit element forms are encoded
> in 0f38 space.
Ok.
>
> gcc/
>
> * config/i386/i386.md (@rdrand): Add "prefix_0f". Drop
> "prefix_extra".
> (@rdseed): Likewise.
> * config/i386/mmx.md (3 [smaxmin and umaxmin cases]):
> Adjust "prefix_extra".
> * config/i386/sse.md (@vec_set_0): Likewise.
> (*sse4_1_3): Likewise.
> (*avx2_eq3): Likewise.
> (avx2_gt3): Likewise.
> (_pinsr): Likewise.
> (*vec_extract): Likewise.
> (_movntdqa): Likewise.
>
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -25943,7 +25943,7 @@
>"TARGET_RDRND"
>"rdrand\t%0"
>[(set_attr "type" "other")
> -   (set_attr "prefix_extra" "1")])
> +   (set_attr "prefix_0f" "1")])
>
>  (define_insn "@rdseed"
>[(set (match_operand:SWI248 0 "register_operand" "=r")
> @@ -25953,7 +25953,7 @@
>"TARGET_RDSEED"
>"rdseed\t%0"
>[(set_attr "type" "other")
> -   (set_attr "prefix_extra" "1")])
> +   (set_attr "prefix_0f" "1")])
>
>  (define_expand "pause"
>[(set (match_dup 0)
> --- a/gcc/config/i386/mmx.md
> +++ b/gcc/config/i386/mmx.md
> @@ -2483,7 +2483,7 @@
> vp\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "isa" "noavx,noavx,avx")
> (set_attr "type" "sseiadd")
> -   (set_attr "prefix_extra" "1,1,*")
> +   (set_attr "prefix_extra" "1")
> (set_attr "prefix" "orig,orig,vex")
> (set_attr "mode" "TI")])
>
> @@ -2532,7 +2532,7 @@
> vpb\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "isa" "noavx,noavx,avx")
> (set_attr "type" "sseiadd")
> -   (set_attr "prefix_extra" "1,1,*")
> +   (set_attr "prefix_extra" "1")
> (set_attr "prefix" "orig,orig,vex")
> (set_attr "mode" "TI")])
>
> @@ -2561,7 +2561,7 @@
> vp\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "isa" "noavx,noavx,avx")
> (set_attr "type" "sseiadd")
> -   (set_attr "prefix_extra" "1,1,*")
> +   (set_attr "prefix_extra" "1")
> (set_attr "prefix" "orig,orig,vex")
> (set_attr "mode" "TI")])
>
> @@ -2623,7 +2623,7 @@
> vpw\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "isa" "noavx,noavx,avx")
> (set_attr "type" "sseiadd")
> -   (set_attr "prefix_extra" "1,1,*")
> +   (set_attr "prefix_extra" "1")
> (set_attr "prefix" "orig,orig,vex")
> (set_attr "mode" "TI")])
>
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -11064,7 +11064,7 @@
>(const_string "1")
>(const_string "*")))
> (set (attr "prefix_extra")
> - (if_then_else (eq_attr "alternative" "5,6,7,8,9")
> + (if_then_else (eq_attr "alternative" "5,6,9")
>(const_string "1")
>(const_string "*")))
> (set (attr "length_immediate")
> @@ -16779,7 +16779,7 @@
> vp\t{%2, %1, 
> %0|%0, %1, %2}"
>[(set_attr "isa" "noavx,noavx,avx")
> (set_attr "type" "sseiadd")
> -   (set_attr "prefix_extra" "1,1,*")
> +   (set_attr "prefix_extra" "1")
> (set_attr "prefix" "orig,orig,vex")
> (set_attr "mode" "TI")])
>
> @@ -16813,7 +16813,10 @@
>"TARGET_AVX2 && !(MEM_P (operands[1]) && MEM_P (operands[2]))"
>"vpcmpeq\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "type" "ssecmp")
> -   (set_attr "prefix_extra" "1")
> +   (set (attr "prefix_extra")
> + (if_then_else (eq (const_string "mode") (const_string "V4DImode"))
> +  (const_string "1")
> +  (const_string "*")))
> (set_attr "prefix" "vex")
> (set_attr "mode" "OI")])
>
> @@ -17048,7 +17051,10 @@
>"TARGET_AVX2"
>"vpcmpgt\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "type" "ssecmp")
> -   (set_attr "prefix_extra" "1")
> +   (set (attr "prefix_extra")
> + (if_then_else (eq (const_string "mode") (const_string "V4DImode"))
> +  (const_string "1")
> +  (const_string "*")))
> (set_attr "prefix" "vex")
> (set_attr "mode" "OI")])
>
> @@ -18843,7 +18849,7 @@
> (const_string "*")))
> (set (attr "prefix_extra")
>   (if_then_else
> -   (and (not (match_test "TARGET_AVX"))
> +   (ior (eq_attr "prefix" "evex")
> (match_test "GET_MODE_NUNITS (mode) == 8"))
> (const_string "*")
> (const_string "1")))
> @@ -20004,8 +20010,7 @@
> (set_attr "prefix_data16" "1")
> (set (attr "prefix_extra")
>   (if_then_else
> -   (and (eq_attr "alternative" "0,2")
> -   (e

Re: [PATCH 06/10] x86: drop stray "prefix_extra"

2023-08-03 Thread Hongtao Liu via Gcc-patches

On Thu, Aug 3, 2023 at 4:16 PM Jan Beulich via Gcc-patches
 wrote:
>
> While the attribute is relevant for legacy- and VEX-encoded insns, it is
> of no relevance for EVEX-encoded ones.
>
> While there in avx512dq_broadcast_1 add
> the missing "length_immediate".
Ok.
>
> gcc/
>
> * config/i386/sse.md
> (*_eq3_1): Drop
> "prefix_extra".
> (avx512dq_vextract64x2_1_mask): Likewise.
> (*avx512dq_vextract64x2_1): Likewise.
> (avx512f_vextract32x4_1_mask): Likewise.
> (*avx512f_vextract32x4_1): Likewise.
> (vec_extract_lo__mask [AVX512 forms]): Likewise.
> (vec_extract_lo_ [AVX512 forms]): Likewise.
> (vec_extract_hi__mask [AVX512 forms]): Likewise.
> (vec_extract_hi_ [AVX512 forms]): Likewise.
> (@vec_extract_lo_ [AVX512 forms]): Likewise.
> (@vec_extract_hi_ [AVX512 forms]): Likewise.
> (vec_extract_lo_v64qi): Likewise.
> (vec_extract_hi_v64qi): Likewise.
> (*vec_widen_umult_even_v16si): Likewise.
> (*vec_widen_smult_even_v16si): Likewise.
> (*avx512f_3): Likewise.
> (*vec_extractv4ti): Likewise.
> (avx512bw_v32qiv32hi2): Likewise.
> (avx512dq_broadcast_1): Likewise.
> Add "length_immediate".
>
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -4030,7 +4030,6 @@
> vpcmpeq\t{%2, %1, 
> %0|%0, %1, %2}
> vptestnm\t{%1, %1, 
> %0|%0, %1, %1}"
>[(set_attr "type" "ssecmp")
> -   (set_attr "prefix_extra" "1")
> (set_attr "prefix" "evex")
> (set_attr "mode" "")])
>
> @@ -4128,7 +4127,6 @@
> vpcmpeq\t{%2, %1, 
> %0|%0, %1, %2}
> vptestnm\t{%1, %1, 
> %0|%0, %1, %1}"
>[(set_attr "type" "ssecmp")
> -   (set_attr "prefix_extra" "1")
> (set_attr "prefix" "evex")
> (set_attr "mode" "")])
>
> @@ -11487,7 +11485,6 @@
>return "vextract64x2\t{%2, %1, %0%{%5%}%N4|%0%{%5%}%N4, %1, 
> %2}";
>  }
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "evex")
> (set_attr "mode" "")])
> @@ -11506,7 +11503,6 @@
>return "vextract64x2\t{%2, %1, %0|%0, %1, %2}";
>  }
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "evex")
> (set_attr "mode" "")])
> @@ -11554,7 +11550,6 @@
>return "vextract32x4\t{%2, %1, %0%{%7%}%N6|%0%{%7%}%N6, %1, 
> %2}";
>  }
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "evex")
> (set_attr "mode" "")])
> @@ -11577,7 +11572,6 @@
>return "vextract32x4\t{%2, %1, %0|%0, %1, %2}";
>  }
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "evex")
> (set_attr "mode" "")])
> @@ -11671,7 +11665,6 @@
> && (!MEM_P (operands[0]) || rtx_equal_p (operands[0], operands[2]))"
>"vextract64x4\t{$0x0, %1, %0%{%3%}%N2|%0%{%3%}%N2, %1, 0x0}"
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "memory" "none,store")
> (set_attr "prefix" "evex")
> @@ -11691,7 +11684,6 @@
>  return "#";
>  }
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "memory" "none,store,load")
> (set_attr "prefix" "evex")
> @@ -11710,7 +11702,6 @@
> && (!MEM_P (operands[0]) || rtx_equal_p (operands[0], operands[2]))"
>"vextract64x4\t{$0x1, %1, %0%{%3%}%N2|%0%{%3%}%N2, %1, 0x1}"
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "evex")
> (set_attr "mode" "")])
> @@ -11724,7 +11715,6 @@
>"TARGET_AVX512F"
>"vextract64x4\t{$0x1, %1, %0|%0, %1, 0x1}"
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "evex")
> (set_attr "mode" "")])
> @@ -11744,7 +11734,6 @@
> && (!MEM_P (operands[0]) || rtx_equal_p (operands[0], operands[2]))"
>"vextract32x8\t{$0x1, %1, %0%{%3%}%N2|%0%{%3%}%N2, %1, 0x1}"
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "evex")
> (set_attr "mode" "")])
> @@ -11762,7 +11751,6 @@
> vextract32x8\t{$0x1, %1, %0|%0, %1, 0x1}
> vextracti64x4\t{$0x1, %1, %0|%0, %1, 0x1}"
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_extra" "1")
> (set_attr "isa" "avx512dq,noavx512dq")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "evex")
> @@ -11850,7 +11838,6 @@
> && (!MEM_P (operands[0]) || rtx_equal_p (operands[0], operands[2]))"
>"vextract32x8\t{$0x0, %1, %0%{%3%}%N2|%0%{%3%}%N2, %1, 0x0}"
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_extra" "1")
> (set_attr "length

Re: [PATCH 04/10] x86: "prefix_extra" can't really be "2"

2023-08-03 Thread Hongtao Liu via Gcc-patches

On Thu, Aug 3, 2023 at 4:11 PM Jan Beulich via Gcc-patches
 wrote:
>
> In the three remaining instances separate "prefix_0f" and "prefix_rep"
> are what is wanted instead.
Ok.
>
> gcc/
>
> * config/i386/i386.md (rdbase): Add "prefix_0f" and
> "prefix_rep". Drop "prefix_extra".
> (wrbase): Likewise.
> (ptwrite): Likewise.
>
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -25914,7 +25914,8 @@
>"TARGET_64BIT && TARGET_FSGSBASE"
>"rdbase\t%0"
>[(set_attr "type" "other")
> -   (set_attr "prefix_extra" "2")])
> +   (set_attr "prefix_0f" "1")
> +   (set_attr "prefix_rep" "1")])
>
>  (define_insn "wrbase"
>[(unspec_volatile [(match_operand:SWI48 0 "register_operand" "r")]
> @@ -25922,7 +25923,8 @@
>"TARGET_64BIT && TARGET_FSGSBASE"
>"wrbase\t%0"
>[(set_attr "type" "other")
> -   (set_attr "prefix_extra" "2")])
> +   (set_attr "prefix_0f" "1")
> +   (set_attr "prefix_rep" "1")])
>
>  (define_insn "ptwrite"
>[(unspec_volatile [(match_operand:SWI48 0 "nonimmediate_operand" "rm")]
> @@ -25930,7 +25932,8 @@
>"TARGET_PTWRITE"
>"ptwrite\t%0"
>[(set_attr "type" "other")
> -   (set_attr "prefix_extra" "2")])
> +   (set_attr "prefix_0f" "1")
> +   (set_attr "prefix_rep" "1")])
>
>  (define_insn "@rdrand"
>[(set (match_operand:SWI248 0 "register_operand" "=r")
>


-- 
BR,
Hongtao

Re: [PATCH 09/10] x86: correct "length_immediate" in a few cases

2023-08-03 Thread Hongtao Liu via Gcc-patches

On Thu, Aug 3, 2023 at 4:14 PM Jan Beulich via Gcc-patches
 wrote:
>
> When first added explicitly in 3ddffba914b2 ("i386.md
> (sse4_1_round2): Add avx512f alternative"), "*" should not have
> been used for the pre-existing alternative. The attribute was plain
> missing. Subsequent changes adding more alternatives then generously
> extended the bogus pattern.
>
> Apparently something similar happened to the two mmx_pblendvb_* insns.
Ok.
>
> gcc/
>
> * config/i386/i386.md (sse4_1_round2): Make
> "length_immediate" uniformly 1.
> * config/i386/mmx.md (mmx_pblendvb_v8qi): Likewise.
> (mmx_pblendvb_): Likewise.
>
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -21594,7 +21594,7 @@
> vrndscale\t{%2, %1, %d0|%d0, %1, %2}"
>[(set_attr "type" "ssecvt")
> (set_attr "prefix_extra" "1,1,1,*,*")
> -   (set_attr "length_immediate" "*,*,*,1,1")
> +   (set_attr "length_immediate" "1")
> (set_attr "prefix" "maybe_vex,maybe_vex,maybe_vex,evex,evex")
> (set_attr "isa" "noavx512f,noavx512f,noavx512f,avx512f,avx512f")
> (set_attr "avx_partial_xmm_update" "false,false,true,false,true")
> --- a/gcc/config/i386/mmx.md
> +++ b/gcc/config/i386/mmx.md
> @@ -3094,7 +3094,7 @@
>[(set_attr "isa" "noavx,noavx,avx")
> (set_attr "type" "ssemov")
> (set_attr "prefix_extra" "1")
> -   (set_attr "length_immediate" "*,*,1")
> +   (set_attr "length_immediate" "1")
> (set_attr "prefix" "orig,orig,vex")
> (set_attr "btver2_decode" "vector")
> (set_attr "mode" "TI")])
> @@ -3114,7 +3114,7 @@
>[(set_attr "isa" "noavx,noavx,avx")
> (set_attr "type" "ssemov")
> (set_attr "prefix_extra" "1")
> -   (set_attr "length_immediate" "*,*,1")
> +   (set_attr "length_immediate" "1")
> (set_attr "prefix" "orig,orig,vex")
> (set_attr "btver2_decode" "vector")
> (set_attr "mode" "TI")])
>


-- 
BR,
Hongtao

Re: [PATCH 07/10] x86: add (adjust) XOP insn attributes

2023-08-03 Thread Hongtao Liu via Gcc-patches

On Thu, Aug 3, 2023 at 4:14 PM Jan Beulich via Gcc-patches
 wrote:
>
> Many were lacking "prefix" and "prefix_extra", some had a bogus value of
> 2 for "prefix_extra" (presumably inherited from their SSE5 counterparts,
> which are long gone) and a meaningless "prefix_data16" one. Where
> missing, "mode" attributes are also added. (Note that "sse4arg" and
> "ssemuladd" ones don't need further adjustment in this regard.)
Ok.
>
> gcc/
>
> * config/i386/sse.md (xop_phaddbw): Add "prefix",
> "prefix_extra", and "mode" attributes.
> (xop_phaddbd): Likewise.
> (xop_phaddbq): Likewise.
> (xop_phaddwd): Likewise.
> (xop_phaddwq): Likewise.
> (xop_phadddq): Likewise.
> (xop_phsubbw): Likewise.
> (xop_phsubwd): Likewise.
> (xop_phsubdq): Likewise.
> (xop_rotl3): Add "prefix" and "prefix_extra" attributes.
> (xop_rotr3): Likewise.
> (xop_frcz2): Likewise.
> (*xop_vmfrcz2): Likewise.
> (xop_vrotl3): Add "prefix" attribute. Change
> "prefix_extra" to 1.
> (xop_sha3): Likewise.
> (xop_shl3): Likewise.
>
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -24897,7 +24897,10 @@
>   (const_int 13) (const_int 15)])]
>"TARGET_XOP"
>"vphaddbw\t{%1, %0|%0, %1}"
> -  [(set_attr "type" "sseiadd1")])
> +  [(set_attr "type" "sseiadd1")
> +   (set_attr "prefix" "vex")
> +   (set_attr "prefix_extra" "1")
> +   (set_attr "mode" "TI")])
>
>  (define_insn "xop_phaddbd"
>[(set (match_operand:V4SI 0 "register_operand" "=x")
> @@ -24926,7 +24929,10 @@
>(const_int 11) (const_int 15)]))]
>"TARGET_XOP"
>"vphaddbd\t{%1, %0|%0, %1}"
> -  [(set_attr "type" "sseiadd1")])
> +  [(set_attr "type" "sseiadd1")
> +   (set_attr "prefix" "vex")
> +   (set_attr "prefix_extra" "1")
> +   (set_attr "mode" "TI")])
>
>  (define_insn "xop_phaddbq"
>[(set (match_operand:V2DI 0 "register_operand" "=x")
> @@ -24971,7 +24977,10 @@
>  (parallel [(const_int 7) (const_int 15)])))]
>"TARGET_XOP"
>"vphaddbq\t{%1, %0|%0, %1}"
> -  [(set_attr "type" "sseiadd1")])
> +  [(set_attr "type" "sseiadd1")
> +   (set_attr "prefix" "vex")
> +   (set_attr "prefix_extra" "1")
> +   (set_attr "mode" "TI")])
>
>  (define_insn "xop_phaddwd"
>[(set (match_operand:V4SI 0 "register_operand" "=x")
> @@ -24988,7 +24997,10 @@
>   (const_int 5) (const_int 7)])]
>"TARGET_XOP"
>"vphaddwd\t{%1, %0|%0, %1}"
> -  [(set_attr "type" "sseiadd1")])
> +  [(set_attr "type" "sseiadd1")
> +   (set_attr "prefix" "vex")
> +   (set_attr "prefix_extra" "1")
> +   (set_attr "mode" "TI")])
>
>  (define_insn "xop_phaddwq"
>[(set (match_operand:V2DI 0 "register_operand" "=x")
> @@ -25013,7 +25025,10 @@
> (parallel [(const_int 3) (const_int 7)]))]
>"TARGET_XOP"
>"vphaddwq\t{%1, %0|%0, %1}"
> -  [(set_attr "type" "sseiadd1")])
> +  [(set_attr "type" "sseiadd1")
> +   (set_attr "prefix" "vex")
> +   (set_attr "prefix_extra" "1")
> +   (set_attr "mode" "TI")])
>
>  (define_insn "xop_phadddq"
>[(set (match_operand:V2DI 0 "register_operand" "=x")
> @@ -25028,7 +25043,10 @@
>(parallel [(const_int 1) (const_int 3)])]
>"TARGET_XOP"
>"vphadddq\t{%1, %0|%0, %1}"
> -  [(set_attr "type" "sseiadd1")])
> +  [(set_attr "type" "sseiadd1")
> +   (set_attr "prefix" "vex")
> +   (set_attr "prefix_extra" "1")
> +   (set_attr "mode" "TI")])
>
>  (define_insn "xop_phsubbw"
>[(set (match_operand:V8HI 0 "register_operand" "=x")
> @@ -25049,7 +25067,10 @@
>   (const_int 13) (const_int 15)])]
>"TARGET_XOP"
>"vphsubbw\t{%1, %0|%0, %1}"
> -  [(set_attr "type" "sseiadd1")])
> +  [(set_attr "type" "sseiadd1")
> +   (set_attr "prefix" "vex")
> +   (set_attr "prefix_extra" "1")
> +   (set_attr "mode" "TI")])
>
>  (define_insn "xop_phsubwd"
>[(set (match_operand:V4SI 0 "register_operand" "=x")
> @@ -25066,7 +25087,10 @@
>   (const_int 5) (const_int 7)])]
>"TARGET_XOP"
>"vphsubwd\t{%1, %0|%0, %1}"
> -  [(set_attr "type" "sseiadd1")])
> +  [(set_attr "type" "sseiadd1")
> +   (set_attr "prefix" "vex")
> +   (set_attr "prefix_extra" "1")
> +   (set_attr "mode" "TI")])
>
>  (define_insn "xop_phsubdq"
>[(set (match_operand:V2DI 0 "register_operand" "=x")
> @@ -25081,7 +25105,10 @@
>(parallel [(const_int 1) (const_int 3)])]
>"TARGET_XOP"
>"vphsubdq\t{%1, %0|%0, %1}"
> -  [(set_attr "type" "sseiadd1")])
> +  [(set_attr "type" "sseiadd1")
> +   (set_attr "prefix" "vex")
> +   (set_attr "prefix_extra" "1")
> +   (set_attr "mode" "TI")])
>
>  ;; XOP permute instructions
>  (define_insn "xop_pperm"
> @@ -25209,6 +25236,8 @@
>"TARGET_XOP"
>"vprot\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "type" "sseishft")
> +   (set_attr "prefix" "vex")
> +   (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "

Re: [PATCH 10/10] x86: drop redundant "prefix_data16" attributes

2023-08-03 Thread Hongtao Liu via Gcc-patches

On Thu, Aug 3, 2023 at 4:17 PM Jan Beulich via Gcc-patches
 wrote:
>
> The attribute defaults to 1 for TI-mode insns of type sselog, sselog1,
> sseiadd, sseimul, and sseishft.
>
> In *v8hi3 [smaxmin] and *v16qi3 [umaxmin] also drop the
> similarly stray "prefix_extra" at this occasion. These two max/min
> flavors are encoded in 0f space.
Ok.
>
> gcc/
>
> * config/i386/mmx.md (*mmx_pinsrd): Drop "prefix_data16".
> (*mmx_pinsrb): Likewise.
> (*mmx_pextrb): Likewise.
> (*mmx_pextrb_zext): Likewise.
> (mmx_pshufbv8qi3): Likewise.
> (mmx_pshufbv4qi3): Likewise.
> (mmx_pswapdv2si2): Likewise.
> (*pinsrb): Likewise.
> (*pextrb): Likewise.
> (*pextrb_zext): Likewise.
> * config/i386/sse.md (*sse4_1_mulv2siv2di3): Likewise.
> (*sse2_eq3): Likewise.
> (*sse2_gt3): Likewise.
> (_pinsr): Likewise.
> (*vec_extract): Likewise.
> (*vec_extract_zext): Likewise.
> (*vec_extractv16qi_zext): Likewise.
> (ssse3_phwv8hi3): Likewise.
> (ssse3_pmaddubsw128): Likewise.
> (*_pmulhrsw3): Likewise.
> (_pshufb3): Likewise.
> (_psign3): Likewise.
> (_palignr): Likewise.
> (*abs2): Likewise.
> (sse4_2_pcmpestr): Likewise.
> (sse4_2_pcmpestri): Likewise.
> (sse4_2_pcmpestrm): Likewise.
> (sse4_2_pcmpestr_cconly): Likewise.
> (sse4_2_pcmpistr): Likewise.
> (sse4_2_pcmpistri): Likewise.
> (sse4_2_pcmpistrm): Likewise.
> (sse4_2_pcmpistr_cconly): Likewise.
> (vgf2p8affineinvqb_): Likewise.
> (vgf2p8affineqb_): Likewise.
> (vgf2p8mulb_): Likewise.
> (*v8hi3 [smaxmin]): Drop "prefix_data16" and
> "prefix_extra".
> (*v16qi3 [umaxmin]): Likewise.
>
> --- a/gcc/config/i386/mmx.md
> +++ b/gcc/config/i386/mmx.md
> @@ -3863,7 +3863,6 @@
>  }
>  }
>[(set_attr "isa" "noavx,avx")
> -   (set_attr "prefix_data16" "1")
> (set_attr "prefix_extra" "1")
> (set_attr "type" "sselog")
> (set_attr "length_immediate" "1")
> @@ -3950,7 +3949,6 @@
>  }
>[(set_attr "isa" "noavx,avx")
> (set_attr "type" "sselog")
> -   (set_attr "prefix_data16" "1")
> (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "orig,vex")
> @@ -4002,7 +4000,6 @@
> %vpextrb\t{%2, %1, %k0|%k0, %1, %2}
> %vpextrb\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_data16" "1")
> (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "maybe_vex")
> @@ -4017,7 +4014,6 @@
>"TARGET_SSE4_1 && TARGET_MMX_WITH_SSE"
>"%vpextrb\t{%2, %1, %k0|%k0, %1, %2}"
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_data16" "1")
> (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "maybe_vex")
> @@ -4035,7 +4031,6 @@
> vpshufb\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "isa" "noavx,avx")
> (set_attr "type" "sselog1")
> -   (set_attr "prefix_data16" "1,*")
> (set_attr "prefix_extra" "1")
> (set_attr "prefix" "orig,maybe_evex")
> (set_attr "btver2_decode" "vector")
> @@ -4053,7 +4048,6 @@
> vpshufb\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "isa" "noavx,avx")
> (set_attr "type" "sselog1")
> -   (set_attr "prefix_data16" "1,*")
> (set_attr "prefix_extra" "1")
> (set_attr "prefix" "orig,maybe_evex")
> (set_attr "btver2_decode" "vector")
> @@ -4191,7 +4185,6 @@
> (set_attr "mmx_isa" "native,*")
> (set_attr "type" "mmxcvt,sselog1")
> (set_attr "prefix_extra" "1,*")
> -   (set_attr "prefix_data16" "*,1")
> (set_attr "length_immediate" "*,1")
> (set_attr "mode" "DI,TI")])
>
> @@ -4531,7 +4524,6 @@
>  }
>[(set_attr "isa" "noavx,avx")
> (set_attr "type" "sselog")
> -   (set_attr "prefix_data16" "1")
> (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "orig,vex")
> @@ -4575,7 +4567,6 @@
> %vpextrb\t{%2, %1, %k0|%k0, %1, %2}
> %vpextrb\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_data16" "1")
> (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "maybe_vex")
> @@ -4590,7 +4581,6 @@
>"TARGET_SSE4_1"
>"%vpextrb\t{%2, %1, %k0|%k0, %1, %2}"
>[(set_attr "type" "sselog1")
> -   (set_attr "prefix_data16" "1")
> (set_attr "prefix_extra" "1")
> (set_attr "length_immediate" "1")
> (set_attr "prefix" "maybe_vex")
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -15614,7 +15614,6 @@
> vpmuldq\t{%2, %1, %0|%0, %1, %2}"
>[(set_attr "isa" "noavx,noavx,avx")
> (set_attr "type" "sseimul")
> -   (set_attr "prefix_data16" "1,1,*")
> (set_attr "prefix_extra" "1")
> (set_attr "prefix" "orig,orig,vex")
> (set_attr "mode" "TI")])
> @@ -16688,8 +16687,6 @@
>

Re: [PATCH 08/10] x86: add missing "prefix" attribute to VF{,C}MULC

2023-08-03 Thread Hongtao Liu via Gcc-patches

On Thu, Aug 3, 2023 at 4:16 PM Jan Beulich via Gcc-patches
 wrote:
>
> gcc/
>
> * config/i386/sse.md
> (__): Add
> "prefix" attribute.
> 
> (avx512fp16_sh_v8hf):
> Likewise.
Ok.
> ---
> Talking of "prefix": Shouldn't at least V32HF and V32BF have it also
> default to "evex"? (It won't matter right here, but it may matter
> elsewhere.)
>
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -6790,6 +6790,7 @@
>return "v\t{%2, %1, 
> %0|%0, %1, %2}";
>  }
>[(set_attr "type" "ssemul")
> +   (set_attr "prefix" "evex")
> (set_attr "mode" "")])
>
>  (define_expand "avx512fp16_fmaddcsh_v8hf_maskz"
> @@ -6993,6 +6994,7 @@
>return "vsh\t{%2, %1, 
> %0|%0, %1, 
> %2}";
>  }
>[(set_attr "type" "ssemul")
> +   (set_attr "prefix" "evex")
> (set_attr "mode" "V8HF")])
>
>  ;
>


-- 
BR,
Hongtao

Re: [PATCH 00/10] x86: (mainly) "prefix_extra" adjustments

2023-08-03 Thread Hongtao Liu via Gcc-patches

On Thu, Aug 3, 2023 at 4:09 PM Jan Beulich via Gcc-patches
 wrote:
>
> Having noticed various bogus uses, I thought I'd go through and audit
> them all. This is the result, with some other attributes also adjusted
> as noticed in the process. (I think this tidying also is a good thing
> to have ahead of APX further complicating insn length calculations.)
Thanks for doing this.
I'm just checking the way to modify the attribute , doesn't go detail
for those instructions encoding(I think you must know better than me).
>
> 01: "prefix_extra" tidying
> 02: "sse4arg" adjustments
> 03: "ssemuladd" adjustments
> 04: "prefix_extra" can't really be "2"
> 05: replace/correct bogus "prefix_extra"
> 06: drop stray "prefix_extra"
> 07: add (adjust) XOP insn attributes
> 08: add missing "prefix" attribute to VF{,C}MULC
> 09: correct "length_immediate" in a few cases
> 10: drop redundant "prefix_data16" attributes
>
> Jan



-- 
BR,
Hongtao

Re: [PATCH] i386: Clear upper bits of XMM register for V4HFmode/V2HFmode operations [PR110762]

2023-08-07 Thread Hongtao Liu via Gcc-patches

On Mon, Aug 7, 2023 at 5:19 PM Uros Bizjak via Gcc-patches
 wrote:
>
> On Mon, Aug 7, 2023 at 10:57 AM liuhongt  wrote:
> >
> > Similar like r14-2786-gade30fad6669e5, the patch is for V4HF/V2HFmode.
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Ok for trunk?
> >
> > gcc/ChangeLog:
> >
> > PR target/110762
> > * config/i386/mmx.md (3): Changed from define_insn
> > to define_expand and break into ..
> > (v4hf3): .. this.
> > (divv4hf3): .. this.
> > (v2hf3): .. this.
> > (divv2hf3): .. this.
> > (movd_v2hf_to_sse): New define_expand.
> > (movq__to_sse): Extend to V4HFmode.
> > (mmxdoublevecmode): Ditto.
> > (V2FI_V4HF): New mode iterator.
> > * config/i386/sse.md (*vec_concatv4sf): Extend to hanlde V8HF
> > by using mode iterator V4SF_V8HF, renamed to ..
> > (*vec_concat): .. this.
> > (*vec_concatv4sf_0): Extend to handle V8HF by using mode
> > iterator V4SF_V8HF, renamed to ..
> > (*vec_concat_0): .. this.
> > (*vec_concatv8hf_movss): New define_insn.
> > (V4SF_V8HF): New mode iterator.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/i386/pr110762-v4hf.c: New test.
>
> LGTM.
>
> Please also note the RFC patch [1] that relaxes clears for V2SFmode
> with -fno-trapping-math. The patched compiler will then emit the same
> code as clang does for -O2. Which raises another question - should gcc
> default to -fno-trapping-math?
>
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625795.html
>
I can create another patch to handle my parts for -fno-trapping-math
optimization.
> Thanks,
> Uros.
>
> > ---
> >  gcc/config/i386/mmx.md| 109 +++---
> >  gcc/config/i386/sse.md|  40 +--
> >  gcc/testsuite/gcc.target/i386/pr110762-v4hf.c |  57 +
> >  3 files changed, 177 insertions(+), 29 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr110762-v4hf.c
> >
> > diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
> > index 896af76a33f..88bdf084f54 100644
> > --- a/gcc/config/i386/mmx.md
> > +++ b/gcc/config/i386/mmx.md
> > @@ -79,9 +79,7 @@ (define_mode_iterator V_16_32_64
> >  ;; V2S* modes
> >  (define_mode_iterator V2FI [V2SF V2SI])
> >
> > -;; 4-byte and 8-byte float16 vector modes
> > -(define_mode_iterator VHF_32_64 [V4HF V2HF])
> > -
> > +(define_mode_iterator V2FI_V4HF [V2SF V2SI V4HF])
> >  ;; Mapping from integer vector mode to mnemonic suffix
> >  (define_mode_attr mmxvecsize
> >[(V8QI "b") (V4QI "b") (V2QI "b")
> > @@ -108,7 +106,7 @@ (define_mode_attr mmxintvecmodelower
> >
> >  ;; Mapping of vector modes to a vector mode of double size
> >  (define_mode_attr mmxdoublevecmode
> > -  [(V2SF "V4SF") (V2SI "V4SI")])
> > +  [(V2SF "V4SF") (V2SI "V4SI") (V4HF "V8HF")])
> >
> >  ;; Mapping of vector modes back to the scalar modes
> >  (define_mode_attr mmxscalarmode
> > @@ -594,7 +592,7 @@ (define_insn "sse_movntq"
> >  (define_expand "movq__to_sse"
> >[(set (match_operand: 0 "register_operand")
> > (vec_concat:
> > - (match_operand:V2FI 1 "nonimmediate_operand")
> > + (match_operand:V2FI_V4HF 1 "nonimmediate_operand")
> >   (match_dup 2)))]
> >"TARGET_SSE2"
> >"operands[2] = CONST0_RTX (mode);")
> > @@ -1927,21 +1925,94 @@ (define_expand "lroundv2sfv2si2"
> >  ;;
> >  ;
> >
> > -(define_insn "3"
> > -  [(set (match_operand:VHF_32_64 0 "register_operand" "=v")
> > -   (plusminusmultdiv:VHF_32_64
> > - (match_operand:VHF_32_64 1 "register_operand" "v")
> > - (match_operand:VHF_32_64 2 "register_operand" "v")))]
> > +(define_expand "v4hf3"
> > +  [(set (match_operand:V4HF 0 "register_operand")
> > +   (plusminusmult:V4HF
> > + (match_operand:V4HF 1 "nonimmediate_operand")
> > + (match_operand:V4HF 2 "nonimmediate_operand")))]
> >"TARGET_AVX512FP16 && TARGET_AVX512VL"
> > -  "vph\t{%2, %1, %0|%0, %1, %2}"
> > -  [(set (attr "type")
> > -  (cond [(match_test " == MULT")
> > -   (const_string "ssemul")
> > -(match_test " == DIV")
> > -   (const_string "ssediv")]
> > -(const_string "sseadd")))
> > -   (set_attr "prefix" "evex")
> > -   (set_attr "mode" "V8HF")])
> > +{
> > +  rtx op2 = gen_reg_rtx (V8HFmode);
> > +  rtx op1 = gen_reg_rtx (V8HFmode);
> > +  rtx op0 = gen_reg_rtx (V8HFmode);
> > +
> > +  emit_insn (gen_movq_v4hf_to_sse (op2, operands[2]));
> > +  emit_insn (gen_movq_v4hf_to_sse (op1, operands[1]));
> > +
> > +  emit_insn (gen_v8hf3 (op0, op1, op2));
> > +
> > +  emit_move_insn (operands[0], lowpart_subreg (V4HFmode, op0, V8HFmode));
> > +  DONE;
> > +})
> > +
> > +(define_expand "divv4hf3"
> > +  [(set (match_operand:V4HF 0 "register_operand")
> > +   (div:V4HF
> > + (match_operand:V4HF 1 "nonimmedia

Re: [PATCH] Fix ICE in rtl check when bootstrap.

2023-08-07 Thread Hongtao Liu via Gcc-patches

On Mon, Aug 7, 2023 at 4:54 PM liuhongt  wrote:
>
> /var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/libgfortran/generated/matmul_i1.c:
>  In function ‘matmul_i1_avx512f’:
> /var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/libgfortran/generated/matmul_i1.c:1781:1:
>  internal compiler error: RTL check: expected elt 0 type 'i' or 'n', have 'w' 
> (rtx const_int) in vpternlog_redundant_operand_mask, at 
> config/i386/i386.cc:19460
>  1781 | }
>   | ^
> 0x5559de26dc2d rtl_check_failed_type2(rtx_def const*, int, int, int, char 
> const*, int, char const*)
> 
> /var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/gcc/rtl.cc:761
> 0x5559de340bfe vpternlog_redundant_operand_mask(rtx_def**)
> 
> /var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/gcc/config/i386/i386.cc:19460
> 0x5559dfec67a6 split_44
> 
> /var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/gcc/config/i386/sse.md:12730
> 0x5559dfec67a6 split_63
> 
> /var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/gcc/config/i386/sse.md:28428
> 0x5559deb8a682 try_split(rtx_def*, rtx_insn*, int)
> 
> /var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/gcc/emit-rtl.cc:3800
> 0x5559deb8adf2 try_split(rtx_def*, rtx_insn*, int)
> 
> /var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/gcc/emit-rtl.cc:3972
> 0x5559def69194 split_insn
> 
> /var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/gcc/recog.cc:3385
> 0x5559def70c57 split_all_insns()
> 
> /var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/gcc/recog.cc:3489
> 0x5559def70d0c execute
> 
> /var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/gcc/recog.cc:4413
>
> Use INTVAL (imm_op) instead of XINT (imm_op, 0).
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
Pushed to trunk as an obvious fix.
>
> gcc/ChangeLog:
>
> * config/i386/i386-protos.h (vpternlog_redundant_operand_mask):
>   Adjust parameter type.
> * config/i386/i386.cc (vpternlog_redundant_operand_mask): Use
>   INTVAL instead of XINT, also adjust parameter type from rtx* to
>   rtx since the function only needs operands[4] in vpternlog
>   pattern.
> (substitute_vpternlog_operands): Pass operands[4] instead of
>   operands to vpternlog_redundant_operand_mask
> * config/i386/sse.md: Ditto.
> ---
>  gcc/config/i386/i386-protos.h | 2 +-
>  gcc/config/i386/i386.cc   | 6 +++---
>  gcc/config/i386/sse.md| 4 ++--
>  3 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> index e547ee64587..fc2f1f13b78 100644
> --- a/gcc/config/i386/i386-protos.h
> +++ b/gcc/config/i386/i386-protos.h
> @@ -70,7 +70,7 @@ extern machine_mode ix86_cc_mode (enum rtx_code, rtx, rtx);
>  extern int avx_vpermilp_parallel (rtx par, machine_mode mode);
>  extern int avx_vperm2f128_parallel (rtx par, machine_mode mode);
>
> -extern int vpternlog_redundant_operand_mask (rtx[]);
> +extern int vpternlog_redundant_operand_mask (rtx);
>  extern void substitute_vpternlog_operands (rtx[]);
>
>  extern bool ix86_expand_strlen (rtx, rtx, rtx, rtx);
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 8cd26eb54fa..50860050049 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -19454,10 +19454,10 @@ avx_vperm2f128_parallel (rtx par, machine_mode mode)
>  /* Return a mask of VPTERNLOG operands that do not affect output.  */
>
>  int
> -vpternlog_redundant_operand_mask (rtx *operands)
> +vpternlog_redundant_operand_mask (rtx pternlog_imm)
>  {
>int mask = 0;
> -  int imm8 = XINT (operands[4], 0);
> +  int imm8 = INTVAL (pternlog_imm);
>
>if (((imm8 >> 4) & 0x0F) == (imm8 & 0x0F))
>  mask |= 1;
> @@ -19475,7 +19475,7 @@ vpternlog_redundant_operand_mask (rtx *operands)
>  void
>  substitute_vpternlog_operands (rtx *operands)
>  {
> -  int mask = vpternlog_redundant_operand_mask (operands);
> +  int mask = vpternlog_redundant_operand_mask (operands[4]);
>
>if (mask & 1) /* The first operand is redundant.  */
>  operands[1] = operands[2];
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 7e2aa3f995c..c53450fd965 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -12706,7 +12706,7 @@ (define_split
>(match_operand:V 3 "memory_operand")
>(match_operand:SI 4 "const_0_to_255_operand")]
>   UNSPEC_VTERNLOG))]
> -  "!reload_completed && vpternlog_redundant_operand_mask (operands) == 3"
> +  "!reload_completed && vpternlog_redundant_operand_mask (operands[4]) == 3"
>[(set (match_dup 0)
> (match_dup 3))
> (set (match_dup 0)
> @@ -12727,7 +12727,7 @@ (define_split
>(match_operand:

Re: Intel AVX10.1 Compiler Design and Support

2023-08-08 Thread Hongtao Liu via Gcc-patches

On Wed, Aug 9, 2023 at 3:55 AM Joseph Myers  wrote:
>
> Do you have any comments on the interaction of AVX10 with the
> micro-architecture levels defined in the ABI (and supported with
> glibc-hwcaps directories in glibc)?  Given that the levels are cumulative,
> should we take it that any future levels will be ones supporting 512-bit
> vector width for AVX10 (because x86-64-v4 requires the current AVX512F,
> AVX512BW, AVX512CD, AVX512DQ and AVX512VL) - and so any future processors
> that only support 256-bit vector width will be considered to match the
> x86-64-v3 micro-architecture level but not any higher level?
This is actually something we really want to discuss in the community,
our proposal for x86-64-v5: AVX10.2-256(Implying AVX10.1-256) + APX.
One big reason is Intel E-core will only support AVX10 256-bit, if we
want to use x86-64-v5 accross  server and client, it's better to
256-bit default.
>
> --
> Joseph S. Myers
> jos...@codesourcery.com



-- 
BR,
Hongtao

Re: Intel AVX10.1 Compiler Design and Support

2023-08-08 Thread Hongtao Liu via Gcc-patches

On Tue, Aug 8, 2023 at 8:45 PM Richard Biener via Gcc-patches
 wrote:
>
> On Tue, Aug 8, 2023 at 10:15 AM Jiang, Haochen via Gcc-patches
>  wrote:
> >
> > Hi Jakub,
> >
> > > So, what does this imply for the current ISAs?
> >
> > AVX10 will imply AVX2 on the ISA level. And we suppose AVX10 is an
> > independent ISA feature set. Although sharing the same instructions and
> > encodings, AVX10 and AVX512 are conceptual independent features, which
> > means they are orthogonal.
> >
> > > The expectations in lots of config/i386/* is that -mavx512f / 
> > > TARGET_AVX512F
> > > means 512 bit vector support is available and most of the various 
> > > -mavx512XXX
> > > options imply -mavx512f (and -mno-avx512f turns those off).  And if
> > > -mavx512vl / TARGET_AVX512VL isn't available, tons of places just use
> > > 512-bit EVEX instructions for 256-bit or 128-bit stuff (mostly to be able 
> > > to
> > > access [xy]mm16+).
> >
> > For AVX10, the 128/256/scalar version of the instructions are always there, 
> > and
> > also for [xy]mm16+. 512 version is "optional", which needs user to indicate 
> > them
> > in options. When 512 version is enabled, 128/256/scalar version is also 
> > enabled,
> > which is kind of reverse relation between the current AVX512F/AVX512VL.
> >
> > Since we take AVX10 and AVX512 are orthogonal, we will add OR logic for the 
> > current
> > pattern, which is shown in our AVX512DQ+VL sample patches.
>
> Hmm, so it sounds like AVX10 is currently, at the 10.1 level, a way to specify
> AVX512F and AVX512VL "differently", so wouldn't it make sense to make it
In the future there're plantfomrs only support AVX10.x-256, but not
AVX512 stuffs, it doesn't make much sense on that platfrom to disable
part of AVX512.
We really want to make AVX10.x a indivisible features, just like other
individual CPUID.
> complement those only so one can use, say, -mavx10 -mno-avx512bf16 to disable
> parts of the former AVX512 ISA one doesn't like to get code generated for?
> -mavx10 would then enable all the existing sub-AVX512 ISAs?
Another alternative solution is
>
> > > Sure, I expect all AVX10.N CPUs will have AVX512VL CPUID, will they have
> > > AVX512F CPUID even when the 512-bit vectors aren't present? What happens 
> > > if
> > > one mixes the -mavx10* options together with -mno-avx512vl or similar
> > > options?  Will -mno-avx512f still imply -mno-avx512vl etc.?
> >
> > For the CPUID part, AVX10 and AVX512 have different emulation. Only Xeon 
> > Server
> > will have AVX512 related CPUIDs for backward compatibility. For GNR, it 
> > will be
> > AVX512F, AVX512VL, AVX512CD, AVX512BW, AVX512DQ, AVX512_IFMA, AVX512_VBMI,
> > AVX512_VNNI, AVX512_BF16, AVX512_BITALG, AVX512_VPOPCNTDQ, AV512_VBMI2,
> > AVX512_FP16. Also, it will have AVX10 CPUIDs with 512 bit support set. Atom 
> > Server and
> > client will only have AVX10 CPUIDs with 256 bit support set.
> >
> > -mno-avx512f will still imply -mno-avx512vl.
> >
> > As we mentioned below, we don't recommend users to combine the AVX10 and 
> > legacy
> > AVX512 options. We understand that there will be different opinions on what 
> > should
> > compiler behave on some controversial option combinations.
> >
> > If there is someone mixes the options, the golden rule is that we are using 
> > OR logic.
> > Therefore, enabling either feature will turn on the shared instructions, no 
> > matter the other
> > feature is not mentioned or closed. That is why we are emitting warning for 
> > some scenarios,
> > which is also mentioned in the letter.
>
> I'm refraining from commenting on the senslesness of AVX10 as you're
> likely on the same
> receiving side as us.
>
> Thanks,
> Richard.
>
> > Thx,
> > Haochen
> >
> > >
> > >   Jakub
> >



-- 
BR,
Hongtao

Re: Intel AVX10.1 Compiler Design and Support

2023-08-08 Thread Hongtao Liu via Gcc-patches

On Wed, Aug 9, 2023 at 10:06 AM Hongtao Liu  wrote:
>
> On Tue, Aug 8, 2023 at 8:45 PM Richard Biener via Gcc-patches
>  wrote:
> >
> > On Tue, Aug 8, 2023 at 10:15 AM Jiang, Haochen via Gcc-patches
> >  wrote:
> > >
> > > Hi Jakub,
> > >
> > > > So, what does this imply for the current ISAs?
> > >
> > > AVX10 will imply AVX2 on the ISA level. And we suppose AVX10 is an
> > > independent ISA feature set. Although sharing the same instructions and
> > > encodings, AVX10 and AVX512 are conceptual independent features, which
> > > means they are orthogonal.
> > >
> > > > The expectations in lots of config/i386/* is that -mavx512f / 
> > > > TARGET_AVX512F
> > > > means 512 bit vector support is available and most of the various 
> > > > -mavx512XXX
> > > > options imply -mavx512f (and -mno-avx512f turns those off).  And if
> > > > -mavx512vl / TARGET_AVX512VL isn't available, tons of places just use
> > > > 512-bit EVEX instructions for 256-bit or 128-bit stuff (mostly to be 
> > > > able to
> > > > access [xy]mm16+).
> > >
> > > For AVX10, the 128/256/scalar version of the instructions are always 
> > > there, and
> > > also for [xy]mm16+. 512 version is "optional", which needs user to 
> > > indicate them
> > > in options. When 512 version is enabled, 128/256/scalar version is also 
> > > enabled,
> > > which is kind of reverse relation between the current AVX512F/AVX512VL.
> > >
> > > Since we take AVX10 and AVX512 are orthogonal, we will add OR logic for 
> > > the current
> > > pattern, which is shown in our AVX512DQ+VL sample patches.
> >
> > Hmm, so it sounds like AVX10 is currently, at the 10.1 level, a way to 
> > specify
> > AVX512F and AVX512VL "differently", so wouldn't it make sense to make it
> In the future there're plantfomrs only support AVX10.x-256, but not
> AVX512 stuffs, it doesn't make much sense on that platfrom to disable
> part of AVX512.
> We really want to make AVX10.x a indivisible features, just like other
> individual CPUID.
> > complement those only so one can use, say, -mavx10 -mno-avx512bf16 to 
> > disable
> > parts of the former AVX512 ISA one doesn't like to get code generated for?
> > -mavx10 would then enable all the existing sub-AVX512 ISAs?
> Another alternative solution is
is split AVX512 into AVX512-256 and AVX512-512, like AVX512F-256,
AVX512FP16-256, AVX512FP16-512, AVX512FP16-512, and make AVX10.1-256
implies those AVX512-256, AVX10.1-512 implies AVX512-512.
> >
> > > > Sure, I expect all AVX10.N CPUs will have AVX512VL CPUID, will they have
> > > > AVX512F CPUID even when the 512-bit vectors aren't present? What 
> > > > happens if
> > > > one mixes the -mavx10* options together with -mno-avx512vl or similar
> > > > options?  Will -mno-avx512f still imply -mno-avx512vl etc.?
> > >
> > > For the CPUID part, AVX10 and AVX512 have different emulation. Only Xeon 
> > > Server
> > > will have AVX512 related CPUIDs for backward compatibility. For GNR, it 
> > > will be
> > > AVX512F, AVX512VL, AVX512CD, AVX512BW, AVX512DQ, AVX512_IFMA, AVX512_VBMI,
> > > AVX512_VNNI, AVX512_BF16, AVX512_BITALG, AVX512_VPOPCNTDQ, AV512_VBMI2,
> > > AVX512_FP16. Also, it will have AVX10 CPUIDs with 512 bit support set. 
> > > Atom Server and
> > > client will only have AVX10 CPUIDs with 256 bit support set.
> > >
> > > -mno-avx512f will still imply -mno-avx512vl.
> > >
> > > As we mentioned below, we don't recommend users to combine the AVX10 and 
> > > legacy
> > > AVX512 options. We understand that there will be different opinions on 
> > > what should
> > > compiler behave on some controversial option combinations.
> > >
> > > If there is someone mixes the options, the golden rule is that we are 
> > > using OR logic.
> > > Therefore, enabling either feature will turn on the shared instructions, 
> > > no matter the other
> > > feature is not mentioned or closed. That is why we are emitting warning 
> > > for some scenarios,
> > > which is also mentioned in the letter.
> >
> > I'm refraining from commenting on the senslesness of AVX10 as you're
> > likely on the same
> > receiving side as us.
> >
> > Thanks,
> > Richard.
> >
> > > Thx,
> > > Haochen
> > >
> > > >
> > > >   Jakub
> > >
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao

Re: Intel AVX10.1 Compiler Design and Support

2023-08-08 Thread Hongtao Liu via Gcc-patches

On Wed, Aug 9, 2023 at 9:21 AM Hongtao Liu  wrote:
>
> On Wed, Aug 9, 2023 at 3:55 AM Joseph Myers  wrote:
> >
> > Do you have any comments on the interaction of AVX10 with the
> > micro-architecture levels defined in the ABI (and supported with
> > glibc-hwcaps directories in glibc)?  Given that the levels are cumulative,
> > should we take it that any future levels will be ones supporting 512-bit
> > vector width for AVX10 (because x86-64-v4 requires the current AVX512F,
> > AVX512BW, AVX512CD, AVX512DQ and AVX512VL) - and so any future processors
> > that only support 256-bit vector width will be considered to match the
> > x86-64-v3 micro-architecture level but not any higher level?
> This is actually something we really want to discuss in the community,
> our proposal for x86-64-v5: AVX10.2-256(Implying AVX10.1-256) + APX.
> One big reason is Intel E-core will only support AVX10 256-bit, if we
> want to use x86-64-v5 accross  server and client, it's better to
> 256-bit default.
+ ABI and LLVM folked for this topic.
> >
> > --
> > Joseph S. Myers
> > jos...@codesourcery.com
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao

Re: Intel AVX10.1 Compiler Design and Support

2023-08-08 Thread Hongtao Liu via Gcc-patches

On Wed, Aug 9, 2023 at 10:14 AM Hongtao Liu  wrote:
>
> On Wed, Aug 9, 2023 at 9:21 AM Hongtao Liu  wrote:
> >
> > On Wed, Aug 9, 2023 at 3:55 AM Joseph Myers  wrote:
> > >
> > > Do you have any comments on the interaction of AVX10 with the
> > > micro-architecture levels defined in the ABI (and supported with
> > > glibc-hwcaps directories in glibc)?  Given that the levels are cumulative,
> > > should we take it that any future levels will be ones supporting 512-bit
> > > vector width for AVX10 (because x86-64-v4 requires the current AVX512F,
> > > AVX512BW, AVX512CD, AVX512DQ and AVX512VL) - and so any future processors
> > > that only support 256-bit vector width will be considered to match the
> > > x86-64-v3 micro-architecture level but not any higher level?
> > This is actually something we really want to discuss in the community,
> > our proposal for x86-64-v5: AVX10.2-256(Implying AVX10.1-256) + APX.
> > One big reason is Intel E-core will only support AVX10 256-bit, if we
> > want to use x86-64-v5 accross  server and client, it's better to
> > 256-bit default.
> + ABI and LLVM folked for this topic.
s/folked/folks/

> > >
> > > --
> > > Joseph S. Myers
> > > jos...@codesourcery.com
> >
> >
> >
> > --
> > BR,
> > Hongtao
>
>
>
> --
> BR,
> Hongtao



--
BR,
Hongtao

Re: Intel AVX10.1 Compiler Design and Support

2023-08-09 Thread Hongtao Liu via Gcc-patches

On Wed, Aug 9, 2023 at 3:17 PM Jan Beulich  wrote:
>
> On 09.08.2023 04:14, Hongtao Liu wrote:
> > On Wed, Aug 9, 2023 at 9:21 AM Hongtao Liu  wrote:
> >>
> >> On Wed, Aug 9, 2023 at 3:55 AM Joseph Myers  
> >> wrote:
> >>>
> >>> Do you have any comments on the interaction of AVX10 with the
> >>> micro-architecture levels defined in the ABI (and supported with
> >>> glibc-hwcaps directories in glibc)?  Given that the levels are cumulative,
> >>> should we take it that any future levels will be ones supporting 512-bit
> >>> vector width for AVX10 (because x86-64-v4 requires the current AVX512F,
> >>> AVX512BW, AVX512CD, AVX512DQ and AVX512VL) - and so any future processors
> >>> that only support 256-bit vector width will be considered to match the
> >>> x86-64-v3 micro-architecture level but not any higher level?
> >> This is actually something we really want to discuss in the community,
> >> our proposal for x86-64-v5: AVX10.2-256(Implying AVX10.1-256) + APX.
> >> One big reason is Intel E-core will only support AVX10 256-bit, if we
> >> want to use x86-64-v5 accross  server and client, it's better to
> >> 256-bit default.
>
> Aiui these ABI levels were intended to be incremental, i.e. higher versions
> would include everything earlier ones cover. Without such a guarantee, how
> would you propose compatibility checks to be implemented in a way
Are there many software implemenation based on this assumption?
At least in GCC, it's not a big problem, we can adjust code for the
new micro-architecture level.
> applicable both forwards and backwards? If a new level is wanted here, then
> I guess it could only be something like v3.5.
But if we use avx10.1 as v3.5, it's still not subset of
x86-64-v4(avx10.1 contains avx512fp16,avx512bf16 .etc which are not in
x86-64-v4), there will be still a diverge.
Then 256-bit of x86-64-v4 as v3.5? that's too weired to me.

Our main proposal is to make AVX10.x as new micro-architecture level
with 256-bit default, either v3.5 or v5 would be acceptable if it's
just the name.
>
> Jan



-- 
BR,
Hongtao

Re: Intel AVX10.1 Compiler Design and Support

2023-08-09 Thread Hongtao Liu via Gcc-patches

On Wed, Aug 9, 2023 at 4:14 PM Florian Weimer  wrote:
>
> * Richard Biener via Gcc-patches:
>
> > I don’t think we can realistically change the ABI.  If we could
> > passing them in two 256bit registers would be possible as well.
> >
> > Note I fully expect intel to turn around and implement 512 bits on a
> > 256 but data path on the E cores in 5 years.  And it will take at
> > least that time for AVX10 to take off (look at AVX512 for this and how
> > they cautionously chose to include bf16 to cut off Zen4).  So IMHO we
> > shouldn’t worry at all and just wait and see for AVX42 to arrive.
>
> Yes, the direction is a bit unclear.  In retrospect, we could have
> defined x86-64-v4 to use 256 bit vector width, so it could eventually be
> compatible with AVX10; it's also what current Intel CPUs prefer (and
NOTE, avx10.x-256 also inhibit the usage of 64-bit kmask which is
supposed to be only used  by zmm instructions.
But in theory, those 64-bit kmask intrinsics can be used standalone
.i.e. kshift/kand/kor.
> past, with the exception of the Xeon Phi line).  But in the meantime,
> AMD has started to ship CPUs that seem to prefer 512 bit vectors,
> despite having a double pumped implementation.  (Disclaimer: All CPU
> preferences inferred from current compiler tuning defaults, not actual
> experiments. 8-/)
>
> To me, this looks like we may have defined x86-64-v4 prematurely, and
> this suggests we should wait a bit to see where things are heading.
>
> Thanks,
> Florian
>


-- 
BR,
Hongtao

Re: Intel AVX10.1 Compiler Design and Support

2023-08-09 Thread Hongtao Liu via Gcc-patches

On Wed, Aug 9, 2023 at 5:15 PM Florian Weimer  wrote:
>
> * Hongtao Liu:
>
> > On Wed, Aug 9, 2023 at 3:17 PM Jan Beulich  wrote:
> >> Aiui these ABI levels were intended to be incremental, i.e. higher versions
> >> would include everything earlier ones cover. Without such a guarantee, how
> >> would you propose compatibility checks to be implemented in a way
>
> Correct, this was the intent.  But it's mostly to foster adoption and
> make it easier for developers to pick the variants that they want to
> target custom builds.  If it's an ascending chain, the trade-offs are
> simpler.
>
> > Are there many software implemenation based on this assumption?
> > At least in GCC, it's not a big problem, we can adjust code for the
> > new micro-architecture level.
>
> The glibc framework can deal with alternate choices in principle,
> although I'd prefer not to go there for the reasons indicated.
>
> >> applicable both forwards and backwards? If a new level is wanted here, then
> >> I guess it could only be something like v3.5.
>
> > But if we use avx10.1 as v3.5, it's still not subset of
> > x86-64-v4(avx10.1 contains avx512fp16,avx512bf16 .etc which are not in
> > x86-64-v4), there will be still a diverge.
> > Then 256-bit of x86-64-v4 as v3.5? that's too weired to me.
>
> The question is whether you want to mandate the 16-bit floating point
> extensions.  You might get better adoption if you stay compatible with
> shipping CPUs.  Furthermore, the 256-bit tuning apparently benefits
> current Intel CPUs, even though they can do 512-bit vectors.
Not only 16-bit floating point, here's a whole picture of  AVX512->AVX10 in
Figure 1-1. Intel® AVX-512 Feature Flags Across Intel® Xeon® Processor
Generations vs. Intel® AVX10
and Figure 1-2. Intel® ISA Families and Features
at https://cdrdv2.intel.com/v1/dl/getContent/784343 (this link is a
direct download of pdf).



>
> (The thread subject is a bit misleading for this sub-topic, by the way.)
>
> Thanks,
> Florian
>


-- 
BR,
Hongtao

Re: [PATCH] i386: Do not sanitize upper part of V2HFmode and V4HFmode reg with -fno-trapping-math [PR110832]

2023-08-09 Thread Hongtao Liu via Gcc-patches

On Thu, Aug 10, 2023 at 2:01 PM Uros Bizjak via Gcc-patches
 wrote:
>
> On Thu, Aug 10, 2023 at 2:49 AM liuhongt  wrote:
> >
> > Also add ix86_partial_vec_fp_math to to condition of V2HF/V4HF named
> > patterns in order to avoid generation of partial vector V8HFmode
> > trapping instructions.
> >
> > Bootstrapped and regtseted on x86_64-pc-linux-gnu{-m32,}
> > Ok for trunk?
> >
> > gcc/ChangeLog:
> >
> > PR target/110832
> > * config/i386/mmx.md: (movq__to_sse): Also do not
> > sanitize upper part of V4HFmode register with
> > -fno-trapping-math.
> > (v4hf3): Enable for ix86_partial_vec_fp_math.
> > ( > (v2hf3): Ditto.
> > (divv2hf3): Ditto.
> > (movd_v2hf_to_sse): Do not sanitize upper part of V2HFmode
> > register with -fno-trapping-math.
>
> OK.
>
> BTW: I would just like to mention that plenty of instructions can be
> enabled for V4HF/V2HFmode besides arithmetic insns. At least
> conversions, comparisons, FMA and min/max (to name some of them) can
> be enabled by introducing expanders that expand to V8HFmode
> instruction.
Yes, try to support that in GCC14.
>
> Uros.
> >
> > ---
> >  gcc/config/i386/mmx.md | 20 ++--
> >  1 file changed, 14 insertions(+), 6 deletions(-)
> >
> > diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
> > index d51b3b9dc71..170432a7128 100644
> > --- a/gcc/config/i386/mmx.md
> > +++ b/gcc/config/i386/mmx.md
> > @@ -596,7 +596,7 @@ (define_expand "movq__to_sse"
> >   (match_dup 2)))]
> >"TARGET_SSE2"
> >  {
> > -  if (mode == V2SFmode
> > +  if (mode != V2SImode
> >&& !flag_trapping_math)
> >  {
> >rtx op1 = force_reg (mode, operands[1]);
> > @@ -1941,7 +1941,7 @@ (define_expand "v4hf3"
> > (plusminusmult:V4HF
> >   (match_operand:V4HF 1 "nonimmediate_operand")
> >   (match_operand:V4HF 2 "nonimmediate_operand")))]
> > -  "TARGET_AVX512FP16 && TARGET_AVX512VL"
> > +  "TARGET_AVX512FP16 && TARGET_AVX512VL && ix86_partial_vec_fp_math"
> >  {
> >rtx op2 = gen_reg_rtx (V8HFmode);
> >rtx op1 = gen_reg_rtx (V8HFmode);
> > @@ -1961,7 +1961,7 @@ (define_expand "divv4hf3"
> > (div:V4HF
> >   (match_operand:V4HF 1 "nonimmediate_operand")
> >   (match_operand:V4HF 2 "nonimmediate_operand")))]
> > -  "TARGET_AVX512FP16 && TARGET_AVX512VL"
> > +  "TARGET_AVX512FP16 && TARGET_AVX512VL && ix86_partial_vec_fp_math"
> >  {
> >rtx op2 = gen_reg_rtx (V8HFmode);
> >rtx op1 = gen_reg_rtx (V8HFmode);
> > @@ -1983,14 +1983,22 @@ (define_expand "movd_v2hf_to_sse"
> > (match_operand:V2HF 1 "nonimmediate_operand"))
> >   (match_operand:V8HF 2 "reg_or_0_operand")
> >   (const_int 3)))]
> > -  "TARGET_SSE")
> > +  "TARGET_SSE"
> > +{
> > +  if (!flag_trapping_math && operands[2] == CONST0_RTX (V8HFmode))
> > +  {
> > +rtx op1 = force_reg (V2HFmode, operands[1]);
> > +emit_move_insn (operands[0], lowpart_subreg (V8HFmode, op1, V2HFmode));
> > +DONE;
> > +  }
> > +})
> >
> >  (define_expand "v2hf3"
> >[(set (match_operand:V2HF 0 "register_operand")
> > (plusminusmult:V2HF
> >   (match_operand:V2HF 1 "nonimmediate_operand")
> >   (match_operand:V2HF 2 "nonimmediate_operand")))]
> > -  "TARGET_AVX512FP16 && TARGET_AVX512VL"
> > +  "TARGET_AVX512FP16 && TARGET_AVX512VL && ix86_partial_vec_fp_math"
> >  {
> >rtx op2 = gen_reg_rtx (V8HFmode);
> >rtx op1 = gen_reg_rtx (V8HFmode);
> > @@ -2009,7 +2017,7 @@ (define_expand "divv2hf3"
> > (div:V2HF
> >   (match_operand:V2HF 1 "nonimmediate_operand")
> >   (match_operand:V2HF 2 "nonimmediate_operand")))]
> > -  "TARGET_AVX512FP16 && TARGET_AVX512VL"
> > +  "TARGET_AVX512FP16 && TARGET_AVX512VL && ix86_partial_vec_fp_math"
> >  {
> >rtx op2 = gen_reg_rtx (V8HFmode);
> >rtx op1 = gen_reg_rtx (V8HFmode);
> > --
> > 2.31.1
> >



-- 
BR,
Hongtao

Re: [PATCH] Support -m[no-]gather -m[no-]scatter to enable/disable vectorization for all gather/scatter instructions.

2023-08-09 Thread Hongtao Liu via Gcc-patches

On Thu, Aug 10, 2023 at 2:04 PM Uros Bizjak via Gcc-patches
 wrote:
>
> On Thu, Aug 10, 2023 at 3:13 AM liuhongt  wrote:
> >
> > Currently we have 3 different independent tunes for gather
> > "use_gather,use_gather_2parts,use_gather_4parts",
> > similar for scatter, there're
> > "use_scatter,use_scatter_2parts,use_scatter_4parts"
> >
> > The patch support 2 standardizing options to enable/disable
> > vectorization for all gather/scatter instructions. The options is
> > interpreted by driver to 3 tunes.
> >
> > bootstrapped and regtested on x86_64-pc-linux-gnu.
> > Ok for trunk?
> >
> > gcc/ChangeLog:
> >
> > * config/i386/i386.h (DRIVER_SELF_SPECS): Add
> > GATHER_SCATTER_DRIVER_SELF_SPECS.
> > (GATHER_SCATTER_DRIVER_SELF_SPECS): New macro.
> > * config/i386/i386.opt (mgather): New option.
> > (mscatter): Ditto.
> > ---
> >  gcc/config/i386/i386.h   | 12 +++-
> >  gcc/config/i386/i386.opt |  8 
> >  2 files changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> > index ef342fcee9b..d9ac2c29bde 100644
> > --- a/gcc/config/i386/i386.h
> > +++ b/gcc/config/i386/i386.h
> > @@ -565,7 +565,17 @@ extern GTY(()) tree x86_mfence;
> >  # define SUBTARGET_DRIVER_SELF_SPECS ""
> >  #endif
> >
> > -#define DRIVER_SELF_SPECS SUBTARGET_DRIVER_SELF_SPECS
> > +#ifndef GATHER_SCATTER_DRIVER_SELF_SPECS
> > +# define GATHER_SCATTER_DRIVER_SELF_SPECS \
> > +  
> > "%{mno-gather:-mtune-ctrl=^use_gather_2parts,^use_gather_4parts,^use_gather}
> >  \
> > +   %{mgather:-mtune-ctrl=use_gather_2parts,use_gather_4parts,use_gather} \
> > +   
> > %{mno-scatter:-mtune-ctrl=^use_scatter_2parts,^use_scatter_4parts,^use_scatter}
> >  \
> > +   
> > %{mscatter:-mtune-ctrl=use_scatter_2parts,use_scatter_4parts,use_scatter}"
> > +#endif
> > +
> > +#define DRIVER_SELF_SPECS \
> > +  SUBTARGET_DRIVER_SELF_SPECS " " \
> > +  GATHER_SCATTER_DRIVER_SELF_SPECS
> >
> >  /* -march=native handling only makes sense with compiler running on
> > an x86 or x86_64 chip.  If changing this condition, also change
> > diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> > index ddb7f110aa2..99948644a8d 100644
> > --- a/gcc/config/i386/i386.opt
> > +++ b/gcc/config/i386/i386.opt
> > @@ -424,6 +424,14 @@ mdaz-ftz
> >  Target
> >  Set the FTZ and DAZ Flags.
> >
> > +mgather
> > +Target
> > +Enable vectorization for gather instruction.
> > +
> > +mscatter
> > +Target
> > +Enable vectorization for scatter instruction.
>
> Are gather and scatter instructions affected in a separate way, or
> should we use one -mgather-scatter option to cover all gather/scatter
> tunings?
A separate way.
Gather Data Sampling is only for gather.
https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/gather-data-sampling.html
>
> Uros.
>
> > +
> >  mpreferred-stack-boundary=
> >  Target RejectNegative Joined UInteger 
> > Var(ix86_preferred_stack_boundary_arg)
> >  Attempt to keep stack aligned to this power of 2.
> > --
> > 2.31.1
> >



-- 
BR,
Hongtao

Re: [PATCH] i386: Do not sanitize upper part of V2HFmode and V4HFmode reg with -fno-trapping-math [PR110832]

2023-08-09 Thread Hongtao Liu via Gcc-patches

On Thu, Aug 10, 2023 at 2:06 PM Hongtao Liu  wrote:
>
> On Thu, Aug 10, 2023 at 2:01 PM Uros Bizjak via Gcc-patches
>  wrote:
> >
> > On Thu, Aug 10, 2023 at 2:49 AM liuhongt  wrote:
> > >
> > > Also add ix86_partial_vec_fp_math to to condition of V2HF/V4HF named
> > > patterns in order to avoid generation of partial vector V8HFmode
> > > trapping instructions.
> > >
> > > Bootstrapped and regtseted on x86_64-pc-linux-gnu{-m32,}
> > > Ok for trunk?
> > >
> > > gcc/ChangeLog:
> > >
> > > PR target/110832
> > > * config/i386/mmx.md: (movq__to_sse): Also do not
> > > sanitize upper part of V4HFmode register with
> > > -fno-trapping-math.
> > > (v4hf3): Enable for ix86_partial_vec_fp_math.
> > > ( > > (v2hf3): Ditto.
> > > (divv2hf3): Ditto.
> > > (movd_v2hf_to_sse): Do not sanitize upper part of V2HFmode
> > > register with -fno-trapping-math.
> >
> > OK.
> >
> > BTW: I would just like to mention that plenty of instructions can be
> > enabled for V4HF/V2HFmode besides arithmetic insns. At least
> > conversions, comparisons, FMA and min/max (to name some of them) can
> > be enabled by introducing expanders that expand to V8HFmode
> > instruction.
> Yes, try to support that in GCC14.
I would wait for avx10's patch to go in first, so as to avoid extra
rebases and conflicts.
> >
> > Uros.
> > >
> > > ---
> > >  gcc/config/i386/mmx.md | 20 ++--
> > >  1 file changed, 14 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
> > > index d51b3b9dc71..170432a7128 100644
> > > --- a/gcc/config/i386/mmx.md
> > > +++ b/gcc/config/i386/mmx.md
> > > @@ -596,7 +596,7 @@ (define_expand "movq__to_sse"
> > >   (match_dup 2)))]
> > >"TARGET_SSE2"
> > >  {
> > > -  if (mode == V2SFmode
> > > +  if (mode != V2SImode
> > >&& !flag_trapping_math)
> > >  {
> > >rtx op1 = force_reg (mode, operands[1]);
> > > @@ -1941,7 +1941,7 @@ (define_expand "v4hf3"
> > > (plusminusmult:V4HF
> > >   (match_operand:V4HF 1 "nonimmediate_operand")
> > >   (match_operand:V4HF 2 "nonimmediate_operand")))]
> > > -  "TARGET_AVX512FP16 && TARGET_AVX512VL"
> > > +  "TARGET_AVX512FP16 && TARGET_AVX512VL && ix86_partial_vec_fp_math"
> > >  {
> > >rtx op2 = gen_reg_rtx (V8HFmode);
> > >rtx op1 = gen_reg_rtx (V8HFmode);
> > > @@ -1961,7 +1961,7 @@ (define_expand "divv4hf3"
> > > (div:V4HF
> > >   (match_operand:V4HF 1 "nonimmediate_operand")
> > >   (match_operand:V4HF 2 "nonimmediate_operand")))]
> > > -  "TARGET_AVX512FP16 && TARGET_AVX512VL"
> > > +  "TARGET_AVX512FP16 && TARGET_AVX512VL && ix86_partial_vec_fp_math"
> > >  {
> > >rtx op2 = gen_reg_rtx (V8HFmode);
> > >rtx op1 = gen_reg_rtx (V8HFmode);
> > > @@ -1983,14 +1983,22 @@ (define_expand "movd_v2hf_to_sse"
> > > (match_operand:V2HF 1 "nonimmediate_operand"))
> > >   (match_operand:V8HF 2 "reg_or_0_operand")
> > >   (const_int 3)))]
> > > -  "TARGET_SSE")
> > > +  "TARGET_SSE"
> > > +{
> > > +  if (!flag_trapping_math && operands[2] == CONST0_RTX (V8HFmode))
> > > +  {
> > > +rtx op1 = force_reg (V2HFmode, operands[1]);
> > > +emit_move_insn (operands[0], lowpart_subreg (V8HFmode, op1, 
> > > V2HFmode));
> > > +DONE;
> > > +  }
> > > +})
> > >
> > >  (define_expand "v2hf3"
> > >[(set (match_operand:V2HF 0 "register_operand")
> > > (plusminusmult:V2HF
> > >   (match_operand:V2HF 1 "nonimmediate_operand")
> > >   (match_operand:V2HF 2 "nonimmediate_operand")))]
> > > -  "TARGET_AVX512FP16 && TARGET_AVX512VL"
> > > +  "TARGET_AVX512FP16 && TARGET_AVX512VL && ix86_partial_vec_fp_math"
> > >  {
> > >rtx op2 = gen_reg_rtx (V8HFmode);
> > >rtx op1 = gen_reg_rtx (V8HFmode);
> > > @@ -2009,7 +2017,7 @@ (define_expand "divv2hf3"
> > > (div:V2HF
> > >   (match_operand:V2HF 1 "nonimmediate_operand")
> > >   (match_operand:V2HF 2 "nonimmediate_operand")))]
> > > -  "TARGET_AVX512FP16 && TARGET_AVX512VL"
> > > +  "TARGET_AVX512FP16 && TARGET_AVX512VL && ix86_partial_vec_fp_math"
> > >  {
> > >rtx op2 = gen_reg_rtx (V8HFmode);
> > >rtx op1 = gen_reg_rtx (V8HFmode);
> > > --
> > > 2.31.1
> > >
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao

Re: [PATCH] Support -m[no-]gather -m[no-]scatter to enable/disable vectorization for all gather/scatter instructions.

2023-08-10 Thread Hongtao Liu via Gcc-patches

On Thu, Aug 10, 2023 at 3:49 PM Richard Biener via Gcc-patches
 wrote:
>
> On Thu, Aug 10, 2023 at 9:42 AM Uros Bizjak  wrote:
> >
> > On Thu, Aug 10, 2023 at 9:40 AM Richard Biener
> >  wrote:
> > >
> > > On Thu, Aug 10, 2023 at 3:13 AM liuhongt  wrote:
> > > >
> > > > Currently we have 3 different independent tunes for gather
> > > > "use_gather,use_gather_2parts,use_gather_4parts",
> > > > similar for scatter, there're
> > > > "use_scatter,use_scatter_2parts,use_scatter_4parts"
> > > >
> > > > The patch support 2 standardizing options to enable/disable
> > > > vectorization for all gather/scatter instructions. The options is
> > > > interpreted by driver to 3 tunes.
> > > >
> > > > bootstrapped and regtested on x86_64-pc-linux-gnu.
> > > > Ok for trunk?
> > >
> > > I think -mgather/-mscatter are too close to -mfma suggesting they
> > > enable part of an ISA but they won't disable the use of intrinsics
> > > or enable gather/scatter on CPUs where the ISA doesn't have them.
> > >
> > > May I suggest to invent a more generic "short-cut" to
> > > -mtune-ctrl=^X, maybe -mdisable=X?  And for gather/scatter
> > > tunables add ^use_gather_any to cover all cases?  (or
> > > change what use_gather controls - it seems we changed its
> > > meaning before, and instead add use_gather_8parts and
> > > use_gather_16parts)
> > >
> > > That is, what's the point of this?
> >
> > https://www.phoronix.com/review/downfall
> >
> > that caused:
> >
> > https://www.phoronix.com/review/intel-downfall-benchmarks
>
> Yes, I know.  But there's -mtune-ctl= doing the trick.
> GCC 11 had only 'use_gather', covering all number of lanes.  I suggest
> to resurrect that behavior and add use_gather_8+parts (or two, IIRC
> gather works only on SI/SFmode or larger).
>
> Then -mtune-ctl=^use_gather works which I think is nice enough?
So basically, -mtune-ctrl=^use_gather is used to turn off all gather
vectorization, but -mtune-ctrl=use_gather doesn't turn on all of them?
We don't have an extrat explicit flag for target tune, just single bit
- ix86_tune_features[X86_TUNE_USE_GATHER]
>
> Richard.
>
> > Uros.



-- 
BR,
Hongtao

Re: [PATCH] Support -m[no-]gather -m[no-]scatter to enable/disable vectorization for all gather/scatter instructions.

2023-08-10 Thread Hongtao Liu via Gcc-patches

On Thu, Aug 10, 2023 at 3:55 PM Hongtao Liu  wrote:
>
> On Thu, Aug 10, 2023 at 3:49 PM Richard Biener via Gcc-patches
>  wrote:
> >
> > On Thu, Aug 10, 2023 at 9:42 AM Uros Bizjak  wrote:
> > >
> > > On Thu, Aug 10, 2023 at 9:40 AM Richard Biener
> > >  wrote:
> > > >
> > > > On Thu, Aug 10, 2023 at 3:13 AM liuhongt  wrote:
> > > > >
> > > > > Currently we have 3 different independent tunes for gather
> > > > > "use_gather,use_gather_2parts,use_gather_4parts",
> > > > > similar for scatter, there're
> > > > > "use_scatter,use_scatter_2parts,use_scatter_4parts"
> > > > >
> > > > > The patch support 2 standardizing options to enable/disable
> > > > > vectorization for all gather/scatter instructions. The options is
> > > > > interpreted by driver to 3 tunes.
> > > > >
> > > > > bootstrapped and regtested on x86_64-pc-linux-gnu.
> > > > > Ok for trunk?
> > > >
> > > > I think -mgather/-mscatter are too close to -mfma suggesting they
> > > > enable part of an ISA but they won't disable the use of intrinsics
> > > > or enable gather/scatter on CPUs where the ISA doesn't have them.
> > > >
> > > > May I suggest to invent a more generic "short-cut" to
> > > > -mtune-ctrl=^X, maybe -mdisable=X?  And for gather/scatter
> > > > tunables add ^use_gather_any to cover all cases?  (or
> > > > change what use_gather controls - it seems we changed its
> > > > meaning before, and instead add use_gather_8parts and
> > > > use_gather_16parts)
> > > >
> > > > That is, what's the point of this?
> > >
> > > https://www.phoronix.com/review/downfall
> > >
> > > that caused:
> > >
> > > https://www.phoronix.com/review/intel-downfall-benchmarks
> >
> > Yes, I know.  But there's -mtune-ctl= doing the trick.
> > GCC 11 had only 'use_gather', covering all number of lanes.  I suggest
> > to resurrect that behavior and add use_gather_8+parts (or two, IIRC
> > gather works only on SI/SFmode or larger).
> >
> > Then -mtune-ctl=^use_gather works which I think is nice enough?
> So basically, -mtune-ctrl=^use_gather is used to turn off all gather
> vectorization, but -mtune-ctrl=use_gather doesn't turn on all of them?
> We don't have an extrat explicit flag for target tune, just single bit
> - ix86_tune_features[X86_TUNE_USE_GATHER]
Looks like I can handle it specially in parse_mtune_ctrl_str, let me try.
> >
> > Richard.
> >
> > > Uros.
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao

Re: [PATCH] Support -m[no-]gather -m[no-]scatter to enable/disable vectorization for all gather/scatter instructions.

2023-08-10 Thread Hongtao Liu via Gcc-patches

On Thu, Aug 10, 2023 at 4:07 PM Hongtao Liu  wrote:
>
> On Thu, Aug 10, 2023 at 3:55 PM Hongtao Liu  wrote:
> >
> > On Thu, Aug 10, 2023 at 3:49 PM Richard Biener via Gcc-patches
> >  wrote:
> > >
> > > On Thu, Aug 10, 2023 at 9:42 AM Uros Bizjak  wrote:
> > > >
> > > > On Thu, Aug 10, 2023 at 9:40 AM Richard Biener
> > > >  wrote:
> > > > >
> > > > > On Thu, Aug 10, 2023 at 3:13 AM liuhongt  
> > > > > wrote:
> > > > > >
> > > > > > Currently we have 3 different independent tunes for gather
> > > > > > "use_gather,use_gather_2parts,use_gather_4parts",
> > > > > > similar for scatter, there're
> > > > > > "use_scatter,use_scatter_2parts,use_scatter_4parts"
> > > > > >
> > > > > > The patch support 2 standardizing options to enable/disable
> > > > > > vectorization for all gather/scatter instructions. The options is
> > > > > > interpreted by driver to 3 tunes.
> > > > > >
> > > > > > bootstrapped and regtested on x86_64-pc-linux-gnu.
> > > > > > Ok for trunk?
> > > > >
> > > > > I think -mgather/-mscatter are too close to -mfma suggesting they
> > > > > enable part of an ISA but they won't disable the use of intrinsics
> > > > > or enable gather/scatter on CPUs where the ISA doesn't have them.
> > > > >
> > > > > May I suggest to invent a more generic "short-cut" to
> > > > > -mtune-ctrl=^X, maybe -mdisable=X?  And for gather/scatter
> > > > > tunables add ^use_gather_any to cover all cases?  (or
> > > > > change what use_gather controls - it seems we changed its
> > > > > meaning before, and instead add use_gather_8parts and
> > > > > use_gather_16parts)
> > > > >
> > > > > That is, what's the point of this?
The point of this is to keep consistent between GCC, LLVM, and
ICX(Intel® oneAPI DPC++/C++ Compiler) .
LLVM,ICX will support that option.
> > > >
> > > > https://www.phoronix.com/review/downfall
> > > >
> > > > that caused:
> > > >
> > > > https://www.phoronix.com/review/intel-downfall-benchmarks
> > >
> > > Yes, I know.  But there's -mtune-ctl= doing the trick.
> > > GCC 11 had only 'use_gather', covering all number of lanes.  I suggest
> > > to resurrect that behavior and add use_gather_8+parts (or two, IIRC
> > > gather works only on SI/SFmode or larger).
> > >
> > > Then -mtune-ctl=^use_gather works which I think is nice enough?
> > So basically, -mtune-ctrl=^use_gather is used to turn off all gather
> > vectorization, but -mtune-ctrl=use_gather doesn't turn on all of them?
> > We don't have an extrat explicit flag for target tune, just single bit
> > - ix86_tune_features[X86_TUNE_USE_GATHER]
> Looks like I can handle it specially in parse_mtune_ctrl_str, let me try.
> > >
> > > Richard.
> > >
> > > > Uros.
> >
> >
> >
> > --
> > BR,
> > Hongtao
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao

Re: [PATCH] Support -m[no-]gather -m[no-]scatter to enable/disable vectorization for all gather/scatter instructions.

2023-08-10 Thread Hongtao Liu via Gcc-patches

On Thu, Aug 10, 2023 at 7:13 PM Richard Biener
 wrote:
>
> On Thu, Aug 10, 2023 at 11:16 AM Hongtao Liu  wrote:
> >
> > On Thu, Aug 10, 2023 at 4:07 PM Hongtao Liu  wrote:
> > >
> > > On Thu, Aug 10, 2023 at 3:55 PM Hongtao Liu  wrote:
> > > >
> > > > On Thu, Aug 10, 2023 at 3:49 PM Richard Biener via Gcc-patches
> > > >  wrote:
> > > > >
> > > > > On Thu, Aug 10, 2023 at 9:42 AM Uros Bizjak  wrote:
> > > > > >
> > > > > > On Thu, Aug 10, 2023 at 9:40 AM Richard Biener
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Thu, Aug 10, 2023 at 3:13 AM liuhongt  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Currently we have 3 different independent tunes for gather
> > > > > > > > "use_gather,use_gather_2parts,use_gather_4parts",
> > > > > > > > similar for scatter, there're
> > > > > > > > "use_scatter,use_scatter_2parts,use_scatter_4parts"
> > > > > > > >
> > > > > > > > The patch support 2 standardizing options to enable/disable
> > > > > > > > vectorization for all gather/scatter instructions. The options 
> > > > > > > > is
> > > > > > > > interpreted by driver to 3 tunes.
> > > > > > > >
> > > > > > > > bootstrapped and regtested on x86_64-pc-linux-gnu.
> > > > > > > > Ok for trunk?
> > > > > > >
> > > > > > > I think -mgather/-mscatter are too close to -mfma suggesting they
> > > > > > > enable part of an ISA but they won't disable the use of intrinsics
> > > > > > > or enable gather/scatter on CPUs where the ISA doesn't have them.
> > > > > > >
> > > > > > > May I suggest to invent a more generic "short-cut" to
> > > > > > > -mtune-ctrl=^X, maybe -mdisable=X?  And for gather/scatter
> > > > > > > tunables add ^use_gather_any to cover all cases?  (or
> > > > > > > change what use_gather controls - it seems we changed its
> > > > > > > meaning before, and instead add use_gather_8parts and
> > > > > > > use_gather_16parts)
> > > > > > >
> > > > > > > That is, what's the point of this?
> > The point of this is to keep consistent between GCC, LLVM, and
> > ICX(Intel® oneAPI DPC++/C++ Compiler) .
> > LLVM,ICX will support that option.
>
> GCC has very many options that are not the same as LLVM or ICX,
> I don't see a good reason to special case this one.  As said, it's
> a very bad name IMHO.
In general terms, yes.
But this is a new option, shouldn't it be better to be consistent?
And the problem with mfma is mainly that the cpuid is just called fma,
but we don't have a cpuid called gather/scatter, with clear document
that the option is only for auto-vectorization,
-m{no-,}{gather,scattter} looks fine to me.
As Honza mentioned, users need to option to turn on/off gather/scatter
auto vectorization, I don't think they will expect the option is also
valid for intrinsic.
If -mtune-crtl= is not suitable for direct exposure to usersusers,
then the original proposal should be ok?
Developers will manintain the relation between mgather/scatter and
-mtune-crtl=XXX to make it consistent between GCC versions.
>
> Richard.
>
> > > > > >
> > > > > > https://www.phoronix.com/review/downfall
> > > > > >
> > > > > > that caused:
> > > > > >
> > > > > > https://www.phoronix.com/review/intel-downfall-benchmarks
> > > > >
> > > > > Yes, I know.  But there's -mtune-ctl= doing the trick.
> > > > > GCC 11 had only 'use_gather', covering all number of lanes.  I suggest
> > > > > to resurrect that behavior and add use_gather_8+parts (or two, IIRC
> > > > > gather works only on SI/SFmode or larger).
> > > > >
> > > > > Then -mtune-ctl=^use_gather works which I think is nice enough?
> > > > So basically, -mtune-ctrl=^use_gather is used to turn off all gather
> > > > vectorization, but -mtune-ctrl=use_gather doesn't turn on all of them?
> > > > We don't have an extrat explicit flag for target tune, just single bit
> > > > - ix86_tune_features[X86_TUNE_USE_GATHER]
> > > Looks like I can handle it specially in parse_mtune_ctrl_str, let me try.
> > > > >
> > > > > Richard.
> > > > >
> > > > > > Uros.
> > > >
> > > >
> > > >
> > > > --
> > > > BR,
> > > > Hongtao
> > >
> > >
> > >
> > > --
> > > BR,
> > > Hongtao
> >
> >
> >
> > --
> > BR,
> > Hongtao



-- 
BR,
Hongtao

Re: [PATCH V2] Support -m[no-]gather -m[no-]scatter to enable/disable vectorization for all gather/scatter instructions

2023-08-13 Thread Hongtao Liu via Gcc-patches

On Fri, Aug 11, 2023 at 2:02 PM liuhongt via Gcc-patches
 wrote:
>
> Rename original use_gather to use_gather_8parts, Support
> -mtune-ctrl={,^}use_gather to set/clear tune features
> use_gather_{2parts, 4parts, 8parts}. Support the new option -mgather
> as alias of -mtune-ctrl=, use_gather, ^use_gather.
>
> Similar for use_scatter.
>
> How about this version?
I'll commit the patch if there's no objections in the next 24 hours.
>
> gcc/ChangeLog:
>
> * config/i386/i386-builtins.cc
> (ix86_vectorize_builtin_gather): Adjust for use_gather_8parts.
> * config/i386/i386-options.cc (parse_mtune_ctrl_str):
> Set/Clear tune features use_{gather,scatter}_{2parts, 4parts,
> 8parts} for -mtune-crtl={,^}{use_gather,use_scatter}.
> * config/i386/i386.cc (ix86_vectorize_builtin_scatter): Adjust
> for use_scatter_8parts
> * config/i386/i386.h (TARGET_USE_GATHER): Rename to ..
> (TARGET_USE_GATHER_8PARTS): .. this.
> (TARGET_USE_SCATTER): Rename to ..
> (TARGET_USE_SCATTER_8PARTS): .. this.
> * config/i386/x86-tune.def (X86_TUNE_USE_GATHER): Rename to
> (X86_TUNE_USE_GATHER_8PARTS): .. this.
> (X86_TUNE_USE_SCATTER): Rename to
> (X86_TUNE_USE_SCATTER_8PARTS): .. this.
> * config/i386/i386.opt: Add new options mgather, mscatter.
> ---
>  gcc/config/i386/i386-builtins.cc |  2 +-
>  gcc/config/i386/i386-options.cc  | 54 +++-
>  gcc/config/i386/i386.cc  |  2 +-
>  gcc/config/i386/i386.h   |  8 ++---
>  gcc/config/i386/i386.opt |  8 +
>  gcc/config/i386/x86-tune.def |  4 +--
>  6 files changed, 56 insertions(+), 22 deletions(-)
>
> diff --git a/gcc/config/i386/i386-builtins.cc 
> b/gcc/config/i386/i386-builtins.cc
> index 356b6dfd5fb..8a0b8dfe073 100644
> --- a/gcc/config/i386/i386-builtins.cc
> +++ b/gcc/config/i386/i386-builtins.cc
> @@ -1657,7 +1657,7 @@ ix86_vectorize_builtin_gather (const_tree mem_vectype,
>   ? !TARGET_USE_GATHER_2PARTS
>   : (known_eq (TYPE_VECTOR_SUBPARTS (mem_vectype), 4u)
>  ? !TARGET_USE_GATHER_4PARTS
> -: !TARGET_USE_GATHER)))
> +: !TARGET_USE_GATHER_8PARTS)))
>  return NULL_TREE;
>
>if ((TREE_CODE (index_type) != INTEGER_TYPE
> diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
> index 127ee24203c..b8d038af69d 100644
> --- a/gcc/config/i386/i386-options.cc
> +++ b/gcc/config/i386/i386-options.cc
> @@ -1731,20 +1731,46 @@ parse_mtune_ctrl_str (struct gcc_options *opts, bool 
> dump)
>curr_feature_string++;
>clear = true;
>  }
> -  for (i = 0; i < X86_TUNE_LAST; i++)
> -{
> -  if (!strcmp (curr_feature_string, ix86_tune_feature_names[i]))
> -{
> -  ix86_tune_features[i] = !clear;
> -  if (dump)
> -fprintf (stderr, "Explicitly %s feature %s\n",
> - clear ? "clear" : "set", 
> ix86_tune_feature_names[i]);
> -  break;
> -}
> -}
> -  if (i == X86_TUNE_LAST)
> -   error ("unknown parameter to option %<-mtune-ctrl%>: %s",
> -  clear ? curr_feature_string - 1 : curr_feature_string);
> +
> +  if (!strcmp (curr_feature_string, "use_gather"))
> +   {
> + ix86_tune_features[X86_TUNE_USE_GATHER_2PARTS] = !clear;
> + ix86_tune_features[X86_TUNE_USE_GATHER_4PARTS] = !clear;
> + ix86_tune_features[X86_TUNE_USE_GATHER_8PARTS] = !clear;
> + if (dump)
> +   fprintf (stderr, "Explicitly %s features use_gather_2parts,"
> +" use_gather_4parts, use_gather_8parts\n",
> +clear ? "clear" : "set");
> +
> +   }
> +  else if (!strcmp (curr_feature_string, "use_scatter"))
> +   {
> + ix86_tune_features[X86_TUNE_USE_SCATTER_2PARTS] = !clear;
> + ix86_tune_features[X86_TUNE_USE_SCATTER_4PARTS] = !clear;
> + ix86_tune_features[X86_TUNE_USE_SCATTER_8PARTS] = !clear;
> + if (dump)
> +   fprintf (stderr, "Explicitly %s features use_scatter_2parts,"
> +" use_scatter_4parts, use_scatter_8parts\n",
> +clear ? "clear" : "set");
> +   }
> +  else
> +   {
> + for (i = 0; i < X86_TUNE_LAST; i++)
> +   {
> + if (!strcmp (curr_feature_string, ix86_tune_feature_names[i]))
> +   {
> + ix86_tune_features[i] = !clear;
> + if (dump)
> +   fprintf (stderr, "Explicitly %s feature %s\n",
> +clear ? "clear" : "set", 
> ix86_tune_feature_names[i]);
> + break;
> +   }
> +   }
> +
> + if (i == X86_TUNE_LAST)
> +   error ("unknown parameter to option %<-mtune-ctrl%>: %s",
> +  clear ? curr_feature_string - 1 : curr_feature_string);
> +

Re: [PATCH] Generate vmovapd instead of vmovsd for moving DFmode between SSE_REGS.

2023-08-13 Thread Hongtao Liu via Gcc-patches

cc

On Mon, Aug 14, 2023 at 10:46 AM liuhongt  wrote:
>
> vmovapd can enable register renaming and have same code size as
> vmovsd. Similar for vmovsh vs vmovaps, vmovaps is 1 byte less than
> vmovsh.
>
> When TARGET_AVX512VL is not available, still generate
> vmovsd/vmovss/vmovsh to avoid vmovapd/vmovaps zmm16-31.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
> * config/i386/i386.md (movdf_internal): Generate vmovapd instead of
> vmovsd when moving DFmode between SSE_REGS.
> (movhi_internal): Generate vmovdqa instead of vmovsh when
> moving HImode between SSE_REGS.
> (mov_internal): Use vmovaps instead of vmovsh when
> moving HF/BFmode between SSE_REGS.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr89229-4a.c: Adjust testcase.
> ---
>  gcc/config/i386/i386.md| 20 +---
>  gcc/testsuite/gcc.target/i386/pr89229-4a.c |  4 +---
>  2 files changed, 18 insertions(+), 6 deletions(-)
>
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index c906d75b13e..77182e34fe1 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -2961,8 +2961,12 @@ (define_insn "*movhi_internal"
> ]
> (const_string "TI"))
> (eq_attr "alternative" "12")
> - (cond [(match_test "TARGET_AVX512FP16")
> + (cond [(match_test "TARGET_AVX512VL")
> +  (const_string "TI")
> +(match_test "TARGET_AVX512FP16")
>(const_string "HF")
> +(match_test "TARGET_AVX512F")
> +  (const_string "SF")
>  (match_test "TARGET_AVX")
>(const_string "TI")
>  (ior (not (match_test "TARGET_SSE2"))
> @@ -4099,8 +4103,12 @@ (define_insn "*movdf_internal"
>
>/* movaps is one byte shorter for non-AVX targets.  */
>(eq_attr "alternative" "13,17")
> -(cond [(match_test "TARGET_AVX")
> +(cond [(match_test "TARGET_AVX512VL")
> + (const_string "V2DF")
> +   (match_test "TARGET_AVX512F")
>   (const_string "DF")
> +   (match_test "TARGET_AVX")
> + (const_string "V2DF")
> (ior (not (match_test "TARGET_SSE2"))
>  (match_test "optimize_function_for_size_p 
> (cfun)"))
>   (const_string "V4SF")
> @@ -4380,8 +4388,14 @@ (define_insn "*mov_internal"
>(const_string "HI")
>(const_string "TI"))
>(eq_attr "alternative" "5")
> -(cond [(match_test "TARGET_AVX512FP16")
> +(cond [(match_test "TARGET_AVX512VL")
> +   (const_string "V4SF")
> +   (match_test "TARGET_AVX512FP16")
>   (const_string "HF")
> +   (match_test "TARGET_AVX512F")
> + (const_string "SF")
> +   (match_test "TARGET_AVX")
> + (const_string "V4SF")
> (ior (match_test "TARGET_SSE_PARTIAL_REG_DEPENDENCY")
>  (match_test "TARGET_SSE_SPLIT_REGS"))
>   (const_string "V4SF")
> diff --git a/gcc/testsuite/gcc.target/i386/pr89229-4a.c 
> b/gcc/testsuite/gcc.target/i386/pr89229-4a.c
> index 5bc10d25619..8869650b0ad 100644
> --- a/gcc/testsuite/gcc.target/i386/pr89229-4a.c
> +++ b/gcc/testsuite/gcc.target/i386/pr89229-4a.c
> @@ -1,4 +1,4 @@
> -/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-do assemble { target { ! ia32 } } } */
>  /* { dg-options "-O2 -march=skylake-avx512" } */
>
>  extern double d;
> @@ -12,5 +12,3 @@ foo1 (double x)
>asm volatile ("" : "+v" (xmm17));
>d = xmm17;
>  }
> -
> -/* { dg-final { scan-assembler-not "vmovapd" } } */
> --
> 2.31.1
>


-- 
BR,
Hongtao

Re: [PATCH 1/3] Initial support for AVX10.1

2023-08-15 Thread Hongtao Liu via Gcc-patches

On Tue, Aug 8, 2023 at 3:16 PM Haochen Jiang via Gcc-patches
 wrote:
>
> gcc/ChangeLog:
>
> * common/config/i386/cpuinfo.h (get_available_features):
> Add avx10_set and version and detect avx10.1.
> (cpu_indicator_init): Handle avx10.1-512.
> * common/config/i386/i386-common.cc
> (OPTION_MASK_ISA2_AVX10_512BIT_SET): New.
> (OPTION_MASK_ISA2_AVX10_1_SET): Ditto.
> (OPTION_MASK_ISA2_AVX10_512BIT_UNSET): Ditto.
> (OPTION_MASK_ISA2_AVX10_1_UNSET): Ditto.
> (OPTION_MASK_ISA2_AVX2_UNSET): Modify for AVX10_1.
> (ix86_handle_option): Handle -mavx10.1, -mavx10.1-256 and
> -mavx10.1-512.
> * common/config/i386/i386-cpuinfo.h (enum processor_features):
> Add FEATURE_AVX10_512BIT, FEATURE_AVX10_1 and
> FEATURE_AVX10_512BIT.
> * common/config/i386/i386-isas.h: Add ISA_NAME_TABLE_ENTRY for
> AVX10_512BIT, AVX10_1 and AVX10_1_512.
> * config/i386/constraints.md (Yk): Add AVX10_1.
> (Yv): Ditto.
> (k): Ditto.
> * config/i386/cpuid.h (bit_AVX10): New.
> (bit_AVX10_256): Ditto.
> (bit_AVX10_512): Ditto.
> * config/i386/i386-c.cc (ix86_target_macros_internal):
> Define AVX10_512BIT and AVX10_1.
> * config/i386/i386-isa.def
> (AVX10_512BIT): Add DEF_PTA(AVX10_512BIT).
> (AVX10_1): Add DEF_PTA(AVX10_1).
> * config/i386/i386-options.cc (isa2_opts): Add -mavx10.1.
> (ix86_valid_target_attribute_inner_p): Handle avx10-512bit, avx10.1
> and avx10.1-512.
> (ix86_option_override_internal): Enable AVX512{F,VL,BW,DQ,CD,BF16,
> FP16,VBMI,VBMI2,VNNI,IFMA,BITALG,VPOPCNTDQ} features for avx10.1-512.
> (ix86_valid_target_attribute_inner_p): Handle AVX10_1.
> * config/i386/i386.cc (ix86_get_ssemov): Add AVX10_1.
> (ix86_conditional_register_usage): Ditto.
> (ix86_hard_regno_mode_ok): Ditto.
> (ix86_rtx_costs): Ditto.
> * config/i386/i386.h (VALID_MASK_AVX10_MODE): New macro.
> * config/i386/i386.opt: Add option -mavx10.1, -mavx10.1-256 and
> -mavx10.1-512.
> * doc/extend.texi: Document avx10.1, avx10.1-256 and avx10.1-512.
> * doc/invoke.texi: Document -mavx10.1, -mavx10.1-256 and 
> -mavx10.1-512.
> * doc/sourcebuild.texi: Document target avx10.1, avx10.1-256
> and avx10.1-512.
>
> gcc/testsuite/ChangeLog:
>
> * g++.target/i386/mv33.C: New test.
> * gcc.target/i386/avx10_1-1.c: Ditto.
> * gcc.target/i386/avx10_1-2.c: Ditto.
> * gcc.target/i386/avx10_1-3.c: Ditto.
> * gcc.target/i386/avx10_1-4.c: Ditto.
> * gcc.target/i386/avx10_1-5.c: Ditto.
> * gcc.target/i386/avx10_1-6.c: Ditto.
> * gcc.target/i386/avx10_1-7.c: Ditto.
> * gcc.target/i386/avx10_1-8.c: Ditto.
> * gcc.target/i386/avx10_1-9.c: Ditto.
> * gcc.target/i386/avx10_1-10.c: Ditto.
Ok(please wait for extra 24 hours to commit, if there's no objection)
> ---
>  gcc/common/config/i386/cpuinfo.h   | 36 +++
>  gcc/common/config/i386/i386-common.cc  | 53 +-
>  gcc/common/config/i386/i386-cpuinfo.h  |  3 ++
>  gcc/common/config/i386/i386-isas.h |  5 ++
>  gcc/config/i386/constraints.md |  6 +--
>  gcc/config/i386/cpuid.h|  6 +++
>  gcc/config/i386/i386-c.cc  |  4 ++
>  gcc/config/i386/i386-isa.def   |  2 +
>  gcc/config/i386/i386-options.cc| 26 ++-
>  gcc/config/i386/i386.cc| 18 ++--
>  gcc/config/i386/i386.h |  3 ++
>  gcc/config/i386/i386.opt   | 19 
>  gcc/doc/extend.texi| 13 ++
>  gcc/doc/invoke.texi| 16 +--
>  gcc/doc/sourcebuild.texi   |  9 
>  gcc/testsuite/g++.target/i386/mv33.C   | 30 
>  gcc/testsuite/gcc.target/i386/avx10_1-1.c  | 22 +
>  gcc/testsuite/gcc.target/i386/avx10_1-10.c | 13 ++
>  gcc/testsuite/gcc.target/i386/avx10_1-2.c  | 13 ++
>  gcc/testsuite/gcc.target/i386/avx10_1-3.c  | 13 ++
>  gcc/testsuite/gcc.target/i386/avx10_1-4.c  | 13 ++
>  gcc/testsuite/gcc.target/i386/avx10_1-5.c  | 13 ++
>  gcc/testsuite/gcc.target/i386/avx10_1-6.c  | 13 ++
>  gcc/testsuite/gcc.target/i386/avx10_1-7.c  | 13 ++
>  gcc/testsuite/gcc.target/i386/avx10_1-8.c  |  4 ++
>  gcc/testsuite/gcc.target/i386/avx10_1-9.c  | 13 ++
>  26 files changed, 366 insertions(+), 13 deletions(-)
>  create mode 100644 gcc/testsuite/g++.target/i386/mv33.C
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-10.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-3.c
>  create mode 100644 gcc/testsuite/gcc.t

Re: [PATCH 2/3] Emit a warning when disabling AVX512 with AVX10 enabled or disabling AVX10 with AVX512 enabled

2023-08-15 Thread Hongtao Liu via Gcc-patches

On Tue, Aug 8, 2023 at 3:15 PM Haochen Jiang via Gcc-patches
 wrote:
>
> gcc/ChangeLog:
>
> * config/i386/driver-i386.cc (host_detect_local_cpu):
> Do not append -mno-avx10.1 for -march=native.
> * config/i386/i386-options.cc
> (ix86_check_avx10): New function to check isa_flags and
> isa_flags_explicit to emit warning when AVX10 is enabled
> by "-m" option.
> (ix86_check_avx512):  New function to check isa_flags and
> isa_flags_explicit to emit warning when AVX512 is enabled
> by "-m" option.
> (ix86_handle_option): Do not change the flags when warning
> is emitted.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx10_1-11.c: New test.
> * gcc.target/i386/avx10_1-12.c: Ditto.
> * gcc.target/i386/avx10_1-13.c: Ditto.
> * gcc.target/i386/avx10_1-14.c: Ditto.
Ok(please wait for extra 24 hours to commit, if there's no objection)
> ---
>  gcc/common/config/i386/i386-common.cc  | 68 +-
>  gcc/config/i386/driver-i386.cc |  2 +-
>  gcc/testsuite/gcc.target/i386/avx10_1-11.c |  5 ++
>  gcc/testsuite/gcc.target/i386/avx10_1-12.c | 13 +
>  gcc/testsuite/gcc.target/i386/avx10_1-13.c |  5 ++
>  gcc/testsuite/gcc.target/i386/avx10_1-14.c | 13 +
>  6 files changed, 91 insertions(+), 15 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-11.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-12.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-13.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-14.c
>
> diff --git a/gcc/common/config/i386/i386-common.cc 
> b/gcc/common/config/i386/i386-common.cc
> index 6c3bebb1846..ec94251dd4c 100644
> --- a/gcc/common/config/i386/i386-common.cc
> +++ b/gcc/common/config/i386/i386-common.cc
> @@ -388,6 +388,46 @@ set_malign_value (const char **flag, unsigned value)
>*flag = r;
>  }
>
> +/* Emit a warning when using -mno-avx512{f,vl,bw,dq,cd,bf16,fp16,vbmi,vbmi2,
> +   vnni,ifma,bitalg,vpopcntdq} with -mavx10.1 and above.  */
> +static bool
> +ix86_check_avx10 (struct gcc_options *opts)
> +{
> +  if (opts->x_ix86_isa_flags2 & opts->x_ix86_isa_flags2_explicit
> +  & OPTION_MASK_ISA2_AVX10_1)
> +{
> +  warning (0, 
> "%<-mno-avx512{f,vl,bw,dq,cd,bf16,fp16,vbmi,vbmi2,vnni,ifma,"
> +  "bitalg,vpopcntdq}%> are ignored with %<-mavx10.1%> and 
> above");
> +  return false;
> +}
> +
> +  return true;
> +}
> +
> +/* Emit a warning when using -mno-avx10.1 with -mavx512{f,vl,bw,dq,cd,bf16,
> +   fp16,vbmi,vbmi2,vnni,ifma,bitalg,vpopcntdq}.  */
> +static bool
> +ix86_check_avx512 (struct gcc_options *opts)
> +{
> +  if ((opts->x_ix86_isa_flags & opts->x_ix86_isa_flags_explicit
> +   & (OPTION_MASK_ISA_AVX512F | OPTION_MASK_ISA_AVX512CD
> + | OPTION_MASK_ISA_AVX512DQ | OPTION_MASK_ISA_AVX512BW
> + | OPTION_MASK_ISA_AVX512VL | OPTION_MASK_ISA_AVX512IFMA
> + | OPTION_MASK_ISA_AVX512VBMI | OPTION_MASK_ISA_AVX512VBMI2
> + | OPTION_MASK_ISA_AVX512VNNI | OPTION_MASK_ISA_AVX512VPOPCNTDQ
> + | OPTION_MASK_ISA_AVX512BITALG))
> +  || (opts->x_ix86_isa_flags2 & opts->x_ix86_isa_flags2_explicit
> + & (OPTION_MASK_ISA2_AVX512FP16 | OPTION_MASK_ISA2_AVX512BF16)))
> +{
> +  warning (0, "%<-mno-avx10.1%> is ignored when using with "
> +  "%<-mavx512{f,vl,bw,dq,cd,bf16,fp16,vbmi,vbmi2,vnni,"
> +  "ifma,bitalg,vpopcntdq}%>");
> +  return false;
> +}
> +
> +  return true;
> +}
> +
>  /* Implement TARGET_HANDLE_OPTION.  */
>
>  bool
> @@ -609,7 +649,7 @@ ix86_handle_option (struct gcc_options *opts,
>   opts->x_ix86_isa_flags |= OPTION_MASK_ISA_AVX512F_SET;
>   opts->x_ix86_isa_flags_explicit |= OPTION_MASK_ISA_AVX512F_SET;
> }
> -  else
> +  else if (ix86_check_avx10 (opts))
> {
>   opts->x_ix86_isa_flags &= ~OPTION_MASK_ISA_AVX512F_UNSET;
>   opts->x_ix86_isa_flags_explicit |= OPTION_MASK_ISA_AVX512F_UNSET;
> @@ -624,7 +664,7 @@ ix86_handle_option (struct gcc_options *opts,
>   opts->x_ix86_isa_flags |= OPTION_MASK_ISA_AVX512CD_SET;
>   opts->x_ix86_isa_flags_explicit |= OPTION_MASK_ISA_AVX512CD_SET;
> }
> -  else
> +  else if (ix86_check_avx10 (opts))
> {
>   opts->x_ix86_isa_flags &= ~OPTION_MASK_ISA_AVX512CD_UNSET;
>   opts->x_ix86_isa_flags_explicit |= OPTION_MASK_ISA_AVX512CD_UNSET;
> @@ -898,7 +938,7 @@ ix86_handle_option (struct gcc_options *opts,
>   opts->x_ix86_isa_flags |= OPTION_MASK_ISA_AVX512VBMI2_SET;
>   opts->x_ix86_isa_flags_explicit |= OPTION_MASK_ISA_AVX512VBMI2_SET;
> }
> -  else
> +  else if (ix86_check_avx10 (opts))
> {
>   opts->x_ix86_isa_flags &= ~OPTION_MASK_ISA_AVX512VBMI2_UNSET;
>   opts->x_ix86_isa_flags_explicit |= 
> OPTION_MASK_ISA_AVX512VBMI2_UNSET;
> @@ -913,7 +953,7 @@

Re: [PATCH 3/3] Emit a warning when AVX10 options conflict in vector width

2023-08-15 Thread Hongtao Liu via Gcc-patches

On Tue, Aug 8, 2023 at 3:13 PM Haochen Jiang via Gcc-patches
 wrote:
>
> gcc/ChangeLog:
>
> * config/i386/driver-i386.cc (host_detect_local_cpu):
> Do not append -mno-avx10-max-512bit for -march=native.
> * common/config/i386/i386-common.cc
> (ix86_check_avx10_vector_width): New function to check isa_flags
> to emit a warning when there is a conflict in AVX10 options for
> vector width.
> (ix86_handle_option): Add check for avx10.1-256 and avx10.1-512.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx10_1-15.c: New test.
> * gcc.target/i386/avx10_1-16.c: Ditto.
> * gcc.target/i386/avx10_1-17.c: Ditto.
> * gcc.target/i386/avx10_1-18.c: Ditto.
> ---
Ok(please wait for extra 24 hours to commit, if there's no objection)
>  gcc/common/config/i386/i386-common.cc  | 20 
>  gcc/config/i386/driver-i386.cc |  3 ++-
>  gcc/config/i386/i386-options.cc|  2 +-
>  gcc/testsuite/gcc.target/i386/avx10_1-15.c |  5 +
>  gcc/testsuite/gcc.target/i386/avx10_1-16.c |  5 +
>  gcc/testsuite/gcc.target/i386/avx10_1-17.c | 13 +
>  gcc/testsuite/gcc.target/i386/avx10_1-18.c | 13 +
>  7 files changed, 59 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-15.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-16.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-17.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-18.c
>
> diff --git a/gcc/common/config/i386/i386-common.cc 
> b/gcc/common/config/i386/i386-common.cc
> index ec94251dd4c..db88befc9b8 100644
> --- a/gcc/common/config/i386/i386-common.cc
> +++ b/gcc/common/config/i386/i386-common.cc
> @@ -428,6 +428,24 @@ ix86_check_avx512 (struct gcc_options *opts)
>return true;
>  }
>
> +/* Emit a warning when there is a conflict vector width in AVX10 options.  */
> +static void
> +ix86_check_avx10_vector_width (struct gcc_options *opts, bool avx10_max_512)
> +{
> +  if (avx10_max_512)
> +{
> +  if (((opts->x_ix86_isa_flags2 | ~OPTION_MASK_ISA2_AVX10_512BIT)
> +  == ~OPTION_MASK_ISA2_AVX10_512BIT)
> + && (opts->x_ix86_isa_flags2_explicit & 
> OPTION_MASK_ISA2_AVX10_512BIT))
> +   warning (0, "The options used for AVX10 have conflict vector width, "
> +"using the latter 512 as vector width");
> +}
> +  else if (opts->x_ix86_isa_flags2 & opts->x_ix86_isa_flags2_explicit
> +  & OPTION_MASK_ISA2_AVX10_512BIT)
> +warning (0, "The options used for AVX10 have conflict vector width, "
> +"using the latter 256 as vector width");
> +}
> +
>  /* Implement TARGET_HANDLE_OPTION.  */
>
>  bool
> @@ -1415,6 +1433,7 @@ ix86_handle_option (struct gcc_options *opts,
>return true;
>
>  case OPT_mavx10_1_256:
> +  ix86_check_avx10_vector_width (opts, false);
>opts->x_ix86_isa_flags2 |= OPTION_MASK_ISA2_AVX10_1_SET;
>opts->x_ix86_isa_flags2_explicit |= OPTION_MASK_ISA2_AVX10_1_SET;
>opts->x_ix86_isa_flags2 &= ~OPTION_MASK_ISA2_AVX10_512BIT_SET;
> @@ -1424,6 +1443,7 @@ ix86_handle_option (struct gcc_options *opts,
>return true;
>
>  case OPT_mavx10_1_512:
> +  ix86_check_avx10_vector_width (opts, true);
>opts->x_ix86_isa_flags2 |= OPTION_MASK_ISA2_AVX10_1_SET;
>opts->x_ix86_isa_flags2_explicit |= OPTION_MASK_ISA2_AVX10_1_SET;
>opts->x_ix86_isa_flags2 |= OPTION_MASK_ISA2_AVX10_512BIT_SET;
> diff --git a/gcc/config/i386/driver-i386.cc b/gcc/config/i386/driver-i386.cc
> index 227ace6ff83..f4551a74e3a 100644
> --- a/gcc/config/i386/driver-i386.cc
> +++ b/gcc/config/i386/driver-i386.cc
> @@ -854,7 +854,8 @@ const char *host_detect_local_cpu (int argc, const char 
> **argv)
>   options = concat (options, " ",
> isa_names_table[i].option, NULL);
>   }
> -   else if (isa_names_table[i].feature != FEATURE_AVX10_1)
> +   else if ((isa_names_table[i].feature != FEATURE_AVX10_1)
> +&& (isa_names_table[i].feature != FEATURE_AVX10_512BIT))
>   options = concat (options, neg_option,
> isa_names_table[i].option + 2, NULL);
>   }
> diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
> index b2281fbd4b5..8f9b825b527 100644
> --- a/gcc/config/i386/i386-options.cc
> +++ b/gcc/config/i386/i386-options.cc
> @@ -985,7 +985,7 @@ ix86_valid_target_attribute_inner_p (tree fndecl, tree 
> args, char *p_strings[],
>  ix86_opt_ix86_no,
>  ix86_opt_str,
>  ix86_opt_enum,
> -ix86_opt_isa,
> +ix86_opt_isa
>};
>
>static const struct
> diff --git a/gcc/testsuite/gcc.target/i386/avx10_1-15.c 
> b/gcc/testsuite/gcc.target/i386/avx10_1-15.c
> new file mode 100644
> index 000..fd873c9694c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/

Re: [PATCH 6/6] Support AVX10.1 for AVX512DQ+AVX512VL intrins

2023-08-15 Thread Hongtao Liu via Gcc-patches

On Tue, Aug 8, 2023 at 3:23 PM Haochen Jiang via Gcc-patches
 wrote:
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx10_1-vextractf64x2-1.c: New test.
> * gcc.target/i386/avx10_1-vextracti64x2-1.c: Ditto.
> * gcc.target/i386/avx10_1-vfpclasspd-1.c: Ditto.
> * gcc.target/i386/avx10_1-vfpclassps-1.c: Ditto.
> * gcc.target/i386/avx10_1-vinsertf64x2-1.c: Ditto.
> * gcc.target/i386/avx10_1-vinserti64x2-1.c: Ditto.
> * gcc.target/i386/avx10_1-vrangepd-1.c: Ditto.
> * gcc.target/i386/avx10_1-vrangeps-1.c: Ditto.
> * gcc.target/i386/avx10_1-vreducepd-1.c: Ditto.
> * gcc.target/i386/avx10_1-vreduceps-1.c: Ditto.
Ok for all 6 patches(please wait for extra 24 hours to commit, if
there's no objection).
> ---
>  .../gcc.target/i386/avx10_1-vextractf64x2-1.c | 18 
>  .../gcc.target/i386/avx10_1-vextracti64x2-1.c | 19 
>  .../gcc.target/i386/avx10_1-vfpclasspd-1.c| 21 ++
>  .../gcc.target/i386/avx10_1-vfpclassps-1.c| 21 ++
>  .../gcc.target/i386/avx10_1-vinsertf64x2-1.c  | 18 
>  .../gcc.target/i386/avx10_1-vinserti64x2-1.c  | 18 
>  .../gcc.target/i386/avx10_1-vrangepd-1.c  | 27 +
>  .../gcc.target/i386/avx10_1-vrangeps-1.c  | 27 +
>  .../gcc.target/i386/avx10_1-vreducepd-1.c | 29 +++
>  .../gcc.target/i386/avx10_1-vreduceps-1.c | 29 +++
>  10 files changed, 227 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-vextractf64x2-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-vextracti64x2-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-vfpclasspd-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-vfpclassps-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-vinsertf64x2-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-vinserti64x2-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-vrangepd-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-vrangeps-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-vreducepd-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx10_1-vreduceps-1.c
>
> diff --git a/gcc/testsuite/gcc.target/i386/avx10_1-vextractf64x2-1.c 
> b/gcc/testsuite/gcc.target/i386/avx10_1-vextractf64x2-1.c
> new file mode 100644
> index 000..4c7e54dc198
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx10_1-vextractf64x2-1.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx10.1 -O2" } */
> +/* { dg-final { scan-assembler-times "vextractf64x2\[ 
> \\t\]+\[^\{\n\]*%ymm\[0-9\]+.{7}(?:\n|\[ \\t\]+#)"  1 } } */
> +/* { dg-final { scan-assembler-times "vextractf64x2\[ 
> \\t\]+\[^\{\n\]*%ymm\[0-9\]+.{7}\{%k\[1-7\]\}\{z\}(?:\n|\[ \\t\]+#)"  1 } } */
> +/* { dg-final { scan-assembler-times "vextractf64x2\[ 
> \\t\]+\[^\{\n\]*%ymm\[0-9\]+.{7}\{%k\[1-7\]\}(?:\n|\[ \\t\]+#)"  1 } } */
> +
> +#include 
> +
> +volatile __m256d x;
> +volatile __m128d y;
> +
> +void extern
> +avx10_1_test (void)
> +{
> +  y = _mm256_extractf64x2_pd (x, 1);
> +  y = _mm256_mask_extractf64x2_pd (y, 2, x, 1);
> +  y = _mm256_maskz_extractf64x2_pd (2, x, 1);
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/avx10_1-vextracti64x2-1.c 
> b/gcc/testsuite/gcc.target/i386/avx10_1-vextracti64x2-1.c
> new file mode 100644
> index 000..c0bd7700d52
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx10_1-vextracti64x2-1.c
> @@ -0,0 +1,19 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx10.1 -O2" } */
> +/* { dg-final { scan-assembler-times "vextracti64x2\[ 
> \\t\]+\[^\{\n\]*%ymm\[0-9\]+.{7}(?:\n|\[ \\t\]+#)"  1 } } */
> +/* { dg-final { scan-assembler-times "vextracti64x2\[ 
> \\t\]+\[^\{\n\]*%ymm\[0-9\]+.{7}\{%k\[1-7\]\}\{z\}(?:\n|\[ \\t\]+#)"  1 } } */
> +/* { dg-final { scan-assembler-times "vextracti64x2\[ 
> \\t\]+\[^\{\n\]*%ymm\[0-9\]+.{7}\{%k\[1-7\]\}(?:\n|\[ \\t\]+#)"  1 } } */
> +
> +#include 
> +
> +volatile __m256i x;
> +volatile __m128i y;
> +
> +void extern
> +avx10_1_test (void)
> +{
> +  y = _mm256_extracti64x2_epi64 (x, 1);
> +  y = _mm256_mask_extracti64x2_epi64 (y, 2, x, 1);
> +  y = _mm256_maskz_extracti64x2_epi64 (2, x, 1);
> +}
> +
> diff --git a/gcc/testsuite/gcc.target/i386/avx10_1-vfpclasspd-1.c 
> b/gcc/testsuite/gcc.target/i386/avx10_1-vfpclasspd-1.c
> new file mode 100644
> index 000..806ba800023
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx10_1-vfpclasspd-1.c
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx10.1 -O2" } */
> +/* { dg-final { scan-assembler-times "vfpclasspdy\[ 
> \\t\]+\[^\{\n\]*%ymm\[0-9\]+\[^\n^k\]*%k\[0-7\](?:\n|\[ \\t\]+#)" 1 } } */
> +/* { dg-final { scan-assembler-times "vfpclasspdx\[ 
> \\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n^k\]*%k\[0-7\](?:\n|\[ \\t\]+#)" 1 } } */
> +/* { dg-final { scan-assembler-times "vfpclasspdy\[ 
> \\t\]+\[^\{\n\]*%y

Re: [PATCH] Software mitigation: Disable gather generation in vectorization for GDS affected Intel Processors.

2023-08-16 Thread Hongtao Liu via Gcc-patches

On Fri, Aug 11, 2023 at 8:38 AM liuhongt  wrote:
>
> For more details of GDS (Gather Data Sampling), refer to
> https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/gather-data-sampling.html
>
> After microcode update, there's performance regression. To avoid that,
> the patch disables gather generation in autovectorization but uses
> gather scalar emulation instead.
>
> Ready push to trunk and backport.
> any comments?
Pushed to trunk and backport to releases/gcc-{11,12,13}.
>
> gcc/ChangeLog:
>
> * config/i386/i386-options.cc (m_GDS): New macro.
> * config/i386/x86-tune.def (X86_TUNE_USE_GATHER_2PARTS): Don't
> enable for m_GDS.
> (X86_TUNE_USE_GATHER_4PARTS): Ditto.
> (X86_TUNE_USE_GATHER): Ditto.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx2-gather-2.c: Adjust options to keep
> gather vectorization.
> * gcc.target/i386/avx2-gather-6.c: Ditto.
> * gcc.target/i386/avx512f-pr88464-1.c: Ditto.
> * gcc.target/i386/avx512f-pr88464-5.c: Ditto.
> * gcc.target/i386/avx512vl-pr88464-1.c: Ditto.
> * gcc.target/i386/avx512vl-pr88464-11.c: Ditto.
> * gcc.target/i386/avx512vl-pr88464-3.c: Ditto.
> * gcc.target/i386/avx512vl-pr88464-9.c: Ditto.
> * gcc.target/i386/pr88531-1b.c: Ditto.
> * gcc.target/i386/pr88531-1c.c: Ditto.
> ---
>  gcc/config/i386/i386-options.cc | 5 +
>  gcc/config/i386/x86-tune.def| 6 +++---
>  gcc/testsuite/gcc.target/i386/avx2-gather-2.c   | 2 +-
>  gcc/testsuite/gcc.target/i386/avx2-gather-6.c   | 2 +-
>  gcc/testsuite/gcc.target/i386/avx512f-pr88464-1.c   | 2 +-
>  gcc/testsuite/gcc.target/i386/avx512f-pr88464-5.c   | 2 +-
>  gcc/testsuite/gcc.target/i386/avx512vl-pr88464-1.c  | 2 +-
>  gcc/testsuite/gcc.target/i386/avx512vl-pr88464-11.c | 2 +-
>  gcc/testsuite/gcc.target/i386/avx512vl-pr88464-3.c  | 2 +-
>  gcc/testsuite/gcc.target/i386/avx512vl-pr88464-9.c  | 2 +-
>  gcc/testsuite/gcc.target/i386/pr88531-1b.c  | 2 +-
>  gcc/testsuite/gcc.target/i386/pr88531-1c.c  | 2 +-
>  12 files changed, 18 insertions(+), 13 deletions(-)
>
> diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
> index 127ee24203c..e6ba33c370d 100644
> --- a/gcc/config/i386/i386-options.cc
> +++ b/gcc/config/i386/i386-options.cc
> @@ -141,6 +141,11 @@ along with GCC; see the file COPYING3.  If not see
>  #define m_ARROWLAKE (HOST_WIDE_INT_1U<  #define m_CORE_ATOM (m_SIERRAFOREST | m_GRANDRIDGE)
>  #define m_INTEL (HOST_WIDE_INT_1U< +/* Gather Data Sampling / CVE-2022-40982 / INTEL-SA-00828.
> +   Software mitigation.  */
> +#define m_GDS (m_SKYLAKE | m_SKYLAKE_AVX512 | m_CANNONLAKE \
> +  | m_ICELAKE_CLIENT | m_ICELAKE_SERVER | m_CASCADELAKE \
> +  | m_TIGERLAKE | m_COOPERLAKE | m_ROCKETLAKE)
>
>  #define m_LUJIAZUI (HOST_WIDE_INT_1U<
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index 40e04ecddbf..22d26bb0030 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -491,7 +491,7 @@ DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, 
> "avoid_4byte_prefixes",
> elements.  */
>  DEF_TUNE (X86_TUNE_USE_GATHER_2PARTS, "use_gather_2parts",
>   ~(m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4 | m_ALDERLAKE
> -   | m_ARROWLAKE | m_CORE_ATOM | m_GENERIC))
> +   | m_ARROWLAKE | m_CORE_ATOM | m_GENERIC | m_GDS))
>
>  /* X86_TUNE_USE_SCATTER_2PARTS: Use scater instructions for vectors with 2
> elements.  */
> @@ -502,7 +502,7 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_2PARTS, 
> "use_scatter_2parts",
> elements.  */
>  DEF_TUNE (X86_TUNE_USE_GATHER_4PARTS, "use_gather_4parts",
>   ~(m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4 | m_ALDERLAKE
> -   | m_ARROWLAKE | m_CORE_ATOM | m_GENERIC))
> +   | m_ARROWLAKE | m_CORE_ATOM | m_GENERIC | m_GDS))
>
>  /* X86_TUNE_USE_SCATTER_4PARTS: Use scater instructions for vectors with 4
> elements.  */
> @@ -513,7 +513,7 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_4PARTS, 
> "use_scatter_4parts",
> elements.  */
>  DEF_TUNE (X86_TUNE_USE_GATHER, "use_gather",
>   ~(m_ZNVER1 | m_ZNVER2 | m_ZNVER4 | m_ALDERLAKE | m_ARROWLAKE
> -   | m_CORE_ATOM | m_GENERIC))
> +   | m_CORE_ATOM | m_GENERIC | m_GDS))
>
>  /* X86_TUNE_USE_SCATTER: Use scater instructions for vectors with 8 or more
> elements.  */
> diff --git a/gcc/testsuite/gcc.target/i386/avx2-gather-2.c 
> b/gcc/testsuite/gcc.target/i386/avx2-gather-2.c
> index ad5ef73107c..978924b0f57 100644
> --- a/gcc/testsuite/gcc.target/i386/avx2-gather-2.c
> +++ b/gcc/testsuite/gcc.target/i386/avx2-gather-2.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O3 -fdump-tree-vect-details -march=skylake" } */
> +/* { dg-options "-O3 -fdump-tree-vect-details -march=skylake -mtune=haswell" 
> } */
>
>  #include "avx2

Re: [PATCH V2] Support -m[no-]gather -m[no-]scatter to enable/disable vectorization for all gather/scatter instructions

2023-08-16 Thread Hongtao Liu via Gcc-patches

On Mon, Aug 14, 2023 at 10:40 AM Hongtao Liu  wrote:
>
> On Fri, Aug 11, 2023 at 2:02 PM liuhongt via Gcc-patches
>  wrote:
> >
> > Rename original use_gather to use_gather_8parts, Support
> > -mtune-ctrl={,^}use_gather to set/clear tune features
> > use_gather_{2parts, 4parts, 8parts}. Support the new option -mgather
> > as alias of -mtune-ctrl=, use_gather, ^use_gather.
> >
> > Similar for use_scatter.
> >
> > How about this version?
> I'll commit the patch if there's no objections in the next 24 hours.
Pushed to trunk and backport to release/gcc-{13,12,11}.
Note for GCC11, The backport patch only supports -m{no,}gather since
the branch doesn't have scatter tunings.
For GCC12/GCC13. both -m{no,}gather/scatter are supported.
> >
> > gcc/ChangeLog:
> >
> > * config/i386/i386-builtins.cc
> > (ix86_vectorize_builtin_gather): Adjust for use_gather_8parts.
> > * config/i386/i386-options.cc (parse_mtune_ctrl_str):
> > Set/Clear tune features use_{gather,scatter}_{2parts, 4parts,
> > 8parts} for -mtune-crtl={,^}{use_gather,use_scatter}.
> > * config/i386/i386.cc (ix86_vectorize_builtin_scatter): Adjust
> > for use_scatter_8parts
> > * config/i386/i386.h (TARGET_USE_GATHER): Rename to ..
> > (TARGET_USE_GATHER_8PARTS): .. this.
> > (TARGET_USE_SCATTER): Rename to ..
> > (TARGET_USE_SCATTER_8PARTS): .. this.
> > * config/i386/x86-tune.def (X86_TUNE_USE_GATHER): Rename to
> > (X86_TUNE_USE_GATHER_8PARTS): .. this.
> > (X86_TUNE_USE_SCATTER): Rename to
> > (X86_TUNE_USE_SCATTER_8PARTS): .. this.
> > * config/i386/i386.opt: Add new options mgather, mscatter.
> > ---
> >  gcc/config/i386/i386-builtins.cc |  2 +-
> >  gcc/config/i386/i386-options.cc  | 54 +++-
> >  gcc/config/i386/i386.cc  |  2 +-
> >  gcc/config/i386/i386.h   |  8 ++---
> >  gcc/config/i386/i386.opt |  8 +
> >  gcc/config/i386/x86-tune.def |  4 +--
> >  6 files changed, 56 insertions(+), 22 deletions(-)
> >
> > diff --git a/gcc/config/i386/i386-builtins.cc 
> > b/gcc/config/i386/i386-builtins.cc
> > index 356b6dfd5fb..8a0b8dfe073 100644
> > --- a/gcc/config/i386/i386-builtins.cc
> > +++ b/gcc/config/i386/i386-builtins.cc
> > @@ -1657,7 +1657,7 @@ ix86_vectorize_builtin_gather (const_tree mem_vectype,
> >   ? !TARGET_USE_GATHER_2PARTS
> >   : (known_eq (TYPE_VECTOR_SUBPARTS (mem_vectype), 4u)
> >  ? !TARGET_USE_GATHER_4PARTS
> > -: !TARGET_USE_GATHER)))
> > +: !TARGET_USE_GATHER_8PARTS)))
> >  return NULL_TREE;
> >
> >if ((TREE_CODE (index_type) != INTEGER_TYPE
> > diff --git a/gcc/config/i386/i386-options.cc 
> > b/gcc/config/i386/i386-options.cc
> > index 127ee24203c..b8d038af69d 100644
> > --- a/gcc/config/i386/i386-options.cc
> > +++ b/gcc/config/i386/i386-options.cc
> > @@ -1731,20 +1731,46 @@ parse_mtune_ctrl_str (struct gcc_options *opts, 
> > bool dump)
> >curr_feature_string++;
> >clear = true;
> >  }
> > -  for (i = 0; i < X86_TUNE_LAST; i++)
> > -{
> > -  if (!strcmp (curr_feature_string, ix86_tune_feature_names[i]))
> > -{
> > -  ix86_tune_features[i] = !clear;
> > -  if (dump)
> > -fprintf (stderr, "Explicitly %s feature %s\n",
> > - clear ? "clear" : "set", 
> > ix86_tune_feature_names[i]);
> > -  break;
> > -}
> > -}
> > -  if (i == X86_TUNE_LAST)
> > -   error ("unknown parameter to option %<-mtune-ctrl%>: %s",
> > -  clear ? curr_feature_string - 1 : curr_feature_string);
> > +
> > +  if (!strcmp (curr_feature_string, "use_gather"))
> > +   {
> > + ix86_tune_features[X86_TUNE_USE_GATHER_2PARTS] = !clear;
> > + ix86_tune_features[X86_TUNE_USE_GATHER_4PARTS] = !clear;
> > + ix86_tune_features[X86_TUNE_USE_GATHER_8PARTS] = !clear;
> > + if (dump)
> > +   fprintf (stderr, "Explicitly %s features use_gather_2parts,"
> > +" use_gather_4parts, use_gather_8parts\n",
> > +clear ? "clear" : "set");
> > +
> > +   }
> > +  else if (!strcmp (curr_feature_string, "use_scatter"))
> > +   {
> > + ix86_tune_features[X86_TUNE_USE_SCATTER_2PARTS] = !clear;
> > + ix86_tune_features[X86_TUNE_USE_SCATTER_4PARTS] = !clear;
> > + ix86_tune_features[X86_TUNE_USE_SCATTER_8PARTS] = !clear;
> > + if (dump)
> > +   fprintf (stderr, "Explicitly %s features use_scatter_2parts,"
> > +" use_scatter_4parts, use_scatter_8parts\n",
> > +clear ? "clear" : "set");
> > +   }
> > +  else
> > +   {
> > + for (i = 0; i < X86_TUNE_LAST; i++)
> > +   {
> > + if (!strcmp (curr_feature_string, ix86_tune_feature_names[i]))
> > +   {
>

Re: [PATCH] i386: Add AVX2 pragma wrapper for AVX512DQVL intrins

2023-08-17 Thread Hongtao Liu via Gcc-patches

On Fri, Aug 18, 2023 at 2:01 PM Haochen Jiang via Gcc-patches
 wrote:
>
> Hi all,
>
> This patch aims to fix PR111051, which actually make sure that AVX2
> intrins are visible to AVX512/AVX10 intrins under any circumstances.
>
> I will also apply the same fix on AVX512DQ scalar intrins.
>
> Regtested on on x86_64-pc-linux-gnu. Ok for trunk?
Ok.
>
> Thx,
> Haochen
>
> PR target/111051
>
> gcc/ChangeLog:
>
> * config/i386/avx512vldqintrin.h: Push AVX2 when AVX2 is
> disabled.
>
> gcc/testsuite/ChangeLog:
>
> PR target/111051
> * gcc.target/i386/pr111051-1.c: New test.
> ---
>  gcc/config/i386/avx512vldqintrin.h | 11 +++
>  gcc/testsuite/gcc.target/i386/pr111051-1.c | 11 +++
>  2 files changed, 22 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr111051-1.c
>
> diff --git a/gcc/config/i386/avx512vldqintrin.h 
> b/gcc/config/i386/avx512vldqintrin.h
> index 1fbf93a0b52..db900ebf467 100644
> --- a/gcc/config/i386/avx512vldqintrin.h
> +++ b/gcc/config/i386/avx512vldqintrin.h
> @@ -28,6 +28,12 @@
>  #ifndef _AVX512VLDQINTRIN_H_INCLUDED
>  #define _AVX512VLDQINTRIN_H_INCLUDED
>
> +#if !defined(__AVX2__)
> +#pragma GCC push_options
> +#pragma GCC target("avx2")
> +#define __DISABLE_AVX2__
> +#endif /* __AVX2__ */
> +
>  extern __inline __m256i
>  __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
>  _mm256_cvttpd_epi64 (__m256d __A)
> @@ -2002,4 +2008,9 @@ _mm256_maskz_insertf64x2 (__mmask8 __U, __m256d __A, 
> __m128d __B,
>
>  #endif
>
> +#ifdef __DISABLE_AVX2__
> +#undef __DISABLE_AVX2__
> +#pragma GCC pop_options
> +#endif /* __DISABLE_AVX2__ */
> +
>  #endif /* _AVX512VLDQINTRIN_H_INCLUDED */
> diff --git a/gcc/testsuite/gcc.target/i386/pr111051-1.c 
> b/gcc/testsuite/gcc.target/i386/pr111051-1.c
> new file mode 100644
> index 000..973007043cb
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr111051-1.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +
> +#include 
> +
> +#pragma GCC target("avx512vl,avx512dq")
> +
> +void foo (__m256i i)
> +{
> +  volatile __m256d v1 = _mm256_cvtepi64_pd (i);
> +}
> +
> --
> 2.31.1
>


-- 
BR,
Hongtao

Re: Intel AVX10.1 Compiler Design and Support

2023-08-20 Thread Hongtao Liu via Gcc-patches

On Sun, Aug 20, 2023 at 6:44 AM ZiNgA BuRgA via Gcc-patches
 wrote:
>
> Hi,
>
> With the proposed design of these switches, how would I restrict AVX10.1
> to particular AVX-512 subsets?
We can't, avx10.1 is taken as an indivisible ISA which contains all
AVX512 related instructions.

> We’ve been taking these cases as bugs (but yes, intrinsics are still allowed, 
> so in some cases it might prove difficult to guarantee this).
intel sde support avx10.1-256 target which can be used to validate the
binary(if there's invalid 512-bit vector register or 64-bit kmask
register is used).
> I don’t see any other way of doing what you want within the constraints of 
> this design.
It looks like the requirement is that we want a
-mavx10-vector-width=256(or maybe reuse -mprefer-vector-width=256)
option that acts on the original -mavx512XXX option to produce
avx10.1-256 compatible binary. we can't use -mavx10.1-256 since it may
include avx512fp16 directives and thus not be backward compatible
SKX/CLX/ICX.
>
> For example, usage of the |_mm256_rol_epi32| intrinsic should be
> compatible on any AVX10/256 implementation, /as well as /any AVX-512VL
> without AVX10 implementation (e.g. Skylake-X).  But how do I signal that
> I want compatibility with both these targets?
>
>   * |-mavx512vl| lets the compiler use 512-bit registers -> incompatible
> with 256-bit AVX10.
>   * |-mavx512vl -mprefer-vector-width=256| might steer the compiler away
> from 512-bit registers, but I don't think it guarantees it.
>   * |-mavx10.1-256| lets the compiler use all Sapphire Rapids AVX-512
> features at 256-bit wide (so in theory, it could choose to compile
> it with |vpshldd|) -> incompatible with Skylake-X.
>   * |-mavx10.1-256 -mno-avx512fp16 -mno-avx512...| will emit a warning
> and ignore the attempts at disabling AVX-512 subsets.
>   * |-mavx10.1-256 -mavx512vl| takes the /union/ of the features, not
> the /intersection./
>
> Is there something like |-mavx512vl -mmax-vector-width=256|, or am I
> misunderstanding the situation?
>
> Thanks!



-- 
BR,
Hongtao

Re: Intel AVX10.1 Compiler Design and Support

2023-08-21 Thread Hongtao Liu via Gcc-patches

On Mon, Aug 21, 2023 at 4:09 PM Jakub Jelinek  wrote:
>
> On Mon, Aug 21, 2023 at 09:36:16AM +0200, Richard Biener via Gcc-patches 
> wrote:
> > > On Sun, Aug 20, 2023 at 6:44 AM ZiNgA BuRgA via Gcc-patches
> > >  wrote:
> > > >
> > > > Hi,
> > > >
> > > > With the proposed design of these switches, how would I restrict AVX10.1
> > > > to particular AVX-512 subsets?
> > > We can't, avx10.1 is taken as an indivisible ISA which contains all
> > > AVX512 related instructions.
> > >
> > > > We’ve been taking these cases as bugs (but yes, intrinsics are still 
> > > > allowed, so in some cases it might prove difficult to guarantee this).
> > > intel sde support avx10.1-256 target which can be used to validate the
> > > binary(if there's invalid 512-bit vector register or 64-bit kmask
> > > register is used).
> > > > I don’t see any other way of doing what you want within the constraints 
> > > > of this design.
> > > It looks like the requirement is that we want a
> > > -mavx10-vector-width=256(or maybe reuse -mprefer-vector-width=256)
> > > option that acts on the original -mavx512XXX option to produce
> > > avx10.1-256 compatible binary. we can't use -mavx10.1-256 since it may
> > > include avx512fp16 directives and thus not be backward compatible
> > > SKX/CLX/ICX.
> >
> > Yes.  Note we cannot really re-purpose -mprefer-vector-width=256 since that
> > would also make uses of 512bit intrinsics ill-formed.  So we'd need a new
> > flag that would restrict AVX512VL to 256bit, possibly using a common 
> > internal
> > flag for this and the -mavx10.1-256 vector size effect.
> >
> > Maybe -mdisable-vector-width-512 or -mavx512vl-for-avx10.1-256 or
> > -mavx512vl-256?  Writing these the last looks most sensible to me?
> > Note it should combine with -mavx512vl to -mavx512vl-256 to make
> > -march=native -mavx512vl-256 work (I think we should also allow the
> > flag together with -mavx10.1*?)
> >
> > mavx512vl-256
> > Target ...
> > Disable the 512bit vector ISA subset of AVX512 or AVX10, enable
> > the 256bit vector ISA subset of AVX512.
>
> Wouldn't it be better to have it similarly to other ISA options as something
> positive, say -mevex512 (the ISA docs talk about EVEX.512, EVEX.256 and
> EVEX.128)?
> Have -mavx512f (and anything that implies it right now) imply also -mevex512
> but allow -mno-evex512 which wouldn't unset everything dependent on
> -mavx512f.  There is one gotcha, if -mavx512vl isn't enabled in the end,
> then -mavx512f -mno-evex512 should disable whole TARGET_AVX512F because
> nothing is left.
> TARGET_EVEX512 then would guard all TARGET_AVX512* intrinsics which operate
> on 512-bit vector registers or 64-bit mask registers (in addition to the
> other TARGET_AVX512* options, perhaps except TARGET_AVX512F), whether the
> 512-bit modes can be used etc.
We have an undocumented option mavx10-max-512bit.

1314;; Only for implementation use
1315mavx10-max-512bit
1316Target Mask(ISA2_AVX10_512BIT) Var(ix86_isa_flags2) Undocumented Save
1317Indicates 512 bit vector width support for AVX10.

Currently it's only used for AVX10 only, maybe we can extend it to
existing AVX512*** FLAGS.
so users can use -mavx512XXX -mno-avx10-max-512bit to get avx10.1-256
compatible binaries.

>From the implementation perspective, we need to restrict all 512-bit
vector patterns/builtins/intrinsics under both AVX512XXX and
TARGET_AVX10_512BIT.
similar for register allocation, parameter passing, return value,
vector_mode_supported_p, gather/scatter hook, and all other hooks.
After that, the -mavx10-max-512bit will divide existing AVX512 into 2
parts, AVX512XXX-256, AVX512XXX-512.


>
> Jakub
>


-- 
BR,
Hongtao

Re: Intel AVX10.1 Compiler Design and Support

2023-08-21 Thread Hongtao Liu via Gcc-patches

On Mon, Aug 21, 2023 at 4:38 PM Jakub Jelinek  wrote:
>
> On Mon, Aug 21, 2023 at 04:28:20PM +0800, Hongtao Liu wrote:
> > We have an undocumented option mavx10-max-512bit.
>
> How it is called internally is one thing, but it is weird to use
> avx10 in an option name which would be meant for finding common subset
> of -mavx512xxx and -mavx10.1-256.
We can have an alias for the name, but internally use the same bit
since they're doing the same thing.
And the option is somewhat orthogonal to  AVX512XXX/AVX10, it only
care about vector/kmask size.
>
> Jakub
>


-- 
BR,
Hongtao

Re: Intel AVX10.1 Compiler Design and Support

2023-08-21 Thread Hongtao Liu via Gcc-patches

On Mon, Aug 21, 2023 at 5:35 PM Richard Biener
 wrote:
>
> On Mon, Aug 21, 2023 at 10:28 AM Hongtao Liu  wrote:
> >
> > On Mon, Aug 21, 2023 at 4:09 PM Jakub Jelinek  wrote:
> > >
> > > On Mon, Aug 21, 2023 at 09:36:16AM +0200, Richard Biener via Gcc-patches 
> > > wrote:
> > > > > On Sun, Aug 20, 2023 at 6:44 AM ZiNgA BuRgA via Gcc-patches
> > > > >  wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > With the proposed design of these switches, how would I restrict 
> > > > > > AVX10.1
> > > > > > to particular AVX-512 subsets?
> > > > > We can't, avx10.1 is taken as an indivisible ISA which contains all
> > > > > AVX512 related instructions.
> > > > >
> > > > > > We’ve been taking these cases as bugs (but yes, intrinsics are 
> > > > > > still allowed, so in some cases it might prove difficult to 
> > > > > > guarantee this).
> > > > > intel sde support avx10.1-256 target which can be used to validate the
> > > > > binary(if there's invalid 512-bit vector register or 64-bit kmask
> > > > > register is used).
> > > > > > I don’t see any other way of doing what you want within the 
> > > > > > constraints of this design.
> > > > > It looks like the requirement is that we want a
> > > > > -mavx10-vector-width=256(or maybe reuse -mprefer-vector-width=256)
> > > > > option that acts on the original -mavx512XXX option to produce
> > > > > avx10.1-256 compatible binary. we can't use -mavx10.1-256 since it may
> > > > > include avx512fp16 directives and thus not be backward compatible
> > > > > SKX/CLX/ICX.
> > > >
> > > > Yes.  Note we cannot really re-purpose -mprefer-vector-width=256 since 
> > > > that
> > > > would also make uses of 512bit intrinsics ill-formed.  So we'd need a 
> > > > new
> > > > flag that would restrict AVX512VL to 256bit, possibly using a common 
> > > > internal
> > > > flag for this and the -mavx10.1-256 vector size effect.
> > > >
> > > > Maybe -mdisable-vector-width-512 or -mavx512vl-for-avx10.1-256 or
> > > > -mavx512vl-256?  Writing these the last looks most sensible to me?
> > > > Note it should combine with -mavx512vl to -mavx512vl-256 to make
> > > > -march=native -mavx512vl-256 work (I think we should also allow the
> > > > flag together with -mavx10.1*?)
> > > >
> > > > mavx512vl-256
> > > > Target ...
> > > > Disable the 512bit vector ISA subset of AVX512 or AVX10, enable
> > > > the 256bit vector ISA subset of AVX512.
> > >
> > > Wouldn't it be better to have it similarly to other ISA options as 
> > > something
> > > positive, say -mevex512 (the ISA docs talk about EVEX.512, EVEX.256 and
> > > EVEX.128)?
> > > Have -mavx512f (and anything that implies it right now) imply also 
> > > -mevex512
> > > but allow -mno-evex512 which wouldn't unset everything dependent on
> > > -mavx512f.  There is one gotcha, if -mavx512vl isn't enabled in the end,
> > > then -mavx512f -mno-evex512 should disable whole TARGET_AVX512F because
> > > nothing is left.
> > > TARGET_EVEX512 then would guard all TARGET_AVX512* intrinsics which 
> > > operate
> > > on 512-bit vector registers or 64-bit mask registers (in addition to the
> > > other TARGET_AVX512* options, perhaps except TARGET_AVX512F), whether the
> > > 512-bit modes can be used etc.
> > We have an undocumented option mavx10-max-512bit.
> >
> > 1314;; Only for implementation use
> > 1315mavx10-max-512bit
> > 1316Target Mask(ISA2_AVX10_512BIT) Var(ix86_isa_flags2) Undocumented Save
> > 1317Indicates 512 bit vector width support for AVX10.
>
> Ah, missed that, but ...
>
> > Currently it's only used for AVX10 only, maybe we can extend it to
> > existing AVX512*** FLAGS.
> > so users can use -mavx512XXX -mno-avx10-max-512bit to get avx10.1-256
> > compatible binaries.
>
> ... -mno-avx10-max-512bit sounds awkward, no-..-max implies the max doesn't
> apply, so what is it then?
>
> If you think -mavx512vl-256 isn't good then maybe -mavx-width-512
> and -mno-avx-width-512 would be better (applying to both avx512 and avx10).
> I chose -mavx512vl-256 because of the existing -mavx10.1-256.  Btw,
> will we then have -mavx10.2-256 as well?  Do we allow -mavx10.1-512
> -mavx10.2-256 then, thus just enable 256bit for 10.2 extensions to 10.1?!
We're only allowing a single vector width.
-mavx10.1-512 mavx10.2-256 will only enable -mavx10.2-256 + -mavx10.1-256.
> I think we opened up too many holes here and the options should be fixed
> to decouple the size from the base ISA.
I see, we can try to use -mavx-max-512bit(maybe another name) to
decouple the size from the base ISA.
And make
 -mavx10.1-256 just implies all -mavx512XXX + -mno-avx-max-512bit,
 -mavx10.1-512 implies -mavx512XXX + mavx-max-512bit.
then -mavx512vl-256 is just equal to -mavx512vl + mno-avx-max-512bit.

Lots of work to do, but still not too late for GCC14.1
>
> What variable we map this to internally doesn't really matter but yes,
> we'd need to guard 512bit patterns with (AVX512VL || AVX10) && 
> 512-enabled-flag
>
> Richard.
>
> > From the implementation perspective, we need

Re: [PATCH] Fix FAIL: gcc.target/i386/pr87007-5.c

2023-08-21 Thread Hongtao Liu via Gcc-patches

On Mon, Aug 21, 2023 at 8:25 PM Richard Biener via Gcc-patches
 wrote:
>
> The following fixes the gcc.target/i386/pr87007-5.c testcase which
> changed code generation again after the recent sinking improvements.
> We now have
>
> vxorps  %xmm0, %xmm0, %xmm0
> vsqrtsd d2(%rip), %xmm0, %xmm0
>
> and an unnecessary xor again in one case, the other vsqrtsd has
> a register source and a properly zeroing load:
>
> vmovsd  d3(%rip), %xmm0
> testl   %esi, %esi
> jg  .L11
> .L3:
> vsqrtsd %xmm0, %xmm0, %xmm0
>
> the following patch XFAILs the scan.  I'm not sure what's at
> fault here, there are no loops in the CFG, but somehow
> r84:DF=sqrt(['d2']) gets a pxor but r84:DF=sqrt(r83:DF)
> doesn't.  I guess I don't really understand what
> remove_partial_avx_dependency is supposed to do so can't
> really assess whether the pxor is necessary or not.
There's a false dependency on xmm0 when the source operand in the
pattern is memory, the pattern only takes xmm0 as dest, but the output
instruction takes xmm0 also as input(the second source operand),
that's why we need an pxor here.
When the source operand in the pattern is register_operand, we can
reuse the register_operand for the second source operand. The
instructions here are not very obvious, the more representative one
should be vsqrtsd %xmm1, %xmm1(rused one), %xmm0.
>
> OK?
Can we add -fno-XXX to disable the optimization to make the assembly
more stable?
Or current codegen should be optimal(for the sinking), then Ok for the patch.

>
> * gcc.target/i386/pr87007-5.c: Update comment, XFAIL
> subtest.
> ---
>  gcc/testsuite/gcc.target/i386/pr87007-5.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/i386/pr87007-5.c 
> b/gcc/testsuite/gcc.target/i386/pr87007-5.c
> index a6cdf11522e..5902616d1f1 100644
> --- a/gcc/testsuite/gcc.target/i386/pr87007-5.c
> +++ b/gcc/testsuite/gcc.target/i386/pr87007-5.c
> @@ -1,6 +1,8 @@
>  /* { dg-do compile } */
>  /* { dg-options "-Ofast -march=skylake-avx512 -mfpmath=sse 
> -fno-tree-vectorize -fdump-tree-cddce3-details -fdump-tree-lsplit-optimized" 
> } */
> -/* Load of d2/d3 is hoisted out, vrndscalesd will reuse loades register to 
> avoid partial dependence.  */
> +/* Load of d2/d3 is hoisted out, the loop is split, store of d1 and sqrt
> +   are sunk out of the loop and the loop is elided.  One vsqrtsd with
> +   memory operand will need a xor to avoid partial dependence.  */
>
>  #include
>
> @@ -17,4 +19,4 @@ foo (int n, int k)
>
>  /* { dg-final { scan-tree-dump "optimized: loop split" "lsplit" } } */
>  /* { dg-final { scan-tree-dump-times "removing loop" 2 "cddce3" } } */
> -/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "vxorps\[^\n\r\]*xmm\[0-9\]" 1 } } */
> --
> 2.35.3



-- 
BR,
Hongtao

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 971 matches

Mail list logo