Re: [PATCH 00/18] Support -mevex512 for AVX512

2023-10-06 Thread Hongtao Liu
On Thu, Sep 28, 2023 at 11:23 AM ZiNgA BuRgA  wrote:
>
> That sounds about right.  The code I had in mind would perhaps look like:
>
>
> #if defined(__AVX512BW__) && defined(__AVX512VL__)
>  #if defined(__EVEX256__) && !defined(__EVEX512__)
>  // compiled code is AVX10.1/256 and AVX512 compatible
>  #else
>  // compiled code is only AVX512 compatible
>  #endif
>
>  // some code which only uses 256b instructions
>  __m256i...
> #endif
>
>
> The '__EVEX256__' define would avoid needing to check compiler versions.
Sounds reasonable, regarding how to set __EVEX256__, I think it should
be set/unset along with __AVX512VL__ and __EVEX512__ should not unset
__EVEX256__.

> Hopefully you can align it with whatever Clang does:
> https://discourse.llvm.org/t/rfc-design-for-avx10-feature-support/72661/18

>
> Thanks!
>
> On 28/09/2023 12:26 pm, Hu, Lin1 wrote:
> > Hi,
> >
> > Thanks for you reply.
> >
> > I'd like to verify that our understanding of your requirements is correct, 
> > and that __EVEX256__ can be considered a default macro to determine whether 
> > the compiler supports the __EVEX***__ series of switches.
> >
> > For example:
> >
> > I have a segment of code like:
> > #if defined(__EVEX512__):
> > __mm512.*__;
> > #else
> > __mm256.*__;
> > #endif
> >
> > But __EVEX512__ is undefined that doesn't mean I only need 256bit, maybe I 
> > use gcc-13, so I can still use 512bit.
> >
> > So the code should be:
> > #if defined(__EVEX512__):
> > __mm512.*__;
> > #elif defined(__EVEX256__):
> > __mm256.*__;
> > #else
> > __mm512.*__;
> > #endif
> >
> > If we understand correctly, we'll consider the request. But since we're 
> > about to have a vacation, follow-up replies may be a bit slower.
> >
> > BRs,
> > Lin
> >
> > -Original Message-
> > From: ZiNgA BuRgA 
> > Sent: Thursday, September 28, 2023 8:32 AM
> > To: Hu, Lin1 ; gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH 00/18] Support -mevex512 for AVX512
> >
> > Thanks for the new patch!
> >
> > I see that there's a new __EVEX512__ define.  Will there be some 
> > __EVEX256__ (or maybe some max EVEX width) define, so that code can detect 
> > whether the compiler supports AVX10.1/256 without resorting to version 
> > checks?
> >
> >
>


-- 
BR,
Hongtao


Re: [PATCH 03/13] [APX_EGPR] Initial support for APX_F

2023-10-06 Thread Hongtao Liu
On Fri, Sep 22, 2023 at 6:58 PM Hongyu Wang  wrote:
>
> From: Kong Lingling 
>
> Add -mapx-features= enumeration to separate subfeatures of APX_F.
> -mapxf is treated same as previous ISA flag, while it sets
> -mapx-features=apx_all that enables all subfeatures.
Ok for this and the resest of patches(04-13).
>
> gcc/ChangeLog:
>
> * common/config/i386/cpuinfo.h (XSTATE_APX_F): New macro.
> (XCR_APX_F_ENABLED_MASK): Likewise.
> (get_available_features): Detect APX_F under
> * common/config/i386/i386-common.cc (OPTION_MASK_ISA2_APX_F_SET): New.
> (OPTION_MASK_ISA2_APX_F_UNSET): Likewise.
> (ix86_handle_option): Handle -mapxf.
> * common/config/i386/i386-cpuinfo.h (FEATURE_APX_F): New.
> * common/config/i386/i386-isas.h: Add entry for APX_F.
> * config/i386/cpuid.h (bit_APX_F): New.
> * config/i386/i386.h (bit_APX_F): (TARGET_APX_EGPR,
> TARGET_APX_PUSH2POP2, TARGET_APX_NDD): New define.
> * config/i386/i386-opts.h (enum apx_features): New enum.
> * config/i386/i386-isa.def (APX_F): New DEF_PTA.
> * config/i386/i386-options.cc (ix86_function_specific_save):
> Save ix86_apx_features.
> (ix86_function_specific_restore): Restore it.
> (ix86_valid_target_attribute_inner_p): Add mapxf.
> (ix86_option_override_internal): Set ix86_apx_features for PTA
> and TARGET_APX_F. Also reports error when APX_F is set but not
> having TARGET_64BIT.
> * config/i386/i386.opt: (-mapxf): New ISA flag option.
> (-mapx=): New enumeration option.
> (apx_features): New enum type.
> (apx_none): New enum value.
> (apx_egpr): Likewise.
> (apx_push2pop2): Likewise.
> (apx_ndd): Likewise.
> (apx_all): Likewise.
> * doc/invoke.texi: Document mapxf.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/apx-1.c: New test.
>
> Co-authored-by: Hongyu Wang 
> Co-authored-by: Hongtao Liu 
> ---
>  gcc/common/config/i386/cpuinfo.h  | 12 +++-
>  gcc/common/config/i386/i386-common.cc | 17 +
>  gcc/common/config/i386/i386-cpuinfo.h |  1 +
>  gcc/common/config/i386/i386-isas.h|  1 +
>  gcc/config/i386/cpuid.h   |  1 +
>  gcc/config/i386/i386-isa.def  |  1 +
>  gcc/config/i386/i386-options.cc   | 18 ++
>  gcc/config/i386/i386-opts.h   |  8 
>  gcc/config/i386/i386.h|  4 
>  gcc/config/i386/i386.opt  | 25 +
>  gcc/doc/invoke.texi   | 11 +++
>  gcc/testsuite/gcc.target/i386/apx-1.c |  8 
>  12 files changed, 102 insertions(+), 5 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-1.c
>
> diff --git a/gcc/common/config/i386/cpuinfo.h 
> b/gcc/common/config/i386/cpuinfo.h
> index 24ae0dbf0ac..141d3743316 100644
> --- a/gcc/common/config/i386/cpuinfo.h
> +++ b/gcc/common/config/i386/cpuinfo.h
> @@ -678,6 +678,7 @@ get_available_features (struct __processor_model 
> *cpu_model,
>  #define XSTATE_HI_ZMM  0x80
>  #define XSTATE_TILECFG 0x2
>  #define XSTATE_TILEDATA0x4
> +#define XSTATE_APX_F   0x8
>
>  #define XCR_AVX_ENABLED_MASK \
>(XSTATE_SSE | XSTATE_YMM)
> @@ -685,11 +686,13 @@ get_available_features (struct __processor_model 
> *cpu_model,
>(XSTATE_SSE | XSTATE_YMM | XSTATE_OPMASK | XSTATE_ZMM | XSTATE_HI_ZMM)
>  #define XCR_AMX_ENABLED_MASK \
>(XSTATE_TILECFG | XSTATE_TILEDATA)
> +#define XCR_APX_F_ENABLED_MASK XSTATE_APX_F
>
> -  /* Check if AVX and AVX512 are usable.  */
> +  /* Check if AVX, AVX512 and APX are usable.  */
>int avx_usable = 0;
>int avx512_usable = 0;
>int amx_usable = 0;
> +  int apx_usable = 0;
>/* Check if KL is usable.  */
>int has_kl = 0;
>if ((ecx & bit_OSXSAVE))
> @@ -709,6 +712,8 @@ get_available_features (struct __processor_model 
> *cpu_model,
> }
>amx_usable = ((xcrlow & XCR_AMX_ENABLED_MASK)
> == XCR_AMX_ENABLED_MASK);
> +  apx_usable = ((xcrlow & XCR_APX_F_ENABLED_MASK)
> +   == XCR_APX_F_ENABLED_MASK);
>  }
>
>  #define set_feature(f) \
> @@ -922,6 +927,11 @@ get_available_features (struct __processor_model 
> *cpu_model,
>   if (edx & bit_AMX_COMPLEX)
> set_feature (FEATURE_AMX_COMPLEX);
> }
> + if (apx_usable)
> +   {
> + if (edx & bit_APX_F)
> +   set_feature (FEATURE_APX_F);
> +   }
>  

Re: [PATCH] [i386] APX EGPR: fix missing pattern that prohibits egpr

2023-10-08 Thread Hongtao Liu
On Mon, Oct 9, 2023 at 10:05 AM Hongyu Wang  wrote:
>
> For vec_concatv2di, m constraint in alternative 0 and 1 could result in
> egpr allocated on operand 2 under -mapxf. Should use jm instead.
>
> Bootstrapped/regtested on x86-64-linux-gnu.
>
> Ok for trunk?
Ok.
>
> gcc/ChangeLog:
>
> * config/i386/sse.md (vec_concatv2di): Replace constraint "m"
> with "jm" for alternative 0 and 1 of operand 2.
> ---
>  gcc/config/i386/sse.md | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 6bffd749c6d..58672f46365 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -20638,7 +20638,7 @@ (define_insn "vec_concatv2di"
>   (match_operand:DI 1 "register_operand"
>   "  0, 0,x ,Yv,0,Yv,0,0,v")
>   (match_operand:DI 2 "nonimmediate_operand"
> - " jrm,jrm,rm,rm,x,Yv,x,m,m")))]
> + " jrjm,jrjm,rm,rm,x,Yv,x,m,m")))]
>"TARGET_SSE"
>"@
> pinsrq\t{$1, %2, %0|%0, %2, 1}
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH] [APX] Support Intel APX PUSH2POP2

2023-10-11 Thread Hongtao Liu
On Tue, Oct 10, 2023 at 2:51 PM Hongyu Wang  wrote:
>
> From: "Mo, Zewei" 
>
> Hi,
>
> Intel APX PUSH2POP2 feature has been released in [1].
>
> This feature requires stack to be aligned at 16byte, therefore in
> prologue/epilogue, a standalone push/pop will be emitted before any
> push2/pop2 if the stack was not aligned to 16byte.
> Also for current implementation we only support push2/pop2 usage in
> function prologue/epilogue for those callee-saved registers.
>
> Bootstrapped/regtested on x86-64-pc-linux-gnu{-m32,} and sde.
>
> OK for master?
Ok, What remains to be optimized is to the save and restore of the
caller-save registers for ipa-ra, let's leave that to GCC15.
>
> [1].https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html.
>
> gcc/ChangeLog:
>
> * config/i386/i386.cc (gen_push2): New function to emit push2
> and adjust cfa offset.
> (ix86_use_push2_pop2): New function to determine whether
> push2/pop2 can be used.
> (ix86_compute_frame_layout): Adjust preferred stack boundary
> and stack alignment needed for push2/pop2.
> (ix86_emit_save_regs): Emit push2 when available.
> (ix86_emit_restore_reg_using_pop2): New function to emit pop2
> and adjust cfa info.
> (ix86_emit_restore_regs_using_pop2): New function to loop
> through the saved regs and call above.
> (ix86_expand_epilogue): Call ix86_emit_restore_regs_using_pop2
> when push2pop2 available.
> * config/i386/i386.md (push2_di): New pattern for push2.
> (pop2_di): Likewise for pop2.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/apx-push2pop2-1.c: New test.
> * gcc.target/i386/apx-push2pop2_force_drap-1.c: Likewise.
> * gcc.target/i386/apx-push2pop2_interrupt-1.c: Likewise.
>
> Co-authored-by: Hu Lin1 
> Co-authored-by: Hongyu Wang 
> ---
>  gcc/config/i386/i386.cc   | 252 --
>  gcc/config/i386/i386.md   |  26 ++
>  .../gcc.target/i386/apx-push2pop2-1.c |  45 
>  .../i386/apx-push2pop2_force_drap-1.c |  29 ++
>  .../i386/apx-push2pop2_interrupt-1.c  |  28 ++
>  5 files changed, 365 insertions(+), 15 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-push2pop2-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-push2pop2_force_drap-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-push2pop2_interrupt-1.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 6244f64a619..8251b67e2d6 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -6473,6 +6473,26 @@ gen_pop (rtx arg)
>  stack_pointer_rtx)));
>  }
>
> +/* Generate a "push2" pattern for input ARG.  */
> +rtx
> +gen_push2 (rtx mem, rtx reg1, rtx reg2)
> +{
> +  struct machine_function *m = cfun->machine;
> +  const int offset = UNITS_PER_WORD * 2;
> +
> +  if (m->fs.cfa_reg == stack_pointer_rtx)
> +m->fs.cfa_offset += offset;
> +  m->fs.sp_offset += offset;
> +
> +  if (REG_P (reg1) && GET_MODE (reg1) != word_mode)
> +reg1 = gen_rtx_REG (word_mode, REGNO (reg1));
> +
> +  if (REG_P (reg2) && GET_MODE (reg2) != word_mode)
> +reg2 = gen_rtx_REG (word_mode, REGNO (reg2));
> +
> +  return gen_push2_di (mem, reg1, reg2);
> +}
> +
>  /* Return >= 0 if there is an unused call-clobbered register available
> for the entire function.  */
>
> @@ -6714,6 +6734,18 @@ get_probe_interval (void)
>
>  #define SPLIT_STACK_AVAILABLE 256
>
> +/* Helper function to determine whether push2/pop2 can be used in prologue or
> +   epilogue for register save/restore.  */
> +static bool
> +ix86_pro_and_epilogue_can_use_push2pop2 (int nregs)
> +{
> +  int aligned = cfun->machine->fs.sp_offset % 16 == 0;
> +  return TARGET_APX_PUSH2POP2
> +&& !cfun->machine->frame.save_regs_using_mov
> +&& cfun->machine->func_type == TYPE_NORMAL
> +&& (nregs + aligned) >= 3;
> +}
> +
>  /* Fill structure ix86_frame about frame of currently computed function.  */
>
>  static void
> @@ -6771,16 +6803,20 @@ ix86_compute_frame_layout (void)
>
>   Darwin's ABI specifies 128b alignment for both 32 and  64 bit variants
>   at call sites, including profile function calls.
> - */
> -  if (((TARGET_64BIT_MS_ABI || TARGET_MACHO)
> -&& crtl->preferred_stack_boundary < 128)
> -  && (!crtl->is_leaf || cfun->calls_alloca != 0
> - || ix86_current_function_calls_tls_descriptor
> - || (TARGET_MACHO && crtl->profile)
> - || ix86_incoming_stack_boundary < 128))
> +
> + For APX push2/pop2, the stack also requires 128b alignment.  */
> +  if ((ix86_pro_and_epilogue_can_use_push2pop2 (frame->nregs)
> +   && crtl->preferred_stack_boundary < 128)
> +  || (((TARGET_64BIT_MS_ABI || TARGET_MACHO)
> +  && crtl->preferred_stack_boundary < 128)
> +

Re: [PATCH] Disparage slightly for the alternative which move DFmode between SSE_REGS and GENERAL_REGS.

2023-10-12 Thread Hongtao Liu
On Thu, Jul 6, 2023 at 1:53 PM Uros Bizjak via Gcc-patches
 wrote:
>
> On Thu, Jul 6, 2023 at 3:14 AM liuhongt  wrote:
> >
> > For testcase
> >
> > void __cond_swap(double* __x, double* __y) {
> >   bool __r = (*__x < *__y);
> >   auto __tmp = __r ? *__x : *__y;
> >   *__y = __r ? *__y : *__x;
> >   *__x = __tmp;
> > }
> >
> > GCC-14 with -O2 and -march=x86-64 options generates the following code:
> >
> > __cond_swap(double*, double*):
> > movsd   xmm1, QWORD PTR [rdi]
> > movsd   xmm0, QWORD PTR [rsi]
> > comisd  xmm0, xmm1
> > jbe .L2
> > movqrax, xmm1
> > movapd  xmm1, xmm0
> > movqxmm0, rax
> > .L2:
> > movsd   QWORD PTR [rsi], xmm1
> > movsd   QWORD PTR [rdi], xmm0
> > ret
> >
> > rax is used to save and restore DFmode value. In RA both GENERAL_REGS
> > and SSE_REGS cost zero since we didn't disparage the
> > alternative in movdf_internal pattern, according to register
> > allocation order, GENERAL_REGS is allocated. The patch add ? for
> > alternative (r,v) and (v,r) just like we did for movsf/hf/bf_internal
> > pattern, after that we get optimal RA.
> >
> > __cond_swap:
> > .LFB0:
> > .cfi_startproc
> > movsd   (%rdi), %xmm1
> > movsd   (%rsi), %xmm0
> > comisd  %xmm1, %xmm0
> > jbe .L2
> > movapd  %xmm1, %xmm2
> > movapd  %xmm0, %xmm1
> > movapd  %xmm2, %xmm0
> > .L2:
> > movsd   %xmm1, (%rsi)
> > movsd   %xmm0, (%rdi)
> > ret
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}
> > Ok for trunk?
> >
> >
> > gcc/ChangeLog:
> >
> > PR target/110170
> > * config/i386/i386.md (movdf_internal): Disparage slightly for
> > 2 alternatives (r,v) and (v,r) by adding constraint modifier
> > '?'.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/i386/pr110170-3.c: New test.
>
> OK.
Some user reports the same issue in unixbench, i looks like an common
issue when swap 2 double variable
So I'd like to backport this patch to GCC13/GCC12/GCC11, the fix
should be generally good and at low risk.
Any comments?

>
> Thanks,
> Uros.
>
> > ---
> >  gcc/config/i386/i386.md|  4 ++--
> >  gcc/testsuite/gcc.target/i386/pr110170-3.c | 11 +++
> >  2 files changed, 13 insertions(+), 2 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr110170-3.c
> >
> > diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> > index a82cc353cfd..e47ced1bb70 100644
> > --- a/gcc/config/i386/i386.md
> > +++ b/gcc/config/i386/i386.md
> > @@ -3915,9 +3915,9 @@ (define_split
> >  ;; Possible store forwarding (partial memory) stall in alternatives 4, 6 
> > and 7.
> >  (define_insn "*movdf_internal"
> >[(set (match_operand:DF 0 "nonimmediate_operand"
> > -"=Yf*f,m   ,Yf*f,?r ,!o,?*r ,!o,!o,?r,?m,?r,?r,v,v,v,m,*x,*x,*x,m ,r 
> > ,v,r  ,o ,r  ,m")
> > +"=Yf*f,m   ,Yf*f,?r ,!o,?*r ,!o,!o,?r,?m,?r,?r,v,v,v,m,*x,*x,*x,m 
> > ,?r,?v,r  ,o ,r  ,m")
> > (match_operand:DF 1 "general_operand"
> > -"Yf*fm,Yf*f,G   ,roF,r ,*roF,*r,F ,rm,rC,C ,F ,C,v,m,v,C ,*x,m ,*x,v,r 
> > ,roF,rF,rmF,rC"))]
> > +"Yf*fm,Yf*f,G   ,roF,r ,*roF,*r,F ,rm,rC,C ,F ,C,v,m,v,C ,*x,m ,*x, v, 
> > r,roF,rF,rmF,rC"))]
> >"!(MEM_P (operands[0]) && MEM_P (operands[1]))
> > && (lra_in_progress || reload_completed
> > || !CONST_DOUBLE_P (operands[1])
> > diff --git a/gcc/testsuite/gcc.target/i386/pr110170-3.c 
> > b/gcc/testsuite/gcc.target/i386/pr110170-3.c
> > new file mode 100644
> > index 000..70daa89e9aa
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr110170-3.c
> > @@ -0,0 +1,11 @@
> > +/* { dg-do compile { target { ! ia32 } } } */
> > +/* { dg-options "-O2 -fno-if-conversion -fno-if-conversion2" } */
> > +/* { dg-final { scan-assembler-not {(?n)movq.*r} } } */
> > +
> > +void __cond_swap(double* __x, double* __y) {
> > +  _Bool __r = (*__x < *__y);
> > +  double __tmp = __r ? *__x : *__y;
> > +  *__y = __r ? *__y : *__x;
> > +  *__x = __tmp;
> > +}
> > +
> > --
> > 2.39.1.388.g2fc9e9ca3c
> >



--
BR,
Hongtao


Re: [PATCH 0/3] Add Intel new cpu archs

2023-10-17 Thread Hongtao Liu
On Mon, Oct 16, 2023 at 2:25 PM Haochen Jiang  wrote:
>
> Hi all,
>
> The patches aim to add new cpu archs Clear Water Forest and
> Panther Lake. Here comes the documentation:
>
> https://cdrdv2.intel.com/v1/dl/getContent/671368
>
> Also in the patches, I refactored how we detect cpu according to features
> and added m_CORE_ATOM.
>
> Regtested on x86_64-pc-linux-gnu. Ok for trunk?
Ok, please also update https://gcc.gnu.org/gcc-14/changes.html with
your patches and USER_MSR.
>
> Thx,
> Haochen
>
>



--
BR,
Hongtao


Re: [PATCH] Avoid compile time hog on vect_peel_nonlinear_iv_init for nonlinear induction vec_step_op_mul when iteration count is too big. 65; 6800; 1c There's loop in vect_peel_nonlinear_iv_init to

2023-10-18 Thread Hongtao Liu
On Wed, Oct 18, 2023 at 4:33 PM liuhongt  wrote:
>
Cut from subject...
There's a loop in vect_peel_nonlinear_iv_init to get init_expr * pow
(step_expr, skip_niters). When skipn_iters is too big, compile time
hogs. To avoid that, optimize init_expr * pow (step_expr, skip_niters)
to init_expr << (exact_log2 (step_expr) * skip_niters) when step_expr
is pow of 2, otherwise give up vectorization when skip_niters >=
TYPE_PRECISION (TREE_TYPE (init_expr)).

> Also give up vectorization when niters_skip is negative which will be
> used for fully masked loop.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
> gcc/ChangeLog:
>
> PR tree-optimization/111820
> PR tree-optimization/111833
> * tree-vect-loop-manip.cc (vect_can_peel_nonlinear_iv_p): Give
> up vectorization for nonlinear iv vect_step_op_mul when
> step_expr is not exact_log2 and niters is greater than
> TYPE_PRECISION (TREE_TYPE (step_expr)). Also don't vectorize
> for nagative niters_skip which will be used by fully masked
> loop.
> (vect_can_advance_ivs_p): Pass whole phi_info to
> vect_can_peel_nonlinear_iv_p.
> * tree-vect-loop.cc (vect_peel_nonlinear_iv_init): Optimize
> init_expr * pow (step_expr, skipn) to init_expr
> << (log2 (step_expr) * skipn) when step_expr is exact_log2.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr111820-1.c: New test.
> * gcc.target/i386/pr111820-2.c: New test.
> * gcc.target/i386/pr103144-mul-1.c: Adjust testcase.
> ---
>  .../gcc.target/i386/pr103144-mul-1.c  |  6 ++--
>  gcc/testsuite/gcc.target/i386/pr111820-1.c| 16 ++
>  gcc/testsuite/gcc.target/i386/pr111820-2.c| 17 ++
>  gcc/tree-vect-loop-manip.cc   | 28 ++--
>  gcc/tree-vect-loop.cc | 32 ---
>  5 files changed, 88 insertions(+), 11 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr111820-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr111820-2.c
>
> diff --git a/gcc/testsuite/gcc.target/i386/pr103144-mul-1.c 
> b/gcc/testsuite/gcc.target/i386/pr103144-mul-1.c
> index 640c34fd959..f80d1094097 100644
> --- a/gcc/testsuite/gcc.target/i386/pr103144-mul-1.c
> +++ b/gcc/testsuite/gcc.target/i386/pr103144-mul-1.c
> @@ -23,7 +23,7 @@ foo_mul_const (int* a)
>for (int i = 0; i != N; i++)
>  {
>a[i] = b;
> -  b *= 3;
> +  b *= 4;
>  }
>  }
>
> @@ -34,7 +34,7 @@ foo_mul_peel (int* a, int b)
>for (int i = 0; i != 39; i++)
>  {
>a[i] = b;
> -  b *= 3;
> +  b *= 4;
>  }
>  }
>
> @@ -46,6 +46,6 @@ foo_mul_peel_const (int* a)
>for (int i = 0; i != 39; i++)
>  {
>a[i] = b;
> -  b *= 3;
> +  b *= 4;
>  }
>  }
> diff --git a/gcc/testsuite/gcc.target/i386/pr111820-1.c 
> b/gcc/testsuite/gcc.target/i386/pr111820-1.c
> new file mode 100644
> index 000..50e960c39d4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr111820-1.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -mavx2 -fno-tree-vrp -Wno-aggressive-loop-optimizations 
> -fdump-tree-vect-details" } */
> +/* { dg-final { scan-tree-dump "Avoid compile time hog on 
> vect_peel_nonlinear_iv_init for nonlinear induction vec_step_op_mul when 
> iteration count is too big" "vect" } } */
> +
> +int r;
> +int r_0;
> +
> +void f1 (void)
> +{
> +  int n = 0;
> +  while (-- n)
> +{
> +  r_0 += r;
> +  r  *= 3;
> +}
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/pr111820-2.c 
> b/gcc/testsuite/gcc.target/i386/pr111820-2.c
> new file mode 100644
> index 000..bbdb40798c6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr111820-2.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -mavx2 -fno-tree-vrp -fdump-tree-vect-details 
> -Wno-aggressive-loop-optimizations" } */
> +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> +
> +int r;
> +int r_0;
> +
> +void f (void)
> +{
> +  int n = 0;
> +  while (-- n)
> +{
> +  r_0 += r ;
> +  r  *= 2;
> +}
> +}
> +
> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> index 2608c286e5d..a530088b61d 100644
> --- a/gcc/tree-vect-loop-manip.cc
> +++ b/gcc/tree-vect-loop-manip.cc
> @@ -1783,8 +1783,10 @@ iv_phi_p (stmt_vec_info stmt_info)
>  /* Return true if vectorizer can peel for nonlinear iv.  */
>  static bool
>  vect_can_peel_nonlinear_iv_p (loop_vec_info loop_vinfo,
> - enum vect_induction_op_type induction_type)
> + stmt_vec_info stmt_info)
>  {
> +  enum vect_induction_op_type induction_type
> += STMT_VINFO_LOOP_PHI_EVOLUTION_TYPE (stmt_info);
>tree niters_skip;
>/* Init_expr will be update by vect_update_ivs_after_vectorizer,
>   if niters or vf is unkown:
> @@ -1805,11 +1807,31 @@ vect_can_peel_nonlinear_iv_p (loop_vec_inf

Re: [PATCH] x86: Correct ISA enabled for clients since Arrow Lake

2023-10-19 Thread Hongtao Liu
On Wed, Oct 18, 2023 at 4:10 PM Haochen Jiang  wrote:
>
> Hi all,
>
> I just found that since ISAs enabled on Sierra Forest changed, clients since
> Arrow Lake will wrongly enable ENQCMD according to the current code.
>
> To avoid messing up again in the future, I changed the dependency on how ISAs
> are enabled currently by making clients depending on clients and Atom servers
> depending on Atom servers, which makes no functionality difference on
> Clearwater Forest.
>
> Also, revise the current out of date documentation in texi file.
>
> Regtested on x86_64-pc-linux-gnu. Ok for trunk?
Ok.
>
> Thx,
> Haochen
>
> gcc/ChangeLog:
>
> * config/i386/i386.h: Correct the ISA enabled for Arrow Lake.
> Also make Clearwater Forest depends on Sierra Forest.
> * doc/invoke.texi: Correct documentation.
> ---
>  gcc/config/i386/i386.h |  7 ---
>  gcc/doc/invoke.texi| 15 ---
>  2 files changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> index abfe1672c41..92a7982c87f 100644
> --- a/gcc/config/i386/i386.h
> +++ b/gcc/config/i386/i386.h
> @@ -2401,11 +2401,12 @@ constexpr wide_int_bitmask PTA_GRANITERAPIDS = 
> PTA_SAPPHIRERAPIDS | PTA_AMX_FP16
>  constexpr wide_int_bitmask PTA_GRANITERAPIDS_D = PTA_GRANITERAPIDS
>| PTA_AMX_COMPLEX;
>  constexpr wide_int_bitmask PTA_GRANDRIDGE = PTA_SIERRAFOREST | PTA_RAOINT;
> -constexpr wide_int_bitmask PTA_ARROWLAKE = PTA_SIERRAFOREST;
> +constexpr wide_int_bitmask PTA_ARROWLAKE = PTA_ALDERLAKE | PTA_AVXIFMA
> +  | PTA_AVXVNNIINT8 | PTA_AVXNECONVERT | PTA_CMPCCXADD | PTA_UINTR;
>  constexpr wide_int_bitmask PTA_ARROWLAKE_S = PTA_ARROWLAKE | PTA_AVXVNNIINT16
>| PTA_SHA512 | PTA_SM3 | PTA_SM4;
> -constexpr wide_int_bitmask PTA_CLEARWATERFOREST = PTA_ARROWLAKE_S | 
> PTA_PREFETCHI
> -  | PTA_USER_MSR;
> +constexpr wide_int_bitmask PTA_CLEARWATERFOREST = PTA_SIERRAFOREST | 
> PTA_AVXVNNIINT16
> +  | PTA_SHA512 | PTA_SM3 | PTA_SM4 | PTA_USER_MSR | PTA_PREFETCHI;
>  constexpr wide_int_bitmask PTA_PANTHERLAKE = PTA_ARROWLAKE_S | PTA_PREFETCHI;
>  constexpr wide_int_bitmask PTA_KNM = PTA_KNL | PTA_AVX5124VNNIW
>| PTA_AVX5124FMAPS | PTA_AVX512VPOPCNTDQ;
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index a0da7f9d5ac..69809db9f1b 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -32845,7 +32845,8 @@ SSSE3, SSE4.1, SSE4.2, POPCNT, AES, PREFETCHW, 
> PCLMUL, RDRND, XSAVE, XSAVEC,
>  XSAVES, XSAVEOPT, FSGSBASE, PTWRITE, RDPID, SGX, GFNI-SSE, CLWB, MOVDIRI,
>  MOVDIR64B, CLDEMOTE, WAITPKG, ADCX, AVX, AVX2, BMI, BMI2, F16C, FMA, LZCNT,
>  PCONFIG, PKU, VAES, VPCLMULQDQ, SERIALIZE, HRESET, KL, WIDEKL, AVX-VNNI,
> -AVXIFMA, AVXVNNIINT8, AVXNECONVERT and CMPCCXADD instruction set support.
> +UINTR, AVXIFMA, AVXVNNIINT8, AVXNECONVERT and CMPCCXADD instruction set
> +support.
>
>  @item arrowlake-s
>  Intel Arrow Lake S CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3,
> @@ -32853,8 +32854,8 @@ SSSE3, SSE4.1, SSE4.2, POPCNT, AES, PREFETCHW, 
> PCLMUL, RDRND, XSAVE, XSAVEC,
>  XSAVES, XSAVEOPT, FSGSBASE, PTWRITE, RDPID, SGX, GFNI-SSE, CLWB, MOVDIRI,
>  MOVDIR64B, CLDEMOTE, WAITPKG, ADCX, AVX, AVX2, BMI, BMI2, F16C, FMA, LZCNT,
>  PCONFIG, PKU, VAES, VPCLMULQDQ, SERIALIZE, HRESET, KL, WIDEKL, AVX-VNNI,
> -AVXIFMA, AVXVNNIINT8, AVXNECONVERT, CMPCCXADD, AVXVNNIINT16, SHA512, SM3
> -and SM4 instruction set support.
> +UINTR, AVXIFMA, AVXVNNIINT8, AVXNECONVERT, CMPCCXADD, AVXVNNIINT16, SHA512,
> +SM3 and SM4 instruction set support.
>
>  @item clearwaterforest
>  Intel Clearwater Forest CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2,
> @@ -32862,8 +32863,8 @@ SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT, AES, PREFETCHW, 
> PCLMUL, RDRND, XSAVE,
>  XSAVEC, XSAVES, XSAVEOPT, FSGSBASE, PTWRITE, RDPID, SGX, GFNI-SSE, CLWB,
>  MOVDIRI, MOVDIR64B, CLDEMOTE, WAITPKG, ADCX, AVX, AVX2, BMI, BMI2, F16C, FMA,
>  LZCNT, PCONFIG, PKU, VAES, VPCLMULQDQ, SERIALIZE, HRESET, KL, WIDEKL, 
> AVX-VNNI,
> -AVXIFMA, AVXVNNIINT8, AVXNECONVERT, CMPCCXADD, AVXVNNIINT16, SHA512, SM3, 
> SM4,
> -USER_MSR and PREFETCHI instruction set support.
> +ENQCMD, UINTR, AVXIFMA, AVXVNNIINT8, AVXNECONVERT, CMPCCXADD, AVXVNNIINT16,
> +SHA512, SM3, SM4, USER_MSR and PREFETCHI instruction set support.
>
>  @item pantherlake
>  Intel Panther Lake CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3,
> @@ -32871,8 +32872,8 @@ SSSE3, SSE4.1, SSE4.2, POPCNT, AES, PREFETCHW, 
> PCLMUL, RDRND, XSAVE, XSAVEC,
>  XSAVES, XSAVEOPT, FSGSBASE, PTWRITE, RDPID, SGX, GFNI-SSE, CLWB, MOVDIRI,
>  MOVDIR64B, CLDEMOTE, WAITPKG, ADCX, AVX, AVX2, BMI, BMI2, F16C, FMA, LZCNT,
>  PCONFIG, PKU, VAES, VPCLMULQDQ, SERIALIZE, HRESET, KL, WIDEKL, AVX-VNNI,
> -AVXIFMA, AVXVNNIINT8, AVXNECONVERT, CMPCCXADD, AVXVNNIINT16, SHA512, SM3, SM4
> -and PREFETCHI instruction set support.
> +UINTR, AVXIFMA, AVXVNNIINT8, AVXNECONVERT, CMPCCXADD, AVXVNNIINT16, SHA512,
> +SM3, SM4 and PREFETCHI instruction set support.
>
>  @item knl
>  Intel Kni

Re: [PATCH] Support vec_cmpmn/vcondmn for v2hf/v4hf.

2023-10-23 Thread Hongtao Liu
On Mon, Oct 23, 2023 at 8:35 PM Richard Biener
 wrote:
>
> On Mon, Oct 23, 2023 at 10:48 AM liuhongt  wrote:
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Ready push to trunk.
>
> vcond and vcondeq shouldn't be necessary if there's
> vcond_mask and vcmp support which is the "modern"
> way of handling vcond.  Unless the ISA really can do
> compare and select with a single instruction.
For testcase

typedef _Float16 __attribute__((__vector_size__ (4))) __v2hf;
typedef _Float16 __attribute__((__vector_size__ (8))) __v4hf;


__v4hf cf, df;

__v4hf cfu (__v4hf c, __v4hf d) { return (c > d) ? cf : df; }

The data_mode passes to ix86_get_mask_mode is v4hi, not v4hf since

  /* Always construct signed integer vector type.  */
  intt = c_common_type_for_size
(GET_MODE_BITSIZE (SCALAR_TYPE_MODE (TREE_TYPE (type0))), 0);
  if (!intt)
{
  if (complain & tf_error)
error_at (location, "could not find an integer type "
  "of the same size as %qT", TREE_TYPE (type0));
  return error_mark_node;
}
  result_type = build_opaque_vector_type (intt,
  TYPE_VECTOR_SUBPARTS (type0));
  return build_vec_cmp (resultcode, result_type, op0, op1);

The backend can't distinguish whether it's a vector fp16 comparison or
a vector hi comparison.
the former require -mavx512fp16, the latter requires -mavx512bw
>
> Richard.
>
> > gcc/ChangeLog:
> >
> > PR target/103861
> > * config/i386/i386-expand.cc (ix86_expand_sse_movcc): Handle
> > V2HF/V2BF/V4HF/V4BFmode.
> > * config/i386/mmx.md (vec_cmpv4hfqi): New expander.
> > (vcondv4hf): Ditto.
> > (vcondv4hi): Ditto.
> > (vconduv4hi): Ditto.
> > (vcond_mask_v4hi): Ditto.
> > (vcond_mask_qi): Ditto.
> > (vec_cmpv2hfqi): Ditto.
> > (vcondv2hf): Ditto.
> > (vcondv2hi): Ditto.
> > (vconduv2hi): Ditto.
> > (vcond_mask_v2hi): Ditto.
> > * config/i386/sse.md (vcond): Merge this with ..
> > (vcond): .. this into ..
> > (vcond): .. this,
> > and extend to V8BF/V16BF/V32BFmode.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * g++.target/i386/part-vect-vcondhf.C: New test.
> > * gcc.target/i386/part-vect-vec_cmphf.c: New test.
> > ---
> >  gcc/config/i386/i386-expand.cc|   4 +
> >  gcc/config/i386/mmx.md| 237 +-
> >  gcc/config/i386/sse.md|  25 +-
> >  .../g++.target/i386/part-vect-vcondhf.C   |  34 +++
> >  .../gcc.target/i386/part-vect-vec_cmphf.c |  26 ++
> >  5 files changed, 304 insertions(+), 22 deletions(-)
> >  create mode 100644 gcc/testsuite/g++.target/i386/part-vect-vcondhf.C
> >  create mode 100644 gcc/testsuite/gcc.target/i386/part-vect-vec_cmphf.c
> >
> > diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> > index 1eae9d7c78c..9658f9c5a2d 100644
> > --- a/gcc/config/i386/i386-expand.cc
> > +++ b/gcc/config/i386/i386-expand.cc
> > @@ -4198,6 +4198,8 @@ ix86_expand_sse_movcc (rtx dest, rtx cmp, rtx 
> > op_true, rtx op_false)
> >break;
> >  case E_V8QImode:
> >  case E_V4HImode:
> > +case E_V4HFmode:
> > +case E_V4BFmode:
> >  case E_V2SImode:
> >if (TARGET_SSE4_1)
> > {
> > @@ -4207,6 +4209,8 @@ ix86_expand_sse_movcc (rtx dest, rtx cmp, rtx 
> > op_true, rtx op_false)
> >break;
> >  case E_V4QImode:
> >  case E_V2HImode:
> > +case E_V2HFmode:
> > +case E_V2BFmode:
> >if (TARGET_SSE4_1)
> > {
> >   gen = gen_mmx_pblendvb_v4qi;
> > diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
> > index 491a0a51272..b9617e9d8c6 100644
> > --- a/gcc/config/i386/mmx.md
> > +++ b/gcc/config/i386/mmx.md
> > @@ -61,6 +61,9 @@ (define_mode_iterator MMXMODE248 [V4HI V2SI V1DI])
> >  (define_mode_iterator V_32 [V4QI V2HI V1SI V2HF V2BF])
> >
> >  (define_mode_iterator V2FI_32 [V2HF V2BF V2HI])
> > +(define_mode_iterator V4FI_64 [V4HF V4BF V4HI])
> > +(define_mode_iterator V4F_64 [V4HF V4BF])
> > +(define_mode_iterator V2F_32 [V2HF V2BF])
> >  ;; 4-byte integer vector modes
> >  (define_mode_iterator VI_32 [V4QI V2HI])
> >
> > @@ -1972,10 +1975,12 @@ (define_mode_attr mov_to_sse_suffix
> >[(V2HF "d") (V4HF "q") (V2HI "d") (V4HI "q")])
> >
> >  (define_mode_attr mmxxmmmode
> > -  [(V2HF "V8HF") (V2HI "V8HI") (V2BF "V8BF")])
> > +  [(V2HF "V8HF") (V2HI "V8HI") (V2BF "V8BF")
> > +   (V4HF "V8HF") (V4HI "V8HI") (V4BF "V8BF")])
> >
> >  (define_mode_attr mmxxmmmodelower
> > -  [(V2HF "v8hf") (V2HI "v8hi") (V2BF "v8bf")])
> > +  [(V2HF "v8hf") (V2HI "v8hi") (V2BF "v8bf")
> > +   (V4HF "v8hf") (V4HI "v8hi") (V4BF "v8bf")])
> >
> >  (define_expand "movd__to_sse"
> >[(set (match_operand: 0 "register_operand")
> > @@ -2114,6 +2119,234 @@ (define_insn_and_split "*mmx_nabs2"
> >[(set (match_dup 0)
> > (ior: (match_dup 1) (match_dup 2)))])
> >
> > +;;;

Re: [PATCH] Support vec_cmpmn/vcondmn for v2hf/v4hf.

2023-10-23 Thread Hongtao Liu
On Tue, Oct 24, 2023 at 10:53 AM Hongtao Liu  wrote:
>
> On Mon, Oct 23, 2023 at 8:35 PM Richard Biener
>  wrote:
> >
> > On Mon, Oct 23, 2023 at 10:48 AM liuhongt  wrote:
> > >
> > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > > Ready push to trunk.
> >
> > vcond and vcondeq shouldn't be necessary if there's
> > vcond_mask and vcmp support which is the "modern"
> > way of handling vcond.  Unless the ISA really can do
> > compare and select with a single instruction.
> For testcase
>
> typedef _Float16 __attribute__((__vector_size__ (4))) __v2hf;
> typedef _Float16 __attribute__((__vector_size__ (8))) __v4hf;
>
>
> __v4hf cf, df;
>
> __v4hf cfu (__v4hf c, __v4hf d) { return (c > d) ? cf : df; }
>
> The data_mode passes to ix86_get_mask_mode is v4hi, not v4hf since
>
>   /* Always construct signed integer vector type.  */
>   intt = c_common_type_for_size
> (GET_MODE_BITSIZE (SCALAR_TYPE_MODE (TREE_TYPE (type0))), 0);
>   if (!intt)
> {
>   if (complain & tf_error)
> error_at (location, "could not find an integer type "
>   "of the same size as %qT", TREE_TYPE (type0));
>   return error_mark_node;
> }
>   result_type = build_opaque_vector_type (intt,
>   TYPE_VECTOR_SUBPARTS (type0));
>   return build_vec_cmp (resultcode, result_type, op0, op1);
>
> The backend can't distinguish whether it's a vector fp16 comparison or
> a vector hi comparison.
> the former requires -mavx512fp16, the latter requires -mavx512bw
Should we pass type0 instead of result_type here?
> >
> > Richard.
> >
> > > gcc/ChangeLog:
> > >
> > > PR target/103861
> > > * config/i386/i386-expand.cc (ix86_expand_sse_movcc): Handle
> > > V2HF/V2BF/V4HF/V4BFmode.
> > > * config/i386/mmx.md (vec_cmpv4hfqi): New expander.
> > > (vcondv4hf): Ditto.
> > > (vcondv4hi): Ditto.
> > > (vconduv4hi): Ditto.
> > > (vcond_mask_v4hi): Ditto.
> > > (vcond_mask_qi): Ditto.
> > > (vec_cmpv2hfqi): Ditto.
> > > (vcondv2hf): Ditto.
> > > (vcondv2hi): Ditto.
> > > (vconduv2hi): Ditto.
> > > (vcond_mask_v2hi): Ditto.
> > > * config/i386/sse.md (vcond): Merge this with ..
> > > (vcond): .. this into ..
> > > (vcond): .. this,
> > > and extend to V8BF/V16BF/V32BFmode.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > * g++.target/i386/part-vect-vcondhf.C: New test.
> > > * gcc.target/i386/part-vect-vec_cmphf.c: New test.
> > > ---
> > >  gcc/config/i386/i386-expand.cc|   4 +
> > >  gcc/config/i386/mmx.md| 237 +-
> > >  gcc/config/i386/sse.md|  25 +-
> > >  .../g++.target/i386/part-vect-vcondhf.C   |  34 +++
> > >  .../gcc.target/i386/part-vect-vec_cmphf.c |  26 ++
> > >  5 files changed, 304 insertions(+), 22 deletions(-)
> > >  create mode 100644 gcc/testsuite/g++.target/i386/part-vect-vcondhf.C
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/part-vect-vec_cmphf.c
> > >
> > > diff --git a/gcc/config/i386/i386-expand.cc 
> > > b/gcc/config/i386/i386-expand.cc
> > > index 1eae9d7c78c..9658f9c5a2d 100644
> > > --- a/gcc/config/i386/i386-expand.cc
> > > +++ b/gcc/config/i386/i386-expand.cc
> > > @@ -4198,6 +4198,8 @@ ix86_expand_sse_movcc (rtx dest, rtx cmp, rtx 
> > > op_true, rtx op_false)
> > >break;
> > >  case E_V8QImode:
> > >  case E_V4HImode:
> > > +case E_V4HFmode:
> > > +case E_V4BFmode:
> > >  case E_V2SImode:
> > >if (TARGET_SSE4_1)
> > > {
> > > @@ -4207,6 +4209,8 @@ ix86_expand_sse_movcc (rtx dest, rtx cmp, rtx 
> > > op_true, rtx op_false)
> > >break;
> > >  case E_V4QImode:
> > >  case E_V2HImode:
> > > +case E_V2HFmode:
> > > +case E_V2BFmode:
> > >if (TARGET_SSE4_1)
> > > {
> > >   gen = gen_mmx_pblendvb_v4qi;
> > > diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
> > > index 491a0a51272..b9617e9d8c6 100644
> > > --- a/gcc/config/i386/mmx.md
> > > +++ b/gcc/config/i386/mmx.md
> > > @@ -61,6 +61,9 @@ (define_mode_iterator MMXMODE248 [V4HI V2SI V1DI])
>

Re: [PATCH] Support vec_cmpmn/vcondmn for v2hf/v4hf.

2023-10-23 Thread Hongtao Liu
On Tue, Oct 24, 2023 at 1:23 PM Hongtao Liu  wrote:
>
> On Tue, Oct 24, 2023 at 10:53 AM Hongtao Liu  wrote:
> >
> > On Mon, Oct 23, 2023 at 8:35 PM Richard Biener
> >  wrote:
> > >
> > > On Mon, Oct 23, 2023 at 10:48 AM liuhongt  wrote:
> > > >
> > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > > > Ready push to trunk.
> > >
> > > vcond and vcondeq shouldn't be necessary if there's
> > > vcond_mask and vcmp support which is the "modern"
> > > way of handling vcond.  Unless the ISA really can do
> > > compare and select with a single instruction.
> > For testcase
> >
> > typedef _Float16 __attribute__((__vector_size__ (4))) __v2hf;
> > typedef _Float16 __attribute__((__vector_size__ (8))) __v4hf;
> >
> >
> > __v4hf cf, df;
> >
> > __v4hf cfu (__v4hf c, __v4hf d) { return (c > d) ? cf : df; }
> >
> > The data_mode passes to ix86_get_mask_mode is v4hi, not v4hf since
> >
> >   /* Always construct signed integer vector type.  */
> >   intt = c_common_type_for_size
> > (GET_MODE_BITSIZE (SCALAR_TYPE_MODE (TREE_TYPE (type0))), 0);
> >   if (!intt)
> > {
> >   if (complain & tf_error)
> > error_at (location, "could not find an integer type "
> >   "of the same size as %qT", TREE_TYPE (type0));
> >   return error_mark_node;
> > }
> >   result_type = build_opaque_vector_type (intt,
> >   TYPE_VECTOR_SUBPARTS (type0));
> >   return build_vec_cmp (resultcode, result_type, op0, op1);
> >
> > The backend can't distinguish whether it's a vector fp16 comparison or
> > a vector hi comparison.
> > the former requires -mavx512fp16, the latter requires -mavx512bw
> Should we pass type0 instead of result_type here?
 6335@deftypefn {Target Hook} opt_machine_mode
TARGET_VECTORIZE_GET_MASK_MODE (machine_mode @var{mode})
 6336Return the mode to use for a vector mask that holds one boolean
 6337result for each element of vector mode @var{mode}.  The returned mask mode
 6338can be a vector of integers (class @code{MODE_VECTOR_INT}), a vector of
 6339booleans (class @code{MODE_VECTOR_BOOL}) or a scalar integer (class
 6340@code{MODE_INT}).  Return an empty @code{opt_machine_mode} if no such
 6341mask mode exists.

Looks like it's on purpose, v2hi is exactly what we needed here.

Then we use either kmask or v4hi for both v4hf and v4hi comparison,
but can't use v4hi for v4hi comparison, but kmask for v4hf comparison.
> > >
> > > Richard.
> > >
> > > > gcc/ChangeLog:
> > > >
> > > > PR target/103861
> > > > * config/i386/i386-expand.cc (ix86_expand_sse_movcc): Handle
> > > > V2HF/V2BF/V4HF/V4BFmode.
> > > > * config/i386/mmx.md (vec_cmpv4hfqi): New expander.
> > > > (vcondv4hf): Ditto.
> > > > (vcondv4hi): Ditto.
> > > > (vconduv4hi): Ditto.
> > > > (vcond_mask_v4hi): Ditto.
> > > > (vcond_mask_qi): Ditto.
> > > > (vec_cmpv2hfqi): Ditto.
> > > > (vcondv2hf): Ditto.
> > > > (vcondv2hi): Ditto.
> > > > (vconduv2hi): Ditto.
> > > > (vcond_mask_v2hi): Ditto.
> > > > * config/i386/sse.md (vcond): Merge this with ..
> > > > (vcond): .. this into ..
> > > > (vcond): .. this,
> > > > and extend to V8BF/V16BF/V32BFmode.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > > * g++.target/i386/part-vect-vcondhf.C: New test.
> > > > * gcc.target/i386/part-vect-vec_cmphf.c: New test.
> > > > ---
> > > >  gcc/config/i386/i386-expand.cc|   4 +
> > > >  gcc/config/i386/mmx.md| 237 +-
> > > >  gcc/config/i386/sse.md|  25 +-
> > > >  .../g++.target/i386/part-vect-vcondhf.C   |  34 +++
> > > >  .../gcc.target/i386/part-vect-vec_cmphf.c |  26 ++
> > > >  5 files changed, 304 insertions(+), 22 deletions(-)
> > > >  create mode 100644 gcc/testsuite/g++.target/i386/part-vect-vcondhf.C
> > > >  create mode 100644 gcc/testsuite/gcc.target/i386/part-vect-vec_cmphf.c
> > > >
> > > > diff --git a/gcc/config/i386/i386-expand.cc 
> > > > b/gcc/config/i386/i386-expand.cc
> > > > index 1eae9d7c78c..9658f9c5a2d 100644
> > > > --- a/gc

Re: [PATCH] i386: Fix undefined masks in vpopcnt tests

2023-10-24 Thread Hongtao Liu
On Tue, Oct 24, 2023 at 6:10 PM Richard Sandiford
 wrote:
>
> The files changed in this patch had tests for masked and unmasked
> popcnt.  However, the mask inputs to the masked forms were undefined,
> and would be set to zero by init_regs.  Any combine-like pass that
> ran after init_regs could then fold the masked forms into the
> unmasked ones.  I saw this while testing the late-combine pass
> on x86.
>
> Tested on x86_64-linux-gnu.  OK to install?  (I didn't think this
> counted as obvious because there are other ways of initialising
> the mask.)
Maybe just move the definition of the mask outside of the functions as
extern __mmask16 msk;
But of course your approach is also ok, so either way is ok with me.
>
> Richard
>
>
> gcc/testsuite/
> * gcc.target/i386/avx512bitalg-vpopcntb.c: Use an asm to define
> the mask.
> * gcc.target/i386/avx512bitalg-vpopcntbvl.c: Likewise.
> * gcc.target/i386/avx512bitalg-vpopcntw.c: Likewise.
> * gcc.target/i386/avx512bitalg-vpopcntwvl.c: Likewise.
> * gcc.target/i386/avx512vpopcntdq-vpopcntd.c: Likewise.
> * gcc.target/i386/avx512vpopcntdq-vpopcntq.c: Likewise.
> ---
>  gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntb.c| 1 +
>  gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntbvl.c  | 1 +
>  gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntw.c| 1 +
>  gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntwvl.c  | 1 +
>  gcc/testsuite/gcc.target/i386/avx512vpopcntdq-vpopcntd.c | 1 +
>  gcc/testsuite/gcc.target/i386/avx512vpopcntdq-vpopcntq.c | 1 +
>  6 files changed, 6 insertions(+)
>
> diff --git a/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntb.c 
> b/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntb.c
> index 44b82c0519d..c52088161a0 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntb.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntb.c
> @@ -11,6 +11,7 @@ extern __m512i z, z1;
>  int foo ()
>  {
>__mmask16 msk;
> +  asm volatile ("" : "=k" (msk));
>__m512i c = _mm512_popcnt_epi8 (z);
>asm volatile ("" : "+v" (c));
>c = _mm512_mask_popcnt_epi8 (z1, msk, z);
> diff --git a/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntbvl.c 
> b/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntbvl.c
> index 8c2dfaba9c6..7d11c6c4623 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntbvl.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntbvl.c
> @@ -16,6 +16,7 @@ int foo ()
>  {
>__mmask32 msk32;
>__mmask16 msk16;
> +  asm volatile ("" : "=k" (msk16), "=k" (msk32));
>__m256i c256 = _mm256_popcnt_epi8 (y);
>asm volatile ("" : "+v" (c256));
>c256 = _mm256_mask_popcnt_epi8 (y_1, msk32, y);
> diff --git a/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntw.c 
> b/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntw.c
> index 2ef8589f6c1..bc470415e9b 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntw.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntw.c
> @@ -11,6 +11,7 @@ extern __m512i z, z1;
>  int foo ()
>  {
>__mmask16 msk;
> +  asm volatile ("" : "=k" (msk));
>__m512i c = _mm512_popcnt_epi16 (z);
>asm volatile ("" : "+v" (c));
>c = _mm512_mask_popcnt_epi16 (z1, msk, z);
> diff --git a/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntwvl.c 
> b/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntwvl.c
> index c976461b12e..3a6af3ed8a1 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntwvl.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512bitalg-vpopcntwvl.c
> @@ -16,6 +16,7 @@ int foo ()
>  {
>__mmask16 msk16;
>__mmask8 msk8;
> +  asm volatile ("" : "=k" (msk16), "=k" (msk8));
>__m256i c256 = _mm256_popcnt_epi16 (y);
>asm volatile ("" : "+v" (c256));
>c256 = _mm256_mask_popcnt_epi16 (y_1, msk16, y);
> diff --git a/gcc/testsuite/gcc.target/i386/avx512vpopcntdq-vpopcntd.c 
> b/gcc/testsuite/gcc.target/i386/avx512vpopcntdq-vpopcntd.c
> index b4d82f97032..0a54ae83055 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512vpopcntdq-vpopcntd.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512vpopcntdq-vpopcntd.c
> @@ -20,6 +20,7 @@ int foo ()
>  {
>__mmask16 msk;
>__mmask8 msk8;
> +  asm volatile ("" : "=k" (msk), "=k" (msk8));
>__m128i a = _mm_popcnt_epi32 (x);
>asm volatile ("" : "+v" (a));
>a = _mm_mask_popcnt_epi32 (x_1, msk8, x);
> diff --git a/gcc/testsuite/gcc.target/i386/avx512vpopcntdq-vpopcntq.c 
> b/gcc/testsuite/gcc.target/i386/avx512vpopcntdq-vpopcntq.c
> index e87d6c999b6..c11e6e00998 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512vpopcntdq-vpopcntq.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512vpopcntdq-vpopcntq.c
> @@ -19,6 +19,7 @@ extern __m512i z, z_1;
>  int foo ()
>  {
>__mmask8 msk;
> +  asm volatile ("" : "=k" (msk));
>__m128i a = _mm_popcnt_epi64 (x);
>asm volatile ("" : "+v" (a));
>a = _mm_mask_popcnt_epi64 (x_1, msk, x);
> --
> 2.25.1
>


-- 
BR,
Hongtao


Re: [PATCH] Improve memcmpeq for 512-bit vector with vpcmpeq + kortest.

2023-10-27 Thread Hongtao Liu
On Fri, Oct 27, 2023 at 2:49 PM Richard Biener
 wrote:
>
>
>
> > Am 27.10.2023 um 07:50 schrieb liuhongt :
> >
> > When 2 vectors are equal, kmask is allones and kortest will set CF,
> > else CF will be cleared.
> >
> > So CF bit can be used to check for the result of the comparison.
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Ok for trunk?
>
> Is that also profitable for 256bit aka AVX10?
Yes, it's also available for both 128-bit and 256-bit with AVX10, from
performance perspective it's better.
AVX10:
  vpcmp + kortest
 vs
AVX2:
 vpxor + vptest

 vptest is more expensive than vpcmp + kortest

> Is there a jump on carry in case the result feeds control flow rather than a 
> value and is using ktest better then (does combine figure this out?)
There are JC and JNC, there're many pattern matches for ptest which
can't be automatically adjusted to kortest by combining, backend needs
to manually transform them.
That's why my patch only handles 64-bit vectors(to avoid regressing
those pattern match stuff).

>
> > Before:
> >vmovdqu (%rsi), %ymm0
> >vpxorq  (%rdi), %ymm0, %ymm0
> >vptest  %ymm0, %ymm0
> >jne .L2
> >vmovdqu 32(%rsi), %ymm0
> >vpxorq  32(%rdi), %ymm0, %ymm0
> >vptest  %ymm0, %ymm0
> >je  .L5
> > .L2:
> >movl$1, %eax
> >xorl$1, %eax
> >vzeroupper
> >ret
> >
> > After:
> >vmovdqu64   (%rsi), %zmm0
> >xorl%eax, %eax
> >vpcmpeqd(%rdi), %zmm0, %k0
> >kortestw%k0, %k0
> >setc%al
> >vzeroupper
> >ret
> >
> > gcc/ChangeLog:
> >
> >PR target/104610
> >* config/i386/i386-expand.cc (ix86_expand_branch): Handle
> >512-bit vector with vpcmpeq + kortest.
> >* config/i386/i386.md (cbranchxi4): New expander.
> >* config/i386/sse.md: (cbranch4): Extend to V16SImode
> >and V8DImode.
> >
> > gcc/testsuite/ChangeLog:
> >
> >* gcc.target/i386/pr104610-2.c: New test.
> > ---
> > gcc/config/i386/i386-expand.cc | 55 +++---
> > gcc/config/i386/i386.md| 16 +++
> > gcc/config/i386/sse.md | 36 +++---
> > gcc/testsuite/gcc.target/i386/pr104610-2.c | 14 ++
> > 4 files changed, 99 insertions(+), 22 deletions(-)
> > create mode 100644 gcc/testsuite/gcc.target/i386/pr104610-2.c
> >
> > diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> > index 1eae9d7c78c..c664cb61e80 100644
> > --- a/gcc/config/i386/i386-expand.cc
> > +++ b/gcc/config/i386/i386-expand.cc
> > @@ -2411,30 +2411,53 @@ ix86_expand_branch (enum rtx_code code, rtx op0, 
> > rtx op1, rtx label)
> >   rtx tmp;
> >
> >   /* Handle special case - vector comparsion with boolean result, transform
> > - it using ptest instruction.  */
> > + it using ptest instruction or vpcmpeq + kortest.  */
> >   if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT
> >   || (mode == TImode && !TARGET_64BIT)
> > -  || mode == OImode)
> > +  || mode == OImode
> > +  || GET_MODE_SIZE (mode) == 64)
> > {
> > -  rtx flag = gen_rtx_REG (CCZmode, FLAGS_REG);
> > -  machine_mode p_mode = GET_MODE_SIZE (mode) == 32 ? V4DImode : 
> > V2DImode;
> > +  unsigned msize = GET_MODE_SIZE (mode);
> > +  machine_mode p_mode
> > += msize == 64 ? V16SImode : msize == 32 ? V4DImode : V2DImode;
> > +  /* kortest set CF when result is 0x (op0 == op1).  */
> > +  rtx flag = gen_rtx_REG (msize == 64 ? CCCmode : CCZmode, FLAGS_REG);
> >
> >   gcc_assert (code == EQ || code == NE);
> >
> > -  if (GET_MODE_CLASS (mode) != MODE_VECTOR_INT)
> > +  /* Using vpcmpeq zmm zmm k + kortest for 512-bit vectors.  */
> > +  if (msize == 64)
> >{
> > -  op0 = lowpart_subreg (p_mode, force_reg (mode, op0), mode);
> > -  op1 = lowpart_subreg (p_mode, force_reg (mode, op1), mode);
> > -  mode = p_mode;
> > +  if (mode != V16SImode)
> > +{
> > +  op0 = lowpart_subreg (p_mode, force_reg (mode, op0), mode);
> > +  op1 = lowpart_subreg (p_mode, force_reg (mode, op1), mode);
> > +}
> > +
> > +  tmp = gen_reg_rtx (HImode);
> > +  emit_insn (gen_avx512f_cmpv16si3 (tmp, op0, op1, GEN_INT (0)));
> > +  emit_insn (gen_kortesthi_ccc (tmp, tmp));
> > +}
> > +  /* Using ptest for 128/256-bit vectors.  */
> > +  else
> > +{
> > +  if (GET_MODE_CLASS (mode) != MODE_VECTOR_INT)
> > +{
> > +  op0 = lowpart_subreg (p_mode, force_reg (mode, op0), mode);
> > +  op1 = lowpart_subreg (p_mode, force_reg (mode, op1), mode);
> > +  mode = p_mode;
> > +}
> > +
> > +  /* Generate XOR since we can't check that one operand is zero
> > + vector.  */
> > +  tmp = gen_reg_rtx (mode);
> > +  emit_insn (gen_rtx_SET (tmp, gen_rtx_XOR (mode, op0, op1)));
> > +  tmp = gen_lowpart (p_mode, tmp);
> > +   

Re: [PATCH] Improve memcmpeq for 512-bit vector with vpcmpeq + kortest.

2023-10-27 Thread Hongtao Liu
On Fri, Oct 27, 2023 at 3:21 PM Hongtao Liu  wrote:
>
> On Fri, Oct 27, 2023 at 2:49 PM Richard Biener
>  wrote:
> >
> >
> >
> > > Am 27.10.2023 um 07:50 schrieb liuhongt :
> > >
> > > When 2 vectors are equal, kmask is allones and kortest will set CF,
> > > else CF will be cleared.
> > >
> > > So CF bit can be used to check for the result of the comparison.
> > >
> > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > > Ok for trunk?
> >
> > Is that also profitable for 256bit aka AVX10?
> Yes, it's also available for both 128-bit and 256-bit with AVX10, from
> performance perspective it's better.
> AVX10:
>   vpcmp + kortest
>  vs
> AVX2:
>  vpxor + vptest
>
>  vptest is more expensive than vpcmp + kortest
>
> > Is there a jump on carry in case the result feeds control flow rather than 
> > a value and is using ktest better then (does combine figure this out?)
> There are JC and JNC, there're many pattern matches for ptest which
> can't be automatically adjusted to kortest by combining, backend needs
> to manually transform them.
> That's why my patch only handles 64-bit vectors(to avoid regressing
I mean 64 bytes.
> those pattern match stuff).
>
> >
> > > Before:
> > >vmovdqu (%rsi), %ymm0
> > >vpxorq  (%rdi), %ymm0, %ymm0
> > >vptest  %ymm0, %ymm0
> > >jne .L2
> > >vmovdqu 32(%rsi), %ymm0
> > >vpxorq  32(%rdi), %ymm0, %ymm0
> > >vptest  %ymm0, %ymm0
> > >je  .L5
> > > .L2:
> > >movl$1, %eax
> > >xorl$1, %eax
> > >vzeroupper
> > >ret
> > >
> > > After:
> > >vmovdqu64   (%rsi), %zmm0
> > >xorl%eax, %eax
> > >vpcmpeqd(%rdi), %zmm0, %k0
> > >kortestw%k0, %k0
> > >setc%al
> > >vzeroupper
> > >ret
> > >
> > > gcc/ChangeLog:
> > >
> > >PR target/104610
> > >* config/i386/i386-expand.cc (ix86_expand_branch): Handle
> > >512-bit vector with vpcmpeq + kortest.
> > >* config/i386/i386.md (cbranchxi4): New expander.
> > >* config/i386/sse.md: (cbranch4): Extend to V16SImode
> > >and V8DImode.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >* gcc.target/i386/pr104610-2.c: New test.
> > > ---
> > > gcc/config/i386/i386-expand.cc | 55 +++---
> > > gcc/config/i386/i386.md| 16 +++
> > > gcc/config/i386/sse.md | 36 +++---
> > > gcc/testsuite/gcc.target/i386/pr104610-2.c | 14 ++
> > > 4 files changed, 99 insertions(+), 22 deletions(-)
> > > create mode 100644 gcc/testsuite/gcc.target/i386/pr104610-2.c
> > >
> > > diff --git a/gcc/config/i386/i386-expand.cc 
> > > b/gcc/config/i386/i386-expand.cc
> > > index 1eae9d7c78c..c664cb61e80 100644
> > > --- a/gcc/config/i386/i386-expand.cc
> > > +++ b/gcc/config/i386/i386-expand.cc
> > > @@ -2411,30 +2411,53 @@ ix86_expand_branch (enum rtx_code code, rtx op0, 
> > > rtx op1, rtx label)
> > >   rtx tmp;
> > >
> > >   /* Handle special case - vector comparsion with boolean result, 
> > > transform
> > > - it using ptest instruction.  */
> > > + it using ptest instruction or vpcmpeq + kortest.  */
> > >   if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT
> > >   || (mode == TImode && !TARGET_64BIT)
> > > -  || mode == OImode)
> > > +  || mode == OImode
> > > +  || GET_MODE_SIZE (mode) == 64)
> > > {
> > > -  rtx flag = gen_rtx_REG (CCZmode, FLAGS_REG);
> > > -  machine_mode p_mode = GET_MODE_SIZE (mode) == 32 ? V4DImode : 
> > > V2DImode;
> > > +  unsigned msize = GET_MODE_SIZE (mode);
> > > +  machine_mode p_mode
> > > += msize == 64 ? V16SImode : msize == 32 ? V4DImode : V2DImode;
> > > +  /* kortest set CF when result is 0x (op0 == op1).  */
> > > +  rtx flag = gen_rtx_REG (msize == 64 ? CCCmode : CCZmode, 
> > > FLAGS_REG);
> > >
> > >   gcc_assert (code == EQ || code == NE);
> > >
> > > -  if (GET_MODE_CLASS (mode) != MODE_VECTOR_INT)
> > > +  /* Using vpcmpeq zmm zmm k + kortest for 512

Re: [PATCH] Fix incorrect option mask and avx512cd target push

2023-10-30 Thread Hongtao Liu
On Mon, Oct 30, 2023 at 3:47 PM Haochen Jiang  wrote:
>
> Hi all,
>
> This patch fixed two obvious bug in current evex512 implementation.
>
> Also, I moved AVX512CD+AVX512VL part out of the AVX512VL to avoid
> accidental handle miss in avx512cd in the future.
>
> Ok for trunk?
Ok.
>
> BRs,
> Haochen
>
> gcc/ChangeLog:
>
> * config/i386/avx512cdintrin.h (target): Push evex512 for
> avx512cd.
> * config/i386/avx512vlintrin.h (target): Split avx512cdvl part
> out from avx512vl.
> * config/i386/i386-builtin.def (BDESC): Do not check evex512
> for builtins not needed.
> ---
>  gcc/config/i386/avx512cdintrin.h |2 +-
>  gcc/config/i386/avx512vlintrin.h | 1792 +++---
>  gcc/config/i386/i386-builtin.def |4 +-
>  3 files changed, 899 insertions(+), 899 deletions(-)
>
> diff --git a/gcc/config/i386/avx512cdintrin.h 
> b/gcc/config/i386/avx512cdintrin.h
> index a5f5eabb68d..56a786aa9a3 100644
> --- a/gcc/config/i386/avx512cdintrin.h
> +++ b/gcc/config/i386/avx512cdintrin.h
> @@ -30,7 +30,7 @@
>
>  #ifndef __AVX512CD__
>  #pragma GCC push_options
> -#pragma GCC target("avx512cd")
> +#pragma GCC target("avx512cd,evex512")
>  #define __DISABLE_AVX512CD__
>  #endif /* __AVX512CD__ */
>
> diff --git a/gcc/config/i386/avx512vlintrin.h 
> b/gcc/config/i386/avx512vlintrin.h
> index 08e49e8d8ab..a40aa91b948 100644
> --- a/gcc/config/i386/avx512vlintrin.h
> +++ b/gcc/config/i386/avx512vlintrin.h
> @@ -8396,1281 +8396,1003 @@ _mm_mask_min_epu32 (__m128i __W, __mmask8 __M, 
> __m128i __A,
>   (__v4si) __W, __M);
>  }
>
> -#ifndef __AVX512CD__
> -#pragma GCC push_options
> -#pragma GCC target("avx512vl,avx512cd")
> -#define __DISABLE_AVX512VLCD__
> -#endif
> -
> -extern __inline __m128i
> +extern __inline __m256d
>  __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> -_mm_broadcastmb_epi64 (__mmask8 __A)
> +_mm256_mask_unpacklo_pd (__m256d __W, __mmask8 __U, __m256d __A,
> +__m256d __B)
>  {
> -  return (__m128i) __builtin_ia32_broadcastmb128 (__A);
> +  return (__m256d) __builtin_ia32_unpcklpd256_mask ((__v4df) __A,
> +   (__v4df) __B,
> +   (__v4df) __W,
> +   (__mmask8) __U);
>  }
>
> -extern __inline __m256i
> +extern __inline __m256d
>  __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> -_mm256_broadcastmb_epi64 (__mmask8 __A)
> +_mm256_maskz_unpacklo_pd (__mmask8 __U, __m256d __A, __m256d __B)
>  {
> -  return (__m256i) __builtin_ia32_broadcastmb256 (__A);
> +  return (__m256d) __builtin_ia32_unpcklpd256_mask ((__v4df) __A,
> +   (__v4df) __B,
> +   (__v4df)
> +   _mm256_setzero_pd (),
> +   (__mmask8) __U);
>  }
>
> -extern __inline __m128i
> +extern __inline __m128d
>  __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> -_mm_broadcastmw_epi32 (__mmask16 __A)
> +_mm_mask_unpacklo_pd (__m128d __W, __mmask8 __U, __m128d __A,
> + __m128d __B)
>  {
> -  return (__m128i) __builtin_ia32_broadcastmw128 (__A);
> +  return (__m128d) __builtin_ia32_unpcklpd128_mask ((__v2df) __A,
> +   (__v2df) __B,
> +   (__v2df) __W,
> +   (__mmask8) __U);
>  }
>
> -extern __inline __m256i
> +extern __inline __m128d
>  __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> -_mm256_broadcastmw_epi32 (__mmask16 __A)
> +_mm_maskz_unpacklo_pd (__mmask8 __U, __m128d __A, __m128d __B)
>  {
> -  return (__m256i) __builtin_ia32_broadcastmw256 (__A);
> +  return (__m128d) __builtin_ia32_unpcklpd128_mask ((__v2df) __A,
> +   (__v2df) __B,
> +   (__v2df)
> +   _mm_setzero_pd (),
> +   (__mmask8) __U);
>  }
>
> -extern __inline __m256i
> +extern __inline __m256
>  __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> -_mm256_lzcnt_epi32 (__m256i __A)
> +_mm256_mask_unpacklo_ps (__m256 __W, __mmask8 __U, __m256 __A,
> +__m256 __B)
>  {
> -  return (__m256i) __builtin_ia32_vplzcntd_256_mask ((__v8si) __A,
> -(__v8si)
> -_mm256_setzero_si256 (),
> -(__mmask8) -1);
> +  return (__m256) __builtin_ia32_unpcklps256_mask ((__v8sf) __A,
> + 

Re: [PATCH 0/4] Fix no-evex512 function attribute

2023-10-31 Thread Hongtao Liu
On Tue, Oct 31, 2023 at 2:39 PM Haochen Jiang  wrote:
>
> Hi all,
>
> These four patches are going to fix no-evex512 function attribute. The detail
> of the issue comes following:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111889
>
> My proposal for this problem is to also push "no-evex512" when defining
> 128/256 intrins in AVX512.
>
> Besides, I added some new intrins to support the current AVX512 intrins.
> The newly added  _mm{,256}_avx512* intrins are duplicated from their
> _mm{,256}_* forms from AVX2 or before. We need to add them to prevent target
> option mismatch when calling AVX512 intrins implemented with these intrins
> under no-evex512 function attribute. All AVX512 intrins calling those AVX2
> intrins or before will change their calls to these newly added AVX512 version.
>
> This will solve the problem when we are using no-evex512 attribute with
> AVX512 related intrins. But it will not solve target option mismatch when we
> are calling AVX2 intrins or before with no-evex512 function attribute since as
> mentioned in PR111889, it actually comes from a legacy issue. Therefore, we
> are not expecting that usage.
>
> Regtested on x86_64-pc-linux-gnu. Ok for trunk?
Ok, but please wait for 2 more days in case other folks have any comments.
>
> Thx,
> Haochen
>
>


-- 
BR,
Hongtao


Re: [RFC, RFA PATCH] i386: Handle multiple address register classes

2023-11-03 Thread Hongtao Liu
On Fri, Nov 3, 2023 at 6:34 PM Uros Bizjak  wrote:
>
> The patch generalizes address register class handling to allow multiple
> address register classes.  For APX EGPR targets, some instructions can't be
> encoded with REX2 prefix, so it is necessary to limit address register
> class to avoid REX2 registers.  The same situation happens for instructions
> with high registers, where the REX register can not be used in the address,
> so the existing infrastructure can be adapted to also handle this case.
>
> The patch is mostly a mechanical rename of "gpr32" attribute to "addr" and
> introduces no functional changes, although it fixes a couple of inconsistent
> attribute values in passing.

@@ -22569,9 +22578,8 @@ (define_insn "_mpsadbw"
mpsadbw\t{%3, %2, %0|%0, %2, %3}
vmpsadbw\t{%3, %2, %1, %0|%0, %1, %2, %3}"
   [(set_attr "isa" "noavx,noavx,avx")
-   (set_attr "gpr32" "0,0,1")
+   (set_attr "addr" "rex")
(set_attr "type" "sselog1")
-   (set_attr "gpr32" "0")
(set_attr "length_immediate" "1")
(set_attr "prefix_extra" "1")
(set_attr "prefix" "orig,orig,vex")

I believe your fix is correct.

>
> A follow-up patch will use the above infrastructure to limit address register
> class to legacy registers for instructions with high registers.

The patch looks good to me, but please leave some time for Hongyu in
case he has any comments.

>
> gcc/ChangeLog:
>
> * config/i386/i386.cc (ix86_memory_address_use_extended_reg_class_p):
> Rename to ...
> (ix86_memory_address_reg_class): ... this.  Generalize address
> register class handling to allow multiple address register classes.
> Return maximal class for unrecognized instructions.  Improve comments.
> (ix86_insn_base_reg_class): Rewrite to handle
> multiple address register classes.
> (ix86_regno_ok_for_insn_base_p): Ditto.
> (ix86_insn_index_reg_class): Ditto.
> * config/i386/i386.md: Rename "gpr32" attribute to "addr"
> and substitute its values with "0" -> "rex", "1" -> "*".
> (addr): New attribute to limit allowed address register set.
> (gpr32): Remove.
> * config/i386/mmx.md: Rename "gpr32" attribute to "addr"
> and substitute its values with "0" -> "rex", "1" -> "*".
> * config/i386/sse.md: Ditto.
>
> Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.
>
> Comments welcome.
>
> Uros.



-- 
BR,
Hongtao


Re: [PATCH 5/5] x86: yet more PR target/100711-like splitting

2023-11-06 Thread Hongtao Liu
On Mon, Nov 6, 2023 at 7:10 PM Jan Beulich  wrote:
>
> On 25.06.2023 08:41, Hongtao Liu wrote:
> > On Sun, Jun 25, 2023 at 2:35 PM Hongtao Liu  wrote:
> >>
> >> On Sun, Jun 25, 2023 at 2:25 PM Jan Beulich  wrote:
> >>>
> >>> On 25.06.2023 07:12, Hongtao Liu wrote:
> >>>> On Wed, Jun 21, 2023 at 2:29 PM Jan Beulich via Gcc-patches
> >>>>  wrote:
> >>>>>
> >>>>> ---
> >>>>> For the purpose here (and elsewhere) bcst_vector_operand() (really:
> >>>>> bcst_mem_operand()) isn't permissive enough: We'd want it to allow
> >>>>> 128-bit and 256-bit types as well irrespective of AVX512VL being
> >>>>> enabled. This would likely require a new predicate
> >>>>> (bcst_intvec_operand()?) and a new constraint (BR? Bi?). (Yet for name
> >>>>> selection it will want considering that this is applicable to certain
> >>>>> non-calculational FP operations as well.)
> >>>> I think so.
> >>>
> >>> Any preference towards predicate and constraint naming?
> >> something like bcst_mem_operand_$suffiix, $suffix indicates the
> >> pattern may use zmm instruction for 128/256-bit operand.
> >> maybe just bcst_mem_operand_zmm?
> > For constraint, maybe we can reuse Br, relax Br to match 
> > bcst_mem_operand_zmm.
> > For those original patterns with bcst_mem_operand, it should be ok
> > since it's already guarded by the predicate, the constraint must be
> > valid.
>
> Hmm, I wanted to get back to this, but then I started wondering about this
> reply of yours vs your request to not go farther with the use of "oversized"
> insns (i.e. acting in 512-bit registers in lieu of AVX512VL being enabled,
> when no FP exceptions can be raised on the otherwise unused elements). Since
> iirc the latter came later, am I right in assuming we then also shouldn't go
> the route outlined above?
No, we shouldn't.
This reply is just an answer on how to do it technically, but we don't
really want to do it (considering that all AVX512 processors after SKX
will all support AVX512VL)
>
> Jan



-- 
BR,
Hongtao


[PATCH target/89071] Fix false dependence of scalar operations vrcp/vsqrt/vrsqrt/vrndscale

2019-10-22 Thread Hongtao Liu
Hi uros:
  This patch fixes false dependence of scalar operations
vrcp/vsqrt/vrsqrt/vrndscale.
  Bootstrap ok, regression test on i386/x86 ok.

  It does something like this:
-
For scalar instructions with both xmm operands:

op %xmmN,%xmmQ,%xmmQ > op %xmmN, %xmmN, %xmmQ

for scalar instructions with one mem  or gpr operand:

op mem/gpr, %xmmQ, %xmmQ

--->  using pass rpad >

xorps %xmmN, %xmmN, %xxN
op mem/gpr, %xmmN, %xmmQ

Performance influence of SPEC2017 fprate which is tested on SKX

503.bwaves_r -0.03%
507.cactuBSSN_r -0.22%
508.namd_r -0.02%
510.parest_r 0.37%
511.povray_r 0.74%
519.lbm_r 0.24%
521.wrf_r 2.35%
526.blender_r 0.71%
527.cam4_r 0.65%
538.imagick_r 0.95%
544.nab_r -0.37
549.fotonik3d_r 0.24%
554.roms_r 0.90%
fprate geomean 0.50%
-

Changelog
gcc/
* config/i386/i386.md (*rcpsf2_sse): Add
avx_partial_xmm_update, prefer m constraint for TARGET_AVX.
(*rsqrtsf2_sse): Ditto.
(*sqrt2_sse): Ditto.
(sse4_1_round2): separate constraint vm, add
avx_partail_xmm_update, prefer m constraint for TARGET_AVX.
* config/i386/sse.md (*sse_vmrcpv4sf2"): New define_insn used
by pass rpad.
(*_vmsqrt2*):
Ditto.
(*sse_vmrsqrtv4sf2): Ditto.
(*avx512f_rndscale): Ditto.
(*sse4_1_round): Ditto.

gcc/testsuite
* gcc.target/i386/pr87007-4.c: New test.
* gcc.target/i386/pr87007-5.c: Ditto.


-- 
BR,
Hongtao
From 2db08bff3fb9e2720c6c57a52e6f51c990d1a57f Mon Sep 17 00:00:00 2001
From: liuhongt 
Date: Wed, 9 Oct 2019 11:21:25 +0800
Subject: [PATCH] Fix false dependence of scalar operation
 vrcp/vsqrt/vrsqrt/vrndscale

For instructions with xmm operand:

op %xmmN,%xmmQ,%xmmQ > op %xmmN, %xmmN, %xmmQ

for instruction with mem operand or gpr operand:

op mem/gpr, %xmmQ, %xmmQ

--->  using pass rpad >

xorps %xmmN, %xmmN, %xxN
op mem/gpr, %xmmN, %xmmQ

Performance influence of SPEC2017 fprate which is tested on SKX

503.bwaves_r	-0.03%
507.cactuBSSN_r -0.22%
508.namd_r	-0.02%
510.parest_r	0.37%
511.povray_r	0.74%
519.lbm_r	0.24%
521.wrf_r	2.35%
526.blender_r	0.71%
527.cam4_r	0.65%
538.imagick_r	0.95%
544.nab_r	-0.37
549.fotonik3d_r 0.24%
554.roms_r	0.90%
fprate geomean	0.50%
-

Changelog
gcc/
	* config/i386/i386.md (*rcpsf2_sse): Add
	avx_partial_xmm_update, prefer m constraint for TARGET_AVX.
	(*rsqrtsf2_sse): Ditto.
	(*sqrt2_sse): Ditto.
	(sse4_1_round2): separate constraint vm, add
	avx_partail_xmm_update, prefer m constraint for TARGET_AVX.
	* config/i386/sse.md (*sse_vmrcpv4sf2"): New define_insn used
	by pass rpad.
	(*_vmsqrt2*):
	Ditto.
	(*sse_vmrsqrtv4sf2): Ditto.
	(*avx512f_rndscale): Ditto.
	(*sse4_1_round): Ditto.

gcc/testsuite
	* gcc.target/i386/pr87007-4.c: New test.
	* gcc.target/i386/pr87007-5.c: Ditto.
---
 gcc/config/i386/i386.md   | 27 ---
 gcc/config/i386/sse.md| 95 +++
 gcc/testsuite/gcc.target/i386/pr87007-4.c | 18 +
 gcc/testsuite/gcc.target/i386/pr87007-5.c | 18 +
 4 files changed, 147 insertions(+), 11 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr87007-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr87007-5.c

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 5e0795953d8..ab785d3d6d7 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -14843,11 +14843,12 @@
(set_attr "btver2_sse_attr" "rcp")
(set_attr "prefix" "maybe_vex")
(set_attr "mode" "SF")
+   (set_attr "avx_partial_xmm_update" "false,false,true")
(set (attr "preferred_for_speed")
  (cond [(eq_attr "alternative" "1")
 	  (symbol_ref "TARGET_AVX || !TARGET_SSE_PARTIAL_REG_DEPENDENCY")
 	   (eq_attr "alternative" "2")
-	  (symbol_ref "!TARGET_SSE_PARTIAL_REG_DEPENDENCY")
+	  (symbol_ref "TARGET_AVX || !TARGET_SSE_PARTIAL_REG_DEPENDENCY")
 	   ]
 	   (symbol_ref "true")))])
 
@@ -15089,11 +15090,12 @@
(set_attr "btver2_sse_attr" "rcp")
(set_attr "prefix" "maybe_vex")
(set_attr "mode" "SF")
+   (set_attr "avx_partial_xmm_update" "false,false,true")
(set (attr "preferred_for_speed")
  (cond [(eq_attr "alternative" "1")
 	  (symbol_ref "TARGET_AVX || !TARGET_SSE_PARTIAL_REG_DEPENDENCY")
 	   (eq_attr "alternative" "2")
-	  (symbol_ref "!TARGET_SSE_PARTIAL_REG_DEPENDENCY")
+	  (symbol_ref "TARGET_AVX || !TARGET_SSE_PARTIAL_REG_DEPENDENCY")
 	   ]
 	   (symbol_ref "true")))])
 
@@ -15120,12 +15122,13 @@
(set_attr "atom_sse_attr" "sqrt")
(set_attr "btver2_sse_attr" "sqrt")
(set_attr "prefix" "maybe_vex")
+   (set_attr "avx_partial_xmm_update" "false,false,true")
(set_attr "mode" "")
(set (attr "preferred_for_speed")
  (cond [(eq_attr "alternative" "1")
 	  (symbol_ref "TARGET_AVX || !TARGET_SSE_PARTIAL_REG_DEPENDENCY")
 	   (eq_attr "alternative" "2")
-	  (symbol_ref "!TARGET_SSE_PARTIAL_REG_DEPENDENCY")
+	  (symbol_ref "TARGET_AVX || !TARGET_SSE_PARTIAL_REG_DEPENDENCY")
 	  

Re: [PATCH target/89071] Fix false dependence of scalar operations vrcp/vsqrt/vrsqrt/vrndscale

2019-10-22 Thread Hongtao Liu
Update patch:
Add m constraint to define_insn (sse_1_round):
Change constraint x to xm
since vround support memory operand.
* (*sse4_1_round): Ditto.

Bootstrap and regression test ok.

On Wed, Oct 23, 2019 at 9:56 AM Hongtao Liu  wrote:
>
> Hi uros:
>   This patch fixes false dependence of scalar operations
> vrcp/vsqrt/vrsqrt/vrndscale.
>   Bootstrap ok, regression test on i386/x86 ok.
>
>   It does something like this:
> -
> For scalar instructions with both xmm operands:
>
> op %xmmN,%xmmQ,%xmmQ > op %xmmN, %xmmN, %xmmQ
>
> for scalar instructions with one mem  or gpr operand:
>
> op mem/gpr, %xmmQ, %xmmQ
>
> --->  using pass rpad >
>
> xorps %xmmN, %xmmN, %xxN
> op mem/gpr, %xmmN, %xmmQ
>
> Performance influence of SPEC2017 fprate which is tested on SKX
>
> 503.bwaves_r -0.03%
> 507.cactuBSSN_r -0.22%
> 508.namd_r -0.02%
> 510.parest_r 0.37%
> 511.povray_r 0.74%
> 519.lbm_r 0.24%
> 521.wrf_r 2.35%
> 526.blender_r 0.71%
> 527.cam4_r 0.65%
> 538.imagick_r 0.95%
> 544.nab_r -0.37
> 549.fotonik3d_r 0.24%
> 554.roms_r 0.90%
> fprate geomean 0.50%
> -
>
> Changelog
> gcc/
> * config/i386/i386.md (*rcpsf2_sse): Add
> avx_partial_xmm_update, prefer m constraint for TARGET_AVX.
> (*rsqrtsf2_sse): Ditto.
> (*sqrt2_sse): Ditto.
> (sse4_1_round2): separate constraint vm, add
> avx_partail_xmm_update, prefer m constraint for TARGET_AVX.
> * config/i386/sse.md (*sse_vmrcpv4sf2"): New define_insn used
> by pass rpad.
> (*_vmsqrt2*):
> Ditto.
> (*sse_vmrsqrtv4sf2): Ditto.
> (*avx512f_rndscale): Ditto.
> (*sse4_1_round): Ditto.
>
> gcc/testsuite
> * gcc.target/i386/pr87007-4.c: New test.
> * gcc.target/i386/pr87007-5.c: Ditto.
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao
From 23c11846a2dd3eeed32174a7334eb888eb576353 Mon Sep 17 00:00:00 2001
From: liuhongt 
Date: Wed, 9 Oct 2019 11:21:25 +0800
Subject: [PATCH] Fix false dependence of scalar operation
 vrcp/vsqrt/vrsqrt/vrndscale

For instructions with xmm operand:

op %xmmN,%xmmQ,%xmmQ > op %xmmN, %xmmN, %xmmQ

for instruction with mem operand or gpr operand:

op mem/gpr, %xmmQ, %xmmQ

--->  using pass rpad >

xorps %xmmN, %xmmN, %xxN
op mem/gpr, %xmmN, %xmmQ

Performance influence of SPEC2017 fprate which is tested on SKX

503.bwaves_r	-0.03%
507.cactuBSSN_r -0.22%
508.namd_r	-0.02%
510.parest_r	0.37%
511.povray_r	0.74%
519.lbm_r	0.24%
521.wrf_r	2.35%
526.blender_r	0.71%
527.cam4_r	0.65%
538.imagick_r	0.95%
544.nab_r	-0.37
549.fotonik3d_r 0.24%
554.roms_r	0.90%
fprate geomean	0.50%
-

Changelog
gcc/
	* config/i386/i386.md (*rcpsf2_sse): Add
	avx_partial_xmm_update, prefer m constraint for TARGET_AVX.
	(*rsqrtsf2_sse): Ditto.
	(*sqrt2_sse): Ditto.
	(sse4_1_round2): separate constraint vm, add
	avx_partail_xmm_update, prefer m constraint for TARGET_AVX.
	* config/i386/sse.md (*sse_vmrcpv4sf2"): New define_insn used
	by pass rpad.
	(*_vmsqrt2*):
	Ditto.
	(*sse_vmrsqrtv4sf2): Ditto.
	(*avx512f_rndscale): Ditto.
	(*sse4_1_round): Ditto.
	(sse4_1_round): Change constraint x to xm
	since vround support memory operand.

gcc/testsuite
	* gcc.target/i386/pr87007-4.c: New test.
	* gcc.target/i386/pr87007-5.c: Ditto.
---
 gcc/config/i386/i386.md   | 27 ---
 gcc/config/i386/sse.md| 97 ++-
 gcc/testsuite/gcc.target/i386/pr87007-4.c | 18 +
 gcc/testsuite/gcc.target/i386/pr87007-5.c | 18 +
 4 files changed, 148 insertions(+), 12 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr87007-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr87007-5.c

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 5e0795953d8..ab785d3d6d7 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -14843,11 +14843,12 @@
(set_attr "btver2_sse_attr" "rcp")
(set_attr "prefix" "maybe_vex")
(set_attr "mode" "SF")
+   (set_attr "avx_partial_xmm_update" "false,false,true")
(set (attr "preferred_for_speed")
  (cond [(eq_attr "alternative" "1")
 	  (symbol_ref "TARGET_AVX || !TARGET_SSE_PARTIAL_REG_DEPENDENCY")
 	   (eq_attr "alternative" "2")
-	  (symbol_ref "!TARGET_SSE_PARTIAL_REG_DEPENDENCY")
+	  (symbol_ref "TARGET_AVX || !TARGET_SSE_PARTIAL_REG_DEPENDENCY")
 	   ]
 	   (symbol_ref "true")))])
 
@@ -15089,11 +15090,12 @@
(set_attr "btver2_sse_attr" "rcp")
(set_attr "prefix" "maybe_vex")
(set_attr "mode" "SF")
+   (set_attr "avx_p

Re: [PATCH target/89071] Fix false dependence of scalar operations vrcp/vsqrt/vrsqrt/vrndscale

2019-10-24 Thread Hongtao Liu
On Fri, Oct 25, 2019 at 2:39 AM Uros Bizjak  wrote:
>
> On Wed, Oct 23, 2019 at 7:48 AM Hongtao Liu  wrote:
> >
> > Update patch:
> > Add m constraint to define_insn (sse_1_round > *sse_1_round > when under sse4 but not avx512f.
>
> It looks to me that the original insn is incompletely defined. It
> should use nonimmediate_operand, "m" constraint and  pointer
> size modifier. Something like:
>
> (define_insn "sse4_1_round"
>   [(set (match_operand:VF_128 0 "register_operand" "=Yr,*x,x,v")
> (vec_merge:VF_128
>   (unspec:VF_128
> [(match_operand:VF_128 2 "nonimmediate_operand" "Yrm,*xm,xm,vm")
>  (match_operand:SI 3 "const_0_to_15_operand" "n,n,n,n")]
> UNSPEC_ROUND)
>   (match_operand:VF_128 1 "register_operand" "0,0,x,v")
>   (const_int 1)))]
>   "TARGET_SSE4_1"
>   "@
>round\t{%3, %2, %0|%0, %2, %3}
>round\t{%3, %2, %0|%0, %2, %3}
>vround\t{%3, %2, %1, %0|%0, %1, %2, %3}
>vrndscale\t{%3, %2, %1, %0|%0, %1, %2, %3}"
>
> >
> > Changelog:
> > gcc/
> > * config/i386/sse.md:  (sse4_1_round):
> > Change constraint x to xm
> > since vround support memory operand.
> > * (*sse4_1_round): Ditto.
> >
> > Bootstrap and regression test ok.
> >
> > On Wed, Oct 23, 2019 at 9:56 AM Hongtao Liu  wrote:
> > >
> > > Hi uros:
> > >   This patch fixes false dependence of scalar operations
> > > vrcp/vsqrt/vrsqrt/vrndscale.
> > >   Bootstrap ok, regression test on i386/x86 ok.
> > >
> > >   It does something like this:
> > > -
> > > For scalar instructions with both xmm operands:
> > >
> > > op %xmmN,%xmmQ,%xmmQ > op %xmmN, %xmmN, %xmmQ
> > >
> > > for scalar instructions with one mem  or gpr operand:
> > >
> > > op mem/gpr, %xmmQ, %xmmQ
> > >
> > > --->  using pass rpad >
> > >
> > > xorps %xmmN, %xmmN, %xxN
> > > op mem/gpr, %xmmN, %xmmQ
> > >
> > > Performance influence of SPEC2017 fprate which is tested on SKX
> > >
> > > 503.bwaves_r -0.03%
> > > 507.cactuBSSN_r -0.22%
> > > 508.namd_r -0.02%
> > > 510.parest_r 0.37%
> > > 511.povray_r 0.74%
> > > 519.lbm_r 0.24%
> > > 521.wrf_r 2.35%
> > > 526.blender_r 0.71%
> > > 527.cam4_r 0.65%
> > > 538.imagick_r 0.95%
> > > 544.nab_r -0.37
> > > 549.fotonik3d_r 0.24%
> > > 554.roms_r 0.90%
> > > fprate geomean 0.50%
> > > -
> > >
> > > Changelog
> > > gcc/
> > > * config/i386/i386.md (*rcpsf2_sse): Add
> > > avx_partial_xmm_update, prefer m constraint for TARGET_AVX.
> > > (*rsqrtsf2_sse): Ditto.
> > > (*sqrt2_sse): Ditto.
> > > (sse4_1_round2): separate constraint vm, add
> > > avx_partail_xmm_update, prefer m constraint for TARGET_AVX.
> > > * config/i386/sse.md (*sse_vmrcpv4sf2"): New define_insn used
> > > by pass rpad.
> > > (*_vmsqrt2*):
> > > Ditto.
> > > (*sse_vmrsqrtv4sf2): Ditto.
> > > (*avx512f_rndscale): Ditto.
> > > (*sse4_1_round): Ditto.
> > >
> > > gcc/testsuite
> > > * gcc.target/i386/pr87007-4.c: New test.
> > > * gcc.target/i386/pr87007-5.c: Ditto.
> > >
> > >
> > > --
> > > BR,
> > > Hongtao
>
> (set (attr "preferred_for_speed")
>   (cond [(eq_attr "alternative" "1")
>(symbol_ref "TARGET_AVX || !TARGET_SSE_PARTIAL_REG_DEPENDENCY")
> (eq_attr "alternative" "2")
> -  (symbol_ref "!TARGET_SSE_PARTIAL_REG_DEPENDENCY")
> +  (symbol_ref "TARGET_AVX || !TARGET_SSE_PARTIAL_REG_DEPENDENCY")
> ]
> (symbol_ref "true")))])
>
> This can be written as:
>
> (set (attr "preferred_for_speed")
>   (cond [(match_test "TARGET_AVX")
>(symbol_ref "true")
> (eq_attr "alternative" "1,2")
>   (symbol_ref "!TARGET_SSE_PARTIAL_REG_DEPENDENCY")
> ]
> (symbol_ref "true")))])
>
> Uros.

Yes, after these fixed, i'll upstream to trunk, ok?
-- 
BR,
Hongtao


Re: [PATCH target/89071] Fix false dependence of scalar operations vrcp/vsqrt/vrsqrt/vrndscale

2019-10-24 Thread Hongtao Liu
On Fri, Oct 25, 2019 at 1:23 PM Hongtao Liu  wrote:
>
> On Fri, Oct 25, 2019 at 2:39 AM Uros Bizjak  wrote:
> >
> > On Wed, Oct 23, 2019 at 7:48 AM Hongtao Liu  wrote:
> > >
> > > Update patch:
> > > Add m constraint to define_insn (sse_1_round > > *sse_1_round > > when under sse4 but not avx512f.
> >
> > It looks to me that the original insn is incompletely defined. It
> > should use nonimmediate_operand, "m" constraint and  pointer
> > size modifier. Something like:
> >
> > (define_insn "sse4_1_round"
> >   [(set (match_operand:VF_128 0 "register_operand" "=Yr,*x,x,v")
> > (vec_merge:VF_128
> >   (unspec:VF_128
> > [(match_operand:VF_128 2 "nonimmediate_operand" "Yrm,*xm,xm,vm")
> >  (match_operand:SI 3 "const_0_to_15_operand" "n,n,n,n")]
> > UNSPEC_ROUND)
> >   (match_operand:VF_128 1 "register_operand" "0,0,x,v")
> >   (const_int 1)))]
> >   "TARGET_SSE4_1"
> >   "@
> >round\t{%3, %2, %0|%0, %2, %3}
> >round\t{%3, %2, %0|%0, %2, %3}
> >vround\t{%3, %2, %1, %0|%0, %1, %2, %3}
> >vrndscale\t{%3, %2, %1, %0|%0, %1, %2, %3}"
> >
> > >
> > > Changelog:
> > > gcc/
> > > * config/i386/sse.md:  (sse4_1_round):
> > > Change constraint x to xm
> > > since vround support memory operand.
> > > * (*sse4_1_round): Ditto.
> > >
> > > Bootstrap and regression test ok.
> > >
> > > On Wed, Oct 23, 2019 at 9:56 AM Hongtao Liu  wrote:
> > > >
> > > > Hi uros:
> > > >   This patch fixes false dependence of scalar operations
> > > > vrcp/vsqrt/vrsqrt/vrndscale.
> > > >   Bootstrap ok, regression test on i386/x86 ok.
> > > >
> > > >   It does something like this:
> > > > -
> > > > For scalar instructions with both xmm operands:
> > > >
> > > > op %xmmN,%xmmQ,%xmmQ > op %xmmN, %xmmN, %xmmQ
> > > >
> > > > for scalar instructions with one mem  or gpr operand:
> > > >
> > > > op mem/gpr, %xmmQ, %xmmQ
> > > >
> > > > --->  using pass rpad >
> > > >
> > > > xorps %xmmN, %xmmN, %xxN
> > > > op mem/gpr, %xmmN, %xmmQ
> > > >
> > > > Performance influence of SPEC2017 fprate which is tested on SKX
> > > >
> > > > 503.bwaves_r -0.03%
> > > > 507.cactuBSSN_r -0.22%
> > > > 508.namd_r -0.02%
> > > > 510.parest_r 0.37%
> > > > 511.povray_r 0.74%
> > > > 519.lbm_r 0.24%
> > > > 521.wrf_r 2.35%
> > > > 526.blender_r 0.71%
> > > > 527.cam4_r 0.65%
> > > > 538.imagick_r 0.95%
> > > > 544.nab_r -0.37
> > > > 549.fotonik3d_r 0.24%
> > > > 554.roms_r 0.90%
> > > > fprate geomean 0.50%
> > > > -
> > > >
> > > > Changelog
> > > > gcc/
> > > > * config/i386/i386.md (*rcpsf2_sse): Add
> > > > avx_partial_xmm_update, prefer m constraint for TARGET_AVX.
> > > > (*rsqrtsf2_sse): Ditto.
> > > > (*sqrt2_sse): Ditto.
> > > > (sse4_1_round2): separate constraint vm, add
> > > > avx_partail_xmm_update, prefer m constraint for TARGET_AVX.
> > > > * config/i386/sse.md (*sse_vmrcpv4sf2"): New define_insn used
> > > > by pass rpad.
> > > > (*_vmsqrt2*):
> > > > Ditto.
> > > > (*sse_vmrsqrtv4sf2): Ditto.
> > > > (*avx512f_rndscale): Ditto.
> > > > (*sse4_1_round): Ditto.
> > > >
> > > > gcc/testsuite
> > > > * gcc.target/i386/pr87007-4.c: New test.
> > > > * gcc.target/i386/pr87007-5.c: Ditto.
> > > >
> > > >
> > > > --
> > > > BR,
> > > > Hongtao
> >
> > (set (attr "preferred_for_speed")
> >   (cond [(eq_attr "alternative" "1")
> >(symbol_ref "TARGET_AVX || !TARGET_SSE_PARTIAL_REG_DEPENDENCY")
> > (eq_attr "alternative" "2")
> > -  (symbol_ref "!TARGET_SSE_PARTIAL_REG_DEPENDENCY")
> > +  (symbol_ref "TARGET_AVX || !TARGET_SSE_PARTIAL_

Re: [PATCH target/89071] Fix false dependence of scalar operations vrcp/vsqrt/vrsqrt/vrndscale

2019-10-25 Thread Hongtao Liu
Update patch.

On Fri, Oct 25, 2019 at 4:01 PM Uros Bizjak  wrote:
>
> On Fri, Oct 25, 2019 at 7:55 AM Hongtao Liu  wrote:
> >
> > On Fri, Oct 25, 2019 at 1:23 PM Hongtao Liu  wrote:
> > >
> > > On Fri, Oct 25, 2019 at 2:39 AM Uros Bizjak  wrote:
> > > >
> > > > On Wed, Oct 23, 2019 at 7:48 AM Hongtao Liu  wrote:
> > > > >
> > > > > Update patch:
> > > > > Add m constraint to define_insn (sse_1_round > > > > *sse_1_round > > > > when under sse4 but not avx512f.
> > > >
> > > > It looks to me that the original insn is incompletely defined. It
> > > > should use nonimmediate_operand, "m" constraint and  pointer
> > > > size modifier. Something like:
> > > >
> > > > (define_insn "sse4_1_round"
> > > >   [(set (match_operand:VF_128 0 "register_operand" "=Yr,*x,x,v")
> > > > (vec_merge:VF_128
> > > >   (unspec:VF_128
> > > > [(match_operand:VF_128 2 "nonimmediate_operand" "Yrm,*xm,xm,vm")
> > > >  (match_operand:SI 3 "const_0_to_15_operand" "n,n,n,n")]
> > > > UNSPEC_ROUND)
> > > >   (match_operand:VF_128 1 "register_operand" "0,0,x,v")
> > > >   (const_int 1)))]
> > > >   "TARGET_SSE4_1"
> > > >   "@
> > > >round\t{%3, %2, %0|%0, %2, %3}
> > > >round\t{%3, %2, %0|%0, %2, %3}
> > > >vround\t{%3, %2, %1, %0|%0, %1, %2, %3}
> > > >vrndscale\t{%3, %2, %1, %0|%0, %1, %2, 
> > > > %3}"
> > > >
> > > > >
> > > > > Changelog:
> > > > > gcc/
> > > > > * config/i386/sse.md:  (sse4_1_round):
> > > > > Change constraint x to xm
> > > > > since vround support memory operand.
> > > > > * (*sse4_1_round): Ditto.
> > > > >
> > > > > Bootstrap and regression test ok.
> > > > >
> > > > > On Wed, Oct 23, 2019 at 9:56 AM Hongtao Liu  
> > > > > wrote:
> > > > > >
> > > > > > Hi uros:
> > > > > >   This patch fixes false dependence of scalar operations
> > > > > > vrcp/vsqrt/vrsqrt/vrndscale.
> > > > > >   Bootstrap ok, regression test on i386/x86 ok.
> > > > > >
> > > > > >   It does something like this:
> > > > > > -
> > > > > > For scalar instructions with both xmm operands:
> > > > > >
> > > > > > op %xmmN,%xmmQ,%xmmQ > op %xmmN, %xmmN, %xmmQ
> > > > > >
> > > > > > for scalar instructions with one mem  or gpr operand:
> > > > > >
> > > > > > op mem/gpr, %xmmQ, %xmmQ
> > > > > >
> > > > > > --->  using pass rpad >
> > > > > >
> > > > > > xorps %xmmN, %xmmN, %xxN
> > > > > > op mem/gpr, %xmmN, %xmmQ
> > > > > >
> > > > > > Performance influence of SPEC2017 fprate which is tested on SKX
> > > > > >
> > > > > > 503.bwaves_r -0.03%
> > > > > > 507.cactuBSSN_r -0.22%
> > > > > > 508.namd_r -0.02%
> > > > > > 510.parest_r 0.37%
> > > > > > 511.povray_r 0.74%
> > > > > > 519.lbm_r 0.24%
> > > > > > 521.wrf_r 2.35%
> > > > > > 526.blender_r 0.71%
> > > > > > 527.cam4_r 0.65%
> > > > > > 538.imagick_r 0.95%
> > > > > > 544.nab_r -0.37
> > > > > > 549.fotonik3d_r 0.24%
> > > > > > 554.roms_r 0.90%
> > > > > > fprate geomean 0.50%
> > > > > > -
> > > > > >
> > > > > > Changelog
> > > > > > gcc/
> > > > > > * config/i386/i386.md (*rcpsf2_sse): Add
> > > > > > avx_partial_xmm_update, prefer m constraint for TARGET_AVX.
> > > > > > (*rsqrtsf2_sse): Ditto.
> > > > > > (*sqrt2_sse): Ditto.
> > > > > > (sse4_1_round2): separate constraint vm, add
> > > > > > avx_partail_xmm_update, prefer m constraint for TARGET_AVX.
> > > > > > * config/i386/s

[PATCH] Adjust predicates and constraints of scalar insns

2019-10-25 Thread Hongtao Liu
> Looking into sse.md, there is a lot of inconsistencies in existing *vm
> patterns w.r.t. operand constraints. Unfortunately, these were copied
> into proposed patterns. One example is existing
>
> (define_insn "_vmsqrt2"
>   [(set (match_operand:VF_128 0 "register_operand" "=x,v")
> (vec_merge:VF_128
>   (sqrt:VF_128
> (match_operand:VF_128 1 "vector_operand"
> "xBm,"))
>   (match_operand:VF_128 2 "register_operand" "0,v")
>   (const_int 1)))]
>   "TARGET_SSE"
>   "@
>sqrt\t{%1, %0|%0, %1}
>
> Due to combine benefits, *vm operands to be merged is described in
> vector mode. Since the insn operates in scalar mode, there is no need
> for "vector_operand" and Bm constraint that impose more strict
> alignment requirements. However, iptr modifier is needed here to
> override VF_128 vector mode (e.g. V4SFmode) to generate scalar
> (SFmode, DWORD PTR) memory access prefix.
>
> Someone should fix these existing inconsistencies in a follow-up patch.

https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01867.html
This patch is to fix these.

Bootstrap and regression test on i386/x86-64 is ok.

Ok for trunk?

Changelog

cc/
* config/i386/sse.md
(_vm3,
_vm3,
_vmsqrt2,
_vm3,
_vmmaskcmp3):
Change predicates from vector_operand to nonimmediate_operand,
constraints xBm to xm, since scalar operations don't need
memory address alignment.
(avx512f_vmcmp3,
avx512f_vmcmp3_mask): Replace
round_saeonly_nimm_predicate with
round_saeonly_nimm_scalar_predicate.
(fmai_vmfmadd_, fmai_vmfmsub_,
fmai_vmfnmadd_,fmai_vmfnmsub_,
*fmai_fmadd_, *fmai_fmsub_,
*fmai_fnmadd_, *fmai_fnmsub_,
avx512f_vmfmadd__mask3,
avx512f_vmfmadd__maskz_1,
*avx512f_vmfmsub__mask,
avx512f_vmfmsub__mask3,
*avx512f_vmfmsub__maskz_1,
*avx512f_vmfnmadd__mask,
*avx512f_vmfnmadd__mask3,
*avx512f_vmfnmadd__maskz_1,
*avx512f_vmfnmsub__mask,
*avx512f_vmfnmsub__mask3,
*avx512f_vmfnmsub__maskz_1,
cvtusi232,
cvtusi264, ): Replace
round_nimm_predicate instead of round_nimm_scalr_predicate.
(avx512f_sfixupimm,
avx512f_sfixupimm_mask,
avx512er_vmrcp28,
avx512er_vmrsqrt28,
): Replace round_saeonly_nimm_predicate with
round_saeonly_nimm_scalar_predicate.
(avx512dq_vmfpclass, ): Replace
vector_operand with nonimmediate_operand.
* config/i386/subst.md (round_scalar_nimm_predicate,
round_saeonly_scalar_nimm_predicate): Replace
vector_operand with nonimmediate_operand.

-- 
BR,
Hongtao


0001-Adjust-predicates-and-constraints-of-scalar-insns.patch
Description: Binary data


[PATCH] Remove redudant iptr when operand already has a scalar mode.

2019-10-26 Thread Hongtao Liu
> BTW: Please also note that there is no need to use  or operand
> mode override in scalar insn templates for intel asm dialect when
> operand already has a scalar mode.
https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01868.html

This patch is to remove redundant  when operand already has a scalar mode.

bootstrap and regression test for i386/x86-64 is ok.

Changelog
gcc/
* config/i386/sse.md (*_vm3,
_vm3): Remove  since
operand is already scalar mode.
(iptr): Remove SF/DF.
-- 
BR,
Hongtao


0001-Remove-redudant-iptr-when-operand-is-already-scalar-.patch
Description: Binary data


[PATCH target/92295] Fix inefficient vector constructor

2019-10-31 Thread Hongtao Liu
Hi uros:
  This patch is about to fix inefficient vector constructor.
  Currently in ix86_expand_vector_init_concat, vector are initialized
per 2 elements which can miss some optimization opportunity like
pr92295.

  Bootstrap and i386 regression test is ok.
  Ok for trunk?

Changelog
gcc/
PR target/92295
* config/i386/i386-expand.c (ix86_expand_vector_init_concat)
Enhance ix86_expand_vector_init_concat.

gcc/testsuite
* gcc.target/i386/pr92295.c: New test.

-- 
BR,
Hongtao
From 408fb093993f9df4da42d8daf2e6996f087c4618 Mon Sep 17 00:00:00 2001
From: liuhongt 
Date: Thu, 31 Oct 2019 15:14:00 +
Subject: [PATCH] Enhance ix86_expand_vector_init_concat.

Changelog
gcc/
	PR target/92295
	* config/i386/i386-expand.c (ix86_expand_vector_init_concat)
	Enhance ix86_expand_vector_init_concat.

gcc/testsuite
	* gcc.target/i386/pr92295.c: New test.
---
 gcc/config/i386/i386-expand.c   | 130 ++--
 gcc/testsuite/gcc.target/i386/pr92295.c |  13 +++
 2 files changed, 65 insertions(+), 78 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr92295.c

diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
index 6d3d14c37dd..be040a1bc3e 100644
--- a/gcc/config/i386/i386-expand.c
+++ b/gcc/config/i386/i386-expand.c
@@ -13654,8 +13654,8 @@ static void
 ix86_expand_vector_init_concat (machine_mode mode,
 rtx target, rtx *ops, int n)
 {
-  machine_mode cmode, hmode = VOIDmode, gmode = VOIDmode;
-  rtx first[16], second[8], third[4];
+  machine_mode half_mode = VOIDmode;
+  rtx half[2];
   rtvec v;
   int i, j;
 
@@ -13665,55 +13665,55 @@ ix86_expand_vector_init_concat (machine_mode mode,
   switch (mode)
 	{
 	case E_V16SImode:
-	  cmode = V8SImode;
+	  half_mode = V8SImode;
 	  break;
 	case E_V16SFmode:
-	  cmode = V8SFmode;
+	  half_mode = V8SFmode;
 	  break;
 	case E_V8DImode:
-	  cmode = V4DImode;
+	  half_mode = V4DImode;
 	  break;
 	case E_V8DFmode:
-	  cmode = V4DFmode;
+	  half_mode = V4DFmode;
 	  break;
 	case E_V8SImode:
-	  cmode = V4SImode;
+	  half_mode = V4SImode;
 	  break;
 	case E_V8SFmode:
-	  cmode = V4SFmode;
+	  half_mode = V4SFmode;
 	  break;
 	case E_V4DImode:
-	  cmode = V2DImode;
+	  half_mode = V2DImode;
 	  break;
 	case E_V4DFmode:
-	  cmode = V2DFmode;
+	  half_mode = V2DFmode;
 	  break;
 	case E_V4SImode:
-	  cmode = V2SImode;
+	  half_mode = V2SImode;
 	  break;
 	case E_V4SFmode:
-	  cmode = V2SFmode;
+	  half_mode = V2SFmode;
 	  break;
 	case E_V2DImode:
-	  cmode = DImode;
+	  half_mode = DImode;
 	  break;
 	case E_V2SImode:
-	  cmode = SImode;
+	  half_mode = SImode;
 	  break;
 	case E_V2DFmode:
-	  cmode = DFmode;
+	  half_mode = DFmode;
 	  break;
 	case E_V2SFmode:
-	  cmode = SFmode;
+	  half_mode = SFmode;
 	  break;
 	default:
 	  gcc_unreachable ();
 	}
 
-  if (!register_operand (ops[1], cmode))
-	ops[1] = force_reg (cmode, ops[1]);
-  if (!register_operand (ops[0], cmode))
-	ops[0] = force_reg (cmode, ops[0]);
+  if (!register_operand (ops[1], half_mode))
+	ops[1] = force_reg (half_mode, ops[1]);
+  if (!register_operand (ops[0], half_mode))
+	ops[0] = force_reg (half_mode, ops[0]);
   emit_insn (gen_rtx_SET (target, gen_rtx_VEC_CONCAT (mode, ops[0],
 			  ops[1])));
   break;
@@ -13722,16 +13722,16 @@ ix86_expand_vector_init_concat (machine_mode mode,
   switch (mode)
 	{
 	case E_V4DImode:
-	  cmode = V2DImode;
+	  half_mode = V2DImode;
 	  break;
 	case E_V4DFmode:
-	  cmode = V2DFmode;
+	  half_mode = V2DFmode;
 	  break;
 	case E_V4SImode:
-	  cmode = V2SImode;
+	  half_mode = V2SImode;
 	  break;
 	case E_V4SFmode:
-	  cmode = V2SFmode;
+	  half_mode = V2SFmode;
 	  break;
 	default:
 	  gcc_unreachable ();
@@ -13742,20 +13742,16 @@ ix86_expand_vector_init_concat (machine_mode mode,
   switch (mode)
 	{
 	case E_V8DImode:
-	  cmode = V2DImode;
-	  hmode = V4DImode;
+	  half_mode = V4DImode;
 	  break;
 	case E_V8DFmode:
-	  cmode = V2DFmode;
-	  hmode = V4DFmode;
+	  half_mode = V4DFmode;
 	  break;
 	case E_V8SImode:
-	  cmode = V2SImode;
-	  hmode = V4SImode;
+	  half_mode = V4SImode;
 	  break;
 	case E_V8SFmode:
-	  cmode = V2SFmode;
-	  hmode = V4SFmode;
+	  half_mode = V4SFmode;
 	  break;
 	default:
 	  gcc_unreachable ();
@@ -13766,14 +13762,10 @@ ix86_expand_vector_init_concat (machine_mode mode,
   switch (mode)
 	{
 	case E_V16SImode:
-	  cmode = V2SImode;
-	  hmode = V4SImode;
-	  gmode = V8SImode;
+	  half_mode = V8SImode;
 	  break;
 	case E_V16SFmode:
-	  cmode = V2SFmode;
-	  hmode = V4SFmode;
-	  gmode = V8SFmode;
+	  half_mode = V8SFmode;
 	  break;
 	default:
 	  gcc_unreachable ();
@@ -13783,50 +13775,32 @@ ix86_expand_vector_init_concat (machine_mode mode,
 half:
   /* FIXME: We process inputs backward to help RA.  PR 36222.  */
   i = n - 1;
-  j = (n >> 1) - 1;
-  for (; i > 0; i -= 2, j--)
-	{
-	  first[j] = gen_reg_rtx (cmode);
-	  v = gen_rtvec (2, ops[i - 1], ops[i]);
-	  ix86_expand_vecto

Re: [PATCH target/92295] Fix inefficient vector constructor

2019-11-02 Thread Hongtao Liu
Hi Jakub:
  Could you help reviewing this patch.

PS: Since this patch is related to vectors(avx512f), and Uros
mentioned before that he has no intension to maintain avx512f.

On Fri, Nov 1, 2019 at 9:12 AM Hongtao Liu  wrote:
>
> Hi uros:
>   This patch is about to fix inefficient vector constructor.
>   Currently in ix86_expand_vector_init_concat, vector are initialized
> per 2 elements which can miss some optimization opportunity like
> pr92295.
>
>   Bootstrap and i386 regression test is ok.
>   Ok for trunk?
>
> Changelog
> gcc/
> PR target/92295
> * config/i386/i386-expand.c (ix86_expand_vector_init_concat)
> Enhance ix86_expand_vector_init_concat.
>
> gcc/testsuite
> * gcc.target/i386/pr92295.c: New test.
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao


Re: [PATCH target/92295] Fix inefficient vector constructor

2019-11-06 Thread Hongtao Liu
Ping!

On Sat, Nov 2, 2019 at 9:38 PM Hongtao Liu  wrote:
>
> Hi Jakub:
>   Could you help reviewing this patch.
>
> PS: Since this patch is related to vectors(avx512f), and Uros
> mentioned before that he has no intension to maintain avx512f.
>
> On Fri, Nov 1, 2019 at 9:12 AM Hongtao Liu  wrote:
> >
> > Hi uros:
> >   This patch is about to fix inefficient vector constructor.
> >   Currently in ix86_expand_vector_init_concat, vector are initialized
> > per 2 elements which can miss some optimization opportunity like
> > pr92295.
> >
> >   Bootstrap and i386 regression test is ok.
> >   Ok for trunk?
> >
> > Changelog
> > gcc/
> > PR target/92295
> > * config/i386/i386-expand.c (ix86_expand_vector_init_concat)
> > Enhance ix86_expand_vector_init_concat.
> >
> > gcc/testsuite
> > * gcc.target/i386/pr92295.c: New test.
> >
> > --
> > BR,
> > Hongtao
>
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao


[PATCH] Set AVX128_OPTIMAL for all avx targets.

2019-11-11 Thread Hongtao Liu
Hi:
  This patch is about to set X86_TUNE_AVX128_OPTIMAL as default for
all AVX target because we found there's still performance gap between
128-bit auto-vectorization and 256-bit auto-vectorization even with
epilog vectorized.
  The performance influence of setting avx128_optimal as default on
SPEC2017 with option `-march=native -funroll-loops -Ofast -flto" on
CLX is as bellow:

INT rate
500.perlbench_r -0.32%
502.gcc_r   -1.32%
505.mcf_r   -0.12%
520.omnetpp_r   -0.34%
523.xalancbmk_r -0.65%
525.x264_r  2.23%
531.deepsjeng_r 0.81%
541.leela_r -0.02%
548.exchange2_r 10.89%  --> big improvement
557.xz_r0.38%
geomean for intrate 1.10%

FP rate
503.bwaves_r1.41%
507.cactuBSSN_r -0.14%
508.namd_r  1.54%
510.parest_r-0.87%
511.povray_r0.28%
519.lbm_r   0.32%
521.wrf_r   -0.54%
526.blender_r   0.59%
527.cam4_r  -2.70%
538.imagick_r   3.92%
544.nab_r   0.59%
549.fotonik3d_r -5.44%  -> regression
554.roms_r  -2.34%
geomean for fprate  -0.28%

The 10% improvement of 548.exchange_r is because there is 9-layer
nested loop, and the loop count for innermost layer is small(enough
for 128-bit vectorization, but not for 256-bit vectorization).
Since loop count is not statically analyzed out, vectorizer will
choose 256-bit vectorization which would never never be triggered. The
vectorization of epilog will introduced some extra instructions,
normally it will bring back some performance, but since it's 9-layer
nested loop, costs of extra instructions will cover the gain.

The 5.44% regression of 549.fotonik3d_r is because 256-bit
vectorization is better than 128-bit vectorization. Generally when
enabling 256-bit or 512-bit vectorization, there will be instruction
clocksticks reduction also with frequency reduction. when frequency
reduction is less than instructions clocksticks reduction, long vector
width vectorization would be better than shorter one, otherwise the
opposite. The regression of 549.fotonik3d_r is due to this, similar
for 554.roms_r, 528.cam4_r, for those 3 benchmarks, 512-bit
vectorization is best.

Bootstrap and regression test on i386 is ok.
Ok for trunk?

Changelog
gcc/
* config/i386/i386-option.c (m_CORE_AVX): New macro.
* config/i386/x86-tune.def: Enable 128_optimal for avx and
replace m_SANDYBRIDGE | m_CORE_AVX2 with m_CORE_AVX.
* testsuite/gcc.target/i386/pr84413-1.c: Adjust testcase.
* testsuite/gcc.target/i386/pr84413-2.c: Ditto.
* testsuite/gcc.target/i386/pr84413-3.c: Ditto.
* testsuite/gcc.target/i386/pr70021.c: Ditto.
* testsuite/gcc.target/i386/pr90579.c: New test.


-- 
BR,
Hongtao
From a02d5c896600c4c80765f375d531c5412a778145 Mon Sep 17 00:00:00 2001
From: liuhongt 
Date: Wed, 6 Nov 2019 09:36:57 +0800
Subject: [PATCH] Enbale 128-bit auto-vectorization for avx

Performance impact test on CLX8280 with best perf option
-Ofast -march=native -funroll-loops -flto -mfpmath=sse.

INT rate
500.perlbench_r		-0.32%
502.gcc_r			-1.32%
505.mcf_r			-0.12%
520.omnetpp_r			-0.34%
523.xalancbmk_r		-0.65%
525.x264_r			2.23%
531.deepsjeng_r		0.81%
541.leela_r			-0.02%
548.exchange2_r		10.89%
557.xz_r			0.38%
geomean for intrate		1.10%

FP rate
503.bwaves_r			1.41%
507.cactuBSSN_r		-0.14%
508.namd_r			1.54%
510.parest_r			-0.87%
511.povray_r			0.28%
519.lbm_r			0.32%
521.wrf_r			-0.54%
526.blender_r			0.59%
527.cam4_r			-2.70%
538.imagick_r			3.92%
544.nab_r			0.59%
549.fotonik3d_r		-5.44%
554.roms_r			-2.34%
geomean for fprate		-0.28%

Changelog
gcc/
	* config/i386/i386-option.c (m_CORE_AVX): New macro.
	* config/i386/x86-tune.def: Enable 128_optimal for avx and
	replace m_SANDYBRIDGE | m_CORE_AVX2 with m_CORE_AVX.
	* testsuite/gcc.target/i386/pr84413-1.c: Adjust testcase.
	* testsuite/gcc.target/i386/pr84413-2.c: Ditto.
	* testsuite/gcc.target/i386/pr84413-3.c: Ditto.
	* testsuite/gcc.target/i386/pr70021.c: Ditto.
	* testsuite/gcc.target/i386/pr90579.c: New test.
---
 gcc/config/i386/i386-options.c|  1 +
 gcc/config/i386/x86-tune.def  | 24 +++
 gcc/testsuite/gcc.target/i386/pr70021.c   |  2 +-
 gcc/testsuite/gcc.target/i386/pr84413-1.c |  4 ++--
 gcc/testsuite/gcc.target/i386/pr84413-2.c |  4 ++--
 gcc/testsuite/gcc.target/i386/pr84413-3.c |  4 ++--
 gcc/testsuite/gcc.target/i386/pr90579.c   | 20 +++
 7 files changed, 40 insertions(+), 19 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr90579.c

diff --git a/gcc/

Re: [PATCH] Set AVX128_OPTIMAL for all avx targets.

2019-11-12 Thread Hongtao Liu
On Tue, Nov 12, 2019 at 4:19 PM Richard Biener
 wrote:
>
> On Tue, Nov 12, 2019 at 8:36 AM Hongtao Liu  wrote:
> >
> > Hi:
> >   This patch is about to set X86_TUNE_AVX128_OPTIMAL as default for
> > all AVX target because we found there's still performance gap between
> > 128-bit auto-vectorization and 256-bit auto-vectorization even with
> > epilog vectorized.
> >   The performance influence of setting avx128_optimal as default on
> > SPEC2017 with option `-march=native -funroll-loops -Ofast -flto" on
> > CLX is as bellow:
> >
> > INT rate
> > 500.perlbench_r -0.32%
> > 502.gcc_r   -1.32%
> > 505.mcf_r   -0.12%
> > 520.omnetpp_r   -0.34%
> > 523.xalancbmk_r -0.65%
> > 525.x264_r  2.23%
> > 531.deepsjeng_r 0.81%
> > 541.leela_r -0.02%
> > 548.exchange2_r 10.89%  --> big improvement
> > 557.xz_r0.38%
> > geomean for intrate 1.10%
> >
> > FP rate
> > 503.bwaves_r1.41%
> > 507.cactuBSSN_r -0.14%
> > 508.namd_r  1.54%
> > 510.parest_r-0.87%
> > 511.povray_r0.28%
> > 519.lbm_r   0.32%
> > 521.wrf_r   -0.54%
> > 526.blender_r   0.59%
> > 527.cam4_r  -2.70%
> > 538.imagick_r   3.92%
> > 544.nab_r   0.59%
> > 549.fotonik3d_r -5.44%  -> regression
> > 554.roms_r  -2.34%
> > geomean for fprate  -0.28%
> >
> > The 10% improvement of 548.exchange_r is because there is 9-layer
> > nested loop, and the loop count for innermost layer is small(enough
> > for 128-bit vectorization, but not for 256-bit vectorization).
> > Since loop count is not statically analyzed out, vectorizer will
> > choose 256-bit vectorization which would never never be triggered. The
> > vectorization of epilog will introduced some extra instructions,
> > normally it will bring back some performance, but since it's 9-layer
> > nested loop, costs of extra instructions will cover the gain.
> >
> > The 5.44% regression of 549.fotonik3d_r is because 256-bit
> > vectorization is better than 128-bit vectorization. Generally when
> > enabling 256-bit or 512-bit vectorization, there will be instruction
> > clocksticks reduction also with frequency reduction. when frequency
> > reduction is less than instructions clocksticks reduction, long vector
> > width vectorization would be better than shorter one, otherwise the
> > opposite. The regression of 549.fotonik3d_r is due to this, similar
> > for 554.roms_r, 528.cam4_r, for those 3 benchmarks, 512-bit
> > vectorization is best.
> >
> > Bootstrap and regression test on i386 is ok.
> > Ok for trunk?
>
> I don't think 128_optimal does what you think it does.  If you want to
> prefer 128bit AVX adjust the preference, but 128_optimal describes
> a microarchitectural detail (AVX256 ops are split into two AVX128 ops)
But it will set target_prefer_avx128 by default.

2694  /* Enable 128-bit AVX instruction generation
2695 for the auto-vectorizer.  */
2696  if (TARGET_AVX128_OPTIMAL
2697  && (opts_set->x_prefer_vector_width_type == PVW_NONE))
2698opts->x_prefer_vector_width_type = PVW_AVX128;
-
And it may be too confusing to add another tuning flag.
> and is _not_ intended for "tuning".
>
> Richard.
>
> > Changelog
> > gcc/
> > * config/i386/i386-option.c (m_CORE_AVX): New macro.
> > * config/i386/x86-tune.def: Enable 128_optimal for avx and
> > replace m_SANDYBRIDGE | m_CORE_AVX2 with m_CORE_AVX.
> > * testsuite/gcc.target/i386/pr84413-1.c: Adjust testcase.
> > * testsuite/gcc.target/i386/pr84413-2.c: Ditto.
> > * testsuite/gcc.target/i386/pr84413-3.c: Ditto.
> > * testsuite/gcc.target/i386/pr70021.c: Ditto.
> > * testsuite/gcc.target/i386/pr90579.c: New test.
> >
> >
> > --
> > BR,
> > Hongtao



-- 
BR,
Hongtao


Re: [PATCH] Set AVX128_OPTIMAL for all avx targets.

2019-11-12 Thread Hongtao Liu
On Tue, Nov 12, 2019 at 4:29 PM Richard Biener
 wrote:
>
> On Tue, Nov 12, 2019 at 9:19 AM Richard Biener
>  wrote:
> >
> > On Tue, Nov 12, 2019 at 8:36 AM Hongtao Liu  wrote:
> > >
> > > Hi:
> > >   This patch is about to set X86_TUNE_AVX128_OPTIMAL as default for
> > > all AVX target because we found there's still performance gap between
> > > 128-bit auto-vectorization and 256-bit auto-vectorization even with
> > > epilog vectorized.
> > >   The performance influence of setting avx128_optimal as default on
> > > SPEC2017 with option `-march=native -funroll-loops -Ofast -flto" on
> > > CLX is as bellow:
> > >
> > > INT rate
> > > 500.perlbench_r -0.32%
> > > 502.gcc_r   -1.32%
> > > 505.mcf_r   -0.12%
> > > 520.omnetpp_r   -0.34%
> > > 523.xalancbmk_r -0.65%
> > > 525.x264_r  2.23%
> > > 531.deepsjeng_r 0.81%
> > > 541.leela_r -0.02%
> > > 548.exchange2_r 10.89%  --> big improvement
> > > 557.xz_r0.38%
> > > geomean for intrate 1.10%
> > >
> > > FP rate
> > > 503.bwaves_r1.41%
> > > 507.cactuBSSN_r -0.14%
> > > 508.namd_r  1.54%
> > > 510.parest_r-0.87%
> > > 511.povray_r0.28%
> > > 519.lbm_r   0.32%
> > > 521.wrf_r   -0.54%
> > > 526.blender_r   0.59%
> > > 527.cam4_r  -2.70%
> > > 538.imagick_r   3.92%
> > > 544.nab_r   0.59%
> > > 549.fotonik3d_r -5.44%  -> regression
> > > 554.roms_r  -2.34%
> > > geomean for fprate  -0.28%
> > >
> > > The 10% improvement of 548.exchange_r is because there is 9-layer
> > > nested loop, and the loop count for innermost layer is small(enough
> > > for 128-bit vectorization, but not for 256-bit vectorization).
> > > Since loop count is not statically analyzed out, vectorizer will
> > > choose 256-bit vectorization which would never never be triggered. The
> > > vectorization of epilog will introduced some extra instructions,
> > > normally it will bring back some performance, but since it's 9-layer
> > > nested loop, costs of extra instructions will cover the gain.
> > >
> > > The 5.44% regression of 549.fotonik3d_r is because 256-bit
> > > vectorization is better than 128-bit vectorization. Generally when
> > > enabling 256-bit or 512-bit vectorization, there will be instruction
> > > clocksticks reduction also with frequency reduction. when frequency
> > > reduction is less than instructions clocksticks reduction, long vector
> > > width vectorization would be better than shorter one, otherwise the
> > > opposite. The regression of 549.fotonik3d_r is due to this, similar
> > > for 554.roms_r, 528.cam4_r, for those 3 benchmarks, 512-bit
> > > vectorization is best.
> > >
> > > Bootstrap and regression test on i386 is ok.
> > > Ok for trunk?
> >
> > I don't think 128_optimal does what you think it does.  If you want to
> > prefer 128bit AVX adjust the preference, but 128_optimal describes
> > a microarchitectural detail (AVX256 ops are split into two AVX128 ops)
> > and is _not_ intended for "tuning".
>
> So yes, it's poorly named.  A preparatory patch to clean this up
> (and maybe split it into TARGET_AVX256_SPLIT_REGS and TARGET_AVX128_OPTIMAL)
> would be nice.
>
> And I'm not convinced that a single SPEC benchmark is good enough to
> penaltize this for all users.  GCC isn't a benchmark compiler and GCC
> does exactly what you expect it to do - try FDO if you want to tell it more.
Yes, you're right, it's just benchmark result.
>
> Richard.
>
> > Richard.
> >
> > > Changelog
> > > gcc/
> > > * config/i386/i386-option.c (m_CORE_AVX): New macro.
> > > * config/i386/x86-tune.def: Enable 128_optimal for avx and
> > > replace m_SANDYBRIDGE | m_CORE_AVX2 with m_CORE_AVX.
> > > * testsuite/gcc.target/i386/pr84413-1.c: Adjust testcase.
> > > * testsuite/gcc.target/i386/pr84413-2.c: Ditto.
> > > * testsuite/gcc.target/i386/pr84413-3.c: Ditto.
> > > * testsuite/gcc.target/i386/pr70021.c: Ditto.
> > > * testsuite/gcc.target/i386/pr90579.c: New test.
> > >
> > >
> > > --
> > > BR,
> > > Hongtao



-- 
BR,
Hongtao


[PATCH] Split X86_TUNE_AVX128_OPTIMAL into X86_TUNE_AVX256_SPLIT_REGS and X86_TUNE_AVX128_OPTIMAL

2019-11-12 Thread Hongtao Liu
Hi:
  As mentioned in https://gcc.gnu.org/ml/gcc-patches/2019-11/msg00832.html
> So yes, it's poorly named.  A preparatory patch to clean this up
> (and maybe split it into TARGET_AVX256_SPLIT_REGS and TARGET_AVX128_OPTIMAL)
> would be nice.

  Bootstrap and regression test for i386 backend is ok.
  Ok for trunk?

Changelog
gcc/
PR target/92448
* config/i386/i386-expand.c (ix86_expand_set_or_cpymem):
Replace TARGET_AVX128_OPTIMAL with TARGET_AVX256_SPLIT_REGS.
* config/i386/i386-option.c (ix86_vec_cost): Ditto.
(ix86_reassociation_width): Ditto.
* config/i386/i386-options.c (ix86_option_override_internal):
Replace TARGET_AVX128_OPTIAML with
ix86_tune_features[X86_TUNE_AVX128_OPTIMAL]
* config/i386/i386.h (TARGET_AVX256_SPLIT_REGS): New macro.
(TARGET_AVX128_OPTIMAL): Deleted.
* config/i386/x86-tune.def (X86_TUNE_AVX256_SPLIT_REGS): New
DEF_TUNE.

-- 
BR,
Hongtao
From 93f49b7739d87106988869ee9a5ebe441e0b56ab Mon Sep 17 00:00:00 2001
From: liuhongt 
Date: Tue, 12 Nov 2019 16:49:41 +0800
Subject: [PATCH] Split X86_TUNE_AVX128_OPTIMAL into X86_TUNE_AVX256_SPLIT_REGS
 and X86_TUNE_AVX128_OPTIMAL.

Changelog
gcc/
	PR target/92448
	* config/i386/i386-expand.c (ix86_expand_set_or_cpymem):
	Replace TARGET_AVX128_OPTIMAL with TARGET_AVX256_SPLIT_REGS.
	* config/i386/i386-option.c (ix86_vec_cost): Ditto.
	(ix86_reassociation_width): Ditto.
	* config/i386/i386-options.c (ix86_option_override_internal):
	Replace TARGET_AVX128_OPTIAML with
	ix86_tune_features[X86_TUNE_AVX128_OPTIMAL]
	* config/i386/i386.h (TARGET_AVX256_SPLIT_REGS): New macro.
	(TARGET_AVX128_OPTIMAL): Deleted.
	* config/i386/x86-tune.def (X86_TUNE_AVX256_SPLIT_REGS): New
	DEF_TUNE.
---
 gcc/config/i386/i386-expand.c  | 2 +-
 gcc/config/i386/i386-options.c | 2 +-
 gcc/config/i386/i386.c | 4 ++--
 gcc/config/i386/i386.h | 4 ++--
 gcc/config/i386/x86-tune.def   | 4 
 5 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
index be040a1bc3e..392e0f95460 100644
--- a/gcc/config/i386/i386-expand.c
+++ b/gcc/config/i386/i386-expand.c
@@ -7348,7 +7348,7 @@ ix86_expand_set_or_cpymem (rtx dst, rtx src, rtx count_exp, rtx val_exp,
 	 && optab_handler (mov_optab, wider_mode) != CODE_FOR_nothing)
 	move_mode = wider_mode;
 
-  if (TARGET_AVX128_OPTIMAL && GET_MODE_BITSIZE (move_mode) > 128)
+  if (TARGET_AVX256_SPLIT_REGS && GET_MODE_BITSIZE (move_mode) > 128)
 	move_mode = TImode;
 
   /* Find the corresponding vector mode with the same size as MOVE_MODE.
diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
index dfc8ae23ba0..3d87dec8b15 100644
--- a/gcc/config/i386/i386-options.c
+++ b/gcc/config/i386/i386-options.c
@@ -2692,7 +2692,7 @@ ix86_option_override_internal (bool main_args_p,
 
   /* Enable 128-bit AVX instruction generation
  for the auto-vectorizer.  */
-  if (TARGET_AVX128_OPTIMAL
+  if (ix86_tune_features[X86_TUNE_AVX128_OPTIMAL]
   && (opts_set->x_prefer_vector_width_type == PVW_NONE))
 opts->x_prefer_vector_width_type = PVW_AVX128;
 
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 03a7082d2fc..4a4cf79555e 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -18960,7 +18960,7 @@ ix86_vec_cost (machine_mode mode, int cost)
   && TARGET_SSE_SPLIT_REGS)
 return cost * 2;
   if (GET_MODE_BITSIZE (mode) > 128
-  && TARGET_AVX128_OPTIMAL)
+  && TARGET_AVX256_SPLIT_REGS)
 return cost * GET_MODE_BITSIZE (mode) / 128;
   return cost;
 }
@@ -21298,7 +21298,7 @@ ix86_reassociation_width (unsigned int op, machine_mode mode)
 	return 1;
 
   /* Account for targets that splits wide vectors into multiple parts.  */
-  if (TARGET_AVX128_OPTIMAL && GET_MODE_BITSIZE (mode) > 128)
+  if (TARGET_AVX256_SPLIT_REGS && GET_MODE_BITSIZE (mode) > 128)
 	div = GET_MODE_BITSIZE (mode) / 128;
   else if (TARGET_SSE_SPLIT_REGS && GET_MODE_BITSIZE (mode) > 64)
 	div = GET_MODE_BITSIZE (mode) / 64;
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index afa0aa83ddf..3954c12f4e7 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -578,8 +578,8 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
 	ix86_tune_features[X86_TUNE_AVOID_LEA_FOR_ADDR]
 #define TARGET_SOFTWARE_PREFETCHING_BENEFICIAL \
 	ix86_tune_features[X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL]
-#define TARGET_AVX128_OPTIMAL \
-	ix86_tune_features[X86_TUNE_AVX128_OPTIMAL]
+#define TARGET_AVX256_SPLIT_REGS \
+	ix86_tune_features[X86_TUNE_AVX256_SPLIT_REGS]
 #define TARGET_GENERAL_REGS_SSE_SPILL \
 	ix86_tune_features[X86_TUNE_GENERAL_REGS_SSE_SPILL]
 #define TARGET_AVOID_MEM_OPND_FOR_CMOVE \
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index e289efdf2e0..328535d38d7 100644
--- a/gcc/config/i386/x86-tune.def

Re: [PATCH] Set AVX128_OPTIMAL for all avx targets.

2019-11-12 Thread Hongtao Liu
On Tue, Nov 12, 2019 at 4:41 PM Richard Biener
 wrote:
>
> On Tue, Nov 12, 2019 at 9:29 AM Hongtao Liu  wrote:
> >
> > On Tue, Nov 12, 2019 at 4:19 PM Richard Biener
> >  wrote:
> > >
> > > On Tue, Nov 12, 2019 at 8:36 AM Hongtao Liu  wrote:
> > > >
> > > > Hi:
> > > >   This patch is about to set X86_TUNE_AVX128_OPTIMAL as default for
> > > > all AVX target because we found there's still performance gap between
> > > > 128-bit auto-vectorization and 256-bit auto-vectorization even with
> > > > epilog vectorized.
> > > >   The performance influence of setting avx128_optimal as default on
> > > > SPEC2017 with option `-march=native -funroll-loops -Ofast -flto" on
> > > > CLX is as bellow:
> > > >
> > > > INT rate
> > > > 500.perlbench_r -0.32%
> > > > 502.gcc_r   -1.32%
> > > > 505.mcf_r   -0.12%
> > > > 520.omnetpp_r   -0.34%
> > > > 523.xalancbmk_r -0.65%
> > > > 525.x264_r  2.23%
> > > > 531.deepsjeng_r 0.81%
> > > > 541.leela_r -0.02%
> > > > 548.exchange2_r 10.89%  --> big improvement
> > > > 557.xz_r0.38%
> > > > geomean for intrate 1.10%
> > > >
> > > > FP rate
> > > > 503.bwaves_r1.41%
> > > > 507.cactuBSSN_r -0.14%
> > > > 508.namd_r  1.54%
> > > > 510.parest_r-0.87%
> > > > 511.povray_r0.28%
> > > > 519.lbm_r   0.32%
> > > > 521.wrf_r   -0.54%
> > > > 526.blender_r   0.59%
> > > > 527.cam4_r  -2.70%
> > > > 538.imagick_r   3.92%
> > > > 544.nab_r   0.59%
> > > > 549.fotonik3d_r -5.44%  -> regression
> > > > 554.roms_r  -2.34%
> > > > geomean for fprate  -0.28%
> > > >
> > > > The 10% improvement of 548.exchange_r is because there is 9-layer
> > > > nested loop, and the loop count for innermost layer is small(enough
> > > > for 128-bit vectorization, but not for 256-bit vectorization).
> > > > Since loop count is not statically analyzed out, vectorizer will
> > > > choose 256-bit vectorization which would never never be triggered. The
> > > > vectorization of epilog will introduced some extra instructions,
> > > > normally it will bring back some performance, but since it's 9-layer
> > > > nested loop, costs of extra instructions will cover the gain.
> > > >
> > > > The 5.44% regression of 549.fotonik3d_r is because 256-bit
> > > > vectorization is better than 128-bit vectorization. Generally when
> > > > enabling 256-bit or 512-bit vectorization, there will be instruction
> > > > clocksticks reduction also with frequency reduction. when frequency
> > > > reduction is less than instructions clocksticks reduction, long vector
> > > > width vectorization would be better than shorter one, otherwise the
> > > > opposite. The regression of 549.fotonik3d_r is due to this, similar
> > > > for 554.roms_r, 528.cam4_r, for those 3 benchmarks, 512-bit
> > > > vectorization is best.
> > > >
> > > > Bootstrap and regression test on i386 is ok.
> > > > Ok for trunk?
> > >
> > > I don't think 128_optimal does what you think it does.  If you want to
> > > prefer 128bit AVX adjust the preference, but 128_optimal describes
> > > a microarchitectural detail (AVX256 ops are split into two AVX128 ops)
> > But it will set target_prefer_avx128 by default.
> > 
> > 2694  /* Enable 128-bit AVX instruction generation
> > 2695 for the auto-vectorizer.  */
> > 2696  if (TARGET_AVX128_OPTIMAL
> > 2697  && (opts_set->x_prefer_vector_width_type == PVW_NONE))
> > 2698opts->x_prefer_vector_width_type = PVW_AVX128;
> > -
> > And it may be too confusing to add another tuning flag.
>
> Well, it's confusing to mix two things - defaulting the vector width 
> prefer

[PATCH]Several intrinsic macros lack a closing parenthesis[PR93274]

2020-02-12 Thread Hongtao Liu
Hi
  As mentioned in PR93724, several intrinsic macros lack a closing
parenthesis. These macros are only used with -O0 option, and currently
unit tests use -O2, so not covered.
  Bootstrap ok, regression tests on i386/x86_64 is ok.
  Ok for trunk?

Changelog
gcc/
* config/i386/avx512vbmi2intrin.h
(_mm512_[,mask_,maskz_]shrdi_epi16,
_mm512_[,mask_,maskz_]shrdi_epi32,
_m512_[,mask_,maskz_]shrdi_epi64,
_mm512_[,mask_,maskz_]shldi_epi16,
_mm512_[,mask_,maskz_]shldi_epi32,
_m512_[,mask_,maskz_]shldi_epi64): Fix typo of lacking a
closing parenthesis.
* config/i386/avx512vbmi2vlintrin.h
(_mm256_[,mask_,maskz_]shrdi_epi16,
_mm256_[,mask_,maskz_]shrdi_epi32,
_m256_[,mask_,maskz_]shrdi_epi64,
_mm_[,mask_,maskz_]shrdi_epi16,
_mm_[,mask_,maskz_]shrdi_epi32,
_mm_[,mask_,maskz_]shrdi_epi64,
_mm256_[,mask_,maskz_]shldi_epi16,
_mm256_[,mask_,maskz_]shldi_epi32,
_m256_[,mask_,maskz_]shldi_epi64,
_mm_[,mask_,maskz_]shldi_epi16,
_mm_[,mask_,maskz_]shldi_epi32,
_mm_[,mask_,maskz_]shldi_epi64): Ditto.

gcc/testsuite/
* gcc.target/i386/avx512vbmi2-vpshld-1.c: New test.
* gcc.target/i386/avx512vbmi2-vpshld-O0-1.c: Ditto.
* gcc.target/i386/avx512vbmi2-vpshrd-1.c: Ditto.
* gcc.target/i386/avx512vbmi2-vpshrd-O0-1.c: Ditto.
* gcc.target/i386/avx512vl-vpshld-O0-1.c: Ditto.
* gcc.target/i386/avx512vl-vpshrd-O0-1.c: Ditto.

-- 
BR,
Hongtao


0001-Intrinsic-macro-of-vpshr-and-vpshl-lack-a-closing-pa.patch
Description: Binary data


Re: [PATCH]Several intrinsic macros lack a closing parenthesis[PR93274]

2020-02-13 Thread Hongtao Liu
On Thu, Feb 13, 2020 at 5:12 PM Uros Bizjak  wrote:
>
> On Thu, Feb 13, 2020 at 9:53 AM Jakub Jelinek  wrote:
> >
> > On Thu, Feb 13, 2020 at 09:39:05AM +0100, Uros Bizjak wrote:
> > > > Changelog
> > > > gcc/
> > > >* config/i386/avx512vbmi2intrin.h
> > > >(_mm512_[,mask_,maskz_]shrdi_epi16,
> > > >_mm512_[,mask_,maskz_]shrdi_epi32,
> > > >_m512_[,mask_,maskz_]shrdi_epi64,
> > > >_mm512_[,mask_,maskz_]shldi_epi16,
> > > >_mm512_[,mask_,maskz_]shldi_epi32,
> > > >_m512_[,mask_,maskz_]shldi_epi64): Fix typo of lacking a
> > > >closing parenthesis.
> > > >* config/i386/avx512vbmi2vlintrin.h
> > > >(_mm256_[,mask_,maskz_]shrdi_epi16,
> > > >_mm256_[,mask_,maskz_]shrdi_epi32,
> > > >_m256_[,mask_,maskz_]shrdi_epi64,
> > > >_mm_[,mask_,maskz_]shrdi_epi16,
> > > >_mm_[,mask_,maskz_]shrdi_epi32,
> > > >_mm_[,mask_,maskz_]shrdi_epi64,
> > > >_mm256_[,mask_,maskz_]shldi_epi16,
> > > >_mm256_[,mask_,maskz_]shldi_epi32,
> > > >_m256_[,mask_,maskz_]shldi_epi64,
> > > >_mm_[,mask_,maskz_]shldi_epi16,
> > > >_mm_[,mask_,maskz_]shldi_epi32,
> > > >_mm_[,mask_,maskz_]shldi_epi64): Ditto.
> > > >
> > > > gcc/testsuite/
> > > >* gcc.target/i386/avx512vbmi2-vpshld-1.c: New test.
> > > >* gcc.target/i386/avx512vbmi2-vpshld-O0-1.c: Ditto.
> > > >* gcc.target/i386/avx512vbmi2-vpshrd-1.c: Ditto.
> > > >* gcc.target/i386/avx512vbmi2-vpshrd-O0-1.c: Ditto.
> > > >* gcc.target/i386/avx512vl-vpshld-O0-1.c: Ditto.
> > > >* gcc.target/i386/avx512vl-vpshrd-O0-1.c: Ditto.
> > >
> > > This is obvious patch, so OK for mainline and backports.
> >
> > The header changes sure, but for the testsuite, the standard way
> > would be to have it covered in the standard tests we have for this.
> > I think that is gcc.target/i386/sse-{13,14,22a,23}.c, so it would be worth
> > trying to figure out why it hasn't caught that.
>
> Indeed. It looks that these macros are not listed in sse-14.c, which
> would catch the problem. So, there is no need for new -O0 tests,
> please add missing functions to sse-14.c and sse-22.c testcases. I was
> also surprised that no testsuite coverage for vbmi2 functions was
> added at submission.
>
Yes, i saw that, thanks.
> Uros.
>
> > And, I don't think we allow any wildcards etc. (and [,whatever,whateverelse]
> > isn't even one, neither regexp nor shell wildcard) in the names of functions
> > changed, they can appear in the description text, but for the names of
> > macros one needs to list them all expanded, people do grep for those.
> >
> > Jakub
> >



-- 
BR,
Hongtao


Re: [PATCH]Several intrinsic macros lack a closing parenthesis[PR93274]

2020-02-13 Thread Hongtao Liu
On Thu, Feb 13, 2020 at 5:31 PM Hongtao Liu  wrote:
>
> On Thu, Feb 13, 2020 at 5:12 PM Uros Bizjak  wrote:
> >
> > On Thu, Feb 13, 2020 at 9:53 AM Jakub Jelinek  wrote:
> > >
> > > On Thu, Feb 13, 2020 at 09:39:05AM +0100, Uros Bizjak wrote:
> > > > > Changelog
> > > > > gcc/
> > > > >* config/i386/avx512vbmi2intrin.h
> > > > >(_mm512_[,mask_,maskz_]shrdi_epi16,
> > > > >_mm512_[,mask_,maskz_]shrdi_epi32,
> > > > >_m512_[,mask_,maskz_]shrdi_epi64,
> > > > >_mm512_[,mask_,maskz_]shldi_epi16,
> > > > >_mm512_[,mask_,maskz_]shldi_epi32,
> > > > >_m512_[,mask_,maskz_]shldi_epi64): Fix typo of lacking a
> > > > >closing parenthesis.
> > > > >* config/i386/avx512vbmi2vlintrin.h
> > > > >(_mm256_[,mask_,maskz_]shrdi_epi16,
> > > > >_mm256_[,mask_,maskz_]shrdi_epi32,
> > > > >_m256_[,mask_,maskz_]shrdi_epi64,
> > > > >_mm_[,mask_,maskz_]shrdi_epi16,
> > > > >_mm_[,mask_,maskz_]shrdi_epi32,
> > > > >_mm_[,mask_,maskz_]shrdi_epi64,
> > > > >_mm256_[,mask_,maskz_]shldi_epi16,
> > > > >_mm256_[,mask_,maskz_]shldi_epi32,
> > > > >_m256_[,mask_,maskz_]shldi_epi64,
> > > > >_mm_[,mask_,maskz_]shldi_epi16,
> > > > >_mm_[,mask_,maskz_]shldi_epi32,
> > > > >_mm_[,mask_,maskz_]shldi_epi64): Ditto.
> > > > >
> > > > > gcc/testsuite/
> > > > >* gcc.target/i386/avx512vbmi2-vpshld-1.c: New test.
> > > > >* gcc.target/i386/avx512vbmi2-vpshld-O0-1.c: Ditto.
> > > > >* gcc.target/i386/avx512vbmi2-vpshrd-1.c: Ditto.
> > > > >* gcc.target/i386/avx512vbmi2-vpshrd-O0-1.c: Ditto.
> > > > >* gcc.target/i386/avx512vl-vpshld-O0-1.c: Ditto.
> > > > >* gcc.target/i386/avx512vl-vpshrd-O0-1.c: Ditto.
> > > >
> > > > This is obvious patch, so OK for mainline and backports.
> > >
> > > The header changes sure, but for the testsuite, the standard way
> > > would be to have it covered in the standard tests we have for this.
> > > I think that is gcc.target/i386/sse-{13,14,22a,23}.c, so it would be worth
> > > trying to figure out why it hasn't caught that.
> >
> > Indeed. It looks that these macros are not listed in sse-14.c, which
> > would catch the problem. So, there is no need for new -O0 tests,
> > please add missing functions to sse-14.c and sse-22.c testcases. I was
> > also surprised that no testsuite coverage for vbmi2 functions was
> > added at submission.
> >
> Yes, i saw that, thanks.
> > Uros.
> >
> > > And, I don't think we allow any wildcards etc. (and 
> > > [,whatever,whateverelse]
> > > isn't even one, neither regexp nor shell wildcard) in the names of 
> > > functions
> > > changed, they can appear in the description text, but for the names of
> > > macros one needs to list them all expanded, people do grep for those.
> > >
> > > Jakub
> > >
>
>
>
> --
> BR,
> Hongtao

Update patch:
Update Changelog, delete O0 testcase, and add testcase in sse-14.c, sse-22.c

-- 
BR,
Hongtao


0001-Intrinsic-macro-of-vpshr-and-vpshl-lack-a-closing-pa.patch
Description: Binary data


Re: [PATCH]Several intrinsic macros lack a closing parenthesis[PR93274]

2020-02-14 Thread Hongtao Liu
Done.

On Fri, Feb 14, 2020 at 7:16 PM Uros Bizjak  wrote:
>
> On Fri, Feb 14, 2020 at 8:06 AM Uros Bizjak  wrote:
> >
> > On Fri, Feb 14, 2020 at 7:03 AM Hongtao Liu  wrote:
> > >
> > > On Thu, Feb 13, 2020 at 5:31 PM Hongtao Liu  wrote:
> > > >
> > > > On Thu, Feb 13, 2020 at 5:12 PM Uros Bizjak  wrote:
> > > > >
> > > > > On Thu, Feb 13, 2020 at 9:53 AM Jakub Jelinek  
> > > > > wrote:
> > > > > >
> > > > > > On Thu, Feb 13, 2020 at 09:39:05AM +0100, Uros Bizjak wrote:
> > > > > > > > Changelog
> > > > > > > > gcc/
> > > > > > > >* config/i386/avx512vbmi2intrin.h
> > > > > > > >(_mm512_[,mask_,maskz_]shrdi_epi16,
> > > > > > > >_mm512_[,mask_,maskz_]shrdi_epi32,
> > > > > > > >_m512_[,mask_,maskz_]shrdi_epi64,
> > > > > > > >_mm512_[,mask_,maskz_]shldi_epi16,
> > > > > > > >_mm512_[,mask_,maskz_]shldi_epi32,
> > > > > > > >_m512_[,mask_,maskz_]shldi_epi64): Fix typo of lacking a
> > > > > > > >closing parenthesis.
> > > > > > > >* config/i386/avx512vbmi2vlintrin.h
> > > > > > > >(_mm256_[,mask_,maskz_]shrdi_epi16,
> > > > > > > >_mm256_[,mask_,maskz_]shrdi_epi32,
> > > > > > > >_m256_[,mask_,maskz_]shrdi_epi64,
> > > > > > > >_mm_[,mask_,maskz_]shrdi_epi16,
> > > > > > > >_mm_[,mask_,maskz_]shrdi_epi32,
> > > > > > > >_mm_[,mask_,maskz_]shrdi_epi64,
> > > > > > > >_mm256_[,mask_,maskz_]shldi_epi16,
> > > > > > > >_mm256_[,mask_,maskz_]shldi_epi32,
> > > > > > > >_m256_[,mask_,maskz_]shldi_epi64,
> > > > > > > >_mm_[,mask_,maskz_]shldi_epi16,
> > > > > > > >_mm_[,mask_,maskz_]shldi_epi32,
> > > > > > > >_mm_[,mask_,maskz_]shldi_epi64): Ditto.
> > > > > > > >
> > > > > > > > gcc/testsuite/
> > > > > > > >* gcc.target/i386/avx512vbmi2-vpshld-1.c: New test.
> > > > > > > >* gcc.target/i386/avx512vbmi2-vpshld-O0-1.c: Ditto.
> > > > > > > >* gcc.target/i386/avx512vbmi2-vpshrd-1.c: Ditto.
> > > > > > > >* gcc.target/i386/avx512vbmi2-vpshrd-O0-1.c: Ditto.
> > > > > > > >* gcc.target/i386/avx512vl-vpshld-O0-1.c: Ditto.
> > > > > > > >* gcc.target/i386/avx512vl-vpshrd-O0-1.c: Ditto.
> > > > > > >
> > > > > > > This is obvious patch, so OK for mainline and backports.
> > > > > >
> > > > > > The header changes sure, but for the testsuite, the standard way
> > > > > > would be to have it covered in the standard tests we have for this.
> > > > > > I think that is gcc.target/i386/sse-{13,14,22a,23}.c, so it would 
> > > > > > be worth
> > > > > > trying to figure out why it hasn't caught that.
> > > > >
> > > > > Indeed. It looks that these macros are not listed in sse-14.c, which
> > > > > would catch the problem. So, there is no need for new -O0 tests,
> > > > > please add missing functions to sse-14.c and sse-22.c testcases. I was
> > > > > also surprised that no testsuite coverage for vbmi2 functions was
> > > > > added at submission.
> > > > >
> > > > Yes, i saw that, thanks.
> > > > > Uros.
> > > > >
> > > > > > And, I don't think we allow any wildcards etc. (and 
> > > > > > [,whatever,whateverelse]
> > > > > > isn't even one, neither regexp nor shell wildcard) in the names of 
> > > > > > functions
> > > > > > changed, they can appear in the description text, but for the names 
> > > > > > of
> > > > > > macros one needs to list them all expanded, people do grep for 
> > > > > > those.
> > > > > >
> > > > > > Jakub
> > > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > BR,
> > > > Hongtao
> > >
> > > Update patch:
> > > Update Changelog, delete O0 testcase, and add testcase in sse-14.c, 
> > > sse-22.c
> >
> > OK.
>
> Please also commit ChangeLog entries to relevant ChangeLog files.
>
> Uros.



-- 
BR,
Hongtao


Re: [PATCH] Enable mask operation for 128/256-bit vector VCOND_EXPR under avx512f (PR92686)

2019-12-04 Thread Hongtao Liu
On Wed, Dec 4, 2019 at 4:22 PM Jakub Jelinek  wrote:
>
> On Wed, Dec 04, 2019 at 10:07:05AM +0800, Hongtao Liu wrote:
> > Changelog
> > gcc/
> >   PR target/92686
> >   * config/i386/sse.md
> >   (*_cmp3,
> >   *_cmp3,
> >   *_ucmp3,
> >   *_ucmp3): New.
> >   * config/i386/i386.c (ix86_print_operand): New operand substitution.
> >   * config/i386/i386-expand.c (ix86_valid_mask_cmp_mode):
> >   New function.
> >   (ix86_expand_sse_cmp): Relax condition for integer mask from
> >   512-bit vector to all 128/256/512-bit vector. Delete code gen
> >   for avx512f compare patterns since we have generic pattern now.
> >   (ix86_expand_sse_movcc): Adjust condition and codegen for
> >   maskcmp.
> >   (ix86_expand_int_sse_cmp): Don't canonicalize the comparison
> >   when corresponding vector compare is available.
> >
> > gcc/testsuite/
> >   * gcc.target/i386/pr92686.inc: New file.
> >   * gcc.target/i386/avx512bw-pr92686-vpcmp-1.c: New test.
> >   * gcc.target/i386/avx512bw-pr92686-vpcmp-2.c: Ditto.
> >   * gcc.target/i386/avx512vl-pr92686-vpcmp-1.c: Ditto.
> >   * gcc.target/i386/avx512vl-pr92686-vpcmp-2.c: Ditto.
> >   * gcc.target/i386/avx512bw-pr92686-movcc-1.c: Ditto.
> >   * gcc.target/i386/avx512bw-pr92686-movcc-2.c: Ditto.
> >   * gcc.target/i386/avx512vl-pr92686-movcc-1.c: Ditto.
> >   * gcc.target/i386/avx512vl-pr92686-movcc-2.c: Ditto.
> >   * gcc.target/i386/avx512vl-pr88547-1.c: Adjust testcase.
> >   * gcc.target/i386/pr88547-1.c: Ditto.
>
> See comments below.
>
> > +  /* AVX512BW is needed for vector QI/HImode,
> > + AVX512VL is needed for 128/256-bit vector.  */
> > +  machine_mode inner_mode = GET_MODE_INNER (mode);
> > +  int vector_size = GET_MODE_SIZE (mode);
> > +  if ((inner_mode == QImode || inner_mode == HImode)
> > +  && !TARGET_AVX512BW)
>
> There is no reason not to keep && !TARGET_AVX512BW) on the previous line.
>
> > +  if (ix86_valid_mask_cmp_mode (cmp_ops_mode))
> >  {
> >unsigned int nbits = GET_MODE_NUNITS (cmp_ops_mode);
> > -  cmp_mode = int_mode_for_size (nbits, 0).require ();
> >maskcmp = true;
> > +  cmp_mode = nbits > 8 ?
> > + int_mode_for_size (nbits, 0).require ()
> > + : E_QImode;
>
> Formatting.  ? never goes at the end of line, similarly :.
> So, you want either
>   cmp_mode
> = nbits > 8 ? int_mode_for_size (nbits, 0).require () : E_QImode;
> or
>   cmp_mode = (nbits > 8 ? int_mode_for_size (nbits, 0).require ()
>   : E_QImode);
> or
>   cmp_mode = (nbits > 8
>   ? int_mode_for_size (nbits, 0).require () : E_QImode);
> or
>   cmp_mode = (nbits > 8
>   ? int_mode_for_size (nbits, 0).require ()
>   : E_QImode);
> - the parens around to help emacs.
>
> > @@ -3515,7 +3510,7 @@ ix86_expand_sse_movcc (rtx dest, rtx cmp, rtx 
> > op_true, rtx op_false)
> >machine_mode cmpmode = GET_MODE (cmp);
> >
> >/* In AVX512F the result of comparison is an integer mask.  */
> > -  bool maskcmp = (mode != cmpmode && TARGET_AVX512F);
> > +  bool maskcmp = ((mode != cmpmode) && ix86_valid_mask_cmp_mode (mode));
>
> No reason for either pair of ()s here, of course except the function call
> argument list.
>
> > +  /* Using vector move with mask register.  */
> > +  cmp = force_reg (cmpmode, cmp);
> > +  /* Optimize for mask zero.  */
> > +  op_true = op_true != CONST0_RTX (mode)
> > + ? force_reg (mode, op_true)
> > + : op_true;
>
> Same thing as above, just in this case ? wasn't incorrectly at the end.
>
> > +  op_false = op_false != CONST0_RTX (mode)
> > + ? force_reg (mode, op_false)
> > + : op_false;
>
> And again.
>
> > --- a/gcc/config/i386/i386.c
> > +++ b/gcc/config/i386/i386.c
> > @@ -12468,6 +12468,38 @@ ix86_print_operand (FILE *file, rtx x, int code)
> >   }
> > return;
> >
> > + case 'I':
> > +   switch (GET_CODE (x))
> > + {
> > + case EQ:
> > +   fputs ("$0", file);
> > +   break;
> > + case NE:
> > +   fputs ("$4", file);
> > +   break;
> > + case GE:
> > + case GEU:
> > +   fputs ("$5", file);
> > +   break;
>

Re: [PATCH] Enable mask operation for 128/256-bit vector VCOND_EXPR under avx512f (PR92686)

2019-12-08 Thread Hongtao Liu
On Thu, Dec 5, 2019 at 4:03 PM Jakub Jelinek  wrote:
>
> On Thu, Dec 05, 2019 at 09:56:46AM +0800, Hongtao Liu wrote:
> > --- a/gcc/config/i386/i386-expand.c
> > +++ b/gcc/config/i386/i386-expand.c
> > +  /* Using vector move with mask register.  */
> > +  cmp = force_reg (cmpmode, cmp);
> > +  /* Optimize for mask zero.  */
> > +  op_true =
> > + op_true != CONST0_RTX (mode) ? force_reg (mode, op_true) : op_true;
> > +  op_false =
> > + op_false != CONST0_RTX (mode) ? force_reg (mode, op_false) : op_false;
>
> The above two still aren't correct, = doesn't belong at the end of line
> either.
>
>   op_true
> = op_true != CONST0_RTX (mode) ? force_reg (mode, op_true) : op_true;
>
> would be ok,
>
>   op_false
> = op_false != CONST0_RTX (mode) ? force_reg (mode, op_false) : 
> op_false;
>
> is too long, so e.g.
>
>   op_false = (op_false != CONST0_RTX (mode)
>   ? force_reg (mode, op_false) : op_false);
>
> > +   /* Reverse op_true op_false.  */
> > +   n = op_true;
> > +   op_true = op_false;
> > +   op_false = n;
>
> Please use
>   std::swap (op_true, op_false);
> instead of the above 3 lines.
>
> Also, can you please add at least one testcase for this with -masm=intel,
> effective target masm_intel and dg-do assemble to make sure it assembles?
> Perhaps just one -mavx512vl -mavx512bw avx512vl/avx512bw effective target that
> tests all the patterns?
>
Yes, avx512vl-pr92686-vpcmp-intelasm-1.c,
avx512bw-pr92686-vpcmp-intelasm-1.c are added.
> Ok with those changes.
>
> Jakub
>
Committed, thanks.

-- 
BR,
Hongtao


[PATCH] Use OPTION_MASK_ISA2_$target_[SET, UNSET, ] to indicate those for x_ix86_isa_flags2

2019-12-09 Thread Hongtao Liu
Hi uros:
  This patch is about to rename OPTION_MASK_ISA_$target_[SET,UNSET, ]
to OPTION_MASK_ISA2_$target_[SET,UNSET, ] for those targets setting
x_ix86_isa_flags2.
  target list as bellow:
-
 188static struct ix86_target_opts isa2_opts[] =
 189{
 190  { "-mcx16",   OPTION_MASK_ISA2_CX16 },
 191  { "-mvaes",   OPTION_MASK_ISA2_VAES },
 192  { "-mrdpid",  OPTION_MASK_ISA2_RDPID },
 193  { "-mpconfig",OPTION_MASK_ISA2_PCONFIG },
 194  { "-mwbnoinvd",   OPTION_MASK_ISA2_WBNOINVD },
 195  { "-mavx512vp2intersect", OPTION_MASK_ISA2_AVX512VP2INTERSECT },
 196  { "-msgx",OPTION_MASK_ISA2_SGX },
 197  { "-mavx5124vnniw",   OPTION_MASK_ISA2_AVX5124VNNIW },
 198  { "-mavx5124fmaps",   OPTION_MASK_ISA2_AVX5124FMAPS },
 199  { "-mhle",OPTION_MASK_ISA2_HLE },
 200  { "-mmovbe",  OPTION_MASK_ISA2_MOVBE },
 201  { "-mclzero", OPTION_MASK_ISA2_CLZERO },
 202  { "-mmwaitx", OPTION_MASK_ISA2_MWAITX },
 203  { "-mmovdir64b",  OPTION_MASK_ISA2_MOVDIR64B },
 204  { "-mwaitpkg",OPTION_MASK_ISA2_WAITPKG },
 205  { "-mcldemote",   OPTION_MASK_ISA2_CLDEMOTE },
 206  { "-mptwrite",OPTION_MASK_ISA2_PTWRITE },
 207  { "-mavx512bf16", OPTION_MASK_ISA2_AVX512BF16 },
 208  { "-menqcmd", OPTION_MASK_ISA2_ENQCMD }
 209};
--

  Bootstrap and regression test on i386/x86-64 backend is ok.
  Ok for trunk?

Changelog
* gcc/common/config/i386/i386-common.c
(OPTION_MASK_ISA_AVX5124FMAPS_SET): Rename to
OPTION_MASK_ISA2_AVX5124FMAPS_SET.
(OPTION_MASK_ISA_AVX5124VNNIW_SET, OPTION_MASK_ISA_AVX512BF16_SET,
OPTION_MASK_ISA_AVX512VP2INTERSECT_SET,
OPTION_MASK_ISA_PCONFIG_SET, OPTION_MASK_ISA_WBNOINVD_SET,
OPTION_MASK_ISA_SGX_SET, OPTION_MASK_ISA_CX16_SET,
OPTION_MASK_ISA_MOVBE_SET, OPTION_MASK_ISA_PTWRITE_SET,
OPTION_MASK_ISA_MWAITX_SET, OPTION_MASK_ISA_CLZERO_SET,
OPTION_MASK_ISA_RDPID_SET, OPTION_MASK_ISA_VAES_SET,
OPTION_MASK_ISA_MOVDIR64B_SET, OPTION_MASK_ISA_WAITPKG_SET,
OPTION_MASK_ISA_CLDEMOTE_SET, OPTION_MASK_ISA_ENQCMD_SET,
OPTION_MASK_ISA_AVX5124FMAPS_UNSET,
OPTION_MASK_ISA_AVX5124VNNIW_UNSET,
OPTION_MASK_ISA_AVX512BF16_UNSET,
OPTION_MASK_ISA_AVX512VP2INTERSECT_UNSET,
OPTION_MASK_ISA_PCONFIG_UNSET, OPTION_MASK_ISA_WBNOINVD_UNSET,
OPTION_MASK_ISA_SGX_UNSET, OPTION_MASK_ISA_CX16_UNSET,
OPTION_MASK_ISA_MOVBE_UNSET, OPTION_MASK_ISA_PTWRITE_UNSET,
OPTION_MASK_ISA_MWAITX_UNSET, OPTION_MASK_ISA_CLZERO_UNSET,
OPTION_MASK_ISA_RDPID_UNSET, OPTION_MASK_ISA_VAES_UNSET,
OPTION_MASK_ISA_MOVDIR64B_UNSET, OPTION_MASK_ISA_WAITPKG_UNSET,
OPTION_MASK_ISA_CLDEMOTE_UNSET, OPTION_MASK_ISA_ENQCMD_UNSET,
OPTION_MASK_ISA_AVX5124FMAPS, OPTION_MASK_ISA_AVX5124VNNIW,
OPTION_MASK_ISA_AVX512BF16, OPTION_MASK_ISA_AVX512VP2INTERSECT,
OPTION_MASK_ISA_PCONFIG, OPTION_MASK_ISA_WBNOINVD,
OPTION_MASK_ISA_SGX, OPTION_MASK_ISA_CX16, OPTION_MASK_ISA_MOVBE,
OPTION_MASK_ISA_PTWRITE, OPTION_MASK_ISA_MWAITX,
OPTION_MASK_ISA_CLZERO, OPTION_MASK_ISA_RDPID,
OPTION_MASK_ISA_VAES, OPTION_MASK_ISA_MOVDIR64B,
OPTION_MASK_ISA_WAITPKG, OPTION_MASK_ISA_CLDEMOTE,
OPTION_MASK_ISA_ENQCMD): Ditto.

* gcc/config/i386/i386-builtin.def
(OPTION_MASK_ISA_AVX5124FMAPS, OPTION_MASK_ISA_AVX5124VNNIW,
OPTION_MASK_ISA_AVX512BF16, OPTION_MASK_ISA_AVX512VP2INTERSECT,
OPTION_MASK_ISA_WBNOINVD, OPTION_MASK_ISA_PTWRITE,
OPTION_MASK_ISA_RDPID, OPTION_MASK_ISA_VAES,
OPTION_MASK_ISA_MOVDIR64B, OPTION_MASK_ISA_ENQCMD): Ditto.
* gcc/config/i386/i386-builtins.c (OPTION_MASK_ISA_MWAITX,
OPTION_MASK_ISA_CLZERO, OPTION_MASK_ISA_WAITPKG,
OPTION_MASK_ISA_CLDEMOTE, OPTION_MASK_ISA_WBNOINVD): Ditto.
* gcc/config/i386/i386-c.c
(OPTION_MASK_ISA_AVX5124FMAPS, OPTION_MASK_ISA_AVX5124VNNIW,
OPTION_MASK_ISA_AVX512BF16, OPTION_MASK_ISA_AVX512VP2INTERSECT,
OPTION_MASK_ISA_PCONFIG, OPTION_MASK_ISA_WBNOINVD,
OPTION_MASK_ISA_SGX, OPTION_MASK_ISA_CX16, OPTION_MASK_ISA_MOVBE,
OPTION_MASK_ISA_PTWRITE, OPTION_MASK_ISA_MWAITX,
OPTION_MASK_ISA_CLZERO, OPTION_MASK_ISA_RDPID,
OPTION_MASK_ISA_VAES, OPTION_MASK_ISA_MOVDIR64B,
OPTION_MASK_ISA_WAITPKG, OPTION_MASK_ISA_CLDEMOTE,
OPTION_MASK_ISA_ENQCMD): Ditto.
* gcc/config/i386/i386-option.c: Ditto
* gcc/config/i386/i386.opt: Ditto..
* gcc/config/i386/i386.h: (TARGET_ISA_AVX5124FMAPS,
TARGET_ISA_AVX5124VNNIW,  TARGET_ISA_AVX512BF16,
TARGET_ISA_AVX512VP2INTERSECT, TARGET_ISA_PCONFIG,
TARGET_ISA_WBNOINVD, TARGET_ISA_SGX, TARGET_ISA_CX16,
TARGET_ISA_MOVBE, TARGET_ISA_PTWRITE, TARGET_ISA_MWAITX,
TARGET_ISA_CLZERO, TARGET_ISA_RDPID, TARGET_ISA_VAES,
TARGET_ISA_MOVDIR64B, TARGET_ISA_WAITPKG, TARGET_ISA_CLDEMOTE,
TARGET_ISA_ENQCMD): Ditto.

-- 
BR,
Hongtao
From 42d004c271228a4d6a1075cf4b77ae3282388e69 Mon Sep 17 00:00:00 2001
From: liuhongt 
Date: Mon, 9 Dec 2019 13:37:46 +0800
Subject: [PATCH] Use OPTION_MASK_ISA2_*_SET, OPTION_MASK_ISA2_*

[PATCH] Fix unrecognizable insn of pr92865

2019-12-09 Thread Hongtao Liu
Hi jakub:
  This patch is to enable integer mask cmp/cmov under AVX512F even
with TARGET_XOP .
  Bootstrap and regression test on i386/x86_64 backend is ok.

Changelog:
PR target/92865
* gcc/config/i386/i386-expand.c (ix86_valid_mask_cmp_mode): Enable
integer mask cmov when available even with TARGET_XOP.
* gcc/testsuite/gcc.target/i386/pr92865-1.c: New test.

-- 
BR,
Hongtao
From 2c53eb1ddf876a616c7ee914256e3a27f30cd158 Mon Sep 17 00:00:00 2001
From: liuhongt 
Date: Tue, 10 Dec 2019 09:44:18 +0800
Subject: [PATCH] Fix unrecognizable insn of pr92865.

PR target/92865
* gcc/config/i386/i386-expand.c (ix86_valid_mask_cmp_mode): Enable
integer mask cmov when available even with TARGET_XOP.
* gcc/testsuite/gcc.target/i386/pr92865-1.c: New test.
---
 gcc/config/i386/i386-expand.c |  2 +-
 gcc/testsuite/gcc.target/i386/pr92865-1.c | 67 +++
 2 files changed, 68 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr92865-1.c

diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
index ff3c24cc5b7..cbf4eb7b487 100644
--- a/gcc/config/i386/i386-expand.c
+++ b/gcc/config/i386/i386-expand.c
@@ -3428,7 +3428,7 @@ static bool
 ix86_valid_mask_cmp_mode (machine_mode mode)
 {
   /* XOP has its own vector conditional movement.  */
-  if (TARGET_XOP)
+  if (TARGET_XOP && !TARGET_AVX512F)
 return false;
 
   /* AVX512F is needed for mask operation.  */
diff --git a/gcc/testsuite/gcc.target/i386/pr92865-1.c b/gcc/testsuite/gcc.target/i386/pr92865-1.c
new file mode 100644
index 000..49b5778a067
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr92865-1.c
@@ -0,0 +1,67 @@
+/* PR target/92865 */
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mavx512f -mavx512bw -mxop" } */
+/* { dg-final { scan-assembler-times "vpcmp\[bwdq\]\[\t ]" 4 } } */
+/* { dg-final { scan-assembler-times "vpcmpu\[bwdq\]\[\t ]" 4 } } */
+/* { dg-final { scan-assembler-times "vmovdq\[au\]8\[\t ]" 4 } } */
+/* { dg-final { scan-assembler-times "vmovdq\[au\]16\[\t ]" 4 } } *
+/* { dg-final { scan-assembler-times "vmovdq\[au\]32\[\t ]" 4 } } */
+/* { dg-final { scan-assembler-times "vmovdq\[au\]64\[\t ]" 4 } } */
+
+extern char arraysb[64];
+extern short arraysw[32];
+extern int arraysd[16];
+extern long long arraysq[8];
+
+extern unsigned char arrayub[64];
+extern unsigned short arrayuw[32];
+extern unsigned int arrayud[16];
+extern unsigned long long arrayuq[8];
+
+int f1(char a)
+{
+  for (int i = 0; i < 64; i++)
+arraysb[i] = arraysb[i] >= a;
+}
+
+int f2(short a)
+{
+  for (int i = 0; i < 32; i++)
+arraysw[i] = arraysw[i] >= a;
+}
+
+int f3(int a)
+{
+  for (int i = 0; i < 16; i++)
+arraysd[i] = arraysd[i] >= a;
+}
+
+int f4(long long a)
+{
+  for (int i = 0; i < 8; i++)
+arraysq[i] = arraysq[i] >= a;
+}
+
+int f5(unsigned char a)
+{
+  for (int i = 0; i < 64; i++)
+arrayub[i] = arrayub[i] >= a;
+}
+
+int f6(unsigned short a)
+{
+  for (int i = 0; i < 32; i++)
+arrayuw[i] = arrayuw[i] >= a;
+}
+
+int f7(unsigned int a)
+{
+  for (int i = 0; i < 16; i++)
+arrayud[i] = arrayud[i] >= a;
+}
+
+int f8(unsigned long long a)
+{
+  for (int i = 0; i < 8; i++)
+arrayuq[i] = arrayuq[i] >= a;
+}
-- 
2.18.1



Re: [PATCH] Fix unrecognizable insn of pr92865

2019-12-10 Thread Hongtao Liu
On Tue, Dec 10, 2019 at 4:11 PM Jakub Jelinek  wrote:
>
> On Tue, Dec 10, 2019 at 01:47:50PM +0800, Hongtao Liu wrote:
> >   This patch is to enable integer mask cmp/cmov under AVX512F even
> > with TARGET_XOP .
> >   Bootstrap and regression test on i386/x86_64 backend is ok.
> >
> > Changelog:
> > PR target/92865
> > * gcc/config/i386/i386-expand.c (ix86_valid_mask_cmp_mode): Enable
> > integer mask cmov when available even with TARGET_XOP.
> > * gcc/testsuite/gcc.target/i386/pr92865-1.c: New test.
>
> No gcc/ or gcc/testsuite/ prefixes in ChangeLog.
>
> > --- a/gcc/config/i386/i386-expand.c
> > +++ b/gcc/config/i386/i386-expand.c
> > @@ -3428,7 +3428,7 @@ static bool
> >  ix86_valid_mask_cmp_mode (machine_mode mode)
> >  {
> >/* XOP has its own vector conditional movement.  */
> > -  if (TARGET_XOP)
> > +  if (TARGET_XOP && !TARGET_AVX512F)
> >  return false;
> >
> >/* AVX512F is needed for mask operation.  */
>
> We don't know what will AMD CPUs with AVX512* do or what will be optimal for
Yes, I'll make it tunable for different processors in another
separated patch or in this one?
> them, there aren't any yet.  I guess this is fine for now, so would be your
> previous && GET_MODE_SIZE (mode) == 16.
>
> Jakub
>
Updated patch with Changelog


Changelog
gcc/
PR target/92865
* config/i386/i386-expand.c (ix86_valid_mask_cmp_mode): Enable
integer mask cmov when available even with TARGET_XOP.

gcc/testsuite
* gcc/testsuite/gcc.target/i386/pr92865-1.c: New test.


-- 
BR,
Hongtao
From 16b7c5caa930684fce604b352575d27a92b313fb Mon Sep 17 00:00:00 2001
From: liuhongt 
Date: Tue, 10 Dec 2019 09:44:18 +0800
Subject: [PATCH] Fix unrecognizable insn of pr92865.

gcc/
PR target/92865
* config/i386/i386-expand.c (ix86_valid_mask_cmp_mode): Enable
integer mask cmov when available even with TARGET_XOP.

gcc/testsuite
* gcc/testsuite/gcc.target/i386/pr92865-1.c: New test.
---
 gcc/config/i386/i386-expand.c |  2 +-
 gcc/testsuite/gcc.target/i386/pr92865-1.c | 67 +++
 2 files changed, 68 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr92865-1.c

diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
index ff3c24cc5b7..cbf4eb7b487 100644
--- a/gcc/config/i386/i386-expand.c
+++ b/gcc/config/i386/i386-expand.c
@@ -3428,7 +3428,7 @@ static bool
 ix86_valid_mask_cmp_mode (machine_mode mode)
 {
   /* XOP has its own vector conditional movement.  */
-  if (TARGET_XOP)
+  if (TARGET_XOP && !TARGET_AVX512F)
 return false;
 
   /* AVX512F is needed for mask operation.  */
diff --git a/gcc/testsuite/gcc.target/i386/pr92865-1.c b/gcc/testsuite/gcc.target/i386/pr92865-1.c
new file mode 100644
index 000..49b5778a067
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr92865-1.c
@@ -0,0 +1,67 @@
+/* PR target/92865 */
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mavx512f -mavx512bw -mxop" } */
+/* { dg-final { scan-assembler-times "vpcmp\[bwdq\]\[\t ]" 4 } } */
+/* { dg-final { scan-assembler-times "vpcmpu\[bwdq\]\[\t ]" 4 } } */
+/* { dg-final { scan-assembler-times "vmovdq\[au\]8\[\t ]" 4 } } */
+/* { dg-final { scan-assembler-times "vmovdq\[au\]16\[\t ]" 4 } } *
+/* { dg-final { scan-assembler-times "vmovdq\[au\]32\[\t ]" 4 } } */
+/* { dg-final { scan-assembler-times "vmovdq\[au\]64\[\t ]" 4 } } */
+
+extern char arraysb[64];
+extern short arraysw[32];
+extern int arraysd[16];
+extern long long arraysq[8];
+
+extern unsigned char arrayub[64];
+extern unsigned short arrayuw[32];
+extern unsigned int arrayud[16];
+extern unsigned long long arrayuq[8];
+
+int f1(char a)
+{
+  for (int i = 0; i < 64; i++)
+arraysb[i] = arraysb[i] >= a;
+}
+
+int f2(short a)
+{
+  for (int i = 0; i < 32; i++)
+arraysw[i] = arraysw[i] >= a;
+}
+
+int f3(int a)
+{
+  for (int i = 0; i < 16; i++)
+arraysd[i] = arraysd[i] >= a;
+}
+
+int f4(long long a)
+{
+  for (int i = 0; i < 8; i++)
+arraysq[i] = arraysq[i] >= a;
+}
+
+int f5(unsigned char a)
+{
+  for (int i = 0; i < 64; i++)
+arrayub[i] = arrayub[i] >= a;
+}
+
+int f6(unsigned short a)
+{
+  for (int i = 0; i < 32; i++)
+arrayuw[i] = arrayuw[i] >= a;
+}
+
+int f7(unsigned int a)
+{
+  for (int i = 0; i < 16; i++)
+arrayud[i] = arrayud[i] >= a;
+}
+
+int f8(unsigned long long a)
+{
+  for (int i = 0; i < 8; i++)
+arrayuq[i] = arrayuq[i] >= a;
+}
-- 
2.18.1



Re: [PATCH] Fix unrecognizable insn of pr92865

2019-12-10 Thread Hongtao Liu
On Wed, Dec 11, 2019 at 3:54 PM Jakub Jelinek  wrote:
>
> On Wed, Dec 11, 2019 at 09:55:24AM +0800, Hongtao Liu wrote:
> > Changelog
> > gcc/
> > PR target/92865
> > * config/i386/i386-expand.c (ix86_valid_mask_cmp_mode): Enable
> > integer mask cmov when available even with TARGET_XOP.
> >
> > gcc/testsuite
> > * gcc/testsuite/gcc.target/i386/pr92865-1.c: New test.
>
> Please remove gcc/testsuite/ here too.
> Ok with that change.
Yes, thanks.
>
> Jakub
>


-- 
BR,
Hongtao


[PATCH]Add tune option for integer mask cmov, enable this tune for m_CORE_AVX512

2019-12-11 Thread Hongtao Liu
Hi:
  This patch is about to add tune option for integer mask cmov, for
some targets has both integer mask register and sse mask register,
this tune indicates to use integer one. Currently it's default on for
m_CORE_AVX512.

  Bootstrap is ok, regression test on i386/x86_64 backends is ok.
  ok for trunk?

Changelog
gcc/
* config/i386/i386-expand.c (ix86_valid_mask_cmp_mode): Return
false if target not prefer using integer mask cmov for
128/256-bit vector under avx512f.
* config/i386/i386.h (TARGET_PREFER_INTEGER_MASK_CMOV): New
macro.
* config/i386/x86-tune.def
(X86_TUNE_PREFER_INTEGER_MASK_CMOV): New tune.

gcc/testsuite
* gcc.target/i386/avx512bw-pr92686-movcc-1.c: Adjust test case.
* gcc.target/i386/avx512bw-pr92686-movcc-2.c: Ditto.
* gcc.target/i386/avx512bw-pr92686-vpcmp-1.c: Ditto.
* gcc.target/i386/avx512bw-pr92686-vpcmp-2.c: Ditto.
* gcc.target/i386/avx512vl-pr92686-movcc-1.c: Ditto.
* gcc.target/i386/avx512vl-pr92686-movcc-2.c: Ditto.
* gcc.target/i386/avx512vl-pr92686-vpcmp-1.c: Ditto.
* gcc.target/i386/avx512vl-pr92686-vpcmp-2.c: Ditto.
* gcc.target/i386/avx512vl-pr88547-1.c: Ditto.


-- 
BR,
Hongtao
From 716bdede7f23ef035d93fb1d4f6917e19cef5f3e Mon Sep 17 00:00:00 2001
From: liuhongt 
Date: Wed, 11 Dec 2019 16:38:04 +0800
Subject: [PATCH] Add tune option for integer mask cmov, enable this tune for m_CORE_AVX512

Changelog
gcc/
	* config/i386/i386-expand.c (ix86_valid_mask_cmp_mode): Return
	false if target not prefer using integer mask cmov for
	128/256-bit vector under avx512f.
	* config/i386/i386.h (TARGET_PREFER_INTEGER_MASK_CMOV): New
	macro.
	* config/i386/x86-tune.def
	(X86_TUNE_PREFER_INTEGER_MASK_CMOV): New tune.

gcc/testsuite
	* gcc.target/i386/avx512bw-pr92686-movcc-1.c: Adjust test case.
	* gcc.target/i386/avx512bw-pr92686-movcc-2.c: Ditto.
	* gcc.target/i386/avx512bw-pr92686-vpcmp-1.c: Ditto.
	* gcc.target/i386/avx512bw-pr92686-vpcmp-2.c: Ditto.
	* gcc.target/i386/avx512vl-pr92686-movcc-1.c: Ditto.
	* gcc.target/i386/avx512vl-pr92686-movcc-2.c: Ditto.
	* gcc.target/i386/avx512vl-pr92686-vpcmp-1.c: Ditto.
	* gcc.target/i386/avx512vl-pr92686-vpcmp-2.c: Ditto.
	* gcc.target/i386/avx512vl-pr88547-1.c: Ditto.
---
 gcc/config/i386/i386-expand.c  |4 
 gcc/config/i386/i386.h |2 ++
 gcc/config/i386/x86-tune.def   |   10 ++
 .../gcc.target/i386/avx512bw-pr92686-movcc-1.c |2 +-
 .../gcc.target/i386/avx512bw-pr92686-movcc-2.c |2 +-
 .../gcc.target/i386/avx512bw-pr92686-vpcmp-1.c |2 +-
 .../gcc.target/i386/avx512bw-pr92686-vpcmp-2.c |2 +-
 gcc/testsuite/gcc.target/i386/avx512vl-pr88547-1.c |6 +++---
 .../gcc.target/i386/avx512vl-pr92686-movcc-1.c |2 +-
 .../gcc.target/i386/avx512vl-pr92686-movcc-2.c |2 +-
 .../gcc.target/i386/avx512vl-pr92686-vpcmp-1.c |2 +-
 .../gcc.target/i386/avx512vl-pr92686-vpcmp-2.c |2 +-
 12 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c
index cbf4eb7..a627642 100644
--- a/gcc/config/i386/i386-expand.c
+++ b/gcc/config/i386/i386-expand.c
@@ -3431,6 +3431,10 @@ ix86_valid_mask_cmp_mode (machine_mode mode)
   if (TARGET_XOP && !TARGET_AVX512F)
 return false;
 
+  /* For 512-bit vector, only integer mask vcmp/vcmov is valid.  */
+  if (!TARGET_PREFER_INTEGER_MASK_CMOV && GET_MODE_SIZE (mode) != 64)
+return false;
+
   /* AVX512F is needed for mask operation.  */
   if (!(TARGET_AVX512F && VECTOR_MODE_P (mode)))
 return false;
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 2542cb3..23d796e 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -596,6 +596,8 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
 	ix86_tune_features[X86_TUNE_USE_XCHG_FOR_ATOMIC_STORE]
 #define TARGET_EMIT_VZEROUPPER \
 	ix86_tune_features[X86_TUNE_EMIT_VZEROUPPER]
+#define TARGET_PREFER_INTEGER_MASK_CMOV \
+	ix86_tune_features[X86_TUNE_PREFER_INTEGER_MASK_CMOV]
 
 /* Feature tests against the various architecture variations.  */
 enum ix86_arch_indices {
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index 328535d..e944f39 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -467,6 +467,16 @@ DEF_TUNE (X86_TUNE_AVX128_OPTIMAL, "avx128_optimal", m_BDVER | m_BTVER2
 DEF_TUNE (X86_TUNE_AVX256_OPTIMAL, "avx256_optimal", m_CORE_AVX512)
 
 /*/
+/* AVX512 instruction selection tuning. */
+/*/
+
+/* X86_TUNE_PREFER_INTEGER_MASK_CMOV: Use integer mask vcmov/vcmp for
+   128/256-bit vector under avx512f, there's are also instructions
+   using sse regs as mask under avx2 or xop.  */
+DEF_

[PATCH] Fix redundant load missed by fre [tree-optimization 92980]

2019-12-17 Thread Hongtao Liu
Hi:
  This patch is to simplify A * C + (-D) -> (A - D/C) * C when C is a
power of 2 and D mod C == 0.
  bootstrap and make check is ok.

changelog
gcc/
* gcc/match.pd (A * C + (-D) = (A - D/C) * C. when C is a
power of 2 and D mod C == 0): Add new simplification.

gcc/testsuite
* gcc.dg/pr92980.c: New test.

-- 
BR,
Hongtao
From 41f76f29f0070082e29082460efdb0bb9b9869f7 Mon Sep 17 00:00:00 2001
From: liuhongt 
Date: Fri, 13 Dec 2019 15:52:02 +0800
Subject: [PATCH] Simplify A * C + (-D) = (A - D/C) * C. when C is a power of 2
 and D mod C == 0.

gcc/
	* gcc/match.pd (A * C + (-D) = (A - D/C) * C. when C is a
	power of 2 and D mod C == 0): Add new simplification.

gcc/testsuite
	* gcc.dg/pr92980.c: New test.
---
 gcc/match.pd   | 20 
 gcc/testsuite/gcc.dg/pr92980.c | 43 ++
 2 files changed, 63 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/pr92980.c

diff --git a/gcc/match.pd b/gcc/match.pd
index dda86964b4c..a128733e2c3 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -4297,6 +4297,26 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   (if (tree_single_nonzero_warnv_p (@0, NULL))
{ constant_boolean_node (cmp == NE_EXPR, type); })))
 
+/* Simplify A * C + (-D) = (A - D/C) * C. when C is a power of 2
+   and D mod C == 0.  */
+(simplify
+ (plus (mult @0 integer_pow2p@1) INTEGER_CST@2)
+ (if (TREE_CODE (TREE_TYPE (@0)) == INTEGER_TYPE
+ && TYPE_UNSIGNED (TREE_TYPE (@0))
+ && tree_fits_uhwi_p (@1)
+ && tree_fits_uhwi_p (@2))
+  (with
+   {
+ unsigned HOST_WIDE_INT c = tree_to_uhwi (@1);
+ unsigned HOST_WIDE_INT d = tree_to_uhwi (@2);
+ HOST_WIDE_INT neg_p = wi::sign_mask (d);
+ unsigned HOST_WIDE_INT negd = HOST_WIDE_INT_0U - d;
+ unsigned HOST_WIDE_INT modd = negd % c;
+ unsigned HOST_WIDE_INT divd = negd / c;
+}
+   (if (neg_p && modd == HOST_WIDE_INT_0U)
+(mult (minus @0 { build_int_cst (TREE_TYPE (@2), divd);}) @1)
+
 /* If we have (A & C) == C where C is a power of 2, convert this into
(A & C) != 0.  Similarly for NE_EXPR.  */
 (for cmp (eq ne)
diff --git a/gcc/testsuite/gcc.dg/pr92980.c b/gcc/testsuite/gcc.dg/pr92980.c
new file mode 100644
index 000..d7abf20788e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr92980.c
@@ -0,0 +1,43 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-fre" }  */
+
+int f1(short *src1, int i, int k, int n)
+{
+  int j = k + n;
+  short sum = src1[j];
+  sum += src1[j-1];
+  if (i <= k)
+{
+  j+=2;
+  sum += src1[j-3];
+}
+  return sum + j;
+}
+
+int f2(int *src1, int i, int k, int n)
+{
+  int j = k + n;
+  int sum = src1[j];
+  sum += src1[j-1];
+  if (i <= k)
+{
+  j+=2;
+  sum += src1[j-3];
+}
+  return sum + j;
+}
+
+int f3(long long *src1, int i, int k, int n)
+{
+  int j = k + n;
+  long long sum = src1[j];
+  sum += src1[j-1];
+  if (i <= k)
+{
+  j+=2;
+  sum += src1[j-3];
+}
+  return sum + j;
+}
+
+/* { dg-final { scan-tree-dump-times "= \\*" 6 "fre1" } }  */
-- 
2.18.1



Re: [PATCH] Fix redundant load missed by fre [tree-optimization 92980]

2019-12-17 Thread Hongtao Liu
On Wed, Dec 18, 2019 at 10:50 AM Andrew Pinski  wrote:
>
> On Tue, Dec 17, 2019 at 6:33 PM Hongtao Liu  wrote:
> >
> > Hi:
> >   This patch is to simplify A * C + (-D) -> (A - D/C) * C when C is a
> > power of 2 and D mod C == 0.
> >   bootstrap and make check is ok.
>
> I don't see why D has to be negative here.
>
>
> >TREE_CODE (TREE_TYPE (@0)) == INTEGER_TYPE
> + && TYPE_UNSIGNED (TREE_TYPE (@0))
>
> This is the wrong check here.
> Use INTEGRAL_TYPE_P .
>
> >+ (plus (mult @0 integer_pow2p@1) INTEGER_CST@2)
>
>  You might want a :s here for the mult and/or plus.
>
> unsigned HOST_WIDE_INT d = tree_to_uhwi (@2);
> ...
> Maybe use wide_int math instead of HOST_WIDE_INT here, then you don't
> need the tree_fits_uhwi_p check.
>
> Add a testcase should tests the pattern directly rather than indirectly.
>
> Also we are in stage 3 which means bug fixes only so this might/should
> wait until stage 1.

Yes, thanks.

>
> Thanks,
> Andrew Pinski
>
> >
> > changelog
> > gcc/
> > * gcc/match.pd (A * C + (-D) = (A - D/C) * C. when C is a
> > power of 2 and D mod C == 0): Add new simplification.
> >
> > gcc/testsuite
> > * gcc.dg/pr92980.c: New test.
> >
> > --
> > BR,
> > Hongtao



-- 
BR,
Hongtao


Re: [PATCH] Fix redundant load missed by fre [tree-optimization 92980]

2019-12-18 Thread Hongtao Liu
On Wed, Dec 18, 2019 at 4:26 PM Segher Boessenkool
 wrote:
>
> On Wed, Dec 18, 2019 at 10:37:11AM +0800, Hongtao Liu wrote:
> > Hi:
> >   This patch is to simplify A * C + (-D) -> (A - D/C) * C when C is a
> > power of 2 and D mod C == 0.
> >   bootstrap and make check is ok.
>
> Why would this be a good idea?  It is not reducing the number of
> operators or similar?
>
It helps VN, so that fre will delete redundant load.
>
> Segher



-- 
BR,
Hongtao


Re: [PATCH] i386: Guard noreturn no-callee-saved-registers optimization with -mnoreturn-no-callee-saved-registers [PR38534]

2024-03-04 Thread Hongtao Liu
On Thu, Feb 29, 2024 at 2:20 PM Hongtao Liu  wrote:
>
> On Wed, Feb 28, 2024 at 4:54 PM Jakub Jelinek  wrote:
> >
> > Hi!
> >
> > Adding Hongtao and Honza into the loop as the ones who acked the original
> > patch.
> >
> > The no_callee_saved_registers by default for noreturn functions change can
> > break in-process backtrace(3) or backtraces from debugger or other process
> > (quite often, any time the noreturn function decides to use the bp register
> > and any of the parent frames uses a frame pointer; the unwinder just crashes
> > in the libgcc unwinder case, gdb prints stack corrupted message), so I'd
> > like to save bp register in that case:
> >
> > https://gcc.gnu.org/pipermail/gcc-patches/2024-February/646591.html
> I think this patch makes sense and LGTM, we save and restore frame
> pointer for noreturn.
> >
> > and additionally the no_callee_saved_registers by default for noreturn
> > functions change can make debugging harder, again not localized to the
> > noreturn function, but any of its callers.  So, if say glibc abort function
> > implementation needs a lot of normally callee-saved registers, no matter how
> > users recompile their apps, they will see garbage or optimized out
> > vars/parameters in their code unless they rebuild their glibc with -O0.
> > So, I think we should guard that by a non-default option:
>From what has been discussed so far, I am inclined to this proposal.
If there are no additional objections(or concerns) in a few days, ok
for the trunk.
> >
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao


Re: [PATCH] sanitizer: [PR110027] Align asan_vec[0] to MAX (alignb, ASAN_RED_ZONE_SIZE)

2024-03-12 Thread Hongtao Liu
On Tue, Mar 12, 2024 at 8:00 PM liuhongt  wrote:
>
> if alignb > ASAN_RED_ZONE_SIZE and offset[0] is not multiple of
> alignb. (base_align_bias - base_offset) may not aligned to alignb, and
> caused segement fault.
>
> Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
> Ok for trunk and backport to GCC13?
CC jakub, I see the code was added by
https://gcc.gnu.org/pipermail/gcc-patches/2018-December/512313.html
The issue in the PR is similar, but __m512 requires bigger
alignment(64 > ASAN_RED_ZONE_SIZE(32)), in that case we need to insert
MAX (alignb, ASAN_RED_ZONE_SIZE) instead of ASAN_RED_ZONE_SIZE.
Assume when alignb > ASAN_RED_ZONE_SIZE, it must be multiple of
ASAN_RED_ZONE_SIZE.
>
> gcc/ChangeLog:
>
> PR sanitizer/110027
> * cfgexpand.cc (expand_stack_vars): Align frame offset to
> MAX (alignb, ASAN_RED_ZONE_SIZE).
>
> gcc/testsuite/ChangeLog:
>
> * g++.dg/asan/pr110027.C: New test.
> ---
>  gcc/cfgexpand.cc |  2 +-
>  gcc/testsuite/g++.dg/asan/pr110027.C | 20 
>  2 files changed, 21 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/g++.dg/asan/pr110027.C
>
> diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
> index 0de299c62e3..92062378d8e 100644
> --- a/gcc/cfgexpand.cc
> +++ b/gcc/cfgexpand.cc
> @@ -1214,7 +1214,7 @@ expand_stack_vars (bool (*pred) (size_t), class 
> stack_vars_data *data)
> {
>   if (data->asan_vec.is_empty ())
> {
> - align_frame_offset (ASAN_RED_ZONE_SIZE);
> + align_frame_offset (MAX (alignb, ASAN_RED_ZONE_SIZE));
>   prev_offset = frame_offset.to_constant ();
> }
>   prev_offset = align_base (prev_offset,
> diff --git a/gcc/testsuite/g++.dg/asan/pr110027.C 
> b/gcc/testsuite/g++.dg/asan/pr110027.C
> new file mode 100644
> index 000..0067781bc89
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/asan/pr110027.C
> @@ -0,0 +1,20 @@
> +/* PR sanitizer/110027 */
> +/* { dg-do run } */
> +/* { dg-require-effective-target avx512f_runtime } */
> +/* { dg-options "-std=gnu++23 -mavx512f -fsanitize=address -O0 -g 
> -fstack-protector-strong" } */
> +
> +#include 
> +#include 
> +
> +template 
> +using Vec [[gnu::vector_size(W * sizeof(T))]] = T;
> +
> +auto foo() {
> +  Vec<8, int64_t> ret{};
> +  return ret;
> +}
> +
> +int main() {
> +  foo();
> +  return 0;
> +}
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH] i386[stv]: Handle REG_EH_REGION note

2024-03-14 Thread Hongtao Liu
On Thu, Mar 14, 2024 at 3:22 PM Uros Bizjak  wrote:
>
> On Thu, Mar 14, 2024 at 2:33 AM liuhongt  wrote:
> >
> > When we split
> > (insn 37 36 38 10 (set (reg:DI 104 [ _18 ])
> > (mem:DI (reg/f:SI 98 [ CallNative_nclosure.0_1 ]) [6 MEM[(struct 
> > SQRefCounted *)CallNative_nclosure.0_1]._uiRef+0 S8 A32])) "test.C":22:42 
> > 84 {*movdi_internal}
> >  (expr_list:REG_EH_REGION (const_int -11 [0xfff5])
> >
> > into
> >
> > (insn 104 36 37 10 (set (subreg:V2DI (reg:DI 124) 0)
> > (vec_concat:V2DI (mem:DI (reg/f:SI 98 [ CallNative_nclosure.0_1 ]) 
> > [6 MEM[(struct SQRefCounted *)CallNative_nclosure.0_1]._uiRef+0 S8 A32])
> > (const_int 0 [0]))) "test.C":22:42 -1
> > (nil)))
> > (insn 37 104 105 10 (set (subreg:V2DI (reg:DI 104 [ _18 ]) 0)
> > (subreg:V2DI (reg:DI 124) 0)) "test.C":22:42 2024 {movv2di_internal}
> >  (expr_list:REG_EH_REGION (const_int -11 [0xfff5])
> > (nil)))
> >
> > we must copy the REG_EH_REGION note to the first insn and split the block
> > after the newly added insn.  The REG_EH_REGION on the second insn will be
> > removed later since it no longer traps.
> >
> > Currently we only handle memory_operand, are there any other insns
> > need to be handled???
>
> I think memory access is the only thing that can trap.
>
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,} for trunk and 
> > gcc-13/gcc-12 release branch.
> > Ok for trunk and backport?
> >
> > gcc/ChangeLog:
> >
> > * config/i386/i386-features.cc
> > (general_scalar_chain::convert_op): Handle REG_EH_REGION note.
> > (convert_scalars_to_vector): Ditto.
> > * config/i386/i386-features.h (class scalar_chain): New
> > memeber control_flow_insns.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * g++.target/i386/pr111822.C: New test.
> > ---
> >  gcc/config/i386/i386-features.cc | 48 ++--
> >  gcc/config/i386/i386-features.h  |  1 +
> >  gcc/testsuite/g++.target/i386/pr111822.C | 45 ++
> >  3 files changed, 90 insertions(+), 4 deletions(-)
> >  create mode 100644 gcc/testsuite/g++.target/i386/pr111822.C
> >
> > diff --git a/gcc/config/i386/i386-features.cc 
> > b/gcc/config/i386/i386-features.cc
> > index 1de2a07ed75..2ed27a9ebdd 100644
> > --- a/gcc/config/i386/i386-features.cc
> > +++ b/gcc/config/i386/i386-features.cc
> > @@ -998,20 +998,36 @@ general_scalar_chain::convert_op (rtx *op, rtx_insn 
> > *insn)
> >  }
> >else if (MEM_P (*op))
> >  {
> > +  rtx_insn* eh_insn, *movabs = NULL;
> >rtx tmp = gen_reg_rtx (GET_MODE (*op));
> >
> >/* Handle movabs.  */
> >if (!memory_operand (*op, GET_MODE (*op)))
> > {
> >   rtx tmp2 = gen_reg_rtx (GET_MODE (*op));
> > + movabs = emit_insn_before (gen_rtx_SET (tmp2, *op), insn);
> >
> > - emit_insn_before (gen_rtx_SET (tmp2, *op), insn);
> >   *op = tmp2;
> > }
>
> I may be missing something, but isn't the above a dead code? We have
> if (MEM_p(*op)) and then if (!memory_operand (*op, ...)).
It's PR91814 #c1, memory_operand will also check invalid memory addresses.
>
> Uros.
>
> >
> > -  emit_insn_before (gen_rtx_SET (gen_rtx_SUBREG (vmode, tmp, 0),
> > -gen_gpr_to_xmm_move_src (vmode, *op)),
> > -   insn);
> > +  eh_insn
> > +   = emit_insn_before (gen_rtx_SET (gen_rtx_SUBREG (vmode, tmp, 0),
> > +gen_gpr_to_xmm_move_src (vmode, 
> > *op)),
> > +   insn);
> > +
> > +  if (cfun->can_throw_non_call_exceptions)
> > +   {
> > + /* Handle REG_EH_REGION note.  */
> > + rtx note = find_reg_note (insn, REG_EH_REGION, NULL_RTX);
> > + if (note)
> > +   {
> > + if (movabs)
> > +   eh_insn = movabs;
> > + control_flow_insns.safe_push (eh_insn);
> > + add_reg_note (eh_insn, REG_EH_REGION, XEXP (note, 0));
> > +   }
> > +   }
> > +
> >*op = gen_rtx_SUBREG (vmode, tmp, 0);
> >
> >if (dump_file)
> > @@ -2494,6 +2510,7 @@ convert_scalars_to_vector (bool timode_p)
> >  {
> >basic_block bb;
> >int converted_insns = 0;
> > +  auto_vec control_flow_insns;
> >
> >bitmap_obstack_initialize (NULL);
> >const machine_mode cand_mode[3] = { SImode, DImode, TImode };
> > @@ -2575,6 +2592,11 @@ convert_scalars_to_vector (bool timode_p)
> >  chain->chain_id);
> > }
> >
> > + rtx_insn* iter_insn;
> > + unsigned int ii;
> > + FOR_EACH_VEC_ELT (chain->control_flow_insns, ii, iter_insn)
> > +   control_flow_insns.safe_push (iter_insn);
> > +
> >   delete chain;
> > }
> >  }
> > @@ -2643,6 +2665,24 @@ convert_scalars_to_vector (bool timode_p)
> >   DECL_INCOMING_RTL (parm) = gen_rtx_SUBREG (TIm

Re: [PATCH] i386[stv]: Handle REG_EH_REGION note

2024-03-14 Thread Hongtao Liu
On Thu, Mar 14, 2024 at 10:46 PM Uros Bizjak  wrote:
>
> On Thu, Mar 14, 2024 at 8:42 AM Uros Bizjak  wrote:
> >
> > On Thu, Mar 14, 2024 at 8:32 AM Hongtao Liu  wrote:
> > >
> > > On Thu, Mar 14, 2024 at 3:22 PM Uros Bizjak  wrote:
> > > >
> > > > On Thu, Mar 14, 2024 at 2:33 AM liuhongt  wrote:
> > > > >
> > > > > When we split
> > > > > (insn 37 36 38 10 (set (reg:DI 104 [ _18 ])
> > > > > (mem:DI (reg/f:SI 98 [ CallNative_nclosure.0_1 ]) [6 
> > > > > MEM[(struct SQRefCounted *)CallNative_nclosure.0_1]._uiRef+0 S8 
> > > > > A32])) "test.C":22:42 84 {*movdi_internal}
> > > > >  (expr_list:REG_EH_REGION (const_int -11 [0xfff5])
> > > > >
> > > > > into
> > > > >
> > > > > (insn 104 36 37 10 (set (subreg:V2DI (reg:DI 124) 0)
> > > > > (vec_concat:V2DI (mem:DI (reg/f:SI 98 [ 
> > > > > CallNative_nclosure.0_1 ]) [6 MEM[(struct SQRefCounted 
> > > > > *)CallNative_nclosure.0_1]._uiRef+0 S8 A32])
> > > > > (const_int 0 [0]))) "test.C":22:42 -1
> > > > > (nil)))
> > > > > (insn 37 104 105 10 (set (subreg:V2DI (reg:DI 104 [ _18 ]) 0)
> > > > > (subreg:V2DI (reg:DI 124) 0)) "test.C":22:42 2024 
> > > > > {movv2di_internal}
> > > > >  (expr_list:REG_EH_REGION (const_int -11 [0xfff5])
> > > > > (nil)))
> > > > >
> > > > > we must copy the REG_EH_REGION note to the first insn and split the 
> > > > > block
> > > > > after the newly added insn.  The REG_EH_REGION on the second insn 
> > > > > will be
> > > > > removed later since it no longer traps.
> > > > >
> > > > > Currently we only handle memory_operand, are there any other insns
> > > > > need to be handled???
> > > >
> > > > I think memory access is the only thing that can trap.
> > > >
> > > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,} for trunk 
> > > > > and gcc-13/gcc-12 release branch.
> > > > > Ok for trunk and backport?
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > > * config/i386/i386-features.cc
> > > > > (general_scalar_chain::convert_op): Handle REG_EH_REGION note.
> > > > > (convert_scalars_to_vector): Ditto.
> > > > > * config/i386/i386-features.h (class scalar_chain): New
> > > > > memeber control_flow_insns.
> > > > >
> > > > > gcc/testsuite/ChangeLog:
> > > > >
> > > > > * g++.target/i386/pr111822.C: New test.
> > > > > ---
> > > > >  gcc/config/i386/i386-features.cc | 48 
> > > > > ++--
> > > > >  gcc/config/i386/i386-features.h  |  1 +
> > > > >  gcc/testsuite/g++.target/i386/pr111822.C | 45 ++
> > > > >  3 files changed, 90 insertions(+), 4 deletions(-)
> > > > >  create mode 100644 gcc/testsuite/g++.target/i386/pr111822.C
> > > > >
> > > > > diff --git a/gcc/config/i386/i386-features.cc 
> > > > > b/gcc/config/i386/i386-features.cc
> > > > > index 1de2a07ed75..2ed27a9ebdd 100644
> > > > > --- a/gcc/config/i386/i386-features.cc
> > > > > +++ b/gcc/config/i386/i386-features.cc
> > > > > @@ -998,20 +998,36 @@ general_scalar_chain::convert_op (rtx *op, 
> > > > > rtx_insn *insn)
> > > > >  }
> > > > >else if (MEM_P (*op))
> > > > >  {
> > > > > +  rtx_insn* eh_insn, *movabs = NULL;
> > > > >rtx tmp = gen_reg_rtx (GET_MODE (*op));
> > > > >
> > > > >/* Handle movabs.  */
> > > > >if (!memory_operand (*op, GET_MODE (*op)))
> > > > > {
> > > > >   rtx tmp2 = gen_reg_rtx (GET_MODE (*op));
> > > > > + movabs = emit_insn_before (gen_rtx_SET (tmp2, *op), insn);
> > > > >
> > > > > - emit_insn_before (gen_rtx_SET (tmp2, *op), insn);
> > > > >   *op = tmp2;
> > > > > }
> > > >
> > > > I may be missing something, but isn't the above a dead code? We have
> > > > if (MEM_p(*op)) and then if (!memory_operand (*op, ...)).
> > > It's PR91814 #c1, memory_operand will also check invalid memory addresses.
> >
> > Oh, it is even my comment ;)
> >
> > Perhaps the comment should be improved to something like:
> >
> > "Emit MOVABS to load from a 64-bit absolute address to a GPR."
> >
> > LGTM then.
>
> BTW: Do we need to also fix timode_scalar_chain::convert_op ? There we
> also preload operand, so a similar fix should be applied there.
Yes, I'll make another patch. Didn't realize there are 2 of them.
>
> Uros.



-- 
BR,
Hongtao


Re: [PATCH] vect: Use xor to invert oversized vector masks

2024-03-14 Thread Hongtao Liu
On Thu, Mar 14, 2024 at 11:42 PM Andrew Stubbs  wrote:
>
> Don't enable excess lanes when inverting vector bit-masks smaller than the
> integer mode.  This is yet another case of wrong-code due to mishandling
> of oversized bitmasks.
>
> This issue shows up in vect/tsvc/vect-tsvc-s278.c and
> vect/tsvc/vect-tsvc-s279.c if I set the preferred vector size to V32
> (down from V64) on amdgcn.
>
> OK for mainline?
>
> Andrew
>
> gcc/ChangeLog:
>
> * expr.cc (expand_expr_real_2): Use xor to invert vector masks.
> ---
>  gcc/expr.cc | 11 +++
>  1 file changed, 11 insertions(+)
>
> diff --git a/gcc/expr.cc b/gcc/expr.cc
> index 403eeaa108e4..3540327d879e 100644
> --- a/gcc/expr.cc
> +++ b/gcc/expr.cc
> @@ -10497,6 +10497,17 @@ expand_expr_real_2 (sepops ops, rtx target, 
> machine_mode tmode,
>immed_wide_int_const (mask, int_mode),
>target, 1, OPTAB_LIB_WIDEN);
> }
> +  /* If it's a vector mask don't enable excess bits.  */
> +  else if (VECTOR_BOOLEAN_TYPE_P (type)
> +  && SCALAR_INT_MODE_P (mode)
> +  && maybe_ne (GET_MODE_PRECISION (mode),
> +   TYPE_VECTOR_SUBPARTS (type).to_constant ()))
> +   {
> + auto nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> + temp = expand_binop (mode, xor_optab, op0,
> +  GEN_INT ((HOST_WIDE_INT_1U << nunits) - 1),
> +  target, true, OPTAB_WIDEN);
> +   }
Not review, just curious, should the issue be fixed by the commit in PR113576.
Also wonder besides cbranch, excess land bits also matter?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576#c35
>else
> temp = expand_unop (mode, one_cmpl_optab, op0, target, 1);
>gcc_assert (temp);
> --
> 2.41.0
>


-- 
BR,
Hongtao


Re: [PATCH] i386 [stv]: Handle REG_EH_REGION note [pr111822].

2024-03-18 Thread Hongtao Liu
On Mon, Mar 18, 2024 at 6:59 PM Uros Bizjak  wrote:
>
> On Mon, Mar 18, 2024 at 11:52 AM liuhongt  wrote:
> >
> > Commit r14-9459-g618e34d56cc38e only handles
> > general_scalar_chain::convert_op. The patch also handles
> > timode_scalar_chain::convert_op to avoid potential similar bug.
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Ok for trunk and backport to releases/gcc-13 branch?
>
> I have the following patch in testing that merges
> {general,timode}_scalar_chain::convert_op, so in addition to less code
> duplication, it will fix the issue for both chains. WDYT?
It would be better for maintenance, I prefer your patch.
>
> Uros.
>
> >
> > gcc/ChangeLog:
> >
> > PR target/111822
> > * config/i386/i386-features.cc
> > (timode_scalar_chain::convert_op): Handle REG_EH_REGION note.
> > ---
> >  gcc/config/i386/i386-features.cc | 20 +---
> >  1 file changed, 17 insertions(+), 3 deletions(-)
> >
> > diff --git a/gcc/config/i386/i386-features.cc 
> > b/gcc/config/i386/i386-features.cc
> > index c7d7a965901..38f57d96df5 100644
> > --- a/gcc/config/i386/i386-features.cc
> > +++ b/gcc/config/i386/i386-features.cc
> > @@ -1794,12 +1794,26 @@ timode_scalar_chain::convert_op (rtx *op, rtx_insn 
> > *insn)
> >  *op = gen_rtx_SUBREG (V1TImode, *op, 0);
> >else if (MEM_P (*op))
> >  {
> > +  rtx_insn* eh_insn;
> >rtx tmp = gen_reg_rtx (V1TImode);
> > -  emit_insn_before (gen_rtx_SET (tmp,
> > -gen_gpr_to_xmm_move_src (V1TImode, 
> > *op)),
> > -   insn);
> > +  eh_insn
> > +   = emit_insn_before (gen_rtx_SET (tmp,
> > +gen_gpr_to_xmm_move_src (V1TImode,
> > + *op)),
> > +   insn);
> >*op = tmp;
> >
> > +  if (cfun->can_throw_non_call_exceptions)
> > +   {
> > + /* Handle REG_EH_REGION note.  */
> > + rtx note = find_reg_note (insn, REG_EH_REGION, NULL_RTX);
> > + if (note)
> > +   {
> > + control_flow_insns.safe_push (eh_insn);
> > + add_reg_note (eh_insn, REG_EH_REGION, XEXP (note, 0));
> > +   }
> > +   }
> > +
> >if (dump_file)
> > fprintf (dump_file, "  Preloading operand for insn %d into r%d\n",
> >  INSN_UID (insn), REGNO (tmp));
> > --
> > 2.31.1
> >



-- 
BR,
Hongtao


Re: [PATCH] Document -fexcess-precision=16.

2024-03-18 Thread Hongtao Liu
On Tue, Mar 19, 2024 at 12:16 AM Joseph Myers  wrote:
>
> On Mon, 18 Mar 2024, liuhongt wrote:
>
> > +If @option{-fexcess-precision=16} is specified, casts and assignments of
> > +@code{_Float16} and @code{bfloat16_t} cause value to be rounded to their
> > +semantic types if they're supported by the target.
>
> Isn't that option about rounding results of all operations, whether or not
> a cast or assignment is involved?  That's certainly what the brief mention
> of this option in extend.texi says, and fits the intent that
> -fexcess-precision=16 corresponds to FLT_EVAL_METHOD == 16.
Yes, how about this.


+If @option{-fexcess-precision=16} is specified, each operation of
+@code{_Float16} and @code{bfloat16_t} causes value to be rounded to their
+semantic types if they're supported by the target.

>
> --
> Joseph S. Myers
> josmy...@redhat.com
>


-- 
BR,
Hongtao


Re: [PATCH] sanitizer: [PR110027] Align asan_vec[0] to MAX (alignb, ASAN_RED_ZONE_SIZE)

2024-03-25 Thread Hongtao Liu
On Mon, Mar 25, 2024 at 8:51 PM Jakub Jelinek  wrote:
>
> On Tue, Mar 12, 2024 at 07:57:59PM +0800, liuhongt wrote:
> > if alignb > ASAN_RED_ZONE_SIZE and offset[0] is not multiple of
> > alignb. (base_align_bias - base_offset) may not aligned to alignb, and
> > caused segement fault.
> >
> > Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
> > Ok for trunk and backport to GCC13?
> >
> > gcc/ChangeLog:
> >
> >   PR sanitizer/110027
> >   * cfgexpand.cc (expand_stack_vars): Align frame offset to
> >   MAX (alignb, ASAN_RED_ZONE_SIZE).
> >
> > gcc/testsuite/ChangeLog:
> >
> >   * g++.dg/asan/pr110027.C: New test.
> > ---
> >  gcc/cfgexpand.cc |  2 +-
> >  gcc/testsuite/g++.dg/asan/pr110027.C | 20 
> >  2 files changed, 21 insertions(+), 1 deletion(-)
> >  create mode 100644 gcc/testsuite/g++.dg/asan/pr110027.C
> >
> > diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
> > index 0de299c62e3..92062378d8e 100644
> > --- a/gcc/cfgexpand.cc
> > +++ b/gcc/cfgexpand.cc
> > @@ -1214,7 +1214,7 @@ expand_stack_vars (bool (*pred) (size_t), class 
> > stack_vars_data *data)
> >   {
> > if (data->asan_vec.is_empty ())
> >   {
> > -   align_frame_offset (ASAN_RED_ZONE_SIZE);
> > +   align_frame_offset (MAX (alignb, ASAN_RED_ZONE_SIZE));
> > prev_offset = frame_offset.to_constant ();
> >   }
> > prev_offset = align_base (prev_offset,
>
> This doesn't look correct to me.
> The above is done just once for the first var partition.  And
> var partitions are sorted by stack_var_cmp, which puts > 
> MAX_SUPPORTED_STACK_ALIGNMENT
> alignment vars first (that should be none on x86, the above is quite huge
> alignment), then on size decreasing and only after that on alignment
> decreasing.
>
> So, try to add some other variable with larger size and smaller alignment
> to the frame (and make sure it isn't optimized away).
>
> alignb above is the alignment of the first partition's var, if
> align_frame_offset really needs to depend on the var alignment, it probably
> should be the maximum alignment of all the vars with alignment
> alignb * BITS_PER_UNIT <= MAX_SUPPORTED_STACK_ALIGNMENT

In asan_emit_stack_protection, when it allocated fake stack, it assume
bottom of stack is also aligned to alignb. And the place violated this
is the first var partition. which is 32 bytes offsets,  it should be
MAX_SUPPORTED_STACK_ALIGNMENT / BITS_PER_UNIT.
So I think we need to use MAX (MAX_SUPPORTED_STACK_ALIGNMENT /
BITS_PER_UNIT, ASAN_RED_ZONE_SIZE) for the first var partition.

>
> > diff --git a/gcc/testsuite/g++.dg/asan/pr110027.C 
> > b/gcc/testsuite/g++.dg/asan/pr110027.C
> > new file mode 100644
> > index 000..0067781bc89
> > --- /dev/null
> > +++ b/gcc/testsuite/g++.dg/asan/pr110027.C
> > @@ -0,0 +1,20 @@
> > +/* PR sanitizer/110027 */
> > +/* { dg-do run } */
> > +/* { dg-require-effective-target avx512f_runtime } */
> > +/* { dg-options "-std=gnu++23 -mavx512f -fsanitize=address -O0 -g 
> > -fstack-protector-strong" } */
> > +
> > +#include 
> > +#include 
> > +
> > +template 
> > +using Vec [[gnu::vector_size(W * sizeof(T))]] = T;
> > +
> > +auto foo() {
> > +  Vec<8, int64_t> ret{};
> > +  return ret;
> > +}
> > +
> > +int main() {
> > +  foo();
> > +  return 0;
> > +}
> > --
> > 2.31.1
>
> Jakub
>


-- 
BR,
Hongtao


Re: [PATCH] sanitizer: [PR110027] Align asan_vec[0] to MAX (alignb, ASAN_RED_ZONE_SIZE)

2024-03-25 Thread Hongtao Liu
On Tue, Mar 26, 2024 at 11:26 AM Hongtao Liu  wrote:
>
> On Mon, Mar 25, 2024 at 8:51 PM Jakub Jelinek  wrote:
> >
> > On Tue, Mar 12, 2024 at 07:57:59PM +0800, liuhongt wrote:
> > > if alignb > ASAN_RED_ZONE_SIZE and offset[0] is not multiple of
> > > alignb. (base_align_bias - base_offset) may not aligned to alignb, and
> > > caused segement fault.
> > >
> > > Bootstrapped and regtested on x86_64-linux-gnu{-m32,}.
> > > Ok for trunk and backport to GCC13?
> > >
> > > gcc/ChangeLog:
> > >
> > >   PR sanitizer/110027
> > >   * cfgexpand.cc (expand_stack_vars): Align frame offset to
> > >   MAX (alignb, ASAN_RED_ZONE_SIZE).
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >   * g++.dg/asan/pr110027.C: New test.
> > > ---
> > >  gcc/cfgexpand.cc |  2 +-
> > >  gcc/testsuite/g++.dg/asan/pr110027.C | 20 
> > >  2 files changed, 21 insertions(+), 1 deletion(-)
> > >  create mode 100644 gcc/testsuite/g++.dg/asan/pr110027.C
> > >
> > > diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
> > > index 0de299c62e3..92062378d8e 100644
> > > --- a/gcc/cfgexpand.cc
> > > +++ b/gcc/cfgexpand.cc
> > > @@ -1214,7 +1214,7 @@ expand_stack_vars (bool (*pred) (size_t), class 
> > > stack_vars_data *data)
> > >   {
> > > if (data->asan_vec.is_empty ())
> > >   {
> > > -   align_frame_offset (ASAN_RED_ZONE_SIZE);
> > > +   align_frame_offset (MAX (alignb, ASAN_RED_ZONE_SIZE));
> > > prev_offset = frame_offset.to_constant ();
> > >   }
> > > prev_offset = align_base (prev_offset,
> >
> > This doesn't look correct to me.
> > The above is done just once for the first var partition.  And
> > var partitions are sorted by stack_var_cmp, which puts > 
> > MAX_SUPPORTED_STACK_ALIGNMENT
> > alignment vars first (that should be none on x86, the above is quite huge
> > alignment), then on size decreasing and only after that on alignment
> > decreasing.
> >
> > So, try to add some other variable with larger size and smaller alignment
> > to the frame (and make sure it isn't optimized away).
> >
> > alignb above is the alignment of the first partition's var, if
> > align_frame_offset really needs to depend on the var alignment, it probably
> > should be the maximum alignment of all the vars with alignment
> > alignb * BITS_PER_UNIT <= MAX_SUPPORTED_STACK_ALIGNMENT
>
> In asan_emit_stack_protection, when it allocated fake stack, it assume
> bottom of stack is also aligned to alignb. And the place violated this
> is the first var partition. which is 32 bytes offsets,  it should be
> MAX_SUPPORTED_STACK_ALIGNMENT / BITS_PER_UNIT.
> So I think we need to use MAX (MAX_SUPPORTED_STACK_ALIGNMENT /
> BITS_PER_UNIT, ASAN_RED_ZONE_SIZE) for the first var partition.
It should be MAX (BIGGEST_ALIGNMENT / BITS_PER_UNIT, ASAN_RED_ZONE_SIZE).
MAX_SUPPORTED_STACK_ALIGNMENT is huge.
>
> >
> > > diff --git a/gcc/testsuite/g++.dg/asan/pr110027.C 
> > > b/gcc/testsuite/g++.dg/asan/pr110027.C
> > > new file mode 100644
> > > index 000..0067781bc89
> > > --- /dev/null
> > > +++ b/gcc/testsuite/g++.dg/asan/pr110027.C
> > > @@ -0,0 +1,20 @@
> > > +/* PR sanitizer/110027 */
> > > +/* { dg-do run } */
> > > +/* { dg-require-effective-target avx512f_runtime } */
> > > +/* { dg-options "-std=gnu++23 -mavx512f -fsanitize=address -O0 -g 
> > > -fstack-protector-strong" } */
> > > +
> > > +#include 
> > > +#include 
> > > +
> > > +template 
> > > +using Vec [[gnu::vector_size(W * sizeof(T))]] = T;
> > > +
> > > +auto foo() {
> > > +  Vec<8, int64_t> ret{};
> > > +  return ret;
> > > +}
> > > +
> > > +int main() {
> > > +  foo();
> > > +  return 0;
> > > +}
> > > --
> > > 2.31.1
> >
> > Jakub
> >
>
>
> --
> BR,
> Hongtao



-- 
BR,
Hongtao


Re: [PATCH] x86: Define macros for APX options

2024-04-08 Thread Hongtao Liu
On Mon, Apr 8, 2024 at 11:44 PM H.J. Lu  wrote:
>
> Define following macros for APX options:
>
> 1. __APX_EGPR__: -mapx-features=egpr.
> 2. __APX_PUSH2POP2__: -mapx-features=push2pop2.
> 3. __APX_NDD__: -mapx-features=ndd.
> 4. __APX_PPX__: -mapx-features=ppx.
For -mapx-features=, we haven't decided to expose this option to users
yet, we want users to just use -mapxf, so I think __APXF__ should be
enough?
> 5. __APX_INLINE_ASM_USE_GPR32__: -mapx-inline-asm-use-gpr32.
I'm ok for this one.
>
> They can be used to make assembly codes compatible with APX options.
> Some use cases are:
>
> 1. When __APX_PUSH2POP2__ is defined, assembly codes should always align
> the outgoing stack to 16 bytes.
> 2. When __APX_INLINE_ASM_USE_GPR32__ is defined, inline asm statements
> should contain only instructions compatible with r16-r31.
>
> gcc/
>
> PR target/114587
> * config/i386/i386-c.cc (ix86_target_macros_internal): Define
> __APX_XXX__ for APX options.
>
> gcc/testsuite/
>
> PR target/114587
> * gcc.target/i386/apx-3a.c: New test.
> * gcc.target/i386/apx-3b.c: Likewise.
> * gcc.target/i386/apx-3c.c: Likewise.
> * gcc.target/i386/apx-3d.c: Likewise.
> * gcc.target/i386/apx-3e.c: Likewise.
> * gcc.target/i386/apx-4.c: Likewise.
> ---
>  gcc/config/i386/i386-c.cc  | 10 ++
>  gcc/testsuite/gcc.target/i386/apx-3a.c |  6 ++
>  gcc/testsuite/gcc.target/i386/apx-3b.c |  6 ++
>  gcc/testsuite/gcc.target/i386/apx-3c.c |  6 ++
>  gcc/testsuite/gcc.target/i386/apx-3d.c |  6 ++
>  gcc/testsuite/gcc.target/i386/apx-3e.c | 18 ++
>  gcc/testsuite/gcc.target/i386/apx-4.c  |  6 ++
>  7 files changed, 58 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-3a.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-3b.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-3c.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-3d.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-3e.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-4.c
>
> diff --git a/gcc/config/i386/i386-c.cc b/gcc/config/i386/i386-c.cc
> index 226d277676c..b8cfba90fdc 100644
> --- a/gcc/config/i386/i386-c.cc
> +++ b/gcc/config/i386/i386-c.cc
> @@ -751,6 +751,16 @@ ix86_target_macros_internal (HOST_WIDE_INT isa_flag,
>  def_or_undef (parse_in, "__AVX10_1_512__");
>if (isa_flag2 & OPTION_MASK_ISA2_APX_F)
>  def_or_undef (parse_in, "__APX_F__");
> +  if (TARGET_APX_EGPR)
> +def_or_undef (parse_in, "__APX_EGPR__");
> +  if (TARGET_APX_PUSH2POP2)
> +def_or_undef (parse_in, "__APX_PUSH2POP2__");
> +  if (TARGET_APX_NDD)
> +def_or_undef (parse_in, "__APX_NDD__");
> +  if (TARGET_APX_PPX)
> +def_or_undef (parse_in, "__APX_PPX__");
> +  if (ix86_apx_inline_asm_use_gpr32)
> +def_or_undef (parse_in, "__APX_INLINE_ASM_USE_GPR32__");
>if (TARGET_IAMCU)
>  {
>def_or_undef (parse_in, "__iamcu");
> diff --git a/gcc/testsuite/gcc.target/i386/apx-3a.c 
> b/gcc/testsuite/gcc.target/i386/apx-3a.c
> new file mode 100644
> index 000..86d3ef2061d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/apx-3a.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-mapx-features=egpr" } */
> +
> +#ifndef __APX_EGPR__
> +# error __APX_EGPR__ not defined
> +#endif
> diff --git a/gcc/testsuite/gcc.target/i386/apx-3b.c 
> b/gcc/testsuite/gcc.target/i386/apx-3b.c
> new file mode 100644
> index 000..611727a389a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/apx-3b.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-mapx-features=push2pop2" } */
> +
> +#ifndef __APX_PUSH2POP2__
> +# error __APX_PUSH2POP2__ not defined
> +#endif
> diff --git a/gcc/testsuite/gcc.target/i386/apx-3c.c 
> b/gcc/testsuite/gcc.target/i386/apx-3c.c
> new file mode 100644
> index 000..52655b6cfa5
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/apx-3c.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-mapx-features=ndd" } */
> +
> +#ifndef __APX_NDD__
> +# error __APX_NDD__ not defined
> +#endif
> diff --git a/gcc/testsuite/gcc.target/i386/apx-3d.c 
> b/gcc/testsuite/gcc.target/i386/apx-3d.c
> new file mode 100644
> index 000..9b91af1d377
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/apx-3d.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-mapx-features=ppx" } */
> +
> +#ifndef __APX_PPX__
> +# error __APX_PPX__ not defined
> +#endif
> diff --git a/gcc/testsuite/gcc.target/i386/apx-3e.c 
> b/gcc/testsuite/gcc.target/i386/apx-3e.c
> new file mode 100644
> index 000..7278428e5c4
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/apx-3e.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-mapx-features=egpr,push2pop2,ndd,ppx" } */
> +
> +#ifndef __APX_EGPR__

Re: [PATCH v2] x86: Define __APX_INLINE_ASM_USE_GPR32__

2024-04-08 Thread Hongtao Liu
On Tue, Apr 9, 2024 at 9:58 AM H.J. Lu  wrote:
>
> Define __APX_INLINE_ASM_USE_GPR32__ for -mapx-inline-asm-use-gpr32.
> When __APX_INLINE_ASM_USE_GPR32__ is defined, inline asm statements
> should contain only instructions compatible with r16-r31.
Ok.
>
> gcc/
>
> PR target/114587
> * config/i386/i386-c.cc (ix86_target_macros_internal): Define
> __APX_INLINE_ASM_USE_GPR32__ for -mapx-inline-asm-use-gpr32.
>
> gcc/testsuite/
>
> PR target/114587
> * gcc.target/i386/apx-3.c: Likewise.
> ---
>  gcc/config/i386/i386-c.cc | 2 ++
>  gcc/testsuite/gcc.target/i386/apx-3.c | 6 ++
>  2 files changed, 8 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-3.c
>
> diff --git a/gcc/config/i386/i386-c.cc b/gcc/config/i386/i386-c.cc
> index 226d277676c..07f4936ba91 100644
> --- a/gcc/config/i386/i386-c.cc
> +++ b/gcc/config/i386/i386-c.cc
> @@ -751,6 +751,8 @@ ix86_target_macros_internal (HOST_WIDE_INT isa_flag,
>  def_or_undef (parse_in, "__AVX10_1_512__");
>if (isa_flag2 & OPTION_MASK_ISA2_APX_F)
>  def_or_undef (parse_in, "__APX_F__");
> +  if (ix86_apx_inline_asm_use_gpr32)
> +def_or_undef (parse_in, "__APX_INLINE_ASM_USE_GPR32__");
>if (TARGET_IAMCU)
>  {
>def_or_undef (parse_in, "__iamcu");
> diff --git a/gcc/testsuite/gcc.target/i386/apx-3.c 
> b/gcc/testsuite/gcc.target/i386/apx-3.c
> new file mode 100644
> index 000..1ba4ac036fc
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/apx-3.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-mapx-inline-asm-use-gpr32" } */
> +
> +#ifndef __APX_INLINE_ASM_USE_GPR32__
> +# error __APX_INLINE_ASM_USE_GPR32__ not defined
> +#endif
> --
> 2.44.0
>


-- 
BR,
Hongtao


Re: [PATCH] i386: Fix aes/vaes patterns [PR114576]

2024-04-08 Thread Hongtao Liu
On Thu, Apr 4, 2024 at 4:42 PM Jakub Jelinek  wrote:
>
> On Wed, Apr 19, 2023 at 02:40:59AM +, Jiang, Haochen via Gcc-patches 
> wrote:
> > > >  (define_insn "aesenc"
> > > > -  [(set (match_operand:V2DI 0 "register_operand" "=x,x")
> > > > -   (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "0,x")
> > > > -  (match_operand:V2DI 2 "vector_operand" "xBm,xm")]
> > > > +  [(set (match_operand:V2DI 0 "register_operand" "=x,x,v")
> > > > +   (unspec:V2DI [(match_operand:V2DI 1 "register_operand" "0,x,v")
> > > > +  (match_operand:V2DI 2 "vector_operand"
> > > > + "xBm,xm,vm")]
> > > >   UNSPEC_AESENC))]
> > > > -  "TARGET_AES"
> > > > +  "TARGET_AES || (TARGET_VAES && TARGET_AVX512VL)"
> > > >"@
> > > > aesenc\t{%2, %0|%0, %2}
> > > > +   vaesenc\t{%2, %1, %0|%0, %1, %2}
> > > > vaesenc\t{%2, %1, %0|%0, %1, %2}"
> > > > -  [(set_attr "isa" "noavx,avx")
> > > > +  [(set_attr "isa" "noavx,aes,avx512vl")
> > > Shouldn't it be vaes_avx512vl and then remove " || (TARGET_VAES &&
> > > TARGET_AVX512VL)" from condition.
> >
> > Since VAES should not imply AES, we need that "|| (TARGET_VAES &&
> > TARGET_AVX512VL)"
> >
> > And there is no need to add vaes_avx512vl since the last alternative will 
> > only
> > be hit when there is no aes. When there is no aes, the pattern will need 
> > vaes
> > and avx512vl both or we could not use this pattern. avx512vl here is just 
> > like
> > a placeholder.
>
> As the following testcase shows, the above change was incorrect.
>
> Using aes isa for the second alternative is obviously wrong, aes is enabled
> whenever -maes is, regardless of -mavx or -mno-avx, so the above change
> means that for -maes -mno-avx RA can choose, either it matches the first
> alternative with the dup operand, or it matches the second one (but that
> is of course wrong because vaesenc VEX encoded insn needs AES & AVX CPUID).
>
> The big question is if "Since VAES should not imply AES" is the case or not.
> Looking around at what LLVM does on godbolt, seems since clang 6 which added
> -mvaes support -mvaes there implies -maes, but GCC treats those two
> independent.
>
> Now, if we'd take the LLVM path of making -mvaes imply -maes and -mno-aes
> imply -mno-vaes, then we should probably just revert the above patch and
> tweak common/config/i386/ to do the implications (+ add the testcase from
> this patch).
>
> If we keep the current behavior, where AES and VAES are completely
> independent extensions, then we need to do more changes as the following
> patch attempts to do.
> We should use the aesenc etc. insns for noavx as before, we know at that
> point that TARGET_AES must be true because (TARGET_VAES && TARGET_AVX512VL)
> won't be true when !TARGET_AVX - TARGET_AVX512VL implies TARGET_AVX.
> For the second alternative, i.e. the AVX AES VEX encoded case, the patch
> uses aes_avx isa which requires both.  Now, for the third one we can't
> use avx512vl isa attribute, because one could compile with
> -maes -mavx512vl -mno-vaes and in that case we want VEX encoded vaesenc
> which can't use %xmm16+ (nor EGPRs), so we need vaes_avx512vl isa to
> ensure it is enabled only for -mvaes -mavx512vl.  And there is another
> problem, with -mno-aes -mvaes -mavx512vl we could emit VEX encoded vaesenc
> which requires AES and AVX ISAs rather than the VAES and AVX512VL which
> are enabled.  So the patch uses the {evex} prefix for those cases.
> And similarly for the vaes*_ instructions, if they aren't 128-bit
> or use %xmm16+ registers, the current case is fine, but if they are 128-bit
> and use only %xmm0-15 registers, assembler would again emit VEX encoded insn
> which needs AES & AVX CPUID, rather than the EVEX encoded ones which need
> VAES & AVX512VL CPUIDs.
> Still, I wonder if -mvaes shouldn't imply at least -mavx512f and
> -mno-avx512f shouldn't imply -mno-vaes, because otherwise can't see how
> it could use 512-bit registers (this part not done in the patch).
>
> The following patch has been successfully bootstrapped/regtested on
> x86_64-linux and i686-linux.
>
> 2024-04-04  Jakub Jelinek  
>
> PR target/114576
> * config/i386/i386.md (isa): Remove aes, add aes_avx, vaes_avx512vl.
> (enabled): Remove aes isa check, add aes_avx and vaes_avx512vl.
> * config/i386/sse.md (aesenc, aesenclast, aesdec, aesdeclast): Add
> 4th alternative, emit {evex} prefix for the third one, use
> noavx,aes_avx,vaes_avx512vl,vaes_avx512vl isa attribute, use jm
> rather than m constraint on the 2nd and 3rd alternative input.
> (vaesdec_, vaesdeclast_, vaesenc_,
> vaesenclast_): Add second alternative with x instead of v
> and jm instead of m.
>
> * gcc.target/i386/aes-pr114576.c: New test.
>
> --- gcc/config/i386/i386.md.jj  2024-03-18 22:15:43.165839479 +0100
> +++ gcc/config/i386/i386.md 2024-04-04 00:48:46.575511556 +0200
> @@ -568,13 +568,14 @@ (define_at

Re: [PATCH] i386, v2: Fix aes/vaes patterns [PR114576]

2024-04-09 Thread Hongtao Liu
On Tue, Apr 9, 2024 at 5:18 PM Jakub Jelinek  wrote:
>
> On Tue, Apr 09, 2024 at 11:23:40AM +0800, Hongtao Liu wrote:
> > I think we can merge alternative 2 with 3 to
> > *  return TARGET_AES ? \"vaesenc\t{%2, %1, %0|%0, %1, %2}"\" :
> > \"%{evex%} vaesenc\t{%2, %1, %0|%0, %1, %2}\";
> > Then it can handle vaes_avx512vl + -mno-aes case.
>
> Ok, done in the patch below.
>
> > > @@ -30246,44 +30250,60 @@ (define_insn "vpdpwssds__maskz_1"
> > > [(set_attr ("prefix") ("evex"))])
> > >
> > >  (define_insn "vaesdec_"
> > > -  [(set (match_operand:VI1_AVX512VL_F 0 "register_operand" "=v")
> > > +  [(set (match_operand:VI1_AVX512VL_F 0 "register_operand" "=x,v")
> > > (unspec:VI1_AVX512VL_F
> > > - [(match_operand:VI1_AVX512VL_F 1 "register_operand" "v")
> > > -  (match_operand:VI1_AVX512VL_F 2 "vector_operand" "vm")]
> > > + [(match_operand:VI1_AVX512VL_F 1 "register_operand" "x,v")
> > > +  (match_operand:VI1_AVX512VL_F 2 "vector_operand" "xjm,vm")]
> > >   UNSPEC_VAESDEC))]
> > >"TARGET_VAES"
> > > -  "vaesdec\t{%2, %1, %0|%0, %1, %2}"
> > > -)
> > > +{
> > > +  if (which_alternative == 0 && mode == V16QImode)
> > > +return "%{evex%} vaesdec\t{%2, %1, %0|%0, %1, %2}";
> > Similar, but something like
> > *  return TARGET_AES || mode != V16QImode ? \"vaesenc\t{%2, %1,
> > %0|%0, %1, %2}"\" : \"%{evex%} vaesenc\t{%2, %1, %0|%0, %1, %2}\";
>
> For a single alternative, it would need to be
> {
>   return x86_evex_reg_mentioned_p (operands, 3)
>  ? \"vaesenc\t{%2, %1, %0|%0, %1, %2}\"
>  : \"%{evex%} vaesenc\t{%2, %1, %0|%0, %1, %2}\";
> }
> (* return would just mean uselessly too long line).
> Is that what you want instead?  I thought the 2 separate alternatives
> where only the latter covers those cases is more readable...
>
> The following patch just changes the aes* patterns, not the vaes* ones.
Patch LGTM.
>
> 2024-04-09  Jakub Jelinek  
>
> PR target/114576
> * config/i386/i386.md (isa): Remove aes, add vaes_avx512vl.
> (enabled): Remove aes isa check, add vaes_avx512vl.
> * config/i386/sse.md (aesenc, aesenclast, aesdec, aesdeclast): Use
> jm instead of m for second alternative and emit {evex} prefix
> for it if !TARGET_AES.  Use noavx,avx,vaes_avx512vl isa attribute.
> (vaesdec_, vaesdeclast_, vaesenc_,
> vaesenclast_): Add second alternative with x instead of v
> and jm instead of m.
>
> * gcc.target/i386/aes-pr114576.c: New test.
>
> --- gcc/config/i386/i386.md.jj  2024-04-09 08:12:29.259451422 +0200
> +++ gcc/config/i386/i386.md 2024-04-09 10:53:24.965516804 +0200
> @@ -568,13 +568,14 @@ (define_attr "unit" "integer,i387,sse,mm
>
>  ;; Used to control the "enabled" attribute on a per-instruction basis.
>  (define_attr "isa" "base,x64,nox64,x64_sse2,x64_sse4,x64_sse4_noavx,
> -   x64_avx,x64_avx512bw,x64_avx512dq,aes,apx_ndd,
> +   x64_avx,x64_avx512bw,x64_avx512dq,apx_ndd,
> sse_noavx,sse2,sse2_noavx,sse3,sse3_noavx,sse4,sse4_noavx,
> 
> avx,noavx,avx2,noavx2,bmi,bmi2,fma4,fma,avx512f,avx512f_512,
> noavx512f,avx512bw,avx512bw_512,noavx512bw,avx512dq,
> noavx512dq,fma_or_avx512vl,avx512vl,noavx512vl,avxvnni,
> avx512vnnivl,avx512fp16,avxifma,avx512ifmavl,avxneconvert,
> -   avx512bf16vl,vpclmulqdqvl,avx_noavx512f,avx_noavx512vl"
> +   avx512bf16vl,vpclmulqdqvl,avx_noavx512f,avx_noavx512vl,
> +   vaes_avx512vl"
>(const_string "base"))
>
>  ;; The (bounding maximum) length of an instruction immediate.
> @@ -915,7 +916,6 @@ (define_attr "enabled" ""
>(symbol_ref "TARGET_64BIT && TARGET_AVX512BW")
>  (eq_attr "isa" "x64_avx512dq")
>(symbol_ref "TARGET_64BIT && TARGET_AVX512DQ")
> -(eq_attr "isa" "aes") (symbol_ref "TARGET_AES")
>  (eq_attr "isa" "sse_noavx")
>(symbol_ref "TARGET_SSE && !TARGET_AVX")
>  (eq_attr "isa" "sse2") (symbol_ref "T

Re: [PATCH] Prohibit SHA/KEYLOCKER usage of EGPR when APX enabled

2024-04-09 Thread Hongtao Liu
On Tue, Apr 9, 2024 at 3:05 PM Hongyu Wang  wrote:
>
> The latest APX spec announced removal of SHA/KEYLOCKER evex promotion [1],
> which means the SHA/KEYLOCKER insn does not support EGPR when APX
> enabled. Update the corresponding constraints to their EGPR-disabled
> counterparts.
>
> Bootstrapped and regtested on x86-64-pc-linux-gnu.
>
> Ok for trunk?
Ok.
>
> [1].https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html
>
> gcc/ChangeLog:
>
> * config/i386/sse.md (sha1msg1): Use "ja" instead of "Bm" for
> memory constraint.
> (sha1msg2): Likewise.
> (sha1nexte): Likewise.
> (sha1rnds4): Likewise.
> (sha256msg1): Likewise.
> (sha256msg2): Likewise.
> (sha256rnds2): Likewise.
> (aesu8): Use "jm" instead of "m" for memory
> constraint.
> (*aesu8): Likewise.
> (*encodekey128u32): Use "jr" instead of "r" for register
> constraints.
> (*encodekey256u32): Likewise.
> ---
>  gcc/config/i386/sse.md | 26 +-
>  1 file changed, 13 insertions(+), 13 deletions(-)
>
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 3286d3a4fac..4b8d5342707 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -29104,7 +29104,7 @@ (define_insn "sha1msg1"
>[(set (match_operand:V4SI 0 "register_operand" "=x")
> (unspec:V4SI
>   [(match_operand:V4SI 1 "register_operand" "0")
> -  (match_operand:V4SI 2 "vector_operand" "xBm")]
> +  (match_operand:V4SI 2 "vector_operand" "xja")]
>   UNSPEC_SHA1MSG1))]
>"TARGET_SHA"
>"sha1msg1\t{%2, %0|%0, %2}"
> @@ -29115,7 +29115,7 @@ (define_insn "sha1msg2"
>[(set (match_operand:V4SI 0 "register_operand" "=x")
> (unspec:V4SI
>   [(match_operand:V4SI 1 "register_operand" "0")
> -  (match_operand:V4SI 2 "vector_operand" "xBm")]
> +  (match_operand:V4SI 2 "vector_operand" "xja")]
>   UNSPEC_SHA1MSG2))]
>"TARGET_SHA"
>"sha1msg2\t{%2, %0|%0, %2}"
> @@ -29126,7 +29126,7 @@ (define_insn "sha1nexte"
>[(set (match_operand:V4SI 0 "register_operand" "=x")
> (unspec:V4SI
>   [(match_operand:V4SI 1 "register_operand" "0")
> -  (match_operand:V4SI 2 "vector_operand" "xBm")]
> +  (match_operand:V4SI 2 "vector_operand" "xja")]
>   UNSPEC_SHA1NEXTE))]
>"TARGET_SHA"
>"sha1nexte\t{%2, %0|%0, %2}"
> @@ -29137,7 +29137,7 @@ (define_insn "sha1rnds4"
>[(set (match_operand:V4SI 0 "register_operand" "=x")
> (unspec:V4SI
>   [(match_operand:V4SI 1 "register_operand" "0")
> -  (match_operand:V4SI 2 "vector_operand" "xBm")
> +  (match_operand:V4SI 2 "vector_operand" "xja")
>(match_operand:SI 3 "const_0_to_3_operand")]
>   UNSPEC_SHA1RNDS4))]
>"TARGET_SHA"
> @@ -29150,7 +29150,7 @@ (define_insn "sha256msg1"
>[(set (match_operand:V4SI 0 "register_operand" "=x")
> (unspec:V4SI
>   [(match_operand:V4SI 1 "register_operand" "0")
> -  (match_operand:V4SI 2 "vector_operand" "xBm")]
> +  (match_operand:V4SI 2 "vector_operand" "xja")]
>   UNSPEC_SHA256MSG1))]
>"TARGET_SHA"
>"sha256msg1\t{%2, %0|%0, %2}"
> @@ -29161,7 +29161,7 @@ (define_insn "sha256msg2"
>[(set (match_operand:V4SI 0 "register_operand" "=x")
> (unspec:V4SI
>   [(match_operand:V4SI 1 "register_operand" "0")
> -  (match_operand:V4SI 2 "vector_operand" "xBm")]
> +  (match_operand:V4SI 2 "vector_operand" "xja")]
>   UNSPEC_SHA256MSG2))]
>"TARGET_SHA"
>"sha256msg2\t{%2, %0|%0, %2}"
> @@ -29172,7 +29172,7 @@ (define_insn "sha256rnds2"
>[(set (match_operand:V4SI 0 "register_operand" "=x")
> (unspec:V4SI
>   [(match_operand:V4SI 1 "register_operand" "0")
> -  (match_operand:V4SI 2 "vector_operand" "xBm")
> +  (match_operand:V4SI 2 "vector_operand" "xja")
>(match_operand:V4SI 3 "register_operand" "Yz")]
>   UNSPEC_SHA256RNDS2))]
>"TARGET_SHA"
> @@ -30575,9 +30575,9 @@ (define_expand "encodekey128u32"
>
>  (define_insn "*encodekey128u32"
>[(match_parallel 2 "encodekey128_operation"
> -[(set (match_operand:SI 0 "register_operand" "=r")
> +[(set (match_operand:SI 0 "register_operand" "=jr")
>   (unspec_volatile:SI
> -   [(match_operand:SI   1 "register_operand" "r")
> +   [(match_operand:SI   1 "register_operand" "jr")
>  (reg:V2DI XMM0_REG)]
> UNSPECV_ENCODEKEY128U32))])]
>"TARGET_KL"
> @@ -30632,9 +30632,9 @@ (define_expand "encodekey256u32"
>
>  (define_insn "*encodekey256u32"
>[(match_parallel 2 "encodekey256_operation"
> -[(set (match_operand:SI 0 "register_operand" "=r")
> +[(set (match_operand:SI 0 "register_operand" "=jr")
>   (unspec_volatile:SI
> -   [(match_operand:SI   1 "register_operan

Re: [PATCH] x86: Update constraints for APX NDD instructions

2024-02-07 Thread Hongtao Liu
On Tue, Feb 6, 2024 at 11:49 AM H.J. Lu  wrote:
>
> 1. The only supported TLS code sequence with ADD is
>
> addq foo@gottpoff(%rip),%reg
>
> Change je constraint to a memory operand in APX NDD ADD pattern with
> register source operand.
>
> 2. The instruction length of APX NDD instructions with immediate operand:
>
> op imm, mem, reg
>
> may exceed the size limit of 15 byes when non-default address space,
> segment register or address size prefix are used.
>
> Add jM constraint which is a memory operand valid for APX NDD instructions
> with immediate operand and add jO constraint which is an offsetable memory
> operand valid for APX NDD instructions with immediate operand.  Update
> APX NDD patterns with jM and jO constraints.
Ok.
>
> gcc/
>
> PR target/113711
> PR target/113733
> * config/i386/constraints.md: List all constraints with j prefix.
> (j>): Change auto-dec to auto-inc in documentation.
> (je): Changed to a memory constraint with APX NDD TLS operand
> check.
> (jM): New memory constraint for APX NDD instructions.
> (jO): Likewise.
> * config/i386/i386-protos.h (x86_poff_operand_p): Removed.
> * config/i386/i386.cc (x86_poff_operand_p): Likewise.
> * config/i386/i386.md (*add3_doubleword): Use rjO.
> (*add_1[SWI48]): Use je and jM.
> (addsi_1_zext): Use jM.
> (*addv4_doubleword_1[DWI]): Likewise.
> (*sub_1[SWI]): Use jM.
> (@add3_cc_overflow_1[SWI]): Likewise.
> (*add3_doubleword_cc_overflow_1): Use rjO.
> (*and3_doubleword): Likewise.
> (*anddi_1): Use jM.
> (*andsi_1_zext): Likewise.
> (*and_1[SWI24]): Likewise.
> (*3_doubleword[any_or]: Use rjO
> (*code_1[any_or SWI248]): Use jM.
> (*si_1_zext[zero_extend + any_or]): Likewise.
> * config/i386/predicates.md (apx_ndd_memory_operand): New.
> (apx_ndd_add_memory_operand): Likewise.
>
> gcc/testsuite/
>
> PR target/113711
> PR target/113733
> * gcc.target/i386/apx-ndd-2.c: New test.
> * gcc.target/i386/apx-ndd-base-index-1.c: Likewise.
> * gcc.target/i386/apx-ndd-no-seg-global-1.c: Likewise.
> * gcc.target/i386/apx-ndd-seg-1.c: Likewise.
> * gcc.target/i386/apx-ndd-seg-2.c: Likewise.
> * gcc.target/i386/apx-ndd-seg-3.c: Likewise.
> * gcc.target/i386/apx-ndd-seg-4.c: Likewise.
> * gcc.target/i386/apx-ndd-seg-5.c: Likewise.
> * gcc.target/i386/apx-ndd-tls-1a.c: Likewise.
> * gcc.target/i386/apx-ndd-tls-2.c: Likewise.
> * gcc.target/i386/apx-ndd-tls-3.c: Likewise.
> * gcc.target/i386/apx-ndd-tls-4.c: Likewise.
> * gcc.target/i386/apx-ndd-x32-1.c: Likewise.
> ---
>  gcc/config/i386/constraints.md|  36 -
>  gcc/config/i386/i386-protos.h |   1 -
>  gcc/config/i386/i386.cc   |  25 
>  gcc/config/i386/i386.md   | 129 +-
>  gcc/config/i386/predicates.md |  65 +
>  gcc/testsuite/gcc.target/i386/apx-ndd-2.c |  17 +++
>  .../gcc.target/i386/apx-ndd-base-index-1.c|  50 +++
>  .../gcc.target/i386/apx-ndd-no-seg-global-1.c |  74 ++
>  gcc/testsuite/gcc.target/i386/apx-ndd-seg-1.c |  98 +
>  gcc/testsuite/gcc.target/i386/apx-ndd-seg-2.c |  98 +
>  gcc/testsuite/gcc.target/i386/apx-ndd-seg-3.c |  14 ++
>  gcc/testsuite/gcc.target/i386/apx-ndd-seg-4.c |   9 ++
>  gcc/testsuite/gcc.target/i386/apx-ndd-seg-5.c |  13 ++
>  .../gcc.target/i386/apx-ndd-tls-1a.c  |  41 ++
>  gcc/testsuite/gcc.target/i386/apx-ndd-tls-2.c |  38 ++
>  gcc/testsuite/gcc.target/i386/apx-ndd-tls-3.c |  16 +++
>  gcc/testsuite/gcc.target/i386/apx-ndd-tls-4.c |  31 +
>  gcc/testsuite/gcc.target/i386/apx-ndd-x32-1.c |  49 +++
>  18 files changed, 712 insertions(+), 92 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd-base-index-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd-no-seg-global-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd-seg-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd-seg-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd-seg-3.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd-seg-4.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd-seg-5.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd-tls-1a.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd-tls-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd-tls-3.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd-tls-4.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd-x32-1.c
>
> diff --git a/gcc/config/i386/constraints.md b/gcc/config/i386/constraints.md
> index 280e4c8e36c..64702d9c0a8 100644
> 

Re: [PATCH] x86-64: Generate push2/pop2 only if the incoming stack is 16-byte aligned

2024-02-17 Thread Hongtao Liu
On Wed, Feb 14, 2024 at 5:33 AM H.J. Lu  wrote:
>
> Since push2/pop2 requires 16-byte stack alignment, don't generate them
> if the incoming stack isn't 16-byte aligned.
Ok.
>
> gcc/
>
> PR target/113912
> * config/i386/i386.cc (ix86_can_use_push2pop2): New.
> (ix86_pro_and_epilogue_can_use_push2pop2): Use it.
> (ix86_emit_save_regs): Don't generate push2 if
> ix86_can_use_push2pop2 return false.
> (ix86_expand_epilogue): Don't generate pop2 if
> ix86_can_use_push2pop2 return false.
>
> gcc/testsuite/
>
> PR target/113912
> * gcc.target/i386/apx-push2pop2-2.c: New test.
> ---
>  gcc/config/i386/i386.cc   | 24 ++-
>  .../gcc.target/i386/apx-push2pop2-2.c | 24 +++
>  2 files changed, 42 insertions(+), 6 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-push2pop2-2.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index a4e12602f70..46f238651a6 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -6802,16 +6802,24 @@ get_probe_interval (void)
>
>  #define SPLIT_STACK_AVAILABLE 256
>
> -/* Helper function to determine whether push2/pop2 can be used in prologue or
> -   epilogue for register save/restore.  */
> +/* Return true if push2/pop2 can be generated.  */
> +
>  static bool
> -ix86_pro_and_epilogue_can_use_push2pop2 (int nregs)
> +ix86_can_use_push2pop2 (void)
>  {
>/* Use push2/pop2 only if the incoming stack is 16-byte aligned.  */
>unsigned int incoming_stack_boundary
>  = (crtl->parm_stack_boundary > ix86_incoming_stack_boundary
> ? crtl->parm_stack_boundary : ix86_incoming_stack_boundary);
> -  if (incoming_stack_boundary % 128 != 0)
> +  return incoming_stack_boundary % 128 == 0;
> +}
> +
> +/* Helper function to determine whether push2/pop2 can be used in prologue or
> +   epilogue for register save/restore.  */
> +static bool
> +ix86_pro_and_epilogue_can_use_push2pop2 (int nregs)
> +{
> +  if (!ix86_can_use_push2pop2 ())
>  return false;
>int aligned = cfun->machine->fs.sp_offset % 16 == 0;
>return TARGET_APX_PUSH2POP2
> @@ -7401,7 +7409,9 @@ ix86_emit_save_regs (void)
>int regno;
>rtx_insn *insn;
>
> -  if (!TARGET_APX_PUSH2POP2 || cfun->machine->func_type != TYPE_NORMAL)
> +  if (!TARGET_APX_PUSH2POP2
> +  || !ix86_can_use_push2pop2 ()
> +  || cfun->machine->func_type != TYPE_NORMAL)
>  {
>for (regno = FIRST_PSEUDO_REGISTER - 1; regno >= 0; regno--)
> if (GENERAL_REGNO_P (regno) && ix86_save_reg (regno, true, true))
> @@ -10039,7 +10049,9 @@ ix86_expand_epilogue (int style)
>  m->fs.cfa_reg == stack_pointer_rtx);
> }
>
> -  if (TARGET_APX_PUSH2POP2 && m->func_type == TYPE_NORMAL)
> +  if (TARGET_APX_PUSH2POP2
> + && ix86_can_use_push2pop2 ()
> + && m->func_type == TYPE_NORMAL)
> ix86_emit_restore_regs_using_pop2 ();
>else
> ix86_emit_restore_regs_using_pop (TARGET_APX_PPX);
> diff --git a/gcc/testsuite/gcc.target/i386/apx-push2pop2-2.c 
> b/gcc/testsuite/gcc.target/i386/apx-push2pop2-2.c
> new file mode 100644
> index 000..975a6212b30
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/apx-push2pop2-2.c
> @@ -0,0 +1,24 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> +/* { dg-options "-O2 -mpreferred-stack-boundary=3 -mapx-features=push2pop2 
> -fomit-frame-pointer" } */
> +
> +extern int bar (int);
> +
> +void foo ()
> +{
> +  int a,b,c,d,e,f,i;
> +  a = bar (5);
> +  b = bar (a);
> +  c = bar (b);
> +  d = bar (c);
> +  e = bar (d);
> +  f = bar (e);
> +  for (i = 1; i < 10; i++)
> +  {
> +a += bar (a + i) + bar (b + i) +
> + bar (c + i) + bar (d + i) +
> + bar (e + i) + bar (f + i);
> +  }
> +}
> +
> +/* { dg-final { scan-assembler-not "push2(|p)\[\\t \]*%r" } } */
> +/* { dg-final { scan-assembler-not "pop2(|p)\[\\t \]*%r" } } */
> --
> 2.43.0
>


-- 
BR,
Hongtao


Re: PING: [PATCH] x86-64: Check R_X86_64_CODE_6_GOTTPOFF support

2024-02-22 Thread Hongtao Liu
On Thu, Feb 22, 2024 at 10:33 PM H.J. Lu  wrote:
>
> On Sun, Feb 18, 2024 at 8:02 AM H.J. Lu  wrote:
> >
> > If assembler and linker supports
> >
> > add %reg1, name@gottpoff(%rip), %reg2
> >
> > with R_X86_64_CODE_6_GOTTPOFF, we can generate it instead of
> >
> > mov name@gottpoff(%rip), %reg2
> > add %reg1, %reg2
x86 part LGTM, but I'm not familiar with the changes in config related files.
> >
> > gcc/
> >
> > * configure.ac (HAVE_AS_R_X86_64_CODE_6_GOTTPOFF): Defined as 1
> > if R_X86_64_CODE_6_GOTTPOFF is supported.
> > * config.in: Regenerated.
> > * configure: Likewise.
> > * config/i386/predicates.md (apx_ndd_add_memory_operand): Allow
> > UNSPEC_GOTNTPOFF if R_X86_64_CODE_6_GOTTPOFF is supported.
> >
> > gcc/testsuite/
> >
> > * gcc.target/i386/apx-ndd-tls-1b.c: New test.
> > * lib/target-supports.exp
> > (check_effective_target_code_6_gottpoff_reloc): New.
> > ---
> >  gcc/config.in |  7 +++
> >  gcc/config/i386/predicates.md |  6 +-
> >  gcc/configure | 62 +++
> >  gcc/configure.ac  | 37 +++
> >  .../gcc.target/i386/apx-ndd-tls-1b.c  |  9 +++
> >  gcc/testsuite/lib/target-supports.exp | 48 ++
> >  6 files changed, 168 insertions(+), 1 deletion(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd-tls-1b.c
> >
> > diff --git a/gcc/config.in b/gcc/config.in
> > index ce1d073833f..f3de4ba6776 100644
> > --- a/gcc/config.in
> > +++ b/gcc/config.in
> > @@ -737,6 +737,13 @@
> >  #endif
> >
> >
> > +/* Define 0/1 if your assembler and linker support 
> > R_X86_64_CODE_6_GOTTPOFF.
> > +   */
> > +#ifndef USED_FOR_TARGET
> > +#undef HAVE_AS_R_X86_64_CODE_6_GOTTPOFF
> > +#endif
> > +
> > +
> >  /* Define if your assembler supports relocs needed by -fpic. */
> >  #ifndef USED_FOR_TARGET
> >  #undef HAVE_AS_SMALL_PIC_RELOCS
> > diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
> > index 4c1aedd7e70..391f108c360 100644
> > --- a/gcc/config/i386/predicates.md
> > +++ b/gcc/config/i386/predicates.md
> > @@ -2299,10 +2299,14 @@ (define_predicate "apx_ndd_memory_operand"
> >
> >  ;; Return true if OP is a memory operand which can be used in APX NDD
> >  ;; ADD with register source operand.  UNSPEC_GOTNTPOFF memory operand
> > -;; isn't allowed with APX NDD ADD.
> > +;; is allowed with APX NDD ADD only if R_X86_64_CODE_6_GOTTPOFF works.
> >  (define_predicate "apx_ndd_add_memory_operand"
> >(match_operand 0 "memory_operand")
> >  {
> > +  /* OK if "add %reg1, name@gottpoff(%rip), %reg2" is supported.  */
> > +  if (HAVE_AS_R_X86_64_CODE_6_GOTTPOFF)
> > +return true;
> > +
> >op = XEXP (op, 0);
> >
> >/* Disallow APX NDD ADD with UNSPEC_GOTNTPOFF.  */
> > diff --git a/gcc/configure b/gcc/configure
> > index 41b978b0380..c59c971862c 100755
> > --- a/gcc/configure
> > +++ b/gcc/configure
> > @@ -29834,6 +29834,68 @@ cat >>confdefs.h <<_ACEOF
> >  _ACEOF
> >
> >
> > +if echo "$ld_ver" | grep GNU > /dev/null; then
> > +  if $gcc_cv_ld -V 2>/dev/null | grep elf_x86_64_sol2 > /dev/null; then
> > +ld_ix86_gld_64_opt="-melf_x86_64_sol2"
> > +  else
> > +ld_ix86_gld_64_opt="-melf_x86_64"
> > +  fi
> > +fi
> > +conftest_s='
> > +   .text
> > +   .globl  _start
> > +   .type _start, @function
> > +_start:
> > +   addq%r23,foo@GOTTPOFF(%rip), %r15
> > +   .section .tdata,"awT",@progbits
> > +   .type foo, @object
> > +foo:
> > +   .quad 0'
> > +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking assembler for 
> > R_X86_64_CODE_6_GOTTPOFF reloc" >&5
> > +$as_echo_n "checking assembler for R_X86_64_CODE_6_GOTTPOFF reloc... " 
> > >&6; }
> > +if ${gcc_cv_as_x86_64_code_6_gottpoff+:} false; then :
> > +  $as_echo_n "(cached) " >&6
> > +else
> > +  gcc_cv_as_x86_64_code_6_gottpoff=no
> > +  if test x$gcc_cv_as != x; then
> > +$as_echo "$conftest_s" > conftest.s
> > +if { ac_try='$gcc_cv_as $gcc_cv_as_flags  -o conftest.o conftest.s >&5'
> > +  { { eval echo "\"\$as_me\":${as_lineno-$LINENO}: \"$ac_try\""; } >&5
> > +  (eval $ac_try) 2>&5
> > +  ac_status=$?
> > +  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
> > +  test $ac_status = 0; }; }
> > +then
> > +   if test x$gcc_cv_ld != x && test x$gcc_cv_objdump != x \
> > +   && test x$gcc_cv_readelf != x \
> > +   && $gcc_cv_readelf --relocs --wide conftest.o 2>&1 \
> > +  | grep R_X86_64_CODE_6_GOTTPOFF > /dev/null 2>&1 \
> > +   && $gcc_cv_ld $ld_ix86_gld_64_opt -o conftest conftest.o > 
> > /dev/null 2>&1; then
> > +  if $gcc_cv_objdump -dw conftest 2>&1 \
> > + | grep "add \+\$0xf\+8,%r23,%r15" > /dev/null 2>&1; then
> > +gcc_cv_as_x86_64_code_6_gottpoff=yes
> > +  else
> > +gcc_cv_as_x86_64_code_6_gottpoff=no

Re: [PATCH] x86: Properly implement AMX-TILE load/store intrinsics

2024-02-25 Thread Hongtao Liu
On Mon, Feb 26, 2024 at 5:11 AM H.J. Lu  wrote:
>
> ldtilecfg and sttilecfg take a 512-byte memory block.  With
> _tile_loadconfig implemented as
>
> extern __inline void
> __attribute__((__gnu_inline__, __always_inline__, __artificial__))
> _tile_loadconfig (const void *__config)
> {
>   __asm__ volatile ("ldtilecfg\t%X0" :: "m" (*((const void **)__config)));
> }
>
> GCC sees:
>
> (parallel [
>   (asm_operands/v ("ldtilecfg   %X0") ("") 0
>[(mem/f/c:DI (plus:DI (reg/f:DI 77 virtual-stack-vars)
>  (const_int -64 [0xffc0])) [1 MEM[(const 
> void * *)&tile_data]+0 S8 A128])]
>[(asm_input:DI ("m"))]
>(clobber (reg:CC 17 flags))])
>
> and the memory operand size is 1 byte.  As the result, the rest of 511
> bytes is ignored by GCC.  Implement ldtilecfg and sttilecfg intrinsics
> with a pointer to BLKmode to honor the 512-byte memory block.
>
> gcc/ChangeLog:
>
> PR target/114098
> * config/i386/amxtileintrin.h (_tile_loadconfig): Use
> __builtin_ia32_ldtilecfg.
> (_tile_storeconfig): Use __builtin_ia32_sttilecfg.
> * config/i386/i386-builtin.def (BDESC): Add
> __builtin_ia32_ldtilecfg and __builtin_ia32_sttilecfg.
> * config/i386/i386-expand.cc (ix86_expand_builtin): Handle
> IX86_BUILTIN_LDTILECFG and IX86_BUILTIN_STTILECFG.
> * config/i386/i386.md (ldtilecfg): New pattern.
> (sttilecfg): Likewise.
>
> gcc/testsuite/ChangeLog:
>
> PR target/114098
> * gcc.target/i386/amxtile-4.c: New test.
> ---
>  gcc/config/i386/amxtileintrin.h   |  4 +-
>  gcc/config/i386/i386-builtin.def  |  4 ++
>  gcc/config/i386/i386-expand.cc| 19 
>  gcc/config/i386/i386.md   | 24 ++
>  gcc/testsuite/gcc.target/i386/amxtile-4.c | 55 +++
>  5 files changed, 104 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/amxtile-4.c
>
> diff --git a/gcc/config/i386/amxtileintrin.h b/gcc/config/i386/amxtileintrin.h
> index d1a26e0fea5..5081b326498 100644
> --- a/gcc/config/i386/amxtileintrin.h
> +++ b/gcc/config/i386/amxtileintrin.h
> @@ -39,14 +39,14 @@ extern __inline void
>  __attribute__((__gnu_inline__, __always_inline__, __artificial__))
>  _tile_loadconfig (const void *__config)
>  {
> -  __asm__ volatile ("ldtilecfg\t%X0" :: "m" (*((const void **)__config)));
> +  __builtin_ia32_ldtilecfg (__config);
>  }
>
>  extern __inline void
>  __attribute__((__gnu_inline__, __always_inline__, __artificial__))
>  _tile_storeconfig (void *__config)
>  {
> -  __asm__ volatile ("sttilecfg\t%X0" : "=m" (*((void **)__config)));
> +  __builtin_ia32_sttilecfg (__config);
>  }
>
>  extern __inline void
> diff --git a/gcc/config/i386/i386-builtin.def 
> b/gcc/config/i386/i386-builtin.def
> index 729355230b8..88dd7f8857f 100644
> --- a/gcc/config/i386/i386-builtin.def
> +++ b/gcc/config/i386/i386-builtin.def
> @@ -126,6 +126,10 @@ BDESC (OPTION_MASK_ISA_XSAVES | OPTION_MASK_ISA_64BIT, 
> 0, CODE_FOR_nothing, "__b
>  BDESC (OPTION_MASK_ISA_XSAVES | OPTION_MASK_ISA_64BIT, 0, CODE_FOR_nothing, 
> "__builtin_ia32_xrstors64", IX86_BUILTIN_XRSTORS64, UNKNOWN, (int) 
> VOID_FTYPE_PVOID_INT64)
>  BDESC (OPTION_MASK_ISA_XSAVEC | OPTION_MASK_ISA_64BIT, 0, CODE_FOR_nothing, 
> "__builtin_ia32_xsavec64", IX86_BUILTIN_XSAVEC64, UNKNOWN, (int) 
> VOID_FTYPE_PVOID_INT64)
>
> +/* LDFILECFG and STFILECFG.  */
> +BDESC (OPTION_MASK_ISA_64BIT, OPTION_MASK_ISA2_AMX_TILE, CODE_FOR_ldtilecfg, 
> "__builtin_ia32_ldtilecfg", IX86_BUILTIN_LDTILECFG, UNKNOWN, (int) 
> VOID_FTYPE_PCVOID)
> +BDESC (OPTION_MASK_ISA_64BIT, OPTION_MASK_ISA2_AMX_TILE, CODE_FOR_ldtilecfg, 
> "__builtin_ia32_sttilecfg", IX86_BUILTIN_STTILECFG, UNKNOWN, (int) 
> VOID_FTYPE_PVOID)
CODE_FOR_sttilecfg.
> +
>  /* SSE */
>  BDESC (OPTION_MASK_ISA_SSE, 0, CODE_FOR_movv4sf_internal, 
> "__builtin_ia32_storeups", IX86_BUILTIN_STOREUPS, UNKNOWN, (int) 
> VOID_FTYPE_PFLOAT_V4SF)
>  BDESC (OPTION_MASK_ISA_SSE, 0, CODE_FOR_sse_movntv4sf, 
> "__builtin_ia32_movntps", IX86_BUILTIN_MOVNTPS, UNKNOWN, (int) 
> VOID_FTYPE_PFLOAT_V4SF)
> diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> index a4d3369f01b..17993eb837f 100644
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -14152,6 +14152,25 @@ ix86_expand_builtin (tree exp, rtx target, rtx 
> subtarget,
> emit_insn (pat);
>return 0;
>
> +case IX86_BUILTIN_LDTILECFG:
> +case IX86_BUILTIN_STTILECFG:
> +  arg0 = CALL_EXPR_ARG (exp, 0);
> +  op0 = expand_normal (arg0);
> +
> +  if (!address_operand (op0, VOIDmode))
> +   {
> + op0 = convert_memory_address (Pmode, op0);
> + op0 = copy_addr_to_reg (op0);
> +   }
> +  op0 = gen_rtx_MEM (BLKmode, op0);
maybe we can just use XImode, and adjust the patterns with XI.
> +  if (fcode == IX86_BUILTIN_LDTILECFG)
> +   icode = CODE_FOR_ldtilecfg;
> +  else
> 

Re: [PATCH] x86: Properly implement AMX-TILE load/store intrinsics

2024-02-25 Thread Hongtao Liu
On Mon, Feb 26, 2024 at 10:37 AM H.J. Lu  wrote:
>
> On Sun, Feb 25, 2024 at 6:03 PM Hongtao Liu  wrote:
> >
> > On Mon, Feb 26, 2024 at 5:11 AM H.J. Lu  wrote:
> > >
> > > ldtilecfg and sttilecfg take a 512-byte memory block.  With
> > > _tile_loadconfig implemented as
> > >
> > > extern __inline void
> > > __attribute__((__gnu_inline__, __always_inline__, __artificial__))
> > > _tile_loadconfig (const void *__config)
> > > {
> > >   __asm__ volatile ("ldtilecfg\t%X0" :: "m" (*((const void **)__config)));
> > > }
> > >
> > > GCC sees:
> > >
> > > (parallel [
> > >   (asm_operands/v ("ldtilecfg   %X0") ("") 0
> > >[(mem/f/c:DI (plus:DI (reg/f:DI 77 virtual-stack-vars)
> > >  (const_int -64 [0xffc0])) [1 
> > > MEM[(const void * *)&tile_data]+0 S8 A128])]
> > >[(asm_input:DI ("m"))]
> > >(clobber (reg:CC 17 flags))])
> > >
> > > and the memory operand size is 1 byte.  As the result, the rest of 511
> > > bytes is ignored by GCC.  Implement ldtilecfg and sttilecfg intrinsics
> > > with a pointer to BLKmode to honor the 512-byte memory block.
> > >
> > > gcc/ChangeLog:
> > >
> > > PR target/114098
> > > * config/i386/amxtileintrin.h (_tile_loadconfig): Use
> > > __builtin_ia32_ldtilecfg.
> > > (_tile_storeconfig): Use __builtin_ia32_sttilecfg.
> > > * config/i386/i386-builtin.def (BDESC): Add
> > > __builtin_ia32_ldtilecfg and __builtin_ia32_sttilecfg.
> > > * config/i386/i386-expand.cc (ix86_expand_builtin): Handle
> > > IX86_BUILTIN_LDTILECFG and IX86_BUILTIN_STTILECFG.
> > > * config/i386/i386.md (ldtilecfg): New pattern.
> > > (sttilecfg): Likewise.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > PR target/114098
> > > * gcc.target/i386/amxtile-4.c: New test.
> > > ---
> > >  gcc/config/i386/amxtileintrin.h   |  4 +-
> > >  gcc/config/i386/i386-builtin.def  |  4 ++
> > >  gcc/config/i386/i386-expand.cc| 19 
> > >  gcc/config/i386/i386.md   | 24 ++
> > >  gcc/testsuite/gcc.target/i386/amxtile-4.c | 55 +++
> > >  5 files changed, 104 insertions(+), 2 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/amxtile-4.c
> > >
> > > diff --git a/gcc/config/i386/amxtileintrin.h 
> > > b/gcc/config/i386/amxtileintrin.h
> > > index d1a26e0fea5..5081b326498 100644
> > > --- a/gcc/config/i386/amxtileintrin.h
> > > +++ b/gcc/config/i386/amxtileintrin.h
> > > @@ -39,14 +39,14 @@ extern __inline void
> > >  __attribute__((__gnu_inline__, __always_inline__, __artificial__))
> > >  _tile_loadconfig (const void *__config)
> > >  {
> > > -  __asm__ volatile ("ldtilecfg\t%X0" :: "m" (*((const void 
> > > **)__config)));
> > > +  __builtin_ia32_ldtilecfg (__config);
> > >  }
> > >
> > >  extern __inline void
> > >  __attribute__((__gnu_inline__, __always_inline__, __artificial__))
> > >  _tile_storeconfig (void *__config)
> > >  {
> > > -  __asm__ volatile ("sttilecfg\t%X0" : "=m" (*((void **)__config)));
> > > +  __builtin_ia32_sttilecfg (__config);
> > >  }
> > >
> > >  extern __inline void
> > > diff --git a/gcc/config/i386/i386-builtin.def 
> > > b/gcc/config/i386/i386-builtin.def
> > > index 729355230b8..88dd7f8857f 100644
> > > --- a/gcc/config/i386/i386-builtin.def
> > > +++ b/gcc/config/i386/i386-builtin.def
> > > @@ -126,6 +126,10 @@ BDESC (OPTION_MASK_ISA_XSAVES | 
> > > OPTION_MASK_ISA_64BIT, 0, CODE_FOR_nothing, "__b
> > >  BDESC (OPTION_MASK_ISA_XSAVES | OPTION_MASK_ISA_64BIT, 0, 
> > > CODE_FOR_nothing, "__builtin_ia32_xrstors64", IX86_BUILTIN_XRSTORS64, 
> > > UNKNOWN, (int) VOID_FTYPE_PVOID_INT64)
> > >  BDESC (OPTION_MASK_ISA_XSAVEC | OPTION_MASK_ISA_64BIT, 0, 
> > > CODE_FOR_nothing, "__builtin_ia32_xsavec64", IX86_BUILTIN_XSAVEC64, 
> > > UNKNOWN, (int) VOID_FTYPE_PVOID_INT64)
> > >
> > > +/* LDFILECFG and STFILECFG.  */
> > > +BDESC (OPTION_MASK_ISA_64BIT, OPTION_MASK_ISA2_AMX_TILE, 
> > > CODE_FOR_ldtilecfg, &quo

Re: [PATCH v1] RTL: Bugfix ICE after allow vector type in DSE

2024-02-25 Thread Hongtao Liu
On Mon, Feb 26, 2024 at 11:26 AM  wrote:
>
> From: Pan Li 
>
> We allowed vector type for get_stored_val when read is less than or
> equal to store in previous.  Unfortunately, we missed to adjust the
> validate_subreg part accordingly.  For vector type, we don't need to
> restrict the mode size is greater than the vector register size.
>
> Thus, for example when gen_lowpart from E_V2SFmode to E_V4QImode, it
> will have NULL_RTX(of course ICE after that) because of the mode size
> is less than vector register size.  That also explain that gen_lowpart
> from E_V8SFmode to E_V16QImode is valid here.
>
> This patch would like to remove the the restriction for vector mode, to
> rid of the ICE when gen_lowpart because of validate_subreg fails.
Be Careful, It may regresses some other backend.
>
> The below test are passed for this patch:
>
> * The X86 bootstrap test.
> * The fully riscv regression tests.
>
> gcc/ChangeLog:
>
> * emit-rtl.cc (validate_subreg): Bypass register size check
> if the mode is vector.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.dg/tree-ssa/ssa-fre-44.c: Add ftree-vectorize to trigger
> the ICE.
> * gcc.target/riscv/rvv/base/bug-6.c: New test.
>
> Signed-off-by: Pan Li 
> ---
>  gcc/emit-rtl.cc   |  3 ++-
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-44.c|  2 +-
>  .../gcc.target/riscv/rvv/base/bug-6.c | 22 +++
>  3 files changed, 25 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/bug-6.c
>
> diff --git a/gcc/emit-rtl.cc b/gcc/emit-rtl.cc
> index 1856fa4884f..45c6301b487 100644
> --- a/gcc/emit-rtl.cc
> +++ b/gcc/emit-rtl.cc
> @@ -934,7 +934,8 @@ validate_subreg (machine_mode omode, machine_mode imode,
>  ;
>/* ??? Similarly, e.g. with (subreg:DF (reg:TI)).  Though store_bit_field
>   is the culprit here, and not the backends.  */
> -  else if (known_ge (osize, regsize) && known_ge (isize, osize))
> +  else if (known_ge (isize, osize) && (known_ge (osize, regsize)
> +|| (VECTOR_MODE_P (imode) || VECTOR_MODE_P (omode
>  ;
>/* Allow component subregs of complex and vector.  Though given the below
>   extraction rules, it's not always clear what that means.  */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-44.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-44.c
> index f79b4c142ae..624a00a4f32 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-44.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-44.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O -fdump-tree-fre1" } */
> +/* { dg-options "-O -fdump-tree-fre1 -O3 -ftree-vectorize" } */
>
>  struct A { float x, y; };
>  struct B { struct A u; };
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/base/bug-6.c 
> b/gcc/testsuite/gcc.target/riscv/rvv/base/bug-6.c
> new file mode 100644
> index 000..5bb00b8f587
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/base/bug-6.c
> @@ -0,0 +1,22 @@
> +/* Test that we do not have ice when compile */
> +/* { dg-do compile } */
> +/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -ftree-vectorize" } */
> +
> +struct A { float x, y; };
> +struct B { struct A u; };
> +
> +extern void bar (struct A *);
> +
> +float
> +f3 (struct B *x, int y)
> +{
> +  struct A p = {1.0f, 2.0f};
> +  struct A *q = &x[y].u;
> +
> +  __builtin_memcpy (&q->x, &p.x, sizeof (float));
> +  __builtin_memcpy (&q->y, &p.y, sizeof (float));
> +
> +  bar (&p);
> +
> +  return x[y].u.x + x[y].u.y;
> +}
> --
> 2.34.1
>


-- 
BR,
Hongtao


Re: [PATCH v1] RTL: Bugfix ICE after allow vector type in DSE

2024-02-25 Thread Hongtao Liu
On Mon, Feb 26, 2024 at 11:42 AM Li, Pan2  wrote:
>
> > Be Careful, It may regresses some other backend.
>
> Thanks Hongtao, how about take INNER_MODE here for regsize. Currently it will 
> be the whole vector register when comparation.
>
> poly_uint64 regsize = REGMODE_NATURAL_SIZE (imode);
>
> Pan
>
> -Original Message-
> From: Hongtao Liu 
> Sent: Monday, February 26, 2024 11:41 AM
> To: Li, Pan2 
> Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com; 
> richard.guent...@gmail.com; Wang, Yanzhang ; 
> rdapp@gmail.com
> Subject: Re: [PATCH v1] RTL: Bugfix ICE after allow vector type in DSE
>
> On Mon, Feb 26, 2024 at 11:26 AM  wrote:
> >
> > From: Pan Li 
> >
> > We allowed vector type for get_stored_val when read is less than or
> > equal to store in previous.  Unfortunately, we missed to adjust the
> > validate_subreg part accordingly.  For vector type, we don't need to
> > restrict the mode size is greater than the vector register size.
> >
> > Thus, for example when gen_lowpart from E_V2SFmode to E_V4QImode, it
> > will have NULL_RTX(of course ICE after that) because of the mode size
> > is less than vector register size.  That also explain that gen_lowpart
> > from E_V8SFmode to E_V16QImode is valid here.
> >
> > This patch would like to remove the the restriction for vector mode, to
> > rid of the ICE when gen_lowpart because of validate_subreg fails.
> Be Careful, It may regresses some other backend.
The related thread.
https://gcc.gnu.org/pipermail/gcc-patches/2021-August/578466.html
> >
> > The below test are passed for this patch:
> >
> > * The X86 bootstrap test.
> > * The fully riscv regression tests.
> >
> > gcc/ChangeLog:
> >
> > * emit-rtl.cc (validate_subreg): Bypass register size check
> > if the mode is vector.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.dg/tree-ssa/ssa-fre-44.c: Add ftree-vectorize to trigger
> > the ICE.
> > * gcc.target/riscv/rvv/base/bug-6.c: New test.
> >
> > Signed-off-by: Pan Li 
> > ---
> >  gcc/emit-rtl.cc   |  3 ++-
> >  gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-44.c|  2 +-
> >  .../gcc.target/riscv/rvv/base/bug-6.c | 22 +++
> >  3 files changed, 25 insertions(+), 2 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/bug-6.c
> >
> > diff --git a/gcc/emit-rtl.cc b/gcc/emit-rtl.cc
> > index 1856fa4884f..45c6301b487 100644
> > --- a/gcc/emit-rtl.cc
> > +++ b/gcc/emit-rtl.cc
> > @@ -934,7 +934,8 @@ validate_subreg (machine_mode omode, machine_mode imode,
> >  ;
> >/* ??? Similarly, e.g. with (subreg:DF (reg:TI)).  Though store_bit_field
> >   is the culprit here, and not the backends.  */
> > -  else if (known_ge (osize, regsize) && known_ge (isize, osize))
> > +  else if (known_ge (isize, osize) && (known_ge (osize, regsize)
> > +|| (VECTOR_MODE_P (imode) || VECTOR_MODE_P (omode
> >  ;
> >/* Allow component subregs of complex and vector.  Though given the below
> >   extraction rules, it's not always clear what that means.  */
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-44.c 
> > b/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-44.c
> > index f79b4c142ae..624a00a4f32 100644
> > --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-44.c
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-44.c
> > @@ -1,5 +1,5 @@
> >  /* { dg-do compile } */
> > -/* { dg-options "-O -fdump-tree-fre1" } */
> > +/* { dg-options "-O -fdump-tree-fre1 -O3 -ftree-vectorize" } */
> >
> >  struct A { float x, y; };
> >  struct B { struct A u; };
> > diff --git a/gcc/testsuite/gcc.target/riscv/rvv/base/bug-6.c 
> > b/gcc/testsuite/gcc.target/riscv/rvv/base/bug-6.c
> > new file mode 100644
> > index 000..5bb00b8f587
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/riscv/rvv/base/bug-6.c
> > @@ -0,0 +1,22 @@
> > +/* Test that we do not have ice when compile */
> > +/* { dg-do compile } */
> > +/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -ftree-vectorize" } */
> > +
> > +struct A { float x, y; };
> > +struct B { struct A u; };
> > +
> > +extern void bar (struct A *);
> > +
> > +float
> > +f3 (struct B *x, int y)
> > +{
> > +  struct A p = {1.0f, 2.0f};
> > +  struct A *q = &x[y].u;
> > +
> > +  __builtin_memcpy (&q->x, &p.x, sizeof (float));
> > +  __builtin_memcpy (&q->y, &p.y, sizeof (float));
> > +
> > +  bar (&p);
> > +
> > +  return x[y].u.x + x[y].u.y;
> > +}
> > --
> > 2.34.1
> >
>
>
> --
> BR,
> Hongtao



--
BR,
Hongtao


Re: [PATCH] x86: Properly implement AMX-TILE load/store intrinsics

2024-02-26 Thread Hongtao Liu
On Mon, Feb 26, 2024 at 6:30 PM H.J. Lu  wrote:
>
> On Sun, Feb 25, 2024 at 8:25 PM H.J. Lu  wrote:
> >
> > On Sun, Feb 25, 2024 at 7:03 PM Hongtao Liu  wrote:
> > >
> > > On Mon, Feb 26, 2024 at 10:37 AM H.J. Lu  wrote:
> > > >
> > > > On Sun, Feb 25, 2024 at 6:03 PM Hongtao Liu  wrote:
> > > > >
> > > > > On Mon, Feb 26, 2024 at 5:11 AM H.J. Lu  wrote:
> > > > > >
> > > > > > ldtilecfg and sttilecfg take a 512-byte memory block.  With
> > > > > > _tile_loadconfig implemented as
> > > > > >
> > > > > > extern __inline void
> > > > > > __attribute__((__gnu_inline__, __always_inline__, __artificial__))
> > > > > > _tile_loadconfig (const void *__config)
> > > > > > {
> > > > > >   __asm__ volatile ("ldtilecfg\t%X0" :: "m" (*((const void 
> > > > > > **)__config)));
> > > > > > }
> > > > > >
> > > > > > GCC sees:
> > > > > >
> > > > > > (parallel [
> > > > > >   (asm_operands/v ("ldtilecfg   %X0") ("") 0
> > > > > >[(mem/f/c:DI (plus:DI (reg/f:DI 77 virtual-stack-vars)
> > > > > >  (const_int -64 [0xffc0])) [1 
> > > > > > MEM[(const void * *)&tile_data]+0 S8 A128])]
> > > > > >[(asm_input:DI ("m"))]
> > > > > >(clobber (reg:CC 17 flags))])
> > > > > >
> > > > > > and the memory operand size is 1 byte.  As the result, the rest of 
> > > > > > 511
> > > > > > bytes is ignored by GCC.  Implement ldtilecfg and sttilecfg 
> > > > > > intrinsics
> > > > > > with a pointer to BLKmode to honor the 512-byte memory block.
> > > > > >
> > > > > > gcc/ChangeLog:
> > > > > >
> > > > > > PR target/114098
> > > > > > * config/i386/amxtileintrin.h (_tile_loadconfig): Use
> > > > > > __builtin_ia32_ldtilecfg.
> > > > > > (_tile_storeconfig): Use __builtin_ia32_sttilecfg.
> > > > > > * config/i386/i386-builtin.def (BDESC): Add
> > > > > > __builtin_ia32_ldtilecfg and __builtin_ia32_sttilecfg.
> > > > > > * config/i386/i386-expand.cc (ix86_expand_builtin): Handle
> > > > > > IX86_BUILTIN_LDTILECFG and IX86_BUILTIN_STTILECFG.
> > > > > > * config/i386/i386.md (ldtilecfg): New pattern.
> > > > > > (sttilecfg): Likewise.
> > > > > >
> > > > > > gcc/testsuite/ChangeLog:
> > > > > >
> > > > > > PR target/114098
> > > > > > * gcc.target/i386/amxtile-4.c: New test.
> > > > > > ---
> > > > > >  gcc/config/i386/amxtileintrin.h   |  4 +-
> > > > > >  gcc/config/i386/i386-builtin.def  |  4 ++
> > > > > >  gcc/config/i386/i386-expand.cc| 19 
> > > > > >  gcc/config/i386/i386.md   | 24 ++
> > > > > >  gcc/testsuite/gcc.target/i386/amxtile-4.c | 55 
> > > > > > +++
> > > > > >  5 files changed, 104 insertions(+), 2 deletions(-)
> > > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/amxtile-4.c
> > > > > >
> > > > > > diff --git a/gcc/config/i386/amxtileintrin.h 
> > > > > > b/gcc/config/i386/amxtileintrin.h
> > > > > > index d1a26e0fea5..5081b326498 100644
> > > > > > --- a/gcc/config/i386/amxtileintrin.h
> > > > > > +++ b/gcc/config/i386/amxtileintrin.h
> > > > > > @@ -39,14 +39,14 @@ extern __inline void
> > > > > >  __attribute__((__gnu_inline__, __always_inline__, __artificial__))
> > > > > >  _tile_loadconfig (const void *__config)
> > > > > >  {
> > > > > > -  __asm__ volatile ("ldtilecfg\t%X0" :: "m" (*((const void 
> > > > > > **)__config)));
> > > > > > +  __builtin_ia32_ldtilecfg (__config);
> > > > > >  }
> > > > > >
> > > > > >  extern __inline void
> > > > > >  __attribute__((__g

Re: [r14-9173 Regression] FAIL: gcc.dg/tree-ssa/andnot-2.c scan-tree-dump-not forwprop3 "_expr" on Linux/x86_64

2024-02-26 Thread Hongtao Liu
On Tue, Feb 27, 2024 at 3:44 PM Richard Biener  wrote:
>
> On Tue, 27 Feb 2024, haochen.jiang wrote:
>
> > On Linux/x86_64,
> >
> > af66ad89e8169f44db723813662917cf4cbb78fc is the first bad commit
> > commit af66ad89e8169f44db723813662917cf4cbb78fc
> > Author: Richard Biener 
> > Date:   Fri Feb 23 16:06:05 2024 +0100
> >
> > middle-end/114070 - folding breaking VEC_COND expansion
> >
> > caused
> >
> > FAIL: gcc.dg/tree-ssa/andnot-2.c scan-tree-dump-not forwprop3 "_expr"
>
> This shows that the x86 backend is missing vcond_mask_qiqi and friends
Interesting, so both operand and mask are vector boolean.
> (for AVX512 mask modes).  Either that or both expand_vec_cond_expr_p
> and all the machinery behind it (ISEL pass, lowering) should handle
> pure integer mode VEC_COND_EXPR via bit operations.  I think quite some
> targets now implement patterns for these variants, whatever their
> boolean vector modes are.
>
> One complication with the change, which was
>
>   (simplify
>(op @3 (vec_cond:s @0 @1 @2))
> -  (vec_cond @0 (op! @3 @1) (op! @3 @2
> +  (if (TREE_CODE_CLASS (op) != tcc_comparison
> +   || types_match (type, TREE_TYPE (@1))
> +   || expand_vec_cond_expr_p (type, TREE_TYPE (@0), ERROR_MARK))
> +   (vec_cond @0 (op! @3 @1) (op! @3 @2)
>
> is that expand_vec_cond_expr_p can also handle comparison defined
> masks, but whether or not we have this isn't visible here so we
> can only check whether vcond_mask expansion would work.
>
> We have optimize_vectors_before_lowering_p but we shouldn't even there
> turn supported into not supported ops and as said, what's supported or
> not cannot be finally decided (if it's only vcond and not vcond_mask
> that is supported).  Also optimize_vectors_before_lowering_p is set
> for a short time between vectorization and vector lowering and we
> definitely do not want to turn supported vectorizer emitted stmts
> into ones that we need to lower.  For GCC 15 we should see to move
> vector lowering before vectorization (before loop optimization I'd
> say) to close this particula hole (and also reliably ICE when the
> vectorizer creates unsupported IL).  We also definitely want to
> retire vcond expanders (no target I know of supports single-instruction
> compare-and-select).
>
> So short term we either live with this regression (the testcase
> verifies we perform constant folding to { 0, 0 }), implement
> the four missing patterns (qi, hi, si and di missing value mode
> vcond_mask patterns) or see to implement generic code for this.
>
> Given precedent I'd tend towards adding the x86 patterns.
>
> Hongtao, can you handle that?
Sure, I'll take a look.
>
> Thanks,
> Richard.



-- 
BR,
Hongtao


Re: [PATCH] i386: Guard noreturn no-callee-saved-registers optimization with -mnoreturn-no-callee-saved-registers [PR38534]

2024-02-28 Thread Hongtao Liu
On Wed, Feb 28, 2024 at 4:54 PM Jakub Jelinek  wrote:
>
> Hi!
>
> Adding Hongtao and Honza into the loop as the ones who acked the original
> patch.
>
> The no_callee_saved_registers by default for noreturn functions change can
> break in-process backtrace(3) or backtraces from debugger or other process
> (quite often, any time the noreturn function decides to use the bp register
> and any of the parent frames uses a frame pointer; the unwinder just crashes
> in the libgcc unwinder case, gdb prints stack corrupted message), so I'd
> like to save bp register in that case:
>
> https://gcc.gnu.org/pipermail/gcc-patches/2024-February/646591.html
I think this patch makes sense and LGTM, we save and restore frame
pointer for noreturn.
>
> and additionally the no_callee_saved_registers by default for noreturn
> functions change can make debugging harder, again not localized to the
> noreturn function, but any of its callers.  So, if say glibc abort function
> implementation needs a lot of normally callee-saved registers, no matter how
> users recompile their apps, they will see garbage or optimized out
> vars/parameters in their code unless they rebuild their glibc with -O0.
> So, I think we should guard that by a non-default option:
>
> https://gcc.gnu.org/pipermail/gcc-patches/2024-February/646649.html
So it turns off the optimization for noreturn functions by default,
I'm not sure about this.
Any comments, H.J?
>
> Plus we need to somehow make sure to emit DW_CFA_undefined for the modified
> but not saved normally callee-saved registers, so that we at least don't get
> garbage in debug info.  H.J. posted some patches for that, so far I wasn't
> happy about the implementation but the actual change is desirable.
>
> Your thoughts on this?
>
> Jakub
>


-- 
BR,
Hongtao


Re: [PATCH] i386: [APX] Document inline asm behavior and new switch for APX

2024-01-10 Thread Hongtao Liu
On Tue, Jan 9, 2024 at 3:09 PM Hongyu Wang  wrote:
>
> Hi,
>
> For APX, the inline asm behavior was not mentioned in any document
> before. Add description for it.
>
> Ok for trunk?
>
> gcc/ChangeLog:
>
> * config/i386/i386.opt: Adjust document.
> * doc/invoke.texi: Add description for
> -mapx-inline-asm-use-gpr32.
> ---
>  gcc/config/i386/i386.opt | 3 +--
>  gcc/doc/invoke.texi  | 7 +++
>  2 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> index a38e92baf92..5b4f1bff25f 100644
> --- a/gcc/config/i386/i386.opt
> +++ b/gcc/config/i386/i386.opt
> @@ -1357,8 +1357,7 @@ Enum(apx_features) String(all) Value(apx_all) Set(1)
>
>  mapx-inline-asm-use-gpr32
>  Target Var(ix86_apx_inline_asm_use_gpr32) Init(0)
> -Enable GPR32 in inline asm when APX_EGPR enabled, do not
> -hook reg or mem constraint in inline asm to GPR16.
> +Enable GPR32 in inline asm when APX_F enabled.
>
>  mevex512
>  Target Mask(ISA2_EVEX512) Var(ix86_isa_flags2) Save
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 68d1f364ac0..47fd96648d8 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -35272,6 +35272,13 @@ r8-r15 registers so that the call and jmp 
> instruction length is 6 bytes
>  to allow them to be replaced with @samp{lfence; call *%r8-r15} or
>  @samp{lfence; jmp *%r8-r15} at run-time.
>
> +@opindex mapx-inline-asm-use-gpr32
> +@item -mapx-inline-asm-use-gpr32
> +When APX_F enabled, EGPR usage was by default disabled to prevent
> +unexpected EGPR generation in instructions that does not support it.
> +To invoke EGPR usage in inline asm, use this switch to allow EGPR in
> +inline asm, while user should ensure the asm actually supports EGPR.
Please align with
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/642228.html.
Ok after changing that.
> +
>  @end table
>
>  These @samp{-m} switches are supported in addition to the above
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH] i386: [APX] Document inline asm behavior and new switch for APX

2024-01-10 Thread Hongtao Liu
On Thu, Jan 11, 2024 at 7:06 AM Andi Kleen  wrote:
>
> Hongtao Liu  writes:
> >>
> >> +@opindex mapx-inline-asm-use-gpr32
> >> +@item -mapx-inline-asm-use-gpr32
> >> +When APX_F enabled, EGPR usage was by default disabled to prevent
> >> +unexpected EGPR generation in instructions that does not support it.
> >> +To invoke EGPR usage in inline asm, use this switch to allow EGPR in
> >> +inline asm, while user should ensure the asm actually supports EGPR.
> > Please align with
> > https://gcc.gnu.org/pipermail/gcc-patches/2024-January/642228.html.
> > Ok after changing that.
>
> BTW I think we would need a way to specify this individually per inline
> asm statement too.
>
> Otherwise a library which wants to use APX inline asm in the header
> never can do so until all its users set the option, which will be
> awkward to deploy.
>
> Perhaps it could be a magic clobber string.
We do have new constraints string for gpr32 or gpr16 for registers,
but not for memory due to restrictiction of GCC RA infrastructure
which assumes universal BASE_REG_CLASS/INDEX_REG_CLASS for all inline
asm.
>
> -andi



-- 
BR,
Hongtao


Re: [PATCH] i386: Add AVX10.1 related macros

2024-01-11 Thread Hongtao Liu
On Fri, Jan 12, 2024 at 10:55 AM Jiang, Haochen  wrote:
>
> > -Original Message-
> > From: Richard Biener 
> > Sent: Thursday, January 11, 2024 4:19 PM
> > To: Liu, Hongtao 
> > Cc: Jiang, Haochen ; gcc-patches@gcc.gnu.org;
> > ubiz...@gmail.com; bur...@net-b.de; san...@codesourcery.com
> > Subject: Re: [PATCH] i386: Add AVX10.1 related macros
> >
> > On Thu, Jan 11, 2024 at 2:16 AM Liu, Hongtao 
> > wrote:
> > >
> > >
> > >
> > > > -Original Message-
> > > > From: Richard Biener 
> > > > Sent: Wednesday, January 10, 2024 5:44 PM
> > > > To: Liu, Hongtao 
> > > > Cc: Jiang, Haochen ;
> > > > gcc-patches@gcc.gnu.org; ubiz...@gmail.com; bur...@net-b.de;
> > > > san...@codesourcery.com
> > > > Subject: Re: [PATCH] i386: Add AVX10.1 related macros
> > > >
> > > > On Wed, Jan 10, 2024 at 9:01 AM Liu, Hongtao 
> > > > wrote:
> > > > >
> > > > >
> > > > >
> > > > > > -Original Message-
> > > > > > From: Jiang, Haochen 
> > > > > > Sent: Wednesday, January 10, 2024 3:35 PM
> > > > > > To: gcc-patches@gcc.gnu.org
> > > > > > Cc: Liu, Hongtao ; ubiz...@gmail.com;
> > > > > > burnus@net- b.de; san...@codesourcery.com
> > > > > > Subject: [PATCH] i386: Add AVX10.1 related macros
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > This patch aims to add AVX10.1 related macros for libgomp's request.
> > > > > > The request comes following:
> > > > > >
> > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2024-January/642025.ht
> > > > > > ml
> > > > > >
> > > > > > Ok for trunk?
> > > > > >
> > > > > > Thx,
> > > > > > Haochen
> > > > > >
> > > > > > gcc/ChangeLog:
> > > > > >
> > > > > >   PR target/113288
> > > > > >   * config/i386/i386-c.cc (ix86_target_macros_internal):
> > > > > >   Add __AVX10_1__, __AVX10_1_256__ and __AVX10_1_512__.
> > > > > > ---
> > > > > >  gcc/config/i386/i386-c.cc | 7 +++
> > > > > >  1 file changed, 7 insertions(+)
> > > > > >
> > > > > > diff --git a/gcc/config/i386/i386-c.cc
> > > > > > b/gcc/config/i386/i386-c.cc index c3ae984670b..366b560158a
> > > > > > 100644
> > > > > > --- a/gcc/config/i386/i386-c.cc
> > > > > > +++ b/gcc/config/i386/i386-c.cc
> > > > > > @@ -735,6 +735,13 @@ ix86_target_macros_internal
> > (HOST_WIDE_INT
> > > > > > isa_flag,
> > > > > >  def_or_undef (parse_in, "__EVEX512__");
> > > > > >if (isa_flag2 & OPTION_MASK_ISA2_USER_MSR)
> > > > > >  def_or_undef (parse_in, "__USER_MSR__");
> > > > > > +  if (isa_flag2 & OPTION_MASK_ISA2_AVX10_1_256)
> > > > > > +{
> > > > > > +  def_or_undef (parse_in, "__AVX10_1_256__");
> > > > > > +  def_or_undef (parse_in, "__AVX10_1__");
> > > > > I think this is not needed, others LGTM.
> > > >
> > > > So __AVX10_1_256__ and __AVX10_1_512__ are redundant with
> > > > __AVX10_1__ and __EVEX512__, right?
> > > No, I mean __AVX10_1__ is redundant of __AVX10_1_256__ since -
> > mavx10.1 is just alias of -mavx10.1-256.
> > > We want explicit __AVX10_1_256__ and __AVX10_1_512__ and don't want
> > mix __EVEX512__ with AVX10(They are related in their internal
> > implementation, but we don't want the user to control the vector length of
> > avx10 with -mno-evex512, -mno-evex512 is supposed for the existing
> > AVX512).
>
> Let's keep both of them if we prefer __AVX10_1_256__ since I just found
> that LLVM got macro __AVX10_1__.
>
> https://github.com/llvm/llvm-project/pull/67278/files#diff-7435d50346a810555df89deb1f879b767ee985ace43fb3990de17fb23a47f004
>
> in file clang/lib/Basic/Targets/X86.cpp L774-777.
Ok.
>
> Thx,
> Haochen
>
> >
> > Ah, that makes sense.
> >
> > > > > > +}
> > > > > > +  if (isa_flag2 & OPTION_MASK_ISA2_AVX10_1_512)
> > > > > > +def_or_undef (parse_in, "__AVX10_1_512__");
> > > > > >if (TARGET_IAMCU)
> > > > > >  {
> > > > > >def_or_undef (parse_in, "__iamcu");
> > > > > > --
> > > > > > 2.31.1
> > > > >



-- 
BR,
Hongtao


Re: [PATCH] Update documents for fcf-protection=

2024-01-11 Thread Hongtao Liu
On Thu, Jan 11, 2024 at 12:06 AM H.J. Lu  wrote:
>
> On Tue, Jan 9, 2024 at 6:02 PM liuhongt  wrote:
> >
> > After r14-2692-g1c6231c05bdcca, the option is defined as EnumSet and
> > -fcf-protection=branch won't unset any others bits since they're in
> > different groups. So to override -fcf-protection, an explicit
> > -fcf-protection=none needs to be added and then with
> > -fcf-protection=XXX
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Ok for trunk?
> >
> > gcc/ChangeLog:
>
> We should mention:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113039
Changed, and committed.
>
> > * doc/invoke.texi (fcf-protection=): Update documents.
> > ---
> >  gcc/doc/invoke.texi | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > index 68d1f364ac0..d1e6fafb98c 100644
> > --- a/gcc/doc/invoke.texi
> > +++ b/gcc/doc/invoke.texi
> > @@ -17734,6 +17734,9 @@ function.  The value @code{full} is an alias for 
> > specifying both
> >  @code{branch} and @code{return}. The value @code{none} turns off
> >  instrumentation.
> >
> > +To override @option{-fcf-protection}, @option{-fcf-protection=none}
> > +needs to be explicitly added and then with @option{-fcf-protection=xxx}.
> > +
> >  The value @code{check} is used for the final link with link-time
> >  optimization (LTO).  An error is issued if LTO object files are
> >  compiled with different @option{-fcf-protection} values.  The
> > --
> > 2.31.1
> >
>
>
> --
> H.J.



-- 
BR,
Hongtao


Re: [x86 PATCH] PR target/106060: Improved SSE vector constant materialization.

2024-01-16 Thread Hongtao Liu
On Wed, Jan 17, 2024 at 5:59 AM Roger Sayle  wrote:
>
>
> I thought I'd just missed the bug fixing season of stage3, but there
> appears to a little latitude in early stage4 (for vector patches), so
> I'll post this now.
>
> This patch resolves PR target/106060 by providing efficient methods for
> materializing/synthesizing special "vector" constants on x86.  Currently
> there are three methods of materializing a vector constant; the most
> general is to load a vector from the constant pool, secondly "duplicated"
> constants can be synthesized by moving an integer between units and
> broadcasting (or shuffling it), and finally the special cases of the
> all-zeros vector and all-ones vectors can be loaded via a single SSE
> instruction.   This patch handles additional cases that can be synthesized
> in two instructions, loading an all-ones vector followed by another SSE
> instruction.  Following my recent patch for PR target/112992, there's
> conveniently a single place in i386-expand.cc where these special cases
> can be handled.
>
> Two examples are given in the original bugzilla PR for 106060.
>
> __m256i
> should_be_cmpeq_abs ()
> {
>   return _mm256_set1_epi8 (1);
> }
>
> is now generated (with -O3 -march=x86-64-v3) as:
>
> vpcmpeqd%ymm0, %ymm0, %ymm0
> vpabsb  %ymm0, %ymm0
> ret
>
> and
>
> __m256i
> should_be_cmpeq_add ()
> {
>   return _mm256_set1_epi8 (-2);
> }
>
> is now generated as:
>
> vpcmpeqd%ymm0, %ymm0, %ymm0
> vpaddb  %ymm0, %ymm0, %ymm0
> ret
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2024-01-16  Roger Sayle  
>
> gcc/ChangeLog
> PR target/106060
> * config/i386/i386-expand.cc (enum ix86_vec_bcast_alg): New.
> (struct ix86_vec_bcast_map_simode_t): New type for table below.
> (ix86_vec_bcast_map_simode): Table of SImode constants that may
> be efficiently synthesized by a ix86_vec_bcast_alg method.
> (ix86_vec_bcast_map_simode_cmp): New comparator for bsearch.
> (ix86_vector_duplicate_simode_const): Efficiently synthesize
> V4SImode and V8SImode constants that duplicate special constants.
> (ix86_vector_duplicate_value): Attempt to synthesize "special"
> vector constants using ix86_vector_duplicate_simode_const.
> * config/i386/i386.cc (ix86_rtx_costs) : ABS of a
> vector integer mode costs with a single SSE instruction.
>

+  switch (entry->alg)
+{
+case VEC_BCAST_PXOR:
+  if (mode == V8SImode && !TARGET_AVX2)
+ return false;
+  emit_move_insn (target, CONST0_RTX (mode));
+  return true;
+case VEC_BCAST_PCMPEQ:
+  if ((mode == V4SImode && !TARGET_SSE2)
+  || (mode == V8SImode && !TARGET_AVX2))
+ return false;
+  emit_move_insn (target, CONSTM1_RTX (mode));
+  return true;

I think we need to prevent those standard_sse_constant_p getting in
ix86_expand_vector_init_duplicate by below codes.

  /* If all values are identical, broadcast the value.  */
  if (all_same
  && (nvars != 0 || !standard_sse_constant_p (gen_rtx_CONST_VECTOR
(mode, XVEC (vals, 0)), mode))
  && ix86_expand_vector_init_duplicate (mmx_ok, mode, target,
XVECEXP (vals, 0, 0)))
return;

+case VEC_BCAST_PABSB:
+  if (mode == V4SImode)
+ {
+  tmp1 = gen_reg_rtx (V16QImode);
+  emit_move_insn (tmp1, CONSTM1_RTX (V16QImode));
+  tmp2 = gen_reg_rtx (V16QImode);
+  emit_insn (gen_absv16qi2 (tmp2, tmp1));
Shouldn't it rely on TARGET_SSE2?

+case VEC_BCAST_PADDB:
+  if (mode == V4SImode)
+ {
+  tmp1 = gen_reg_rtx (V16QImode);
+  emit_move_insn (tmp1, CONSTM1_RTX (V16QImode));
+  tmp2 = gen_reg_rtx (V16QImode);
+  emit_insn (gen_addv16qi3 (tmp2, tmp1, tmp1));
Ditto here and for all logic shift cases.
+ }

+
+  if ((mode == V4SImode || mode == V8SImode)
+  && CONST_INT_P (val)
+  && ix86_vector_duplicate_simode_const (mode, target, INTVAL (val)))
+return true;
+
The alternative way is adding a pre_reload define_insn_and_split to
match specific const_vector and splitt it into new instructions.
In theoritically, the constant info can be retained before combine and
will enable more simplication.

Also the patch can be extend to V16SImode, but it can be a separate patch.

> gcc/testsuite/ChangeLog
> PR target/106060
> * gcc.target/i386/auto-init-8.c: Update test case.
> * gcc.target/i386/avx512fp16-3.c: Likewise.
> * gcc.target/i386/pr100865-9a.c: Likewise.
> * gcc.target/i386/pr106060-1.c: New test case.
> * gcc.target/i386/pr106060-2.c: Likewise.
> * gcc.target/i386/pr106060-3.c: Likewise.
> * gcc.target/i386/pr70314-3.c: Update test case.
> * gcc.target/i386/vect-shiftv4qi.c: Likewise.
> * gcc.target/i386/vect-shiftv8qi.c: Likewise.
>
>
> Thanks in advance,
> Roger
> -

Re: [PATCH] hwasan: Check if Intel LAM_U57 is enabled

2024-01-17 Thread Hongtao Liu
On Wed, Jan 10, 2024 at 12:47 AM H.J. Lu  wrote:
>
> When -fsanitize=hwaddress is used, libhwasan will try to enable LAM_U57
> in the startup code.  Update the target check to enable hwaddress tests
> if LAM_U57 is enabled.  Also compile hwaddress tests with -mlam=u57 on
> x86-64 since hwasan requires LAM_U57 on x86-64.
I've tested it on lam enabled SRF, and it passed all hwasan testcases
except below

FAIL: c-c++-common/hwasan/alloca-outside-caught.c   -O0  output pattern test
FAIL: c-c++-common/hwasan/hwasan-poison-optimisation.c   -O1
scan-assembler-times bl
s*__hwasan_tag_mismatch4 1
FAIL: c-c++-common/hwasan/hwasan-poison-optimisation.c   -O2
scan-assembler-times bl
s*__hwasan_tag_mismatch4 1
FAIL: c-c++-common/hwasan/hwasan-poison-optimisation.c   -O3 -g
scan-assembler-times bl
s*__hwasan_tag_mismatch4 1
FAIL: c-c++-common/hwasan/hwasan-poison-optimisation.c   -Os
scan-assembler-times bl
s*__hwasan_tag_mismatch4 1
FAIL: c-c++-common/hwasan/hwasan-poison-optimisation.c   -O2 -flto
-fno-use-linker-plugin -flto-partition=none   scan-assembler-times bl
s*__hwasan_tag_mismatch4 1
FAIL: c-c++-common/hwasan/hwasan-poison-optimisation.c   -O2 -flto
-fuse-linker-plugin -fno-fat-lto-objects   scan-assembler-times bl
s*__hwasan_tag_mismatch4 1
FAIL: c-c++-common/hwasan/vararray-outside-caught.c   -O0  output pattern test

Basically they're testcase issues, the testcases needs to be adjusted
for x86/ I'll commit a separate patch for those after this commit is
upstream.
Also I've also tested the patch on lam unsupported platforms, all
hwasan testcases shows unsupported.
So the patch LGTM.

>
> * lib/hwasan-dg.exp (check_effective_target_hwaddress_exec):
> Return 1 if Intel LAM_U57 is enabled.
> (hwasan_init): Add -mlam=u57 on x86-64.
> ---
>  gcc/testsuite/lib/hwasan-dg.exp | 25 ++---
>  1 file changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/testsuite/lib/hwasan-dg.exp b/gcc/testsuite/lib/hwasan-dg.exp
> index e9c5ef6524d..76057502ee6 100644
> --- a/gcc/testsuite/lib/hwasan-dg.exp
> +++ b/gcc/testsuite/lib/hwasan-dg.exp
> @@ -44,11 +44,25 @@ proc check_effective_target_hwaddress_exec {} {
> #ifdef __cplusplus
> extern "C" {
> #endif
> +   extern int arch_prctl (int, unsigned long int *);
> extern int prctl(int, unsigned long, unsigned long, unsigned long, 
> unsigned long);
> #ifdef __cplusplus
> }
> #endif
> int main (void) {
> +   #ifdef __x86_64__
> +   # ifdef __LP64__
> +   #  define ARCH_GET_UNTAG_MASK 0x4001
> +   #  define LAM_U57_MASK (0x3fULL << 57)
> + unsigned long mask = 0;
> + if (arch_prctl(ARCH_GET_UNTAG_MASK, &mask) != 0)
> +   return 1;
> + if (mask != ~LAM_U57_MASK)
> +   return 1;
> + return 0;
> +   # endif
> + return 1;
> +   #else
> #define PR_SET_TAGGED_ADDR_CTRL 55
> #define PR_GET_TAGGED_ADDR_CTRL 56
> #define PR_TAGGED_ADDR_ENABLE (1UL << 0)
> @@ -58,6 +72,7 @@ proc check_effective_target_hwaddress_exec {} {
>   || !prctl(PR_GET_TAGGED_ADDR_CTRL, 0, 0, 0, 0))
> return 1;
>   return 0;
> +   #endif
> }
>  }] {
> return 0;
> @@ -102,6 +117,10 @@ proc hwasan_init { args } {
>
>  setenv HWASAN_OPTIONS "random_tags=0"
>
> +if [istarget x86_64-*-*] {
> +  set target_hwasan_flags "-mlam=u57"
> +}
> +
>  set link_flags ""
>  if ![is_remote host] {
> if [info exists TOOL_OPTIONS] {
> @@ -119,12 +138,12 @@ proc hwasan_init { args } {
>  if [info exists ALWAYS_CXXFLAGS] {
> set hwasan_saved_ALWAYS_CXXFLAGS $ALWAYS_CXXFLAGS
> set ALWAYS_CXXFLAGS [concat "{ldflags=$link_flags}" $ALWAYS_CXXFLAGS]
> -   set ALWAYS_CXXFLAGS [concat "{additional_flags=-fsanitize=hwaddress 
> --param hwasan-random-frame-tag=0 -g $include_flags}" $ALWAYS_CXXFLAGS]
> +   set ALWAYS_CXXFLAGS [concat "{additional_flags=-fsanitize=hwaddress 
> $target_hwasan_flags --param hwasan-random-frame-tag=0 -g $include_flags}" 
> $ALWAYS_CXXFLAGS]
>  } else {
> if [info exists TEST_ALWAYS_FLAGS] {
> -   set TEST_ALWAYS_FLAGS "$link_flags -fsanitize=hwaddress --param 
> hwasan-random-frame-tag=0 -g $include_flags $TEST_ALWAYS_FLAGS"
> +   set TEST_ALWAYS_FLAGS "$link_flags -fsanitize=hwaddress 
> $target_hwasan_flags --param hwasan-random-frame-tag=0 -g $include_flags 
> $TEST_ALWAYS_FLAGS"
> } else {
> -   set TEST_ALWAYS_FLAGS "$link_flags -fsanitize=hwaddress --param 
> hwasan-random-frame-tag=0 -g $include_flags"
> +   set TEST_ALWAYS_FLAGS "$link_flags -fsanitize=hwaddress 
> $target_hwasan_flags --param hwasan-random-frame-tag=0 -g $include_flags"
> }
>  }
>  }
> --
> 2.43.0
>


-- 
BR,
Hongtao


Re: [PATCH 1/2] x86: Add no_callee_saved_registers function attribute

2024-01-21 Thread Hongtao Liu
On Sat, Jan 20, 2024 at 10:30 PM H.J. Lu  wrote:
>
> When an interrupt handler is implemented by an assembly stub which does:
>
> 1. Save all registers.
> 2. Call a C function.
> 3. Restore all registers.
> 4. Return from interrupt.
>
> it is completely unnecessary to save and restore any registers in the C
> function called by the assembly stub, even if they would normally be
> callee-saved.
>
> Add no_callee_saved_registers function attribute, which is complementary
> to no_caller_saved_registers function attribute, to mark a function which
> doesn't have any callee-saved registers.  Such a function won't save and
> restore any registers.  Classify function call-saved register handling
> type with:
>
> 1. Default call-saved registers.
> 2. No caller-saved registers with no_caller_saved_registers attribute.
> 3. No callee-saved registers with no_callee_saved_registers attribute.
>
> Disallow sibcall if callee is a no_callee_saved_registers function
> and caller isn't a no_callee_saved_registers function.  Otherwise,
> callee-saved registers won't be preserved.
>
> After a no_callee_saved_registers function is called, all registers may
> be clobbered.  If the calling function isn't a no_callee_saved_registers
> function, we need to preserve all registers which aren't used by function
> calls.
>
> gcc/
>
> PR target/103503
> PR target/113312
> * config/i386/i386-expand.cc (ix86_expand_call): Set
> call_no_callee_saved_registers to true when calling function
> with no_callee_saved_registers attribute.  Replace
> no_caller_saved_registers check with call_saved_registers check.
> * config/i386/i386-options.cc (ix86_set_func_type): Set
> call_saved_registers to TYPE_NO_CALLEE_SAVED_REGISTERS for
> noreturn function.  Disallow no_callee_saved_registers with
> interrupt or no_caller_saved_registers attributes together.
> (ix86_set_current_function): Replace no_caller_saved_registers
> check with call_saved_registers check.
> (ix86_handle_no_caller_saved_registers_attribute): Renamed to ...
> (ix86_handle_call_saved_registers_attribute): This.
> (ix86_gnu_attributes): Add
> ix86_handle_call_saved_registers_attribute.
> * config/i386/i386.cc (ix86_conditional_register_usage): Replace
> no_caller_saved_registers check with call_saved_registers check.
> (ix86_function_ok_for_sibcall): Don't allow callee with
> no_callee_saved_registers attribute when the calling function
> has callee-saved registers.
> (ix86_comp_type_attributes): Also check
> no_callee_saved_registers.
> (ix86_epilogue_uses): Replace no_caller_saved_registers check
> with call_saved_registers check.
> (ix86_hard_regno_scratch_ok): Likewise.
> (ix86_save_reg): Replace no_caller_saved_registers check with
> call_saved_registers check.  Don't save any registers for
> TYPE_NO_CALLEE_SAVED_REGISTERS.  Save all registers with
> TYPE_DEFAULT_CALL_SAVED_REGISTERS if function with
> no_callee_saved_registers attribute is called.
> (find_drap_reg): Replace no_caller_saved_registers check with
> call_saved_registers check.
> * config/i386/i386.h (call_saved_registers_type): New enum.
> (machine_function): Replace no_caller_saved_registers with
> call_saved_registers.  Add call_no_callee_saved_registers.
> * doc/extend.texi: Document no_callee_saved_registers attribute.
>
> gcc/testsuite/
>
> PR target/103503
> PR target/113312
> * gcc.dg/torture/no-callee-saved-run-1a.c: New file.
> * gcc.dg/torture/no-callee-saved-run-1b.c: Likewise.
> * gcc.target/i386/no-callee-saved-1.c: Likewise.
> * gcc.target/i386/no-callee-saved-2.c: Likewise.
> * gcc.target/i386/no-callee-saved-3.c: Likewise.
> * gcc.target/i386/no-callee-saved-4.c: Likewise.
> * gcc.target/i386/no-callee-saved-5.c: Likewise.
> * gcc.target/i386/no-callee-saved-6.c: Likewise.
> * gcc.target/i386/no-callee-saved-7.c: Likewise.
> * gcc.target/i386/no-callee-saved-8.c: Likewise.
> * gcc.target/i386/no-callee-saved-9.c: Likewise.
> * gcc.target/i386/no-callee-saved-10.c: Likewise.
> * gcc.target/i386/no-callee-saved-11.c: Likewise.
> * gcc.target/i386/no-callee-saved-12.c: Likewise.
> * gcc.target/i386/no-callee-saved-13.c: Likewise.
> * gcc.target/i386/no-callee-saved-14.c: Likewise.
> * gcc.target/i386/no-callee-saved-15.c: Likewise.
> * gcc.target/i386/no-callee-saved-16.c: Likewise.
> * gcc.target/i386/no-callee-saved-17.c: Likewise.
> * gcc.target/i386/no-callee-saved-18.c: Likewise.
> ---
>  gcc/config/i386/i386-expand.cc| 72 ---
>  gcc/config/i386/i386-options.cc   | 49 +
>  gcc/c

Re: [PATCH] i386: Modify testcases failed under -DDEBUG

2024-01-24 Thread Hongtao Liu
On Mon, Jan 22, 2024 at 10:31 AM Haochen Jiang  wrote:
>
> Hi all,
>
> Recently, I happened to run i386.exp under -DDEBUG and found some fail.
>
> This patch aims to fix that. Ok for trunk?
OK.
>
> Thx,
> Haochen
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/adx-check.h: Include stdio.h when DEBUG
> is defined.
> * gcc.target/i386/avx512fp16-vscalefph-1b.c: Do not define
> DEBUG.
> * gcc.target/i386/avx512fp16vl-vaddph-1b.c: Ditto.
> * gcc.target/i386/avx512fp16vl-vcmpph-1b.c: Ditto.
> * gcc.target/i386/avx512fp16vl-vdivph-1b.c: Ditto.
> * gcc.target/i386/avx512fp16vl-vfpclassph-1b.c: Ditto.
> * gcc.target/i386/avx512fp16vl-vgetexpph-1b.c: Ditto.
> * gcc.target/i386/avx512fp16vl-vgetmantph-1b.c: Ditto.
> * gcc.target/i386/avx512fp16vl-vmaxph-1b.c: Ditto.
> * gcc.target/i386/avx512fp16vl-vminph-1b.c: Ditto.
> * gcc.target/i386/avx512fp16vl-vmulph-1b.c: Ditto.
> * gcc.target/i386/avx512fp16vl-vrcpph-1b.c: Ditto.
> * gcc.target/i386/avx512fp16vl-vreduceph-1b.c: Ditto.
> * gcc.target/i386/avx512fp16vl-vrndscaleph-1b.c: Ditto.
> * gcc.target/i386/avx512fp16vl-vrsqrtph-1b.c: Ditto.
> * gcc.target/i386/avx512fp16vl-vscalefph-1b.c: Ditto.
> * gcc.target/i386/avx512fp16vl-vsqrtph-1b.c: Ditto.
> * gcc.target/i386/avx512fp16vl-vsubph-1b.c: Ditto.
> * gcc.target/i386/readeflags-1.c: Include stdio.h when DEBUG
> is defined.
> * gcc.target/i386/rtm-check.h: Ditto.
> * gcc.target/i386/sha-check.h: Ditto.
> * gcc.target/i386/writeeflags-1.c: Ditto.
> ---
>  gcc/testsuite/gcc.target/i386/adx-check.h   | 3 +++
>  gcc/testsuite/gcc.target/i386/avx512fp16-vscalefph-1b.c | 3 ---
>  gcc/testsuite/gcc.target/i386/avx512fp16vl-vaddph-1b.c  | 1 -
>  gcc/testsuite/gcc.target/i386/avx512fp16vl-vcmpph-1b.c  | 1 -
>  gcc/testsuite/gcc.target/i386/avx512fp16vl-vdivph-1b.c  | 1 -
>  gcc/testsuite/gcc.target/i386/avx512fp16vl-vfpclassph-1b.c  | 1 -
>  gcc/testsuite/gcc.target/i386/avx512fp16vl-vgetexpph-1b.c   | 1 -
>  gcc/testsuite/gcc.target/i386/avx512fp16vl-vgetmantph-1b.c  | 1 -
>  gcc/testsuite/gcc.target/i386/avx512fp16vl-vmaxph-1b.c  | 1 -
>  gcc/testsuite/gcc.target/i386/avx512fp16vl-vminph-1b.c  | 1 -
>  gcc/testsuite/gcc.target/i386/avx512fp16vl-vmulph-1b.c  | 1 -
>  gcc/testsuite/gcc.target/i386/avx512fp16vl-vrcpph-1b.c  | 1 -
>  gcc/testsuite/gcc.target/i386/avx512fp16vl-vreduceph-1b.c   | 1 -
>  gcc/testsuite/gcc.target/i386/avx512fp16vl-vrndscaleph-1b.c | 1 -
>  gcc/testsuite/gcc.target/i386/avx512fp16vl-vrsqrtph-1b.c| 1 -
>  gcc/testsuite/gcc.target/i386/avx512fp16vl-vscalefph-1b.c   | 1 -
>  gcc/testsuite/gcc.target/i386/avx512fp16vl-vsqrtph-1b.c | 1 -
>  gcc/testsuite/gcc.target/i386/avx512fp16vl-vsubph-1b.c  | 1 -
>  gcc/testsuite/gcc.target/i386/readeflags-1.c| 3 +++
>  gcc/testsuite/gcc.target/i386/rtm-check.h   | 3 +++
>  gcc/testsuite/gcc.target/i386/sha-check.h   | 3 +++
>  gcc/testsuite/gcc.target/i386/writeeflags-1.c   | 3 +++
>  22 files changed, 15 insertions(+), 19 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.target/i386/adx-check.h 
> b/gcc/testsuite/gcc.target/i386/adx-check.h
> index cfed1a38483..45435b91d0e 100644
> --- a/gcc/testsuite/gcc.target/i386/adx-check.h
> +++ b/gcc/testsuite/gcc.target/i386/adx-check.h
> @@ -1,5 +1,8 @@
>  #include 
>  #include "cpuid.h"
> +#ifdef DEBUG
> +#include 
> +#endif
>
>  static void adx_test (void);
>
> diff --git a/gcc/testsuite/gcc.target/i386/avx512fp16-vscalefph-1b.c 
> b/gcc/testsuite/gcc.target/i386/avx512fp16-vscalefph-1b.c
> index 7c7288d6eb3..0ba9ec57f37 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512fp16-vscalefph-1b.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512fp16-vscalefph-1b.c
> @@ -1,9 +1,6 @@
>  /* { dg-do run { target avx512fp16 } } */
>  /* { dg-options "-O2 -mavx512fp16 -mavx512dq" } */
>
> -
> -#define DEBUG
> -
>  #define AVX512FP16
>  #include "avx512fp16-helper.h"
>
> diff --git a/gcc/testsuite/gcc.target/i386/avx512fp16vl-vaddph-1b.c 
> b/gcc/testsuite/gcc.target/i386/avx512fp16vl-vaddph-1b.c
> index fcf6a9058f5..1db7c565262 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512fp16vl-vaddph-1b.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512fp16vl-vaddph-1b.c
> @@ -1,7 +1,6 @@
>  /* { dg-do run { target avx512fp16 } } */
>  /* { dg-options "-O2 -mavx512fp16 -mavx512vl -mavx512dq" } */
>
> -#define DEBUG
>  #define AVX512VL
>  #define AVX512F_LEN 256
>  #define AVX512F_LEN_HALF 128
> diff --git a/gcc/testsuite/gcc.target/i386/avx512fp16vl-vcmpph-1b.c 
> b/gcc/testsuite/gcc.target/i386/avx512fp16vl-vcmpph-1b.c
> index c201a9258bf..bbd366a5d29 100644
> --- a/gcc/testsuite/gcc.target/i386/avx512fp16vl-vcmpph-1b.c
> +++ b/gcc/testsuite/gcc.target/i386/avx512fp16vl-vcmpph-1b.c
> @@ -1,7 +1,6 @@
>  /* { dg-do run { target avx512f

Re: [PATCH v3 0/2] x86: Don't save callee-saved registers if not needed

2024-01-24 Thread Hongtao Liu
On Tue, Jan 23, 2024 at 11:00 PM H.J. Lu  wrote:
>
> Changes in v3:
>
> 1. Rebase against commit 02e68389494
> 2. Don't add call_no_callee_saved_registers to machine_function since
> all callee-saved registers are properly clobbered by callee with
> no_callee_saved_registers attribute.
>
The patch LGTM, it should be low risk since there's already
no_caller_save_registers attribute, the patch just extends to
no_callee_save_registers with the same approach.
So if there's no objection(or any concerns) in the next couple days,
I'm ok for the patch to be in GCC14 and backport.

> Changes in v2:
>
> 1. Rebase against commit f9df00340e3
> 2. Don't add redundant clobbered_registers check in ix86_expand_call.
>
> In some cases, there are no need to save callee-saved registers:
>
> 1. If a noreturn function doesn't throw nor support exceptions, it can
> skip saving callee-saved registers.
>
> 2. When an interrupt handler is implemented by an assembly stub which does:
>
>   1. Save all registers.
>   2. Call a C function.
>   3. Restore all registers.
>   4. Return from interrupt.
>
> it is completely unnecessary to save and restore any registers in the C
> function called by the assembly stub, even if they would normally be
> callee-saved.
>
> This patch set adds no_callee_saved_registers function attribute, which
> is complementary to no_caller_saved_registers function attribute, to
> classify x86 backend call-saved register handling type with
>
>   1. Default call-saved registers.
>   2. No caller-saved registers with no_caller_saved_registers attribute.
>   3. No callee-saved registers with no_callee_saved_registers attribute.
>
> Functions of no callee-saved registers won't save callee-saved registers.
> If a noreturn function doesn't throw nor support exceptions, it is
> classified as the no callee-saved registers type.
>
> With these changes, __libc_start_main in glibc 2.39, which is a noreturn
> function, is changed from
>
> __libc_start_main:
> endbr64
> push   %r15
> push   %r14
> mov%rcx,%r14
> push   %r13
> push   %r12
> push   %rbp
> mov%esi,%ebp
> push   %rbx
> mov%rdx,%rbx
> sub$0x28,%rsp
> mov%rdi,(%rsp)
> mov%fs:0x28,%rax
> mov%rax,0x18(%rsp)
> xor%eax,%eax
> test   %r9,%r9
>
> to
>
> __libc_start_main:
> endbr64
> sub$0x28,%rsp
> mov%esi,%ebp
> mov%rdx,%rbx
> mov%rcx,%r14
> mov%rdi,(%rsp)
> mov%fs:0x28,%rax
> mov%rax,0x18(%rsp)
> xor%eax,%eax
> test   %r9,%r9
>
> In Linux kernel 6.7.0 on x86-64, do_exit is changed from
>
> do_exit:
> endbr64
> call   
> push   %r15
> push   %r14
> push   %r13
> push   %r12
> mov%rdi,%r12
> push   %rbp
> push   %rbx
> mov%gs:0x0,%rbx
> sub$0x28,%rsp
> mov%gs:0x28,%rax
> mov%rax,0x20(%rsp)
> xor%eax,%eax
> call   *0x0(%rip)# 
> test   $0x2,%ah
> je 
>
> to
>
> do_exit:
> endbr64
> call   
> sub$0x28,%rsp
> mov%rdi,%r12
> mov%gs:0x28,%rax
> mov%rax,0x20(%rsp)
> xor%eax,%eax
> mov%gs:0x0,%rbx
> call   *0x0(%rip)# 
> test   $0x2,%ah
> je 
>
> I compared GCC master branch bootstrap and test times on a slow machine
> with 6.6 Linux kernels compiled with the original GCC 13 and the GCC 13
> with the backported patch.  The performance data isn't precise since the
> measurements were done on different days with different GCC sources under
> different 6.6 kernel versions.
>
> GCC master branch build time in seconds:
>
> beforeafter  improvement
> 30043.75user  30013.16user   0%
> 1274.85system 1243.72system  2.4%
>
> GCC master branch test time in seconds (new tests added):
>
> beforeafter  improvement
> 216035.90user 216547.51user  0
> 27365.51system26658.54system 2.6%
>
> Backported to GCC 13 to rebuild system glibc and kernel on Fedora 39.
> Systems perform normally.
>
>
> H.J. Lu (2):
>   x86: Add no_callee_saved_registers function attribute
>   x86: Don't save callee-saved registers in noreturn functions
>
>  gcc/config/i386/i386-expand.cc| 52 +---
>  gcc/config/i386/i386-options.cc   | 61 +++
>  gcc/config/i386/i386.cc   | 57 +
>  gcc/config/i386/i386.h| 16 -
>  gcc/doc/extend.texi   |  8 +++
>  .../gcc.dg/torture/no-callee-saved-run-1a.c   | 23 +++
>  .../gcc.dg/torture/no-callee-saved-run-1b.c   | 59 ++
>  .../gcc.target/i386

Re: [x86 PATCH] PR target/106060: Improved SSE vector constant materialization.

2024-01-25 Thread Hongtao Liu
On Fri, Jan 26, 2024 at 3:03 AM Roger Sayle  wrote:
>
>
> Hi Hongtao,
> Many thanks for the review.  Here's a revised version of my patch
> that addresses (most of) the issues you've raised.  Firstly the
> handling of zero and all_ones in this function is mostly for
> completeness/documentation, these standard_sse_constant_p
> values are (currently/normally) handled elsewhere.  But I have
> added an "n_var == 0" optimization to ix86_expand_vector_init.
>
> As you've suggested I've added explicit TARGET_SSE2 tests where
> required, and for consistency I've also added support for AVX512's
> V16SImode.
>
> As you've predicted, the eventual goal is to move this after combine
> (or reload) using define_insn_and_split, but that requires a significant
> restructuring that should be done in steps.  This also interacts with
> a similar planned reorganization of TImode constant handling.  If
> all 128-bit (vector) constants are acceptable before combine, then
> STV has the freedom to chose V1TImode (and this broadcast
> functionality) to implement TImode operations on immediate
> constants.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline (in stage 1)?
Ok, thanks for handling this.
>
>
> 2024-01-25  Roger Sayle  
> Hongtao Liu  
>
> gcc/ChangeLog
> PR target/106060
> * config/i386/i386-expand.cc (enum ix86_vec_bcast_alg): New.
> (struct ix86_vec_bcast_map_simode_t): New type for table below.
> (ix86_vec_bcast_map_simode): Table of SImode constants that may
> be efficiently synthesized by a ix86_vec_bcast_alg method.
> (ix86_vec_bcast_map_simode_cmp): New comparator for bsearch.
> (ix86_vector_duplicate_simode_const): Efficiently synthesize
> V4SImode and V8SImode constants that duplicate special constants.
> (ix86_vector_duplicate_value): Attempt to synthesize "special"
> vector constants using ix86_vector_duplicate_simode_const.
> * config/i386/i386.cc (ix86_rtx_costs) : ABS of a
> vector integer mode costs with a single SSE instruction.
>
> gcc/testsuite/ChangeLog
> PR target/106060
> * gcc.target/i386/auto-init-8.c: Update test case.
> * gcc.target/i386/avx512fp16-3.c: Likewise.
> * gcc.target/i386/pr100865-9a.c: Likewise.
> * gcc.target/i386/pr101796-1.c: Likewise.
> * gcc.target/i386/pr106060-1.c: New test case.
> * gcc.target/i386/pr106060-2.c: Likewise.
> * gcc.target/i386/pr106060-3.c: Likewise.
> * gcc.target/i386/pr70314.c: Update test case.
> * gcc.target/i386/vect-shiftv4qi.c: Likewise.
> * gcc.target/i386/vect-shiftv8qi.c: Likewise.
>
>
> Roger
> --
>
> > -Original Message-
> > From: Hongtao Liu 
> > Sent: 17 January 2024 03:13
> > To: Roger Sayle 
> > Cc: gcc-patches@gcc.gnu.org; Uros Bizjak 
> > Subject: Re: [x86 PATCH] PR target/106060: Improved SSE vector constant
> > materialization.
> >
> > On Wed, Jan 17, 2024 at 5:59 AM Roger Sayle 
> > wrote:
> > >
> > >
> > > I thought I'd just missed the bug fixing season of stage3, but there
> > > appears to a little latitude in early stage4 (for vector patches), so
> > > I'll post this now.
> > >
> > > This patch resolves PR target/106060 by providing efficient methods
> > > for materializing/synthesizing special "vector" constants on x86.
> > > Currently there are three methods of materializing a vector constant;
> > > the most general is to load a vector from the constant pool, secondly
> > "duplicated"
> > > constants can be synthesized by moving an integer between units and
> > > broadcasting (or shuffling it), and finally the special cases of the
> > > all-zeros vector and all-ones vectors can be loaded via a single SSE
> > > instruction.   This patch handles additional cases that can be synthesized
> > > in two instructions, loading an all-ones vector followed by another
> > > SSE instruction.  Following my recent patch for PR target/112992,
> > > there's conveniently a single place in i386-expand.cc where these
> > > special cases can be handled.
> > >
> > > Two examples are given in the original bugzilla PR for 106060.
> > >
> > > __m256i
> > > should_be_cmpeq_abs ()
> > > {
> > >   return _mm256_set1_epi8 (1);
> > > }
> > >
> > > 

Re: [PATCH] [ICE] Support vpcmov for V4HF/V4BF/V2HF/V2BF under TARGET_XOP.

2023-12-13 Thread Hongtao Liu
On Wed, Dec 13, 2023 at 7:59 PM Jakub Jelinek  wrote:
>
> On Fri, Dec 08, 2023 at 03:12:00PM +0800, liuhongt wrote:
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Ready push to trunk.
> >
> > gcc/ChangeLog:
> >
> >   PR target/112904
> >   * config/i386/mmx.md (*xop_pcmov_): New define_insn.
> >
> > gcc/testsuite/ChangeLog:
> >
> >   * g++.target/i386/pr112904.C: New test.
>
> The new test FAILs on i686-linux and even on x86_64-linux I think
> it doesn't actually test what was reported, unless one performs testing
> with -march= for some XOP enabled CPU or -mxop.
>
> The following patch fixes that, tested on x86_64-linux with
> make check-g++ 
> RUNTESTFLAGS='--target_board=unix\{-m32,-m32/-mno-sse/-mno-mmx,-m64\} 
> i386.exp=pr112904.C'
> Ok for trunk?

Ok.
Sorry for the inconvenience, I must have missed something in my tester.

>
> 2023-12-13  Jakub Jelinek  
>
> * g++.target/i386/pr112904.C: Add dg-do compile, dg-options -mxop
> and for ia32 also dg-additional-options -mmmx.
>
> --- gcc/testsuite/g++.target/i386/pr112904.C.jj 2023-12-11 08:31:59.001938798 
> +0100
> +++ gcc/testsuite/g++.target/i386/pr112904.C2023-12-13 12:54:50.318521637 
> +0100
> @@ -1,3 +1,8 @@
> +// PR target/112904
> +// { dg-do compile }
> +// { dg-options "-mxop" }
> +// { dg-additional-options "-mmmx" { target ia32 } }
> +
>  typedef _Float16 v4hf __attribute__((vector_size(8)));
>  typedef short v4hi __attribute__((vector_size(8)));
>  typedef _Float16 v2hf __attribute__((vector_size(4)));
>
>
> Jakub
>


-- 
BR,
Hongtao


Re: [PATCH] i386: Remove RAO-INT from Grand Ridge

2023-12-14 Thread Hongtao Liu
On Thu, Dec 14, 2023 at 10:55 AM Haochen Jiang  wrote:
>
> Hi all,
>
> According to ISE050 published at the end of September, RAO-INT will not
> be in Grand Ridge anymore. This patch aims to remove it.
>
> The documentation comes following:
>
> https://cdrdv2.intel.com/v1/dl/getContent/671368
>
> Regtested on x86_64-pc-linux-gnu. Ok for trunk and backport to GCC13?
Ok.
>
> Thx,
> Haochen
>
> gcc/ChangeLog:
>
> * config/i386/driver-i386.cc (host_detect_local_cpu): Do not
> set Grand Ridge depending on RAO-INT.
> * config/i386/i386.h: Remove PTA_RAOINT from PTA_GRANDRIDGE.
> * doc/invoke.texi: Adjust documentation.
> ---
>  gcc/config/i386/driver-i386.cc | 3 ---
>  gcc/config/i386/i386.h | 2 +-
>  gcc/doc/invoke.texi| 4 ++--
>  3 files changed, 3 insertions(+), 6 deletions(-)
>
> diff --git a/gcc/config/i386/driver-i386.cc b/gcc/config/i386/driver-i386.cc
> index 0cfb2884d65..3342e550f2a 100644
> --- a/gcc/config/i386/driver-i386.cc
> +++ b/gcc/config/i386/driver-i386.cc
> @@ -665,9 +665,6 @@ const char *host_detect_local_cpu (int argc, const char 
> **argv)
>   /* Assume Arrow Lake S.  */
>   else if (has_feature (FEATURE_SM3))
> cpu = "arrowlake-s";
> - /* Assume Grand Ridge.  */
> - else if (has_feature (FEATURE_RAOINT))
> -   cpu = "grandridge";
>   /* Assume Sierra Forest.  */
>   else if (has_feature (FEATURE_AVXVNNIINT8))
> cpu = "sierraforest";
> diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> index 47340c6a4ad..303baf8c921 100644
> --- a/gcc/config/i386/i386.h
> +++ b/gcc/config/i386/i386.h
> @@ -2416,7 +2416,7 @@ constexpr wide_int_bitmask PTA_GRANITERAPIDS = 
> PTA_SAPPHIRERAPIDS | PTA_AMX_FP16
>| PTA_PREFETCHI;
>  constexpr wide_int_bitmask PTA_GRANITERAPIDS_D = PTA_GRANITERAPIDS
>| PTA_AMX_COMPLEX;
> -constexpr wide_int_bitmask PTA_GRANDRIDGE = PTA_SIERRAFOREST | PTA_RAOINT;
> +constexpr wide_int_bitmask PTA_GRANDRIDGE = PTA_SIERRAFOREST;
>  constexpr wide_int_bitmask PTA_ARROWLAKE = PTA_ALDERLAKE | PTA_AVXIFMA
>| PTA_AVXVNNIINT8 | PTA_AVXNECONVERT | PTA_CMPCCXADD | PTA_UINTR;
>  constexpr wide_int_bitmask PTA_ARROWLAKE_S = PTA_ARROWLAKE | PTA_AVXVNNIINT16
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 1f26f80d26c..82dd9cdf907 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -33451,8 +33451,8 @@ SSSE3, SSE4.1, SSE4.2, POPCNT, AES, PREFETCHW, 
> PCLMUL, RDRND, XSAVE, XSAVEC,
>  XSAVES, XSAVEOPT, FSGSBASE, PTWRITE, RDPID, SGX, GFNI-SSE, CLWB, MOVDIRI,
>  MOVDIR64B, CLDEMOTE, WAITPKG, ADCX, AVX, AVX2, BMI, BMI2, F16C, FMA, LZCNT,
>  PCONFIG, PKU, VAES, VPCLMULQDQ, SERIALIZE, HRESET, KL, WIDEKL, AVX-VNNI,
> -AVXIFMA, AVXVNNIINT8, AVXNECONVERT, CMPCCXADD, ENQCMD, UINTR and RAOINT
> -instruction set support.
> +AVXIFMA, AVXVNNIINT8, AVXNECONVERT, CMPCCXADD, ENQCMD and UINTR instruction 
> set
> +support.
>
>  @item clearwaterforest
>  Intel Clearwater Forest CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2,
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH] i386: Sync move_max/store_max with prefer-vector-width [PR112824]

2023-12-14 Thread Hongtao Liu
On Thu, Dec 14, 2023 at 3:54 PM Hongyu Wang  wrote:
>
> Hi,
>
> Currently move_max follows the tuning feature first, but ideally it
> should sync with prefer-vector-width when it is explicitly set to keep
> vector move and operation with same vector size.
>
> Bootstrapped/regtested on x86-64-pc-linux-gnu{-m32,}
>
> OK for trunk?
>
> gcc/ChangeLog:
>
> PR target/112824
> * config/i386/i386-options.cc (ix86_option_override_internal):
> Sync ix86_move_max/ix86_store_max with prefer_vector_width when
> it is explicitly set.
>
> gcc/testsuite/ChangeLog:
>
> PR target/112824
> * gcc.target/i386/pieces-memset-45.c: Remove
> -mprefer-vector-width=256.
> * g++.target/i386/pr112824-1.C: New test.
> ---
>  gcc/config/i386/i386-options.cc   |   8 +-
>  gcc/testsuite/g++.target/i386/pr112824-1.C| 113 ++
>  .../gcc.target/i386/pieces-memset-45.c|   2 +-
>  3 files changed, 120 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/g++.target/i386/pr112824-1.C
>
> diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
> index 588a0878c0d..440ef59 100644
> --- a/gcc/config/i386/i386-options.cc
> +++ b/gcc/config/i386/i386-options.cc
> @@ -3012,7 +3012,9 @@ ix86_option_override_internal (bool main_args_p,
>  {
>/* Set the maximum number of bits can be moved from memory to
>  memory efficiently.  */
> -  if (ix86_tune_features[X86_TUNE_AVX512_MOVE_BY_PIECES])
> +  if (opts_set->x_prefer_vector_width_type != PVW_NONE)
> +   opts->x_ix86_move_max = opts->x_prefer_vector_width_type;
> +  else if (ix86_tune_features[X86_TUNE_AVX512_MOVE_BY_PIECES])
> opts->x_ix86_move_max = PVW_AVX512;
>else if (ix86_tune_features[X86_TUNE_AVX256_MOVE_BY_PIECES])
> opts->x_ix86_move_max = PVW_AVX256;
> @@ -3034,7 +3036,9 @@ ix86_option_override_internal (bool main_args_p,
>  {
>/* Set the maximum number of bits can be stored to memory
>  efficiently.  */
> -  if (ix86_tune_features[X86_TUNE_AVX512_STORE_BY_PIECES])
> +  if (opts_set->x_prefer_vector_width_type != PVW_NONE)
> +   opts->x_ix86_store_max = opts->x_prefer_vector_width_type;
> +  else if (ix86_tune_features[X86_TUNE_AVX512_STORE_BY_PIECES])
> opts->x_ix86_store_max = PVW_AVX512;
>else if (ix86_tune_features[X86_TUNE_AVX256_STORE_BY_PIECES])
> opts->x_ix86_store_max = PVW_AVX256;
> diff --git a/gcc/testsuite/g++.target/i386/pr112824-1.C 
> b/gcc/testsuite/g++.target/i386/pr112824-1.C
> new file mode 100644
> index 000..fccaf23c530
> --- /dev/null
> +++ b/gcc/testsuite/g++.target/i386/pr112824-1.C
> @@ -0,0 +1,113 @@
> +/* PR target/112824 */
> +/* { dg-do compile } */
> +/* { dg-options "-std=c++23 -O3 -march=skylake-avx512 
> -mprefer-vector-width=512" } */
> +/* { dg-final { scan-assembler-not "vmov(?:dqu|apd)\[ \\t\]+\[^\n\]*%ymm" } 
> } */
> +
> +
remove empty line.
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +template 
> +using Vec [[gnu::vector_size(W * sizeof(T))]] = T;
> +
> +// Omitted: 16 without AVX, 32 without AVX512F,
> +// or for forward compatibility some AVX10 may also mean 32-only
> +static constexpr ptrdiff_t VectorBytes = 64;
> +template
> +static constexpr ptrdiff_t VecWidth = 64 <= sizeof(T) ? 1 : 64/sizeof(T);
> +
> +template  struct Vector{
> +static constexpr ptrdiff_t L = N;
> +T data[L];
> +static constexpr auto size()->ptrdiff_t{return N;}
> +};
> +template  struct Vector{
> +static constexpr ptrdiff_t W = N >= VecWidth ? VecWidth : 
> ptrdiff_t(std::bit_ceil(size_t(N)));
> +static constexpr ptrdiff_t L = (N/W) + ((N%W)!=0);
> +using V = Vec;
> +V data[L];
> +static constexpr auto size()->ptrdiff_t{return N;}
> +};
> +/// should be trivially copyable
> +/// codegen is worse when passing by value, even though it seems like it 
> should make
> +/// aliasing simpler to analyze?
> +template
> +[[gnu::always_inline]] constexpr auto operator+(Vector x, Vector 
> y) -> Vector {
> +Vector z;
> +for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x.data[n] + 
> y.data[n];
> +return z;
> +}
> +template
> +[[gnu::always_inline]] constexpr auto operator*(Vector x, Vector 
> y) -> Vector {
> +Vector z;
> +for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x.data[n] * 
> y.data[n];
> +return z;
> +}
> +template
> +[[gnu::always_inline]] constexpr auto operator+(T x, Vector y) -> 
> Vector {
> +Vector z;
> +for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x + y.data[n];
> +return z;
> +}
> +template
> +[[gnu::always_inline]] constexpr auto operator*(T x, Vector y) -> 
> Vector {
> +Vector z;
> +for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x * y.data[n];
> +return z;
> +}
> +
> +
> +
Ditto.
> +template  struct Dual {
> +  T value;
> +  Vector partials;
> +};
> +// Here we have a specialization fo

Re: [PATCH] i386: Allow 64 bit mask register for -mno-evex512

2023-12-19 Thread Hongtao Liu
On Fri, Dec 15, 2023 at 10:34 AM Haochen Jiang  wrote:
>
> Hi all,
>
> There is a recent change in AVX10 documentation which allows 64 bit mask
> register instructions in AVX10-256, the documentation comes following:
>
> Intel Advanced Vector Extensions 10 (Intel AVX10) Architecture Specification
> https://cdrdv2.intel.com/v1/dl/getContent/784267
> The Converged Vector ISA: Intel Advanced Vector Extensions 10 Technical Paper
> https://cdrdv2.intel.com/v1/dl/getContent/784343
>
> As a result, we will need to allow 64 bit mask register for -mno-evex512. The
> patch aims to add them.
>
> Regtested on x86_64-pc-linux-gnu. Ok for trunk?
Ok.
>
> Thx,
> Haochen
>
> gcc/ChangeLog:
>
> * config/i386/avx512bwintrin.h: Allow 64 bit mask intrin usage
> for -mno-evex512.
> * config/i386/i386-builtin.def: Remove OPTION_MASK_ISA2_EVEX512
> for 64 bit mask builtins.
> * config/i386/i386.cc (ix86_hard_regno_mode_ok): Allow 64 bit
> mask register for -mno-evex512.
> * config/i386/i386.md (SWI1248_AVX512BWDQ_64): Remove
> TARGET_EVEX512.
> (*zero_extendsidi2): Change isa attribute to avx512bw.
> (kmov_isa): Ditto.
> (*anddi_1): Ditto.
> (*andn_1): Remove TARGET_EVEX512.
> (*one_cmplsi2_1_zext): Change isa attribute to avx512bw.
> (*ashl3_1): Ditto.
> (*lshr3_1): Ditto.
> * config/i386/sse.md (SWI1248_AVX512BWDQ): Remove TARGET_EVEX512.
> (SWI1248_AVX512BW): Ditto.
> (SWI1248_AVX512BWDQ2): Ditto.
> (*knotsi_1_zext): Ditto.
> (kunpckdi): Ditto.
> (SWI24_MASK): Removed.
> (vec_pack_trunc_): Change iterator from SWI24_MASK to SWI24.
> (vec_unpacks_lo_di): Remove TARGET_EVEX512.
> (SWI48x_MASK): Removed.
> (vec_unpacks_hi_): Change iterator from SWI48x_MASK to SWI48x.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx10_1-6.c: Remove check for errors.
> * gcc.target/i386/noevex512-2.c: Diito.
> ---
>  gcc/config/i386/avx512bwintrin.h| 42 ++---
>  gcc/config/i386/i386-builtin.def| 28 +++---
>  gcc/config/i386/i386.cc |  3 +-
>  gcc/config/i386/i386.md | 20 +-
>  gcc/config/i386/sse.md  | 30 ++-
>  gcc/testsuite/gcc.target/i386/avx10_1-6.c   |  2 +-
>  gcc/testsuite/gcc.target/i386/noevex512-2.c |  2 +-
>  7 files changed, 59 insertions(+), 68 deletions(-)
>
> diff --git a/gcc/config/i386/avx512bwintrin.h 
> b/gcc/config/i386/avx512bwintrin.h
> index d5ce79fd073..37fd7c68976 100644
> --- a/gcc/config/i386/avx512bwintrin.h
> +++ b/gcc/config/i386/avx512bwintrin.h
> @@ -34,6 +34,8 @@
>  #define __DISABLE_AVX512BW__
>  #endif /* __AVX512BW__ */
>
> +typedef unsigned long long __mmask64;
> +
>  extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, 
> __artificial__))
>  _mm_avx512_set_epi32 (int __q3, int __q2, int __q1, int __q0)
>  {
> @@ -223,27 +225,6 @@ _kshiftri_mask32 (__mmask32 __A, unsigned int __B)
>
>  #endif
>
> -#ifdef __DISABLE_AVX512BW__
> -#undef __DISABLE_AVX512BW__
> -#pragma GCC pop_options
> -#endif /* __DISABLE_AVX512BW__ */
> -
> -#if !defined (__AVX512BW__) || !defined (__EVEX512__)
> -#pragma GCC push_options
> -#pragma GCC target("avx512bw,evex512")
> -#define __DISABLE_AVX512BW_512__
> -#endif /* __AVX512BW_512__ */
> -
> -/* Internal data types for implementing the intrinsics.  */
> -typedef short __v32hi __attribute__ ((__vector_size__ (64)));
> -typedef short __v32hi_u __attribute__ ((__vector_size__ (64),  \
> -   __may_alias__, __aligned__ (1)));
> -typedef char __v64qi __attribute__ ((__vector_size__ (64)));
> -typedef char __v64qi_u __attribute__ ((__vector_size__ (64),   \
> -  __may_alias__, __aligned__ (1)));
> -
> -typedef unsigned long long __mmask64;
> -
>  extern __inline unsigned char
>  __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
>  _ktest_mask64_u8  (__mmask64 __A,  __mmask64 __B, unsigned char *__CF)
> @@ -365,6 +346,25 @@ _kandn_mask64 (__mmask64 __A, __mmask64 __B)
>return (__mmask64) __builtin_ia32_kandndi ((__mmask64) __A, (__mmask64) 
> __B);
>  }
>
> +#ifdef __DISABLE_AVX512BW__
> +#undef __DISABLE_AVX512BW__
> +#pragma GCC pop_options
> +#endif /* __DISABLE_AVX512BW__ */
> +
> +#if !defined (__AVX512BW__) || !defined (__EVEX512__)
> +#pragma GCC push_options
> +#pragma GCC target("avx512bw,evex512")
> +#define __DISABLE_AVX512BW_512__
> +#endif /* __AVX512BW_512__ */
> +
> +/* Internal data types for implementing the intrinsics.  */
> +typedef short __v32hi __attribute__ ((__vector_size__ (64)));
> +typedef short __v32hi_u __attribute__ ((__vector_size__ (64),  \
> +   __may_alias__, __aligned__ (1)));
> +typedef char __v64qi __attribute__ ((__vector_size__ (64)));
> +typedef char __v64qi_u 

Re: [x86_64 PATCH] PR target/112992: Optimize mode for broadcast of constants.

2024-01-01 Thread Hongtao Liu
On Fri, Dec 22, 2023 at 6:25 PM Roger Sayle  wrote:
>
>
> This patch resolves the second part of PR target/112992, building upon
> Hongtao Liu's solution to the first part.
>
> The issue addressed by this patch is that when initializing vectors by
> broadcasting integer constants, the compiler has the flexibility to
> select the most appropriate vector mode to perform the broadcast, as
> long as the resulting vector has an identical bit pattern.  For
> example, the following constants are all equivalent:
> V4SImode {0x01010101, 0x01010101, 0x01010101, 0x01010101 }
> V8HImode {0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101, 0x0101 }
> V16QImode {0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, ... 0x01 }
> So instruction sequences that construct any of these can be used to
> construct the others (with a suitable cast/SUBREG).
>
> On x86_64, it turns out that broadcasts of SImode constants are preferred,
> as DImode constants often require a longer movabs instruction, and
> HImode and QImode broadcasts require multiple uops on some architectures.
> Hence, SImode is always the equal shortest/fastest implementation.
>
> Examples of this improvement, can be seen in the testsuite.
>
> gcc.target/i386/pr102021.c
> Before:
>0:   48 b8 0c 00 0c 00 0cmovabs $0xc000c000c000c,%rax
>7:   00 0c 00
>a:   62 f2 fd 28 7c c0   vpbroadcastq %rax,%ymm0
>   10:   c3  retq
>
> After:
>0:   b8 0c 00 0c 00  mov$0xc000c,%eax
>5:   62 f2 7d 28 7c c0   vpbroadcastd %eax,%ymm0
>b:   c3  retq
>
> and
> gcc.target/i386/pr90773-17.c:
> Before:
>0:   48 8b 15 00 00 00 00mov0x0(%rip),%rdx# 7 
>7:   b8 0c 00 00 00  mov$0xc,%eax
>c:   62 f2 7d 08 7a c0   vpbroadcastb %eax,%xmm0
>   12:   62 f1 7f 08 7f 02   vmovdqu8 %xmm0,(%rdx)
>   18:   c7 42 0f 0c 0c 0c 0cmovl   $0xc0c0c0c,0xf(%rdx)
>   1f:   c3  retq
>
> After:
>0:   48 8b 15 00 00 00 00mov0x0(%rip),%rdx# 7 
>7:   b8 0c 0c 0c 0c  mov$0xc0c0c0c,%eax
>c:   62 f2 7d 08 7c c0   vpbroadcastd %eax,%xmm0
>   12:   62 f1 7f 08 7f 02   vmovdqu8 %xmm0,(%rdx)
>   18:   c7 42 0f 0c 0c 0c 0cmovl   $0xc0c0c0c,0xf(%rdx)
>   1f:   c3  retq
>
> where according to Agner Fog's instruction tables broadcastd is slightly
> faster on some microarchitectures, for example Knight's Landing.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
>
>
> 2023-12-21  Roger Sayle  
>
> gcc/ChangeLog
> PR target/112992
> * config/i386/i386-expand.cc
> (ix86_convert_const_wide_int_to_broadcast): Allow call to
> ix86_expand_vector_init_duplicate to fail, and return NULL_RTX.
> (ix86_broadcast_from_constant): Revert recent change; Return a
> suitable MEMREF independently of mode/target combinations.
> (ix86_expand_vector_move): Allow ix86_expand_vector_init_duplicate
> to decide whether expansion is possible/preferrable.  Only try
> forcing DImode constants to memory (and trying again) if calling
> ix86_expand_vector_init_duplicate fails with an DImode immediate
> constant.
> (ix86_expand_vector_init_duplicate) : Try using
> V4SImode for suitable immediate constants.
> : Try using V8SImode for suitable constants.
> : Use constant pool for AVX without AVX2.
> : Fail for CONST_INT_P, i.e. use constant pool.
> : Likewise.
> : For CONST_INT_P try using V4SImode via widen.
> : For CONT_INT_P try using V8HImode via widen.
> : Handle CONT_INTs via simplify_binary_operation.
> Allow recursive calls to ix86_expand_vector_init_duplicate to fail.
> : For CONST_INT_P try V8SImode via widen.
> : For CONST_INT_P try V16HImode via widen.
> (ix86_expand_vector_init): Move try using a broadcast for all_same
> with ix86_expand_vector_init_duplicate before using constant pool.
>
> gcc/testsuite/ChangeLog
> * gcc.target/i386/avx512f-broadcast-pr87767-1.c: Update test case.
> * gcc.target/i386/avx512f-broadcast-pr87767-5.c: Likewise.
> * gcc.target/i386/avx512fp16-13.c: Likewise.
> * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Likewise.
> * gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Likewise.
> * gcc.target/i386/pr100865-10a.c: Likewise.
> * gcc.target/i386/pr100865-10b.c: Likewise.
> * gcc.target/i386/pr100865-11c.c: Likewise.
> * gcc.target/i386/pr100865-12c.c: Likewise.
> * gcc.target/i386/pr100865-2.c: Likewise.
> * gcc.target/i386/pr100865-3.c: Likewise.
> * gcc.target/i386/pr100865-4a.c: Likewise.
> * gcc.target/i386/pr100865-4b.c: Likewise.
> * gcc.targ

Re: [x86_64 PATCH] PR target/112992: Optimize mode for broadcast of constants.

2024-01-07 Thread Hongtao Liu
On Sun, Jan 7, 2024 at 6:53 AM Roger Sayle  wrote:
>
> Hi Hongtao,
>
> Many thanks for the review.  This revised patch implements several
> of your suggestions, specifically to use pshufd for V4SImode and
> punpcklqdq for V2DImode.  These changes are demonstrated by the
> examples below:
>
> typedef unsigned int v4si __attribute((vector_size(16)));
> typedef unsigned long long v2di __attribute((vector_size(16)));
>
> v4si foo() { return (v4si){1,1,1,1}; }
> v2di bar() { return (v2di){1,1}; }
>
> The previous version of my patch generated:
>
> foo:movdqa  .LC0(%rip), %xmm0
> ret
> bar:movdqa  .LC1(%rip), %xmm0
> ret
>
> with this revised version, -O2 generates:
>
> foo:movl$1, %eax
> movd%eax, %xmm0
> pshufd  $0, %xmm0, %xmm0
> ret
> bar:movl$1, %eax
> movq%rax, %xmm0
> punpcklqdq  %xmm0, %xmm0
> ret
>
> However, if it's OK with you, I'd prefer to allow this function to
> return false, safely falling back to emitting a vector load from
> the constant bool rather than ICEing from a gcc_assert.  For one
Sure, that makes sense.
> thing this isn't a unrecoverable correctness issue, but at worst
> a missed optimization.  The deeper reason is that this usefully
> provides a handle for tuning on different microarchitectures.
> On some (AMD?) machines, where !TARGET_INTER_UNIT_MOVES_TO_VEC,
> the first form above may be preferable to the second.  Currently
> the start of ix86_convert_const_wide_int_to_broadcast disables
> broadcasts for !TARGET_INTER_UNIT_MOVES_TO_VEC even when an
> implementation doesn't reuire an inter unit move, such as a
> broadcast from memory.  I plan follow-up patches that benefit
> from this flexibility.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?
Ok.
>
> gcc/ChangeLog
> PR target/112992
> * config/i386/i386-expand.cc
> (ix86_convert_const_wide_int_to_broadcast): Allow call to
> ix86_expand_vector_init_duplicate to fail, and return NULL_RTX.
> (ix86_broadcast_from_constant): Revert recent change; Return a
> suitable MEMREF independently of mode/target combinations.
> (ix86_expand_vector_move): Allow ix86_expand_vector_init_duplicate
> to decide whether expansion is possible/preferrable.  Only try
> forcing DImode constants to memory (and trying again) if calling
> ix86_expand_vector_init_duplicate fails with an DImode immediate
> constant.
> (ix86_expand_vector_init_duplicate) : Try using
> V4SImode for suitable immediate constants.
> : Try using V8SImode for suitable constants.
> : Fail for CONST_INT_P, i.e. use constant pool.
> : Likewise.
> : For CONST_INT_P try using V4SImode via widen.
> : For CONT_INT_P try using V8HImode via widen.
> : Handle CONT_INTs via simplify_binary_operation.
> Allow recursive calls to ix86_expand_vector_init_duplicate to fail.
> : For CONST_INT_P try V8SImode via widen.
> : For CONST_INT_P try V16HImode via widen.
> (ix86_expand_vector_init): Move try using a broadcast for all_same
> with ix86_expand_vector_init_duplicate before using constant pool.
>
> gcc/testsuite/ChangeLog
> * gcc.target/i386/auto-init-8.c: Update test case.
> * gcc.target/i386/avx512f-broadcast-pr87767-1.c: Likewise.
> * gcc.target/i386/avx512f-broadcast-pr87767-5.c: Likewise.
> * gcc.target/i386/avx512fp16-13.c: Likewise.
> * gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Likewise.
> * gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Likewise.
> * gcc.target/i386/pr100865-1.c: Likewise.
> * gcc.target/i386/pr100865-10a.c: Likewise.
> * gcc.target/i386/pr100865-10b.c: Likewise.
> * gcc.target/i386/pr100865-2.c: Likewise.
> * gcc.target/i386/pr100865-3.c: Likewise.
> * gcc.target/i386/pr100865-4a.c: Likewise.
> * gcc.target/i386/pr100865-4b.c: Likewise.
> * gcc.target/i386/pr100865-5a.c: Likewise.
> * gcc.target/i386/pr100865-5b.c: Likewise.
> * gcc.target/i386/pr100865-9a.c: Likewise.
> * gcc.target/i386/pr100865-9b.c: Likewise.
> * gcc.target/i386/pr102021.c: Likewise.
> * gcc.target/i386/pr90773-17.c: Likewise.
>
> Thanks in advance.
> Roger
> --
>
> > -Original Message-
> > From: Hongtao Liu 
> > Sent: 02 January 2024 05:40
> > To: Roger Sayle 
> > Cc: gcc

Re: Disable FMADD in chains for Zen4 and generic

2024-01-07 Thread Hongtao Liu
On Thu, Dec 14, 2023 at 12:03 AM Jan Hubicka  wrote:
>
> > > The diffrerence is that Cores understand the fact that fmadd does not need
> > > all three parameters to start computation, while Zen cores doesn't.
> > >
> > > Since this seems noticeable win on zen and not loss on Core it seems like 
> > > good
> > > default for generic.
> > >
> > > I plan to commit the patch next week if there are no compplains.
> > The generic part LGTM.(It's exactly what we proposed in [1])
> >
> > [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637721.html
>
> Thanks.  I wonder if can think of other generic changes that would make
> sense to do?
> Concerning zen4 and FMA, it is not really win with AVX512 enabled
> (which is what I was benchmarking for znver4 tuning), but indeed it is
> win with AVX256 where the extra latency is not hidden by the parallelism
> exposed by doing evertyhing twice.
>
> I re-benmchmarked zen4 and it behaves similarly to zen3 with avx256, so
> for x86-64-v3 this makes sense.
>
> Honza
> > >
> > > Honza
> > >
> > > #include 
> > > #include 
> > >
> > > #define SIZE 1000
> > >
> > > float a[SIZE][SIZE];
> > > float b[SIZE][SIZE];
> > > float c[SIZE][SIZE];
> > >
> > > void init(void)
> > > {
> > >int i, j, k;
> > >for(i=0; i > >{
> > >   for(j=0; j > >   {
> > >  a[i][j] = (float)i + j;
> > >  b[i][j] = (float)i - j;
> > >  c[i][j] = 0.0f;
> > >   }
> > >}
> > > }
> > >
> > > void mult(void)
> > > {
> > >int i, j, k;
> > >
> > >for(i=0; i > >{
> > >   for(j=0; j > >   {
> > >  for(k=0; k > >  {
> > > c[i][j] += a[i][k] * b[k][j];
> > >  }
> > >   }
> > >}
> > > }
> > >
> > > int main(void)
> > > {
> > >clock_t s, e;
> > >
> > >init();
> > >s=clock();
> > >mult();
> > >e=clock();
> > >printf("mult took %10d clocks\n", (int)(e-s));
> > >
> > >return 0;
> > >
> > > }
> > >
> > > * confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, 
> > > X86_TUNE_AVOID_256FMA_CHAINS)
> > > Enable for znver4 and Core.
> > >
> > > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> > > index 43fa9e8fd6d..74b03cbcc60 100644
> > > --- a/gcc/config/i386/x86-tune.def
> > > +++ b/gcc/config/i386/x86-tune.def
> > > @@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, 
> > > "use_scatter_8parts",
> > >
> > >  /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit 
> > > or
> > > smaller FMA chain.  */
> > > -DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | 
> > > m_ZNVER2 | m_ZNVER3
> > > -  | m_YONGFENG)
> > > +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | 
> > > m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> > > +  | m_YONGFENG | m_GENERIC)
> > >
> > >  /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit 
> > > or
> > > smaller FMA chain.  */
> > > -DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 
> > > | m_ZNVER3
> > > - | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM)
> > > +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 
> > > | m_ZNVER3 | m_ZNVER4
> > > + | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC)
Can we backport the patch(at least the generic part) to
GCC11/GCC12/GCC13 release branch?
> > >
> > >  /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit 
> > > or
> > > smaller FMA chain.  */
> >
> >
> >
> > --
> > BR,
> > Hongtao



-- 
BR,
Hongtao


Re: [PATCH] i386: [APX] Add missing document for APX

2024-01-07 Thread Hongtao Liu
On Mon, Jan 8, 2024 at 11:09 AM Hongyu Wang  wrote:
>
> Hi,
>
> The supported sub-features for APX was missing in option document and
> target attribute section. Add those missing ones.
>
> Ok for trunk?
Ok.
>
> gcc/ChangeLog:
>
> * config/i386/i386.opt: Add supported sub-features.
> * doc/extend.texi: Add description for target attribute.
> ---
>  gcc/config/i386/i386.opt | 3 ++-
>  gcc/doc/extend.texi  | 6 ++
>  2 files changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
> index 1bfff1e0d82..a38e92baf92 100644
> --- a/gcc/config/i386/i386.opt
> +++ b/gcc/config/i386/i386.opt
> @@ -1328,7 +1328,8 @@ Enable vectorization for scatter instruction.
>
>  mapxf
>  Target Mask(ISA2_APX_F) Var(ix86_isa_flags2) Save
> -Support APX code generation.
> +Support code generation for APX features, including EGPR, PUSH2POP2,
> +NDD and PPX.
>
>  mapx-features=
>  Target Undocumented Joined Enum(apx_features) EnumSet Var(ix86_apx_features) 
> Init(apx_none) Save
> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> index 9e61ba9507d..84eef411e2d 100644
> --- a/gcc/doc/extend.texi
> +++ b/gcc/doc/extend.texi
> @@ -7344,6 +7344,12 @@ Enable/disable the generation of the SM4 instructions.
>  @itemx no-usermsr
>  Enable/disable the generation of the USER_MSR instructions.
>
> +@cindex @code{target("apxf")} function attribute, x86
> +@item apxf
> +@itemx no-apxf
> +Enable/disable the generation of the APX features, including
> +EGPR, PUSH2POP2, NDD and PPX.
> +
>  @cindex @code{target("avx10.1")} function attribute, x86
>  @item avx10.1
>  @itemx no-avx10.1
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH] Take register pressure into account for vec_construct/scalar_to_vec when the components are not loaded from memory.

2023-12-03 Thread Hongtao Liu
On Fri, Dec 1, 2023 at 10:26 PM Richard Biener
 wrote:
>
> On Fri, Dec 1, 2023 at 3:39 AM liuhongt  wrote:
> >
> > > Hmm, I would suggest you put reg_needed into the class and accumulate
> > > over all vec_construct, with your patch you pessimize a single v32qi
> > > over two separate v16qi for example.  Also currently the whole block is
> > > gated with INTEGRAL_TYPE_P but register pressure would be also
> > > a concern for floating point vectors.  finish_cost would then apply an
> > > adjustment.
> >
> > Changed.
> >
> > > 'target_avail_regs' is for GENERAL_REGS, does that include APX regs?
> > > I don't see anything similar for FP regs, but I guess the target should 
> > > know
> > > or maybe there's a #regs in regclass query already.
> > Haven't see any, use below setting.
> >
> > unsigned target_avail_sse = TARGET_64BIT ? (TARGET_AVX512F ? 32 : 16) : 8;
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > No big impact on SPEC2017.
> > Observe 1 big improvement from other benchmark by avoiding vectorization 
> > with
> > vec_construct v32qi which caused lots of spills.
> >
> > Ok for trunk?
>
> LGTM, let's see what x86 maintainers think.
+Honza and Uros.
Any comments?
>
> Richard.
>
> > For vec_contruct, the components must be live at the same time if
> > they're not loaded from memory, when the number of those components
> > exceeds available registers, spill happens. Try to account that with a
> > rough estimation.
> > ??? Ideally, we should have an overall estimation of register pressure
> > if we know the live range of all variables.
> >
> > gcc/ChangeLog:
> >
> > * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
> > Count sse_reg/gpr_regs for components not loaded from memory.
> > (ix86_vector_costs:ix86_vector_costs): New constructor.
> > (ix86_vector_costs::m_num_gpr_needed[3]): New private memeber.
> > (ix86_vector_costs::m_num_sse_needed[3]): Ditto.
> > (ix86_vector_costs::finish_cost): Estimate overall register
> > pressure cost.
> > (ix86_vector_costs::ix86_vect_estimate_reg_pressure): New
> > function.
> > ---
> >  gcc/config/i386/i386.cc | 54 ++---
> >  1 file changed, 50 insertions(+), 4 deletions(-)
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index 9390f525b99..dcaea6c2096 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -24562,15 +24562,34 @@ ix86_noce_conversion_profitable_p (rtx_insn *seq, 
> > struct noce_if_info *if_info)
> >  /* x86-specific vector costs.  */
> >  class ix86_vector_costs : public vector_costs
> >  {
> > -  using vector_costs::vector_costs;
> > +public:
> > +  ix86_vector_costs (vec_info *, bool);
> >
> >unsigned int add_stmt_cost (int count, vect_cost_for_stmt kind,
> >   stmt_vec_info stmt_info, slp_tree node,
> >   tree vectype, int misalign,
> >   vect_cost_model_location where) override;
> >void finish_cost (const vector_costs *) override;
> > +
> > +private:
> > +
> > +  /* Estimate register pressure of the vectorized code.  */
> > +  void ix86_vect_estimate_reg_pressure ();
> > +  /* Number of GENERAL_REGS/SSE_REGS used in the vectorizer, it's used for
> > + estimation of register pressure.
> > + ??? Currently it's only used by vec_construct/scalar_to_vec
> > + where we know it's not loaded from memory.  */
> > +  unsigned m_num_gpr_needed[3];
> > +  unsigned m_num_sse_needed[3];
> >  };
> >
> > +ix86_vector_costs::ix86_vector_costs (vec_info* vinfo, bool 
> > costing_for_scalar)
> > +  : vector_costs (vinfo, costing_for_scalar),
> > +m_num_gpr_needed (),
> > +m_num_sse_needed ()
> > +{
> > +}
> > +
> >  /* Implement targetm.vectorize.create_costs.  */
> >
> >  static vector_costs *
> > @@ -24748,8 +24767,7 @@ ix86_vector_costs::add_stmt_cost (int count, 
> > vect_cost_for_stmt kind,
> >  }
> >else if ((kind == vec_construct || kind == scalar_to_vec)
> >&& node
> > -  && SLP_TREE_DEF_TYPE (node) == vect_external_def
> > -  && INTEGRAL_TYPE_P (TREE_TYPE (vectype)))
> > +  && SLP_TREE_DEF_TYPE (node) == vect_external_def)
> >  {
> >stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, 
> > misalign);
> >unsigned i;
> > @@ -24785,7 +24803,15 @@ ix86_vector_costs::add_stmt_cost (int count, 
> > vect_cost_for_stmt kind,
> >   && (gimple_assign_rhs_code (def) != BIT_FIELD_REF
> >   || !VECTOR_TYPE_P (TREE_TYPE
> > (TREE_OPERAND (gimple_assign_rhs1 (def), 
> > 0))
> > -   stmt_cost += ix86_cost->sse_to_integer;
> > +   {
> > + if (fp)
> > +   m_num_sse_needed[where]++;
> > + else
> > +   {
> > + m_num_gpr_needed[where]++;
> > + 

Re: [PATCH v2 00/17] Support Intel APX NDD

2023-12-04 Thread Hongtao Liu
On Tue, Dec 5, 2023 at 10:32 AM Hongyu Wang  wrote:
>
> Hi,
>
> APX NDD patches have been posted at
> https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636604.html
>
> Thanks to Hongtao's review, the V2 patch adds support of zext sematic with
> memory input as NDD by default clear upper bits of dest for any operand size.
>
> Also we support TImode shift with new split helper functions, which allows NDD
> form split but still restric the memory src usage as in post-reload splitter
> the register number is restricted, and no new register can be used for
> shld/shrd.
>
> Also fixed several typo/formatting/redundant code.
Patches LGTM, Please wait a few more days before committing incase
other folks have comments.
>
> Bootstrapped/regtested on x86_64-pc-linux-gnu{-m32,} and sde.
>
> OK for trunk?
>
> Hongyu Wang (8):
>   [APX NDD] Restrict TImode register usage when NDD enabled
>   [APX NDD] Disable seg_prefixed memory usage for NDD add
>   [APX NDD] Support APX NDD for left shift insns
>   [APX NDD] Support APX NDD for right shift insns
>   [APX NDD] Support APX NDD for rotate insns
>   [APX NDD] Support APX NDD for shld/shrd insns
>   [APX NDD] Support APX NDD for cmove insns
>   [APX NDD] Support TImode shift for NDD
>
> Kong Lingling (9):
>   [APX NDD] Support Intel APX NDD for legacy add insn
>   [APX NDD] Support APX NDD for optimization patterns of add
>   [APX NDD] Support APX NDD for adc insns
>   [APX NDD] Support APX NDD for sub insns
>   [APX NDD] Support APX NDD for sbb insn
>   [APX NDD] Support APX NDD for neg insn
>   [APX NDD] Support APX NDD for not insn
>   [APX NDD] Support APX NDD for and insn
>   [APX NDD] Support APX NDD for or/xor insn
>
>  gcc/config/i386/constraints.md|5 +
>  gcc/config/i386/i386-expand.cc|  164 +-
>  gcc/config/i386/i386-options.cc   |2 +
>  gcc/config/i386/i386-protos.h |   16 +-
>  gcc/config/i386/i386.cc   |   40 +-
>  gcc/config/i386/i386.md   | 2323 +++--
>  gcc/testsuite/gcc.target/i386/apx-ndd-adc.c   |   15 +
>  gcc/testsuite/gcc.target/i386/apx-ndd-cmov.c  |   16 +
>  gcc/testsuite/gcc.target/i386/apx-ndd-sbb.c   |6 +
>  .../gcc.target/i386/apx-ndd-shld-shrd.c   |   24 +
>  .../gcc.target/i386/apx-ndd-ti-shift.c|   91 +
>  gcc/testsuite/gcc.target/i386/apx-ndd.c   |  202 ++
>  12 files changed, 2149 insertions(+), 755 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd-adc.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd-cmov.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd-sbb.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd-shld-shrd.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd-ti-shift.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd.c
>
> --
> 2.31.1
>


-- 
BR,
Hongtao


Re: [PATCH] Take register pressure into account for vec_construct/scalar_to_vec when the components are not loaded from memory.

2023-12-04 Thread Hongtao Liu
On Mon, Dec 4, 2023 at 3:51 PM Uros Bizjak  wrote:
>
> On Mon, Dec 4, 2023 at 8:11 AM Hongtao Liu  wrote:
> >
> > On Fri, Dec 1, 2023 at 10:26 PM Richard Biener
> >  wrote:
> > >
> > > On Fri, Dec 1, 2023 at 3:39 AM liuhongt  wrote:
> > > >
> > > > > Hmm, I would suggest you put reg_needed into the class and accumulate
> > > > > over all vec_construct, with your patch you pessimize a single v32qi
> > > > > over two separate v16qi for example.  Also currently the whole block 
> > > > > is
> > > > > gated with INTEGRAL_TYPE_P but register pressure would be also
> > > > > a concern for floating point vectors.  finish_cost would then apply an
> > > > > adjustment.
> > > >
> > > > Changed.
> > > >
> > > > > 'target_avail_regs' is for GENERAL_REGS, does that include APX regs?
> > > > > I don't see anything similar for FP regs, but I guess the target 
> > > > > should know
> > > > > or maybe there's a #regs in regclass query already.
> > > > Haven't see any, use below setting.
> > > >
> > > > unsigned target_avail_sse = TARGET_64BIT ? (TARGET_AVX512F ? 32 : 16) : 
> > > > 8;
> > > >
> > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > > > No big impact on SPEC2017.
> > > > Observe 1 big improvement from other benchmark by avoiding 
> > > > vectorization with
> > > > vec_construct v32qi which caused lots of spills.
> > > >
> > > > Ok for trunk?
> > >
> > > LGTM, let's see what x86 maintainers think.
> > +Honza and Uros.
> > Any comments?
>
> I have no comment on vector stuff, I think you are the most
> experienced developer in this area.
Thanks, committed.
>
> Uros.
>
> > >
> > > Richard.
> > >
> > > > For vec_contruct, the components must be live at the same time if
> > > > they're not loaded from memory, when the number of those components
> > > > exceeds available registers, spill happens. Try to account that with a
> > > > rough estimation.
> > > > ??? Ideally, we should have an overall estimation of register pressure
> > > > if we know the live range of all variables.
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
> > > > Count sse_reg/gpr_regs for components not loaded from memory.
> > > > (ix86_vector_costs:ix86_vector_costs): New constructor.
> > > > (ix86_vector_costs::m_num_gpr_needed[3]): New private memeber.
> > > > (ix86_vector_costs::m_num_sse_needed[3]): Ditto.
> > > > (ix86_vector_costs::finish_cost): Estimate overall register
> > > > pressure cost.
> > > > (ix86_vector_costs::ix86_vect_estimate_reg_pressure): New
> > > > function.
> > > > ---
> > > >  gcc/config/i386/i386.cc | 54 ++---
> > > >  1 file changed, 50 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > > index 9390f525b99..dcaea6c2096 100644
> > > > --- a/gcc/config/i386/i386.cc
> > > > +++ b/gcc/config/i386/i386.cc
> > > > @@ -24562,15 +24562,34 @@ ix86_noce_conversion_profitable_p (rtx_insn 
> > > > *seq, struct noce_if_info *if_info)
> > > >  /* x86-specific vector costs.  */
> > > >  class ix86_vector_costs : public vector_costs
> > > >  {
> > > > -  using vector_costs::vector_costs;
> > > > +public:
> > > > +  ix86_vector_costs (vec_info *, bool);
> > > >
> > > >unsigned int add_stmt_cost (int count, vect_cost_for_stmt kind,
> > > >   stmt_vec_info stmt_info, slp_tree node,
> > > >   tree vectype, int misalign,
> > > >   vect_cost_model_location where) override;
> > > >void finish_cost (const vector_costs *) override;
> > > > +
> > > > +private:
> > > > +
> > > > +  /* Estimate register pressure of the vectorized code.  */
> > > > +  void ix86_vect_estimate_reg_pressure ();
> > > > +  /* Number of GENERAL_REGS/SSE_REGS used in the vectorizer, it's use

Re: [PATCH] i386: Move vzeroupper pass from after reload pass to after postreload_cse [PR112760]

2023-12-05 Thread Hongtao Liu
On Wed, Dec 6, 2023 at 6:23 AM Jakub Jelinek  wrote:
>
> Hi!
>
> Regardless of the outcome of the REG_UNUSED discussions, I think
> it is a good idea to move the vzeroupper pass one pass later.
> As can be seen in the multiple PRs and as postreload.cc documents,
> reload/LRA is known to create dead statements quite often, which
> is the reason why we have postreload_cse pass at all.
> Doing vzeroupper pass before such cleanup means the pass including
> df_analyze for it needs to process more instructions than needed
> and because mode switching adds note problem, also higher chance of
> having stale REG_UNUSED notes.
> And, I really don't see why vzeroupper can't wait until those cleanups
> are done.
>
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
LGTM.
>
> 2023-12-05  Jakub Jelinek  
>
> PR rtl-optimization/112760
> * config/i386/i386-passes.def (pass_insert_vzeroupper): Insert
> after pass_postreload_cse rather than pass_reload.
> * config/i386/i386-features.cc (rest_of_handle_insert_vzeroupper):
> Adjust comment for it.
>
> * gcc.dg/pr112760.c: New test.
>
> --- gcc/config/i386/i386-passes.def.jj  2023-01-16 11:52:15.960735877 +0100
> +++ gcc/config/i386/i386-passes.def 2023-12-05 19:15:01.748279329 +0100
> @@ -24,7 +24,7 @@ along with GCC; see the file COPYING3.
> REPLACE_PASS (PASS, INSTANCE, TGT_PASS)
>   */
>
> -  INSERT_PASS_AFTER (pass_reload, 1, pass_insert_vzeroupper);
> +  INSERT_PASS_AFTER (pass_postreload_cse, 1, pass_insert_vzeroupper);
>INSERT_PASS_AFTER (pass_combine, 1, pass_stv, false /* timode_p */);
>/* Run the 64-bit STV pass before the CSE pass so that CONST0_RTX and
>   CONSTM1_RTX generated by the STV pass can be CSEed.  */
> --- gcc/config/i386/i386-features.cc.jj 2023-11-02 07:49:15.029894060 +0100
> +++ gcc/config/i386/i386-features.cc2023-12-05 19:15:48.658620698 +0100
> @@ -2627,10 +2627,11 @@ convert_scalars_to_vector (bool timode_p
>  static unsigned int
>  rest_of_handle_insert_vzeroupper (void)
>  {
> -  /* vzeroupper instructions are inserted immediately after reload to
> - account for possible spills from 256bit or 512bit registers.  The pass
> - reuses mode switching infrastructure by re-running mode insertion
> - pass, so disable entities that have already been processed.  */
> +  /* vzeroupper instructions are inserted immediately after reload and
> + postreload_cse to clean up after it a little bit to account for possible
> + spills from 256bit or 512bit registers.  The pass reuses mode switching
> + infrastructure by re-running mode insertion pass, so disable entities
> + that have already been processed.  */
>for (int i = 0; i < MAX_386_ENTITIES; i++)
>  ix86_optimize_mode_switching[i] = 0;
>
> --- gcc/testsuite/gcc.dg/pr112760.c.jj  2023-12-01 13:46:57.444746529 +0100
> +++ gcc/testsuite/gcc.dg/pr112760.c 2023-12-01 13:46:36.729036971 +0100
> @@ -0,0 +1,22 @@
> +/* PR rtl-optimization/112760 */
> +/* { dg-do run } */
> +/* { dg-options "-O2 -fno-dce -fno-guess-branch-probability 
> --param=max-cse-insns=0" } */
> +/* { dg-additional-options "-m8bit-idiv -mavx" { target i?86-*-* x86_64-*-* 
> } } */
> +
> +unsigned g;
> +
> +__attribute__((__noipa__)) unsigned short
> +foo (unsigned short a, unsigned short b)
> +{
> +  unsigned short x = __builtin_add_overflow_p (a, g, (unsigned short) 0);
> +  g -= g / b;
> +  return x;
> +}
> +
> +int
> +main ()
> +{
> +  unsigned short x = foo (40, 6);
> +  if (x != 0)
> +__builtin_abort ();
> +}
>
> Jakub
>


-- 
BR,
Hongtao


Re: [PATCH] Don't vectorize when vector stmts are only vec_contruct and stores

2023-12-05 Thread Hongtao Liu
On Mon, Dec 4, 2023 at 10:10 PM Richard Biener
 wrote:
>
> On Mon, Dec 4, 2023 at 6:32 AM liuhongt  wrote:
> >
> > .i.e. for below cases.
> >a[0] = b1;
> >a[1] = b2;
> >..
> >a[n] = bn;
> >
> > There're extra dependences when contructing the vector, but not for
> > scalar store. According to experiments, it's generally worse.
> >
> > The patch adds an cut-off heuristic when vec_stmt is just
> > vec_construct and vector store. It improves SPEC2017 a little bit.
> >
> > BenchMarks  Ratio
> > 500.perlbench_r 2.60%
> > 502.gcc_r   0.30%
> > 505.mcf_r   0.40%
> > 520.omnetpp_r   -1.00%
> > 523.xalancbmk_r 0.90%
> > 525.x264_r  0.00%
> > 531.deepsjeng_r 0.30%
> > 541.leela_r 0.90%
> > 548.exchange2_r 3.20%
> > 557.xz_r1.40%
> > 503.bwaves_r0.00%
> > 507.cactuBSSN_r 0.00%
> > 508.namd_r  0.30%
> > 510.parest_r0.00%
> > 511.povray_r0.20%
> > 519.lbm_r   SAME BIN
> > 521.wrf_r   -0.30%
> > 526.blender_r   -1.20%
> > 527.cam4_r  -0.20%
> > 538.imagick_r   4.00%
> > 544.nab_r   0.40%
> > 549.fotonik3d_r 0.00%
> > 554.roms_r  0.00%
> > Geomean-int 0.90%
> > Geomean-fp  0.30%
> > Geomean-all 0.50%
> >
> > And
> > Regressed testcases:
> >
> > gcc.target/i386/part-vect-absneghf.c
> > gcc.target/i386/part-vect-copysignhf.c
> > gcc.target/i386/part-vect-xorsignhf.c
> >
> > Regressed under -m32 since it generates 2 vector
> > .ABS/NEG/XORSIGN/COPYSIGN vs original 1 64-bit vec_construct. The
> > original testcases are used to test vectorization capability for
> > .ABS/NEG/XORG/COPYSIGN, so just restrict testcase to TARGET_64BIT.
> >
> > gcc.target/i386/pr111023-2.c
> > gcc.target/i386/pr111023.c
> > Regressed under -m32
> >
> > testcase as below
> >
> > void
> > v8hi_v8qi (v8hi *dst, v16qi src)
> > {
> >   short tem[8];
> >   tem[0] = src[0];
> >   tem[1] = src[1];
> >   tem[2] = src[2];
> >   tem[3] = src[3];
> >   tem[4] = src[4];
> >   tem[5] = src[5];
> >   tem[6] = src[6];
> >   tem[7] = src[7];
> >   dst[0] = *(v8hi *) tem;
> > }
> >
> > under 64-bit target, vectorizer realize it's just permutation of
> > original src vector, but under -m32, vectorizer relies on
> > vec_construct for vectorization. I think optimziation for this case
> > under 32-bit target maynot impact much, so just add
> > -fno-vect-cost-model.
> >
> > gcc.target/i386/pr91446.c: This testcase is guard for cost model of
> > vector store, not vectorization capability, so just adjust testcase.
> >
> > gcc.target/i386/pr108938-3.c: This testcase relies on vec_construct to
> > optimize for bswap, like other optimziation vectorizer can't realize
> > optimization after it. So the current solution is add
> > -fno-vect-cost-model to the testcase.
> >
> > costmodel-pr104582-1.c
> > costmodel-pr104582-2.c
> > costmodel-pr104582-4.c
> >
> > Failed since it's now not vectorized, looked at the PR, it's exactly
> > what's wanted, so adjust testcase to scan-tree-dump-not.
> >
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Ok for trunk?
>
> So the original motivation to not more aggressively prune
> store-from-CTOR vectorization in the vectorizer itself is that
> the vector store is possibly better for STLF (larger stores are
> good, larger loads eventually problematic).

That's exactly what I worried about, and I didn't observe any STLF
stall in SPEC2017, I'll try with more benchmarks.
But on the other hand, the cost model is not suitable for solving this
problem, at best it only circumvents part of this.

>
> I'd also expect the costs to play out to not make those profitable.
>
> OTOH, if you have a series of 'double' stores you can convert to
> a series of V2DF stores you _may_ be faster if this reduces
> pressure on the store unit.  Esp. V2DF is cheap to construct
> with one movhpd.

>
> So I don't think we want to try to pattern match it this way?
>
> In fact the SLP vectorization cases could all arrive with an
> SLP node specified (vectorizable_store would have to be
> changed here), which means you could check for an
> vect_external_def child instead?
>
> But as said, I would hope that we can arrive at a better way
> assessing the CONSTRUCTOR cost.  IMHO one big issue
> is that load and store cost are comparatively high compared
> to simple stmt ops so it's very hard to offset saving many
> stores with "ops".  That's because we generally think of
> 'cost' to model latency but as you say stores don't really
> have latency - we only have store bandwidth of the store

Yes.

> unit and of course issue width (but that's true for other ops
> as well).  I wonder what happens if we set both scalar and
> vector store cost to zero?  Or maybe one (to count one
> issue slot)?

I tried to reduce the cost of the scalar store, but it regressed in oth

Re: [PATCH v3 00/16] Support Intel APX NDD

2023-12-06 Thread Hongtao Liu
On Wed, Dec 6, 2023 at 8:11 PM Uros Bizjak  wrote:
>
> On Wed, Dec 6, 2023 at 9:08 AM Hongyu Wang  wrote:
> >
> > Hi,
> >
> > Following up the discussion of V2 patches in
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-December/639368.html,
> > this patch series add early clobber for all TImode NDD alternatives
> > to avoid any potential overlapping between dest register and src
> > register/memory. Also use get_attr_isa (insn) == ISA_APX_NDD instead of
> > checking alternative at asm output stage.
> >
> > Bootstrapped & regtested on x86_64-pc-linux-gnu{-m32,} and sde.
> >
> > Ok for master?
>
> LGTM, but Hongtao should have the final approval here.
Ok, thanks.
>
> Thanks,
> Uros.
>
> >
> > Hongyu Wang (7):
> >   [APX NDD] Disable seg_prefixed memory usage for NDD add
> >   [APX NDD] Support APX NDD for left shift insns
> >   [APX NDD] Support APX NDD for right shift insns
> >   [APX NDD] Support APX NDD for rotate insns
> >   [APX NDD] Support APX NDD for shld/shrd insns
> >   [APX NDD] Support APX NDD for cmove insns
> >   [APX NDD] Support TImode shift for NDD
> >
> > Kong Lingling (9):
> >   [APX NDD] Support Intel APX NDD for legacy add insn
> >   [APX NDD] Support APX NDD for optimization patterns of add
> >   [APX NDD] Support APX NDD for adc insns
> >   [APX NDD] Support APX NDD for sub insns
> >   [APX NDD] Support APX NDD for sbb insn
> >   [APX NDD] Support APX NDD for neg insn
> >   [APX NDD] Support APX NDD for not insn
> >   [APX NDD] Support APX NDD for and insn
> >   [APX NDD] Support APX NDD for or/xor insn
> >
> >  gcc/config/i386/constraints.md|5 +
> >  gcc/config/i386/i386-expand.cc|  164 +-
> >  gcc/config/i386/i386-options.cc   |2 +
> >  gcc/config/i386/i386-protos.h |   16 +-
> >  gcc/config/i386/i386.cc   |   30 +-
> >  gcc/config/i386/i386.md   | 2325 +++--
> >  gcc/testsuite/gcc.target/i386/apx-ndd-adc.c   |   15 +
> >  gcc/testsuite/gcc.target/i386/apx-ndd-cmov.c  |   16 +
> >  gcc/testsuite/gcc.target/i386/apx-ndd-sbb.c   |6 +
> >  .../gcc.target/i386/apx-ndd-shld-shrd.c   |   24 +
> >  .../gcc.target/i386/apx-ndd-ti-shift.c|   91 +
> >  gcc/testsuite/gcc.target/i386/apx-ndd.c   |  202 ++
> >  12 files changed, 2141 insertions(+), 755 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd-adc.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd-cmov.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd-sbb.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd-shld-shrd.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd-ti-shift.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/apx-ndd.c
> >
> > --
> > 2.31.1
> >



-- 
BR,
Hongtao


Re: [V2 PATCH] Simplify vector ((VCE (a cmp b ? -1 : 0)) < 0) ? c : d to just (VCE ((a cmp b) ? (VCE c) : (VCE d))).

2023-12-07 Thread Hongtao Liu
ping.

On Thu, Nov 16, 2023 at 6:49 PM liuhongt  wrote:
>
> Update in V2:
> 1) Add some comments before the pattern.
> 2) Remove ? from view_convert.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?
>
> When I'm working on PR112443, I notice there's some misoptimizations:
> after we fold _mm{,256}_blendv_epi8/pd/ps into gimple, the backend
> fails to combine it back to v{,p}blendv{v,ps,pd} since the pattern is
> too complicated, so I think maybe we should hanlde it in the gimple
> level.
>
> The dump is like
>
>   _1 = c_3(D) >= { 0, 0, 0, 0 };
>   _2 = VEC_COND_EXPR <_1, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }>;
>   _7 = VIEW_CONVERT_EXPR(_2);
>   _8 = VIEW_CONVERT_EXPR(b_6(D));
>   _9 = VIEW_CONVERT_EXPR(a_5(D));
>   _10 = _7 < { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
>   _11 = VEC_COND_EXPR <_10, _8, _9>;
>
> It can be optimized to
>
>   _1 = c_2(D) >= { 0, 0, 0, 0 };
>   _6 = VEC_COND_EXPR <_1, b_5(D), a_4(D)>;
>
> since _7 is either -1 or 0, the selection of _7 < 0 ? _8 : _9 should
> be euqal to _1 ? b : a as long as TYPE_PRECISION of the component type
> of the second VEC_COND_EXPR is less equal to the first one.
> The patch add a gimple pattern to handle that.
>
> gcc/ChangeLog:
>
> * match.pd (VCE (a cmp b ? -1 : 0) < 0) ? c : d ---> (VCE ((a
> cmp b) ? (VCE:c) : (VCE:d))): New gimple simplication.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/avx512vl-blendv-3.c: New test.
> * gcc.target/i386/blendv-3.c: New test.
> ---
>  gcc/match.pd  | 22 +
>  .../gcc.target/i386/avx512vl-blendv-3.c   |  6 +++
>  gcc/testsuite/gcc.target/i386/blendv-3.c  | 46 +++
>  3 files changed, 74 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-blendv-3.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/blendv-3.c
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index dbc811b2b38..2a69622a300 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -5170,6 +5170,28 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>   (if (optimize_vectors_before_lowering_p () && types_match (@0, @3))
>(vec_cond (bit_and @0 (bit_not @3)) @2 @1)))
>
> +/*  ((VCE (a cmp b ? -1 : 0)) < 0) ? c : d is just
> +(VCE ((a cmp b) ? (VCE c) : (VCE d))) when TYPE_PRECISION of the
> +component type of the outer vec_cond is greater equal the inner one.  */
> +(for cmp (simple_comparison)
> + (simplify
> +  (vec_cond
> +(lt (view_convert@5 (vec_cond@6 (cmp@4 @0 @1)
> +   integer_all_onesp
> +   integer_zerop))
> + integer_zerop) @2 @3)
> +  (if (VECTOR_INTEGER_TYPE_P (TREE_TYPE (@0))
> +   && VECTOR_INTEGER_TYPE_P (TREE_TYPE (@5))
> +   && !TYPE_UNSIGNED (TREE_TYPE (@5))
> +   && VECTOR_TYPE_P (TREE_TYPE (@6))
> +   && VECTOR_TYPE_P (type)
> +   && (TYPE_PRECISION (TREE_TYPE (type))
> + <= TYPE_PRECISION (TREE_TYPE (TREE_TYPE (@6
> +   && TYPE_SIZE (type) == TYPE_SIZE (TREE_TYPE (@6)))
> +   (with { tree vtype = TREE_TYPE (@6);}
> + (view_convert:type
> +   (vec_cond @4 (view_convert:vtype @2) (view_convert:vtype @3)))
> +
>  /* c1 ? c2 ? a : b : b  -->  (c1 & c2) ? a : b  */
>  (simplify
>   (vec_cond @0 (vec_cond:s @1 @2 @3) @3)
> diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-blendv-3.c 
> b/gcc/testsuite/gcc.target/i386/avx512vl-blendv-3.c
> new file mode 100644
> index 000..2777e72ab5f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx512vl-blendv-3.c
> @@ -0,0 +1,6 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx512vl -mavx512bw -O2" } */
> +/* { dg-final { scan-assembler-times {vp?blendv(?:b|p[sd])[ \t]*} 6 } } */
> +/* { dg-final { scan-assembler-not {vpcmp} } } */
> +
> +#include "blendv-3.c"
> diff --git a/gcc/testsuite/gcc.target/i386/blendv-3.c 
> b/gcc/testsuite/gcc.target/i386/blendv-3.c
> new file mode 100644
> index 000..fa0fb067a73
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/blendv-3.c
> @@ -0,0 +1,46 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx2 -O2" } */
> +/* { dg-final { scan-assembler-times {vp?blendv(?:b|p[sd])[ \t]*} 6 } } */
> +/* { dg-final { scan-assembler-not {vpcmp} } } */
> +
> +#include 
> +
> +__m256i
> +foo (__m256i a, __m256i b, __m256i c)
> +{
> +  return _mm256_blendv_epi8 (a, b, ~c < 0);
> +}
> +
> +__m256d
> +foo1 (__m256d a, __m256d b, __m256i c)
> +{
> +  __m256i d = ~c < 0;
> +  return _mm256_blendv_pd (a, b, (__m256d)d);
> +}
> +
> +__m256
> +foo2 (__m256 a, __m256 b, __m256i c)
> +{
> +  __m256i d = ~c < 0;
> +  return _mm256_blendv_ps (a, b, (__m256)d);
> +}
> +
> +__m128i
> +foo4 (__m128i a, __m128i b, __m128i c)
> +{
> +  return _mm_blendv_epi8 (a, b, ~c < 0);
> +}
> +
> +__m128d
> +foo5 (__m128d a, __m128d b, __m128i c)
> +{
> +  __m128i d = ~c < 0;
> +  return _mm_blendv_pd (a, b, (__m128d)d);
> +}
> +
> +__m128
> +foo6 

Re: [PATCH] i386: Mark Xeon Phi ISAs as deprecated

2023-12-07 Thread Hongtao Liu
On Wed, Dec 6, 2023 at 3:52 PM Richard Biener
 wrote:
>
> On Wed, Dec 6, 2023 at 3:33 AM Jiang, Haochen  wrote:
> >
> > > -Original Message-
> > > From: Jiang, Haochen
> > > Sent: Friday, December 1, 2023 4:51 PM
> > > To: Richard Biener 
> > > Cc: gcc-patches@gcc.gnu.org; Liu, Hongtao ;
> > > ubiz...@gmail.com
> > > Subject: RE: [PATCH] i386: Mark Xeon Phi ISAs as deprecated
> > >
> > > > -Original Message-
> > > > From: Richard Biener 
> > > > Sent: Friday, December 1, 2023 4:37 PM
> > > > To: Jiang, Haochen 
> > > > Cc: gcc-patches@gcc.gnu.org; Liu, Hongtao ;
> > > > ubiz...@gmail.com
> > > > Subject: Re: [PATCH] i386: Mark Xeon Phi ISAs as deprecated
> > > >
> > > > On Fri, Dec 1, 2023 at 8:34 AM Jiang, Haochen 
> > > > wrote:
> > > > >
> > > > > > -Original Message-
> > > > > > From: Richard Biener 
> > > > > > Sent: Friday, December 1, 2023 3:04 PM
> > > > > > To: Jiang, Haochen 
> > > > > > Cc: gcc-patches@gcc.gnu.org; Liu, Hongtao ;
> > > > > > ubiz...@gmail.com
> > > > > > Subject: Re: [PATCH] i386: Mark Xeon Phi ISAs as deprecated
> > > > > >
> > > > > > On Fri, Dec 1, 2023 at 3:22 AM Haochen Jiang
> > > 
> > > > > > wrote:
> > > > > > >
> > > > > > > Since Knight Landing and Knight Mill microarchitectures are EOL, 
> > > > > > > we
> > > > > > > would like to remove its support in GCC 15. In GCC 14, we will 
> > > > > > > first
> > > > > > > emit a warning for the usage.
> > > > > >
> > > > > > I think it's better to keep supporting -mtune/arch=knl without 
> > > > > > diagnostics
> > > > >
> > > > > I see, it could be a choice and might be better. But if we take this, 
> > > > > how
> > > should
> > > > > we define -mtune=knl remains a question.
> > > >
> > > > I'd say mapping it to a "close" micro-architecture makes most sense, but
> > > > we could also simply keep the tuning entry for knl?
> > >
> > > Actually I have written a removal test patch, one of the issue might be 
> > > there is
> > > something specific about knl in tuning for VZEROUPPER, which is also 
> > > reflected
> > > in
> > > PR82990.
> > >
> > > /* X86_TUNE_EMIT_VZEROUPPER: This enables vzeroupper instruction
> > > insertion
> > >before a transfer of control flow out of the function.  */
> > > DEF_TUNE (X86_TUNE_EMIT_VZEROUPPER, "emit_vzeroupper", ~m_KNL)
> > >
> > > If we chose to keep them, this behavior will be changed.
> >
> > Hi Richard,
> >
> > After double thinking, I suppose we still should remove the arch/tune 
> > options
> > here to avoid misleading behavior since there will always something be 
> > changed.
> >
> > What is your concern about removing? Do you have anything that relies on the
> > tune and arch?
>
> We usually promise backwards compatibility with respect to accepted options
> which is why we have things like
>
> ftree-vect-loop-version
> Common Ignore
> Does nothing. Preserved for backward compatibility.
>
> the backend errors on unknown march/tune and that would be a regression
> for build systems using that (even if that's indeed very unlikely).  That's 
> why
> I suggested to make it still do something (doing "nothing", aka keeping 
> generic
> is probably worse than dropping).  I guess having -march=knl behave 
> differently
> is also bad so I guess there's not a good solution for that.
To avoid confusion,  I prefer to remove all of them.
>
> So - just to have made the above point, I'm fine with what x86 maintainers
> decide here.
>
> Richard.
>
> > Thx,
> > Haochen
> >
> > >
> > > >
> > > > > > but simply not enable the ISAs we don't support.  The better 
> > > > > > question is
> > > > > > what to do about KNL specific intrinsics headers / intrinsics?  
> > > > > > Will we
> > > > > > simply remove those?
> > > > >
> > > > > If there is no objection, The intrinsics are planned to be removed in 
> > > > > GCC 15.
> > > > > As far as concerned, almost nobody are using them with the latest GCC.
> > > And
> > > > > there is no complaint when removing them in ICC/ICX.
> > > >
> > > > I see.  Replacing the header contents with #error "XYZ is no longer
> > > supported"
> > > > might be nicer.  OTOH x86intrin.h should simply no longer include them.
> > >
> > > That is nicer. I will take that in GCC 15 patch.
> > >
> > > Thx,
> > > Haochen
> > >
> > > >
> > > > Richard.
> > > >
> > > > > Thx,
> > > > > Haochen
> > > > >
> > > > > >
> > > > > > Richard.
> > > > > >
> > > > > > > gcc/ChangeLog:
> > > > > > >
> > > > > > > * config/i386/driver-i386.cc (host_detect_local_cpu):
> > > > > > > Do not append "-mno-" for Xeon Phi ISAs.
> > > > > > > * config/i386/i386-options.cc 
> > > > > > > (ix86_option_override_internal):
> > > > > > > Emit a warning for KNL/KNM targets.
> > > > > > > * config/i386/i386.opt: Emit a warning for Xeon Phi ISAs.
> > > > > > >
> > > > > > > gcc/testsuite/ChangeLog:
> > > > > > >
> > > > > > > * g++.dg/other/i386-2.C: Adjust testcases.
> > > > > > > * g++.dg/other/i386-3.C: Ditto.
> > > > > > >   

Re: [v3 PATCH] Simplify vector ((VCE (a cmp b ? -1 : 0)) < 0) ? c : d to just (VCE ((a cmp b) ? (VCE c) : (VCE d))).

2023-12-11 Thread Hongtao Liu
On Mon, Dec 11, 2023 at 4:14 PM Richard Biener
 wrote:
>
> On Mon, Dec 11, 2023 at 7:51 AM liuhongt  wrote:
> >
> > > since you are looking at TYPE_PRECISION below you want
> > > VECTOR_INTIEGER_TYPE_P here as well?  The alternative
> > > would be to compare TYPE_SIZE.
> > >
> > > Some of the checks feel redundant but are probably good for
> > > documentation purposes.
> > >
> > > OK with using VECTOR_INTIEGER_TYPE_P
> > Actually, the data type doens't need to integer, .i.e x86 support vblendvps
> > so I'm using TYPE_SIZE here, the code is adjusted to
> >
> > && tree_fits_uhwi_p (TYPE_SIZE (TREE_TYPE (type)))
> > && (tree_to_uhwi (TYPE_SIZE (TREE_TYPE (type)))
> ><= TYPE_PRECISION (TREE_TYPE (TREE_TYPE (@6
> >
> > Here's the updated patch.
> > Ok for trunk?
> >
> > When I'm working on PR112443, I notice there's some misoptimizations:
> > after we fold _mm{,256}_blendv_epi8/pd/ps into gimple, the backend
> > fails to combine it back to v{,p}blendv{v,ps,pd} since the pattern is
> > too complicated, so I think maybe we should hanlde it in the gimple
> > level.
> >
> > The dump is like
> >
> >   _1 = c_3(D) >= { 0, 0, 0, 0 };
> >   _2 = VEC_COND_EXPR <_1, { -1, -1, -1, -1 }, { 0, 0, 0, 0 }>;
> >   _7 = VIEW_CONVERT_EXPR(_2);
> >   _8 = VIEW_CONVERT_EXPR(b_6(D));
> >   _9 = VIEW_CONVERT_EXPR(a_5(D));
> >   _10 = _7 < { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
> >   _11 = VEC_COND_EXPR <_10, _8, _9>;
> >
> > It can be optimized to
> >
> >   _1 = c_2(D) >= { 0, 0, 0, 0 };
> >   _6 = VEC_COND_EXPR <_1, b_5(D), a_4(D)>;
> >
> > since _7 is either -1 or 0, the selection of _7 < 0 ? _8 : _9 should
> > be euqal to _1 ? b : a as long as TYPE_PRECISION of the component type
> > of the second VEC_COND_EXPR is less equal to the first one.
> > The patch add a gimple pattern to handle that.
> >
> > gcc/ChangeLog:
> >
> > * match.pd (VCE (a cmp b ? -1 : 0) < 0) ? c : d ---> (VCE ((a
> > cmp b) ? (VCE:c) : (VCE:d))): New gimple simplication.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/i386/avx512vl-blendv-3.c: New test.
> > * gcc.target/i386/blendv-3.c: New test.
> > ---
> >  gcc/match.pd  | 23 ++
> >  .../gcc.target/i386/avx512vl-blendv-3.c   |  6 +++
> >  gcc/testsuite/gcc.target/i386/blendv-3.c  | 46 +++
> >  3 files changed, 75 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/avx512vl-blendv-3.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/blendv-3.c
> >
> > diff --git a/gcc/match.pd b/gcc/match.pd
> > index 4d554ba4721..359c7b07dc3 100644
> > --- a/gcc/match.pd
> > +++ b/gcc/match.pd
> > @@ -5190,6 +5190,29 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> >   (if (optimize_vectors_before_lowering_p () && types_match (@0, @3))
> >(vec_cond (bit_and @0 (bit_not @3)) @2 @1)))
> >
> > +/*  ((VCE (a cmp b ? -1 : 0)) < 0) ? c : d is just
> > +(VCE ((a cmp b) ? (VCE c) : (VCE d))) when TYPE_PRECISION of the
> > +component type of the outer vec_cond is greater equal the inner one.  
> > */
> > +(for cmp (simple_comparison)
> > + (simplify
> > +  (vec_cond
> > +(lt (view_convert@5 (vec_cond@6 (cmp@4 @0 @1)
> > +   integer_all_onesp
> > +   integer_zerop))
> > + integer_zerop) @2 @3)
> > +  (if (VECTOR_INTEGER_TYPE_P (TREE_TYPE (@0))
> > +   && VECTOR_INTEGER_TYPE_P (TREE_TYPE (@5))
> > +   && !TYPE_UNSIGNED (TREE_TYPE (@5))
> > +   && VECTOR_TYPE_P (TREE_TYPE (@6))
> > +   && VECTOR_TYPE_P (type)
> > +   && tree_fits_uhwi_p (TYPE_SIZE (TREE_TYPE (type)))
> > +   && (tree_to_uhwi (TYPE_SIZE (TREE_TYPE (type)))
> > + <= TYPE_PRECISION (TREE_TYPE (TREE_TYPE (@6
>
> sorry for nitpicking, but can you please use
>
> && tree_int_cst_le (TYPE_SIZE (TREE_TYPE (type)),
>  TREE_TYPE (TREE_TYPE (@6)))
>
> thus not use precision on one and size on the other type?
>
> OK with that change.
Thanks, committed.
>
> Richard.
>
> > +   && TYPE_SIZE (type) == TYPE_SIZE (TREE_TYPE (@6)))
> > +   (with { tree vtype = TREE_TYPE (@6);}
> > + (view_convert:type
> > +   (vec_cond @4 (view_convert:vtype @2) (view_convert:vtype @3)))
> > +
> >  /* c1 ? c2 ? a : b : b  -->  (c1 & c2) ? a : b  */
> >  (simplify
> >   (vec_cond @0 (vec_cond:s @1 @2 @3) @3)
> > diff --git a/gcc/testsuite/gcc.target/i386/avx512vl-blendv-3.c 
> > b/gcc/testsuite/gcc.target/i386/avx512vl-blendv-3.c
> > new file mode 100644
> > index 000..2777e72ab5f
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/avx512vl-blendv-3.c
> > @@ -0,0 +1,6 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-mavx512vl -mavx512bw -O2" } */
> > +/* { dg-final { scan-assembler-times {vp?blendv(?:b|p[sd])[ \t]*} 6 } } */
> > +/* { dg-final { scan-assembler-not {vpcmp} } } */
> > +
> > +#include "blendv-3.c"

Re: [PATCH] i386: Fix missed APX_NDD check for shift/rotate expanders [PR 112943]

2023-12-11 Thread Hongtao Liu
On Mon, Dec 11, 2023 at 8:39 PM Hongyu Wang  wrote:
>
> > > +__int128 u128_2 = (9223372036854775808 << 4) * foo0_u8_0; /* { 
> > > dg-warning "integer constant is so large that it is unsigned" "so large" 
> > > } */
> >
> > Just you can use (9223372036854775807LL + (__int128) 1) instead of 
> > 9223372036854775808
> > to avoid the warning.
> > The testcase will ICE without the patch even with that.
>
> Thanks for the hint! Will adjust when pushing the patch.
Ok.



-- 
BR,
Hongtao


Re: [PATCH] Don't assume it's AVX_U128_CLEAN after call_insn whose abi.mode_clobber(V4DImode) deosn't contains all SSE_REGS.

2023-12-11 Thread Hongtao Liu
On Fri, Dec 8, 2023 at 10:17 AM liuhongt  wrote:
>
> If the function desn't clobber any sse registers or only clobber
> 128-bit part, then vzeroupper isn't issued before the function exit.
> the status not CLEAN but ANY after the function.
>
> Also for sibling_call, it's safe to issue an vzeroupper. Also there
> could be missing vzeroupper since there's no mode_exit for
> sibling_call_p.
>
> Compared to the patch in the PR, this patch add sibling_call part.
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk and backport?
Part of this has been approved  in the PR, and for the sibling_call
part, i think it should be reasonable.
So i'm going to commit the patch.
>
> gcc/ChangeLog:
>
> PR target/112891
> * config/i386/i386.cc (ix86_avx_u128_mode_after): Return
> AVX_U128_ANY if callee_abi doesn't clobber all_sse_regs to
> align with ix86_avx_u128_mode_needed.
> (ix86_avx_u128_mode_needed): Return AVX_U128_ClEAN for
> sibling_call.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr112891.c: New test.
> * gcc.target/i386/pr112891-2.c: New test.
> ---
>  gcc/config/i386/i386.cc| 22 +---
>  gcc/testsuite/gcc.target/i386/pr112891-2.c | 30 ++
>  gcc/testsuite/gcc.target/i386/pr112891.c   | 29 +
>  3 files changed, 78 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr112891-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr112891.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 7c5cab4e2c6..fe259cdb789 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -15038,8 +15038,12 @@ ix86_avx_u128_mode_needed (rtx_insn *insn)
>  vzeroupper if all SSE registers are clobbered.  */
>const function_abi &abi = insn_callee_abi (insn);
>if (vzeroupper_pattern (PATTERN (insn), VOIDmode)
> - || !hard_reg_set_subset_p (reg_class_contents[SSE_REGS],
> -abi.mode_clobbers (V4DImode)))
> + /* Should be safe to issue an vzeroupper before sibling_call_p.
> +Also there not mode_exit for sibling_call, so there could be
> +missing vzeroupper for that.  */
> + || !(SIBLING_CALL_P (insn)
> +  || hard_reg_set_subset_p (reg_class_contents[SSE_REGS],
> +abi.mode_clobbers (V4DImode
> return AVX_U128_ANY;
>
>return AVX_U128_CLEAN;
> @@ -15177,7 +15181,19 @@ ix86_avx_u128_mode_after (int mode, rtx_insn *insn)
>bool avx_upper_reg_found = false;
>note_stores (insn, ix86_check_avx_upper_stores, &avx_upper_reg_found);
>
> -  return avx_upper_reg_found ? AVX_U128_DIRTY : AVX_U128_CLEAN;
> +  if (avx_upper_reg_found)
> +   return AVX_U128_DIRTY;
> +
> +  /* If the function desn't clobber any sse registers or only clobber
> +128-bit part, Then vzeroupper isn't issued before the function exit.
> +the status not CLEAN but ANY after the function.  */
> +  const function_abi &abi = insn_callee_abi (insn);
> +  if (!(SIBLING_CALL_P (insn)
> +   || hard_reg_set_subset_p (reg_class_contents[SSE_REGS],
> + abi.mode_clobbers (V4DImode
> +   return AVX_U128_ANY;
> +
> +  return  AVX_U128_CLEAN;
>  }
>
>/* Otherwise, return current mode.  Remember that if insn
> diff --git a/gcc/testsuite/gcc.target/i386/pr112891-2.c 
> b/gcc/testsuite/gcc.target/i386/pr112891-2.c
> new file mode 100644
> index 000..164c3985d50
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr112891-2.c
> @@ -0,0 +1,30 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx2 -O3" } */
> +/* { dg-final { scan-assembler-times "vzeroupper" 1 } } */
> +
> +void
> +__attribute__((noinline))
> +bar (double* a)
> +{
> +  a[0] = 1.0;
> +  a[1] = 2.0;
> +}
> +
> +double
> +__attribute__((noinline))
> +foo (double* __restrict a, double* b)
> +{
> +  a[0] += b[0];
> +  a[1] += b[1];
> +  a[2] += b[2];
> +  a[3] += b[3];
> +  bar (b);
> +  return a[5] + b[5];
> +}
> +
> +double
> +foo1 (double* __restrict a, double* b)
> +{
> +  double c = foo (a, b);
> +  return __builtin_exp (c);
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/pr112891.c 
> b/gcc/testsuite/gcc.target/i386/pr112891.c
> new file mode 100644
> index 000..dbf6c67948a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr112891.c
> @@ -0,0 +1,29 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mavx2 -O3" } */
> +/* { dg-final { scan-assembler-times "vzeroupper" 1 } } */
> +
> +void
> +__attribute__((noinline))
> +bar (double* a)
> +{
> +  a[0] = 1.0;
> +  a[1] = 2.0;
> +}
> +
> +void
> +__attribute__((noinline))
> +foo (double* __restrict a, double* b)
> +{
> +  a[0] += b[0];
> +  a[1] += b[1];
> +  a[2] += b[2];
> +  a[3] += b[3];
> +  bar (b);
> +}
> +
> +double
> +foo1 (dou

  1   2   3   4   5   6   7   8   9   10   >