Re: [PATCH] Add COMPLEX_VECTOR_INT modes

2023-05-29 Thread Richard Biener via Gcc-patches
On Fri, May 26, 2023 at 4:35 PM Andrew Stubbs  wrote:
>
> Hi all,
>
> I want to implement a vector DIVMOD libfunc for amdgcn, but I can't just
> do it because the GCC middle-end models DIVMOD's return value as
> "complex int" type, and there are no vector equivalents of that type.
>
> Therefore, this patch adds minimal support for "complex vector int"
> modes.  I have not attempted to provide any means to use these modes
> from C, so they're really only useful for DIVMOD.  The actual libfunc
> implementation will pack the data into wider vector modes manually.
>
> A knock-on effect of this is that I needed to increase the range of
> "mode_unit_size" (several of the vector modes supported by amdgcn exceed
> the previous 255-byte limit).
>
> Since this change would add a large number of new, unused modes to many
> architectures, I have elected to *not* enable them, by default, in
> machmode.def (where the other complex modes are created).  The new modes
> are therefore inactive on all architectures but amdgcn, for now.
>
> OK for mainline?  (I've not done a full test yet, but I will.)

I think it makes more sense to map vector CSImode to vector SImode with
the double number of lanes.  In fact since divmod is a libgcc function
I wonder where your vector variant would reside and how GCC decides to
emit calls to it?  That is, there's no way to OMP simd declare this function?

Richard.

> Thanks
>
> Andrew


Re: [PATCH] Replace a HWI_COMPUTABLE_MODE_P with wide-int in simplify-rtx.cc.

2023-05-29 Thread Richard Biener via Gcc-patches
On Fri, May 26, 2023 at 8:44 PM Roger Sayle  wrote:
>
>
> This patch enhances one of the optimizations in simplify_binary_operation_1
> to allow it to simplify RTL expressions in modes than HOST_WIDE_INT by
> replacing a use of HWI_COMPUTABLE_MODE_P and UINTVAL with wide_int.
>
> The motivating example is a pending x86_64 backend patch that produces
> the following RTL in combine:
>
> (and:TI (zero_extend:TI (reg:DI 89))
> (const_wide_int 0x0))
>
> where the AND is redundant, as the mask, ~0LL, is DImode's MODE_MASK.
> There's already an optimization that catches this for narrower modes,
> transforming (and:HI (zero_extend:HI (reg:QI x)) (const_int 0xff))
> into (zero_extend:HI (reg:QI x)), but this currently only handles
> CONST_INT not CONST_WIDE_INT.  Fixed by upgrading this transformation
> to use wide_int, specifically rtx_mode_t and wi::mask.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?

OK.

Thanks,
Richard.

>
> 2023-05-23  Roger Sayle  
>
> gcc/ChangeLog
> * simplify-rtx.cc (simplify_binary_operation_1) : Use wide-int
> instead of HWI_COMPUTABLE_MODE_P and UINTVAL in transformation of
> (and (extend X) C) as (zero_extend (and X C)), to also optimize
> modes wider than HOST_WIDE_INT.
>
>
> Thanks in advance,
> Roger
> --
>


Re: [RFC] light expander sra for parameters and returns

2023-05-29 Thread Richard Biener via Gcc-patches
On Mon, 29 May 2023, Jiufu Guo wrote:

> Hi,
> 
> Previously, I was investigating some struct parameters and returns related
> PRs 69143/65421/108073.
> 
> Investigating the issues case by case, and drafting patches for each of
> them one by one. This would help us to enhance code incrementally.
> While, this way, patches would interact with each other and implement
> different codes for similar issues (because of the different paths in
> gimple/rtl).  We may have a common fix for those issues.
> 
> We know a few other related PRs(such as meta-bug PR101926) exist. For those
> PRs in different targets with different symptoms (and also different root
> cause), I would expect a method could help some of them, but it may
> be hard to handle all of them in one fix.
> 
> With investigation and check discussion for the issues, I remember a
> suggestion from Richard: it would be nice to perform some SRA-like analysis
> for the accesses on the structs (parameter/returns).
> https://gcc.gnu.org/pipermail/gcc-patches/2022-November/605117.html
> This may be a 'fairly common method' for those issues. With this idea,
> I drafted a patch as below in this mail.
> 
> I also thought about directly using tree-sra.cc, e.g. enhance it and rerun it
> at the end of GIMPLE passes. While since some issues are introduced inside
> the expander, so below patch also co-works with other parts of the expander.
> And since we already have tree-sra in gimple pass, we only need to take more
> care on parameter and return in this patch: other decls could be handled
> well in tree-sra.
> 
> The steps of this patch are:
> 1. Collect struct type parameters and returns, and then scan the function to
> get the accesses on them. And figure out the accesses which would be 
> profitable
> to be scalarized (using registers of the parameter/return ). Now, reading on
> parameter and writing on returns are checked in the current patch.
> 2. When/after the scalar registers are determined/expanded for the return or
> parameters, compute the corresponding scalar register(s) for each accesses of
> the return/parameter, and prepare the scalar RTLs for those accesses.
> 3. When using/expanding the accesses expression, leverage the 
> computed/prepared
> scalars directly.
> 
> This patch is tested on ppc64 both LE and BE.
> To continue, I would ask for comments and suggestions first. And then I would
> update/enhance accordingly.  Thanks in advance!

Thanks for working on this - the description above sounds exactly like
what should be done.

Now - I'd like the code to re-use the access tree data structure from
SRA plus at least the worker creating the accesses from a stmt.

The RTL expansion code already does a sweep over stmts in
discover_nonconstant_array_refs which makes sure RTL expansion doesn't
scalarize (aka assign non-stack) to variables which have accesses
that would later eventually FAIL to expand when operating on registers.
That's very much related to the task at hand so we should try to
at least merge the CFG walks of both (it produces a forced_stack_vars
bitmap).

Can you work together with Martin to split out the access tree
data structure and share it?

I didn't look in detail as of how you make use of the information
yet.

Thanks,
Richard.

> 
> BR,
> Jeff (Jiufu)
> 
> 
> ---
>  gcc/cfgexpand.cc | 567 ++-
>  gcc/expr.cc  |  15 +-
>  gcc/function.cc  |  26 +-
>  gcc/opts.cc  |   8 +-
>  gcc/testsuite/g++.target/powerpc/pr102024.C  |   2 +-
>  gcc/testsuite/gcc.target/powerpc/pr108073.c  |  29 +
>  gcc/testsuite/gcc.target/powerpc/pr65421-1.c |   6 +
>  gcc/testsuite/gcc.target/powerpc/pr65421-2.c |  32 ++
>  8 files changed, 675 insertions(+), 10 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108073.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr65421-1.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr65421-2.c
> 
> diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
> index 85a93a547c0..95c29b6b6fe 100644
> --- a/gcc/cfgexpand.cc
> +++ b/gcc/cfgexpand.cc
> @@ -97,6 +97,564 @@ static bool defer_stack_allocation (tree, bool);
>  
>  static void record_alignment_for_reg_var (unsigned int);
>  
> +/* For light SRA in expander about paramaters and returns.  */
> +namespace {
> +
> +struct access
> +{
> +  /* Each accessing on the aggragate is about OFFSET/SIZE and BASE.  */
> +  HOST_WIDE_INT offset;
> +  HOST_WIDE_INT size;
> +  tree base;
> +  bool writing;
> +
> +  /* The context expression of this access.  */
> +  tree expr;
> +
> +  /* The rtx for the access: link to incoming/returning register(s).  */
> +  rtx rtx_val;
> +};
> +
> +typedef struct access *access_p;
> +
> +/* Expr (tree) -> Acess (access_p) map.  */
> +static hash_map *expr_access_vec;
> +
> +/* Base (tree) -> Vector (vec *) map.  */
> +static hash_map > *base_access_vec;
> +
> +/* Return a vector of p

Re: [PATCH 1/2] ipa-cp: Avoid long linear searches through DECL_ARGUMENTS

2023-05-30 Thread Richard Biener via Gcc-patches
On Mon, May 29, 2023 at 6:20 PM Martin Jambor  wrote:
>
> Hi,
>
> there have been concerns that linear searches through DECL_ARGUMENTS
> that are often necessary to compute the index of a particular
> PARM_DECL which is the key to results of IPA-CP can happen often
> enough to be a compile time issue, especially if we plug the results
> into value numbering, as I intend to do with a follow-up patch.
>
> This patch creates a hash map to do the look-up for all functions
> which have some information discovered by IPA-CP and which have 32
> parameters or more.  32 is a hard-wired magical constant here to
> capture the trade-off between the memory allocation overhead and
> length of the linear search.  I do not think it is worth making it a
> --param but if people think it appropriate, I can turn it into one.

Since ipcp_transformation is short-lived (is it?) is it worth the trouble?
Comments below ...

> Bootstrapped, tested and LTO bootstrapped on x86_64-linux, both as-is
> and with themagical constant dropped to 4 so that the has lookup path
> is also well excercised.  OK for master?
>
> Thanks,
>
> Martin
>
>
> gcc/ChangeLog:
>
> 2023-05-26  Martin Jambor  
>
> * ipa-prop.h (struct ipcp_transformation): Rearrange members
> according to C++ class coding convention, add m_tree_to_idx,
> get_param_index and maybe_create_parm_idx_map.
> * ipa-cp.cc (ipcp_transformation::get_param_index): New function.
> (ipcp_transformation::maype_create_parm_idx_map): Likewise.
> * ipa-prop.cc (ipcp_get_parm_bits): Use get_param_index.
> (ipcp_update_bits): Accept TS as a parameter, assume it is not NULL.
> (ipcp_update_vr): Likewise.
> (ipcp_transform_function): Call, maybe_create_parm_idx_map of TS, bail
> out quickly if empty, pass it to ipcp_update_bits and ipcp_update_vr.
> ---
>  gcc/ipa-cp.cc   | 45 +
>  gcc/ipa-prop.cc | 44 +++-
>  gcc/ipa-prop.h  | 33 +
>  3 files changed, 89 insertions(+), 33 deletions(-)
>
> diff --git a/gcc/ipa-cp.cc b/gcc/ipa-cp.cc
> index 0f37bb5e336..9f8b07b2398 100644
> --- a/gcc/ipa-cp.cc
> +++ b/gcc/ipa-cp.cc
> @@ -6761,3 +6761,48 @@ ipa_cp_cc_finalize (void)
>orig_overall_size = 0;
>ipcp_free_transformation_sum ();
>  }
> +
> +/* Given PARAM which must be a parameter of function FNDECL described by 
> THIS,
> +   return its index in the DECL_ARGUMENTS chain, using a pre-computed hash 
> map
> +   if avialable (which is pre-computed only if there are many parameters).  
> Can
> +   return -1 if param is static chain not represented among DECL_ARGUMENTS. 
> */
> +
> +int
> +ipcp_transformation::get_param_index (const_tree fndecl, const_tree param) 
> const
> +{
> +  gcc_assert (TREE_CODE (param) == PARM_DECL);
> +  if (m_tree_to_idx)
> +{
> +  unsigned *pr = m_tree_to_idx->get(param);
> +  if (!pr)
> +   {
> + gcc_assert (DECL_STATIC_CHAIN (fndecl));
> + return -1;
> +   }
> +  return (int) *pr;
> +}
> +
> +  unsigned index = 0;
> +  for (tree p = DECL_ARGUMENTS (fndecl); p; p = DECL_CHAIN (p), index++)
> +if (p == param)
> +  return (int) index;
> +
> +  gcc_assert (DECL_STATIC_CHAIN (fndecl));
> +  return -1;
> +}
> +
> +/* Assuming THIS describes FNDECL and it has sufficiently many parameters to
> +   justify the overhead, creat a has map from parameter trees to their
> +   indices.  */
> +void
> +ipcp_transformation::maybe_create_parm_idx_map (tree fndecl)
> +{
> +  int c = count_formal_params (fndecl);
> +  if (c < 32)
> +return;
> +
> +  m_tree_to_idx = hash_map::create_ggc (c);
> +  unsigned index = 0;
> +  for (tree p = DECL_ARGUMENTS (fndecl); p; p = DECL_CHAIN (p), index++)
> +m_tree_to_idx->put (p, index);

I think allocating the hash-map with 'c' for some numbers (depending
on the "prime"
chosen) will necessarily cause re-allocation of the hash since we keep a load
factor of at most 3/4 upon insertion.

But - I wonder if a UID sorted array isn't a very much better data
structure for this?
That is, a vec >?

> +}
> diff --git a/gcc/ipa-prop.cc b/gcc/ipa-prop.cc
> index ab6de9f10da..f0976e363f7 100644
> --- a/gcc/ipa-prop.cc
> +++ b/gcc/ipa-prop.cc
> @@ -5776,16 +5776,9 @@ ipcp_get_parm_bits (tree parm, tree *value, widest_int 
> *mask)
>if (!ts || vec_safe_length (ts->bits) == 0)
>  return false;
>
> -  int i = 0;
> -  for (tree p = DECL_ARGUMENTS (current_function_decl);
> -   p != parm; p = DECL_CHAIN (p))
> -{
> -  i++;
> -  /* Ignore static chain.  */
> -  if (!p)
> -   return false;
> -}
> -
> +  int i = ts->get_param_index (current_function_decl, parm);
> +  if (i < 0)
> +return false;
>clone_info *cinfo = clone_info::get (cnode);
>if (cinfo && cinfo->param_adjustments)
>  {
> @@ -5802,16 +5795,12 @@ ipcp_get_parm_bits (tree parm, tree *value, 
> widest_int *mask)
>

Re: [PATCH v1] tree-ssa-sink: Improve code sinking pass.

2023-05-30 Thread Richard Biener via Gcc-patches
On Tue, May 30, 2023 at 7:06 AM Ajit Agarwal  wrote:
>
> Hello Richard:
>
> On 22/05/23 6:26 pm, Richard Biener wrote:
> > On Thu, May 18, 2023 at 9:14 AM Ajit Agarwal  wrote:
> >>
> >> Hello All:
> >>
> >> This patch improves code sinking pass to sink statements before call to 
> >> reduce
> >> register pressure.
> >> Review comments are incorporated.
> >>
> >> Bootstrapped and regtested on powerpc64-linux-gnu.
> >>
> >> Thanks & Regards
> >> Ajit
> >>
> >>
> >> tree-ssa-sink: Improve code sinking pass.
> >>
> >> Code Sinking sinks the blocks after call. This increases
> >> register pressure for callee-saved registers. Improves
> >> code sinking before call in the use blocks or immediate
> >> dominator of use blocks.
> >>
> >> 2023-05-18  Ajit Kumar Agarwal  
> >>
> >> gcc/ChangeLog:
> >>
> >> * tree-ssa-sink.cc (statement_sink_location): Modifed to
> >> move statements before calls.
> >> (block_call_p): New function.
> >> (def_use_same_block): New function.
> >> (select_best_block): Add heuristics to select the best
> >> blocks in the immediate post dominator.
> >>
> >> gcc/testsuite/ChangeLog:
> >>
> >> * gcc.dg/tree-ssa/ssa-sink-20.c: New testcase.
> >> * gcc.dg/tree-ssa/ssa-sink-21.c: New testcase.
> >> ---
> >>  gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-20.c |  16 ++
> >>  gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c |  20 +++
> >>  gcc/tree-ssa-sink.cc| 159 ++--
> >>  3 files changed, 185 insertions(+), 10 deletions(-)
> >>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-20.c
> >>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
> >>
> >> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-20.c 
> >> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-20.c
> >> new file mode 100644
> >> index 000..716bc1f9257
> >> --- /dev/null
> >> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-20.c
> >> @@ -0,0 +1,16 @@
> >> +/* { dg-do compile } */
> >> +/* { dg-options "-O2 -fdump-tree-sink -fdump-tree-optimized 
> >> -fdump-tree-sink-stats" } */
> >> +
> >> +void bar();
> >> +int j;
> >> +void foo(int a, int b, int c, int d, int e, int f)
> >> +{
> >> +  int l;
> >> +  l = a + b + c + d +e + f;
> >> +  if (a != 5)
> >> +{
> >> +  bar();
> >> +  j = l;
> >> +}
> >> +}
> >> +/* { dg-final { scan-tree-dump-times "Sunk statements: 5" 1 "sink" } } */
> >
> > this doesn't verify the place we sink to?
> >
>
> I am not sure how to verify the place we sink to with dg-final.

I think dejagnu supports matching multi-line regexps so I suggest
to scan for the sunk expr RHS to be followed by the call?

> >> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c 
> >> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
> >> new file mode 100644
> >> index 000..ff41e2ea8ae
> >> --- /dev/null
> >> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
> >> @@ -0,0 +1,20 @@
> >> +/* { dg-do compile } */
> >> +/* { dg-options "-O2 -fdump-tree-sink-stats -fdump-tree-sink-stats" } */
> >> +
> >> +void bar();
> >> +int j, x;
> >> +void foo(int a, int b, int c, int d, int e, int f)
> >> +{
> >> +  int l;
> >> +  l = a + b + c + d +e + f;
> >> +  if (a != 5)
> >> +{
> >> +  bar();
> >> +  if (b != 3)
> >> +x = 3;
> >> +  else
> >> +x = 5;
> >> +  j = l;
> >> +}
> >> +}
> >> +/* { dg-final { scan-tree-dump-times "Sunk statements: 5" 1 "sink" } } */
> >
> > likewise.  So both tests already pass before the patch?
> >
> >> diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc
> >> index 87b1d40c174..76556e7795b 100644
> >> --- a/gcc/tree-ssa-sink.cc
> >> +++ b/gcc/tree-ssa-sink.cc
> >> @@ -171,6 +171,72 @@ nearest_common_dominator_of_uses (def_operand_p 
> >> def_p, bool *debug_stmts)
> >>return commondom;
> >>  }
> >>
> >> +/* Return TRUE if immediate uses of the defs in
> >> +   USE occur in the same block as USE, FALSE otherwise.  */
> >> +
> >> +bool
> >> +def_use_same_block (gimple *stmt)
> >> +{
> >> +  use_operand_p use_p;
> >> +  def_operand_p def_p;
> >> +  imm_use_iterator imm_iter;
> >> +  ssa_op_iter iter;
> >> +
> >> +  FOR_EACH_SSA_DEF_OPERAND (def_p, stmt, iter, SSA_OP_DEF)
> >> +{
> >> +  FOR_EACH_IMM_USE_FAST (use_p, imm_iter, DEF_FROM_PTR (def_p))
> >> +   {
> >> + if (is_gimple_debug (USE_STMT (use_p)))
> >> +   continue;
> >> +
> >> + if (use_p
> >
> > use_p is never null
> >
> >> + && (gimple_bb (USE_STMT (use_p)) == gimple_bb (stmt)))
> >> +   return true;
> >
> > the function behavior is obviously odd ...
> >
> >> +   }
> >> + }
> >> +  return false;
> >> +}
> >> +
> >> +/* Return TRUE if the block has only calls, FALSE otherwise. */
> >> +
> >> +bool
> >> +block_call_p (basic_block bb)
> >> +{
> >> +  int i = 0;
> >> +  bool is_call = false;
> >> +  gimple_stmt_iterator gsi = gsi_last_bb (bb);
> >> +  gimple *last_stmt = gsi_stmt (gsi);
> >> +
> >> +  if (last_stmt && gimp

Re: [PATCH] RISC-V: Basic VLS code gen for RISC-V

2023-05-30 Thread Richard Biener via Gcc-patches
On Tue, May 30, 2023 at 8:07 AM Kito Cheng via Gcc-patches
 wrote:
>
> GNU vector extensions is widly used around this world, and this patch
> enable that with RISC-V vector extensions, this can help people
> leverage existing code base with RVV, and also can write vector programs in a
> familiar way.
>
> The idea of VLS code gen support is emulate VLS operation by VLA operation 
> with
> specific length.

In the patch you added fixed 16 bytes vector modes, correct?  I've
never looked at
how ARM deals with the GNU vector extensions but I suppose they get mapped
to NEON and not SVE so basically behave the same way here.

But I do wonder about the efficiency for RVV where there doesn't exist a
complementary fixed-length ISA.  Shouldn't vector lowering
(tree-vect-generic.cc)
be enhanced to support lowering fixed-length vectors to variable length ones
with (variable) fixed length instead?  From your patch I second-guess the RVV
specification requires 16 byte vectors to be available (or will your
patch split the
insns?) but ideally the user would be able to specify -mrvv-size=32 for an
implementation with 32 byte vectors and then vector lowering would make use
of vectors up to 32 bytes?

Also vector lowering will split smaller vectors not equal to the fixed size to
scalars unless you add all fixed length modes smaller than 16 bytes as well.

> Key design point is we defer the mode conversion (From VLS to VLA mode) after
> register allocation, it come with several advantages:
> - VLS pattern is much friendly for most optimization pass like combine.
> - Register allocator can spill/restore exact size of VLS type instead of
>   whole register.
>
> This is compatible with VLA vectorization.
>
> Only support move and binary part of operation patterns.
>
> gcc/ChangeLog:
>
> * config/riscv/riscv-modes.def: Introduce VLS modes.
> * config/riscv/riscv-protos.h (riscv_vector::minimal_vls_mode): New.
> (riscv_vector::vls_insn_expander): New.
> (riscv_vector::vls_mode_p): New.
> * config/riscv/riscv-v.cc (riscv_vector::minimal_vls_mode): New.
> (riscv_vector::vls_mode_p): New.
> (riscv_vector::vls_insn_expander): New.
> (riscv_vector::update_vls_mode): New.
> * config/riscv/riscv.cc (riscv_v_ext_mode_p): New.
> (riscv_v_adjust_nunits): Handle VLS type.
> (riscv_hard_regno_nregs): Ditto.
> (riscv_hard_regno_mode_ok): Ditto.
> (riscv_regmode_natural_size): Ditto.
> * config/riscv/vector-iterators.md (VLS): New.
> (VM): Handle VLS type.
> (vel): Ditto.
> * config/riscv/vector.md: Include vector-vls.md.
> * config/riscv/vector-vls.md: New file.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/riscv/rvv/rvv.exp: Add vls folder.
> * gcc.target/riscv/rvv/vls/binop-template.h: New test.
> * gcc.target/riscv/rvv/vls/binop-v.c: New test.
> * gcc.target/riscv/rvv/vls/binop-zve32x.c: New test.
> * gcc.target/riscv/rvv/vls/binop-zve64x.c: New test.
> * gcc.target/riscv/rvv/vls/move-template.h: New test.
> * gcc.target/riscv/rvv/vls/move-v.c: New test.
> * gcc.target/riscv/rvv/vls/move-zve32x.c: New test.
> * gcc.target/riscv/rvv/vls/move-zve64x.c: New test.
> * gcc.target/riscv/rvv/vls/load-store-template.h: New test.
> * gcc.target/riscv/rvv/vls/load-store-v.c: New test.
> * gcc.target/riscv/rvv/vls/load-store-zve32x.c: New test.
> * gcc.target/riscv/rvv/vls/load-store-zve64x.c: New test.
> * gcc.target/riscv/rvv/vls/vls-types.h: New test.
> ---
>  gcc/config/riscv/riscv-modes.def  |  3 +
>  gcc/config/riscv/riscv-protos.h   |  4 ++
>  gcc/config/riscv/riscv-v.cc   | 67 +++
>  gcc/config/riscv/riscv.cc | 27 +++-
>  gcc/config/riscv/vector-iterators.md  |  6 ++
>  gcc/config/riscv/vector-vls.md| 64 ++
>  gcc/config/riscv/vector.md|  2 +
>  gcc/testsuite/gcc.target/riscv/rvv/rvv.exp|  4 ++
>  .../gcc.target/riscv/rvv/vls/binop-template.h | 18 +
>  .../gcc.target/riscv/rvv/vls/binop-v.c| 18 +
>  .../gcc.target/riscv/rvv/vls/binop-zve32x.c   | 18 +
>  .../gcc.target/riscv/rvv/vls/binop-zve64x.c   | 18 +
>  .../riscv/rvv/vls/load-store-template.h   |  8 +++
>  .../gcc.target/riscv/rvv/vls/load-store-v.c   | 17 +
>  .../riscv/rvv/vls/load-store-zve32x.c | 17 +
>  .../riscv/rvv/vls/load-store-zve64x.c | 17 +
>  .../gcc.target/riscv/rvv/vls/move-template.h  | 13 
>  .../gcc.target/riscv/rvv/vls/move-v.c | 10 +++
>  .../gcc.target/riscv/rvv/vls/move-zve32x.c| 10 +++
>  .../gcc.target/riscv/rvv/vls/move-zve64x.c| 10 +++
>  .../gcc.target/riscv/rvv/vls/vls-types.h  | 42 
>  21 files changed, 391 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/config/riscv/v

Re: Re: decremnt IV patch create fails on PowerPC

2023-05-30 Thread Richard Biener via Gcc-patches
On Fri, 26 May 2023, juzhe.zh...@rivai.ai wrote:

> Hi, Richi. Thanks for your analysis and helps.
> 
> >> We could simply retain the original
> >> incrementing IV for loop control and add the decrementing
> >> IV for computing LEN in addition to that and leave IVOPTs
> >> sorting out to eventually merge them (or not).
> 
> I am not sure how to do that. Could you give me more informations?
> 
> I somehow understand your concern is that variable amount of IV will make
> IVOPT fails. 
> 
> I have seen similar situation in LLVM (when apply variable IV,
> they failed to interleave the vectorize code). I am not sure whether they
> are the same reason for that.
> 
> For RVV, we not only want decrement IV style in vectorization but also
> we want to apply SELECT_VL in single-rgroup which is most happen cases (LLVM 
> also only apply get_vector_length in single vector length).
>
> >>You can do some testing with a cross compiler, alternatively
> >>there are powerpc machines in the GCC compile farm.
> 
> It seems that Power is ok with decrement IV since most cases are improved.

Well, but Power never will have SELECT_VL so at least for !SELECT_VL
targets you should avoid having an IV with variable decrement.  As
I said it should be easy to rewrite decrement IV to use a constant
increment (when not using SELECT_VL) and testing the pre-decrement
value in the exit test.

Richard.
 
> I think Richard may help to explain decrement IV more clearly.
> 
> Thanks
> 
> 
> juzhe.zh...@rivai.ai
>  
> From: Richard Biener
> Date: 2023-05-26 14:46
> To: ???
> CC: gcc-patches; richard.sandiford; linkw
> Subject: Re: decremnt IV patch create fails on PowerPC
> On Fri, 26 May 2023, ??? wrote:
>  
> > Yesterday's patch has been approved (decremnt IV support):
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-May/619663.html 
> > 
> > However, it creates fails on PowerPC:
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109971 
> > 
> > I am really sorry for causing inconvinience.
> > 
> > I wonder as we disccussed:
> > +  /* If we're vectorizing a loop that uses length "controls" and
> > + can iterate more than once, we apply decrementing IV approach
> > + in loop control.  */
> > +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> > +  && !LOOP_VINFO_LENS (loop_vinfo).is_empty ()
> > +  && LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo) == 0
> > +  && !(LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > +&& known_le (LOOP_VINFO_INT_NITERS (loop_vinfo),
> > + LOOP_VINFO_VECT_FACTOR (loop_vinfo
> > +LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo) = true;
> > 
> > This conditions can not disable decrement IV on PowerPC.
> > Should I add a target hook for it?
>  
> No.  I've put some analysis in the PR.  To me the question is
> why (without that SELECT_VL case) we need a decrementing IV
> _for the loop control_?  We could simply retain the original
> incrementing IV for loop control and add the decrementing
> IV for computing LEN in addition to that and leave IVOPTs
> sorting out to eventually merge them (or not).
>  
> Alternatively avoid the variable decrement as I wrote in the
> PR and do the exit test based on the previous IV value.
>  
> But as said all this won't work for the SELECT_VL case, but
> then it's availability is something to key off rather than a
> new target hook?
>  
> > The patch I can only do bootstrap and regression on X86.
> > I didn't have an environment to test PowerPC. I am really sorry.
>  
> You can do some testing with a cross compiler, alternatively
> there are powerpc machines in the GCC compile farm.
>  
> Richard.
>  
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)


Re: Re: [PATCH] RISC-V: Basic VLS code gen for RISC-V

2023-05-30 Thread Richard Biener via Gcc-patches
On Tue, May 30, 2023 at 11:17 AM juzhe.zh...@rivai.ai
 wrote:
>
> In the future, we will definitely mixing VLA and VLS-vlmin together in a 
> codegen and it will not cause any issues.
> For VLS-vlmin, I prefer it is used in length style auto-vectorization (I am 
> not sure since my SELECT_VL patch is not
> finished, I will check if can work when I am working in SELECT_VL patch).

For the future it would be then good to have the vectorizer
re-vectorize loops with
VLS vector uses to VLA style?  I think there's a PR with a draft patch
from a few
years ago attached (from me) somewhere.  Currently the vectorizer will give
up when seeing vector operations in a loop but ideally those should simply
be SLPed.

> >> In general I don't have a good overview of which optimizations we gain by
> >> such an approach or rather which ones are prevented by VLA altogether?
> These patches VLS modes can help for SLP auto-vectorization.
>
> 
> juzhe.zh...@rivai.ai
>
>
> From: Robin Dapp
> Date: 2023-05-30 17:05
> To: juzhe.zh...@rivai.ai; Richard Biener; Kito.cheng
> CC: rdapp.gcc; gcc-patches; palmer; kito.cheng; jeffreyalaw; pan2.li
> Subject: Re: [PATCH] RISC-V: Basic VLS code gen for RISC-V
> >>> but ideally the user would be able to specify -mrvv-size=32 for an
> >>> implementation with 32 byte vectors and then vector lowering would make 
> >>> use
> >>> of vectors up to 32 bytes?
> >
> > Actually, we don't want to specify -mrvv-size = 32 to enable vectorization 
> > on GNU vectors.
> > You can take a look this example:
> > https://godbolt.org/z/3jYqoM84h 
> >
> > GCC need to specify the mrvv size to enable GNU vectors and the codegen 
> > only can run on CPU with vector-length = 128bit.
> > However, LLVM doesn't need to specify the vector length, and the codegen 
> > can run on any CPU with RVV  vector-length >= 128 bits.
> >
> > This is what this patch want to do.
> >
> > Thanks.
> I think Richard's question was rather if it wasn't better to do it more
> generically and lower vectors to what either the current cpu or what the
> user specified rather than just 16-byte vectors (i.e. indeed a fixed
> vlmin and not a fixed vlmin == fixed vlmax).
>
> This patch assumes everything is fixed for optimization purposes and then
> switches over to variable-length when nothing can be changed anymore.  That
> is, we would work on "vlmin"-sized chunks in a VLA fashion at runtime?
> We would need to make sure that no pass after reload makes use of VLA
> properties at all.
>
> In general I don't have a good overview of which optimizations we gain by
> such an approach or rather which ones are prevented by VLA altogether?
> What's the idea for the future?  Still use LEN_LOAD et al. (and masking)
> with "fixed vlmin"?  Wouldn't we select different IVs with this patch than
> what we would have for pure VLA?
>
> Regards
> Robin
>


Re: [PATCH] MATCH: Move `a <= CST1 ? MAX : a` optimization to match

2023-05-30 Thread Richard Biener via Gcc-patches
On Mon, May 8, 2023 at 12:21 AM Andrew Pinski via Gcc-patches
 wrote:
>
> This moves the `a <= CST1 ? MAX : a` optimization
> from phiopt to match. It just adds a new pattern to match.pd.
>
> There is one more change needed before being able to remove
> minmax_replacement from phiopt.
>
> A few notes on the testsuite changes:
> * phi-opt-5.c is now able to optimize at phiopt1 so remove
> the xfail.
> * pr66726-4.c can be optimized during fold before phiopt1
> so need to change the scanning.
> * pr66726-5.c needs two phiopt passes currently to optimize
> to the right thing, it needed 2 phiopt passes before, the cast
> from int to unsigned char is the reason.
> * pr66726-6.c is what the original pr66726-4.c was testing
> before the fold was able to optimize it.
>
> OK? Bootstrapped and tested on x86_64-linux-gnu.

OK.

> gcc/ChangeLog:
>
> * match.pd (`(a CMP CST1) ? max : a`): New
> pattern.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.dg/tree-ssa/phi-opt-5.c: Remove last xfail.
> * gcc.dg/tree-ssa/pr66726-4.c: Change how scanning
> works.
> * gcc.dg/tree-ssa/pr66726-5.c: New test.
> * gcc.dg/tree-ssa/pr66726-6.c: New test.
> ---
>  gcc/match.pd  | 18 +++
>  gcc/testsuite/gcc.dg/tree-ssa/phi-opt-5.c |  2 +-
>  gcc/testsuite/gcc.dg/tree-ssa/pr66726-4.c |  5 +++-
>  gcc/testsuite/gcc.dg/tree-ssa/pr66726-5.c | 28 +++
>  gcc/testsuite/gcc.dg/tree-ssa/pr66726-6.c | 17 ++
>  5 files changed, 68 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr66726-5.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr66726-6.c
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index ceae1c34abc..a55ede838cd 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -4954,6 +4954,24 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>   (if (code == MAX_EXPR)
>(minmax (max @1 @2) @4)))
>
> +/* Optimize (a CMP CST1) ? max : a */
> +(for cmp(gt  ge  lt  le)
> + minmax (min min max max)
> + (simplify
> +  (cond (cmp @0 @1) (minmax:c@2 @0 @3) @4)
> +   (with
> +{
> +  tree_code code = minmax_from_comparison (cmp, @0, @1, @0, @4);
> +}
> +(if ((cmp == LT_EXPR || cmp == LE_EXPR)
> +&& code == MIN_EXPR
> + && integer_nonzerop (fold_build2 (LE_EXPR, boolean_type_node, @3, 
> @1)))
> + (min @2 @4)
> + (if ((cmp == GT_EXPR || cmp == GE_EXPR)
> + && code == MAX_EXPR
> +  && integer_nonzerop (fold_build2 (GE_EXPR, boolean_type_node, @3, 
> @1)))
> +  (max @2 @4))
> +
>  /* X != C1 ? -X : C2 simplifies to -X when -C1 == C2.  */
>  (simplify
>   (cond (ne @0 INTEGER_CST@1) (negate@3 @0) INTEGER_CST@2)
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-5.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-5.c
> index 5f78a1ba6dc..e78d9d8b83d 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-5.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-5.c
> @@ -39,7 +39,7 @@ float repl2 (float vary)
>
>  /* phiopt1 confused by predictors.  */
>  /* { dg-final { scan-tree-dump "vary.*MAX_EXPR.*0\\.0" "phiopt1" } } */
> -/* { dg-final { scan-tree-dump "vary.*MIN_EXPR.*1\\.0" "phiopt1" { xfail 
> *-*-* } } } */
> +/* { dg-final { scan-tree-dump "vary.*MIN_EXPR.*1\\.0" "phiopt1" } } */
>  /* { dg-final { scan-tree-dump "vary.*MAX_EXPR.*0\\.0" "phiopt2"} } */
>  /* { dg-final { scan-tree-dump "vary.*MIN_EXPR.*1\\.0" "phiopt2"} } */
>
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr66726-4.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/pr66726-4.c
> index 4e43522f3a3..930ad5fb79f 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/pr66726-4.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr66726-4.c
> @@ -9,4 +9,7 @@ foo (unsigned char *p, int i)
>*p = SAT (i);
>  }
>
> -/* { dg-final { scan-tree-dump-times "COND_EXPR .*and PHI .*converted to 
> straightline code" 1 "phiopt1" } } */
> +/* fold could optimize SAT before phiopt1 so only match on the
> +   MIN/MAX here.  */
> +/* { dg-final { scan-tree-dump-times "= MIN_EXPR" 1 "phiopt1" } } */
> +/* { dg-final { scan-tree-dump-times "= MAX_EXPR" 1 "phiopt1" } } */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr66726-5.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/pr66726-5.c
> new file mode 100644
> index 000..4b5066cdb6b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr66726-5.c
> @@ -0,0 +1,28 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-phiopt1-details -fdump-tree-phiopt2-details 
> -fdump-tree-optimized" } */
> +
> +#define SAT(x) (x < 0 ? 0 : (x > 255 ? 255 : x))
> +
> +unsigned char
> +foo (unsigned char *p, int i)
> +{
> +  if (i < 0)
> +return 0;
> +  {
> +int t;
> +if (i > 255)
> +  t = 255;
> +else
> +  t = i;
> +return t;
> +  }
> +}
> +
> +/* Because of the way PHIOPT works, it only does the merging of BBs after it 
> is done so we get the case were we can't
> +   optimize the above until phiopt2 right now.  */
> +/* { dg-final { scan-tree-dump-ti

Re: [PATCH] Add a != MIN/MAX_VALUE_CST ? CST-+1 : a to minmax_from_comparison

2023-05-30 Thread Richard Biener via Gcc-patches
On Mon, May 8, 2023 at 7:27 AM Andrew Pinski via Gcc-patches
 wrote:
>
> This patch adds the support for match that was implemented for PR 87913 in 
> phiopt.
> It implements it by adding support to minmax_from_comparison for the check.
> It uses the range information if available which allows to produce MIN/MAX 
> expression
> when comparing against the lower/upper bound of the range instead of 
> lower/upper
> of the type.
>
> minmax-20.c is the new testcase which tests the ranges part.
>
> OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions.

OK.

> gcc/ChangeLog:
>
> * fold-const.cc (minmax_from_comparison): Add support for NE_EXPR.
> * match.pd ((cond (cmp (convert1? x) c1) (convert2? x) c2) pattern):
> Add ne as a possible cmp.
> ((a CMP b) ? minmax : minmax pattern): Likewise.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.dg/tree-ssa/minmax-20.c: New test.
> ---
>  gcc/fold-const.cc | 26 +++
>  gcc/match.pd  |  4 ++--
>  gcc/testsuite/gcc.dg/tree-ssa/minmax-20.c | 12 +++
>  3 files changed, 40 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/minmax-20.c
>
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index db54bfc5662..d90671b9975 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -173,6 +173,19 @@ minmax_from_comparison (tree_code cmp, tree exp0, tree 
> exp1, tree exp2, tree exp
>   /* X > Y - 1 equals to X >= Y.  */
>   if (cmp == GT_EXPR)
> code = GE_EXPR;
> + /* a != MIN_RANGE ? a : MIN_RANGE+1 -> 
> MAX_EXPR+1, a> */
> + if (cmp == NE_EXPR && TREE_CODE (exp0) == SSA_NAME)
> +   {
> + value_range r;
> + get_range_query (cfun)->range_of_expr (r, exp0);
> + if (r.undefined_p ())
> +   r.set_varying (TREE_TYPE (exp0));
> +
> + widest_int min = widest_int::from (r.lower_bound (),
> +TYPE_SIGN (TREE_TYPE 
> (exp0)));
> + if (min == wi::to_widest (exp1))
> +   code = MAX_EXPR;
> +   }
> }
>if (wi::to_widest (exp1) == (wi::to_widest (exp3) + 1))
> {
> @@ -182,6 +195,19 @@ minmax_from_comparison (tree_code cmp, tree exp0, tree 
> exp1, tree exp2, tree exp
>   /* X >= Y + 1 equals to X > Y.  */
>   if (cmp == GE_EXPR)
>   code = GT_EXPR;
> + /* a != MAX_RANGE ? a : MAX_RANGE-1 -> 
> MIN_EXPR-1, a> */
> + if (cmp == NE_EXPR && TREE_CODE (exp0) == SSA_NAME)
> +   {
> + value_range r;
> + get_range_query (cfun)->range_of_expr (r, exp0);
> + if (r.undefined_p ())
> +   r.set_varying (TREE_TYPE (exp0));
> +
> + widest_int max = widest_int::from (r.upper_bound (),
> +TYPE_SIGN (TREE_TYPE 
> (exp0)));
> + if (max == wi::to_widest (exp1))
> +   code = MIN_EXPR;
> +   }
> }
>  }
>if (code != ERROR_MARK
> diff --git a/gcc/match.pd b/gcc/match.pd
> index a55ede838cd..95f7e9a6abc 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -4751,7 +4751,7 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>
> Case 2)
> (cond (eq (convert1? x) c1) (convert2? x) c2) -> (cond (eq x c1) c1 c2).  
> */
> -(for cmp (lt le gt ge eq)
> +(for cmp (lt le gt ge eq ne)
>   (simplify
>(cond (cmp (convert1? @1) INTEGER_CST@3) (convert2? @1) INTEGER_CST@2)
>(with
> @@ -4942,7 +4942,7 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  /* Optimize (a CMP b) ? minmax : minmax
> to minmax, c> */
>  (for minmax (min max)
> - (for cmp (lt le gt ge)
> + (for cmp (lt le gt ge ne)
>(simplify
> (cond (cmp @1 @3) (minmax:c @1 @4) (minmax:c @2 @4))
> (with
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/minmax-20.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/minmax-20.c
> new file mode 100644
> index 000..481c375f5f9
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/minmax-20.c
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-phiopt2" } */
> +
> +int f(int num)
> +{
> +  if (num < 3) __builtin_unreachable();
> +  return num != 3 ?  num : 4;
> +}
> +
> +/* In phiopt2 with the range information, this should be turned into
> +   a MAX_EXPR.  */
> +/* { dg-final { scan-tree-dump-times "MAX_EXPR" 1 "phiopt2" } } */
> --
> 2.31.1
>


Re: [PATCH] Detect bswap + rotate for byte permutation in pass_bswap.

2023-05-30 Thread Richard Biener via Gcc-patches
On Tue, May 9, 2023 at 9:06 AM liuhongt via Gcc-patches
 wrote:
>
> The patch doesn't handle:
>   1. cast64_to_32,
>   2. memory source with rsize < range.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?

OK and sorry for the delay.

Richard.

> gcc/ChangeLog:
>
> PR middle-end/108938
> * gimple-ssa-store-merging.cc (is_bswap_or_nop_p): New
> function, cut from original find_bswap_or_nop function.
> (find_bswap_or_nop): Add a new parameter, detect bswap +
> rotate and save rotate result in the new parameter.
> (bswap_replace): Add a new parameter to indicate rotate and
> generate rotate stmt if needed.
> (maybe_optimize_vector_constructor): Adjust for new rotate
> parameter in the upper 2 functions.
> (pass_optimize_bswap::execute): Ditto.
> (imm_store_chain_info::output_merged_store): Ditto.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr108938-1.c: New test.
> * gcc.target/i386/pr108938-2.c: New test.
> * gcc.target/i386/pr108938-3.c: New test.
> * gcc.target/i386/pr108938-load-1.c: New test.
> * gcc.target/i386/pr108938-load-2.c: New test.
> ---
>  gcc/gimple-ssa-store-merging.cc   | 130 ++
>  gcc/testsuite/gcc.target/i386/pr108938-1.c|  79 +++
>  gcc/testsuite/gcc.target/i386/pr108938-2.c|  35 +
>  gcc/testsuite/gcc.target/i386/pr108938-3.c|  26 
>  .../gcc.target/i386/pr108938-load-1.c |  69 ++
>  .../gcc.target/i386/pr108938-load-2.c |  30 
>  6 files changed, 342 insertions(+), 27 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr108938-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr108938-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr108938-3.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr108938-load-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr108938-load-2.c
>
> diff --git a/gcc/gimple-ssa-store-merging.cc b/gcc/gimple-ssa-store-merging.cc
> index df7afd2fd78..9cb574fa315 100644
> --- a/gcc/gimple-ssa-store-merging.cc
> +++ b/gcc/gimple-ssa-store-merging.cc
> @@ -893,6 +893,37 @@ find_bswap_or_nop_finalize (struct symbolic_number *n, 
> uint64_t *cmpxchg,
>n->range *= BITS_PER_UNIT;
>  }
>
> +/* Helper function for find_bswap_or_nop,
> +   Return true if N is a swap or nop with MASK.  */
> +static bool
> +is_bswap_or_nop_p (uint64_t n, uint64_t cmpxchg,
> +  uint64_t cmpnop, uint64_t* mask,
> +  bool* bswap)
> +{
> +  *mask = ~(uint64_t) 0;
> +  if (n == cmpnop)
> +*bswap = false;
> +  else if (n == cmpxchg)
> +*bswap = true;
> +  else
> +{
> +  int set = 0;
> +  for (uint64_t msk = MARKER_MASK; msk; msk <<= BITS_PER_MARKER)
> +   if ((n & msk) == 0)
> + *mask &= ~msk;
> +   else if ((n & msk) == (cmpxchg & msk))
> + set++;
> +   else
> + return false;
> +
> +  if (set < 2)
> +   return false;
> +  *bswap = true;
> +}
> +  return true;
> +}
> +
> +
>  /* Check if STMT completes a bswap implementation or a read in a given
> endianness consisting of ORs, SHIFTs and ANDs and sets *BSWAP
> accordingly.  It also sets N to represent the kind of operations
> @@ -903,7 +934,7 @@ find_bswap_or_nop_finalize (struct symbolic_number *n, 
> uint64_t *cmpxchg,
>
>  gimple *
>  find_bswap_or_nop (gimple *stmt, struct symbolic_number *n, bool *bswap,
> -  bool *cast64_to_32, uint64_t *mask)
> +  bool *cast64_to_32, uint64_t *mask, uint64_t* l_rotate)
>  {
>tree type_size = TYPE_SIZE_UNIT (TREE_TYPE (gimple_get_lhs (stmt)));
>if (!tree_fits_uhwi_p (type_size))
> @@ -984,29 +1015,57 @@ find_bswap_or_nop (gimple *stmt, struct 
> symbolic_number *n, bool *bswap,
>  }
>
>uint64_t cmpxchg, cmpnop;
> +  uint64_t orig_range = n->range * BITS_PER_UNIT;
>find_bswap_or_nop_finalize (n, &cmpxchg, &cmpnop, cast64_to_32);
>
>/* A complete byte swap should make the symbolic number to start with
>   the largest digit in the highest order byte. Unchanged symbolic
>   number indicates a read with same endianness as target architecture.  */
> -  *mask = ~(uint64_t) 0;
> -  if (n->n == cmpnop)
> -*bswap = false;
> -  else if (n->n == cmpxchg)
> -*bswap = true;
> -  else
> +  *l_rotate = 0;
> +  uint64_t tmp_n = n->n;
> +  if (!is_bswap_or_nop_p (tmp_n, cmpxchg, cmpnop, mask, bswap))
>  {
> -  int set = 0;
> -  for (uint64_t msk = MARKER_MASK; msk; msk <<= BITS_PER_MARKER)
> -   if ((n->n & msk) == 0)
> - *mask &= ~msk;
> -   else if ((n->n & msk) == (cmpxchg & msk))
> - set++;
> -   else
> - return NULL;
> -  if (set < 2)
> +  /* Try bswap + lrotate.  */
> +  /* TODO, handle cast64_to_32 and big/litte_endian memory
> +source when rsize < range.  */
> +  if (n->range == ori

Re: [PATCH] Optimized "(X - N * M) / N + M" to "X / N" if valid

2023-05-30 Thread Richard Biener via Gcc-patches
On Wed, 17 May 2023, Jiufu Guo wrote:

> Hi,
> 
> This patch tries to optimize "(X - N * M) / N + M" to "X / N".

But if that's valid why not make the transform simpler and transform
(X - N * M) / N  to X / N - M instead?

You use the same optimize_x_minus_NM_div_N_plus_M validator for
the division and shift variants but the overflow rules are different,
so I'm not sure that's warranted.  I'd also prefer to not split out
the validator to a different file - iff then the appropriate file
is fold-const.cc, not gimple-match-head.cc (I see we're a bit
inconsistent here, for pure gimple matches gimple-fold.cc would
be another place).

Since you use range information why is the transform restricted
to constant M?

Richard.

> As per the discussions in PR108757, we know this transformation is valid
> only under some conditions.
> For C code, "/" towards zero (trunc_div), and "X - N * M"
> maybe wrap/overflow/underflow. So, it is valid that "X - N * M" does
> not cross zero and does not wrap/overflow/underflow.
> 
> This patch also handles the case when "N" is the power of 2, where
> "(X - N * M) / N" is "(X - N * M) >> log2(N)".
> 
> Bootstrap & regtest pass on ppc64{,le} and x86_64.
> Is this ok for trunk?
> 
> BR,
> Jeff (Jiufu)
> 
>   PR tree-optimization/108757
> 
> gcc/ChangeLog:
> 
>   * gimple-match-head.cc (optimize_x_minus_NM_div_N_plus_M): New function.
>   * match.pd ((X - N * M) / N + M): New pattern.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/pr108757-1.c: New test.
>   * gcc.dg/pr108757-2.c: New test.
>   * gcc.dg/pr108757.h: New test.
> 
> ---
>  gcc/gimple-match-head.cc  |  54 ++
>  gcc/match.pd  |  22 
>  gcc/testsuite/gcc.dg/pr108757-1.c |  17 
>  gcc/testsuite/gcc.dg/pr108757-2.c |  18 
>  gcc/testsuite/gcc.dg/pr108757.h   | 160 ++
>  5 files changed, 271 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.dg/pr108757-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/pr108757-2.c
>  create mode 100644 gcc/testsuite/gcc.dg/pr108757.h
> 
> diff --git a/gcc/gimple-match-head.cc b/gcc/gimple-match-head.cc
> index b08cd891a13..680a4cb2fc6 100644
> --- a/gcc/gimple-match-head.cc
> +++ b/gcc/gimple-match-head.cc
> @@ -224,3 +224,57 @@ optimize_successive_divisions_p (tree divisor, tree 
> inner_div)
>  }
>return true;
>  }
> +
> +/* Return true if "(X - N * M) / N + M" can be optimized into "X / N".
> +   Otherwise return false.
> +
> +   For unsigned,
> +   If sign bit of M is 0 (clz is 0), valid range is [N*M, MAX].
> +   If sign bit of M is 1, valid range is [0, MAX - N*(-M)].
> +
> +   For signed,
> +   If N*M > 0, valid range: [MIN+N*M, 0] + [N*M, MAX]
> +   If N*M < 0, valid range: [MIN, -(-N*M)] + [0, MAX - (-N*M)].  */
> +
> +static bool
> +optimize_x_minus_NM_div_N_plus_M (tree x, wide_int n, wide_int m, tree type)
> +{
> +  wide_int max = wi::max_value (type);
> +  signop sgn = TYPE_SIGN (type);
> +  wide_int nm;
> +  wi::overflow_type ovf;
> +  if (TYPE_UNSIGNED (type) && wi::clz (m) == 0)
> +nm = wi::mul (n, -m, sgn, &ovf);
> +  else
> +nm = wi::mul (n, m, sgn, &ovf);
> +
> +  if (ovf != wi::OVF_NONE)
> +return false;
> +
> +  value_range vr0;
> +  if (!get_range_query (cfun)->range_of_expr (vr0, x) || vr0.varying_p ()
> +  || vr0.undefined_p ())
> +return false;
> +
> +  wide_int wmin0 = vr0.lower_bound ();
> +  wide_int wmax0 = vr0.upper_bound ();
> +  wide_int min = wi::min_value (type);
> +
> +  /* unsigned */
> +  if ((TYPE_UNSIGNED (type)))
> +/* M > 0 (clz != 0): [N*M, MAX],  M < 0 : [0, MAX-N*(-M)]  */
> +return wi::clz (m) != 0 ? wi::ge_p (wmin0, nm, sgn)
> + : wi::le_p (wmax0, max - nm, sgn);
> +
> +  /* signed, N*M > 0 */
> +  else if (wi::gt_p (nm, 0, sgn))
> +/* [N*M, MAX] or [MIN+N*M, 0] */
> +return wi::ge_p (wmin0, nm, sgn)
> +|| (wi::ge_p (wmin0, min + nm, sgn) && wi::le_p (wmax0, 0, sgn));
> +
> +  /* signed, N*M < 0 */
> +  /* [MIN, N*M] or [0, MAX + N*M]*/
> +  else
> +return wi::le_p (wmax0, nm, sgn)
> +|| (wi::ge_p (wmin0, 0, sgn) && wi::le_p (wmax0, max - (-nm), sgn));
> +}
> diff --git a/gcc/match.pd b/gcc/match.pd
> index ceae1c34abc..1aaa5530577 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -881,6 +881,28 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  #endif
> 
>  
> +#if GIMPLE
> +/* Simplify ((t + -N*M) / N + M) -> t / N.  */
> +(for div (trunc_div exact_div)
> + (simplify
> +  (plus (div (plus @0 INTEGER_CST@1) INTEGER_CST@2) INTEGER_CST@3)
> +  (with {wide_int n = wi::to_wide (@2); wide_int m = wi::to_wide (@3);}
> +(if (INTEGRAL_TYPE_P (type)
> +  && n * m == -wi::to_wide (@1)
> +  && optimize_x_minus_NM_div_N_plus_M (@0, n, m, type))
> +(div @0 @2)
> +
> +/* Simplify ((t + -(M<> N + M) -> t >> N.  */
> +(simplify
> + (plus (rshift (plus @0 INTEGER_CST@1) INTEGER_CST@2) INTEGER_CST@3)
> + (with {wide_int n = wi::to_wide (@2); wide_int m = wi:

Re: Re: decremnt IV patch create fails on PowerPC

2023-05-30 Thread Richard Biener via Gcc-patches
On Tue, 30 May 2023, juzhe.zh...@rivai.ai wrote:

> Ok.
> 
> It seems that for this conditions:
> 
> +  /* If we're vectorizing a loop that uses length "controls" and
> + can iterate more than once, we apply decrementing IV approach
> + in loop control.  */
> +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> +  && !LOOP_VINFO_LENS (loop_vinfo).is_empty ()
> +  && LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo) == 0
> +  && !(LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> +&& known_le (LOOP_VINFO_INT_NITERS (loop_vinfo),
> + LOOP_VINFO_VECT_FACTOR (loop_vinfo
> +LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo) = true;
> 
> I should add direct_supportted_p (SELECT_VL...) to this is that right?

No, since powerpc is fine with decrementing VL it should also use it.
Instead you should make sure to produce SCEV analyzable IVs when
possible (when SELECT_VL is not or cannot be used).

Richard.

> I have send SELECT_VL patch. I will add this in next SELECT_VL patch.
> 
> Let's wait Richard's more comments.
> 
> Thanks.
> 
> 
> juzhe.zh...@rivai.ai
>  
> From: Richard Biener
> Date: 2023-05-30 17:22
> To: juzhe.zh...@rivai.ai
> CC: gcc-patches; richard.sandiford; linkw
> Subject: Re: Re: decremnt IV patch create fails on PowerPC
> On Fri, 26 May 2023, juzhe.zh...@rivai.ai wrote:
>  
> > Hi, Richi. Thanks for your analysis and helps.
> > 
> > >> We could simply retain the original
> > >> incrementing IV for loop control and add the decrementing
> > >> IV for computing LEN in addition to that and leave IVOPTs
> > >> sorting out to eventually merge them (or not).
> > 
> > I am not sure how to do that. Could you give me more informations?
> > 
> > I somehow understand your concern is that variable amount of IV will make
> > IVOPT fails. 
> > 
> > I have seen similar situation in LLVM (when apply variable IV,
> > they failed to interleave the vectorize code). I am not sure whether they
> > are the same reason for that.
> > 
> > For RVV, we not only want decrement IV style in vectorization but also
> > we want to apply SELECT_VL in single-rgroup which is most happen cases 
> > (LLVM also only apply get_vector_length in single vector length).
> >
> > >>You can do some testing with a cross compiler, alternatively
> > >>there are powerpc machines in the GCC compile farm.
> > 
> > It seems that Power is ok with decrement IV since most cases are improved.
>  
> Well, but Power never will have SELECT_VL so at least for !SELECT_VL
> targets you should avoid having an IV with variable decrement.  As
> I said it should be easy to rewrite decrement IV to use a constant
> increment (when not using SELECT_VL) and testing the pre-decrement
> value in the exit test.
>  
> Richard.
> > I think Richard may help to explain decrement IV more clearly.
> > 
> > Thanks
> > 
> > 
> > juzhe.zh...@rivai.ai
> >  
> > From: Richard Biener
> > Date: 2023-05-26 14:46
> > To: ???
> > CC: gcc-patches; richard.sandiford; linkw
> > Subject: Re: decremnt IV patch create fails on PowerPC
> > On Fri, 26 May 2023, ??? wrote:
> >  
> > > Yesterday's patch has been approved (decremnt IV support):
> > > https://gcc.gnu.org/pipermail/gcc-patches/2023-May/619663.html 
> > > 
> > > However, it creates fails on PowerPC:
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109971 
> > > 
> > > I am really sorry for causing inconvinience.
> > > 
> > > I wonder as we disccussed:
> > > +  /* If we're vectorizing a loop that uses length "controls" and
> > > + can iterate more than once, we apply decrementing IV approach
> > > + in loop control.  */
> > > +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> > > +  && !LOOP_VINFO_LENS (loop_vinfo).is_empty ()
> > > +  && LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo) == 0
> > > +  && !(LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > > +&& known_le (LOOP_VINFO_INT_NITERS (loop_vinfo),
> > > + LOOP_VINFO_VECT_FACTOR (loop_vinfo
> > > +LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo) = true;
> > > 
> > > This conditions can not disable decrement IV on PowerPC.
> > > Should I add a target hook for it?
> >  
> > No.  I've put some analysis in the PR.  To me the question is
> > why (without that SELECT_VL case) we need a decrementing IV
> > _for the loop control_?  We could simply retain the original
> > incrementing IV for loop control and add the decrementing
> > IV for computing LEN in addition to that and leave IVOPTs
> > sorting out to eventually merge them (or not).
> >  
> > Alternatively avoid the variable decrement as I wrote in the
> > PR and do the exit test based on the previous IV value.
> >  
> > But as said all this won't work for the SELECT_VL case, but
> > then it's availability is something to key off rather than a
> > new target hook?
> >  
> > > The patch I can only do bootstrap and regression on X86.
> > > I didn't have an environment to test PowerPC. I am really sorry.
> >  
> > You can do some testing with a cross co

Re: [PATCH] libatomic: Provide gthr.h default implementation

2023-05-30 Thread Richard Biener via Gcc-patches
On Tue, May 23, 2023 at 11:28 AM Sebastian Huber
 wrote:
>
> On 10.01.23 16:38, Sebastian Huber wrote:
> > On 19/12/2022 17:02, Sebastian Huber wrote:
> >> Build libatomic for all targets.  Use gthr.h to provide a default
> >> implementation.  If the thread model is "single", then this
> >> implementation will
> >> not work if for example atomic operations are used for thread/interrupt
> >> synchronization.
> >
> > Is this and the related -fprofile-update=atomic patch something for GCC 14?
>
> Now that the GCC 14 development is in progress, what about this patch?

Sorry, there doesn't seem to be a main maintainer for libatomic and your patch
touches targets which didn't have it before.

Can you explain how this affects the ABI of targets not having (needing?!)
libatomic?  It might help if you can say this is still opt-in and targets not
building libatomic right now would not with your patch and targets already
building libatomic have no changes with your patch.

That said - what kind of ABI implications has providing libatomic support
for a target that didn't do so before?

Richard.

> --
> embedded brains GmbH
> Herr Sebastian HUBER
> Dornierstr. 4
> 82178 Puchheim
> Germany
> email: sebastian.hu...@embedded-brains.de
> phone: +49-89-18 94 741 - 16
> fax:   +49-89-18 94 741 - 08
>
> Registergericht: Amtsgericht München
> Registernummer: HRB 157899
> Vertretungsberechtigte Geschäftsführer: Peter Rasmussen, Thomas Dörfler
> Unsere Datenschutzerklärung finden Sie hier:
> https://embedded-brains.de/datenschutzerklaerung/


Re: decremnt IV patch create fails on PowerPC

2023-05-30 Thread Richard Biener via Gcc-patches
On Tue, 30 May 2023, Kewen.Lin wrote:

> on 2023/5/30 17:26, juzhe.zh...@rivai.ai wrote:
> > Ok.
> > 
> > It seems that for this conditions:
> > 
> > +  /* If we're vectorizing a loop that uses length "controls" and
> > + can iterate more than once, we apply decrementing IV approach
> > + in loop control.  */
> > +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> > +  && !LOOP_VINFO_LENS (loop_vinfo).is_empty ()
> > +  && LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo) == 0
> > +  && !(LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > +  && known_le (LOOP_VINFO_INT_NITERS (loop_vinfo),
> > +   LOOP_VINFO_VECT_FACTOR (loop_vinfo
> > +LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo) = true;
> > 
> > 
> > I should add direct_supportted_p (SELECT_VL...) to this is that right?
> 
> I guess no, with this condition any targets without SELECT_VL are unable
> to leverage the new decrement scheme for lengths, as your reply in PR109971
> you didn't meant to disable it.  IIUC, what Richi suggested is to introduce
> one new IV just like the previous one which has non-variable step, then it's
> SCEV-ed and some analysis based on it can do a good job.

No, I said the current scheme does sth along

 do {
   remain -= MIN (vf, remain);
 } while (remain != 0);

and I suggest to instead do

 do {
   old_remain = remain;
   len = MIN (vf, remain);
   remain -= vf;
 } while (old_remain >= vf);

basically since only the last iteration will have len < vf we can
ignore that remain -= vf will underflow there if we appropriately
rewrite the exit test to use the pre-decrement value.

> Since this is mainly for targets without SELECT_VL capability, I can follow
> up this if you don't mind.
> 
> BR,
> Kewen
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)


Re: decremnt IV patch create fails on PowerPC

2023-05-30 Thread Richard Biener via Gcc-patches
On Tue, 30 May 2023, Richard Sandiford wrote:

> My understanding was that we went into this knowing that the IVs
> would defeat SCEV analysis.  Apparently that wasn't a problem for RVV,
> but it's not surprising that it is a problem in general.
> 
> This isn't just about SELECT_VL though.  We use the same type of IV
> for cases what aren't going to use SELECT_VL.
> 
> Richard Biener  writes:
> > On Tue, 30 May 2023, Kewen.Lin wrote:
> >
> >> on 2023/5/30 17:26, juzhe.zh...@rivai.ai wrote:
> >> > Ok.
> >> > 
> >> > It seems that for this conditions:
> >> > 
> >> > +  /* If we're vectorizing a loop that uses length "controls" and
> >> > + can iterate more than once, we apply decrementing IV approach
> >> > + in loop control.  */
> >> > +  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> >> > +  && !LOOP_VINFO_LENS (loop_vinfo).is_empty ()
> >> > +  && LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo) == 0
> >> > +  && !(LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> >> > +   && known_le (LOOP_VINFO_INT_NITERS (loop_vinfo),
> >> > +LOOP_VINFO_VECT_FACTOR (loop_vinfo
> >> > +LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo) = true;
> >> > 
> >> > 
> >> > I should add direct_supportted_p (SELECT_VL...) to this is that right?
> >> 
> >> I guess no, with this condition any targets without SELECT_VL are unable
> >> to leverage the new decrement scheme for lengths, as your reply in PR109971
> >> you didn't meant to disable it.  IIUC, what Richi suggested is to introduce
> >> one new IV just like the previous one which has non-variable step, then 
> >> it's
> >> SCEV-ed and some analysis based on it can do a good job.
> >
> > No, I said the current scheme does sth along
> >
> >  do {
> >remain -= MIN (vf, remain);
> >  } while (remain != 0);
> >
> > and I suggest to instead do
> >
> >  do {
> >old_remain = remain;
> >len = MIN (vf, remain);
> >remain -= vf;
> >  } while (old_remain >= vf);
> >
> > basically since only the last iteration will have len < vf we can
> > ignore that remain -= vf will underflow there if we appropriately
> > rewrite the exit test to use the pre-decrement value.
> 
> Yeah, agree that should work.

Btw, it's still on my TOOD list (unless somebody beats me...) to
rewrite the vectorizer code gen to do all loop control and conditions
on a decrementing "remaining scalar iters" IV.

> But how easy would it be to extend SCEV analysis, via a pattern match?
> The evolution of the IV phi wrt the inner loop is still a normal SCEV.

No, the IV isn't a normal SCEV, the final value is different.
I think pattern matching this in niter analysis could work though.

Richard.


Re: [PATCH] libatomic: Provide gthr.h default implementation

2023-05-30 Thread Richard Biener via Gcc-patches
On Tue, May 30, 2023 at 12:17 PM Sebastian Huber
 wrote:
>
> On 30.05.23 11:53, Richard Biener wrote:
> > On Tue, May 23, 2023 at 11:28 AM Sebastian Huber
> >   wrote:
> >> On 10.01.23 16:38, Sebastian Huber wrote:
> >>> On 19/12/2022 17:02, Sebastian Huber wrote:
>  Build libatomic for all targets.  Use gthr.h to provide a default
>  implementation.  If the thread model is "single", then this
>  implementation will
>  not work if for example atomic operations are used for thread/interrupt
>  synchronization.
> >>> Is this and the related -fprofile-update=atomic patch something for GCC 
> >>> 14?
> >> Now that the GCC 14 development is in progress, what about this patch?
> > Sorry, there doesn't seem to be a main maintainer for libatomic and your 
> > patch
> > touches targets which didn't have it before.
> >
> > Can you explain how this affects the ABI of targets not having (needing?!)
> > libatomic?  It might help if you can say this is still opt-in and targets 
> > not
> > building libatomic right now would not with your patch and targets already
> > building libatomic have no changes with your patch.
> >
> > That said - what kind of ABI implications has providing libatomic support
> > for a target that didn't do so before?
>
> Sorry for the missing context. The root problem I want to solve is
> getting gcov support for multi-threaded applications. For this we need
> atomic 64-bit operations, see also:

I was aware of the context but still worry about the ABI implications.
A target that doesn't build libatomic but would need one currently
has "unsupported" (aka fail to link) atomic operations that require
libatomic support.  After your patch such targets suddenly have
a new ABI (and supported atomic ops) - this ABI they need to
maintain for compatibility reasons I think but it would be (likely)
not documented anywhere.

I think that's undesirable, esp. without buy-in from the affected
target maintainers.

> https://gcc.gnu.org/pipermail/gcc-patches/2022-December/608620.html
>
> The libatomic patch lets it build for every target. Targets with no
> explicit support will use the gthr.h API to provide a default
> implementation.
>
> An alternative would be to use the RTEMS approach which uses the
> following API (provided by Newlib  for RTEMS):
>
> #include 
> #include 
>
> __BEGIN_DECLS
>
> __uint32_t _Libatomic_Protect_start(void *);
>
> void _Libatomic_Protect_end(void *, __uint32_t);
>
> void _Libatomic_Lock_n(void *, __size_t);
>
> void _Libatomic_Unlock_n(void *, __size_t);
>
> __END_DECLS
>
> We could also leave libatomic as is, but then you may get unresolved
> references if you use -fprofile-update=atomic with the patch mentioned
> above.

The alternative would be to provide the required subset of atomic
library functions from libgcov.a and emit calls to that directly?
The locked data isn't part of any ABI so no compatibility guarantee
needs to be maintained?

Richard.

>
> --
> embedded brains GmbH
> Herr Sebastian HUBER
> Dornierstr. 4
> 82178 Puchheim
> Germany
> email: sebastian.hu...@embedded-brains.de
> phone: +49-89-18 94 741 - 16
> fax:   +49-89-18 94 741 - 08
>
> Registergericht: Amtsgericht München
> Registernummer: HRB 157899
> Vertretungsberechtigte Geschäftsführer: Peter Rasmussen, Thomas Dörfler
> Unsere Datenschutzerklärung finden Sie hier:
> https://embedded-brains.de/datenschutzerklaerung/


Re: [PATCH V2] [vect]Enhance NARROW FLOAT_EXPR vectorization by truncating integer to lower precision.

2023-05-30 Thread Richard Biener via Gcc-patches
On Mon, May 29, 2023 at 5:21 AM Hongtao Liu via Gcc-patches
 wrote:
>
> ping.
>
> On Mon, May 8, 2023 at 9:59 AM liuhongt  wrote:
> >
> > > > @@ -4799,7 +4800,8 @@ vect_create_vectorized_demotion_stmts (vec_info 
> > > > *vinfo, vec *vec_oprnds,
> > > >stmt_vec_info stmt_info,
> > > >vec &vec_dsts,
> > > >gimple_stmt_iterator *gsi,
> > > > -  slp_tree slp_node, enum 
> > > > tree_code code)
> > > > +  slp_tree slp_node, enum 
> > > > tree_code code,
> > > > +  bool last_stmt_p)
> > >
> > > Can you please document this new parameter?
> > >
> > Changed.
> >
> > >
> > > I understand what you are doing, but somehow it looks a bit awkward?
> > > Maybe we should split the NARROW case into NARROW_SRC and NARROW_DST?
> > > The case of narrowing the source because we know its range isn't a
> > > good fit for the
> > > flow.
> > Changed.
> >
> > Here's updated patch.
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Ok for trunk?

OK, sorry for the delay.

Thanks,
Richard.

> > Similar like WIDEN FLOAT_EXPR, when direct_optab is not existed, try
> > intermediate integer type whenever gimple ranger can tell it's safe.
> >
> > .i.e.
> > When there's no direct optab for vector long long -> vector float, but
> > the value range of integer can be represented as int, try vector int
> > -> vector float if availble.
> >
> > gcc/ChangeLog:
> >
> > PR tree-optimization/108804
> > * tree-vect-patterns.cc (vect_get_range_info): Remove static.
> > * tree-vect-stmts.cc (vect_create_vectorized_demotion_stmts):
> > Add new parameter narrow_src_p.
> > (vectorizable_conversion): Enhance NARROW FLOAT_EXPR
> > vectorization by truncating to lower precision.
> > * tree-vectorizer.h (vect_get_range_info): New declare.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/i386/pr108804.c: New test.
> > ---
> >  gcc/testsuite/gcc.target/i386/pr108804.c |  15 +++
> >  gcc/tree-vect-patterns.cc|   2 +-
> >  gcc/tree-vect-stmts.cc   | 135 +--
> >  gcc/tree-vectorizer.h|   1 +
> >  4 files changed, 121 insertions(+), 32 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr108804.c
> >
> > diff --git a/gcc/testsuite/gcc.target/i386/pr108804.c 
> > b/gcc/testsuite/gcc.target/i386/pr108804.c
> > new file mode 100644
> > index 000..2a43c1e1848
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr108804.c
> > @@ -0,0 +1,15 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-mavx2 -Ofast -fdump-tree-vect-details" } */
> > +/* { dg-final { scan-tree-dump-times "vectorized \[1-3] loops" 1 "vect" } 
> > } */
> > +
> > +typedef unsigned long long uint64_t;
> > +uint64_t d[512];
> > +float f[1024];
> > +
> > +void foo() {
> > +for (int i=0; i<512; ++i) {
> > +uint64_t k = d[i];
> > +f[i]=(k & 0x3F30);
> > +}
> > +}
> > +
> > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> > index a49b0953977..dd546b488a4 100644
> > --- a/gcc/tree-vect-patterns.cc
> > +++ b/gcc/tree-vect-patterns.cc
> > @@ -61,7 +61,7 @@ along with GCC; see the file COPYING3.  If not see
> >  /* Return true if we have a useful VR_RANGE range for VAR, storing it
> > in *MIN_VALUE and *MAX_VALUE if so.  Note the range in the dump files.  
> > */
> >
> > -static bool
> > +bool
> >  vect_get_range_info (tree var, wide_int *min_value, wide_int *max_value)
> >  {
> >value_range vr;
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index 6b7dbfd4a23..3da89a8402d 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -51,6 +51,7 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "internal-fn.h"
> >  #include "tree-vector-builder.h"
> >  #include "vec-perm-indices.h"
> > +#include "gimple-range.h"
> >  #include "tree-ssa-loop-niter.h"
> >  #include "gimple-fold.h"
> >  #include "regs.h"
> > @@ -4791,7 +4792,9 @@ vect_gen_widened_results_half (vec_info *vinfo, enum 
> > tree_code code,
> >
> >  /* Create vectorized demotion statements for vector operands from 
> > VEC_OPRNDS.
> > For multi-step conversions store the resulting vectors and call the 
> > function
> > -   recursively.  */
> > +   recursively. When NARROW_SRC_P is true, there's still a conversion after
> > +   narrowing, don't store the vectors in the SLP_NODE or in vector info of
> > +   the scalar statement(or in STMT_VINFO_RELATED_STMT chain).  */
> >
> >  static void
> >  vect_create_vectorized_demotion_stmts (vec_info *vinfo, vec 
> > *vec_oprnds,
> > @@ -4799,7 +4802,8 @@ vect_create_vectorized_demotion_stmts (vec_info 
> > *vinfo, vec *vec_oprnds,
> >stmt_vec_info 

Re: [PATCH v1] tree-ssa-sink: Improve code sinking pass.

2023-05-30 Thread Richard Biener via Gcc-patches
On Tue, May 30, 2023 at 9:35 AM Ajit Agarwal  wrote:
>
> Hello Richard:
>
> On 30/05/23 12:34 pm, Richard Biener wrote:
> > On Tue, May 30, 2023 at 7:06 AM Ajit Agarwal  wrote:
> >>
> >> Hello Richard:
> >>
> >> On 22/05/23 6:26 pm, Richard Biener wrote:
> >>> On Thu, May 18, 2023 at 9:14 AM Ajit Agarwal  
> >>> wrote:
> 
>  Hello All:
> 
>  This patch improves code sinking pass to sink statements before call to 
>  reduce
>  register pressure.
>  Review comments are incorporated.
> 
>  Bootstrapped and regtested on powerpc64-linux-gnu.
> 
>  Thanks & Regards
>  Ajit
> 
> 
>  tree-ssa-sink: Improve code sinking pass.
> 
>  Code Sinking sinks the blocks after call. This increases
>  register pressure for callee-saved registers. Improves
>  code sinking before call in the use blocks or immediate
>  dominator of use blocks.
> 
>  2023-05-18  Ajit Kumar Agarwal  
> 
>  gcc/ChangeLog:
> 
>  * tree-ssa-sink.cc (statement_sink_location): Modifed to
>  move statements before calls.
>  (block_call_p): New function.
>  (def_use_same_block): New function.
>  (select_best_block): Add heuristics to select the best
>  blocks in the immediate post dominator.
> 
>  gcc/testsuite/ChangeLog:
> 
>  * gcc.dg/tree-ssa/ssa-sink-20.c: New testcase.
>  * gcc.dg/tree-ssa/ssa-sink-21.c: New testcase.
>  ---
>   gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-20.c |  16 ++
>   gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c |  20 +++
>   gcc/tree-ssa-sink.cc| 159 ++--
>   3 files changed, 185 insertions(+), 10 deletions(-)
>   create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-20.c
>   create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
> 
>  diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-20.c 
>  b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-20.c
>  new file mode 100644
>  index 000..716bc1f9257
>  --- /dev/null
>  +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-20.c
>  @@ -0,0 +1,16 @@
>  +/* { dg-do compile } */
>  +/* { dg-options "-O2 -fdump-tree-sink -fdump-tree-optimized 
>  -fdump-tree-sink-stats" } */
>  +
>  +void bar();
>  +int j;
>  +void foo(int a, int b, int c, int d, int e, int f)
>  +{
>  +  int l;
>  +  l = a + b + c + d +e + f;
>  +  if (a != 5)
>  +{
>  +  bar();
>  +  j = l;
>  +}
>  +}
>  +/* { dg-final { scan-tree-dump-times "Sunk statements: 5" 1 "sink" } } 
>  */
> >>>
> >>> this doesn't verify the place we sink to?
> >>>
> >>
> >> I am not sure how to verify the place we sink to with dg-final.
> >
> > I think dejagnu supports matching multi-line regexps so I suggest
> > to scan for the sunk expr RHS to be followed by the call?
> >
>
> You meant to use dg-begin-multiline-output and dg-end-multiline-output.

I was referring to uses like that in gcc.dg/debug/dwarf2/pr41445-6.c

> Thanks & Regards
> Ajit
>  diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c 
>  b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
>  new file mode 100644
>  index 000..ff41e2ea8ae
>  --- /dev/null
>  +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
>  @@ -0,0 +1,20 @@
>  +/* { dg-do compile } */
>  +/* { dg-options "-O2 -fdump-tree-sink-stats -fdump-tree-sink-stats" } */
>  +
>  +void bar();
>  +int j, x;
>  +void foo(int a, int b, int c, int d, int e, int f)
>  +{
>  +  int l;
>  +  l = a + b + c + d +e + f;
>  +  if (a != 5)
>  +{
>  +  bar();
>  +  if (b != 3)
>  +x = 3;
>  +  else
>  +x = 5;
>  +  j = l;
>  +}
>  +}
>  +/* { dg-final { scan-tree-dump-times "Sunk statements: 5" 1 "sink" } } 
>  */
> >>>
> >>> likewise.  So both tests already pass before the patch?
> >>>
>  diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc
>  index 87b1d40c174..76556e7795b 100644
>  --- a/gcc/tree-ssa-sink.cc
>  +++ b/gcc/tree-ssa-sink.cc
>  @@ -171,6 +171,72 @@ nearest_common_dominator_of_uses (def_operand_p 
>  def_p, bool *debug_stmts)
> return commondom;
>   }
> 
>  +/* Return TRUE if immediate uses of the defs in
>  +   USE occur in the same block as USE, FALSE otherwise.  */
>  +
>  +bool
>  +def_use_same_block (gimple *stmt)
>  +{
>  +  use_operand_p use_p;
>  +  def_operand_p def_p;
>  +  imm_use_iterator imm_iter;
>  +  ssa_op_iter iter;
>  +
>  +  FOR_EACH_SSA_DEF_OPERAND (def_p, stmt, iter, SSA_OP_DEF)
>  +{
>  +  FOR_EACH_IMM_USE_FAST (use_p, imm_iter, DEF_FROM_PTR (def_p))
>  +   {
>  + if (is_gimple_debug (USE_STMT (use_p)))
>  +   cont

Re: decremnt IV patch create fails on PowerPC

2023-05-30 Thread Richard Biener via Gcc-patches
On Tue, 30 May 2023, Richard Sandiford wrote:

> Richard Biener  writes:
> >> But how easy would it be to extend SCEV analysis, via a pattern match?
> >> The evolution of the IV phi wrt the inner loop is still a normal SCEV.
> >
> > No, the IV isn't a normal SCEV, the final value is different.
> 
> Which part of the IV though?

The relevant IV (for niter analysis) is the one used in the loop
exit test and that currently isn't a SCEV.  The IV used in the
*_len operations isn't either (and that's not going to change,
obviously).

>  Won't all executions of the latch edge
> decrement the IV phi (and specifically the phi) by VF (and only VF)?

But currently there's no decrement by invariant VF but only
by MIN (VF, remain), that's what I suggested to address to
make the loop exit condition analyzable (as said, in theory
we can try pattern matching the analysis of the exit test in
niter analysis).

> So if we analyse the IV phi wrt the inner loop, the IV phi is simply
> { initial, -, VF }.
> 
> I agree "IV phi - step" isn't a SCEV, but that doesn't seem fatal.

Right.  Fatal is the non-SCEV in the exit test which makes most
followup loop optimizations fail to consider the loop because the
number of iterations cannot be determined.

Richard.


Re: [PATCH] VECT: Change flow of decrement IV

2023-05-30 Thread Richard Biener via Gcc-patches
On Tue, 30 May 2023, juzhe.zhong wrote:

> This patch will generate the number of rgroup ?mov? instructions inside the
> loop. This is unacceptable. For example?if number of rgroups=3? will be 3 more
> instruction in loop. If this patch is necessary? I think I should find a way
> to fix it.

That's odd, you only need to adjust the IV which is used in the exit test,
not all the others.

>  Replied Message 
> From
> Richard Sandiford
> Date
> 05/30/2023 19:41
> To
> juzhe.zh...@rivai.ai
> Cc
> gcc-patches,
> rguenther,
> linkw
> Subject
> Re: [PATCH] VECT: Change flow of decrement IV
> "juzhe.zh...@rivai.ai"  writes:
> > Before this patch:
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> >   sub   a2,a2,a5
> > bne a2,zero,.L3
> > .L5:
> > ret
> >
> > After this patch:
> >
> > foo:
> > ble a2,zero,.L5
> > csrr a3,vlenb
> > srli a4,a3,2
> > neg a7,a4   -->>>additional instruction
> > .L3:
> > minu a5,a2,a4
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a0)
> > vsetvli t1,zero,e32,m1,ta,ma
> > mv a6,a2  -->>>additional instruction
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a0)
> > add a1,a1,a3
> > add a0,a0,a3
> > add a2,a2,a7
> > bgtu a6,a4,.L3
> > .L5:
> > ret
> >
> > There is 1 more instruction in preheader and 1 more instruction in loop.
> > But I think it's OK for RVV since we will definitely be using SELECT_VL so
> this issue will gone.
> 
> But what about cases where you won't be using SELECT_VL, such as SLP?
> 
> Richard
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)


Re: Re: [PATCH] VECT: Change flow of decrement IV

2023-05-30 Thread Richard Biener via Gcc-patches
On Wed, 31 May 2023, juzhe.zh...@rivai.ai wrote:

> Hi?all. I have posted my several investigations:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620101.html 
> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620105.html 
> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620108.html 
> 
> Turns out when "niters is a constant value and vf is a constant value"
> This patch can allow SCEV/IVOPTS optimize a lot for RVV too (Take tesecase 
> from IBM's testsuite for example) and I think this patch can fix IBM's 
> cunroll issue.
> Even though it will produce a 'mv' instruction in some ohter cases for RVV, I 
> think Gain > Pain overal.
> 
> Actually, for current flow:
> 
> step = MIN ()
> ...
> remain = remain - step.
> 
> I don't know how difficult to extend SCEV/IVOPTS to fix this issue.
> So, could you make a decision for this patch?
> 
> I wonder whether we should apply the approach of this patch (the codes can be 
> refined after well reviewed) or
> we should extend SCEV/IVOPTS ?

I don't think we can do anything in SCEV for this which means we'd
need to special-case this in niter analysis, in IVOPTs and any other
passes that might be affected (and not fixed by handling it in niter
analysis).  While improving niter analysis would be good (the user
could write this pattern as well) I do not have time to try
implementing that (I have no idea how ugly or robust it is going to be).

So I think we should patch this up in the vectorizer itself like with
your patch.  I'm going to wait for Richards input though since he
seems to disagree.

Note with SELECT_VL all bets will be off since as I understand the
value it gives can vary from iteration to iteration (but we know
a lower and maybe an upper bound?)

Thanks,
Richard.

> Thanks. 
> 
> 
> juzhe.zh...@rivai.ai
>  
> From: ???
> Date: 2023-05-30 23:05
> To: rguenther
> CC: richard.sandiford; gcc-patches; linkw
> Subject: Re: Re: [PATCH] VECT: Change flow of decrement IV
> More information of power's testcase:
> 
> Before this patch:
> test_npeel_int16_t:
> lui a4,%hi(.LANCHOR0+130)
> lui a3,%hi(.LANCHOR1)
> addi a3,a3,%lo(.LANCHOR1)
> addi a4,a4,%lo(.LANCHOR0+130)
> li a5,58
> li a2,16
> vsetivli zero,16,e16,m1,ta,ma
> vl1re16.v v3,0(a3)
> vid.v v1
> .L5:
> minu a3,a5,a2
> vsetvli zero,a3,e16,m1,ta,ma
> sub a5,a5,a3
> vse16.v v1,0(a4)
> vsetivli zero,16,e16,m1,ta,ma
> addi a4,a4,32
> vadd.vv v1,v1,v3
> bne a5,zero,.L5
> ret
> 
> After this patch:
> test_npeel_int16_t:
> lui a5,%hi(.LANCHOR0)
> addi a5,a5,%lo(.LANCHOR0)
> li a1,16
> vsetivli zero,16,e16,m1,ta,ma
> addi a2,a5,130
> vid.v v1
> addi a3,a5,162
> vadd.vx v4,v1,a1
> addi a4,a5,194
> li a1,32
> vadd.vx v3,v1,a1
> vse16.v v1,0(a2)
> vse16.v v4,0(a3)
> vse16.v v3,0(a4)
> addi a5,a5,226
> li a1,48
> vadd.vx v2,v1,a1
> vsetivli zero,10,e16,m1,ta,ma
> vse16.v v2,0(a5)
> ret
> 
> It's obvious, previously, power's testcase in RVV side can not unroll, but 
> after this patch, in RVV side, it can unroll now.
> 
> 
> juzhe.zh...@rivai.ai
>  
> From: Richard Biener
> Date: 2023-05-30 20:33
> To: juzhe.zhong
> CC: Richard Sandiford; gcc-patches; linkw
> Subject: Re: [PATCH] VECT: Change flow of decrement IV
> On Tue, 30 May 2023, juzhe.zhong wrote:
>  
> > This patch will generate the number of rgroup ?mov? instructions inside the
> > loop. This is unacceptable. For example?if number of rgroups=3? will be 3 
> > more
> > instruction in loop. If this patch is necessary? I think I should find a way
> > to fix it.
>  
> That's odd, you only need to adjust the IV which is used in the exit test,
> not all the others.
>  
> >  Replied Message 
> > From
> > Richard Sandiford
> > Date
> > 05/30/2023 19:41
> > To
> > juzhe.zh...@rivai.ai
> > Cc
> > gcc-patches,
> > rguenther,
> > linkw
> > Subject
> > Re: [PATCH] VECT: Change flow of decrement IV
> > "juzhe.zh...@rivai.ai"  writes:
> > > Before this patch:
> > > foo:
> > > ble a2,zero,.L5
> > > csrr a3,vlenb
> > > srli a4,a3,2
> > > .L3:
> > > minu a5,a2,a4
> > > vsetvli zero,a5,e32,m1,ta,ma
> > > vle32.v v2,0(a1)
> > > vle32.v v1,0(a0)
> > > vsetvli t1,zero,e32,m1,ta,ma
> > > vadd.vv v1,v1,v2
> > > vsetvli zero,a5,e32,m1,ta,ma
> > > vse32.v v1,0(a0)
> > > add a1,a1,a3
> > > add a0,a0,a3
> > >   sub   a2,a2,a5
> > > bne a2,zero,.L3
> > > .L5:
> > > ret
> > >
> > > After this patch:
> > >
> > > foo:
> > > ble a2,zero,.L5
> > > csrr a3,vlenb
> > > srli a4,a3,2
> > > neg a7,a4   -->>>additional instruction
> > > .L3:
> > > minu a5,a2,a4
> > > vsetvli zero,a5,e32,m1,ta,ma
> > > vle32.v v2,0(a1)
> > > vle32.v v1,0(a0)
> > > vsetvli t1,zero,e32,m1,ta,ma
> > > mv a6,a2  -->>>additional instruction
> > > vadd.vv v1,v1,v2
> > > vsetvli zero,a5,e32,m1,ta,ma
> > > vse32.v v1,0(a0)
> > > add a1,a1,a3
> > > add a0,a0,a3
> > > add a2,a2,a7
> > > bgtu a6,a4,.L3
> > > .L5:
> > > ret
> > >
> > > There is 1 more instruction in preheader and 1 more instruction in loop.
> > > But I think it's OK for RVV since we will definitely be using SELECT_VL so
> > this issue will 

Re: [PATCH 1/2] ipa-cp: Avoid long linear searches through DECL_ARGUMENTS

2023-05-31 Thread Richard Biener via Gcc-patches
On Tue, May 30, 2023 at 4:21 PM Jan Hubicka  wrote:
>
> > On Mon, May 29, 2023 at 6:20 PM Martin Jambor  wrote:
> > >
> > > Hi,
> > >
> > > there have been concerns that linear searches through DECL_ARGUMENTS
> > > that are often necessary to compute the index of a particular
> > > PARM_DECL which is the key to results of IPA-CP can happen often
> > > enough to be a compile time issue, especially if we plug the results
> > > into value numbering, as I intend to do with a follow-up patch.
> > >
> > > This patch creates a hash map to do the look-up for all functions
> > > which have some information discovered by IPA-CP and which have 32
> > > parameters or more.  32 is a hard-wired magical constant here to
> > > capture the trade-off between the memory allocation overhead and
> > > length of the linear search.  I do not think it is worth making it a
> > > --param but if people think it appropriate, I can turn it into one.
> >
> > Since ipcp_transformation is short-lived (is it?) is it worth the trouble?
> > Comments below ...
>
> It lives from ipa-cp time to WPA stream-out or IPA transform stage,
> so memory consumption is a concern with -flto.
> > > +  m_tree_to_idx = hash_map::create_ggc (c);
> > > +  unsigned index = 0;
> > > +  for (tree p = DECL_ARGUMENTS (fndecl); p; p = DECL_CHAIN (p), index++)
> > > +m_tree_to_idx->put (p, index);
> >
> > I think allocating the hash-map with 'c' for some numbers (depending
> > on the "prime"
> > chosen) will necessarily cause re-allocation of the hash since we keep a 
> > load
> > factor of at most 3/4 upon insertion.
> >
> > But - I wonder if a UID sorted array isn't a very much better data
> > structure for this?
> > That is, a vec >?
>
> Yeah, I was thinking along this lines too.
> Having field directly in PARM_DECL node would be probably prettiest.
> In general this is probably not that important as wast amount of time we
> have few parameters and linear lookup is just fine.

There is 6 bits of DECL_OFFSET_ALIGN that could be re-purposed, but
64 parameters is a bit low.  _Maybe_ PARM_DECL doesn't need any of
the tree_base bits so could use the full word for sth else as well ...

I also though it might be interesting to only record PARM_DECLs that
we have interesting info for and skip VARYING ones.  So with an
indirection DECL_OFFSET_ALIGN -> index to non-varying param or
-1 the encoding space could shrink.

But still using a vec<> looks like a straight-forward improvement here.

Richard.

> Honza
> >
> > > +}
> > > diff --git a/gcc/ipa-prop.cc b/gcc/ipa-prop.cc
> > > index ab6de9f10da..f0976e363f7 100644
> > > --- a/gcc/ipa-prop.cc
> > > +++ b/gcc/ipa-prop.cc
> > > @@ -5776,16 +5776,9 @@ ipcp_get_parm_bits (tree parm, tree *value, 
> > > widest_int *mask)
> > >if (!ts || vec_safe_length (ts->bits) == 0)
> > >  return false;
> > >
> > > -  int i = 0;
> > > -  for (tree p = DECL_ARGUMENTS (current_function_decl);
> > > -   p != parm; p = DECL_CHAIN (p))
> > > -{
> > > -  i++;
> > > -  /* Ignore static chain.  */
> > > -  if (!p)
> > > -   return false;
> > > -}
> > > -
> > > +  int i = ts->get_param_index (current_function_decl, parm);
> > > +  if (i < 0)
> > > +return false;
> > >clone_info *cinfo = clone_info::get (cnode);
> > >if (cinfo && cinfo->param_adjustments)
> > >  {
> > > @@ -5802,16 +5795,12 @@ ipcp_get_parm_bits (tree parm, tree *value, 
> > > widest_int *mask)
> > >return true;
> > >  }
> > >
> > > -
> > > -/* Update bits info of formal parameters as described in
> > > -   ipcp_transformation.  */
> > > +/* Update bits info of formal parameters of NODE as described in TS.  */
> > >
> > >  static void
> > > -ipcp_update_bits (struct cgraph_node *node)
> > > +ipcp_update_bits (struct cgraph_node *node, ipcp_transformation *ts)
> > >  {
> > > -  ipcp_transformation *ts = ipcp_get_transformation_summary (node);
> > > -
> > > -  if (!ts || vec_safe_length (ts->bits) == 0)
> > > +  if (vec_safe_is_empty (ts->bits))
> > >  return;
> > >vec &bits = *ts->bits;
> > >unsigned count = bits.length ();
> > > @@ -5913,14 +5902,12 @@ ipcp_update_bits (struct cgraph_node *node)
> > >  }
> > >  }
> > >
> > > -/* Update value range of formal parameters as described in
> > > -   ipcp_transformation.  */
> > > +/* Update value range of formal parameters of NODE as described in TS.  
> > > */
> > >
> > >  static void
> > > -ipcp_update_vr (struct cgraph_node *node)
> > > +ipcp_update_vr (struct cgraph_node *node, ipcp_transformation *ts)
> > >  {
> > > -  ipcp_transformation *ts = ipcp_get_transformation_summary (node);
> > > -  if (!ts || vec_safe_length (ts->m_vr) == 0)
> > > +  if (vec_safe_is_empty (ts->m_vr))
> > >  return;
> > >const vec &vr = *ts->m_vr;
> > >unsigned count = vr.length ();
> > > @@ -5996,10 +5983,17 @@ ipcp_transform_function (struct cgraph_node *node)
> > >  fprintf (dump_file, "Modification phase of node %s\n",
> > >  node->dump_name ());
> > >

Re: [PATCH] doc: clarify semantics of vector bitwise shifts

2023-05-31 Thread Richard Biener via Gcc-patches
On Tue, May 30, 2023 at 4:49 PM Alexander Monakov  wrote:
>
>
> On Thu, 25 May 2023, Richard Biener wrote:
>
> > On Wed, May 24, 2023 at 8:36 PM Alexander Monakov  
> > wrote:
> > >
> > >
> > > On Wed, 24 May 2023, Richard Biener via Gcc-patches wrote:
> > >
> > > > I’d have to check the ISAs what they actually do here - it of course 
> > > > depends
> > > > on RTL semantics as well but as you say those are not strictly defined 
> > > > here
> > > > either.
> > >
> > > Plus, we can add the following executable test to the testsuite:
> >
> > Yeah, that's probably a good idea.  I think your documentation change
> > with the added sentence about the truncation is OK.
>
> I am no longer confident in my patch, sorry.
>
> My claim about vector shift semantics in OpenCL was wrong. In fact it 
> specifies
> that RHS of a vector shift is masked to the exact bitwidth of the element 
> type.
>
> So, to collect various angles:
>
> 1. OpenCL semantics would need an 'AND' before a shift (except VSX/Altivec).
>
> 2. From user side we had a request to follow C integer promotion semantics
>in https://gcc.gnu.org/PR91838 but I now doubt we can do that.
>
> 3. LLVM makes oversized vector shifts UB both for 'vector_size' and
>'ext_vector_type'.

I had the impression GCC desired to do 3. as well, matching what we do
for scalar shifts.

> 4. Vector lowering does not emit promotions, and starting from gcc-12
>ranger treats oversized shifts according to the documentation you
>cite below, and optimizes (e.g. with '-O2 -mno-sse')
>
> typedef short v8hi __attribute__((vector_size(16)));
>
> void f(v8hi *p)
> {
> *p >>= 16;
> }
>
>to zeroing '*p'. If this looks unintended, I can file a bug.
>
> I still think we need to clarify semantics of vector shifts, but probably
> not in the way I proposed initially. What do you think?

I think the intent at some point was to adhere to the OpenCL spec
for the GCC vector extension (because that's a written spec while
GCCs vector extension docs are lacking).  Originally the powerpc
altivec 'vector' keyword spurred most of the development IIRC
so it might be useful to see how they specify shifts.

So yes, we probably should clarify the semantics to match the
implementation (since we have two targets doing things differently
since forever we can only document it as UB) and also note the
difference from OpenCL (in case OpenCL is still relevant these
days we might want to offer a -fopencl-vectors to emit the required
AND).

It would be also good to amend the RTL documentation.

It would be very nice to start an internals documentation section
around collecting what the middle-end considers undefined
or implementation defined (aka target defined) behavior in the
GENERIC, GIMPLE and RTL ILs and what predicates eventually
control that (like TYPE_OVERFLOW_UNDEFINED).  Maybe spread it over
{gimple,generic,rtl}.texi, though gimple.texi is only about the representation
and all semantics are shared and documented in generic.texi.

Thanks,
Richard.

> Thanks.
> Alexander
>
> > Note we have
> >
> > /* Shift operations for shift and rotate.
> >Shift means logical shift if done on an
> >unsigned type, arithmetic shift if done on a signed type.
> >The second operand is the number of bits to
> >shift by; it need not be the same type as the first operand and result.
> >Note that the result is undefined if the second operand is larger
> >than or equal to the first operand's type size.
> >
> >The first operand of a shift can have either an integer or a
> >(non-integer) fixed-point type.  We follow the ISO/IEC TR 18037:2004
> >semantics for the latter.
> >
> >Rotates are defined for integer types only.  */
> > DEFTREECODE (LSHIFT_EXPR, "lshift_expr", tcc_binary, 2)
> >
> > in tree.def which implies short << 24 is undefined behavior (similar
> > wording in generic.texi).  The rtl docs say nothing about behavior
> > but I think the semantics should carry over.  That works for x86
> > even for scalar instructions working on GPRs (masking is applied
> > but fixed to 5 or 6 bits even for QImode or HImode shifts).
> >
> > Note that when we make these shifts well-defined there's
> > also arithmetic on signed types smaller than int (which again
> > doesn't exist in C) where overflow invokes undefined behavior
> > in the middle-end.  Unless we want to change that as well
> &

Re: [PATCH] jump: Change return type of predicate functions from int to bool

2023-05-31 Thread Richard Biener via Gcc-patches
On Tue, May 30, 2023 at 9:01 PM Jeff Law via Gcc-patches
 wrote:
>
>
>
> On 5/30/23 08:36, Uros Bizjak via Gcc-patches wrote:
> > gcc/ChangeLog:
> >
> >  * rtl.h (comparison_dominates_p): Change return type from int to bool.
> >  (condjump_p): Ditto.
> >  (any_condjump_p): Ditto.
> >  (any_uncondjump_p): Ditto.
> >  (simplejump_p): Ditto.
> >  (returnjump_p): Ditto.
> >  (eh_returnjump_p): Ditto.
> >  (onlyjump_p): Ditto.
> >  (invert_jump_1): Ditto.
> >  (invert_jump): Ditto.
> >  (rtx_renumbered_equal_p): Ditto.
> >  (redirect_jump_1): Ditto.
> >  (redirect_jump): Ditto.
> >  (condjump_in_parallel_p): Ditto.
> >  * jump.cc (invert_exp_1): Adjust forward declaration.
> >  (comparison_dominates_p): Change return type from int to bool
> >  and adjust function body accordingly.
> >  (simplejump_p): Ditto.
> >  (condjump_p): Ditto.
> >  (condjump_in_parallel_p): Ditto.
> >  (any_uncondjump_p): Ditto.
> >  (any_condjump_p): Ditto.
> >  (returnjump_p): Ditto.
> >  (eh_returnjump_p): Ditto.
> >  (onlyjump_p): Ditto.
> >  (redirect_jump_1): Ditto.
> >  (redirect_jump): Ditto.
> >  (invert_exp_1): Ditto.
> >  (invert_jump_1): Ditto.
> >  (invert_jump): Ditto.
> >  (rtx_renumbered_equal_p): Ditto.
> >
> > Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.
> >
> > OK for master?
> OK.

Do we have a diagnostic that would point out places we
assign the bool result to an integer variable?  Do we want
to change those places as well (did you intend to or restrict
the changes to functions only used in conditional context?)

Richard.

> jeff


Re: [PATCH] Fix ICE in rewrite_expr_tree_parallel

2023-05-31 Thread Richard Biener via Gcc-patches
On Wed, May 31, 2023 at 3:35 AM Cui, Lili  wrote:
>
> Hi,
>
> This patch is to fix ICE in rewrite_expr_tree_parallel.
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110038
>
> Bootstrapped and regtested. Ok for trunk?

OK.

> Regards
> Lili.
>
> 1. Limit the value of tree-reassoc-width to IntegerRange(0, 256).
> 2. Add width limit in rewrite_expr_tree_parallel.
>
> gcc/ChangeLog:
>
> PR tree-optimization/110038
> * params.opt: Add a limit on tree-reassoc-width.
> * tree-ssa-reassoc.cc
> (rewrite_expr_tree_parallel): Add width limit.
>
> gcc/testsuite/ChangeLog:
>
> PR tree-optimization/110038
> * gcc.dg/pr110038.c: New test.
> ---
>  gcc/params.opt  |  2 +-
>  gcc/testsuite/gcc.dg/pr110038.c | 10 ++
>  gcc/tree-ssa-reassoc.cc |  3 +++
>  3 files changed, 14 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.dg/pr110038.c
>
> diff --git a/gcc/params.opt b/gcc/params.opt
> index 66f1c99036a..70cfb495e3a 100644
> --- a/gcc/params.opt
> +++ b/gcc/params.opt
> @@ -1091,7 +1091,7 @@ Common Joined UInteger 
> Var(param_tracer_min_branch_ratio) Init(10) IntegerRange(
>  Stop reverse growth if the reverse probability of best edge is less than 
> this threshold (in percent).
>
>  -param=tree-reassoc-width=
> -Common Joined UInteger Var(param_tree_reassoc_width) Param Optimization
> +Common Joined UInteger Var(param_tree_reassoc_width) IntegerRange(0, 256) 
> Param Optimization
>  Set the maximum number of instructions executed in parallel in reassociated 
> tree.  If 0, use the target dependent heuristic.
>
>  -param=tsan-distinguish-volatile=
> diff --git a/gcc/testsuite/gcc.dg/pr110038.c b/gcc/testsuite/gcc.dg/pr110038.c
> new file mode 100644
> index 000..0f578b182ca
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/pr110038.c
> @@ -0,0 +1,10 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O --param=tree-reassoc-width=256" } */
> +
> +unsigned a, b;
> +
> +void
> +foo (unsigned c)
> +{
> +  a += b + c + 1;
> +}
> diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
> index ad2f528ff07..f8055d59d57 100644
> --- a/gcc/tree-ssa-reassoc.cc
> +++ b/gcc/tree-ssa-reassoc.cc
> @@ -5510,6 +5510,9 @@ rewrite_expr_tree_parallel (gassign *stmt, int width, 
> bool has_fma,
>for (i = stmt_num - 2; i >= 0; i--)
>  stmts[i] = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
>
> +   /* Width should not be larger than op_num/2.  */
> +   width = width <= op_num / 2 ? width : op_num / 2;
> +
>/* Build parallel dependency chain according to width.  */
>for (i = 0; i < width; i++)
>  {
> --
> 2.25.1
>


Re: [PATCH] Fix PR 110042: ifcvt regression due to paradoxical subregs

2023-05-31 Thread Richard Biener via Gcc-patches
On Wed, May 31, 2023 at 6:34 AM Andrew Pinski via Gcc-patches
 wrote:
>
> After r14-1014-gc5df248509b489364c573e8, GCC started to emit
> directly a zero_extract for `(t1&0x8)!=0`. This introduced
> a small regression where ifcvt would not do the ifconversion
> as there is now a paradoxical subreg in the dest which
> was being rejected. Since paradoxical subreg set the whole
> register, we can treat it as the same as a reg in the two places.
>
> OK? Bootstrapped and tested on x86_64-linux-gnu and aarch64-linux-gnu.

OK I guess.   I vaguely remember SUBREG_PROMOTED_UNSIGNED_P
applies to non-paradoxical subregs but I might be swapping things - maybe
you remember better and whether that would cause any issues here?

Thanks,
Richard.

> gcc/ChangeLog:
>
> PR rtl-optimization/110042
> * ifcvt.cc (bbs_ok_for_cmove_arith): Allow paradoxical subregs.
> (bb_valid_for_noce_process_p): Strip the subreg for the SET_DEST.
>
> gcc/testsuite/ChangeLog:
>
> PR rtl-optimization/110042
> * gcc.target/aarch64/csel_bfx_2.c: New test.
> ---
>  gcc/ifcvt.cc  | 14 ++
>  gcc/testsuite/gcc.target/aarch64/csel_bfx_2.c | 27 +++
>  2 files changed, 36 insertions(+), 5 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/csel_bfx_2.c
>
> diff --git a/gcc/ifcvt.cc b/gcc/ifcvt.cc
> index 868eda93251..0b180b4568f 100644
> --- a/gcc/ifcvt.cc
> +++ b/gcc/ifcvt.cc
> @@ -2022,7 +2022,7 @@ bbs_ok_for_cmove_arith (basic_block bb_a, basic_block 
> bb_b, rtx to_rename)
> }
>
>/* Make sure this is a REG and not some instance
> -of ZERO_EXTRACT or SUBREG or other dangerous stuff.
> +of ZERO_EXTRACT or non-paradoxical SUBREG or other dangerous stuff.
>  If we have a memory destination then we have a pair of simple
>  basic blocks performing an operation of the form [addr] = c ? a : b.
>  bb_valid_for_noce_process_p will have ensured that these are
> @@ -2030,7 +2030,8 @@ bbs_ok_for_cmove_arith (basic_block bb_a, basic_block 
> bb_b, rtx to_rename)
>  to be renamed.  Assert that the callers set this up properly.  */
>if (MEM_P (SET_DEST (sset_b)))
> gcc_assert (rtx_equal_p (SET_DEST (sset_b), to_rename));
> -  else if (!REG_P (SET_DEST (sset_b)))
> +  else if (!REG_P (SET_DEST (sset_b))
> +  && !paradoxical_subreg_p (SET_DEST (sset_b)))
> {
>   BITMAP_FREE (bba_sets);
>   return false;
> @@ -3136,14 +3137,17 @@ bb_valid_for_noce_process_p (basic_block test_bb, rtx 
> cond,
>
>   rtx sset = single_set (insn);
>   gcc_assert (sset);
> + rtx dest = SET_DEST (sset);
> + if (SUBREG_P (dest))
> +   dest = SUBREG_REG (dest);
>
>   if (contains_mem_rtx_p (SET_SRC (sset))
> - || !REG_P (SET_DEST (sset))
> - || reg_overlap_mentioned_p (SET_DEST (sset), cond))
> + || !REG_P (dest)
> + || reg_overlap_mentioned_p (dest, cond))
> goto free_bitmap_and_fail;
>
>   potential_cost += pattern_cost (sset, speed_p);
> - bitmap_set_bit (test_bb_temps, REGNO (SET_DEST (sset)));
> + bitmap_set_bit (test_bb_temps, REGNO (dest));
> }
>  }
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/csel_bfx_2.c 
> b/gcc/testsuite/gcc.target/aarch64/csel_bfx_2.c
> new file mode 100644
> index 000..c3b8a6f45cc
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/csel_bfx_2.c
> @@ -0,0 +1,27 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +unsigned
> +f1(int t, int t1)
> +{
> +  int tt = 0;
> +  if(t)
> +tt = (t1&0x8)!=0;
> +  return tt;
> +}
> +struct f
> +{
> +  unsigned t:3;
> +  unsigned t1:4;
> +};
> +unsigned
> +f2(int t, struct f y)
> +{
> +  int tt = 0;
> +  if(t)
> +tt = y.t1;
> +  return tt;
> +}
> +/* Both f1 and f2 should produce a csel and not a cbz on the argument. */
> +/*  { dg-final { scan-assembler-times "csel\t" 2 } } */
> +/*  { dg-final { scan-assembler-times "ubfx\t" 2 } } */
> +/*  { dg-final { scan-assembler-not "cbz\t" } } */
> --
> 2.31.1
>


Re: [PATCH] libatomic: Provide gthr.h default implementation

2023-05-31 Thread Richard Biener via Gcc-patches
On Wed, May 31, 2023 at 7:31 AM Sebastian Huber
 wrote:
>
> On 30.05.23 13:17, Richard Biener wrote:
> > The alternative would be to provide the required subset of atomic
> > library functions from libgcov.a and emit calls to that directly?
> > The locked data isn't part of any ABI so no compatibility guarantee
> > needs to be maintained?
>
> So, if atomic operations are not available in hardware, then I should
> emit calls to libgcov.a which would use gthr.h to implement them? I
> guess that I can to this, but it needs a bit of time.

Before doing that it would be nice to get buy-in from others - maybe
my ABI concern for libatomic isn't shared by others.

> Should I add the libgcov functions to builtin_decl_explicit()?

No, they shouldn't be any different from other libgcov functions.

Richard.

>
> --
> embedded brains GmbH
> Herr Sebastian HUBER
> Dornierstr. 4
> 82178 Puchheim
> Germany
> email: sebastian.hu...@embedded-brains.de
> phone: +49-89-18 94 741 - 16
> fax:   +49-89-18 94 741 - 08
>
> Registergericht: Amtsgericht München
> Registernummer: HRB 157899
> Vertretungsberechtigte Geschäftsführer: Peter Rasmussen, Thomas Dörfler
> Unsere Datenschutzerklärung finden Sie hier:
> https://embedded-brains.de/datenschutzerklaerung/


Re: [PATCH] VECT: Change flow of decrement IV

2023-05-31 Thread Richard Biener via Gcc-patches
On Wed, 31 May 2023, Richard Sandiford wrote:

> Richard Biener  writes:
> > On Wed, 31 May 2023, juzhe.zh...@rivai.ai wrote:
> >
> >> Hi?all. I have posted my several investigations:
> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620101.html 
> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620105.html 
> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620108.html 
> >> 
> >> Turns out when "niters is a constant value and vf is a constant value"
> >> This patch can allow SCEV/IVOPTS optimize a lot for RVV too (Take tesecase 
> >> from IBM's testsuite for example) and I think this patch can fix IBM's 
> >> cunroll issue.
> >> Even though it will produce a 'mv' instruction in some ohter cases for 
> >> RVV, I think Gain > Pain overal.
> >> 
> >> Actually, for current flow:
> >> 
> >> step = MIN ()
> >> ...
> >> remain = remain - step.
> >> 
> >> I don't know how difficult to extend SCEV/IVOPTS to fix this issue.
> >> So, could you make a decision for this patch?
> >> 
> >> I wonder whether we should apply the approach of this patch (the codes can 
> >> be refined after well reviewed) or
> >> we should extend SCEV/IVOPTS ?
> >
> > I don't think we can do anything in SCEV for this which means we'd
> > need to special-case this in niter analysis, in IVOPTs and any other
> > passes that might be affected (and not fixed by handling it in niter
> > analysis).  While improving niter analysis would be good (the user
> > could write this pattern as well) I do not have time to try
> > implementing that (I have no idea how ugly or robust it is going to be).
> >
> > So I think we should patch this up in the vectorizer itself like with
> > your patch.  I'm going to wait for Richards input though since he
> > seems to disagree.
> 
> I think my main disagreement is that the IV phi can be analysed
> as a SCEV with sufficient work (realising that the MIN result is
> always VF when the latch is executed).  That SCEV might be useful
> ?as is? for things like IVOPTS, without specific work in those passes.
> (Although perhaps not too useful, since most other IVs will be upcounting.)

I think we'd need another API for SCEV there then,
analyze_scalar_evolution_for_latch () so we can disregard the
value on the exit edges then.  That means we'd still need to touch
all users and decide whether it's safe to use that or not.

> I don't object though.  It just feels like we're giving up easily.
> And that's a bit frustrating, since this potential problem was flagged
> ahead of time.

Well, I expect that massaging SCEV and niter analysis will take
up quite some developer time while avoiding the situation in
the vectorizer is possible (and would fix the observed regressions).
We can always improve later here and I'd suggest to file an
enhancement bugreport with a simple C testcase using this kind of
iteration.

I'm just saying that to go forward the vectorizer change looks
more promising (also considering the pace RISC-V people are working at 
...)

Richard.

> > Note with SELECT_VL all bets will be off since as I understand the
> > value it gives can vary from iteration to iteration (but we know
> > a lower and maybe an upper bound?)
> 
> Right.  All IVs will have a variable step for SELECT_VL.
> 
> Thanks,
> Richard
> 


Re: Re: [PATCH] VECT: Change flow of decrement IV

2023-05-31 Thread Richard Biener via Gcc-patches
On Wed, 31 May 2023, juzhe.zh...@rivai.ai wrote:

> Hi, Richard.
> 
> >> I don't object though.  It just feels like we're giving up easily.
> >> And that's a bit frustrating, since this potential problem was flagged
> >> ahead of time.
> 
> I can take a look at it. Would you mind giving me some hints?
> Should I do this in which PASS ? "ivopts" PASS?
> Is that right that we can enhance analysis when we see the statement as 
> follows:
> remain = remain - step and step is coming from a MIN_EXPR (remain, vf).
> Then what we need to do?

The key is that we have

 # iv = PHI 
 step = MIN (iv, invariant);
 iv' = iv - step;
 if (iv' != 0)
   continue;

in that case for the purpose of niter analysis we can ignore
the MIN expression and use 'invariant' as step (whether constant
or not).  Of course it only works for unsigned 'step'.

niter analysis uses simple_iv () but that necessarily fails here
so one idea would be to enhance simple_iv when passed a special
flag.  Another idea is to add to the set of patterns niter analysis
already has to compute niters resolving to popcount and friends
and pattern match the above (I'd start with that to see which other
passes / analyses besides niter analysis need improvement).

number_of_iterations_exit_assumptions does

  if (!simple_iv_with_niters (loop, loop_containing_stmt (stmt),
  op0, &iv0, safe ? &iv0_niters : NULL, 
false))
return number_of_iterations_bitcount (loop, exit, code, niter);

so I'd add that to the set of matchers in number_of_iterations_bitcount
(and maybe rename that to number_of_iterations_pattern).

Richard.

> Thanks.
> 
> 
> juzhe.zh...@rivai.ai
>  
> From: Richard Sandiford
> Date: 2023-05-31 15:28
> To: Richard Biener
> CC: juzhe.zhong\@rivai.ai; gcc-patches; linkw
> Subject: Re: [PATCH] VECT: Change flow of decrement IV
> Richard Biener  writes:
> > On Wed, 31 May 2023, juzhe.zh...@rivai.ai wrote:
> >
> >> Hi?all. I have posted my several investigations:
> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620101.html 
> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620105.html 
> >> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620108.html 
> >> 
> >> Turns out when "niters is a constant value and vf is a constant value"
> >> This patch can allow SCEV/IVOPTS optimize a lot for RVV too (Take tesecase 
> >> from IBM's testsuite for example) and I think this patch can fix IBM's 
> >> cunroll issue.
> >> Even though it will produce a 'mv' instruction in some ohter cases for 
> >> RVV, I think Gain > Pain overal.
> >> 
> >> Actually, for current flow:
> >> 
> >> step = MIN ()
> >> ...
> >> remain = remain - step.
> >> 
> >> I don't know how difficult to extend SCEV/IVOPTS to fix this issue.
> >> So, could you make a decision for this patch?
> >> 
> >> I wonder whether we should apply the approach of this patch (the codes can 
> >> be refined after well reviewed) or
> >> we should extend SCEV/IVOPTS ?
> >
> > I don't think we can do anything in SCEV for this which means we'd
> > need to special-case this in niter analysis, in IVOPTs and any other
> > passes that might be affected (and not fixed by handling it in niter
> > analysis).  While improving niter analysis would be good (the user
> > could write this pattern as well) I do not have time to try
> > implementing that (I have no idea how ugly or robust it is going to be).
> >
> > So I think we should patch this up in the vectorizer itself like with
> > your patch.  I'm going to wait for Richards input though since he
> > seems to disagree.
>  
> I think my main disagreement is that the IV phi can be analysed
> as a SCEV with sufficient work (realising that the MIN result is
> always VF when the latch is executed).  That SCEV might be useful
> ?as is? for things like IVOPTS, without specific work in those passes.
> (Although perhaps not too useful, since most other IVs will be upcounting.)
>  
> I don't object though.  It just feels like we're giving up easily.
> And that's a bit frustrating, since this potential problem was flagged
> ahead of time.
>  
> > Note with SELECT_VL all bets will be off since as I understand the
> > value it gives can vary from iteration to iteration (but we know
> > a lower and maybe an upper bound?)
>  
> Right.  All IVs will have a variable step for SELECT_VL.
>  
> Thanks,
> Richard
>  
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)


Re: Re: [PATCH] VECT: Change flow of decrement IV

2023-05-31 Thread Richard Biener via Gcc-patches
On Wed, 31 May 2023, juzhe.zh...@rivai.ai wrote:

> Thanks Richard.
> Seems that this patch's approach is ok to trunk?
> Maybe the only thing we should do is to wait Kewen's testing feedback, am I 
> right ?

Can you repost the patch with Kevens fix and state how you tested it?

Thanks,
Richard.


[PATCH] IPA PTA stats enhancement and non-details dump slimming

2023-05-31 Thread Richard Biener via Gcc-patches
The following keeps track of the number of edges we avoid to create
because they redundandly feed ESCAPED.  It also avoids printing
a header for -details when not using -details.

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

* tree-ssa-structalias.cc (constraint_stats::num_avoided_edges):
New.
(add_graph_edge): Count redundant edges we avoid to create.
(dump_sa_stats): Dump them.
(ipa_pta_execute): Do not dump generating constraints when
we are not dumping them.
---
 gcc/tree-ssa-structalias.cc | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/gcc/tree-ssa-structalias.cc b/gcc/tree-ssa-structalias.cc
index 546dab5035e..9ded34c1dd1 100644
--- a/gcc/tree-ssa-structalias.cc
+++ b/gcc/tree-ssa-structalias.cc
@@ -237,6 +237,7 @@ static struct constraint_stats
   unsigned int iterations;
   unsigned int num_edges;
   unsigned int num_implicit_edges;
+  unsigned int num_avoided_edges;
   unsigned int points_to_sets_created;
 } stats;
 
@@ -1213,7 +1214,10 @@ add_graph_edge (constraint_graph_t graph, unsigned int 
to,
   if (to < FIRST_REF_NODE
  && bitmap_bit_p (graph->succs[from], find (escaped_id))
  && bitmap_bit_p (get_varinfo (find (to))->solution, escaped_id))
-   return false;
+   {
+ stats.num_avoided_edges++;
+ return false;
+   }
 
   if (bitmap_set_bit (graph->succs[from], to))
{
@@ -7164,6 +7168,8 @@ dump_sa_stats (FILE *outfile)
   fprintf (outfile, "Number of edges:  %d\n", stats.num_edges);
   fprintf (outfile, "Number of implicit edges: %d\n",
   stats.num_implicit_edges);
+  fprintf (outfile, "Number of avoided edges: %d\n",
+  stats.num_avoided_edges);
 }
 
 /* Dump points-to information to OUTFILE.  */
@@ -8427,7 +8433,7 @@ ipa_pta_execute (void)
  || node->clone_of)
continue;
 
-  if (dump_file)
+  if (dump_file && (dump_flags & TDF_DETAILS))
{
  fprintf (dump_file,
   "Generating constraints for %s", node->dump_name ());
-- 
2.35.3


[PATCH] ipa/109983 - (IPA) PTA speedup

2023-05-31 Thread Richard Biener via Gcc-patches
This improves the edge avoidance heuristic by re-ordering the
topological sort of the graph to make sure the component with
the ESCAPED node is processed first.  This improves the number
of created edges which directly correlates with the number
of bitmap_ior_into calls from 141447426 to 239596 and the
compile-time from 1083s to 3s.  It also improves the compile-time
for the related PR109143 from 81s to 27s.

I've modernized the topological sorting API on the way as well.

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

PR ipa/109983
PR tree-optimization/109143
* tree-ssa-structalias.cc (struct topo_info): Remove.
(init_topo_info): Likewise.
(free_topo_info): Likewise.
(compute_topo_order): Simplify API, put the component
with ESCAPED last so it's processed first.
(topo_visit): Adjust.
(solve_graph): Likewise.
---
 gcc/tree-ssa-structalias.cc | 118 ++--
 1 file changed, 46 insertions(+), 72 deletions(-)

diff --git a/gcc/tree-ssa-structalias.cc b/gcc/tree-ssa-structalias.cc
index 9ded34c1dd1..8db99a42565 100644
--- a/gcc/tree-ssa-structalias.cc
+++ b/gcc/tree-ssa-structalias.cc
@@ -1585,65 +1585,6 @@ unify_nodes (constraint_graph_t graph, unsigned int to, 
unsigned int from,
 bitmap_clear_bit (graph->succs[to], to);
 }
 
-/* Information needed to compute the topological ordering of a graph.  */
-
-struct topo_info
-{
-  /* sbitmap of visited nodes.  */
-  sbitmap visited;
-  /* Array that stores the topological order of the graph, *in
- reverse*.  */
-  vec topo_order;
-};
-
-
-/* Initialize and return a topological info structure.  */
-
-static struct topo_info *
-init_topo_info (void)
-{
-  size_t size = graph->size;
-  struct topo_info *ti = XNEW (struct topo_info);
-  ti->visited = sbitmap_alloc (size);
-  bitmap_clear (ti->visited);
-  ti->topo_order.create (1);
-  return ti;
-}
-
-
-/* Free the topological sort info pointed to by TI.  */
-
-static void
-free_topo_info (struct topo_info *ti)
-{
-  sbitmap_free (ti->visited);
-  ti->topo_order.release ();
-  free (ti);
-}
-
-/* Visit the graph in topological order, and store the order in the
-   topo_info structure.  */
-
-static void
-topo_visit (constraint_graph_t graph, struct topo_info *ti,
-   unsigned int n)
-{
-  bitmap_iterator bi;
-  unsigned int j;
-
-  bitmap_set_bit (ti->visited, n);
-
-  if (graph->succs[n])
-EXECUTE_IF_SET_IN_BITMAP (graph->succs[n], 0, j, bi)
-  {
-   unsigned k = find (j);
-   if (!bitmap_bit_p (ti->visited, k))
- topo_visit (graph, ti, k);
-  }
-
-  ti->topo_order.safe_push (n);
-}
-
 /* Add a copy edge FROM -> TO, optimizing special cases.  Returns TRUE
if the solution of TO changed.  */
 
@@ -1925,19 +1866,56 @@ find_indirect_cycles (constraint_graph_t graph)
   scc_visit (graph, &si, i);
 }
 
-/* Compute a topological ordering for GRAPH, and store the result in the
-   topo_info structure TI.  */
+/* Visit the graph in topological order starting at node N, and store the
+   order in TOPO_ORDER using VISITED to indicate visited nodes.  */
 
 static void
-compute_topo_order (constraint_graph_t graph,
-   struct topo_info *ti)
+topo_visit (constraint_graph_t graph, vec &topo_order,
+   sbitmap visited, unsigned int n)
+{
+  bitmap_iterator bi;
+  unsigned int j;
+
+  bitmap_set_bit (visited, n);
+
+  if (graph->succs[n])
+EXECUTE_IF_SET_IN_BITMAP (graph->succs[n], 0, j, bi)
+  {
+   unsigned k = find (j);
+   if (!bitmap_bit_p (visited, k))
+ topo_visit (graph, topo_order, visited, k);
+  }
+
+  topo_order.quick_push (n);
+}
+
+/* Compute a topological ordering for GRAPH, and return the result.  */
+
+static auto_vec
+compute_topo_order (constraint_graph_t graph)
 {
   unsigned int i;
   unsigned int size = graph->size;
 
+  auto_sbitmap visited (size);
+  bitmap_clear (visited);
+
+  /* For the heuristic in add_graph_edge to work optimally make sure to
+ first visit the connected component of the graph containing
+ ESCAPED.  Do this by extracting the connected component
+ with ESCAPED and append that to all other components as solve_graph
+ pops from the order.  */
+  auto_vec tail (size);
+  topo_visit (graph, tail, visited, find (escaped_id));
+
+  auto_vec topo_order (size);
+
   for (i = 0; i != size; ++i)
-if (!bitmap_bit_p (ti->visited, i) && find (i) == i)
-  topo_visit (graph, ti, i);
+if (!bitmap_bit_p (visited, i) && find (i) == i)
+  topo_visit (graph, topo_order, visited, i);
+
+  topo_order.splice (tail);
+  return topo_order;
 }
 
 /* Structure used to for hash value numbering of pointer equivalence
@@ -2765,17 +2743,14 @@ solve_graph (constraint_graph_t graph)
   while (!bitmap_empty_p (changed))
 {
   unsigned int i;
-  struct topo_info *ti = init_topo_info ();
   stats.iterations++;
 
   bitmap_obstack_initialize (&iteration_obstack

Re: [PATCH 1/2] ipa-cp: Avoid long linear searches through DECL_ARGUMENTS

2023-06-02 Thread Richard Biener via Gcc-patches
On Wed, May 31, 2023 at 6:08 PM Martin Jambor  wrote:
>
> Hello,
>
> On Wed, May 31 2023, Richard Biener wrote:
> > On Tue, May 30, 2023 at 4:21 PM Jan Hubicka  wrote:
> >>
> >> > On Mon, May 29, 2023 at 6:20 PM Martin Jambor  wrote:
> >> > >
> >> > > Hi,
> >> > >
> >> > > there have been concerns that linear searches through DECL_ARGUMENTS
> >> > > that are often necessary to compute the index of a particular
> >> > > PARM_DECL which is the key to results of IPA-CP can happen often
> >> > > enough to be a compile time issue, especially if we plug the results
> >> > > into value numbering, as I intend to do with a follow-up patch.
> >> > >
> >> > > This patch creates a hash map to do the look-up for all functions
> >> > > which have some information discovered by IPA-CP and which have 32
> >> > > parameters or more.  32 is a hard-wired magical constant here to
> >> > > capture the trade-off between the memory allocation overhead and
> >> > > length of the linear search.  I do not think it is worth making it a
> >> > > --param but if people think it appropriate, I can turn it into one.
> >> >
> >> > Since ipcp_transformation is short-lived (is it?) is it worth the 
> >> > trouble?
> >> > Comments below ...
> >>
> >> It lives from ipa-cp time to WPA stream-out or IPA transform stage,
> >> so memory consumption is a concern with -flto.
>
> It lives longer, until the function is finished, it holds the
> information we want to use during PRE, after all (and Honza also already
> added queries to it to tree-ssa-ccp.cc though those probably could be
> avoided).
>
> The proposed mapping for long chains would only be created in the
> transformation IPA-CP hook, so would only live in LTRANS and only
> throughout the compilation of a single function.  (But I am adding a
> pointer to the transformation summary of all.)
>
> >> > > +  m_tree_to_idx = hash_map::create_ggc (c);
> >> > > +  unsigned index = 0;
> >> > > +  for (tree p = DECL_ARGUMENTS (fndecl); p; p = DECL_CHAIN (p), 
> >> > > index++)
> >> > > +m_tree_to_idx->put (p, index);
> >> >
> >> > I think allocating the hash-map with 'c' for some numbers (depending
> >> > on the "prime"
> >> > chosen) will necessarily cause re-allocation of the hash since we keep a 
> >> > load
> >> > factor of at most 3/4 upon insertion.
>
> Oh, right.
>
> >> >
> >> > But - I wonder if a UID sorted array isn't a very much better data
> >> > structure for this?
> >> > That is, a vec >?
> >>
> >> Yeah, I was thinking along this lines too.
> >> Having field directly in PARM_DECL node would be probably prettiest.
> >> In general this is probably not that important as wast amount of time we
> >> have few parameters and linear lookup is just fine.
> >
> > There is 6 bits of DECL_OFFSET_ALIGN that could be re-purposed, but
> > 64 parameters is a bit low.  _Maybe_ PARM_DECL doesn't need any of
> > the tree_base bits so could use the full word for sth else as well ...
> >
> > I also though it might be interesting to only record PARM_DECLs that
> > we have interesting info for and skip VARYING ones.  So with an
> > indirection DECL_OFFSET_ALIGN -> index to non-varying param or
> > -1 the encoding space could shrink.
> >
> > But still using a vec<> looks like a straight-forward improvement here.
>
> Yeah, 64 parameters seems too tight.  I guess a testcase in which we
> would record information for that many parameters would be quite
> artificial, but I can imagine something like that in machine generated
> code.
>
> Below is the patch based on DECL_UIDs in a vector.  The problem with
> std::pair is that it is not GC-friendly and the transformation summary
> unfortunately needs to live in GC.  So I added a simple GTY marked
> structure.
>
> Bootstrapped, tested and (together with the subsequent patch) LTO
> bootstrapped on an x86_64-linux, as is and with lower threshold to
> create the mapping.  OK for master now?

LGTM now.

Thanks,
Richard.

> Thanks,
>
> Martin
>
>
> Subject: [PATCH 1/2] ipa-cp: Avoid long linear searches through DECL_ARGUMENTS
>
> There have been concerns that linear searches through DECL_ARGUMENTS
> that are often necessary to compute the index of a particular
> PARM_DECL which is the key to results of IPA-CP can happen often
> enough to be a compile time issue, especially if we plug the results
> into value numbering, as I intend to do with a follow-up patch.
>
> This patch creates a vector sorted according to PARM_DECLs to do the look-up
> for all functions which have some information discovered by IPA-CP and which
> have 32 parameters or more.  32 is a hard-wired magical constant here to
> capture the trade-off between the memory allocation overhead and length of the
> linear search.  I do not think it is worth making it a --param but if people
> think it appropriate, I can turn it into one.
>
> gcc/ChangeLog:
>
> 2023-05-31  Martin Jambor  
>
> * ipa-prop.h (ipa_uid_to_idx_map_elt): New type.
> (struct ipcp_transformation): Rearrange members accord

Re: [PATCH] Don't try bswap + rotate when TYPE_PRECISION(n->type) > n->range.

2023-06-02 Thread Richard Biener via Gcc-patches
On Thu, Jun 1, 2023 at 9:51 AM liuhongt via Gcc-patches
 wrote:
>
> For the testcase in the PR, we have
>
>   br64 = br;
>   br64 = ((br64 << 16) & 0x00ffull) | (br64 & 0xff00ull);
>
>   n->n: 0x300200.
>   n->range: 32.
>   n->type: uint64.
>
> The original code assumes n->range is same as TYPE PRECISION(n->type),
> and tries to rotate the mask from 0x30200 -> 0x20300 which is
> incorrect. The patch fixed this bug by not trying bswap + rotate when
> TYPE_PRECISION(n->type) is not equal to n->range.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?

OK.

> gcc/ChangeLog:
>
> PR tree-optimization/110067
> * gimple-ssa-store-merging.cc (find_bswap_or_nop): Don't try
> bswap + rotate when TYPE_PRECISION(n->type) > n->range.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/pr110067.c: New test.
> ---
>  gcc/gimple-ssa-store-merging.cc  |  3 +
>  gcc/testsuite/gcc.target/i386/pr110067.c | 77 
>  2 files changed, 80 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr110067.c
>
> diff --git a/gcc/gimple-ssa-store-merging.cc b/gcc/gimple-ssa-store-merging.cc
> index 9cb574fa315..401496a9231 100644
> --- a/gcc/gimple-ssa-store-merging.cc
> +++ b/gcc/gimple-ssa-store-merging.cc
> @@ -1029,6 +1029,9 @@ find_bswap_or_nop (gimple *stmt, struct symbolic_number 
> *n, bool *bswap,
>/* TODO, handle cast64_to_32 and big/litte_endian memory
>  source when rsize < range.  */
>if (n->range == orig_range
> + /* There're case like 0x30200 for uint32->uint64 cast,
> +Don't hanlde this.  */
> + && n->range == TYPE_PRECISION (n->type)
>   && ((orig_range == 32
>&& optab_handler (rotl_optab, SImode) != CODE_FOR_nothing)
>   || (orig_range == 64
> diff --git a/gcc/testsuite/gcc.target/i386/pr110067.c 
> b/gcc/testsuite/gcc.target/i386/pr110067.c
> new file mode 100644
> index 000..c4208811628
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr110067.c
> @@ -0,0 +1,77 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -fno-strict-aliasing" } */
> +
> +#include 
> +#define force_inline __inline__ __attribute__ ((__always_inline__))
> +
> +__attribute__((noipa))
> +static void
> +fetch_pixel_no_alpha_32_bug (void *out)
> +{
> +  uint32_t *ret = out;
> +  *ret = 0xff499baf;
> +}
> +
> +static force_inline uint32_t
> +bilinear_interpolation_local (uint32_t tl, uint32_t tr,
> + uint32_t bl, uint32_t br,
> + int distx, int disty)
> +{
> +  uint64_t distxy, distxiy, distixy, distixiy;
> +  uint64_t tl64, tr64, bl64, br64;
> +  uint64_t f, r;
> +
> +  distx <<= 1;
> +  disty <<= 1;
> +
> +  distxy = distx * disty;
> +  distxiy = distx * (256 - disty);
> +  distixy = (256 - distx) * disty;
> +  distixiy = (256 - distx) * (256 - disty);
> +
> +  /* Alpha and Blue */
> +  tl64 = tl & 0xffff;
> +  tr64 = tr & 0xffff;
> +  bl64 = bl & 0xffff;
> +  br64 = br & 0xffff;
> +
> +  f = tl64 * distixiy + tr64 * distxiy + bl64 * distixy + br64 * distxy;
> +  r = f & 0xffffull;
> +
> +  /* Red and Green */
> +  tl64 = tl;
> +  tl64 = ((tl64 << 16) & 0x00ffull) | (tl64 & 0xff00ull);
> +
> +  tr64 = tr;
> +  tr64 = ((tr64 << 16) & 0x00ffull) | (tr64 & 0xff00ull);
> +
> +  bl64 = bl;
> +  bl64 = ((bl64 << 16) & 0x00ffull) | (bl64 & 0xff00ull);
> +
> +  br64 = br;
> +  br64 = ((br64 << 16) & 0x00ffull) | (br64 & 0xff00ull);
> +
> +  f = tl64 * distixiy + tr64 * distxiy + bl64 * distixy + br64 * distxy;
> +  r |= ((f >> 16) & 0x00ffull) | (f & 0xff00ull);
> +
> +  return (uint32_t)(r >> 16);
> +}
> +
> +__attribute__((noipa))
> +static void
> +bits_image_fetch_pixel_bilinear_32_bug (void *out)
> +{
> +  uint32_t br;
> +  uint32_t *ret = out;
> +
> +  fetch_pixel_no_alpha_32_bug (&br);
> +  *ret = bilinear_interpolation_local (0, 0, 0, br, 0x41, 0x42);
> +}
> +
> +int main() {
> +  uint32_t r;
> +  bits_image_fetch_pixel_bilinear_32_bug (&r);
> +  if (r != 0x4213282d)
> +__builtin_abort ();
> +  return 0;
> +}
> --
> 2.39.1.388.g2fc9e9ca3c
>


Re: [PATCH V3] VECT: Change flow of decrement IV

2023-06-02 Thread Richard Biener via Gcc-patches
On Thu, 1 Jun 2023, juzhe.zh...@rivai.ai wrote:

> This patch is no difference from V2.
> Just add PR tree-optimization/109971 as Kewen's suggested.
> 
> Already bootstrapped and Regression on X86 no difference.
> 
> Ok for trunk ?

OK.

Richard.

> 
> juzhe.zh...@rivai.ai
>  
> From: juzhe.zhong
> Date: 2023-06-01 12:36
> To: gcc-patches
> CC: richard.sandiford; rguenther; linkw; Ju-Zhe Zhong
> Subject: [PATCH V3] VECT: Change flow of decrement IV
> From: Ju-Zhe Zhong 
>  
> Follow Richi's suggestion, I change current decrement IV flow from:
>  
> do {
>remain -= MIN (vf, remain);
> } while (remain != 0);
>  
> into:
>  
> do {
>old_remain = remain;
>len = MIN (vf, remain);
>remain -= vf;
> } while (old_remain >= vf);
>  
> to enhance SCEV.
>  
> Include fixes from kewen.
>  
>  
> This patch will need to wait for Kewen's test feedback.
>  
> Testing on X86 is on-going
>  
> Co-Authored by: Kewen Lin  
>  
>   PR tree-optimization/109971
>  
> gcc/ChangeLog:
>  
> * tree-vect-loop-manip.cc (vect_set_loop_controls_directly): Change 
> decrement IV flow.
> (vect_set_loop_condition_partial_vectors): Ditto.
>  
> ---
> gcc/tree-vect-loop-manip.cc | 36 +---
> 1 file changed, 25 insertions(+), 11 deletions(-)
>  
> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> index acf3642ceb2..3f735945e67 100644
> --- a/gcc/tree-vect-loop-manip.cc
> +++ b/gcc/tree-vect-loop-manip.cc
> @@ -483,7 +483,7 @@ vect_set_loop_controls_directly (class loop *loop, 
> loop_vec_info loop_vinfo,
> gimple_stmt_iterator loop_cond_gsi,
> rgroup_controls *rgc, tree niters,
> tree niters_skip, bool might_wrap_p,
> - tree *iv_step)
> + tree *iv_step, tree *compare_step)
> {
>tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo);
>tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
> @@ -538,9 +538,9 @@ vect_set_loop_controls_directly (class loop *loop, 
> loop_vec_info loop_vinfo,
>...
>vect__4.8_28 = .LEN_LOAD (_17, 32B, _36, 0);
>...
> -ivtmp_35 = ivtmp_9 - _36;
> +ivtmp_35 = ivtmp_9 - POLY_INT_CST [4, 4];
>...
> -if (ivtmp_35 != 0)
> +if (ivtmp_9 > POLY_INT_CST [4, 4])
>  goto ; [83.33%]
>else
>  goto ; [16.67%]
> @@ -549,13 +549,15 @@ vect_set_loop_controls_directly (class loop *loop, 
> loop_vec_info loop_vinfo,
>tree step = rgc->controls.length () == 1 ? rgc->controls[0]
>: make_ssa_name (iv_type);
>/* Create decrement IV.  */
> -  create_iv (nitems_total, MINUS_EXPR, step, NULL_TREE, loop, &incr_gsi,
> - insert_after, &index_before_incr, &index_after_incr);
> +  create_iv (nitems_total, MINUS_EXPR, nitems_step, NULL_TREE, loop,
> + &incr_gsi, insert_after, &index_before_incr,
> + &index_after_incr);
>gimple_seq_add_stmt (header_seq, gimple_build_assign (step, MIN_EXPR,
> index_before_incr,
> nitems_step));
>*iv_step = step;
> -  return index_after_incr;
> +  *compare_step = nitems_step;
> +  return index_before_incr;
>  }
>/* Create increment IV.  */
> @@ -825,6 +827,7 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
>   arbitrarily pick the last.  */
>tree test_ctrl = NULL_TREE;
>tree iv_step = NULL_TREE;
> +  tree compare_step = NULL_TREE;
>rgroup_controls *rgc;
>rgroup_controls *iv_rgc = nullptr;
>unsigned int i;
> @@ -861,7 +864,7 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
> &preheader_seq, &header_seq,
> loop_cond_gsi, rgc, niters,
> niters_skip, might_wrap_p,
> - &iv_step);
> + &iv_step, &compare_step);
> iv_rgc = rgc;
>   }
> @@ -884,10 +887,21 @@ vect_set_loop_condition_partial_vectors (class loop 
> *loop,
>/* Get a boolean result that tells us whether to iterate.  */
>edge exit_edge = single_exit (loop);
> -  tree_code code = (exit_edge->flags & EDGE_TRUE_VALUE) ? EQ_EXPR : NE_EXPR;
> -  tree zero_ctrl = build_zero_cst (TREE_TYPE (test_ctrl));
> -  gcond *cond_stmt = gimple_build_cond (code, test_ctrl, zero_ctrl,
> - NULL_TREE, NULL_TREE);
> +  gcond *cond_stmt;
> +  if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
> +{
> +  gcc_assert (compare_step);
> +  tree_code code = (exit_edge->flags & EDGE_TRUE_VALUE) ? LE_EXPR : 
> GT_EXPR;
> +  cond_stmt = gimple_build_cond (code, test_ctrl, compare_step, 
> NULL_TREE,
> +  NULL_TREE);
> +}
> +  else
> +{
> +  tree_code code = (exit_edge->flags & EDGE_TRUE_VALUE) ? EQ_EXPR : 
> NE_EXPR;
> +  tree zero_ctrl = build_zero_cst (TREE_TYPE (test_ctrl));
> +  cond_stmt
> + = gimple_build_cond (code, test_ctrl, zero_ctrl, NULL_TREE, NULL_TREE);
> +}
>gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
>/* The loop iterates (NITERS - 1) / VF + 1 times.
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG 

Re: [PATCH] Optimized "(X - N * M) / N + M" to "X / N" if valid

2023-06-02 Thread Richard Biener via Gcc-patches
On Thu, 1 Jun 2023, Jiufu Guo wrote:

> Hi,
> 
> Jiufu Guo via Gcc-patches  writes:
> 
> > Hi,
> >
> > Richard Biener  writes:
> >
> >> On Wed, 17 May 2023, Jiufu Guo wrote:
> >>
> >>> Hi,
> >>> 
> >>> This patch tries to optimize "(X - N * M) / N + M" to "X / N".
> >>
> >> But if that's valid why not make the transform simpler and transform
> >> (X - N * M) / N  to X / N - M instead?
> >
> > Great catch!
> > If "N * M" is not constant, "X / N - M" would be better than
> > "(X - N * M) / N".  If "N, M" are constants, "(X - N * M) / N" and
> > "X / N - M" may be similar; while for this case, "X / N - M" should
> > also be fine!  I would try to update accordingly. 
> >
> >>
> >> You use the same optimize_x_minus_NM_div_N_plus_M validator for
> >> the division and shift variants but the overflow rules are different,
> >> so I'm not sure that's warranted.  I'd also prefer to not split out
> >> the validator to a different file - iff then the appropriate file
> >> is fold-const.cc, not gimple-match-head.cc (I see we're a bit
> >> inconsistent here, for pure gimple matches gimple-fold.cc would
> >> be another place).
> >
> > Thanks for pointing out this!
> > For shift,  I guess you may concern that: 1. if the right operand is
> > negative or is greater than or equal to the type width.  2. if it is
> > a signed negative value.  They may UB or 'sign bit shift'?  This patch
> > assumes it is ok to do the transform.  I may have more check to see
> > if this is really ok, and hope some one can point out if this is
> > invalid. "(X - N * M) >> log2(N)" ==> " X >> log2(N) - M".
> >
> > I split out the validator just because: it is shared for division and
> > shift :).  And it seems gimple-match-head.cc and generic-match-head.cc,
> > may be introduced for match.pd.  So, I put it into gimple-match-head.cc.
> >
> >>
> >> Since you use range information why is the transform restricted
> >> to constant M?
> >
> > If M is a variable, the range for "X" is varying_p. I did not find
> > the method to get the bounds for "X" (or for "X - N * M") to check no
> > wraps.  Any suggestions?
> 
> Oh, I may misunderstand here.
> You may say: M could be with a range too, then we can check if
> "X - N * M" has a valid range or possible wrap/overflow. 

Yes.

Richard.

> BR,
> Jeff (Jiufu Guo)
> 
> >
> >
> > Again, thanks for your great help!
> >
> > BR,
> > Jeff (Jiufu Guo)
> >
> >>
> >> Richard.
> >>
> >>> As per the discussions in PR108757, we know this transformation is valid
> >>> only under some conditions.
> >>> For C code, "/" towards zero (trunc_div), and "X - N * M"
> >>> maybe wrap/overflow/underflow. So, it is valid that "X - N * M" does
> >>> not cross zero and does not wrap/overflow/underflow.
> >>> 
> >>> This patch also handles the case when "N" is the power of 2, where
> >>> "(X - N * M) / N" is "(X - N * M) >> log2(N)".
> >>> 
> >>> Bootstrap & regtest pass on ppc64{,le} and x86_64.
> >>> Is this ok for trunk?
> >>> 
> >>> BR,
> >>> Jeff (Jiufu)
> >>> 
> >>>   PR tree-optimization/108757
> >>> 
> >>> gcc/ChangeLog:
> >>> 
> >>>   * gimple-match-head.cc (optimize_x_minus_NM_div_N_plus_M): New function.
> >>>   * match.pd ((X - N * M) / N + M): New pattern.
> >>> 
> >>> gcc/testsuite/ChangeLog:
> >>> 
> >>>   * gcc.dg/pr108757-1.c: New test.
> >>>   * gcc.dg/pr108757-2.c: New test.
> >>>   * gcc.dg/pr108757.h: New test.
> >>> 
> >>> ---
> >>>  gcc/gimple-match-head.cc  |  54 ++
> >>>  gcc/match.pd  |  22 
> >>>  gcc/testsuite/gcc.dg/pr108757-1.c |  17 
> >>>  gcc/testsuite/gcc.dg/pr108757-2.c |  18 
> >>>  gcc/testsuite/gcc.dg/pr108757.h   | 160 ++
> >>>  5 files changed, 271 insertions(+)
> >>>  create mode 100644 gcc/testsuite/gcc.dg/pr108757-1.c
> >>>  create mode 100644 gcc/testsuite/gcc.dg/pr108757-2.c
> >>>  create mode 100644 gcc/testsuite/gcc.dg/pr108757.h
> >>> 
> >>> diff --git a/gcc/gimple-match-head.cc b/gcc/gimple-match-head.cc
> >>> index b08cd891a13..680a4cb2fc6 100644
> >>> --- a/gcc/gimple-match-head.cc
> >>> +++ b/gcc/gimple-match-head.cc
> >>> @@ -224,3 +224,57 @@ optimize_successive_divisions_p (tree divisor, tree 
> >>> inner_div)
> >>>  }
> >>>return true;
> >>>  }
> >>> +
> >>> +/* Return true if "(X - N * M) / N + M" can be optimized into "X / N".
> >>> +   Otherwise return false.
> >>> +
> >>> +   For unsigned,
> >>> +   If sign bit of M is 0 (clz is 0), valid range is [N*M, MAX].
> >>> +   If sign bit of M is 1, valid range is [0, MAX - N*(-M)].
> >>> +
> >>> +   For signed,
> >>> +   If N*M > 0, valid range: [MIN+N*M, 0] + [N*M, MAX]
> >>> +   If N*M < 0, valid range: [MIN, -(-N*M)] + [0, MAX - (-N*M)].  */
> >>> +
> >>> +static bool
> >>> +optimize_x_minus_NM_div_N_plus_M (tree x, wide_int n, wide_int m, tree 
> >>> type)
> >>> +{
> >>> +  wide_int max = wi::max_value (type);
> >>> +  signop sgn = TYPE_SIGN (type);
> >>> +  wide_int nm;
> >>> +  wi::overflow_type ovf;
> >>> +  if (TYPE_UNSIGNED (type) && wi::clz (

Re: [PATCH 2/2] ipa-cp: Feed results of IPA-CP into value numbering

2023-06-02 Thread Richard Biener via Gcc-patches
On Mon, 29 May 2023, Martin Jambor wrote:

> Hi,
> 
> PRs 68930 and 92497 show that when IPA-CP figures out constants in
> aggregate parameters or when passed by reference but the loads happen
> in an inlined function the information is lost.  This happens even
> when the inlined function itself was known to have - or even cloned to
> have - such constants in incoming parameters because the transform
> phase of IPA passes is not run on them.  See discussion in the bugs
> for reasons why.
> 
> Honza suggested that we can plug the results of IPA-CP analysis into
> value numbering, so that FRE can figure out that some loads fetch
> known constants.  This is what this patch does.
> 
> This version of the patch uses the new way we represent aggregate
> constants discovered IPA-CP and so avoids linear scan to find them.
> Similarly, it depends on the previous patch which avoids potentially
> slow linear look ups of indices of PARM_DECLs when there are many of
> them.
> 
> Bootstrapped, LTO-bootstrapped and LTO-profiledbootstrapped and tested
> on x86_64-linux.  OK for trunk?
> 
> Thanks,
> 
> Martin
> 
> 
> gcc/ChangeLog:
> 
> 2023-05-26  Martin Jambor  
> 
>   PR ipa/68930
>   PR ipa/92497
>   * ipa-prop.h (ipcp_get_aggregate_const): Declare.
>   * ipa-prop.cc (ipcp_get_aggregate_const): New function.
>   (ipcp_transform_function): Do not deallocate transformation info.
>   * tree-ssa-sccvn.cc: Include alloc-pool.h, symbol-summary.h and
>   ipa-prop.h.
>   (vn_reference_lookup_2): When hitting default-def vuse, query
>   IPA-CP transformation info for any known constants.
> 
> gcc/testsuite/ChangeLog:
> 
> 2022-09-05  Martin Jambor  
> 
>   PR ipa/68930
>   PR ipa/92497
>   * gcc.dg/ipa/pr92497-1.c: New test.
>   * gcc.dg/ipa/pr92497-2.c: Likewise.
> ---
>  gcc/ipa-prop.cc  | 33 +
>  gcc/ipa-prop.h   |  3 +++
>  gcc/testsuite/gcc.dg/ipa/pr92497-1.c | 26 
>  gcc/testsuite/gcc.dg/ipa/pr92497-2.c | 26 
>  gcc/tree-ssa-sccvn.cc| 36 +++-
>  5 files changed, 118 insertions(+), 6 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/ipa/pr92497-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/ipa/pr92497-2.c
> 
> diff --git a/gcc/ipa-prop.cc b/gcc/ipa-prop.cc
> index f0976e363f7..fb2c0c0466b 100644
> --- a/gcc/ipa-prop.cc
> +++ b/gcc/ipa-prop.cc
> @@ -5765,6 +5765,34 @@ ipcp_modif_dom_walker::before_dom_children 
> (basic_block bb)
>return NULL;
>  }
>  
> +/* If IPA-CP discovered a constant in parameter PARM at OFFSET of a given 
> SIZE
> +   - whether passed by reference or not is given by BY_REF - return that
> +   constant.  Otherwise return NULL_TREE.  */
> +
> +tree
> +ipcp_get_aggregate_const (struct function *func, tree parm, bool by_ref,
> +   HOST_WIDE_INT bit_offset, HOST_WIDE_INT bit_size)
> +{
> +  cgraph_node *node = cgraph_node::get (func->decl);
> +  ipcp_transformation *ts = ipcp_get_transformation_summary (node);
> +
> +  if (!ts || !ts->m_agg_values)
> +return NULL_TREE;
> +
> +  int index = ts->get_param_index (func->decl, parm);
> +  if (index < 0)
> +return NULL_TREE;
> +
> +  ipa_argagg_value_list avl (ts);
> +  unsigned unit_offset = bit_offset / BITS_PER_UNIT;
> +  tree v = avl.get_value (index, unit_offset, by_ref);
> +  if (!v
> +  || maybe_ne (tree_to_poly_int64 (TYPE_SIZE (TREE_TYPE (v))), bit_size))
> +return NULL_TREE;
> +
> +  return v;
> +}
> +
>  /* Return true if we have recorded VALUE and MASK about PARM.
> Set VALUE and MASk accordingly.  */
>  
> @@ -6037,11 +6065,6 @@ ipcp_transform_function (struct cgraph_node *node)
>  free_ipa_bb_info (bi);
>fbi.bb_infos.release ();
>  
> -  ipcp_transformation *s = ipcp_transformation_sum->get (node);
> -  s->m_agg_values = NULL;
> -  s->bits = NULL;
> -  s->m_vr = NULL;
> -
>vec_free (descriptors);
>if (cfg_changed)
>  delete_unreachable_blocks_update_callgraph (node, false);
> diff --git a/gcc/ipa-prop.h b/gcc/ipa-prop.h
> index 211b12ff6b3..f68fa4a12dd 100644
> --- a/gcc/ipa-prop.h
> +++ b/gcc/ipa-prop.h
> @@ -1221,6 +1221,9 @@ void ipa_dump_param (FILE *, class ipa_node_params 
> *info, int i);
>  void ipa_release_body_info (struct ipa_func_body_info *);
>  tree ipa_get_callee_param_type (struct cgraph_edge *e, int i);
>  bool ipcp_get_parm_bits (tree, tree *, widest_int *);
> +tree ipcp_get_aggregate_const (struct function *func, tree parm, bool by_ref,
> +HOST_WIDE_INT bit_offset,
> +HOST_WIDE_INT bit_size);
>  bool unadjusted_ptr_and_unit_offset (tree op, tree *ret,
>poly_int64 *offset_ret);
>  
> diff --git a/gcc/testsuite/gcc.dg/ipa/pr92497-1.c 
> b/gcc/testsuite/gcc.dg/ipa/pr92497-1.c
> new file mode 100644
> index 000..eb8f2e75fd0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/ipa

Re: [PATCH] inline: improve internal function costs

2023-06-02 Thread Richard Biener via Gcc-patches
On Thu, 1 Jun 2023, Andre Vieira (lists) wrote:

> Hi,
> 
> This is a follow-up of the internal function patch to add widening and
> narrowing patterns.  This patch improves the inliner cost estimation for
> internal functions.

I have no idea why calls are special in IPA analyze_function_body
and so I cannot say whether treating all internal fn calls as
non-calls is correct there.  Honza?

The tree-inline.cc change is OK though (you can push that separately).

Thanks,
Richard.

> Bootstrapped and regression tested on aarch64-unknown-linux-gnu.
> 
> gcc/ChangeLog:
> 
> * ipa-fnsummary.cc (analyze_function_body): Correctly handle
> non-zero costed internal functions.
> * tree-inline.cc (estimate_num_insns): Improve costing for internal
> functions.
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)


Re: [PATCH RFA] c++: make initializer_list array static again [PR110070]

2023-06-02 Thread Richard Biener via Gcc-patches
On Fri, Jun 2, 2023 at 3:32 AM Jason Merrill via Gcc-patches
 wrote:
>
> I ended up deciding not to apply the DECL_NOT_OBSERVABLE patch that you
> approved in stage 3 because I didn't feel like it was fully baked; I'm happy
> with this version now, which seems like a more broadly useful flag.
>
> Tested x86_64-pc-linux-gnu.  OK for trunk?

OK.

Richard.

> -- 8< --
>
> After the maybe_init_list_as_* patches, I noticed that we were putting the
> array of strings into .rodata, but then memcpying it into an automatic
> array, which is pointless; we should be able to use it directly.
>
> This doesn't happen automatically because TREE_ADDRESSABLE is set (since
> r12-657 for PR100464), and so gimplify_init_constructor won't promote the
> variable to static.  Theoretically we could do escape analysis to recognize
> that the address, though taken, never leaves the function; that would allow
> promotion when we're only using the address for indexing within the
> function, as in initlist-opt2.C.  But this would be a new pass.
>
> And in initlist-opt1.C, we're passing the array address to another function,
> so it definitely escapes; it's only safe in this case because it's calling a
> standard library function that we know only uses it for indexing.  So, a
> flag seems needed.  I first thought to put the flag on the TARGET_EXPR, but
> the VAR_DECL seems more appropriate.
>
> In a previous revision of the patch I called this flag DECL_NOT_OBSERVABLE,
> but I think DECL_MERGEABLE is a better name, especially if we're going to
> apply it to the backing array of initializer_list, which is observable.  I
> then also check it in places that check for -fmerge-all-constants, so that
> multiple equivalent initializer-lists can also be combined.  And then it
> seemed to make sense for [[no_unique_address]] to have this meaning for
> user-written variables.
>
> I think the note in [dcl.init.list]/6 intended to allow this kind of merging
> for initializer_lists, but it didn't actually work; for an explicit array
> with the same initializer, if the address escapes the program could tell
> whether the same variable in two frames have the same address.  P2752 is
> trying to correct this defect, so I'm going to assume that this is the
> intent.
>
> PR c++/110070
> PR c++/105838
>
> gcc/ChangeLog:
>
> * tree.h (DECL_MERGEABLE): New.
> * tree-core.h (struct tree_decl_common): Mention it.
> * gimplify.cc (gimplify_init_constructor): Check it.
> * cgraph.cc (symtab_node::address_can_be_compared_p): Likewise.
> * varasm.cc (categorize_decl_for_section): Likewise.
>
> gcc/cp/ChangeLog:
>
> * call.cc (maybe_init_list_as_array): Set DECL_MERGEABLE.
> (convert_like_internal) [ck_list]: Set it.
> (set_up_extended_ref_temp): Copy it.
> * tree.cc (handle_no_unique_addr_attribute): Set it.
>
> gcc/testsuite/ChangeLog:
>
> * g++.dg/tree-ssa/initlist-opt1.C: Check for static array.
> * g++.dg/tree-ssa/initlist-opt2.C: Likewise.
> * g++.dg/tree-ssa/initlist-opt4.C: New test.
> * g++.dg/opt/icf1.C: New test.
> * g++.dg/opt/icf2.C: New test.
> ---
>  gcc/tree-core.h   |  3 ++-
>  gcc/tree.h|  6 ++
>  gcc/cgraph.cc |  2 +-
>  gcc/cp/call.cc| 15 ---
>  gcc/cp/tree.cc|  9 -
>  gcc/gimplify.cc   |  3 ++-
>  gcc/testsuite/g++.dg/opt/icf1.C   | 16 
>  gcc/testsuite/g++.dg/opt/icf2.C   | 17 +
>  gcc/testsuite/g++.dg/tree-ssa/initlist-opt1.C |  1 +
>  gcc/testsuite/g++.dg/tree-ssa/initlist-opt2.C |  2 ++
>  gcc/testsuite/g++.dg/tree-ssa/initlist-opt4.C | 13 +
>  gcc/varasm.cc |  2 +-
>  12 files changed, 81 insertions(+), 8 deletions(-)
>  create mode 100644 gcc/testsuite/g++.dg/opt/icf1.C
>  create mode 100644 gcc/testsuite/g++.dg/opt/icf2.C
>  create mode 100644 gcc/testsuite/g++.dg/tree-ssa/initlist-opt4.C
>
> diff --git a/gcc/tree-core.h b/gcc/tree-core.h
> index 9d44c04bf03..6dd7b680b57 100644
> --- a/gcc/tree-core.h
> +++ b/gcc/tree-core.h
> @@ -1803,7 +1803,8 @@ struct GTY(()) tree_decl_common {
>   In VAR_DECL, PARM_DECL and RESULT_DECL, this is
>   DECL_HAS_VALUE_EXPR_P.  */
>unsigned decl_flag_2 : 1;
> -  /* In FIELD_DECL, this is DECL_PADDING_P.  */
> +  /* In FIELD_DECL, this is DECL_PADDING_P.
> + In VAR_DECL, this is DECL_MERGEABLE.  */
>unsigned decl_flag_3 : 1;
>/* Logically, these two would go in a theoretical base shared by var and
>   parm decl. */
> diff --git a/gcc/tree.h b/gcc/tree.h
> index 0b72663e6a1..8a4beba1230 100644
> --- a/gcc/tree.h
> +++ b/gcc/tree.h
> @@ -3233,6 +3233,12 @@ extern void decl_fini_priority_insert (tree, 
> priority_type);
>  #define DECL_NONALIASED(NO

Re: [PATCH RFA] varasm: check float size

2023-06-02 Thread Richard Biener via Gcc-patches
On Fri, Jun 2, 2023 at 4:44 AM Jason Merrill via Gcc-patches
 wrote:
>
> Tested x86_64-pc-linux-gnu, OK for trunk?

OK.

> -- 8< --
>
> In PR95226, the testcase was failing because we tried to output_constant a
> NOP_EXPR to float from a double REAL_CST, and so we output a double where
> the caller wanted a float.  That doesn't happen anymore, but with the
> output_constant hunk we will ICE in that situation rather than emit the
> wrong number of bytes.
>
> Part of the problem was that initializer_constant_valid_p_1 returned true
> for that NOP_EXPR, because it compared the sizes of integer types but not
> floating-point types.  So the C++ front end assumed it didn't need to fold
> the initializer.
>
> PR c++/95226
>
> gcc/ChangeLog:
>
> * varasm.cc (output_constant) [REAL_TYPE]: Check that sizes match.
> (initializer_constant_valid_p_1): Compare float precision.
> ---
>  gcc/varasm.cc | 11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/gcc/varasm.cc b/gcc/varasm.cc
> index 34400ec39ef..dd84754a283 100644
> --- a/gcc/varasm.cc
> +++ b/gcc/varasm.cc
> @@ -4876,16 +4876,16 @@ initializer_constant_valid_p_1 (tree value, tree 
> endtype, tree *cache)
> tree src_type = TREE_TYPE (src);
> tree dest_type = TREE_TYPE (value);
>
> -   /* Allow conversions between pointer types, floating-point
> -  types, and offset types.  */
> +   /* Allow conversions between pointer types and offset types.  */
> if ((POINTER_TYPE_P (dest_type) && POINTER_TYPE_P (src_type))
> -   || (FLOAT_TYPE_P (dest_type) && FLOAT_TYPE_P (src_type))
> || (TREE_CODE (dest_type) == OFFSET_TYPE
> && TREE_CODE (src_type) == OFFSET_TYPE))
>   return initializer_constant_valid_p_1 (src, endtype, cache);
>
> -   /* Allow length-preserving conversions between integer types.  */
> -   if (INTEGRAL_TYPE_P (dest_type) && INTEGRAL_TYPE_P (src_type)
> +   /* Allow length-preserving conversions between integer types and
> +  floating-point types.  */
> +   if (((INTEGRAL_TYPE_P (dest_type) && INTEGRAL_TYPE_P (src_type))
> +|| (FLOAT_TYPE_P (dest_type) && FLOAT_TYPE_P (src_type)))
> && (TYPE_PRECISION (dest_type) == TYPE_PRECISION (src_type)))
>   return initializer_constant_valid_p_1 (src, endtype, cache);
>
> @@ -5255,6 +5255,7 @@ output_constant (tree exp, unsigned HOST_WIDE_INT size, 
> unsigned int align,
>break;
>
>  case REAL_TYPE:
> +  gcc_assert (size == thissize);
>if (TREE_CODE (exp) != REAL_CST)
> error ("initializer for floating value is not a floating constant");
>else
>
> base-commit: 5fccebdbd9666e0adf6dd8357c21d4ef3ac3f83f
> --
> 2.31.1
>


Re: [PATCH] rtl-optimization: [PR102733] DSE removing address which only differ by address space.

2023-06-02 Thread Richard Biener via Gcc-patches
On Fri, Jun 2, 2023 at 9:36 AM Andrew Pinski via Gcc-patches
 wrote:
>
> The problem here is DSE was not taking into account the address space
> which meant if you had two addresses say `fs:0` and `gs:0` (on x86_64),
> DSE would think they were the same and remove the first store.
> This fixes that issue by adding a check for the address space too.
>
> OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions.

OK.

> PR rtl-optimization/102733
>
> gcc/ChangeLog:
>
> * dse.cc (store_info): Add addrspace field.
> (record_store): Record the address space
> and check to make sure they are the same.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/addr-space-6.c: New test.
> ---
>  gcc/dse.cc   |  9 -
>  gcc/testsuite/gcc.target/i386/addr-space-6.c | 21 
>  2 files changed, 29 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/addr-space-6.c
>
> diff --git a/gcc/dse.cc b/gcc/dse.cc
> index 802b949cfb2..8b07be17674 100644
> --- a/gcc/dse.cc
> +++ b/gcc/dse.cc
> @@ -251,6 +251,9 @@ public:
>   and known (rather than -1).  */
>poly_int64 width;
>
> +  /* The address space that the memory reference uses.  */
> +  unsigned char addrspace;
> +
>union
>  {
>/* A bitmask as wide as the number of bytes in the word that
> @@ -1524,6 +1527,7 @@ record_store (rtx body, bb_info_t bb_info)
>ptr = active_local_stores;
>last = NULL;
>redundant_reason = NULL;
> +  unsigned char addrspace = MEM_ADDR_SPACE (mem);
>mem = canon_rtx (mem);
>
>if (group_id < 0)
> @@ -1548,7 +1552,9 @@ record_store (rtx body, bb_info_t bb_info)
>while (!s_info->is_set)
> s_info = s_info->next;
>
> -  if (s_info->group_id == group_id && s_info->cse_base == base)
> +  if (s_info->group_id == group_id
> + && s_info->cse_base == base
> + && s_info->addrspace == addrspace)
> {
>   HOST_WIDE_INT i;
>   if (dump_file && (dump_flags & TDF_DETAILS))
> @@ -1688,6 +1694,7 @@ record_store (rtx body, bb_info_t bb_info)
>store_info->rhs = rhs;
>store_info->const_rhs = const_rhs;
>store_info->redundant_reason = redundant_reason;
> +  store_info->addrspace = addrspace;
>
>/* If this is a clobber, we return 0.  We will only be able to
>   delete this insn if there is only one store USED store, but we
> diff --git a/gcc/testsuite/gcc.target/i386/addr-space-6.c 
> b/gcc/testsuite/gcc.target/i386/addr-space-6.c
> new file mode 100644
> index 000..82eca4d7e0c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/addr-space-6.c
> @@ -0,0 +1,21 @@
> +/* PR rtl-optimization/102733 */
> +/* { dg-do compile } */
> +/* { dg-options "-O1" } */
> +
> +/* DSE was removing a store to fs:0 (correctly)
> +   and gs:0 (incorrectly) as DSE didn't take into
> +   account the address space was different.  */
> +
> +void test_null_store (void)
> +{
> +  int __seg_fs *fs = (int __seg_fs *)0;
> +  *fs = 1;
> +
> +  int __seg_gs *gs = (int __seg_gs *)0;
> +  *gs = 2;
> +  *fs = 3;
> +}
> +
> +/* { dg-final { scan-assembler-times "movl\t" 2 } } */
> +/* { dg-final { scan-assembler "gs:" } } */
> +/* { dg-final { scan-assembler "fs:" } } */
> --
> 2.31.1
>


Re: [PATCH] doc: clarify semantics of vector bitwise shifts

2023-06-02 Thread Richard Biener via Gcc-patches
On Fri, Jun 2, 2023 at 11:24 AM Alexander Monakov  wrote:
>
>
> On Fri, 2 Jun 2023, Matthias Kretz wrote:
>
> > > Okay, I see opinions will vary here. I was thinking about our immintrin.h
> > > which is partially implemented in terms of generic vectors. Imagine we
> > > extend UBSan to trap on signed overflow for vector types. I expect that
> > > will blow up on existing code that uses Intel intrinsics.
> >
> > _mm_add_epi32 is already implemented via __v4su addition (i.e. unsigned). So
> > the intrinsic would continue to wrap on signed overflow.
>
> Ah, if our intrinsics take care of it, that alleviates my concern.

Just to add when generic vectors are lowered to scalar operations then
signed vector ops become signed scalar ops which means followup
optimizations will assume undefined behavior on overflow.

> > > I'm not sure what you consider a breaking change here. Is that the implied
> > > threat to use undefinedness for range deduction and other optimizations?
> >
> > Consider the stdx::simd implementation. It currently follows semantics of 
> > the
> > builtin types. So simd can be shifted by 30 without UB. The
> > implementation of the shift operator depends on the current behavior, even 
> > if
> > it is target-dependent. For PPC the simd implementation adds extra code to
> > avoid the "UB". With nailing down shifts > sizeof(T) as UB this extra code 
> > now
> > needs to be added for all targets.
>
> What does stdx::simd do on LLVM, where that has always been UB even on x86?
>
> Alexander


Re: [PATCH] doc: clarify semantics of vector bitwise shifts

2023-06-02 Thread Richard Biener via Gcc-patches
On Thu, Jun 1, 2023 at 8:25 PM Alexander Monakov  wrote:
>
>
> On Wed, 31 May 2023, Richard Biener wrote:
>
> > On Tue, May 30, 2023 at 4:49 PM Alexander Monakov  
> > wrote:
> > >
> > >
> > > On Thu, 25 May 2023, Richard Biener wrote:
> > >
> > > > On Wed, May 24, 2023 at 8:36 PM Alexander Monakov  
> > > > wrote:
> > > > >
> > > > >
> > > > > On Wed, 24 May 2023, Richard Biener via Gcc-patches wrote:
> > > > >
> > > > > > I’d have to check the ISAs what they actually do here - it of 
> > > > > > course depends
> > > > > > on RTL semantics as well but as you say those are not strictly 
> > > > > > defined here
> > > > > > either.
> > > > >
> > > > > Plus, we can add the following executable test to the testsuite:
> > > >
> > > > Yeah, that's probably a good idea.  I think your documentation change
> > > > with the added sentence about the truncation is OK.
> > >
> > > I am no longer confident in my patch, sorry.
> > >
> > > My claim about vector shift semantics in OpenCL was wrong. In fact it 
> > > specifies
> > > that RHS of a vector shift is masked to the exact bitwidth of the element 
> > > type.
> > >
> > > So, to collect various angles:
> > >
> > > 1. OpenCL semantics would need an 'AND' before a shift (except 
> > > VSX/Altivec).
> > >
> > > 2. From user side we had a request to follow C integer promotion semantics
> > >in https://gcc.gnu.org/PR91838 but I now doubt we can do that.
> > >
> > > 3. LLVM makes oversized vector shifts UB both for 'vector_size' and
> > >'ext_vector_type'.
> >
> > I had the impression GCC desired to do 3. as well, matching what we do
> > for scalar shifts.
> >
> > > 4. Vector lowering does not emit promotions, and starting from gcc-12
> > >ranger treats oversized shifts according to the documentation you
> > >cite below, and optimizes (e.g. with '-O2 -mno-sse')
> > >
> > > typedef short v8hi __attribute__((vector_size(16)));
> > >
> > > void f(v8hi *p)
> > > {
> > > *p >>= 16;
> > > }
> > >
> > >to zeroing '*p'. If this looks unintended, I can file a bug.
> > >
> > > I still think we need to clarify semantics of vector shifts, but probably
> > > not in the way I proposed initially. What do you think?
> >
> > I think the intent at some point was to adhere to the OpenCL spec
> > for the GCC vector extension (because that's a written spec while
> > GCCs vector extension docs are lacking).  Originally the powerpc
> > altivec 'vector' keyword spurred most of the development IIRC
> > so it might be useful to see how they specify shifts.
>
> It doesn't look like they document the semantics of '<<' and '>>'
> operators for vector types.
>
> > So yes, we probably should clarify the semantics to match the
> > implementation (since we have two targets doing things differently
> > since forever we can only document it as UB) and also note the
> > difference from OpenCL (in case OpenCL is still relevant these
> > days we might want to offer a -fopencl-vectors to emit the required
> > AND).
>
> It doesn't have to be UB, in principle we could say that shift amount
> is taken modulo some power of two depending on the target without UB.
> But since LLVM already treats that as UB, we might as well follow.
>
> I think for addition/multiplication of signed vectors everybody
> expects them to have wrapping semantics without UB on overflow though?

Actually GCC already treats them as UB on overflow by means of
vector lowering eventually turning them into scalar operations and
quite some patterns in match.pd applying to ANY_INTEGRAL_TYPE_P.

> Revised patch below.

The revised patch is OK.

Thanks,
Richard.

> > It would be also good to amend the RTL documentation.
> >
> > It would be very nice to start an internals documentation section
> > around collecting what the middle-end considers undefined
> > or implementation defined (aka target defined) behavior in the
> > GENERIC, GIMPLE and RTL ILs and what predicates eventually
> > control that (like TYPE_OVERFLOW_UNDEFINED).  Maybe spread it over
> > {gimple,generic,rtl}.texi, though gimple.texi is onl

Re: [pushed] Darwin, PPC: Fix struct layout with pragma pack [PR110044].

2023-06-03 Thread Richard Biener via Gcc-patches



> Am 02.06.2023 um 21:12 schrieb Iain Sandoe via Gcc-patches 
> :
> 
> @David: I am not sure what sets the ABI on AIX (for Darwin, it is effectively
> "whatever the system compiler [Apple gcc-4] does") but from an inspection of
> the code, it seems that (if the platform should honour #pragma pack) a similar
> effect could be present there too.
> 
> Tested on powerpc-apple-darwin9, powerpc64-linux-gnu and on i686 and x86_64
> Darwin.  Checked that the testcases also pass for Apple gcc-4.2.1.
> pushed to trunk, thanks
> Iain
> 
> --- 8< ---
> 
> This bug was essentially that darwin_rs6000_special_round_type_align()
> was ignoring externally-imposed capping of field alignment.
> 
> Signed-off-by: Iain Sandoe 
> 
>PR target/110044
> 
> gcc/ChangeLog:
> 
>* config/rs6000/rs6000.cc (darwin_rs6000_special_round_type_align):
>Make sure that we do not have a cap on field alignment before altering
>the struct layout based on the type alignment of the first entry.
> 
> gcc/testsuite/ChangeLog:
> 
>* gcc.target/powerpc/darwin-abi-13-0.c: New test.
>* gcc.target/powerpc/darwin-abi-13-1.c: New test.
>* gcc.target/powerpc/darwin-abi-13-2.c: New test.
>* gcc.target/powerpc/darwin-structs-0.h: New test.
> ---
> gcc/config/rs6000/rs6000.cc   |  3 +-
> .../gcc.target/powerpc/darwin-abi-13-0.c  | 23 +++
> .../gcc.target/powerpc/darwin-abi-13-1.c  | 27 +
> .../gcc.target/powerpc/darwin-abi-13-2.c  | 27 +
> .../gcc.target/powerpc/darwin-structs-0.h | 29 +++
> 5 files changed, 108 insertions(+), 1 deletion(-)
> create mode 100644 gcc/testsuite/gcc.target/powerpc/darwin-abi-13-0.c
> create mode 100644 gcc/testsuite/gcc.target/powerpc/darwin-abi-13-1.c
> create mode 100644 gcc/testsuite/gcc.target/powerpc/darwin-abi-13-2.c
> create mode 100644 gcc/testsuite/gcc.target/powerpc/darwin-structs-0.h
> 
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index 5b3b8b52e7e..42f49e4a56b 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -8209,7 +8209,8 @@ darwin_rs6000_special_round_type_align (tree type, 
> unsigned int computed,
>   type = TREE_TYPE (type);
>   } while (AGGREGATE_TYPE_P (type));
> 
> -  if (! AGGREGATE_TYPE_P (type) && type != error_mark_node)
> +  if (type != error_mark_node && ! AGGREGATE_TYPE_P (type)
> +  && ! TYPE_PACKED (type) && maximum_field_alignment == 0)

Just noticed while browsing mail.  ‚Maximum_field_alignment‘ sounds like
Something that should be factored in when 
Computing align but as written there’s no adjustment done instead?  Is there a 
way to get that to more than BITS_PER_UNIT?

> align = MAX (align, TYPE_ALIGN (type));
> 
>   return align;
> diff --git a/gcc/testsuite/gcc.target/powerpc/darwin-abi-13-0.c 
> b/gcc/testsuite/gcc.target/powerpc/darwin-abi-13-0.c
> new file mode 100644
> index 000..d8d3c63a083
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/darwin-abi-13-0.c
> @@ -0,0 +1,23 @@
> +/* { dg-do compile { target powerpc*-*-darwin* } } */
> +/* { dg-require-effective-target ilp32 } */
> +/* { dg-options "-Wno-long-long" } */
> +
> +#include "darwin-structs-0.h"
> +
> +int tcd[sizeof(cd) != 12 ? -1 : 1];
> +int acd[__alignof__(cd) != 4 ? -1 : 1];
> +
> +int sdc[sizeof(dc) != 16 ? -1 : 1];
> +int adc[__alignof__(dc) != 8 ? -1 : 1];
> +
> +int scL[sizeof(cL) != 12 ? -1 : 1];
> +int acL[__alignof__(cL) != 4 ? -1 : 1];
> +
> +int sLc[sizeof(Lc) != 16 ? -1 : 1];
> +int aLc[__alignof__(Lc) != 8 ? -1 : 1];
> +
> +int scD[sizeof(cD) != 32 ? -1 : 1];
> +int acD[__alignof__(cD) != 16 ? -1 : 1];
> +
> +int sDc[sizeof(Dc) != 32 ? -1 : 1];
> +int aDc[__alignof__(Dc) != 16 ? -1 : 1];
> diff --git a/gcc/testsuite/gcc.target/powerpc/darwin-abi-13-1.c 
> b/gcc/testsuite/gcc.target/powerpc/darwin-abi-13-1.c
> new file mode 100644
> index 000..4d888d383fa
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/darwin-abi-13-1.c
> @@ -0,0 +1,27 @@
> +/* { dg-do compile { target powerpc*-*-darwin* } } */
> +/* { dg-require-effective-target ilp32 } */
> +/* { dg-options "-Wno-long-long" } */
> +
> +#pragma pack(push, 1)
> +
> +#include "darwin-structs-0.h"
> +
> +int tcd[sizeof(cd) != 9 ? -1 : 1];
> +int acd[__alignof__(cd) != 1 ? -1 : 1];
> +
> +int sdc[sizeof(dc) != 9 ? -1 : 1];
> +int adc[__alignof__(dc) != 1 ? -1 : 1];
> +
> +int scL[sizeof(cL) != 9 ? -1 : 1];
> +int acL[__alignof__(cL) != 1 ? -1 : 1];
> +
> +int sLc[sizeof(Lc) != 9 ? -1 : 1];
> +int aLc[__alignof__(Lc) != 1 ? -1 : 1];
> +
> +int scD[sizeof(cD) != 17 ? -1 : 1];
> +int acD[__alignof__(cD) != 1 ? -1 : 1];
> +
> +int sDc[sizeof(Dc) != 17 ? -1 : 1];
> +int aDc[__alignof__(Dc) != 1 ? -1 : 1];
> +
> +#pragma pack(pop)
> diff --git a/gcc/testsuite/gcc.target/powerpc/darwin-abi-13-2.c 
> b/gcc/testsuite/gcc.target/powerpc/darwin-abi-13-2.c
> new file mode 100644
> index 000..3bd52c0a8f8
> --- /dev/null
> +++ b/gcc/testsuit

Re: [PATCH] Fix PR 110085: `make clean` in GCC directory on sh target causes a failure

2023-06-04 Thread Richard Biener via Gcc-patches



> Am 05.06.2023 um 06:42 schrieb Andrew Pinski via Gcc-patches 
> :
> 
> On sh target, there is a MULTILIB_DIRNAMES (or is it MULTILIB_OPTIONS) named 
> m2,
> this conflicts with the langauge m2. So when you do a `make clean`, it will 
> remove
> the m2 directory and then a build will fail. Now since 
> r0-78222-gfa9585134f6f58,
> the multilib directories are no longer created in the gcc directory as libgcc
> was moved to the toplevel. So we can remove the part of clean that removes 
> those
> directories.
> 
> Tested on x86_64-linux-gnu and a cross to sh-elf that `make clean` followed by
> `make` works again.
> 
> OK?

Ok

> gcc/ChangeLog:
> 
>PR bootstrap/110085
>* Makefile.in (clean): Remove the removing of
>MULTILIB_DIR/MULTILIB_OPTIONS directories.
> ---
> gcc/Makefile.in | 7 ---
> 1 file changed, 7 deletions(-)
> 
> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index 1d39e6dd3f8..0c02f312985 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -3622,13 +3622,6 @@ clean: mostlyclean lang.clean
>-rm -f doc/*.pdf
> # Delete the include directories.
>-rm -rf include include-fixed
> -# Delete files used by the "multilib" facility (including libgcc subdirs).
> --rm -f multilib.h tmpmultilib*
> --if [ "x$(MULTILIB_DIRNAMES)" != x ] ; then \
> -  rm -rf $(MULTILIB_DIRNAMES); \
> -else if [ "x$(MULTILIB_OPTIONS)" != x ] ; then \
> -  rm -rf `echo $(MULTILIB_OPTIONS) | sed -e 's/\// /g'`; \
> -fi ; fi
> 
> # Delete all files that users would normally create
> # while building and installing GCC.
> -- 
> 2.31.1
> 


Re: [PATCH, PR110086] avr: Fix ICE on optimize attribute

2023-06-04 Thread Richard Biener via Gcc-patches
On Fri, Jun 2, 2023 at 11:54 AM SenthilKumar.Selvaraj--- via
Gcc-patches  wrote:
>
> Hi,
>
> This patch fixes an ICE when an optimize attribute changes the prevailing
> optimization level.
>
> I found https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105069 describing the
> same ICE for the sh target, where the fix was to enable save/restore of
> target specific options modified via TARGET_OPTIMIZATION_TABLE hook.
>
> For the AVR target, mgas-isr-prologues and -mmain-is-OS_task are those
> target specific options. As they enable generation of more optimal code,
> this patch adds the Optimization option property to those option records,
> and that fixes the ICE.
>
> Regression run shows no regressions, and >100 new PASSes.
> Ok to commit to master?

LGTM

Richard.

> Regards
> Senthil
>
>
> PR 110086
>
> gcc/ChangeLog:
>
> * config/avr/avr.opt (mgas-isr-prologues, mmain-is-OS_task):
> Add Optimization option property.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/avr/pr110086.c: New test.
>
> diff --git gcc/config/avr/avr.opt gcc/config/avr/avr.opt
> index f62d746..5a0b465 100644
> --- gcc/config/avr/avr.opt
> +++ gcc/config/avr/avr.opt
> @@ -27,7 +27,7 @@ Target RejectNegative Joined Var(avr_mmcu) 
> MissingArgError(missing device or arc
>  -mmcu=MCU  Select the target MCU.
>
>  mgas-isr-prologues
> -Target Var(avr_gasisr_prologues) UInteger Init(0)
> +Target Var(avr_gasisr_prologues) UInteger Init(0) Optimization
>  Allow usage of __gcc_isr pseudo instructions in ISR prologues and epilogues.
>
>  mn-flash=
> @@ -65,7 +65,7 @@ Target Joined RejectNegative UInteger Var(avr_branch_cost) 
> Init(0)
>  Set the branch costs for conditional branch instructions.  Reasonable values 
> are small, non-negative integers.  The default
> branch cost is 0.
>
>  mmain-is-OS_task
> -Target Mask(MAIN_IS_OS_TASK)
> +Target Mask(MAIN_IS_OS_TASK) Optimization
>  Treat main as if it had attribute OS_task.
>
>  morder1
> diff --git gcc/testsuite/gcc.target/avr/pr110086.c 
> gcc/testsuite/gcc.target/avr/pr110086.c
> new file mode 100644
> index 000..6b97620
> --- /dev/null
> +++ gcc/testsuite/gcc.target/avr/pr110086.c
> @@ -0,0 +1,5 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Os" } */
> +
> +void __attribute__((optimize("O0"))) foo(void) {
> +}


Re: PING Re: [PATCH RFA (tree-eh)] c++: use __cxa_call_terminate for MUST_NOT_THROW [PR97720]

2023-06-04 Thread Richard Biener via Gcc-patches
On Fri, Jun 2, 2023 at 6:57 PM Jason Merrill via Gcc-patches
 wrote:
>
> Since Jonathan approved the library change, I'm looking for middle-end
> approval for the tree-eh change, even without advice on the potential
> follow-up.
>
> On 5/24/23 14:55, Jason Merrill wrote:
> > Middle-end folks: any thoughts about how best to make the change described 
> > in
> > the last paragraph below?
> >
> > Library folks: any thoughts on the changes to __cxa_call_terminate?
> >
> > -- 8< --
> >
> > [except.handle]/7 says that when we enter std::terminate due to a throw,
> > that is considered an active handler.  We already implemented that properly
> > for the case of not finding a handler (__cxa_throw calls __cxa_begin_catch
> > before std::terminate) and the case of finding a callsite with no landing
> > pad (the personality function calls __cxa_call_terminate which calls
> > __cxa_begin_catch), but for the case of a throw in a try/catch in a noexcept
> > function, we were emitting a cleanup that calls std::terminate directly
> > without ever calling __cxa_begin_catch to handle the exception.
> >
> > A straightforward way to fix this seems to be calling __cxa_call_terminate
> > instead.  However, that requires exporting it from libstdc++, which we have
> > not previously done.  Despite the name, it isn't actually part of the ABI
> > standard.  Nor is __cxa_call_unexpected, as far as I can tell, but that one
> > is also used by clang.  For this case they use __clang_call_terminate; it
> > seems reasonable to me for us to stick with __cxa_call_terminate.
> >
> > I also change __cxa_call_terminate to take void* for simplicity in the front
> > end (and consistency with __cxa_call_unexpected) but that isn't necessary if
> > it's undesirable for some reason.
> >
> > This patch does not fix the issue that representing the noexcept as a
> > cleanup is wrong, and confuses the handler search; since it looks like a
> > cleanup in the EH tables, the unwinder keeps looking until it finds the
> > catch in main(), which it should never have gotten to.  Without the
> > try/catch in main, the unwinder would reach the end of the stack and say no
> > handler was found.  The noexcept is a handler, and should be treated as one,
> > as it is when the landing pad is omitted.
> >
> > The best fix for that issue seems to me to be to represent an
> > ERT_MUST_NOT_THROW after an ERT_TRY in an action list as though it were an
> > ERT_ALLOWED_EXCEPTIONS (since indeed it is an exception-specification).  The
> > actual code generation shouldn't need to change (apart from the change made
> > by this patch), only the action table entry.
> >
> >   PR c++/97720
> >
> > gcc/cp/ChangeLog:
> >
> >   * cp-tree.h (enum cp_tree_index): Add CPTI_CALL_TERMINATE_FN.
> >   (call_terminate_fn): New macro.
> >   * cp-gimplify.cc (gimplify_must_not_throw_expr): Use it.
> >   * except.cc (init_exception_processing): Set it.
> >   (cp_protect_cleanup_actions): Return it.
> >
> > gcc/ChangeLog:
> >
> >   * tree-eh.cc (lower_resx): Pass the exception pointer to the
> >   failure_decl.
> >   * except.h: Tweak comment.
> >
> > libstdc++-v3/ChangeLog:
> >
> >   * libsupc++/eh_call.cc (__cxa_call_terminate): Take void*.
> >   * config/abi/pre/gnu.ver: Add it.
> >
> > gcc/testsuite/ChangeLog:
> >
> >   * g++.dg/eh/terminate2.C: New test.
> > ---
> >   gcc/cp/cp-tree.h |  2 ++
> >   gcc/except.h |  2 +-
> >   gcc/cp/cp-gimplify.cc|  2 +-
> >   gcc/cp/except.cc |  5 -
> >   gcc/testsuite/g++.dg/eh/terminate2.C | 30 
> >   gcc/tree-eh.cc   | 16 ++-
> >   libstdc++-v3/libsupc++/eh_call.cc|  4 +++-
> >   libstdc++-v3/config/abi/pre/gnu.ver  |  7 +++
> >   8 files changed, 63 insertions(+), 5 deletions(-)
> >   create mode 100644 gcc/testsuite/g++.dg/eh/terminate2.C
> >
> > diff --git a/gcc/cp/cp-tree.h b/gcc/cp/cp-tree.h
> > index a1b882f11fe..a8465a988b5 100644
> > --- a/gcc/cp/cp-tree.h
> > +++ b/gcc/cp/cp-tree.h
> > @@ -217,6 +217,7 @@ enum cp_tree_index
> >  definitions.  */
> >   CPTI_ALIGN_TYPE,
> >   CPTI_TERMINATE_FN,
> > +CPTI_CALL_TERMINATE_FN,
> >   CPTI_CALL_UNEXPECTED_FN,
> >
> >   /* These are lazily inited.  */
> > @@ -358,6 +359,7 @@ extern GTY(()) tree cp_global_trees[CPTI_MAX];
> >   /* Exception handling function declarations.  */
> >   #define terminate_fn
> > cp_global_trees[CPTI_TERMINATE_FN]
> >   #define call_unexpected_fn  
> > cp_global_trees[CPTI_CALL_UNEXPECTED_FN]
> > +#define call_terminate_fn
> > cp_global_trees[CPTI_CALL_TERMINATE_FN]
> >   #define get_exception_ptr_fn
> > cp_global_trees[CPTI_GET_EXCEPTION_PTR_FN]
> >   #define begin_catch_fn  
> > cp_global_trees[CPTI_BEGIN_CATCH_FN]
> >   #define end_catch_fn
> > cp_gl

Re: [RFA] Improve strcmp expansion when one input is a constant string.

2023-06-04 Thread Richard Biener via Gcc-patches
On Sun, Jun 4, 2023 at 11:41 PM Jeff Law via Gcc-patches
 wrote:
>
> While investigating a RISC-V backend patch from Jivan I noticed a
> regression in terms of dynamic instruction counts for the omnetpp
> benchmark in spec2017.
>
> https://gcc.gnu.org/pipermail/gcc-patches/2023-June/620577.html
>
> The code we we with Jivan's patch at expansion time looks like this for
> each character in the input string:
>
>
>
> (insn 6 5 7 (set (reg:SI 137)
>  (zero_extend:SI (mem:QI (reg/v/f:DI 135 [ x ]) [0 MEM
>  [(void *)x_2(D)]+0 S1 A8]))) "j.c":5:11 -1
>   (nil))
>
> (insn 7 6 8 (set (reg:DI 138)
>  (sign_extend:DI (plus:SI (reg:SI 137)
>  (const_int -108 [0xff94] "j.c":5:11 -1
>   (nil))
>
> (insn 8 7 9 (set (reg:SI 136)
>  (subreg/s/u:SI (reg:DI 138) 0)) "j.c":5:11 -1
>   (expr_list:REG_EQUAL (plus:SI (reg:SI 137)
>  (const_int -108 [0xff94]))
>  (nil)))
>
> (insn 9 8 10 (set (reg:DI 139)
>  (sign_extend:DI (reg:SI 136))) "j.c":5:11 -1
>   (nil))
>
> (jump_insn 10 9 11 (set (pc)
>  (if_then_else (ne (reg:DI 139)
>  (const_int 0 [0]))
>  (label_ref 64)
>  (pc))) "j.c":5:11 -1
>   (nil))
>
>
> Ignore insn 9.  fwprop will turn it into a trivial copy from r138->r139
> which will ultimately propagate away.
>
>
> All the paths eventually transfer to control to the label in question,
> either by jumping or falling thru on the last character.  After a bit of
> cleanup by fwprop & friends we have:
>
>
>
> > (insn 6 3 7 2 (set (reg:SI 137 [ MEM  [(void *)x_2(D)] ])
> > (zero_extend:SI (mem:QI (reg/v/f:DI 135 [ x ]) [0 MEM  
> > [(void *)x_2(D)]+0 S1 A8]))) "j.c":5:11 114 {zero_extendqisi2}
> >  (nil))
> > (insn 7 6 8 2 (set (reg:DI 138)
> > (sign_extend:DI (plus:SI (reg:SI 137 [ MEM  [(void 
> > *)x_2(D)] ])
> > (const_int -108 [0xff94] "j.c":5:11 6 
> > {addsi3_extended}
> >  (expr_list:REG_DEAD (reg:SI 137 [ MEM  [(void *)x_2(D)] ])
> > (nil)))
> > (insn 8 7 10 2 (set (reg:SI 136 [ MEM  [(void *)x_2(D)]+11 ])
> > (subreg/s/u:SI (reg:DI 138) 0)) "j.c":5:11 180 {*movsi_internal}
> >  (nil))
> > (jump_insn 10 8 73 2 (set (pc)
> > (if_then_else (ne (reg:DI 138)
> > (const_int 0 [0]))
> > (label_ref 64)
> > (pc))) "j.c":5:11 243 {*branchdi}
> >  (expr_list:REG_DEAD (reg:DI 138)
> > (int_list:REG_BR_PROB 536870916 (nil)))
> >  -> 64)
>
>
> insn 8 is the result of wanting the ultimate result of the strcmp to be
> an "int" type (SImode).Note that (reg 136) is the result of the
> strcmp.  It gets set in each fragment of code that compares one element
> in the string.  It's also live after the strcmp sequence.   As a result
> combine isn't going to be able to clean this up.
>
> Note how (reg 136) births while (reg 138) is live and even though (reg
> 136) is a copy of (reg 138), IRA doesn't have the necessary code to
> determine that the regs do not conflict.  As a result (reg 136) and (reg
> 138) must be allocated different hard registers and we get code like this:
>
> > lbu a5,0(a0)# 6 [c=28 l=4]  zero_extendqisi2/1
> > addiw   a5,a5,-108  # 7 [c=8 l=4]  addsi3_extended/1
> > mv  a4,a5   # 8 [c=4 l=4]  *movsi_internal/0
> > bne a5,zero,.L2 # 10[c=4 l=4]  *branchdi
>
> Note the annoying "mv".
>
>
> Rather than do a conversion for each character, we could do each step in
> word_mode and do the conversion once at the end of the whole sequence.
>
> So for each character we expand to:
>
> > (insn 6 5 7 (set (reg:DI 138)
> > (zero_extend:DI (mem:QI (reg/v/f:DI 135 [ x ]) [0 MEM  
> > [(void *)x_2(D)]+0 S1 A8]))) "j.c":5:11 -1
> >  (nil))
> >
> > (insn 7 6 8 (set (reg:DI 137)
> > (plus:DI (reg:DI 138)
> > (const_int -108 [0xff94]))) "j.c":5:11 -1
> >  (nil))
> >
> > (jump_insn 8 7 9 (set (pc)
> > (if_then_else (ne (reg:DI 137)
> > (const_int 0 [0]))
> > (label_ref 41)
> > (pc))) "j.c":5:11 -1
> >  (nil))
>
> Good.  Then at the end of the sequence we have:
> > (code_label 41 40 42 2 (nil) [0 uses])
> >
> > (insn 42 41 43 (set (reg:SI 136)
> > (subreg:SI (reg:DI 137) 0)) "j.c":5:11 -1
> >  (nil))
>
> Which seems like exactly what we want.  At the assembly level we get:
>  lbu a5,0(a0)# 6 [c=28 l=4]  zero_extendqidi2/1
>  addia0,a5,-108  # 7 [c=4 l=4]  adddi3/1
>  bne a0,zero,.L2 # 8 [c=4 l=4]  *branchdi
> [ ... ]
>
> At the end of the sequence we realize the narrowing subreg followed by
> an extnesion isn't necessary and just remove them.
>
> The ultimate result is omnetpp goes from a small regression to a small
> overall improvement with Jivan's patch.
>
> Bootstrapped and regression tested 

Re: [PATCH] Fix PR 110085: `make clean` in GCC directory on sh target causes a failure

2023-06-04 Thread Richard Biener via Gcc-patches
On Mon, Jun 5, 2023 at 7:43 AM Andrew Pinski  wrote:
>
> On Sun, Jun 4, 2023 at 10:24 PM Richard Biener via Gcc-patches
>  wrote:
> >
> >
> >
> > > Am 05.06.2023 um 06:42 schrieb Andrew Pinski via Gcc-patches 
> > > :
> > >
> > > On sh target, there is a MULTILIB_DIRNAMES (or is it MULTILIB_OPTIONS) 
> > > named m2,
> > > this conflicts with the langauge m2. So when you do a `make clean`, it 
> > > will remove
> > > the m2 directory and then a build will fail. Now since 
> > > r0-78222-gfa9585134f6f58,
> > > the multilib directories are no longer created in the gcc directory as 
> > > libgcc
> > > was moved to the toplevel. So we can remove the part of clean that 
> > > removes those
> > > directories.
> > >
> > > Tested on x86_64-linux-gnu and a cross to sh-elf that `make clean` 
> > > followed by
> > > `make` works again.
> > >
> > > OK?
> >
> > Ok
>
> Is a similar patch ok for GCC 13 branch as we would get a similar
> failure there too?

Yes, though I wonder if we should worry.

Richard.

> Thanks,
> Andrew
>
> >
> > > gcc/ChangeLog:
> > >
> > >PR bootstrap/110085
> > >* Makefile.in (clean): Remove the removing of
> > >MULTILIB_DIR/MULTILIB_OPTIONS directories.
> > > ---
> > > gcc/Makefile.in | 7 ---
> > > 1 file changed, 7 deletions(-)
> > >
> > > diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> > > index 1d39e6dd3f8..0c02f312985 100644
> > > --- a/gcc/Makefile.in
> > > +++ b/gcc/Makefile.in
> > > @@ -3622,13 +3622,6 @@ clean: mostlyclean lang.clean
> > >-rm -f doc/*.pdf
> > > # Delete the include directories.
> > >-rm -rf include include-fixed
> > > -# Delete files used by the "multilib" facility (including libgcc 
> > > subdirs).
> > > --rm -f multilib.h tmpmultilib*
> > > --if [ "x$(MULTILIB_DIRNAMES)" != x ] ; then \
> > > -  rm -rf $(MULTILIB_DIRNAMES); \
> > > -else if [ "x$(MULTILIB_OPTIONS)" != x ] ; then \
> > > -  rm -rf `echo $(MULTILIB_OPTIONS) | sed -e 's/\// /g'`; \
> > > -fi ; fi
> > >
> > > # Delete all files that users would normally create
> > > # while building and installing GCC.
> > > --
> > > 2.31.1
> > >


Re: [PATCH] Add COMPLEX_VECTOR_INT modes

2023-06-05 Thread Richard Biener via Gcc-patches
On Mon, Jun 5, 2023 at 3:49 PM Andrew Stubbs  wrote:
>
> On 30/05/2023 07:26, Richard Biener wrote:
> > On Fri, May 26, 2023 at 4:35 PM Andrew Stubbs  wrote:
> >>
> >> Hi all,
> >>
> >> I want to implement a vector DIVMOD libfunc for amdgcn, but I can't just
> >> do it because the GCC middle-end models DIVMOD's return value as
> >> "complex int" type, and there are no vector equivalents of that type.
> >>
> >> Therefore, this patch adds minimal support for "complex vector int"
> >> modes.  I have not attempted to provide any means to use these modes
> >> from C, so they're really only useful for DIVMOD.  The actual libfunc
> >> implementation will pack the data into wider vector modes manually.
> >>
> >> A knock-on effect of this is that I needed to increase the range of
> >> "mode_unit_size" (several of the vector modes supported by amdgcn exceed
> >> the previous 255-byte limit).
> >>
> >> Since this change would add a large number of new, unused modes to many
> >> architectures, I have elected to *not* enable them, by default, in
> >> machmode.def (where the other complex modes are created).  The new modes
> >> are therefore inactive on all architectures but amdgcn, for now.
> >>
> >> OK for mainline?  (I've not done a full test yet, but I will.)
> >
> > I think it makes more sense to map vector CSImode to vector SImode with
> > the double number of lanes.  In fact since divmod is a libgcc function
> > I wonder where your vector variant would reside and how GCC decides to
> > emit calls to it?  That is, there's no way to OMP simd declare this 
> > function?
>
> The divmod implementation lives in libgcc. It's not too difficult to
> write using vector extensions and some asm tricks. I did try an OMP simd
> declare implementation, but it didn't vectorize well, and that's a yack
> I don't wish to shave right now.
>
> In any case, the OMP simd declare will not help us here, directly,
> because the DIVMOD transformation happens too late in the pass pipeline,
> long after ifcvt and vect. My implementation (not yet posted), uses a
> libfunc and the TARGET_EXPAND_DIVMOD_LIBFUNC hook in the standard way.
> It just needs the complex vector modes to exist.
>
> Using vectors twice the length is problematic also. If I create a new
> V128SImode that spans across two 64-lane vector registers then that will
> probably have the desired effect ("real" quotient in v8, "imaginary"
> remainder in v9), but if I use V64SImode to represent two V32SImode
> vectors then that's a one-register mode, and I'll have to use a
> permutation (a memory operation) to extract lanes 32-63 into lanes 0-31,
> and if we ever want to implement instructions that operate on these
> modes (as opposed to the odd/even add/sub complex patterns we have now)
> then the masking will be all broken and we'd need to constantly
> disassemble the double length vectors to operate on them.

I'm a bit confused as I don't see the difference between V64SCImode and
V128SImode since both contain 128 SImode values.  And I would expect
the imag/real parts to be _always_ interleaved, irrespective of whether
the result fits one or two vector registers.

> The implementation I proposed is essentially a struct containing two
> vectors placed in consecutive registers. This is the natural
> representation for the architecture.

I don't think you did that?  Or at least I don't see how vectors of
complex modes would match that.  It would be a complex of a vector
mode instead, no?

I do see that internal functions with more than one output would be
desirable and I think I proposed ASMs with a "coded text" aka
something like a pattern ID or an optab identifier would be the best
fit on GIMPLE but TARGET_EXPAND_DIVMOD_LIBFUNC for this
particular case should be a good fit as well, no?

Can you share what you needed to change to get your complex vector int
code actually working?  What does the divmod pattern matching create
for the return type?  The pass has

  /* Disable the transform if either is a constant, since division-by-constant
 may have specialized expansion.  */
  if (CONSTANT_CLASS_P (op1))
return false;

  if (CONSTANT_CLASS_P (op2))
{
  if (integer_pow2p (op2))
return false;

  if (TYPE_PRECISION (type) <= HOST_BITS_PER_WIDE_INT
  && TYPE_PRECISION (type) <= BITS_PER_WORD)
return false;

at least the TYPE_PRECISION query is bogus when type is a vector type
and the IFN building does

 /* Part 3: Create libcall to internal fn DIVMOD:
 divmod_tmp = DIVMOD (op1, op2).  */

  gcall *call_stmt = gimple_build_call_internal (IFN_DIVMOD, 2, op1, op2);
  tree res = make_temp_ssa_name (build_complex_type (TREE_TYPE (op1)),
 call_stmt, "divmod_tmp");

so that builds a complex type with a vector component, not a vector
with complex components.

Richard.


> Anyway, you don't like this patch and I see that AArch64 is picking
> apart BLKmode to see if there's complex inside, so maybe I can make
> somethi

Re: [RFA] Improve strcmp expansion when one input is a constant string.

2023-06-05 Thread Richard Biener via Gcc-patches
On Mon, Jun 5, 2023 at 8:41 PM Jeff Law  wrote:
>
>
>
> On 6/5/23 00:29, Richard Biener wrote:
>
> >
> > But then for example x86 has smaller encoding for byte ops and while
> > widening is easily done later, truncation is not.
> Sadly, the x86 costing looks totally bogus here.  We actually emit the
> exact same code for a QI mode loads vs a zero-extending load from QI to
> SI.  But the costing is different and would tend to prefer QImode.  That
> in turn is going to force an extension at the end of the sequence which
> would be a regression relative to the current code.  Additionally we may
> get partial register stalls for the byte ops to implement the comparison
> steps.
>
> The net result is that querying the backend's costs would do the exact
> opposite of what I think we want on x86.  One could argue the x86
> maintainers should improve this situation...
>
> >
> > Note I would have expected to use the mode of the load so we truly
> > elide some extensions, using word_mode looks like just another
> > mode here?  The key to note is probably
> >
> >op0 = convert_modes (mode, unit_mode, op0, 1);
> >op1 = convert_modes (mode, unit_mode, op1, 1);
> >rtx diff = expand_simple_binop (mode, MINUS, op0, op1,
> >result, 1, OPTAB_WIDEN);
> >
> > which uses OPTAB_WIDEN - wouldn't it be better to pass in the
> > unconverted modes and leave the decision which mode to use
> > to OPTAB_WIDEN?  Should we somehow query the target for
> > the smallest mode from unit_mode it can do both the MINUS
> > and the compare?
> And avoiding OPTAB_WIDEN isn't going to help rv64 at all.  The core
> issue being that we do define 32bit ops.  With Jivan's patch those 32bit
> ops expose the sign extending nature.  So a 32bit add would look
> something like
>
> (set (temp:DI) (sign_extend:DI (plus:SI (op:SI) (op:SI
> (set (res:SI) (subreg:SI (temp:DI) 0)
>
> Where we mark the subreg with SUBREG_PROMOTED_VAR_P.
>
>
> I'm not sure the best way to proceed now.  I could just put this on the
> back-burner as it's RISC-V specific and the gains elsewhere dwarf this
> issue.

I wonder if there's some more generic target macro we can key the
behavior off - SLOW_BYTE_ACCESS isn't a good fit, WORD_REGISTER_OPERATIONS
is maybe closer but it's exact implications are unknown to me.  Maybe
there's something else as well ...

The point about OPTAB_WIDEN above was that I wonder why we
extend 'op0' and 'op1' before emitting the binop when we allow WIDEN
anyway.  Yes, we want the result in 'mode' (but why?  As you say we
can extend at the end) and there's likely no way to tell expand_simple_binop
to "expand as needed and not narrow the result".  So I wonder if we should
emulate that somehow (also taking into consideration the compare).

Richard.

>
> jeff


Re: [PATCH] RISC-V: Support RVV VLA SLP auto-vectorization

2023-06-05 Thread Richard Biener via Gcc-patches
On Tue, Jun 6, 2023 at 6:17 AM  wrote:
>
> From: Juzhe-Zhong 
>
> This patch enables basic VLA SLP auto-vectorization.
> Consider this following case:
> void
> f (uint8_t *restrict a, uint8_t *restrict b)
> {
>   for (int i = 0; i < 100; ++i)
> {
>   a[i * 8 + 0] = b[i * 8 + 7] + 1;
>   a[i * 8 + 1] = b[i * 8 + 7] + 2;
>   a[i * 8 + 2] = b[i * 8 + 7] + 8;
>   a[i * 8 + 3] = b[i * 8 + 7] + 4;
>   a[i * 8 + 4] = b[i * 8 + 7] + 5;
>   a[i * 8 + 5] = b[i * 8 + 7] + 6;
>   a[i * 8 + 6] = b[i * 8 + 7] + 7;
>   a[i * 8 + 7] = b[i * 8 + 7] + 3;
> }
> }
>
> To enable VLA SLP auto-vectorization, we should be able to handle this 
> following const vector:
>
> 1. NPATTERNS = 8, NELTS_PER_PATTERN = 3.
> { 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8, 16, 16, 16, 16, 16, 16, 16, 
> 16, ... }
>
> 2. NPATTERNS = 8, NELTS_PER_PATTERN = 1.
> { 1, 2, 8, 4, 5, 6, 7, 3, ... }
>
> And these vector can be generated at prologue.
>
> After this patch, we end up with this following codegen:
>
> Prologue:
> ...
> vsetvli a7,zero,e16,m2,ta,ma
> vid.v   v4
> vsrl.vi v4,v4,3
> li  a3,8
> vmul.vx v4,v4,a3  ===> v4 = { 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 
> 8, 8, 8, 16, 16, 16, 16, 16, 16, 16, 16, ... }
> ...
> li  t1,67633152
> addit1,t1,513
> li  a3,50790400
> addia3,a3,1541
> sllia3,a3,32
> add a3,a3,t1
> vsetvli t1,zero,e64,m1,ta,ma
> vmv.v.x v3,a3   ===> v3 = { 1, 2, 8, 4, 5, 6, 7, 3, ... }
> ...
> LoopBody:
> ...
> min a3,...
> vsetvli zero,a3,e8,m1,ta,ma
> vle8.v  v2,0(a6)
> vsetvli a7,zero,e8,m1,ta,ma
> vrgatherei16.vv v1,v2,v4
> vadd.vv v1,v1,v3
> vsetvli zero,a3,e8,m1,ta,ma
> vse8.v  v1,0(a2)
> add a6,a6,a4
> add a2,a2,a4
> mv  a3,a5
> add a5,a5,t1
> bgtua3,a4,.L3
> ...
>
> Note: we need to use "vrgatherei16.vv" instead of "vrgather.vv" for SEW = 8 
> since "vrgatherei16.vv" can cover larger
>   range than "vrgather.vv" (which only can maximum element index = 255).
> Epilogue:
> lbu a5,799(a1)
> addiw   a4,a5,1
> sb  a4,792(a0)
> addiw   a4,a5,2
> sb  a4,793(a0)
> addiw   a4,a5,8
> sb  a4,794(a0)
> addiw   a4,a5,4
> sb  a4,795(a0)
> addiw   a4,a5,5
> sb  a4,796(a0)
> addiw   a4,a5,6
> sb  a4,797(a0)
> addiw   a4,a5,7
> sb  a4,798(a0)
> addiw   a5,a5,3
> sb  a5,799(a0)
> ret
>
> There is one more last thing we need to do is the "Epilogue 
> auto-vectorization" which needs VLS modes support.
> I will support VLS modes for "Epilogue auto-vectorization" in the future.

What's the epilogue generated for?  With a VLA main loop body you
shouldn't have one apart from
when that body isn't entered because of cost or alias reasons?

>
> gcc/ChangeLog:
>
> * config/riscv/riscv-protos.h (expand_vec_perm_const): New function.
> * config/riscv/riscv-v.cc 
> (rvv_builder::can_duplicate_repeating_sequence_p): Support POLY handling.
> (rvv_builder::single_step_npatterns_p): New function.
> (rvv_builder::npatterns_all_equal_p): Ditto.
> (const_vec_all_in_range_p): Support POLY handling.
> (gen_const_vector_dup): Ditto.
> (emit_vlmax_gather_insn): Add vrgatherei16.
> (emit_vlmax_masked_gather_mu_insn): Ditto.
> (expand_const_vector): Add VLA SLP const vector support.
> (expand_vec_perm): Support POLY.
> (struct expand_vec_perm_d): New struct.
> (shuffle_generic_patterns): New function.
> (expand_vec_perm_const_1): Ditto.
> (expand_vec_perm_const): Ditto.
> * config/riscv/riscv.cc (riscv_vectorize_vec_perm_const): Ditto.
> (TARGET_VECTORIZE_VEC_PERM_CONST): New targethook.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/riscv/rvv/autovec/scalable-1.c: Adapt testcase for VLA 
> vectorizer.
> * gcc.target/riscv/rvv/autovec/v-1.c: Ditto.
> * gcc.target/riscv/rvv/autovec/zve32f_zvl128b-1.c: Ditto.
> * gcc.target/riscv/rvv/autovec/zve32x_zvl128b-1.c: Ditto.
> * gcc.target/riscv/rvv/autovec/zve64d-1.c: Ditto.
> * gcc.target/riscv/rvv/autovec/zve64d_zvl128b-1.c: Ditto.
> * gcc.target/riscv/rvv/autovec/zve64f-1.c: Ditto.
> * gcc.target/riscv/rvv/autovec/zve64f_zvl128b-1.c: Ditto.
> * gcc.target/riscv/rvv/autovec/zve64x_zvl128b-1.c: Ditto.
> * gcc.target/riscv/rvv/autovec/partial/slp-1.c: New test.
> * gcc.target/riscv/rvv/autovec/partial/slp-2.c: New test.
> * gcc.target/riscv/rvv/autovec/partial/slp-3.c: New test.
> * gcc.target/riscv/rvv/autovec/partial/slp-4.c: New test.
> * gcc.target/riscv/rvv/autovec/partial/slp-5.c: New test.
> 

[PATCH] middle-end/110055 - avoid CLOBBERing static variables

2023-06-06 Thread Richard Biener via Gcc-patches
The gimplifier can elide initialized constant automatic variables
to static storage in which case TARGET_EXPR gimplification needs
to avoid emitting a CLOBBER for them since their lifetime is no
longer limited.  Failing to do so causes spurious dangling-pointer
diagnostics on the added testcase for some targets.

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

PR middle-end/110055
* gimplify.cc (gimplify_target_expr): Do not emit
CLOBBERs for variables which have static storage duration
after gimplifying their initializers.

* g++.dg/warn/Wdangling-pointer-pr110055.C: New testcase.
---
 gcc/gimplify.cc  |  4 +++-
 .../g++.dg/warn/Wdangling-pointer-pr110055.C | 16 
 2 files changed, 19 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/g++.dg/warn/Wdangling-pointer-pr110055.C

diff --git a/gcc/gimplify.cc b/gcc/gimplify.cc
index d0d16a24820..d7cfa6321a0 100644
--- a/gcc/gimplify.cc
+++ b/gcc/gimplify.cc
@@ -7173,8 +7173,10 @@ gimplify_target_expr (tree *expr_p, gimple_seq *pre_p, 
gimple_seq *post_p)
gimplify_and_add (init, &init_pre_p);
 
   /* Add a clobber for the temporary going out of scope, like
-gimplify_bind_expr.  */
+gimplify_bind_expr.  But only if we did not promote the
+temporary to static storage.  */
   if (gimplify_ctxp->in_cleanup_point_expr
+ && !TREE_STATIC (temp)
  && needs_to_live_in_memory (temp))
{
  if (flag_stack_reuse == SR_ALL)
diff --git a/gcc/testsuite/g++.dg/warn/Wdangling-pointer-pr110055.C 
b/gcc/testsuite/g++.dg/warn/Wdangling-pointer-pr110055.C
new file mode 100644
index 000..77dbbf380b6
--- /dev/null
+++ b/gcc/testsuite/g++.dg/warn/Wdangling-pointer-pr110055.C
@@ -0,0 +1,16 @@
+// { dg-do compile }
+// { dg-require-effective-target c++11 }
+// { dg-options "-O3 -fno-exceptions -Wdangling-pointer" }
+
+#include 
+#include 
+
+struct Data {
+  std::vector v = {1, 1};
+};
+
+int main()
+{
+  Data a;
+  Data b;
+}
-- 
2.35.3


[PATCH] tree-optimization/109143 - improve PTA compile time

2023-06-06 Thread Richard Biener via Gcc-patches
The following improves solution_set_expand to require one less
iteration over the bitmap and avoid changing the bitmap we iterate
over.  Plus we handle adjacent subvars in the ID space (the common case)
and use bitmap_set_range.  This cuts a bit less than 10% off the PTA
time from the testcase in the PR.

Bootstrapped and tested on x86_64-unknwon-linux-gnu, pushed.

PR tree-optimization/109143
* tree-ssa-structalias.cc (solution_set_expand): Avoid
one bitmap iteration and optimize bit range setting.
---
 gcc/tree-ssa-structalias.cc | 38 -
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/gcc/tree-ssa-structalias.cc b/gcc/tree-ssa-structalias.cc
index 8db99a42565..ee9313c59ca 100644
--- a/gcc/tree-ssa-structalias.cc
+++ b/gcc/tree-ssa-structalias.cc
@@ -966,28 +966,40 @@ solution_set_expand (bitmap set, bitmap *expanded)
 
   *expanded = BITMAP_ALLOC (&iteration_obstack);
 
-  /* In a first pass expand to the head of the variables we need to
- add all sub-fields off.  This avoids quadratic behavior.  */
+  /* In a first pass expand variables, once for each head to avoid
+ quadratic behavior, to include all sub-fields.  */
+  unsigned prev_head = 0;
   EXECUTE_IF_SET_IN_BITMAP (set, 0, j, bi)
 {
   varinfo_t v = get_varinfo (j);
   if (v->is_artificial_var
  || v->is_full_var)
continue;
-  bitmap_set_bit (*expanded, v->head);
-}
+  if (v->head != prev_head)
+   {
+ varinfo_t head = get_varinfo (v->head);
+ unsigned num = 1;
+ for (varinfo_t n = vi_next (head); n != NULL; n = vi_next (n))
+   {
+ if (n->id != head->id + num)
+   {
+ /* Usually sub variables are adjacent but since we
+create pointed-to restrict representatives there
+can be gaps as well.  */
+ bitmap_set_range (*expanded, head->id, num);
+ head = n;
+ num = 1;
+   }
+ else
+   num++;
+   }
 
-  /* In the second pass now expand all head variables with subfields.  */
-  EXECUTE_IF_SET_IN_BITMAP (*expanded, 0, j, bi)
-{
-  varinfo_t v = get_varinfo (j);
-  if (v->head != j)
-   continue;
-  for (v = vi_next (v); v != NULL; v = vi_next (v))
-   bitmap_set_bit (*expanded, v->id);
+ bitmap_set_range (*expanded, head->id, num);
+ prev_head = v->head;
+   }
 }
 
-  /* And finally set the rest of the bits from SET.  */
+  /* And finally set the rest of the bits from SET in an efficient way.  */
   bitmap_ior_into (*expanded, set);
 
   return *expanded;
-- 
2.35.3


Re: [PATCH V3] VECT: Add SELECT_VL support

2023-06-07 Thread Richard Biener via Gcc-patches
On Mon, 5 Jun 2023, juzhe.zh...@rivai.ai wrote:

> From: Ju-Zhe Zhong 
> 
> Co-authored-by: Richard Sandiford
> 
> This patch address comments from Richard and rebase to trunk.
> 
> This patch is adding SELECT_VL middle-end support
> allow target have target dependent optimization in case of
> length calculation.
> 
> This patch is inspired by RVV ISA and LLVM:
> https://reviews.llvm.org/D99750
> 
> The SELECT_VL is same behavior as LLVM "get_vector_length" with
> these following properties:
> 
> 1. Only apply on single-rgroup.
> 2. non SLP.
> 3. adjust loop control IV.
> 4. adjust data reference IV.
> 5. allow non-vf elements processing in non-final iteration
> 
> Code:
># void vvaddint32(size_t n, const int*x, const int*y, int*z)
> # { for (size_t i=0; i 
> Take RVV codegen for example:
> 
> Before this patch:
> vvaddint32:
> ble a0,zero,.L6
> csrra4,vlenb
> srlia6,a4,2
> .L4:
> mv  a5,a0
> bleua0,a6,.L3
> mv  a5,a6
> .L3:
> vsetvli zero,a5,e32,m1,ta,ma
> vle32.v v2,0(a1)
> vle32.v v1,0(a2)
> vsetvli a7,zero,e32,m1,ta,ma
> sub a0,a0,a5
> vadd.vv v1,v1,v2
> vsetvli zero,a5,e32,m1,ta,ma
> vse32.v v1,0(a3)
> add a2,a2,a4
> add a3,a3,a4
> add a1,a1,a4
> bne a0,zero,.L4
> .L6:
> ret
> 
> After this patch:
> 
> vvaddint32:
> vsetvli t0, a0, e32, ta, ma  # Set vector length based on 32-bit vectors
> vle32.v v0, (a1) # Get first vector
>   sub a0, a0, t0 # Decrement number done
>   slli t0, t0, 2 # Multiply number done by 4 bytes
>   add a1, a1, t0 # Bump pointer
> vle32.v v1, (a2) # Get second vector
>   add a2, a2, t0 # Bump pointer
> vadd.vv v2, v0, v1   # Sum vectors
> vse32.v v2, (a3) # Store result
>   add a3, a3, t0 # Bump pointer
>   bnez a0, vvaddint32# Loop back
>   ret# Finished
> 
> gcc/ChangeLog:
> 
> * doc/md.texi: Add SELECT_VL support.
> * internal-fn.def (SELECT_VL): Ditto.
> * optabs.def (OPTAB_D): Ditto.
> * tree-vect-loop-manip.cc (vect_set_loop_controls_directly): Ditto.
> * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Ditto.
> * tree-vect-stmts.cc (get_select_vl_data_ref_ptr): Ditto.
> (vectorizable_store): Ditto.
> (vectorizable_load): Ditto.
> * tree-vectorizer.h (LOOP_VINFO_USING_SELECT_VL_P): Ditto.
> 
> Co-authored-by: Richard Sandiford
> 
> ---
>  gcc/doc/md.texi | 22 
>  gcc/internal-fn.def |  1 +
>  gcc/optabs.def  |  1 +
>  gcc/tree-vect-loop-manip.cc | 32 -
>  gcc/tree-vect-loop.cc   | 72 +
>  gcc/tree-vect-stmts.cc  | 66 ++
>  gcc/tree-vectorizer.h   |  6 
>  7 files changed, 191 insertions(+), 9 deletions(-)
> 
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 6a435eb4461..95f7fe1f802 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -4974,6 +4974,28 @@ for (i = 1; i < operand3; i++)
>operand0[i] = operand0[i - 1] && (operand1 + i < operand2);
>  @end smallexample
>  
> +@cindex @code{select_vl@var{m}} instruction pattern
> +@item @code{select_vl@var{m}}
> +Set operand 0 to the number of scalar iterations that should be handled
> +by one iteration of a vector loop.  Operand 1 is the total number of
> +scalar iterations that the loop needs to process and operand 2 is a
> +maximum bound on the result (also known as the maximum ``vectorization
> +factor'').
> +
> +The maximum value of operand 0 is given by:
> +@smallexample
> +operand0 = MIN (operand1, operand2)
> +@end smallexample
> +However, targets might choose a lower value than this, based on
> +target-specific criteria.  Each iteration of the vector loop might
> +therefore process a different number of scalar iterations, which in turn
> +means that induction variables will have a variable step.  Because of
> +this, it is generally not useful to define this instruction if it will
> +always calculate the maximum value.
> +
> +This optab is only useful on targets that implement @samp{len_load_@var{m}}
> +and/or @samp{len_store_@var{m}}.
> +
>  @cindex @code{check_raw_ptrs@var{m}} instruction pattern
>  @item @samp{check_raw_ptrs@var{m}}
>  Check whether, given two pointers @var{a} and @var{b} and a length @var{len},
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index 7fe742c2ae7..6f6fa7d37f9 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -153,6 +153,7 @@ DEF_INTERNAL_OPTAB_FN (VEC_SET, 0, vec_set, vec_set)
>  DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, len_store, len_store)
>  
>  DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while)
> +DEF_INTERNAL_OPTAB_FN (SELECT_VL, ECF_CONST | ECF_NOTHROW, select_

Re: Re: [PATCH V3] VECT: Add SELECT_VL support

2023-06-07 Thread Richard Biener via Gcc-patches
On Wed, 7 Jun 2023, juzhe.zh...@rivai.ai wrote:

> Hi, Richi. Since SELECT_VL only apply on single-rgroup (ncopies == 1 && 
> vec_num == 1)
> Should I make SELECT_VL stuff out side the loop?
> 
> for (i = 0; i < vec_num; i++)
>   for (j = 0; j < ncopies; j++)
> 

No, but please put assertions into the iteration so it's obvious
the SELECT_VL doesn't reach there.

> Thanks.
> 
> 
> juzhe.zh...@rivai.ai
>  
> From: Richard Biener
> Date: 2023-06-07 15:41
> To: Ju-Zhe Zhong
> CC: gcc-patches; richard.sandiford
> Subject: Re: [PATCH V3] VECT: Add SELECT_VL support
> On Mon, 5 Jun 2023, juzhe.zh...@rivai.ai wrote:
>  
> > From: Ju-Zhe Zhong 
> > 
> > Co-authored-by: Richard Sandiford
> > 
> > This patch address comments from Richard and rebase to trunk.
> > 
> > This patch is adding SELECT_VL middle-end support
> > allow target have target dependent optimization in case of
> > length calculation.
> > 
> > This patch is inspired by RVV ISA and LLVM:
> > https://reviews.llvm.org/D99750
> > 
> > The SELECT_VL is same behavior as LLVM "get_vector_length" with
> > these following properties:
> > 
> > 1. Only apply on single-rgroup.
> > 2. non SLP.
> > 3. adjust loop control IV.
> > 4. adjust data reference IV.
> > 5. allow non-vf elements processing in non-final iteration
> > 
> > Code:
> ># void vvaddint32(size_t n, const int*x, const int*y, int*z)
> > # { for (size_t i=0; i > 
> > Take RVV codegen for example:
> > 
> > Before this patch:
> > vvaddint32:
> > ble a0,zero,.L6
> > csrra4,vlenb
> > srlia6,a4,2
> > .L4:
> > mv  a5,a0
> > bleua0,a6,.L3
> > mv  a5,a6
> > .L3:
> > vsetvli zero,a5,e32,m1,ta,ma
> > vle32.v v2,0(a1)
> > vle32.v v1,0(a2)
> > vsetvli a7,zero,e32,m1,ta,ma
> > sub a0,a0,a5
> > vadd.vv v1,v1,v2
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a3)
> > add a2,a2,a4
> > add a3,a3,a4
> > add a1,a1,a4
> > bne a0,zero,.L4
> > .L6:
> > ret
> > 
> > After this patch:
> > 
> > vvaddint32:
> > vsetvli t0, a0, e32, ta, ma  # Set vector length based on 32-bit vectors
> > vle32.v v0, (a1) # Get first vector
> >   sub a0, a0, t0 # Decrement number done
> >   slli t0, t0, 2 # Multiply number done by 4 bytes
> >   add a1, a1, t0 # Bump pointer
> > vle32.v v1, (a2) # Get second vector
> >   add a2, a2, t0 # Bump pointer
> > vadd.vv v2, v0, v1   # Sum vectors
> > vse32.v v2, (a3) # Store result
> >   add a3, a3, t0 # Bump pointer
> >   bnez a0, vvaddint32# Loop back
> >   ret# Finished
> > 
> > gcc/ChangeLog:
> > 
> > * doc/md.texi: Add SELECT_VL support.
> > * internal-fn.def (SELECT_VL): Ditto.
> > * optabs.def (OPTAB_D): Ditto.
> > * tree-vect-loop-manip.cc (vect_set_loop_controls_directly): Ditto.
> > * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Ditto.
> > * tree-vect-stmts.cc (get_select_vl_data_ref_ptr): Ditto.
> > (vectorizable_store): Ditto.
> > (vectorizable_load): Ditto.
> > * tree-vectorizer.h (LOOP_VINFO_USING_SELECT_VL_P): Ditto.
> > 
> > Co-authored-by: Richard Sandiford
> > 
> > ---
> >  gcc/doc/md.texi | 22 
> >  gcc/internal-fn.def |  1 +
> >  gcc/optabs.def  |  1 +
> >  gcc/tree-vect-loop-manip.cc | 32 -
> >  gcc/tree-vect-loop.cc   | 72 +
> >  gcc/tree-vect-stmts.cc  | 66 ++
> >  gcc/tree-vectorizer.h   |  6 
> >  7 files changed, 191 insertions(+), 9 deletions(-)
> > 
> > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > index 6a435eb4461..95f7fe1f802 100644
> > --- a/gcc/doc/md.texi
> > +++ b/gcc/doc/md.texi
> > @@ -4974,6 +4974,28 @@ for (i = 1; i < operand3; i++)
> >operand0[i] = operand0[i - 1] && (operand1 + i < operand2);
> >  @end smallexample
> >  
> > +@cindex @code{select_vl@var{m}} instruction pattern
> > +@item @code{select_vl@var{m}}
> > +Set operand 0 to the number of scalar iterations that should be handled
> > +by one iteration of a vector loop.  Operand 1 is the total number of
> > +scalar iterations that the loop needs to process and operand 2 is a
> > +maximum bound on the result (also known as the maximum ``vectorization
> > +factor'').
> > +
> > +The maximum value of operand 0 is given by:
> > +@smallexample
> > +operand0 = MIN (operand1, operand2)
> > +@end smallexample
> > +However, targets might choose a lower value than this, based on
> > +target-specific criteria.  Each iteration of the vector loop might
> > +therefore process a different number of scalar iterations, which in turn
> > +means that induction variables will have a variable step.  Because of
> > +this, it is generally not useful to define t

Re: [PATCH] libgcc: Fix eh_frame fast path in find_fde_tail

2023-06-07 Thread Richard Biener via Gcc-patches
On Tue, Jun 6, 2023 at 11:53 AM Florian Weimer via Gcc-patches
 wrote:
>
> The eh_frame value is only used by linear_search_fdes, not the binary
> search directly in find_fde_tail, so the bug is not immediately
> apparent with most programs.
>
> Fixes commit e724b0480bfa5ec04f39be8c7290330b495c59de ("libgcc:
> Special-case BFD ld unwind table encodings in find_fde_tail").

OK.

> [I'd appreciate suggestions how I could add a test for this.  BFD ld
> does not seem to allow ommitting the binary search table.]
>
> libgcc/
>
> PR libgcc/109712
> * unwind-dw2-fde-dip.c (find_fde_tail): Correct fast path for
> parsing eh_frame.
>
> ---
>  libgcc/unwind-dw2-fde-dip.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/libgcc/unwind-dw2-fde-dip.c b/libgcc/unwind-dw2-fde-dip.c
> index 6223f5f18a2..4e0b880513f 100644
> --- a/libgcc/unwind-dw2-fde-dip.c
> +++ b/libgcc/unwind-dw2-fde-dip.c
> @@ -403,8 +403,8 @@ find_fde_tail (_Unwind_Ptr pc,
>  BFD ld generates.  */
>signed value __attribute__ ((mode (SI)));
>memcpy (&value, p, sizeof (value));
> +  eh_frame = p + value;
>p += sizeof (value);
> -  dbase = value;   /* No adjustment because pcrel has base 0.  */
>  }
>else
>  p = read_encoded_value_with_base (hdr->eh_frame_ptr_enc,
>
> base-commit: b327cbe8f4eefc91ee2bea49a1da7128adf30281
>


Re: [PATCH] i386: Fix endless recursion in ix86_expand_vector_init_general with MMX [PR110152]

2023-06-07 Thread Richard Biener via Gcc-patches



> Am 07.06.2023 um 18:52 schrieb Jakub Jelinek via Gcc-patches 
> :
> 
> Hi!
> 
> I'm getting
> +FAIL: gcc.target/i386/3dnow-1.c (internal compiler error: Segmentation fault 
> signal terminated program cc1)
> +FAIL: gcc.target/i386/3dnow-1.c (test for excess errors)
> +FAIL: gcc.target/i386/3dnow-2.c (internal compiler error: Segmentation fault 
> signal terminated program cc1)
> +FAIL: gcc.target/i386/3dnow-2.c (test for excess errors)
> +FAIL: gcc.target/i386/mmx-1.c (internal compiler error: Segmentation fault 
> signal terminated program cc1)
> +FAIL: gcc.target/i386/mmx-1.c (test for excess errors)
> +FAIL: gcc.target/i386/mmx-2.c (internal compiler error: Segmentation fault 
> signal terminated program cc1)
> +FAIL: gcc.target/i386/mmx-2.c (test for excess errors)
> regressions on i686-linux since r14-1166.  The problem is when
> ix86_expand_vector_init_general is called with mmx_ok = true and
> mode = V4HImode, it newly recurses with mmx_ok = false and mode = V2SImode,
> but as mmx_ok is false and !TARGET_SSE, we recurse again with the same
> arguments (ok, fresh new tmp and vals) infinitely.
> The following patch fixes that by passing mmx_ok to that recursive call.
> For n_words == 4 it isn't needed, because we only care about mmx_ok for
> V2SImode or V2SFmode and no other modes.
> 
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

Ok.

Richard 

> 2023-06-07  Jakub Jelinek  
> 
>PR target/110152
>* config/i386/i386-expand.cc (ix86_expand_vector_init_general): For
>n_words == 2 recurse with mmx_ok as first argument rather than false.
> 
> --- gcc/config/i386/i386-expand.cc.jj2023-06-03 15:32:04.489410367 +0200
> +++ gcc/config/i386/i386-expand.cc2023-06-07 10:31:34.715981752 +0200
> @@ -16371,7 +16371,7 @@ quarter:
>  machine_mode concat_mode = tmp_mode == DImode ? V2DImode : V2SImode;
>  rtx tmp = gen_reg_rtx (concat_mode);
>  vals = gen_rtx_PARALLEL (concat_mode, gen_rtvec_v (2, words));
> -  ix86_expand_vector_init_general (false, concat_mode, tmp, vals);
> +  ix86_expand_vector_init_general (mmx_ok, concat_mode, tmp, vals);
>  emit_move_insn (target, gen_lowpart (mode, tmp));
>}
>   else if (n_words == 4)
> 
>Jakub
> 


Re: [PATCH] optabs: Implement double-word ctz and ffs expansion

2023-06-07 Thread Richard Biener via Gcc-patches



> Am 07.06.2023 um 18:59 schrieb Jakub Jelinek via Gcc-patches 
> :
> 
> Hi!
> 
> We have expand_doubleword_clz for a couple of years, where we emit
> double-word CLZ as if (high_word == 0) return CLZ (low_word) + word_size;
> else return CLZ (high_word);
> We can do something similar for CTZ and FFS IMHO, just with the 2
> words swapped.  So if (low_word == 0) return CTZ (high_word) + word_size;
> else return CTZ (low_word); for CTZ and
> if (low_word == 0) { return high_word ? FFS (high_word) + word_size : 0;
> else return FFS (low_word);
> 
> The following patch implements that.
> 
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

Ok

Richard 
> Note, on some targets which implement both word_mode ctz and ffs patterns,
> it might be better to incrementally implement those double-word ffs expansion
> patterns in md files, because we aren't able to optimize it correctly;
> nothing can detect we have just made sure that argument is not 0 and so
> don't need to bother with handling that case.  So, on ia32 just using
> CTZ patterns would be better there, but I think we can even do better and
> instead of doing the comparisons of the operands against 0 do the CTZ
> expansion followed by testing of flags.
> 
> 2023-06-07  Jakub Jelinek  
> 
>* optabs.cc (expand_ffs): Add forward declaration.
>(expand_doubleword_clz): Rename to ...
>(expand_doubleword_clz_ctz_ffs): ... this.  Add UNOPTAB argument,
>handle also doubleword CTZ and FFS in addition to CLZ.
>(expand_unop): Adjust caller.  Also call it for doubleword
>ctz_optab and ffs_optab.
> 
>* gcc.target/i386/ctzll-1.c: New test.
>* gcc.target/i386/ffsll-1.c: New test.
> 
> --- gcc/optabs.cc.jj2023-06-07 09:42:14.701130305 +0200
> +++ gcc/optabs.cc2023-06-07 14:35:04.909879272 +0200
> @@ -2697,10 +2697,14 @@ expand_clrsb_using_clz (scalar_int_mode
>   return temp;
> }
> 
> -/* Try calculating clz of a double-word quantity as two clz's of word-sized
> -   quantities, choosing which based on whether the high word is nonzero.  */
> +static rtx expand_ffs (scalar_int_mode, rtx, rtx);
> +
> +/* Try calculating clz, ctz or ffs of a double-word quantity as two clz, ctz 
> or
> +   ffs operations on word-sized quantities, choosing which based on whether 
> the
> +   high (for clz) or low (for ctz and ffs) word is nonzero.  */
> static rtx
> -expand_doubleword_clz (scalar_int_mode mode, rtx op0, rtx target)
> +expand_doubleword_clz_ctz_ffs (scalar_int_mode mode, rtx op0, rtx target,
> +   optab unoptab)
> {
>   rtx xop0 = force_reg (mode, op0);
>   rtx subhi = gen_highpart (word_mode, xop0);
> @@ -2709,6 +2713,7 @@ expand_doubleword_clz (scalar_int_mode m
>   rtx_code_label *after_label = gen_label_rtx ();
>   rtx_insn *seq;
>   rtx temp, result;
> +  int addend = 0;
> 
>   /* If we were not given a target, use a word_mode register, not a
>  'mode' register.  The result will fit, and nobody is expecting
> @@ -2721,6 +2726,9 @@ expand_doubleword_clz (scalar_int_mode m
>  'target' to tag a REG_EQUAL note on.  */
>   result = gen_reg_rtx (word_mode);
> 
> +  if (unoptab != clz_optab)
> +std::swap (subhi, sublo);
> +
>   start_sequence ();
> 
>   /* If the high word is not equal to zero,
> @@ -2728,7 +2736,13 @@ expand_doubleword_clz (scalar_int_mode m
>   emit_cmp_and_jump_insns (subhi, CONST0_RTX (word_mode), EQ, 0,
>   word_mode, true, hi0_label);
> 
> -  temp = expand_unop_direct (word_mode, clz_optab, subhi, result, true);
> +  if (optab_handler (unoptab, word_mode) != CODE_FOR_nothing)
> +temp = expand_unop_direct (word_mode, unoptab, subhi, result, true);
> +  else
> +{
> +  gcc_assert (unoptab == ffs_optab);
> +  temp = expand_ffs (word_mode, subhi, result);
> +}
>   if (!temp)
> goto fail;
> 
> @@ -2739,14 +2753,32 @@ expand_doubleword_clz (scalar_int_mode m
>   emit_barrier ();
> 
>   /* Else clz of the full value is clz of the low word plus the number
> - of bits in the high word.  */
> + of bits in the high word.  Similarly for ctz/ffs of the high word,
> + except that ffs should be 0 when both words are zero.  */
>   emit_label (hi0_label);
> 
> -  temp = expand_unop_direct (word_mode, clz_optab, sublo, 0, true);
> +  if (unoptab == ffs_optab)
> +{
> +  convert_move (result, const0_rtx, true);
> +  emit_cmp_and_jump_insns (sublo, CONST0_RTX (word_mode), EQ, 0,
> +   word_mode, true, after_label);
> +}
> +
> +  if (optab_handler (unoptab, word_mode) != CODE_FOR_nothing)
> +temp = expand_unop_direct (word_mode, unoptab, sublo, NULL_RTX, true);
> +  else
> +{
> +  gcc_assert (unoptab == ffs_optab);
> +  temp = expand_unop_direct (word_mode, ctz_optab, sublo, NULL_RTX, 
> true);
> +  addend = 1;
> +}
> +
>   if (!temp)
> goto fail;
> +
>   temp = expand_binop (word_mode, add_optab, temp,
> -   gen_int_mode (GET_MODE_BITSIZE (word_mode), word_mode),
> +

[PATCH] middle-end/110182 - TYPE_PRECISION on VECTOR_TYPE causes wrong-code

2023-06-09 Thread Richard Biener via Gcc-patches
When folding two conversions in a row we use TYPE_PRECISION but
that's invalid for VECTOR_TYPE.  The following fixes this by
using element_precision instead.

Bootstrap and regtest running on x86_64-unknown-linux-gnu.

* match.pd (two conversions in a row): Use element_precision
to DTRT for VECTOR_TYPE.
---
 gcc/match.pd | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/gcc/match.pd b/gcc/match.pd
index 4ad037d641a..4072afb474a 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -4147,19 +4147,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   int inside_ptr = POINTER_TYPE_P (inside_type);
   int inside_float = FLOAT_TYPE_P (inside_type);
   int inside_vec = VECTOR_TYPE_P (inside_type);
-  unsigned int inside_prec = TYPE_PRECISION (inside_type);
+  unsigned int inside_prec = element_precision (inside_type);
   int inside_unsignedp = TYPE_UNSIGNED (inside_type);
   int inter_int = INTEGRAL_TYPE_P (inter_type);
   int inter_ptr = POINTER_TYPE_P (inter_type);
   int inter_float = FLOAT_TYPE_P (inter_type);
   int inter_vec = VECTOR_TYPE_P (inter_type);
-  unsigned int inter_prec = TYPE_PRECISION (inter_type);
+  unsigned int inter_prec = element_precision (inter_type);
   int inter_unsignedp = TYPE_UNSIGNED (inter_type);
   int final_int = INTEGRAL_TYPE_P (type);
   int final_ptr = POINTER_TYPE_P (type);
   int final_float = FLOAT_TYPE_P (type);
   int final_vec = VECTOR_TYPE_P (type);
-  unsigned int final_prec = TYPE_PRECISION (type);
+  unsigned int final_prec = element_precision (type);
   int final_unsignedp = TYPE_UNSIGNED (type);
 }
(switch
-- 
2.35.3


[PATCH] Prevent TYPE_PRECISION on VECTOR_TYPEs

2023-06-09 Thread Richard Biener via Gcc-patches
The following makes sure that using TYPE_PRECISION on VECTOR_TYPE
ICEs when tree checking is enabled.  This should avoid wrong-code
in cases like PR110182 and instead ICE.

Bootstrap and regtest pending on x86_64-unknown-linux-gnu, I guess
there will be some fallout of such change ...

* tree.h (TYPE_PRECISION): Check for non-VECTOR_TYPE.
---
 gcc/tree.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/gcc/tree.h b/gcc/tree.h
index 1854fe4a7d4..9c525d14474 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -2191,7 +2191,8 @@ class auto_suppress_location_wrappers
 #define TYPE_SIZE_UNIT(NODE) (TYPE_CHECK (NODE)->type_common.size_unit)
 #define TYPE_POINTER_TO(NODE) (TYPE_CHECK (NODE)->type_common.pointer_to)
 #define TYPE_REFERENCE_TO(NODE) (TYPE_CHECK (NODE)->type_common.reference_to)
-#define TYPE_PRECISION(NODE) (TYPE_CHECK (NODE)->type_common.precision)
+#define TYPE_PRECISION(NODE) \
+  (TREE_NOT_CHECK (TYPE_CHECK (NODE), VECTOR_TYPE)->type_common.precision)
 #define TYPE_NAME(NODE) (TYPE_CHECK (NODE)->type_common.name)
 #define TYPE_NEXT_VARIANT(NODE) (TYPE_CHECK (NODE)->type_common.next_variant)
 #define TYPE_MAIN_VARIANT(NODE) (TYPE_CHECK (NODE)->type_common.main_variant)
-- 
2.35.3


Re: [PATCH] Make sure SCALAR_INT_MODE_P before invoke try_const_anchors

2023-06-09 Thread Richard Biener via Gcc-patches
On Fri, 9 Jun 2023, Jiufu Guo wrote:

> Hi,
> 
> As checking the code, there is a "gcc_assert (SCALAR_INT_MODE_P (mode))"
> in "try_const_anchors".
> This assert seems correct because the function try_const_anchors cares
> about integer values currently, and modes other than SCALAR_INT_MODE_P
> are not needed to support.
> 
> This patch makes sure SCALAR_INT_MODE_P when calling try_const_anchors.
> 
> This patch is raised when drafting below one.
> https://gcc.gnu.org/pipermail/gcc-patches/2022-October/603530.html.
> With that patch, "{[%1:DI]=0;} stack_tie" with BLKmode runs into
> try_const_anchors, and hits the assert/ice.
> 
> Boostrap and regtest pass on ppc64{,le} and x86_64.
> Is this ok for trunk?

Iff the correct fix at all (how can a CONST_INT have BLKmode?) then
I suggest to instead fix try_const_anchors to change

  /* CONST_INT is used for CC modes, but we should leave those alone.  */
  if (GET_MODE_CLASS (mode) == MODE_CC)
return NULL_RTX;

  gcc_assert (SCALAR_INT_MODE_P (mode));

to

  /* CONST_INT is used for CC modes, leave any non-scalar-int mode alone.  */
  if (!SCALAR_INT_MODE_P (mode))
return NULL_RTX;

but as said I wonder how we arrive at a BLKmode CONST_INT and whether
we should have fended this off earlier.  Can you share more complete
RTL of that stack_tie?

> 
> BR,
> Jeff (Jiufu Guo)
> 
> gcc/ChangeLog:
> 
>   * cse.cc (cse_insn): Add SCALAR_INT_MODE_P condition.
> 
> ---
>  gcc/cse.cc | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/gcc/cse.cc b/gcc/cse.cc
> index 2bb63ac4105..f213fa0faf7 100644
> *** a/gcc/cse.cc
> --- b/gcc/cse.cc
> ***
> *** 5003,5009 
> if (targetm.const_anchor
> && !src_related
> && src_const
> !   && GET_CODE (src_const) == CONST_INT)
>   {
> src_related = try_const_anchors (src_const, mode);
> src_related_is_const_anchor = src_related != NULL_RTX;
> - - 
> --- 5003,5010 
> if (targetm.const_anchor
> && !src_related
> && src_const
> !   && GET_CODE (src_const) == CONST_INT
> !   && SCALAR_INT_MODE_P (mode))
>   {
> src_related = try_const_anchors (src_const, mode);
> src_related_is_const_anchor = src_related != NULL_RTX;
> 2.39.3
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)


Re: [PATCH V5] VECT: Add SELECT_VL support

2023-06-09 Thread Richard Biener via Gcc-patches
On Thu, 8 Jun 2023, juzhe.zh...@rivai.ai wrote:

> From: Ju-Zhe Zhong 
> 
> Co-authored-by: Richard Sandiford
> Co-authored-by: Richard Biener 
> 
> This patch address comments from Richard && Richi and rebase to trunk.
> 
> This patch is adding SELECT_VL middle-end support
> allow target have target dependent optimization in case of
> length calculation.
> 
> This patch is inspired by RVV ISA and LLVM:
> https://reviews.llvm.org/D99750
> 
> The SELECT_VL is same behavior as LLVM "get_vector_length" with
> these following properties:
> 
> 1. Only apply on single-rgroup.
> 2. non SLP.
> 3. adjust loop control IV.
> 4. adjust data reference IV.
> 5. allow non-vf elements processing in non-final iteration
> 
> Code:
># void vvaddint32(size_t n, const int*x, const int*y, int*z)
> # { for (size_t i=0; i 
> Take RVV codegen for example:
> 
> Before this patch:
> vvaddint32:
> ble a0,zero,.L6
> csrra4,vlenb
> srlia6,a4,2
> .L4:
> mv  a5,a0
> bleua0,a6,.L3
> mv  a5,a6
> .L3:
> vsetvli zero,a5,e32,m1,ta,ma
> vle32.v v2,0(a1)
> vle32.v v1,0(a2)
> vsetvli a7,zero,e32,m1,ta,ma
> sub a0,a0,a5
> vadd.vv v1,v1,v2
> vsetvli zero,a5,e32,m1,ta,ma
> vse32.v v1,0(a3)
> add a2,a2,a4
> add a3,a3,a4
> add a1,a1,a4
> bne a0,zero,.L4
> .L6:
> ret
> 
> After this patch:
> 
> vvaddint32:
> vsetvli t0, a0, e32, ta, ma  # Set vector length based on 32-bit vectors
> vle32.v v0, (a1) # Get first vector
>   sub a0, a0, t0 # Decrement number done
>   slli t0, t0, 2 # Multiply number done by 4 bytes
>   add a1, a1, t0 # Bump pointer
> vle32.v v1, (a2) # Get second vector
>   add a2, a2, t0 # Bump pointer
> vadd.vv v2, v0, v1   # Sum vectors
> vse32.v v2, (a3) # Store result
>   add a3, a3, t0 # Bump pointer
>   bnez a0, vvaddint32# Loop back
>   ret# Finished
> 
> gcc/ChangeLog:
> 
> * doc/md.texi: Add SELECT_VL support.
> * internal-fn.def (SELECT_VL): Ditto.
> * optabs.def (OPTAB_D): Ditto.
> * tree-vect-loop-manip.cc (vect_set_loop_controls_directly): Ditto.
> * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Ditto.
> * tree-vect-stmts.cc (get_select_vl_data_ref_ptr): Ditto.
> (vectorizable_store): Ditto.
> (vectorizable_load): Ditto.
> * tree-vectorizer.h (LOOP_VINFO_USING_SELECT_VL_P): Ditto.
>
> Co-authored-by: Richard Sandiford 
> Co-authored-by: Richard Biener 
> 
> ---
>  gcc/doc/md.texi | 22 ++
>  gcc/internal-fn.def |  1 +
>  gcc/optabs.def  |  1 +
>  gcc/tree-vect-loop-manip.cc | 32 ++
>  gcc/tree-vect-loop.cc   | 72 +++
>  gcc/tree-vect-stmts.cc  | 86 -
>  gcc/tree-vectorizer.h   |  6 +++
>  7 files changed, 201 insertions(+), 19 deletions(-)
> 
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 6a435eb4461..95f7fe1f802 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -4974,6 +4974,28 @@ for (i = 1; i < operand3; i++)
>operand0[i] = operand0[i - 1] && (operand1 + i < operand2);
>  @end smallexample
>  
> +@cindex @code{select_vl@var{m}} instruction pattern
> +@item @code{select_vl@var{m}}
> +Set operand 0 to the number of scalar iterations that should be handled
> +by one iteration of a vector loop.  Operand 1 is the total number of
> +scalar iterations that the loop needs to process and operand 2 is a
> +maximum bound on the result (also known as the maximum ``vectorization
> +factor'').
> +
> +The maximum value of operand 0 is given by:
> +@smallexample
> +operand0 = MIN (operand1, operand2)
> +@end smallexample
> +However, targets might choose a lower value than this, based on
> +target-specific criteria.  Each iteration of the vector loop might
> +therefore process a different number of scalar iterations, which in turn
> +means that induction variables will have a variable step.  Because of
> +this, it is generally not useful to define this instruction if it will
> +always calculate the maximum value.
> +
> +This optab is only useful on targets that implement @samp{len_load_@var{m}}
> +and/or @samp{len_store_@var{m}}.
> +
>  @cindex @code{check_raw_ptrs@var{m}} instruction pattern
>  @item @samp{check_raw_ptrs@var{m}}
>  Check whether, given two pointers @var{a} and @var{b} and a length @var{len},
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index 3ac9d82aace..5d638de6d06 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -177,6 +177,7 @@ DEF_INTERNAL_OPTAB_FN (VEC_SET, 0, vec_set, vec_set)
>  DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, len_store, len_store)
>  
>  DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while)

Re: [PATCH V2] Optimize '(X - N * M) / N' to 'X / N - M' if valid

2023-06-09 Thread Richard Biener via Gcc-patches
On Wed, 7 Jun 2023, Jiufu Guo wrote:

> Hi,
> 
> This patch tries to optimize "(X - N * M) / N" to "X / N - M".
> For C code, "/" towards zero (trunc_div), and "X - N * M" maybe
> wrap/overflow/underflow. So, it is valid that "X - N * M" does
> not cross zero and does not wrap/overflow/underflow.
> 
> Compare with previous version:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618796.html
> 
> This patch 1. adds the patterns for variable N or M,
> 2. uses simpler form "(X - N * M) / N" for patterns,
> 3. adds functions to gimle-fold.h/cc (not gimple-match-head.cc)
> 4. updates testcases
> 
> Bootstrap & regtest pass on ppc64{,le} and x86_64.
> Is this patch ok for trunk?

Comments below.

> 
> BR,
> Jeff (Jiufu Guo)
> 
>   PR tree-optimization/108757
> 
> gcc/ChangeLog:
> 
>   * gimple-fold.cc (maybe_mult_overflow): New function.
>   (maybe_plus_overflow): New function.
>   (maybe_minus_overflow): New function.
>   (plus_mult_no_ovf_and_keep_sign): New function.
>   (plus_no_ovf_and_keep_sign): New function.
>   * gimple-fold.h (maybe_mult_overflow): New declare.
>   (plus_mult_no_ovf_and_keep_sign): New declare.
>   (plus_no_ovf_and_keep_sign): New declare.
>   * match.pd ((X - N * M) / N): New pattern.
>   ((X + N * M) / N): New pattern.
>   ((X + C) / N): New pattern.
>   ((X + C) >> N): New pattern.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/pr108757-1.c: New test.
>   * gcc.dg/pr108757-2.c: New test.
>   * gcc.dg/pr108757.h: New test.
> 
> ---
>  gcc/gimple-fold.cc| 161 
>  gcc/gimple-fold.h |   3 +
>  gcc/match.pd  |  58 +++
>  gcc/testsuite/gcc.dg/pr108757-1.c |  18 +++
>  gcc/testsuite/gcc.dg/pr108757-2.c |  19 +++
>  gcc/testsuite/gcc.dg/pr108757.h   | 244 ++
>  6 files changed, 503 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.dg/pr108757-1.c
>  create mode 100644 gcc/testsuite/gcc.dg/pr108757-2.c
>  create mode 100644 gcc/testsuite/gcc.dg/pr108757.h
> 
> diff --git a/gcc/gimple-fold.cc b/gcc/gimple-fold.cc
> index 581575b65ec..bb833ae17b3 100644
> --- a/gcc/gimple-fold.cc
> +++ b/gcc/gimple-fold.cc
> @@ -9349,3 +9349,164 @@ gimple_stmt_integer_valued_real_p (gimple *stmt, int 
> depth)
>return false;
>  }
>  }
> +
> +/* Return true if "X * Y" may be overflow.  */
> +
> +bool
> +maybe_mult_overflow (value_range &x, value_range &y, signop sgn)

These functions look like some "basic" functionality that should
be (or maybe already is?  Andrew?) provided by the value-range
framework.  That means it should not reside in gimple-fold.{cc,h}
but elsehwere and possibly with an API close to the existing
value-range stuff.

Andrew?

> +{
> +  wide_int wmin0 = x.lower_bound ();
> +  wide_int wmax0 = x.upper_bound ();
> +  wide_int wmin1 = y.lower_bound ();
> +  wide_int wmax1 = y.upper_bound ();
> +
> +  wi::overflow_type min_ovf, max_ovf;
> +  wi::mul (wmin0, wmin1, sgn, &min_ovf);
> +  wi::mul (wmax0, wmax1, sgn, &max_ovf);
> +  if (min_ovf == wi::OVF_NONE && max_ovf == wi::OVF_NONE)
> +{
> +  wi::mul (wmin0, wmax1, sgn, &min_ovf);
> +  wi::mul (wmax0, wmin1, sgn, &max_ovf);
> +  if (min_ovf == wi::OVF_NONE && max_ovf == wi::OVF_NONE)
> + return false;
> +}
> +  return true;
> +}
> +
> +/* Return true if "X + Y" may be overflow.  */
> +
> +static bool
> +maybe_plus_overflow (value_range &x, value_range &y, signop sgn)
> +{
> +  wide_int wmin0 = x.lower_bound ();
> +  wide_int wmax0 = x.upper_bound ();
> +  wide_int wmin1 = y.lower_bound ();
> +  wide_int wmax1 = y.upper_bound ();
> +
> +  wi::overflow_type min_ovf, max_ovf;
> +  wi::add (wmax0, wmax1, sgn, &min_ovf);
> +  wi::add (wmin0, wmin1, sgn, &max_ovf);
> +  if (min_ovf == wi::OVF_NONE && max_ovf == wi::OVF_NONE)
> +return false;
> +
> +  return true;
> +}
> +
> +/* Return true if "X - Y" may be overflow.  */
> +
> +static bool
> +maybe_minus_overflow (value_range &x, value_range &y, signop sgn)
> +{
> +  wide_int wmin0 = x.lower_bound ();
> +  wide_int wmax0 = x.upper_bound ();
> +  wide_int wmin1 = y.lower_bound ();
> +  wide_int wmax1 = y.upper_bound ();
> +
> +  wi::overflow_type min_ovf, max_ovf;
> +  wi::sub (wmin0, wmax1, sgn, &min_ovf);
> +  wi::sub (wmax0, wmin1, sgn, &max_ovf);
> +  if (min_ovf == wi::OVF_NONE && max_ovf == wi::OVF_NONE)
> +return false;
> +
> +  return true;
> +}
> +
> +/* Return true if there is no overflow in the expression.
> +   And no sign change on the plus/minus for X.

What does the second sentence mean?  sign(X) == sign (X + N*M)?
I suppose zero has positive sign?

> +   CODE is PLUS_EXPR, if the expression is "X + N * M".
> +   CODE is MINUS_EXPR, if the expression is "X - N * M".
> +   TYPE is the integer type of the expressions.  */
> +
> +bool
> +plus_mult_no_ovf_and_keep_sign (tree x, tree m, tree n, tree_code code,
> + tree type)
> +{
> +  value_range vr

Re: [PATCH] Make sure SCALAR_INT_MODE_P before invoke try_const_anchors

2023-06-09 Thread Richard Biener via Gcc-patches
On Fri, 9 Jun 2023, Richard Sandiford wrote:

> guojiufu  writes:
> > Hi,
> >
> > On 2023-06-09 16:00, Richard Biener wrote:
> >> On Fri, 9 Jun 2023, Jiufu Guo wrote:
> >> 
> >>> Hi,
> >>> 
> >>> As checking the code, there is a "gcc_assert (SCALAR_INT_MODE_P 
> >>> (mode))"
> >>> in "try_const_anchors".
> >>> This assert seems correct because the function try_const_anchors cares
> >>> about integer values currently, and modes other than SCALAR_INT_MODE_P
> >>> are not needed to support.
> >>> 
> >>> This patch makes sure SCALAR_INT_MODE_P when calling 
> >>> try_const_anchors.
> >>> 
> >>> This patch is raised when drafting below one.
> >>> https://gcc.gnu.org/pipermail/gcc-patches/2022-October/603530.html.
> >>> With that patch, "{[%1:DI]=0;} stack_tie" with BLKmode runs into
> >>> try_const_anchors, and hits the assert/ice.
> >>> 
> >>> Boostrap and regtest pass on ppc64{,le} and x86_64.
> >>> Is this ok for trunk?
> >> 
> >> Iff the correct fix at all (how can a CONST_INT have BLKmode?) then
> >> I suggest to instead fix try_const_anchors to change
> >> 
> >>   /* CONST_INT is used for CC modes, but we should leave those alone.  
> >> */
> >>   if (GET_MODE_CLASS (mode) == MODE_CC)
> >> return NULL_RTX;
> >> 
> >>   gcc_assert (SCALAR_INT_MODE_P (mode));
> >> 
> >> to
> >> 
> >>   /* CONST_INT is used for CC modes, leave any non-scalar-int mode 
> >> alone.  */
> >>   if (!SCALAR_INT_MODE_P (mode))
> >> return NULL_RTX;
> >> 
> >
> > This is also able to fix this issue.  there is a "Punt on CC modes" 
> > patch
> > to return NULL_RTX in try_const_anchors.
> >
> >> but as said I wonder how we arrive at a BLKmode CONST_INT and whether
> >> we should have fended this off earlier.  Can you share more complete
> >> RTL of that stack_tie?
> >
> >
> > (insn 15 14 16 3 (parallel [
> >  (set (mem/c:BLK (reg/f:DI 1 1) [1  A8])
> >  (const_int 0 [0]))
> >  ]) "/home/guojiufu/temp/gdb.c":13:3 922 {stack_tie}
> >   (nil))
> >
> > It is "set (mem/c:BLK (reg/f:DI 1 1) (const_int 0 [0])".
> 
> I'm not convinced this is correct RTL.  (unspec:BLK [(const_int 0)] ...)
> would be though.  It's arguably more accurate too, since the effect
> on the stack locations is unspecified rather than predictable.

powerpc seems to be the only port with a stack_tie that's not
using an UNSPEC RHS.

> Thanks,
> Richard


Re: [PATCH] testsuite: fix the condition bug in tsvc s176

2023-06-09 Thread Richard Biener via Gcc-patches
On Thu, Jun 8, 2023 at 1:24 PM Lehua Ding  wrote:
>
> Hi,
>
> This patch fixes the problem that the loop in the tsvc s176 function is
> optimized and removed because `iterations/LEN_1D` is 0 (where iterations
> is set to 1, LEN_1D is set to 32000 in tsvc.h).
>
> This testcase passed on x86 and AArch64 system.

OK.

It's odd that the checksum doesn't depend on the number of iterations done ...

> Best,
> Lehua
>
> gcc/testsuite/ChangeLog:
>
> * gcc.dg/vect/tsvc/vect-tsvc-s176.c: adjust iterations
>
> ---
>  gcc/testsuite/gcc.dg/vect/tsvc/vect-tsvc-s176.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.dg/vect/tsvc/vect-tsvc-s176.c 
> b/gcc/testsuite/gcc.dg/vect/tsvc/vect-tsvc-s176.c
> index 79faf7fdb9e4..365e5205982b 100644
> --- a/gcc/testsuite/gcc.dg/vect/tsvc/vect-tsvc-s176.c
> +++ b/gcc/testsuite/gcc.dg/vect/tsvc/vect-tsvc-s176.c
> @@ -14,7 +14,7 @@ real_t s176(struct args_t * func_args)
>  initialise_arrays(__func__);
>
>  int m = LEN_1D/2;
> -for (int nl = 0; nl < 4*(iterations/LEN_1D); nl++) {
> +for (int nl = 0; nl < 4*(10*iterations/LEN_1D); nl++) {
>  for (int j = 0; j < (LEN_1D/2); j++) {
>  for (int i = 0; i < m; i++) {
>  a[i] += b[i+m-j-1] * c[j];
> @@ -39,4 +39,4 @@ int main (int argc, char **argv)
>return 0;
>  }
>
> -/* { dg-final { scan-tree-dump "vectorized 1 loops" "vect" { xfail *-*-* } } 
> } */
> +/* { dg-final { scan-tree-dump "vectorized 1 loops" "vect" } } */
> --
> 2.36.1
>


Re: [committed] libstdc++: Fix code size regressions in std::vector [PR110060]

2023-06-09 Thread Richard Biener via Gcc-patches
On Thu, Jun 8, 2023 at 11:15 AM Jakub Jelinek via Gcc-patches
 wrote:
>
> On Thu, Jun 08, 2023 at 10:05:43AM +0100, Jonathan Wakely via Gcc-patches 
> wrote:
> > > Looking at assembly, one of the differences I see is that the "after"
> > > version has calls to realloc_insert(), while "before" version seems to 
> > > have
> > > them inlined [2].
> > >
> > > [1]
> > > https://git.linaro.org/toolchain/ci/interesting-commits.git/tree/gcc/sha1/b7b255e77a271974479c34d1db3daafc04b920bc/tcwg_bmk-code_size-cpu2017fast/status.txt
> > >
> > >
> > I find it annoying that adding `if (n < sz) __builtin_unreachable()` seems
> > to affect the size estimates for the function, and so perturbs inlining
> > decisions. That code shouldn't add any actual instructions, so shouldn't
> > affect size estimates.
> >
> > I mentioned this in a meeting last week and Jason suggested checking
> > whether using __builtin_assume has the same undesirable consequences, so I
>
> We don't support __builtin_assume (intentionally), if you mean 
> [[assume(n>=sz)]],
> then because n >= sz doesn't have side-effects, it will be lowered to
> exactly that if (n < sz) __builtin_unreachable(); - you can look at
> -fdump-tree-all to confirm that.
>
> I agree that the inliner should ignore if (comparison) 
> __builtin_unreachable();
> from costs estimation.  And inliner should ignore what we emit for 
> [[assume()]]
> if there are side-effects.  CCing Honza.

Agreed, that would be nice.  Note that we have inliner limits in place to avoid
compile-time and memory-usage explosion as well so these kind of
"tricks" may be a way to
defeat them.

Richard.

>
> Jakub
>


Re: [PATCH RFC] c++: use __cxa_call_terminate for MUST_NOT_THROW [PR97720]

2023-06-09 Thread Richard Biener via Gcc-patches
On Thu, Jun 8, 2023 at 3:14 PM Jonathan Wakely via Gcc-patches
 wrote:
>
> On Fri, 26 May 2023 at 10:58, Jonathan Wakely wrote:
>
> >
> >
> > On Wed, 24 May 2023 at 19:56, Jason Merrill via Libstdc++ <
> > libstd...@gcc.gnu.org> wrote:
> >
> >> Middle-end folks: any thoughts about how best to make the change
> >> described in
> >> the last paragraph below?
> >>
> >> Library folks: any thoughts on the changes to __cxa_call_terminate?
> >>
> >
> > I see no harm in exporting it (with the adjusted signature). The "looks
> > standard but isn't" name is a little unfortunate, but not a big deal.
> >
>
> Jason, do you have any objection to exporting __cxa_call_terminate for GCC
> 13.2 as well, even though the FE won't use it?
>
> Currently both gcc-13 and trunk are at the same library version,
> libstdc++.so.6.0.32
>
> But with this addition to trunk we need to bump that .32 to .33, meaning
> that gcc-13 and trunk diverge. If we want to backport any new symbols from
> trunk to gcc-13 that gets trickier once they've diverged.

But if you backport any new used symbol you have to bump the version
anyway.  So why not bump now (on trunk)?

> If we added __cxa_call_terminate to gcc-13, making it another new addition
> to libstdc++.so.6.0.32, then it would simplify a few things.
>
> In theory it could be a problem for distros already shipping gcc-13.1.1
> with that new libstdc++.so.6.0.32 version, but since the
> __cxa_call_terminate symbol won't actually be used by the gcc-13.1.1
> compilers, I don't think it will be a problem.


Re: [PATCH] fix frange_nextafter odr violation

2023-06-09 Thread Richard Biener via Gcc-patches
On Thu, Jun 8, 2023 at 4:38 PM Alexandre Oliva via Gcc-patches
 wrote:
>
>
> C++ requires inline functions to be declared inline and defined in
> every translation unit that uses them.  frange_nextafter is used in
> gimple-range-op.cc but it's only defined as inline in
> range-op-float.cc.  Drop the extraneous inline specifier.
>
> Other non-static inline functions in range-op-float.cc are not
> referenced elsewhere, so I'm making them static.
>
> Bootstrapping on x86_64-linux-gnu, along with other changes that exposed
> the problem; it's already into stage3, and it wouldn't get past stage2
> before.  Ok to install?

OK

>
> for  gcc/ChangeLog
>
> * range-op-float.cc (frange_nextafter): Drop inline.
> (frelop_early_resolve): Add static.
> (frange_float): Likewise.
> ---
>  gcc/range-op-float.cc |6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/range-op-float.cc b/gcc/range-op-float.cc
> index a99a6b01ed835..d6da2aa701ee3 100644
> --- a/gcc/range-op-float.cc
> +++ b/gcc/range-op-float.cc
> @@ -255,7 +255,7 @@ maybe_isnan (const frange &op1, const frange &op2)
>  // Floating version of relop_early_resolve that takes into account NAN
>  // and -ffinite-math-only.
>
> -inline bool
> +static inline bool
>  frelop_early_resolve (irange &r, tree type,
>   const frange &op1, const frange &op2,
>   relation_trio rel, relation_kind my_rel)
> @@ -272,7 +272,7 @@ frelop_early_resolve (irange &r, tree type,
>
>  // Set VALUE to its next real value, or INF if the operation overflows.
>
> -inline void
> +void
>  frange_nextafter (enum machine_mode mode,
>   REAL_VALUE_TYPE &value,
>   const REAL_VALUE_TYPE &inf)
> @@ -2878,7 +2878,7 @@ namespace selftest
>
>  // Build an frange from string endpoints.
>
> -inline frange
> +static inline frange
>  frange_float (const char *lb, const char *ub, tree type = float_type_node)
>  {
>REAL_VALUE_TYPE min, max;
>
>
> --
> Alexandre Oliva, happy hackerhttps://FSFLA.org/blogs/lxo/
>Free Software Activist   GNU Toolchain Engineer
> Disinformation flourishes because many people care deeply about injustice
> but very few check the facts.  Ask me about 


Re: [PATCH] doc: Clarification for -Wmissing-field-initializers

2023-06-09 Thread Richard Biener via Gcc-patches
On Thu, Jun 8, 2023 at 7:57 PM Marek Polacek via Gcc-patches
 wrote:
>
> The manual is incorrect in saying that the option does not warn
> about designated initializers, which it does in C++.  Whether the
> divergence in behavior is desirable is another thing, but let's
> at least make the manual match the reality.

OK.

> PR c/39589
> PR c++/96868
>
> gcc/ChangeLog:
>
> * doc/invoke.texi: Clarify that -Wmissing-field-initializers doesn't
> warn about designated initializers in C only.
> ---
>  gcc/doc/invoke.texi | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 6d08229ce40..0870f7aff93 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -9591,8 +9591,9 @@ struct s @{ int f, g, h; @};
>  struct s x = @{ 3, 4 @};
>  @end smallexample
>
> -This option does not warn about designated initializers, so the following
> -modification does not trigger a warning:
> +@c It's unclear if this behavior is desirable.  See PR39589 and PR96868.
> +In C this option does not warn about designated initializers, so the
> +following modification does not trigger a warning:
>
>  @smallexample
>  struct s @{ int f, g, h; @};
>
> base-commit: 1379ae33e05c28d705f3c69a3f6c774bf6e83136
> --
> 2.40.1
>


Re: [PATCH] MATCH: Fix zero_one_valued_p not to match signed 1 bit integers

2023-06-09 Thread Richard Biener via Gcc-patches
On Fri, Jun 9, 2023 at 3:48 AM Andrew Pinski via Gcc-patches
 wrote:
>
> So for the attached testcase, we assumed that zero_one_valued_p would
> be the value [0,1] but currently zero_one_valued_p matches also
> signed 1 bit integers.
> This changes that not to match that and fixes the 2 new testcases at
> all optimization levels.
>
> OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions.

OK.

> Note the GCC 13 patch will be slightly different due to the changes
> made to zero_one_valued_p.
>
> PR tree-optimization/110165
> PR tree-optimization/110166
>
> gcc/ChangeLog:
>
> * match.pd (zero_one_valued_p): Don't accept
> signed 1-bit integers.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.c-torture/execute/pr110165-1.c: New test.
> * gcc.c-torture/execute/pr110166-1.c: New test.
> ---
>  gcc/match.pd  | 13 ++--
>  .../gcc.c-torture/execute/pr110165-1.c| 28 
>  .../gcc.c-torture/execute/pr110166-1.c| 33 +++
>  3 files changed, 71 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.c-torture/execute/pr110165-1.c
>  create mode 100644 gcc/testsuite/gcc.c-torture/execute/pr110166-1.c
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 4ad037d641a..9a6bc2e9348 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -1984,12 +1984,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>@0)
>
>  /* zero_one_valued_p will match when a value is known to be either
> -   0 or 1 including constants 0 or 1. */
> +   0 or 1 including constants 0 or 1.
> +   Signed 1-bits includes -1 so they cannot match here. */
>  (match zero_one_valued_p
>   @0
> - (if (INTEGRAL_TYPE_P (type) && wi::leu_p (tree_nonzero_bits (@0), 1
> + (if (INTEGRAL_TYPE_P (type)
> +  && (TYPE_UNSIGNED (type)
> + || TYPE_PRECISION (type) > 1)
> +  && wi::leu_p (tree_nonzero_bits (@0), 1
>  (match zero_one_valued_p
> - truth_valued_p@0)
> + truth_valued_p@0
> + (if (INTEGRAL_TYPE_P (type)
> +  && (TYPE_UNSIGNED (type)
> + || TYPE_PRECISION (type) > 1
>
>  /* Transform { 0 or 1 } * { 0 or 1 } into { 0 or 1 } & { 0 or 1 }.  */
>  (simplify
> diff --git a/gcc/testsuite/gcc.c-torture/execute/pr110165-1.c 
> b/gcc/testsuite/gcc.c-torture/execute/pr110165-1.c
> new file mode 100644
> index 000..9521a19428e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.c-torture/execute/pr110165-1.c
> @@ -0,0 +1,28 @@
> +struct s
> +{
> +  int t : 1;
> +};
> +
> +int f(struct s t, int a, int b) __attribute__((noinline));
> +int f(struct s t, int a, int b)
> +{
> +int bd = t.t;
> +if (bd) a|=b;
> +return a;
> +}
> +
> +int main(void)
> +{
> +struct s t;
> +for(int i = -1;i <= 1; i++)
> +{
> +int a = 0x10;
> +int b = 0x0f;
> +int c = a | b;
> +   struct s t = {i};
> +int r = f(t, a, b);
> +int exp = (i != 0) ? a | b : a;
> +if (exp != r)
> + __builtin_abort();
> +}
> +}
> diff --git a/gcc/testsuite/gcc.c-torture/execute/pr110166-1.c 
> b/gcc/testsuite/gcc.c-torture/execute/pr110166-1.c
> new file mode 100644
> index 000..f999d47fe69
> --- /dev/null
> +++ b/gcc/testsuite/gcc.c-torture/execute/pr110166-1.c
> @@ -0,0 +1,33 @@
> +struct s
> +{
> +  int t : 1;
> +  int t1 : 1;
> +};
> +
> +int f(struct s t) __attribute__((noinline));
> +int f(struct s t)
> +{
> +   int c = t.t;
> +   int d = t.t1;
> +   if (c > d)
> + t.t = d;
> +   else
> + t.t = c;
> +  return t.t;
> +}
> +
> +int main(void)
> +{
> +struct s t;
> +for(int i = -1;i <= 0; i++)
> +{
> +  for(int j = -1;j <= 0; j++)
> +  {
> +   struct s t = {i, j};
> +int r = f(t);
> +int exp = i < j ? i : j;
> +if (exp != r)
> + __builtin_abort();
> +  }
> +}
> +}
> --
> 2.31.1
>


Re: [PATCH] Make sure SCALAR_INT_MODE_P before invoke try_const_anchors

2023-06-09 Thread Richard Biener via Gcc-patches
On Fri, 9 Jun 2023, Jiufu Guo wrote:

> 
> Hi,
> 
> Richard Biener  writes:
> 
> > On Fri, 9 Jun 2023, Richard Sandiford wrote:
> >
> >> guojiufu  writes:
> >> > Hi,
> >> >
> >> > On 2023-06-09 16:00, Richard Biener wrote:
> >> >> On Fri, 9 Jun 2023, Jiufu Guo wrote:
> >> >> 
> >> >>> Hi,
> >> >>> 
> >> >>> As checking the code, there is a "gcc_assert (SCALAR_INT_MODE_P 
> >> >>> (mode))"
> >> >>> in "try_const_anchors".
> >> >>> This assert seems correct because the function try_const_anchors cares
> >> >>> about integer values currently, and modes other than SCALAR_INT_MODE_P
> >> >>> are not needed to support.
> >> >>> 
> >> >>> This patch makes sure SCALAR_INT_MODE_P when calling 
> >> >>> try_const_anchors.
> >> >>> 
> >> >>> This patch is raised when drafting below one.
> >> >>> https://gcc.gnu.org/pipermail/gcc-patches/2022-October/603530.html.
> >> >>> With that patch, "{[%1:DI]=0;} stack_tie" with BLKmode runs into
> >> >>> try_const_anchors, and hits the assert/ice.
> >> >>> 
> >> >>> Boostrap and regtest pass on ppc64{,le} and x86_64.
> >> >>> Is this ok for trunk?
> >> >> 
> >> >> Iff the correct fix at all (how can a CONST_INT have BLKmode?) then
> >> >> I suggest to instead fix try_const_anchors to change
> >> >> 
> >> >>   /* CONST_INT is used for CC modes, but we should leave those alone.  
> >> >> */
> >> >>   if (GET_MODE_CLASS (mode) == MODE_CC)
> >> >> return NULL_RTX;
> >> >> 
> >> >>   gcc_assert (SCALAR_INT_MODE_P (mode));
> >> >> 
> >> >> to
> >> >> 
> >> >>   /* CONST_INT is used for CC modes, leave any non-scalar-int mode 
> >> >> alone.  */
> >> >>   if (!SCALAR_INT_MODE_P (mode))
> >> >> return NULL_RTX;
> >> >> 
> >> >
> >> > This is also able to fix this issue.  there is a "Punt on CC modes" 
> >> > patch
> >> > to return NULL_RTX in try_const_anchors.
> >> >
> >> >> but as said I wonder how we arrive at a BLKmode CONST_INT and whether
> >> >> we should have fended this off earlier.  Can you share more complete
> >> >> RTL of that stack_tie?
> >> >
> >> >
> >> > (insn 15 14 16 3 (parallel [
> >> >  (set (mem/c:BLK (reg/f:DI 1 1) [1  A8])
> >> >  (const_int 0 [0]))
> >> >  ]) "/home/guojiufu/temp/gdb.c":13:3 922 {stack_tie}
> >> >   (nil))
> >> >
> >> > It is "set (mem/c:BLK (reg/f:DI 1 1) (const_int 0 [0])".
> >> 
> >> I'm not convinced this is correct RTL.  (unspec:BLK [(const_int 0)] ...)
> >> would be though.  It's arguably more accurate too, since the effect
> >> on the stack locations is unspecified rather than predictable.
> >
> > powerpc seems to be the only port with a stack_tie that's not
> > using an UNSPEC RHS.
> In rs6000.md, it is
> 
> ; This is to explain that changes to the stack pointer should
> ; not be moved over loads from or stores to stack memory.
> (define_insn "stack_tie"
>   [(match_parallel 0 "tie_operand"
>  [(set (mem:BLK (reg 1)) (const_int 0))])]
>   ""
>   ""
>   [(set_attr "length" "0")])
> 
> This would be just an placeholder insn, and acts as the comments.
> UNSPEC_ would works like other targets.  While, I'm wondering
> the concerns on "set (mem:BLK (reg 1)) (const_int 0)".
> MODEs between SET_DEST and SET_SRC?

I don't think the issue is the mode but the issue is that
the patter as-is says some memory is zeroed while that's not
actually true (not specifying a size means we can't really do
anything with this MEM, but still).  Using an UNSPEC avoids
implying anything for the stored value.

Of course I think a MEM SET_DEST without a specified size is bougs
as well, but there's larger precedent for this...

Richard.

> Thanks for comments!
> 
> BR,
> Jeff (Jiufu Guo)
> >
> >> Thanks,
> >> Richard
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)


Re: [PATCH V6] VECT: Add SELECT_VL support

2023-06-09 Thread Richard Biener via Gcc-patches
On Fri, 9 Jun 2023, juzhe.zh...@rivai.ai wrote:

> From: Ju-Zhe Zhong 
> 
> Co-authored-by: Richard Sandiford
> Co-authored-by: Richard Biener 
> 
> This patch address comments from Richard && Richi and rebase to trunk.
> 
> This patch is adding SELECT_VL middle-end support
> allow target have target dependent optimization in case of
> length calculation.
> 
> This patch is inspired by RVV ISA and LLVM:
> https://reviews.llvm.org/D99750
> 
> The SELECT_VL is same behavior as LLVM "get_vector_length" with
> these following properties:
> 
> 1. Only apply on single-rgroup.
> 2. non SLP.
> 3. adjust loop control IV.
> 4. adjust data reference IV.
> 5. allow non-vf elements processing in non-final iteration
> 
> Code:
># void vvaddint32(size_t n, const int*x, const int*y, int*z)
> # { for (size_t i=0; i 
> Take RVV codegen for example:
> 
> Before this patch:
> vvaddint32:
> ble a0,zero,.L6
> csrra4,vlenb
> srlia6,a4,2
> .L4:
> mv  a5,a0
> bleua0,a6,.L3
> mv  a5,a6
> .L3:
> vsetvli zero,a5,e32,m1,ta,ma
> vle32.v v2,0(a1)
> vle32.v v1,0(a2)
> vsetvli a7,zero,e32,m1,ta,ma
> sub a0,a0,a5
> vadd.vv v1,v1,v2
> vsetvli zero,a5,e32,m1,ta,ma
> vse32.v v1,0(a3)
> add a2,a2,a4
> add a3,a3,a4
> add a1,a1,a4
> bne a0,zero,.L4
> .L6:
> ret
> 
> After this patch:
> 
> vvaddint32:
> vsetvli t0, a0, e32, ta, ma  # Set vector length based on 32-bit vectors
> vle32.v v0, (a1) # Get first vector
>   sub a0, a0, t0 # Decrement number done
>   slli t0, t0, 2 # Multiply number done by 4 bytes
>   add a1, a1, t0 # Bump pointer
> vle32.v v1, (a2) # Get second vector
>   add a2, a2, t0 # Bump pointer
> vadd.vv v2, v0, v1   # Sum vectors
> vse32.v v2, (a3) # Store result
>   add a3, a3, t0 # Bump pointer
>   bnez a0, vvaddint32# Loop back
>   ret# Finished

OK.

Thanks,
Richard.

> gcc/ChangeLog:
> 
> * doc/md.texi: Add SELECT_VL support.
> * internal-fn.def (SELECT_VL): Ditto.
> * optabs.def (OPTAB_D): Ditto.
> * tree-vect-loop-manip.cc (vect_set_loop_controls_directly): Ditto.
> * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Ditto.
> * tree-vect-stmts.cc (get_select_vl_data_ref_ptr): Ditto.
> (vectorizable_store): Ditto.
> (vectorizable_load): Ditto.
> * tree-vectorizer.h (LOOP_VINFO_USING_SELECT_VL_P): Ditto.
> 
> ---
>  gcc/doc/md.texi | 22 
>  gcc/internal-fn.def |  1 +
>  gcc/optabs.def  |  1 +
>  gcc/tree-vect-loop-manip.cc | 32 -
>  gcc/tree-vect-loop.cc   | 72 +
>  gcc/tree-vect-stmts.cc  | 69 +++
>  gcc/tree-vectorizer.h   |  6 
>  7 files changed, 187 insertions(+), 16 deletions(-)
> 
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 6a435eb4461..95f7fe1f802 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -4974,6 +4974,28 @@ for (i = 1; i < operand3; i++)
>operand0[i] = operand0[i - 1] && (operand1 + i < operand2);
>  @end smallexample
>  
> +@cindex @code{select_vl@var{m}} instruction pattern
> +@item @code{select_vl@var{m}}
> +Set operand 0 to the number of scalar iterations that should be handled
> +by one iteration of a vector loop.  Operand 1 is the total number of
> +scalar iterations that the loop needs to process and operand 2 is a
> +maximum bound on the result (also known as the maximum ``vectorization
> +factor'').
> +
> +The maximum value of operand 0 is given by:
> +@smallexample
> +operand0 = MIN (operand1, operand2)
> +@end smallexample
> +However, targets might choose a lower value than this, based on
> +target-specific criteria.  Each iteration of the vector loop might
> +therefore process a different number of scalar iterations, which in turn
> +means that induction variables will have a variable step.  Because of
> +this, it is generally not useful to define this instruction if it will
> +always calculate the maximum value.
> +
> +This optab is only useful on targets that implement @samp{len_load_@var{m}}
> +and/or @samp{len_store_@var{m}}.
> +
>  @cindex @code{check_raw_ptrs@var{m}} instruction pattern
>  @item @samp{check_raw_ptrs@var{m}}
>  Check whether, given two pointers @var{a} and @var{b} and a length @var{len},
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index 3ac9d82aace..5d638de6d06 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -177,6 +177,7 @@ DEF_INTERNAL_OPTAB_FN (VEC_SET, 0, vec_set, vec_set)
>  DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, len_store, len_store)
>  
>  DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while)
> +DEF_INTERNAL_OPTAB_FN (SELECT_VL, ECF_C

Re: [PATCH] Prevent TYPE_PRECISION on VECTOR_TYPEs

2023-06-09 Thread Richard Biener via Gcc-patches
On Fri, 9 Jun 2023, Richard Biener wrote:

> The following makes sure that using TYPE_PRECISION on VECTOR_TYPE
> ICEs when tree checking is enabled.  This should avoid wrong-code
> in cases like PR110182 and instead ICE.
> 
> Bootstrap and regtest pending on x86_64-unknown-linux-gnu, I guess
> there will be some fallout of such change ...

The following is what I need to get it to boostrap on 
x86_64-unknown-linux-gnu (with all languages enabled).

I think some cases warrant a TYPE_PRECISION_RAW but most
are fixing existing errors.  For some cases I didn't dig
deep enough if the code also needs to compare TYPE_VECTOR_SUBPARTS.

The testsuite is running and shows more issues ...

I put this on hold for the moment but hope to get back to it at
some point.  I'll followup with the testresults though.

Richard.


diff --git a/gcc/c-family/c-common.cc b/gcc/c-family/c-common.cc
index 9c8eed5442a..34566a342bd 100644
--- a/gcc/c-family/c-common.cc
+++ b/gcc/c-family/c-common.cc
@@ -1338,6 +1338,10 @@ shorten_binary_op (tree result_type, tree op0, tree op1, 
bool bitwise)
   int uns;
   tree type;
 
+  /* Do not shorten vector operations.  */
+  if (VECTOR_TYPE_P (result_type))
+return result_type;
+
   /* Cast OP0 and OP1 to RESULT_TYPE.  Doing so prevents
  excessive narrowing when we call get_narrower below.  For
  example, suppose that OP0 is of unsigned int extended
diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 3f3c6685bb3..a8c033ba008 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -12574,10 +12574,10 @@ fold_binary_loc (location_t loc, enum tree_code code, 
tree type,
tree targ1 = strip_float_extensions (arg1);
tree newtype = TREE_TYPE (targ0);
 
-   if (TYPE_PRECISION (TREE_TYPE (targ1)) > TYPE_PRECISION (newtype))
+   if (element_precision (TREE_TYPE (targ1)) > element_precision (newtype))
  newtype = TREE_TYPE (targ1);
 
-   if (TYPE_PRECISION (newtype) < TYPE_PRECISION (TREE_TYPE (arg0)))
+   if (element_precision (newtype) < element_precision (TREE_TYPE (arg0)))
  return fold_build2_loc (loc, code, type,
  fold_convert_loc (loc, newtype, targ0),
  fold_convert_loc (loc, newtype, targ1));
@@ -14540,7 +14540,8 @@ tree_expr_maybe_real_minus_zero_p (const_tree x)
 static bool
 tree_simple_nonnegative_warnv_p (enum tree_code code, tree type)
 {
-  if ((TYPE_PRECISION (type) != 1 || TYPE_UNSIGNED (type))
+  if (!VECTOR_TYPE_P (type)
+  && (TYPE_PRECISION (type) != 1 || TYPE_UNSIGNED (type))
   && truth_value_p (code))
 /* Truth values evaluate to 0 or 1, which is nonnegative unless we
have a signed:1 type (where the value is -1 and 0).  */
diff --git a/gcc/tree-ssa-scopedtables.cc b/gcc/tree-ssa-scopedtables.cc
index 528ddf2a2ab..e698ef97343 100644
--- a/gcc/tree-ssa-scopedtables.cc
+++ b/gcc/tree-ssa-scopedtables.cc
@@ -574,7 +574,7 @@ hashable_expr_equal_p (const struct hashable_expr *expr0,
   && (TREE_CODE (type0) == ERROR_MARK
  || TREE_CODE (type1) == ERROR_MARK
  || TYPE_UNSIGNED (type0) != TYPE_UNSIGNED (type1)
- || TYPE_PRECISION (type0) != TYPE_PRECISION (type1)
+ || element_precision (type0) != element_precision (type1)
  || TYPE_MODE (type0) != TYPE_MODE (type1)))
 return false;
 
diff --git a/gcc/tree.cc b/gcc/tree.cc
index 8e144bc090e..4b43e209c6e 100644
--- a/gcc/tree.cc
+++ b/gcc/tree.cc
@@ -13423,7 +13423,10 @@ verify_type_variant (const_tree t, tree tv)
}
   verify_variant_match (TYPE_NEEDS_CONSTRUCTING);
 }
-  verify_variant_match (TYPE_PRECISION);
+  /* ???  Need a TYPE_PRECISION_RAW here?  TYPE_VECTOR_SUBPARTS
+ is a poly-int.  */
+  if (!VECTOR_TYPE_P (t))
+verify_variant_match (TYPE_PRECISION);
   if (RECORD_OR_UNION_TYPE_P (t))
 verify_variant_match (TYPE_TRANSPARENT_AGGR);
   else if (TREE_CODE (t) == ARRAY_TYPE)
@@ -13701,8 +13704,12 @@ gimple_canonical_types_compatible_p (const_tree t1, 
const_tree t2,
   || TREE_CODE (t1) == OFFSET_TYPE
   || POINTER_TYPE_P (t1))
 {
-  /* Can't be the same type if they have different recision.  */
-  if (TYPE_PRECISION (t1) != TYPE_PRECISION (t2))
+  /* Can't be the same type if they have different precision.  */
+  /* ??? TYPE_PRECISION_RAW for speed.  */
+  if ((VECTOR_TYPE_P (t1)
+  && maybe_ne (TYPE_VECTOR_SUBPARTS (t1), TYPE_VECTOR_SUBPARTS (t2)))
+ || (!VECTOR_TYPE_P (t1)
+ && TYPE_PRECISION (t1) != TYPE_PRECISION (t2)))
return false;
 
   /* In some cases the signed and unsigned types are required to be


Re: [PATCH] Add COMPLEX_VECTOR_INT modes

2023-06-09 Thread Richard Biener via Gcc-patches
On Fri, Jun 9, 2023 at 11:45 AM Andrew Stubbs  wrote:
>
> On 09/06/2023 10:02, Richard Sandiford wrote:
> > Andrew Stubbs  writes:
> >> On 07/06/2023 20:42, Richard Sandiford wrote:
> >>> I don't know if this helps (probably not), but we have a similar
> >>> situation on AArch64: a 64-bit mode like V8QI can be doubled to a
> >>> 128-bit vector or to a pair of 64-bit vectors.  We used V16QI for
> >>> the former and "V2x8QI" for the latter.  V2x8QI is forced to come
> >>> after V16QI in the mode list, and so it is only ever used through
> >>> explicit choice.  But both modes are functionally vectors of 16 QIs.
> >>
> >> OK, that's interesting, but how do you map "complex int" vectors to that
> >> mode? I tried to figure it out, but there's no DIVMOD support so I
> >> couldn't just do a straight comparison.
> >
> > Yeah, we don't do that currently.  Instead we make TARGET_ARRAY_MODE
> > return V2x8QI for an array of 2 V8QIs (which is OK, since V2x8QI has
> > 64-bit rather than 128-bit alignment).  So we should use it for a
> > complex-y type like:
> >
> >struct { res_type res[2]; };
> >
> > In principle we should be able to do the same for:
> >
> >struct { res_type a, b; };
> >
> > but that isn't supported yet.  I think it would need a new target hook
> > along the lines of TARGET_ARRAY_MODE, but for structs rather than arrays.

And the same should work for complex types, no?  In fact we could document
that TARGET_ARRAY_MODE also is used for _Complex?  Note the hook
is used for type layout and thus innocent array types (in aggregates) can end up
with a vector mode now.  Hopefully that's without bad effects (on the ABI).

That said, the hook _could_ be used just for divmod expansion without
actually creating a complex (or array) type of vectors.

> > The advantage of this from AArch64's PoV is that it extends to 3x and 4x
> > tuples as well, whereas complex is obviously for pairs only.
> >
> > I don't know if it would be acceptable to use that kind of struct wrapper
> > for the divmod code though (for the vector case only).
>
> Looking again, I don't think this will help because GCN does not have an
> instruction that loads vectors that are back-to-back, hence there's
> little benefit in adding the tuple mode.
>
> However, GCN does have instructions that effectively load 2, 3, or 4
> vectors that are *interleaved*, which would be the likely case for
> complex numbers (or pixel colour data!)

that's load_lanes and I think not related here but it probably also
needs the xN modes.

> I need to figure out how to move forward with this patch, please; if the
> new complex modes are not acceptable then I think I need to reimplement
> DIVMOD (maybe the scalars can remain as-is), but it's not clear to me
> what that would look like.
>
> Andrew


Re: [PATCH] Prevent TYPE_PRECISION on VECTOR_TYPEs

2023-06-09 Thread Richard Biener via Gcc-patches
On Fri, 9 Jun 2023, Richard Biener wrote:

> On Fri, 9 Jun 2023, Richard Biener wrote:
> 
> > The following makes sure that using TYPE_PRECISION on VECTOR_TYPE
> > ICEs when tree checking is enabled.  This should avoid wrong-code
> > in cases like PR110182 and instead ICE.
> > 
> > Bootstrap and regtest pending on x86_64-unknown-linux-gnu, I guess
> > there will be some fallout of such change ...
> 
> The following is what I need to get it to boostrap on 
> x86_64-unknown-linux-gnu (with all languages enabled).
> 
> I think some cases warrant a TYPE_PRECISION_RAW but most
> are fixing existing errors.  For some cases I didn't dig
> deep enough if the code also needs to compare TYPE_VECTOR_SUBPARTS.
> 
> The testsuite is running and shows more issues ...
> 
> I put this on hold for the moment but hope to get back to it at
> some point.  I'll followup with the testresults though.

Attached - it's not too much it seems, but things repeat of course.

Richard.

testresults.xz
Description: application/xz


Re: [PATCH] testsuite: fix the condition bug in tsvc s176

2023-06-09 Thread Richard Biener via Gcc-patches
On Fri, Jun 9, 2023 at 11:58 AM Lehua Ding  wrote:
>
> > It's odd that the checksum doesn't depend on the number of iterations done 
> > ...
>
> This is because the difference between the calculated result (32063.902344) 
> and
> the expected result (32000.00) is small. The current check is that the 
> result
> is considered correct as long as the `value/expected` ratio is between 0.99f 
> and
> 1.01f.

Oh, I see ...

> I'm not sure if this check is enough, but I should also update the expected
> result to 32063.902344 (the same without vectorized).

OK.

> Best,
> Lehua
>
> gcc/testsuite/ChangeLog:
>
> * gcc.dg/vect/tsvc/tsvc.h:
> * gcc.dg/vect/tsvc/vect-tsvc-s176.c:
>
> ---
>  gcc/testsuite/gcc.dg/vect/tsvc/tsvc.h   | 2 +-
>  gcc/testsuite/gcc.dg/vect/tsvc/vect-tsvc-s176.c | 4 ++--
>  2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.dg/vect/tsvc/tsvc.h 
> b/gcc/testsuite/gcc.dg/vect/tsvc/tsvc.h
> index cd39c041903d..d910c384fc83 100644
> --- a/gcc/testsuite/gcc.dg/vect/tsvc/tsvc.h
> +++ b/gcc/testsuite/gcc.dg/vect/tsvc/tsvc.h
> @@ -1164,7 +1164,7 @@ real_t get_expected_result(const char * name)
>  } else if (!strcmp(name, "s175")) {
> return 32009.023438f;
>  } else if (!strcmp(name, "s176")) {
> -   return 32000.f;
> +   return 32063.902344f;
>  } else if (!strcmp(name, "s211")) {
> return 63983.308594f;
>  } else if (!strcmp(name, "s212")) {
> diff --git a/gcc/testsuite/gcc.dg/vect/tsvc/vect-tsvc-s176.c 
> b/gcc/testsuite/gcc.dg/vect/tsvc/vect-tsvc-s176.c
> index 79faf7fdb9e4..365e5205982b 100644
> --- a/gcc/testsuite/gcc.dg/vect/tsvc/vect-tsvc-s176.c
> +++ b/gcc/testsuite/gcc.dg/vect/tsvc/vect-tsvc-s176.c
> @@ -14,7 +14,7 @@ real_t s176(struct args_t * func_args)
>  initialise_arrays(__func__);
>
>  int m = LEN_1D/2;
> -for (int nl = 0; nl < 4*(iterations/LEN_1D); nl++) {
> +for (int nl = 0; nl < 4*(10*iterations/LEN_1D); nl++) {
>  for (int j = 0; j < (LEN_1D/2); j++) {
>  for (int i = 0; i < m; i++) {
>  a[i] += b[i+m-j-1] * c[j];
> @@ -39,4 +39,4 @@ int main (int argc, char **argv)
>return 0;
>  }
>
> -/* { dg-final { scan-tree-dump "vectorized 1 loops" "vect" { xfail *-*-* } } 
> } */
> +/* { dg-final { scan-tree-dump "vectorized 1 loops" "vect" } } */
> --
> 2.36.1
>


Re: [PATCH] fix frange_nextafter odr violation

2023-06-09 Thread Richard Biener via Gcc-patches
On Fri, Jun 9, 2023 at 2:26 PM Alexandre Oliva  wrote:
>
> On Jun  9, 2023, Richard Biener  wrote:
>
> > On Thu, Jun 8, 2023 at 4:38 PM Alexandre Oliva via Gcc-patches
> >  wrote:
>
> >> C++ requires inline functions to be declared inline and defined in
> >> every translation unit that uses them.  frange_nextafter is used in
> >> gimple-range-op.cc but it's only defined as inline in
> >> range-op-float.cc.  Drop the extraneous inline specifier.
>
> > OK
>
> >> for  gcc/ChangeLog
> >>
> >> * range-op-float.cc (frange_nextafter): Drop inline.
> >> (frelop_early_resolve): Add static.
> >> (frange_float): Likewise.
>
> The problem is also present in gcc-13.  Ok there as well?  Regstrapped
> on x86_64-linux-gnu.

Yes.

Richard.

> --
> Alexandre Oliva, happy hackerhttps://FSFLA.org/blogs/lxo/
>Free Software Activist   GNU Toolchain Engineer
> Disinformation flourishes because many people care deeply about injustice
> but very few check the facts.  Ask me about 


Re: [PATCH] Make sure SCALAR_INT_MODE_P before invoke try_const_anchors

2023-06-09 Thread Richard Biener via Gcc-patches
On Fri, 9 Jun 2023, Jiufu Guo wrote:

> 
> Hi,
> 
> Richard Biener  writes:
> 
> > On Fri, 9 Jun 2023, Jiufu Guo wrote:
> >
> >> 
> >> Hi,
> >> 
> >> Richard Biener  writes:
> >> 
> >> > On Fri, 9 Jun 2023, Richard Sandiford wrote:
> >> >
> >> >> guojiufu  writes:
> >> >> > Hi,
> >> >> >
> >> >> > On 2023-06-09 16:00, Richard Biener wrote:
> >> >> >> On Fri, 9 Jun 2023, Jiufu Guo wrote:
> >> >> >> 
> >> >> >>> Hi,
> >> >> >>> 
> ...
> >> >> >>> 
> >> >> >>> This patch is raised when drafting below one.
> >> >> >>> https://gcc.gnu.org/pipermail/gcc-patches/2022-October/603530.html.
> >> >> >>> With that patch, "{[%1:DI]=0;} stack_tie" with BLKmode runs into
> >> >> >>> try_const_anchors, and hits the assert/ice.
> >> >> >>> 
> >> >> >>> Boostrap and regtest pass on ppc64{,le} and x86_64.
> >> >> >>> Is this ok for trunk?
> >> >> >> 
> >> >> >> Iff the correct fix at all (how can a CONST_INT have BLKmode?) then
> >> >> >> I suggest to instead fix try_const_anchors to change
> >> >> >> 
> >> >> >>   /* CONST_INT is used for CC modes, but we should leave those 
> >> >> >> alone.  
> >> >> >> */
> >> >> >>   if (GET_MODE_CLASS (mode) == MODE_CC)
> >> >> >> return NULL_RTX;
> >> >> >> 
> >> >> >>   gcc_assert (SCALAR_INT_MODE_P (mode));
> >> >> >> 
> >> >> >> to
> >> >> >> 
> >> >> >>   /* CONST_INT is used for CC modes, leave any non-scalar-int mode 
> >> >> >> alone.  */
> >> >> >>   if (!SCALAR_INT_MODE_P (mode))
> >> >> >> return NULL_RTX;
> >> >> >> 
> >> >> >
> >> >> > This is also able to fix this issue.  there is a "Punt on CC modes" 
> >> >> > patch
> >> >> > to return NULL_RTX in try_const_anchors.
> >> >> >
> >> >> >> but as said I wonder how we arrive at a BLKmode CONST_INT and whether
> >> >> >> we should have fended this off earlier.  Can you share more complete
> >> >> >> RTL of that stack_tie?
> >> >> >
> >> >> >
> >> >> > (insn 15 14 16 3 (parallel [
> >> >> >  (set (mem/c:BLK (reg/f:DI 1 1) [1  A8])
> >> >> >  (const_int 0 [0]))
> >> >> >  ]) "/home/guojiufu/temp/gdb.c":13:3 922 {stack_tie}
> >> >> >   (nil))
> >> >> >
> >> >> > It is "set (mem/c:BLK (reg/f:DI 1 1) (const_int 0 [0])".
> >> >> 
> >> >> I'm not convinced this is correct RTL.  (unspec:BLK [(const_int 0)] ...)
> >> >> would be though.  It's arguably more accurate too, since the effect
> >> >> on the stack locations is unspecified rather than predictable.
> >> >
> >> > powerpc seems to be the only port with a stack_tie that's not
> >> > using an UNSPEC RHS.
> >> In rs6000.md, it is
> >> 
> >> ; This is to explain that changes to the stack pointer should
> >> ; not be moved over loads from or stores to stack memory.
> >> (define_insn "stack_tie"
> >>   [(match_parallel 0 "tie_operand"
> >>   [(set (mem:BLK (reg 1)) (const_int 0))])]
> >>   ""
> >>   ""
> >>   [(set_attr "length" "0")])
> >> 
> >> This would be just an placeholder insn, and acts as the comments.
> >> UNSPEC_ would works like other targets.  While, I'm wondering
> >> the concerns on "set (mem:BLK (reg 1)) (const_int 0)".
> >> MODEs between SET_DEST and SET_SRC?
> >
> > I don't think the issue is the mode but the issue is that
> > the patter as-is says some memory is zeroed while that's not
> > actually true (not specifying a size means we can't really do
> > anything with this MEM, but still).  Using an UNSPEC avoids
> > implying anything for the stored value.
> >
> > Of course I think a MEM SET_DEST without a specified size is bougs
> > as well, but there's larger precedent for this...
> 
> Thanks for your kindly comments!
> Using "(set (mem:BLK (reg 1)) (const_int 0))" here, may because this
> insn does not generate real thing (not a real store and no asm code),
> may like barrier.
> 
> While I agree that, using UNSPEC may be more clear to avoid mis-reading.

Btw, another way to avoid the issue in CSE is to make it not process
(aka record anything for optimization) for SET from MEMs with
!MEM_SIZE_KNOWN_P

Richard.


Re: [PATCH 2/2] ipa-cp: Feed results of IPA-CP into value numbering

2023-06-12 Thread Richard Biener via Gcc-patches
On Fri, 9 Jun 2023, Martin Jambor wrote:

> Hi,
> 
> thanks for looking at this.
> 
> On Fri, Jun 02 2023, Richard Biener wrote:
> > On Mon, 29 May 2023, Martin Jambor wrote:
> >
> 
> [...]
> 
> >> diff --git a/gcc/tree-ssa-sccvn.cc b/gcc/tree-ssa-sccvn.cc
> >> index 27c84e78fcf..33215b5fc82 100644
> >> --- a/gcc/tree-ssa-sccvn.cc
> >> +++ b/gcc/tree-ssa-sccvn.cc
> >> @@ -74,6 +74,9 @@ along with GCC; see the file COPYING3.  If not see
> >>  #include "ipa-modref-tree.h"
> >>  #include "ipa-modref.h"
> >>  #include "tree-ssa-sccvn.h"
> >> +#include "alloc-pool.h"
> >> +#include "symbol-summary.h"
> >> +#include "ipa-prop.h"
> >>  
> >>  /* This algorithm is based on the SCC algorithm presented by Keith
> >> Cooper and L. Taylor Simpson in "SCC-Based Value numbering"
> >> @@ -2327,7 +2330,7 @@ vn_walk_cb_data::push_partial_def (pd_data pd,
> >> with the current VUSE and performs the expression lookup.  */
> >>  
> >>  static void *
> >> -vn_reference_lookup_2 (ao_ref *op ATTRIBUTE_UNUSED, tree vuse, void 
> >> *data_)
> >> +vn_reference_lookup_2 (ao_ref *op, tree vuse, void *data_)
> >>  {
> >>vn_walk_cb_data *data = (vn_walk_cb_data *)data_;
> >>vn_reference_t vr = data->vr;
> >> @@ -2361,6 +2364,37 @@ vn_reference_lookup_2 (ao_ref *op ATTRIBUTE_UNUSED, 
> >> tree vuse, void *data_)
> >>return *slot;
> >>  }
> >>  
> >> +  if (SSA_NAME_IS_DEFAULT_DEF (vuse))
> >> +{
> >> +  HOST_WIDE_INT offset, size;
> >> +  tree v = NULL_TREE;
> >> +  if (op->base && TREE_CODE (op->base) == PARM_DECL
> >> +&& op->offset.is_constant (&offset)
> >> +&& op->size.is_constant (&size)
> >> +&& op->max_size_known_p ()
> >> +&& known_eq (op->size, op->max_size))
> >> +  v = ipcp_get_aggregate_const (cfun, op->base, false, offset, size);
> >
> > We've talked about partial definition support, this does not
> > have this implemented AFAICS.  But that means you cannot simply
> > do ->finish () without verifying data->partial_defs.is_empty ().
> >
> 
> You are right, partial definitions are not implemented.  I have added
> the is_empty check to the patch.  I'll continue looking into adding the
> support as a follow-up.
> 
> >> +  else if (op->ref)
> >> +  {
> >
> > does this ever happen to imrpove things?
> 
> Yes, this branch is necessary for propagation of all known constants
> passed in memory pointed to by a POINTER_TYPE_P parameter.  It handles
> the second testcase added by the patch.
> 
> > There's the remote
> > possibility op->base isn't initialized yet, for this reason
> > above you should use ao_ref_base (op) instead of accessing
> > op->base directly.
> 
> OK
> 
> >
> >> +HOST_WIDE_INT offset, size;
> >> +bool reverse;
> >> +tree base = get_ref_base_and_extent_hwi (op->ref, &offset,
> >> + &size, &reverse);
> >> +if (base
> >> +&& TREE_CODE (base) == MEM_REF
> >> +&& integer_zerop (TREE_OPERAND (base, 1))
> >> +&& TREE_CODE (TREE_OPERAND (base, 0)) == SSA_NAME
> >
> > And this then should be done within the above branch as well,
> > just keyed off base == MEM_REF.
> 
> I am sorry but I don't understand this comment, can you please try to
> re-phrase it?  The previous branch handles direct accesses to
> PARM_DECLs, MEM_REFs don't need to be there at all.

See below

> Updated (bootstrap and testing passing) patch is below for reference,
> but I obviously expect to incorporate the above comment as well before
> proposing to push it.
> 
> Thanks,
> 
> Martin
> 
> 
> Subject: [PATCH 2/2] ipa-cp: Feed results of IPA-CP into value numbering
> 
> PRs 68930 and 92497 show that when IPA-CP figures out constants in
> aggregate parameters or when passed by reference but the loads happen
> in an inlined function the information is lost.  This happens even
> when the inlined function itself was known to have - or even cloned to
> have - such constants in incoming parameters because the transform
> phase of IPA passes is not run on them.  See discussion in the bugs
> for reasons why.
> 
> Honza suggested that we can plug the results of IPA-CP analysis into
> value numbering, so that FRE can figure out that some loads fetch
> known constants.  This is what this patch attempts to do.
> 
> This version of the patch uses the new way we represent aggregate
> constants discovered IPA-CP and so avoids linear scan to find them.
> Similarly, it depends on the previous patch which avoids potentially
> slow linear look ups of indices of PARM_DECLs when there are many of
> them.
> 
> gcc/ChangeLog:
> 
> 2023-06-07  Martin Jambor  
> 
>   PR ipa/68930
>   PR ipa/92497
>   * ipa-prop.h (ipcp_get_aggregate_const): Declare.
>   * ipa-prop.cc (ipcp_get_aggregate_const): New function.
>   (ipcp_transform_function): Do not deallocate transformation info.
>   * tree-ssa-sccvn.cc: Include alloc-pool.h, symbol-summary.h and
>   ipa-prop.h.
>   (vn_reference_lookup_2): When hitting 

Re: [PATCH] Make sure SCALAR_INT_MODE_P before invoke try_const_anchors

2023-06-12 Thread Richard Biener via Gcc-patches
On Mon, 12 Jun 2023, Jiufu Guo wrote:

> Richard Biener  writes:
> 
> > On Fri, 9 Jun 2023, Jiufu Guo wrote:
> >
> >> 
> >> Hi,
> >> 
> >> Richard Biener  writes:
> >> 
> >> > On Fri, 9 Jun 2023, Jiufu Guo wrote:
> >> >
> >> >> 
> >> >> Hi,
> >> >> 
> >> >> Richard Biener  writes:
> >> >> 
> >> >> > On Fri, 9 Jun 2023, Richard Sandiford wrote:
> >> >> >
> >> >> >> guojiufu  writes:
> >> >> >> > Hi,
> >> >> >> >
> >> >> >> > On 2023-06-09 16:00, Richard Biener wrote:
> >> >> >> >> On Fri, 9 Jun 2023, Jiufu Guo wrote:
> >> >> >> >> 
> >> >> >> >>> Hi,
> >> >> >> >>> 
> >> ...
> >> >> >> >>> 
> >> >> >> >>> This patch is raised when drafting below one.
> >> >> >> >>> https://gcc.gnu.org/pipermail/gcc-patches/2022-October/603530.html.
> >> >> >> >>> With that patch, "{[%1:DI]=0;} stack_tie" with BLKmode runs into
> >> >> >> >>> try_const_anchors, and hits the assert/ice.
> >> >> >> >>> 
> >> >> >> >>> Boostrap and regtest pass on ppc64{,le} and x86_64.
> >> >> >> >>> Is this ok for trunk?
> >> >> >> >> 
> >> >> >> >> Iff the correct fix at all (how can a CONST_INT have BLKmode?) 
> >> >> >> >> then
> >> >> >> >> I suggest to instead fix try_const_anchors to change
> >> >> >> >> 
> >> >> >> >>   /* CONST_INT is used for CC modes, but we should leave those 
> >> >> >> >> alone.  
> >> >> >> >> */
> >> >> >> >>   if (GET_MODE_CLASS (mode) == MODE_CC)
> >> >> >> >> return NULL_RTX;
> >> >> >> >> 
> >> >> >> >>   gcc_assert (SCALAR_INT_MODE_P (mode));
> >> >> >> >> 
> >> >> >> >> to
> >> >> >> >> 
> >> >> >> >>   /* CONST_INT is used for CC modes, leave any non-scalar-int 
> >> >> >> >> mode 
> >> >> >> >> alone.  */
> >> >> >> >>   if (!SCALAR_INT_MODE_P (mode))
> >> >> >> >> return NULL_RTX;
> >> >> >> >> 
> >> >> >> >
> >> >> >> > This is also able to fix this issue.  there is a "Punt on CC 
> >> >> >> > modes" 
> >> >> >> > patch
> >> >> >> > to return NULL_RTX in try_const_anchors.
> >> >> >> >
> >> >> >> >> but as said I wonder how we arrive at a BLKmode CONST_INT and 
> >> >> >> >> whether
> >> >> >> >> we should have fended this off earlier.  Can you share more 
> >> >> >> >> complete
> >> >> >> >> RTL of that stack_tie?
> >> >> >> >
> >> >> >> >
> >> >> >> > (insn 15 14 16 3 (parallel [
> >> >> >> >  (set (mem/c:BLK (reg/f:DI 1 1) [1  A8])
> >> >> >> >  (const_int 0 [0]))
> >> >> >> >  ]) "/home/guojiufu/temp/gdb.c":13:3 922 {stack_tie}
> >> >> >> >   (nil))
> >> >> >> >
> >> >> >> > It is "set (mem/c:BLK (reg/f:DI 1 1) (const_int 0 [0])".
> >> >> >> 
> >> >> >> I'm not convinced this is correct RTL.  (unspec:BLK [(const_int 0)] 
> >> >> >> ...)
> >> >> >> would be though.  It's arguably more accurate too, since the effect
> >> >> >> on the stack locations is unspecified rather than predictable.
> >> >> >
> >> >> > powerpc seems to be the only port with a stack_tie that's not
> >> >> > using an UNSPEC RHS.
> >> >> In rs6000.md, it is
> >> >> 
> >> >> ; This is to explain that changes to the stack pointer should
> >> >> ; not be moved over loads from or stores to stack memory.
> >> >> (define_insn "stack_tie"
> >> >>   [(match_parallel 0 "tie_operand"
> >> >>[(set (mem:BLK (reg 1)) (const_int 0))])]
> >> >>   ""
> >> >>   ""
> >> >>   [(set_attr "length" "0")])
> >> >> 
> >> >> This would be just an placeholder insn, and acts as the comments.
> >> >> UNSPEC_ would works like other targets.  While, I'm wondering
> >> >> the concerns on "set (mem:BLK (reg 1)) (const_int 0)".
> >> >> MODEs between SET_DEST and SET_SRC?
> >> >
> >> > I don't think the issue is the mode but the issue is that
> >> > the patter as-is says some memory is zeroed while that's not
> >> > actually true (not specifying a size means we can't really do
> >> > anything with this MEM, but still).  Using an UNSPEC avoids
> >> > implying anything for the stored value.
> >> >
> >> > Of course I think a MEM SET_DEST without a specified size is bougs
> >> > as well, but there's larger precedent for this...
> >> 
> >> Thanks for your kindly comments!
> >> Using "(set (mem:BLK (reg 1)) (const_int 0))" here, may because this
> >> insn does not generate real thing (not a real store and no asm code),
> >> may like barrier.
> >> 
> >> While I agree that, using UNSPEC may be more clear to avoid mis-reading.
> >
> > Btw, another way to avoid the issue in CSE is to make it not process
> > (aka record anything for optimization) for SET from MEMs with
> > !MEM_SIZE_KNOWN_P
> 
> Thanks! Yes, this would make sense.
> Then, there are two ideas(patches) to handle this issue:
> Which one would be preferable?  This one (from compiling time aspect)?
> 
> And maybe, the changes in rs6000 stack_tie through using unspec
> can be a standalone enhancement besides cse patch.
> 
> Thanks for comments!
> 
> BR,
> Jeff (Jiufu Guo)
> 
>  patch 1
> diff --git a/gcc/cse.cc b/gcc/cse.cc
> index 2bb63ac4105..06ecdadecbc 100644
> --- a/gcc/cse.cc
> +++ b/gcc/cse.cc
> @@ -4271,6 +4271,8 @@ fi

[PATCH] middle-end/110200 - genmatch force-leaf and convert interaction

2023-06-12 Thread Richard Biener via Gcc-patches
The following fixes code GENERIC generation for (convert! ...)
which currently generates

  if (TREE_TYPE (_o1[0]) != type)
_r1 = fold_build1_loc (loc, NOP_EXPR, type, _o1[0]);
if (EXPR_P (_r1))
  goto next_after_fail867;
  else
_r1 = _o1[0];

where obviously braces are missing.

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed to trunk,
will push down to branches as well.

PR middle-end/110200
* genmatch.cc (expr::gen_transform): Put braces around
the if arm for the (convert ...) short-cut.
---
 gcc/genmatch.cc | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/gcc/genmatch.cc b/gcc/genmatch.cc
index bd6ce3a28f8..5fceeec9780 100644
--- a/gcc/genmatch.cc
+++ b/gcc/genmatch.cc
@@ -2625,7 +2625,8 @@ expr::gen_transform (FILE *f, int indent, const char 
*dest, bool gimple,
{
  fprintf_indent (f, indent, "if (TREE_TYPE (_o%d[0]) != %s)\n",
  depth, type);
- indent += 2;
+ fprintf_indent (f, indent + 2, "{\n");
+ indent += 4;
}
   if (opr->kind == id_base::CODE)
fprintf_indent (f, indent, "_r%d = fold_build%d_loc (loc, %s, %s",
@@ -2648,7 +2649,8 @@ expr::gen_transform (FILE *f, int indent, const char 
*dest, bool gimple,
}
   if (*opr == CONVERT_EXPR)
{
- indent -= 2;
+ fprintf_indent (f, indent - 2, "}\n");
+ indent -= 4;
  fprintf_indent (f, indent, "else\n");
  fprintf_indent (f, indent, "  _r%d = _o%d[0];\n", depth, depth);
}
-- 
2.35.3


Re: [PATCH] inline: improve internal function costs

2023-06-12 Thread Richard Biener via Gcc-patches
On Mon, 12 Jun 2023, Andre Vieira (lists) wrote:

> 
> 
> On 05/06/2023 04:04, Jan Hubicka wrote:
> >> On Thu, 1 Jun 2023, Andre Vieira (lists) wrote:
> >>
> >>> Hi,
> >>>
> >>> This is a follow-up of the internal function patch to add widening and
> >>> narrowing patterns.  This patch improves the inliner cost estimation for
> >>> internal functions.
> >>
> >> I have no idea why calls are special in IPA analyze_function_body
> >> and so I cannot say whether treating all internal fn calls as
> >> non-calls is correct there.  Honza?
> > 
> > The reason is that normal statements are acconted as part of the
> > function body, while calls have their costs attached to call edges
> > (so it can be adjusted when call is inlined to otherwise optimized).
> > 
> > However since internal functions have no cgraph edges, this looks like
> > a bug that we do not test it.  (the code was written before internal
> > calls was introduced).
> >
> 
> This sounds to me like you agree with my approach to treat internal calls
> different to regular calls.
> 
> > I wonder if we don't want to have is_noninternal_gimple_call that could
> > be used by IPA code to test whether cgraph edge should exist for
> > the statement.
> 
> I'm happy to add such a helper function @richi,rsandifo: you ok with that?

It's a bit of an ugly name, if we want something that keys on calls
that have an edge it should be obvious it does this.  I wouldn't
add is_noninternal_gimple_call.  With LTO and libgcc and internal
optab fns it's also less obvious in cases we want to have say
.DIVMODDI3 (...) which in the end maps to a LTOed libcall from libgcc.a 
...

Richard.


Re: [PATCH] Remove DEFAULT_MATCHPD_PARTITIONS macro

2023-06-12 Thread Richard Biener via Gcc-patches
On Mon, 12 Jun 2023, Tamar Christina wrote:

> Hi All,
> 
> As Jakub pointed out, DEFAULT_MATCHPD_PARTITIONS
> is now unused and can be removed.
> 
> Bootstrapped aarch64-none-linux-gnu and no issues.
> 
> Ok for master?

OK.

> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>   * config.in: Regenerate.
>   * configure: Regenerate.
>   * configure.ac: Remove DEFAULT_MATCHPD_PARTITIONS.
> 
> --- inline copy of patch -- 
> diff --git a/gcc/config.in b/gcc/config.in
> index 
> cf2f284378447c8f8e2f838a786dba23d6086fe3..0e62b9fbfc93da8fb511bf581ef9457e55c8bc6c
>  100644
> --- a/gcc/config.in
> +++ b/gcc/config.in
> @@ -67,12 +67,6 @@
>  #endif
>  
>  
> -/* Define to larger than one set the number of match.pd partitions to make. 
> */
> -#ifndef USED_FOR_TARGET
> -#undef DEFAULT_MATCHPD_PARTITIONS
> -#endif
> -
> -
>  /* Define to larger than zero set the default stack clash protector size. */
>  #ifndef USED_FOR_TARGET
>  #undef DEFAULT_STK_CLASH_GUARD_SIZE
> diff --git a/gcc/configure b/gcc/configure
> index 
> 5f67808b77441ba730183eef90367b70a51b08a0..3aa2534f4d4aa4136e9aaf5de51b8e6b67c48d5a
>  100755
> --- a/gcc/configure
> +++ b/gcc/configure
> @@ -7908,11 +7908,6 @@ if (test $DEFAULT_MATCHPD_PARTITIONS -lt 1); then
>  fi
>  
>  
> -cat >>confdefs.h <<_ACEOF
> -#define DEFAULT_MATCHPD_PARTITIONS $DEFAULT_MATCHPD_PARTITIONS
> -_ACEOF
> -
> -
>  
>  # Enable __cxa_atexit for C++.
>  # Check whether --enable-__cxa_atexit was given.
> @@ -19850,7 +19845,7 @@ else
>lt_dlunknown=0; lt_dlno_uscore=1; lt_dlneed_uscore=2
>lt_status=$lt_dlunknown
>cat > conftest.$ac_ext <<_LT_EOF
> -#line 19853 "configure"
> +#line 19848 "configure"
>  #include "confdefs.h"
>  
>  #if HAVE_DLFCN_H
> @@ -19956,7 +19951,7 @@ else
>lt_dlunknown=0; lt_dlno_uscore=1; lt_dlneed_uscore=2
>lt_status=$lt_dlunknown
>cat > conftest.$ac_ext <<_LT_EOF
> -#line 19959 "configure"
> +#line 19954 "configure"
>  #include "confdefs.h"
>  
>  #if HAVE_DLFCN_H
> diff --git a/gcc/configure.ac b/gcc/configure.ac
> index 
> cc8dd9e20bf4e3994af99a74ec2a0fe61b0fb1ae..524ef76ec7deb6357d616b6dc6e016d2a9804816
>  100644
> --- a/gcc/configure.ac
> +++ b/gcc/configure.ac
> @@ -932,8 +932,6 @@ if (test $DEFAULT_MATCHPD_PARTITIONS -lt 1); then
>   Cannot be negative.]))
>  fi
>  
> -AC_DEFINE_UNQUOTED(DEFAULT_MATCHPD_PARTITIONS, $DEFAULT_MATCHPD_PARTITIONS,
> - [Define to larger than one set the number of match.pd partitions to 
> make.])
>  AC_SUBST(DEFAULT_MATCHPD_PARTITIONS)
>  
>  # Enable __cxa_atexit for C++.
> 
> 
> 
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)


[PATCH] Fix disambiguation against .MASK_STORE

2023-06-12 Thread Richard Biener via Gcc-patches
Alias analysis was treating .MASK_STORE as storing a full vector
which means we disambiguate against decls of smaller than vector size.
That's of course wrong and a similar issue was fixed for DSE already.
The following makes sure we set the size of the access to unknown
and only constrain max_size.

This fixes runtime execution FAILs of gfortran.dg/matmul_2.f90,
gfortran.dg/matmul_6.f90 and gfortran.dg/pr91577.f90 when using
AVX512 with full masked loop vectorization on Zen4.

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed to
trunk sofar.

* tree-ssa-alias.cc (call_may_clobber_ref_p_1): For
.MASK_STORE and friend set the size of the access to
unknown.
---
 gcc/tree-ssa-alias.cc | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/gcc/tree-ssa-alias.cc b/gcc/tree-ssa-alias.cc
index 79ed956e300..b5476e8b41e 100644
--- a/gcc/tree-ssa-alias.cc
+++ b/gcc/tree-ssa-alias.cc
@@ -3072,6 +3072,9 @@ call_may_clobber_ref_p_1 (gcall *call, ao_ref *ref, bool 
tbaa_p)
  ao_ref lhs_ref;
  ao_ref_init_from_ptr_and_size (&lhs_ref, gimple_call_arg (call, 0),
 TYPE_SIZE_UNIT (TREE_TYPE (rhs)));
+ /* We cannot make this a known-size access since otherwise
+we disambiguate against stores to decls that are smaller.  */
+ lhs_ref.size = -1;
  lhs_ref.ref_alias_set = lhs_ref.base_alias_set
= tbaa_p ? get_deref_alias_set
   (TREE_TYPE (gimple_call_arg (call, 1))) : 0;
-- 
2.35.3


Re: [PATCH] middle-end, i386: Pattern recognize add/subtract with carry [PR79173]

2023-06-13 Thread Richard Biener via Gcc-patches
On Tue, 6 Jun 2023, Jakub Jelinek wrote:

> Hi!
> 
> The following patch introduces {add,sub}c5_optab and pattern recognizes
> various forms of add with carry and subtract with carry/borrow, see
> pr79173-{1,2,3,4,5,6}.c tests on what is matched.
> Primarily forms with 2 __builtin_add_overflow or __builtin_sub_overflow
> calls per limb (with just one for the least significant one), for
> add with carry even when it is hand written in C (for subtraction
> reassoc seems to change it too much so that the pattern recognition
> doesn't work).  __builtin_{add,sub}_overflow are standardized in C23
> under ckd_{add,sub} names, so it isn't any longer a GNU only extension.
> 
> Note, clang has for these has (IMHO badly designed)
> __builtin_{add,sub}c{b,s,,l,ll} builtins which don't add/subtract just
> a single bit of carry, but basically add 3 unsigned values or
> subtract 2 unsigned values from one, and result in carry out of 0, 1, or 2
> because of that.  If we wanted to introduce those for clang compatibility,
> we could and lower them early to just two __builtin_{add,sub}_overflow
> calls and let the pattern matching in this patch recognize it later.
> 
> I've added expanders for this on ix86 and in addition to that
> added various peephole2s to make sure we get nice (and small) code
> for the common cases.  I think there are other PRs which request that
> e.g. for the _{addcarry,subborrow}_u{32,64} intrinsics, which the patch
> also improves.
> 
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
> 
> Would be nice if support for these optabs was added to many other targets,
> arm/aarch64 and powerpc* certainly have such instructions, I'd expect
> in fact that most targets do.
> 
> The _BitInt support I'm working on will also need this to emit reasonable
> code.
> 
> 2023-06-06  Jakub Jelinek  
> 
>   PR middle-end/79173
>   * internal-fn.def (ADDC, SUBC): New internal functions.
>   * internal-fn.cc (expand_ADDC, expand_SUBC): New functions.
>   (commutative_ternary_fn_p): Return true also for IFN_ADDC.
>   * optabs.def (addc5_optab, subc5_optab): New optabs.
>   * tree-ssa-math-opts.cc (match_addc_subc): New function.
>   (math_opts_dom_walker::after_dom_children): Call match_addc_subc
>   for PLUS_EXPR, MINUS_EXPR, BIT_IOR_EXPR and BIT_XOR_EXPR unless
>   other optimizations have been successful for those.
>   * gimple-fold.cc (gimple_fold_call): Handle IFN_ADDC and IFN_SUBC.
>   * gimple-range-fold.cc (adjust_imagpart_expr): Likewise.
>   * tree-ssa-dce.cc (eliminate_unnecessary_stmts): Likewise.
>   * doc/md.texi (addc5, subc5): Document new named
>   patterns.
>   * config/i386/i386.md (subborrow): Add alternative with
>   memory destination.
>   (addc5, subc5): New define_expand patterns.
>   (*sub_3, @add3_carry, addcarry, @sub3_carry,
>   subborrow, *add3_cc_overflow_1): Add define_peephole2
>   TARGET_READ_MODIFY_WRITE/-Os patterns to prefer using memory
>   destination in these patterns.
> 
>   * gcc.target/i386/pr79173-1.c: New test.
>   * gcc.target/i386/pr79173-2.c: New test.
>   * gcc.target/i386/pr79173-3.c: New test.
>   * gcc.target/i386/pr79173-4.c: New test.
>   * gcc.target/i386/pr79173-5.c: New test.
>   * gcc.target/i386/pr79173-6.c: New test.
>   * gcc.target/i386/pr79173-7.c: New test.
>   * gcc.target/i386/pr79173-8.c: New test.
>   * gcc.target/i386/pr79173-9.c: New test.
>   * gcc.target/i386/pr79173-10.c: New test.
> 
> --- gcc/internal-fn.def.jj2023-06-05 10:38:06.670333685 +0200
> +++ gcc/internal-fn.def   2023-06-05 11:40:50.672212265 +0200
> @@ -381,6 +381,8 @@ DEF_INTERNAL_FN (ASAN_POISON_USE, ECF_LE
>  DEF_INTERNAL_FN (ADD_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
>  DEF_INTERNAL_FN (SUB_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
>  DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
> +DEF_INTERNAL_FN (ADDC, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
> +DEF_INTERNAL_FN (SUBC, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
>  DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
>  DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
>  DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
> --- gcc/internal-fn.cc.jj 2023-05-15 19:12:24.080780016 +0200
> +++ gcc/internal-fn.cc2023-06-06 09:38:46.333871169 +0200
> @@ -2722,6 +2722,44 @@ expand_MUL_OVERFLOW (internal_fn, gcall
>expand_arith_overflow (MULT_EXPR, stmt);
>  }
>  
> +/* Expand ADDC STMT.  */
> +
> +static void
> +expand_ADDC (internal_fn ifn, gcall *stmt)
> +{
> +  tree lhs = gimple_call_lhs (stmt);
> +  tree arg1 = gimple_call_arg (stmt, 0);
> +  tree arg2 = gimple_call_arg (stmt, 1);
> +  tree arg3 = gimple_call_arg (stmt, 2);
> +  tree type = TREE_TYPE (arg1);
> +  machine_mode mode = TYPE_MODE (type);
> +  insn_code icode = optab_handler (ifn == IFN_ADDC
> +   

[PATCH] Fix disambiguation against .MASK_LOAD

2023-06-13 Thread Richard Biener via Gcc-patches
Alias analysis was treating .MASK_LOAD as storing a full vector
which means we disambiguate against decls of smaller than vector size.
This complements the previous patch handling .MASK_STORE and fixes
runtime execution FAILs of gfortran.dg/matmul_3.f90 and
gfortran.dg/inline_sum_2.f90 when using AVX512 with full masked loop
vectorization on Zen4.

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

* tree-ssa-alias.cc (ref_maybe_used_by_call_p_1): For
.MASK_LOAD and friends set the size of the access to unknown.
---
 gcc/tree-ssa-alias.cc | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/gcc/tree-ssa-alias.cc b/gcc/tree-ssa-alias.cc
index b5476e8b41e..e1bc04b82ba 100644
--- a/gcc/tree-ssa-alias.cc
+++ b/gcc/tree-ssa-alias.cc
@@ -2829,6 +2829,9 @@ ref_maybe_used_by_call_p_1 (gcall *call, ao_ref *ref, 
bool tbaa_p)
  ao_ref_init_from_ptr_and_size (&rhs_ref,
 gimple_call_arg (call, 0),
 TYPE_SIZE_UNIT (TREE_TYPE (lhs)));
+ /* We cannot make this a known-size access since otherwise
+we disambiguate against refs to decls that are smaller.  */
+ rhs_ref.size = -1;
  rhs_ref.ref_alias_set = rhs_ref.base_alias_set
= tbaa_p ? get_deref_alias_set (TREE_TYPE
(gimple_call_arg (call, 1))) : 0;
@@ -3073,7 +3076,7 @@ call_may_clobber_ref_p_1 (gcall *call, ao_ref *ref, bool 
tbaa_p)
  ao_ref_init_from_ptr_and_size (&lhs_ref, gimple_call_arg (call, 0),
 TYPE_SIZE_UNIT (TREE_TYPE (rhs)));
  /* We cannot make this a known-size access since otherwise
-we disambiguate against stores to decls that are smaller.  */
+we disambiguate against refs to decls that are smaller.  */
  lhs_ref.size = -1;
  lhs_ref.ref_alias_set = lhs_ref.base_alias_set
= tbaa_p ? get_deref_alias_set
-- 
2.35.3


[PATCH] middle-end/110232 - fix native interpret of vector

2023-06-13 Thread Richard Biener via Gcc-patches
The following fixes native interpretation of a buffer as boolean
vector with bit-precision elements such as AVX512 vectors.  The
check whether the buffer covers the whole vector was broken for
bit-precision elements and the following instead implements it
based on the vector type size.

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

PR middle-end/110232
* fold-const.cc (native_interpret_vector): Use TYPE_SIZE_UNIT
to check whether the buffer covers the whole vector.

* gcc.target/i386/pr110232.c: New testcase.
---
 gcc/fold-const.cc| 11 ---
 gcc/testsuite/gcc.target/i386/pr110232.c | 12 
 2 files changed, 16 insertions(+), 7 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr110232.c

diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 84b0d06b819..9ea055d4523 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -8796,16 +8796,13 @@ native_interpret_vector_part (tree type, const unsigned 
char *bytes,
 static tree
 native_interpret_vector (tree type, const unsigned char *ptr, unsigned int len)
 {
-  tree etype;
-  unsigned int size;
-  unsigned HOST_WIDE_INT count;
+  unsigned HOST_WIDE_INT size;
 
-  etype = TREE_TYPE (type);
-  size = GET_MODE_SIZE (SCALAR_TYPE_MODE (etype));
-  if (!TYPE_VECTOR_SUBPARTS (type).is_constant (&count)
-  || size * count > len)
+  if (!tree_to_poly_uint64 (TYPE_SIZE_UNIT (type)).is_constant (&size)
+  || size > len)
 return NULL_TREE;
 
+  unsigned HOST_WIDE_INT count = TYPE_VECTOR_SUBPARTS (type).to_constant ();
   return native_interpret_vector_part (type, ptr, len, count, 1);
 }
 
diff --git a/gcc/testsuite/gcc.target/i386/pr110232.c 
b/gcc/testsuite/gcc.target/i386/pr110232.c
new file mode 100644
index 000..43b74b15e00
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr110232.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -march=znver4 --param vect-partial-vector-usage=2 
-fno-vect-cost-model -fdump-tree-vect" } */
+
+int a[4096];
+
+void foo ()
+{
+  for (int i = 1; i < 4095; ++i)
+a[i] = 42;
+}
+
+/* { dg-final { scan-tree-dump-not "VIEW_CONVERT_EXPR" "vect" } } */
-- 
2.35.3


Re: [PATCH] New finish_compare_by_pieces target hook (for x86).

2023-06-13 Thread Richard Biener via Gcc-patches
On Mon, Jun 12, 2023 at 4:04 PM Roger Sayle  wrote:
>
>
> The following simple test case, from PR 104610, shows that memcmp () == 0
> can result in some bizarre code sequences on x86.
>
> int foo(char *a)
> {
> static const char t[] = "0123456789012345678901234567890";
> return __builtin_memcmp(a, &t[0], sizeof(t)) == 0;
> }
>
> with -O2 currently contains both:
> xorl%eax, %eax
> xorl$1, %eax
> and also
> movl$1, %eax
> xorl$1, %eax
>
> Changing the return type of foo to _Bool results in the equally
> bizarre:
> xorl%eax, %eax
> testl   %eax, %eax
> sete%al
> and also
> movl$1, %eax
> testl   %eax, %eax
> sete%al
>
> All these sequences set the result to a constant, but this optimization
> opportunity only occurs very late during compilation, by basic block
> duplication in the 322r.bbro pass, too late for CSE or peephole2 to
> do anything about it.  The problem is that the idiom expanded by
> compare_by_pieces for __builtin_memcmp_eq contains basic blocks that
> can't easily be optimized by if-conversion due to the multiple
> incoming edges on the fail block.
>
> In summary, compare_by_pieces generates code that looks like:
>
> if (x[0] != y[0]) goto fail_label;
> if (x[1] != y[1]) goto fail_label;
> ...
> if (x[n] != y[n]) goto fail_label;
> result = 1;
> goto end_label;
> fail_label:
> result = 0;
> end_label:
>
> In theory, the RTL if-conversion pass could be enhanced to tackle
> arbitrarily complex if-then-else graphs, but the solution proposed
> here is to allow suitable targets to perform if-conversion during
> compare_by_pieces.  The x86, for example, can take advantage that
> all of the above comparisons set and test the zero flag (ZF), which
> can then be used in combination with sete.  Hence compare_by_pieces
> could instead generate:
>
> if (x[0] != y[0]) goto fail_label;
> if (x[1] != y[1]) goto fail_label;
> ...
> if (x[n] != y[n]) goto fail_label;
> fail_label:
> sete result
>
> which requires one less basic block, and the redundant conditional
> branch to a label immediately after is cleaned up by GCC's existing
> RTL optimizations.
>
> For the test case above, where -O2 -msse4 previously generated:
>
> foo:movdqu  (%rdi), %xmm0
> pxor.LC0(%rip), %xmm0
> ptest   %xmm0, %xmm0
> je  .L5
> .L2:movl$1, %eax
> xorl$1, %eax
> ret
> .L5:movdqu  16(%rdi), %xmm0
> pxor.LC1(%rip), %xmm0
> ptest   %xmm0, %xmm0
> jne .L2
> xorl%eax, %eax
> xorl$1, %eax
> ret
>
> we now generate:
>
> foo:movdqu  (%rdi), %xmm0
> pxor.LC0(%rip), %xmm0
> ptest   %xmm0, %xmm0
> jne .L2
> movdqu  16(%rdi), %xmm0
> pxor.LC1(%rip), %xmm0
> ptest   %xmm0, %xmm0
> .L2:sete%al
> movzbl  %al, %eax
> ret
>
> Using a target hook allows the large amount of intelligence already in
> compare_by_pieces to be re-used by the i386 backend, but this can also
> help other backends with condition flags where the equality result can
> be materialized.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  Ok for mainline?

What's the guarantee that the zero flag is appropriately set on all
edges incoming now and forever?  Does this require target specific
knowledge on how do_compare_rtx_and_jump is emitting RTL?

Do you see matching this in ifcvt to be unreasonable?  I'm thinking
of "reducing" the incoming edges pairwise without actually looking
at the ifcvt code.

Thanks,
Richard.

>
> 2023-06-12  Roger Sayle  
>
> gcc/ChangeLog
> * config/i386/i386.cc (ix86_finish_compare_by_pieces): New
> function to provide a backend specific implementation.
> (TARGET_FINISH_COMPARE_BY_PIECES): Use the above function.
>
> * doc/tm.texi.in (TARGET_FINISH_COMPARE_BY_PIECES): New @hook.
> * doc/tm.texi: Regenerate.
>
> * expr.cc (compare_by_pieces): Call finish_compare_by_pieces in
> targetm to finalize the RTL expansion.  Move the current
> implementation to a default target hook.
> * target.def (finish_compare_by_pieces): New target hook to allow
> compare_by_pieces to be customized by the target.
> * targhooks.cc (default_finish_compare_by_pieces): Default
> implementation moved here from expr.cc's compare_by_pieces.
> * targhooks.h (default_finish_compare_by_pieces): Prototype.
>
> gcc/testsuite/ChangeLog
> * gcc.target/i386/pieces-memcmp-1.c: New test case.
>
>
> Thanks in advance,
> Roger
> --
>


[PATCH] Fix memory leak in loop header copying

2023-06-13 Thread Richard Biener via Gcc-patches


Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

* tree-ssa-loop-ch.cc (ch_base::copy_headers): Free loop BBs.
---
 gcc/tree-ssa-loop-ch.cc | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gcc/tree-ssa-loop-ch.cc b/gcc/tree-ssa-loop-ch.cc
index 7fdef3bb11a..22252bee135 100644
--- a/gcc/tree-ssa-loop-ch.cc
+++ b/gcc/tree-ssa-loop-ch.cc
@@ -642,6 +642,7 @@ ch_base::copy_headers (function *fun)
   if (stmt_can_terminate_bb_p (gsi_stmt (bsi)))
 precise = false;
   }
+ free (bbs);
}
   if (precise
  && get_max_loop_iterations_int (loop) == 1)
-- 
2.35.3


Re: [PATCH] rs6000: replace '(const_int 0)' to 'unspec:BLK [(const_int 0)]' for stack_tie

2023-06-14 Thread Richard Biener via Gcc-patches
On Wed, 14 Jun 2023, Jiufu Guo wrote:

> 
> Hi,
> 
> Segher Boessenkool  writes:
> 
> > Hi!
> >
> > As I said in a reply to the original patch: not okay.  Sorry.
> 
> Thanks a lot for your comments!
> I'm also thinking about other solutions:
> 1. "set (mem/c:BLK (reg/f:DI 1 1) (const_int 0 [0])"
>   This is the existing pattern.  It may be read as an action
>   to clean an unknown-size memory block.
> 
> 2. "set (mem/c:BLK (reg/f:DI 1 1) unspec:blk (const_int 0 [0])
> UNSPEC_TIE".
>   Current patch is using this one.
> 
> 3. "set (mem/c:DI (reg/f:DI 1 1) unspec:DI (const_int 0 [0])
> UNSPEC_TIE".
>This avoids using BLK on unspec, but using DI.

That gives the MEM a size which means we can interpret the (set ..)
as killing a specific area of memory, enabling DSE of earlier
stores.

AFAIU this special instruction is only supposed to prevent
code motion (of stack memory accesses?) across this instruction?
I'd say a

  (may_clobber (mem:BLK (reg:DI 1 1)))

might be more to the point?  I've used "may_clobber" which doesn't
exist since I'm not sure whether a clobber is considered a kill.
The docs say "Represents the storing or possible storing of an 
unpredictable..." - what is it?  Storing or possible storing?
I suppose stack_tie should be less strict than the documented
(clobber (mem:BLK (const_int 0))) (clobber all memory).

?

> 4. "set (mem/c:BLK (reg/f:DI 1 1) unspec (const_int 0 [0])
> UNSPEC_TIE"
>There is still a mode for the unspec.
> 
> 
> >
> > But some comments on this patch:
> >
> > On Tue, Jun 13, 2023 at 08:23:35PM +0800, Jiufu Guo wrote:
> >> +&& XINT (SET_SRC (set), 1) == UNSPEC_TIE
> >> +&& XVECEXP (SET_SRC (set), 0, 0) == const0_rtx);
> >
> > This makes it required that the operand of an UNSPEC_TIE unspec is a
> > const_int 0.  This should be documented somewhere.  Ideally you would
> > want no operand at all here, but every unspec has an operand.
> 
> Right!  Since checked UNSPEC_TIE arleady, we may not need to check
> the inner operand. Like " && XINT (SET_SRC (set), 1) == UNSPEC_TIE);".
> 
> >
> >> +  RTVEC_ELT (p, i)
> >> +  = gen_rtx_SET (mem, gen_rtx_UNSPEC (BLKmode, gen_rtvec (1, const0_rtx),
> >> +  UNSPEC_TIE));
> >
> > If it is hard to indent your code, your code is trying to do to much.
> > Just have an extra temporary?
> >
> >   rtx un = gen_rtx_UNSPEC (BLKmode, gen_rtvec (1, const0_rtx), 
> > UNSPEC_TIE);
> >   RTVEC_ELT (p, i) = gen_rtx_SET (mem, un);
> >
> > That is shorter even, and certainly more readable :-)
> 
> Yeap, thanks!
> 
> >
> >> @@ -10828,7 +10829,9 @@ (define_expand "restore_stack_block"
> >>operands[4] = gen_frame_mem (Pmode, operands[1]);
> >>p = rtvec_alloc (1);
> >>RTVEC_ELT (p, 0) = gen_rtx_SET (gen_frame_mem (BLKmode, operands[0]),
> >> -const0_rtx);
> >> +gen_rtx_UNSPEC (BLKmode,
> >> +gen_rtvec (1, const0_rtx),
> >> +UNSPEC_TIE));
> >>operands[5] = gen_rtx_PARALLEL (VOIDmode, p);
> >
> > I have a hard time to see how this could ever be seen as clearer or more
> > obvious or anything like that :-(
> 
> I was thinking about just invoking gen_stack_tie here.
> 
> BR,
> Jeff (Jiufu Guo)
> 
> >
> >
> > Segher
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)


Re: [PATCH] rs6000: replace '(const_int 0)' to 'unspec:BLK [(const_int 0)]' for stack_tie

2023-06-14 Thread Richard Biener via Gcc-patches
On Wed, 14 Jun 2023, Richard Sandiford wrote:

> Richard Biener  writes:
> > AFAIU this special instruction is only supposed to prevent
> > code motion (of stack memory accesses?) across this instruction?
> > I'd say a
> >
> >   (may_clobber (mem:BLK (reg:DI 1 1)))
> >
> > might be more to the point?  I've used "may_clobber" which doesn't
> > exist since I'm not sure whether a clobber is considered a kill.
> > The docs say "Represents the storing or possible storing of an 
> > unpredictable..." - what is it? Storing or possible storing?
> 
> I'd also understood it to be either.  As in, it is a may-clobber
> that can be used for must-clobber.  Alternatively: the value stored
> is unpredictable, and can therefore be the same as the current value.
> 
> I think the main difference between:
> 
>   (clobber (mem:BLK ?))
> 
> and
> 
>   (set (mem:BLK ?) (unspec:BLK ?))
> 
> is that the latter must happen for correctness (unless something
> that understands the unspec proves otherwise) whereas a clobber
> can validly be dropped.  So for something like stack_tie, a set
> seems more correct than a clobber.

How can a clobber be validly dropped?  For the case of stack
memory if there's no stack use after it it could be elided
and I suppose the clobber itself can be moved.  But then
the function return is a stack use as well.

Btw, with the same reason the (set (mem:...)) could be removed, no?
Or is the (unspec:) SET_SRC having implicit side-effects that
prevents the removal (so rs6000 could have its stack_tie removed)?

That said, I fail to see how a clobber is special here.

Richard.

> Thanks,
> Richard
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)


[PATCH] [RFC] main loop masked vectorization with --param vect-partial-vector-usage=1

2023-06-14 Thread Richard Biener via Gcc-patches


Currently vect_determine_partial_vectors_and_peeling will decide
to apply fully masking to the main loop despite
--param vect-partial-vector-usage=1 when the currently analyzed
vector mode results in a vectorization factor that's bigger
than the number of scalar iterations.  That's undesirable for
targets where a vector mode can handle both partial vector and
non-partial vector vectorization.  I understand that for AARCH64
we have SVE and NEON but SVE can only do partial vector and
NEON only non-partial vector vectorization, plus the target
chooses to let cost comparison decide the vector mode to use.

For x86 and the upcoming AVX512 partial vector support the
story is different, the target chooses the first (and largest)
vector mode that can successfully used for vectorization.  But
that means with --param vect-partial-vector-usage=1 we will
always choose AVX512 with partial vectors for the main loop
even if, for example, V4SI would be a perfect fit with full
vectors and no required epilog!

The following tries to find the appropriate condition for
this - I suppose simply refusing to set LOOP_VINFO_USING_PARTIAL_VECTORS_P
on the main loop when --param vect-partial-vector-usage=1 will
hurt AARCH64?  Incidentially looking up the docs for
vect-partial-vector-usage suggests that it's not supposed to
control epilog vectorization but instead
"1 allows partial vector loads and stores if vectorization removes the
need for the code to iterate".  That's probably OK in the end
but if there's a fixed size vector mode that allows the same thing
without using masking that would be better.

I wonder if we should special-case known niter (bounds) somehow
when analyzing the vector modes and override the targets sorting?

Maybe we want a new --param in addition to vect-epilogues-nomask
and vect-partial-vector-usage to say we want masked epilogues?

* tree-vect-loop.cc (vect_determine_partial_vectors_and_peeling):
For non-VLA vectorization interpret param_vect_partial_vector_usage == 1
as only applying to epilogues.
---
 gcc/tree-vect-loop.cc | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 9be66b8fbc5..9323aa572d4 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -2478,7 +2478,15 @@ vect_determine_partial_vectors_and_peeling 
(loop_vec_info loop_vinfo,
  && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
  && !vect_known_niters_smaller_than_vf (loop_vinfo))
LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
-  else
+  /* Avoid using a large fixed size vectorization mode with masking
+for the main loop when we were asked to only use masking for
+the epilog.
+???  Ideally we'd start analysis with a better sized mode,
+the param_vect_partial_vector_usage == 2 case suffers from
+this as well.  But there's a catch-22.  */
+  else if (!(!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+&& param_vect_partial_vector_usage == 1
+&& LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()))
LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = true;
 }
 
-- 
2.35.3


Re: [PATCH] rs6000: replace '(const_int 0)' to 'unspec:BLK [(const_int 0)]' for stack_tie

2023-06-14 Thread Richard Biener via Gcc-patches
On Wed, 14 Jun 2023, Richard Sandiford wrote:

> Richard Biener  writes:
> > On Wed, 14 Jun 2023, Richard Sandiford wrote:
> >
> >> Richard Biener  writes:
> >> > AFAIU this special instruction is only supposed to prevent
> >> > code motion (of stack memory accesses?) across this instruction?
> >> > I'd say a
> >> >
> >> >   (may_clobber (mem:BLK (reg:DI 1 1)))
> >> >
> >> > might be more to the point?  I've used "may_clobber" which doesn't
> >> > exist since I'm not sure whether a clobber is considered a kill.
> >> > The docs say "Represents the storing or possible storing of an 
> >> > unpredictable..." - what is it? Storing or possible storing?
> >> 
> >> I'd also understood it to be either.  As in, it is a may-clobber
> >> that can be used for must-clobber.  Alternatively: the value stored
> >> is unpredictable, and can therefore be the same as the current value.
> >> 
> >> I think the main difference between:
> >> 
> >>   (clobber (mem:BLK ?))
> >> 
> >> and
> >> 
> >>   (set (mem:BLK ?) (unspec:BLK ?))
> >> 
> >> is that the latter must happen for correctness (unless something
> >> that understands the unspec proves otherwise) whereas a clobber
> >> can validly be dropped.  So for something like stack_tie, a set
> >> seems more correct than a clobber.
> >
> > How can a clobber be validly dropped?  For the case of stack
> > memory if there's no stack use after it it could be elided
> > and I suppose the clobber itself can be moved.  But then
> > the function return is a stack use as well.
> >
> > Btw, with the same reason the (set (mem:...)) could be removed, no?
> > Or is the (unspec:) SET_SRC having implicit side-effects that
> > prevents the removal (so rs6000 could have its stack_tie removed)?
> >
> > That said, I fail to see how a clobber is special here.
> 
> Clobbers are for side-effects.  They don't start a def-use chain.
> E.g. any use after a full clobber is an uninitialised read rather
> than a read of the clobber ?result?.

I see.  So

(parallel
 (unspec stack_tie)
 (clobber (mem:BLK ...)))

then?  I suppose it needs to be an unspec_volatile?  It feels like
the stack_ties are a delicate hack preventing enough but not too
much optimization ...

> In contrast, a set of memory with an unspec source is in dataflow terms
> the same as a set of memory with a specified source.  (some unspecs
> actually have well-defined values, it's just that only the target code
> knows what those well-defined value are.)
> 
> So a set of memory could only be removed if DSE proves that there are no
> reads of the set bytes before the next set(s) to the same bytes of memory.
> And memory is always live.
> 
> Thanks,
> Richard
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)


[PATCH 1/3] Inline vect_get_max_nscalars_per_iter

2023-06-14 Thread Richard Biener via Gcc-patches
The function is only meaningful for LOOP_VINFO_MASKS processing so
inline it into the single use.

Bootstrapped and tested on x86_64-unknown-linux-gnu, OK?

* tree-vect-loop.cc (vect_get_max_nscalars_per_iter): Inline
into ...
(vect_verify_full_masking): ... this.
---
 gcc/tree-vect-loop.cc | 22 ++
 1 file changed, 6 insertions(+), 16 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index ace9e759f5b..a9695e5b25d 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -1117,20 +1117,6 @@ can_produce_all_loop_masks_p (loop_vec_info loop_vinfo, 
tree cmp_type)
   return true;
 }
 
-/* Calculate the maximum number of scalars per iteration for every
-   rgroup in LOOP_VINFO.  */
-
-static unsigned int
-vect_get_max_nscalars_per_iter (loop_vec_info loop_vinfo)
-{
-  unsigned int res = 1;
-  unsigned int i;
-  rgroup_controls *rgm;
-  FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), i, rgm)
-res = MAX (res, rgm->max_nscalars_per_iter);
-  return res;
-}
-
 /* Calculate the minimum precision necessary to represent:
 
   MAX_NITERS * FACTOR
@@ -1210,8 +1196,6 @@ static bool
 vect_verify_full_masking (loop_vec_info loop_vinfo)
 {
   unsigned int min_ni_width;
-  unsigned int max_nscalars_per_iter
-= vect_get_max_nscalars_per_iter (loop_vinfo);
 
   /* Use a normal loop if there are no statements that need masking.
  This only happens in rare degenerate cases: it means that the loop
@@ -1219,6 +1203,12 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
   if (LOOP_VINFO_MASKS (loop_vinfo).is_empty ())
 return false;
 
+  /* Calculate the maximum number of scalars per iteration for every rgroup.  
*/
+  unsigned int max_nscalars_per_iter = 1;
+  for (auto rgm : LOOP_VINFO_MASKS (loop_vinfo))
+max_nscalars_per_iter
+  = MAX (max_nscalars_per_iter, rgm.max_nscalars_per_iter);
+
   /* Work out how many bits we need to represent the limit.  */
   min_ni_width
 = vect_min_prec_for_max_niters (loop_vinfo, max_nscalars_per_iter);
-- 
2.35.3



[PATCH 2/3] Add loop_vinfo argument to vect_get_loop_mask

2023-06-14 Thread Richard Biener via Gcc-patches
This adds a loop_vinfo argument for future use, making the next
patch smaller.

* tree-vectorizer.h (vect_get_loop_mask): Add loop_vec_info
argument.
* tree-vect-loop.cc (vect_get_loop_mask): Likewise.
(vectorize_fold_left_reduction): Adjust.
(vect_transform_reduction): Likewise.
(vectorizable_live_operation): Likewise.
* tree-vect-stmts.cc (vectorizable_call): Likewise.
(vectorizable_operation): Likewise.
(vectorizable_store): Likewise.
(vectorizable_load): Likewise.
(vectorizable_condition): Likewise.
---
 gcc/tree-vect-loop.cc  | 16 +---
 gcc/tree-vect-stmts.cc | 36 +++-
 gcc/tree-vectorizer.h  |  3 ++-
 3 files changed, 30 insertions(+), 25 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index a9695e5b25d..1897e720389 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -6637,7 +6637,7 @@ vectorize_fold_left_reduction (loop_vec_info loop_vinfo,
   gimple *new_stmt;
   tree mask = NULL_TREE;
   if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
-   mask = vect_get_loop_mask (gsi, masks, vec_num, vectype_in, i);
+   mask = vect_get_loop_mask (loop_vinfo, gsi, masks, vec_num, vectype_in, 
i);
 
   /* Handle MINUS by adding the negative.  */
   if (reduc_fn != IFN_LAST && code == MINUS_EXPR)
@@ -7950,8 +7950,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
  gcc_assert (commutative_binary_op_p (code, op.type));
  std::swap (vop[0], vop[1]);
}
- tree mask = vect_get_loop_mask (gsi, masks, vec_num * ncopies,
- vectype_in, i);
+ tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks,
+ vec_num * ncopies, vectype_in, i);
  gcall *call = gimple_build_call_internal (cond_fn, 4, mask,
vop[0], vop[1], vop[0]);
  new_temp = make_ssa_name (vec_dest, call);
@@ -7967,8 +7967,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 
  if (masked_loop_p && mask_by_cond_expr)
{
- tree mask = vect_get_loop_mask (gsi, masks, vec_num * ncopies,
- vectype_in, i);
+ tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks,
+ vec_num * ncopies, vectype_in, i);
  build_vect_cond_expr (code, vop, mask, gsi);
}
 
@@ -10075,7 +10075,8 @@ vectorizable_live_operation (vec_info *vinfo,
 the loop mask for the final iteration.  */
  gcc_assert (ncopies == 1 && !slp_node);
  tree scalar_type = TREE_TYPE (STMT_VINFO_VECTYPE (stmt_info));
- tree mask = vect_get_loop_mask (gsi, &LOOP_VINFO_MASKS (loop_vinfo),
+ tree mask = vect_get_loop_mask (loop_vinfo, gsi,
+ &LOOP_VINFO_MASKS (loop_vinfo),
  1, vectype, 0);
  tree scalar_res = gimple_build (&stmts, CFN_EXTRACT_LAST, scalar_type,
  mask, vec_lhs_phi);
@@ -10359,7 +10360,8 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, 
vec_loop_masks *masks,
arrangement.  */
 
 tree
-vect_get_loop_mask (gimple_stmt_iterator *gsi, vec_loop_masks *masks,
+vect_get_loop_mask (loop_vec_info,
+   gimple_stmt_iterator *gsi, vec_loop_masks *masks,
unsigned int nvectors, tree vectype, unsigned int index)
 {
   rgroup_controls *rgm = &(*masks)[nvectors - 1];
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index a7acc032d47..47baf35227f 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -3692,7 +3692,8 @@ vectorizable_call (vec_info *vinfo,
  unsigned int vec_num = vec_oprnds0.length ();
  /* Always true for SLP.  */
  gcc_assert (ncopies == 1);
- vargs[varg++] = vect_get_loop_mask (gsi, masks, vec_num,
+ vargs[varg++] = vect_get_loop_mask (loop_vinfo,
+ gsi, masks, vec_num,
  vectype_out, i);
}
  size_t k;
@@ -3733,7 +3734,8 @@ vectorizable_call (vec_info *vinfo,
  unsigned int vec_num = vec_oprnds0.length ();
  /* Always true for SLP.  */
  gcc_assert (ncopies == 1);
- tree mask = vect_get_loop_mask (gsi, masks, vec_num,
+ tree mask = vect_get_loop_mask (loop_vinfo,
+ gsi, masks, vec_num,
  vectype_out, i);
  vargs[mask_opno] = prep

[PATCH 3/3] AVX512 fully masked vectorization

2023-06-14 Thread Richard Biener via Gcc-patches
This implemens fully masked vectorization or a masked epilog for
AVX512 style masks which single themselves out by representing
each lane with a single bit and by using integer modes for the mask
(both is much like GCN).

AVX512 is also special in that it doesn't have any instruction
to compute the mask from a scalar IV like SVE has with while_ult.
Instead the masks are produced by vector compares and the loop
control retains the scalar IV (mainly to avoid dependences on
mask generation, a suitable mask test instruction is available).

Like RVV code generation prefers a decrementing IV though IVOPTs
messes things up in some cases removing that IV to eliminate
it with an incrementing one used for address generation.

One of the motivating testcases is from PR108410 which in turn
is extracted from x264 where large size vectorization shows
issues with small trip loops.  Execution time there improves
compared to classic AVX512 with AVX2 epilogues for the cases
of less than 32 iterations.

size   scalar 128 256 512512e512f
19.42   11.329.35   11.17   15.13   16.89
25.726.536.666.667.628.56
34.495.105.105.745.085.73
44.104.334.295.213.794.25
63.783.853.864.762.542.85
83.641.893.764.501.922.16
   123.562.213.754.261.261.42
   163.360.831.064.160.951.07
   203.391.421.334.070.750.85
   243.230.661.724.220.620.70
   283.181.092.044.200.540.61
   323.160.470.410.410.470.53
   343.160.670.610.560.440.50
   383.190.950.950.820.400.45
   423.090.581.211.130.360.40

'size' specifies the number of actual iterations, 512e is for
a masked epilog and 512f for the fully masked loop.  From
4 scalar iterations on the AVX512 masked epilog code is clearly
the winner, the fully masked variant is clearly worse and
it's size benefit is also tiny.

This patch does not enable using fully masked loops or
masked epilogues by default.  More work on cost modeling
and vectorization kind selection on x86_64 is necessary
for this.

Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
which could be exploited further to unify some of the flags
we have right now but there didn't seem to be many easy things
to merge, so I'm leaving this for followups.

Mask requirements as registered by vect_record_loop_mask are kept in their
original form and recorded in a hash_set now instead of being
processed to a vector of rgroup_controls.  Instead that's now
left to the final analysis phase which tries forming the rgroup_controls
vector using while_ult and if that fails now tries AVX512 style
which needs a different organization and instead fills a hash_map
with the relevant info.  vect_get_loop_mask now has two implementations,
one for the two mask styles we then have.

I have decided against interweaving vect_set_loop_condition_partial_vectors
with conditions to do AVX512 style masking and instead opted to
"duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
Likewise for vect_verify_full_masking vs vect_verify_full_masking_avx512.

I was split between making 'vec_loop_masks' a class with methods,
possibly merging in the _len stuff into a single registry.  It
seemed to be too many changes for the purpose of getting AVX512
working.  I'm going to play wait and see what happens with RISC-V
here since they are going to get both masks and lengths registered
I think.

The vect_prepare_for_masked_peels hunk might run into issues with
SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
looked odd.

Bootstrapped and tested on x86_64-unknown-linux-gnu.  I've run
the testsuite with --param vect-partial-vector-usage=2 with and
without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
and one latent wrong-code (PR110237).

There's followup work to be done to try enabling masked epilogues
for x86-64 by default (when AVX512 is enabled, possibly only when
-mprefer-vector-width=512).  Getting cost modeling and decision
right is going to be challenging.

Any comments?

OK?

Btw, testing on GCN would be welcome - the _avx512 paths could
work for it so in case the while_ult path fails (not sure if
it ever does) it could get _avx512 style masking.  Likewise
testing on ARM just to see I didn't break anything here.
I don't have SVE hardware so testing is probably meaningless.

Thanks,
Richard.

* tree-vectorizer.h (enum vect_partial_vector_style): New.
(_loop_vec_info::partial_vector_style): Likewise.
(LOOP_VINFO_PARTIAL_VECTORS_STYLE): Likewise.
(rgroup_controls::compare_type): Add.
(vec_loop_masks): Change from a typedef to auto_vec<>
to a structure.
* tree-vect-loop-manip.cc (vect_set

Re: [PATCH] middle-end, i386: Pattern recognize add/subtract with carry [PR79173]

2023-06-14 Thread Richard Biener via Gcc-patches
On Wed, 14 Jun 2023, Jakub Jelinek wrote:

> On Tue, Jun 13, 2023 at 01:29:04PM +0200, Jakub Jelinek via Gcc-patches wrote:
> > > > + else if (addc_subc)
> > > > +   {
> > > > + if (!integer_zerop (arg2))
> > > > +   ;
> > > > + /* x = y + 0 + 0; x = y - 0 - 0; */
> > > > + else if (integer_zerop (arg1))
> > > > +   result = arg0;
> > > > + /* x = 0 + y + 0; */
> > > > + else if (subcode != MINUS_EXPR && integer_zerop (arg0))
> > > > +   result = arg1;
> > > > + /* x = y - y - 0; */
> > > > + else if (subcode == MINUS_EXPR
> > > > +  && operand_equal_p (arg0, arg1, 0))
> > > > +   result = integer_zero_node;
> > > > +   }
> > > 
> > > So this all performs simplifications but also constant folding.  In
> > > particular the match.pd re-simplification will invoke fold_const_call
> > > on all-constant argument function calls but does not do extra folding
> > > on partially constant arg cases but instead relies on patterns here.
> > > 
> > > Can you add all-constant arg handling to fold_const_call and
> > > consider moving cases like y + 0 + 0 to match.pd?
> > 
> > The reason I've done this here is that this is the spot where all other
> > similar internal functions are handled, be it the ubsan ones
> > - IFN_UBSAN_CHECK_{ADD,SUB,MUL}, or __builtin_*_overflow ones
> > - IFN_{ADD,SUB,MUL}_OVERFLOW, or these 2 new ones.  The code handles
> > there 2 constant arguments as well as various patterns that can be
> > simplified and has code to clean it up later, build a COMPLEX_CST,
> > or COMPLEX_EXPR etc. as needed.  So, I think we want to handle those
> > elsewhere, we should do it for all of those functions, but then
> > probably incrementally.
> 
> The patch I've posted yesterday now fully tested on x86_64-linux and
> i686-linux.
> 
> Here is an untested incremental patch to handle constant folding of these
> in fold-const-call.cc rather than gimple-fold.cc.
> Not really sure if that is the way to go because it is replacing 28
> lines of former code with 65 of new code, for the overall benefit that say
> int
> foo (long long *p)
> {
>   int one = 1;
>   long long max = __LONG_LONG_MAX__;
>   return __builtin_add_overflow (one, max, p);
> }
> can be now fully folded already in ccp1 pass while before it was only
> cleaned up in forwprop1 pass right after it.

I think that's still very much desirable so this followup looks OK.
Maybe you can re-base it as prerequesite though?

> As for doing some stuff in match.pd, I'm afraid it would result in even more
> significant growth, the advantage of gimple-fold.cc doing all of these in
> one place is that the needed infrastructure can be shared.

Yes, I saw that.

Richard.

> 
> --- gcc/gimple-fold.cc.jj 2023-06-14 12:21:38.657657759 +0200
> +++ gcc/gimple-fold.cc2023-06-14 12:52:04.335054958 +0200
> @@ -5731,34 +5731,6 @@ gimple_fold_call (gimple_stmt_iterator *
>   result = arg0;
> else if (subcode == MULT_EXPR && integer_onep (arg0))
>   result = arg1;
> -   if (type
> -   && result == NULL_TREE
> -   && TREE_CODE (arg0) == INTEGER_CST
> -   && TREE_CODE (arg1) == INTEGER_CST
> -   && (!uaddc_usubc || TREE_CODE (arg2) == INTEGER_CST))
> - {
> -   if (cplx_result)
> - result = int_const_binop (subcode, fold_convert (type, arg0),
> -   fold_convert (type, arg1));
> -   else
> - result = int_const_binop (subcode, arg0, arg1);
> -   if (result && arith_overflowed_p (subcode, type, arg0, arg1))
> - {
> -   if (cplx_result)
> - overflow = build_one_cst (type);
> -   else
> - result = NULL_TREE;
> - }
> -   if (uaddc_usubc && result)
> - {
> -   tree r = int_const_binop (subcode, result,
> - fold_convert (type, arg2));
> -   if (r == NULL_TREE)
> - result = NULL_TREE;
> -   else if (arith_overflowed_p (subcode, type, result, arg2))
> - overflow = build_one_cst (type);
> - }
> - }
> if (result)
>   {
> if (result == integer_zero_node)
> --- gcc/fold-const-call.cc.jj 2023-06-02 10:36:43.096967505 +0200
> +++ gcc/fold-const-call.cc2023-06-14 12:56:08.195631214 +0200
> @@ -1669,6 +1669,7 @@ fold_const_call (combined_fn fn, tree ty
>  {
>const char *p0, *p1;
>char c;
> +  tree_code subcode;
>switch (fn)
>  {
>  case CFN_BUILT_IN_STRSPN:
> @@ -1738,6 +1739,46 @@ fold_const_call (combined_fn fn, tree ty
>  case CFN_FOLD_LEFT_PLUS:
>return fold_const_fold_left (type, arg0, arg1, PLUS_EXPR);
>  
> +case CFN_UBSAN_CHECK_ADD:
> +case CFN_ADD_OVERFLOW:
> +  s

Re: [PATCH] middle-end, i386: Pattern recognize add/subtract with carry [PR79173]

2023-06-14 Thread Richard Biener via Gcc-patches
On Tue, 13 Jun 2023, Jakub Jelinek wrote:

> On Tue, Jun 13, 2023 at 08:40:36AM +, Richard Biener wrote:
> > I suspect re-association can wreck things even more here.  I have
> > to say the matching code is very hard to follow, not sure if
> > splitting out a function matching
> > 
> >_22 = .{ADD,SUB}_OVERFLOW (_6, _5);
> >_23 = REALPART_EXPR <_22>;
> >_24 = IMAGPART_EXPR <_22>;
> > 
> > from _23 and _24 would help?
> 
> I've outlined 3 most often used sequences of statements or checks
> into 3 helper functions, hope that helps.
> 
> > > +  while (TREE_CODE (rhs[0]) == SSA_NAME && !rhs[3])
> > > + {
> > > +   gimple *g = SSA_NAME_DEF_STMT (rhs[0]);
> > > +   if (has_single_use (rhs[0])
> > > +   && is_gimple_assign (g)
> > > +   && (gimple_assign_rhs_code (g) == code
> > > +   || (code == MINUS_EXPR
> > > +   && gimple_assign_rhs_code (g) == PLUS_EXPR
> > > +   && TREE_CODE (gimple_assign_rhs2 (g)) == INTEGER_CST)))
> > > + {
> > > +   rhs[0] = gimple_assign_rhs1 (g);
> > > +   tree &r = rhs[2] ? rhs[3] : rhs[2];
> > > +   r = gimple_assign_rhs2 (g);
> > > +   if (gimple_assign_rhs_code (g) != code)
> > > + r = fold_build1 (NEGATE_EXPR, TREE_TYPE (r), r);
> > 
> > Can you use const_unop here?  In fact both will not reliably
> > negate all constants (ick), so maybe we want a force_const_negate ()?
> 
> It is unsigned type NEGATE_EXPR of INTEGER_CST, so I think it should
> work.  That said, changed it to const_unop and am just giving up on it
> as if it wasn't a PLUS_EXPR with INTEGER_CST addend if const_unop doesn't
> simplify.
> 
> > > +   else if (addc_subc)
> > > + {
> > > +   if (!integer_zerop (arg2))
> > > + ;
> > > +   /* x = y + 0 + 0; x = y - 0 - 0; */
> > > +   else if (integer_zerop (arg1))
> > > + result = arg0;
> > > +   /* x = 0 + y + 0; */
> > > +   else if (subcode != MINUS_EXPR && integer_zerop (arg0))
> > > + result = arg1;
> > > +   /* x = y - y - 0; */
> > > +   else if (subcode == MINUS_EXPR
> > > +&& operand_equal_p (arg0, arg1, 0))
> > > + result = integer_zero_node;
> > > + }
> > 
> > So this all performs simplifications but also constant folding.  In
> > particular the match.pd re-simplification will invoke fold_const_call
> > on all-constant argument function calls but does not do extra folding
> > on partially constant arg cases but instead relies on patterns here.
> > 
> > Can you add all-constant arg handling to fold_const_call and
> > consider moving cases like y + 0 + 0 to match.pd?
> 
> The reason I've done this here is that this is the spot where all other
> similar internal functions are handled, be it the ubsan ones
> - IFN_UBSAN_CHECK_{ADD,SUB,MUL}, or __builtin_*_overflow ones
> - IFN_{ADD,SUB,MUL}_OVERFLOW, or these 2 new ones.  The code handles
> there 2 constant arguments as well as various patterns that can be
> simplified and has code to clean it up later, build a COMPLEX_CST,
> or COMPLEX_EXPR etc. as needed.  So, I think we want to handle those
> elsewhere, we should do it for all of those functions, but then
> probably incrementally.
> 
> > > +@cindex @code{addc@var{m}5} instruction pattern
> > > +@item @samp{addc@var{m}5}
> > > +Adds operands 2, 3 and 4 (where the last operand is guaranteed to have
> > > +only values 0 or 1) together, sets operand 0 to the result of the
> > > +addition of the 3 operands and sets operand 1 to 1 iff there was no
> > > +overflow on the unsigned additions, and to 0 otherwise.  So, it is
> > > +an addition with carry in (operand 4) and carry out (operand 1).
> > > +All operands have the same mode.
> > 
> > operand 1 set to 1 for no overflow sounds weird when specifying it
> > as carry out - can you double check?
> 
> Fixed.
> 
> > > +@cindex @code{subc@var{m}5} instruction pattern
> > > +@item @samp{subc@var{m}5}
> > > +Similarly to @samp{addc@var{m}5}, except subtracts operands 3 and 4
> > > +from operand 2 instead of adding them.  So, it is
> > > +a subtraction with carry/borrow in (operand 4) and carry/borrow out
> > > +(operand 1).  All operands have the same mode.
> > > +
> > 
> > I wonder if we want to name them uaddc and usubc?  Or is this supposed
> > to be simply the twos-complement "carry"?  I think the docs should
> > say so then (note we do have uaddv and addv).
> 
> Makes sense, I've actually renamed even the internal functions etc.
> 
> Here is only lightly tested patch with everything but gimple-fold.cc
> changed.
> 
> 2023-06-13  Jakub Jelinek  
> 
>   PR middle-end/79173
>   * internal-fn.def (UADDC, USUBC): New internal functions.
>   * internal-fn.cc (expand_UADDC, expand_USUBC): New functions.
>   (commutative_ternary_fn_p): Return true also for IFN_UADDC.
>   * optabs.def (uaddc5_optab, usubc5_optab): New optabs.
>   * tree-ssa-math-opts.cc (uaddc_cast, uaddc_ne0, uaddc_is_cplxpart,
>   match_uaddc_usubc): New functi

  1   2   3   4   5   6   7   8   9   10   >