Re: [PATCH] aarch64: Add support for -mcpu=grace

2024-06-26 Thread Andrew Pinski
On Wed, Jun 26, 2024 at 12:40 AM Kyrylo Tkachov  wrote:
>
> Hi all,
>
> This adds support for the NVIDIA Grace CPU to aarch64.
> We reuse the tuning decisions for the Neoverse V2 core, but include a
> number of architecture features that are not enabled by default in
> -mcpu=neoverse-v2.
>
> This allows Grace users to more simply target the CPU with -mcpu=grace
> rather than remembering what extensions to tag on top of
> -mcpu=neoverse-v2.
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
> I’m pushing this to trunk.

> RNG

I noticed this is missing from grace but is included in neoverse-v2.
Is that expected?

Thanks,
Andrew Pinski


> I have patches tested for the 14, 13, 12, 11 branches as well that I’d like 
> to push there to make it simpler for our users to target Grace.
> They are the same as this one logically, but they just account for slight 
> syntactic differences and flag definitions that have happened since those 
> branches.
> Thanks,
> Kyrill
>
> * config/aarch64/aarch64-cores.def (grace): New entry.
> * config/aarch64/aarch64-tune.md: Regenerate
> * doc/invoke.texi (AArch64 Options): Document the above.
>
> Signed-off-by: Kyrylo Tkachov 
>


Re: [PATCH v2] c: Error message for incorrect use of static in array declarations

2024-06-26 Thread Marek Polacek
On Wed, Jun 26, 2024 at 08:01:57PM +0200, Martin Uecker wrote:
> 
> Thanks Marek, here is the second version which should
> implement all your suggestions.  

Thanks!
 
> (BTW: Without the newline of the end, the test case has
> undefined behavior..., not that we need to care.)
> 
> 
> Bootstrapped and regression tested on x86_64.
> 
> 
> [PATCH] c: Error message for incorrect use of static in array 
> declarations.
> 
> Add an explicit error messages when c99's static is
> used without a size expression in an array declarator.
> 
> gcc/c:
> c-parser.cc (c_parser_direct_declarator_inner): Add
> error message.
> 
> gcc/testsuite:
> gcc.dg/c99-arraydecl-4.c: New test.
> 
> diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc
> index 6a3f96d5b61..db60507252b 100644
> --- a/gcc/c/c-parser.cc
> +++ b/gcc/c/c-parser.cc
> @@ -4715,8 +4715,6 @@ c_parser_direct_declarator_inner (c_parser *parser, 
> bool id_present,
>location_t brace_loc = c_parser_peek_token (parser)->location;
>struct c_declarator *declarator;
>struct c_declspecs *quals_attrs = build_null_declspecs ();
> -  bool static_seen;
> -  bool star_seen;
>struct c_expr dimen;
>dimen.value = NULL_TREE;
>dimen.original_code = ERROR_MARK;
> @@ -4724,7 +4722,8 @@ c_parser_direct_declarator_inner (c_parser *parser, 
> bool id_present,
>c_parser_consume_token (parser);
>c_parser_declspecs (parser, quals_attrs, false, false, true,
> false, false, false, false, cla_prefer_id);
> -  static_seen = c_parser_next_token_is_keyword (parser, RID_STATIC);
> +  const bool static_seen = c_parser_next_token_is_keyword (parser,
> +RID_STATIC);
>if (static_seen)
>   c_parser_consume_token (parser);
>if (static_seen && !quals_attrs->declspecs_seen_p)
> @@ -4735,38 +4734,34 @@ c_parser_direct_declarator_inner (c_parser *parser, 
> bool id_present,
>/* If "static" is present, there must be an array dimension.
>Otherwise, there may be a dimension, "*", or no
>dimension.  */
> -  if (static_seen)
> +  bool star_seen = false;
> +  if (c_parser_next_token_is (parser, CPP_MULT)
> +   && c_parser_peek_2nd_token (parser)->type == CPP_CLOSE_SQUARE)
>   {
> -   star_seen = false;
> -   dimen = c_parser_expr_no_commas (parser, NULL);
> +   star_seen = true;
> +   c_parser_consume_token (parser);
>   }
> -  else
> +  else if (!c_parser_next_token_is (parser, CPP_CLOSE_SQUARE))
> + dimen = c_parser_expr_no_commas (parser, NULL);
> +
> +  if (static_seen)
>   {
> -   if (c_parser_next_token_is (parser, CPP_CLOSE_SQUARE))
> +   if (star_seen)
>   {
> -   dimen.value = NULL_TREE;
> +   error_at (c_parser_peek_token (parser)->location,

Now I realize that the location is not ideal here, it points to the
closing ], but we can easily get the location of "static".  So perhaps:

  location_t static_loc = UNKNOWN_LOCATION;
  if (c_parser_next_token_is_keyword (parser, RID_STATIC))
{
  static_loc = c_parser_peek_token (parser)->location;
  c_parser_consume_token (parser);
}

and then use "seen_loc != UNKNOWN_LOCATION" instead of the bool, or
do
  const bool static_seen = (static_loc != UNKNOWN_LOCATION);
if you prefer.

> + "% may not be used with an unspecified "
> + "variable length array size");
> +   /* Prevent further errors.  */
> star_seen = false;
> +   dimen.value = error_mark_node;
>   }
> -   else if (c_parser_next_token_is (parser, CPP_MULT))
> - {
> -   if (c_parser_peek_2nd_token (parser)->type == CPP_CLOSE_SQUARE)
> - {
> -   dimen.value = NULL_TREE;
> -   star_seen = true;
> -   c_parser_consume_token (parser);
> - }
> -   else
> - {
> -   star_seen = false;
> -   dimen = c_parser_expr_no_commas (parser, NULL);
> - }
> - }
> -   else
> +   else if (!dimen.value)
>   {
> -   star_seen = false;
> -   dimen = c_parser_expr_no_commas (parser, NULL);
> +   error_at (c_parser_peek_token (parser)->location,
> + "% may not be used without an array size");
>   }

No need to have { } around a single statement.

Marek



Re: [PATCH] _Hashtable fancy pointer support

2024-06-26 Thread Jonathan Wakely
On Wed, 26 Jun 2024 at 21:39, François Dumont  wrote:
>
> Hi
>
> Here is my proposal to add support for fancy allocator pointer.
>
> The only place where we still have C pointers is at the
> iterator::pointer level but it's consistent with std::list
> implementation and also logical considering that we do not get
> value_type pointers from the allocator.
>
> I also wondered if it was ok to use nullptr in different places or if I
> should rather do __node_ptr{}. But recent modifications are using
> nullptr so I think it's fine.

I haven't reviewed the patch yet, but this answers the nullptr question:
https://en.cppreference.com/w/cpp/named_req/NullablePointer
(aka Cpp17NullablePointer in the C++ standard).



Re: Frontend access to target features (was Re: [PATCH] libgccjit: Add ability to get CPU features)

2024-06-26 Thread David Malcolm
On Sun, 2024-03-10 at 12:05 +0100, Iain Buclaw wrote:
> Excerpts from David Malcolm's message of März 5, 2024 4:09 pm:
> > On Thu, 2023-11-09 at 19:33 -0500, Antoni Boucher wrote:
> > > Hi.
> > > See answers below.
> > > 
> > > On Thu, 2023-11-09 at 18:04 -0500, David Malcolm wrote:
> > > > On Thu, 2023-11-09 at 17:27 -0500, Antoni Boucher wrote:
> > > > > Hi.
> > > > > This patch adds support for getting the CPU features in
> > > > > libgccjit
> > > > > (bug
> > > > > 112466)
> > > > > 
> > > > > There's a TODO in the test:
> > > > > I'm not sure how to test that gcc_jit_target_info_arch
> > > > > returns
> > > > > the
> > > > > correct value since it is dependant on the CPU.
> > > > > Any idea on how to improve this?
> > > > > 
> > > > > Also, I created a CStringHash to be able to have a
> > > > > std::unordered_set. Is there any built-in way
> > > > > of
> > > > > doing
> > > > > this?
> > > > 
> > > > Thanks for the patch.
> > > > 
> > > > Some high-level questions:
> > > > 
> > > > Is this specifically about detecting capabilities of the host
> > > > that
> > > > libgccjit is currently running on? or how the target was
> > > > configured
> > > > when libgccjit was built?
> > > 
> > > I'm less sure about this part. I'll need to do more tests.
> > > 
> > > > 
> > > > One of the benefits of libgccjit is that, in theory, we support
> > > > all
> > > > of
> > > > the targets that GCC already supports.  Does this patch change
> > > > that,
> > > > or
> > > > is this more about giving client code the ability to determine
> > > > capabilities of the specific host being compiled for?
> > > 
> > > This should not change that. If it does, this is a bug.
> > > 
> > > > 
> > > > I'm nervous about having per-target jit code.  Presumably
> > > > there's a
> > > > reason that we can't reuse existing target logic here - can you
> > > > please
> > > > describe what the problem is.  I see that the ChangeLog has:
> > > > 
> > > > > * config/i386/i386-jit.cc: New file.
> > > > 
> > > > where i386-jit.cc has almost 200 lines of nontrivial code. 
> > > > Where
> > > > did
> > > > this come from?  Did you base it on existing code in our source
> > > > tree,
> > > > making modifications to fit the new internal API, or did you
> > > > write
> > > > it
> > > > from scratch?  In either case, how onerous would this be for
> > > > other
> > > > targets?
> > > 
> > > This was mostly copied from the same code done for the Rust and D
> > > frontends.
> > > See this commit and the following:
> > > https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=b1c06fd9723453dd2b2ec306684cb806dc2b4fbb
> > > The equivalent to i386-jit.cc is there:
> > > https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=22e3557e2d52f129f2bbfdc98688b945dba28dc9
> > 
> > [CCing Iain and Arthur re those patches; for reference, the patch
> > being
> > discussed is attached to :
> > https://gcc.gnu.org/pipermail/jit/2024q1/001792.html ]
> > 
> > One of my concerns about this patch is that we seem to be gaining
> > code
> > that's per-(frontend x config) which seems to be copied and pasted
> > with
> > a search and replace, which could lead to an M*N explosion.
> > 
> 
> That's certainly the case with the configure/make rules. Itself I
> think
> is copied originally from the {cpu_type}-protos.h machinery.
> 
> It might be worth pointing out that the c-family of front-ends don't
> have separate headers because their per-target macros are defined in
> {cpu_type}.h directly - for better or worse.
> 
> > Is there any real difference between the per-config code for the
> > different frontends, or should there be a general "enumerate all
> > features of the target" hook that's independent of the frontend?
> > (but
> > perhaps calls into it).
> > 
> 
> As far as I understand, the configure parts should all be identical
> between tm_p, tm_d, tm_rust, ..., so would benefit from being
> templated
> to aid any other front-ends adding in their own per target hooks.
> 
> > Am I right in thinking that (rustc with default LLVM backend) has
> > some
> > set of feature strings that both (rustc with rustc_codegen_gcc) and
> > gccrs are trying to emulate?  If so, is it presumably a goal that
> > libgccjit gives identical results to gccrs?  If so, would it be
> > crazy
> > for libgccjit to consume e.g. config/i386/i386-rust.cc ?
> 
> I don't know whether libgccjit can just pull in directly the
> implementation of the rust target hooks here.

Sorry for the delay in responding.

I don't want to be in the business of maintaining a copy of the per-
target code for "jit", and I think it makes sense for libgccjit to
return identical information compared to gccrs.

So I think it would be ideal for jit to share code with rust for this,
rather than do a one-time copy-and-paste followed by a ongoing "keep
things updated" treadmill.

Presumably there would be Makefile.in issues given that e.g. Makefile
has i386-rust.o listed in:

# Target specific, Rust specific object file
RUST_TARGET_OBJS= i386-rust.o 

Re: [PATCH] libgccjit: Fix get_size of size_t

2024-06-26 Thread David Malcolm
On Wed, 2024-02-21 at 14:16 -0500, Antoni Boucher wrote:
> On Thu, 2023-12-07 at 19:57 -0500, David Malcolm wrote:
> > On Thu, 2023-12-07 at 17:26 -0500, Antoni Boucher wrote:
> > > Hi.
> > > This patch fixes getting the size of size_t (bug 112910).
> > > 
> > > There's one issue with this patch: like every other feature that
> > > checks
> > > for target-specific stuff, it requires a compilation before
> > > actually
> > > fetching the size of the type.
> > > Which means that getting the size before a compilation might be
> > > wrong
> > > (and I actually believe is wrong on x86-64).
> > > 
> > > I was wondering if we should always implicitely do the first
> > > compilation to gather the correct info: this would fix this issue
> > > and
> > > all the others that we have due to that.
> > > I'm not sure what would be the performance implication.
> > 
> > Maybe introduce a new class target_info which contains all the
> > information we might want to find via a compilation, and have the
> > top-
> > level recording::context have a pointer to it, which starts as
> > nullptr,
> > but can be populated on-demand the first time something needs it?
> 
> That would mean that we'll need to populate it for every top-level
> context, right? Would the idea be that we should then use child
> contexts to have the proper information filled?
> If so, how is this different than just compiling two contexts like
> what
> I currently do?
> This would also mean that we'll do an implicit compilation whenever
> we
> use an API that needs this info, right? Wouldn't that be unexpected?

I was thinking a compilation with an empty playback::context to lazily
capture the target data.

My hope was that this would make things easier for users.  But you're
the one using this API, so if you're more comfortable with the explicit
initial compilation approach, let's go with that.

If so, this is OK for trunk - but we might want to add a note to the
documentation about the double-compilation workaround.

Dave


> 
> Thanks for the idea.
> 
> > 
> > > 
> > > Another solution that I have been thinking about for a while now
> > > would
> > > be to have another frontend libgccaot (I don't like that name),
> > > which
> > > is like libgccjit but removes the JIT part so that we get access
> > > to
> > > the
> > > target stuff directly and would remove the need for having a
> > > seperation
> > > between recording and playback as far as I understand.
> > > That's a long-term solution, but I wanted to share the idea now
> > > and
> > > gather your thoughts on that.
> > 
> > FWIW the initial version of libgccjit didn't have a split between
> > recording and playback; instead the client code had to pass in a
> > callback to call into the various API functions (creating tree
> > nodes).
> > See:
> > https://gcc.gnu.org/legacy-ml/gcc-patches/2013-10/msg00228.html
> > 
> > Dave
> > 
> 



[PATCH] c++: ICE with computed gotos [PR115469]

2024-06-26 Thread Marek Polacek
Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk?

-- >8 --
This is a low-prio crash on invalid code where we ICE on a VAR_DECL
with erroneous type.  I thought I'd try to avoid putting such decls
into ->names and ->names_in_scope but that sounds riskier than the
following cleanup.

PR c++/115469

gcc/cp/ChangeLog:

* decl.cc (decl_with_nontrivial_dtor_p): New.
(poplevel_named_label_1): Use it.
(check_goto_1): Likewise.

gcc/testsuite/ChangeLog:

* g++.dg/ext/label17.C: New test.
---
 gcc/cp/decl.cc | 19 +++
 gcc/testsuite/g++.dg/ext/label17.C | 18 ++
 2 files changed, 33 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/g++.dg/ext/label17.C

diff --git a/gcc/cp/decl.cc b/gcc/cp/decl.cc
index 03deb1493a4..e5696079c28 100644
--- a/gcc/cp/decl.cc
+++ b/gcc/cp/decl.cc
@@ -514,6 +514,19 @@ level_for_consteval_if (cp_binding_level *b)
  && IF_STMT_CONSTEVAL_P (b->this_entity));
 }
 
+/* True if T is a non-static VAR_DECL that has a non-trivial destructor.  */
+
+static bool
+decl_with_nontrivial_dtor_p (const_tree t)
+{
+  if (error_operand_p (t))
+return false;
+
+  return (VAR_P (t)
+ && !TREE_STATIC (t)
+ && TYPE_HAS_NONTRIVIAL_DESTRUCTOR (TREE_TYPE (t)));
+}
+
 /* Update data for defined and undefined labels when leaving a scope.  */
 
 int
@@ -575,8 +588,7 @@ poplevel_named_label_1 (named_label_entry **slot, 
cp_binding_level *bl)
if (bl->kind == sk_catch)
  vec_safe_push (cg, get_identifier ("catch"));
for (tree d = use->names_in_scope; d; d = DECL_CHAIN (d))
- if (TREE_CODE (d) == VAR_DECL && !TREE_STATIC (d)
- && TYPE_HAS_NONTRIVIAL_DESTRUCTOR (TREE_TYPE (d)))
+ if (decl_with_nontrivial_dtor_p (d))
vec_safe_push (cg, d);
  }
 
@@ -4003,8 +4015,7 @@ check_goto_1 (named_label_entry *ent, bool computed)
  tree end = b == level ? names : NULL_TREE;
  for (tree d = b->names; d != end; d = DECL_CHAIN (d))
{
- if (TREE_CODE (d) == VAR_DECL && !TREE_STATIC (d)
- && TYPE_HAS_NONTRIVIAL_DESTRUCTOR (TREE_TYPE (d)))
+ if (decl_with_nontrivial_dtor_p (d))
{
  if (!identified)
{
diff --git a/gcc/testsuite/g++.dg/ext/label17.C 
b/gcc/testsuite/g++.dg/ext/label17.C
new file mode 100644
index 000..076ef1f798e
--- /dev/null
+++ b/gcc/testsuite/g++.dg/ext/label17.C
@@ -0,0 +1,18 @@
+// PR c++/115469
+// { dg-do compile { target indirect_jumps } }
+// { dg-options "" }
+
+void
+fn1 ()
+{
+  b = &&c;// { dg-error "not declared|not defined" }
+  goto *0;
+}
+
+void
+fn2 ()
+{
+c:
+  b = &&c;  // { dg-error "not declared" }
+  goto *0;
+}

base-commit: 0731985920cdeeeb028f03ddb8a7f035565c1594
-- 
2.45.2



Re: [committed] Remove compromised sh test

2024-06-26 Thread Oleg Endo



On Wed, 2024-06-26 at 07:22 -0600, Jeff Law wrote:
> Surya's recent patch to IRA improves the code for sh/pr54602-1.c 
> slightly.  Specifically it's able to eliminate a save/restore in the 
> prologue/epilogue and a bit of register shuffling.
> 
> As a result there literally aren't any insns that can be used to fill 
> the delay slot of the return, so a nop gets emitted and the test fails.
> 
> Given there literally aren't any insns to move into the delay slot, the 
> best course of action is to just drop the test.
> 
> Pushed to the trunk.
> 
> Jeff

I can't reproduce what you are saying.
Which triplet and flags is your test setup using?

For this test case, GCC 13 with -m4 -ml -O1 -fno-pic:

_test01:
mov.l   r8,@-r15
sts.l   pr,@-r15
mov.l   .L3,r0
jsr @r0
mov r6,r8
add r8,r0
lds.l   @r15+,pr
rts 
mov.l   @r15+,r8
.L3:
.long   _test00


current GCC master branch with -m4 -ml -O1 -fno-pic:

_test00:
mov.l   r8,@-r15
sts.l   pr,@-r15
mov.l   .L3,r0
jsr @r0
mov r6,r8
add r8,r0
lds.l   @r15+,pr
rts
mov.l   @r15+,r8
.L4:
.align 2
.L3:
.long   _test01


Best regards,
Oleg Endo


Re: [committed] Remove compromised sh test

2024-06-26 Thread Jeff Law




On 6/26/24 4:12 PM, Oleg Endo wrote:



On Wed, 2024-06-26 at 07:22 -0600, Jeff Law wrote:

Surya's recent patch to IRA improves the code for sh/pr54602-1.c
slightly.  Specifically it's able to eliminate a save/restore in the
prologue/epilogue and a bit of register shuffling.

As a result there literally aren't any insns that can be used to fill
the delay slot of the return, so a nop gets emitted and the test fails.

Given there literally aren't any insns to move into the delay slot, the
best course of action is to just drop the test.

Pushed to the trunk.

Jeff


I can't reproduce what you are saying.
Which triplet and flags is your test setup using?

For this test case, GCC 13 with -m4 -ml -O1 -fno-pic:

No -m flags at all.   As plain of a testrun as you can do.

jeff




Re: [committed] Remove compromised sh test

2024-06-26 Thread Oleg Endo



On Wed, 2024-06-26 at 16:39 -0600, Jeff Law wrote:
> 
> On 6/26/24 4:12 PM, Oleg Endo wrote:
> > 
> > 
> > On Wed, 2024-06-26 at 07:22 -0600, Jeff Law wrote:
> > > Surya's recent patch to IRA improves the code for sh/pr54602-1.c
> > > slightly.  Specifically it's able to eliminate a save/restore in the
> > > prologue/epilogue and a bit of register shuffling.
> > > 
> > > As a result there literally aren't any insns that can be used to fill
> > > the delay slot of the return, so a nop gets emitted and the test fails.
> > > 
> > > Given there literally aren't any insns to move into the delay slot, the
> > > best course of action is to just drop the test.
> > > 
> > > Pushed to the trunk.
> > > 
> > > Jeff
> > 
> > I can't reproduce what you are saying.
> > Which triplet and flags is your test setup using?
> > 
> > For this test case, GCC 13 with -m4 -ml -O1 -fno-pic:
> No -m flags at all.   As plain of a testrun as you can do.
> 

OK, then what's the default config of your test setup / triplet?
Can you please show the generated code that you get?  Because - like I said
- I can't reproduce it.

Best regards,
Oleg Endo


Re: [PATCH] libgccjit: Add support for machine-dependent builtins

2024-06-26 Thread David Malcolm
On Thu, 2023-11-23 at 17:17 -0500, Antoni Boucher wrote:
> Hi.
> I did split the patch and sent one for the bfloat16 support and
> another
> one for the vector support.
> 
> Here's the updated patch for the machine-dependent builtins.
> 

Thanks for the patch; sorry about the long delay in reviewing it.

CCing Jan and Uros re the i386 part of that patch; for reference the
patch being discussed is here:
  https://gcc.gnu.org/pipermail/gcc-patches/2023-November/638027.html

> From e025f95f4790ae861e709caf23cbc0723c1a3804 Mon Sep 17 00:00:00 2001
> From: Antoni Boucher 
> Date: Mon, 23 Jan 2023 17:21:15 -0500
> Subject: [PATCH] libgccjit: Add support for machine-dependent builtins

[...snip...]

> diff --git a/gcc/config/i386/i386-builtins.cc 
> b/gcc/config/i386/i386-builtins.cc
> index 42fc3751676..5cc1d6f4d2e 100644
> --- a/gcc/config/i386/i386-builtins.cc
> +++ b/gcc/config/i386/i386-builtins.cc
> @@ -225,6 +225,22 @@ static GTY(()) tree ix86_builtins[(int) 
> IX86_BUILTIN_MAX];
>  
>  struct builtin_isa ix86_builtins_isa[(int) IX86_BUILTIN_MAX];
>  
> +static void
> +clear_builtin_types (void)
> +{
> +  for (int i = 0 ; i < IX86_BT_LAST_CPTR + 1 ; i++)
> +ix86_builtin_type_tab[i] = NULL;
> +
> +  for (int i = 0 ; i < IX86_BUILTIN_MAX ; i++)
> +  {
> +ix86_builtins[i] = NULL;
> +ix86_builtins_isa[i].set_and_not_built_p = true;
> +  }
> +
> +  for (int i = 0 ; i < IX86_BT_LAST_ALIAS + 1 ; i++)
> +ix86_builtin_func_type_tab[i] = NULL;
> +}
> +
>  tree get_ix86_builtin (enum ix86_builtins c)
>  {
>return ix86_builtins[c];
> @@ -1483,6 +1499,8 @@ ix86_init_builtins (void)
>  {
>tree ftype, decl;
>  
> +  clear_builtin_types ();
> +
>ix86_init_builtin_types ();
>  
>/* Builtins to get CPU type and features. */

Please can one of the i386 maintainers check this?
(CCing Jan and Uros: this is for the case where the compiler code runs
multiple times in-process due to being linked into libgccjit.so.  We
want to restore state within i386-builtins.cc to an initial state, and
ensure that no GC-managed objects persist from previous in-memory
compiles).

> diff --git a/gcc/jit/docs/topics/compatibility.rst
b/gcc/jit/docs/topics/compatibility.rst
> index ebede440ee4..764de23341e 100644
> --- a/gcc/jit/docs/topics/compatibility.rst
> +++ b/gcc/jit/docs/topics/compatibility.rst
> @@ -378,3 +378,12 @@ alignment of a variable:
>  
>  ``LIBGCCJIT_ABI_25`` covers the addition of
>  :func:`gcc_jit_type_get_restrict`
> +
> +.. _LIBGCCJIT_ABI_26:
> +
> +``LIBGCCJIT_ABI_26``
> +
> +
> +``LIBGCCJIT_ABI_26`` covers the addition of a function to get target 
> builtins:
> +
> +  * :func:`gcc_jit_context_get_target_builtin_function`
> diff --git a/gcc/jit/docs/topics/functions.rst 
> b/gcc/jit/docs/topics/functions.rst
> index cf5cb716daf..e9b77fdb892 100644
> --- a/gcc/jit/docs/topics/functions.rst
> +++ b/gcc/jit/docs/topics/functions.rst
> @@ -140,6 +140,25 @@ Functions
>uses such a parameter will lead to an error being emitted within
>the context.
>  
> +.. function::  gcc_jit_function *\
> +   gcc_jit_context_get_target_builtin_function (gcc_jit_context 
> *ctxt,\
> +const char *name)
> +
> +   Get the :type:`gcc_jit_function` for the built-in function with the
> +   given name.  For example:

Might be nice to add the "(sometimes called intrinsic functions)" text
you have in the header here.

[...snip]

> diff --git a/gcc/jit/dummy-frontend.cc b/gcc/jit/dummy-frontend.cc
> index a729086bafb..3ca9702d429 100644
> --- a/gcc/jit/dummy-frontend.cc
> +++ b/gcc/jit/dummy-frontend.cc

[...]

> @@ -29,8 +30,14 @@ along with GCC; see the file COPYING3.  If not see
>  #include "options.h"
>  #include "stringpool.h"
>  #include "attribs.h"
> +#include "jit-recording.h"
> +#include "print-tree.h"
>  
>  #include 
> +#include 
> +#include 
> +
> +using namespace gcc::jit;
>  
>  /* Attribute handling.  */
>  
> @@ -86,6 +93,11 @@ static const struct attribute_spec::exclusions 
> attr_const_pure_exclusions[] =
>ATTR_EXCL (NULL, false, false, false)
>  };
>  
> +hash_map target_builtins{};

I was wondering if this needs a GTY marker, but I don't think it does:
presumably it's only used within jit_langhook_parse_file where no GC
can happen - unless jit_langhook_write_globals makes use of it?

> +std::unordered_map
target_function_types
> +{};
> +recording::context target_builtins_ctxt{NULL};

Please add a comment to target_builtins_ctxt saying what it's for.  As
far as I can tell, it's for getting at recording::types from
tree_type_to_jit_type; we then use a new "copy" mechanism to copy
objects from target_builtins_ctxt for use with the real
recording::context.

This feels ugly, but maybe it's the only way to make it work.

Could tree_type_to_jit_type take a recording::context as a param?  The
only non-recursive uses of tree_type_to_jit_type seem to be in
jit_langhook_builtin_funct

Re: [committed] Remove compromised sh test

2024-06-26 Thread Jeff Law




On 6/26/24 4:44 PM, Oleg Endo wrote:



On Wed, 2024-06-26 at 16:39 -0600, Jeff Law wrote:


On 6/26/24 4:12 PM, Oleg Endo wrote:



On Wed, 2024-06-26 at 07:22 -0600, Jeff Law wrote:

Surya's recent patch to IRA improves the code for sh/pr54602-1.c
slightly.  Specifically it's able to eliminate a save/restore in the
prologue/epilogue and a bit of register shuffling.

As a result there literally aren't any insns that can be used to fill
the delay slot of the return, so a nop gets emitted and the test fails.

Given there literally aren't any insns to move into the delay slot, the
best course of action is to just drop the test.

Pushed to the trunk.

Jeff


I can't reproduce what you are saying.
Which triplet and flags is your test setup using?

For this test case, GCC 13 with -m4 -ml -O1 -fno-pic:

No -m flags at all.   As plain of a testrun as you can do.



OK, then what's the default config of your test setup / triplet?
Can you please show the generated code that you get?  Because - like I said
- I can't reproduce it.

test01:
sts.l   pr,@-r15! 31[c=4 l=2]  movsi_i/10
add #-4,r15 ! 32[c=4 l=2]  *addsi3/0
mov.l   .L3,r0  ! 26[c=10 l=2]  movsi_i/0
jsr @r0 ! 12[c=5 l=2]  call_valuei
mov.l   r6,@r15 ! 4 [c=4 l=2]  movsi_i/8
mov.l   @r15,r1 ! 29[c=1 l=2]  movsi_i/5
add r1,r0   ! 30[c=4 l=2]  *addsi3/0
add #4,r15  ! 36[c=4 l=2]  *addsi3/0
lds.l   @r15+,pr! 38[c=1 l=2]  movsi_i/14
rts
nop ! 40[c=0 l=4]  *return_i


Note that there's a scheduling barrier in the RTL between insns 30 and 
36.  So instructions prior to insn 36 can't be used to fill the delay slot.


jeff


Re: [PATCH] [libstdc++] [testsuite] defer to check_vect_support* [PR115454]

2024-06-26 Thread Matthias Kretz
Ah, thank you. I didn't realize that there's a default for dg-do. I probably 
knew it back when I added check_vect_support_and_set_flags...

OK for all branches from my side.

-Matthias

On Wednesday, 26 June 2024 04:45:28 CDT Alexandre Oliva wrote:
> The newly-added testcase overrides the default dg-do action set by
> check_vect_support_and_set_flags (in libstdc++-dg/conformance.exp), so
> it attempts to run the test even if runtime vector support is not
> available.
> 
> Remove the explicit dg-do directive, so that the default is honored,
> and the test is run if vector support is found, and only compiled
> otherwise.
> 
> Tested so far with gcc-13 on ppc64-vx7r2, targeting vector-less
> hardware, where it cured the observed regression.  Regstrapping on
> x86_64- and ppc64el-linux-gnu just to be sure.  Ok to install?
> 
> 
> for  libstdc++-v3/ChangeLog
> 
>   PR libstdc++/115454
>   * testsuite/experimental/simd/pr115454_find_last_set.cc: Defer
>   to check_vect_support_and_set_flags's default dg-do action.
> ---
>  .../experimental/simd/pr115454_find_last_set.cc|1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git
> a/libstdc++-v3/testsuite/experimental/simd/pr115454_find_last_set.cc
> b/libstdc++-v3/testsuite/experimental/simd/pr115454_find_last_set.cc index
> 25a713b4e948c..4ade8601f272f 100644
> --- a/libstdc++-v3/testsuite/experimental/simd/pr115454_find_last_set.cc
> +++ b/libstdc++-v3/testsuite/experimental/simd/pr115454_find_last_set.cc
> @@ -1,5 +1,4 @@
>  // { dg-options "-std=gnu++17" }
> -// { dg-do run { target *-*-* } }
>  // { dg-require-effective-target c++17 }
>  // { dg-additional-options "-march=x86-64-v4" { target avx512f_runtime } }
>  // { dg-require-cmath "" }


-- 
──┬
 Dr. Matthias Kretz   │ SDE — Software Development for Experiments
 Senior Software Engineer,│ 📞 +49 6159 713084
 SIMD Expert, │ 📧 m.kr...@gsi.de floss.social/@mkretz
 ISO C++ Numerics Chair   │ 🔗 mattkretz.github.io
──┴

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Katharina Stummeyer, Jörg Blaurock
Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
Ministerialdirigent Dr. Volkmar Dietz


signature.asc
Description: This is a digitally signed message part.


Re: Re: [PATCH 2/3] RISC-V: Add Zvfbfmin and Zvfbfwma intrinsic

2024-06-26 Thread wangf...@eswincomputing.com
On 2024-06-22 00:16  Patrick O'Neill  wrote:
>
>Hi Feng,
>
>Pre-commit has flagged a build-failure for patch 2/3:
>https://github.com/ewlu/gcc-precommit-ci/issues/1786#issuecomment-2181962244
>
>When applied to 9a76db24e04 i386: Allow all register_operand SUBREGs in
>x86_ternlog_idx.
>
>Re-confirmed locally with 5320bcbd342 xstormy16: Fix
>xs_hi_nonmemory_operand.
>
>Additionally there is an apply failure for patch 3/3.
> 
Sorry for the late reply. This is the reason why build failure failed,  the 
patch 2/3 depends on patch 3/3. Do you know the reason
"Failed to merge in the changes."? Do I need rebase the patch 3/3? Thanks.
>Results can be seen here:
>Series:
>https://patchwork.sourceware.org/project/gcc/list/?series=35407
>Patch 2/3:
>https://patchwork.sourceware.org/project/gcc/patch/20240621015459.13525-2-wangf...@eswincomputing.com/
>https://github.com/ewlu/gcc-precommit-ci/issues/1786#issuecomment-2181863112
>Patch 3/3:
>https://patchwork.sourceware.org/project/gcc/patch/20240621015459.13525-3-wangf...@eswincomputing.com/
>https://github.com/ewlu/gcc-precommit-ci/issues/1784#issuecomment-2181861381
>
>Thanks,
>Patrick
>
>On 6/20/24 18:54, Feng Wang wrote:
>> Accroding to the intrinsic doc, the 'Zvfbfmin' and 'Zvfbfwma' intrinsic
>> functions are added by this patch.
>>
>> gcc/ChangeLog:
>>
>> * config/riscv/riscv-vector-builtins-bases.cc (class vfncvtbf16_f):
>>     Add 'Zvfbfmin' intrinsic in bases.
>> (class vfwcvtbf16_f): Ditto.
>> (class vfwmaccbf16): Add 'Zvfbfwma' intrinsic in bases.
>> (BASE): Add BASE macro for 'Zvfbfmin' and 'Zvfbfwma'.
>> * config/riscv/riscv-vector-builtins-bases.h: Add declaration for 'Zvfbfmin' 
>> and 'Zvfbfwma'.
>> * config/riscv/riscv-vector-builtins-functions.def (REQUIRED_EXTENSIONS):
>>     Add builtins def for 'Zvfbfmin' and 'Zvfbfwma'.
>> (vfncvtbf16_f): Ditto.
>> (vfncvtbf16_f_frm): Ditto.
>> (vfwcvtbf16_f): Ditto.
>> (vfwmaccbf16): Ditto.
>> (vfwmaccbf16_frm): Ditto.
>> * config/riscv/riscv-vector-builtins-shapes.cc (supports_vectype_p):
>>     Add vector intrinsic build judgment for BFloat16.
>> (build_all): Ditto.
>> (BASE_NAME_MAX_LEN): Adjust max length.
>> * config/riscv/riscv-vector-builtins-types.def (DEF_RVV_F32_OPS):
>>     Add new operand type for BFloat16.
>> (vfloat32mf2_t): Ditto.
>> (vfloat32m1_t): Ditto.
>> (vfloat32m2_t): Ditto.
>> (vfloat32m4_t): Ditto.
>> (vfloat32m8_t): Ditto.
>> * config/riscv/riscv-vector-builtins.cc (DEF_RVV_F32_OPS): Ditto.
>> (validate_instance_type_required_extensions):
>>     Add required_ext checking for 'Zvfbfmin' and 'Zvfbfwma'.
>> * config/riscv/riscv-vector-builtins.h (enum required_ext):
>>     Add required_ext declaration for 'Zvfbfmin' and 'Zvfbfwma'.
>> (reqired_ext_to_isa_name): Ditto.
>> (required_extensions_specified): Ditto.
>> (struct function_group_info): Add match case for 'Zvfbfmin' and 'Zvfbfwma'.
>> * config/riscv/riscv.cc (riscv_validate_vector_type):
>>     Add required_ext checking for 'Zvfbfmin' and 'Zvfbfwma'.
>>
>> ---

Re: [PATCH] [libstdc++] [testsuite] defer to check_vect_support* [PR115454]

2024-06-26 Thread Jonathan Wakely
On Thu, 27 Jun 2024, 01:53 Matthias Kretz,  wrote:

> Ah, thank you. I didn't realize that there's a default for dg-do. I
> probably
> knew it back when I added check_vect_support_and_set_flags...
>
> OK for all branches from my side.
>

Yup, ok to push then, thanks.



> -Matthias
>
> On Wednesday, 26 June 2024 04:45:28 CDT Alexandre Oliva wrote:
> > The newly-added testcase overrides the default dg-do action set by
> > check_vect_support_and_set_flags (in libstdc++-dg/conformance.exp), so
> > it attempts to run the test even if runtime vector support is not
> > available.
> >
> > Remove the explicit dg-do directive, so that the default is honored,
> > and the test is run if vector support is found, and only compiled
> > otherwise.
> >
> > Tested so far with gcc-13 on ppc64-vx7r2, targeting vector-less
> > hardware, where it cured the observed regression.  Regstrapping on
> > x86_64- and ppc64el-linux-gnu just to be sure.  Ok to install?
> >
> >
> > for  libstdc++-v3/ChangeLog
> >
> >   PR libstdc++/115454
> >   * testsuite/experimental/simd/pr115454_find_last_set.cc: Defer
> >   to check_vect_support_and_set_flags's default dg-do action.
> > ---
> >  .../experimental/simd/pr115454_find_last_set.cc|1 -
> >  1 file changed, 1 deletion(-)
> >
> > diff --git
> > a/libstdc++-v3/testsuite/experimental/simd/pr115454_find_last_set.cc
> > b/libstdc++-v3/testsuite/experimental/simd/pr115454_find_last_set.cc
> index
> > 25a713b4e948c..4ade8601f272f 100644
> > --- a/libstdc++-v3/testsuite/experimental/simd/pr115454_find_last_set.cc
> > +++ b/libstdc++-v3/testsuite/experimental/simd/pr115454_find_last_set.cc
> > @@ -1,5 +1,4 @@
> >  // { dg-options "-std=gnu++17" }
> > -// { dg-do run { target *-*-* } }
> >  // { dg-require-effective-target c++17 }
> >  // { dg-additional-options "-march=x86-64-v4" { target avx512f_runtime
> } }
> >  // { dg-require-cmath "" }
>
>
> --
> ──┬
>  Dr. Matthias Kretz   │ SDE — Software Development for Experiments
>  Senior Software Engineer,│ 📞 +49 6159 713084
>  SIMD Expert, │ 📧 m.kr...@gsi.de floss.social/@mkretz
>  ISO C++ Numerics Chair   │ 🔗 mattkretz.github.io
> ──┴
>
> GSI Helmholtzzentrum für Schwerionenforschung GmbH
> Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de
>
> Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
> Managing Directors / Geschäftsführung:
> Professor Dr. Paolo Giubellino, Dr. Katharina Stummeyer, Jörg Blaurock
> Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
> Ministerialdirigent Dr. Volkmar Dietz
>


[PATCH] i386: Refactor vcvttps2qq/vcvtqq2ps patterns.

2024-06-26 Thread Hu, Lin1
Hi, all

This patch aims to refactor vcvttps2qq/vcvtqq2ps patterns for remove redundant
round_*_modev8sf_condition.

Bootstrapped and regtested on x86-64-linux-gnu, OK for trunk?

BRs,
Lin

gcc/ChangeLog:

* config/i386/sse.md

(float2):
Refactor the pattern.

(unspec_fix_trunc2):
Ditto.

(fix_trunc2):
Ditto.
* config/i386/subst.md (round_modev8sf_condition): Remove.
(round_saeonly_modev8sf_condition): Ditto.
---
 gcc/config/i386/sse.md   | 51 +---
 gcc/config/i386/subst.md |  2 --
 2 files changed, 22 insertions(+), 31 deletions(-)

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 0be2dcd8891..cf8de6347cf 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -1157,6 +1157,9 @@ (define_mode_attr ssePSmodelower
 (define_mode_attr ssePSmode2
   [(V8DI "V8SF") (V4DI "V4SF")])
 
+(define_mode_attr ssePSmode2lower
+  [(V8DI "v8sf") (V4DI "v4sf")])
+
 ;; Mapping of vector modes back to the scalar modes
 (define_mode_attr ssescalarmode
   [(V64QI "QI") (V32QI "QI") (V16QI "QI")
@@ -8861,27 +8864,17 @@ (define_insn 
"float2 insn patterns
 (define_mode_attr qq2pssuff
-  [(V8SF "") (V4SF "{y}")])
-
-(define_mode_attr sselongvecmode
-  [(V8SF "V8DI") (V4SF  "V4DI")])
-
-(define_mode_attr sselongvecmodelower
-  [(V8SF "v8di") (V4SF  "v4di")])
-
-(define_mode_attr sseintvecmode3
-  [(V8SF "XI") (V4SF "OI")
-   (V8DF "OI") (V4DF "TI")])
+  [(V8DI "") (V4DI "{y}")])
 
-(define_insn 
"float2"
-  [(set (match_operand:VF1_128_256VL 0 "register_operand" "=v")
-(any_float:VF1_128_256VL
-  (match_operand: 1 "nonimmediate_operand" 
"")))]
-  "TARGET_AVX512DQ && "
+(define_insn 
"float2"
+  [(set (match_operand: 0 "register_operand" "=v")
+(any_float:
+  (match_operand:VI8_256_512 1 "nonimmediate_operand" 
"")))]
+  "TARGET_AVX512DQ && "
   "vcvtqq2ps\t{%1, 
%0|%0, %1}"
   [(set_attr "type" "ssecvt")
(set_attr "prefix" "evex")
-   (set_attr "mode" "")])
+   (set_attr "mode" "")])
 
 (define_expand "avx512dq_floatv2div2sf2"
   [(set (match_operand:V4SF 0 "register_operand" "=v")
@@ -9416,26 +9409,26 @@ (define_insn 
"fixuns_notrunc2"
(set_attr "prefix" "evex")
(set_attr "mode" "")])
 
-(define_insn 
"unspec_fix_trunc2"
-  [(set (match_operand: 0 "register_operand" "=v")
-   (unspec:
- [(match_operand:VF1_128_256VL 1 "" 
"")]
+(define_insn 
"unspec_fix_trunc2"
+  [(set (match_operand:VI8_256_512 0 "register_operand" "=v")
+   (unspec:VI8_256_512
+ [(match_operand: 1 "" 
"")]
  UNSPEC_VCVTT_U))]
-  "TARGET_AVX512DQ && "
+  "TARGET_AVX512DQ && "
   "vcvttps2qq\t{%1, 
%0|%0, %1}"
   [(set_attr "type" "ssecvt")
(set_attr "prefix" "evex")
-   (set_attr "mode" "")])
+   (set_attr "mode" "")])
 
-(define_insn 
"fix_trunc2"
-  [(set (match_operand: 0 "register_operand" "=v")
-   (any_fix:
- (match_operand:VF1_128_256VL 1 "" 
"")))]
-  "TARGET_AVX512DQ && "
+(define_insn 
"fix_trunc2"
+  [(set (match_operand:VI8_256_512 0 "register_operand" "=v")
+   (any_fix:VI8_256_512
+ (match_operand: 1 "" 
"")))]
+  "TARGET_AVX512DQ && "
   "vcvttps2qq\t{%1, 
%0|%0, %1}"
   [(set_attr "type" "ssecvt")
(set_attr "prefix" "evex")
-   (set_attr "mode" "")])
+   (set_attr "mode" "")])
 
 (define_insn "unspec_avx512dq_fix_truncv2sfv2di2"
   [(set (match_operand:V2DI 0 "register_operand" "=v")
diff --git a/gcc/config/i386/subst.md b/gcc/config/i386/subst.md
index 7a9b697e0f6..40fb92094d2 100644
--- a/gcc/config/i386/subst.md
+++ b/gcc/config/i386/subst.md
@@ -211,7 +211,6 @@ (define_subst_attr "round_mode512bit_condition" "round" "1" 
"(mode == V16S
  || mode == 
V16SImode
  || mode == 
V32HFmode)")
 
-(define_subst_attr "round_modev8sf_condition" "round" "1" "(mode == 
V8SFmode)")
 (define_subst_attr "round_modev4sf_condition" "round" "1" "(mode == 
V4SFmode)")
 (define_subst_attr "round_codefor" "round" "*" "")
 (define_subst_attr "round_opnum" "round" "5" "6")
@@ -257,7 +256,6 @@ (define_subst_attr "round_saeonly_mode512bit_condition" 
"round_saeonly" "1" "(mode == V16SImode
  
|| mode == V32HFmode)")
 
-(define_subst_attr "round_saeonly_modev8sf_condition" "round_saeonly" "1" 
"(mode == V8SFmode)")
 
 (define_subst "round_saeonly"
   [(set (match_operand:SUBST_A 0)
-- 
2.31.1



Re: [committed] Remove compromised sh test

2024-06-26 Thread Oleg Endo
On Wed, 2024-06-26 at 18:30 -0600, Jeff Law wrote:
> > > 
> > 
> > OK, then what's the default config of your test setup / triplet?
> > Can you please show the generated code that you get?  Because - like I said
> > - I can't reproduce it.
> test01:
>  sts.l   pr,@-r15! 31[c=4 l=2]  movsi_i/10
>  add #-4,r15 ! 32[c=4 l=2]  *addsi3/0
>  mov.l   .L3,r0  ! 26[c=10 l=2]  movsi_i/0
>  jsr @r0 ! 12[c=5 l=2]  call_valuei
>  mov.l   r6,@r15 ! 4 [c=4 l=2]  movsi_i/8
>  mov.l   @r15,r1 ! 29[c=1 l=2]  movsi_i/5
>  add r1,r0   ! 30[c=4 l=2]  *addsi3/0
>  add #4,r15  ! 36[c=4 l=2]  *addsi3/0
>  lds.l   @r15+,pr! 38[c=1 l=2]  movsi_i/14
>  rts
>  nop ! 40[c=0 l=4]  *return_i
> 
> 
> Note that there's a scheduling barrier in the RTL between insns 30 and 
> 36.  So instructions prior to insn 36 can't be used to fill the delay slot.
> 

Thanks.  Now I'm also seeing the same result.  Needed to specify -O2 to get
that.  -O1 was not enough it seems.

I don't know why you said that the code for this case improved -- it has
not?!

I think the test is still valid.  The reason for the failure might be
different from the original one (the scheduling barrier for whatever
reason), but the end result is the same -- the last delay slot is not
stuffed, although the 'add r1,r0' could go in there.

I'd like to revert the removal of this test case, as it catches a valid
issue.

Best regards,
Oleg Endo







[PATCH v3] Vect: Support truncate after .SAT_SUB pattern in zip

2024-06-26 Thread pan2 . li
From: Pan Li 

The zip benchmark of coremark-pro have one SAT_SUB like pattern but
truncated as below:

void test (uint16_t *x, unsigned b, unsigned n)
{
  unsigned a = 0;
  register uint16_t *p = x;

  do {
a = *--p;
*p = (uint16_t)(a >= b ? a - b : 0); // Truncate after .SAT_SUB
  } while (--n);
}

It will have gimple before vect pass,  it cannot hit any pattern of
SAT_SUB and then cannot vectorize to SAT_SUB.

_2 = a_11 - b_12(D);
iftmp.0_13 = (short unsigned int) _2;
_18 = a_11 >= b_12(D);
iftmp.0_5 = _18 ? iftmp.0_13 : 0;

This patch would like to improve the pattern match to recog above
as truncate after .SAT_SUB pattern.  Then we will have the pattern
similar to below,  as well as eliminate the first 3 dead stmt.

_2 = a_11 - b_12(D);
iftmp.0_13 = (short unsigned int) _2;
_18 = a_11 >= b_12(D);
iftmp.0_5 = (short unsigned int).SAT_SUB (a_11, b_12(D));

The below tests are passed for this patch.
1. The rv64gcv fully regression tests.
2. The rv64gcv build with glibc.
3. The x86 bootstrap tests.
4. The x86 fully regression tests.

gcc/ChangeLog:

* match.pd: Add convert description for minus and capture.
* tree-vect-patterns.cc (vect_recog_build_binary_gimple_call): Add
new logic to handle in_type is incompatibile with out_type,  as
well as rename from.
(vect_recog_build_binary_gimple_stmt): Rename to.
(vect_recog_sat_add_pattern): Leverage above renamed func.
(vect_recog_sat_sub_pattern): Ditto.

Signed-off-by: Pan Li 
---
 gcc/match.pd  |  4 +--
 gcc/tree-vect-patterns.cc | 51 ---
 2 files changed, 33 insertions(+), 22 deletions(-)

diff --git a/gcc/match.pd b/gcc/match.pd
index cf8a399a744..820591a36b3 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3164,9 +3164,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 /* Unsigned saturation sub, case 2 (branch with ge):
SAT_U_SUB = X >= Y ? X - Y : 0.  */
 (match (unsigned_integer_sat_sub @0 @1)
- (cond^ (ge @0 @1) (minus @0 @1) integer_zerop)
+ (cond^ (ge @0 @1) (convert? (minus (convert1? @0) (convert1? @1))) 
integer_zerop)
  (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
-  && types_match (type, @0, @1
+  && TYPE_UNSIGNED (TREE_TYPE (@0)) && types_match (@0, @1
 
 /* Unsigned saturation sub, case 3 (branchless with gt):
SAT_U_SUB = (X - Y) * (X > Y).  */
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index cef901808eb..519d15f2a43 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -4490,26 +4490,37 @@ vect_recog_mult_pattern (vec_info *vinfo,
 extern bool gimple_unsigned_integer_sat_add (tree, tree*, tree (*)(tree));
 extern bool gimple_unsigned_integer_sat_sub (tree, tree*, tree (*)(tree));
 
-static gcall *
-vect_recog_build_binary_gimple_call (vec_info *vinfo, gimple *stmt,
+static gimple *
+vect_recog_build_binary_gimple_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
 internal_fn fn, tree *type_out,
-tree op_0, tree op_1)
+tree lhs, tree op_0, tree op_1)
 {
   tree itype = TREE_TYPE (op_0);
-  tree vtype = get_vectype_for_scalar_type (vinfo, itype);
+  tree otype = TREE_TYPE (lhs);
+  tree v_itype = get_vectype_for_scalar_type (vinfo, itype);
+  tree v_otype = get_vectype_for_scalar_type (vinfo, otype);
 
-  if (vtype != NULL_TREE
-&& direct_internal_fn_supported_p (fn, vtype, OPTIMIZE_FOR_BOTH))
+  if (v_itype != NULL_TREE && v_otype != NULL_TREE
+&& direct_internal_fn_supported_p (fn, v_itype, OPTIMIZE_FOR_BOTH))
 {
   gcall *call = gimple_build_call_internal (fn, 2, op_0, op_1);
+  tree in_ssa = vect_recog_temp_ssa_var (itype, NULL);
 
-  gimple_call_set_lhs (call, vect_recog_temp_ssa_var (itype, NULL));
+  gimple_call_set_lhs (call, in_ssa);
   gimple_call_set_nothrow (call, /* nothrow_p */ false);
-  gimple_set_location (call, gimple_location (stmt));
+  gimple_set_location (call, gimple_location (STMT_VINFO_STMT 
(stmt_info)));
+
+  *type_out = v_otype;
 
-  *type_out = vtype;
+  if (types_compatible_p (itype, otype))
+   return call;
+  else
+   {
+ append_pattern_def_seq (vinfo, stmt_info, call, v_itype);
+ tree out_ssa = vect_recog_temp_ssa_var (otype, NULL);
 
-  return call;
+ return gimple_build_assign (out_ssa, NOP_EXPR, in_ssa);
+   }
 }
 
   return NULL;
@@ -4541,13 +4552,13 @@ vect_recog_sat_add_pattern (vec_info *vinfo, 
stmt_vec_info stmt_vinfo,
 
   if (gimple_unsigned_integer_sat_add (lhs, ops, NULL))
 {
-  gcall *call = vect_recog_build_binary_gimple_call (vinfo, last_stmt,
-IFN_SAT_ADD, type_out,
-ops[0], ops[1]);
-  if (call)
+  gimple *stmt = vect_recog_build_binary_gimple_stmt (vinfo, stmt_vinfo,
+ 

[PATCH-1v4, rs6000] Implement optab_isinf for SFDF and IEEE128

2024-06-26 Thread HAO CHEN GUI
Hi,
  This patch implemented optab_isinf for SFDF and IEEE128 by test
data class instructions.

  Compared with previous version, the main change is to define
and use the constant mask for test data class insns.
https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652593.html

  Bootstrapped and tested on powerpc64-linux BE and LE with no
regressions. Is it OK for trunk?

Thanks
Gui Haochen

ChangeLog
rs6000: Implement optab_isinf for SFDF and IEEE128

gcc/
PR target/97786
* config/rs6000/rs6000.md (ISNAN, ISINF, ISZERO, ISDENORMAL): Define.
* config/rs6000/vsx.md (isinf2 for SFDF): New expand.
(isinf2 for IEEE128): New expand.

gcc/testsuite/
PR target/97786
* gcc.target/powerpc/pr97786-1.c: New test.
* gcc.target/powerpc/pr97786-2.c: New test.

patch.diff
diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index ac5651d7420..e84e6b08f03 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -53,6 +53,17 @@ (define_constants
(FRAME_POINTER_REGNUM   110)
   ])

+;;
+;; Test data class mask
+;;
+
+(define_constants
+  [(ISNAN  0x40)
+   (ISINF  0x30)
+   (ISZERO 0xC)
+   (ISDENORMAL 0x3)
+  ])
+
 ;;
 ;; UNSPEC usage
 ;;
diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index f135fa079bd..67615bae8c0 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -5313,6 +5313,24 @@ (define_expand "xststdcp"
   operands[4] = CONST0_RTX (SImode);
 })

+(define_expand "isinf2"
+  [(use (match_operand:SI 0 "gpc_reg_operand"))
+   (use (match_operand:SFDF 1 "vsx_register_operand"))]
+  "TARGET_HARD_FLOAT && TARGET_P9_VECTOR"
+{
+  emit_insn (gen_xststdcp (operands[0], operands[1], GEN_INT (ISINF)));
+  DONE;
+})
+
+(define_expand "isinf2"
+  [(use (match_operand:SI 0 "gpc_reg_operand"))
+   (use (match_operand:IEEE128 1 "vsx_register_operand"))]
+  "TARGET_HARD_FLOAT && TARGET_P9_VECTOR"
+{
+  emit_insn (gen_xststdcqp_ (operands[0], operands[1], GEN_INT (ISINF)));
+  DONE;
+})
+
 ;; The VSX Scalar Test Negative Quad-Precision
 (define_expand "xststdcnegqp_"
   [(set (match_dup 2)
diff --git a/gcc/testsuite/gcc.target/powerpc/pr97786-1.c 
b/gcc/testsuite/gcc.target/powerpc/pr97786-1.c
new file mode 100644
index 000..c1c4f64ee8b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr97786-1.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_vsx } */
+/* { dg-options "-O2 -mdejagnu-cpu=power9" } */
+
+int test1 (double x)
+{
+  return __builtin_isinf (x);
+}
+
+int test2 (float x)
+{
+  return __builtin_isinf (x);
+}
+
+int test3 (float x)
+{
+  return __builtin_isinff (x);
+}
+
+/* { dg-final { scan-assembler-not {\mfcmp} } } */
+/* { dg-final { scan-assembler-times {\mxststdcsp\M} 2 } } */
+/* { dg-final { scan-assembler-times {\mxststdcdp\M} 1 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/pr97786-2.c 
b/gcc/testsuite/gcc.target/powerpc/pr97786-2.c
new file mode 100644
index 000..ed305e8572e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr97786-2.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target ppc_float128_hw } */
+/* { dg-require-effective-target powerpc_vsx } */
+/* { dg-options "-O2 -mdejagnu-cpu=power9 -mabi=ieeelongdouble -Wno-psabi" } */
+
+int test1 (long double x)
+{
+  return __builtin_isinf (x);
+}
+
+int test2 (long double x)
+{
+  return __builtin_isinfl (x);
+}
+
+/* { dg-final { scan-assembler-not {\mxscmpuqp\M} } } */
+/* { dg-final { scan-assembler-times {\mxststdcqp\M} 2 } } */


[PATCH-3v4, rs6000] Implement optab_isnormal for SFDF and IEEE128

2024-06-26 Thread HAO CHEN GUI
Hi,
  This patch implemented optab_isnormal for SFDF and IEEE128 by
test data class instructions.

  Compared with previous version, the main change is to use the
constant mask for test data class insns.
https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652595.html

  Bootstrapped and tested on powerpc64-linux BE and LE with no
regressions. Is it OK for trunk?

Thanks
Gui Haochen

ChangeLog
rs6000: Implement optab_isnormal for SFDF and IEEE128

gcc/
PR target/97786
* config/rs6000/vsx.md (isnormal2 for SFDF): New expand.
(isnormal2 for IEEE128): New expand.

gcc/testsuite/
PR target/97786
* gcc.target/powerpc/pr97786-7.c: New test.
* gcc.target/powerpc/pr97786-8.c: New test.

patch.diff
diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index 11d02e60170..b48986ac9eb 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -5355,6 +5355,30 @@ (define_expand "isfinite2"
   DONE;
 })

+(define_expand "isnormal2"
+  [(use (match_operand:SI 0 "gpc_reg_operand"))
+   (use (match_operand:SFDF 1 "vsx_register_operand"))]
+  "TARGET_HARD_FLOAT && TARGET_P9_VECTOR"
+{
+  rtx tmp = gen_reg_rtx (SImode);
+  int mask = ISINF | ISNAN | ISZERO | ISDENORMAL;
+  emit_insn (gen_xststdcp (tmp, operands[1], GEN_INT (mask)));
+  emit_insn (gen_xorsi3 (operands[0], tmp, const1_rtx));
+  DONE;
+})
+
+(define_expand "isnormal2"
+  [(use (match_operand:SI 0 "gpc_reg_operand"))
+   (use (match_operand:IEEE128 1 "vsx_register_operand"))]
+  "TARGET_HARD_FLOAT && TARGET_P9_VECTOR"
+{
+  rtx tmp = gen_reg_rtx (SImode);
+  int mask = ISINF | ISNAN | ISZERO | ISDENORMAL;
+  emit_insn (gen_xststdcqp_ (tmp, operands[1], GEN_INT (mask)));
+  emit_insn (gen_xorsi3 (operands[0], tmp, const1_rtx));
+  DONE;
+})
+
 ;; The VSX Scalar Test Negative Quad-Precision
 (define_expand "xststdcnegqp_"
   [(set (match_dup 2)
diff --git a/gcc/testsuite/gcc.target/powerpc/pr97786-7.c 
b/gcc/testsuite/gcc.target/powerpc/pr97786-7.c
new file mode 100644
index 000..2df472e35d4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr97786-7.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_vsx } */
+/* { dg-options "-O2 -mdejagnu-cpu=power9" } */
+
+int test1 (double x)
+{
+  return __builtin_isnormal (x);
+}
+
+int test2 (float x)
+{
+  return __builtin_isnormal (x);
+}
+
+/* { dg-final { scan-assembler-not {\mfcmp} } } */
+/* { dg-final { scan-assembler-times {\mxststdcsp\M} 1 } } */
+/* { dg-final { scan-assembler-times {\mxststdcdp\M} 1 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/pr97786-8.c 
b/gcc/testsuite/gcc.target/powerpc/pr97786-8.c
new file mode 100644
index 000..00478dbf3ef
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr97786-8.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target ppc_float128_hw } */
+/* { dg-require-effective-target powerpc_vsx } */
+/* { dg-options "-O2 -mdejagnu-cpu=power9 -mabi=ieeelongdouble -Wno-psabi" } */
+
+int test1 (long double x)
+{
+  return __builtin_isnormal (x);
+}
+
+/* { dg-final { scan-assembler-not {\mxscmpuqp\M} } } */
+/* { dg-final { scan-assembler {\mxststdcqp\M} } } */


[PATCH-2v4, rs6000] Implement optab_isfinite for SFDF and IEEE128

2024-06-26 Thread HAO CHEN GUI
Hi,
  This patch implemented optab_isfinite for SFDF and IEEE128 by
test data class instructions.

  Compared with previous version, the main change is to use the
constant mask for test data class insns.
https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652594.html

  Bootstrapped and tested on powerpc64-linux BE and LE with no
regressions. Is it OK for trunk?

Thanks
Gui Haochen

ChangeLog
rs6000: Implement optab_isfinite for SFDF and IEEE128

gcc/
PR target/97786
* config/rs6000/vsx.md (isfinite2 for SFDF): New expand.
(isfinite2 for IEEE128): New expand.

gcc/testsuite/
PR target/97786
* gcc.target/powerpc/pr97786-4.c: New test.
* gcc.target/powerpc/pr97786-5.c: New test.

patch.diff
diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index 67615bae8c0..11d02e60170 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -5331,6 +5331,30 @@ (define_expand "isinf2"
   DONE;
 })

+(define_expand "isfinite2"
+  [(use (match_operand:SI 0 "gpc_reg_operand"))
+   (use (match_operand:SFDF 1 "vsx_register_operand"))]
+  "TARGET_HARD_FLOAT && TARGET_P9_VECTOR"
+{
+  rtx tmp = gen_reg_rtx (SImode);
+  int mask = ISINF | ISNAN;
+  emit_insn (gen_xststdcp (tmp, operands[1], GEN_INT (mask)));
+  emit_insn (gen_xorsi3 (operands[0], tmp, const1_rtx));
+  DONE;
+})
+
+(define_expand "isfinite2"
+  [(use (match_operand:SI 0 "gpc_reg_operand"))
+   (use (match_operand:IEEE128 1 "vsx_register_operand"))]
+  "TARGET_HARD_FLOAT && TARGET_P9_VECTOR"
+{
+  rtx tmp = gen_reg_rtx (SImode);
+  int mask = ISINF | ISNAN;
+  emit_insn (gen_xststdcqp_ (tmp, operands[1], GEN_INT (mask)));
+  emit_insn (gen_xorsi3 (operands[0], tmp, const1_rtx));
+  DONE;
+})
+
 ;; The VSX Scalar Test Negative Quad-Precision
 (define_expand "xststdcnegqp_"
   [(set (match_dup 2)
diff --git a/gcc/testsuite/gcc.target/powerpc/pr97786-4.c 
b/gcc/testsuite/gcc.target/powerpc/pr97786-4.c
new file mode 100644
index 000..01faa962bd5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr97786-4.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_vsx } */
+/* { dg-options "-O2 -mdejagnu-cpu=power9" } */
+
+int test1 (double x)
+{
+  return __builtin_isfinite (x);
+}
+
+int test2 (float x)
+{
+  return __builtin_isfinite (x);
+}
+
+/* { dg-final { scan-assembler-not {\mfcmp} } } */
+/* { dg-final { scan-assembler-times {\mxststdcsp\M} 1 } } */
+/* { dg-final { scan-assembler-times {\mxststdcdp\M} 1 } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/pr97786-5.c 
b/gcc/testsuite/gcc.target/powerpc/pr97786-5.c
new file mode 100644
index 000..0e106b9f23a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr97786-5.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target ppc_float128_hw } */
+/* { dg-require-effective-target powerpc_vsx } */
+/* { dg-options "-O2 -mdejagnu-cpu=power9 -mabi=ieeelongdouble -Wno-psabi" } */
+
+int test1 (long double x)
+{
+  return __builtin_isfinite (x);
+}
+
+/* { dg-final { scan-assembler-not {\mxscmpuqp\M} } } */
+/* { dg-final { scan-assembler {\mxststdcqp\M} } } */


Re: [committed] Remove compromised sh test

2024-06-26 Thread Jeff Law




On 6/26/24 7:27 PM, Oleg Endo wrote:

On Wed, 2024-06-26 at 18:30 -0600, Jeff Law wrote:




OK, then what's the default config of your test setup / triplet?
Can you please show the generated code that you get?  Because - like I said
- I can't reproduce it.

test01:
  sts.l   pr,@-r15! 31[c=4 l=2]  movsi_i/10
  add #-4,r15 ! 32[c=4 l=2]  *addsi3/0
  mov.l   .L3,r0  ! 26[c=10 l=2]  movsi_i/0
  jsr @r0 ! 12[c=5 l=2]  call_valuei
  mov.l   r6,@r15 ! 4 [c=4 l=2]  movsi_i/8
  mov.l   @r15,r1 ! 29[c=1 l=2]  movsi_i/5
  add r1,r0   ! 30[c=4 l=2]  *addsi3/0
  add #4,r15  ! 36[c=4 l=2]  *addsi3/0
  lds.l   @r15+,pr! 38[c=1 l=2]  movsi_i/14
  rts
  nop ! 40[c=0 l=4]  *return_i


Note that there's a scheduling barrier in the RTL between insns 30 and
36.  So instructions prior to insn 36 can't be used to fill the delay slot.



Thanks.  Now I'm also seeing the same result.  Needed to specify -O2 to get
that.  -O1 was not enough it seems.

I don't know why you said that the code for this case improved -- it has
not?!

I think the test is still valid.  The reason for the failure might be
different from the original one (the scheduling barrier for whatever
reason), but the end result is the same -- the last delay slot is not
stuffed, although the 'add r1,r0' could go in there.

I'd like to revert the removal of this test case, as it catches a valid
issue.


Before the IRA patch there is an additional prologue/epilogue 
save/restore for a callee saved register.  That's what filled the delay 
slot before.


THe add r1,r0 can not move down to fill the delay slot.  There's 
scheduling barrier in the RTL.


Feel free to restore it, but you're just adding a bogus, failing, test 
to the testsuite.


jeff


Re: Re: [PATCH 2/3] RISC-V: Add Zvfbfmin and Zvfbfwma intrinsic

2024-06-26 Thread wangf...@eswincomputing.com
On 2024-06-27 08:52  wangfeng  wrote: 
I rebased the patch 3/3, there is a conflict. I will submit again after 
internal code review. 
Due to many changes, the patch was split, so a dependency relationship between 
patch 2/3
and patch 3/3 was generated. Could you help pull both patches down when run the 
regression
test after I submit the v2 version?
Thanks.
>
>On 2024-06-22 00:16  Patrick O'Neill  wrote:
>>
>>Hi Feng,
>>
>>Pre-commit has flagged a build-failure for patch 2/3:
>>https://github.com/ewlu/gcc-precommit-ci/issues/1786#issuecomment-2181962244
>>
>>When applied to 9a76db24e04 i386: Allow all register_operand SUBREGs in
>>x86_ternlog_idx.
>>
>>Re-confirmed locally with 5320bcbd342 xstormy16: Fix
>>xs_hi_nonmemory_operand.
>>
>>Additionally there is an apply failure for patch 3/3.
>>
>Sorry for the late reply. This is the reason why build failure failed,  the 
>patch 2/3 depends on patch 3/3. Do you know the reason
>"Failed to merge in the changes."? Do I need rebase the patch 3/3? Thanks.
>>Results can be seen here:
>>Series:
>>https://patchwork.sourceware.org/project/gcc/list/?series=35407
>>Patch 2/3:
>>https://patchwork.sourceware.org/project/gcc/patch/20240621015459.13525-2-wangf...@eswincomputing.com/
>>https://github.com/ewlu/gcc-precommit-ci/issues/1786#issuecomment-2181863112
>>Patch 3/3:
>>https://patchwork.sourceware.org/project/gcc/patch/20240621015459.13525-3-wangf...@eswincomputing.com/
>>https://github.com/ewlu/gcc-precommit-ci/issues/1784#issuecomment-2181861381
>>
>>Thanks,
>>Patrick
>>
>>On 6/20/24 18:54, Feng Wang wrote:
>>> Accroding to the intrinsic doc, the 'Zvfbfmin' and 'Zvfbfwma' intrinsic
>>> functions are added by this patch.
>>>
>>> gcc/ChangeLog:
>>>
>>> * config/riscv/riscv-vector-builtins-bases.cc (class vfncvtbf16_f):
>>>     Add 'Zvfbfmin' intrinsic in bases.
>>> (class vfwcvtbf16_f): Ditto.
>>> (class vfwmaccbf16): Add 'Zvfbfwma' intrinsic in bases.
>>> (BASE): Add BASE macro for 'Zvfbfmin' and 'Zvfbfwma'.
>>> * config/riscv/riscv-vector-builtins-bases.h: Add declaration for 
>>> 'Zvfbfmin' and 'Zvfbfwma'.
>>> * config/riscv/riscv-vector-builtins-functions.def (REQUIRED_EXTENSIONS):
>>>     Add builtins def for 'Zvfbfmin' and 'Zvfbfwma'.
>>> (vfncvtbf16_f): Ditto.
>>> (vfncvtbf16_f_frm): Ditto.
>>> (vfwcvtbf16_f): Ditto.
>>> (vfwmaccbf16): Ditto.
>>> (vfwmaccbf16_frm): Ditto.
>>> * config/riscv/riscv-vector-builtins-shapes.cc (supports_vectype_p):
>>>     Add vector intrinsic build judgment for BFloat16.
>>> (build_all): Ditto.
>>> (BASE_NAME_MAX_LEN): Adjust max length.
>>> * config/riscv/riscv-vector-builtins-types.def (DEF_RVV_F32_OPS):
>>>     Add new operand type for BFloat16.
>>> (vfloat32mf2_t): Ditto.
>>> (vfloat32m1_t): Ditto.
>>> (vfloat32m2_t): Ditto.
>>> (vfloat32m4_t): Ditto.
>>> (vfloat32m8_t): Ditto.
>>> * config/riscv/riscv-vector-builtins.cc (DEF_RVV_F32_OPS): Ditto.
>>> (validate_instance_type_required_extensions):
>>>     Add required_ext checking for 'Zvfbfmin' and 'Zvfbfwma'.
>>> * config/riscv/riscv-vector-builtins.h (enum required_ext):
>>>     Add required_ext declaration for 'Zvfbfmin' and 'Zvfbfwma'.
>>> (reqired_ext_to_isa_name): Ditto.
>>> (required_extensions_specified): Ditto.
>>> (struct function_group_info): Add match case for 'Zvfbfmin' and 'Zvfbfwma'.
>>> * config/riscv/riscv.cc (riscv_validate_vector_type):
>>>     Add required_ext checking for 'Zvfbfmin' and 'Zvfbfwma'.
>>>
>>> ---

Re: Ping: [PATCH v2] LoongArch: Tweak IOR rtx_cost for bstrins

2024-06-26 Thread Lulu Cheng

LGTM!

Thanks very much!


在 2024/6/26 下午3:53, Xi Ruoyao 写道:

Ping.

On Sun, 2024-06-16 at 01:50 +0800, Xi Ruoyao wrote:

Consider

     c &= 0xfff;
     a &= ~0xfff;
     b &= ~0xfff;
     a |= c;
     b |= c;

This can be done with 2 bstrins instructions.  But we need to
recognize
it in loongarch_rtx_costs or the compiler will not propagate "c &
0xfff"
forward.

gcc/ChangeLog:

* config/loongarch/loongarch.cc:
(loongarch_use_bstrins_for_ior_with_mask): Split the main
logic
into ...
(loongarch_use_bstrins_for_ior_with_mask_1): ... here.
(loongarch_rtx_costs): Special case for IOR those can be
implemented with bstrins.

gcc/testsuite/ChangeLog;

* gcc.target/loongarch/bstrins-3.c: New test.
---

Bootstrapped and regtested on loongarch64-linux-gnu.  Ok for trunk?

  gcc/config/loongarch/loongarch.cc | 73 ++
-
  .../gcc.target/loongarch/bstrins-3.c  | 16 
  2 files changed, 72 insertions(+), 17 deletions(-)
  create mode 100644 gcc/testsuite/gcc.target/loongarch/bstrins-3.c

diff --git a/gcc/config/loongarch/loongarch.cc
b/gcc/config/loongarch/loongarch.cc
index 6ec3ee62502..256b76d044b 100644
--- a/gcc/config/loongarch/loongarch.cc
+++ b/gcc/config/loongarch/loongarch.cc
@@ -3681,6 +3681,27 @@ loongarch_set_reg_reg_piece_cost (machine_mode
mode, unsigned int units)
    return COSTS_N_INSNS ((GET_MODE_SIZE (mode) + units - 1) / units);
  }
  
+static int

+loongarch_use_bstrins_for_ior_with_mask_1 (machine_mode mode,
+      unsigned HOST_WIDE_INT
mask1,
+      unsigned HOST_WIDE_INT
mask2)
+{
+  if (mask1 != ~mask2 || !mask1 || !mask2)
+    return 0;
+
+  /* Try to avoid a right-shift.  */
+  if (low_bitmask_len (mode, mask1) != -1)
+    return -1;
+
+  if (low_bitmask_len (mode, mask2 >> (ffs_hwi (mask2) - 1)) != -1)
+    return 1;
+
+  if (low_bitmask_len (mode, mask1 >> (ffs_hwi (mask1) - 1)) != -1)
+    return -1;
+
+  return 0;
+}
+
  /* Return the cost of moving between two registers of mode MODE.  */
  
  static int

@@ -3812,6 +3833,38 @@ loongarch_rtx_costs (rtx x, machine_mode mode,
int outer_code,
    /* Fall through.  */
  
  case IOR:

+  {
+   rtx op[2] = {XEXP (x, 0), XEXP (x, 1)};
+   if (GET_CODE (op[0]) == AND && GET_CODE (op[1]) == AND
+       && (mode == SImode || (TARGET_64BIT && mode == DImode)))
+     {
+       rtx rtx_mask0 = XEXP (op[0], 1), rtx_mask1 = XEXP (op[1],
1);
+       if (CONST_INT_P (rtx_mask0) && CONST_INT_P (rtx_mask1))
+     {
+   unsigned HOST_WIDE_INT mask0 = UINTVAL (rtx_mask0);
+   unsigned HOST_WIDE_INT mask1 = UINTVAL (rtx_mask1);
+   if (loongarch_use_bstrins_for_ior_with_mask_1 (mode,
+      mask0,
+   
mask1))
+     {
+       /* A bstrins instruction */
+       *total = COSTS_N_INSNS (1);
+
+       /* A srai instruction */
+       if (low_bitmask_len (mode, mask0) == -1
+   && low_bitmask_len (mode, mask1) == -1)
+     *total += COSTS_N_INSNS (1);
+
+       for (int i = 0; i < 2; i++)
+     *total += set_src_cost (XEXP (op[i], 0), mode,
speed);
+
+       return true;
+     }
+     }
+     }
+  }
+
+  /* Fall through.  */
  case XOR:
    /* Double-word operations use two single-word operations.  */
    *total = loongarch_binary_cost (x, COSTS_N_INSNS (1),
COSTS_N_INSNS (2),
@@ -5796,23 +5849,9 @@ bool loongarch_pre_reload_split (void)
  int
  loongarch_use_bstrins_for_ior_with_mask (machine_mode mode, rtx *op)
  {
-  unsigned HOST_WIDE_INT mask1 = UINTVAL (op[2]);
-  unsigned HOST_WIDE_INT mask2 = UINTVAL (op[4]);
-
-  if (mask1 != ~mask2 || !mask1 || !mask2)
-    return 0;
-
-  /* Try to avoid a right-shift.  */
-  if (low_bitmask_len (mode, mask1) != -1)
-    return -1;
-
-  if (low_bitmask_len (mode, mask2 >> (ffs_hwi (mask2) - 1)) != -1)
-    return 1;
-
-  if (low_bitmask_len (mode, mask1 >> (ffs_hwi (mask1) - 1)) != -1)
-    return -1;
-
-  return 0;
+  return loongarch_use_bstrins_for_ior_with_mask_1 (mode,
+       UINTVAL (op[2]),
+       UINTVAL (op[4]));
  }
  
  /* Rewrite a MEM for simple load/store under -mexplicit-relocs=auto

diff --git a/gcc/testsuite/gcc.target/loongarch/bstrins-3.c
b/gcc/testsuite/gcc.target/loongarch/bstrins-3.c
new file mode 100644
index 000..13762bdef42
--- /dev/null
+++ b/gcc/testsuite/gcc.target/loongarch/bstrins-3.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-rtl-final" } */
+/* { dg-final { scan-rtl-dump-times "insv\[sd\]i" 2 "final" } } */
+
+struct X {
+  long a, b;
+};
+

[PATCH V2] Fix wrong cost of MEM when addr is a lea.

2024-06-26 Thread liuhongt
> But rtx_cost invokes targetm.rtx_cost which allows to avoid that
> recursive processing at any level.  You're dealing with MEM [addr]
> here, so why's rtx_cost (addr, Pmode, MEM, 0, speed) not always
> the best way to deal with this?  Since this is the MEM [addr] case
> we know it's not LEA, no?
The patch restrict MEM rtx_cost reduction only for register_operand + disp.


Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?


416.gamess regressed 4-6% on x86_64 since my r15-882-g1d6199e5f8c1c0.
The commit adjust rtx_cost of mem to reduce cost of (add op0 disp).
But Cost of ADDR could be cheaper than XEXP (addr, 0) when it's a lea.
It is the case in the PR, the patch adjust rtx_cost to only handle reg
+ disp, for other forms, they're basically all LEA which doesn't have
additional cost of ADD.

gcc/ChangeLog:

PR target/115462
* config/i386/i386.cc (ix86_rtx_costs): Make cost of MEM (reg +
disp) just a little bit more than MEM (reg).

gcc/testsuite/ChangeLog:
* gcc.target/i386/pr115462.c: New test.
---
 gcc/config/i386/i386.cc  |  5 -
 gcc/testsuite/gcc.target/i386/pr115462.c | 22 ++
 2 files changed, 26 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr115462.c

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index d4ccc24be6e..ef2a1e4f4f2 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -22339,7 +22339,10 @@ ix86_rtx_costs (rtx x, machine_mode mode, int 
outer_code_i, int opno,
 address_cost should be used, but it reduce cost too much.
 So current solution is make constant disp as cheap as possible.  */
  if (GET_CODE (addr) == PLUS
- && x86_64_immediate_operand (XEXP (addr, 1), Pmode))
+ && x86_64_immediate_operand (XEXP (addr, 1), Pmode)
+ /* Only hanlde (reg + disp) since other forms of addr are mostly 
LEA,
+there's no additional cost for the plus of disp.  */
+ && register_operand (XEXP (addr, 0), Pmode))
{
  *total += 1;
  *total += rtx_cost (XEXP (addr, 0), Pmode, PLUS, 0, speed);
diff --git a/gcc/testsuite/gcc.target/i386/pr115462.c 
b/gcc/testsuite/gcc.target/i386/pr115462.c
new file mode 100644
index 000..ad50a6382bc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr115462.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx2 -fno-tree-vectorize -fno-pic" } */
+/* { dg-final { scan-assembler-times {(?n)movl[ \t]+.*, p1\.0\+[0-9]*\(,} 3 } 
} */
+
+int
+foo (long indx, long indx2, long indx3, long indx4, long indx5, long indx6, 
long n, int* q)
+{
+  static int p1[1];
+  int* p2 = p1 + 1000;
+  int* p3 = p1 + 4000;
+  int* p4 = p1 + 8000;
+
+  for (long i = 0; i != n; i++)
+{
+  /* scan for  movl%edi, p1.0+3996(,%rax,4),
+p1.0+3996 should be propagted into the loop.  */
+  p2[indx++] = q[indx++];
+  p3[indx2++] = q[indx2++];
+  p4[indx3++] = q[indx3++];
+}
+  return p1[indx6] + p1[indx5];
+}
-- 
2.31.1



Re: [PATCH] libstdc++: Add script to update docs for a new release branch

2024-06-26 Thread Eric Gallager
On Wed, Jun 26, 2024 at 4:28 PM Jonathan Wakely  wrote:
>
> Pushed to trunk. We have nearly a year to make improvements to it
> before it's needed for the gcc-15 branch ... I just hope I remember it
> exists when we branch ;-)

Maybe you could leave a note about it in the docs somewhere?

>
> On Wed, 26 Jun 2024 at 00:13, Jonathan Wakely  wrote:
> >
> > This script automates some updates that should be made when branching
> > from trunk. Putting them in a script makes it much easier and means I
> > won't forget what should be done.
> >
> > Any suggestions for doing this differently?
> >
> > Anything I've forgotten that should be added here?
> >
> > We could add an entry to the lists of versions at
> > https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html#abi.versioning.goals
> > but that should really be done when bumping the libtool version, not
> > when branching from trunk.
> >
> > -- >8 --
> >
> > This should be run on a release branch after branching from trunk.
> > Various links and references to trunk in the docs will be updated to
> > refer to the new release branch.
> >
> > libstdc++-v3/ChangeLog:
> >
> > * scripts/update_release_branch.sh: New file.
> > ---
> >  libstdc++-v3/scripts/update_release_branch.sh | 14 ++
> >  1 file changed, 14 insertions(+)
> >  create mode 100755 libstdc++-v3/scripts/update_release_branch.sh
> >
> > diff --git a/libstdc++-v3/scripts/update_release_branch.sh 
> > b/libstdc++-v3/scripts/update_release_branch.sh
> > new file mode 100755
> > index 000..f8109ed0ba3
> > --- /dev/null
> > +++ b/libstdc++-v3/scripts/update_release_branch.sh
> > @@ -0,0 +1,14 @@
> > +#!/bin/bash
> > +
> > +# This should be run on a release branch after branching from trunk.
> > +# Various links and references to trunk in the docs will be updated to
> > +# refer to the new release branch.
> > +
> > +# The major version of the new release branch.
> > +major=$1
> > +(($major)) || { echo "$0: Integer argument expected" >& 2 ; exit 1; }
> > +
> > +# This assumes GNU sed
> > +sed -i "s@^mainline GCC, not in any particular major.\$@the GCC ${major} 
> > series.@" doc/xml/manual/status_cxx*.xml
> > +sed -i 
> > 's@https://gcc.gnu.org/cgit/gcc/tree/libstdc++-v3/testsuite/[^"]\+@&?h=releases%2Fgcc-'${major}@
> >  doc/xml/manual/allocator.xml doc/xml/manual/mt_allocator.xml
> > +sed -i 
> > "s@https://gcc.gnu.org/onlinedocs/gcc/Invoking-GCC.html@https://gcc.gnu.org/onlinedocs/gcc-${major}.1.0/gcc/Invoking-GCC.html@";
> >  doc/xml/manual/using.xml
> > --
> > 2.45.2
> >
>


[PATCH v2] Internal-fn: Support new IFN SAT_TRUNC for unsigned scalar int

2024-06-26 Thread pan2 . li
From: Pan Li 

This patch would like to add the middle-end presentation for the
saturation truncation.  Aka set the result of truncated value to
the max value when overflow.  It will take the pattern similar
as below.

Form 1:
  #define DEF_SAT_U_TRUC_FMT_1(WT, NT) \
  NT __attribute__((noinline)) \
  sat_u_truc_##T##_fmt_1 (WT x)\
  {\
bool overflow = x > (WT)(NT)(-1);  \
return ((NT)x) | (NT)-overflow;\
  }

For example, truncated uint16_t to uint8_t, we have

* SAT_TRUNC (254)   => 254
* SAT_TRUNC (255)   => 255
* SAT_TRUNC (256)   => 255
* SAT_TRUNC (65536) => 255

Given below SAT_TRUNC from uint64_t to uint32_t.

DEF_SAT_U_TRUC_FMT_1 (uint64_t, uint32_t)

Before this patch:
__attribute__((noinline))
uint32_t sat_u_truc_T_fmt_1 (uint64_t x)
{
  _Bool overflow;
  unsigned int _1;
  unsigned int _2;
  unsigned int _3;
  uint32_t _6;

;;   basic block 2, loop depth 0
;;pred:   ENTRY
  overflow_5 = x_4(D) > 4294967295;
  _1 = (unsigned int) x_4(D);
  _2 = (unsigned int) overflow_5;
  _3 = -_2;
  _6 = _1 | _3;
  return _6;
;;succ:   EXIT

}

After this patch:
__attribute__((noinline))
uint32_t sat_u_truc_T_fmt_1 (uint64_t x)
{
  uint32_t _6;

;;   basic block 2, loop depth 0
;;pred:   ENTRY
  _6 = .SAT_TRUNC (x_4(D)); [tail call]
  return _6;
;;succ:   EXIT

}

The below tests are passed for this patch:
*. The rv64gcv fully regression tests.
*. The rv64gcv build with glibc.
*. The x86 bootstrap tests.
*. The x86 fully regression tests.

gcc/ChangeLog:

* internal-fn.def (SAT_TRUNC): Add new signed IFN sat_trunc as
unary_convert.
* match.pd: Add new matching pattern for unsigned int sat_trunc.
* optabs.def (OPTAB_CL): Add unsigned and signed optab.
* tree-ssa-math-opts.cc (gimple_unsigend_integer_sat_trunc): Add
new decl for the matching pattern generated func.
(match_unsigned_saturation_trunc): Add new func impl to match
the .SAT_TRUNC.
(math_opts_dom_walker::after_dom_children): Add .SAT_TRUNC match
function under BIT_IOR_EXPR case.

Signed-off-by: Pan Li 
---
 gcc/internal-fn.def   |  2 ++
 gcc/match.pd  | 16 
 gcc/optabs.def|  3 +++
 gcc/tree-ssa-math-opts.cc | 32 
 4 files changed, 53 insertions(+)

diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index a8c83437ada..915d329c05a 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -278,6 +278,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | 
ECF_NOTHROW, first,
 DEF_INTERNAL_SIGNED_OPTAB_FN (SAT_ADD, ECF_CONST, first, ssadd, usadd, binary)
 DEF_INTERNAL_SIGNED_OPTAB_FN (SAT_SUB, ECF_CONST, first, sssub, ussub, binary)
 
+DEF_INTERNAL_SIGNED_OPTAB_FN (SAT_TRUNC, ECF_CONST, first, sstrunc, ustrunc, 
unary_convert)
+
 DEF_INTERNAL_COND_FN (ADD, ECF_CONST, add, binary)
 DEF_INTERNAL_COND_FN (SUB, ECF_CONST, sub, binary)
 DEF_INTERNAL_COND_FN (MUL, ECF_CONST, smul, binary)
diff --git a/gcc/match.pd b/gcc/match.pd
index 3d0689c9312..06120a1c62c 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3210,6 +3210,22 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
   && types_match (type, @0, @1
 
+/* Unsigned saturation truncate, case 1 (), sizeof (WT) > sizeof (NT).
+   SAT_U_TRUNC = (NT)x | (NT)(-(X > (WT)(NT)(-1))).  */
+(match (unsigned_integer_sat_trunc @0)
+ (bit_ior:c (negate (convert (gt @0 INTEGER_CST@1)))
+   (convert @0))
+ (with {
+   unsigned itype_precision = TYPE_PRECISION (TREE_TYPE (@0));
+   unsigned otype_precision = TYPE_PRECISION (type);
+   wide_int trunc_max = wi::mask (itype_precision / 2, false, itype_precision);
+   wide_int int_cst = wi::to_wide (@1, itype_precision);
+  }
+  (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
+   && TYPE_UNSIGNED (TREE_TYPE (@0))
+   && otype_precision < itype_precision
+   && wi::eq_p (trunc_max, int_cst)
+
 /* x >  y  &&  x != XXX_MIN  -->  x > y
x >  y  &&  x == XXX_MIN  -->  false . */
 (for eqne (eq ne)
diff --git a/gcc/optabs.def b/gcc/optabs.def
index bc2611abdc2..c16580ce956 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -63,6 +63,9 @@ OPTAB_CX(fractuns_optab, "fractuns$Q$b$I$a2")
 OPTAB_CL(satfract_optab, "satfract$b$Q$a2", SAT_FRACT, "satfract", 
gen_satfract_conv_libfunc)
 OPTAB_CL(satfractuns_optab, "satfractuns$I$b$Q$a2", UNSIGNED_SAT_FRACT, 
"satfractuns", gen_satfractuns_conv_libfunc)
 
+OPTAB_CL(ustrunc_optab, "ustrunc$b$a2", US_TRUNCATE, "ustrunc", NULL)
+OPTAB_CL(sstrunc_optab, "sstrunc$b$a2", SS_TRUNCATE, "sstrunc", NULL)
+
 OPTAB_CD(sfixtrunc_optab, "fix_trunc$F$b$I$a2")
 OPTAB_CD(ufixtrunc_optab, "fixuns_trunc$F$b$I$a2")
 
diff --git a/gcc/tree-ssa-math-opts.cc b/gcc/tree-ssa-math-opts.cc
index 57085488722..3783a874699 100644
--- a/gcc/tree-ssa-math-opts.cc
+++ b/gcc/tree-ssa-math-opts.cc
@@ -4088,6 +4088,7 @@ arith_overflow_check_p (gimple 

Re: [PATCH v3] Vect: Support truncate after .SAT_SUB pattern in zip

2024-06-26 Thread Richard Biener
On Thu, Jun 27, 2024 at 3:31 AM  wrote:
>
> From: Pan Li 

OK

> The zip benchmark of coremark-pro have one SAT_SUB like pattern but
> truncated as below:
>
> void test (uint16_t *x, unsigned b, unsigned n)
> {
>   unsigned a = 0;
>   register uint16_t *p = x;
>
>   do {
> a = *--p;
> *p = (uint16_t)(a >= b ? a - b : 0); // Truncate after .SAT_SUB
>   } while (--n);
> }
>
> It will have gimple before vect pass,  it cannot hit any pattern of
> SAT_SUB and then cannot vectorize to SAT_SUB.
>
> _2 = a_11 - b_12(D);
> iftmp.0_13 = (short unsigned int) _2;
> _18 = a_11 >= b_12(D);
> iftmp.0_5 = _18 ? iftmp.0_13 : 0;
>
> This patch would like to improve the pattern match to recog above
> as truncate after .SAT_SUB pattern.  Then we will have the pattern
> similar to below,  as well as eliminate the first 3 dead stmt.
>
> _2 = a_11 - b_12(D);
> iftmp.0_13 = (short unsigned int) _2;
> _18 = a_11 >= b_12(D);
> iftmp.0_5 = (short unsigned int).SAT_SUB (a_11, b_12(D));
>
> The below tests are passed for this patch.
> 1. The rv64gcv fully regression tests.
> 2. The rv64gcv build with glibc.
> 3. The x86 bootstrap tests.
> 4. The x86 fully regression tests.
>
> gcc/ChangeLog:
>
> * match.pd: Add convert description for minus and capture.
> * tree-vect-patterns.cc (vect_recog_build_binary_gimple_call): Add
> new logic to handle in_type is incompatibile with out_type,  as
> well as rename from.
> (vect_recog_build_binary_gimple_stmt): Rename to.
> (vect_recog_sat_add_pattern): Leverage above renamed func.
> (vect_recog_sat_sub_pattern): Ditto.
>
> Signed-off-by: Pan Li 
> ---
>  gcc/match.pd  |  4 +--
>  gcc/tree-vect-patterns.cc | 51 ---
>  2 files changed, 33 insertions(+), 22 deletions(-)
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index cf8a399a744..820591a36b3 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -3164,9 +3164,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  /* Unsigned saturation sub, case 2 (branch with ge):
> SAT_U_SUB = X >= Y ? X - Y : 0.  */
>  (match (unsigned_integer_sat_sub @0 @1)
> - (cond^ (ge @0 @1) (minus @0 @1) integer_zerop)
> + (cond^ (ge @0 @1) (convert? (minus (convert1? @0) (convert1? @1))) 
> integer_zerop)
>   (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
> -  && types_match (type, @0, @1
> +  && TYPE_UNSIGNED (TREE_TYPE (@0)) && types_match (@0, @1
>
>  /* Unsigned saturation sub, case 3 (branchless with gt):
> SAT_U_SUB = (X - Y) * (X > Y).  */
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index cef901808eb..519d15f2a43 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -4490,26 +4490,37 @@ vect_recog_mult_pattern (vec_info *vinfo,
>  extern bool gimple_unsigned_integer_sat_add (tree, tree*, tree (*)(tree));
>  extern bool gimple_unsigned_integer_sat_sub (tree, tree*, tree (*)(tree));
>
> -static gcall *
> -vect_recog_build_binary_gimple_call (vec_info *vinfo, gimple *stmt,
> +static gimple *
> +vect_recog_build_binary_gimple_stmt (vec_info *vinfo, stmt_vec_info 
> stmt_info,
>  internal_fn fn, tree *type_out,
> -tree op_0, tree op_1)
> +tree lhs, tree op_0, tree op_1)
>  {
>tree itype = TREE_TYPE (op_0);
> -  tree vtype = get_vectype_for_scalar_type (vinfo, itype);
> +  tree otype = TREE_TYPE (lhs);
> +  tree v_itype = get_vectype_for_scalar_type (vinfo, itype);
> +  tree v_otype = get_vectype_for_scalar_type (vinfo, otype);
>
> -  if (vtype != NULL_TREE
> -&& direct_internal_fn_supported_p (fn, vtype, OPTIMIZE_FOR_BOTH))
> +  if (v_itype != NULL_TREE && v_otype != NULL_TREE
> +&& direct_internal_fn_supported_p (fn, v_itype, OPTIMIZE_FOR_BOTH))
>  {
>gcall *call = gimple_build_call_internal (fn, 2, op_0, op_1);
> +  tree in_ssa = vect_recog_temp_ssa_var (itype, NULL);
>
> -  gimple_call_set_lhs (call, vect_recog_temp_ssa_var (itype, NULL));
> +  gimple_call_set_lhs (call, in_ssa);
>gimple_call_set_nothrow (call, /* nothrow_p */ false);
> -  gimple_set_location (call, gimple_location (stmt));
> +  gimple_set_location (call, gimple_location (STMT_VINFO_STMT 
> (stmt_info)));
> +
> +  *type_out = v_otype;
>
> -  *type_out = vtype;
> +  if (types_compatible_p (itype, otype))
> +   return call;
> +  else
> +   {
> + append_pattern_def_seq (vinfo, stmt_info, call, v_itype);
> + tree out_ssa = vect_recog_temp_ssa_var (otype, NULL);
>
> -  return call;
> + return gimple_build_assign (out_ssa, NOP_EXPR, in_ssa);
> +   }
>  }
>
>return NULL;
> @@ -4541,13 +4552,13 @@ vect_recog_sat_add_pattern (vec_info *vinfo, 
> stmt_vec_info stmt_vinfo,
>
>if (gimple_unsigned_integer_sat_add (lhs, ops, NULL))
>  {
> -  gcall *call = vect_recog_build_binary_gimple_call (vinfo, 

Re: [PATCH V2] Fix wrong cost of MEM when addr is a lea.

2024-06-26 Thread Richard Biener
On Thu, Jun 27, 2024 at 5:57 AM liuhongt  wrote:
>
> > But rtx_cost invokes targetm.rtx_cost which allows to avoid that
> > recursive processing at any level.  You're dealing with MEM [addr]
> > here, so why's rtx_cost (addr, Pmode, MEM, 0, speed) not always
> > the best way to deal with this?  Since this is the MEM [addr] case
> > we know it's not LEA, no?
> The patch restrict MEM rtx_cost reduction only for register_operand + disp.
>
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?

Looks good from my side, I'll leave approval to x86 maintainers though.

Thanks,
Richard.

>
> 416.gamess regressed 4-6% on x86_64 since my r15-882-g1d6199e5f8c1c0.
> The commit adjust rtx_cost of mem to reduce cost of (add op0 disp).
> But Cost of ADDR could be cheaper than XEXP (addr, 0) when it's a lea.
> It is the case in the PR, the patch adjust rtx_cost to only handle reg
> + disp, for other forms, they're basically all LEA which doesn't have
> additional cost of ADD.
>
> gcc/ChangeLog:
>
> PR target/115462
> * config/i386/i386.cc (ix86_rtx_costs): Make cost of MEM (reg +
> disp) just a little bit more than MEM (reg).
>
> gcc/testsuite/ChangeLog:
> * gcc.target/i386/pr115462.c: New test.
> ---
>  gcc/config/i386/i386.cc  |  5 -
>  gcc/testsuite/gcc.target/i386/pr115462.c | 22 ++
>  2 files changed, 26 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr115462.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index d4ccc24be6e..ef2a1e4f4f2 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -22339,7 +22339,10 @@ ix86_rtx_costs (rtx x, machine_mode mode, int 
> outer_code_i, int opno,
>  address_cost should be used, but it reduce cost too much.
>  So current solution is make constant disp as cheap as possible.  
> */
>   if (GET_CODE (addr) == PLUS
> - && x86_64_immediate_operand (XEXP (addr, 1), Pmode))
> + && x86_64_immediate_operand (XEXP (addr, 1), Pmode)
> + /* Only hanlde (reg + disp) since other forms of addr are 
> mostly LEA,
> +there's no additional cost for the plus of disp.  */
> + && register_operand (XEXP (addr, 0), Pmode))
> {
>   *total += 1;
>   *total += rtx_cost (XEXP (addr, 0), Pmode, PLUS, 0, speed);
> diff --git a/gcc/testsuite/gcc.target/i386/pr115462.c 
> b/gcc/testsuite/gcc.target/i386/pr115462.c
> new file mode 100644
> index 000..ad50a6382bc
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr115462.c
> @@ -0,0 +1,22 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mavx2 -fno-tree-vectorize -fno-pic" } */
> +/* { dg-final { scan-assembler-times {(?n)movl[ \t]+.*, p1\.0\+[0-9]*\(,} 3 
> } } */
> +
> +int
> +foo (long indx, long indx2, long indx3, long indx4, long indx5, long indx6, 
> long n, int* q)
> +{
> +  static int p1[1];
> +  int* p2 = p1 + 1000;
> +  int* p3 = p1 + 4000;
> +  int* p4 = p1 + 8000;
> +
> +  for (long i = 0; i != n; i++)
> +{
> +  /* scan for  movl%edi, p1.0+3996(,%rax,4),
> +p1.0+3996 should be propagted into the loop.  */
> +  p2[indx++] = q[indx++];
> +  p3[indx2++] = q[indx2++];
> +  p4[indx3++] = q[indx3++];
> +}
> +  return p1[indx6] + p1[indx5];
> +}
> --
> 2.31.1
>


Re: [PATCH v2] Internal-fn: Support new IFN SAT_TRUNC for unsigned scalar int

2024-06-26 Thread Richard Biener
On Thu, Jun 27, 2024 at 7:12 AM  wrote:
>
> From: Pan Li 
>
> This patch would like to add the middle-end presentation for the
> saturation truncation.  Aka set the result of truncated value to
> the max value when overflow.  It will take the pattern similar
> as below.
>
> Form 1:
>   #define DEF_SAT_U_TRUC_FMT_1(WT, NT) \
>   NT __attribute__((noinline)) \
>   sat_u_truc_##T##_fmt_1 (WT x)\
>   {\
> bool overflow = x > (WT)(NT)(-1);  \
> return ((NT)x) | (NT)-overflow;\
>   }
>
> For example, truncated uint16_t to uint8_t, we have
>
> * SAT_TRUNC (254)   => 254
> * SAT_TRUNC (255)   => 255
> * SAT_TRUNC (256)   => 255
> * SAT_TRUNC (65536) => 255
>
> Given below SAT_TRUNC from uint64_t to uint32_t.
>
> DEF_SAT_U_TRUC_FMT_1 (uint64_t, uint32_t)
>
> Before this patch:
> __attribute__((noinline))
> uint32_t sat_u_truc_T_fmt_1 (uint64_t x)
> {
>   _Bool overflow;
>   unsigned int _1;
>   unsigned int _2;
>   unsigned int _3;
>   uint32_t _6;
>
> ;;   basic block 2, loop depth 0
> ;;pred:   ENTRY
>   overflow_5 = x_4(D) > 4294967295;
>   _1 = (unsigned int) x_4(D);
>   _2 = (unsigned int) overflow_5;
>   _3 = -_2;
>   _6 = _1 | _3;
>   return _6;
> ;;succ:   EXIT
>
> }
>
> After this patch:
> __attribute__((noinline))
> uint32_t sat_u_truc_T_fmt_1 (uint64_t x)
> {
>   uint32_t _6;
>
> ;;   basic block 2, loop depth 0
> ;;pred:   ENTRY
>   _6 = .SAT_TRUNC (x_4(D)); [tail call]
>   return _6;
> ;;succ:   EXIT
>
> }

OK.

Thanks,
Richard.

> The below tests are passed for this patch:
> *. The rv64gcv fully regression tests.
> *. The rv64gcv build with glibc.
> *. The x86 bootstrap tests.
> *. The x86 fully regression tests.
>
> gcc/ChangeLog:
>
> * internal-fn.def (SAT_TRUNC): Add new signed IFN sat_trunc as
> unary_convert.
> * match.pd: Add new matching pattern for unsigned int sat_trunc.
> * optabs.def (OPTAB_CL): Add unsigned and signed optab.
> * tree-ssa-math-opts.cc (gimple_unsigend_integer_sat_trunc): Add
> new decl for the matching pattern generated func.
> (match_unsigned_saturation_trunc): Add new func impl to match
> the .SAT_TRUNC.
> (math_opts_dom_walker::after_dom_children): Add .SAT_TRUNC match
> function under BIT_IOR_EXPR case.
>
> Signed-off-by: Pan Li 
> ---
>  gcc/internal-fn.def   |  2 ++
>  gcc/match.pd  | 16 
>  gcc/optabs.def|  3 +++
>  gcc/tree-ssa-math-opts.cc | 32 
>  4 files changed, 53 insertions(+)
>
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index a8c83437ada..915d329c05a 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -278,6 +278,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | 
> ECF_NOTHROW, first,
>  DEF_INTERNAL_SIGNED_OPTAB_FN (SAT_ADD, ECF_CONST, first, ssadd, usadd, 
> binary)
>  DEF_INTERNAL_SIGNED_OPTAB_FN (SAT_SUB, ECF_CONST, first, sssub, ussub, 
> binary)
>
> +DEF_INTERNAL_SIGNED_OPTAB_FN (SAT_TRUNC, ECF_CONST, first, sstrunc, ustrunc, 
> unary_convert)
> +
>  DEF_INTERNAL_COND_FN (ADD, ECF_CONST, add, binary)
>  DEF_INTERNAL_COND_FN (SUB, ECF_CONST, sub, binary)
>  DEF_INTERNAL_COND_FN (MUL, ECF_CONST, smul, binary)
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 3d0689c9312..06120a1c62c 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -3210,6 +3210,22 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>   (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
>&& types_match (type, @0, @1
>
> +/* Unsigned saturation truncate, case 1 (), sizeof (WT) > sizeof (NT).
> +   SAT_U_TRUNC = (NT)x | (NT)(-(X > (WT)(NT)(-1))).  */
> +(match (unsigned_integer_sat_trunc @0)
> + (bit_ior:c (negate (convert (gt @0 INTEGER_CST@1)))
> +   (convert @0))
> + (with {
> +   unsigned itype_precision = TYPE_PRECISION (TREE_TYPE (@0));
> +   unsigned otype_precision = TYPE_PRECISION (type);
> +   wide_int trunc_max = wi::mask (itype_precision / 2, false, 
> itype_precision);
> +   wide_int int_cst = wi::to_wide (@1, itype_precision);
> +  }
> +  (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
> +   && TYPE_UNSIGNED (TREE_TYPE (@0))
> +   && otype_precision < itype_precision
> +   && wi::eq_p (trunc_max, int_cst)
> +
>  /* x >  y  &&  x != XXX_MIN  -->  x > y
> x >  y  &&  x == XXX_MIN  -->  false . */
>  (for eqne (eq ne)
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index bc2611abdc2..c16580ce956 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -63,6 +63,9 @@ OPTAB_CX(fractuns_optab, "fractuns$Q$b$I$a2")
>  OPTAB_CL(satfract_optab, "satfract$b$Q$a2", SAT_FRACT, "satfract", 
> gen_satfract_conv_libfunc)
>  OPTAB_CL(satfractuns_optab, "satfractuns$I$b$Q$a2", UNSIGNED_SAT_FRACT, 
> "satfractuns", gen_satfractuns_conv_libfunc)
>
> +OPTAB_CL(ustrunc_optab, "ustrunc$b$a2", US_TRUNCATE, "ustrunc", NULL)
> +OPTAB_CL(sstrunc_optab, "sstrunc$b$a2", SS_TRUNCATE, "sstru

Re: [PATCH] [i386] restore recompute to override opts after change [PR113719]

2024-06-26 Thread Hongtao Liu
On Thu, Jun 13, 2024 at 3:32 PM Alexandre Oliva  wrote:
>
>
> The first patch for PR113719 regressed gcc.dg/ipa/iinline-attr.c on
> toolchains configured to --enable-frame-pointer, because the
> optimization node created within handle_optimize_attribute had
> flag_omit_frame_pointer incorrectly set, whereas
> default_optimization_node didn't.  With this difference,
> can_inline_edge_by_limits_p flagged an optimization mismatch and we
> refused to inline the function that had a redundant optimization flag
> into one that didn't, which is exactly what is tested for there.
>
> This patch restores the calls to ix86_default_align and
> ix86_recompute_optlev_based_flags that used to be, and ought to be,
> issued during TARGET_OVERRIDE_OPTIONS_AFTER_CHANGE, but preserves the
> intent of the original change, of having those functions called at
> different spots within ix86_option_override_internal.  To that end,
> the remaining bits were refactored into a separate function, that was
> in turn adjusted to operate on explicitly-passed opts and opts_set,
> rather than going for their global counterparts.
>
> Regstrapped on x86_64-linux-gnu.  Also tested with
> --enable-frame-pointer, and with gcc-13 x-x86-vx7r2, where the problem
> was detected.  Ok to install?
LGTM, thanks.
>
>
> for  gcc/ChangeLog
>
> PR target/113719
> * config/i386/i386-options.cc
> (ix86_override_options_after_change_1): Add opts and opts_set
> parms, operate on them, after factoring out of...
> (ix86_override_options_after_change): ... this.  Restore calls
> of ix86_default_align and ix86_recompute_optlev_based_flags.
> (ix86_option_override_internal): Call the factored-out bits.
> ---
>  gcc/config/i386/i386-options.cc |   59 
> ++-
>  1 file changed, 40 insertions(+), 19 deletions(-)
>
> diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc
> index f2cecc0e2545b..7fa7f6774e9cf 100644
> --- a/gcc/config/i386/i386-options.cc
> +++ b/gcc/config/i386/i386-options.cc
> @@ -1911,37 +1911,58 @@ ix86_recompute_optlev_based_flags (struct gcc_options 
> *opts,
>  }
>  }
>
> -/* Implement TARGET_OVERRIDE_OPTIONS_AFTER_CHANGE hook.  */
> +/* Implement part of TARGET_OVERRIDE_OPTIONS_AFTER_CHANGE hook.  */
>
> -void
> -ix86_override_options_after_change (void)
> +static void
> +ix86_override_options_after_change_1 (struct gcc_options *opts,
> + struct gcc_options *opts_set)
>  {
> +#define OPTS_SET_P(OPTION) opts_set->x_ ## OPTION
> +#define OPTS(OPTION) opts->x_ ## OPTION
> +
>/* Disable unrolling small loops when there's explicit
>   -f{,no}unroll-loop.  */
> -  if ((OPTION_SET_P (flag_unroll_loops))
> - || (OPTION_SET_P (flag_unroll_all_loops)
> -&& flag_unroll_all_loops))
> +  if ((OPTS_SET_P (flag_unroll_loops))
> + || (OPTS_SET_P (flag_unroll_all_loops)
> +&& OPTS (flag_unroll_all_loops)))
>  {
> -  if (!OPTION_SET_P (ix86_unroll_only_small_loops))
> -   ix86_unroll_only_small_loops = 0;
> +  if (!OPTS_SET_P (ix86_unroll_only_small_loops))
> +   OPTS (ix86_unroll_only_small_loops) = 0;
>/* Re-enable -frename-registers and -fweb if funroll-loops
>  enabled.  */
> -  if (!OPTION_SET_P (flag_web))
> -   flag_web = flag_unroll_loops;
> -  if (!OPTION_SET_P (flag_rename_registers))
> -   flag_rename_registers = flag_unroll_loops;
> +  if (!OPTS_SET_P (flag_web))
> +   OPTS (flag_web) = OPTS (flag_unroll_loops);
> +  if (!OPTS_SET_P (flag_rename_registers))
> +   OPTS (flag_rename_registers) = OPTS (flag_unroll_loops);
>/* -fcunroll-grow-size default follws -f[no]-unroll-loops.  */
> -  if (!OPTION_SET_P (flag_cunroll_grow_size))
> -   flag_cunroll_grow_size = flag_unroll_loops
> -|| flag_peel_loops
> -|| optimize >= 3;
> +  if (!OPTS_SET_P (flag_cunroll_grow_size))
> +   OPTS (flag_cunroll_grow_size)
> + = (OPTS (flag_unroll_loops)
> +|| OPTS (flag_peel_loops)
> +|| OPTS (optimize) >= 3);
>  }
>else
>  {
> -  if (!OPTION_SET_P (flag_cunroll_grow_size))
> -   flag_cunroll_grow_size = flag_peel_loops || optimize >= 3;
> +  if (!OPTS_SET_P (flag_cunroll_grow_size))
> +   OPTS (flag_cunroll_grow_size)
> + = (OPTS (flag_peel_loops)
> +|| OPTS (optimize) >= 3);
>  }
>
> +#undef OPTS
> +#undef OPTS_SET_P
> +}
> +
> +/* Implement TARGET_OVERRIDE_OPTIONS_AFTER_CHANGE hook.  */
> +
> +void
> +ix86_override_options_after_change (void)
> +{
> +  ix86_default_align (&global_options);
> +
> +  ix86_recompute_optlev_based_flags (&global_options, &global_options_set);
> +
> +  ix86_override_options_after_change_1 (&global_options, 
> &global_options_set);
>  }
>
>  /* Clear stack slot assignments remembered from previous functions.
> @@ -2488,7 +250

Re: [PATCH V2] Fix wrong cost of MEM when addr is a lea.

2024-06-26 Thread Uros Bizjak
On Thu, Jun 27, 2024 at 5:57 AM liuhongt  wrote:
>
> > But rtx_cost invokes targetm.rtx_cost which allows to avoid that
> > recursive processing at any level.  You're dealing with MEM [addr]
> > here, so why's rtx_cost (addr, Pmode, MEM, 0, speed) not always
> > the best way to deal with this?  Since this is the MEM [addr] case
> > we know it's not LEA, no?
> The patch restrict MEM rtx_cost reduction only for register_operand + disp.
>
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ok for trunk?

LGTM.

Thanks,
Uros.

>
>
> 416.gamess regressed 4-6% on x86_64 since my r15-882-g1d6199e5f8c1c0.
> The commit adjust rtx_cost of mem to reduce cost of (add op0 disp).
> But Cost of ADDR could be cheaper than XEXP (addr, 0) when it's a lea.
> It is the case in the PR, the patch adjust rtx_cost to only handle reg
> + disp, for other forms, they're basically all LEA which doesn't have
> additional cost of ADD.
>
> gcc/ChangeLog:
>
> PR target/115462
> * config/i386/i386.cc (ix86_rtx_costs): Make cost of MEM (reg +
> disp) just a little bit more than MEM (reg).
>
> gcc/testsuite/ChangeLog:
> * gcc.target/i386/pr115462.c: New test.
> ---
>  gcc/config/i386/i386.cc  |  5 -
>  gcc/testsuite/gcc.target/i386/pr115462.c | 22 ++
>  2 files changed, 26 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr115462.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index d4ccc24be6e..ef2a1e4f4f2 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -22339,7 +22339,10 @@ ix86_rtx_costs (rtx x, machine_mode mode, int 
> outer_code_i, int opno,
>  address_cost should be used, but it reduce cost too much.
>  So current solution is make constant disp as cheap as possible.  
> */
>   if (GET_CODE (addr) == PLUS
> - && x86_64_immediate_operand (XEXP (addr, 1), Pmode))
> + && x86_64_immediate_operand (XEXP (addr, 1), Pmode)
> + /* Only hanlde (reg + disp) since other forms of addr are 
> mostly LEA,
> +there's no additional cost for the plus of disp.  */
> + && register_operand (XEXP (addr, 0), Pmode))
> {
>   *total += 1;
>   *total += rtx_cost (XEXP (addr, 0), Pmode, PLUS, 0, speed);
> diff --git a/gcc/testsuite/gcc.target/i386/pr115462.c 
> b/gcc/testsuite/gcc.target/i386/pr115462.c
> new file mode 100644
> index 000..ad50a6382bc
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr115462.c
> @@ -0,0 +1,22 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mavx2 -fno-tree-vectorize -fno-pic" } */
> +/* { dg-final { scan-assembler-times {(?n)movl[ \t]+.*, p1\.0\+[0-9]*\(,} 3 
> } } */
> +
> +int
> +foo (long indx, long indx2, long indx3, long indx4, long indx5, long indx6, 
> long n, int* q)
> +{
> +  static int p1[1];
> +  int* p2 = p1 + 1000;
> +  int* p3 = p1 + 4000;
> +  int* p4 = p1 + 8000;
> +
> +  for (long i = 0; i != n; i++)
> +{
> +  /* scan for  movl%edi, p1.0+3996(,%rax,4),
> +p1.0+3996 should be propagted into the loop.  */
> +  p2[indx++] = q[indx++];
> +  p3[indx2++] = q[indx2++];
> +  p4[indx3++] = q[indx3++];
> +}
> +  return p1[indx6] + p1[indx5];
> +}
> --
> 2.31.1
>


RE: [PATCH v3] Vect: Support truncate after .SAT_SUB pattern in zip

2024-06-26 Thread Li, Pan2
> OK

Committed, thanks Richard.

Pan

-Original Message-
From: Richard Biener  
Sent: Thursday, June 27, 2024 2:04 PM
To: Li, Pan2 
Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com; 
jeffreya...@gmail.com; rdapp@gmail.com
Subject: Re: [PATCH v3] Vect: Support truncate after .SAT_SUB pattern in zip

On Thu, Jun 27, 2024 at 3:31 AM  wrote:
>
> From: Pan Li 

OK

> The zip benchmark of coremark-pro have one SAT_SUB like pattern but
> truncated as below:
>
> void test (uint16_t *x, unsigned b, unsigned n)
> {
>   unsigned a = 0;
>   register uint16_t *p = x;
>
>   do {
> a = *--p;
> *p = (uint16_t)(a >= b ? a - b : 0); // Truncate after .SAT_SUB
>   } while (--n);
> }
>
> It will have gimple before vect pass,  it cannot hit any pattern of
> SAT_SUB and then cannot vectorize to SAT_SUB.
>
> _2 = a_11 - b_12(D);
> iftmp.0_13 = (short unsigned int) _2;
> _18 = a_11 >= b_12(D);
> iftmp.0_5 = _18 ? iftmp.0_13 : 0;
>
> This patch would like to improve the pattern match to recog above
> as truncate after .SAT_SUB pattern.  Then we will have the pattern
> similar to below,  as well as eliminate the first 3 dead stmt.
>
> _2 = a_11 - b_12(D);
> iftmp.0_13 = (short unsigned int) _2;
> _18 = a_11 >= b_12(D);
> iftmp.0_5 = (short unsigned int).SAT_SUB (a_11, b_12(D));
>
> The below tests are passed for this patch.
> 1. The rv64gcv fully regression tests.
> 2. The rv64gcv build with glibc.
> 3. The x86 bootstrap tests.
> 4. The x86 fully regression tests.
>
> gcc/ChangeLog:
>
> * match.pd: Add convert description for minus and capture.
> * tree-vect-patterns.cc (vect_recog_build_binary_gimple_call): Add
> new logic to handle in_type is incompatibile with out_type,  as
> well as rename from.
> (vect_recog_build_binary_gimple_stmt): Rename to.
> (vect_recog_sat_add_pattern): Leverage above renamed func.
> (vect_recog_sat_sub_pattern): Ditto.
>
> Signed-off-by: Pan Li 
> ---
>  gcc/match.pd  |  4 +--
>  gcc/tree-vect-patterns.cc | 51 ---
>  2 files changed, 33 insertions(+), 22 deletions(-)
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index cf8a399a744..820591a36b3 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -3164,9 +3164,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  /* Unsigned saturation sub, case 2 (branch with ge):
> SAT_U_SUB = X >= Y ? X - Y : 0.  */
>  (match (unsigned_integer_sat_sub @0 @1)
> - (cond^ (ge @0 @1) (minus @0 @1) integer_zerop)
> + (cond^ (ge @0 @1) (convert? (minus (convert1? @0) (convert1? @1))) 
> integer_zerop)
>   (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
> -  && types_match (type, @0, @1
> +  && TYPE_UNSIGNED (TREE_TYPE (@0)) && types_match (@0, @1
>
>  /* Unsigned saturation sub, case 3 (branchless with gt):
> SAT_U_SUB = (X - Y) * (X > Y).  */
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index cef901808eb..519d15f2a43 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -4490,26 +4490,37 @@ vect_recog_mult_pattern (vec_info *vinfo,
>  extern bool gimple_unsigned_integer_sat_add (tree, tree*, tree (*)(tree));
>  extern bool gimple_unsigned_integer_sat_sub (tree, tree*, tree (*)(tree));
>
> -static gcall *
> -vect_recog_build_binary_gimple_call (vec_info *vinfo, gimple *stmt,
> +static gimple *
> +vect_recog_build_binary_gimple_stmt (vec_info *vinfo, stmt_vec_info 
> stmt_info,
>  internal_fn fn, tree *type_out,
> -tree op_0, tree op_1)
> +tree lhs, tree op_0, tree op_1)
>  {
>tree itype = TREE_TYPE (op_0);
> -  tree vtype = get_vectype_for_scalar_type (vinfo, itype);
> +  tree otype = TREE_TYPE (lhs);
> +  tree v_itype = get_vectype_for_scalar_type (vinfo, itype);
> +  tree v_otype = get_vectype_for_scalar_type (vinfo, otype);
>
> -  if (vtype != NULL_TREE
> -&& direct_internal_fn_supported_p (fn, vtype, OPTIMIZE_FOR_BOTH))
> +  if (v_itype != NULL_TREE && v_otype != NULL_TREE
> +&& direct_internal_fn_supported_p (fn, v_itype, OPTIMIZE_FOR_BOTH))
>  {
>gcall *call = gimple_build_call_internal (fn, 2, op_0, op_1);
> +  tree in_ssa = vect_recog_temp_ssa_var (itype, NULL);
>
> -  gimple_call_set_lhs (call, vect_recog_temp_ssa_var (itype, NULL));
> +  gimple_call_set_lhs (call, in_ssa);
>gimple_call_set_nothrow (call, /* nothrow_p */ false);
> -  gimple_set_location (call, gimple_location (stmt));
> +  gimple_set_location (call, gimple_location (STMT_VINFO_STMT 
> (stmt_info)));
> +
> +  *type_out = v_otype;
>
> -  *type_out = vtype;
> +  if (types_compatible_p (itype, otype))
> +   return call;
> +  else
> +   {
> + append_pattern_def_seq (vinfo, stmt_info, call, v_itype);
> + tree out_ssa = vect_recog_temp_ssa_var (otype, NULL);
>
> -  return call;
> +   

RE: [PATCH v2] Internal-fn: Support new IFN SAT_TRUNC for unsigned scalar int

2024-06-26 Thread Li, Pan2
> OK.

Committed, thanks Richard.

Pan

-Original Message-
From: Richard Biener  
Sent: Thursday, June 27, 2024 2:08 PM
To: Li, Pan2 
Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com; 
jeffreya...@gmail.com; rdapp@gmail.com
Subject: Re: [PATCH v2] Internal-fn: Support new IFN SAT_TRUNC for unsigned 
scalar int

On Thu, Jun 27, 2024 at 7:12 AM  wrote:
>
> From: Pan Li 
>
> This patch would like to add the middle-end presentation for the
> saturation truncation.  Aka set the result of truncated value to
> the max value when overflow.  It will take the pattern similar
> as below.
>
> Form 1:
>   #define DEF_SAT_U_TRUC_FMT_1(WT, NT) \
>   NT __attribute__((noinline)) \
>   sat_u_truc_##T##_fmt_1 (WT x)\
>   {\
> bool overflow = x > (WT)(NT)(-1);  \
> return ((NT)x) | (NT)-overflow;\
>   }
>
> For example, truncated uint16_t to uint8_t, we have
>
> * SAT_TRUNC (254)   => 254
> * SAT_TRUNC (255)   => 255
> * SAT_TRUNC (256)   => 255
> * SAT_TRUNC (65536) => 255
>
> Given below SAT_TRUNC from uint64_t to uint32_t.
>
> DEF_SAT_U_TRUC_FMT_1 (uint64_t, uint32_t)
>
> Before this patch:
> __attribute__((noinline))
> uint32_t sat_u_truc_T_fmt_1 (uint64_t x)
> {
>   _Bool overflow;
>   unsigned int _1;
>   unsigned int _2;
>   unsigned int _3;
>   uint32_t _6;
>
> ;;   basic block 2, loop depth 0
> ;;pred:   ENTRY
>   overflow_5 = x_4(D) > 4294967295;
>   _1 = (unsigned int) x_4(D);
>   _2 = (unsigned int) overflow_5;
>   _3 = -_2;
>   _6 = _1 | _3;
>   return _6;
> ;;succ:   EXIT
>
> }
>
> After this patch:
> __attribute__((noinline))
> uint32_t sat_u_truc_T_fmt_1 (uint64_t x)
> {
>   uint32_t _6;
>
> ;;   basic block 2, loop depth 0
> ;;pred:   ENTRY
>   _6 = .SAT_TRUNC (x_4(D)); [tail call]
>   return _6;
> ;;succ:   EXIT
>
> }

OK.

Thanks,
Richard.

> The below tests are passed for this patch:
> *. The rv64gcv fully regression tests.
> *. The rv64gcv build with glibc.
> *. The x86 bootstrap tests.
> *. The x86 fully regression tests.
>
> gcc/ChangeLog:
>
> * internal-fn.def (SAT_TRUNC): Add new signed IFN sat_trunc as
> unary_convert.
> * match.pd: Add new matching pattern for unsigned int sat_trunc.
> * optabs.def (OPTAB_CL): Add unsigned and signed optab.
> * tree-ssa-math-opts.cc (gimple_unsigend_integer_sat_trunc): Add
> new decl for the matching pattern generated func.
> (match_unsigned_saturation_trunc): Add new func impl to match
> the .SAT_TRUNC.
> (math_opts_dom_walker::after_dom_children): Add .SAT_TRUNC match
> function under BIT_IOR_EXPR case.
>
> Signed-off-by: Pan Li 
> ---
>  gcc/internal-fn.def   |  2 ++
>  gcc/match.pd  | 16 
>  gcc/optabs.def|  3 +++
>  gcc/tree-ssa-math-opts.cc | 32 
>  4 files changed, 53 insertions(+)
>
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index a8c83437ada..915d329c05a 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -278,6 +278,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | 
> ECF_NOTHROW, first,
>  DEF_INTERNAL_SIGNED_OPTAB_FN (SAT_ADD, ECF_CONST, first, ssadd, usadd, 
> binary)
>  DEF_INTERNAL_SIGNED_OPTAB_FN (SAT_SUB, ECF_CONST, first, sssub, ussub, 
> binary)
>
> +DEF_INTERNAL_SIGNED_OPTAB_FN (SAT_TRUNC, ECF_CONST, first, sstrunc, ustrunc, 
> unary_convert)
> +
>  DEF_INTERNAL_COND_FN (ADD, ECF_CONST, add, binary)
>  DEF_INTERNAL_COND_FN (SUB, ECF_CONST, sub, binary)
>  DEF_INTERNAL_COND_FN (MUL, ECF_CONST, smul, binary)
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 3d0689c9312..06120a1c62c 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -3210,6 +3210,22 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>   (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
>&& types_match (type, @0, @1
>
> +/* Unsigned saturation truncate, case 1 (), sizeof (WT) > sizeof (NT).
> +   SAT_U_TRUNC = (NT)x | (NT)(-(X > (WT)(NT)(-1))).  */
> +(match (unsigned_integer_sat_trunc @0)
> + (bit_ior:c (negate (convert (gt @0 INTEGER_CST@1)))
> +   (convert @0))
> + (with {
> +   unsigned itype_precision = TYPE_PRECISION (TREE_TYPE (@0));
> +   unsigned otype_precision = TYPE_PRECISION (type);
> +   wide_int trunc_max = wi::mask (itype_precision / 2, false, 
> itype_precision);
> +   wide_int int_cst = wi::to_wide (@1, itype_precision);
> +  }
> +  (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
> +   && TYPE_UNSIGNED (TREE_TYPE (@0))
> +   && otype_precision < itype_precision
> +   && wi::eq_p (trunc_max, int_cst)
> +
>  /* x >  y  &&  x != XXX_MIN  -->  x > y
> x >  y  &&  x == XXX_MIN  -->  false . */
>  (for eqne (eq ne)
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index bc2611abdc2..c16580ce956 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -63,6 +63,9 @@ OPTAB_CX(fractuns_optab, "fractuns$Q$b$I$a2")
>  OPTAB_CL(

Re: [PATCH] Hard register asm constraint

2024-06-26 Thread Stefan Schulze Frielinghaus
On Wed, Jun 26, 2024 at 11:10:38AM -0400, Paul Koning wrote:
> 
> 
> > On Jun 26, 2024, at 8:54 AM, Stefan Schulze Frielinghaus 
> >  wrote:
> > 
> > On Tue, Jun 25, 2024 at 01:02:39PM -0400, Paul Koning wrote:
> >> 
> >> 
> >>> On Jun 25, 2024, at 12:04 PM, Stefan Schulze Frielinghaus 
> >>>  wrote:
> >>> 
> >>> On Tue, Jun 25, 2024 at 10:03:34AM -0400, Paul Koning wrote:
>  
> >>> ...
> >>> could be rewritten into
> >>> 
> >>> int test (int x, int y)
> >>> {
> >>> asm ("foo %0,%1,%2" : "+{r4}" (x) : "{r5}" (y), "d" (y));
> >>> return x;
> >>> }
>  
>  I like this idea but I'm wondering: regular constraints specify what 
>  sort of value is needed, for example an int vs. a short int vs. a float. 
>   The notation you've shown doesn't seem to have that aspect.
> >>> 
> >>> As Maciej already pointed out the type of the expression should suffice.
> >>> My assumption was that an asm can deal with a value as is or its
> >>> promoted value.  At least for integer values this should be fine and
> >>> AFAICS is also the case for simple constraints like "r" which do not
> >>> define any mode.  I've probably overseen something but which constraint
> >>> differentiates between int vs short?  However, you have a good point
> >>> with this and I should test this more.
> >> 
> >> I thought there was but I may be confused.  On the other hand, there 
> >> definitely are (machine dependent) constraints that distinguish, say, 
> >> float from integer registers; pdp11 is an example.  If you were to use an 
> >> "a" constraint, that means a floating point register and the compiler will 
> >> detect attempts to pass non-float operands ("Inconsistent operand 
> >> constraints...").
> >> 
> >> I see that the existing "register int ..." syntax appears to check that 
> >> the register is the right type for the data type given for it, so for 
> >> example on pdp11, 
> >> 
> >>register int ac1 asm ("ac1") = i;
> >> 
> >> fails ("register ... isn't suitable for data type").  I assume your new 
> >> syntax would perform the same check and produce roughly the same error 
> >> message.  You might verify that.  On pdp11, trying to use, for example, 
> >> "r0" for a float, or "ac0" for an int, would produce that error.
> > 
> > Right, so far I don't error out here which I will change.  It basically
> > results in bit casting floats to ints currently.
> 
> That would be bad.  For one thing, a PDP11 float doesn't fit in an integer 
> register.
> 
> That also brings up another point (which applies to more mainstream targets 
> as well): for data types that require multiple registers, say a register pair 
> for a double length value, how is that handled?  One possible answer is to 
> reject that.  Another would be to load a register pair.
> 
> This case applies to a "long int" on pdp11, or 32 bit MIPS, and probably a 
> bunch of others.

Absolutely, also on mainstream targets you could think of 128-bit integers
or long doubles which typically don't fit in (single) GPRs.  I should
definitely add error handling for this.  Similar, I don't error out for
non-primitive data types.

I will give register pairs a try.

Thanks for all your comments so far :)

Cheers,
Stefan


Re: [PATCH 1/3] Release structures on function return

2024-06-26 Thread Jørgen Kvalsvik

I think we need to revert this.

I got this email from linaro/gcc-regressions:

[Linaro-TCWG-CI] gcc-15-1649-g19f630e6ae8d: FAIL: 2 regressions on aarch64

regressions.sum:
=== gcc tests ===

Running gcc:gcc.misc-tests/gcov.exp ...
FAIL: gcc.misc-tests/gcov-23.c (internal compiler error: in operator[], 
at vec.h:910)

FAIL: gcc.misc-tests/gcov-23.c (test for excess errors)

This did not reproduce on my machine, but I took a quick look at the 
hash-map implementation. hash_map.put calls 
hash_table.find_slot_with_hash, which calls hash_table.expand, which 
does move+destroy. auto_vec is not really move-aware which leads to a 
double-free.


The fix is either to make auto_vec move-aware (and more like C++'s 
std::vector) or revert this patch and apply the original version with an 
explicit release.


OK?

Thanks,
Jørgen

On 6/25/24 12:23, Jan Hubicka wrote:

The value vec objects are destroyed on exit, but release still needs to
be called explicitly.

gcc/ChangeLog:

* tree-profile.cc (find_conditions): Release vectors before
  return.

I wonder if you turn
 hash_map, vec> exprs;
to
 hash_map, auto_vec> exprs;
Won't hash_map destructor take care of this by itself?

Honza

---
  gcc/tree-profile.cc | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/gcc/tree-profile.cc b/gcc/tree-profile.cc
index e4bb689cef5..18f48e8d04e 100644
--- a/gcc/tree-profile.cc
+++ b/gcc/tree-profile.cc
@@ -919,6 +919,9 @@ find_conditions (struct function *fn)
  if (!have_post_dom)
free_dominance_info (fn, CDI_POST_DOMINATORS);
  
+for (auto expr : exprs)

+  expr.second.release ();
+
  cov->m_masks.safe_grow_cleared (2 * cov->m_index.last ());
  const size_t length = cov_length (cov);
  for (size_t i = 0; i != length; i++)
--
2.39.2





Re: [Patch, Fortran] 2/3 Refactor locations where _vptr is (re)set.

2024-06-26 Thread Paul Richard Thomas
Hi Andre,

Thanks for resending the patches. I fear that daytime work and visitors
have taken my attention the last days - hence the delay in reviewing, for
which I apoloise,

The patches do what they are advertised to do, without regressions on my
side. I like gfc_class_set_vptr. Please remove the commented out assert,
unless you intend to deploy it.

OK for mainline.

Thanks for the patches.

Regards

Paul


On Fri, 21 Jun 2024 at 07:39, Andre Vehreschild  wrote:

> Hi Paul,
>
> I am sorry for the delay. I am fighting with PR96992, where Harald finds
> more
> and more issues. I think I am approaching that one wrongly. We will see.
>
> Anyway, please find attached updated version of the 2/3 and 3/3 patches,
> which
> apply cleanly onto master at 1f974c3a24b76e25a2b7f31a6c7f4aee93a9eaab .
>
> Hope that helps and thanks in advance for looking at the patches.
>
> Regards,
> Andre
>
> PS. I have attached them in plain text and as archive to prevent mailers
> from
> corrupting them.
>
> On Thu, 20 Jun 2024 07:42:31 +0100
> Paul Richard Thomas  wrote:
>
> > Hi Andre,
> >
> > Both this patch and 3/3 are corrupt according to git apply:
> > [pault@pc30 gcc]$ git apply --ignore-space-change --ignore-whitespace <
> > ~/prs/andre/u*.patch
> > error: corrupt patch at line 45
> > [pault@pc30 gcc]$ git apply --ignore-space-change --ignore-whitespace <
> > ~/prs/andre/i*.patch
> > error: corrupt patch at line 36
> >
> > I tried messing with the offending lines, to no avail. I can apply them
> by
> > hand or, perhaps, you could supply me with clean patches?
> >
> > The patches look OK but I want to check the code that they generate.
> >
> > Cheers
> >
> > Paul
> >
> >
> > On Tue, 11 Jun 2024 at 13:57, Andre Vehreschild  wrote:
> >
> > > Hi all,
> > >
> > > this patch refactors most of the locations where the _vptr of a class
> data
> > > type
> > > is reset. The code was inconsistent in most of the locations. The goal
> of
> > > using
> > > only one routine for setting the _vptr is to be able to later modify it
> > > more
> > > easily.
> > >
> > > The ultimate goal being that every time one assigns to a class data
> type a
> > > consistent way is used to prevent forgetting the corner cases. So this
> is
> > > just a
> > > small step in this direction. I think it is worth to simplify the code
> to
> > > something consistent to reduce maintenance efforts anyhow.
> > >
> > > Regtested ok on x86_64 Fedora 39. Ok for mainline?
> > >
> > > Regards,
> > > Andre
> > > --
> > > Andre Vehreschild * Email: vehre ad gmx dot de
> > >
>
>
> --
> Andre Vehreschild * Kreuzherrenstr. 8 * 52062 Aachen
> Tel.: +49 241 9291018 * Email: ve...@gmx.de
>


Re: [PATCH v1] RISC-V: Add testcases for vector truncate after .SAT_SUB

2024-06-26 Thread juzhe.zh...@rivai.ai
Since middle-end patch is approved, LGTM this patch.
Thanks for improving RVV vectorization.



juzhe.zh...@rivai.ai
 
From: pan2.li
Date: 2024-06-25 20:40
To: gcc-patches
CC: juzhe.zhong; kito.cheng; richard.guenther; jeffreyalaw; rdapp.gcc; Pan Li
Subject: [PATCH v1] RISC-V: Add testcases for vector truncate after .SAT_SUB
From: Pan Li 
 
This patch would like to add the test cases of the vector truncate after
.SAT_SUB.  Aka:
 
  #define DEF_VEC_SAT_U_SUB_TRUNC_FMT_1(OUT_T, IN_T)   \
  void __attribute__((noinline))   \
  vec_sat_u_sub_trunc_##OUT_T##_fmt_1 (OUT_T *out, IN_T *op_1, IN_T y, \
   unsigned limit) \
  {\
unsigned i;\
for (i = 0; i < limit; i++)\
  {\
IN_T x = op_1[i];  \
out[i] = (OUT_T)(x >= y ? x - y : 0);  \
  }\
  }
 
The below 3 cases are included.
 
DEF_VEC_SAT_U_SUB_TRUNC_FMT_1(uint8_t, uint16_t)
DEF_VEC_SAT_U_SUB_TRUNC_FMT_1(uint16_t, uint32_t)
DEF_VEC_SAT_U_SUB_TRUNC_FMT_1(uint32_t, uint64_t)
 
gcc/testsuite/ChangeLog:
 
* gcc.target/riscv/rvv/autovec/binop/vec_sat_arith.h: Add helper
test macros.
* gcc.target/riscv/rvv/autovec/binop/vec_sat_binary_scalar.h: New test.
* gcc.target/riscv/rvv/autovec/binop/vec_sat_u_sub_trunc-1.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vec_sat_u_sub_trunc-2.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vec_sat_u_sub_trunc-3.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vec_sat_u_sub_trunc-run-1.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vec_sat_u_sub_trunc-run-2.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vec_sat_u_sub_trunc-run-3.c: New test.
 
Signed-off-by: Pan Li 
---
.../riscv/rvv/autovec/binop/vec_sat_arith.h   | 19 +
.../rvv/autovec/binop/vec_sat_binary_scalar.h | 27 +++
.../rvv/autovec/binop/vec_sat_u_sub_trunc-1.c | 21 ++
.../rvv/autovec/binop/vec_sat_u_sub_trunc-2.c | 21 ++
.../rvv/autovec/binop/vec_sat_u_sub_trunc-3.c | 21 ++
.../autovec/binop/vec_sat_u_sub_trunc-run-1.c | 74 +++
.../autovec/binop/vec_sat_u_sub_trunc-run-2.c | 74 +++
.../autovec/binop/vec_sat_u_sub_trunc-run-3.c | 74 +++
8 files changed, 331 insertions(+)
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_binary_scalar.h
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_u_sub_trunc-1.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_u_sub_trunc-2.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_u_sub_trunc-3.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_u_sub_trunc-run-1.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_u_sub_trunc-run-2.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_u_sub_trunc-run-3.c
 
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_arith.h 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_arith.h
index d5c81fbe5a9..a3116033fb3 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_arith.h
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_arith.h
@@ -310,4 +310,23 @@ vec_sat_u_sub_##T##_fmt_10 (T *out, T *op_1, T *op_2, 
unsigned limit) \
#define RUN_VEC_SAT_U_SUB_FMT_10(T, out, op_1, op_2, N) \
   vec_sat_u_sub_##T##_fmt_10(out, op_1, op_2, N)
+/**/
+/* Saturation Sub Truncated (Unsigned and Signed) 
*/
+/**/
+#define DEF_VEC_SAT_U_SUB_TRUNC_FMT_1(OUT_T, IN_T)   \
+void __attribute__((noinline))   \
+vec_sat_u_sub_trunc_##OUT_T##_fmt_1 (OUT_T *out, IN_T *op_1, IN_T y, \
+  unsigned limit) \
+{\
+  unsigned i;\
+  for (i = 0; i < limit; i++)\
+{\
+  IN_T x = op_1[i];  \
+  out[i] = (OUT_T)(x >= y ? x - y : 0);  \
+}\
+}
+
+#define RUN_VEC_SAT_U_SUB_TRUNC_FMT_1(OUT_T, IN_T, out, op_1, y, N) \
+  vec_sat_u_sub_trunc_##OUT_T##_fmt_1(out, op_1, y, N)
+
#endif
diff --git 
a/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_bi

Re: [PATCH v2] Vect: Support truncate after .SAT_SUB pattern in zip

2024-06-26 Thread Uros Bizjak
On Mon, Jun 24, 2024 at 3:55 PM  wrote:
>
> From: Pan Li 
>
> The zip benchmark of coremark-pro have one SAT_SUB like pattern but
> truncated as below:
>
> void test (uint16_t *x, unsigned b, unsigned n)
> {
>   unsigned a = 0;
>   register uint16_t *p = x;
>
>   do {
> a = *--p;
> *p = (uint16_t)(a >= b ? a - b : 0); // Truncate after .SAT_SUB
>   } while (--n);
> }
>
> It will have gimple before vect pass,  it cannot hit any pattern of
> SAT_SUB and then cannot vectorize to SAT_SUB.
>
> _2 = a_11 - b_12(D);
> iftmp.0_13 = (short unsigned int) _2;
> _18 = a_11 >= b_12(D);
> iftmp.0_5 = _18 ? iftmp.0_13 : 0;
>
> This patch would like to improve the pattern match to recog above
> as truncate after .SAT_SUB pattern.  Then we will have the pattern
> similar to below,  as well as eliminate the first 3 dead stmt.
>
> _2 = a_11 - b_12(D);
> iftmp.0_13 = (short unsigned int) _2;
> _18 = a_11 >= b_12(D);
> iftmp.0_5 = (short unsigned int).SAT_SUB (a_11, b_12(D));
>
> The below tests are passed for this patch.
> 1. The rv64gcv fully regression tests.
> 2. The rv64gcv build with glibc.
> 3. The x86 bootstrap tests.
> 4. The x86 fully regression tests.

I have tried this patch with x86_64 on the testcase from PR51492, but
the compiler does not recognize the .SAT_SUB pattern here.

Is there anything else missing for successful detection?

Uros.

>
> gcc/ChangeLog:
>
> * match.pd: Add convert description for minus and capture.
> * tree-vect-patterns.cc (vect_recog_build_binary_gimple_call): Add
> new logic to handle in_type is incompatibile with out_type,  as
> well as rename from.
> (vect_recog_build_binary_gimple_stmt): Rename to.
> (vect_recog_sat_add_pattern): Leverage above renamed func.
> (vect_recog_sat_sub_pattern): Ditto.
>
> Signed-off-by: Pan Li 
> ---
>  gcc/match.pd  |  4 +--
>  gcc/tree-vect-patterns.cc | 51 ---
>  2 files changed, 33 insertions(+), 22 deletions(-)
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 3d0689c9312..4a4b0b2e72f 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -3164,9 +3164,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  /* Unsigned saturation sub, case 2 (branch with ge):
> SAT_U_SUB = X >= Y ? X - Y : 0.  */
>  (match (unsigned_integer_sat_sub @0 @1)
> - (cond^ (ge @0 @1) (minus @0 @1) integer_zerop)
> + (cond^ (ge @0 @1) (convert? (minus (convert1? @0) (convert1? @1))) 
> integer_zerop)
>   (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
> -  && types_match (type, @0, @1
> +  && TYPE_UNSIGNED (TREE_TYPE (@0)) && types_match (@0, @1
>
>  /* Unsigned saturation sub, case 3 (branchless with gt):
> SAT_U_SUB = (X - Y) * (X > Y).  */
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index cef901808eb..3d887d36050 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -4490,26 +4490,37 @@ vect_recog_mult_pattern (vec_info *vinfo,
>  extern bool gimple_unsigned_integer_sat_add (tree, tree*, tree (*)(tree));
>  extern bool gimple_unsigned_integer_sat_sub (tree, tree*, tree (*)(tree));
>
> -static gcall *
> -vect_recog_build_binary_gimple_call (vec_info *vinfo, gimple *stmt,
> +static gimple *
> +vect_recog_build_binary_gimple_stmt (vec_info *vinfo, stmt_vec_info 
> stmt_info,
>  internal_fn fn, tree *type_out,
> -tree op_0, tree op_1)
> +tree lhs, tree op_0, tree op_1)
>  {
>tree itype = TREE_TYPE (op_0);
> -  tree vtype = get_vectype_for_scalar_type (vinfo, itype);
> +  tree otype = TREE_TYPE (lhs);
> +  tree v_itype = get_vectype_for_scalar_type (vinfo, itype);
> +  tree v_otype = get_vectype_for_scalar_type (vinfo, otype);
>
> -  if (vtype != NULL_TREE
> -&& direct_internal_fn_supported_p (fn, vtype, OPTIMIZE_FOR_BOTH))
> +  if (v_itype != NULL_TREE && v_otype != NULL_TREE
> +&& direct_internal_fn_supported_p (fn, v_itype, OPTIMIZE_FOR_BOTH))
>  {
>gcall *call = gimple_build_call_internal (fn, 2, op_0, op_1);
> +  tree in_ssa = vect_recog_temp_ssa_var (itype, NULL);
>
> -  gimple_call_set_lhs (call, vect_recog_temp_ssa_var (itype, NULL));
> +  gimple_call_set_lhs (call, in_ssa);
>gimple_call_set_nothrow (call, /* nothrow_p */ false);
> -  gimple_set_location (call, gimple_location (stmt));
> +  gimple_set_location (call, gimple_location (STMT_VINFO_STMT 
> (stmt_info)));
> +
> +  *type_out = v_otype;
>
> -  *type_out = vtype;
> +  if (types_compatible_p (itype, otype))
> +   return call;
> +  else
> +   {
> + append_pattern_def_seq (vinfo, stmt_info, call, v_itype);
> + tree out_ssa = vect_recog_temp_ssa_var (otype, NULL);
>
> -  return call;
> + return gimple_build_assign (out_ssa, CONVERT_EXPR, in_ssa);
> +   }
>  }
>
>return NULL;
> @@ -4541,13 +4552,13 @@ vect_recog_sat

Re: [PATCH] libstdc++: Add script to update docs for a new release branch

2024-06-26 Thread Jonathan Wakely
On Thu, 27 Jun 2024, 05:05 Eric Gallager,  wrote:

> On Wed, Jun 26, 2024 at 4:28 PM Jonathan Wakely 
> wrote:
> >
> > Pushed to trunk. We have nearly a year to make improvements to it
> > before it's needed for the gcc-15 branch ... I just hope I remember it
> > exists when we branch ;-)
>
> Maybe you could leave a note about it in the docs somewhere?
>

The release managers already have a note to poke the libstdc++ maintainer
when branching.



> >
> > On Wed, 26 Jun 2024 at 00:13, Jonathan Wakely 
> wrote:
> > >
> > > This script automates some updates that should be made when branching
> > > from trunk. Putting them in a script makes it much easier and means I
> > > won't forget what should be done.
> > >
> > > Any suggestions for doing this differently?
> > >
> > > Anything I've forgotten that should be added here?
> > >
> > > We could add an entry to the lists of versions at
> > >
> https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html#abi.versioning.goals
> > > but that should really be done when bumping the libtool version, not
> > > when branching from trunk.
> > >
> > > -- >8 --
> > >
> > > This should be run on a release branch after branching from trunk.
> > > Various links and references to trunk in the docs will be updated to
> > > refer to the new release branch.
> > >
> > > libstdc++-v3/ChangeLog:
> > >
> > > * scripts/update_release_branch.sh: New file.
> > > ---
> > >  libstdc++-v3/scripts/update_release_branch.sh | 14 ++
> > >  1 file changed, 14 insertions(+)
> > >  create mode 100755 libstdc++-v3/scripts/update_release_branch.sh
> > >
> > > diff --git a/libstdc++-v3/scripts/update_release_branch.sh
> b/libstdc++-v3/scripts/update_release_branch.sh
> > > new file mode 100755
> > > index 000..f8109ed0ba3
> > > --- /dev/null
> > > +++ b/libstdc++-v3/scripts/update_release_branch.sh
> > > @@ -0,0 +1,14 @@
> > > +#!/bin/bash
> > > +
> > > +# This should be run on a release branch after branching from trunk.
> > > +# Various links and references to trunk in the docs will be updated to
> > > +# refer to the new release branch.
> > > +
> > > +# The major version of the new release branch.
> > > +major=$1
> > > +(($major)) || { echo "$0: Integer argument expected" >& 2 ; exit 1; }
> > > +
> > > +# This assumes GNU sed
> > > +sed -i "s@^mainline GCC, not in any particular major.\$@the GCC
> ${major} series.@" doc/xml/manual/status_cxx*.xml
> > > +sed -i 's@https://gcc.gnu.org/cgit/gcc/tree/libstdc++-v3/testsuite/[
> 
> ^"]\+@&?h=releases%2Fgcc-'${major}@ doc/xml/manual/allocator.xml
> doc/xml/manual/mt_allocator.xml
> > > +sed -i "s@
> https://gcc.gnu.org/onlinedocs/gcc/Invoking-GCC.html@https://gcc.gnu.org/onlinedocs/gcc-${major}.1.0/gcc/Invoking-GCC.html@";
> doc/xml/manual/using.xml
> > > --
> > > 2.45.2
> > >
> >
>


Re: [PATCH] aarch64: Add support for -mcpu=grace

2024-06-26 Thread Kyrylo Tkachov
Hi Andrew,

> On 26 Jun 2024, at 23:02, Andrew Pinski  wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> On Wed, Jun 26, 2024 at 12:40 AM Kyrylo Tkachov  wrote:
>> 
>> Hi all,
>> 
>> This adds support for the NVIDIA Grace CPU to aarch64.
>> We reuse the tuning decisions for the Neoverse V2 core, but include a
>> number of architecture features that are not enabled by default in
>> -mcpu=neoverse-v2.
>> 
>> This allows Grace users to more simply target the CPU with -mcpu=grace
>> rather than remembering what extensions to tag on top of
>> -mcpu=neoverse-v2.
>> 
>> Bootstrapped and tested on aarch64-none-linux-gnu.
>> I’m pushing this to trunk.
> 
>> RNG
> 
> I noticed this is missing from grace but is included in neoverse-v2.
> Is that expected?

Yes, RNG is an optional configuration feature of Neoverse V2 (according to the 
TRM) and Grace doesn’t implement it. In fact, I don’t think the base 
-mcpu=neoverse-v2 should include it either (I’m testing a patch to remove it)

Thanks,
Kyrill


> 
> Thanks,
> Andrew Pinski
> 
> 
>> I have patches tested for the 14, 13, 12, 11 branches as well that I’d like 
>> to push there to make it simpler for our users to target Grace.
>> They are the same as this one logically, but they just account for slight 
>> syntactic differences and flag definitions that have happened since those 
>> branches.
>> Thanks,
>> Kyrill
>> 
>>* config/aarch64/aarch64-cores.def (grace): New entry.
>>* config/aarch64/aarch64-tune.md: Regenerate
>>* doc/invoke.texi (AArch64 Options): Document the above.
>> 
>> Signed-off-by: Kyrylo Tkachov 




Re: [PATCH] Fix wrong cost of MEM when addr is a lea.

2024-06-26 Thread Hongtao Liu
On Wed, Jun 26, 2024 at 2:52 PM Richard Biener
 wrote:
>
> On Wed, Jun 26, 2024 at 8:09 AM liuhongt  wrote:
> >
> > 416.gamess regressed 4-6% on x86_64 since my r15-882-g1d6199e5f8c1c0.
> > The commit adjust rtx_cost of mem to reduce cost of (add op0 disp).
> > But Cost of ADDR could be cheaper than XEXP (addr, 0) when it's a lea.
> > It is the case in the PR, the patch uses lower cost to enable more
> > simplication and fix the regression.
> >
> > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > Ok for trunk?
> >
> > gcc/ChangeLog:
> >
> > PR target/115462
> > * config/i386/i386.cc (ix86_rtx_costs): Use cost of addr when
> > it's lower than rtx_cost (XEXP (addr, 0)) + 1.
> >
> > gcc/testsuite/ChangeLog:
> > * gcc.target/i386/pr115462.c: New test.
> > ---
> >  gcc/config/i386/i386.cc  |  9 +++--
> >  gcc/testsuite/gcc.target/i386/pr115462.c | 22 ++
> >  2 files changed, 29 insertions(+), 2 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr115462.c
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index d4ccc24be6e..83dab8220dd 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -22341,8 +22341,13 @@ ix86_rtx_costs (rtx x, machine_mode mode, int 
> > outer_code_i, int opno,
> >   if (GET_CODE (addr) == PLUS
> >   && x86_64_immediate_operand (XEXP (addr, 1), Pmode))
> > {
> > - *total += 1;
> > - *total += rtx_cost (XEXP (addr, 0), Pmode, PLUS, 0, speed);
> > + /* PR115462: Cost of ADDR could be cheaper than XEXP (addr, 0)
> > +when it's a lea, use lower cost to enable more
> > +simplification.  */
> > + unsigned cost1 = rtx_cost (addr, Pmode, MEM, 0, speed);
> > + unsigned cost2 = rtx_cost (XEXP (addr, 0), Pmode,
> > +PLUS, 0, speed) + 1;
>
> Just as comment - this is a bit ugly, why would we not always use the
> address cost?  (and why are you using 'MEM'?)  Should this be better
> handled on the insn_cost level when it's clear the PLUS is separate address
> calculation (LEA) rather than address calculation in a MEM context?
 For MEM, rtx_cost doesn't use address_cost but iterates each subrtx,
and adds up the costs,
 So for MEM (reg) and MEM (reg + 4), the former costs 5, the latter
costs 9, it is not accurate for x86.
 Ideally address_cost should be used, but it reduces cost too
much(range from 1-3).
(I've tried that, it regressed many testcases, because two many
registers are propagated into addr and increase register pressure).
 So the current solution is to make constant disp as cheap as possible
so more constant can be propagated into the address(but not
registers).

>
> > + *total += MIN (cost1, cost2);
> >   return true;
> > }
> > }
> > diff --git a/gcc/testsuite/gcc.target/i386/pr115462.c 
> > b/gcc/testsuite/gcc.target/i386/pr115462.c
> > new file mode 100644
> > index 000..ad50a6382bc
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr115462.c
> > @@ -0,0 +1,22 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -mavx2 -fno-tree-vectorize -fno-pic" } */
> > +/* { dg-final { scan-assembler-times {(?n)movl[ \t]+.*, p1\.0\+[0-9]*\(,} 
> > 3 } } */
> > +
> > +int
> > +foo (long indx, long indx2, long indx3, long indx4, long indx5, long 
> > indx6, long n, int* q)
> > +{
> > +  static int p1[1];
> > +  int* p2 = p1 + 1000;
> > +  int* p3 = p1 + 4000;
> > +  int* p4 = p1 + 8000;
> > +
> > +  for (long i = 0; i != n; i++)
> > +{
> > +  /* scan for  movl%edi, p1.0+3996(,%rax,4),
> > +p1.0+3996 should be propagted into the loop.  */
> > +  p2[indx++] = q[indx++];
> > +  p3[indx2++] = q[indx2++];
> > +  p4[indx3++] = q[indx3++];
> > +}
> > +  return p1[indx6] + p1[indx5];
> > +}
> > --
> > 2.31.1
> >



-- 
BR,
Hongtao


[PATCH] aarch64: Add support for -mcpu=grace

2024-06-26 Thread Kyrylo Tkachov
Hi all,

This adds support for the NVIDIA Grace CPU to aarch64.
We reuse the tuning decisions for the Neoverse V2 core, but include a
number of architecture features that are not enabled by default in
-mcpu=neoverse-v2.

This allows Grace users to more simply target the CPU with -mcpu=grace
rather than remembering what extensions to tag on top of
-mcpu=neoverse-v2.

Bootstrapped and tested on aarch64-none-linux-gnu.
I’m pushing this to trunk.
I have patches tested for the 14, 13, 12, 11 branches as well that I’d like to 
push there to make it simpler for our users to target Grace.
They are the same as this one logically, but they just account for slight 
syntactic differences and flag definitions that have happened since those 
branches.
Thanks,
Kyrill

* config/aarch64/aarch64-cores.def (grace): New entry.  
* config/aarch64/aarch64-tune.md: Regenerate
* doc/invoke.texi (AArch64 Options): Document the above.

Signed-off-by: Kyrylo Tkachov 



grace.patch
Description: grace.patch


Re: [PATCH] i386: Remove declaration of unused functions

2024-06-26 Thread Christophe Lyon
On Wed, 26 Jun 2024 at 01:27, Iain Sandoe  wrote:
>
>
>
> > On 25 Jun 2024, at 22:59, Evgeny Karpov  wrote:
> >
> > The patch fixes the issue introduced in
> > https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=63512c72df09b43d56ac7680cdfd57a66d40c636
> > and reported at
> > https://gcc.gnu.org/pipermail/gcc-patches/2024-June/655599.html .
>
> Trivial patches like this that fix bootstrap on multiple targets can be 
> applied without extra approval,
> this fixes bootstrap for x86 Darwin, so OK
> Iain
>
I've just pushed the patch on Evgeny's behalf.

Thanks,

Christophe

> >
> > Regards,
> > Evgeny
> >
> >
> > The patch fixes the issue with compilation on x86_64-gnu-linux
> > when warnings for unused functions are treated as errors.
> >
> > gcc/ChangeLog:
> >
> >   * config/i386/i386.cc (legitimize_dllimport_symbol): Remove unused
> >   functions.
> >   (legitimize_pe_coff_extern_decl): Likewise.
> > ---
> > gcc/config/i386/i386.cc | 2 --
> > 1 file changed, 2 deletions(-)
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index aee88b08ae9..6d6a478f6f5 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -104,8 +104,6 @@ along with GCC; see the file COPYING3.  If not see
> > /* This file should be included last.  */
> > #include "target-def.h"
> >
> > -static rtx legitimize_dllimport_symbol (rtx, bool);
> > -static rtx legitimize_pe_coff_extern_decl (rtx, bool);
> > static void ix86_print_operand_address_as (FILE *, rtx, addr_space_t, bool);
> > static void ix86_emit_restore_reg_using_pop (rtx, bool = false);
> >
> > --
> > 2.25.1
> >
>


Ping: [PATCH v2] LoongArch: Tweak IOR rtx_cost for bstrins

2024-06-26 Thread Xi Ruoyao
Ping.

On Sun, 2024-06-16 at 01:50 +0800, Xi Ruoyao wrote:
> Consider
> 
>     c &= 0xfff;
>     a &= ~0xfff;
>     b &= ~0xfff;
>     a |= c;
>     b |= c;
> 
> This can be done with 2 bstrins instructions.  But we need to
> recognize
> it in loongarch_rtx_costs or the compiler will not propagate "c &
> 0xfff"
> forward.
> 
> gcc/ChangeLog:
> 
>   * config/loongarch/loongarch.cc:
>   (loongarch_use_bstrins_for_ior_with_mask): Split the main
> logic
>   into ...
>   (loongarch_use_bstrins_for_ior_with_mask_1): ... here.
>   (loongarch_rtx_costs): Special case for IOR those can be
>   implemented with bstrins.
> 
> gcc/testsuite/ChangeLog;
> 
>   * gcc.target/loongarch/bstrins-3.c: New test.
> ---
> 
> Bootstrapped and regtested on loongarch64-linux-gnu.  Ok for trunk?
> 
>  gcc/config/loongarch/loongarch.cc | 73 ++
> -
>  .../gcc.target/loongarch/bstrins-3.c  | 16 
>  2 files changed, 72 insertions(+), 17 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/loongarch/bstrins-3.c
> 
> diff --git a/gcc/config/loongarch/loongarch.cc
> b/gcc/config/loongarch/loongarch.cc
> index 6ec3ee62502..256b76d044b 100644
> --- a/gcc/config/loongarch/loongarch.cc
> +++ b/gcc/config/loongarch/loongarch.cc
> @@ -3681,6 +3681,27 @@ loongarch_set_reg_reg_piece_cost (machine_mode
> mode, unsigned int units)
>    return COSTS_N_INSNS ((GET_MODE_SIZE (mode) + units - 1) / units);
>  }
>  
> +static int
> +loongarch_use_bstrins_for_ior_with_mask_1 (machine_mode mode,
> +    unsigned HOST_WIDE_INT
> mask1,
> +    unsigned HOST_WIDE_INT
> mask2)
> +{
> +  if (mask1 != ~mask2 || !mask1 || !mask2)
> +    return 0;
> +
> +  /* Try to avoid a right-shift.  */
> +  if (low_bitmask_len (mode, mask1) != -1)
> +    return -1;
> +
> +  if (low_bitmask_len (mode, mask2 >> (ffs_hwi (mask2) - 1)) != -1)
> +    return 1;
> +
> +  if (low_bitmask_len (mode, mask1 >> (ffs_hwi (mask1) - 1)) != -1)
> +    return -1;
> +
> +  return 0;
> +}
> +
>  /* Return the cost of moving between two registers of mode MODE.  */
>  
>  static int
> @@ -3812,6 +3833,38 @@ loongarch_rtx_costs (rtx x, machine_mode mode,
> int outer_code,
>    /* Fall through.  */
>  
>  case IOR:
> +  {
> + rtx op[2] = {XEXP (x, 0), XEXP (x, 1)};
> + if (GET_CODE (op[0]) == AND && GET_CODE (op[1]) == AND
> +     && (mode == SImode || (TARGET_64BIT && mode == DImode)))
> +   {
> +     rtx rtx_mask0 = XEXP (op[0], 1), rtx_mask1 = XEXP (op[1],
> 1);
> +     if (CONST_INT_P (rtx_mask0) && CONST_INT_P (rtx_mask1))
> +   {
> + unsigned HOST_WIDE_INT mask0 = UINTVAL (rtx_mask0);
> + unsigned HOST_WIDE_INT mask1 = UINTVAL (rtx_mask1);
> + if (loongarch_use_bstrins_for_ior_with_mask_1 (mode,
> +    mask0,
> +   
> mask1))
> +   {
> +     /* A bstrins instruction */
> +     *total = COSTS_N_INSNS (1);
> +
> +     /* A srai instruction */
> +     if (low_bitmask_len (mode, mask0) == -1
> + && low_bitmask_len (mode, mask1) == -1)
> +   *total += COSTS_N_INSNS (1);
> +
> +     for (int i = 0; i < 2; i++)
> +   *total += set_src_cost (XEXP (op[i], 0), mode,
> speed);
> +
> +     return true;
> +   }
> +   }
> +   }
> +  }
> +
> +  /* Fall through.  */
>  case XOR:
>    /* Double-word operations use two single-word operations.  */
>    *total = loongarch_binary_cost (x, COSTS_N_INSNS (1),
> COSTS_N_INSNS (2),
> @@ -5796,23 +5849,9 @@ bool loongarch_pre_reload_split (void)
>  int
>  loongarch_use_bstrins_for_ior_with_mask (machine_mode mode, rtx *op)
>  {
> -  unsigned HOST_WIDE_INT mask1 = UINTVAL (op[2]);
> -  unsigned HOST_WIDE_INT mask2 = UINTVAL (op[4]);
> -
> -  if (mask1 != ~mask2 || !mask1 || !mask2)
> -    return 0;
> -
> -  /* Try to avoid a right-shift.  */
> -  if (low_bitmask_len (mode, mask1) != -1)
> -    return -1;
> -
> -  if (low_bitmask_len (mode, mask2 >> (ffs_hwi (mask2) - 1)) != -1)
> -    return 1;
> -
> -  if (low_bitmask_len (mode, mask1 >> (ffs_hwi (mask1) - 1)) != -1)
> -    return -1;
> -
> -  return 0;
> +  return loongarch_use_bstrins_for_ior_with_mask_1 (mode,
> +     UINTVAL (op[2]),
> +     UINTVAL (op[4]));
>  }
>  
>  /* Rewrite a MEM for simple load/store under -mexplicit-relocs=auto
> diff --git a/gcc/testsuite/gcc.target/loongarch/bstrins-3.c
> b/gcc/testsuite/gcc.target/loongarch/bstrins-3.c
> new file mode 100644
> index 000..13762bdef42
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/loongarch/bstrins-3.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +

Ping: [PATCH] LoongArch: Only transform move/move/bstrins to srai/bstrins when -Os

2024-06-26 Thread Xi Ruoyao
Ping.

On Sat, 2024-06-15 at 21:47 +0800, Xi Ruoyao wrote:
> The first form has a lower latency (due to the special handling of
> "move" in LA464 and LA664) despite it's longer.
> 
> gcc/ChangeLog:
> 
>   * config/loongarch/loongarch.md (define_peephole2): Require
>   optimize_insn_for_size_p () for move/move/bstrins =>
>   srai/bstrins transform.
> ---
> 
> Bootstrapped and regtested on loongarch64-linux-gnu.  Ok for trunk?
> 
>  gcc/config/loongarch/loongarch.md | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/gcc/config/loongarch/loongarch.md
> b/gcc/config/loongarch/loongarch.md
> index 25c1d323ba0..e4434c3bd4e 100644
> --- a/gcc/config/loongarch/loongarch.md
> +++ b/gcc/config/loongarch/loongarch.md
> @@ -1617,20 +1617,23 @@ (define_insn_and_split
> "*bstrins__for_ior_mask"
>    })
>  
>  ;; We always avoid the shift operation in bstrins__for_ior_mask
> -;; if possible, but the result may be sub-optimal when one of the
> masks
> +;; if possible, but the result may be larger when one of the masks
>  ;; is (1 << N) - 1 and one of the src register is the dest register.
>  ;; For example:
>  ;; move  t0, a0
>  ;; move  a0, a1
>  ;; bstrins.d a0, t0, 42, 0
>  ;; ret
> -;; using a shift operation would be better:
> +;; using a shift operation would be smaller:
>  ;; srai.dt0, a1, 43
>  ;; bstrins.d a0, t0, 63, 43
>  ;; ret
>  ;; unfortunately we cannot figure it out in split1: before reload we
> cannot
>  ;; know if the dest register is one of the src register.  Fix it up
> in
>  ;; peephole2.
> +;;
> +;; Note that the first form has a lower latency so this should only
> be
> +;; done when optimizing for size.
>  (define_peephole2
>    [(set (match_operand:GPR 0 "register_operand")
>   (match_operand:GPR 1 "register_operand"))
> @@ -1639,7 +1642,7 @@ (define_peephole2
>     (match_operand:SI 3 "const_int_operand")
>     (const_int 0))
>   (match_dup 0))]
> -  "peep2_reg_dead_p (3, operands[0])"
> +  "peep2_reg_dead_p (3, operands[0]) && optimize_insn_for_size_p ()"
>    [(const_int 0)]
>    {
>  int len = GET_MODE_BITSIZE (mode) - INTVAL (operands[3]);

-- 
Xi Ruoyao 
School of Aerospace Science and Technology, Xidian University


Re: [PATCH] Fix wrong cost of MEM when addr is a lea.

2024-06-26 Thread Richard Biener
On Wed, Jun 26, 2024 at 9:14 AM Hongtao Liu  wrote:
>
> On Wed, Jun 26, 2024 at 2:52 PM Richard Biener
>  wrote:
> >
> > On Wed, Jun 26, 2024 at 8:09 AM liuhongt  wrote:
> > >
> > > 416.gamess regressed 4-6% on x86_64 since my r15-882-g1d6199e5f8c1c0.
> > > The commit adjust rtx_cost of mem to reduce cost of (add op0 disp).
> > > But Cost of ADDR could be cheaper than XEXP (addr, 0) when it's a lea.
> > > It is the case in the PR, the patch uses lower cost to enable more
> > > simplication and fix the regression.
> > >
> > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > > Ok for trunk?
> > >
> > > gcc/ChangeLog:
> > >
> > > PR target/115462
> > > * config/i386/i386.cc (ix86_rtx_costs): Use cost of addr when
> > > it's lower than rtx_cost (XEXP (addr, 0)) + 1.
> > >
> > > gcc/testsuite/ChangeLog:
> > > * gcc.target/i386/pr115462.c: New test.
> > > ---
> > >  gcc/config/i386/i386.cc  |  9 +++--
> > >  gcc/testsuite/gcc.target/i386/pr115462.c | 22 ++
> > >  2 files changed, 29 insertions(+), 2 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr115462.c
> > >
> > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > index d4ccc24be6e..83dab8220dd 100644
> > > --- a/gcc/config/i386/i386.cc
> > > +++ b/gcc/config/i386/i386.cc
> > > @@ -22341,8 +22341,13 @@ ix86_rtx_costs (rtx x, machine_mode mode, int 
> > > outer_code_i, int opno,
> > >   if (GET_CODE (addr) == PLUS
> > >   && x86_64_immediate_operand (XEXP (addr, 1), Pmode))
> > > {
> > > - *total += 1;
> > > - *total += rtx_cost (XEXP (addr, 0), Pmode, PLUS, 0, speed);
> > > + /* PR115462: Cost of ADDR could be cheaper than XEXP (addr, 
> > > 0)
> > > +when it's a lea, use lower cost to enable more
> > > +simplification.  */
> > > + unsigned cost1 = rtx_cost (addr, Pmode, MEM, 0, speed);
> > > + unsigned cost2 = rtx_cost (XEXP (addr, 0), Pmode,
> > > +PLUS, 0, speed) + 1;
> >
> > Just as comment - this is a bit ugly, why would we not always use the
> > address cost?  (and why are you using 'MEM'?)  Should this be better
> > handled on the insn_cost level when it's clear the PLUS is separate address
> > calculation (LEA) rather than address calculation in a MEM context?
>  For MEM, rtx_cost doesn't use address_cost but iterates each subrtx,
> and adds up the costs,
>  So for MEM (reg) and MEM (reg + 4), the former costs 5, the latter
> costs 9, it is not accurate for x86.

But rtx_cost invokes targetm.rtx_cost which allows to avoid that
recursive processing at any level.  You're dealing with MEM [addr]
here, so why's rtx_cost (addr, Pmode, MEM, 0, speed) not always
the best way to deal with this?  Since this is the MEM [addr] case
we know it's not LEA, no?

>  Ideally address_cost should be used, but it reduces cost too
> much(range from 1-3).
> (I've tried that, it regressed many testcases, because two many
> registers are propagated into addr and increase register pressure).
>  So the current solution is to make constant disp as cheap as possible
> so more constant can be propagated into the address(but not
> registers).
>
> >
> > > + *total += MIN (cost1, cost2);
> > >   return true;
> > > }
> > > }
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr115462.c 
> > > b/gcc/testsuite/gcc.target/i386/pr115462.c
> > > new file mode 100644
> > > index 000..ad50a6382bc
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr115462.c
> > > @@ -0,0 +1,22 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -mavx2 -fno-tree-vectorize -fno-pic" } */
> > > +/* { dg-final { scan-assembler-times {(?n)movl[ \t]+.*, 
> > > p1\.0\+[0-9]*\(,} 3 } } */
> > > +
> > > +int
> > > +foo (long indx, long indx2, long indx3, long indx4, long indx5, long 
> > > indx6, long n, int* q)
> > > +{
> > > +  static int p1[1];
> > > +  int* p2 = p1 + 1000;
> > > +  int* p3 = p1 + 4000;
> > > +  int* p4 = p1 + 8000;
> > > +
> > > +  for (long i = 0; i != n; i++)
> > > +{
> > > +  /* scan for  movl%edi, p1.0+3996(,%rax,4),
> > > +p1.0+3996 should be propagted into the loop.  */
> > > +  p2[indx++] = q[indx++];
> > > +  p3[indx2++] = q[indx2++];
> > > +  p4[indx3++] = q[indx3++];
> > > +}
> > > +  return p1[indx6] + p1[indx5];
> > > +}
> > > --
> > > 2.31.1
> > >
>
>
>
> --
> BR,
> Hongtao


Re: [PATCH] Fix wrong cost of MEM when addr is a lea.

2024-06-26 Thread Hongtao Liu
On Wed, Jun 26, 2024 at 4:02 PM Richard Biener
 wrote:
>
> On Wed, Jun 26, 2024 at 9:14 AM Hongtao Liu  wrote:
> >
> > On Wed, Jun 26, 2024 at 2:52 PM Richard Biener
> >  wrote:
> > >
> > > On Wed, Jun 26, 2024 at 8:09 AM liuhongt  wrote:
> > > >
> > > > 416.gamess regressed 4-6% on x86_64 since my r15-882-g1d6199e5f8c1c0.
> > > > The commit adjust rtx_cost of mem to reduce cost of (add op0 disp).
> > > > But Cost of ADDR could be cheaper than XEXP (addr, 0) when it's a lea.
> > > > It is the case in the PR, the patch uses lower cost to enable more
> > > > simplication and fix the regression.
> > > >
> > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > > > Ok for trunk?
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > PR target/115462
> > > > * config/i386/i386.cc (ix86_rtx_costs): Use cost of addr when
> > > > it's lower than rtx_cost (XEXP (addr, 0)) + 1.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > > * gcc.target/i386/pr115462.c: New test.
> > > > ---
> > > >  gcc/config/i386/i386.cc  |  9 +++--
> > > >  gcc/testsuite/gcc.target/i386/pr115462.c | 22 ++
> > > >  2 files changed, 29 insertions(+), 2 deletions(-)
> > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr115462.c
> > > >
> > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > > index d4ccc24be6e..83dab8220dd 100644
> > > > --- a/gcc/config/i386/i386.cc
> > > > +++ b/gcc/config/i386/i386.cc
> > > > @@ -22341,8 +22341,13 @@ ix86_rtx_costs (rtx x, machine_mode mode, int 
> > > > outer_code_i, int opno,
> > > >   if (GET_CODE (addr) == PLUS
> > > >   && x86_64_immediate_operand (XEXP (addr, 1), Pmode))
> > > > {
> > > > - *total += 1;
> > > > - *total += rtx_cost (XEXP (addr, 0), Pmode, PLUS, 0, 
> > > > speed);
> > > > + /* PR115462: Cost of ADDR could be cheaper than XEXP 
> > > > (addr, 0)
> > > > +when it's a lea, use lower cost to enable more
> > > > +simplification.  */
> > > > + unsigned cost1 = rtx_cost (addr, Pmode, MEM, 0, speed);
> > > > + unsigned cost2 = rtx_cost (XEXP (addr, 0), Pmode,
> > > > +PLUS, 0, speed) + 1;
> > >
> > > Just as comment - this is a bit ugly, why would we not always use the
> > > address cost?  (and why are you using 'MEM'?)  Should this be better
> > > handled on the insn_cost level when it's clear the PLUS is separate 
> > > address
> > > calculation (LEA) rather than address calculation in a MEM context?
> >  For MEM, rtx_cost doesn't use address_cost but iterates each subrtx,
> > and adds up the costs,
> >  So for MEM (reg) and MEM (reg + 4), the former costs 5, the latter
> > costs 9, it is not accurate for x86.
>
> But rtx_cost invokes targetm.rtx_cost which allows to avoid that
> recursive processing at any level.  You're dealing with MEM [addr]
> here, so why's rtx_cost (addr, Pmode, MEM, 0, speed) not always
Because when addr is just (plus reg cosnt_int), rtx_cost (addr, Pmode,
MEM, 0, speed) return 4.
But i think it should be equal to addr (reg) which is 0 cost.
> the best way to deal with this?  Since this is the MEM [addr] case
> we know it's not LEA, no?
Maybe the only case I need to handle is reg + disp, otherwise, they're
all lea forms.
>
> >  Ideally address_cost should be used, but it reduces cost too
> > much(range from 1-3).
> > (I've tried that, it regressed many testcases, because two many
> > registers are propagated into addr and increase register pressure).
> >  So the current solution is to make constant disp as cheap as possible
> > so more constant can be propagated into the address(but not
> > registers).
> >
> > >
> > > > + *total += MIN (cost1, cost2);
> > > >   return true;
> > > > }
> > > > }
> > > > diff --git a/gcc/testsuite/gcc.target/i386/pr115462.c 
> > > > b/gcc/testsuite/gcc.target/i386/pr115462.c
> > > > new file mode 100644
> > > > index 000..ad50a6382bc
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/i386/pr115462.c
> > > > @@ -0,0 +1,22 @@
> > > > +/* { dg-do compile } */
> > > > +/* { dg-options "-O2 -mavx2 -fno-tree-vectorize -fno-pic" } */
> > > > +/* { dg-final { scan-assembler-times {(?n)movl[ \t]+.*, 
> > > > p1\.0\+[0-9]*\(,} 3 } } */
> > > > +
> > > > +int
> > > > +foo (long indx, long indx2, long indx3, long indx4, long indx5, long 
> > > > indx6, long n, int* q)
> > > > +{
> > > > +  static int p1[1];
> > > > +  int* p2 = p1 + 1000;
> > > > +  int* p3 = p1 + 4000;
> > > > +  int* p4 = p1 + 8000;
> > > > +
> > > > +  for (long i = 0; i != n; i++)
> > > > +{
> > > > +  /* scan for  movl%edi, p1.0+3996(,%rax,4),
> > > > +p1.0+3996 should be propagted into the loop.  */
> > > > +  p2[indx++] = q[indx++];
> > > > +  p3[indx2++] = q[indx2++];
> > > > +  p4[indx3++] = q[indx3++];
> > > > + 

Re: [PATCH] middle-end/114604 - ranger allocates bitmap without initialized obstack

2024-06-26 Thread Aldy Hernandez



On 6/20/24 4:36 PM, Richard Biener wrote:




Am 20.06.2024 um 16:05 schrieb Andrew MacLeod :



On 6/20/24 05:31, Richard Biener wrote:

On Thu, 20 Jun 2024, Aldy Hernandez wrote:

Hi.

I came around to this, and whipped up the proposed patch.  However, it
does seem a bit verbose, and I'm wondering if it's cleaner to just
leave things as they are.

The attached patch passes tests and there's no difference in
performance.  I am wondering, whether it's better to get rid of
all/most of the local obstacks we use in ranger, and just use the
global (NULL) one?

Thoughts?

It really depends on how much garbage ranger is expected to create
on the obstack - the global obstack is released after each pass.
But ranger instances are also not expected to be created multiple
times each pass, right?



Typically correct.  Although the path ranger also creates  a normal ranger,.  
Different components also have their own obstacks, mostly because they can be 
used independent of ranger. I didn't want to add artificial dependencies just 
for a obstack sharing.

  I was unaware of how the global one worked at that point. Do they get stacked 
if another global obstack is initialized?   And is there any danger in that 
case twe could accidentally have a sequence like:

   obstack1 created by ranger
   GORI  allocates bitmap from obstack1
   obstack2 created by the pass that decided to use ranger
   GORI allocates bitmap2.. comes from obstack2
   obstack2 destroyed by the pass.
   GORI tries to use bitmap2  .. its now unallocated.

If so, this reeks of the obstack problems we had back in the late 90's when 
obstacks were generally stacked.  Tracking down objects still in use from freed 
obstacks was a nightmare.  That is one of the reason general stacked obstacks 
fell out of favour for a long time, and why i only ever use local named one.

  It seems to me that components managing their own obstacks ensures this does 
not happen.

If, however, that is not longer a problem for some reason, then I have no 
strong feelings either way either.


The global obstack is special, it’s init keeps a reference count.  So yes, a 
local obstack is cleaner.


Ok, since a local obstack is cleaner and a global one has the potential 
to introduce subtle bugs, I have rebased the patch against current 
mainline and will commit the attached if it passes tests.


Thanks for everyone's feedback.
AldyFrom cd7b03ba43a74ae808a3005ff0e66cd8fabdaea3 Mon Sep 17 00:00:00 2001
From: Aldy Hernandez 
Date: Wed, 19 Jun 2024 11:42:16 +0200
Subject: [PATCH] Avoid global bitmap space in ranger.

gcc/ChangeLog:

	* gimple-range-cache.cc (update_list::update_list): Add m_bitmaps.
	(update_list::~update_list): Initialize m_bitmaps.
	* gimple-range-cache.h (ssa_lazy_cache): Add m_bitmaps.
	* gimple-range.cc (enable_ranger): Remove global bitmap
	initialization.
	(disable_ranger): Remove global bitmap release.
---
 gcc/gimple-range-cache.cc | 6 --
 gcc/gimple-range-cache.h  | 9 +++--
 gcc/gimple-range.cc   | 4 
 3 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/gcc/gimple-range-cache.cc b/gcc/gimple-range-cache.cc
index d84fd1ca0e8..6979a14cbaa 100644
--- a/gcc/gimple-range-cache.cc
+++ b/gcc/gimple-range-cache.cc
@@ -906,6 +906,7 @@ private:
   vec m_update_list;
   int m_update_head;
   bitmap m_propfail;
+  bitmap_obstack m_bitmaps;
 };
 
 // Create an update list.
@@ -915,7 +916,8 @@ update_list::update_list ()
   m_update_list.create (0);
   m_update_list.safe_grow_cleared (last_basic_block_for_fn (cfun) + 64);
   m_update_head = -1;
-  m_propfail = BITMAP_ALLOC (NULL);
+  bitmap_obstack_initialize (&m_bitmaps);
+  m_propfail = BITMAP_ALLOC (&m_bitmaps);
 }
 
 // Destroy an update list.
@@ -923,7 +925,7 @@ update_list::update_list ()
 update_list::~update_list ()
 {
   m_update_list.release ();
-  BITMAP_FREE (m_propfail);
+  bitmap_obstack_release (&m_bitmaps);
 }
 
 // Add BB to the list of blocks to update, unless it's already in the list.
diff --git a/gcc/gimple-range-cache.h b/gcc/gimple-range-cache.h
index 63410d5437e..0ea34d3f686 100644
--- a/gcc/gimple-range-cache.h
+++ b/gcc/gimple-range-cache.h
@@ -78,8 +78,12 @@ protected:
 class ssa_lazy_cache : public ssa_cache
 {
 public:
-  inline ssa_lazy_cache () { active_p = BITMAP_ALLOC (NULL); }
-  inline ~ssa_lazy_cache () { BITMAP_FREE (active_p); }
+  inline ssa_lazy_cache ()
+  {
+bitmap_obstack_initialize (&m_bitmaps);
+active_p = BITMAP_ALLOC (&m_bitmaps);
+  }
+  inline ~ssa_lazy_cache () { bitmap_obstack_release (&m_bitmaps); }
   inline bool empty_p () const { return bitmap_empty_p (active_p); }
   virtual bool has_range (tree name) const;
   virtual bool set_range (tree name, const vrange &r);
@@ -89,6 +93,7 @@ public:
   virtual void clear ();
   void merge (const ssa_lazy_cache &);
 protected:
+  bitmap_obstack m_bitmaps;
   bitmap active_p;
 };
 
diff --git a/gcc/gimple-range.cc b/gcc/gimple-range.cc
index 50448ef81a2..5df649e268c 100644
--- a/gcc/gimple-r

PR target/115618: can we back port the fix to GCC 13?

2024-06-26 Thread Kyrylo Tkachov
Hi Andrew,

I’ve tested the fix for PR 115618 from your commit r14-6612-g8d30107455f230 on 
the GCC 13 branch.
I’d like to back port it to that branch.
Is there any problem with that I should be aware of?
It applies cleanly and tests fine.

Thanks,
Kyrill

Re: Ping: [PATCH] LoongArch: Only transform move/move/bstrins to srai/bstrins when -Os

2024-06-26 Thread Lulu Cheng





  ;; We always avoid the shift operation in bstrins__for_ior_mask
-;; if possible, but the result may be sub-optimal when one of the
masks
+;; if possible, but the result may be larger when one of the masks
  ;; is (1 << N) - 1 and one of the src register is the dest register.
  ;; For example:
  ;; move   t0, a0
  ;; move   a0, a1
  ;; bstrins.d  a0, t0, 42, 0
  ;; ret
-;; using a shift operation would be better:
+;; using a shift operation would be smaller:
  ;; srai.d t0, a1, 43
  ;; bstrins.d  a0, t0, 63, 43
  ;; ret
  ;; unfortunately we cannot figure it out in split1: before reload we
cannot
  ;; know if the dest register is one of the src register.  Fix it up
in
  ;; peephole2.
+;;
+;; Note that the first form has a lower latency so this should only


The result of my test is that the latency of these two forms is the 
same, is there a problem with my test?




be
+;; done when optimizing for size.
  (define_peephole2
    [(set (match_operand:GPR 0 "register_operand")
    (match_operand:GPR 1 "register_operand"))
@@ -1639,7 +1642,7 @@ (define_peephole2
      (match_operand:SI 3 "const_int_operand")
      (const_int 0))
    (match_dup 0))]
-  "peep2_reg_dead_p (3, operands[0])"
+  "peep2_reg_dead_p (3, operands[0]) && optimize_insn_for_size_p ()"
    [(const_int 0)]
    {
  int len = GET_MODE_BITSIZE (mode) - INTVAL (operands[3]);




Re: PING: Re: [PATCH] selftest: invoke "diff" when ASSERT_STREQ fails

2024-06-26 Thread Eric Gallager
On Wed, May 29, 2024 at 5:06 PM David Malcolm  wrote:
>
> On Wed, 2024-05-29 at 16:35 -0400, Eric Gallager wrote:
> > On Tue, May 28, 2024 at 1:21 PM David Malcolm 
> > wrote:
> > >
> > > Ping.
> > >
> > > This patch has actually been *very* helpful to me when debugging
> > > selftest failures involving ASSERT_STREQ.
> > >
> > > Thanks
> > > Dave
> > >
> >
> > Currently `diff` is only listed under the "Tools/packages necessary
> > for modifying GCC" section of install/prerequisites.html:
> > https://gcc.gnu.org/install/prerequisites.html
> > If it's going to become a dependency for actually running GCC, too,
> > it
> > should get moved to be documented elsewhere, IMO.
>
> All this is selftest code, and is turned off in a release configuration
> of GCC.  The code path that invokes "diff" is when a selftest is
> failing, which is immediately before a hard failure of the *build* of
> GCC.  So arguably this is just a build-time thing for people
> packaging/hacking on GCC, and thus not a new dependency for end-usage.
>

Well to be clear, I'm just asking for a documentation update here, so
if you want to use wording that reflects all of that, I think that'd
be fine. It seems like a useful idea overall, so don't let my
documentation request hold you up from proceeding with it.

> BTW I'm a bit hazy on the details of how "pex" is meant to work, so
> hopefully someone more knowledgable than me can comment on that aspect
> of the patch.  It seems to work though.
>

I'm not too clear on it either; maybe one of the libiberty maintainers
can chime in? Or wait, looks like there's just the one currently
(Ian); cc-ing...

> Dave
>
> >
> > > On Fri, 2024-05-17 at 15:51 -0400, David Malcolm wrote:
> > > > Currently when ASSERT_STREQ or ASSERT_STREQ_AT fail we print
> > > > both strings to stderr.  However it can be hard to figure out
> > > > the problem (e.g. for 1-character differences in long strings).
> > > >
> > > > Extend the output by writing out the strings to tempfiles and
> > > > invoking "diff -up" on them when we have such a selftest failure,
> > > > to (I hope) simplify debugging.
> > > >
> > > > Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > >
> > > > OK for trunk?
> > > >
> > > > gcc/ChangeLog:
> > > > * selftest.cc (selftest::print_diff): New function.
> > > > (selftest::assert_streq): Call it when we have non-equal
> > > > non-null strings.
> > > >
> > > > Signed-off-by: David Malcolm 
> > > > ---
> > > >  gcc/selftest.cc | 28 ++--
> > > >  1 file changed, 26 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/gcc/selftest.cc b/gcc/selftest.cc
> > > > index 6438d86a6aa0..f58c0631908e 100644
> > > > --- a/gcc/selftest.cc
> > > > +++ b/gcc/selftest.cc
> > > > @@ -63,6 +63,26 @@ fail_formatted (const location &loc, const
> > > > char
> > > > *fmt, ...)
> > > >abort ();
> > > >  }
> > > >
> > > > +/* Invoke "diff" to print the difference between VAL1 and VAL2
> > > > +   on stdout.  */
> > > > +
> > > > +static void
> > > > +print_diff (const location &loc, const char *val1, const char
> > > > *val2)
> > > > +{
> > > > +  temp_source_file tmpfile1 (loc, ".txt", val1);
> > > > +  temp_source_file tmpfile2 (loc, ".txt", val2);
> > > > +  const char *args[] = {"diff",
> > > > +   "-up",
> > > > +   tmpfile1.get_filename (),
> > > > +   tmpfile2.get_filename (),
> > > > +   NULL};
> > > > +  int exit_status = 0;
> > > > +  int err = 0;
> > > > +  pex_one (PEX_SEARCH | PEX_LAST,
> > > > +  args[0], CONST_CAST (char **, args),
> > > > +  NULL, NULL, NULL, &exit_status, &err);
> > > > +}
> > > > +
> > > >  /* Implementation detail of ASSERT_STREQ.
> > > > Compare val1 and val2 with strcmp.  They ought
> > > > to be non-NULL; fail gracefully if either or both are NULL.
> > > > */
> > > > @@ -89,8 +109,12 @@ assert_streq (const location &loc,
> > > > if (strcmp (val1, val2) == 0)
> > > >   pass (loc, "ASSERT_STREQ");
> > > > else
> > > > - fail_formatted (loc, "ASSERT_STREQ (%s, %s)\n
> > > > val1=\"%s\"\n
> > > > val2=\"%s\"\n",
> > > > - desc_val1, desc_val2, val1, val2);
> > > > + {
> > > > +   print_diff (loc, val1, val2);
> > > > +   fail_formatted
> > > > + (loc, "ASSERT_STREQ (%s, %s)\n val1=\"%s\"\n
> > > > val2=\"%s\"\n",
> > > > +  desc_val1, desc_val2, val1, val2);
> > > > + }
> > > >}
> > > >  }
> > > >
> > >
> >
>


[PATCH] [libstdc++] [testsuite] defer to check_vect_support* [PR115454]

2024-06-26 Thread Alexandre Oliva


The newly-added testcase overrides the default dg-do action set by
check_vect_support_and_set_flags (in libstdc++-dg/conformance.exp), so
it attempts to run the test even if runtime vector support is not
available.

Remove the explicit dg-do directive, so that the default is honored,
and the test is run if vector support is found, and only compiled
otherwise.

Tested so far with gcc-13 on ppc64-vx7r2, targeting vector-less
hardware, where it cured the observed regression.  Regstrapping on
x86_64- and ppc64el-linux-gnu just to be sure.  Ok to install?


for  libstdc++-v3/ChangeLog

PR libstdc++/115454
* testsuite/experimental/simd/pr115454_find_last_set.cc: Defer
to check_vect_support_and_set_flags's default dg-do action.
---
 .../experimental/simd/pr115454_find_last_set.cc|1 -
 1 file changed, 1 deletion(-)

diff --git a/libstdc++-v3/testsuite/experimental/simd/pr115454_find_last_set.cc 
b/libstdc++-v3/testsuite/experimental/simd/pr115454_find_last_set.cc
index 25a713b4e948c..4ade8601f272f 100644
--- a/libstdc++-v3/testsuite/experimental/simd/pr115454_find_last_set.cc
+++ b/libstdc++-v3/testsuite/experimental/simd/pr115454_find_last_set.cc
@@ -1,5 +1,4 @@
 // { dg-options "-std=gnu++17" }
-// { dg-do run { target *-*-* } }
 // { dg-require-effective-target c++17 }
 // { dg-additional-options "-march=x86-64-v4" { target avx512f_runtime } }
 // { dg-require-cmath "" }

-- 
Alexandre Oliva, happy hackerhttps://FSFLA.org/blogs/lxo/
   Free Software Activist   GNU Toolchain Engineer
More tolerance and less prejudice are key for inclusion and diversity
Excluding neuro-others for not behaving ""normal"" is *not* inclusive


Re: [PATCH 1/2] Record edge true/false value for gcov

2024-06-26 Thread Jørgen Kvalsvik

On 6/25/24 23:37, Jeff Law wrote:



On 6/25/24 2:04 AM, Jørgen Kvalsvik wrote:

Make gcov aware which edges are the true/false to more accurately
reconstruct the CFG.  There are plenty of bits left in arc_info and it
opens up for richer reporting.

gcc/ChangeLog:

* gcov-io.h (GCOV_ARC_TRUE): New.
(GCOV_ARC_FALSE): New.
* gcov.cc (struct arc_info): Add true_value, false_value.
(read_graph_file): Read true_value, false_value.
* profile.cc (branch_prob): Write GCOV_ARC_TRUE, GCOV_ARC_FALSE.

I thought I'd already acked this patch.

So OK, again :-)

jeff



Thanks! Pushed.


Re: [PATCH 1/3] Release structures on function return

2024-06-26 Thread Jørgen Kvalsvik

On 6/25/24 12:23, Jan Hubicka wrote:

The value vec objects are destroyed on exit, but release still needs to
be called explicitly.

gcc/ChangeLog:

* tree-profile.cc (find_conditions): Release vectors before
  return.

I wonder if you turn
 hash_map, vec> exprs;
to
 hash_map, auto_vec> exprs;
Won't hash_map destructor take care of this by itself?

Honza


I updated this to use auto_vec and pushed it.

Thanks,
Jørgen


---
  gcc/tree-profile.cc | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/gcc/tree-profile.cc b/gcc/tree-profile.cc
index e4bb689cef5..18f48e8d04e 100644
--- a/gcc/tree-profile.cc
+++ b/gcc/tree-profile.cc
@@ -919,6 +919,9 @@ find_conditions (struct function *fn)
  if (!have_post_dom)
free_dominance_info (fn, CDI_POST_DOMINATORS);
  
+for (auto expr : exprs)

+  expr.second.release ();
+
  cov->m_masks.safe_grow_cleared (2 * cov->m_index.last ());
  const size_t length = cov_length (cov);
  for (size_t i = 0; i != length; i++)
--
2.39.2





Re: [PATCH 2/3] Add section on MC/DC in gcov manual

2024-06-26 Thread Jørgen Kvalsvik

On 6/25/24 12:23, Jan Hubicka wrote:

gcc/ChangeLog:

* doc/gcov.texi: Add MC/DC section.

OK,
thanks!


Pushed.

Thanks,
Jørgen


Honza

---
  gcc/doc/gcov.texi | 72 +++
  1 file changed, 72 insertions(+)

diff --git a/gcc/doc/gcov.texi b/gcc/doc/gcov.texi
index dc79bccb8cf..a9221738cce 100644
--- a/gcc/doc/gcov.texi
+++ b/gcc/doc/gcov.texi
@@ -917,6 +917,78 @@ of times the call was executed will be printed.  This will 
usually be
  100%, but may be less for functions that call @code{exit} or @code{longjmp},
  and thus may not return every time they are called.
  
+When you use the @option{-g} option, your output looks like this:

+
+@smallexample
+$ gcov -t -m -g tmp
+-:0:Source:tmp.cpp
+-:0:Graph:tmp.gcno
+-:0:Data:tmp.gcda
+-:0:Runs:1
+-:1:#include 
+-:2:
+-:3:int
+1:4:main (void)
+-:5:@{
+-:6:  int i, total;
+1:7:  total = 0;
+-:8:
+   11:9:  for (i = 0; i < 10; i++)
+condition outcomes covered 2/2
+   10:   10:total += i;
+-:   11:
+   1*:   12:  int v = total > 100 ? 1 : 2;
+condition outcomes covered 1/2
+condition  0 not covered (true)
+-:   13:
+   1*:   14:  if (total != 45 && v == 1)
+condition outcomes covered 1/4
+condition  0 not covered (true)
+condition  1 not covered (true false)
+#:   15:printf ("Failure\n");
+-:   16:  else
+1:   17:printf ("Success\n");
+1:   18:  return 0;
+-:   19:@}
+@end smallexample
+
+For every condition the number of taken and total outcomes are
+printed, and if there are uncovered outcomes a line will be printed
+for each condition showing the uncovered outcome in parentheses.
+Conditions are identified by their index -- index 0 is the left-most
+condition.  In @code{a || (b && c)}, @var{a} is condition 0, @var{b}
+condition 1, and @var{c} condition 2.
+
+An outcome is considered covered if it has an independent effect on
+the decision, also known as masking MC/DC (Modified Condition/Decision
+Coverage).  In this example the decision evaluates to true and @var{a}
+is evaluated, but not covered.  This is because @var{a} cannot affect
+the decision independently -- both @var{a} and @var{b} must change
+value for the decision to change.
+
+@smallexample
+$ gcov -t -m -g tmp
+-:0:Source:tmp.c
+-:0:Graph:tmp.gcno
+-:0:Data:tmp.gcda
+-:0:Runs:1
+-:1:#include 
+-:2:
+1:3:int main()
+-:4:@{
+1:5:  int a = 1;
+1:6:  int b = 0;
+-:7:
+1:8:  if (a && b)
+condition outcomes covered 1/4
+condition  0 not covered (true false)
+condition  1 not covered (true)
+#:9:printf ("Success!\n");
+-:   10:  else
+1:   11:printf ("Failure!\n");
+-:   12:@}
+@end smallexample
+
  The execution counts are cumulative.  If the example program were
  executed again without removing the @file{.gcda} file, the count for the
  number of times each line in the source was executed would be added to
--
2.39.2





Re: [PATCH 3/3] Use the term MC/DC in help for gcov --conditions

2024-06-26 Thread Jørgen Kvalsvik

On 6/25/24 12:25, Jan Hubicka wrote:

Without key terms like "masking" and "MC/DC" it is not at all obvious
what --conditions actually reports on, and there is no easy path for the
user to figure out. By at least including the two key terms MC/DC and
masking users have something to search for.

gcc/ChangeLog:

 * gcov.cc (print_usage): Reference masking MC/DC.


So the main purpose is to turn users into masking MC/DC description in
the manual?  Asking google does not seem to do the trick so far, but
I don't know if better options.

OK,
Thanks


Pushed.

Thanks,
Jørgen


---
  gcc/gcov.cc | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/gcov.cc b/gcc/gcov.cc
index f6787f0be8f..1e2e193d79d 100644
--- a/gcc/gcov.cc
+++ b/gcc/gcov.cc
@@ -1015,7 +1015,7 @@ print_usage (int error_p)
fnotice (file, "  -c, --branch-counts Output counts of branches 
taken\n\
  rather than percentages\n");
fnotice (file, "  -g, --conditionsInclude modified 
condition/decision\n\
-coverage in output\n");
+coverage (masking MC/DC) in output\n");
fnotice (file, "  -d, --display-progress  Display progress 
information\n");
fnotice (file, "  -D, --debug  Display debugging 
dumps\n");
fnotice (file, "  -f, --function-summariesOutput summaries for each 
function\n");
--
2.39.2





[PATCH 1/3 v5] vect: generate suitable convert insn for int -> int, float -> float and int <-> float.

2024-06-26 Thread Hu, Lin1
Hi,

This is the lasted version, I modified some comments and retest the patch on
x86-64-linux-gnu. I'll wait another day to see what else Tamar has to say
about the API, if not I will upstream this patch tomorrow.

BRs,
Lin

gcc/ChangeLog:

PR target/107432
* tree-vect-generic.cc
(expand_vector_conversion): Support convert for int -> int,
float -> float and int <-> float.
* tree-vect-stmts.cc (vectorizable_conversion): Wrap the
indirect convert part.
(supportable_indirect_convert_operation): New function.
* tree-vectorizer.h (supportable_indirect_convert_operation):
Define the new function.

gcc/testsuite/ChangeLog:

PR target/107432
* gcc.target/i386/pr107432-1.c: New test.
* gcc.target/i386/pr107432-2.c: Ditto.
* gcc.target/i386/pr107432-3.c: Ditto.
* gcc.target/i386/pr107432-4.c: Ditto.
* gcc.target/i386/pr107432-5.c: Ditto.
* gcc.target/i386/pr107432-6.c: Ditto.
* gcc.target/i386/pr107432-7.c: Ditto.
---
 gcc/testsuite/gcc.target/i386/pr107432-1.c | 234 
 gcc/testsuite/gcc.target/i386/pr107432-2.c | 105 +
 gcc/testsuite/gcc.target/i386/pr107432-3.c |  55 +
 gcc/testsuite/gcc.target/i386/pr107432-4.c |  56 +
 gcc/testsuite/gcc.target/i386/pr107432-5.c |  72 ++
 gcc/testsuite/gcc.target/i386/pr107432-6.c | 139 
 gcc/testsuite/gcc.target/i386/pr107432-7.c | 150 +
 gcc/tree-vect-generic.cc   |  29 ++-
 gcc/tree-vect-stmts.cc | 241 +
 gcc/tree-vectorizer.h  |   4 +
 10 files changed, 990 insertions(+), 95 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr107432-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr107432-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr107432-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr107432-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr107432-5.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr107432-6.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr107432-7.c

diff --git a/gcc/testsuite/gcc.target/i386/pr107432-1.c 
b/gcc/testsuite/gcc.target/i386/pr107432-1.c
new file mode 100644
index 000..a4f37447eb4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr107432-1.c
@@ -0,0 +1,234 @@
+/* { dg-do compile } */
+/* { dg-options "-march=x86-64 -mavx512bw -mavx512vl -O3" } */
+/* { dg-final { scan-assembler-times "vpmovqd" 6 } } */
+/* { dg-final { scan-assembler-times "vpmovqw" 6 } } */
+/* { dg-final { scan-assembler-times "vpmovqb" 6 } } */
+/* { dg-final { scan-assembler-times "vpmovdw" 6 { target { ia32 } } } } */
+/* { dg-final { scan-assembler-times "vpmovdw" 8 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "vpmovdb" 6 { target { ia32 } } } } */
+/* { dg-final { scan-assembler-times "vpmovdb" 8 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "vpmovwb" 8 } } */
+
+#include 
+
+typedef short __v2hi __attribute__ ((__vector_size__ (4)));
+typedef char __v2qi __attribute__ ((__vector_size__ (2)));
+typedef char __v4qi __attribute__ ((__vector_size__ (4)));
+typedef char __v8qi __attribute__ ((__vector_size__ (8)));
+
+typedef unsigned short __v2hu __attribute__ ((__vector_size__ (4)));
+typedef unsigned short __v4hu __attribute__ ((__vector_size__ (8)));
+typedef unsigned char __v2qu __attribute__ ((__vector_size__ (2)));
+typedef unsigned char __v4qu __attribute__ ((__vector_size__ (4)));
+typedef unsigned char __v8qu __attribute__ ((__vector_size__ (8)));
+typedef unsigned int __v2su __attribute__ ((__vector_size__ (8)));
+
+__v2si mm_cvtepi64_epi32_builtin_convertvector(__m128i a)
+{
+  return __builtin_convertvector((__v2di)a, __v2si);
+}
+
+__m128imm256_cvtepi64_epi32_builtin_convertvector(__m256i a)
+{
+  return (__m128i)__builtin_convertvector((__v4di)a, __v4si);
+}
+
+__m256imm512_cvtepi64_epi32_builtin_convertvector(__m512i a)
+{
+  return (__m256i)__builtin_convertvector((__v8di)a, __v8si);
+}
+
+__v2hi mm_cvtepi64_epi16_builtin_convertvector(__m128i a)
+{
+  return __builtin_convertvector((__v2di)a, __v2hi);
+}
+
+__v4hi mm256_cvtepi64_epi16_builtin_convertvector(__m256i a)
+{
+  return __builtin_convertvector((__v4di)a, __v4hi);
+}
+
+__m128imm512_cvtepi64_epi16_builtin_convertvector(__m512i a)
+{
+  return (__m128i)__builtin_convertvector((__v8di)a, __v8hi);
+}
+
+__v2qi mm_cvtepi64_epi8_builtin_convertvector(__m128i a)
+{
+  return __builtin_convertvector((__v2di)a, __v2qi);
+}
+
+__v4qi mm256_cvtepi64_epi8_builtin_convertvector(__m256i a)
+{
+  return __builtin_convertvector((__v4di)a, __v4qi);
+}
+
+__v8qi mm512_cvtepi64_epi8_builtin_convertvector(__m512i a)
+{
+  return __builtin_convertvector((__v8di)a, __v8qi);
+}
+
+__v2hi mm64_cvtepi32_epi16_builtin_convertvector(__v2si a)
+{
+  return __builtin_convertvector((__v2si)a, __v2hi);
+}
+
+__v4hi mm_cvtepi32_epi

[PATCH] tree-optimization/115652 - adjust insertion gsi for SLP

2024-06-26 Thread Richard Biener
The following adjusts how SLP computes the insertion location.  In
particular it advanced the insert iterator of the found last_stmt.
The vectorizer will later insert stmts _before_ it.  But we also
have the constraint that possibly masked ops may not be scheduled
outside of the loop and as we do not model the loop mask in the
SLP graph we have to adjust for that.  The following moves this
to after the advance since it isn't compatible with that as the
current GIMPLE_COND exception shows.  The PR is about in-order
reduction vectorization which also isn't happy when that's the
very first stmt.

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

PR tree-optimization/115652
* tree-vect-slp.cc (vect_schedule_slp_node): Advance the
iterator based on last_stmt only for vector defs.
---
 gcc/tree-vect-slp.cc | 29 +
 1 file changed, 13 insertions(+), 16 deletions(-)

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index bb70a3fa5c2..0b12c821cbe 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -9932,16 +9932,6 @@ vect_schedule_slp_node (vec_info *vinfo,
   /* Emit other stmts after the children vectorized defs which is
 earliest possible.  */
   gimple *last_stmt = NULL;
-  if (auto loop_vinfo = dyn_cast  (vinfo))
-   if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
-   || LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
- {
-   /* But avoid scheduling internal defs outside of the loop when
-  we might have only implicitly tracked loop mask/len defs.  */
-   gimple_stmt_iterator si
- = gsi_after_labels (LOOP_VINFO_LOOP (loop_vinfo)->header);
-   last_stmt = *si;
- }
   bool seen_vector_def = false;
   FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
if (SLP_TREE_DEF_TYPE (child) == vect_internal_def)
@@ -10050,12 +10040,19 @@ vect_schedule_slp_node (vec_info *vinfo,
   else
{
  si = gsi_for_stmt (last_stmt);
- /* When we're getting gsi_after_labels from the starting
-condition of a fully masked/len loop avoid insertion
-after a GIMPLE_COND that can appear as the only header
-stmt with early break vectorization.  */
- if (gimple_code (last_stmt) != GIMPLE_COND)
-   gsi_next (&si);
+ gsi_next (&si);
+
+ /* Avoid scheduling internal defs outside of the loop when
+we might have only implicitly tracked loop mask/len defs.  */
+ if (auto loop_vinfo = dyn_cast  (vinfo))
+   if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+   || LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+ {
+   gimple_stmt_iterator si2
+ = gsi_after_labels (LOOP_VINFO_LOOP (loop_vinfo)->header);
+   if (vect_stmt_dominates_stmt_p (last_stmt, *si2))
+ si = si2;
+ }
}
 }
 
-- 
2.35.3


[PATCH v2] Rearrange SLP nodes with duplicate statements. [PR98138]

2024-06-26 Thread Manolis Tsamis
This change checks when a two_operators SLP node has multiple occurrences of
the same statement (e.g. {A, B, A, B, ...}) and tries to rearrange the operands
so that there are no duplicates. Two vec_perm expressions are then introduced
to recreate the original ordering. These duplicates can appear due to how
two_operators nodes are handled, and they prevent vectorization in some cases.

This targets the vectorization of the SPEC2017 x264 pixel_satd functions.
In some processors a larger than 10% improvement on x264 has been observed.

See also: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

gcc/ChangeLog:

* tree-vect-slp.cc: Avoid duplicates in two_operators nodes.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/vect-slp-two-operator.c: New test.

Signed-off-by: Manolis Tsamis 
---

Changes in v2:
- Do not use predefined patterns; support rearrangement of arbitrary
node orderings.
- Only apply for two_operators nodes.
- Recurse with single SLP operand instead of two duplicated ones.
- Refactoring of code.

 .../aarch64/vect-slp-two-operator.c   |  36 ++
 gcc/tree-vect-slp.cc  | 114 ++
 2 files changed, 150 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c

diff --git a/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c 
b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c
new file mode 100644
index 000..b6b093ffc34
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c
@@ -0,0 +1,36 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect 
-fdump-tree-vect-details" } */
+
+typedef unsigned char uint8_t;
+typedef unsigned int uint32_t;
+
+#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
+int t0 = s0 + s1;\
+int t1 = s0 - s1;\
+int t2 = s2 + s3;\
+int t3 = s2 - s3;\
+d0 = t0 + t2;\
+d1 = t1 + t3;\
+d2 = t0 - t2;\
+d3 = t1 - t3;\
+}
+
+void sink(uint32_t tmp[4][4]);
+
+int x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 )
+{
+uint32_t tmp[4][4];
+int sum = 0;
+for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 )
+{
+uint32_t a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
+uint32_t a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
+uint32_t a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
+uint32_t a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
+HADAMARD4( tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0,a1,a2,a3 );
+}
+sink(tmp);
+}
+
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index b47b7e8c979..60d0d388dff 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -2420,6 +2420,95 @@ out:
   }
   swap = NULL;
 
+  bool has_two_operators_perm = false;
+  auto_vec two_op_perm_indices[2];
+  vec two_op_scalar_stmts[2] = {vNULL, vNULL};
+
+  if (two_operators && oprnds_info.length () == 2 && group_size > 2)
+{
+  unsigned idx = 0;
+  hash_map seen;
+  vec new_oprnds_info
+   = vect_create_oprnd_info (1, group_size);
+  bool success = true;
+
+  enum tree_code code = ERROR_MARK;
+  if (oprnds_info[0]->def_stmts[0]
+ && is_a (oprnds_info[0]->def_stmts[0]->stmt))
+   code = gimple_assign_rhs_code (oprnds_info[0]->def_stmts[0]->stmt);
+
+  for (unsigned j = 0; j < group_size; ++j)
+   {
+ FOR_EACH_VEC_ELT (oprnds_info, i, oprnd_info)
+   {
+ stmt_vec_info stmt_info = oprnd_info->def_stmts[j];
+ if (!stmt_info || !stmt_info->stmt
+ || !is_a (stmt_info->stmt)
+ || gimple_assign_rhs_code (stmt_info->stmt) != code
+ || skip_args[i])
+   {
+ success = false;
+ break;
+   }
+
+ bool exists;
+ unsigned &stmt_idx
+   = seen.get_or_insert (stmt_info->stmt, &exists);
+
+ if (!exists)
+   {
+ new_oprnds_info[0]->def_stmts.safe_push (stmt_info);
+ new_oprnds_info[0]->ops.safe_push (oprnd_info->ops[j]);
+ stmt_idx = idx;
+ idx++;
+   }
+
+ two_op_perm_indices[i].safe_push (stmt_idx);
+   }
+
+ if (!success)
+   break;
+   }
+
+  if (success && idx == group_size)
+   {
+ if (dump_enabled_p ())
+   {
+ dump_printf_loc (MSG_NOTE, vect_location,
+  "Replace two_operators operands:\n");
+
+ FOR_EACH_VEC_ELT (oprnds_info, i, oprnd_info)
+   {
+ dump_printf_loc (MSG_NOTE, vect_location,
+

[PATCH] s390: Check for ADDR_REGS in s390_decompose_addrstyle_without_index

2024-06-26 Thread Stefan Schulze Frielinghaus
An explicit check for address registers was not required so far since
during register allocation the processing of address constraints was
sufficient.  However, address constraints themself do not check for
REGNO_OK_FOR_{BASE,INDEX}_P.  Thus, with the newly introduced
late-combine pass in r15-1579-g792f97b44ffc5e we generate new insns with
invalid address registers which aren't fixed up afterwards.

Fixed by explicitly checking for address registers in
s390_decompose_addrstyle_without_index such that those new insns are
rejected.

gcc/ChangeLog:

target/PR115634
* config/s390/s390.cc (s390_decompose_addrstyle_without_index):
Check for ADDR_REGS in s390_decompose_addrstyle_without_index.
---
 This restores bootstrap on s390.  I ran the testsuite against mainline
 and of course there is some fallout which is most likely coming from
 the new pass or other changes.  I have another job running comparing
 pre r15-1579-g792f97b44ffc5e with and without this patch.  Assuming
 this goes well, ok for mainline?

 gcc/config/s390/s390.cc | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/gcc/config/s390/s390.cc b/gcc/config/s390/s390.cc
index c65421de831..05a0fde7fb0 100644
--- a/gcc/config/s390/s390.cc
+++ b/gcc/config/s390/s390.cc
@@ -3347,7 +3347,9 @@ s390_decompose_addrstyle_without_index (rtx op, rtx *base,
   while (op && GET_CODE (op) == SUBREG)
 op = SUBREG_REG (op);
 
-  if (op && GET_CODE (op) != REG)
+  if (op && (!REG_P (op)
+|| (reload_completed
+&& !REGNO_OK_FOR_BASE_P (REGNO (op)
 return false;
 
   if (offset)
-- 
2.45.1



mve: Fix vsetq_lane for 64-bit elements with lane 1 [PR 115611]

2024-06-26 Thread Andre Vieira (lists)

This patch fixes the backend pattern that was printing the wrong input
scalar register pair when inserting into lane 1.

Added a new test to force float-abi=hard so we can use scan-assembler to 
check

correct codegen.

Regression tested arm-none-eabi with 
-march=armv8.1-m.main+mve/-mfloat-abi=hard/-mfpu=auto


gcc/ChangeLog:

PR target/115611
* config/arm/mve.md (mve_vec_setv2di_internal): Fix printing of input
scalar register pair when lane = 1.

gcc/testsuite/ChangeLog:

* gcc.target/arm/mve/intrinsics/vsetq_lane_su64.c: New test.diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index 
4b4d6298ffb1899dc089eb52b03500e6e6236c31..706a45c7d6652677f3ec993a77646e3845eb8f8d
 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -6505,7 +6505,7 @@ (define_insn "mve_vec_setv2di_internal"
   if (elt == 0)
return "vmov\t%e0, %Q1, %R1";
   else
-   return "vmov\t%f0, %J1, %K1";
+   return "vmov\t%f0, %Q1, %R1";
 }
  [(set_attr "type" "mve_move")])
 
diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/vsetq_lane_su64.c 
b/gcc/testsuite/gcc.target/arm/mve/intrinsics/vsetq_lane_su64.c
new file mode 100644
index 
..5aa3bc9a76a06d7151ff6a844807afe666bbeacb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/vsetq_lane_su64.c
@@ -0,0 +1,63 @@
+/* { dg-require-effective-target arm_v8_1m_mve_ok } */
+/* { dg-add-options arm_v8_1m_mve } */
+/* { dg-require-effective-target arm_hard_ok } */
+/* { dg-additional-options "-mfloat-abi=hard -O2" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#include "arm_mve.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/*
+**fn1:
+** vmovd0, r0, r1
+** bx  lr
+*/
+uint64x2_t
+fn1 (uint64_t a, uint64x2_t b)
+{
+  return vsetq_lane_u64 (a, b, 0);
+}
+
+/*
+**fn2:
+** vmovd1, r0, r1
+** bx  lr
+*/
+uint64x2_t
+fn2 (uint64_t a, uint64x2_t b)
+{
+  return vsetq_lane_u64 (a, b, 1);
+}
+
+/*
+**fn3:
+** vmovd0, r0, r1
+** bx  lr
+*/
+int64x2_t
+fn3 (int64_t a, int64x2_t b)
+{
+  return vsetq_lane_s64 (a, b, 0);
+}
+
+/*
+**fn4:
+** vmovd1, r0, r1
+** bx  lr
+*/
+int64x2_t
+fn4 (int64_t a, int64x2_t b)
+{
+  return vsetq_lane_s64 (a, b, 1);
+}
+
+
+#ifdef __cplusplus
+}
+#endif
+
+/* { dg-final { scan-assembler-not "__ARM_undef" } } */
+


[PING] Re: [PATCH 1/2] ivopts: Revert computation of address cost complexity

2024-06-26 Thread Aleksandar Rakic
Hi!

I'd like to ping the following patch:

https://gcc.gnu.org/pipermail/gcc-patches/2024-March/647966.html
  a patch for the computation of the complexity for the unsupported addressing 
modes in ivopts

  This patch should be a fix for the bug which is described on the following 
link:
  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109429
  It modifies the order of the complexity calculation. By fixing the 
complexities, the
  candidate selection is also fixed, which leads to the smaller code size.


Thanks

Aleksandar Rakić


Re: [PATCH v2] Rearrange SLP nodes with duplicate statements. [PR98138]

2024-06-26 Thread Manolis Tsamis
This is a reworked implementation that only deduplicates two_operators
nodes and supports arbitrary orderings.

Based on the discussion on the original submission, I have made some
SPEC runs to see if this transformation applies to any other
benchmarks.
Aside from x264 this now triggers on SPEC2017 521.wrf and SPECv8
712.av1aom. In these additional two cases the transformation applies
in a way that could help vectorization.
For example in the relevant av1 code, we see things like:

note: Operand 0:
note: stmt 0 _718 = MEM[(int32_t *)output_1462(D) + 16B];
note: stmt 1 _718 = MEM[(int32_t *)output_1462(D) + 16B];
note: stmt 2 _722 = MEM[(int32_t *)output_1462(D) + 28B];
note: stmt 3 _722 = MEM[(int32_t *)output_1462(D) + 28B];
note: Operand 1:
note: stmt 0 _719 = MEM[(int32_t *)output_1462(D) + 20B];
note: stmt 1 _719 = MEM[(int32_t *)output_1462(D) + 20B];
note: stmt 2 _723 = MEM[(int32_t *)output_1462(D) + 24B];
note: stmt 3 _723 = MEM[(int32_t *)output_1462(D) + 24B];
note: With a single operand:
note: stmt 0 _718 = MEM[(int32_t *)output_1462(D) + 16B];
note: stmt 1 _719 = MEM[(int32_t *)output_1462(D) + 20B];
note: stmt 2 _722 = MEM[(int32_t *)output_1462(D) + 28B];
note: stmt 3 _723 = MEM[(int32_t *)output_1462(D) + 24B];

Whereas gcc master will create load nodes with permutations that have
repeated elements. The issue here that prevents even better
vectorization, and that affects all other cases including x264, is
that there is no way to properly order the deduplicated elements each
time. For example consider x264, in which we have two different
patterns that we're applying the transformation to:

int t0 = s0 + s1;
int t1 = s0 - s1;
int t2 = s2 + s3;
int t3 = s2 - s3;

and

d0 = t0 + t2;
d1 = t1 + t3;
d2 = t0 - t2;
d3 = t1 - t3;

The preferred order for the deduplicated operands in the first case is
[s0, s1, s2, s3] and in the second case [t0, t1, t2, t3]. But because
the operands appear in different order one of the two will end up in a
different ordering. With this implementation we get a good first node

x264.c:29:23: note:   Replace two_operators operands:
x264.c:29:23: note:   Operand 0:
x264.c:29:23: note:   stmt 0 a0_110 = (uint32_t) _12;
x264.c:29:23: note:   stmt 1 a2_112 = (uint32_t) _36;
x264.c:29:23: note:   stmt 2 a0_110 = (uint32_t) _12;
x264.c:29:23: note:   stmt 3 a2_112 = (uint32_t) _36;
x264.c:29:23: note:   Operand 1:
x264.c:29:23: note:   stmt 0 a1_111 = (uint32_t) _24;
x264.c:29:23: note:   stmt 1 a3_113 = (uint32_t) _48;
x264.c:29:23: note:   stmt 2 a1_111 = (uint32_t) _24;
x264.c:29:23: note:   stmt 3 a3_113 = (uint32_t) _48;
x264.c:29:23: note:   With a single operand:
x264.c:29:23: note:   stmt 0 a0_110 = (uint32_t) _12;
x264.c:29:23: note:   stmt 1 a1_111 = (uint32_t) _24;
x264.c:29:23: note:   stmt 2 a2_112 = (uint32_t) _36;
x264.c:29:23: note:   stmt 3 a3_113 = (uint32_t) _48;

but mess the order in the second one:

x264.c:29:23: note:   Replace two_operators operands:
x264.c:29:23: note:   Operand 0:
x264.c:29:23: note:   stmt 0 t0_114 = (int) _49;
x264.c:29:23: note:   stmt 1 t1_115 = (int) _50;
x264.c:29:23: note:   stmt 2 t0_114 = (int) _49;
x264.c:29:23: note:   stmt 3 t1_115 = (int) _50;
x264.c:29:23: note:   Operand 1:
x264.c:29:23: note:   stmt 0 t2_116 = (int) _51;
x264.c:29:23: note:   stmt 1 t3_117 = (int) _52;
x264.c:29:23: note:   stmt 2 t2_116 = (int) _51;
x264.c:29:23: note:   stmt 3 t3_117 = (int) _52;
x264.c:29:23: note:   With a single operand:
x264.c:29:23: note:   stmt 0 t0_114 = (int) _49;
x264.c:29:23: note:   stmt 1 t2_116 = (int) _51;
x264.c:29:23: note:   stmt 2 t1_115 = (int) _50;
x264.c:29:23: note:   stmt 3 t3_117 = (int) _52;

and get [_49, _51, _50, _52] instead of the preferred [_49, _50, _51,
_52]. As a result we get an extra layer of 4 permute instructions when
we generate code. In other cases this gets even worse.
Is there any reasonable way to improve the ordering of these nodes? I
though of sorting based on SSA name version but that's a workaround at
best and doesn't work in all cases.

Manolis


On Wed, Jun 26, 2024 at 3:06 PM Manolis Tsamis  wrote:
>
> This change checks when a two_operators SLP node has multiple occurrences of
> the same statement (e.g. {A, B, A, B, ...}) and tries to rearrange the 
> operands
> so that there are no duplicates. Two vec_perm expressions are then introduced
> to recreate the original ordering. These duplicates can appear due to how
> two_operators nodes are handled, and they prevent vectorization in some cases.
>
> This targets the vectorization of the SPEC2017 x264 pixel_satd functions.
> In some processors a larger than 10% improvement on x264 has been observed.
>
> See also: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138
>
> gcc/ChangeLog:
>
> * tree-vect-slp.cc: Avoid duplicates in two_operators nodes.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/aarch6

[PATCH] tree-optimization/115640 - outer loop vect with inner SLP permute

2024-06-26 Thread Richard Biener
The following fixes wrong-code when using outer loop vectorization
and an inner loop SLP access with permutation.  A wrong adjustment
to the IV increment is then applied on GCN.

Bootstrap and regtest running on x86_64-unknown-linux-gnu.

PR tree-optimization/115640
* tree-vect-stmts.cc (vectorizable_load): With an inner
loop SLP access to not apply a gap adjustment.
---
 gcc/tree-vect-stmts.cc | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 1fa92a0dc13..9697b8ca39c 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -10597,9 +10597,14 @@ vectorizable_load (vec_info *vinfo,
 whole group, not only the number of vector stmts the
 permutation result fits in.  */
  unsigned scalar_lanes = SLP_TREE_LANES (slp_node);
- if (slp_perm
- && (group_size != scalar_lanes 
- || !multiple_p (nunits, group_size)))
+ if (nested_in_vect_loop)
+   /* We do not support grouped accesses in a nested loop,
+  instead the access is contiguous but it might be
+  permuted.  No gap adjustment is needed though.  */
+   vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
+ else if (slp_perm
+  && (group_size != scalar_lanes
+  || !multiple_p (nunits, group_size)))
{
  /* We don't yet generate such SLP_TREE_LOAD_PERMUTATIONs for
 variable VF; see vect_transform_slp_perm_load.  */
-- 
2.35.3


Re: [PATCH] Rearrange SLP nodes with duplicate statements. [PR98138]

2024-06-26 Thread Manolis Tsamis
On Wed, Jun 5, 2024 at 11:07 AM Richard Biener  wrote:
>
> On Tue, 4 Jun 2024, Manolis Tsamis wrote:
>
> > This change adds a function that checks for SLP nodes with multiple 
> > occurrences
> > of the same statement (e.g. {A, B, A, B, ...}) and tries to rearrange the 
> > node
> > so that there are no duplicates. A vec_perm is then introduced to recreate 
> > the
> > original ordering. These duplicates can appear due to how two_operators 
> > nodes
> > are handled, and they prevent vectorization in some cases.
>
> So the trick is that when we have two operands we elide duplicate lanes
> so we can do discovery for a single combined operand instead which we
> then decompose into the required two again.  That's a nice one.
>
> But as implemented this will fail SLP discovery if the combined operand
> fails discovery possibly because of divergence in downstream defs.  That
> is, it doesn't fall back to separate discovery.  I suspect the situation
> of duplicate lanes isn't common but then I would also suspect that
> divergence _is_ common.
>
> The discovery code is already quite complex with the way it possibly
> swaps operands of lanes, fitting in this as another variant to try (first)
> is likely going to be a bit awkward.  A way out might be to split the
> function or to make the re-try in the caller which could indicate whether
> to apply this pattern trick or not.  That said - can you try to get
> data on how often the trick applies and discovery succeeds and how
> often discovery fails but discovery would suceed without applying the
> pattern (say, on SPEC)?

Hi Richard,

I have found two other SPEC benchmarks in which the new version of
this optimization applies.
In these cases discovery "fails" anyway when not doing deduplication.
It's not an immediate failure though but rather not producing good SLP
trees and then aborting due to cost of other checks (similar to x264).

>
> I also suppose instead of hardcoding three patterns for a fixed
> size it should be possible to see there's
> only (at most) half unique lanes in both operands (and one less in one
> operand if the number of lanes is odd) and compute the un-swizzling lane
> permutes during this discovery, removing the need of the explicit enum
> and open-coding each case?
>
> Another general note is that trying (and then undo on fail) such ticks
> eats at the discovery limit we have in place to avoid exponential run-off
> in exactly this degenerate cases.
>

I have sent a new version that doesn't have hardcoded patterns and
only works with two_operators nodes among others.
Please note that I still haven't addressed all your other feedback as
I'm still iterating the implementation.

Thanks,
Manolis


> Thanks,
> Richard.
>
> > This targets the vectorization of the SPEC2017 x264 pixel_satd functions.
> > In some processors a larger than 10% improvement on x264 has been observed.
> >
> > See also: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138
> >
> > gcc/ChangeLog:
> >
> >   * tree-vect-slp.cc (enum slp_oprnd_pattern): new enum for 
> > rearrangement
> >   patterns.
> >   (try_rearrange_oprnd_info): Detect if a node corresponds to one of the
> >   patterns.
> >
> > gcc/testsuite/ChangeLog:
> >
> >   * gcc.target/aarch64/vect-slp-two-operator.c: New test.
> >
> > Signed-off-by: Manolis Tsamis 
> > ---
> >
> >  .../aarch64/vect-slp-two-operator.c   |  42 
> >  gcc/tree-vect-slp.cc  | 234 ++
> >  2 files changed, 276 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c
> >
> > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c 
> > b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c
> > new file mode 100644
> > index 000..2db066a0b6e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c
> > @@ -0,0 +1,42 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect 
> > -fdump-tree-vect-details" } */
> > +
> > +typedef unsigned char uint8_t;
> > +typedef unsigned int uint32_t;
> > +
> > +#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
> > +int t0 = s0 + s1;\
> > +int t1 = s0 - s1;\
> > +int t2 = s2 + s3;\
> > +int t3 = s2 - s3;\
> > +d0 = t0 + t2;\
> > +d1 = t1 + t3;\
> > +d2 = t0 - t2;\
> > +d3 = t1 - t3;\
> > +}
> > +
> > +static uint32_t abs2( uint32_t a )
> > +{
> > +uint32_t s = ((a>>15)&0x10001)*0x;
> > +return (a+s)^s;
> > +}
> > +
> > +void sink(uint32_t tmp[4][4]);
> > +
> > +int x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int 
> > i_pix2 )
> > +{
> > +uint32_t tmp[4][4];
> > +int sum = 0;
> > +for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 )
> > +{
> > +uint32_t a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
> > +uint32_t a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
> > +uint32_t a2 = (pix1[2] - 

Re: [PING] Re: [PATCH 1/2] ivopts: Revert computation of address cost complexity

2024-06-26 Thread Richard Biener
On Wed, Jun 26, 2024 at 2:28 PM Aleksandar Rakic
 wrote:
>
> Hi!
>
> I'd like to ping the following patch:
>
> https://gcc.gnu.org/pipermail/gcc-patches/2024-March/647966.html
>   a patch for the computation of the complexity for the unsupported 
> addressing modes in ivopts

The thread starting at
https://sourceware.org/pipermail/gcc-patches/2022-October/604128.html
contains much information.  The mail you point to contains
inappropriate testsuite additions,
refers to a commit that doesn't look relevant and in fact does not
"revert" anything.  I also
can't remember seeing it, it might have been classified as spam.

I would consider to instead of citing the patch by reference to re-post it.

Richard.

>   This patch should be a fix for the bug which is described on the following 
> link:
>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109429
>   It modifies the order of the complexity calculation. By fixing the 
> complexities, the
>   candidate selection is also fixed, which leads to the smaller code size.
>
>
> Thanks
>
> Aleksandar Rakić


Re: [PATCH] Hard register asm constraint

2024-06-26 Thread Stefan Schulze Frielinghaus
On Tue, Jun 25, 2024 at 01:02:39PM -0400, Paul Koning wrote:
> 
> 
> > On Jun 25, 2024, at 12:04 PM, Stefan Schulze Frielinghaus 
> >  wrote:
> > 
> > On Tue, Jun 25, 2024 at 10:03:34AM -0400, Paul Koning wrote:
> >> 
> > ...
> > could be rewritten into
> > 
> > int test (int x, int y)
> > {
> > asm ("foo %0,%1,%2" : "+{r4}" (x) : "{r5}" (y), "d" (y));
> > return x;
> > }
> >> 
> >> I like this idea but I'm wondering: regular constraints specify what sort 
> >> of value is needed, for example an int vs. a short int vs. a float.  The 
> >> notation you've shown doesn't seem to have that aspect.
> > 
> > As Maciej already pointed out the type of the expression should suffice.
> > My assumption was that an asm can deal with a value as is or its
> > promoted value.  At least for integer values this should be fine and
> > AFAICS is also the case for simple constraints like "r" which do not
> > define any mode.  I've probably overseen something but which constraint
> > differentiates between int vs short?  However, you have a good point
> > with this and I should test this more.
> 
> I thought there was but I may be confused.  On the other hand, there 
> definitely are (machine dependent) constraints that distinguish, say, float 
> from integer registers; pdp11 is an example.  If you were to use an "a" 
> constraint, that means a floating point register and the compiler will detect 
> attempts to pass non-float operands ("Inconsistent operand constraints...").
> 
> I see that the existing "register int ..." syntax appears to check that the 
> register is the right type for the data type given for it, so for example on 
> pdp11, 
> 
>   register int ac1 asm ("ac1") = i;
> 
> fails ("register ... isn't suitable for data type").  I assume your new 
> syntax would perform the same check and produce roughly the same error 
> message.  You might verify that.  On pdp11, trying to use, for example, "r0" 
> for a float, or "ac0" for an int, would produce that error.

Right, so far I don't error out here which I will change.  It basically
results in bit casting floats to ints currently.

Just one thing to note: this is not a novel feature but pretty similar
to Rust's explicit register operands:
https://doc.rust-lang.org/rust-by-example/unsafe/asm.html#explicit-register-operands

Cheers,
Stefan


[committed][RISC-V] Fix expected output for thead store pair test

2024-06-26 Thread Jeff Law
Surya's patch to IRA has improved the code we generate for one of the 
thead store pair tests for both rv32 and rv64.  This patch adjusts the 
expectations of that test.


I've verified that the test now passes on rv32 and rv64 in my tester. 
Pushing to the trunk.


Jeffcommit 03a3dffa43145f80548d32b266b9b87be07b52ee
Author: Jeff Law 
Date:   Wed Jun 26 06:59:26 2024 -0600

[committed][RISC-V] Fix expected output for thead store pair test

Surya's patch to IRA has improved the code we generate for one of the thead
store pair tests for both rv32 and rv64.  This patch adjusts the 
expectations
of that test.

I've verified that the test now passes on rv32 and rv64 in my tester.  
Pushing
to the trunk.

gcc/testsuite
* gcc.target/riscv/xtheadmempair-3.c: Update expected output.

diff --git a/gcc/testsuite/gcc.target/riscv/xtheadmempair-3.c 
b/gcc/testsuite/gcc.target/riscv/xtheadmempair-3.c
index 5dec702819a..99a6ae7f4d7 100644
--- a/gcc/testsuite/gcc.target/riscv/xtheadmempair-3.c
+++ b/gcc/testsuite/gcc.target/riscv/xtheadmempair-3.c
@@ -17,13 +17,11 @@ void bar (xlen_t, xlen_t, xlen_t, xlen_t, xlen_t, xlen_t, 
xlen_t, xlen_t);
 void baz (xlen_t a, xlen_t b, xlen_t c, xlen_t d, xlen_t e, xlen_t f, xlen_t 
g, xlen_t h)
 {
   foo (a, b, c, d, e, f, g, h);
-  /* RV64: We don't use 0(sp), therefore we can only get 3 mempairs.  */
-  /* RV32: We don't use 0(sp)-8(sp), therefore we can only get 2 mempairs.  */
   bar (a, b, c, d, e, f, g, h);
 }
 
-/* { dg-final { scan-assembler-times "th.ldd\t" 3 { target { rv64 } } } } */
-/* { dg-final { scan-assembler-times "th.sdd\t" 3 { target { rv64 } } } } */
+/* { dg-final { scan-assembler-times "th.ldd\t" 4 { target { rv64 } } } } */
+/* { dg-final { scan-assembler-times "th.sdd\t" 4 { target { rv64 } } } } */
 
-/* { dg-final { scan-assembler-times "th.lwd\t" 2 { target { rv32 } } } } */
-/* { dg-final { scan-assembler-times "th.swd\t" 2 { target { rv32 } } } } */
+/* { dg-final { scan-assembler-times "th.lwd\t" 4 { target { rv32 } } } } */
+/* { dg-final { scan-assembler-times "th.swd\t" 4 { target { rv32 } } } } */


Re: [PATCH] s390: Check for ADDR_REGS in s390_decompose_addrstyle_without_index

2024-06-26 Thread Richard Sandiford
Stefan Schulze Frielinghaus  writes:
> An explicit check for address registers was not required so far since
> during register allocation the processing of address constraints was
> sufficient.  However, address constraints themself do not check for
> REGNO_OK_FOR_{BASE,INDEX}_P.  Thus, with the newly introduced
> late-combine pass in r15-1579-g792f97b44ffc5e we generate new insns with
> invalid address registers which aren't fixed up afterwards.
>
> Fixed by explicitly checking for address registers in
> s390_decompose_addrstyle_without_index such that those new insns are
> rejected.

Thanks for fixing this.  LGTM FWIW.

Richard

> gcc/ChangeLog:
>
>   target/PR115634
>   * config/s390/s390.cc (s390_decompose_addrstyle_without_index):
>   Check for ADDR_REGS in s390_decompose_addrstyle_without_index.
> ---
>  This restores bootstrap on s390.  I ran the testsuite against mainline
>  and of course there is some fallout which is most likely coming from
>  the new pass or other changes.  I have another job running comparing
>  pre r15-1579-g792f97b44ffc5e with and without this patch.  Assuming
>  this goes well, ok for mainline?
>
>  gcc/config/s390/s390.cc | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/config/s390/s390.cc b/gcc/config/s390/s390.cc
> index c65421de831..05a0fde7fb0 100644
> --- a/gcc/config/s390/s390.cc
> +++ b/gcc/config/s390/s390.cc
> @@ -3347,7 +3347,9 @@ s390_decompose_addrstyle_without_index (rtx op, rtx 
> *base,
>while (op && GET_CODE (op) == SUBREG)
>  op = SUBREG_REG (op);
>  
> -  if (op && GET_CODE (op) != REG)
> +  if (op && (!REG_P (op)
> +  || (reload_completed
> +  && !REGNO_OK_FOR_BASE_P (REGNO (op)
>  return false;
>  
>if (offset)


[committed] Remove compromised sh test

2024-06-26 Thread Jeff Law
Surya's recent patch to IRA improves the code for sh/pr54602-1.c 
slightly.  Specifically it's able to eliminate a save/restore in the 
prologue/epilogue and a bit of register shuffling.


As a result there literally aren't any insns that can be used to fill 
the delay slot of the return, so a nop gets emitted and the test fails.


Given there literally aren't any insns to move into the delay slot, the 
best course of action is to just drop the test.


Pushed to the trunk.

Jeffcommit 47b68cda2c4afe32e84c5f18da0196c39e5e0edf
Author: Jeff Law 
Date:   Wed Jun 26 07:20:29 2024 -0600

[committed] Remove compromised sh test

Surya's recent patch to IRA improves the code for sh/pr54602-1.c slightly.
Specifically it's able to eliminate a save/restore in the prologue/epilogue 
and
a bit of register shuffling.

As a result there literally aren't any insns that can be used to fill the 
delay
slot of the return, so a nop gets emitted and the test fails.

Given there literally aren't any insns to move into the delay slot, the best
course of action is to just drop the test.

gcc/testsuite
* gcc.target/sh/pr54602-1.c: Delete test.

diff --git a/gcc/testsuite/gcc.target/sh/pr54602-1.c 
b/gcc/testsuite/gcc.target/sh/pr54602-1.c
deleted file mode 100644
index e7fb2a9a642..000
--- a/gcc/testsuite/gcc.target/sh/pr54602-1.c
+++ /dev/null
@@ -1,14 +0,0 @@
-/* Verify that the delay slot is stuffed with register pop insns for normal
-   (i.e. not interrupt handler) function returns.  If everything goes as
-   expected we won't see any nop insns.  */
-/* { dg-do compile }  */
-/* { dg-options "-O1" } */
-/* { dg-final { scan-assembler-not "nop" } } */
-
-int test00 (int a, int b);
-
-int
-test01 (int a, int b, int c, int d)
-{
-  return test00 (a, b) + c;
-}


[committed] Remove compromised sh test

2024-06-26 Thread Jeff Law
Surya's recent patch to IRA improves the code for sh/pr54602-1.c 
slightly.  Specifically it's able to eliminate a save/restore in the 
prologue/epilogue and a bit of register shuffling.


As a result there literally aren't any insns that can be used to fill 
the delay slot of the return, so a nop gets emitted and the test fails.


Given there literally aren't any insns to move into the delay slot, the 
best course of action is to just drop the test.


Pushed to the trunk.

Jeffcommit 47b68cda2c4afe32e84c5f18da0196c39e5e0edf
Author: Jeff Law 
Date:   Wed Jun 26 07:20:29 2024 -0600

[committed] Remove compromised sh test

Surya's recent patch to IRA improves the code for sh/pr54602-1.c slightly.
Specifically it's able to eliminate a save/restore in the prologue/epilogue 
and
a bit of register shuffling.

As a result there literally aren't any insns that can be used to fill the 
delay
slot of the return, so a nop gets emitted and the test fails.

Given there literally aren't any insns to move into the delay slot, the best
course of action is to just drop the test.

gcc/testsuite
* gcc.target/sh/pr54602-1.c: Delete test.

diff --git a/gcc/testsuite/gcc.target/sh/pr54602-1.c 
b/gcc/testsuite/gcc.target/sh/pr54602-1.c
deleted file mode 100644
index e7fb2a9a642..000
--- a/gcc/testsuite/gcc.target/sh/pr54602-1.c
+++ /dev/null
@@ -1,14 +0,0 @@
-/* Verify that the delay slot is stuffed with register pop insns for normal
-   (i.e. not interrupt handler) function returns.  If everything goes as
-   expected we won't see any nop insns.  */
-/* { dg-do compile }  */
-/* { dg-options "-O1" } */
-/* { dg-final { scan-assembler-not "nop" } } */
-
-int test00 (int a, int b);
-
-int
-test01 (int a, int b, int c, int d)
-{
-  return test00 (a, b) + c;
-}


Re: [PATCH]middle-end: Implement conditonal store vectorizer pattern [PR115531]

2024-06-26 Thread Richard Biener
On Tue, 25 Jun 2024, Tamar Christina wrote:

> Hi All,
> 
> This adds a conditional store optimization for the vectorizer as a pattern.
> The vectorizer already supports modifying memory accesses because of the 
> pattern
> based gather/scatter recognition.
> 
> Doing it in the vectorizer allows us to still keep the ability to vectorize 
> such
> loops for architectures that don't have MASK_STORE support, whereas doing this
> in ifcvt makes us commit to MASK_STORE.
> 
> Concretely for this loop:
> 
> void foo1 (char *restrict a, int *restrict b, int *restrict c, int n, int 
> stride)
> {
>   if (stride <= 1)
> return;
> 
>   for (int i = 0; i < n; i++)
> {
>   int res = c[i];
>   int t = b[i+stride];
>   if (a[i] != 0)
> res = t;
>   c[i] = res;
> }
> }
> 
> today we generate:
> 
> .L3:
> ld1bz29.s, p7/z, [x0, x5]
> ld1wz31.s, p7/z, [x2, x5, lsl 2]
> ld1wz30.s, p7/z, [x1, x5, lsl 2]
> cmpne   p15.b, p6/z, z29.b, #0
> sel z30.s, p15, z30.s, z31.s
> st1wz30.s, p7, [x2, x5, lsl 2]
> add x5, x5, x4
> whilelo p7.s, w5, w3
> b.any   .L3
> 
> which in gimple is:
> 
>   vect_res_18.9_68 = .MASK_LOAD (vectp_c.7_65, 32B, loop_mask_67);
>   vect_t_20.12_74 = .MASK_LOAD (vectp.10_72, 32B, loop_mask_67);
>   vect__9.15_77 = .MASK_LOAD (vectp_a.13_75, 8B, loop_mask_67);
>   mask__34.16_79 = vect__9.15_77 != { 0, ... };
>   vect_res_11.17_80 = VEC_COND_EXPR  vect_res_18.9_68>;
>   .MASK_STORE (vectp_c.18_81, 32B, loop_mask_67, vect_res_11.17_80);
> 
> A MASK_STORE is already conditional, so there's no need to perform the load of
> the old values and the VEC_COND_EXPR.  This patch makes it so we generate:
> 
>   vect_res_18.9_68 = .MASK_LOAD (vectp_c.7_65, 32B, loop_mask_67);
>   vect__9.15_77 = .MASK_LOAD (vectp_a.13_75, 8B, loop_mask_67);
>   mask__34.16_79 = vect__9.15_77 != { 0, ... };
>   .MASK_STORE (vectp_c.18_81, 32B, mask__34.16_79, vect_res_18.9_68);
> 
> which generates:
> 
> .L3:
> ld1bz30.s, p7/z, [x0, x5]
> ld1wz31.s, p7/z, [x1, x5, lsl 2]
> cmpne   p7.b, p7/z, z30.b, #0
> st1wz31.s, p7, [x2, x5, lsl 2]
> add x5, x5, x4
> whilelo p7.s, w5, w3
> b.any   .L3
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

The idea looks good but I wonder if it's not slower in practice.
The issue with masked stores, in particular those where any elements
are actually masked out, is that such stores do not forward on any
uarch I know.  They also usually have a penalty for the merging
(the load has to be carried out anyway).

So - can you do an actual benchmark on real hardware where the
loop has (way) more than one vector iteration and where there's
at least one masked element during each vector iteration?

> Ok for master?

Few comments below.

> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>   PR tree-optimization/115531
>   * tree-vect-patterns.cc (vect_cond_store_pattern_same_ref): New.
>   (vect_recog_cond_store_pattern): New.
>   (vect_vect_recog_func_ptrs): Use it.
> 
> gcc/testsuite/ChangeLog:
> 
>   PR tree-optimization/115531
>   * gcc.dg/vect/vect-conditional_store_1.c: New test.
>   * gcc.dg/vect/vect-conditional_store_2.c: New test.
>   * gcc.dg/vect/vect-conditional_store_3.c: New test.
>   * gcc.dg/vect/vect-conditional_store_4.c: New test.
> 
> ---
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-conditional_store_1.c 
> b/gcc/testsuite/gcc.dg/vect/vect-conditional_store_1.c
> new file mode 100644
> index 
> ..3884a3c3d0a2dc2258097348c75bb7c0b3b37c72
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-conditional_store_1.c
> @@ -0,0 +1,24 @@
> +/* { dg-do assemble } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-require-effective-target vect_masked_store } */
> +
> +/* { dg-additional-options "-mavx2" { target avx2 } } */
> +/* { dg-additional-options "-march=armv9-a" { target aarch64-*-* } } */
> +
> +void foo1 (char *restrict a, int *restrict b, int *restrict c, int n, int 
> stride)
> +{
> +  if (stride <= 1)
> +return;
> +
> +  for (int i = 0; i < n; i++)
> +{
> +  int res = c[i];
> +  int t = b[i+stride];
> +  if (a[i] != 0)
> +res = t;
> +  c[i] = res;
> +}
> +}
> +
> +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> +/* { dg-final { scan-tree-dump-not "VEC_COND_EXPR " "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-conditional_store_2.c 
> b/gcc/testsuite/gcc.dg/vect/vect-conditional_store_2.c
> new file mode 100644
> index 
> ..bc965a244f147c199b1726e5f6b44229539cd225
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-conditional_store_2.c
> @@ -0,0 +1,24 @@
> +/* { dg-do assemble } */
> +/* { dg-require-effective-target vect_int } */
> +/* { dg-require-effective-target vect_masked_store } */

RE: [PATCH]middle-end: Implement conditonal store vectorizer pattern [PR115531]

2024-06-26 Thread Tamar Christina
> -Original Message-
> From: Richard Biener 
> Sent: Wednesday, June 26, 2024 2:23 PM
> To: Tamar Christina 
> Cc: gcc-patches@gcc.gnu.org; nd ; j...@ventanamicro.com
> Subject: Re: [PATCH]middle-end: Implement conditonal store vectorizer pattern
> [PR115531]
> 
> On Tue, 25 Jun 2024, Tamar Christina wrote:
> 
> > Hi All,
> >
> > This adds a conditional store optimization for the vectorizer as a pattern.
> > The vectorizer already supports modifying memory accesses because of the
> pattern
> > based gather/scatter recognition.
> >
> > Doing it in the vectorizer allows us to still keep the ability to vectorize 
> > such
> > loops for architectures that don't have MASK_STORE support, whereas doing 
> > this
> > in ifcvt makes us commit to MASK_STORE.
> >
> > Concretely for this loop:
> >
> > void foo1 (char *restrict a, int *restrict b, int *restrict c, int n, int 
> > stride)
> > {
> >   if (stride <= 1)
> > return;
> >
> >   for (int i = 0; i < n; i++)
> > {
> >   int res = c[i];
> >   int t = b[i+stride];
> >   if (a[i] != 0)
> > res = t;
> >   c[i] = res;
> > }
> > }
> >
> > today we generate:
> >
> > .L3:
> > ld1bz29.s, p7/z, [x0, x5]
> > ld1wz31.s, p7/z, [x2, x5, lsl 2]
> > ld1wz30.s, p7/z, [x1, x5, lsl 2]
> > cmpne   p15.b, p6/z, z29.b, #0
> > sel z30.s, p15, z30.s, z31.s
> > st1wz30.s, p7, [x2, x5, lsl 2]
> > add x5, x5, x4
> > whilelo p7.s, w5, w3
> > b.any   .L3
> >
> > which in gimple is:
> >
> >   vect_res_18.9_68 = .MASK_LOAD (vectp_c.7_65, 32B, loop_mask_67);
> >   vect_t_20.12_74 = .MASK_LOAD (vectp.10_72, 32B, loop_mask_67);
> >   vect__9.15_77 = .MASK_LOAD (vectp_a.13_75, 8B, loop_mask_67);
> >   mask__34.16_79 = vect__9.15_77 != { 0, ... };
> >   vect_res_11.17_80 = VEC_COND_EXPR  vect_res_18.9_68>;
> >   .MASK_STORE (vectp_c.18_81, 32B, loop_mask_67, vect_res_11.17_80);
> >
> > A MASK_STORE is already conditional, so there's no need to perform the load 
> > of
> > the old values and the VEC_COND_EXPR.  This patch makes it so we generate:
> >
> >   vect_res_18.9_68 = .MASK_LOAD (vectp_c.7_65, 32B, loop_mask_67);
> >   vect__9.15_77 = .MASK_LOAD (vectp_a.13_75, 8B, loop_mask_67);
> >   mask__34.16_79 = vect__9.15_77 != { 0, ... };
> >   .MASK_STORE (vectp_c.18_81, 32B, mask__34.16_79, vect_res_18.9_68);
> >
> > which generates:
> >
> > .L3:
> > ld1bz30.s, p7/z, [x0, x5]
> > ld1wz31.s, p7/z, [x1, x5, lsl 2]
> > cmpne   p7.b, p7/z, z30.b, #0
> > st1wz31.s, p7, [x2, x5, lsl 2]
> > add x5, x5, x4
> > whilelo p7.s, w5, w3
> > b.any   .L3
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> The idea looks good but I wonder if it's not slower in practice.
> The issue with masked stores, in particular those where any elements
> are actually masked out, is that such stores do not forward on any
> uarch I know.  They also usually have a penalty for the merging
> (the load has to be carried out anyway).
> 

Yes, but when the predicate has all bit set it usually does.
But forwarding aside, this gets rid of the select and the additional load,
So purely from a instruction latency perspective it's a win.

> So - can you do an actual benchmark on real hardware where the
> loop has (way) more than one vector iteration and where there's
> at least one masked element during each vector iteration?
> 

Sure, this optimization comes from exchange2 where vectoring with SVE
ends up being slower than not vectorizing.  This change makes the vectorization
profitable and recovers about a 3% difference overall between vectorizing and 
not.

I did run microbenchmarks over all current and future Arm cores and it was a 
universal
win.

I can run more benchmarks with various masks, but as mentioned above, even 
without
Forwarding, you still have 2 instructions less, so it's almost always going to 
win.

> > Ok for master?
> 
> Few comments below.
> 
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > PR tree-optimization/115531
> > * tree-vect-patterns.cc (vect_cond_store_pattern_same_ref): New.
> > (vect_recog_cond_store_pattern): New.
> > (vect_vect_recog_func_ptrs): Use it.
> >
> > gcc/testsuite/ChangeLog:
> >
> > PR tree-optimization/115531
> > * gcc.dg/vect/vect-conditional_store_1.c: New test.
> > * gcc.dg/vect/vect-conditional_store_2.c: New test.
> > * gcc.dg/vect/vect-conditional_store_3.c: New test.
> > * gcc.dg/vect/vect-conditional_store_4.c: New test.
> >
> > ---
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-conditional_store_1.c
> b/gcc/testsuite/gcc.dg/vect/vect-conditional_store_1.c
> > new file mode 100644
> > index
> ..3884a3c3d0a2dc2258097
> 348c75bb7c0b3b37c72
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-conditional_store_1.c
> > @@ -0,0 +1,24 @@
> > +/* { dg-do assemble } */
> > +

Re: [PATCH v1] Internal-fn: Support new IFN SAT_TRUNC for unsigned scalar int

2024-06-26 Thread Richard Biener
On Wed, Jun 26, 2024 at 3:46 AM  wrote:
>
> From: Pan Li 
>
> This patch would like to add the middle-end presentation for the
> saturation truncation.  Aka set the result of truncated value to
> the max value when overflow.  It will take the pattern similar
> as below.
>
> Form 1:
>   #define DEF_SAT_U_TRUC_FMT_1(WT, NT) \
>   NT __attribute__((noinline)) \
>   sat_u_truc_##T##_fmt_1 (WT x)\
>   {\
> bool overflow = x > (WT)(NT)(-1);  \
> return ((NT)x) | (NT)-overflow;\
>   }
>
> For example, truncated uint16_t to uint8_t, we have
>
> * SAT_TRUNC (254)   => 254
> * SAT_TRUNC (255)   => 255
> * SAT_TRUNC (256)   => 255
> * SAT_TRUNC (65536) => 255
>
> Given below SAT_TRUNC from uint64_t to uint32_t.
>
> DEF_SAT_U_TRUC_FMT_1 (uint64_t, uint32_t)
>
> Before this patch:
> __attribute__((noinline))
> uint32_t sat_u_truc_T_fmt_1 (uint64_t x)
> {
>   _Bool overflow;
>   unsigned int _1;
>   unsigned int _2;
>   unsigned int _3;
>   uint32_t _6;
>
> ;;   basic block 2, loop depth 0
> ;;pred:   ENTRY
>   overflow_5 = x_4(D) > 4294967295;
>   _1 = (unsigned int) x_4(D);
>   _2 = (unsigned int) overflow_5;
>   _3 = -_2;
>   _6 = _1 | _3;
>   return _6;
> ;;succ:   EXIT
>
> }
>
> After this patch:
> __attribute__((noinline))
> uint32_t sat_u_truc_T_fmt_1 (uint64_t x)
> {
>   uint32_t _6;
>
> ;;   basic block 2, loop depth 0
> ;;pred:   ENTRY
>   _6 = .SAT_TRUNC (x_4(D)); [tail call]
>   return _6;
> ;;succ:   EXIT
>
> }
>
> The below tests are passed for this patch:
> *. The rv64gcv fully regression tests.
> *. The rv64gcv build with glibc.
> *. The x86 bootstrap tests.
> *. The x86 fully regression tests.
>
> gcc/ChangeLog:
>
> * internal-fn.def (SAT_TRUNC): Add new signed IFN sat_trunc as
> unary_convert.
> * match.pd: Add new matching pattern for unsigned int sat_trunc.
> * optabs.def (OPTAB_CL): Add unsigned and signed optab.
> * tree-ssa-math-opts.cc (gimple_unsigend_integer_sat_trunc): Add
> new decl for the matching pattern generated func.
> (match_unsigned_saturation_trunc): Add new func impl to match
> the .SAT_TRUNC.
> (math_opts_dom_walker::after_dom_children): Add .SAT_TRUNC match
> function under BIT_IOR_EXPR case.
> * tree.cc (integer_half_truncated_all_ones_p): Add new func impl
> to filter the truncated threshold.
> * tree.h (integer_half_truncated_all_ones_p): Add new func decl.
>
> Signed-off-by: Pan Li 
> ---
>  gcc/internal-fn.def   |  2 ++
>  gcc/match.pd  | 12 +++-
>  gcc/optabs.def|  3 +++
>  gcc/tree-ssa-math-opts.cc | 32 
>  gcc/tree.cc   | 22 ++
>  gcc/tree.h|  6 ++
>  6 files changed, 76 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index a8c83437ada..915d329c05a 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -278,6 +278,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | 
> ECF_NOTHROW, first,
>  DEF_INTERNAL_SIGNED_OPTAB_FN (SAT_ADD, ECF_CONST, first, ssadd, usadd, 
> binary)
>  DEF_INTERNAL_SIGNED_OPTAB_FN (SAT_SUB, ECF_CONST, first, sssub, ussub, 
> binary)
>
> +DEF_INTERNAL_SIGNED_OPTAB_FN (SAT_TRUNC, ECF_CONST, first, sstrunc, ustrunc, 
> unary_convert)
> +
>  DEF_INTERNAL_COND_FN (ADD, ECF_CONST, add, binary)
>  DEF_INTERNAL_COND_FN (SUB, ECF_CONST, sub, binary)
>  DEF_INTERNAL_COND_FN (MUL, ECF_CONST, smul, binary)
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 3d0689c9312..d4062434cc7 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -39,7 +39,8 @@ along with GCC; see the file COPYING3.  If not see
> HONOR_NANS
> uniform_vector_p
> expand_vec_cmp_expr_p
> -   bitmask_inv_cst_vector_p)
> +   bitmask_inv_cst_vector_p
> +   integer_half_truncated_all_ones_p)
>
>  /* Operator lists.  */
>  (define_operator_list tcc_comparison
> @@ -3210,6 +3211,15 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>   (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
>&& types_match (type, @0, @1
>
> +/* Unsigned saturation truncate, case 1 (), sizeof (WT) > sizeof (NT).
> +   SAT_U_TRUNC = (NT)x | (NT)(-(X > (WT)(NT)(-1))).  */
> +(match (unsigend_integer_sat_trunc @0)

unsigned

> + (bit_ior:c (negate (convert (gt @0 integer_half_truncated_all_ones_p)))
> +   (convert @0))
> + (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
> +  && TYPE_UNSIGNED (TREE_TYPE (@0))
> +  && tree_int_cst_lt (TYPE_SIZE (type), TYPE_SIZE (TREE_TYPE (@0))

This type size relation doesn't match
integer_half_truncated_all_ones_p, that works
based on TYPE_PRECISION.  Don't you maybe want to scrap
integer_half_truncated_all_ones_p
as too restrictive and instead verify that TYPE_PRECISION (type) is
less than the
precision of @0 and that the INTEGER_CST compared against matches
'type's precision mask?

> +
>  /

Re: [PATCH v2] Vect: Support truncate after .SAT_SUB pattern in zip

2024-06-26 Thread Richard Biener
On Mon, Jun 24, 2024 at 3:55 PM  wrote:
>
> From: Pan Li 
>
> The zip benchmark of coremark-pro have one SAT_SUB like pattern but
> truncated as below:
>
> void test (uint16_t *x, unsigned b, unsigned n)
> {
>   unsigned a = 0;
>   register uint16_t *p = x;
>
>   do {
> a = *--p;
> *p = (uint16_t)(a >= b ? a - b : 0); // Truncate after .SAT_SUB
>   } while (--n);
> }
>
> It will have gimple before vect pass,  it cannot hit any pattern of
> SAT_SUB and then cannot vectorize to SAT_SUB.
>
> _2 = a_11 - b_12(D);
> iftmp.0_13 = (short unsigned int) _2;
> _18 = a_11 >= b_12(D);
> iftmp.0_5 = _18 ? iftmp.0_13 : 0;
>
> This patch would like to improve the pattern match to recog above
> as truncate after .SAT_SUB pattern.  Then we will have the pattern
> similar to below,  as well as eliminate the first 3 dead stmt.
>
> _2 = a_11 - b_12(D);
> iftmp.0_13 = (short unsigned int) _2;
> _18 = a_11 >= b_12(D);
> iftmp.0_5 = (short unsigned int).SAT_SUB (a_11, b_12(D));
>
> The below tests are passed for this patch.
> 1. The rv64gcv fully regression tests.
> 2. The rv64gcv build with glibc.
> 3. The x86 bootstrap tests.
> 4. The x86 fully regression tests.
>
> gcc/ChangeLog:
>
> * match.pd: Add convert description for minus and capture.
> * tree-vect-patterns.cc (vect_recog_build_binary_gimple_call): Add
> new logic to handle in_type is incompatibile with out_type,  as
> well as rename from.
> (vect_recog_build_binary_gimple_stmt): Rename to.
> (vect_recog_sat_add_pattern): Leverage above renamed func.
> (vect_recog_sat_sub_pattern): Ditto.
>
> Signed-off-by: Pan Li 
> ---
>  gcc/match.pd  |  4 +--
>  gcc/tree-vect-patterns.cc | 51 ---
>  2 files changed, 33 insertions(+), 22 deletions(-)
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 3d0689c9312..4a4b0b2e72f 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -3164,9 +3164,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  /* Unsigned saturation sub, case 2 (branch with ge):
> SAT_U_SUB = X >= Y ? X - Y : 0.  */
>  (match (unsigned_integer_sat_sub @0 @1)
> - (cond^ (ge @0 @1) (minus @0 @1) integer_zerop)
> + (cond^ (ge @0 @1) (convert? (minus (convert1? @0) (convert1? @1))) 
> integer_zerop)
>   (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
> -  && types_match (type, @0, @1
> +  && TYPE_UNSIGNED (TREE_TYPE (@0)) && types_match (@0, @1

I suppose the other patterns can see similar enhacements for the case
their forms
show up truncated or extended?

>  /* Unsigned saturation sub, case 3 (branchless with gt):
> SAT_U_SUB = (X - Y) * (X > Y).  */
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index cef901808eb..3d887d36050 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -4490,26 +4490,37 @@ vect_recog_mult_pattern (vec_info *vinfo,
>  extern bool gimple_unsigned_integer_sat_add (tree, tree*, tree (*)(tree));
>  extern bool gimple_unsigned_integer_sat_sub (tree, tree*, tree (*)(tree));
>
> -static gcall *
> -vect_recog_build_binary_gimple_call (vec_info *vinfo, gimple *stmt,
> +static gimple *
> +vect_recog_build_binary_gimple_stmt (vec_info *vinfo, stmt_vec_info 
> stmt_info,
>  internal_fn fn, tree *type_out,
> -tree op_0, tree op_1)
> +tree lhs, tree op_0, tree op_1)
>  {
>tree itype = TREE_TYPE (op_0);
> -  tree vtype = get_vectype_for_scalar_type (vinfo, itype);
> +  tree otype = TREE_TYPE (lhs);
> +  tree v_itype = get_vectype_for_scalar_type (vinfo, itype);
> +  tree v_otype = get_vectype_for_scalar_type (vinfo, otype);
>
> -  if (vtype != NULL_TREE
> -&& direct_internal_fn_supported_p (fn, vtype, OPTIMIZE_FOR_BOTH))
> +  if (v_itype != NULL_TREE && v_otype != NULL_TREE
> +&& direct_internal_fn_supported_p (fn, v_itype, OPTIMIZE_FOR_BOTH))
>  {
>gcall *call = gimple_build_call_internal (fn, 2, op_0, op_1);
> +  tree in_ssa = vect_recog_temp_ssa_var (itype, NULL);
>
> -  gimple_call_set_lhs (call, vect_recog_temp_ssa_var (itype, NULL));
> +  gimple_call_set_lhs (call, in_ssa);
>gimple_call_set_nothrow (call, /* nothrow_p */ false);
> -  gimple_set_location (call, gimple_location (stmt));
> +  gimple_set_location (call, gimple_location (STMT_VINFO_STMT 
> (stmt_info)));
> +
> +  *type_out = v_otype;
>
> -  *type_out = vtype;
> +  if (types_compatible_p (itype, otype))
> +   return call;
> +  else
> +   {
> + append_pattern_def_seq (vinfo, stmt_info, call, v_itype);
> + tree out_ssa = vect_recog_temp_ssa_var (otype, NULL);
>
> -  return call;
> + return gimple_build_assign (out_ssa, CONVERT_EXPR, in_ssa);

Please use NOP_EXPR here.

> +   }
>  }
>
>return NULL;
> @@ -4541,13 +4552,13 @@ vect_recog_sat_add_pattern (vec_info *vinfo, 
> stmt_vec_info stmt_vi

[r15-1619 Regression] FAIL: gcc.target/i386/stack-check-17.c scan-assembler-not pop on Linux/x86_64

2024-06-26 Thread haochen.jiang
On Linux/x86_64,

3b9b8d6cfdf59337f4b7ce10ce92a98044b2657b is the first bad commit
commit 3b9b8d6cfdf59337f4b7ce10ce92a98044b2657b
Author: Surya Kumari Jangala 
Date:   Tue Jun 25 08:37:49 2024 -0500

ira: Scale save/restore costs of callee save registers with block frequency

caused

FAIL: gcc.dg/pr10474.c scan-rtl-dump pro_and_epilogue "Performing 
shrink-wrapping"
FAIL: gcc.target/i386/force-indirect-call-2.c scan-assembler-times 
(?:call|jmp)[ \\t]+\\*% 3
FAIL: gcc.target/i386/pr63527.c scan-assembler-not movl[ \t]%[^,]+, %ebx
FAIL: gcc.target/i386/pr91384.c scan-assembler-not testl
FAIL: gcc.target/i386/stack-check-17.c scan-assembler-not pop

with GCC configured with

../../gcc/configure 
--prefix=/export/users/haochenj/src/gcc-bisect/master/master/r15-1619/usr 
--enable-clocale=gnu --with-system-zlib --with-demangler-in-ld 
--with-fpmath=sse --enable-languages=c,c++,fortran --enable-cet --without-isl 
--enable-libmpx x86_64-linux --disable-bootstrap

To reproduce:

$ cd {build_dir}/gcc && make check RUNTESTFLAGS="dg.exp=gcc.dg/pr10474.c 
--target_board='unix{-m64}'"
$ cd {build_dir}/gcc && make check RUNTESTFLAGS="dg.exp=gcc.dg/pr10474.c 
--target_board='unix{-m64\ -march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="i386.exp=gcc.target/i386/force-indirect-call-2.c 
--target_board='unix{-m64}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="i386.exp=gcc.target/i386/force-indirect-call-2.c 
--target_board='unix{-m64\ -march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="i386.exp=gcc.target/i386/pr63527.c --target_board='unix{-m32}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="i386.exp=gcc.target/i386/pr63527.c --target_board='unix{-m32\ 
-march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="i386.exp=gcc.target/i386/pr91384.c --target_board='unix{-m32}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="i386.exp=gcc.target/i386/pr91384.c --target_board='unix{-m32\ 
-march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="i386.exp=gcc.target/i386/pr91384.c --target_board='unix{-m64}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="i386.exp=gcc.target/i386/pr91384.c --target_board='unix{-m64\ 
-march=cascadelake}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="i386.exp=gcc.target/i386/stack-check-17.c 
--target_board='unix{-m64}'"
$ cd {build_dir}/gcc && make check 
RUNTESTFLAGS="i386.exp=gcc.target/i386/stack-check-17.c 
--target_board='unix{-m64\ -march=cascadelake}'"

(Please do not reply to this email, for question about this report, contact me 
at haochen dot jiang at intel.com.)
(If you met problems with cascadelake related, disabling AVX512F in command 
line might save that.)
(However, please make sure that there is no potential problems with AVX512.)


RE: [PATCH]middle-end: Implement conditonal store vectorizer pattern [PR115531]

2024-06-26 Thread Richard Biener
On Wed, 26 Jun 2024, Tamar Christina wrote:

> > -Original Message-
> > From: Richard Biener 
> > Sent: Wednesday, June 26, 2024 2:23 PM
> > To: Tamar Christina 
> > Cc: gcc-patches@gcc.gnu.org; nd ; j...@ventanamicro.com
> > Subject: Re: [PATCH]middle-end: Implement conditonal store vectorizer 
> > pattern
> > [PR115531]
> > 
> > On Tue, 25 Jun 2024, Tamar Christina wrote:
> > 
> > > Hi All,
> > >
> > > This adds a conditional store optimization for the vectorizer as a 
> > > pattern.
> > > The vectorizer already supports modifying memory accesses because of the
> > pattern
> > > based gather/scatter recognition.
> > >
> > > Doing it in the vectorizer allows us to still keep the ability to 
> > > vectorize such
> > > loops for architectures that don't have MASK_STORE support, whereas doing 
> > > this
> > > in ifcvt makes us commit to MASK_STORE.
> > >
> > > Concretely for this loop:
> > >
> > > void foo1 (char *restrict a, int *restrict b, int *restrict c, int n, int 
> > > stride)
> > > {
> > >   if (stride <= 1)
> > > return;
> > >
> > >   for (int i = 0; i < n; i++)
> > > {
> > >   int res = c[i];
> > >   int t = b[i+stride];
> > >   if (a[i] != 0)
> > > res = t;
> > >   c[i] = res;
> > > }
> > > }
> > >
> > > today we generate:
> > >
> > > .L3:
> > > ld1bz29.s, p7/z, [x0, x5]
> > > ld1wz31.s, p7/z, [x2, x5, lsl 2]
> > > ld1wz30.s, p7/z, [x1, x5, lsl 2]
> > > cmpne   p15.b, p6/z, z29.b, #0
> > > sel z30.s, p15, z30.s, z31.s
> > > st1wz30.s, p7, [x2, x5, lsl 2]
> > > add x5, x5, x4
> > > whilelo p7.s, w5, w3
> > > b.any   .L3
> > >
> > > which in gimple is:
> > >
> > >   vect_res_18.9_68 = .MASK_LOAD (vectp_c.7_65, 32B, loop_mask_67);
> > >   vect_t_20.12_74 = .MASK_LOAD (vectp.10_72, 32B, loop_mask_67);
> > >   vect__9.15_77 = .MASK_LOAD (vectp_a.13_75, 8B, loop_mask_67);
> > >   mask__34.16_79 = vect__9.15_77 != { 0, ... };
> > >   vect_res_11.17_80 = VEC_COND_EXPR  > vect_res_18.9_68>;
> > >   .MASK_STORE (vectp_c.18_81, 32B, loop_mask_67, vect_res_11.17_80);
> > >
> > > A MASK_STORE is already conditional, so there's no need to perform the 
> > > load of
> > > the old values and the VEC_COND_EXPR.  This patch makes it so we generate:
> > >
> > >   vect_res_18.9_68 = .MASK_LOAD (vectp_c.7_65, 32B, loop_mask_67);
> > >   vect__9.15_77 = .MASK_LOAD (vectp_a.13_75, 8B, loop_mask_67);
> > >   mask__34.16_79 = vect__9.15_77 != { 0, ... };
> > >   .MASK_STORE (vectp_c.18_81, 32B, mask__34.16_79, vect_res_18.9_68);
> > >
> > > which generates:
> > >
> > > .L3:
> > > ld1bz30.s, p7/z, [x0, x5]
> > > ld1wz31.s, p7/z, [x1, x5, lsl 2]
> > > cmpne   p7.b, p7/z, z30.b, #0
> > > st1wz31.s, p7, [x2, x5, lsl 2]
> > > add x5, x5, x4
> > > whilelo p7.s, w5, w3
> > > b.any   .L3
> > >
> > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > 
> > The idea looks good but I wonder if it's not slower in practice.
> > The issue with masked stores, in particular those where any elements
> > are actually masked out, is that such stores do not forward on any
> > uarch I know.  They also usually have a penalty for the merging
> > (the load has to be carried out anyway).
> > 
> 
> Yes, but when the predicate has all bit set it usually does.
> But forwarding aside, this gets rid of the select and the additional load,
> So purely from a instruction latency perspective it's a win.
> 
> > So - can you do an actual benchmark on real hardware where the
> > loop has (way) more than one vector iteration and where there's
> > at least one masked element during each vector iteration?
> > 
> 
> Sure, this optimization comes from exchange2 where vectoring with SVE
> ends up being slower than not vectorizing.  This change makes the 
> vectorization
> profitable and recovers about a 3% difference overall between vectorizing and 
> not.
> 
> I did run microbenchmarks over all current and future Arm cores and it was a 
> universal
> win.
> 
> I can run more benchmarks with various masks, but as mentioned above, even 
> without
> Forwarding, you still have 2 instructions less, so it's almost always going 
> to win.
> 
> > > Ok for master?
> > 
> > Few comments below.
> > 
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > >   PR tree-optimization/115531
> > >   * tree-vect-patterns.cc (vect_cond_store_pattern_same_ref): New.
> > >   (vect_recog_cond_store_pattern): New.
> > >   (vect_vect_recog_func_ptrs): Use it.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >   PR tree-optimization/115531
> > >   * gcc.dg/vect/vect-conditional_store_1.c: New test.
> > >   * gcc.dg/vect/vect-conditional_store_2.c: New test.
> > >   * gcc.dg/vect/vect-conditional_store_3.c: New test.
> > >   * gcc.dg/vect/vect-conditional_store_4.c: New test.
> > >
> > > ---
> > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-conditional_store

Re: [PING] Re: [PATCH 1/2] ivopts: Revert computation of address cost complexity

2024-06-26 Thread Aleksandar Rakic
The mail I pointed to ( 
https://gcc.gnu.org/pipermail/gcc-patches/2024-March/647966.html ) is an answer 
to the topic started by Dimitrije Milošević. I replied to the topic in the same 
way as you answered here: 
https://sourceware.org/pipermail/gcc-patches/2022-October/604298.html .
The meaning of all the tests is elaborated in the mail I pointed to ( 
https://gcc.gnu.org/pipermail/gcc-patches/2024-March/647966.html ). The mail I 
pointed to contains a link to the patch ( 
https://github.com/rakicaleksandar1999/gcc/tree/bug_109429 ) but it joined with 
a full stop.

I will not accept my work to be presented as irrelevant and inappropriate.


From: Richard Biener 
Sent: Wednesday, June 26, 2024 2:50 PM
To: Aleksandar Rakic
Cc: gcc-patches@gcc.gnu.org; jeffreya...@gmail.com; rguent...@suse.de; 
ja...@redhat.com; Djordje Todorovic; Jovan Dmitrovic
Subject: Re: [PING] Re: [PATCH 1/2] ivopts: Revert computation of address cost 
complexity

On Wed, Jun 26, 2024 at 2:28 PM Aleksandar Rakic
 wrote:
>
> Hi!
>
> I'd like to ping the following patch:
>
> https://gcc.gnu.org/pipermail/gcc-patches/2024-March/647966.html
>   a patch for the computation of the complexity for the unsupported 
> addressing modes in ivopts

The thread starting at
https://sourceware.org/pipermail/gcc-patches/2022-October/604128.html
contains much information.  The mail you point to contains
inappropriate testsuite additions,
refers to a commit that doesn't look relevant and in fact does not
"revert" anything.  I also
can't remember seeing it, it might have been classified as spam.

I would consider to instead of citing the patch by reference to re-post it.

Richard.

>   This patch should be a fix for the bug which is described on the following 
> link:
>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109429
>   It modifies the order of the complexity calculation. By fixing the 
> complexities, the
>   candidate selection is also fixed, which leads to the smaller code size.
>
>
> Thanks
>
> Aleksandar Rakić


RE: [PATCH v1] Internal-fn: Support new IFN SAT_TRUNC for unsigned scalar int

2024-06-26 Thread Li, Pan2
Thanks Richard, will address the comments in v2.

Pan

-Original Message-
From: Richard Biener  
Sent: Wednesday, June 26, 2024 9:52 PM
To: Li, Pan2 
Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com; 
jeffreya...@gmail.com; rdapp@gmail.com
Subject: Re: [PATCH v1] Internal-fn: Support new IFN SAT_TRUNC for unsigned 
scalar int

On Wed, Jun 26, 2024 at 3:46 AM  wrote:
>
> From: Pan Li 
>
> This patch would like to add the middle-end presentation for the
> saturation truncation.  Aka set the result of truncated value to
> the max value when overflow.  It will take the pattern similar
> as below.
>
> Form 1:
>   #define DEF_SAT_U_TRUC_FMT_1(WT, NT) \
>   NT __attribute__((noinline)) \
>   sat_u_truc_##T##_fmt_1 (WT x)\
>   {\
> bool overflow = x > (WT)(NT)(-1);  \
> return ((NT)x) | (NT)-overflow;\
>   }
>
> For example, truncated uint16_t to uint8_t, we have
>
> * SAT_TRUNC (254)   => 254
> * SAT_TRUNC (255)   => 255
> * SAT_TRUNC (256)   => 255
> * SAT_TRUNC (65536) => 255
>
> Given below SAT_TRUNC from uint64_t to uint32_t.
>
> DEF_SAT_U_TRUC_FMT_1 (uint64_t, uint32_t)
>
> Before this patch:
> __attribute__((noinline))
> uint32_t sat_u_truc_T_fmt_1 (uint64_t x)
> {
>   _Bool overflow;
>   unsigned int _1;
>   unsigned int _2;
>   unsigned int _3;
>   uint32_t _6;
>
> ;;   basic block 2, loop depth 0
> ;;pred:   ENTRY
>   overflow_5 = x_4(D) > 4294967295;
>   _1 = (unsigned int) x_4(D);
>   _2 = (unsigned int) overflow_5;
>   _3 = -_2;
>   _6 = _1 | _3;
>   return _6;
> ;;succ:   EXIT
>
> }
>
> After this patch:
> __attribute__((noinline))
> uint32_t sat_u_truc_T_fmt_1 (uint64_t x)
> {
>   uint32_t _6;
>
> ;;   basic block 2, loop depth 0
> ;;pred:   ENTRY
>   _6 = .SAT_TRUNC (x_4(D)); [tail call]
>   return _6;
> ;;succ:   EXIT
>
> }
>
> The below tests are passed for this patch:
> *. The rv64gcv fully regression tests.
> *. The rv64gcv build with glibc.
> *. The x86 bootstrap tests.
> *. The x86 fully regression tests.
>
> gcc/ChangeLog:
>
> * internal-fn.def (SAT_TRUNC): Add new signed IFN sat_trunc as
> unary_convert.
> * match.pd: Add new matching pattern for unsigned int sat_trunc.
> * optabs.def (OPTAB_CL): Add unsigned and signed optab.
> * tree-ssa-math-opts.cc (gimple_unsigend_integer_sat_trunc): Add
> new decl for the matching pattern generated func.
> (match_unsigned_saturation_trunc): Add new func impl to match
> the .SAT_TRUNC.
> (math_opts_dom_walker::after_dom_children): Add .SAT_TRUNC match
> function under BIT_IOR_EXPR case.
> * tree.cc (integer_half_truncated_all_ones_p): Add new func impl
> to filter the truncated threshold.
> * tree.h (integer_half_truncated_all_ones_p): Add new func decl.
>
> Signed-off-by: Pan Li 
> ---
>  gcc/internal-fn.def   |  2 ++
>  gcc/match.pd  | 12 +++-
>  gcc/optabs.def|  3 +++
>  gcc/tree-ssa-math-opts.cc | 32 
>  gcc/tree.cc   | 22 ++
>  gcc/tree.h|  6 ++
>  6 files changed, 76 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index a8c83437ada..915d329c05a 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -278,6 +278,8 @@ DEF_INTERNAL_SIGNED_OPTAB_FN (MULHRS, ECF_CONST | 
> ECF_NOTHROW, first,
>  DEF_INTERNAL_SIGNED_OPTAB_FN (SAT_ADD, ECF_CONST, first, ssadd, usadd, 
> binary)
>  DEF_INTERNAL_SIGNED_OPTAB_FN (SAT_SUB, ECF_CONST, first, sssub, ussub, 
> binary)
>
> +DEF_INTERNAL_SIGNED_OPTAB_FN (SAT_TRUNC, ECF_CONST, first, sstrunc, ustrunc, 
> unary_convert)
> +
>  DEF_INTERNAL_COND_FN (ADD, ECF_CONST, add, binary)
>  DEF_INTERNAL_COND_FN (SUB, ECF_CONST, sub, binary)
>  DEF_INTERNAL_COND_FN (MUL, ECF_CONST, smul, binary)
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 3d0689c9312..d4062434cc7 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -39,7 +39,8 @@ along with GCC; see the file COPYING3.  If not see
> HONOR_NANS
> uniform_vector_p
> expand_vec_cmp_expr_p
> -   bitmask_inv_cst_vector_p)
> +   bitmask_inv_cst_vector_p
> +   integer_half_truncated_all_ones_p)
>
>  /* Operator lists.  */
>  (define_operator_list tcc_comparison
> @@ -3210,6 +3211,15 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>   (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
>&& types_match (type, @0, @1
>
> +/* Unsigned saturation truncate, case 1 (), sizeof (WT) > sizeof (NT).
> +   SAT_U_TRUNC = (NT)x | (NT)(-(X > (WT)(NT)(-1))).  */
> +(match (unsigend_integer_sat_trunc @0)

unsigned

> + (bit_ior:c (negate (convert (gt @0 integer_half_truncated_all_ones_p)))
> +   (convert @0))
> + (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
> +  && TYPE_UNSIGNED (TREE_TYPE (@0))
> +  && tree_int_cst_lt (TYPE_SIZE (type), TYPE_SIZE (TREE_TYPE (@

RE: [PATCH v2] Vect: Support truncate after .SAT_SUB pattern in zip

2024-06-26 Thread Li, Pan2
> I suppose the other patterns can see similar enhacements for the case
> their forms
> show up truncated or extended?

Yes, just want to highlight that this form comes from the zip benchmark.
Of course, the rest forms are planed in underlying Patch(es).

> Please use NOP_EXPR here.

Sure, and will send the v2 if no surprise from test.

Pan

-Original Message-
From: Richard Biener  
Sent: Wednesday, June 26, 2024 9:56 PM
To: Li, Pan2 
Cc: gcc-patches@gcc.gnu.org; juzhe.zh...@rivai.ai; kito.ch...@gmail.com; 
jeffreya...@gmail.com; pins...@gmail.com
Subject: Re: [PATCH v2] Vect: Support truncate after .SAT_SUB pattern in zip

On Mon, Jun 24, 2024 at 3:55 PM  wrote:
>
> From: Pan Li 
>
> The zip benchmark of coremark-pro have one SAT_SUB like pattern but
> truncated as below:
>
> void test (uint16_t *x, unsigned b, unsigned n)
> {
>   unsigned a = 0;
>   register uint16_t *p = x;
>
>   do {
> a = *--p;
> *p = (uint16_t)(a >= b ? a - b : 0); // Truncate after .SAT_SUB
>   } while (--n);
> }
>
> It will have gimple before vect pass,  it cannot hit any pattern of
> SAT_SUB and then cannot vectorize to SAT_SUB.
>
> _2 = a_11 - b_12(D);
> iftmp.0_13 = (short unsigned int) _2;
> _18 = a_11 >= b_12(D);
> iftmp.0_5 = _18 ? iftmp.0_13 : 0;
>
> This patch would like to improve the pattern match to recog above
> as truncate after .SAT_SUB pattern.  Then we will have the pattern
> similar to below,  as well as eliminate the first 3 dead stmt.
>
> _2 = a_11 - b_12(D);
> iftmp.0_13 = (short unsigned int) _2;
> _18 = a_11 >= b_12(D);
> iftmp.0_5 = (short unsigned int).SAT_SUB (a_11, b_12(D));
>
> The below tests are passed for this patch.
> 1. The rv64gcv fully regression tests.
> 2. The rv64gcv build with glibc.
> 3. The x86 bootstrap tests.
> 4. The x86 fully regression tests.
>
> gcc/ChangeLog:
>
> * match.pd: Add convert description for minus and capture.
> * tree-vect-patterns.cc (vect_recog_build_binary_gimple_call): Add
> new logic to handle in_type is incompatibile with out_type,  as
> well as rename from.
> (vect_recog_build_binary_gimple_stmt): Rename to.
> (vect_recog_sat_add_pattern): Leverage above renamed func.
> (vect_recog_sat_sub_pattern): Ditto.
>
> Signed-off-by: Pan Li 
> ---
>  gcc/match.pd  |  4 +--
>  gcc/tree-vect-patterns.cc | 51 ---
>  2 files changed, 33 insertions(+), 22 deletions(-)
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 3d0689c9312..4a4b0b2e72f 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -3164,9 +3164,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  /* Unsigned saturation sub, case 2 (branch with ge):
> SAT_U_SUB = X >= Y ? X - Y : 0.  */
>  (match (unsigned_integer_sat_sub @0 @1)
> - (cond^ (ge @0 @1) (minus @0 @1) integer_zerop)
> + (cond^ (ge @0 @1) (convert? (minus (convert1? @0) (convert1? @1))) 
> integer_zerop)
>   (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
> -  && types_match (type, @0, @1
> +  && TYPE_UNSIGNED (TREE_TYPE (@0)) && types_match (@0, @1

I suppose the other patterns can see similar enhacements for the case
their forms
show up truncated or extended?

>  /* Unsigned saturation sub, case 3 (branchless with gt):
> SAT_U_SUB = (X - Y) * (X > Y).  */
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index cef901808eb..3d887d36050 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -4490,26 +4490,37 @@ vect_recog_mult_pattern (vec_info *vinfo,
>  extern bool gimple_unsigned_integer_sat_add (tree, tree*, tree (*)(tree));
>  extern bool gimple_unsigned_integer_sat_sub (tree, tree*, tree (*)(tree));
>
> -static gcall *
> -vect_recog_build_binary_gimple_call (vec_info *vinfo, gimple *stmt,
> +static gimple *
> +vect_recog_build_binary_gimple_stmt (vec_info *vinfo, stmt_vec_info 
> stmt_info,
>  internal_fn fn, tree *type_out,
> -tree op_0, tree op_1)
> +tree lhs, tree op_0, tree op_1)
>  {
>tree itype = TREE_TYPE (op_0);
> -  tree vtype = get_vectype_for_scalar_type (vinfo, itype);
> +  tree otype = TREE_TYPE (lhs);
> +  tree v_itype = get_vectype_for_scalar_type (vinfo, itype);
> +  tree v_otype = get_vectype_for_scalar_type (vinfo, otype);
>
> -  if (vtype != NULL_TREE
> -&& direct_internal_fn_supported_p (fn, vtype, OPTIMIZE_FOR_BOTH))
> +  if (v_itype != NULL_TREE && v_otype != NULL_TREE
> +&& direct_internal_fn_supported_p (fn, v_itype, OPTIMIZE_FOR_BOTH))
>  {
>gcall *call = gimple_build_call_internal (fn, 2, op_0, op_1);
> +  tree in_ssa = vect_recog_temp_ssa_var (itype, NULL);
>
> -  gimple_call_set_lhs (call, vect_recog_temp_ssa_var (itype, NULL));
> +  gimple_call_set_lhs (call, in_ssa);
>gimple_call_set_nothrow (call, /* nothrow_p */ false);
> -  gimple_set_location (call, gimple_location (stmt));
> + 

Re: [PATCH 4/8] vect: Determine input vectype for multiple lane-reducing

2024-06-26 Thread Feng Xue OS
Updated the patches based on comments.

The input vectype of reduction PHI statement must be determined before
vect cost computation for the reduction. Since lance-reducing operation has
different input vectype from normal one, so we need to traverse all reduction
statements to find out the input vectype with the least lanes, and set that to
the PHI statement.

---
 gcc/tree-vect-loop.cc | 79 ++-
 1 file changed, 56 insertions(+), 23 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 347dac97e49..419f4b08d2b 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7643,7 +7643,9 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 {
   stmt_vec_info def = loop_vinfo->lookup_def (reduc_def);
   stmt_vec_info vdef = vect_stmt_to_vectorize (def);
-  if (STMT_VINFO_REDUC_IDX (vdef) == -1)
+  int reduc_idx = STMT_VINFO_REDUC_IDX (vdef);
+
+  if (reduc_idx == -1)
{
  if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -7686,10 +7688,57 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
  return false;
}
}
-  else if (!stmt_info)
-   /* First non-conversion stmt.  */
-   stmt_info = vdef;
-  reduc_def = op.ops[STMT_VINFO_REDUC_IDX (vdef)];
+  else
+   {
+ /* First non-conversion stmt.  */
+ if (!stmt_info)
+   stmt_info = vdef;
+
+ if (lane_reducing_op_p (op.code))
+   {
+ enum vect_def_type dt;
+ tree vectype_op;
+
+ /* The last operand of lane-reducing operation is for
+reduction.  */
+ gcc_assert (reduc_idx > 0 && reduc_idx == (int) op.num_ops - 1);
+
+ if (!vect_is_simple_use (op.ops[0], loop_vinfo, &dt, &vectype_op))
+   return false;
+
+ tree type_op = TREE_TYPE (op.ops[0]);
+
+ if (!vectype_op)
+   {
+ vectype_op = get_vectype_for_scalar_type (loop_vinfo,
+   type_op);
+ if (!vectype_op)
+   return false;
+   }
+
+ /* For lane-reducing operation vectorizable analysis needs the
+reduction PHI information */
+ STMT_VINFO_REDUC_DEF (def) = phi_info;
+
+ /* Each lane-reducing operation has its own input vectype, while
+reduction PHI will record the input vectype with the least
+lanes.  */
+ STMT_VINFO_REDUC_VECTYPE_IN (vdef) = vectype_op;
+
+ /* To accommodate lane-reducing operations of mixed input
+vectypes, choose input vectype with the least lanes for the
+reduction PHI statement, which would result in the most
+ncopies for vectorized reduction results.  */
+ if (!vectype_in
+ || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype_in)))
+  < GET_MODE_SIZE (SCALAR_TYPE_MODE (type_op
+   vectype_in = vectype_op;
+   }
+ else
+   vectype_in = STMT_VINFO_VECTYPE (phi_info);
+   }
+
+  reduc_def = op.ops[reduc_idx];
   reduc_chain_length++;
   if (!stmt_info && slp_node)
slp_for_stmt_info = SLP_TREE_CHILDREN (slp_for_stmt_info)[0];
@@ -7747,6 +7796,8 @@ vectorizable_reduction (loop_vec_info loop_vinfo,

   tree vectype_out = STMT_VINFO_VECTYPE (stmt_info);
   STMT_VINFO_REDUC_VECTYPE (reduc_info) = vectype_out;
+  STMT_VINFO_REDUC_VECTYPE_IN (reduc_info) = vectype_in;
+
   gimple_match_op op;
   if (!gimple_extract_op (stmt_info->stmt, &op))
 gcc_unreachable ();
@@ -7831,16 +7882,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
  = get_vectype_for_scalar_type (loop_vinfo,
 TREE_TYPE (op.ops[i]), slp_op[i]);

-  /* To properly compute ncopies we are interested in the widest
-non-reduction input type in case we're looking at a widening
-accumulation that we later handle in vect_transform_reduction.  */
-  if (lane_reducing
- && vectype_op[i]
- && (!vectype_in
- || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype_in)))
- < GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE 
(vectype_op[i]))
-   vectype_in = vectype_op[i];
-
   /* Record how the non-reduction-def value of COND_EXPR is defined.
 ???  For a chain of multiple CONDs we'd have to match them up all.  */
   if (op.code == COND_EXPR && reduc_chain_length == 1)
@@ -7859,14 +7900,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
}
}
 }
-  if (!vectype_in)
-vectype_in = STMT_VINFO_VECTYPE (phi_info);
-  STMT_VINFO_REDUC_VECTYPE_IN (reduc_info) = vectype_in;
-
-  /* Each lane-reducing operation has its own input vectype, while red

Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]

2024-06-26 Thread Feng Xue OS
Updated the patch.

For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current
vectorizer could only handle the pattern if the reduction chain does not
contain other operation, no matter the other is normal or lane-reducing.

Actually, to allow multiple arbitrary lane-reducing operations, we need to
support vectorization of loop reduction chain with mixed input vectypes. Since
lanes of vectype may vary with operation, the effective ncopies of vectorized
statements for operation also may not be same to each other, this causes
mismatch on vectorized def-use cycles. A simple way is to align all operations
with the one that has the most ncopies, the gap could be complemented by
generating extra trivial pass-through copies. For example:

   int sum = 0;
   for (i)
 {
   sum += d0[i] * d1[i];  // dot-prod 
   sum += w[i];   // widen-sum 
   sum += abs(s0[i] - s1[i]); // sad 
   sum += n[i];   // normal 
 }

The vector size is 128-bit vectorization factor is 16. Reduction statements
would be transformed as:

   vector<4> int sum_v0 = { 0, 0, 0, 0 };
   vector<4> int sum_v1 = { 0, 0, 0, 0 };
   vector<4> int sum_v2 = { 0, 0, 0, 0 };
   vector<4> int sum_v3 = { 0, 0, 0, 0 };

   for (i / 16)
 {
   sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
   sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 += n_v0[i: 0  ~ 3 ];
   sum_v1 += n_v1[i: 4  ~ 7 ];
   sum_v2 += n_v2[i: 8  ~ 11];
   sum_v3 += n_v3[i: 12 ~ 15];
 }

2024-03-22 Feng Xue 

gcc/
PR tree-optimization/114440
* tree-vectorizer.h (vectorizable_lane_reducing): New function
declaration.
* tree-vect-stmts.cc (vect_analyze_stmt): Call new function
vectorizable_lane_reducing to analyze lane-reducing operation.
* tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation
code related to emulated_mixed_dot_prod.
(vect_reduction_update_partial_vector_usage): Compute ncopies as the
original means for single-lane slp node.
(vectorizable_lane_reducing): New function.
(vectorizable_reduction): Allow multiple lane-reducing operations in
loop reduction. Move some original lane-reducing related code to
vectorizable_lane_reducing.
(vect_transform_reduction): Extend transformation to support reduction
statements with mixed input vectypes.

gcc/testsuite/
PR tree-optimization/114440
* gcc.dg/vect/vect-reduc-chain-1.c
* gcc.dg/vect/vect-reduc-chain-2.c
* gcc.dg/vect/vect-reduc-chain-3.c
* gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
* gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
* gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
* gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
* gcc.dg/vect/vect-reduc-dot-slp-1.c
---
 .../gcc.dg/vect/vect-reduc-chain-1.c  |  62 
 .../gcc.dg/vect/vect-reduc-chain-2.c  |  77 
 .../gcc.dg/vect/vect-reduc-chain-3.c  |  66 
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c  |  95 +
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c  |  67 
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c  |  79 +
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c  |  63 
 .../gcc.dg/vect/vect-reduc-dot-slp-1.c|  60 
 gcc/tree-vect-loop.cc | 333 ++
 gcc/tree-vect-stmts.cc|   2 +
 gcc/tree-vectorizer.h |   2 +
 11 files changed, 836 insertions(+), 70 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c

diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c 
b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
new file mode 100644
index 000..04bfc419dbd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
@@ -0,0 +1,62 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_

Re: [PATCH 8/8] vect: Optimize order of lane-reducing statements in loop def-use cycles

2024-06-26 Thread Feng Xue OS
This patch is also adjusted with changes to two its dependent patches.

When transforming multiple lane-reducing operations in a loop reduction chain,
originally, corresponding vectorized statements are generated into def-use
cycles starting from 0. The def-use cycle with smaller index, would contain
more statements, which means more instruction dependency. For example:

   int sum = 0;
   for (i)
 {
   sum += d0[i] * d1[i];  // dot-prod 
   sum += w[i];   // widen-sum 
   sum += abs(s0[i] - s1[i]); // sad 
 }

Original transformation result:

   for (i / 16)
 {
   sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
   sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy
 }

For a higher instruction parallelism in final vectorized loop, an optimal
means is to make those effective vectorized lane-reducing statements be
distributed evenly among all def-use cycles. Transformed as the below,
DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles,
instruction dependency could be eliminated.

   for (i / 16)
 {
   sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = sum_v0;  // copy
   sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1);
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = sum_v0;  // copy
   sum_v1 = sum_v1;  // copy
   sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2);
   sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3);
 }

2024-03-22 Feng Xue 

gcc/
PR tree-optimization/114440
* tree-vectorizer.h (struct _stmt_vec_info): Add a new field
reduc_result_pos.
* tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing
statements in an optimized order.
---
 gcc/tree-vect-loop.cc | 43 +++
 gcc/tree-vectorizer.h |  6 ++
 2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 6bfb0e72905..783c4f2b153 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8841,9 +8841,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

-  sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
-  sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
-  sum_v2 = sum_v2;  // copy
+  sum_v0 = sum_v0;  // copy
+  sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1);
+  sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2);
   sum_v3 = sum_v3;  // copy

   sum_v0 += n_v0[i: 0  ~ 3 ];
@@ -8851,7 +8851,12 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   sum_v2 += n_v2[i: 8  ~ 11];
   sum_v3 += n_v3[i: 12 ~ 15];
 }
-   */
+
+Moreover, for a higher instruction parallelism in final vectorized
+loop, it is considered to make those effective vectorized lane-
+reducing statements be distributed evenly among all def-use cycles.
+In the above example, SADs are generated into other cycles rather
+than that of DOT_PROD.  */
   unsigned using_ncopies = vec_oprnds[0].length ();
   unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();

@@ -8864,6 +8869,36 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
  gcc_assert (vec_oprnds[i].length () == using_ncopies);
  vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
}
+
+ /* Find suitable def-use cycles to generate vectorized statements
+into, and reorder operands based on the selection.  */
+ unsigned curr_pos = reduc_info->reduc_result_pos;
+ unsigned next_pos = (curr_pos + using_ncopies) % reduc_ncopies;
+
+ gcc_assert (curr_pos < reduc_ncopies);
+  reduc_info->reduc_result_pos = next_pos;
+
+ if (curr_pos)
+   {
+ unsigned count = reduc_ncopies - using_ncopies;
+ unsigned start = curr_pos - count;
+
+ if ((int) start < 0)
+   {
+ count = curr_pos;
+ start = 0;
+   }
+
+ for (unsigned i = 0; i < op.num_ops - 1; i++)
+   {
+ for (unsigned j = using_ncopies; j > start; j--)
+   {
+ unsigned k = j - 1;
+

[PATCH] vect: Fix shift-by-induction for single-lane slp

2024-06-26 Thread Feng Xue OS
Allow shift-by-induction for slp node, when it is single lane, which is
aligned with the original loop-based handling. 

Thanks,
Feng

---
 gcc/tree-vect-stmts.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index ca6052662a3..840e162c7f0 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -6247,7 +6247,7 @@ vectorizable_shift (vec_info *vinfo,
   if ((dt[1] == vect_internal_def
|| dt[1] == vect_induction_def
|| dt[1] == vect_nested_cycle)
-  && !slp_node)
+  && (!slp_node || SLP_TREE_LANES (slp_node) == 1))
 scalar_shift_arg = false;
   else if (dt[1] == vect_constant_def
   || dt[1] == vect_external_def
-- 
2.17.1

[PATCH] vect: Fix shift-by-induction for single-lane slp

2024-06-26 Thread Feng Xue OS
Allow shift-by-induction for slp node, when it is single lane, which is
aligned with the original loop-based handling. 

Thanks,
Feng

---
 gcc/tree-vect-stmts.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index ca6052662a3..840e162c7f0 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -6247,7 +6247,7 @@ vectorizable_shift (vec_info *vinfo,
   if ((dt[1] == vect_internal_def
|| dt[1] == vect_induction_def
|| dt[1] == vect_nested_cycle)
-  && !slp_node)
+  && (!slp_node || SLP_TREE_LANES (slp_node) == 1))
 scalar_shift_arg = false;
   else if (dt[1] == vect_constant_def
   || dt[1] == vect_external_def
-- 
2.17.1

[PATCH] c: ICE with invalid sizeof [PR115642]

2024-06-26 Thread Marek Polacek
Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk?

-- >8 --
Here we ICE in c_expr_sizeof_expr on an erroneous expr.value.  The
code checks for expr.value == error_mark_node but here the e_m_n is
wrapped in a C_MAYBE_CONST_EXPR.  I don't think we should have created
such a tree, so let's return earlier in c_cast_expr.

PR c/115642

gcc/c/ChangeLog:

* c-typeck.cc (c_cast_expr): Return error_mark_node if build_c_cast
failed.

gcc/testsuite/ChangeLog:

* gcc.dg/noncompile/sizeof-1.c: New test.
---
 gcc/c/c-typeck.cc  | 3 +++
 gcc/testsuite/gcc.dg/noncompile/sizeof-1.c | 7 +++
 2 files changed, 10 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/noncompile/sizeof-1.c

diff --git a/gcc/c/c-typeck.cc b/gcc/c/c-typeck.cc
index ffcab7df4d3..8c03a7731c4 100644
--- a/gcc/c/c-typeck.cc
+++ b/gcc/c/c-typeck.cc
@@ -6695,6 +6695,9 @@ c_cast_expr (location_t loc, struct c_type_name 
*type_name, tree expr)
 return error_mark_node;
 
   ret = build_c_cast (loc, type, expr);
+  if (ret == error_mark_node)
+return error_mark_node;
+
   if (type_expr)
 {
   bool inner_expr_const = true;
diff --git a/gcc/testsuite/gcc.dg/noncompile/sizeof-1.c 
b/gcc/testsuite/gcc.dg/noncompile/sizeof-1.c
new file mode 100644
index 000..db7e2044b11
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/noncompile/sizeof-1.c
@@ -0,0 +1,7 @@
+/* PR c/115642 */
+/* { dg-do compile } */
+
+void f (int N) {
+  int a[2][N];
+  sizeof ((int [2][N])a); /* { dg-error "cast specifies array type" } */
+}

base-commit: 47b68cda2c4afe32e84c5f18da0196c39e5e0edf
-- 
2.45.2



Re: [PATCH] Hard register asm constraint

2024-06-26 Thread Paul Koning



> On Jun 26, 2024, at 8:54 AM, Stefan Schulze Frielinghaus 
>  wrote:
> 
> On Tue, Jun 25, 2024 at 01:02:39PM -0400, Paul Koning wrote:
>> 
>> 
>>> On Jun 25, 2024, at 12:04 PM, Stefan Schulze Frielinghaus 
>>>  wrote:
>>> 
>>> On Tue, Jun 25, 2024 at 10:03:34AM -0400, Paul Koning wrote:
 
>>> ...
>>> could be rewritten into
>>> 
>>> int test (int x, int y)
>>> {
>>> asm ("foo %0,%1,%2" : "+{r4}" (x) : "{r5}" (y), "d" (y));
>>> return x;
>>> }
 
 I like this idea but I'm wondering: regular constraints specify what sort 
 of value is needed, for example an int vs. a short int vs. a float.  The 
 notation you've shown doesn't seem to have that aspect.
>>> 
>>> As Maciej already pointed out the type of the expression should suffice.
>>> My assumption was that an asm can deal with a value as is or its
>>> promoted value.  At least for integer values this should be fine and
>>> AFAICS is also the case for simple constraints like "r" which do not
>>> define any mode.  I've probably overseen something but which constraint
>>> differentiates between int vs short?  However, you have a good point
>>> with this and I should test this more.
>> 
>> I thought there was but I may be confused.  On the other hand, there 
>> definitely are (machine dependent) constraints that distinguish, say, float 
>> from integer registers; pdp11 is an example.  If you were to use an "a" 
>> constraint, that means a floating point register and the compiler will 
>> detect attempts to pass non-float operands ("Inconsistent operand 
>> constraints...").
>> 
>> I see that the existing "register int ..." syntax appears to check that the 
>> register is the right type for the data type given for it, so for example on 
>> pdp11, 
>> 
>>  register int ac1 asm ("ac1") = i;
>> 
>> fails ("register ... isn't suitable for data type").  I assume your new 
>> syntax would perform the same check and produce roughly the same error 
>> message.  You might verify that.  On pdp11, trying to use, for example, "r0" 
>> for a float, or "ac0" for an int, would produce that error.
> 
> Right, so far I don't error out here which I will change.  It basically
> results in bit casting floats to ints currently.

That would be bad.  For one thing, a PDP11 float doesn't fit in an integer 
register.

That also brings up another point (which applies to more mainstream targets as 
well): for data types that require multiple registers, say a register pair for 
a double length value, how is that handled?  One possible answer is to reject 
that.  Another would be to load a register pair.

This case applies to a "long int" on pdp11, or 32 bit MIPS, and probably a 
bunch of others.

paul



Re: [PATCH] Add rvalue::get_name method (and its C equivalent)

2024-06-26 Thread David Malcolm
On Mon, 2024-04-22 at 19:56 +0200, Guillaume Gomez wrote:
> `param` is also inheriting from `lvalue`. I don't think adding this
> check is a good idea
> because it will not evolve nicely if more changes are done in
> libgccjit.

Sorry for not responding earlier.

I think I agree with Guillaume here.

Looking at the checklist at:
https://gcc.gnu.org/onlinedocs/jit/internals/index.html#submitting-patches
the patch is missing:

- a feature macro in libgccjit.h such as
#define LIBGCCJIT_HAVE_gcc_jit_lvalue_get_name

- documentation for the new C entrypoint
- documentation for the new ABI tag (see topics/compatibility.rst).

Other than that, the patch looks reasonable to me.

Dave

> 
> Le lun. 22 avr. 2024 à 17:19, Antoni Boucher  a
> écrit :
> > 
> > For your new API endpoint, please add a check like:
> > 
> >    RETURN_IF_FAIL (lvalue->is_global () || lvalue->is_local (),
> >   NULL,
> >   NULL,
> >   "lvalue should be a variable");
> > 
> > 
> > Le 2024-04-22 à 09 h 16, Guillaume Gomez a écrit :
> > > Good point!
> > > 
> > > New patch attached.
> > > 
> > > Le lun. 22 avr. 2024 à 15:13, Antoni Boucher  a
> > > écrit :
> > > > 
> > > > Please move the function to be on lvalue since there are no
> > > > rvalue types
> > > > that are not lvalues that have a name.
> > > > 
> > > > Le 2024-04-22 à 09 h 04, Guillaume Gomez a écrit :
> > > > > Hey Arthur :)
> > > > > 
> > > > > > Is there any reason for that getter to return a mutable
> > > > > > pointer to the
> > > > > > name? Would something like this work instead if you're just
> > > > > > looking at
> > > > > > getting the name?
> > > > > > 
> > > > > > +  virtual string * get_name () const { return NULL; }
> > > > > > 
> > > > > > With of course adequate modifications to the inheriting
> > > > > > classes.
> > > > > 
> > > > > Good catch, thanks!
> > > > > 
> > > > > Updated the patch and attached the new version to this email.
> > > > > 
> > > > > Cordially.
> > > > > 
> > > > > Le lun. 22 avr. 2024 à 11:51, Arthur Cohen
> > > > >  a écrit :
> > > > > > 
> > > > > > Hey Guillaume :)
> > > > > > 
> > > > > > On 4/20/24 01:05, Guillaume Gomez wrote:
> > > > > > > Hi,
> > > > > > > 
> > > > > > > I just encountered the need to retrieve the name of an
> > > > > > > `rvalue` (if
> > > > > > > there is one) while working on the Rust GCC backend.
> > > > > > > 
> > > > > > > This patch adds a getter to retrieve the information.
> > > > > > > 
> > > > > > > Cordially.
> > > > > > 
> > > > > > >  virtual bool get_wide_int (wide_int *) const {
> > > > > > > return false; }
> > > > > > > 
> > > > > > > +  virtual string * get_name () { return NULL; }
> > > > > > > +
> > > > > > >    private:
> > > > > > >  virtual enum precedence get_precedence () const = 0;
> > > > > > 
> > > > > > Is there any reason for that getter to return a mutable
> > > > > > pointer to the
> > > > > > name? Would something like this work instead if you're just
> > > > > > looking at
> > > > > > getting the name?
> > > > > > 
> > > > > > +  virtual string * get_name () const { return NULL; }
> > > > > > 
> > > > > > With of course adequate modifications to the inheriting
> > > > > > classes.
> > > > > > 
> > > > > > Best,
> > > > > > 
> > > > > > Arthur
> 



Re: [PATCH] libgccjit: Add ability to get the alignment of a type

2024-06-26 Thread David Malcolm
On Thu, 2024-04-04 at 18:59 -0400, Antoni Boucher wrote:
> Hi.
> This patch adds a new API to produce an rvalue representing the 
> alignment of a type.
> Thanks for the review.

Patch looks good to me (but may need the usual ABI version updates when
merging).

Thanks; sorry for the delay in reviewing.
Dave 



Re: [PATCH] libgccjit: Make new_array_type take unsigned long

2024-06-26 Thread David Malcolm
On Fri, 2024-02-23 at 09:55 -0500, Antoni Boucher wrote:
> I had forgotten to add the doc since there is now a new API.
> Here it is.

Sorry for the delay; the updated patch looks good to me (but may need
usual ABI tag changes when pushing).

Thanks
Dave


> 
> On Wed, 2024-02-21 at 19:45 -0500, Antoni Boucher wrote:
> > Thanks for the review.
> > 
> > Here's the updated patch.
> > 
> > On Thu, 2023-12-07 at 20:04 -0500, David Malcolm wrote:
> > > On Thu, 2023-12-07 at 17:29 -0500, Antoni Boucher wrote:
> > > > Hi.
> > > > This patches update gcc_jit_context_new_array_type to take the
> > > > size
> > > > as
> > > > an unsigned long instead of a int, to allow creating bigger
> > > > array
> > > > types.
> > > > 
> > > > I haven't written the ChangeLog yet because I wasn't sure it's
> > > > allowed
> > > > to change the type of a function like that.
> > > > If it isn't, what would you suggest?
> > > 
> > > We've kept ABI compatibility all the way back to the version in
> > > GCC
> > > 5,
> > > so it seems a shame to break ABI.
> > > 
> > > How about a new API entrypoint:
> > >   gcc_jit_context_new_array_type_unsigned_long
> > > whilst keeping the old one.
> > > 
> > > Then everything internally can use "unsigned long"; we just keep
> > > the
> > > old entrypoint accepting int (which internally promotes the arg
> > > to
> > > unsigned long, if positive, sharing all the implementation).
> > > 
> > > Alternatively, I think there may be a way to do this with symbol
> > > versioning:
> > >   https://gcc.gnu.org/wiki/SymbolVersioning
> > > see e.g. Section 3.7 of Ulrich Drepper's "How To Write Shared
> > > Libraries", but I'm a bit wary of cross-platform compatibility
> > > with
> > > that.
> > > 
> > > Dave
> > > 
> > > 
> > 
> 



Re: [PATCH] libgccjit: Allow comparing array types

2024-06-26 Thread David Malcolm
On Thu, 2024-01-25 at 07:52 -0500, Antoni Boucher wrote:
> Thanks.
> Can we please agree on some wording to use so I know when the patch
> can
> be pushed. Especially since we're now in stage 4, it would help me if
> you say something like "you can push to master".

Sorry about the ambiguity.

Yes, you can push this one.

Thanks
Dave


> Regards.
> 
> On Wed, 2024-01-24 at 12:14 -0500, David Malcolm wrote:
> > On Fri, 2024-01-19 at 16:55 -0500, Antoni Boucher wrote:
> > > Hi.
> > > This patch allows comparing different instances of array types as
> > > equal.
> > > Thanks for the review.
> > 
> > Thanks; the patch looks good to me.
> > 
> > Dave
> > 
> 



[PATCH] c++: unresolved overload with comma op [PR115430]

2024-06-26 Thread Marek Polacek
Bootstrapped/regtested on x86_64-pc-linux-gnu, ok for trunk?

-- >8 --
This works:

  template
  int Func(T);
  typedef int (*funcptrtype)(int);
  funcptrtype fp0 = &Func;

but this doesn't:

  funcptrtype fp2 = (0, &Func);

because we only call resolve_nondeduced_context on the LHS (via
convert_to_void) but not on the RHS, so cp_build_compound_expr's
type_unknown_p check issues an error.

PR c++/115430

gcc/cp/ChangeLog:

* typeck.cc (cp_build_compound_expr): Call resolve_nondeduced_context
on RHS.

gcc/testsuite/ChangeLog:

* g++.dg/cpp0x/noexcept41.C: Remove dg-error.
* g++.dg/overload/addr3.C: New test.
---
 gcc/cp/typeck.cc|  4 +++-
 gcc/testsuite/g++.dg/cpp0x/noexcept41.C |  2 +-
 gcc/testsuite/g++.dg/overload/addr3.C   | 24 
 3 files changed, 28 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/g++.dg/overload/addr3.C

diff --git a/gcc/cp/typeck.cc b/gcc/cp/typeck.cc
index 50f48768a95..55ee867d329 100644
--- a/gcc/cp/typeck.cc
+++ b/gcc/cp/typeck.cc
@@ -8157,6 +8157,8 @@ cp_build_compound_expr (tree lhs, tree rhs, 
tsubst_flags_t complain)
   return rhs;
 }
 
+  rhs = resolve_nondeduced_context (rhs, complain);
+
   if (type_unknown_p (rhs))
 {
   if (complain & tf_error)
@@ -8164,7 +8166,7 @@ cp_build_compound_expr (tree lhs, tree rhs, 
tsubst_flags_t complain)
  "no context to resolve type of %qE", rhs);
   return error_mark_node;
 }
-  
+
   tree ret = build2 (COMPOUND_EXPR, TREE_TYPE (rhs), lhs, rhs);
   if (eptype)
 ret = build1 (EXCESS_PRECISION_EXPR, eptype, ret);
diff --git a/gcc/testsuite/g++.dg/cpp0x/noexcept41.C 
b/gcc/testsuite/g++.dg/cpp0x/noexcept41.C
index 4cd3d8d7854..7c65cebb618 100644
--- a/gcc/testsuite/g++.dg/cpp0x/noexcept41.C
+++ b/gcc/testsuite/g++.dg/cpp0x/noexcept41.C
@@ -9,4 +9,4 @@ template  struct a {
 };
 template  auto f(d &&, c &&) -> decltype(declval);
 struct e {};
-static_assert((e{}, declval>),""); // { dg-error "no context to resolve 
type" }
+static_assert((e{}, declval>),"");
diff --git a/gcc/testsuite/g++.dg/overload/addr3.C 
b/gcc/testsuite/g++.dg/overload/addr3.C
new file mode 100644
index 000..b203326de32
--- /dev/null
+++ b/gcc/testsuite/g++.dg/overload/addr3.C
@@ -0,0 +1,24 @@
+// PR c++/115430
+// { dg-do compile }
+
+template
+int Func(T);
+typedef int (*funcptrtype)(int);
+funcptrtype fp0 = &Func;
+funcptrtype fp1 = +&Func;
+funcptrtype fp2 = (0, &Func);
+funcptrtype fp3 = (0, +&Func);
+funcptrtype fp4 = (0, 1, &Func);
+
+template
+void
+g ()
+{
+  funcptrtype fp5 = (0, &Func);
+}
+
+void
+f ()
+{
+  g();
+}

base-commit: 47b68cda2c4afe32e84c5f18da0196c39e5e0edf
-- 
2.45.2



Re: [PATCH] libgccjit: Add support for the type bfloat16

2024-06-26 Thread David Malcolm
On Wed, 2024-02-21 at 10:56 -0500, Antoni Boucher wrote:
> Thanks for the review.
> Here's the updated patch.

Thanks for the update patch; sorry for the delay in reviewing.

The updated patch looks good for trunk.

Dave

> 
> On Fri, 2023-12-01 at 12:45 -0500, David Malcolm wrote:
> > On Thu, 2023-11-16 at 17:20 -0500, Antoni Boucher wrote:
> > > I forgot to attach the patch.
> > > 
> > > On Thu, 2023-11-16 at 17:19 -0500, Antoni Boucher wrote:
> > > > Hi.
> > > > This patch adds the support for the type bfloat16 (bug 112574).
> > > > 
> > > > This was asked to be splitted from a another patch sent here:
> > > > https://gcc.gnu.org/pipermail/jit/2023q1/001607.html
> > > > 
> > > > Thanks for the review.
> > > 
> > 
> > Thanks for the patch.
> > 
> > > diff --git a/gcc/jit/jit-playback.cc b/gcc/jit/jit-playback.cc
> > > index 18cc4da25b8..7e1c97a4638 100644
> > > --- a/gcc/jit/jit-playback.cc
> > > +++ b/gcc/jit/jit-playback.cc
> > > @@ -280,6 +280,8 @@ get_tree_node_for_type (enum gcc_jit_types
> > > type_)
> > >  
> > >  case GCC_JIT_TYPE_FLOAT:
> > >    return float_type_node;
> > > +    case GCC_JIT_TYPE_BFLOAT16:
> > > +  return bfloat16_type_node;
> > 
> > The code to create bfloat16_type_node (in build_common_tree_nodes)
> > is
> > guarded by #ifdef HAVE_BFmode, so we should probably have a test
> > for
> > this in case GCC_JIT_TYPE_BFLOAT16 to at least add an error message
> > when it's NULL_TREE, rather than silently returning NULL_TREE and
> > crashing.
> > 
> > [...]
> > 
> > > diff --git a/gcc/testsuite/jit.dg/test-bfloat16.c
> > > b/gcc/testsuite/jit.dg/test-bfloat16.c
> > > new file mode 100644
> > > index 000..6aed3920351
> > > --- /dev/null
> > > +++ b/gcc/testsuite/jit.dg/test-bfloat16.c
> > > @@ -0,0 +1,37 @@
> > > +/* { dg-do compile { target x86_64-*-* } } */
> > > +
> > > +#include 
> > > +#include 
> > > +
> > > +#include "libgccjit.h"
> > > +
> > > +/* We don't want set_options() in harness.h to set -O3 so our
> > > little local
> > > +   is optimized away. */
> > > +#define TEST_ESCHEWS_SET_OPTIONS
> > > +static void set_options (gcc_jit_context *ctxt, const char
> > > *argv0)
> > > +{
> > > +}
> > 
> > 
> > Please add a comment to all-non-failing-tests.h noting the
> > exclusion
> > of
> > this test case from the array.
> > 
> > [...]
> > 
> > > diff --git a/gcc/testsuite/jit.dg/test-types.c
> > > b/gcc/testsuite/jit.dg/test-types.c
> > > index a01944e35fa..9e7c4f3e046 100644
> > > --- a/gcc/testsuite/jit.dg/test-types.c
> > > +++ b/gcc/testsuite/jit.dg/test-types.c
> > > @@ -1,3 +1,4 @@
> > > +#include 
> > >  #include 
> > >  #include 
> > >  #include 
> > > @@ -492,4 +493,5 @@ verify_code (gcc_jit_context *ctxt,
> > > gcc_jit_result *result)
> > >  
> > >    CHECK_VALUE (gcc_jit_type_get_size (gcc_jit_context_get_type
> > > (ctxt, GCC_JIT_TYPE_FLOAT)), sizeof (float));
> > >    CHECK_VALUE (gcc_jit_type_get_size (gcc_jit_context_get_type
> > > (ctxt, GCC_JIT_TYPE_DOUBLE)), sizeof (double));
> > > +  CHECK_VALUE (gcc_jit_type_get_size (gcc_jit_context_get_type
> > > (ctxt, GCC_JIT_TYPE_BFLOAT16)), sizeof (__bfloat16));
> > 
> > 
> > This is only going to work on targets which #ifdef HAVE_BFmode, so
> > this
> > CHECK_VALUE needs to be conditionalized somehow, to avoid having
> > this,
> > test-combination, and test-threads from bailing out early on
> > targets
> > without BFmode.
> > 
> > Dave
> > 
> 



Re: [PATCH] RISC-V: Add support for Zabha extension

2024-06-26 Thread Andrea Parri
> Tested using amo.exp with rv64gc_zalrsc, rv64id_zaamo, rv64id_zalrsc,
> rv64id_zabha (using tip-of-tree qemu w/ zabha patches [2] applied for
> execution tests).

My interpretation of the Zabha specification, in particular of "The Zabha
extension depends upon the Zaamo standard extension", is that rv64id_zabha
should result in a dependency violation (some compiler warning).

The changes at stake seem instead to make the Zabha extension "select" the
Zaamo extension; IOW, these changes seem to make rv64id_zabha an alias of
rv64id_zaamo_zabha: I am wondering whether this was intentional?

  Andrea


Re: [PATCH v2 0/6] Add DLL import/export implementation to AArch64

2024-06-26 Thread Andrew Pinski
On Fri, Jun 7, 2024 at 2:45 AM Evgeny Karpov
 wrote:
>
> Hello,
>
> Thank you for reviewing v1!
> v2 addresses all comments on v1.
>
> Changes in v2:
> - Move winnt.h and winnt-dll.h to config.gcc.
> - Resolve the issue with GCC GC in winnt-dll.cc.
> - Add definitions for GOT_ALIAS_SET, 
> PE_COFF_EXTERN_DECL_SHOULD_BE_LEGITIMIZED, and HAVE_64BIT_POINTERS to 
> cygming.h.
> - Replace intermediate functions for PECOFF with ifdef checks in ix86.
> - Update the copyright date in winnt-dll.cc.
> - Correct the style.
> - Rebase from 7th June 2024

I think this caused profilebootstrap failure on x86_64-linux-gnu.
I notice the definition of GOT_ALIAS_SET for all non mingw targets is
now just -1. That seems very wrong to me.
It was originally:
```
alias_set_type
x86_GOT_alias_set (void)
{
  static alias_set_type set = -1;
  if (set == -1)
set = new_alias_set ();
  return set;
}
```
And GOT_ALIAS_SET is used in more than COFF areas. Can you please fix
this definition?

Thanks,
Andrew Pinski


>
> Regards,
> Evgeny
>
> Evgeny Karpov (6):
>   Move mingw_* declarations to the mingw folder
>   Extract ix86 dllimport implementation to mingw
>   Rename functions for reuse in AArch64
>   aarch64: Add selectany attribute handling
>   Adjust DLL import/export implementation for AArch64
>   aarch64: Add DLL import/export to AArch64 target
>
>  gcc/config.gcc  |  20 ++-
>  gcc/config/aarch64/aarch64-protos.h |   5 -
>  gcc/config/aarch64/aarch64.cc   |  42 -
>  gcc/config/aarch64/cygming.h|  33 +++-
>  gcc/config/i386/cygming.h   |  16 +-
>  gcc/config/i386/i386-expand.cc  |   4 +-
>  gcc/config/i386/i386-expand.h   |   2 -
>  gcc/config/i386/i386-protos.h   |  10 --
>  gcc/config/i386/i386.cc | 205 ++--
>  gcc/config/i386/i386.h  |   2 +
>  gcc/config/mingw/mingw32.h  |   2 +-
>  gcc/config/mingw/t-cygming  |   6 +
>  gcc/config/mingw/winnt-dll.cc   | 231 
>  gcc/config/mingw/winnt-dll.h|  30 
>  gcc/config/mingw/winnt.cc   |  10 +-
>  gcc/config/mingw/winnt.h|  38 +
>  16 files changed, 423 insertions(+), 233 deletions(-)
>  create mode 100644 gcc/config/mingw/winnt-dll.cc
>  create mode 100644 gcc/config/mingw/winnt-dll.h
>  create mode 100644 gcc/config/mingw/winnt.h
>
> --
> 2.25.1
>


Re: [PATCH v2] MIPS: Output $0 for conditional trap if !ISA_HAS_COND_TRAPI

2024-06-26 Thread Maciej W. Rozycki
On Thu, 20 Jun 2024, YunQiang Su wrote:

> MIPSr6 removes condition trap instructions with imm, so the instruction
> like `teq $2,imm` will be converted to
>   li $at, imm
>   teq $2, $at
> 
> The current version of Gas cannot detect if imm is zero, and output
>   teq $2, $0
> Let's do it in GCC.

 This description should state that the change is a fix for an actual bug 
in GCC where the output pattern does not match the constraints supplied, 
and what consequences this has that the fix addressed.  There is no `imm' 
in the general sense here, just the special case of zero.

 The missed optimisation in GAS, which used not to trigger pre-R6, is 
irrelevant from this change's point of view and just adds noise.  I'm 
surprised that it worked even in the first place, as I reckon GCC is 
supposed to emit regular MIPS code in the `.set nomacro' mode nowadays, 
which is the only way to guarantee that instruction lengths known to GCC 
do not accidentally disagree with what the assembler has produced, such 
as in the case of the bug your change has addressed.

 Overall ISTM there is no need for distinct insns for ISA_HAS_COND_TRAPI
and !ISA_HAS_COND_TRAPI cases each and this would better be sorted with 
predicates and constraints, especially as the output pattern is the same 
in both cases anyway.  This would prevent special-casing from being needed 
in `mips_expand_conditional_trap' as well.

  Maciej


Re: [PATCH] RISC-V: Add support for Zabha extension

2024-06-26 Thread Patrick O'Neill


On 6/26/24 08:50, Andrea Parri wrote:

Tested using amo.exp with rv64gc_zalrsc, rv64id_zaamo, rv64id_zalrsc,
rv64id_zabha (using tip-of-tree qemu w/ zabha patches [2] applied for
execution tests).

My interpretation of the Zabha specification, in particular of "The Zabha
extension depends upon the Zaamo standard extension", is that rv64id_zabha
should result in a dependency violation (some compiler warning).

The changes at stake seem instead to make the Zabha extension "select" the
Zaamo extension; IOW, these changes seem to make rv64id_zabha an alias of
rv64id_zaamo_zabha: I am wondering whether this was intentional?


Hi Andrea,

Thanks for highlighting this.

This is intentional - my understanding is that GCC adds extensions if 
the specified extensions depend upon them.


For example in the Zvfh spec: "The Zvfh extension depends on the Zve32f 
and Zfhmin extensions."


In riscv_implied_info zve32f and zfhmin are implied by zvfh: {"zvfh", 
"zve32f"}, {"zvfh", "zfhmin"}


This can be seen here: https://godbolt.org/z/63518Wrcj in the .attribute 
arch string: ...zfhmin1p0_zve32f1p0_...


Side tangent: oddly enough it looks like zvfh does not require/imply the 
v extension?


Patrick


   Andrea

Re: [PATCH] RISC-V: Add support for Zabha extension

2024-06-26 Thread Andrea Parri
> > My interpretation of the Zabha specification, in particular of "The Zabha
> > extension depends upon the Zaamo standard extension", is that rv64id_zabha
> > should result in a dependency violation (some compiler warning).
> > 
> > The changes at stake seem instead to make the Zabha extension "select" the
> > Zaamo extension; IOW, these changes seem to make rv64id_zabha an alias of
> > rv64id_zaamo_zabha: I am wondering whether this was intentional?
> 
> Hi Andrea,
> 
> Thanks for highlighting this.
> 
> This is intentional - my understanding is that GCC adds extensions if the
> specified extensions depend upon them.

Thanks for the clarification.

For the patch at stake,

Tested-by: Andrea Parri 

  Andrea


[Committed] RISC-V: AMO testsuite cleanup

2024-06-26 Thread Patrick O'Neill



On 6/25/24 14:34, Jeff Law wrote:



On 6/25/24 3:14 PM, Patrick O'Neill wrote:
This is another round of AMO testcase cleanup. Consolidates a lot of 
testcases

and unifies the testcase names.

Patrick O'Neill (3):
   RISC-V: Rename amo testcases
   RISC-V: Consolidate amo testcase variants
   RISC-V: Update testcase comments to point to PSABI rather than Table
 A.6

[ ... ]
This series is OK for the trunk.

Thanks,
Jeff


Committed.

Patrick



Re: [PATCH] RISC-V: Add support for Zabha extension

2024-06-26 Thread Palmer Dabbelt

On Wed, 26 Jun 2024 08:50:57 PDT (-0700), Andrea Parri wrote:

Tested using amo.exp with rv64gc_zalrsc, rv64id_zaamo, rv64id_zalrsc,
rv64id_zabha (using tip-of-tree qemu w/ zabha patches [2] applied for
execution tests).


My interpretation of the Zabha specification, in particular of "The Zabha
extension depends upon the Zaamo standard extension", is that rv64id_zabha
should result in a dependency violation (some compiler warning).

The changes at stake seem instead to make the Zabha extension "select" the
Zaamo extension; IOW, these changes seem to make rv64id_zabha an alias of
rv64id_zaamo_zabha: I am wondering whether this was intentional?


I think your interpretation of "depends on" is reasonable, but it's not 
the way we've handled it for other extension dependencies.  For the 
others we're treating "depends on" the way this code does, ie enabling 
the dependant extensions implicitly.  IIRC that's how the RISC-V specs 
want it to be.


That said, we do call it "implied" in the sources because that's really 
the right word for it.  So we should probably add something to the docs 
that describes how/why things are this way, as I don't think it's the 
first time someone's been confused.


Maybe just something like

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 23d90db2925..429275d56df 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -31037,6 +31037,10 @@ If both @option{-march} and @option{-mcpu=} are not 
specified, the default for
this argument is system dependent, users who want a specific architecture
extensions should specify one explicitly.

+When the RISC-V specifications define an extension as depending on other
+extensions, GCC will implicitly add the dependant extensions to the enabled
+extension set if they weren't added explicitly.
+
@opindex mcpu
@item -mcpu=@var{processor-string}
Use architecture of and optimize the output for the given processor, specified

would do it?



  Andrea


  1   2   >