Re: [PATCH 1/3] Add TARGET_MODE_CAN_TRANSFER_BITS

2024-07-30 Thread Richard Biener
On Mon, 29 Jul 2024, Richard Sandiford wrote:

> Richard Biener  writes:
> > On Mon, 29 Jul 2024, Jakub Jelinek wrote:
> >> And, for the GET_MODE_INNER, I also meant it for Aarch64/RISC-V VL vectors,
> >> I think those should be considered as true by the hook, not false
> >> because maybe_ne.
> >
> > I don't think relevant modes will have size/precision mismatches
> > and maybe_ne should work here.  Richard?
> 
> Yeah, I think that's true for AArch64 at least (not sure about RVV).
> 
> One wrinkle is that VNx16BI (every bit of a predicate) is technically
> suitable for memcpy, even though it would be a bad choice performance-wise.
> But VNx8BI (every even bit of a predicate) wouldn't, since the odd bits
> are undefined on read.
> 
> Arguably, this means that VNx8BI has the wrong precision, but like you
> say, we don't (AFAIK) support bitsize != precision for vector modes.
> Instead, the information that there is only one meaningful bit per
> boolean is represented by having an inner mode of BI.  Both VNx16BI
> and VNx8BI have an inner mode of BI, which means that VNx8BI's
> precision is not equal the its nunits * its unit precision.
> 
> So I suppose:
> 
>   maybe_ne (GET_MODE_BITSIZE (mode),
> GET_MODE_UNIT_PRECISION (mode) * GET_MODE_NUNITS (mode))
> 
> would capture this.

OK, I'll adjust like this.

> Targets that want a vector bool mode with 2 meaningful bits per boolean
> are expected to define a 2-bit scalar boolean mode and use that as the
> inner mode.  So I think the condition above would (correctly) continue
> to allow those.

Hmm, but I think SVE mask registers could be used to transfer bits?
I tried the following

typedef svint64_t v4dfm __attribute__((vector_mask));

void __GIMPLE(ssa) foo(void *p)
{
  v4dfm _2;

__BB(2):
  _2 = __MEM  ((v4dfm *)p);
  __MEM  ((v4dfm *)p + 128) = _2;
  return;
}

and it produces

ldr p15, [x0]
add x0, x0, 128
str p15, [x0]

exactly the same code as if using svint8_t which gets
signed-boolean:1 vs signed-boolean:8, so that mask producing
instructions get you undefined bits doesn't mean that
reg<->mem moves do the same since the predicate registers
don't know what modes they operate in?

It might of course be prohibitive to copy memory like this
and there might not be GPR <-> predicate reg moves.

But technically ... for SVE predicates there aren't even any
types less than 8 bits in size (as there are for GCN and AVX512).

Richard.


RE: [PATCH v2] i386: Add non-optimize prefetchi intrins

2024-07-30 Thread Jiang, Haochen
> -Original Message-
> From: Jakub Jelinek 
> Sent: Tuesday, July 30, 2024 2:57 PM
> To: Hongtao Liu 
> Cc: Jiang, Haochen ; gcc-patches@gcc.gnu.org;
> Liu, Hongtao ; ubiz...@gmail.com
> Subject: Re: [PATCH v2] i386: Add non-optimize prefetchi intrins
> 
> On Tue, Jul 30, 2024 at 09:28:46AM +0800, Hongtao Liu wrote:
> > On Tue, Jul 30, 2024 at 9:27 AM Hongtao Liu  wrote:
> > >
> > > On Fri, Jul 26, 2024 at 4:55 PM Haochen Jiang 
> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > I added related O0 testcase in this patch.
> > > >
> > > > Ok for trunk and backport to GCC 14 and GCC 13?
> > > Ok.
> > I mean for trunk, and it needs jakub's approval to backport to GCC14.2.
> 
> IMHO this needs to wait for GCC 14.3 (aka can be committed to 14 branch
> after the 14.2 release).

Ok, for GCC14, I will wait until the release happen.

Thx,
Haochen

> 
>   Jakub



Re: [PATCH v2] gimple ssa: Teach switch conversion to optimize powers of 2 switches

2024-07-30 Thread Richard Biener
On Mon, 29 Jul 2024, Filip Kastl wrote:

> Hi Richard,
> 
> > > Sorry, I'm not sure if I understand.  Are you suggesting something like 
> > > this?
> > > 
> > > if (idom(default bb) == cond bb)
> > > {
> > >   if (exists a path from default bb to final bb)
> > >   {
> > > idom(final bb) = cond bb;
> > >   }
> > >   else
> > >   {
> > > idom(final bb) = switch bb;
> > >   }
> > > }
> > > else
> > > {
> > >   // idom(final bb) doesn't change
> > > }
> 
> Sidenote: I've just noticed that this code wouldn't work since the original
> idom of final_bb may be some block outside of the switch.  In that case, idom
> of final_bb should remain the same after the transformation regardless of if
> idom(default bb) == cond bb.  So the code would look like this
> 
> if (original idom(final bb) == switch bb and idom(default bb) == cond bb)
> {
>   if (exists a path from default bb to final bb)
>   {
> idom(final bb) = cond bb;
>   }
>   else
>   {
> idom(final bb) = switch bb;
>   }
> }
> else
> {
>   // idom(final bb) doesn't change
> }
> 
> > > 
> > > If so, how do I implement testing existence of a path from default bb to 
> > > final
> > > bb?  I guess I could do BFS but that seems like a pretty complicated 
> > > solution.
> > > > 
> > > > That said, even above if there's a merge of the default BB and final BB
> > > > downstream in the CFG, inserting cond BB requires adjustment of the
> > > > immediate dominator of that merge block and you are missing that?
> > > 
> > > I think this isn't a problem because I do
> > > 
> > > redirect_immediate_dominators (CDI_DOMINATORS, swtch_bb, cond_bb);
> > 
> > Hmm, I'm probably just confused.  So the problem is that
> > redirect_immediate_dominators makes the dominator of final_bb incorrect
> > (but also all case_bb immediate dominator?!)?
> 
> Yes, the problem is what the idom of final_bb should be after the
> transformation.  However, redirect_immediate_dominators doesn't *make* the 
> idom
> of final_bb incorrect.  It may have been already incorrect before the call 
> (the
> call may also possibly make the idom correct btw).
> 
> This has probably already been clear to you.  I'm just making sure we're on 
> the
> same page.
> 
> > 
> > Ah, I see you fix those up.  Then 2.) is left - the final block.  Iff
> > the final block needs adjustment you know there was a path from
> > the default case to it which means one of its predecessors is dominated
> > by the default case?  In that case, adjust the dominator to cond_bb,
> > otherwise leave it at switch_bb?
> 
> Yes, what I'm saying is that if I want to know idom of final_bb after the
> transformation, I have to know if there is a path between default_bb and
> final_bb.  It is because of these two cases:
> 
> 1.
> 
> cond BB -+
>| |
> switch BB ---+   |
> /  |  \   \  |
> case BBsdefault BB
> \  |  /   /
> final BB <---+  <- this may be an edge or a path
>|
> 
> 2.
> 
> cond BB -+
>| |
> switch BB ---+   |
> /  |  \   \  |
> case BBsdefault BB
> \  |  /   /
> final BB / <- this may be an edge or a path
>|/
> 
> In the first case, there is a path between default_bb and final_bb and in the
> second there isn't.  Otherwise the cases are the same.  In the first case idom
> of final_bb should be cond_bb.  In the second case idom of final_bb should be
> switch_bb. Algorithm deciding what should be idom of final_bb therefore has to
> know if there is a path between default_bb and final_bb.
> 
> You said that if there is a path between default_bb and final_bb, one of the
> predecessors of final_bb is dominated by default_bb.  That would indeed give a
> nice way to check existence of a path between default_bb and final_bb.  But
> does it hold?  Consider this situation:
> 
>| |
> cond BB --+
>| ||
> switch BB +   |
> /  |  \  | \  |
> case BBs |default BB
> \  |  /  |/
> final BB <- pred BB -+
>|
> 
> Here no predecessors of final_bb are dominated by default_bb but at the same
> time there does exist a path from default_bb to final_bb.  Or is this CFG
> impossible for some reason?

I think in this case the dominator simply need not change - the only case
you need to adjust it is when the immediate dominator of final BB was
switch BB before the transform, and then we know we have to update it
too cond BB, no?

> Btw to further check that we're on the same page:  Right now we're only trying
> to figure out if there is a way to update idom of final_bb after the
> transformation without using iterate_fix_dominators, right?  The rest of my
> dominator fixing code makes sense / is ok?

Yes.  These "fix me up" utilities are used too often lazily - I have
the gut feeling that the situation here is very simple.

Richard.


[patch, avr, applied] Propose to use attribute signal(n) via AVR-LibC's ISR_N.

2024-07-30 Thread Georg-Johann Lay

Applied the following patchlet to the documentation.

Johann

--

AVR: Propose to use attribute signal(n) via AVR-LibC's ISR_N.

gcc/
* doc/extend.texi (AVR Function Attributes): Propose to use
attribute signal(n) via AVR-LibC's ISR_N from avr/interrupt.h
diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index 927aa24ab63..48b27ff9f39 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -5147,22 +5147,38 @@ the attribute, rather than providing the ISR name itself as the function name:
 
 @example
 __attribute__((signal(1)))
-void my_handler (void)
+static void my_handler (void)
 @{
// Code for __vector_1
 @}
+@end example
 
-#include 
+Notice that the handler function needs not to be externally visible.
+The recommended way to use these attributes is by means of the
+@code{ISR_N} macro provided by @code{avr/interrupt.h} from
+@w{@uref{https://www.nongnu.org/avr-libc/user-manual/group__avr__interrupts.html,,AVR-LibC}}:
+
+@example
+#include 
 
-__attribute__((__signal__(PCINT0_vect_num, PCINT1_vect_num)))
-static void my_pcint0_1_handler (void)
+ISR_N (PCINT0_vect_num)
+static void my_pcint0_handler (void)
 @{
-   // Code for PCINT0 and PCINT1 (__vector_3 and __vector_4
-   // on ATmega328).
+   // Code
+@}
+
+ISR_N (ADC_vect_num, ISR_NOBLOCK)
+static void my_adc_handler (void)
+@{
+// Code
 @}
 @end example
 
-Notice that the handler function needs not to be externally visible.
+@code{ISR_N} can be specified more than once, in which case several
+interrupt vectors are pointing to the same handler function.  This
+is similar to the @code{ISR_ALIASOF} macro provided by AVR-LibC, but
+without the overhead introduced by @code{ISR_ALIASOF}.
+
 
 @cindex @code{noblock} function attribute, AVR
 @item noblock


[PATCH] libstdc++: implement concatenation of strings and string_views

2024-07-30 Thread Giuseppe D'Angelo

Hello!

The attached patch implements adds support for P2591R5 in libstdc++ 
(concatenation of strings and string_views, approved in Tokyo for C++26).


Thank you,
--
Giuseppe D'Angelo
From 0a4d44196bced41d97d8086343786b52a6f75faf Mon Sep 17 00:00:00 2001
From: Giuseppe D'Angelo 
Date: Tue, 30 Jul 2024 08:57:13 +0200
Subject: [PATCH] libstdc++: implement concatenation of strings and
 string_views

This adds support for P2591R5, merged for C++26.

libstdc++-v3/ChangeLog:

	* include/bits/basic_string.h: Implement the four operator+
	overloads between basic_string and (types convertible to)
	basic_string_view.
	* include/bits/version.def: Bump the feature-testing macro.
	* include/bits/version.h: Regenerate.
	* testsuite/21_strings/basic_string/operators/char/op_plus_fspath_cpp17_fail.cc: New test.
	* testsuite/21_strings/basic_string/operators/char/op_plus_fspath_cpp2c_fail.cc: New test.
	* testsuite/21_strings/basic_string/operators/char/op_plus_fspath_impl.h: New test.
	* testsuite/21_strings/basic_string/operators/char/op_plus_string_view.cc: New test.
	* testsuite/21_strings/basic_string/operators/char/op_plus_string_view_compat_fail.cc:
	New test.
	* testsuite/21_strings/basic_string/operators/char/op_plus_string_view_compat_impl.h:
	New test.
	* testsuite/21_strings/basic_string/operators/char/op_plus_string_view_compat_ok.cc:
	New test.
---
 libstdc++-v3/include/bits/basic_string.h  |  48 +
 libstdc++-v3/include/bits/version.def |   5 +
 libstdc++-v3/include/bits/version.h   |   7 +-
 .../char/op_plus_fspath_cpp17_fail.cc |  21 ++
 .../char/op_plus_fspath_cpp2c_fail.cc |  22 +++
 .../operators/char/op_plus_fspath_impl.h  |  26 +++
 .../operators/char/op_plus_string_view.cc | 187 ++
 .../char/op_plus_string_view_compat_fail.cc   |  22 +++
 .../char/op_plus_string_view_compat_impl.h|  75 +++
 .../char/op_plus_string_view_compat_ok.cc |  20 ++
 10 files changed, 432 insertions(+), 1 deletion(-)
 create mode 100644 libstdc++-v3/testsuite/21_strings/basic_string/operators/char/op_plus_fspath_cpp17_fail.cc
 create mode 100644 libstdc++-v3/testsuite/21_strings/basic_string/operators/char/op_plus_fspath_cpp2c_fail.cc
 create mode 100644 libstdc++-v3/testsuite/21_strings/basic_string/operators/char/op_plus_fspath_impl.h
 create mode 100644 libstdc++-v3/testsuite/21_strings/basic_string/operators/char/op_plus_string_view.cc
 create mode 100644 libstdc++-v3/testsuite/21_strings/basic_string/operators/char/op_plus_string_view_compat_fail.cc
 create mode 100644 libstdc++-v3/testsuite/21_strings/basic_string/operators/char/op_plus_string_view_compat_impl.h
 create mode 100644 libstdc++-v3/testsuite/21_strings/basic_string/operators/char/op_plus_string_view_compat_ok.cc

diff --git a/libstdc++-v3/include/bits/basic_string.h b/libstdc++-v3/include/bits/basic_string.h
index 8a695a494ef..bf9ad2be00a 100644
--- a/libstdc++-v3/include/bits/basic_string.h
+++ b/libstdc++-v3/include/bits/basic_string.h
@@ -3742,6 +3742,54 @@ _GLIBCXX_END_NAMESPACE_CXX11
 { return std::move(__lhs.append(1, __rhs)); }
 #endif
 
+#if __cplusplus > 202302L
+  // const string & + string_view
+  template
+_GLIBCXX_NODISCARD _GLIBCXX20_CONSTEXPR
+inline basic_string<_CharT, _Traits, _Alloc>
+operator+(const basic_string<_CharT, _Traits, _Alloc>& __lhs,
+	   __type_identity_t> __rhs)
+{
+  typedef basic_string<_CharT, _Traits, _Alloc> _Str;
+  return std::__str_concat<_Str>(__lhs.data(), __lhs.size(),
+  __rhs.data(), __rhs.size(),
+  __lhs.get_allocator());
+}
+
+  // string && + string_view
+  template
+_GLIBCXX_NODISCARD _GLIBCXX20_CONSTEXPR
+inline basic_string<_CharT, _Traits, _Alloc>
+operator+(basic_string<_CharT, _Traits, _Alloc>&& __lhs,
+	   __type_identity_t> __rhs)
+{
+  return std::move(__lhs.append(__rhs));
+}
+
+  // string_view + const string &
+  template
+_GLIBCXX_NODISCARD _GLIBCXX20_CONSTEXPR
+inline basic_string<_CharT, _Traits, _Alloc>
+operator+(__type_identity_t> __lhs,
+	   const basic_string<_CharT, _Traits, _Alloc>& __rhs)
+{
+  typedef basic_string<_CharT, _Traits, _Alloc> _Str;
+  return std::__str_concat<_Str>(__lhs.data(), __lhs.size(),
+  __rhs.data(), __rhs.size(),
+  __rhs.get_allocator());
+}
+
+  // string_view + string &&
+   template
+_GLIBCXX_NODISCARD _GLIBCXX20_CONSTEXPR
+inline basic_string<_CharT, _Traits, _Alloc>
+operator+(__type_identity_t> __lhs,
+	   basic_string<_CharT, _Traits, _Alloc>&& __rhs)
+{
+  return std::move(__rhs.insert(0, __lhs));
+}
+#endif
+
   // operator ==
   /**
*  @brief  Test equivalence of two strings.
diff --git a/libstdc++-v3/include/bits/version.def b/libstdc++-v3/include/bits/version.def
index ad4715048ab..e5cb527728b 100644
--- a/libstdc++-v3/include/bits/version.def
+++ b/libstdc++-v3/include/bits/version.def
@@ -694,6 +694,11 @@ ftms = {

RE: Support streaming of poly_int for offloading when it's degree <= accel's NUM_POLY_INT_COEFFS

2024-07-30 Thread Richard Biener
On Tue, 30 Jul 2024, Prathamesh Kulkarni wrote:

> 
> 
> > -Original Message-
> > From: Richard Sandiford 
> > Sent: Monday, July 29, 2024 9:43 PM
> > To: Richard Biener 
> > Cc: Prathamesh Kulkarni ; gcc-
> > patc...@gcc.gnu.org
> > Subject: Re: Support streaming of poly_int for offloading when it's
> > degree <= accel's NUM_POLY_INT_COEFFS
> > 
> > External email: Use caution opening links or attachments
> > 
> > 
> > Richard Biener  writes:
> > > On Mon, 29 Jul 2024, Prathamesh Kulkarni wrote:
> > >
> > >> Hi Richard,
> > >> Thanks for your suggestions on RFC email, the attached patch adds
> > support for streaming of poly_int when it's degree <= accel's
> > NUM_POLY_INT_COEFFS.
> > >> The patch changes streaming of poly_int as follows:
> > >>
> > >> Streaming out poly_int:
> > >>
> > >> degree = poly_int.degree();
> > >> stream out degree;
> > >> for (i = 0; i < degree; i++)
> > >>   stream out poly_int.coeffs[i];
> > >>
> > >> Streaming in poly_int:
> > >>
> > >> stream in degree;
> > >> if (degree > NUM_POLY_INT_COEFFS)
> > >>   fatal_error();
> > >> stream in coeffs;
> > >> // Set remaining coeffs to zero in case degree < accel's
> > >> NUM_POLY_INT_COEFFS for (i = degree; i < NUM_POLY_INT_COEFFS; i++)
> > >>   poly_int.coeffs[i] = 0;
> > >>
> > >> Patch passes bootstrap+test and LTO bootstrap+test on aarch64-
> > linux-gnu.
> > >> LTO bootstrap+test on x86_64-linux-gnu in progress.
> > >>
> > >> I am not quite sure how to test it for offloading since currently
> > it's (entirely) broken for aarch64->nvptx.
> > >> I can give a try with x86_64->nvptx offloading if required (altho I
> > >> guess LTO bootstrap should test streaming changes ?)
> > >
> > > +  unsigned degree
> > > += bp_unpack_value (bp, BITS_PER_UNIT * sizeof (unsigned
> > > HOST_WIDE_INT));
> > >
> > > The NUM_POLY_INT_COEFFS target define doesn't seem to be constrained
> > > to any type it needs to fit into, using HOST_WIDE_INT is arbitrary.
> > > I'd say we should constrain it to a reasonable upper bound, like 2?
> > > Maybe even have MAX_NUM_POLY_INT_COEFFS or NUM_POLY_INT_COEFFS_BITS
> > in
> > > poly-int.h and constrain NUM_POLY_INT_COEFFS.
> > >
> > > The patch looks reasonable over all, but Richard S. should have a
> > say
> > > about the abstraction you chose and the poly-int adjustment.
> > 
> > Sorry if this has been discussed already, but could we instead stream
> > NUM_POLY_INT_COEFFS once per file, rather than once per poly_int?
> > It's a target invariant, and poly_int has wormed its way into lots of
> > things by now :)
> Hi Richard,
> The patch doesn't stream out NUM_POLY_INT_COEFFS, but the degree of poly_int 
> (and streams-out coeffs only up to degree, ignoring the higher zero coeffs).
> During streaming-in, it reads back the degree (and streamed coeffs upto 
> degree) and issues an error if degree > accel's NUM_POLY_INT_COEFFS, since we 
> can't
> (as-is) represent a degree-N poly_int on accel with NUM_POLY_INT_COEFFS < N. 
> If degree < accel's NUM_POLY_INT_COEFFS, the remaining coeffs are set to 0
> (similar to zero-extension). I posted more details in RFC: 
> https://gcc.gnu.org/pipermail/gcc/2024-July/244466.html
> 
> The attached patch defines MAX_NUM_POLY_INT_COEFFS_BITS in poly-int.h to 
> represent number of bits needed for max value of NUM_POLY_INT_COEFFS defined 
> by any target,
> and uses that for packing/unpacking degree of poly_int to/from bitstream, 
> which should make it independent of the type used for representing 
> NUM_POLY_INT_COEFFS by
> the target.

Just as additional comment - maybe we can avoid the POLY_INT_CST tree
side if we'd consistently "canonicalize" a POLY_INT_CST with zero
second coeff as INTEGER_CST instead?  This of course doesn't
generalize to NUM_POLY_INT_COEFFS == 3 vs NUM_POLY_INT_COEFFS == 2.

We still need the poly_int<> streaming support of course where I
would guess that 99% of the cases have a zero second coeff.

Richard.

> Bootstrap+test and LTO bootstrap+test in progress on aarch64-linux-gnu.
> Does the patch look OK ?
> 
> Signed-off-by: Prathamesh Kulkarni 
> 
> Thanks,
> Prathamesh
> > 
> > Thanks,
> > Richard
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


[PATCH] AArch64: Set instruction attribute of TST to logics_imm

2024-07-30 Thread Jennifer Schmitz
As suggested in
https://gcc.gnu.org/pipermail/gcc-patches/2024-July/658249.html,
this patch changes the instruction attribute of "*and_compare0" (TST) from
alus_imm to logics_imm.

The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
OK for mainline?

Signed-off-by: Jennifer Schmitz 

gcc/

* config/aarch64/aarch64.md (*and_compare0): Change attribute.


0001-AArch64-Set-instruction-attribute-of-TST-to-logics_i.patch
Description: Binary data


smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH 1/2] SVE intrinsics: Add strength reduction for division by constant.

2024-07-30 Thread Jennifer Schmitz
Dear Richard,
Thanks for the feedback. Great to see this patch approved! I made the changes 
as suggested.
Best,
Jennifer


0001-SVE-intrinsics-Add-strength-reduction-for-division-b.patch
Description: Binary data


> On 29 Jul 2024, at 22:55, Richard Sandiford  wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Thanks for doing this.
> 
> Jennifer Schmitz  writes:
>> [...]
>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_s32.c 
>> b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_s32.c
>> index c49ca1aa524..6500b64c41b 100644
>> --- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_s32.c
>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_s32.c
>> @@ -1,6 +1,9 @@
>> /* { dg-final { check-function-bodies "**" "" "-DCHECK_ASM" } } */
>> 
>> #include "test_sve_acle.h"
>> +#include 
>> +
> 
> I think it'd better to drop the explicit include of stdint.h.  arm_sve.h
> is defined to include stdint.h itself, and we rely on that elsewhere.
> 
> Same for div_s64.c.
Done.
> 
>> +#define MAXPOW 1<<30
>> 
>> /*
>> ** div_s32_m_tied1:
>> @@ -53,10 +56,27 @@ TEST_UNIFORM_ZX (div_w0_s32_m_untied, svint32_t, int32_t,
>>   z0 = svdiv_n_s32_m (p0, z1, x0),
>>   z0 = svdiv_m (p0, z1, x0))
>> 
>> +/*
>> +** div_1_s32_m_tied1:
>> +**   sel z0\.s, p0, z0\.s, z0\.s
>> +**   ret
>> +*/
>> +TEST_UNIFORM_Z (div_1_s32_m_tied1, svint32_t,
>> + z0 = svdiv_n_s32_m (p0, z0, 1),
>> + z0 = svdiv_m (p0, z0, 1))
>> +
>> +/*
>> +** div_1_s32_m_untied:
>> +**   sel z0\.s, p0, z1\.s, z1\.s
>> +**   ret
>> +*/
>> +TEST_UNIFORM_Z (div_1_s32_m_untied, svint32_t,
>> + z0 = svdiv_n_s32_m (p0, z1, 1),
>> + z0 = svdiv_m (p0, z1, 1))
>> +
> 
> [ Thanks for adding the tests (which look good to me).  If the output
>  ever improves in future, we can "defend" the improvement by changing
>  the test.  But in the meantime, the above defends something that is
>  known to work. ]
> 
>> [...]
>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/div_const_run.c 
>> b/gcc/testsuite/gcc.target/aarch64/sve/div_const_run.c
>> new file mode 100644
>> index 000..1a3c25b817d
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/div_const_run.c
>> @@ -0,0 +1,91 @@
>> +/* { dg-do run { target aarch64_sve128_hw } } */
>> +/* { dg-options "-O2 -msve-vector-bits=128" } */
>> +
>> +#include 
>> +#include 
>> +
>> +typedef svbool_t pred __attribute__((arm_sve_vector_bits(128)));
>> +typedef svfloat16_t svfloat16_ __attribute__((arm_sve_vector_bits(128)));
>> +typedef svfloat32_t svfloat32_ __attribute__((arm_sve_vector_bits(128)));
>> +typedef svfloat64_t svfloat64_ __attribute__((arm_sve_vector_bits(128)));
>> +typedef svint32_t svint32_ __attribute__((arm_sve_vector_bits(128)));
>> +typedef svint64_t svint64_ __attribute__((arm_sve_vector_bits(128)));
>> +typedef svuint32_t svuint32_ __attribute__((arm_sve_vector_bits(128)));
>> +typedef svuint64_t svuint64_ __attribute__((arm_sve_vector_bits(128)));
>> +
>> +#define F(T, TS, P, OP1, OP2)   
>>  \
>> +{\
>> +  T##_t op1 = (T##_t) OP1;   \
>> +  T##_t op2 = (T##_t) OP2;   \
>> +  sv##T##_ res = svdiv_##P (pg, svdup_##TS (op1), svdup_##TS (op2)); \
>> +  sv##T##_ exp = svdup_##TS (op1 / op2); \
>> +  if (svptest_any (pg, svcmpne (pg, exp, res))) 
>>  \
>> +__builtin_abort (); 
>>  \
>> + \
>> +  sv##T##_ res_n = svdiv_##P (pg, svdup_##TS (op1), op2);\
>> +  if (svptest_any (pg, svcmpne (pg, exp, res_n)))\
>> +__builtin_abort (); 
>>  \
>> +}
>> +
>> +#define TEST_TYPES_1(T, TS)  \
>> +  F (T, TS, m, 79, 16)  
>>  \
>> +  F (T, TS, z, 79, 16)  
>>  \
>> +  F (T, TS, x, 79, 16)
>> +
>> +#define TEST_TYPES   \
>> +  TEST_TYPES_1 (float16, f16)   
>>  \
>> +  TEST_TYPES_1 (float32, f32)   
>>  \
>> +  TEST_TYPES_1 (float64, f64)   
>>  \
>> +  TEST_TYPES_1 (int32, s32)  \
>> +  TEST_TYPES_1 (int64, s64)  \
>> +  TEST_TYPES_1 (uint32, u32) \
>> +  TEST_TYPES_1 (uint64, u64)
>> +
>> +#define TEST_VALUES_S_1(B, OP1, OP2) \
>> +  F (int##B, s##B, x, OP1, OP2)
>> +
>> 

Re: Performance improvement for std::to_chars(char* first, char* last, /* integer-type */ value, int base = 10 );

2024-07-30 Thread Jonathan Wakely
On Tue, 30 Jul 2024, 06:21 Ehrnsperger, Markus, 
wrote:

> On 2024-07-29 12:16, Jonathan Wakely wrote:
>
> > On Mon, 29 Jul 2024 at 10:45, Jonathan Wakely 
> wrote:
> >> On Mon, 29 Jul 2024 at 09:42, Ehrnsperger, Markus
> >>  wrote:
> >>> Hi,
> >>>
> >>>
> >>> I'm attaching two files:
> >>>
> >>> 1.:   to_chars10.h:
> >>>
> >>> This is intended to be included in libstdc++ / gcc to achieve
> performance improvements. It is an implementation of
> >>>
> >>> to_chars10(char* first, char* last,  /* integer-type */ value);
> >>>
> >>> Parameters are identical to std::to_chars(char* first, char* last,  /*
> integer-type */ value, int base = 10 ); . It only works for base == 10.
> >>>
> >>> If it is included in libstdc++, to_chars10(...) could be renamed to
> std::to_chars(char* first, char* last,  /* integer-type */ value) to
> provide an overload for the default base = 10
> >> Thanks for the email. This isn't in the form of a patch that we can
> >> accept as-is, although I see that the license is compatible with
> >> libstdc++, so if you are looking to contribute it then that could be
> >> done either by assigning copyright to the FSF or under the DCO terms.
> >> See https://gcc.gnu.org/contribute.html#legal for more details.
> >>
> >> I haven't looked at the code in detail, but is it a similar approach
> >> to https://jk-jeon.github.io/posts/2022/02/jeaiii-algorithm/ ?
> >> How does it compare to the performance of that algorithm?
> >>
> >> I have an incomplete implementation of that algorithm for libstdc++
> >> somewhere, but I haven't looked at it for a while.
> > I took a closer look and the reinterpret_casts worried me, so I tried
> > your test code with UBsan. There are a number of errors that would
> > need to be fixed before we would consider using this code.
>
> Attached are new versions of to_chars10.cpp, to_chars10.h and the new
> file itoa_better_y.h
>

Thanks! I'll take another look.



> Changes:
>
> - I removed all reinterpret_casts, and tested with -fsanitize=undefined
>
> - I added itoa_better_y.h from
> https://jk-jeon.github.io/posts/2022/02/jeaiii-algorithm/ to the
> performance test.
>
> Note: There is only one line in the benchmark test for itoa_better_y due
> to limited features of itoa_better_y:
>
> Benchmarking random unsigned 32 bit  itoa_better_y   ...
>
>
> to_chars10.h: Signed-off-by: Markus Ehrnsperger
> 
>
> The other files are only for performance tests.
>
> >
> >
> >>
> >>> 2.:  to_chars10.cpp:
> >>>
> >>> This is a test program for to_chars10 verifying the correctness of the
> results, and measuring the performance. The actual performance improvement
> is system dependent, so please test on your own system.
> >>>
> >>> On my system the performance improvement is about factor two, my
> results are:
> >>>
> >>>
> >>> Test   int8_t verifying to_chars10 = std::to_chars ... OK
> >>> Test  uint8_t verifying to_chars10 = std::to_chars ... OK
> >>> Test  int16_t verifying to_chars10 = std::to_chars ... OK
> >>> Test uint16_t verifying to_chars10 = std::to_chars ... OK
> >>> Test  int32_t verifying to_chars10 = std::to_chars ... OK
> >>> Test uint32_t verifying to_chars10 = std::to_chars ... OK
> >>> Test  int64_t verifying to_chars10 = std::to_chars ... OK
> >>> Test uint64_t verifying to_chars10 = std::to_chars ... OK
> >>>
> >>> Benchmarking test case   tested method  ...  time (lower
> is better)
> >>> Benchmarking random unsigned 64 bit  to_chars10 ...  0.00957
> >>> Benchmarking random unsigned 64 bit  std::to_chars  ...  0.01854
> >>> Benchmarking random   signed 64 bit  to_chars10 ...  0.01018
> >>> Benchmarking random   signed 64 bit  std::to_chars  ...  0.02297
> >>> Benchmarking random unsigned 32 bit  to_chars10 ...  0.00620
> >>> Benchmarking random unsigned 32 bit  std::to_chars  ...  0.01275
> >>> Benchmarking random   signed 32 bit  to_chars10 ...  0.00783
> >>> Benchmarking random   signed 32 bit  std::to_chars  ...  0.01606
> >>> Benchmarking random unsigned 16 bit  to_chars10 ...  0.00536
> >>> Benchmarking random unsigned 16 bit  std::to_chars  ...  0.00871
> >>> Benchmarking random   signed 16 bit  to_chars10 ...  0.00664
> >>> Benchmarking random   signed 16 bit  std::to_chars  ...  0.01154
> >>> Benchmarking random unsigned 08 bit  to_chars10 ...  0.00393
> >>> Benchmarking random unsigned 08 bit  std::to_chars  ...  0.00626
> >>> Benchmarking random   signed 08 bit  to_chars10 ...  0.00465
> >>> Benchmarking random   signed 08 bit  std::to_chars  ...  0.01089
> >>>
> >>>
> >>> Thanks, Markus
> >>>
> >>>
> >>>


[committed] gfortran.dg/compiler-directive_2.f: Update dg-error (was: [Patch, v2] OpenMP/Fortran: Fix handling of 'declare target' with 'link' clause [PR115559])

2024-07-30 Thread Tobias Burnus

Follow up fix:

As the !GCC$ attributes are now added in reverse order,
the 'stdcall' vs. 'fastcall' in the error message swapped order:
 "Error: stdcall and fastcall attributes are not compatible" This didn't 
show up here with -m64 ("Warning: 'stdcall' attribute ignored") and I 
didn't run it with -m32, but it was reported by Haochen's script +

manually confirmed by him.
(Thanks for the report and checking – and sorry for the FAIL.)

Committed asr15-2401-g15158a8853a69f. Tobias



Re: [PATCH 1/3] Add TARGET_MODE_CAN_TRANSFER_BITS

2024-07-30 Thread Richard Sandiford
Richard Biener  writes:
> On Mon, 29 Jul 2024, Richard Sandiford wrote:
>
>> Richard Biener  writes:
>> > On Mon, 29 Jul 2024, Jakub Jelinek wrote:
>> >> And, for the GET_MODE_INNER, I also meant it for Aarch64/RISC-V VL 
>> >> vectors,
>> >> I think those should be considered as true by the hook, not false
>> >> because maybe_ne.
>> >
>> > I don't think relevant modes will have size/precision mismatches
>> > and maybe_ne should work here.  Richard?
>> 
>> Yeah, I think that's true for AArch64 at least (not sure about RVV).
>> 
>> One wrinkle is that VNx16BI (every bit of a predicate) is technically
>> suitable for memcpy, even though it would be a bad choice performance-wise.
>> But VNx8BI (every even bit of a predicate) wouldn't, since the odd bits
>> are undefined on read.
>> 
>> Arguably, this means that VNx8BI has the wrong precision, but like you
>> say, we don't (AFAIK) support bitsize != precision for vector modes.
>> Instead, the information that there is only one meaningful bit per
>> boolean is represented by having an inner mode of BI.  Both VNx16BI
>> and VNx8BI have an inner mode of BI, which means that VNx8BI's
>> precision is not equal the its nunits * its unit precision.
>> 
>> So I suppose:
>> 
>>   maybe_ne (GET_MODE_BITSIZE (mode),
>> GET_MODE_UNIT_PRECISION (mode) * GET_MODE_NUNITS (mode))
>> 
>> would capture this.
>
> OK, I'll adjust like this.
>
>> Targets that want a vector bool mode with 2 meaningful bits per boolean
>> are expected to define a 2-bit scalar boolean mode and use that as the
>> inner mode.  So I think the condition above would (correctly) continue
>> to allow those.
>
> Hmm, but I think SVE mask registers could be used to transfer bits?
> I tried the following
>
> typedef svint64_t v4dfm __attribute__((vector_mask));
>
> void __GIMPLE(ssa) foo(void *p)
> {
>   v4dfm _2;
>
> __BB(2):
>   _2 = __MEM  ((v4dfm *)p);
>   __MEM  ((v4dfm *)p + 128) = _2;
>   return;
> }
>
> and it produces
>
> ldr p15, [x0]
> add x0, x0, 128
> str p15, [x0]
>
> exactly the same code as if using svint8_t which gets
> signed-boolean:1 vs signed-boolean:8, so that mask producing
> instructions get you undefined bits doesn't mean that
> reg<->mem moves do the same since the predicate registers
> don't know what modes they operate in?

Yes, in practice, VNx2BI is likely to produce the same load/store code
as VNx16BI.  But when comparing VNx2BI for equality, say, only every
eighth bit matters.  So if the optimisers were ultimately able to
determine which store feeds the VNx2BI load, there's a theoretical
possibility that they could do something that changes the other bits
of the value.

That's not very likely to happen.  But it'd be a valid thing to do.

> It might of course be prohibitive to copy memory like this
> and there might not be GPR <-> predicate reg moves.
>
> But technically ... for SVE predicates there aren't even any
> types less than 8 bits in size (as there are for GCN and AVX512).

I guess its architected bits vs payload.  The underlying registers
have 2N bytes for a 128N-bit VL, and so 2N bytes will be loaded by
LDR and stored by STR.  But when GCC uses the registers as VNx2BI,
only 2N bits are payload, and so the optimisers only guarantee to
preserve those 2N bits.

Thanks,
Richard


Re: [Patch] gimplify.cc: Handle VALUE_EXPR of MEM_REF's ADDR_EXPR argument [PR115637]

2024-07-30 Thread Richard Biener
On Mon, Jul 29, 2024 at 9:26 PM Tobias Burnus  wrote:
>
> The problem is code like:
>
>MEM  [(c_char * {ref-all})&arr2]
>
> where arr2 is the value expr '*arr2$13$linkptr'
> (i.e. indirect ref + decl name).
>
> Insidepass_omp_target_link::execute, there is a call to
> gimple_regimplify_operands but the value expression is not
> expanded.There are two problems: ADDR_EXPR is no handling this and while
> MEM_REF has some code for it, it doesn't handle this either. The
> attached code fixes this. Tested on x86_64-gnu-linux with nvidia
> offloading. Comments, remarks, OK? Better suggestions? * * * In
> gimplify_expr for MEM_REF, there is a call to is_gimple_mem_ref_addr which 
> checks for ADD_EXPR
> but not for value expressions. The attached match handles
> the case explicitly, but, alternatively, we might want
> move it to is_gimple_mem_ref_addr (not checked whether it
> makes sense or not).
>
> Where is_gimple_mem_ref_addr is defined as:
>
> /* Return true if T is a valid address operand of a MEM_REF.  */
>
> bool
> is_gimple_mem_ref_addr (tree t)
> {
>return (is_gimple_reg (t)
>|| TREE_CODE (t) == INTEGER_CST
>|| (TREE_CODE (t) == ADDR_EXPR
>&& (CONSTANT_CLASS_P (TREE_OPERAND (t, 0))
>|| decl_address_invariant_p (TREE_OPERAND (t, 0);
> }

I think iff then decl_address_invariant_p should be amended.

Why is the gimplify_addr_expr hunk needed?  It should get
to gimplifying the VAR_DECL/PARM_DECL by recursion?

> Tobias


Re: Support streaming of poly_int for offloading when it's degree <= accel's NUM_POLY_INT_COEFFS

2024-07-30 Thread Richard Sandiford
Richard Biener  writes:
> On Tue, 30 Jul 2024, Prathamesh Kulkarni wrote:
>
>> 
>> 
>> > -Original Message-
>> > From: Richard Sandiford 
>> > Sent: Monday, July 29, 2024 9:43 PM
>> > To: Richard Biener 
>> > Cc: Prathamesh Kulkarni ; gcc-
>> > patc...@gcc.gnu.org
>> > Subject: Re: Support streaming of poly_int for offloading when it's
>> > degree <= accel's NUM_POLY_INT_COEFFS
>> > 
>> > External email: Use caution opening links or attachments
>> > 
>> > 
>> > Richard Biener  writes:
>> > > On Mon, 29 Jul 2024, Prathamesh Kulkarni wrote:
>> > >
>> > >> Hi Richard,
>> > >> Thanks for your suggestions on RFC email, the attached patch adds
>> > support for streaming of poly_int when it's degree <= accel's
>> > NUM_POLY_INT_COEFFS.
>> > >> The patch changes streaming of poly_int as follows:
>> > >>
>> > >> Streaming out poly_int:
>> > >>
>> > >> degree = poly_int.degree();
>> > >> stream out degree;
>> > >> for (i = 0; i < degree; i++)
>> > >>   stream out poly_int.coeffs[i];
>> > >>
>> > >> Streaming in poly_int:
>> > >>
>> > >> stream in degree;
>> > >> if (degree > NUM_POLY_INT_COEFFS)
>> > >>   fatal_error();
>> > >> stream in coeffs;
>> > >> // Set remaining coeffs to zero in case degree < accel's
>> > >> NUM_POLY_INT_COEFFS for (i = degree; i < NUM_POLY_INT_COEFFS; i++)
>> > >>   poly_int.coeffs[i] = 0;
>> > >>
>> > >> Patch passes bootstrap+test and LTO bootstrap+test on aarch64-
>> > linux-gnu.
>> > >> LTO bootstrap+test on x86_64-linux-gnu in progress.
>> > >>
>> > >> I am not quite sure how to test it for offloading since currently
>> > it's (entirely) broken for aarch64->nvptx.
>> > >> I can give a try with x86_64->nvptx offloading if required (altho I
>> > >> guess LTO bootstrap should test streaming changes ?)
>> > >
>> > > +  unsigned degree
>> > > += bp_unpack_value (bp, BITS_PER_UNIT * sizeof (unsigned
>> > > HOST_WIDE_INT));
>> > >
>> > > The NUM_POLY_INT_COEFFS target define doesn't seem to be constrained
>> > > to any type it needs to fit into, using HOST_WIDE_INT is arbitrary.
>> > > I'd say we should constrain it to a reasonable upper bound, like 2?
>> > > Maybe even have MAX_NUM_POLY_INT_COEFFS or NUM_POLY_INT_COEFFS_BITS
>> > in
>> > > poly-int.h and constrain NUM_POLY_INT_COEFFS.
>> > >
>> > > The patch looks reasonable over all, but Richard S. should have a
>> > say
>> > > about the abstraction you chose and the poly-int adjustment.
>> > 
>> > Sorry if this has been discussed already, but could we instead stream
>> > NUM_POLY_INT_COEFFS once per file, rather than once per poly_int?
>> > It's a target invariant, and poly_int has wormed its way into lots of
>> > things by now :)
>> Hi Richard,
>> The patch doesn't stream out NUM_POLY_INT_COEFFS, but the degree of poly_int 
>> (and streams-out coeffs only up to degree, ignoring the higher zero coeffs).
>> During streaming-in, it reads back the degree (and streamed coeffs upto 
>> degree) and issues an error if degree > accel's NUM_POLY_INT_COEFFS, since 
>> we can't
>> (as-is) represent a degree-N poly_int on accel with NUM_POLY_INT_COEFFS < N. 
>> If degree < accel's NUM_POLY_INT_COEFFS, the remaining coeffs are set to 0
>> (similar to zero-extension). I posted more details in RFC: 
>> https://gcc.gnu.org/pipermail/gcc/2024-July/244466.html

It's not clear to me what the plan is for VLA host + VLS offloading.
Is the streamed data guaranteed to be "clean" of any host-only
VLA stuff?  E.g. if code does:

  #include 

  svint32_t *ptr:
  void foo(svint32_t);

  #pragma GCC target "+nosve"

  ...offloading...

is there a guarantee that the offload target won't see the definition
of ptr and foo?

>> 
>> The attached patch defines MAX_NUM_POLY_INT_COEFFS_BITS in poly-int.h to 
>> represent number of bits needed for max value of NUM_POLY_INT_COEFFS defined 
>> by any target,
>> and uses that for packing/unpacking degree of poly_int to/from bitstream, 
>> which should make it independent of the type used for representing 
>> NUM_POLY_INT_COEFFS by
>> the target.
>
> Just as additional comment - maybe we can avoid the POLY_INT_CST tree
> side if we'd consistently "canonicalize" a POLY_INT_CST with zero
> second coeff as INTEGER_CST instead?  This of course doesn't
> generalize to NUM_POLY_INT_COEFFS == 3 vs NUM_POLY_INT_COEFFS == 2.

That should already happen, via:

tree
wide_int_to_tree (tree type, const poly_wide_int_ref &value)
{
  if (value.is_constant ())
return wide_int_to_tree_1 (type, value.coeffs[0]);
  return build_poly_int_cst (type, value);
}

etc.  So if we see POLY_INT_CSTs that could be INTEGER_CSTs, I think
that'd be a bug.

Thanks,
Richard


Re: [PATCH v2] Internal-fn: Handle vector bool type for type strict match mode [PR116103]

2024-07-30 Thread Richard Biener
On Tue, Jul 30, 2024 at 5:08 AM  wrote:
>
> From: Pan Li 
>
> For some target like target=amdgcn-amdhsa,  we need to take care of
> vector bool types prior to general vector mode types.  Or we may have
> the asm check failure as below.
>
> gcc.target/gcn/cond_smax_1.c scan-assembler-times \\tv_cmp_gt_i32\\tvcc, 
> s[0-9]+, v[0-9]+ 80
> gcc.target/gcn/cond_smin_1.c scan-assembler-times \\tv_cmp_gt_i32\\tvcc, 
> s[0-9]+, v[0-9]+ 80
> gcc.target/gcn/cond_umax_1.c scan-assembler-times \\tv_cmp_gt_i32\\tvcc, 
> s[0-9]+, v[0-9]+ 56
> gcc.target/gcn/cond_umin_1.c scan-assembler-times \\tv_cmp_gt_i32\\tvcc, 
> s[0-9]+, v[0-9]+ 56
> gcc.dg/tree-ssa/loop-bound-2.c scan-tree-dump-not ivopts "zero if "
>
> The below test suites are passed for this patch.
> 1. The rv64gcv fully regression tests.
> 2. The x86 bootstrap tests.
> 3. The x86 fully regression tests.
> 4. The amdgcn test case as above.

Still OK.

Richard.

> gcc/ChangeLog:
>
> * internal-fn.cc (type_strictly_matches_mode_p): Add handling
> for vector bool type.
>
> Signed-off-by: Pan Li 
> ---
>  gcc/internal-fn.cc | 10 ++
>  1 file changed, 10 insertions(+)
>
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index 8a2e07f2f96..966594a52ed 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -4171,6 +4171,16 @@ direct_internal_fn_optab (internal_fn fn)
>  static bool
>  type_strictly_matches_mode_p (const_tree type)
>  {
> +  /* The masked vector operations have both vector data operands and vector
> + boolean operands.  The vector data operands are expected to have a 
> vector
> + mode,  but the vector boolean operands can be an integer mode rather 
> than
> + a vector mode,  depending on how TARGET_VECTORIZE_GET_MASK_MODE is
> + defined.  PR116103.  */
> +  if (VECTOR_BOOLEAN_TYPE_P (type)
> +  && SCALAR_INT_MODE_P (TYPE_MODE (type))
> +  && TYPE_PRECISION (TREE_TYPE (type)) == 1)
> +return true;
> +
>if (VECTOR_TYPE_P (type))
>  return VECTOR_MODE_P (TYPE_MODE (type));
>
> --
> 2.34.1
>


Re: Support streaming of poly_int for offloading when it's degree <= accel's NUM_POLY_INT_COEFFS

2024-07-30 Thread Richard Biener
On Tue, 30 Jul 2024, Richard Sandiford wrote:

> Richard Biener  writes:
> > On Tue, 30 Jul 2024, Prathamesh Kulkarni wrote:
> >
> >> 
> >> 
> >> > -Original Message-
> >> > From: Richard Sandiford 
> >> > Sent: Monday, July 29, 2024 9:43 PM
> >> > To: Richard Biener 
> >> > Cc: Prathamesh Kulkarni ; gcc-
> >> > patc...@gcc.gnu.org
> >> > Subject: Re: Support streaming of poly_int for offloading when it's
> >> > degree <= accel's NUM_POLY_INT_COEFFS
> >> > 
> >> > External email: Use caution opening links or attachments
> >> > 
> >> > 
> >> > Richard Biener  writes:
> >> > > On Mon, 29 Jul 2024, Prathamesh Kulkarni wrote:
> >> > >
> >> > >> Hi Richard,
> >> > >> Thanks for your suggestions on RFC email, the attached patch adds
> >> > support for streaming of poly_int when it's degree <= accel's
> >> > NUM_POLY_INT_COEFFS.
> >> > >> The patch changes streaming of poly_int as follows:
> >> > >>
> >> > >> Streaming out poly_int:
> >> > >>
> >> > >> degree = poly_int.degree();
> >> > >> stream out degree;
> >> > >> for (i = 0; i < degree; i++)
> >> > >>   stream out poly_int.coeffs[i];
> >> > >>
> >> > >> Streaming in poly_int:
> >> > >>
> >> > >> stream in degree;
> >> > >> if (degree > NUM_POLY_INT_COEFFS)
> >> > >>   fatal_error();
> >> > >> stream in coeffs;
> >> > >> // Set remaining coeffs to zero in case degree < accel's
> >> > >> NUM_POLY_INT_COEFFS for (i = degree; i < NUM_POLY_INT_COEFFS; i++)
> >> > >>   poly_int.coeffs[i] = 0;
> >> > >>
> >> > >> Patch passes bootstrap+test and LTO bootstrap+test on aarch64-
> >> > linux-gnu.
> >> > >> LTO bootstrap+test on x86_64-linux-gnu in progress.
> >> > >>
> >> > >> I am not quite sure how to test it for offloading since currently
> >> > it's (entirely) broken for aarch64->nvptx.
> >> > >> I can give a try with x86_64->nvptx offloading if required (altho I
> >> > >> guess LTO bootstrap should test streaming changes ?)
> >> > >
> >> > > +  unsigned degree
> >> > > += bp_unpack_value (bp, BITS_PER_UNIT * sizeof (unsigned
> >> > > HOST_WIDE_INT));
> >> > >
> >> > > The NUM_POLY_INT_COEFFS target define doesn't seem to be constrained
> >> > > to any type it needs to fit into, using HOST_WIDE_INT is arbitrary.
> >> > > I'd say we should constrain it to a reasonable upper bound, like 2?
> >> > > Maybe even have MAX_NUM_POLY_INT_COEFFS or NUM_POLY_INT_COEFFS_BITS
> >> > in
> >> > > poly-int.h and constrain NUM_POLY_INT_COEFFS.
> >> > >
> >> > > The patch looks reasonable over all, but Richard S. should have a
> >> > say
> >> > > about the abstraction you chose and the poly-int adjustment.
> >> > 
> >> > Sorry if this has been discussed already, but could we instead stream
> >> > NUM_POLY_INT_COEFFS once per file, rather than once per poly_int?
> >> > It's a target invariant, and poly_int has wormed its way into lots of
> >> > things by now :)
> >> Hi Richard,
> >> The patch doesn't stream out NUM_POLY_INT_COEFFS, but the degree of 
> >> poly_int (and streams-out coeffs only up to degree, ignoring the higher 
> >> zero coeffs).
> >> During streaming-in, it reads back the degree (and streamed coeffs upto 
> >> degree) and issues an error if degree > accel's NUM_POLY_INT_COEFFS, since 
> >> we can't
> >> (as-is) represent a degree-N poly_int on accel with NUM_POLY_INT_COEFFS < 
> >> N. If degree < accel's NUM_POLY_INT_COEFFS, the remaining coeffs are set 
> >> to 0
> >> (similar to zero-extension). I posted more details in RFC: 
> >> https://gcc.gnu.org/pipermail/gcc/2024-July/244466.html
> 
> It's not clear to me what the plan is for VLA host + VLS offloading.
> Is the streamed data guaranteed to be "clean" of any host-only
> VLA stuff?  E.g. if code does:
> 
>   #include 
> 
>   svint32_t *ptr:
>   void foo(svint32_t);
> 
>   #pragma GCC target "+nosve"
> 
>   ...offloading...
> 
> is there a guarantee that the offload target won't see the definition
> of ptr and foo?

No.  If it sees any unsupported poly-* the offload compilation will fail.

I think all current issues are because of poly-* leaking in for cases
where a non-poly would have worked fine, but I have not had a look
myself.

> >> 
> >> The attached patch defines MAX_NUM_POLY_INT_COEFFS_BITS in poly-int.h to 
> >> represent number of bits needed for max value of NUM_POLY_INT_COEFFS 
> >> defined by any target,
> >> and uses that for packing/unpacking degree of poly_int to/from bitstream, 
> >> which should make it independent of the type used for representing 
> >> NUM_POLY_INT_COEFFS by
> >> the target.
> >
> > Just as additional comment - maybe we can avoid the POLY_INT_CST tree
> > side if we'd consistently "canonicalize" a POLY_INT_CST with zero
> > second coeff as INTEGER_CST instead?  This of course doesn't
> > generalize to NUM_POLY_INT_COEFFS == 3 vs NUM_POLY_INT_COEFFS == 2.
> 
> That should already happen, via:
> 
> tree
> wide_int_to_tree (tree type, const poly_wide_int_ref &value)
> {
>   if (value.is_constant ())
> return wide_int_to_tree_1 (type, value.coeffs[0]);
>   return b

[PATCH v3] arm: [MVE intrinsics] Improve vdupq_n implementation

2024-07-30 Thread Christophe Lyon
Hi,

v3 of patch 2/2 uses your suggested fix about using extra_cost as an
adjustment.

I did not introduce the ARM_INSN_COST macro you suggested because it seems 
there's only a handful (maybe two) of cases where it could be used, and I 
thought it wouldn't make the code really easier to understand.

Since you already approved patch 1/2, I'm not reposting it.

Thanks,

Christophe


This patch makes the non-predicated vdupq_n MVE intrinsics use
vec_duplicate rather than an unspec.  This enables the compiler to
generate better code sequences (for instance using vmov when
possible).

The patch renames the existing mve_vdup pattern into
@mve_vdupq_n, and removes the now useless
@mve_q_n_f and @mve_q_n_ ones.

As a side-effect, it needs to update the mve_unpredicated_insn
predicates in @mve_q_m_n_ and
@mve_q_m_n_f.

Using vec_duplicates means the compiler is now able to use vmov in the
tests with an immediate argument in vdupq_n_[su]{8,16,32}.c:
vmov.i8 q0,#0x1

However, this is only possible when the immediate has a suitable value
(MVE encoding constraints, see imm_for_neon_mov_operand predicate).

Provided we adjust the cost computations in arm_rtx_costs_internal(),
when the immediate does not meet the vmov constraints, we now generate:
mov r0, #imm
vdup.xx q0,r0

or
ldr r0, .L4
vdup.32 q0,r0
in the f32 case (with 1.1 as immediate).

Without the cost adjustment, we would generate:
vldr.64 d0, .L4
vldr.64 d1, .L4+8
and an associated literal pool entry.

Regarding the testsuite updates:

* The signed versions of vdupq_* tests lack a version with an
immediate argument.  This patch adds them, similar to what we already
have for vdupq_n_u*.c tests.

* Code generation for different immediate values is checked with the
new tests this patch introduces.  Note there's no need for s8/u8 tests
because 8-bit immediates always comply wth imm_for_neon_mov_operand.

* We can remove xfail from vcmp*f tests since we now generate:
movw r3, #15462
vcmp.f16 eq, q0, r3
instead of the previous:
vldr.64 d6, .L5
vldr.64 d7, .L5+8
vcmp.f16 eq, q0, q3

Tested on arm-linux-gnueabihf and arm-none-eabi with no regression.

2024-07-02  Jolen Li  
Christophe Lyon  

gcc/
* config/arm/arm-mve-builtins-base.cc (vdupq_impl): New class.
(vdupq): Use new implementation.
* config/arm/arm.cc (arm_rtx_costs_internal): Handle HFmode
for COST_DOUBLE. Update costing for CONST_VECTOR.
* config/arm/arm_mve_builtins.def: Merge vdupq_n_f, vdupq_n_s
and vdupq_n_u into vdupq_n.
* config/arm/mve.md (mve_vdup): Rename into ...
(@mve_vdup_n): ... this.
(@mve_q_n_f): Delete.
(@mve_q_n_): Delete..
(@mve_q_m_n_): Update mve_unpredicated_insn
attribute.
(@mve_q_m_n_f): Likewise.

gcc/testsuite/
* gcc.target/arm/mve/intrinsics/vdupq_n_u8.c (foo1): Update
expected code.
* gcc.target/arm/mve/intrinsics/vdupq_n_u16.c (foo1): Likewise.
* gcc.target/arm/mve/intrinsics/vdupq_n_u32.c (foo1): Likewise.
* gcc.target/arm/mve/intrinsics/vdupq_n_s8.c: Add test with
immediate argument.
* gcc.target/arm/mve/intrinsics/vdupq_n_s16.c: Likewise.
* gcc.target/arm/mve/intrinsics/vdupq_n_s32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vdupq_n_f16.c (foo1): Update
expected code.
* gcc.target/arm/mve/intrinsics/vdupq_n_f32.c (foo1): Likewise.
* gcc.target/arm/mve/intrinsics/vdupq_m_n_s16.c: Add test with
immediate argument.
* gcc.target/arm/mve/intrinsics/vdupq_m_n_s32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vdupq_m_n_s8.c: Likewise.
* gcc.target/arm/mve/intrinsics/vdupq_x_n_s16.c: Likewise.
* gcc.target/arm/mve/intrinsics/vdupq_x_n_s32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vdupq_x_n_s8.c: Likewise.
* gcc.target/arm/mve/intrinsics/vdupq_n_f32-2.c: New test.
* gcc.target/arm/mve/intrinsics/vdupq_n_s16-2.c: New test.
* gcc.target/arm/mve/intrinsics/vdupq_n_s32-2.c: New test.
* gcc.target/arm/mve/intrinsics/vdupq_n_u16-2.c: New test.
* gcc.target/arm/mve/intrinsics/vdupq_n_u32-2.c: New test.
* gcc.target/arm/mve/intrinsics/vcmpeqq_n_f16.c: Remove xfail.
* gcc.target/arm/mve/intrinsics/vcmpeqq_n_f32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vcmpgeq_n_f16.c: Likewise.
* gcc.target/arm/mve/intrinsics/vcmpgeq_n_f32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vcmpgtq_n_f16.c: Likewise.
* gcc.target/arm/mve/intrinsics/vcmpgtq_n_f32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vcmpleq_n_f16.c: Likewise.
* gcc.target/arm/mve/intrinsics/vcmpleq_n_f32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vcmpltq_n_f16.c: Likewise.
* gcc.target/arm/mve/intrinsics/vcmpltq_n_f32.c

RE: [RFC][middle-end] SLP Early break and control flow support in GCC

2024-07-30 Thread Tamar Christina
> -Original Message-
> From: Richard Biener 
> Sent: Thursday, July 18, 2024 10:00 AM
> To: Tamar Christina 
> Cc: GCC Patches ; Richard Sandiford
> 
> Subject: RE: [RFC][middle-end] SLP Early break and control flow support in GCC
> 
> On Wed, 17 Jul 2024, Tamar Christina wrote:
> 
> > > -Original Message-
> > > From: Richard Biener 
> > > Sent: Tuesday, July 16, 2024 4:08 PM
> > > To: Tamar Christina 
> > > Cc: GCC Patches ; Richard Sandiford
> > > 
> > > Subject: Re: [RFC][middle-end] SLP Early break and control flow support 
> > > in GCC
> > >
> > > On Mon, 15 Jul 2024, Tamar Christina wrote:
> > >
> > > > Hi All,
> > > >
> > > > This RFC document covers at a high level how to extend early break 
> > > > support in
> > > > GCC to support SLP and how this will be extended in the future to 
> > > > support
> > > > full control flow in GCC.
> > > >
> > > > The basic idea in this is based on the paper "All You Need Is 
> > > > Superword-Level
> > > > Parallelism: Systematic Control-Flow Vectorization with SLP"[1] but it 
> > > > is
> > > > adapted to fit within the GCC vectorizer.
> > >
> > > An interesting read - I think the approach is viable for loop
> > > vectorization where we schedule the whole vectorized loop but difficult
> > > for basic-block vectorization as they seem to re-build the whole function
> > > from scratch.  They also do not address how to code-generate predicated
> > > not vectorized code or how they decide to handle mixed vector/non-vector
> > > code at all.  For example I don't think any current CPU architecture
> > > supports a full set of predicated _scalar_ ops and not every scalar
> > > op would have a vector equivalent in case one would use single-lane
> > > vectors.
> >
> > Hmm I'm guessing you mean here, they don't address for BB vectorization
> > how to deal with the fact that you may not always be able to vectorize the
> > entire function up from the seed? I thought today we dealt with that by
> > splitting during discovery?  Can we not do the same? i.e. treat them as
> > externals?
> 
> Currently not all scalar stmts are even reached by the seeds we use.
> They seem to simply rewrite the whole function into predicated form
> and I don't see how that is a transform that gets you back 1:1 the old
> code (or code of at least the same quality) if not all statements end
> up vectorized?
> 
> Sure we could build up single-lane SLP instances for "everything",
> but then how do you code-generate these predicated single-lane SLP
> instances?
> 
> > >
> > > For GCCs basic-block vectorization the main outstanding issue is one
> > > of scheduling and dependences with scalar code (live lane extracts,
> > > vector builds from scalars) as well.
> > >
> > > > What will not be covered is the support for First-Faulting Loads nor
> > > > alignment peeling as these are done by a different engineer.
> > > >
> > > > == Concept and Theory ==
> > > >
> > > > Supporting Early Break in SLP requires the same theory as general 
> > > > control
> flow,
> > > > the only difference is in that Early break can be supported for 
> > > > non-masked
> > > > architectures while full control flow support requires a fully masked
> > > > architecture.
> > > >
> > > > In GCC 14 Early break was added for non-SLP by teaching the vectorizer 
> > > > to
> > > branch
> > > > to a scalar epilogue whenever an early exit is taken.  This means the 
> > > > vectorizer
> > > > itself doesn't need to know how to perform side-effects or final 
> > > > reductions.
> > >
> > > With current GCC we avoid the need of predicated stmts by using the scalar
> > > epilog and a branch-on-all-true/false stmt.  To make that semantically
> > > valid stmts are moved around.
> > >
> > > > The additional benefit of this is that it works for any target 
> > > > providing a
> > > > vector cbranch optab implementation since we don't require masking
> support.
> > > In
> > > > order for this to work we need to have code motion that moves side 
> > > > effects
> > > > (primarily stores) into the latch basic block.  i.e. we delay any 
> > > > side-effects
> > > > up to a point where we know the full vector iteration would have been
> > > performed.
> > > > For this to work however we had to disable support for epilogue 
> > > > vectorization
> as
> > > > when the scalar statements are moved we break the link to the original 
> > > > BB
> they
> > > > were in.  This means that the stmt_vinfo for the stores that need to be 
> > > > moved
> > > > will no longer be valid for epilogue vectorization and the only way to 
> > > > recover
> > > > this would be to perform a full dataflow analysis again.  We decided 
> > > > against
> > > > this as the plan of record was to switch to SLP.
> > > >
> > > > -- Predicated SSA --
> > > >
> > > > The core of the proposal is to support a form of Predicated SSA (PSSA) 
> > > > in the
> > > > vectorizer [2]. The idea is to assign a control predicate to every SSA
> > > > statement.  This control predicat

Re: [PATCH] Fix overwriting files with fs::copy_file on windows

2024-07-30 Thread Jonathan Wakely
On Sun, 24 Mar 2024 at 21:34, Björn Schäpers  wrote:
>
> From: Björn Schäpers 
>
> This fixes i.e. https://github.com/msys2/MSYS2-packages/issues/1937
> I don't know if I picked the right way to do it.
>
> When acceptable I think the declaration should be moved into
> ops-common.h, since then we could use stat_type and also use that in the
> commonly used function.
>
> Manually tested on i686-w64-mingw32.
>
> -- >8 --
> libstdc++: Fix overwriting files on windows
>
> The inodes have no meaning on windows, thus all files have an inode of
> 0. Use a differenz approach to identify equivalent files. As a result
> std::filesystem::copy_file did not honor
> copy_options::overwrite_existing. Factored the method out of
> std::filesystem::equivalent.
>
> libstdc++-v3/Changelog:
>
> * include/bits/fs_ops.h: Add declaration of
>   __detail::equivalent_win32.
> * src/c++17/fs_ops.cc (__detail::equivalent_win32): Implement it
> (fs::equivalent): Use __detail::equivalent_win32, factored the
> old test out.
> * src/filesystem/ops-common.h (_GLIBCXX_FILESYSTEM_IS_WINDOWS):
>   Use the function.
>
> Signed-off-by: Björn Schäpers 
> ---
>  libstdc++-v3/include/bits/fs_ops.h   |  8 +++
>  libstdc++-v3/src/c++17/fs_ops.cc | 79 +---
>  libstdc++-v3/src/filesystem/ops-common.h | 10 ++-
>  3 files changed, 60 insertions(+), 37 deletions(-)
>
> diff --git a/libstdc++-v3/include/bits/fs_ops.h 
> b/libstdc++-v3/include/bits/fs_ops.h
> index 90650c47b46..d10b78a4bdd 100644
> --- a/libstdc++-v3/include/bits/fs_ops.h
> +++ b/libstdc++-v3/include/bits/fs_ops.h
> @@ -40,6 +40,14 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>
>  namespace filesystem
>  {
> +#ifdef _GLIBCXX_FILESYSTEM_IS_WINDOWS
> +namespace __detail
> +{
> +  bool
> +  equivalent_win32(const wchar_t* p1, const wchar_t* p2, error_code& ec);
> +} // namespace __detail
> +#endif //_GLIBCXX_FILESYSTEM_IS_WINDOWS
> +
>/** @addtogroup filesystem
> *  @{
> */
> diff --git a/libstdc++-v3/src/c++17/fs_ops.cc 
> b/libstdc++-v3/src/c++17/fs_ops.cc
> index 61df19753ef..3cc87d45237 100644
> --- a/libstdc++-v3/src/c++17/fs_ops.cc
> +++ b/libstdc++-v3/src/c++17/fs_ops.cc
> @@ -67,6 +67,49 @@
>  namespace fs = std::filesystem;
>  namespace posix = std::filesystem::__gnu_posix;
>
> +#ifdef _GLIBCXX_FILESYSTEM_IS_WINDOWS
> +bool
> +fs::__detail::equivalent_win32(const wchar_t* p1, const wchar_t* p2,
> +  error_code& ec)
> +{
> +  struct auto_handle {
> +explicit auto_handle(const path& p_)
> +: handle(CreateFileW(p_.c_str(), 0,
> +   FILE_SHARE_DELETE | FILE_SHARE_READ | FILE_SHARE_WRITE,
> +   0, OPEN_EXISTING, FILE_FLAG_BACKUP_SEMANTICS, 0))
> +{ }
> +
> +~auto_handle()
> +{ if (*this) CloseHandle(handle); }
> +
> +explicit operator bool() const
> +{ return handle != INVALID_HANDLE_VALUE; }
> +
> +bool get_info()
> +{ return GetFileInformationByHandle(handle, &info); }
> +
> +HANDLE handle;
> +BY_HANDLE_FILE_INFORMATION info;
> +  };
> +  auto_handle h1(p1);
> +  auto_handle h2(p2);
> +  if (!h1 || !h2)
> +{
> +  if (!h1 && !h2)
> +   ec = __last_system_error();
> +  return false;
> +}
> +  if (!h1.get_info() || !h2.get_info())
> +{
> +  ec = __last_system_error();
> +  return false;
> +}
> +  return h1.info.dwVolumeSerialNumber == h2.info.dwVolumeSerialNumber
> +&& h1.info.nFileIndexHigh == h2.info.nFileIndexHigh
> +&& h1.info.nFileIndexLow == h2.info.nFileIndexLow;
> +}
> +#endif //_GLIBCXX_FILESYSTEM_IS_WINDOWS
> +
>  fs::path
>  fs::absolute(const path& p)
>  {
> @@ -858,41 +901,7 @@ fs::equivalent(const path& p1, const path& p2, 
> error_code& ec) noexcept
>if (st1.st_mode != st2.st_mode || st1.st_dev != st2.st_dev)
> return false;
>
> -  struct auto_handle {
> -   explicit auto_handle(const path& p_)
> -   : handle(CreateFileW(p_.c_str(), 0,
> - FILE_SHARE_DELETE | FILE_SHARE_READ | FILE_SHARE_WRITE,
> - 0, OPEN_EXISTING, FILE_FLAG_BACKUP_SEMANTICS, 0))
> -   { }
> -
> -   ~auto_handle()
> -   { if (*this) CloseHandle(handle); }
> -
> -   explicit operator bool() const
> -   { return handle != INVALID_HANDLE_VALUE; }
> -
> -   bool get_info()
> -   { return GetFileInformationByHandle(handle, &info); }
> -
> -   HANDLE handle;
> -   BY_HANDLE_FILE_INFORMATION info;
> -  };
> -  auto_handle h1(p1);
> -  auto_handle h2(p2);
> -  if (!h1 || !h2)
> -   {
> - if (!h1 && !h2)
> -   ec = __last_system_error();
> - return false;
> -   }
> -  if (!h1.get_info() || !h2.get_info())
> -   {
> - ec = __last_system_error();
> - return false;
> -   }
> -  return h1.info.dwVolumeSerialNumber == h2.info.dwVolumeSerialNumber
> -   && h1.info.nFileIndexHigh == h2.info.nFileIndexHigh
> -   && h1.info.nFileIndexLow == 

Re: Support streaming of poly_int for offloading when it's degree <= accel's NUM_POLY_INT_COEFFS

2024-07-30 Thread Richard Sandiford
Richard Biener  writes:
> On Tue, 30 Jul 2024, Richard Sandiford wrote:
>
>> Richard Biener  writes:
>> > On Tue, 30 Jul 2024, Prathamesh Kulkarni wrote:
>> >
>> >> 
>> >> 
>> >> > -Original Message-
>> >> > From: Richard Sandiford 
>> >> > Sent: Monday, July 29, 2024 9:43 PM
>> >> > To: Richard Biener 
>> >> > Cc: Prathamesh Kulkarni ; gcc-
>> >> > patc...@gcc.gnu.org
>> >> > Subject: Re: Support streaming of poly_int for offloading when it's
>> >> > degree <= accel's NUM_POLY_INT_COEFFS
>> >> > 
>> >> > External email: Use caution opening links or attachments
>> >> > 
>> >> > 
>> >> > Richard Biener  writes:
>> >> > > On Mon, 29 Jul 2024, Prathamesh Kulkarni wrote:
>> >> > >
>> >> > >> Hi Richard,
>> >> > >> Thanks for your suggestions on RFC email, the attached patch adds
>> >> > support for streaming of poly_int when it's degree <= accel's
>> >> > NUM_POLY_INT_COEFFS.
>> >> > >> The patch changes streaming of poly_int as follows:
>> >> > >>
>> >> > >> Streaming out poly_int:
>> >> > >>
>> >> > >> degree = poly_int.degree();
>> >> > >> stream out degree;
>> >> > >> for (i = 0; i < degree; i++)
>> >> > >>   stream out poly_int.coeffs[i];
>> >> > >>
>> >> > >> Streaming in poly_int:
>> >> > >>
>> >> > >> stream in degree;
>> >> > >> if (degree > NUM_POLY_INT_COEFFS)
>> >> > >>   fatal_error();
>> >> > >> stream in coeffs;
>> >> > >> // Set remaining coeffs to zero in case degree < accel's
>> >> > >> NUM_POLY_INT_COEFFS for (i = degree; i < NUM_POLY_INT_COEFFS; i++)
>> >> > >>   poly_int.coeffs[i] = 0;
>> >> > >>
>> >> > >> Patch passes bootstrap+test and LTO bootstrap+test on aarch64-
>> >> > linux-gnu.
>> >> > >> LTO bootstrap+test on x86_64-linux-gnu in progress.
>> >> > >>
>> >> > >> I am not quite sure how to test it for offloading since currently
>> >> > it's (entirely) broken for aarch64->nvptx.
>> >> > >> I can give a try with x86_64->nvptx offloading if required (altho I
>> >> > >> guess LTO bootstrap should test streaming changes ?)
>> >> > >
>> >> > > +  unsigned degree
>> >> > > += bp_unpack_value (bp, BITS_PER_UNIT * sizeof (unsigned
>> >> > > HOST_WIDE_INT));
>> >> > >
>> >> > > The NUM_POLY_INT_COEFFS target define doesn't seem to be constrained
>> >> > > to any type it needs to fit into, using HOST_WIDE_INT is arbitrary.
>> >> > > I'd say we should constrain it to a reasonable upper bound, like 2?
>> >> > > Maybe even have MAX_NUM_POLY_INT_COEFFS or NUM_POLY_INT_COEFFS_BITS
>> >> > in
>> >> > > poly-int.h and constrain NUM_POLY_INT_COEFFS.
>> >> > >
>> >> > > The patch looks reasonable over all, but Richard S. should have a
>> >> > say
>> >> > > about the abstraction you chose and the poly-int adjustment.
>> >> > 
>> >> > Sorry if this has been discussed already, but could we instead stream
>> >> > NUM_POLY_INT_COEFFS once per file, rather than once per poly_int?
>> >> > It's a target invariant, and poly_int has wormed its way into lots of
>> >> > things by now :)
>> >> Hi Richard,
>> >> The patch doesn't stream out NUM_POLY_INT_COEFFS, but the degree of 
>> >> poly_int (and streams-out coeffs only up to degree, ignoring the higher 
>> >> zero coeffs).
>> >> During streaming-in, it reads back the degree (and streamed coeffs upto 
>> >> degree) and issues an error if degree > accel's NUM_POLY_INT_COEFFS, 
>> >> since we can't
>> >> (as-is) represent a degree-N poly_int on accel with NUM_POLY_INT_COEFFS < 
>> >> N. If degree < accel's NUM_POLY_INT_COEFFS, the remaining coeffs are set 
>> >> to 0
>> >> (similar to zero-extension). I posted more details in RFC: 
>> >> https://gcc.gnu.org/pipermail/gcc/2024-July/244466.html
>> 
>> It's not clear to me what the plan is for VLA host + VLS offloading.
>> Is the streamed data guaranteed to be "clean" of any host-only
>> VLA stuff?  E.g. if code does:
>> 
>>   #include 
>> 
>>   svint32_t *ptr:
>>   void foo(svint32_t);
>> 
>>   #pragma GCC target "+nosve"
>> 
>>   ...offloading...
>> 
>> is there a guarantee that the offload target won't see the definition
>> of ptr and foo?
>
> No.  If it sees any unsupported poly-* the offload compilation will fail.

Could it fail even if the offloading code doesn't refer to ptr and foo
directly?  Or is only "relevant" stuff streamed?

> I think all current issues are because of poly-* leaking in for cases
> where a non-poly would have worked fine, but I have not had a look
> myself.

One of the cases that Prathamesh mentions is streaming the mode sizes.
Are those modes "offload target modes" or "host modes"?  It seems like
it shouldn't be an error for the host to have VLA modes per se.  It's
just that those modes can't be used in the host/offload interface.

Thanks,
Richard



Re: Support streaming of poly_int for offloading when it's degree <= accel's NUM_POLY_INT_COEFFS

2024-07-30 Thread Richard Biener
On Tue, 30 Jul 2024, Richard Sandiford wrote:

> Richard Biener  writes:
> > On Tue, 30 Jul 2024, Richard Sandiford wrote:
> >
> >> Richard Biener  writes:
> >> > On Tue, 30 Jul 2024, Prathamesh Kulkarni wrote:
> >> >
> >> >> 
> >> >> 
> >> >> > -Original Message-
> >> >> > From: Richard Sandiford 
> >> >> > Sent: Monday, July 29, 2024 9:43 PM
> >> >> > To: Richard Biener 
> >> >> > Cc: Prathamesh Kulkarni ; gcc-
> >> >> > patc...@gcc.gnu.org
> >> >> > Subject: Re: Support streaming of poly_int for offloading when it's
> >> >> > degree <= accel's NUM_POLY_INT_COEFFS
> >> >> > 
> >> >> > External email: Use caution opening links or attachments
> >> >> > 
> >> >> > 
> >> >> > Richard Biener  writes:
> >> >> > > On Mon, 29 Jul 2024, Prathamesh Kulkarni wrote:
> >> >> > >
> >> >> > >> Hi Richard,
> >> >> > >> Thanks for your suggestions on RFC email, the attached patch adds
> >> >> > support for streaming of poly_int when it's degree <= accel's
> >> >> > NUM_POLY_INT_COEFFS.
> >> >> > >> The patch changes streaming of poly_int as follows:
> >> >> > >>
> >> >> > >> Streaming out poly_int:
> >> >> > >>
> >> >> > >> degree = poly_int.degree();
> >> >> > >> stream out degree;
> >> >> > >> for (i = 0; i < degree; i++)
> >> >> > >>   stream out poly_int.coeffs[i];
> >> >> > >>
> >> >> > >> Streaming in poly_int:
> >> >> > >>
> >> >> > >> stream in degree;
> >> >> > >> if (degree > NUM_POLY_INT_COEFFS)
> >> >> > >>   fatal_error();
> >> >> > >> stream in coeffs;
> >> >> > >> // Set remaining coeffs to zero in case degree < accel's
> >> >> > >> NUM_POLY_INT_COEFFS for (i = degree; i < NUM_POLY_INT_COEFFS; i++)
> >> >> > >>   poly_int.coeffs[i] = 0;
> >> >> > >>
> >> >> > >> Patch passes bootstrap+test and LTO bootstrap+test on aarch64-
> >> >> > linux-gnu.
> >> >> > >> LTO bootstrap+test on x86_64-linux-gnu in progress.
> >> >> > >>
> >> >> > >> I am not quite sure how to test it for offloading since currently
> >> >> > it's (entirely) broken for aarch64->nvptx.
> >> >> > >> I can give a try with x86_64->nvptx offloading if required (altho I
> >> >> > >> guess LTO bootstrap should test streaming changes ?)
> >> >> > >
> >> >> > > +  unsigned degree
> >> >> > > += bp_unpack_value (bp, BITS_PER_UNIT * sizeof (unsigned
> >> >> > > HOST_WIDE_INT));
> >> >> > >
> >> >> > > The NUM_POLY_INT_COEFFS target define doesn't seem to be constrained
> >> >> > > to any type it needs to fit into, using HOST_WIDE_INT is arbitrary.
> >> >> > > I'd say we should constrain it to a reasonable upper bound, like 2?
> >> >> > > Maybe even have MAX_NUM_POLY_INT_COEFFS or NUM_POLY_INT_COEFFS_BITS
> >> >> > in
> >> >> > > poly-int.h and constrain NUM_POLY_INT_COEFFS.
> >> >> > >
> >> >> > > The patch looks reasonable over all, but Richard S. should have a
> >> >> > say
> >> >> > > about the abstraction you chose and the poly-int adjustment.
> >> >> > 
> >> >> > Sorry if this has been discussed already, but could we instead stream
> >> >> > NUM_POLY_INT_COEFFS once per file, rather than once per poly_int?
> >> >> > It's a target invariant, and poly_int has wormed its way into lots of
> >> >> > things by now :)
> >> >> Hi Richard,
> >> >> The patch doesn't stream out NUM_POLY_INT_COEFFS, but the degree of 
> >> >> poly_int (and streams-out coeffs only up to degree, ignoring the higher 
> >> >> zero coeffs).
> >> >> During streaming-in, it reads back the degree (and streamed coeffs upto 
> >> >> degree) and issues an error if degree > accel's NUM_POLY_INT_COEFFS, 
> >> >> since we can't
> >> >> (as-is) represent a degree-N poly_int on accel with NUM_POLY_INT_COEFFS 
> >> >> < N. If degree < accel's NUM_POLY_INT_COEFFS, the remaining coeffs are 
> >> >> set to 0
> >> >> (similar to zero-extension). I posted more details in RFC: 
> >> >> https://gcc.gnu.org/pipermail/gcc/2024-July/244466.html
> >> 
> >> It's not clear to me what the plan is for VLA host + VLS offloading.
> >> Is the streamed data guaranteed to be "clean" of any host-only
> >> VLA stuff?  E.g. if code does:
> >> 
> >>   #include 
> >> 
> >>   svint32_t *ptr:
> >>   void foo(svint32_t);
> >> 
> >>   #pragma GCC target "+nosve"
> >> 
> >>   ...offloading...
> >> 
> >> is there a guarantee that the offload target won't see the definition
> >> of ptr and foo?
> >
> > No.  If it sees any unsupported poly-* the offload compilation will fail.
> 
> Could it fail even if the offloading code doesn't refer to ptr and foo
> directly?  Or is only "relevant" stuff streamed?

Only "relevant" stuff should be streamed - the offload code and all
trees refered to.

> > I think all current issues are because of poly-* leaking in for cases
> > where a non-poly would have worked fine, but I have not had a look
> > myself.
> 
> One of the cases that Prathamesh mentions is streaming the mode sizes.
> Are those modes "offload target modes" or "host modes"?  It seems like
> it shouldn't be an error for the host to have VLA modes per se.  It's
> just that those modes can't be used in the host/offlo

Re: Support streaming of poly_int for offloading when it's degree <= accel's NUM_POLY_INT_COEFFS

2024-07-30 Thread Jakub Jelinek
On Tue, Jul 30, 2024 at 11:25:42AM +0200, Richard Biener wrote:
> Only "relevant" stuff should be streamed - the offload code and all
> trees refered to.

Yeah.

> > > I think all current issues are because of poly-* leaking in for cases
> > > where a non-poly would have worked fine, but I have not had a look
> > > myself.
> > 
> > One of the cases that Prathamesh mentions is streaming the mode sizes.
> > Are those modes "offload target modes" or "host modes"?  It seems like
> > it shouldn't be an error for the host to have VLA modes per se.  It's
> > just that those modes can't be used in the host/offload interface.
> 
> There's a requirement that a mode mapping exists from the host to
> target enum machine_mode.  I don't remember exactly how we compute
> that mapping and whether streaming of some data (and thus poly-int)
> are part of this.

During streaming out, the code records what machine modes are being streamed
(in streamer_mode_table).
For those modes (and their inner modes) then lto_write_mode_table
should stream a table with mode details like class, bits, size, inner mode,
nunits, real mode format if any, etc.
That table is then streamed in in the offloading compiler and it attempts to
find corresponding modes (and emits fatal_error if there is no such mode;
consider say x86_64 long double with XFmode being used in offloading code
which doesn't have XFmode support).
Now, because Richard S. changed GET_MODE_SIZE etc. to give poly_int rather
than int, this has been changed to use bp_pack_poly_value; but that relies
on the same number of coefficients for poly_int, which is not the case when
e.g. offloading aarch64 to gcn or nvptx.

>From what I can see, this mode table handling are the only uses of
bp_pack_poly_value.  So the options are either to stream at the start of the
mode table the NUM_POLY_INT_COEFFS value and in bp_unpack_poly_value pass to
it what we've read and fill in any remaining coeffs with zeros, or in each
bp_pack_poly_value stream the number of coefficients and then stream that
back in and fill in remaining ones (and diagnose if it would try to read
non-zero coefficient which isn't stored).
I think streaming NUM_POLY_INT_COEFFS once would be more compact (at least
for non-aarch64/riscv targets).

Jakub



RE: [RFC][middle-end] SLP Early break and control flow support in GCC

2024-07-30 Thread Richard Biener
On Tue, 30 Jul 2024, Tamar Christina wrote:

> > -Original Message-
> > From: Richard Biener 
> > Sent: Thursday, July 18, 2024 10:00 AM
> > To: Tamar Christina 
> > Cc: GCC Patches ; Richard Sandiford
> > 
> > Subject: RE: [RFC][middle-end] SLP Early break and control flow support in 
> > GCC
> > 
> > On Wed, 17 Jul 2024, Tamar Christina wrote:
> > 
> > > > -Original Message-
> > > > From: Richard Biener 
> > > > Sent: Tuesday, July 16, 2024 4:08 PM
> > > > To: Tamar Christina 
> > > > Cc: GCC Patches ; Richard Sandiford
> > > > 
> > > > Subject: Re: [RFC][middle-end] SLP Early break and control flow support 
> > > > in GCC
> > > >
> > > > On Mon, 15 Jul 2024, Tamar Christina wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > This RFC document covers at a high level how to extend early break 
> > > > > support in
> > > > > GCC to support SLP and how this will be extended in the future to 
> > > > > support
> > > > > full control flow in GCC.
> > > > >
> > > > > The basic idea in this is based on the paper "All You Need Is 
> > > > > Superword-Level
> > > > > Parallelism: Systematic Control-Flow Vectorization with SLP"[1] but 
> > > > > it is
> > > > > adapted to fit within the GCC vectorizer.
> > > >
> > > > An interesting read - I think the approach is viable for loop
> > > > vectorization where we schedule the whole vectorized loop but difficult
> > > > for basic-block vectorization as they seem to re-build the whole 
> > > > function
> > > > from scratch.  They also do not address how to code-generate predicated
> > > > not vectorized code or how they decide to handle mixed vector/non-vector
> > > > code at all.  For example I don't think any current CPU architecture
> > > > supports a full set of predicated _scalar_ ops and not every scalar
> > > > op would have a vector equivalent in case one would use single-lane
> > > > vectors.
> > >
> > > Hmm I'm guessing you mean here, they don't address for BB vectorization
> > > how to deal with the fact that you may not always be able to vectorize the
> > > entire function up from the seed? I thought today we dealt with that by
> > > splitting during discovery?  Can we not do the same? i.e. treat them as
> > > externals?
> > 
> > Currently not all scalar stmts are even reached by the seeds we use.
> > They seem to simply rewrite the whole function into predicated form
> > and I don't see how that is a transform that gets you back 1:1 the old
> > code (or code of at least the same quality) if not all statements end
> > up vectorized?
> > 
> > Sure we could build up single-lane SLP instances for "everything",
> > but then how do you code-generate these predicated single-lane SLP
> > instances?
> > 
> > > >
> > > > For GCCs basic-block vectorization the main outstanding issue is one
> > > > of scheduling and dependences with scalar code (live lane extracts,
> > > > vector builds from scalars) as well.
> > > >
> > > > > What will not be covered is the support for First-Faulting Loads nor
> > > > > alignment peeling as these are done by a different engineer.
> > > > >
> > > > > == Concept and Theory ==
> > > > >
> > > > > Supporting Early Break in SLP requires the same theory as general 
> > > > > control
> > flow,
> > > > > the only difference is in that Early break can be supported for 
> > > > > non-masked
> > > > > architectures while full control flow support requires a fully masked
> > > > > architecture.
> > > > >
> > > > > In GCC 14 Early break was added for non-SLP by teaching the 
> > > > > vectorizer to
> > > > branch
> > > > > to a scalar epilogue whenever an early exit is taken.  This means the 
> > > > > vectorizer
> > > > > itself doesn't need to know how to perform side-effects or final 
> > > > > reductions.
> > > >
> > > > With current GCC we avoid the need of predicated stmts by using the 
> > > > scalar
> > > > epilog and a branch-on-all-true/false stmt.  To make that semantically
> > > > valid stmts are moved around.
> > > >
> > > > > The additional benefit of this is that it works for any target 
> > > > > providing a
> > > > > vector cbranch optab implementation since we don't require masking
> > support.
> > > > In
> > > > > order for this to work we need to have code motion that moves side 
> > > > > effects
> > > > > (primarily stores) into the latch basic block.  i.e. we delay any 
> > > > > side-effects
> > > > > up to a point where we know the full vector iteration would have been
> > > > performed.
> > > > > For this to work however we had to disable support for epilogue 
> > > > > vectorization
> > as
> > > > > when the scalar statements are moved we break the link to the 
> > > > > original BB
> > they
> > > > > were in.  This means that the stmt_vinfo for the stores that need to 
> > > > > be moved
> > > > > will no longer be valid for epilogue vectorization and the only way 
> > > > > to recover
> > > > > this would be to perform a full dataflow analysis again.  We decided 
> > > > > against
> > > > > this as the

[PATCH 1/3][v2] Add TARGET_MODE_CAN_TRANSFER_BITS

2024-07-30 Thread Richard Biener
The following adds a target hook to specify whether regs of MODE can be
used to transfer bits.  The hook is supposed to be used for value-numbering
to decide whether a value loaded in such mode can be punned to another
mode instead of re-loading the value in the other mode and for SRA to
decide whether MODE is suitable as container holding a value to be
used in different modes.

Bootstrapped and tested on x86_64-unknown-linux-gnu, OK this way?

Thanks,
Richard.

* target.def (mode_can_transfer_bits): New target hook.
* target.h (mode_can_transfer_bits): New function wrapping the
hook and providing default behavior.
* doc/tm.texi.in: Update.
* doc/tm.texi: Re-generate.
---
 gcc/doc/tm.texi|  6 ++
 gcc/doc/tm.texi.in |  2 ++
 gcc/target.def |  8 
 gcc/target.h   | 16 
 4 files changed, 32 insertions(+)

diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index c7535d07f4d..fa53c23f1de 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -4545,6 +4545,12 @@ is either a declaration of type int or accessed by 
dereferencing
 a pointer to int.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_MODE_CAN_TRANSFER_BITS (machine_mode 
@var{mode})
+Define this to return false if the mode @var{mode} cannot be used
+for memory copying.  The default is to assume modes with the same
+precision as size are fine to be used.
+@end deftypefn
+
 @deftypefn {Target Hook} machine_mode TARGET_TRANSLATE_MODE_ATTRIBUTE 
(machine_mode @var{mode})
 Define this hook if during mode attribute processing, the port should
 translate machine_mode @var{mode} to another mode.  For example, rs6000's
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 64cea3b1eda..8af3f414505 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -3455,6 +3455,8 @@ stack.
 
 @hook TARGET_REF_MAY_ALIAS_ERRNO
 
+@hook TARGET_MODE_CAN_TRANSFER_BITS
+
 @hook TARGET_TRANSLATE_MODE_ATTRIBUTE
 
 @hook TARGET_SCALAR_MODE_SUPPORTED_P
diff --git a/gcc/target.def b/gcc/target.def
index 3de1aad4c84..4356ef2f974 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -3363,6 +3363,14 @@ a pointer to int.",
  bool, (ao_ref *ref),
  default_ref_may_alias_errno)
 
+DEFHOOK
+(mode_can_transfer_bits,
+ "Define this to return false if the mode @var{mode} cannot be used\n\
+for memory copying.  The default is to assume modes with the same\n\
+precision as size are fine to be used.",
+ bool, (machine_mode mode),
+ NULL)
+
 /* Support for named address spaces.  */
 #undef HOOK_PREFIX
 #define HOOK_PREFIX "TARGET_ADDR_SPACE_"
diff --git a/gcc/target.h b/gcc/target.h
index c1f99b97b86..837651d273a 100644
--- a/gcc/target.h
+++ b/gcc/target.h
@@ -312,6 +312,22 @@ estimated_poly_value (poly_int64 x,
 return targetm.estimated_poly_value (x, kind);
 }
 
+/* Return true when MODE can be used to copy GET_MODE_BITSIZE bits
+   unchanged.  */
+
+inline bool
+mode_can_transfer_bits (machine_mode mode)
+{
+  if (mode == BLKmode)
+return true;
+  if (maybe_ne (GET_MODE_BITSIZE (mode),
+   GET_MODE_UNIT_PRECISION (mode) * GET_MODE_NUNITS (mode)))
+return false;
+  if (targetm.mode_can_transfer_bits)
+return targetm.mode_can_transfer_bits (mode);
+  return true;
+}
+
 #ifdef GCC_TM_H
 
 #ifndef CUMULATIVE_ARGS_MAGIC
-- 
2.43.0



[PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS

2024-07-30 Thread Richard Biener
The following implements the hook, excluding x87 modes for scalar
and complex float modes.

Bootstrapped and tested on x86_64-unknown-linux-gnu.

OK?

Thanks,
Richard.

* i386.cc (TARGET_MODE_CAN_TRANSFER_BITS): Define.
(ix86_mode_can_transfer_bits): New function.
---
 gcc/config/i386/i386.cc | 21 +
 1 file changed, 21 insertions(+)

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 12d15feb5e9..5184366916b 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -26113,6 +26113,24 @@ ix86_have_ccmp ()
   return (bool) TARGET_APX_CCMP;
 }
 
+/* Implement TARGET_MODE_CAN_TRANSFER_BITS.  */
+static bool
+ix86_mode_can_transfer_bits (machine_mode mode)
+{
+  if (GET_MODE_CLASS (mode) == MODE_FLOAT
+  || GET_MODE_CLASS (mode) == MODE_COMPLEX_FLOAT)
+switch (GET_MODE_INNER (mode))
+  {
+  case SFmode:
+  case DFmode:
+   return TARGET_SSE_MATH && !TARGET_MIX_SSE_I387;
+  default:
+   return false;
+  }
+
+  return true;
+}
+
 /* Target-specific selftests.  */
 
 #if CHECKING_P
@@ -26959,6 +26977,9 @@ ix86_libgcc_floating_mode_supported_p
 #undef TARGET_HAVE_CCMP
 #define TARGET_HAVE_CCMP ix86_have_ccmp
 
+#undef TARGET_MODE_CAN_TRANSFER_BITS
+#define TARGET_MODE_CAN_TRANSFER_BITS ix86_mode_can_transfer_bits
+
 static bool
 ix86_libc_has_fast_function (int fcode ATTRIBUTE_UNUSED)
 {
-- 
2.43.0



[PATCH 3/3][v2] tree-optimization/114659 - VN and FP to int punning

2024-07-30 Thread Richard Biener
The following addresses another case where x87 FP loads mangle the
bit representation and thus are not suitable for a representative
in other types.  VN was value-numbering a later integer load of 'x'
as the same as a former float load of 'x'.

We can use the new TARGET_MODE_CAN_TRANSFER_BITS hook to identify
problematic modes and enforce strict compatibility for those in
the reference comparison, improving the handling of modes with
padding in visit_reference_op_load.

Bootstrapped and tested on x86_64-unknown-linux-gnu.

PR tree-optimization/114659
* tree-ssa-sccvn.cc (visit_reference_op_load): Do not
prevent punning from modes with padding here, but ...
(vn_reference_eq): ... ensure this here, also honoring
types with modes that cannot act as bit container.

* gcc.target/i386/pr114659.c: New testcase.
---
 gcc/testsuite/gcc.target/i386/pr114659.c | 62 
 gcc/tree-ssa-sccvn.cc| 11 ++---
 2 files changed, 66 insertions(+), 7 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr114659.c

diff --git a/gcc/testsuite/gcc.target/i386/pr114659.c 
b/gcc/testsuite/gcc.target/i386/pr114659.c
new file mode 100644
index 000..e1e24d55687
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr114659.c
@@ -0,0 +1,62 @@
+/* { dg-do run } */
+/* { dg-options "-O2" } */
+
+int
+my_totalorderf (float const *x, float const *y)
+{
+  int xs = __builtin_signbit (*x);
+  int ys = __builtin_signbit (*y);
+  if (!xs != !ys)
+return xs;
+
+  int xn = __builtin_isnan (*x);
+  int yn = __builtin_isnan (*y);
+  if (!xn != !yn)
+return !xn == !xs;
+  if (!xn)
+return *x <= *y;
+
+  unsigned int extended_sign = -!!xs;
+  union { unsigned int i; float f; } xu = {0}, yu = {0};
+  __builtin_memcpy (&xu.f, x, sizeof (float));
+  __builtin_memcpy (&yu.f, y, sizeof (float));
+  return (xu.i ^ extended_sign) <= (yu.i ^ extended_sign);
+}
+
+static float
+positive_NaNf ()
+{
+  float volatile nan = 0.0f / 0.0f;
+  return (__builtin_signbit (nan) ? - nan : nan);
+}
+
+typedef union { float value; unsigned int word[1]; } memory_float;
+
+static memory_float
+construct_memory_SNaNf (float quiet_value)
+{
+  memory_float m;
+  m.value = quiet_value;
+  m.word[0] ^= (unsigned int) 1 << 22;
+  m.word[0] |= (unsigned int) 1;
+  return m;
+}
+
+memory_float x[7] =
+  {
+{ 0 },
+{ 1e-5 },
+{ 1 },
+{ 1e37 },
+{ 1.0f / 0.0f },
+  };
+
+int
+main ()
+{
+  x[5] = construct_memory_SNaNf (positive_NaNf ());
+  x[6] = (memory_float) { positive_NaNf () };
+  if (! my_totalorderf (&x[5].value, &x[6].value))
+__builtin_abort ();
+  return 0;
+}
diff --git a/gcc/tree-ssa-sccvn.cc b/gcc/tree-ssa-sccvn.cc
index dc377fa16ce..0639ba426ff 100644
--- a/gcc/tree-ssa-sccvn.cc
+++ b/gcc/tree-ssa-sccvn.cc
@@ -837,6 +837,9 @@ vn_reference_eq (const_vn_reference_t const vr1, 
const_vn_reference_t const vr2)
TYPE_VECTOR_SUBPARTS (vr2->type)))
return false;
 }
+  else if (TYPE_MODE (vr1->type) != TYPE_MODE (vr2->type)
+  && !mode_can_transfer_bits (TYPE_MODE (vr1->type)))
+return false;
 
   i = 0;
   j = 0;
@@ -5814,13 +5817,7 @@ visit_reference_op_load (tree lhs, tree op, gimple *stmt)
   if (result
   && !useless_type_conversion_p (TREE_TYPE (result), TREE_TYPE (op)))
 {
-  /* Avoid the type punning in case the result mode has padding where
-the op we lookup has not.  */
-  if (TYPE_MODE (TREE_TYPE (result)) != BLKmode
- && maybe_lt (GET_MODE_PRECISION (TYPE_MODE (TREE_TYPE (result))),
-  GET_MODE_PRECISION (TYPE_MODE (TREE_TYPE (op)
-   result = NULL_TREE;
-  else if (CONSTANT_CLASS_P (result))
+  if (CONSTANT_CLASS_P (result))
result = const_unop (VIEW_CONVERT_EXPR, TREE_TYPE (op), result);
   else
{
-- 
2.43.0


Re: [PATCH v1] gcc/: Rename array_type_nelts() => array_type_nelts_minus_one()

2024-07-30 Thread Richard Biener
On Mon, Jul 29, 2024 at 5:15 PM Alejandro Colomar  wrote:
>
> The old name was misleading.
>
> While at it, also rename some temporary variables that are used with
> this function, for consistency.
>
> Link: 
> https://inbox.sourceware.org/gcc-patches/9fffd80-dca-2c7e-14b-6c9b509a7...@redhat.com/T/#m2f661c67c8f7b2c405c8c7fc3152dd85dc729120
> Cc: Gabriel Ravier 
> Cc: Martin Uecker 
> Cc: Joseph Myers 
> Cc: Xavier Del Campo Romero 
>
> gcc/ChangeLog:
>
> * tree.cc (array_type_nelts): Rename function ...
> (array_type_nelts_minus_one): ... to this name.  The old name
> was misleading.
> * tree.h: Likewise.
> * c/c-decl.cc: Likewise.
> * c/c-fold.cc: Likewise.

This and the cp/ and fortran/ and rust/ entries below have different ChangeLog
files and thus need not be prefixed but need

gcc/cp/ChangeLog:

etc.

The changes look good to me, please leave the frontend maintainers time to
chime in.  Also Jakub had reservations with the renaming because of
branch maintainance.  I think if that proves an issue we could backport the
renaming as well, or make sure that array_type_nelts is not re-introduced
with the same name but different semantics.

Richard.

> * config/aarch64/aarch64.cc: Likewise.
> * config/i386/i386.cc: Likewise.
> * cp/decl.cc: Likewise.
> * cp/init.cc: Likewise.
> * cp/lambda.cc: Likewise.
> * cp/tree.cc: Likewise.
> * expr.cc: Likewise.
> * fortran/trans-array.cc: Likewise.
> * fortran/trans-openmp.cc: Likewise.
> * rust/backend/rust-tree.cc: Likewise.
>
> Suggested-by: Richard Biener 
> Signed-off-by: Alejandro Colomar 
> ---
> Range-diff against v0:
> -:  --- > 1:  82efbc3c540 gcc/: Rename array_type_nelts() => 
> array_type_nelts_minus_one()
>
>  gcc/c/c-decl.cc   | 10 +-
>  gcc/c/c-fold.cc   |  7 ---
>  gcc/config/aarch64/aarch64.cc |  2 +-
>  gcc/config/i386/i386.cc   |  2 +-
>  gcc/cp/decl.cc|  2 +-
>  gcc/cp/init.cc|  8 
>  gcc/cp/lambda.cc  |  3 ++-
>  gcc/cp/tree.cc|  2 +-
>  gcc/expr.cc   |  8 
>  gcc/fortran/trans-array.cc|  2 +-
>  gcc/fortran/trans-openmp.cc   |  4 ++--
>  gcc/rust/backend/rust-tree.cc |  2 +-
>  gcc/tree.cc   |  4 ++--
>  gcc/tree.h|  2 +-
>  14 files changed, 30 insertions(+), 28 deletions(-)
>
> diff --git a/gcc/c/c-decl.cc b/gcc/c/c-decl.cc
> index 97f1d346835..4dced430d1f 100644
> --- a/gcc/c/c-decl.cc
> +++ b/gcc/c/c-decl.cc
> @@ -5309,7 +5309,7 @@ one_element_array_type_p (const_tree type)
>  {
>if (TREE_CODE (type) != ARRAY_TYPE)
>  return false;
> -  return integer_zerop (array_type_nelts (type));
> +  return integer_zerop (array_type_nelts_minus_one (type));
>  }
>
>  /* Determine whether TYPE is a zero-length array type "[0]".  */
> @@ -6257,15 +6257,15 @@ get_parm_array_spec (const struct c_parm *parm, tree 
> attrs)
>   for (tree type = parm->specs->type; TREE_CODE (type) == ARRAY_TYPE;
>type = TREE_TYPE (type))
> {
> - tree nelts = array_type_nelts (type);
> - if (error_operand_p (nelts))
> + tree nelts_minus_one = array_type_nelts_minus_one (type);
> + if (error_operand_p (nelts_minus_one))
> return attrs;
> - if (TREE_CODE (nelts) != INTEGER_CST)
> + if (TREE_CODE (nelts_minus_one) != INTEGER_CST)
> {
>   /* Each variable VLA bound is represented by the dollar
>  sign.  */
>   spec += "$";
> - tpbnds = tree_cons (NULL_TREE, nelts, tpbnds);
> + tpbnds = tree_cons (NULL_TREE, nelts_minus_one, tpbnds);
> }
> }
>   tpbnds = nreverse (tpbnds);
> diff --git a/gcc/c/c-fold.cc b/gcc/c/c-fold.cc
> index 57b67c74bd8..9ea174f79c4 100644
> --- a/gcc/c/c-fold.cc
> +++ b/gcc/c/c-fold.cc
> @@ -73,11 +73,12 @@ c_fold_array_ref (tree type, tree ary, tree index)
>unsigned elem_nchars = (TYPE_PRECISION (elem_type)
>   / TYPE_PRECISION (char_type_node));
>unsigned len = (unsigned) TREE_STRING_LENGTH (ary) / elem_nchars;
> -  tree nelts = array_type_nelts (TREE_TYPE (ary));
> +  tree nelts_minus_one = array_type_nelts_minus_one (TREE_TYPE (ary));
>bool dummy1 = true, dummy2 = true;
> -  nelts = c_fully_fold_internal (nelts, true, &dummy1, &dummy2, false, 
> false);
> +  nelts_minus_one = c_fully_fold_internal (nelts_minus_one, true, &dummy1,
> +  &dummy2, false, false);
>unsigned HOST_WIDE_INT i = tree_to_uhwi (index);
> -  if (!tree_int_cst_le (index, nelts)
> +  if (!tree_int_cst_le (index, nelts_minus_one)
>|| i >= len
>|| i + elem_nchars > len)
>  return NULL_TREE;
> diff --git a/gcc/config/aarch64/aarc

Re: [PATCH v1] gcc/: Rename array_type_nelts() => array_type_nelts_minus_one()

2024-07-30 Thread Jakub Jelinek
On Tue, Jul 30, 2024 at 12:22:01PM +0200, Richard Biener wrote:
> On Mon, Jul 29, 2024 at 5:15 PM Alejandro Colomar  wrote:
> >
> > The old name was misleading.
> >
> > While at it, also rename some temporary variables that are used with
> > this function, for consistency.
> >
> > Link: 
> > https://inbox.sourceware.org/gcc-patches/9fffd80-dca-2c7e-14b-6c9b509a7...@redhat.com/T/#m2f661c67c8f7b2c405c8c7fc3152dd85dc729120
> > Cc: Gabriel Ravier 
> > Cc: Martin Uecker 
> > Cc: Joseph Myers 
> > Cc: Xavier Del Campo Romero 
> >
> > gcc/ChangeLog:
> >
> > * tree.cc (array_type_nelts): Rename function ...
> > (array_type_nelts_minus_one): ... to this name.  The old name
> > was misleading.
> > * tree.h: Likewise.
> > * c/c-decl.cc: Likewise.
> > * c/c-fold.cc: Likewise.
> 
> This and the cp/ and fortran/ and rust/ entries below have different ChangeLog
> files and thus need not be prefixed but need
> 
> gcc/cp/ChangeLog:
> 
> etc.

And not just that, but shouldn't also start with Likewise. in each of those
ChangeLogs because the context isn't there, so need to repeat what changed.
Also, generally, we specify what exact functions/macros/etc. have been
changed, not just what files (unless it would be too large, hundreds or
thousands of changes, which is not the case here).

Jakub



Re: [PATCH] middle-end: Add and use few helper methods for current_properties

2024-07-30 Thread Richard Biener
On Sat, Jul 27, 2024 at 4:29 AM Andrew Pinski  wrote:
>
> While working on isel, I found that the current way of doing 
> current_properties
> in function can easily make a mistake and having to do stuff like `(a & b ) 
> == 0`
> and `a |= b;` and `a &= ~b;` is not so obvious what was going on.
> So let's add a few helper methods to function:
> * set_property
> * unset_property
> * prop_set_p
> * gimple_prop_p

We have, for cfun->cfg.x_current_loops:

loops_state_satisfies_p, loops_state_set, loops_state_clear

and for example

gimple_set_visited (gimple *stmt, bool visited_p)

The API doesn't look consistent with any existing one?  Is gimple_prop_p
common enough to warrant special-casing?

I think the change is OK but I wanted to raise the lack of coding style for
this kind of API.  I'd have used

 property_set
 property_clear
 property_set_p

not *_property vs. prop_*, prefixing the domain is better than postfixing.  And
I'd have avoided gimple_prop_p (bad name anyway - gimple_property_set_p
or property_gimple_set_p?)

It's also the first member function in struct function ... I'd have passed it
as first argument.  If it's a member function the property field should become
private, otherwise it's a bit pointless IMO.

Thanks,
Richard.

> and use them in the source; I didn't change of the backends which has a few 
> places
> which could change.
>
> Also moves the PROP_* defines from tree-pass.h to function.h where
> they should be.
>
> Bootstrapped and tested on x86_64-linux-gnu.
>
> gcc/ChangeLog:
>
> * tree-pass.h (PROP_gimple_any): Delete.
> (PROP_gimple_lcf): Delete.
> (PROP_gimple_leh): Delete.
> (PROP_cfg): Delete.
> (PROP_objsz): Delete.
> (PROP_ssa): Delete.
> (PROP_no_crit_edges): Delete.
> (PROP_rtl): Delete.
> (PROP_gimple_lomp): Delete.
> (PROP_cfglayout): Delete.
> (PROP_gimple_lcx): Delete.
> (PROP_loops): Delete.
> (PROP_gimple_lvec): Delete.
> (PROP_gimple_eomp): Delete.
> (PROP_gimple_lva): Delete.
> (PROP_gimple_opt_math): Delete.
> (PROP_gimple_lomp_dev): Delete.
> (PROP_rtl_split_insns): Delete.
> (PROP_loop_opts_done): Delete.
> (PROP_assumptions_done): Delete.
> (PROP_gimple_lbitint): Delete.
> (PROP_gimple): Delete.
> * function.h (PROP_gimple_any): Move from tree-pass.h.
> (PROP_gimple_lcf): Move from tree-pass.h.
> (PROP_gimple_leh): Move from tree-pass.h.
> (PROP_cfg): Move from tree-pass.h.
> (PROP_objsz): Move from tree-pass.h.
> (PROP_ssa): Move from tree-pass.h.
> (PROP_no_crit_edges): Move from tree-pass.h.
> (PROP_rtl): Move from tree-pass.h.
> (PROP_gimple_lomp): Move from tree-pass.h.
> (PROP_cfglayout): Move from tree-pass.h.
> (PROP_gimple_lcx): Move from tree-pass.h. Move from tree-pass.h.
> (PROP_loops): Move from tree-pass.h.
> (PROP_gimple_lvec): Move from tree-pass.h.
> (PROP_gimple_eomp): Move from tree-pass.h.
> (PROP_gimple_lva): Move from tree-pass.h.
> (PROP_gimple_opt_math): Move from tree-pass.h.
> (PROP_gimple_lomp_dev): Move from tree-pass.h.
> (PROP_rtl_split_insns): Move from tree-pass.h.
> (PROP_loop_opts_done): Move from tree-pass.h.
> (PROP_assumptions_done): Move from tree-pass.h.
> (PROP_gimple_lbitint): Move from tree-pass.h.
> (PROP_gimple): Move from tree-pass.h.
> (struct function): Add helper methods, set_property,
> unset_property, prop_set_p, gimple_prop_p
> * cfgexpand.cc (pass_expand::execute): Use unset_property.
> * cfgrtl.cc (print_rtl_with_bb): Use prop_set_p.
> * cgraph.cc (release_function_body): Use unset_property.
> (cgraph_node::verify_node): Use prop_set_p.
> * cgraphunit.cc (symtab_node::native_rtl_p): Use prop_set_p.
> (init_lowered_empty_function): Use set_property.
> (symbol_table::compile): Use prop_set_p.
> * function.cc (free_after_compilation): Use unset_property.
> * generic-match-head.cc (canonicalize_math_p): Use prop_set_p.
> (optimize_vectors_before_lowering_p): Use prop_set_p.
> * gimple-expr.cc (gimple_has_body_p): Use prop_set_p.
> * gimple-lower-bitint.cc (pass_lower_bitint_O0::gate): Use prop_set_p.
> * gimple-match-exports.cc (build_call_internal): Use prop_set_p.
> * gimple-match-head.cc (canonicalize_math_p): Use prop_set_p.
> (canonicalize_math_after_vectorization_p): Use prop_set_p.
> (optimize_vectors_before_lowering_p): Use prop_set_p.
> * gimplify.cc: Remove tree-pass.h include and add timevar.h include.
> (gimplify_call_expr): Use prop_set_p.
> (gimplify_function_tree): Use set_property.
> (gimplify_va_arg_expr): Use unset_property.
> * loop-init.cc (loop_optimizer_init): Use pr

[COMMITTED PATCH] testsuite: fix dg-do run whitespace

2024-07-30 Thread Sam James
This caused the tests to not be run. I may do further passes for non-run
next.

Tested on x86_64-pc-linux-gnu and checked test logs before/after.

PR c/53548
PR target/101529
PR tree-optimization/102359
* c-c++-common/fam-in-union-alone-in-struct-1.c: Fix whitespace in dg 
directive.
* c-c++-common/fam-in-union-alone-in-struct-2.c: Likewise.
* c-c++-common/torture/builtin-shufflevector-2.c: Likewise.
* g++.dg/pr102359_2.C: Likewise.
* g++.target/i386/mvc1.C: Likewise.
---
Committed as obvious.

 gcc/testsuite/c-c++-common/fam-in-union-alone-in-struct-1.c  | 2 +-
 gcc/testsuite/c-c++-common/fam-in-union-alone-in-struct-2.c  | 2 +-
 gcc/testsuite/c-c++-common/torture/builtin-shufflevector-2.c | 2 +-
 gcc/testsuite/g++.dg/pr102359_2.C| 2 +-
 gcc/testsuite/g++.target/i386/mvc1.C | 2 +-
 5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/gcc/testsuite/c-c++-common/fam-in-union-alone-in-struct-1.c 
b/gcc/testsuite/c-c++-common/fam-in-union-alone-in-struct-1.c
index 7d4721aa95ac..39ebf17850bf 100644
--- a/gcc/testsuite/c-c++-common/fam-in-union-alone-in-struct-1.c
+++ b/gcc/testsuite/c-c++-common/fam-in-union-alone-in-struct-1.c
@@ -1,6 +1,6 @@
 /* testing the correct usage of flexible array members in unions 
and alone in structures.  */
-/* { dg-do run} */
+/* { dg-do run } */
 /* { dg-options "-Wpedantic" } */
 
 union with_fam_1 {
diff --git a/gcc/testsuite/c-c++-common/fam-in-union-alone-in-struct-2.c 
b/gcc/testsuite/c-c++-common/fam-in-union-alone-in-struct-2.c
index 3743f9e7dac5..93f9d5128f6e 100644
--- a/gcc/testsuite/c-c++-common/fam-in-union-alone-in-struct-2.c
+++ b/gcc/testsuite/c-c++-common/fam-in-union-alone-in-struct-2.c
@@ -1,6 +1,6 @@
 /* testing the correct usage of flexible array members in unions 
and alone in structures: initialization  */
-/* { dg-do run} */
+/* { dg-do run } */
 /* { dg-options "-O2" } */
 
 union with_fam_1 {
diff --git a/gcc/testsuite/c-c++-common/torture/builtin-shufflevector-2.c 
b/gcc/testsuite/c-c++-common/torture/builtin-shufflevector-2.c
index b1ffc95e39ae..a84e0a626211 100644
--- a/gcc/testsuite/c-c++-common/torture/builtin-shufflevector-2.c
+++ b/gcc/testsuite/c-c++-common/torture/builtin-shufflevector-2.c
@@ -1,4 +1,4 @@
-/* { dg-do run}  */
+/* { dg-do run }  */
 /* PR target/101529 */
 typedef unsigned char C;
 typedef unsigned char __attribute__((__vector_size__ (8))) V;
diff --git a/gcc/testsuite/g++.dg/pr102359_2.C 
b/gcc/testsuite/g++.dg/pr102359_2.C
index d026d727dd5c..1b3f6147dec1 100644
--- a/gcc/testsuite/g++.dg/pr102359_2.C
+++ b/gcc/testsuite/g++.dg/pr102359_2.C
@@ -1,6 +1,6 @@
 /* PR middle-end/102359 ICE gimplification failed since
r12-3433-ga25e0b5e6ac8a77a.  */
-/* { dg-do run} */
+/* { dg-do run } */
 /* { dg-options "-ftrivial-auto-var-init=zero" } */
 /* { dg-require-effective-target c++17 } */
 
diff --git a/gcc/testsuite/g++.target/i386/mvc1.C 
b/gcc/testsuite/g++.target/i386/mvc1.C
index b307d01ace63..348bd0ec7202 100644
--- a/gcc/testsuite/g++.target/i386/mvc1.C
+++ b/gcc/testsuite/g++.target/i386/mvc1.C
@@ -1,4 +1,4 @@
-/* { dg-do run} */
+/* { dg-do run } */
 /* { dg-require-ifunc "" } */
 
 __attribute__((target_clones("avx","arch=slm","arch=core-avx2","default")))

base-commit: 15158a8853a69f27a3c14644f35a93029dea6a84
prerequisite-patch-id: 5a9ae2c3593d43aa6ea53170a549830494180a1a
-- 
2.45.2



Re: [PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS

2024-07-30 Thread Uros Bizjak
On Tue, Jul 30, 2024 at 12:18 PM Richard Biener  wrote:
>
> The following implements the hook, excluding x87 modes for scalar
> and complex float modes.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> OK?
>
> Thanks,
> Richard.
>
> * i386.cc (TARGET_MODE_CAN_TRANSFER_BITS): Define.
> (ix86_mode_can_transfer_bits): New function.
> ---
>  gcc/config/i386/i386.cc | 21 +
>  1 file changed, 21 insertions(+)
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 12d15feb5e9..5184366916b 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -26113,6 +26113,24 @@ ix86_have_ccmp ()
>return (bool) TARGET_APX_CCMP;
>  }
>
> +/* Implement TARGET_MODE_CAN_TRANSFER_BITS.  */
> +static bool
> +ix86_mode_can_transfer_bits (machine_mode mode)
> +{
> +  if (GET_MODE_CLASS (mode) == MODE_FLOAT
> +  || GET_MODE_CLASS (mode) == MODE_COMPLEX_FLOAT)
> +switch (GET_MODE_INNER (mode))
> +  {
> +  case SFmode:
> +  case DFmode:
> +   return TARGET_SSE_MATH && !TARGET_MIX_SSE_I387;

This can be simplified to:

return !(ix86_fpmath & FPMATH_387);

(Which implies that we should introduce TARGET_I387_MATH to parallel
TARGET_SSE_MATH some day...)

> +  default:
> +   return false;

We don't want to enable HFmode for transfers?

Uros.

> +  }
> +
> +  return true;
> +}
> +
>  /* Target-specific selftests.  */
>
>  #if CHECKING_P
> @@ -26959,6 +26977,9 @@ ix86_libgcc_floating_mode_supported_p
>  #undef TARGET_HAVE_CCMP
>  #define TARGET_HAVE_CCMP ix86_have_ccmp
>
> +#undef TARGET_MODE_CAN_TRANSFER_BITS
> +#define TARGET_MODE_CAN_TRANSFER_BITS ix86_mode_can_transfer_bits
> +
>  static bool
>  ix86_libc_has_fast_function (int fcode ATTRIBUTE_UNUSED)
>  {
> --
> 2.43.0
>


Re: Support streaming of poly_int for offloading when it's degree <= accel's NUM_POLY_INT_COEFFS

2024-07-30 Thread Richard Sandiford
Jakub Jelinek  writes:
> On Tue, Jul 30, 2024 at 11:25:42AM +0200, Richard Biener wrote:
>> Only "relevant" stuff should be streamed - the offload code and all
>> trees refered to.
>
> Yeah.
>
>> > > I think all current issues are because of poly-* leaking in for cases
>> > > where a non-poly would have worked fine, but I have not had a look
>> > > myself.
>> > 
>> > One of the cases that Prathamesh mentions is streaming the mode sizes.
>> > Are those modes "offload target modes" or "host modes"?  It seems like
>> > it shouldn't be an error for the host to have VLA modes per se.  It's
>> > just that those modes can't be used in the host/offload interface.
>> 
>> There's a requirement that a mode mapping exists from the host to
>> target enum machine_mode.  I don't remember exactly how we compute
>> that mapping and whether streaming of some data (and thus poly-int)
>> are part of this.
>
> During streaming out, the code records what machine modes are being streamed
> (in streamer_mode_table).
> For those modes (and their inner modes) then lto_write_mode_table
> should stream a table with mode details like class, bits, size, inner mode,
> nunits, real mode format if any, etc.
> That table is then streamed in in the offloading compiler and it attempts to
> find corresponding modes (and emits fatal_error if there is no such mode;
> consider say x86_64 long double with XFmode being used in offloading code
> which doesn't have XFmode support).
> Now, because Richard S. changed GET_MODE_SIZE etc. to give poly_int rather
> than int, this has been changed to use bp_pack_poly_value; but that relies
> on the same number of coefficients for poly_int, which is not the case when
> e.g. offloading aarch64 to gcn or nvptx.
>
> From what I can see, this mode table handling are the only uses of
> bp_pack_poly_value.  So the options are either to stream at the start of the
> mode table the NUM_POLY_INT_COEFFS value and in bp_unpack_poly_value pass to
> it what we've read and fill in any remaining coeffs with zeros, or in each
> bp_pack_poly_value stream the number of coefficients and then stream that
> back in and fill in remaining ones (and diagnose if it would try to read
> non-zero coefficient which isn't stored).
> I think streaming NUM_POLY_INT_COEFFS once would be more compact (at least
> for non-aarch64/riscv targets).

Ah, ok, thanks for the explanation.  In that case, I agree that either
of those two would work (no personal preference for which).

Richard


Re: [PATCH 1/3][v2] Add TARGET_MODE_CAN_TRANSFER_BITS

2024-07-30 Thread Richard Sandiford
Richard Biener  writes:
> The following adds a target hook to specify whether regs of MODE can be
> used to transfer bits.  The hook is supposed to be used for value-numbering
> to decide whether a value loaded in such mode can be punned to another
> mode instead of re-loading the value in the other mode and for SRA to
> decide whether MODE is suitable as container holding a value to be
> used in different modes.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu, OK this way?

LGTM FWIW.

Richard

>
> Thanks,
> Richard.
>
>   * target.def (mode_can_transfer_bits): New target hook.
>   * target.h (mode_can_transfer_bits): New function wrapping the
>   hook and providing default behavior.
>   * doc/tm.texi.in: Update.
>   * doc/tm.texi: Re-generate.
> ---
>  gcc/doc/tm.texi|  6 ++
>  gcc/doc/tm.texi.in |  2 ++
>  gcc/target.def |  8 
>  gcc/target.h   | 16 
>  4 files changed, 32 insertions(+)
>
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index c7535d07f4d..fa53c23f1de 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -4545,6 +4545,12 @@ is either a declaration of type int or accessed by 
> dereferencing
>  a pointer to int.
>  @end deftypefn
>  
> +@deftypefn {Target Hook} bool TARGET_MODE_CAN_TRANSFER_BITS (machine_mode 
> @var{mode})
> +Define this to return false if the mode @var{mode} cannot be used
> +for memory copying.  The default is to assume modes with the same
> +precision as size are fine to be used.
> +@end deftypefn
> +
>  @deftypefn {Target Hook} machine_mode TARGET_TRANSLATE_MODE_ATTRIBUTE 
> (machine_mode @var{mode})
>  Define this hook if during mode attribute processing, the port should
>  translate machine_mode @var{mode} to another mode.  For example, rs6000's
> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> index 64cea3b1eda..8af3f414505 100644
> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -3455,6 +3455,8 @@ stack.
>  
>  @hook TARGET_REF_MAY_ALIAS_ERRNO
>  
> +@hook TARGET_MODE_CAN_TRANSFER_BITS
> +
>  @hook TARGET_TRANSLATE_MODE_ATTRIBUTE
>  
>  @hook TARGET_SCALAR_MODE_SUPPORTED_P
> diff --git a/gcc/target.def b/gcc/target.def
> index 3de1aad4c84..4356ef2f974 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -3363,6 +3363,14 @@ a pointer to int.",
>   bool, (ao_ref *ref),
>   default_ref_may_alias_errno)
>  
> +DEFHOOK
> +(mode_can_transfer_bits,
> + "Define this to return false if the mode @var{mode} cannot be used\n\
> +for memory copying.  The default is to assume modes with the same\n\
> +precision as size are fine to be used.",
> + bool, (machine_mode mode),
> + NULL)
> +
>  /* Support for named address spaces.  */
>  #undef HOOK_PREFIX
>  #define HOOK_PREFIX "TARGET_ADDR_SPACE_"
> diff --git a/gcc/target.h b/gcc/target.h
> index c1f99b97b86..837651d273a 100644
> --- a/gcc/target.h
> +++ b/gcc/target.h
> @@ -312,6 +312,22 @@ estimated_poly_value (poly_int64 x,
>  return targetm.estimated_poly_value (x, kind);
>  }
>  
> +/* Return true when MODE can be used to copy GET_MODE_BITSIZE bits
> +   unchanged.  */
> +
> +inline bool
> +mode_can_transfer_bits (machine_mode mode)
> +{
> +  if (mode == BLKmode)
> +return true;
> +  if (maybe_ne (GET_MODE_BITSIZE (mode),
> + GET_MODE_UNIT_PRECISION (mode) * GET_MODE_NUNITS (mode)))
> +return false;
> +  if (targetm.mode_can_transfer_bits)
> +return targetm.mode_can_transfer_bits (mode);
> +  return true;
> +}
> +
>  #ifdef GCC_TM_H
>  
>  #ifndef CUMULATIVE_ARGS_MAGIC


Re: [PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS

2024-07-30 Thread Uros Bizjak
On Tue, Jul 30, 2024 at 1:07 PM Uros Bizjak  wrote:
>
> On Tue, Jul 30, 2024 at 12:18 PM Richard Biener  wrote:
> >
> > The following implements the hook, excluding x87 modes for scalar
> > and complex float modes.
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> >
> > OK?
> >
> > Thanks,
> > Richard.
> >
> > * i386.cc (TARGET_MODE_CAN_TRANSFER_BITS): Define.
> > (ix86_mode_can_transfer_bits): New function.
> > ---
> >  gcc/config/i386/i386.cc | 21 +
> >  1 file changed, 21 insertions(+)
> >
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index 12d15feb5e9..5184366916b 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -26113,6 +26113,24 @@ ix86_have_ccmp ()
> >return (bool) TARGET_APX_CCMP;
> >  }
> >
> > +/* Implement TARGET_MODE_CAN_TRANSFER_BITS.  */
> > +static bool
> > +ix86_mode_can_transfer_bits (machine_mode mode)
> > +{
> > +  if (GET_MODE_CLASS (mode) == MODE_FLOAT
> > +  || GET_MODE_CLASS (mode) == MODE_COMPLEX_FLOAT)
> > +switch (GET_MODE_INNER (mode))
> > +  {
> > +  case SFmode:
> > +  case DFmode:
> > +   return TARGET_SSE_MATH && !TARGET_MIX_SSE_I387;
>
> This can be simplified to:
>
> return !(ix86_fpmath & FPMATH_387);
>
> (Which implies that we should introduce TARGET_I387_MATH to parallel
> TARGET_SSE_MATH some day...)
>
> > +  default:
> > +   return false;
>
> We don't want to enable HFmode for transfers?

Oh, and please add a small comment why we don't use XFmode here.

Uros.


RE: Support streaming of poly_int for offloading when it's degree <= accel's NUM_POLY_INT_COEFFS

2024-07-30 Thread Prathamesh Kulkarni



> -Original Message-
> From: Jakub Jelinek 
> Sent: Tuesday, July 30, 2024 3:16 PM
> To: Richard Biener 
> Cc: Richard Sandiford ; Prathamesh Kulkarni
> ; gcc-patches@gcc.gnu.org
> Subject: Re: Support streaming of poly_int for offloading when it's
> degree <= accel's NUM_POLY_INT_COEFFS
> 
> External email: Use caution opening links or attachments
> 
> 
> On Tue, Jul 30, 2024 at 11:25:42AM +0200, Richard Biener wrote:
> > Only "relevant" stuff should be streamed - the offload code and all
> > trees refered to.
> 
> Yeah.
> 
> > > > I think all current issues are because of poly-* leaking in for
> > > > cases where a non-poly would have worked fine, but I have not
> had
> > > > a look myself.
> > >
> > > One of the cases that Prathamesh mentions is streaming the mode
> sizes.
> > > Are those modes "offload target modes" or "host modes"?  It seems
> > > like it shouldn't be an error for the host to have VLA modes per
> se.
> > > It's just that those modes can't be used in the host/offload
> interface.
> >
> > There's a requirement that a mode mapping exists from the host to
> > target enum machine_mode.  I don't remember exactly how we compute
> > that mapping and whether streaming of some data (and thus poly-int)
> > are part of this.
> 
> During streaming out, the code records what machine modes are being
> streamed (in streamer_mode_table).
> For those modes (and their inner modes) then lto_write_mode_table
> should stream a table with mode details like class, bits, size, inner
> mode, nunits, real mode format if any, etc.
> That table is then streamed in in the offloading compiler and it
> attempts to find corresponding modes (and emits fatal_error if there
> is no such mode; consider say x86_64 long double with XFmode being
> used in offloading code which doesn't have XFmode support).
> Now, because Richard S. changed GET_MODE_SIZE etc. to give poly_int
> rather than int, this has been changed to use bp_pack_poly_value; but
> that relies on the same number of coefficients for poly_int, which is
> not the case when e.g. offloading aarch64 to gcn or nvptx.
Indeed, for the minimal test:
int main()
{
  int x;
  #pragma omp target map (to: x)
  {
x = 0;
  }
  return x;
}

Streaming out mode_table from AArch64 shows:
mode = SI, mclass = 2, size = 4, prec = 32
mode = DI, mclass = 2, size = 8, prec = 64

While streaming-in for nvptx shows:
mclass = 2, size = 4, prec = 0

The discrepancy happens because of differing value of NUM_POLY_INT_COEFFS 
between AArch64 and nvptx.
>From AArch64 it streams out size and prec as <4, 0> and <32, 0> respectively, 
>where 0 comes from coeffs[1].
While streaming-in from nvptx, since NUM_POLY_INT_COEFFS is 1, it incorrectly 
reads size as 4, and prec as 0.
> 
> From what I can see, this mode table handling are the only uses of
> bp_pack_poly_value.  So the options are either to stream at the start
> of the mode table the NUM_POLY_INT_COEFFS value and in
> bp_unpack_poly_value pass to it what we've read and fill in any
> remaining coeffs with zeros, or in each bp_pack_poly_value stream the
> number of coefficients and then stream that back in and fill in
> remaining ones (and diagnose if it would try to read non-zero
> coefficient which isn't stored).
This is the approach taken in proposed patch (stream-out degree of poly_int 
followed by coeffs).

> I think streaming NUM_POLY_INT_COEFFS once would be more compact (at
> least for non-aarch64/riscv targets).
I will try implementing this, thanks.

Thanks,
Prathamesh
> 
> Jakub



Re: [PATCH 1/2] SVE intrinsics: Add strength reduction for division by constant.

2024-07-30 Thread Kyrylo Tkachov
Hi Jennifer,

> On 30 Jul 2024, at 09:47, Jennifer Schmitz  wrote:
> 
> Dear Richard,
> Thanks for the feedback. Great to see this patch approved! I made the changes 
> as suggested.
> Best,
> Jennifer
> <0001-SVE-intrinsics-Add-strength-reduction-for-division-b.patch>

Thanks, I’m okay with the patch as well and have pushed it to trunk with 
7cde140863e.
To commit future patches yourself you should apply for Write After Approval 
commit access by filling in the form at 
https://sourceware.org/cgi-bin/pdw/ps_form.cgi . You can use my email address 
as approver.

Kyrill


> 
>> On 29 Jul 2024, at 22:55, Richard Sandiford  
>> wrote:
>> 
>> External email: Use caution opening links or attachments
>> 
>> 
>> Thanks for doing this.
>> 
>> Jennifer Schmitz  writes:
>>> [...]
>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_s32.c 
>>> b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_s32.c
>>> index c49ca1aa524..6500b64c41b 100644
>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_s32.c
>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/div_s32.c
>>> @@ -1,6 +1,9 @@
>>> /* { dg-final { check-function-bodies "**" "" "-DCHECK_ASM" } } */
>>> 
>>> #include "test_sve_acle.h"
>>> +#include 
>>> +
>> 
>> I think it'd better to drop the explicit include of stdint.h.  arm_sve.h
>> is defined to include stdint.h itself, and we rely on that elsewhere.
>> 
>> Same for div_s64.c.
> Done.
>> 
>>> +#define MAXPOW 1<<30
>>> 
>>> /*
>>> ** div_s32_m_tied1:
>>> @@ -53,10 +56,27 @@ TEST_UNIFORM_ZX (div_w0_s32_m_untied, svint32_t, 
>>> int32_t,
>>>  z0 = svdiv_n_s32_m (p0, z1, x0),
>>>  z0 = svdiv_m (p0, z1, x0))
>>> 
>>> +/*
>>> +** div_1_s32_m_tied1:
>>> +**   sel z0\.s, p0, z0\.s, z0\.s
>>> +**   ret
>>> +*/
>>> +TEST_UNIFORM_Z (div_1_s32_m_tied1, svint32_t,
>>> + z0 = svdiv_n_s32_m (p0, z0, 1),
>>> + z0 = svdiv_m (p0, z0, 1))
>>> +
>>> +/*
>>> +** div_1_s32_m_untied:
>>> +**   sel z0\.s, p0, z1\.s, z1\.s
>>> +**   ret
>>> +*/
>>> +TEST_UNIFORM_Z (div_1_s32_m_untied, svint32_t,
>>> + z0 = svdiv_n_s32_m (p0, z1, 1),
>>> + z0 = svdiv_m (p0, z1, 1))
>>> +
>> 
>> [ Thanks for adding the tests (which look good to me).  If the output
>> ever improves in future, we can "defend" the improvement by changing
>> the test.  But in the meantime, the above defends something that is
>> known to work. ]
>> 
>>> [...]
>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/div_const_run.c 
>>> b/gcc/testsuite/gcc.target/aarch64/sve/div_const_run.c
>>> new file mode 100644
>>> index 000..1a3c25b817d
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/div_const_run.c
>>> @@ -0,0 +1,91 @@
>>> +/* { dg-do run { target aarch64_sve128_hw } } */
>>> +/* { dg-options "-O2 -msve-vector-bits=128" } */
>>> +
>>> +#include 
>>> +#include 
>>> +
>>> +typedef svbool_t pred __attribute__((arm_sve_vector_bits(128)));
>>> +typedef svfloat16_t svfloat16_ __attribute__((arm_sve_vector_bits(128)));
>>> +typedef svfloat32_t svfloat32_ __attribute__((arm_sve_vector_bits(128)));
>>> +typedef svfloat64_t svfloat64_ __attribute__((arm_sve_vector_bits(128)));
>>> +typedef svint32_t svint32_ __attribute__((arm_sve_vector_bits(128)));
>>> +typedef svint64_t svint64_ __attribute__((arm_sve_vector_bits(128)));
>>> +typedef svuint32_t svuint32_ __attribute__((arm_sve_vector_bits(128)));
>>> +typedef svuint64_t svuint64_ __attribute__((arm_sve_vector_bits(128)));
>>> +
>>> +#define F(T, TS, P, OP1, OP2)  
>>>   \
>>> +{\
>>> +  T##_t op1 = (T##_t) OP1;   \
>>> +  T##_t op2 = (T##_t) OP2;   \
>>> +  sv##T##_ res = svdiv_##P (pg, svdup_##TS (op1), svdup_##TS (op2)); \
>>> +  sv##T##_ exp = svdup_##TS (op1 / op2); \
>>> +  if (svptest_any (pg, svcmpne (pg, exp, res)))
>>>   \
>>> +__builtin_abort ();
>>>   \
>>> + \
>>> +  sv##T##_ res_n = svdiv_##P (pg, svdup_##TS (op1), op2);\
>>> +  if (svptest_any (pg, svcmpne (pg, exp, res_n)))\
>>> +__builtin_abort ();
>>>   \
>>> +}
>>> +
>>> +#define TEST_TYPES_1(T, TS)  \
>>> +  F (T, TS, m, 79, 16) 
>>>   \
>>> +  F (T, TS, z, 79, 16) 
>>>   \
>>> +  F (T, TS, x, 79, 16)
>>> +
>>> +#define TEST_TYPES   \
>>> +  TEST_TYPES_1 (float16, f16)  
>>>   \
>>> +  TEST_TYPES_1 (float32, f32) 

Re: [PATCH] AArch64: Set instruction attribute of TST to logics_imm

2024-07-30 Thread Richard Sandiford
Jennifer Schmitz  writes:
> As suggested in
> https://gcc.gnu.org/pipermail/gcc-patches/2024-July/658249.html,
> this patch changes the instruction attribute of "*and_compare0" (TST) 
> from
> alus_imm to logics_imm.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
>
> Signed-off-by: Jennifer Schmitz 
>
> gcc/
>
>   * config/aarch64/aarch64.md (*and_compare0): Change attribute.

OK, thanks.

Richard

>
> From e643211edd212276ddeef87136932da4aa14837c Mon Sep 17 00:00:00 2001
> From: Jennifer Schmitz 
> Date: Mon, 29 Jul 2024 07:59:33 -0700
> Subject: [PATCH] AArch64: Set instruction attribute of TST to logics_imm
>
> As suggested in
> https://gcc.gnu.org/pipermail/gcc-patches/2024-July/658249.html,
> this patch changes the instruction attribute of "*and_compare0" (TST) 
> from
> alus_imm to logics_imm.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
>
> Signed-off-by: Jennifer Schmitz 
>
> gcc/
>
>   * config/aarch64/aarch64.md (*and_compare0): Change attribute.
> ---
>  gcc/config/aarch64/aarch64.md | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index ed29127dafb..734a21268dc 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -5398,7 +5398,7 @@
>(const_int 0)))]
>""
>"tst\\t%0, "
> -  [(set_attr "type" "alus_imm")]
> +  [(set_attr "type" "logics_imm")]
>  )
>  
>  (define_insn "*ands_compare0"


Re: [PATCH v2] gimple ssa: Teach switch conversion to optimize powers of 2 switches

2024-07-30 Thread Filip Kastl
> > > Ah, I see you fix those up.  Then 2.) is left - the final block.  Iff
> > > the final block needs adjustment you know there was a path from
> > > the default case to it which means one of its predecessors is dominated
> > > by the default case?  In that case, adjust the dominator to cond_bb,
> > > otherwise leave it at switch_bb?
> > 
> > Yes, what I'm saying is that if I want to know idom of final_bb after the
> > transformation, I have to know if there is a path between default_bb and
> > final_bb.  It is because of these two cases:
> > 
> > 1.
> > 
> > cond BB -+
> >| |
> > switch BB ---+   |
> > /  |  \   \  |
> > case BBsdefault BB
> > \  |  /   /
> > final BB <---+  <- this may be an edge or a path
> >|
> > 
> > 2.
> > 
> > cond BB -+
> >| |
> > switch BB ---+   |
> > /  |  \   \  |
> > case BBsdefault BB
> > \  |  /   /
> > final BB / <- this may be an edge or a path
> >|/
> > 
> > In the first case, there is a path between default_bb and final_bb and in 
> > the
> > second there isn't.  Otherwise the cases are the same.  In the first case 
> > idom
> > of final_bb should be cond_bb.  In the second case idom of final_bb should 
> > be
> > switch_bb. Algorithm deciding what should be idom of final_bb therefore has 
> > to
> > know if there is a path between default_bb and final_bb.
> > 
> > You said that if there is a path between default_bb and final_bb, one of the
> > predecessors of final_bb is dominated by default_bb.  That would indeed 
> > give a
> > nice way to check existence of a path between default_bb and final_bb.  But
> > does it hold?  Consider this situation:
> > 
> >| |
> > cond BB --+
> >| ||
> > switch BB +   |
> > /  |  \  | \  |
> > case BBs |default BB
> > \  |  /  |/
> > final BB <- pred BB -+
> >|
> > 
> > Here no predecessors of final_bb are dominated by default_bb but at the same
> > time there does exist a path from default_bb to final_bb.  Or is this CFG
> > impossible for some reason?
> 
> I think in this case the dominator simply need not change - the only case
> you need to adjust it is when the immediate dominator of final BB was
> switch BB before the transform, and then we know we have to update it
> too cond BB, no?

Ah, my bad.  Yes, my counterexample actually isn't a problem.  I was glad when
I realized that and started thinking that this...

if (original idom(final bb) == switch bb)
{
  if (exists a pred of final bb dominated by default bb)
  {
idom(final bb) = cond bb;
  }
  else
  {
idom(final bb) = switch bb;
  }
}
else
{
  // idom(final bb) doesn't change
}

...might be the final solution.  But after thinking about it for a while I
(saddly) came up with another counterexample.

   |  
cond BB --+
   |  |
switch BB +   |
/  |  \\  |
case BBs  default BB
\  |  /   /
final BB <- pred BB -+
   |   ^
   |   |
   +---+  <- this may be a path or an edge I guess

Here there *is* a path between default_bb and final_bb but since no predecessor
of final_bb is dominated by default_bb we incorrectly decide that there is no
such path.  Therefore we incorrectly assign idom(final_bb) = switch_bb instead
of idom(final_bb) = cond_bb.

So unless I'm missing something, "final has a pred dominated by default" isn't
equivalent with "there is a path between default and final" even when we assume
that the original idom of final_bb was switch_bb.  Therefore I think we're back
to searching for a nice way to test "there is a path between default and
final".

Maybe you can spot a flaw in my logic or maybe you see a solution I don't.
Meanwhile I'll look into source code of the rest of the switch conversion pass.
Switch conversion pass inserts conditions similar to what I'm doing so someone
before me may have already solved how to properly fix dominators in this
situation.

Cheers,
Filip Kastl


Re: [PATCH v2] gimple ssa: Teach switch conversion to optimize powers of 2 switches

2024-07-30 Thread Filip Kastl
> Meanwhile I'll look into source code of the rest of the switch conversion 
> pass.
> Switch conversion pass inserts conditions similar to what I'm doing so someone
> before me may have already solved how to properly fix dominators in this
> situation.

Oh nevermind.  Switch conversion (gen_inbound_check ()) actually uses
iterate_fix_dominators.


Re: [PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS

2024-07-30 Thread Richard Biener
On Tue, 30 Jul 2024, Uros Bizjak wrote:

> On Tue, Jul 30, 2024 at 1:07 PM Uros Bizjak  wrote:
> >
> > On Tue, Jul 30, 2024 at 12:18 PM Richard Biener  wrote:
> > >
> > > The following implements the hook, excluding x87 modes for scalar
> > > and complex float modes.
> > >
> > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > >
> > > OK?
> > >
> > > Thanks,
> > > Richard.
> > >
> > > * i386.cc (TARGET_MODE_CAN_TRANSFER_BITS): Define.
> > > (ix86_mode_can_transfer_bits): New function.
> > > ---
> > >  gcc/config/i386/i386.cc | 21 +
> > >  1 file changed, 21 insertions(+)
> > >
> > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > index 12d15feb5e9..5184366916b 100644
> > > --- a/gcc/config/i386/i386.cc
> > > +++ b/gcc/config/i386/i386.cc
> > > @@ -26113,6 +26113,24 @@ ix86_have_ccmp ()
> > >return (bool) TARGET_APX_CCMP;
> > >  }
> > >
> > > +/* Implement TARGET_MODE_CAN_TRANSFER_BITS.  */
> > > +static bool
> > > +ix86_mode_can_transfer_bits (machine_mode mode)
> > > +{
> > > +  if (GET_MODE_CLASS (mode) == MODE_FLOAT
> > > +  || GET_MODE_CLASS (mode) == MODE_COMPLEX_FLOAT)
> > > +switch (GET_MODE_INNER (mode))
> > > +  {
> > > +  case SFmode:
> > > +  case DFmode:
> > > +   return TARGET_SSE_MATH && !TARGET_MIX_SSE_I387;
> >
> > This can be simplified to:
> >
> > return !(ix86_fpmath & FPMATH_387);

Done.

> > (Which implies that we should introduce TARGET_I387_MATH to parallel
> > TARGET_SSE_MATH some day...)
> >
> > > +  default:
> > > +   return false;
> >
> > We don't want to enable HFmode for transfers?

Jakub indicated that wouldn't be safe - is it?

> Oh, and please add a small comment why we don't use XFmode here.

Will do.

/* Do not enable XFmode, there is padding in it and it suffers
   from normalization upon load like SFmode and DFmode when
   not using SSE.  */

Thanks,
Richard.

> Uros.
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH v2] gimple ssa: Teach switch conversion to optimize powers of 2 switches

2024-07-30 Thread Richard Biener
On Tue, 30 Jul 2024, Filip Kastl wrote:

> > > > Ah, I see you fix those up.  Then 2.) is left - the final block.  Iff
> > > > the final block needs adjustment you know there was a path from
> > > > the default case to it which means one of its predecessors is dominated
> > > > by the default case?  In that case, adjust the dominator to cond_bb,
> > > > otherwise leave it at switch_bb?
> > > 
> > > Yes, what I'm saying is that if I want to know idom of final_bb after the
> > > transformation, I have to know if there is a path between default_bb and
> > > final_bb.  It is because of these two cases:
> > > 
> > > 1.
> > > 
> > > cond BB -+
> > >| |
> > > switch BB ---+   |
> > > /  |  \   \  |
> > > case BBsdefault BB
> > > \  |  /   /
> > > final BB <---+  <- this may be an edge or a path
> > >|
> > > 
> > > 2.
> > > 
> > > cond BB -+
> > >| |
> > > switch BB ---+   |
> > > /  |  \   \  |
> > > case BBsdefault BB
> > > \  |  /   /
> > > final BB / <- this may be an edge or a path
> > >|/
> > > 
> > > In the first case, there is a path between default_bb and final_bb and in 
> > > the
> > > second there isn't.  Otherwise the cases are the same.  In the first case 
> > > idom
> > > of final_bb should be cond_bb.  In the second case idom of final_bb 
> > > should be
> > > switch_bb. Algorithm deciding what should be idom of final_bb therefore 
> > > has to
> > > know if there is a path between default_bb and final_bb.
> > > 
> > > You said that if there is a path between default_bb and final_bb, one of 
> > > the
> > > predecessors of final_bb is dominated by default_bb.  That would indeed 
> > > give a
> > > nice way to check existence of a path between default_bb and final_bb.  
> > > But
> > > does it hold?  Consider this situation:
> > > 
> > >| |
> > > cond BB --+
> > >| ||
> > > switch BB +   |
> > > /  |  \  | \  |
> > > case BBs |default BB
> > > \  |  /  |/
> > > final BB <- pred BB -+
> > >|
> > > 
> > > Here no predecessors of final_bb are dominated by default_bb but at the 
> > > same
> > > time there does exist a path from default_bb to final_bb.  Or is this CFG
> > > impossible for some reason?
> > 
> > I think in this case the dominator simply need not change - the only case
> > you need to adjust it is when the immediate dominator of final BB was
> > switch BB before the transform, and then we know we have to update it
> > too cond BB, no?
> 
> Ah, my bad.  Yes, my counterexample actually isn't a problem.  I was glad when
> I realized that and started thinking that this...
> 
> if (original idom(final bb) == switch bb)
> {
>   if (exists a pred of final bb dominated by default bb)
>   {
> idom(final bb) = cond bb;
>   }
>   else
>   {
> idom(final bb) = switch bb;
>   }
> }
> else
> {
>   // idom(final bb) doesn't change
> }
> 
> ...might be the final solution.  But after thinking about it for a while I
> (saddly) came up with another counterexample.
> 
>|  
> cond BB --+
>|  |
> switch BB +   |
> /  |  \\  |
> case BBs  default BB
> \  |  /   /
> final BB <- pred BB -+
>|   ^
>|   |
>+---+  <- this may be a path or an edge I guess
> 
> Here there *is* a path between default_bb and final_bb but since no 
> predecessor
> of final_bb is dominated by default_bb we incorrectly decide that there is no
> such path.  Therefore we incorrectly assign idom(final_bb) = switch_bb instead
> of idom(final_bb) = cond_bb.
>
> So unless I'm missing something, "final has a pred dominated by default" isn't
> equivalent with "there is a path between default and final" even when we 
> assume
> that the original idom of final_bb was switch_bb.  Therefore I think we're 
> back
> to searching for a nice way to test "there is a path between default and
> final".

Hmm.

> Maybe you can spot a flaw in my logic or maybe you see a solution I don't.
> Meanwhile I'll look into source code of the rest of the switch conversion 
> pass.
> Switch conversion pass inserts conditions similar to what I'm doing so someone
> before me may have already solved how to properly fix dominators in this
> situation.

OK, as I see in your next followup that uses iterate_fix_dominators as 
well.

So your patch is OK as-is.

It might be nice to factor out a common helper from gen_inbound_check
and your "copy" of it though.  As followup, if you like.

Thanks and sorry for the confusion,
Richard.


Re: [PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS

2024-07-30 Thread Jakub Jelinek
On Tue, Jul 30, 2024 at 02:26:05PM +0200, Richard Biener wrote:
> > > (Which implies that we should introduce TARGET_I387_MATH to parallel
> > > TARGET_SSE_MATH some day...)
> > >
> > > > +  default:
> > > > +   return false;
> > >
> > > We don't want to enable HFmode for transfers?
> 
> Jakub indicated that wouldn't be safe - is it?

I was worried about that, but in everything I've tried it actually looked ok
(both HFmode and BFmode).
*mov{hf,bf}_internal uses GPRs to move stuff (or SSE registers), in both
cases transferable.

And TFmode should be ok as well (i.e. IEEE quad, that shouldn't go through x87
either, it is software emulated).

Jakub



Re: Support streaming of poly_int for offloading when it's degree <= accel's NUM_POLY_INT_COEFFS

2024-07-30 Thread Tobias Burnus

Prathamesh Kulkarni wrote:

Thanks for your suggestions on RFC email, the attached patch adds support for 
streaming of poly_int when it's degree <= accel's NUM_POLY_INT_COEFFS.


First, thanks a lot for your patch!

Secondly, it seems as if this patch is indented to fully or partially 
fix the following PRs.

If so, can you add the PR to the commit log such that both "git log"
will help finding the problem report and the commit will show up
in the issue?


https://gcc.gnu.org/PR111937
  PR ipa/111937
  offloading from x86_64-linux-gnu to riscv*-linux-gnu will have issues

https://gcc.gnu.org/PR96265
  PR ipa/96265
  offloading to nvptx-none from aarch64-linux-gnu (and 
riscv*-linux-gnu) does not work


And - marked as duplicate of the latter:

https://gcc.gnu.org/PR114174
  PR lto/114174
  [aarch64] Offloading to nvptx-none

Thanks,

Tobias


Re: [PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS

2024-07-30 Thread Richard Biener
On Tue, 30 Jul 2024, Jakub Jelinek wrote:

> On Tue, Jul 30, 2024 at 02:26:05PM +0200, Richard Biener wrote:
> > > > (Which implies that we should introduce TARGET_I387_MATH to parallel
> > > > TARGET_SSE_MATH some day...)
> > > >
> > > > > +  default:
> > > > > +   return false;
> > > >
> > > > We don't want to enable HFmode for transfers?
> > 
> > Jakub indicated that wouldn't be safe - is it?
> 
> I was worried about that, but in everything I've tried it actually looked ok
> (both HFmode and BFmode).
> *mov{hf,bf}_internal uses GPRs to move stuff (or SSE registers), in both
> cases transferable.
> 
> And TFmode should be ok as well (i.e. IEEE quad, that shouldn't go through x87
> either, it is software emulated).

So something like the following then?

/* Implement TARGET_MODE_CAN_TRANSFER_BITS.  */
static bool
ix86_mode_can_transfer_bits (machine_mode mode)
{
  if (GET_MODE_CLASS (mode) == MODE_FLOAT
  || GET_MODE_CLASS (mode) == MODE_COMPLEX_FLOAT)
switch (GET_MODE_INNER (mode))
  {
  case SFmode:
  case DFmode:
return !(ix86_fpmath & FPMATH_387);
  case XFmode:
/* Do not enable XFmode, there is padding in it and it suffers
   from normalization upon load like SFmode and DFmode when
   not using SSE.  */
return false;
  case HFmode:
  case BFmode:
  case TFmode:
/* IEEE quad and half and brain floats never touch x87 regs.  */
return true;
  default:
gcc_unreachable ();
  }

  return true;
}



Re: [PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS

2024-07-30 Thread Alexander Monakov


On Tue, 30 Jul 2024, Richard Biener wrote:

> > Oh, and please add a small comment why we don't use XFmode here.
> 
> Will do.
> 
> /* Do not enable XFmode, there is padding in it and it suffers
>from normalization upon load like SFmode and DFmode when
>not using SSE.  */

Is it really true? I have no evidence of FLDT performing normalization
(as mentioned in PR 114659, if it did, there would be no way to spill/reload
x87 registers).

(the padding is not part of the 80-bit mode precision of XFmode, right?)

Alexander


Re: [PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS

2024-07-30 Thread Jakub Jelinek
On Tue, Jul 30, 2024 at 03:43:25PM +0300, Alexander Monakov wrote:
> 
> On Tue, 30 Jul 2024, Richard Biener wrote:
> 
> > > Oh, and please add a small comment why we don't use XFmode here.
> > 
> > Will do.
> > 
> > /* Do not enable XFmode, there is padding in it and it suffers
> >from normalization upon load like SFmode and DFmode when
> >not using SSE.  */
> 
> Is it really true? I have no evidence of FLDT performing normalization
> (as mentioned in PR 114659, if it did, there would be no way to spill/reload
> x87 registers).
> 
> (the padding is not part of the 80-bit mode precision of XFmode, right?)

It is part of the mode (which has 12 or 16 byte size).
Though, the condition in the caller of the target hook should already
return false before calling this hook for the XFmode/XCmode modes exactly
becayse the precision doesn't match size.

Jakub



Re: [PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS

2024-07-30 Thread Alexander Monakov


On Tue, 30 Jul 2024, Jakub Jelinek wrote:

> On Tue, Jul 30, 2024 at 03:43:25PM +0300, Alexander Monakov wrote:
> > 
> > On Tue, 30 Jul 2024, Richard Biener wrote:
> > 
> > > > Oh, and please add a small comment why we don't use XFmode here.
> > > 
> > > Will do.
> > > 
> > > /* Do not enable XFmode, there is padding in it and it suffers
> > >from normalization upon load like SFmode and DFmode when
> > >not using SSE.  */
> > 
> > Is it really true? I have no evidence of FLDT performing normalization
> > (as mentioned in PR 114659, if it did, there would be no way to spill/reload
> > x87 registers).
> > 
> > (the padding is not part of the 80-bit mode precision of XFmode, right?)
> 
> It is part of the mode (which has 12 or 16 byte size).
> Though, the condition in the caller of the target hook should already
> return false before calling this hook for the XFmode/XCmode modes exactly
> becayse the precision doesn't match size.

Yes, thanks, that's why I raised the question in parenthesis. The claim
about normalization still doesn't make sense to me.

Alexander


Re: [PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS

2024-07-30 Thread Richard Biener
On Tue, 30 Jul 2024, Alexander Monakov wrote:

> 
> On Tue, 30 Jul 2024, Richard Biener wrote:
> 
> > > Oh, and please add a small comment why we don't use XFmode here.
> > 
> > Will do.
> > 
> > /* Do not enable XFmode, there is padding in it and it suffers
> >from normalization upon load like SFmode and DFmode when
> >not using SSE.  */
> 
> Is it really true? I have no evidence of FLDT performing normalization
> (as mentioned in PR 114659, if it did, there would be no way to spill/reload
> x87 registers).

What mangling fld performs depends on the contents of the FP control
word which is awkward.  IIRC there's at least a bugreport that it
turns sNaN into a qNaN, it seems I was wrong about denormals
(when DM is not masked).  And yes, IIRC x87 instability is also
related to spills (IIRC we spill in the actual mode of the reg, not in 
XFmode), but -fexcess-precision=standard should hopefully avoid that.
It's also not clear whether all implementations conformed to the
specs wrt extended-precision format loads.

> (the padding is not part of the 80-bit mode precision of XFmode, right?)

As Jakub said the padding is already dealt with in the caller
though I only added that there for convenience since padding is
problematic in general.

If you think XFmode is safe to transfer 10 bytes we could enable it,
I guess I'll amend the docs to be clear:

"Define this to return false if the mode @var{mode} cannot be used
for memory copying.  The default is to assume modes with the same
precision as size are fine to be used."

this might suggest transfering GET_MODE_PRECISION bits is intended
but it might want to say GET_MODE_SIZE units explicitly so the
default makes sense.

Richard.


Re: [PATCH] LoongArch: Expand some SImode operations through "si3_extend" instructions if TARGET_64BIT

2024-07-30 Thread Lulu Cheng



在 2024/7/26 下午8:43, Xi Ruoyao 写道:

We already had "si3_extend" insns and we hoped the fwprop or combine
passes can use them to remove unnecessary sign extensions.  But this
does not always work: for cases like x << 1 | y, the compiler
tends to do

 (sign_extend:DI
   (ior:SI (ashift:SI (reg:SI $r4)
  (const_int 1))
   (reg:SI $r5)))

instead of

 (ior:DI (sign_extend:DI (ashift:SI (reg:SI $r4) (const_int 1)))
 (sign_extend:DI (reg:SI $r5)))

So we cannot match the ashlsi3_extend instruction here and we get:

 slli.w $r4,$r4,1
 or $r4,$r5,$r4
 slli.w $r4,$r4,0# <= redundant
 jr$r1

To eliminate this redundant extension we need to turn SImode shift etc.
to DImode "si3_extend" operations earlier, when we expand the SImode
operation.  We are already doing this for addition, now do it for
shifts, rotates, substract, multiplication, division, and modulo as
well.

The bytepick.w definition for TARGET_64BIT needs to be adjusted so it
won't be undone by the shift expanding.


LGTM!

I don't know if there will be redundant symbol extension directives 
after this change.:-(


Thanks!



gcc/ChangeLog:

* config/loongarch/loongarch.md (optab): Add (rotatert "rotr").
(3, 3,
sub3, rotr3, mul3): Add a "*" to the insn name
so we can redefine the names with define_expand.
(*si3_extend): Remove "*" so we can use them
in expanders.
(*subsi3_extended, *mulsi3_extended): Likewise, also remove the
trailing "ed" for consistency.
(*si3_extended): Add mode for sign_extend to
prevent an ICE using it in expanders.
(shift_w, arith_w): New define_code_iterator.
(3): New define_expand.  Expand with
si3_extend for SImode if TARGET_64BIT.
(3): Likewise.
(mul3): Expand to mulsi3_extended for SImode if
TARGET_64BIT and ISA_HAS_DIV32.
(3): Expand to si3_extended
for SImode if TARGET_64BIT.
(rotl3): Expand to rotrsi3_extend for SImode if
TARGET_64BIT.
(bytepick_w_): Add mode for lshiftrt and ashift.
(bitsize, bytepick_imm, bytepick_w_ashift_amount): New
define_mode_attr.
(bytepick_w__extend): Adjust for the RTL change
caused by 32-bit shift expanding.  Now bytepick_imm only covers
2 and 3, separate one remaining case to ...
(bytepick_w_1_extend): ... here, new define_insn.

gcc/testsuite/ChangeLog:

* gcc.target/loongarch/bitwise_extend.c: New test.
---

Bootstrapped and regtested on loongarch64-linux-gnu.  Ok for trunk?

  gcc/config/loongarch/loongarch.md | 131 +++---
  .../gcc.target/loongarch/bitwise_extend.c |  45 ++
  2 files changed, 154 insertions(+), 22 deletions(-)
  create mode 100644 gcc/testsuite/gcc.target/loongarch/bitwise_extend.c

diff --git a/gcc/config/loongarch/loongarch.md 
b/gcc/config/loongarch/loongarch.md
index bc09712bce7..e1629c5a339 100644
--- a/gcc/config/loongarch/loongarch.md
+++ b/gcc/config/loongarch/loongarch.md
@@ -546,6 +546,7 @@ (define_code_attr u_bool [(sign_extend "false") (zero_extend 
"true")])
  (define_code_attr optab [(ashift "ashl")
 (ashiftrt "ashr")
 (lshiftrt "lshr")
+(rotatert "rotr")
 (ior "ior")
 (xor "xor")
 (and "and")
@@ -624,6 +625,49 @@ (define_int_attr bytepick_imm [(8 "1")
 (48 "6")
 (56 "7")])
  
+;; Expand some 32-bit operations to si3_extend operations if TARGET_64BIT

+;; so the redundant sign extension can be removed if the output is used as
+;; an input of a bitwise operation.  Note plus, rotl, and div are handled
+;; separately.
+(define_code_iterator shift_w [any_shift rotatert])
+(define_code_iterator arith_w [minus mult])
+
+(define_expand "3"
+  [(set (match_operand:GPR 0 "register_operand" "=r")
+   (shift_w:GPR (match_operand:GPR 1 "register_operand" "r")
+(match_operand:SI 2 "arith_operand" "rI")))]
+  ""
+{
+  if (TARGET_64BIT && mode == SImode)
+{
+  rtx t = gen_reg_rtx (DImode);
+  emit_insn (gen_si3_extend (t, operands[1], operands[2]));
+  t = gen_lowpart (SImode, t);
+  SUBREG_PROMOTED_VAR_P (t) = 1;
+  SUBREG_PROMOTED_SET (t, SRP_SIGNED);
+  emit_move_insn (operands[0], t);
+  DONE;
+}
+})
+
+(define_expand "3"
+  [(set (match_operand:GPR 0 "register_operand" "=r")
+   (arith_w:GPR (match_operand:GPR 1 "register_operand" "r")
+(match_operand:GPR 2 "register_operand" "r")))]
+  ""
+{
+  if (TARGET_64BIT && mode == SImode)
+{
+  rtx t = gen_reg_rtx (DImode);
+  emit_insn (gen_si3_extend (t, operands[1], operands[2]));
+  t = gen_lowpart (SImode, t);
+  SUBREG_PROMOTED_VAR_P (t) = 1;
+  SUBREG_PROMOTED_SET (t, SRP_S

Re: [PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS

2024-07-30 Thread Jakub Jelinek
On Tue, Jul 30, 2024 at 03:00:49PM +0200, Richard Biener wrote:
> As Jakub said the padding is already dealt with in the caller
> though I only added that there for convenience since padding is
> problematic in general.
> 
> If you think XFmode is safe to transfer 10 bytes we could enable it,
> I guess I'll amend the docs to be clear:
> 
> "Define this to return false if the mode @var{mode} cannot be used
> for memory copying.  The default is to assume modes with the same
> precision as size are fine to be used."
> 
> this might suggest transfering GET_MODE_PRECISION bits is intended
> but it might want to say GET_MODE_SIZE units explicitly so the
> default makes sense.

You could just gcc_unreachable (); for XFmode in the link with a short
comment that the mode has padding.

Jakub



Re: [PATCH 2/2] libstdc++: add std::is_virtual_base_of

2024-07-30 Thread Jonathan Wakely
On Mon, 29 Jul 2024 at 21:58, Giuseppe D'Angelo wrote:
>
> Hi,
>
> And this is the corresponding change libstdc++.

Thanks for the patch. Do you have a copyright assignment for GCC in
place, or are you covered by a corporate assignment for KDAB?

If not, please complete that process, or contribute under the DCO
terms, see https://gcc.gnu.org/contribute.html#legal

This is a C++26 change, so it can use bool_constant instead of
__bool_constant. That's only needed for C++11 and C++14 code.

The new test says:

+// Copyright (C) 2013-2024 Free Software Foundation, Inc.

Unless you reused the code from an existing test that was written in
2013, then the date is wrong. And if you dont' have a copyright
assignment, then claiming it's copyright FSF is also wrong.

Please don't put copyright and license headers on new tests at all,
unless they really are copyright FSF, and really do contain something
original and copyrightable (which I don't think applies here).
This is documented at
https://gcc.gnu.org/onlinedocs/libstdc++/manual/test.html#test.new_tests

With those things fixed, this should definitely go in once the new
built-in is supported in the compiler, thanks again!


Re: [PATCH] libstdc++: implement concatenation of strings and string_views

2024-07-30 Thread Jonathan Wakely
On Tue, 30 Jul 2024 at 08:31, Giuseppe D'Angelo
 wrote:
>
> Hello!
>
> The attached patch implements adds support for P2591R5 in libstdc++
> (concatenation of strings and string_views, approved in Tokyo for C++26).

Thanks for this patch as well. This was on my TODO list so I'll be
happy to not have to do it myself now!

I won't repeat my questions for your is_virtual_base_of patch,
regarding the legal prerequisites, but they apply here too. My
comments about copyright and licence headers in the tests apply here
too.

+#if __cplusplus > 202302L
+  // const string & + string_view

Please test the feature test macro, not the C++ standard version, so e.g.

#if __glibcxx_string_view >= 202403L

This relates the #if block directly to the feature test macro, so that
if we decide to support the feature in C++23 as an extension, we only
need to change version.def and not basic_string.h


[PATCH] c/106800 - support vector condition operation in C

2024-07-30 Thread Richard Biener
The following adds support for vector conditionals in C.  The support
was nearly there already but c_objc_common_truthvalue_conversion
rejecting vector types.  Instead of letting them pass there unchanged
I chose to instead skip it when parsing conditionals instead as a
variant with less possible fallout.  The part missing was promotion
of a scalar operands to vector, I copied the logic from binary operator
handling for this and for the case of two scalars mimicked what the
C++ frontend does.

I've moved the testcases I could easily spot over to c-c++-common.

Changed from the first version this should now be more compatible
with what the C++ frontend accepts and rejects, in particular
two scalar operands are now promote to vectors before applying
standard conversions.  I've resorted to c_common_type here,
the C++ frontend performs a scalar ?: build but C++ doesn't seem
to involve integral promotions in that so this doesn't work for C.
This enabled c-c++-common/vector23.c to be "moved" from g++.dg/ext 
and one extra case in c-c++-common/vector19.c but with an
explicit conversion of the character literal.

Bootstrapped and tested on x86_64-unknown-linux-gnu, OK?

Thanks,
Richard.

PR c/106800
gcc/
* doc/extend.texi (Vector Extension): Document ternary ?:
to be generally available.

gcc/c/
* c-parser.cc (c_parser_conditional_expression): Skip
truthvalue conversion for vector typed conditions.
* c-typeck.cc (build_conditional_expr): Build a VEC_COND_EXPR
for vector typed ifexpr.  Do basic diagnostics.

* g++.dg/ext/pr56790-1.C: Move ...
* c-c++-common/pr56790-1.c: ... here.
* g++.dg/opt/vectcond-1.C: Move ...
* c-c++-common/vectcond-1.c: ... here.
* g++.dg/ext/vector21.C: Move ...
* c-c++-common/vector21.c: ... here.
* g++.dg/ext/vector22.C: Move ...
* c-c++-common/vector22.c: ... here.
* g++.dg/ext/vector35.C: Move ...
* c-c++-common/vector35.c: ... here.
* g++.dg/ext/vector19.C: Move common parts ...
* c-c++-common/vector19.c: ... here.
* gcc.dg/vector-19.c: Add c23 auto case.
* c-c++-common/vector23.c: Duplicate here from
g++.dg/ext/vector23.C, removing use of auto and adding
explicit cast.

amend
---
 gcc/c/c-parser.cc |   7 +-
 gcc/c/c-typeck.cc | 121 +-
 gcc/doc/extend.texi   |   2 +-
 .../pr56790-1.C => c-c++-common/pr56790-1.c}  |   0
 .../vectcond-1.c} |   0
 gcc/testsuite/c-c++-common/vector19.c |  34 +
 .../vector21.C => c-c++-common/vector21.c}|   0
 .../vector22.C => c-c++-common/vector22.c}|   0
 gcc/testsuite/c-c++-common/vector23.c |  28 
 .../vector35.C => c-c++-common/vector35.c}|   0
 gcc/testsuite/g++.dg/ext/vector19.C   |  25 
 gcc/testsuite/gcc.dg/vector-19.c  |  10 ++
 12 files changed, 191 insertions(+), 36 deletions(-)
 rename gcc/testsuite/{g++.dg/ext/pr56790-1.C => c-c++-common/pr56790-1.c} 
(100%)
 rename gcc/testsuite/{g++.dg/opt/vectcond-1.C => c-c++-common/vectcond-1.c} 
(100%)
 create mode 100644 gcc/testsuite/c-c++-common/vector19.c
 rename gcc/testsuite/{g++.dg/ext/vector21.C => c-c++-common/vector21.c} (100%)
 rename gcc/testsuite/{g++.dg/ext/vector22.C => c-c++-common/vector22.c} (100%)
 create mode 100644 gcc/testsuite/c-c++-common/vector23.c
 rename gcc/testsuite/{g++.dg/ext/vector35.C => c-c++-common/vector35.c} (100%)
 create mode 100644 gcc/testsuite/gcc.dg/vector-19.c

diff --git a/gcc/c/c-parser.cc b/gcc/c/c-parser.cc
index 9b9284b1ba4..c5eacfd7d97 100644
--- a/gcc/c/c-parser.cc
+++ b/gcc/c/c-parser.cc
@@ -9281,9 +9281,10 @@ c_parser_conditional_expression (c_parser *parser, 
struct c_expr *after,
 }
   else
 {
-  cond.value
-   = c_objc_common_truthvalue_conversion
-   (cond_loc, default_conversion (cond.value));
+  /* Vector conditions see no default or truthvalue conversion.  */
+  if (!VECTOR_INTEGER_TYPE_P (TREE_TYPE (cond.value)))
+   cond.value = c_objc_common_truthvalue_conversion
+  (cond_loc, default_conversion (cond.value));
   c_inhibit_evaluation_warnings += cond.value == truthvalue_false_node;
   exp1 = c_parser_expression_conv (parser);
   mark_exp_read (exp1.value);
diff --git a/gcc/c/c-typeck.cc b/gcc/c/c-typeck.cc
index 094e41fa202..b0cad0278df 100644
--- a/gcc/c/c-typeck.cc
+++ b/gcc/c/c-typeck.cc
@@ -5677,6 +5677,57 @@ build_conditional_expr (location_t colon_loc, tree 
ifexp, bool ifexp_bcp,
   if (ifexp_int_operands)
 ifexp = remove_c_maybe_const_expr (ifexp);
 
+  /* When this is a vector conditional but both alternatives are not vector
+ types promote them before applying default conversions.  */
+  if (VECTOR_TYPE_P (TREE_TYPE (ifexp))
+  && !VECTOR_TYPE_P (TREE_TYPE (op1))
+  && !VECTOR_TYPE_P 

Re: [PATCH] LoongArch: Expand some SImode operations through "si3_extend" instructions if TARGET_64BIT

2024-07-30 Thread Jeff Law




On 7/30/24 7:01 AM, Lulu Cheng wrote:


在 2024/7/26 下午8:43, Xi Ruoyao 写道:

We already had "si3_extend" insns and we hoped the fwprop or combine
passes can use them to remove unnecessary sign extensions.  But this
does not always work: for cases like x << 1 | y, the compiler
tends to do

 (sign_extend:DI
   (ior:SI (ashift:SI (reg:SI $r4)
  (const_int 1))
   (reg:SI $r5)))

instead of

 (ior:DI (sign_extend:DI (ashift:SI (reg:SI $r4) (const_int 1)))
 (sign_extend:DI (reg:SI $r5)))

So we cannot match the ashlsi3_extend instruction here and we get:

 slli.w $r4,$r4,1
 or $r4,$r5,$r4
 slli.w $r4,$r4,0    # <= redundant
 jr   $r1

To eliminate this redundant extension we need to turn SImode shift etc.
to DImode "si3_extend" operations earlier, when we expand the SImode
operation.  We are already doing this for addition, now do it for
shifts, rotates, substract, multiplication, division, and modulo as
well.

The bytepick.w definition for TARGET_64BIT needs to be adjusted so it
won't be undone by the shift expanding.


LGTM!

I don't know if there will be redundant symbol extension directives 
after this change.:-(
We've had very good success with this approach on RISC-V.  It's not 
perfect and may highlight the need for a few additional patterns, but 
overall it was very much a step in the right direction.


Jeff



Re: [PATCH 1/3][v2] Add TARGET_MODE_CAN_TRANSFER_BITS

2024-07-30 Thread Paul Koning



> On Jul 30, 2024, at 6:17 AM, Richard Biener  wrote:
> 
> The following adds a target hook to specify whether regs of MODE can be
> used to transfer bits.  The hook is supposed to be used for value-numbering
> to decide whether a value loaded in such mode can be punned to another
> mode instead of re-loading the value in the other mode and for SRA to
> decide whether MODE is suitable as container holding a value to be
> used in different modes.
> 
> ...
> 
> +@deftypefn {Target Hook} bool TARGET_MODE_CAN_TRANSFER_BITS (machine_mode 
> @var{mode})
> +Define this to return false if the mode @var{mode} cannot be used
> +for memory copying.  The default is to assume modes with the same
> +precision as size are fine to be used.
> +@end deftypefn
> +

I'm a bit confused about the meaning of this hook; the summary at the top 
speaks of type punning while the documentation talks about memory copying.  
Those seem rather different.

I'm also wondering about this being tied to a mode rather than a register 
class.  To given an example: on the PDP11 there are two main register classes, 
"general" and "float".  General registers handle any bit pattern and support 
arithmetic operations on integer modes; float registers do not transparently 
transfer every bit pattern and support float modes.  So only general registers 
are suitable for memory copies (though on a PDP-11 you don't need registers to 
do memory copy).  And for type punning, you could load an SF mode value into 
general registers (a pair) and type-pun them to SImode without reloading.

So what does that mean for this hook on that target?

paul




Re: [PATCH v3 2/3] aarch64: Add support for moving fpm system register

2024-07-30 Thread Claudio Bantaloukas


On 29/07/2024 13:13, Richard Sandiford wrote:
> Claudio Bantaloukas  writes:
>> Unlike most system registers, fpmr can be heavily written to in code that
>> exercises the fp8 functionality. That is because every fp8 instrinsic call
>> can potentially change the value of fpmr.
>> Rather than just use a an unspec, we treat the fpmr system register like
> 
> Typo: s/a an/an/
Thanks for the catch, will repost along with the requested changes below

Cheers,
Claudio

> 
>> all other registers and use a move operation to read and write to it.
>>
>> We introduce a new class of moveable system registers that, currently,
>> only accepts fpmr and a new constraint, Umv, that allows us to
>> selectively use mrs and msr instructions when expanding rtl for them.
>> Given that there is code that depends on "real" registers coming before
>> "fake" ones, we introduce a new constant FPM_REGNUM that uses an
>> existing value and renumber registers below that.
>> This requires us to update the bitmaps that describe which registers
>> belong to each register class.
>>
>> gcc/ChangeLog:
>>
>>  * config/aarch64/aarch64.cc (aarch64_hard_regno_nregs): Add
>>  support for MOVEABLE_SYSREGS class.
>>  (aarch64_hard_regno_mode_ok): Allow reads and writes to fpmr.
>>  (aarch64_regno_regclass): Support MOVEABLE_SYSREGS class.
>>  (aarch64_class_max_nregs): Likewise.
>>  * config/aarch64/aarch64.h (FIXED_REGISTERS): add fpmr.
>>  (CALL_REALLY_USED_REGISTERS): Likewise.
>>  (REGISTER_NAMES): Likewise.
>>  (enum reg_class): Add MOVEABLE_SYSREGS class.
>>  (REG_CLASS_NAMES): Likewise.
>>  (REG_CLASS_CONTENTS): Update class bitmaps to deal with fpmr,
>>  the new MOVEABLE_REGS class and renumbering of registers.
>>  * config/aarch64/aarch64.md: (FPM_REGNUM): added new register
>>  number, reusing old value.
>>  (FFR_REGNUM): Renumber.
>>  (FFRT_REGNUM): Likewise.
>>  (LOWERING_REGNUM): Likewise.
>>  (TPIDR2_BLOCK_REGNUM): Likewise.
>>  (SME_STATE_REGNUM): Likewise.
>>  (TPIDR2_SETUP_REGNUM): Likewise.
>>  (ZA_FREE_REGNUM): Likewise.
>>  (ZA_SAVED_REGNUM): Likewise.
>>  (ZA_REGNUM): Likewise.
>>  (ZT0_REGNUM): Likewise.
>>  (*mov_aarch64): Add support for moveable sysregs.
>>  (*movsi_aarch64): Likewise.
>>  (*movdi_aarch64): Likewise.
>>  * config/aarch64/constraints.md (MOVEABLE_SYSREGS): New constraint.
>>
>> gcc/testsuite/ChangeLog:
>>
>>  * gcc.target/aarch64/acle/fp8.c: New tests.
>> [...]
>> @@ -1405,6 +1409,8 @@ (define_insn "*mov_aarch64"
>>[w, r Z  ; neon_from_gp, nosimd ] fmov\t%s0, %w1
>>[w, w; neon_dup   , simd   ] dup\t%0, %1.[0]
>>[w, w; neon_dup   , nosimd ] fmov\t%s0, %s1
>> + [Umv, r  ; mrs, *  ] msr\t%0, %x1
>> + [r, Umv  ; mrs, *  ] mrs\t%x0, %1
>> }
>>   )
>>   
>> @@ -1467,6 +1473,8 @@ (define_insn_and_split "*movsi_aarch64"
>>[r  , w  ; f_mrc, fp  , 4] fmov\t%w0, %s1
>>[w  , w  ; fmov , fp  , 4] fmov\t%s0, %s1
>>[w  , Ds ; neon_move, simd, 4] << 
>> aarch64_output_scalar_simd_mov_immediate (operands[1], SImode);
>> + [Umv, r  ; mrs  , *   , 8] msr\t%0, %x1
>> + [r, Umv  ; mrs  , *   , 8] mrs\t%x0, %1
> 
> The lengths should be 4 rather than 8.
> 
>> }
>> "CONST_INT_P (operands[1]) && !aarch64_move_imm (INTVAL (operands[1]), 
>> SImode)
>>   && REG_P (operands[0]) && GP_REGNUM_P (REGNO (operands[0]))"
>> @@ -1505,6 +1513,8 @@ (define_insn_and_split "*movdi_aarch64"
>>[w, w  ; fmov , fp  , 4] fmov\t%d0, %d1
>>[w, Dd ; neon_move, simd, 4] << 
>> aarch64_output_scalar_simd_mov_immediate (operands[1], DImode);
>>[w, Dx ; neon_move, simd, 8] #
>> + [Umv, r; mrs  , *   , 8] msr\t%0, %1
>> + [r, Umv; mrs  , *   , 8] mrs\t%0, %1
> 
> Similarly here.
> 
>> }
>> "CONST_INT_P (operands[1])
>>  && REG_P (operands[0])
>> [...]
>> diff --git a/gcc/testsuite/gcc.target/aarch64/acle/fp8.c 
>> b/gcc/testsuite/gcc.target/aarch64/acle/fp8.c
>> index 459442be155..1a5c3d7e8fd 100644
>> --- a/gcc/testsuite/gcc.target/aarch64/acle/fp8.c
>> +++ b/gcc/testsuite/gcc.target/aarch64/acle/fp8.c
>> @@ -1,6 +1,7 @@
>>   /* Test the fp8 ACLE intrinsics family.  */
>>   /* { dg-do compile } */
>>   /* { dg-options "-O1 -march=armv8-a" } */
>> +/* { dg-final { check-function-bodies "**" "" "" } } */
>>   
>>   #include 
>>   
>> @@ -17,4 +18,107 @@
>>   #error "__ARM_FEATURE_FP8 feature macro defined."
>>   #endif
>>   
>> +/*
>> +**test_write_fpmr_sysreg_asm_64:
>> +**  msr fpmr, x0
>> +**  ret
>> +*/
>> +void
>> +test_write_fpmr_sysreg_asm_64 (uint64_t val)
>> +{
>> +  register uint64_t fpmr asm ("fpmr") = val;
>> +  asm volatile ("" ::"Umv"(fpmr));
>> +}
>> +
>> +/*
>> +**test_write_fpmr_sysreg_asm_32:
>> +**  uxtwx0, w0
>> +**  msr fpmr, x0
>> +**  ret
>> +*/
>> +void
>> +test_write_fpmr_sysreg_asm_32 (uint32

Re: [PATCH] libstdc++: implement concatenation of strings and string_views

2024-07-30 Thread Jonathan Wakely
On Tue, 30 Jul 2024 at 14:08, Jonathan Wakely  wrote:
>
> On Tue, 30 Jul 2024 at 08:31, Giuseppe D'Angelo
>  wrote:
> >
> > Hello!
> >
> > The attached patch implements adds support for P2591R5 in libstdc++
> > (concatenation of strings and string_views, approved in Tokyo for C++26).
>
> Thanks for this patch as well. This was on my TODO list so I'll be
> happy to not have to do it myself now!
>
> I won't repeat my questions for your is_virtual_base_of patch,
> regarding the legal prerequisites, but they apply here too. My
> comments about copyright and licence headers in the tests apply here
> too.
>
> +#if __cplusplus > 202302L
> +  // const string & + string_view
>
> Please test the feature test macro, not the C++ standard version, so e.g.
>
> #if __glibcxx_string_view >= 202403L
>
> This relates the #if block directly to the feature test macro, so that
> if we decide to support the feature in C++23 as an extension, we only
> need to change version.def and not basic_string.h

Also the use of dg-options and { target c++NN } selectors in the tests
is wrong. Please read
https://gcc.gnu.org/onlinedocs/libstdc++/manual/test.html#test.new_tests

+// { dg-do compile { target c++17 } }
+// { dg-error "no match for" "P2591" { target c++17 } 25 }

+// { dg-options "-std=gnu++2c" }
+// { dg-do compile { target c++17 } }
+// { dg-error "no match for" "P2591" { target c++26 } 25 }

Why are there two tests for this?
The test will fail for any mode C++17 or later, so you only need one
test, and so no need for a separate op_plus_fspath_impl.h header. Just
write one test with:

+// { dg-do compile { target c++17 } }
+// { dg-error "no match for" "P2591" { target c++17 } 25 }

The "P2591" comment isn't very clear, the paper only mentions fs::path once:
"Added the requested tests (involving std::filesystem::path and a test
producing ambiguities) to the working prototype."

This doesn't really explain what the test is for, could you add a
comment to the new test saying that the new operators added by P2591
are expected to not support concatenation with paths. And move the
content of op_plus_fspath_impl.h into op_plus_fspath_cpp17_fail.cc and
get rid of op_plus_fspath_cpp2c_fail.cc. With that change, you'll also
be able to put the dg-error on the right line where the error actually
happens.

+#if !defined(__cpp_lib_string_view) || __cpp_lib_string_view < 202403L
+#error
+#endif

The #error should say what's wrong, otherwise the logs would just show:

FAIL: op_plus_string_view.cc -std=gnu++17
#error

which is not very helpful.

+// { dg-options "-std=gnu++26" }
+// { dg-do run { target c++20 } }

This is entirely bogus, it says the test should run for C++20 and
later, but then forces it to run for only 26.
This should be just:
// { dg-do run { target c++26 } }
The testsuite will add the right -std option, so don't force it with dg-options.


+// { dg-options "-std=gnu++2c" }
+// { dg-do compile { target c++17 } }
+// { dg-error "ambiguous" "P2591" { target c++26 } 73 }

Don't force -std with dg-options, just use the right target selector.
Please add a comment explaining what is being tested here, "P2591"
doesn't tell me.

Please combine op_plus_string_view_compat_ok.cc and
op_plus_string_view_compat_fail.cc into one file, and get rid of the
impl header.
I think it should be a single file with something like:

// { dg-do run { target { c++17 && c++23_down } } }
// { dg-do compile { target c++26 } }
// This should be ambiguous in C++26 due to new operator+ overloads.
// { dg-error "ambiguous" "P2591" { target c++26 } 73 }

This say to make it a "run" test for C++17-23 and an xfail compile
test for C++23 and later.

Although I don't see any reason to make it a { dg-do run } test for
C++17, all that's testing is that your my_string_view concatenation
operators work correctly, which is irrelevant to libstdc++ code.

So it could be just

// { dg-do compile { target c++17 } }
// This should be ambiguous in C++26 due to new operator+ overloads.
// { dg-error "ambiguous" "P2591" { target c++26 } 73 }

i.e. compile for all modes C++17 and later, but expect an error for
C++26 and later.

Thanks for providing both runtime and constexpr tests, that's great.


Re: [PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS

2024-07-30 Thread Alexander Monakov


On Tue, Jul 30, 2024 at 03:00:49PM +0200, Richard Biener wrote:

> > What mangling fld performs depends on the contents of the FP control
> > word which is awkward.

For float/double loads (FLDS and FLDL) we know format conversion changes
SNaNs to QNaNs, but it's a widening conversion, so e.g. rounding mode bits
have no effect (and precision control doesn't affect loads). For FLDT
there should be no mangling as far as I can tell.

On Tue, 30 Jul 2024, Jakub Jelinek wrote:

> > As Jakub said the padding is already dealt with in the caller
> > though I only added that there for convenience since padding is
> > problematic in general.
> > 
> > If you think XFmode is safe to transfer 10 bytes we could enable it,
> > I guess I'll amend the docs to be clear:

(I'm not proposing that, I think it's fine that the caller rejects that)

> > "Define this to return false if the mode @var{mode} cannot be used
> > for memory copying.  The default is to assume modes with the same
> > precision as size are fine to be used."
> > 
> > this might suggest transfering GET_MODE_PRECISION bits is intended
> > but it might want to say GET_MODE_SIZE units explicitly so the
> > default makes sense.
> 
> You could just gcc_unreachable (); for XFmode in the link with a short
> comment that the mode has padding.

I think this would be the right solution.

Alexander


[PATCH] libstdc++: Implement LWG 3886 for std::optional and std::expected

2024-07-30 Thread Jonathan Wakely
This LWG issue is about to become Tentatively Ready.

Tested x86_64-linux.

-- >8 --

This uses remove_cv_t for the default template argument used for
deducing a type for a braced-init-list used with std::optional and
std::expected.

libstdc++-v3/ChangeLog:

* include/std/expected (expected(U&&), operator=(U&&))
(value_or): Use remove_cv_t on default template argument, as per
LWG 3886.
* include/std/optional (optional(U&&), operator=(U&&))
(value_or): Likewise.
* testsuite/20_util/expected/lwg3886.cc: New test.
* testsuite/20_util/optional/cons/lwg3886.cc: New test.
---
 libstdc++-v3/include/std/expected |  8 +--
 libstdc++-v3/include/std/optional | 12 ++--
 .../testsuite/20_util/expected/lwg3886.cc | 58 +++
 .../20_util/optional/cons/lwg3886.cc  | 58 +++
 4 files changed, 126 insertions(+), 10 deletions(-)
 create mode 100644 libstdc++-v3/testsuite/20_util/expected/lwg3886.cc
 create mode 100644 libstdc++-v3/testsuite/20_util/optional/cons/lwg3886.cc

diff --git a/libstdc++-v3/include/std/expected 
b/libstdc++-v3/include/std/expected
index 515a1e6ab8f..b8217e577fa 100644
--- a/libstdc++-v3/include/std/expected
+++ b/libstdc++-v3/include/std/expected
@@ -468,7 +468,7 @@ namespace __expected
  std::move(__x)._M_unex);
}
 
-  template
+  template>
requires (!is_same_v, expected>)
  && (!is_same_v, in_place_t>)
  && is_constructible_v<_Tp, _Up>
@@ -582,7 +582,7 @@ namespace __expected
return *this;
   }
 
-  template
+  template>
requires (!is_same_v>)
  && (!__expected::__is_unexpected>)
  && is_constructible_v<_Tp, _Up> && is_assignable_v<_Tp&, _Up>
@@ -818,7 +818,7 @@ namespace __expected
return std::move(_M_unex);
   }
 
-  template
+  template>
constexpr _Tp
value_or(_Up&& __v) const &
noexcept(__and_v,
@@ -832,7 +832,7 @@ namespace __expected
  return static_cast<_Tp>(std::forward<_Up>(__v));
}
 
-  template
+  template>
constexpr _Tp
value_or(_Up&& __v) &&
noexcept(__and_v,
diff --git a/libstdc++-v3/include/std/optional 
b/libstdc++-v3/include/std/optional
index 4694d594f98..2c4cc260f90 100644
--- a/libstdc++-v3/include/std/optional
+++ b/libstdc++-v3/include/std/optional
@@ -868,7 +868,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 
   // Converting constructors for engaged optionals.
 #ifdef _GLIBCXX_USE_CONSTRAINTS_FOR_OPTIONAL
-  template
+  template>
requires (!is_same_v>)
  && (!is_same_v>)
  && is_constructible_v<_Tp, _Up>
@@ -919,7 +919,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
: _Base(std::in_place, __il, std::forward<_Args>(__args)...)
{ }
 #else
-  template,
   _Requires<__not_self<_Up>, __not_tag<_Up>,
 is_constructible<_Tp, _Up>,
 is_convertible<_Up, _Tp>,
@@ -929,7 +929,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
noexcept(is_nothrow_constructible_v<_Tp, _Up>)
: _Base(std::in_place, std::forward<_Up>(__t)) { }
 
-  template,
   _Requires<__not_self<_Up>, __not_tag<_Up>,
 is_constructible<_Tp, _Up>,
 __not_>,
@@ -1017,7 +1017,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
return *this;
   }
 
-  template
+  template>
 #ifdef _GLIBCXX_USE_CONSTRAINTS_FOR_OPTIONAL
requires (!is_same_v>)
  && (!(is_scalar_v<_Tp> && is_same_v<_Tp, decay_t<_Up>>))
@@ -1242,7 +1242,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
__throw_bad_optional_access();
   }
 
-  template
+  template>
constexpr _Tp
value_or(_Up&& __u) const&
{
@@ -1255,7 +1255,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
return static_cast<_Tp>(std::forward<_Up>(__u));
}
 
-  template
+  template>
constexpr _Tp
value_or(_Up&& __u) &&
{
diff --git a/libstdc++-v3/testsuite/20_util/expected/lwg3886.cc 
b/libstdc++-v3/testsuite/20_util/expected/lwg3886.cc
new file mode 100644
index 000..cf1a2ce4421
--- /dev/null
+++ b/libstdc++-v3/testsuite/20_util/expected/lwg3886.cc
@@ -0,0 +1,58 @@
+// { dg-do compile { target c++23 } }
+
+// LWG 3886. Monad mo' problems
+
+#include 
+
+void
+test_constructor()
+{
+  struct MoveOnly {
+MoveOnly(int, int) { }
+MoveOnly(MoveOnly&&) { }
+  };
+
+  // The {0,0} should be deduced as MoveOnly not const MoveOnly
+  [[maybe_unused]] std::expected e({0,0});
+}
+
+struct Tracker {
+  bool moved = false;
+  constexpr Tracker(int, int) { }
+  constexpr Tracker(const Tracker&) { }
+  constexpr Tracker(Tracker&&) : moved(true) { }
+
+  // The follow means that is_assignable is true:
+  template constexpr void operator=(T&&) const { }
+  // This stops a copy assignment from being declared implicitly

Re: [PATCH v3 1/3] aarch64: Add march flags for +fp8 arch extensions

2024-07-30 Thread Claudio Bantaloukas


On 29/07/2024 08:30, Kyrylo Tkachov wrote:
> Hi Claudio,
> 
>> On 26 Jul 2024, at 18:32, Claudio Bantaloukas  
>> wrote:
>>
>> External email: Use caution opening links or attachments
>>
>>
>> This introduces the relevant flags to enable access to the fpmr register and 
>> fp8 intrinsics, which will be added subsequently.
>>
>> gcc/ChangeLog:
>>
>> * config/aarch64/aarch64-option-extensions.def (fp8): New.
>> * config/aarch64/aarch64.h (TARGET_FP8): Likewise.
>> * doc/invoke.texi (AArch64 Options): Document new -march flags
>> and extensions.
>>
>> gcc/testsuite/ChangeLog:
>>
>> * gcc.target/aarch64/acle/fp8.c: New test.
> 
> Thanks, this looks ok to me now.
> One question about the command-line flag.
> FP8 defines instructions for Advanced SIMD, SVE and SME.
> Is the “+fp8” option in this patch intended to combine with the +sve and +sme 
> options to indicate the presence of these ISA-specific subsets? That is, 
> you’re not planning to introduce something like +sve-fp8, +sme-fp8?
> Kyrill

Hi Kyrill, thanks!
The plan is to have more specific feature flags like +fp8fma 
+ssve-fp8fma and +sme-lutv. +fp8 will only be used for conversion and 
scaling operations and my undestanding is that it will not combine as 
you propose.

See also the relevant binutils features in 
https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gas/config/tc-aarch64.c;h=e94a0cff406aaaf1800979a27991ccbb7e92e917;hb=HEAD#l10731

Cheers,
Claudio

> 
>> ---
>> .../aarch64/aarch64-option-extensions.def |  2 ++
>> gcc/config/aarch64/aarch64.h  |  3 +++
>> gcc/doc/invoke.texi   |  2 ++
>> gcc/testsuite/gcc.target/aarch64/acle/fp8.c   | 20 +++
>> 4 files changed, 27 insertions(+)
>> create mode 100644 gcc/testsuite/gcc.target/aarch64/acle/fp8.c
>>
>> diff --git a/gcc/config/aarch64/aarch64-option-extensions.def 
>> b/gcc/config/aarch64/aarch64-option-extensions.def
>> index 42ec0eec31e..6998627f377 100644
>> --- a/gcc/config/aarch64/aarch64-option-extensions.def
>> +++ b/gcc/config/aarch64/aarch64-option-extensions.def
>> @@ -232,6 +232,8 @@ AARCH64_OPT_EXTENSION("the", THE, (), (), (), "the")
>>
>> AARCH64_OPT_EXTENSION("gcs", GCS, (), (), (), "gcs")
>>
>> +AARCH64_OPT_EXTENSION("fp8", FP8, (SIMD), (), (), "fp8")
>> +
>> #undef AARCH64_OPT_FMV_EXTENSION
>> #undef AARCH64_OPT_EXTENSION
>> #undef AARCH64_FMV_FEATURE
>> diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
>> index b7e330438d9..2e75c6b81e2 100644
>> --- a/gcc/config/aarch64/aarch64.h
>> +++ b/gcc/config/aarch64/aarch64.h
>> @@ -463,6 +463,9 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE 
>> ATTRIBUTE_UNUSED
>> && (aarch64_tune_params.extra_tuning_flags \
>>  & AARCH64_EXTRA_TUNE_AVOID_PRED_RMW))
>>
>> +/* fp8 instructions are enabled through +fp8.  */
>> +#define TARGET_FP8 AARCH64_HAVE_ISA (FP8)
>> +
>> /* Standard register usage.  */
>>
>> /* 31 64-bit general purpose registers R0-R30:
>> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
>> index 9fb0925ed29..7cbcd8ad1b4 100644
>> --- a/gcc/doc/invoke.texi
>> +++ b/gcc/doc/invoke.texi
>> @@ -21848,6 +21848,8 @@ Enable support for Armv9.4-a Guarded Control Stack 
>> extension.
>> Enable support for Armv8.9-a/9.4-a translation hardening extension.
>> @item rcpc3
>> Enable the RCpc3 (Release Consistency) extension.
>> +@item fp8
>> +Enable the fp8 (8-bit floating point) extension.
>>
>> @end table
>>
>> diff --git a/gcc/testsuite/gcc.target/aarch64/acle/fp8.c 
>> b/gcc/testsuite/gcc.target/aarch64/acle/fp8.c
>> new file mode 100644
>> index 000..459442be155
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/acle/fp8.c
>> @@ -0,0 +1,20 @@
>> +/* Test the fp8 ACLE intrinsics family.  */
>> +/* { dg-do compile } */
>> +/* { dg-options "-O1 -march=armv8-a" } */
>> +
>> +#include 
>> +
>> +#ifdef __ARM_FEATURE_FP8
>> +#error "__ARM_FEATURE_FP8 feature macro defined."
>> +#endif
>> +
>> +#pragma GCC push_options
>> +#pragma GCC target("arch=armv9.4-a+fp8")
>> +
>> +/* We do not define __ARM_FEATURE_FP8 until all
>> +   relevant features have been added. */
>> +#ifdef __ARM_FEATURE_FP8
>> +#error "__ARM_FEATURE_FP8 feature macro defined."
>> +#endif
>> +
>> +#pragma GCC pop_options
> 

[PATCH v2 0/1] gcc/: Rename array_type_nelts() => array_type_nelts_minus_one()

2024-07-30 Thread Alejandro Colomar
Hi Richard, Jakub,

I've adjusted the ChangeLog; hopefully it'll be good now.

On Tue, Jul 30, 2024 at 12:22:01PM +0200, Richard Biener wrote:
> The changes look good to me, please leave the frontend maintainers
> time to chime in.

Sure; as much as they need.  My latest patch
(-Wunterminated-string-initialization) took 2 years to get in; I expect
__lengthof__, which is a larger feature, to take no less than that.  :)

Below is a range-diff comparing v1 and v2.

BTW, I put the Cc and Link lines before the changelog, because Martin
had issues with them while merging and pushing my latest patch.  Please
let me know if that will work (workaround the scripts).

> Also Jakub had reservations with the renaming because of branch
> maintainance.  I think if that proves an issue we could backport the
> renaming as well, or make sure that array_type_nelts is not
> re-introduced with the same name but different semantics.

I don't expect __lengthof__ to be merged before 2025, and then, I
thought we could keep the name array_type_nelts_top() for some time, to
prevent issues like that.

Anyway, I think backporting this change could be good.  So, how does
this timeline sound to you?

-  Rename array_type_nelts() => array_type_nelts_minus_one() now.
-  Backport that change to stable branches.

[... a year or so passes ...]

-  Make array_type_nelts_top() a global API.
-  Add __lengthof__.

[... a year or so passes ...]

-  Rename array_type_nelts_top() => array_type_nelts()

This would reduce chances of mistakes.

Have a lovely day!
Alex


Alejandro Colomar (1):
  gcc/: Rename array_type_nelts() => array_type_nelts_minus_one()

 gcc/c/c-decl.cc   | 10 +-
 gcc/c/c-fold.cc   |  7 ---
 gcc/config/aarch64/aarch64.cc |  2 +-
 gcc/config/i386/i386.cc   |  2 +-
 gcc/cp/decl.cc|  2 +-
 gcc/cp/init.cc|  8 
 gcc/cp/lambda.cc  |  3 ++-
 gcc/cp/tree.cc|  2 +-
 gcc/expr.cc   |  8 
 gcc/fortran/trans-array.cc|  2 +-
 gcc/fortran/trans-openmp.cc   |  4 ++--
 gcc/rust/backend/rust-tree.cc |  2 +-
 gcc/tree.cc   |  4 ++--
 gcc/tree.h|  2 +-
 14 files changed, 30 insertions(+), 28 deletions(-)

Range-diff against v1:
1:  82efbc3c540 ! 1:  73010cb4af6 gcc/: Rename array_type_nelts() => 
array_type_nelts_minus_one()
@@ Commit message
 Cc: Martin Uecker 
 Cc: Joseph Myers 
 Cc: Xavier Del Campo Romero 
+Cc: Jakub Jelinek 
 
 gcc/ChangeLog:
 
 * tree.cc (array_type_nelts): Rename function ...
 (array_type_nelts_minus_one): ... to this name.  The old name
 was misleading.
-* tree.h: Likewise.
-* c/c-decl.cc: Likewise.
-* c/c-fold.cc: Likewise.
-* config/aarch64/aarch64.cc: Likewise.
-* config/i386/i386.cc: Likewise.
-* cp/decl.cc: Likewise.
-* cp/init.cc: Likewise.
-* cp/lambda.cc: Likewise.
-* cp/tree.cc: Likewise.
-* expr.cc: Likewise.
-* fortran/trans-array.cc: Likewise.
-* fortran/trans-openmp.cc: Likewise.
-* rust/backend/rust-tree.cc: Likewise.
+* tree.h (array_type_nelts): Rename function ...
+(array_type_nelts_minus_one): ... to this name.  The old name
+was misleading.
+* expr.cc (count_type_elements):
+Rename array_type_nelts() => array_type_nelts_minus_one()
+* config/aarch64/aarch64.cc
+(pure_scalable_type_info::analyze_array): Likewise.
+* config/i386/i386.cc (ix86_canonical_va_list_type): Likewise.
+
+gcc/c/ChangeLog:
+
+* c-decl.cc (one_element_array_type_p, get_parm_array_spec):
+Rename array_type_nelts() => array_type_nelts_minus_one()
+* c-fold.cc (c_fold_array_ref): Likewise.
+
+gcc/cp/ChangeLog:
+
+* decl.cc (reshape_init_array):
+Rename array_type_nelts() => array_type_nelts_minus_one()
+* init.cc (build_zero_init_1): Likewise.
+(build_value_init_noctor): Likewise.
+(build_vec_init): Likewise.
+(build_delete): Likewise.
+* lambda.cc (add_capture): Likewise.
+* tree.cc (array_type_nelts_top): Likewise.
+
+gcc/fortran/ChangeLog:
+
+* trans-array.cc (structure_alloc_comps):
+Rename array_type_nelts() => array_type_nelts_minus_one()
+* trans-openmp.cc (gfc_walk_alloc_comps): Likewise.
+(gfc_omp_clause_linear_ctor): Likewise.
+
+gcc/rust/ChangeLog:
+
+* backend/r

[PATCH v2 1/1] gcc/: Rename array_type_nelts() => array_type_nelts_minus_one()

2024-07-30 Thread Alejandro Colomar
The old name was misleading.

While at it, also rename some temporary variables that are used with
this function, for consistency.

Link: 
https://inbox.sourceware.org/gcc-patches/9fffd80-dca-2c7e-14b-6c9b509a7...@redhat.com/T/#m2f661c67c8f7b2c405c8c7fc3152dd85dc729120
Cc: Gabriel Ravier 
Cc: Martin Uecker 
Cc: Joseph Myers 
Cc: Xavier Del Campo Romero 
Cc: Jakub Jelinek 

gcc/ChangeLog:

* tree.cc (array_type_nelts): Rename function ...
(array_type_nelts_minus_one): ... to this name.  The old name
was misleading.
* tree.h (array_type_nelts): Rename function ...
(array_type_nelts_minus_one): ... to this name.  The old name
was misleading.
* expr.cc (count_type_elements):
Rename array_type_nelts() => array_type_nelts_minus_one()
* config/aarch64/aarch64.cc
(pure_scalable_type_info::analyze_array): Likewise.
* config/i386/i386.cc (ix86_canonical_va_list_type): Likewise.

gcc/c/ChangeLog:

* c-decl.cc (one_element_array_type_p, get_parm_array_spec):
Rename array_type_nelts() => array_type_nelts_minus_one()
* c-fold.cc (c_fold_array_ref): Likewise.

gcc/cp/ChangeLog:

* decl.cc (reshape_init_array):
Rename array_type_nelts() => array_type_nelts_minus_one()
* init.cc (build_zero_init_1): Likewise.
(build_value_init_noctor): Likewise.
(build_vec_init): Likewise.
(build_delete): Likewise.
* lambda.cc (add_capture): Likewise.
* tree.cc (array_type_nelts_top): Likewise.

gcc/fortran/ChangeLog:

* trans-array.cc (structure_alloc_comps):
Rename array_type_nelts() => array_type_nelts_minus_one()
* trans-openmp.cc (gfc_walk_alloc_comps): Likewise.
(gfc_omp_clause_linear_ctor): Likewise.

gcc/rust/ChangeLog:

* backend/rust-tree.cc (array_type_nelts_top):
Rename array_type_nelts() => array_type_nelts_minus_one()

Suggested-by: Richard Biener 
Signed-off-by: Alejandro Colomar 
---
 gcc/c/c-decl.cc   | 10 +-
 gcc/c/c-fold.cc   |  7 ---
 gcc/config/aarch64/aarch64.cc |  2 +-
 gcc/config/i386/i386.cc   |  2 +-
 gcc/cp/decl.cc|  2 +-
 gcc/cp/init.cc|  8 
 gcc/cp/lambda.cc  |  3 ++-
 gcc/cp/tree.cc|  2 +-
 gcc/expr.cc   |  8 
 gcc/fortran/trans-array.cc|  2 +-
 gcc/fortran/trans-openmp.cc   |  4 ++--
 gcc/rust/backend/rust-tree.cc |  2 +-
 gcc/tree.cc   |  4 ++--
 gcc/tree.h|  2 +-
 14 files changed, 30 insertions(+), 28 deletions(-)

diff --git a/gcc/c/c-decl.cc b/gcc/c/c-decl.cc
index 97f1d346835..4dced430d1f 100644
--- a/gcc/c/c-decl.cc
+++ b/gcc/c/c-decl.cc
@@ -5309,7 +5309,7 @@ one_element_array_type_p (const_tree type)
 {
   if (TREE_CODE (type) != ARRAY_TYPE)
 return false;
-  return integer_zerop (array_type_nelts (type));
+  return integer_zerop (array_type_nelts_minus_one (type));
 }
 
 /* Determine whether TYPE is a zero-length array type "[0]".  */
@@ -6257,15 +6257,15 @@ get_parm_array_spec (const struct c_parm *parm, tree 
attrs)
  for (tree type = parm->specs->type; TREE_CODE (type) == ARRAY_TYPE;
   type = TREE_TYPE (type))
{
- tree nelts = array_type_nelts (type);
- if (error_operand_p (nelts))
+ tree nelts_minus_one = array_type_nelts_minus_one (type);
+ if (error_operand_p (nelts_minus_one))
return attrs;
- if (TREE_CODE (nelts) != INTEGER_CST)
+ if (TREE_CODE (nelts_minus_one) != INTEGER_CST)
{
  /* Each variable VLA bound is represented by the dollar
 sign.  */
  spec += "$";
- tpbnds = tree_cons (NULL_TREE, nelts, tpbnds);
+ tpbnds = tree_cons (NULL_TREE, nelts_minus_one, tpbnds);
}
}
  tpbnds = nreverse (tpbnds);
diff --git a/gcc/c/c-fold.cc b/gcc/c/c-fold.cc
index 57b67c74bd8..9ea174f79c4 100644
--- a/gcc/c/c-fold.cc
+++ b/gcc/c/c-fold.cc
@@ -73,11 +73,12 @@ c_fold_array_ref (tree type, tree ary, tree index)
   unsigned elem_nchars = (TYPE_PRECISION (elem_type)
  / TYPE_PRECISION (char_type_node));
   unsigned len = (unsigned) TREE_STRING_LENGTH (ary) / elem_nchars;
-  tree nelts = array_type_nelts (TREE_TYPE (ary));
+  tree nelts_minus_one = array_type_nelts_minus_one (TREE_TYPE (ary));
   bool dummy1 = true, dummy2 = true;
-  nelts = c_fully_fold_internal (nelts, true, &dummy1, &dummy2, false, false);
+  nelts_minus_one = c_fully_fold_internal (nelts_minus_one, true, &dummy1,
+  &dummy2, false, false);
   unsigned HOST_WIDE_INT i = tree_to_uhwi (index);
-  if (!tree_int_cst_le (index, nelts)
+  if (!tree_int_cst_le (index, nelts_minus_one)
   || i >= len
   || i + elem_nchars > len)
   

Re: [PATCH] RISC-V: Expand subreg move via slide if necessary [PR116086].

2024-07-30 Thread Robin Dapp
> > IMO, what ought to happen here is that the RA should spill
> > the inner register to memory and load the V4SI back from there.
> > (Or vice versa, for an lvalue.)  Obviously that's not very efficient,
> > and so a patch like the above might be useful as an optimisation.[*]
> > But it shouldn't be needed for correctness.  The target-independent
> > code should already have the information it needs to realise that
> > it can't predict the register index at compile time (at least for SVE).
>
> Or actually, for that case:
>
>   /* For pseudo registers, we want most of the same checks.  Namely:
>
>  Assume that the pseudo register will be allocated to hard registers
>  that can hold REGSIZE bytes each.  If OSIZE is not a multiple of REGSIZE,
>  the remainder must correspond to the lowpart of the containing hard
>  register.  If BYTES_BIG_ENDIAN, the lowpart is at the highest offset,
>  otherwise it is at the lowest offset.
>
>  Given that we've already checked the mode and offset alignment,
>  we only have to check subblock subregs here.  */
>   if (maybe_lt (osize, regsize)
>   && ! (lra_in_progress && (FLOAT_MODE_P (imode) || FLOAT_MODE_P 
> (omode
> {
>   /* It is invalid for the target to pick a register size for a mode
>that isn't ordered wrt to the size of that mode.  */
>   poly_uint64 block_size = ordered_min (isize, regsize);
>   unsigned int start_reg;
>   poly_uint64 offset_within_reg;
>   if (!can_div_trunc_p (offset, block_size, &start_reg, 
> &offset_within_reg)
>   ...
>

Like aarch64 we set REGMODE_NATURAL_SIZE for fixed-size modes to
UNITS_PER_WORD.  Isn't that part of the problem?

In extract_bit_field_as_subreg we check lowpart_bit_field_p (= true because
128 is a multiple of UNITS_PER_WORD).  This leads to the subreg expression.

If I have REGMODE_NATURAL_SIZE return a VLA number this fails and we extract
via memory - but that of course breaks almost everything else :)

When you say the target-independent code should already have all information it
needs, what are you referring to?  Something else than REGMODE_NATURAL_SIZE?

Thanks.

-- 
Regards
 Robin



[committed] libstdc++: Fix overwriting files with fs::copy_file on Windows

2024-07-30 Thread Jonathan Wakely
I've pushed this for https://github.com/msys2/MSYS2-packages/issues/1937
but I'm taking a slightly different approach to Björn's original patch.

Instead of adding __detail::equivalent_win32 I'm adding fs::equiv_files
to do the check for both POSIX and Windows. The logic in do_copy_file
should be correct here, and fs::copy also needed an update to fix the
case of overwriting existing directories. I've also added tests.

Tested x86_64-linux and x86_64-w64-mingw32. Pushed to trunk. I'll
backport this after some time on trunk.

-- >8 --

There are no inode numbers on Windows filesystems, so stat_type::st_ino
is always zero and the check for equivalent files in do_copy_file was
incorrectly identifying distinct files as equivalent. This caused
copy_file to incorrectly report errors when trying to overwrite existing
files.

The fs::equivalent function already does the right thing on Windows, so
factor that logic out into a new function that can be reused by
fs::copy_file.

The tests for fs::copy_file were quite inadequate, so this also adds
checks for that function's error conditions.

libstdc++-v3/ChangeLog:

* src/c++17/fs_ops.cc (auto_win_file_handle): Change constructor
parameter from const path& to const wchar_t*.
(fs::equiv_files): New function.
(fs::equivalent): Use equiv_files.
* src/filesystem/ops-common.h (fs::equiv_files): Declare.
(do_copy_file): Use equiv_files.
* src/filesystem/ops.cc (fs::equiv_files): Define.
(fs::copy, fs::equivalent): Use equiv_files.
* testsuite/27_io/filesystem/operations/copy.cc: Test
overwriting directory contents recursively.
* testsuite/27_io/filesystem/operations/copy_file.cc: Test
overwriting existing files.
---
 libstdc++-v3/src/c++17/fs_ops.cc  |  71 ++
 libstdc++-v3/src/filesystem/ops-common.h  |  12 +-
 libstdc++-v3/src/filesystem/ops.cc|  18 ++-
 .../27_io/filesystem/operations/copy.cc   |   9 ++
 .../27_io/filesystem/operations/copy_file.cc  | 122 ++
 5 files changed, 199 insertions(+), 33 deletions(-)

diff --git a/libstdc++-v3/src/c++17/fs_ops.cc b/libstdc++-v3/src/c++17/fs_ops.cc
index 81227c49dfd..7ffdce67782 100644
--- a/libstdc++-v3/src/c++17/fs_ops.cc
+++ b/libstdc++-v3/src/c++17/fs_ops.cc
@@ -350,7 +350,7 @@ fs::copy(const path& from, const path& to, copy_options 
options,
   f = make_file_status(from_st);
 
   if (exists(t) && !is_other(t) && !is_other(f)
-  && to_st.st_dev == from_st.st_dev && to_st.st_ino == from_st.st_ino)
+  && fs::equiv_files(from.c_str(), from_st, to.c_str(), to_st, ec))
 {
   ec = std::make_error_code(std::errc::file_exists);
   return;
@@ -829,8 +829,8 @@ namespace
   struct auto_win_file_handle
   {
 explicit
-auto_win_file_handle(const fs::path& p_)
-: handle(CreateFileW(p_.c_str(), 0,
+auto_win_file_handle(const wchar_t* p)
+: handle(CreateFileW(p, 0,
 FILE_SHARE_DELETE | FILE_SHARE_READ | FILE_SHARE_WRITE,
 0, OPEN_EXISTING, FILE_FLAG_BACKUP_SEMANTICS, 0))
 { }
@@ -850,6 +850,44 @@ namespace
 }
 #endif
 
+#ifdef _GLIBCXX_HAVE_SYS_STAT_H
+#ifdef NEED_DO_COPY_FILE // Only define this once, not in cow-ops.o too
+bool
+fs::equiv_files([[maybe_unused]] const char_type* p1, const stat_type& st1,
+   [[maybe_unused]] const char_type* p2, const stat_type& st2,
+   [[maybe_unused]] error_code& ec)
+{
+#if ! _GLIBCXX_FILESYSTEM_IS_WINDOWS
+  // For POSIX the device ID and inode number uniquely identify a file.
+  return st1.st_dev == st2.st_dev && st1.st_ino == st2.st_ino;
+#else
+  // For Windows st_ino is not set, so can't be used to distinguish files.
+  // We can compare modes and device IDs as a cheap initial check:
+  if (st1.st_mode != st2.st_mode || st1.st_dev != st2.st_dev)
+return false;
+
+  // Need to use GetFileInformationByHandle to get more info about the files.
+  auto_win_file_handle h1(p1);
+  auto_win_file_handle h2(p2);
+  if (!h1 || !h2)
+{
+  if (!h1 && !h2)
+   ec = __last_system_error();
+  return false;
+}
+  if (!h1.get_info() || !h2.get_info())
+{
+  ec = __last_system_error();
+  return false;
+}
+  return h1.info.dwVolumeSerialNumber == h2.info.dwVolumeSerialNumber
+  && h1.info.nFileIndexHigh == h2.info.nFileIndexHigh
+  && h1.info.nFileIndexLow == h2.info.nFileIndexLow;
+#endif // _GLIBCXX_FILESYSTEM_IS_WINDOWS
+}
+#endif // NEED_DO_COPY_FILE
+#endif // _GLIBCXX_HAVE_SYS_STAT_H
+
 bool
 fs::equivalent(const path& p1, const path& p2, error_code& ec) noexcept
 {
@@ -881,30 +919,7 @@ fs::equivalent(const path& p1, const path& p2, error_code& 
ec) noexcept
   ec.clear();
   if (is_other(s1) || is_other(s2))
return false;
-#if _GLIBCXX_FILESYSTEM_IS_WINDOWS
-  // st_ino is not set, so can't be used to distinguish files
-  if (st1.st_mode != st2.st_mode || st1.st_dev !=

[Committed] RISC-V: Remove configure check for zabha

2024-07-30 Thread Patrick O'Neill

Committed with spaces -> tabs ChangeLog fix.

Patrick

On 7/29/24 20:27, Kito Cheng wrote:

LGTM, thanks :)

On Tue, Jul 30, 2024 at 10:53 AM Patrick O'Neill  wrote:

This patch removes the zabha configure check since it's not a breaking change
and updates the existing zaamo/zalrsc comment.

gcc/ChangeLog:

 * common/config/riscv/riscv-common.cc
   (riscv_subset_list::to_string): Remove zabha configure check
   handling and clarify zaamo/zalrsc comment.
 * config.in: Regenerate.
 * configure: Regenerate.
 * configure.ac: Remove zabha configure check.

Signed-off-by: Patrick O'Neill 
---
The user has to specify zabha in order for binutils to throw an error.
This is in contrast to zaamo/zalrsc which are expanded from 'a' without being
specified.

Relying on precommit to do testing.
---
  gcc/common/config/riscv/riscv-common.cc | 12 +++---
  gcc/config.in   |  6 -
  gcc/configure   | 31 -
  gcc/configure.ac|  5 
  4 files changed, 3 insertions(+), 51 deletions(-)

diff --git a/gcc/common/config/riscv/riscv-common.cc 
b/gcc/common/config/riscv/riscv-common.cc
index 682826c0e34..d2912877784 100644
--- a/gcc/common/config/riscv/riscv-common.cc
+++ b/gcc/common/config/riscv/riscv-common.cc
@@ -855,7 +855,6 @@ riscv_subset_list::to_string (bool version_p) const

bool skip_zifencei = false;
bool skip_zaamo_zalrsc = false;
-  bool skip_zabha = false;
bool skip_zicsr = false;
bool i2p0 = false;

@@ -884,13 +883,11 @@ riscv_subset_list::to_string (bool version_p) const
skip_zifencei = true;
  #endif
  #ifndef HAVE_AS_MARCH_ZAAMO_ZALRSC
-  /* Skip since binutils 2.42 and earlier don't recognize zaamo/zalrsc.  */
+  /* Skip since binutils 2.42 and earlier don't recognize zaamo/zalrsc.
+ Expanding 'a' to zaamo/zalrsc would otherwise break compilations
+ for users with an older version of binutils.  */
skip_zaamo_zalrsc = true;
  #endif
-#ifndef HAVE_AS_MARCH_ZABHA
-  /* Skip since binutils 2.42 and earlier don't recognize zabha.  */
-  skip_zabha = true;
-#endif

for (subset = m_head; subset != NULL; subset = subset->next)
  {
@@ -908,9 +905,6 @@ riscv_subset_list::to_string (bool version_p) const
if (skip_zaamo_zalrsc && subset->name == "zalrsc")
 continue;

-  if (skip_zabha && subset->name == "zabha")
-   continue;
-
/* For !version_p, we only separate extension with underline for
  multi-letter extension.  */
if (!first &&
diff --git a/gcc/config.in b/gcc/config.in
index bc819005bd6..3af153eaec5 100644
--- a/gcc/config.in
+++ b/gcc/config.in
@@ -635,12 +635,6 @@
  #endif


-/* Define if the assembler understands -march=rv*_zabha. */
-#ifndef USED_FOR_TARGET
-#undef HAVE_AS_MARCH_ZABHA
-#endif
-
-
  /* Define if the assembler understands -march=rv*_zifencei. */
  #ifndef USED_FOR_TARGET
  #undef HAVE_AS_MARCH_ZIFENCEI
diff --git a/gcc/configure b/gcc/configure
index 01acca7fb5c..7541bdeb724 100755
--- a/gcc/configure
+++ b/gcc/configure
@@ -30882,37 +30882,6 @@ if test $gcc_cv_as_riscv_march_zaamo_zalrsc = yes; then

  $as_echo "#define HAVE_AS_MARCH_ZAAMO_ZALRSC 1" >>confdefs.h

-fi
-
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking assembler for -march=rv32i_zabha 
support" >&5
-$as_echo_n "checking assembler for -march=rv32i_zabha support... " >&6; }
-if ${gcc_cv_as_riscv_march_zabha+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  gcc_cv_as_riscv_march_zabha=no
-  if test x$gcc_cv_as != x; then
-$as_echo '' > conftest.s
-if { ac_try='$gcc_cv_as $gcc_cv_as_flags -march=rv32i_zabha -o conftest.o 
conftest.s >&5'
-  { { eval echo "\"\$as_me\":${as_lineno-$LINENO}: \"$ac_try\""; } >&5
-  (eval $ac_try) 2>&5
-  ac_status=$?
-  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
-  test $ac_status = 0; }; }
-then
-   gcc_cv_as_riscv_march_zabha=yes
-else
-  echo "configure: failed program was" >&5
-  cat conftest.s >&5
-fi
-rm -f conftest.o conftest.s
-  fi
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $gcc_cv_as_riscv_march_zabha" 
>&5
-$as_echo "$gcc_cv_as_riscv_march_zabha" >&6; }
-if test $gcc_cv_as_riscv_march_zabha = yes; then
-
-$as_echo "#define HAVE_AS_MARCH_ZABHA 1" >>confdefs.h
-
  fi

  ;;
diff --git a/gcc/configure.ac b/gcc/configure.ac
index 3f20c107b6a..52c1780379d 100644
--- a/gcc/configure.ac
+++ b/gcc/configure.ac
@@ -5461,11 +5461,6 @@ configured with --enable-newlib-nano-formatted-io.])
[-march=rv32i_zaamo_zalrsc],,,
[AC_DEFINE(HAVE_AS_MARCH_ZAAMO_ZALRSC, 1,
  [Define if the assembler understands 
-march=rv*_zaamo_zalrsc.])])
-gcc_GAS_CHECK_FEATURE([-march=rv32i_zabha support],
-  gcc_cv_as_riscv_march_zabha,
-  [-march=rv32i_zabha],,,
-  [AC_DEFINE(HAVE_AS_MARCH_ZABHA, 1,
-[Define if the assembler understands -march=rv*_zabha.]

Re: [PATCH ver 2] rs6000, Add new overloaded vector shift builtin int128, varients

2024-07-30 Thread Carl Love



Peter, Kewen:

Per Peter's request, I did the following testing on ltcd97-lp7 which is 
a Power 10 running in BE mode.


On 7/29/24 8:47 AM, Peter Bergner wrote:

Maybe the following will work?

+/* { dg-do run  { target power10_hw } } */
+/* { dg-do link { target { ! power10_hw } } } */
+/* { dg-require-effective-target int128 } */
...

Carl, can you try testing the above change on ltcd97-lp7 and run the test
in both 32-bit and 64-bit modes?

I tested with the above specification and -m64 and I get

    # of expected passes    8

I tested the above specification with -m32


/home/carll/GCC/gcc-steve/gcc/testsuite/gcc.target/powerpc/vec-shift-double-run\
nable-int128.c:390:346: warning: overflow in conversion from 'long long 
int' to\
 'int' changes value from '8526495043095935640' to '-19088744' 
[-Woverflow]^M

/home/carll/GCC/gcc-steve/gcc/testsuite/gcc.target/powerpc/vec-shift-double-run\
nable-int128.c:394:60: error: '__int128' is not supported on this target^



FAIL: gcc.target/powerpc/vec-shift-double-runnable-int128.c (test for 
excess er\

rors)
gcc.target/powerpc/vec-shift-double-runnable-int128.c: \\mvsrdbi\\M 
found 0 tim\

es
FAIL: gcc.target/powerpc/vec-shift-double-runnable-int128.c 
scan-assembler-time\

s \\mvsrdbi\\M 2
gcc.target/powerpc/vec-shift-double-runnable-int128.c: \\mvsldbi\\M 
found 0 tim\

es
FAIL: gcc.target/powerpc/vec-shift-double-runnable-int128.c 
scan-assembler-time\

s \\mvsldbi\\M 2
gcc.target/powerpc/vec-shift-double-runnable-int128.c: \\mvsl\\M found 0 
times
FAIL: gcc.target/powerpc/vec-shift-double-runnable-int128.c 
scan-assembler-time\

s \\mvsl\\M 2
gcc.target/powerpc/vec-shift-double-runnable-int128.c: \\mvsr\\M found 0 
times
FAIL: gcc.target/powerpc/vec-shift-double-runnable-int128.c 
scan-assembler-time\

s \\mvsr\\M 2
gcc.target/powerpc/vec-shift-double-runnable-int128.c: \\mvslo\\M found 
0 times
FAIL: gcc.target/powerpc/vec-shift-double-runnable-int128.c 
scan-assembler-time\

s \\mvslo\\M 4
gcc.target/powerpc/vec-shift-double-runnable-int128.c: \\mvsro\\M found 
0 times
FAIL: gcc.target/powerpc/vec-shift-double-runnable-int128.c 
scan-assembler-time\

s \\mvsro\\M 4


# of unexpected failures    7

Basically, the header is not detecting the int128.

But if I put the int128 in the dg-do run line, like vsc-buildin-20d.c

/* { dg-do run  { target { power10_hw }  && { int128 } } } */
/* { dg-do link { target { ! power10_hw } } } */
/* { dg-require-effective-target vsx_hw  } */

I get the following with -m32:

# of unsupported tests  1


Per the comments from Kewen:

On 7/29/24 7:27 PM, Kewen.Lin wrote:

Maybe the following will work?

+/* { dg-do run  { target power10_hw } } */
+/* { dg-do link { target { ! power10_hw } } } */

Maybe we can replace link by compile here, as we care about compilation and
execution result more here.  (IMHO if it's still "link", power10_ok is useful
to stop this being tested on an environment with an assembler not supporting
power10).

BR,
Kewen



I tried, I hope I got it right, with -m32t:

/* { dg-do run  { target power10_hw } } */
/* { dg-do compile  { target { ! power10_hw } } } */
/* { dg-require-effective-target int128 } */

This gives:

# of unsupported tests  1

The same header with -m64 I get:

# of expected passes    8

This header seems to give us what we want on Power10 BE with -m32 and 
m64 (tested on ltcd97-lp7).


 Carl




[PATCH 1/2] match: Fix types matching for `(?:) !=/== (?:)` [PR116134]

2024-07-30 Thread Andrew Pinski
The problem here is that in generic types of comparisons don't need
to be boolean types (or vector boolean types). And fixes that by making
sure the types of the conditions match before doing the optimization.

Bootstrapped and tested on x86_64-linux-gnu with no regressions.

PR middle-end/116134

gcc/ChangeLog:

* match.pd (`(a ? x : y) eq/ne (b ? x : y)`): Check that
a and b types match.
(`(a ? x : y) eq/ne (b ? y : x)`): Likewise.

gcc/testsuite/ChangeLog:

* gcc.dg/torture/pr116134-1.c: New test.

Signed-off-by: Andrew Pinski 
---
 gcc/match.pd  | 10 ++
 gcc/testsuite/gcc.dg/torture/pr116134-1.c |  9 +
 2 files changed, 15 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/torture/pr116134-1.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 1c8601229e3..881a827860f 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -5640,12 +5640,14 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (for eqne (eq ne)
   (simplify
(eqne:c (cnd @0 @1 @2) (cnd @3 @1 @2))
-(cnd (bit_xor @0 @3) { constant_boolean_node (eqne == NE_EXPR, type); }
- { constant_boolean_node (eqne != NE_EXPR, type); }))
+(if (types_match (TREE_TYPE (@0), TREE_TYPE (@3)))
+ (cnd (bit_xor @0 @3) { constant_boolean_node (eqne == NE_EXPR, type); }
+  { constant_boolean_node (eqne != NE_EXPR, type); })))
   (simplify
(eqne:c (cnd @0 @1 @2) (cnd @3 @2 @1))
-(cnd (bit_xor @0 @3) { constant_boolean_node (eqne != NE_EXPR, type); }
- { constant_boolean_node (eqne == NE_EXPR, type); }
+(if (types_match (TREE_TYPE (@0), TREE_TYPE (@3)))
+ (cnd (bit_xor @0 @3) { constant_boolean_node (eqne != NE_EXPR, type); }
+  { constant_boolean_node (eqne == NE_EXPR, type); })
 
 /* Canonicalize mask ? { 0, ... } : { -1, ...} to ~mask if the mask
types are compatible.  */
diff --git a/gcc/testsuite/gcc.dg/torture/pr116134-1.c 
b/gcc/testsuite/gcc.dg/torture/pr116134-1.c
new file mode 100644
index 000..ab595f99680
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/pr116134-1.c
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+
+/* This used to ICE as comparisons on generic can be different types. */
+/* PR middle-end/116134  */
+
+int a;
+int b;
+int d;
+void c() { 1UL <= (d < b) != (1UL & (0 < a | 0L)); }
-- 
2.43.0



[PATCH 2/2] match: Fix wrong code due to `(a ? e : f) !=/== (b ? e : f)` patterns [PR116120]

2024-07-30 Thread Andrew Pinski
When this pattern was converted from being only dealing with 0/-1, we missed 
that if `e == f` is true
then the optimization is wrong and needs an extra check for that.

This changes the patterns to be:
/* (a ? x : y) != (b ? x : y) --> (a^b & (x != y)) ? TRUE  : FALSE */
/* (a ? x : y) == (b ? x : y) --> (a^b & (x != y)) ? FALSE : TRUE  */
/* (a ? x : y) != (b ? y : x) --> (a^b | (x == y)) ? FALSE : TRUE  */
/* (a ? x : y) == (b ? y : x) --> (a^b | (x == y)) ? TRUE  : FALSE */

This still produces better code than the original case and in many cases (x != 
y) will
still reduce to either false or true.

With this change we also need to make sure `a`, `b` and the resulting types are 
all
the same for the same reason as the previous patch.

I updated (well added) to the testcases to make sure there are the right amount 
of
comparisons left.

Bootstrapped and tested on x86_64-linux-gnu with no regressions.

PR tree-optimization/116120

gcc/ChangeLog:

* match.pd (`(a ? x : y) eq/ne (b ? x : y)`): Add test for `x != y`
in result.
(`(a ? x : y) eq/ne (b ? y : x)`): Add test for `x == y` in result.

gcc/testsuite/ChangeLog:

* g++.dg/tree-ssa/pr50.C: Add extra checks on the test.
* gcc.dg/tree-ssa/pr50-1.c: Likewise.
* gcc.dg/tree-ssa/pr50.c: Likewise.
* g++.dg/torture/pr116120-1.c: New test.
* g++.dg/torture/pr116120-2.c: New test.

Signed-off-by: Andrew Pinski 
---
 gcc/match.pd   | 20 -
 gcc/testsuite/g++.dg/torture/pr116120-1.c  | 32 
 gcc/testsuite/g++.dg/torture/pr116120-2.c  | 35 ++
 gcc/testsuite/g++.dg/tree-ssa/pr50.C   | 10 +++
 gcc/testsuite/gcc.dg/tree-ssa/pr50-1.c |  9 ++
 gcc/testsuite/gcc.dg/tree-ssa/pr50.c   |  1 +
 6 files changed, 99 insertions(+), 8 deletions(-)
 create mode 100644 gcc/testsuite/g++.dg/torture/pr116120-1.c
 create mode 100644 gcc/testsuite/g++.dg/torture/pr116120-2.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 881a827860f..4d3ee578371 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -5632,21 +5632,25 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   (vec_cond (bit_and (bit_not @0) @1) @2 @3)))
 #endif
 
-/* (a ? x : y) != (b ? x : y) --> (a^b) ? TRUE  : FALSE */
-/* (a ? x : y) == (b ? x : y) --> (a^b) ? FALSE : TRUE  */
-/* (a ? x : y) != (b ? y : x) --> (a^b) ? FALSE : TRUE  */
-/* (a ? x : y) == (b ? y : x) --> (a^b) ? TRUE  : FALSE */
+/* (a ? x : y) != (b ? x : y) --> (a^b & (x != y)) ? TRUE  : FALSE */
+/* (a ? x : y) == (b ? x : y) --> (a^b & (x != y)) ? FALSE : TRUE  */
+/* (a ? x : y) != (b ? y : x) --> (a^b | (x == y)) ? FALSE : TRUE  */
+/* (a ? x : y) == (b ? y : x) --> (a^b | (x == y)) ? TRUE  : FALSE */
 (for cnd (cond vec_cond)
  (for eqne (eq ne)
   (simplify
(eqne:c (cnd @0 @1 @2) (cnd @3 @1 @2))
-(if (types_match (TREE_TYPE (@0), TREE_TYPE (@3)))
- (cnd (bit_xor @0 @3) { constant_boolean_node (eqne == NE_EXPR, type); }
+(if (types_match (TREE_TYPE (@0), TREE_TYPE (@3))
+ && types_match (type, TREE_TYPE (@0)))
+ (cnd (bit_and (bit_xor @0 @3) (ne:type @1 @2))
+  { constant_boolean_node (eqne == NE_EXPR, type); }
   { constant_boolean_node (eqne != NE_EXPR, type); })))
   (simplify
(eqne:c (cnd @0 @1 @2) (cnd @3 @2 @1))
-(if (types_match (TREE_TYPE (@0), TREE_TYPE (@3)))
- (cnd (bit_xor @0 @3) { constant_boolean_node (eqne != NE_EXPR, type); }
+(if (types_match (TREE_TYPE (@0), TREE_TYPE (@3))
+ && types_match (type, TREE_TYPE (@0)))
+ (cnd (bit_ior (bit_xor @0 @3) (eq:type @1 @2))
+  { constant_boolean_node (eqne != NE_EXPR, type); }
   { constant_boolean_node (eqne == NE_EXPR, type); })
 
 /* Canonicalize mask ? { 0, ... } : { -1, ...} to ~mask if the mask
diff --git a/gcc/testsuite/g++.dg/torture/pr116120-1.c 
b/gcc/testsuite/g++.dg/torture/pr116120-1.c
new file mode 100644
index 000..cffb7fbdc5b
--- /dev/null
+++ b/gcc/testsuite/g++.dg/torture/pr116120-1.c
@@ -0,0 +1,32 @@
+// { dg-run }
+// PR tree-optimization/116120
+
+// The optimization for `(a ? x : y) != (b ? x : y)`
+// missed that x and y could be the same value.
+
+typedef int v4si __attribute((__vector_size__(1 * sizeof(int;
+v4si f1(v4si a, v4si b, v4si c, v4si d, v4si e, v4si f) {
+  v4si X = a == b ? e : f;
+  v4si Y = c == d ? e : f;
+  return (X != Y); // ~(X == Y ? -1 : 0) (x ^ Y)
+}
+
+int f2(int a, int b, int c, int d, int e, int f) {
+  int X = a == b ? e : f;
+  int Y = c == d ? e : f;
+  return (X != Y) ? -1 : 0; // ~(X == Y ? -1 : 0) (x ^ Y)
+}
+
+int main()
+{
+  v4si a = {0};
+  v4si b = {0}; // a == b, true
+  v4si c = {2};
+  v4si d = {3}; // c == b, false
+  v4si e = {0};
+  v4si f = e;
+  v4si r = f1(a,b,c,d,e, f);
+  int r1 = f2(a[0], b[0], c[0], d[0], e[0], f[0]);
+  if (r[0] != r1)
+__builtin_abort();
+}
diff --git a/gcc/testsuite/g++.dg/torture/pr116120-2.c 
b/gcc/testsuite/g++.dg/torture/pr116120-2.c
new fil

[Committed] RISC-V: Add basic support for the Zacas extension

2024-07-30 Thread Patrick O'Neill
From: Gianluca Guida 

This patch adds support for amocas.{b|h|w|d}. Support for amocas.q
(64/128 bit cas for rv32/64) will be added in a future patch.

Extension: https://github.com/riscv/riscv-zacas
Ratification: https://jira.riscv.org/browse/RVS-680

gcc/ChangeLog:

* common/config/riscv/riscv-common.cc: Add zacas extension.
* config/riscv/arch-canonicalize: Make zacas imply zaamo.
* config/riscv/riscv.opt: Add zacas.
* config/riscv/sync.md (zacas_atomic_cas_value): New pattern.
(atomic_compare_and_swap): Use new pattern for compare-and-swap 
ops.
(zalrsc_atomic_cas_value_strong): Rename atomic_cas_value_strong.
* doc/sourcebuild.texi: Add Zacas documentation.

gcc/testsuite/ChangeLog:

* lib/target-supports.exp: Add zacas testsuite infra support.
* 
gcc.target/riscv/amo/zalrsc-rvwmo-compare-exchange-int-acquire-release.c:
Remove zacas to continue to test the lr/sc pairs.
* gcc.target/riscv/amo/zalrsc-rvwmo-compare-exchange-int-acquire.c: 
Ditto.
* gcc.target/riscv/amo/zalrsc-rvwmo-compare-exchange-int-consume.c: 
Ditto.
* gcc.target/riscv/amo/zalrsc-rvwmo-compare-exchange-int-relaxed.c: 
Ditto.
* gcc.target/riscv/amo/zalrsc-rvwmo-compare-exchange-int-release.c: 
Ditto.
* 
gcc.target/riscv/amo/zalrsc-rvwmo-compare-exchange-int-seq-cst-relaxed.c: Ditto.
* gcc.target/riscv/amo/zalrsc-rvwmo-compare-exchange-int-seq-cst.c: 
Ditto.
* 
gcc.target/riscv/amo/zalrsc-ztso-compare-exchange-int-acquire-release.c: Ditto.
* gcc.target/riscv/amo/zalrsc-ztso-compare-exchange-int-acquire.c: 
Ditto.
* gcc.target/riscv/amo/zalrsc-ztso-compare-exchange-int-consume.c: 
Ditto.
* gcc.target/riscv/amo/zalrsc-ztso-compare-exchange-int-relaxed.c: 
Ditto.
* gcc.target/riscv/amo/zalrsc-ztso-compare-exchange-int-release.c: 
Ditto.
* 
gcc.target/riscv/amo/zalrsc-ztso-compare-exchange-int-seq-cst-relaxed.c: Ditto.
* gcc.target/riscv/amo/zalrsc-ztso-compare-exchange-int-seq-cst.c: 
Ditto.
* gcc.target/riscv/amo/zabha-zacas-preferred-over-zalrsc.c: New test.
* gcc.target/riscv/amo/zacas-char-requires-zabha.c: New test.
* gcc.target/riscv/amo/zacas-char-requires-zacas.c: New test.
* gcc.target/riscv/amo/zacas-preferred-over-zalrsc.c: New test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-char-acq-rel.c: New 
test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-char-acquire.c: New 
test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-char-relaxed.c: New 
test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-char-release.c: New 
test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-char-seq-cst.c: New 
test.
* 
gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-compatability-mapping-no-fence.c:
New test.
* 
gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-compatability-mapping.cc: New 
test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-int-acq-rel.c: New 
test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-int-acquire.c: New 
test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-int-relaxed.c: New 
test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-int-release.c: New 
test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-int-seq-cst.c: New 
test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-short-acq-rel.c: 
New test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-short-acquire.c: 
New test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-short-relaxed.c: 
New test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-short-release.c: 
New test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-short-seq-cst.c: 
New test.
* gcc.target/riscv/amo/zacas-ztso-compare-exchange-char-seq-cst.c: New 
test.
* gcc.target/riscv/amo/zacas-ztso-compare-exchange-char.c: New test.
* 
gcc.target/riscv/amo/zacas-ztso-compare-exchange-compatability-mapping-no-fence.c:
New test.
* 
gcc.target/riscv/amo/zacas-ztso-compare-exchange-compatability-mapping.cc: New 
test.
* gcc.target/riscv/amo/zacas-ztso-compare-exchange-int-seq-cst.c: New 
test.
* gcc.target/riscv/amo/zacas-ztso-compare-exchange-int.c: New test.
* gcc.target/riscv/amo/zacas-ztso-compare-exchange-short-seq-cst.c: New 
test.
* gcc.target/riscv/amo/zacas-ztso-compare-exchange-short.c: New test.

Co-authored-by: Patrick O'Neill 
Tested-by: Andrea Parri 
Signed-Off-By: Gianluca Guida 
---
Added missing riscv-common.cc changelog entry and Gianluca Guida's sign-off
that he gave.
---
 gcc/common/config/riscv/riscv-common.cc   |   3 +
 gcc/config/riscv/arch-canonicalize|   1 +
 gcc/config/riscv/riscv.opt|   2 +
 gcc/config/riscv/sync.md  | 111 

[Committed] RISC-V: Add basic support for the Zacas extension

2024-07-30 Thread Patrick O'Neill
Committed w/changelog fixup/sign-off and sent final version to the lists 
here:

https://inbox.sourceware.org/gcc-patches/20240730152448.4089002-1-patr...@rivosinc.com/T/#u

Approved during risc-v patchworks meeting by Jeff Law.

Patrick

On 7/29/24 15:13, Patrick O'Neill wrote:

From: Gianluca Guida 

This patch adds support for amocas.{b|h|w|d}. Support for amocas.q
(64/128 bit cas for rv32/64) will be added in a future patch.

Extension: https://github.com/riscv/riscv-zacas
Ratification: https://jira.riscv.org/browse/RVS-680

gcc/ChangeLog:

* config/riscv/arch-canonicalize: Make zacas imply zaamo.
* config/riscv/riscv.opt: Add zacas.
* config/riscv/sync.md (zacas_atomic_cas_value): New pattern.
(atomic_compare_and_swap): Use new pattern for compare-and-swap 
ops.
(zalrsc_atomic_cas_value_strong): Rename atomic_cas_value_strong.
* doc/sourcebuild.texi: Add Zacas documentation.

gcc/testsuite/ChangeLog:

* lib/target-supports.exp: Add zacas testsuite infra support.
* 
gcc.target/riscv/amo/zalrsc-rvwmo-compare-exchange-int-acquire-release.c:
Remove zacas to continue to test the lr/sc pairs.
* gcc.target/riscv/amo/zalrsc-rvwmo-compare-exchange-int-acquire.c: 
Ditto.
* gcc.target/riscv/amo/zalrsc-rvwmo-compare-exchange-int-consume.c: 
Ditto.
* gcc.target/riscv/amo/zalrsc-rvwmo-compare-exchange-int-relaxed.c: 
Ditto.
* gcc.target/riscv/amo/zalrsc-rvwmo-compare-exchange-int-release.c: 
Ditto.
* 
gcc.target/riscv/amo/zalrsc-rvwmo-compare-exchange-int-seq-cst-relaxed.c: Ditto.
* gcc.target/riscv/amo/zalrsc-rvwmo-compare-exchange-int-seq-cst.c: 
Ditto.
* 
gcc.target/riscv/amo/zalrsc-ztso-compare-exchange-int-acquire-release.c: Ditto.
* gcc.target/riscv/amo/zalrsc-ztso-compare-exchange-int-acquire.c: 
Ditto.
* gcc.target/riscv/amo/zalrsc-ztso-compare-exchange-int-consume.c: 
Ditto.
* gcc.target/riscv/amo/zalrsc-ztso-compare-exchange-int-relaxed.c: 
Ditto.
* gcc.target/riscv/amo/zalrsc-ztso-compare-exchange-int-release.c: 
Ditto.
* 
gcc.target/riscv/amo/zalrsc-ztso-compare-exchange-int-seq-cst-relaxed.c: Ditto.
* gcc.target/riscv/amo/zalrsc-ztso-compare-exchange-int-seq-cst.c: 
Ditto.
* gcc.target/riscv/amo/zabha-zacas-preferred-over-zalrsc.c: New test.
* gcc.target/riscv/amo/zacas-char-requires-zabha.c: New test.
* gcc.target/riscv/amo/zacas-char-requires-zacas.c: New test.
* gcc.target/riscv/amo/zacas-preferred-over-zalrsc.c: New test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-char-acq-rel.c: New 
test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-char-acquire.c: New 
test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-char-relaxed.c: New 
test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-char-release.c: New 
test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-char-seq-cst.c: New 
test.
* 
gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-compatability-mapping-no-fence.c:
New test.
* 
gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-compatability-mapping.cc: New 
test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-int-acq-rel.c: New 
test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-int-acquire.c: New 
test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-int-relaxed.c: New 
test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-int-release.c: New 
test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-int-seq-cst.c: New 
test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-short-acq-rel.c: 
New test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-short-acquire.c: 
New test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-short-relaxed.c: 
New test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-short-release.c: 
New test.
* gcc.target/riscv/amo/zacas-rvwmo-compare-exchange-short-seq-cst.c: 
New test.
* gcc.target/riscv/amo/zacas-ztso-compare-exchange-char-seq-cst.c: New 
test.
* gcc.target/riscv/amo/zacas-ztso-compare-exchange-char.c: New test.
* 
gcc.target/riscv/amo/zacas-ztso-compare-exchange-compatability-mapping-no-fence.c:
New test.
* 
gcc.target/riscv/amo/zacas-ztso-compare-exchange-compatability-mapping.cc: New 
test.
* gcc.target/riscv/amo/zacas-ztso-compare-exchange-int-seq-cst.c: New 
test.
* gcc.target/riscv/amo/zacas-ztso-compare-exchange-int.c: New test.
* gcc.target/riscv/amo/zacas-ztso-compare-exchange-short-seq-cst.c: New 
test.
* gcc.target/riscv/amo/zacas-ztso-compare-exchange-short.c: New test.

Co-authored-by: Patrick O'Neill 
Tested-by: Andrea Parri 
---
V3 Changelog:
* Make insn lengths dynamic to account for leading fence.
* Remove config-check for old binutils versions.

Tested locally with 

[PATCH 2/2] Add AVX2 code path to lexer

2024-07-30 Thread Andi Kleen
From: Andi Kleen 

AVX2 is widely available on x86 and it allows to do the scanner line
check with 32 bytes at a time. The code is similar to the SSE2 code
path, just using AVX and 32 bytes at a time instead of SSE2 16 bytes.

Also adjust the code to allow inlining when the compiler
is built for an AVX2 host, following what other architectures
do.

I see about a ~0.6% compile time improvement for compiling i386
insn-recog.i with -O0.

libcpp/ChangeLog:

* config.in (HAVE_AVX2): Add.
* configure: Regenerate.
* configure.ac: Add HAVE_AVX2 check.
* lex.cc (repl_chars): Extend to 32 bytes.
(search_line_avx2): New function to scan line using AVX2.
(init_vectorized_lexer): Check for AVX2 in CPUID.
---
 libcpp/config.in|  3 ++
 libcpp/configure| 17 +
 libcpp/configure.ac |  3 ++
 libcpp/lex.cc   | 91 +++--
 4 files changed, 110 insertions(+), 4 deletions(-)

diff --git a/libcpp/config.in b/libcpp/config.in
index 253ef03a3dea..8fad6bd4b4f5 100644
--- a/libcpp/config.in
+++ b/libcpp/config.in
@@ -213,6 +213,9 @@
 /* Define to 1 if you can assemble SSE4 insns. */
 #undef HAVE_SSE4
 
+/* Define to 1 if you can assemble AVX2 insns. */
+#undef HAVE_AVX2
+
 /* Define to 1 if you have the  header file. */
 #undef HAVE_STDDEF_H
 
diff --git a/libcpp/configure b/libcpp/configure
index 32d6aaa30699..6d9286ac9601 100755
--- a/libcpp/configure
+++ b/libcpp/configure
@@ -9149,6 +9149,23 @@ if ac_fn_c_try_compile "$LINENO"; then :
 
 $as_echo "#define HAVE_SSE4 1" >>confdefs.h
 
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+asm ("vpcmpeqb %%ymm0, %%ymm4, %%ymm5" : : "i"(0))
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+
+$as_echo "#define HAVE_AVX2 1" >>confdefs.h
+
 fi
 rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
 esac
diff --git a/libcpp/configure.ac b/libcpp/configure.ac
index b883fec776fe..c06609827924 100644
--- a/libcpp/configure.ac
+++ b/libcpp/configure.ac
@@ -200,6 +200,9 @@ case $target in
 AC_TRY_COMPILE([], [asm ("pcmpestri %0, %%xmm0, %%xmm1" : : "i"(0))],
   [AC_DEFINE([HAVE_SSE4], [1],
 [Define to 1 if you can assemble SSE4 insns.])])
+AC_TRY_COMPILE([], [asm ("vpcmpeqb %%ymm0, %%ymm4, %%ymm5" : : "i"(0))],
+  [AC_DEFINE([HAVE_AVX2], [1],
+[Define to 1 if you can assemble AVX2 insns.])])
 esac
 
 # Enable --enable-host-shared.
diff --git a/libcpp/lex.cc b/libcpp/lex.cc
index 1591dcdf151a..72f3402aac99 100644
--- a/libcpp/lex.cc
+++ b/libcpp/lex.cc
@@ -278,19 +278,31 @@ search_line_acc_char (const uchar *s, const uchar *end 
ATTRIBUTE_UNUSED)
 /* Replicated character data to be shared between implementations.
Recall that outside of a context with vector support we can't
define compatible vector types, therefore these are all defined
-   in terms of raw characters.  */
-static const char repl_chars[4][16] __attribute__((aligned(16))) = {
+   in terms of raw characters.
+   gcc constant propagates this and usually turns it into a
+   vector broadcast, so it actually disappears.  */
+
+static const char repl_chars[4][32] __attribute__((aligned(32))) = {
   { '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n',
+'\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n',
+'\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n',
 '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n' },
   { '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r',
+'\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r',
+'\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r',
 '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r' },
   { '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\',
+'\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\',
+'\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\',
 '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\' },
   { '?', '?', '?', '?', '?', '?', '?', '?',
+'?', '?', '?', '?', '?', '?', '?', '?',
+'?', '?', '?', '?', '?', '?', '?', '?',
 '?', '?', '?', '?', '?', '?', '?', '?' },
 };
 
 
+#ifndef __AVX2__
 /* A version of the fast scanner using SSE2 vectorized byte compare insns.  */
 
 static const uchar *
@@ -343,8 +355,9 @@ search_line_sse2 (const uchar *s, const uchar *end 
ATTRIBUTE_UNUSED)
   found = __builtin_ctz(found);
   return (const uchar *)p + found;
 }
+#endif
 
-#ifdef HAVE_SSE4
+#if defined(HAVE_SSE4) && !defined(__AVX2__)
 /* A version of the fast scanner using SSE 4.2 vectorized string insns.  */
 
 static const uchar *
@@ -425,6 +438,71 @@ search_line_sse42 (const uchar *s, const uchar *end)
 #define search_line_sse42 search_line_sse2
 #endif
 
+#ifdef HAVE_AVX2
+
+/* A version of the fast scanner using AVX2 vectorized byte compare insns.  */
+
+static const uchar *
+#ifndef __AVX2__
+__attribute__((__target__("avx2")))
+#endif
+search_line_avx2 (const uchar *s, const uchar *end ATTRIBUTE_UNUSED)
+{
+  

[PATCH 1/2] Remove MMX code path in lexer

2024-07-30 Thread Andi Kleen
From: Andi Kleen 

Host systems with only MMX and no SSE2 should be really rare now.
Let's remove the MMX code path to keep the number of custom
implementations the same.

The SSE2 code path is also somewhat dubious now (nearly everything
should have SSE4 4.2 which is >15 years old now), but the SSE2
code path is used as fallback for others and also apparently
Solaris uses it due to tool chain deficiencies.

libcpp/ChangeLog:

* lex.cc (search_line_mmx): Remove function.
(init_vectorized_lexer): Remove search_line_mmx.
---
 libcpp/lex.cc | 75 ---
 1 file changed, 75 deletions(-)

diff --git a/libcpp/lex.cc b/libcpp/lex.cc
index 16f2c23af1e1..1591dcdf151a 100644
--- a/libcpp/lex.cc
+++ b/libcpp/lex.cc
@@ -290,71 +290,6 @@ static const char repl_chars[4][16] 
__attribute__((aligned(16))) = {
 '?', '?', '?', '?', '?', '?', '?', '?' },
 };
 
-/* A version of the fast scanner using MMX vectorized byte compare insns.
-
-   This uses the PMOVMSKB instruction which was introduced with "MMX2",
-   which was packaged into SSE1; it is also present in the AMD MMX
-   extension.  Mark the function as using "sse" so that we emit a real
-   "emms" instruction, rather than the 3dNOW "femms" instruction.  */
-
-static const uchar *
-#ifndef __SSE__
-__attribute__((__target__("sse")))
-#endif
-search_line_mmx (const uchar *s, const uchar *end ATTRIBUTE_UNUSED)
-{
-  typedef char v8qi __attribute__ ((__vector_size__ (8)));
-  typedef int __m64 __attribute__ ((__vector_size__ (8), __may_alias__));
-
-  const v8qi repl_nl = *(const v8qi *)repl_chars[0];
-  const v8qi repl_cr = *(const v8qi *)repl_chars[1];
-  const v8qi repl_bs = *(const v8qi *)repl_chars[2];
-  const v8qi repl_qm = *(const v8qi *)repl_chars[3];
-
-  unsigned int misalign, found, mask;
-  const v8qi *p;
-  v8qi data, t, c;
-
-  /* Align the source pointer.  While MMX doesn't generate unaligned data
- faults, this allows us to safely scan to the end of the buffer without
- reading beyond the end of the last page.  */
-  misalign = (uintptr_t)s & 7;
-  p = (const v8qi *)((uintptr_t)s & -8);
-  data = *p;
-
-  /* Create a mask for the bytes that are valid within the first
- 16-byte block.  The Idea here is that the AND with the mask
- within the loop is "free", since we need some AND or TEST
- insn in order to set the flags for the branch anyway.  */
-  mask = -1u << misalign;
-
-  /* Main loop processing 8 bytes at a time.  */
-  goto start;
-  do
-{
-  data = *++p;
-  mask = -1;
-
-start:
-  t = __builtin_ia32_pcmpeqb(data, repl_nl);
-  c = __builtin_ia32_pcmpeqb(data, repl_cr);
-  t = (v8qi) __builtin_ia32_por ((__m64)t, (__m64)c);
-  c = __builtin_ia32_pcmpeqb(data, repl_bs);
-  t = (v8qi) __builtin_ia32_por ((__m64)t, (__m64)c);
-  c = __builtin_ia32_pcmpeqb(data, repl_qm);
-  t = (v8qi) __builtin_ia32_por ((__m64)t, (__m64)c);
-  found = __builtin_ia32_pmovmskb (t);
-  found &= mask;
-}
-  while (!found);
-
-  __builtin_ia32_emms ();
-
-  /* FOUND contains 1 in bits for which we matched a relevant
- character.  Conversion to the byte index is trivial.  */
-  found = __builtin_ctz(found);
-  return (const uchar *)p + found;
-}
 
 /* A version of the fast scanner using SSE2 vectorized byte compare insns.  */
 
@@ -509,8 +444,6 @@ init_vectorized_lexer (void)
   minimum = 3;
 #elif defined(__SSE2__)
   minimum = 2;
-#elif defined(__SSE__)
-  minimum = 1;
 #endif
 
   if (minimum == 3)
@@ -521,14 +454,6 @@ init_vectorized_lexer (void)
 impl = search_line_sse42;
   else if (minimum == 2 || (edx & bit_SSE2))
impl = search_line_sse2;
-  else if (minimum == 1 || (edx & bit_SSE))
-   impl = search_line_mmx;
-}
-  else if (__get_cpuid (0x8001, &dummy, &dummy, &dummy, &edx))
-{
-  if (minimum == 1
- || (edx & (bit_MMXEXT | bit_CMOV)) == (bit_MMXEXT | bit_CMOV))
-   impl = search_line_mmx;
 }
 
   search_line_fast = impl;
-- 
2.45.2



Re: [PATCH 2/2] Add AVX2 code path to lexer

2024-07-30 Thread Andrew Pinski
On Tue, Jul 30, 2024 at 8:43 AM Andi Kleen  wrote:
>
> From: Andi Kleen 
>
> AVX2 is widely available on x86 and it allows to do the scanner line
> check with 32 bytes at a time. The code is similar to the SSE2 code
> path, just using AVX and 32 bytes at a time instead of SSE2 16 bytes.
>
> Also adjust the code to allow inlining when the compiler
> is built for an AVX2 host, following what other architectures
> do.
>
> I see about a ~0.6% compile time improvement for compiling i386
> insn-recog.i with -O0.
>
> libcpp/ChangeLog:
>
> * config.in (HAVE_AVX2): Add.
> * configure: Regenerate.
> * configure.ac: Add HAVE_AVX2 check.
> * lex.cc (repl_chars): Extend to 32 bytes.
> (search_line_avx2): New function to scan line using AVX2.
> (init_vectorized_lexer): Check for AVX2 in CPUID.
> ---
>  libcpp/config.in|  3 ++
>  libcpp/configure| 17 +
>  libcpp/configure.ac |  3 ++
>  libcpp/lex.cc   | 91 +++--
>  4 files changed, 110 insertions(+), 4 deletions(-)
>
> diff --git a/libcpp/config.in b/libcpp/config.in
> index 253ef03a3dea..8fad6bd4b4f5 100644
> --- a/libcpp/config.in
> +++ b/libcpp/config.in
> @@ -213,6 +213,9 @@
>  /* Define to 1 if you can assemble SSE4 insns. */
>  #undef HAVE_SSE4
>
> +/* Define to 1 if you can assemble AVX2 insns. */
> +#undef HAVE_AVX2
> +
>  /* Define to 1 if you have the  header file. */
>  #undef HAVE_STDDEF_H
>
> diff --git a/libcpp/configure b/libcpp/configure
> index 32d6aaa30699..6d9286ac9601 100755
> --- a/libcpp/configure
> +++ b/libcpp/configure
> @@ -9149,6 +9149,23 @@ if ac_fn_c_try_compile "$LINENO"; then :
>
>  $as_echo "#define HAVE_SSE4 1" >>confdefs.h
>
> +fi
> +rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
> +cat confdefs.h - <<_ACEOF >conftest.$ac_ext
> +/* end confdefs.h.  */
> +
> +int
> +main ()
> +{
> +asm ("vpcmpeqb %%ymm0, %%ymm4, %%ymm5" : : "i"(0))
> +  ;
> +  return 0;
> +}
> +_ACEOF
> +if ac_fn_c_try_compile "$LINENO"; then :
> +
> +$as_echo "#define HAVE_AVX2 1" >>confdefs.h
> +
>  fi
>  rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
>  esac
> diff --git a/libcpp/configure.ac b/libcpp/configure.ac
> index b883fec776fe..c06609827924 100644
> --- a/libcpp/configure.ac
> +++ b/libcpp/configure.ac
> @@ -200,6 +200,9 @@ case $target in
>  AC_TRY_COMPILE([], [asm ("pcmpestri %0, %%xmm0, %%xmm1" : : "i"(0))],
>[AC_DEFINE([HAVE_SSE4], [1],
>  [Define to 1 if you can assemble SSE4 insns.])])
> +AC_TRY_COMPILE([], [asm ("vpcmpeqb %%ymm0, %%ymm4, %%ymm5" : : "i"(0))],
> +  [AC_DEFINE([HAVE_AVX2], [1],
> +[Define to 1 if you can assemble AVX2 insns.])])
>  esac
>
>  # Enable --enable-host-shared.
> diff --git a/libcpp/lex.cc b/libcpp/lex.cc
> index 1591dcdf151a..72f3402aac99 100644
> --- a/libcpp/lex.cc
> +++ b/libcpp/lex.cc
> @@ -278,19 +278,31 @@ search_line_acc_char (const uchar *s, const uchar *end 
> ATTRIBUTE_UNUSED)
>  /* Replicated character data to be shared between implementations.
> Recall that outside of a context with vector support we can't
> define compatible vector types, therefore these are all defined
> -   in terms of raw characters.  */
> -static const char repl_chars[4][16] __attribute__((aligned(16))) = {
> +   in terms of raw characters.
> +   gcc constant propagates this and usually turns it into a
> +   vector broadcast, so it actually disappears.  */
> +
> +static const char repl_chars[4][32] __attribute__((aligned(32))) = {
>{ '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n',
> +'\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n',
> +'\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n',
>  '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n' },
>{ '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r',
> +'\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r',
> +'\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r',
>  '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r' },
>{ '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\',
> +'\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\',
> +'\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\',
>  '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\' },
>{ '?', '?', '?', '?', '?', '?', '?', '?',
> +'?', '?', '?', '?', '?', '?', '?', '?',
> +'?', '?', '?', '?', '?', '?', '?', '?',
>  '?', '?', '?', '?', '?', '?', '?', '?' },
>  };
>
>
> +#ifndef __AVX2__
>  /* A version of the fast scanner using SSE2 vectorized byte compare insns.  
> */
>
>  static const uchar *
> @@ -343,8 +355,9 @@ search_line_sse2 (const uchar *s, const uchar *end 
> ATTRIBUTE_UNUSED)
>found = __builtin_ctz(found);
>return (const uchar *)p + found;
>  }
> +#endif
>
> -#ifdef HAVE_SSE4
> +#if defined(HAVE_SSE4) && !defined(__AVX2__)
>  /* A version of the fast scanner using SSE 4.2 vectorized string insns.  */
>
>  static const uchar *
> @@ -425,6 +438,71 @@ search_line_sse42 (const uchar *s, const uchar *end)
>  #define sear

Re: [PATCH 2/2] Add AVX2 code path to lexer

2024-07-30 Thread Andi Kleen
Andrew Pinski  writes:
>
> Using the builtin here seems wrong. Why not use the intrinsic
> _mm256_movemask_epi8 ?

I followed the rest of the vectorized code paths. The original reason was that
there was some incompatibility of the intrinsic header with the source
build. I don't know if it's still true, but I guess it doesn't hurt.

> Also it might make sense to remove the MMX version.

See the previous patch.

-Andi



[COMMITTED PATCH 2/3] testsuite: fix whitespace in dg-do preprocess directive

2024-07-30 Thread Sam James
PR preprocessor/90581
* c-c++-common/cpp/fmax-include-depth.c: Fix whitespace in dg directive.
---
Committed as obvious.

 gcc/testsuite/c-c++-common/cpp/fmax-include-depth.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/c-c++-common/cpp/fmax-include-depth.c 
b/gcc/testsuite/c-c++-common/cpp/fmax-include-depth.c
index bd8cc3adcdd7..134c29805c89 100644
--- a/gcc/testsuite/c-c++-common/cpp/fmax-include-depth.c
+++ b/gcc/testsuite/c-c++-common/cpp/fmax-include-depth.c
@@ -1,4 +1,4 @@
-/* { dg-do preprocess} */
+/* { dg-do preprocess } */
 /* { dg-options "-fmax-include-depth=1" } */
 
 #include "fmax-include-depth-1b.h" /* { dg-error ".include nested depth 1 
exceeds maximum of 1 .use -fmax-include-depth=DEPTH to increase the maximum." } 
*/
-- 
2.45.2



[COMMITTED PATCH 3/3] testsuite: fix whitespace in dg-do assemble directive

2024-07-30 Thread Sam James
* gcc.target/aarch64/simd/vmmla.c: Fix whitespace in dg directive.
---
Committed as obvious.

 gcc/testsuite/gcc.target/aarch64/simd/vmmla.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.target/aarch64/simd/vmmla.c 
b/gcc/testsuite/gcc.target/aarch64/simd/vmmla.c
index 5eec2b5cfb96..777decc56a20 100644
--- a/gcc/testsuite/gcc.target/aarch64/simd/vmmla.c
+++ b/gcc/testsuite/gcc.target/aarch64/simd/vmmla.c
@@ -1,4 +1,4 @@
-/* { dg-do assemble} */
+/* { dg-do assemble } */
 /* { dg-require-effective-target arm_v8_2a_i8mm_ok } */
 /* { dg-additional-options "-march=armv8.2-a+i8mm" } */
 
-- 
2.45.2



[COMMITTED PATCH 1/3] testsuite: fix whitespace in dg-do compile directives

2024-07-30 Thread Sam James
Nothing seems to change here in reality at least on x86_64-pc-linux-gnu,
but important to fix nonetheless in case people copy it.

PR rtl-optimization/48633
PR tree-optimization/83072
PR tree-optimization/83073
PR tree-optimization/96542
PR tree-optimization/96707
PR tree-optimization/97567
PR target/69225
PR target/89929
PR target/96562
* g++.dg/pr48633.C: Fix whitespace in dg directive.
* g++.dg/pr96707.C: Likewise.
* g++.target/i386/mv28.C: Likewise.
* gcc.dg/Warray-bounds-flex-arrays-1.c: Likewise.
* gcc.dg/pr83072-2.c: Likewise.
* gcc.dg/pr83073.c: Likewise.
* gcc.dg/pr96542.c: Likewise.
* gcc.dg/pr97567-2.c: Likewise.
* gcc.target/i386/avx512fp16-11a.c: Likewise.
* gcc.target/i386/avx512fp16-13.c: Likewise.
* gcc.target/i386/avx512fp16-14.c: Likewise.
* gcc.target/i386/avx512fp16-conjugation-1.c: Likewise.
* gcc.target/i386/avx512fp16-neg-1a.c: Likewise.
* gcc.target/i386/avx512fp16-set1-pch-1a.c: Likewise.
* gcc.target/i386/avx512fp16vl-conjugation-1.c: Likewise.
* gcc.target/i386/avx512fp16vl-neg-1a.c: Likewise.
* gcc.target/i386/avx512fp16vl-set1-pch-1a.c: Likewise.
* gcc.target/i386/avx512vlfp16-11a.c: Likewise.
* gcc.target/i386/pr69225-1.c: Likewise.
* gcc.target/i386/pr69225-2.c: Likewise.
* gcc.target/i386/pr69225-3.c: Likewise.
* gcc.target/i386/pr69225-4.c: Likewise.
* gcc.target/i386/pr69225-5.c: Likewise.
* gcc.target/i386/pr69225-6.c: Likewise.
* gcc.target/i386/pr69225-7.c: Likewise.
* gcc.target/i386/pr96562-1.c: Likewise.
* gcc.target/riscv/rv32e_stack.c: Likewise.
* gfortran.dg/c-interop/removed-restrictions-3.f90: Likewise.
* gnat.dg/renaming1.adb: Likewise.
---
Committed as obvious.

 gcc/testsuite/g++.dg/pr48633.C | 2 +-
 gcc/testsuite/g++.dg/pr96707.C | 2 +-
 gcc/testsuite/g++.target/i386/mv28.C   | 2 +-
 gcc/testsuite/gcc.dg/Warray-bounds-flex-arrays-1.c | 2 +-
 gcc/testsuite/gcc.dg/pr83072-2.c   | 2 +-
 gcc/testsuite/gcc.dg/pr83073.c | 2 +-
 gcc/testsuite/gcc.dg/pr96542.c | 2 +-
 gcc/testsuite/gcc.dg/pr97567-2.c   | 2 +-
 gcc/testsuite/gcc.target/i386/avx512fp16-11a.c | 2 +-
 gcc/testsuite/gcc.target/i386/avx512fp16-13.c  | 2 +-
 gcc/testsuite/gcc.target/i386/avx512fp16-14.c  | 2 +-
 gcc/testsuite/gcc.target/i386/avx512fp16-conjugation-1.c   | 2 +-
 gcc/testsuite/gcc.target/i386/avx512fp16-neg-1a.c  | 2 +-
 gcc/testsuite/gcc.target/i386/avx512fp16-set1-pch-1a.c | 2 +-
 gcc/testsuite/gcc.target/i386/avx512fp16vl-conjugation-1.c | 2 +-
 gcc/testsuite/gcc.target/i386/avx512fp16vl-neg-1a.c| 2 +-
 gcc/testsuite/gcc.target/i386/avx512fp16vl-set1-pch-1a.c   | 2 +-
 gcc/testsuite/gcc.target/i386/avx512vlfp16-11a.c   | 2 +-
 gcc/testsuite/gcc.target/i386/pr69225-1.c  | 2 +-
 gcc/testsuite/gcc.target/i386/pr69225-2.c  | 2 +-
 gcc/testsuite/gcc.target/i386/pr69225-3.c  | 2 +-
 gcc/testsuite/gcc.target/i386/pr69225-4.c  | 2 +-
 gcc/testsuite/gcc.target/i386/pr69225-5.c  | 2 +-
 gcc/testsuite/gcc.target/i386/pr69225-6.c  | 2 +-
 gcc/testsuite/gcc.target/i386/pr69225-7.c  | 2 +-
 gcc/testsuite/gcc.target/i386/pr96562-1.c  | 2 +-
 gcc/testsuite/gcc.target/riscv/rv32e_stack.c   | 2 +-
 gcc/testsuite/gfortran.dg/c-interop/removed-restrictions-3.f90 | 2 +-
 gcc/testsuite/gnat.dg/renaming1.adb| 2 +-
 29 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/gcc/testsuite/g++.dg/pr48633.C b/gcc/testsuite/g++.dg/pr48633.C
index 90f053a74c88..efcdab02acbd 100644
--- a/gcc/testsuite/g++.dg/pr48633.C
+++ b/gcc/testsuite/g++.dg/pr48633.C
@@ -1,4 +1,4 @@
-/* { dg-do compile} */
+/* { dg-do compile } */
 /* { dg-options "-O2 -fira-region=all -fnon-call-exceptions" } */
 extern long double getme (void);
 extern void useme (long double);
diff --git a/gcc/testsuite/g++.dg/pr96707.C b/gcc/testsuite/g++.dg/pr96707.C
index 2653fe3d0431..868ee416e269 100644
--- a/gcc/testsuite/g++.dg/pr96707.C
+++ b/gcc/testsuite/g++.dg/pr96707.C
@@ -1,4 +1,4 @@
-/* { dg-do compile} */
+/* { dg-do compile } */
 /* { dg-options "-O2 -fdump-tree-evrp" } */
 
 bool f(unsigned x, unsigned y)
diff --git a/gcc/testsuite/g++.target/i386/mv28.C 
b/gcc/testsuite/g++.target/i386/mv28.C
index 9a0419c058d3..c377d23a4241 100644
--- a/gcc/testsuite/g++.target/i386/mv28.C
+++ b/gcc/testsuite/g++.target/i386/mv28.C
@@ -

[PATCH] RISC-V: xtheadmemidx: Fix RV32 ICE because of unexpected subreg

2024-07-30 Thread Christoph Müllner
As documented in PR116131, we might end up with the following
INSN for rv32i_xtheadmemidx after th_memidx_I_c is applied:

(insn 18 14 0 2 (set (mem:SI (plus:SI (reg/f:SI 141)
(ashift:SI (subreg:SI (reg:DI 134 [ a.0_1 ]) 0)
(const_int 2 [0x2]))) [0  S4 A32])
(reg:SI 143 [ b ])) "":4:17 -1
 (nil))

This INSN is rejected by th_memidx_classify_address_index(),
because the first ASHIFT operand needs to satisfy REG_P().

For most other cases of subreg expressions, an extension/trunctation
will be created before the ASHIFT INSN.  However, this case is different
as we have a reg:DI and RV32, where no truncation is possible.
Therefore, this patch accepts this corner-case and allows this INSN.

PR target/116131

gcc/ChangeLog:

* config/riscv/thead.cc (th_memidx_classify_address_index):
Allow (ashift (subreg:SI (reg:DI))) for RV32.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/pr116131.c: New test.

Reported-by: Patrick O'Neill 
Signed-off-by: Christoph Müllner 
---
 gcc/config/riscv/thead.cc | 13 +
 gcc/testsuite/gcc.target/riscv/pr116131.c | 15 +++
 2 files changed, 28 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/pr116131.c

diff --git a/gcc/config/riscv/thead.cc b/gcc/config/riscv/thead.cc
index 6f5edeb7e0a..6cbbece3ec0 100644
--- a/gcc/config/riscv/thead.cc
+++ b/gcc/config/riscv/thead.cc
@@ -646,6 +646,19 @@ th_memidx_classify_address_index (struct 
riscv_address_info *info, rtx x,
   shift = INTVAL (XEXP (offset, 1));
   offset = XEXP (offset, 0);
 }
+  /* (ashift:SI (subreg:SI (reg:DI)) (const_int shift)) */
+  else if (GET_CODE (offset) == ASHIFT
+  && GET_MODE (offset) == SImode
+  && SUBREG_P (XEXP (offset, 0))
+  && GET_MODE (XEXP (offset, 0)) == SImode
+  && GET_MODE (XEXP (XEXP (offset, 0), 0)) == DImode
+  && CONST_INT_P (XEXP (offset, 1))
+  && IN_RANGE (INTVAL (XEXP (offset, 1)), 0, 3))
+{
+  type = ADDRESS_REG_REG;
+  shift = INTVAL (XEXP (offset, 1));
+  offset = XEXP (XEXP (offset, 0), 0);
+}
   /* (ashift:DI (zero_extend:DI (reg:SI)) (const_int shift)) */
   else if (GET_CODE (offset) == ASHIFT
   && GET_MODE (offset) == DImode
diff --git a/gcc/testsuite/gcc.target/riscv/pr116131.c 
b/gcc/testsuite/gcc.target/riscv/pr116131.c
new file mode 100644
index 000..4d644c37cde
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/pr116131.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-skip-if "" { *-*-* } { "-flto" "-O0" "-Og" "-Os" "-Oz" } } */
+/* { dg-options "-march=rv64gc_xtheadmemidx" { target { rv64 } } } */
+/* { dg-options "-march=rv32gc_xtheadmemidx" { target { rv32 } } } */
+
+volatile long long a;
+int b;
+int c[1];
+
+void d()
+{
+  c[a] = b;
+}
+
+/* { dg-final { scan-assembler "th.srw\t" } } */
-- 
2.45.2



Re: [PATCH 6/4] libbacktrace: Add loaded dlls after initialize

2024-07-30 Thread Ian Lance Taylor
On Mon, Jul 29, 2024 at 12:41 PM Björn Schäpers  wrote:
>
> > Instead of deleting those, move them inside the parentheses:
> >
> > typedef VOID (CALLBACK *LDR_DLL_NOTIFICATION)(ULONG,
> > struct dll_notification_data*,
> > PVOID);
> > typedef NTSTATUS (NTAPI *LDR_REGISTER_FUNCTION)(ULONG,
> >   LDR_DLL_NOTIFICATION, PVOID,
> >   PVOID*);
> >
> > and also I think you need to include , for the definition
> > of the NTSTATUS type.
> >
> > Caveat: I don't have MSVC, so I couldn't verify that these measures
> > fix the problem, sorry.
>
> Moving into the parentheses does fix the issue: 
> https://godbolt.org/z/Pe558ofYz
>
> NTSTATUS is typedefed directly before, so that no additional include is 
> needed.

Thanks.  I committed this patch.

Ian

* pecoff.c (LDR_DLL_NOTIFICATION): Put function modifier
inside parentheses.
(LDR_REGISTER_FUNCTION): Likewise.
338a93ce71ccfd435c0f392af483cc946b2c26fc
diff --git a/libbacktrace/pecoff.c b/libbacktrace/pecoff.c
index 636e1b11296..ccd5ccbce2c 100644
--- a/libbacktrace/pecoff.c
+++ b/libbacktrace/pecoff.c
@@ -83,10 +83,10 @@ struct dll_notification_data
 #define LDR_DLL_NOTIFICATION_REASON_LOADED 1
 
 typedef LONG NTSTATUS;
-typedef VOID CALLBACK (*LDR_DLL_NOTIFICATION)(ULONG,
+typedef VOID (CALLBACK *LDR_DLL_NOTIFICATION)(ULONG,
  struct dll_notification_data*,
  PVOID);
-typedef NTSTATUS NTAPI (*LDR_REGISTER_FUNCTION)(ULONG,
+typedef NTSTATUS (NTAPI *LDR_REGISTER_FUNCTION)(ULONG,
LDR_DLL_NOTIFICATION, PVOID,
PVOID*);
 #endif


Re: [PATCH 2/2] Add AVX2 code path to lexer

2024-07-30 Thread Alexander Monakov
Hi,

On Tue, 30 Jul 2024, Andi Kleen wrote:

> AVX2 is widely available on x86 and it allows to do the scanner line
> check with 32 bytes at a time. The code is similar to the SSE2 code
> path, just using AVX and 32 bytes at a time instead of SSE2 16 bytes.
> 
> Also adjust the code to allow inlining when the compiler
> is built for an AVX2 host, following what other architectures
> do.
> 
> I see about a ~0.6% compile time improvement for compiling i386
> insn-recog.i with -O0.

Is that from some kind of rigorous measurement under perf? As you
surely know, 0.6% wall-clock time can be from boost clock variation
or just run-to-run noise on x86.

I have looked at this code before. When AVX2 is available, so is SSSE3,
and then a much more efficient approach is available: instead of comparing
against \r \n \\ ? one-by-one, build a vector

  0  1  2  3  4  5  6  7  8  9a   bc d   e   f
{ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, '\n', 0, '\\', '\r', 0, '?' }

where each character C we're seeking is at position (C % 16). Then
you can match against them all at once using PSHUFB:

  t = _mm_shuffle_epi8 (lut, data);
  t = t == data;

As you might recognize this handily beats the fancy SSE4.1 loop as well.
I did not pursue this because I did not measure a substantial improvement
(we're way into the land of diminishing returns here) and it seemed like
maintainers might not like to be distracted with that, but if we are
touching this code, might as well use the more efficient algorithm.
I'll be happy to propose a patch if people think it's worthwhile.

I see one issue with your patch, please see below:

> @@ -448,6 +526,10 @@ init_vectorized_lexer (void)
>  
>if (minimum == 3)
>  impl = search_line_sse42;
> +  else if (__get_cpuid_max (0, &dummy) >= 7
> +&& __get_cpuid_count (7, 0, &dummy, &ebx, &dummy, &dummy)
> +&& (ebx & bit_AVX2))
> +impl = search_line_avx2;
>else if (__get_cpuid (1, &dummy, &dummy, &ecx, &edx) || minimum == 2)
>  {
>if (minimum == 3 || (ecx & bit_SSE4_2))

Surely this is not enough? You're not checking OS support via xgetbv.

Alexander


[PATCH] Add a bootstrap-native build config

2024-07-30 Thread Andi Kleen
From: Andi Kleen 

... that uses -march=native -mtune=native to build a compiler optimized
for the host.

config/ChangeLog:

* bootstrap-native.mk: New file.

gcc/ChangeLog:

* doc/install.texi: Document bootstrap-native.
---
 config/bootstrap-native.mk | 1 +
 gcc/doc/install.texi   | 6 ++
 2 files changed, 7 insertions(+)
 create mode 100644 config/bootstrap-native.mk

diff --git a/config/bootstrap-native.mk b/config/bootstrap-native.mk
new file mode 100644
index ..a4a3d8594089
--- /dev/null
+++ b/config/bootstrap-native.mk
@@ -0,0 +1 @@
+BOOT_CFLAGS := -march=native -mtune=native $(BOOT_CFLAGS)
diff --git a/gcc/doc/install.texi b/gcc/doc/install.texi
index 4973f195daf9..29827c5106f8 100644
--- a/gcc/doc/install.texi
+++ b/gcc/doc/install.texi
@@ -3052,6 +3052,12 @@ Removes any @option{-O}-started option from 
@code{BOOT_CFLAGS}, and adds
 @itemx @samp{bootstrap-Og}
 Analogous to @code{bootstrap-O1}.
 
+@item @samp{bootstrap-native}
+@itemx @samp{bootstrap-native}
+Optimize the compiler code for the build host, if supported by the
+architecture. Note this only affects the compiler, not the targeted
+code. If you want the later use @samp{--with-cpu}.
+
 @item @samp{bootstrap-lto}
 Enables Link-Time Optimization for host tools during bootstrapping.
 @samp{BUILD_CONFIG=bootstrap-lto} is equivalent to adding
-- 
2.45.2



Re: [PATCH 2/2] Add AVX2 code path to lexer

2024-07-30 Thread Jakub Jelinek
On Tue, Jul 30, 2024 at 08:41:59AM -0700, Andi Kleen wrote:
> From: Andi Kleen 
> 
> AVX2 is widely available on x86 and it allows to do the scanner line
> check with 32 bytes at a time. The code is similar to the SSE2 code
> path, just using AVX and 32 bytes at a time instead of SSE2 16 bytes.
> 
> Also adjust the code to allow inlining when the compiler
> is built for an AVX2 host, following what other architectures
> do.
> 
> I see about a ~0.6% compile time improvement for compiling i386
> insn-recog.i with -O0.
> 
> libcpp/ChangeLog:
> 
>   * config.in (HAVE_AVX2): Add.
>   * configure: Regenerate.
>   * configure.ac: Add HAVE_AVX2 check.
>   * lex.cc (repl_chars): Extend to 32 bytes.
>   (search_line_avx2): New function to scan line using AVX2.
>   (init_vectorized_lexer): Check for AVX2 in CPUID.

I'd like to just mention that there in libcpp/files.cc (read_file_guts)
we have
  /* The + 16 here is space for the final '\n' and 15 bytes of padding,
 used to quiet warnings from valgrind or Address Sanitizer, when the
 optimized lexer accesses aligned 16-byte memory chunks, including
 the bytes after the malloced, area, and stops lexing on '\n'.  */
  buf = XNEWVEC (uchar, size + 16);
So, if for AVX2 we handle 32 bytes at a time rather than 16 this would
need to change (at least conditionally for arches where the AVX2 code could
be used).

Jakub



Re: [PATCH v2] gimple ssa: Teach switch conversion to optimize powers of 2 switches

2024-07-30 Thread Filip Kastl
On Tue 2024-07-30 14:34:54, Richard Biener wrote:
> On Tue, 30 Jul 2024, Filip Kastl wrote:
> 
> > > > > Ah, I see you fix those up.  Then 2.) is left - the final block.  Iff
> > > > > the final block needs adjustment you know there was a path from
> > > > > the default case to it which means one of its predecessors is 
> > > > > dominated
> > > > > by the default case?  In that case, adjust the dominator to cond_bb,
> > > > > otherwise leave it at switch_bb?
> > > > 
> > > > Yes, what I'm saying is that if I want to know idom of final_bb after 
> > > > the
> > > > transformation, I have to know if there is a path between default_bb and
> > > > final_bb.  It is because of these two cases:
> > > > 
> > > > 1.
> > > > 
> > > > cond BB -+
> > > >| |
> > > > switch BB ---+   |
> > > > /  |  \   \  |
> > > > case BBsdefault BB
> > > > \  |  /   /
> > > > final BB <---+  <- this may be an edge or a path
> > > >|
> > > > 
> > > > 2.
> > > > 
> > > > cond BB -+
> > > >| |
> > > > switch BB ---+   |
> > > > /  |  \   \  |
> > > > case BBsdefault BB
> > > > \  |  /   /
> > > > final BB / <- this may be an edge or a path
> > > >|/
> > > > 
> > > > In the first case, there is a path between default_bb and final_bb and 
> > > > in the
> > > > second there isn't.  Otherwise the cases are the same.  In the first 
> > > > case idom
> > > > of final_bb should be cond_bb.  In the second case idom of final_bb 
> > > > should be
> > > > switch_bb. Algorithm deciding what should be idom of final_bb therefore 
> > > > has to
> > > > know if there is a path between default_bb and final_bb.
> > > > 
> > > > You said that if there is a path between default_bb and final_bb, one 
> > > > of the
> > > > predecessors of final_bb is dominated by default_bb.  That would indeed 
> > > > give a
> > > > nice way to check existence of a path between default_bb and final_bb.  
> > > > But
> > > > does it hold?  Consider this situation:
> > > > 
> > > >| |
> > > > cond BB --+
> > > >| ||
> > > > switch BB +   |
> > > > /  |  \  | \  |
> > > > case BBs |default BB
> > > > \  |  /  |/
> > > > final BB <- pred BB -+
> > > >|
> > > > 
> > > > Here no predecessors of final_bb are dominated by default_bb but at the 
> > > > same
> > > > time there does exist a path from default_bb to final_bb.  Or is this 
> > > > CFG
> > > > impossible for some reason?
> > > 
> > > I think in this case the dominator simply need not change - the only case
> > > you need to adjust it is when the immediate dominator of final BB was
> > > switch BB before the transform, and then we know we have to update it
> > > too cond BB, no?
> > 
> > Ah, my bad.  Yes, my counterexample actually isn't a problem.  I was glad 
> > when
> > I realized that and started thinking that this...
> > 
> > if (original idom(final bb) == switch bb)
> > {
> >   if (exists a pred of final bb dominated by default bb)
> >   {
> > idom(final bb) = cond bb;
> >   }
> >   else
> >   {
> > idom(final bb) = switch bb;
> >   }
> > }
> > else
> > {
> >   // idom(final bb) doesn't change
> > }
> > 
> > ...might be the final solution.  But after thinking about it for a while I
> > (saddly) came up with another counterexample.
> > 
> >|  
> > cond BB --+
> >|  |
> > switch BB +   |
> > /  |  \\  |
> > case BBs  default BB
> > \  |  /   /
> > final BB <- pred BB -+
> >|   ^
> >|   |
> >+---+  <- this may be a path or an edge I guess
> > 
> > Here there *is* a path between default_bb and final_bb but since no 
> > predecessor
> > of final_bb is dominated by default_bb we incorrectly decide that there is 
> > no
> > such path.  Therefore we incorrectly assign idom(final_bb) = switch_bb 
> > instead
> > of idom(final_bb) = cond_bb.
> >
> > So unless I'm missing something, "final has a pred dominated by default" 
> > isn't
> > equivalent with "there is a path between default and final" even when we 
> > assume
> > that the original idom of final_bb was switch_bb.  Therefore I think we're 
> > back
> > to searching for a nice way to test "there is a path between default and
> > final".
> 
> Hmm.
> 
> > Maybe you can spot a flaw in my logic or maybe you see a solution I don't.
> > Meanwhile I'll look into source code of the rest of the switch conversion 
> > pass.
> > Switch conversion pass inserts conditions similar to what I'm doing so 
> > someone
> > before me may have already solved how to properly fix dominators in this
> > situation.
> 
> OK, as I see in your next followup that uses iterate_fix_dominators as 
> well.
> 
> So your patch is OK as-is.
> 
> It might be nice to factor out a common helper from gen_inbound_check
> and your "copy" of it though.  As followup, if you like.
> 
> Thanks and sorry for th

Re: [PATCH 2/2] Add AVX2 code path to lexer

2024-07-30 Thread Andi Kleen
> Is that from some kind of rigorous measurement under perf? As you
> surely know, 0.6% wall-clock time can be from boost clock variation
> or just run-to-run noise on x86.

I compared it using hyperfine which does rigorous measurements yes.
It was well above the run-to-run variability.

I had some other patches that didn't meet that bar, e.g. 
i've been experimenting with more modern hashes for inchash
and multiple ggc free lists, but so far no above noise
results.

> 
> I have looked at this code before. When AVX2 is available, so is SSSE3,
> and then a much more efficient approach is available: instead of comparing
> against \r \n \\ ? one-by-one, build a vector
> 
>   0  1  2  3  4  5  6  7  8  9a   bc d   e   f
> { 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, '\n', 0, '\\', '\r', 0, '?' }
> 
> where each character C we're seeking is at position (C % 16). Then
> you can match against them all at once using PSHUFB:
> 
>   t = _mm_shuffle_epi8 (lut, data);
>   t = t == data;

I thought the PSHUFB trick only worked for some bit patterns?

At least according to this paper: https://arxiv.org/pdf/1902.08318

But yes if it applies here it's a good idea.


> 
> As you might recognize this handily beats the fancy SSE4.1 loop as well.
> I did not pursue this because I did not measure a substantial improvement
> (we're way into the land of diminishing returns here) and it seemed like
> maintainers might not like to be distracted with that, but if we are
> touching this code, might as well use the more efficient algorithm.
> I'll be happy to propose a patch if people think it's worthwhile.

Yes makes sense.

(of course it would be even better to teach the vectorizer about it,
although this will require fixing some other issues first, see PR116126)

-Andi


[PATCH] c: Add support for unsequenced and reproducible attributes

2024-07-30 Thread Jakub Jelinek
Hi!

C23 added in N2956 ( https://open-std.org/JTC1/SC22/WG14/www/docs/n2956.htm )
two new attributes, which are described as similar to GCC const and pure
attributes, but they aren't really same and it seems that even the paper
is missing some of the differences.
The paper says unsequenced is the same as const on functions without pointer
arguments and reproducible is the same as pure on such functions (except
that they are function type attributes rather than function
declaration ones), but it seems the paper doesn't consider the finiteness GCC
relies on (aka non-DECL_LOOPING_CONST_OR_PURE_P) - the paper only talks
about using the attributes for CSE etc., not for DCE.

The following patch introduces (for now limited) support for those
attributes, both as standard C23 attributes and as GNU extensions (the
difference is that the patch is then less strict on where it allows them,
like other function type attributes they can be specified on function
declarations as well and apply to the type, while C23 standard ones must
go on the function declarators (i.e. after closing paren after function
parameters) or in type specifiers of function type.

If function doesn't have any pointer/reference arguments (I wasn't sure
whether it must be really just pure pointer arguments or whether say
struct S { int s; int *p; } passed by value, or unions, or perhaps just
transparent unions count, and whether variadic functions which can take
pointer va_arg count too, so the check punts on all of those), the patch
adds additional internal attribute with " noptr" suffix which then is used
by flags_from_decl_or_type to handle those easy cases as
ECF_CONST|ECF_LOOPING_CONST_OR_PURE or
ECF_PURE|ECF_LOOPING_CONST_OR_PURE
The harder cases aren't handled right now, I'd hope they can be handled
incrementally.

I wonder whether we shouldn't emit a warning for the
gcc.dg/c23-attr-{reproducible,unsequenced}-5.c cases, while the standard
clearly specifies that composite types should union the attributes and it
is what GCC implements for decades, for ?: that feels dangerous for the
new attributes, it would be much better to be conservative on say
(cond ? unsequenced_function : normal_function) (args)

There is no diagnostics on incorrect [[unsequenced]] or [[reproducible]]
function definitions, while I think diagnosing non-const static/TLS
declarations in the former could be easy, the rest feels hard.  E.g. the
const/pure discovery can just punt on everything it doesn't understand,
but complete diagnostics would need to understand it.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

2024-07-30  Jakub Jelinek  

PR c/116130
gcc/
* doc/extend.texi (unsequenced, reproducible): Document new function
type attributes.
* calls.cc (flags_from_decl_or_type): Handle "unsequenced noptr" and
"reproducible noptr" attributes.
gcc/c-family/
* c-attribs.cc (c_common_gnu_attributes): Add entries for
"unsequenced", "reproducible", "unsequenced noptr" and
"reproducible noptr" attributes.
(c_maybe_contains_pointers_p): New function.
(handle_unsequenced_attribute): Likewise.
(handle_reproducible_attribute): Likewise.
* c-common.h (handle_unsequenced_attribute): Declare.
(handle_reproducible_attribute): Likewise.
* c-lex.cc (c_common_has_attribute): Return 202311 for standard
unsequenced and reproducible attributes.
gcc/c/
* c-decl.cc (handle_std_unsequenced_attribute): New function.
(handle_std_reproducible_attribute): Likewise.
(std_attributes): Add entries for "unsequenced" and "reproducible"
attributes.
(c_warn_type_attributes): Add TYPE argument.  Allow unsequenced
or reproducible attributes if it is FUNCTION_TYPE.
(groktypename): Adjust c_warn_type_attributes caller.
(grokdeclarator): Likewise.
(finish_declspecs): Likewise.
* c-parser.cc (c_parser_declaration_or_fndef): Likewise.
* c-tree.h (c_warn_type_attributes): Add TYPE argument.
gcc/testsuite/
* c-c++-common/attr-reproducible-1.c: New test.
* c-c++-common/attr-reproducible-2.c: New test.
* c-c++-common/attr-unsequenced-1.c: New test.
* c-c++-common/attr-unsequenced-2.c: New test.
* gcc.dg/c23-attr-reproducible-1.c: New test.
* gcc.dg/c23-attr-reproducible-2.c: New test.
* gcc.dg/c23-attr-reproducible-3.c: New test.
* gcc.dg/c23-attr-reproducible-4.c: New test.
* gcc.dg/c23-attr-reproducible-5.c: New test.
* gcc.dg/c23-attr-reproducible-6.c: New test.
* gcc.dg/c23-attr-unsequenced-1.c: New test.
* gcc.dg/c23-attr-unsequenced-2.c: New test.
* gcc.dg/c23-attr-unsequenced-3.c: New test.
* gcc.dg/c23-attr-unsequenced-4.c: New test.
* gcc.dg/c23-attr-unsequenced-5.c: New test.
* gcc.dg/c23-attr-unsequenced-6.c: New test.
* gcc.dg/c23-has-c-attribu

Re: [PATCH 2/3][x86][v2] implement TARGET_MODE_CAN_TRANSFER_BITS

2024-07-30 Thread Uros Bizjak
On Tue, Jul 30, 2024 at 3:00 PM Richard Biener  wrote:
>
> On Tue, 30 Jul 2024, Alexander Monakov wrote:
>
> >
> > On Tue, 30 Jul 2024, Richard Biener wrote:
> >
> > > > Oh, and please add a small comment why we don't use XFmode here.
> > >
> > > Will do.
> > >
> > > /* Do not enable XFmode, there is padding in it and it suffers
> > >from normalization upon load like SFmode and DFmode when
> > >not using SSE.  */
> >
> > Is it really true? I have no evidence of FLDT performing normalization
> > (as mentioned in PR 114659, if it did, there would be no way to spill/reload
> > x87 registers).
>
> What mangling fld performs depends on the contents of the FP control
> word which is awkward.  IIRC there's at least a bugreport that it
> turns sNaN into a qNaN, it seems I was wrong about denormals
> (when DM is not masked).  And yes, IIRC x87 instability is also
> related to spills (IIRC we spill in the actual mode of the reg, not in
> XFmode), but -fexcess-precision=standard should hopefully avoid that.
> It's also not clear whether all implementations conformed to the
> specs wrt extended-precision format loads.

FYI, FLDT does not mangle long-double values and does not generate
exceptions. Please see [1], but ignore shadowed text and instead read
the "Floating-Point Exceptions" section. So, as far as hardware is
concerned, it *can* be used to transfer 10-byte values, but I don't
want to judge from the compiler PoV if this is the way to go. We can
enable it, perhaps temporarily to experiment a bit - it is easy to
disable if it causes problems.

Let's CC Intel folks for their opinion, if it is worth using an aging
x87 to transfer 80-bit data.

[1] https://www.felixcloutier.com/x86/fld

Uros.


[PATCH] testsuite: fix whitespace in dg-require-effective-target directives

2024-07-30 Thread Sam James
PR middle-end/54400
PR target/98161
* gcc.dg/vect/bb-slp-layout-18.c: Fix whitespace in dg directive.
* gcc.dg/vect/bb-slp-pr54400.c: Likewise.
* gcc.target/i386/pr98161.c: Likewise.
---
Committed as obvious.

 gcc/testsuite/gcc.dg/vect/bb-slp-layout-18.c | 2 +-
 gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c   | 2 +-
 gcc/testsuite/gcc.target/i386/pr98161.c  | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-18.c 
b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-18.c
index ff4627225074..ebbf9d2da7ca 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-18.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-18.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-require-effective-target vect_float} */
+/* { dg-require-effective-target vect_float } */
 /* { dg-additional-options "-w -Wno-psabi -ffast-math" } */
 
 typedef float v4sf __attribute__((vector_size(sizeof(float)*4)));
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c 
b/gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c
index 6ecd51103ed8..745e3ced70ea 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c
@@ -1,4 +1,4 @@
-/* { dg-require-effective-target vect_float} */
+/* { dg-require-effective-target vect_float } */
 /* { dg-additional-options "-w -Wno-psabi -ffast-math" } */
 
 #include "tree-vect.h"
diff --git a/gcc/testsuite/gcc.target/i386/pr98161.c 
b/gcc/testsuite/gcc.target/i386/pr98161.c
index 5825b9bd1dbb..8ea93325214f 100644
--- a/gcc/testsuite/gcc.target/i386/pr98161.c
+++ b/gcc/testsuite/gcc.target/i386/pr98161.c
@@ -1,6 +1,6 @@
 /* { dg-do run } */
 /* { dg-options "-O2 -msse4" } */
-/* { dg-require-effective-target sse4} */
+/* { dg-require-effective-target sse4 } */
 
 typedef unsigned short u16;
 typedef unsigned int   u32;

-- 
2.45.2



Re: [PATCH 2/2] Add AVX2 code path to lexer

2024-07-30 Thread Alexander Monakov


On Tue, 30 Jul 2024, Andi Kleen wrote:
> > I have looked at this code before. When AVX2 is available, so is SSSE3,
> > and then a much more efficient approach is available: instead of comparing
> > against \r \n \\ ? one-by-one, build a vector
> > 
> >   0  1  2  3  4  5  6  7  8  9a   bc d   e   f
> > { 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, '\n', 0, '\\', '\r', 0, '?' }
> > 
> > where each character C we're seeking is at position (C % 16). Then
> > you can match against them all at once using PSHUFB:
> > 
> >   t = _mm_shuffle_epi8 (lut, data);
> >   t = t == data;
> 
> I thought the PSHUFB trick only worked for some bit patterns?
> 
> At least according to this paper: https://arxiv.org/pdf/1902.08318
> 
> But yes if it applies here it's a good idea.

I wouldn't mention it if it did not apply.

> > As you might recognize this handily beats the fancy SSE4.1 loop as well.
> > I did not pursue this because I did not measure a substantial improvement
> > (we're way into the land of diminishing returns here) and it seemed like
> > maintainers might not like to be distracted with that, but if we are
> > touching this code, might as well use the more efficient algorithm.
> > I'll be happy to propose a patch if people think it's worthwhile.
> 
> Yes makes sense.

Okay, so what are the next steps here? Can someone who could eventually
supply a review indicate their buy-in for switching our SSE4.1 routine
for the SSSE3 PSHUFB-based one? And then for the 256-bit variant, assuming
it still brings an improvement over the faster PSHUFB scanner?

> (of course it would be even better to teach the vectorizer about it,
> although this will require fixing some other issues first, see PR116126)

(I disagree, FWIW)

(and you trimmed the part about XGETBV)

Alexander


Re: [Patch] gimplify.cc: Handle VALUE_EXPR of MEM_REF's ADDR_EXPR argument [PR115637]

2024-07-30 Thread Tobias Burnus

Richard Biener wrote:

On Mon, Jul 29, 2024 at 9:26 PM Tobias Burnus  wrote:

Inside pass_omp_target_link::execute, there is a call to
gimple_regimplify_operands but the value expression is not
expanded.[...]

Where is_gimple_mem_ref_addr is defined as:

/* Return true if T is a valid address operand of a MEM_REF.  */

bool
is_gimple_mem_ref_addr (tree t)
{
return (is_gimple_reg (t)
|| TREE_CODE (t) == INTEGER_CST
|| (TREE_CODE (t) == ADDR_EXPR
&& (CONSTANT_CLASS_P (TREE_OPERAND (t, 0))
|| decl_address_invariant_p (TREE_OPERAND (t, 0);
}

I think iff then decl_address_invariant_p should be amended.


This does not work - at least not for my use case if OpenMP
link variables - due to ordering issues.

For the device compilers, the VALUE_EXPR is added in lto_main
or in do_whole_program_analysis (same file: lto/lto.cc) by
callingoffload_handle_link_vars. The value expression is then later expanded 
via pass_omp_target_link::execute, but in between the following happens:

lto_main  callssymbol_table::compile, which then calls
cgraph_node::expand  and that executes

   res |= verify_types_in_gimple_reference (lhs, true); for lhs being: 
MEM  [(c_char * {ref-all})&arr2]
But when adding the has-value-expr check either directly to is_gimple_mem_ref_addr or to the decl_address_invariant_pit calls, the following condition becomes true the called function in 
tree-cfg.cc:


3302  if (!is_gimple_mem_ref_addr (TREE_OPERAND (expr, 0))
3303  || (TREE_CODE (TREE_OPERAND (expr, 0)) == ADDR_EXPR
3304  && verify_address (TREE_OPERAND (expr, 0), false)))
3305{
3306  error ("invalid address operand in %qs", code_name);

* * * Thus, I am now back to the previous change, except for:


Why is the gimplify_addr_expr hunk needed?  It should get
to gimplifying the VAR_DECL/PARM_DECL by recursion?


Indeed. I wonder why I had (thought to) need it before; possibly
because it was needed or thought to be needed when trying to trace
this down.

Previous patch - except for that bit removed - attached.

Thoughts, better ideas?

Tobias
gimplify.cc: Handle VALUE_EXPR of MEM_REF's ADDR_EXPR argument [PR115637]

As the PR and included testcase shows, replacing 'arr2' by its value expression
'*arr2$13$linkptr' failed for
  MEM  [(c_char * {ref-all})&arr2]
which left 'arr2' in the code as unknown symbol.

	PR middle-end/115637

gcc/ChangeLog:

	* gimplify.cc (gimplify_expr): For MEM_REF and an ADDR_EXPR, also
	check for value-expr arguments.
	(gimplify_body): Fix macro name in the comment.

libgomp/ChangeLog:

	* testsuite/libgomp.fortran/declare-target-link.f90: Uncomment
	now working code.

 gcc/gimplify.cc   |  9 +++--
 libgomp/testsuite/libgomp.fortran/declare-target-link.f90 | 15 ++-
 2 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/gcc/gimplify.cc b/gcc/gimplify.cc
index ab323d764e8..4fa88c9b21c 100644
--- a/gcc/gimplify.cc
+++ b/gcc/gimplify.cc
@@ -18251,8 +18251,13 @@ gimplify_expr (tree *expr_p, gimple_seq *pre_p, gimple_seq *post_p,
 	 in suitable form.  Re-gimplifying would mark the address
 	 operand addressable.  Always gimplify when not in SSA form
 	 as we still may have to gimplify decls with value-exprs.  */
+	  tmp = TREE_OPERAND (*expr_p, 0);
 	  if (!gimplify_ctxp || !gimple_in_ssa_p (cfun)
-	  || !is_gimple_mem_ref_addr (TREE_OPERAND (*expr_p, 0)))
+	  || (!is_gimple_mem_ref_addr (tmp)
+		  || (TREE_CODE (tmp) == ADDR_EXPR
+		  && (VAR_P (TREE_OPERAND (tmp, 0))
+			  || TREE_CODE (TREE_OPERAND (tmp, 0)) == PARM_DECL)
+		  && DECL_HAS_VALUE_EXPR_P (TREE_OPERAND (tmp, 0)
 	{
 	  ret = gimplify_expr (&TREE_OPERAND (*expr_p, 0), pre_p, post_p,
    is_gimple_mem_ref_addr, fb_rvalue);
@@ -19422,7 +19427,7 @@ gimplify_body (tree fndecl, bool do_parms)
   DECL_SAVED_TREE (fndecl) = NULL_TREE;
 
   /* If we had callee-copies statements, insert them at the beginning
- of the function and clear DECL_VALUE_EXPR_P on the parameters.  */
+ of the function and clear DECL_HAS_VALUE_EXPR_P on the parameters.  */
   if (!gimple_seq_empty_p (parm_stmts))
 {
   tree parm;
diff --git a/libgomp/testsuite/libgomp.fortran/declare-target-link.f90 b/libgomp/testsuite/libgomp.fortran/declare-target-link.f90
index 2ce212d114f..44c67f925bd 100644
--- a/libgomp/testsuite/libgomp.fortran/declare-target-link.f90
+++ b/libgomp/testsuite/libgomp.fortran/declare-target-link.f90
@@ -1,5 +1,7 @@
 ! { dg-additional-options "-Wall" }
+
 ! PR fortran/115559
+! PR middle-end/115637
 
 module m
integer :: A
@@ -73,24 +75,19 @@ contains
 !$omp target map(from:res)
   res = run_device1()
 !$omp end target
-print *, res
-! FIXME: arr2 not link mapped -> PR115637
-! if (res /= -11436) stop 5
-if (res /= -11546) stop 5 ! FIXME
+! print *, res
+if (res /= -11436) stop 5
   end

Re: [PATCH 2/2] Add AVX2 code path to lexer

2024-07-30 Thread Richard Biener



> Am 30.07.2024 um 19:22 schrieb Alexander Monakov :
> 
> 
> On Tue, 30 Jul 2024, Andi Kleen wrote:
>>> I have looked at this code before. When AVX2 is available, so is SSSE3,
>>> and then a much more efficient approach is available: instead of comparing
>>> against \r \n \\ ? one-by-one, build a vector
>>> 
>>>  0  1  2  3  4  5  6  7  8  9a   bc d   e   f
>>> { 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, '\n', 0, '\\', '\r', 0, '?' }
>>> 
>>> where each character C we're seeking is at position (C % 16). Then
>>> you can match against them all at once using PSHUFB:
>>> 
>>>  t = _mm_shuffle_epi8 (lut, data);
>>>  t = t == data;
>> 
>> I thought the PSHUFB trick only worked for some bit patterns?
>> 
>> At least according to this paper: https://arxiv.org/pdf/1902.08318
>> 
>> But yes if it applies here it's a good idea.
> 
> I wouldn't mention it if it did not apply.
> 
>>> As you might recognize this handily beats the fancy SSE4.1 loop as well.
>>> I did not pursue this because I did not measure a substantial improvement
>>> (we're way into the land of diminishing returns here) and it seemed like
>>> maintainers might not like to be distracted with that, but if we are
>>> touching this code, might as well use the more efficient algorithm.
>>> I'll be happy to propose a patch if people think it's worthwhile.
>> 
>> Yes makes sense.
> 
> Okay, so what are the next steps here? Can someone who could eventually
> supply a review indicate their buy-in for switching our SSE4.1 routine
> for the SSSE3 PSHUFB-based one? And then for the 256-bit variant, assuming
> it still brings an improvement over the faster PSHUFB scanner?

I’ll happily approve such change.

>> (of course it would be even better to teach the vectorizer about it,
>> although this will require fixing some other issues first, see PR116126)
> 
> (I disagree, FWIW)

I also think writing optimized code with intrinsics is fine.

Richard 

> (and you trimmed the part about XGETBV)
> 
> Alexander


Re: [PATCH] Add a bootstrap-native build config

2024-07-30 Thread Sam James
Andi Kleen  writes:

> From: Andi Kleen 
>
> ... that uses -march=native -mtune=native to build a compiler optimized
> for the host.
>

I like the idea and I'll probably use this. (I can't approve it though.)

> config/ChangeLog:
>
>   * bootstrap-native.mk: New file.
>
> gcc/ChangeLog:
>
>   * doc/install.texi: Document bootstrap-native.
> ---
>  config/bootstrap-native.mk | 1 +
>  gcc/doc/install.texi   | 6 ++
>  2 files changed, 7 insertions(+)
>  create mode 100644 config/bootstrap-native.mk
>
> diff --git a/config/bootstrap-native.mk b/config/bootstrap-native.mk
> new file mode 100644
> index ..a4a3d8594089
> --- /dev/null
> +++ b/config/bootstrap-native.mk
> @@ -0,0 +1 @@
> +BOOT_CFLAGS := -march=native -mtune=native $(BOOT_CFLAGS)

I was under the impression that -mtune=native is useless with
-march=native. Is that wrong?

> diff --git a/gcc/doc/install.texi b/gcc/doc/install.texi
> index 4973f195daf9..29827c5106f8 100644
> --- a/gcc/doc/install.texi
> +++ b/gcc/doc/install.texi
> @@ -3052,6 +3052,12 @@ Removes any @option{-O}-started option from 
> @code{BOOT_CFLAGS}, and adds
>  @itemx @samp{bootstrap-Og}
>  Analogous to @code{bootstrap-O1}.
>  
> +@item @samp{bootstrap-native}
> +@itemx @samp{bootstrap-native}
> +Optimize the compiler code for the build host, if supported by the
> +architecture. Note this only affects the compiler, not the targeted
> +code. If you want the later use @samp{--with-cpu}.

later -> latter

> +
>  @item @samp{bootstrap-lto}
>  Enables Link-Time Optimization for host tools during bootstrapping.
>  @samp{BUILD_CONFIG=bootstrap-lto} is equivalent to adding


Re: [Committed] RISC-V: Add configure check for B extention support

2024-07-30 Thread Edwin Lu

Thanks! Committed

Edwin

On 7/29/2024 6:37 AM, Kito Cheng wrote:

LGTM, although I said no binutils check for zacas and zabha, but B is
a different situation since GCC will add that if zba, zbb and zbs are
all present.



On Thu, Jul 25, 2024 at 7:51 AM Edwin Lu  wrote:

Binutils 2.42 and before don't recognize the B extension in the march
strings even though it supports zba_zbb_zbs. Add a configure check to
ignore the B in the march string if found.

gcc/ChangeLog:

 * common/config/riscv/riscv-common.cc (riscv_subset_list::to_string):
 Skip b in march string
 * config.in: Regenerate.
 * configure: Regenerate.
 * configure.ac: Add B assembler check

Signed-off-by: Edwin Lu 
---
  gcc/common/config/riscv/riscv-common.cc |  8 +++
  gcc/config.in   |  6 +
  gcc/configure   | 31 +
  gcc/configure.ac|  5 
  4 files changed, 50 insertions(+)

diff --git a/gcc/common/config/riscv/riscv-common.cc 
b/gcc/common/config/riscv/riscv-common.cc
index 682826c0e34..200a57e1bc8 100644
--- a/gcc/common/config/riscv/riscv-common.cc
+++ b/gcc/common/config/riscv/riscv-common.cc
@@ -857,6 +857,7 @@ riscv_subset_list::to_string (bool version_p) const
bool skip_zaamo_zalrsc = false;
bool skip_zabha = false;
bool skip_zicsr = false;
+  bool skip_b = false;
bool i2p0 = false;

/* For RISC-V ISA version 2.2 or earlier version, zicsr and zifencei is
@@ -891,6 +892,10 @@ riscv_subset_list::to_string (bool version_p) const
/* Skip since binutils 2.42 and earlier don't recognize zabha.  */
skip_zabha = true;
  #endif
+#ifndef HAVE_AS_MARCH_B
+  /* Skip since binutils 2.42 and earlier don't recognize b.  */
+  skip_b = true;
+#endif

for (subset = m_head; subset != NULL; subset = subset->next)
  {
@@ -911,6 +916,9 @@ riscv_subset_list::to_string (bool version_p) const
if (skip_zabha && subset->name == "zabha")
 continue;

+  if (skip_b && subset->name == "b")
+   continue;
+
/* For !version_p, we only separate extension with underline for
  multi-letter extension.  */
if (!first &&
diff --git a/gcc/config.in b/gcc/config.in
index bc819005bd6..96e829b9c93 100644
--- a/gcc/config.in
+++ b/gcc/config.in
@@ -629,6 +629,12 @@
  #endif


+/* Define if the assembler understands -march=rv*_b. */
+#ifndef USED_FOR_TARGET
+#undef HAVE_AS_MARCH_B
+#endif
+
+
  /* Define if the assembler understands -march=rv*_zaamo_zalrsc. */
  #ifndef USED_FOR_TARGET
  #undef HAVE_AS_MARCH_ZAAMO_ZALRSC
diff --git a/gcc/configure b/gcc/configure
index 01acca7fb5c..c5725c4cd44 100755
--- a/gcc/configure
+++ b/gcc/configure
@@ -30913,6 +30913,37 @@ if test $gcc_cv_as_riscv_march_zabha = yes; then

  $as_echo "#define HAVE_AS_MARCH_ZABHA 1" >>confdefs.h

+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking assembler for -march=rv32i_b 
support" >&5
+$as_echo_n "checking assembler for -march=rv32i_b support... " >&6; }
+if ${gcc_cv_as_riscv_march_b+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  gcc_cv_as_riscv_march_b=no
+  if test x$gcc_cv_as != x; then
+$as_echo '' > conftest.s
+if { ac_try='$gcc_cv_as $gcc_cv_as_flags -march=rv32i_b -o conftest.o conftest.s 
>&5'
+  { { eval echo "\"\$as_me\":${as_lineno-$LINENO}: \"$ac_try\""; } >&5
+  (eval $ac_try) 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; }
+then
+   gcc_cv_as_riscv_march_b=yes
+else
+  echo "configure: failed program was" >&5
+  cat conftest.s >&5
+fi
+rm -f conftest.o conftest.s
+  fi
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $gcc_cv_as_riscv_march_b" >&5
+$as_echo "$gcc_cv_as_riscv_march_b" >&6; }
+if test $gcc_cv_as_riscv_march_b = yes; then
+
+$as_echo "#define HAVE_AS_MARCH_B 1" >>confdefs.h
+
  fi

  ;;
diff --git a/gcc/configure.ac b/gcc/configure.ac
index 3f20c107b6a..93d9236ff36 100644
--- a/gcc/configure.ac
+++ b/gcc/configure.ac
@@ -5466,6 +5466,11 @@ configured with --enable-newlib-nano-formatted-io.])
[-march=rv32i_zabha],,,
[AC_DEFINE(HAVE_AS_MARCH_ZABHA, 1,
  [Define if the assembler understands -march=rv*_zabha.])])
+gcc_GAS_CHECK_FEATURE([-march=rv32i_b support],
+  gcc_cv_as_riscv_march_b,
+  [-march=rv32i_b],,,
+  [AC_DEFINE(HAVE_AS_MARCH_B, 1,
+[Define if the assembler understands -march=rv*_b.])])
  ;;
  loongarch*-*-*)
  gcc_GAS_CHECK_FEATURE([.dtprelword support],
--
2.34.1



[committed] i386/testsuite: Add testcase for fixed PR [PR51492]

2024-07-30 Thread Uros Bizjak
PR target/51492

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr51492.c: New test.

Tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/testsuite/gcc.target/i386/pr51492.c 
b/gcc/testsuite/gcc.target/i386/pr51492.c
new file mode 100644
index 000..0892e0c79a7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr51492.c
@@ -0,0 +1,19 @@
+/* PR target/51492 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -msse2" } */
+
+#define SIZE 65536
+#define WSIZE 64
+unsigned short head[SIZE] __attribute__((aligned(64)));
+
+void
+f(void)
+{
+  for (unsigned n = 0; n < SIZE; ++n) {
+unsigned short m = head[n];
+head[n] = (unsigned short)(m >= WSIZE ? m-WSIZE : 0);
+  }
+}
+
+/* { dg-final { scan-assembler "psubusw" } } */
+/* { dg-final { scan-assembler-not "paddw" } } */


Re: [PATCH] Add a bootstrap-native build config

2024-07-30 Thread Andi Kleen
> > +BOOT_CFLAGS := -march=native -mtune=native $(BOOT_CFLAGS)
> 
> I was under the impression that -mtune=native is useless with
> -march=native. Is that wrong?

On x86 it's right, but not sure about other architectures. I suppose
it doesn't hurt.

-Andi


Re: [PATCH 2/2] Add AVX2 code path to lexer

2024-07-30 Thread Kyrylo Tkachov



> On 30 Jul 2024, at 19:01, Andi Kleen  wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
>> Is that from some kind of rigorous measurement under perf? As you
>> surely know, 0.6% wall-clock time can be from boost clock variation
>> or just run-to-run noise on x86.
> 
> I compared it using hyperfine which does rigorous measurements yes.
> It was well above the run-to-run variability.

FWIW when I was experimenting with these paths I found that an -fsyntax-only 
compilation helps make the changes more pronounced.

Thanks,
Kyrill

> 
> I had some other patches that didn't meet that bar, e.g.
> i've been experimenting with more modern hashes for inchash
> and multiple ggc free lists, but so far no above noise
> results.
> 
>> 
>> I have looked at this code before. When AVX2 is available, so is SSSE3,
>> and then a much more efficient approach is available: instead of comparing
>> against \r \n \\ ? one-by-one, build a vector
>> 
>>  0  1  2  3  4  5  6  7  8  9a   bc d   e   f
>> { 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, '\n', 0, '\\', '\r', 0, '?' }
>> 
>> where each character C we're seeking is at position (C % 16). Then
>> you can match against them all at once using PSHUFB:
>> 
>>  t = _mm_shuffle_epi8 (lut, data);
>>  t = t == data;
> 
> I thought the PSHUFB trick only worked for some bit patterns?
> 
> At least according to this paper: https://arxiv.org/pdf/1902.08318
> 
> But yes if it applies here it's a good idea.
> 
> 
>> 
>> As you might recognize this handily beats the fancy SSE4.1 loop as well.
>> I did not pursue this because I did not measure a substantial improvement
>> (we're way into the land of diminishing returns here) and it seemed like
>> maintainers might not like to be distracted with that, but if we are
>> touching this code, might as well use the more efficient algorithm.
>> I'll be happy to propose a patch if people think it's worthwhile.
> 
> Yes makes sense.
> 
> (of course it would be even better to teach the vectorizer about it,
> although this will require fixing some other issues first, see PR116126)
> 
> -Andi



Re: [PATCH] RISC-V: NFC: Do not use zicond for pr105314 testcases

2024-07-30 Thread Jeff Law




On 7/28/24 7:58 PM, Xiao Zeng wrote:

gcc/testsuite/ChangeLog:

 * gcc.target/riscv/pr105314-rtl.c: Skip zicond.
 * gcc.target/riscv/pr105314-rtl32.c: Dotto.
 * gcc.target/riscv/pr105314.c: Dotto.

Why do you want to skip zicond for this test?

Jeff



[committed] libstdc++: Fix name of source file in comment

2024-07-30 Thread Jonathan Wakely
Pushed to trunk.

-- >8 --

libstdc++-v3/ChangeLog:

* src/c++17/fs_ops.cc: Fix file name in comment.
---
 libstdc++-v3/src/c++17/fs_ops.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libstdc++-v3/src/c++17/fs_ops.cc b/libstdc++-v3/src/c++17/fs_ops.cc
index 7ffdce67782..9606afa9f1f 100644
--- a/libstdc++-v3/src/c++17/fs_ops.cc
+++ b/libstdc++-v3/src/c++17/fs_ops.cc
@@ -851,7 +851,7 @@ namespace
 #endif
 
 #ifdef _GLIBCXX_HAVE_SYS_STAT_H
-#ifdef NEED_DO_COPY_FILE // Only define this once, not in cow-ops.o too
+#ifdef NEED_DO_COPY_FILE // Only define this once, not in cow-fs_ops.o too
 bool
 fs::equiv_files([[maybe_unused]] const char_type* p1, const stat_type& st1,
[[maybe_unused]] const char_type* p2, const stat_type& st2,
-- 
2.45.2



Re: [PATCH] RISC-V: xtheadmemidx: Fix RV32 ICE because of unexpected subreg

2024-07-30 Thread Jeff Law




On 7/30/24 10:17 AM, Christoph Müllner wrote:

As documented in PR116131, we might end up with the following
INSN for rv32i_xtheadmemidx after th_memidx_I_c is applied:

(insn 18 14 0 2 (set (mem:SI (plus:SI (reg/f:SI 141)
 (ashift:SI (subreg:SI (reg:DI 134 [ a.0_1 ]) 0)
 (const_int 2 [0x2]))) [0  S4 A32])
 (reg:SI 143 [ b ])) "":4:17 -1
  (nil))

This INSN is rejected by th_memidx_classify_address_index(),
because the first ASHIFT operand needs to satisfy REG_P().

For most other cases of subreg expressions, an extension/trunctation
will be created before the ASHIFT INSN.  However, this case is different
as we have a reg:DI and RV32, where no truncation is possible.
Therefore, this patch accepts this corner-case and allows this INSN.

PR target/116131

gcc/ChangeLog:

* config/riscv/thead.cc (th_memidx_classify_address_index):
Allow (ashift (subreg:SI (reg:DI))) for RV32.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/pr116131.c: New test.
But why do we have the ashift form here at all?   Canonicalization rules 
say this is invalid RTL.   So while this patch fixes the ICE it does not 
address the canonicalization problem in this RTL.


Specifically in a memory address we should be using mult rather than 
ashift.   And for associative ops, we chain left.  So the proper form of 
the address (inside a MEM) is:


(plus (mult (op1) (const_int 4) (op2))

That needs to be fixed.

jeff





Re: [PATCH 4/5] RISC-V: Add support to vector stack-clash protection

2024-07-30 Thread Jeff Law




On 7/29/24 8:52 AM, Raphael Zinsly wrote:

On Mon, Jul 29, 2024 at 11:20 AM Jeff Law  wrote:




On 7/29/24 6:18 AM, Raphael Zinsly wrote:

On Fri, Jul 26, 2024 at 6:48 PM Jeff Law  wrote:




On 7/24/24 12:00 PM, Raphael Moreira Zinsly wrote:

Adds basic support to vector stack-clash protection using a loop to do
the probing and stack adjustments.

gcc/ChangeLog:
* config/riscv/riscv.cc
(riscv_allocate_and_probe_stack_loop): New function.
(riscv_v_adjust_scalable_frame): Add stack-clash protection
support.
(riscv_allocate_and_probe_stack_space): Move the probe loop
implementation to riscv_allocate_and_probe_stack_loop.
* config/riscv/riscv.h: Define RISCV_STACK_CLASH_VECTOR_CFA_REGNUM.

gcc/testsuite/ChangeLog:
* gcc.target/riscv/stack-check-cfa-3.c: New test.
* gcc.target/riscv/stack-check-prologue-16.c: New test.
* gcc.target/riscv/struct_vect_24.c: New test.

So my only worry here is using another scratch register in the prologue
code instead of using one of the preexisting prologue scratch registers.
Is there a reasonable way to use  PROLOGUE_TEMP or PROLOGUE_TEMP2 here?


These are the preexisting prologue scratch registers: PROLOGUE_TEMP is
t0 and PROLOGUE_TEMP2 is t1.


Otherwise this looks good as well.  So let's get closure on that
question and we can move forward after that.

Right.  And so my question is can we use PROLOGUE_TEMP or PROLOGUE_TEMP2
rather than defining another temporary for the prologue?


We are only using these two and we do not need to use another temporary.
Do you mean stop using riscv_force_temporary?
If so, yes, we can change it to riscv_emit_move.


You define:
+#define RISCV_STACK_CLASH_VECTOR_CFA_REGNUM (GP_TEMP_FIRST + 4)

Where:
#define GP_REG_FIRST 0
#define GP_TEMP_FIRST (GP_REG_FIRST + 5)

So RISCV_STACK_CLASH_VECTOR_CFA_REGNUM defined as "9" which I think is 
"s1".  That can't be what we want :-)


What I don't understand is why we don't use RISCV_PROLOGUE_TEMP_REGNUM 
or RISCV_PROLOGUE_TEMP2_REGNUM which are defined as t0 and t1 respectively.


We'd have to audit the prologue/epilogue code to ensure we can safely 
use one of those two as a scratch in the context we care about.


jeff


[PATCH] testsuite: fix 'dg-compile' typos

2024-07-30 Thread Sam James
'dg-compile' is not a thing, replace it with 'dg-do compile'.

PR target/68015
PR c++/83979
* c-c++-common/goacc/loop-shape.c: Fix 'dg-compile' typo.
* g++.dg/pr83979.C: Likewise.
* g++.target/aarch64/sve/acle/general-c++/attributes_2.C: Likewise.
* gcc.dg/tree-ssa/builtin-sprintf-7.c: Likewise.
* gcc.dg/tree-ssa/builtin-sprintf-8.c: Likewise.
* gcc.target/riscv/amo/zabha-rvwmo-all-amo-ops-char.c: Likewise.
* gcc.target/riscv/amo/zabha-rvwmo-all-amo-ops-short.c: Likewise.
* gcc.target/s390/20181024-1.c: Likewise.
* gcc.target/s390/addr-constraints-1.c: Likewise.
* gcc.target/s390/arch12/aghsghmgh-1.c: Likewise.
* gcc.target/s390/arch12/mul-1.c: Likewise.
* gcc.target/s390/arch13/bitops-1.c: Likewise.
* gcc.target/s390/arch13/bitops-2.c: Likewise.
* gcc.target/s390/arch13/fp-signedint-convert-1.c: Likewise.
* gcc.target/s390/arch13/fp-unsignedint-convert-1.c: Likewise.
* gcc.target/s390/arch13/popcount-1.c: Likewise.
* gcc.target/s390/pr68015.c: Likewise.
* gcc.target/s390/vector/fp-signedint-convert-1.c: Likewise.
* gcc.target/s390/vector/fp-unsignedint-convert-1.c: Likewise.
* gcc.target/s390/vector/reverse-elements-1.c: Likewise.
* gcc.target/s390/vector/reverse-elements-2.c: Likewise.
* gcc.target/s390/vector/reverse-elements-3.c: Likewise.
* gcc.target/s390/vector/reverse-elements-4.c: Likewise.
* gcc.target/s390/vector/reverse-elements-5.c: Likewise.
* gcc.target/s390/vector/reverse-elements-6.c: Likewise.
* gcc.target/s390/vector/reverse-elements-7.c: Likewise.
* gnat.dg/alignment15.adb: Likewise.
* gnat.dg/debug4.adb: Likewise.
* gnat.dg/inline21.adb: Likewise.
* gnat.dg/inline22.adb: Likewise.
* gnat.dg/opt37.adb: Likewise.
* gnat.dg/warn13.adb: Likewise.
---
Committed as obvious. No changes in logs.

 gcc/testsuite/c-c++-common/goacc/loop-shape.c   | 2 +-
 gcc/testsuite/g++.dg/pr83979.C  | 2 +-
 .../g++.target/aarch64/sve/acle/general-c++/attributes_2.C  | 2 +-
 gcc/testsuite/gcc.dg/tree-ssa/builtin-sprintf-7.c   | 2 +-
 gcc/testsuite/gcc.dg/tree-ssa/builtin-sprintf-8.c   | 2 +-
 .../gcc.target/riscv/amo/zabha-rvwmo-all-amo-ops-char.c | 2 +-
 .../gcc.target/riscv/amo/zabha-rvwmo-all-amo-ops-short.c| 2 +-
 gcc/testsuite/gcc.target/s390/20181024-1.c  | 2 +-
 gcc/testsuite/gcc.target/s390/addr-constraints-1.c  | 2 +-
 gcc/testsuite/gcc.target/s390/arch12/aghsghmgh-1.c  | 2 +-
 gcc/testsuite/gcc.target/s390/arch12/mul-1.c| 2 +-
 gcc/testsuite/gcc.target/s390/arch13/bitops-1.c | 2 +-
 gcc/testsuite/gcc.target/s390/arch13/bitops-2.c | 2 +-
 gcc/testsuite/gcc.target/s390/arch13/fp-signedint-convert-1.c   | 2 +-
 gcc/testsuite/gcc.target/s390/arch13/fp-unsignedint-convert-1.c | 2 +-
 gcc/testsuite/gcc.target/s390/arch13/popcount-1.c   | 2 +-
 gcc/testsuite/gcc.target/s390/pr68015.c | 2 +-
 gcc/testsuite/gcc.target/s390/vector/fp-signedint-convert-1.c   | 2 +-
 gcc/testsuite/gcc.target/s390/vector/fp-unsignedint-convert-1.c | 2 +-
 gcc/testsuite/gcc.target/s390/vector/reverse-elements-1.c   | 2 +-
 gcc/testsuite/gcc.target/s390/vector/reverse-elements-2.c   | 2 +-
 gcc/testsuite/gcc.target/s390/vector/reverse-elements-3.c   | 2 +-
 gcc/testsuite/gcc.target/s390/vector/reverse-elements-4.c   | 2 +-
 gcc/testsuite/gcc.target/s390/vector/reverse-elements-5.c   | 2 +-
 gcc/testsuite/gcc.target/s390/vector/reverse-elements-6.c   | 2 +-
 gcc/testsuite/gcc.target/s390/vector/reverse-elements-7.c   | 2 +-
 gcc/testsuite/gnat.dg/alignment15.adb   | 2 +-
 gcc/testsuite/gnat.dg/debug4.adb| 2 +-
 gcc/testsuite/gnat.dg/inline21.adb  | 2 +-
 gcc/testsuite/gnat.dg/inline22.adb  | 2 +-
 gcc/testsuite/gnat.dg/opt37.adb | 2 +-
 gcc/testsuite/gnat.dg/warn13.adb| 2 +-
 32 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/gcc/testsuite/c-c++-common/goacc/loop-shape.c 
b/gcc/testsuite/c-c++-common/goacc/loop-shape.c
index 9708f7bf5eb3..b3199b4044d0 100644
--- a/gcc/testsuite/c-c++-common/goacc/loop-shape.c
+++ b/gcc/testsuite/c-c++-common/goacc/loop-shape.c
@@ -1,7 +1,7 @@
 /* Exercise *_parser_oacc_shape_clause by checking various combinations
of gang, worker and vector clause arguments.  */
 
-/* { dg-compile } */
+/* { dg-do compile } */
 
 int main ()
 {
diff --git a/gcc/testsuite/g++.dg/pr83979.C b/gcc/testsuite/g++.dg/pr83979.C
index a39b1ea6ab9b..0ef754d1e48d 100644
--- a/gcc/testsuite/g++.dg/pr83979.C
+++ b/gcc/tests

Re: [PATCHv2 2/2] libiberty/buildargv: handle input consisting of only white space

2024-07-30 Thread Jeff Law




On 7/29/24 6:51 AM, Andrew Burgess wrote:

Thomas Schwinge  writes:


Hi!

On 2024-02-10T17:26:01+, Andrew Burgess  wrote:

--- a/libiberty/argv.c
+++ b/libiberty/argv.c



@@ -439,17 +442,8 @@ expandargv (int *argcp, char ***argvp)
}
/* Add a NUL terminator.  */
buffer[len] = '\0';
-  /* If the file is empty or contains only whitespace, buildargv would
-return a single empty argument.  In this context we want no arguments,
-instead.  */
-  if (only_whitespace (buffer))
-   {
- file_argv = (char **) xmalloc (sizeof (char *));
- file_argv[0] = NULL;
-   }
-  else
-   /* Parse the string.  */
-   file_argv = buildargv (buffer);
+  /* Parse the string.  */
+  file_argv = buildargv (buffer);
/* If *ARGVP is not already dynamically allocated, copy it.  */
if (*argvp == original_argv)
*argvp = dupargv (*argvp);


With that (single) use of 'only_whitespace' now gone:

 [...]/source-gcc/libiberty/argv.c:128:1: warning: ‘only_whitespace’ 
defined but not used [-Wunused-function]
   128 | only_whitespace (const char* input)
   | ^~~



Sorry about that.

The patch below is the obvious fix.  OK to apply?

Of course.
jeff



  1   2   >