Re: [RFC/RFA][PATCH v4 06/12] aarch64: Implement new expander for efficient CRC computation

2024-08-25 Thread Richard Biener
On Sat, Aug 24, 2024 at 9:22 AM Mariam Arutunian
 wrote:
>
>
>
> On Fri, Aug 23, 2024, 15:03 Richard Biener  wrote:
>>
>> On Fri, Aug 23, 2024 at 9:55 AM Mariam Arutunian
>>  wrote:
>> >
>> >
>> > On Wed, Aug 21, 2024 at 5:56 PM Richard Sandiford 
>> >  wrote:
>> >>
>> >> Mariam Arutunian  writes:
>> >> > This patch introduces two new expanders for the aarch64 backend,
>> >> > dedicated to generate optimized code for CRC computations.
>> >> > The new expanders are designed to leverage specific hardware 
>> >> > capabilities
>> >> > to achieve faster CRC calculations,
>> >> > particularly using the crc32, crc32c and pmull instructions when 
>> >> > supported
>> >> > by the target architecture.
>> >> >
>> >> > Expander 1: Bit-Forward CRC (crc4)
>> >> > For targets that support pmul instruction (TARGET_AES),
>> >> > the expander will generate code that uses the pmull (crypto_pmulldi)
>> >> > instruction for CRC computation.
>> >> >
>> >> > Expander 2: Bit-Reversed CRC (crc_rev4)
>> >> > The expander first checks if the target supports the CRC32* instruction 
>> >> > set
>> >> > (TARGET_CRC32)
>> >> > and the polynomial in use is 0x1EDC6F41 (iSCSI) or 0x04C11DB7 (HDLC). If
>> >> > the conditions are met,
>> >> > it emits calls to the corresponding crc32* instruction (depending on the
>> >> > data size and the polynomial).
>> >> > If the target does not support crc32* but supports pmull, it then uses 
>> >> > the
>> >> > pmull (crypto_pmulldi) instruction for bit-reversed CRC computation.
>> >> > Otherwise table-based CRC is generated.
>> >> >
>> >> >   gcc/config/aarch64/
>> >> >
>> >> > * aarch64-protos.h (aarch64_expand_crc_using_pmull): New extern
>> >> > function declaration.
>> >> > (aarch64_expand_reversed_crc_using_pmull):  Likewise.
>> >> > * aarch64.cc (aarch64_expand_crc_using_pmull): New function.
>> >> > (aarch64_expand_reversed_crc_using_pmull):  Likewise.
>> >> > * aarch64.md (crc_rev4): New expander for
>> >> > reversed CRC.
>> >> > (crc4): New expander for bit-forward CRC.
>> >> > * iterators.md (crc_data_type): New mode attribute.
>> >> >
>> >> >   gcc/testsuite/gcc.target/aarch64/
>> >> >
>> >> > * crc-1-pmul.c: New test.
>> >> > * crc-10-pmul.c: Likewise.
>> >> > * crc-12-pmul.c: Likewise.
>> >> > * crc-13-pmul.c: Likewise.
>> >> > * crc-14-pmul.c: Likewise.
>> >> > * crc-17-pmul.c: Likewise.
>> >> > * crc-18-pmul.c: Likewise.
>> >> > * crc-21-pmul.c: Likewise.
>> >> > * crc-22-pmul.c: Likewise.
>> >> > * crc-23-pmul.c: Likewise.
>> >> > * crc-4-pmul.c: Likewise.
>> >> > * crc-5-pmul.c: Likewise.
>> >> > * crc-6-pmul.c: Likewise.
>> >> > * crc-7-pmul.c: Likewise.
>> >> > * crc-8-pmul.c: Likewise.
>> >> > * crc-9-pmul.c: Likewise.
>> >> > * crc-CCIT-data16-pmul.c: Likewise.
>> >> > * crc-CCIT-data8-pmul.c: Likewise.
>> >> > * crc-coremark-16bitdata-pmul.c: Likewise.
>> >> > * crc-crc32-data16.c: Likewise.
>> >> > * crc-crc32-data32.c: Likewise.
>> >> > * crc-crc32-data8.c: Likewise.
>> >> > * crc-crc32c-data16.c: Likewise.
>> >> > * crc-crc32c-data32.c: Likewise.
>> >> > * crc-crc32c-data8.c: Likewise.
>> >>
>> >> OK for trunk once the prerequisites are approved.  Thanks for all your
>> >> work on this.
>> >>
>> >> Which other parts of the series still need review?  I can try to help
>> >> out with the target-independent bits.  (That said, I'm not sure I'm the
>> >> best person to review the tree recognition pass, but I can have a go.)
>> >>
>> >
>> > Thank you very much for everything.
>> > Right now, I'm not sure which parts would be best to be reviewed since 
>> > Richard Biener is currently reviewing them.
>> > Maybe I can ask for your help later?
>>
>> I'm done with the parts I preserved for reviewing.  Btw, it seems the
>> vN series are not
>> complete, that is, you didn't re-post the entire series but only
>> changed parts?  I was
>> somewhat confused by that.
>
>
> Yes, I didn't re-post the entire series; I only resent the parts that were 
> modified. I didn't know that I needed to send the entire series each time. 
> I'll make sure to do that in the next versions.

It's fine to only post changed parts, I just missed a note that you
did this so was
searching for a revised version of an older patch I had in the queue for
reviewing.  But re-posting the entire series is fine as well and probably
the least confusing to everyone (including the pre-commit CI).

Richard.

> Thanks,
> Mariam
>
>
>>
>> Richard.
>>
>> > Thanks,
>> > Mariam
>> >
>> >> Richard
>> >>
>> >> >
>> >> > Signed-off-by: Mariam Arutunian 
>> >> > Co-authored-by: Richard Sandiford 
>> >> > diff --git a/gcc/config/aarch64/aarch64-protos.h 
>> >> > b/gcc/config/aarch64/aarch64-protos.h
>> >> > index 42639e9efcf..469111e3b17 100644
>> >> > --- a/gcc/config/aarch64/aarch64-protos.h
>> >> > +++ b/gcc/config/aarch64/aarch64-protos.h
>> >> > @@ -1112,5 +1112,8 @@ extern void aarch64_adjust_reg_alloc_ord

Re: [PATCH v1] Vect: Promote unsigned .SAT_ADD constant operand for vectorizable_call

2024-08-25 Thread Richard Biener
On Sat, Aug 24, 2024 at 1:31 PM Li, Pan2  wrote:
>
> Thanks Jakub and Richard for explanation and help, I will double check 
> saturate matching for the const_int strict check.
>
> Back to this below case, do we still need some ad-hoc step to unblock the 
> type check when vectorizable_call?
> For example, the const_int 9u may have int type for .SAT_ADD(uint8_t, 9u).
> Or we have somewhere else to make the vectorizable_call happy.

I don't see how vectorizable_call itself can handle this since it
doesn't have any idea
about the type requirements.  Instead pattern recognition of .SAT_ADD should
promote/demote the invariants - of course there might be correctness
issues involved
with matching .ADD_OVERFLOW in the first place.  What I read is that
.ADD_OVERFLOW
produces a value that is equal to the twos-complement add of its arguments
promoted/demoted to the result type, correct?

Richard.

> #define DEF_VEC_SAT_U_ADD_IMM_FMT_3(T, IMM)  \
> T __attribute__((noinline))  \
> vec_sat_u_add_imm##IMM##_##T##_fmt_3 (T *out, T *in, unsigned limit) \
> {\
>   unsigned i;\
>   T ret; \
>   for (i = 0; i < limit; i++)\
> {\
>   out[i] = __builtin_add_overflow (in[i], IMM, &ret) ? -1 : ret; \
> }\
> }
>
> DEF_VEC_SAT_U_ADD_IMM_FMT_3(uint8_t, 9u)
>
> Pan
>
> -Original Message-
> From: Richard Biener 
> Sent: Friday, August 23, 2024 6:53 PM
> To: Jakub Jelinek 
> Cc: Li, Pan2 ; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH v1] Vect: Promote unsigned .SAT_ADD constant operand for 
> vectorizable_call
>
> On Thu, Aug 22, 2024 at 8:36 PM Jakub Jelinek  wrote:
> >
> > On Tue, Aug 20, 2024 at 01:52:35PM +0200, Richard Biener wrote:
> > > On Sat, Aug 17, 2024 at 11:18 PM Jakub Jelinek  wrote:
> > > >
> > > > On Sat, Aug 17, 2024 at 05:03:14AM +, Li, Pan2 wrote:
> > > > > Please feel free to let me know if there is anything I can do to fix 
> > > > > this issue. Thanks a lot.
> > > >
> > > > There is no bug.  The operands of .{ADD,SUB,MUL}_OVERFLOW don't have to 
> > > > have the same type, as described in the 
> > > > __builtin_{add,sub,mul}_overflow{,_p} documentation, each argument can 
> > > > have different type and result yet another one, the behavior is then 
> > > > (as if) to perform the operation in infinite precision and if that 
> > > > result fits into the result type, there is no overflow, otherwise there 
> > > > is.
> > > > So, there is no need to promote anything.
> > >
> > > Hmm, it's a bit awkward to have this state in the IL.
> >
> > Why?  These aren't the only internal functions which have different types
> > of arguments, from the various widening ifns, conditional ifns,
> > scatter/gather, ...  Even the WIDEN_*EXPR trees do have type differences
> > among arguments.
> > And it matches what the user builtin does.
> >
> > Furthermore, at least without _BitInt (but even with _BitInt at the maximum
> > precision too) this might not be even possible.
> > E.g. if there is __builtin_add_overflow with unsigned __int128 and __int128
> > arguments and there are no wider types there is simply no type to use for 
> > both
> > arguments, it would need to be a signed type with at least 129 bits...
> >
> > > I see that
> > > expand_arith_overflow eventually applies
> > > promotion, namely to the type of the LHS.
> >
> > The LHS doesn't have to be wider than the operand types, so it can't promote
> > always.  Yes, in some cases it applies promotion if it is desirable for
> > codegen purposes.  But without the promotions explicitly in the IL it
> > doesn't need to rely on VRP to figure out how to expand it exactly.
> >
> > > Exposing this earlier could
> > > enable optimization even
> >
> > Which optimizations?
>
> I was thinking of merging conversions with that implied promotion.
>
> >  We already try to fold the .{ADD,SUB,MUL}_OVERFLOW
> > builtins to constants or non-overflowing arithmetics etc. as soon as we
> > can e.g. using ranges prove the operation will never overflow or will always
> > overflow.  Doing unnecessary promotion (see above that it might not be
> > always possible at all) would just make the IL larger and risk we during
> > expansion actually perform the promotions even when we don't have to.
> > We on the other side already have match.pd rules to undo such promotions
> > in the operands.  See
> > /* Demote operands of IFN_{ADD,SUB,MUL}_OVERFLOW.  */
> > And the result (well, TREE_TYPE of the lhs type) can be yet another type,
> > not related to either of those in any way.
>
> OK, fair enough.  I think this also shows again the lack of documentation
> of internal functio

Re: [x86_64 PATCH] Update STV's gains for TImode arithmetic right shifts on AVX2.

2024-08-25 Thread Uros Bizjak
V sob., 24. avg. 2024 17:11 je oseba Roger Sayle 
napisala:

>
> This patch tweaks timode_scalar_chain::compute_convert_gain to better
> reflect the expansion of V1TImode arithmetic right shifts by the i386
> backend.  The comment "see ix86_expand_v1ti_ashiftrt" appears after
> "case ASHIFTRT" in compute_convert_gain, and the changes below attempt
> to better match the logic used there.
>
> The original motivating example is:
>
> __int128 m1;
> void foo()
> {
>   m1 = (m1 << 8) >> 8;
> }
>
> which with -O2 -mavx2 we fail to convert to vector form due to the
> inappropriate cost of the arithmetic right shift.
>
>   Instruction gain -16 for 7: {r103:TI=r101:TI>>0x8;clobber flags:CC;}
>   Total gain: -3
>   Chain #1 conversion is not profitable
>
> This is reporting that the ASHIFTRT is four instructions worse using
> vectors than in scalar form, which is incorrect as the AVX2 expansion
> of this shift only requires three instructions (and the scalar form
> requires two).
>
> With more accurate costs in timode_scalar_chain::compute_convert_gain
> we now see (with -O2 -mavx2):
>
>   Instruction gain -4 for 7: {r103:TI=r101:TI>>0x8;clobber flags:CC;}
>   Total gain: 9
>   Converting chain #1...
>
> which results in:
>
> foo:vmovdqa m1(%rip), %xmm0
> vpslldq $1, %xmm0, %xmm0
> vpsrad  $8, %xmm0, %xmm1
> vpsrldq $1, %xmm0, %xmm0
> vpblendd$7, %xmm0, %xmm1, %xmm0
> vmovdqa %xmm0, m1(%rip)
> ret
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures.  No new testcase (yet) as the code for both the
> vector and scalar forms of the above function are still suboptimal
> so code generation is in flux, but this improvement should be a step
> in the right direction.  Ok for mainline?
>
>
> 2024-08-24  Roger Sayle  
>
> gcc/ChangeLog
> * config/i386/i386-features.cc (compute_convert_gain)
> : Update to match ix86_expand_v1ti_ashiftrt.
>
> TARGET_AVX2 always implies TARGET_SSE4_1, so there is no need to OR them
> together.
>

OK with above change.

Thanks,
Uros.

>
>


[committed] Turn off late-combine for a few risc-v specific tests

2024-08-25 Thread Jeff Law
Just minor testsuite adjustments -- several of the shorten-memref tests 
are slightly twiddled by the late-combine pass:




Running /home/jlaw/test/gcc/gcc/testsuite/gcc.target/riscv/riscv.exp ...
FAIL: gcc.target/riscv/shorten-memrefs-2.c   -Os   scan-assembler 
store1a:\n(\t?\\.[^\n]*\n)*\taddi
XPASS: gcc.target/riscv/shorten-memrefs-3.c   -Os   scan-assembler-not 
load2a:\n.*addi[ \t]*[at][0-9],[at][0-9],[0-9]*
FAIL: gcc.target/riscv/shorten-memrefs-5.c   -Os   scan-assembler 
store1a:\n(\t?\\.[^\n]*\n)*\taddi
FAIL: gcc.target/riscv/shorten-memrefs-8.c   -Os   scan-assembler 
store:\n(\t?\\.[^\n]*\n)*\taddi\ta[0-7],a[0-7],1


This patch just turns off the late-combine pass for those tests. 
Locally I'd adjusted all the shorten-memref patches, but a quick re-rest 
shows that only 4 tests seem affected right now.


Anyway, pushing to the trunk to slightly clean up our test results.

jeff
commit ab9c4bb54e817948f1a55edfb0f1f0481e4046df
Author: Jeff Law 
Date:   Sun Aug 25 07:06:45 2024 -0600

Turn off late-combine for a few risc-v specific tests

Just minor testsuite adjustments -- several of the shorten-memref tests are
slightly twiddled by the late-combine pass:

> Running /home/jlaw/test/gcc/gcc/testsuite/gcc.target/riscv/riscv.exp ...
> FAIL: gcc.target/riscv/shorten-memrefs-2.c   -Os   scan-assembler 
store1a:\n(\t?\\.[^\n]*\n)*\taddi
> XPASS: gcc.target/riscv/shorten-memrefs-3.c   -Os   scan-assembler-not 
load2a:\n.*addi[ \t]*[at][0-9],[at][0-9],[0-9]*
> FAIL: gcc.target/riscv/shorten-memrefs-5.c   -Os   scan-assembler 
store1a:\n(\t?\\.[^\n]*\n)*\taddi
> FAIL: gcc.target/riscv/shorten-memrefs-8.c   -Os   scan-assembler 
store:\n(\t?\\.[^\n]*\n)*\taddi\ta[0-7],a[0-7],1
This patch just turns off the late-combine pass for those tests.  Locally 
I'd
adjusted all the shorten-memref patches, but a quick re-rest shows that 
only 4
tests seem affected right now.

Anyway, pushing to the trunk to slightly clean up our test results.

gcc/testsuite
* gcc.target/riscv/shorten-memrefs-2.c: Turn off late-combine.
* gcc.target/riscv/shorten-memrefs-3.c: Likewise.
* gcc.target/riscv/shorten-memrefs-5.c: Likewise.
* gcc.target/riscv/shorten-memrefs-8.c: Likewise.

diff --git a/gcc/testsuite/gcc.target/riscv/shorten-memrefs-2.c 
b/gcc/testsuite/gcc.target/riscv/shorten-memrefs-2.c
index a9ddb797d06..29ece481c26 100644
--- a/gcc/testsuite/gcc.target/riscv/shorten-memrefs-2.c
+++ b/gcc/testsuite/gcc.target/riscv/shorten-memrefs-2.c
@@ -1,4 +1,4 @@
-/* { dg-options "-march=rv32imc -mabi=ilp32" } */
+/* { dg-options "-march=rv32imc -mabi=ilp32 -fno-late-combine-instructions" } 
*/
 /* { dg-skip-if "" { *-*-* } { "*" } { "-Os" } } */
 
 /* shorten_memrefs should rewrite these load/stores into a compressible
diff --git a/gcc/testsuite/gcc.target/riscv/shorten-memrefs-3.c 
b/gcc/testsuite/gcc.target/riscv/shorten-memrefs-3.c
index 3d561124b81..273a68c373a 100644
--- a/gcc/testsuite/gcc.target/riscv/shorten-memrefs-3.c
+++ b/gcc/testsuite/gcc.target/riscv/shorten-memrefs-3.c
@@ -1,4 +1,4 @@
-/* { dg-options "-march=rv32imc -mabi=ilp32" } */
+/* { dg-options "-march=rv32imc -mabi=ilp32 -fno-late-combine-instructions" } 
*/
 /* { dg-skip-if "" { *-*-* } { "*" } { "-Os" } } */
 
 /* These loads cannot be compressed because only one compressed reg is
diff --git a/gcc/testsuite/gcc.target/riscv/shorten-memrefs-5.c 
b/gcc/testsuite/gcc.target/riscv/shorten-memrefs-5.c
index 11e858ed6da..f554105f91f 100644
--- a/gcc/testsuite/gcc.target/riscv/shorten-memrefs-5.c
+++ b/gcc/testsuite/gcc.target/riscv/shorten-memrefs-5.c
@@ -1,4 +1,4 @@
-/* { dg-options "-march=rv64imc -mabi=lp64" } */
+/* { dg-options "-march=rv64imc -mabi=lp64 -fno-late-combine-instructions" } */
 /* { dg-skip-if "" { *-*-* } { "*" } { "-Os" } } */
 
 /* shorten_memrefs should rewrite these load/stores into a compressible
diff --git a/gcc/testsuite/gcc.target/riscv/shorten-memrefs-8.c 
b/gcc/testsuite/gcc.target/riscv/shorten-memrefs-8.c
index 3ff6956b33e..d533355409c 100644
--- a/gcc/testsuite/gcc.target/riscv/shorten-memrefs-8.c
+++ b/gcc/testsuite/gcc.target/riscv/shorten-memrefs-8.c
@@ -1,4 +1,4 @@
-/* { dg-options "-march=rv32imc -mabi=ilp32" } */
+/* { dg-options "-march=rv32imc -mabi=ilp32 -fno-late-combine-instructions" } 
*/
 /* { dg-skip-if "" { *-*-* } { "*" } { "-Os" } } */
 
 /* shorten_memrefs should use a correct base address*/


[committed] Fix assembly scan for RISC-V VLS tests

2024-08-25 Thread Jeff Law
Surya's IRA patch from June slightly improves the code we generate for 
the vls/calling-conventions tests on RISC-V.  Specifically it removes an 
unnecessary move from the instruction stream.  This (of course) broke 
those tests:




Running /home/jlaw/test/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp ...
FAIL: gcc.target/riscv/rvv/autovec/vls/calling-convention-1.c -O3 
-ftree-vectorize -mrvv-vector-bits=scalable  scan-assembler-times 
mv\\s+s0,a0\\s+call\\s+memset\\s+mv\\s+a0,s0 3
FAIL: gcc.target/riscv/rvv/autovec/vls/calling-convention-2.c -O3 
-ftree-vectorize -mrvv-vector-bits=scalable  scan-assembler-times 
mv\\s+s0,a0\\s+call\\s+memset\\s+mv\\s+a0,s0 3
FAIL: gcc.target/riscv/rvv/autovec/vls/calling-convention-3.c -O3 
-ftree-vectorize -mrvv-vector-bits=scalable  scan-assembler-times 
mv\\s+s0,a0\\s+call\\s+memset\\s+mv\\s+a0,s0 3
FAIL: gcc.target/riscv/rvv/autovec/vls/calling-convention-4.c -O3 
-ftree-vectorize -mrvv-vector-bits=scalable  scan-assembler-times 
mv\\s+s0,a0\\s+call\\s+memset\\s+mv\\s+a0,s0 3
FAIL: gcc.target/riscv/rvv/autovec/vls/calling-convention-5.c -O3 
-ftree-vectorize -mrvv-vector-bits=scalable  scan-assembler-times 
mv\\s+s0,a0\\s+call\\s+memset\\s+mv\\s+a0,s0 3
FAIL: gcc.target/riscv/rvv/autovec/vls/calling-convention-6.c -O3 
-ftree-vectorize -mrvv-vector-bits=scalable  scan-assembler-times 
mv\\s+s0,a0\\s+call\\s+memset\\s+mv\\s+a0,s0 3
FAIL: gcc.target/riscv/rvv/autovec/vls/calling-convention-7.c -O3 
-ftree-vectorize -mrvv-vector-bits=scalable  scan-assembler-times 
mv\\s+s0,a0\\s+call\\s+memset\\s+mv\\s+a0,s0 3



This patch does the natural adjustment of those tests by dropping the 
moves from the scan.


Pushing to the trunk.

Jeff



commit 4c3485897d3e28ecfbe911f21f83fa047ee8b54b
Author: Jeff Law 
Date:   Sun Aug 25 07:16:50 2024 -0600

[committed] Fix assembly scan for RISC-V VLS tests

Surya's IRA patch from June slightly improves the code we generate for the
vls/calling-conventions tests on RISC-V.  Specifically it removes an
unnecessary move from the instruction stream.  This (of course) broke those
tests:

> Running /home/jlaw/test/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp ...
> FAIL: gcc.target/riscv/rvv/autovec/vls/calling-convention-1.c -O3 
-ftree-vectorize -mrvv-vector-bits=scalable  scan-assembler-times 
mv\\s+s0,a0\\s+call\\s+memset\\s+mv\\s+a0,s0 3
> FAIL: gcc.target/riscv/rvv/autovec/vls/calling-convention-2.c -O3 
-ftree-vectorize -mrvv-vector-bits=scalable  scan-assembler-times 
mv\\s+s0,a0\\s+call\\s+memset\\s+mv\\s+a0,s0 3
> FAIL: gcc.target/riscv/rvv/autovec/vls/calling-convention-3.c -O3 
-ftree-vectorize -mrvv-vector-bits=scalable  scan-assembler-times 
mv\\s+s0,a0\\s+call\\s+memset\\s+mv\\s+a0,s0 3
> FAIL: gcc.target/riscv/rvv/autovec/vls/calling-convention-4.c -O3 
-ftree-vectorize -mrvv-vector-bits=scalable  scan-assembler-times 
mv\\s+s0,a0\\s+call\\s+memset\\s+mv\\s+a0,s0 3
> FAIL: gcc.target/riscv/rvv/autovec/vls/calling-convention-5.c -O3 
-ftree-vectorize -mrvv-vector-bits=scalable  scan-assembler-times 
mv\\s+s0,a0\\s+call\\s+memset\\s+mv\\s+a0,s0 3
> FAIL: gcc.target/riscv/rvv/autovec/vls/calling-convention-6.c -O3 
-ftree-vectorize -mrvv-vector-bits=scalable  scan-assembler-times 
mv\\s+s0,a0\\s+call\\s+memset\\s+mv\\s+a0,s0 3
> FAIL: gcc.target/riscv/rvv/autovec/vls/calling-convention-7.c -O3 
-ftree-vectorize -mrvv-vector-bits=scalable  scan-assembler-times 
mv\\s+s0,a0\\s+call\\s+memset\\s+mv\\s+a0,s0 3

This patch does the natural adjustment of those tests by dropping the moves
from the scan.

gcc/testsuite
* gcc.target/riscv/rvv/autovec/vls/calling-convention-1.c: Update
expected output.
* gcc.target/riscv/rvv/autovec/vls/calling-convention-2.c: Likewise.
* gcc.target/riscv/rvv/autovec/vls/calling-convention-3.c: Likewise.
* gcc.target/riscv/rvv/autovec/vls/calling-convention-4.c: Likewise.
* gcc.target/riscv/rvv/autovec/vls/calling-convention-5.c: Likewise.
* gcc.target/riscv/rvv/autovec/vls/calling-convention-6.c: Likewise.
* gcc.target/riscv/rvv/autovec/vls/calling-convention-7.c: Likewise.

diff --git 
a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/calling-convention-1.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/calling-convention-1.c
index 60c838eb21d..82039f5ac4e 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/calling-convention-1.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/calling-convention-1.c
@@ -145,7 +145,7 @@ DEF_RET1_ARG9 (v4096qi)
 
 // RET1_ARG0 tests
 /* { dg-final { scan-assembler-times {li\s+a[0-1],\s*0} 9 } } */
-/* { dg-final { scan-assembler-times {mv\s+s0,a0\s+call\s+memset\s+mv\s+a0,s0} 
3 } } */
+/* { dg-final { scan-assembler-times {call\s+memset} 3 } } */
 
 // v1qi tests: return value (lbu) and function prologue (sb)
 // 1 lbu per test, argnum sb's when args > 1
diff --git 
a/gcc/testsuite/

[committed] Disable late-combine in another RISC-V test

2024-08-25 Thread Jeff Law
Another test where the output was slightly twiddled by late-combine in 
which simply disabling late-combine seems to be the best option.



Running /home/jlaw/test/gcc/gcc/testsuite/gcc.target/riscv/riscv.exp ...
FAIL: gcc.target/riscv/cm_mv_rv32.c   -Os   check-function-bodies sum



Pushing to the trunk.

Jeffcommit 70edccf88738ec204036e498a4a50c46e5e4f0c0
Author: Jeff Law 
Date:   Sun Aug 25 07:24:56 2024 -0600

Disable late-combine in another RISC-V test

Another test where the output was slightly twiddled by late-combine in which
simply disabling late-combine seems to be the best option.

> Running /home/jlaw/test/gcc/gcc/testsuite/gcc.target/riscv/riscv.exp ...
> FAIL: gcc.target/riscv/cm_mv_rv32.c   -Os   check-function-bodies sum

Pushing to the trunk.

gcc/testsuite
* gcc.target/riscv/cm_mv_rv32.c: Disable late-combine.

diff --git a/gcc/testsuite/gcc.target/riscv/cm_mv_rv32.c 
b/gcc/testsuite/gcc.target/riscv/cm_mv_rv32.c
index 2c1b3f9cabf..e2369fc4d2d 100644
--- a/gcc/testsuite/gcc.target/riscv/cm_mv_rv32.c
+++ b/gcc/testsuite/gcc.target/riscv/cm_mv_rv32.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options " -Os -march=rv32i_zca_zcmp -mabi=ilp32 " } */
+/* { dg-options " -Os -march=rv32i_zca_zcmp -mabi=ilp32 
-fno-late-combine-instructions " } */
 /* { dg-skip-if "" { *-*-* } {"-O0" "-O1" "-O2" "-Og" "-O3" "-Oz" "-flto"} } */
 /* { dg-final { check-function-bodies "**" "" } } */
 


Re: [PATCH v1 1/2] RISC-V: Add testcases for unsigned scalar .SAT_TRUNC form 4

2024-08-25 Thread Jeff Law




On 8/25/24 12:18 AM, pan2...@intel.com wrote:

From: Pan Li 

This patch would like to add test cases for the unsigned scalar quad and
oct .SAT_TRUNC form 4.  Aka:

Form 4:
   #define DEF_SAT_U_TRUNC_FMT_4(NT, WT)  \
   NT __attribute__((noinline))   \
   sat_u_trunc_##WT##_to_##NT##_fmt_4 (WT x)  \
   {  \
 bool not_overflow = x <= (WT)(NT)(-1);   \
 return ((NT)x) | (NT)((NT)not_overflow - 1); \
   }

The below test is passed for this patch.
* The rv64gcv regression test.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/sat_arith.h: Add test helper macros.
* gcc.target/riscv/sat_u_trunc-19.c: New test.
* gcc.target/riscv/sat_u_trunc-20.c: New test.
* gcc.target/riscv/sat_u_trunc-21.c: New test.
* gcc.target/riscv/sat_u_trunc-22.c: New test.
* gcc.target/riscv/sat_u_trunc-23.c: New test.
* gcc.target/riscv/sat_u_trunc-24.c: New test.
* gcc.target/riscv/sat_u_trunc-run-19.c: New test.
* gcc.target/riscv/sat_u_trunc-run-20.c: New test.
* gcc.target/riscv/sat_u_trunc-run-21.c: New test.
* gcc.target/riscv/sat_u_trunc-run-22.c: New test.
* gcc.target/riscv/sat_u_trunc-run-23.c: New test.
* gcc.target/riscv/sat_u_trunc-run-24.c: New test.

Both patches in this series are fine.

Thanks,
jeff



Re: [PATCH] testsuite: Run array54.C only for sync_int_long targets

2024-08-25 Thread Jeff Law




On 8/24/24 11:51 AM, Dimitar Dimitrov wrote:

On Tue, Aug 06, 2024 at 10:16:36PM +0300, Dimitar Dimitrov wrote:

The test case uses "atomic", which fails to link on
pru-unknown-elf target due to missing __atomic_load_4 symbol.

Fix by filtering for sync_int_long effective target.  Ensured that the
test still passes for x86_64-pc-linux-gnu.

Ok for master?


Ping.



gcc/testsuite/ChangeLog:

* g++.dg/init/array54.C: Require sync_int_long effective target.

OK
jeff



Re: [PATCH 4/9] RISC-V: Reorder insn cost match order to match corresponding expander match order

2024-08-25 Thread Jeff Law




On 8/22/24 1:46 PM, Patrick O'Neill wrote:

The corresponding expander (riscv-v.cc:expand_const_vector) matches
const_vec_duplicate_p before const_vec_series_p. Reorder to match this
behavior when calculating costs.

gcc/ChangeLog:

* config/riscv/riscv.cc (riscv_const_insns): Relocate.

OK
jeff



Re: [patch,avr] Overhaul avr-ifelse RTL optimization pass

2024-08-25 Thread Jeff Law




On 8/23/24 6:20 AM, Richard Biener wrote:

On Fri, Aug 23, 2024 at 2:16 PM Georg-Johann Lay  wrote:


This patch overhauls the avr-ifelse mini-pass that optimizes
two cbranch insns to one comparison and two branches.

More optimization opportunities are realized, and the code
has been refactored.

No new regressions.  Ok for trunk?

There is currently no avr maintainer, so some global reviewer
might please have a look at this.


I see Denis still listed?  Possibly Jeff can have a look though.
I think Denis is inactive at this point.  I don't really have any 
significant interest in avr, nor do I actually know the architecture. 
So I'm mostly just looking for high level issues rather than diving into 
really thinking about the codegen impact.


IIRC I've asked Georg-Johann if he'd like to take maintainership of the 
avr port, but he declined.  So we're a bit stuck.



Jeff


Re: [RFC] RISC-V: Add cost model asserts

2024-08-25 Thread Jeff Law




On 8/22/24 1:50 PM, Patrick O'Neill wrote:

Applies after the recent 9 patch series:
"RISC-V: Improve const vector costing and expansion"
https://inbox.sourceware.org/gcc-patches/20240822194705.2789364-1-patr...@rivosinc.com/T/#t

This isn't functional due to RTX hash collisions. It was incredibly useful and
helped me catch a few tricky bugs like:
"RISC-V: Handle 0.0 floating point pattern costing"

Current flow is susceptible to hash collisions.

Ideal flow would be:
Costing: Insert into hashmap>
Expand: Check for membership in hashmap
  -> Not in hashmap: ignore, this wasn't costed
  -> In hashmap: Iterate over vec
 -> if RTX not in hashmap: Ignore, this wasn't costed (hash collision)
 -> if RTX in hashmap: Assert enum is expected

Example of hash collision:
hash -663464470:
(const_vector:RVVM4DI repeat [
 (const_int 8589934593 [0x20001])
 ])
hash -663464470:
(const_vector:RVVM4SF repeat [
 (const_double:SF 1.0e+0 [0x0.8p+1])
 (const_double:SF 2.0e+0 [0x0.8p+2])
 ])

Segher pointed out that collisions are inevitible (~80k 32 bit hashes
have a >50% chance of containing a collision)

If this is worth adding then these are the next questions:
* How heavy-weight is it to store a copy of every costed
   const RTX vector (and ideally other costed expressions later).
* Does this belong in release or gated behind rtx checking?
Given that it's currently non-functional and potentially quite 
expensive, RTL checking seems more appropriate than release checking. 
But I think even to get there you'd need to address some of the hash 
collision issues.


I'm not sure if/how to go forward with this.

jeff



Re: LRA: Don't use 0 as initialization for sp_offset

2024-08-25 Thread Jeff Law




On 8/22/24 9:44 AM, Michael Matz wrote:

this is part of making m68k work with LRA.  See PR116374.
m68k has the property that sometimes the elimation offset
between %sp and %argptr is zero.  During setting up elimination
infrastructure it's changes between sp_offset and previous_offset
that feed into insns_with_changed_offsets that ultimately will
setup looking at the instructions so marked.

But the initial values for sp_offset and previous_offset are
also zero.  So if the targets INITIAL_ELIMINATION_OFFSET (called
in update_reg_eliminate) is zero then nothing changes, the
instructions in question don't get into the list to consider and
the sp_offset tracking goes wrong.

Solve this by initializing those member with -1 instead of zero.
An initial offset of that value seems very unlikely, as it's
in word-sized increments.  This then also reveals a problem in
eliminate_regs_in_insn where it always uses sp_offset-previous_offset
as offset adjustment, even in the first_p pass.  That was harmless
when previous_offset was uninitialized as zero.  But all the other
code uses a different idiom of checking for first_p (or rather
update_p which is !replace_p&&!first_p), and using sp_offset directly.
So use that as well in eliminate_regs_in_insn.

PR target/116374
* lra-eliminations.cc (init_elim_table): Use -1 as initializer.
(update_reg_eliminate): Accept -1 as not-yet-used marker.
(eliminate_regs_in_insn): Use previous_sp_offset only when
not first_p.

OK
jeff



Re: LRA: Fix setup_sp_offset

2024-08-25 Thread Jeff Law




On 8/22/24 9:45 AM, Michael Matz wrote:

This is part of making m68k work with LRA.  See PR116429.
In short: setup_sp_offset is internally inconsistent.  It wants to
setup the sp_offset for newly generated instructions.  sp_offset for
an instruction is always the state of the sp-offset right before that
instruction.  For that it starts at the (assumed correct) sp_offset
of the instruction right after the given (new) sequence, and then
iterates that sequence forward simulating its effects on sp_offset.

That can't ever be right: either it needs to start at the front
and simulate forward, or start at the end and simulate backward.
The former seems to be the more natural way.  Funnily the local
variable holding that instruction is also called 'before'.

This changes it to the first variant: start before the sequence,
do one simulation step to get the sp-offset state in front of the
sequence and then continue simulating.

More details: in the problematic testcase we start with this
situation (sp_off before 550 is 0):

   550: [--sp] = 0 sp_off = 0  {pushexthisi_const}
   551: [--sp] = 37sp_off = -4 {pushexthisi_const}
   552: [--sp] = r37   sp_off = -8 {movsi_m68k2}
   554: [--sp] = r116 - r37sp_off = -12 {subsi3}
   556: call   sp_off = -16

insn 554 doesn't match its constraints and needs some reloads:

   Creating newreg=262, assigning class DATA_REGS to r262
   554: r262:SI=r262:SI-r37:SI
   REG_ARGS_SIZE 0x10
 Inserting insn reload before:
   996: r262:SI=r116:SI
 Inserting insn reload after:
   997: [--%sp:SI]=r262:SI

  Considering alt=0 of insn 997:   (0) =g  (1) damSKT
 1 Non pseudo reload: reject++
   overall=1,losers=0,rld_nregs=0
   Choosing alt 0 in insn 997:  (0) =g  (1) damSKT {*movsi_m68k2} 
(sp_off=-16)

Note how insn 997 (the after-reload) now has sp_off=-16 already.  It all
goes downhill from there.  We end up with these insns:

   552: [--sp] = r37   sp_off = -8 {movsi_m68k2}
   996: r262 = r116sp_off = -12
   554: r262 = r262 - r37  sp_off = -12
   997: [--sp] = r262  sp_off = -16  (!!! should be -12)
   556: call   sp_off = -16

The call insn sp_off remains at the correct -16, but internally it's already
inconsistent here.  If the sp_off before an insn is -16, and that insn
pre_decs sp, then the after-insn sp_off should be -20.

PR target/116429
* lra.cc (setup_sp_offset): Start with sp_offset from
before the new sequence, not from after.
I think you're right in that the current code isn't correct, but the 
natural question is how in the world has this worked to-date.   Though I 
guess targets which push arguments are a dying breed (though I would 
have expected i386 to have tripped over this at some point).


OK. Though I fear there may be fallout on this one...

jeff



Re: [PATCH] optabs-query: Guard smallest_int_mode_for_size [PR115495].

2024-08-25 Thread Jeff Law




On 8/22/24 7:57 AM, Robin Dapp wrote:

Indeed though that might be a larger change.


I have tested the attached now, aarch64 is still running but
x86 and power10 are bootstrapped and regtested,  riscv regtested.

Hope I didn't miss any target-specific code that I haven't tested.

As the issue is only latent I verified by calling
get_best_mem_extraction_insn directly with a size of 256.

Presumably the change is a bit too clunky to be backported to GCC 14.
Note that I checked this against an internal tree where we saw the same 
failure signature and can confirm that it fixes the problem there as well.


jeff



Re: [PATCH v2] combine.cc (make_more_copies): Copy attributes from the original pseudo, PR115883

2024-08-25 Thread Jeff Law




On 8/21/24 8:48 AM, Hans-Peter Nilsson wrote:

The only thing that's changed with the patch in v2 since the
first version (pinged once) is the commit message.  CC to
the nexts-of-kin as a heads-up.

Regtested cross to cris-elf and native x86_64-linux-gnu at
r15-3043-g64028d626a50.  The gcc.dg/guality/pr54200.c
magically being fixed was also noticed at an earlier
test-run, at r15-1880-gce34fcc572a0.  I see on
gcc-testresults that this test fails for several targets.

Ok to commit?

-- >8 --
The first of the late-combine passes, propagates some of the copies
made during the (in-time-)combine pass in make_more_copies into the
users of the "original" pseudo registers and removes the "old"
pseudos.  That effectively removes attributes such as REG_POINTER,
which matter to LRA.  The quoted PR is for an ICE-manifesting bug that
was exposed by the late-combine pass and went back to hiding with this
patch until commit r15-2937-g3673b7054ec2, the fix for PR116236, when
it was actually fixed.  To wit, this patch is only incidentally
related to that bug.

In other words, the REG_POINTER attribute should not be required for
LRA to work correctly.  This patch merely corrects state for those
propagated register-uses to ante late-combine.

For reasons not investigated, this fixes a failing test
"FAIL: gcc.dg/guality/pr54200.c -Og -DPREVENT_OPTIMIZATION line 20 z == 3"
for x86_64-linux-gnu.

PR middle-end/115883
* combine.cc (make_more_copies): Copy attributes from the original
pseudo to the new copy.
OK, though please use old fashioned C comment style since that's the 
style used everywhere else is combine.cc.  No need to wait for review on 
the comment style change ;-)


jeff



Re: LRA: Fix setup_sp_offset

2024-08-25 Thread H.J. Lu
On Sun, Aug 25, 2024 at 7:30 AM Jeff Law  wrote:
>
>
>
> On 8/22/24 9:45 AM, Michael Matz wrote:
> > This is part of making m68k work with LRA.  See PR116429.
> > In short: setup_sp_offset is internally inconsistent.  It wants to
> > setup the sp_offset for newly generated instructions.  sp_offset for
> > an instruction is always the state of the sp-offset right before that
> > instruction.  For that it starts at the (assumed correct) sp_offset
> > of the instruction right after the given (new) sequence, and then
> > iterates that sequence forward simulating its effects on sp_offset.
> >
> > That can't ever be right: either it needs to start at the front
> > and simulate forward, or start at the end and simulate backward.
> > The former seems to be the more natural way.  Funnily the local
> > variable holding that instruction is also called 'before'.
> >
> > This changes it to the first variant: start before the sequence,
> > do one simulation step to get the sp-offset state in front of the
> > sequence and then continue simulating.
> >
> > More details: in the problematic testcase we start with this
> > situation (sp_off before 550 is 0):
> >
> >550: [--sp] = 0 sp_off = 0  {pushexthisi_const}
> >551: [--sp] = 37sp_off = -4 {pushexthisi_const}
> >552: [--sp] = r37   sp_off = -8 {movsi_m68k2}
> >554: [--sp] = r116 - r37sp_off = -12 {subsi3}
> >556: call   sp_off = -16
> >
> > insn 554 doesn't match its constraints and needs some reloads:
> >
> >Creating newreg=262, assigning class DATA_REGS to r262
> >554: r262:SI=r262:SI-r37:SI
> >REG_ARGS_SIZE 0x10
> >  Inserting insn reload before:
> >996: r262:SI=r116:SI
> >  Inserting insn reload after:
> >997: [--%sp:SI]=r262:SI
> >
> >   Considering alt=0 of insn 997:   (0) =g  (1) damSKT
> >  1 Non pseudo reload: reject++
> >overall=1,losers=0,rld_nregs=0
> >Choosing alt 0 in insn 997:  (0) =g  (1) damSKT {*movsi_m68k2} 
> > (sp_off=-16)
> >
> > Note how insn 997 (the after-reload) now has sp_off=-16 already.  It all
> > goes downhill from there.  We end up with these insns:
> >
> >552: [--sp] = r37   sp_off = -8 {movsi_m68k2}
> >996: r262 = r116sp_off = -12
> >554: r262 = r262 - r37  sp_off = -12
> >997: [--sp] = r262  sp_off = -16  (!!! should be -12)
> >556: call   sp_off = -16
> >
> > The call insn sp_off remains at the correct -16, but internally it's already
> > inconsistent here.  If the sp_off before an insn is -16, and that insn
> > pre_decs sp, then the after-insn sp_off should be -20.
> >
> >   PR target/116429
> >   * lra.cc (setup_sp_offset): Start with sp_offset from
> >   before the new sequence, not from after.
> I think you're right in that the current code isn't correct, but the
> natural question is how in the world has this worked to-date.   Though I
> guess targets which push arguments are a dying breed (though I would
> have expected i386 to have tripped over this at some point).

Is it because i386 pushes the return address on stack?

> OK. Though I fear there may be fallout on this one...
>
> jeff
>


-- 
H.J.


Re: [patch,avr] Overhaul avr-ifelse RTL optimization pass

2024-08-25 Thread Jeff Law




On 8/23/24 6:16 AM, Georg-Johann Lay wrote:

This patch overhauls the avr-ifelse mini-pass that optimizes
two cbranch insns to one comparison and two branches.

More optimization opportunities are realized, and the code
has been refactored.

No new regressions.  Ok for trunk?

There is currently no avr maintainer, so some global reviewer
might please have a look at this.

And one question I have:  avr_optimize_2ifelse() is rewiring
basic blocks using redirect_edge_and_branch().  Does this
require extra pass flags or actions?  Currently the RTL_PASS
data reads:

static const pass_data avr_pass_data_ifelse =
{
   RTL_PASS,  // type
   "",    // name (will be patched)
   OPTGROUP_NONE, // optinfo_flags
   TV_DF_SCAN,    // tv_id
   0, // properties_required
   0, // properties_provided
   0, // properties_destroyed
   0, // todo_flags_start
   TODO_df_finish | TODO_df_verify // todo_flags_finish
};


Johann

p.s. The additional notes on compare-elim / PR115830 can be found
here (pending review):

https://gcc.gnu.org/pipermail/gcc-patches/2024-August/660743.html

--

AVR: Overhaul the avr-ifelse RTL optimization pass.

Mini-pass avr-ifelse realizes optimizations that replace two cbranch
insns with one comparison and two branches.  This patch adds the
following improvements:

- The right operand of the comparisons may also be REGs.
   Formerly only CONST_INT was handled.

- The code of the first comparison in no more restricted
   to (effectively) EQ.

- When the second cbranch is located in the fallthrough path
   of the first cbranch, then difficult (expensive) comparisons
   can always be avoided.  This may require to swap the branch
   targets.  (When the second cbranch if located after the target
   label of the first one, then getting rid of difficult branches
   would require to reorder blocks.)

- The code has been cleaned up:  avr_rest_of_handle_ifelse() now
   just scans the insn stream for optimization candidates.  The code
   that actually performs the transformation has been outsourced to
   the new function avr_optimize_2ifelse().

- The code to find a better representation for reg-const_int comparisons
   has been split into two parts:  First try to find codes such that the
   right-hand sides of the comparisons are the same (avr_2comparisons_rhs).
   When this succeeds then one comparison can serve two branches, and
   avr_redundant_compare() tries to get rid of difficult branches that
   may have been introduced by avr_2comparisons_rhs().  This is always
   possible when the second cbranch is located in the fallthrough path
   of the first one, or when the first code is EQ.

Some final notes on why we don't use compare-elim:  1) The two cbranch
insns may come with different scratch operands depending on the chosen
constraint alternatives.  There are cases where the outgoing comparison
requires a scratch but only one incoming cbranch has one.  2) Avoiding
difficult branches can be achieved by rewiring basic blocks.
compare-elim doesn't do that; it doesn't even know the costs of the
branch codes.  3)  avr_2comparisons_rhs() may de-canonicalize a
comparison to achieve its goal.  compare-elim doesn't know how to do
that.  4) There are more reasons, see for example the commit
message and discussion for PR115830.

gcc/
 * config/avr/avr.cc (cfganal.h): Include it.
 (avr_2comparisons_rhs, avr_redundant_compare_regs)
 (avr_strict_signed_p, avr_strict_unsigned_p): New static functions.
 (avr_redundant_compare): Overhaul: Allow more cases.
 (avr_optimize_2ifelse): New static function, outsourced from...
 (avr_rest_of_handle_ifelse): ...this method.
gcc/testsuite/
 * gcc.target/avr/torture/ifelse-c.h: New file.
 * gcc.target/avr/torture/ifelse-d.h: New file.
 * gcc.target/avr/torture/ifelse-q.h: New file.
 * gcc.target/avr/torture/ifelse-r.h: New file.
 * gcc.target/avr/torture/ifelse-c-i8.c: New test.
 * gcc.target/avr/torture/ifelse-d-i8.c: New test.
 * gcc.target/avr/torture/ifelse-q-i8.c: New test.
 * gcc.target/avr/torture/ifelse-r-i8.c: New test.
 * gcc.target/avr/torture/ifelse-c-i16.c: New test.
 * gcc.target/avr/torture/ifelse-d-i16.c: New test.
 * gcc.target/avr/torture/ifelse-q-i16.c: New test.
 * gcc.target/avr/torture/ifelse-r-i16.c: New test.
 * gcc.target/avr/torture/ifelse-c-u16.c: New test.
 * gcc.target/avr/torture/ifelse-d-u16.c: New test.
 * gcc.target/avr/torture/ifelse-q-u16.c: New test.
 * gcc.target/avr/torture/ifelse-r-u16.c: New test.

ifelse-tweak.diff

diff --git a/gcc/config/avr/avr.cc b/gcc/config/avr/avr.cc
index c520b98a178..90606b73114 100644
--- a/gcc/config/avr/avr.cc
+++ b/gcc/config/avr/avr.cc




+
+static rtx
+avr_2comparisons_rhs (rtx_code &cond1, rtx xval1,
+ rtx_code &cond2, rtx xval2, machine_mode mode)
+{
+  HOST_WIDE_INT val1 = INTVAL (xval1);
+  HOST_WIDE_INT val2 = INTVAL (xval2);
+
+  if (

Re: [PATCH v3] RISC-V: Support IMM for operand 0 of ussub pattern

2024-08-25 Thread Jeff Law




On 8/18/24 11:23 PM, pan2...@intel.com wrote:

From: Pan Li 

This patch would like to allow IMM for the operand 0 of ussub pattern.
Aka .SAT_SUB(1023, y) as the below example.

Form 1:
   #define DEF_SAT_U_SUB_IMM_FMT_1(T, IMM) \
   T __attribute__((noinline)) \
   sat_u_sub_imm##IMM##_##T##_fmt_1 (T y)  \
   {   \
 return (T)IMM >= y ? (T)IMM - y : 0;  \
   }

DEF_SAT_U_SUB_IMM_FMT_1(uint64_t, 1023)

Before this patch:
   10   │ sat_u_sub_imm82_uint64_t_fmt_1:
   11   │ li  a5,82
   12   │ bgtua0,a5,.L3
   13   │ sub a0,a5,a0
   14   │ ret
   15   │ .L3:
   16   │ li  a0,0
   17   │ ret

After this patch:
   10   │ sat_u_sub_imm82_uint64_t_fmt_1:
   11   │ li  a5,82
   12   │ sltua4,a5,a0
   13   │ addia4,a4,-1
   14   │ sub a0,a5,a0
   15   │ and a0,a4,a0
   16   │ ret

The below test suites are passed for this patch:
1. The rv64gcv fully regression test.

gcc/ChangeLog:

* config/riscv/riscv.cc (riscv_gen_unsigned_xmode_reg): Add new
func impl to gen xmode rtx reg from operand rtx.
(riscv_expand_ussub): Gen xmode reg for operand 1.
* config/riscv/riscv.md: Allow const_int for operand 1.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/sat_arith.h: Add test helper macro.
* gcc.target/riscv/sat_u_sub_imm-1.c: New test.
* gcc.target/riscv/sat_u_sub_imm-1_1.c: New test.
* gcc.target/riscv/sat_u_sub_imm-1_2.c: New test.
* gcc.target/riscv/sat_u_sub_imm-2.c: New test.
* gcc.target/riscv/sat_u_sub_imm-2_1.c: New test.
* gcc.target/riscv/sat_u_sub_imm-2_2.c: New test.
* gcc.target/riscv/sat_u_sub_imm-3.c: New test.
* gcc.target/riscv/sat_u_sub_imm-3_1.c: New test.
* gcc.target/riscv/sat_u_sub_imm-3_2.c: New test.
* gcc.target/riscv/sat_u_sub_imm-4.c: New test.
* gcc.target/riscv/sat_u_sub_imm-run-1.c: New test.
* gcc.target/riscv/sat_u_sub_imm-run-2.c: New test.
* gcc.target/riscv/sat_u_sub_imm-run-3.c: New test.
* gcc.target/riscv/sat_u_sub_imm-run-4.c: New test.
OK.  I'm assuming we don't have to worry about the case where X is wider 
than Xmode?  ie, a DImode on rv32?



Jeff



Re: [PATCH 1/4] Write CodeView information about enregistered optimized variables

2024-08-25 Thread Jeff Law




On 8/18/24 7:15 PM, Mark Harmstone wrote:

Enable variable tracking when outputting CodeView debug information, and make
it so that we issue debug symbols for optimized variables in registers. This
consists of S_LOCAL symbols, which give the name and the type of local
variables, followed by S_DEFRANGE_REGISTER symbols for the register and the
code for which this applies.

gcc/
* dwarf2codeview.cc (enum cv_sym_type): Add S_LOCAL and
S_DEFRANGE_REGISTER.
(write_s_local): New function.
(write_defrange_register): New function.
(write_optimized_local_variable_loc): New function.
(write_optimized_local_variable): New function.
(write_optimized_function_vars): New function.
(write_function): Call write_optimized_function_vars if variable
tracking enabled.
* dwarf2out.cc (typedef var_loc_view): Move to dwarf2out.h.
(struct dw_loc_list_struct): Likewise.
* dwarf2out.h (typedef var_loc_view): Move from dwarf2out.cc.
(struct dw_loc_list_struct): Likewise.
* opts.cc (finish_options): Enable variable tracking for CodeView.

All four patches in this series are fine.

Thanks,
Jeff



Re: LRA: Fix setup_sp_offset

2024-08-25 Thread Andreas Schwab
On Aug 25 2024, H.J. Lu wrote:

> Is it because i386 pushes the return address on stack?

Like m68k.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."


Re: [PATCH] RISC-V: Bugfix for Duplicate entries for -mtune in --target-help[Bug 116347]

2024-08-25 Thread Jeff Law




On 8/19/24 2:14 AM, shiyul...@iscas.ac.cn wrote:

From: yulong 

This patch try to fix a bug[116347]. I change the name of the micro-arch,
because I think micro-arch and core have the same name that caused the error.

gcc/ChangeLog:

 * config/riscv/riscv-cores.def (RISCV_TUNE): Rename.
 (RISCV_CORE): Ditto.
Conceptually tuning means things like costs and scheduler model while 
core defines what instructions can be used.   So why are core entries 
showing up under known arguments for the -mtune option?


Jeff


Re: ping: [PATCH] libcpp: Support extended characters for #pragma {push,pop}_macro [PR109704]

2024-08-25 Thread Lewis Hyatt
Hello-

https://gcc.gnu.org/pipermail/gcc-patches/2024-January/642926.html

Monthly ping for this one please :). Thanks...

-Lewis

On Sat, Jul 27, 2024 at 3:09 PM Lewis Hyatt  wrote:
>
> Hello-
>
> https://gcc.gnu.org/pipermail/gcc-patches/2024-January/642926.html
>
> Ping please? Jakub + Jason, hope you don't mind that I CCed you, I saw
> you had your attention on extended character identifiers a bit now :).
> Thanks!
>
> -Lewis
>
> On Fri, Jul 5, 2024 at 4:23 PM Lewis Hyatt  wrote:
> >
> > Hello-
> >
> > https://gcc.gnu.org/pipermail/gcc-patches/2024-January/642926.html
> >
> > May I please ping this one again? It's the largest remaining gap in
> > UTF-8 support for libcpp that I know of. Thanks!
> >
> > -Lewis
> >
> > On Tue, May 28, 2024 at 7:46 PM Lewis Hyatt  wrote:
> > >
> > > Hello-
> > >
> > > May I please ping this one (now for GCC 15)? Thanks!
> > > https://gcc.gnu.org/pipermail/gcc-patches/2024-January/642926.html
> > >
> > > -Lewis
> > >
> > > On Sat, Feb 10, 2024 at 9:02 AM Lewis Hyatt  wrote:
> > > >
> > > > Hello-
> > > >
> > > > https://gcc.gnu.org/pipermail/gcc-patches/2024-January/642926.html
> > > >
> > > > May I please ping this one? Thanks!
> > > >
> > > > On Sat, Jan 13, 2024 at 5:12 PM Lewis Hyatt  wrote:
> > > > >
> > > > > Hello-
> > > > >
> > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109704
> > > > >
> > > > > The below patch fixes the issue noted in the PR that extended 
> > > > > characters
> > > > > cannot appear in the identifier passed to a #pragma push_macro or 
> > > > > #pragma
> > > > > pop_macro. Bootstrap + regtest all languages on x86-64 Linux. Is it 
> > > > > OK for
> > > > > GCC 13 please?


Re: [PATCH 2/2] RISC-V: Constant synthesis by shifting the lower half

2024-08-25 Thread Jeff Law




On 8/8/24 11:10 AM, Raphael Moreira Zinsly wrote:

Improve handling of constants where the high half can be constructed
by shifting the low half.

gcc/ChangeLog:
* config/riscv/riscv.cc (riscv_build_integer): Detect constants
were the higher half is a shift of the lower half.

gcc/testsuite/ChangeLog:
* gcc.target/riscv/synthesis-12.c: New test.
Don't you need to check somewhere that the upper/lower halves are the 
same after an appropriate shift?   It looks like you just assume they are.



Jeff


Re: [patch,avr] Overhaul avr-ifelse RTL optimization pass

2024-08-25 Thread Denis Chertykov
вс, 25 авг. 2024 г. в 17:55, Jeff Law :

>
>
> On 8/23/24 6:20 AM, Richard Biener wrote:
> > On Fri, Aug 23, 2024 at 2:16 PM Georg-Johann Lay  wrote:
> >>
> >> This patch overhauls the avr-ifelse mini-pass that optimizes
> >> two cbranch insns to one comparison and two branches.
> >>
> >> More optimization opportunities are realized, and the code
> >> has been refactored.
> >>
> >> No new regressions.  Ok for trunk?
> >>
> >> There is currently no avr maintainer, so some global reviewer
> >> might please have a look at this.
> >
> > I see Denis still listed?  Possibly Jeff can have a look though.
> I think Denis is inactive at this point.  I don't really have any
> significant interest in avr, nor do I actually know the architecture.
> So I'm mostly just looking for high level issues rather than diving into
> really thinking about the codegen impact.
>
> IIRC I've asked Georg-Johann if he'd like to take maintainership of the
> avr port, but he declined.  So we're a bit stuck.
>

Yes, I was inactive but I'm here.
I'm interested in converting the port to LRC.
Starting to review the patch...

Denis

>
>
> Jeff
>


Re: [PATCH 2/2] RISC-V: Constant synthesis by shifting the lower half

2024-08-25 Thread Jeff Law




On 8/8/24 11:10 AM, Raphael Moreira Zinsly wrote:

Improve handling of constants where the high half can be constructed
by shifting the low half.

gcc/ChangeLog:
* config/riscv/riscv.cc (riscv_build_integer): Detect constants
were the higher half is a shift of the lower half.

gcc/testsuite/ChangeLog:
* gcc.target/riscv/synthesis-12.c: New test.

Oh, nevermind.  The test is a bit later than I expected to find it.

I'd move the test for equality after shifting to a point before you call 
riscv_build_integer_1.  That routine is more expensive than I'd like 
with all the recursive calls and such, so let's do the relatively cheap 
test first and only call riscv_build_integer_1 when there's a reasonable 
chance we can optimize.


This code should also test ALLOW_NEW_PSEUDOS since we need cost 
stability before/after reload.


Repost after those changes.


With this framework I think you could also handle the case where the 
upper/lower vary by just one bit fairly trivially.


ie, when popcount (upper ^ lower) == 1 use binv to flip the bit high 
word.  Obviously this only applies when ZBS is enabled.  If you want to 
do this, I'd structure it largely like the shifted case.


And if high is +-2k from low, then there may be a synthesis for that 
case as well.


And if the high word is 3x 5x or 9x the low word, then shadd applies.

Those three additional cases aren't required for this patch to move 
forward.  Just additional enhancements if you want to tackle them.


Jeff


Re: [PATCH 6/9] RISC-V: Emit costs for bool and stepped const vectors

2024-08-25 Thread Jeff Law




On 8/22/24 1:46 PM, Patrick O'Neill wrote:

These cases are handled in the expander
(riscv-v.cc:expand_const_vector). We need the vector builder to detect
these cases so extract that out into a new riscv-v.h header file.

gcc/ChangeLog:

* config/riscv/riscv-v.cc (class rvv_builder): Move to riscv-v.h.
* config/riscv/riscv.cc (riscv_const_insns): Emit placeholder costs for
bool/stepped const vectors.
* config/riscv/riscv-v.h: New file.
OK.  As you noted the pre-commit tester caught a dependency issue with 
the missing return.  I don't mind if you push with that fixed or if you 
wait for the full series to get ACK'd.  Either is fine with me.


jeff



Re: [PATCH 8/9] RISC-V: Move helper functions above expand_const_vector

2024-08-25 Thread Jeff Law




On 8/22/24 1:46 PM, Patrick O'Neill wrote:

These subroutines will be used in expand_const_vector in a future patch.
Relocate so expand_const_vector can use them.

gcc/ChangeLog:

* config/riscv/riscv-v.cc (expand_vector_init_insert_elems): Relocate.
(expand_vector_init_trailing_same_elem): Ditto.

OK.  Commit when convenient.

jeff



Re: [PATCH] c++: Fix overeager Woverloaded-virtual with conversion operators [PR109918]

2024-08-25 Thread Simon Martin
Hi Jason,

On 24 Aug 2024, at 23:59, Simon Martin wrote:

> Hi Jason,
>
> On 24 Aug 2024, at 15:13, Jason Merrill wrote:
>
>> On 8/23/24 12:44 PM, Simon Martin wrote:
>>> We currently emit an incorrect -Woverloaded-virtual warning upon the

>>> following
>>> test case
>>>
>>> === cut here ===
>>> struct A {
>>>virtual operator int() { return 42; }
>>>virtual operator char() = 0;
>>> };
>>> struct B : public A {
>>>operator char() { return 'A'; }
>>> };
>>> === cut here ===
>>>
>>> The problem is that warn_hidden relies on get_basefndecls to find 

>>> the
>>> methods
>>> in A possibly hidden B's operator char(), and gets both the
>>> conversion operator
>>> to int and to char. It eventually wrongly concludes that the
>>> conversion to int
>>> is hidden.
>>>
>>> This patch fixes this by filtering out conversion operators to
>>> different types
>>> from the list returned by get_basefndecls.
>>
>> Hmm, same_signature_p already tries to handle comparing conversion
>> operators, why isn't that working?
>>
> It does indeed.
>
> However, `ovl_range (fns)` does not only contain `char B::operator()` 
> -
> for which `any_override` gets true - but also `conv_op_marker` - for
> which `any_override` gets false, causing `seen_non_override` to get to

> true. Because of that, we run the last loop, that will emit a warning
> for all `base_fndecls` (except `char B::operator()` that has been
> removed).
>
> We could test `fndecl` and `base_fndecls[k]` against `conv_op_marker` 
> in
> the loop, but we’d still need to inspect the “converting to” 
> type
> in the last loop (for when `warn_overloaded_virtual` is 2). This would

> make the code much more complex than the current patch.
>
> It would however probably be better if `get_basefndecls` only returned

> the right conversion operator, not all of them. I’ll draft another
> version of the patch that does that and submit it in this thread.
>
I have explored my suggestion further and it actually ends up more 
complicated than the initial patch.

Please find attached a new revision to fix the reported issue, as well 
as new ones I discovered while testing with -Woverloaded-virtual=2.

It’s pretty close to the initial patch, but (1) adds a missing 
“continue;” (2) fixes a location problem when 
-Woverloaded-virtual==2 (3) adds more test cases. The commit log is also 
more comprehensive, and should describe well the various problems and 

why the patch is correct.

Successfully tested on x86_64-pc-linux-gnu; OK for trunk?

Thanks!
   Simon

>
>>> Successfully tested on x86_64-pc-linux-gnu.
>>>
>>> PR c++/109918
>>>
>>> gcc/cp/ChangeLog:
>>>
>>> * class.cc (warn_hidden): Filter out conversion operators to
>>> different
>>> types.
>>>
>>> gcc/testsuite/ChangeLog:
>>>
>>> * g++.dg/warn/Woverloaded-virt5.C: New test.
>>>
>>> ---
>>>   gcc/cp/class.cc   | 33
>>> ---
>>>   gcc/testsuite/g++.dg/warn/Woverloaded-virt5.C | 12 +++
>>>   2 files changed, 34 insertions(+), 11 deletions(-)
>>>   create mode 100644 gcc/testsuite/g++.dg/warn/Woverloaded-virt5.C
>>>
>>> diff --git a/gcc/cp/class.cc b/gcc/cp/class.cc
>>> index fb6c3370950..a8178a31fe8 100644
>>> --- a/gcc/cp/class.cc
>>> +++ b/gcc/cp/class.cc
>>> @@ -3267,18 +3267,29 @@ warn_hidden (tree t)
>>> if (TREE_CODE (fndecl) == FUNCTION_DECL
>>> && DECL_VINDEX (fndecl))
>>>   {
>>> -   /* If the method from the base class has the same
>>> -  signature as the method from the derived class, it
>>> -  has been overridden.  Note that we can't move on
>>> -  after finding one match: fndecl might override
>>> -  multiple base fns.  */
>>> for (size_t k = 0; k < base_fndecls.length (); k++)
>>> - if (base_fndecls[k]
>>> - && same_signature_p (fndecl, base_fndecls[k]))
>>> -   {
>>> - base_fndecls[k] = NULL_TREE;
>>> - any_override = true;
>>> -   }
>>> + {
>>> +   if (!base_fndecls[k])
>>> + continue;
>>> +   /* If FNS is a conversion operator, base_fndecls contains

>>> +  all conversion operators from base classes; we need to

>>> +  remove those converting to a different type.  */
>>> +   if (IDENTIFIER_CONV_OP_P (name)
>>> +   && !same_type_p (DECL_CONV_FN_TYPE (fndecl),
>>> +DECL_CONV_FN_TYPE (base_fndecls[k])))
>>> + {
>>> +   base_fndecls[k] = NULL_TREE;
>>> + }
>>> +   /* If the method from the base class has the same signature
>>> +  as the method from the derived class, it has been
>>> +  overriden.  Note that we can't move on after finding
>>> +  one match: fndecl might override multiple base fns.  */
>>> +   else if

Re: [PATCH v2] RISC-V: Add --with-cmodel configure option

2024-08-25 Thread Jeff Law




On 8/4/24 8:24 PM, Hau Hsu wrote:

Oh the Palmer's patch is here
[PATCH] RISC-V: Add --with-cmodel configure-time argument gcc.gnu.org/pipermail/gcc-patches/2023-December/641172.html>
gcc.gnu.org 
	favicon.ico 





It doesn't mater who's patch get merged for me :)
Me neither.  We do need to understand if this is causing other problems 
though.  As I noted earlier, pre-commit testing showed a failure on aarch64:



https://patchwork.sourceware.org/project/gcc/patch/20240802051151.3658614-1-hau@sifive.com/




I'm not familiar enough with that tester to identify what went wrong, 
but it needs to be understood/fixed before we can move forward.


jeff




Re: [PING^2] [PATCH] PR116080: Fix test suite checks for musttail

2024-08-25 Thread Andi Kleen
Andi Kleen  writes:

PING^2 for https://gcc.gnu.org/pipermail/gcc-patches/2024-July/658602.html

This fixes some musttail related test suite failures that cause noise on
various targets.

> Andi Kleen  writes:
>
> I wanted to ping this patch. It fixes test suite noise on various
> targets.
>
> https://gcc.gnu.org/pipermail/gcc-patches/2024-July/658602.html
>
>
>> From: Andi Kleen 
>>
>> This is a new attempt to fix PR116080. The previous try was reverted
>> because it just broke a bunch of tests, hiding the problem.
>>
>> - musttail behaves differently than tailcall at -O0. Some of the test
>> run at -O0, so add separate effective target tests for musttail.
>> - New effective target tests need to use unique file names
>> to make dejagnu caching work
>> - Change the tests to use new targets
>> - Add a external_musttail test to check for target's ability
>> to do tail calls between translation units. This covers some powerpc
>> ABIs.
>>
>> gcc/testsuite/ChangeLog:
>>
>>  PR testsuite/116080
>>  * c-c++-common/musttail1.c: Use musttail target.
>>  * c-c++-common/musttail12.c: Use struct_musttail target.
>>  * c-c++-common/musttail2.c: Use musttail target.
>>  * c-c++-common/musttail3.c: Likewise.
>>  * c-c++-common/musttail4.c: Likewise.
>>  * c-c++-common/musttail7.c: Likewise.
>>  * c-c++-common/musttail8.c: Likewise.
>>  * g++.dg/musttail10.C: Likewise. Replace powerpc checks with
>>  external_musttail.
>>  * g++.dg/musttail11.C: Use musttail target.
>>  * g++.dg/musttail6.C: Use musttail target. Replace powerpc
>>  checks with external_musttail.
>>  * g++.dg/musttail9.C: Use musttail target.
>>  * lib/target-supports.exp: Add musttail, struct_musttail,
>>  external_musttail targets. Remove optimization for musttail.
>>  Use unique file names for musttail.
>> ---
>>  gcc/testsuite/c-c++-common/musttail1.c  |  2 +-
>>  gcc/testsuite/c-c++-common/musttail12.c |  2 +-
>>  gcc/testsuite/c-c++-common/musttail2.c  |  2 +-
>>  gcc/testsuite/c-c++-common/musttail3.c  |  2 +-
>>  gcc/testsuite/c-c++-common/musttail4.c  |  2 +-
>>  gcc/testsuite/c-c++-common/musttail7.c  |  2 +-
>>  gcc/testsuite/c-c++-common/musttail8.c  |  2 +-
>>  gcc/testsuite/g++.dg/musttail10.C   |  4 ++--
>>  gcc/testsuite/g++.dg/musttail11.C   |  2 +-
>>  gcc/testsuite/g++.dg/musttail6.C|  4 ++--
>>  gcc/testsuite/g++.dg/musttail9.C|  2 +-
>>  gcc/testsuite/lib/target-supports.exp   | 30 -
>>  12 files changed, 37 insertions(+), 19 deletions(-)
>>
>> diff --git a/gcc/testsuite/c-c++-common/musttail1.c 
>> b/gcc/testsuite/c-c++-common/musttail1.c
>> index 74efcc2a0bc6..51549672e02a 100644
>> --- a/gcc/testsuite/c-c++-common/musttail1.c
>> +++ b/gcc/testsuite/c-c++-common/musttail1.c
>> @@ -1,4 +1,4 @@
>> -/* { dg-do compile { target { tail_call && { c || c++11 } } } } */
>> +/* { dg-do compile { target { musttail && { c || c++11 } } } } */
>>  /* { dg-additional-options "-fdelayed-branch" { target sparc*-*-* } } */
>>  
>>  int __attribute__((noinline,noclone,noipa))
>> diff --git a/gcc/testsuite/c-c++-common/musttail12.c 
>> b/gcc/testsuite/c-c++-common/musttail12.c
>> index 4140bcd00950..475afc5af3f3 100644
>> --- a/gcc/testsuite/c-c++-common/musttail12.c
>> +++ b/gcc/testsuite/c-c++-common/musttail12.c
>> @@ -1,4 +1,4 @@
>> -/* { dg-do compile { target { struct_tail_call && { c || c++11 } } } } */
>> +/* { dg-do compile { target { struct_musttail && { c || c++11 } } } } */
>>  /* { dg-additional-options "-fdelayed-branch" { target sparc*-*-* } } */
>>  
>>  struct str
>> diff --git a/gcc/testsuite/c-c++-common/musttail2.c 
>> b/gcc/testsuite/c-c++-common/musttail2.c
>> index 86f2c3d77404..1970c4edd670 100644
>> --- a/gcc/testsuite/c-c++-common/musttail2.c
>> +++ b/gcc/testsuite/c-c++-common/musttail2.c
>> @@ -1,4 +1,4 @@
>> -/* { dg-do compile { target { tail_call && { c || c++11 } } } } */
>> +/* { dg-do compile { target { musttail && { c || c++11 } } } } */
>>  
>>  struct box { char field[256]; int i; };
>>  
>> diff --git a/gcc/testsuite/c-c++-common/musttail3.c 
>> b/gcc/testsuite/c-c++-common/musttail3.c
>> index ea9589c59ef2..7499fd6460b4 100644
>> --- a/gcc/testsuite/c-c++-common/musttail3.c
>> +++ b/gcc/testsuite/c-c++-common/musttail3.c
>> @@ -1,4 +1,4 @@
>> -/* { dg-do compile { target { tail_call && { c || c++11 } } } } */
>> +/* { dg-do compile { target { struct_musttail && { c || c++11 } } } } */
>>  
>>  extern int foo2 (int x, ...);
>>  
>> diff --git a/gcc/testsuite/c-c++-common/musttail4.c 
>> b/gcc/testsuite/c-c++-common/musttail4.c
>> index 23f4b5e1cd68..bd6effa4b931 100644
>> --- a/gcc/testsuite/c-c++-common/musttail4.c
>> +++ b/gcc/testsuite/c-c++-common/musttail4.c
>> @@ -1,4 +1,4 @@
>> -/* { dg-do compile { target { tail_call && { c || c++11 } } } } */
>> +/* { dg-do compile { target { musttail && { c || c++11 } } } } */
>>  
>>  struct box { char field[64]; int i; };
>>  
>> diff --git a/gcc/testsuite

[PING^2] [PATCH] Add a bootstrap-native build config

2024-08-25 Thread Andi Kleen
Andi Kleen  writes:

PING^2 for the patch.

(not sure if there is any maintainer to cc here, this is generic build 
infrastructure)

> Andi Kleen  writes:
>
> I wanted to ping this patch:
>
> https://gcc.gnu.org/pipermail/gcc-patches/2024-July/658729.html
>
>
>> From: Andi Kleen 
>>
>> ... that uses -march=native -mtune=native to build a compiler optimized
>> for the host.
>>
>> config/ChangeLog:
>>
>>  * bootstrap-native.mk: New file.
>>
>> gcc/ChangeLog:
>>
>>  * doc/install.texi: Document bootstrap-native.
>> ---
>>  config/bootstrap-native.mk | 1 +
>>  gcc/doc/install.texi   | 6 ++
>>  2 files changed, 7 insertions(+)
>>  create mode 100644 config/bootstrap-native.mk
>>
>> diff --git a/config/bootstrap-native.mk b/config/bootstrap-native.mk
>> new file mode 100644
>> index ..a4a3d8594089
>> --- /dev/null
>> +++ b/config/bootstrap-native.mk
>> @@ -0,0 +1 @@
>> +BOOT_CFLAGS := -march=native -mtune=native $(BOOT_CFLAGS)
>> diff --git a/gcc/doc/install.texi b/gcc/doc/install.texi
>> index 4973f195daf9..29827c5106f8 100644
>> --- a/gcc/doc/install.texi
>> +++ b/gcc/doc/install.texi
>> @@ -3052,6 +3052,12 @@ Removes any @option{-O}-started option from 
>> @code{BOOT_CFLAGS}, and adds
>>  @itemx @samp{bootstrap-Og}
>>  Analogous to @code{bootstrap-O1}.
>>  
>> +@item @samp{bootstrap-native}
>> +@itemx @samp{bootstrap-native}
>> +Optimize the compiler code for the build host, if supported by the
>> +architecture. Note this only affects the compiler, not the targeted
>> +code. If you want the later use @samp{--with-cpu}.
>> +
>>  @item @samp{bootstrap-lto}
>>  Enables Link-Time Optimization for host tools during bootstrapping.
>>  @samp{BUILD_CONFIG=bootstrap-lto} is equivalent to adding


Re: [PATCH] Re-add calling emit_clobber in lower-subreg.cc's resolve_simple_move.

2024-08-25 Thread Jeff Law




On 8/14/24 10:20 AM, Xianmiao Qu wrote:



As I described in the commit message, the absence of clobber could
potentially lead to the register's lifetime occupying the entire function,
according to the algorithm of the 'df_lr_bb_local_compute' function.
And avoiding unnecessary liveness has always been the point of these 
clobbers.


In the (mostly) forgotten past, those clobbers were often paired with 
REG_NO_CONFLICT notes to help the register allocator know that the full 
register value was set, even though it was set in pieces and that 
certain conflicts could be ignored during register allocation.  In both 
cases the notes were dealing with lifetime related problems.


We've been slowly moving to a space where these are much less of a 
concern and in fact we've totally removed REG_NO_CONFLICT blocks from 
the compiler.



[ ... ]



It will still be considered within the LR IN. And its lifetime spans the entire 
function.
Right.  It's not ideal because if its life spans the entire function, 
then it's going to have many more conflicts than are strictly necessary 
which in turn will inhibit good register allocation.



I use the use case in my patch as a use case for RISC-V:
   double foo (double a)
   {
 if (a < 0.0)
   return a + 1.0;
 else if (a > 16.0)
   return a - 3.0;
 else if (a < 300.0)
   return a - 30.0;
 else
   return a;
   }

As riscv don´t expand multi-register move in during the expand phase,
it will generate the following sequence of instructions.
   (insn 7 6 8 (set (reg:DF 136)
   (const_double:DF 0.0 [0x0.0p+0])) "r.c":3:6 -1
(nil))
   (insn 8 7 9 (set (reg:DF 12 a2)
   (reg:DF 136)) "r.c":3:6 -1
(nil))
   (insn 9 8 10 (set (reg:DF 10 a0)
   (reg/v:DF 135 [ a ])) "r.c":3:6 -1
(nil))

These are normal move instructions and the `init-regs´ pass will not
generate initialization instructions for them. So the CLOBBER instructions
are important for them.
?!?  I don't think that's correct.   Unless perhaps you're dealing with 
rv32 in which case the DFmode moves are multi-register moves.  Is that 
the case?


If that is the case, then one could argue that the problem is really the 
risc-v backend exposing a DFmode move that has to be lowered later.  The 
alternate approach is to not expose DFmode moves for rv32 and let the 
generic expansion code deal with it.


There's a natural tension between when to expose early vs exposing late 
and some things will work better with the former and others with the 
latter.  It's just the nature of the problem.


The general guidance would be to not expose a pattern unless the target 
can natively handle that case -- until such point as there's clear 
evidence that pretending to handle something it can't is better.


I'm relatively new to the risc-v port, so I don't know the history 
behind exposing DFmode patterns for rv32.  So I can't really suggest 
disabling them as that could well be reverting a conscious decision that 
generally gives us better code.  And since I'm mostly focused on rv64, 
I'm not in a position to do the deep analysis necessary to make that a 
viable option.


Hopefully that gives a bit more background here.  That natural tension 
between early exposure and late exposure of the target's actual 
capabilities is always a tough nut -- and decisions are somewhat fluid 
as can be seen by the ongoing discussion around mvconst_internal.


jeff


Re: [PATCH] Re-add calling emit_clobber in lower-subreg.cc's resolve_simple_move.

2024-08-25 Thread Jeff Law




On 8/12/24 10:12 AM, Xianmiao Qu wrote:

The previous patch:
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=d8a6945c6ea22efa4d5e42fe1922d2b27953c8cd
aimed to eliminate redundant MOV instructions by removing calling
emit_clobber in lower-subreg.cc's resolve_simple_move.

First, I found that another patch address this issue:
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=bdf2737cda53a83332db1a1a021653447b05a7e7
and even without removing calling emit_clobber,
the instruction generation is still as expected.

Second, removing the CLOBBER expression will have side effects.
When there is no CLOBBER expression and only SUBREG assignments exist,
according to the logic of the 'df_lr_bb_local_compute' function,
the register will be added to the basic block LR IN set.
This will cause the register's lifetime to span the entire function,
resulting in increased register pressure. Taking the newly added test case
'gcc/testsuite/gcc.target/riscv/pr43644.c' as an example,
removing the CLOBBER expression will lead to spill in some registers.

gcc/:
* lower-subreg.cc (resolve_simple_move): Re-add calling emit_clobber
immediately before moving a multi-word register by parts.

gcc/testsuite/:
* gcc.target/riscv/pr43644.c: New test case.

I've pushed this to the trunk.  Thanks.

jeff



Re: [RFC/RFA][PATCH v2 12/12] Add tests for CRC detection and generation.

2024-08-25 Thread Jeff Law




On 7/26/24 12:07 PM, Mariam Arutunian wrote:

    gcc/testsuite/gcc.dg/torture/

        * crc-(1-29).c: New tests.
        * crc-CCIT-data16-xorOutside_InsideFor.c: Likewise.
        * crc-CCIT-data16.c: Likewise.
        * crc-CCIT-data8.c: Likewise.
        * crc-coremark16-data16.c: Likewise.
        * crc-coremark32-data16.c: Likewise.
        * crc-coremark32-data32.c: Likewise.
        * crc-coremark32-data8.c: Likewise.
        * crc-coremark64-data64.c: Likewise.
        * crc-coremark8-data8.c: Likewise.
        * crc-crc32-data16.c: Likewise.
        * crc-crc32-data24.c: Likewise.
        * crc-crc32-data8.c: Likewise.
        * crc-crc32.c: Likewise.
        * crc-crc64-data32.c: Likewise.
        * crc-crc64-data64.c: Likewise.
        * crc-crc8-data8-loop-xorInFor.c: Likewise.
        * crc-crc8-data8-loop-xorOutsideFor.c: Likewise.
        * crc-crc8-data8-xorOustideFor.c: Likewise.
        * crc-crc8.c: Likewise.
        * crc-linux-(1-5).c: Likewise.
        * crc-not-crc-(1-26).c: Likewise.
        * crc-side-instr-(1-17).c: Likewise.

    Signed-off-by: Mariam Arutunian >
So I can't recall the state on all this stuff.  So I'll go ahead and ACK 
this (perhaps for the Nth time ;-)  Obviously it can't commit until the 
optimization bits are all in place.


Jeff


Re: [RFC/RFA][PATCH v2 03/12] RISC-V: Add CRC expander to generate faster CRC.

2024-08-25 Thread Jeff Law




On 7/26/24 12:06 PM, Mariam Arutunian wrote:
   If the target is ZBC or ZBKC, it uses clmul instruction for the CRC 
calculation.
Otherwise, if the target is ZBKB, generates table-based CRC, but for 
reversing inputs and the output uses bswap and brev8 instructions.

   Add new tests to check CRC generation for ZBC, ZBKC and ZBKB targets.

      gcc/

         * expr.cc (reflect): New function.
         (gf2n_poly_long_div_quotient): Likewise.
         * expr.h (reflect): New function declaration.
         (gf2n_poly_long_div_quotient): Likewise.

      gcc/config/riscv/

         * bitmanip.md (crc_rev4): New expander 
for reversed CRC.

         (crc4): New expander for bit-forward CRC.
         (SUBX1, ANYI1): New iterators.
         * riscv-protos.h (generate_reflecting_code_using_brev): New 
function declaration.

         (expand_crc_using_clmul): Likewise.
         (expand_reversed_crc_using_clmul): Likewise.
         * riscv.cc (generate_reflecting_code_using_brev): New function.
         (expand_crc_using_clmul): Likewise.
         (expand_reversed_crc_using_clmul): Likewise.
         * riscv.md (UNSPEC_CRC, UNSPEC_CRC_REV):  New unspecs.

      gcc/testsuite/gcc.target/riscv/

            * crc-1-zbc.c: New test.
            * crc-1-zbc.c: Likewise.
            * crc-10-zbc.c: Likewise.
            * crc-10-zbkc.c: Likewise.
            * crc-12-zbc.c: Likewise.
            * crc-12-zbkc.c: Likewise.
            * crc-13-zbc.c: Likewise.
            * crc-13-zbkc.c: Likewise.
            * crc-14-zbc.c: Likewise.
            * crc-14-zbkc.c: Likewise.
            * crc-17-zbc.c: Likewise.
            * crc-17-zbkc.c: Likewise.
            * crc-18-zbc.c: Likewise.
            * crc-18-zbkc.c: Likewise.
            * crc-21-zbc.c: Likewise.
            * crc-21-zbkc.c: Likewise.
            * crc-22-rv64-zbc.c: Likewise.
            * crc-22-rv64-zbkb.c: Likewise.
            * crc-22-rv64-zbkc.c: Likewise.
            * crc-23-zbc.c: Likewise.
            * crc-23-zbkc.c: Likewise.
            * crc-4-zbc.c: Likewise.
            * crc-4-zbkb.c: Likewise.
            * crc-4-zbkc.c: Likewise.
            * crc-5-zbc.c: Likewise.
            * crc-5-zbkb.c: Likewise.
            * crc-5-zbkc.c: Likewise.
            * crc-6-zbc.c: Likewise.
            * crc-6-zbkc.c: Likewise.
            * crc-7-zbc.c: Likewise.
            * crc-7-zbkc.c: Likewise.
            * crc-8-zbc.c: Likewise.
            * crc-8-zbkb.c: Likewise.
            * crc-8-zbkc.c: Likewise.
            * crc-9-zbc.c: Likewise.
            * crc-9-zbkc.c: Likewise.
            * crc-CCIT-data16-zbc.c: Likewise.
            * crc-CCIT-data16-zbkc.c: Likewise.
            * crc-CCIT-data8-zbc.c: Likewise.
            * crc-CCIT-data8-zbkc.c: Likewise.
            * crc-coremark-16bitdata-zbc.c: Likewise.
            * crc-coremark-16bitdata-zbkc.c: Likewise.

    Signed-off-by: Mariam Arutunian >



0003-RISC-V-Add-CRC-expander-to-generate-faster-CRC.patch

diff --git a/gcc/config/riscv/bitmanip.md b/gcc/config/riscv/bitmanip.md
index 8769a6b818b..9683ac48ef6 100644
--- a/gcc/config/riscv/bitmanip.md
+++ b/gcc/config/riscv/bitmanip.md
@@ -973,3 +973,67 @@
"TARGET_ZBC"
"clmulr\t%0,%1,%2"
[(set_attr "type" "clmul")])
+
+
+;; Iterator for hardware integer modes narrower than XLEN, same as SUBX
+(define_mode_iterator SUBX1 [QI HI (SI "TARGET_64BIT")])
+
+;; Iterator for hardware integer modes narrower than XLEN, same as ANYI
+(define_mode_iterator ANYI1 [QI HI SI (DI "TARGET_64BIT")])

Might as well go ahead and put these into iterators.md.



+
+;; Reversed CRC 8, 16, 32 for TARGET_64
+(define_expand "crc_rev4"
+   ;; return value (calculated CRC)
+  [(set (match_operand:ANYI 0 "register_operand" "=r")
+ ;; initial CRC
+   (unspec:ANYI [(match_operand:ANYI 1 "register_operand" "r")
+ ;; data
+ (match_operand:ANYI1 2 "register_operand" "r")
+ ;; polynomial without leading 1
+ (match_operand:ANYI 3)]
+ UNSPEC_CRC_REV))]
+  /* We don't support the case when data's size is bigger than CRC's size.  */
+  "(((TARGET_ZBKC || TARGET_ZBC) && mode < word_mode)
+|| TARGET_ZBKB) && mode >= mode"
This condition should get reformatted.   Ideally the condition should be 
fairly obvious, but it's fairly obfuscated here.


I would expect that the TARGET_ZBKB likely belongs inside the 
conditional with the other TARGET tests.  Perhaps something like this:



"(TARGET_ZBKB || TARGET_ZBKC || TARGET_ZBC)
 && mode < word_mode
 && mode >= mode"


Or did you mean to allow ZBKB even if the quotient needs 65 bits?  If so 
the condition still needs adjustment, just a different adjustment.



Otherwise this looks fine to me.  Were there any adjustments you were 
considering after working through the aarch64 expansion with Richard S?


jeff




Re: [RFC/RFA][PATCH v2 02/12] Add built-ins and tests for bit-forward and bit-reversed CRCs

2024-08-25 Thread Jeff Law




On 7/26/24 12:05 PM, Mariam Arutunian wrote:
    This patch introduces new built-in functions to GCC for computing 
bit-forward and bit-reversed CRCs.

    These builtins aim to provide efficient CRC calculation capabilities.
    When the target architecture supports CRC operations (as indicated 
by the presence of a CRC optab),

    the builtins will utilize the expander to generate CRC code.
    In the absence of hardware support, the builtins default to 
generating code for a table-based CRC calculation.


    The built-ins are defined as follows:
    __builtin_rev_crc16_data8,
    __builtin_rev_crc32_data8, __builtin_rev_crc32_data16, 
__builtin_rev_crc32_data32
    __builtin_rev_crc64_data8, __builtin_rev_crc64_data16, 
  __builtin_rev_crc64_data32, __builtin_rev_crc64_data64,

    __builtin_crc8_data8,
    __builtin_crc16_data16, __builtin_crc16_data8,
    __builtin_crc32_data8, __builtin_crc32_data16, __builtin_crc32_data32,
    __builtin_crc64_data8, __builtin_crc64_data16, 
  __builtin_crc64_data32, __builtin_crc64_data64


    Each built-in takes three parameters:
    crc: The initial CRC value.
    data: The data to be processed.
    polynomial: The CRC polynomial without the leading 1.

    To validate the correctness of these built-ins, this patch also 
includes additions to the GCC testsuite.
    This enhancement allows GCC to offer developers high-performance CRC 
computation options

    that automatically adapt to the capabilities of the target hardware.

    Co-authored-by: Joern Rennecke >


    Not complete. May continue the work if these built-ins are needed.

            gcc/

              * builtin-types.def (BT_FN_UINT8_UINT8_UINT8_CONST_SIZE): 
Define.

              (BT_FN_UINT16_UINT16_UINT8_CONST_SIZE): Likewise.
              (BT_FN_UINT16_UINT16_UINT16_CONST_SIZE): Likewise.
              (BT_FN_UINT32_UINT32_UINT8_CONST_SIZE): Likewise.
              (BT_FN_UINT32_UINT32_UINT16_CONST_SIZE): Likewise.
              (BT_FN_UINT32_UINT32_UINT32_CONST_SIZE): Likewise.
              (BT_FN_UINT64_UINT64_UINT8_CONST_SIZE): Likewise.
              (BT_FN_UINT64_UINT64_UINT16_CONST_SIZE): Likewise.
              (BT_FN_UINT64_UINT64_UINT32_CONST_SIZE): Likewise.
              (BT_FN_UINT64_UINT64_UINT64_CONST_SIZE): Likewise.
              * builtins.cc (associated_internal_fn): Handle 
BUILT_IN_CRC8_DATA8,

              BUILT_IN_CRC16_DATA8, BUILT_IN_CRC16_DATA16,
              BUILT_IN_CRC32_DATA8, BUILT_IN_CRC32_DATA16, 
BUILT_IN_CRC32_DATA32,
              BUILT_IN_CRC64_DATA8, BUILT_IN_CRC64_DATA16, 
BUILT_IN_CRC64_DATA32,

              BUILT_IN_CRC64_DATA64,
              BUILT_IN_REV_CRC8_DATA8,
              BUILT_IN_REV_CRC16_DATA8, BUILT_IN_REV_CRC16_DATA16,
              BUILT_IN_REV_CRC32_DATA8, BUILT_IN_REV_CRC32_DATA16, 
BUILT_IN_REV_CRC32_DATA32.

              (expand_builtin_crc_table_based): New function.
              (expand_builtin): Handle BUILT_IN_CRC8_DATA8,
              BUILT_IN_CRC16_DATA8, BUILT_IN_CRC16_DATA16,
              BUILT_IN_CRC32_DATA8, BUILT_IN_CRC32_DATA16, 
BUILT_IN_CRC32_DATA32,
              BUILT_IN_CRC64_DATA8, BUILT_IN_CRC64_DATA16, 
BUILT_IN_CRC64_DATA32,

              BUILT_IN_CRC64_DATA64,
              BUILT_IN_REV_CRC8_DATA8,
              BUILT_IN_REV_CRC16_DATA8, BUILT_IN_REV_CRC16_DATA16,
              BUILT_IN_REV_CRC32_DATA8, BUILT_IN_REV_CRC32_DATA16, 
BUILT_IN_REV_CRC32_DATA32,
              BUILT_IN_REV_CRC64_DATA8, BUILT_IN_REV_CRC64_DATA16, 
BUILT_IN_REV_CRC64_DATA32,

              BUILT_IN_REV_CRC64_DATA64.
              * builtins.def (BUILT_IN_CRC8_DATA8): New builtin.
              (BUILT_IN_CRC16_DATA8): Likewise.
              (BUILT_IN_CRC16_DATA16): Likewise.
              (BUILT_IN_CRC32_DATA8): Likewise.
              (BUILT_IN_CRC32_DATA16): Likewise.
              (BUILT_IN_CRC32_DATA32): Likewise.
              (BUILT_IN_CRC64_DATA8): Likewise.
              (BUILT_IN_CRC64_DATA16): Likewise.
              (BUILT_IN_CRC64_DATA32): Likewise.
              (BUILT_IN_CRC64_DATA64): Likewise.
              (BUILT_IN_REV_CRC8_DATA8): New builtin.
              (BUILT_IN_REV_CRC16_DATA8): Likewise.
              (BUILT_IN_REV_CRC16_DATA16): Likewise.
              (BUILT_IN_REV_CRC32_DATA8): Likewise.
              (BUILT_IN_REV_CRC32_DATA16): Likewise.
              (BUILT_IN_REV_CRC32_DATA32): Likewise.
              (BUILT_IN_REV_CRC64_DATA8): Likewise.
              (BUILT_IN_REV_CRC64_DATA16): Likewise.
              (BUILT_IN_REV_CRC64_DATA32): Likewise.
              (BUILT_IN_REV_CRC64_DATA64): Likewise.
              * builtins.h (expand_builtin_crc_table_based): New 
function declaration.

              * doc/extend.texti (__builtin_rev_crc8_data8,
              __builtin_rev_crc16_data16, __builtin_rev_crc16_data8,
              __builtin_rev_crc32_data32, __builtin_rev_crc32_data8,
              __builtin_rev_crc32_data16, __builtin_rev_crc64_data64,
         

Re: [RFC/RFA][PATCH v2 01/12] Implement internal functions for efficient CRC computation

2024-08-25 Thread Jeff Law




On 7/26/24 12:05 PM, Mariam Arutunian wrote:
    Add two new internal functions (IFN_CRC, IFN_CRC_REV), to provide 
faster CRC generation.

    One performs bit-forward and the other bit-reversed CRC computation.
    If CRC optabs are supported, they are used for the CRC computation.
    Otherwise, table-based CRC is generated.
    The supported data and CRC sizes are 8, 16, 32, and 64 bits.
    The polynomial is without the leading 1.
    A table with 256 elements is used to store precomputed CRCs.
    For the reflection of inputs and the output, a simple algorithm 
involving

    SHIFT, AND, and OR operations is used.

    Co-authored-by: Joern Rennecke >


    gcc/

        * doc/md.texi (crc@var{m}@var{n}4,
        crc_rev@var{m}@var{n}4): Document.
        * expr.cc (calculate_crc): New function.
        (assemble_crc_table): Likewise.
        (generate_crc_table): Likewise.
        (calculate_table_based_CRC): Likewise.
        (emit_crc): Likewise.
        (expand_crc_table_based): Likewise.
        (gen_common_operation_to_reflect): Likewise.
        (reflect_64_bit_value): Likewise.
        (reflect_32_bit_value): Likewise.
        (reflect_16_bit_value): Likewise.
        (reflect_8_bit_value): Likewise.
        (generate_reflecting_code_standard): Likewise.
        (expand_reversed_crc_table_based): Likewise.
        * expr.h (generate_reflecting_code_standard): New function 
declaration.

        (expand_crc_table_based): Likewise.
        (expand_reversed_crc_table_based): Likewise.
        * internal-fn.cc: (crc_direct): Define.
        (direct_crc_optab_supported_p): Likewise.
        (expand_crc_optab_fn): New function
        * internal-fn.def (CRC, CRC_REV): New internal functions.
        * optabs.def (crc_optab, crc_rev_optab): New optabs.

    Signed-off-by: Mariam Arutunian >


0001-Implement-internal-functions-for-efficient-CRC-compu.patch

diff --git a/gcc/expr.cc b/gcc/expr.cc
index 1baa39b98eb..c9a049aeecc 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc





+
+/* Converts and moves a CRC value to a target register.
+
+  CRC_MODE is the mode (data type) of the CRC value.
+  CRC is the initial CRC value.
+  OP0 is the target register.  */
+
+void
+emit_crc (machine_mode crc_mode, rtx* crc, rtx* op0)
+{
+  if (GET_MODE_BITSIZE (crc_mode).to_constant () == 32
+  && GET_MODE_BITSIZE (word_mode) == 64)
+{
+  rtx a_low = gen_lowpart_SUBREG (crc_mode, *crc);
I may have asked this before, but is there a reason we're not just using 
gen_lowpart?



Otherwise this looks pretty reasonable.

Just a note.  IIRC, you were having some problems with formatting 
issues.  Rather than forcing you to fight with those, I'm comfortable if 
you send me the final patchkit and I'll do a once-over for formatting 
and commit the final result.


Jeff


[PATCH] libstdc++: Fix @headername for bits/cpp_typetraits.h

2024-08-25 Thread Kim Gräsman
There is no file ext/type_traits, point it to ext/type_traits.h instead.

libstdc++-v3/ChangeLog:

* include/bits/cpp_type_traits.h: Improve doxygen file docs.
---
 libstdc++-v3/include/bits/cpp_type_traits.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libstdc++-v3/include/bits/cpp_type_traits.h
b/libstdc++-v3/include/bits/cpp_type_traits.h
index 4bfb4521e06..ff74c557245 100644
--- a/libstdc++-v3/include/bits/cpp_type_traits.h
+++ b/libstdc++-v3/include/bits/cpp_type_traits.h
@@ -24,7 +24,7 @@

 /** @file bits/cpp_type_traits.h
  *  This is an internal header file, included by other library headers.
- *  Do not attempt to use it directly. @headername{ext/type_traits}
+ *  Do not attempt to use it directly. @headername{ext/type_traits.h}
  */

 // Written by Gabriel Dos Reis 
-- 
2.46.0


[PATCH] libstdc++: Fix @file for target-specific opt_random.h

2024-08-25 Thread Kim Gräsman
A few of these files self-identified as ext/random.tcc, update to use
the actual basename.

libstdc++-v3/ChangeLog:

* config/cpu/aarch64/opt/ext/opt_random.h: Improve doxygen file
docs.
* config/cpu/i486/opt/ext/opt_random.h: Likewise.
---
 libstdc++-v3/config/cpu/aarch64/opt/ext/opt_random.h | 2 +-
 libstdc++-v3/config/cpu/i486/opt/ext/opt_random.h| 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/libstdc++-v3/config/cpu/aarch64/opt/ext/opt_random.h
b/libstdc++-v3/config/cpu/aarch64/opt/ext/opt_random.h
index ae78aced27e..7f756d1572f 100644
--- a/libstdc++-v3/config/cpu/aarch64/opt/ext/opt_random.h
+++ b/libstdc++-v3/config/cpu/aarch64/opt/ext/opt_random.h
@@ -22,7 +22,7 @@
 // see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
 // .

-/** @file ext/random.tcc
+/** @file ext/opt_random.h
  *  This is an internal header file, included by other library headers.
  *  Do not attempt to use it directly. @headername{ext/random}
  */
diff --git a/libstdc++-v3/config/cpu/i486/opt/ext/opt_random.h
b/libstdc++-v3/config/cpu/i486/opt/ext/opt_random.h
index 0947197af7b..3a3e892e7c3 100644
--- a/libstdc++-v3/config/cpu/i486/opt/ext/opt_random.h
+++ b/libstdc++-v3/config/cpu/i486/opt/ext/opt_random.h
@@ -22,7 +22,7 @@
 // see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
 // .

-/** @file ext/random.tcc
+/** @file ext/opt_random.h
  *  This is an internal header file, included by other library headers.
  *  Do not attempt to use it directly. @headername{ext/random}
  */
-- 
2.46.0


[PATCH] vect: Fix STMT_VINFO_DEF_TYPE check for odd/even widen mult [PR116348]

2024-08-25 Thread Xi Ruoyao
After fixing PR116142 some code started to trigger an ICE with -O3
-march=znver4.  Per Richard Biener who actually made this fix:

"supportable_widening_operation fails at transform time - that's likely
because vectorizable_reduction "puns" defs to internal_def"

so the check should use STMT_VINFO_REDUC_DEF instead of checking if
STMT_VINFO_DEF_TYPE is vect_reduction_def.

gcc/ChangeLog:

PR tree-optimization/PR116348
* tree-vect-stmts.cc (supportable_widening_operation): Use
STMT_VINFO_REDUC_DEF (x) instead of
STMT_VINFO_DEF_TYPE (x) == vect_reduction_def.

gcc/testsuite/ChangeLog:

PR tree-optimization/PR116348
* gcc.c-torture/compile/pr116438.c: New test.

Co-authored-by: Richard Biener 
---

Bootstrapped and regtested on x86_64-linux-gnu.  Ok for trunk?

 gcc/testsuite/gcc.c-torture/compile/pr116438.c | 14 ++
 gcc/tree-vect-stmts.cc |  3 +--
 2 files changed, 15 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.c-torture/compile/pr116438.c

diff --git a/gcc/testsuite/gcc.c-torture/compile/pr116438.c 
b/gcc/testsuite/gcc.c-torture/compile/pr116438.c
new file mode 100644
index 000..97ab0181ab8
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/compile/pr116438.c
@@ -0,0 +1,14 @@
+/* { dg-additional-options "-march=znver4" { target x86_64-*-* i?86-*-* } } */
+
+int *a;
+int b;
+long long c, d;
+void
+e (int f)
+{
+  for (; f; f++)
+{
+  d += (long long)a[f] * b;
+  c += (long long)a[f] * 3;
+}
+}
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 385e63163c2..9eb73a59933 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -14193,8 +14193,7 @@ supportable_widening_operation (vec_info *vinfo,
  by STMT is only directly used in the reduction statement.  */
  tree lhs = gimple_assign_lhs (vect_orig_stmt (stmt_info)->stmt);
  stmt_vec_info use_stmt_info = loop_info->lookup_single_use (lhs);
- if (use_stmt_info
- && STMT_VINFO_DEF_TYPE (use_stmt_info) == vect_reduction_def)
+ if (use_stmt_info && STMT_VINFO_REDUC_DEF (use_stmt_info))
return true;
 }
   c1 = VEC_WIDEN_MULT_LO_EXPR;
-- 
2.46.0



[PATCH] expand: Use the correct mode for store flags for popcount [PR116480]

2024-08-25 Thread Andrew Pinski
When expanding popcount used for equal to 1 (or rather 
__builtin_stdc_has_single_bit),
the wrong mode was bsing used for the mode of the store flags. We were using 
the mode
of the argument to popcount but since popcount's return value is always int, 
the mode
of the expansion here should have been the mode of the return type rater than 
the argument.

Built and tested on aarch64-linux-gnu with no regressions.
Also bootstrapped and tested on x86_64-linux-gnu.

PR middle-end/116480

gcc/ChangeLog:

* internal-fn.cc (expand_POPCOUNT): Use the correct mode
for store flags.

gcc/testsuite/ChangeLog:

* gcc.dg/torture/pr116480-1.c: New test.
* gcc.dg/torture/pr116480-2.c: New test.

Signed-off-by: Andrew Pinski 
---
 gcc/internal-fn.cc| 3 ++-
 gcc/testsuite/gcc.dg/torture/pr116480-1.c | 8 
 gcc/testsuite/gcc.dg/torture/pr116480-2.c | 8 
 3 files changed, 18 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/torture/pr116480-1.c
 create mode 100644 gcc/testsuite/gcc.dg/torture/pr116480-2.c

diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index a96e61e527c..89da13b38ce 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -5311,6 +5311,7 @@ expand_POPCOUNT (internal_fn fn, gcall *stmt)
   bool nonzero_arg = integer_zerop (gimple_call_arg (stmt, 1));
   tree type = TREE_TYPE (arg);
   machine_mode mode = TYPE_MODE (type);
+  machine_mode lhsmode = TYPE_MODE (TREE_TYPE (lhs));
   do_pending_stack_adjust ();
   start_sequence ();
   expand_unary_optab_fn (fn, stmt, popcount_optab);
@@ -5318,7 +5319,7 @@ expand_POPCOUNT (internal_fn fn, gcall *stmt)
   end_sequence ();
   start_sequence ();
   rtx plhs = expand_normal (lhs);
-  rtx pcmp = emit_store_flag (NULL_RTX, EQ, plhs, const1_rtx, mode, 0, 0);
+  rtx pcmp = emit_store_flag (NULL_RTX, EQ, plhs, const1_rtx, lhsmode, 0, 0);
   if (pcmp == NULL_RTX)
 {
 fail:
diff --git a/gcc/testsuite/gcc.dg/torture/pr116480-1.c 
b/gcc/testsuite/gcc.dg/torture/pr116480-1.c
new file mode 100644
index 000..15a5727941c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/pr116480-1.c
@@ -0,0 +1,8 @@
+/* { dg-do compile { target int128 } } */
+
+int
+foo(unsigned __int128 b)
+{
+  return __builtin_popcountg(b) == 1;
+}
+
diff --git a/gcc/testsuite/gcc.dg/torture/pr116480-2.c 
b/gcc/testsuite/gcc.dg/torture/pr116480-2.c
new file mode 100644
index 000..7bf690283b4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/pr116480-2.c
@@ -0,0 +1,8 @@
+/* { dg-do compile { target bitint } } */
+
+int
+foo(unsigned _BitInt(127) b)
+{
+  return __builtin_popcountg(b) == 1;
+}
+
-- 
2.43.0



Re: [PATCH] RISC-V: Fix double mode under RV32 not utilize vf

2024-08-25 Thread Jeff Law
On Fri, Jul 19, 2024 at 12:07 PM Jeff Law  wrote:

>
>
> On 7/19/24 2:55 AM, demin.han wrote:
> > Currently, some binops of vector vs double scalar under RV32 can't
> > translated to vf but vfmv+vxx.vv.
> >
> > The cause is that vec_duplicate is also expanded to broadcast for double
> mode
> > under RV32. last-combine can't process expanded broadcast.
> >
> > gcc/ChangeLog:
> >
> >   * config/riscv/vector.md: Add !FLOAT_MODE_P constrain
> >
> > gcc/testsuite/ChangeLog:
> >
> >   * gcc.target/riscv/rvv/autovec/binop/vadd-rv32gcv-nofm.c: Fix test
> >   * gcc.target/riscv/rvv/autovec/binop/vdiv-rv32gcv-nofm.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/binop/vmul-rv32gcv-nofm.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/binop/vsub-rv32gcv-nofm.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_copysign-rv32gcv.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fadd-1.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fadd-2.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fadd-3.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fadd-4.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fma_fnma-1.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fma_fnma-3.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fma_fnma-4.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fma_fnma-5.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fma_fnma-6.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fmax-1.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fmax-2.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fmax-3.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fmax-4.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fmin-1.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fmin-2.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fmin-3.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fmin-4.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fms_fnms-1.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fms_fnms-3.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fms_fnms-4.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fms_fnms-5.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fms_fnms-6.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fmul-1.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fmul-2.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fmul-3.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fmul-4.c: Ditto
> >   * gcc.target/riscv/rvv/autovec/cond/cond_fmul-5.c: Ditto
> It looks like vadd-rv32gcv-nofm still isn't quite right according to the
> pre-commit testing:
>
>   >
>
> https://github.com/ewlu/gcc-precommit-ci/issues/1931#issuecomment-2238752679
>
>
> OK once that's fixed.  No need to wait for another review cycle.
>
There's a reasonable chance late-combine was catching more cases that could
be turned into .vf forms.  That was pretty common when I first looked at
the late-combine changes.

Regardless,  I adjusted the vadd/vsub tests and pushed this to the trunk.

Thanks,
jeff


Re: [PATCH] sched: Don't skip empty block by removing no_real_insns_p [PR108273]

2024-08-25 Thread Jeff Law



So is this patch still relevant Kewen?

On 12/20/23 2:25 AM, Kewen.Lin wrote:

Hi,

This patch follows Richi's suggestion "scheduling shouldn't
special case empty blocks as they usually do not appear" in
[1], it removes function no_real_insns_p and its uses
completely.

There is some case that one block previously has only one
INSN_P, but while scheduling some other blocks this only
INSN_P gets moved there and the block becomes empty so
that the only NOTE_P insn was counted then, but since this
block isn't empty initially and any NOTE_P gets skipped in
a normal block, the count to-be-scheduled doesn't count it
in, it can cause the below assertion to fail:

   /* Sanity check: verify that all region insns were scheduled.  */
   gcc_assert (sched_rgn_n_insns == rgn_n_insns);

A bitmap rgn_init_empty_bb is proposed to detect such case
by recording one one block is empty initially or not before
actual scheduling.  The other changes are mainly to handle
NOTE which wasn't expected before but now we have to face
with.

Bootstrapped and regress-tested on:
   - powerpc64{,le}-linux-gnu
   - x86_64-redhat-linux
   - aarch64-linux-gnu

Also tested this with superblock scheduling (sched2) turned
on by default, bootstrapped and regress-tested again on the
above triples.  I tried to test with seletive-scheduling
1/2 enabled by default, it's bootstrapped & regress-tested
on x86_64-redhat-linux, but both failed to build on
powerpc64{,le}-linux-gnu and aarch64-linux-gnu even without
this patch (so it's unrelated, I've filed two PRs for
observed failures on Power separately).

[1] https://inbox.sourceware.org/gcc-patches/CAFiYyc2hMvbU_+
a47ytnbxf0yrcybwrhru2jdcw5a0px3+n...@mail.gmail.com/

Is it ok for trunk or next stage 1?

BR,
Kewen
-
PR rtl-optimization/108273

gcc/ChangeLog:

* config/aarch64/aarch64.cc (aarch64_sched_adjust_priority): Early
return for NOTE_P.
* haifa-sched.cc (recompute_todo_spec): Likewise.
(setup_insn_reg_pressure_info): Likewise.
(schedule_insn): Handle NOTE_P specially as we don't skip empty block
any more and adopt NONDEBUG_INSN_P somewhere appropriate.
(commit_schedule): Likewise.
(prune_ready_list): Likewise.
(schedule_block): Likewise.
(set_priorities): Likewise.
(fix_tick_ready): Likewise.
(no_real_insns_p): Remove.
* rtl.h (SCHED_GROUP_P): Add NOTE consideration.
* sched-ebb.cc (schedule_ebb): Skip leading labels like note to ensure
that we don't have the chance to have single label block, remove the
call to no_real_insns_p.
* sched-int.h (no_real_insns_p): Remove declaration.
* sched-rgn.cc (free_block_dependencies): Remove the call to
no_real_insns_p.
(compute_priorities): Likewise.
(schedule_region): Remove the call to no_real_insns_p, check
rgn_init_empty_bb and update rgn_n_insns if need.
(sched_rgn_local_init): Init rgn_init_empty_bb.
(sched_rgn_local_free): Free rgn_init_empty_bb.
(rgn_init_empty_bb): New static bitmap.
* sel-sched.cc (sel_region_target_finish): Remove the call to
no_real_insns_p.

This largely looks sensible.  One change caught my eye though.

SCHED_GROUP_P IIRC only applies to INSNs.  That bit means something 
different for NOTEs.  I think the change to rtl.h should be backed out, 
which may mean you need further changes into the scheduler infrastructure.


Definitely will need a rebase and retest given the age of the patch.

jeff


Re: [PATCH] ifcvt: Clarify if_info.original_cost.

2024-08-25 Thread Jeff Law



I think Manolis's patches are all in, time to revisit this one?



On 6/12/24 1:54 AM, Robin Dapp wrote:

Hmm, ok.  The bit that confused me most was:

   if (last_needs_comparison != -1)
 {
   end_sequence ();
   start_sequence ();
   ...
 }

which implied that the second attempt was made conditionally.
It seems like it's always used and is an inherent part of the
algorithm.

If the problem is tracking liveness, wouldn't it be better to
iterate over the "then" block in reverse order?  We would start
with the liveness set for the join block and update as we move
backwards through the "then" block.  This liveness set would
tell us whether the current instruction needs to preserve a
particular register.  That should make it possible to do the
transformation in one step, and so avoid the risk that the
second attempt does something that is unexpectedly different
from the first attempt.


I agree that the current approach is rather cumbersome.  Indeed
the second attempt was conditional at first and I changed it to
be unconditional after some patch iterations.
Your reverse-order idea sounds like it should work.  To further
clean up the algorithm we could also make it more explicit
that a "cmov" depends on either the condition or the CC and
basically track two separate paths through the block, one CC
path and one "condition" path.

I can surely do that as a follow up.  It might conflict with
Manolis's changes, though, so his work should probably be in
first.


FWIW, the reason for asking was that it seemed safer to pass
use_cond_earliest back from noce_convert_multiple_sets_1
to noce_convert_multiple_sets, as another parameter,
and then do the adjustment around noce_convert_multiple_sets's
call to targetm.noce_conversion_profitable_p.  That would avoid
the new for a new if_info field, which in turn would make it
less likely that stale information is carried over from one attempt
to the next (e.g. if other ifcvt techniques end up using the same
field in future).


Would something like the attached v4 be OK that uses a parameter
instead (I mean without having refactored the full algorithm)?
At least I changed the comment before the second attempt to
hopefully cause a tiny bit less confusion :)
I haven't fully bootstrapped it yet.

Regards
  Robin

Before noce_find_if_block processes a block it sets up an if_info
structure that holds the original costs.  At that point the costs of
the then/else blocks have not been added so we only care about the
"if" cost.

The code originally used BRANCH_COST for that but was then changed
to COST_N_INSNS (2) - a compare and a jump.

This patch computes the jump costs via
   insn_cost (if_info.jump, ...)
under the assumption that the target takes BRANCH_COST into account
when costing a jump instruction.

In noce_convert_multiple_sets, we keep track of the need for the initial
CC comparison.  If we needed it for the generated sequence we add its
cost before default_noce_conversion_profitable_p.

gcc/ChangeLog:

* ifcvt.cc (noce_convert_multiple_sets):  Define
use_cond_earliest and adjust original cost if needed.
(noce_convert_multiple_sets_1): Add param use_cond_earliest.
(noce_process_if_block): Do not subtract CC cost anymore.
(noce_find_if_block): Use insn_cost for costing jump insn.
---
  gcc/ifcvt.cc | 79 +---
  1 file changed, 44 insertions(+), 35 deletions(-)

diff --git a/gcc/ifcvt.cc b/gcc/ifcvt.cc
index 58ed42673e5..2854eea7702 100644
--- a/gcc/ifcvt.cc
+++ b/gcc/ifcvt.cc
@@ -105,7 +105,8 @@ static bool noce_convert_multiple_sets_1 (struct 
noce_if_info *,
  hash_map *,
  auto_vec *,
  auto_vec *,
- auto_vec *, int *);
+ auto_vec *,
+ int *, bool *);
  
  /* Count the number of non-jump active insns in BB.  */
  
@@ -3502,30 +3503,28 @@ noce_convert_multiple_sets (struct noce_if_info *if_info)
  
int last_needs_comparison = -1;
  
+  bool use_cond_earliest = false;

+
bool ok = noce_convert_multiple_sets_1
  (if_info, &need_no_cmov, &rewired_src, &targets, &temporaries,
- &unmodified_insns, &last_needs_comparison);
+ &unmodified_insns, &last_needs_comparison, &use_cond_earliest);
if (!ok)
return false;
  
-  /* If there are insns that overwrite part of the initial

- comparison, we can still omit creating temporaries for
- the last of them.
- As the second try will always create a less expensive,
- valid sequence, we do not need to compare and can discard
- the first one.  */
-  if (last_needs_comparison != -1)
-{
-  end_sequence ();
-  start_sequence ();
-  ok = noce_convert_multiple_sets_1
-   (if_info, &need_no_cmov, &rewired_src, &targets, &temporaries,

Re: [PATCH v5 1/1] RISC-V: Add support for XCVbitmanip extension in CV32E40P

2024-08-25 Thread Jeff Law




On 8/4/24 12:35 PM, Mary Bennett wrote:

Spec: 
github.com/openhwgroup/core-v-sw/blob/master/specifications/corev-builtin-spec.md

Contributors:
   Mary Bennett 
   Nandni Jamnadas 
   Pietra Ferreira 
   Charlie Keaney
   Jessica Mills
   Craig Blackmore 
   Simon Cook 
   Jeremy Bennett 
   Helene Chelin 

gcc/ChangeLog:
* common/config/riscv/riscv-common.cc: Add XCVbitmanip.
* config/riscv/constraints.md: Likewise.
* config/riscv/corev.def: Likewise.
* config/riscv/corev.md: Likewise.
* config/riscv/predicates.md: Likewise.
* config/riscv/riscv-builtins.cc (AVAIL): Likewise.
* config/riscv/riscv-ftypes.def: Likewise.
* config/riscv/riscv.opt: Likewise.
* doc/extend.texi: Add XCVbitmanip builtin documentation.
* doc/sourcebuild.texi: Likewise.

gcc/testsuite/ChangeLog:
* gcc.target/riscv/cv-bitmanip-compile-bclr.c: New test.
* gcc.target/riscv/cv-bitmanip-compile-bclrr.c: New test.
* gcc.target/riscv/cv-bitmanip-compile-bitrev.c: New test.
* gcc.target/riscv/cv-bitmanip-compile-bset.c: New test.
* gcc.target/riscv/cv-bitmanip-compile-bsetr.c: New test.
* gcc.target/riscv/cv-bitmanip-compile-clb.c: New test.
* gcc.target/riscv/cv-bitmanip-compile-cnt.c: New test.
* gcc.target/riscv/cv-bitmanip-compile-extract.c: New test.
* gcc.target/riscv/cv-bitmanip-compile-extractr.c: New test.
* gcc.target/riscv/cv-bitmanip-compile-extractu.c: New test.
* gcc.target/riscv/cv-bitmanip-compile-extractur.c: New test.
* gcc.target/riscv/cv-bitmanip-compile-ff1.c: New test.
* gcc.target/riscv/cv-bitmanip-compile-fl1.c: New test.
* gcc.target/riscv/cv-bitmanip-compile-insert.c: New test.
* gcc.target/riscv/cv-bitmanip-compile-insertr.c: New test.
* gcc.target/riscv/cv-bitmanip-compile-ror.c: New test.
* gcc.target/riscv/cv-bitmanip-fail-compile-bclr.c: New test.
* gcc.target/riscv/cv-bitmanip-fail-compile-bitrev.c: New test.
* gcc.target/riscv/cv-bitmanip-fail-compile-bset.c: New test.
* gcc.target/riscv/cv-bitmanip-fail-compile-extract.c: New test.
* gcc.target/riscv/cv-bitmanip-fail-compile-extractu.c: New test.
* gcc.target/riscv/cv-bitmanip-fail-compile-insert.c: New test.
* lib/target-supports.exp: Add proc for the XCVbitmanip extension.
---



@@ -281,3 +286,14 @@
"Shifting immediate for SIMD shufflei3."
(and (match_code "const_int")
 (match_test "IN_RANGE (ival, -64, -1)")))
+
+(define_constraint "CV_bit_si10"
+  "A 10-bit unsigned immediate for CORE-V bitmanip."
+  (and (match_code "const_int")
+   (match_test "IN_RANGE (ival, 0, 1023)")))
+
+(define_constraint "CV_bit_in10"
+  "A 10-bit unsigned immediate for CORE-V bitmanip insert."
+  (and (match_code "const_int")
+   (and (match_test "IN_RANGE (ival, 0, 1023)")
+   (match_test "(ival & 31) + ((ival >> 5) & 31) <= 32"
I probably asked this before, but this seems really odd.  why would we 
need 10 bits for a bitmanip operations?





@@ -2651,3 +2653,189 @@
  }
[(set_attr "type" "branch")
 (set_attr "mode" "none")])
+
+;; XCVBITMANIP builtins
+
+(define_insn "riscv_cv_bitmanip_extract"
+  [(set (match_operand:SI 0 "register_operand" "=r,r")
+(sign_extract:SI
+  (match_operand:SI 1 "register_operand" "r,r")
+  (ashiftrt:SI
+(match_operand:HI 2 "bit_extract_operand" "CV_bit_si10,r")
+(const_int 5))
+  (plus:SI
+(and:SI
+  (match_dup 2)
+  (const_int 31))
+(const_int 1]
That (ashiftrt (operand) (const_int 5) seems really weird to me and it's 
almost certainly related to the question about needing 10 bits in 
predicates.



+
+(define_insn "riscv_cv_bitmanip_ff1"
+  [(set (match_operand:SI 0 "register_operand" "=r")
+(ctz:SI (match_operand:SI 1 "register_operand" "r")))]
+
+  "TARGET_XCVBITMANIP && !TARGET_64BIT"
+  "cv.ff1\t%0,%1"
+  [(set_attr "type" "bitmanip")
+  (set_attr "mode" "SI")])
+
+(define_insn "riscv_cv_bitmanip_fl1"
+  [(set (match_operand:SI 0 "register_operand" "=r")
+   (minus:SI
+ (const_int 31)
+ (clz:SI (match_operand:SI 1 "register_operand" "r"]
+
+  "TARGET_XCVBITMANIP && !TARGET_64BIT"
+  "cv.fl1\t%0,%1"
+  [(set_attr "type" "bitmanip")
+  (set_attr "mode" "SI")])
So presumably only available in 32bit mode?  Which means they're no 
great candidates for merging with the standard patterns  since they use 
the GPR iterator.  Similarly for popcount.



I'm leaning more and more against trying to merge any of this with 
bitmanip.md.  I think it'll largely come down to whether or not there 
appears to be an ability to merge the single bit manipulation patterns. 
Right now with the weird 10 bit stuff above, that doesn't look likely 
either.



Jeff


Re: [RFC/RFA] [PATCH v2 09/12] Add symbolic execution support.

2024-08-25 Thread Jeff Law




On 8/20/24 5:41 AM, Richard Biener wrote:



So the store-merging variant IIRC tracks a single overall source
only (unless it was extended and I missed that) and operates at
a byte granularity.  I did want to extend it to support vector shuffles
at one point (with two sources then), but didn't get to it.  The
current implementation manages to be quite efficient - mainly due
to the restriction to a single source I guess.

How does that compare to the symbolic execution engine?

What can the symbolic execution engine handle?  The store-merging
machinery can only handle plain copies, can the symbolic
execution engine tell that for example bits 3-7 are bits 1-5 from A
plus constant 9 with appropriately truncated result?
Conceptually this is the kind of thing it's supposed to handle, but 
there may be implementation details that are missing for the case you want.


More importantly, the execution engine has a limited set of expressions 
it knows how to evaluate, so there's a reasonable chance if you feed it 
something more general than what is typically seen in a CRC loop that 
it's going to give up because it doesn't know how to handle more than 
just a few operators.






Note we should always have an eye on compile-time complexity,
GCC does get used on multi-megabyte machine-generated sources
that tend to look very uniform - variants with loops and bit operations
supported by symbolic execution would not catch me in surprise.
Which is why it's a two phase recognition.  It uses simple tests to 
filter out the vast majority of loops, leaving just a few that have a 
good chance of being a CRC for the more expensive verification step 
using the symbolic execution engine.


jeff



Re: [PATCH v2] RISC-V: More support of vx and vf for autovec comparison

2024-08-25 Thread Jeff Law




On 7/19/24 2:54 AM, demin.han wrote:

There are still some cases which can't utilize vx or vf after
last_combine pass.

1. integer comparison when imm isn't in range of [-16, 15]
2. float imm is 0.0
3. DI or DF mode under RV32

This patch fix above mentioned issues.

Tested on RV32 and RV64.

Signed-off-by: demin.han 
gcc/ChangeLog:

* config/riscv/autovec.md: register_operand to nonmemory_operand
* config/riscv/riscv-v.cc (get_cmp_insn_code): Select code according
 * to scalar_p
(expand_vec_cmp): Generate scalar_p and transform op1
* config/riscv/riscv.cc (riscv_const_insns): Add !FLOAT_MODE_P
 * constrain

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/cmp/vcond-1.c: Fix and add test

Signed-off-by: demin.han 
---
V2 changes:
   1. remove unnecessary add_integer_operand and related code
   2. fix one format issue
   3. split patch and make it only related to vec cmp

  gcc/config/riscv/autovec.md   |  2 +-
  gcc/config/riscv/riscv-v.cc   | 57 +++
  gcc/config/riscv/riscv.cc |  2 +-
  .../riscv/rvv/autovec/cmp/vcond-1.c   | 48 +++-
  4 files changed, 82 insertions(+), 27 deletions(-)

diff --git a/gcc/config/riscv/autovec.md b/gcc/config/riscv/autovec.md
index d5793acc999..a772153 100644
--- a/gcc/config/riscv/autovec.md
+++ b/gcc/config/riscv/autovec.md
@@ -690,7 +690,7 @@ (define_expand "vec_cmp"
[(set (match_operand: 0 "register_operand")
(match_operator: 1 "comparison_operator"
  [(match_operand:V_VLSF 2 "register_operand")
-  (match_operand:V_VLSF 3 "register_operand")]))]
+  (match_operand:V_VLSF 3 "nonmemory_operand")]))]
"TARGET_VECTOR"
I'm still concerned about this.  We generally want the expander's 
predicate to match what the insn can do.  A "nonmemory_operand" is 
likely going to accept all kinds of things that we don't want.  A 
tighter predicate seems more appropriate.


Also note that a define_expand is primarily used during initial RTL 
generation.  What's more important from a code optimization standpoint 
is the define_insn patterns since that's what passes will test against 
for recognition.]


Jeff


Re: [committed] libstdc++: Make std::vector::reference constructor private [PR115098]

2024-08-25 Thread Andrew Pinski
On Fri, Aug 23, 2024 at 5:20 AM Jonathan Wakely  wrote:
>
> Tested x86_64-linux. Pushed to trunk.
>
> -- >8 --
>
> The standard says this constructor should be private.  LWG 4141 proposes
> to remove it entirely. We still need it, but it doesn't need to be
> public.
>
> For std::bitset the default constructor is already private (and never
> even defined) but there's a non-standard constructor that's public, but
> doesn't need to be.

This looks like broke the pretty-printers testcase:
```
/home/apinski/src/upstream-gcc-isel/gcc/libstdc++-v3/testsuite/libstdc++-prettyprinters/simple.cc:
In function 'int main()':
/home/apinski/src/upstream-gcc-isel/gcc/libstdc++-v3/testsuite/libstdc++-prettyprinters/simple.cc:156:
error: 'std::_Bit_reference::_Bit_reference()' is private within this
context
In file included from
/home/apinski/src/upstream-gcc-isel/gcc/objdir/x86_64-pc-linux-gnu/libstdc++-v3/include/vector:67,
 from
/home/apinski/src/upstream-gcc-isel/gcc/libstdc++-v3/testsuite/libstdc++-prettyprinters/simple.cc:31:
/home/apinski/src/upstream-gcc-isel/gcc/objdir/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/stl_bvector.h:90:
note: declared private here
compiler exited with status 1

...
spawn -ignore SIGHUP
/home/apinski/src/upstream-gcc-isel/gcc/objdir/./gcc/xg++
-shared-libgcc -B/home/apinski/src/upstream-gcc-isel/gcc/objdir/./gcc
-nostdinc++ 
-L/home/apinski/src/upstream-gcc-isel/gcc/objdir/x86_64-pc-linux-gnu/libstdc++-v3/src
-L/home/apinski/src/upstream-gcc-isel/gcc/objdir/x86_64-pc-linux-gnu/libstdc++-v3/src/.libs
-L/home/apinski/src/upstream-gcc-isel/gcc/objdir/x86_64-pc-linux-gnu/libstdc++-v3/libsupc++/.libs
-B/home/apinski/upstream-gcc-isel/x86_64-pc-linux-gnu/bin/
-B/home/apinski/upstream-gcc-isel/x86_64-pc-linux-gnu/lib/ -isystem
/home/apinski/upstream-gcc-isel/x86_64-pc-linux-gnu/include -isystem
/home/apinski/upstream-gcc-isel/x86_64-pc-linux-gnu/sys-include
-fchecking=1 
-B/home/apinski/src/upstream-gcc-isel/gcc/objdir/x86_64-pc-linux-gnu/./libstdc++-v3/src/.libs
-fmessage-length=0 -fno-show-column -ffunction-sections
-fdata-sections -fcf-protection -mshstk -g -O2 -D_GNU_SOURCE
-DLOCALEDIR="." -nostdinc++
-I/home/apinski/src/upstream-gcc-isel/gcc/objdir/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu
-I/home/apinski/src/upstream-gcc-isel/gcc/objdir/x86_64-pc-linux-gnu/libstdc++-v3/include
-I/home/apinski/src/upstream-gcc-isel/gcc/libstdc++-v3/libsupc++
-I/home/apinski/src/upstream-gcc-isel/gcc/libstdc++-v3/include/backward
-I/home/apinski/src/upstream-gcc-isel/gcc/libstdc++-v3/testsuite/util
/home/apinski/src/upstream-gcc-isel/gcc/libstdc++-v3/testsuite/libstdc++-prettyprinters/simple11.cc
-g -O0 -fdiagnostics-plain-output ./libtestc++.a -Wl,--gc-sections
-L/home/apinski/src/upstream-gcc-isel/gcc/objdir/x86_64-pc-linux-gnu/libstdc++-v3/src/filesystem/.libs
-L/home/apinski/src/upstream-gcc-isel/gcc/objdir/x86_64-pc-linux-gnu/libstdc++-v3/src/experimental/.libs
-lm -o ./simple11.exe
/home/apinski/src/upstream-gcc-isel/gcc/libstdc++-v3/testsuite/libstdc++-prettyprinters/simple11.cc:
In function 'int main()':
/home/apinski/src/upstream-gcc-isel/gcc/libstdc++-v3/testsuite/libstdc++-prettyprinters/simple11.cc:149:
error: 'std::_Bit_reference::_Bit_reference()' is private within this
context
In file included from
/home/apinski/src/upstream-gcc-isel/gcc/objdir/x86_64-pc-linux-gnu/libstdc++-v3/include/vector:67,
 from
/home/apinski/src/upstream-gcc-isel/gcc/libstdc++-v3/testsuite/libstdc++-prettyprinters/simple11.cc:31:
/home/apinski/src/upstream-gcc-isel/gcc/objdir/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/stl_bvector.h:90:
note: declared private here
compiler exited with status 1
```

Noticed because of the new UNRESOLVED .

Thanks,
Andrew Pinski




>
> libstdc++-v3/ChangeLog:
>
> PR libstdc++/115098
> * include/bits/stl_bvector.h (_Bit_reference): Make default
> constructor private. Declare vector and bit iterators as
> friends.
> * include/std/bitset (bitset::reference): Make constructor and
> data members private.
> * testsuite/20_util/bitset/115098.cc: New test.
> * testsuite/23_containers/vector/bool/115098.cc: New test.
> ---
>  libstdc++-v3/include/bits/stl_bvector.h  | 12 +---
>  libstdc++-v3/include/std/bitset  |  5 +
>  libstdc++-v3/testsuite/20_util/bitset/115098.cc  | 11 +++
>  .../testsuite/23_containers/vector/bool/115098.cc|  8 
>  4 files changed, 29 insertions(+), 7 deletions(-)
>  create mode 100644 libstdc++-v3/testsuite/20_util/bitset/115098.cc
>  create mode 100644 libstdc++-v3/testsuite/23_containers/vector/bool/115098.cc
>
> diff --git a/libstdc++-v3/include/bits/stl_bvector.h 
> b/libstdc++-v3/include/bits/stl_bvector.h
> index c45b7ff3320..42261ac5915 100644
> --- a/libstdc++-v3/include/bits/stl_bvector.h
> +++ b/libstdc++-v3/include/bits/stl_bvector.h
> @@ -81,6 +81,14 @@ _GLIBCXX_BEGIN_NAMESPACE_C

RE: [PATCH v3] RISC-V: Support IMM for operand 0 of ussub pattern

2024-08-25 Thread Li, Pan2
Thanks Jeff.

> OK.  I'm assuming we don't have to worry about the case where X is wider 
> than Xmode?  ie, a DImode on rv32?

Yes, the DImode is disabled by ANYI iterator for ussub pattern.

Pan

-Original Message-
From: Jeff Law  
Sent: Sunday, August 25, 2024 11:22 PM
To: Li, Pan2 ; gcc-patches@gcc.gnu.org
Cc: juzhe.zh...@rivai.ai; kito.ch...@gmail.com; rdapp@gmail.com
Subject: Re: [PATCH v3] RISC-V: Support IMM for operand 0 of ussub pattern



On 8/18/24 11:23 PM, pan2...@intel.com wrote:
> From: Pan Li 
> 
> This patch would like to allow IMM for the operand 0 of ussub pattern.
> Aka .SAT_SUB(1023, y) as the below example.
> 
> Form 1:
>#define DEF_SAT_U_SUB_IMM_FMT_1(T, IMM) \
>T __attribute__((noinline)) \
>sat_u_sub_imm##IMM##_##T##_fmt_1 (T y)  \
>{   \
>  return (T)IMM >= y ? (T)IMM - y : 0;  \
>}
> 
> DEF_SAT_U_SUB_IMM_FMT_1(uint64_t, 1023)
> 
> Before this patch:
>10   │ sat_u_sub_imm82_uint64_t_fmt_1:
>11   │ li  a5,82
>12   │ bgtua0,a5,.L3
>13   │ sub a0,a5,a0
>14   │ ret
>15   │ .L3:
>16   │ li  a0,0
>17   │ ret
> 
> After this patch:
>10   │ sat_u_sub_imm82_uint64_t_fmt_1:
>11   │ li  a5,82
>12   │ sltua4,a5,a0
>13   │ addia4,a4,-1
>14   │ sub a0,a5,a0
>15   │ and a0,a4,a0
>16   │ ret
> 
> The below test suites are passed for this patch:
> 1. The rv64gcv fully regression test.
> 
> gcc/ChangeLog:
> 
>   * config/riscv/riscv.cc (riscv_gen_unsigned_xmode_reg): Add new
>   func impl to gen xmode rtx reg from operand rtx.
>   (riscv_expand_ussub): Gen xmode reg for operand 1.
>   * config/riscv/riscv.md: Allow const_int for operand 1.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.target/riscv/sat_arith.h: Add test helper macro.
>   * gcc.target/riscv/sat_u_sub_imm-1.c: New test.
>   * gcc.target/riscv/sat_u_sub_imm-1_1.c: New test.
>   * gcc.target/riscv/sat_u_sub_imm-1_2.c: New test.
>   * gcc.target/riscv/sat_u_sub_imm-2.c: New test.
>   * gcc.target/riscv/sat_u_sub_imm-2_1.c: New test.
>   * gcc.target/riscv/sat_u_sub_imm-2_2.c: New test.
>   * gcc.target/riscv/sat_u_sub_imm-3.c: New test.
>   * gcc.target/riscv/sat_u_sub_imm-3_1.c: New test.
>   * gcc.target/riscv/sat_u_sub_imm-3_2.c: New test.
>   * gcc.target/riscv/sat_u_sub_imm-4.c: New test.
>   * gcc.target/riscv/sat_u_sub_imm-run-1.c: New test.
>   * gcc.target/riscv/sat_u_sub_imm-run-2.c: New test.
>   * gcc.target/riscv/sat_u_sub_imm-run-3.c: New test.
>   * gcc.target/riscv/sat_u_sub_imm-run-4.c: New test.
OK.  I'm assuming we don't have to worry about the case where X is wider 
than Xmode?  ie, a DImode on rv32?


Jeff



RE: [PATCH v1] Vect: Promote unsigned .SAT_ADD constant operand for vectorizable_call

2024-08-25 Thread Li, Pan2
Thanks Richard for comments and confirmation.

> Instead pattern recognition of .SAT_ADD should promote/demote the invariants -

Got it, will have a try to reconcile the types in .SAT_ADD for const_int.

> What I read is that
> .ADD_OVERFLOW
> produces a value that is equal to the twos-complement add of its arguments
> promoted/demoted to the result type, correct?

Yes, that make sense to me.

Pan

-Original Message-
From: Richard Biener  
Sent: Sunday, August 25, 2024 3:42 PM
To: Li, Pan2 
Cc: Jakub Jelinek ; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH v1] Vect: Promote unsigned .SAT_ADD constant operand for 
vectorizable_call

On Sat, Aug 24, 2024 at 1:31 PM Li, Pan2  wrote:
>
> Thanks Jakub and Richard for explanation and help, I will double check 
> saturate matching for the const_int strict check.
>
> Back to this below case, do we still need some ad-hoc step to unblock the 
> type check when vectorizable_call?
> For example, the const_int 9u may have int type for .SAT_ADD(uint8_t, 9u).
> Or we have somewhere else to make the vectorizable_call happy.

I don't see how vectorizable_call itself can handle this since it
doesn't have any idea
about the type requirements.  Instead pattern recognition of .SAT_ADD should
promote/demote the invariants - of course there might be correctness
issues involved
with matching .ADD_OVERFLOW in the first place.  What I read is that
.ADD_OVERFLOW
produces a value that is equal to the twos-complement add of its arguments
promoted/demoted to the result type, correct?

Richard.

> #define DEF_VEC_SAT_U_ADD_IMM_FMT_3(T, IMM)  \
> T __attribute__((noinline))  \
> vec_sat_u_add_imm##IMM##_##T##_fmt_3 (T *out, T *in, unsigned limit) \
> {\
>   unsigned i;\
>   T ret; \
>   for (i = 0; i < limit; i++)\
> {\
>   out[i] = __builtin_add_overflow (in[i], IMM, &ret) ? -1 : ret; \
> }\
> }
>
> DEF_VEC_SAT_U_ADD_IMM_FMT_3(uint8_t, 9u)
>
> Pan
>
> -Original Message-
> From: Richard Biener 
> Sent: Friday, August 23, 2024 6:53 PM
> To: Jakub Jelinek 
> Cc: Li, Pan2 ; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH v1] Vect: Promote unsigned .SAT_ADD constant operand for 
> vectorizable_call
>
> On Thu, Aug 22, 2024 at 8:36 PM Jakub Jelinek  wrote:
> >
> > On Tue, Aug 20, 2024 at 01:52:35PM +0200, Richard Biener wrote:
> > > On Sat, Aug 17, 2024 at 11:18 PM Jakub Jelinek  wrote:
> > > >
> > > > On Sat, Aug 17, 2024 at 05:03:14AM +, Li, Pan2 wrote:
> > > > > Please feel free to let me know if there is anything I can do to fix 
> > > > > this issue. Thanks a lot.
> > > >
> > > > There is no bug.  The operands of .{ADD,SUB,MUL}_OVERFLOW don't have to 
> > > > have the same type, as described in the 
> > > > __builtin_{add,sub,mul}_overflow{,_p} documentation, each argument can 
> > > > have different type and result yet another one, the behavior is then 
> > > > (as if) to perform the operation in infinite precision and if that 
> > > > result fits into the result type, there is no overflow, otherwise there 
> > > > is.
> > > > So, there is no need to promote anything.
> > >
> > > Hmm, it's a bit awkward to have this state in the IL.
> >
> > Why?  These aren't the only internal functions which have different types
> > of arguments, from the various widening ifns, conditional ifns,
> > scatter/gather, ...  Even the WIDEN_*EXPR trees do have type differences
> > among arguments.
> > And it matches what the user builtin does.
> >
> > Furthermore, at least without _BitInt (but even with _BitInt at the maximum
> > precision too) this might not be even possible.
> > E.g. if there is __builtin_add_overflow with unsigned __int128 and __int128
> > arguments and there are no wider types there is simply no type to use for 
> > both
> > arguments, it would need to be a signed type with at least 129 bits...
> >
> > > I see that
> > > expand_arith_overflow eventually applies
> > > promotion, namely to the type of the LHS.
> >
> > The LHS doesn't have to be wider than the operand types, so it can't promote
> > always.  Yes, in some cases it applies promotion if it is desirable for
> > codegen purposes.  But without the promotions explicitly in the IL it
> > doesn't need to rely on VRP to figure out how to expand it exactly.
> >
> > > Exposing this earlier could
> > > enable optimization even
> >
> > Which optimizations?
>
> I was thinking of merging conversions with that implied promotion.
>
> >  We already try to fold the .{ADD,SUB,MUL}_OVERFLOW
> > builtins to constants or non-overflowing arithmetics etc. as soon as we
> > can e.g. using ranges prove the operation w

Re: [PATCH 00/12] AVX10.2: Support new instructions

2024-08-25 Thread Hongtao Liu
On Mon, Aug 19, 2024 at 4:57 PM Haochen Jiang  wrote:
>
> Hi all,
>
> The AVX10.2 ymm rounding patches has been merged to trunk around
> 6 hours ago. As mentioned before, next step will be AVX10.2 new
> instruction support.
>
> This patch series could be divided into three part.
>
> The first patch will refactor m512-check.h under testsuite to reuse
> AVX-512 helper functions and unions and avoid ABI warnings when using
> AVX10.
>
> The following ten patches will support all AVX10.2 new instrctions,
> including:
>
>   - AI Datatypes, Conversions, and post-Convolution Instructions.
>   - Media Acceleration.
>   - IEEE-754-2019 Minimum and Maximum Support.
>   - Saturating Conversions.
>   - Zero-extending Partial Vector Copies.
>   - FP Scalar Comparison.
>
> For FP Scalar Comparison part (a.k.a comx instructions), we will only
> provide pattern support but not intrin support since it is redundant
> with comi ones for common usage. We will also add some optimizations
> afterwards for common usage with comx instructions. If there are some
> strong requests, we will add intrin support in the future.
>
> The final patch will add bf8 -> fp16 intrin for convenience. Since the
> conversion from bf8 to fp16 is only casting for fraction part due to
> same bits for exponent part, we will use a sequence of instructions
> instead of new instructions. It is just like the scenario for bf16 ->
> fp32 conversion.
>
> After all these patch merged, the next step would be optimizations based
> on AVX10.2 new instructions, including vnni vectorization, bf16
> vectorization, comx optmization, etc.
>
> Bootstrapped on x86-64-pc-linux-gnu. Ok for trunk?
Ok for all 12 patches.
>
> Thx,
> Haochen
>


-- 
BR,
Hongtao


Re: [PATCHv4, expand] Add const0 move checking for CLEAR_BY_PIECES optabs

2024-08-25 Thread Hongtao Liu
On Fri, Aug 23, 2024 at 5:46 PM HAO CHEN GUI  wrote:
>
> Hi Hongtao,
>
> 在 2024/8/23 11:47, Hongtao Liu 写道:
> > On Fri, Aug 23, 2024 at 11:03 AM HAO CHEN GUI  wrote:
> >>
> >> Hi Hongtao,
> >>
> >> 在 2024/8/23 9:47, Hongtao Liu 写道:
> >>> On Thu, Aug 22, 2024 at 4:06 PM HAO CHEN GUI  
> >>> wrote:
> 
>  Hi Hongtao,
> 
>  在 2024/8/21 11:21, Hongtao Liu 写道:
> > r15-3058-gbb42c551905024 support const0 operand for movv16qi, please
> > rebase your patch and see if there's still the regressions.
> 
>  There's still regressions. The patch enables V16QI const0 store, but
>  it also enables V8QI const0 store. The vector mode is preferable than
>  scalar mode so that V8QI is used for 8-byte memory clear instead of
>  DI. It's sub-optimal.
> >>> Could we check if mode_size is greater than HOST_BITS_PER_WIDE_INT?
> >> Not sure if all targets prefer it. Richard & Jeff, what's your opinion?
> >>
> >> IMHO, could we disable it from predicate or convert it to DI mode store
> >> if V8QI const0 store is sub-optimal on i386?
> >>
> >>
> 
>  Another issue is it takes lots of subreg to generate an all-zero
>  V16QI register sometime. As PR92080 has been fixed, it can't reuse
>  existing all-zero V16QI register.
> > Backend rtx_cost needs to be adjusted to prevent const0 propagation.
> > The current rtx_cost for const0 for i386 is 0, which will enable
> > propagation of const0.
> >
> >/* If MODE2 is appropriate for an MMX register, then tie
> > @@ -21588,10 +21590,12 @@ ix86_rtx_costs (rtx x, machine_mode mode,
> > int outer_code_i, int opno,
> > case 0:
> >   break;
> > case 1:  /* 0: xor eliminates false dependency */
> > - *total = 0;
> > + /* Add extra cost 1 to prevent propagation of CONST_VECTOR
> > +for SET, which will enable more CSE optimization.  */
> > + *total = 0 + (outer_code == SET);
> >   return true;
> > default: /* -1: cmp contains false dependency */
> > - *total = 1;
> > + *total = 1 + (outer_code == SET);
> >   return true;
> > }
> >
> > the upper hunk should help for that.
> Sorry, I didn't get your point. Which problem it will fix? I tested
> upper code. Nothing changed. Which kind of const0 propagation you want
> to prevent?
The patch itself doesn't enable CSE for const0_rtx, but it's needed
after cse_insn recognizes CONST0_RTX with a different mode and
replaces them with subreg.
I thought you had changed the cse_insn part.
 On the other hand, pxor is cheap, what matters more is the CSE of
broadcasting the same value to different modes. i.e.

__m512i sinkz;
__m256i sinky;
void foo(char c) {
sinkz = _mm512_set1_epi8(c);
sinky = _mm256_set1_epi8(c);
}

>
> Thanks
> Gui Haochen
>
> 
>  (insn 16 15 17 (set (reg:V4SI 118)
>  (const_vector:V4SI [
>  (const_int 0 [0]) repeated x4
>  ])) "auto-init-7.c":25:12 -1
>   (nil))
> 
>  (insn 17 16 18 (set (reg:V8HI 117)
>  (subreg:V8HI (reg:V4SI 118) 0)) "auto-init-7.c":25:12 -1
>   (nil))
> 
>  (insn 18 17 19 (set (reg:V16QI 116)
>  (subreg:V16QI (reg:V8HI 117) 0)) "auto-init-7.c":25:12 -1
>   (nil))
> 
>  (insn 19 18 0 (set (mem/c:V16QI (plus:DI (reg:DI 114)
>  (const_int 12 [0xc])) [0 MEM  [(void 
>  *)&temp3]+12 S16 A32])
>  (reg:V16QI 116)) "auto-init-7.c":25:12 -1
>   (nil))
> >>> I think those subregs can be simplified by later rtl passes?
> >>
> >> Here is the final dump. There are two all-zero 16-byte vector
> >> registers. It can't figure out V4SI could be a subreg of V16QI.
> >>
> >> (insn 14 56 15 2 (set (reg:V16QI 20 xmm0 [115])
> >> (const_vector:V16QI [
> >> (const_int 0 [0]) repeated x16
> >> ])) "auto-init-7.c":25:12 2154 {movv16qi_internal}
> >>  (nil))
> >> (insn 15 14 16 2 (set (mem/c:V16QI (reg:DI 0 ax [114]) [0 MEM  
> >> [(void *)&temp3]+0 S16 A128])
> >> (reg:V16QI 20 xmm0 [115])) "auto-init-7.c":25:12 2154 
> >> {movv16qi_internal}
> >>  (nil))
> >> (insn 16 15 19 2 (set (reg:V4SI 20 xmm0 [118])
> >> (const_vector:V4SI [
> >> (const_int 0 [0]) repeated x4
> >> ])) "auto-init-7.c":25:12 2160 {movv4si_internal}
> >>  (nil))
> >> (insn 19 16 57 2 (set (mem/c:V16QI (plus:DI (reg:DI 0 ax [114])
> >> (const_int 12 [0xc])) [0 MEM  [(void 
> >> *)&temp3]+12 S16 A32])
> >> (reg:V16QI 20 xmm0 [116])) "auto-init-7.c":25:12 2154 
> >> {movv16qi_internal}
> >>
> >> Thanks
> >> Gui Haochen
> >>
> 
>  Thanks
>  Gui Haochen
> >>>
> >>>
> >>>
> >
> >
> >



-- 
BR,
Hongtao


Re: [PATCH] RISC-V: Bugfix for Duplicate entries for -mtune in --target-help[Bug 116347]

2024-08-25 Thread Jiawei



在 2024/8/25 23:38, Jeff Law 写道:



On 8/19/24 2:14 AM, shiyul...@iscas.ac.cn wrote:

From: yulong 

This patch try to fix a bug[116347]. I change the name of the 
micro-arch,
because I think micro-arch and core have the same name that caused 
the error.


gcc/ChangeLog:

 * config/riscv/riscv-cores.def (RISCV_TUNE): Rename.
 (RISCV_CORE): Ditto.
Conceptually tuning means things like costs and scheduler model while 
core defines what instructions can be used.   So why are core entries 
showing up under known arguments for the -mtune option?


Jeff


In the current definition, different cores need to be configured with 
corresponding tuning in `riscv-cores.def`, so we can reuse the core name 
in '-mtune' option.



BR,

Jiawei



[PATCH v3] Match: Support form 1 for scalar signed integer .SAT_ADD

2024-08-25 Thread pan2 . li
From: Pan Li 

This patch would like to support the form 1 of the scalar signed
integer .SAT_ADD.  Aka below example:

Form 1:
  #define DEF_SAT_S_ADD_FMT_1(T, UT, MIN, MAX) \
  T __attribute__((noinline))  \
  sat_s_add_##T##_fmt_1 (T x, T y) \
  {\
T sum = (UT)x + (UT)y; \
return (x ^ y) < 0 \
  ? sum\
  : (sum ^ x) >= 0 \
? sum  \
: x < 0 ? MIN : MAX;   \
  }

DEF_SAT_S_ADD_FMT_1(int64_t, uint64_t, INT64_MIN, INT64_MAX)

We can tell the difference before and after this patch if backend
implemented the ssadd3 pattern similar as below.

Before this patch:
   4   │ __attribute__((noinline))
   5   │ int64_t sat_s_add_int64_t_fmt_1 (int64_t x, int64_t y)
   6   │ {
   7   │   int64_t sum;
   8   │   long unsigned int x.0_1;
   9   │   long unsigned int y.1_2;
  10   │   long unsigned int _3;
  11   │   long int _4;
  12   │   long int _5;
  13   │   int64_t _6;
  14   │   _Bool _11;
  15   │   long int _12;
  16   │   long int _13;
  17   │   long int _14;
  18   │   long int _16;
  19   │   long int _17;
  20   │
  21   │ ;;   basic block 2, loop depth 0
  22   │ ;;pred:   ENTRY
  23   │   x.0_1 = (long unsigned int) x_7(D);
  24   │   y.1_2 = (long unsigned int) y_8(D);
  25   │   _3 = x.0_1 + y.1_2;
  26   │   sum_9 = (int64_t) _3;
  27   │   _4 = x_7(D) ^ y_8(D);
  28   │   _5 = x_7(D) ^ sum_9;
  29   │   _17 = ~_4;
  30   │   _16 = _5 & _17;
  31   │   if (_16 < 0)
  32   │ goto ; [41.00%]
  33   │   else
  34   │ goto ; [59.00%]
  35   │ ;;succ:   3
  36   │ ;;4
  37   │
  38   │ ;;   basic block 3, loop depth 0
  39   │ ;;pred:   2
  40   │   _11 = x_7(D) < 0;
  41   │   _12 = (long int) _11;
  42   │   _13 = -_12;
  43   │   _14 = _13 ^ 9223372036854775807;
  44   │ ;;succ:   4
  45   │
  46   │ ;;   basic block 4, loop depth 0
  47   │ ;;pred:   2
  48   │ ;;3
  49   │   # _6 = PHI 
  50   │   return _6;
  51   │ ;;succ:   EXIT
  52   │
  53   │ }

After this patch:
   4   │ __attribute__((noinline))
   5   │ int64_t sat_s_add_int64_t_fmt_1 (int64_t x, int64_t y)
   6   │ {
   7   │   int64_t _4;
   8   │
   9   │ ;;   basic block 2, loop depth 0
  10   │ ;;pred:   ENTRY
  11   │   _4 = .SAT_ADD (x_5(D), y_6(D)); [tail call]
  12   │   return _4;
  13   │ ;;succ:   EXIT
  14   │
  15   │ }

The below test suites are passed for this patch.
* The rv64gcv fully regression test.
* The x86 bootstrap test.
* The x86 fully regression test.

gcc/ChangeLog:

* match.pd: Add the matching for signed .SAT_ADD.
* tree-ssa-math-opts.cc (gimple_signed_integer_sat_add): Add new
matching func decl.
(match_unsigned_saturation_add): Try signed .SAT_ADD and rename
to ...
(match_saturation_add): ... here.
(math_opts_dom_walker::after_dom_children): Update the above renamed
func from caller.

Signed-off-by: Pan Li 
---
 gcc/match.pd  | 18 ++
 gcc/tree-ssa-math-opts.cc | 35 ++-
 2 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/gcc/match.pd b/gcc/match.pd
index 78f1957e8c7..b059e313415 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3192,6 +3192,24 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   (if (INTEGRAL_TYPE_P (type) && TYPE_UNSIGNED (type)
   && types_match (type, @0
 
+/* Signed saturation add, case 1:
+   T sum = (UT)X + (UT)Y;
+   SAT_S_ADD = (X ^ Y) < 0
+ ? sum
+ : (sum ^ x) >= 0
+   ? sum
+   : x < 0 ? MIN : MAX;
+   T and UT are type pair like T=int8_t, UT=uint8_t.  */
+(match (signed_integer_sat_add @0 @1)
+ (cond^ (lt (bit_and:c (bit_xor:c @0 (convert@2 (plus:c (convert @0)
+   (convert @1
+  (bit_not (bit_xor:c @0 @1)))
+   integer_zerop)
+   (bit_xor:c (negate (convert (lt @0 integer_zerop))) max_value)
+   @2)
+ (if (INTEGRAL_TYPE_P (type) && !TYPE_UNSIGNED (type)
+  && types_match (type, @0, @1
+
 /* Unsigned saturation sub, case 1 (branch with gt):
SAT_U_SUB = X > Y ? X - Y : 0  */
 (match (unsigned_integer_sat_sub @0 @1)
diff --git a/gcc/tree-ssa-math-opts.cc b/gcc/tree-ssa-math-opts.cc
index 8d96a4c964b..3c93fca5b53 100644
--- a/gcc/tree-ssa-math-opts.cc
+++ b/gcc/tree-ssa-math-opts.cc
@@ -4023,6 +4023,8 @@ extern bool gimple_unsigned_integer_sat_add (tree, tree*, 
tree (*)(tree));
 extern bool gimple_unsigned_integer_sat_sub (tree, tree*, tree (*)(tree));
 extern bool gimple_unsigned_integer_sat_trunc (tree, tree*, tree (*)(tree));
 
+extern bool gimple_signed_integer_sat_add (tree, tree*, tree (*)(tree));
+
 static void
 build_saturation_binary_arith_call (gimple_stmt_iterator *gsi, inte

Re: [PATCH v3] RISC-V: Support IMM for operand 0 of ussub pattern

2024-08-25 Thread Jeff Law




On 8/25/24 7:35 PM, Li, Pan2 wrote:

Thanks Jeff.


OK.  I'm assuming we don't have to worry about the case where X is wider
than Xmode?  ie, a DImode on rv32?


Yes, the DImode is disabled by ANYI iterator for ussub pattern.
Thanks.  Just wanted to make sure.  And for the avoidance of doubt, this 
patch is fine for the trunk.


jeff



RE: [PATCH v3] RISC-V: Support IMM for operand 0 of ussub pattern

2024-08-25 Thread Li, Pan2
Got it, thanks Jeff.

Pan

-Original Message-
From: Jeff Law  
Sent: Monday, August 26, 2024 10:21 AM
To: Li, Pan2 ; gcc-patches@gcc.gnu.org
Cc: juzhe.zh...@rivai.ai; kito.ch...@gmail.com; rdapp@gmail.com
Subject: Re: [PATCH v3] RISC-V: Support IMM for operand 0 of ussub pattern



On 8/25/24 7:35 PM, Li, Pan2 wrote:
> Thanks Jeff.
> 
>> OK.  I'm assuming we don't have to worry about the case where X is wider
>> than Xmode?  ie, a DImode on rv32?
> 
> Yes, the DImode is disabled by ANYI iterator for ussub pattern.
Thanks.  Just wanted to make sure.  And for the avoidance of doubt, this 
patch is fine for the trunk.

jeff



[PATCH] Fix bootstap-errors due to enabling -gvariable-location-views

2024-08-25 Thread Bernd Edlinger
This recent change triggered various bootsteap-errors, mostly on
x86 targets because line info advance address entries were output
in the wrong section table.
The switch to the wrong line table happened in dwarfout_set_ignored_loc.
It must use the same section as the earlier called
dwarf2out_switch_text_section.

But also ft32-elf was affected, because the assembler choked on
something simple as ".2byte .LM2-.LM1", but fortunately it is
able to use native location views, the configure test was just
not executed because the ft32 "nop" instruction was missing.

gcc/ChangeLog:

PR debug/116470
* configure.ac: Add the "nop" instruction for cpu type ft32.
* configure: Regenerate.
* dwarf2out.cc (dwarf2out_set_ignored_loc): Use the correct
line info section.
---
 gcc/configure| 2 +-
 gcc/configure.ac | 2 +-
 gcc/dwarf2out.cc | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/gcc/configure b/gcc/configure
index 557ea5fa3ac..3d301b6ecd3 100755
--- a/gcc/configure
+++ b/gcc/configure
@@ -31398,7 +31398,7 @@ esac
 case "$cpu_type" in
   aarch64 | alpha | arc | arm | avr | bfin | cris | csky | i386 | loongarch | 
m32c \
   | m68k | microblaze | mips | nds32 | nios2 | pa | riscv | rs6000 | score | 
sparc \
-  | visium | xstormy16 | xtensa)
+  | visium | xstormy16 | xtensa | ft32)
 insn="nop"
 ;;
   ia64 | s390)
diff --git a/gcc/configure.ac b/gcc/configure.ac
index eaa01d0d7e5..8a2d2b0438e 100644
--- a/gcc/configure.ac
+++ b/gcc/configure.ac
@@ -5610,7 +5610,7 @@ esac
 case "$cpu_type" in
   aarch64 | alpha | arc | arm | avr | bfin | cris | csky | i386 | loongarch | 
m32c \
   | m68k | microblaze | mips | nds32 | nios2 | pa | riscv | rs6000 | score | 
sparc \
-  | visium | xstormy16 | xtensa)
+  | visium | xstormy16 | xtensa | ft32)
 insn="nop"
 ;;
   ia64 | s390)
diff --git a/gcc/dwarf2out.cc b/gcc/dwarf2out.cc
index a26a07e3424..1187d32352b 100644
--- a/gcc/dwarf2out.cc
+++ b/gcc/dwarf2out.cc
@@ -28976,7 +28976,7 @@ dwarf2out_set_ignored_loc (unsigned int line, unsigned 
int column,
   dw_fde_ref fde = cfun->fde;
 
   fde->ignored_debug = false;
-  set_cur_line_info_table (function_section (fde->decl));
+  set_cur_line_info_table (current_function_section ());
 
   dwarf2out_source_line (line, column, filename, 0, true);
 }
-- 
2.39.2



Re: [PATCH] sched: Don't skip empty block by removing no_real_insns_p [PR108273]

2024-08-25 Thread Kewen.Lin
Hi Jeff,

on 2024/8/26 06:13, Jeff Law wrote:
> 
> So is this patch still relevant Kewen?

Yes, sorry that I forgot to follow up this after stage 1 opens.

> 
> On 12/20/23 2:25 AM, Kewen.Lin wrote:
>> Hi,
>>
>> This patch follows Richi's suggestion "scheduling shouldn't
>> special case empty blocks as they usually do not appear" in
>> [1], it removes function no_real_insns_p and its uses
>> completely.
>>
>> There is some case that one block previously has only one
>> INSN_P, but while scheduling some other blocks this only
>> INSN_P gets moved there and the block becomes empty so
>> that the only NOTE_P insn was counted then, but since this
>> block isn't empty initially and any NOTE_P gets skipped in
>> a normal block, the count to-be-scheduled doesn't count it
>> in, it can cause the below assertion to fail:
>>
>>    /* Sanity check: verify that all region insns were scheduled.  */
>>    gcc_assert (sched_rgn_n_insns == rgn_n_insns);
>>
>> A bitmap rgn_init_empty_bb is proposed to detect such case
>> by recording one one block is empty initially or not before
>> actual scheduling.  The other changes are mainly to handle
>> NOTE which wasn't expected before but now we have to face
>> with.
>>
>> Bootstrapped and regress-tested on:
>>    - powerpc64{,le}-linux-gnu
>>    - x86_64-redhat-linux
>>    - aarch64-linux-gnu
>>
>> Also tested this with superblock scheduling (sched2) turned
>> on by default, bootstrapped and regress-tested again on the
>> above triples.  I tried to test with seletive-scheduling
>> 1/2 enabled by default, it's bootstrapped & regress-tested
>> on x86_64-redhat-linux, but both failed to build on
>> powerpc64{,le}-linux-gnu and aarch64-linux-gnu even without
>> this patch (so it's unrelated, I've filed two PRs for
>> observed failures on Power separately).
>>
>> [1] https://inbox.sourceware.org/gcc-patches/CAFiYyc2hMvbU_+
>> a47ytnbxf0yrcybwrhru2jdcw5a0px3+n...@mail.gmail.com/
>>
>> Is it ok for trunk or next stage 1?
>>
>> BR,
>> Kewen
>> -
>> PR rtl-optimization/108273
>>
>> gcc/ChangeLog:
>>
>> * config/aarch64/aarch64.cc (aarch64_sched_adjust_priority): Early
>> return for NOTE_P.
>> * haifa-sched.cc (recompute_todo_spec): Likewise.
>> (setup_insn_reg_pressure_info): Likewise.
>> (schedule_insn): Handle NOTE_P specially as we don't skip empty block
>> any more and adopt NONDEBUG_INSN_P somewhere appropriate.
>> (commit_schedule): Likewise.
>> (prune_ready_list): Likewise.
>> (schedule_block): Likewise.
>> (set_priorities): Likewise.
>> (fix_tick_ready): Likewise.
>> (no_real_insns_p): Remove.
>> * rtl.h (SCHED_GROUP_P): Add NOTE consideration.
>> * sched-ebb.cc (schedule_ebb): Skip leading labels like note to ensure
>> that we don't have the chance to have single label block, remove the
>> call to no_real_insns_p.
>> * sched-int.h (no_real_insns_p): Remove declaration.
>> * sched-rgn.cc (free_block_dependencies): Remove the call to
>> no_real_insns_p.
>> (compute_priorities): Likewise.
>> (schedule_region): Remove the call to no_real_insns_p, check
>> rgn_init_empty_bb and update rgn_n_insns if need.
>> (sched_rgn_local_init): Init rgn_init_empty_bb.
>> (sched_rgn_local_free): Free rgn_init_empty_bb.
>> (rgn_init_empty_bb): New static bitmap.
>> * sel-sched.cc (sel_region_target_finish): Remove the call to
>> no_real_insns_p.
> This largely looks sensible.  One change caught my eye though.
> 
> SCHED_GROUP_P IIRC only applies to INSNs.  That bit means something different 
> for NOTEs.  I think the change to rtl.h should be backed out, which may mean 
> you need further changes into the scheduler infrastructure.
> 
> Definitely will need a rebase and retest given the age of the patch.

Thanks for your review comments, I'll check SCHED_GROUP_P and get back to you 
soon.

BR,
Kewen



Re: [PATCH ver 3] rs6000,extend and document built-ins vec_test_lsbb_all_ones and vec_test_lsbb_all_zeros

2024-08-25 Thread Kewen.Lin
Hi Carl,

on 2024/8/22 23:24, Carl Love wrote:
> Gcc maintainers:
> 
> Version 3, fixed a few typos per Kewen's review.  Fixed the expected number 
> of scan-assembler-times for xvtlsbb and setbc.  Retested on Power 10 LE.
> 
> Version 2, based on discussion additional overloaded instances of the 
> vec_test_lsbb_all_ones and, vec_test_lsbb_all_zeros built-ins has been added. 
>  The additional instances are for arguments of vector signed char and vector 
> bool char.  The patch has been tested on Power 10 LE and BE with no 
> regressions.
> 
> Per a report from a user, the existing vec_test_lsbb_all_ones and, 
> vec_test_lsbb_all_zeros built-ins are not documented in the GCC documentation 
> file.
> 
> The following patch adds missing documentation for the vec_test_lsbb_all_ones 
> and, vec_test_lsbb_all_zeros built-ins.
> 
> Please let me know if the patch is acceptable for mainline.  Thanks.

This patch is OK for trunk, thanks!

BR,
Kewen

> 
>   Carl
> 
> 
> 
> rs6000,extend and document built-ins vec_test_lsbb_all_ones  and 
> vec_test_lsbb_all_zeros
> 
> The built-ins currently support vector unsigned char arguments. Extend the
> built-ins to also support vector signed char and vector bool char
> arguments.
> 
> Add documentation for the Power 10 built-ins vec_test_lsbb_all_ones
> and vec_test_lsbb_all_zeros.  The vec_test_lsbb_all_ones built-in
> returns 1 if the least significant bit in each byte is a 1, returns
> 0 otherwise.  Similarly, vec_test_lsbb_all_zeros returns a 1 if
> the least significant bit in each byte is a zero and 0 otherwise.
> 
> Add addtional test cases for the built-ins in files:
>   gcc/testsuite/gcc.target/powerpc/lsbb.c
>   gcc/testsuite/gcc.target/powerpc/lsbb-runnable.c
> 
> gcc/ChangeLog:
>     * config/rs6000/rs6000-overloaded.def (vec_test_lsbb_all_ones,
>     vec_test_lsbb_all_zeros): Add built-in instances for vector signed
>     char and vector bool char.
>     * doc/extend.texi (vec_test_lsbb_all_ones,
>     vec_test_lsbb_all_zeros): Add documentation for the
>     existing built-ins.
> 
> gcc/testsuite/ChangeLog:gcc/testsuite/ChangeLog:
>     * gcc.target/powerpc/lsbb-runnable.c: Add test cases for the vector
>     signed char and vector bool char instances of
>     vec_test_lsbb_all_zeros and vec_test_lsbb_all_ones built-ins.
>     * gcc.target/powerpc/lsbb.c: Add compile test cases for the vector
>     signed char and vector bool char instances of
>     vec_test_lsbb_all_zeros and vec_test_lsbb_all_ones built-ins.
> ---
>  gcc/config/rs6000/rs6000-overload.def |  12 +-
>  gcc/doc/extend.texi   |  19 +++
>  .../gcc.target/powerpc/lsbb-runnable.c    | 131 ++
>  gcc/testsuite/gcc.target/powerpc/lsbb.c   |  28 +++-
>  4 files changed, 158 insertions(+), 32 deletions(-)
> 
> diff --git a/gcc/config/rs6000/rs6000-overload.def 
> b/gcc/config/rs6000/rs6000-overload.def
> index 87495aded49..7d9e31c3f9e 100644
> --- a/gcc/config/rs6000/rs6000-overload.def
> +++ b/gcc/config/rs6000/rs6000-overload.def
> @@ -4403,12 +4403,20 @@
>  XXEVAL  XXEVAL_VUQ
> 
>  [VEC_TEST_LSBB_ALL_ONES, vec_test_lsbb_all_ones, 
> __builtin_vec_xvtlsbb_all_ones]
> +  signed int __builtin_vec_xvtlsbb_all_ones (vsc);
> +    XVTLSBB_ONES LSBB_ALL_ONES_VSC
>    signed int __builtin_vec_xvtlsbb_all_ones (vuc);
> -    XVTLSBB_ONES
> +    XVTLSBB_ONES LSBB_ALL_ONES_VUC
> +  signed int __builtin_vec_xvtlsbb_all_ones (vbc);
> +    XVTLSBB_ONES LSBB_ALL_ONES_VBC
> 
>  [VEC_TEST_LSBB_ALL_ZEROS, vec_test_lsbb_all_zeros, 
> __builtin_vec_xvtlsbb_all_zeros]
> +  signed int __builtin_vec_xvtlsbb_all_zeros (vsc);
> +    XVTLSBB_ZEROS LSBB_ALL_ZEROS_VSC
>    signed int __builtin_vec_xvtlsbb_all_zeros (vuc);
> -    XVTLSBB_ZEROS
> +    XVTLSBB_ZEROS LSBB_ALL_ZEROS_VUC
> +  signed int __builtin_vec_xvtlsbb_all_zeros (vbc);
> +    XVTLSBB_ZEROS LSBB_ALL_ZEROS_VBC
> 
>  [VEC_TRUNC, vec_trunc, __builtin_vec_trunc]
>    vf __builtin_vec_trunc (vf);
> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> index 89fe5db7aed..8971d9fbf3c 100644
> --- a/gcc/doc/extend.texi
> +++ b/gcc/doc/extend.texi
> @@ -23332,6 +23332,25 @@ signed long long will sign extend the rightmost byte 
> of each doubleword.
>  The following additional built-in functions are also available for the
>  PowerPC family of processors, starting with ISA 3.1 (@option{-mcpu=power10}):
> 
> +@smallexample
> +@exdent int vec_test_lsbb_all_ones (vector signed char);
> +@exdent int vec_test_lsbb_all_ones (vector unsigned char);
> +@exdent int vec_test_lsbb_all_ones (vector bool char);
> +@end smallexample
> +@findex vec_test_lsbb_all_ones
> +
> +The builtin @code{vec_test_lsbb_all_ones} returns 1 if the least significant
> +bit in each byte is equal to 1.  It returns 0 otherwise.
> +
> +@smallexample
> +@exdent int vec_test_lsbb_all_zeros (vector signed char);
> +@exdent int vec_test_lsbb_all_zeros (v

Re: [PATCH] rs6000: allow split vsx_stxvd2x4_le_const after RA[pr116030]

2024-08-25 Thread Kewen.Lin
Hi,

on 2024/8/21 15:23, Jiufu Guo wrote:
> Hi,
> 
> Previous, vsx_stxvd2x4_le_const_ is introduced for 'split1' pass,
> so it is guarded by "can_create_pseudo_p ()".
> While, it would be possible to match the pattern of this insn during/after
> RA, so this insn could be updated to make it work for split pass after RA.
> 
> Bootstrap®test pass on ppc64{,le}.
> Is this ok for trunk?
> 
> BR,
> Jeff (Jiufu) Guo
> 
> 
>   PR target/116030
> 
> gcc/ChangeLog:
> 
>   * config/rs6000/vsx.md (vsx_stxvd2x4_le_const_): Allow insn
>   after RA.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.target/powerpc/pr116030.c: New test.
> 
> ---
>  gcc/config/rs6000/vsx.md|  9 +
>  gcc/testsuite/gcc.target/powerpc/pr116030.c | 17 +
>  2 files changed, 22 insertions(+), 4 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr116030.c
> 
> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
> index 27069d070e1..2dd87b7a9db 100644
> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@ -3454,12 +3454,12 @@ (define_insn "*vsx_stxvd2x4_le_"
>  
>  (define_insn_and_split "vsx_stxvd2x4_le_const_"
>[(set (match_operand:VSX_W 0 "memory_operand" "=Z")
> - (match_operand:VSX_W 1 "immediate_operand" "W"))]
> + (match_operand:VSX_W 1 "immediate_operand" "W"))
> +   (clobber (match_scratch:VSX_W 2 "=&wa"))]

I think "&" isn't needed here as the input is immediate_operand.

>"!BYTES_BIG_ENDIAN
> && VECTOR_MEM_VSX_P (mode)
> && !TARGET_P9_VECTOR
> -   && const_vec_duplicate_p (operands[1])
> -   && can_create_pseudo_p ()"
> +   && const_vec_duplicate_p (operands[1])"
>"#"
>"&& 1"
>[(set (match_dup 2)
> @@ -3472,7 +3472,8 @@ (define_insn_and_split "vsx_stxvd2x4_le_const_"
>  {
>/* Here all the constants must be loaded without memory.  */
>gcc_assert (easy_altivec_constant (operands[1], mode));
> -  operands[2] = gen_reg_rtx (mode);
> +  if (GET_CODE(operands[2]) == SCRATCH)
> +operands[2] = gen_reg_rtx (mode);
>  }
>[(set_attr "type" "vecstore")
> (set_attr "length" "8")])
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr116030.c 
> b/gcc/testsuite/gcc.target/powerpc/pr116030.c
> new file mode 100644
> index 000..ada0a4fd2b1
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr116030.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mdejagnu-cpu=power8 -Os -fno-forward-propagate 
> -ftrivial-auto-var-init=zero -save-temps" } */

As Sam pointed out, "-save-temps" isn't needed.

Since this test case adopts dfp type, please add one more check:

/* { dg-require-effective-target dfp } */

OK with all these fixed, thanks!

BR,
Kewen

> +
> +/* Verify we do not ICE on the tests below.  */
> +union U128
> +{
> +  _Decimal128 d;
> +  unsigned long long int u[2];
> +};
> +
> +union U128
> +foo ()
> +{
> +  volatile union U128 u128;
> +  u128.d = 0.99e+39DL;
> +  return u128;
> +}


Re: [PATCH V2] rs6000: add clober and guard for vsx_stxvd2x4_le_const[pr116030]

2024-08-25 Thread Kewen.Lin
Hi Jeff,

I just noticed this v2 ...

on 2024/8/22 14:22, Jiufu Guo wrote:
> Hi,
> 
> Previous, vsx_stxvd2x4_le_const_ is introduced for 'split1' pass,
> so it is guarded by "can_create_pseudo_p ()".
> While, it would be possible to match the pattern of this insn during/after
> RA, so this insn could be updated to make it work for split pass after RA.
> 
> And this insn would not be the best choise, if the address has alignment like

Nit: s/choise/choice/

> "&(-16)", so "!altivec_indexed_or_indirect_operand" is added to guard this 
> insn.

Yeah, good catch!

> 
> Compare with previous version:
> https://gcc.gnu.org/pipermail/gcc-patches/2024-August/660983.html
> "!altivec_indexed_or_indirect_operand" is guarded.
> 
> Bootstrap®test pass on ppc64{,le}.
> Is this ok for trunk?
> 
> BR,
> Jeff (Jiufu) Guo
> 
>   PR target/116030
> 
> gcc/ChangeLog:
> 
>   * config/rs6000/vsx.md (vsx_stxvd2x4_le_const_): Add clobber
>   and guard with !altivec_indexed_or_indirect_operand.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.target/powerpc/pr116030.c: New test.
> ---
>  gcc/config/rs6000/vsx.md| 10 ++
>  gcc/testsuite/gcc.target/powerpc/pr116030.c | 21 +
>  2 files changed, 27 insertions(+), 4 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr116030.c
> 
> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
> index 27069d070e1..6b58fa712c8 100644
> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@ -3454,12 +3454,13 @@ (define_insn "*vsx_stxvd2x4_le_"
>  
>  (define_insn_and_split "vsx_stxvd2x4_le_const_"
>[(set (match_operand:VSX_W 0 "memory_operand" "=Z")
> - (match_operand:VSX_W 1 "immediate_operand" "W"))]
> + (match_operand:VSX_W 1 "immediate_operand" "W"))
> +   (clobber (match_scratch:VSX_W 2 "=&wa"))]

I think "&" isn't needed here as the input is immediate_operand.

>"!BYTES_BIG_ENDIAN
> && VECTOR_MEM_VSX_P (mode)
> && !TARGET_P9_VECTOR
> -   && const_vec_duplicate_p (operands[1])
> -   && can_create_pseudo_p ()"
> +   && !altivec_indexed_or_indirect_operand (operands[0], mode)
> +   && const_vec_duplicate_p (operands[1])"
>"#"
>"&& 1"
>[(set (match_dup 2)
> @@ -3472,7 +3473,8 @@ (define_insn_and_split "vsx_stxvd2x4_le_const_"
>  {
>/* Here all the constants must be loaded without memory.  */
>gcc_assert (easy_altivec_constant (operands[1], mode));
> -  operands[2] = gen_reg_rtx (mode);
> +  if (GET_CODE(operands[2]) == SCRATCH)
> +operands[2] = gen_reg_rtx (mode);
>  }
>[(set_attr "type" "vecstore")
> (set_attr "length" "8")])
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr116030.c 
> b/gcc/testsuite/gcc.target/powerpc/pr116030.c
> new file mode 100644
> index 000..304d5519ac6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr116030.c
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mdejagnu-cpu=power8 -Os -fno-forward-propagate 
> -ftrivial-auto-var-init=zero -save-temps" } */
> +

As Sam pointed out previously, "-save-temps" isn't needed.

Since this test case adopts dfp type, please add one more check:

/* { dg-require-effective-target dfp } */

OK with all these fixed, thanks!

BR,
Kewen

> +/* Verify we do not ICE on the tests below.  */
> +
> +/* { dg-final { scan-assembler-not "rldicr" { target { le } } } } */
> +/* { dg-final { scan-assembler-not "stxvd2x" { target { le } } } } */
> +
> +union U128
> +{
> +  _Decimal128 d;
> +  unsigned long long int u[2];
> +};
> +
> +union U128
> +foo ()
> +{
> +  volatile union U128 u128;
> +  u128.d = 0.99e+39DL;
> +  return u128;
> +}


Re: [PATCH] tree-optimization/116166 - forward jump-threading going wild

2024-08-25 Thread Aldy Hernandez
[I'm slowly coming up to speed here after my absence, so please bear with me...]

I suspect there's a few things going on here, both in the forward and
the backwards threader.  For the forward threader, you mention some
very good points in the PR.  First, there's unnecessary recursion in
simplify_control_stmt_condition_1 that ranger should be able to handle
on its own.  Secondly, since we're doing a DOM walk, we should be able
to re-use most of the path_ranger's cache instead of having to reset
all of it on every path, especially when we're just adding empty
blocks.  I can investigate both of these things.

The end game here is to get rid of the forward threader, so we should
really find out why the backwards threader is choking so bad.  I
suspect whatever the case is, will affect both threaders.  I thought
you had added some limits in the search space last cycle?  Are they
not being triggered?

For the record, the reason we can't get rid of the forward threader
yet (apart from having to fix whatever is going on in PR114855 at -O2
:)), is that we still rely on the pointer equivalency tracking with
the DOM equiv lookup tables.  Prange does not yet handle pointer
equivs.  Also, we'd need to audit to make sure frange handles whatever
floating point operations were being simplified in the DOM equiv
lookup as well.  I suspect not much, but we still need to make sure.

Minor nit, wouldn't it be cleaner for "limit" to be a class local
variable instead of passing it around as a function parameter?

Thanks for all your work here.
Aldy

On Tue, Aug 6, 2024 at 3:12 PM Richard Biener  wrote:
>
> Currently the forward threader isn't limited as to the search space
> it explores and with it now using path-ranger for simplifying
> conditions it runs into it became pretty slow for degenerate cases
> like compiling insn-emit.cc for RISC-V esp. when compiling for
> a host with LOGICAL_OP_NON_SHORT_CIRCUIT disabled.
>
> The following makes the forward threader honor the search space
> limit I introduced for the backward threader.  This reduces
> compile-time from minutes to seconds for the testcase in PR116166.
>
> Note this wasn't necessary before we had ranger but with ranger
> the work we do is quadatic in the length of the threading path
> we build up (the same is true for the backwards threader).
>
> Bootstrap and regtest running on x86_64-unknown-linux-gnu.
>
> OK if that succeeds?
>
> Thanks,
> Richard.
>
> PR tree-optimization/116166
> * tree-ssa-threadedge.h (jump_threader::thread_around_empty_blocks):
> Add limit parameter.
> (jump_threader::thread_through_normal_block): Likewise.
> * tree-ssa-threadedge.cc (jump_threader::thread_around_empty_blocks):
> Honor and decrement limit parameter.
> (jump_threader::thread_through_normal_block): Likewise.
> (jump_threader::thread_across_edge): Initialize limit from
> param_max_jump_thread_paths and pass it down to workers.
> ---
>  gcc/tree-ssa-threadedge.cc | 30 ++
>  gcc/tree-ssa-threadedge.h  |  4 ++--
>  2 files changed, 24 insertions(+), 10 deletions(-)
>
> diff --git a/gcc/tree-ssa-threadedge.cc b/gcc/tree-ssa-threadedge.cc
> index 7f82639b8ec..0aa2aa85143 100644
> --- a/gcc/tree-ssa-threadedge.cc
> +++ b/gcc/tree-ssa-threadedge.cc
> @@ -786,13 +786,17 @@ propagate_threaded_block_debug_into (basic_block dest, 
> basic_block src)
>  bool
>  jump_threader::thread_around_empty_blocks (vec *path,
>edge taken_edge,
> -  bitmap visited)
> +  bitmap visited, unsigned &limit)
>  {
>basic_block bb = taken_edge->dest;
>gimple_stmt_iterator gsi;
>gimple *stmt;
>tree cond;
>
> +  if (limit == 0)
> +return false;
> +  --limit;
> +
>/* The key property of these blocks is that they need not be duplicated
>   when threading.  Thus they cannot have visible side effects such
>   as PHI nodes.  */
> @@ -830,7 +834,8 @@ jump_threader::thread_around_empty_blocks 
> (vec *path,
>   m_registry->push_edge (path, taken_edge, 
> EDGE_NO_COPY_SRC_BLOCK);
>   m_state->append_path (taken_edge->dest);
>   bitmap_set_bit (visited, taken_edge->dest->index);
> - return thread_around_empty_blocks (path, taken_edge, visited);
> + return thread_around_empty_blocks (path, taken_edge, visited,
> +limit);
> }
> }
>
> @@ -872,7 +877,7 @@ jump_threader::thread_around_empty_blocks 
> (vec *path,
>m_registry->push_edge (path, taken_edge, EDGE_NO_COPY_SRC_BLOCK);
>m_state->append_path (taken_edge->dest);
>
> -  thread_around_empty_blocks (path, taken_edge, visited);
> +  thread_around_empty_blocks (path, taken_edge, visited, limit);
>return true;
>  }
>
> @@ -899,8 +904,13 @@ jump_threader::thread_ar

[PATCH 1/8] i386: Auto vectorize sdot_prod, usdot_prod, udot_prod with AVX10.2 instructions

2024-08-25 Thread Haochen Jiang
gcc/ChangeLog:

* config/i386/sse.md (VI1_AVX512VNNIBW): New.
(VI2_AVX10_2): Ditto.
(sdot_prod): Add AVX10.2
to auto vectorize and combine 512 bit part.
(udot_prod): Ditto.
(sdot_prodv64qi): Removed.
(udot_prodv64qi): Ditto.
(usdot_prod): Add AVX10.2 to auto vectorize.
(udot_prod): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/vnniint16-auto-vectorize-2.c: Only define
TEST when not defined.
* gcc.target/i386/vnniint8-auto-vectorize-2.c: Ditto.
* gcc.target/i386/vnniint16-auto-vectorize-3.c: New test.
* gcc.target/i386/vnniint16-auto-vectorize-4.c: Ditto.
* gcc.target/i386/vnniint8-auto-vectorize-3.c: Ditto.
* gcc.target/i386/vnniint8-auto-vectorize-4.c: Ditto.
---
 gcc/config/i386/sse.md| 93 +--
 .../i386/vnniint16-auto-vectorize-2.c | 11 ++-
 .../i386/vnniint16-auto-vectorize-3.c |  6 ++
 .../i386/vnniint16-auto-vectorize-4.c | 15 +++
 .../i386/vnniint8-auto-vectorize-2.c  | 12 ++-
 .../i386/vnniint8-auto-vectorize-3.c  |  6 ++
 .../i386/vnniint8-auto-vectorize-4.c  | 15 +++
 7 files changed, 80 insertions(+), 78 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/vnniint16-auto-vectorize-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/vnniint16-auto-vectorize-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/vnniint8-auto-vectorize-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/vnniint8-auto-vectorize-4.c

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index da91d39cf8e..442ac93afa2 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -610,6 +610,10 @@
 (define_mode_iterator VI1_AVX512VNNI
   [(V64QI "TARGET_AVX512VNNI && TARGET_EVEX512") (V32QI "TARGET_AVX2") V16QI])
 
+(define_mode_iterator VI1_AVX512VNNIBW
+  [(V64QI "(TARGET_AVX512BW || TARGET_AVX512VNNI) && TARGET_EVEX512")
+   (V32QI "TARGET_AVX2") V16QI])
+
 (define_mode_iterator VI12_256_512_AVX512VL
   [(V64QI "TARGET_EVEX512") (V32QI "TARGET_AVX512VL")
(V32HI "TARGET_EVEX512") (V16HI "TARGET_AVX512VL")])
@@ -627,6 +631,9 @@
   [(V32HI "(TARGET_AVX512BW || TARGET_AVX512VNNI) && TARGET_EVEX512")
(V16HI "TARGET_AVX2") V8HI])
 
+(define_mode_iterator VI2_AVX10_2
+  [(V32HI "TARGET_AVX10_2_512") V16HI V8HI])
+
 (define_mode_iterator VI4_AVX
   [(V8SI "TARGET_AVX") V4SI])
 
@@ -31232,12 +31239,13 @@
 
 (define_expand "sdot_prod"
   [(match_operand: 0 "register_operand")
-   (match_operand:VI1_AVX2 1 "register_operand")
-   (match_operand:VI1_AVX2 2 "register_operand")
+   (match_operand:VI1_AVX512VNNIBW 1 "register_operand")
+   (match_operand:VI1_AVX512VNNIBW 2 "register_operand")
(match_operand: 3 "register_operand")]
   "TARGET_SSE2"
 {
-  if (TARGET_AVXVNNIINT8)
+  if (( == 64 && TARGET_AVX10_2_512)
+  || ( < 64 && (TARGET_AVXVNNIINT8 || TARGET_AVX10_2_256)))
 {
   operands[1] = lowpart_subreg (mode,
force_reg (mode, operands[1]),
@@ -31276,44 +31284,15 @@
   DONE;
 })
 
-(define_expand "sdot_prodv64qi"
-  [(match_operand:V16SI 0 "register_operand")
-   (match_operand:V64QI 1 "register_operand")
-   (match_operand:V64QI 2 "register_operand")
-   (match_operand:V16SI 3 "register_operand")]
-  "(TARGET_AVX512VNNI || TARGET_AVX512BW) && TARGET_EVEX512"
-{
-  /* Emulate with vpdpwssd.  */
-  rtx op1_lo = gen_reg_rtx (V32HImode);
-  rtx op1_hi = gen_reg_rtx (V32HImode);
-  rtx op2_lo = gen_reg_rtx (V32HImode);
-  rtx op2_hi = gen_reg_rtx (V32HImode);
-
-  emit_insn (gen_vec_unpacks_lo_v64qi (op1_lo, operands[1]));
-  emit_insn (gen_vec_unpacks_lo_v64qi (op2_lo, operands[2]));
-  emit_insn (gen_vec_unpacks_hi_v64qi (op1_hi, operands[1]));
-  emit_insn (gen_vec_unpacks_hi_v64qi (op2_hi, operands[2]));
-
-  rtx res1 = gen_reg_rtx (V16SImode);
-  rtx res2 = gen_reg_rtx (V16SImode);
-  rtx sum = gen_reg_rtx (V16SImode);
-
-  emit_move_insn (sum, CONST0_RTX (V16SImode));
-  emit_insn (gen_sdot_prodv32hi (res1, op1_lo, op2_lo, sum));
-  emit_insn (gen_sdot_prodv32hi (res2, op1_hi, op2_hi, operands[3]));
-
-  emit_insn (gen_addv16si3 (operands[0], res1, res2));
-  DONE;
-})
-
 (define_expand "udot_prod"
   [(match_operand: 0 "register_operand")
-   (match_operand:VI1_AVX2 1 "register_operand")
-   (match_operand:VI1_AVX2 2 "register_operand")
+   (match_operand:VI1_AVX512VNNIBW 1 "register_operand")
+   (match_operand:VI1_AVX512VNNIBW 2 "register_operand")
(match_operand: 3 "register_operand")]
   "TARGET_SSE2"
 {
-  if (TARGET_AVXVNNIINT8)
+  if (( == 64 && TARGET_AVX10_2_512)
+  || ( < 64 && (TARGET_AVXVNNIINT8 || TARGET_AVX10_2_256)))
 {
   operands[1] = lowpart_subreg (mode,
force_reg (mode, operands[1]),
@@ -31352,36 +31331,6 @@
   DONE;
 })
 
-(define_expand "udot_prodv64qi"
-  [(match_operand:V16SI 0 "register_operand")
-   (match_operand:V64QI 1 "register_operand")
- 

[PATCH 2/8] i386: Optimize ordered and nonequal

2024-08-25 Thread Haochen Jiang
From: "Hu, Lin1" 

Currently, when we input !__builtin_isunordered (a, b) && (a != b), gcc
will emit
  ucomiss %xmm1, %xmm0
  movl $1, %ecx
  setp %dl
  setnp %al
  cmovne %ecx, %edx
  andl %edx, %eax
  movzbl %al, %eax

In fact,
  xorl %eax, %eax
  ucomiss %xmm1, %xmm0
  setne %al
is better.

gcc/ChangeLog:

* match.pd: Optimize (and ordered non-equal) to
(not (or unordered  equal))

gcc/testsuite/ChangeLog:

* gcc.target/i386/optimize_one.c: New test.
---
 gcc/match.pd | 3 +++
 gcc/testsuite/gcc.target/i386/optimize_one.c | 9 +
 2 files changed, 12 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/optimize_one.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 78f1957e8c7..aaadd2e977c 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -6636,6 +6636,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (ltgt @0 @0)
  (if (!flag_trapping_math || !tree_expr_maybe_nan_p (@0))
   { constant_boolean_node (false, type); }))
+(simplify
+ (bit_and (ordered @0 @1) (ne @0 @1))
+ (bit_not (uneq @0 @1)))
 
 /* x == ~x -> false */
 /* x != ~x -> true */
diff --git a/gcc/testsuite/gcc.target/i386/optimize_one.c 
b/gcc/testsuite/gcc.target/i386/optimize_one.c
new file mode 100644
index 000..62728d3c5ba
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/optimize_one.c
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mfpmath=sse" } */
+/* { dg-final { scan-assembler-times "comi" 1 } } */
+/* { dg-final { scan-assembler-times "set" 1 } } */
+
+int is_ordered_or_nonequal_sh (float a, float b)
+{
+  return !__builtin_isunordered (a, b) && (a != b);
+}
-- 
2.31.1



[PATCH 0/8] i386: Opmitize code with AVX10.2 new instructions

2024-08-25 Thread Haochen Jiang
Hi all,

I have just commited AVX10.2 new instructions patches into trunk hours
ago. The next and final part for AVX10.2 upstream is to optimize code
with AVX10.2 new instructions.

In this patch series, it will contain the following optimizations:

  - VNNI instruction auto vectorize (PATCH 1).
  - Codegen optimization with new scalar comparison instructions to
eliminate redundant code (PATCH 2-3).
  - BF16 instruction auto vectorize (PATCH 4-8).

This will finish the upstream for AVX10.2 series.

Afterwards, we may add V2BF/V4BF in another thread just like what we
have done for V2HF/V4HF when AVX512FP16 upstreamed.

Bootstrapped on x86-64-pc-linux-gnu. Ok for trunk?

Thx,
Haochen




[PATCH 5/8] i386: Support vectorized BF16 FMA with AVX10.2 instructions

2024-08-25 Thread Haochen Jiang
From: Levy Hsu 

gcc/ChangeLog:

* config/i386/sse.md: Add V8BF/V16BF/V32BF to mode iterator FMAMODEM.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx10_2-512-bf-vector-fma-1.c: New test.
* gcc.target/i386/avx10_2-bf-vector-fma-1.c: New test.
---
 gcc/config/i386/sse.md|  5 +-
 .../i386/avx10_2-512-bf-vector-fma-1.c| 34 ++
 .../gcc.target/i386/avx10_2-bf-vector-fma-1.c | 63 +++
 3 files changed, 101 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-fma-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx10_2-bf-vector-fma-1.c

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index ebca462bae8..85fbef331ea 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -5677,7 +5677,10 @@
(HF "TARGET_AVX512FP16")
(V8HF "TARGET_AVX512FP16 && TARGET_AVX512VL")
(V16HF "TARGET_AVX512FP16 && TARGET_AVX512VL")
-   (V32HF "TARGET_AVX512FP16 && TARGET_EVEX512")])
+   (V32HF "TARGET_AVX512FP16 && TARGET_EVEX512")
+   (V8BF "TARGET_AVX10_2_256")
+   (V16BF "TARGET_AVX10_2_256")
+   (V32BF "TARGET_AVX10_2_512")])
 
 (define_expand "fma4"
   [(set (match_operand:FMAMODEM 0 "register_operand")
diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-fma-1.c 
b/gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-fma-1.c
new file mode 100644
index 000..a857f9b90db
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-fma-1.c
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx10.2-512 -O2" } */
+/* { dg-final { scan-assembler-times "vfmadd132nepbf16\[ 
\\t\]+\[^\{\n\]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vfmsub132nepbf16\[ 
\\t\]+\[^\{\n\]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vfnmadd132nepbf16\[ 
\\t\]+\[^\{\n\]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vfnmsub132nepbf16\[ 
\\t\]+\[^\{\n\]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+
+#include 
+
+typedef __bf16 v32bf __attribute__ ((__vector_size__ (64)));
+
+v32bf
+foo_madd (v32bf a, v32bf b, v32bf c)
+{
+  return a * b + c;
+}
+
+v32bf
+foo_msub (v32bf a, v32bf b, v32bf c)
+{
+  return a * b - c;
+}
+
+v32bf
+foo_nmadd (v32bf a, v32bf b, v32bf c)
+{
+  return -a * b + c;
+}
+
+v32bf
+foo_nmsub (v32bf a, v32bf b, v32bf c)
+{
+  return -a * b - c;
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-bf-vector-fma-1.c 
b/gcc/testsuite/gcc.target/i386/avx10_2-bf-vector-fma-1.c
new file mode 100644
index 000..0fd78efe049
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx10_2-bf-vector-fma-1.c
@@ -0,0 +1,63 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx10.2 -O2" } */
+/* { dg-final { scan-assembler-times "vfmadd132nepbf16\[ 
\\t\]+\[^\{\n\]*%ymm\[0-9\]+\[^\n\r]*%ymm\[0-9\]+\[^\n\r]*%ymm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vfmsub132nepbf16\[ 
\\t\]+\[^\{\n\]*%ymm\[0-9\]+\[^\n\r]*%ymm\[0-9\]+\[^\n\r]*%ymm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vfnmadd132nepbf16\[ 
\\t\]+\[^\{\n\]*%ymm\[0-9\]+\[^\n\r]*%ymm\[0-9\]+\[^\n\r]*%ymm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vfnmsub132nepbf16\[ 
\\t\]+\[^\{\n\]*%ymm\[0-9\]+\[^\n\r]*%ymm\[0-9\]+\[^\n\r]*%ymm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vfmadd132nepbf16\[ 
\\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vfmsub132nepbf16\[ 
\\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vfnmadd132nepbf16\[ 
\\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vfnmsub132nepbf16\[ 
\\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+
+#include 
+
+typedef __bf16 v16bf __attribute__ ((__vector_size__ (32)));
+typedef __bf16 v8bf __attribute__ ((__vector_size__ (16)));
+
+v16bf
+foo_madd_256 (v16bf a, v16bf b, v16bf c)
+{
+  return a * b + c;
+}
+
+v16bf
+foo_msub_256 (v16bf a, v16bf b, v16bf c)
+{
+  return a * b - c;
+}
+
+v16bf
+foo_nmadd_256 (v16bf a, v16bf b, v16bf c)
+{
+  return -a * b + c;
+}
+
+v16bf
+foo_nmsub_256 (v16bf a, v16bf b, v16bf c)
+{
+  return -a * b - c;
+}
+
+v8bf
+foo_madd_128 (v8bf a, v8bf b, v8bf c)
+{
+  return a * b + c;
+}
+
+v8bf
+foo_msub_128 (v8bf a, v8bf b, v8bf c)
+{
+  return a * b - c;
+}
+
+v8bf
+foo_nmadd_128 (v8bf a, v8bf b, v8bf c)
+{
+  return -a * b + c;
+}
+
+v8bf
+foo_nmsub_128 (v8bf a, v8bf b, v8bf c)
+{
+  return -a * b - c;
+}
-- 
2.31.1



[PATCH 7/8] i386: Support vectorized BF16 sqrt with AVX10.2 instruction

2024-08-25 Thread Haochen Jiang
From: Levy Hsu 

gcc/ChangeLog:

* config/i386/sse.md: Expand VF2H to VF2HB with VBF modes.
---
 gcc/config/i386/sse.md | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index b374783429c..2de592a9c8f 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -447,9 +447,12 @@
 (define_mode_iterator VF2_AVX10_2
   [(V8DF "TARGET_AVX10_2_512") V4DF V2DF])
 
-;; All DFmode & HFmode vector float modes
-(define_mode_iterator VF2H
-  [(V32HF "TARGET_AVX512FP16 && TARGET_EVEX512")
+;; All DFmode & HFmode & BFmode vector float modes
+(define_mode_iterator VF2HB
+  [(V32BF "TARGET_AVX10_2_512")
+   (V16BF "TARGET_AVX10_2_256")
+   (V8BF "TARGET_AVX10_2_256")
+   (V32HF "TARGET_AVX512FP16 && TARGET_EVEX512")
(V16HF "TARGET_AVX512FP16 && TARGET_AVX512VL")
(V8HF "TARGET_AVX512FP16 && TARGET_AVX512VL")
(V8DF "TARGET_AVX512F && TARGET_EVEX512") (V4DF "TARGET_AVX") V2DF])
@@ -2933,8 +2936,8 @@
(set_attr "mode" "")])
 
 (define_expand "sqrt2"
-  [(set (match_operand:VF2H 0 "register_operand")
-   (sqrt:VF2H (match_operand:VF2H 1 "vector_operand")))]
+  [(set (match_operand:VF2HB 0 "register_operand")
+   (sqrt:VF2HB (match_operand:VF2HB 1 "vector_operand")))]
   "TARGET_SSE2")
 
 (define_expand "sqrt2"
-- 
2.31.1



[PATCH 3/8] i386: Optimize generate insn for avx10.2 compare

2024-08-25 Thread Haochen Jiang
From: "Hu, Lin1" 

gcc/ChangeLog:

* config/i386/i386-expand.cc (ix86_expand_fp_compare): Add UNSPEC to
support the optimization.
* config/i386/i386.cc (ix86_fp_compare_code_to_integer): Add NE/EQ.
* config/i386/i386.md (*cmpx): New define_insn.
(*cmpxhf): Ditto.
* config/i386/predicates.md (ix86_trivial_fp_comparison_operator):
Add ne/eq.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx10_2-compare-1b.c: New test.
---
 gcc/config/i386/i386-expand.cc|  5 +
 gcc/config/i386/i386.cc   |  5 +
 gcc/config/i386/i386.md   | 31 +-
 gcc/config/i386/predicates.md | 12 +++
 .../gcc.target/i386/avx10_2-compare-1b.c  | 96 +++
 5 files changed, 147 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/avx10_2-compare-1b.c

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index d692008ffe7..53327544620 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -2916,6 +2916,11 @@ ix86_expand_fp_compare (enum rtx_code code, rtx op0, rtx 
op1)
   switch (ix86_fp_comparison_strategy (code))
 {
 case IX86_FPCMP_COMI:
+  tmp = gen_rtx_COMPARE (CCFPmode, op0, op1);
+  if (TARGET_AVX10_2_256 && (code == EQ || code == NE))
+   tmp = gen_rtx_UNSPEC (CCFPmode, gen_rtvec (1, tmp), UNSPEC_OPTCOMX);
+  if (unordered_compare)
+   tmp = gen_rtx_UNSPEC (CCFPmode, gen_rtvec (1, tmp), UNSPEC_NOTRAP);
   cmp_mode = CCFPmode;
   emit_insn (gen_rtx_SET (gen_rtx_REG (CCFPmode, FLAGS_REG), tmp));
   break;
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 224a78cc832..a4454d393d5 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -16623,6 +16623,11 @@ ix86_fp_compare_code_to_integer (enum rtx_code code)
   return LEU;
 case LTGT:
   return NE;
+case EQ:
+case NE:
+  if (TARGET_AVX10_2_256)
+   return code;
+  /* FALLTHRU.  */
 default:
   return UNKNOWN;
 }
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index b56a51be09f..0fae3c1eb87 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -117,6 +117,7 @@
   UNSPEC_STC
   UNSPEC_PUSHFL
   UNSPEC_POPFL
+  UNSPEC_OPTCOMX
 
   ;; For SSE/MMX support:
   UNSPEC_FIX_NOTRUNC
@@ -1736,7 +1737,7 @@
(compare:CC (match_operand:XF 1 "nonmemory_operand")
(match_operand:XF 2 "nonmemory_operand")))
(set (pc) (if_then_else
-  (match_operator 0 "ix86_fp_comparison_operator"
+  (match_operator 0 "ix86_fp_comparison_operator_xf"
[(reg:CC FLAGS_REG)
 (const_int 0)])
   (label_ref (match_operand 3))
@@ -1753,7 +1754,7 @@
(compare:CC (match_operand:XF 2 "nonmemory_operand")
(match_operand:XF 3 "nonmemory_operand")))
(set (match_operand:QI 0 "register_operand")
-  (match_operator 1 "ix86_fp_comparison_operator"
+  (match_operator 1 "ix86_fp_comparison_operator_xf"
[(reg:CC FLAGS_REG)
 (const_int 0)]))]
   "TARGET_80387"
@@ -2017,6 +2018,32 @@
(set_attr "bdver1_decode" "double")
(set_attr "znver1_decode" "double")])
 
+(define_insn "*cmpx"
+  [(set (reg:CCFP FLAGS_REG)
+   (unspec:CCFP [
+ (compare:CCFP
+   (match_operand:MODEF 0 "register_operand" "v")
+   (match_operand:MODEF 1 "nonimmediate_operand" "vm"))]
+ UNSPEC_OPTCOMX))]
+  "TARGET_AVX10_2_256"
+  "%vcomx\t{%1, %0|%0, %1}"
+  [(set_attr "type" "ssecomi")
+   (set_attr "prefix" "evex")
+   (set_attr "mode" "")])
+
+(define_insn "*cmpxhf"
+  [(set (reg:CCFP FLAGS_REG)
+   (unspec:CCFP [
+ (compare:CCFP
+   (match_operand:HF 0 "register_operand" "v")
+   (match_operand:HF 1 "nonimmediate_operand" "vm"))]
+ UNSPEC_OPTCOMX))]
+  "TARGET_AVX10_2_256"
+  "vcomxsh\t{%1, %0|%0, %1}"
+  [(set_attr "type" "ssecomi")
+   (set_attr "prefix" "evex")
+   (set_attr "mode" "HF")])
+
 (define_insn "*cmpi"
   [(set (reg:CCFP FLAGS_REG)
(compare:CCFP
diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
index ab6a2e14d35..053312bbe27 100644
--- a/gcc/config/i386/predicates.md
+++ b/gcc/config/i386/predicates.md
@@ -1633,7 +1633,13 @@
 })
 
 ;; Return true if this comparison only requires testing one flag bit.
+;; VCOMX/VUCOMX set ZF, SF, OF, differently from COMI/UCOMI.
 (define_predicate "ix86_trivial_fp_comparison_operator"
+  (if_then_else (match_test "TARGET_AVX10_2_256")
+   (match_code "gt,ge,unlt,unle,eq,uneq,ne,ltgt,ordered,unordered")
+   (match_code "gt,ge,unlt,unle,uneq,ltgt,ordered,unordered")))
+
+(define_predicate "ix86_trivial_fp_comparison_operator_xf"
   (match_code "gt,ge,unlt,unle,uneq,ltgt,ordered,unordered"))
 
 ;; Return true if we know how to do this comp

[PATCH 6/8] i386: Support vectorized BF16 smaxmin with AVX10.2 instructions

2024-08-25 Thread Haochen Jiang
From: Levy Hsu 

gcc/ChangeLog:

* config/i386/sse.md
(3): New define expand pattern for BF smaxmin.

gcc/testsuite/ChangeLog:
* gcc.target/i386/avx10_2-512-bf-vector-smaxmin-1.c: New test.
* gcc.target/i386/avx10_2-bf-vector-smaxmin-1.c: New test.
---
 gcc/config/i386/sse.md|  7 
 .../i386/avx10_2-512-bf-vector-smaxmin-1.c| 20 +++
 .../i386/avx10_2-bf-vector-smaxmin-1.c| 36 +++
 3 files changed, 63 insertions(+)
 create mode 100644 
gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-smaxmin-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx10_2-bf-vector-smaxmin-1.c

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 85fbef331ea..b374783429c 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -31901,6 +31901,13 @@
"vscalefpbf16\t{%2, %1, %0|%0, %1, %2}"
[(set_attr "prefix" "evex")])
 
+(define_expand "3"
+  [(set (match_operand:VBF_AVX10_2 0 "register_operand")
+ (smaxmin:VBF_AVX10_2
+   (match_operand:VBF_AVX10_2 1 "register_operand")
+   (match_operand:VBF_AVX10_2 2 "nonimmediate_operand")))]
+  "TARGET_AVX10_2_256")
+
 (define_insn "avx10_2_pbf16_"
[(set (match_operand:VBF_AVX10_2 0 "register_operand" "=v")
   (smaxmin:VBF_AVX10_2
diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-smaxmin-1.c 
b/gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-smaxmin-1.c
new file mode 100644
index 000..e33c325e2da
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-smaxmin-1.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx10.2-512 -mprefer-vector-width=512 -Ofast" } */
+/* /* { dg-final { scan-assembler-times "vmaxpbf16" 1 } } */
+/* /* { dg-final { scan-assembler-times "vminpbf16" 1 } } */
+
+void
+maxpbf16_512 (__bf16* dest, __bf16* src1, __bf16* src2)
+{
+  int i;
+  for (i = 0; i < 32; i++)
+dest[i] = src1[i] > src2[i] ? src1[i] : src2[i];
+}
+
+void
+minpbf16_512 (__bf16* dest, __bf16* src1, __bf16* src2)
+{
+  int i;
+  for (i = 0; i < 32; i++)
+dest[i] = src1[i] < src2[i] ? src1[i] : src2[i];
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-bf-vector-smaxmin-1.c 
b/gcc/testsuite/gcc.target/i386/avx10_2-bf-vector-smaxmin-1.c
new file mode 100644
index 000..9bae073c95a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx10_2-bf-vector-smaxmin-1.c
@@ -0,0 +1,36 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx10.2 -Ofast" } */
+/* /* { dg-final { scan-assembler-times "vmaxpbf16" 2 } } */
+/* /* { dg-final { scan-assembler-times "vminpbf16" 2 } } */
+
+void
+maxpbf16_256 (__bf16* dest, __bf16* src1, __bf16* src2)
+{
+  int i;
+  for (i = 0; i < 16; i++)
+dest[i] = src1[i] > src2[i] ? src1[i] : src2[i];
+}
+
+void
+minpbf16_256 (__bf16* dest, __bf16* src1, __bf16* src2)
+{
+  int i;
+  for (i = 0; i < 16; i++)
+dest[i] = src1[i] < src2[i] ? src1[i] : src2[i];
+}
+
+void
+maxpbf16_128 (__bf16* dest, __bf16* src1, __bf16* src2)
+{
+  int i;
+  for (i = 0; i < 16; i++)
+dest[i] = src1[i] > src2[i] ? src1[i] : src2[i];
+}
+
+void
+minpbf16_128 (__bf16* dest, __bf16* src1, __bf16* src2)
+{
+  int i;
+  for (i = 0; i < 16; i++)
+dest[i] = src1[i] < src2[i] ? src1[i] : src2[i];
+}
-- 
2.31.1



[PATCH 8/8] i386: Support vec_cmp for V8BF/V16BF/V32BF in AVX10.2

2024-08-25 Thread Haochen Jiang
From: Levy Hsu 

gcc/ChangeLog:

* config/i386/i386-expand.cc (ix86_use_mask_cmp_p): Add BFmode
  for int mask cmp.
* config/i386/sse.md (vec_cmp): New
  vec_cmp expand for VBF modes.

gcc/testsuite/ChangeLog:
* gcc.target/i386/avx10_2-512-bf-vector-cmpp-1.c: New test.
* gcc.target/i386/avx10_2-bf-vector-cmpp-1.c: New test.
---
 gcc/config/i386/i386-expand.cc|  2 ++
 gcc/config/i386/sse.md| 13 +
 .../i386/avx10_2-512-bf-vector-cmpp-1.c   | 19 
 .../i386/avx10_2-bf-vector-cmpp-1.c   | 29 +++
 4 files changed, 63 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-cmpp-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx10_2-bf-vector-cmpp-1.c

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 53327544620..124cb976ec8 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -4036,6 +4036,8 @@ ix86_use_mask_cmp_p (machine_mode mode, machine_mode 
cmp_mode,
 return true;
   else if (GET_MODE_INNER (cmp_mode) == HFmode)
 return true;
+  else if (GET_MODE_INNER (cmp_mode) == BFmode)
+return true;
 
   /* When op_true is NULL, op_false must be NULL, or vice versa.  */
   gcc_assert (!op_true == !op_false);
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 2de592a9c8f..3bf95f0b0e5 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -4797,6 +4797,19 @@
   DONE;
 })
 
+(define_expand "vec_cmp"
+  [(set (match_operand: 0 "register_operand")
+   (match_operator: 1 ""
+ [(match_operand:VBF_AVX10_2 2 "register_operand")
+  (match_operand:VBF_AVX10_2 3 "nonimmediate_operand")]))]
+  "TARGET_AVX10_2_256"
+{
+  bool ok = ix86_expand_mask_vec_cmp (operands[0], GET_CODE (operands[1]),
+ operands[2], operands[3]);
+  gcc_assert (ok);
+  DONE;
+})
+
 (define_expand "vec_cmp"
   [(set (match_operand: 0 "register_operand")
(match_operator: 1 ""
diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-cmpp-1.c 
b/gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-cmpp-1.c
new file mode 100644
index 000..416fcaa3628
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-cmpp-1.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx10.2-512 -O2 -mprefer-vector-width=512" } */
+/* { dg-final { scan-assembler-times "vcmppbf16" 5 } } */
+
+typedef __bf16 v32bf __attribute__ ((__vector_size__ (64)));
+
+#define VCMPMN(type, op, name) \
+type  \
+__attribute__ ((noinline, noclone)) \
+vec_cmp_##type##type##name (type a, type b) \
+{ \
+  return a op b;  \
+}
+
+VCMPMN (v32bf, <, lt)
+VCMPMN (v32bf, <=, le)
+VCMPMN (v32bf, >, gt)
+VCMPMN (v32bf, >=, ge)
+VCMPMN (v32bf, ==, eq)
diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-bf-vector-cmpp-1.c 
b/gcc/testsuite/gcc.target/i386/avx10_2-bf-vector-cmpp-1.c
new file mode 100644
index 000..6234116039f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx10_2-bf-vector-cmpp-1.c
@@ -0,0 +1,29 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx10.2 -O2" } */
+/* { dg-final { scan-assembler-times "vcmppbf16" 10 } } */
+
+typedef __bf16 v16bf __attribute__ ((__vector_size__ (32)));
+typedef __bf16 v8bf __attribute__ ((__vector_size__ (16)));
+
+#define VCMPMN(type, op, name) \
+type  \
+__attribute__ ((noinline, noclone)) \
+vec_cmp_##type##type##name (type a, type b) \
+{ \
+  return a op b;  \
+}
+
+VCMPMN (v16bf, <, lt)
+VCMPMN (v8bf, <, lt)
+
+VCMPMN (v16bf, <=, le)
+VCMPMN (v8bf, <=, le)
+
+VCMPMN (v16bf, >, gt)
+VCMPMN (v8bf, >, gt)
+
+VCMPMN (v16bf, >=, ge)
+VCMPMN (v8bf, >=, ge)
+
+VCMPMN (v16bf, ==, eq)
+VCMPMN (v8bf, ==, eq)
-- 
2.31.1



[PATCH 4/8] i386: Support vectorized BF16 add/sub/mul/div with AVX10.2 instructions

2024-08-25 Thread Haochen Jiang
From: Levy Hsu 

AVX10.2 introduces several non-exception instructions for BF16 vector.
Enable vectorized BF add/sub/mul/div operation by supporting standard
optab for them.

gcc/ChangeLog:

* config/i386/sse.md (div3): New expander for BFmode div.
(VF_BHSD): New mode iterator with vector BFmodes.
(3): Change mode to VF_BHSD.
(mul3): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx10_2-512-bf-vector-operations-1.c: New test.
* gcc.target/i386/avx10_2-bf-vector-operations-1.c: Ditto.
---
 gcc/config/i386/sse.md| 49 ++--
 .../i386/avx10_2-512-bf-vector-operations-1.c | 42 ++
 .../i386/avx10_2-bf-vector-operations-1.c | 79 +++
 3 files changed, 162 insertions(+), 8 deletions(-)
 create mode 100644 
gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-operations-1.c
 create mode 100644 
gcc/testsuite/gcc.target/i386/avx10_2-bf-vector-operations-1.c

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 442ac93afa2..ebca462bae8 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -391,6 +391,19 @@
(V8DF "TARGET_AVX512F && TARGET_EVEX512") (V4DF "TARGET_AVX")
(V2DF "TARGET_SSE2")])
 
+(define_mode_iterator VF_BHSD
+  [(V32HF "TARGET_AVX512FP16 && TARGET_EVEX512")
+   (V16HF "TARGET_AVX512FP16 && TARGET_AVX512VL")
+   (V8HF "TARGET_AVX512FP16 && TARGET_AVX512VL")
+   (V16SF "TARGET_AVX512F && TARGET_EVEX512")
+   (V8SF "TARGET_AVX") V4SF
+   (V8DF "TARGET_AVX512F && TARGET_EVEX512")
+   (V4DF "TARGET_AVX") (V2DF "TARGET_SSE2")
+   (V32BF "TARGET_AVX10_2_512")
+   (V16BF "TARGET_AVX10_2_256")
+   (V8BF "TARGET_AVX10_2_256")
+  ])
+
 ;; 128-, 256- and 512-bit float vector modes for bitwise operations
 (define_mode_iterator VFB
   [(V32BF "TARGET_AVX512F && TARGET_EVEX512")
@@ -2527,10 +2540,10 @@
 })
 
 (define_expand "3"
-  [(set (match_operand:VFH 0 "register_operand")
-   (plusminus:VFH
- (match_operand:VFH 1 "")
- (match_operand:VFH 2 "")))]
+  [(set (match_operand:VF_BHSD 0 "register_operand")
+   (plusminus:VF_BHSD
+ (match_operand:VF_BHSD 1 "")
+ (match_operand:VF_BHSD 2 "")))]
   "TARGET_SSE &&  && "
   "ix86_fixup_binary_operands_no_copy (, mode, operands);")
 
@@ -2616,10 +2629,10 @@
 })
 
 (define_expand "mul3"
-  [(set (match_operand:VFH 0 "register_operand")
-   (mult:VFH
- (match_operand:VFH 1 "")
- (match_operand:VFH 2 "")))]
+  [(set (match_operand:VF_BHSD 0 "register_operand")
+   (mult:VF_BHSD
+ (match_operand:VF_BHSD 1 "")
+ (match_operand:VF_BHSD 2 "")))]
   "TARGET_SSE &&  && "
   "ix86_fixup_binary_operands_no_copy (MULT, mode, operands);")
 
@@ -2734,6 +2747,26 @@
 }
 })
 
+(define_expand "div3"
+  [(set (match_operand:VBF_AVX10_2 0 "register_operand")
+   (div:VBF_AVX10_2
+ (match_operand:VBF_AVX10_2 1 "register_operand")
+ (match_operand:VBF_AVX10_2 2 "vector_operand")))]
+  "TARGET_AVX10_2_256"
+{
+  if (TARGET_RECIP_VEC_DIV
+  && optimize_insn_for_speed_p ()
+  && flag_finite_math_only
+  && flag_unsafe_math_optimizations)
+{
+  rtx op = gen_reg_rtx (mode);
+  operands[2] = force_reg (mode, operands[2]);
+  emit_insn (gen_avx10_2_rcppbf16_ (op, operands[2]));
+  emit_insn (gen_avx10_2_mulnepbf16_ (operands[0], operands[1], op));
+  DONE;
+}
+})
+
 (define_expand "cond_div"
   [(set (match_operand:VFH 0 "register_operand")
(vec_merge:VFH
diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-operations-1.c 
b/gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-operations-1.c
new file mode 100644
index 000..d6b0750c233
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-operations-1.c
@@ -0,0 +1,42 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx10.2-512 -O2" } */
+/* { dg-final { scan-assembler-times "vmulnepbf16\[ 
\\t\]+\[^\{\n\]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 2 } } */
+/* { dg-final { scan-assembler-times "vaddnepbf16\[ 
\\t\]+\[^\{\n\]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vdivnepbf16\[ 
\\t\]+\[^\{\n\]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vsubnepbf16\[ 
\\t\]+\[^\{\n\]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vrcppbf16\[ 
\\t\]+\[^\{\n\]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+(?:\n|\[ \\t\]+#)" 1 } } */
+
+#include 
+
+typedef __bf16 v32bf __attribute__ ((__vector_size__ (64)));
+
+v32bf
+foo_mul (v32bf a, v32bf b)
+{
+  return a * b;
+}
+
+v32bf
+foo_add (v32bf a, v32bf b)
+{
+  return a + b;
+}
+
+v32bf
+foo_div (v32bf a, v32bf b)
+{
+  return a / b;
+}
+
+v32bf
+foo_sub (v32bf a, v32bf b)
+{
+  return a - b;
+}
+
+__attribute__((optimize("fast-math")))
+v32bf
+