date:20240901

[gcc r15-3358] i386: Support vectorized BF16 sqrt with AVX10.2 instruction

2024-09-01 Thread Haochen Jiang via Gcc-cvs

https://gcc.gnu.org/g:e19f65b0be1e91ff86689feb7695080dad4c9197

commit r15-3358-ge19f65b0be1e91ff86689feb7695080dad4c9197
Author: Levy Hsu 
Date:   Mon Sep 2 10:24:48 2024 +0800

i386: Support vectorized BF16 sqrt with AVX10.2 instruction

gcc/ChangeLog:

* config/i386/sse.md: Expand VF2H to VF2HB with VBF modes.

Diff:
---
 gcc/config/i386/sse.md | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index b374783429cb..2de592a9c8fa 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -447,9 +447,12 @@
 (define_mode_iterator VF2_AVX10_2
   [(V8DF "TARGET_AVX10_2_512") V4DF V2DF])
 
-;; All DFmode & HFmode vector float modes
-(define_mode_iterator VF2H
-  [(V32HF "TARGET_AVX512FP16 && TARGET_EVEX512")
+;; All DFmode & HFmode & BFmode vector float modes
+(define_mode_iterator VF2HB
+  [(V32BF "TARGET_AVX10_2_512")
+   (V16BF "TARGET_AVX10_2_256")
+   (V8BF "TARGET_AVX10_2_256")
+   (V32HF "TARGET_AVX512FP16 && TARGET_EVEX512")
(V16HF "TARGET_AVX512FP16 && TARGET_AVX512VL")
(V8HF "TARGET_AVX512FP16 && TARGET_AVX512VL")
(V8DF "TARGET_AVX512F && TARGET_EVEX512") (V4DF "TARGET_AVX") V2DF])
@@ -2933,8 +2936,8 @@
(set_attr "mode" "")])
 
 (define_expand "sqrt2"
-  [(set (match_operand:VF2H 0 "register_operand")
-   (sqrt:VF2H (match_operand:VF2H 1 "vector_operand")))]
+  [(set (match_operand:VF2HB 0 "register_operand")
+   (sqrt:VF2HB (match_operand:VF2HB 1 "vector_operand")))]
   "TARGET_SSE2")
 
 (define_expand "sqrt2"

[gcc r15-3348] RISC-V: Add testcases for form 3 of unsigned vector .SAT_ADD IMM

2024-09-01 Thread Pan Li via Gcc-cvs

https://gcc.gnu.org/g:72f3e9021e55f14e90773cf2966805a318f44842

commit r15-3348-g72f3e9021e55f14e90773cf2966805a318f44842
Author: Pan Li 
Date:   Fri Aug 30 08:36:45 2024 +0800

RISC-V: Add testcases for form 3 of unsigned vector .SAT_ADD IMM

This patch would like to add test cases for the unsigned vector .SAT_ADD
when one of the operand is IMM.

Form 3:
  #define DEF_VEC_SAT_U_ADD_IMM_FMT_3(T, IMM)  \
  T __attribute__((noinline))  \
  vec_sat_u_add_imm##IMM##_##T##_fmt_3 (T *out, T *in, unsigned limit) \
  {\
unsigned i;\
T ret; \
for (i = 0; i < limit; i++)\
  {\
out[i] = __builtin_add_overflow (in[i], IMM, &ret) ? -1 : ret; \
  }\
  }

DEF_VEC_SAT_U_ADD_IMM_FMT_3(uint64_t, 123)

The below test are passed for this patch.
* The rv64gcv fully regression test.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-10.c: New 
test.
* gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-11.c: New 
test.
* gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-12.c: New 
test.
* gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-9.c: New 
test.
* gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-run-10.c: 
New test.
* gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-run-11.c: 
New test.
* gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-run-12.c: 
New test.
* gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-run-9.c: New 
test.

Signed-off-by: Pan Li 

Diff:
---
 .../riscv/rvv/autovec/binop/vec_sat_u_add_imm-10.c | 14 +++
 .../riscv/rvv/autovec/binop/vec_sat_u_add_imm-11.c | 14 +++
 .../riscv/rvv/autovec/binop/vec_sat_u_add_imm-12.c | 14 +++
 .../riscv/rvv/autovec/binop/vec_sat_u_add_imm-9.c  | 14 +++
 .../rvv/autovec/binop/vec_sat_u_add_imm-run-10.c   | 28 ++
 .../rvv/autovec/binop/vec_sat_u_add_imm-run-11.c   | 28 ++
 .../rvv/autovec/binop/vec_sat_u_add_imm-run-12.c   | 28 ++
 .../rvv/autovec/binop/vec_sat_u_add_imm-run-9.c| 28 ++
 8 files changed, 168 insertions(+)

diff --git 
a/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-10.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-10.c
new file mode 100644
index ..b6b605ac6158
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-10.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -ftree-vectorize 
-fdump-rtl-expand-details -fno-schedule-insns -fno-schedule-insns2" } */
+/* { dg-skip-if "" { *-*-* } { "-flto" } } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#include "../vec_sat_arith.h"
+
+/*
+** vec_sat_u_add_imm15_uint16_t_fmt_3:
+** ...
+** vsaddu\.vi\s+v[0-9]+,\s*v[0-9]+,\s*15
+** ...
+*/
+DEF_VEC_SAT_U_ADD_IMM_FMT_3(uint16_t, 15)
diff --git 
a/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-11.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-11.c
new file mode 100644
index ..6da86a1abe17
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-11.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -ftree-vectorize 
-fdump-rtl-expand-details -fno-schedule-insns -fno-schedule-insns2" } */
+/* { dg-skip-if "" { *-*-* } { "-flto" } } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#include "../vec_sat_arith.h"
+
+/*
+** vec_sat_u_add_imm33u_uint32_t_fmt_3:
+** ...
+** vsaddu\.vv\s+v[0-9]+,\s*v[0-9]+,\s*v[0-9]+
+** ...
+*/
+DEF_VEC_SAT_U_ADD_IMM_FMT_3(uint32_t, 33u)
diff --git 
a/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-12.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-12.c
new file mode 100644
index ..b6ff5a6d5d68
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-12.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -ftree-vectorize 
-fdump-rtl-expand-details -fno-schedule-insns -fno-schedule-insns2" } */
+/* { dg-skip-if "" { *-*-* } { "-flto" } } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#include "../vec_sat_arith.h"
+
+/*
+** vec_sat_u_add_imm129ull_uint64_t_fmt_3:
+** ...
+** vsaddu\.vv\s+v[0-9]+,\s*v[0-9]+,\s*v[0-9]+
+** ...
+*/

[gcc r15-3360] [committed][PR rtl-optimization/116544] Fix test for promoted subregs

2024-09-01 Thread Jeff Law via Gcc-cvs

https://gcc.gnu.org/g:0562976d62e095f3a00c799288dee4e5b20114e2

commit r15-3360-g0562976d62e095f3a00c799288dee4e5b20114e2
Author: Jeff Law 
Date:   Sun Sep 1 22:16:04 2024 -0600

[committed][PR rtl-optimization/116544] Fix test for promoted subregs

This is a small bug in the ext-dce code's handling of promoted subregs.

Essentially when we see a promoted subreg we need to make additional bit 
groups
live as various parts of the RTL path know that an extension of a suitably
promoted subreg can be trivially eliminated.

When I added support for dealing with this quirk I failed to account for the
larger modes properly and it ignored the case when the size of the inner 
object
was > 32 bits.  Opps.

This does _not_ fix the outstanding x86 issue.  That's caused by something
completely different and more concerning ;(

Bootstrapped and regression tested on x86.  Obviously fixes the testcase on
riscv as well.

Pushing to the trunk.

PR rtl-optimization/116544
gcc/
* ext-dce.cc (ext_dce_process_uses): Fix thinko in promoted subreg
handling.

gcc/testsuite/
* gcc.dg/torture/pr116544.c: New test.

Diff:
---
 gcc/ext-dce.cc  |  2 +-
 gcc/testsuite/gcc.dg/torture/pr116544.c | 22 ++
 2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/gcc/ext-dce.cc b/gcc/ext-dce.cc
index 4a2503f18313..2f3514ae7976 100644
--- a/gcc/ext-dce.cc
+++ b/gcc/ext-dce.cc
@@ -846,7 +846,7 @@ ext_dce_process_uses (rtx_insn *insn, rtx obj,
bitmap_set_bit (livenow, rn + 1);
  if (size > 16)
bitmap_set_bit (livenow, rn + 2);
- if (size == 32)
+ if (size >= 32)
bitmap_set_bit (livenow, rn + 3);
  iter.skip_subrtxes ();
}
diff --git a/gcc/testsuite/gcc.dg/torture/pr116544.c 
b/gcc/testsuite/gcc.dg/torture/pr116544.c
new file mode 100644
index ..15f52fecb3b0
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/pr116544.c
@@ -0,0 +1,22 @@
+/* { dg-options "-fno-strict-aliasing -fwrapv" }
+/* { dg-do run { target longlong64 } } */
+
+extern void abort (void);
+long long a;
+signed char b[60];
+signed char c;
+long long d[60];
+int e[30];
+long long *f = d;
+static void g(long long *j, long k) { *j = k; }
+int main() {
+  d[5] = 0x1;
+  for (int h = 2; h < 7; h += 3)
+for (int i = 0; i < (c || b[h]) + 10; i += 11)
+  e[2] = f[h];
+  g(&a, e[2]);
+  if (a != 0)
+abort ();
+  return 0;
+}
+

[gcc r15-3349] RISC-V: Add testcases for form 4 of unsigned vector .SAT_ADD IMM

2024-09-01 Thread Pan Li via Gcc-cvs

https://gcc.gnu.org/g:56ed1dfa79c436b769f3266258d34d160b4330d9

commit r15-3349-g56ed1dfa79c436b769f3266258d34d160b4330d9
Author: Pan Li 
Date:   Fri Aug 30 11:01:37 2024 +0800

RISC-V: Add testcases for form 4 of unsigned vector .SAT_ADD IMM

This patch would like to add test cases for the unsigned vector .SAT_ADD
when one of the operand is IMM.

Form 4:
  #define DEF_VEC_SAT_U_ADD_IMM_FMT_4(T, IMM)   
\
  T __attribute__((noinline))   
\
  vec_sat_u_add_imm##IMM##_##T##_fmt_4 (T *out, T *in, unsigned limit)  
\
  { 
\
unsigned i; 
\
T ret;  
\
for (i = 0; i < limit; i++) 
\
  { 
\
out[i] = __builtin_add_overflow (in[i], IMM, &ret) == 0 ? ret : -1; 
\
  } 
\
  }

DEF_VEC_SAT_U_ADD_IMM_FMT_4(uint64_t, 123)

The below test are passed for this patch.
* The rv64gcv fully regression test.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/vec_sat_arith.h: Add test helper 
macros.
* gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-13.c: New 
test.
* gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-14.c: New 
test.
* gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-15.c: New 
test.
* gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-16.c: New 
test.
* gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-run-13.c: 
New test.
* gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-run-14.c: 
New test.
* gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-run-15.c: 
New test.
* gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-run-16.c: 
New test.

Signed-off-by: Pan Li 

Diff:
---
 .../riscv/rvv/autovec/binop/vec_sat_u_add_imm-13.c | 14 +++
 .../riscv/rvv/autovec/binop/vec_sat_u_add_imm-14.c | 14 +++
 .../riscv/rvv/autovec/binop/vec_sat_u_add_imm-15.c | 14 +++
 .../riscv/rvv/autovec/binop/vec_sat_u_add_imm-16.c | 14 +++
 .../rvv/autovec/binop/vec_sat_u_add_imm-run-13.c   | 28 ++
 .../rvv/autovec/binop/vec_sat_u_add_imm-run-14.c   | 28 ++
 .../rvv/autovec/binop/vec_sat_u_add_imm-run-15.c   | 28 ++
 .../rvv/autovec/binop/vec_sat_u_add_imm-run-16.c   | 28 ++
 .../gcc.target/riscv/rvv/autovec/vec_sat_arith.h   | 20 
 9 files changed, 188 insertions(+)

diff --git 
a/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-13.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-13.c
new file mode 100644
index ..a9439dff39f7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-13.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -ftree-vectorize 
-fdump-rtl-expand-details -fno-schedule-insns -fno-schedule-insns2" } */
+/* { dg-skip-if "" { *-*-* } { "-flto" } } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#include "../vec_sat_arith.h"
+
+/*
+** vec_sat_u_add_imm9u_uint8_t_fmt_4:
+** ...
+** vsaddu\.vi\s+v[0-9]+,\s*v[0-9]+,\s*9
+** ...
+*/
+DEF_VEC_SAT_U_ADD_IMM_FMT_4(uint8_t, 9u)
diff --git 
a/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-14.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-14.c
new file mode 100644
index ..dbe474975991
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-14.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -ftree-vectorize 
-fdump-rtl-expand-details -fno-schedule-insns -fno-schedule-insns2" } */
+/* { dg-skip-if "" { *-*-* } { "-flto" } } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#include "../vec_sat_arith.h"
+
+/*
+** vec_sat_u_add_imm15_uint16_t_fmt_4:
+** ...
+** vsaddu\.vi\s+v[0-9]+,\s*v[0-9]+,\s*15
+** ...
+*/
+DEF_VEC_SAT_U_ADD_IMM_FMT_4(uint16_t, 15)
diff --git 
a/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-15.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-15.c
new file mode 100644
index ..0ac2e1b2942f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vec_sat_u_add_imm-15.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -ftree-vectorize 
-fdump-rtl-expand-details -fno-schedule-insns -fno-schedule-insns2" } */
+/* { dg-skip-if "" { *-*-* } { "-flto"

[gcc r15-3347] RISC-V: Refactor gen zero_extend rtx for SAT_* when expand SImode in RV64

2024-09-01 Thread Pan Li via Gcc-cvs

https://gcc.gnu.org/g:e96d4bf6a6e8b8a5ea1b81a79f4efa07dee77af1

commit r15-3347-ge96d4bf6a6e8b8a5ea1b81a79f4efa07dee77af1
Author: Pan Li 
Date:   Fri Aug 30 14:07:12 2024 +0800

RISC-V: Refactor gen zero_extend rtx for SAT_* when expand SImode in RV64

In previous, we have some specially handling for both the .SAT_ADD and
.SAT_SUB for unsigned int.  There are similar to take care of SImode
in RV64 for zero extend.  Thus refactor these two helper function
into one for possible code duplication.

The below test suite are passed for this patch.
* The rv64gcv fully regression test.

gcc/ChangeLog:

* config/riscv/riscv.cc (riscv_gen_zero_extend_rtx): Merge
the zero_extend handing from func riscv_gen_unsigned_xmode_reg.
(riscv_gen_unsigned_xmode_reg): Remove.
(riscv_expand_ussub): Leverage riscv_gen_zero_extend_rtx
instead of riscv_gen_unsigned_xmode_reg.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/sat_u_sub-11.c: Adjust asm check.
* gcc.target/riscv/sat_u_sub-15.c: Ditto.
* gcc.target/riscv/sat_u_sub-19.c: Ditto.
* gcc.target/riscv/sat_u_sub-23.c: Ditto.
* gcc.target/riscv/sat_u_sub-27.c: Ditto.
* gcc.target/riscv/sat_u_sub-3.c: Ditto.
* gcc.target/riscv/sat_u_sub-31.c: Ditto.
* gcc.target/riscv/sat_u_sub-35.c: Ditto.
* gcc.target/riscv/sat_u_sub-39.c: Ditto.
* gcc.target/riscv/sat_u_sub-43.c: Ditto.
* gcc.target/riscv/sat_u_sub-47.c: Ditto.
* gcc.target/riscv/sat_u_sub-7.c: Ditto.
* gcc.target/riscv/sat_u_sub_imm-11.c: Ditto.
* gcc.target/riscv/sat_u_sub_imm-11_1.c: Ditto.
* gcc.target/riscv/sat_u_sub_imm-11_2.c: Ditto.
* gcc.target/riscv/sat_u_sub_imm-15.c: Ditto.
* gcc.target/riscv/sat_u_sub_imm-15_1.c: Ditto.
* gcc.target/riscv/sat_u_sub_imm-15_2.c: Ditto.
* gcc.target/riscv/sat_u_sub_imm-3.c: Ditto.
* gcc.target/riscv/sat_u_sub_imm-3_1.c: Ditto.
* gcc.target/riscv/sat_u_sub_imm-3_2.c: Ditto.
* gcc.target/riscv/sat_u_sub_imm-7.c: Ditto.
* gcc.target/riscv/sat_u_sub_imm-7_1.c: Ditto.
* gcc.target/riscv/sat_u_sub_imm-7_2.c: Ditto.

Signed-off-by: Pan Li 

Diff:
---
 gcc/config/riscv/riscv.cc  | 99 ++
 gcc/testsuite/gcc.target/riscv/sat_u_sub-11.c  |  4 +
 gcc/testsuite/gcc.target/riscv/sat_u_sub-15.c  |  4 +
 gcc/testsuite/gcc.target/riscv/sat_u_sub-19.c  |  4 +
 gcc/testsuite/gcc.target/riscv/sat_u_sub-23.c  |  4 +
 gcc/testsuite/gcc.target/riscv/sat_u_sub-27.c  |  4 +
 gcc/testsuite/gcc.target/riscv/sat_u_sub-3.c   |  4 +
 gcc/testsuite/gcc.target/riscv/sat_u_sub-31.c  |  4 +
 gcc/testsuite/gcc.target/riscv/sat_u_sub-35.c  |  4 +
 gcc/testsuite/gcc.target/riscv/sat_u_sub-39.c  |  4 +
 gcc/testsuite/gcc.target/riscv/sat_u_sub-43.c  |  4 +
 gcc/testsuite/gcc.target/riscv/sat_u_sub-47.c  |  4 +
 gcc/testsuite/gcc.target/riscv/sat_u_sub-7.c   |  4 +
 gcc/testsuite/gcc.target/riscv/sat_u_sub_imm-11.c  |  2 +
 .../gcc.target/riscv/sat_u_sub_imm-11_1.c  |  2 +
 .../gcc.target/riscv/sat_u_sub_imm-11_2.c  |  2 +
 gcc/testsuite/gcc.target/riscv/sat_u_sub_imm-15.c  |  2 +
 .../gcc.target/riscv/sat_u_sub_imm-15_1.c  |  2 +
 .../gcc.target/riscv/sat_u_sub_imm-15_2.c  |  2 +
 gcc/testsuite/gcc.target/riscv/sat_u_sub_imm-3.c   |  2 +
 gcc/testsuite/gcc.target/riscv/sat_u_sub_imm-3_1.c |  2 +
 gcc/testsuite/gcc.target/riscv/sat_u_sub_imm-3_2.c |  2 +
 gcc/testsuite/gcc.target/riscv/sat_u_sub_imm-7.c   |  2 +
 gcc/testsuite/gcc.target/riscv/sat_u_sub_imm-7_1.c |  2 +
 gcc/testsuite/gcc.target/riscv/sat_u_sub_imm-7_2.c |  2 +
 25 files changed, 118 insertions(+), 53 deletions(-)

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 496dd177fe7f..75b37b532443 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -11894,19 +11894,56 @@ riscv_get_raw_result_mode (int regno)
   return default_get_reg_raw_mode (regno);
 }
 
-/* Generate a new rtx of Xmode based on the rtx and mode in define pattern.
-   The rtx x will be zero extended to Xmode if the mode is HI/QImode,  and
-   the new zero extended Xmode rtx will be returned.
-   Or the gen_lowpart rtx of Xmode will be returned.  */
+/* Generate a REG rtx of Xmode from the given rtx and mode.
+   The rtx x can be REG (QI/HI/SI/DI) or const_int.
+   The machine_mode mode is the original mode from define pattern.
+
+   If rtx is REG and Xmode, the RTX x will be returned directly.
+
+   If rtx is REG and non-Xmode, the zero extended to new REG of Xmode will be
+   returned.
+
+   If rtx is const_int, a new REG rtx will be created to hold the value of
+   const_int and then returned.
+
+   According to the gcci

[gcc r15-3359] i386: Support vec_cmp for V8BF/V16BF/V32BF in AVX10.2

2024-09-01 Thread Haochen Jiang via Gcc-cvs

https://gcc.gnu.org/g:f77435aa3911c437cba71991509eee57b333b3ce

commit r15-3359-gf77435aa3911c437cba71991509eee57b333b3ce
Author: Levy Hsu 
Date:   Mon Sep 2 10:24:49 2024 +0800

i386: Support vec_cmp for V8BF/V16BF/V32BF in AVX10.2

gcc/ChangeLog:

* config/i386/i386-expand.cc (ix86_use_mask_cmp_p): Add BFmode
for int mask cmp.
* config/i386/sse.md (vec_cmp): New
vec_cmp expand for VBF modes.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx10_2-512-bf-vector-cmpp-1.c: New test.
* gcc.target/i386/avx10_2-bf-vector-cmpp-1.c: Ditto.

Diff:
---
 gcc/config/i386/i386-expand.cc |  2 ++
 gcc/config/i386/sse.md | 13 ++
 .../gcc.target/i386/avx10_2-512-bf-vector-cmpp-1.c | 19 ++
 .../gcc.target/i386/avx10_2-bf-vector-cmpp-1.c | 29 ++
 4 files changed, 63 insertions(+)

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 53327544620f..124cb976ec87 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -4036,6 +4036,8 @@ ix86_use_mask_cmp_p (machine_mode mode, machine_mode 
cmp_mode,
 return true;
   else if (GET_MODE_INNER (cmp_mode) == HFmode)
 return true;
+  else if (GET_MODE_INNER (cmp_mode) == BFmode)
+return true;
 
   /* When op_true is NULL, op_false must be NULL, or vice versa.  */
   gcc_assert (!op_true == !op_false);
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 2de592a9c8fa..3bf95f0b0e53 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -4797,6 +4797,19 @@
   DONE;
 })
 
+(define_expand "vec_cmp"
+  [(set (match_operand: 0 "register_operand")
+   (match_operator: 1 ""
+ [(match_operand:VBF_AVX10_2 2 "register_operand")
+  (match_operand:VBF_AVX10_2 3 "nonimmediate_operand")]))]
+  "TARGET_AVX10_2_256"
+{
+  bool ok = ix86_expand_mask_vec_cmp (operands[0], GET_CODE (operands[1]),
+ operands[2], operands[3]);
+  gcc_assert (ok);
+  DONE;
+})
+
 (define_expand "vec_cmp"
   [(set (match_operand: 0 "register_operand")
(match_operator: 1 ""
diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-cmpp-1.c 
b/gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-cmpp-1.c
new file mode 100644
index ..416fcaa36289
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-cmpp-1.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx10.2-512 -O2 -mprefer-vector-width=512" } */
+/* { dg-final { scan-assembler-times "vcmppbf16" 5 } } */
+
+typedef __bf16 v32bf __attribute__ ((__vector_size__ (64)));
+
+#define VCMPMN(type, op, name) \
+type  \
+__attribute__ ((noinline, noclone)) \
+vec_cmp_##type##type##name (type a, type b) \
+{ \
+  return a op b;  \
+}
+
+VCMPMN (v32bf, <, lt)
+VCMPMN (v32bf, <=, le)
+VCMPMN (v32bf, >, gt)
+VCMPMN (v32bf, >=, ge)
+VCMPMN (v32bf, ==, eq)
diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-bf-vector-cmpp-1.c 
b/gcc/testsuite/gcc.target/i386/avx10_2-bf-vector-cmpp-1.c
new file mode 100644
index ..6234116039f0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx10_2-bf-vector-cmpp-1.c
@@ -0,0 +1,29 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx10.2 -O2" } */
+/* { dg-final { scan-assembler-times "vcmppbf16" 10 } } */
+
+typedef __bf16 v16bf __attribute__ ((__vector_size__ (32)));
+typedef __bf16 v8bf __attribute__ ((__vector_size__ (16)));
+
+#define VCMPMN(type, op, name) \
+type  \
+__attribute__ ((noinline, noclone)) \
+vec_cmp_##type##type##name (type a, type b) \
+{ \
+  return a op b;  \
+}
+
+VCMPMN (v16bf, <, lt)
+VCMPMN (v8bf, <, lt)
+
+VCMPMN (v16bf, <=, le)
+VCMPMN (v8bf, <=, le)
+
+VCMPMN (v16bf, >, gt)
+VCMPMN (v8bf, >, gt)
+
+VCMPMN (v16bf, >=, ge)
+VCMPMN (v8bf, >=, ge)
+
+VCMPMN (v16bf, ==, eq)
+VCMPMN (v8bf, ==, eq)

[gcc r15-3356] i386: Support vectorized BF16 FMA with AVX10.2 instructions

2024-09-01 Thread Haochen Jiang via Gcc-cvs

https://gcc.gnu.org/g:6d294fb8ac9baf2624446deaa4c995b7a7719823

commit r15-3356-g6d294fb8ac9baf2624446deaa4c995b7a7719823
Author: Levy Hsu 
Date:   Mon Sep 2 10:24:46 2024 +0800

i386: Support vectorized BF16 FMA with AVX10.2 instructions

gcc/ChangeLog:

* config/i386/sse.md: Add V8BF/V16BF/V32BF to mode iterator 
FMAMODEM.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx10_2-512-bf-vector-fma-1.c: New test.
* gcc.target/i386/avx10_2-bf-vector-fma-1.c: New test.

Diff:
---
 gcc/config/i386/sse.md |  5 +-
 .../gcc.target/i386/avx10_2-512-bf-vector-fma-1.c  | 34 
 .../gcc.target/i386/avx10_2-bf-vector-fma-1.c  | 63 ++
 3 files changed, 101 insertions(+), 1 deletion(-)

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index ebca462bae8b..85fbef331ea4 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -5677,7 +5677,10 @@
(HF "TARGET_AVX512FP16")
(V8HF "TARGET_AVX512FP16 && TARGET_AVX512VL")
(V16HF "TARGET_AVX512FP16 && TARGET_AVX512VL")
-   (V32HF "TARGET_AVX512FP16 && TARGET_EVEX512")])
+   (V32HF "TARGET_AVX512FP16 && TARGET_EVEX512")
+   (V8BF "TARGET_AVX10_2_256")
+   (V16BF "TARGET_AVX10_2_256")
+   (V32BF "TARGET_AVX10_2_512")])
 
 (define_expand "fma4"
   [(set (match_operand:FMAMODEM 0 "register_operand")
diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-fma-1.c 
b/gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-fma-1.c
new file mode 100644
index ..a857f9b90db4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-fma-1.c
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx10.2-512 -O2" } */
+/* { dg-final { scan-assembler-times "vfmadd132nepbf16\[ 
\\t\]+\[^\{\n\]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vfmsub132nepbf16\[ 
\\t\]+\[^\{\n\]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vfnmadd132nepbf16\[ 
\\t\]+\[^\{\n\]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vfnmsub132nepbf16\[ 
\\t\]+\[^\{\n\]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+
+#include 
+
+typedef __bf16 v32bf __attribute__ ((__vector_size__ (64)));
+
+v32bf
+foo_madd (v32bf a, v32bf b, v32bf c)
+{
+  return a * b + c;
+}
+
+v32bf
+foo_msub (v32bf a, v32bf b, v32bf c)
+{
+  return a * b - c;
+}
+
+v32bf
+foo_nmadd (v32bf a, v32bf b, v32bf c)
+{
+  return -a * b + c;
+}
+
+v32bf
+foo_nmsub (v32bf a, v32bf b, v32bf c)
+{
+  return -a * b - c;
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-bf-vector-fma-1.c 
b/gcc/testsuite/gcc.target/i386/avx10_2-bf-vector-fma-1.c
new file mode 100644
index ..0fd78efe0493
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx10_2-bf-vector-fma-1.c
@@ -0,0 +1,63 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx10.2 -O2" } */
+/* { dg-final { scan-assembler-times "vfmadd132nepbf16\[ 
\\t\]+\[^\{\n\]*%ymm\[0-9\]+\[^\n\r]*%ymm\[0-9\]+\[^\n\r]*%ymm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vfmsub132nepbf16\[ 
\\t\]+\[^\{\n\]*%ymm\[0-9\]+\[^\n\r]*%ymm\[0-9\]+\[^\n\r]*%ymm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vfnmadd132nepbf16\[ 
\\t\]+\[^\{\n\]*%ymm\[0-9\]+\[^\n\r]*%ymm\[0-9\]+\[^\n\r]*%ymm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vfnmsub132nepbf16\[ 
\\t\]+\[^\{\n\]*%ymm\[0-9\]+\[^\n\r]*%ymm\[0-9\]+\[^\n\r]*%ymm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vfmadd132nepbf16\[ 
\\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vfmsub132nepbf16\[ 
\\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vfnmadd132nepbf16\[ 
\\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vfnmsub132nepbf16\[ 
\\t\]+\[^\{\n\]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+\[^\n\r]*%xmm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+
+#include 
+
+typedef __bf16 v16bf __attribute__ ((__vector_size__ (32)));
+typedef __bf16 v8bf __attribute__ ((__vector_size__ (16)));
+
+v16bf
+foo_madd_256 (v16bf a, v16bf b, v16bf c)
+{
+  return a * b + c;
+}
+
+v16bf
+foo_msub_256 (v16bf a, v16bf b, v16bf c)
+{
+  return a * b - c;
+}
+
+v16bf
+foo_nmadd_256 (v16bf a, v16bf b, v16bf c)
+{
+  return -a * b + c;
+}
+
+v16bf
+foo_nmsub_256 (v16bf a, v16bf b, v16bf c)
+{
+  return -a * b - c;
+}
+
+v8bf
+foo_madd_128 (v8bf a, v8bf b, v8bf c)
+{
+  return a * b + c;
+}
+
+v8bf
+foo_msub_128 (v8bf a, v8bf b, v8bf c)
+{
+  return a * b - c;
+}
+
+v8bf
+foo_nmadd_128 (v8b

[gcc r14-10625] Check avx upper register for parallel.

2024-09-01 Thread hongtao Liu via Gcc-cvs

https://gcc.gnu.org/g:ba9a3f105ea552a22d08f2d54dfdbef16af7c99e

commit r14-10625-gba9a3f105ea552a22d08f2d54dfdbef16af7c99e
Author: liuhongt 
Date:   Thu Aug 29 11:39:20 2024 +0800

Check avx upper register for parallel.

For function arguments/return, when it's BLK mode, it's put in a
parallel with an expr_list, and the expr_list contains the real mode
and registers.
Current ix86_check_avx_upper_register only checked for SSE_REG_P, and
failed to handle that. The patch extend the handle to each subrtx.

gcc/ChangeLog:

PR target/116512
* config/i386/i386.cc (ix86_check_avx_upper_register): Iterate
subrtx to scan for avx upper register.
(ix86_check_avx_upper_stores): Inline old
ix86_check_avx_upper_register.
(ix86_avx_u128_mode_needed): Ditto, and replace
FOR_EACH_SUBRTX with call to new
ix86_check_avx_upper_register.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr116512.c: New test.

(cherry picked from commit ab214ef734bfc3dcffcf79ff9e1dd651c2b40566)

Diff:
---
 gcc/config/i386/i386.cc  | 36 
 gcc/testsuite/gcc.target/i386/pr116512.c | 26 +++
 2 files changed, 49 insertions(+), 13 deletions(-)

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 288c69467d62..feefbe322dec 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -15027,9 +15027,19 @@ ix86_dirflag_mode_needed (rtx_insn *insn)
 static bool
 ix86_check_avx_upper_register (const_rtx exp)
 {
-  return (SSE_REG_P (exp)
- && !EXT_REX_SSE_REG_P (exp)
- && GET_MODE_BITSIZE (GET_MODE (exp)) > 128);
+  /* construct_container may return a parallel with expr_list
+ which contains the real reg and mode  */
+  subrtx_iterator::array_type array;
+  FOR_EACH_SUBRTX (iter, array, exp, NONCONST)
+{
+  const_rtx x = *iter;
+  if (SSE_REG_P (x)
+ && !EXT_REX_SSE_REG_P (x)
+ && GET_MODE_BITSIZE (GET_MODE (x)) > 128)
+   return true;
+}
+
+  return false;
 }
 
 /* Check if a 256bit or 512bit AVX register is referenced in stores.   */
@@ -15037,7 +15047,9 @@ ix86_check_avx_upper_register (const_rtx exp)
 static void
 ix86_check_avx_upper_stores (rtx dest, const_rtx, void *data)
 {
-  if (ix86_check_avx_upper_register (dest))
+  if (SSE_REG_P (dest)
+  && !EXT_REX_SSE_REG_P (dest)
+  && GET_MODE_BITSIZE (GET_MODE (dest)) > 128)
 {
   bool *used = (bool *) data;
   *used = true;
@@ -15096,14 +15108,14 @@ ix86_avx_u128_mode_needed (rtx_insn *insn)
   return AVX_U128_CLEAN;
 }
 
-  subrtx_iterator::array_type array;
-
   rtx set = single_set (insn);
   if (set)
 {
   rtx dest = SET_DEST (set);
   rtx src = SET_SRC (set);
-  if (ix86_check_avx_upper_register (dest))
+  if (SSE_REG_P (dest)
+ && !EXT_REX_SSE_REG_P (dest)
+ && GET_MODE_BITSIZE (GET_MODE (dest)) > 128)
{
  /* This is an YMM/ZMM load.  Return AVX_U128_DIRTY if the
 source isn't zero.  */
@@ -15114,9 +15126,8 @@ ix86_avx_u128_mode_needed (rtx_insn *insn)
}
   else
{
- FOR_EACH_SUBRTX (iter, array, src, NONCONST)
-   if (ix86_check_avx_upper_register (*iter))
- return AVX_U128_DIRTY;
+ if (ix86_check_avx_upper_register (src))
+   return AVX_U128_DIRTY;
}
 
   /* This isn't YMM/ZMM load/store.  */
@@ -15127,9 +15138,8 @@ ix86_avx_u128_mode_needed (rtx_insn *insn)
  Hardware changes state only when a 256bit register is written to,
  but we need to prevent the compiler from moving optimal insertion
  point above eventual read from 256bit or 512 bit register.  */
-  FOR_EACH_SUBRTX (iter, array, PATTERN (insn), NONCONST)
-if (ix86_check_avx_upper_register (*iter))
-  return AVX_U128_DIRTY;
+  if (ix86_check_avx_upper_register (PATTERN (insn)))
+return AVX_U128_DIRTY;
 
   return AVX_U128_ANY;
 }
diff --git a/gcc/testsuite/gcc.target/i386/pr116512.c 
b/gcc/testsuite/gcc.target/i386/pr116512.c
new file mode 100644
index ..c2bc6c91b648
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr116512.c
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-options "-march=x86-64-v4 -O2" } */
+/* { dg-final { scan-assembler-not "vzeroupper" { target { ! ia32 } } } } */
+
+#include 
+
+struct B {
+  union {
+__m512 f;
+__m512i s;
+  };
+};
+
+struct B foo(int n) {
+  struct B res;
+  res.s = _mm512_set1_epi32(n);
+
+  return res;
+}
+
+__m512i bar(int n) {
+  struct B res;
+  res.s = _mm512_set1_epi32(n);
+
+  return res.s;
+}

[gcc r15-3353] i386: Optimize ordered and nonequal

2024-09-01 Thread Haochen Jiang via Gcc-cvs

https://gcc.gnu.org/g:86f5031c804220274a9bbebd26b8ebf47a2207ac

commit r15-3353-g86f5031c804220274a9bbebd26b8ebf47a2207ac
Author: Hu, Lin1 
Date:   Mon Sep 2 10:24:31 2024 +0800

i386: Optimize ordered and nonequal

Currently, when we input !__builtin_isunordered (a, b) && (a != b), gcc
will emit
  ucomiss %xmm1, %xmm0
  movl $1, %ecx
  setp %dl
  setnp %al
  cmovne %ecx, %edx
  andl %edx, %eax
  movzbl %al, %eax

In fact,
  xorl %eax, %eax
  ucomiss %xmm1, %xmm0
  setne %al
is better.

gcc/ChangeLog:

* match.pd: Optimize (and ordered non-equal) to
(not (or unordered  equal))

gcc/testsuite/ChangeLog:

* gcc.target/i386/optimize_one.c: New test.

Diff:
---
 gcc/match.pd | 3 +++
 gcc/testsuite/gcc.target/i386/optimize_one.c | 9 +
 2 files changed, 12 insertions(+)

diff --git a/gcc/match.pd b/gcc/match.pd
index be211535a49f..4298e89dad6d 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -6651,6 +6651,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (ltgt @0 @0)
  (if (!flag_trapping_math || !tree_expr_maybe_nan_p (@0))
   { constant_boolean_node (false, type); }))
+(simplify
+ (bit_and (ordered @0 @1) (ne @0 @1))
+ (bit_not (uneq @0 @1)))
 
 /* x == ~x -> false */
 /* x != ~x -> true */
diff --git a/gcc/testsuite/gcc.target/i386/optimize_one.c 
b/gcc/testsuite/gcc.target/i386/optimize_one.c
new file mode 100644
index ..62728d3c5ba4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/optimize_one.c
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mfpmath=sse" } */
+/* { dg-final { scan-assembler-times "comi" 1 } } */
+/* { dg-final { scan-assembler-times "set" 1 } } */
+
+int is_ordered_or_nonequal_sh (float a, float b)
+{
+  return !__builtin_isunordered (a, b) && (a != b);
+}

[gcc r13-8999] Check avx upper register for parallel.

2024-09-01 Thread hongtao Liu via Gcc-cvs

https://gcc.gnu.org/g:5e049ada87842947adaca5c607516396889f64d6

commit r13-8999-g5e049ada87842947adaca5c607516396889f64d6
Author: liuhongt 
Date:   Thu Aug 29 11:39:20 2024 +0800

Check avx upper register for parallel.

For function arguments/return, when it's BLK mode, it's put in a
parallel with an expr_list, and the expr_list contains the real mode
and registers.
Current ix86_check_avx_upper_register only checked for SSE_REG_P, and
failed to handle that. The patch extend the handle to each subrtx.

gcc/ChangeLog:

PR target/116512
* config/i386/i386.cc (ix86_check_avx_upper_register): Iterate
subrtx to scan for avx upper register.
(ix86_check_avx_upper_stores): Inline old
ix86_check_avx_upper_register.
(ix86_avx_u128_mode_needed): Ditto, and replace
FOR_EACH_SUBRTX with call to new
ix86_check_avx_upper_register.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr116512.c: New test.

(cherry picked from commit ab214ef734bfc3dcffcf79ff9e1dd651c2b40566)

Diff:
---
 gcc/config/i386/i386.cc  | 36 
 gcc/testsuite/gcc.target/i386/pr116512.c | 26 +++
 2 files changed, 49 insertions(+), 13 deletions(-)

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 499184166ff2..a90351ca9c2c 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -14432,9 +14432,19 @@ ix86_dirflag_mode_needed (rtx_insn *insn)
 static bool
 ix86_check_avx_upper_register (const_rtx exp)
 {
-  return (SSE_REG_P (exp)
- && !EXT_REX_SSE_REG_P (exp)
- && GET_MODE_BITSIZE (GET_MODE (exp)) > 128);
+  /* construct_container may return a parallel with expr_list
+ which contains the real reg and mode  */
+  subrtx_iterator::array_type array;
+  FOR_EACH_SUBRTX (iter, array, exp, NONCONST)
+{
+  const_rtx x = *iter;
+  if (SSE_REG_P (x)
+ && !EXT_REX_SSE_REG_P (x)
+ && GET_MODE_BITSIZE (GET_MODE (x)) > 128)
+   return true;
+}
+
+  return false;
 }
 
 /* Check if a 256bit or 512bit AVX register is referenced in stores.   */
@@ -14442,7 +14452,9 @@ ix86_check_avx_upper_register (const_rtx exp)
 static void
 ix86_check_avx_upper_stores (rtx dest, const_rtx, void *data)
 {
-  if (ix86_check_avx_upper_register (dest))
+  if (SSE_REG_P (dest)
+  && !EXT_REX_SSE_REG_P (dest)
+  && GET_MODE_BITSIZE (GET_MODE (dest)) > 128)
 {
   bool *used = (bool *) data;
   *used = true;
@@ -14500,14 +14512,14 @@ ix86_avx_u128_mode_needed (rtx_insn *insn)
   return AVX_U128_CLEAN;
 }
 
-  subrtx_iterator::array_type array;
-
   rtx set = single_set (insn);
   if (set)
 {
   rtx dest = SET_DEST (set);
   rtx src = SET_SRC (set);
-  if (ix86_check_avx_upper_register (dest))
+  if (SSE_REG_P (dest)
+ && !EXT_REX_SSE_REG_P (dest)
+ && GET_MODE_BITSIZE (GET_MODE (dest)) > 128)
{
  /* This is an YMM/ZMM load.  Return AVX_U128_DIRTY if the
 source isn't zero.  */
@@ -14518,9 +14530,8 @@ ix86_avx_u128_mode_needed (rtx_insn *insn)
}
   else
{
- FOR_EACH_SUBRTX (iter, array, src, NONCONST)
-   if (ix86_check_avx_upper_register (*iter))
- return AVX_U128_DIRTY;
+ if (ix86_check_avx_upper_register (src))
+   return AVX_U128_DIRTY;
}
 
   /* This isn't YMM/ZMM load/store.  */
@@ -14531,9 +14542,8 @@ ix86_avx_u128_mode_needed (rtx_insn *insn)
  Hardware changes state only when a 256bit register is written to,
  but we need to prevent the compiler from moving optimal insertion
  point above eventual read from 256bit or 512 bit register.  */
-  FOR_EACH_SUBRTX (iter, array, PATTERN (insn), NONCONST)
-if (ix86_check_avx_upper_register (*iter))
-  return AVX_U128_DIRTY;
+  if (ix86_check_avx_upper_register (PATTERN (insn)))
+return AVX_U128_DIRTY;
 
   return AVX_U128_ANY;
 }
diff --git a/gcc/testsuite/gcc.target/i386/pr116512.c 
b/gcc/testsuite/gcc.target/i386/pr116512.c
new file mode 100644
index ..c2bc6c91b648
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr116512.c
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-options "-march=x86-64-v4 -O2" } */
+/* { dg-final { scan-assembler-not "vzeroupper" { target { ! ia32 } } } } */
+
+#include 
+
+struct B {
+  union {
+__m512 f;
+__m512i s;
+  };
+};
+
+struct B foo(int n) {
+  struct B res;
+  res.s = _mm512_set1_epi32(n);
+
+  return res;
+}
+
+__m512i bar(int n) {
+  struct B res;
+  res.s = _mm512_set1_epi32(n);
+
+  return res.s;
+}

[gcc r15-3357] i386: Support vectorized BF16 smaxmin with AVX10.2 instructions

2024-09-01 Thread Haochen Jiang via Gcc-cvs

https://gcc.gnu.org/g:29ef601973d7b79338694e59581d4c24bcd07f69

commit r15-3357-g29ef601973d7b79338694e59581d4c24bcd07f69
Author: Levy Hsu 
Date:   Mon Sep 2 10:24:47 2024 +0800

i386: Support vectorized BF16 smaxmin with AVX10.2 instructions

gcc/ChangeLog:

* config/i386/sse.md
(3): New define expand pattern for BF smaxmin.

gcc/testsuite/ChangeLog:
* gcc.target/i386/avx10_2-512-bf-vector-smaxmin-1.c: New test.
* gcc.target/i386/avx10_2-bf-vector-smaxmin-1.c: New test.

Diff:
---
 gcc/config/i386/sse.md |  7 +
 .../i386/avx10_2-512-bf-vector-smaxmin-1.c | 20 
 .../gcc.target/i386/avx10_2-bf-vector-smaxmin-1.c  | 36 ++
 3 files changed, 63 insertions(+)

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 85fbef331ea4..b374783429cb 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -31901,6 +31901,13 @@
"vscalefpbf16\t{%2, %1, %0|%0, %1, %2}"
[(set_attr "prefix" "evex")])
 
+(define_expand "3"
+  [(set (match_operand:VBF_AVX10_2 0 "register_operand")
+ (smaxmin:VBF_AVX10_2
+   (match_operand:VBF_AVX10_2 1 "register_operand")
+   (match_operand:VBF_AVX10_2 2 "nonimmediate_operand")))]
+  "TARGET_AVX10_2_256")
+
 (define_insn "avx10_2_pbf16_"
[(set (match_operand:VBF_AVX10_2 0 "register_operand" "=v")
   (smaxmin:VBF_AVX10_2
diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-smaxmin-1.c 
b/gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-smaxmin-1.c
new file mode 100644
index ..e33c325e2da9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-smaxmin-1.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx10.2-512 -mprefer-vector-width=512 -Ofast" } */
+/* /* { dg-final { scan-assembler-times "vmaxpbf16" 1 } } */
+/* /* { dg-final { scan-assembler-times "vminpbf16" 1 } } */
+
+void
+maxpbf16_512 (__bf16* dest, __bf16* src1, __bf16* src2)
+{
+  int i;
+  for (i = 0; i < 32; i++)
+dest[i] = src1[i] > src2[i] ? src1[i] : src2[i];
+}
+
+void
+minpbf16_512 (__bf16* dest, __bf16* src1, __bf16* src2)
+{
+  int i;
+  for (i = 0; i < 32; i++)
+dest[i] = src1[i] < src2[i] ? src1[i] : src2[i];
+}
diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-bf-vector-smaxmin-1.c 
b/gcc/testsuite/gcc.target/i386/avx10_2-bf-vector-smaxmin-1.c
new file mode 100644
index ..9bae073c95aa
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx10_2-bf-vector-smaxmin-1.c
@@ -0,0 +1,36 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx10.2 -Ofast" } */
+/* /* { dg-final { scan-assembler-times "vmaxpbf16" 2 } } */
+/* /* { dg-final { scan-assembler-times "vminpbf16" 2 } } */
+
+void
+maxpbf16_256 (__bf16* dest, __bf16* src1, __bf16* src2)
+{
+  int i;
+  for (i = 0; i < 16; i++)
+dest[i] = src1[i] > src2[i] ? src1[i] : src2[i];
+}
+
+void
+minpbf16_256 (__bf16* dest, __bf16* src1, __bf16* src2)
+{
+  int i;
+  for (i = 0; i < 16; i++)
+dest[i] = src1[i] < src2[i] ? src1[i] : src2[i];
+}
+
+void
+maxpbf16_128 (__bf16* dest, __bf16* src1, __bf16* src2)
+{
+  int i;
+  for (i = 0; i < 16; i++)
+dest[i] = src1[i] > src2[i] ? src1[i] : src2[i];
+}
+
+void
+minpbf16_128 (__bf16* dest, __bf16* src1, __bf16* src2)
+{
+  int i;
+  for (i = 0; i < 16; i++)
+dest[i] = src1[i] < src2[i] ? src1[i] : src2[i];
+}

[gcc r15-3351] RISC-V: Add testcases for unsigned scalar quad and oct .SAT_TRUNC form 3

2024-09-01 Thread Pan Li via Gcc-cvs

https://gcc.gnu.org/g:5239902210a16b22d59d2cf8b535d615922a5c00

commit r15-3351-g5239902210a16b22d59d2cf8b535d615922a5c00
Author: Pan Li 
Date:   Sun Aug 18 14:08:21 2024 +0800

RISC-V: Add testcases for unsigned scalar quad and oct .SAT_TRUNC form 3

This patch would like to add test cases for the unsigned scalar quad and
oct .SAT_TRUNC form 3.  Aka:

Form 3:
  #define DEF_SAT_U_TRUC_FMT_3(NT, WT) \
  NT __attribute__((noinline)) \
  sat_u_truc_##WT##_to_##NT##_fmt_3 (WT x) \
  {\
WT max = (WT)(NT)-1;   \
return x <= max ? (NT)x : (NT) max;\
  }

QUAD:
DEF_SAT_U_TRUC_FMT_3 (uint16_t, uint64_t)
DEF_SAT_U_TRUC_FMT_3 (uint8_t, uint32_t)

OCT:
DEF_SAT_U_TRUC_FMT_3 (uint8_t, uint64_t)

The below test is passed for this patch.
* The rv64gcv regression test.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/sat_u_trunc-16.c: New test.
* gcc.target/riscv/sat_u_trunc-17.c: New test.
* gcc.target/riscv/sat_u_trunc-18.c: New test.
* gcc.target/riscv/sat_u_trunc-run-16.c: New test.
* gcc.target/riscv/sat_u_trunc-run-17.c: New test.
* gcc.target/riscv/sat_u_trunc-run-18.c: New test.

Signed-off-by: Pan Li 

Diff:
---
 gcc/testsuite/gcc.target/riscv/sat_u_trunc-16.c | 17 +
 gcc/testsuite/gcc.target/riscv/sat_u_trunc-17.c | 17 +
 gcc/testsuite/gcc.target/riscv/sat_u_trunc-18.c | 20 
 gcc/testsuite/gcc.target/riscv/sat_u_trunc-run-16.c | 16 
 gcc/testsuite/gcc.target/riscv/sat_u_trunc-run-17.c | 16 
 gcc/testsuite/gcc.target/riscv/sat_u_trunc-run-18.c | 16 
 6 files changed, 102 insertions(+)

diff --git a/gcc/testsuite/gcc.target/riscv/sat_u_trunc-16.c 
b/gcc/testsuite/gcc.target/riscv/sat_u_trunc-16.c
new file mode 100644
index ..f91da58c0bae
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/sat_u_trunc-16.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gc -mabi=lp64d -O3 -fdump-rtl-expand-details 
-fno-schedule-insns -fno-schedule-insns2" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#include "sat_arith.h"
+
+/*
+** sat_u_trunc_uint32_t_to_uint8_t_fmt_3:
+** sltiu\s+[atx][0-9]+,\s*a0,\s*255
+** addi\s+[atx][0-9]+,\s*[atx][0-9]+,\s*-1
+** or\s+[atx][0-9]+,\s*[atx][0-9]+,\s*[atx][0-9]+
+** andi\s+[atx][0-9]+,\s*[atx][0-9]+,\s*0xff
+** ret
+*/
+DEF_SAT_U_TRUNC_FMT_3(uint8_t, uint32_t)
+
+/* { dg-final { scan-rtl-dump-times ".SAT_TRUNC " 2 "expand" } } */
diff --git a/gcc/testsuite/gcc.target/riscv/sat_u_trunc-17.c 
b/gcc/testsuite/gcc.target/riscv/sat_u_trunc-17.c
new file mode 100644
index ..9813e1f79b05
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/sat_u_trunc-17.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gc -mabi=lp64d -O3 -fdump-rtl-expand-details 
-fno-schedule-insns -fno-schedule-insns2" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#include "sat_arith.h"
+
+/*
+** sat_u_trunc_uint64_t_to_uint8_t_fmt_3:
+** sltiu\s+[atx][0-9]+,\s*a0,\s*255
+** addi\s+[atx][0-9]+,\s*[atx][0-9]+,\s*-1
+** or\s+[atx][0-9]+,\s*[atx][0-9]+,\s*[atx][0-9]+
+** andi\s+[atx][0-9]+,\s*[atx][0-9]+,\s*0xff
+** ret
+*/
+DEF_SAT_U_TRUNC_FMT_3(uint8_t, uint64_t)
+
+/* { dg-final { scan-rtl-dump-times ".SAT_TRUNC " 2 "expand" } } */
diff --git a/gcc/testsuite/gcc.target/riscv/sat_u_trunc-18.c 
b/gcc/testsuite/gcc.target/riscv/sat_u_trunc-18.c
new file mode 100644
index ..eb799849f73a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/sat_u_trunc-18.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gc -mabi=lp64d -O3 -fdump-rtl-expand-details 
-fno-schedule-insns -fno-schedule-insns2" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#include "sat_arith.h"
+
+/*
+** sat_u_trunc_uint64_t_to_uint16_t_fmt_3:
+** li\s+[atx][0-9]+,\s*65536
+** addi\s+[atx][0-9]+,\s*[atx][0-9]+,\s*-1
+** sltu\s+[atx][0-9]+,\s*a0,\s*[atx][0-9]+
+** addi\s+[atx][0-9]+,\s*[atx][0-9]+,\s*-1
+** or\s+[atx][0-9]+,\s*[atx][0-9]+,\s*[atx][0-9]+
+** slli\s+a0,\s*a0,\s*48
+** srli\s+a0,\s*a0,\s*48
+** ret
+*/
+DEF_SAT_U_TRUNC_FMT_3(uint16_t, uint64_t)
+
+/* { dg-final { scan-rtl-dump-times ".SAT_TRUNC " 2 "expand" } } */
diff --git a/gcc/testsuite/gcc.target/riscv/sat_u_trunc-run-16.c 
b/gcc/testsuite/gcc.target/riscv/sat_u_trunc-run-16.c
new file mode 100644
index ..20ceda6852e9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/sat_u_trunc-run-16.c
@@ -0,0 +1,16 @@
+/* { dg-do run { target { riscv_v } } } */
+/* { dg-additional-options "-std=c99" } */
+
+#include "sat_arith.h"
+#include "sat_arith_data.h"
+
+#define T1 uint8_t
+#define T2 uint32_t
+
+DEF_SAT_U_TRUNC_FMT_3_WRAP(T1, T2)
+
+#define DATA   TEST_UNARY_DATA_WRAP(T1, T2

[gcc r15-3361] [PATCH] RISC-V: Optimize the cost of the DFmode register move for RV32.

2024-09-01 Thread Jeff Law via Gcc-cvs

https://gcc.gnu.org/g:eca320bfe340beec9267bdb6021c7b387111

commit r15-3361-geca320bfe340beec9267bdb6021c7b387111
Author: Xianmiao Qu 
Date:   Sun Sep 1 22:28:13 2024 -0600

[PATCH] RISC-V: Optimize the cost of the DFmode register move for RV32.

Currently, in RV32, even with the D extension enabled, the cost of DFmode
register moves is still set to 'COSTS_N_INSNS (2)'. This results in the
'lower-subreg' pass splitting DFmode register moves into two SImode SUBREG
register moves, leading to the generation of many redundant instructions.

As an example, consider the following test case:
  double foo (int t, double a, double b)
  {
if (t > 0)
  return a;
else
  return b;
  }

When compiling with -march=rv32imafdc -mabi=ilp32d, the following code is 
generated:
  .cfi_startproc
  addisp,sp,-32
  .cfi_def_cfa_offset 32
  fsd fa0,8(sp)
  fsd fa1,16(sp)
  lw  a4,8(sp)
  lw  a5,12(sp)
  lw  a2,16(sp)
  lw  a3,20(sp)
  bgt a0,zero,.L1
  mv  a4,a2
  mv  a5,a3
  .L1:
  sw  a4,24(sp)
  sw  a5,28(sp)
  fld fa0,24(sp)
  addisp,sp,32
  .cfi_def_cfa_offset 0
  jr  ra
  .cfi_endproc

After adjust the DFmode register move's cost to 'COSTS_N_INSNS (1)', the
generated code is as follows, with a significant reduction in the number
of instructions.
  .cfi_startproc
  ble a0,zero,.L5
  ret
  .L5:
  fmv.d   fa0,fa1
  ret
  .cfi_endproc

gcc/
* config/riscv/riscv.cc (riscv_rtx_costs): Optimize the cost of the
DFmode register move for RV32.

gcc/testsuite/
* gcc.target/riscv/rv32-movdf-cost.c: New test.

Diff:
---
 gcc/config/riscv/riscv.cc|  5 +
 gcc/testsuite/gcc.target/riscv/rv32-movdf-cost.c | 13 +
 2 files changed, 18 insertions(+)

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index 75b37b532443..d03e51f3a687 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -3601,6 +3601,11 @@ riscv_rtx_costs (rtx x, machine_mode mode, int 
outer_code, int opno ATTRIBUTE_UN
   if (outer_code == INSN
  && register_operand (SET_DEST (x), GET_MODE (SET_DEST (x
{
+ if (REG_P (SET_SRC (x)) && TARGET_DOUBLE_FLOAT && mode == DFmode)
+   {
+ *total = COSTS_N_INSNS (1);
+ return true;
+   }
  riscv_rtx_costs (SET_SRC (x), mode, outer_code, opno, total, speed);
  return true;
}
diff --git a/gcc/testsuite/gcc.target/riscv/rv32-movdf-cost.c 
b/gcc/testsuite/gcc.target/riscv/rv32-movdf-cost.c
new file mode 100644
index ..cb679e7b95fb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rv32-movdf-cost.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv32imafdc -mabi=ilp32d" } */
+/* { dg-skip-if "" { *-*-* } { "-O0" } } */
+
+double foo (int t, double a, double b)
+{
+  if (t > 0)
+return a;
+  else
+return b;
+}
+
+/* { dg-final { scan-assembler-not "fsd\t" } } */

[gcc r15-3354] i386: Optimize generate insn for AVX10.2 compare

2024-09-01 Thread Haochen Jiang via Gcc-cvs

https://gcc.gnu.org/g:3b1decef83003db9cf8667977c293435c0f3d024

commit r15-3354-g3b1decef83003db9cf8667977c293435c0f3d024
Author: Hu, Lin1 
Date:   Mon Sep 2 10:24:36 2024 +0800

i386: Optimize generate insn for AVX10.2 compare

gcc/ChangeLog:

* config/i386/i386-expand.cc (ix86_expand_fp_compare): Add UNSPEC to
support the optimization.
* config/i386/i386.cc (ix86_fp_compare_code_to_integer): Add NE/EQ.
* config/i386/i386.md (*cmpx): New define_insn.
(*cmpxhf): Ditto.
* config/i386/predicates.md (ix86_trivial_fp_comparison_operator):
Add ne/eq.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx10_2-compare-1b.c: New test.

Diff:
---
 gcc/config/i386/i386-expand.cc |  5 ++
 gcc/config/i386/i386.cc|  5 ++
 gcc/config/i386/i386.md| 31 ++-
 gcc/config/i386/predicates.md  | 12 +++
 gcc/testsuite/gcc.target/i386/avx10_2-compare-1b.c | 96 ++
 5 files changed, 147 insertions(+), 2 deletions(-)

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index d692008ffe7e..53327544620f 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -2916,6 +2916,11 @@ ix86_expand_fp_compare (enum rtx_code code, rtx op0, rtx 
op1)
   switch (ix86_fp_comparison_strategy (code))
 {
 case IX86_FPCMP_COMI:
+  tmp = gen_rtx_COMPARE (CCFPmode, op0, op1);
+  if (TARGET_AVX10_2_256 && (code == EQ || code == NE))
+   tmp = gen_rtx_UNSPEC (CCFPmode, gen_rtvec (1, tmp), UNSPEC_OPTCOMX);
+  if (unordered_compare)
+   tmp = gen_rtx_UNSPEC (CCFPmode, gen_rtvec (1, tmp), UNSPEC_NOTRAP);
   cmp_mode = CCFPmode;
   emit_insn (gen_rtx_SET (gen_rtx_REG (CCFPmode, FLAGS_REG), tmp));
   break;
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 546c964d2a47..7af9ceca429f 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -16634,6 +16634,11 @@ ix86_fp_compare_code_to_integer (enum rtx_code code)
   return LEU;
 case LTGT:
   return NE;
+case EQ:
+case NE:
+  if (TARGET_AVX10_2_256)
+   return code;
+  /* FALLTHRU.  */
 default:
   return UNKNOWN;
 }
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index b56a51be09fb..0fae3c1eb878 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -117,6 +117,7 @@
   UNSPEC_STC
   UNSPEC_PUSHFL
   UNSPEC_POPFL
+  UNSPEC_OPTCOMX
 
   ;; For SSE/MMX support:
   UNSPEC_FIX_NOTRUNC
@@ -1736,7 +1737,7 @@
(compare:CC (match_operand:XF 1 "nonmemory_operand")
(match_operand:XF 2 "nonmemory_operand")))
(set (pc) (if_then_else
-  (match_operator 0 "ix86_fp_comparison_operator"
+  (match_operator 0 "ix86_fp_comparison_operator_xf"
[(reg:CC FLAGS_REG)
 (const_int 0)])
   (label_ref (match_operand 3))
@@ -1753,7 +1754,7 @@
(compare:CC (match_operand:XF 2 "nonmemory_operand")
(match_operand:XF 3 "nonmemory_operand")))
(set (match_operand:QI 0 "register_operand")
-  (match_operator 1 "ix86_fp_comparison_operator"
+  (match_operator 1 "ix86_fp_comparison_operator_xf"
[(reg:CC FLAGS_REG)
 (const_int 0)]))]
   "TARGET_80387"
@@ -2017,6 +2018,32 @@
(set_attr "bdver1_decode" "double")
(set_attr "znver1_decode" "double")])
 
+(define_insn "*cmpx"
+  [(set (reg:CCFP FLAGS_REG)
+   (unspec:CCFP [
+ (compare:CCFP
+   (match_operand:MODEF 0 "register_operand" "v")
+   (match_operand:MODEF 1 "nonimmediate_operand" "vm"))]
+ UNSPEC_OPTCOMX))]
+  "TARGET_AVX10_2_256"
+  "%vcomx\t{%1, %0|%0, %1}"
+  [(set_attr "type" "ssecomi")
+   (set_attr "prefix" "evex")
+   (set_attr "mode" "")])
+
+(define_insn "*cmpxhf"
+  [(set (reg:CCFP FLAGS_REG)
+   (unspec:CCFP [
+ (compare:CCFP
+   (match_operand:HF 0 "register_operand" "v")
+   (match_operand:HF 1 "nonimmediate_operand" "vm"))]
+ UNSPEC_OPTCOMX))]
+  "TARGET_AVX10_2_256"
+  "vcomxsh\t{%1, %0|%0, %1}"
+  [(set_attr "type" "ssecomi")
+   (set_attr "prefix" "evex")
+   (set_attr "mode" "HF")])
+
 (define_insn "*cmpi"
   [(set (reg:CCFP FLAGS_REG)
(compare:CCFP
diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
index ab6a2e14d355..053312bbe27c 100644
--- a/gcc/config/i386/predicates.md
+++ b/gcc/config/i386/predicates.md
@@ -1633,7 +1633,13 @@
 })
 
 ;; Return true if this comparison only requires testing one flag bit.
+;; VCOMX/VUCOMX set ZF, SF, OF, differently from COMI/UCOMI.
 (define_predicate "ix86_trivial_fp_comparison_operator"
+  (if_then_else (match_test "TARGET_AVX10_2_256")
+   (match_code "gt,ge,unlt,unle,eq,uneq,ne,ltgt,ordered,unordered")
+

[gcc r15-3362] lower SLP load permutation to interleaving

2024-09-01 Thread Richard Biener via Gcc-cvs

https://gcc.gnu.org/g:464067a242150628ceb0d47daf2297f29a31743c

commit r15-3362-g464067a242150628ceb0d47daf2297f29a31743c
Author: Richard Biener 
Date:   Mon May 13 14:57:01 2024 +0200

lower SLP load permutation to interleaving

The following emulates classical interleaving for SLP load permutes
that we are unlikely handling natively.  This is to handle cases
where interleaving (or load/store-lanes) is the optimal choice for
vectorizing even when we are doing that within SLP.  An example
would be

void foo (int * __restrict a, int * b)
{
  for (int i = 0; i < 16; ++i)
{
  a[4*i + 0] = b[4*i + 0] * 3;
  a[4*i + 1] = b[4*i + 1] + 3;
  a[4*i + 2] = (b[4*i + 2] * 3 + 3);
  a[4*i + 3] = b[4*i + 3] * 3;
}
}

where currently the SLP store is merging four single-lane SLP
sub-graphs but none of the loads in it can be code-generated
with V4SImode vectors and a VF of four as the permutes would need
three vectors.

The patch introduces a lowering phase after SLP discovery but
before SLP pattern recognition or permute optimization that
analyzes all loads from the same dataref group and creates an
interleaving scheme starting from an unpermuted load.

What can be handled is power-of-two group size and a group size of
three.  The possibility for doing the interleaving with a load-lanes
like instruction is done as followup.

For a group-size of three this is done by using
the non-interleaving fallback code which then creates at VF == 4 from
{ { a0, b0, c0 }, { a1, b1, c1 }, { a2, b2, c2 }, { a3, b3, c3 } }
the intermediate vectors { c0, c0, c1, c1 } and { c2, c2, c3, c3 }
to produce { c0, c1, c2, c3 }.  This turns out to be more effective
than the scheme implemented for non-SLP for SSE and only slightly
worse for AVX512 and a bit more worse for AVX2.  It seems to me that
this would extend to other non-power-of-two group-sizes though (but
the patch does not).  Optimal schemes are likely difficult to lay out
in VF agnostic form.

I'll note that while the lowering assumes even/odd extract is
generally available for all vector element sizes (which is probably
a good assumption), it doesn't in any way constrain the other
permutes it generates based on target availability.  Again difficult
to do in a VF agnostic way (but at least currently the vector type
is fixed).

I'll also note that the SLP store side merges lanes in a way
producing three-vector permutes for store group-size of three, so
the testcase uses a store group-size of four.

The patch has a fallback for when there are multi-lane groups
and the resulting permutes to not fit interleaving.  Code
generation is not optimal when this triggers and might be
worse than doing single-lane group interleaving.

The patch handles gaps by representing them with NULL
entries in SLP_TREE_SCALAR_STMTS for the unpermuted load node.
The SLP discovery changes could be elided if we manually build the
load node instead.

SLP load nodes covering enough lanes to not need intermediate
permutes are retained as having a load-permutation and do not
use the single SLP load node for each dataref group.  That's
something we might want to change, making load-permutation
something purely local to SLP discovery (but then SLP discovery
could do part of the lowering).

The patch misses CSEing intermediate generated permutes and
registering them with the bst_map which is possibly required
for SLP pattern detection in some cases - this re-spin of the
patch moves the lowering after SLP pattern detection.

* tree-vect-slp.cc (vect_build_slp_tree_1): Handle NULL stmt.
(vect_build_slp_tree_2): Likewise.  Release load permutation
when there's a NULL in SLP_TREE_SCALAR_STMTS and assert there's
no actual permutation in that case.
(vllp_cmp): New function.
(vect_lower_load_permutations): Likewise.
(vect_analyze_slp): Call it.

* gcc.dg/vect/slp-11a.c: Expect SLP.
* gcc.dg/vect/slp-12a.c: Likewise.
* gcc.dg/vect/slp-51.c: New testcase.
* gcc.dg/vect/slp-52.c: New testcase.

Diff:
---
 gcc/testsuite/gcc.dg/vect/slp-11a.c |   2 +-
 gcc/testsuite/gcc.dg/vect/slp-12a.c |   2 +-
 gcc/testsuite/gcc.dg/vect/slp-51.c  |  17 ++
 gcc/testsuite/gcc.dg/vect/slp-52.c  |  14 ++
 gcc/tree-vect-slp.cc| 347 +++-
 5 files changed, 378 insertions(+), 4 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/slp-11a.c 
b/gcc/testsuite/gcc.dg/vect/slp-11a.c
index fcb7cf6c7a2c..2efa1796757e 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-11a.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-11a.c
@@ -72,4 +72,4 @@ int main (void)
 
 /* { dg-final

[gcc r15-3363] load and store-lanes with SLP

2024-09-01 Thread Richard Biener via Gcc-cvs

https://gcc.gnu.org/g:9aaedfc4146c5e4b8412913a6ca4092a2731c35c

commit r15-3363-g9aaedfc4146c5e4b8412913a6ca4092a2731c35c
Author: Richard Biener 
Date:   Fri Jul 5 10:35:08 2024 +0200

load and store-lanes with SLP

The following is a prototype for how to represent load/store-lanes
within SLP.  I've for now settled with having a single load node
with multiple permute nodes acting as selection, one for each loaded lane
and a single store node fed from all stored lanes.  For

  for (int i = 0; i < 1024; ++i)
{
  a[2*i] = b[2*i] + 7;
  a[2*i+1] = b[2*i+1] * 3;
}

you have the following SLP graph where I explain how things are set
up and code-generated:

t.c:23:21: note:   SLP graph after lowering permutations:
t.c:23:21: note:   node 0x50dc8b0 (max_nunits=1, refcnt=1) vector(4) int
t.c:23:21: note:   op template: *_6 = _7;
t.c:23:21: note:stmt 0 *_6 = _7;
t.c:23:21: note:stmt 1 *_12 = _13;
t.c:23:21: note:children 0x50dc488 0x50dc6e8

This is the store node, it's marked with ldst_lanes = true during
SLP discovery.  This node code-generates

  vect_array.65[0] = vect__7.61_29;
  vect_array.65[1] = vect__13.62_28;
  MEM  [(int *)vectp_a.63_27] = .STORE_LANES (vect_array.65);

...
t.c:23:21: note:   node 0x50dc520 (max_nunits=4, refcnt=2) vector(4) int
t.c:23:21: note:   op: VEC_PERM_EXPR
t.c:23:21: note:stmt 0 _5 = *_4;
t.c:23:21: note:lane permutation { 0[0] }
t.c:23:21: note:children 0x50dc948
t.c:23:21: note:   node 0x50dc780 (max_nunits=4, refcnt=2) vector(4) int
t.c:23:21: note:   op: VEC_PERM_EXPR
t.c:23:21: note:stmt 0 _11 = *_10;
t.c:23:21: note:lane permutation { 0[1] }
t.c:23:21: note:children 0x50dc948

These are the selection nodes, marked with ldst_lanes = true.
They code generate nothing.

t.c:23:21: note:   node 0x50dc948 (max_nunits=4, refcnt=3) vector(4) int
t.c:23:21: note:   op template: _5 = *_4;
t.c:23:21: note:stmt 0 _5 = *_4;
t.c:23:21: note:stmt 1 _11 = *_10;
t.c:23:21: note:load permutation { 0 1 }

This is the load node, marked with ldst_lanes = true (the load
permutation is only accurate when taking into account the lane permute
in the selection nodes).  It code generates

  vect_array.58 = .LOAD_LANES (MEM  [(int *)vectp_b.56_33]);
  vect__5.59_31 = vect_array.58[0];
  vect__5.60_30 = vect_array.58[1];

This scheme allows to leave code generation in vectorizable_load/store
mostly as-is.

While this should support both load-lanes and (masked) store-lanes
the decision to do either is done during SLP discovery time and
cannot be reversed without altering the SLP tree - as-is the SLP
tree is not usable for non-store-lanes on the store side, the
load side is OK representation-wise but will very likely fail
permute handling as the lowering to deal with the two input vector
restriction isn't done - but of course since the permute node is
marked as to be ignored that doesn't work out.  So I've put
restrictions in place that fail vectorization if a load/store-lane
SLP tree is later classified differently by get_load_store_type.

I'll note that for example gcc.target/aarch64/sve/mask_struct_store_3.c
will not get SLP store-lanes used because the full store SLPs just
fine though we then fail to handle the "splat" load-permutation

t2.c:5:21: note:   node 0x4db2630 (max_nunits=4, refcnt=2) vector([4,4]) int
t2.c:5:21: note:   op template: _6 = *_5;
t2.c:5:21: note:stmt 0 _6 = *_5;
t2.c:5:21: note:stmt 1 _6 = *_5;
t2.c:5:21: note:stmt 2 _6 = *_5;
t2.c:5:21: note:stmt 3 _6 = *_5;
t2.c:5:21: note:load permutation { 0 0 0 0 }

the load permute lowering code currently doesn't consider it worth
lowering single loads from a group (or in this case not grouped loads).
The expectation is the target can handle this by two interleaves with
itself.

So what we see here is that while the explicit SLP representation is
helpful in some cases, in cases like this it would require changing
it when we make decisions how to vectorize.  My idea is that this
all will change a lot when we re-do SLP discovery (for loops) and
when we get rid of non-SLP as I think vectorizable_* should be
allowed to alter the SLP graph during analysis.

The patch also removes the code cancelling SLP if we can use
load/store-lanes from the main loop vector analysis code and
re-implements it as re-discovering the SLP instance with
forced single-lane splits so SLP load/store-lanes scheme can be
used.

This is now done after SLP discovery and SLP pattern recog are
complete to not d

[gcc r12-10694] Check avx upper register for parallel.

2024-09-01 Thread hongtao Liu via Gcc-cvs

https://gcc.gnu.org/g:6585b06303d8fd9da907f443fc0da9faed303712

commit r12-10694-g6585b06303d8fd9da907f443fc0da9faed303712
Author: liuhongt 
Date:   Thu Aug 29 11:39:20 2024 +0800

Check avx upper register for parallel.

For function arguments/return, when it's BLK mode, it's put in a
parallel with an expr_list, and the expr_list contains the real mode
and registers.
Current ix86_check_avx_upper_register only checked for SSE_REG_P, and
failed to handle that. The patch extend the handle to each subrtx.

gcc/ChangeLog:

PR target/116512
* config/i386/i386.cc (ix86_check_avx_upper_register): Iterate
subrtx to scan for avx upper register.
(ix86_check_avx_upper_stores): Inline old
ix86_check_avx_upper_register.
(ix86_avx_u128_mode_needed): Ditto, and replace
FOR_EACH_SUBRTX with call to new
ix86_check_avx_upper_register.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr116512.c: New test.

(cherry picked from commit ab214ef734bfc3dcffcf79ff9e1dd651c2b40566)

Diff:
---
 gcc/config/i386/i386.cc  | 36 
 gcc/testsuite/gcc.target/i386/pr116512.c | 26 +++
 2 files changed, 49 insertions(+), 13 deletions(-)

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index af42e4b9739e..2d272bdaf1a4 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -14360,9 +14360,19 @@ ix86_dirflag_mode_needed (rtx_insn *insn)
 static bool
 ix86_check_avx_upper_register (const_rtx exp)
 {
-  return (SSE_REG_P (exp)
- && !EXT_REX_SSE_REG_P (exp)
- && GET_MODE_BITSIZE (GET_MODE (exp)) > 128);
+  /* construct_container may return a parallel with expr_list
+ which contains the real reg and mode  */
+  subrtx_iterator::array_type array;
+  FOR_EACH_SUBRTX (iter, array, exp, NONCONST)
+{
+  const_rtx x = *iter;
+  if (SSE_REG_P (x)
+ && !EXT_REX_SSE_REG_P (x)
+ && GET_MODE_BITSIZE (GET_MODE (x)) > 128)
+   return true;
+}
+
+  return false;
 }
 
 /* Check if a 256bit or 512bit AVX register is referenced in stores.   */
@@ -14370,7 +14380,9 @@ ix86_check_avx_upper_register (const_rtx exp)
 static void
 ix86_check_avx_upper_stores (rtx dest, const_rtx, void *data)
 {
-  if (ix86_check_avx_upper_register (dest))
+  if (SSE_REG_P (dest)
+  && !EXT_REX_SSE_REG_P (dest)
+  && GET_MODE_BITSIZE (GET_MODE (dest)) > 128)
 {
   bool *used = (bool *) data;
   *used = true;
@@ -14428,14 +14440,14 @@ ix86_avx_u128_mode_needed (rtx_insn *insn)
   return AVX_U128_CLEAN;
 }
 
-  subrtx_iterator::array_type array;
-
   rtx set = single_set (insn);
   if (set)
 {
   rtx dest = SET_DEST (set);
   rtx src = SET_SRC (set);
-  if (ix86_check_avx_upper_register (dest))
+  if (SSE_REG_P (dest)
+ && !EXT_REX_SSE_REG_P (dest)
+ && GET_MODE_BITSIZE (GET_MODE (dest)) > 128)
{
  /* This is an YMM/ZMM load.  Return AVX_U128_DIRTY if the
 source isn't zero.  */
@@ -14446,9 +14458,8 @@ ix86_avx_u128_mode_needed (rtx_insn *insn)
}
   else
{
- FOR_EACH_SUBRTX (iter, array, src, NONCONST)
-   if (ix86_check_avx_upper_register (*iter))
- return AVX_U128_DIRTY;
+ if (ix86_check_avx_upper_register (src))
+   return AVX_U128_DIRTY;
}
 
   /* This isn't YMM/ZMM load/store.  */
@@ -14459,9 +14470,8 @@ ix86_avx_u128_mode_needed (rtx_insn *insn)
  Hardware changes state only when a 256bit register is written to,
  but we need to prevent the compiler from moving optimal insertion
  point above eventual read from 256bit or 512 bit register.  */
-  FOR_EACH_SUBRTX (iter, array, PATTERN (insn), NONCONST)
-if (ix86_check_avx_upper_register (*iter))
-  return AVX_U128_DIRTY;
+  if (ix86_check_avx_upper_register (PATTERN (insn)))
+return AVX_U128_DIRTY;
 
   return AVX_U128_ANY;
 }
diff --git a/gcc/testsuite/gcc.target/i386/pr116512.c 
b/gcc/testsuite/gcc.target/i386/pr116512.c
new file mode 100644
index ..c2bc6c91b648
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr116512.c
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-options "-march=x86-64-v4 -O2" } */
+/* { dg-final { scan-assembler-not "vzeroupper" { target { ! ia32 } } } } */
+
+#include 
+
+struct B {
+  union {
+__m512 f;
+__m512i s;
+  };
+};
+
+struct B foo(int n) {
+  struct B res;
+  res.s = _mm512_set1_epi32(n);
+
+  return res;
+}
+
+__m512i bar(int n) {
+  struct B res;
+  res.s = _mm512_set1_epi32(n);
+
+  return res.s;
+}

[gcc r15-3350] RISC-V: Add testcases for unsigned scalar quad and oct .SAT_TRUNC form 2

2024-09-01 Thread Pan Li via Gcc-cvs

https://gcc.gnu.org/g:ea81e21d5398bdacf883533fd738fc45ea8d6dd9

commit r15-3350-gea81e21d5398bdacf883533fd738fc45ea8d6dd9
Author: Pan Li 
Date:   Sun Aug 18 12:49:47 2024 +0800

RISC-V: Add testcases for unsigned scalar quad and oct .SAT_TRUNC form 2

This patch would like to add test cases for the unsigned scalar quad and
oct .SAT_TRUNC form 2.  Aka:

Form 2:
  #define DEF_SAT_U_TRUC_FMT_2(NT, WT) \
  NT __attribute__((noinline)) \
  sat_u_truc_##WT##_to_##NT##_fmt_2 (WT x) \
  {\
WT max = (WT)(NT)-1;   \
return x > max ? (NT) max : (NT)x; \
  }

QUAD:
DEF_SAT_U_TRUC_FMT_2 (uint16_t, uint64_t)
DEF_SAT_U_TRUC_FMT_2 (uint8_t, uint32_t)

OCT:
DEF_SAT_U_TRUC_FMT_2 (uint8_t, uint64_t)

The below test is passed for this patch.
* The rv64gcv regression test.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/sat_u_trunc-10.c: New test.
* gcc.target/riscv/sat_u_trunc-11.c: New test.
* gcc.target/riscv/sat_u_trunc-12.c: New test.
* gcc.target/riscv/sat_u_trunc-run-10.c: New test.
* gcc.target/riscv/sat_u_trunc-run-11.c: New test.
* gcc.target/riscv/sat_u_trunc-run-12.c: New test.

Signed-off-by: Pan Li 

Diff:
---
 gcc/testsuite/gcc.target/riscv/sat_u_trunc-10.c | 17 +
 gcc/testsuite/gcc.target/riscv/sat_u_trunc-11.c | 17 +
 gcc/testsuite/gcc.target/riscv/sat_u_trunc-12.c | 20 
 gcc/testsuite/gcc.target/riscv/sat_u_trunc-run-10.c | 16 
 gcc/testsuite/gcc.target/riscv/sat_u_trunc-run-11.c | 16 
 gcc/testsuite/gcc.target/riscv/sat_u_trunc-run-12.c | 16 
 6 files changed, 102 insertions(+)

diff --git a/gcc/testsuite/gcc.target/riscv/sat_u_trunc-10.c 
b/gcc/testsuite/gcc.target/riscv/sat_u_trunc-10.c
new file mode 100644
index ..5ea8e613901c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/sat_u_trunc-10.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gc -mabi=lp64d -O3 -fdump-rtl-expand-details 
-fno-schedule-insns -fno-schedule-insns2" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#include "sat_arith.h"
+
+/*
+** sat_u_trunc_uint32_t_to_uint8_t_fmt_2:
+** sltiu\s+[atx][0-9]+,\s*a0,\s*255
+** addi\s+[atx][0-9]+,\s*[atx][0-9]+,\s*-1
+** or\s+[atx][0-9]+,\s*[atx][0-9]+,\s*[atx][0-9]+
+** andi\s+[atx][0-9]+,\s*[atx][0-9]+,\s*0xff
+** ret
+*/
+DEF_SAT_U_TRUNC_FMT_2(uint8_t, uint32_t)
+
+/* { dg-final { scan-rtl-dump-times ".SAT_TRUNC " 2 "expand" } } */
diff --git a/gcc/testsuite/gcc.target/riscv/sat_u_trunc-11.c 
b/gcc/testsuite/gcc.target/riscv/sat_u_trunc-11.c
new file mode 100644
index ..3b45e2af9ce3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/sat_u_trunc-11.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gc -mabi=lp64d -O3 -fdump-rtl-expand-details 
-fno-schedule-insns -fno-schedule-insns2" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#include "sat_arith.h"
+
+/*
+** sat_u_trunc_uint64_t_to_uint8_t_fmt_2:
+** sltiu\s+[atx][0-9]+,\s*a0,\s*255
+** addi\s+[atx][0-9]+,\s*[atx][0-9]+,\s*-1
+** or\s+[atx][0-9]+,\s*[atx][0-9]+,\s*[atx][0-9]+
+** andi\s+[atx][0-9]+,\s*[atx][0-9]+,\s*0xff
+** ret
+*/
+DEF_SAT_U_TRUNC_FMT_2(uint8_t, uint64_t)
+
+/* { dg-final { scan-rtl-dump-times ".SAT_TRUNC " 2 "expand" } } */
diff --git a/gcc/testsuite/gcc.target/riscv/sat_u_trunc-12.c 
b/gcc/testsuite/gcc.target/riscv/sat_u_trunc-12.c
new file mode 100644
index ..7ea2c93a301f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/sat_u_trunc-12.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gc -mabi=lp64d -O3 -fdump-rtl-expand-details 
-fno-schedule-insns -fno-schedule-insns2" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#include "sat_arith.h"
+
+/*
+** sat_u_trunc_uint64_t_to_uint16_t_fmt_2:
+** li\s+[atx][0-9]+,\s*65536
+** addi\s+[atx][0-9]+,\s*[atx][0-9]+,\s*-1
+** sltu\s+[atx][0-9]+,\s*a0,\s*[atx][0-9]+
+** addi\s+[atx][0-9]+,\s*[atx][0-9]+,\s*-1
+** or\s+[atx][0-9]+,\s*[atx][0-9]+,\s*[atx][0-9]+
+** slli\s+a0,\s*a0,\s*48
+** srli\s+a0,\s*a0,\s*48
+** ret
+*/
+DEF_SAT_U_TRUNC_FMT_2(uint16_t, uint64_t)
+
+/* { dg-final { scan-rtl-dump-times ".SAT_TRUNC " 2 "expand" } } */
diff --git a/gcc/testsuite/gcc.target/riscv/sat_u_trunc-run-10.c 
b/gcc/testsuite/gcc.target/riscv/sat_u_trunc-run-10.c
new file mode 100644
index ..2281610f3353
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/sat_u_trunc-run-10.c
@@ -0,0 +1,16 @@
+/* { dg-do run { target { riscv_v } } } */
+/* { dg-additional-options "-std=c99" } */
+
+#include "sat_arith.h"
+#include "sat_arith_data.h"
+
+#define T1 uint8_t
+#define T2 uint32_t
+
+DEF_SAT_U_TRUNC_FMT_2_WRAP(T1, T2)
+
+#define DATA   TEST_UNARY_DATA_WRAP(T1, T2

[gcc r15-3355] i386: Support vectorized BF16 add/sub/mul/div with AVX10.2 instructions

2024-09-01 Thread Haochen Jiang via Gcc-cvs

https://gcc.gnu.org/g:f82fa0da4d9e1fdaf5e4edd70364d5781534ce11

commit r15-3355-gf82fa0da4d9e1fdaf5e4edd70364d5781534ce11
Author: Levy Hsu 
Date:   Mon Sep 2 10:24:45 2024 +0800

i386: Support vectorized BF16 add/sub/mul/div with AVX10.2 instructions

AVX10.2 introduces several non-exception instructions for BF16 vector.
Enable vectorized BF add/sub/mul/div operation by supporting standard
optab for them.

gcc/ChangeLog:

* config/i386/sse.md (div3): New expander for BFmode div.
(VF_BHSD): New mode iterator with vector BFmodes.
(3): Change mode to VF_BHSD.
(mul3): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx10_2-512-bf-vector-operations-1.c: New test.
* gcc.target/i386/avx10_2-bf-vector-operations-1.c: Ditto.

Diff:
---
 gcc/config/i386/sse.md | 49 +++---
 .../i386/avx10_2-512-bf-vector-operations-1.c  | 42 
 .../i386/avx10_2-bf-vector-operations-1.c  | 79 ++
 3 files changed, 162 insertions(+), 8 deletions(-)

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 442ac93afa2b..ebca462bae8b 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -391,6 +391,19 @@
(V8DF "TARGET_AVX512F && TARGET_EVEX512") (V4DF "TARGET_AVX")
(V2DF "TARGET_SSE2")])
 
+(define_mode_iterator VF_BHSD
+  [(V32HF "TARGET_AVX512FP16 && TARGET_EVEX512")
+   (V16HF "TARGET_AVX512FP16 && TARGET_AVX512VL")
+   (V8HF "TARGET_AVX512FP16 && TARGET_AVX512VL")
+   (V16SF "TARGET_AVX512F && TARGET_EVEX512")
+   (V8SF "TARGET_AVX") V4SF
+   (V8DF "TARGET_AVX512F && TARGET_EVEX512")
+   (V4DF "TARGET_AVX") (V2DF "TARGET_SSE2")
+   (V32BF "TARGET_AVX10_2_512")
+   (V16BF "TARGET_AVX10_2_256")
+   (V8BF "TARGET_AVX10_2_256")
+  ])
+
 ;; 128-, 256- and 512-bit float vector modes for bitwise operations
 (define_mode_iterator VFB
   [(V32BF "TARGET_AVX512F && TARGET_EVEX512")
@@ -2527,10 +2540,10 @@
 })
 
 (define_expand "3"
-  [(set (match_operand:VFH 0 "register_operand")
-   (plusminus:VFH
- (match_operand:VFH 1 "")
- (match_operand:VFH 2 "")))]
+  [(set (match_operand:VF_BHSD 0 "register_operand")
+   (plusminus:VF_BHSD
+ (match_operand:VF_BHSD 1 "")
+ (match_operand:VF_BHSD 2 "")))]
   "TARGET_SSE &&  && "
   "ix86_fixup_binary_operands_no_copy (, mode, operands);")
 
@@ -2616,10 +2629,10 @@
 })
 
 (define_expand "mul3"
-  [(set (match_operand:VFH 0 "register_operand")
-   (mult:VFH
- (match_operand:VFH 1 "")
- (match_operand:VFH 2 "")))]
+  [(set (match_operand:VF_BHSD 0 "register_operand")
+   (mult:VF_BHSD
+ (match_operand:VF_BHSD 1 "")
+ (match_operand:VF_BHSD 2 "")))]
   "TARGET_SSE &&  && "
   "ix86_fixup_binary_operands_no_copy (MULT, mode, operands);")
 
@@ -2734,6 +2747,26 @@
 }
 })
 
+(define_expand "div3"
+  [(set (match_operand:VBF_AVX10_2 0 "register_operand")
+   (div:VBF_AVX10_2
+ (match_operand:VBF_AVX10_2 1 "register_operand")
+ (match_operand:VBF_AVX10_2 2 "vector_operand")))]
+  "TARGET_AVX10_2_256"
+{
+  if (TARGET_RECIP_VEC_DIV
+  && optimize_insn_for_speed_p ()
+  && flag_finite_math_only
+  && flag_unsafe_math_optimizations)
+{
+  rtx op = gen_reg_rtx (mode);
+  operands[2] = force_reg (mode, operands[2]);
+  emit_insn (gen_avx10_2_rcppbf16_ (op, operands[2]));
+  emit_insn (gen_avx10_2_mulnepbf16_ (operands[0], operands[1], op));
+  DONE;
+}
+})
+
 (define_expand "cond_div"
   [(set (match_operand:VFH 0 "register_operand")
(vec_merge:VFH
diff --git a/gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-operations-1.c 
b/gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-operations-1.c
new file mode 100644
index ..d6b0750c2334
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/avx10_2-512-bf-vector-operations-1.c
@@ -0,0 +1,42 @@
+/* { dg-do compile } */
+/* { dg-options "-mavx10.2-512 -O2" } */
+/* { dg-final { scan-assembler-times "vmulnepbf16\[ 
\\t\]+\[^\{\n\]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 2 } } */
+/* { dg-final { scan-assembler-times "vaddnepbf16\[ 
\\t\]+\[^\{\n\]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vdivnepbf16\[ 
\\t\]+\[^\{\n\]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vsubnepbf16\[ 
\\t\]+\[^\{\n\]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+(?:\n|\[ 
\\t\]+#)" 1 } } */
+/* { dg-final { scan-assembler-times "vrcppbf16\[ 
\\t\]+\[^\{\n\]*%zmm\[0-9\]+\[^\n\r]*%zmm\[0-9\]+(?:\n|\[ \\t\]+#)" 1 } } */
+
+#include 
+
+typedef __bf16 v32bf __attribute__ ((__vector_size__ (64)));
+
+v32bf
+foo_mul (v32bf a, v32bf b)
+{
+  return a * b;
+}
+
+v32bf
+foo_add (v32bf a, v32bf b)
+{
+  return a + b;
+}
+
+v32

[gcc r15-3352] i386: Auto vectorize sdot_prod, usdot_prod, udot_prod with AVX10.2 instructions

2024-09-01 Thread Haochen Jiang via Gcc-cvs

https://gcc.gnu.org/g:b1f9fbb6da1a3ced57c3668cecc9f9449e1b237e

commit r15-3352-gb1f9fbb6da1a3ced57c3668cecc9f9449e1b237e
Author: Haochen Jiang 
Date:   Mon Sep 2 10:24:29 2024 +0800

i386: Auto vectorize sdot_prod, usdot_prod, udot_prod with AVX10.2 
instructions

gcc/ChangeLog:

* config/i386/sse.md (VI1_AVX512VNNIBW): New.
(VI2_AVX10_2): Ditto.
(sdot_prod): Add AVX10.2
to auto vectorize and combine 512 bit part.
(udot_prod): Ditto.
(sdot_prodv64qi): Removed.
(udot_prodv64qi): Ditto.
(usdot_prod): Add AVX10.2 to auto vectorize.
(udot_prod): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/vnniint16-auto-vectorize-2.c: Only define
TEST when not defined.
* gcc.target/i386/vnniint8-auto-vectorize-2.c: Ditto.
* gcc.target/i386/vnniint16-auto-vectorize-3.c: New test.
* gcc.target/i386/vnniint16-auto-vectorize-4.c: Ditto.
* gcc.target/i386/vnniint8-auto-vectorize-3.c: Ditto.
* gcc.target/i386/vnniint8-auto-vectorize-4.c: Ditto.

Diff:
---
 gcc/config/i386/sse.md | 93 +-
 .../gcc.target/i386/vnniint16-auto-vectorize-2.c   | 11 ++-
 .../gcc.target/i386/vnniint16-auto-vectorize-3.c   |  6 ++
 .../gcc.target/i386/vnniint16-auto-vectorize-4.c   | 18 +
 .../gcc.target/i386/vnniint8-auto-vectorize-2.c| 12 ++-
 .../gcc.target/i386/vnniint8-auto-vectorize-3.c|  6 ++
 .../gcc.target/i386/vnniint8-auto-vectorize-4.c| 18 +
 7 files changed, 86 insertions(+), 78 deletions(-)

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index da91d39cf8eb..442ac93afa2b 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -610,6 +610,10 @@
 (define_mode_iterator VI1_AVX512VNNI
   [(V64QI "TARGET_AVX512VNNI && TARGET_EVEX512") (V32QI "TARGET_AVX2") V16QI])
 
+(define_mode_iterator VI1_AVX512VNNIBW
+  [(V64QI "(TARGET_AVX512BW || TARGET_AVX512VNNI) && TARGET_EVEX512")
+   (V32QI "TARGET_AVX2") V16QI])
+
 (define_mode_iterator VI12_256_512_AVX512VL
   [(V64QI "TARGET_EVEX512") (V32QI "TARGET_AVX512VL")
(V32HI "TARGET_EVEX512") (V16HI "TARGET_AVX512VL")])
@@ -627,6 +631,9 @@
   [(V32HI "(TARGET_AVX512BW || TARGET_AVX512VNNI) && TARGET_EVEX512")
(V16HI "TARGET_AVX2") V8HI])
 
+(define_mode_iterator VI2_AVX10_2
+  [(V32HI "TARGET_AVX10_2_512") V16HI V8HI])
+
 (define_mode_iterator VI4_AVX
   [(V8SI "TARGET_AVX") V4SI])
 
@@ -31232,12 +31239,13 @@
 
 (define_expand "sdot_prod"
   [(match_operand: 0 "register_operand")
-   (match_operand:VI1_AVX2 1 "register_operand")
-   (match_operand:VI1_AVX2 2 "register_operand")
+   (match_operand:VI1_AVX512VNNIBW 1 "register_operand")
+   (match_operand:VI1_AVX512VNNIBW 2 "register_operand")
(match_operand: 3 "register_operand")]
   "TARGET_SSE2"
 {
-  if (TARGET_AVXVNNIINT8)
+  if (( == 64 && TARGET_AVX10_2_512)
+  || ( < 64 && (TARGET_AVXVNNIINT8 || TARGET_AVX10_2_256)))
 {
   operands[1] = lowpart_subreg (mode,
force_reg (mode, operands[1]),
@@ -31276,44 +31284,15 @@
   DONE;
 })
 
-(define_expand "sdot_prodv64qi"
-  [(match_operand:V16SI 0 "register_operand")
-   (match_operand:V64QI 1 "register_operand")
-   (match_operand:V64QI 2 "register_operand")
-   (match_operand:V16SI 3 "register_operand")]
-  "(TARGET_AVX512VNNI || TARGET_AVX512BW) && TARGET_EVEX512"
-{
-  /* Emulate with vpdpwssd.  */
-  rtx op1_lo = gen_reg_rtx (V32HImode);
-  rtx op1_hi = gen_reg_rtx (V32HImode);
-  rtx op2_lo = gen_reg_rtx (V32HImode);
-  rtx op2_hi = gen_reg_rtx (V32HImode);
-
-  emit_insn (gen_vec_unpacks_lo_v64qi (op1_lo, operands[1]));
-  emit_insn (gen_vec_unpacks_lo_v64qi (op2_lo, operands[2]));
-  emit_insn (gen_vec_unpacks_hi_v64qi (op1_hi, operands[1]));
-  emit_insn (gen_vec_unpacks_hi_v64qi (op2_hi, operands[2]));
-
-  rtx res1 = gen_reg_rtx (V16SImode);
-  rtx res2 = gen_reg_rtx (V16SImode);
-  rtx sum = gen_reg_rtx (V16SImode);
-
-  emit_move_insn (sum, CONST0_RTX (V16SImode));
-  emit_insn (gen_sdot_prodv32hi (res1, op1_lo, op2_lo, sum));
-  emit_insn (gen_sdot_prodv32hi (res2, op1_hi, op2_hi, operands[3]));
-
-  emit_insn (gen_addv16si3 (operands[0], res1, res2));
-  DONE;
-})
-
 (define_expand "udot_prod"
   [(match_operand: 0 "register_operand")
-   (match_operand:VI1_AVX2 1 "register_operand")
-   (match_operand:VI1_AVX2 2 "register_operand")
+   (match_operand:VI1_AVX512VNNIBW 1 "register_operand")
+   (match_operand:VI1_AVX512VNNIBW 2 "register_operand")
(match_operand: 3 "register_operand")]
   "TARGET_SSE2"
 {
-  if (TARGET_AVXVNNIINT8)
+  if (( == 64 && TARGET_AVX10_2_512)
+  || ( < 64 && (TARGET_AVXVNNIINT8 || TARGET_AVX10_2_256)))
 {
   operands[1] = lowpart_subreg (mode,
force_reg (mode, operands[1]),
@@ -31352,36 +31331,6 @@
   DONE;
 })
 
-(define_expand "udot_prodv64qi"
-

[gcc r15-3358] i386: Support vectorized BF16 sqrt with AVX10.2 instruction

[gcc r15-3348] RISC-V: Add testcases for form 3 of unsigned vector .SAT_ADD IMM

[gcc r15-3360] [committed][PR rtl-optimization/116544] Fix test for promoted subregs

[gcc r15-3349] RISC-V: Add testcases for form 4 of unsigned vector .SAT_ADD IMM

[gcc r15-3347] RISC-V: Refactor gen zero_extend rtx for SAT_* when expand SImode in RV64

[gcc r15-3359] i386: Support vec_cmp for V8BF/V16BF/V32BF in AVX10.2

[gcc r15-3356] i386: Support vectorized BF16 FMA with AVX10.2 instructions

[gcc r14-10625] Check avx upper register for parallel.

[gcc r15-3353] i386: Optimize ordered and nonequal

[gcc r13-8999] Check avx upper register for parallel.

[gcc r15-3357] i386: Support vectorized BF16 smaxmin with AVX10.2 instructions

[gcc r15-3351] RISC-V: Add testcases for unsigned scalar quad and oct .SAT_TRUNC form 3

[gcc r15-3361] [PATCH] RISC-V: Optimize the cost of the DFmode register move for RV32.

[gcc r15-3354] i386: Optimize generate insn for AVX10.2 compare

[gcc r15-3362] lower SLP load permutation to interleaving

[gcc r15-3363] load and store-lanes with SLP

[gcc r12-10694] Check avx upper register for parallel.

[gcc r15-3350] RISC-V: Add testcases for unsigned scalar quad and oct .SAT_TRUNC form 2

[gcc r15-3355] i386: Support vectorized BF16 add/sub/mul/div with AVX10.2 instructions

[gcc r15-3352] i386: Auto vectorize sdot_prod, usdot_prod, udot_prod with AVX10.2 instructions

20 matches

Site Navigation

Mail list logo

Footer information