Hi Hongtao,
Many thanks for your review. This revised patch implements your
suggestions of removing the combine splitters, and instead reusing
the functionality of the ssse3_palignrdi define_insn_and split.
This revised patch has been tested on x86_64-pc-linux-gnu with make
bootstrap and make -k check, both with and with --target_board=unix{-32},
with no new failures. Is this revised version Ok for mainline?
2022-07-04 Roger Sayle <[email protected]>
Hongtao Liu <[email protected]>
gcc/ChangeLog
* config/i386/i386-builtin.def (__builtin_ia32_palignr128): Change
CODE_FOR_ssse3_palignrti to CODE_FOR_ssse3_palignrv1ti.
* config/i386/i386-expand.cc (expand_vec_perm_palignr): Use V1TImode
and gen_ssse3_palignv1ti instead of TImode.
* config/i386/sse.md (SSESCALARMODE): Delete.
(define_mode_attr ssse3_avx2): Handle V1TImode instead of TImode.
(<ssse3_avx2>_palignr<mode>): Use VIMAX_AVX2_AVX512BW as a mode
iterator instead of SSESCALARMODE.
(ssse3_palignrdi): Optimize cases where operands[3] is 0 or 64,
using a single move instruction (if required).
gcc/testsuite/ChangeLog
* gcc.target/i386/ssse3-palignr-2.c: New test case.
Thanks in advance,
Roger
--
> -----Original Message-----
> From: Hongtao Liu <[email protected]>
> Sent: 01 July 2022 03:40
> To: Roger Sayle <[email protected]>
> Cc: GCC Patches <[email protected]>
> Subject: Re: [x86 PATCH] UNSPEC_PALIGNR optimizations and clean-ups.
>
> On Fri, Jul 1, 2022 at 10:12 AM Hongtao Liu <[email protected]> wrote:
> >
> > On Fri, Jul 1, 2022 at 2:42 AM Roger Sayle <[email protected]>
> wrote:
> > >
> > >
> > > This patch is a follow-up to Hongtao's fix for PR target/105854.
> > > That fix is perfectly correct, but the thing that caught my eye was
> > > why is the compiler generating a shift by zero at all. Digging
> > > deeper it turns out that we can easily optimize
> > > __builtin_ia32_palignr for alignments of 0 and 64 respectively,
> > > which may be simplified to moves from the highpart or lowpart.
> > >
> > > After adding optimizations to simplify the 64-bit DImode palignr, I
> > > started to add the corresponding optimizations for vpalignr (i.e.
> > > 128-bit). The first oddity is that sse.md uses TImode and a special
> > > SSESCALARMODE iterator, rather than V1TImode, and indeed the comment
> > > above SSESCALARMODE hints that this should be "dropped in favor of
> > > VIMAX_AVX2_AVX512BW". Hence this patch includes the migration of
> > > <ssse3_avx2>_palignr<mode> to use VIMAX_AVX2_AVX512BW, basically
> > > using V1TImode instead of TImode for 128-bit palignr.
> > >
> > > But it was only after I'd implemented this clean-up that I stumbled
> > > across the strange semantics of 128-bit [v]palignr. According to
> > > https://www.felixcloutier.com/x86/palignr, the semantics are subtly
> > > different based upon how the instruction is encoded. PALIGNR leaves
> > > the highpart unmodified, whilst VEX.128 encoded VPALIGNR clears the
> > > highpart, and (unless I'm mistaken) it looks like GCC currently uses
> > > the exact same RTL/templates for both, treating one as an
> > > alternative for the other.
> > I think as long as patterns or intrinsics only care about the low
> > part, they should be ok.
> > But if we want to use default behavior for upper bits, we need to
> > restrict them under specific isa(.i.e. vmovq in vec_set<mode>_0).
> > Generally, 128-bit sse legacy instructions have different behaviors
> > for upper bits from AVX ones, and that's why vzeroupper is introduced
> > for sse <-> avx instructions transition.
> > >
> > > Hence I thought I'd post what I have so far (part optimization and
> > > part clean-up), to then ask the x86 experts for their opinions.
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make
> > > bootstrap and make -k check, both with and without
> > > --target_board=unix{-,32}, with no new failures. Ok for mainline?
> > >
> > >
> > > 2022-06-30 Roger Sayle <[email protected]>
> > >
> > > gcc/ChangeLog
> > > * config/i386/i386-builtin.def (__builtin_ia32_palignr128): Change
> > > CODE_FOR_ssse3_palignrti to CODE_FOR_ssse3_palignrv1ti.
> > > * config/i386/i386-expand.cc (expand_vec_perm_palignr): Use
> V1TImode
> > > and gen_ssse3_palignv1ti instead of TImode.
> > > * config/i386/sse.md (SSESCALARMODE): Delete.
> > > (define_mode_attr ssse3_avx2): Handle V1TImode instead of TImode.
> > > (<ssse3_avx2>_palignr<mode>): Use VIMAX_AVX2_AVX512BW as a
> mode
> > > iterator instead of SSESCALARMODE.
> > >
> > > (ssse3_palignrdi): Optimize cases when operands[3] is 0 or 64,
> > > using a single move instruction (if required).
> > > (define_split): Likewise split UNSPEC_PALIGNR $0 into a move.
> > > (define_split): Likewise split UNSPEC_PALIGNR $64 into a move.
> > >
> > > gcc/testsuite/ChangeLog
> > > * gcc.target/i386/ssse3-palignr-2.c: New test case.
> > >
> > >
> > > Thanks in advance,
> > > Roger
> > > --
> > >
> >
> > +(define_split
> > + [(set (match_operand:DI 0 "register_operand") (unspec:DI
> > +[(match_operand:DI 1 "register_operand")
> > + (match_operand:DI 2 "register_mmxmem_operand")
> > + (const_int 0)]
> > + UNSPEC_PALIGNR))]
> > + ""
> > + [(set (match_dup 0) (match_dup 2))])
> > +
> > +(define_split
> > + [(set (match_operand:DI 0 "register_operand") (unspec:DI
> > +[(match_operand:DI 1 "register_operand")
> > + (match_operand:DI 2 "register_mmxmem_operand")
> > + (const_int 64)]
> > + UNSPEC_PALIGNR))]
> > + ""
> > + [(set (match_dup 0) (match_dup 1))])
> > +
> > define_split is assumed to be splitted to 2(or more) insns, hence
> > pass_combine will only try define_split if the number of merged insns
> > is greater than 2.
> > For palignr, i think most time there would be only 2 merged
> > insns(constant propagation), so better to change them as pre_reload
> > splitter.
> > (.i.e. (define_insn_and_split "*avx512bw_permvar_truncv16siv16hi_1").
> I think you can just merge 2 define_split into define_insn_and_split
> "ssse3_palignrdi" by relaxing split condition as
>
> - "TARGET_SSSE3 && reload_completed
> - && SSE_REGNO_P (REGNO (operands[0]))"
> + "(TARGET_SSSE3 && reload_completed
> + && SSE_REGNO_P (REGNO (operands[0])))
> + || INVAL(operands[3]) == 0
> + || INVAL(operands[3]) == 64"
>
> and you have already handled them by
>
> + if (operands[3] == const0_rtx)
> + {
> + if (!rtx_equal_p (operands[0], operands[2])) emit_move_insn
> + (operands[0], operands[2]);
> + else
> + emit_note (NOTE_INSN_DELETED);
> + DONE;
> + }
> + else if (INTVAL (operands[3]) == 64)
> + {
> + if (!rtx_equal_p (operands[0], operands[1])) emit_move_insn
> + (operands[0], operands[1]);
> + else
> + emit_note (NOTE_INSN_DELETED);
> + DONE;
> + }
> +
>
> >
>
> >
> > --
> > BR,
> > Hongtao
>
>
>
> --
> BR,
> Hongtao
diff --git a/gcc/config/i386/i386-builtin.def b/gcc/config/i386/i386-builtin.def
index e6daad4..fd16093 100644
--- a/gcc/config/i386/i386-builtin.def
+++ b/gcc/config/i386/i386-builtin.def
@@ -900,7 +900,7 @@ BDESC (OPTION_MASK_ISA_SSSE3, 0, CODE_FOR_ssse3_psignv4si3,
"__builtin_ia32_psig
BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0,
CODE_FOR_ssse3_psignv2si3, "__builtin_ia32_psignd", IX86_BUILTIN_PSIGND,
UNKNOWN, (int) V2SI_FTYPE_V2SI_V2SI)
/* SSSE3. */
-BDESC (OPTION_MASK_ISA_SSSE3, 0, CODE_FOR_ssse3_palignrti,
"__builtin_ia32_palignr128", IX86_BUILTIN_PALIGNR128, UNKNOWN, (int)
V2DI_FTYPE_V2DI_V2DI_INT_CONVERT)
+BDESC (OPTION_MASK_ISA_SSSE3, 0, CODE_FOR_ssse3_palignrv1ti,
"__builtin_ia32_palignr128", IX86_BUILTIN_PALIGNR128, UNKNOWN, (int)
V2DI_FTYPE_V2DI_V2DI_INT_CONVERT)
BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0,
CODE_FOR_ssse3_palignrdi, "__builtin_ia32_palignr", IX86_BUILTIN_PALIGNR,
UNKNOWN, (int) V1DI_FTYPE_V1DI_V1DI_INT_CONVERT)
/* SSE4.1 */
diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 8bc5430..6a3fcde 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -19548,9 +19548,11 @@ expand_vec_perm_palignr (struct expand_vec_perm_d *d,
bool single_insn_only_p)
shift = GEN_INT (min * GET_MODE_UNIT_BITSIZE (d->vmode));
if (GET_MODE_SIZE (d->vmode) == 16)
{
- target = gen_reg_rtx (TImode);
- emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, dcopy.op1),
- gen_lowpart (TImode, dcopy.op0), shift));
+ target = gen_reg_rtx (V1TImode);
+ emit_insn (gen_ssse3_palignrv1ti (target,
+ gen_lowpart (V1TImode, dcopy.op1),
+ gen_lowpart (V1TImode, dcopy.op0),
+ shift));
}
else
{
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index f2f72e8..adf05bf 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -575,10 +575,6 @@
(define_mode_iterator VIMAX_AVX2
[(V2TI "TARGET_AVX2") V1TI])
-;; ??? This should probably be dropped in favor of VIMAX_AVX2_AVX512BW.
-(define_mode_iterator SSESCALARMODE
- [(V4TI "TARGET_AVX512BW") (V2TI "TARGET_AVX2") TI])
-
(define_mode_iterator VI12_AVX2
[(V32QI "TARGET_AVX2") V16QI
(V16HI "TARGET_AVX2") V8HI])
@@ -712,7 +708,7 @@
(V4HI "ssse3") (V8HI "ssse3") (V16HI "avx2") (V32HI "avx512bw")
(V4SI "ssse3") (V8SI "avx2")
(V2DI "ssse3") (V4DI "avx2")
- (TI "ssse3") (V2TI "avx2") (V4TI "avx512bw")])
+ (V1TI "ssse3") (V2TI "avx2") (V4TI "avx512bw")])
(define_mode_attr sse4_1_avx2
[(V16QI "sse4_1") (V32QI "avx2") (V64QI "avx512bw")
@@ -21092,10 +21088,10 @@
(set_attr "mode" "<sseinsnmode>")])
(define_insn "<ssse3_avx2>_palignr<mode>"
- [(set (match_operand:SSESCALARMODE 0 "register_operand" "=x,<v_Yw>")
- (unspec:SSESCALARMODE
- [(match_operand:SSESCALARMODE 1 "register_operand" "0,<v_Yw>")
- (match_operand:SSESCALARMODE 2 "vector_operand" "xBm,<v_Yw>m")
+ [(set (match_operand:VIMAX_AVX2_AVX512BW 0 "register_operand" "=x,<v_Yw>")
+ (unspec:VIMAX_AVX2_AVX512BW
+ [(match_operand:VIMAX_AVX2_AVX512BW 1 "register_operand" "0,<v_Yw>")
+ (match_operand:VIMAX_AVX2_AVX512BW 2 "vector_operand" "xBm,<v_Yw>m")
(match_operand:SI 3 "const_0_to_255_mul_8_operand")]
UNSPEC_PALIGNR))]
"TARGET_SSSE3"
@@ -21141,11 +21137,30 @@
gcc_unreachable ();
}
}
- "TARGET_SSSE3 && reload_completed
- && SSE_REGNO_P (REGNO (operands[0]))"
+ "(TARGET_SSSE3 && reload_completed
+ && SSE_REGNO_P (REGNO (operands[0])))
+ || operands[3] == const0_rtx
+ || INTVAL (operands[3]) == 64"
[(set (match_dup 0)
(lshiftrt:V1TI (match_dup 0) (match_dup 3)))]
{
+ if (operands[3] == const0_rtx)
+ {
+ if (!rtx_equal_p (operands[0], operands[2]))
+ emit_move_insn (operands[0], operands[2]);
+ else
+ emit_note (NOTE_INSN_DELETED);
+ DONE;
+ }
+ else if (INTVAL (operands[3]) == 64)
+ {
+ if (!rtx_equal_p (operands[0], operands[1]))
+ emit_move_insn (operands[0], operands[1]);
+ else
+ emit_note (NOTE_INSN_DELETED);
+ DONE;
+ }
+
/* Emulate MMX palignrdi with SSE psrldq. */
rtx op0 = lowpart_subreg (V2DImode, operands[0],
GET_MODE (operands[0]));
diff --git a/gcc/testsuite/gcc.target/i386/ssse3-palignr-2.c
b/gcc/testsuite/gcc.target/i386/ssse3-palignr-2.c
new file mode 100644
index 0000000..791222d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/ssse3-palignr-2.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mssse3" } */
+
+typedef long long __attribute__ ((__vector_size__ (8))) T;
+
+T x;
+T y;
+T z;
+
+void foo()
+{
+ z = __builtin_ia32_palignr (x, y, 0);
+}
+
+void bar()
+{
+ z = __builtin_ia32_palignr (x, y, 64);
+}
+/* { dg-final { scan-assembler-not "punpcklqdq" } } */
+/* { dg-final { scan-assembler-not "pshufd" } } */
+/* { dg-final { scan-assembler-not "psrldq" } } */