RE: [PATCH] x86_64: Integer min/max improvements.

Roger Sayle Thu, 06 Aug 2020 01:41:04 -0700

Hi Uros,

Many thanks for the review and feedback.  Here's the final version as committed,
with both the test cases requested by Richard Biener and your suggestion/request
to use ix86_expand_clear.  Tested again on x86_64-pc-linux-gnu.


Thank you again for the fantastic ix86_expand_clear pointer, which cleared up 
one
of two technical questions I had, and allowed this peephole2 to now also apply 
to
QImode and HImode MOV0s, where my original version was limited to SImode and
DImode.

My two questions were (i) why a QImode set of 0 with a flags clobber isn't a 
recognized
instruction?  I'd assume that on some architectures "xorb dl,dl" might be an 
appropriate
sequence to use.  This is mostly answered by the use of ix86_expand_clear, 
which 
intelligently selects the correct form, but the lack of a *movqi_xor was 
previously odd.
(ii) My other question, was that despite my best efforts I couldn't seem to 
convince GCC
to generate/use a *movsi_or to load the constant -1.  It was just a curiosity, 
but this
would affect/benefit the smaxm1 and sminm1 examples in the new i386/minmax-10.c
test program.

Many thanks again for your help.
Roger
--

-----Original Message-----
From: Uros Bizjak <ubiz...@gmail.com> 
Sent: 03 August 2020 11:29
To: Roger Sayle <ro...@nextmovesoftware.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] x86_64: Integer min/max improvements.

On Thu, Jul 30, 2020 at 1:23 PM Roger Sayle <ro...@nextmovesoftware.com> wrote:
>
>
> This patch tweaks the way that min and max are expanded, so that the 
> semantics of these operations are visible to the early RTL 
> optimization passes, until split into explicit comparison and 
> conditional move instructions. The good news is that i386.md already 
> contains all of the required logic (many thanks to Richard Biener and 
> Uros Bizjak), but this is currently only to enable scalar-to-vector 
> (STV) synthesis of min/max instructions.  This change enables this 
> functionality for all TARGET_CMOVE architectures for SImode, HImode and 
> DImode.
>
> My first attempt to support "min(reg,reg)" as an instruction revealed 
> why this functionality isn't already enabled: In PR rtl-optimization 
> 91154 we end up generating a cmp instead of a test in 
> gcc.target/i386/minmax-3.c which has poor performance on AMD Opteron.  
> My solution to this is to actually support "min(reg,general_operand)" 
> allowing us to inspect any immediate constant at the point this 
> operation is split.  This allows us to use "test" instructions for min/max 
> against 0, 1 and -1.
> As an added benefit it allows us to CSE large immediate constants, 
> reducing code size.
>
> Previously, "int foo(int x) { return max(x,12345); }" would generate:
>
> foo:    cmpl    $12345, %edi
>         movl    $12345, %eax
>         cmovge  %edi, %eax
>         ret
>
> with this patch we instead generate:
>
> foo:    movl    $12345, %eax
>         cmpl    %eax, %edi
>         cmovge  %edi, %eax
>         ret
>
>
> I've also included a peephole2 to avoid the "movl $0,%eax" 
> instructions being produced by the register allocator.  Materializing 
> constants as late as possible reduces register pressure, but for 
> const0_rtx on x86, it's preferable to use "xor" by moving this set 
> from between a flags setting operation and its use.  This should also 
> help instruction macro fusion on some microarchitectures, where 
> test/cmp and the following instruction can sometimes be combined.
>
> Previously, "int foo(int x) { return max(x,0); }" would generate:
>
> foo:    testl   %edi, %edi
>         movl    $0, %eax
>         cmovns  %edi, %eax
>         ret
>
> with this patch we instead generate:
> foo:    xorl    %eax, %eax
>         testl   %edi, %edi
>         cmovns  %edi, %eax
>         ret
>
> The benefits of having min/max explicit at the RTL level can be seen 
> from compiling the following example with "-O2 -fno-tree-reassoc".
>
>
> #define max(a,b) (((a) > (b))? (a) : (b))
>
> int foo(int x)
> {
>   int y = max(x,5);
>   return max(y,10);
> }
>
> where GCC currently produces:
>
> foo:    cmpl    $5, %edi
>         movl    $5, %eax
>         movl    $10, %edx
>         cmovge  %edi, %eax
>         cmpl    $10, %eax
>         cmovl   %edx, %eax
>         ret
>
> and with this patch it instead now produces:
>
> foo:    movl    $10, %eax
>         cmpl    %eax, %edi
>         cmovge  %edi, %eax
>         ret
>
>
> The original motivation was from looking at a performance critical 
> function in a quantum mechanics code, that performed MIN_EXPR and 
> MAX_EXPR of the same arguments (effectively a two-element sort), where 
> GCC was performing the comparison twice.  I'd hoped that it might be 
> possible to fuse these together, perhaps in combine, but this patch is 
> an intermediate step towards that goal.
>
> This patch has been tested on x86_64-pc-linux-gnu with a make 
> bootstrap followed by make -k check with no new regressions.
> Ok for mainline?
>
>
> 2020-07-30  Roger Sayle  <ro...@nextmovesoftware.com>
>
>         * config/i386/i386.md (MAXMIN_IMODE): No longer needed.
>         (<maxmin><mode>3):  Support SWI248 and general_operand for
>         second operand, when TARGET_CMOVE.
>         (<maxmin><mode>3_1 splitter): Optimize comparisons against
>         0, 1 and -1 to use "test" instead of "cmp".
>         (*<maxmin>di3_doubleword): Likewise, allow general_operand
>         and enable on TARGET_CMOVE.
>         (peephole2): Convert clearing a register after a flag setting
>         instruction into an xor followed by the original flag setter.
>
>
> Many thanks in advance.  Almost all of the credit goes to Uros and 
> Richard for already implementing this feature, I've just copied the 
> transformations from optab expansion, that allow it to be enabled 
> without penalty (this late in the compilation).

LGTM, modulo the below part:

+;; Avoid clearing a register between a flags setting comparison and its 
+use, ;; i.e. prefer "xorl %eax,%eax; test/cmp" over "test/cmp; movl $0, %eax".
+(define_peephole2
+  [(set (reg FLAGS_REG) (match_operand 0))
+   (set (match_operand:SWI48 1 "register_operand") (const_int 0))]
+  "peep2_regno_dead_p (0, FLAGS_REG)
+   && !reg_overlap_mentioned_p (operands[1], operands[0])"
+  [(parallel [(set (match_dup 1) (const_int 0))
+      (clobber (reg:CC FLAGS_REG))])
+   (set (match_dup 2) (match_dup 0))]
+{
+  operands[2] = gen_rtx_REG (GET_MODE (operands[0]), FLAGS_REG);
+})

Please use ix86_expand_clear (operands[1]); in the preparation statement 
instead of emitting RTX pattern. This will take
TARGET_USE_MOV0 and size optimizations into account.

Uros.

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index b24a455..4e916bf 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -18809,45 +18809,68 @@
 
 ;; min/max patterns
 
-(define_mode_iterator MAXMIN_IMODE
-  [(SI "TARGET_SSE4_1") (DI "TARGET_AVX512VL")])
 (define_code_attr maxmin_rel
   [(smax "GE") (smin "LE") (umax "GEU") (umin "LEU")])
 
 (define_expand "<code><mode>3"
   [(parallel
-    [(set (match_operand:MAXMIN_IMODE 0 "register_operand")
-         (maxmin:MAXMIN_IMODE
-           (match_operand:MAXMIN_IMODE 1 "register_operand")
-           (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand")))
+    [(set (match_operand:SWI248 0 "register_operand")
+         (maxmin:SWI248
+           (match_operand:SWI248 1 "register_operand")
+           (match_operand:SWI248 2 "general_operand")))
      (clobber (reg:CC FLAGS_REG))])]
-  "TARGET_STV")
+  "TARGET_CMOVE")
 
 (define_insn_and_split "*<code><mode>3_1"
-  [(set (match_operand:MAXMIN_IMODE 0 "register_operand")
-       (maxmin:MAXMIN_IMODE
-         (match_operand:MAXMIN_IMODE 1 "register_operand")
-         (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand")))
+  [(set (match_operand:SWI248 0 "register_operand")
+       (maxmin:SWI248
+         (match_operand:SWI248 1 "register_operand")
+         (match_operand:SWI248 2 "general_operand")))
    (clobber (reg:CC FLAGS_REG))]
-  "(TARGET_64BIT || <MODE>mode != DImode) && TARGET_STV
+  "TARGET_CMOVE
    && ix86_pre_reload_split ()"
   "#"
   "&& 1"
   [(set (match_dup 0)
-       (if_then_else:MAXMIN_IMODE (match_dup 3)
+       (if_then_else:SWI248 (match_dup 3)
          (match_dup 1)
          (match_dup 2)))]
 {
   machine_mode mode = <MODE>mode;
+  rtx cmp_op = operands[2];
 
-  if (!register_operand (operands[2], mode))
-    operands[2] = force_reg (mode, operands[2]);
+  if (!register_operand (cmp_op, mode))
+    operands[2] = force_reg (mode, cmp_op);
 
   enum rtx_code code = <maxmin_rel>;
-  machine_mode cmpmode = SELECT_CC_MODE (code, operands[1], operands[2]);
+
+  if (cmp_op == const1_rtx)
+    {
+      /* Convert smax (x, 1) into (x > 0 ? x : 1).
+        Convert umax (x, 1) into (x != 0 ? x : 1).
+        Convert ?min (x, 1) into (x <= 0 ? x : 1).  */
+      cmp_op = const0_rtx;
+      if (code == GE)
+       code = GT;
+      else if (code == GEU)
+       code = NE;
+    }
+  /* Convert smin (x, -1) into (x < 0 ? x : -1).  */
+  else if (cmp_op == constm1_rtx && code == LE)
+    {
+      cmp_op = const0_rtx;
+      code = LT;
+    }
+  /* Convert smax (x, -1) into (x >= 0 ? x : -1).  */
+  else if (cmp_op == constm1_rtx && code == GE)
+    cmp_op = const0_rtx;
+  else if (cmp_op != const0_rtx)
+    cmp_op = operands[2];
+
+  machine_mode cmpmode = SELECT_CC_MODE (code, operands[1], cmp_op);
   rtx flags = gen_rtx_REG (cmpmode, FLAGS_REG);
 
-  rtx tmp = gen_rtx_COMPARE (cmpmode, operands[1], operands[2]);
+  rtx tmp = gen_rtx_COMPARE (cmpmode, operands[1], cmp_op);
   emit_insn (gen_rtx_SET (flags, tmp));
 
   operands[3] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx);
@@ -18856,9 +18879,9 @@
 (define_insn_and_split "*<code>di3_doubleword"
   [(set (match_operand:DI 0 "register_operand")
        (maxmin:DI (match_operand:DI 1 "register_operand")
-                  (match_operand:DI 2 "nonimmediate_operand")))
+                  (match_operand:DI 2 "general_operand")))
    (clobber (reg:CC FLAGS_REG))]
-  "!TARGET_64BIT && TARGET_STV && TARGET_AVX512VL
+  "!TARGET_64BIT && TARGET_CMOVE
    && ix86_pre_reload_split ()"
   "#"
   "&& 1"
@@ -18910,6 +18933,19 @@
       gcc_unreachable ();
     }
 })
+
+;; Avoid clearing a register between a flags setting comparison and its use,
+;; i.e. prefer "xorl %eax,%eax; test/cmp" over "test/cmp; movl $0, %eax".
+(define_peephole2
+  [(set (reg FLAGS_REG) (match_operand 0))
+   (set (match_operand:SWI 1 "register_operand") (const_int 0))]
+  "peep2_regno_dead_p (0, FLAGS_REG)
+   && !reg_overlap_mentioned_p (operands[1], operands[0])"
+   [(set (match_dup 2) (match_dup 0))]
+{
+  operands[2] = gen_rtx_REG (GET_MODE (operands[0]), FLAGS_REG);
+  ix86_expand_clear (operands[1]);
+})
 
 ;; Misc patterns (?)
 
diff --git a/gcc/testsuite/gcc.target/i386/minmax-8.c 
b/gcc/testsuite/gcc.target/i386/minmax-8.c
new file mode 100644
index 0000000..1f7e466
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/minmax-8.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-Os" } */
+
+#define max(a,b) (((a) > (b))? (a) : (b))
+#define min(a,b) (((a) < (b))? (a) : (b))
+
+int foo(int x)
+{
+  return max(x,12345);
+}
+
+int bar(int x)
+{
+  return min(x,87654);
+}
+
+/* { dg-final { scan-assembler-times "12345" 1 } } */
+/* { dg-final { scan-assembler-times "87654" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/minmax-9.c 
b/gcc/testsuite/gcc.target/i386/minmax-9.c
new file mode 100644
index 0000000..3b94023
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/minmax-9.c
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "-Os" } */
+
+#define max(a,b) (((a) > (b))? (a) : (b))
+#define min(a,b) (((a) < (b))? (a) : (b))
+
+int foo(int x)
+{
+  return max(x,0);
+}
+
+int bar(int x)
+{
+  return min(x,0);
+}
+
+unsigned int baz(unsigned int x)
+{
+  return min(x,1);
+}
+
+/* { dg-final { scan-assembler-times "xor" 3 } } */
+/* { dg-final { scan-assembler-times "test" 3 } } */
diff --git a/gcc/testsuite/gcc.target/i386/minmax-10.c 
b/gcc/testsuite/gcc.target/i386/minmax-10.c
new file mode 100644
index 0000000..b044462
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/minmax-10.c
@@ -0,0 +1,38 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+#define max(a,b) (((a) > (b))? (a) : (b))
+#define min(a,b) (((a) < (b))? (a) : (b))
+
+int smax1(int x)
+{
+  return max(x,1);
+}
+
+int smin1(int x)
+{
+  return min(x,1);
+}
+
+int smaxm1(int x)
+{
+  return max(x,-1);
+}
+
+int sminm1(int x)
+{
+  return min(x,-1);
+}
+
+unsigned int umax1(unsigned int x)
+{
+  return max(x,1);
+}
+
+unsigned int umin1(unsigned int x)
+{
+  return min(x,1);
+}
+
+/* { dg-final { scan-assembler-times "test" 6 } } */
+/* { dg-final { scan-assembler-not "cmp" } } */
diff --git a/gcc/testsuite/gcc.target/i386/minmax-11.c 
b/gcc/testsuite/gcc.target/i386/minmax-11.c
new file mode 100644
index 0000000..a8c2df5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/minmax-11.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fno-tree-reassoc" } */
+
+#define max(a,b) (((a) > (b))? (a) : (b))
+
+int foo(int x)
+{
+  int y = max(x,12345);
+  return max(y,87654);
+}
+
+/* { dg-final { scan-assembler-not "12345" } } */

RE: [PATCH] x86_64: Integer min/max improvements.

Reply via email to