On Sat, Jul 27, 2019 at 12:07 PM Uros Bizjak <ubiz...@gmail.com> wrote:

> > How would one write smaxsi3 as a splitter to be split after
> > reload in the case LRA assigned the GPR alternative?  Is it
> > even worth doing?  Even the SSE reg alternative can be split
> > to remove the not needed CC clobber.
> >
> > Finally I'm unsure about the add where I needed to place
> > the SSE alternative before the 2nd op memory one since it
> > otherwise gets the same cost and wins.
> >
> > So - how to go forward with this?
>
> Sorry to come a bit late to the discussion.
>
> We are aware of CMOV issue for quite some time, but the issue is not
> understood yet in detail (I was hoping for Intel people to look at
> this). However, you demonstrated that using PMAX and PMIN  instead of
> scalar CMOV can bring us big gains, and this thread now deals on how
> to best implement PMAX/PMIN for scalar code.
>
> I think that the way to go forward is with STV infrastructure.
> Currently, the implementation only deals with DImode on SSE2 32bit
> targets, but I see no issues on using STV pass also for SImode (on
> 32bit and 64bit targets). There are actually two STV passes, the first
> one (currently run on 64bit targets) is run before cse2, and the
> second (which currently runs on 32bit SSE2 only) is run after combine
> and before split1 pass. The second pass is interesting to us.
>
> The base idea of the second STV pass (for 32bit targets!) is that we
> introduce a DImode _doubleword instructons that otherwise do not exist
> with integer registers. Now, the passes up to and including combine
> pass can use these instructions to simplify and optimize the insn
> flow. Later, based on cost analysis, STV pass either converts the
> _doubleword instructions to a real vector ones (e.g. V2DImode
> patterns) or leaves them intact, and a follow-up split pass splits
> them into scalar SImode instruction pairs. STV pass also takes care to
> move and preload values from their scalar form to a vector
> representation (using SUBREGs). Please note that all this happens on
> pseudos, and register allocator will later simply use scalar (integer)
> registers in scalar patterns and vector registers with vector insn
> patterns.
>
> Your approach to amend existing scalar SImode patterns with vector
> registers will introduce no end of problems. Register allocator will
> do funny things during register pressure, where values will take a
> trip to a vector register before being stored to memory (and vice
> versa, you already found some of them). Current RA simply can't
> distinguish clearly between two register sets.
>
> So, my advice would be to use STV pass also for SImode values, on
> 64bit and 32bit targets. On both targets, we will be able to use
> instructions that operate on vector register set, and for 32bit
> targets (and to some extent on 64bit targets), we would perhaps be
> able to relax register pressure in a kind of controlled way.
>
> So, to demonstrate the benefits of existing STV pass, it should be
> relatively easy to introduce 64bit max/min pattern on 32bit target to
> handle 64bit values. For 32bit values, the pass should be re-run to
> convert SImode scalar operations to vector operations in a controlled
> way, based on various cost functions.

Please find attached patch to see STV in action. The compilation will
crash due to non-existing V2DImode SMAX insn, but in the _.268r.stv2
dump, you will be able to see chain building, cost calculation and
conversion insertion.

The testcase:

--cut here--
long long test (long long a, long long b)
{
  return (a > b) ? a : b;
}
--cut here--

gcc -O2 -m32 -msse2 (-mstv):

_.268r.stv2 dump:

Searching for mode conversion candidates...
  insn 2 is marked as a candidate
  insn 3 is marked as a candidate
  insn 7 is marked as a candidate
Created a new instruction chain #1
Building chain #1...
  Adding insn 2 to chain #1
  Adding insn 7 into chain's #1 queue
  Adding insn 7 to chain #1
  r85 use in insn 12 isn't convertible
  Mark r85 def in insn 7 as requiring both modes in chain #1
  Adding insn 3 into chain's #1 queue
  Adding insn 3 to chain #1
Collected chain #1...
  insns: 2, 3, 7
  defs to convert: r85
Computing gain for chain #1...
  Instruction conversion gain: 24
  Registers conversion cost: 6
  Total gain: 18
Converting chain #1...

...

(insn 2 5 3 2 (set (reg/v:DI 83 [ a ])
        (mem/c:DI (reg/f:SI 16 argp) [1 a+0 S8 A32])) "max.c":2:1 66
{*movdi_internal}
     (nil))
(insn 3 2 4 2 (set (reg/v:DI 84 [ b ])
        (mem/c:DI (plus:SI (reg/f:SI 16 argp)
                (const_int 8 [0x8])) [1 b+0 S8 A32])) "max.c":2:1 66
{*movdi_internal}
     (nil))
(note 4 3 7 2 NOTE_INSN_FUNCTION_BEG)
(insn 7 4 15 2 (set (subreg:V2DI (reg:DI 85) 0)
        (smax:V2DI (subreg:V2DI (reg/v:DI 84 [ b ]) 0)
            (subreg:V2DI (reg/v:DI 83 [ a ]) 0))) "max.c":3:22 -1
     (expr_list:REG_DEAD (reg/v:DI 84 [ b ])
        (expr_list:REG_DEAD (reg/v:DI 83 [ a ])
            (expr_list:REG_UNUSED (reg:CC 17 flags)
                (nil)))))
(insn 15 7 16 2 (set (reg:V2DI 87)
        (subreg:V2DI (reg:DI 85) 0)) "max.c":3:22 -1
     (nil))
(insn 16 15 17 2 (set (subreg:SI (reg:DI 86) 0)
        (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1
     (nil))
(insn 17 16 18 2 (set (reg:V2DI 87)
        (lshiftrt:V2DI (reg:V2DI 87)
            (const_int 32 [0x20]))) "max.c":3:22 -1
     (nil))
(insn 18 17 12 2 (set (subreg:SI (reg:DI 86) 4)
        (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1
     (nil))
(insn 12 18 13 2 (set (reg/i:DI 0 ax)
        (reg:DI 86)) "max.c":4:1 66 {*movdi_internal}
     (expr_list:REG_DEAD (reg:DI 86)
        (nil)))
(insn 13 12 0 2 (use (reg/i:DI 0 ax)) "max.c":4:1 -1
     (nil))

Uros.
Index: i386-features.c
===================================================================
--- i386-features.c     (revision 273844)
+++ i386-features.c     (working copy)
@@ -531,6 +531,9 @@
          if (CONST_INT_P (XEXP (src, 1)))
            gain -= vector_const_cost (XEXP (src, 1));
        }
+      else if (GET_CODE (src) == SMAX
+              || (GET_CODE (src) == SMIN))
+       gain += COSTS_N_INSNS (3);
       else if (GET_CODE (src) == NEG
               || GET_CODE (src) == NOT)
        gain += ix86_cost->add - COSTS_N_INSNS (1);
@@ -907,6 +910,8 @@
     case IOR:
     case XOR:
     case AND:
+    case SMAX:
+    case SMIN:
       convert_op (&XEXP (src, 0), insn);
       convert_op (&XEXP (src, 1), insn);
       PUT_MODE (src, V2DImode);
@@ -1285,6 +1290,8 @@
     case IOR:
     case XOR:
     case AND:
+    case SMAX:
+    case SMIN:
       if (!REG_P (XEXP (src, 1))
          && !MEM_P (XEXP (src, 1))
          && !CONST_INT_P (XEXP (src, 1)))
Index: i386.md
===================================================================
--- i386.md     (revision 273844)
+++ i386.md     (working copy)
@@ -17489,6 +17489,14 @@
     gcc_unreachable ();
 })
 
+(define_insn "smaxdi3"
+  [(set (match_operand:DI 0 "register_operand")
+        (smax:DI (match_operand:DI 1 "register_operand")
+                 (match_operand:DI 2 "register_operand")))
+   (clobber (reg:CC FLAGS_REG))]
+  "!TARGET_64BIT && TARGET_STV && TARGET_SSE2"
+  "#")
+
 (define_expand "mov<mode>cc"
   [(set (match_operand:X87MODEF 0 "register_operand")
        (if_then_else:X87MODEF

Reply via email to