Re: [PATCH, i386]: Do not limit the cost of moves to/from XMM register to minimum 8.

Uros Bizjak Sun, 01 Sep 2019 12:49:56 -0700

On Sat, Aug 31, 2019 at 3:57 PM Richard Biener
<[email protected]> wrote:
>
> On August 31, 2019 2:51:51 AM GMT+02:00, Alan Modra <[email protected]> wrote:
> >On Fri, Aug 30, 2019 at 09:42:06AM +0200, Uros Bizjak wrote:
> >> On Fri, Aug 30, 2019 at 9:22 AM Richard Biener
> >> <[email protected]> wrote:
> >> >
> >> > On Thu, Aug 29, 2019 at 9:54 AM Uros Bizjak <[email protected]>
> >wrote:
> >> > >
> >> > > On Wed, Aug 28, 2019 at 5:12 PM Uros Bizjak <[email protected]>
> >wrote:
> >> > > >
> >> > > > Attached patch improves costing for STV shifts and corrects
> >reject
> >> > > > condition for out of range shift count operands.
> >> > > >
> >> > > > 2019-08-28  Uroš Bizjak  <[email protected]>
> >> > > >
> >> > > >     * config/i386/i386-features.c
> >> > > >     (general_scalar_chain::compute_convert_gain):
> >> > > >     Correct cost for double-word shifts.
> >> > > >     (general_scalar_to_vector_candidate_p): Reject count
> >operands
> >> > > >     greater or equal to mode bitsize.
> >> > > >
> >> > > > Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.
> >> > > >
> >> > > > Committed to mainline SVN.
> >> > >
> >> > > Ouch... I mixed up patches and actually committed the patch that
> >> > > removes maximum from cost of sse<->int moves.
> >> > >
> >> > > I can leave the patch for a day, so we can see the effects of the
> >cost
> >> > > change, and if the patch creates problems, I'll revert it.
> >> >
> >> > Regresses gromacs and namd quite a bit on Haswell, also perl in
> >SPEC 2000.
> >> > I guess we should try understand why rather than reverting
> >immediately
> >> > (I'd leave it in at least another few days to get more testers pick
> >up the
> >> > rev.).
> >>
> >> The correct patch is actually [1] (I have updated the subject):
> >>
> >> 2019-08-28  Uroš Bizjak  <[email protected]>
> >>
> >>     * config/i386/i386.c (ix86_register_move_cost): Do not
> >>     limit the cost of moves to/from XMM register to minimum 8.
> >>
> >> There is no technical reason to limit the costs to minimum 8 anymore,
> >> and several targets (e.g skylake) also claim that moves between SSE
> >> and general regs are as cheap as moves between general regs.
>
> That's likely the issue since moves between regs in the same class can be 
> usually dealt with in register renaming while between classes that is not 
> generally possible. So I'd make those bigger cost. Otoh moves between same 
> class regs are likely overcounted.
>
>  However,
> >> these values were never benchmarked properly due to the mentioned
> >> limitation and apparently cause some unwanted secondary effects
> >> (lowering the clock).
> >>
> >> I agree with your proposal to leave the change in the tree some more
> >> time. At the end, we could raise the costs for all targets to 8 to
> >> effectively revert to the previous state.
> >>
> >> [1] https://gcc.gnu.org/ml/gcc-patches/2019-08/msg02009.html
> >
> >Those SPEC regressions sound similar to what I saw when trying to give
> >proper costing to moves between general regs and vsx regs on Power9.
> >rs6000.c:rs6000_ira_change_pseudo_allocno_class is the hack I used to
> >work around bad register allocation decisions due to poor register
> >pressure calculation.


Alan, thanks for the pointer, I take a look at the extensive comments
in and above rs6000_ira_change_pseudo_allocno_class and also inside
rs6000_register_move_cost. According to the comment to
rs6000_i_c_p_a_c:

--quote--
   The register allocator chooses GEN_OR_VSX_REGS for the allocno
   class if GENERAL_REGS and VSX_REGS cost is lower than the memory
   cost.  This happens a lot when TARGET_DIRECT_MOVE makes the register
   move cost between GENERAL_REGS and VSX_REGS low.

   It might seem reasonable to use a union class.  After all, if usage
   of vsr is low and gpr high, it might make sense to spill gpr to vsr
   rather than memory.  However, in cases where register pressure of
   both is high, like the cactus_adm spec test, allowing
   GEN_OR_VSX_REGS as the allocno class results in bad decisions in
   the first scheduling pass.  This is partly due to an allocno of
   GEN_OR_VSX_REGS wrongly contributing to the GENERAL_REGS pressure
   class, which gives too high a pressure for GENERAL_REGS and too low
   for VSX_REGS.  So, force a choice of the subclass here.
--/quote--

we can take a parallel with x86's SSE and integer registers, where
both classes can hold SImode and DImode values. The attached patch is
the first try to implement the idea of forcing a subclass (I must
admit that the patch is a bit of a shot in the dark...).

Also, the comment in the rs6000_register_move_cost says:

          /* Keep the cost for direct moves above that for within
         a register class even if the actual processor cost is
         comparable.  We do this because a direct move insn
         can't be a nop, whereas with ideal register
         allocation a move within the same class might turn
         out to be a nop.  */

which is not the case with core_cost (and similar with skylake_cost):

  2, 2, 4,                /* cost of moving XMM,YMM,ZMM register */
  {6, 6, 6, 6, 12},            /* cost of loading SSE registers
                       in 32,64,128,256 and 512-bit */
  {6, 6, 6, 6, 12},            /* cost of storing SSE registers
                       in 32,64,128,256 and 512-bit */
  2, 2,                    /* SSE->integer and integer->SSE moves */

We have the same cost of moving between integer registers (by default
set to 2), between SSE registers and between integer and SSE register
sets. I think that at least the cost of moves between regsets should
be substantially higher, rs6000 uses 3x cost of intra-regset moves;
that would translate to the value of 6. The value should be low enough
to keep the cost below the value that forces move through the memory.
Changing core register allocation cost of SSE <-> integer to:

--cut here--
Index: config/i386/x86-tune-costs.h
===================================================================
--- config/i386/x86-tune-costs.h        (revision 275281)
+++ config/i386/x86-tune-costs.h        (working copy)
@@ -2555,7 +2555,7 @@ struct processor_costs core_cost = {
                                           in 32,64,128,256 and 512-bit */
   {6, 6, 6, 6, 12},                    /* cost of storing SSE registers
                                           in 32,64,128,256 and 512-bit */
-  2, 2,                                        /* SSE->integer and
integer->SSE moves */
+  6, 6,                                        /* SSE->integer and
integer->SSE moves */
   /* End of register allocator costs.  */
   },

--cut here--

still produces direct move in gcc.target/i386/minmax-6.c

I think that in addition to attached patch, values between 2 and 6
should be considered in benchmarking. Unfortunately, without access to
regressed SPEC tests, I can't analyse these changes by myself.

Uros.

Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c  (revision 275198)
+++ config/i386/i386.c  (working copy)
@@ -18632,6 +18632,25 @@
   return 2;
 }
 
+/* Implement TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS.  */
+
+static reg_class_t
+x86_ira_change_pseudo_allocno_class (int regno ATTRIBUTE_UNUSED,
+                                    reg_class_t allocno_class,
+                                    reg_class_t best_class)
+{
+  switch (allocno_class)
+    {
+    case INT_SSE_REGS:
+      return best_class;
+
+    default:
+      break;
+    }
+
+  return allocno_class;
+}
+
 /* Implement TARGET_HARD_REGNO_NREGS.  This is ordinarily the length in
    words of a value of mode MODE but can be less for certain modes in
    special long registers.
@@ -22746,6 +22765,9 @@
 #define TARGET_REGISTER_MOVE_COST ix86_register_move_cost
 #undef TARGET_MEMORY_MOVE_COST
 #define TARGET_MEMORY_MOVE_COST ix86_memory_move_cost
+#undef TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS
+#define TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS \
+  x86_ira_change_pseudo_allocno_class
 #undef TARGET_RTX_COSTS
 #define TARGET_RTX_COSTS ix86_rtx_costs
 #undef TARGET_ADDRESS_COST

Re: [PATCH, i386]: Do not limit the cost of moves to/from XMM register to minimum 8.

Reply via email to