i386: Fix some problems in stv cost model

Jan Hubicka Sat, 10 May 2025 13:27:53 -0700

Hi,
this patch fixes some of problems with cosint in scalar to vector pass.
In particular
 1) the pass uses optimize_insn_for_size which is intended to be used by
    expanders and splitters and requires the optimization pass to use
    set_rtl_profile (bb) for currently processed bb.
    This is not done, so we get random stale info about hotness of insn.
 2) register allocator move costs are all realtive to integer reg-reg move
    which has cost of 2, so it is (except for size tables and i386)
    a latency of instruction multiplied by 2.
    These costs has been duplicated and are now used in combination with
    rtx costs which are all based to COSTS_N_INSNS that multiplies latency
    by 4.
    Some of vectorizer costing contains COSTS_N_INSNS (move_cost) / 2
    to compensate, but some new code does not.  This patch adds compensatoin.


    Perhaps we should update the cost tables to use COSTS_N_INSNS everywher
    but I think we want to first fix inconsistencies.  Also the tables will
    get optically much longer, since we have many move costs and COSTS_N_INSNS
    is a lot of characters.
 3) variable m which decides how much to multiply integer variant (to account
    that with -m32 all 64bit computations needs 2 instructions) is declared
    unsigned which makes the signed computation of instruction gain to be
    done in unsigned type and breaks i.e. for division.
 4) I added integer_to_sse costs which are currently all duplicationof
    sse_to_integer. AMD chips are asymetric and moving one direction is faster
    than another.  I will chance costs incremnetally once vectorizer part
    is fixed up, too.

There are two failures gcc.target/i386/minmax-6.c and 
gcc.target/i386/minmax-7.c.
Both test stv on hasswell which no longer happens since SSE->INT and INT->SSE 
moves
are now more expensive.

There is only one instruction to convert:

Computing gain for chain #1...
  Instruction gain 8 for    11: {r110:SI=smax(r116:SI,0);clobber flags:CC;}
  Instruction conversion gain: 8
  Registers conversion cost: 8    <- this is integer_to_sse and sse_to_integer
  Total gain: 0

total gain used to be 4 since the patch doubles the conversion costs.
According to agner fog's tables the costs should be 1 cycle which is correct
here.

Final code gnerated is:

        vmovd   %esi, %xmm0         * latency 1
        cmpl    %edx, %esi
        je      .L2
        vpxor   %xmm1, %xmm1, %xmm1 * latency 1
        vpmaxsd %xmm1, %xmm0, %xmm0 * latency 1
        vmovd   %xmm0, %eax         * latency 1
        imull   %edx, %eax
        cltq
        movzwl  (%rdi,%rax,2), %eax
        ret

        cmpl    %edx, %esi
        je      .L2
        xorl    %eax, %eax          * latency 1
        testl   %esi, %esi          * latency 1
        cmovs   %eax, %esi          * latency 2
        imull   %edx, %esi
        movslq  %esi, %rsi
        movzwl  (%rdi,%rsi,2), %eax
        ret

Instructions with latency info are those really different.
So the uncoverted code has sum of latencies 4 and real latency 3.
Converted code has sum of latencies 4 and real latency 3 (vmod+vpmaxsd+vmov).
So I do not quite see it should be a win.

There is also a bug in costing MIN/MAX

            case ABS:
            case SMAX:
            case SMIN:
            case UMAX:
            case UMIN:
              /* We do not have any conditional move cost, estimate it as a
                 reg-reg move.  Comparisons are costed as adds.  */
              igain += m * (COSTS_N_INSNS (2) + ix86_cost->add);
              /* Integer SSE ops are all costed the same.  */
              igain -= ix86_cost->sse_op;
              break;

Now COSTS_N_INSNS (2) is not quite right since reg-reg move should be 1 or 
perhaps 0.
For Haswell cmov really is 2 cycles, but I guess we want to have that in cost 
vectors
like all other instructions.

I am not sure if this is really a win in this case (other minmax testcases 
seems to make
sense).  I have xfailed it for now and will check if that affects specs on LNT 
testers.

Bootstrapped/regtested x86_64-linux, comitted.

I will proceed with similar fixes on vectorizer cost side. Sadly those 
introduces
quite some differences in the testuiste (partly triggered by other costing 
problems,
such as one of scatter/gather)

gcc/ChangeLog:

        * config/i386/i386-features.cc
        (general_scalar_chain::vector_const_cost): Add BB parameter; handle
        size costs; use COSTS_N_INSNS to compute move costs.
        (general_scalar_chain::compute_convert_gain): Use optimize_bb_for_size
        instead of optimize_insn_for size; use COSTS_N_INSNS to compute move 
costs;
        update calls of general_scalar_chain::vector_const_cost; use
        ix86_cost->integer_to_sse.
        (timode_immed_const_gain): Add bb parameter; use
        optimize_bb_for_size_p.
        (timode_scalar_chain::compute_convert_gain): Use optimize_bb_for_size_p.
        * config/i386/i386-features.h (class general_scalar_chain): Update
        prototype of vector_const_cost.
        * config/i386/i386.h (struct processor_costs): Add integer_to_sse.
        * config/i386/x86-tune-costs.h (struct processor_costs): Copy
        sse_to_integer to integer_to_sse everywhere.

gcc/testsuite/ChangeLog:

        * gcc.target/i386/minmax-6.c: xfail test that pmax is used.
        * gcc.target/i386/minmax-7.c: xfall test that pmin is used.

diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc
index 1ba5ac4faa4..54b3f6d33b2 100644
--- a/gcc/config/i386/i386-features.cc
+++ b/gcc/config/i386/i386-features.cc
@@ -518,15 +518,17 @@ scalar_chain::build (bitmap candidates, unsigned 
insn_uid, bitmap disallowed)
    instead of using a scalar one.  */
 
 int
-general_scalar_chain::vector_const_cost (rtx exp)
+general_scalar_chain::vector_const_cost (rtx exp, basic_block bb)
 {
   gcc_assert (CONST_INT_P (exp));
 
   if (standard_sse_constant_p (exp, vmode))
     return ix86_cost->sse_op;
+  if (optimize_bb_for_size_p (bb))
+    return COSTS_N_BYTES (8);
   /* We have separate costs for SImode and DImode, use SImode costs
      for smaller modes.  */
-  return ix86_cost->sse_load[smode == DImode ? 1 : 0];
+  return COSTS_N_INSNS (ix86_cost->sse_load[smode == DImode ? 1 : 0]) / 2;
 }
 
 /* Compute a gain for chain conversion.  */
@@ -547,7 +549,7 @@ general_scalar_chain::compute_convert_gain ()
      smaller modes than SImode the int load/store costs need to be
      adjusted as well.  */
   unsigned sse_cost_idx = smode == DImode ? 1 : 0;
-  unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
+  int m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
 
   EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
     {
@@ -555,26 +557,55 @@ general_scalar_chain::compute_convert_gain ()
       rtx def_set = single_set (insn);
       rtx src = SET_SRC (def_set);
       rtx dst = SET_DEST (def_set);
+      basic_block bb = BLOCK_FOR_INSN (insn);
       int igain = 0;
 
       if (REG_P (src) && REG_P (dst))
-       igain += 2 * m - ix86_cost->xmm_move;
+       {
+         if (optimize_bb_for_size_p (bb))
+           /* reg-reg move is 2 bytes, while SSE 3.  */
+           igain += COSTS_N_BYTES (2 * m - 3);
+         else
+           /* Move costs are normalized to reg-reg move having cost 2.  */
+           igain += COSTS_N_INSNS (2 * m - ix86_cost->xmm_move) / 2;
+       }
       else if (REG_P (src) && MEM_P (dst))
-       igain
-         += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx];
+       {
+         if (optimize_bb_for_size_p (bb))
+           /* Integer load/store is 3+ bytes and SSE 4+.  */
+           igain += COSTS_N_BYTES (3 * m - 4);
+         else
+           igain
+             += COSTS_N_INSNS (m * ix86_cost->int_store[2]
+                               - ix86_cost->sse_store[sse_cost_idx]) / 2;
+       }
       else if (MEM_P (src) && REG_P (dst))
-       igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx];
+       {
+         if (optimize_bb_for_size_p (bb))
+           igain += COSTS_N_BYTES (3 * m - 4);
+         else
+           igain += COSTS_N_INSNS (m * ix86_cost->int_load[2]
+                                   - ix86_cost->sse_load[sse_cost_idx]) / 2;
+       }
       else
        {
          /* For operations on memory operands, include the overhead
             of explicit load and store instructions.  */
          if (MEM_P (dst))
-           igain += optimize_insn_for_size_p ()
-                    ? -COSTS_N_BYTES (8)
-                    : (m * (ix86_cost->int_load[2]
-                            + ix86_cost->int_store[2])
-                       - (ix86_cost->sse_load[sse_cost_idx] +
-                          ix86_cost->sse_store[sse_cost_idx]));
+           {
+             if (optimize_bb_for_size_p (bb))
+               /* ??? This probably should account size difference
+                  of SSE and integer load rather than full SSE load.  */
+               igain -= COSTS_N_BYTES (8);
+             else
+               {
+                 int cost = (m * (ix86_cost->int_load[2]
+                                  + ix86_cost->int_store[2])
+                            - (ix86_cost->sse_load[sse_cost_idx] +
+                               ix86_cost->sse_store[sse_cost_idx]));
+                 igain += COSTS_N_INSNS (cost) / 2;
+               }
+           }
 
          switch (GET_CODE (src))
            {
@@ -595,7 +626,7 @@ general_scalar_chain::compute_convert_gain ()
              igain += ix86_cost->shift_const - ix86_cost->sse_op;
 
              if (CONST_INT_P (XEXP (src, 0)))
-               igain -= vector_const_cost (XEXP (src, 0));
+               igain -= vector_const_cost (XEXP (src, 0), bb);
              break;
 
            case ROTATE:
@@ -631,16 +662,17 @@ general_scalar_chain::compute_convert_gain ()
                igain += m * ix86_cost->add;
 
              if (CONST_INT_P (XEXP (src, 0)))
-               igain -= vector_const_cost (XEXP (src, 0));
+               igain -= vector_const_cost (XEXP (src, 0), bb);
              if (CONST_INT_P (XEXP (src, 1)))
-               igain -= vector_const_cost (XEXP (src, 1));
+               igain -= vector_const_cost (XEXP (src, 1), bb);
              if (MEM_P (XEXP (src, 1)))
                {
-                 if (optimize_insn_for_size_p ())
+                 if (optimize_bb_for_size_p (bb))
                    igain -= COSTS_N_BYTES (m == 2 ? 3 : 5);
                  else
-                   igain += m * ix86_cost->int_load[2]
-                            - ix86_cost->sse_load[sse_cost_idx];
+                   igain += COSTS_N_INSNS
+                              (m * ix86_cost->int_load[2]
+                                - ix86_cost->sse_load[sse_cost_idx]) / 2;
                }
              break;
 
@@ -698,7 +730,7 @@ general_scalar_chain::compute_convert_gain ()
            case CONST_INT:
              if (REG_P (dst))
                {
-                 if (optimize_insn_for_size_p ())
+                 if (optimize_bb_for_size_p (bb))
                    {
                      /* xor (2 bytes) vs. xorps (3 bytes).  */
                      if (src == const0_rtx)
@@ -722,14 +754,14 @@ general_scalar_chain::compute_convert_gain ()
                      /* DImode can be immediate for TARGET_64BIT
                         and SImode always.  */
                      igain += m * COSTS_N_INSNS (1);
-                     igain -= vector_const_cost (src);
+                     igain -= vector_const_cost (src, bb);
                    }
                }
              else if (MEM_P (dst))
                {
                  igain += (m * ix86_cost->int_store[2]
                            - ix86_cost->sse_store[sse_cost_idx]);
-                 igain -= vector_const_cost (src);
+                 igain -= vector_const_cost (src, bb);
                }
              break;
 
@@ -737,13 +769,14 @@ general_scalar_chain::compute_convert_gain ()
              if (XVECEXP (XEXP (src, 1), 0, 0) == const0_rtx)
                {
                  // movd (4 bytes) replaced with movdqa (4 bytes).
-                 if (!optimize_insn_for_size_p ())
-                   igain += ix86_cost->sse_to_integer - ix86_cost->xmm_move;
+                 if (!optimize_bb_for_size_p (bb))
+                   igain += COSTS_N_INSNS (ix86_cost->sse_to_integer
+                                           - ix86_cost->xmm_move) / 2;
                }
              else
                {
                  // pshufd; movd replaced with pshufd.
-                 if (optimize_insn_for_size_p ())
+                 if (optimize_bb_for_size_p (bb))
                    igain += COSTS_N_BYTES (4);
                  else
                    igain += ix86_cost->sse_to_integer;
@@ -769,11 +802,11 @@ general_scalar_chain::compute_convert_gain ()
   /* Cost the integer to sse and sse to integer moves.  */
   if (!optimize_function_for_size_p (cfun))
     {
-      cost += n_sse_to_integer * ix86_cost->sse_to_integer;
+      cost += n_sse_to_integer * COSTS_N_INSNS (ix86_cost->sse_to_integer) / 2;
       /* ???  integer_to_sse but we only have that in the RA cost table.
              Assume sse_to_integer/integer_to_sse are the same which they
              are at the moment.  */
-      cost += n_integer_to_sse * ix86_cost->sse_to_integer;
+      cost += n_integer_to_sse * COSTS_N_INSNS (ix86_cost->integer_to_sse) / 2;
     }
   else if (TARGET_64BIT || smode == SImode)
     {
@@ -1508,13 +1541,13 @@ general_scalar_chain::convert_insn (rtx_insn *insn)
    with numerous special cases.  */
 
 static int
-timode_immed_const_gain (rtx cst)
+timode_immed_const_gain (rtx cst, basic_block bb)
 {
   /* movabsq vs. movabsq+vmovq+vunpacklqdq.  */
   if (CONST_WIDE_INT_P (cst)
       && CONST_WIDE_INT_NUNITS (cst) == 2
       && CONST_WIDE_INT_ELT (cst, 0) == CONST_WIDE_INT_ELT (cst, 1))
-    return optimize_insn_for_size_p () ? -COSTS_N_BYTES (9)
+    return optimize_bb_for_size_p (bb) ? -COSTS_N_BYTES (9)
                                       : -COSTS_N_INSNS (2);
   /* 2x movabsq ~ vmovdqa.  */
   return 0;
@@ -1546,33 +1579,34 @@ timode_scalar_chain::compute_convert_gain ()
       rtx src = SET_SRC (def_set);
       rtx dst = SET_DEST (def_set);
       HOST_WIDE_INT op1val;
+      basic_block bb = BLOCK_FOR_INSN (insn);
       int scost, vcost;
       int igain = 0;
 
       switch (GET_CODE (src))
        {
        case REG:
-         if (optimize_insn_for_size_p ())
+         if (optimize_bb_for_size_p (bb))
            igain = MEM_P (dst) ? COSTS_N_BYTES (6) : COSTS_N_BYTES (3);
          else
            igain = COSTS_N_INSNS (1);
          break;
 
        case MEM:
-         igain = optimize_insn_for_size_p () ? COSTS_N_BYTES (7)
+         igain = optimize_bb_for_size_p (bb) ? COSTS_N_BYTES (7)
                                              : COSTS_N_INSNS (1);
          break;
 
        case CONST_INT:
          if (MEM_P (dst)
              && standard_sse_constant_p (src, V1TImode))
-           igain = optimize_insn_for_size_p () ? COSTS_N_BYTES (11) : 1;
+           igain = optimize_bb_for_size_p (bb) ? COSTS_N_BYTES (11) : 1;
          break;
 
        case CONST_WIDE_INT:
          /* 2 x mov vs. vmovdqa.  */
          if (MEM_P (dst))
-           igain = optimize_insn_for_size_p () ? COSTS_N_BYTES (3)
+           igain = optimize_bb_for_size_p (bb) ? COSTS_N_BYTES (3)
                                                : COSTS_N_INSNS (1);
          break;
 
@@ -1587,14 +1621,14 @@ timode_scalar_chain::compute_convert_gain ()
          if (!MEM_P (dst))
            igain = COSTS_N_INSNS (1);
          if (CONST_SCALAR_INT_P (XEXP (src, 1)))
-           igain += timode_immed_const_gain (XEXP (src, 1));
+           igain += timode_immed_const_gain (XEXP (src, 1), bb);
          break;
 
        case ASHIFT:
        case LSHIFTRT:
          /* See ix86_expand_v1ti_shift.  */
          op1val = INTVAL (XEXP (src, 1));
-         if (optimize_insn_for_size_p ())
+         if (optimize_bb_for_size_p (bb))
            {
              if (op1val == 64 || op1val == 65)
                scost = COSTS_N_BYTES (5);
@@ -1628,7 +1662,7 @@ timode_scalar_chain::compute_convert_gain ()
        case ASHIFTRT:
          /* See ix86_expand_v1ti_ashiftrt.  */
          op1val = INTVAL (XEXP (src, 1));
-         if (optimize_insn_for_size_p ())
+         if (optimize_bb_for_size_p (bb))
            {
              if (op1val == 64 || op1val == 127)
                scost = COSTS_N_BYTES (7);
@@ -1706,7 +1740,7 @@ timode_scalar_chain::compute_convert_gain ()
        case ROTATERT:
          /* See ix86_expand_v1ti_rotate.  */
          op1val = INTVAL (XEXP (src, 1));
-         if (optimize_insn_for_size_p ())
+         if (optimize_bb_for_size_p (bb))
            {
              scost = COSTS_N_BYTES (13);
              if ((op1val & 31) == 0)
@@ -1738,16 +1772,16 @@ timode_scalar_chain::compute_convert_gain ()
            {
              if (GET_CODE (XEXP (src, 0)) == AND)
                /* and;and;or (9 bytes) vs. ptest (5 bytes).  */
-               igain = optimize_insn_for_size_p() ? COSTS_N_BYTES (4)
-                                                  : COSTS_N_INSNS (2);
+               igain = optimize_bb_for_size_p (bb) ? COSTS_N_BYTES (4)
+                                                   : COSTS_N_INSNS (2);
              /* or (3 bytes) vs. ptest (5 bytes).  */
-             else if (optimize_insn_for_size_p ())
+             else if (optimize_bb_for_size_p (bb))
                igain = -COSTS_N_BYTES (2);
            }
          else if (XEXP (src, 1) == const1_rtx)
            /* and;cmp -1 (7 bytes) vs. pcmpeqd;pxor;ptest (13 bytes).  */
-           igain = optimize_insn_for_size_p() ? -COSTS_N_BYTES (6)
-                                              : -COSTS_N_INSNS (1);
+           igain = optimize_bb_for_size_p (bb) ? -COSTS_N_BYTES (6)
+                                               : -COSTS_N_INSNS (1);
          break;
 
        default:
diff --git a/gcc/config/i386/i386-features.h b/gcc/config/i386/i386-features.h
index 24b0c4ed0cd..7f7c0f78c96 100644
--- a/gcc/config/i386/i386-features.h
+++ b/gcc/config/i386/i386-features.h
@@ -188,7 +188,7 @@ class general_scalar_chain : public scalar_chain
 
  private:
   void convert_insn (rtx_insn *insn) final override;
-  int vector_const_cost (rtx exp);
+  int vector_const_cost (rtx exp, basic_block bb);
   rtx convert_rotate (enum rtx_code, rtx op0, rtx op1, rtx_insn *insn);
 };
 
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 6a38de30de4..18fa97a9eb0 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -179,6 +179,7 @@ struct processor_costs {
   const int xmm_move, ymm_move, /* cost of moving XMM and YMM register.  */
            zmm_move;
   const int sse_to_integer;    /* cost of moving SSE register to integer.  */
+  const int integer_to_sse;    /* cost of moving integer register to SSE. */
   const int gather_static, gather_per_elt; /* Cost of gather load is computed
                                   as static + per_item * nelts. */
   const int scatter_static, scatter_per_elt; /* Cost of gather store is
diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index 6cce70a6c40..e5091293509 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -107,6 +107,7 @@ struct processor_costs ix86_size_cost = {/* costs for 
tuning for size */
                                           in 128bit, 256bit and 512bit */
   4, 4, 6,                             /* cost of moving XMM,YMM,ZMM register 
*/
   4,                                   /* cost of moving SSE register to 
integer.  */
+  4,                                   /* cost of moving integer register to 
SSE.  */
   COSTS_N_BYTES (5), 0,                        /* Gather load static, per_elt. 
 */
   COSTS_N_BYTES (5), 0,                        /* Gather store static, 
per_elt.  */
   0,                                   /* size of l1 cache  */
@@ -227,6 +228,7 @@ struct processor_costs i386_cost = {        /* 386 specific 
costs */
   {4, 8, 16, 32, 64},                  /* cost of unaligned stores.  */
   2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register 
*/
   3,                                   /* cost of moving SSE register to 
integer.  */
+  3,                                   /* cost of moving integer register to 
SSE.  */
   4, 4,                                        /* Gather load static, per_elt. 
 */
   4, 4,                                        /* Gather store static, 
per_elt.  */
   0,                                   /* size of l1 cache  */
@@ -345,6 +347,7 @@ struct processor_costs i486_cost = {        /* 486 specific 
costs */
   {4, 8, 16, 32, 64},                  /* cost of unaligned stores.  */
   2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register 
*/
   3,                                   /* cost of moving SSE register to 
integer.  */
+  3,                                   /* cost of moving integer register to 
SSE.  */
   4, 4,                                        /* Gather load static, per_elt. 
 */
   4, 4,                                        /* Gather store static, 
per_elt.  */
   4,                                   /* size of l1 cache.  486 has 8kB cache
@@ -465,6 +468,7 @@ struct processor_costs pentium_cost = {
   {4, 8, 16, 32, 64},                  /* cost of unaligned stores.  */
   2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register 
*/
   3,                                   /* cost of moving SSE register to 
integer.  */
+  3,                                   /* cost of moving integer register to 
SSE.  */
   4, 4,                                        /* Gather load static, per_elt. 
 */
   4, 4,                                        /* Gather store static, 
per_elt.  */
   8,                                   /* size of l1 cache.  */
@@ -576,6 +580,7 @@ struct processor_costs lakemont_cost = {
   {4, 8, 16, 32, 64},                  /* cost of unaligned stores.  */
   2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register 
*/
   3,                                   /* cost of moving SSE register to 
integer.  */
+  3,                                   /* cost of moving integer register to 
SSE.  */
   4, 4,                                        /* Gather load static, per_elt. 
 */
   4, 4,                                        /* Gather store static, 
per_elt.  */
   8,                                   /* size of l1 cache.  */
@@ -702,6 +707,7 @@ struct processor_costs pentiumpro_cost = {
   {4, 8, 16, 32, 64},                  /* cost of unaligned stores.  */
   2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register 
*/
   3,                                   /* cost of moving SSE register to 
integer.  */
+  3,                                   /* cost of moving integer register to 
SSE.  */
   4, 4,                                        /* Gather load static, per_elt. 
 */
   4, 4,                                        /* Gather store static, 
per_elt.  */
   8,                                   /* size of l1 cache.  */
@@ -819,6 +825,7 @@ struct processor_costs geode_cost = {
   {2, 2, 8, 16, 32},                   /* cost of unaligned stores.  */
   2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register 
*/
   6,                                   /* cost of moving SSE register to 
integer.  */
+  6,                                   /* cost of moving integer register to 
SSE.  */
   2, 2,                                        /* Gather load static, per_elt. 
 */
   2, 2,                                        /* Gather store static, 
per_elt.  */
   64,                                  /* size of l1 cache.  */
@@ -936,6 +943,7 @@ struct processor_costs k6_cost = {
   {2, 2, 8, 16, 32},                   /* cost of unaligned stores.  */
   2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register 
*/
   6,                                   /* cost of moving SSE register to 
integer.  */
+  6,                                   /* cost of moving integer register to 
SSE.  */
   2, 2,                                        /* Gather load static, per_elt. 
 */
   2, 2,                                        /* Gather store static, 
per_elt.  */
   32,                                  /* size of l1 cache.  */
@@ -1059,6 +1067,7 @@ struct processor_costs athlon_cost = {
   {4, 4, 10, 10, 20},                  /* cost of unaligned stores.  */
   2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register 
*/
   5,                                   /* cost of moving SSE register to 
integer.  */
+  5,                                   /* cost of moving integer register to 
SSE.  */
   4, 4,                                        /* Gather load static, per_elt. 
 */
   4, 4,                                        /* Gather store static, 
per_elt.  */
   64,                                  /* size of l1 cache.  */
@@ -1184,6 +1193,7 @@ struct processor_costs k8_cost = {
   {4, 4, 10, 10, 20},                  /* cost of unaligned stores.  */
   2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register 
*/
   5,                                   /* cost of moving SSE register to 
integer.  */
+  5,                                   /* cost of moving integer register to 
SSE.  */
   4, 4,                                        /* Gather load static, per_elt. 
 */
   4, 4,                                        /* Gather store static, 
per_elt.  */
   64,                                  /* size of l1 cache.  */
@@ -1322,6 +1332,7 @@ struct processor_costs amdfam10_cost = {
   {4, 4, 5, 10, 20},                   /* cost of unaligned stores.  */
   2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register 
*/
   3,                                   /* cost of moving SSE register to 
integer.  */
+  3,                                   /* cost of moving integer register to 
SSE.  */
   4, 4,                                        /* Gather load static, per_elt. 
 */
   4, 4,                                        /* Gather store static, 
per_elt.  */
   64,                                  /* size of l1 cache.  */
@@ -1452,6 +1463,7 @@ const struct processor_costs bdver_cost = {
   {10, 10, 10, 40, 60},                        /* cost of unaligned stores.  */
   2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register 
*/
   16,                                  /* cost of moving SSE register to 
integer.  */
+  16,                                  /* cost of moving integer register to 
SSE.  */
   12, 12,                              /* Gather load static, per_elt.  */
   10, 10,                              /* Gather store static, per_elt.  */
   16,                                  /* size of l1 cache.  */
@@ -1603,6 +1615,7 @@ struct processor_costs znver1_cost = {
   {8, 8, 8, 16, 32},                   /* cost of unaligned stores.  */
   2, 3, 6,                             /* cost of moving XMM,YMM,ZMM register. 
 */
   6,                                   /* cost of moving SSE register to 
integer.  */
+  6,                                   /* cost of moving integer register to 
SSE.  */
   /* VGATHERDPD is 23 uops and throughput is 9, VGATHERDPD is 35 uops,
      throughput 12.  Approx 9 uops do not depend on vector size and every load
      is 7 uops.  */
@@ -1770,6 +1783,7 @@ struct processor_costs znver2_cost = {
   2, 2, 3,                             /* cost of moving XMM,YMM,ZMM
                                           register.  */
   6,                                   /* cost of moving SSE register to 
integer.  */
+  6,                                   /* cost of moving integer register to 
SSE.  */
   /* VGATHERDPD is 23 uops and throughput is 9, VGATHERDPD is 35 uops,
      throughput 12.  Approx 9 uops do not depend on vector size and every load
      is 7 uops.  */
@@ -1912,6 +1926,7 @@ struct processor_costs znver3_cost = {
   2, 2, 3,                             /* cost of moving XMM,YMM,ZMM
                                           register.  */
   6,                                   /* cost of moving SSE register to 
integer.  */
+  6,                                   /* cost of moving integer register to 
SSE.  */
   /* VGATHERDPD is 15 uops and throughput is 4, VGATHERDPS is 23 uops,
      throughput 9.  Approx 7 uops do not depend on vector size and every load
      is 4 uops.  */
@@ -2056,6 +2071,7 @@ struct processor_costs znver4_cost = {
   2, 2, 2,                             /* cost of moving XMM,YMM,ZMM
                                           register.  */
   6,                                   /* cost of moving SSE register to 
integer.  */
+  6,                                   /* cost of moving integer register to 
SSE.  */
   /* VGATHERDPD is 17 uops and throughput is 4, VGATHERDPS is 24 uops,
      throughput 5.  Approx 7 uops do not depend on vector size and every load
      is 5 uops.  */
@@ -2204,6 +2220,7 @@ struct processor_costs znver5_cost = {
   2, 2, 2,                             /* cost of moving XMM,YMM,ZMM
                                           register.  */
   6,                                   /* cost of moving SSE register to 
integer.  */
+  6,                                   /* cost of moving integer register to 
SSE.  */
 
   /* TODO: gather and scatter instructions are currently disabled in
      x86-tune.def.  In some cases they are however a win, see PR116582
@@ -2372,6 +2389,7 @@ struct processor_costs skylake_cost = {
   {8, 8, 8, 8, 16},                    /* cost of unaligned stores.  */
   2, 2, 4,                             /* cost of moving XMM,YMM,ZMM register 
*/
   6,                                   /* cost of moving SSE register to 
integer.  */
+  6,                                   /* cost of moving integer register to 
SSE.  */
   20, 8,                               /* Gather load static, per_elt.  */
   22, 10,                              /* Gather store static, per_elt.  */
   64,                                  /* size of l1 cache.  */
@@ -2508,6 +2526,7 @@ struct processor_costs icelake_cost = {
   {8, 8, 8, 8, 16},                    /* cost of unaligned stores.  */
   2, 2, 4,                             /* cost of moving XMM,YMM,ZMM register 
*/
   6,                                   /* cost of moving SSE register to 
integer.  */
+  6,                                   /* cost of moving integer register to 
SSE.  */
   20, 8,                               /* Gather load static, per_elt.  */
   22, 10,                              /* Gather store static, per_elt.  */
   64,                                  /* size of l1 cache.  */
@@ -2638,6 +2657,7 @@ struct processor_costs alderlake_cost = {
   {8, 8, 8, 10, 15},                   /* cost of unaligned storess.  */
   2, 3, 4,                             /* cost of moving XMM,YMM,ZMM register 
*/
   6,                                   /* cost of moving SSE register to 
integer.  */
+  6,                                   /* cost of moving integer register to 
SSE.  */
   18, 6,                               /* Gather load static, per_elt.  */
   18, 6,                               /* Gather store static, per_elt.  */
   32,                                  /* size of l1 cache.  */
@@ -2761,6 +2781,7 @@ const struct processor_costs btver1_cost = {
   {10, 10, 12, 48, 96},                        /* cost of unaligned stores.  */
   2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register 
*/
   14,                                  /* cost of moving SSE register to 
integer.  */
+  14,                                  /* cost of moving integer register to 
SSE.  */
   10, 10,                              /* Gather load static, per_elt.  */
   10, 10,                              /* Gather store static, per_elt.  */
   32,                                  /* size of l1 cache.  */
@@ -2881,6 +2902,7 @@ const struct processor_costs btver2_cost = {
   {10, 10, 12, 48, 96},                        /* cost of unaligned stores.  */
   2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register 
*/
   14,                                  /* cost of moving SSE register to 
integer.  */
+  14,                                  /* cost of moving integer register to 
SSE.  */
   10, 10,                              /* Gather load static, per_elt.  */
   10, 10,                              /* Gather store static, per_elt.  */
   32,                                  /* size of l1 cache.  */
@@ -3000,6 +3022,7 @@ struct processor_costs pentium4_cost = {
   {32, 32, 32, 64, 128},               /* cost of unaligned stores.  */
   12, 24, 48,                          /* cost of moving XMM,YMM,ZMM register 
*/
   20,                                  /* cost of moving SSE register to 
integer.  */
+  20,                                  /* cost of moving integer register to 
SSE.  */
   16, 16,                              /* Gather load static, per_elt.  */
   16, 16,                              /* Gather store static, per_elt.  */
   8,                                   /* size of l1 cache.  */
@@ -3122,6 +3145,7 @@ struct processor_costs nocona_cost = {
   {24, 24, 24, 48, 96},                        /* cost of unaligned stores.  */
   6, 12, 24,                           /* cost of moving XMM,YMM,ZMM register 
*/
   20,                                  /* cost of moving SSE register to 
integer.  */
+  20,                                  /* cost of moving integer register to 
SSE.  */
   12, 12,                              /* Gather load static, per_elt.  */
   12, 12,                              /* Gather store static, per_elt.  */
   8,                                   /* size of l1 cache.  */
@@ -3242,6 +3266,7 @@ struct processor_costs atom_cost = {
   {16, 16, 16, 32, 64},                        /* cost of unaligned stores.  */
   2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register 
*/
   8,                                   /* cost of moving SSE register to 
integer.  */
+  8,                                   /* cost of moving integer register to 
SSE.  */
   8, 8,                                        /* Gather load static, per_elt. 
 */
   8, 8,                                        /* Gather store static, 
per_elt.  */
   32,                                  /* size of l1 cache.  */
@@ -3362,6 +3387,7 @@ struct processor_costs slm_cost = {
   {16, 16, 16, 32, 64},                        /* cost of unaligned stores.  */
   2, 4, 8,                             /* cost of moving XMM,YMM,ZMM register 
*/
   8,                                   /* cost of moving SSE register to 
integer.  */
+  8,                                   /* cost of moving integer register to 
SSE.  */
   8, 8,                                        /* Gather load static, per_elt. 
 */
   8, 8,                                        /* Gather store static, 
per_elt.  */
   32,                                  /* size of l1 cache.  */
@@ -3494,6 +3520,7 @@ struct processor_costs tremont_cost = {
   {6, 6, 6, 10, 15},                   /* cost of unaligned storess.  */
   2, 3, 4,                             /* cost of moving XMM,YMM,ZMM register 
*/
   6,                                   /* cost of moving SSE register to 
integer.  */
+  6,                                   /* cost of moving integer register to 
SSE.  */
   18, 6,                               /* Gather load static, per_elt.  */
   18, 6,                               /* Gather store static, per_elt.  */
   32,                                  /* size of l1 cache.  */
@@ -3616,6 +3643,7 @@ struct processor_costs intel_cost = {
   {10, 10, 10, 10, 10},                        /* cost of unaligned loads.  */
   2, 2, 2,                             /* cost of moving XMM,YMM,ZMM register 
*/
   4,                                   /* cost of moving SSE register to 
integer.  */
+  4,                                   /* cost of moving integer register to 
SSE.  */
   6, 6,                                        /* Gather load static, per_elt. 
 */
   6, 6,                                        /* Gather store static, 
per_elt.  */
   32,                                  /* size of l1 cache.  */
@@ -3731,15 +3759,16 @@ struct processor_costs lujiazui_cost = {
   {6, 6, 6},                           /* cost of loading integer registers
                                           in QImode, HImode and SImode.
                                           Relative to reg-reg move (2).  */
-  {6, 6, 6},                   /* cost of storing integer registers.  */
+  {6, 6, 6},                           /* cost of storing integer registers.  
*/
   {6, 6, 6, 10, 15},                   /* cost of loading SSE register
-                               in 32bit, 64bit, 128bit, 256bit and 512bit.  */
+                                          in 32bit, 64bit, 128bit, 256bit and 
512bit.  */
   {6, 6, 6, 10, 15},                   /* cost of storing SSE register
-                               in 32bit, 64bit, 128bit, 256bit and 512bit.  */
+                                          in 32bit, 64bit, 128bit, 256bit and 
512bit.  */
   {6, 6, 6, 10, 15},                   /* cost of unaligned loads.  */
   {6, 6, 6, 10, 15},                   /* cost of unaligned storess.  */
-  2, 3, 4,                     /* cost of moving XMM,YMM,ZMM register.  */
-  6,                           /* cost of moving SSE register to integer.  */
+  2, 3, 4,                             /* cost of moving XMM,YMM,ZMM register. 
 */
+  6,                                   /* cost of moving SSE register to 
integer.  */
+  6,                                   /* cost of moving integer register to 
SSE.  */
   18, 6,                               /* Gather load static, per_elt.  */
   18, 6,                               /* Gather store static, per_elt.  */
   32,                                  /* size of l1 cache.  */
@@ -3864,6 +3893,7 @@ struct processor_costs yongfeng_cost = {
   {8, 8, 8, 12, 15},                   /* cost of unaligned storess.  */
   2, 3, 4,                     /* cost of moving XMM,YMM,ZMM register.  */
   8,                           /* cost of moving SSE register to integer.  */
+  8,                                   /* cost of moving integer register to 
SSE.  */
   18, 6,                               /* Gather load static, per_elt.  */
   18, 6,                               /* Gather store static, per_elt.  */
   32,                                  /* size of l1 cache.  */
@@ -3987,6 +4017,7 @@ struct processor_costs shijidadao_cost = {
   {8, 8, 8, 12, 15},                   /* cost of unaligned storess.  */
   2, 3, 4,                     /* cost of moving XMM,YMM,ZMM register.  */
   8,                           /* cost of moving SSE register to integer.  */
+  8,                                   /* cost of moving integer register to 
SSE.  */
   18, 6,                               /* Gather load static, per_elt.  */
   18, 6,                               /* Gather store static, per_elt.  */
   32,                                  /* size of l1 cache.  */
@@ -4116,6 +4147,7 @@ struct processor_costs generic_cost = {
   {6, 6, 6, 10, 15},                   /* cost of unaligned storess.  */
   2, 3, 4,                             /* cost of moving XMM,YMM,ZMM register 
*/
   6,                                   /* cost of moving SSE register to 
integer.  */
+  6,                                   /* cost of moving integer register to 
SSE.  */
   18, 6,                               /* Gather load static, per_elt.  */
   18, 6,                               /* Gather store static, per_elt.  */
   32,                                  /* size of l1 cache.  */
@@ -4249,6 +4281,7 @@ struct processor_costs core_cost = {
   {6, 6, 6, 6, 12},                    /* cost of unaligned stores.  */
   2, 2, 4,                             /* cost of moving XMM,YMM,ZMM register 
*/
   2,                                   /* cost of moving SSE register to 
integer.  */
+  2,                                   /* cost of moving integer register to 
SSE.  */
   /* VGATHERDPD is 7 uops, rec throughput 5, while VGATHERDPD is 9 uops,
      rec. throughput 6.
      So 5 uops statically and one uops per load.  */
diff --git a/gcc/testsuite/gcc.target/i386/minmax-6.c 
b/gcc/testsuite/gcc.target/i386/minmax-6.c
index 615f919ba0a..23f61c52d80 100644
--- a/gcc/testsuite/gcc.target/i386/minmax-6.c
+++ b/gcc/testsuite/gcc.target/i386/minmax-6.c
@@ -15,4 +15,4 @@ UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
 /* We do not want the RA to spill %esi for it's dual-use but using
    pmaxsd is OK.  */
 /* { dg-final { scan-assembler-not "rsp" { target { ! { ia32 } } } } } */
-/* { dg-final { scan-assembler "pmaxsd" } } */
+/* { dg-final { scan-assembler "pmaxsd" { xfail *-*-* } } } */
diff --git a/gcc/testsuite/gcc.target/i386/minmax-7.c 
b/gcc/testsuite/gcc.target/i386/minmax-7.c
index 619a93946c7..b2cb1c24d7e 100644
--- a/gcc/testsuite/gcc.target/i386/minmax-7.c
+++ b/gcc/testsuite/gcc.target/i386/minmax-7.c
@@ -17,4 +17,4 @@ void bar (int aleft, int axcenter)
 /* We do not want the RA to spill %esi for it's dual-use but using
    pminsd is OK.  */
 /* { dg-final { scan-assembler-not "rsp" { target { ! { ia32 } } } } } */
-/* { dg-final { scan-assembler "pminsd" } } */
+/* { dg-final { scan-assembler "pminsd" { xfail *-*-* } } } */

i386: Fix some problems in stv cost model

Reply via email to