Re: i386: Fix some problems in stv cost model

H.J. Lu Sat, 10 May 2025 19:48:06 -0700

On Sun, May 11, 2025 at 4:28 AM Jan Hubicka <[email protected]> wrote:
>
> Hi,
> this patch fixes some of problems with cosint in scalar to vector pass.
> In particular


This caused:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120215

>  1) the pass uses optimize_insn_for_size which is intended to be used by
>     expanders and splitters and requires the optimization pass to use
>     set_rtl_profile (bb) for currently processed bb.
>     This is not done, so we get random stale info about hotness of insn.
>  2) register allocator move costs are all realtive to integer reg-reg move
>     which has cost of 2, so it is (except for size tables and i386)
>     a latency of instruction multiplied by 2.
>     These costs has been duplicated and are now used in combination with
>     rtx costs which are all based to COSTS_N_INSNS that multiplies latency
>     by 4.
>     Some of vectorizer costing contains COSTS_N_INSNS (move_cost) / 2
>     to compensate, but some new code does not.  This patch adds compensatoin.
>
>     Perhaps we should update the cost tables to use COSTS_N_INSNS everywher
>     but I think we want to first fix inconsistencies.  Also the tables will
>     get optically much longer, since we have many move costs and COSTS_N_INSNS
>     is a lot of characters.
>  3) variable m which decides how much to multiply integer variant (to account
>     that with -m32 all 64bit computations needs 2 instructions) is declared
>     unsigned which makes the signed computation of instruction gain to be
>     done in unsigned type and breaks i.e. for division.
>  4) I added integer_to_sse costs which are currently all duplicationof
>     sse_to_integer. AMD chips are asymetric and moving one direction is faster
>     than another.  I will chance costs incremnetally once vectorizer part
>     is fixed up, too.
>
> There are two failures gcc.target/i386/minmax-6.c and 
> gcc.target/i386/minmax-7.c.
> Both test stv on hasswell which no longer happens since SSE->INT and INT->SSE 
> moves
> are now more expensive.
>
> There is only one instruction to convert:
>
> Computing gain for chain #1...
>   Instruction gain 8 for    11: {r110:SI=smax(r116:SI,0);clobber flags:CC;}
>   Instruction conversion gain: 8
>   Registers conversion cost: 8    <- this is integer_to_sse and sse_to_integer
>   Total gain: 0
>
> total gain used to be 4 since the patch doubles the conversion costs.
> According to agner fog's tables the costs should be 1 cycle which is correct
> here.
>
> Final code gnerated is:
>
>         vmovd   %esi, %xmm0         * latency 1
>         cmpl    %edx, %esi
>         je      .L2
>         vpxor   %xmm1, %xmm1, %xmm1 * latency 1
>         vpmaxsd %xmm1, %xmm0, %xmm0 * latency 1
>         vmovd   %xmm0, %eax         * latency 1
>         imull   %edx, %eax
>         cltq
>         movzwl  (%rdi,%rax,2), %eax
>         ret
>
>         cmpl    %edx, %esi
>         je      .L2
>         xorl    %eax, %eax          * latency 1
>         testl   %esi, %esi          * latency 1
>         cmovs   %eax, %esi          * latency 2
>         imull   %edx, %esi
>         movslq  %esi, %rsi
>         movzwl  (%rdi,%rsi,2), %eax
>         ret
>
> Instructions with latency info are those really different.
> So the uncoverted code has sum of latencies 4 and real latency 3.
> Converted code has sum of latencies 4 and real latency 3 (vmod+vpmaxsd+vmov).
> So I do not quite see it should be a win.
>
> There is also a bug in costing MIN/MAX
>
>             case ABS:
>             case SMAX:
>             case SMIN:
>             case UMAX:
>             case UMIN:
>               /* We do not have any conditional move cost, estimate it as a
>                  reg-reg move.  Comparisons are costed as adds.  */
>               igain += m * (COSTS_N_INSNS (2) + ix86_cost->add);
>               /* Integer SSE ops are all costed the same.  */
>               igain -= ix86_cost->sse_op;
>               break;
>
> Now COSTS_N_INSNS (2) is not quite right since reg-reg move should be 1 or 
> perhaps 0.
> For Haswell cmov really is 2 cycles, but I guess we want to have that in cost 
> vectors
> like all other instructions.
>
> I am not sure if this is really a win in this case (other minmax testcases 
> seems to make
> sense).  I have xfailed it for now and will check if that affects specs on 
> LNT testers.
>
> Bootstrapped/regtested x86_64-linux, comitted.
>
> I will proceed with similar fixes on vectorizer cost side. Sadly those 
> introduces
> quite some differences in the testuiste (partly triggered by other costing 
> problems,
> such as one of scatter/gather)
>
> gcc/ChangeLog:
>
>         * config/i386/i386-features.cc
>         (general_scalar_chain::vector_const_cost): Add BB parameter; handle
>         size costs; use COSTS_N_INSNS to compute move costs.
>         (general_scalar_chain::compute_convert_gain): Use optimize_bb_for_size
>         instead of optimize_insn_for size; use COSTS_N_INSNS to compute move 
> costs;
>         update calls of general_scalar_chain::vector_const_cost; use
>         ix86_cost->integer_to_sse.
>         (timode_immed_const_gain): Add bb parameter; use
>         optimize_bb_for_size_p.
>         (timode_scalar_chain::compute_convert_gain): Use 
> optimize_bb_for_size_p.
>         * config/i386/i386-features.h (class general_scalar_chain): Update
>         prototype of vector_const_cost.
>         * config/i386/i386.h (struct processor_costs): Add integer_to_sse.
>         * config/i386/x86-tune-costs.h (struct processor_costs): Copy
>         sse_to_integer to integer_to_sse everywhere.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/i386/minmax-6.c: xfail test that pmax is used.
>         * gcc.target/i386/minmax-7.c: xfall test that pmin is used.
>
> diff --git a/gcc/config/i386/i386-features.cc 
> b/gcc/config/i386/i386-features.cc
> index 1ba5ac4faa4..54b3f6d33b2 100644
> --- a/gcc/config/i386/i386-features.cc
> +++ b/gcc/config/i386/i386-features.cc
> @@ -518,15 +518,17 @@ scalar_chain::build (bitmap candidates, unsigned 
> insn_uid, bitmap disallowed)
>     instead of using a scalar one.  */
>
>  int
> -general_scalar_chain::vector_const_cost (rtx exp)
> +general_scalar_chain::vector_const_cost (rtx exp, basic_block bb)
>  {
>    gcc_assert (CONST_INT_P (exp));
>
>    if (standard_sse_constant_p (exp, vmode))
>      return ix86_cost->sse_op;
> +  if (optimize_bb_for_size_p (bb))
> +    return COSTS_N_BYTES (8);
>    /* We have separate costs for SImode and DImode, use SImode costs
>       for smaller modes.  */
> -  return ix86_cost->sse_load[smode == DImode ? 1 : 0];
> +  return COSTS_N_INSNS (ix86_cost->sse_load[smode == DImode ? 1 : 0]) / 2;
>  }
>
>  /* Compute a gain for chain conversion.  */
> @@ -547,7 +549,7 @@ general_scalar_chain::compute_convert_gain ()
>       smaller modes than SImode the int load/store costs need to be
>       adjusted as well.  */
>    unsigned sse_cost_idx = smode == DImode ? 1 : 0;
> -  unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
> +  int m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
>
>    EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
>      {
> @@ -555,26 +557,55 @@ general_scalar_chain::compute_convert_gain ()
>        rtx def_set = single_set (insn);
>        rtx src = SET_SRC (def_set);
>        rtx dst = SET_DEST (def_set);
> +      basic_block bb = BLOCK_FOR_INSN (insn);
>        int igain = 0;
>
>        if (REG_P (src) && REG_P (dst))
> -       igain += 2 * m - ix86_cost->xmm_move;
> +       {
> +         if (optimize_bb_for_size_p (bb))
> +           /* reg-reg move is 2 bytes, while SSE 3.  */
> +           igain += COSTS_N_BYTES (2 * m - 3);
> +         else
> +           /* Move costs are normalized to reg-reg move having cost 2.  */
> +           igain += COSTS_N_INSNS (2 * m - ix86_cost->xmm_move) / 2;
> +       }
>        else if (REG_P (src) && MEM_P (dst))
> -       igain
> -         += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx];
> +       {
> +         if (optimize_bb_for_size_p (bb))
> +           /* Integer load/store is 3+ bytes and SSE 4+.  */
> +           igain += COSTS_N_BYTES (3 * m - 4);
> +         else
> +           igain
> +             += COSTS_N_INSNS (m * ix86_cost->int_store[2]
> +                               - ix86_cost->sse_store[sse_cost_idx]) / 2;
> +       }
>        else if (MEM_P (src) && REG_P (dst))
> -       igain += m * ix86_cost->int_load[2] - 
> ix86_cost->sse_load[sse_cost_idx];
> +       {
> +         if (optimize_bb_for_size_p (bb))
> +           igain += COSTS_N_BYTES (3 * m - 4);
> +         else
> +           igain += COSTS_N_INSNS (m * ix86_cost->int_load[2]
> +                                   - ix86_cost->sse_load[sse_cost_idx]) / 2;
> +       }
>        else
>         {
>           /* For operations on memory operands, include the overhead
>              of explicit load and store instructions.  */
>           if (MEM_P (dst))
> -           igain += optimize_insn_for_size_p ()
> -                    ? -COSTS_N_BYTES (8)
> -                    : (m * (ix86_cost->int_load[2]
> -                            + ix86_cost->int_store[2])
> -                       - (ix86_cost->sse_load[sse_cost_idx] +
> -                          ix86_cost->sse_store[sse_cost_idx]));
> +           {
> +             if (optimize_bb_for_size_p (bb))
> +               /* ??? This probably should account size difference
> +                  of SSE and integer load rather than full SSE load.  */
> +               igain -= COSTS_N_BYTES (8);
> +             else
> +               {
> +                 int cost = (m * (ix86_cost->int_load[2]
> +                                  + ix86_cost->int_store[2])
> +                            - (ix86_cost->sse_load[sse_cost_idx] +
> +                               ix86_cost->sse_store[sse_cost_idx]));
> +                 igain += COSTS_N_INSNS (cost) / 2;
> +               }
> +           }
>
>           switch (GET_CODE (src))
>             {
> @@ -595,7 +626,7 @@ general_scalar_chain::compute_convert_gain ()
>               igain += ix86_cost->shift_const - ix86_cost->sse_op;
>
>               if (CONST_INT_P (XEXP (src, 0)))
> -               igain -= vector_const_cost (XEXP (src, 0));
> +               igain -= vector_const_cost (XEXP (src, 0), bb);
>               break;
>
>             case ROTATE:
> @@ -631,16 +662,17 @@ general_scalar_chain::compute_convert_gain ()
>                 igain += m * ix86_cost->add;
>
>               if (CONST_INT_P (XEXP (src, 0)))
> -               igain -= vector_const_cost (XEXP (src, 0));
> +               igain -= vector_const_cost (XEXP (src, 0), bb);
>               if (CONST_INT_P (XEXP (src, 1)))
> -               igain -= vector_const_cost (XEXP (src, 1));
> +               igain -= vector_const_cost (XEXP (src, 1), bb);
>               if (MEM_P (XEXP (src, 1)))
>                 {
> -                 if (optimize_insn_for_size_p ())
> +                 if (optimize_bb_for_size_p (bb))
>                     igain -= COSTS_N_BYTES (m == 2 ? 3 : 5);
>                   else
> -                   igain += m * ix86_cost->int_load[2]
> -                            - ix86_cost->sse_load[sse_cost_idx];
> +                   igain += COSTS_N_INSNS
> +                              (m * ix86_cost->int_load[2]
> +                                - ix86_cost->sse_load[sse_cost_idx]) / 2;
>                 }
>               break;
>
> @@ -698,7 +730,7 @@ general_scalar_chain::compute_convert_gain ()
>             case CONST_INT:
>               if (REG_P (dst))
>                 {
> -                 if (optimize_insn_for_size_p ())
> +                 if (optimize_bb_for_size_p (bb))
>                     {
>                       /* xor (2 bytes) vs. xorps (3 bytes).  */
>                       if (src == const0_rtx)
> @@ -722,14 +754,14 @@ general_scalar_chain::compute_convert_gain ()
>                       /* DImode can be immediate for TARGET_64BIT
>                          and SImode always.  */
>                       igain += m * COSTS_N_INSNS (1);
> -                     igain -= vector_const_cost (src);
> +                     igain -= vector_const_cost (src, bb);
>                     }
>                 }
>               else if (MEM_P (dst))
>                 {
>                   igain += (m * ix86_cost->int_store[2]
>                             - ix86_cost->sse_store[sse_cost_idx]);
> -                 igain -= vector_const_cost (src);
> +                 igain -= vector_const_cost (src, bb);
>                 }
>               break;
>
> @@ -737,13 +769,14 @@ general_scalar_chain::compute_convert_gain ()
>               if (XVECEXP (XEXP (src, 1), 0, 0) == const0_rtx)
>                 {
>                   // movd (4 bytes) replaced with movdqa (4 bytes).
> -                 if (!optimize_insn_for_size_p ())
> -                   igain += ix86_cost->sse_to_integer - ix86_cost->xmm_move;
> +                 if (!optimize_bb_for_size_p (bb))
> +                   igain += COSTS_N_INSNS (ix86_cost->sse_to_integer
> +                                           - ix86_cost->xmm_move) / 2;
>                 }
>               else
>                 {
>                   // pshufd; movd replaced with pshufd.
> -                 if (optimize_insn_for_size_p ())
> +                 if (optimize_bb_for_size_p (bb))
>                     igain += COSTS_N_BYTES (4);
>                   else
>                     igain += ix86_cost->sse_to_integer;
> @@ -769,11 +802,11 @@ general_scalar_chain::compute_convert_gain ()
>    /* Cost the integer to sse and sse to integer moves.  */
>    if (!optimize_function_for_size_p (cfun))
>      {
> -      cost += n_sse_to_integer * ix86_cost->sse_to_integer;
> +      cost += n_sse_to_integer * COSTS_N_INSNS (ix86_cost->sse_to_integer) / 
> 2;
>        /* ???  integer_to_sse but we only have that in the RA cost table.
>               Assume sse_to_integer/integer_to_sse are the same which they
>               are at the moment.  */
> -      cost += n_integer_to_sse * ix86_cost->sse_to_integer;
> +      cost += n_integer_to_sse * COSTS_N_INSNS (ix86_cost->integer_to_sse) / 
> 2;
>      }
>    else if (TARGET_64BIT || smode == SImode)
>      {
> @@ -1508,13 +1541,13 @@ general_scalar_chain::convert_insn (rtx_insn *insn)
>     with numerous special cases.  */
>
>  static int
> -timode_immed_const_gain (rtx cst)
> +timode_immed_const_gain (rtx cst, basic_block bb)
>  {
>    /* movabsq vs. movabsq+vmovq+vunpacklqdq.  */
>    if (CONST_WIDE_INT_P (cst)
>        && CONST_WIDE_INT_NUNITS (cst) == 2
>        && CONST_WIDE_INT_ELT (cst, 0) == CONST_WIDE_INT_ELT (cst, 1))
> -    return optimize_insn_for_size_p () ? -COSTS_N_BYTES (9)
> +    return optimize_bb_for_size_p (bb) ? -COSTS_N_BYTES (9)
>                                        : -COSTS_N_INSNS (2);
>    /* 2x movabsq ~ vmovdqa.  */
>    return 0;
> @@ -1546,33 +1579,34 @@ timode_scalar_chain::compute_convert_gain ()
>        rtx src = SET_SRC (def_set);
>        rtx dst = SET_DEST (def_set);
>        HOST_WIDE_INT op1val;
> +      basic_block bb = BLOCK_FOR_INSN (insn);
>        int scost, vcost;
>        int igain = 0;
>
>        switch (GET_CODE (src))
>         {
>         case REG:
> -         if (optimize_insn_for_size_p ())
> +         if (optimize_bb_for_size_p (bb))
>             igain = MEM_P (dst) ? COSTS_N_BYTES (6) : COSTS_N_BYTES (3);
>           else
>             igain = COSTS_N_INSNS (1);
>           break;
>
>         case MEM:
> -         igain = optimize_insn_for_size_p () ? COSTS_N_BYTES (7)
> +         igain = optimize_bb_for_size_p (bb) ? COSTS_N_BYTES (7)
>                                               : COSTS_N_INSNS (1);
>           break;
>
>         case CONST_INT:
>           if (MEM_P (dst)
>               && standard_sse_constant_p (src, V1TImode))
> -           igain = optimize_insn_for_size_p () ? COSTS_N_BYTES (11) : 1;
> +           igain = optimize_bb_for_size_p (bb) ? COSTS_N_BYTES (11) : 1;
>           break;
>
>         case CONST_WIDE_INT:
>           /* 2 x mov vs. vmovdqa.  */
>           if (MEM_P (dst))
> -           igain = optimize_insn_for_size_p () ? COSTS_N_BYTES (3)
> +           igain = optimize_bb_for_size_p (bb) ? COSTS_N_BYTES (3)
>                                                 : COSTS_N_INSNS (1);
>           break;
>
> @@ -1587,14 +1621,14 @@ timode_scalar_chain::compute_convert_gain ()
>           if (!MEM_P (dst))
>             igain = COSTS_N_INSNS (1);
>           if (CONST_SCALAR_INT_P (XEXP (src, 1)))
> -           igain += timode_immed_const_gain (XEXP (src, 1));
> +           igain += timode_immed_const_gain (XEXP (src, 1), bb);
>           break;
>
>         case ASHIFT:
>         case LSHIFTRT:
>           /* See ix86_expand_v1ti_shift.  */
>           op1val = INTVAL (XEXP (src, 1));
> -         if (optimize_insn_for_size_p ())
> +         if (optimize_bb_for_size_p (bb))
>             {
>               if (op1val == 64 || op1val == 65)
>                 scost = COSTS_N_BYTES (5);
> @@ -1628,7 +1662,7 @@ timode_scalar_chain::compute_convert_gain ()
>         case ASHIFTRT:
>           /* See ix86_expand_v1ti_ashiftrt.  */
>           op1val = INTVAL (XEXP (src, 1));
> -         if (optimize_insn_for_size_p ())
> +         if (optimize_bb_for_size_p (bb))
>             {
>               if (op1val == 64 || op1val == 127)
>                 scost = COSTS_N_BYTES (7);
> @@ -1706,7 +1740,7 @@ timode_scalar_chain::compute_convert_gain ()
>         case ROTATERT:
>           /* See ix86_expand_v1ti_rotate.  */
>           op1val = INTVAL (XEXP (src, 1));
> -         if (optimize_insn_for_size_p ())
> +         if (optimize_bb_for_size_p (bb))
>             {
>               scost = COSTS_N_BYTES (13);
>               if ((op1val & 31) == 0)
> @@ -1738,16 +1772,16 @@ timode_scalar_chain::compute_convert_gain ()
>             {
>               if (GET_CODE (XEXP (src, 0)) == AND)
>                 /* and;and;or (9 bytes) vs. ptest (5 bytes).  */
> -               igain = optimize_insn_for_size_p() ? COSTS_N_BYTES (4)
> -                                                  : COSTS_N_INSNS (2);
> +               igain = optimize_bb_for_size_p (bb) ? COSTS_N_BYTES (4)
> +                                                   : COSTS_N_INSNS (2);
>               /* or (3 bytes) vs. ptest (5 bytes).  */
> -             else if (optimize_insn_for_size_p ())
> +             else if (optimize_bb_for_size_p (bb))
>                 igain = -COSTS_N_BYTES (2);
>             }
>           else if (XEXP (src, 1) == const1_rtx)
>             /* and;cmp -1 (7 bytes) vs. pcmpeqd;pxor;ptest (13 bytes).  */
> -           igain = optimize_insn_for_size_p() ? -COSTS_N_BYTES (6)
> -                                              : -COSTS_N_INSNS (1);
> +           igain = optimize_bb_for_size_p (bb) ? -COSTS_N_BYTES (6)
> +                                               : -COSTS_N_INSNS (1);
>           break;
>
>         default:
> diff --git a/gcc/config/i386/i386-features.h b/gcc/config/i386/i386-features.h
> index 24b0c4ed0cd..7f7c0f78c96 100644
> --- a/gcc/config/i386/i386-features.h
> +++ b/gcc/config/i386/i386-features.h
> @@ -188,7 +188,7 @@ class general_scalar_chain : public scalar_chain
>
>   private:
>    void convert_insn (rtx_insn *insn) final override;
> -  int vector_const_cost (rtx exp);
> +  int vector_const_cost (rtx exp, basic_block bb);
>    rtx convert_rotate (enum rtx_code, rtx op0, rtx op1, rtx_insn *insn);
>  };
>
> diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> index 6a38de30de4..18fa97a9eb0 100644
> --- a/gcc/config/i386/i386.h
> +++ b/gcc/config/i386/i386.h
> @@ -179,6 +179,7 @@ struct processor_costs {
>    const int xmm_move, ymm_move, /* cost of moving XMM and YMM register.  */
>             zmm_move;
>    const int sse_to_integer;    /* cost of moving SSE register to integer.  */
> +  const int integer_to_sse;    /* cost of moving integer register to SSE. */
>    const int gather_static, gather_per_elt; /* Cost of gather load is computed
>                                    as static + per_item * nelts. */
>    const int scatter_static, scatter_per_elt; /* Cost of gather store is
> diff --git a/gcc/config/i386/x86-tune-costs.h 
> b/gcc/config/i386/x86-tune-costs.h
> index 6cce70a6c40..e5091293509 100644
> --- a/gcc/config/i386/x86-tune-costs.h
> +++ b/gcc/config/i386/x86-tune-costs.h
> @@ -107,6 +107,7 @@ struct processor_costs ix86_size_cost = {/* costs for 
> tuning for size */
>                                            in 128bit, 256bit and 512bit */
>    4, 4, 6,                             /* cost of moving XMM,YMM,ZMM 
> register */
>    4,                                   /* cost of moving SSE register to 
> integer.  */
> +  4,                                   /* cost of moving integer register to 
> SSE.  */
>    COSTS_N_BYTES (5), 0,                        /* Gather load static, 
> per_elt.  */
>    COSTS_N_BYTES (5), 0,                        /* Gather store static, 
> per_elt.  */
>    0,                                   /* size of l1 cache  */
> @@ -227,6 +228,7 @@ struct processor_costs i386_cost = {        /* 386 
> specific costs */
>    {4, 8, 16, 32, 64},                  /* cost of unaligned stores.  */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM 
> register */
>    3,                                   /* cost of moving SSE register to 
> integer.  */
> +  3,                                   /* cost of moving integer register to 
> SSE.  */
>    4, 4,                                        /* Gather load static, 
> per_elt.  */
>    4, 4,                                        /* Gather store static, 
> per_elt.  */
>    0,                                   /* size of l1 cache  */
> @@ -345,6 +347,7 @@ struct processor_costs i486_cost = {        /* 486 
> specific costs */
>    {4, 8, 16, 32, 64},                  /* cost of unaligned stores.  */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM 
> register */
>    3,                                   /* cost of moving SSE register to 
> integer.  */
> +  3,                                   /* cost of moving integer register to 
> SSE.  */
>    4, 4,                                        /* Gather load static, 
> per_elt.  */
>    4, 4,                                        /* Gather store static, 
> per_elt.  */
>    4,                                   /* size of l1 cache.  486 has 8kB 
> cache
> @@ -465,6 +468,7 @@ struct processor_costs pentium_cost = {
>    {4, 8, 16, 32, 64},                  /* cost of unaligned stores.  */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM 
> register */
>    3,                                   /* cost of moving SSE register to 
> integer.  */
> +  3,                                   /* cost of moving integer register to 
> SSE.  */
>    4, 4,                                        /* Gather load static, 
> per_elt.  */
>    4, 4,                                        /* Gather store static, 
> per_elt.  */
>    8,                                   /* size of l1 cache.  */
> @@ -576,6 +580,7 @@ struct processor_costs lakemont_cost = {
>    {4, 8, 16, 32, 64},                  /* cost of unaligned stores.  */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM 
> register */
>    3,                                   /* cost of moving SSE register to 
> integer.  */
> +  3,                                   /* cost of moving integer register to 
> SSE.  */
>    4, 4,                                        /* Gather load static, 
> per_elt.  */
>    4, 4,                                        /* Gather store static, 
> per_elt.  */
>    8,                                   /* size of l1 cache.  */
> @@ -702,6 +707,7 @@ struct processor_costs pentiumpro_cost = {
>    {4, 8, 16, 32, 64},                  /* cost of unaligned stores.  */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM 
> register */
>    3,                                   /* cost of moving SSE register to 
> integer.  */
> +  3,                                   /* cost of moving integer register to 
> SSE.  */
>    4, 4,                                        /* Gather load static, 
> per_elt.  */
>    4, 4,                                        /* Gather store static, 
> per_elt.  */
>    8,                                   /* size of l1 cache.  */
> @@ -819,6 +825,7 @@ struct processor_costs geode_cost = {
>    {2, 2, 8, 16, 32},                   /* cost of unaligned stores.  */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM 
> register */
>    6,                                   /* cost of moving SSE register to 
> integer.  */
> +  6,                                   /* cost of moving integer register to 
> SSE.  */
>    2, 2,                                        /* Gather load static, 
> per_elt.  */
>    2, 2,                                        /* Gather store static, 
> per_elt.  */
>    64,                                  /* size of l1 cache.  */
> @@ -936,6 +943,7 @@ struct processor_costs k6_cost = {
>    {2, 2, 8, 16, 32},                   /* cost of unaligned stores.  */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM 
> register */
>    6,                                   /* cost of moving SSE register to 
> integer.  */
> +  6,                                   /* cost of moving integer register to 
> SSE.  */
>    2, 2,                                        /* Gather load static, 
> per_elt.  */
>    2, 2,                                        /* Gather store static, 
> per_elt.  */
>    32,                                  /* size of l1 cache.  */
> @@ -1059,6 +1067,7 @@ struct processor_costs athlon_cost = {
>    {4, 4, 10, 10, 20},                  /* cost of unaligned stores.  */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM 
> register */
>    5,                                   /* cost of moving SSE register to 
> integer.  */
> +  5,                                   /* cost of moving integer register to 
> SSE.  */
>    4, 4,                                        /* Gather load static, 
> per_elt.  */
>    4, 4,                                        /* Gather store static, 
> per_elt.  */
>    64,                                  /* size of l1 cache.  */
> @@ -1184,6 +1193,7 @@ struct processor_costs k8_cost = {
>    {4, 4, 10, 10, 20},                  /* cost of unaligned stores.  */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM 
> register */
>    5,                                   /* cost of moving SSE register to 
> integer.  */
> +  5,                                   /* cost of moving integer register to 
> SSE.  */
>    4, 4,                                        /* Gather load static, 
> per_elt.  */
>    4, 4,                                        /* Gather store static, 
> per_elt.  */
>    64,                                  /* size of l1 cache.  */
> @@ -1322,6 +1332,7 @@ struct processor_costs amdfam10_cost = {
>    {4, 4, 5, 10, 20},                   /* cost of unaligned stores.  */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM 
> register */
>    3,                                   /* cost of moving SSE register to 
> integer.  */
> +  3,                                   /* cost of moving integer register to 
> SSE.  */
>    4, 4,                                        /* Gather load static, 
> per_elt.  */
>    4, 4,                                        /* Gather store static, 
> per_elt.  */
>    64,                                  /* size of l1 cache.  */
> @@ -1452,6 +1463,7 @@ const struct processor_costs bdver_cost = {
>    {10, 10, 10, 40, 60},                        /* cost of unaligned stores.  
> */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM 
> register */
>    16,                                  /* cost of moving SSE register to 
> integer.  */
> +  16,                                  /* cost of moving integer register to 
> SSE.  */
>    12, 12,                              /* Gather load static, per_elt.  */
>    10, 10,                              /* Gather store static, per_elt.  */
>    16,                                  /* size of l1 cache.  */
> @@ -1603,6 +1615,7 @@ struct processor_costs znver1_cost = {
>    {8, 8, 8, 16, 32},                   /* cost of unaligned stores.  */
>    2, 3, 6,                             /* cost of moving XMM,YMM,ZMM 
> register.  */
>    6,                                   /* cost of moving SSE register to 
> integer.  */
> +  6,                                   /* cost of moving integer register to 
> SSE.  */
>    /* VGATHERDPD is 23 uops and throughput is 9, VGATHERDPD is 35 uops,
>       throughput 12.  Approx 9 uops do not depend on vector size and every 
> load
>       is 7 uops.  */
> @@ -1770,6 +1783,7 @@ struct processor_costs znver2_cost = {
>    2, 2, 3,                             /* cost of moving XMM,YMM,ZMM
>                                            register.  */
>    6,                                   /* cost of moving SSE register to 
> integer.  */
> +  6,                                   /* cost of moving integer register to 
> SSE.  */
>    /* VGATHERDPD is 23 uops and throughput is 9, VGATHERDPD is 35 uops,
>       throughput 12.  Approx 9 uops do not depend on vector size and every 
> load
>       is 7 uops.  */
> @@ -1912,6 +1926,7 @@ struct processor_costs znver3_cost = {
>    2, 2, 3,                             /* cost of moving XMM,YMM,ZMM
>                                            register.  */
>    6,                                   /* cost of moving SSE register to 
> integer.  */
> +  6,                                   /* cost of moving integer register to 
> SSE.  */
>    /* VGATHERDPD is 15 uops and throughput is 4, VGATHERDPS is 23 uops,
>       throughput 9.  Approx 7 uops do not depend on vector size and every load
>       is 4 uops.  */
> @@ -2056,6 +2071,7 @@ struct processor_costs znver4_cost = {
>    2, 2, 2,                             /* cost of moving XMM,YMM,ZMM
>                                            register.  */
>    6,                                   /* cost of moving SSE register to 
> integer.  */
> +  6,                                   /* cost of moving integer register to 
> SSE.  */
>    /* VGATHERDPD is 17 uops and throughput is 4, VGATHERDPS is 24 uops,
>       throughput 5.  Approx 7 uops do not depend on vector size and every load
>       is 5 uops.  */
> @@ -2204,6 +2220,7 @@ struct processor_costs znver5_cost = {
>    2, 2, 2,                             /* cost of moving XMM,YMM,ZMM
>                                            register.  */
>    6,                                   /* cost of moving SSE register to 
> integer.  */
> +  6,                                   /* cost of moving integer register to 
> SSE.  */
>
>    /* TODO: gather and scatter instructions are currently disabled in
>       x86-tune.def.  In some cases they are however a win, see PR116582
> @@ -2372,6 +2389,7 @@ struct processor_costs skylake_cost = {
>    {8, 8, 8, 8, 16},                    /* cost of unaligned stores.  */
>    2, 2, 4,                             /* cost of moving XMM,YMM,ZMM 
> register */
>    6,                                   /* cost of moving SSE register to 
> integer.  */
> +  6,                                   /* cost of moving integer register to 
> SSE.  */
>    20, 8,                               /* Gather load static, per_elt.  */
>    22, 10,                              /* Gather store static, per_elt.  */
>    64,                                  /* size of l1 cache.  */
> @@ -2508,6 +2526,7 @@ struct processor_costs icelake_cost = {
>    {8, 8, 8, 8, 16},                    /* cost of unaligned stores.  */
>    2, 2, 4,                             /* cost of moving XMM,YMM,ZMM 
> register */
>    6,                                   /* cost of moving SSE register to 
> integer.  */
> +  6,                                   /* cost of moving integer register to 
> SSE.  */
>    20, 8,                               /* Gather load static, per_elt.  */
>    22, 10,                              /* Gather store static, per_elt.  */
>    64,                                  /* size of l1 cache.  */
> @@ -2638,6 +2657,7 @@ struct processor_costs alderlake_cost = {
>    {8, 8, 8, 10, 15},                   /* cost of unaligned storess.  */
>    2, 3, 4,                             /* cost of moving XMM,YMM,ZMM 
> register */
>    6,                                   /* cost of moving SSE register to 
> integer.  */
> +  6,                                   /* cost of moving integer register to 
> SSE.  */
>    18, 6,                               /* Gather load static, per_elt.  */
>    18, 6,                               /* Gather store static, per_elt.  */
>    32,                                  /* size of l1 cache.  */
> @@ -2761,6 +2781,7 @@ const struct processor_costs btver1_cost = {
>    {10, 10, 12, 48, 96},                        /* cost of unaligned stores.  
> */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM 
> register */
>    14,                                  /* cost of moving SSE register to 
> integer.  */
> +  14,                                  /* cost of moving integer register to 
> SSE.  */
>    10, 10,                              /* Gather load static, per_elt.  */
>    10, 10,                              /* Gather store static, per_elt.  */
>    32,                                  /* size of l1 cache.  */
> @@ -2881,6 +2902,7 @@ const struct processor_costs btver2_cost = {
>    {10, 10, 12, 48, 96},                        /* cost of unaligned stores.  
> */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM 
> register */
>    14,                                  /* cost of moving SSE register to 
> integer.  */
> +  14,                                  /* cost of moving integer register to 
> SSE.  */
>    10, 10,                              /* Gather load static, per_elt.  */
>    10, 10,                              /* Gather store static, per_elt.  */
>    32,                                  /* size of l1 cache.  */
> @@ -3000,6 +3022,7 @@ struct processor_costs pentium4_cost = {
>    {32, 32, 32, 64, 128},               /* cost of unaligned stores.  */
>    12, 24, 48,                          /* cost of moving XMM,YMM,ZMM 
> register */
>    20,                                  /* cost of moving SSE register to 
> integer.  */
> +  20,                                  /* cost of moving integer register to 
> SSE.  */
>    16, 16,                              /* Gather load static, per_elt.  */
>    16, 16,                              /* Gather store static, per_elt.  */
>    8,                                   /* size of l1 cache.  */
> @@ -3122,6 +3145,7 @@ struct processor_costs nocona_cost = {
>    {24, 24, 24, 48, 96},                        /* cost of unaligned stores.  
> */
>    6, 12, 24,                           /* cost of moving XMM,YMM,ZMM 
> register */
>    20,                                  /* cost of moving SSE register to 
> integer.  */
> +  20,                                  /* cost of moving integer register to 
> SSE.  */
>    12, 12,                              /* Gather load static, per_elt.  */
>    12, 12,                              /* Gather store static, per_elt.  */
>    8,                                   /* size of l1 cache.  */
> @@ -3242,6 +3266,7 @@ struct processor_costs atom_cost = {
>    {16, 16, 16, 32, 64},                        /* cost of unaligned stores.  
> */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM 
> register */
>    8,                                   /* cost of moving SSE register to 
> integer.  */
> +  8,                                   /* cost of moving integer register to 
> SSE.  */
>    8, 8,                                        /* Gather load static, 
> per_elt.  */
>    8, 8,                                        /* Gather store static, 
> per_elt.  */
>    32,                                  /* size of l1 cache.  */
> @@ -3362,6 +3387,7 @@ struct processor_costs slm_cost = {
>    {16, 16, 16, 32, 64},                        /* cost of unaligned stores.  
> */
>    2, 4, 8,                             /* cost of moving XMM,YMM,ZMM 
> register */
>    8,                                   /* cost of moving SSE register to 
> integer.  */
> +  8,                                   /* cost of moving integer register to 
> SSE.  */
>    8, 8,                                        /* Gather load static, 
> per_elt.  */
>    8, 8,                                        /* Gather store static, 
> per_elt.  */
>    32,                                  /* size of l1 cache.  */
> @@ -3494,6 +3520,7 @@ struct processor_costs tremont_cost = {
>    {6, 6, 6, 10, 15},                   /* cost of unaligned storess.  */
>    2, 3, 4,                             /* cost of moving XMM,YMM,ZMM 
> register */
>    6,                                   /* cost of moving SSE register to 
> integer.  */
> +  6,                                   /* cost of moving integer register to 
> SSE.  */
>    18, 6,                               /* Gather load static, per_elt.  */
>    18, 6,                               /* Gather store static, per_elt.  */
>    32,                                  /* size of l1 cache.  */
> @@ -3616,6 +3643,7 @@ struct processor_costs intel_cost = {
>    {10, 10, 10, 10, 10},                        /* cost of unaligned loads.  
> */
>    2, 2, 2,                             /* cost of moving XMM,YMM,ZMM 
> register */
>    4,                                   /* cost of moving SSE register to 
> integer.  */
> +  4,                                   /* cost of moving integer register to 
> SSE.  */
>    6, 6,                                        /* Gather load static, 
> per_elt.  */
>    6, 6,                                        /* Gather store static, 
> per_elt.  */
>    32,                                  /* size of l1 cache.  */
> @@ -3731,15 +3759,16 @@ struct processor_costs lujiazui_cost = {
>    {6, 6, 6},                           /* cost of loading integer registers
>                                            in QImode, HImode and SImode.
>                                            Relative to reg-reg move (2).  */
> -  {6, 6, 6},                   /* cost of storing integer registers.  */
> +  {6, 6, 6},                           /* cost of storing integer registers. 
>  */
>    {6, 6, 6, 10, 15},                   /* cost of loading SSE register
> -                               in 32bit, 64bit, 128bit, 256bit and 512bit.  
> */
> +                                          in 32bit, 64bit, 128bit, 256bit 
> and 512bit.  */
>    {6, 6, 6, 10, 15},                   /* cost of storing SSE register
> -                               in 32bit, 64bit, 128bit, 256bit and 512bit.  
> */
> +                                          in 32bit, 64bit, 128bit, 256bit 
> and 512bit.  */
>    {6, 6, 6, 10, 15},                   /* cost of unaligned loads.  */
>    {6, 6, 6, 10, 15},                   /* cost of unaligned storess.  */
> -  2, 3, 4,                     /* cost of moving XMM,YMM,ZMM register.  */
> -  6,                           /* cost of moving SSE register to integer.  */
> +  2, 3, 4,                             /* cost of moving XMM,YMM,ZMM 
> register.  */
> +  6,                                   /* cost of moving SSE register to 
> integer.  */
> +  6,                                   /* cost of moving integer register to 
> SSE.  */
>    18, 6,                               /* Gather load static, per_elt.  */
>    18, 6,                               /* Gather store static, per_elt.  */
>    32,                                  /* size of l1 cache.  */
> @@ -3864,6 +3893,7 @@ struct processor_costs yongfeng_cost = {
>    {8, 8, 8, 12, 15},                   /* cost of unaligned storess.  */
>    2, 3, 4,                     /* cost of moving XMM,YMM,ZMM register.  */
>    8,                           /* cost of moving SSE register to integer.  */
> +  8,                                   /* cost of moving integer register to 
> SSE.  */
>    18, 6,                               /* Gather load static, per_elt.  */
>    18, 6,                               /* Gather store static, per_elt.  */
>    32,                                  /* size of l1 cache.  */
> @@ -3987,6 +4017,7 @@ struct processor_costs shijidadao_cost = {
>    {8, 8, 8, 12, 15},                   /* cost of unaligned storess.  */
>    2, 3, 4,                     /* cost of moving XMM,YMM,ZMM register.  */
>    8,                           /* cost of moving SSE register to integer.  */
> +  8,                                   /* cost of moving integer register to 
> SSE.  */
>    18, 6,                               /* Gather load static, per_elt.  */
>    18, 6,                               /* Gather store static, per_elt.  */
>    32,                                  /* size of l1 cache.  */
> @@ -4116,6 +4147,7 @@ struct processor_costs generic_cost = {
>    {6, 6, 6, 10, 15},                   /* cost of unaligned storess.  */
>    2, 3, 4,                             /* cost of moving XMM,YMM,ZMM 
> register */
>    6,                                   /* cost of moving SSE register to 
> integer.  */
> +  6,                                   /* cost of moving integer register to 
> SSE.  */
>    18, 6,                               /* Gather load static, per_elt.  */
>    18, 6,                               /* Gather store static, per_elt.  */
>    32,                                  /* size of l1 cache.  */
> @@ -4249,6 +4281,7 @@ struct processor_costs core_cost = {
>    {6, 6, 6, 6, 12},                    /* cost of unaligned stores.  */
>    2, 2, 4,                             /* cost of moving XMM,YMM,ZMM 
> register */
>    2,                                   /* cost of moving SSE register to 
> integer.  */
> +  2,                                   /* cost of moving integer register to 
> SSE.  */
>    /* VGATHERDPD is 7 uops, rec throughput 5, while VGATHERDPD is 9 uops,
>       rec. throughput 6.
>       So 5 uops statically and one uops per load.  */
> diff --git a/gcc/testsuite/gcc.target/i386/minmax-6.c 
> b/gcc/testsuite/gcc.target/i386/minmax-6.c
> index 615f919ba0a..23f61c52d80 100644
> --- a/gcc/testsuite/gcc.target/i386/minmax-6.c
> +++ b/gcc/testsuite/gcc.target/i386/minmax-6.c
> @@ -15,4 +15,4 @@ UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
>  /* We do not want the RA to spill %esi for it's dual-use but using
>     pmaxsd is OK.  */
>  /* { dg-final { scan-assembler-not "rsp" { target { ! { ia32 } } } } } */
> -/* { dg-final { scan-assembler "pmaxsd" } } */
> +/* { dg-final { scan-assembler "pmaxsd" { xfail *-*-* } } } */
> diff --git a/gcc/testsuite/gcc.target/i386/minmax-7.c 
> b/gcc/testsuite/gcc.target/i386/minmax-7.c
> index 619a93946c7..b2cb1c24d7e 100644
> --- a/gcc/testsuite/gcc.target/i386/minmax-7.c
> +++ b/gcc/testsuite/gcc.target/i386/minmax-7.c
> @@ -17,4 +17,4 @@ void bar (int aleft, int axcenter)
>  /* We do not want the RA to spill %esi for it's dual-use but using
>     pminsd is OK.  */
>  /* { dg-final { scan-assembler-not "rsp" { target { ! { ia32 } } } } } */
> -/* { dg-final { scan-assembler "pminsd" } } */
> +/* { dg-final { scan-assembler "pminsd" { xfail *-*-* } } } */



-- 
H.J.

Re: i386: Fix some problems in stv cost model

Reply via email to