Re: i386: Fix some problems in stv cost model

Richard Biener Sun, 11 May 2025 00:21:25 -0700

> Am 10.05.2025 um 22:28 schrieb Jan Hubicka <hubi...@ucw.cz>:
> 
> Hi,
> this patch fixes some of problems with cosint in scalar to vector pass.
> In particular
> 1) the pass uses optimize_insn_for_size which is intended to be used by
>    expanders and splitters and requires the optimization pass to use
>    set_rtl_profile (bb) for currently processed bb.
>    This is not done, so we get random stale info about hotness of insn.
> 2) register allocator move costs are all realtive to integer reg-reg move
>    which has cost of 2, so it is (except for size tables and i386)
>    a latency of instruction multiplied by 2.
>    These costs has been duplicated and are now used in combination with
>    rtx costs which are all based to COSTS_N_INSNS that multiplies latency
>    by 4.
>    Some of vectorizer costing contains COSTS_N_INSNS (move_cost) / 2
>    to compensate, but some new code does not.  This patch adds compensatoin.
> 
>    Perhaps we should update the cost tables to use COSTS_N_INSNS everywher
>    but I think we want to first fix inconsistencies.  Also the tables will
>    get optically much longer, since we have many move costs and COSTS_N_INSNS
>    is a lot of characters.
> 3) variable m which decides how much to multiply integer variant (to account
>    that with -m32 all 64bit computations needs 2 instructions) is declared
>    unsigned which makes the signed computation of instruction gain to be
>    done in unsigned type and breaks i.e. for division.
> 4) I added integer_to_sse costs which are currently all duplicationof
>    sse_to_integer. AMD chips are asymetric and moving one direction is faster
>    than another.  I will chance costs incremnetally once vectorizer part
>    is fixed up, too.
> 
> There are two failures gcc.target/i386/minmax-6.c and 
> gcc.target/i386/minmax-7.c.
> Both test stv on hasswell which no longer happens since SSE->INT and INT->SSE 
> moves
> are now more expensive.
> 
> There is only one instruction to convert:
> 
> Computing gain for chain #1...
>  Instruction gain 8 for    11: {r110:SI=smax(r116:SI,0);clobber flags:CC;}
>  Instruction conversion gain: 8
>  Registers conversion cost: 8    <- this is integer_to_sse and sse_to_integer
>  Total gain: 0
> 
> total gain used to be 4 since the patch doubles the conversion costs.
> According to agner fog's tables the costs should be 1 cycle which is correct
> here.
> 
> Final code gnerated is:
> 
>    vmovd    %esi, %xmm0         * latency 1
>    cmpl    %edx, %esi
>    je    .L2
>    vpxor    %xmm1, %xmm1, %xmm1 * latency 1
>    vpmaxsd    %xmm1, %xmm0, %xmm0 * latency 1
>    vmovd    %xmm0, %eax         * latency 1
>    imull    %edx, %eax
>    cltq
>    movzwl    (%rdi,%rax,2), %eax
>    ret
> 
>    cmpl    %edx, %esi
>    je    .L2
>    xorl    %eax, %eax          * latency 1
>    testl    %esi, %esi          * latency 1
>    cmovs    %eax, %esi          * latency 2
>    imull    %edx, %esi
>    movslq    %esi, %rsi
>    movzwl    (%rdi,%rsi,2), %eax
>    ret
> 
> Instructions with latency info are those really different.
> So the uncoverted code has sum of latencies 4 and real latency 3.
> Converted code has sum of latencies 4 and real latency 3 (vmod+vpmaxsd+vmov).
> So I do not quite see it should be a win.

Note this was historically done because cmov performance behaves erratically at 
least on some uarchs compared to SSE min/max, esp. if there are back-to-back 
cmov (the latter, aka throughput, is not modeled at all in the cost tables nor 
the pass).  IIRC it was hmmer from SPEC 2006 exhibiting such back-to-back case.

Richard 

> There is also a bug in costing MIN/MAX
> 
>        case ABS:
>        case SMAX:
>        case SMIN:
>        case UMAX:
>        case UMIN:
>          /* We do not have any conditional move cost, estimate it as a
>         reg-reg move.  Comparisons are costed as adds.  */
>          igain += m * (COSTS_N_INSNS (2) + ix86_cost->add);
>          /* Integer SSE ops are all costed the same.  */
>          igain -= ix86_cost->sse_op;
>          break;
> 
> Now COSTS_N_INSNS (2) is not quite right since reg-reg move should be 1 or 
> perhaps 0.
> For Haswell cmov really is 2 cycles, but I guess we want to have that in cost 
> vectors
> like all other instructions.
> 
> I am not sure if this is really a win in this case (other minmax testcases 
> seems to make
> sense).  I have xfailed it for now and will check if that affects specs on 
> LNT testers.
> 
> Bootstrapped/regtested x86_64-linux, comitted.
> 
> I will proceed with similar fixes on vectorizer cost side. Sadly those 
> introduces
> quite some differences in the testuiste (partly triggered by other costing 
> problems,
> such as one of scatter/gather)
> 
> gcc/ChangeLog:
> 
>    * config/i386/i386-features.cc
>    (general_scalar_chain::vector_const_cost): Add BB parameter; handle
>    size costs; use COSTS_N_INSNS to compute move costs.
>    (general_scalar_chain::compute_convert_gain): Use optimize_bb_for_size
>    instead of optimize_insn_for size; use COSTS_N_INSNS to compute move costs;
>    update calls of general_scalar_chain::vector_const_cost; use
>    ix86_cost->integer_to_sse.
>    (timode_immed_const_gain): Add bb parameter; use
>    optimize_bb_for_size_p.
>    (timode_scalar_chain::compute_convert_gain): Use optimize_bb_for_size_p.
>    * config/i386/i386-features.h (class general_scalar_chain): Update
>    prototype of vector_const_cost.
>    * config/i386/i386.h (struct processor_costs): Add integer_to_sse.
>    * config/i386/x86-tune-costs.h (struct processor_costs): Copy
>    sse_to_integer to integer_to_sse everywhere.
> 
> gcc/testsuite/ChangeLog:
> 
>    * gcc.target/i386/minmax-6.c: xfail test that pmax is used.
>    * gcc.target/i386/minmax-7.c: xfall test that pmin is used.
> 
> diff --git a/gcc/config/i386/i386-features.cc 
> b/gcc/config/i386/i386-features.cc
> index 1ba5ac4faa4..54b3f6d33b2 100644
> --- a/gcc/config/i386/i386-features.cc
> +++ b/gcc/config/i386/i386-features.cc
> @@ -518,15 +518,17 @@ scalar_chain::build (bitmap candidates, unsigned 
> insn_uid, bitmap disallowed)
>    instead of using a scalar one.  */
> 
> int
> -general_scalar_chain::vector_const_cost (rtx exp)
> +general_scalar_chain::vector_const_cost (rtx exp, basic_block bb)
> {
>   gcc_assert (CONST_INT_P (exp));
> 
>   if (standard_sse_constant_p (exp, vmode))
>     return ix86_cost->sse_op;
> +  if (optimize_bb_for_size_p (bb))
> +    return COSTS_N_BYTES (8);
>   /* We have separate costs for SImode and DImode, use SImode costs
>      for smaller modes.  */
> -  return ix86_cost->sse_load[smode == DImode ? 1 : 0];
> +  return COSTS_N_INSNS (ix86_cost->sse_load[smode == DImode ? 1 : 0]) / 2;
> }
> 
> /* Compute a gain for chain conversion.  */
> @@ -547,7 +549,7 @@ general_scalar_chain::compute_convert_gain ()
>      smaller modes than SImode the int load/store costs need to be
>      adjusted as well.  */
>   unsigned sse_cost_idx = smode == DImode ? 1 : 0;
> -  unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
> +  int m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
> 
>   EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
>     {
> @@ -555,26 +557,55 @@ general_scalar_chain::compute_convert_gain ()
>       rtx def_set = single_set (insn);
>       rtx src = SET_SRC (def_set);
>       rtx dst = SET_DEST (def_set);
> +      basic_block bb = BLOCK_FOR_INSN (insn);
>       int igain = 0;
> 
>       if (REG_P (src) && REG_P (dst))
> -    igain += 2 * m - ix86_cost->xmm_move;
> +    {
> +      if (optimize_bb_for_size_p (bb))
> +        /* reg-reg move is 2 bytes, while SSE 3.  */
> +        igain += COSTS_N_BYTES (2 * m - 3);
> +      else
> +        /* Move costs are normalized to reg-reg move having cost 2.  */
> +        igain += COSTS_N_INSNS (2 * m - ix86_cost->xmm_move) / 2;
> +    }
>       else if (REG_P (src) && MEM_P (dst))
> -    igain
> -      += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx];
> +    {
> +      if (optimize_bb_for_size_p (bb))
> +        /* Integer load/store is 3+ bytes and SSE 4+.  */
> +        igain += COSTS_N_BYTES (3 * m - 4);
> +      else
> +        igain
> +          += COSTS_N_INSNS (m * ix86_cost->int_store[2]
> +                - ix86_cost->sse_store[sse_cost_idx]) / 2;
> +    }
>       else if (MEM_P (src) && REG_P (dst))
> -    igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx];
> +    {
> +      if (optimize_bb_for_size_p (bb))
> +        igain += COSTS_N_BYTES (3 * m - 4);
> +      else
> +        igain += COSTS_N_INSNS (m * ix86_cost->int_load[2]
> +                    - ix86_cost->sse_load[sse_cost_idx]) / 2;
> +    }
>       else
>    {
>      /* For operations on memory operands, include the overhead
>         of explicit load and store instructions.  */
>      if (MEM_P (dst))
> -        igain += optimize_insn_for_size_p ()
> -             ? -COSTS_N_BYTES (8)
> -             : (m * (ix86_cost->int_load[2]
> -                 + ix86_cost->int_store[2])
> -            - (ix86_cost->sse_load[sse_cost_idx] +
> -               ix86_cost->sse_store[sse_cost_idx]));
> +        {
> +          if (optimize_bb_for_size_p (bb))
> +        /* ??? This probably should account size difference
> +           of SSE and integer load rather than full SSE load.  */
> +        igain -= COSTS_N_BYTES (8);
> +          else
> +        {
> +          int cost = (m * (ix86_cost->int_load[2]
> +                   + ix86_cost->int_store[2])
> +                 - (ix86_cost->sse_load[sse_cost_idx] +
> +                ix86_cost->sse_store[sse_cost_idx]));
> +          igain += COSTS_N_INSNS (cost) / 2;
> +        }
> +        }
> 
>      switch (GET_CODE (src))
>        {
> @@ -595,7 +626,7 @@ general_scalar_chain::compute_convert_gain ()
>          igain += ix86_cost->shift_const - ix86_cost->sse_op;
> 
>          if (CONST_INT_P (XEXP (src, 0)))
> -        igain -= vector_const_cost (XEXP (src, 0));
> +        igain -= vector_const_cost (XEXP (src, 0), bb);
>          break;
> 
>        case ROTATE:
> @@ -631,16 +662,17 @@ general_scalar_chain::compute_convert_gain ()
>        igain += m * ix86_cost->add;
> 
>          if (CONST_INT_P (XEXP (src, 0)))
> -        igain -= vector_const_cost (XEXP (src, 0));
> +        igain -= vector_const_cost (XEXP (src, 0), bb);
>          if (CONST_INT_P (XEXP (src, 1)))
> -        igain -= vector_const_cost (XEXP (src, 1));
> +        igain -= vector_const_cost (XEXP (src, 1), bb);
>          if (MEM_P (XEXP (src, 1)))
>        {
> -          if (optimize_insn_for_size_p ())
> +          if (optimize_bb_for_size_p (bb))
>            igain -= COSTS_N_BYTES (m == 2 ? 3 : 5);
>          else
> -            igain += m * ix86_cost->int_load[2]
> -                 - ix86_cost->sse_load[sse_cost_idx];
> +            igain += COSTS_N_INSNS
> +                   (m * ix86_cost->int_load[2]
> +                 - ix86_cost->sse_load[sse_cost_idx]) / 2;
>        }
>          break;
> 
> @@ -698,7 +730,7 @@ general_scalar_chain::compute_convert_gain ()
>        case CONST_INT:
>          if (REG_P (dst))
>        {
> -          if (optimize_insn_for_size_p ())
> +          if (optimize_bb_for_size_p (bb))
>            {
>              /* xor (2 bytes) vs. xorps (3 bytes).  */
>              if (src == const0_rtx)
> @@ -722,14 +754,14 @@ general_scalar_chain::compute_convert_gain ()
>              /* DImode can be immediate for TARGET_64BIT
>             and SImode always.  */
>              igain += m * COSTS_N_INSNS (1);
> -              igain -= vector_const_cost (src);
> +              igain -= vector_const_cost (src, bb);
>            }
>        }
>          else if (MEM_P (dst))
>        {
>          igain += (m * ix86_cost->int_store[2]
>                - ix86_cost->sse_store[sse_cost_idx]);
> -          igain -= vector_const_cost (src);
> +          igain -= vector_const_cost (src, bb);
>        }
>          break;
> 
> @@ -737,13 +769,14 @@ general_scalar_chain::compute_convert_gain ()
>          if (XVECEXP (XEXP (src, 1), 0, 0) == const0_rtx)
>        {
>          // movd (4 bytes) replaced with movdqa (4 bytes).
> -          if (!optimize_insn_for_size_p ())
> -            igain += ix86_cost->sse_to_integer - ix86_cost->xmm_move;
> +          if (!optimize_bb_for_size_p (bb))
> +            igain += COSTS_N_INSNS (ix86_cost->sse_to_integer
> +                        - ix86_cost->xmm_move) / 2;
>        }
>          else
>        {
>          // pshufd; movd replaced with pshufd.
> -          if (optimize_insn_for_size_p ())
> +          if (optimize_bb_for_size_p (bb))
>            igain += COSTS_N_BYTES (4);
>          else
>            igain += ix86_cost->sse_to_integer;
> @@ -769,11 +802,11 @@ general_scalar_chain::compute_convert_gain ()
>   /* Cost the integer to sse and sse to integer moves.  */
>   if (!optimize_function_for_size_p (cfun))
>     {
> -      cost += n_sse_to_integer * ix86_cost->sse_to_integer;
> +      cost += n_sse_to_integer * COSTS_N_INSNS (ix86_cost->sse_to_integer) / 
> 2;
>       /* ???  integer_to_sse but we only have that in the RA cost table.
>          Assume sse_to_integer/integer_to_sse are the same which they
>          are at the moment.  */
> -      cost += n_integer_to_sse * ix86_cost->sse_to_integer;
> +      cost += n_integer_to_sse * COSTS_N_INSNS (ix86_cost->integer_to_sse) / 
> 2;
>     }
>   else if (TARGET_64BIT || smode == SImode)
>     {
> @@ -1508,13 +1541,13 @@ general_scalar_chain::convert_insn (rtx_insn *insn)
>    with numerous special cases.  */
> 
> static int
> -timode_immed_const_gain (rtx cst)
> +timode_immed_const_gain (rtx cst, basic_block bb)
> {
>   /* movabsq vs. movabsq+vmovq+vunpacklqdq.  */
>   if (CONST_WIDE_INT_P (cst)
>       && CONST_WIDE_INT_NUNITS (cst) == 2
>       && CONST_WIDE_INT_ELT (cst, 0) == CONST_WIDE_INT_ELT (cst, 1))
> -    return optimize_insn_for_size_p () ? -COSTS_N_BYTES (9)
> +    return optimize_bb_for_size_p (bb) ? -COSTS_N_BYTES (9)
>                       : -COSTS_N_INSNS (2);
>   /* 2x movabsq ~ vmovdqa.  */
>   return 0;
> @@ -1546,33 +1579,34 @@ timode_scalar_chain::compute_convert_gain ()
>       rtx src = SET_SRC (def_set);
>       rtx dst = SET_DEST (def_set);
>       HOST_WIDE_INT op1val;
> +      basic_block bb = BLOCK_FOR_INSN (insn);
>       int scost, vcost;
>       int igain = 0;
> 
>       switch (GET_CODE (src))
>    {
>    case REG:
> -      if (optimize_insn_for_size_p ())
> +      if (optimize_bb_for_size_p (bb))
>        igain = MEM_P (dst) ? COSTS_N_BYTES (6) : COSTS_N_BYTES (3);
>      else
>        igain = COSTS_N_INSNS (1);
>      break;
> 
>    case MEM:
> -      igain = optimize_insn_for_size_p () ? COSTS_N_BYTES (7)
> +      igain = optimize_bb_for_size_p (bb) ? COSTS_N_BYTES (7)
>                          : COSTS_N_INSNS (1);
>      break;
> 
>    case CONST_INT:
>      if (MEM_P (dst)
>          && standard_sse_constant_p (src, V1TImode))
> -        igain = optimize_insn_for_size_p () ? COSTS_N_BYTES (11) : 1;
> +        igain = optimize_bb_for_size_p (bb) ? COSTS_N_BYTES (11) : 1;
>      break;
> 
>    case CONST_WIDE_INT:
>      /* 2 x mov vs. vmovdqa.  */
>      if (MEM_P (dst))
> -        igain = optimize_insn_for_size_p () ? COSTS_N_BYTES (3)
> +        igain = optimize_bb_for_size_p (bb) ? COSTS_N_BYTES (3)
>                        : COSTS_N_INSNS (1);
>      break;
> 
> @@ -1587,14 +1621,14 @@ timode_scalar_chain::compute_convert_gain ()
>      if (!MEM_P (dst))
>        igain = COSTS_N_INSNS (1);
>      if (CONST_SCALAR_INT_P (XEXP (src, 1)))
> -        igain += timode_immed_const_gain (XEXP (src, 1));
> +        igain += timode_immed_const_gain (XEXP (src, 1), bb);
>      break;
> 
>    case ASHIFT:
>    case LSHIFTRT:
>      /* See ix86_expand_v1ti_shift.  */
>      op1val = INTVAL (XEXP (src, 1));
> -      if (optimize_insn_for_size_p ())
> +      if (optimize_bb_for_size_p (bb))
>        {
>          if (op1val == 64 || op1val == 65)
>        scost = COSTS_N_BYTES (5);
> @@ -1628,7 +1662,7 @@ timode_scalar_chain::compute_convert_gain ()
>    case ASHIFTRT:
>      /* See ix86_expand_v1ti_ashiftrt.  */
>      op1val = INTVAL (XEXP (src, 1));
> -      if (optimize_insn_for_size_p ())
> +      if (optimize_bb_for_size_p (bb))
>        {
>          if (op1val == 64 || op1val == 127)
>        scost = COSTS_N_BYTES (7);
> @@ -1706,7 +1740,7 @@ timode_scalar_chain::compute_convert_gain ()
>    case ROTATERT:
>      /* See ix86_expand_v1ti_rotate.  */
>      op1val = INTVAL (XEXP (src, 1));
> -      if (optimize_insn_for_size_p ())
> +      if (optimize_bb_for_size_p (bb))
>        {
>          scost = COSTS_N_BYTES (13);
>          if ((op1val & 31) == 0)
> @@ -1738,16 +1772,16 @@ timode_scalar_chain::compute_convert_gain ()
>        {
>          if (GET_CODE (XEXP (src, 0)) == AND)
>        /* and;and;or (9 bytes) vs. ptest (5 bytes).  */
> -        igain = optimize_insn_for_size_p() ? COSTS_N_BYTES (4)
> -                           : COSTS_N_INSNS (2);
> +        igain = optimize_bb_for_size_p (bb) ? COSTS_N_BYTES (4)
> +                            : COSTS_N_INSNS (2);
>          /* or (3 bytes) vs. ptest (5 bytes).  */
> -          else if (optimize_insn_for_size_p ())
> +          else if (optimize_bb_for_size_p (bb))
>        igain = -COSTS_N_BYTES (2);
>        }
>      else if (XEXP (src, 1) == const1_rtx)
>        /* and;cmp -1 (7 bytes) vs. pcmpeqd;pxor;ptest (13 bytes).  */
> -        igain = optimize_insn_for_size_p() ? -COSTS_N_BYTES (6)
> -                           : -COSTS_N_INSNS (1);
> +        igain = optimize_bb_for_size_p (bb) ? -COSTS_N_BYTES (6)
> +                        : -COSTS_N_INSNS (1);
>      break;
> 
>    default:
> diff --git a/gcc/config/i386/i386-features.h b/gcc/config/i386/i386-features.h
> index 24b0c4ed0cd..7f7c0f78c96 100644
> --- a/gcc/config/i386/i386-features.h
> +++ b/gcc/config/i386/i386-features.h
> @@ -188,7 +188,7 @@ class general_scalar_chain : public scalar_chain
> 
>  private:
>   void convert_insn (rtx_insn *insn) final override;
> -  int vector_const_cost (rtx exp);
> +  int vector_const_cost (rtx exp, basic_block bb);
>   rtx convert_rotate (enum rtx_code, rtx op0, rtx op1, rtx_insn *insn);
> };
> 
> diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> index 6a38de30de4..18fa97a9eb0 100644
> --- a/gcc/config/i386/i386.h
> +++ b/gcc/config/i386/i386.h
> @@ -179,6 +179,7 @@ struct processor_costs {
>   const int xmm_move, ymm_move, /* cost of moving XMM and YMM register.  */
>        zmm_move;
>   const int sse_to_integer;    /* cost of moving SSE register to integer.  */
> +  const int integer_to_sse;    /* cost of moving integer register to SSE. */
>   const int gather_static, gather_per_elt; /* Cost of gather load is computed
>                   as static + per_item * nelts. */
>   const int scatter_static, scatter_per_elt; /* Cost of gather store is
> diff --git a/gcc/config/i386/x86-tune-costs.h 
> b/gcc/config/i386/x86-tune-costs.h
> index 6cce70a6c40..e5091293509 100644
> --- a/gcc/config/i386/x86-tune-costs.h
> +++ b/gcc/config/i386/x86-tune-costs.h
> @@ -107,6 +107,7 @@ struct processor_costs ix86_size_cost = {/* costs for 
> tuning for size */
>                       in 128bit, 256bit and 512bit */
>   4, 4, 6,                /* cost of moving XMM,YMM,ZMM register */
>   4,                    /* cost of moving SSE register to integer.  */
> +  4,                    /* cost of moving integer register to SSE.  */
>   COSTS_N_BYTES (5), 0,            /* Gather load static, per_elt.  */
>   COSTS_N_BYTES (5), 0,            /* Gather store static, per_elt.  */
>   0,                    /* size of l1 cache  */
> @@ -227,6 +228,7 @@ struct processor_costs i386_cost = {    /* 386 specific 
> costs */
>   {4, 8, 16, 32, 64},            /* cost of unaligned stores.  */
>   2, 4, 8,                /* cost of moving XMM,YMM,ZMM register */
>   3,                    /* cost of moving SSE register to integer.  */
> +  3,                    /* cost of moving integer register to SSE.  */
>   4, 4,                    /* Gather load static, per_elt.  */
>   4, 4,                    /* Gather store static, per_elt.  */
>   0,                    /* size of l1 cache  */
> @@ -345,6 +347,7 @@ struct processor_costs i486_cost = {    /* 486 specific 
> costs */
>   {4, 8, 16, 32, 64},            /* cost of unaligned stores.  */
>   2, 4, 8,                /* cost of moving XMM,YMM,ZMM register */
>   3,                    /* cost of moving SSE register to integer.  */
> +  3,                    /* cost of moving integer register to SSE.  */
>   4, 4,                    /* Gather load static, per_elt.  */
>   4, 4,                    /* Gather store static, per_elt.  */
>   4,                    /* size of l1 cache.  486 has 8kB cache
> @@ -465,6 +468,7 @@ struct processor_costs pentium_cost = {
>   {4, 8, 16, 32, 64},            /* cost of unaligned stores.  */
>   2, 4, 8,                /* cost of moving XMM,YMM,ZMM register */
>   3,                    /* cost of moving SSE register to integer.  */
> +  3,                    /* cost of moving integer register to SSE.  */
>   4, 4,                    /* Gather load static, per_elt.  */
>   4, 4,                    /* Gather store static, per_elt.  */
>   8,                    /* size of l1 cache.  */
> @@ -576,6 +580,7 @@ struct processor_costs lakemont_cost = {
>   {4, 8, 16, 32, 64},            /* cost of unaligned stores.  */
>   2, 4, 8,                /* cost of moving XMM,YMM,ZMM register */
>   3,                    /* cost of moving SSE register to integer.  */
> +  3,                    /* cost of moving integer register to SSE.  */
>   4, 4,                    /* Gather load static, per_elt.  */
>   4, 4,                    /* Gather store static, per_elt.  */
>   8,                    /* size of l1 cache.  */
> @@ -702,6 +707,7 @@ struct processor_costs pentiumpro_cost = {
>   {4, 8, 16, 32, 64},            /* cost of unaligned stores.  */
>   2, 4, 8,                /* cost of moving XMM,YMM,ZMM register */
>   3,                    /* cost of moving SSE register to integer.  */
> +  3,                    /* cost of moving integer register to SSE.  */
>   4, 4,                    /* Gather load static, per_elt.  */
>   4, 4,                    /* Gather store static, per_elt.  */
>   8,                    /* size of l1 cache.  */
> @@ -819,6 +825,7 @@ struct processor_costs geode_cost = {
>   {2, 2, 8, 16, 32},            /* cost of unaligned stores.  */
>   2, 4, 8,                /* cost of moving XMM,YMM,ZMM register */
>   6,                    /* cost of moving SSE register to integer.  */
> +  6,                    /* cost of moving integer register to SSE.  */
>   2, 2,                    /* Gather load static, per_elt.  */
>   2, 2,                    /* Gather store static, per_elt.  */
>   64,                    /* size of l1 cache.  */
> @@ -936,6 +943,7 @@ struct processor_costs k6_cost = {
>   {2, 2, 8, 16, 32},            /* cost of unaligned stores.  */
>   2, 4, 8,                /* cost of moving XMM,YMM,ZMM register */
>   6,                    /* cost of moving SSE register to integer.  */
> +  6,                    /* cost of moving integer register to SSE.  */
>   2, 2,                    /* Gather load static, per_elt.  */
>   2, 2,                    /* Gather store static, per_elt.  */
>   32,                    /* size of l1 cache.  */
> @@ -1059,6 +1067,7 @@ struct processor_costs athlon_cost = {
>   {4, 4, 10, 10, 20},            /* cost of unaligned stores.  */
>   2, 4, 8,                /* cost of moving XMM,YMM,ZMM register */
>   5,                    /* cost of moving SSE register to integer.  */
> +  5,                    /* cost of moving integer register to SSE.  */
>   4, 4,                    /* Gather load static, per_elt.  */
>   4, 4,                    /* Gather store static, per_elt.  */
>   64,                    /* size of l1 cache.  */
> @@ -1184,6 +1193,7 @@ struct processor_costs k8_cost = {
>   {4, 4, 10, 10, 20},            /* cost of unaligned stores.  */
>   2, 4, 8,                /* cost of moving XMM,YMM,ZMM register */
>   5,                    /* cost of moving SSE register to integer.  */
> +  5,                    /* cost of moving integer register to SSE.  */
>   4, 4,                    /* Gather load static, per_elt.  */
>   4, 4,                    /* Gather store static, per_elt.  */
>   64,                    /* size of l1 cache.  */
> @@ -1322,6 +1332,7 @@ struct processor_costs amdfam10_cost = {
>   {4, 4, 5, 10, 20},            /* cost of unaligned stores.  */
>   2, 4, 8,                /* cost of moving XMM,YMM,ZMM register */
>   3,                    /* cost of moving SSE register to integer.  */
> +  3,                    /* cost of moving integer register to SSE.  */
>   4, 4,                    /* Gather load static, per_elt.  */
>   4, 4,                    /* Gather store static, per_elt.  */
>   64,                    /* size of l1 cache.  */
> @@ -1452,6 +1463,7 @@ const struct processor_costs bdver_cost = {
>   {10, 10, 10, 40, 60},            /* cost of unaligned stores.  */
>   2, 4, 8,                /* cost of moving XMM,YMM,ZMM register */
>   16,                    /* cost of moving SSE register to integer.  */
> +  16,                    /* cost of moving integer register to SSE.  */
>   12, 12,                /* Gather load static, per_elt.  */
>   10, 10,                /* Gather store static, per_elt.  */
>   16,                    /* size of l1 cache.  */
> @@ -1603,6 +1615,7 @@ struct processor_costs znver1_cost = {
>   {8, 8, 8, 16, 32},            /* cost of unaligned stores.  */
>   2, 3, 6,                /* cost of moving XMM,YMM,ZMM register.  */
>   6,                    /* cost of moving SSE register to integer.  */
> +  6,                    /* cost of moving integer register to SSE.  */
>   /* VGATHERDPD is 23 uops and throughput is 9, VGATHERDPD is 35 uops,
>      throughput 12.  Approx 9 uops do not depend on vector size and every load
>      is 7 uops.  */
> @@ -1770,6 +1783,7 @@ struct processor_costs znver2_cost = {
>   2, 2, 3,                /* cost of moving XMM,YMM,ZMM
>                       register.  */
>   6,                    /* cost of moving SSE register to integer.  */
> +  6,                    /* cost of moving integer register to SSE.  */
>   /* VGATHERDPD is 23 uops and throughput is 9, VGATHERDPD is 35 uops,
>      throughput 12.  Approx 9 uops do not depend on vector size and every load
>      is 7 uops.  */
> @@ -1912,6 +1926,7 @@ struct processor_costs znver3_cost = {
>   2, 2, 3,                /* cost of moving XMM,YMM,ZMM
>                       register.  */
>   6,                    /* cost of moving SSE register to integer.  */
> +  6,                    /* cost of moving integer register to SSE.  */
>   /* VGATHERDPD is 15 uops and throughput is 4, VGATHERDPS is 23 uops,
>      throughput 9.  Approx 7 uops do not depend on vector size and every load
>      is 4 uops.  */
> @@ -2056,6 +2071,7 @@ struct processor_costs znver4_cost = {
>   2, 2, 2,                /* cost of moving XMM,YMM,ZMM
>                       register.  */
>   6,                    /* cost of moving SSE register to integer.  */
> +  6,                    /* cost of moving integer register to SSE.  */
>   /* VGATHERDPD is 17 uops and throughput is 4, VGATHERDPS is 24 uops,
>      throughput 5.  Approx 7 uops do not depend on vector size and every load
>      is 5 uops.  */
> @@ -2204,6 +2220,7 @@ struct processor_costs znver5_cost = {
>   2, 2, 2,                /* cost of moving XMM,YMM,ZMM
>                       register.  */
>   6,                    /* cost of moving SSE register to integer.  */
> +  6,                    /* cost of moving integer register to SSE.  */
> 
>   /* TODO: gather and scatter instructions are currently disabled in
>      x86-tune.def.  In some cases they are however a win, see PR116582
> @@ -2372,6 +2389,7 @@ struct processor_costs skylake_cost = {
>   {8, 8, 8, 8, 16},            /* cost of unaligned stores.  */
>   2, 2, 4,                /* cost of moving XMM,YMM,ZMM register */
>   6,                    /* cost of moving SSE register to integer.  */
> +  6,                    /* cost of moving integer register to SSE.  */
>   20, 8,                /* Gather load static, per_elt.  */
>   22, 10,                /* Gather store static, per_elt.  */
>   64,                    /* size of l1 cache.  */
> @@ -2508,6 +2526,7 @@ struct processor_costs icelake_cost = {
>   {8, 8, 8, 8, 16},            /* cost of unaligned stores.  */
>   2, 2, 4,                /* cost of moving XMM,YMM,ZMM register */
>   6,                    /* cost of moving SSE register to integer.  */
> +  6,                    /* cost of moving integer register to SSE.  */
>   20, 8,                /* Gather load static, per_elt.  */
>   22, 10,                /* Gather store static, per_elt.  */
>   64,                    /* size of l1 cache.  */
> @@ -2638,6 +2657,7 @@ struct processor_costs alderlake_cost = {
>   {8, 8, 8, 10, 15},            /* cost of unaligned storess.  */
>   2, 3, 4,                /* cost of moving XMM,YMM,ZMM register */
>   6,                    /* cost of moving SSE register to integer.  */
> +  6,                    /* cost of moving integer register to SSE.  */
>   18, 6,                /* Gather load static, per_elt.  */
>   18, 6,                /* Gather store static, per_elt.  */
>   32,                    /* size of l1 cache.  */
> @@ -2761,6 +2781,7 @@ const struct processor_costs btver1_cost = {
>   {10, 10, 12, 48, 96},            /* cost of unaligned stores.  */
>   2, 4, 8,                /* cost of moving XMM,YMM,ZMM register */
>   14,                    /* cost of moving SSE register to integer.  */
> +  14,                    /* cost of moving integer register to SSE.  */
>   10, 10,                /* Gather load static, per_elt.  */
>   10, 10,                /* Gather store static, per_elt.  */
>   32,                    /* size of l1 cache.  */
> @@ -2881,6 +2902,7 @@ const struct processor_costs btver2_cost = {
>   {10, 10, 12, 48, 96},            /* cost of unaligned stores.  */
>   2, 4, 8,                /* cost of moving XMM,YMM,ZMM register */
>   14,                    /* cost of moving SSE register to integer.  */
> +  14,                    /* cost of moving integer register to SSE.  */
>   10, 10,                /* Gather load static, per_elt.  */
>   10, 10,                /* Gather store static, per_elt.  */
>   32,                    /* size of l1 cache.  */
> @@ -3000,6 +3022,7 @@ struct processor_costs pentium4_cost = {
>   {32, 32, 32, 64, 128},        /* cost of unaligned stores.  */
>   12, 24, 48,                /* cost of moving XMM,YMM,ZMM register */
>   20,                    /* cost of moving SSE register to integer.  */
> +  20,                    /* cost of moving integer register to SSE.  */
>   16, 16,                /* Gather load static, per_elt.  */
>   16, 16,                /* Gather store static, per_elt.  */
>   8,                    /* size of l1 cache.  */
> @@ -3122,6 +3145,7 @@ struct processor_costs nocona_cost = {
>   {24, 24, 24, 48, 96},            /* cost of unaligned stores.  */
>   6, 12, 24,                /* cost of moving XMM,YMM,ZMM register */
>   20,                    /* cost of moving SSE register to integer.  */
> +  20,                    /* cost of moving integer register to SSE.  */
>   12, 12,                /* Gather load static, per_elt.  */
>   12, 12,                /* Gather store static, per_elt.  */
>   8,                    /* size of l1 cache.  */
> @@ -3242,6 +3266,7 @@ struct processor_costs atom_cost = {
>   {16, 16, 16, 32, 64},            /* cost of unaligned stores.  */
>   2, 4, 8,                /* cost of moving XMM,YMM,ZMM register */
>   8,                    /* cost of moving SSE register to integer.  */
> +  8,                    /* cost of moving integer register to SSE.  */
>   8, 8,                    /* Gather load static, per_elt.  */
>   8, 8,                    /* Gather store static, per_elt.  */
>   32,                    /* size of l1 cache.  */
> @@ -3362,6 +3387,7 @@ struct processor_costs slm_cost = {
>   {16, 16, 16, 32, 64},            /* cost of unaligned stores.  */
>   2, 4, 8,                /* cost of moving XMM,YMM,ZMM register */
>   8,                    /* cost of moving SSE register to integer.  */
> +  8,                    /* cost of moving integer register to SSE.  */
>   8, 8,                    /* Gather load static, per_elt.  */
>   8, 8,                    /* Gather store static, per_elt.  */
>   32,                    /* size of l1 cache.  */
> @@ -3494,6 +3520,7 @@ struct processor_costs tremont_cost = {
>   {6, 6, 6, 10, 15},            /* cost of unaligned storess.  */
>   2, 3, 4,                /* cost of moving XMM,YMM,ZMM register */
>   6,                    /* cost of moving SSE register to integer.  */
> +  6,                    /* cost of moving integer register to SSE.  */
>   18, 6,                /* Gather load static, per_elt.  */
>   18, 6,                /* Gather store static, per_elt.  */
>   32,                    /* size of l1 cache.  */
> @@ -3616,6 +3643,7 @@ struct processor_costs intel_cost = {
>   {10, 10, 10, 10, 10},            /* cost of unaligned loads.  */
>   2, 2, 2,                /* cost of moving XMM,YMM,ZMM register */
>   4,                    /* cost of moving SSE register to integer.  */
> +  4,                    /* cost of moving integer register to SSE.  */
>   6, 6,                    /* Gather load static, per_elt.  */
>   6, 6,                    /* Gather store static, per_elt.  */
>   32,                    /* size of l1 cache.  */
> @@ -3731,15 +3759,16 @@ struct processor_costs lujiazui_cost = {
>   {6, 6, 6},                /* cost of loading integer registers
>                       in QImode, HImode and SImode.
>                       Relative to reg-reg move (2).  */
> -  {6, 6, 6},            /* cost of storing integer registers.  */
> +  {6, 6, 6},                /* cost of storing integer registers.  */
>   {6, 6, 6, 10, 15},            /* cost of loading SSE register
> -                in 32bit, 64bit, 128bit, 256bit and 512bit.  */
> +                       in 32bit, 64bit, 128bit, 256bit and 512bit.  */
>   {6, 6, 6, 10, 15},            /* cost of storing SSE register
> -                in 32bit, 64bit, 128bit, 256bit and 512bit.  */
> +                       in 32bit, 64bit, 128bit, 256bit and 512bit.  */
>   {6, 6, 6, 10, 15},            /* cost of unaligned loads.  */
>   {6, 6, 6, 10, 15},            /* cost of unaligned storess.  */
> -  2, 3, 4,            /* cost of moving XMM,YMM,ZMM register.  */
> -  6,                /* cost of moving SSE register to integer.  */
> +  2, 3, 4,                /* cost of moving XMM,YMM,ZMM register.  */
> +  6,                    /* cost of moving SSE register to integer.  */
> +  6,                    /* cost of moving integer register to SSE.  */
>   18, 6,                /* Gather load static, per_elt.  */
>   18, 6,                /* Gather store static, per_elt.  */
>   32,                      /* size of l1 cache.  */
> @@ -3864,6 +3893,7 @@ struct processor_costs yongfeng_cost = {
>   {8, 8, 8, 12, 15},            /* cost of unaligned storess.  */
>   2, 3, 4,            /* cost of moving XMM,YMM,ZMM register.  */
>   8,                /* cost of moving SSE register to integer.  */
> +  8,                    /* cost of moving integer register to SSE.  */
>   18, 6,                /* Gather load static, per_elt.  */
>   18, 6,                /* Gather store static, per_elt.  */
>   32,                      /* size of l1 cache.  */
> @@ -3987,6 +4017,7 @@ struct processor_costs shijidadao_cost = {
>   {8, 8, 8, 12, 15},            /* cost of unaligned storess.  */
>   2, 3, 4,            /* cost of moving XMM,YMM,ZMM register.  */
>   8,                /* cost of moving SSE register to integer.  */
> +  8,                    /* cost of moving integer register to SSE.  */
>   18, 6,                /* Gather load static, per_elt.  */
>   18, 6,                /* Gather store static, per_elt.  */
>   32,                      /* size of l1 cache.  */
> @@ -4116,6 +4147,7 @@ struct processor_costs generic_cost = {
>   {6, 6, 6, 10, 15},            /* cost of unaligned storess.  */
>   2, 3, 4,                /* cost of moving XMM,YMM,ZMM register */
>   6,                    /* cost of moving SSE register to integer.  */
> +  6,                    /* cost of moving integer register to SSE.  */
>   18, 6,                /* Gather load static, per_elt.  */
>   18, 6,                /* Gather store static, per_elt.  */
>   32,                    /* size of l1 cache.  */
> @@ -4249,6 +4281,7 @@ struct processor_costs core_cost = {
>   {6, 6, 6, 6, 12},            /* cost of unaligned stores.  */
>   2, 2, 4,                /* cost of moving XMM,YMM,ZMM register */
>   2,                    /* cost of moving SSE register to integer.  */
> +  2,                    /* cost of moving integer register to SSE.  */
>   /* VGATHERDPD is 7 uops, rec throughput 5, while VGATHERDPD is 9 uops,
>      rec. throughput 6.
>      So 5 uops statically and one uops per load.  */
> diff --git a/gcc/testsuite/gcc.target/i386/minmax-6.c 
> b/gcc/testsuite/gcc.target/i386/minmax-6.c
> index 615f919ba0a..23f61c52d80 100644
> --- a/gcc/testsuite/gcc.target/i386/minmax-6.c
> +++ b/gcc/testsuite/gcc.target/i386/minmax-6.c
> @@ -15,4 +15,4 @@ UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
> /* We do not want the RA to spill %esi for it's dual-use but using
>    pmaxsd is OK.  */
> /* { dg-final { scan-assembler-not "rsp" { target { ! { ia32 } } } } } */
> -/* { dg-final { scan-assembler "pmaxsd" } } */
> +/* { dg-final { scan-assembler "pmaxsd" { xfail *-*-* } } } */
> diff --git a/gcc/testsuite/gcc.target/i386/minmax-7.c 
> b/gcc/testsuite/gcc.target/i386/minmax-7.c
> index 619a93946c7..b2cb1c24d7e 100644
> --- a/gcc/testsuite/gcc.target/i386/minmax-7.c
> +++ b/gcc/testsuite/gcc.target/i386/minmax-7.c
> @@ -17,4 +17,4 @@ void bar (int aleft, int axcenter)
> /* We do not want the RA to spill %esi for it's dual-use but using
>    pminsd is OK.  */
> /* { dg-final { scan-assembler-not "rsp" { target { ! { ia32 } } } } } */
> -/* { dg-final { scan-assembler "pminsd" } } */
> +/* { dg-final { scan-assembler "pminsd" { xfail *-*-* } } } */
Re: i386: Fix some problems in stv cost model

Reply via email to