> Am 10.05.2025 um 22:28 schrieb Jan Hubicka <hubi...@ucw.cz>:
>
> Hi,
> this patch fixes some of problems with cosint in scalar to vector pass.
> In particular
> 1) the pass uses optimize_insn_for_size which is intended to be used by
> expanders and splitters and requires the optimization pass to use
> set_rtl_profile (bb) for currently processed bb.
> This is not done, so we get random stale info about hotness of insn.
> 2) register allocator move costs are all realtive to integer reg-reg move
> which has cost of 2, so it is (except for size tables and i386)
> a latency of instruction multiplied by 2.
> These costs has been duplicated and are now used in combination with
> rtx costs which are all based to COSTS_N_INSNS that multiplies latency
> by 4.
> Some of vectorizer costing contains COSTS_N_INSNS (move_cost) / 2
> to compensate, but some new code does not. This patch adds compensatoin.
>
> Perhaps we should update the cost tables to use COSTS_N_INSNS everywher
> but I think we want to first fix inconsistencies. Also the tables will
> get optically much longer, since we have many move costs and COSTS_N_INSNS
> is a lot of characters.
> 3) variable m which decides how much to multiply integer variant (to account
> that with -m32 all 64bit computations needs 2 instructions) is declared
> unsigned which makes the signed computation of instruction gain to be
> done in unsigned type and breaks i.e. for division.
> 4) I added integer_to_sse costs which are currently all duplicationof
> sse_to_integer. AMD chips are asymetric and moving one direction is faster
> than another. I will chance costs incremnetally once vectorizer part
> is fixed up, too.
>
> There are two failures gcc.target/i386/minmax-6.c and
> gcc.target/i386/minmax-7.c.
> Both test stv on hasswell which no longer happens since SSE->INT and INT->SSE
> moves
> are now more expensive.
>
> There is only one instruction to convert:
>
> Computing gain for chain #1...
> Instruction gain 8 for 11: {r110:SI=smax(r116:SI,0);clobber flags:CC;}
> Instruction conversion gain: 8
> Registers conversion cost: 8 <- this is integer_to_sse and sse_to_integer
> Total gain: 0
>
> total gain used to be 4 since the patch doubles the conversion costs.
> According to agner fog's tables the costs should be 1 cycle which is correct
> here.
>
> Final code gnerated is:
>
> vmovd %esi, %xmm0 * latency 1
> cmpl %edx, %esi
> je .L2
> vpxor %xmm1, %xmm1, %xmm1 * latency 1
> vpmaxsd %xmm1, %xmm0, %xmm0 * latency 1
> vmovd %xmm0, %eax * latency 1
> imull %edx, %eax
> cltq
> movzwl (%rdi,%rax,2), %eax
> ret
>
> cmpl %edx, %esi
> je .L2
> xorl %eax, %eax * latency 1
> testl %esi, %esi * latency 1
> cmovs %eax, %esi * latency 2
> imull %edx, %esi
> movslq %esi, %rsi
> movzwl (%rdi,%rsi,2), %eax
> ret
>
> Instructions with latency info are those really different.
> So the uncoverted code has sum of latencies 4 and real latency 3.
> Converted code has sum of latencies 4 and real latency 3 (vmod+vpmaxsd+vmov).
> So I do not quite see it should be a win.
Note this was historically done because cmov performance behaves erratically at
least on some uarchs compared to SSE min/max, esp. if there are back-to-back
cmov (the latter, aka throughput, is not modeled at all in the cost tables nor
the pass). IIRC it was hmmer from SPEC 2006 exhibiting such back-to-back case.
Richard
> There is also a bug in costing MIN/MAX
>
> case ABS:
> case SMAX:
> case SMIN:
> case UMAX:
> case UMIN:
> /* We do not have any conditional move cost, estimate it as a
> reg-reg move. Comparisons are costed as adds. */
> igain += m * (COSTS_N_INSNS (2) + ix86_cost->add);
> /* Integer SSE ops are all costed the same. */
> igain -= ix86_cost->sse_op;
> break;
>
> Now COSTS_N_INSNS (2) is not quite right since reg-reg move should be 1 or
> perhaps 0.
> For Haswell cmov really is 2 cycles, but I guess we want to have that in cost
> vectors
> like all other instructions.
>
> I am not sure if this is really a win in this case (other minmax testcases
> seems to make
> sense). I have xfailed it for now and will check if that affects specs on
> LNT testers.
>
> Bootstrapped/regtested x86_64-linux, comitted.
>
> I will proceed with similar fixes on vectorizer cost side. Sadly those
> introduces
> quite some differences in the testuiste (partly triggered by other costing
> problems,
> such as one of scatter/gather)
>
> gcc/ChangeLog:
>
> * config/i386/i386-features.cc
> (general_scalar_chain::vector_const_cost): Add BB parameter; handle
> size costs; use COSTS_N_INSNS to compute move costs.
> (general_scalar_chain::compute_convert_gain): Use optimize_bb_for_size
> instead of optimize_insn_for size; use COSTS_N_INSNS to compute move costs;
> update calls of general_scalar_chain::vector_const_cost; use
> ix86_cost->integer_to_sse.
> (timode_immed_const_gain): Add bb parameter; use
> optimize_bb_for_size_p.
> (timode_scalar_chain::compute_convert_gain): Use optimize_bb_for_size_p.
> * config/i386/i386-features.h (class general_scalar_chain): Update
> prototype of vector_const_cost.
> * config/i386/i386.h (struct processor_costs): Add integer_to_sse.
> * config/i386/x86-tune-costs.h (struct processor_costs): Copy
> sse_to_integer to integer_to_sse everywhere.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/minmax-6.c: xfail test that pmax is used.
> * gcc.target/i386/minmax-7.c: xfall test that pmin is used.
>
> diff --git a/gcc/config/i386/i386-features.cc
> b/gcc/config/i386/i386-features.cc
> index 1ba5ac4faa4..54b3f6d33b2 100644
> --- a/gcc/config/i386/i386-features.cc
> +++ b/gcc/config/i386/i386-features.cc
> @@ -518,15 +518,17 @@ scalar_chain::build (bitmap candidates, unsigned
> insn_uid, bitmap disallowed)
> instead of using a scalar one. */
>
> int
> -general_scalar_chain::vector_const_cost (rtx exp)
> +general_scalar_chain::vector_const_cost (rtx exp, basic_block bb)
> {
> gcc_assert (CONST_INT_P (exp));
>
> if (standard_sse_constant_p (exp, vmode))
> return ix86_cost->sse_op;
> + if (optimize_bb_for_size_p (bb))
> + return COSTS_N_BYTES (8);
> /* We have separate costs for SImode and DImode, use SImode costs
> for smaller modes. */
> - return ix86_cost->sse_load[smode == DImode ? 1 : 0];
> + return COSTS_N_INSNS (ix86_cost->sse_load[smode == DImode ? 1 : 0]) / 2;
> }
>
> /* Compute a gain for chain conversion. */
> @@ -547,7 +549,7 @@ general_scalar_chain::compute_convert_gain ()
> smaller modes than SImode the int load/store costs need to be
> adjusted as well. */
> unsigned sse_cost_idx = smode == DImode ? 1 : 0;
> - unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
> + int m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
>
> EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
> {
> @@ -555,26 +557,55 @@ general_scalar_chain::compute_convert_gain ()
> rtx def_set = single_set (insn);
> rtx src = SET_SRC (def_set);
> rtx dst = SET_DEST (def_set);
> + basic_block bb = BLOCK_FOR_INSN (insn);
> int igain = 0;
>
> if (REG_P (src) && REG_P (dst))
> - igain += 2 * m - ix86_cost->xmm_move;
> + {
> + if (optimize_bb_for_size_p (bb))
> + /* reg-reg move is 2 bytes, while SSE 3. */
> + igain += COSTS_N_BYTES (2 * m - 3);
> + else
> + /* Move costs are normalized to reg-reg move having cost 2. */
> + igain += COSTS_N_INSNS (2 * m - ix86_cost->xmm_move) / 2;
> + }
> else if (REG_P (src) && MEM_P (dst))
> - igain
> - += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx];
> + {
> + if (optimize_bb_for_size_p (bb))
> + /* Integer load/store is 3+ bytes and SSE 4+. */
> + igain += COSTS_N_BYTES (3 * m - 4);
> + else
> + igain
> + += COSTS_N_INSNS (m * ix86_cost->int_store[2]
> + - ix86_cost->sse_store[sse_cost_idx]) / 2;
> + }
> else if (MEM_P (src) && REG_P (dst))
> - igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx];
> + {
> + if (optimize_bb_for_size_p (bb))
> + igain += COSTS_N_BYTES (3 * m - 4);
> + else
> + igain += COSTS_N_INSNS (m * ix86_cost->int_load[2]
> + - ix86_cost->sse_load[sse_cost_idx]) / 2;
> + }
> else
> {
> /* For operations on memory operands, include the overhead
> of explicit load and store instructions. */
> if (MEM_P (dst))
> - igain += optimize_insn_for_size_p ()
> - ? -COSTS_N_BYTES (8)
> - : (m * (ix86_cost->int_load[2]
> - + ix86_cost->int_store[2])
> - - (ix86_cost->sse_load[sse_cost_idx] +
> - ix86_cost->sse_store[sse_cost_idx]));
> + {
> + if (optimize_bb_for_size_p (bb))
> + /* ??? This probably should account size difference
> + of SSE and integer load rather than full SSE load. */
> + igain -= COSTS_N_BYTES (8);
> + else
> + {
> + int cost = (m * (ix86_cost->int_load[2]
> + + ix86_cost->int_store[2])
> + - (ix86_cost->sse_load[sse_cost_idx] +
> + ix86_cost->sse_store[sse_cost_idx]));
> + igain += COSTS_N_INSNS (cost) / 2;
> + }
> + }
>
> switch (GET_CODE (src))
> {
> @@ -595,7 +626,7 @@ general_scalar_chain::compute_convert_gain ()
> igain += ix86_cost->shift_const - ix86_cost->sse_op;
>
> if (CONST_INT_P (XEXP (src, 0)))
> - igain -= vector_const_cost (XEXP (src, 0));
> + igain -= vector_const_cost (XEXP (src, 0), bb);
> break;
>
> case ROTATE:
> @@ -631,16 +662,17 @@ general_scalar_chain::compute_convert_gain ()
> igain += m * ix86_cost->add;
>
> if (CONST_INT_P (XEXP (src, 0)))
> - igain -= vector_const_cost (XEXP (src, 0));
> + igain -= vector_const_cost (XEXP (src, 0), bb);
> if (CONST_INT_P (XEXP (src, 1)))
> - igain -= vector_const_cost (XEXP (src, 1));
> + igain -= vector_const_cost (XEXP (src, 1), bb);
> if (MEM_P (XEXP (src, 1)))
> {
> - if (optimize_insn_for_size_p ())
> + if (optimize_bb_for_size_p (bb))
> igain -= COSTS_N_BYTES (m == 2 ? 3 : 5);
> else
> - igain += m * ix86_cost->int_load[2]
> - - ix86_cost->sse_load[sse_cost_idx];
> + igain += COSTS_N_INSNS
> + (m * ix86_cost->int_load[2]
> + - ix86_cost->sse_load[sse_cost_idx]) / 2;
> }
> break;
>
> @@ -698,7 +730,7 @@ general_scalar_chain::compute_convert_gain ()
> case CONST_INT:
> if (REG_P (dst))
> {
> - if (optimize_insn_for_size_p ())
> + if (optimize_bb_for_size_p (bb))
> {
> /* xor (2 bytes) vs. xorps (3 bytes). */
> if (src == const0_rtx)
> @@ -722,14 +754,14 @@ general_scalar_chain::compute_convert_gain ()
> /* DImode can be immediate for TARGET_64BIT
> and SImode always. */
> igain += m * COSTS_N_INSNS (1);
> - igain -= vector_const_cost (src);
> + igain -= vector_const_cost (src, bb);
> }
> }
> else if (MEM_P (dst))
> {
> igain += (m * ix86_cost->int_store[2]
> - ix86_cost->sse_store[sse_cost_idx]);
> - igain -= vector_const_cost (src);
> + igain -= vector_const_cost (src, bb);
> }
> break;
>
> @@ -737,13 +769,14 @@ general_scalar_chain::compute_convert_gain ()
> if (XVECEXP (XEXP (src, 1), 0, 0) == const0_rtx)
> {
> // movd (4 bytes) replaced with movdqa (4 bytes).
> - if (!optimize_insn_for_size_p ())
> - igain += ix86_cost->sse_to_integer - ix86_cost->xmm_move;
> + if (!optimize_bb_for_size_p (bb))
> + igain += COSTS_N_INSNS (ix86_cost->sse_to_integer
> + - ix86_cost->xmm_move) / 2;
> }
> else
> {
> // pshufd; movd replaced with pshufd.
> - if (optimize_insn_for_size_p ())
> + if (optimize_bb_for_size_p (bb))
> igain += COSTS_N_BYTES (4);
> else
> igain += ix86_cost->sse_to_integer;
> @@ -769,11 +802,11 @@ general_scalar_chain::compute_convert_gain ()
> /* Cost the integer to sse and sse to integer moves. */
> if (!optimize_function_for_size_p (cfun))
> {
> - cost += n_sse_to_integer * ix86_cost->sse_to_integer;
> + cost += n_sse_to_integer * COSTS_N_INSNS (ix86_cost->sse_to_integer) /
> 2;
> /* ??? integer_to_sse but we only have that in the RA cost table.
> Assume sse_to_integer/integer_to_sse are the same which they
> are at the moment. */
> - cost += n_integer_to_sse * ix86_cost->sse_to_integer;
> + cost += n_integer_to_sse * COSTS_N_INSNS (ix86_cost->integer_to_sse) /
> 2;
> }
> else if (TARGET_64BIT || smode == SImode)
> {
> @@ -1508,13 +1541,13 @@ general_scalar_chain::convert_insn (rtx_insn *insn)
> with numerous special cases. */
>
> static int
> -timode_immed_const_gain (rtx cst)
> +timode_immed_const_gain (rtx cst, basic_block bb)
> {
> /* movabsq vs. movabsq+vmovq+vunpacklqdq. */
> if (CONST_WIDE_INT_P (cst)
> && CONST_WIDE_INT_NUNITS (cst) == 2
> && CONST_WIDE_INT_ELT (cst, 0) == CONST_WIDE_INT_ELT (cst, 1))
> - return optimize_insn_for_size_p () ? -COSTS_N_BYTES (9)
> + return optimize_bb_for_size_p (bb) ? -COSTS_N_BYTES (9)
> : -COSTS_N_INSNS (2);
> /* 2x movabsq ~ vmovdqa. */
> return 0;
> @@ -1546,33 +1579,34 @@ timode_scalar_chain::compute_convert_gain ()
> rtx src = SET_SRC (def_set);
> rtx dst = SET_DEST (def_set);
> HOST_WIDE_INT op1val;
> + basic_block bb = BLOCK_FOR_INSN (insn);
> int scost, vcost;
> int igain = 0;
>
> switch (GET_CODE (src))
> {
> case REG:
> - if (optimize_insn_for_size_p ())
> + if (optimize_bb_for_size_p (bb))
> igain = MEM_P (dst) ? COSTS_N_BYTES (6) : COSTS_N_BYTES (3);
> else
> igain = COSTS_N_INSNS (1);
> break;
>
> case MEM:
> - igain = optimize_insn_for_size_p () ? COSTS_N_BYTES (7)
> + igain = optimize_bb_for_size_p (bb) ? COSTS_N_BYTES (7)
> : COSTS_N_INSNS (1);
> break;
>
> case CONST_INT:
> if (MEM_P (dst)
> && standard_sse_constant_p (src, V1TImode))
> - igain = optimize_insn_for_size_p () ? COSTS_N_BYTES (11) : 1;
> + igain = optimize_bb_for_size_p (bb) ? COSTS_N_BYTES (11) : 1;
> break;
>
> case CONST_WIDE_INT:
> /* 2 x mov vs. vmovdqa. */
> if (MEM_P (dst))
> - igain = optimize_insn_for_size_p () ? COSTS_N_BYTES (3)
> + igain = optimize_bb_for_size_p (bb) ? COSTS_N_BYTES (3)
> : COSTS_N_INSNS (1);
> break;
>
> @@ -1587,14 +1621,14 @@ timode_scalar_chain::compute_convert_gain ()
> if (!MEM_P (dst))
> igain = COSTS_N_INSNS (1);
> if (CONST_SCALAR_INT_P (XEXP (src, 1)))
> - igain += timode_immed_const_gain (XEXP (src, 1));
> + igain += timode_immed_const_gain (XEXP (src, 1), bb);
> break;
>
> case ASHIFT:
> case LSHIFTRT:
> /* See ix86_expand_v1ti_shift. */
> op1val = INTVAL (XEXP (src, 1));
> - if (optimize_insn_for_size_p ())
> + if (optimize_bb_for_size_p (bb))
> {
> if (op1val == 64 || op1val == 65)
> scost = COSTS_N_BYTES (5);
> @@ -1628,7 +1662,7 @@ timode_scalar_chain::compute_convert_gain ()
> case ASHIFTRT:
> /* See ix86_expand_v1ti_ashiftrt. */
> op1val = INTVAL (XEXP (src, 1));
> - if (optimize_insn_for_size_p ())
> + if (optimize_bb_for_size_p (bb))
> {
> if (op1val == 64 || op1val == 127)
> scost = COSTS_N_BYTES (7);
> @@ -1706,7 +1740,7 @@ timode_scalar_chain::compute_convert_gain ()
> case ROTATERT:
> /* See ix86_expand_v1ti_rotate. */
> op1val = INTVAL (XEXP (src, 1));
> - if (optimize_insn_for_size_p ())
> + if (optimize_bb_for_size_p (bb))
> {
> scost = COSTS_N_BYTES (13);
> if ((op1val & 31) == 0)
> @@ -1738,16 +1772,16 @@ timode_scalar_chain::compute_convert_gain ()
> {
> if (GET_CODE (XEXP (src, 0)) == AND)
> /* and;and;or (9 bytes) vs. ptest (5 bytes). */
> - igain = optimize_insn_for_size_p() ? COSTS_N_BYTES (4)
> - : COSTS_N_INSNS (2);
> + igain = optimize_bb_for_size_p (bb) ? COSTS_N_BYTES (4)
> + : COSTS_N_INSNS (2);
> /* or (3 bytes) vs. ptest (5 bytes). */
> - else if (optimize_insn_for_size_p ())
> + else if (optimize_bb_for_size_p (bb))
> igain = -COSTS_N_BYTES (2);
> }
> else if (XEXP (src, 1) == const1_rtx)
> /* and;cmp -1 (7 bytes) vs. pcmpeqd;pxor;ptest (13 bytes). */
> - igain = optimize_insn_for_size_p() ? -COSTS_N_BYTES (6)
> - : -COSTS_N_INSNS (1);
> + igain = optimize_bb_for_size_p (bb) ? -COSTS_N_BYTES (6)
> + : -COSTS_N_INSNS (1);
> break;
>
> default:
> diff --git a/gcc/config/i386/i386-features.h b/gcc/config/i386/i386-features.h
> index 24b0c4ed0cd..7f7c0f78c96 100644
> --- a/gcc/config/i386/i386-features.h
> +++ b/gcc/config/i386/i386-features.h
> @@ -188,7 +188,7 @@ class general_scalar_chain : public scalar_chain
>
> private:
> void convert_insn (rtx_insn *insn) final override;
> - int vector_const_cost (rtx exp);
> + int vector_const_cost (rtx exp, basic_block bb);
> rtx convert_rotate (enum rtx_code, rtx op0, rtx op1, rtx_insn *insn);
> };
>
> diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> index 6a38de30de4..18fa97a9eb0 100644
> --- a/gcc/config/i386/i386.h
> +++ b/gcc/config/i386/i386.h
> @@ -179,6 +179,7 @@ struct processor_costs {
> const int xmm_move, ymm_move, /* cost of moving XMM and YMM register. */
> zmm_move;
> const int sse_to_integer; /* cost of moving SSE register to integer. */
> + const int integer_to_sse; /* cost of moving integer register to SSE. */
> const int gather_static, gather_per_elt; /* Cost of gather load is computed
> as static + per_item * nelts. */
> const int scatter_static, scatter_per_elt; /* Cost of gather store is
> diff --git a/gcc/config/i386/x86-tune-costs.h
> b/gcc/config/i386/x86-tune-costs.h
> index 6cce70a6c40..e5091293509 100644
> --- a/gcc/config/i386/x86-tune-costs.h
> +++ b/gcc/config/i386/x86-tune-costs.h
> @@ -107,6 +107,7 @@ struct processor_costs ix86_size_cost = {/* costs for
> tuning for size */
> in 128bit, 256bit and 512bit */
> 4, 4, 6, /* cost of moving XMM,YMM,ZMM register */
> 4, /* cost of moving SSE register to integer. */
> + 4, /* cost of moving integer register to SSE. */
> COSTS_N_BYTES (5), 0, /* Gather load static, per_elt. */
> COSTS_N_BYTES (5), 0, /* Gather store static, per_elt. */
> 0, /* size of l1 cache */
> @@ -227,6 +228,7 @@ struct processor_costs i386_cost = { /* 386 specific
> costs */
> {4, 8, 16, 32, 64}, /* cost of unaligned stores. */
> 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */
> 3, /* cost of moving SSE register to integer. */
> + 3, /* cost of moving integer register to SSE. */
> 4, 4, /* Gather load static, per_elt. */
> 4, 4, /* Gather store static, per_elt. */
> 0, /* size of l1 cache */
> @@ -345,6 +347,7 @@ struct processor_costs i486_cost = { /* 486 specific
> costs */
> {4, 8, 16, 32, 64}, /* cost of unaligned stores. */
> 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */
> 3, /* cost of moving SSE register to integer. */
> + 3, /* cost of moving integer register to SSE. */
> 4, 4, /* Gather load static, per_elt. */
> 4, 4, /* Gather store static, per_elt. */
> 4, /* size of l1 cache. 486 has 8kB cache
> @@ -465,6 +468,7 @@ struct processor_costs pentium_cost = {
> {4, 8, 16, 32, 64}, /* cost of unaligned stores. */
> 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */
> 3, /* cost of moving SSE register to integer. */
> + 3, /* cost of moving integer register to SSE. */
> 4, 4, /* Gather load static, per_elt. */
> 4, 4, /* Gather store static, per_elt. */
> 8, /* size of l1 cache. */
> @@ -576,6 +580,7 @@ struct processor_costs lakemont_cost = {
> {4, 8, 16, 32, 64}, /* cost of unaligned stores. */
> 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */
> 3, /* cost of moving SSE register to integer. */
> + 3, /* cost of moving integer register to SSE. */
> 4, 4, /* Gather load static, per_elt. */
> 4, 4, /* Gather store static, per_elt. */
> 8, /* size of l1 cache. */
> @@ -702,6 +707,7 @@ struct processor_costs pentiumpro_cost = {
> {4, 8, 16, 32, 64}, /* cost of unaligned stores. */
> 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */
> 3, /* cost of moving SSE register to integer. */
> + 3, /* cost of moving integer register to SSE. */
> 4, 4, /* Gather load static, per_elt. */
> 4, 4, /* Gather store static, per_elt. */
> 8, /* size of l1 cache. */
> @@ -819,6 +825,7 @@ struct processor_costs geode_cost = {
> {2, 2, 8, 16, 32}, /* cost of unaligned stores. */
> 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */
> 6, /* cost of moving SSE register to integer. */
> + 6, /* cost of moving integer register to SSE. */
> 2, 2, /* Gather load static, per_elt. */
> 2, 2, /* Gather store static, per_elt. */
> 64, /* size of l1 cache. */
> @@ -936,6 +943,7 @@ struct processor_costs k6_cost = {
> {2, 2, 8, 16, 32}, /* cost of unaligned stores. */
> 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */
> 6, /* cost of moving SSE register to integer. */
> + 6, /* cost of moving integer register to SSE. */
> 2, 2, /* Gather load static, per_elt. */
> 2, 2, /* Gather store static, per_elt. */
> 32, /* size of l1 cache. */
> @@ -1059,6 +1067,7 @@ struct processor_costs athlon_cost = {
> {4, 4, 10, 10, 20}, /* cost of unaligned stores. */
> 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */
> 5, /* cost of moving SSE register to integer. */
> + 5, /* cost of moving integer register to SSE. */
> 4, 4, /* Gather load static, per_elt. */
> 4, 4, /* Gather store static, per_elt. */
> 64, /* size of l1 cache. */
> @@ -1184,6 +1193,7 @@ struct processor_costs k8_cost = {
> {4, 4, 10, 10, 20}, /* cost of unaligned stores. */
> 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */
> 5, /* cost of moving SSE register to integer. */
> + 5, /* cost of moving integer register to SSE. */
> 4, 4, /* Gather load static, per_elt. */
> 4, 4, /* Gather store static, per_elt. */
> 64, /* size of l1 cache. */
> @@ -1322,6 +1332,7 @@ struct processor_costs amdfam10_cost = {
> {4, 4, 5, 10, 20}, /* cost of unaligned stores. */
> 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */
> 3, /* cost of moving SSE register to integer. */
> + 3, /* cost of moving integer register to SSE. */
> 4, 4, /* Gather load static, per_elt. */
> 4, 4, /* Gather store static, per_elt. */
> 64, /* size of l1 cache. */
> @@ -1452,6 +1463,7 @@ const struct processor_costs bdver_cost = {
> {10, 10, 10, 40, 60}, /* cost of unaligned stores. */
> 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */
> 16, /* cost of moving SSE register to integer. */
> + 16, /* cost of moving integer register to SSE. */
> 12, 12, /* Gather load static, per_elt. */
> 10, 10, /* Gather store static, per_elt. */
> 16, /* size of l1 cache. */
> @@ -1603,6 +1615,7 @@ struct processor_costs znver1_cost = {
> {8, 8, 8, 16, 32}, /* cost of unaligned stores. */
> 2, 3, 6, /* cost of moving XMM,YMM,ZMM register. */
> 6, /* cost of moving SSE register to integer. */
> + 6, /* cost of moving integer register to SSE. */
> /* VGATHERDPD is 23 uops and throughput is 9, VGATHERDPD is 35 uops,
> throughput 12. Approx 9 uops do not depend on vector size and every load
> is 7 uops. */
> @@ -1770,6 +1783,7 @@ struct processor_costs znver2_cost = {
> 2, 2, 3, /* cost of moving XMM,YMM,ZMM
> register. */
> 6, /* cost of moving SSE register to integer. */
> + 6, /* cost of moving integer register to SSE. */
> /* VGATHERDPD is 23 uops and throughput is 9, VGATHERDPD is 35 uops,
> throughput 12. Approx 9 uops do not depend on vector size and every load
> is 7 uops. */
> @@ -1912,6 +1926,7 @@ struct processor_costs znver3_cost = {
> 2, 2, 3, /* cost of moving XMM,YMM,ZMM
> register. */
> 6, /* cost of moving SSE register to integer. */
> + 6, /* cost of moving integer register to SSE. */
> /* VGATHERDPD is 15 uops and throughput is 4, VGATHERDPS is 23 uops,
> throughput 9. Approx 7 uops do not depend on vector size and every load
> is 4 uops. */
> @@ -2056,6 +2071,7 @@ struct processor_costs znver4_cost = {
> 2, 2, 2, /* cost of moving XMM,YMM,ZMM
> register. */
> 6, /* cost of moving SSE register to integer. */
> + 6, /* cost of moving integer register to SSE. */
> /* VGATHERDPD is 17 uops and throughput is 4, VGATHERDPS is 24 uops,
> throughput 5. Approx 7 uops do not depend on vector size and every load
> is 5 uops. */
> @@ -2204,6 +2220,7 @@ struct processor_costs znver5_cost = {
> 2, 2, 2, /* cost of moving XMM,YMM,ZMM
> register. */
> 6, /* cost of moving SSE register to integer. */
> + 6, /* cost of moving integer register to SSE. */
>
> /* TODO: gather and scatter instructions are currently disabled in
> x86-tune.def. In some cases they are however a win, see PR116582
> @@ -2372,6 +2389,7 @@ struct processor_costs skylake_cost = {
> {8, 8, 8, 8, 16}, /* cost of unaligned stores. */
> 2, 2, 4, /* cost of moving XMM,YMM,ZMM register */
> 6, /* cost of moving SSE register to integer. */
> + 6, /* cost of moving integer register to SSE. */
> 20, 8, /* Gather load static, per_elt. */
> 22, 10, /* Gather store static, per_elt. */
> 64, /* size of l1 cache. */
> @@ -2508,6 +2526,7 @@ struct processor_costs icelake_cost = {
> {8, 8, 8, 8, 16}, /* cost of unaligned stores. */
> 2, 2, 4, /* cost of moving XMM,YMM,ZMM register */
> 6, /* cost of moving SSE register to integer. */
> + 6, /* cost of moving integer register to SSE. */
> 20, 8, /* Gather load static, per_elt. */
> 22, 10, /* Gather store static, per_elt. */
> 64, /* size of l1 cache. */
> @@ -2638,6 +2657,7 @@ struct processor_costs alderlake_cost = {
> {8, 8, 8, 10, 15}, /* cost of unaligned storess. */
> 2, 3, 4, /* cost of moving XMM,YMM,ZMM register */
> 6, /* cost of moving SSE register to integer. */
> + 6, /* cost of moving integer register to SSE. */
> 18, 6, /* Gather load static, per_elt. */
> 18, 6, /* Gather store static, per_elt. */
> 32, /* size of l1 cache. */
> @@ -2761,6 +2781,7 @@ const struct processor_costs btver1_cost = {
> {10, 10, 12, 48, 96}, /* cost of unaligned stores. */
> 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */
> 14, /* cost of moving SSE register to integer. */
> + 14, /* cost of moving integer register to SSE. */
> 10, 10, /* Gather load static, per_elt. */
> 10, 10, /* Gather store static, per_elt. */
> 32, /* size of l1 cache. */
> @@ -2881,6 +2902,7 @@ const struct processor_costs btver2_cost = {
> {10, 10, 12, 48, 96}, /* cost of unaligned stores. */
> 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */
> 14, /* cost of moving SSE register to integer. */
> + 14, /* cost of moving integer register to SSE. */
> 10, 10, /* Gather load static, per_elt. */
> 10, 10, /* Gather store static, per_elt. */
> 32, /* size of l1 cache. */
> @@ -3000,6 +3022,7 @@ struct processor_costs pentium4_cost = {
> {32, 32, 32, 64, 128}, /* cost of unaligned stores. */
> 12, 24, 48, /* cost of moving XMM,YMM,ZMM register */
> 20, /* cost of moving SSE register to integer. */
> + 20, /* cost of moving integer register to SSE. */
> 16, 16, /* Gather load static, per_elt. */
> 16, 16, /* Gather store static, per_elt. */
> 8, /* size of l1 cache. */
> @@ -3122,6 +3145,7 @@ struct processor_costs nocona_cost = {
> {24, 24, 24, 48, 96}, /* cost of unaligned stores. */
> 6, 12, 24, /* cost of moving XMM,YMM,ZMM register */
> 20, /* cost of moving SSE register to integer. */
> + 20, /* cost of moving integer register to SSE. */
> 12, 12, /* Gather load static, per_elt. */
> 12, 12, /* Gather store static, per_elt. */
> 8, /* size of l1 cache. */
> @@ -3242,6 +3266,7 @@ struct processor_costs atom_cost = {
> {16, 16, 16, 32, 64}, /* cost of unaligned stores. */
> 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */
> 8, /* cost of moving SSE register to integer. */
> + 8, /* cost of moving integer register to SSE. */
> 8, 8, /* Gather load static, per_elt. */
> 8, 8, /* Gather store static, per_elt. */
> 32, /* size of l1 cache. */
> @@ -3362,6 +3387,7 @@ struct processor_costs slm_cost = {
> {16, 16, 16, 32, 64}, /* cost of unaligned stores. */
> 2, 4, 8, /* cost of moving XMM,YMM,ZMM register */
> 8, /* cost of moving SSE register to integer. */
> + 8, /* cost of moving integer register to SSE. */
> 8, 8, /* Gather load static, per_elt. */
> 8, 8, /* Gather store static, per_elt. */
> 32, /* size of l1 cache. */
> @@ -3494,6 +3520,7 @@ struct processor_costs tremont_cost = {
> {6, 6, 6, 10, 15}, /* cost of unaligned storess. */
> 2, 3, 4, /* cost of moving XMM,YMM,ZMM register */
> 6, /* cost of moving SSE register to integer. */
> + 6, /* cost of moving integer register to SSE. */
> 18, 6, /* Gather load static, per_elt. */
> 18, 6, /* Gather store static, per_elt. */
> 32, /* size of l1 cache. */
> @@ -3616,6 +3643,7 @@ struct processor_costs intel_cost = {
> {10, 10, 10, 10, 10}, /* cost of unaligned loads. */
> 2, 2, 2, /* cost of moving XMM,YMM,ZMM register */
> 4, /* cost of moving SSE register to integer. */
> + 4, /* cost of moving integer register to SSE. */
> 6, 6, /* Gather load static, per_elt. */
> 6, 6, /* Gather store static, per_elt. */
> 32, /* size of l1 cache. */
> @@ -3731,15 +3759,16 @@ struct processor_costs lujiazui_cost = {
> {6, 6, 6}, /* cost of loading integer registers
> in QImode, HImode and SImode.
> Relative to reg-reg move (2). */
> - {6, 6, 6}, /* cost of storing integer registers. */
> + {6, 6, 6}, /* cost of storing integer registers. */
> {6, 6, 6, 10, 15}, /* cost of loading SSE register
> - in 32bit, 64bit, 128bit, 256bit and 512bit. */
> + in 32bit, 64bit, 128bit, 256bit and 512bit. */
> {6, 6, 6, 10, 15}, /* cost of storing SSE register
> - in 32bit, 64bit, 128bit, 256bit and 512bit. */
> + in 32bit, 64bit, 128bit, 256bit and 512bit. */
> {6, 6, 6, 10, 15}, /* cost of unaligned loads. */
> {6, 6, 6, 10, 15}, /* cost of unaligned storess. */
> - 2, 3, 4, /* cost of moving XMM,YMM,ZMM register. */
> - 6, /* cost of moving SSE register to integer. */
> + 2, 3, 4, /* cost of moving XMM,YMM,ZMM register. */
> + 6, /* cost of moving SSE register to integer. */
> + 6, /* cost of moving integer register to SSE. */
> 18, 6, /* Gather load static, per_elt. */
> 18, 6, /* Gather store static, per_elt. */
> 32, /* size of l1 cache. */
> @@ -3864,6 +3893,7 @@ struct processor_costs yongfeng_cost = {
> {8, 8, 8, 12, 15}, /* cost of unaligned storess. */
> 2, 3, 4, /* cost of moving XMM,YMM,ZMM register. */
> 8, /* cost of moving SSE register to integer. */
> + 8, /* cost of moving integer register to SSE. */
> 18, 6, /* Gather load static, per_elt. */
> 18, 6, /* Gather store static, per_elt. */
> 32, /* size of l1 cache. */
> @@ -3987,6 +4017,7 @@ struct processor_costs shijidadao_cost = {
> {8, 8, 8, 12, 15}, /* cost of unaligned storess. */
> 2, 3, 4, /* cost of moving XMM,YMM,ZMM register. */
> 8, /* cost of moving SSE register to integer. */
> + 8, /* cost of moving integer register to SSE. */
> 18, 6, /* Gather load static, per_elt. */
> 18, 6, /* Gather store static, per_elt. */
> 32, /* size of l1 cache. */
> @@ -4116,6 +4147,7 @@ struct processor_costs generic_cost = {
> {6, 6, 6, 10, 15}, /* cost of unaligned storess. */
> 2, 3, 4, /* cost of moving XMM,YMM,ZMM register */
> 6, /* cost of moving SSE register to integer. */
> + 6, /* cost of moving integer register to SSE. */
> 18, 6, /* Gather load static, per_elt. */
> 18, 6, /* Gather store static, per_elt. */
> 32, /* size of l1 cache. */
> @@ -4249,6 +4281,7 @@ struct processor_costs core_cost = {
> {6, 6, 6, 6, 12}, /* cost of unaligned stores. */
> 2, 2, 4, /* cost of moving XMM,YMM,ZMM register */
> 2, /* cost of moving SSE register to integer. */
> + 2, /* cost of moving integer register to SSE. */
> /* VGATHERDPD is 7 uops, rec throughput 5, while VGATHERDPD is 9 uops,
> rec. throughput 6.
> So 5 uops statically and one uops per load. */
> diff --git a/gcc/testsuite/gcc.target/i386/minmax-6.c
> b/gcc/testsuite/gcc.target/i386/minmax-6.c
> index 615f919ba0a..23f61c52d80 100644
> --- a/gcc/testsuite/gcc.target/i386/minmax-6.c
> +++ b/gcc/testsuite/gcc.target/i386/minmax-6.c
> @@ -15,4 +15,4 @@ UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
> /* We do not want the RA to spill %esi for it's dual-use but using
> pmaxsd is OK. */
> /* { dg-final { scan-assembler-not "rsp" { target { ! { ia32 } } } } } */
> -/* { dg-final { scan-assembler "pmaxsd" } } */
> +/* { dg-final { scan-assembler "pmaxsd" { xfail *-*-* } } } */
> diff --git a/gcc/testsuite/gcc.target/i386/minmax-7.c
> b/gcc/testsuite/gcc.target/i386/minmax-7.c
> index 619a93946c7..b2cb1c24d7e 100644
> --- a/gcc/testsuite/gcc.target/i386/minmax-7.c
> +++ b/gcc/testsuite/gcc.target/i386/minmax-7.c
> @@ -17,4 +17,4 @@ void bar (int aleft, int axcenter)
> /* We do not want the RA to spill %esi for it's dual-use but using
> pminsd is OK. */
> /* { dg-final { scan-assembler-not "rsp" { target { ! { ia32 } } } } } */
> -/* { dg-final { scan-assembler "pminsd" } } */
> +/* { dg-final { scan-assembler "pminsd" { xfail *-*-* } } } */