On Sun, May 11, 2025 at 4:28 AM Jan Hubicka <hubi...@ucw.cz> wrote: > > Hi, > this patch fixes some of problems with cosint in scalar to vector pass. > In particular
This caused: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120215 > 1) the pass uses optimize_insn_for_size which is intended to be used by > expanders and splitters and requires the optimization pass to use > set_rtl_profile (bb) for currently processed bb. > This is not done, so we get random stale info about hotness of insn. > 2) register allocator move costs are all realtive to integer reg-reg move > which has cost of 2, so it is (except for size tables and i386) > a latency of instruction multiplied by 2. > These costs has been duplicated and are now used in combination with > rtx costs which are all based to COSTS_N_INSNS that multiplies latency > by 4. > Some of vectorizer costing contains COSTS_N_INSNS (move_cost) / 2 > to compensate, but some new code does not. This patch adds compensatoin. > > Perhaps we should update the cost tables to use COSTS_N_INSNS everywher > but I think we want to first fix inconsistencies. Also the tables will > get optically much longer, since we have many move costs and COSTS_N_INSNS > is a lot of characters. > 3) variable m which decides how much to multiply integer variant (to account > that with -m32 all 64bit computations needs 2 instructions) is declared > unsigned which makes the signed computation of instruction gain to be > done in unsigned type and breaks i.e. for division. > 4) I added integer_to_sse costs which are currently all duplicationof > sse_to_integer. AMD chips are asymetric and moving one direction is faster > than another. I will chance costs incremnetally once vectorizer part > is fixed up, too. > > There are two failures gcc.target/i386/minmax-6.c and > gcc.target/i386/minmax-7.c. > Both test stv on hasswell which no longer happens since SSE->INT and INT->SSE > moves > are now more expensive. > > There is only one instruction to convert: > > Computing gain for chain #1... > Instruction gain 8 for 11: {r110:SI=smax(r116:SI,0);clobber flags:CC;} > Instruction conversion gain: 8 > Registers conversion cost: 8 <- this is integer_to_sse and sse_to_integer > Total gain: 0 > > total gain used to be 4 since the patch doubles the conversion costs. > According to agner fog's tables the costs should be 1 cycle which is correct > here. > > Final code gnerated is: > > vmovd %esi, %xmm0 * latency 1 > cmpl %edx, %esi > je .L2 > vpxor %xmm1, %xmm1, %xmm1 * latency 1 > vpmaxsd %xmm1, %xmm0, %xmm0 * latency 1 > vmovd %xmm0, %eax * latency 1 > imull %edx, %eax > cltq > movzwl (%rdi,%rax,2), %eax > ret > > cmpl %edx, %esi > je .L2 > xorl %eax, %eax * latency 1 > testl %esi, %esi * latency 1 > cmovs %eax, %esi * latency 2 > imull %edx, %esi > movslq %esi, %rsi > movzwl (%rdi,%rsi,2), %eax > ret > > Instructions with latency info are those really different. > So the uncoverted code has sum of latencies 4 and real latency 3. > Converted code has sum of latencies 4 and real latency 3 (vmod+vpmaxsd+vmov). > So I do not quite see it should be a win. > > There is also a bug in costing MIN/MAX > > case ABS: > case SMAX: > case SMIN: > case UMAX: > case UMIN: > /* We do not have any conditional move cost, estimate it as a > reg-reg move. Comparisons are costed as adds. */ > igain += m * (COSTS_N_INSNS (2) + ix86_cost->add); > /* Integer SSE ops are all costed the same. */ > igain -= ix86_cost->sse_op; > break; > > Now COSTS_N_INSNS (2) is not quite right since reg-reg move should be 1 or > perhaps 0. > For Haswell cmov really is 2 cycles, but I guess we want to have that in cost > vectors > like all other instructions. > > I am not sure if this is really a win in this case (other minmax testcases > seems to make > sense). I have xfailed it for now and will check if that affects specs on > LNT testers. > > Bootstrapped/regtested x86_64-linux, comitted. > > I will proceed with similar fixes on vectorizer cost side. Sadly those > introduces > quite some differences in the testuiste (partly triggered by other costing > problems, > such as one of scatter/gather) > > gcc/ChangeLog: > > * config/i386/i386-features.cc > (general_scalar_chain::vector_const_cost): Add BB parameter; handle > size costs; use COSTS_N_INSNS to compute move costs. > (general_scalar_chain::compute_convert_gain): Use optimize_bb_for_size > instead of optimize_insn_for size; use COSTS_N_INSNS to compute move > costs; > update calls of general_scalar_chain::vector_const_cost; use > ix86_cost->integer_to_sse. > (timode_immed_const_gain): Add bb parameter; use > optimize_bb_for_size_p. > (timode_scalar_chain::compute_convert_gain): Use > optimize_bb_for_size_p. > * config/i386/i386-features.h (class general_scalar_chain): Update > prototype of vector_const_cost. > * config/i386/i386.h (struct processor_costs): Add integer_to_sse. > * config/i386/x86-tune-costs.h (struct processor_costs): Copy > sse_to_integer to integer_to_sse everywhere. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/minmax-6.c: xfail test that pmax is used. > * gcc.target/i386/minmax-7.c: xfall test that pmin is used. > > diff --git a/gcc/config/i386/i386-features.cc > b/gcc/config/i386/i386-features.cc > index 1ba5ac4faa4..54b3f6d33b2 100644 > --- a/gcc/config/i386/i386-features.cc > +++ b/gcc/config/i386/i386-features.cc > @@ -518,15 +518,17 @@ scalar_chain::build (bitmap candidates, unsigned > insn_uid, bitmap disallowed) > instead of using a scalar one. */ > > int > -general_scalar_chain::vector_const_cost (rtx exp) > +general_scalar_chain::vector_const_cost (rtx exp, basic_block bb) > { > gcc_assert (CONST_INT_P (exp)); > > if (standard_sse_constant_p (exp, vmode)) > return ix86_cost->sse_op; > + if (optimize_bb_for_size_p (bb)) > + return COSTS_N_BYTES (8); > /* We have separate costs for SImode and DImode, use SImode costs > for smaller modes. */ > - return ix86_cost->sse_load[smode == DImode ? 1 : 0]; > + return COSTS_N_INSNS (ix86_cost->sse_load[smode == DImode ? 1 : 0]) / 2; > } > > /* Compute a gain for chain conversion. */ > @@ -547,7 +549,7 @@ general_scalar_chain::compute_convert_gain () > smaller modes than SImode the int load/store costs need to be > adjusted as well. */ > unsigned sse_cost_idx = smode == DImode ? 1 : 0; > - unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1; > + int m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1; > > EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi) > { > @@ -555,26 +557,55 @@ general_scalar_chain::compute_convert_gain () > rtx def_set = single_set (insn); > rtx src = SET_SRC (def_set); > rtx dst = SET_DEST (def_set); > + basic_block bb = BLOCK_FOR_INSN (insn); > int igain = 0; > > if (REG_P (src) && REG_P (dst)) > - igain += 2 * m - ix86_cost->xmm_move; > + { > + if (optimize_bb_for_size_p (bb)) > + /* reg-reg move is 2 bytes, while SSE 3. */ > + igain += COSTS_N_BYTES (2 * m - 3); > + else > + /* Move costs are normalized to reg-reg move having cost 2. */ > + igain += COSTS_N_INSNS (2 * m - ix86_cost->xmm_move) / 2; > + } > else if (REG_P (src) && MEM_P (dst)) > - igain > - += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx]; > + { > + if (optimize_bb_for_size_p (bb)) > + /* Integer load/store is 3+ bytes and SSE 4+. */ > + igain += COSTS_N_BYTES (3 * m - 4); > + else > + igain > + += COSTS_N_INSNS (m * ix86_cost->int_store[2] > + - ix86_cost->sse_store[sse_cost_idx]) / 2; > + } > else if (MEM_P (src) && REG_P (dst)) > - igain += m * ix86_cost->int_load[2] - > ix86_cost->sse_load[sse_cost_idx]; > + { > + if (optimize_bb_for_size_p (bb)) > + igain += COSTS_N_BYTES (3 * m - 4); > + else > + igain += COSTS_N_INSNS (m * ix86_cost->int_load[2] > + - ix86_cost->sse_load[sse_cost_idx]) / 2; > + } > else > { > /* For operations on memory operands, include the overhead > of explicit load and store instructions. */ > if (MEM_P (dst)) > - igain += optimize_insn_for_size_p () > - ? -COSTS_N_BYTES (8) > - : (m * (ix86_cost->int_load[2] > - + ix86_cost->int_store[2]) > - - (ix86_cost->sse_load[sse_cost_idx] + > - ix86_cost->sse_store[sse_cost_idx])); > + { > + if (optimize_bb_for_size_p (bb)) > + /* ??? This probably should account size difference > + of SSE and integer load rather than full SSE load. */ > + igain -= COSTS_N_BYTES (8); > + else > + { > + int cost = (m * (ix86_cost->int_load[2] > + + ix86_cost->int_store[2]) > + - (ix86_cost->sse_load[sse_cost_idx] + > + ix86_cost->sse_store[sse_cost_idx])); > + igain += COSTS_N_INSNS (cost) / 2; > + } > + } > > switch (GET_CODE (src)) > { > @@ -595,7 +626,7 @@ general_scalar_chain::compute_convert_gain () > igain += ix86_cost->shift_const - ix86_cost->sse_op; > > if (CONST_INT_P (XEXP (src, 0))) > - igain -= vector_const_cost (XEXP (src, 0)); > + igain -= vector_const_cost (XEXP (src, 0), bb); > break; > > case ROTATE: > @@ -631,16 +662,17 @@ general_scalar_chain::compute_convert_gain () > igain += m * ix86_cost->add; > > if (CONST_INT_P (XEXP (src, 0))) > - igain -= vector_const_cost (XEXP (src, 0)); > + igain -= vector_const_cost (XEXP (src, 0), bb); > if (CONST_INT_P (XEXP (src, 1))) > - igain -= vector_const_cost (XEXP (src, 1)); > + igain -= vector_const_cost (XEXP (src, 1), bb); > if (MEM_P (XEXP (src, 1))) > { > - if (optimize_insn_for_size_p ()) > + if (optimize_bb_for_size_p (bb)) > igain -= COSTS_N_BYTES (m == 2 ? 3 : 5); > else > - igain += m * ix86_cost->int_load[2] > - - ix86_cost->sse_load[sse_cost_idx]; > + igain += COSTS_N_INSNS > + (m * ix86_cost->int_load[2] > + - ix86_cost->sse_load[sse_cost_idx]) / 2; > } > break; > > @@ -698,7 +730,7 @@ general_scalar_chain::compute_convert_gain () > case CONST_INT: > if (REG_P (dst)) > { > - if (optimize_insn_for_size_p ()) > + if (optimize_bb_for_size_p (bb)) > { > /* xor (2 bytes) vs. xorps (3 bytes). */ > if (src == const0_rtx) > @@ -722,14 +754,14 @@ general_scalar_chain::compute_convert_gain () > /* DImode can be immediate for TARGET_64BIT > and SImode always. */ > igain += m * COSTS_N_INSNS (1); > - igain -= vector_const_cost (src); > + igain -= vector_const_cost (src, bb); > } > } > else if (MEM_P (dst)) > { > igain += (m * ix86_cost->int_store[2] > - ix86_cost->sse_store[sse_cost_idx]); > - igain -= vector_const_cost (src); > + igain -= vector_const_cost (src, bb); > } > break; > > @@ -737,13 +769,14 @@ general_scalar_chain::compute_convert_gain () > if (XVECEXP (XEXP (src, 1), 0, 0) == const0_rtx) > { > // movd (4 bytes) replaced with movdqa (4 bytes). > - if (!optimize_insn_for_size_p ()) > - igain += ix86_cost->sse_to_integer - ix86_cost->xmm_move; > + if (!optimize_bb_for_size_p (bb)) > + igain += COSTS_N_INSNS (ix86_cost->sse_to_integer > + - ix86_cost->xmm_move) / 2; > } > else > { > // pshufd; movd replaced with pshufd. > - if (optimize_insn_for_size_p ()) > + if (optimize_bb_for_size_p (bb)) > igain += COSTS_N_BYTES (4); > else > igain += ix86_cost->sse_to_integer; > @@ -769,11 +802,11 @@ general_scalar_chain::compute_convert_gain () > /* Cost the integer to sse and sse to integer moves. */ > if (!optimize_function_for_size_p (cfun)) > { > - cost += n_sse_to_integer * ix86_cost->sse_to_integer; > + cost += n_sse_to_integer * COSTS_N_INSNS (ix86_cost->sse_to_integer) / > 2; > /* ??? integer_to_sse but we only have that in the RA cost table. > Assume sse_to_integer/integer_to_sse are the same which they > are at the moment. */ > - cost += n_integer_to_sse * ix86_cost->sse_to_integer; > + cost += n_integer_to_sse * COSTS_N_INSNS (ix86_cost->integer_to_sse) / > 2; > } > else if (TARGET_64BIT || smode == SImode) > { > @@ -1508,13 +1541,13 @@ general_scalar_chain::convert_insn (rtx_insn *insn) > with numerous special cases. */ > > static int > -timode_immed_const_gain (rtx cst) > +timode_immed_const_gain (rtx cst, basic_block bb) > { > /* movabsq vs. movabsq+vmovq+vunpacklqdq. */ > if (CONST_WIDE_INT_P (cst) > && CONST_WIDE_INT_NUNITS (cst) == 2 > && CONST_WIDE_INT_ELT (cst, 0) == CONST_WIDE_INT_ELT (cst, 1)) > - return optimize_insn_for_size_p () ? -COSTS_N_BYTES (9) > + return optimize_bb_for_size_p (bb) ? -COSTS_N_BYTES (9) > : -COSTS_N_INSNS (2); > /* 2x movabsq ~ vmovdqa. */ > return 0; > @@ -1546,33 +1579,34 @@ timode_scalar_chain::compute_convert_gain () > rtx src = SET_SRC (def_set); > rtx dst = SET_DEST (def_set); > HOST_WIDE_INT op1val; > + basic_block bb = BLOCK_FOR_INSN (insn); > int scost, vcost; > int igain = 0; > > switch (GET_CODE (src)) > { > case REG: > - if (optimize_insn_for_size_p ()) > + if (optimize_bb_for_size_p (bb)) > igain = MEM_P (dst) ? COSTS_N_BYTES (6) : COSTS_N_BYTES (3); > else > igain = COSTS_N_INSNS (1); > break; > > case MEM: > - igain = optimize_insn_for_size_p () ? COSTS_N_BYTES (7) > + igain = optimize_bb_for_size_p (bb) ? COSTS_N_BYTES (7) > : COSTS_N_INSNS (1); > break; > > case CONST_INT: > if (MEM_P (dst) > && standard_sse_constant_p (src, V1TImode)) > - igain = optimize_insn_for_size_p () ? COSTS_N_BYTES (11) : 1; > + igain = optimize_bb_for_size_p (bb) ? COSTS_N_BYTES (11) : 1; > break; > > case CONST_WIDE_INT: > /* 2 x mov vs. vmovdqa. */ > if (MEM_P (dst)) > - igain = optimize_insn_for_size_p () ? COSTS_N_BYTES (3) > + igain = optimize_bb_for_size_p (bb) ? COSTS_N_BYTES (3) > : COSTS_N_INSNS (1); > break; > > @@ -1587,14 +1621,14 @@ timode_scalar_chain::compute_convert_gain () > if (!MEM_P (dst)) > igain = COSTS_N_INSNS (1); > if (CONST_SCALAR_INT_P (XEXP (src, 1))) > - igain += timode_immed_const_gain (XEXP (src, 1)); > + igain += timode_immed_const_gain (XEXP (src, 1), bb); > break; > > case ASHIFT: > case LSHIFTRT: > /* See ix86_expand_v1ti_shift. */ > op1val = INTVAL (XEXP (src, 1)); > - if (optimize_insn_for_size_p ()) > + if (optimize_bb_for_size_p (bb)) > { > if (op1val == 64 || op1val == 65) > scost = COSTS_N_BYTES (5); > @@ -1628,7 +1662,7 @@ timode_scalar_chain::compute_convert_gain () > case ASHIFTRT: > /* See ix86_expand_v1ti_ashiftrt. */ > op1val = INTVAL (XEXP (src, 1)); > - if (optimize_insn_for_size_p ()) > + if (optimize_bb_for_size_p (bb)) > { > if (op1val == 64 || op1val == 127) > scost = COSTS_N_BYTES (7); > @@ -1706,7 +1740,7 @@ timode_scalar_chain::compute_convert_gain () > case ROTATERT: > /* See ix86_expand_v1ti_rotate. */ > op1val = INTVAL (XEXP (src, 1)); > - if (optimize_insn_for_size_p ()) > + if (optimize_bb_for_size_p (bb)) > { > scost = COSTS_N_BYTES (13); > if ((op1val & 31) == 0) > @@ -1738,16 +1772,16 @@ timode_scalar_chain::compute_convert_gain () > { > if (GET_CODE (XEXP (src, 0)) == AND) > /* and;and;or (9 bytes) vs. ptest (5 bytes). */ > - igain = optimize_insn_for_size_p() ? COSTS_N_BYTES (4) > - : COSTS_N_INSNS (2); > + igain = optimize_bb_for_size_p (bb) ? COSTS_N_BYTES (4) > + : COSTS_N_INSNS (2); > /* or (3 bytes) vs. ptest (5 bytes). */ > - else if (optimize_insn_for_size_p ()) > + else if (optimize_bb_for_size_p (bb)) > igain = -COSTS_N_BYTES (2); > } > else if (XEXP (src, 1) == const1_rtx) > /* and;cmp -1 (7 bytes) vs. pcmpeqd;pxor;ptest (13 bytes). */ > - igain = optimize_insn_for_size_p() ? -COSTS_N_BYTES (6) > - : -COSTS_N_INSNS (1); > + igain = optimize_bb_for_size_p (bb) ? -COSTS_N_BYTES (6) > + : -COSTS_N_INSNS (1); > break; > > default: > diff --git a/gcc/config/i386/i386-features.h b/gcc/config/i386/i386-features.h > index 24b0c4ed0cd..7f7c0f78c96 100644 > --- a/gcc/config/i386/i386-features.h > +++ b/gcc/config/i386/i386-features.h > @@ -188,7 +188,7 @@ class general_scalar_chain : public scalar_chain > > private: > void convert_insn (rtx_insn *insn) final override; > - int vector_const_cost (rtx exp); > + int vector_const_cost (rtx exp, basic_block bb); > rtx convert_rotate (enum rtx_code, rtx op0, rtx op1, rtx_insn *insn); > }; > > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h > index 6a38de30de4..18fa97a9eb0 100644 > --- a/gcc/config/i386/i386.h > +++ b/gcc/config/i386/i386.h > @@ -179,6 +179,7 @@ struct processor_costs { > const int xmm_move, ymm_move, /* cost of moving XMM and YMM register. */ > zmm_move; > const int sse_to_integer; /* cost of moving SSE register to integer. */ > + const int integer_to_sse; /* cost of moving integer register to SSE. */ > const int gather_static, gather_per_elt; /* Cost of gather load is computed > as static + per_item * nelts. */ > const int scatter_static, scatter_per_elt; /* Cost of gather store is > diff --git a/gcc/config/i386/x86-tune-costs.h > b/gcc/config/i386/x86-tune-costs.h > index 6cce70a6c40..e5091293509 100644 > --- a/gcc/config/i386/x86-tune-costs.h > +++ b/gcc/config/i386/x86-tune-costs.h > @@ -107,6 +107,7 @@ struct processor_costs ix86_size_cost = {/* costs for > tuning for size */ > in 128bit, 256bit and 512bit */ > 4, 4, 6, /* cost of moving XMM,YMM,ZMM > register */ > 4, /* cost of moving SSE register to > integer. */ > + 4, /* cost of moving integer register to > SSE. */ > COSTS_N_BYTES (5), 0, /* Gather load static, > per_elt. */ > COSTS_N_BYTES (5), 0, /* Gather store static, > per_elt. */ > 0, /* size of l1 cache */ > @@ -227,6 +228,7 @@ struct processor_costs i386_cost = { /* 386 > specific costs */ > {4, 8, 16, 32, 64}, /* cost of unaligned stores. */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM > register */ > 3, /* cost of moving SSE register to > integer. */ > + 3, /* cost of moving integer register to > SSE. */ > 4, 4, /* Gather load static, > per_elt. */ > 4, 4, /* Gather store static, > per_elt. */ > 0, /* size of l1 cache */ > @@ -345,6 +347,7 @@ struct processor_costs i486_cost = { /* 486 > specific costs */ > {4, 8, 16, 32, 64}, /* cost of unaligned stores. */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM > register */ > 3, /* cost of moving SSE register to > integer. */ > + 3, /* cost of moving integer register to > SSE. */ > 4, 4, /* Gather load static, > per_elt. */ > 4, 4, /* Gather store static, > per_elt. */ > 4, /* size of l1 cache. 486 has 8kB > cache > @@ -465,6 +468,7 @@ struct processor_costs pentium_cost = { > {4, 8, 16, 32, 64}, /* cost of unaligned stores. */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM > register */ > 3, /* cost of moving SSE register to > integer. */ > + 3, /* cost of moving integer register to > SSE. */ > 4, 4, /* Gather load static, > per_elt. */ > 4, 4, /* Gather store static, > per_elt. */ > 8, /* size of l1 cache. */ > @@ -576,6 +580,7 @@ struct processor_costs lakemont_cost = { > {4, 8, 16, 32, 64}, /* cost of unaligned stores. */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM > register */ > 3, /* cost of moving SSE register to > integer. */ > + 3, /* cost of moving integer register to > SSE. */ > 4, 4, /* Gather load static, > per_elt. */ > 4, 4, /* Gather store static, > per_elt. */ > 8, /* size of l1 cache. */ > @@ -702,6 +707,7 @@ struct processor_costs pentiumpro_cost = { > {4, 8, 16, 32, 64}, /* cost of unaligned stores. */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM > register */ > 3, /* cost of moving SSE register to > integer. */ > + 3, /* cost of moving integer register to > SSE. */ > 4, 4, /* Gather load static, > per_elt. */ > 4, 4, /* Gather store static, > per_elt. */ > 8, /* size of l1 cache. */ > @@ -819,6 +825,7 @@ struct processor_costs geode_cost = { > {2, 2, 8, 16, 32}, /* cost of unaligned stores. */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM > register */ > 6, /* cost of moving SSE register to > integer. */ > + 6, /* cost of moving integer register to > SSE. */ > 2, 2, /* Gather load static, > per_elt. */ > 2, 2, /* Gather store static, > per_elt. */ > 64, /* size of l1 cache. */ > @@ -936,6 +943,7 @@ struct processor_costs k6_cost = { > {2, 2, 8, 16, 32}, /* cost of unaligned stores. */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM > register */ > 6, /* cost of moving SSE register to > integer. */ > + 6, /* cost of moving integer register to > SSE. */ > 2, 2, /* Gather load static, > per_elt. */ > 2, 2, /* Gather store static, > per_elt. */ > 32, /* size of l1 cache. */ > @@ -1059,6 +1067,7 @@ struct processor_costs athlon_cost = { > {4, 4, 10, 10, 20}, /* cost of unaligned stores. */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM > register */ > 5, /* cost of moving SSE register to > integer. */ > + 5, /* cost of moving integer register to > SSE. */ > 4, 4, /* Gather load static, > per_elt. */ > 4, 4, /* Gather store static, > per_elt. */ > 64, /* size of l1 cache. */ > @@ -1184,6 +1193,7 @@ struct processor_costs k8_cost = { > {4, 4, 10, 10, 20}, /* cost of unaligned stores. */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM > register */ > 5, /* cost of moving SSE register to > integer. */ > + 5, /* cost of moving integer register to > SSE. */ > 4, 4, /* Gather load static, > per_elt. */ > 4, 4, /* Gather store static, > per_elt. */ > 64, /* size of l1 cache. */ > @@ -1322,6 +1332,7 @@ struct processor_costs amdfam10_cost = { > {4, 4, 5, 10, 20}, /* cost of unaligned stores. */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM > register */ > 3, /* cost of moving SSE register to > integer. */ > + 3, /* cost of moving integer register to > SSE. */ > 4, 4, /* Gather load static, > per_elt. */ > 4, 4, /* Gather store static, > per_elt. */ > 64, /* size of l1 cache. */ > @@ -1452,6 +1463,7 @@ const struct processor_costs bdver_cost = { > {10, 10, 10, 40, 60}, /* cost of unaligned stores. > */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM > register */ > 16, /* cost of moving SSE register to > integer. */ > + 16, /* cost of moving integer register to > SSE. */ > 12, 12, /* Gather load static, per_elt. */ > 10, 10, /* Gather store static, per_elt. */ > 16, /* size of l1 cache. */ > @@ -1603,6 +1615,7 @@ struct processor_costs znver1_cost = { > {8, 8, 8, 16, 32}, /* cost of unaligned stores. */ > 2, 3, 6, /* cost of moving XMM,YMM,ZMM > register. */ > 6, /* cost of moving SSE register to > integer. */ > + 6, /* cost of moving integer register to > SSE. */ > /* VGATHERDPD is 23 uops and throughput is 9, VGATHERDPD is 35 uops, > throughput 12. Approx 9 uops do not depend on vector size and every > load > is 7 uops. */ > @@ -1770,6 +1783,7 @@ struct processor_costs znver2_cost = { > 2, 2, 3, /* cost of moving XMM,YMM,ZMM > register. */ > 6, /* cost of moving SSE register to > integer. */ > + 6, /* cost of moving integer register to > SSE. */ > /* VGATHERDPD is 23 uops and throughput is 9, VGATHERDPD is 35 uops, > throughput 12. Approx 9 uops do not depend on vector size and every > load > is 7 uops. */ > @@ -1912,6 +1926,7 @@ struct processor_costs znver3_cost = { > 2, 2, 3, /* cost of moving XMM,YMM,ZMM > register. */ > 6, /* cost of moving SSE register to > integer. */ > + 6, /* cost of moving integer register to > SSE. */ > /* VGATHERDPD is 15 uops and throughput is 4, VGATHERDPS is 23 uops, > throughput 9. Approx 7 uops do not depend on vector size and every load > is 4 uops. */ > @@ -2056,6 +2071,7 @@ struct processor_costs znver4_cost = { > 2, 2, 2, /* cost of moving XMM,YMM,ZMM > register. */ > 6, /* cost of moving SSE register to > integer. */ > + 6, /* cost of moving integer register to > SSE. */ > /* VGATHERDPD is 17 uops and throughput is 4, VGATHERDPS is 24 uops, > throughput 5. Approx 7 uops do not depend on vector size and every load > is 5 uops. */ > @@ -2204,6 +2220,7 @@ struct processor_costs znver5_cost = { > 2, 2, 2, /* cost of moving XMM,YMM,ZMM > register. */ > 6, /* cost of moving SSE register to > integer. */ > + 6, /* cost of moving integer register to > SSE. */ > > /* TODO: gather and scatter instructions are currently disabled in > x86-tune.def. In some cases they are however a win, see PR116582 > @@ -2372,6 +2389,7 @@ struct processor_costs skylake_cost = { > {8, 8, 8, 8, 16}, /* cost of unaligned stores. */ > 2, 2, 4, /* cost of moving XMM,YMM,ZMM > register */ > 6, /* cost of moving SSE register to > integer. */ > + 6, /* cost of moving integer register to > SSE. */ > 20, 8, /* Gather load static, per_elt. */ > 22, 10, /* Gather store static, per_elt. */ > 64, /* size of l1 cache. */ > @@ -2508,6 +2526,7 @@ struct processor_costs icelake_cost = { > {8, 8, 8, 8, 16}, /* cost of unaligned stores. */ > 2, 2, 4, /* cost of moving XMM,YMM,ZMM > register */ > 6, /* cost of moving SSE register to > integer. */ > + 6, /* cost of moving integer register to > SSE. */ > 20, 8, /* Gather load static, per_elt. */ > 22, 10, /* Gather store static, per_elt. */ > 64, /* size of l1 cache. */ > @@ -2638,6 +2657,7 @@ struct processor_costs alderlake_cost = { > {8, 8, 8, 10, 15}, /* cost of unaligned storess. */ > 2, 3, 4, /* cost of moving XMM,YMM,ZMM > register */ > 6, /* cost of moving SSE register to > integer. */ > + 6, /* cost of moving integer register to > SSE. */ > 18, 6, /* Gather load static, per_elt. */ > 18, 6, /* Gather store static, per_elt. */ > 32, /* size of l1 cache. */ > @@ -2761,6 +2781,7 @@ const struct processor_costs btver1_cost = { > {10, 10, 12, 48, 96}, /* cost of unaligned stores. > */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM > register */ > 14, /* cost of moving SSE register to > integer. */ > + 14, /* cost of moving integer register to > SSE. */ > 10, 10, /* Gather load static, per_elt. */ > 10, 10, /* Gather store static, per_elt. */ > 32, /* size of l1 cache. */ > @@ -2881,6 +2902,7 @@ const struct processor_costs btver2_cost = { > {10, 10, 12, 48, 96}, /* cost of unaligned stores. > */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM > register */ > 14, /* cost of moving SSE register to > integer. */ > + 14, /* cost of moving integer register to > SSE. */ > 10, 10, /* Gather load static, per_elt. */ > 10, 10, /* Gather store static, per_elt. */ > 32, /* size of l1 cache. */ > @@ -3000,6 +3022,7 @@ struct processor_costs pentium4_cost = { > {32, 32, 32, 64, 128}, /* cost of unaligned stores. */ > 12, 24, 48, /* cost of moving XMM,YMM,ZMM > register */ > 20, /* cost of moving SSE register to > integer. */ > + 20, /* cost of moving integer register to > SSE. */ > 16, 16, /* Gather load static, per_elt. */ > 16, 16, /* Gather store static, per_elt. */ > 8, /* size of l1 cache. */ > @@ -3122,6 +3145,7 @@ struct processor_costs nocona_cost = { > {24, 24, 24, 48, 96}, /* cost of unaligned stores. > */ > 6, 12, 24, /* cost of moving XMM,YMM,ZMM > register */ > 20, /* cost of moving SSE register to > integer. */ > + 20, /* cost of moving integer register to > SSE. */ > 12, 12, /* Gather load static, per_elt. */ > 12, 12, /* Gather store static, per_elt. */ > 8, /* size of l1 cache. */ > @@ -3242,6 +3266,7 @@ struct processor_costs atom_cost = { > {16, 16, 16, 32, 64}, /* cost of unaligned stores. > */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM > register */ > 8, /* cost of moving SSE register to > integer. */ > + 8, /* cost of moving integer register to > SSE. */ > 8, 8, /* Gather load static, > per_elt. */ > 8, 8, /* Gather store static, > per_elt. */ > 32, /* size of l1 cache. */ > @@ -3362,6 +3387,7 @@ struct processor_costs slm_cost = { > {16, 16, 16, 32, 64}, /* cost of unaligned stores. > */ > 2, 4, 8, /* cost of moving XMM,YMM,ZMM > register */ > 8, /* cost of moving SSE register to > integer. */ > + 8, /* cost of moving integer register to > SSE. */ > 8, 8, /* Gather load static, > per_elt. */ > 8, 8, /* Gather store static, > per_elt. */ > 32, /* size of l1 cache. */ > @@ -3494,6 +3520,7 @@ struct processor_costs tremont_cost = { > {6, 6, 6, 10, 15}, /* cost of unaligned storess. */ > 2, 3, 4, /* cost of moving XMM,YMM,ZMM > register */ > 6, /* cost of moving SSE register to > integer. */ > + 6, /* cost of moving integer register to > SSE. */ > 18, 6, /* Gather load static, per_elt. */ > 18, 6, /* Gather store static, per_elt. */ > 32, /* size of l1 cache. */ > @@ -3616,6 +3643,7 @@ struct processor_costs intel_cost = { > {10, 10, 10, 10, 10}, /* cost of unaligned loads. > */ > 2, 2, 2, /* cost of moving XMM,YMM,ZMM > register */ > 4, /* cost of moving SSE register to > integer. */ > + 4, /* cost of moving integer register to > SSE. */ > 6, 6, /* Gather load static, > per_elt. */ > 6, 6, /* Gather store static, > per_elt. */ > 32, /* size of l1 cache. */ > @@ -3731,15 +3759,16 @@ struct processor_costs lujiazui_cost = { > {6, 6, 6}, /* cost of loading integer registers > in QImode, HImode and SImode. > Relative to reg-reg move (2). */ > - {6, 6, 6}, /* cost of storing integer registers. */ > + {6, 6, 6}, /* cost of storing integer registers. > */ > {6, 6, 6, 10, 15}, /* cost of loading SSE register > - in 32bit, 64bit, 128bit, 256bit and 512bit. > */ > + in 32bit, 64bit, 128bit, 256bit > and 512bit. */ > {6, 6, 6, 10, 15}, /* cost of storing SSE register > - in 32bit, 64bit, 128bit, 256bit and 512bit. > */ > + in 32bit, 64bit, 128bit, 256bit > and 512bit. */ > {6, 6, 6, 10, 15}, /* cost of unaligned loads. */ > {6, 6, 6, 10, 15}, /* cost of unaligned storess. */ > - 2, 3, 4, /* cost of moving XMM,YMM,ZMM register. */ > - 6, /* cost of moving SSE register to integer. */ > + 2, 3, 4, /* cost of moving XMM,YMM,ZMM > register. */ > + 6, /* cost of moving SSE register to > integer. */ > + 6, /* cost of moving integer register to > SSE. */ > 18, 6, /* Gather load static, per_elt. */ > 18, 6, /* Gather store static, per_elt. */ > 32, /* size of l1 cache. */ > @@ -3864,6 +3893,7 @@ struct processor_costs yongfeng_cost = { > {8, 8, 8, 12, 15}, /* cost of unaligned storess. */ > 2, 3, 4, /* cost of moving XMM,YMM,ZMM register. */ > 8, /* cost of moving SSE register to integer. */ > + 8, /* cost of moving integer register to > SSE. */ > 18, 6, /* Gather load static, per_elt. */ > 18, 6, /* Gather store static, per_elt. */ > 32, /* size of l1 cache. */ > @@ -3987,6 +4017,7 @@ struct processor_costs shijidadao_cost = { > {8, 8, 8, 12, 15}, /* cost of unaligned storess. */ > 2, 3, 4, /* cost of moving XMM,YMM,ZMM register. */ > 8, /* cost of moving SSE register to integer. */ > + 8, /* cost of moving integer register to > SSE. */ > 18, 6, /* Gather load static, per_elt. */ > 18, 6, /* Gather store static, per_elt. */ > 32, /* size of l1 cache. */ > @@ -4116,6 +4147,7 @@ struct processor_costs generic_cost = { > {6, 6, 6, 10, 15}, /* cost of unaligned storess. */ > 2, 3, 4, /* cost of moving XMM,YMM,ZMM > register */ > 6, /* cost of moving SSE register to > integer. */ > + 6, /* cost of moving integer register to > SSE. */ > 18, 6, /* Gather load static, per_elt. */ > 18, 6, /* Gather store static, per_elt. */ > 32, /* size of l1 cache. */ > @@ -4249,6 +4281,7 @@ struct processor_costs core_cost = { > {6, 6, 6, 6, 12}, /* cost of unaligned stores. */ > 2, 2, 4, /* cost of moving XMM,YMM,ZMM > register */ > 2, /* cost of moving SSE register to > integer. */ > + 2, /* cost of moving integer register to > SSE. */ > /* VGATHERDPD is 7 uops, rec throughput 5, while VGATHERDPD is 9 uops, > rec. throughput 6. > So 5 uops statically and one uops per load. */ > diff --git a/gcc/testsuite/gcc.target/i386/minmax-6.c > b/gcc/testsuite/gcc.target/i386/minmax-6.c > index 615f919ba0a..23f61c52d80 100644 > --- a/gcc/testsuite/gcc.target/i386/minmax-6.c > +++ b/gcc/testsuite/gcc.target/i386/minmax-6.c > @@ -15,4 +15,4 @@ UMVLine16Y_11 (short unsigned int * Pic, int y, int width) > /* We do not want the RA to spill %esi for it's dual-use but using > pmaxsd is OK. */ > /* { dg-final { scan-assembler-not "rsp" { target { ! { ia32 } } } } } */ > -/* { dg-final { scan-assembler "pmaxsd" } } */ > +/* { dg-final { scan-assembler "pmaxsd" { xfail *-*-* } } } */ > diff --git a/gcc/testsuite/gcc.target/i386/minmax-7.c > b/gcc/testsuite/gcc.target/i386/minmax-7.c > index 619a93946c7..b2cb1c24d7e 100644 > --- a/gcc/testsuite/gcc.target/i386/minmax-7.c > +++ b/gcc/testsuite/gcc.target/i386/minmax-7.c > @@ -17,4 +17,4 @@ void bar (int aleft, int axcenter) > /* We do not want the RA to spill %esi for it's dual-use but using > pminsd is OK. */ > /* { dg-final { scan-assembler-not "rsp" { target { ! { ia32 } } } } } */ > -/* { dg-final { scan-assembler "pminsd" } } */ > +/* { dg-final { scan-assembler "pminsd" { xfail *-*-* } } } */ -- H.J.