https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119298
--- Comment #14 from Jan Hubicka <hubicka at gcc dot gnu.org> --- > > I am OK with using addss cost of 3 for trunk&release branches and make this > > more precise next stage1. > > That's what we use now? But I still don't understand why exactly > 538.imagick_r regresses. We currently use 2 for znver5. In r15-3441-g4292297a0f938f (causing this regressoin) I changed COSTS_N_INSNS (3), /* cost of ADDSS/SD SUBSS/SD insns. */ which was copied from znver4 tables to /* ADDSS has throughput 2 and latency 2 (in some cases when source is another addition). */ COSTS_N_INSNS (2), /* cost of ADDSS/SD SUBSS/SD insns. */ since ADDSS has latency of 2 at znver5 (at least in some cases, but common ones). Martin claims he tested that this particular change causes the regression and moreover that the difference is: - _882 = {_929, D__lsm.193_931, D__lsm.194_932, D__lsm.195_930}; - vectp.214_881 = prephitmp_900 + 32; - MEM <vector(4) long unsigned int> [(long unsigned int *)vectp.214_881] = _882; + MEM[(long unsigned int *)prephitmp_900 + 40B] = D__lsm.193_931; + MEM[(long unsigned int *)prephitmp_900 + 48B] = D__lsm.194_932; + MEM[(long unsigned int *)prephitmp_900 + 56B] = D__lsm.195_930; + MEM[(long unsigned int *)prephitmp_900 + 32B] = _929; Which boils down to: long test[4]; void foo (unsigned long a, unsigned long b, unsigned long c, unsigned long d) { test[0]=a; test[1]=b; test[2]=c; test[3]=d; } being vectorized with cost of 2 (trunk) as _Z3foommmm: .LFB0: .cfi_startproc vmovq %rdx, %xmm2 vmovq %rdi, %xmm3 vpinsrq $1, %rcx, %xmm2, %xmm1 vpinsrq $1, %rsi, %xmm3, %xmm0 vinserti64x2 $0x1, %xmm1, %ymm0, %ymm0 vmovdqa %ymm0, test(%rip) vzeroupper ret while with cost of 3 we get: _Z3foommmm: .LFB0: .cfi_startproc vmovq %rdi, %xmm2 vmovq %rdx, %xmm3 vpinsrq $1, %rsi, %xmm2, %xmm1 vpinsrq $1, %rcx, %xmm3, %xmm0 vmovdqa %xmm1, test(%rip) vmovdqa %xmm0, test+16(%rip) ret I would not guess which one is better. I can see the trunk's version saves one move and potential store->load stall. In stand alone benchmark (which does not consume the stored values) both variants have the same runtime. Costing difference is: a.0_1 1 times scalar_store costs 16 in body b.1_2 1 times scalar_store costs 16 in body c.2_3 1 times scalar_store costs 16 in body d.3_4 1 times scalar_store costs 16 in body node 0x396b4c0 1 times vec_construct costs 40 in prologue a.0_1 1 times vector_store costs 24 in body r.C:5:16: note: Cost model analysis for part in loop 0: Vector cost: 64 Scalar cost: 64 to: a.0_1 1 times scalar_store costs 16 in body b.1_2 1 times scalar_store costs 16 in body c.2_3 1 times scalar_store costs 16 in body d.3_4 1 times scalar_store costs 16 in body node 0x396b4c0 1 times vec_construct costs 44 in prologue a.0_1 1 times vector_store costs 24 in body r.C:5:16: note: Cost model analysis for part in loop 0: Vector cost: 68 Scalar cost: 64 Counting latencies, I think vinserti64x2 is 1 cycle and vpinst is integer->sse move that is slower and set to 4 cycles. Overall it is wrong that we use addss cost to estimate vec_construct: case vec_construct: { int n = TYPE_VECTOR_SUBPARTS (vectype); /* N - 1 element inserts into an SSE vector, the possible GPR -> XMM move is accounted for in add_stmt_cost. */ if (GET_MODE_BITSIZE (mode) <= 128) return (n - 1) * ix86_cost->sse_op; /* One vinserti128 for combining two SSE vectors for AVX256. */ else if (GET_MODE_BITSIZE (mode) == 256) return ((n - 2) * ix86_cost->sse_op + ix86_vec_cost (mode, ix86_cost->addss)); /* One vinserti64x4 and two vinserti128 for combining SSE and AVX256 vectors to AVX512. */ else if (GET_MODE_BITSIZE (mode) == 512) return ((n - 4) * ix86_cost->sse_op + 3 * ix86_vec_cost (mode, ix86_cost->addss)); gcc_unreachable (); } I think we may want to have ix86_cost->hard_register->integer_to_sse to cost the construction in integer modes instead of addss? Overall r15-3441-g4292297a0f938f is right for addss itself, but wrong for all other instructions we glob into addss cost (conversion, packing etc.). This was a silly mistake to forget that addss is used in many other cases. So I think it still makes sense to have at least one extra cost entry (for common simple fp operation that is more expensive that common simple integer operation we already have) and name it in a way that it is clear it represents various kinds typical FP operations (unless we want to have costs of all of them).