The following fixes vec_construct cost calculation to properly consider that the inserts will happen to SSE regs thus forgo the multiplication done in ix86_vec_cost which is passed the wrong mode. This gets rid of the only call passing false to ix86_vec_cost (so consider the patch amended to remove the arg if approved).
Bootstrapped and tested on x86_64-unknown-linux-gnu. OK for trunk? I am considering to make the factor we apply in ix86_vec_cost which currently depends on X86_TUNE_AVX128_OPTIMAL and X86_TUNE_SSE_SPLIT_REGS part of the actual cost tables since the reason we apply them are underlying CPU architecture details. Was the original reason of doing the multiplication based on those tunings to be able to "share" the same basic cost table across architectures that differ in this important detail? I see X86_TUNE_SSE_SPLIT_REGS is only used for m_ATHLON_K8 and X86_TUNE_AVX128_OPTIMAL is used for m_BDVER, m_BTVER2 and m_ZNVER1. Those all have (multiple) exclusive processor_cost_table entries. As a first step I'd like to remove the use of ix86_vec_cost for the entries that already have entries for multiple modes (loads and stores) and apply the factor there. For example Zen can do two 128bit loads per cycle but only one 128bit store. With multiplying AVX256 costs by two we seem to cost sth like # instructions to dispatch * instruction latency which is an odd thing. I'd have expected # instructions to dispatch / instruction throughput * instruction latency - so a AVX256 add would cost the same as a AVX128 add, likewise for loads but stores would be more expensive because of the throughput issue. This all ignores resource utilization across multiple insns but that's how the cost model works ... Thanks, Richard. 2018-10-11 Richard Biener <rguent...@suse.de> * config/i386/i386.c (ix86_vec_cost): Remove !parallel path and argument. (ix86_builtin_vectorization_cost): For vec_construct properly cost insertion into SSE regs. (...): Adjust calls to ix86_vec_cost. Index: gcc/config/i386/i386.c =================================================================== --- gcc/config/i386/i386.c (revision 265022) +++ gcc/config/i386/i386.c (working copy) @@ -39846,11 +39846,10 @@ ix86_set_reg_reg_cost (machine_mode mode static int ix86_vec_cost (machine_mode mode, int cost, bool parallel) { + gcc_assert (parallel); if (!VECTOR_MODE_P (mode)) return cost; - - if (!parallel) - return cost * GET_MODE_NUNITS (mode); + if (GET_MODE_BITSIZE (mode) == 128 && TARGET_SSE_SPLIT_REGS) return cost * 2; @@ -45190,8 +45189,9 @@ ix86_builtin_vectorization_cost (enum ve case vec_construct: { - /* N element inserts. */ - int cost = ix86_vec_cost (mode, ix86_cost->sse_op, false); + gcc_assert (VECTOR_MODE_P (mode)); + /* N element inserts into SSE vectors. */ + int cost = GET_MODE_NUNITS (mode) * ix86_cost->sse_op; /* One vinserti128 for combining two SSE vectors for AVX256. */ if (GET_MODE_BITSIZE (mode) == 256) cost += ix86_vec_cost (mode, ix86_cost->addss, true);