https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
--- Comment #30 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Hongtao.liu from comment #23) > > _813 = {_437, _448, _459, _470, _490, _501, _512, _523, _543, _554, _565, > > _576, _125, _143, _161, _179}; > > The cost of vec_construct in i386 backend is 64, calculated as 16 x 4 > > cut from i386.c > --- > /* N element inserts into SSE vectors. */ > int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op; > --- > > From perspective of pipeline latency, is seems ok, but from perspective of > rtx_cost, it seems inaccurate since it would be initialized as > --- > vmovd %eax, %xmm0 > vpinsrb $1, 1(%rsi), %xmm0, %xmm0 > vmovd %eax, %xmm7 > vpinsrb $1, 3(%rsi), %xmm7, %xmm7 > vmovd %eax, %xmm3 > vpinsrb $1, 17(%rsi), %xmm3, %xmm3 > vmovd %eax, %xmm6 > vpinsrb $1, 19(%rsi), %xmm6, %xmm6 > vmovd %eax, %xmm1 > vpinsrb $1, 33(%rsi), %xmm1, %xmm1 > vmovd %eax, %xmm5 > vpinsrb $1, 35(%rsi), %xmm5, %xmm5 > vmovd %eax, %xmm2 > vpinsrb $1, 49(%rsi), %xmm2, %xmm2 > vmovd %eax, %xmm4 > vpinsrb $1, 51(%rsi), %xmm4, %xmm4 > vpunpcklwd %xmm6, %xmm3, %xmm3 > vpunpcklwd %xmm4, %xmm2, %xmm2 > vpunpcklwd %xmm7, %xmm0, %xmm0 > vpunpcklwd %xmm5, %xmm1, %xmm1 > vpunpckldq %xmm2, %xmm1, %xmm1 > vpunpckldq %xmm3, %xmm0, %xmm0 > vpunpcklqdq %xmm1, %xmm0, %xmm0 > --- > > it's 16 "vector insert" + (4 + 2 + 1) "vector concat/permutation", so cost > should be 92(23 * 4). So the important part for any target is that it makes the scalar and vector costs apples and apples because they end up being compared against each other. For loops the most important metric tends to be latency which is also the only thing that can be reasonably costed when looking at a single statement at a time. For all other factors coming in there's (in theory) the finish_cost hook where, after gathering individual stmt data from add_stmt_cost, a target hook can apply adjustments based on say functional unit allocation (IIRC the powerpc backend looks whether there are "many" shifts and disparages vectorization in that case). For the vector construction the x86 backend does a reasonable job in costing - the only thing that's not very well modeled is the extra cost of constructing from values in GPRs compared to values in XMM regs (on some CPU archs that even as extra penalties). But as seen above "GPR" values can also come from memory where the difference vanishes (for AVX, not for SSE).