Hi, currently vec_construct cost is simply TYPE_VECTOR_SUBPARTS / 2 + 1, a reasonable estimate only of other target stmt costs are close to 1. The idea was you need that many vector stmts thus the following patch which should fix skewed costs for bdver2 for example with a vec_stmt_cost of 6.
Fixing this gets important for a fix for PR62283 which will consider building vectors up from parts during basic-block vectorization and relies on the cost model to reject too expensive ones. For example gcc.dg/vect/bb-slp-14.c will now be vectorized (with the generic cost model and just SSE2) as Cost model analysis: Vector inside of basic block cost: 2 Vector prologue cost: 7 Vector epilogue cost: 0 Scalar cost of basic block: 10 .LFB7: .cfi_startproc subq $24, %rsp .cfi_def_cfa_offset 32 movl in+12(%rip), %eax testl %edi, %edi movd in+4(%rip), %xmm0 movd in(%rip), %xmm1 movl %eax, 12(%rsp) movd in+4(%rip), %xmm4 movd 12(%rsp), %xmm3 movl %edi, 12(%rsp) punpckldq %xmm4, %xmm1 punpckldq %xmm3, %xmm0 punpcklqdq %xmm0, %xmm1 movd 12(%rsp), %xmm0 movl %esi, 12(%rsp) movd 12(%rsp), %xmm5 paddd .LC2(%rip), %xmm1 movdqa %xmm1, %xmm2 psrlq $32, %xmm1 punpckldq %xmm5, %xmm0 punpcklqdq %xmm0, %xmm0 pmuludq %xmm0, %xmm2 psrlq $32, %xmm0 pmuludq %xmm1, %xmm0 pshufd $8, %xmm2, %xmm1 pshufd $8, %xmm0, %xmm0 punpckldq %xmm0, %xmm1 movaps %xmm1, out(%rip) je .L12 vs. the scalar variant .LFB7: .cfi_startproc subq $8, %rsp .cfi_def_cfa_offset 16 movl in(%rip), %edx movl in+4(%rip), %eax movl in+12(%rip), %ecx addl $23, %edx imull %edi, %edx leal 31(%rcx), %r8d movl %edx, out(%rip) leal 142(%rax), %edx addl $2, %eax imull %edi, %eax imull %esi, %edx movl %eax, out+8(%rip) movl %r8d, %eax imull %esi, %eax testl %edi, %edi movl %edx, out+4(%rip) movl %eax, out+12(%rip) je .L12 Some excessive PRE across the conditional asm() keeps part of the scalar computes live (yes, the cost model accounts for that). Previously we didn't vectorize the basic-block because the loads from in[] could not be vectorized. Now we will build up a vector from the scalar loads. The vectorized code is generated from <bb 2>: vect_cst_.19_43 = {x_10(D), y_13(D), x_10(D), y_13(D)}; _3 = in[0]; _5 = in[1]; _8 = in[3]; vect_cst_.16_47 = {_3, _5, _5, _8}; vect_a0_4.15_42 = vect_cst_.16_47 + { 23, 142, 2, 31 }; vect__11.18_44 = vect_a0_4.15_42 * vect_cst_.19_43; MEM[(unsigned int *)&out] = vect__11.18_44; thus the code we generate for _3 = in[0]; _5 = in[1]; _8 = in[3]; vect_cst_.16_47 = {_3, _5, _5, _8}; is quite bad. It get's better for -mavx but I wonder where we should try to optimize code generation for constructors... (we can vectorize the loads by enhancing load permutation support, of course - another vectorizer improvement I have some partial patches for). Well, anyway - below for the "obvoious" cost model patch. Boostrapped on x86_64-unknown-linux-gnu, testing in progress. Ok for trunk? Thanks, Richard. 2015-04-21 Richard Biener <rguent...@suse.de> * config/i386/i386.c (ix86_builtin_vectorization_cost): Scale vec_construct cost by vec_stmt_cost. Index: gcc/config/i386/i386.c =================================================================== --- gcc/config/i386/i386.c (revision 222230) +++ gcc/config/i386/i386.c (working copy) @@ -46731,7 +46731,7 @@ ix86_builtin_vectorization_cost (enum ve case vec_construct: elements = TYPE_VECTOR_SUBPARTS (vectype); - return elements / 2 + 1; + return ix86_cost->vec_stmt_cost * (elements / 2 + 1); default: gcc_unreachable ();