Hi,
currently vec_construct cost is simply TYPE_VECTOR_SUBPARTS / 2 + 1,
a reasonable estimate only of other target stmt costs are close to 1.
The idea was you need that many vector stmts thus the following patch
which should fix skewed costs for bdver2 for example with a
vec_stmt_cost of 6.
Fixing this gets important for a fix for PR62283 which will consider
building vectors up from parts during basic-block vectorization
and relies on the cost model to reject too expensive ones.
For example gcc.dg/vect/bb-slp-14.c will now be vectorized (with
the generic cost model and just SSE2) as
Cost model analysis:
Vector inside of basic block cost: 2
Vector prologue cost: 7
Vector epilogue cost: 0
Scalar cost of basic block: 10
.LFB7:
.cfi_startproc
subq $24, %rsp
.cfi_def_cfa_offset 32
movl in+12(%rip), %eax
testl %edi, %edi
movd in+4(%rip), %xmm0
movd in(%rip), %xmm1
movl %eax, 12(%rsp)
movd in+4(%rip), %xmm4
movd 12(%rsp), %xmm3
movl %edi, 12(%rsp)
punpckldq %xmm4, %xmm1
punpckldq %xmm3, %xmm0
punpcklqdq %xmm0, %xmm1
movd 12(%rsp), %xmm0
movl %esi, 12(%rsp)
movd 12(%rsp), %xmm5
paddd .LC2(%rip), %xmm1
movdqa %xmm1, %xmm2
psrlq $32, %xmm1
punpckldq %xmm5, %xmm0
punpcklqdq %xmm0, %xmm0
pmuludq %xmm0, %xmm2
psrlq $32, %xmm0
pmuludq %xmm1, %xmm0
pshufd $8, %xmm2, %xmm1
pshufd $8, %xmm0, %xmm0
punpckldq %xmm0, %xmm1
movaps %xmm1, out(%rip)
je .L12
vs. the scalar variant
.LFB7:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movl in(%rip), %edx
movl in+4(%rip), %eax
movl in+12(%rip), %ecx
addl $23, %edx
imull %edi, %edx
leal 31(%rcx), %r8d
movl %edx, out(%rip)
leal 142(%rax), %edx
addl $2, %eax
imull %edi, %eax
imull %esi, %edx
movl %eax, out+8(%rip)
movl %r8d, %eax
imull %esi, %eax
testl %edi, %edi
movl %edx, out+4(%rip)
movl %eax, out+12(%rip)
je .L12
Some excessive PRE across the conditional asm() keeps part
of the scalar computes live (yes, the cost model accounts
for that). Previously we didn't vectorize the basic-block
because the loads from in[] could not be vectorized. Now
we will build up a vector from the scalar loads.
The vectorized code is generated from
<bb 2>:
vect_cst_.19_43 = {x_10(D), y_13(D), x_10(D), y_13(D)};
_3 = in[0];
_5 = in[1];
_8 = in[3];
vect_cst_.16_47 = {_3, _5, _5, _8};
vect_a0_4.15_42 = vect_cst_.16_47 + { 23, 142, 2, 31 };
vect__11.18_44 = vect_a0_4.15_42 * vect_cst_.19_43;
MEM[(unsigned int *)&out] = vect__11.18_44;
thus the code we generate for
_3 = in[0];
_5 = in[1];
_8 = in[3];
vect_cst_.16_47 = {_3, _5, _5, _8};
is quite bad. It get's better for -mavx but I wonder where we
should try to optimize code generation for constructors...
(we can vectorize the loads by enhancing load permutation support,
of course - another vectorizer improvement I have some partial
patches for).
Well, anyway - below for the "obvoious" cost model patch.
Boostrapped on x86_64-unknown-linux-gnu, testing in progress.
Ok for trunk?
Thanks,
Richard.
2015-04-21 Richard Biener <[email protected]>
* config/i386/i386.c (ix86_builtin_vectorization_cost): Scale
vec_construct cost by vec_stmt_cost.
Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c (revision 222230)
+++ gcc/config/i386/i386.c (working copy)
@@ -46731,7 +46731,7 @@ ix86_builtin_vectorization_cost (enum ve
case vec_construct:
elements = TYPE_VECTOR_SUBPARTS (vectype);
- return elements / 2 + 1;
+ return ix86_cost->vec_stmt_cost * (elements / 2 + 1);
default:
gcc_unreachable ();