On Fri, 13 Nov 2015, Alan Lawrence wrote: > On 10/11/15 09:34, Richard Biener wrote: > > > > The following fixes PR56118 by adjusting the cost model handling of > > basic-block vectorization to favor the vectorized version in case > > estimated cost is the same as the estimated cost of the scalar > > version. This makes sense because we over-estimate the vectorized > > cost in several places. > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied. > > > > Richard. > > > > 2015-11-10 Richard Biener <rguent...@suse.de> > > > > PR tree-optimization/56118 > > * tree-vect-slp.c (vect_bb_vectorization_profitable_p): Make equal > > cost favor vectorized version. > > > > * gcc.target/i386/pr56118.c: New testcase. > > On AArch64 and ARM targets, this causes PASS->FAIL of > > gcc.dg/vect/bb-slp-32.c scan-tree-dump slp2 "vectorization is not profitable" > gcc.dg/vect/bb-slp-32.c -flto -ffat-lto-objects scan-tree-dump slp2 > "vectorization is not profitable" > > ....that sounds like a good thing ;)
It depends ;) You may want to look at the generated code with/without vectorization and decide for yourselves. The testcase is int foo (int *p) { int x[4]; int tem0, tem1, tem2, tem3; tem0 = p[0] + 1; x[0] = tem0; tem1 = p[1] + 2; x[1] = tem1; tem2 = p[2] + 3; x[2] = tem2; tem3 = p[3] + 4; x[3] = tem3; bar (x); return tem0 + tem1 + tem2 + tem3; } which was added to cover the situation where we vectorize the store to x[] but have to keep the scalar computations for tem[0-3] for the final reduction. The scalar cost for this kernel is 3*4 while the vector cost is unaligned load + aligned load + vector op + aligned vector store. We compensate for the out-of-kernel uses by making the scalar cost just the stores (4). Now if all of the vector cost pieces are 1 then we have 4 vs. 4 here. On x86_64 the unaligned load cost is 2 and thus vectorization is deemed non-profitable. In reality this depends on the actual constants used in the plus as that tells you whether the constant is free for the scalar plus or not (for vectors it almost always comes from the constant pool). On x86_64 the assembler difference is movl (%rdi), %eax movdqu (%rdi), %xmm0 leal 1(%rax), %r13d movl 4(%rdi), %eax paddd .LC0(%rip), %xmm0 movaps %xmm0, (%rsp) leal 2(%rax), %r12d movl 8(%rdi), %eax addl %r13d, %r12d leal 3(%rax), %ebp movl 12(%rdi), %eax movq %rsp, %rdi addl %r12d, %ebp leal 4(%rax), %ebx call bar leal 0(%rbp,%rbx), %eax vs. movl (%rdi), %eax movl 12(%rdi), %ebx leal 1(%rax), %r13d movl 4(%rdi), %eax addl $4, %ebx movl %ebx, 12(%rsp) movl %r13d, (%rsp) leal 2(%rax), %r12d movl 8(%rdi), %eax movq %rsp, %rdi movl %r12d, 4(%rsp) leal 3(%rax), %ebp movl %ebp, 8(%rsp) call bar leal 0(%r13,%r12), %eax addl %ebp, %eax addl %ebx, %eax clearly the testcase could need adjustment to be not so on the edge of the individual targets. I'm considering changing it. The testcase also shows the lack of reduction vectorization in BBs (I have partial finished patches for this but got distracted...). >, so I imagine the xfail directive may > just need updating. The test also looks to be failing on powerpc64 (according > to https://gcc.gnu.org/ml/gcc-testresults/2015-11/msg01327.html). I'll try making the testcase more complicated instead. Richard. > Regards, Alan > > -- Richard Biener <rguent...@suse.de> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)