Re: [PATCH] Fix PR56118

Richard Biener Mon, 16 Nov 2015 00:40:11 -0800

On Fri, 13 Nov 2015, Alan Lawrence wrote:

> On 10/11/15 09:34, Richard Biener wrote:
> > 
> > The following fixes PR56118 by adjusting the cost model handling of
> > basic-block vectorization to favor the vectorized version in case
> > estimated cost is the same as the estimated cost of the scalar
> > version.  This makes sense because we over-estimate the vectorized
> > cost in several places.
> > 
> > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied.
> > 
> > Richard.
> > 
> > 2015-11-10  Richard Biener  <rguent...@suse.de>
> > 
> >     PR tree-optimization/56118
> >     * tree-vect-slp.c (vect_bb_vectorization_profitable_p): Make equal
> >     cost favor vectorized version.
> > 
> >     * gcc.target/i386/pr56118.c: New testcase.
> 
> On AArch64 and ARM targets, this causes PASS->FAIL of
> 
> gcc.dg/vect/bb-slp-32.c scan-tree-dump slp2 "vectorization is not profitable"
> gcc.dg/vect/bb-slp-32.c -flto -ffat-lto-objects scan-tree-dump slp2
> "vectorization is not profitable"
> 
> ....that sounds like a good thing ;)


It depends ;)  You may want to look at the generated code with/without
vectorization and decide for yourselves.  The testcase is

int foo (int *p)
{
  int x[4];
  int tem0, tem1, tem2, tem3;
  tem0 = p[0] + 1;
  x[0] = tem0;
  tem1 = p[1] + 2;
  x[1] = tem1;
  tem2 = p[2] + 3;
  x[2] = tem2;
  tem3 = p[3] + 4;
  x[3] = tem3;
  bar (x);
  return tem0 + tem1 + tem2 + tem3;
}

which was added to cover the situation where we vectorize the store
to x[] but have to keep the scalar computations for tem[0-3] for
the final reduction.  The scalar cost for this kernel is 3*4
while the vector cost is unaligned load + aligned load + vector op
+ aligned vector store.  We compensate for the out-of-kernel uses
by making the scalar cost just the stores (4).  Now if all of the
vector cost pieces are 1 then we have 4 vs. 4 here.  On x86_64
the unaligned load cost is 2 and thus vectorization is deemed
non-profitable.

In reality this depends on the actual constants used in the plus
as that tells you whether the constant is free for the scalar
plus or not (for vectors it almost always comes from the constant
pool).

On x86_64 the assembler difference is

        movl    (%rdi), %eax
        movdqu  (%rdi), %xmm0
        leal    1(%rax), %r13d
        movl    4(%rdi), %eax
        paddd   .LC0(%rip), %xmm0
        movaps  %xmm0, (%rsp)
        leal    2(%rax), %r12d
        movl    8(%rdi), %eax
        addl    %r13d, %r12d
        leal    3(%rax), %ebp
        movl    12(%rdi), %eax
        movq    %rsp, %rdi
        addl    %r12d, %ebp
        leal    4(%rax), %ebx
        call    bar
        leal    0(%rbp,%rbx), %eax

vs.

        movl    (%rdi), %eax
        movl    12(%rdi), %ebx
        leal    1(%rax), %r13d
        movl    4(%rdi), %eax
        addl    $4, %ebx
        movl    %ebx, 12(%rsp)
        movl    %r13d, (%rsp)
        leal    2(%rax), %r12d
        movl    8(%rdi), %eax
        movq    %rsp, %rdi
        movl    %r12d, 4(%rsp)
        leal    3(%rax), %ebp
        movl    %ebp, 8(%rsp)
        call    bar
        leal    0(%r13,%r12), %eax
        addl    %ebp, %eax
        addl    %ebx, %eax

clearly the testcase could need adjustment to be not so on the
edge of the individual targets.  I'm considering changing it.

The testcase also shows the lack of reduction vectorization in BBs
(I have partial finished patches for this but got distracted...).

>, so I imagine the xfail directive may
> just need updating. The test also looks to be failing on powerpc64 (according
> to https://gcc.gnu.org/ml/gcc-testresults/2015-11/msg01327.html).

I'll try making the testcase more complicated instead.

Richard.

> Regards, Alan
> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
21284 (AG Nuernberg)

Re: [PATCH] Fix PR56118

Reply via email to