On Tue, 13 Feb 2018, Jeff Law wrote:

> On 01/30/2018 02:59 AM, Richard Biener wrote:
> > 
> > This patch tries to deal with the "easy" part of a function ABI,
> > the return value location, in vectorization costing.  The testcase
> > shows that if we vectorize the returned value but the function
> > doesn't return in memory or in a vector register but as in this
> > case in an integer register pair (reg:TI ax) (bah, ABI details
> > exposed late?  why's this not a parallel?) we end up spilling
> > badly.
> PARALLEL is used when the ABI mandates a value be returned in multiple
> places.  Typically that happens when the value is returned in different
> types of registers (integer, floating point, vector).
> 
> Presumably it's not a PARALLEL in this case because the value is only
> returned in %eax.

It's returned in %eax and %rdx (TImode after all).  But maybe
"standard register pairs" are not represented as PARALLEL ...

> > 
> > The idea is to account for such spilling so if vectorization
> > benefits outweight the spilling we'll vectorize anyway.
> That's a pretty serious bleed of the target into the vectorizer.  But
> we've already deemed that the vectorizer is going to have these target
> dependencies.  So I won't object on those grounds.
> 
> 
> > 
> > I think the particular testcase could be fixed in the subreg
> > pass basically undoing the vectorization but I realize that
> > generally this is a too hard problem and avoiding vectorization
> > is better.  Still this patch is somewhat fragile in that it
> > depends on us "seeing" that the stored to decl is returned
> > (see cfun_returns).
> > 
> > Bootstrap & regtest running on x86_64-unknown-linux-gnu.
> > 
> > I'd like to hear opinions on my use of hard_function_value
> > and also from other target maintainers.  I'm not sure we
> > have sufficient testsuite coverage of _profitable_ vectorization
> > of a return value.  Feel free to add to this for your
> > target.
> Well, it's the right way to get at the information.  I'm not aware of
> any other way to get what you want.  We could possibly hide the RTL bits
> to avoid GET_MODE and friends within the vectorizer -- your call.
>
> I'm not sure the bits in vect_mode_store_cost are right though.  ISTM
> you want to penalize if and only if the return value is not stored in a
> vector-capable location.  If it's a PARALLEL and any element is a
> suitable vector register, then do not penalize -- the spills are
> unavoidable in that case.

So we vectorize sth like

  res[0] = a;
  res[1] = b;

to

  vector_reg = {a, b};
  MEM[&res] = vector_reg;

and if 'res' is a PARALLEL we'd end up spilling vector_reg so we
can assign it to MEM[&res] and/or decompose 'res' according to the
PARALLEL.  Without this we should be able to manage to emit
direct sets of the PARALLELs components to a and b?  Certainly
it works that way for the (reg:TI ax) testcase (which is not a
PARALLEL ...).

> So I think you have to iterate over elements in the PARALLEL case to
> verify none of them are suitable for holding the vector result.

Ok, so in case ret was bigger than just res[0] and res[1] and the
PARALLEL for it would contain both a vector of two elements and
sth for the rest then yes, I guess we might be able to generate
optimal code for the assignment to the vector part of the PARALLEL.

> I'm not entirely sure what to do with CONCAT.  I wasn't immediately
> aware it could show up in that context.

It was just a guess that it might occur for complex vars.  OTOH
I wasn't able to create a testcase that didn't end up using
a SSA name for the actual return value so we don't catch this
case with the patch.

void bar(_Complex double *);
_Complex double foo (double x, double y)
{
  _Complex double z;
  bar (&z);
  __real z = x;
  __imag z = y;
  return z;
}

gets vectorized to

foo:
.LFB0:
        .cfi_startproc
        subq    $40, %rsp
        .cfi_def_cfa_offset 48
        unpcklpd        %xmm1, %xmm0
        leaq    16(%rsp), %rdi
        movaps  %xmm0, (%rsp)
        call    bar
        movapd  (%rsp), %xmm0
        movaps  %xmm0, 16(%rsp)
        movsd   16(%rsp), %xmm0
        movsd   24(%rsp), %xmm1
        addq    $40, %rsp
        .cfi_def_cfa_offset 8
        ret

vs. non-vectorized

foo:
.LFB0:
        .cfi_startproc
        subq    $40, %rsp
        .cfi_def_cfa_offset 48
        leaq    16(%rsp), %rdi
        movsd   %xmm0, 8(%rsp)
        movsd   %xmm1, (%rsp)
        call    bar
        movsd   8(%rsp), %xmm0
        movsd   (%rsp), %xmm1
        addq    $40, %rsp
        .cfi_def_cfa_offset 8
        ret

but hard_function_value indeed returns a PARALLEL here:

(parallel:DC [
        (expr_list:REG_DEP_TRUE (reg:DF 21 xmm0)
            (const_int 0 [0]))
        (expr_list:REG_DEP_TRUE (reg:DF 22 xmm1)
            (const_int 8 [0x8]))
    ])


> Or am I missing something here?

No idea - how would a testcase with a PARALLEL like you mention
look like?  On x86_64 stuff like struct A { v2df x; v2df y; };
is returned via an invisible by-reference parameter.

ISTR PARALLELs are really "parallel" so are they in a specific
order when returned from hard_function_value?  Are there ever
PARALLELs with just one element in this context?

I can happily restrict the patch to the REG_P && ! VECTOR_MODE_P
case for now as that is the only testcase I have sofar.
I also agree it's somewhat ugly and a hack only given it doesn't
handle the _Complex testcase above...

Richard.

> 
> > 
> > Ok for trunk?
> > 
> > Thanks,
> > Richard.
> > 
> > 2018-01-30  Richard Biener  <rguent...@suse.de>
> > 
> >     PR tree-optimization/84101
> >     * tree-vect-stmts.c: Include explow.h for hard_function_value.
> >     (cfun_returns): New helper.
> >     (vect_model_store_cost): When vectorizing a store to a decl
> >     we return and the function ABI returns via a non-vector
> >     register account for the possible spilling that will happen.
> > 
> >     * gcc.target/i386/pr84101.c: New testcase.
> 
> 
> Jeff
> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
21284 (AG Nuernberg)

Reply via email to