On Tue, 13 Feb 2018, Jeff Law wrote: > On 01/30/2018 02:59 AM, Richard Biener wrote: > > > > This patch tries to deal with the "easy" part of a function ABI, > > the return value location, in vectorization costing. The testcase > > shows that if we vectorize the returned value but the function > > doesn't return in memory or in a vector register but as in this > > case in an integer register pair (reg:TI ax) (bah, ABI details > > exposed late? why's this not a parallel?) we end up spilling > > badly. > PARALLEL is used when the ABI mandates a value be returned in multiple > places. Typically that happens when the value is returned in different > types of registers (integer, floating point, vector). > > Presumably it's not a PARALLEL in this case because the value is only > returned in %eax.
It's returned in %eax and %rdx (TImode after all). But maybe "standard register pairs" are not represented as PARALLEL ... > > > > The idea is to account for such spilling so if vectorization > > benefits outweight the spilling we'll vectorize anyway. > That's a pretty serious bleed of the target into the vectorizer. But > we've already deemed that the vectorizer is going to have these target > dependencies. So I won't object on those grounds. > > > > > > I think the particular testcase could be fixed in the subreg > > pass basically undoing the vectorization but I realize that > > generally this is a too hard problem and avoiding vectorization > > is better. Still this patch is somewhat fragile in that it > > depends on us "seeing" that the stored to decl is returned > > (see cfun_returns). > > > > Bootstrap & regtest running on x86_64-unknown-linux-gnu. > > > > I'd like to hear opinions on my use of hard_function_value > > and also from other target maintainers. I'm not sure we > > have sufficient testsuite coverage of _profitable_ vectorization > > of a return value. Feel free to add to this for your > > target. > Well, it's the right way to get at the information. I'm not aware of > any other way to get what you want. We could possibly hide the RTL bits > to avoid GET_MODE and friends within the vectorizer -- your call. > > I'm not sure the bits in vect_mode_store_cost are right though. ISTM > you want to penalize if and only if the return value is not stored in a > vector-capable location. If it's a PARALLEL and any element is a > suitable vector register, then do not penalize -- the spills are > unavoidable in that case. So we vectorize sth like res[0] = a; res[1] = b; to vector_reg = {a, b}; MEM[&res] = vector_reg; and if 'res' is a PARALLEL we'd end up spilling vector_reg so we can assign it to MEM[&res] and/or decompose 'res' according to the PARALLEL. Without this we should be able to manage to emit direct sets of the PARALLELs components to a and b? Certainly it works that way for the (reg:TI ax) testcase (which is not a PARALLEL ...). > So I think you have to iterate over elements in the PARALLEL case to > verify none of them are suitable for holding the vector result. Ok, so in case ret was bigger than just res[0] and res[1] and the PARALLEL for it would contain both a vector of two elements and sth for the rest then yes, I guess we might be able to generate optimal code for the assignment to the vector part of the PARALLEL. > I'm not entirely sure what to do with CONCAT. I wasn't immediately > aware it could show up in that context. It was just a guess that it might occur for complex vars. OTOH I wasn't able to create a testcase that didn't end up using a SSA name for the actual return value so we don't catch this case with the patch. void bar(_Complex double *); _Complex double foo (double x, double y) { _Complex double z; bar (&z); __real z = x; __imag z = y; return z; } gets vectorized to foo: .LFB0: .cfi_startproc subq $40, %rsp .cfi_def_cfa_offset 48 unpcklpd %xmm1, %xmm0 leaq 16(%rsp), %rdi movaps %xmm0, (%rsp) call bar movapd (%rsp), %xmm0 movaps %xmm0, 16(%rsp) movsd 16(%rsp), %xmm0 movsd 24(%rsp), %xmm1 addq $40, %rsp .cfi_def_cfa_offset 8 ret vs. non-vectorized foo: .LFB0: .cfi_startproc subq $40, %rsp .cfi_def_cfa_offset 48 leaq 16(%rsp), %rdi movsd %xmm0, 8(%rsp) movsd %xmm1, (%rsp) call bar movsd 8(%rsp), %xmm0 movsd (%rsp), %xmm1 addq $40, %rsp .cfi_def_cfa_offset 8 ret but hard_function_value indeed returns a PARALLEL here: (parallel:DC [ (expr_list:REG_DEP_TRUE (reg:DF 21 xmm0) (const_int 0 [0])) (expr_list:REG_DEP_TRUE (reg:DF 22 xmm1) (const_int 8 [0x8])) ]) > Or am I missing something here? No idea - how would a testcase with a PARALLEL like you mention look like? On x86_64 stuff like struct A { v2df x; v2df y; }; is returned via an invisible by-reference parameter. ISTR PARALLELs are really "parallel" so are they in a specific order when returned from hard_function_value? Are there ever PARALLELs with just one element in this context? I can happily restrict the patch to the REG_P && ! VECTOR_MODE_P case for now as that is the only testcase I have sofar. I also agree it's somewhat ugly and a hack only given it doesn't handle the _Complex testcase above... Richard. > > > > > Ok for trunk? > > > > Thanks, > > Richard. > > > > 2018-01-30 Richard Biener <rguent...@suse.de> > > > > PR tree-optimization/84101 > > * tree-vect-stmts.c: Include explow.h for hard_function_value. > > (cfun_returns): New helper. > > (vect_model_store_cost): When vectorizing a store to a decl > > we return and the function ABI returns via a non-vector > > register account for the possible spilling that will happen. > > > > * gcc.target/i386/pr84101.c: New testcase. > > > Jeff > > -- Richard Biener <rguent...@suse.de> SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)