On Mon, 13 Jun 2016, Richard Biener wrote:
> On Fri, 10 Jun 2016, Richard Biener wrote:
>
> >
> > With the proposed cost change for vector construction we will end up
> > vectorizing the testcase in PR68961 again (on x86_64 and likely
> > on ppc64le as well after that target gets adjustments). Currently
> > we can't optimize that away again noticing the direct overlap of
> > argument and return registers. The obstackle is
> >
> > (insn 7 4 8 2 (set (reg:V2DF 93)
> > (vec_concat:V2DF (reg/v:DF 91 [ a ])
> > (reg/v:DF 92 [ aa ])))
> > ...
> > (insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ])
> > (subreg:DI (reg:TI 88 [ D.1756 ]) 0))
> > (insn 24 21 11 2 (set (reg:DI 100 [+8 ])
> > (subreg:DI (reg:TI 88 [ D.1756 ]) 8))
> >
> > which we eventually optimize to DFmode subregs of (reg:V2DF 93).
> >
> > First of all simplify_subreg doesn't handle the subregs of a vec_concat
> > (easy fix below).
> >
> > Then combine doesn't like to simplify the multi-use (it tries some
> > parallel it seems). So I went to forwprop which eventually manages
> > to do this but throws away the result (reg:DF 91) or (reg:DF 92)
> > because it is not a constant. Thus I allow arbitrary simplification
> > results for SUBREGs of [VEC_]CONCAT operations. There doesn't seem
> > to be a magic flag to tell it to restrict to the case where all
> > uses can be simplified or so, nor to restrict simplifications to a REG.
> > But I don't see any undesirable simplifications of (subreg
> > ([vec_]concat)).
> >
> > For the testcase I'm not sure if I have to exclude some ABIs (mingw?).
> >
> > Boostrap and regtest in progress on x86_64-unknown-linux-gnu, I'll
> > install the simplify-rtx.c if that succeeds but like to have opinions
> > on the fwprop.c change.
>
> So the bootstrap exposes a latent issue in simplify-rtx.c in the changed
> hunk via gcc.target/i386/mmx-8.c on i?86 which ends up with a
>
> (vec_concat:V2SI (reg:SI 103)
> (const_int 0 [0]))
>
> and thus a VOIDmode 2nd operand (I'm sure this can happen for
> complex integer concat as well, thus latent). I am adjusting the
> simplify_subreg hunk to always pass GET_MODE_INNER (innermode)
> (that hopefully exercises it a bit more than just using that
> if GET_MODE (part) == VOIDmode - and hopefully they should always
> agree).
>
> Re-bootstrap / regtest running on x86_64-unknown-linux-gnu.
That works worse given that vec_concat can be
(vec_concat:V16QI (us_truncate:V8QI (reg:V8HI 159))
(us_truncate:V8QI (reg:V8HI 160)))
... now I think the VOIDmode case can only happen for scalar vec_concat
and thus
enum machine_mode part_mode = GET_MODE (part);
if (part_mode == VOIDmode)
part_mode = GET_MODE_INNER (GET_MODE (op));
should work. Re-testing with that... (ok, I know it has coverage of
exactly one testcase on x86_64 as it would otherwise ICE).
Richard.
2016-06-13 Richard Biener <[email protected]>
PR rtl-optimization/68961
* simplify-rtx.c (simplify_subreg): Handle VEC_CONCAT like CONCAT.
* fwprop.c (propagate_rtx): Allow SUBREGs of VEC_CONCAT and CONCAT
to simplify to a non-constant.
* gcc.target/i386/pr68961.c: New testcase.
Index: gcc/simplify-rtx.c
===================================================================
*** gcc/simplify-rtx.c (revision 237372)
--- gcc/simplify-rtx.c (working copy)
*************** simplify_subreg (machine_mode outermode,
*** 6108,6116 ****
&& GET_MODE_SIZE (outermode) <= GET_MODE_SIZE (GET_MODE (op)))
return adjust_address_nv (op, outermode, byte);
! /* Handle complex values represented as CONCAT
! of real and imaginary part. */
! if (GET_CODE (op) == CONCAT)
{
unsigned int part_size, final_offset;
rtx part, res;
--- 6108,6117 ----
&& GET_MODE_SIZE (outermode) <= GET_MODE_SIZE (GET_MODE (op)))
return adjust_address_nv (op, outermode, byte);
! /* Handle complex or vector values represented as CONCAT or VEC_CONCAT
! of two parts. */
! if (GET_CODE (op) == CONCAT
! || GET_CODE (op) == VEC_CONCAT)
{
unsigned int part_size, final_offset;
rtx part, res;
*************** simplify_subreg (machine_mode outermode,
*** 6130,6139 ****
if (final_offset + GET_MODE_SIZE (outermode) > part_size)
return NULL_RTX;
! res = simplify_subreg (outermode, part, GET_MODE (part), final_offset);
if (res)
return res;
! if (validate_subreg (outermode, GET_MODE (part), part, final_offset))
return gen_rtx_SUBREG (outermode, part, final_offset);
return NULL_RTX;
}
--- 6131,6143 ----
if (final_offset + GET_MODE_SIZE (outermode) > part_size)
return NULL_RTX;
! enum machine_mode part_mode = GET_MODE (part);
! if (part_mode == VOIDmode)
! part_mode = GET_MODE_INNER (GET_MODE (op));
! res = simplify_subreg (outermode, part, part_mode, final_offset);
if (res)
return res;
! if (validate_subreg (outermode, part_mode, part, final_offset))
return gen_rtx_SUBREG (outermode, part, final_offset);
return NULL_RTX;
}
Index: gcc/fwprop.c
===================================================================
*** gcc/fwprop.c (revision 237372)
--- gcc/fwprop.c (working copy)
*************** propagate_rtx (rtx x, machine_mode mode,
*** 664,670 ****
|| (GET_CODE (new_rtx) == SUBREG
&& REG_P (SUBREG_REG (new_rtx))
&& (GET_MODE_SIZE (mode)
! <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx))))))
flags |= PR_CAN_APPEAR;
if (!varying_mem_p (new_rtx))
flags |= PR_HANDLE_MEM;
--- 664,673 ----
|| (GET_CODE (new_rtx) == SUBREG
&& REG_P (SUBREG_REG (new_rtx))
&& (GET_MODE_SIZE (mode)
! <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx)))))
! || ((GET_CODE (new_rtx) == VEC_CONCAT
! || GET_CODE (new_rtx) == CONCAT)
! && GET_CODE (x) == SUBREG))
flags |= PR_CAN_APPEAR;
if (!varying_mem_p (new_rtx))
flags |= PR_HANDLE_MEM;
Index: gcc/testsuite/gcc.target/i386/pr68961.c
===================================================================
*** gcc/testsuite/gcc.target/i386/pr68961.c (revision 0)
--- gcc/testsuite/gcc.target/i386/pr68961.c (working copy)
***************
*** 0 ****
--- 1,19 ----
+ /* { dg-do compile { target lp64 } } */
+ /* { dg-options "-O3 -fno-vect-cost-model -fdump-tree-slp2-details" } */
+
+ struct x { double d[2]; };
+
+ struct x
+ pack (double a, double aa)
+ {
+ struct x u;
+ u.d[0] = a;
+ u.d[1] = aa;
+ return u;
+ }
+
+ /* The function should be optimized to just return as arguments and
+ result exactly overlap even when previously vectorized. */
+
+ /* { dg-final { scan-tree-dump "basic block vectorized" "slp2" } } */
+ /* { dg-final { scan-assembler-not "mov" } } */