Re: [patch AArch64] Do not perform a vector splat for vector initialisation if it is not useful

Richard Sandiford Sat, 19 May 2018 02:54:34 -0700

"Richard Earnshaw (lists)" <richard.earns...@arm.com> writes:
> On 16/05/18 09:37, Kyrill Tkachov wrote:
>> 
>> On 15/05/18 10:58, Richard Biener wrote:
>>> On Tue, May 15, 2018 at 10:20 AM Kyrill Tkachov
>>> <kyrylo.tkac...@foss.arm.com>
>>> wrote:
>>>
>>>> Hi all,
>>>> This is a respin of James's patch from:
>>> https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html
>>>> The original patch was approved and committed but was later reverted
>>> because of failures on big-endian.
>>>> This tweaked version fixes the big-endian failures in
>>> aarch64_expand_vector_init by picking the right
>>>> element of VALS to move into the low part of the vector register
>>> depending on endianness. The rest of the patch
>>>> stays the same. I'm looking for approval on the aarch64 parts, as they
>>> are the ones that have changed
>>>> since the last approved version of the patch.
>>>> -----------------------------------------------------------------------
>>>> In the testcase in this patch we create an SLP vector with only two
>>>> elements. Our current vector initialisation code will first duplicate
>>>> the first element to both lanes, then overwrite the top lane with a new
>>>> value.
>>>> This duplication can be clunky and wasteful.
>>>> Better would be to simply use the fact that we will always be
>>>> overwriting
>>>> the remaining bits, and simply move the first element to the corrcet
>>>> place
>>>> (implicitly zeroing all other bits).
>>>> This reduces the code generation for this case, and can allow more
>>>> efficient addressing modes, and other second order benefits for AArch64
>>>> code which has been vectorized to V2DI mode.
>>>> Note that the change is generic enough to catch the case for any vector
>>>> mode, but is expected to be most useful for 2x64-bit vectorization.
>>>> Unfortunately, on its own, this would cause failures in
>>>> gcc.target/aarch64/load_v2vec_lanes_1.c and
>>>> gcc.target/aarch64/store_v2vec_lanes.c , which expect to see many more
>>>> vec_merge and vec_duplicate for their simplifications to apply. To fix
>>>> this,
>>>> add a special case to the AArch64 code if we are loading from two memory
>>>> addresses, and use the load_pair_lanes patterns directly.
>>>> We also need a new pattern in simplify-rtx.c:simplify_ternary_operation
>>>> , to
>>>> catch:
>>>>      (vec_merge:OUTER
>>>>         (vec_duplicate:OUTER x:INNER)
>>>>         (subreg:OUTER y:INNER 0)
>>>>         (const_int N))
>>>> And simplify it to:
>>>>      (vec_concat:OUTER x:INNER y:INNER) or (vec_concat y x)
>>>> This is similar to the existing patterns which are tested in this
>>>> function,
>>>> without requiring the second operand to also be a vec_duplicate.
>>>> Bootstrapped and tested on aarch64-none-linux-gnu and tested on
>>>> aarch64-none-elf.
>>>> Note that this requires
>>>> https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html
>>>> if we don't want to ICE creating broken vector zero extends.
>>>> Are the non-AArch64 parts OK?
>>> Is (vec_merge (subreg ..) (vec_duplicate)) canonicalized to the form
>>> you handle?  I see the (vec_merge (vec_duplicate...) (vec_concat)) case
>>> also doesn't handle the swapped operand case.
>>>
>>> Otherwise the middle-end parts looks ok.
>> 
>> I don't see any explicit canonicalisation code for it.
>> I've updated the simplify-rtx part to handle the swapped operand case.
>> Is the attached patch better in this regard? I couldn't think of a clean
>> way to avoid
>> duplicating some logic (beyond creating a new function away from the
>> callsite).
>> 
>> Thanks,
>> Kyrill
>> 
>>> Thanks,
>>> Richard.
>>>
>>>> Thanks,
>>>> James
>>>> ---
>>>> 2018-05-15  James Greenhalgh  <james.greenha...@arm.com>
>>>>                Kyrylo Tkachov  <kyrylo.tkac...@arm.com>
>>>>            * config/aarch64/aarch64.c (aarch64_expand_vector_init):
>>>> Modify
>>>>            code generation for cases where splatting a value is not
>>>> useful.
>>>>            * simplify-rtx.c (simplify_ternary_operation): Simplify
>>>>            vec_merge across a vec_duplicate and a paradoxical subreg
>>> forming a vector
>>>>            mode to a vec_concat.
>>>> 2018-05-15  James Greenhalgh  <james.greenha...@arm.com>
>>>>            * gcc.target/aarch64/vect-slp-dup.c: New.
>> 
>
> I'm surprised we don't seem to have a function in the compiler that
> performs this check:
>
> +               && rtx_equal_p (XEXP (x1, 0),
> +                               plus_constant (Pmode,
> +                                              XEXP (x0, 0),
> +                                              GET_MODE_SIZE (inner_mode))))
>
> Without generating dead RTL (plus_constant will rarely be able to return
> a subexpression of the original pattern).  I would have thought this
> sort of test was not that uncommon.


FWIW, I think the way to write it without generating dead RTL is:

     rtx_equal_p (strip_offset (XEXP (x0, 0), &x0_offset),
                  strip_offset (XEXP (x1, 0), &x1_offset))
     && known_eq (x1_offset, x0_offset + GET_MODE_SIZE (inner_mode))

But yeah, a helper would be nice at some point.

> However, I don't think that needs to hold up this patch.
>
> OK.
>
> R.
>> 
>> vec-splat.patch
>> 
>> 
>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>> index 
>> a2003fe52875f1653d644347bafd7773d1f01e91..6bf6c05535b61eef1021d46bcd8448fb3a0b25f4
>>  100644
>> --- a/gcc/config/aarch64/aarch64.c
>> +++ b/gcc/config/aarch64/aarch64.c
>> @@ -13916,9 +13916,54 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>>          maxv = matches[i][1];
>>        }
>>  
>> -      /* Create a duplicate of the most common element.  */
>> -      rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
>> -      aarch64_emit_move (target, gen_vec_duplicate (mode, x));
>> +      /* Create a duplicate of the most common element, unless all elements
>> +     are equally useless to us, in which case just immediately set the
>> +     vector register using the first element.  */
>> +
>> +      if (maxv == 1)
>> +    {
>> +      /* For vectors of two 64-bit elements, we can do even better.  */
>> +      if (n_elts == 2
>> +          && (inner_mode == E_DImode
>> +              || inner_mode == E_DFmode))
>> +
>> +        {
>> +          rtx x0 = XVECEXP (vals, 0, 0);
>> +          rtx x1 = XVECEXP (vals, 0, 1);
>> +          /* Combine can pick up this case, but handling it directly
>> +             here leaves clearer RTL.
>> +
>> +             This is load_pair_lanes<mode>, and also gives us a clean-up
>> +             for store_pair_lanes<mode>.  */
>> +          if (memory_operand (x0, inner_mode)
>> +              && memory_operand (x1, inner_mode)
>> +              && !STRICT_ALIGNMENT
>> +              && rtx_equal_p (XEXP (x1, 0),
>> +                              plus_constant (Pmode,
>> +                                             XEXP (x0, 0),
>> +                                             GET_MODE_SIZE (inner_mode))))
>> +            {
>> +              rtx t;
>> +              if (inner_mode == DFmode)
>> +                t = gen_load_pair_lanesdf (target, x0, x1);
>> +              else
>> +                t = gen_load_pair_lanesdi (target, x0, x1);
>> +              emit_insn (t);
>> +              return;
>> +            }
>> +        }
>> +      /* The subreg-move sequence below will move into lane zero of the
>> +         vector register.  For big-endian we want that position to hold
>> +         the last element of VALS.  */
>> +      maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0;
>> +      rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
>> +      aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode));
>> +    }
>> +      else
>> +    {
>> +      rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement));
>> +      aarch64_emit_move (target, gen_vec_duplicate (mode, x));
>> +    }
>>  
>>        /* Insert the rest.  */
>>        for (int i = 0; i < n_elts; i++)
>> diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
>> index 
>> e96c9d1b441fcfcbddcbffb0c1c7c0e2a871a2a3..d2714db7ae8ef946a6ce035bea59ddbec890e905
>>  100644
>> --- a/gcc/simplify-rtx.c
>> +++ b/gcc/simplify-rtx.c
>> @@ -5891,6 +5891,60 @@ simplify_ternary_operation (enum rtx_code code, 
>> machine_mode mode,
>>              return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
>>          }
>>  
>> +      /* Replace:
>> +
>> +          (vec_merge:outer (vec_duplicate:outer x:inner)
>> +                           (subreg:outer y:inner 0)
>> +                           (const_int N))
>> +
>> +         with (vec_concat:outer x:inner y:inner) if N == 1,
>> +         or (vec_concat:outer y:inner x:inner) if N == 2.
>> +
>> +         Implicitly, this means we have a paradoxical subreg, but such
>> +         a check is cheap, so make it anyway.
>> +
>> +         Only applies for vectors of two elements.  */
>> +      if (GET_CODE (op0) == VEC_DUPLICATE
>> +          && GET_CODE (op1) == SUBREG
>> +          && GET_MODE (op1) == GET_MODE (op0)
>> +          && GET_MODE (SUBREG_REG (op1)) == GET_MODE (XEXP (op0, 0))
>> +          && paradoxical_subreg_p (op1)
>> +          && subreg_lowpart_p (op1)
>> +          && known_eq (GET_MODE_NUNITS (GET_MODE (op0)), 2)
>> +          && known_eq (GET_MODE_NUNITS (GET_MODE (op1)), 2)
>> +          && IN_RANGE (sel, 1, 2))
>> +        {
>> +          rtx newop0 = XEXP (op0, 0);
>> +          rtx newop1 = SUBREG_REG (op1);
>> +          if (sel == 2)
>> +            std::swap (newop0, newop1);
>> +          return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
>> +        }
>> +
>> +      /* Same as above but with switched operands:
>> +            Replace (vec_merge:outer (subreg:outer x:inner 0)
>> +                                     (vec_duplicate:outer y:inner)
>> +                           (const_int N))
>> +
>> +         with (vec_concat:outer x:inner y:inner) if N == 1,
>> +         or (vec_concat:outer y:inner x:inner) if N == 2.  */
>> +      if (GET_CODE (op1) == VEC_DUPLICATE
>> +          && GET_CODE (op0) == SUBREG
>> +          && GET_MODE (op0) == GET_MODE (op1)
>> +          && GET_MODE (SUBREG_REG (op0)) == GET_MODE (XEXP (op1, 0))
>> +          && paradoxical_subreg_p (op0)
>> +          && subreg_lowpart_p (op0)
>> +          && known_eq (GET_MODE_NUNITS (GET_MODE (op1)), 2)
>> +          && known_eq (GET_MODE_NUNITS (GET_MODE (op0)), 2)
>> +          && IN_RANGE (sel, 1, 2))
>> +        {
>> +          rtx newop0 = SUBREG_REG (op0);
>> +          rtx newop1 = XEXP (op1, 0);
>> +          if (sel == 2)
>> +            std::swap (newop0, newop1);
>> +          return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
>> +        }
>> +
>>        /* Replace (vec_merge (vec_duplicate x) (vec_duplicate y)
>>                               (const_int n))
>>           with (vec_concat x y) or (vec_concat y x) depending on value
>> diff --git a/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c 
>> b/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c
>> new file mode 100644
>> index 
>> 0000000000000000000000000000000000000000..0541e480d1f8561dbd9b2a56926c8df60d667a54
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c
>> @@ -0,0 +1,20 @@
>> +/* { dg-do compile } */
>> +
>> +/* { dg-options "-O3 -ftree-vectorize -fno-vect-cost-model" } */
>> +
>> +void bar (double);
>> +
>> +void
>> +foo (double *restrict in, double *restrict in2,
>> +     double *restrict out1, double *restrict out2)
>> +{
>> +  for (int i = 0; i < 1024; i++)
>> +    {
>> +      out1[i] = in[i] + 2.0 * in[i+128];
>> +      out1[i+1] = in[i+1] + 2.0 * in2[i];
>> +      bar (in[i]);
>> +    }
>> +}
>> +
>> +/* { dg-final { scan-assembler-not "dup\tv\[0-9\]+.2d, v\[0-9\]+" } } */
>> +
>>

Re: [patch AArch64] Do not perform a vector splat for vector initialisation if it is not useful

Reply via email to