"Richard Earnshaw (lists)" <richard.earns...@arm.com> writes: > On 16/05/18 09:37, Kyrill Tkachov wrote: >> >> On 15/05/18 10:58, Richard Biener wrote: >>> On Tue, May 15, 2018 at 10:20 AM Kyrill Tkachov >>> <kyrylo.tkac...@foss.arm.com> >>> wrote: >>> >>>> Hi all, >>>> This is a respin of James's patch from: >>> https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html >>>> The original patch was approved and committed but was later reverted >>> because of failures on big-endian. >>>> This tweaked version fixes the big-endian failures in >>> aarch64_expand_vector_init by picking the right >>>> element of VALS to move into the low part of the vector register >>> depending on endianness. The rest of the patch >>>> stays the same. I'm looking for approval on the aarch64 parts, as they >>> are the ones that have changed >>>> since the last approved version of the patch. >>>> ----------------------------------------------------------------------- >>>> In the testcase in this patch we create an SLP vector with only two >>>> elements. Our current vector initialisation code will first duplicate >>>> the first element to both lanes, then overwrite the top lane with a new >>>> value. >>>> This duplication can be clunky and wasteful. >>>> Better would be to simply use the fact that we will always be >>>> overwriting >>>> the remaining bits, and simply move the first element to the corrcet >>>> place >>>> (implicitly zeroing all other bits). >>>> This reduces the code generation for this case, and can allow more >>>> efficient addressing modes, and other second order benefits for AArch64 >>>> code which has been vectorized to V2DI mode. >>>> Note that the change is generic enough to catch the case for any vector >>>> mode, but is expected to be most useful for 2x64-bit vectorization. >>>> Unfortunately, on its own, this would cause failures in >>>> gcc.target/aarch64/load_v2vec_lanes_1.c and >>>> gcc.target/aarch64/store_v2vec_lanes.c , which expect to see many more >>>> vec_merge and vec_duplicate for their simplifications to apply. To fix >>>> this, >>>> add a special case to the AArch64 code if we are loading from two memory >>>> addresses, and use the load_pair_lanes patterns directly. >>>> We also need a new pattern in simplify-rtx.c:simplify_ternary_operation >>>> , to >>>> catch: >>>> (vec_merge:OUTER >>>> (vec_duplicate:OUTER x:INNER) >>>> (subreg:OUTER y:INNER 0) >>>> (const_int N)) >>>> And simplify it to: >>>> (vec_concat:OUTER x:INNER y:INNER) or (vec_concat y x) >>>> This is similar to the existing patterns which are tested in this >>>> function, >>>> without requiring the second operand to also be a vec_duplicate. >>>> Bootstrapped and tested on aarch64-none-linux-gnu and tested on >>>> aarch64-none-elf. >>>> Note that this requires >>>> https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html >>>> if we don't want to ICE creating broken vector zero extends. >>>> Are the non-AArch64 parts OK? >>> Is (vec_merge (subreg ..) (vec_duplicate)) canonicalized to the form >>> you handle? I see the (vec_merge (vec_duplicate...) (vec_concat)) case >>> also doesn't handle the swapped operand case. >>> >>> Otherwise the middle-end parts looks ok. >> >> I don't see any explicit canonicalisation code for it. >> I've updated the simplify-rtx part to handle the swapped operand case. >> Is the attached patch better in this regard? I couldn't think of a clean >> way to avoid >> duplicating some logic (beyond creating a new function away from the >> callsite). >> >> Thanks, >> Kyrill >> >>> Thanks, >>> Richard. >>> >>>> Thanks, >>>> James >>>> --- >>>> 2018-05-15 James Greenhalgh <james.greenha...@arm.com> >>>> Kyrylo Tkachov <kyrylo.tkac...@arm.com> >>>> * config/aarch64/aarch64.c (aarch64_expand_vector_init): >>>> Modify >>>> code generation for cases where splatting a value is not >>>> useful. >>>> * simplify-rtx.c (simplify_ternary_operation): Simplify >>>> vec_merge across a vec_duplicate and a paradoxical subreg >>> forming a vector >>>> mode to a vec_concat. >>>> 2018-05-15 James Greenhalgh <james.greenha...@arm.com> >>>> * gcc.target/aarch64/vect-slp-dup.c: New. >> > > I'm surprised we don't seem to have a function in the compiler that > performs this check: > > + && rtx_equal_p (XEXP (x1, 0), > + plus_constant (Pmode, > + XEXP (x0, 0), > + GET_MODE_SIZE (inner_mode)))) > > Without generating dead RTL (plus_constant will rarely be able to return > a subexpression of the original pattern). I would have thought this > sort of test was not that uncommon.
FWIW, I think the way to write it without generating dead RTL is: rtx_equal_p (strip_offset (XEXP (x0, 0), &x0_offset), strip_offset (XEXP (x1, 0), &x1_offset)) && known_eq (x1_offset, x0_offset + GET_MODE_SIZE (inner_mode)) But yeah, a helper would be nice at some point. > However, I don't think that needs to hold up this patch. > > OK. > > R. >> >> vec-splat.patch >> >> >> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c >> index >> a2003fe52875f1653d644347bafd7773d1f01e91..6bf6c05535b61eef1021d46bcd8448fb3a0b25f4 >> 100644 >> --- a/gcc/config/aarch64/aarch64.c >> +++ b/gcc/config/aarch64/aarch64.c >> @@ -13916,9 +13916,54 @@ aarch64_expand_vector_init (rtx target, rtx vals) >> maxv = matches[i][1]; >> } >> >> - /* Create a duplicate of the most common element. */ >> - rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement)); >> - aarch64_emit_move (target, gen_vec_duplicate (mode, x)); >> + /* Create a duplicate of the most common element, unless all elements >> + are equally useless to us, in which case just immediately set the >> + vector register using the first element. */ >> + >> + if (maxv == 1) >> + { >> + /* For vectors of two 64-bit elements, we can do even better. */ >> + if (n_elts == 2 >> + && (inner_mode == E_DImode >> + || inner_mode == E_DFmode)) >> + >> + { >> + rtx x0 = XVECEXP (vals, 0, 0); >> + rtx x1 = XVECEXP (vals, 0, 1); >> + /* Combine can pick up this case, but handling it directly >> + here leaves clearer RTL. >> + >> + This is load_pair_lanes<mode>, and also gives us a clean-up >> + for store_pair_lanes<mode>. */ >> + if (memory_operand (x0, inner_mode) >> + && memory_operand (x1, inner_mode) >> + && !STRICT_ALIGNMENT >> + && rtx_equal_p (XEXP (x1, 0), >> + plus_constant (Pmode, >> + XEXP (x0, 0), >> + GET_MODE_SIZE (inner_mode)))) >> + { >> + rtx t; >> + if (inner_mode == DFmode) >> + t = gen_load_pair_lanesdf (target, x0, x1); >> + else >> + t = gen_load_pair_lanesdi (target, x0, x1); >> + emit_insn (t); >> + return; >> + } >> + } >> + /* The subreg-move sequence below will move into lane zero of the >> + vector register. For big-endian we want that position to hold >> + the last element of VALS. */ >> + maxelement = BYTES_BIG_ENDIAN ? n_elts - 1 : 0; >> + rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement)); >> + aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode)); >> + } >> + else >> + { >> + rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement)); >> + aarch64_emit_move (target, gen_vec_duplicate (mode, x)); >> + } >> >> /* Insert the rest. */ >> for (int i = 0; i < n_elts; i++) >> diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c >> index >> e96c9d1b441fcfcbddcbffb0c1c7c0e2a871a2a3..d2714db7ae8ef946a6ce035bea59ddbec890e905 >> 100644 >> --- a/gcc/simplify-rtx.c >> +++ b/gcc/simplify-rtx.c >> @@ -5891,6 +5891,60 @@ simplify_ternary_operation (enum rtx_code code, >> machine_mode mode, >> return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1); >> } >> >> + /* Replace: >> + >> + (vec_merge:outer (vec_duplicate:outer x:inner) >> + (subreg:outer y:inner 0) >> + (const_int N)) >> + >> + with (vec_concat:outer x:inner y:inner) if N == 1, >> + or (vec_concat:outer y:inner x:inner) if N == 2. >> + >> + Implicitly, this means we have a paradoxical subreg, but such >> + a check is cheap, so make it anyway. >> + >> + Only applies for vectors of two elements. */ >> + if (GET_CODE (op0) == VEC_DUPLICATE >> + && GET_CODE (op1) == SUBREG >> + && GET_MODE (op1) == GET_MODE (op0) >> + && GET_MODE (SUBREG_REG (op1)) == GET_MODE (XEXP (op0, 0)) >> + && paradoxical_subreg_p (op1) >> + && subreg_lowpart_p (op1) >> + && known_eq (GET_MODE_NUNITS (GET_MODE (op0)), 2) >> + && known_eq (GET_MODE_NUNITS (GET_MODE (op1)), 2) >> + && IN_RANGE (sel, 1, 2)) >> + { >> + rtx newop0 = XEXP (op0, 0); >> + rtx newop1 = SUBREG_REG (op1); >> + if (sel == 2) >> + std::swap (newop0, newop1); >> + return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1); >> + } >> + >> + /* Same as above but with switched operands: >> + Replace (vec_merge:outer (subreg:outer x:inner 0) >> + (vec_duplicate:outer y:inner) >> + (const_int N)) >> + >> + with (vec_concat:outer x:inner y:inner) if N == 1, >> + or (vec_concat:outer y:inner x:inner) if N == 2. */ >> + if (GET_CODE (op1) == VEC_DUPLICATE >> + && GET_CODE (op0) == SUBREG >> + && GET_MODE (op0) == GET_MODE (op1) >> + && GET_MODE (SUBREG_REG (op0)) == GET_MODE (XEXP (op1, 0)) >> + && paradoxical_subreg_p (op0) >> + && subreg_lowpart_p (op0) >> + && known_eq (GET_MODE_NUNITS (GET_MODE (op1)), 2) >> + && known_eq (GET_MODE_NUNITS (GET_MODE (op0)), 2) >> + && IN_RANGE (sel, 1, 2)) >> + { >> + rtx newop0 = SUBREG_REG (op0); >> + rtx newop1 = XEXP (op1, 0); >> + if (sel == 2) >> + std::swap (newop0, newop1); >> + return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1); >> + } >> + >> /* Replace (vec_merge (vec_duplicate x) (vec_duplicate y) >> (const_int n)) >> with (vec_concat x y) or (vec_concat y x) depending on value >> diff --git a/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c >> b/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c >> new file mode 100644 >> index >> 0000000000000000000000000000000000000000..0541e480d1f8561dbd9b2a56926c8df60d667a54 >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c >> @@ -0,0 +1,20 @@ >> +/* { dg-do compile } */ >> + >> +/* { dg-options "-O3 -ftree-vectorize -fno-vect-cost-model" } */ >> + >> +void bar (double); >> + >> +void >> +foo (double *restrict in, double *restrict in2, >> + double *restrict out1, double *restrict out2) >> +{ >> + for (int i = 0; i < 1024; i++) >> + { >> + out1[i] = in[i] + 2.0 * in[i+128]; >> + out1[i+1] = in[i+1] + 2.0 * in2[i]; >> + bar (in[i]); >> + } >> +} >> + >> +/* { dg-final { scan-assembler-not "dup\tv\[0-9\]+.2d, v\[0-9\]+" } } */ >> + >>