On Mon, 20 May 2019, Richard Biener wrote: > On Sun, 19 May 2019, Richard Sandiford wrote: > > > Richard Biener <rguent...@suse.de> writes: > > > This adds, incrementally ontop of moving VEC_PERM_EXPR folding > > > to match.pd (but not incremental patch - sorry...), folding > > > of single-element insert permutations to BIT_INSERT_EXPR. > > > > > > Things are quite awkward with the new poly-int vec-perm stuff > > > so effectively this doesn't work for SVE and is still very > > > ugly. I wonder how to make it generic enough so SVE would > > > be happy and / or how to make the code prettier. > > > > > > I also can't find a helper to read a specific vector element > > > from a VECTOR_CST/CONSTRUCTOR (can I even do that "generally" > > > with a poly-int index?!), but there surely must be one. > > > > Yeah, would be nice to have a helper that handles both VECTOR_CST > > and CONSTRUCTOR, even if just for constant indices. > > > > CONSTRUCTORs are still very much fixed-length, so it wouldn't really > > make sense to fold a poly_int index at compile time. The only case we > > could handle is when the index is known to be beyond the last nonzero > > element. > > > > We could fold some poly_int VECTOR_CST_ELT indices to poly_ints, but > > it'd depend on the values involved. > > > > Indexing specific poly_int elements of a single vector is a bit dubious > > in length-agnostic code though. Not saying it'll never be useful, but > > it's certainly not a native SVE operation. So I think even for SVE, > > constant VECTOR_CST/CONSTRUCTOR elements are the only interesting case > > for now. > > > > > [...] > > > @@ -11774,61 +11777,7 @@ fold_ternary_loc (location_t loc, enum t > > > poly_uint64 nelts = TYPE_VECTOR_SUBPARTS (type); > > > bool single_arg = (op0 == op1); > > > vec_perm_indices sel (builder, single_arg ? 1 : 2, nelts); > > > - > > > - /* Check for cases that fold to OP0 or OP1 in their original > > > - element order. */ > > > - if (sel.series_p (0, 1, 0, 1)) > > > - return op0; > > > - if (sel.series_p (0, 1, nelts, 1)) > > > - return op1; > > > - > > > - if (!single_arg) > > > - { > > > - if (sel.all_from_input_p (0)) > > > - op1 = op0; > > > - else if (sel.all_from_input_p (1)) > > > - { > > > - op0 = op1; > > > - sel.rotate_inputs (1); > > > - } > > > - } > > > > Since this isn't an incremental patch... :-) > > > > One of the holes of the current code is that we still allow two > > permute indices for the same permutation, e.g. { 4, 1, 2, 3 } and > > { 0, 5, 6, 7 }. IMO we should canonicalize it so that the first index > > selects from the first vector. So maybe the above should be: > > > > if (!single_arg) > > { > > if (known_ge (sel[0], nunits)) > > { > > std::swap (op0, op1); > > sel.rotate_inputs (1); > > } > > if (sel.all_from_input_p (0)) > > op1 = op0; > > } > > > > > [...] > > > + /* See if the permutation is performing a single element > > > + insert from a CONSTRUCTOR or constant and use a BIT_INSERT_EXPR > > > + in that case. */ > > > + unsigned HOST_WIDE_INT cnelts; > > > + if ((TREE_CODE (cop0) == VECTOR_CST > > > + || TREE_CODE (cop0) == CONSTRUCTOR > > > + || TREE_CODE (cop1) == VECTOR_CST > > > + || TREE_CODE (cop1) == CONSTRUCTOR) > > > + && nelts.is_constant (&cnelts)) > > > + { > > > + unsigned first = 0, first_oo = 0, first_i; > > > + unsigned second = 0, second_oo = 0, second_i; > > > + HOST_WIDE_INT idx; > > > + for (unsigned HOST_WIDE_INT i = 0; i < cnelts; ++i) > > > + if (!sel[i].is_constant (&idx)) > > > + { > > > + first = second = 0; > > > + break; > > > + } > > > + else if ((unsigned HOST_WIDE_INT)idx < cnelts) > > > + { > > > + first_i = i; > > > + first++; > > > + first_oo += (unsigned HOST_WIDE_INT)idx != i; > > > + } > > > + else > > > + { > > > + second_i = i; > > > + second++; > > > + second_oo += (unsigned HOST_WIDE_INT)idx != i + cnelts; > > > + } > > > > This won't handle the case in which the inserted element comes from > > the same vector. > > > > If we add the extra canonicalization above, we'd only ever be inserting > > into the second vector at element 0. The test for that would be: > > > > if (sel.series_p (1, 1, nelts + 1, 1)) > > // insert sel[0] into index 0 of the second vector > > > > I think the SVE-friendly way of doing the check for the first vector > > would be: > > > > unsigned int encoded_nelts = sel.encoding ().encoded_nelts (); > > unsigned int i = 0; > > for (; i < encoded_nelts; ++i) > > if (maybe_ne (sel[i], i)) > > break; > > if (i < encoded_nelts && sel.series_p (i + 1, 1, i + 1, 1)) > > // insert sel[i] into index i of the first vector > > > > Although that's two searches, one of them will halt very early. > > > > The SVE-friendly way of getting the VECTOR_CST/CONSTRUCTOR element would be: > > > > poly_uint64 idx = sel[i]; > > unsigned HOST_WIDE_INT const_idx; > > if (known_le (idx, nelts) && idx.is_constant (&const_idx)) > > // const_idx from first vector > > else if (known_ge (idx, nelts) && (idx - nelts).is_constant (&const_idx)) > > // const_idx from second vector > > else > > // bail out > > > > ...which is a bit ugly, I admit. > > So in the end it's all still a bit smaller than before. I've split > out a fold_read_from_vector routine.
And the following is what I applied after fixing all sign-compare issues. Bootstraped and tested on x86_64-unknown-linux-gnu. 2019-05-21 Richard Biener <rguent...@suse.de> PR middle-end/90510 * fold-const.c (fold_read_from_vector): New function. * fold-const.h (fold_read_from_vector): Declare. * match.pd (VEC_PERM_EXPR): Build BIT_INSERT_EXPRs for single-element insert permutations. Canonicalize selector further and fix issue with last commit. * gcc.target/i386/pr90510.c: New testcase. Index: gcc/fold-const.c =================================================================== --- gcc/fold-const.c (revision 271412) +++ gcc/fold-const.c (working copy) @@ -13793,6 +13793,28 @@ fold_read_from_constant_string (tree exp return NULL; } +/* Folds a read from vector element at IDX of vector ARG. */ + +tree +fold_read_from_vector (tree arg, poly_uint64 idx) +{ + unsigned HOST_WIDE_INT i; + if (known_lt (idx, TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg))) + && known_ge (idx, 0u) + && idx.is_constant (&i)) + { + if (TREE_CODE (arg) == VECTOR_CST) + return VECTOR_CST_ELT (arg, i); + else if (TREE_CODE (arg) == CONSTRUCTOR) + { + if (i >= CONSTRUCTOR_NELTS (arg)) + return build_zero_cst (TREE_TYPE (TREE_TYPE (arg))); + return CONSTRUCTOR_ELT (arg, i)->value; + } + } + return NULL_TREE; +} + /* Return the tree for neg (ARG0) when ARG0 is known to be either an integer constant, real, or fixed-point constant. Index: gcc/fold-const.h =================================================================== --- gcc/fold-const.h (revision 271412) +++ gcc/fold-const.h (working copy) @@ -100,6 +100,7 @@ extern tree fold_bit_and_mask (tree, tre tree, enum tree_code, tree, tree, tree, enum tree_code, tree, tree, tree *); extern tree fold_read_from_constant_string (tree); +extern tree fold_read_from_vector (tree, poly_uint64); #if GCC_VEC_PERN_INDICES_H extern tree fold_vec_perm (tree, tree, tree, const vec_perm_indices &); #endif Index: gcc/match.pd =================================================================== --- gcc/match.pd (revision 271412) +++ gcc/match.pd (working copy) @@ -5406,6 +5406,11 @@ (define_operator_list COND_TERNARY op0 = op1; sel.rotate_inputs (1); } + else if (known_ge (poly_uint64 (sel[0]), nelts)) + { + std::swap (op0, op1); + sel.rotate_inputs (1); + } } gassign *def; tree cop0 = op0, cop1 = op1; @@ -5429,9 +5434,46 @@ (define_operator_list COND_TERNARY (with { bool changed = (op0 == op1 && !single_arg); + tree ins = NULL_TREE; + unsigned at = 0; + + /* See if the permutation is performing a single element + insert from a CONSTRUCTOR or constant and use a BIT_INSERT_EXPR + in that case. But only if the vector mode is supported, + otherwise this is invalid GIMPLE. */ + if (TYPE_MODE (type) != BLKmode + && (TREE_CODE (cop0) == VECTOR_CST + || TREE_CODE (cop0) == CONSTRUCTOR + || TREE_CODE (cop1) == VECTOR_CST + || TREE_CODE (cop1) == CONSTRUCTOR)) + { + if (sel.series_p (1, 1, nelts + 1, 1)) + { + /* After canonicalizing the first elt to come from the + first vector we only can insert the first elt from + the first vector. */ + at = 0; + ins = fold_read_from_vector (cop0, 0); + op0 = op1; + } + else + { + unsigned int encoded_nelts = sel.encoding ().encoded_nelts (); + for (at = 0; at < encoded_nelts; ++at) + if (maybe_ne (sel[at], at)) + break; + if (at < encoded_nelts && sel.series_p (at + 1, 1, at + 1, 1)) + { + if (known_lt (at, nelts)) + ins = fold_read_from_vector (cop0, sel[at]); + else + ins = fold_read_from_vector (cop1, sel[at] - nelts); + } + } + } /* Generate a canonical form of the selector. */ - if (sel.encoding () != builder) + if (!ins && sel.encoding () != builder) { /* Some targets are deficient and fail to expand a single argument permutation while still allowing an equivalent @@ -5450,10 +5492,12 @@ (define_operator_list COND_TERNARY so use the preferred form. */ op2 = vec_perm_indices_to_tree (TREE_TYPE (op2), sel); } - /* Differences in the encoder do not necessarily mean - differences in the resulting vector. */ - changed = !operand_equal_p (op2, oldop2, 0); + if (!operand_equal_p (op2, oldop2, 0)) + changed = true; } } - (if (changed) - (vec_perm { op0; } { op1; } { op2; }))))))))) + (if (ins) + (bit_insert { op0; } { ins; } + { bitsize_int (at * tree_to_uhwi (TYPE_SIZE (TREE_TYPE (type)))); }) + (if (changed) + (vec_perm { op0; } { op1; } { op2; })))))))))) Index: gcc/testsuite/gcc.target/i386/pr90510.c =================================================================== --- gcc/testsuite/gcc.target/i386/pr90510.c (revision 0) +++ gcc/testsuite/gcc.target/i386/pr90510.c (working copy) @@ -0,0 +1,22 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -msse2 -fdump-tree-optimized" } */ + +typedef double __v2df __attribute__ ((__vector_size__ (16))); +typedef long long __v2di __attribute__ ((__vector_size__ (16))); + +__v2df +_mm_add_sd_A (__v2df x, __v2df y) +{ + double tem = x[0] + y[0]; + return __builtin_shuffle ( x, (__v2df) { tem, tem }, (__v2di) { 2, 1 } ); +} + +__v2df +_mm_add_sd_B (__v2df x, __v2df y) +{ + __v2df z = { (x[0] + y[0]), x[1] }; + return z; +} + +/* { dg-final { scan-tree-dump-times "BIT_INSERT_EXPR" 2 "optimized" } } */ +/* { dg-final { scan-assembler-not "unpck" } } */