Richard Biener <rguent...@suse.de> writes: > The following aovids over/under-read of storage when vectorizing > a non-grouped load with SLP. Instead of forcing peeling for gaps > use a smaller load for the last vector which might access excess > elements. This builds upon the existing optimization avoiding > peeling for gaps, generalizing it to all gap widths leaving a > power-of-two remaining number of elements (but it doesn't replace > or improve that particular case at this point). > > I wonder if the poly relational compares I set up are good enough > to guarantee /* remain should now be > 0 and < nunits. */. > > There is existing test coverage that runs into /* DR will be unused. */ > always when the gap is wider than nunits. Compared to the > existing gap == nunits/2 case this only adjusts the load that will > cause the overrun at the end, not every load. Apart from the > poly relational compares it should reliably cover these cases but > I'll leave it for stage1 to remove. > > Bootstrapped and tested on x86_64-unknown-linux-gnu, I've also > built and tested SPEC CPU 2017. > > OK? > > PR tree-optimization/112736 > * tree-vect-stmts.cc (vectorizable_load): Extend optimization > to avoid peeling for gaps to handle single-element non-groups > we now allow with SLP. > > * gcc.dg/torture/pr112736.c: New testcase.
Mostly LGTM FWIW. A couple of comments below: > --- > gcc/testsuite/gcc.dg/torture/pr112736.c | 27 ++++++++ > gcc/tree-vect-stmts.cc | 86 ++++++++++++++++++++----- > 2 files changed, 96 insertions(+), 17 deletions(-) > create mode 100644 gcc/testsuite/gcc.dg/torture/pr112736.c > > diff --git a/gcc/testsuite/gcc.dg/torture/pr112736.c > b/gcc/testsuite/gcc.dg/torture/pr112736.c > new file mode 100644 > index 00000000000..6abb56edba3 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/torture/pr112736.c > @@ -0,0 +1,27 @@ > +/* { dg-do run { target *-*-linux* *-*-gnu* *-*-uclinux* } } */ > + > +#include <sys/mman.h> > +#include <unistd.h> > + > +int a, c[3][5]; > + > +void __attribute__((noipa)) > +fn1 (int * __restrict b) > +{ > + int e; > + for (a = 2; a >= 0; a--) > + for (e = 0; e < 4; e++) > + c[a][e] = b[a]; > +} > + > +int main() > +{ > + long pgsz = sysconf (_SC_PAGESIZE); > + void *p = mmap (NULL, pgsz * 2, PROT_READ|PROT_WRITE, > + MAP_ANONYMOUS|MAP_PRIVATE, 0, 0); > + if (p == MAP_FAILED) > + return 0; > + mprotect (p, pgsz, PROT_NONE); > + fn1 (p + pgsz); > + return 0; > +} > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > index 390c8472fd6..c03c4c08c9d 100644 > --- a/gcc/tree-vect-stmts.cc > +++ b/gcc/tree-vect-stmts.cc > @@ -11465,26 +11465,70 @@ vectorizable_load (vec_info *vinfo, > if (new_vtype != NULL_TREE) > ltype = half_vtype; > } > + /* Try to use a single smaller load when we are about > + to load accesses excess elements compared to the s/accesses // > + unrolled scalar loop. > + ??? This should cover the above case as well. */ > + else if (known_gt ((vec_num * j + i + 1) * nunits, > + (group_size * vf - gap))) At first it seemed odd to be using known_gt rather than maybe_gt here, given that peeling for gaps is a correctness issue. But as things stand this is just an optimisation, and VLA vectors (to whatever extent they're handled by this code) allegedly work correctly without it. So I agree known_gt is correct. We might need to revisit it when dealing with the ??? though. > + { > + if (known_ge ((vec_num * j + i + 1) * nunits > + - (group_size * vf - gap), nunits)) > + /* DR will be unused. */ > + ltype = NULL_TREE; > + else if (alignment_support_scheme == dr_aligned) > + /* Aligned access to excess elements is OK if > + at least one element is accessed in the > + scalar loop. */ > + ; > + else > + { > + auto remain > + = ((group_size * vf - gap) > + - (vec_num * j + i) * nunits); > + /* remain should now be > 0 and < nunits. */ > + unsigned num; > + if (constant_multiple_p (nunits, remain, &num)) > + { > + tree ptype; > + new_vtype > + = vector_vector_composition_type (vectype, > + num, > + &ptype); > + if (new_vtype) > + ltype = ptype; > + } > + /* Else use multiple loads or a masked load? */ > + } > + } > tree offset > = (dataref_offset ? dataref_offset > : build_int_cst (ref_type, 0)); > - if (ltype != vectype > - && memory_access_type == VMAT_CONTIGUOUS_REVERSE) > + if (!ltype) > + ; > + else if (ltype != vectype > + && memory_access_type == VMAT_CONTIGUOUS_REVERSE) > { > unsigned HOST_WIDE_INT gap_offset > - = gap * tree_to_uhwi (TYPE_SIZE_UNIT (elem_type)); > + = (tree_to_uhwi (TYPE_SIZE_UNIT (vectype)) > + - tree_to_uhwi (TYPE_SIZE_UNIT (ltype))); > tree gapcst = build_int_cst (ref_type, gap_offset); > offset = size_binop (PLUS_EXPR, offset, gapcst); Can this trigger for the VLA case? Probably not. But it might be worth using tree_to_poly_uint64 instead of tree_to_uhwi, and auto instead of unsigned HOST_WIDE_INT, since the calculation is no longer based on scalars. > } > - data_ref > - = fold_build2 (MEM_REF, ltype, dataref_ptr, offset); > - if (alignment_support_scheme == dr_aligned) > - ; > - else > - TREE_TYPE (data_ref) > - = build_aligned_type (TREE_TYPE (data_ref), > - align * BITS_PER_UNIT); > - if (ltype != vectype) > + if (ltype) > + { > + data_ref > + = fold_build2 (MEM_REF, ltype, dataref_ptr, offset); > + if (alignment_support_scheme == dr_aligned) > + ; > + else > + TREE_TYPE (data_ref) > + = build_aligned_type (TREE_TYPE (data_ref), > + align * BITS_PER_UNIT); > + } > + if (!ltype) > + data_ref = build_constructor (vectype, NULL); > + else if (ltype != vectype) > { > vect_copy_ref_info (data_ref, > DR_REF (first_dr_info->dr)); > @@ -11494,18 +11538,26 @@ vectorizable_load (vec_info *vinfo, > gsi); > data_ref = NULL; > vec<constructor_elt, va_gc> *v; > - vec_alloc (v, 2); > + unsigned num > + = (exact_div (tree_to_poly_uint64 > + (TYPE_SIZE_UNIT (vectype)), > + tree_to_poly_uint64 > + (TYPE_SIZE_UNIT (ltype))) > + .to_constant ()); FWIW, vector_unroll_factor probably fits here, but maybe that's a bit of a stretch. The current version is OK with me too. Thanks, Richard > + vec_alloc (v, num); > if (memory_access_type == VMAT_CONTIGUOUS_REVERSE) > { > - CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, > - build_zero_cst (ltype)); > + while (--num) > + CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, > + build_zero_cst (ltype)); > CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, tem); > } > else > { > CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, tem); > - CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, > - build_zero_cst (ltype)); > + while (--num) > + CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, > + build_zero_cst (ltype)); > } > gcc_assert (new_vtype != NULL_TREE); > if (new_vtype == vectype)