Richard Biener <rguent...@suse.de> writes: > On Wed, 1 Mar 2017, Richard Sandiford wrote: > >> Richard Biener <rguent...@suse.de> writes: >> > On Wed, 1 Mar 2017, Richard Sandiford wrote: >> > >> >> Sorry for the late reply, but: >> >> >> >> Richard Biener <rguent...@suse.de> writes: >> >> > On Mon, 7 Nov 2016, Richard Biener wrote: >> >> > >> >> >> >> >> >> Currently we force peeling for gaps whenever element overrun can occur >> >> >> but for aligned accesses we know that the loads won't trap and thus >> >> >> we can avoid this. >> >> >> >> >> >> Bootstrap and regtest running on x86_64-unknown-linux-gnu (I expect >> >> >> some testsuite fallout here so didn't bother to invent a new testcase). >> >> >> >> >> >> Just in case somebody thinks the overrun is a bad idea in general >> >> >> (even when not trapping). Like for ASAN or valgrind. >> >> > >> >> > This is what I applied. >> >> > >> >> > Bootstrapped and tested on x86_64-unknown-linux-gnu. >> >> > >> >> > Richard. >> >> [...] >> >> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c >> >> > index 15aec21..c29e73d 100644 >> >> > --- a/gcc/tree-vect-stmts.c >> >> > +++ b/gcc/tree-vect-stmts.c >> >> > @@ -1789,6 +1794,10 @@ get_group_load_store_type (gimple *stmt, tree >> >> > vectype, bool slp, >> >> > /* If there is a gap at the end of the group then these >> >> > optimizations >> >> > would access excess elements in the last iteration. */ >> >> > bool would_overrun_p = (gap != 0); >> >> > + /* If the access is aligned an overrun is fine. */ >> >> > + if (would_overrun_p >> >> > + && aligned_access_p (STMT_VINFO_DATA_REF (stmt_info))) >> >> > + would_overrun_p = false; >> >> > if (!STMT_VINFO_STRIDED_P (stmt_info) >> >> > && (can_overrun_p || !would_overrun_p) >> >> > && compare_step_with_zero (stmt) > 0) >> >> >> >> ...is this right for all cases? I think it only looks for single-vector >> >> alignment, but the gap can in principle be vector-sized or larger, >> >> at least for load-lanes. >> >> >> >> E.g. say we have a 128-bit vector of doubles in a group of size 4 >> >> and a gap of 2 or 3. Even if the access itself is aligned, the group >> >> spans two vectors and we have no guarantee that the second one >> >> is mapped. >> > >> > The check assumes that if aligned_access_p () returns true then the >> > whole access is aligned in a way that it can't cross page boundaries. >> > That's of course not the case if alignment is 16 bytes but the access >> > will be a multiple of that. >> > >> >> I haven't been able to come up with a testcase though. We seem to be >> >> overly conservative when computing alignments. >> > >> > Not sure if we can run into this with load-lanes given that bumps the >> > vectorization factor. Also does load-lane work with gaps? >> > >> > I think that gap can never be larger than nunits-1 so it is by definition >> > in the last "vector" independent of the VF. >> > >> > Classical gap case is >> > >> > for (i=0; i<n; ++i) >> > { >> > y[3*i + 0] = x[4*i + 0]; >> > y[3*i + 1] = x[4*i + 1]; >> > y[3*i + 2] = x[4*i + 2]; >> > } >> > >> > where x has a gap of 1. You'll get VF of 12 for the above. Make >> > the y's different streams and you should get the perfect case for >> > load-lane: >> > >> > for (i=0; i<n; ++i) >> > { >> > y[i] = x[4*i + 0]; >> > z[i] = x[4*i + 1]; >> > w[i] = x[4*i + 2]; >> > } >> > >> > previously we'd peel at least 4 iterations into the epilogue for >> > the fear of accessing x[4*i + 3]. When x is V4SI aligned that's >> > ok. >> >> The case I was thinking of was like the second, but with the >> element type being DI or DF and with the + 2 statement removed. >> E.g.: >> >> double __attribute__((noinline)) >> foo (double *a) >> { >> double res = 0.0; >> for (int n = 0; n < 256; n += 4) >> res += a[n] + a[n + 1]; >> return res; >> } >> >> (with -ffast-math). We do use LD4 for this, and having "a" aligned >> to V2DF isn't enough to guarantee that we can access a[n + 2] >> and a[n + 3]. > > Yes, indeed. It's safe when peeling for gaps would remove > N < alignof (ref) / sizeof (ref) scalar iterations. > > Peeling for gaps simply subtracts one from the niter of the vectorized > loop.
I think subtracting one is enough in all cases. It's only the final iteration of the scalar loop that can't access a[n + 2] and a[n + 3]. (Of course, subtracting one happens before peeling for niters, so it only makes a difference if the original niters was a multiple of the VF, in which case we peel a full vector's worth of iterations instead of peeling none.) Thanks, Richard