On Mon, 9 Jan 2017, Jakub Jelinek wrote:
> On Sat, Jan 07, 2017 at 12:46:26PM +0100, Richard Biener wrote:
> > >The following patch tweaks tree-if-conv.c so that when it wants to
> > >version
> > >an outer loop, it actually transforms:
> > > loop1
> > > loop2
> > >into:
> > > if (LOOP_VECTORIZED (1, 3))
> > > {
> > > loop1
> > > loop2
> > > }
> > > else
> > > loop3 (copy of loop1)
> > > if (LOOP_VECTORIZED (4, 5))
> > > loop4 (copy of loop2)
> > > else
> > > loop5 (copy of loop4)
> >
> > Huh. Why isn't the else case equal to the if case for the vectorizer?
> > That is, we have the inner loop if-converted And thus for the if case
> > either outer or inner loop vectorization should be possible.
> >
> > So - why doesn't it work that way?
>
> The patch is a consequence of the r242520 changes:
> http://gcc.gnu.org/ml/gcc-patches/2016-11/msg01541.html
> Previously, we had
> loop1
> loop2
> transformed into:
> loop1
> if (LOOP_VECTORIZED (2, 3))
> loop2
> else
> loop3 (copy of loop2)
> (if actually we find if-conversion of loop2 desirable, there are masked
> loads/stores etc.). loop2 is if-converted after the versioning.
> This works well if the inner loop is vectorized, doesn't work at all
> (prevents it) outer loop vectorization.
> Now, with the r242520, it is transformed into:
> if (LOOP_VECTORIZED (1, 3))
> {
> loop1
> loop2
> }
> else
> loop3 (copy of loop1)
> loop4 (copy of loop2)
> if if-conversion thinks outer loop vectorization might be successful.
> In this case, loop2 is if-converted. This works well if the outer loop
> versioning is subsequently successful, doesn't work at all if it is
> unsuccessful (loop2 itself isn't LOOP_VECTORIZED guarded, so when we are
> vectorizing, we use loop2 itself as its scalar loop (so it will contain
> MASK_LOAD/MASK_STORE etc. that we then can't expand; also, as loop1 isn't
> vectorized, LOOP_VECTORIZED (1, 3) is folded into false and thus we
> effectively are vectorizing loop2 in dead code, loop3/loop4 will be used
> instead (loop3 is marked as dont_vectorize, so we don't try to vectorize it,
> loop4 isn't, so might be vectorized, but only if no if-conversion is needed
> for it (but tree-if-conversion determined it is needed)).
> With my patch, we have instead:
> if (LOOP_VECTORIZED (1, 3))
> {
> loop1
> loop2
> }
> else
> loop3 (copy of loop1)
> if (LOOP_VECTORIZED (4, 5))
> loop4 (copy of loop2)
> else
> loop5 (copy of loop4)
> loop2 and loop4 are if-converted, so either outer loop vectorization of
> loop1 is successful, then we use loop1/loop2 as the vectorized loop
> and loop3/loop5 as the corresponding scalar loop, or it is unsuccessful,
> then we use non-vectorized loop3, either with successfully vectorized
> loop4 as inner loop (loop5 is corresponding scalar_loop, for epilogues,
> versioning for alignment etc.), or we fail to vectorize anything and
> end up with scalar loop3/loop5.
But that causes even more versioning (plus redundant if-conversion).
> Without the patch I've posted, there are some remaining options, but those
> will mean big amount of work in the loop manipulation code etc.
>
> One option is to keep r242520 in, then the problem is that:
> 1) we would need to defer folding LOOP_VECTORIZED (1, 3) into false when
> outer loop vectorization failed, if there is still possible inner loop
> vectorization (not that difficult)
Yeah, that's something r242520 missed to address it seems. We've
expected this works as expected... heh. Testsuite coverage is low
here it seems.
> 2) we'd need to use loop4 as the scalar_loop for the vectorization of
> loop2, but that loop is not adjacent to the vectorized loop, so we'd need
> to somehow transform all the SSA_NAMEs that might be affected by that
> different placement (as if all the loop3 PHIs were loop1 PHIs instead,
> and deal with the SSA_NAMEs set in loop4 and used outside of loop4 as
> if those were those in loop2 instead); this is the hard part I'm not really
> enthusiastic to write
We use the special scalar_loop always if we have the loop_vectorized
guard, right? Which is of course good. And required for masked
loads/stores.
I see how this looks somewhat unfortunate.
> Another option is to revert r242520, and then do something for the outer loop
> vectorization. Right now we expect certain fixed form (5 basic blocks in
> the outer loop, lots of assumptions about the cfg of that, dunno where
> everywhere it is hardcoded). We'd need to allow also 7+ basic block form,
> where one of the extra loops is just if LOOP_VECTORIZED (x, y), then there
> is if-converted single basic block loop x, and then perhaps many basic
> blocks loop y. Lots of the vectorization code expects ->inner to be THE
> inner loop, that would no longer be the case, it would be either the scalar
> or vector loop, etc.
Yeah, this is why we went the r242520 route...
> To me my patch at least for GCC7 looks like far less work than both other
> option would require. I know it has the drawback that compared to the other
> options there is more loop copying that - for this regard, the option to
> revert r242520 and deal with it would be the most effective, only a single
> loop is versioned, with r242520 and without my patch it is 2 loops versioned
> and with my patch 3 loops versioned.
Certainly handling r242520 in a different way looks best then (w/o
r242520 the followup to always version loops in if-conversion shows
a lot of vect.exp regressions).
> There is one thing my patch doesn't do but should for efficiency, if loop1
> (outer loop) is not successfully outer-loop vectorized, then we should mark
> loop2 (its inner loop) as dont_vectorize if the outer loop has been
> LOOP_VECTORIZED guarded. Then the gcc.dg/gomp/pr68128-1.c change
> wouldn't be actually needed.
Yes.
Are you willing to try re-doing r242520?
Thanks,
Richard.