On 15/07/2019 11:54, Richard Biener wrote:
On Mon, 15 Jul 2019, Andre Vieira (lists) wrote:



On 12/07/2019 11:19, Richard Biener wrote:
On Thu, 11 Jul 2019, Andre Vieira (lists) wrote:


I have code that can split the condition and alias checks in
'vect_loop_versioning'. For this approach I am considering keeping that bit of
code and seeing if I can patch up the checks after vectorizing the epilogue
further. I think initially I will just go with a "hacked up" way of passing
down the bb with the iteration check and split the false edge every time we
vectorize it further. Will keep you posted on progress. If you have any
pointers/tips they are most welc ome!

I thought to somehow force the idea that we have a prologue loop
to the vectorizer so it creates the number-of-vectorized iterations
check and branch around the main (highest VF) vectorized loop.


Hmm I think I may have skimmed over this earlier. I am reading it now and am not entirely sure what you mean by this. Specifically the "number of vectorized iterations" or how a prologue loop plays a role in this.



The advantage is that this would just use the epilogue vectorization
code and it would avoid excessive code growth if you have many
VFs to consider (on x86 we now have 8 byte, 16 byte, 32 byte and
64 byte vectors...).  The disadvantage is of course that a small
number of loops will not enter the vector code at all - namely
those that would pass the alias check for lowest_VF but not the
one for highest_VF.  I'm sure this isn't a common situation and
in quite a number of cases we formulate the alias check in a way
that it isn't dependent on the VF anyways.

The code growth is indeed a factor and I can see the argument for choosing
this approach over the other. Cases of such specific overlaps are most likely
oddities rather than the common situation.

Yeah, it also looks simplest to me (and a motivation to enable
epilogue vectorization by default).

There's also possibly
an extra branch for the case the highest_VF loop isn't entered
(unless there already was a prologue loop).
I don't understand this one, can you elaborate?

The branch around the main vectorized loop I talked about above.
So I'd fool the versioning condition to use the lowest VF for
the iteration count checking and use the code that handles
zero-trip iteration count for the vector loop unconditionally.

I guess this sheds some light on the comment above. And it definitely implies we would need to know the lowest VF when creating this condition. Which is tricky.

In some way this makes checking the niter condition on the version
check pointless (at least if we have a really low lowest VF like
on x64 where it will likely be 2), so we may want to elide that
completely?  For the check to be "correct" we'd also need to
compute the lowest VF a vectorized epilogue is still profitable
(on x86 those will run once or never, but we can also end up
with say main AVX512 vectorization, and a single vectorized
epilogue with SSE2 if we somehow figure AVX256 vectorization
isn't profitable for it - we can also end up with non-vectorizable
epilogue).  So with the current setup how we vectorize epilogues
we maybe want to have a location of the version niter check we
can "patch up" later after (not) vectorizing the epilogue(s).

I think you come to the same conclusion here as I mentioned above. Somehow I wish I had understood this better when I first read it ... but eh such is life :)

I went on and continued hacking around the approach of splitting the niter and alias check I had earlier. I got it to work with a single loop. However, when dealing with nested loops I run into the problem that I'd need to sink the niter checks. Otherwise you could end up with an alias check and niter checks outside the outer loop. Where the 2nd and consequent VF niter checks point to the corresponding epilogues in the inner loop. However, once those are done and it iterates over the outer-loop, it will go through the higher VF's first, leading to wrong behavior.

To illustrate what I mean, here is a very simplistic illustration of what is happening:

BB1: Alias check
BB2: niter check VF 32
BB3: niter check VF 16
BB4: Vectorized loop VF32
BB5: Vectorized loop VF16
BB6: Remaining epilogue scalar loop
BB7: Outer loop iteration (updates IV's and DRs of inner loop)
BB8: Scalar inner&outer loop

With edges:
BB1 -T-> BB2
BB1 -F-> BB8
BB2 -T-> BB4
BB2 -F-> BB3
BB3 -T-> BB5
BB3 -F-> BB8
BB4 -> BB5
BB5 -> BB6
BB6 -> BB7
BB7 -> BB4

Where -T-> is a True edge and -F-> is a False edge

So my first thought to solve this is to sink BB2 and BB3 into the loop for which BB7 is the latch.

I.e. make BB7 -> BB2

But then I would argue, it would be good to introduce a BB9:
BB1 -T-> BB9
BB9 -T-> BB2
BB9 -F-> BB8

Where BB9 checks that niter is at least the lowest VF.

Sorry if I am repeating what you were telling me to do all along :')

Cheers,
Andre

PS: I often find myself having to patch the DOMINATOR information, sometimes its easy to, but sometimes it can get pretty complicated. I wonder whether it would make sense to write something that traverses a loop and corrects this, if it doesn't exist already.



Richard.

Reply via email to