Re: Prototype implementation: Improving effectiveness and generality of auto-vectorization

Richard Biener Wed, 21 Oct 2015 04:54:08 -0700

On Thu, Oct 15, 2015 at 12:20 PM, sameera <sameera.deshpa...@imgtec.com> wrote:
> Hi Richard,
>
> This is with reference to our discussion at GNU Tools Cauldron 2015
> regarding my talk titled "Improving the effectiveness and generality of GCC
> auto-vectorization." We, at Imaginations Technologies, have further worked
> on finalizing the algorithms for transformations to achieve efficient
> target-aware reordering instruction selection. We have implemented the
> prototype in python to demonstrate the capabilities of the algorithm and
> verify the claims made in the presentation.
>
> I am attaching the prototype along with the documented algorithm for review
> purposes. We would be starting with design and implementation of the same in
> GCC and would be glad to receive comments, feedback and suggestions.


So I finally sat down and skimmed over auto_vectorization.txt.  The first thing
I notice is that you constrain your input language to sth you define yourself.
In the end you will need to work on GIMPLE in SSA form.  Thus the
Algorithm as described needs to construct its AST from that GIMPLE
representation.
Loop count and data reference analysis is provided by GCC and you need to work
with the way their result is presented.

As with the presentation the paper is mostly about optimizing the interleaving
code.  That's fine in principle but I suspect that the AST needs to get explicit
support for "vectorized stmts" that perform type conversions (like type
conversions themselves or widening operations), that is, represent the fact
that for certain loops you need N vectorized stmts for stmt A and M vectorized
stmts for stmt B.  This is an important observation once you get to the point
supporting targets with multiple vector sizes (and instructions like the x86_64
integer - double conversions which go from 128bit to 256bit vectors).

I somewhat miss an idea on how to deal with irregular SLP cases - that is,
does the AST model each (current) SLP node as a single statement or
does it model individual vector elements (so you can insert no-op compensation
code to make the operations match).  Consider

 a[i] = b[i] * 3;
 a[i+1] = b[i+1];

which can be vectorized with SLP if you realize you can multiply b[i+1] by 1.

As of implementing the whole thing in GCC I recently spent some more time
thinking and came to the conclusion that the path to the fastest improvements
in GCC code-gen would be to build the new AST after the current analysis
phase finished, do the optimization and drive code-generation off the new AST.
Thus keep things like data-ref and dependence analysis as is, as well as
group analysis and SLP tree construction.  Build the AST from the vectorizer
"IL" at the point vect_transform_loop / vect_slp_transform_bb is called.

More and more of the rest of the vectorizer code can then be "slowly" moved
to work on the AST and AST construction be moved earlier.

Richard.

>
> - Thanks and regards,
>   Sameera Deshpande

Re: Prototype implementation: Improving effectiveness and generality of auto-vectorization

Reply via email to