Re: Prototype implementation: Improving effectiveness and generality of auto-vectorization

sameera Sun, 25 Oct 2015 23:00:13 -0700

On Wednesday 21 October 2015 05:23 PM, Richard Biener wrote:

On Thu, Oct 15, 2015 at 12:20 PM, sameera <sameera.deshpa...@imgtec.com> wrote:

Hi Richard,


This is with reference to our discussion at GNU Tools Cauldron 2015
regarding my talk titled "Improving the effectiveness and generality of GCC
auto-vectorization." We, at Imaginations Technologies, have further worked
on finalizing the algorithms for transformations to achieve efficient
target-aware reordering instruction selection. We have implemented the
prototype in python to demonstrate the capabilities of the algorithm and
verify the claims made in the presentation.

I am attaching the prototype along with the documented algorithm for review
purposes. We would be starting with design and implementation of the same in
GCC and would be glad to receive comments, feedback and suggestions.


So I finally sat down and skimmed over auto_vectorization.txt.  The first thing
I notice is that you constrain your input language to sth you define yourself.
In the end you will need to work on GIMPLE in SSA form.  Thus the
Algorithm as described needs to construct its AST from that GIMPLE
representation.

Richard, we have defined the input language for convenience in prototype implementation. However, we will be using GIMPLE as our IR. As per grammar ofour tree, p-tree denote the permute order associated with the statement, whereas c-tree is actually the GIMPLE instruction, which performs computeoperation. I tried looking at structures used in SLP, however they can not be used as they are, as main difference between current SLP implementationin GCC versus our representation is that, permute order in SLP is part of the tree node in current GCC, whereas in our representation permute order isrepresented as independent tree-node. Hence, I have created new tree structure for our pass, which will create p-tree nodes for permute order, andc-tree node which points to appropriate gimple statement.

Loop count and data reference analysis is provided by GCC and you need to work
with the way their result is presented.

I am trying to figure out where and how interleave pattern encapsulating whole loop can be represented, as the interleave pattern not only has theloop related information, but also the order in which dest array is being written. The data reference analysis can be used nicely with the datastructures we have designed - as introduction of p-tree nodes does not alter the attributes of c-tree (GIMPLE stmt).

I am also trying to identify relations between chain of recurrences for each SSA variable and vec_size associated with each tree-node in ourstructure. Logically, both of them compute same information, and I am seeing if it can be propagated in our tree.


As with the presentation the paper is mostly about optimizing the interleaving
code.  That's fine in principle but I suspect that the AST needs to get explicit
support for "vectorized stmts" that perform type conversions (like type
conversions themselves or widening operations), that is, represent the fact
that for certain loops you need N vectorized stmts for stmt A and M vectorized
stmts for stmt B.  This is an important observation once you get to the point
supporting targets with multiple vector sizes (and instructions like the x86_64
integer - double conversions which go from 128bit to 256bit vectors).

Yes, we haven't given much thought about the type conversions, because our assumption is that type conversions are order preserving transformations(c-tree), and not order altering transformations (p-tree). Because of which those instructions will be generated as any other compute instruction isgenerated. And I see that GCC is also having same assumption, because of which it treats vec_perm_const patterns different from vec_pack/unpack*patterns though the instructions generated can be same. And, as each statement can have its own vectorization count, the scenario that you arementioning can be taken care. However, I will again look more into it, if we need to take additional care for type conversions.


I somewhat miss an idea on how to deal with irregular SLP cases - that is,
does the AST model each (current) SLP node as a single statement or
does it model individual vector elements (so you can insert no-op compensation
code to make the operations match).  Consider

  a[i] = b[i] * 3;
  a[i+1] = b[i+1];

which can be vectorized with SLP if you realize you can multiply b[i+1] by 1.


The pass structure we are having for our optimization is as follows:
- New pass target-aware-loop-vect with following subpasses:
  - Loop structure identification
  - Restricted loop construction
  - Loop analysis (6 phases of target-aware reordering-instruction selection 
algorithm)
  - Loop transformation (Cost aware gimple code generation)

We might need some optimizations to be performed on loop body before we start our optimization for reordering instruction selection, so that inputprogram can be transformed to fit into our restricted loop structure. These transformations can be performed by restricted loop construction subpass.Eg.: multi-dimentional array, nested loops, reduction chains etc. can be supported by transforming input loop. The case that you have mentioned, canbe transformed at this level.

Also, we have introduced new operator "PERM" to support irregular permute orders which cannot be matched by vec_perm_const patterns, but can begenerated using vec_perm pattern.


As of implementing the whole thing in GCC I recently spent some more time
thinking and came to the conclusion that the path to the fastest improvements
in GCC code-gen would be to build the new AST after the current analysis
phase finished, do the optimization and drive code-generation off the new AST.
Thus keep things like data-ref and dependence analysis as is, as well as
group analysis and SLP tree construction.  Build the AST from the vectorizer
"IL" at the point vect_transform_loop / vect_slp_transform_bb is called.

Current pass structure that we have designed, does exactly that. The only difference is that, we are planning to use the transform phase as aco-process of our transform pass, than reconstruct an AST from our representation to pass to vect_transform_loop.


More and more of the rest of the vectorizer code can then be "slowly" moved
to work on the AST and AST construction be moved earlier.

Richard.


Thanks again for your inputs. I will share the complete design and our plan of 
action for implementation in GCC shortly.

- Thanks and regards,
  Sameera D.


- Thanks and regards,
   Sameera Deshpande

Re: Prototype implementation: Improving effectiveness and generality of auto-vectorization

Reply via email to