Re: Prototype implementation: Improving effectiveness and generality of auto-vectorization

Alan Lawrence Fri, 08 Jan 2016 08:12:44 -0800

On Tues, Oct 27, 2015 at 2:39 PM, Richard Biener <richard.guent...@gmail.com> 
wrote:

On Mon, Oct 26, 2015 at 6:59 AM, sameera <sameera.deshpa...@imgtec.com> wrote:


Richard, we have defined the input language for convenience in prototype
implementation. However, we will be using GIMPLE as our IR. As per grammar
of our tree, p-tree denote the permute order associated with the statement,
whereas c-tree is actually the GIMPLE instruction, which performs compute
operation. I tried looking at structures used in SLP, however they can not
be used as they are, as main difference between current SLP implementation
in GCC versus our representation is that, permute order in SLP is part of
the tree node in current GCC, whereas in our representation permute order is
represented as independent tree-node. Hence, I have created new tree
structure for our pass, which will create p-tree nodes for permute order,
and c-tree node which points to appropriate gimple statement.


Yes, that's the whole purpose - get the vectorizer (and SLP) a better
data structure
which is explicit about permutes.

I somewhat miss an idea on how to deal with irregular SLP cases - that is,
does the AST model each (current) SLP node as a single statement or
does it model individual vector elements (so you can insert no-op
compensation
code to make the operations match).  Consider

  a[i] = b[i] * 3;
  a[i+1] = b[i+1];

which can be vectorized with SLP if you realize you can multiply b[i+1] by
1.



The pass structure we are having for our optimization is as follows:
- New pass target-aware-loop-vect with following subpasses:
  - Loop structure identification
  - Restricted loop construction
  - Loop analysis (6 phases of target-aware reordering-instruction selection
algorithm)
  - Loop transformation (Cost aware gimple code generation)
We might need some optimizations to be performed on loop body before we
start our optimization for reordering instruction selection, so that input
program can be transformed to fit into our restricted loop structure. These
transformations can be performed by restricted loop construction subpass.
Eg.: multi-dimentional array, nested loops, reduction chains etc. can be
supported by transforming input loop. The case that you have mentioned, can
be transformed at this level.

As of implementing the whole thing in GCC I recently spent some more time
thinking and came to the conclusion that the path to the fastest
improvements
in GCC code-gen would be to build the new AST after the current analysis
phase finished, do the optimization and drive code-generation off the new
AST.
Thus keep things like data-ref and dependence analysis as is, as well as
group analysis and SLP tree construction.  Build the AST from the
vectorizer
"IL" at the point vect_transform_loop / vect_slp_transform_bb is called.



Current pass structure that we have designed, does exactly that. The only
difference is that, we are planning to use the transform phase as a
co-process of our transform pass, than reconstruct an AST from our
representation to pass to vect_transform_loop.


More and more of the rest of the vectorizer code can then be "slowly"
moved
to work on the AST and AST construction be moved earlier.

So I think an incremental approach is much more likely to get us some benefit,and I'm wondering how much of the benefit here stems from which bit, and howmany of these changes can be broken off.

For example, making permutes into explicit operations on the SLP tree, andadding a pass that tries to push them up/down the tree into each other, might bedone on the existing codebase, and would be a significant win in itself(bringing some/much(?) of improvements in interleaving code of your proposal).

I'm not really sure how much of the benefit stems from what, though - e.g. doesreducing vector length down to fit the target machine only after the othersteps, bring any benefits besides raising abstraction? (Which certainly can be abenefit!). AFAICT this could be done independently of other changes?

Another limitation of the present SLP structure is it's not understandingsharing (until later CSE, hence costing wrongly); it would be good to fix this(making the SLP tree into a DAG, and potentially reusing subtrees by addingpermutes). This is a step we'd like to take anyway, but might tie in with your"Bottom-up commoning of p-node subtrees". And would help with (BB as well asloop) cost calculations.

Are there any really fundamental differences that we cannot overcome in steps? (/ are there any steps we shouldn't take in isolation, because we'd have to'throw them away' later?)


Alan

Re: Prototype implementation: Improving effectiveness and generality of auto-vectorization

Reply via email to