Re: Prototype implementation: Improving effectiveness and generality of auto-vectorization

Richard Biener Tue, 27 Oct 2015 06:40:38 -0700

On Mon, Oct 26, 2015 at 6:59 AM, sameera <[email protected]> wrote:
> On Wednesday 21 October 2015 05:23 PM, Richard Biener wrote:
>>
>> On Thu, Oct 15, 2015 at 12:20 PM, sameera <[email protected]>
>> wrote:
>>>
>>> Hi Richard,
>>>
>>> This is with reference to our discussion at GNU Tools Cauldron 2015
>>> regarding my talk titled "Improving the effectiveness and generality of
>>> GCC
>>> auto-vectorization." We, at Imaginations Technologies, have further
>>> worked
>>> on finalizing the algorithms for transformations to achieve efficient
>>> target-aware reordering instruction selection. We have implemented the
>>> prototype in python to demonstrate the capabilities of the algorithm and
>>> verify the claims made in the presentation.
>>>
>>> I am attaching the prototype along with the documented algorithm for
>>> review
>>> purposes. We would be starting with design and implementation of the same
>>> in
>>> GCC and would be glad to receive comments, feedback and suggestions.
>>
>>
>> So I finally sat down and skimmed over auto_vectorization.txt.  The first
>> thing
>> I notice is that you constrain your input language to sth you define
>> yourself.
>> In the end you will need to work on GIMPLE in SSA form.  Thus the
>> Algorithm as described needs to construct its AST from that GIMPLE
>> representation.
>
> Richard, we have defined the input language for convenience in prototype
> implementation. However, we will be using GIMPLE as our IR. As per grammar
> of our tree, p-tree denote the permute order associated with the statement,
> whereas c-tree is actually the GIMPLE instruction, which performs compute
> operation. I tried looking at structures used in SLP, however they can not
> be used as they are, as main difference between current SLP implementation
> in GCC versus our representation is that, permute order in SLP is part of
> the tree node in current GCC, whereas in our representation permute order is
> represented as independent tree-node. Hence, I have created new tree
> structure for our pass, which will create p-tree nodes for permute order,
> and c-tree node which points to appropriate gimple statement.


Yes, that's the whole purpose - get the vectorizer (and SLP) a better
data structure
which is explicit about permutes.

>> Loop count and data reference analysis is provided by GCC and you need to
>> work
>> with the way their result is presented.
>
> I am trying to figure out where and how interleave pattern encapsulating
> whole loop can be represented, as the interleave pattern not only has the
> loop related information, but also the order in which dest array is being
> written. The data reference analysis can be used nicely with the data
> structures we have designed - as introduction of p-tree nodes does not alter
> the attributes of c-tree (GIMPLE stmt).
>
> I am also trying to identify relations between chain of recurrences for each
> SSA variable and vec_size associated with each tree-node in our structure.
> Logically, both of them compute same information, and I am seeing if it can
> be propagated in our tree.
>
>>
>> As with the presentation the paper is mostly about optimizing the
>> interleaving
>> code.  That's fine in principle but I suspect that the AST needs to get
>> explicit
>> support for "vectorized stmts" that perform type conversions (like type
>> conversions themselves or widening operations), that is, represent the
>> fact
>> that for certain loops you need N vectorized stmts for stmt A and M
>> vectorized
>> stmts for stmt B.  This is an important observation once you get to the
>> point
>> supporting targets with multiple vector sizes (and instructions like the
>> x86_64
>> integer - double conversions which go from 128bit to 256bit vectors).
>
> Yes, we haven't given much thought about the type conversions, because our
> assumption is that type conversions are order preserving transformations
> (c-tree), and not order altering transformations (p-tree). Because of which
> those instructions will be generated as any other compute instruction is
> generated. And I see that GCC is also having same assumption, because of
> which it treats vec_perm_const patterns different from vec_pack/unpack*
> patterns though the instructions generated can be same. And, as each
> statement can have its own vectorization count, the scenario that you are
> mentioning can be taken care. However, I will again look more into it, if we
> need to take additional care for type conversions.

Thanks.

>>
>>
>> I somewhat miss an idea on how to deal with irregular SLP cases - that is,
>> does the AST model each (current) SLP node as a single statement or
>> does it model individual vector elements (so you can insert no-op
>> compensation
>> code to make the operations match).  Consider
>>
>>   a[i] = b[i] * 3;
>>   a[i+1] = b[i+1];
>>
>> which can be vectorized with SLP if you realize you can multiply b[i+1] by
>> 1.
>
>
> The pass structure we are having for our optimization is as follows:
> - New pass target-aware-loop-vect with following subpasses:
>   - Loop structure identification
>   - Restricted loop construction
>   - Loop analysis (6 phases of target-aware reordering-instruction selection
> algorithm)
>   - Loop transformation (Cost aware gimple code generation)
> We might need some optimizations to be performed on loop body before we
> start our optimization for reordering instruction selection, so that input
> program can be transformed to fit into our restricted loop structure. These
> transformations can be performed by restricted loop construction subpass.
> Eg.: multi-dimentional array, nested loops, reduction chains etc. can be
> supported by transforming input loop. The case that you have mentioned, can
> be transformed at this level.
>
> Also, we have introduced new operator "PERM" to support irregular permute
> orders which cannot be matched by vec_perm_const patterns, but can be
> generated using vec_perm pattern.
>>
>>
>> As of implementing the whole thing in GCC I recently spent some more time
>> thinking and came to the conclusion that the path to the fastest
>> improvements
>> in GCC code-gen would be to build the new AST after the current analysis
>> phase finished, do the optimization and drive code-generation off the new
>> AST.
>> Thus keep things like data-ref and dependence analysis as is, as well as
>> group analysis and SLP tree construction.  Build the AST from the
>> vectorizer
>> "IL" at the point vect_transform_loop / vect_slp_transform_bb is called.
>
>
> Current pass structure that we have designed, does exactly that. The only
> difference is that, we are planning to use the transform phase as a
> co-process of our transform pass, than reconstruct an AST from our
> representation to pass to vect_transform_loop.
>
>>
>> More and more of the rest of the vectorizer code can then be "slowly"
>> moved
>> to work on the AST and AST construction be moved earlier.
>>
>> Richard.
>>
>
> Thanks again for your inputs. I will share the complete design and our plan
> of action for implementation in GCC shortly.
>
> - Thanks and regards,
>   Sameera D.
>
>>>
>>> - Thanks and regards,
>>>    Sameera Deshpande

Re: Prototype implementation: Improving effectiveness and generality of auto-vectorization

Reply via email to