On Mon, Oct 26, 2015 at 6:59 AM, sameera <sameera.deshpa...@imgtec.com> wrote: > On Wednesday 21 October 2015 05:23 PM, Richard Biener wrote: >> >> On Thu, Oct 15, 2015 at 12:20 PM, sameera <sameera.deshpa...@imgtec.com> >> wrote: >>> >>> Hi Richard, >>> >>> This is with reference to our discussion at GNU Tools Cauldron 2015 >>> regarding my talk titled "Improving the effectiveness and generality of >>> GCC >>> auto-vectorization." We, at Imaginations Technologies, have further >>> worked >>> on finalizing the algorithms for transformations to achieve efficient >>> target-aware reordering instruction selection. We have implemented the >>> prototype in python to demonstrate the capabilities of the algorithm and >>> verify the claims made in the presentation. >>> >>> I am attaching the prototype along with the documented algorithm for >>> review >>> purposes. We would be starting with design and implementation of the same >>> in >>> GCC and would be glad to receive comments, feedback and suggestions. >> >> >> So I finally sat down and skimmed over auto_vectorization.txt. The first >> thing >> I notice is that you constrain your input language to sth you define >> yourself. >> In the end you will need to work on GIMPLE in SSA form. Thus the >> Algorithm as described needs to construct its AST from that GIMPLE >> representation. > > Richard, we have defined the input language for convenience in prototype > implementation. However, we will be using GIMPLE as our IR. As per grammar > of our tree, p-tree denote the permute order associated with the statement, > whereas c-tree is actually the GIMPLE instruction, which performs compute > operation. I tried looking at structures used in SLP, however they can not > be used as they are, as main difference between current SLP implementation > in GCC versus our representation is that, permute order in SLP is part of > the tree node in current GCC, whereas in our representation permute order is > represented as independent tree-node. Hence, I have created new tree > structure for our pass, which will create p-tree nodes for permute order, > and c-tree node which points to appropriate gimple statement.
Yes, that's the whole purpose - get the vectorizer (and SLP) a better data structure which is explicit about permutes. >> Loop count and data reference analysis is provided by GCC and you need to >> work >> with the way their result is presented. > > I am trying to figure out where and how interleave pattern encapsulating > whole loop can be represented, as the interleave pattern not only has the > loop related information, but also the order in which dest array is being > written. The data reference analysis can be used nicely with the data > structures we have designed - as introduction of p-tree nodes does not alter > the attributes of c-tree (GIMPLE stmt). > > I am also trying to identify relations between chain of recurrences for each > SSA variable and vec_size associated with each tree-node in our structure. > Logically, both of them compute same information, and I am seeing if it can > be propagated in our tree. > >> >> As with the presentation the paper is mostly about optimizing the >> interleaving >> code. That's fine in principle but I suspect that the AST needs to get >> explicit >> support for "vectorized stmts" that perform type conversions (like type >> conversions themselves or widening operations), that is, represent the >> fact >> that for certain loops you need N vectorized stmts for stmt A and M >> vectorized >> stmts for stmt B. This is an important observation once you get to the >> point >> supporting targets with multiple vector sizes (and instructions like the >> x86_64 >> integer - double conversions which go from 128bit to 256bit vectors). > > Yes, we haven't given much thought about the type conversions, because our > assumption is that type conversions are order preserving transformations > (c-tree), and not order altering transformations (p-tree). Because of which > those instructions will be generated as any other compute instruction is > generated. And I see that GCC is also having same assumption, because of > which it treats vec_perm_const patterns different from vec_pack/unpack* > patterns though the instructions generated can be same. And, as each > statement can have its own vectorization count, the scenario that you are > mentioning can be taken care. However, I will again look more into it, if we > need to take additional care for type conversions. Thanks. >> >> >> I somewhat miss an idea on how to deal with irregular SLP cases - that is, >> does the AST model each (current) SLP node as a single statement or >> does it model individual vector elements (so you can insert no-op >> compensation >> code to make the operations match). Consider >> >> a[i] = b[i] * 3; >> a[i+1] = b[i+1]; >> >> which can be vectorized with SLP if you realize you can multiply b[i+1] by >> 1. > > > The pass structure we are having for our optimization is as follows: > - New pass target-aware-loop-vect with following subpasses: > - Loop structure identification > - Restricted loop construction > - Loop analysis (6 phases of target-aware reordering-instruction selection > algorithm) > - Loop transformation (Cost aware gimple code generation) > We might need some optimizations to be performed on loop body before we > start our optimization for reordering instruction selection, so that input > program can be transformed to fit into our restricted loop structure. These > transformations can be performed by restricted loop construction subpass. > Eg.: multi-dimentional array, nested loops, reduction chains etc. can be > supported by transforming input loop. The case that you have mentioned, can > be transformed at this level. > > Also, we have introduced new operator "PERM" to support irregular permute > orders which cannot be matched by vec_perm_const patterns, but can be > generated using vec_perm pattern. >> >> >> As of implementing the whole thing in GCC I recently spent some more time >> thinking and came to the conclusion that the path to the fastest >> improvements >> in GCC code-gen would be to build the new AST after the current analysis >> phase finished, do the optimization and drive code-generation off the new >> AST. >> Thus keep things like data-ref and dependence analysis as is, as well as >> group analysis and SLP tree construction. Build the AST from the >> vectorizer >> "IL" at the point vect_transform_loop / vect_slp_transform_bb is called. > > > Current pass structure that we have designed, does exactly that. The only > difference is that, we are planning to use the transform phase as a > co-process of our transform pass, than reconstruct an AST from our > representation to pass to vect_transform_loop. > >> >> More and more of the rest of the vectorizer code can then be "slowly" >> moved >> to work on the AST and AST construction be moved earlier. >> >> Richard. >> > > Thanks again for your inputs. I will share the complete design and our plan > of action for implementation in GCC shortly. > > - Thanks and regards, > Sameera D. > >>> >>> - Thanks and regards, >>> Sameera Deshpande