On Thu, Oct 15, 2015 at 12:20 PM, sameera <sameera.deshpa...@imgtec.com> wrote: > Hi Richard, > > This is with reference to our discussion at GNU Tools Cauldron 2015 > regarding my talk titled "Improving the effectiveness and generality of GCC > auto-vectorization." We, at Imaginations Technologies, have further worked > on finalizing the algorithms for transformations to achieve efficient > target-aware reordering instruction selection. We have implemented the > prototype in python to demonstrate the capabilities of the algorithm and > verify the claims made in the presentation. > > I am attaching the prototype along with the documented algorithm for review > purposes. We would be starting with design and implementation of the same in > GCC and would be glad to receive comments, feedback and suggestions.
So I finally sat down and skimmed over auto_vectorization.txt. The first thing I notice is that you constrain your input language to sth you define yourself. In the end you will need to work on GIMPLE in SSA form. Thus the Algorithm as described needs to construct its AST from that GIMPLE representation. Loop count and data reference analysis is provided by GCC and you need to work with the way their result is presented. As with the presentation the paper is mostly about optimizing the interleaving code. That's fine in principle but I suspect that the AST needs to get explicit support for "vectorized stmts" that perform type conversions (like type conversions themselves or widening operations), that is, represent the fact that for certain loops you need N vectorized stmts for stmt A and M vectorized stmts for stmt B. This is an important observation once you get to the point supporting targets with multiple vector sizes (and instructions like the x86_64 integer - double conversions which go from 128bit to 256bit vectors). I somewhat miss an idea on how to deal with irregular SLP cases - that is, does the AST model each (current) SLP node as a single statement or does it model individual vector elements (so you can insert no-op compensation code to make the operations match). Consider a[i] = b[i] * 3; a[i+1] = b[i+1]; which can be vectorized with SLP if you realize you can multiply b[i+1] by 1. As of implementing the whole thing in GCC I recently spent some more time thinking and came to the conclusion that the path to the fastest improvements in GCC code-gen would be to build the new AST after the current analysis phase finished, do the optimization and drive code-generation off the new AST. Thus keep things like data-ref and dependence analysis as is, as well as group analysis and SLP tree construction. Build the AST from the vectorizer "IL" at the point vect_transform_loop / vect_slp_transform_bb is called. More and more of the rest of the vectorizer code can then be "slowly" moved to work on the AST and AST construction be moved earlier. Richard. > > - Thanks and regards, > Sameera Deshpande