Prototype implementation: Improving effectiveness and generality of auto-vectorization

2015-10-15 Thread sameera

Hi Richard,

This is with reference to our discussion at GNU Tools Cauldron 2015 regarding my talk titled "Improving the effectiveness and generality of GCC 
auto-vectorization." We, at Imaginations Technologies, have further worked on finalizing the algorithms for transformations to achieve efficient 
target-aware reordering instruction selection. We have implemented the prototype in python to demonstrate the capabilities of the algorithm and verify 
the claims made in the presentation.


I am attaching the prototype along with the documented algorithm for review purposes. We would be starting with design and implementation of the same 
in GCC and would be glad to receive comments, feedback and suggestions.


- Thanks and regards,
  Sameera Deshpande
Introduction

Current methods of auto-vectorization take the vectorization decision through a 
generic analysis which is almost target independent
(except for target features such as vector sizes). However, the instruction 
selection mechanism is largely adhoc in the sense that the
ideas used for instruction selection for a target may not carry over seamlessly 
to some other target. This has led to proliferation of
target specific code in the generic part of compiler frameworks such as GCC.

We propose a generic approach for reordering-instruction selection which is 
based on the insight that effective data reordering for
vectorization can be defined in terms of a set of well defined primitive 
operations. Interestingly, the same set of operations can be 
used to define target instructions almost for all targets that we know of. Thus 
these operations form the basic building blocks of 
reorderings for effective vectorization.

Primitive operations

Let V_N denote a vector of N scalar elements. A collection of k V_N vectors is 
denoted by V_N , ...k , V_N
We propose the following operations:
- Concatenation_k^N : V_N x ... x k V_N --> V_kN : CONCAT_k^N (V_N , ...k , 
V_N) concatenates k vectors of size N to produce a
single vector of size kN.

- Split_k,i^kN : V_kN --> V_N . Split operation is the inverse of 
concatenation. SPLT_k,i^kN ( V_kN ) splits a vector of size kN into
k vectors of size N each and returns the i th vector.

- Inerleave_k^N : V_N x ...k x V_N --> V_kN . ILV_k^N ( V_N, ...k, V_N ) 
creates a vector of size kN such that ith element of output
  vector comes from i/kth element from i%kth vector.

- Extract_k,i^kN : V_kN --> V_N. Extract operation is inverse of interleave 
operation. EXTR_k,i^kN (V_kN) divides the input vector
 into N parts and creates k vectors of size N by selecting jth element from 
each part, and returning ith vector of size N .

- permute^N : V_N x I_M --> V_N. Permute operation takes vector of size N and 
integer order of size M. It divides vector N into N/M
  vectors of size M, and permutes each subvector with permute order I_M.

Phase structure of Algorithm

The algorithm for target-aware reordering-instruction selection for efficient 
autovectorization is divided into following phases:
Phase 1 : Create tree for each loop per destination - Implemented by Sameera 
Deshpande
Phase 2 : k-arity reduction/promotion  - Implemented by Sameera Deshpande
Phase 3 : Top-down distribution of p-node subtree  - Implemented by Sameera 
Deshpande
Phase 4 : Optimizations  - Implemented by Sameera Deshpande
Phase 5 : Bottom-up commoning of p-node subtree  - Implemented by Sameera 
Deshpande
  - Cost computation of p-tree  - Implemented by Sameera Deshpande
Phase 6 : Vector-size reduction  - Implemented by Sameera Deshpande
Phase 4(again): Optimization  - Implemented by Sameera Deshpande
Phase 7: Cost computation - Store min_cost_tree.  - Implemented by Prachi 
Godbole
Phase 8: Target specific code generation for min_cost_tree  - Implemented by 
Prachi Godbole

Input loop
==
To enable auto-vectorization using this algorithm, the loop structure should be 
as follows:
Loop with k statements reading from S_1 , S_2 , ... S_m and writing to 
destination D, where stmt j is of form
D[k * i + d_j ] = op_j (S_1 [k_1 * i + c_j1 ], S_2 [k_2 * i + c_j2 ], ... S_m 
[k_m * i + c_jm ])
such that

– All statements stmt_1 , ... stmt_k in the loop body are independent and 
vectorizable for target with vector size VS.
– Iteration count N > VS.
– Induction variable i for the loop is not altered by stmt_1 , stmt_2 ... 
stmt_k .
– All S_1 , ... S_m and D are of same type.
– Indices for all S and D have same multiplicative index. If different, 
normalize by taking LCM, and duplicate the statements 
  appropriately by adjusting the offset.
– c_jp < k_p


FOR(i = 0 ... N − 1)
{
  stmt 1 : D [k * i + d_1 ] = op_1 (S_1 [k_1 * i + c_11 ], S_2 [k_2 * i + c_12 
], . . . S_m [k_m * i + c_1m ])
  stmt 2 : D [k * i + d_2 ] = op_2 (S_1 [k_1 * i + c_21 ], S_2 [k_2 * i + c_22 
], . . . S_m [k_m * i + c_2m ])
  ...
  stmt k : D [k * i + d_k ] = op_k (S_1 [k_1 * 

Re: Prototype implementation: Improving effectiveness and generality of auto-vectorization

2015-10-25 Thread sameera

On Wednesday 21 October 2015 05:23 PM, Richard Biener wrote:

On Thu, Oct 15, 2015 at 12:20 PM, sameera  wrote:

Hi Richard,

This is with reference to our discussion at GNU Tools Cauldron 2015
regarding my talk titled "Improving the effectiveness and generality of GCC
auto-vectorization." We, at Imaginations Technologies, have further worked
on finalizing the algorithms for transformations to achieve efficient
target-aware reordering instruction selection. We have implemented the
prototype in python to demonstrate the capabilities of the algorithm and
verify the claims made in the presentation.

I am attaching the prototype along with the documented algorithm for review
purposes. We would be starting with design and implementation of the same in
GCC and would be glad to receive comments, feedback and suggestions.


So I finally sat down and skimmed over auto_vectorization.txt.  The first thing
I notice is that you constrain your input language to sth you define yourself.
In the end you will need to work on GIMPLE in SSA form.  Thus the
Algorithm as described needs to construct its AST from that GIMPLE
representation.
Richard, we have defined the input language for convenience in prototype implementation. However, we will be using GIMPLE as our IR. As per grammar of 
our tree, p-tree denote the permute order associated with the statement, whereas c-tree is actually the GIMPLE instruction, which performs compute 
operation. I tried looking at structures used in SLP, however they can not be used as they are, as main difference between current SLP implementation 
in GCC versus our representation is that, permute order in SLP is part of the tree node in current GCC, whereas in our representation permute order is 
represented as independent tree-node. Hence, I have created new tree structure for our pass, which will create p-tree nodes for permute order, and 
c-tree node which points to appropriate gimple statement.



Loop count and data reference analysis is provided by GCC and you need to work
with the way their result is presented.
I am trying to figure out where and how interleave pattern encapsulating whole loop can be represented, as the interleave pattern not only has the 
loop related information, but also the order in which dest array is being written. The data reference analysis can be used nicely with the data 
structures we have designed - as introduction of p-tree nodes does not alter the attributes of c-tree (GIMPLE stmt).


I am also trying to identify relations between chain of recurrences for each SSA variable and vec_size associated with each tree-node in our 
structure. Logically, both of them compute same information, and I am seeing if it can be propagated in our tree.




As with the presentation the paper is mostly about optimizing the interleaving
code.  That's fine in principle but I suspect that the AST needs to get explicit
support for "vectorized stmts" that perform type conversions (like type
conversions themselves or widening operations), that is, represent the fact
that for certain loops you need N vectorized stmts for stmt A and M vectorized
stmts for stmt B.  This is an important observation once you get to the point
supporting targets with multiple vector sizes (and instructions like the x86_64
integer - double conversions which go from 128bit to 256bit vectors).
Yes, we haven't given much thought about the type conversions, because our assumption is that type conversions are order preserving transformations 
(c-tree), and not order altering transformations (p-tree). Because of which those instructions will be generated as any other compute instruction is 
generated. And I see that GCC is also having same assumption, because of which it treats vec_perm_const patterns different from vec_pack/unpack* 
patterns though the instructions generated can be same. And, as each statement can have its own vectorization count, the scenario that you are 
mentioning can be taken care. However, I will again look more into it, if we need to take additional care for type conversions.


I somewhat miss an idea on how to deal with irregular SLP cases - that is,
does the AST model each (current) SLP node as a single statement or
does it model individual vector elements (so you can insert no-op compensation
code to make the operations match).  Consider

  a[i] = b[i] * 3;
  a[i+1] = b[i+1];

which can be vectorized with SLP if you realize you can multiply b[i+1] by 1.


The pass structure we are having for our optimization is as follows:
- New pass target-aware-loop-vect with following subpasses:
  - Loop structure identification
  - Restricted loop construction
  - Loop analysis (6 phases of target-aware reordering-instruction selection 
algorithm)
  - Loop transformation (Cost aware gimple code generation)
We might need some optimizations to be performed on loop body before we start our optimization for reordering instruction selection, so that input 
program can 

[Patch 0,1a] Improving effectiveness and generality of autovectorization using unified representation.

2016-06-07 Thread Sameera Deshpande
 can 
have INTERLEAVE, CONCAT, EXTRACT, SPLIT, ITER or any compute operation as 
intermediate node. Leaf nodes can either be memory reference, constant or 
vector of loop invariants. Depending upon the operation, PRIMOP_TREE holds 
appropriate information about the statement within the loop which is necessary 
for vectorization.

At this stage, these data structures are populated by gathering all the 
information of the loop, statements within the loop and correlation of the 
statements within the loop. Moreover the loop body is analyzed to check if 
vectorization of each statement is possible. One has to note however that this 
analysis phase will give worst-case estimate of instruction selection, as it 
checks if specific named pattern is defined in .md for the target. It not 
necessarily  give optimal cover which is aim of the transformation phase using 
tree tiling algorithm - and can be invoked only once the loop body is 
represented using primitive reoder tree.

At this stage, the focus is to create permute order tree for the loop with LOAD 
and STORE instructions only. The code we intend to compile is of the form
FOR(i = 0; i < N; i + +)
{
  stmt 1 : D[k ∗ i + d 1 ] =S 1 [k ∗ i + c 11 ]
  stmt 2 : D[k ∗ i + d 2 ] =S 1 [k ∗ i + c 21 ]
  ...
  stmt k : D[k ∗ i + d k ] =S 1 [k ∗ i + c k 1 ]
}
Here we are assuming that any data reference can be represented using base + k 
* index + offset (The data structure struct data_reference from GCC is used 
currently for this purpose). If not, the address is normalized to convert to 
such representation.

We are looking forward to your suggestions and insight in this regard for 
better execution of this project.

Also, as this is long term project, can we create a branch in GCC to put all 
our patches at one place so that anyone interested can download and tweak with 
the code?

- Thanks and regards,
  Sameera D.
Index: gcc/Makefile.in
===
--- gcc/Makefile.in	(revision 236862)
+++ gcc/Makefile.in	(working copy)
@@ -1522,6 +1522,7 @@
 	tree-vect-loop-manip.o \
 	tree-vect-slp.o \
 	tree-vectorizer.o \
+	tree-vect-unified.o \
 	tree-vrp.o \
 	tree.o \
 	valtrack.o \
Index: gcc/common.opt
===
--- gcc/common.opt	(revision 236862)
+++ gcc/common.opt	(working copy)
@@ -2570,6 +2570,10 @@
 Common Report Var(flag_tree_loop_vectorize) Optimization
 Enable loop vectorization on trees.
 
+ftree-loop-vectorize-unified
+Common Report Var(flag_tree_loop_vectorize_unified) Optimization
+Enable loop vectorization using unified representation on trees.
+
 ftree-slp-vectorize
 Common Report Var(flag_tree_slp_vectorize) Optimization
 Enable basic block vectorization (SLP) on trees.
Index: gcc/hsa.c
===
--- gcc/hsa.c	(revision 236862)
+++ gcc/hsa.c	(working copy)
@@ -840,6 +840,7 @@
 fn_opts = optimization_default_node;
   fn_opts = copy_node (fn_opts);
   TREE_OPTIMIZATION (fn_opts)->x_flag_tree_loop_vectorize = false;
+  TREE_OPTIMIZATION (fn_opts)->x_flag_tree_loop_vectorize_unified = false;
   TREE_OPTIMIZATION (fn_opts)->x_flag_tree_slp_vectorize = false;
   DECL_FUNCTION_SPECIFIC_OPTIMIZATION (gdecl) = fn_opts;
 
Index: gcc/passes.def
===
--- gcc/passes.def	(revision 236862)
+++ gcc/passes.def	(working copy)
@@ -278,6 +278,11 @@
 	  NEXT_PASS (pass_expand_omp_ssa);
 	  NEXT_PASS (pass_ch_vect);
 	  NEXT_PASS (pass_if_conversion);
+	  NEXT_PASS (pass_unified_vectorize);
+  PUSH_INSERT_PASSES_WITHIN (pass_unified_vectorize)
+	  NEXT_PASS (pass_dce);
+  POP_INSERT_PASSES ()
+
 	  /* pass_vectorize must immediately follow pass_if_conversion.
 	 Please do not add any other passes in between.  */
 	  NEXT_PASS (pass_vectorize);
Index: gcc/tree-data-ref.c
===
--- gcc/tree-data-ref.c	(revision 236862)
+++ gcc/tree-data-ref.c	(working copy)
@@ -749,16 +749,19 @@
   return build_fold_addr_expr (TREE_OPERAND (addr, 0));
 }
 
-/* Analyzes the behavior of the memory reference DR in the innermost loop or
-   basic block that contains it.  Returns true if analysis succeed or false
-   otherwise.  */
-
+/* Analyze behavior of memory reference in the innermost loop or basic block
+   that contains it.  STMT is statement having memory reference.  REF is the
+   memory reference under consideration.  Returns true if analysis succeeds,
+   false otherwise.  PBASE, STEP and OFFSET are returned if analysis is
+   successful where REF is represented as PBASE + PSTEP * iter# + POFFSET.
+   The initial address of memory reference is returned in PINIT, and desired
+   alignment in ALIGNED_TO.  */
 bool
-dr_analyze_innermost (struct data_reference *dr, struct loop *nest)
+memref_analyze_innermost (gimple *stmt, tree ref, struct loop *nest,
+			  tr

Re: [Patch 0,1a] Improving effectiveness and generality of autovectorization using unified representation.

2016-06-13 Thread Sameera Deshpande

On Thursday 09 June 2016 05:45 PM, Richard Biener wrote:

On Thu, Jun 9, 2016 at 10:54 AM, Richard Biener
 wrote:

On Tue, Jun 7, 2016 at 3:59 PM, Sameera Deshpande
 wrote:

Hi Richard,

This is with reference to our discussion at GNU Tools Cauldron 2015 regarding my talk 
titled "Improving the effectiveness and generality of GCC auto-vectorization." 
Further to our prototype implementation of the concept, we have started implementing this 
concept in GCC.

We are following incremental model to add language support in our front-end, 
and corresponding back-end (for auto-vectorizer) will be added for feature 
completion.

Looking at the complexity and scale of the project, we have divided this 
project into subtasks listed below, for ease of implementation, testing and 
review.

0. Add new pass to perform autovectorization using unified representation - 
Current GCC framework does not give complete overview of the loop to be 
vectorized : it either breaks the loop across body, or across iterations. 
Because of which these data structures can not be reused for our approach which 
gathers all the information of loop body at one place using primitive permute 
operations. Hence, define new data structures and populate them.

1. Add support for vectorization of LOAD/STORE instructions
 a. Create permute order tree for the loop with LOAD and STORE instructions 
for single or multi-dimensional arrays, aggregates within nested loops.
 b. Basic transformation phase to generate vectorized code for the 
primitive reorder tree generated at stage 1a using tree tiling algorithm. This 
phase handles code generation for SCATTER, GATHER, stridded memory accesses 
etc. along with permute instruction generation.

2. Implementation of k-arity promotion/reduction : The permute nodes within 
primitive reorder tree generated from input program can have any arity. 
However, the target can support maximum of arity = 2 in most of the cases. 
Hence, we need to promote or reduce the arity of permute order tree to enable 
successful tree tiling.

3. Vector size reduction : Depending upon the vector size for target, reduce 
vector size per statement and adjust the loop count for vectorized loop 
accordingly.

4. Support simple arithmetic operations :
 a. Add support for analyzing statements with simple arithmetic operations 
like +, -, *, / for vectorization, and create primitive reorder tree with 
compute_op.
 b. Generate vector code for primitive reorder tree generated at stage 4a 
using tree tiling algorithm - here support for complex patterns like 
multiply-add should be checked and appropriate instruction to be generated.

5. Support reduction operation :
 a. Add support for reduction operation analysis and primitive reorder tree 
generation. The reduction operation needs special handling, as the finish 
statement should COLLAPSE the temporary reduction vector TEMP_VAR into original 
reduction variable.
 b. The code generation for primitive reorder tree does not need any 
handling - as reduction tree is same as tree generated in 4a, with only 
difference that in 4a, the destination is MEMREF (because of STORE operation) 
and for reduction it is TEMP_VAR. At this stage, generate code for COLLAPSE 
node in finish statements.

6. Support other vectorizable statements like complex arithmetic operations, 
bitwise operations, type conversions etc.
 a. Add support for analysis and primitive reorder tree generation.
 b. Vector code generation.

7. Cost effective tree tiling algorithm : Till now, the tree tiling is 
happening without considering cost of computation. However, there can be 
multiple target instructions covering the tree - hence, instead of picking 
first matched largest instruction cover, select the instruction cover based on 
cost of instruction given in .md for the target.

8. Optimizations on created primitive reorder tree : This stage is open ended, 
and depending upon perf analysis, the scope of it can be defined.

The current patch I have attached herewith handles stage 0 and 1a : Adds new 
pass to perform autovectorization using unified representation, defines new 
data structures to cater to this requirement and creates primitive reorder tree 
for LOAD/STORE instructions within the loop.

The whole loop is represented using the ITER_NODE, which have information about
- The preparatory statements for vectorization to be executed before entering 
the loop (like initialization of vectors, prepping for reduction operations, 
peeling etc.)
- Vectorizable loop body represented as PRIMOP_TREE (primitive reordering tree)
- Final statements (For peeling, variable loop bound, COLLAPSE operation for 
reduction etc.)
- Other loop attributes (loop bound, peeling needed, dependences, etc.)

Memory accesses within a loop have definite repetitive pattern which can be 
captured using primitive permute operators which can be used to  determine 
desired permute order for the vector computat

Re: [Patch 0,1a] Improving effectiveness and generality of autovectorization using unified representation.

2016-06-19 Thread Sameera Deshpande

On Wednesday 15 June 2016 05:52 PM, Richard Biener wrote:

On Mon, Jun 13, 2016 at 12:56 PM, Sameera Deshpande
 wrote:

On Thursday 09 June 2016 05:45 PM, Richard Biener wrote:


On Thu, Jun 9, 2016 at 10:54 AM, Richard Biener
 wrote:


On Tue, Jun 7, 2016 at 3:59 PM, Sameera Deshpande
 wrote:


Hi Richard,

This is with reference to our discussion at GNU Tools Cauldron 2015
regarding my talk titled "Improving the effectiveness and generality of GCC
auto-vectorization." Further to our prototype implementation of the concept,
we have started implementing this concept in GCC.

We are following incremental model to add language support in our
front-end, and corresponding back-end (for auto-vectorizer) will be added
for feature completion.

Looking at the complexity and scale of the project, we have divided this
project into subtasks listed below, for ease of implementation, testing and
review.

0. Add new pass to perform autovectorization using unified
representation - Current GCC framework does not give complete overview of
the loop to be vectorized : it either breaks the loop across body, or across
iterations. Because of which these data structures can not be reused for our
approach which gathers all the information of loop body at one place using
primitive permute operations. Hence, define new data structures and populate
them.

1. Add support for vectorization of LOAD/STORE instructions
  a. Create permute order tree for the loop with LOAD and STORE
instructions for single or multi-dimensional arrays, aggregates within
nested loops.
  b. Basic transformation phase to generate vectorized code for the
primitive reorder tree generated at stage 1a using tree tiling algorithm.
This phase handles code generation for SCATTER, GATHER, stridded memory
accesses etc. along with permute instruction generation.

2. Implementation of k-arity promotion/reduction : The permute nodes
within primitive reorder tree generated from input program can have any
arity. However, the target can support maximum of arity = 2 in most of the
cases. Hence, we need to promote or reduce the arity of permute order tree
to enable successful tree tiling.

3. Vector size reduction : Depending upon the vector size for target,
reduce vector size per statement and adjust the loop count for vectorized
loop accordingly.

4. Support simple arithmetic operations :
  a. Add support for analyzing statements with simple arithmetic
operations like +, -, *, / for vectorization, and create primitive reorder
tree with compute_op.
  b. Generate vector code for primitive reorder tree generated at
stage 4a using tree tiling algorithm - here support for complex patterns
like multiply-add should be checked and appropriate instruction to be
generated.

5. Support reduction operation :
  a. Add support for reduction operation analysis and primitive
reorder tree generation. The reduction operation needs special handling, as
the finish statement should COLLAPSE the temporary reduction vector TEMP_VAR
into original reduction variable.
  b. The code generation for primitive reorder tree does not need any
handling - as reduction tree is same as tree generated in 4a, with only
difference that in 4a, the destination is MEMREF (because of STORE
operation) and for reduction it is TEMP_VAR. At this stage, generate code
for COLLAPSE node in finish statements.

6. Support other vectorizable statements like complex arithmetic
operations, bitwise operations, type conversions etc.
  a. Add support for analysis and primitive reorder tree generation.
  b. Vector code generation.

7. Cost effective tree tiling algorithm : Till now, the tree tiling is
happening without considering cost of computation. However, there can be
multiple target instructions covering the tree - hence, instead of picking
first matched largest instruction cover, select the instruction cover based
on cost of instruction given in .md for the target.

8. Optimizations on created primitive reorder tree : This stage is open
ended, and depending upon perf analysis, the scope of it can be defined.

The current patch I have attached herewith handles stage 0 and 1a : Adds
new pass to perform autovectorization using unified representation, defines
new data structures to cater to this requirement and creates primitive
reorder tree for LOAD/STORE instructions within the loop.

The whole loop is represented using the ITER_NODE, which have
information about
- The preparatory statements for vectorization to be executed before
entering the loop (like initialization of vectors, prepping for reduction
operations, peeling etc.)
- Vectorizable loop body represented as PRIMOP_TREE (primitive
reordering tree)
- Final statements (For peeling, variable loop bound, COLLAPSE operation
for reduction etc.)
- Other loop attributes (loop bound, peeling needed, dependences, etc.)

Memory accesses within a loop have definite repetitive pattern which can
be captured using primitive permute opera

RE: [Patch 0,1a] Improving effectiveness and generality of autovectorization using unified representation.

2016-07-07 Thread Sameera Deshpande


From: Richard Biener [richard.guent...@gmail.com]
Sent: 06 July 2016 16:46:17
To: Sameera Deshpande
Cc: Matthew Fortune; Rich Fuhler; Prachi Godbole; gcc@gcc.gnu.org; Jaydeep Patil
Subject: Re: [Patch 0,1a] Improving effectiveness and generality of 
autovectorization using unified representation.

On Wed, Jul 6, 2016 at 12:49 PM, Sameera Deshpande
 wrote:
> 
> From: Sameera Deshpande [sameera.deshpa...@imgtec.com]
> Sent: 20 June 2016 11:37:58
> To: Richard Biener
> Cc: Matthew Fortune; Rich Fuhler; Prachi Godbole; gcc@gcc.gnu.org; Jaydeep 
> Patil
> Subject: Re: [Patch 0,1a] Improving effectiveness and generality of 
> autovectorization using unified representation.
>
> On Wednesday 15 June 2016 05:52 PM, Richard Biener wrote:
>> On Mon, Jun 13, 2016 at 12:56 PM, Sameera Deshpande
>>  wrote:
>>> On Thursday 09 June 2016 05:45 PM, Richard Biener wrote:
>>>>
>>>> On Thu, Jun 9, 2016 at 10:54 AM, Richard Biener
>>>>  wrote:
>>>>>
>>>>> On Tue, Jun 7, 2016 at 3:59 PM, Sameera Deshpande
>>>>>  wrote:
>>>>>>
>>>>>> Hi Richard,
>>>>>>
>>>>>> This is with reference to our discussion at GNU Tools Cauldron 2015
>>>>>> regarding my talk titled "Improving the effectiveness and generality of 
>>>>>> GCC
>>>>>> auto-vectorization." Further to our prototype implementation of the 
>>>>>> concept,
>>>>>> we have started implementing this concept in GCC.
>>>>>>
>>>>>> We are following incremental model to add language support in our
>>>>>> front-end, and corresponding back-end (for auto-vectorizer) will be added
>>>>>> for feature completion.
>>>>>>
>>>>>> Looking at the complexity and scale of the project, we have divided this
>>>>>> project into subtasks listed below, for ease of implementation, testing 
>>>>>> and
>>>>>> review.
>>>>>>
>>>>>> 0. Add new pass to perform autovectorization using unified
>>>>>> representation - Current GCC framework does not give complete overview of
>>>>>> the loop to be vectorized : it either breaks the loop across body, or 
>>>>>> across
>>>>>> iterations. Because of which these data structures can not be reused for 
>>>>>> our
>>>>>> approach which gathers all the information of loop body at one place 
>>>>>> using
>>>>>> primitive permute operations. Hence, define new data structures and 
>>>>>> populate
>>>>>> them.
>>>>>>
>>>>>> 1. Add support for vectorization of LOAD/STORE instructions
>>>>>>   a. Create permute order tree for the loop with LOAD and STORE
>>>>>> instructions for single or multi-dimensional arrays, aggregates within
>>>>>> nested loops.
>>>>>>   b. Basic transformation phase to generate vectorized code for the
>>>>>> primitive reorder tree generated at stage 1a using tree tiling algorithm.
>>>>>> This phase handles code generation for SCATTER, GATHER, stridded memory
>>>>>> accesses etc. along with permute instruction generation.
>>>>>>
>>>>>> 2. Implementation of k-arity promotion/reduction : The permute nodes
>>>>>> within primitive reorder tree generated from input program can have any
>>>>>> arity. However, the target can support maximum of arity = 2 in most of 
>>>>>> the
>>>>>> cases. Hence, we need to promote or reduce the arity of permute order 
>>>>>> tree
>>>>>> to enable successful tree tiling.
>>>>>>
>>>>>> 3. Vector size reduction : Depending upon the vector size for target,
>>>>>> reduce vector size per statement and adjust the loop count for vectorized
>>>>>> loop accordingly.
>>>>>>
>>>>>> 4. Support simple arithmetic operations :
>>>>>>   a. Add support for analyzing statements with simple arithmetic
>>>>>> operations like +, -, *, / for vectorization, and create primitive 
>>>>>> reorder
>>>>>> tree with compute_op.
>>>>>>   b. Generate vector code for primitive reorder tree generated at