Prototype implementation: Improving effectiveness and generality of auto-vectorization
Hi Richard, This is with reference to our discussion at GNU Tools Cauldron 2015 regarding my talk titled "Improving the effectiveness and generality of GCC auto-vectorization." We, at Imaginations Technologies, have further worked on finalizing the algorithms for transformations to achieve efficient target-aware reordering instruction selection. We have implemented the prototype in python to demonstrate the capabilities of the algorithm and verify the claims made in the presentation. I am attaching the prototype along with the documented algorithm for review purposes. We would be starting with design and implementation of the same in GCC and would be glad to receive comments, feedback and suggestions. - Thanks and regards, Sameera Deshpande Introduction Current methods of auto-vectorization take the vectorization decision through a generic analysis which is almost target independent (except for target features such as vector sizes). However, the instruction selection mechanism is largely adhoc in the sense that the ideas used for instruction selection for a target may not carry over seamlessly to some other target. This has led to proliferation of target specific code in the generic part of compiler frameworks such as GCC. We propose a generic approach for reordering-instruction selection which is based on the insight that effective data reordering for vectorization can be defined in terms of a set of well defined primitive operations. Interestingly, the same set of operations can be used to define target instructions almost for all targets that we know of. Thus these operations form the basic building blocks of reorderings for effective vectorization. Primitive operations Let V_N denote a vector of N scalar elements. A collection of k V_N vectors is denoted by V_N , ...k , V_N We propose the following operations: - Concatenation_k^N : V_N x ... x k V_N --> V_kN : CONCAT_k^N (V_N , ...k , V_N) concatenates k vectors of size N to produce a single vector of size kN. - Split_k,i^kN : V_kN --> V_N . Split operation is the inverse of concatenation. SPLT_k,i^kN ( V_kN ) splits a vector of size kN into k vectors of size N each and returns the i th vector. - Inerleave_k^N : V_N x ...k x V_N --> V_kN . ILV_k^N ( V_N, ...k, V_N ) creates a vector of size kN such that ith element of output vector comes from i/kth element from i%kth vector. - Extract_k,i^kN : V_kN --> V_N. Extract operation is inverse of interleave operation. EXTR_k,i^kN (V_kN) divides the input vector into N parts and creates k vectors of size N by selecting jth element from each part, and returning ith vector of size N . - permute^N : V_N x I_M --> V_N. Permute operation takes vector of size N and integer order of size M. It divides vector N into N/M vectors of size M, and permutes each subvector with permute order I_M. Phase structure of Algorithm The algorithm for target-aware reordering-instruction selection for efficient autovectorization is divided into following phases: Phase 1 : Create tree for each loop per destination - Implemented by Sameera Deshpande Phase 2 : k-arity reduction/promotion - Implemented by Sameera Deshpande Phase 3 : Top-down distribution of p-node subtree - Implemented by Sameera Deshpande Phase 4 : Optimizations - Implemented by Sameera Deshpande Phase 5 : Bottom-up commoning of p-node subtree - Implemented by Sameera Deshpande - Cost computation of p-tree - Implemented by Sameera Deshpande Phase 6 : Vector-size reduction - Implemented by Sameera Deshpande Phase 4(again): Optimization - Implemented by Sameera Deshpande Phase 7: Cost computation - Store min_cost_tree. - Implemented by Prachi Godbole Phase 8: Target specific code generation for min_cost_tree - Implemented by Prachi Godbole Input loop == To enable auto-vectorization using this algorithm, the loop structure should be as follows: Loop with k statements reading from S_1 , S_2 , ... S_m and writing to destination D, where stmt j is of form D[k * i + d_j ] = op_j (S_1 [k_1 * i + c_j1 ], S_2 [k_2 * i + c_j2 ], ... S_m [k_m * i + c_jm ]) such that – All statements stmt_1 , ... stmt_k in the loop body are independent and vectorizable for target with vector size VS. – Iteration count N > VS. – Induction variable i for the loop is not altered by stmt_1 , stmt_2 ... stmt_k . – All S_1 , ... S_m and D are of same type. – Indices for all S and D have same multiplicative index. If different, normalize by taking LCM, and duplicate the statements appropriately by adjusting the offset. – c_jp < k_p FOR(i = 0 ... N − 1) { stmt 1 : D [k * i + d_1 ] = op_1 (S_1 [k_1 * i + c_11 ], S_2 [k_2 * i + c_12 ], . . . S_m [k_m * i + c_1m ]) stmt 2 : D [k * i + d_2 ] = op_2 (S_1 [k_1 * i + c_21 ], S_2 [k_2 * i + c_22 ], . . . S_m [k_m * i + c_2m ]) ... stmt k : D [k * i + d_k ] = op_k (S_1 [k_1 *
Re: Prototype implementation: Improving effectiveness and generality of auto-vectorization
On Wednesday 21 October 2015 05:23 PM, Richard Biener wrote: On Thu, Oct 15, 2015 at 12:20 PM, sameera wrote: Hi Richard, This is with reference to our discussion at GNU Tools Cauldron 2015 regarding my talk titled "Improving the effectiveness and generality of GCC auto-vectorization." We, at Imaginations Technologies, have further worked on finalizing the algorithms for transformations to achieve efficient target-aware reordering instruction selection. We have implemented the prototype in python to demonstrate the capabilities of the algorithm and verify the claims made in the presentation. I am attaching the prototype along with the documented algorithm for review purposes. We would be starting with design and implementation of the same in GCC and would be glad to receive comments, feedback and suggestions. So I finally sat down and skimmed over auto_vectorization.txt. The first thing I notice is that you constrain your input language to sth you define yourself. In the end you will need to work on GIMPLE in SSA form. Thus the Algorithm as described needs to construct its AST from that GIMPLE representation. Richard, we have defined the input language for convenience in prototype implementation. However, we will be using GIMPLE as our IR. As per grammar of our tree, p-tree denote the permute order associated with the statement, whereas c-tree is actually the GIMPLE instruction, which performs compute operation. I tried looking at structures used in SLP, however they can not be used as they are, as main difference between current SLP implementation in GCC versus our representation is that, permute order in SLP is part of the tree node in current GCC, whereas in our representation permute order is represented as independent tree-node. Hence, I have created new tree structure for our pass, which will create p-tree nodes for permute order, and c-tree node which points to appropriate gimple statement. Loop count and data reference analysis is provided by GCC and you need to work with the way their result is presented. I am trying to figure out where and how interleave pattern encapsulating whole loop can be represented, as the interleave pattern not only has the loop related information, but also the order in which dest array is being written. The data reference analysis can be used nicely with the data structures we have designed - as introduction of p-tree nodes does not alter the attributes of c-tree (GIMPLE stmt). I am also trying to identify relations between chain of recurrences for each SSA variable and vec_size associated with each tree-node in our structure. Logically, both of them compute same information, and I am seeing if it can be propagated in our tree. As with the presentation the paper is mostly about optimizing the interleaving code. That's fine in principle but I suspect that the AST needs to get explicit support for "vectorized stmts" that perform type conversions (like type conversions themselves or widening operations), that is, represent the fact that for certain loops you need N vectorized stmts for stmt A and M vectorized stmts for stmt B. This is an important observation once you get to the point supporting targets with multiple vector sizes (and instructions like the x86_64 integer - double conversions which go from 128bit to 256bit vectors). Yes, we haven't given much thought about the type conversions, because our assumption is that type conversions are order preserving transformations (c-tree), and not order altering transformations (p-tree). Because of which those instructions will be generated as any other compute instruction is generated. And I see that GCC is also having same assumption, because of which it treats vec_perm_const patterns different from vec_pack/unpack* patterns though the instructions generated can be same. And, as each statement can have its own vectorization count, the scenario that you are mentioning can be taken care. However, I will again look more into it, if we need to take additional care for type conversions. I somewhat miss an idea on how to deal with irregular SLP cases - that is, does the AST model each (current) SLP node as a single statement or does it model individual vector elements (so you can insert no-op compensation code to make the operations match). Consider a[i] = b[i] * 3; a[i+1] = b[i+1]; which can be vectorized with SLP if you realize you can multiply b[i+1] by 1. The pass structure we are having for our optimization is as follows: - New pass target-aware-loop-vect with following subpasses: - Loop structure identification - Restricted loop construction - Loop analysis (6 phases of target-aware reordering-instruction selection algorithm) - Loop transformation (Cost aware gimple code generation) We might need some optimizations to be performed on loop body before we start our optimization for reordering instruction selection, so that input program can
[Patch 0,1a] Improving effectiveness and generality of autovectorization using unified representation.
can have INTERLEAVE, CONCAT, EXTRACT, SPLIT, ITER or any compute operation as intermediate node. Leaf nodes can either be memory reference, constant or vector of loop invariants. Depending upon the operation, PRIMOP_TREE holds appropriate information about the statement within the loop which is necessary for vectorization. At this stage, these data structures are populated by gathering all the information of the loop, statements within the loop and correlation of the statements within the loop. Moreover the loop body is analyzed to check if vectorization of each statement is possible. One has to note however that this analysis phase will give worst-case estimate of instruction selection, as it checks if specific named pattern is defined in .md for the target. It not necessarily give optimal cover which is aim of the transformation phase using tree tiling algorithm - and can be invoked only once the loop body is represented using primitive reoder tree. At this stage, the focus is to create permute order tree for the loop with LOAD and STORE instructions only. The code we intend to compile is of the form FOR(i = 0; i < N; i + +) { stmt 1 : D[k ∗ i + d 1 ] =S 1 [k ∗ i + c 11 ] stmt 2 : D[k ∗ i + d 2 ] =S 1 [k ∗ i + c 21 ] ... stmt k : D[k ∗ i + d k ] =S 1 [k ∗ i + c k 1 ] } Here we are assuming that any data reference can be represented using base + k * index + offset (The data structure struct data_reference from GCC is used currently for this purpose). If not, the address is normalized to convert to such representation. We are looking forward to your suggestions and insight in this regard for better execution of this project. Also, as this is long term project, can we create a branch in GCC to put all our patches at one place so that anyone interested can download and tweak with the code? - Thanks and regards, Sameera D. Index: gcc/Makefile.in === --- gcc/Makefile.in (revision 236862) +++ gcc/Makefile.in (working copy) @@ -1522,6 +1522,7 @@ tree-vect-loop-manip.o \ tree-vect-slp.o \ tree-vectorizer.o \ + tree-vect-unified.o \ tree-vrp.o \ tree.o \ valtrack.o \ Index: gcc/common.opt === --- gcc/common.opt (revision 236862) +++ gcc/common.opt (working copy) @@ -2570,6 +2570,10 @@ Common Report Var(flag_tree_loop_vectorize) Optimization Enable loop vectorization on trees. +ftree-loop-vectorize-unified +Common Report Var(flag_tree_loop_vectorize_unified) Optimization +Enable loop vectorization using unified representation on trees. + ftree-slp-vectorize Common Report Var(flag_tree_slp_vectorize) Optimization Enable basic block vectorization (SLP) on trees. Index: gcc/hsa.c === --- gcc/hsa.c (revision 236862) +++ gcc/hsa.c (working copy) @@ -840,6 +840,7 @@ fn_opts = optimization_default_node; fn_opts = copy_node (fn_opts); TREE_OPTIMIZATION (fn_opts)->x_flag_tree_loop_vectorize = false; + TREE_OPTIMIZATION (fn_opts)->x_flag_tree_loop_vectorize_unified = false; TREE_OPTIMIZATION (fn_opts)->x_flag_tree_slp_vectorize = false; DECL_FUNCTION_SPECIFIC_OPTIMIZATION (gdecl) = fn_opts; Index: gcc/passes.def === --- gcc/passes.def (revision 236862) +++ gcc/passes.def (working copy) @@ -278,6 +278,11 @@ NEXT_PASS (pass_expand_omp_ssa); NEXT_PASS (pass_ch_vect); NEXT_PASS (pass_if_conversion); + NEXT_PASS (pass_unified_vectorize); + PUSH_INSERT_PASSES_WITHIN (pass_unified_vectorize) + NEXT_PASS (pass_dce); + POP_INSERT_PASSES () + /* pass_vectorize must immediately follow pass_if_conversion. Please do not add any other passes in between. */ NEXT_PASS (pass_vectorize); Index: gcc/tree-data-ref.c === --- gcc/tree-data-ref.c (revision 236862) +++ gcc/tree-data-ref.c (working copy) @@ -749,16 +749,19 @@ return build_fold_addr_expr (TREE_OPERAND (addr, 0)); } -/* Analyzes the behavior of the memory reference DR in the innermost loop or - basic block that contains it. Returns true if analysis succeed or false - otherwise. */ - +/* Analyze behavior of memory reference in the innermost loop or basic block + that contains it. STMT is statement having memory reference. REF is the + memory reference under consideration. Returns true if analysis succeeds, + false otherwise. PBASE, STEP and OFFSET are returned if analysis is + successful where REF is represented as PBASE + PSTEP * iter# + POFFSET. + The initial address of memory reference is returned in PINIT, and desired + alignment in ALIGNED_TO. */ bool -dr_analyze_innermost (struct data_reference *dr, struct loop *nest) +memref_analyze_innermost (gimple *stmt, tree ref, struct loop *nest, + tr
Re: [Patch 0,1a] Improving effectiveness and generality of autovectorization using unified representation.
On Thursday 09 June 2016 05:45 PM, Richard Biener wrote: On Thu, Jun 9, 2016 at 10:54 AM, Richard Biener wrote: On Tue, Jun 7, 2016 at 3:59 PM, Sameera Deshpande wrote: Hi Richard, This is with reference to our discussion at GNU Tools Cauldron 2015 regarding my talk titled "Improving the effectiveness and generality of GCC auto-vectorization." Further to our prototype implementation of the concept, we have started implementing this concept in GCC. We are following incremental model to add language support in our front-end, and corresponding back-end (for auto-vectorizer) will be added for feature completion. Looking at the complexity and scale of the project, we have divided this project into subtasks listed below, for ease of implementation, testing and review. 0. Add new pass to perform autovectorization using unified representation - Current GCC framework does not give complete overview of the loop to be vectorized : it either breaks the loop across body, or across iterations. Because of which these data structures can not be reused for our approach which gathers all the information of loop body at one place using primitive permute operations. Hence, define new data structures and populate them. 1. Add support for vectorization of LOAD/STORE instructions a. Create permute order tree for the loop with LOAD and STORE instructions for single or multi-dimensional arrays, aggregates within nested loops. b. Basic transformation phase to generate vectorized code for the primitive reorder tree generated at stage 1a using tree tiling algorithm. This phase handles code generation for SCATTER, GATHER, stridded memory accesses etc. along with permute instruction generation. 2. Implementation of k-arity promotion/reduction : The permute nodes within primitive reorder tree generated from input program can have any arity. However, the target can support maximum of arity = 2 in most of the cases. Hence, we need to promote or reduce the arity of permute order tree to enable successful tree tiling. 3. Vector size reduction : Depending upon the vector size for target, reduce vector size per statement and adjust the loop count for vectorized loop accordingly. 4. Support simple arithmetic operations : a. Add support for analyzing statements with simple arithmetic operations like +, -, *, / for vectorization, and create primitive reorder tree with compute_op. b. Generate vector code for primitive reorder tree generated at stage 4a using tree tiling algorithm - here support for complex patterns like multiply-add should be checked and appropriate instruction to be generated. 5. Support reduction operation : a. Add support for reduction operation analysis and primitive reorder tree generation. The reduction operation needs special handling, as the finish statement should COLLAPSE the temporary reduction vector TEMP_VAR into original reduction variable. b. The code generation for primitive reorder tree does not need any handling - as reduction tree is same as tree generated in 4a, with only difference that in 4a, the destination is MEMREF (because of STORE operation) and for reduction it is TEMP_VAR. At this stage, generate code for COLLAPSE node in finish statements. 6. Support other vectorizable statements like complex arithmetic operations, bitwise operations, type conversions etc. a. Add support for analysis and primitive reorder tree generation. b. Vector code generation. 7. Cost effective tree tiling algorithm : Till now, the tree tiling is happening without considering cost of computation. However, there can be multiple target instructions covering the tree - hence, instead of picking first matched largest instruction cover, select the instruction cover based on cost of instruction given in .md for the target. 8. Optimizations on created primitive reorder tree : This stage is open ended, and depending upon perf analysis, the scope of it can be defined. The current patch I have attached herewith handles stage 0 and 1a : Adds new pass to perform autovectorization using unified representation, defines new data structures to cater to this requirement and creates primitive reorder tree for LOAD/STORE instructions within the loop. The whole loop is represented using the ITER_NODE, which have information about - The preparatory statements for vectorization to be executed before entering the loop (like initialization of vectors, prepping for reduction operations, peeling etc.) - Vectorizable loop body represented as PRIMOP_TREE (primitive reordering tree) - Final statements (For peeling, variable loop bound, COLLAPSE operation for reduction etc.) - Other loop attributes (loop bound, peeling needed, dependences, etc.) Memory accesses within a loop have definite repetitive pattern which can be captured using primitive permute operators which can be used to determine desired permute order for the vector computat
Re: [Patch 0,1a] Improving effectiveness and generality of autovectorization using unified representation.
On Wednesday 15 June 2016 05:52 PM, Richard Biener wrote: On Mon, Jun 13, 2016 at 12:56 PM, Sameera Deshpande wrote: On Thursday 09 June 2016 05:45 PM, Richard Biener wrote: On Thu, Jun 9, 2016 at 10:54 AM, Richard Biener wrote: On Tue, Jun 7, 2016 at 3:59 PM, Sameera Deshpande wrote: Hi Richard, This is with reference to our discussion at GNU Tools Cauldron 2015 regarding my talk titled "Improving the effectiveness and generality of GCC auto-vectorization." Further to our prototype implementation of the concept, we have started implementing this concept in GCC. We are following incremental model to add language support in our front-end, and corresponding back-end (for auto-vectorizer) will be added for feature completion. Looking at the complexity and scale of the project, we have divided this project into subtasks listed below, for ease of implementation, testing and review. 0. Add new pass to perform autovectorization using unified representation - Current GCC framework does not give complete overview of the loop to be vectorized : it either breaks the loop across body, or across iterations. Because of which these data structures can not be reused for our approach which gathers all the information of loop body at one place using primitive permute operations. Hence, define new data structures and populate them. 1. Add support for vectorization of LOAD/STORE instructions a. Create permute order tree for the loop with LOAD and STORE instructions for single or multi-dimensional arrays, aggregates within nested loops. b. Basic transformation phase to generate vectorized code for the primitive reorder tree generated at stage 1a using tree tiling algorithm. This phase handles code generation for SCATTER, GATHER, stridded memory accesses etc. along with permute instruction generation. 2. Implementation of k-arity promotion/reduction : The permute nodes within primitive reorder tree generated from input program can have any arity. However, the target can support maximum of arity = 2 in most of the cases. Hence, we need to promote or reduce the arity of permute order tree to enable successful tree tiling. 3. Vector size reduction : Depending upon the vector size for target, reduce vector size per statement and adjust the loop count for vectorized loop accordingly. 4. Support simple arithmetic operations : a. Add support for analyzing statements with simple arithmetic operations like +, -, *, / for vectorization, and create primitive reorder tree with compute_op. b. Generate vector code for primitive reorder tree generated at stage 4a using tree tiling algorithm - here support for complex patterns like multiply-add should be checked and appropriate instruction to be generated. 5. Support reduction operation : a. Add support for reduction operation analysis and primitive reorder tree generation. The reduction operation needs special handling, as the finish statement should COLLAPSE the temporary reduction vector TEMP_VAR into original reduction variable. b. The code generation for primitive reorder tree does not need any handling - as reduction tree is same as tree generated in 4a, with only difference that in 4a, the destination is MEMREF (because of STORE operation) and for reduction it is TEMP_VAR. At this stage, generate code for COLLAPSE node in finish statements. 6. Support other vectorizable statements like complex arithmetic operations, bitwise operations, type conversions etc. a. Add support for analysis and primitive reorder tree generation. b. Vector code generation. 7. Cost effective tree tiling algorithm : Till now, the tree tiling is happening without considering cost of computation. However, there can be multiple target instructions covering the tree - hence, instead of picking first matched largest instruction cover, select the instruction cover based on cost of instruction given in .md for the target. 8. Optimizations on created primitive reorder tree : This stage is open ended, and depending upon perf analysis, the scope of it can be defined. The current patch I have attached herewith handles stage 0 and 1a : Adds new pass to perform autovectorization using unified representation, defines new data structures to cater to this requirement and creates primitive reorder tree for LOAD/STORE instructions within the loop. The whole loop is represented using the ITER_NODE, which have information about - The preparatory statements for vectorization to be executed before entering the loop (like initialization of vectors, prepping for reduction operations, peeling etc.) - Vectorizable loop body represented as PRIMOP_TREE (primitive reordering tree) - Final statements (For peeling, variable loop bound, COLLAPSE operation for reduction etc.) - Other loop attributes (loop bound, peeling needed, dependences, etc.) Memory accesses within a loop have definite repetitive pattern which can be captured using primitive permute opera
RE: [Patch 0,1a] Improving effectiveness and generality of autovectorization using unified representation.
From: Richard Biener [richard.guent...@gmail.com] Sent: 06 July 2016 16:46:17 To: Sameera Deshpande Cc: Matthew Fortune; Rich Fuhler; Prachi Godbole; gcc@gcc.gnu.org; Jaydeep Patil Subject: Re: [Patch 0,1a] Improving effectiveness and generality of autovectorization using unified representation. On Wed, Jul 6, 2016 at 12:49 PM, Sameera Deshpande wrote: > > From: Sameera Deshpande [sameera.deshpa...@imgtec.com] > Sent: 20 June 2016 11:37:58 > To: Richard Biener > Cc: Matthew Fortune; Rich Fuhler; Prachi Godbole; gcc@gcc.gnu.org; Jaydeep > Patil > Subject: Re: [Patch 0,1a] Improving effectiveness and generality of > autovectorization using unified representation. > > On Wednesday 15 June 2016 05:52 PM, Richard Biener wrote: >> On Mon, Jun 13, 2016 at 12:56 PM, Sameera Deshpande >> wrote: >>> On Thursday 09 June 2016 05:45 PM, Richard Biener wrote: >>>> >>>> On Thu, Jun 9, 2016 at 10:54 AM, Richard Biener >>>> wrote: >>>>> >>>>> On Tue, Jun 7, 2016 at 3:59 PM, Sameera Deshpande >>>>> wrote: >>>>>> >>>>>> Hi Richard, >>>>>> >>>>>> This is with reference to our discussion at GNU Tools Cauldron 2015 >>>>>> regarding my talk titled "Improving the effectiveness and generality of >>>>>> GCC >>>>>> auto-vectorization." Further to our prototype implementation of the >>>>>> concept, >>>>>> we have started implementing this concept in GCC. >>>>>> >>>>>> We are following incremental model to add language support in our >>>>>> front-end, and corresponding back-end (for auto-vectorizer) will be added >>>>>> for feature completion. >>>>>> >>>>>> Looking at the complexity and scale of the project, we have divided this >>>>>> project into subtasks listed below, for ease of implementation, testing >>>>>> and >>>>>> review. >>>>>> >>>>>> 0. Add new pass to perform autovectorization using unified >>>>>> representation - Current GCC framework does not give complete overview of >>>>>> the loop to be vectorized : it either breaks the loop across body, or >>>>>> across >>>>>> iterations. Because of which these data structures can not be reused for >>>>>> our >>>>>> approach which gathers all the information of loop body at one place >>>>>> using >>>>>> primitive permute operations. Hence, define new data structures and >>>>>> populate >>>>>> them. >>>>>> >>>>>> 1. Add support for vectorization of LOAD/STORE instructions >>>>>> a. Create permute order tree for the loop with LOAD and STORE >>>>>> instructions for single or multi-dimensional arrays, aggregates within >>>>>> nested loops. >>>>>> b. Basic transformation phase to generate vectorized code for the >>>>>> primitive reorder tree generated at stage 1a using tree tiling algorithm. >>>>>> This phase handles code generation for SCATTER, GATHER, stridded memory >>>>>> accesses etc. along with permute instruction generation. >>>>>> >>>>>> 2. Implementation of k-arity promotion/reduction : The permute nodes >>>>>> within primitive reorder tree generated from input program can have any >>>>>> arity. However, the target can support maximum of arity = 2 in most of >>>>>> the >>>>>> cases. Hence, we need to promote or reduce the arity of permute order >>>>>> tree >>>>>> to enable successful tree tiling. >>>>>> >>>>>> 3. Vector size reduction : Depending upon the vector size for target, >>>>>> reduce vector size per statement and adjust the loop count for vectorized >>>>>> loop accordingly. >>>>>> >>>>>> 4. Support simple arithmetic operations : >>>>>> a. Add support for analyzing statements with simple arithmetic >>>>>> operations like +, -, *, / for vectorization, and create primitive >>>>>> reorder >>>>>> tree with compute_op. >>>>>> b. Generate vector code for primitive reorder tree generated at