[patch,ira]: Improve on updated memory cost in coloring pass of integrated register allocator.
This patch improves the updated memory cost in coloring pass of integrated register allocator. Only enter_freq of the loop is considered in updated memory cost in the coloring pass. Consideration of only enter_freq is based on the concept that live Out of the entry or header of the Loop is live in and liveout throughout the loop. Exit freq is ignored in the update memory cost in coloring pass. This increases the updated memory most and more chances of reducing the spill and fetch and better assignment. The concept of live-out of the header of the loop is live-in and live-out throughout of the Loop is based on the following. If a v live is out at the header of the loop then the variable is live-in at every node in the loop. To prove this, consider a loop L with header h such that the variable v defined at d is live-in at h. Since v is live at h, d is not part of L. This follows from the dominance property, i.e. h is strictly dominated by d. Furthermore, there exists a path from h to a use of v which does not go through d. For every node p in the loop, since the loop is strongly connected and node is a component of the CFG, there exists a path, consisting only of nodes of L from p to h. Concatenating these two paths proves that v is live-in and live-out of p. Bootstrapped on X86_64. Performance run is done on SPEC CPU2000 benchmarks and following are the results. SPEC INT benchmarks (Mean Score with this patch vs Mean score without this patch = 3729.777 vs 3717.083). BenchmarksGains. 186.crafty = 2.78% 176.gcc = 0.7% 253.perlbmk = 0.75% 255.vortex= 0.82% SPEC FP benchmarks (Mean Score with this patch vs Mean score without this patch = 4774.65 vs 4751.838 ). Benchmarks Gains 168.wupwise = 0.77% 171.swim= 1.5% 177.mesa= 1.2% 200.sixtrack= 1.2% 178.galgel= 0.6% 179.art = 0.6% 183.equake = 0.5% 187.facerec = 0.7%. ChangeLog: 2016-01-23 Ajit Agarwal * ira-color.c (color_pass): Consider only the enter_freq in calculation of update memory cost. Signed-off-by:Ajit Agarwal ajit...@xilinx.com. --- gcc/ira-color.c | 12 +--- 1 files changed, 5 insertions(+), 7 deletions(-) diff --git a/gcc/ira-color.c b/gcc/ira-color.c index 1e4c64f..201017c 100644 --- a/gcc/ira-color.c +++ b/gcc/ira-color.c @@ -3173,7 +3173,7 @@ static void color_pass (ira_loop_tree_node_t loop_tree_node) { int regno, hard_regno, index = -1, n; - int cost, exit_freq, enter_freq; + int cost, enter_freq; unsigned int j; bitmap_iterator bi; machine_mode mode; @@ -3297,7 +3297,6 @@ color_pass (ira_loop_tree_node_t loop_tree_node) } continue; } - exit_freq = ira_loop_edge_freq (subloop_node, regno, true); enter_freq = ira_loop_edge_freq (subloop_node, regno, false); ira_assert (regno < ira_reg_equiv_len); if (ira_equiv_no_lvalue_p (regno)) @@ -3315,15 +3314,14 @@ color_pass (ira_loop_tree_node_t loop_tree_node) else if (hard_regno < 0) { ALLOCNO_UPDATED_MEMORY_COST (subloop_allocno) - -= ((ira_memory_move_cost[mode][rclass][1] * enter_freq) - + (ira_memory_move_cost[mode][rclass][0] * exit_freq)); + -= ((ira_memory_move_cost[mode][rclass][1] * enter_freq)); } else { aclass = ALLOCNO_CLASS (subloop_allocno); ira_init_register_move_cost_if_necessary (mode); cost = (ira_register_move_cost[mode][rclass][rclass] - * (exit_freq + enter_freq)); + * (enter_freq)); ira_allocate_and_set_or_copy_costs (&ALLOCNO_UPDATED_HARD_REG_COSTS (subloop_allocno), aclass, ALLOCNO_UPDATED_CLASS_COST (subloop_allocno), @@ -3339,8 +3337,8 @@ color_pass (ira_loop_tree_node_t loop_tree_node) ALLOCNO_UPDATED_CLASS_COST (subloop_allocno) = ALLOCNO_UPDATED_HARD_REG_COSTS (subloop_allocno)[index]; ALLOCNO_UPDATED_MEMORY_COST (subloop_allocno) - += (ira_memory_move_cost[mode][rclass][0] * enter_freq - + ira_memory_move_cost[mode][rclass][1] * exit_freq); + += (ira_memory_move_cost[mode][rclass][0] * enter_freq); + } } } -- 1.7.1 Thanks & Regards Ajit ira.patch Description: ira.patch
RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
-Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Wednesday, January 27, 2016 12:48 PM To: Ajit Kumar Agarwal; Richard Biener Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation On 01/18/2016 11:27 AM, Ajit Kumar Agarwal wrote: >> >> Ajit, can you confirm which of adpcm_code or adpcm_decode where path >> splitting is showing a gain? I suspect it's the former but would >> like to make sure so that I can adjust the heuristics properly. >>> I'd still like to have this answered when you can Ajit, just to be >>> 100% that it's the path splitting in adpcm_code that's responsible >>> for the improvements you're seeing in adpcm. > > The adpcm_coder get optimized with path splitting whereas the > adpcm_decoder is not optimized further with path splitting. In > adpcm_decoder the join node is duplicated into its predecessors and > with the duplication of join node the code is not optimized further. >>Right. Just wanted to make sure my analysis corresponded with what you were >>seeing in your benchmarking -- and it does. >>I suspect that if we looked at this problem from the angle of isolating paths >>based on how constant PHI arguments feed into and allow simplifications in >>later blocks that we might get >>better long term results -- including >>improving adpcm_decoder which has the same idiom as adpcm_coder -- it's just >>in the wrong spot in the CFG. >>But that's obviously gcc-7 material. Can I look into it. Thanks & Regards Ajit Jeff
[Patch,microblaze]: Better register allocation to minimize the spill and fetch.
This patch improves the allocation of registers in the given function. The allocation is optimized for the conditional branches. The temporary register used in the conditional branches to store the comparison results and use of temporary in the conditional branch is optimized. Such temporary registers are allocated with a fixed register r18. Currently such temporaries are allocated with a free registers in the given function. Due to this one of the free register is reserved for the temporaries and given function is left with a few registers. This is unoptimized with respect to microblaze. In Microblaze r18 is marked as fixed and cannot be allocated to pseudos' in the given function. Instead r18 can be used as a temporary for the conditional branches with compare and branch. Use of r18 as a temporary for conditional branches will save one of the free registers to be allocated. The free registers can be used for other pseudos' and hence the better register allocation. The usage of r18 as above reduces the spill and fetch because of the availability of one of the free registers to other pseudos instead of being used for conditional temporaries. The advantage of the above is that the scope of the temporaries is limited to the conditional branches and hence the usage of r18 as temporary for such conditional branches is optimized and preserve the functionality of the function. Regtested for Microblaze target. Performance runs are done with Mibench/EEMBC benchmarks. Following gains are achieved. Benchmarks Gains automotive_qsort1 1.630730524% network_dijkstra 1.527506256% office_stringsearch 1 1.81356288% security_rijndael_d 3.26129357% basefp01_lite 4.465120185% a2time01_lite 1.893862857% cjpeg_lite 3.286496675% djpeg_lite 3.120150612% qos_lite 2.63964381% office_ispell 1.531340405% Code Size improvements: Reduction in number of instructions for Mibench : 12927. Reduction in number of instructions for EEMBC : 212. ChangeLog: 2016-01-29 Ajit Agarwal * config/microblaze/microblaze.c (microblaze_expand_conditional_branch): Use of MB_ABI_ASM_TEMP_REGNUM for temporary conditional branch. (microblaze_expand_conditional_branch_reg): Use of MB_ABI_ASM_TEMP_REGNUM for temporary conditional branch. (microblaze_expand_conditional_branch_sf): Use of MB_ABI_ASM_TEMP_REGNUM for temporary conditional branch. Signed-off-by:Ajit Agarwal ajit...@xilinx.com. --- gcc/config/microblaze/microblaze.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/gcc/config/microblaze/microblaze.c b/gcc/config/microblaze/microblaze.c index baff67a..b4277ad 100644 --- a/gcc/config/microblaze/microblaze.c +++ b/gcc/config/microblaze/microblaze.c @@ -3402,7 +3402,7 @@ microblaze_expand_conditional_branch (machine_mode mode, rtx operands[]) rtx cmp_op0 = operands[1]; rtx cmp_op1 = operands[2]; rtx label1 = operands[3]; - rtx comp_reg = gen_reg_rtx (SImode); + rtx comp_reg = gen_rtx_REG (SImode, MB_ABI_ASM_TEMP_REGNUM); rtx condition; gcc_assert ((GET_CODE (cmp_op0) == REG) || (GET_CODE (cmp_op0) == SUBREG)); @@ -3439,7 +3439,7 @@ microblaze_expand_conditional_branch_reg (enum machine_mode mode, rtx cmp_op0 = operands[1]; rtx cmp_op1 = operands[2]; rtx label1 = operands[3]; - rtx comp_reg = gen_reg_rtx (SImode); + rtx comp_reg = gen_rtx_REG (SImode, MB_ABI_ASM_TEMP_REGNUM); rtx condition; gcc_assert ((GET_CODE (cmp_op0) == REG) @@ -3483,7 +3483,7 @@ microblaze_expand_conditional_branch_sf (rtx operands[]) rtx condition; rtx cmp_op0 = XEXP (operands[0], 0); rtx cmp_op1 = XEXP (operands[0], 1); - rtx comp_reg = gen_reg_rtx (SImode); + rtx comp_reg = gen_rtx_REG (SImode, MB_ABI_ASM_TEMP_REGNUM); emit_insn (gen_cstoresf4 (comp_reg, operands[0], cmp_op0, cmp_op1)); condition = gen_rtx_NE (SImode, comp_reg, const0_rtx); -- 1.7.1 Thanks & Regards Ajit better-reg-alloc.patch Description: better-reg-alloc.patch
RE: [Patch,microblaze]: Better register allocation to minimize the spill and fetch.
-Original Message- From: Michael Eager [mailto:ea...@eagerm.com] Sent: Friday, January 29, 2016 11:33 PM To: Ajit Kumar Agarwal; GCC Patches Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,microblaze]: Better register allocation to minimize the spill and fetch. On 01/29/2016 02:31 AM, Ajit Kumar Agarwal wrote: > > This patch improves the allocation of registers in the given function. > The allocation is optimized for the conditional branches. The > temporary register used in the conditional branches to store the > comparison results and use of temporary in the conditional branch is > optimized. Such temporary registers are allocated with a fixed register r18. > > Currently such temporaries are allocated with a free registers in the > given function. Due to this one of the free register is reserved for > the temporaries and given function is left with a few registers. This > is unoptimized with respect to microblaze. In Microblaze r18 is marked as > fixed and cannot be allocated to pseudos' > in the given function. Instead r18 can be used as a temporary for the > conditional branches with compare and branch. Use of r18 as a > temporary for conditional branches will save one of the free registers > to be allocated. The free registers can be used for other pseudos' and hence > the better register allocation. > > The usage of r18 as above reduces the spill and fetch because of the > availability of one of the free registers to other pseudos instead of > being used for conditional temporaries. > > The advantage of the above is that the scope of the temporaries is > limited to the conditional branches and hence the usage of r18 as > temporary for such conditional branches is optimized and preserve the > functionality of the function. > > Regtested for Microblaze target. > > Performance runs are done with Mibench/EEMBC benchmarks. > > Following gains are achieved. > > Benchmarks Gains > > automotive_qsort1 1.630730524% > network_dijkstra 1.527506256% > office_stringsearch 1 1.81356288% > security_rijndael_d 3.26129357% > basefp01_lite 4.465120185% > a2time01_lite 1.893862857% > cjpeg_lite 3.286496675% > djpeg_lite 3.120150612% > qos_lite 2.63964381% > office_ispell 1.531340405% > > Code Size improvements: > > Reduction in number of instructions for Mibench : 12927. > Reduction in number of instructions for EEMBC : 212. > > ChangeLog: > 2016-01-29 Ajit Agarwal > > * config/microblaze/microblaze.c > (microblaze_expand_conditional_branch): Use of MB_ABI_ASM_TEMP_REGNUM > for temporary conditional branch. > (microblaze_expand_conditional_branch_reg): Use of > MB_ABI_ASM_TEMP_REGNUM > for temporary conditional branch. > (microblaze_expand_conditional_branch_sf): Use of MB_ABI_ASM_TEMP_REGNUM > for temporary conditional branch. >>You can combine these ChangeLog entries: >> * config/microblaze/microblaze.c >>(microblaze_expand_conditional_branch, microblaze_expand_conditional_branch_reg, >>microblaze_expand_conditional_branch_sf): Use MB_ABI_ASM_TEMP_REGNUM for temp reg. Attached is the patch with update in the changeLog. Modified changeLog is given below. ChangeLog: 2016-01-29 Ajit Agarwal * config/microblaze/microblaze.c (microblaze_expand_conditional_branch, microblaze_expand_conditional_branch_reg, microblaze_expand_conditional_branch_sf): Use MB_ABI_ASM_TEMP_REGNUM for temp reg. Thanks & Regards Ajit >>Otherwise, OK. > > Signed-off-by:Ajit Agarwal ajit...@xilinx.com. > --- > gcc/config/microblaze/microblaze.c |6 +++--- > 1 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/gcc/config/microblaze/microblaze.c > b/gcc/config/microblaze/microblaze.c > index baff67a..b4277ad 100644 > --- a/gcc/config/microblaze/microblaze.c > +++ b/gcc/config/microblaze/microblaze.c > @@ -3402,7 +3402,7 @@ microblaze_expand_conditional_branch (machine_mode > mode, rtx operands[]) > rtx cmp_op0 = operands[1]; > rtx cmp_op1 = operands[2]; > rtx label1 = operands[3]; > - rtx comp_reg = gen_reg_rtx (SImode); > + rtx comp_reg = gen_rtx_REG (SImode, MB_ABI_ASM_TEMP_REGNUM); > rtx condition; > > gcc_assert ((GET_CODE (cmp_op0) == REG) || (GET_CODE (cmp_op0) == > SUBREG)); @@ -3439,7 +3439,7 @@ microblaze_expand_conditional_branch_reg > (enum machine_mode mode, > rtx cmp_op0 = operands[1]; > rtx cmp_op1 = operands[2]; > rtx label1 = operands[3]; > - rtx c
RE: [Patch,microblaze]: Better register allocation to minimize the spill and fetch.
-Original Message- From: Mike Stump [mailto:mikest...@comcast.net] Sent: Tuesday, February 02, 2016 12:12 AM To: Ajit Kumar Agarwal Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,microblaze]: Better register allocation to minimize the spill and fetch. On Jan 29, 2016, at 2:31 AM, Ajit Kumar Agarwal wrote: > > This patch improves the allocation of registers in the given function. >>Is it just me, or, would it be even better to change the abi and make >>MB_ABI_ASM_TEMP_REGNUM be allocated by the register allocator? Yes, it would be even better to make r18 (MB_ABI_ASM_TEMP_REGNUM) to be allocated by the register allocator for the given function. Currently r18 is marked as FIXED_REGISTERS and cannot be allocated to the given function only used for temporaries where the liveness is limited. R18 is used in some of the shift patterns and also for the conditional branches temporaries. I have made some of the ABI changes making r18 as not FIXED_REGISTERS in my local branch and the given function can be allocated with r18.This will reduce the spill and fetch to a great extent and I am seeing good amount of gains with this change for Mibench/EEMBC benchmarks. Since the ABI specifies r18 is reserved for assembler temporaries, lot of kernel driven code and glibc code uses r18 as a temporaries for inline assembly. Changing the ABI requires lot of changes in the kernel specific code for Microblaze where r18 is used as a temporary without storing and fetching after the scope is gone. Currently we don't have plans to change the ABI that might break the existing Kernel and GLIBC code. Changing the ABI require a good amount of work to be done in Kernel and GLIBC code for Microblaze. Thanks & Regards Ajit
RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
-Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Thursday, July 16, 2015 4:30 PM To: Ajit Kumar Agarwal Cc: l...@redhat.com; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation On Tue, Jul 7, 2015 at 3:22 PM, Ajit Kumar Agarwal wrote: > > > -Original Message- > From: Richard Biener [mailto:richard.guent...@gmail.com] > Sent: Tuesday, July 07, 2015 2:21 PM > To: Ajit Kumar Agarwal > Cc: l...@redhat.com; GCC Patches; Vinod Kathail; Shail Aditya Gupta; > Vidhumouli Hunsigida; Nagaraju Mekala > Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on > tree ssa representation > > On Sat, Jul 4, 2015 at 2:39 PM, Ajit Kumar Agarwal > wrote: >> >> >> -Original Message- >> From: Richard Biener [mailto:richard.guent...@gmail.com] >> Sent: Tuesday, June 30, 2015 4:42 PM >> To: Ajit Kumar Agarwal >> Cc: l...@redhat.com; GCC Patches; Vinod Kathail; Shail Aditya Gupta; >> Vidhumouli Hunsigida; Nagaraju Mekala >> Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass >> on tree ssa representation >> >> On Tue, Jun 30, 2015 at 10:16 AM, Ajit Kumar Agarwal >> wrote: >>> All: >>> >>> The below patch added a new path Splitting optimization pass on SSA >>> representation. The Path Splitting optimization Pass moves the join >>> block of if-then-else same as loop latch to its predecessors and get merged >>> with the predecessors Preserving the SSA representation. >>> >>> The patch is tested for Microblaze and i386 target. The >>> EEMBC/Mibench benchmarks is run with the Microblaze target And the >>> performance gain of 9.15% and rgbcmy01_lite(EEMBC benchmarks). The Deja GNU >>> tests is run for Mircroblaze Target and no regression is seen for >>> Microblaze target and the new testcase attached are passed. >>> >>> For i386 bootstrapping goes through fine and the Spec cpu2000 >>> benchmarks is run with this patch. Following observation were seen with >>> spec cpu2000 benchmarks. >>> >>> Ratio of path splitting change vs Ratio of not having path splitting change >>> is 3653.353 vs 3652.14 for INT benchmarks. >>> Ratio of path splitting change vs Ratio of not having path splitting change >>> is 4353.812 vs 4345.351 for FP benchmarks. >>> >>> Based on comments from RFC patch following changes were done. >>> >>> 1. Added a new pass for path splitting changes. >>> 2. Placed the new path Splitting Optimization pass before the copy >>> propagation pass. >>> 3. The join block same as the Loop latch is wired into its >>> predecessors so that the CFG Cleanup pass will merge the blocks Wired >>> together. >>> 4. Copy propagation routines added for path splitting changes is not >>> needed as suggested by Jeff. They are removed in the patch as The copy >>> propagation in the copied join blocks will be done by the existing copy >>> propagation pass and the update ssa pass. >>> 5. Only the propagation of phi results of the join block with the >>> phi argument is done which will not be done by the existing update_ssa Or >>> copy propagation pass on tree ssa representation. >>> 6. Added 2 tests. >>> a) compilation check tests. >>>b) execution tests. >>> 7. Refactoring of the code for the feasibility check and finding the join >>> block same as loop latch node. >>> >>> [Patch,tree-optimization]: Add new path Splitting pass on tree ssa >>> representation. >>> >>> Added a new pass on path splitting on tree SSA representation. The path >>> splitting optimization does the CFG transformation of join block of the >>> if-then-else same as the loop latch node is moved and merged with the >>> predecessor blocks after preserving the SSA representation. >>> >>> ChangeLog: >>> 2015-06-30 Ajit Agarwal >>> >>> * gcc/Makefile.in: Add the build of the new file >>> tree-ssa-path-split.c >>> * gcc/common.opt: Add the new flag ftree-path-split. >>> * gcc/opts.c: Add an entry for Path splitting pass >>> with optimization flag greater and equal to O2. >>> * gcc/passes.def: Enable and add new pass path splitting. >>> * gcc/timevar.def: Add the new entry for TV_TREE_PAT
[Patch,optimization]: Optimized changes in the estimate register pressure cost.
I have made the following changes in the estimate_reg_pressure_cost function used by the loop invariant and IVOPTS. Earlier the estimate_reg_pressure cost uses the cost of n_new variables that are generated by the Loop Invariant and IVOPTS. These are not sufficient for register pressure calculation. The register pressure cost calculation should use the n_new + n_old (numbers) to consider the cost. n_old is the register used inside the loops and the effect of n_new new variables generated by loop invariant and IVOPTS on register pressure is based on how the new variables impact on register used inside the loops. The increase or decrease in register pressure is due to the impact of new variables on the register used inside the loops. The register-register move cost or the spill cost should consider the cost associated with register used and the new variables generated. The movement of new variables increases or decreases the register pressure, which is based on overall cost of n_new + n_old variables. The increase and decrease in register pressure is based on the overall cost of n_new + n_old as the changes in the register pressure caused due to new variables is based on how the changes behave with respect to the register used in the loops. Thus the register pressure caused to new variables is based on the new variables and its impact on register used inside the loops and thus consider the overall cost of n_new + n_old. Bootstrap for i386 and reg tested on i386 with the change is fine. SPEC CPU 2000 benchmarks are run and there is following impact on the performance and code size. ratio with the optimization vs ratio without optimization for INT benchmarks (3807.632 vs 3804.661) ratio with the optimization vs ratio without optimization for FP benchmarks ( 4668.743 vs 4778.741) Code size reduction with respect to FP SPEC CPU 2000 benchmarks Number of instruction with optimization = 1094117 Number of instruction without optimization = 1094659 Reduction in number of instruction with the optimization = 542 instruction. [Patch,optimization]: Optimized changes in the estimate register pressure cost. Earlier the estimate_reg_pressure cost uses the cost of n_new variables that are generated by the Loop Invariant and IVOPTS. These are not sufficient for register pressure calculation. The register pressure cost calculation should use the n_new + n_old (numbers) to consider the cost. n_old is the register used inside the loops and the affect of n_new new variables generated by loop invariant and IVOPTS on register pressure is based on how the new variables impact on register used inside the loops. ChangeLog: 2015-09-26 Ajit Agarwal * cfgloopanal.c (estimate_reg_pressure_cost) : Add changes to consider the n_new plus n_old in the register pressure cost. Signed-off-by:Ajit Agarwal ajit...@xilinx.com Thanks & Regards Ajit 0001-Patch-optimization-Optimized-changes-in-the-estimate.patch Description: 0001-Patch-optimization-Optimized-changes-in-the-estimate.patch
RE: [Patch,optimization]: Optimized changes in the estimate register pressure cost.
-Original Message- From: Segher Boessenkool [mailto:seg...@kernel.crashing.org] Sent: Sunday, September 27, 2015 7:49 PM To: Ajit Kumar Agarwal Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,optimization]: Optimized changes in the estimate register pressure cost. On Sat, Sep 26, 2015 at 04:51:20AM +, Ajit Kumar Agarwal wrote: > SPEC CPU 2000 benchmarks are run and there is following impact on the > performance and code size. > > ratio with the optimization vs ratio without optimization for INT > benchmarks > (3807.632 vs 3804.661) > > ratio with the optimization vs ratio without optimization for FP > benchmarks ( 4668.743 vs 4778.741) >>Did you swap these? You're saying FP got significantly worse? Sorry for the typo error. Please find the corrected one. Ratio with the optimization vs ratio without optimization for FP benchmarks ( 4668.743 vs 4668.741). With the optimization FP is slightly better performance. Thanks & Regards Ajit Segher
RE: [Patch,optimization]: Optimized changes in the estimate register pressure cost.
-Original Message- From: Bin.Cheng [mailto:amker.ch...@gmail.com] Sent: Monday, September 28, 2015 7:05 AM To: Ajit Kumar Agarwal Cc: Segher Boessenkool; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,optimization]: Optimized changes in the estimate register pressure cost. On Sun, Sep 27, 2015 at 11:13 PM, Ajit Kumar Agarwal wrote: > > > -Original Message- > From: Segher Boessenkool [mailto:seg...@kernel.crashing.org] > Sent: Sunday, September 27, 2015 7:49 PM > To: Ajit Kumar Agarwal > Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli > Hunsigida; Nagaraju Mekala > Subject: Re: [Patch,optimization]: Optimized changes in the estimate register > pressure cost. > > On Sat, Sep 26, 2015 at 04:51:20AM +, Ajit Kumar Agarwal wrote: >> SPEC CPU 2000 benchmarks are run and there is following impact on the >> performance and code size. >> >> ratio with the optimization vs ratio without optimization for INT >> benchmarks >> (3807.632 vs 3804.661) >> >> ratio with the optimization vs ratio without optimization for FP >> benchmarks ( 4668.743 vs 4778.741) > >>>Did you swap these? You're saying FP got significantly worse? > > Sorry for the typo error. Please find the corrected one. > > Ratio with the optimization vs ratio without optimization for FP > benchmarks ( 4668.743 vs 4668.741). With the optimization FP is slightly > better performance. >>Did you mis-type the number again? Or this must be noise. Now I remember >>why I didn't get perf improvement from this. Changing reg_new to reg_new >>+ >>reg_old doesn't have big impact because it just increased the starting number >>for each scenarios. Maybe it still makes sense for cases on the verge of >>>>exceeding target's available register number. I will try to collect >>benchmark data on ARM, but it may take some time. This is the correct one. Thanks & Regards Ajit Thanks, bin > > Thanks & Regards > Ajit > > Segher
RE: [Patch,optimization]: Optimized changes in the estimate register pressure cost.
-Original Message- From: Aaron Sawdey [mailto:acsaw...@linux.vnet.ibm.com] Sent: Monday, September 28, 2015 11:55 PM To: Ajit Kumar Agarwal Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,optimization]: Optimized changes in the estimate register pressure cost. On Sat, 2015-09-26 at 04:51 +, Ajit Kumar Agarwal wrote: > I have made the following changes in the estimate_reg_pressure_cost > function used by the loop invariant and IVOPTS. > > Earlier the estimate_reg_pressure cost uses the cost of n_new > variables that are generated by the Loop Invariant and IVOPTS. These > are not sufficient for register pressure calculation. The register > pressure cost calculation should use the n_new + n_old (numbers) to > consider the cost. n_old is the register used inside the loops and > the effect of n_new new variables generated by loop invariant and > IVOPTS on register pressure is based on how the new variables impact > on register used inside the loops. The increase or decrease in register > pressure is due to the impact of new variables on the register used inside > the loops. The register-register move cost or the spill cost should consider > the cost associated with register used and the new variables generated. The > movement of new variables increases or decreases the register pressure, > which is based on overall cost of n_new + n_old variables. > > The increase and decrease in register pressure is based on the overall > cost of n_new + n_old as the changes in the register pressure caused > due to new variables is based on how the changes behave with respect to the > register used in the loops. > > Thus the register pressure caused to new variables is based on the new > variables and its impact on register used inside the loops and thus consider > the overall cost of n_new + n_old. > > Bootstrap for i386 and reg tested on i386 with the change is fine. > > SPEC CPU 2000 benchmarks are run and there is following impact on the > performance and code size. > > ratio with the optimization vs ratio without optimization for INT > benchmarks > (3807.632 vs 3804.661) > > ratio with the optimization vs ratio without optimization for FP > benchmarks ( 4668.743 vs 4778.741) > > Code size reduction with respect to FP SPEC CPU 2000 benchmarks > > Number of instruction with optimization = 1094117 Number of > instruction without optimization = 1094659 > > Reduction in number of instruction with the optimization = 542 instruction. > > [Patch,optimization]: Optimized changes in the estimate register > pressure cost. > > Earlier the estimate_reg_pressure cost uses the cost of n_new > variables that are generated by the Loop Invariant and IVOPTS. These > are not sufficient for register pressure calculation. The register > pressure cost calculation should use the n_new + n_old (numbers) to > consider the cost. n_old is the register used inside the loops and the > affect of n_new new variables generated by loop invariant and IVOPTS > on register pressure is based on how the new variables impact on register > used inside the loops. > > ChangeLog: > 2015-09-26 Ajit Agarwal > > * cfgloopanal.c (estimate_reg_pressure_cost) : Add changes > to consider the n_new plus n_old in the register pressure > cost. > > Signed-off-by:Ajit Agarwal ajit...@xilinx.com >>Ajit, >>It looks to me like your change doesn't do anything at all inside the >>loop-invariant.c code. There it's doing a difference between two >>>>estimate_reg_pressure_cost calls so adding n_old (regs_used) to both is >>canceled out. >> size_cost = (estimate_reg_pressure_cost (new_regs[0] + regs_needed[0], >> regs_used, speed, call_p) >> - estimate_reg_pressure_cost (new_regs[0], >> regs_used, speed, call_p)); >>I'm not quite sure I understand the "why" of the heuristic you've added here >>-- can you explain your reasoning further? Aaron: Extract from function estimate_reg_pressure_cost() where the changes are made. if (regs_needed <= available_regs) /* If we are close to running out of registers, try to preserve them. */ /* Case 1 */ cost = target_reg_cost [speed] * regs_needed ; else /* If we run out of registers, it is very expensive to add another one. */ /* Case 2*/ cost = target_spill_cost [speed] * regs_needed; If the first estimate_reg_pressure falls into the category of Case I or Case 2 and the second estimate_reg_pressure falls into same Category for Case1 Or C
[RFC, Patch]: Optimized changes in the register used inside loop for LICM and IVOPTS.
Following Proposed: Changes are done in the Loop Invariant(LICM) at RTL level and also the Induction variable optimization based on SSA representation. The current logic used in LICM for register used inside the loops is changed. The Live Out of the loop latch node and the Live in of the destination of the exit nodes is used to set the Loops Liveness at the exit of the Loop. The register used is the number of live variables at the exit of the Loop calculated above. For Induction variable optimization on tree SSA representation, the register used logic is based on the number of phi nodes at the loop header to represent the liveness at the loop. Current Logic used only the number of phi nodes at the loop header. I have made changes to represent the phi operands also live at the loop. Thus number of phi operands also gets incremented in the number of registers used. Performance runs: Bootstrapping with i386 goes through fine. The spec cpu 2000 benchmarks is run and following performance runs and the code size for i386 target seen. Ratio with the above optimization changes vs ratio without above optimizations for INT benchmarks (3785.261 vs 3783.064). Ratio with the above optimization changes vs ratio without above optimization for FP benchmarks ( 4676.763189 vs 4676.072428 ). Code size reduction for INT benchmarks : 2324 instructions. Code size reduction for FP benchmarks : 1283 instructions. For Microblaze target the Mibench and EEMBC benchmarks is run and the following improvements is seen. (qos_lite(5.3%), consumer_jpeg_c(1.34%), security_rijndael_d(1.8%), security_rijndael_e(1.4%)) Code Size reduction for Mibench = 16164 instructions. Code Size reduction for EEMBC = 98 instructions. Patch ChangeLog: PATCH] [RFC, Patch]: Optimized changes in the register used inside loop for LICM and IVOPTS. Changes are done in the Loop Invariant(LICM) at RTL level and also the Induction variable optimization based on SSA representation. The current logic used in LICM for register used inside the loops is changed. The Live Out of the loop latch node and the Live in of the destination of the exit nodes is used to set the Loops Liveness at the exit of the Loop. The register used is the number of live variables at the exit of the Loop calculated above. For Induction variable optimization on tree SSA representation, the register used logic is based on the number of phi nodes at the loop header to represent the liveness at the loop. Current Logic used only the number of phi nodes at the loop header. Changes are made to represent the phi operands also live at the loop. Thus number of phi operands also gets incremented in the number of registers used. ChangeLog: 2015-10-09 Ajit Agarwal * loop-invariant.c (compute_loop_liveness): New. (determine_regs_used): New. (find_invariants_to_move): Use of determine_regs_used. * tree-ssa-loop-ivopts.c (determine_set_costs): Consider the phi arguments for register used. Signed-off-by:Ajit Agarwal ajit...@xilinx.com Thanks & Regards Ajit 0001-RFC-Patch-Optimized-changes-in-the-register-used-ins.patch Description: 0001-RFC-Patch-Optimized-changes-in-the-register-used-ins.patch
RE: [RFC, Patch]: Optimized changes in the register used inside loop for LICM and IVOPTS.
-Original Message- From: Bin.Cheng [mailto:amker.ch...@gmail.com] Sent: Thursday, October 08, 2015 10:29 AM To: Ajit Kumar Agarwal Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [RFC, Patch]: Optimized changes in the register used inside loop for LICM and IVOPTS. On Thu, Oct 8, 2015 at 12:32 PM, Ajit Kumar Agarwal wrote: > Following Proposed: > > Changes are done in the Loop Invariant(LICM) at RTL level and also the > Induction variable optimization based on SSA representation. > The current logic used in LICM for register used inside the loops is > changed. The Live Out of the loop latch node and the Live in of the > destination of the exit nodes is used to set the Loops Liveness at the exit > of the Loop. The register used is the number of live variables at the exit of > the Loop calculated above. > > For Induction variable optimization on tree SSA representation, the > register used logic is based on the number of phi nodes at the loop > header to represent the liveness at the loop. Current Logic used only the > number of phi nodes at the loop header. I have made changes to represent the > phi operands also live at the loop. Thus number of phi operands also gets > incremented in the number of registers used. Hi, >>For the GIMPLE IVO part, I don't think the change is reasonable enough. >>IMHO, IVO fails to restrict iv number in some complex cases, your change >>tries to >>rectify that by increasing register pressure irrespective to >>out-of-ssa and coalescing. I think the original code models reg-pressure >>better, what needs to be >>changed is how we compute cost from register >>pressure and use that to restrict iv number. Considering the liveness with respect to all the phi arguments will not increase the register pressure. It improves the heuristics for restricting The IV that increases the register pressure. The cost model uses regs_used and modelling the Liveness with respect to the phi arguments measures Better register pressure. Number of phi nodes in the loop header is not only the criteria for regs_used, but the number of liveness with respect to loop should be Criteria to measure appropriate register pressure. Thanks & Regards Ajit >>As for the specific function determine_set_costs, I think one change is >>necessary to rule out all floating point phi nodes, because they do not have >>impact on >>IVO register pressure. Actually this change will further reduce >>register pressure for fp related cases. Thanks, bin > > Performance runs: > > Bootstrapping with i386 goes through fine. The spec cpu 2000 > benchmarks is run and following performance runs and the code size for > i386 target seen. > > Ratio with the above optimization changes vs ratio without above > optimizations for INT benchmarks (3785.261 vs 3783.064). > Ratio with the above optimization changes vs ratio without above optimization > for FP benchmarks ( 4676.763189 vs 4676.072428 ). > > Code size reduction for INT benchmarks : 2324 instructions. > Code size reduction for FP benchmarks : 1283 instructions. > > For Microblaze target the Mibench and EEMBC benchmarks is run and the > following improvements is seen. > > (qos_lite(5.3%), consumer_jpeg_c(1.34%), security_rijndael_d(1.8%), > security_rijndael_e(1.4%)) > > Code Size reduction for Mibench = 16164 instructions. > Code Size reduction for EEMBC = 98 instructions. > > Patch ChangeLog: > > PATCH] [RFC, Patch]: Optimized changes in the register used inside loop for > LICM and IVOPTS. > > Changes are done in the Loop Invariant(LICM) at RTL level and also the > Induction variable optimization based on SSA representation. The current > logic used in LICM for register used inside the loops is changed. > The Live Out of the loop latch node and the Live in of the destination > of the exit nodes is used to set the Loops Liveness at the exit of > the Loop. The register used is the number of live variables at the exit of > the Loop calculated above. > > For Induction variable optimization on tree SSA representation, the > register used logic is based on the number of phi nodes at the loop > header to represent the liveness at the loop. Current Logic used only > the number of phi nodes at the loop header. Changes are made to represent > the phi operands also live at the loop. Thus number of phi operands also > gets incremented in the number of registers used. > > ChangeLog: > 2015-10-09 Ajit Agarwal > > * loop-invariant.c (compute_loop_liveness): New. > (determine_regs_used): New. > (find_invariants_to_move): Use of determine_regs_used. > * tree-ssa-loop-ivopts.c (determine_set_costs): Consider the phi > arguments for register used. > > Signed-off-by:Ajit Agarwal ajit...@xilinx.com > > Thanks & Regards > Ajit
RE: [RFC, Patch]: Optimized changes in the register used inside loop for LICM and IVOPTS.
-Original Message- From: Bin.Cheng [mailto:amker.ch...@gmail.com] Sent: Friday, October 09, 2015 8:15 AM To: Ajit Kumar Agarwal Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [RFC, Patch]: Optimized changes in the register used inside loop for LICM and IVOPTS. On Thu, Oct 8, 2015 at 1:53 PM, Ajit Kumar Agarwal wrote: > > > -Original Message- > From: Bin.Cheng [mailto:amker.ch...@gmail.com] > Sent: Thursday, October 08, 2015 10:29 AM > To: Ajit Kumar Agarwal > Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli > Hunsigida; Nagaraju Mekala > Subject: Re: [RFC, Patch]: Optimized changes in the register used inside loop > for LICM and IVOPTS. > > On Thu, Oct 8, 2015 at 12:32 PM, Ajit Kumar Agarwal > wrote: >> Following Proposed: >> >> Changes are done in the Loop Invariant(LICM) at RTL level and also the >> Induction variable optimization based on SSA representation. >> The current logic used in LICM for register used inside the loops is >> changed. The Live Out of the loop latch node and the Live in of the >> destination of the exit nodes is used to set the Loops Liveness at the exit >> of the Loop. The register used is the number of live variables at the exit >> of the Loop calculated above. >> >> For Induction variable optimization on tree SSA representation, the >> register used logic is based on the number of phi nodes at the loop >> header to represent the liveness at the loop. Current Logic used only the >> number of phi nodes at the loop header. I have made changes to represent >> the phi operands also live at the loop. Thus number of phi operands also >> gets incremented in the number of registers used. > Hi, >>>For the GIMPLE IVO part, I don't think the change is reasonable enough. >>>IMHO, IVO fails to restrict iv number in some complex cases, your change >>>tries to >>rectify that by increasing register pressure irrespective to >>>out-of-ssa and coalescing. I think the original code models reg-pressure >>>better, what needs to be >>changed is how we compute cost from register >>>pressure and use that to restrict iv number. > > Considering the liveness with respect to all the phi arguments will > not increase the register pressure. It improves the heuristics for > restricting The IV that increases the register pressure. The cost > model uses regs_used and modelling the >>I think register pressure is increased along with regs_needed, doesn't matter >>if it will be canceled in estimate_reg_pressure_cost for both ends of cost >>>>comparison. >>Liveness with respect to the phi arguments measures > Better register pressure. >>I agree IV number should be controlled for some cases, but not by increasing >>`n' using phi argument number unconditionally. Considering summary >>>>reduction as an example, most likely the ssa names will be coalesced and >>held in single register. Furthermore, there is no reason to count phi >>node/arg >>number for floating point phi nodes. > > Number of phi nodes in the loop header is not only the criteria for > regs_used, but the number of liveness with respect to loop should be Criteria > to measure appropriate register pressure. >>IMHO, it's hard to accurately track liveness info on SSA(PHI), because of >>coalescing etc. So could you give some examples/proof for this? I agree with you that it is hard to predict the exact mapping from SSA to the actual register allocation due to coalescing and out of SSA. The Interference on phi arguments and results are important criteria for register pressure on SSA. The conventional SSA where the phi arguments don't interfere. Most of the current compilers don't have conventional SSA. In the Non-conventional SSA there are chances the phi arguments interfere. The Non-Conventional SSA arises due to the copy propagation of ssa names makes the phi arguments interfere. Due to non-conventional nature of SSA the phi arguments interfere and should be considered for the register used. I interpret the register used as the number of interfering live ranges that leads to increase or decrease in register pressure. On top of the above the Out of SSA or SSA names coalescing, for conventional SSA is quite simple as each phi nodes is assigned to new variables and the def and use is replaced with the new variables and makes the case of assigning single register and then the corresponding phi node is removed. But in the Non- Conventional nature of SSA, the out of ssa makes the SSA conventional by inserting copying to each of the predecessor node and assigned it to new va
RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
Hello Jeff: Did you get a chance to look at the below response. Please let me know your opinion on the below. Thanks & Regards Ajit -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Ajit Kumar Agarwal Sent: Saturday, September 12, 2015 4:09 PM To: Jeff Law; Richard Biener Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation -Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Thursday, September 10, 2015 3:10 AM To: Ajit Kumar Agarwal; Richard Biener Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation On 08/26/2015 11:29 PM, Ajit Kumar Agarwal wrote: > > Thanks. The following testcase testsuite/gcc.dg/tree-ssa/ifc-5.c > > void dct_unquantize_h263_inter_c (short *block, int n, int qscale, int > nCoeffs) { int i, level, qmul, qadd; > > qadd = (qscale - 1) | 1; qmul = qscale << 1; > > for (i = 0; i <= nCoeffs; i++) { level = block[i]; if (level < 0) > level = level * qmul - qadd; else level = level * qmul + qadd; > block[i] = level; } } > > The above Loop is a candidate of path splitting as the IF block merges > at the latch of the Loop and the path splitting duplicates The latch > of the loop which is the statement block[i] = level into the > predecessors THEN and ELSE block. > > Due to above path splitting, the IF conversion is disabled and the > above IF-THEN-ELSE is not IF-converted and the test case fails. >>So I think the question then becomes which of the two styles generally >>results in better code? The path-split version or the older if-converted >>version. >>If the latter, then this may suggest that we've got the path splitting code >>at the wrong stage in the optimizer pipeline or that we need better >>heuristics for >>when to avoid applying path splitting. The code generated by the Path Splitting is useful when it exposes the DCE, PRE,CCP candidates. Whereas the IF-conversion is useful When the if-conversion exposes the vectorization candidates. If the if-conversion doesn't exposes the vectorization and the path splitting doesn't Exposes the DCE, PRE redundancy candidates, it's hard to predict. If the if-conversion does not exposes the vectorization and in the similar case Path splitting exposes the DCE , PRE and CCP redundancy candidates then path splitting is useful. Also the path splitting increases the granularity of the THEN and ELSE path makes better register allocation and code scheduling. The suggestion for keeping the path splitting later in the pipeline after the if-conversion and the vectorization is useful as it doesn't break the Existing Deja GNU tests. Also useful to keep the path splitting later in the pipeline after the if-conversion and vectorization is that path splitting Can always duplicate the merge node into its predecessor after the if-conversion and vectorization pass, if the if-conversion and vectorization Is not applicable to the Loops. But this suppresses the CCP, PRE candidates which are earlier in the optimization pipeline. > > There were following review comments from the above patch. > > +/* This function performs the feasibility tests for path splitting >> + to perform. Return false if the feasibility for path splitting >> + is not done and returns true if the feasibility for path >> splitting + is done. Following feasibility tests are performed. >> + + 1. Return false if the join block has rhs casting for assign >> + gimple statements. > > Comments from Jeff: > >>> These seem totally arbitrary. What's the reason behind each of >>> these restrictions? None should be a correctness requirement >>> AFAICT. > > In the above patch I have made a check given in point 1. in the loop > latch and the Path splitting is disabled and the IF-conversion happens > and the test case passes. >>That sounds more like a work-around/hack. There's nothing inherent with a >>type conversion that should disable path splitting. I have sent the patch with this change and I will remove the above check from the patch. >>What happens if we delay path splitting to a point after if-conversion is >>complete? This is better suggestion as explained above, but gains achieved through path splitting by keeping earlier in the pipeline before if-conversion , tree-vectorization, tree-vrp is suppressed if the following optimization after path splitting is not applicable for the above loops. I have made the above changes and
RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
-Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Friday, November 13, 2015 3:28 AM To: Richard Biener Cc: Ajit Kumar Agarwal; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation On 11/12/2015 11:32 AM, Jeff Law wrote: > On 11/12/2015 10:05 AM, Jeff Law wrote: >>> But IIRC you mentioned it should enable vectorization or so? In >>> this case that's obviously too late. >> The opposite. Path splitting interferes with if-conversion & >> vectorization. Path splitting mucks up the CFG enough that >> if-conversion won't fire and as a result vectorization is inhibited. >> It also creates multi-latch loops, which isn't a great situation either. >> >> It *may* be the case that dropping it that far down in the pipeline >> and making the modifications necessary to handle simple latches may >> in turn make the path splitting code play better with if-conversion >> and vectorization and avoid creation of multi-latch loops. At least >> that's how it looks on paper when I draw out the CFG manipulations. >> >> I'll do some experiments. > It doesn't look too terrible to ravamp the recognition code to work > later in the pipeline with simple latches. Sadly that doesn't seem to > have fixed the bad interactions with if-conversion. > > *But* that does open up the possibility of moving the path splitting > pass even deeper in the pipeline -- in particular we can move it past > the vectorizer. Which is may be a win. > > So the big question is whether or not we'll still see enough benefits > from having it so late in the pipeline. It's still early enough that > we get DOM, VRP, reassoc, forwprop, phiopt, etc. > > Ajit, I'll pass along an updated patch after doing some more testing. Hello Jeff: >>So here's what I'm working with. It runs after the vectorizer now. >>Ajit, if you could benchmark this it would be greatly appreciated. I know >>you saw significant improvements on one or more benchmarks in the past. It'd >>be good to know that the >>updated placement of the pass doesn't invalidate >>the gains you saw. >>With the updated pass placement, we don't have to worry about switching the >>pass on/off based on whether or not the vectorizer & if-conversion are >>enabled. So that hackery is gone. >>I think I've beefed up the test to identify the diamond patterns we want so >>that it's stricter in what we accept. The call to ignore_bb_p is a part of >>that test so that we're actually looking at >>the right block in a world >>where we're doing this transformation with simple latches. >>I've also put a graphical comment before perform_path_splitting which >>hopefully shows the CFG transformation we're making a bit clearer. >>This bootstraps and regression tests cleanly on x86_64-linux-gnu. Thank you for the inputs. I will build the compiler and run SPEC CPU 2000 benchmarks for X86 target and respond back as soon as run is done. I will also run the EEMBC/Mibench benchmarks for Microblaze target. Would let you know the results at the earliest. Thanks & Regards Ajit
RE: [RFC, Patch]: Optimized changes in the register used inside loop for LICM and IVOPTS.
-Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Friday, November 13, 2015 11:44 AM To: Ajit Kumar Agarwal; GCC Patches Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [RFC, Patch]: Optimized changes in the register used inside loop for LICM and IVOPTS. On 10/07/2015 10:32 PM, Ajit Kumar Agarwal wrote: > > 0001-RFC-Patch-Optimized-changes-in-the-register-used-ins.patch > > > From f164fd80953f3cffd96a492c8424c83290cd43cc Mon Sep 17 00:00:00 > 2001 > From: Ajit Kumar Agarwal > Date: Wed, 7 Oct 2015 20:50:40 +0200 > Subject: [PATCH] [RFC, Patch]: Optimized changes in the register used inside > loop for LICM and IVOPTS. > > Changes are done in the Loop Invariant(LICM) at RTL level and also the > Induction variable optimization based on SSA representation. The > current logic used in LICM for register used inside the loops is > changed. The Live Out of the loop latch node and the Live in of the > destination of the exit nodes is used to set the Loops Liveness at the exit > of the Loop. > The register used is the number of live variables at the exit of the > Loop calculated above. > > For Induction variable optimization on tree SSA representation, the > register used logic is based on the number of phi nodes at the loop > header to represent the liveness at the loop. Current Logic used only > the number of phi nodes at the loop header. Changes are made to > represent the phi operands also live at the loop. Thus number of phi > operands also gets incremented in the number of registers used. > > ChangeLog: > 2015-10-09 Ajit Agarwal > > * loop-invariant.c (compute_loop_liveness): New. > (determine_regs_used): New. > (find_invariants_to_move): Use of determine_regs_used. > * tree-ssa-loop-ivopts.c (determine_set_costs): Consider the phi > arguments for register used. >>I think Bin rejected the tree-ssa-loop-ivopts change. However, the >>loop-invariant change is still pending, right? > > Signed-off-by:Ajit agarwalajit...@xilinx.com > --- > gcc/loop-invariant.c | 72 > +- > gcc/tree-ssa-loop-ivopts.c | 4 +-- > 2 files changed, 60 insertions(+), 16 deletions(-) > > diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c > index 52c8ae8..e4291c9 100644 > --- a/gcc/loop-invariant.c > +++ b/gcc/loop-invariant.c > @@ -1413,6 +1413,19 @@ set_move_mark (unsigned invno, int gain) > } > } > > +static int > +determine_regs_used() > +{ > + unsigned int j; > + unsigned int reg_used = 2; > + bitmap_iterator bi; > + > + EXECUTE_IF_SET_IN_BITMAP (&LOOP_DATA (curr_loop)->regs_live, 0, j, bi) > +(reg_used) ++; > + > + return reg_used; > +} >>Isn't this just bitmap_count_bits (regs_live) + 2? > @@ -2055,9 +2057,43 @@ calculate_loop_reg_pressure (void) > } > } > > - > +static void > +calculate_loop_liveness (void) >>Needs a function comment. I will incorporate the above comments. > +{ > + basic_block bb; > + struct loop *loop; > > -/* Move the invariants out of the loops. */ > + FOR_EACH_LOOP (loop, 0) > +if (loop->aux == NULL) > + { > +loop->aux = xcalloc (1, sizeof (struct loop_data)); > +bitmap_initialize (&LOOP_DATA (loop)->regs_live, ®_obstack); > + } > + > + FOR_EACH_BB_FN (bb, cfun) >>Why loop over blocks here? Why not just iterate through all the loops >>in the loop structure. Order isn't particularly important AFAICT for >>this code. Iterating over the Loop structure is enough. We don't need iterating over the basic blocks. > + { > + int i; > + edge e; > + vec edges; > + edges = get_loop_exit_edges (loop); > + FOR_EACH_VEC_ELT (edges, i, e) > + { > + bitmap_ior_into (&LOOP_DATA (loop)->regs_live, DF_LR_OUT(e->src)); > + bitmap_ior_into (&LOOP_DATA (loop)->regs_live, DF_LR_IN(e->dest)); >>Space before the open-paren in the previous two lines >>DF_LR_OUT (e->src) and FD_LR_INT (e->dest)) I will incorporate this. > + } > + } > + } > +} > + > +/* Move the invariants ut of the loops. */ >>Looks like you introduced a typo. >>I'd like to see testcases which show the change in # regs used >>computation helping generate better code. We need to measure the test case with the scenario where the new variable created for loop invariant increases the register pressure and the cost with respect to reg_used and new_regs increases that lead to spill and fetch and drop the i
RE: [RFC, Patch]: Optimized changes in the register used inside loop for LICM and IVOPTS.
Sorry I missed out some of the points in earlier mail which is given below. -Original Message- From: Ajit Kumar Agarwal Sent: Monday, November 16, 2015 11:07 PM To: 'Jeff Law'; GCC Patches Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: RE: [RFC, Patch]: Optimized changes in the register used inside loop for LICM and IVOPTS. -Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Friday, November 13, 2015 11:44 AM To: Ajit Kumar Agarwal; GCC Patches Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [RFC, Patch]: Optimized changes in the register used inside loop for LICM and IVOPTS. On 10/07/2015 10:32 PM, Ajit Kumar Agarwal wrote: > > 0001-RFC-Patch-Optimized-changes-in-the-register-used-ins.patch > > > From f164fd80953f3cffd96a492c8424c83290cd43cc Mon Sep 17 00:00:00 > 2001 > From: Ajit Kumar Agarwal > Date: Wed, 7 Oct 2015 20:50:40 +0200 > Subject: [PATCH] [RFC, Patch]: Optimized changes in the register used inside > loop for LICM and IVOPTS. > > Changes are done in the Loop Invariant(LICM) at RTL level and also the > Induction variable optimization based on SSA representation. The > current logic used in LICM for register used inside the loops is > changed. The Live Out of the loop latch node and the Live in of the > destination of the exit nodes is used to set the Loops Liveness at the exit > of the Loop. > The register used is the number of live variables at the exit of the > Loop calculated above. > > For Induction variable optimization on tree SSA representation, the > register used logic is based on the number of phi nodes at the loop > header to represent the liveness at the loop. Current Logic used only > the number of phi nodes at the loop header. Changes are made to > represent the phi operands also live at the loop. Thus number of phi > operands also gets incremented in the number of registers used. > > ChangeLog: > 2015-10-09 Ajit Agarwal > > * loop-invariant.c (compute_loop_liveness): New. > (determine_regs_used): New. > (find_invariants_to_move): Use of determine_regs_used. > * tree-ssa-loop-ivopts.c (determine_set_costs): Consider the phi > arguments for register used. >>I think Bin rejected the tree-ssa-loop-ivopts change. However, the >>loop-invariant change is still pending, right? > > Signed-off-by:Ajit agarwalajit...@xilinx.com > --- > gcc/loop-invariant.c | 72 > +- > gcc/tree-ssa-loop-ivopts.c | 4 +-- > 2 files changed, 60 insertions(+), 16 deletions(-) > > diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c index > 52c8ae8..e4291c9 100644 > --- a/gcc/loop-invariant.c > +++ b/gcc/loop-invariant.c > @@ -1413,6 +1413,19 @@ set_move_mark (unsigned invno, int gain) > } > } > > +static int > +determine_regs_used() > +{ > + unsigned int j; > + unsigned int reg_used = 2; > + bitmap_iterator bi; > + > + EXECUTE_IF_SET_IN_BITMAP (&LOOP_DATA (curr_loop)->regs_live, 0, j, bi) > +(reg_used) ++; > + > + return reg_used; > +} >>Isn't this just bitmap_count_bits (regs_live) + 2? > @@ -2055,9 +2057,43 @@ calculate_loop_reg_pressure (void) > } > } > > - > +static void > +calculate_loop_liveness (void) >>Needs a function comment. I will incorporate the above comments. > +{ > + basic_block bb; > + struct loop *loop; > > -/* Move the invariants out of the loops. */ > + FOR_EACH_LOOP (loop, 0) > +if (loop->aux == NULL) > + { > +loop->aux = xcalloc (1, sizeof (struct loop_data)); > +bitmap_initialize (&LOOP_DATA (loop)->regs_live, ®_obstack); > + } > + > + FOR_EACH_BB_FN (bb, cfun) >>Why loop over blocks here? Why not just iterate through all the loops >>in the loop structure. Order isn't particularly important AFAICT for >>this code. Iterating over the Loop structure is enough. We don't need iterating over the basic blocks. > + { > + int i; > + edge e; > + vec edges; > + edges = get_loop_exit_edges (loop); > + FOR_EACH_VEC_ELT (edges, i, e) > + { > + bitmap_ior_into (&LOOP_DATA (loop)->regs_live, DF_LR_OUT(e->src)); > + bitmap_ior_into (&LOOP_DATA (loop)->regs_live, > + DF_LR_IN(e->dest)); >>Space before the open-paren in the previous two lines DF_LR_OUT >>(e->src) and FD_LR_INT (e->dest)) I will incorporate this. > + } > + } > + } > +} > + > +/* Move the invariants ut of the loops. */ >>Looks lik
RE: [Patch,testsuite]: Fix the tree-ssa/split-path-1.c testcase
Hello Jeff: Please ignore my previous mails as they bounced back. Sorry for that. I have fixed the problem with the testcase. The splitting path optimization remains intact. Attached is the patch. The problem was related to the testcase as the loop bound goes beyond the malloced array. There was also a problem with accessing the elements of EritePtr. ChangeLog: 2015-11-18 Ajit Agarwal * gcc.dg/tree-ssa/split-path-1.c: Fix the testcase. Signed-off-by:Ajit Agarwal ajit...@xilinx.com Thanks & Regards Ajit testcase.patch Description: testcase.patch
RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
-Original Message- From: Tom de Vries [mailto:tom_devr...@mentor.com] Sent: Wednesday, November 18, 2015 1:14 PM To: Jeff Law; Richard Biener Cc: Ajit Kumar Agarwal; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation On 14/11/15 00:35, Jeff Law wrote: > Anyway, bootstrapped and regression tested on x86_64-linux-gnu. > Installed on the trunk. > [Patch,tree-optimization]: Add new path Splitting pass on tree ssa > representation > > * Makefile.in (OBJS): Add gimple-ssa-split-paths.o > * common.opt (-fsplit-paths): New flag controlling path splitting. > * doc/invoke.texi (fsplit-paths): Document. > * opts.c (default_options_table): Add -fsplit-paths to -O2. > * passes.def: Add split_paths pass. > * timevar.def (TV_SPLIT_PATHS): New timevar. > * tracer.c: Include "tracer.h" > (ignore_bb_p): No longer static. > (transform_duplicate): New function, broken out of tail_duplicate. > (tail_duplicate): Use transform_duplicate. > * tracer.h (ignore_bb_p): Declare > (transform_duplicate): Likewise. > * tree-pass.h (make_pass_split_paths): Declare. > * gimple-ssa-split-paths.c: New file. > > * gcc.dg/tree-ssa/split-path-1.c: New test. >>I've filed PR68402 - FAIL: gcc.dg/tree-ssa/split-path-1.c execution test with >>-m32. I have fixed the above PR and the patch is submitted. https://gcc.gnu.org/ml/gcc-patches/2015-11/msg02217.html Thanks & Regards Ajit Thanks, - Tom
RE: [RFC, Patch]: Optimized changes in the register used inside loop for LICM and IVOPTS.
Hello Jeff: -Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Tuesday, November 17, 2015 4:30 AM To: Ajit Kumar Agarwal; GCC Patches Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [RFC, Patch]: Optimized changes in the register used inside loop for LICM and IVOPTS. On 11/16/2015 10:36 AM, Ajit Kumar Agarwal wrote: >> For Induction variable optimization on tree SSA representation, the >> register used logic is based on the number of phi nodes at the loop >> header to represent the liveness at the loop. Current Logic used >> only the number of phi nodes at the loop header. Changes are made to >> represent the phi operands also live at the loop. Thus number of phi >> operands also gets incremented in the number of registers used. >>But my question is why is the # of PHI operands useful here. You'd have a >>stronger argument if it was the number of unique operands in each PHI. >> While I don't doubt this patch improved things, I think it's just putting a >> band-aid over the problem. >>I think anything that just looks at PHIs or at register liveness at loop >>boundaries is inherently underestimating the register pressure implications >>of code motion from inside to outside a >>loop. >>If an object is pulled out of the loop, then it's going to conflict with >>nearly every object that births in the loop (because the object being moved >>now has to live throughout the loop). >>There's exceptions, but I doubt they >>matter in practice. The object is also going to conflict with anything else >>that is live through the loop. At least that's how it seems to me at first >>>>thought. >>So build the set of objects (SSA_NAMEs) that either birth or are live through >>the loop that have the same type class as the object we want to hoist out of >>the loop (scalar, floating point, >>vector). Use that set of objects to >>estimate register pressure. I agree with the above. To add up on the above, we only require to calculate the set of objects ( SSA_NAMES) that are live at the birth or the header of the loop. We don't need to calculate the live through the Loop considering Live in and Live out of all the basic blocks of the Loop. This is because the set of objects (SSA_NAMES) That are live-in at the birth or header of the loop will be live-in at every node in the Loop. If a v live out at the header of the loop then the variable is live-in at every node in the Loop. To prove this, Consider a Loop L with header h such that The variable v defined at d is live-in at h. Since v is live at h, d is not part of L. This follows from the dominance property, i.e. h is strictly dominated by d. Furthermore, there exists a path from h to a use of v which does not go through d. For every node of the loop, p, since the loop is strongly connected Component of the CFG, there exists a path, consisting only of nodes of L from p to h. Concatenating those two paths prove that v is live-in and live-out Of p. On top of live-in at the birth or header of the loop as proven above, if we calculate the Live out of the exit block of the block and Live-in at the destination Edge of the exit block of the loops. This consider the liveness outside of the Loop. The above two cases forms the basis of better estimator for register pressure as far as LICM is concerned. If you agree with the above, I will implement add the above in the patch for register_used estimates for better estimate of register pressure for LICM. Thanks & Regards Ajit >>It won't be exact because some of those objects could end up coloring the >>same. BUt it's probably still considerably more accurate than what we have >>now. >>I suspect that would be a better estimator for register pressure as far as >>LICM is concerned. jeff
[Patch,SLP]: Correction in the comment for SLP vectorization profitable case.
This patch made correction in the comment for SLP profitable vectorization case. Correction in the comment for vectorizable profitable case. The comment is contradicting the condition vec_outside_cost + vec_inside_cost > scalar_cost. ChangeLog: 2015-11-30 Ajit Agarwal * tree-vect-slp.c (vect_bb_vectorization_profitable_p): Correction in the comment. Signed-off-by:Ajit Agarwal ajit...@xilinx.com Thanks & Regards Ajit comment-slp.patch Description: comment-slp.patch
[Patch,microblaze]: Instruction prefetch optimization for microblaze.
The changes are made in this patch for the instruction prefetch optimizations for Microblaze. Reg tested for Microblaze target. The changes are made for instruction prefetch optimizations for Microblaze. The "wic" microblaze instruction is the instruction prefetch instruction. The instruction prefetch optimization is done to generate the iprefetch instruction at the call site fall through path. This optimization is enabled with microblaze target flag mxl-prefetch. The purpose of adding the flags is that selection of "wic" instruction should be enabled in the reconfigurable design and the selection is not enabled by default. ChangeLog: 2015-12-01 Ajit Agarwal * config/microblaze/microblaze.c (get_branch_target): New. (insert_wic_for_ilb_runout): New. (insert_wic): New. (microblaze_machine_dependent_reorg): New. (TARGET_MACHINE_DEPENDENT_REORG): Define macro. * config/microblaze/microblaze.md (UNSPEC_IPREFETCH): Define. (iprefetch): New pattern * config/microblaze/microblaze.opt (mxl-prefetch): New flag. Signed-off-by:Ajit Agarwal ajit...@xilinx.com Thanks & Regards Ajit iprefetch.patch Description: iprefetch.patch
RE: [Patch,microblaze]: Instruction prefetch optimization for microblaze.
Moreover this patch is tested and run on hardware with Mibench/EEMBC benchmarks for Microblaze target. The reconfigurable design is enabled with the selection of "wic" instruction prefetch instruction and above benchmarks compiled with -mxl-prefetch flags. Thanks & Regards Ajit -Original Message----- From: Ajit Kumar Agarwal Sent: Tuesday, December 01, 2015 2:19 PM To: GCC Patches Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: [Patch,microblaze]: Instruction prefetch optimization for microblaze. The changes are made in this patch for the instruction prefetch optimizations for Microblaze. Reg tested for Microblaze target. The changes are made for instruction prefetch optimizations for Microblaze. The "wic" microblaze instruction is the instruction prefetch instruction. The instruction prefetch optimization is done to generate the iprefetch instruction at the call site fall through path. This optimization is enabled with microblaze target flag mxl-prefetch. The purpose of adding the flags is that selection of "wic" instruction should be enabled in the reconfigurable design and the selection is not enabled by default. ChangeLog: 2015-12-01 Ajit Agarwal * config/microblaze/microblaze.c (get_branch_target): New. (insert_wic_for_ilb_runout): New. (insert_wic): New. (microblaze_machine_dependent_reorg): New. (TARGET_MACHINE_DEPENDENT_REORG): Define macro. * config/microblaze/microblaze.md (UNSPEC_IPREFETCH): Define. (iprefetch): New pattern * config/microblaze/microblaze.opt (mxl-prefetch): New flag. Signed-off-by:Ajit Agarwal ajit...@xilinx.com Thanks & Regards Ajit
RE: [Patch,microblaze]: Instruction prefetch optimization for microblaze.
-Original Message- From: Michael Eager [mailto:ea...@eagerm.com] Sent: Thursday, December 03, 2015 7:27 PM To: Ajit Kumar Agarwal; GCC Patches Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,microblaze]: Instruction prefetch optimization for microblaze. On 12/01/2015 12:49 AM, Ajit Kumar Agarwal wrote: > The changes are made in this patch for the instruction prefetch optimizations > for Microblaze. > > Reg tested for Microblaze target. > > The changes are made for instruction prefetch optimizations for > Microblaze. The "wic" microblaze instruction is the instruction > prefetch instruction. The instruction prefetch optimization is done to > generate the iprefetch instruction at the call site fall through path. > This optimization is enabled with microblaze target flag mxl-prefetch. The > purpose of adding the flags is that selection of "wic" instruction should be > enabled in the reconfigurable design and the selection is not enabled by > default. > > ChangeLog: > 2015-12-01 Ajit Agarwal > > * config/microblaze/microblaze.c > (get_branch_target): New. > (insert_wic_for_ilb_runout): New. > (insert_wic): New. > (microblaze_machine_dependent_reorg): New. > (TARGET_MACHINE_DEPENDENT_REORG): Define macro. > * config/microblaze/microblaze.md > (UNSPEC_IPREFETCH): Define. > (iprefetch): New pattern > * config/microblaze/microblaze.opt > (mxl-prefetch): New flag. > > Signed-off-by:Ajit Agarwal ajit...@xilinx.com > > > Thanks & Regards > Ajit > >>+ rtx_insn *insn, *before_4 = 0, *before_16 = 0; int addr = 0, length, >>+ first_addr = -1; int wic_addr0 = 128 * 4, wic_addr1 = 128 * 4; >>Especially when there are initializers, I prefer to see each variable >>declared on a separate line. If the meaning of a variable is not clear (and >>most of these are not), include a comment >>before the declaration. >>+if (first_addr == -1) >>+ first_addr = INSN_ADDRESSES (INSN_UID (insn)); >>Can be moved to initialize first_addr. >>+addr = INSN_ADDRESSES (INSN_UID (insn)) - first_addr; >>Is "addr" and address or offset? If the latter, use a more descriptive name. >>+if (before_4 == 0 && addr + length >= 4 * 4) >>+ before_4 = insn; ... >>Please add comments to describe what you are doing here. What are before_4 >>and before_16? What are all these conditions testing? >>+ loop_optimizer_finalize(); >>Space before parens. All the above comments are incorporated. Updated patch is attached. Regtested for Microblaze target. Mibench/EEMBC benchmarks are run on the hardware enabling the mxl-prefetch and the run goes through fine With the generation of "wic" instruction. [Patch,microblaze]: Instruction prefetch optimization for microblaze. The changes are made for instruction prefetch optimizations for Microblaze. The "wic" microblaze instruction is the instruction prefetch instruction. The instruction prefetch optimization is done to generate the iprefetch instruction at the call site fall through path. This optimization is enabled with microblaze target flag mxl-prefetch. The purpose of adding the flags is that selection of "wic" instruction should be enabled in the reconfigurable design and the selection is not enabled by default. ChangeLog: 2015-12-07 Ajit Agarwal * config/microblaze/microblaze.c (get_branch_target): New. (insert_wic_for_ilb_runout): New. (insert_wic): New. (microblaze_machine_dependent_reorg): New. (TARGET_MACHINE_DEPENDENT_REORG): Define macro. * config/microblaze/microblaze.md (UNSPEC_IPREFETCH): Define. (iprefetch): New pattern * config/microblaze/microblaze.opt (mxl-prefetch): New flag. Signed-off-by:Ajit Agarwal ajit...@xilinx.com Thanks & Regards Ajit -- Michael Eagerea...@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077 iprefetch.patch Description: iprefetch.patch
[Patch,rtl Optimization]: Better register pressure estimate for Loop Invariant Code Motion.
Based on the comments on RFC patch this patch incorporates all the comments from Jeff. Thanks Jeff for the valuable feedback. This patch enables the better register pressure estimate for Loop Invariant code motion. This patch Calculate the Loop Liveness used for regs_used used to calculate the register pressure used in the cost estimation. Loop Liveness is based on the following properties. we only require to calculate the set of objects that are live at the birth or the header of the loop. We don't need to calculate the live through the Loop considering Live in and Live out of all the basic blocks of the Loop. This is because the set of objects. That are live-in at the birth or header of the loop will be live-in at every node in the Loop. If a v live out at the header of the loop then the variable is live-in at every node in the Loop. To prove this, Consider a Loop L with header h such that The variable v defined at d is live-in at h. Since v is live at h, d is not part of L. This follows from the dominance property, i.e. h is strictly dominated by d. Furthermore, there exists a path from h to a use of v which does not go through d. For every node of the loop, p, since the loop is strongly connected Component of the CFG, there exists a path, consisting only of nodes of L from p to h. Concatenating those two paths prove that v is live-in and live-out Of p. Also Calculate the Live-Out and Live-In for the exit edge of the loop. This considers liveness for not only the Loop latch but also the liveness outside the Loops. Bootstrapped and Regtested for x86_64 and microblaze target. There is an extra failure for the testcase gcc.dg/loop-invariant.c with the change that looks correct to me. This is because the available_regs = 6 and the regs_needed = 1 and new_regs = 0 and the regs_used = 10. As the reg_used that are based on the Liveness given above is greater than the available_regs, then it's candidate of spill and the estimate_register_pressure calculates the spill cost. This spill cost is greater than inv_cost and gain comes to be negative. The disables the loop invariant for the above testcase. Disabling of the Loop invariant for the testcase loop-invariant.c with this patch looks correct to me considering the calculation of available_regs in cfgloopanal.c is correct. ChangeLog: 2015-12-09 Ajit Agarwal * loop-invariant.c (calculate_loop_liveness): New. (move_loop_invariants): Use of calculate_loop_liveness. Signed-off-by:Ajit Agarwal ajit...@xilinx.com --- gcc/loop-invariant.c | 77 ++ 1 files changed, 65 insertions(+), 12 deletions(-) diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c index 53d1377..ac08594 100644 --- a/gcc/loop-invariant.c +++ b/gcc/loop-invariant.c @@ -1464,18 +1464,7 @@ find_invariants_to_move (bool speed, bool call_p) registers used; we put some initial bound here to stand for induction variables etc. that we do not detect. */ { - unsigned int n_regs = DF_REG_SIZE (df); - - regs_used = 2; - - for (i = 0; i < n_regs; i++) - { - if (!DF_REGNO_FIRST_DEF (i) && DF_REGNO_LAST_USE (i)) - { - /* This is a value that is used but not changed inside loop. */ - regs_used++; - } - } + regs_used = bitmap_count_bits (&LOOP_DATA (curr_loop)->regs_live) + 2; } if (! flag_ira_loop_pressure) @@ -1966,7 +1955,63 @@ mark_ref_regs (rtx x) } } +/* Calculate the Loop Liveness used for regs_used used in + heuristics to calculate the register pressure. */ + +static void +calculate_loop_liveness (void) +{ + struct loop *loop; + + FOR_EACH_LOOP (loop, 0) +if (loop->aux == NULL) + { +loop->aux = xcalloc (1, sizeof (struct loop_data)); +bitmap_initialize (&LOOP_DATA (loop)->regs_live, ®_obstack); + } + + FOR_EACH_LOOP (loop, LI_FROM_INNERMOST) + { +int i; +edge e; +vec edges; +edges = get_loop_exit_edges (loop); + +/* Loop Liveness is based on the following proprties. + we only require to calculate the set of objects that are live at + the birth or the header of the loop. + We don't need to calculate the live through the Loop considering + Live in and Live out of all the basic blocks of the Loop. This is + because the set of objects. That are live-in at the birth or header + of the loop will be live-in at every node in the Loop. + + If a v live out at the header of the loop then the variable is live-in + at every node in the Loop. To prove this, Consider a Loop L with header + h such that The variable v defined at d is live-in at h. Since v is live + at h, d is not part of L. This follows from the dominance property, i.e. + h is strictly dominated by d. Furthermore, there exists a path from h to + a use of v which does not go
RE: [Patch,rtl Optimization]: Better register pressure estimate for Loop Invariant Code Motion.
-Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Wednesday, December 09, 2015 4:06 PM To: Ajit Kumar Agarwal Cc: Jeff Law; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,rtl Optimization]: Better register pressure estimate for Loop Invariant Code Motion. On Tue, Dec 8, 2015 at 11:24 PM, Ajit Kumar Agarwal wrote: > Based on the comments on RFC patch this patch incorporates all the comments > from Jeff. Thanks Jeff for the valuable feedback. > > This patch enables the better register pressure estimate for Loop > Invariant code motion. This patch Calculate the Loop Liveness used for > regs_used used to calculate the register pressure used in the cost > estimation. > > Loop Liveness is based on the following properties. we only require > to calculate the set of objects that are live at the birth or the > header of the loop. We don't need to calculate the live through the Loop > considering Live in and Live out of all the basic blocks of the Loop. This is > because the set of objects. That are live-in at the birth or header of the > loop will be live-in at every node in the Loop. > > If a v live out at the header of the loop then the variable is live-in > at every node in the Loop. To prove this, Consider a Loop L with header h > such that The variable v defined at d is live-in at h. Since v is live at h, > d is not part of L. This follows from the dominance property, i.e. h is > strictly dominated by d. > Furthermore, there exists a path from h to a use of v which does not > go through d. For every node of the loop, p, since the loop is strongly > connected Component of the CFG, there exists a path, consisting only of nodes > of L from p to h. Concatenating those two paths prove that v is live-in and > live-out Of p. > > Also Calculate the Live-Out and Live-In for the exit edge of the loop. This > considers liveness for not only the Loop latch but also the liveness outside > the Loops. > > Bootstrapped and Regtested for x86_64 and microblaze target. > > There is an extra failure for the testcase gcc.dg/loop-invariant.c with the > change that looks correct to me. > > This is because the available_regs = 6 and the regs_needed = 1 and > new_regs = 0 and the regs_used = 10. As the reg_used that are based > on the Liveness given above is greater than the available_regs, then it's > candidate of spill and the estimate_register_pressure calculates the spill > cost. This spill cost is greater than inv_cost and gain comes to be negative. > The disables the loop invariant for the above testcase. > > Disabling of the Loop invariant for the testcase loop-invariant.c with > this patch looks correct to me considering the calculation of available_regs > in cfgloopanal.c is correct. >>You keep a lot of data (bitmaps) live just to use the number of bits set in >>the end. >>I'll note that this also increases the complexity of the pass which is >>enabled at -O1+ where >>-O1 should limit itself to O(n*log n) algorithms but live is quadratic. >.So I don't think doing this unconditionally is a good idea (and we have >-fira-loop-pressure after all). >>Please watch not making -O1 worse again after we spent so much time making it >>suitable for all the weird autogenerated code. I can also implement without keeping the bitmaps live data and would not require to set the bits in the end. I am interested in the Liveness in the loop header and the loop exit which I can calculate where I am setting the regs_used for the curr_loop. This will not require to set and keep the bitmaps for liveness and also make it available for the curr_loop that are candidates of loop Invariant. Thanks & Regards Ajit Richard. > ChangeLog: > 2015-12-09 Ajit Agarwal > > * loop-invariant.c > (calculate_loop_liveness): New. > (move_loop_invariants): Use of calculate_loop_liveness. > > Signed-off-by:Ajit Agarwal ajit...@xilinx.com > > --- > gcc/loop-invariant.c | 77 > ++ > 1 files changed, 65 insertions(+), 12 deletions(-) > > diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c index > 53d1377..ac08594 100644 > --- a/gcc/loop-invariant.c > +++ b/gcc/loop-invariant.c > @@ -1464,18 +1464,7 @@ find_invariants_to_move (bool speed, bool call_p) > registers used; we put some initial bound here to stand for > induction variables etc. that we do not detect. */ > { > - unsigned int n_regs = DF_REG_SIZE (df); > - > - regs_used = 2; > - > - for (i = 0; i < n_regs; i++) > - { > -
RE: [Patch,rtl Optimization]: Better register pressure estimate for Loop Invariant Code Motion.
-Original Message- From: Bernd Schmidt [mailto:bschm...@redhat.com] Sent: Wednesday, December 09, 2015 7:34 PM To: Ajit Kumar Agarwal; Richard Biener Cc: Jeff Law; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,rtl Optimization]: Better register pressure estimate for Loop Invariant Code Motion. On 12/09/2015 12:22 PM, Ajit Kumar Agarwal wrote: > > This is because the available_regs = 6 and the regs_needed = 1 and > new_regs = 0 and the regs_used = 10. As the reg_used that are based > on the Liveness given above is greater than the available_regs, then > it's candidate of spill and the estimate_register_pressure calculates > the > spill cost. This spill cost is greater than inv_cost and gain > comes to be > negative. The disables the loop invariant for the above > testcase. >>As far as I can tell this loop does not lead to a spill. Hence, failure of >>this testcase would suggest there is something wrong with your idea. As far as I can see there is nothing wrong with the idea, the existing implementation of the calculation of available_regs results to 6 which is not the case for this testcase. The estimate of regs_used based on the Liveness ( idea behind this patch) is correct, but the incorrect estimate of the available_regs (6) results in the gain to be negative. >> + FOR_EACH_LOOP (loop, LI_FROM_INNERMOST) { >>Formatting. >> +/* Loop Liveness is based on the following proprties. >>"properties" I will incorporate the change. >> + we only require to calculate the set of objects that are live at >> + the birth or the header of the loop. >> + We don't need to calculate the live through the Loop considering >> + Live in and Live out of all the basic blocks of the Loop. This is >> + because the set of objects. That are live-in at the birth or header >> + of the loop will be live-in at every node in the Loop. >> + If a v live out at the header of the loop then the variable is >> live-in >> + at every node in the Loop. To prove this, Consider a Loop L with >> header >> + h such that The variable v defined at d is live-in at h. Since v is >> live >> + at h, d is not part of L. This follows from the dominance property, >> i.e. >> + h is strictly dominated by d. Furthermore, there exists a path from >> h to >> + a use of v which does not go through d. For every node of the loop, >> p, >> + since the loop is strongly connected Component of the CFG, there >> exists >> + a path, consisting only of nodes of L from p to h. Concatenating >> those >> + two paths prove that v is live-in and live-out Of p. */ >>Please Refrain From Randomly Capitalizing Words, and please also fix the >>grammar ("set of objects. That are live-in"). These problems make this >>comment (and also your emails) >>extremely hard to read. >>Partly for that reason, I can't quite make up my mind whether this patch >>makes things better or just different. The testcase failure makes me think >>it's probably not an improvement. I'll correct it. I am seeing the gains for Mibench/EEMBC benchmarks for Microblaze target with this patch. I will run SPEC CPU 2000 benchmarks for i386 with this patch. >> +bitmap_ior_into (&LOOP_DATA (loop)->regs_live, DF_LR_IN (loop->header)); >> +bitmap_ior_into (&LOOP_DATA (loop)->regs_live, DF_LR_OUT >> + (loop->header)); >>Formatting. I'll incorporate the change. Thanks & Regards Ajit Bernd
RE: [RFC][ARM]: Fix reload spill failure (PR 60617)
You can assign the same register to more than operand based on the Liveness. It will be tricky if on the basis of Liveness available registers not found. In that you need to spill one of the operands and use the registers assigned to the next operand. This forms the basis of spilling one of the values to assign the same registers to another operand. This pattern is very specific to LO_REGS and I am sure there will low registers pressure for such operands. Is this pattern used for dedicated registers for the operands? Thanks & Regards Ajit -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Venkataramanan Kumar Sent: Monday, June 16, 2014 6:23 PM To: gcc-patches@gcc.gnu.org; Richard Earnshaw; Ramana Radhakrishnan; Marcus Shawcroft; Patch Tracking; Maxim Kuvyrkov Subject: [RFC][ARM]: Fix reload spill failure (PR 60617) Hi Maintainers, This patch fixes the PR 60617 that occurs when we turn on reload pass in thumb2 mode. It occurs for the pattern "*ior_scc_scc" that gets generated for the 3 argument of the below function call. JIT:emitStoreInt32(dst,regT0m, (op1 == dst || op2 == dst))); (snip---) (insn 634 633 635 27 (parallel [ (set (reg:SI 3 r3) (ior:SI (eq:SI (reg/v:SI 110 [ dst ]) <== This operand r5 is registers gets assigned (reg/v:SI 112 [ op2 ])) (eq:SI (reg/v:SI 110 [ dst ]) <== This operand (reg/v:SI 111 [ op1 ] (clobber (reg:CC 100 cc)) ]) ../Source/JavaScriptCore/jit/JITArithmetic32_64.cpp:179 300 {*ior_scc_scc (snip---) The issue here is that the above pattern demands 5 registers (LO_REGS). But when we are in reload, registers r0 is used for pointer to the class, r1 and r2 for first and second argument. r7 is used for stack pointer. So we are left with r3,r4,r5 and r6. But the above patterns needs five LO_REGS. Hence we get spill failure when processing the last register operand in that pattern, In ARM port, TARGET_LIKELY_SPILLED_CLASS is defined for Thumb-1 and for thumb 2 mode there is mention of using LO_REG in the comment as below. "Care should be taken to avoid adding thumb-2 patterns that require many low registers" So conservative fix is not to allow this pattern for Thumb-2 mode. I allowed these pattern for Thumb2 when we have constant operands for comparison. That makes the target tests arm/thum2-cond-cmp-1.c to thum2-cond-cmp-4.c pass. Regression tested with gcc 4.9 branch since in trunk this bug is masked revision 209897. Please provide your suggestion on this patch regards, Venkat.
RE: [Patch] OPT: Update heuristics for loop-invariant for address arithmetic.
-Original Message- From: Richard Sandiford [mailto:rdsandif...@googlemail.com] Sent: Friday, April 24, 2015 12:40 AM To: Ajit Kumar Agarwal Cc: vmaka...@redhat.com; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch] OPT: Update heuristics for loop-invariant for address arithmetic. Very delayed answer, sorry... Ajit Kumar Agarwal writes: > Hello All: > > The changes are made in the patch to update the heuristics for loop > invariant for address arithemetic at RTL Level. The heuristics are > updated with the consideration of single def and use for register > pressure calculation instead Of ignoring it and also to update the > estimated register pressure cost along with the check of actual uses > with Address uses. > > With the above change, gains are seen in the Geomean for Mibench/EEMBC > benchmarks for microblaze target. No Regression is seen in deja GNU > regressions tests for microblaze. >>Since thispatch is basically removing code, were you able to analyse why that >>code was having a detrimental effect? I assume it benefited some target >>??>>originally. This patch modified the estimated register pressure cost for non ira based register pressure(flag_ira_loop_pressure is not set). Following changes were made in the estimated register pressure cost. size_cost = (estimate_reg_pressure_cost (new_regs[0] + regs_needed[0], regs_used, speed, call_p) - estimate_reg_pressure_cost (new_regs[0], regs_used, speed, call_p)); is changed to size_cost = estimate_reg_pressure_cost (regs_needed[0], regs_used, speed, call_p); This looks reasonable change for the estimated_reg_pressure_cost calculation. The other changes I have made, For the single use for the given Def the current code does not include such invariants in the register pressure calculation which I have enabled including the single use for the Given def for the register pressure calculation. Though the comment in the code says that there won't be a new register for single use after moving, But moving such invariants outside the loop will affect the register pressures as the spans of the live range after moving out of loops differs from The original loop. Since the Live range spans varies such cases should surely affect the registers pressure. The above explanation looks reasonable and the code that does not include such invariants into register pressure is removed in the patch. I don't any see background or the patches in the past that explicit made the above check as part of any performance improvement or bug fix. Thanks & Regards Ajit Thanks, Richard
[Patch,microblaze]: Optimized usage of pcmp conditional instruction.
Hello All: Please find the patch for the optimized usage of pcmp instructions in microblaze. No regressions is seen In deja GNU tests. There are many testcases that are already there in deja GNU to check the generation of pcmpne/pcmpeq instructions and are used to check the validity. commit b74acf44ce4286649e5be7cff7518d814cb2491f Author: Ajit Kumar Agarwal Date: Wed Feb 25 15:33:02 2015 +0530 [Patch,microblaze]: Optimized usage of pcmp conditional instruction. The changes are made in the patch for optimized usage of pcmpne/pcmpeq instructions. The xor with register to register is replaced with pcmpeq /pcmpne instructions and for immediate check still the xori will be used. The purpose of the change is to acheive the aggressive usage of pcmpne /pcmpeq instructions instead of xor being used for comparison. ChangeLog: 2015-02-25 Ajit Agarwal * config/microblaze/microblaze.md (cbranchsi4): Added immediate constraints. (cbranchsi4_reg): New. * config/microblaze/microblaze.c (microblaze_expand_conditional_branch_reg): New. * config/microblaze/microblaze-protos.h (microblaze_expand_conditional_branch_reg): New prototype. Signed-off-by:Ajit Agarwal ajit...@xilinx.com Thanks & Regards Ajit 0001-Patch-microblaze-Optimized-usage-of-pcmp-conditional.patch Description: 0001-Patch-microblaze-Optimized-usage-of-pcmp-conditional.patch
[Patch,microblaze]: Optimized usage of fint instruction.
Hello All: Please find the patch for the optimized usage of fint instruction changes. No regression is seen in the deja GNU tests. commit ed4dc0b96bf43c200cacad97f73a98ab7048e51b Author: Ajit Kumar Agarwal Date: Wed Feb 25 15:36:29 2015 +0530 [Patch,microblaze]: Optimized usage of fint instruction. The changes are made in the patch for optimized usage of fint instruction. The sequence of fint/cond_branch is replaced with fcmp/cond_branch. The fint instruction takes 6/7 cycles as compared to fcmp instruction which takes 1 cycles. The conversion from float to int with fint instruction is not required and can directly compared with fcmp instruction which takes 1 cycle as compared to 6/7 cycles with fint instruction. ChangeLog: 2015-02-25 Ajit Agarwal * config/microblaze/microblaze.md (peephole2): New. Signed-off-by:Ajit Agarwal ajit...@xilinx.com Thanks & Regards Ajit 0001-Patch-microblaze-Optimized-usage-of-fint-instruction.patch Description: 0001-Patch-microblaze-Optimized-usage-of-fint-instruction.patch
[Patch] IRA: Update heuristics for optimal coalescing
Hello Vladimir: The changes in the patch are made in the frequency heuristics for optimal coalescing. The Loop back edge frequencies are taken instead of the block frequency for optimal coalescing. Ignores the frequency for the loop for the allocno not having references inside the loops but spans the loops and live at the exit block of the loop. Another similar change are made not to consider allcono frequency at the cost calculation but to consider the loop back edge frequencies having references and spans through the loop and live at the exit of the block. We have tested the changes with MIBench and EEMBC benchmarks and there is a gain in the Geomean for the overall benchmarks for Microblaze target. Also no regressions are seen in deja GNU tests run for microblaze. Please let us know with your feedbacks. commit e6a2edd3794080a973695f80e77df3e7de55452d Author: Ajit Kumar Agarwal Date: Fri Feb 27 11:15:48 2015 +0530 IRA: Update heuristics for optimal coalescing. The changes are made in the frequency heuristics for optimal coalescing. The Loop back edge frequencies are taken instead of the block frequency for optimal coalescing. Ignores the frequency for the loop having not any references but spans the loop and live at the exit block of the loop. Another similar change not to consider allcono frequency at the cost calculation but to consider the loop back edge frequencies having references and spans through the loop and live at the exit of the block. ChangeLog: 2015-02-27 Ajit Agarwal * ira-color.c (ira_loop_back_edge_freq): New. (coalesce_allocnos): Use of ira_loop_back_edge_freq to update the back edge frequencies. (setup_coalesced_allocno_costs_and_nums): Use of ira_loop_back_edge_freq to update the cost. Signed-off-by:Ajit Agarwal ajit...@xilinx.com Thanks & Regards Ajit 0001-IRA-Update-heuristics-for-optimal-coalescing.patch Description: 0001-IRA-Update-heuristics-for-optimal-coalescing.patch
[Patch] OPT: Update heuristics for loop-invariant for address arithmetic.
Hello All: The changes are made in the patch to update the heuristics for loop invariant for address arithemetic at RTL Level. The heuristics are updated with the consideration of single def and use for register pressure calculation instead Of ignoring it and also to update the estimated register pressure cost along with the check of actual uses with Address uses. With the above change, gains are seen in the Geomean for Mibench/EEMBC benchmarks for microblaze target. No Regression is seen in deja GNU regressions tests for microblaze. Please let us know your feedback. commit 039b95028c93f99fc1da7fa255f9b5fff4e17223 Author: Ajit Kumar Agarwal Date: Wed Mar 4 15:46:45 2015 +0530 [Patch] OPT: Update heuristics for loop-invariant for address arithmetic. The changes are made in the patch to update the heuristics for loop invariant for address arithmetic. The heuristics is changed to incorporate the single def and use in the register pressure calculation in order to move the invariant out of loop. The heuristics are further changed to not to use the check for addr uses with actual uses. Also changes are done in the heuristics of estimated register pressure cost. ChangeLog: 2015-03-04 Ajit Agarwal * loop-invariant.c (gain_for_invariant): update the heuristics for estimate_reg_pressure_cost. (get_inv_cost): Remove the check for addr uses with actual uses. Add the single def and use in the register pressure for invariants. Signed-off-by:Ajit Agarwal ajit...@xilinx.com Thanks & Regards Ajit 0001-Patch-OPT-Update-heuristics-for-loop-invariant-for-a.patch Description: 0001-Patch-OPT-Update-heuristics-for-loop-invariant-for-a.patch
RE: [Patch,microblaze]: Optimized usage of fint instruction.
-Original Message- From: Michael Eager [mailto:ea...@eagerm.com] Sent: Thursday, February 26, 2015 4:33 AM To: Ajit Kumar Agarwal; GCC Patches Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,microblaze]: Optimized usage of fint instruction. On 02/25/15 02:20, Ajit Kumar Agarwal wrote: > Hello All: > > Please find the patch for the optimized usage of fint instruction > changes. No regression is seen in the deja GNU tests. > > commit ed4dc0b96bf43c200cacad97f73a98ab7048e51b > Author: Ajit Kumar Agarwal > Date: Wed Feb 25 15:36:29 2015 +0530 > > [Patch,microblaze]: Optimized usage of fint instruction. > > The changes are made in the patch for optimized usage of fint > instruction. > The sequence of fint/cond_branch is replaced with fcmp/cond_branch. The > fint instruction takes 6/7 cycles as compared to fcmp instruction which > takes 1 cycles. The conversion from float to int with fint instruction > is not required and can directly compared with fcmp instruction which > takes 1 cycle as compared to 6/7 cycles with fint instruction. > > ChangeLog: > 2015-02-25 Ajit Agarwal > > * config/microblaze/microblaze.md (peephole2): New. > +emit_insn (gen_cstoresf4 (comp_reg, operands[2], + gen_rtx_REG(SFmode,REGNO(cmp_op0)), + gen_rtx_REG(SFmode,REGNO(cmp_op1; >>Spaces before left parens and after comma in last two lines. Changes are incorporated. Please find the log for updated patch. commit 492b0d0b67a5b12d2dc239de3215630c8838edea Author: Ajit Kumar Agarwal Date: Wed Mar 4 17:15:16 2015 +0530 [Patch,microblaze]: Optimized usage of fint instruction. The changes are made in the patch for optimized usage of fint instruction. The sequence of fint/cond_branch is replaced with fcmp/cond_branch. The fint instruction takes 6/7 cycles as compared to fcmp instruction which takes 1 cycles. The conversion from float to int with fint instruction is not required and can directly compared with fcmp instruction which takes 1 cycle as compared to 6/7 cycles with fint instruction. ChangeLog: 2015-03-04 Ajit Agarwal * config/microblaze/microblaze.md (peephole2): New. Signed-off-by:Ajit Agarwal ajit...@xilinx.com Thanks & Regards Ajit Michael Eagerea...@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077 0001-Patch-microblaze-Optimized-usage-of-fint-instruction.patch Description: 0001-Patch-microblaze-Optimized-usage-of-fint-instruction.patch
RE: [Patch,microblaze]: Optimized usage of pcmp conditional instruction.
-Original Message- From: Michael Eager [mailto:ea...@eagerm.com] Sent: Thursday, February 26, 2015 4:29 AM To: Ajit Kumar Agarwal; GCC Patches Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,microblaze]: Optimized usage of pcmp conditional instruction. On 02/25/15 02:19, Ajit Kumar Agarwal wrote: > Hello All: > > Please find the patch for the optimized usage of pcmp instructions in > microblaze. No regressions is seen In deja GNU tests. There are many > testcases that are already there in deja GNU to check the generation of > pcmpne/pcmpeq instructions and are used to check the validity. > > commit b74acf44ce4286649e5be7cff7518d814cb2491f > Author: Ajit Kumar Agarwal > Date: Wed Feb 25 15:33:02 2015 +0530 > > [Patch,microblaze]: Optimized usage of pcmp conditional instruction. > > The changes are made in the patch for optimized usage of pcmpne/pcmpeq > instructions. The xor with register to register is replaced with pcmpeq > /pcmpne instructions and for immediate check still the xori will be used. > The purpose of the change is to acheive the aggressive usage of pcmpne > /pcmpeq instructions instead of xor being used for comparison. > > ChangeLog: > 2015-02-25 Ajit Agarwal > > * config/microblaze/microblaze.md (cbranchsi4): Added immediate > constraints. > (cbranchsi4_reg): New. > * config/microblaze/microblaze.c > (microblaze_expand_conditional_branch_reg): New. > * config/microblaze/microblaze-protos.h > (microblaze_expand_conditional_branch_reg): New prototype. + if (cmp_op1 == const0_rtx) +{ + comp_reg = cmp_op0; + condition = gen_rtx_fmt_ee (signed_condition (code), + SImode, comp_reg, const0_rtx); + emit_jump_insn (gen_condjump (condition, label1)); +} + + else if (code == EQ || code == NE) +{ + if (code == NE) +{ + emit_insn (gen_sne_internal_pat (comp_reg, cmp_op0, + cmp_op1)); + condition = gen_rtx_NE (SImode, comp_reg, const0_rtx); +} + else +{ + emit_insn (gen_seq_internal_pat (comp_reg, + cmp_op0, cmp_op1)); + condition = gen_rtx_EQ (SImode, comp_reg, const0_rtx); +} + emit_jump_insn (gen_condjump (condition, label1)); +} + else +{ ... >>No blank line between end brace of if and else. >>Replace with >>+ else if (code == EQ) >>+{ >>+ emit_insn (gen_seq_internal_pat (comp_reg, cmp_op0, cmp_op1)); >>+ condition = gen_rtx_EQ (SImode, comp_reg, const0_rtx); >>+ emit_jump_insn (gen_condjump (condition, label1)); >>+} >>+ else if (code == NE) >>+{ >>+ emit_insn (gen_sne_internal_pat (comp_reg, cmp_op0, cmp_op1)); >>+ condition = gen_rtx_NE (SImode, comp_reg, const0_rtx); >>+ emit_jump_insn (gen_condjump (condition, label1)); >>+ } >>+ else >>+{ >>... >>-- Changes are incorporated. Please find the log of the updated patch. commit 91f275c144165320850ddf18e3a1e059a66c Author: Ajit Kumar Agarwal Date: Fri Mar 6 09:55:11 2015 +0530 [Patch,microblaze]: Optimized usage of pcmp conditional instruction. The changes are made in the patch for optimized usage of pcmpne/pcmpeq instructions. The xor with register to register is replaced with pcmpeq /pcmpne instructions and for immediate check still the xori will be used. The purpose of the change is to acheive the aggressive usage of pcmpne /pcmpeq instructions instead of xor being used for comparison. ChangeLog: 2015-03-06 Ajit Agarwal * config/microblaze/microblaze.md (cbranchsi4): Added immediate constraints. (cbranchsi4_reg): New. * config/microblaze/microblaze.c (microblaze_expand_conditional_branch_reg): New. * config/microblaze/microblaze-protos.h (microblaze_expand_conditional_branch_reg): New prototype. Signed-off-by:Ajit Agarwal ajit...@xilinx.com Thanks & Regards Ajit Michael Eagerea...@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077 0001-Patch-microblaze-Optimized-usage-of-pcmp-conditional.patch Description: 0001-Patch-microblaze-Optimized-usage-of-pcmp-conditional.patch
RE: [Patch] IRA: Update heuristics for optimal coalescing
Hello Vladimir: Did you get a chance to look at the below patch. Thanks & Regards Ajit -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Ajit Kumar Agarwal Sent: Friday, February 27, 2015 11:25 AM To: vmaka...@redhat.com; Jeff Law; GCC Patches Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: [Patch] IRA: Update heuristics for optimal coalescing Hello Vladimir: The changes in the patch are made in the frequency heuristics for optimal coalescing. The Loop back edge frequencies are taken instead of the block frequency for optimal coalescing. Ignores the frequency for the loop for the allocno not having references inside the loops but spans the loops and live at the exit block of the loop. Another similar change are made not to consider allcono frequency at the cost calculation but to consider the loop back edge frequencies having references and spans through the loop and live at the exit of the block. We have tested the changes with MIBench and EEMBC benchmarks and there is a gain in the Geomean for the overall benchmarks for Microblaze target. Also no regressions are seen in deja GNU tests run for microblaze. Please let us know with your feedbacks. commit e6a2edd3794080a973695f80e77df3e7de55452d Author: Ajit Kumar Agarwal Date: Fri Feb 27 11:15:48 2015 +0530 IRA: Update heuristics for optimal coalescing. The changes are made in the frequency heuristics for optimal coalescing. The Loop back edge frequencies are taken instead of the block frequency for optimal coalescing. Ignores the frequency for the loop having not any references but spans the loop and live at the exit block of the loop. Another similar change not to consider allcono frequency at the cost calculation but to consider the loop back edge frequencies having references and spans through the loop and live at the exit of the block. ChangeLog: 2015-02-27 Ajit Agarwal * ira-color.c (ira_loop_back_edge_freq): New. (coalesce_allocnos): Use of ira_loop_back_edge_freq to update the back edge frequencies. (setup_coalesced_allocno_costs_and_nums): Use of ira_loop_back_edge_freq to update the cost. Signed-off-by:Ajit Agarwal ajit...@xilinx.com Thanks & Regards Ajit
RE: [Patch] OPT: Update heuristics for loop-invariant for address arithmetic.
Hello All: Did you get a chance to look at the below patch. Thanks & Regards Ajit -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Ajit Kumar Agarwal Sent: Wednesday, March 04, 2015 3:57 PM To: vmaka...@redhat.com; GCC Patches Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: [Patch] OPT: Update heuristics for loop-invariant for address arithmetic. Hello All: The changes are made in the patch to update the heuristics for loop invariant for address arithemetic at RTL Level. The heuristics are updated with the consideration of single def and use for register pressure calculation instead Of ignoring it and also to update the estimated register pressure cost along with the check of actual uses with Address uses. With the above change, gains are seen in the Geomean for Mibench/EEMBC benchmarks for microblaze target. No Regression is seen in deja GNU regressions tests for microblaze. Please let us know your feedback. commit 039b95028c93f99fc1da7fa255f9b5fff4e17223 Author: Ajit Kumar Agarwal Date: Wed Mar 4 15:46:45 2015 +0530 [Patch] OPT: Update heuristics for loop-invariant for address arithmetic. The changes are made in the patch to update the heuristics for loop invariant for address arithmetic. The heuristics is changed to incorporate the single def and use in the register pressure calculation in order to move the invariant out of loop. The heuristics are further changed to not to use the check for addr uses with actual uses. Also changes are done in the heuristics of estimated register pressure cost. ChangeLog: 2015-03-04 Ajit Agarwal * loop-invariant.c (gain_for_invariant): update the heuristics for estimate_reg_pressure_cost. (get_inv_cost): Remove the check for addr uses with actual uses. Add the single def and use in the register pressure for invariants. Signed-off-by:Ajit Agarwal ajit...@xilinx.com Thanks & Regards Ajit
[Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
make_pass_ccp (gcc::context *ctxt); +extern gimple_opt_pass *make_pass_path_split (gcc::context *ctxt); extern gimple_opt_pass *make_pass_phi_only_cprop (gcc::context *ctxt); extern gimple_opt_pass *make_pass_build_ssa (gcc::context *ctxt); extern gimple_opt_pass *make_pass_build_alias (gcc::context *ctxt); diff --git a/gcc/tree-ssa-path-split.c b/gcc/tree-ssa-path-split.c new file mode 100644 index 000..3da7791 --- /dev/null +++ b/gcc/tree-ssa-path-split.c @@ -0,0 +1,462 @@ +/* Support routines for Path Splitting. + Copyright (C) 2015 Free Software Foundation, Inc. + Contributed by Ajit Kumar Agarwal . + + This file is part of GCC. + + GCC is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + +GCC is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +You should have received a copy of the GNU General Public License +along with GCC; see the file COPYING3. If not see +<http://www.gnu.org/licenses/>. */ + +#include "config.h" +#include "system.h" +#include "coretypes.h" +#include "tm.h" +#include "flags.h" +#include "tree.h" +#include "stor-layout.h" +#include "calls.h" +#include "predict.h" +#include "vec.h" +#include "hashtab.h" +#include "hash-set.h" +#include "machmode.h" +#include "hard-reg-set.h" +#include "input.h" +#include "function.h" +#include "dominance.h" +#include "cfg.h" +#include "cfganal.h" +#include "basic-block.h" +#include "tree-ssa-alias.h" +#include "internal-fn.h" +#include "gimple-fold.h" +#include "tree-eh.h" +#include "gimple-expr.h" +#include "is-a.h" +#include "gimple.h" +#include "gimple-iterator.h" +#include "gimple-walk.h" +#include "gimple-ssa.h" +#include "tree-cfg.h" +#include "tree-phinodes.h" +#include "ssa-iterators.h" +#include "stringpool.h" +#include "tree-ssanames.h" +#include "tree-ssa-loop-manip.h" +#include "tree-ssa-loop-niter.h" +#include "tree-ssa-loop.h" +#include "tree-into-ssa.h" +#include "tree-ssa.h" +#include "tree-pass.h" +#include "tree-dump.h" +#include "gimple-pretty-print.h" +#include "diagnostic-core.h" +#include "intl.h" +#include "cfgloop.h" +#include "tree-scalar-evolution.h" +#include "tree-ssa-propagate.h" +#include "tree-chrec.h" +#include "tree-ssa-threadupdate.h" +#include "expr.h" +#include "insn-codes.h" +#include "optabs.h" +#include "tree-ssa-threadedge.h" +#include "wide-int.h" + +/* Replace_uses_phi function propagates the phi results with the + first phi argument into each of the copied join blocks wired into + its predecessors. This function is called from the replace_uses_phi + to replace the uses of first phi arguments with the second + phi arguments in the next copy of join block. */ + +static void +replace_use_phi_operand1_with_operand2 (basic_block b, +tree use1, +tree use2) +{ + use_operand_p use; + ssa_op_iter iter; + gimple_stmt_iterator gsi; + + for (gsi = gsi_start_bb (b); !gsi_end_p (gsi);) + { + gimple stmt = gsi_stmt (gsi); + FOR_EACH_SSA_USE_OPERAND (use, stmt, iter, SSA_OP_USE) + { + tree tuse = USE_FROM_PTR (use); + if (use1 == tuse || use1 == NULL_TREE) +{ + propagate_value (use, use2); + update_stmt(stmt); +} +} + gsi_next(&gsi); + } +} + +/* This function propagates the phi result into the use points with + the phi arguments. The join block is copied and wired into the + predecessors. Since the use points of the phi results will be same + in the each of the copy join blocks in the predecessors, it + propagates the phi arguments in the copy of the join blocks wired + into its predecessor. */ + +static +void replace_uses_phi (basic_block b, basic_block temp_bb) +{ + gimple_seq phis = phi_nodes (b); + gimple phi = gimple_seq_first_stmt (phis); + tree def = gimple_phi_result (phi), use = gimple_phi_arg_def (phi,0); + tree use2 = gimple_phi_arg_def (phi,1); + + if (virtual_operand_p (def)) +{ + imm_use_iterator iter; + use_operand_p use_p; + gimple stmt; + + FOR_EACH_IMM_USE_STMT (stmt, iter, def) +
RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
-Original Message- From: Bernhard Reutner-Fischer [mailto:rep.dot@gmail.com] Sent: Tuesday, June 30, 2015 3:57 PM To: Ajit Kumar Agarwal; l...@redhat.com; GCC Patches Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation On June 30, 2015 10:16:01 AM GMT+02:00, Ajit Kumar Agarwal wrote: >All: > >The below patch added a new path Splitting optimization pass on SSA >representation. The Path Splitting optimization Pass moves the join >block of if-then-else same as loop latch to its predecessors and get >merged with the predecessors Preserving the SSA representation. > >The patch is tested for Microblaze and i386 target. The EEMBC/Mibench >benchmarks is run with the Microblaze target And the performance gain >of 9.15% and rgbcmy01_lite(EEMBC benchmarks). >The Deja GNU tests is run for Mircroblaze Target and no regression is >seen for Microblaze target and the new testcase attached are passed. > >For i386 bootstrapping goes through fine and the Spec cpu2000 >benchmarks is run with this patch. Following observation were seen with >spec cpu2000 benchmarks. > >Ratio of path splitting change vs Ratio of not having path splitting >change is 3653.353 vs 3652.14 for INT benchmarks. >Ratio of path splitting change vs Ratio of not having path splitting >change is 4353.812 vs 4345.351 for FP benchmarks. > >Based on comments from RFC patch following changes were done. > >1. Added a new pass for path splitting changes. >2. Placed the new path Splitting Optimization pass before the copy >propagation pass. >3. The join block same as the Loop latch is wired into its predecessors >so that the CFG Cleanup pass will merge the blocks Wired together. >4. Copy propagation routines added for path splitting changes is not >needed as suggested by Jeff. They are removed in the patch as The copy >propagation in the copied join blocks will be done by the existing copy >propagation pass and the update ssa pass. >5. Only the propagation of phi results of the join block with the phi >argument is done which will not be done by the existing update_ssa Or >copy propagation pass on tree ssa representation. >6. Added 2 tests. >a) compilation check tests. > b) execution tests. >>The 2 tests seem to be identical, so why do you have both? >>Also, please remove cleanup-tree-dump, this is now done automatically. The testcase path-split-1.c is to check for execution which is present in gcc.dg top directory . The one present in the gcc.dg/tree-ssa/path-split-2.c is to check the compilation as the action item is compilation. For the execution tests path-split-1.c the action is compile and run. Thanks & Regards Ajit Thanks, >7. Refactoring of the code for the feasibility check and finding the >join block same as loop latch node.
RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
-Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Tuesday, June 30, 2015 4:42 PM To: Ajit Kumar Agarwal Cc: l...@redhat.com; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation On Tue, Jun 30, 2015 at 10:16 AM, Ajit Kumar Agarwal wrote: > All: > > The below patch added a new path Splitting optimization pass on SSA > representation. The Path Splitting optimization Pass moves the join > block of if-then-else same as loop latch to its predecessors and get merged > with the predecessors Preserving the SSA representation. > > The patch is tested for Microblaze and i386 target. The EEMBC/Mibench > benchmarks is run with the Microblaze target And the performance gain > of 9.15% and rgbcmy01_lite(EEMBC benchmarks). The Deja GNU tests is run for > Mircroblaze Target and no regression is seen for Microblaze target and the > new testcase attached are passed. > > For i386 bootstrapping goes through fine and the Spec cpu2000 > benchmarks is run with this patch. Following observation were seen with spec > cpu2000 benchmarks. > > Ratio of path splitting change vs Ratio of not having path splitting change > is 3653.353 vs 3652.14 for INT benchmarks. > Ratio of path splitting change vs Ratio of not having path splitting change > is 4353.812 vs 4345.351 for FP benchmarks. > > Based on comments from RFC patch following changes were done. > > 1. Added a new pass for path splitting changes. > 2. Placed the new path Splitting Optimization pass before the copy > propagation pass. > 3. The join block same as the Loop latch is wired into its > predecessors so that the CFG Cleanup pass will merge the blocks Wired > together. > 4. Copy propagation routines added for path splitting changes is not > needed as suggested by Jeff. They are removed in the patch as The copy > propagation in the copied join blocks will be done by the existing copy > propagation pass and the update ssa pass. > 5. Only the propagation of phi results of the join block with the phi > argument is done which will not be done by the existing update_ssa Or copy > propagation pass on tree ssa representation. > 6. Added 2 tests. > a) compilation check tests. >b) execution tests. > 7. Refactoring of the code for the feasibility check and finding the join > block same as loop latch node. > > [Patch,tree-optimization]: Add new path Splitting pass on tree ssa > representation. > > Added a new pass on path splitting on tree SSA representation. The path > splitting optimization does the CFG transformation of join block of the > if-then-else same as the loop latch node is moved and merged with the > predecessor blocks after preserving the SSA representation. > > ChangeLog: > 2015-06-30 Ajit Agarwal > > * gcc/Makefile.in: Add the build of the new file > tree-ssa-path-split.c > * gcc/common.opt: Add the new flag ftree-path-split. > * gcc/opts.c: Add an entry for Path splitting pass > with optimization flag greater and equal to O2. > * gcc/passes.def: Enable and add new pass path splitting. > * gcc/timevar.def: Add the new entry for TV_TREE_PATH_SPLIT. > * gcc/tree-pass.h: Extern Declaration of make_pass_path_split. > * gcc/tree-ssa-path-split.c: New file for path splitting pass. > * gcc/testsuite/gcc.dg/tree-ssa/path-split-2.c: New testcase. > * gcc/testsuite/gcc.dg/path-split-1.c: New testcase. >>I'm not 100% sure I understand the transform but what I see from the >>testcases it tail-duplicates from a conditional up to a loop latch block (not >>sure if it >>includes it and thus ends up creating a loop nest or not). The path splitting pass wired the duplicated basic block of the loop latch block to both of its predecessor path, if the loop latch block is same as join block. The CFG cleanup phase of the path splitting transformation merges the basic blocks which is wired with the original predecessors and thus making the loop latch block just as forwarding block of the predecessors with the sequential statements of the loop latch block is set as NULL having only the phi nodes, and the same Loop semantics with respect to loop latch edge is preserved Also the SSA updates are preserved. Thanks & Regards Ajit >>An observation I have is that the pass should at least share the transform >>stage to some extent with the existing tracer pass (tracer.c) which >>essentially does >>the same but not restricted to loops in any way. So I >>wonder if your pass could be simply another heuristic t
RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
I forgot to attach the Link of the RFC comments from Jeff for reference. https://gcc.gnu.org/ml/gcc/2015-05/msg00302.html Thanks & Regards Ajit -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Ajit Kumar Agarwal Sent: Tuesday, June 30, 2015 1:46 PM To: l...@redhat.com; GCC Patches Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation All: The below patch added a new path Splitting optimization pass on SSA representation. The Path Splitting optimization Pass moves the join block of if-then-else same as loop latch to its predecessors and get merged with the predecessors Preserving the SSA representation. The patch is tested for Microblaze and i386 target. The EEMBC/Mibench benchmarks is run with the Microblaze target And the performance gain of 9.15% and rgbcmy01_lite(EEMBC benchmarks). The Deja GNU tests is run for Mircroblaze Target and no regression is seen for Microblaze target and the new testcase attached are passed. For i386 bootstrapping goes through fine and the Spec cpu2000 benchmarks is run with this patch. Following observation were seen with spec cpu2000 benchmarks. Ratio of path splitting change vs Ratio of not having path splitting change is 3653.353 vs 3652.14 for INT benchmarks. Ratio of path splitting change vs Ratio of not having path splitting change is 4353.812 vs 4345.351 for FP benchmarks. Based on comments from RFC patch following changes were done. 1. Added a new pass for path splitting changes. 2. Placed the new path Splitting Optimization pass before the copy propagation pass. 3. The join block same as the Loop latch is wired into its predecessors so that the CFG Cleanup pass will merge the blocks Wired together. 4. Copy propagation routines added for path splitting changes is not needed as suggested by Jeff. They are removed in the patch as The copy propagation in the copied join blocks will be done by the existing copy propagation pass and the update ssa pass. 5. Only the propagation of phi results of the join block with the phi argument is done which will not be done by the existing update_ssa Or copy propagation pass on tree ssa representation. 6. Added 2 tests. a) compilation check tests. b) execution tests. 7. Refactoring of the code for the feasibility check and finding the join block same as loop latch node. [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation. Added a new pass on path splitting on tree SSA representation. The path splitting optimization does the CFG transformation of join block of the if-then-else same as the loop latch node is moved and merged with the predecessor blocks after preserving the SSA representation. ChangeLog: 2015-06-30 Ajit Agarwal * gcc/Makefile.in: Add the build of the new file tree-ssa-path-split.c * gcc/common.opt: Add the new flag ftree-path-split. * gcc/opts.c: Add an entry for Path splitting pass with optimization flag greater and equal to O2. * gcc/passes.def: Enable and add new pass path splitting. * gcc/timevar.def: Add the new entry for TV_TREE_PATH_SPLIT. * gcc/tree-pass.h: Extern Declaration of make_pass_path_split. * gcc/tree-ssa-path-split.c: New file for path splitting pass. * gcc/testsuite/gcc.dg/tree-ssa/path-split-2.c: New testcase. * gcc/testsuite/gcc.dg/path-split-1.c: New testcase. Signed-off-by:Ajit Agarwal ajit...@xilinx.com. gcc/Makefile.in | 1 + gcc/common.opt | 4 + gcc/opts.c | 1 + gcc/passes.def | 1 + gcc/testsuite/gcc.dg/path-split-1.c | 65 gcc/testsuite/gcc.dg/tree-ssa/path-split-2.c | 62 gcc/timevar.def | 1 + gcc/tree-pass.h | 1 + gcc/tree-ssa-path-split.c| 462 +++ 9 files changed, 598 insertions(+) create mode 100644 gcc/testsuite/gcc.dg/path-split-1.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/path-split-2.c create mode 100644 gcc/tree-ssa-path-split.c diff --git a/gcc/Makefile.in b/gcc/Makefile.in index 5f9261f..35ac363 100644 --- a/gcc/Makefile.in +++ b/gcc/Makefile.in @@ -1476,6 +1476,7 @@ OBJS = \ tree-vect-slp.o \ tree-vectorizer.o \ tree-vrp.o \ +tree-ssa-path-split.o \ tree.o \ valtrack.o \ value-prof.o \ diff --git a/gcc/common.opt b/gcc/common.opt index e104269..c63b100 100644 --- a/gcc/common.opt +++ b/gcc/common.opt @@ -2328,6 +2328,10 @@ ftree-vrp Common Report Var(flag_tree_vrp) Init(0) Optimization Perform Value Range Propagation on trees +ftree-path-split +Common Re
RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
-Original Message- From: Joseph Myers [mailto:jos...@codesourcery.com] Sent: Wednesday, July 01, 2015 3:48 AM To: Ajit Kumar Agarwal Cc: l...@redhat.com; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation On Tue, 30 Jun 2015, Ajit Kumar Agarwal wrote: > * gcc/common.opt: Add the new flag ftree-path-split. >>All options need documenting in invoke.texi. Sure. > +#include "tm.h" >>Why? Does some other header depend on this, or are you using a target macro? I am not using any target macro. There are many header files that includes the tm.h and also there are many tree-ssa optimization files that have included "tm.h" listing some of them tree-ssa-threadupdate.c tree-vrp.c , tree-ssa-threadedge.c. Thanks & Regards Ajit -- Joseph S. Myers jos...@codesourcery.com
RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
-Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Tuesday, June 30, 2015 4:42 PM To: Ajit Kumar Agarwal Cc: l...@redhat.com; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation On Tue, Jun 30, 2015 at 10:16 AM, Ajit Kumar Agarwal wrote: > All: > > The below patch added a new path Splitting optimization pass on SSA > representation. The Path Splitting optimization Pass moves the join > block of if-then-else same as loop latch to its predecessors and get merged > with the predecessors Preserving the SSA representation. > > The patch is tested for Microblaze and i386 target. The EEMBC/Mibench > benchmarks is run with the Microblaze target And the performance gain > of 9.15% and rgbcmy01_lite(EEMBC benchmarks). The Deja GNU tests is run for > Mircroblaze Target and no regression is seen for Microblaze target and the > new testcase attached are passed. > > For i386 bootstrapping goes through fine and the Spec cpu2000 > benchmarks is run with this patch. Following observation were seen with spec > cpu2000 benchmarks. > > Ratio of path splitting change vs Ratio of not having path splitting change > is 3653.353 vs 3652.14 for INT benchmarks. > Ratio of path splitting change vs Ratio of not having path splitting change > is 4353.812 vs 4345.351 for FP benchmarks. > > Based on comments from RFC patch following changes were done. > > 1. Added a new pass for path splitting changes. > 2. Placed the new path Splitting Optimization pass before the copy > propagation pass. > 3. The join block same as the Loop latch is wired into its > predecessors so that the CFG Cleanup pass will merge the blocks Wired > together. > 4. Copy propagation routines added for path splitting changes is not > needed as suggested by Jeff. They are removed in the patch as The copy > propagation in the copied join blocks will be done by the existing copy > propagation pass and the update ssa pass. > 5. Only the propagation of phi results of the join block with the phi > argument is done which will not be done by the existing update_ssa Or copy > propagation pass on tree ssa representation. > 6. Added 2 tests. > a) compilation check tests. >b) execution tests. > 7. Refactoring of the code for the feasibility check and finding the join > block same as loop latch node. > > [Patch,tree-optimization]: Add new path Splitting pass on tree ssa > representation. > > Added a new pass on path splitting on tree SSA representation. The path > splitting optimization does the CFG transformation of join block of the > if-then-else same as the loop latch node is moved and merged with the > predecessor blocks after preserving the SSA representation. > > ChangeLog: > 2015-06-30 Ajit Agarwal > > * gcc/Makefile.in: Add the build of the new file > tree-ssa-path-split.c > * gcc/common.opt: Add the new flag ftree-path-split. > * gcc/opts.c: Add an entry for Path splitting pass > with optimization flag greater and equal to O2. > * gcc/passes.def: Enable and add new pass path splitting. > * gcc/timevar.def: Add the new entry for TV_TREE_PATH_SPLIT. > * gcc/tree-pass.h: Extern Declaration of make_pass_path_split. > * gcc/tree-ssa-path-split.c: New file for path splitting pass. > * gcc/testsuite/gcc.dg/tree-ssa/path-split-2.c: New testcase. > * gcc/testsuite/gcc.dg/path-split-1.c: New testcase. >>I'm not 100% sure I understand the transform but what I see from the >>testcases it tail-duplicates from a conditional up to a loop latch block (not >>sure if it >>includes it and thus ends up creating a loop nest or not). >>An observation I have is that the pass should at least share the transform >>stage to some extent with the existing tracer pass (tracer.c) which >>essentially does >>the same but not restricted to loops in any way. The following piece of code from tracer.c can be shared with the existing path splitting pass. { e = find_edge (bb, bb2); copy = duplicate_block (bb2, e, bb); flush_pending_stmts (e); add_phi_args_after_copy (©, 1, NULL); } Sharing the above code of the transform stage of tracer.c with the path splitting pass has the following limitation. 1. The duplicated loop latch node is wired to its predecessors and the existing phi node in the loop latch node with the Phi arguments from its corresponding predecessors is moved to the duplicated loop latch node that is wired into its predecessors. Due To this, the duplicated
[Patch,microblaze]: Optimized usage of reserved stack space for function arguments.
All: The below patch optimized the usage of the reserved stack space for function arguments. The stack space is reserved if the function is a libcall, variable number of arguments, aggregate data types, and some parameter are reserved in registers and some parameters is reserved in the stack. Along with the above conditions the stack space is not reserved if no arguments are passed. No regressions is seen in Deja GNU tests for microblaze. [Patch,microblaze]: Optimized usage of reserved stack space for function arguments. The changes are made in the patch for optimized usage of reserved stack space for arguments. The stack space is reserved if the function is a libcall, variable number of arguments, aggregate data types, and some parameter are reserved in registers and some parameters is reserved in the stack. Along with the above conditions the stack space is not reserved if no arguments are passed. ChangeLog: 2015-07-06 Ajit Agarwal * config/microblaze/microblaze.c (microblaze_parm_needs_stack): New. (microblaze_function_parms_need_stack): New. (microblaze_reg_parm_stack_space): New. * config/microblaze/microblaze.h (REG_PARM_STACK_SPACE): Modify the macro. * config/microblaze/microblaze-protos.h (microblaze_reg_parm_stack_space): Declare. Signed-off-by:Ajit Agarwal ajit...@xilinx.com --- gcc/config/microblaze/microblaze-protos.h |1 + gcc/config/microblaze/microblaze.c| 140 + gcc/config/microblaze/microblaze.h|2 +- 3 files changed, 142 insertions(+), 1 deletions(-) diff --git a/gcc/config/microblaze/microblaze-protos.h b/gcc/config/microblaze/microblaze-protos.h index 57879b1..d27d3e1 100644 --- a/gcc/config/microblaze/microblaze-protos.h +++ b/gcc/config/microblaze/microblaze-protos.h @@ -56,6 +56,7 @@ extern bool microblaze_tls_referenced_p (rtx); extern int symbol_mentioned_p (rtx); extern int label_mentioned_p (rtx); extern bool microblaze_cannot_force_const_mem (machine_mode, rtx); +extern int microblaze_reg_parm_stack_space(tree fun); #endif /* RTX_CODE */ /* Declare functions in microblaze-c.c. */ diff --git a/gcc/config/microblaze/microblaze.c b/gcc/config/microblaze/microblaze.c index 566b78c..0eae4cd 100644 --- a/gcc/config/microblaze/microblaze.c +++ b/gcc/config/microblaze/microblaze.c @@ -3592,7 +3592,147 @@ microblaze_legitimate_constant_p (machine_mode mode ATTRIBUTE_UNUSED, rtx x) return true; } +/* Heuristics and criteria for having param needs stack. */ +static bool +microblaze_parm_needs_stack (cumulative_args_t args_so_far, tree type) +{ + enum machine_mode mode; + int unsignedp; + rtx entry_parm; + + /* Catch errors. */ + if (type == NULL || type == error_mark_node) +return true; + + /* Handle types with no storage requirement. */ + if (TYPE_MODE (type) == VOIDmode) +return false; + + /* Handle complex types. */ + if (TREE_CODE (type) == COMPLEX_TYPE) +return (microblaze_parm_needs_stack (args_so_far, TREE_TYPE (type)) + || microblaze_parm_needs_stack (args_so_far, TREE_TYPE (type))); + + /* Handle transparent aggregates. */ + if ((TREE_CODE (type) == UNION_TYPE || TREE_CODE (type) == RECORD_TYPE) + && TYPE_TRANSPARENT_AGGR (type)) +type = TREE_TYPE (first_field (type)); + + /* See if this arg was passed by invisible reference. */ + if (pass_by_reference (get_cumulative_args (args_so_far), + TYPE_MODE (type), type, true)) +type = build_pointer_type (type); + + /* Find mode as it is passed by the ABI. */ + unsignedp = TYPE_UNSIGNED (type); + mode = promote_mode (type, TYPE_MODE (type), &unsignedp); + + /* If there is no incoming register, we need a stack. */ + entry_parm = microblaze_function_arg (args_so_far, mode, type, true); + + if (entry_parm == NULL) +return true; + + /* Likewise if we need to pass both in registers and on the stack. */ + if (GET_CODE (entry_parm) == PARALLEL + && XEXP (XVECEXP (entry_parm, 0, 0), 0) == NULL_RTX) +return true; + + /* Also true if we're partially in registers and partially not. */ + if (function_arg_partial_bytes (args_so_far, mode, type, true) != 0) +return true; + + /* Update info on where next arg arrives in registers. */ + microblaze_function_arg_advance (args_so_far, mode, type, true); + + return false; +} + +/* Function need stack for param if + 1. The function is a libcall. + 2. Variable number of arguments. + 3. If the param is aggregate data types. + 4. If partially some param in registers and some in the stack. */ + +static bool +microblaze_function_parms_need_stack (tree fun, bool incoming) +{ + tree fntype, result; + CUMULATIVE_ARGS args_so_far_v; + cumulative_args_t args_so_far; + int num_of_args = 0; + + /* Must be a libcall, all of which only use reg parms. */ + if (!fun) +return tru
RE: [Patch,microblaze]: Optimized usage of reserved stack space for function arguments.
-Original Message- From: Oleg Endo [mailto:oleg.e...@t-online.de] Sent: Monday, July 06, 2015 7:07 PM To: Ajit Kumar Agarwal Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,microblaze]: Optimized usage of reserved stack space for function arguments. >>Hi, >>Just some general comments... Thanks. On 06 Jul 2015, at 22:05, Ajit Kumar Agarwal wrote: > +static bool > +microblaze_parm_needs_stack (cumulative_args_t args_so_far, tree > +type) { > + enum machine_mode mode; >> 'enum' is not required in C++, please omit it. >> We've been trying to remove unnecessary 'struct' and 'enum' after the >>switch to C++. Although there are still some of them around, please >>don't add new ones. Sure. > + int unsignedp; > + rtx entry_parm; >>Please declare variables at their first use. >>(there are other such cases in your patch) I have declared the above variables in the first scope of their first use. Sorry, it's not clear to me. You meant to say to declare in the following way. int unsignedp = TYPE_UNSIGNED (type); rtx entry_parm = microblaze_function_arg (args_so_far, mode, type, true); Thanks & Regards Ajit Cheers, Oleg
RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
-Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Tuesday, July 07, 2015 2:21 PM To: Ajit Kumar Agarwal Cc: l...@redhat.com; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation On Sat, Jul 4, 2015 at 2:39 PM, Ajit Kumar Agarwal wrote: > > > -Original Message- > From: Richard Biener [mailto:richard.guent...@gmail.com] > Sent: Tuesday, June 30, 2015 4:42 PM > To: Ajit Kumar Agarwal > Cc: l...@redhat.com; GCC Patches; Vinod Kathail; Shail Aditya Gupta; > Vidhumouli Hunsigida; Nagaraju Mekala > Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on > tree ssa representation > > On Tue, Jun 30, 2015 at 10:16 AM, Ajit Kumar Agarwal > wrote: >> All: >> >> The below patch added a new path Splitting optimization pass on SSA >> representation. The Path Splitting optimization Pass moves the join >> block of if-then-else same as loop latch to its predecessors and get merged >> with the predecessors Preserving the SSA representation. >> >> The patch is tested for Microblaze and i386 target. The EEMBC/Mibench >> benchmarks is run with the Microblaze target And the performance gain >> of 9.15% and rgbcmy01_lite(EEMBC benchmarks). The Deja GNU tests is run for >> Mircroblaze Target and no regression is seen for Microblaze target and the >> new testcase attached are passed. >> >> For i386 bootstrapping goes through fine and the Spec cpu2000 >> benchmarks is run with this patch. Following observation were seen with spec >> cpu2000 benchmarks. >> >> Ratio of path splitting change vs Ratio of not having path splitting change >> is 3653.353 vs 3652.14 for INT benchmarks. >> Ratio of path splitting change vs Ratio of not having path splitting change >> is 4353.812 vs 4345.351 for FP benchmarks. >> >> Based on comments from RFC patch following changes were done. >> >> 1. Added a new pass for path splitting changes. >> 2. Placed the new path Splitting Optimization pass before the copy >> propagation pass. >> 3. The join block same as the Loop latch is wired into its >> predecessors so that the CFG Cleanup pass will merge the blocks Wired >> together. >> 4. Copy propagation routines added for path splitting changes is not >> needed as suggested by Jeff. They are removed in the patch as The copy >> propagation in the copied join blocks will be done by the existing copy >> propagation pass and the update ssa pass. >> 5. Only the propagation of phi results of the join block with the phi >> argument is done which will not be done by the existing update_ssa Or copy >> propagation pass on tree ssa representation. >> 6. Added 2 tests. >> a) compilation check tests. >>b) execution tests. >> 7. Refactoring of the code for the feasibility check and finding the join >> block same as loop latch node. >> >> [Patch,tree-optimization]: Add new path Splitting pass on tree ssa >> representation. >> >> Added a new pass on path splitting on tree SSA representation. The path >> splitting optimization does the CFG transformation of join block of the >> if-then-else same as the loop latch node is moved and merged with the >> predecessor blocks after preserving the SSA representation. >> >> ChangeLog: >> 2015-06-30 Ajit Agarwal >> >> * gcc/Makefile.in: Add the build of the new file >> tree-ssa-path-split.c >> * gcc/common.opt: Add the new flag ftree-path-split. >> * gcc/opts.c: Add an entry for Path splitting pass >> with optimization flag greater and equal to O2. >> * gcc/passes.def: Enable and add new pass path splitting. >> * gcc/timevar.def: Add the new entry for TV_TREE_PATH_SPLIT. >> * gcc/tree-pass.h: Extern Declaration of make_pass_path_split. >> * gcc/tree-ssa-path-split.c: New file for path splitting pass. >> * gcc/testsuite/gcc.dg/tree-ssa/path-split-2.c: New testcase. >> * gcc/testsuite/gcc.dg/path-split-1.c: New testcase. > >>>I'm not 100% sure I understand the transform but what I see from the >>>testcases it tail-duplicates from a conditional up to a loop latch block >>>(not sure if it >>includes it and thus ends up creating a loop nest or not). > >>>An observation I have is that the pass should at least share the transform >>>stage to some extent with the existing tracer pass (tracer.c) which >
[Patch,microblaze]: Optimized Instruction prefetch with the generation of wic
All: Please find the patch for optimized usage of instruction prefetch with the generation of microblaze instruction "wic". No regressions is seen in Deja GNU tests for microblaze. [Patch,microblaze]: Optimized Instruction prefetch with the generation of wic instruction. The changes are made in the patch for optimized Instruction prefetch with the generation of wic microblaze instructions. The Wic microblaze instruction is the instruction prefetch instruction that optimizes the Instruction prefetch. The wic instruction is generated at the call site fall through path and is enabled with a flag mxl-prefetch. The purpose of adding the flags is that wic instruction selected for the particular FPGA design and is not enabled by default. ChangeLog: 2015-07-07 Ajit Agarwal * config/microblaze/microblaze.c (get_branch_target): New. (insert_wic_for_ilb_runout): New. (insert_wic): New. (microblaze_machine_dependent_reorg): New. (TARGET_MACHINE_DEPENDENT_REORG): Define macro. * config/microblaze/microblaze.md (UNSPEC_IPREFETCH): Define. (iprefetch): New pattern * config/microblaze/microblaze.opt (mxl-prefetch): New flag. Signed-off-by:Ajit Agarwal ajit...@xilinx.com. --- gcc/config/microblaze/microblaze.c | 139 ++ gcc/config/microblaze/microblaze.md | 14 gcc/config/microblaze/microblaze.opt |4 + 3 files changed, 157 insertions(+), 0 deletions(-) diff --git a/gcc/config/microblaze/microblaze.c b/gcc/config/microblaze/microblaze.c index 566b78c..eea2f67 100644 --- a/gcc/config/microblaze/microblaze.c +++ b/gcc/config/microblaze/microblaze.c @@ -71,6 +71,7 @@ #include "cgraph.h" #include "builtins.h" #include "rtl-iter.h" +#include "cfgloop.h" #define MICROBLAZE_VERSION_COMPARE(VA,VB) strcasecmp (VA, VB) @@ -3593,6 +3594,141 @@ microblaze_legitimate_constant_p (machine_mode mode ATTRIBUTE_UNUSED, rtx x) return true; } +static rtx +get_branch_target (rtx branch) +{ + if (CALL_P (branch)) +{ + rtx call; + + call = XVECEXP (PATTERN (branch), 0, 0); + if (GET_CODE (call) == SET) +call = SET_SRC (call); + if (GET_CODE (call) != CALL) +abort (); + return XEXP (XEXP (call, 0), 0); +} +} + +/* Heuristics to identify where to insert at the + fall through path of the caller function. If there + is a call after the caller branch delay slot then + we dont generate the instruction prefetch instruction. */ + +static void +insert_wic_for_ilb_runout (rtx_insn *first) +{ + rtx_insn *insn, *before_4 = 0, *before_16 = 0; + int addr = 0, length, first_addr = -1; + int wic_addr0 = 128 * 4, wic_addr1 = 128 * 4; + int insert_lnop_after = 0; + + for (insn = first; insn; insn = NEXT_INSN (insn)) +if (INSN_P (insn)) + { +if (first_addr == -1) + first_addr = INSN_ADDRESSES (INSN_UID (insn)); + +addr = INSN_ADDRESSES (INSN_UID (insn)) - first_addr; +length = get_attr_length (insn); + +if (before_4 == 0 && addr + length >= 4 * 4) + before_4 = insn; + +if (JUMP_P(insn)) + return; +if (before_16 == 0 && addr + length >= 14 * 4) + before_16 = insn; +if (CALL_P (insn) || tablejump_p (insn, 0, 0)) + return; +if (addr + length >= 32 * 4) + { +gcc_assert (before_4 && before_16); +if (wic_addr0 > 4 * 4) + { +insn = + emit_insn_before (gen_iprefetch +(gen_int_mode (addr, SImode)), +before_4); +recog_memoized (insn); +INSN_LOCATION (insn) = INSN_LOCATION (before_4); +INSN_ADDRESSES_NEW (insn, INSN_ADDRESSES (INSN_UID (before_4))); +return; + } + } + } +} + +/* Insert instruction prefetch instruction at the fall + through path of the function call. */ + +static void +insert_wic (void) +{ + rtx_insn *insn; + int i, j; + basic_block bb,prev = 0; + rtx branch_target = 0; + + shorten_branches (get_insns ()); + + for (i = 0; i < n_basic_blocks_for_fn (cfun) - 1; i++) + { + edge e; + edge_iterator ei; + bool simple_loop = false; + + bb = BASIC_BLOCK_FOR_FN (cfun, i); + + if (bb == NULL) + continue; + + if ((prev != 0) && (prev != bb)) + continue; + else + prev = 0; + + FOR_EACH_EDGE (e, ei, bb->preds) + if (e->src == bb) + { + simple_loop = true; + prev= e->dest; + break; + } + + for (insn = BB_END (bb); insn; insn = PREV_INSN (insn)) + { +if (INSN_P (insn) && !simple_loop + && CALL_P(insn)) + { +if ((branc
RE: [PATCH IRA] update_equiv_regs fails to set EQUIV reg-note for pseudo with more than one definition
-Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Jeff Law Sent: Tuesday, February 10, 2015 5:03 AM To: Bin.Cheng Cc: Alex Velenko; Felix Yang; Yangfei (Felix); Marcus Shawcroft; GCC Patches; vmaka...@redhat.com Subject: Re: [PATCH IRA] update_equiv_regs fails to set EQUIV reg-note for pseudo with more than one definition On 02/03/15 20:03, Bin.Cheng wrote: > I looked into the test and can confirm the previous compilation is correct. > The cover letter of this patch said IRA mis-handled REQ_EQUIV before, > but in this case it is REG_EQUAL that is lost. The full dump (without > this patch) after IRA is like: Right, but a REG_EQUIV is generated based on the incoming REG_EQUAL notes in the insn stream. Basically update_equiv_regs will scan insn stream and some REG_EQUAL notes will be promoted to REG_EQUIV notes. The REG_EQUIV is a function-wide equivalence, meaning that one could substitute the REG_EQUIV note for in any uses of the destination register and still have a valid representation of the program. REG_EQUAL's validity is limited to the point after the insn in which it appears and before the next insn. > > Before r216169 (with REG_EQUAL in insn9), jumps from basic block 6/7/8 > -> 9 can be merged because r110 equals to -1 afterwards. But with the > patch, the equal information of r110==-1 in basic block 8 is lost. As > a result, jump from 8->9 can't be merged and two additional > instructions are generated. > > I suppose the REG_EQUAL note is correct in insn9? According to > GCCint, it only means r110 set by insn9 will be equal to the value at > run time at the end of this insn but not necessarily elsewhere in the > function. >>If you previously got a REG_EQUIV note on any of those insns it was wrong and >>this is the precise kind of situation that the change was trying to fix. >>R110 can have the value -1 (BB6, BB7, BB8) or 0 (BB5). Thus there is no >>single value across the entire function that one can validly use for r110. Does the value of R110 should not change across all the callee path from the given caller functions. If the value is aliased, then how the call side affects should make sure the value remains same across all the callee chain path from the given caller function. I am curious to know how the value remain constant throughout the function is identified in case of aliasing and the interprocedural case. >>I think you could mark this as a missed optimization, but not a regresssion >>since the desired output was relying on a bug in the compiler. >> >>If I were to start looking at this, my first thought would be to look at why >>we have multiple sets of r110, particularly if there are lifetimes that are >>disjoint. > > I also found another problem (or mis-leading?) with the document: > "Thus, compiler passes prior to register allocation need only check > for REG_EQUAL notes and passes subsequent to register allocation need > only check for REG_EQUIV notes". This seems not true now as in this > example, passes after register allocation do take advantage of > REG_EQUAL in optimization and we can't achieve that by using > REG_EQUIV. >>I think that's a long standing (and incorrect) generalization. IIRC we can >>get a REG_EQUIV note earlier for certain argument setup situations. >>And I think it's been the case for a long time that a pass after reload could try to exploit REG_EQUAL notes. Thanks & Regards Ajit jeff
RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
All: Please find the updated patch with suggestion and feedback incorporated. Thanks Jeff and Richard for the review comments. Following changes were done based on the feedback on RFC comments. and the review for the previous patch. 1. Both tracer and path splitting pass are separate passes so that two instances of the pass will run in the end, one doing path splitting and one doing tracing, at different times in the optimization pipeline. 2. Transform code is shared for tracer and path splitting pass. The common code in extracted in a given function transform_duplicate And place the function in tracer.c and the path splitting pass uses the transform code. 3. Analysis for the basic block population and traversing the basic block using the Fibonacci heap is commonly used. This cannot be Factored out into new function as the tracer pass does more analysis based on the profile and the different heuristics is used in tracer And path splitting pass. 4. The include headers is minimal and presence of what is required for the path splitting pass. 5. The earlier patch does the SSA updating with replace function to preserve the SSA representation required to move the loop latch node same as join Block to its predecessors and the loop latch node is just forward block. Such replace function are not required as suggested by the Jeff. Such replace Function goes away with this patch and the transformed code is factored into a given function which is shared between tracer and path splitting pass. Bootstrapping with i386 and Microblaze target works fine. No regression is seen in Deja GNU tests for Microblaze. There are lesser failures. Mibench/EEMBC benchmarks were run for Microblaze target and the gain of 9.3% is seen in rgbcmy_lite the EEMBC benchmarks. SPEC 2000 benchmarks were run with i386 target and the following performance number is achieved. INT benchmarks with path splitting(ratio) Vs INT benchmarks without path splitting(ratio) = 3661.225091 vs 3621.520572 FP benchmarks with path splitting(ratio) Vs FP benchmarks without path splitting(ratio ) = 4339.986209 vs 4339.775527 Maximum gains achieved with 252.eon INT benchmarks = 9.03%. ChangeLog: 2015-08-15 Ajit Agarwal * gcc/Makefile.in: Add the build of the new file tree-ssa-path-split.c * gcc/common.opt (ftree-path-split): Add the new flag. * gcc/opts.c (OPT_ftree_path_split) : Add an entry for Path splitting pass with optimization flag greater and equal to O2. * gcc/passes.def (path_split): add new path splitting pass. * gcc/timevar.def (TV_TREE_PATH_SPLIT): New. * gcc/tree-pass.h (make_pass_path_split): New declaration. * gcc/tree-ssa-path-split.c: New. * gcc/tracer.c (transform_duplicate): New. * gcc/testsuite/gcc.dg/tree-ssa/path-split-2.c: New. * gcc/testsuite/gcc.dg/path-split-1.c: New. * gcc/doc/invoke.texi (ftree-path-split): Document. (fdump-tree-path_split): Document. Signed-off-by:Ajit Agarwal ajit...@xilinx.com. Thanks & Regards Ajit -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Ajit Kumar Agarwal Sent: Wednesday, July 29, 2015 10:13 AM To: Richard Biener; Jeff Law Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation -Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Thursday, July 16, 2015 4:30 PM To: Ajit Kumar Agarwal Cc: l...@redhat.com; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation On Tue, Jul 7, 2015 at 3:22 PM, Ajit Kumar Agarwal wrote: > > > -Original Message- > From: Richard Biener [mailto:richard.guent...@gmail.com] > Sent: Tuesday, July 07, 2015 2:21 PM > To: Ajit Kumar Agarwal > Cc: l...@redhat.com; GCC Patches; Vinod Kathail; Shail Aditya Gupta; > Vidhumouli Hunsigida; Nagaraju Mekala > Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on > tree ssa representation > > On Sat, Jul 4, 2015 at 2:39 PM, Ajit Kumar Agarwal > wrote: >> >> >> -Original Message- >> From: Richard Biener [mailto:richard.guent...@gmail.com] >> Sent: Tuesday, June 30, 2015 4:42 PM >> To: Ajit Kumar Agarwal >> Cc: l...@redhat.com; GCC Patches; Vinod Kathail; Shail Aditya Gupta; >> Vidhumouli Hunsigida; Nagaraju Mekala >> Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass >> on tree ssa representation >> >> On Tue, Jun 30, 2015 at 10:16 AM, Ajit Kumar Agarwal >> wrote: >>> All: >>> >>> The below patch added a new path Split
RE: [PATCH GCC]Improve bound information in loop niter analysis
All: Does the Logic to calculate the Loop bound information through Value Range Analyis uses the post dominator and Dominator info. The iteration branches instead of Loop exit condition can be calculated through post dominator info. If the node in the Loop has two successors and post dominates the two successors then the iteration branch can be The same node. For All the nodes L in the Loop B If (L1, L2 belongs to successors of (L) && L1,L2 belongs to PosDom(Header of Loop)) { I = I union L1 } Thus "I" will have all set of iteration branches. This will handle more cases of Loop bound information that Will be accurate through the exact iteration count that are known cases along with Value Range Information Where the condition is instead not the Loop exits but other nodes in the Loop. Thanks & Regards Ajit -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Bin.Cheng Sent: Monday, August 17, 2015 3:32 PM To: Richard Biener Cc: Bin Cheng; GCC Patches Subject: Re: [PATCH GCC]Improve bound information in loop niter analysis Thanks for all your reviews. On Fri, Aug 14, 2015 at 4:17 PM, Richard Biener wrote: > On Tue, Jul 28, 2015 at 11:36 AM, Bin Cheng wrote: >> Hi, >> Loop niter computes inaccurate bound information for different loops. >> This patch is to improve it by using loop initial condition in >> determine_value_range. Generally, loop niter is computed by >> subtracting start var from end var in loop exit condition. Moreover, >> loop bound is computed using value range information of both start and end >> variables. >> Basic idea of this patch is to check if loop initial condition >> implies more range information for both start/end variables. If yes, >> we refine range information and use that to compute loop bound. >> With this improvement, more accurate loop bound information is >> computed for test cases added by this patch. > > + c0 = fold_convert (type, c0); > + c1 = fold_convert (type, c1); > + > + if (operand_equal_p (var, c0, 0)) > > I believe if c0 is not already of type type operand-equal_p will never > succeed. It's quite specific case targeting comparison between var and it's range bounds. Given c0 is in form of "var + offc0", then the comparison "var + offc0 != range bounds" doesn't have any useful information. Maybe useless type conversion can be handled here though, it might be even corner case. > > (side-note: we should get rid of the GMP use, that's expensive and now > we have wide-int available which should do the trick as well) > > + /* Case of comparing with the bounds of the type. */ > + if (TYPE_MIN_VALUE (type) > + && operand_equal_p (c1, TYPE_MIN_VALUE (type), 0)) > + cmp = GT_EXPR; > + if (TYPE_MAX_VALUE (type) > + && operand_equal_p (c1, TYPE_MAX_VALUE (type), 0)) > + cmp = LT_EXPR; > > don't use TYPE_MIN/MAX_VALUE. Instead use the types precision and all > wide_int operations (see match.pd wi::max_value use). Done. > > + else if (!operand_equal_p (var, varc0, 0)) > +goto end_2; > > ick - goto. We need sth like a auto_mpz class with a destructor. Label end_2 removed. > > struct auto_mpz > { > auto_mpz () { mpz_init (m_val); } > ~auto_mpz () { mpz_clear (m_val); } > mpz& operator() { return m_val; } > mpz m_val; > }; > >> Is it OK? > > I see the code follows existing practice in niter analysis even though > my overall plan was to transition its copying of value-range related > optimizations to use VRP infrastructure. Yes, I think it's easy to push it to VRP infrastructure. Actually from the name of the function, it's more vrp related. For now, the function is called only by bound_difference, not so many as vrp queries. We need cache facility in vrp otherwise it would be expensive. > > I'm still ok with improving the existing code on the basis that I > won't get to that for GCC 6. > > So - ok with the TYPE_MIN/MAX_VALUE change suggested above. > > Refactoring with auto_mpz welcome. That will be an independent patch, so I skipped it in this one. New version attached. Bootstrap and test on x86_64. Thanks, bin > > Thanks, > RIchard. > >> Thanks, >> bin >> >> 2015-07-28 Bin Cheng >> >> * tree-ssa-loop-niter.c (refine_value_range_using_guard): New. >> (determine_value_range): Call refine_value_range_using_guard for >> each loop initial condition to improve value range. >> >> gcc/testsuite/ChangeLog >> 2015-07-28 Bin Cheng >> >> * gcc.dg/tree-ssa/loop-bound-1.c: New test. >> * gcc.dg/tree-ssa/loop-bound-3.c: New test. >> * gcc.dg/tree-ssa/loop-bound-5.c: New test.
RE: [PATCH GCC]Improve bound information in loop niter analysis
Oops, there is a typo error instead of L it was typed as L1. Here is the corrected one. For All the nodes L in the Loop B If (L1, L2 belongs to successors of (L) && L1,L2 belongs to PosDom(Header of Loop)) { I = I union L; } Thanks & Regards Ajit -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Ajit Kumar Agarwal Sent: Monday, August 17, 2015 4:19 PM To: Bin.Cheng; Richard Biener Cc: Bin Cheng; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: RE: [PATCH GCC]Improve bound information in loop niter analysis All: Does the Logic to calculate the Loop bound information through Value Range Analyis uses the post dominator and Dominator info. The iteration branches instead of Loop exit condition can be calculated through post dominator info. If the node in the Loop has two successors and post dominates the two successors then the iteration branch can be The same node. For All the nodes L in the Loop B If (L1, L2 belongs to successors of (L) && L1,L2 belongs to PosDom(Header of Loop)) { I = I union L1 } Thus "I" will have all set of iteration branches. This will handle more cases of Loop bound information that Will be accurate through the exact iteration count that are known cases along with Value Range Information Where the condition is instead not the Loop exits but other nodes in the Loop. Thanks & Regards Ajit -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Bin.Cheng Sent: Monday, August 17, 2015 3:32 PM To: Richard Biener Cc: Bin Cheng; GCC Patches Subject: Re: [PATCH GCC]Improve bound information in loop niter analysis Thanks for all your reviews. On Fri, Aug 14, 2015 at 4:17 PM, Richard Biener wrote: > On Tue, Jul 28, 2015 at 11:36 AM, Bin Cheng wrote: >> Hi, >> Loop niter computes inaccurate bound information for different loops. >> This patch is to improve it by using loop initial condition in >> determine_value_range. Generally, loop niter is computed by >> subtracting start var from end var in loop exit condition. Moreover, >> loop bound is computed using value range information of both start and end >> variables. >> Basic idea of this patch is to check if loop initial condition >> implies more range information for both start/end variables. If yes, >> we refine range information and use that to compute loop bound. >> With this improvement, more accurate loop bound information is >> computed for test cases added by this patch. > > + c0 = fold_convert (type, c0); > + c1 = fold_convert (type, c1); > + > + if (operand_equal_p (var, c0, 0)) > > I believe if c0 is not already of type type operand-equal_p will never > succeed. It's quite specific case targeting comparison between var and it's range bounds. Given c0 is in form of "var + offc0", then the comparison "var + offc0 != range bounds" doesn't have any useful information. Maybe useless type conversion can be handled here though, it might be even corner case. > > (side-note: we should get rid of the GMP use, that's expensive and now > we have wide-int available which should do the trick as well) > > + /* Case of comparing with the bounds of the type. */ > + if (TYPE_MIN_VALUE (type) > + && operand_equal_p (c1, TYPE_MIN_VALUE (type), 0)) > + cmp = GT_EXPR; > + if (TYPE_MAX_VALUE (type) > + && operand_equal_p (c1, TYPE_MAX_VALUE (type), 0)) > + cmp = LT_EXPR; > > don't use TYPE_MIN/MAX_VALUE. Instead use the types precision and all > wide_int operations (see match.pd wi::max_value use). Done. > > + else if (!operand_equal_p (var, varc0, 0)) > +goto end_2; > > ick - goto. We need sth like a auto_mpz class with a destructor. Label end_2 removed. > > struct auto_mpz > { > auto_mpz () { mpz_init (m_val); } > ~auto_mpz () { mpz_clear (m_val); } > mpz& operator() { return m_val; } > mpz m_val; > }; > >> Is it OK? > > I see the code follows existing practice in niter analysis even though > my overall plan was to transition its copying of value-range related > optimizations to use VRP infrastructure. Yes, I think it's easy to push it to VRP infrastructure. Actually from the name of the function, it's more vrp related. For now, the function is called only by bound_difference, not so many as vrp queries. We need cache facility in vrp otherwise it would be expensive. > > I'm still ok with improving the existing code on the basis that I > won't get to that for GCC 6. > > So -
RE: [PATCH GCC]Improve bound information in loop niter analysis
-Original Message- From: Bin.Cheng [mailto:amker.ch...@gmail.com] Sent: Tuesday, August 18, 2015 1:08 PM To: Ajit Kumar Agarwal Cc: Richard Biener; Bin Cheng; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [PATCH GCC]Improve bound information in loop niter analysis On Mon, Aug 17, 2015 at 6:49 PM, Ajit Kumar Agarwal wrote: > All: > > Does the Logic to calculate the Loop bound information through Value > Range Analyis uses the post dominator and Dominator info. The iteration > branches instead of Loop exit condition can be calculated through post > dominator info. > If the node in the Loop has two successors and post dominates the two > successors then the iteration branch can be The same node. > > For All the nodes L in the Loop B > If (L1, L2 belongs to successors of (L) && L1,L2 belongs to > PosDom(Header of Loop)) { > I = I union L1 > } > > Thus "I" will have all set of iteration branches. This will handle > more cases of Loop bound information that Will be accurate through the > exact iteration count that are known cases along with Value Range Information > Where the condition is instead not the Loop exits but other nodes in the Loop. >>I don't quite follow your words here. Could you please give a simple example >>about it? Especially I don't know how post-dom helps the loop bound >>analysis. >>Seems your pseudo code is collecting some comparison basic block >>of loop? The Algorithm I have given above is based on Post Dominator Info. This helps to calculate the iteration branches. The iteration branches are the Branches that determine the loop exit condition. Based on the condition it either branches to the header of the Loop, Or it may branch to the Block dominated by the header or exit from the loop. The above Algorithm finds out such iteration branches and thus decides on the Loop bound Or iteration count. If such iteration branches are not at the back edge node and it may be a node inside the loop based on some conditions. Finding out such iteration branches can be done through the post dominator info using the above algorithm. Based on the iteration branches the conditions can be analyzed and that helps in finding out the iteration bound for Known cases. Know cases are the cases where the loop bound can be determined at compile time. One Example would be Multi-Exits Loops where the Loop exit condition can be at the back edge or it may be a block inside the Loop based on the IF conditions and breaks out based on the conditions. Thus having multiple exits. Such iteration branches can be found using The above Algorithm. Thanks & Regards Ajit Thanks, bin > > Thanks & Regards > Ajit > > > -Original Message- > From: gcc-patches-ow...@gcc.gnu.org > [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Bin.Cheng > Sent: Monday, August 17, 2015 3:32 PM > To: Richard Biener > Cc: Bin Cheng; GCC Patches > Subject: Re: [PATCH GCC]Improve bound information in loop niter > analysis > > Thanks for all your reviews. > > On Fri, Aug 14, 2015 at 4:17 PM, Richard Biener > wrote: >> On Tue, Jul 28, 2015 at 11:36 AM, Bin Cheng wrote: >>> Hi, >>> Loop niter computes inaccurate bound information for different loops. >>> This patch is to improve it by using loop initial condition in >>> determine_value_range. Generally, loop niter is computed by >>> subtracting start var from end var in loop exit condition. >>> Moreover, loop bound is computed using value range information of both >>> start and end variables. >>> Basic idea of this patch is to check if loop initial condition >>> implies more range information for both start/end variables. If >>> yes, we refine range information and use that to compute loop bound. >>> With this improvement, more accurate loop bound information is >>> computed for test cases added by this patch. >> >> + c0 = fold_convert (type, c0); >> + c1 = fold_convert (type, c1); >> + >> + if (operand_equal_p (var, c0, 0)) >> >> I believe if c0 is not already of type type operand-equal_p will never >> succeed. > It's quite specific case targeting comparison between var and it's range > bounds. Given c0 is in form of "var + offc0", then the comparison "var + > offc0 != range bounds" doesn't have any useful information. Maybe useless > type conversion can be handled here though, it might be even corner case. > >> >> (side-note: we should get rid of the GMP use, that's expensive and >> now we have wide-int available which should do the trick as well) >
RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
-Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Thursday, August 20, 2015 1:13 AM To: Ajit Kumar Agarwal; Richard Biener Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation On 08/15/2015 11:01 AM, Ajit Kumar Agarwal wrote: > All: > > Please find the updated patch with suggestion and feedback > incorporated. > > Thanks Jeff and Richard for the review comments. > > Following changes were done based on the feedback on RFC comments. > and the review for the previous patch. > > 1. Both tracer and path splitting pass are separate passes so that > two instances of the pass will run in the end, one doing path > splitting and one doing tracing, at different times in the > optimization pipeline. >>I'll have to think about this. I'm not sure I agree totally with Richi's >>assertion that we should share code with the tracer pass, but I'll give it a >>good looksie. > 2. Transform code is shared for tracer and path splitting pass. The > common code in extracted in a given function transform_duplicate And > place the function in tracer.c and the path splitting pass uses the > transform code. >>OK. I'll take a good look at that. > 3. Analysis for the basic block population and traversing the basic > block using the Fibonacci heap is commonly used. This cannot be > Factored out into new function as the tracer pass does more analysis > based on the profile and the different heuristics is used in tracer > And path splitting pass. >>Understood. > 4. The include headers is minimal and presence of what is required for > the path splitting pass. >>THanks. > 5. The earlier patch does the SSA updating with replace function to > preserve the SSA representation required to move the loop latch node > same as join Block to its predecessors and the loop latch node is just > forward block. Such replace function are not required as suggested by > the Jeff. Such replace Function goes away with this patch and the > transformed code is factored into a given function which is shared > between tracer and path splitting pass. >>Sounds good. > > Bootstrapping with i386 and Microblaze target works fine. No > regression is seen in Deja GNU tests for Microblaze. There are lesser > failures. Mibench/EEMBC benchmarks were run for Microblaze target and > the gain of 9.3% is seen in rgbcmy_lite the EEMBC benchmarks. >>What do you mean by there are "lesser failures"? Are you saying there are >>cases where path splitting generates incorrect code, or cases where path >>>>splitting produces code that is less efficient, or something else? I meant there are more Deja GNU testcases passes with the path splitting changes. > > SPEC 2000 benchmarks were run with i386 target and the following > performance number is achieved. > > INT benchmarks with path splitting(ratio) Vs INT benchmarks without > path splitting(ratio) = 3661.225091 vs 3621.520572 >>That's an impressive improvement. >>Anyway, I'll start taking a close look at this momentarily. Thanks & Regards Ajit Jeff
RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
-Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Thursday, August 20, 2015 3:16 AM To: Ajit Kumar Agarwal; Richard Biener Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation On 08/15/2015 11:01 AM, Ajit Kumar Agarwal wrote: > > > From cf2b64cc1d6623424d770f2a9ea257eb7e58e887 Mon Sep 17 00:00:00 > 2001 > From: Ajit Kumar Agarwal > Date: Sat, 15 Aug 2015 18:19:14 +0200 > Subject: [PATCH] [Patch,tree-optimization]: Add new path Splitting pass on > tree ssa representation. > > Added a new pass on path splitting on tree SSA representation. The > path splitting optimization does the CFG transformation of join block > of the if-then-else same as the loop latch node is moved and merged > with the predecessor blocks after preserving the SSA representation. > > ChangeLog: > 2015-08-15 Ajit Agarwal > > * gcc/Makefile.in: Add the build of the new file > tree-ssa-path-split.c Instead: >> * Makefile.in (OBJS): Add tree-ssa-path-split.o. > * gcc/opts.c (OPT_ftree_path_split) : Add an entry for > Path splitting pass with optimization flag greater and > equal to O2. >> * opts.c (default_options_table): Add entry for path splitting >> optimization at -O2 and above. > * gcc/passes.def (path_split): add new path splitting pass. >>Capitalize "add". > * gcc/tree-ssa-path-split.c: New. >>Use "New file". > * gcc/tracer.c (transform_duplicate): New. >>Use "New function". > * gcc/testsuite/gcc.dg/tree-ssa/path-split-2.c: New. > * gcc/testsuite/gcc.dg/path-split-1.c: New. >>These belong in gcc/testsuite/ChangeLog and remove the "gcc/testsuite" >>prefix. > * gcc/doc/invoke.texi > (ftree-path-split): Document. > (fdump-tree-path_split): Document. >>Should just be two lines instead of three. >>And more generally, there's no need to prefix ChangeLog entries with "gcc/". >>Now that the ChangeLog nits are out of the way, let's get to stuff that's >>more interesting. I will incorporate all the above changes in the upcoming patches. > > Signed-off-by:Ajit agarwalajit...@xilinx.com > --- > gcc/Makefile.in | 1 + > gcc/common.opt | 4 + > gcc/doc/invoke.texi | 16 +- > gcc/opts.c | 1 + > gcc/passes.def | 1 + > gcc/testsuite/gcc.dg/path-split-1.c | 65 ++ > gcc/testsuite/gcc.dg/tree-ssa/path-split-2.c | 60 + > gcc/timevar.def | 1 + > gcc/tracer.c | 37 +-- > gcc/tree-pass.h | 1 + > gcc/tree-ssa-path-split.c| 330 > +++ > 11 files changed, 503 insertions(+), 14 deletions(-) > create mode 100644 gcc/testsuite/gcc.dg/path-split-1.c > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/path-split-2.c > create mode 100644 gcc/tree-ssa-path-split.c > > diff --git a/gcc/common.opt b/gcc/common.opt > index e80eadf..1d02582 100644 > --- a/gcc/common.opt > +++ b/gcc/common.opt > @@ -2378,6 +2378,10 @@ ftree-vrp > Common Report Var(flag_tree_vrp) Init(0) Optimization > Perform Value Range Propagation on trees > > +ftree-path-split > +Common Report Var(flag_tree_path_split) Init(0) Optimization > +Perform Path Splitting >>Maybe "Perform Path Splitting for loop backedges" or something which is >>a little more descriptive. The above isn't exactly right, so don't use >>it as-is. > @@ -9068,6 +9075,13 @@ enabled by default at @option{-O2} and higher. Null > pointer check > elimination is only done if @option{-fdelete-null-pointer-checks} is > enabled. > > +@item -ftree-path-split > +@opindex ftree-path-split > +Perform Path Splitting on trees. The join blocks of IF-THEN-ELSE same > +as loop latch node is moved to its predecessor and the loop latch node > +will be forwarding block. This is enabled by default at @option{-O2} > +and higher. >>Needs some work. Maybe something along the lines of >>When two paths of execution merge immediately before a loop latch node, >>try to duplicate the merge node into the two paths. I will incorporate all the above changes. > diff --git a/gcc/passes.def b/gcc/passes.def > index 6b66f8f..20ddf3d 100644 > --- a/gcc/passes.def > +++ b/gcc/passes.def > @@ -82,6 +82,7 @@ along w
[RFC]: Vectorization cost benefit changes.
All: I have done the vectorization cost changes as given below. I have considered only the cost associated with the inner instead of outside. The consideration of inside scalar and vector cost is done as the inner cost are the most cost effective than the outside cost. min_profitable_iters = ((scalar_single_iter_cost - vec_inside_cost) *vf); The Scalar_single_iter_cost consider the hardcoded value 50 which is used for most of the targets and the scalar cost is multiplied With 50. This scalar cost is subtracted with vector cost and as the scalar cost is increased the chances of vectorization is more with same Vectorization factor and more loops will be vectorized. I have not changed the iteration count which is hardcoded with 50 and I will do the changes to replace the 50 with the static Estimates of iteration count if you agree upon the below changes. I have ran the SPEC cpu 2000 benchmarks with the below changes for i386 targets and the significant gains are achieved with respect To INT and FP benchmarks. Here is the data. Ratio of vectorization cost changes(FP benchmarks) vs Ratio of without vectorization cost changes( FP benchmarks) = 4640.102 vs 4583.379. Ratio of vectorization cost changes (INT benchmarks ) vs Ratio of without vectorization cost changes( INT benchmarks0 = 3812.883 vs 3778.558 Please give your feedback on the below changes for vectorization cost benefit. diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 422b883..35d538f 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -2987,11 +2987,8 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, min_profitable_iters = 1; else { - min_profitable_iters = ((vec_outside_cost - scalar_outside_cost) * vf - - vec_inside_cost * peel_iters_prologue - - vec_inside_cost * peel_iters_epilogue) - / ((scalar_single_iter_cost * vf) -- vec_inside_cost); + min_profitable_iters = ((scalar_single_iter_cost +- vec_inside_cost) *vf); if ((scalar_single_iter_cost * vf * min_profitable_iters) <= (((int) vec_inside_cost * min_profitable_iters) Thanks & Regards Ajit vect.diff Description: vect.diff
RE: [RFC]: Vectorization cost benefit changes.
-Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Friday, August 21, 2015 2:03 PM To: Ajit Kumar Agarwal Cc: Jeff Law; GCC Patches; g...@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [RFC]: Vectorization cost benefit changes. On Fri, Aug 21, 2015 at 7:18 AM, Ajit Kumar Agarwal wrote: > All: > > I have done the vectorization cost changes as given below. I have considered > only the cost associated with the inner instead of outside. > The consideration of inside scalar and vector cost is done as the inner cost > are the most cost effective than the outside cost. >>I think you are confused about what the variables cost are associated to. >>You are changing a place that computes also the cost for >>non-outer-loop->>vectorization so your patch is clearly not applicable. >>vec_outside_cost is the cost of setting up invariants for example. >>All costs apply to the "outer" loop - if there is a nested loop inside that >>loop its costs are folded into the "outer" loop cost already at this stage. >>So I think your analysis is simply wrong and thus your patch. >>You need to find another place to fix inner loop cost. Thanks for your valuable suggestions and feedback. I will certainly look into it. Thanks & Regards Ajit Richard. > min_profitable_iters = ((scalar_single_iter_cost > - vec_inside_cost) *vf); > > The Scalar_single_iter_cost consider the hardcoded value 50 which is > used for most of the targets and the scalar cost is multiplied With > 50. This scalar cost is subtracted with vector cost and as the scalar cost is > increased the chances of vectorization is more with same Vectorization factor > and more loops will be vectorized. > > I have not changed the iteration count which is hardcoded with 50 and > I will do the changes to replace the 50 with the static Estimates of > iteration count if you agree upon the below changes. > > I have ran the SPEC cpu 2000 benchmarks with the below changes for > i386 targets and the significant gains are achieved with respect To INT and > FP benchmarks. > > Here is the data. > > Ratio of vectorization cost changes(FP benchmarks) vs Ratio of without > vectorization cost changes( FP benchmarks) = 4640.102 vs 4583.379. > Ratio of vectorization cost changes (INT benchmarks ) vs Ratio of > without vectorization cost changes( INT benchmarks0 = 3812.883 vs > 3778.558 > > Please give your feedback on the below changes for vectorization cost benefit. > > diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index > 422b883..35d538f 100644 > --- a/gcc/tree-vect-loop.c > +++ b/gcc/tree-vect-loop.c > @@ -2987,11 +2987,8 @@ vect_estimate_min_profitable_iters (loop_vec_info > loop_vinfo, > min_profitable_iters = 1; >else > { > - min_profitable_iters = ((vec_outside_cost - scalar_outside_cost) * > vf > - - vec_inside_cost * peel_iters_prologue > - - vec_inside_cost * peel_iters_epilogue) > - / ((scalar_single_iter_cost * vf) > -- vec_inside_cost); > + min_profitable_iters = ((scalar_single_iter_cost > +- vec_inside_cost) *vf); > >if ((scalar_single_iter_cost * vf * min_profitable_iters) ><= (((int) vec_inside_cost * min_profitable_iters) > > Thanks & Regards > Ajit
RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
-Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Thursday, August 20, 2015 9:19 PM To: Ajit Kumar Agarwal; Richard Biener Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation On 08/20/2015 09:38 AM, Ajit Kumar Agarwal wrote: >> >> Bootstrapping with i386 and Microblaze target works fine. No >> regression is seen in Deja GNU tests for Microblaze. There are lesser >> failures. Mibench/EEMBC benchmarks were run for Microblaze target and >> the gain of 9.3% is seen in rgbcmy_lite the EEMBC benchmarks. >>> What do you mean by there are "lesser failures"? Are you saying there are >>> cases where path splitting generates incorrect code, or cases where path >>> >>splitting produces code that is less efficient, or something else? > > I meant there are more Deja GNU testcases passes with the path splitting > changes. >>Ah, in that case, that's definitely good news! Thanks. The following testcase testsuite/gcc.dg/tree-ssa/ifc-5.c void dct_unquantize_h263_inter_c (short *block, int n, int qscale, int nCoeffs) { int i, level, qmul, qadd; qadd = (qscale - 1) | 1; qmul = qscale << 1; for (i = 0; i <= nCoeffs; i++) { level = block[i]; if (level < 0) level = level * qmul - qadd; else level = level * qmul + qadd; block[i] = level; } } The above Loop is a candidate of path splitting as the IF block merges at the latch of the Loop and the path splitting duplicates The latch of the loop which is the statement block[i] = level into the predecessors THEN and ELSE block. Due to above path splitting, the IF conversion is disabled and the above IF-THEN-ELSE is not IF-converted and the test case fails. There were following review comments from the above patch. +/* This function performs the feasibility tests for path splitting > + to perform. Return false if the feasibility for path splitting > + is not done and returns true if the feasibility for path splitting > + is done. Following feasibility tests are performed. > + > + 1. Return false if the join block has rhs casting for assign > + gimple statements. Comments from Jeff: >>These seem totally arbitrary. What's the reason behind each of these >>restrictions? None should be a correctness requirement AFAICT. In the above patch I have made a check given in point 1. in the loop latch and the Path splitting is disabled and the IF-conversion happens and the test case passes. I have incorporated the above review comments of not doing the above feasibility check of the point 1 and the above testcases goes For path splitting and due to path splitting the if-cvt is not happening and the test case fails (expecting the pattern Applying if conversion To be present). With the above patch given for review and the Feasibility check of cast assign in the latch of the loop as given in point 1 disables the path splitting and if-cvt happens and the above test case passes. Please let me know whether to keep the above feasibility check as given in point 1 or better appropriate changes required for the above Test case scenario of path splitting vs IF-conversion. Thanks & Regards Ajit jeff
RE: [PATCH GCC][rework]Improve loop bound info by simplifying conversions in iv base
-Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Bin Cheng Sent: Thursday, August 27, 2015 3:12 PM To: gcc-patches@gcc.gnu.org Subject: [PATCH GCC][rework]Improve loop bound info by simplifying conversions in iv base Hi, >>This is a rework for >>https://gcc.gnu.org/ml/gcc-patches/2015-07/msg02335.html, with review >>comments addressed. For now, SCEV may compute iv base in the form of "(signed T)((unsigned T)base + step))". This complicates other >>optimizations/analysis depending on SCEV because it's hard to dive into type >>conversions. >>This kind of type conversions can be simplified with >>additional range information implied by loop initial conditions. This patch >>does such simplification. >>With simplified iv base, loop niter analysis can compute more accurate bound >>information since sensible value range can be derived for "base+step". For example, accurate loop bound&may_be_zero information is computed for cases >>added by this patch. >>The code is actually moved from loop_exits_before_overflow. After this >>patch, the corresponding code in loop_exits_before_overflow will be never executed, so I removed that part code. The patch also includes some code >>format changes. >>Bootstrap and test on x86_64. Is it OK? The scalar Evolution calculates the chrec ("base" , "+","step") based on chain of recurrence through induction variable expressions and Propagating the value in SSA representation to derive at the above chrec.. If the base value assigned is unsigned and the declaration of the base is signed, then only the above chrec is derived based on conversion from unsigned to signed? Such type conversions can be ignored for the calculation of iteration bound as this cannot be overflow in any case. Is the below patch aim at that? Thanks & Regards Ajit Thanks, bin 2015-08-27 Bin Cheng * tree-ssa-loop-niter.c (tree_simplify_using_condition_1): Support new parameter. (tree_simplify_using_condition): Ditto. (simplify_using_initial_conditions): Ditto. (loop_exits_before_overflow): Pass new argument to function simplify_using_initial_conditions. Remove case for type conversions simplification. * tree-ssa-loop-niter.h (simplify_using_initial_conditions): New parameter. * tree-scalar-evolution.c (simple_iv): Simplify type conversions in iv base using loop initial conditions. gcc/testsuite/ChangeLog 2015-08-27 Bin Cheng * gcc.dg/tree-ssa/loop-bound-2.c: New test. * gcc.dg/tree-ssa/loop-bound-4.c: New test. * gcc.dg/tree-ssa/loop-bound-6.c: New test.
RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
All: Thanks Jeff for the review comments. The patch attached incorporate all the review comments given below. Bootstrapped on i386 and Microblaze and the Deja GNU tests for Microblaze results looks fine. [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation. Added a new pass on path splitting on tree SSA representation. The path splitting optimization does the CFG transformation when the two execution paths of the IF-THEN-ELSE merge at the latch node of loop, then duplicate the merge mode into two paths preserving the SSA semantics. ChangeLog: 2015-09-05 Ajit Agarwal * Makefile.in (OBJS): Add tree-ssa-path-split.o * common.opt (ftree-path-split): Add the new flag. * opts.c (default_options_table) : Add an entry for Path splitting optimization at -O2 and above. * passes.def (path_split): Add new path splitting pass. * timevar.def (TV_TREE_PATH_SPLIT): New. * tree-pass.h (make_pass_path_split): New declaration. * tree-ssa-path-split.c: New file. * tracer.c (transform_duplicate): New function. * tracer.h: New header file. * doc/invoke.texi (ftree-path-split): Document. (fdump-tree-path_split): Document. * testsuite/gcc.dg/path-split-1.c: New. Signed-off-by:Ajit Agarwal ajit...@xilinx.com Thanks & Regards Ajit -Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Thursday, August 20, 2015 3:16 AM To: Ajit Kumar Agarwal; Richard Biener Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation On 08/15/2015 11:01 AM, Ajit Kumar Agarwal wrote: > > > From cf2b64cc1d6623424d770f2a9ea257eb7e58e887 Mon Sep 17 00:00:00 > 2001 > From: Ajit Kumar Agarwal > Date: Sat, 15 Aug 2015 18:19:14 +0200 > Subject: [PATCH] [Patch,tree-optimization]: Add new path Splitting pass on > tree ssa representation. > > Added a new pass on path splitting on tree SSA representation. The > path splitting optimization does the CFG transformation of join block > of the if-then-else same as the loop latch node is moved and merged > with the predecessor blocks after preserving the SSA representation. > > ChangeLog: > 2015-08-15 Ajit Agarwal > > * gcc/Makefile.in: Add the build of the new file > tree-ssa-path-split.c Instead: * Makefile.in (OBJS): Add tree-ssa-path-split.o. > * gcc/opts.c (OPT_ftree_path_split) : Add an entry for > Path splitting pass with optimization flag greater and > equal to O2. * opts.c (default_options_table): Add entry for path splitting optimization at -O2 and above. > * gcc/passes.def (path_split): add new path splitting pass. Capitalize "add". > * gcc/tree-ssa-path-split.c: New. Use "New file". > * gcc/tracer.c (transform_duplicate): New. Use "New function". > * gcc/testsuite/gcc.dg/tree-ssa/path-split-2.c: New. > * gcc/testsuite/gcc.dg/path-split-1.c: New. These belong in gcc/testsuite/ChangeLog and remove the "gcc/testsuite" prefix. > * gcc/doc/invoke.texi > (ftree-path-split): Document. > (fdump-tree-path_split): Document. Should just be two lines instead of three. And more generally, there's no need to prefix ChangeLog entries with "gcc/". Now that the ChangeLog nits are out of the way, let's get to stuff that's more interesting. > > Signed-off-by:Ajit agarwalajit...@xilinx.com > --- > gcc/Makefile.in | 1 + > gcc/common.opt | 4 + > gcc/doc/invoke.texi | 16 +- > gcc/opts.c | 1 + > gcc/passes.def | 1 + > gcc/testsuite/gcc.dg/path-split-1.c | 65 ++ > gcc/testsuite/gcc.dg/tree-ssa/path-split-2.c | 60 + > gcc/timevar.def | 1 + > gcc/tracer.c | 37 +-- > gcc/tree-pass.h | 1 + > gcc/tree-ssa-path-split.c| 330 > +++ > 11 files changed, 503 insertions(+), 14 deletions(-) > create mode 100644 gcc/testsuite/gcc.dg/path-split-1.c > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/path-split-2.c > create mode 100644 gcc/tree-ssa-path-split.c > > diff --git a/gcc/common.opt b/gcc/common.opt > index e80eadf..1d02582 100644 > --- a/gcc/common.opt > +++ b/gcc/common.opt > @@ -2378,6 +2378,10 @@ ftree-vrp > Common Report Var(flag_tree_vrp) Init(0) Optimization > Perform Value Range Propagation on trees > > +ftree-path-spli
RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
-Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Thursday, September 10, 2015 3:10 AM To: Ajit Kumar Agarwal; Richard Biener Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation On 08/26/2015 11:29 PM, Ajit Kumar Agarwal wrote: > > Thanks. The following testcase testsuite/gcc.dg/tree-ssa/ifc-5.c > > void dct_unquantize_h263_inter_c (short *block, int n, int qscale, int > nCoeffs) { int i, level, qmul, qadd; > > qadd = (qscale - 1) | 1; qmul = qscale << 1; > > for (i = 0; i <= nCoeffs; i++) { level = block[i]; if (level < 0) > level = level * qmul - qadd; else level = level * qmul + qadd; > block[i] = level; } } > > The above Loop is a candidate of path splitting as the IF block merges > at the latch of the Loop and the path splitting duplicates The latch > of the loop which is the statement block[i] = level into the > predecessors THEN and ELSE block. > > Due to above path splitting, the IF conversion is disabled and the > above IF-THEN-ELSE is not IF-converted and the test case fails. >>So I think the question then becomes which of the two styles generally >>results in better code? The path-split version or the older if-converted >>version. >>If the latter, then this may suggest that we've got the path splitting code >>at the wrong stage in the optimizer pipeline or that we need better >>heuristics for >>when to avoid applying path splitting. The code generated by the Path Splitting is useful when it exposes the DCE, PRE,CCP candidates. Whereas the IF-conversion is useful When the if-conversion exposes the vectorization candidates. If the if-conversion doesn't exposes the vectorization and the path splitting doesn't Exposes the DCE, PRE redundancy candidates, it's hard to predict. If the if-conversion does not exposes the vectorization and in the similar case Path splitting exposes the DCE , PRE and CCP redundancy candidates then path splitting is useful. Also the path splitting increases the granularity of the THEN and ELSE path makes better register allocation and code scheduling. The suggestion for keeping the path splitting later in the pipeline after the if-conversion and the vectorization is useful as it doesn't break the Existing Deja GNU tests. Also useful to keep the path splitting later in the pipeline after the if-conversion and vectorization is that path splitting Can always duplicate the merge node into its predecessor after the if-conversion and vectorization pass, if the if-conversion and vectorization Is not applicable to the Loops. But this suppresses the CCP, PRE candidates which are earlier in the optimization pipeline. > > There were following review comments from the above patch. > > +/* This function performs the feasibility tests for path splitting >> + to perform. Return false if the feasibility for path splitting >> + is not done and returns true if the feasibility for path >> splitting + is done. Following feasibility tests are performed. >> + + 1. Return false if the join block has rhs casting for assign >> + gimple statements. > > Comments from Jeff: > >>> These seem totally arbitrary. What's the reason behind each of >>> these restrictions? None should be a correctness requirement >>> AFAICT. > > In the above patch I have made a check given in point 1. in the loop > latch and the Path splitting is disabled and the IF-conversion happens > and the test case passes. >>That sounds more like a work-around/hack. There's nothing inherent with a >>type conversion that should disable path splitting. I have sent the patch with this change and I will remove the above check from the patch. >>What happens if we delay path splitting to a point after if-conversion is >>complete? This is better suggestion as explained above, but gains achieved through path splitting by keeping earlier in the pipeline before if-conversion , tree-vectorization, tree-vrp is suppressed if the following optimization after path splitting is not applicable for the above loops. I have made the above changes and the existing set up doesn't break but the gains achieved in the benchmarks like rgbcmy_lite(EEMBC) Benchmarks is suppressed. The path splitting for the above EEMBC benchmarks give gains of 9% and for such loops if-conversion and Vectorization is not applicable exposing gain with path splitting optimizations. >>Alternately, could if-conversion export a routine which indicates if a >>particular sub-graph is likely to be if-convertable? The path splitting pass >>could then use >>that routine t
RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
Hello Jeff: Sorry for the delay in sending the benchmarks run with Split-Path change. Here is the Summary of the results. SPEC CPU 2000 INT benchmarks ( Target i386) ( Geomean Score without Split-Paths changes vs Geomean Score with Split-Path changes = 3740.789 vs 3745.193). SPEC CPU 2000 FP benchmarks. ( Target i386) ( Geomean Score without Split-Paths changes vs Geomean Score with Split-Path changes = 4721.655 vs 4741.825). Mibench/EEMBC benchmarks (Target Microblaze) Automotive_qsort1(4.03%), Office_ispell(4.29%), Office_stringsearch1(3.5%). Telecom_adpcm_d( 1.37%), ospfv2_lite(1.35%). We are seeing minor negative gains that are mainly noise.(less than 0.5%) Thanks & Regards Ajit -Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Friday, December 11, 2015 1:39 AM To: Richard Biener Cc: Ajit Kumar Agarwal; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation On 12/03/2015 07:38 AM, Richard Biener wrote: > This pass is now enabled by default with -Os but has no limits on the > amount of stmts it copies. The more statements it copies, the more likely it is that the path spitting will turn out to be useful! It's counter-intuitive. The primary benefit AFAICT with path splitting is that it exposes additional CSE, DCE, etc opportunities. IIRC Ajit posited that it could help with live/conflict analysis, I never saw that, and with the changes to push splitting deeper into the pipeline I'd further life/conflict analysis since that work also involved preserving the single latch property. It also will make all loops with this shape have at least two > exits (if the resulting loop will be disambiguated the inner loop will > have two exits). > Having more than one exit will disable almost all loop optimizations after it. Hmmm, the updated code keeps the single latch property, but I'm pretty sure it won't keep a single exit policy. To keep a single exit policy would require keeping an additional block around. Each of the split paths would unconditionally transfer to this new block. The new block would then either transfer to the latch block or out of the loop. > > The pass itself documents the transform it does but does zero to motivate it. > > What's the benefit of this pass (apart from disrupting further optimizations)? It's essentially building superblocks in a special case to enable additional CSE, DCE and the like. Unfortunately what is is missing is heuristics and de-duplication. The former to drive cases when it's not useful and the latter to reduce codesize for any statements that did not participate in optimizations when they were duplicated. The de-duplication is the "sink-statements-through-phi" problems, cross jumping, tail merging and the like class of problems. It was only after I approved this code after twiddling it for Ajit that I came across Honza's tracer implementation, which may in fact be retargettable to these loops and do a better job. I haven't experimented with that. > > I can see a _single_ case where duplicating the latch will allow > threading one of the paths through the loop header to eliminate the > original exit. Then disambiguation may create a nice nested loop out > of this. Of course that is only profitable again if you know the > remaining single exit of the inner loop (exiting to the outer one) is > executed infrequently (thus the inner loop actually loops). It wasn't ever about threading. > > But no checks other than on the CFG shape exist (oh, it checks it will > at _least_ copy two stmts!). Again, the more statements it copies the more likely it is to be profitable. Think superblocks to expose CSE, DCE and the like. > > Given the profitability constraints above (well, correct me if I am > wrong on these) it looks like the whole transform should be done > within the FSM threading code which might be able to compute whether > there will be an inner loop with a single exit only. While it shares some concepts with jump threading, I don't think the transformation belongs in jump threading. > > I'm inclined to request the pass to be removed again or at least > disabled by default. I wouldn't lose any sleep if we disabled by default or removed, particularly if we can repurpose Honza's code. In fact, I might strongly support the former until we hear back from Ajit on performance data. I also keep coming back to Click's paper on code motion -- in that context, copying statements would be a way to break dependencies and give the global code motion algorithm more freedom. The advantage of doing it in a framework like Click's is it's got a built-in sinking step. > > What
RE: [Patch,rtl Optimization]: Better register pressure estimate for Loop Invariant Code Motion.
The patch is modified based on the review comments from Richard, Bernd and David. The following changes are done to incorporate the comments received on the previous mail. 1. With this patch the liveness of the Loop is not stored on the LOOPDATA structures. The liveness is calculated based on the current loops for which the invariant is checked and the regs_used is calculated. 2. Memory leaks are fixed. 3. Reworked on the comments section based on Bernd's comments. Bootstrapped and regtested for i386 and Microblaze target. SPEC CPU 2000 benchmarks are run on i386 target and following is the summary of the results. SPEC CPU 2000 INT benchmarks. ( Gemoean Score without the change vs Geomean score with reg pressure change = 3745.193 vs 3745.328) SPEC CPU 2000 FP benchmarks. ( Gemoean Score without the change vs Geomean score with reg pressure change = 4741.825 vs 4748.364). [Patch,rtl Optimization]: Better register pressure estimate for loop invariant code motion Calculate the loop liveness used for regs for calculating the register pressure in the cost estimation. Loop liveness is based on the following properties. We only need to find the set of objects that are live at the birth or the header of the loop. We don't need to calculate the live through the loop by considering live in and live out of all the basic blocks of the loop. This is based on the point that the set of objects that are live-in at the birth or header of the loop will be live-in at every node in the loop. If a v live is out at the header of the loop then the variable is live-in at every node in the loop. To prove this, consider a loop L with header h such that the variable v defined at d is live-in at h. Since v is live at h, d is not part of L. This follows i from the dominance property, i.e. h is strictly dominated by d. Furthermore, there exists a path from h to a use of v which does not go through d. For every node p in the loop, since the loop is strongly connected and node is a component of the CFG, there exists a path, consisting only of nodes of L from p to h. Concatenating these two paths proves that v is live-in and live-out of p. Calculate the live-out and live-in for the exit edge of the loop. This patch considers liveness for not only the loop latch but also the liveness outside the loops. ChangeLog: 2015-12-15 Ajit Agarwal * loop-invariant.c (find_invariants_to_move): Add the logic of regs_used based on liveness. * cfgloopanal.c (estimate_reg_pressure_cost): Update the heuristics in presence of call_p. Signed-off-by:Ajit Agarwal ajit...@xilinx.com. Thanks & Regards Ajit -Original Message- From: Bernd Schmidt [mailto:bschm...@redhat.com] Sent: Wednesday, December 09, 2015 7:34 PM To: Ajit Kumar Agarwal; Richard Biener Cc: Jeff Law; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,rtl Optimization]: Better register pressure estimate for Loop Invariant Code Motion. On 12/09/2015 12:22 PM, Ajit Kumar Agarwal wrote: > > This is because the available_regs = 6 and the regs_needed = 1 and > new_regs = 0 and the regs_used = 10. As the reg_used that are based > on the Liveness given above is greater than the available_regs, then > it's candidate of spill and the estimate_register_pressure calculates > the > spill cost. This spill cost is greater than inv_cost and gain > comes to be > negative. The disables the loop invariant for the above > testcase. As far as I can tell this loop does not lead to a spill. Hence, failure of this testcase would suggest there is something wrong with your idea. >> + FOR_EACH_LOOP (loop, LI_FROM_INNERMOST) { Formatting. >> +/* Loop Liveness is based on the following proprties. "properties" >> + we only require to calculate the set of objects that are live at >> + the birth or the header of the loop. >> + We don't need to calculate the live through the Loop considering >> + Live in and Live out of all the basic blocks of the Loop. This is >> + because the set of objects. That are live-in at the birth or header >> + of the loop will be live-in at every node in the Loop. >> + If a v live out at the header of the loop then the variable is >> live-in >> + at every node in the Loop. To prove this, Consider a Loop L with >> header >> + h such that The variable v defined at d is live-in at h. Since v is >> live >> + at h, d is not part of L. This follows from the dominance property, >> i.e. >> + h is strictly dominated by d. Furthermore, there exists a path from >> h to >> + a use of v which does
RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
Hello Jeff: Here is more of a data you have asked for. SPEC FP benchmarks. a) No Path Splitting + tracer enabled Geomean Score = 4749.726. b) Path Splitting enabled + tracer enabled. Geomean Score = 4781.655. Conclusion: With both Path Splitting and tracer enabled we got maximum gains. I think we need to have Path Splitting pass. SPEC INT benchmarks. a) Path Splitting enabled + tracer not enabled. Geomean Score = 3745.193. b) No Path Splitting + tracer enabled. Geomean Score = 3738.558. c) Path Splitting enabled + tracer enabled. Geomean Score = 3742.833. Conclusion: We are getting more gains with Path Splitting as compared to tracer. With both Path Splitting and tracer enabled we are also getting gains. I think we should have Path Splitting pass. One more observation: Richard's concern is the creation of multiple exits with Splitting paths through duplication. My observation is, in tracer pass also there is a creation of multiple exits through duplication. I don’t think that’s an issue with the practicality considering the gains we are getting with Splitting paths with more PRE, CSE and DCE. Thanks & Regards Ajit -Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Wednesday, December 16, 2015 5:20 AM To: Richard Biener Cc: Ajit Kumar Agarwal; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation On 12/11/2015 03:05 AM, Richard Biener wrote: > On Thu, Dec 10, 2015 at 9:08 PM, Jeff Law wrote: >> On 12/03/2015 07:38 AM, Richard Biener wrote: >>> >>> This pass is now enabled by default with -Os but has no limits on >>> the amount of stmts it copies. >> >> The more statements it copies, the more likely it is that the path >> spitting will turn out to be useful! It's counter-intuitive. > > Well, it's still not appropriate for -Os (nor -O2 I think). -ftracer > is enabled with -fprofile-use (but it is also properly driven to only > trace hot paths) and otherwise not by default at any optimization level. Definitely not appropriate for -Os. But as I mentioned, I really want to look at the tracer code as it may totally subsume path splitting. > > Don't see how this would work for the CFG pattern it operates on > unless you duplicate the exit condition into that new block creating > an even more obfuscated CFG. Agreed, I don't see any way to fix the multiple exit problem. Then again, this all runs after the tree loop optimizer, so I'm not sure how big of an issue it is in practice. >> It was only after I approved this code after twiddling it for Ajit >> that I came across Honza's tracer implementation, which may in fact >> be retargettable to these loops and do a better job. I haven't >> experimented with that. > > Well, I originally suggested to merge this with the tracer pass... I missed that, or it didn't sink into my brain. >> Again, the more statements it copies the more likely it is to be profitable. >> Think superblocks to expose CSE, DCE and the like. > > Ok, so similar to tracer (where I think the main benefit is actually > increasing scheduling opportunities for architectures where it matters). Right. They're both building superblocks, which has the effect of larger windows for scheduling, DCE, CSE, etc. > > Note that both passes are placed quite late and thus won't see much > of the GIMPLE optimizations (DOM mainly). I wonder why they were > not placed adjacent to each other. Ajit had it fairly early, but that didn't play well with if-conversion. I just pushed it past if-conversion and vectorization, but before the last DOM pass. That turns out to be where tracer lives too as you noted. >> >> I wouldn't lose any sleep if we disabled by default or removed, particularly >> if we can repurpose Honza's code. In fact, I might strongly support the >> former until we hear back from Ajit on performance data. > > See above for what we do with -ftracer. path-splitting should at _least_ > restrict itself to operate on optimize_loop_for_speed_p () loops. I think we need to decide if we want the code at all, particularly given the multiple-exit problem. The difficulty is I think Ajit posted some recent data that shows it's helping. So maybe the thing to do is ask Ajit to try the tracer independent of path splitting and take the obvious actions based on Ajit's data. > > It should also (even if counter-intuitive) limit the amount of stmt copying > it does - after all there is sth like an instruction cache size which > exceeeding > for loops will never be a good idea (and even smaller special loop caches on > so
RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
-Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Richard Biener Sent: Wednesday, December 16, 2015 3:27 PM To: Ajit Kumar Agarwal Cc: Jeff Law; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation On Wed, Dec 16, 2015 at 8:43 AM, Ajit Kumar Agarwal wrote: > Hello Jeff: > > Here is more of a data you have asked for. > > SPEC FP benchmarks. > a) No Path Splitting + tracer enabled > Geomean Score = 4749.726. > b) Path Splitting enabled + tracer enabled. > Geomean Score = 4781.655. > > Conclusion: With both Path Splitting and tracer enabled we got maximum gains. > I think we need to have Path Splitting pass. > > SPEC INT benchmarks. > a) Path Splitting enabled + tracer not enabled. > Geomean Score = 3745.193. > b) No Path Splitting + tracer enabled. > Geomean Score = 3738.558. > c) Path Splitting enabled + tracer enabled. > Geomean Score = 3742.833. >>I suppose with SPEC you mean SPEC CPU 2006? The performance data is with respect to SPEC CPU 2000 benchmarks. >>Can you disclose the architecture you did the measurements on and the compile >>flags you used otherwise? Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz cpu cores : 10 cache size : 25600 KB I have used -O3 and enable the tracer with -ftracer . Thanks & Regards Ajit >>Note that tracer does a very good job only when paired with FDO so can you >>re-run SPEC with FDO and compare with path-splitting enabled on top of that? Thanks, Richard. > Conclusion: We are getting more gains with Path Splitting as compared to > tracer. With both Path Splitting and tracer enabled we are also getting > gains. > I think we should have Path Splitting pass. > > One more observation: Richard's concern is the creation of multiple > exits with Splitting paths through duplication. My observation is, in > tracer pass also there is a creation of multiple exits through duplication. I > don’t think that’s an issue with the practicality considering the gains we > are getting with Splitting paths with more PRE, CSE and DCE. > > Thanks & Regards > Ajit > > > > > -----Original Message- > From: Jeff Law [mailto:l...@redhat.com] > Sent: Wednesday, December 16, 2015 5:20 AM > To: Richard Biener > Cc: Ajit Kumar Agarwal; GCC Patches; Vinod Kathail; Shail Aditya > Gupta; Vidhumouli Hunsigida; Nagaraju Mekala > Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on > tree ssa representation > > On 12/11/2015 03:05 AM, Richard Biener wrote: >> On Thu, Dec 10, 2015 at 9:08 PM, Jeff Law wrote: >>> On 12/03/2015 07:38 AM, Richard Biener wrote: >>>> >>>> This pass is now enabled by default with -Os but has no limits on >>>> the amount of stmts it copies. >>> >>> The more statements it copies, the more likely it is that the path >>> spitting will turn out to be useful! It's counter-intuitive. >> >> Well, it's still not appropriate for -Os (nor -O2 I think). -ftracer >> is enabled with -fprofile-use (but it is also properly driven to only >> trace hot paths) and otherwise not by default at any optimization level. > Definitely not appropriate for -Os. But as I mentioned, I really want to > look at the tracer code as it may totally subsume path splitting. > >> >> Don't see how this would work for the CFG pattern it operates on >> unless you duplicate the exit condition into that new block creating >> an even more obfuscated CFG. > Agreed, I don't see any way to fix the multiple exit problem. Then again, > this all runs after the tree loop optimizer, so I'm not sure how big of an > issue it is in practice. > > >>> It was only after I approved this code after twiddling it for Ajit >>> that I came across Honza's tracer implementation, which may in fact >>> be retargettable to these loops and do a better job. I haven't >>> experimented with that. >> >> Well, I originally suggested to merge this with the tracer pass... > I missed that, or it didn't sink into my brain. > >>> Again, the more statements it copies the more likely it is to be profitable. >>> Think superblocks to expose CSE, DCE and the like. >> >> Ok, so similar to tracer (where I think the main benefit is actually >> increasing scheduling opportunities for architectures where it matters). > Right. They're both building superblocks, which has the effect of larger > windows for sched
[RFC, rtl optimization]: Better heuristics for estimate_reg_pressure_cost in presence of call for LICM.
The estimate on target_clobbered_registers based on the call_used arrays is not correct. This is the worst case heuristics on the estimate on target_clobbered_registers. This disables many of the loop Invariant code motion opportunities in presence of call. Instead of considering the spill cost we consider only the target_reg_cost aggressively. With this change with old logic used in regs_used gave the following gains. diff --git a/gcc/cfgloopanal.c b/gcc/cfgloopanal.c --- a/gcc/cfgloopanal.c +++ b/gcc/cfgloopanal.c @@ -373,15 +373,23 @@ estimate_reg_pressure_cost (unsigned n_new, unsigned n_old, bool speed, /* If there is a call in the loop body, the call-clobbered registers are not available for loop invariants. */ + if (call_p) available_regs = available_regs - target_clobbered_regs; - + /* If we have enough registers, we should use them and not restrict the transformations unnecessarily. */ if (regs_needed + target_res_regs <= available_regs) return 0; - if (regs_needed <= available_regs) + /* Estimation of target_clobbered register is based on the call_used + arrays which is not the right estimate for the clobbered register + used in called function. Instead of considering the spill cost we + consider only the reg_cost aggressively. */ + + if ((regs_needed <= available_regs) + || (call_p && (regs_needed <= + (available_regs + target_clobbered_regs /* If we are close to running out of registers, try to preserve them. */ cost = target_reg_cost [speed] * n_new; SPEC CPU 2000 INT benchmarks (Geomean Score without the above change vs Geomean score with the above change = 3745.193 vs vs 3752.717) SPEC INT CPU 2000 benchmarks. (Geomean Score without the above change vs Geomean score with the above change = 4741.825 vs 4792.085) Thanks & Regards Ajit
RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
Hello Jeff and Richard: Here is the Summary of the FDO(Feedback Directed Optimization ) performance results. SPEC CPU2000 INT benchmarks. a) FDO + Splitting Paths enabled + tracer enabled Geomean Score = 3907.751673. b) FDO + No Splitting Paths + tracer enabled Geomean Score = 3895.191536. SPEC CPU2000 FP benchmarks. a) FDO + Splitting Paths enabled + tracer enabled Geomean Score = 4793.321963 b) FDO + No Splitting Paths + tracer enabled Geomean Score = 4770.855467 The gains are maximum with Split Paths enabled + tracer pass enabled as compared to No Split Paths + tracer enabled. The Split Paths pass is very much required. Thanks & Regards Ajit -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Ajit Kumar Agarwal Sent: Wednesday, December 16, 2015 3:44 PM To: Richard Biener Cc: Jeff Law; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Richard Biener Sent: Wednesday, December 16, 2015 3:27 PM To: Ajit Kumar Agarwal Cc: Jeff Law; GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation On Wed, Dec 16, 2015 at 8:43 AM, Ajit Kumar Agarwal wrote: > Hello Jeff: > > Here is more of a data you have asked for. > > SPEC FP benchmarks. > a) No Path Splitting + tracer enabled > Geomean Score = 4749.726. > b) Path Splitting enabled + tracer enabled. > Geomean Score = 4781.655. > > Conclusion: With both Path Splitting and tracer enabled we got maximum gains. > I think we need to have Path Splitting pass. > > SPEC INT benchmarks. > a) Path Splitting enabled + tracer not enabled. > Geomean Score = 3745.193. > b) No Path Splitting + tracer enabled. > Geomean Score = 3738.558. > c) Path Splitting enabled + tracer enabled. > Geomean Score = 3742.833. >>I suppose with SPEC you mean SPEC CPU 2006? The performance data is with respect to SPEC CPU 2000 benchmarks. >>Can you disclose the architecture you did the measurements on and the compile >>flags you used otherwise? Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz cpu cores : 10 cache size : 25600 KB I have used -O3 and enable the tracer with -ftracer . Thanks & Regards Ajit >>Note that tracer does a very good job only when paired with FDO so can you >>re-run SPEC with FDO and compare with path-splitting enabled on top of that? Thanks, Richard. > Conclusion: We are getting more gains with Path Splitting as compared to > tracer. With both Path Splitting and tracer enabled we are also getting > gains. > I think we should have Path Splitting pass. > > One more observation: Richard's concern is the creation of multiple > exits with Splitting paths through duplication. My observation is, in > tracer pass also there is a creation of multiple exits through duplication. I > don’t think that’s an issue with the practicality considering the gains we > are getting with Splitting paths with more PRE, CSE and DCE. > > Thanks & Regards > Ajit > > > > > -Original Message- > From: Jeff Law [mailto:l...@redhat.com] > Sent: Wednesday, December 16, 2015 5:20 AM > To: Richard Biener > Cc: Ajit Kumar Agarwal; GCC Patches; Vinod Kathail; Shail Aditya > Gupta; Vidhumouli Hunsigida; Nagaraju Mekala > Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on > tree ssa representation > > On 12/11/2015 03:05 AM, Richard Biener wrote: >> On Thu, Dec 10, 2015 at 9:08 PM, Jeff Law wrote: >>> On 12/03/2015 07:38 AM, Richard Biener wrote: >>>> >>>> This pass is now enabled by default with -Os but has no limits on >>>> the amount of stmts it copies. >>> >>> The more statements it copies, the more likely it is that the path >>> spitting will turn out to be useful! It's counter-intuitive. >> >> Well, it's still not appropriate for -Os (nor -O2 I think). -ftracer >> is enabled with -fprofile-use (but it is also properly driven to only >> trace hot paths) and otherwise not by default at any optimization level. > Definitely not appropriate for -Os. But as I mentioned, I really want to > look at the tracer code as it may totally subsume path splitting. > >> >> Don't see how this would work for the CFG pattern it operates on >> unless you duplicate the exit condition into that new block creating >> an even more obfuscated CFG.
RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
Hello Jeff: I am out on vacation till 3rd Jan 2016. Is it okay If I respond on the below once I am back in office. Thanks & Regards Ajit -Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Wednesday, December 23, 2015 12:06 PM To: Ajit Kumar Agarwal; Richard Biener Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation On 12/11/2015 02:11 AM, Ajit Kumar Agarwal wrote: > > Mibench/EEMBC benchmarks (Target Microblaze) > > Automotive_qsort1(4.03%), Office_ispell(4.29%), Office_stringsearch1(3.5%). > Telecom_adpcm_d( 1.37%), ospfv2_lite(1.35%). I'm having a real tough time reproducing any of these results. In fact, I'm having a tough time seeing cases where path splitting even applies to the Mibench/EEMBC benchmarks mentioned above. In the very few cases where split-paths might apply, the net resulting assembly code I get is the same with and without split-paths. How consistent are these results? What functions are being affected that in turn impact performance? What options are you using to compile the benchmarks? I'm trying with -O2 -fsplit-paths and -O3 in my attempts to trigger the transformation so that I can look more closely at possible heuristics. Is this with the standard microblaze-elf target? Or with some other target? jeff
RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
-Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Wednesday, December 23, 2015 12:06 PM To: Ajit Kumar Agarwal; Richard Biener Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation On 12/11/2015 02:11 AM, Ajit Kumar Agarwal wrote: > > Mibench/EEMBC benchmarks (Target Microblaze) > > Automotive_qsort1(4.03%), Office_ispell(4.29%), Office_stringsearch1(3.5%). > Telecom_adpcm_d( 1.37%), ospfv2_lite(1.35%). >>I'm having a real tough time reproducing any of these results. In fact, I'm >>having a tough time seeing cases where path splitting even applies to the >>Mibench/EEMBC benchmarks >>mentioned above. >>In the very few cases where split-paths might apply, the net resulting >>assembly code I get is the same with and without split-paths. >>How consistent are these results? I am consistently getting the gains for office_ispell and office_stringsearch1, telcom_adpcm_d. I ran it again today and we see gains in the same bench mark tests with the split path changes. >>What functions are being affected that in turn impact performance? For office_ispell: The function are Function "linit (linit, funcdef_no=0, decl_uid=2535, cgraph_uid=0, symbol_order=2) for lookup.c file". "Function checkfile (checkfile, funcdef_no=1, decl_uid=2478, cgraph_uid=1, symbol_order=4)" " Function correct (correct, funcdef_no=2, decl_uid=2503, cgraph_uid=2, symbol_order=5)" " Function askmode (askmode, funcdef_no=24, decl_uid=2464, cgraph_uid=24, symbol_order=27)" for correct.c file. For office_stringsearch1: The function is Function "bmhi_search (bmhi_search, funcdef_no=1, decl_uid=2178, cgraph_uid=1, symbol_order=5)" for bmhisrch.c file. >>What options are you using to compile the benchmarks? I'm trying with >>-O2 -fsplit-paths and -O3 in my attempts to trigger the transformation so >>that I can look more closely at possible heuristics. I am using the following flags. -O3 mlittle-endian -mxl-barrel-shift -mno-xl-soft-div -mhard-float -mxl-float-convert -mxl-float-sqrt -mno-xl-soft-mul -mxl-multiply-high -mxl-pattern-compare. To disable split paths -fno-split-paths is used on top of the above flags. >>Is this with the standard microblaze-elf target? Or with some other target? I am using the --target=microblaze-xilinx-elf to build the microblaze target. Thanks & Regards Ajit jeff
[Patch,microblaze]: Optimized register reorganization for Microblaze.
The patch contains the changes in the macros fixed_registers and call_used_registers. Earlier the register r21 is marked as fixed and also marked as 1 for call_used registers. On top of that r21 is not assigned to any of the temporaries in rtl insns. This makes the usage of registers r21 in the callee functions not possible and wasting one registers to allocate in the callee function. The changes makes the register r21 as allocatable to the callee function and optimized usage of the registers r21 in the callee function reduces spill and fetch. The reduction in the spill and fetch is due to availability of register r21 in the callee function. The availability of register r21 is made by marking the register r21 as not fixed and the call_used registers is marked as 0. Also r20 is marked as fixed. The changes are done not to mark as fixed thus allowing the registers to be used reducing the spill and fetch. Regtested for Microblaze. Performance runs made on Mibench/EEMBC benchmarks for microblaze. Following benchmarks shows the gains Benchmarks Gains automotive_qsort1 =3.96% automotive_susan_c = 7.68% consumer_mad =9.6% security_rijndael_d =19.57% telecom_CRC32 = 7.66% bitmnp01_lite = 10.61% a2time01_lite =6.97% ChangeLog: 2016-01-12 Ajit Agarwal * config/microblaze/microblaze.h (FIXED_REGISTERS): Update in macro. (CALL_USED_REGISTERS): Update in macro. Signed-off-by:Ajit Agarwal ajit...@xilinx.com. --- gcc/config/microblaze/microblaze.h |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/gcc/config/microblaze/microblaze.h b/gcc/config/microblaze/microblaze.h index e115c42..dbfb652 100644 --- a/gcc/config/microblaze/microblaze.h +++ b/gcc/config/microblaze/microblaze.h @@ -253,14 +253,14 @@ extern enum pipeline_type microblaze_pipe; #define FIXED_REGISTERS \ { \ 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, \ - 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \ + 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \ 1, 1, 1, 1 \ } #define CALL_USED_REGISTERS\ { \ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, \ - 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \ + 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \ 1, 1, 1, 1 \ } #define GP_REG_FIRST0 -- 1.7.1 Thanks & Regards Ajit reg-reorg.patch Description: reg-reorg.patch
RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
-Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Saturday, January 16, 2016 12:03 PM To: Ajit Kumar Agarwal; Richard Biener Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation On 01/04/2016 07:32 AM, Ajit Kumar Agarwal wrote: > > > -Original Message- From: Jeff Law [mailto:l...@redhat.com] > Sent: Wednesday, December 23, 2015 12:06 PM To: Ajit Kumar Agarwal; > Richard Biener Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; > Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: > [Patch,tree-optimization]: Add new path Splitting pass on tree ssa > representation > > On 12/11/2015 02:11 AM, Ajit Kumar Agarwal wrote: >> >> Mibench/EEMBC benchmarks (Target Microblaze) >> >> Automotive_qsort1(4.03%), Office_ispell(4.29%), >> Office_stringsearch1(3.5%). Telecom_adpcm_d( 1.37%), >> ospfv2_lite(1.35%). >>> I'm having a real tough time reproducing any of these results. >>> In fact, I'm having a tough time seeing cases where path splitting >>> even applies to the Mibench/EEMBC benchmarks >>> >>mentioned above. > >>> In the very few cases where split-paths might apply, the net >>> resulting assembly code I get is the same with and without >>> split-paths. > >>> How consistent are these results? > > I am consistently getting the gains for office_ispell and > office_stringsearch1, telcom_adpcm_d. I ran it again today and we see > gains in the same bench mark tests with the split path changes. > >>> What functions are being affected that in turn impact performance? > > For office_ispell: The function are Function "linit (linit, > funcdef_no=0, decl_uid=2535, cgraph_uid=0, symbol_order=2) for > lookup.c file". "Function checkfile (checkfile, funcdef_no=1, > decl_uid=2478, cgraph_uid=1, symbol_order=4)" " Function correct > (correct, funcdef_no=2, decl_uid=2503, cgraph_uid=2, symbol_order=5)" > " Function askmode (askmode, funcdef_no=24, decl_uid=2464, > cgraph_uid=24, symbol_order=27)" for correct.c file. > > For office_stringsearch1: The function is Function "bmhi_search > (bmhi_search, funcdef_no=1, decl_uid=2178, cgraph_uid=1, > symbol_order=5)" for bmhisrch.c file. >>Can you send me the pre-processed lookup.c, correct.c and bmhi_search.c? >>I generated mine using x86 and that may be affecting my ability to reproduce >>your results on the microblaze target. Looking specifically at bmhi_search.c >>and correct.c, I see they are >>going to be sensitive to the target headers. >>If (for exmaple) they use FORTIFY_SOURCE or macros for toupper. >>In the bmhi_search I'm looking at, I don't see any opportunities for the path >>splitter to do anything. The CFG just doesn't have the right shape. Again, >>that may be an artifact of how >>toupper is implemented in the system header >>files -- hence my request for the cpp output on each of the important files. Would you like me to send the above files and function pre-processed with -E option flag. Thanks & Regards Ajit Jeff
RE: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation
-Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Saturday, January 16, 2016 4:33 AM To: Ajit Kumar Agarwal; Richard Biener Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,tree-optimization]: Add new path Splitting pass on tree ssa representation On 01/14/2016 01:55 AM, Jeff Law wrote: [ Replying to myself again, mostly to make sure we've got these thoughts in the archives. ] > > Anyway, going back to adpcm_decode, we do end up splitting this path: > > # vpdiff_12 = PHI >if (sign_41 != 0) > goto ; >else > goto ; > ;;succ: 15 > ;;16 > > ;; basic block 15, loop depth 1 > ;;pred: 14 >valpred_51 = valpred_76 - vpdiff_12; >goto ; > ;;succ: 17 > > ;; basic block 16, loop depth 1 > ;;pred: 14 >valpred_52 = vpdiff_12 + valpred_76; > ;;succ: 17 > > ;; basic block 17, loop depth 1 > ;;pred: 15 > ;;16 ># valpred_7 = PHI >_85 = MAX_EXPR ; >valpred_13 = MIN_EXPR <_85, 32767>; >step_53 = stepsizeTable[index_62]; >outp_54 = outp_69 + 2; >_55 = (short int) valpred_13; >MEM[base: outp_54, offset: -2B] = _55; >if (outp_54 != _74) > goto ; >else > goto ; > > This doesn't result in anything particularly interesting/good AFAICT. > We propagate valpred_51/52 into the use in the MAX_EXPR in the > duplicate paths, but that doesn't allow any further simplification. >>So with the heuristic I'm poking at, this gets rejected. Essentially it >>doesn't think it's likely to expose CSE/DCE opportunities (and it's correct). >> The number of statements in predecessor >>blocks that feed operands in the >>to-be-copied-block is too small relative to the size of the >>to-be-copied-block. > > Ajit, can you confirm which of adpcm_code or adpcm_decode where path > splitting is showing a gain? I suspect it's the former but would like > to make sure so that I can adjust the heuristics properly. >>I'd still like to have this answered when you can Ajit, just to be 100% >> that it's the path splitting in adpcm_code that's responsible for the >> improvements you're seeing in adpcm. The adpcm_coder get optimized with path splitting whereas the adpcm_decoder is not optimized further with path splitting. In adpcm_decoder the join node is duplicated into its predecessors and with the duplication of join node the code is not optimized further. In adpcm_coder with path splitting the following optimization is triggered with path splitting. 1. /* Output last step, if needed */ if ( !bufferstep ) *outp++ = outputbuffer; IF-THEN inside the loop will be triggered with bufferstep is 1. Then the flip happens and bufferstep is 0. For the exit branch if the bufferstep Is 1 the flip convert it to 0 and above IF-THEN generate store to assign outputbuffer to outp. The above sequence is optimized with path splitting, if the bufferstep is 1 then exit branch of the loop branches to the above store. This does not require the flip of bufferstep using xor with immediate 1. With this optimization there is one level of exit branch for the bufferstep 1 path. This lead to scheduling the exit branch to the store with a meaningful instruction instead of xor with immediate 1. Without Path Splitting if the bufferstep is 1 the exit branch of the loop branches to piece of branch flipping it to zero and the above IF-THEN outside the loop does the store to assign outputbuffer to outp. Thus without path splitting there is two level of branch in the case of exit branch in the path where bufferstep is 1 inside the loop generating non optimized. Also without path splitting the two level of exit branch of the loop is scheduled with xor immediate with 1. Thanks & Regards Ajit jeff
RE: [PATCH 0/5] Fix handling of word subregs of wide registers
-Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Richard Sandiford Sent: Monday, September 22, 2014 12:54 PM To: Jeff Law Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH 0/5] Fix handling of word subregs of wide registers Jeff Law writes: > On 09/19/14 01:23, Richard Sandiford wrote: >> Jeff Law writes: >>> On 09/18/14 04:07, Richard Sandiford wrote: This series is a cleaned-up version of: https://gcc.gnu.org/ml/gcc/2014-03/msg00163.html The underlying problem is that the semantics of subregs depend on the word size. You can't have a subreg for byte 2 of a 4-byte word, say, but you can have a subreg for word 2 of a 4-word value (as well as lowpart subregs of that word, etc.). This causes problems when an architecture has wider-than-word registers, since the addressability of a word can then depend on which register class is used. The register allocators need to fix up cases where a subreg turns out to be invalid for a particular class. This is really an extension of what we need to do for CANNOT_CHANGE_MODE_CLASS. Tested on x86_64-linux-gnu, powerpc64-linux-gnu and aarch64_be-elf. >>> I thought we fixed these problems long ago with the change to subreg_byte?!? >> >> No, that was fixing something else. (I'm just about old enough to >> remember that too!) The problem here is that (say): >> >> (subreg:SI (reg:DI X) 4) >> >> is independently addressable on little-endian AArch32 if X assigned >> to a GPR, but not if X is assigned to a vector register. We need to >> allow these kinds of subreg on pseudos in order to decompose >> multiword arithmetic. It's then up to the RA to realise that a >> reload would be needed if X were assigned to a vector register, since >> the upper half of a vector register cannot be independently accessed. >> >> Note that you could write this example even with the old word-style >> offsets and IIRC the effect would have been the same. > OK. So I kept thinking in terms of the byte offset stuff. But what > you're tackling is related to the mess around the mode of the subreg > having a different meaning if its smaller than a word vs word-sized or > greater. > > Right? >>Yeah, that's right. Addressability is based on words, which is inconvenient >>when your registers are bigger than a word. If the architecture like Microblaze which doesn't support the 1 byte or 2 byte registers. In this scenario what should be returned when SUBREG_WORD is used. Thanks, Richard
RE: [PATCH 0/5] Fix handling of word subregs of wide registers
-Original Message- From: Richard Sandiford [mailto:richard.sandif...@arm.com] Sent: Monday, September 22, 2014 4:56 PM To: Ajit Kumar Agarwal Cc: Jeff Law; gcc-patches@gcc.gnu.org Subject: Re: [PATCH 0/5] Fix handling of word subregs of wide registers Ajit Kumar Agarwal writes: > Jeff Law writes: >> On 09/19/14 01:23, Richard Sandiford wrote: >>> Jeff Law writes: >>>> On 09/18/14 04:07, Richard Sandiford wrote: >>>>> This series is a cleaned-up version of: >>>>> >>>>> https://gcc.gnu.org/ml/gcc/2014-03/msg00163.html >>>>> >>>>> The underlying problem is that the semantics of subregs depend on >>>>> the word size. You can't have a subreg for byte 2 of a 4-byte >>>>> word, say, but you can have a subreg for word 2 of a 4-word value >>>>> (as well as lowpart subregs of that word, etc.). This causes >>>>> problems when an architecture has wider-than-word registers, since >>>>> the addressability of a word can then depend on which register >>>>> class is used. >>>>> >>>>> The register allocators need to fix up cases where a subreg turns >>>>> out to be invalid for a particular class. This is really an >>>>> extension of what we need to do for CANNOT_CHANGE_MODE_CLASS. >>>>> >>>>> Tested on x86_64-linux-gnu, powerpc64-linux-gnu and aarch64_be-elf. >>>> I thought we fixed these problems long ago with the change to >>>> subreg_byte?!? >>> >>> No, that was fixing something else. (I'm just about old enough to >>> remember that too!) The problem here is that (say): >>> >>> (subreg:SI (reg:DI X) 4) >>> >>> is independently addressable on little-endian AArch32 if X assigned >>> to a GPR, but not if X is assigned to a vector register. We need to >>> allow these kinds of subreg on pseudos in order to decompose >>> multiword arithmetic. It's then up to the RA to realise that a >>> reload would be needed if X were assigned to a vector register, >>> since the upper half of a vector register cannot be independently accessed. >>> >>> Note that you could write this example even with the old word-style >>> offsets and IIRC the effect would have been the same. >> OK. So I kept thinking in terms of the byte offset stuff. But what >> you're tackling is related to the mess around the mode of the subreg >> having a different meaning if its smaller than a word vs word-sized >> or greater. >> >> Right? > >>>Yeah, that's right. Addressability is based on words, which is >>>inconvenient when your registers are bigger than a word. > > If the architecture like Microblaze which doesn't support the 1 byte > or > 2 byte registers. In this scenario what should be returned when > SUBREG_WORD is used. >>I don't understand the question sorry. Subreg offsets are still represented >>as bytes rather than words. The patch doesn't change the way that subregs >>are >>represented or the rules about which subregs are valid. >>Both before and after the patch, the semantics of subregs say that if you >>have 4-byte words, only one of: >>(subreg:QI (reg:SI X) 0) >>(subreg:QI (reg:SI X) 1) >>(subreg:QI (reg:SI X) 2) >>(subreg:QI (reg:SI X) 3) >>is ever valid (0 for little-endian, 3 for big-endian). Writing to that one >>valid subreg will change the whole of X, unless the subreg is wrapped in a >>>>strict_lowpart. In other words, subregs are defined so that individual >>parts of a word are not independently addressable. >>However, individual words of a multiword register _are_ addressable. I.e.: (subreg:SI (reg:DI Y) 0) (subreg:SI (reg:DI Y) 4) >>are both valid. Writing to one does not change the other. >>The problem the patch was trying to solve was that you can have targets with >>4-byte words but some 8-byte registers. In those cases, it's still possible >>to >>form both of the Y subregs above if Y is allocated to a word-sized >>register, but not if Y is allocated to a doubleword-sized register. Thanks Richard for the explanation. Thanks, Richard
RE: [Patch,Microblaze]: Added Break Handler Support
Hello Michael: The following patch is to handle Software and Hardware breaks in Microblaze Architecture. Deja GNU testcase does not have any regressions and the testcase attached passes through. Review comments are incorporated. Okay for trunk? Thanks & Regards Ajit From 15dfaee8feef37430745d3dbc58f74bed876aabb Mon Sep 17 00:00:00 2001 From: Ajit Kumar Agarwal Date: Tue, 13 May 2014 13:25:52 +0530 Subject: [PATCH] [Patch, microblaze] Added Break Handler support Added Break Handler support to incorporate the hardware and software break. The Break Handler routine will be generating the rtbd instruction. At the call point where the software breaks are generated with the instruction brki with register operand as r16. 2014-05-13 Ajit Agarwal * config/microblaze/microblaze.c (microblaze_break_function_p,microblaze_is_break_handler) : New * config/microblaze/microblaze.h (BREAK_HANDLER_NAME) : New macro * config/microblaze/microblaze.md : Extended support for generation of brki instruction and rtbd instruction. * config/microblaze/microblaze-protos.h (microblaze_break_function_p,microblaze_is_break_handler) : New Declaration. * testsuite/gcc.target/microblaze/others/break_handler.c : New. Signed-off-by:Nagaraju --- gcc/config/microblaze/microblaze-protos.h |4 +- gcc/config/microblaze/microblaze.c | 47 +--- gcc/config/microblaze/microblaze.h |2 +- gcc/config/microblaze/microblaze.md| 34 ++ .../gcc.target/microblaze/others/break_handler.c | 15 ++ 5 files changed, 83 insertions(+), 19 deletions(-) create mode 100644 gcc/testsuite/gcc.target/microblaze/others/break_handler.c diff --git a/gcc/config/microblaze/microblaze-protos.h b/gcc/config/microblaze/microblaze-protos.h index b03e9e1..f3cc099 100644 --- a/gcc/config/microblaze/microblaze-protos.h +++ b/gcc/config/microblaze/microblaze-protos.h @@ -40,10 +40,12 @@ extern void print_operand_address (FILE *, rtx); extern void init_cumulative_args (CUMULATIVE_ARGS *,tree, rtx); extern bool microblaze_legitimate_address_p (enum machine_mode, rtx, bool); extern int microblaze_is_interrupt_variant (void); +extern int microblaze_is_break_handler (void); +extern int microblaze_break_function_p (tree func); extern rtx microblaze_return_addr (int, rtx); extern int simple_memory_operand (rtx, enum machine_mode); extern int double_memory_operand (rtx, enum machine_mode); - +extern void microblaze_order_regs_for_local_alloc (void); extern int microblaze_regno_ok_for_base_p (int, int); extern HOST_WIDE_INT microblaze_initial_elimination_offset (int, int); extern void microblaze_declare_object (FILE *, const char *, const char *, diff --git a/gcc/config/microblaze/microblaze.c b/gcc/config/microblaze/microblaze.c index ba8109b..fc458a5 100644 --- a/gcc/config/microblaze/microblaze.c +++ b/gcc/config/microblaze/microblaze.c @@ -209,6 +209,7 @@ enum reg_class microblaze_regno_to_class[] = and epilogue and use appropriate interrupt return. save_volatiles- Similar to interrupt handler, but use normal return. */ int interrupt_handler; +int break_handler; int fast_interrupt; int save_volatiles; @@ -217,6 +218,8 @@ const struct attribute_spec microblaze_attribute_table[] = { affects_type_identity */ {"interrupt_handler", 0, 0, true,false, false,NULL, false }, + {"break_handler", 0, 0, true,false, false,NULL, +false }, {"fast_interrupt",0, 0, true,false, false,NULL, false }, {"save_volatiles" , 0, 0, true,false, false,NULL, @@ -1866,7 +1869,18 @@ microblaze_fast_interrupt_function_p (tree func) a = lookup_attribute ("fast_interrupt", DECL_ATTRIBUTES (func)); return a != NULL_TREE; } +int +microblaze_break_function_p (tree func) +{ + tree a; + if (!func) +return 0; + if (TREE_CODE (func) != FUNCTION_DECL) +return 0; + a = lookup_attribute ("break_handler", DECL_ATTRIBUTES (func)); + return a != NULL_TREE; +} /* Return true if FUNC is an interrupt function which uses normal return, indicated by the "save_volatiles" attribute. */ @@ -1891,6 +1905,13 @@ microblaze_is_interrupt_variant (void) { return (interrupt_handler || fast_interrupt); } +int +microblaze_is_break_handler (void) +{ + return break_handler; +} + + /* Determine of register must be saved/restored in call. */ static int @@ -1994,9 +2015,14 @@ compute_frame_size (HOST_WIDE_INT size) interrupt_handler = microblaze_interrupt_function_p (current_function_decl); + break_handler = +microblaze_break_function_p (current_function_decl); + fast_interrupt = microblaze_fast_interrupt_function_p (current_function_decl); save_volatiles = microblaze_save_volatiles (current_function_decl); + if (break_handler
RE: [Patch,Microblaze]: Added Break Handler Support
Hello Michael: Thanks for the comments on ChangeLog. Modified ChangeLog is inlined below. 2014-05-13 Ajit Agarwal * config/microblaze/microblaze.c (break_handler): New Declaration. (microblaze_break_function_p,microblaze_is_break_handler) : New function. (compute_frame_size): use of microblaze_break_function_p. Add the test of break_handler. (microblaze_function_prologue,microblaze_function_epilogue) : Add the test of variable break_handler. (microblaze_globalize_label) : Add the test of break_handler. * config/microblaze/microblaze.h (BREAK_HANDLER_NAME) : New macro * config/microblaze/microblaze.md : (*,_internal): Add microblaze_is_break_handler () test. (call_internal1,call_value_intern) : Use of microblaze_break_function_p. Use of SYMBOL_REF_DECL. * config/microblaze/microblaze-protos.h (microblaze_break_function_p,microblaze_is_break_handler) : New Declaration. * testsuite/gcc.target/microblaze/others/break_handler.c : New. Thanks & Regards Ajit -Original Message- From: Michael Eager [mailto:ea...@eagercon.com] Sent: Tuesday, May 13, 2014 10:30 PM To: Ajit Kumar Agarwal; gcc-patches@gcc.gnu.org Cc: Vinod Kathail; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,Microblaze]: Added Break Handler Support On 05/13/14 02:14, Ajit Kumar Agarwal wrote: > Hello Michael: > > The following patch is to handle Software and Hardware breaks in Microblaze > Architecture. > Deja GNU testcase does not have any regressions and the testcase attached > passes through. > Review comments are incorporated. > > Okay for trunk? Just saying OK would only be appropriate if you had write access to the repository. > Added Break Handler support to incorporate the hardware and software > break. The Break Handler routine will be generating the rtbd > instruction. At the call point where the software breaks are generated with > the instruction brki with register operand as r16. > > 2014-05-13 Ajit Agarwal > > * config/microblaze/microblaze.c > (microblaze_break_function_p,microblaze_is_break_handler) : New > > * config/microblaze/microblaze.h (BREAK_HANDLER_NAME) : New macro > > * config/microblaze/microblaze.md : >Extended support for generation of brki instruction and rtbd instruction. A better ChangeLog entry is * config/microblaze/microblaze.md (*,_internal): Add microblaze_is_break_handler () test. Give specifics, naming functions, rather than making general comments. As the ChangeLog standard says: It’s important to name the changed function or variable in full. Don’t abbreviate function or variable names, and don’t combine them. Subsequent maintainers will often search for a function name to find all the change log entries that pertain to it; if you abbreviate the name, they won’t find it when they search. Mention each place where there are changes. There should be a ChangeLog entry for each non-trivial change. Your patch made four significant changes to microblaze.md. There appear to be several changes in microblaze.c, not just the definition of the new functions as shown in your entry. > * config/microblaze/microblaze-protos.h > (microblaze_break_function_p,microblaze_is_break_handler) : New > Declaration. > > * testsuite/gcc.target/microblaze/others/break_handler.c : New. Thanks for the test case. As mentioned previously, add documentation for _break_handler. > diff --git a/gcc/config/microblaze/microblaze-protos.h > b/gcc/config/microblaze/microblaze-protos.h > index b03e9e1..f3cc099 100644 > --- a/gcc/config/microblaze/microblaze-protos.h > +++ b/gcc/config/microblaze/microblaze-protos.h Please include the patch only once, not both inline and again as an attachment. -- Michael Eagerea...@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077
RE: [Patch,Microblaze]: Added Break Handler Support
Hello Michael: Resubmitting the Patch with documentation for _break_handler in the config/microblaze/microblaze.h. Thanks & Regards Ajit -Original Message- From: Michael Eager [mailto:ea...@eagercon.com] Sent: Wednesday, May 14, 2014 12:55 AM To: Ajit Kumar Agarwal; gcc-patches@gcc.gnu.org Cc: Vinod Kathail; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,Microblaze]: Added Break Handler Support On 05/13/14 12:15, Ajit Kumar Agarwal wrote: > Hello Michael: > > Thanks for the comments on ChangeLog. Modified ChangeLog is inlined below. Please resubmit the patch with documentation for _break_handler. -- Michael Eagerea...@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077 0001-Patch-microblaze-Added-Break-Handler-support.patch Description: 0001-Patch-microblaze-Added-Break-Handler-support.patch
RE: [Patch,Microblaze]: Added Break Handler Support
;; + { +if (microblaze_is_break_handler ()) +return "rtbd\tr16, 8\;%#"; +else if (microblaze_is_interrupt_variant () + && (!microblaze_is_break_handler())) +return "rtid\tr14, 0\;%#"; else return "rtsd\tr15, 8\;%#"; } @@ -1962,9 +1965,12 @@ [(any_return) (use (match_operand:SI 0 "register_operand" ""))] "" - { -if (microblaze_is_interrupt_variant ()) -return "rtid\tr14,0 \;%#"; + { +if (microblaze_is_break_handler ()) +return "rtbd\tr16,8\;%#"; +else if (microblaze_is_interrupt_variant () + && (!microblaze_is_break_handler())) +return "rtid\tr14,0 \;%#"; else return "rtsd\tr15,8 \;%#"; } @@ -2068,8 +2074,14 @@ register rtx target2 = gen_rtx_REG (Pmode, GP_REG_FIRST + MB_ABI_SUB_RETURN_ADDR_REGNUM); if (GET_CODE (target) == SYMBOL_REF) { -gen_rtx_CLOBBER (VOIDmode, target2); -return "brlid\tr15,%0\;%#"; +if (microblaze_break_function_p (SYMBOL_REF_DECL (target))) { +gen_rtx_CLOBBER (VOIDmode, target2); +return "brki\tr16,%0\;%#"; +} +else { +gen_rtx_CLOBBER (VOIDmode, target2); +return "brlid\tr15,%0\;%#"; +} } else if (GET_CODE (target) == CONST_INT) return "la\t%@,r0,%0\;brald\tr15,%@\;%#"; else if (GET_CODE (target) == REG) @@ -2173,13 +2185,15 @@ if (GET_CODE (target) == SYMBOL_REF) { gen_rtx_CLOBBER (VOIDmode,target2); - if (SYMBOL_REF_FLAGS (target) & SYMBOL_FLAG_FUNCTION) + if (microblaze_break_function_p (SYMBOL_REF_DECL (target))) +return "brki\tr16,%1\;%#"; + else if (SYMBOL_REF_FLAGS (target) & SYMBOL_FLAG_FUNCTION) { return "brlid\tr15,%1\;%#"; } else { - return "bralid\tr15,%1\;%#"; + return "bralid\tr15,%1\;%#"; } } else if (GET_CODE (target) == CONST_INT) diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi index 9780d92..adb6410 100644 --- a/gcc/doc/extend.texi +++ b/gcc/doc/extend.texi @@ -3776,6 +3776,18 @@ registers) are saved in the function prologue. If the function is a leaf function, only volatiles used by the function are saved. A normal function return is generated instead of a return from interrupt. +@item break_handler +@cindex break handler functions +Use this attribute on the MicroBlaze ports to indicate that +the specified function is an break handler. The compiler generates function +entry and exit sequences suitable for use in an break handler when this +attribute is present. The return from @code{break_handler} is done through +the @code{rtbd} instead of @code{rtsd}. + +@smallexample +void f () __attribute__ ((break_handler)); +@end smallexample + @item section ("@var{section-name}") @cindex @code{section} function attribute Normally, the compiler places the code it generates in the @code{text} section. diff --git a/gcc/testsuite/gcc.target/microblaze/others/break_handler.c b/gcc/testsuite/gcc.target/microblaze/others/break_handler.c new file mode 100644 index 000..1ccafd0 --- /dev/null +++ b/gcc/testsuite/gcc.target/microblaze/others/break_handler.c @@ -0,0 +1,15 @@ +int func () __attribute__ ((break_handler)); +volatile int intr_occurred; + +int func () +{ + + /* { dg-final { scan-assembler "rtbd\tr(\[0-9]\|\[1-2]\[0-9]\|3\[0-1]),8" } } */ +intr_occurred += 1; +} +int main() +{ +/* { dg-final { scan-assembler "brki\tr16" } } */ +func(); +return 0; +} -- 1.7.1 -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Michael Eager Sent: Wednesday, May 14, 2014 3:33 AM To: Ajit Kumar Agarwal; gcc-patches@gcc.gnu.org Cc: Vinod Kathail; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [Patch,Microblaze]: Added Break Handler Support On 05/13/14 14:42, Ajit Kumar Agarwal wrote: > Hello Michael: > > Resubmitting the Patch with documentation for _break_handler in the > config/microblaze/microblaze.h. Please put everything together in one place. When you resubmit a patch, include the ChangeLog. I'm not sure what you changed, but there are no changes to gcc/doc/extend.texi in your patch. -- Michael Eagerea...@eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077 0001-Patch-MicroBlaze-Add-break-Handler-Support.patch Description: 0001-Patch-MicroBlaze-Add-break-Handler-Support.patch
[Patch, microblaze]: Add Init_priority support.
Please find the following patch for init_priority support for microblaze. Testing Done : No regressions seen in gcc and g++ regressions testsuite. [Patch, microblaze]: Add Init_priority support. Added TARGET_ASM_CONSTRUCTOR and TARGET_ASM_DESTRUCTOR macros. These macros allows users to control the order of initialization of objects defined at namespace scope with the init_priority attribute by specifying a relative priority. ChangeLog: 2014-07-28 Ajit Agarwal * config/microblaze/microblaze.c (microblaze_elf_asm_cdtor): New. (microblaze_elf_asm_constructor,microblaze_elf_asm_destructor): New. * config/microblaze/microblaze.h (TARGET_ASM_CONSTRUCTOR,TARGET_ASM_DESTRUCTOR): New Macros. Signed-off-by:Ajit Agarwal ajit...@xilinx.com. Thanks & Regards Ajit 0001-Patch-microblaze-Add-Init_priority-support.patch Description: 0001-Patch-microblaze-Add-Init_priority-support.patch