Re: Help understand the may_be_zero field in loop niter information
On Thu, Jun 12, 2014 at 7:59 PM, Zdenek Dvorak wrote: > Hi, > >> > I noticed there is below code/comments about may_be_zero field in loop >> > niter desc: >> > >> > tree may_be_zero;/* The boolean expression. If it evaluates to true, >> >the loop will exit in the first iteration (i.e. >> >its latch will not be executed), even if the niter >> >field says otherwise. */ >> > >> > I had difficulty in understanding this because I ran into some cases >> > in which it didn't behave as said. > > actually, in all the examples below, the field behaves as described, > i.e., > > the number of iterations = may_be_zero ? 0 : niter; > > In particular, the fact that may_be_zero is false *does not imply* > that the number of iterations as described by niter is non-zero. > >> > Example1, the dump of loop before sccp is like: >> > >> > : >> > bnd_4 = len_3(D) + 1; >> > >> > : >> > # ivtmp_1 = PHI <0(2), ivtmp_11(4)> >> > _6 = ivtmp_1 + len_3(D); >> > _7 = a[ivtmp_1]; >> > _8 = b[ivtmp_1]; >> > _9 = _7 + _8; >> > a[_6] = _9; >> > ivtmp_11 = ivtmp_1 + 1; >> > if (bnd_4 > ivtmp_11) >> > goto ; >> > else >> > goto ; >> > >> > : >> > goto ; >> > >> > The loop niter information analyzed in sccp is like: >> > >> > Analyzing # of iterations of loop 1 >> > exit condition [1, + , 1] < len_3(D) + 1 >> > bounds on difference of bases: -1 ... 4294967294 >> > result: >> > zero if len_3(D) == 4294967295 >> > # of iterations len_3(D), bounded by 4294967294 >> > >> > Qeustion1, shouldn't it be like "len_3 +1 <= 1" because the latch >> > won't be executed when "len_3 == 0", right? > > the analysis determines the number of iterations as len_3, that is > 0 if len_3 == 0. So, the information is computed correctly here. > >> > But when boundary condition is the only case that latch get ZERO >> > executed, the may_be_zero info will not be computed. See example2, >> > with dump of loop before sccp like: >> > >> > foo (int M) >> > >> > : >> > if (M_4(D) > 0) >> > goto ; >> > else >> > goto ; >> > >> > : >> > return; >> > >> > : >> > >> > : >> > # i_13 = PHI <0(4), i_10(6)> >> > _5 = i_13 + M_4(D); >> > _6 = a[i_13]; >> > _7 = b[i_13]; >> > _8 = _6 + _7; >> > a[_5] = _8; >> > i_10 = i_13 + 1; >> > if (M_4(D) > i_10) >> > goto ; >> > else >> > goto ; >> > >> > : >> > goto ; >> > >> > The niter information analyzed in sccp is like: >> > >> > Analyzing # of iterations of loop 1 >> > exit condition [1, + , 1](no_overflow) < M_4(D) >> > bounds on difference of bases: 0 ... 2147483646 >> > result: >> > # of iterations (unsigned int) M_4(D) + 4294967295, bounded by >> > 2147483646 >> > >> > So may_be_zero is always false here, but the latch may be ZERO >> > executed when "M_4 == 1". > > Again, this is correct, since then ((unsigned int) M_4) + 4294967295 == 0. > >> > Start from Example1, we can create Example3 which makes no sense to >> > me. Again, the dump of loop is like: >> > >> > : >> > bnd_4 = len_3(D) + 1; >> > >> > : >> > # ivtmp_1 = PHI <0(2), ivtmp_11(4)> >> > _6 = ivtmp_1 + len_3(D); >> > _7 = a[ivtmp_1]; >> > _8 = b[ivtmp_1]; >> > _9 = _7 + _8; >> > a[_6] = _9; >> > ivtmp_11 = ivtmp_1 + 4; >> > if (bnd_4 > ivtmp_11) >> > goto ; >> > else >> > goto ; >> > >> > : >> > goto ; >> > >> > : >> > return 0; >> > >> > The niter info is like: >> > >> > Analyzing # of iterations of loop 1 >> > exit condition [4, + , 4] < len_3(D) + 1 >> > bounds on difference of bases: -4 ... 4294967291 >> > result: >> > under assumptions len_3(D) + 1 <= 4294967292 >> > zero if len_3(D) == 4294967295 >> > # of iterations len_3(D) / 4, bounded by 1073741823 >> > >> > The problem is: won't latch be ZERO executed when "len_3 == 0/1/2/3"? > > Again, in all these cases the number of iterations is len_3 / 4 == 0. > > Zdenek Hi Zdenek, I spent some more time pondering over this and I think I understand the (at least one) motivation why may_be_zero acts as it is now. At least for IVOPT, the boundary condition for which loop latch is not executed doesn't need to be handled specially when trying to eliminate condition iv uses. So, I am thinking if it's ok for me to send a documentation patch to describe how it works since it's a little bit confusing for me at the first glance. Thanks, bin -- Best Regards.
Re: [PATCH] tell gcc optimizer to never introduce new data races
Adding "--param allow-store-data-races=0" to the GCC options for the kernel breaks C=1 because Sparse isn't expecting a GCC option with that format. It thinks allow-store-data-races=0 is the name of the file we are trying to test. Try use Sparse on linux-next to see the problem. $ make C=2 mm/slab_common.o CHK include/config/kernel.release CHK include/generated/uapi/linux/version.h CHK include/generated/utsrelease.h CALLscripts/checksyscalls.sh CHECK scripts/mod/empty.c No such file: allow-store-data-races=0 make[2]: *** [scripts/mod/empty.o] Error 1 make[1]: *** [scripts/mod] Error 2 make: *** [scripts] Error 2 $ regards, dan carpenter
Re: [PATCH] tell gcc optimizer to never introduce new data races
Dan Carpenter writes: > Adding "--param allow-store-data-races=0" to the GCC options for the > kernel breaks C=1 because Sparse isn't expecting a GCC option with that > format. Please try --param=allow-store-data-races=0 instead. Andreas. -- Andreas Schwab, SUSE Labs, sch...@suse.de GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE 1748 E4D4 88E3 0EEA B9D7 "And now for something completely different."
Re: [GSoC] decision tree first steps
On Mon, Jun 16, 2014 at 1:07 AM, Prathamesh Kulkarni wrote: > On Sat, Jun 14, 2014 at 12:43 PM, Richard Biener > wrote: > I have attached patch that tries to implement decision tree using the > above algorithm. > (haven't done for built-in function yet, but that would be similar to > expr, so i guess no new issues may come up > for that). Great. > * AST representation > Added two more classes to AST - true_operand and match_operand to > represent "true" and "match" operands > respectively. captures are built during parsing, and are "lowered" to > either true_operand or match_operand > while inserting AST operands in decision tree (lower_capture). Hmm, ok. I'd have made them decision tree node classes instead, but it's a matter of taste I guess. + // or maybe keep a parallel bool indexes_empty array instead of using capture_max to denote "not seen" ? + for (unsigned i = 0; i < capture_max; ++i) + indexes[i] = level_max; using a special value is fine. > * Mapping capture index to preorder level > dt_simplify::indexes (unsigned *indexes) provides mapping from capture > index -> level. > eg: indexes[1] = 2 represent @1 is at level 2 in preorder traversal of AST. > > * true_operand is always placed as last child of the decision tree > node during insertion (dt_node::append_node), > since we want to process that last (if all other decisions fail). right. > * Code gen > Unfortunately, the patch still contains hacks for code-gen. > One such hack is adding three fields - (parent, preorder_level, pos) to > operand. > They should really be part of decision tree, but since code-gen happens off > AST, > I needed to place them there. > For removing that, I am thinking to put information required for > code-gen in another struct (say operand_info?) > struct operand_info > { > operand *op; > unsigned pos; > operand *parent; > unsigned preorder_level; > }; Eventually you can just pass the info to the code-generators as extra arguments? That is, I would get rid of the AST methods for generating the matching code and just do everything in the DT traversal. That is, find a better abstraction here. > a) The metadata of operand (pos, parent, preorder_level) can be > computed during preorder_traversal > in walk_operand_preorder. > b) Stick operand_info into decision tree (dt_operand) instead of operand. > Is that fine ? Yes, that would work, but as code-gen off the DT should be quite simple I'd rather not complicate things with too much C++ abstraction (yeah, it's probably my fault to introduce it in the first place). > Code-gen for operands is slightly changed. > the temporary is created at expression's operand node rather than at > the expression's node itself. > Each operand knows it's name. > > It's name is computed as follows (dt_operand::gen_gimple): > opname = op (if operand's parent is root). > or opname = o if operand's parent is > true_operand or match_operand > or opname = gimple_assign_rhs (def_stmt parent node>); // if operand's parent is non-root expr > for built-in functions it would be: > or opname = gimple_call_arg (def_stmt, ); Hmm, in code-gen I see if (code == MINUS_EXPR) { { tree o1 = op0; if (TREE_CODE (o1) == SSA_NAME) { gimple def_stmt1 = SSA_NAME_DEF_STMT (o1); if (is_gimple_assign (def_stmt1) && gimple_assign_rhs_code (def_stmt1) == PLUS_EXPR) { ... } } } { tree o1 = op0; if (TREE_CODE (o1) == SSA_NAME) { gimple def_stmt1 = SSA_NAME_DEF_STMT (o1); if (is_gimple_assign (def_stmt1) && gimple_assign_rhs_code (def_stmt1) == MINUS_EXPR) { ... for the DT part root, 2 |--operand: MINUS_EXPR, 2 |operand: PLUS_EXPR, 1 ... |operand: MINUS_EXPR, 1 ... but I would have expected the preamble for the inner PLUS_EXPR/MINUS_EXPR check to be unified. Thus if (code == MINUS_EXPR) { tree o1 = op0; if (TREE_CODE (op1) == SSA_NAME) { gimple def_stmt1 = SSA_NAME_DEF_STMT (o1); if (is_gimple_assign (def_stmt1)) { if (gimple_assign_rhs_code (def_stmt1) == PLUS_EXPR) { ... } else if (gimple_assign_rhs_code (def_stmt) == MINUS_EXPR) { ... } That means a better factoring of code-generation would be necessary, with possibly sorting the kids array after operand kind. > * Added do_valueize () in gimple-match-head.c. the generated code > calls do_valueize to valueize theopereand. > This make code-gen simpler (no goto). Good. > Example: > for the pattern: > (match_and_simplify > (minus (plus @0 @1) @1) > @0) > > it produces following code (literally taken from gimple-match.c after > running thru indent): > http://pastebin.com/EaFHZMAF > > For non-matching captures (capt->what->type == operand::OP_EXPR), I > tested with few > bog
Re: [PATCH] tell gcc optimizer to never introduce new data races
On Mon, 16 Jun 2014, Andreas Schwab wrote: > > Adding "--param allow-store-data-races=0" to the GCC options for the > > kernel breaks C=1 because Sparse isn't expecting a GCC option with that > > format. > > Please try --param=allow-store-data-races=0 instead. How reliable is this format across GCC versions? GCC manpage doesn't seem to list it as a valid alternative. -- Jiri Kosina SUSE Labs
vector load Rematerialization!!
Hello All: There has been work done for load rematerialization. Instead of Store and Load of variables they kept in registers for the Live range. Till now we are doing the rematerialization of scalar loads. Is it feasible to have rematerialization for the vector Loads? This will be helpful to reduce the vectorized Store and Load for the dependencies across the vectorized Loops. I was looking at one of the presentation where there is a mentioned about the Load rematerialization is implemented from GCC 4.8.2 Onwards. Does this implementation takes care of rematerialization of vector Loads. Can we have this approach? Please let me know what do you think. Thanks & Regards Ajit
Register Pressure guided Unroll and Jam in GCC !!
Hello All: I have worked on the Open64 compiler where the Register Pressure Guided Unroll and Jam gave a good amount of performance improvement for the C and C++ Spec Benchmark and also Fortran benchmarks. The Unroll and Jam increases the register pressure in the Unrolled Loop leading to increase in the Spill and Fetch degrading the performance of the Unrolled Loop. The Performance of Cache locality achieved through Unroll and Jam is degraded with the presence of Spilling instruction due to increases in register pressure Its better to do the decision of Unrolled Factor of the Loop based on the Performance model of the register pressure. Most of the Loop Optimization Like Unroll and Jam is implemented in the High Level IR. The register pressure based Unroll and Jam requires the calculation of register pressure in the High Level IR which will be similar to register pressure we calculate on Register Allocation. This makes the implementation complex. To overcome this, the Open64 compiler does the decision of Unrolling to both High Level IR and also at the Code Generation Level. Some of the decisions way at the end of the Code Generation . The advantage of using this approach like Open64 helps in using the register pressure information calculated by the Register Allocator. This helps the implementation much simpler and less complex. Can we have this approach in GCC of the Decisions of Unroll and Jam in the High Level IR and also to defer some of the decision at the Code Generation Level like Open64? Please let me know what do you think. Thanks & Regards Ajit
Re: [GSoC] decision tree first steps
Hi, On Mon, 16 Jun 2014, Richard Biener wrote: > For > > (match_and_simplify > (MINUS_EXPR @2 (PLUS_EXPR@2 @0 @1)) > @1) Btw, this just triggered my eye. So with lumping the predicate to the capture without special separator syntax, it means that there's a difference between "minus_expr @2" and "minus_expr@2" with a meaningful whitespace (despite 'r' and '@' already being a natural word boundary), which seems less than ideal. Just mentioning :) Ciao, Michael.
Re: Register Pressure guided Unroll and Jam in GCC !!
On Mon, Jun 16, 2014 at 4:14 PM, Ajit Kumar Agarwal wrote: > Hello All: > > I have worked on the Open64 compiler where the Register Pressure Guided > Unroll and Jam gave a good amount of performance improvement for the C and > C++ Spec Benchmark and also Fortran benchmarks. > > The Unroll and Jam increases the register pressure in the Unrolled Loop > leading to increase in the Spill and Fetch degrading the performance of the > Unrolled Loop. The Performance of Cache locality achieved through Unroll and > Jam is degraded with the presence of Spilling instruction due to increases in > register pressure Its better to do the decision of Unrolled Factor of the > Loop based on the Performance model of the register pressure. > > Most of the Loop Optimization Like Unroll and Jam is implemented in the High > Level IR. The register pressure based Unroll and Jam requires the calculation > of register pressure in the High Level IR which will be similar to register > pressure we calculate on Register Allocation. This makes the implementation > complex. > > To overcome this, the Open64 compiler does the decision of Unrolling to both > High Level IR and also at the Code Generation Level. Some of the decisions > way at the end of the Code Generation . The advantage of using this approach > like Open64 helps in using the register pressure information calculated by > the Register Allocator. This helps the implementation much simpler and less > complex. > > Can we have this approach in GCC of the Decisions of Unroll and Jam in the > High Level IR and also to defer some of the decision at the Code Generation > Level like Open64? > > Please let me know what do you think. Sure, you can for example compute validity of the transform during the GIMPLE loop opts, annotate the loop meta-information with the desired transform and apply it (or not) later during RTL unrolling. Richard. > Thanks & Regards > Ajit
Re: [PATCH] tell gcc optimizer to never introduce new data races
On Mon, Jun 16, 2014 at 12:52:10PM +0200, Andreas Schwab wrote: > Dan Carpenter writes: > > Adding "--param allow-store-data-races=0" to the GCC options for the > > kernel breaks C=1 because Sparse isn't expecting a GCC option with that > > format. > Please try --param=allow-store-data-races=0 instead. That appears to work for me. signature.asc Description: Digital signature
RE: Register Pressure guided Unroll and Jam in GCC !!
-Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Monday, June 16, 2014 7:55 PM To: Ajit Kumar Agarwal Cc: gcc@gcc.gnu.org; Vladimir Makarov; Michael Eager; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Register Pressure guided Unroll and Jam in GCC !! On Mon, Jun 16, 2014 at 4:14 PM, Ajit Kumar Agarwal wrote: > Hello All: > > I have worked on the Open64 compiler where the Register Pressure Guided > Unroll and Jam gave a good amount of performance improvement for the C and > C++ Spec Benchmark and also Fortran benchmarks. > > The Unroll and Jam increases the register pressure in the Unrolled Loop > leading to increase in the Spill and Fetch degrading the performance of the > Unrolled Loop. The Performance of Cache locality achieved through Unroll and > Jam is degraded with the presence of Spilling instruction due to increases in > register pressure Its better to do the decision of Unrolled Factor of the > Loop based on the Performance model of the register pressure. > > Most of the Loop Optimization Like Unroll and Jam is implemented in the High > Level IR. The register pressure based Unroll and Jam requires the calculation > of register pressure in the High Level IR which will be similar to register > pressure we calculate on Register Allocation. This makes the implementation > complex. > > To overcome this, the Open64 compiler does the decision of Unrolling to both > High Level IR and also at the Code Generation Level. Some of the decisions > way at the end of the Code Generation . The advantage of using this approach > like Open64 helps in using the register pressure information calculated by > the Register Allocator. This helps the implementation much simpler and less > complex. > > Can we have this approach in GCC of the Decisions of Unroll and Jam in the > High Level IR and also to defer some of the decision at the Code Generation > Level like Open64? > > Please let me know what do you think. >>Sure, you can for example compute validity of the transform during the GIMPLE >>loop opts, annotate the loop meta-information with the desired transform and >>apply it (or not) later >>during RTL unrolling. Thanks !! Has RTL unrolling been already implemented? Richard. > Thanks & Regards > Ajit
RE: Register Pressure guided Unroll and Jam in GCC !!
On June 16, 2014 6:39:58 PM CEST, Ajit Kumar Agarwal wrote: > > >-Original Message- >From: Richard Biener [mailto:richard.guent...@gmail.com] >Sent: Monday, June 16, 2014 7:55 PM >To: Ajit Kumar Agarwal >Cc: gcc@gcc.gnu.org; Vladimir Makarov; Michael Eager; Vinod Kathail; >Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala >Subject: Re: Register Pressure guided Unroll and Jam in GCC !! > >On Mon, Jun 16, 2014 at 4:14 PM, Ajit Kumar Agarwal > wrote: >> Hello All: >> >> I have worked on the Open64 compiler where the Register Pressure >Guided Unroll and Jam gave a good amount of performance improvement for >the C and C++ Spec Benchmark and also Fortran benchmarks. >> >> The Unroll and Jam increases the register pressure in the Unrolled >Loop leading to increase in the Spill and Fetch degrading the >performance of the Unrolled Loop. The Performance of Cache locality >achieved through Unroll and Jam is degraded with the presence of >Spilling instruction due to increases in register pressure Its better >to do the decision of Unrolled Factor of the Loop based on the >Performance model of the register pressure. >> >> Most of the Loop Optimization Like Unroll and Jam is implemented in >the High Level IR. The register pressure based Unroll and Jam requires >the calculation of register pressure in the High Level IR which will >be similar to register pressure we calculate on Register Allocation. >This makes the implementation complex. >> >> To overcome this, the Open64 compiler does the decision of Unrolling >to both High Level IR and also at the Code Generation Level. Some of >the decisions way at the end of the Code Generation . The advantage of >using this approach like Open64 helps in using the register pressure >information calculated by the Register Allocator. This helps the >implementation much simpler and less complex. >> >> Can we have this approach in GCC of the Decisions of Unroll and Jam >in the High Level IR and also to defer some of the decision at the >Code Generation Level like Open64? >> >> Please let me know what do you think. > >>>Sure, you can for example compute validity of the transform during >the GIMPLE loop opts, annotate the loop meta-information with the >desired transform and apply it (or not) later >>during RTL unrolling. > >Thanks !! Has RTL unrolling been already implemented? Yes but not of non-innermost loops afaik. Richard >Richard. > >> Thanks & Regards >> Ajit
Re: Register Pressure guided Unroll and Jam in GCC !!
On Mon, 2014-06-16 at 14:14 +, Ajit Kumar Agarwal wrote: > Hello All: > > I have worked on the Open64 compiler where the Register Pressure Guided > Unroll and Jam gave a good amount of performance improvement for the C and > C++ Spec Benchmark and also Fortran benchmarks. > > The Unroll and Jam increases the register pressure in the Unrolled Loop > leading to increase in the Spill and Fetch degrading the performance of the > Unrolled Loop. The Performance of Cache locality achieved through Unroll and > Jam is degraded with the presence of Spilling instruction due to increases in > register pressure Its better to do the decision of Unrolled Factor of the > Loop based on the Performance model of the register pressure. > > Most of the Loop Optimization Like Unroll and Jam is implemented in the High > Level IR. The register pressure based Unroll and Jam requires the calculation > of register pressure in the High Level IR which will be similar to register > pressure we calculate on Register Allocation. This makes the implementation > complex. > > To overcome this, the Open64 compiler does the decision of Unrolling to both > High Level IR and also at the Code Generation Level. Some of the decisions > way at the end of the Code Generation . The advantage of using this approach > like Open64 helps in using the register pressure information calculated by > the Register Allocator. This helps the implementation much simpler and less > complex. > > Can we have this approach in GCC of the Decisions of Unroll and Jam in the > High Level IR and also to defer some of the decision at the Code Generation > Level like Open64? > > Please let me know what do you think. I have been working on calculating something analogous to register pressure using a count of the number of live SSA values during the ipa-inline pass. I've been working on steering inlining (especially in LTO) away from decisions that explode the register pressure downstream, with a similar goal of avoiding situations that cause a lot of spill code. I have been working in a branch if you want to take a look: gcc/branches/lto-pressure Aaron > > Thanks & Regards > Ajit > -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
Re: Register Pressure guided Unroll and Jam in GCC !!
On 2014-06-16, 10:14 AM, Ajit Kumar Agarwal wrote: Hello All: I have worked on the Open64 compiler where the Register Pressure Guided Unroll and Jam gave a good amount of performance improvement for the C and C++ Spec Benchmark and also Fortran benchmarks. The Unroll and Jam increases the register pressure in the Unrolled Loop leading to increase in the Spill and Fetch degrading the performance of the Unrolled Loop. The Performance of Cache locality achieved through Unroll and Jam is degraded with the presence of Spilling instruction due to increases in register pressure Its better to do the decision of Unrolled Factor of the Loop based on the Performance model of the register pressure. Most of the Loop Optimization Like Unroll and Jam is implemented in the High Level IR. The register pressure based Unroll and Jam requires the calculation of register pressure in the High Level IR which will be similar to register pressure we calculate on Register Allocation. This makes the implementation complex. To overcome this, the Open64 compiler does the decision of Unrolling to both High Level IR and also at the Code Generation Level. Some of the decisions way at the end of the Code Generation . The advantage of using this approach like Open64 helps in using the register pressure information calculated by the Register Allocator. This helps the implementation much simpler and less complex. Can we have this approach in GCC of the Decisions of Unroll and Jam in the High Level IR and also to defer some of the decision at the Code Generation Level like Open64? Please let me know what do you think. Most loop optimizations are a good target for register pressure sensitive algorithms as loops are usually program hot spots and any pressure decrease there would be harmful as any RA can not undo such complex transformations. So I guess your proposal could work. Right now we have only pressure-sensitive modulo scheduling (SMS) and loop-invariant motion (as I remember switching from loop-invariant motion based on some very inaccurate register-pressure evaluation to one based on RA pressure evaluation gave a nice improvement about 1% for SPECFP2000 on some targets).
Re: Register Pressure guided Unroll and Jam in GCC !!
On 2014-06-16, 2:25 PM, Aaron Sawdey wrote: On Mon, 2014-06-16 at 14:14 +, Ajit Kumar Agarwal wrote: Hello All: I have worked on the Open64 compiler where the Register Pressure Guided Unroll and Jam gave a good amount of performance improvement for the C and C++ Spec Benchmark and also Fortran benchmarks. The Unroll and Jam increases the register pressure in the Unrolled Loop leading to increase in the Spill and Fetch degrading the performance of the Unrolled Loop. The Performance of Cache locality achieved through Unroll and Jam is degraded with the presence of Spilling instruction due to increases in register pressure Its better to do the decision of Unrolled Factor of the Loop based on the Performance model of the register pressure. Most of the Loop Optimization Like Unroll and Jam is implemented in the High Level IR. The register pressure based Unroll and Jam requires the calculation of register pressure in the High Level IR which will be similar to register pressure we calculate on Register Allocation. This makes the implementation complex. To overcome this, the Open64 compiler does the decision of Unrolling to both High Level IR and also at the Code Generation Level. Some of the decisions way at the end of the Code Generation . The advantage of using this approach like Open64 helps in using the register pressure information calculated by the Register Allocator. This helps the implementation much simpler and less complex. Can we have this approach in GCC of the Decisions of Unroll and Jam in the High Level IR and also to defer some of the decision at the Code Generation Level like Open64? Please let me know what do you think. I have been working on calculating something analogous to register pressure using a count of the number of live SSA values during the ipa-inline pass. I've been working on steering inlining (especially in LTO) away from decisions that explode the register pressure downstream, with a similar goal of avoiding situations that cause a lot of spill code. I have been working in a branch if you want to take a look: gcc/branches/lto-pressure Any pressure evaluation is a better than its absence. But on this level it is hard to evaluate it accurately. E.g. pressure in loop can be high for general regs, for fp regs or the both. Using live SSA values is still very inaccurate to make a right decision for the transformations.
Re: Register Pressure guided Unroll and Jam in GCC !!
On Mon, 2014-06-16 at 14:42 -0400, Vladimir Makarov wrote: > On 2014-06-16, 2:25 PM, Aaron Sawdey wrote: > > On Mon, 2014-06-16 at 14:14 +, Ajit Kumar Agarwal wrote: > >> Hello All: > >> > >> I have worked on the Open64 compiler where the Register Pressure Guided > >> Unroll and Jam gave a good amount of performance improvement for the C > >> and C++ Spec Benchmark and also Fortran benchmarks. > >> > >> The Unroll and Jam increases the register pressure in the Unrolled Loop > >> leading to increase in the Spill and Fetch degrading the performance of > >> the Unrolled Loop. The Performance of Cache locality achieved through > >> Unroll and Jam is degraded with the presence of Spilling instruction due > >> to increases in register pressure Its better to do the decision of > >> Unrolled Factor of the Loop based on the Performance model of the register > >> pressure. > >> > >> Most of the Loop Optimization Like Unroll and Jam is implemented in the > >> High Level IR. The register pressure based Unroll and Jam requires the > >> calculation of register pressure in the High Level IR which will be > >> similar to register pressure we calculate on Register Allocation. This > >> makes the implementation complex. > >> > >> To overcome this, the Open64 compiler does the decision of Unrolling to > >> both High Level IR and also at the Code Generation Level. Some of the > >> decisions way at the end of the Code Generation . The advantage of using > >> this approach like Open64 helps in using the register pressure information > >> calculated by the Register Allocator. This helps the implementation much > >> simpler and less complex. > >> > >> Can we have this approach in GCC of the Decisions of Unroll and Jam in the > >> High Level IR and also to defer some of the decision at the Code > >> Generation Level like Open64? > >> > >> Please let me know what do you think. > > > > I have been working on calculating something analogous to register > > pressure using a count of the number of live SSA values during the > > ipa-inline pass. I've been working on steering inlining (especially in > > LTO) away from decisions that explode the register pressure downstream, > > with a similar goal of avoiding situations that cause a lot of spill > > code. > > > > I have been working in a branch if you want to take a look: > > gcc/branches/lto-pressure > > > > Any pressure evaluation is a better than its absence. But on this level > it is hard to evaluate it accurately. > > E.g. pressure in loop can be high for general regs, for fp regs or the > both. Using live SSA values is still very inaccurate to make a right > decision for the transformations. > Yes, the jump I have not made yet is to classify the pressure by what register class it might end up in. The other big piece that's potentially missing at that point is pressure caused by temps and by scheduling. But I think you can still get order-of-magnitude type estimates. -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
[GSoC] Status - 20140616
Hi Community, We are 1 week away from midterm evaluations of students' work. Mentors, please start looking closely into your student's progress and draft up evaluation notes. Midterm evaluations are very important in GSoC. Students who fail this evaluation are immediately kicked out of the program. Students who pass -- get their midterm payment ($2250). Both mentors and students will need to submit midterm evaluations between June 23-27. There is no excuse for not submitting your evaluations. Please let me know if you have any problems submitting your evaluation in the period June 23-27. For evaluations, you might find this guide helpful: http://en.flossmanuals.net/GSoCMentoring/evaluations/ . On another note, copyright assignments are now completed for 4 out of 5 students. I have pinged the last student to get his assignment in order. -- Maxim Kuvyrkov www.linaro.org