LTO IPA inline decisions in GCC trunk.
Hi Honza, I experimented building Coremark with both PGO and LTO at -O3 level on Aarch64 machine. First I generated profiles using the recommended seeds in Coremark's readme.txt. Then compiled again with -O3 -flto and -fprofile-use. I tried using GCC Linaro compiler (september) which is based on FSF 4.9 and GCC trunk 30-sep-2014. With linaro compiler perf events show 5% less instruction counts compared to the GCC trunk version I used. I looked at the generated code and seeing that IPA inlining have changed between linaro and trunk. Linaro compiler does not seem to inline a function called "crcu32", but trunk inlines it but does not inline "crcu16. Also trunk does not detect an IPA indirect inlining on a function called "cmp_complex". The number of partitions for ltrans is 3 in Linaro compiler and reduced to 2 in trunk. Eyeballing the dump it seems --param max-inline-insns-auto limit reached and hence deciding not to inline some functions. I tried increasing this limit from 40 to 45, 50 and 100. But is not helping in inlining "crcu32" in trunk, but inlines "cmp_complex" when set to limit set 45. But this is not reducing the instruction count. With Linaro compiler I tried to manually not to inline crcu16. Now Linaro compiler behaves in same way as trunk. It inlines crcu32, crcu16 is not inlined and instruction count increases. So inlining "crcu16", seem to increasing the instruction counts in trunk. I tried to latest trunk on X86_64 machine only and inlining behavior is same to the trunk version I used in Aarch64. LTO may not be best thing to try on Coremark, but just wanted to check if trunk (5.0) is better compared to GCC 4.9. Can you suggest where should I look in GCC to see why these inline decisions changes in trunk? Also compared to FSF 4.9, inline size calculation in IPA have changed now in trunk? Please advise. regards, Venkat.
Re: Match-and-simplify and COND_EXPR
On Wed, 5 Nov 2014, Andrew Pinski wrote: > Hi, > I was trying to hook up tree-ssa-phiopt to match-and-simplify using > either gimple_build (or rather using gimple_simplify depending on if > we want to produce cond_expr for conditional move). I ran into a > problem. > With the pattern below: > /* a ? 0 : 1 -> a if 0 and 1 are integral types. */ > (simplify > (cond_expr @0 integer_zerop integer_onep) > (if (INTEGRAL_TYPE_P (type)) > (convert @0))) > > This produces a gimple statement with an incorrect gimple statement > where we have have a convert with an compare expression. > > t.c: In function ‘f’: > t.c:1:5: error: invalid operand in unary operation > int f(int c, int a, int b) > ^ > _4 = (int) (c_2(D) != 0); > > > Did I implement the pattern incorrectly or is there a bug due to the > way cond_expr handles its 0th operand in that it is valid for compares > there in gimple form. There are probably still corner-cases where the GENERIC exprs we embed into GIMPLE operands are not handled correctly or optimally. We jump through multiple hoops to get that right already but I'm sure I missed some ;) COND_EXPRs are difficult because the embedded comparison is simplified separately but how the result needs to be represented depends on the context. So more example patterns certainly help here. Feel free to open a bugreport. Btw, I usually "test" this by inspecting the generated code for a single pattern in a test.pd file and ./build/genmatch --gimple test.pd Thanks, Richard. -- Richard Biener SUSE LINUX GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendoerffer, HRB 21284 (AG Nuernberg) Maxfeldstrasse 5, 90409 Nuernberg, Germany
Re: RFC: Update ISL under gcc/infrastructure/ ? // Remove CLooG?
On Thu, Nov 6, 2014 at 8:02 AM, Tobias Grosser wrote: > On 06.11.2014 07:04, Roman Gareev wrote: >>> >>> CLooG is not necessarily needed. You can run graphite just with ISL. The >>> main reason that ISL code generation is not enabled by default is that we >>> did not yet get extensive testing and it was unclear who will have the >>> time >>> to fix possible bugs. >> >> >> Could you please advise me which test suites should be used to make >> performance comparison between CLooG and ISL generator? (I would like >> to do this, even though the old generator is removed). > > > I do not have specific advices. You can use various open source benchmarks > e.g. the LLVM test suite or, if you have access, you could run SPEC or > something. > >>> @Mircae, Roman: Would you have time to help with bug-fixing if we do the >>> switch now? (I am happy to review patches and give advice, but can not do >>> the full move myself) >> >> >> I could find time for this. What do you mean by ‘switch’? If I’m not >> mistaken, ISL generator is already used by default. Should we remove >> support of CLooG generator and all files related to it? > > > Wow, I must really have been sleeping (or just forgetting). The switch > already happened. This is amazing. > > As the ISL code generator has been default since a while and we did not get > many bug reports, the actual switch seems to have worked well. We could > probably still need some testing, but in this case it is most likely time to > drop the CLooG support entirely. Are you interested to provide the relevant > patches? > > Also, as Tobias suggested we should raise the minimal supported isl level to > 0.14 to be sure PR 62289 is fixed. As I am testing with system isl/cloog that would be unfortunate as it means I'd either drop testing graphite for 4.8 and 4.9 or for 5.0 ... AFAIK ISL 12.x and 13+ cannot co-exist in the system due to include file conflicts. Richard. > Cheers, > Tobias >
Re: RFC: Update ISL under gcc/infrastructure/ ? // Remove CLooG?
On 06.11.2014 10:05, Richard Biener wrote: On Thu, Nov 6, 2014 at 8:02 AM, Tobias Grosser wrote: On 06.11.2014 07:04, Roman Gareev wrote: CLooG is not necessarily needed. You can run graphite just with ISL. The main reason that ISL code generation is not enabled by default is that we did not yet get extensive testing and it was unclear who will have the time to fix possible bugs. Could you please advise me which test suites should be used to make performance comparison between CLooG and ISL generator? (I would like to do this, even though the old generator is removed). I do not have specific advices. You can use various open source benchmarks e.g. the LLVM test suite or, if you have access, you could run SPEC or something. @Mircae, Roman: Would you have time to help with bug-fixing if we do the switch now? (I am happy to review patches and give advice, but can not do the full move myself) I could find time for this. What do you mean by ‘switch’? If I’m not mistaken, ISL generator is already used by default. Should we remove support of CLooG generator and all files related to it? Wow, I must really have been sleeping (or just forgetting). The switch already happened. This is amazing. As the ISL code generator has been default since a while and we did not get many bug reports, the actual switch seems to have worked well. We could probably still need some testing, but in this case it is most likely time to drop the CLooG support entirely. Are you interested to provide the relevant patches? Also, as Tobias suggested we should raise the minimal supported isl level to 0.14 to be sure PR 62289 is fixed. As I am testing with system isl/cloog that would be unfortunate as it means I'd either drop testing graphite for 4.8 and 4.9 or for 5.0 ... AFAIK ISL 12.x and 13+ cannot co-exist in the system due to include file conflicts. Sven, is there any chance we can add the deprecated isl_int includes back into isl 0.14.1. This would unblock the testing and we could remove them as soon as gcc 4.8/4.9 has been phased out. This would also fix my polly code coverage tests which do not work since isl_int has been moved into different header files, as ubuntu does not want to update isl as long as such an update breaks gcc. Cheers, Tobias
Re: LTO IPA inline decisions in GCC trunk.
> Hi Honza, Hello, > > I experimented building Coremark with both PGO and LTO at -O3 level on > Aarch64 machine. First I generated profiles using the recommended seeds in > Coremark's readme.txt. Then compiled again with -O3 -flto and -fprofile-use. > > I tried using GCC Linaro compiler (september) which is based on FSF 4.9 and > GCC trunk 30-sep-2014. > > With linaro compiler perf events show 5% less instruction counts compared > to the GCC trunk version I used. > > I looked at the generated code and seeing that IPA inlining have changed > between linaro and trunk. > > Linaro compiler does not seem to inline a function called "crcu32", but > trunk inlines it but does not inline "crcu16. Also trunk does not detect an > IPA indirect inlining on a function called "cmp_complex". Can you please compile with -fdump-ipa-inline-details -fdump-tree-release_ssa and send me the dumps from both compilers? It should not be that hard to debug this. > > The number of partitions for ltrans is 3 in Linaro compiler and reduced to > 2 in trunk. > > Eyeballing the dump it seems --param max-inline-insns-auto limit reached > and hence deciding not to inline some functions. I tried increasing this > limit from 40 to 45, 50 and 100. But is not helping in inlining "crcu32" in > trunk, but inlines "cmp_complex" when set to limit set 45. But this is not > reducing the instruction count. > > With Linaro compiler I tried to manually not to inline crcu16. Now Linaro > compiler behaves in same way as trunk. It inlines crcu32, crcu16 is not > inlined and instruction count increases. > > So inlining "crcu16", seem to increasing the instruction counts in trunk. I > tried to latest trunk on X86_64 machine only and inlining behavior is same > to the trunk version I used in Aarch64. > > LTO may not be best thing to try on Coremark, but just wanted to check if > trunk (5.0) is better compared to GCC 4.9. > > Can you suggest where should I look in GCC to see why these inline > decisions changes in trunk? Also compared to FSF 4.9, inline size > calculation in IPA have changed now in trunk? One important change for mainline compared to 4.9 is that with profile feedback it can now bypass max-inline-insns-single/auto limits. This is change I did in early stage1 https://gcc.gnu.org/ml/gcc-patches/2014-04/msg01110.html and I wanted to see if there are any testcases. I think we may make more selective decisions about what call is considered hot in this case (our current cgraph_maybe_hot_edge_p is very conservative). So in your case the main problem seems to be not inlining crcu16? Of course the change above does not directly explain it, but perhaps some expensive inlining early in the decision stage prevents useful inlining later.. Honza > > Please advise. > > regards, > Venkat.
Re: RFC: Update ISL under gcc/infrastructure/ ? // Remove CLooG?
On 11/6/14, Tobias Grosser wrote: > On 06.11.2014 10:05, Richard Biener wrote: >> On Thu, Nov 6, 2014 at 8:02 AM, Tobias Grosser wrote: >>> On 06.11.2014 07:04, Roman Gareev wrote: > > CLooG is not necessarily needed. You can run graphite just with ISL. > The > main reason that ISL code generation is not enabled by default is that > we > did not yet get extensive testing and it was unclear who will have the > time > to fix possible bugs. Could you please advise me which test suites should be used to make performance comparison between CLooG and ISL generator? (I would like to do this, even though the old generator is removed). >>> >>> >>> I do not have specific advices. You can use various open source >>> benchmarks >>> e.g. the LLVM test suite or, if you have access, you could run SPEC or >>> something. >>> > @Mircae, Roman: Would you have time to help with bug-fixing if we do > the > switch now? (I am happy to review patches and give advice, but can not > do > the full move myself) I could find time for this. What do you mean by ‘switch’? If I’m not mistaken, ISL generator is already used by default. Should we remove support of CLooG generator and all files related to it? >>> >>> >>> Wow, I must really have been sleeping (or just forgetting). The switch >>> already happened. This is amazing. >>> >>> As the ISL code generator has been default since a while and we did not >>> get >>> many bug reports, the actual switch seems to have worked well. We could >>> probably still need some testing, but in this case it is most likely time >>> to >>> drop the CLooG support entirely. Are you interested to provide the >>> relevant >>> patches? >>> >>> Also, as Tobias suggested we should raise the minimal supported isl level >>> to >>> 0.14 to be sure PR 62289 is fixed. >> >> As I am testing with system isl/cloog that would be unfortunate as it >> means >> I'd either drop testing graphite for 4.8 and 4.9 or for 5.0 ... AFAIK >> ISL >> 12.x and 13+ cannot co-exist in the system due to include file conflicts. > > Sven, > > is there any chance we can add the deprecated isl_int includes back into > isl 0.14.1. This would unblock the testing and we could remove them as > soon as gcc 4.8/4.9 has been phased out. > > This would also fix my polly code coverage tests which do not work since > isl_int has been moved into different header files, as ubuntu does not > want to update isl as long as such an update breaks gcc. Ah, so it's still there? Maybe we can simply add some configury to detect its location and fix build with a patch for 4.8 and 4.9? Is there maybe even a preprocessor macro one can check that newer ISL provide that can tell us where to look for isl_int? Thanks, Richard. > Cheers, > Tobias > >
Re: RFC: Update ISL under gcc/infrastructure/ ? // Remove CLooG?
On 06.11.2014 11:15, Richard Biener wrote: On 11/6/14, Tobias Grosser wrote: On 06.11.2014 10:05, Richard Biener wrote: On Thu, Nov 6, 2014 at 8:02 AM, Tobias Grosser wrote: On 06.11.2014 07:04, Roman Gareev wrote: CLooG is not necessarily needed. You can run graphite just with ISL. The main reason that ISL code generation is not enabled by default is that we did not yet get extensive testing and it was unclear who will have the time to fix possible bugs. Could you please advise me which test suites should be used to make performance comparison between CLooG and ISL generator? (I would like to do this, even though the old generator is removed). I do not have specific advices. You can use various open source benchmarks e.g. the LLVM test suite or, if you have access, you could run SPEC or something. @Mircae, Roman: Would you have time to help with bug-fixing if we do the switch now? (I am happy to review patches and give advice, but can not do the full move myself) I could find time for this. What do you mean by ‘switch’? If I’m not mistaken, ISL generator is already used by default. Should we remove support of CLooG generator and all files related to it? Wow, I must really have been sleeping (or just forgetting). The switch already happened. This is amazing. As the ISL code generator has been default since a while and we did not get many bug reports, the actual switch seems to have worked well. We could probably still need some testing, but in this case it is most likely time to drop the CLooG support entirely. Are you interested to provide the relevant patches? Also, as Tobias suggested we should raise the minimal supported isl level to 0.14 to be sure PR 62289 is fixed. As I am testing with system isl/cloog that would be unfortunate as it means I'd either drop testing graphite for 4.8 and 4.9 or for 5.0 ... AFAIK ISL 12.x and 13+ cannot co-exist in the system due to include file conflicts. Sven, is there any chance we can add the deprecated isl_int includes back into isl 0.14.1. This would unblock the testing and we could remove them as soon as gcc 4.8/4.9 has been phased out. This would also fix my polly code coverage tests which do not work since isl_int has been moved into different header files, as ubuntu does not want to update isl as long as such an update breaks gcc. Ah, so it's still there? Yes, we just need to include some additional header files. Besides isl/aff.h we now also would need isl/deprecated/aff.h Maybe we can simply add some configury to detect its location and fix build with a patch for 4.8 and 4.9? > Is there maybe even a preprocessor macro one can check that newer ISL provide that can tell us where to look for isl_int? We could detect the need to include these files based on isl's version macros. So you suggest that adding conditional includes would allow us to move forward and would be part of some of the next gcc point releases. This would be perfect. Tobias
Re: RFC: Update ISL under gcc/infrastructure/ ? // Remove CLooG?
On 11/6/14, Tobias Grosser wrote: > On 06.11.2014 11:15, Richard Biener wrote: >> On 11/6/14, Tobias Grosser wrote: >>> On 06.11.2014 10:05, Richard Biener wrote: On Thu, Nov 6, 2014 at 8:02 AM, Tobias Grosser wrote: > On 06.11.2014 07:04, Roman Gareev wrote: >>> >>> CLooG is not necessarily needed. You can run graphite just with ISL. >>> The >>> main reason that ISL code generation is not enabled by default is >>> that >>> we >>> did not yet get extensive testing and it was unclear who will have >>> the >>> time >>> to fix possible bugs. >> >> >> Could you please advise me which test suites should be used to make >> performance comparison between CLooG and ISL generator? (I would like >> to do this, even though the old generator is removed). > > > I do not have specific advices. You can use various open source > benchmarks > e.g. the LLVM test suite or, if you have access, you could run SPEC or > something. > >>> @Mircae, Roman: Would you have time to help with bug-fixing if we do >>> the >>> switch now? (I am happy to review patches and give advice, but can >>> not >>> do >>> the full move myself) >> >> >> I could find time for this. What do you mean by ‘switch’? If I’m not >> mistaken, ISL generator is already used by default. Should we remove >> support of CLooG generator and all files related to it? > > > Wow, I must really have been sleeping (or just forgetting). The switch > already happened. This is amazing. > > As the ISL code generator has been default since a while and we did > not > get > many bug reports, the actual switch seems to have worked well. We > could > probably still need some testing, but in this case it is most likely > time > to > drop the CLooG support entirely. Are you interested to provide the > relevant > patches? > > Also, as Tobias suggested we should raise the minimal supported isl > level > to > 0.14 to be sure PR 62289 is fixed. As I am testing with system isl/cloog that would be unfortunate as it means I'd either drop testing graphite for 4.8 and 4.9 or for 5.0 ... AFAIK ISL 12.x and 13+ cannot co-exist in the system due to include file conflicts. >>> >>> Sven, >>> >>> is there any chance we can add the deprecated isl_int includes back into >>> isl 0.14.1. This would unblock the testing and we could remove them as >>> soon as gcc 4.8/4.9 has been phased out. >>> >>> This would also fix my polly code coverage tests which do not work since >>> isl_int has been moved into different header files, as ubuntu does not >>> want to update isl as long as such an update breaks gcc. >> >> Ah, so it's still there? > > Yes, we just need to include some additional header files. > > Besides isl/aff.h we now also would need isl/deprecated/aff.h > >> Maybe we can simply add some configury to detect >> its location and fix build with a patch for 4.8 and 4.9? > > Is there maybe even >> a preprocessor macro one can check that newer ISL provide that can tell >> us where to look for isl_int? > > We could detect the need to include these files based on isl's version > macros. > > So you suggest that adding conditional includes would allow us to move > forward and would be part of some of the next gcc point releases. This > would be perfect. Yes. Please make sure - if you produce a patch - that it works with both inline copy of ISL and system installed one. Thanks, Richard. > Tobias > > > > > >
Re: Match-and-simplify and COND_EXPR
On Wed, 5 Nov 2014, Andrew Pinski wrote: > Hi, > I was trying to hook up tree-ssa-phiopt to match-and-simplify using > either gimple_build (or rather using gimple_simplify depending on if > we want to produce cond_expr for conditional move). I ran into a > problem. > With the pattern below: > /* a ? 0 : 1 -> a if 0 and 1 are integral types. */ > (simplify > (cond_expr @0 integer_zerop integer_onep) > (if (INTEGRAL_TYPE_P (type)) > (convert @0))) Ok, so you are capturing a GENERIC expr here but nothing knows that. It would work if you'd do (ugh) (for op (lt le eq ne ge gt) (simplify (cond_expr (op @0 @1) integer_zerop integer_onep) (if (INTEGRAL_TYPE_P (type)) (convert (op @0 @1) (simplify (cond_expr SSA_NAME@0 integer_zerop integer_onep) (if (INTEGRAL_TYPE_P (type)) (convert @0 as a workaround. To make your version work will require (quite) some special-casing in the code generator or maybe the resimplify helper. Let me see if I can cook up a "simple" fix. Richard. > This produces a gimple statement with an incorrect gimple statement > where we have have a convert with an compare expression. > > t.c: In function ‘f’: > t.c:1:5: error: invalid operand in unary operation > int f(int c, int a, int b) > ^ > _4 = (int) (c_2(D) != 0); > > > Did I implement the pattern incorrectly or is there a bug due to the > way cond_expr handles its 0th operand in that it is valid for compares > there in gimple form. > > I don't have a good patch to test tree-ssa-phiopt.c connection due the > patch having extra code in it already which handles creating cond_expr > for conditional moves already. > > Thanks, > Andrew Pinski > > -- Richard Biener SUSE LINUX GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendoerffer, HRB 21284 (AG Nuernberg) Maxfeldstrasse 5, 90409 Nuernberg, Germany
Re: Match-and-simplify and COND_EXPR
On Thu, 6 Nov 2014, Richard Biener wrote: > On Wed, 5 Nov 2014, Andrew Pinski wrote: > > > Hi, > > I was trying to hook up tree-ssa-phiopt to match-and-simplify using > > either gimple_build (or rather using gimple_simplify depending on if > > we want to produce cond_expr for conditional move). I ran into a > > problem. > > With the pattern below: > > /* a ? 0 : 1 -> a if 0 and 1 are integral types. */ > > (simplify > > (cond_expr @0 integer_zerop integer_onep) > > (if (INTEGRAL_TYPE_P (type)) > > (convert @0))) > > Ok, so you are capturing a GENERIC expr here but nothing knows that. > It would work if you'd do (ugh) > > (for op (lt le eq ne ge gt) > (simplify > (cond_expr (op @0 @1) integer_zerop integer_onep) > (if (INTEGRAL_TYPE_P (type)) >(convert (op @0 @1) > (simplify > (cond_expr SSA_NAME@0 integer_zerop integer_onep) > (if (INTEGRAL_TYPE_P (type)) >(convert @0 > > as a workaround. To make your version work will require (quite) > some special-casing in the code generator or maybe the resimplify > helper. Let me see if I can cook up a "simple" fix. Sth like below (for the real fix this has to be replicated in all gimple_resimplifyN functions). I'm missing a testcase where the pattern would apply (and not be already folded by fold), so I didn't check if it actually works. Bah, of course we should fix COND_EXPRs to not embed a GENERIC expr... Richard. Index: gcc/gimple-match-head.c === --- gcc/gimple-match-head.c (revision 217035) +++ gcc/gimple-match-head.c (working copy) @@ -90,6 +90,13 @@ gimple_resimplify1 (gimple_seq *seq, code_helper *res_code, tree type, tree *res_ops, tree (*valueize)(tree)) { + /* ??? Stupid tcc_comparison GENERIC trees in COND_EXPRs. */ + if (COMPARISON_CLASS_P (res_ops[0])) +res_ops[0] = gimple_build (seq, + TREE_CODE (res_ops[0]), TREE_TYPE (res_ops[0]), + TREE_OPERAND (res_ops[0], 0), + TREE_OPERAND (res_ops[0], 1)); + if (constant_for_folding (res_ops[0])) { tree tem = NULL_TREE;
Re: RFC: Update ISL under gcc/infrastructure/ ? // Remove CLooG?
On Thu, Nov 06, 2014 at 10:43:16AM +0100, Tobias Grosser wrote: > is there any chance we can add the deprecated isl_int includes back into isl > 0.14.1. This would unblock the testing and we could remove them as soon as > gcc 4.8/4.9 has been phased out. I prefer Richard's solution. In any case, you need to test later versions of isl before you accept for use in older versions of gcc. Even if there are no incompatibilities in the API, the internal representations of results of operations may very well change, which may, for example, result in significantly different output from the AST generator. skimo
Loop invariant motion from cold block
It appears to me that both loop invariant motion passes (tree/rtl) don't look at basic block frequencies and will gladly hoist invariant code from a cold block within a loop. This can impact performance by executing (possibly costly) code that would otherwise not be executed, adds another register that is live through the loop, and possibly increases number of callee-saved registers used in the procedure (which require save/restore) if there is a call in the loop. Following simplified test was patterned after code observed in Boost library, which apparently has a fair number of asserts sprinkled throughout. volatile int x; void assert_fail(int, char *, char *); void foo(int *a, int n, int k) { int i; for(i=0; iCompiling the above with -O2, tree-LIM (tree-ssa-loop-im.c) will yank the 'k/5' expression out of the loop and rtl-LIM (loop-invariant.c) will yank the address computations for the 2 strings out of the loop. The default probability for __builtin_expect is 90%, but even adding --param builtin-expect-probability=100 to the compile such that the block containing the call looks to never be executed still results in invariant hoisting. A simple change like the following prevents the hoisting in the rtl pass, I assume something similar could also be added to the tree pass. But I'd like to hear others thoughts on whether this is an acceptable approach/issue that should be fixed. The code below prevents motion if the block containing it is executed <= 10% of the time the loop header is. Is there a better approach there instead of another magic number? Should it be based on PARAM_VALUE (BUILTIN_EXPECT_PROBABILITY)? Basing on BUILTIN_EXPECT_PROBABILITY seems to make sense to me for the above case, but what about the more general case where __builtin_expect() isn't used and branch probabilities are guessed based on other heurstics? Index: loop-invariant.c === --- loop-invariant.c(revision 216843) +++ loop-invariant.c(working copy) @@ -1000,6 +1000,11 @@ find_invariants_bb (basic_block bb, bool { rtx_insn *insn; + + /* Don't move invariants from cold block inside the loop. */ + if (bb->frequency <= (bb->loop_father->header->frequency / 10) ) +return; + FOR_BB_INSNS (bb, insn) { if (!NONDEBUG_INSN_P (insn)) Thanks for any feedback, Pat
Re: Loop invariant motion from cold block
On Thu, Nov 6, 2014 at 7:08 PM, Pat Haugen wrote: > It appears to me that both loop invariant motion passes (tree/rtl) don't > look at basic block frequencies and will gladly hoist invariant code from a > cold block within a loop. This can impact performance by executing (possibly > costly) code that would otherwise not be executed, adds another register > that is live through the loop, and possibly increases number of callee-saved > registers used in the procedure (which require save/restore) if there is a > call in the loop. Following simplified test was patterned after code > observed in Boost library, which apparently has a fair number of asserts > sprinkled throughout. > > volatile int x; > void assert_fail(int, char *, char *); > void foo(int *a, int n, int k) { > int i; > > for(i=0; i if (__builtin_expect(x,0)) > assert_fail(k / 5, "one", "two"); > a[i] = k; > } > } > > Compiling the above with -O2, tree-LIM (tree-ssa-loop-im.c) will yank the > 'k/5' expression out of the loop and rtl-LIM (loop-invariant.c) will yank > the address computations for the 2 strings out of the loop. The default > probability for __builtin_expect is 90%, but even adding --param > builtin-expect-probability=100 to the compile such that the block containing > the call looks to never be executed still results in invariant hoisting. > > A simple change like the following prevents the hoisting in the rtl pass, I > assume something similar could also be added to the tree pass. But I'd like > to hear others thoughts on whether this is an acceptable approach/issue that > should be fixed. The code below prevents motion if the block containing it > is executed <= 10% of the time the loop header is. Is there a better > approach there instead of another magic number? Should it be based on > PARAM_VALUE (BUILTIN_EXPECT_PROBABILITY)? Basing on > BUILTIN_EXPECT_PROBABILITY seems to make sense to me for the above case, but > what about the more general case where __builtin_expect() isn't used and > branch probabilities are guessed based on other heurstics? Shouldn't we never hoist anything from a bb with lower execution frequency to a bb with higher one? It seems LIM simply assumes that inside a loop is always higher frequency than outside of it. So - why artificially have that factor of 0.1 instead of simply checking for less than? Btw, we also hoist through multiple loop nests so even an unlikely branch in the innermost loop may execute more times than the outermost loop entry. So I believe the change has to be made somewhere else, like in invariantness_dom_walker::before_dom_children. Btw, for your example with calling a noreturn function in the loop exit destination the block containing the k/5 shouldn't be in the loop at all. I wonder why they are not using __attribute__((noreturn)) in which case they'd even get better profiles can can get rid of the __builtin_expect? Richard. > Index: loop-invariant.c > === > --- loop-invariant.c(revision 216843) > +++ loop-invariant.c(working copy) > @@ -1000,6 +1000,11 @@ find_invariants_bb (basic_block bb, bool > { >rtx_insn *insn; > > + > + /* Don't move invariants from cold block inside the loop. */ > + if (bb->frequency <= (bb->loop_father->header->frequency / 10) ) > +return; > + >FOR_BB_INSNS (bb, insn) > { >if (!NONDEBUG_INSN_P (insn)) > > > Thanks for any feedback, > Pat > > >
Re: RFC: Update ISL under gcc/infrastructure/ ? // Remove CLooG?
> As the ISL code generator has been default since a while and we did not get > many bug reports, the actual switch seems to have worked well. We could > probably still need some testing, but in this case it is most likely time to > drop the CLooG support entirely. Are you interested to provide the relevant > patches? > > Also, as Tobias suggested we should raise the minimal supported isl level to > 0.14 to be sure PR 62289 is fixed. Yes, if you don’t mind, I’ll try to provide them by the end of the week. I’m planning to make two patches: the first one removes files related to CLooG code generator, the second one removes support of CLooG library from configure and make files. What do you think about this? -- Cheers, Roman Gareev.
Re: RFC: Update ISL under gcc/infrastructure/ ? // Remove CLooG?
On 06.11.2014 20:08, Roman Gareev wrote: As the ISL code generator has been default since a while and we did not get many bug reports, the actual switch seems to have worked well. We could probably still need some testing, but in this case it is most likely time to drop the CLooG support entirely. Are you interested to provide the relevant patches? Also, as Tobias suggested we should raise the minimal supported isl level to 0.14 to be sure PR 62289 is fixed. Yes, if you don’t mind, I’ll try to provide them by the end of the week. I’m planning to make two patches: the first one removes files related to CLooG code generator, the second one removes support of CLooG library from configure and make files. What do you think about this? Sounds good as long as they will compile and pass tests independently. Please also remember updating the documentation. Cheers, Tobias
Re: Loop invariant motion from cold block
On 11/06/2014 01:00 PM, Richard Biener wrote: Shouldn't we never hoist anything from a bb with lower execution frequency to a bb with higher one? It seems LIM simply assumes that inside a loop is always higher frequency than outside of it. So - why artificially have that factor of 0.1 instead of simply checking for less than? Yes, that would seem to be the best general approach. My initial hack was done wrt to example given and apparently I just kept my tunnel vision goggles on. Btw, we also hoist through multiple loop nests so even an unlikely branch in the innermost loop may execute more times than the outermost loop entry. So I believe the change has to be made somewhere else, like in invariantness_dom_walker::before_dom_children. Btw, for your example with calling a noreturn function in the loop exit destination the block containing the k/5 shouldn't be in the loop at all. I wonder why they are not using __attribute__((noreturn)) in which case they'd even get better profiles can can get rid of the __builtin_expect? Good point, and probably the approach that should really be taken in that code. Thanks, Pat
Re: RFC: Update ISL under gcc/infrastructure/ ? // Remove CLooG?
Richard Biener wrote: On Thu, Nov 6, 2014 at 8:02 AM, Tobias Grosser wrote: Also, as Tobias suggested we should raise the minimal supported isl level to 0.14 to be sure PR 62289 is fixed. As I am testing with system isl/cloog that would be unfortunate as it means I'd either drop testing graphite for 4.8 and 4.9 or for 5.0 ... AFAIK ISL 12.x and 13+ cannot co-exist in the system due to include file conflicts. An alternative would be to still accept ISL 0.12.x in terms of configure, while updating infrastructure / contrib/download_prerequisites to ISL 0.14 (while keeping 0.12 for GCC 4.8/4.9 on the ftp server). The question is only what then to write at https://gcc.gnu.org/install/prerequisites.html? "0.14.0" (but still accepting 0.12)? "0.12.2 or later, recommended 0.14.0"? Tobias
gcc-4.8-20141106 is now available
Snapshot gcc-4.8-20141106 is now available on ftp://gcc.gnu.org/pub/gcc/snapshots/4.8-20141106/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 4.8 SVN branch with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_8-branch revision 217203 You'll find: gcc-4.8-20141106.tar.bz2 Complete GCC MD5=3a3c474c18240199f009efa5362e3f14 SHA1=b5605b0972fd79967afa63012ae673f78619cd79 Diffs from 4.8-20141030 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-4.8 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.
RE: [Aarch64] LRA
That's what I assumed. However, can reload spill GPRs into FPRs as LRA does? For even after specifying -mno-lra, I still see excessive slots in FPRs. Thank you, -- Evandro Menezes Austin, TX > -Original Message- > From: Richard Earnshaw [mailto:rearn...@arm.com] > Sent: Wednesday, November 05, 2014 11:49 > To: Evandro Menezes; gcc@gcc.gnu.org > Subject: Re: [Aarch64] LRA > > On 05/11/14 17:14, Evandro Menezes wrote: > > It doesn't seem that the option -mno-lra does what it implies. If so, > > what does it do, for the it does result in differences. > > It causes the compiler to use 'reload' rather than LRA for handling part of > the register allocation process. > > R.
Re: Question on param MAX_PENDING_LIST_LENGTH in sched-deps
On 11/04/14 20:29, Bin.Cheng wrote: Hi, The parameter MAX_PENDING_LIST_LENGTH is set to 32 by default. It seems to me the length of pending list can't be larger than 32. But in sched-deps.c, below code is used: /* Pending lists can't get larger with a readonly context. */ if (!deps->readonly && ((deps->pending_read_list_length + deps->pending_write_list_length) > MAX_PENDING_LIST_LENGTH)) Since we compares it using ">", the list can have 33 instructions at most, which is inconsistent with the parameter name. Well, it's some kind of nit picking. Yea, seems like a bit of a nit. Might be worth fixing as someone might size an array based on the param's value only to find out it can have an extra element. Another question. This parameter is introduced to prevent GCC from running for too long time. I am not clear if the running time is a quadratic function of pending list, or a quadratic function of dependency nodes? If it's the latter, could we use the number of dependency nodes as the parameter directly? Because in some cases like memory initialization, there are more than 32 store instructions in flow, but the dependency are actually very simple. See: https://gcc.gnu.org/ml/gcc-patches/2001-07/msg01668.html I don't recall what part of the algorithm was quadratic, but it should be too hard to find out given the thread references the target & the testcase. jeff
Re: [Aarch64] LRA
On Thu, Nov 6, 2014 at 3:03 PM, Evandro Menezes wrote: > That's what I assumed. However, can reload spill GPRs into FPRs as LRA > does? For even after specifying -mno-lra, I still see excessive slots in > FPRs. Not fully. What is happening most likely is IRA is deciding to use some FPRs for some psedu-registers due to the costs (look at the .ira dump) and then reloading from FPRs to GPRs. Thanks, Andrew > > Thank you, > > -- > Evandro Menezes Austin, TX > > >> -Original Message- >> From: Richard Earnshaw [mailto:rearn...@arm.com] >> Sent: Wednesday, November 05, 2014 11:49 >> To: Evandro Menezes; gcc@gcc.gnu.org >> Subject: Re: [Aarch64] LRA >> >> On 05/11/14 17:14, Evandro Menezes wrote: >> > It doesn't seem that the option -mno-lra does what it implies. If so, >> > what does it do, for the it does result in differences. >> >> It causes the compiler to use 'reload' rather than LRA for handling part > of >> the register allocation process. >> >> R. > >
Re: Question on param MAX_PENDING_LIST_LENGTH in sched-deps
On Fri, Nov 7, 2014 at 7:10 AM, Jeff Law wrote: > On 11/04/14 20:29, Bin.Cheng wrote: >> >> Hi, >> The parameter MAX_PENDING_LIST_LENGTH is set to 32 by default. It >> seems to me the length of pending list can't be larger than 32. But >> in sched-deps.c, below code is used: >>/* Pending lists can't get larger with a readonly context. */ >>if (!deps->readonly >>&& ((deps->pending_read_list_length + >> deps->pending_write_list_length) >>> MAX_PENDING_LIST_LENGTH)) >> >> Since we compares it using ">", the list can have 33 instructions at >> most, which is inconsistent with the parameter name. >> >> Well, it's some kind of nit picking. > > Yea, seems like a bit of a nit. Might be worth fixing as someone might size > an array based on the param's value only to find out it can have an extra > element. Yes, I found this only because in some extreme case with lots of consecutive stores, length of 33 breaks the store pairing one time in the middle of instruction flow. > >> >> Another question. This parameter is introduced to prevent GCC from >> running for too long time. I am not clear if the running time is a >> quadratic function of pending list, or a quadratic function of >> dependency nodes? If it's the latter, could we use the number of >> dependency nodes as the parameter directly? Because in some cases >> like memory initialization, there are more than 32 store instructions >> in flow, but the dependency are actually very simple. > > See: > > https://gcc.gnu.org/ml/gcc-patches/2001-07/msg01668.html > > I don't recall what part of the algorithm was quadratic, but it should be > too hard to find out given the thread references the target & the testcase. At least dependency analysis part is of quadratic complexity because for each new memory access, it iterates over all insns in the list to see if it depends on them. But this seems not the real problem, there are comments in both source code and the patch you mentioned, that perf issue is caused by too many dependency nodes created in this scenario. Also this is introduced at time we only have low frequency/memory machines, I bet it can be relax to a larger number without hurting compilation time. The reason I didn't is because I can't get anything good from it, seems it barely has any impact on performance. Thanks, bin
Re: Match-and-simplify and COND_EXPR
On Thu, Nov 6, 2014 at 2:40 AM, Richard Biener wrote: > On Thu, 6 Nov 2014, Richard Biener wrote: > >> On Wed, 5 Nov 2014, Andrew Pinski wrote: >> >> > Hi, >> > I was trying to hook up tree-ssa-phiopt to match-and-simplify using >> > either gimple_build (or rather using gimple_simplify depending on if >> > we want to produce cond_expr for conditional move). I ran into a >> > problem. >> > With the pattern below: >> > /* a ? 0 : 1 -> a if 0 and 1 are integral types. */ >> > (simplify >> > (cond_expr @0 integer_zerop integer_onep) >> > (if (INTEGRAL_TYPE_P (type)) >> > (convert @0))) >> >> Ok, so you are capturing a GENERIC expr here but nothing knows that. >> It would work if you'd do (ugh) >> >> (for op (lt le eq ne ge gt) >> (simplify >> (cond_expr (op @0 @1) integer_zerop integer_onep) >> (if (INTEGRAL_TYPE_P (type)) >>(convert (op @0 @1) >> (simplify >> (cond_expr SSA_NAME@0 integer_zerop integer_onep) >> (if (INTEGRAL_TYPE_P (type)) >>(convert @0 >> >> as a workaround. To make your version work will require (quite) >> some special-casing in the code generator or maybe the resimplify >> helper. Let me see if I can cook up a "simple" fix. > > Sth like below (for the real fix this has to be replicated in > all gimple_resimplifyN functions). I'm missing a testcase > where the pattern would apply (and not be already folded by fold), > so I didn't check if it actually works. You do need to check if seq is NULL though as gimple_build depends on seq not being NULL. But otherwise yes this works for me. > > Bah, of course we should fix COND_EXPRs to not embed a GENERIC > expr... Yes totally agree. For my changes to tree-ssa-phiopt, I no longer embed it. Though we need to change loop ifconvert still. Thanks, Andrew > > Richard. > > Index: gcc/gimple-match-head.c > === > --- gcc/gimple-match-head.c (revision 217035) > +++ gcc/gimple-match-head.c (working copy) > @@ -90,6 +90,13 @@ gimple_resimplify1 (gimple_seq *seq, > code_helper *res_code, tree type, tree *res_ops, > tree (*valueize)(tree)) > { > + /* ??? Stupid tcc_comparison GENERIC trees in COND_EXPRs. */ > + if (COMPARISON_CLASS_P (res_ops[0])) > +res_ops[0] = gimple_build (seq, > + TREE_CODE (res_ops[0]), TREE_TYPE (res_ops[0]), > + TREE_OPERAND (res_ops[0], 0), > + TREE_OPERAND (res_ops[0], 1)); > + >if (constant_for_folding (res_ops[0])) > { >tree tem = NULL_TREE; >
Re: Match-and-simplify and COND_EXPR
On November 7, 2014 5:03:19 AM CET, Andrew Pinski wrote: >On Thu, Nov 6, 2014 at 2:40 AM, Richard Biener >wrote: >> On Thu, 6 Nov 2014, Richard Biener wrote: >> >>> On Wed, 5 Nov 2014, Andrew Pinski wrote: >>> >>> > Hi, >>> > I was trying to hook up tree-ssa-phiopt to match-and-simplify >using >>> > either gimple_build (or rather using gimple_simplify depending on >if >>> > we want to produce cond_expr for conditional move). I ran into a >>> > problem. >>> > With the pattern below: >>> > /* a ? 0 : 1 -> a if 0 and 1 are integral types. */ >>> > (simplify >>> > (cond_expr @0 integer_zerop integer_onep) >>> > (if (INTEGRAL_TYPE_P (type)) >>> > (convert @0))) >>> >>> Ok, so you are capturing a GENERIC expr here but nothing knows that. >>> It would work if you'd do (ugh) >>> >>> (for op (lt le eq ne ge gt) >>> (simplify >>> (cond_expr (op @0 @1) integer_zerop integer_onep) >>> (if (INTEGRAL_TYPE_P (type)) >>>(convert (op @0 @1) >>> (simplify >>> (cond_expr SSA_NAME@0 integer_zerop integer_onep) >>> (if (INTEGRAL_TYPE_P (type)) >>>(convert @0 >>> >>> as a workaround. To make your version work will require (quite) >>> some special-casing in the code generator or maybe the resimplify >>> helper. Let me see if I can cook up a "simple" fix. >> >> Sth like below (for the real fix this has to be replicated in >> all gimple_resimplifyN functions). I'm missing a testcase >> where the pattern would apply (and not be already folded by fold), >> so I didn't check if it actually works. > >You do need to check if seq is NULL though as gimple_build depends on >seq not being NULL. But otherwise yes this works for me. > >> >> Bah, of course we should fix COND_EXPRs to not embed a GENERIC >> expr... > >Yes totally agree. For my changes to tree-ssa-phiopt, I no longer >embed it. Though we need to change loop ifconvert still. Istr expansion or code quality does not like us to cse the condition of two cobd_exprs either. After all I had a patch set at some point doing that conversion (though as well for gimple_conds). Richard. >Thanks, >Andrew > >> >> Richard. >> >> Index: gcc/gimple-match-head.c >> === >> --- gcc/gimple-match-head.c (revision 217035) >> +++ gcc/gimple-match-head.c (working copy) >> @@ -90,6 +90,13 @@ gimple_resimplify1 (gimple_seq *seq, >> code_helper *res_code, tree type, tree *res_ops, >> tree (*valueize)(tree)) >> { >> + /* ??? Stupid tcc_comparison GENERIC trees in COND_EXPRs. */ >> + if (COMPARISON_CLASS_P (res_ops[0])) >> +res_ops[0] = gimple_build (seq, >> + TREE_CODE (res_ops[0]), TREE_TYPE >(res_ops[0]), >> + TREE_OPERAND (res_ops[0], 0), >> + TREE_OPERAND (res_ops[0], 1)); >> + >>if (constant_for_folding (res_ops[0])) >> { >>tree tem = NULL_TREE; >>
Re: Match-and-simplify and COND_EXPR
> On Nov 6, 2014, at 11:24 PM, Richard Biener wrote: > >> On November 7, 2014 5:03:19 AM CET, Andrew Pinski wrote: >> On Thu, Nov 6, 2014 at 2:40 AM, Richard Biener >> wrote: >>> On Thu, 6 Nov 2014, Richard Biener wrote: >>> > On Wed, 5 Nov 2014, Andrew Pinski wrote: > > Hi, > I was trying to hook up tree-ssa-phiopt to match-and-simplify >> using > either gimple_build (or rather using gimple_simplify depending on >> if > we want to produce cond_expr for conditional move). I ran into a > problem. > With the pattern below: > /* a ? 0 : 1 -> a if 0 and 1 are integral types. */ > (simplify > (cond_expr @0 integer_zerop integer_onep) > (if (INTEGRAL_TYPE_P (type)) >(convert @0))) Ok, so you are capturing a GENERIC expr here but nothing knows that. It would work if you'd do (ugh) (for op (lt le eq ne ge gt) (simplify (cond_expr (op @0 @1) integer_zerop integer_onep) (if (INTEGRAL_TYPE_P (type)) (convert (op @0 @1) (simplify (cond_expr SSA_NAME@0 integer_zerop integer_onep) (if (INTEGRAL_TYPE_P (type)) (convert @0 as a workaround. To make your version work will require (quite) some special-casing in the code generator or maybe the resimplify helper. Let me see if I can cook up a "simple" fix. >>> >>> Sth like below (for the real fix this has to be replicated in >>> all gimple_resimplifyN functions). I'm missing a testcase >>> where the pattern would apply (and not be already folded by fold), >>> so I didn't check if it actually works. >> >> You do need to check if seq is NULL though as gimple_build depends on >> seq not being NULL. But otherwise yes this works for me. >> >>> >>> Bah, of course we should fix COND_EXPRs to not embed a GENERIC >>> expr... >> >> Yes totally agree. For my changes to tree-ssa-phiopt, I no longer >> embed it. Though we need to change loop ifconvert still. > > Istr expansion or code quality does not like us to cse the condition of two > cobd_exprs either. After all I had a patch set at some point doing that > conversion (though as well for gimple_conds). I thought I changed that when I did the expansion of cond_expr into conditional move. We need to something similar for cond_expr of jumps too. Thanks, Andrew > > Richard. > >> Thanks, >> Andrew >> >>> >>> Richard. >>> >>> Index: gcc/gimple-match-head.c >>> === >>> --- gcc/gimple-match-head.c (revision 217035) >>> +++ gcc/gimple-match-head.c (working copy) >>> @@ -90,6 +90,13 @@ gimple_resimplify1 (gimple_seq *seq, >>>code_helper *res_code, tree type, tree *res_ops, >>>tree (*valueize)(tree)) >>> { >>> + /* ??? Stupid tcc_comparison GENERIC trees in COND_EXPRs. */ >>> + if (COMPARISON_CLASS_P (res_ops[0])) >>> +res_ops[0] = gimple_build (seq, >>> + TREE_CODE (res_ops[0]), TREE_TYPE >> (res_ops[0]), >>> + TREE_OPERAND (res_ops[0], 0), >>> + TREE_OPERAND (res_ops[0], 1)); >>> + >>> if (constant_for_folding (res_ops[0])) >>> { >>> tree tem = NULL_TREE; >>> > >