Re: GCC mini-summit - compiling for a particular architecture
Mike Stump wrote: On Apr 20, 2007, at 6:42 PM, Robert Dewar wrote: One possibility would be to have a -Om switch (or whatever) that says "do all optimizations for this machine that help". Ick, gross. No. Well OK, Ick, but below you recommend removingf the overly pedantic rule. I agree with that, but the above is a compromise suggestion if we can't remove the rule. So, Mike, my question is, assuming we cannot remove the rule what do you want to do a) nothing b) something like the above c) something else, please specify I must say the rule about all optimizations being the same on all machines seems odd to me I'd look at it this way, it isn't unreasonable to have cost metrics that are in fact different for each cpu and possible each tune choice that greatly effects _any_ codegen choice. Sure, we can unroll the loops always on all targets, but, we can crank up the costs of extra instructions on chips where those costs are high, net result, almost no unrolling. For chips where the costs are cheap and they need to exposed instructions to be able to optimizer further, trivially, the costs involved are totally different. Net result, better code gen for each. I do however think the concept of not allowing targets to set and unset optimization choices is, well, overly pedantic.
Re: GCC mini-summit - benchmarks
Kenneth Hoste wrote: I'm not sure what 'tests' mean here... Are test cases being extracted from the SPEC CPU2006 sources? Or are you refering to the validity tests of the SPEC framework itself (to check whether the output generated by some binary conforms with their reference output)? The claim is that SPEC CPU2006 has source code bugs that cause it to fail when compiled by gcc. We weren't given a specific list of problem. There are known problems with older SPEC benchmarks though. For instance, vortex fails on some targets unless compiled with -fno-strict-aliasing. -- Jim Wilson, GNU Tools Support, http://www.specifix.com
Re: Problem building gcc on Cygwin
Tom Dickens wrote: ../gcc/configure -enable-languages=c,c++,fortran. make[1]: Leaving directory `/cygdrive/c/gcc-4.1.2/obj' You ran the wrong configure script. You must always run the toplevel configure script, not the one inside the gcc directory. So instead of doing cd gcc-4.1.2 mkdir obj cd obj ../gcc/configure which will fail. You should instead do mkdir obj cd obj ../gcc-4.1.2/configure which will work. -- Jim Wilson, GNU Tools Support, http://www.specifix.com
Re: A question on gimplifier
H. J. Lu wrote: __builtin_ia32_vec_set_v2di will be expanded to [(set (match_operand:V2DI 0 "register_operand" "=x") (vec_merge:V2DI (vec_duplicate:V2DI (match_operand:DI 2 "nonimmediate_operand" "rm")) (match_operand:V2DI 1 "register_operand" "0") (match_operand:SI 3 "const_pow2_1_to_2_operand" "n")))] Named rtl expanders aren't allowed to clobber their inputs. You will need to generate a pseudo-reg temp in the expander, copy the first input to the temp, and then use the temp as the output/input argument. There are probably lots of existing examples in the i386 *.md files to look at. See for instance the reduc_splus_v4sf pattern in the sse.md file. -- Jim Wilson, GNU Tools Support, http://www.specifix.com
Re: GCC mini-summit - compiling for a particular architecture
On Fri, 2007-04-20 at 19:28 -0400, Robert Dewar wrote: > Steve Ellcey wrote: > > > This seems unfortunate. I was hoping I might be able to turn on loop > > unrolling for IA64 at -O2 to improve performance. I have only started > > looking into this idea but it seems to help performance quite a bit, > > though it is also increasing size quite a bit too so it may need some > > modification of the unrolling parameters to make it practical. > > To me it is obvious that optimizations are target dependent. For > instance loop unrolling is really a totally different optimization > on the ia64 as a result of the rotating registers. My feeling is that it would be much more useful to have a more detailed documentation on optimization flags in the GCC manual that at least mention the type of source code and architectures where each optimization option is interesting rather than to mess with new flags or changing -On longstanding policies. Look from what we're starting: << @item -funroll-loops @opindex funroll-loops Unroll loops whose number of iterations can be determined at compile time or upon entry to the loop. @option{-funroll-loops} implies @option{-frerun-cse-after-loop}. This option makes code larger, and may or may not make it run faster. @item -funroll-all-loops @opindex funroll-all-loops Unroll all loops, even if their number of iterations is uncertain when the loop is entered. This usually makes programs run more slowly. @option{-funroll-all-loops} implies the same options as @option{-funroll-loops}, >> It could gain a few more paragraphs written by knowledgeable people. And expanding documentation doesn't introduce regressions :). Laurent
Re: GCC mini-summit - compiling for a particular architecture
On Apr 21, 2007, at 3:12 AM, Robert Dewar wrote: So, Mike, my question is, assuming we cannot remove the rule what do you want to do I think in the end, each situation is different and we have to find the best solution for each situation. So, in that siprit, let's open a discussion for the exact case your thinking of. Now, the closest I've come to -Om in the past would be -fast, which means, tune for spec. :-)
How do you get the benefit of -fstrict-aliasing?
I've decided to try to contribute modifications to the the C code that is generated by the Gambit Scheme->C compiler so that (a) it doesn't have any aliasing violations and (b) more aliasing distinctions can be made (the car and cdr of a pair don't overlap with the entries of a vector, etc.). This was in response to a measured 20% speedup with some numerical code with -fstrict-aliasing instead of -fno-strict-aliasing, nearly all of which came because gcc then knew that stores to a vector of doubles didn't change the values of variables on the stack. Part (a) is essentially a non-issue for user-written code, since the only aliasing problems of which I am aware are in the bignum library, so as a preliminary test I added -fstrict-aliasing to the gcc command line and reran the benchmark suite on a 2GHz G5. To my surprise, while there were some improvements, the -fstrict-aliasing option led to slower code overall, in some cases quite severely (7.014 seconds to 11.794 seconds, for example), and, perhaps not surprisingly, compilation times were significantly longer. This was true both with Apple's 4.0.1 and FSF 4.1.2. So I'm wondering whether certain options have to be included on the command line to get the benefits of -fstrict-aliasing. The current command line is gcc -mcpu=970 -m64 -no-cpp-precomp -Wall -W -Wno-unused -O1 -fno- math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing - fwrapv -fexpensive-optimizations -fforce-addr -fpeephole2 -falign- jumps -falign-functions -fno-function-cse -ftree-copyrename -ftree- fre -ftree-dce -fregmove -fgcse-las -freorder-functions -fcaller- saves -fno-if-conversion2 -foptimize-sibling-calls -fcse-skip-blocks - funit-at-a-time -finline-functions -fomit-frame-pointer -fPIC -fno- common -bundle -flat_namespace -undefined suppress -fstrict-aliasing where the optimizations between -fwrapv (which is no longer necessary, I should remove that) and -fstrict-aliasing were chosen by some experiments with genetic algorithms. I didn't think that adding aliasing information could lead to worse code. So I'm wondering how to use that aliasing information more effectively to get better code. Brad
Re: How do you get the benefit of -fstrict-aliasing?
On 4/21/07, Bradley Lucier <[EMAIL PROTECTED]> wrote: I didn't think that adding aliasing information could lead to worse code. So I'm wondering how to use that aliasing information more effectively to get better code. What aliasing information could do is allow an optimization pass cause register pressure which causes our current RA (register allocator) to go crazy and make code worse. This is true of any optimization even one that takes into account register pressure (which actually the wrong thing to do really). Thanks, Andrew Pinski
maybe_infinite_loop?
We still have some lno bits in our tree. We tried to remove them and found: gzip +0.5% vpr -0.4% gcc -3.2% mcf -0.3% crafty +0.2% parser +0.2% perlbmk -2.2% gap +0.2% vortex -0.1% bzip2 +1.9% twolf -0.7% on x86 (probably a core2 duo) in our 4.2 tree (with the rest of our local patches). -3.2% means a 3.2% better codegen (roughly) with the lno bits. I didn't rerun the numbers for mainline to see if they are still applicable. Of all the LNO bits, the last major bits seems to be the below bit. I don't even know if it is responsible for the benefit we see. I thought I'd mention it, as a 2-3% win on two of the spec tests seems worthwhile. I'd be interested in finding someone that might be interested in tracking down where the benefit comes from in the patch and pushing into mainline what goodness there is to be had from the patch. Any takers? If I can find someone, I'd be happy to send out the version of the patch for mainline. [ hum just 567 lines] On second though, I'll just include at the end for reference. Note, there is one soft conflict resolution to resolve in going from the 4.2 context to mainline, which I've not yet resolved. 2004-07-13 Zdenek Dvorak <[EMAIL PROTECTED]> * Makefile.in (tree-ssa-loop.o, tree-ssa-dce.o): Add function.h dependency. * builtins.c (expand_builtin): Handle BUILT_IN_MAYBE_INFINITE_LOOP. * builtins.def (BUILT_IN_MAYBE_INFINITE_LOOP): New builtin. * function.h (struct function): Add marked_maybe_inf_loops field. * timevar.def (TV_MARK_MILOOPS): New timevar. * tree-flow.h (mark_maybe_infinite_loops): Declare. * tree-optimize.c (init_tree_optimization_passes): Add pass_mark_maybe_inf_loops. * tree-pass.h (pass_mark_maybe_inf_loops): Declare. * tree-ssa-dce.c: Include function.h. (find_obviously_necessary_stmts): Mark back edges only if they were not marked already. (perform_tree_ssa_dce): Do not call mark_dfs_back_edges here. * tree-ssa-loop-niter.c (unmark_surely_finite_loop, mark_maybe_infinite_loops): New functions. * tree-ssa-loop.c: Include function.h. (tree_mark_maybe_inf_loops, gate_tree_mark_maybe_inf_loops, pass_mark_maybe_inf_loops): New pass. * tree-ssa-operands.c (function_ignores_memory_p): Add BUILT_IN_MAYBE_INFINITE_LOOP. Doing diffs in .: --- ./builtins.c.~1~2007-04-13 10:06:18.0 -0700 +++ ./builtins.c2007-04-21 15:54:01.0 -0700 @@ -6562,6 +6562,12 @@ expand_builtin (tree exp, rtx target, rt return target; break; +/* APPLE LOCAL begin lno */ +case BUILT_IN_MAYBE_INFINITE_LOOP: + /* This is just a fake statement that expands to nothing. */ + return const0_rtx; +/* APPLE LOCAL end lno */ + case BUILT_IN_FETCH_AND_ADD_1: case BUILT_IN_FETCH_AND_ADD_2: case BUILT_IN_FETCH_AND_ADD_4: --- ./builtins.def.~1~ 2007-04-13 10:06:19.0 -0700 +++ ./builtins.def 2007-04-21 15:54:01.0 -0700 @@ -639,6 +639,8 @@ DEF_LIB_BUILTIN(BUILT_IN_FREE, " DEF_GCC_BUILTIN(BUILT_IN_FROB_RETURN_ADDR, "frob_return_addr", BT_FN_PTR_PTR, ATTR_NULL) DEF_EXT_LIB_BUILTIN(BUILT_IN_GETTEXT, "gettext", BT_FN_STRING_CONST_STRING, ATTR_FORMAT_ARG_1) DEF_C99_BUILTIN(BUILT_IN_IMAXABS, "imaxabs", BT_FN_INTMAX_INTMAX, ATTR_CONST_NOTHROW_LIST) +/* APPLE LOCAL lno */ +DEF_GCC_BUILTIN(BUILT_IN_MAYBE_INFINITE_LOOP, "maybe_infinite_loop", BT_FN_VOID, ATTR_NULL) DEF_GCC_BUILTIN(BUILT_IN_INIT_DWARF_REG_SIZES, "init_dwarf_reg_size_table", BT_FN_VOID_PTR, ATTR_NULL) DEF_EXT_LIB_BUILTIN(BUILT_IN_FINITE, "finite", BT_FN_INT_DOUBLE, ATTR_CONST_NOTHROW_LIST) DEF_EXT_LIB_BUILTIN(BUILT_IN_FINITEF, "finitef", BT_FN_INT_FLOAT, ATTR_CONST_NOTHROW_LIST) --- ./cfghooks.c.~1~2007-02-12 20:10:38.0 -0800 +++ ./cfghooks.c2007-04-21 15:59:31.0 -0700 @@ -405,6 +405,10 @@ edge split_block (basic_block bb, void *i) { basic_block new_bb; + /* APPLE LOCAL begin lno */ + bool irr = (bb->flags & BB_IRREDUCIBLE_LOOP) != 0; + int flags = EDGE_FALLTHRU; + /* APPLE LOCAL end lno */ if (!cfg_hooks->split_block) internal_error ("%s does not support split_block", cfg_hooks->name); @@ -416,6 +420,13 @@ split_block (basic_block bb, void *i) new_bb->count = bb->count; new_bb->frequency = bb->frequency; new_bb->loop_depth = bb->loop_depth; + /* APPLE LOCAL begin lno */ + if (irr) +{ + new_bb->flags |= BB_IRREDUCIBLE_LOOP; + flags |= EDGE_IRREDUCIBLE_LOOP; +} + /* APPLE LOCAL end lno */ if (dom_info_available_p (CDI_DOMINATORS)) { @@ -560,6 +571,15 @@ split_edge (edge e) } } + /* APPLE LOCAL begin lno */ + if (irr) +{ + ret->flags |= BB_IRREDUCIBLE_LOOP; + EDGE_PRED (ret, 0)->flags |= EDGE_IRREDUCIBLE_LOOP; + EDGE_SUCC (ret, 0)->flags |= EDGE_IR
Re: maybe_infinite_loop?
On 4/21/07, Mike Stump <[EMAIL PROTECTED]> wrote: We still have some lno bits in our tree. We tried to remove them and found: Of all the LNO bits, the last major bits seems to be the below bit. I don't even know if it is responsible for the benefit we see. I thought I'd mention it, as a 2-3% win on two of the spec tests seems worthwhile. The only benifit as far as I can tell is causing an extra call at the tree level which could cause aliasing analysis to go wrong with call clobbered variables. The remove empty loop pass in 4.1.0 and above removes more empty loops than the LNO patch could ever remove. So really I think you are just seeing bogus effects of slightly different aliasing and register pressure. Nothing to get your hopes up at anyways. Thanks, Andrew Pinski