Re: on removal of line number notes at the end of BBs
> On Oct 11, 2006, Ian Lance Taylor <[EMAIL PROTECTED]> wrote: > > >> int x; int f() { x = 0; > >> while(1); } > > >> We get line number notes for code only up to "x = 0;". > > > I assume this is only a problem when not optimizing. > > The opposite, actually. It's optimization that breaks it. > > Of course optimization can change stuff and debug info sometimes is > lost, but in this case we *do* have code for that loop, so we might as > well try to preserve the line number info somehow. We shouldn't drop > it just because we turn annotated-with-line-numbers jumps into > fallthru edges that later have to be re-emitted without line numbers. > > > Without looking at the code, the problem looks quite similar to one I > > fixed here: > > It is similar, indeed, but this removal takes place in RTL. Yep, I guess it is cfglayout just replacing the jump with fresh one. Adding code to record locators on edges in the same field as tree level code does already is probably not too dificult. I can try to get it done next week... Honza
Memory usage of 4.2 versus 4.3 (at branchpoints)
Hi, to give some perspective to the discussion on memory usage, I generated comparsion of 4.2 branchpoint to 4.3 branchpoint from logs of our memory tester. I would say it is quite pleasing to see that 4.3 is not really regression relative 4.2 in most tests like it was custom in previous releases, but still we ought to do a lot better ;) There is posssibly interesting 35% regression at -O1 combine.c... Honza comparing combine.c compilation at -O0 level: Peak amount of GGC memory allocated before garbage collecting run decreased from 9595k to 8929k, overall -7.46% Peak amount of GGC memory still allocated after garbage collecting decreased from 8942k to 8558k, overall -4.49% Amount of produced GGC garbage decreased from 40099k to 34878k, overall -14.97% Amount of memory still referenced at the end of compilation decreased from 6705k to 6073k, overall -10.41% Overall memory needed: 24905k -> 24797k Peak memory use before GGC: 9595k -> 8929k Peak memory use after GGC: 8942k -> 8558k Maximum of released memory in single GGC run: 2737k -> 2576k Garbage: 40099k -> 34878k Leak: 6705k -> 6073k Overhead: 5788k -> 4715k GGC runs: 317 -> 294 comparing combine.c compilation at -O1 level: Overall memory allocated via mmap and sbrk increased from 26820k to 36237k, overall 35.11% Amount of produced GGC garbage decreased from 60618k to 55748k, overall -8.74% Amount of memory still referenced at the end of compilation decreased from 6888k to 6151k, overall -11.98% Overall memory needed: 26820k -> 36237k Peak memory use before GGC: 17364k -> 16999k Peak memory use after GGC: 17180k -> 16830k Maximum of released memory in single GGC run: 2372k -> 2342k Garbage: 60618k -> 55748k Leak: 6888k -> 6151k Overhead: 7578k -> 6045k GGC runs: 387 -> 369 comparing combine.c compilation at -O2 level: Amount of memory still referenced at the end of compilation decreased from 6973k to 6252k, overall -11.53% Overall memory needed: 26820k -> 26496k Peak memory use before GGC: 17367k -> 16999k Peak memory use after GGC: 17180k -> 16830k Maximum of released memory in single GGC run: 2452k -> 2884k Garbage: 77388k -> 76521k Leak: 6973k -> 6252k Overhead: 10022k -> 8785k GGC runs: 456 -> 443 comparing combine.c compilation at -O3 level: Ovarall memory allocated via mmap and sbrk decreased from 26820k to 25596k, overall -4.78% Amount of memory still referenced at the end of compilation decreased from 7030k to 6317k, overall -11.28% Overall memory needed: 26820k -> 25596k Peak memory use before GGC: 18365k -> 17988k Peak memory use after GGC: 17995k -> 17536k Maximum of released memory in single GGC run: 3510k -> 4130k Garbage: 107793k -> 107354k Leak: 7030k -> 6317k Overhead: 13563k -> 12408k GGC runs: 509 -> 490 comparing insn-attrtab.c compilation at -O0 level: Overall memory allocated via mmap and sbrk increased from 80924k to 83700k, overall 3.43% Amount of produced GGC garbage decreased from 146623k to 125964k, overall -16.40% Amount of memory still referenced at the end of compilation decreased from 9856k to 9117k, overall -8.11% Overall memory needed: 80924k -> 83700k Peak memory use before GGC: 69469k -> 68247k Peak memory use after GGC: 45007k -> 43913k Maximum of released memory in single GGC run: 36247k -> 35708k Garbage: 146623k -> 125964k Leak: 9856k -> 9117k Overhead: 19791k -> 16830k GGC runs: 252 -> 231 comparing insn-attrtab.c compilation at -O1 level: Overall memory allocated via mmap and sbrk increased from 111696k to 118444k, overall 6.04% Peak amount of GGC memory allocated before garbage collecting increased from 94037k to 94551k, overall 0.55% Peak amount of GGC memory still allocated after garbage collectin increased from 83553k to 90403k, overall 8.20% Amount of memory still referenced at the end of compilation decreased from 10072k to 8977k, overall -12.20% Overall memory needed: 111696k -> 118444k Peak memory use before GGC: 94037k -> 94551k Peak memory use after GGC: 83553k -> 90403k Maximum of released memory in single GGC run: 32589k -> 31807k Garbage: 289765k -> 289427k Leak: 10072k -> 8977k Overhead: 36663k -> 29408k GGC runs: 245 -> 240 comparing insn-attrtab.c compilation at -O2 level: Ovarall memory allocated via mmap and sbrk decreased from 127120k to 114404k, overall -11.11% Peak amount of GGC memory allocated before garbage collecting run decreased from 113347k to 95237k, overall -19.02% Peak amount of GGC memory still allocated after garbage collectin increased from 83466k to 90625k, overall 8.58% Amount of produced GGC garbage decreased from 372181k to 328157k, overall -13.42% Amount of memory still referenced at the end of compilation decreased from 10176k to 8982k, overall -13.30% Overall memory needed: 127120k -> 114404k Peak memory
Re: compiling very large functions.
> On 11/4/06, Kenneth Zadeck <[EMAIL PROTECTED]> wrote: > >Richard Guenther wrote: > >> On 11/4/06, Kenneth Zadeck <[EMAIL PROTECTED]> wrote: > >>> Richard Guenther wrote: > >>> > On 11/4/06, Kenneth Zadeck <[EMAIL PROTECTED]> wrote: > >>> >> I think that it is time that we in the GCC community took some > >>> time to > >>> >> address the problem of compiling very large functions in a somewhat > >>> >> systematic manner. > >>> >> > >>> >> GCC has two competing interests here: it needs to be able to provide > >>> >> state of the art optimization for modest sized functions and it > >>> needs to > >>> >> be able to properly process very large machine generated functions > >>> using > >>> >> reasonable resources. > >>> >> > >>> >> I believe that the default behavior for the compiler should be that > >>> >> certain non essential passes be skipped if a very large function is > >>> >> encountered. > >>> >> > >>> >> There are two problems here: > >>> >> > >>> >> 1) defining the set of optimizations that need to be skipped. > >>> >> 2) defining the set of functions that trigger the special processing. > >>> >> > >>> >> > >>> >> For (1) I would propose that three measures be made of each function. > >>> >> These measures should be made before inlining occurs. The three > >>> measures > >>> >> are the number of variables, the number of statements, and the > >>> number of > >>> >> basic blocks. > >>> > > >>> > Why before inlining? These three numbers can change quite > >>> significantly > >>> > as a function passes through the pass pipeline. So we should try > >>> to keep > >>> > them up-to-date to have an accurate measurement. > >>> > > >>> I am flexible here. We may want inlining to be able to update the > >>> numbers. However, I think that we should drive the inlining agression > >>> based on these numbers. > >> > >> Well, for example jump threading and tail duplication can cause these > >> numbers to significantly change. Also CFG instrumentation and PRE > >> can increase the BB count. So we need to deal with cases where an > >> optimization produces overly large number of basic blocks or > >> instructions. > >> (like by throtting those passes properly) > >> > >I lean to leave the numbers static even if they do increase as time goes > >by. Otherwise you get two effects, the first optimizations get to be > >run more, and you get the wierd non linear step functions where small > >changes in some upstream function effect the down stream. > > Ok, I guess we can easily flag each function as having > - many BBs > - big BBs > - complex CFG (many edges) > and set these flags at CFG construction time during the lowering phase > (which is after the early inlining pass I believe). > > The number of basic blocks is kept up-to-date during optimization, the > other numbers would need to be re-generated if we want to keep them > up-to-date. We definitly need some heuristics to trottle down expensive optimizations for very huge functions. I am not sure I like the static flag rather than per-case approach. The idea of big versus small is very different for individual optimizations and the properties change significantly (ie expansion of min/max function will turn big BB into big CFG) and it will be a lot of fun to set up proper thresholds here. I also do believe that we need to take care to not have O(n^k) where K is significantly higher than neccesary and that most of our current scalability issues falls into this category and we have large complexity just out of little laziness. We are implementing many of fancier algorithms in order to be scalable so we should try to not undermine it by stupid mistakes. To take the specific example (that triggered this discussion) of extreme testcase triggering on new fwprop pass, I think it is quite different. Analyzing the testcase, there was number of problems that was trivial to solve and are fixed in mainline now. Remaining problems are usually easy too - quadratic removal from single linked lists in out-of-SSA and scheduler. I think those ought to be cured, since those don't hurt only in such monstrosities. On the other hand GVN-PRE/df.c do have algoritmic problems that we might want or might not want to solve. I care less if we add code to disable the optimization or come with better algorithm and I am very happy Daniel considers the second alternative for PRE. The df.c issue can probably also be solved by either using FUD graph build by our SSA algorithm over RTL, or turning the FUD graph into DU/UD chain avoiding the need for the bitmaps. This might be interesting task if such a big functions turn out to be common bottleneck. > > But with just using three (or even only one?) flag, we can easily fit this > information in struct function. > > I also like the idea of a "hot" function flag to be able to dynamically > switch betweed optimize_size and !optimize_size in the tree optimizers. > Profile based inlining already tries to follow that route. See cfun->function_f
Re: compiling very large functions.
> Brooks Moses wrote on 11/06/06 17:41: > > >Is there a need for any fine-grained control on this knob, though, or > >would it be sufficient to add an -O4 option that's equivalent to -O3 but > >with no optimization throttling? > > > We need to distinguish two orthogonal issues here: effort and enabled > transformations. Currently, -O3 means enabling transformations that (a) > may not result in an optimization improvement, and (b) may change the > semantics of the program. -O3 will also enable "maximum effort" out of -O3 enables inlining, unswitching and GCSE after reload. How those change semantics of the program? For me -O3 always meant that we enable the code expanding optimization that usualy speeds up, but the code size cost might not be worth it. Honza > every transformation. > > In terms of effort, we currently have individual knobs in the form of -f > and/or --params settings. It should not be hard to introduce a global > -Oeffort=xxx parameter. But, it will take some tweaking to coordinate > what -f/--params/-m switches should that enable.
Re: block reordering at GIMPLE level?
Hi, I know little about CLI, but assuming that your backend is nonstandard enought so it seems to make sense to replace the RTL bits I guess it would make sense to make the bb-reorder run on GIMPLE level too, while keeping bb-reorder on RTL level for common compilation path. This is example of pass that has very little dependency on the particular IL so our CFG manipualtion abstraction can be probably extended rather easilly to make it pracically IL independent. The reason why it is run late is that the RTL backend modify CFG enough to make this important. CLI might have the same property. What you might want to consider is to simply port our CFG code to CLI IL representation, whatever it is and share the pass. The tracer pass, very similar to bb-rorder in nature, has been ported to work on gimple, but the implementation is not in mainline yet. You might want to take a look at changes neccesary as bb-reorder should be about the same (minus the SSA updating since you probably want to bb-reroder after leaving SSA form) Honza > > Hello, > > While working on our CLI port, I realized that we were missing, among > others, the block reordering pass. This is because we emit CLI code > before the RTL passes are reached. > Looking at the backend optimizations, it is clear that some modify the > CFG. But my understanding is that loop optimizations and unrolling are > also being moved to GIMPLE. I do not know about others. > > Could it be that sometime all the optimizations that modify the CFG are > run on GIMPLE? > Is there any plan/interest in having a block layout pass running at > GIMPLE level? > > Cheers, > > -- > Erven.
Re: block reordering at GIMPLE level?
> Hello, > CLI back-end uses GIMPLE representation (to be more precise, a subset of > GIMPLE, the code undergoes a CLI-specific GIMPLE lowering at the end of > middle-end passes) and just emits bytecode in a CLI-specific assembler > generation pass. > Because of that, we (I mean CLI back-end project) wouldn't even have to > redefine our CFG, we already use CFG for GIMPLE. > I think it's interesting for us to check whether the existing RTL > reordering pass may be reused with little or no modification and, if > not, to see if it can be made it more IL independent. The BB reordering pass got some IL specific parts for hot/cold function splitting, but the rest should just work fine with little changes and cleanups. (the main algorithm is basically duplicating blocks via already virtualized interface and constructing new ordering via bb->aux pointers. Then it rely on RTL specific cfglayout code to do the actual reordering that you can just do on gimple with little effort since the GIMPLE BBs are easilly reorderable). Does CLI's conditional jumps have one destination and falltrhough or two destinations? If you have no natural fallthrough edges, reordering blocks is easy. If you do have fallthrough after conditional jump, you will need to immitate cfglayout code inserting unconditional jumps into edges. Honza > > Cheers, > Roberto
Re: RFC: SMS problem with emit_copy_of_insn_after copying REG_NOTEs
> Hello all, > > I'm preparing and testing SMS correction/improvements patch and while > testing it on the SPU with the vectorizer testcases I've got an ICE in > the "gcc_assert ( MAX_RECOG_OPERANDS - i)" in function copy_insn_1 in > emit_rtl.c. The call traces back to the loop versionioning called in > modulo-sched.c before the SMSing actually starts. The specific > instruction it tries to copy when it fails is > > (insn 32 31 33 4 (parallel [ >(set (reg:SI 162) >(div:SI (reg:SI 164) >(reg:SI 156))) >(set (reg:SI 163) >(mod:SI (reg:SI 164) >(reg:SI 156))) >(clobber (scratch:SI)) >(clobber (scratch:SI)) >(clobber (scratch:SI)) >(clobber (scratch:SI)) >(clobber (scratch:SI)) >(clobber (scratch:SI)) >(clobber (scratch:SI)) >(clobber (scratch:SI)) >(clobber (scratch:SI)) >(clobber (reg:SI 130 hbr)) >]) 129 {divmodsi4} (insn_list:REG_DEP_TRUE 30 > (insn_list:REG_DEP_TRUE 31 (nil))) >(expr_list:REG_DEAD (reg:SI 164) >(expr_list:REG_DEAD (reg:SI 156) >(expr_list:REG_UNUSED (reg:SI 130 hbr) >(expr_list:REG_UNUSED (scratch:SI) >(expr_list:REG_UNUSED (scratch:SI) >(expr_list:REG_UNUSED (scratch:SI) >(expr_list:REG_UNUSED (scratch:SI) >(expr_list:REG_UNUSED (scratch:SI) >(expr_list:REG_UNUSED (scratch:SI) >(expr_list:REG_UNUSED (scratch:SI) >(expr_list:REG_UNUSED >(scratch:SI) >(expr_list:REG_UNUSED > (scratch:SI) > (expr_list:REG_UNUSED (reg:SI 163) >(nil))) > > The error happens in the first call to copy_insn_1 in the loop below > (copied from emit_copy_of_insn_after from emit_rtl.c): > > > for (link = REG_NOTES (insn); link; link = XEXP (link, 1)) >if (REG_NOTE_KIND (link) != REG_LABEL) > { >if (GET_CODE (link) == EXPR_LIST) > REG_NOTES (new) >= copy_insn_1 (gen_rtx_EXPR_LIST (REG_NOTE_KIND (link), > XEXP (link, 0), > REG_NOTES (new))); >else > REG_NOTES (new) >= copy_insn_1 (gen_rtx_INSN_LIST (REG_NOTE_KIND (link), > XEXP (link, 0), > REG_NOTES (new))); > } > THanks for sending updated patch, I will try to look across it tomorrow > Tracing the execution of copy_insn_1, it seems that it goes over the > same REG_NOTES many times (it seems to be a quadratic time complexity > algorithm). This causes "copy_insn_n_scratches++" to be executed more > times than there are SCRATCH registers (and even REG_NOTES) leading to > the failure in the assert. There are 9 SCRATCH registers used in the > instruction and MAX_RECOG_OPERANDS is 30 for the SPU. > > Since copy_insn_n_scratches is initialized in copy_insn and since we > go over regnotes over and over again, I've modified in the loop > above the two calls to copy_insn_1 with the calls to copy_insn. This > caused the ICEs in the testsuite to disappear. > > I wonder if this constitutes a legitimate fix or I'm missing something? I believe you really want to avoid quadratic amount of work. This is probably best done by REG_NOTES (new) = gen_rtx_EXPR_LIST (REG_NOTE_KIND (link), copy_insn_1 (XEXP (link, 0)), REG_NOTES (new))); so copy_insn_1 don't recusively descend into already copied chain. Honza > > Thanks in advance, > Vladimir
Re: RFC: SMS problem with emit_copy_of_insn_after copying REG_NOTEs
> Hi, Jan, > Thanks for fast response! > > I've tested the change you proposed and we still failed in the assert > checking that the number of SCRATCHes being too large (>30) while > copying the REG_NOTES of the instruction (see below) using just 9 > SCRATCH registers. Hi, apparently there seems to be another reason copy_insn_1 can do quadratic amount of work except for this one, I don't seem to be able to see any however. Just for sure, did you updated both cases of wrong recursion, the EXPR_LIST I sent and the INSN_LIST hunk just bellow? Otherwise probably adding a breakpoint on copy_insn_1 and seeing how it manage to do so many recursions will surely help :) Honza
Re: RFC: SMS problem with emit_copy_of_insn_after copying REG_NOTEs
Hi, thanks for testing. I've bootstrapped/regtested this variant of patch and comitted it as obvious. Honza 2006-12-30 Jan Hubicka <[EMAIL PROTECTED]> Vladimir Yanovsky <[EMAIL PROTECTED]> * emit-rt.c (emit_copy_of_insn_after): Fix bug causing exponential amount of copies of INSN_NOTEs list. Index: emit-rtl.c === --- emit-rtl.c (revision 120274) +++ emit-rtl.c (working copy) @@ -5297,14 +5297,12 @@ emit_copy_of_insn_after (rtx insn, rtx a { if (GET_CODE (link) == EXPR_LIST) REG_NOTES (new) - = copy_insn_1 (gen_rtx_EXPR_LIST (REG_NOTE_KIND (link), - XEXP (link, 0), - REG_NOTES (new))); + = gen_rtx_EXPR_LIST (REG_NOTE_KIND (link), + copy_insn_1 (XEXP (link, 0)), REG_NOTES (new)); else REG_NOTES (new) - = copy_insn_1 (gen_rtx_INSN_LIST (REG_NOTE_KIND (link), - XEXP (link, 0), - REG_NOTES (new))); + = gen_rtx_INSN_LIST (REG_NOTE_KIND (link), +copy_insn_1 (XEXP (link, 0)), REG_NOTES (new)); } /* Fix the libcall sequences. */
Re: RFC: SMS problem with emit_copy_of_insn_after copying REG_NOTEs
Hi, I do apologize for the breakage, apparently I tested the only target that didn't break. Andrew's patch seems to be OK for me (as well as the patch just omitting copy_insn_1 call in the second branch that should make situation no worse than before my patch and still save the quadratic memory consumption). > > > > ChangeLog: > > > > * emit-rtl.c (emit_copy_of_insn_after): Copy REG_LIBCALL note specially. > > Copy REG_RETVAL not specially and fix it and the referencing > > REG_LIBCALL note. > > Use copy_rtx instead of copy_insn_1 for EXPR_LIST note. > > Abort if we get a INSN_LIST for the note. > > > > Thanks, > > Andrew Pinski > > Also I should mention, this also fixes a possible bug with libcalls that > are embedded in one another. Before we were just assuming if we have a > REG_RETVAL, > then the previous REG_LIBCALL would be the start of the libcall but that > would be > incorrect with embedded libcalls. We should not have nested libcalls at all. One level of libcalls is painful enough and we take care to not do this. Honza > > Thanks, > Andrew Pinski
Re: Nested libcalls (was: Re: RFC: SMS problem with emit_copy_of_insn_after copying REG_NOTEs)
> On Sunday 31 December 2006 00:59, Jan Hubicka wrote: > > > Also I should mention, this also fixes a possible bug with libcalls that > > > are embedded in one another. Before we were just assuming if we have a > > > REG_RETVAL, then the previous REG_LIBCALL would be the start of the > > > libcall but that would be incorrect with embedded libcalls. > > > > We should not have nested libcalls at all. One level of libcalls is > > painful enough and we take care to not do this. > > It's unclear whether we can have nested libcalls or not. We expect them > in some places (especially, see libcall_stack in gcse.c:local_cprop_pass) > but are bound to fail miserably in others. > > This is something I've been wondering for a while. Maybe someone can > give a definitive answer: Can libcalls be nested, or not? My understanding is that libcall should not be nested and there is code to avoid nesting in emit_libcall_block. I believe this was also outcome of some disucssions with Rth while back. Historically some code was written to support nested libcalls but I believe it was never useful but it got propagated to new code. If we care, we probably could tweek verify_flow_info to verify that libcalls are unnested and don't span BB boundary, but I have too many other verifier related plans right now to be able to promise to get into this soon. Honza > > Gr. > Steven
Re: RFC: SMS problem with emit_copy_of_insn_after copying REG_NOTEs
> Hi, > Sorry for possibly causing confusion. I had tested the patch on my ICE > testcase and bootstrapped for -enable-languages=C, but didn't run the > full bootstrap. Bootstrapping the latest Andrew's patch on ppc-linux > and testing it on SPU. Vladimir, I bootstrapped/regtested the patch myself on i686 before commiting it, so the rule was met here. Unfortunately i686 don't seems to show the regression. I've bootstrapped/regtested x86_64 and i686 with Andrew's patch and it works all fine. Honza
Re: RFC: SMS problem with emit_copy_of_insn_after copying REG_NOTEs
Hi, I've commited the following patch that fixes the obvious problem of calling emit_insn_1 for INSN_LIST argument. It seems to solve the problems I can reproduce and it bootstraps x86_64-linux/i686-linux and Darwin (thanks to andreast). The patch was preaproved by Ian. This is meant as fast fix to avoid bootstrap. Andrew's optimization still makes sense as an microoptimization and the nested libcall issue probably ought to be resolved, but can be dealt with incrementally. My apologizes for the problems. Honza Index: ChangeLog === --- ChangeLog (revision 120315) +++ ChangeLog (working copy) @@ -1,3 +1,8 @@ +2007-01-01 Jan Hubicka <[EMAIL PROTECTED]> + + * emit-rtl.c (emit_copy_of_insn_after): Do not call copy_insn_1 for + INSN_LIST. + 2007-01-01 Mike Stump <[EMAIL PROTECTED]> * configure.ac (HAVE_GAS_LITERAL16): Add autoconf check for Index: emit-rtl.c === --- emit-rtl.c (revision 120313) +++ emit-rtl.c (working copy) @@ -5302,7 +5302,7 @@ emit_copy_of_insn_after (rtx insn, rtx a else REG_NOTES (new) = gen_rtx_INSN_LIST (REG_NOTE_KIND (link), -copy_insn_1 (XEXP (link, 0)), REG_NOTES (new)); +XEXP (link, 0), REG_NOTES (new)); } /* Fix the libcall sequences. */
Retiring IPA-branch
Hi, thanks to Diego, Andrew (MacLeod), Daniel and Roger's effort on reviewing IPA branch merge patches, I hope to commit after re-testing the patch to enable IPA-SSA today. This means that the main part of IPA-branch has been merged. There are still features on IPA branch that I hope to follow shortly (in particular the pre-inline local optimization, IP-CP on SSA form by Razya and aliasing by Olga), but those are all rather localized changes. I would hope to retire that branch, since it has gained a lot of dust and also a lot of things has been renamed while merging to mainline making it outdated. Because the branch is actually used as base for other development, I would like to ask if there is need for some kind of "transition plan" to mainline. If so, I can merge mainline into the branch and remove most of no longer needed stuff, if it seems to make things easier or do something else. I will be also happy to help with any problems with updating the code to mainline implementation or, naturally, solving any issues that appears with the merge. At the end of stage 1 I would also like to open new branch targeted to further improvements and cleanups of IPA infrastructure that can co-exist in parallel with LTO branch solving other aspects we need to address before getting useable link time optimization. Issues I would like to deal with are the further cleanups to make writting IPA passes more convenient, probably finally solve the multiple declaration problems on C and fortran frontend (if not done in mainline before that), write some basic easy IPA passes (such as removing of unused arugments and return values, some basic argument passing simplificaiton and similar cheap stuff) and solve as many of scalability issues as possible. Again everyone is welcome to help and hope I didn't discougrated from that by missing stage1 of 4.2 and thus delaying whole ipa-branch project by a year. I really didn't anticipated the 4.2 schedule to be so delayed and I had problems justify all the changes in late stage1 of 4.2 without having enought code developed to take advantage of them. Honza
Re: Build problem with gcc 4.3.0 20070108 (experimental)
> >`/rb.exphome/tools/gcc/obj-i686-pc-linux-gnu/gcc' > >make[2]: *** [all-stage2-gcc] Error 2 > >make[2]: Leaving directory `/rb.exphome/tools/gcc/obj-i686-pc-linux-gnu' > >make[1]: *** [stage2-bubble] Error 2 > >make[1]: Leaving directory `/rb.exphome/tools/gcc/obj-i686-pc-linux-gnu' > >make: *** [bootstrap] Error 2 > > That's honza's patch - but bootstrap doesn't abort for me at that point. Hi, I've comited the obvious fix. Sorry for that. I wonder why the bootstrap doesn't fail on all targets? Honza Index: ChangeLog === *** ChangeLog (revision 120588) --- ChangeLog (working copy) *** *** 1,3 --- 1,7 + 2007-01-08 Jan Hubicka <[EMAIL PROTECTED]> + + * tree-vectorizer.c (gate_increase_alignment): Fix return type. + 2007-01-08 Richard Guenther <[EMAIL PROTECTED]> * tree-ssa-ccp.c (maybe_fold_offset_to_array_ref): Use type Index: tree-vectorizer.c === *** tree-vectorizer.c (revision 120588) --- tree-vectorizer.c (working copy) *** increase_alignment (void) *** 2255,2261 return 0; } ! static int gate_increase_alignment (void) { return flag_section_anchors && flag_tree_vectorize; --- 2255,2261 return 0; } ! static bool gate_increase_alignment (void) { return flag_section_anchors && flag_tree_vectorize;
Re: Build problem with gcc 4.3.0 20070108 (experimental)
> > > > > >`/rb.exphome/tools/gcc/obj-i686-pc-linux-gnu/gcc' > > > >make[2]: *** [all-stage2-gcc] Error 2 > > > >make[2]: Leaving directory `/rb.exphome/tools/gcc/obj-i686-pc-linux-gnu' > > > >make[1]: *** [stage2-bubble] Error 2 > > > >make[1]: Leaving directory `/rb.exphome/tools/gcc/obj-i686-pc-linux-gnu' > > > >make: *** [bootstrap] Error 2 > > > > > > That's honza's patch - but bootstrap doesn't abort for me at that point. > > > > Hi, > > I've comited the obvious fix. Sorry for that. I wonder why the > > bootstrap doesn't fail on all targets? > > This only fixes on of the problems, the other one is > function_and_variable_visibility needs to return unsigned int and 0. > This fixes an ICE building libgcc for spu-elf on powerpc-linux-gnu. Hi, I've commit the obvious fix too and I doubly apologize for the problems. I've doublechecked that the patch tested ineed was the same as comitted and it is the case. I don't see how it can possibly pass -Werror bootstrap, will investigate it tomorrow. Hope that all bootstraps are fine now! Index: ChangeLog === --- ChangeLog (revision 120589) +++ ChangeLog (working copy) @@ -1,6 +1,7 @@ 2007-01-08 Jan Hubicka <[EMAIL PROTECTED]> * tree-vectorizer.c (gate_increase_alignment): Fix return type. + * ipa.c (function_and_variable_visibility): Fix return type. 2007-01-08 Richard Guenther <[EMAIL PROTECTED]> Index: ipa.c === --- ipa.c (revision 120580) +++ ipa.c (working copy) @@ -220,7 +220,7 @@ cgraph_remove_unreachable_nodes (bool be in language point of view but we want to overwrite this default via visibilities for the backend point of view. */ -static void +static unsigned int function_and_variable_visibility (void) { struct cgraph_node *node; @@ -272,6 +272,7 @@ function_and_variable_visibility (void) fprintf (dump_file, "\n\n"); } cgraph_function_flags_ready = true; + return 0; } struct tree_opt_pass pass_ipa_function_and_variable_visibility =
Re: Jan Hubicka and Uros Bizjak appointed i386 maintainers
> I am pleased to announce that the GCC Steering Committee has > appointed Jan Hubicka and Uros Bizjak as co-maintainers of the i386 port. > > Please join me in congratulating Jan and Uros on their new role. > Jan and Uros, please update your listings in the MAINTAINERS file. Thank you! I will try to start reviewing the i386 patches backlog shortly and I've updated the MAINTAINERS file by the attached patch. Honza Index: ChangeLog === --- ChangeLog (revision 120590) +++ ChangeLog (working copy) @@ -1,3 +1,7 @@ +2007-01-08 Jan Hubicka <[EMAIL PROTECTED]> + + * MAINTAINERS: Add myself as i386 maintainer. + 2007-01-08 Kai Tietz <[EMAIL PROTECTED]> * configure.in: Add support for an x86_64-mingw* target. Index: MAINTAINERS === --- MAINTAINERS (revision 120590) +++ MAINTAINERS (working copy) @@ -53,6 +53,7 @@ h8 port Kazu Hirata [EMAIL PROTECTED] hppa port Jeff Law[EMAIL PROTECTED] hppa port Dave Anglin [EMAIL PROTECTED] i386 port Richard Henderson [EMAIL PROTECTED] +i386 port Jan Hubicka [EMAIL PROTECTED] ia64 port Jim Wilson [EMAIL PROTECTED] iq2000 portNick Clifton[EMAIL PROTECTED] m32c port DJ Delorie [EMAIL PROTECTED]
Re: Mis-handled ColdFire submission?
> I know Andrew replied privately, but I hope he doesn't mind me raising > the issue on-list. I just wanted to guage the general feeling as to > whether I'd screwed up, and whether I should have submitted the patches > in a different way. I guess one should rather thank you for taking time to split patch into well described pieces! I think it actually is good way for allowing good review of bigger merge. Perhaps it would make sense to hold patches that depend or earlier or are similar in nature to those already posted so the changes requirested during review can be applied to them without too many re-posts, but I tend to believe that many patches are a lot better than single big merge patch doing many different changes. Honza
Re: GCC trunk revision 120704 failed to build spec cpu2k/gcc
> Hello, > > > > > GCC trunk revision 120704 failed to build SPEC cpu2000/gcc on -O1 and > > > > higher optimization level at x86_64-redhat-linux. > > > > > > > > reload1.c: In function 'reload': > > > > reload1.c:449: error: address taken, but ADDRESSABLE bit not set > > > > bad_spill_regs > > > > > > > > reload1.c:449: error: address taken, but ADDRESSABLE bit not set > > > > bad_spill_regs > > > > > > > > reload1.c:449: internal compiler error: verify_stmts failed > > > > Please submit a full bug report, > > > > with preprocessed source if appropriate. > > > > See http://gcc.gnu.org/bugs.html> for instructions. > > > > > > > > Does anybody see the same? > > > > I can reproduce this on i686. I will check what the problem is, > > this is probably due to some of the recent IPA changes. > ipa-reference.c:analyze_function does not traverse phi nodes, > and thus misses that the address of bad_spill_regs is taken inside the > phi node (and clears the addressable flag from it, causing an ICE > later). I am working on the patch. Thanks for looking into that. THe problem seems to be that ipa-reference and early optimization was more or less tested each independently, so I never noticed this problem (it does not reproduce on IPA branch that has both changes in however). Proper fix is probably to make ipa-reference and symmetrically also the other IPA pass use PHI operands infrastructure as suggested by comments earlier. Honza > > Zdenek
Re: remarks about g++ 4.3 and some comparison to msvc & icc on ia32
> On 1/28/07, tbp <[EMAIL PROTECTED]> wrote: > >Let it be clear from the start this is a potshot and while those > >trends aren't exactly new or specific to my code, i haven't tried to > >provide anything but specific data from one of my app, on > >win32/cygwin. > > > >Primo, gcc getting much better wrt inling exacerbates the fact that > >it's not as good as other compilers at shrinking the stack frame size, > >and perhaps as was suggested by Uros when discussing that point a pass > >to address that would make sense. > >As i'm too lazy to properly measure cruft across multiple compilers, > >i'll use my rtrt app where i mostly control large scale inlining by > >hand. > >objdump -wdrfC --no-show-raw-insn $1|perl -pe 's/^\s+\w+:\s+//'|perl > >-ne 'printf "%4d\n", hex($1) if /sub\s+\$(0x\w+),%esp/'|sort -r| head > >-n 10 > > > >msvc:2196 2100 1772 1692 1688 1444 1428 1312 1308 1160 > >icc: 2412 2280 2172 2044 1928 1848 1820 1588 1428 1396 > >gcc: 2604 2596 2412 2076 2028 1932 1900 1756 1720 1132 > > It would have been nice to tell us what the particular columns in > this table mean - now we have to decrypt objdump params and > perl postprocessing ourselves. > > (If you are interested in stack size related to inlining you may want > to tune --param large-stack-frame and --param large-stack-frame-growth). Also having some testcases showing inlining deffects in GCC would be very interesting for me. Now after IPA-SSA has been merged, I plan to do some retuning of inliner for 4.3 release since a lot has changes about properties of it's input and it was originally designed to operate well on IL used by early tree-ssa. Considering information about stack frame size in the inlining costs is one of things I believe we should do but it is also dificult to tune without interesting testcases for it. Honza > > Richard.
Re: remarks about g++ 4.3 and some comparison to msvc & icc on ia32
> On 1/28/07, Richard Guenther <[EMAIL PROTECTED]> wrote: > >On 1/28/07, tbp <[EMAIL PROTECTED]> wrote: > >> objdump -wdrfC --no-show-raw-insn $1|perl -pe 's/^\s+\w+:\s+//'|perl > >> -ne 'printf "%4d\n", hex($1) if /sub\s+\$(0x\w+),%esp/'|sort -r| head > >> -n 10 > >> > >> msvc:2196 2100 1772 1692 1688 1444 1428 1312 1308 1160 > >> icc: 2412 2280 2172 2044 1928 1848 1820 1588 1428 1396 > >> gcc: 2604 2596 2412 2076 2028 1932 1900 1756 1720 1132 > > > >It would have been nice to tell us what the particular columns in > >this table mean - now we have to decrypt objdump params and > >perl postprocessing ourselves. > I should have known better than to post on a sunday morning. Sorry. > That's the sorted 10 largest stack allocations in binaries produced by > each compiler (presuming most everything falls in place). > Each time i verify codegen for a function across all 3, gcc always has > the largest frame by a substantial amount (on ia32). And that's what > that rigorous table is trying to demonstrate ;) > > Basically i'm wondering if a stack frame shrinking pass [ ] is > possible, [ ] makes no sense, [ ] has been done, [ ] is planed etc... Actually we do have one stack frame shrinking pass already. It depends on where the bloat is comming from - we can pack (with some limitations) memory used by structures/arrays used by different inline functions or lexical blocks. We don't do any packing of spilled registers nor shring wrapping other compilers sometimes implement. Honza > > >(If you are interested in stack size related to inlining you may want > >to tune --param large-stack-frame and --param large-stack-frame-growth). > Recently g++ 4.3 has started to complain about "warning: inlining > failed in call to 'xxx': --param large-stack-frame-growth limit > reached [-Winline]. Bumping said large-function-growth by an ungodly > amount did the trick. But it was the sure sign inlining was being > fixed. > There's much less need to babysit it, thanks a lot to whomever wrote > those patches.
Re: remarks about g++ 4.3 and some comparison to msvc & icc on ia32
> On 1/28/07, Jan Hubicka <[EMAIL PROTECTED]> wrote: > >Also having some testcases showing inlining deffects in GCC would be > >very interesting for me. Now after IPA-SSA has been merged, I plan to > >do some retuning of inliner for 4.3 release since a lot has changes > >about properties of it's input and it was originally designed to operate > >well on IL used by early tree-ssa. > Gcc, well g++ really, used to be so bad at the inlining game, ie > single op functions/ctors suddendly left out, there was no other > options than to explicitly direct inlining if one cared about > performance. So i don't have much to show, for what i monitored wasn't I am not quite sure what you mean by direct inlining here. At -O2 G++ still don't inline any functions not marked as inline by user, at -O3 we do. I plan to propose to change this behaviour to make -O2 auto-inline everything that is expected to reduce code size. I am just about to leave for 5 days but I plan to run couple of benchmarks after that and propose this. I would be interested to know about obvious mistakes GCC do - GCC now has logic to set cost of inlining "wrapper" functions (ie functions doing just one extra call and casts) to at most 0. It might be interesting to know if some common scenarios are missed. > under g++ juridiction. > Now i know it has improved (much) because obviously other parts are > being stressed. Well, we are working on it ;) You can take a look at c++ benchmarks http://www.suse.de/~gcctest the work is ongoing since cgraph was implemented in 2003, another retunning happen at about 4.0 timeframe, 4.3 has the SSA based IPA that should be another improvement. > > >Considering information about stack frame size in the inlining costs is > >one of things I believe we should do but it is also dificult to tune > >without interesting testcases for it. > I have no idea what would make such testcase interesting to you. > But i can try. > You'll find 2 preprocessed GPLed sources attached with > > frontend.cc, app::frontend_loop() > (i don't particularly care about that function, but on ia32 - x86-64 > is immune - g++ is quite creative about it (large frame, oodles of > upfront zeroing, even if it's a bit better with the gcc-4.3-20070119 > snapshot)) > frame size, msvc 1152 bytes, icc 2108, g++ 2604 > > rt_render_packet.cc, horde::grunt_render_tiles_packet(...) > (this one i care about, inlining is controlled) > frame size, msvc 1688, icc 1804, gcc 1932 > Performance wise on that one msvc lags by 25% and gcc has a slight > lead of a couple percent on icc. > > note: take 2, http://ompf.org/vault/frontend.ii.bz2 > http://ompf.org/vault/rt_render_packet.ii.bz2 Thanks, what is definitly most interesting for me is self contained testcase I can easilly compile and run, like we have tramp3d. I will definitly take a lok at your testcases, but perhaps only after returning from trip at next weekend since I am running out of time for all my TODOs today ;) Concerning the frame sizes, we really need some kind of analysis from where it is comming - ie whether GCC simply inline too much together, or fail to pack well the structures using existing algorithm or it is register pressure problem. Thanks, Honza
Re: remarks about g++ 4.3 and some comparison to msvc & icc on ia32
> tbp wrote: > > > Secundo, while i very much appreciate the brand new string ops, it > > seems that on ia32 some array initialization cases where left out, > > hence i still see oodles of 'movl $0x0' when generating code for k8. > > Also those zeroings get coalesced at the top of functions on ia32, and > > i have a function where there's 3 pages of those right after prologue. > > See the attached 'grep 'movl $0x0' dump. > > It looks like Jan and Richard have answered some of your questions about > inlining (or are in the process of doing so), but I haven't seen a > response to this point. > > Certainly, if we're generating zillions of zero-initializations to > contiguous memory, rather than using memset, or an inline loop, that > seems unfortunate. Would you please file a bug report? I though the comment was more reffering to fact that we will happily generate movl $0x0, place1 movl $0x0, place2 ... movl $0x0, placeMillion rather than shorter xor %eax, %eax movl %eax, ... but indeed both of those issues should be addressed (and it would be interesting to know where we fail ty synthetize memset in real scenarios). With the repeated mov issue unforutnately I don't know what would be the best place: we obviously don't want to constrain register allocation too much and after regalloc I guess only machine dependent pass is the hope that is pretty ugly (but not that difiuclt to code at least at local level). Honza > > Thanks, > > -- > Mark Mitchell > CodeSourcery > [EMAIL PROTECTED] > (650) 331-3385 x713
Re: remarks about g++ 4.3 and some comparison to msvc & icc on ia32
> Jan Hubicka wrote: > > > I though the comment was more reffering to fact that we will happily > > generate > > movl $0x0, place1 > > movl $0x0, place2 > > ... > > movl $0x0, placeMillion > > > > rather than shorter > > xor %eax, %eax > > movl %eax, ... > > Yes, that would be an improvement, but, as you say, at some point we > want to call memset. > > > With the repeated mov issue unforutnately I don't know what would be the > > best place: we obviously don't want to constrain register allocation too > > much and after regalloc I guess only machine dependent pass > > I would hope that we could notice this much earlier than that. Wouldn't > this be evident even at the tree level or at least after > stack-allocation in the RTL layer? I wouldn't expect the zeroing to be > coming from machine-dependent code. What I meant is the generic problem that constants on i386 (especially for moves) increase instruction encoding and thus when mutiple copies of the same constant appears in the instruction stream and register is available one can add extra move and use that register instead. Of course we can also have pass detecting large sets of unwound mov instructions and pack them into memset. We can do it either at early RTL level or with some lowering of initializers at tree level too (I guess many of those sequences actally come from expanding initializers that are sort of black boxes for most tree optimizers). Sort of similar transformation is done by Tomas who can use vectorizer infrastructure to detect loops doing memset/memcpy. Those are pretty common especially for floats/doubles and after unrolling also loeads to such a sequences. I hope he will polish and send the patch soonish. Honza > > One possibility is that we're doing something dumb with arrays. Another > possibility is that we're SRA-ing a lot of small structures, which add > up to a ton of stack space. > > I realize that we need a full bug report to be sure, though. > > -- > Mark Mitchell > CodeSourcery > [EMAIL PROTECTED] > (650) 331-3385 x713
Re: remarks about g++ 4.3 and some comparison to msvc & icc on ia32
> On 1/28/07, Jan Hubicka <[EMAIL PROTECTED]> wrote: > >I am not quite sure what you mean by direct inlining here. At -O2 G++ > Decorating everything in sight with attribute always_inline/noinline > (flatten wasn't an option because it used to be troublesome and not as > 'portable' across compilers). BTW when inlining seems to make so noticeable difference, did you try to use profile feedback? > > >I would be interested to know about obvious mistakes GCC do - GCC now > >has logic to set cost of inlining "wrapper" functions (ie functions > >doing just one extra call and casts) to at most 0. It might be > >interesting to know if some common scenarios are missed. > I guess i should remove those attribute and see what it looks like. > > >Well, we are working on it ;) > >You can take a look at c++ benchmarks http://www.suse.de/~gcctest the > >work is ongoing since cgraph was implemented in 2003, another retunning > >happen at about 4.0 timeframe, 4.3 has the SSA based IPA that should be > >another improvement. > I'm aware of that progression and some of my code is already being tested > http://www.suse.de/~gcctest/c++bench/raytracer/ ;) I see, we didn't seem to make that much progress on this testcase performance wise yet ;) > >Concerning the frame sizes, we really need some kind of analysis from > >where it is comming - ie whether GCC simply inline too much together, or > >fail to pack well the structures using existing algorithm or it is > >register pressure problem. > I'm out of my league. I know the frontend_loop function isn't as > horrible on x86-64, giving some credit to the register pressure > hypothesis, but then that code isn't doing anything fancy. > > For the other function, which heavily uses SSE vector intrinsics, g++ > is really doing a good job, if only for the, sometimes, duplicated > structures here & there and the larger frame. But you can rule out > g++'s inlining heuristic as it has no (or shouldn't have) any freedom. Hmm, so then it should be esither structure packing or regalloc. I will be able to take a look only after returning from a course. Honza > > If there's anything i can do, do not hesitate. > And thanks for taking notice.
Re: Scheduling an early complete loop unrolling pass?
> > Hi, > > currently with -ftree-vectorize we generate for > > for (i=0; i<3; ++i) > # SFT.4346_507 = VDEF > # SFT.4347_508 = VDEF > # SFT.4348_509 = VDEF > d[i] = 0.0; Also Tomas' patch is supposed to catch this special case and convert it into memset that should be subsequently optimized into assignment that should be good enough (which reminds me that I forgot to merge the memset part of stringop optimizations). Perhaps this can be made a bit more generic and construct INIT_EXPRs for small arrays directly from Tomas's pass (going from memset to assignment works just in special cases). Tomas, what is the status of your patch? > > for (j=0; j x[j] = d; > > (that is, zero a small vector and use that to initialize an array > of vectors) > > :; > vect_cst_.4501_723 = { 0.0, 0.0 }; > vect_p.4506_724 = (vector double *) &D.76822; > vect_p.4502_725 = vect_p.4506_724; > > # ivtmp.4508_728 = PHI <0(6), ivtmp.4508_729(11)> > # ivtmp.4507_726 = PHI > # ivtmp.4461_601 = PHI <3(6), ivtmp.4461_485(11)> > # SFT.4348_612 = PHI > # SFT.4347_611 = PHI > # SFT.4346_610 = PHI > # i_582 = PHI <0(6), i_118(11)> > :; > # SFT.4346_507 = VDEF > # SFT.4347_508 = VDEF > # SFT.4348_509 = VDEF > *ivtmp.4507_726 = vect_cst_.4501_723; > i_118 = i_582 + 1; > ivtmp.4461_485 = ivtmp.4461_601 - 1; > ivtmp.4507_727 = ivtmp.4507_726 + 16B; > ivtmp.4508_729 = ivtmp.4508_728 + 1; > if (ivtmp.4508_729 < 1) goto ; else goto ; > > # i_722 = PHI > # ivtmp.4461_717 = PHI > :; > > # ivtmp.4461_706 = PHI > # SFT.4348_707 = PHI > # SFT.4347_708 = PHI > # SFT.4346_709 = PHI > # i_710 = PHI > :; > # SFT.4346_711 = VDEF > # SFT.4347_712 = VDEF > # SFT.4348_713 = VDEF > D.76822.D.44378.values[i_710] = 0.0; > i_714 = i_710 + 1; > ivtmp.4461_715 = ivtmp.4461_706 - 1; > if (ivtmp.4461_715 != 0) goto ; else goto ; > > ... > > and we are later not able to do constant propagation to the > second loop which we can do if we first unroll such small loops. > > As we also only vectorize innermost loops I believe doing a > complete unrolling pass early will help in general (I pushed > for this some time ago). Did you run some benchmarks? Honza > > Thoughts? > > Thanks, > Richard. > > -- > Richard Guenther <[EMAIL PROTECTED]> > Novell / SUSE Labs
Re: Scheduling an early complete loop unrolling pass?
> Richard Guenther <[EMAIL PROTECTED]> wrote on 05/02/2007 18:16:05: > > > On Mon, 5 Feb 2007, Jan Hubicka wrote: > ... > > > Did you run some benchmarks? > > > > Not yet - I'm looking at the C++ SPEC 2006 benchmarks at the moment > > and using vectorization there seems to do a lot of collateral damage > > (maybe not measurable though). > > > > Interesting. In SPEC 2000 there is also a hot small loop in the only C++ > benchmark (eon), which get vectorized, and as a result degrades > performance. We really should not vectorize such loops, and the solution > is: > 1. FORNOW: use --param min-vect-loop-bound=2 (or some value greater than > 0). Well, if Tomas manage to post his patch, I think the eon case is dealt with there too (the loop is just internal loop to zero tiny vector or something like that as long as I can remember) Honza > 2. SOON: rely on the vectorizer to do the cost analysis and decide not to > vectorize such loops, using a cost model - this is in the works. > > dorit > > > Richard. > > > > -- > > Richard Guenther <[EMAIL PROTECTED]> > > Novell / SUSE Labs
Re: False ???noreturn??? function does return warnings
> In an OS kernel functions that do not return are commonly used and > practically always their code is beyond gcc's ability to recognize > noreturn functions. A typical example would for example be the BUG() > function in Linux which is implemented as something like: > > static inline void __attribute__((noreturn)) BUG(void) > { > __asm__ __volatile__("trap"); > } > > So the code doesn't return but gcc isn't able to figure that out and the > caring programmer trying to help gcc to do a better job by adding the Well, sadly caring programmer won't get optimization he wants as after inlining this attribute information becomes completely lost. What about __builtin_trap? It results in int 6 that might not be applicable, but adding some control over it to i386 backend is definitly an option. > noreturn is punished with > > warning: ???noreturn??? function does return > > There are other such situations in plain C. A common solution is to add > something like like while(1) - but that does generate code. Quite a bit > for frequently called or very small functions. This frequently makes the > use of noreturn functions unattractive. So two suggested solutions: > > 1) Burry the warning for good. The programmer has explicitly said >noreturn so if the function does return it's his problem. > > 2) Introduce a __builtin_unreached() builtin function which tells gcc that >it not being reached, ever and does not generate any code. This could >be used like: > > static inline void __attribute__((noreturn)) BUG(void) > { > __asm__ __volatile__("trap"); > __builtin_unreached(); This is bit dificult to do in general since it introduces new kind of control flow construct. It would be better to express such functions explicitely to GCC. Honza > } > > It would even conveniently be usable in macros or conditionals. > > Ralf
Re: Inserting profiling function calls
> Dear All, > >In order to implement a specific basic block profiling, i have to > insert function calls at the end of each basic blocks or/and at the end > of each functions. > To do this I'd like to add a profiling pass similar to the arc profiling. > I'm a beginner in the GCC internal implementation and i hope this > subject will be interesting enough for you to give me attention. > I begin to localize were to insert/add this functionality but more > detailed on "a magic how to" insert function calls in the generated code > would be very helpful for me. > Is there somewhere a guide line to add profiling pass ? There is no such guide, but adding a profiling pass is not that different from adding any toher pass. Would you for a start please explain what do you need to do that can't be done using existing arc and value profiling? Honza > > Thank you for your help, > Patrice > > > > > > > ___ > Yahoo! Mail réinvente le mail ! Découvrez le nouveau Yahoo! Mail et son > interface révolutionnaire. > http://fr.mail.yahoo.com
Re: Inserting profiling function calls
> > >Would you for a start please > >explain what do you need to do that can't be done using existing arc and > >value profiling? > > > Sorry, my first mail was not clear about the goal. > Objectives are to follow the execution of function and basic block at > execution time. > To do this, we plan to insert function call, like mcount is inserted > at the function level for gprof but at the basic block level. > A user library linked with the application can then implement this > functions. Hi, interesting, I guess you want to generate some sort of traces through programs or path profiles. This is definitly doable and I think best place to do so would be same time as value range profiling is done (that is also inserting calls but for different reasons, so you probably might want to look into it). Ironically, we used to have such basic block profiling but the implementation was removed in favour of edge profiling years ago be me (the implementation however was broken and would need to be reimplemented anyway). It would be interesting to get infrastructure for path profiling as described in Ball&Larus paper http://citeseer.ist.psu.edu/ball96efficient.html that is what some compilers implement. I din't get across implementing it since I don't see much of use for in in optimization (however if you look at the followup papers, there are some code duplication based ideas that might prove to be effective) Honza > > Thank you, > Patrice > > > > > > ___ > Yahoo! Mail réinvente le mail ! Découvrez le nouveau Yahoo! Mail et son > interface révolutionnaire. > http://fr.mail.yahoo.com
Re: Any hints on this problem? Thanks!
> ?? wrote: > >Now, my question becomes clear. How to make my inserted function call > >not affect the orginal state of program? > > Try looking at a similar feature. One such similar feature is the > mcount calls emitted for profiling. The various solutions for mcount > include > 1) saving lots of registers before the call, and restoring lots of > registers after the call. This has a high cost which may not work in > your case. > 2) Writing mcount in assembly language, so that you can avoid clobbering > any registers. Actually I would say that copying mcount code would be dificult (because we need to emit the calls at arbitrary basic blocks rather than in prologue) and not very maintanale. I belive that it would be better to simply insert the calls at gimple level like edge/value profiling is done. Using custom calling conventions to speed process up might be interesting, I guess it can be done via target hook+attribute but I would leave it for later time. Honza
Re: i386.md:3705: error: undefined machine-specific constraint at this point: "Y"
> I just got this error building a cross-compiler from sparc-sun-solaris2.10 > targetted to i686-unknown-linux-gnu. This worked as recently as last > week: > > > build/genoutput ../../egcc-SVN20070216/gcc/config/i386/i386.md > insn-conditions.md > tmp-output.c > > config/i386/i386.md:3705: error: undefined machine-specific constraint at > this point: "Y" > > config/i386/i386.md:3705: note: in operand 1 > > make[2]: *** [s-output] Error 1 > > anybody else seeing this? Hi, I didn't seen it, but the underlying problem is quite obvious. I've sent the attached patch for testing and will try to investigate why it don't seem to show for everyone as soon as I have some time for it. Honza * i386.md (dummy_extendsfdf2): Fix constraints. Index: i386.md === --- i386.md (revision 122057) +++ i386.md (working copy) @@ -3704,7 +3704,7 @@ ;; %%% Kill these when call knows how to work out a DFmode push earlier. (define_insn "*dummy_extendsfdf2" [(set (match_operand:DF 0 "push_operand" "=<") - (float_extend:DF (match_operand:SF 1 "nonimmediate_operand" "fY")))] + (float_extend:DF (match_operand:SF 1 "nonimmediate_operand" "fY2")))] "0" "#")
Re: 40% performance regression SPEC2006/leslie3d on gcc-4_2-branch
> Grigory Zagorodnev wrote: > > Mark Mitchell wrote: > >> Excellent question; I should have asked for that as well. If 4.2 has > >> gained on 4.1 in other respects, the 4.7% drop might represent a smaller > >> regression relative to 4.1. > >> > > There is the 4.2 (r120817) vs. 4.1.2 release FP performance comparison > > numbers. SPECfp_base2006 of gcc 4.2 has 19% performance gain over 4.1.2. > > Thank you for the measurements. > > In that case, I think we have absolutely nothing to worry about for > 4.2.0. Whether we deliver 19% SPECfp, 23% SPECfp, or 15% SPECfp > improvements isn't so important; all of those numbers are a vast > improvement over 4.1.x. Given that, I think we should just leave > Danny's conservative changes in, and not worry. It should be understood that the large improvement on Cores is special case caused by adding a generic model and CPU specific tuning (We originally measured 28% speedup on P4 and SPECfp2000 just for that change). Situation can be less optimistic on other (sub)targets. Still we made important progress on SPECfp in the 4.x series, so 4% slowdown would not bring us to performance of GCC's from mid 90's as 4% slowdown on SPECint would perhaps do... Honza > > Thanks, > > -- > Mark Mitchell > CodeSourcery > [EMAIL PROTECTED] > (650) 331-3385 x713
Re: ix86_data_alignment: bad defaults?
> > Why do we use 256 instead of BIGGEST ALIGNMENT in ix86_data_alignment? > This is causing all sorts of build problems for djgpp, as I'm getting > lots of warnings about too-big alignments, and with -Werror... It is to improve performance of string functions on larger chunks of data. x86-64 specify this, for x86 it is optional. I don't think we should end up warning here - it is done only for static variables where the alignment can be higher than what BIGGEST_ALIGNMENT promise. Honza > > Index: i386.c > === > --- i386.c (revision 11) > +++ i386.c (working copy) > @@ -15417,7 +15417,7 @@ > int > ix86_data_alignment (tree type, int align) > { > - int max_align = optimize_size ? BITS_PER_WORD : 256; > + int max_align = optimize_size ? BITS_PER_WORD : BIGGEST_ALIGNMENT; > >if (AGGREGATE_TYPE_P (type) >&& TYPE_SIZE (type)
Re: ix86_data_alignment: bad defaults?
> > > > I like the "min (256, MAX_OFILE_ALIGNMENT)" fix... > > > > So do I. > > Ok to apply then? Tested via djgpp cross-compile and cross-host. Yes, this is OK. (to be very pedantic, we can assert that MAX_OFILE_ALIGNMENT>=256 on x86-64 targets, but well). I fully agree with Richard's interpretation concerning BIGGEST_ALIGNMENT meaning - ie in special cases for perofrmance it definitly makes sense to use higher alignments than BIGGEST_ALIGNMENT (such as cache line or page alignments), BIGGEST_ALIGNMENT is what CPU require and runtime must provide when asked for. Honza > > * config/i386/i386.c (ix86_data_alignment): Don't specify an > alignment bigger than the object file can handle. > > Index: i386.c > === > --- i386.c(revision 122271) > +++ i386.c(working copy) > @@ -15417,7 +15417,7 @@ > int > ix86_data_alignment (tree type, int align) > { > - int max_align = optimize_size ? BITS_PER_WORD : 256; > + int max_align = optimize_size ? BITS_PER_WORD : MIN (256, > MAX_OFILE_ALIGNMENT); > >if (AGGREGATE_TYPE_P (type) >&& TYPE_SIZE (type)
Re: ix86_data_alignment: bad defaults?
> > > > > > I like the "min (256, MAX_OFILE_ALIGNMENT)" fix... > > > > > > So do I. > > > > Ok to apply then? Tested via djgpp cross-compile and cross-host. > > Yes, this is OK. (to be very pedantic, we can assert that > MAX_OFILE_ALIGNMENT>=256 on x86-64 targets, but well). I fully agree with > Richard's interpretation concerning BIGGEST_ALIGNMENT meaning - ie in > special cases for perofrmance it definitly makes sense to use higher > alignments than BIGGEST_ALIGNMENT (such as cache line or page > alignments), BIGGEST_ALIGNMENT is what CPU require and runtime must > provide when asked for. One extra bit - we do use alignments of base > 32 bytes for code alignment. What would be the behaviour on targets with MAX_OFILE_ALIGNMENT set to 16 bytes? I.e. if we end up with gas producing many nops that would then on random basis end up 16 or 32 byte aligned, it might be good idea to forcingly reduce alignments to base 16 on those targets. Honza
Re: spec2k comparison of gcc 4.1 and 4.2 on AMD K8
> "Vladimir N. Makarov" <[EMAIL PROTECTED]> writes: > > > I run SPEC2000 several times per week and always look at 3 runs (to be > > sure that is nothing wrong happened) but I never saw such big > > "confidence" intervals (as I understand that is difference between max > > and min of 3 runs divided by the score). [...] > > No, it is much more complex than that, I've used generally accepted > definition of a confidence interval, see > http://en.wikipedia.org/wiki/Confidence_interval > which basically tells that with 95% probabilty (the confidence level I've > choosed) > true value lies in this interval. > > I've used conservative estimate of confidence intervals in this case > because I didn't assume gaussian distribution of numbers which I > reported as difference between two run times, and this estimate is somewhat > bigger than difference between max and min of 3 runs :) > > > [...] If the machine has only 512 Mb memory (even they > > write that it is enough for SPEC2000), the scores for some benchmark > > programs may be unstable. [...] > > My box is equipped with 2Gigs of RAM so I believe this is not the case, > Also the computer was *absolutely* idle when it was running spec2k. > (booted with init=/bin/sh and no other processes were running). > > And no, > > [...] acknowledge that I never ran SPEC2000 on AMD machines and some > > processors generates less "confident intervals". [...] > this is not the case, I'm absolutely sure. I am running SPEC on both AMD and Intel machines quite commonly and I must say that there seems to be difference in between those two. For P4 and Core I get results within something like 1-2 SPEC point (0.1%) of overall SPEC score, for Athlon I was never able to get so close, the difference tends to be up to one percent that is often more than expected speedup I am looking for. Of course it might be property of the boxes I have, but there is no difference in setup of those machines, just it seems to be happening this way. Running the tests more times in sequence tends to stabilize Athlon results, so what I often do is to simply configure peak runs to do something interesting and use same base runs, since peak scores tends to be slightly better than base scores even for identical binaries. (that makes development easier, but not GCC better :) Honza
Re: spec2k comparison of gcc 4.1 and 4.2 on AMD K8
> Jan Hubicka wrote: > > >I am running SPEC on both AMD and Intel machines quite commonly and I > >must say that there seems to be difference in between those two. For P4 > >and Core I get results within something like 1-2 SPEC point (0.1%) of > >overall > >SPEC score, for Athlon I was never able to get so close, the > >difference tends to be up to one percent that is often more than > >expected speedup I am looking for. > > > >Of course it might be property of the boxes I have, but there is no > >difference in setup of those machines, just it seems to be happening > >this way. Running the tests more times in sequence tends to stabilize > >Athlon results, so what I often do is to simply configure peak runs to > >do something interesting and use same base runs, since peak scores tends > >to be slightly better than base scores even for identical binaries. > >(that makes development easier, but not GCC better :) > > > > > Interesting, Jan. I did not know that AMD machines are so unstable. Well, rather than unstable, they seems to be more memory layout sensitive I would say. (the differences are more or less reproducible, not completely random, but independent on the binary itself. I can't think of much else than memory layout to cause it). I always wondered if things like page coloring have chance to reduce this noise, but I never actually got around trying it. Honza
Re: Massive SPEC failures on trunk
> Grigory Zagorodnev wrote on 03/03/07 02:27: > > > There are three checkins, candidates for the root of regression: > > http://gcc.gnu.org/viewcvs?view=rev&revision=122487 > > http://gcc.gnu.org/viewcvs?view=rev&revision=122484 > > http://gcc.gnu.org/viewcvs?view=rev&revision=122479 > > > SPEC2k works as usual[1] for me on x86_64 as of revision 122484. The only new > compile failure I see is building 300.twolf with: > > mt.c: In function 'MTEnd': > mt.c:46: warning: incompatible implicit declaration of built-in function > 'free' > mt.c:46: error: too many arguments to function 'free' Diego, this is actually bug in twolf calling free with two arguments (block and size). We used to tolerate it but since free is now a builtin, we no longer do. SPEC has alternate source tree without this bug. Honza > specmake: *** [mt.o] Error 1 > > Ian, looks like your VRP patch may be involved. > > > [1] 176.gcc and 253.perlbmk usually miscompare for me. Not sure why.
Re: tuples: data structure separation from trees
> I think something like > > struct gimple_statment_base > { > enum gimple_stmt_code code : 8; > unsigned int subcode : 24; > source_locus locus; > tree block; Just jumping late into the debug info discussion, RTL locators are combining TREE blocks and source_locuses into single integer. This works well since they are related (when you want to preserve block you likely want to preserve line number too and vice versa) and you have one integer per statement + integer+pointer pair in the on side lookup table for each of block/line numbers, so I would say it is as effecient as one would like. Doing the same on GIMPLE level would leave us with one set of locators to use in backend. I already played a bit with this idea and saving tree pointer for different purpose as followup of my patch http://gcc.gnu.org/ml/gcc-patches/2007-03/msg01169.html It is moderately ugly because the approach does not work too well for frontends so I decided to not do it for a moment, but if we separate gimple and frontend trees, it would be clean solution. One approach I was considering is also to extend libcpp linemaps to deal with blocks too (via extension API). We then can keep frontends using their BLOCK trees but during conversion to gimple we can simply produce new locus numbers that already combine both info together. This way we can get around without tree block pointer completely, assuming that frontends don't need it and we finish linemap conversion. > > > * Unfortunately, we're going to have to duplicate a lot of the functionality > > we currently have for trees. For example, since STATEMENT_LISTs are used > > before gimplification, we'll have to come up with the equivalent thing > > for tuples (say, GIMPLE_STATMENT_LISTS). > > Sort of. I think you'd want to merge tree_statement_list_node > into gimple_statement_d, to avoid the extra indirection to get > to the statement. I am also happy merging the tree list into statement itself is considered. From earlier discussions it seemed to me that this idea was already gave up. Honza > > > r~
Re: GIMPLE tuples document uploaded to wiki
> > I have added the design document and links to most of the discussions > we've had so far. Aldy updated the document to reflect the latest thread. > > http://gcc.gnu.org/wiki/tuples Looks great, still I think "locus" and "block" could be both merged into single integer, like RTL land has INSN_LOCATOR. After gimplification, only place where we are modifying blocks but not locators is inliner. While doing the gimplification, we probably can keep track of line numbers and blocks in parallel. Also ssa_operands structures should be somewhere in the header and uid would be handy for on-side datastructures. In CFG getting rid of labels in GS_COND woulsd actually save us noticeable amount of memory by avoiding the need for labels. Perhaps we can simply define "true branch target"/"false branch target" to point to label or BB depending on CFG presence or to be NULL after CFG conversion and rely on CFG edges. GS_SWITCH would be harder, since association with CFG edges is not so direct. Honza
Re: GIMPLE tuples document uploaded to wiki
> Jan Hubicka wrote on 04/14/07 16:14: > > > Looks great, still I think "locus" and "block" could be both merged into > > single integer, like RTL land has INSN_LOCATOR. > > That's the idea. But it's simpler to do this for now. The insn locator > is easily done at anytime during the implementation. Sure, it can (and should) be done independently on the main transfomration to tuples. I just wondered if your document is documenting the final shape or what should be done during hte first transition. If the second, probably 2 words should be accounted for location as source_locues is currently a structure. > > > Also ssa_operands structures should be somewhere in the header and uid > > would be handy for on-side datastructures. > > No. SSA operands need to be split in the instructions that actually > need them. Also, UIDs are tempting but not really needed. I would only > consider them if using pointer-maps or hash tables gets outrageously > expensive. So you expect the ssa_operands to be associated via a hashtable or placed lower in the inheritance hiearchy? (the second is what I had in mind - surely it makes no sense to allocate them for labels, but they has to be somewhere). Concerning uids, it is always dificult to get some good data on this sort of thing. It seems to me that the UID would be handly and easy to bundle to some other integer, but it is not too important, especially if get some handy abstraction to map data with statements that we can easilly turn in between hashtables and arrays to see the difference instead of writting hasthables by hand in every pass doing this. > > > > In CFG getting rid of labels in GS_COND woulsd actually save us > > noticeable amount of memory by avoiding the need for labels. Perhaps we > > can simply define "true branch target"/"false branch target" to point to > > label or BB depending on CFG presence or to be NULL after CFG conversion > > and rely on CFG edges. GS_SWITCH would be harder, since association with > > CFG edges is not so direct. > > Sure, that would be something to consider. I have some data for this, lets discuss it at ICE. Honza
Re: GIMPLE tuples document uploaded to wiki
> Jan Hubicka wrote on 04/14/07 21:14: > > > I just wondered if your document is documenting the final shape or what > > should be done during hte first transition. If the second, probably 2 > > words should be accounted for location as source_locues is currently a > > structure. > > The document is describing what the initial implementation will look > like. It will/should evolve as the implementation progresses. > > > > So you expect the ssa_operands to be associated via a hashtable > > Hmm? No, they are there. Notice gimple_statement_with_ops and > gimple_statement_with_memory_ops. Uh, I've somehow mised (looking for ssa-operands pattern, not for the unwound structure) > > > > Concerning uids, it is always dificult to get some good data on this > > sort of thing. It seems to me that the UID would be handly and easy to > > bundle to some other integer, but it is not too important, especially if > > get some handy abstraction to map data with statements that we can > > easilly turn in between hashtables and arrays to see the difference > > instead of writting hasthables by hand in every pass doing this. > > I grepped for uid and IIRC there are only two passes using UIDs: DSE and > PRE. We should see how they react to having uid taken away from them. This is because we now put the data into annotations themselves or use the aux pointer. I am not quite sure how to easilly grep for aux use of stmt annotations, loop-im definitly does that. Also addresses_taken and value_handles can be probably made separate arrays, since they deifnitly don't need to live at IPA level. Honza > > > > I have some data for this, lets discuss it at ICE. > > Sounds good.
Re: GIMPLE tuples document uploaded to wiki
> PRE is only using stmt_ann->uid as a convenient place to store the uid > for local dominance for purposes of Load PRE. > It's making up the UID on it's own :). > > If there was stmt->aux we'd put it there instead (note that the > current way wastes memory, since we really only care about UID's for > statements that generate vdefs/vuses) Well, I think we are mostly discussing how to assign your local data (uids or whatever) with statements. Ie whether to use hashtables, uid indexed arrays or aux pointers. in the UID indexed arrays case your example would translate into some sort of local UID and translation array like RTL optimizers commonly use. I would presonally like to avoid aux pointers. Honza > > > ann = stmt_ann (stmt); > > ann->uid = stmt_uid++; > >/* See if the vuse is defined by a statement that > comes before us in the block. Phi nodes are not > stores, so they do not count. */ > if (TREE_CODE (def) != PHI_NODE > && stmt_ann (def)->uid < stmt_ann (stmt)->uid) >{
Re: Duplicate assembler function names in cgraph
Hi, > Hello all, > > I'm doing in my IPA pass: > for(node = cgraph_nodes; node; node = node->next) { >reg_cgraph_node(IDENTIFIER_POINTER(DECL_ASSEMBLER_NAME(node->decl))); > } > > to get all the function names in the cgraph. I'm adding them to a list > and I'm assuming that two nodes do not have the same > DECL_ASSEMBLER_NAME but I'm wrong. In a C++ file I'm getting two > functions with name _ZN4listIiE6appendEPS0_, DECL_NAME = append. > Why is this? The code is at > http://pastebin.ca/442691 Callgraph is currently trying to avoid use of DECL_ASSEMBLER_NAME, the motivation is that for C++, the DECL_ASSEMBLER_NAMEs are very long and expensive and thus it is not good idea to compute them for symbols not output to final assembly (DECL_ASSEMBLER_NAME triggers lazy construction of the names). So if you don't have good reason for using the names, you should not do it. Cgraph rely on frontend that there are no duplicate FUNCTION_DECLs representing the same function (with same assembler node), that seems to be broken in your testcase. Would be possible to have a small tewstcase reproducing the problem? Honza > > Is there a way to transverse the cgraph but never going through the > same twice? Or should I just ignore the node if the function name is > already registered? > > Cheers, > -- > Paulo Jorge Matos - pocm at soton.ac.uk > http://www.personal.soton.ac.uk/pocm > PhD Student @ ECS > University of Southampton, UK
Re: Builtin functions?
> On 4/16/07, Paulo J. Matos <[EMAIL PROTECTED]> wrote: > >Hello all, > > > >I'm going through the bodies of all user-defined functions. I'm using > >as user-defined function as one that: > >DECL_BUILT_IN(node) == 0. > > > > >Problem is that for a function (derived from a C++ file) whose output > >from my pass is (output is self-explanatory, I think): > >Registering cgraph node: > >_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc > >[operator<<]... SUCCESSFUL > >Declared on > >/home/pmatos/gsvt-bin/lib/gcc/x86_64-unknown-linux-gnu/4.1.1/../../../../include/c++/4.1.1/bits/ostream.tcc, > >line 735 > >Decl Node Code is function_decl > >Registering output void ...SUCCESSFUL > >Arguments: __out : reference_type, __s : pointer_type > > > >Well, this is definitely builtin but DECL_BUILT_IN == 0, which means > >that when I do FOR_EACH_BB_FN, I'm getting a segmentation fault > > First, it's not built in, because it's defined in a source file. > Builtin functions are those defined by the compiler. > > Second, we should make FOR_EACH_BB_FN never crash on empty tree functions. > It seems really rude to do otherwise. Well, it works on empty functions, but why would you ever want to walk body of function that is not there? cgraph_function_body_availability should be checked by IPA passes to see what bodies are there and what can or can not change by linking... Honza > Just because we don't have a body for a function doesn't mean we > should crash. Users shouldn't have to be checking random things like > DECL_SAVED_TREE to determine if FOR_EACH_BB_FN will work (this is not > to say that they should be able to pass any random crap to it, but it > should be detecting if the function has a body)
Re: Builtin functions?
> On 4/16/07, Daniel Berlin <[EMAIL PROTECTED]> wrote: > > > >First, it's not built in, because it's defined in a source file. > >Builtin functions are those defined by the compiler. > > > >Second, we should make FOR_EACH_BB_FN never crash on empty tree functions. > >It seems really rude to do otherwise. > >Just because we don't have a body for a function doesn't mean we > >should crash. Users shouldn't have to be checking random things like > >DECL_SAVED_TREE to determine if FOR_EACH_BB_FN will work (this is not > >to say that they should be able to pass any random crap to it, but it > >should be detecting if the function has a body) > > > > Is there a way to check if the function was or not defined by the > user, i.e., it comes from the users source file? cgraph_function_body_availability is your friend ;) Honza > > Cheers, > -- > Paulo Jorge Matos - pocm at soton.ac.uk > http://www.personal.soton.ac.uk/pocm > PhD Student @ ECS > University of Southampton, UK
Re: Duplicate assembler function names in cgraph
> On 4/16/07, Jan Hubicka <[EMAIL PROTECTED]> wrote: > >Hi, > >> Hello all, > >> > >> I'm doing in my IPA pass: > >> for(node = cgraph_nodes; node; node = node->next) { > >>reg_cgraph_node(IDENTIFIER_POINTER(DECL_ASSEMBLER_NAME(node->decl))); > >> } > >> > >> to get all the function names in the cgraph. I'm adding them to a list > >> and I'm assuming that two nodes do not have the same > >> DECL_ASSEMBLER_NAME but I'm wrong. In a C++ file I'm getting two > >> functions with name _ZN4listIiE6appendEPS0_, DECL_NAME = append. > >> Why is this? The code is at > >> http://pastebin.ca/442691 > > > >Callgraph is currently trying to avoid use of DECL_ASSEMBLER_NAME, the > >motivation is that for C++, the DECL_ASSEMBLER_NAMEs are very long and > >expensive and thus it is not good idea to compute them for symbols not > >output to final assembly (DECL_ASSEMBLER_NAME triggers lazy construction > >of the names). So if you don't have good reason for using the names, > >you should not do it. > > My only reason to use DECL_ASSEMBLER_NAME is, when I'm transversing > cgraph_nodes, to have an ID for the nodes I've already 'analyzed'. Why you don't use something like cgraph->uid? > > > > >Cgraph rely on frontend that there are no duplicate FUNCTION_DECLs > >representing the same function (with same assembler node), that seems to > >be broken in your testcase. Would be possible to have a small tewstcase > >reproducing the problem? > > > > Sure, however, I'm developing over 4.1.1, still you might still have > the error on current head, (I know 4.1.1 is quite old). What do you > mean by a test case? Do you want a short version of my IPA pass which > shows up the issue? Either that or of you can just minimize the testcase (perhaps with delta) so it can be inspected by hand, it is probably easiest for me ;) Honza
Re: Builtin functions?
> > If you just want to scan every function you have around, the obvious > way to do it is > > For each function > FOR_EACH_BB_FN (function). > > This is probably slightly slower than > > For each function > if cgraph_function_body_availability != NOT_AVAILABLE >FOR_EACH_BB_FN (function) > > But about 20x more intuitive. Well, what about For each available function FOR_EACH_BB_FN (function). At least that was my plan. You are probably going to do other things to the functions than just walking the CFG (looking into variables/SSA names etc) and thus way you won't get crash either. But I don't have strong feeling here, both alternatives seems OK for me. Honza > > > >cgraph_function_body_availability should be checked by IPA passes > >to see what bodies are there and what can or can not change by > >linking... > Again, this only matters if you care :)
Re: GCC mini-summit - compiling for a particular architecture
> > Look from what we're starting: > > > > << > > @item -funroll-loops > > @opindex funroll-loops > > Unroll loops whose number of iterations can be determined at compile > > time or upon entry to the loop. @option{-funroll-loops} implies > > @option{-frerun-cse-after-loop}. This option makes code larger, > > and may or may not make it run faster. > > > > @item -funroll-all-loops > > @opindex funroll-all-loops > > Unroll all loops, even if their number of iterations is uncertain when > > the loop is entered. This usually makes programs run more slowly. > > @option{-funroll-all-loops} implies the same options as > > @option{-funroll-loops}, > > >> > > > > It could gain a few more paragraphs written by knowledgeable people. > > And expanding documentation doesn't introduce regressions :). > > but also does not make anyone actually use the options. Nobody reads > the documention. Of course, this is a bit overstatement, but with a > few exceptions, people in general do not enable non-default flags. I don't think this is fair. Most people don't read the docs because they don't care about performance, but most people who develop code that spends a lot of CPU cycles actually read the docs at least up to loop unrolling. BTW there is even www.funroll-loops.org ;) The content however can be found only in wayback http://web.archive.org/web/20060513022941/http://www.funroll-loops.org/ Honza > > Zdenek
Re: GCC mini-summit - compiling for a particular architecture
> On Sun, 2007-04-22 at 14:44 +0200, Richard Guenther wrote: > > On 4/22/07, Laurent GUERBY <[EMAIL PROTECTED]> wrote: > > > > > but also does not make anyone actually use the options. Nobody reads > > > > > the documention. Of course, this is a bit overstatement, but with a > > > > > few exceptions, people in general do not enable non-default flags. > > > > > > > > I don't think this is fair. > > > > Most people don't read the docs because they don't care about > > > > performance, but most people who develop code that spends a lot of CPU > > > > cycles actually read the docs at least up to loop unrolling. > > > > > > Exactly my experience. > > > > > > Unfortunately there's no useful information on this topic in the GCC > > > manual... > > > > Well, we have too many switches really. So the default is use -O2. If you > > want extra speed, try -O3, or even better use profile feedback. (Not many > > people realize that with profile feedback you get faster code than with > > -O3 and smaller code than with -Os - at least for C++ programs) > > At work we use -O3 since it gives 5% performance gain against -O2. > profile-feedback has many flags and there is no overview of it in the > doc IIRC. Who will use it except GCC developpers? Who knows about your > advice? Well, this is why -fprofile-generate and -fprofile-use was invented. Perhaps docs can be improved so people actually discover it. Do you have any suggestions? (Perhaps a chapther for FDO or subchapter of gcov docs would do?) Honza > > The GCC user documentation is the place... > > Laurent >
Re: Bootstrap failure for current gcc trunk on cygwin: in set_curr_insn_source_location, at cfglayout.c:284
> On 4/23/07, Paul Richard Thomas <[EMAIL PROTECTED]> wrote: > >on x86_ia64/fc5 is not a coincidence? > > More over, there were a lot of targets by this patch because they > would call insn_locators_initialize when generating the thunks (x86 > did not because it uses text based thunks and not RTL based thunks). I've reverted the patch, so it should be OK now. My apologizes for the breakage. Honza > > -- Pinski
Re: GCC -On optimization passes: flag and doc issues
> > I'd rather remove this "hack" and use the inliners code size estimator, like > that patch from early 2005 (attached)... Uh yes, I think it is way to go (and additionally making -O2 to autoinline small functions like -Os does). The patch would be OK if it still works ;) Even if CSiBE regress, I would still rather preffer fixing inliner code size estimations than keeping the parameter tweaking code. Honza > > Richard. > > 2005-03-02 Richard Guenther <[EMAIL PROTECTED]> > >* opts.c (decode_options): Do not fiddle with inlining >parameters in case of optimizing for size. >* cgraphunit.c (cgraph_check_inline_limits): If optimizing >for size make sure we do not grow the unit-size by inlining. >(cgraph_decide_recursive_inlining): Likewise. > 2005-03-02 Richard Guenther <[EMAIL PROTECTED]> > > * opts.c (decode_options): Do not fiddle with inlining > parameters in case of optimizing for size. > * cgraphunit.c (cgraph_check_inline_limits): If optimizing > for size make sure we do not grow the unit-size by inlining. > (cgraph_decide_recursive_inlining): Likewise. > > Index: opts.c > === > RCS file: /cvs/gcc/gcc/gcc/opts.c,v > retrieving revision 1.94 > diff -c -3 -p -r1.94 opts.c > *** opts.c24 Feb 2005 09:24:13 - 1.94 > --- opts.c2 Mar 2005 13:10:58 - > *** decode_options (unsigned int argc, const > *** 572,580 > > if (optimize_size) > { > ! /* Inlining of very small functions usually reduces total size. */ > ! set_param_value ("max-inline-insns-single", 5); > ! set_param_value ("max-inline-insns-auto", 5); > flag_inline_functions = 1; > > /* We want to crossjump as much as possible. */ > --- 572,579 > > if (optimize_size) > { > ! /* Inlining of functions reducing size is a good idea regardless > ! of them being declared inline. */ > flag_inline_functions = 1; > > /* We want to crossjump as much as possible. */ > Index: cgraphunit.c > === > RCS file: /cvs/gcc/gcc/gcc/cgraphunit.c,v > retrieving revision 1.93.2.1 > diff -c -3 -p -r1.93.2.1 cgraphunit.c > *** cgraphunit.c 2 Mar 2005 10:10:30 - 1.93.2.1 > --- cgraphunit.c 2 Mar 2005 13:10:58 - > *** cgraph_check_inline_limits (struct cgrap > *** 1184,1189 > --- 1189,1201 > limit += limit * PARAM_VALUE (PARAM_LARGE_FUNCTION_GROWTH) / 100; > > newsize = cgraph_estimate_size_after_inlining (times, to, what); > + if (optimize_size > + && newsize > to->global.insns) > + { > + if (reason) > + *reason = N_("optimizing for size"); > + return false; > + } > if (newsize > PARAM_VALUE (PARAM_LARGE_FUNCTION_INSNS) > && newsize > limit) > { > *** cgraph_decide_recursive_inlining (struct > *** 1279,1284 > --- 1291,1297 > struct cgraph_node *master_clone; > int depth = 0; > int n = 0; > + int newsize; > > if (DECL_DECLARED_INLINE_P (node->decl)) > { > *** cgraph_decide_recursive_inlining (struct > *** 1287,1294 > } > > /* Make sure that function is small enough to be considered for inlining. > */ > ! if (!max_depth > ! || cgraph_estimate_size_after_inlining (1, node, node) >= limit) > return; > lookup_recursive_calls (node, node, &first_call, &last_call); > if (!first_call) > --- 1300,1309 > } > > /* Make sure that function is small enough to be considered for inlining. > */ > ! newsize = cgraph_estimate_size_after_inlining (1, node, node); > ! if (! max_depth > ! || newsize >= limit > ! || (optimize_size && newsize > node->global.insns)) > return; > lookup_recursive_calls (node, node, &first_call, &last_call); > if (!first_call)
Re: Inlining and estimate_num_insns
> On Monday 28 February 2005 10:25, Richard Guenther wrote: > > > I can only wonder why we are having this discussion just after GCC 4.0 > > > was branched, while it was obvious already two years ago that inlining > > > heuristics were going to be a difficult item with tree-ssa. > > > > There were of course complaints and discussions about this, and I even > > tried to tweak inlining parameters once. See the audit trails of PR7863 > > and PR8704. There were people telling me "well in branch XYZ we do so much > > better", as always, so I was not encouraged to persue this further. > > > > Anyway, I think we should try the patch on mainline and I'll plan to > > re-submit it together with a 10% lowering of the inlining parameters > > compared to 3.4 (this is conservative for the mean size change for C code, > > for C++ we're still too high). I personally cannot afford to do so much > > testing to please everyone. > > I tested your -fobey-inline patch a bit using the test case from PR8361. > The run was still going after 3 minutes (without the flag it takes 20s) > so I terminated it and took the following oprofile: > > CPU: Hammer, speed 1394.98 MHz (estimated) > Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit > mask of 0x00 (No unit mask) count 4000 > Counted DATA_CACHE_MISSES events (Data cache misses) with a unit mask of 0x00 > (No unit mask) count 1000 > samples %samples %image name symbol name > 4607300 78.7190 9878479.4179 cc1plus > cgraph_remove_edge > 861258 14.7152 1530812.3070 cc1plus > cgraph_remove_node > 60871 1.0400 999 0.8032 cc1plus ggc_set_mark > 56907 0.9723 2054 1.6513 cc1plus cgraph_optimize > 36513 0.6239 1132 0.9101 cc1plus > cgraph_clone_inlined_nodes > 29570 0.5052 843 0.6777 cc1plus cgraph_postorder > 16187 0.2766 367 0.2951 cc1plus ggc_alloc_stat > 7787 0.1330 970.0780 cc1plus > gt_ggc_mx_cgraph_node > 6851 0.1171 138 0.1109 cc1plus cgraph_edge > 6671 0.1140 305 0.2452 cc1plus comptypes > 5776 0.0987 950.0764 cc1plus > gt_ggc_mx_cgraph_edge > 5243 0.0896 930.0748 cc1plus > gt_ggc_mx_lang_tree_node > > Honza, it seems the cgraph code needs whipping here. I think I can shot down the cgraph_remove_node lazyness by simple reference counting, but concerning removal of edges, only alternative I see is going for vectors/doubly linked lists. I would still expect this time to be dominated by later inlining/compation explossion so I would not take that too seriously (unless proved otherwise by cgraph_remove_edge being top on overall profile ;) Honza > > Gr. > Steven
Re: Inlining and estimate_num_insns
> On Tuesday 01 March 2005 01:33, Jan Hubicka wrote: > > > On Monday 28 February 2005 10:25, Richard Guenther wrote: > > > > > I can only wonder why we are having this discussion just after GCC > > > > > 4.0 was branched, while it was obvious already two years ago that > > > > > inlining heuristics were going to be a difficult item with tree-ssa. > > > > > > > > There were of course complaints and discussions about this, and I even > > > > tried to tweak inlining parameters once. See the audit trails of > > > > PR7863 and PR8704. There were people telling me "well in branch XYZ we > > > > do so much better", as always, so I was not encouraged to persue this > > > > further. > > > > > > > > Anyway, I think we should try the patch on mainline and I'll plan to > > > > re-submit it together with a 10% lowering of the inlining parameters > > > > compared to 3.4 (this is conservative for the mean size change for C > > > > code, for C++ we're still too high). I personally cannot afford to do > > > > so much testing to please everyone. > > > > > > I tested your -fobey-inline patch a bit using the test case from PR8361. > > > The run was still going after 3 minutes (without the flag it takes 20s) > > > so I terminated it and took the following oprofile: > > > > > > CPU: Hammer, speed 1394.98 MHz (estimated) > > > Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a > > > unit mask of 0x00 (No unit mask) count 4000 Counted DATA_CACHE_MISSES > > > events (Data cache misses) with a unit mask of 0x00 (No unit mask) count > > > 1000 samples %samples %image name symbol > > > name 4607300 78.7190 9878479.4179 cc1plus > > > cgraph_remove_edge 861258 14.7152 1530812.3070 cc1plus > > > cgraph_remove_node 60871 1.0400 999 0.8032 cc1plus > > > ggc_set_mark 56907 0.9723 2054 1.6513 cc1plus > > > cgraph_optimize 36513 0.6239 1132 0.9101 cc1plus > > >cgraph_clone_inlined_nodes 29570 0.5052 843 > > > 0.6777 cc1plus cgraph_postorder 16187 0.2766 367 > > > 0.2951 cc1plus ggc_alloc_stat 7787 0.1330 97 > > > 0.0780 cc1plus gt_ggc_mx_cgraph_node 6851 > > > 0.1171 138 0.1109 cc1plus cgraph_edge 6671 > > > 0.1140 305 0.2452 cc1plus comptypes 5776 > > > 0.0987 950.0764 cc1plus gt_ggc_mx_cgraph_edge > > > 5243 0.0896 930.0748 cc1plus > > > gt_ggc_mx_lang_tree_node > > > > > > Honza, it seems the cgraph code needs whipping here. > > > > I think I can shot down the cgraph_remove_node lazyness by simple > > reference counting, but concerning removal of edges, only alternative I > > see is going for vectors/doubly linked lists. > > Doubly linked lists would mean doubly-doubly-linked list, for the caller > and the callee, no? Does not sound attractive. With VECs, on the other > hand, you'd only need two integers on the cgraph edges (index in the > caller and callee edge vectors). Sounds like the fastest way to do this > to me. Of course, I'm high on VECs after the edge-vector-branch work... Memory wise, both sollution seems pretty comparable to me (for double double linked lists you have two extra pointers, for VECs you have two extra integers that ends up smaller on 64bit hosts but you pay some extra allocated memory in the growing arrays), but perhaps vecs has chance to be slightly faster, don't know and I guess it does not matter terribly much and for consistency, vectors seems like choice now. > > > I would still expect this > > time to be dominated by later inlining/compation explossion so I would > > not take that too seriously (unless proved otherwise by > > cgraph_remove_edge being top on overall profile ;) > > After half an hour, I had this: > > 35490975 82.9876 779873 85.0702 cc1pluscgraph_remove_edge > 6706686 15.6821 123588 13.4812 cc1pluscgraph_remove_node > > and cc1plus was still not into even the tree optimizers by then. > So I think we can safely say this is a serious bottleneck. You still didn't get into the fun part of actually inlining all the inlines in in Gerald's testcase ;) > > Note that I've seen these cgraph functions show up in less unrealistic > test runs also. OK, I will put this higher in the TODO list (but this is not 4.0 either). What was those less unrealistic tests? I remember seeing node removal in profiles, but edge removal comes first here. Looks like I finally recovered everything from disc crash so will have time to look into this in the rest of the week... Honza > > Gr. > Steven
Re: Inlining and estimate_num_insns
> On Tue, 1 Mar 2005 02:03:57 +0100, Jan Hubicka <[EMAIL PROTECTED]> wrote: > > > OK, I will put this higher in the TODO list (but this is not 4.0 > > either). What was those less unrealistic tests? I remember seeing node > > removal in profiles, but edge removal comes first here. Looks like I > > finally recovered everything from disc crash so will have time to look > > into this in the rest of the week... > > I see those functions high up in the profiles with tramp3d and leafify turned > on, too. I remember bugging you about this a few month ago... > > http://gcc.gnu.org/ml/gcc/2004-11/msg00737.html OK, thanks, this time I will remember this ;)) I believe that I oprofiled it that shot remove_edge down from profile, but I have to re-try. Honza > > Richard.
Re: Inlining and estimate_num_insns
> On Tue, 1 Mar 2005 13:30:41 +0100, Jan Hubicka <[EMAIL PROTECTED]> wrote: > > > On Tue, 1 Mar 2005 02:03:57 +0100, Jan Hubicka <[EMAIL PROTECTED]> wrote: > > > > > > > OK, I will put this higher in the TODO list (but this is not 4.0 > > > > either). What was those less unrealistic tests? I remember seeing node > > > > removal in profiles, but edge removal comes first here. Looks like I > > > > finally recovered everything from disc crash so will have time to look > > > > into this in the rest of the week... > > > > > > I see those functions high up in the profiles with tramp3d and leafify > > > turned > > > on, too. I remember bugging you about this a few month ago... > > > > > > http://gcc.gnu.org/ml/gcc/2004-11/msg00737.html > > > > OK, thanks, this time I will remember this ;)) I believe that I > > oprofiled it that shot remove_edge down from profile, but I have to > > re-try. > > Try this,. bootstrapped on i686-pc-linux-gnu in progress. Looks nice, you might also consider turing next_clone list into doubly linked list to speedup remove_node somewhat. Not sure how much that one counts. Can you post --disable-checking benchmarks on your testcase with leafify? Honza > > Richard.
Re: Inlining and estimate_num_insns
Hi, so after reading the whole discussion, here are some of my toughts for sanity check combined in one to reduce inbox pollution ;) Concerning Richard's cost tweaks: There is longer story why I like it ;) I originally considered further tweaking of cost function as mostly lost case as the representation of program at that time is way too close to source level to estimate it's cost properly. This got even bit worse in 4.0 times by gimplifier introducing not very predictable noise. Instead I went to plan of optimizing functions early that ought to give better estimates. It seems to me that we need to know both - the code size and expected time consumed by function to have chance to predict the benefits in some way. On tree-profiling and some my local patches I hope to sort out soonish I am mostly there and I didn some limited benchmarking. Overall the early optimization seems to do good job for SPEC (over 1% speedup in whole program mode is more than I expected), but it does almost nothing to C++ testcases (about 10% speedup to POOMA and about 0 to Gerald's application). I believe the reason is that C++ testcases consist of little functions that are unoptimizable by themselves so the context is not big enought. In parallel with Richard's efforts, I tought that problem there is ineed with the "abstraction functions", ie functions just accepting arguments and calling other function or returning some field. There is extremly high amount of those (from profiling early one can see that for every operation executed in resulting program, there are hunderds of function calls elliminated by inlining) Clearly with any inlining limits if the cost function computes non-zero cost to such a forwarders, we are going to have dificult time finding tresholds. I planed to write an pattern matching for these functions to bump them to 0 cost, but it looks like Richard's patch is pretty interesting idea. His results with limits set to 50 shows that he ineed managed to get those forwarders very cheap, so I believe that this idea might ineed work well, with some additional tweaking. Only what I am affraid of is the fact that number of inlines will no longer be linear function of code size esitmate increase that is limited by linear fraction of whole unit. However only "forwarders" having at most one call comes out free, so this is still dominated by the longest path in callgraph consisting of these in the program. Unfortunately this can be high and we can produce _a lot_ of grabage inlining these. One trick that I came across is to do two stage inlining - first inline just functions whose growth estimates are <= 0 in the bottom-up approach , do early optimizations to reduce garbage and then do "real inlining job". This way we might trottle amount of garbage produced by inliner and get more realistic estimates of the function bodies, but I am not at all sure about this. It would also help profiling performance on tramp3d definitly. Concerning -fobey-inline: I really doubt this is going to help C++ programmers. I think it might be usefull to kernel and I can make slightly cleaner implementation (without changes in frontends) if there is really good use for it. Can someone point me to existing codebase where -fobey-inline brings considerable improvements over defaultinlining heuristics? I've seen a lot of argumenting in this direction but never actual bigger application that needs it. It might be also possible to strengten the function "inline" keywords have for heuristics - either multiply priority by 2 for functions declared inline so the candidates gets in first or do two stage inlining, first for inline functions and other for autoinline. But this is probably not going to help those folks complaining mostly about -O2 ignoring inline, right? Concerning multiple heuristics: I really don't like this idea ;) Still think we can make heuristics to adapt to the programming style it is fed by just because often programs consist of multiple such styles. Concerning compilation time/speed tradeoffs: Since whole task of inliner is to slow down compiler in order to improve resulting code, it is dificult to blame it for doing it's job. While I was in easy position with original heuristics where the pre-cgraph code produced just that much of inlining so it was easy to speedup both, now we obviously do too little of inlining, so we need to expect some slodowns. I would define a sucess of heuristics if it results in faster and smaller code, the compilation time is kind of secondary. However definitly for code bases (like SPEC) where extra inlining don't help, we should not slow down seriously (over 1% I guess) Concerning growth limits: If you take a look on when -finline-unit-growth limit hits, it is clear that it hits very often on small units (several times in the kernel, testsuite and such) just because there is tinny space to manuever. It hits almost never on medium units (in GCC bootstrap it hits almost nev
Re: Inlining and estimate_num_insns
> On Tue, 1 Mar 2005 16:49:04 +0100, Richard Guenther > <[EMAIL PROTECTED]> wrote: > > On Tue, 1 Mar 2005 16:14:14 +0100, Jan Hubicka <[EMAIL PROTECTED]> wrote: > > > > > Concerning growth limits: > > > > > > If you take a look on when -finline-unit-growth limit hits, it is clear > > > that it hits very often on small units (several times in the kernel, > > > testsuite and such) just because there is tinny space to manuever. It > > > hits almost never on medium units (in GCC bootstrap it hits almost > > > never) and almost always on big units > > > > > > My intuition alwas has been that for larger units the limits should be > > > much smaller and pooma was major counter example. If we suceed solving > > > this, I would guess we can introduce something like small-unit-insns > > > limit and limit size of units that exceeds this. Does this sound sane? > > > > POOMA hitting the unit-growth limit is caused by the abstraction penalty > > of the inliner and is no longer an issue with the new code size estimate. > > Though with the new estimate our INSNS_PER_CALL is probably too high, > > so we're reducing the unit-size too much inlining one-statement functions > > and as such getting more room for further inlining and finally regress badly > > in compile-time (as max-inline-insns-single is so high we're inlining stuff > > we shouldn't, but as it fit's the unit-growth limit now, we do ...) > > Experimenting further with replacing INSNS_PER_CALL by one plus the > move cost of the function arguments shows promising results, also not > artificially overestimating RDIV_EXPR and friends helps us not regress > on some testcases if we lower the default limits to 100. But I have too many > patches pending now - I'll let the dust settle down for now. Maybe we > should create a branch for the inlining stuff, or use tree-profiling branch > (though that has probably too many unrelated stuff). Problem with separate branch would be that tree-profiling already contains a lot of code for 4.1 that seriously affect inlining decisions, so probably 4.1 stuff should be tuned against that. We need some stabilization on tree-profiling right now and I hope to move on inlining stuff there later this week. Honza > > Richard.
Merge to tree-profiling
Hi, I did merge to tree profiling yesterday and committing to the tree didn't went correctly, so tree was messed up till today. So if something breaks for you, please just update and hopefully everything will be OK now. Honza
Re: memcpy(a,b,CONST) is not inlined by gcc 3.4.1 in Linux kernel
> On Wednesday 30 March 2005 05:27, Gerold Jury wrote: > > > > >> On Tue, Mar 29, 2005 at 05:37:06PM +0300, Denis Vlasenko wrote: > > >> > /* > > >> > * This looks horribly ugly, but the compiler can optimize it totally, > > >> > * as the count is constant. > > >> > */ > > >> > static inline void * __constant_memcpy(void * to, const void * from, > > >> > size_t n) { > > >> > if (n <= 128) > > >> > return __builtin_memcpy(to, from, n); > > >> > > >> The problem is that in GCC < 4.0 there is no constant propagation > > >> pass before expanding builtin functions, so the __builtin_memcpy > > >> call above sees a variable rather than a constant. > > > > > >or change "size_t n" to "const size_t n" will also fix the issue. > > >As we do some (well very little and with inlining and const values) > > >const progation before 4.0.0 on the trees before expanding the builtin. > > > > > >-- Pinski > > >- > > I used the following "const size_t n" change on x86_64 > > and it reduced the memcpy count from 1088 to 609 with my setup and gcc > > 3.4.3. > > (kernel 2.6.12-rc1, running now) > > What do you mean, 'reduced'? > > (/me is checking) > > Oh shit... It still emits half of memcpys, to be exact - for > struct copies: > > arch/i386/kernel/process.c: > > int copy_thread(int nr, unsigned long clone_flags, unsigned long esp, > unsigned long unused, > struct task_struct * p, struct pt_regs * regs) > { > struct pt_regs * childregs; > struct task_struct *tsk; > int err; > > childregs = ((struct pt_regs *) (THREAD_SIZE + (unsigned long) > p->thread_info)) - 1; > *childregs = *regs; > ^^^ > childregs->eax = 0; > childregs->esp = esp; > > # make arch/i386/kernel/process.s > > copy_thread: > pushl %ebp > movl%esp, %ebp > pushl %edi > pushl %esi > pushl %ebx > subl$20, %esp > movl24(%ebp), %eax > movl4(%eax), %esi > pushl $60 > leal8132(%esi), %ebx > pushl 28(%ebp) > pushl %ebx > callmemcpy <= > movl$0, 24(%ebx) > movl16(%ebp), %eax > movl%eax, 52(%ebx) > movl24(%ebp), %edx > addl$8192, %esi > movl%ebx, 516(%edx) > movl%esi, -32(%ebp) > movl%esi, 504(%edx) > movl$ret_from_fork, 512(%edx) > > Jakub, is there a way to instruct gcc to inine this copy, or better yet, > to use user-supplied inline version of memcpy? You can't inline struct copy as it is not function call at first place. You can experiment with -minline-all-stringops where GCC will use it's own memcpy implementation for this. Honza > -- > vda
Re: Canonical form of the RTL CFG for an IF-THEN-ELSE block?
> Hi, > > We would like to know if there is some way to find the true and false > branches of a conditional jump in RTL. In the tree CFG, we have two > edge flags for that, EDGE_{TRUE,FALSE}_VALUE, but those flags have no > meaning for the RTL CFG. So our question is, is there some other way > to tell what edge will be taken in a conditional jump if the condition > is true? > > It seems that some passes assume a canonical form of IF-THEN-ELSE even > on RTL. From ifcvt.c:find_if_header: > > /* The THEN edge is canonically the one that falls through. */ > if (then_edge->flags & EDGE_FALLTHRU) > ; > else if (else_edge->flags & EDGE_FALLTHRU) > { > edge e = else_edge; > else_edge = then_edge; > then_edge = e; > } > else > /* Otherwise this must be a multiway branch of some sort. */ > return NULL; > > On the other hand, in cfgexpand.c:expand_gimple_cond_expr we have, > > false_edge->flags |= EDGE_FALLTHRU; > > and loop-unswitch.c assumes that the BRANCH_EDGE is the true_edge: > > true_edge = BRANCH_EDGE (unswitch_on_alt); > false_edge = FALLTHRU_EDGE (unswitch_on); > > So which is it? Is BRANCH_EDGE always taken if the condition is true, > or FALLTHRU_EDGE, or do you have to look at the condition to know? > Who knows an answer? :-) :) It depends on how the conditional is constructed. If you use get_condition the edge taken when conditional is true is always BRANCH_EDGE if some exists (it is possible to have conditional jump to the following instruction where you have only one edge with EDGE_FALLTHRU flag). Otherwise you have to look into conditional jump RTL yourself to figure out if it has form (set (pc) (if_then_else (cond) (pc) (label_ref xxx)) or (set (pc) (if_then_else (cond) (label_ref xxx) (pc)) In the first case we are taking barnch edge when conditional is false. Honza > > Gr. > Steven
Re: powerpc64-linux bootstrap failure
> That might be related to the bootstrap failure on AIX as well. Hopefully this is fixed now by Jeff's patch. > > Also, the commit modified files not listed in the ChangeLog: > > gcc/tree-pass.h > gcc/cp/method.c > > adding function tree_lowering_passes() Uhm, I apparently cut&pasted changelog somehow incomplette. I guess all I can do now is to fix them with next commit? Honza > > David
Re: powerpc64-linux bootstrap failure
> >>>>> Jan Hubicka writes: > > >> That might be related to the bootstrap failure on AIX as well. > > Jan> Hopefully this is fixed now by Jeff's patch. > > The libjava failure is fixed, but the patch will not affect the > AIX libgfortran failure. > > I have verified that either the cgraph inlining patch or one of > the Zdenek's loop patches is the cause of the bootstrap failure. > Reverting patches and quickstrapping stage3 does not fix the problem, so > either some dependency is missing or cc1 itself is being miscompiled. > > I am bootstrapping with the cgraph patches in place to try to > track down the specific commit that caused the failure. > > >> Also, the commit modified files not listed in the ChangeLog: > >> > >> gcc/tree-pass.h > >> gcc/cp/method.c > >> > >> adding function tree_lowering_passes() > > Jan> Uhm, I apparently cut&pasted changelog somehow incomplette. I guess all > Jan> I can do now is to fix them with next commit? > > Please fix the ChangeLogs *now*. OK, fixed thus... Honza > > Thanks, David >
Re: powerpc64-linux bootstrap failure
> >>>>> Jan Hubicka writes: > > >> That might be related to the bootstrap failure on AIX as well. > > Jan> Hopefully this is fixed now by Jeff's patch. > > The libjava failure is fixed, but the patch will not affect the > AIX libgfortran failure. > > I have verified that either the cgraph inlining patch or one of > the Zdenek's loop patches is the cause of the bootstrap failure. > Reverting patches and quickstrapping stage3 does not fix the problem, so > either some dependency is missing or cc1 itself is being miscompiled. > > I am bootstrapping with the cgraph patches in place to try to > track down the specific commit that caused the failure. Can you please try the attached patch? It fixes ICE on AIX cross for the testcase Steven sent me. (the problem seems to be that on AIX we produce function for static cdtors late in a game and we don't get it properly lowered as it is not passed throught the IPA optimizers) I apologize for the breakage (apparently many non-ELF targets was hit) Honza Index: cgraphunit.c === RCS file: /cvs/gcc/gcc/gcc/cgraphunit.c,v retrieving revision 1.105 diff -c -3 -p -r1.105 cgraphunit.c *** cgraphunit.c17 May 2005 16:56:22 - 1.105 --- cgraphunit.c19 May 2005 10:29:26 - *** cgraph_expand_function (struct cgraph_no *** 989,994 --- 989,996 if (flag_unit_at_a_time) announce_function (decl); + cgraph_lower_function (node); + /* Generate RTL for the body of DECL. */ lang_hooks.callgraph.expand_function (decl);
Re: Bootstrap failure for target AVR, probably linked to Patch "2005-05-19 Jan Hubicka <[EMAIL PROTECTED]>"
> Hi, > > I am observing a bootstrap failure for the avr target that seems to be > related > to the patch > > 2005-05-19 Jan Hubicka <[EMAIL PROTECTED]> > ... > * basic-block.h (REG_BR_PROB_BASE): Define. > ... > * rtl.h (REG_BR_PROB_BASE): Kill. > > . Bootstrap using the switches > configure --target=avr --with-dwarf2 --enable-languages="c,c++" --disable-nls > failed with the error message > > gcc -g -O2 -DIN_GCC -DCROSS_COMPILE -W -Wall -Wwrite-strings > -Wstrict-prototypes -Wmissing-prototypes -fno-common -DHAVE_CONFIG_H > -I. -I. -I../../gcc/gcc -I../../gcc/gcc/. -I../../gcc/gcc/../include > -I../../gcc/gcc/../libcpp/include -c insn-emit.c \ > -o insn-emit.o > ../../gcc/gcc/config/avr/avr.md: In function `gen_movmemhi': > ../../gcc/gcc/config/avr/avr.md:380: error: `REG_BR_PROB_BASE' undeclared > (first use in this function) > ../../gcc/gcc/config/avr/avr.md:380: error: (Each undeclared identifier is > reported only once > ../../gcc/gcc/config/avr/avr.md:380: error: for each function it appears in.) > make[1]: *** [insn-emit.o] Fehler 1 > make[1]: Leaving directory `/home/bmh/gnucvs/cleanhead/build/gcc' > make: *** [all-gcc] Fehler 2 > > The code causing the problems is enclosed in the prepration statements of a > define_expand pattern: > > (define_expand "movmemhi" > [(parallel [(set (match_operand:BLK 0 "memory_operand" "") > (match_operand:BLK 1 "memory_operand" "")) > (use (match_operand:HI 2 "const_int_operand" "")) > (use (match_operand:HI 3 "const_int_operand" ""))])] > "" > "{ > > ... > > /* Work out branch probability for latter use. */ > prob = REG_BR_PROB_BASE - REG_BR_PROB_BASE / count; > > > Since the missing macros seem to have moved from rtl.h to basic-block.h, I'd > like to know at which place one would need gcc make include the additional > header. IIUC, instruction-emit.c is the machine-generated source file that is > generated by the machine-description parsers. > > Simply adding basic-block.h to the include list of the generated emit-insn.c > does not solve the problem but causes follow-up failures. > > I'd appreciate help. I am looking into that now. I would preffer the way of adding basic-block.h and friends into includes of insn-emit.c as in general I would like to make expanders/splitters/output templates aware of the profile (and thus BB they are in) so we can switch in between -O2/-Os on local basis. I must admit I am not 100% sure how to reach this elegantly as all the ingerfaces are bit interwinded (even for the splitters we split in final.c where we no longer have CFG available for all targets) Honza > > Yours, > > Björn
Re: Bootstrap failure for target AVR, probably linked to Patch "2005-05-19 Jan Hubicka <[EMAIL PROTECTED]>"
> > > > Since the missing macros seem to have moved from rtl.h to basic-block.h, > > I'd > > like to know at which place one would need gcc make include the additional > > header. IIUC, instruction-emit.c is the machine-generated source file that > > is > > generated by the machine-description parsers. > > > > Simply adding basic-block.h to the include list of the generated > > emit-insn.c > > does not solve the problem but causes follow-up failures. > > > > I'd appreciate help. > > I am looking into that now. I would preffer the way of adding > basic-block.h and friends into includes of insn-emit.c as in general I > would like to make expanders/splitters/output templates aware of the > profile (and thus BB they are in) so we can switch in between -O2/-Os on > local basis. > I must admit I am not 100% sure how to reach this elegantly as all the > ingerfaces are bit interwinded (even for the splitters we split in > final.c where we no longer have CFG available for all targets) The attached patch seems to fix the problem to me (at least to the point so I can build cc1 binarry). What kind of other problems you are seeing? Index: Makefile.in === RCS file: /cvs/gcc/gcc/gcc/Makefile.in,v retrieving revision 1.1488 diff -c -3 -p -r1.1488 Makefile.in *** Makefile.in 18 May 2005 20:45:02 - 1.1488 --- Makefile.in 20 May 2005 18:11:58 - *** s-constants : $(MD_DEPS) build/genconsta *** 2452,2458 insn-emit.o : insn-emit.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) \ $(RTL_H) $(EXPR_H) real.h output.h insn-config.h $(OPTABS_H) reload.h \ ! $(RECOG_H) toplev.h function.h $(FLAGS_H) hard-reg-set.h $(RESOURCE_H) $(TM_P_H) $(CC) $(ALL_CFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) -c insn-emit.c \ $(OUTPUT_OPTION) --- 2452,2459 insn-emit.o : insn-emit.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) \ $(RTL_H) $(EXPR_H) real.h output.h insn-config.h $(OPTABS_H) reload.h \ ! $(RECOG_H) toplev.h function.h $(FLAGS_H) hard-reg-set.h $(RESOURCE_H) \ ! $(TM_P_H) $(BASIC_BLOCK_H) $(CC) $(ALL_CFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) -c insn-emit.c \ $(OUTPUT_OPTION) Index: genemit.c === RCS file: /cvs/gcc/gcc/gcc/genemit.c,v retrieving revision 1.98 diff -c -3 -p -r1.98 genemit.c *** genemit.c 5 Mar 2005 14:01:00 - 1.98 --- genemit.c 20 May 2005 18:11:58 - *** from the machine description file `md'. *** 845,850 --- 845,851 printf ("#include \"reload.h\"\n"); printf ("#include \"toplev.h\"\n"); printf ("#include \"ggc.h\"\n\n"); + printf ("#include \"basic-block.h\"\n\n"); printf ("#define FAIL return (end_sequence (), _val)\n"); printf ("#define DONE return (_val = get_insns (), end_sequence (), _val)\n\n");
Re: Bootstrap failure for target AVR, probably linked to Patch "2005-05-19 Jan Hubicka <[EMAIL PROTECTED]>"
> On Fri, May 20, 2005 at 07:41:53PM +0200, Jan Hubicka wrote: > > I am looking into that now. I would preffer the way of adding > > basic-block.h and friends into includes of insn-emit.c as in general I > > would like to make expanders/splitters/output templates aware of the > > profile (and thus BB they are in) so we can switch in between -O2/-Os on > > local basis. > > We're a *long* way from that. Please just move REG_BR_PROB_BASE > back to rtl.h. It is a property of an *rtl* note, after all. The problem with that is that I have to include rtl.h in cgraphunit and ipa-inline.c then as well. This is not much prettier either, but sure can do that if that sounds preferable. (I meant to rename it to BR_PROB_BASE afterwards as currently the use in edge->probability is more common than the note itself) Honza > > > r~
Re: [rfc] mainline slush
> On Sat, May 21, 2005 at 09:08:45AM +0200, Eric Botcazou wrote: > > Are these specific to SPARC? > > No. I believe Andrew mentioned that there is a patch for this? (it is lack of sync in between operands and stmt itself) Honza > > > r~
Re: [rfc] mainline slush
> On Sat, May 21, 2005 at 09:45:38PM +0200, Eric Botcazou wrote: > > Btw, is it me or the individual RTL dump options are broken? > > The initial rtl dump is broken. The rest work. > > I think one of Jan's IPA pass manager changes broke it. What are the symptoms? -fdump-tree-expand seems to work fine for me as does -d (not sure of -fdump-rtl-expand ever worked, but I might try to restore it if it did). Honza > > > r~
Re: [rfc] mainline slush
> On Sunday 22 May 2005 00:16, Jan Hubicka wrote: > > (not sure of -fdump-rtl-expand ever worked, but I > > might try to restore it if it did). > > It very definitely did work, and quite nicely too. > Try -fdump-rtl-expand-details. Yeah, but on tree-profiling it always was -fdump-tree-expand-details. Now I see what went wrong. I am still testing the patch but it should rename it back to -fdump-rtl-expand 2005-05-22 Jan Hubicka <[EMAIL PROTECTED]> * tree-optimize.c (init_tree_optimization_passes): Fix flags of all_passes and all_ipa_passes. Index: tree-optimize.c === RCS file: /cvs/gcc/gcc/gcc/tree-optimize.c,v retrieving revision 2.101 diff -c -3 -p -r2.101 tree-optimize.c *** tree-optimize.c 19 May 2005 10:38:42 - 2.101 --- tree-optimize.c 22 May 2005 08:50:25 - *** init_tree_optimization_passes (void) *** 487,499 #undef NEXT_PASS register_dump_files (all_lowering_passes, false, PROP_gimple_any); ! register_dump_files (all_passes, false, PROP_gimple_any ! | PROP_gimple_lcf ! | PROP_gimple_leh | PROP_cfg); ! register_dump_files (all_ipa_passes, true, PROP_gimple_any !| PROP_gimple_lcf !| PROP_gimple_leh | PROP_cfg); } --- 488,496 #undef NEXT_PASS register_dump_files (all_lowering_passes, false, PROP_gimple_any); ! register_dump_files (all_passes, false, PROP_gimple_leh | PROP_cfg); ! register_dump_files (all_ipa_passes, true, PROP_gimple_leh | PROP_cfg); }
Re: i387 control word register definition is missing
> Hello! > > > Well you really want both the fpcr and the mxcsr registers, since the fpcr > > only controls the x87 and the mxcsr controls the xmm registers. Note, in > > adding these registers, you are going to have to go through all of the > > floating > > point patterns to add (use:HI FPCR_REG) and (use:SI MXCSR_REG) to each and > > every pattern so that the optimizer can be told not to move a floating point > > operation past the setting of the control word. > > I think that (use:...) clauses are needed only for (float)->(int) patterns If you make FPCTR/MXCSR real registers, you will need to add use to all the arithmetic and move pattern that would consume quite some memory and confuse optimizers. I think you can get better around simply using volatile unspecs inserted by LCM pass (this would limit scheduling, but I don't think it is that big deal) > (fix_trunc.. & co.). For i386, we could calculate new mode word in advance > (this > calculation is inserted by LCM), and fldcw insn is inserted just before > fist/frndint. > > (define_insn_and_split "fix_trunc_i387_2" > [(set (match_operand:X87MODEI12 0 "memory_operand" "=m") > (fix:X87MODEI12 (match_operand 1 "register_operand" "f"))) >(use (match_operand:HI 2 "memory_operand" "m")) >(use (match_operand:HI 3 "memory_operand" "m"))] > "TARGET_80387 && !TARGET_FISTTP >&& FLOAT_MODE_P (GET_MODE (operands[1])) >&& !SSE_FLOAT_MODE_P (GET_MODE (operands[1]))" > "#" > "reload_completed" > [(set (reg:HI FPCR_REG) > (unspec:HI [(match_dup 3)] UNSPEC_FLDCW)) >(parallel [(set (match_dup 0) (fix:X87MODEI12 (match_dup 1))) > (use (reg:HI FPCR_REG))])] > "" > [(set_attr "type" "fistp") >(set_attr "i387_cw" "trunc") >(set_attr "mode" "")]) > > > (define_insn "*fix_trunc_i387" > [(set (match_operand:X87MODEI12 0 "memory_operand" "=m") > (fix:X87MODEI12 (match_operand 1 "register_operand" "f"))) >(use (reg:HI FPCR_REG))] > "TARGET_80387 && !TARGET_FISTTP >&& FLOAT_MODE_P (GET_MODE (operands[1])) >&& !SSE_FLOAT_MODE_P (GET_MODE (operands[1]))" > "* return output_fix_trunc (insn, operands, 0);" > [(set_attr "type" "fistp") >(set_attr "i387_cw" "trunc") >(set_attr "mode" "")]) > > I'm trying to use MODE_ENTRY and MODE_EXIT macros to insert mode calculations > in My main motivation for stopping on this point was that reload might insert new fld/fst instructions in the places where control word is changes resulting in wrong rounding. it seems to me that we would have to make the second LCM pass happen post reloading, that is definitly doable, just I never got across doing that. > proper places. Currently, I have a somehow working prototype that switches > between 2 modes: MODE_UNINITIALIZED, MODE_TRUNC (and MODE_ANY). The trick here > is, that MODE_ENTRY and MODE_EXIT are defined to MODE_UNINITIALIZED. Secondly, > every asm statement and call insn switches to MODE_UNINITIALIZED, and when > mode > is switched _from_ MODE_TRUNC _to_ MODE_UNINITIALIZED before these two > statements (or in exit BBs), an UNSPEC_VOLATILE type fldcw is emitted (again > via > LCM) that switches fpu to saved mode. [UNSPEC_VOLATILE is needed to prevent > optimizers to remove this pattern]. So, 2 fldcw patterns are defined: If we use the second LCM pass and we make it to insert code as late as possible, it seems to be safe to me to just have MODE_ and MODE_UNINITIALIZED and insert loads accordingly belivin that the first LCM pass laredy inserted the computations on correct points. > > (define_insn "x86_fldcw_1" > [(set (reg:HI FPCR_REG) > (unspec:HI [(match_operand:HI 0 "memory_operand" "m")] >UNSPEC_FLDCW))] > "TARGET_80387" > "fldcw\t%0" > [(set_attr "length" "2") >(set_attr "mode" "HI") >(set_attr "unit" "i387") >(set_attr "athlon_decode" "vector")]) > > (define_insn "x86_fldcw_2" > [(set (reg:HI FPCR_REG) > (unspec_volatile:HI [(match_operand:HI 0 "memory_operand" "m")] > UNSPECV_FLDCW))] > "TARGET_80387" > "fldcw\t%0" > [(set_attr "length" "2") >(set_attr "mode" "HI") >(set_attr "unit" "i387") >(set_attr "athlon_decode" "vector")]) > > By using this approach, testcase: > > int test (int *a, double *x) { > int i; > > for (i = 10; i; i--) { > a[i] = x[i]; > } > > return 0; > } > > is compiled (with -O2 -fomit-frame-pointer -fgcse-after-reload) into: > > test: > pushl %ebx > xorl %edx, %edx > subl $4, %esp > fnstcw 2(%esp) <- store current cw > movl 12(%esp), %ebx > movl 16(%esp), %ecx > movzwl 2(%esp), %eax > orw $3072, %ax > movw %ax, (%esp) <- store new cw > .p2align 4,,15 > .L2: > fldcw (%esp) <- hello? gcse-after-reload? > fldl 80(%ecx,%edx,8) > fistpl 40(%ebx,%edx,4) > decl %edx >
Re: [RFC] Provide SSE/SSE2 ABI math builtins for ia32
> First off, is anyone working on providing GCC or libm with a (complete) > set of C99 math functions with SSE/SSE2 calling conventions? If not, I > will start with something like the following plan: > > 1. There is already support for switching the ABI function-wise in > the backend for local static functions. As a first step I'd like > to expose this functionality as a function attribute. > > 2. In the -mfpmath=sse case I'd like to attach this attribute to GCCs > builtin math functions, either dependent on availability of such > function in libgcc(?) or glibc libm (detected by some configure > magic). > > 3. Emit calls to alternate the functions during RTL expansion, possibly > following the naming scheme from the Intel compiler (to be able to > use their libm replacement for testing). > > 4. Provide either libgcc or libm with SSE/SSE2 implementations for > the math functions, possibly with both ABIs as entry point. > > 5. Eventually provide GCC with RTL inline intrinsics for commonly used > math functions in the -ffast-math case. > > > Does this sound reasonable and profitable? As far as I see glibc libm > currently has no SSE/SSE2 implementations, neither for amd64 nor for > ia32. There are definitly (at least some) math functions implemented in SSE assembly for x86-64 in glibc, but having them available in 32bit would be nice too. Your plan makes sense to me (we might also include our own memcpy and other string functions stubs and call to them in register passing conventions once having infrastructure on place), compatibility with Intel's builtins too. Honza > > Thanks, > Richard.
Re: Edges, predictions, and GC crashes ...
> Hello, > > I'm seeing compiler crashes during garbage collection when using mudflap. > > The problem appears to be that some basic_block_def structures point to > edge_prediction structures which point to edge_def structures that have > already been ggc_free()'d. > > Now, looking at remove_edge (cfg.c) is does indeed appear that it is > possible for edges to get deleted without the corresponding prediction > structure being removed as well ... > > How is this supposed to work? I didn't have any cleanup_cfg in between earliest place putting predictions and the profiling pass consuming them, so this scenario didn't happen. This has however changed a long time ago. I guess just teaching remove_edge to walk prediction list if it is present and kill now dead predictions would not be perfomrance overkill as the predictions are rare, I can implement that if you pass me some testcase. Honza > > Bye, > Ulrich > > -- > Dr. Ulrich Weigand > Linux on zSeries Development > [EMAIL PROTECTED]
Re: Edges, predictions, and GC crashes ...
> Jan Hubicka wrote: > > > I didn't have any cleanup_cfg in between earliest place putting > > predictions and the profiling pass consuming them, so this scenario > > didn't happen. This has however changed a long time ago. I guess just > > teaching remove_edge to walk prediction list if it is present and kill > > now dead predictions would not be perfomrance overkill as the > > predictions are rare, I can implement that if you pass me some testcase. > > FAIL: libmudflap.c++/pass57-frag.cxx (test for excess errors) > WARNING: libmudflap.c++/pass57-frag.cxx compilation failed to produce > executable > FAIL: libmudflap.c++/pass57-frag.cxx (-static) (test for excess errors) > WARNING: libmudflap.c++/pass57-frag.cxx (-static) compilation failed to > produce executable > FAIL: libmudflap.c++/pass57-frag.cxx (-O2) (test for excess errors) > WARNING: libmudflap.c++/pass57-frag.cxx (-O2) compilation failed to produce > executable > FAIL: libmudflap.c++/pass57-frag.cxx (-O3) (test for excess errors) > WARNING: libmudflap.c++/pass57-frag.cxx (-O3) compilation failed to produce > executable > > with current mainline on s390-ibm-linux. > > Thanks for looking into this issue! Hi, I've comitted the attached patch. I didn't suceed to reproduce your failures, but Danny reported it fixes his and it bootstrap/regtests i686-pc-gnu-linux. Honza 2005-06-03 Jan Hubicka <[EMAIL PROTECTED]> * basic-block.h (remove_predictions_associated_with_edge): Declare. * cfg.c (remove_edge): Use it. * predict.c (remove_predictions_associated_with_edge): New function. Index: basic-block.h === RCS file: /cvs/gcc/gcc/gcc/basic-block.h,v retrieving revision 1.261 diff -c -3 -p -r1.261 basic-block.h *** basic-block.h 1 Jun 2005 12:07:42 - 1.261 --- basic-block.h 2 Jun 2005 21:45:58 - *** extern void tree_predict_edge (edge, enu *** 869,874 --- 869,875 extern void rtl_predict_edge (edge, enum br_predictor, int); extern void predict_edge_def (edge, enum br_predictor, enum prediction); extern void guess_outgoing_edge_probabilities (basic_block); + extern void remove_predictions_associated_with_edge (edge); /* In flow.c */ extern void init_flow (void); Index: cfg.c === RCS file: /cvs/gcc/gcc/gcc/cfg.c,v retrieving revision 1.92 diff -c -3 -p -r1.92 cfg.c *** cfg.c 12 May 2005 22:32:08 - 1.92 --- cfg.c 2 Jun 2005 21:45:58 - *** make_single_succ_edge (basic_block src, *** 349,354 --- 349,355 void remove_edge (edge e) { + remove_predictions_associated_with_edge (e); execute_on_shrinking_pred (e); disconnect_src (e); Index: predict.c === RCS file: /cvs/gcc/gcc/gcc/predict.c,v retrieving revision 1.146 diff -c -3 -p -r1.146 predict.c *** predict.c 27 May 2005 22:06:33 - 1.146 --- predict.c 2 Jun 2005 21:45:58 - *** tree_predict_edge (edge e, enum br_predi *** 240,245 --- 240,263 i->edge = e; } + /* Remove all predictions on given basic block that are attached +to edge E. */ + void + remove_predictions_associated_with_edge (edge e) + { + if (e->src->predictions) + { + struct edge_prediction **prediction = &e->src->predictions; + while (*prediction) + { + if ((*prediction)->edge == e) + *prediction = (*prediction)->next; + else + prediction = &((*prediction)->next); + } + } + } + /* Return true when we can store prediction on insn INSN. At the moment we represent predictions only on conditional jumps, not at computed jump or other complicated cases. */
Re: Edges, predictions, and GC crashes ...
> Jan Hubicka wrote: > > > I've comitted the attached patch. I didn't suceed to reproduce your > > failures, but Danny reported it fixes his and it bootstrap/regtests > > i686-pc-gnu-linux. > > Thanks; this does fix one crash on s390x, but doesn't fix the > pass57-frag crashes on s390. > > What happens is that after the predictions are created, but before > remove_edge is called, the edge is modified in rtl_split_block > (called from tree_expand_cfg): > > /* Redirect the outgoing edges. */ > new_bb->succs = bb->succs; > bb->succs = NULL; > FOR_EACH_EDGE (e, ei, new_bb->succs) > e->src = new_bb; > > Now the 'src' link points to a different basic block, but the old > basic block still has the prediction pointing to the edge. > > When remove_edge is finally called, your new code tries to find > and remove the prediction from the *new* basic block's prediction > list -- but it still remains on the old one's list ... Uhm, I will test fix for this too. Thanks! Honza > > Bye, > Ulrich > > -- > Dr. Ulrich Weigand > Linux on zSeries Development > [EMAIL PROTECTED]
Re: Big differences on SpecFP results for gcc and icc
> Hello! > > There is an interesting comparison of SPEC scores between gcc and icc: > http://people.redhat.com/dnovillo/spec2000.i686/gcc/individual-run-ratio.html > . A quick look at the graphs shows a big differences in achieved scores > between gcc and icc, mostly in SpecFP tests. I was trying to find some > information on this matter, but none can be found in the archives on gcc's > site. > > An interesting examples are: > -177.mesa (this is a c test), where icc is almost 40% faster > -178.galgel, where icc is again 40% faster > -179.art, where llvm is more than 1.5x faster than both gcc and icc > -187.facere, where icc is 100% faster than gcc > -189.lucas, where icc is 60% faster > > I know that these graphs don't show the results of most aggresive > optimization options for gcc, but that is also the case with icc (only > -O2). However, it looks that gcc and icc are not even in the same class > regarding FP performance. Perhaps there is some critical optimizations, > that are not present in gcc? > > I think I'm not the only person, that finds these results rather > "dissapointing". As Scott is currently writing a paper on gcc's FP > performance, perhaps someone has an explanation, why gcc's results are > so low on Pentium4 for these tests? Part of reason is the fact that ICC defaults to SSE math while GCC to x87 math on 32bit. I am not sure what it does in setup Diego use (ie whether vectorization is done or if loops are unrolled). Andreas's tester (http://www.suse.de/~aj/SPEC/amd64) shows similar comparsions on Opteron for both 32bit and 64bit runs. The ICC runs uses same flag as AMD published results so presumably good choice of aggressive optimization flags. This is comparing apples to oranges too as 64bit runs suffers from memory problems, 32bit runs from x87 and ICC from lack of Opteron support but gives some more idea. On Opteron we lose score in mesa because ICC runs are with profile feedback and there is division by value that is always 360 in the internal loop. You can see tree-profiling branch scores to be better when profile feedback is available on one point of history... Mesa also suffers from code size being too large for caches of Opteron CPU we use. In 64bit compilation Art suffers from register pressure caused by our tree optimizers (at least last time I tried). Swim (and perhaps some other benchmark too?) suffers from the fact htat loops needs to be interchanged, this was fixed by DannyB recently but requires special flag so you don't see it in scores (is this going to be by default) I didn't look too closely to fortran benchmarks. I always assumed that we did poorly on optimizing fortran loops accessing variably sized arrays and we lack vectorization. Zdenek this week improved SPECfp scores by ivopts patches quite impressively (PPC shows order of mangitude improvements, but you can see improvement on Opteron too), so we seem to do somewhat better now... Honza > > Uros.
Re: PATCH: Explicitly pass --64 to assembler on AMD64 targets
> On Mon, Jun 13, 2005 at 07:17:24PM -0700, Zack Weinberg wrote: > > Or, if GAS can be told which mode it should be in via directives in > > its input (.code32/.code64?), then we could add something like > > > > fputs (TARGET_64BIT ? "\t.code64\n" : "\t.code32", > > asm_out_file); > > > > to x86_file_start, and kill the spec hackery altogether. > > I'm a fan of such directives. I suspect that we'll have to keep > the spec hackery for a while yet. We don't usually force binutils > upgrades with compiler upgrades... Putting .code64 directive in the assembly file and compiling with gas defualting to 32bit would result in 32bit elf image containing 64bit assembly encoding (and will die horribly once we hit missing relocations) Honza > > > r~
Re: PATCH: Explicitly pass --64 to assembler on AMD64 targets
> Richard Henderson <[EMAIL PROTECTED]> writes: > > > On Mon, Jun 13, 2005 at 07:17:24PM -0700, Zack Weinberg wrote: > >> Or, if GAS can be told which mode it should be in via directives in > >> its input (.code32/.code64?), then we could add something like > >> > >> fputs (TARGET_64BIT ? "\t.code64\n" : "\t.code32", > >> asm_out_file); > >> > >> to x86_file_start, and kill the spec hackery altogether. > > > > I'm a fan of such directives. I suspect that we'll have to keep > > the spec hackery for a while yet. We don't usually force binutils > > upgrades with compiler upgrades... > > So I take it that such directives do not already exist? Darn. They exist: {"code16gcc", set_16bit_gcc_code_flag, CODE_16BIT}, {"code16", set_code_flag, CODE_16BIT}, {"code32", set_code_flag, CODE_32BIT}, {"code64", set_code_flag, CODE_64BIT}, but they only switch ASM encoding, not the output ELF file format as they are intended for stuff where you really mix 32bit and 64bit code, such as in the boot loader. Honza > > zw
Re: x86-linux bootstrap broken on mainline?
> stage1/xgcc -Bstage1/ > -B/home/guerby/work/gcc/install/install-20050616T132922/i686-pc-linux-gnu/bin/ > -c -O2 -g -fomit-frame-pointer -DIN_GCC -W -Wall -Wwrite-strings > -Wstrict-prototypes -Wmissing-prototypes -pedantic -Wno-long-long > -Wno-variadic-macros -Wold-style-definition -Werror -fno-common > -DHAVE_CONFIG_H-I. -I. -I/home/guerby/work/gcc/version-head/gcc > -I/home/guerby/work/gcc/version-head/gcc/. > -I/home/guerby/work/gcc/version-head/gcc/../include > -I/home/guerby/work/gcc/version-head/gcc/../libcpp/include \ > /home/guerby/work/gcc/version-head/gcc/config/i386/i386.c -o i386.o > /home/guerby/work/gcc/version-head/gcc/config/i386/i386.c: In function > 'ix86_expand_builtin': > /home/guerby/work/gcc/version-head/gcc/config/i386/i386.c:15127: internal > compiler error: Segmentation fault > Please submit a full bug report, > with preprocessed source if appropriate. > See http://gcc.gnu.org/bugs.html> for instructions. > make[2]: *** [i386.o] Error 1 > > It looks like the only patch between my last successful bootstrap and the > current failure is > the following, am I alone in seeing this? Can I have backtrace? Both patches was bootstrapped/regtested on i686 so everything shall be fine. I am double checking now. I also received reports from memory tester so it's build apparently passed too. Honza > > Laurent > > $ cvs diff -u -D 'Wed Jun 15 21:20:45 UTC 2005' -D 'Thu Jun 16 11:26:54 UTC > 2005' ChangeLog > Index: ChangeLog > === > RCS file: /cvs/gcc/gcc/gcc/ChangeLog,v > retrieving revision 2.9159 > retrieving revision 2.9161 > diff -u -r2.9159 -r2.9161 > --- ChangeLog 15 Jun 2005 20:13:04 - 2.9159 > +++ ChangeLog 16 Jun 2005 10:31:51 - 2.9161 > @@ -1,3 +1,59 @@ > +2005-06-16 Jan Hubicka <[EMAIL PROTECTED]> > + > + * basic-block.h (rtl_bb_info): Break out head_, end_, > + global_live_at_start, global_live_at_end from ... > + (basic_block_def): ... here; update all references > + (BB_RTL): New flag. > + (init_rtl_bb_info): Declare. > + * cfgexpand.c (expand_gimple_basic_block): Init bb info, set BB_RTL > + flag. > + * cfgrtl.c: Include ggc.h > + (create_basic_block_structure): Init bb info. > + (rtl_verify_flow_info_1): Check BB_RTL flag and rtl_bb_info pointer. > + (init_rtl_bb_info): New function. > + (rtl_merge_block, cfglayout_merge_block): Copy global_live_at_end > here. > + * cfghooks.c (merge_block): Do not copy global_live_at_end here. > + * cfg.c (clear_bb_flags): Skip BB_RTL flag. > + (dump_flow_info): Gueard global_live_* dumping. > + > + * Makefile.in (cfg.o): Add new dependencies. > + * basic-block.h (reorder_block_def): Kill > + original/copy/duplicated/copy_number fields. > + (BB_DUPLICATED): New flag. > + (initialize_original_copy_tables, free_original_copy_tables, > + set_bb_original, get_bb_original, set_bb_copy, get_bb_copy): New. > + * cfg.c: Include hashtab.h and alloc-pool.h > + (bb_original, bb_copy, original_copy_bb_pool): New static vars. > + (htab_bb_copy_original_entry): New struct. > + (bb_copy_original_hash, bb_copy_original_eq): New static functions. > + (initialize_original_copy_tables, free_original_copy_tables, > + set_bb_original, get_bb_original, set_bb_copy, get_bb_copy): New > + global functions. > + * cfghooks.c (duplicate_block): Update original/copy handling. > + * cfglayout.c (fixup_reorder_chain): Likewise. > + (cfg_layout_initialize): Initialize orignal_copy tables. > + (cfg_layout_finalize): FInalize original_copy tables. > + (can_copy_bbs_p): Use BB_DUPLICATED flag. > + (copy_bbs): Likewise. > + * cfgloopmanip.c (update-single_exits_after_duplication): Likewise. > + (duplicate_loop_to_header_edge): Likewise; update handling of > + copy_number. > + (loop_version): Likewise. > + * dominance.c (get_dominated_by_region): Use BB_DUPLICATED_FLAG. > + * except.c (expand_resx_expr): Check that reg->resume is not set. > + * loop-unroll.c (unroll_loop_constant_iterations, > + unroll_loop_runtime_iterations, apply_opt_in_copies): Update > + copy/original handling. > + * loop-unwitch.c (unswitch_loop): Likewise. > + * tree-cfg.c (create_bb): Do not initialize RBI. > + (disband_implicit_edges): Do not kill RBI. > + (add_phi_args_after_copy_bb): Use new original/copy mapping. > + (add_phi_args_after_copy): Use BB_DUPLICATED flag. > + (tree_duplicate_sese_region): Update original/copy handling. > + * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Likewise. > + * tree-ssa-loop-manip.c (copy_phi_node_args): Likewise. > + * tree-ssa-loop-unswitch.c (tree_unswitch_single_loop): Likewise. > + >
Re: x86-linux bootstrap broken on mainline?
> stage1/xgcc -Bstage1/ > -B/home/guerby/work/gcc/install/install-20050616T132922/i686-pc-linux-gnu/bin/ > -c -O2 -g -fomit-frame-pointer -DIN_GCC -W -Wall -Wwrite-strings > -Wstrict-prototypes -Wmissing-prototypes -pedantic -Wno-long-long > -Wno-variadic-macros -Wold-style-definition -Werror -fno-common > -DHAVE_CONFIG_H-I. -I. -I/home/guerby/work/gcc/version-head/gcc > -I/home/guerby/work/gcc/version-head/gcc/. > -I/home/guerby/work/gcc/version-head/gcc/../include > -I/home/guerby/work/gcc/version-head/gcc/../libcpp/include \ > /home/guerby/work/gcc/version-head/gcc/config/i386/i386.c -o i386.o > /home/guerby/work/gcc/version-head/gcc/config/i386/i386.c: In function > 'ix86_expand_builtin': > /home/guerby/work/gcc/version-head/gcc/config/i386/i386.c:15127: internal > compiler error: Segmentation fault > Please submit a full bug report, > with preprocessed source if appropriate. > See http://gcc.gnu.org/bugs.html> for instructions. > make[2]: *** [i386.o] Error 1 > > It looks like the only patch between my last successful bootstrap and the > current failure is > the following, am I alone in seeing this? Apparently the following patch http://gcc.gnu.org/ml/gcc-patches/2005-06/msg00971.html should fix your problem. I got the dependencies wrong and this seem to show up only in some memory configurations. My apologizes for that! Honza
Re: Regressions
> On Friday 17 June 2005 08:30, Steve Kargl wrote: > > On Fri, Jun 17, 2005 at 08:01:47AM +0200, FX Coudert wrote: > > > Jerry DeLisle wrote: > > > >There appears to be numerous regression failures this evening. I > > > >suppose these are back end related. > > > > > > On i686-freebsd, i386-linux and x86_64-linux, I see failures for > > > gfortran.dg/pr19657.f and gfortran.dg/select_2.f90 at -O3, > > > gfortran.dg/vect/vect-2.f90 at -O. And gfortran.dg/vect/vect-5.f90, but > > > that one is not new. > > > > > > They were not present in 20050615, and appeared in 20050616. It is due > > > to an ICE, at -O3: > > > > > > O3.f: In function ?MAIN__?: > > > O3.f:11: internal compiler error: in tree_verify_flow_info, at > > > tree-cfg.c:3716 > > > > > > This is now known as PR 22100. > > > > I can confirm the problem on amd64-*-freebsd. I is quite > > annoying that someone would make a change to the backend > > without testing it. > > Indeed. > See http://gcc.gnu.org/ml/gcc/2005-06/msg00728.html... Apparently I screwed up testing (ie tested file, but send different one with same name). I will try to fix it later today or tomorrow. Sory for the breakage.. Honza > > Gr. > Steven
Re: Fortran left broken for a couple of days now
> Honza, > > Your patch here: http://gcc.gnu.org/ml/gcc-patches/2005-06/msg00976.html > has left a number of fortran test cases broken (e.g. gfortran.dg/select_2). > > The problem seems to be that you used the aux field as a replacement for > rbi->copy_number, but tree_verify_flow_info assumes aux is cleared before > it is called (see the SWITCH_EXPR case, "gcc_assert (!label_bb->aux )"). > You must have seen this if you tested your patch with checking enabled, the > patch broke fortran on all platforms. I really apologize for that. I must've messed up testing here seriously. I am pretty sure I was testing both with checking disabled and enabled to see runtime performance and perhaps just saw the first report from tester, but still it don't explain how I missed the vectorizer failures (tought I am pretty sure I saw those failures previously and my incarnation of PR22088 fix actually come from older version of patch). It is obviously pilot error here And I believed that writting scripts to mostly automate testing would prevent me from such a stupid bugs... I think proper fix would be to simply avoid this verify_flow_info call when the transformation is not finished yet like we do in other passes using aux field. I am traveling now but I will try to dig out the backtrace today or tomorrow and fix that. Honza > > Can you please fix this? (It is also http://gcc.gnu.org/PR22100) > > Gr. > Steven >
Re: gcc-4.1-20050702 ICE in cgraph_early_inlining, at ipa-inline.c:990
> I get this error compiling linux-2.6.11.3 with gcc-4.1-20050702 on many > targets: > > drivers/char/random.c: In function 'extract_entropy': > drivers/char/random.c:634: sorry, unimplemented: inlining failed in call to > 'add_entropy_words': function not considered for inlining > drivers/char/random.c:1325: sorry, unimplemented: called from here > drivers/char/random.c: At top level: > drivers/char/random.c:1813: internal compiler error: in > cgraph_early_inlining, at ipa-inline.c:990 > > Line 1813 is > EXPORT_SYMBOL(generate_random_uuid); > > I don't have the preprocessed source handy, but I can provide it if > this hasn't already been reported. Having preprocessed testcase would be nice. Obvious fix would be to disable warning during early inlining, but I can't come of scenario where the early inliner should miss always inline call. Honza
Re: -fprofile-arcs
> Hi, > > I am trying to profile the frequency of each basic block of > SPEC 2000 benchmarks by compiling them using -fprofile-arcs and opt -O3. > After running the benchmark, when I try to read "bb->count" while > compiling > using "-fbranch-probabilities and -O3", I get "0" values for basic blocks > which were known to execute for sure. Any clue as to where I am missing? > Is "bb->count" is the right way to get the dynamic frequency in the second > pass? It is. You would need to provide more information for me to figure out what is going wrong. Of course the bb->count is initialized only after profile is read in that at current mainline is pretty late in compilation queue (after bp RTL pass), but this is going to change hopefully at monday. Honza > > regards, > Raj
Re: -fprofile-arcs
> Hi, > > Thanks a lot. Basically, I want to obtain dynamic basic block frequency at > RTL > level just before register allocation. Look at the following piece of > code(a.c): > > void foo(int i, int *a, int *p) { > int x1,j; > for(j=0;j<200;j++) { > x1=a[i]+j; > *p=99; > a[i]=x1; > } > } > > main() { > int *a,*p,i=0; > int x1,x2,x3,x4; > a=malloc(sizeof(int)); > p=malloc(sizeof(int)); > a[0]=0; > foo(0,a,p); > printf("\n%d ",*p); > for(i=0;i<1;i++) { > printf(" %d ",a[i]); > } > } > > This code was executed using "gcc -O3 -fprofile-arcs --param > max-unroll-times=0 a.c". > "a.out" was then executed (for profiling). > Now I compile using "gcc -O3 -fbranch-probabilities --param > max-unroll-times=0 a.c". > During this phase, I try to obtain dynamic frequencies of the statements > within the "for" > loop in "foo" method at RTL level. The frequencies return "0" using > "bb->count". I > would like this to reflect "200". How to obtain this information? I think the problem is that foo gets inlined into main, so the offline copy of foo is really never executed. Honza > > regards, > Raj > > > > > > > > Jan Hubicka <[EMAIL PROTECTED]> > 07/18/2005 10:29 AM > > To > Rajkishore Barik/India/[EMAIL PROTECTED] > cc > gcc@gcc.gnu.org > Subject > Re: -fprofile-arcs > > > > > > > > Hi, > > > > I am trying to profile the frequency of each basic block of > > SPEC 2000 benchmarks by compiling them using -fprofile-arcs and opt -O3. > > After running the benchmark, when I try to read "bb->count" while > > compiling > > using "-fbranch-probabilities and -O3", I get "0" values for basic > blocks > > which were known to execute for sure. Any clue as to where I am missing? > > Is "bb->count" is the right way to get the dynamic frequency in the > second > > pass? > > It is. You would need to provide more information for me to figure out > what is going wrong. Of course the bb->count is initialized only after > profile is read in that at current mainline is pretty late in > compilation queue (after bp RTL pass), but this is going to change > hopefully at monday. > > Honza > > > > regards, > > Raj >
Re: -fprofile-generate and -fprofile-use
> On Wed, Jul 20, 2005 at 10:45:01AM -0700, girish vaitheeswaran wrote: > > > --- Steven Bosscher <[EMAIL PROTECTED]> wrote: > > > > > > > On Wednesday 20 July 2005 18:53, girish vaitheeswaran wrote: > > > > > I am seeing a 20% slowdown with feedback optimization. > > > > > Does anyone have any thoughts on this. > > > > > > > > My first thought is that you should probably first > > > > tell what compiler > > > > you are using. > > > > I am using gcc 3.4.3 > > -girish > > Which platform? I've seen slower code for profile-directed optimizations > on powerpc64-linux with GCC 4.0 and mainline. It's a bug, but I haven't > looked into it enough to provide a small test case for a problem report. Actually I would be very interested in seeing testcases such as those. (and the Girish' slowdown too if possible). In general some slowdowns in side corners are probably unavoidable but both 3.4.3 and 4.0 seems to have pretty consistent improvements with profiling at least for SPEC and i386 I am testing pretty regularly. Such slodowns usually indicate problems like incorrectly updated profile or incorrectly readed in profile because of missmatch in CFGs in between profile and feedback run that are rather dificult to notice and hunt down... Honza > > Janis
Re: -fprofile-generate and -fprofile-use
> I started with a clean slate in my build environment > and did not have any residual files hanging around. > Are the steps I have indicated in my earlier email > correct. Is there a way I can break down the problem > into a smaller sub-set of flags and eliminate the flag > causing the performance problem. What I mean is since > -fprofile-generate and -fprofile-use enable a bunch of > flags, would it make sense to avoid profiling and try > out some of the individual flags on a trial and error > basis. If so what would be the flags to start the It would be probably better to just turn off the individual optimizations with -fprofile-use (for optimizations that are implied by this flag there should be no need to re-profile each time). If you can find particular optimization that gets out of control, it would be lot easier to fix it... Honza > trials with. > > -girish > > --- Jan Hubicka <[EMAIL PROTECTED]> wrote: > > > > On Wed, Jul 20, 2005 at 10:45:01AM -0700, girish > > vaitheeswaran wrote: > > > > > --- Steven Bosscher <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > On Wednesday 20 July 2005 18:53, girish > > vaitheeswaran wrote: > > > > > > > I am seeing a 20% slowdown with feedback > > optimization. > > > > > > > Does anyone have any thoughts on this. > > > > > > > > > > > > My first thought is that you should probably > > first > > > > > > tell what compiler > > > > > > you are using. > > > > > > > > I am using gcc 3.4.3 > > > > -girish > > > > > > Which platform? I've seen slower code for > > profile-directed optimizations > > > on powerpc64-linux with GCC 4.0 and mainline. > > It's a bug, but I haven't > > > looked into it enough to provide a small test case > > for a problem report. > > > > Actually I would be very interested in seeing > > testcases such as those. > > (and the Girish' slowdown too if possible). In > > general some slowdowns > > in side corners are probably unavoidable but both > > 3.4.3 and 4.0 seems to > > have pretty consistent improvements with profiling > > at least for SPEC and > > i386 I am testing pretty regularly. > > Such slodowns usually indicate problems like > > incorrectly updated profile > > or incorrectly readed in profile because of > > missmatch in CFGs in between > > profile and feedback run that are rather dificult to > > notice and hunt > > down... > > > > Honza > > > > > > Janis > >
Re: -fprofile-generate and -fprofile-use
> I have done quite a few experiments with this to > narrow down the problem. The performance numbers are > slower compared to *No Feedback optimization with just > -O3* Here are some of them. All the experiments were > done on a new build-area in order to eliminate effects > of old feedback files. > > 1. I built the app using -O3 and -fprofile-generate to > generate the feedback data. I then ran the workload > and then recompiled the app using -O3 and > -fprofile-use [app was 20% slower] > > 2. I built the app using -O3 and -fprofile-generate to > generate the feedback data. I then ran the workload > and then recompiled the app using -O3 and > -fprofile-use -fno-vpt -fno-unroll-loops > -fno-peel-loops -fno-tracer (Which is turn off all the > flags used by -fprofile-use) [App was still 20% > slower] > > 3. I have tried selectively turning of some of the > other flags in the above list as well, but the > performance regression persists. > > 4. I tried with the older flags namely -fprofile-arcs > and -fbranch-probabilities still no help. So it looks like the slowdown is caused by one of the profile based optimizations that are enabled by default (basic block reordering or register allocation). If you are getting such a noticable slodown, it probably means that your app has pretty small inner loop. Can you just look into assembly generated for it with and without profiling and try to spot what is gong wrong? Thanks, Honza > > Can someone help me out on how to proceed with this. > > Thanks > -girish > > > --- Jan Hubicka <[EMAIL PROTECTED]> wrote: > > > > I started with a clean slate in my build > > environment > > > and did not have any residual files hanging > > around. > > > Are the steps I have indicated in my earlier email > > > correct. Is there a way I can break down the > > problem > > > into a smaller sub-set of flags and eliminate the > > flag > > > causing the performance problem. What I mean is > > since > > > -fprofile-generate and -fprofile-use enable a > > bunch of > > > flags, would it make sense to avoid profiling and > > try > > > out some of the individual flags on a trial and > > error > > > basis. If so what would be the flags to start the > > It would be probably better to just turn off the > > individual > > optimizations with -fprofile-use (for optimizations > > that are implied by > > this flag there should be no need to re-profile each > > time). > > If you can find particular optimization that gets > > out of control, it > > would be lot easier to fix it... > > > > Honza > > > trials with. > > > > > > -girish > > > > > > --- Jan Hubicka <[EMAIL PROTECTED]> wrote: > > > > > > > > On Wed, Jul 20, 2005 at 10:45:01AM -0700, > > girish > > > > vaitheeswaran wrote: > > > > > > > --- Steven Bosscher <[EMAIL PROTECTED]> > > wrote: > > > > > > > > > > > > > > > On Wednesday 20 July 2005 18:53, girish > > > > vaitheeswaran wrote: > > > > > > > > > I am seeing a 20% slowdown with > > feedback > > > > optimization. > > > > > > > > > Does anyone have any thoughts on this. > > > > > > > > > > > > > > > > My first thought is that you should > > probably > > > > first > > > > > > > > tell what compiler > > > > > > > > you are using. > > > > > > > > > > > > I am using gcc 3.4.3 > > > > > > -girish > > > > > > > > > > Which platform? I've seen slower code for > > > > profile-directed optimizations > > > > > on powerpc64-linux with GCC 4.0 and mainline. > > > > It's a bug, but I haven't > > > > > looked into it enough to provide a small test > > case > > > > for a problem report. > > > > > > > > Actually I would be very interested in seeing > > > > testcases such as those. > > > > (and the Girish' slowdown too if possible). In > > > > general some slowdowns > > > > in side corners are probably unavoidable but > > both > > > > 3.4.3 and 4.0 seems to > > > > have pretty consistent improvements with > > profiling > > > > at least for SPEC and > > > > i386 I am testing pretty regularly. > > > > Such slodowns usually indicate problems like > > > > incorrectly updated profile > > > > or incorrectly readed in profile because of > > > > missmatch in CFGs in between > > > > profile and feedback run that are rather > > dificult to > > > > notice and hunt > > > > down... > > > > > > > > Honza > > > > > > > > > > Janis > > > > > >
Re: -fprofile-generate and -fprofile-use
> Jan, Hi, > That's going to be rather difficult given that the app > has over 1000 files. Is there a way I can turn off the > "default" options one at a time ? This is unforutnately not possible :( The optimizations used either profile feedback or profile guessed by GCC itself. It looks like for your case the profile guessed by GCC (even if departed from reality) causes GCC to produce better code than the real profile (or that the real profile got missread, but there are some sanity checks for this so this is quite unlikely). It seems to me that only way to proceed from here is some profiling. The way I usually look into such problems is to produce oprofile of both versions of code and then compare the times spent in individual functions then it is sometimes possible to identify the offending code more easilly Honza > Thx > -girish > > --- Jan Hubicka <[EMAIL PROTECTED]> wrote: > > > > I have done quite a few experiments with this to > > > narrow down the problem. The performance numbers > > are > > > slower compared to *No Feedback optimization with > > just > > > -O3* Here are some of them. All the experiments > > were > > > done on a new build-area in order to eliminate > > effects > > > of old feedback files. > > > > > > 1. I built the app using -O3 and > > -fprofile-generate to > > > generate the feedback data. I then ran the > > workload > > > and then recompiled the app using -O3 and > > > -fprofile-use [app was 20% slower] > > > > > > 2. I built the app using -O3 and > > -fprofile-generate to > > > generate the feedback data. I then ran the > > workload > > > and then recompiled the app using -O3 and > > > -fprofile-use -fno-vpt -fno-unroll-loops > > > -fno-peel-loops -fno-tracer (Which is turn off all > > the > > > flags used by -fprofile-use) [App was still 20% > > > slower] > > > > > > 3. I have tried selectively turning of some of the > > > other flags in the above list as well, but the > > > performance regression persists. > > > > > > 4. I tried with the older flags namely > > -fprofile-arcs > > > and -fbranch-probabilities still no help. > > > > So it looks like the slowdown is caused by one of > > the profile based > > optimizations that are enabled by default (basic > > block reordering or > > register allocation). If you are getting such a > > noticable slodown, it > > probably means that your app has pretty small inner > > loop. Can you just > > look into assembly generated for it with and without > > profiling and try > > to spot what is gong wrong? > > > > Thanks, > > Honza > > > > > > Can someone help me out on how to proceed with > > this. > > > > > > Thanks > > > -girish > > > > > > > > > --- Jan Hubicka <[EMAIL PROTECTED]> wrote: > > > > > > > > I started with a clean slate in my build > > > > environment > > > > > and did not have any residual files hanging > > > > around. > > > > > Are the steps I have indicated in my earlier > > email > > > > > correct. Is there a way I can break down the > > > > problem > > > > > into a smaller sub-set of flags and eliminate > > the > > > > flag > > > > > causing the performance problem. What I mean > > is > > > > since > > > > > -fprofile-generate and -fprofile-use enable a > > > > bunch of > > > > > flags, would it make sense to avoid profiling > > and > > > > try > > > > > out some of the individual flags on a trial > > and > > > > error > > > > > basis. If so what would be the flags to start > > the > > > > It would be probably better to just turn off the > > > > individual > > > > optimizations with -fprofile-use (for > > optimizations > > > > that are implied by > > > > this flag there should be no need to re-profile > > each > > > > time). > > > > If you can find particular optimization that > > gets > > > > out of control, it > > > > would be lot easier to fix it... > > > > > > > > Honza > > > > > trials with. > > > > > > > > > > -girish > > > > > > > > > > --- Jan Hubicka <[EMAIL PROTECT
IPA branch
Hi, I've branches the IPA branch yesterday and re-directed current SPEC testers running tree-profiling branch (now officially retired ;) to it. ( http://www.suse.de/~aj/SPEC/amd64 ). The branch should be used for interprocedural optimization projects that has serious chance to get into 4.2 (or perhaps longer term plans as long as they won't interfere with merging changes to 4.1). I plan to implement the SSA based inlining to the branch and will start pushing patches probably sometime next week. I would especially welcome frotend fixes for the multiple decls problems so we can compile SPEC in whole program mode wihtout ugly hacks ;)) Please use IPA in the subject line if you don't want me to miss the patches to branch. Honza
Re: IPA branch
> Hi, > I've branches the IPA branch yesterday and re-directed current SPEC > testers running tree-profiling branch (now officially retired ;) to it. > ( http://www.suse.de/~aj/SPEC/amd64 ). > The branch should be used for interprocedural optimization projects that > has serious chance to get into 4.2 (or perhaps longer term plans as > long as they won't interfere with merging changes to 4.1). > > I plan to implement the SSA based inlining to the branch and will start > pushing patches probably sometime next week. > > I would especially welcome frotend fixes for the multiple decls problems > so we can compile SPEC in whole program mode wihtout ugly hacks ;)) > Please use IPA in the subject line if you don't want me to miss the > patches to branch. Forgot to mention, the tag is ipa-branch ;) Honza > > Honza
Re: IPA branch
> On Thursday 04 August 2005 19:12, Jan Hubicka wrote: > > > Hi, > > > I've branches the IPA branch yesterday and re-directed current SPEC > > > testers running tree-profiling branch (now officially retired ;) to it. > > > ( http://www.suse.de/~aj/SPEC/amd64 ). > > > The branch should be used for interprocedural optimization projects that > > > has serious chance to get into 4.2 (or perhaps longer term plans as > > > long as they won't interfere with merging changes to 4.1). > > > > > > I plan to implement the SSA based inlining to the branch and will start > > > pushing patches probably sometime next week. > > > > > > I would especially welcome frotend fixes for the multiple decls problems > > > so we can compile SPEC in whole program mode wihtout ugly hacks ;)) > > > Please use IPA in the subject line if you don't want me to miss the > > > patches to branch. > > > > Forgot to mention, the tag is ipa-branch ;) > > I guess the web pages should be updated with something like the attached? This looks fine to me. Thanks! Perhaps even cvs.html should mention that tree-profiling was almost fully merged and retired? Honza > > Gr. > Steven > > Index: cvs.html > === > RCS file: /cvs/gcc/wwwdocs/htdocs/cvs.html,v > retrieving revision 1.198 > diff -u -3 -p -r1.198 cvs.html > --- cvs.html 28 Jul 2005 21:09:10 - 1.198 > +++ cvs.html 4 Aug 2005 18:30:25 - > @@ -145,12 +145,10 @@ generally of the form "gcc-X_Y > > > > - tree-profiling-branch > - This branch is for the development of profiling heuristics > - and profile based optimizations for trees, such as profile driven inline > - heuristics. Another goal of this branch is to demonstrate that maintaining > - the CFG and profile information over expanding from GIMPLE trees to RTL > - is feasible and can bring considerable performance improvements. > + ipa-branch > + This is a branch for the development interprocedural optimizations > + such as inlining and cloning, interprocedural alias analysis, and so on. > + This branch is being maintained by Jan Hubicka > >struct-reorg-branch >This branch is for the development of structure reorganization > @@ -476,6 +474,13 @@ be prefixed with the initials of the dis >of the passes. It has now been merged into mainline for the >4.1 release. > > + tree-profiling-branch > + This branch is for the development of profiling heuristics > + and profile based optimizations for trees, such as profile driven inline > + heuristics. Another goal of this branch is to demonstrate that maintaining > + the CFG and profile information over expanding from GIMPLE trees to RTL > + is feasible and can bring considerable performance improvements. > + >bje-unsw-branch >This branch was dedicated to some research work by Ben Elliston > at the University of New South Wales (UNSW) on transformation > Index: projects/tree-profiling.html > === > RCS file: /cvs/gcc/wwwdocs/htdocs/projects/tree-profiling.html,v > retrieving revision 1.2 > diff -u -3 -p -r1.2 tree-profiling.html > --- projects/tree-profiling.html 4 Jul 2004 21:00:54 - 1.2 > +++ projects/tree-profiling.html 4 Aug 2005 18:30:25 - > @@ -6,6 +6,10 @@ > > Improving GCC's Interprocedural Optimizaion Infrastructure > > +This page describes the work done in 2005 on the > +tree-profiling-branch. This branch is now retired. Work > +in the same area (IPA) has continued on the ipa-branch > + > This page describes ongoing work to improve GCC's infrastructure > for tree-based interprocedural optimizers. The work is done on a > branch in GCC's CVS repository called tree-profiling-branch.
Re: IPA branch
> On Saturday 06 August 2005 08:14, Andrew Pinski wrote: > > On Aug 5, 2005, at 9:24 PM, Canqun Yang wrote: > > > Hi, > > > > > > Patch from Michael Matz > > > (http://gcc.gnu.org/ml/fortran/2005-07/msg00331.html) may partly fixes > > > the multiple decls problems. > > > > That will only help with the fortran problem, > That is the only problem we have to care about, isn't it? The C++ > front-end folks can fix their own front end. There is also problem with the C frontend and K&R style declarations. Either we can refuse functions K&R declared in one unit and properly declared in other unit or we create duplicated decls... Both is quite wrong :( Honza > > Gr. > Steven
Re: Your patch to skip local statics
> > Hi Jan, > > Your patch to mainline > > http://gcc.gnu.org/ml/gcc-cvs/2005-06/msg00388.html > > to defer handling of local statics has caused a regression > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22034 > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22583 > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=23045 > > hindering development. Please could look into it as soon as possible? Sorry for that. I tought these problems was fixed but apparently just one of the dupes was. I will look into it today. Honza > > Thanks, > > -- Gaby
Re: GCC-4.0.2 20050811: should GCC consider inlining functions in between different sections?
> On Aug 12, 2005, at 5:05 AM, Etienne Lorrain wrote: > > I have added a command to the linker file to forbid reference from > > one section to another: > >NOCROSSREFS (.text .xcode); > > It sounds like this feature isn't compatible with inlining, -fno- > inline I suspect is one of the few ways to `fix' it in general, that, -fno-inline still won't suppress inlining of functions called once... Honza > or require that all functions be marked noinline that would otherwise > be wrong to inline. > > > The question is in fact: what is a section for GCC? > > Anything a user wants. We don't limit the uses, so this question > cannot, in general be answered, at best a few users could tell you > how a few of them are using sections. For example, I would not > expect most people to know about NOCROSSREFS.
Re: Inlining vs the stack
> > "Mike" == Mike Stump <[EMAIL PROTECTED]> writes: > > Mike> On Aug 12, 2005, at 10:39 AM, Dale Johannesen wrote: > >> We had a situation come up here where things are like this > >> (simplified, obviously): > >> > >> c() { char x[100]; } > > Mike> I think we should turn off inlining for functions > 100k stack > Mike> size. (Or maybe 500k, if you want). > > Why should stack size be a consideration? Code size I understand, but > stack size doesn't seem to matter. The problem is that the function you inline into starts consuming a lot of stack space even when it didn't previously. This showed up in the past in glibc where inlining some subfunction of printf caused printf to kill some threading library testsuite tests and linux kernel. Unforutnately the actual size we want to limit is somewhat variable (for kernel it is pretty small, in usual case few hounderd K is probably good choice). So I guess we can add command line flag for this. Other problem is how to realistically estimate stack consumption in current tree representation, but perhaps just summing size of all temporaries would work Honza > > paul >
Re: -fprofile-generate and -fprofile-use
> > There was some discussion a few weeks ago about some apps running slower > with FDO enabled. > > I've recently investigated a similar situation using mainline. In my case, > the fact that the loop_optimize pass is disabled during FDO was the cause > of the slowdown. It appears that was recently disabled as part of Jan > Hubicka's patch to eliminate RTL based profiling. The commentary indicates > that the old loop optimizer is incompatible with tree profiling. > > While this doesn't explain all of the degradations discussed (some were > showing up on older versions of the compiler), it may explain some. Do you have specific testcase? It would be interesting to see if new optimizer can catch up at least on kill-loop branch. Thanks for investigating! Honza > > Pete
Re: -fprofile-generate and -fprofile-use
> >you may try adding -fmove-loop-invariants flag, which enables new > >invariant motion pass. > > That cleaned up both my simplified test case, and the code it > originated from. It also cleaned up a few other cases where I > was noticing worse performance with FDO enabled. Thanks!! > > Perhaps this option should be enabled by default when doing FDO > to replace the loop invariant motions done by the recently > disabled loop optimize pass. This sounds like sane idea. Zdenek, is -fmove-loop-invariants dangerous in some way or just disabled because old loop does the same? Honza > > Pete
Re: Questionable code in fixup_reorder_chain
> Hi Jan, > > I think fixup_reorder_chain contains questionable code to cope with a > pathological case: > > /* The degenerated case of conditional jump jumping to the next >instruction can happen on target having jumps with side >effects. > >Create temporarily the duplicated edge representing branch. >It will get unidentified by force_nonfallthru_and_redirect >that would otherwise get confused by fallthru edge not pointing >to the next basic block. */ > if (!e_taken) > { > rtx note; > edge e_fake; > bool redirected; > > e_fake = unchecked_make_edge (bb, e_fall->dest, 0); > > redirected = redirect_jump (BB_END (bb), > block_label (bb), 0); > gcc_assert (redirected); > > Note the call to redirect_jump that creates a loop. It is responsible for > the > ICE on the attached Ada testcase with the 3.4.5pre compiler at -O3 because > the > edge and the jump disagree on the target. Duh. The code is ineed quite ugly side case. > > The final patch: http://gcc.gnu.org/ml/gcc-cvs/2003-03/msg01294.html > The original version: http://gcc.gnu.org/ml/gcc-patches/2003-03/msg02097.html > > Am I right in thinking that the call to redirect_jump must be removed? I actually believe there was reason for creating the loop (ie redirecting the edge to anything else than the fallthru egdge destination) as otherwise we screwed up in force_nonfallthru_and_redirect and this function (called via force_nonfallthru) is supposed to redirect the jump back to proper destination. This is remarkably ugly hack :( Would be possible to at leat see the RTL and precise ICE this code is causing? Honza