------- Additional Comments From rguenth at tat dot physik dot uni-tuebingen dot de 2004-12-07 15:09 ------- Subject: Re: [4.0 Regression] Inlining limits cause 340% performance regression
On 7 Dec 2004, hubicka at ucw dot cz wrote: > > Yes, it seems so. Really nice improvement. Though profiling is > > sloooooow. I guess you avoid doing any CFG changing transformation > > for the profiling stage? I.e. not even inline the simplest functions? > > I can inline but only after actually instrumenting the functios. That > should minimize the costs, but I also noticed that tramp3d is > surprisingly a lot slower with profiling. > > > That would be the reason the Intel compiler is unusable with profiling > > for me. -fprofile-generate comes with a 50fold increase in runtime! > > -fprofile-generate is actually package of > -fprofile-arcs/-fprofile-values + -fprofile-values-transformations > It might be interesting to figure out whether -fprofile-arcs itslef > brings similar slowdown. Only reason why this can happen I can think of > is the fact that after instrumenting we again inline a lot less or we > produce too many redundant counter. Perhaps it would make sense to > think about inlining functions reducing code size before instrumenting > as we would do that anyway, but it will be tricky to get gcov output and > -f* flags independence right then. Hm. There are a lot of counters - maybe it is possible to merge the counters themselves? The resulting asm of tramp3d-v3 consists of 30% addl/adcl lines for adding the profiling counts - where the total number of lines is just wc -l of a -S -fverbose-asm compilation. That's very much a lot. And additions are in cache unfriedly sequence, too - dunno which optimization pass could improve this though. Consider static inline void foo() {} void bar() { foo(); } which for -O2 -fprofile-generate produces bar: addl $1, .LPBX1 pushl %ebp movl %esp, %ebp adcl $0, .LPBX1+4 addl $1, .LPBX1+16 popl %ebp adcl $0, .LPBX1+20 addl $1, .LPBX1+8 adcl $0, .LPBX1+12 ret that should be bar: addl $1, .LPBX1 pushl %ebp movl %esp, %ebp adcl $0, .LPBX1+4 addl $1, .LPBX1+8 adcl $0, .LPBX1+12 addl $1, .LPBX1+16 adcl $0, .LPBX1+20 ret And of course all the three counters could be merged. But that would need a changed gcov file format somehow representing a callgraph with merged edges. The intel compiler is so much worse here because all the counter adding is done thread-safe in a library (i.e. they have an extra call for every edge and do not do any inlining). > How our profilng performance is compared to ICC? ICC is a lot worse. ICC with -prof_gen causes a 10000 fold slowdown (if the current snapshot of icc doesn't segfault compiling the tramp3d testcase) - ICC is completely unusable for me. So - GCC is great! > > > It would be nice to experiment with this a little - in general the > > > heuristics can be viewed as having three players. There are the limits > > > (specified via --param) that it must obey, there is the cost model > > > (estimated growth for inlining into all callees without profiling and > > > the execute_count to estimated growth for inlining to one call with > > > profiling) and the bin packing algorithm optimizing the gains while > > > obeying the limits. > > > > > > With profiling in the cost model is pretty much realistic and it would > > > be nice to figure out how the performance behave when the individual > > > limits are changed and why. If you have some time for experimentation, > > > it would be very usefull. I am trying to do the same with SPEC and GCC > > > but I have dificulty to play with pooma or Gerald's application as I > > > have little understanding what is going there. I will try it myself > > > next but any feedback can be very usefull here. > > > > I can produce some numbers for the tramp testcase. > Thanks! Note that with changling the flags you should not need to > re-profile now so you can save quite a lot of time. Ah, thats indeed nice. Richard. -- Richard Guenther <richard dot guenther at uni-tuebingen dot de> WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/ -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704