------- Comment #35 from hubicka at gcc dot gnu dot org 2008-01-30 17:58 ------- So for more proper analysis. The testcase is quite challenging for inlining heuristics and by introducing early inlining and reducing call cost we now inline less that we used to at a time I claimed that we inline everything. However making inlined everything again is still not solving the problem.
For inline decisions, the problematic bit seems to be accu1 and friends. They are templates using easier templates of same form. For n=1: double accu1(const double*, const double*) [with int n = 0] (p1, p2) { double D.4655; double D.4654; double D.4653; <bb 2>: D.4654_2 = *p1_1(D); D.4655_4 = *p2_3(D); D.4653_5 = D.4654_2 * D.4655_4; return D.4653_5; } With n>1 we simply copy the body few times: double accu1(const double*, const double*) [with int n = 1] (p1, p2) { double D.17506; double D.17507; double D.17505; double D.17505; double d; double D.6664; double D.6663; double D.6662; <bb 2>: D.6662_2 = *p1_1(D); D.6663_4 = *p2_3(D); d_5 = D.6662_2 * D.6663_4; p2_6 = p2_3(D) + 8; p1_7 = p1_1(D) + 8; D.17506_11 = *p1_7; D.17507_12 = *p2_6; D.17505_13 = D.17506_11 * D.17507_12; D.6664_9 = d_5 + D.17505_13; return D.6664_9; } Early inlinier handles this well until the function grows up, that happens on n=4 and for n=5 we end up not inlining: double accu1(const double*, const double*) [with int n = 5] (p1, p2) { double d; double D.6697; double D.6696; double D.6695; double D.6694; <bb 2>: D.6694_2 = *p1_1(D); D.6695_4 = *p2_3(D); d_5 = D.6694_2 * D.6695_4; p2_6 = p2_3(D) + 8; p1_7 = p1_1(D) + 8; D.6697_8 = accu1 (p1_7, p2_6); D.6696_9 = D.6697_8 + d_5; return D.6696_9; } This is as expected, for n=4 the code is definitely longer than call sequence, having 4 FP multiples, 4 adds, 8 loads, I don't think simple heuristic can resonably expect it to simplify. We inline these functions later in late inlining as expected, but since there are just too many calls of them, we end up eventually on large function and large unit limits. Now to get everything inlined one needs --param inline-call-cost=9999 --param max-inline-insns-single=999999 (the second is needed for DCubuc::DCubic that is just big IMO). Now with this: [EMAIL PROTECTED]:/aux/hubicka/trunk-write/buidl2$ time /aux/hubicka/gcc-install/bin/g++ -O3 ttest.cc -fpermissive --static -march=athlon-xp -Winline --param inline-call-cost=9999 --param max-inline-insns-single=999999 ttest.cc: In function 'void testv4c()': ttest.cc:21: warning: inlining failed in call to 'tcdata::tcdata()': --param inline-unit-growth limit reached ttest.cc:468: warning: called from here real 1m0.934s user 0m59.736s sys 0m1.204s [EMAIL PROTECTED]:/aux/hubicka/trunk-write/buidl2$ time ./a.out real 0m7.055s user 0m7.052s sys 0m0.000s We still have long way to GCC 3-4 perfomrance (5s, see my previous post). I suspect that alising simply give up. Setting inline-call-cost to 1 (the other extreme) leads to 6.9s. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17863