4.3 Regression] performance loss (not inlining as much??)

hubicka at gcc dot gnu dot org Wed, 30 Jan 2008 09:59:37 -0800


------- Comment #35 from hubicka at gcc dot gnu dot org  2008-01-30 17:58 
-------
So for more proper analysis. The testcase is quite challenging for inlining
heuristics and by introducing early inlining and reducing call cost we now
inline less that we used to at a time I claimed that we inline everything. 
However making inlined everything again is still not solving the problem.


For inline decisions, the problematic bit seems to be accu1 and friends.  They
are templates using easier templates of same form.  For n=1:
double accu1(const double*, const double*) [with int n = 0] (p1, p2)
{
  double D.4655;
  double D.4654;
  double D.4653;

<bb 2>:
  D.4654_2 = *p1_1(D);
  D.4655_4 = *p2_3(D);
  D.4653_5 = D.4654_2 * D.4655_4;
  return D.4653_5;

}

With n>1 we simply copy the body few times:
double accu1(const double*, const double*) [with int n = 1] (p1, p2)
{
  double D.17506;
  double D.17507;
  double D.17505;
  double D.17505;
  double d;
  double D.6664;
  double D.6663;
  double D.6662;

<bb 2>:
  D.6662_2 = *p1_1(D);
  D.6663_4 = *p2_3(D);
  d_5 = D.6662_2 * D.6663_4;
  p2_6 = p2_3(D) + 8;
  p1_7 = p1_1(D) + 8;
  D.17506_11 = *p1_7;
  D.17507_12 = *p2_6;
  D.17505_13 = D.17506_11 * D.17507_12;
  D.6664_9 = d_5 + D.17505_13;
  return D.6664_9;

}
Early inlinier handles this well until the function grows up, that happens on
n=4 and for n=5 we end up not inlining:
double accu1(const double*, const double*) [with int n = 5] (p1, p2)
{
  double d;
  double D.6697;
  double D.6696;
  double D.6695;
  double D.6694;

<bb 2>:
  D.6694_2 = *p1_1(D);
  D.6695_4 = *p2_3(D);
  d_5 = D.6694_2 * D.6695_4;
  p2_6 = p2_3(D) + 8;
  p1_7 = p1_1(D) + 8;
  D.6697_8 = accu1 (p1_7, p2_6);
  D.6696_9 = D.6697_8 + d_5;
  return D.6696_9;

}
This is as expected, for n=4 the code is definitely longer than call sequence,
having 4 FP multiples, 4 adds, 8 loads, I don't think simple heuristic can
resonably expect it to simplify.

We inline these functions later in late inlining as expected, but since there
are just too many calls of them, we end up eventually on large function and
large unit limits.

Now to get everything inlined one needs --param inline-call-cost=9999 --param
max-inline-insns-single=999999 (the second is needed for DCubuc::DCubic that is
just big IMO).

Now with this:
[EMAIL PROTECTED]:/aux/hubicka/trunk-write/buidl2$ time
/aux/hubicka/gcc-install/bin/g++  -O3 ttest.cc  -fpermissive --static
-march=athlon-xp  -Winline --param inline-call-cost=9999 --param
max-inline-insns-single=999999
ttest.cc: In function 'void testv4c()':
ttest.cc:21: warning: inlining failed in call to 'tcdata::tcdata()': --param
inline-unit-growth limit reached
ttest.cc:468: warning: called from here

real    1m0.934s
user    0m59.736s
sys     0m1.204s
[EMAIL PROTECTED]:/aux/hubicka/trunk-write/buidl2$ time ./a.out

real    0m7.055s
user    0m7.052s
sys     0m0.000s

We still have long way to GCC 3-4 perfomrance (5s, see my previous post).  I
suspect that alising simply give up. Setting inline-call-cost to 1 (the other
extreme) leads to 6.9s.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17863

[Bug tree-optimization/17863] [4.0/4.1/4.2/4.3 Regression] performance loss (not inlining as much??)

Reply via email to