On Thu, Feb 4, 2016 at 9:10 PM, Carlos Pita <[email protected]> wrote:
> PS 2 (last one, I swear): I've isolated what I think is the root of
> the problem. When einline expands g, there is plenty of call sites for
> f.call, so the full redundancy elimination pass replaces sum for
> f.call, making things easy for the late ipa inliner. But when g is not
> early inlined, there is only one call site for the global f.call and
> another one for the local/literal f.call, so the fre pass just lets
> them be. This is innocuous from the fre point of view, but disables
> further inlining as described above.
>
> On Thu, Feb 4, 2016 at 3:05 PM, Carlos Pita <[email protected]> wrote:
>> PS: I framed the issue between the inline_param and fixup_cfg passes
>> because I was only looking at the tree passes, but the really relevant
>> passes are tree-einline (obviously) and ipa-inline, which happens
>> between tree-inline_param2 and tree-fixup_cfg2. So, restating the
>> problem: if early inline is not happening, late inline will miss the
>> chance to inline the reference from the static const struct member.
>>
>> On Thu, Feb 4, 2016 at 1:08 PM, Carlos Pita <[email protected]> wrote:
>>> Hi all,
>>>
>>> I've been trying to understand some bizarre interaction between
>>> optimizing passes I've observed while compiling a heavily nested
>>> inlined numerical code of mine. I managed to reduce the issue down to
>>> this simple code:
>>>
>>> ``` test.c
>>>
>>> typedef struct F {
>>> int (*call)(int);
>>> } F;
>>>
>>> static int g(F f, int x) {
>>> x = f.call(x);
>>> x = f.call(x);
>>> x = f.call(x);
>>> x = f.call(x);
>>> x = f.call(x);
>>> x = f.call(x);
>>> x = f.call(x);
>>> x = f.call(x);
>>> return x;
>>> }
>>>
>>> static int sq(int x) {
>>> return x * x;
>>> }
>>>
>>> static const F f = {sq};
>>>
>>> void dosomething(int);
>>>
>>> int h(int x) {
>>> dosomething(g(f, x));
>>> dosomething(g((F){sq}, x));
>>> }
>>>
>>> ```
>>>
>>> Here we have a driver function h calling the workhorse g which
>>> delegates some simple task to the inline-wannabe f. The distinctive
>>> aspect of the above scheme is that f is referenced from a struct
>>> member. The first call to g passes a static const struct while the
>>> second call passes a compound literal (alternatively, a local version
>>> of the struct will have the same effect regarding what follows).
>>>
>>> Now, say I compile this code with:
>>>
>>> gcc -O3 -fdump-tree-all --param early-inlining-insns=10 -c test.c
>>>
>>> The einline pass will not be able to inline calls to g with such a low
>>> value for early-inlining-insns.
>>>
>>> The inline_param2 pass still shows:
>>>
>>> ```
>>> h (int x)
>>> {
>>> struct F D.1847;
>>> int _4;
>>> int _8;
>>>
>>> <bb 2>:
>>> _4 = g (f, x_2(D));
>>> dosomething (_4);
>>> D.1847.call = sq;
>>> _8 = g (D.1847, x_2(D));
>>> dosomething (_8);
>>> return;
>>>
>>> }
>>>
>>> ```
>>>
>>> The next tree pass is fixup_cfg4, which does the inline but just for
>>> the second all to g:
>>>
>>> ```
>>> h (int x)
>>> {
>>> ....
>>>
>>> <bb 2>:
>>> f = f;
>>> f$call_7 = MEM[(struct F *)&f];
>>> x_19 = f$call_7 (x_2(D));
>>> x_20 = f$call_7 (x_19);
>>> x_21 = f$call_7 (x_20);
>>> x_22 = f$call_7 (x_21);
>>> x_23 = f$call_7 (x_22);
>>> x_24 = f$call_7 (x_23);
>>> x_25 = f$call_7 (x_24);
>>> x_26 = f$call_7 (x_25);
>>> _43 = x_26;
>>> _4 = _43;
>>> dosomething (_4);
>>> D.1847.call = sq;
>>> f = D.1847;
>>> f$call_10 = MEM[(struct F *)&f];
>>> _33 = x_2(D) * x_2(D);
>>> _45 = _33;
>>> x_11 = _45;
>>> _32 = x_11 * x_11;
>>> _46 = _32;
>>> x_12 = _46;
>>> _31 = x_12 * x_12;
>>> _47 = _31;
>>> x_13 = _47;
>>> _30 = x_13 * x_13;
>>> _48 = _30;
>>> x_14 = _48;
>>> _29 = x_14 * x_14;
>>> _49 = _29;
>>> x_15 = _49;
>>> _28 = x_15 * x_15;
>>> _50 = _28;
>>> x_16 = _50;
>>> _27 = x_16 * x_16;
>>> _51 = _27;
>>> x_17 = _51;
>>> _3 = x_17 * x_17;
>>> _52 = _3;
>>> x_18 = _52;
>>> _53 = x_18;
>>> _8 = _53;
>>> dosomething (_8);
>>> return;
>>>
>>> }
>>> ```
>>>
>>> Now, say I recompile the code with a larger early-inlining-insns, so
>>> that einline is able to early inline both calls to g:
>>>
>>> gcc -O3 -fdump-tree-all --param early-inlining-insns=50 -c test.c
>>>
>>> After inline_param2 (that is, before fixup_cfg4), we now have:
>>>
>>> ```
>>> h (int x)
>>> {
>>> int x;
>>> int x;
>>>
>>> <bb 2>:
>>> x_13 = sq (x_2(D));
>>> x_14 = sq (x_13);
>>> x_15 = sq (x_14);
>>> x_16 = sq (x_15);
>>> x_17 = sq (x_16);
>>> x_18 = sq (x_17);
>>> x_19 = sq (x_18);
>>> x_20 = sq (x_19);
>>> dosomething (x_20);
>>> x_5 = sq (x_2(D));
>>> x_6 = sq (x_5);
>>> x_7 = sq (x_6);
>>> x_8 = sq (x_7);
>>> x_9 = sq (x_8);
>>> x_10 = sq (x_9);
>>> x_11 = sq (x_10);
>>> x_12 = sq (x_11);
>>> dosomething (x_12);
>>> return;
>>>
>>> }
>>> ```
>>>
>>> And fixup_cfg4 is able to do its job for both calls:
>>>
>>> ```
>>> h (int x)
>>> {
>>> ....
>>>
>>> <bb 2>:
>>> _36 = x_2(D) * x_2(D);
>>> _37 = _36;
>>> x_13 = _37;
>>> _35 = x_13 * x_13;
>>> _38 = _35;
>>> x_14 = _38;
>>> _34 = x_14 * x_14;
>>> _39 = _34;
>>> x_15 = _39;
>>> _33 = x_15 * x_15;
>>> _40 = _33;
>>> x_16 = _40;
>>> _32 = x_16 * x_16;
>>> _41 = _32;
>>> x_17 = _41;
>>> _31 = x_17 * x_17;
>>> _42 = _31;
>>> x_18 = _42;
>>> _30 = x_18 * x_18;
>>> _43 = _30;
>>> x_19 = _43;
>>> _29 = x_19 * x_19;
>>> _44 = _29;
>>> x_20 = _44;
>>> dosomething (x_20);
>>> _28 = x_2(D) * x_2(D);
>>> _45 = _28;
>>> x_5 = _45;
>>> _27 = x_5 * x_5;
>>> _46 = _27;
>>> x_6 = _46;
>>> _26 = x_6 * x_6;
>>> _47 = _26;
>>> x_7 = _47;
>>> _25 = x_7 * x_7;
>>> _48 = _25;
>>> x_8 = _48;
>>> _24 = x_8 * x_8;
>>> _49 = _24;
>>> x_9 = _49;
>>> _23 = x_9 * x_9;
>>> _50 = _23;
>>> x_10 = _50;
>>> _22 = x_10 * x_10;
>>> _51 = _22;
>>> x_11 = _51;
>>> _21 = x_11 * x_11;
>>> _52 = _21;
>>> x_12 = _52;
>>> dosomething (x_12);
>>> return;
>>>
>>> }
>>> ```
>>>
>>> The bottom line is that I get full inlining if einline manages to
>>> early inline both g calls, but I get incomplete inlining otherwise. I
>>> guess the problem is that fixup_cfg4 is not able to infer that
>>> f$call_7 is just sq in disguise when f is the global static const
>>> struct but it is able to get it when it's a local or literal one. In
>>> case einline expands the code early the successive passes will make
>>> fixup_cfg4 see just sq in both cases, making inlining of sq a trivial
>>> matter. But if einline hits its hard limits, fixup_cfg4 will have to
>>> figure out that f$call is sq by itself.
>>>
>>> I'm not sure whether this should be considered a proper bug or more of
>>> a quirk of the inlining system one must learn to live with. In the
>>> first case, I'll report it if you ask me to do it. In the second case,
>>> I would like to ask for some advice about the best way to cope with
>>> this scenario (besides blindly incrementing early-inlining-insns); I
>>> can provide more background regarding my real use case if necessary.
I think the real missed optimization is that for the passed aggregate D.1772
D.1772.call = sq;
_8 = g (D.1772, x_2(D));
IPA CP cannot figure out a jump-function with the sq function as
value. It _can_
for the call passing f:
callsite h/3 -> g/0 :
param 0: UNKNOWN
Aggregate passed by value:
offset: 0, cst: sq
Unknown alignment
param 1: PASS THROUGH: 0, op nop_expr
Unknown alignment
callsite h/3 -> dosomething/4 :
callsite h/3 -> g/0 :
param 0: UNKNOWN
Unknown alignment
param 1: PASS THROUGH: 0, op nop_expr
Unknown alignment
not sure how hard to fix that is, but it can't be too difficult. Martin?
You may want to open a bugreport.
Richard.
>>> Cheers
>>> --
>>> Carlos