https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99788
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Last reconfirmed| |2021-03-26 Version|unknown |11.0 Component|ipa |tree-optimization Ever confirmed|0 |1 Status|UNCONFIRMED |NEW --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- Confirmed. The issue is that at -O3 we inline e() and while inside e() we eliminate the call to foo since the preceeding for() loop does not terminate (CCP figures this out), the inline copy has the loop header PHI not simplified at the point CCP runs (and it doesn't run later again): <bb 3> [local count: 43379093]: a = 1; a.3_4 = a; <bb 4> [local count: 350976297]: # a.3_3 = PHI <a.3_5(4), a.3_4(3)> a.2_6 = (unsigned char) a.3_3; _7 = a.2_6 + 2; _8 = (char) _7; a = _8; a.3_5 = a; if (a.3_5 != 0) goto <bb 4>; [87.64%] else goto <bb 5>; [12.36%] <bb 5> [local count: 43379093]: foo (); vs. <bb 3> [local count: 955630225]: # a.3_22 = PHI <_3(3), 1(2)> a.2_1 = (unsigned char) a.3_22; _2 = a.2_1 + 2; _3 = (char) _2; a = _3; if (_3 != 0) goto <bb 3>; [89.00%] else goto <bb 4>; [11.00%] <bb 4> [local count: 118111600]: foo (); and the difference starts with loop header copying which is applied to the outline but not the inline copy of the loop. Analyzing loop 1 Loop 1 is not do-while loop: latch is not empty. Will duplicate bb 4 Not duplicating bb 3: it is single succ. Duplicating header of the loop 1 up to edge 4->3, 3 insns. Loop 1 is do-while loop Loop 1 is now do-while loop. vs. Analyzing loop 1 Analyzing loop 2 Loop 2 is not do-while loop: latch is not empty. Not duplicating bb 5: optimizing for size. where the decision on optimizing for size is because this is main(). Renaming main() to baz() fixes the issue. But I wonder why we inline e() into cold main at all. Honza? I see Processing frequency f/9 Called by main/11 that is normal or hot t.c:24:3: note: Inlining f/9 to main/11 with frequency 1.00 so here main() is normal or hot but loop header copying sees optimize_loop_for_size_p () == true!? IPA inlining sees Considering d/10 with 20 size to be inlined into main/11 in t.c:17 Estimated badness is -0.000046, frequency 0.00. Badness calculation for main/11 -> d/10 size growth 16, time 8428.908463 unspec 8428.908463 -0.000011: guessed profile. frequency 0.000400, count -1 caller count -1 time saved 0.004400 overall growth -4 (current) -4 (original) -4 (compensated) Adjusted by hints -0.000046 Updated mod-ref summary for main/11 loads: Limits: 32 bases, 16 refs Every base stores: Limits: 32 bases, 16 refs Accounting size:17.00, time:2.97 on predicate exec:(true) Processing frequency d/10 Called by main/11 that is executed once Processing frequency e/13 Called by d/10 that is executed once Node e/13 promoted to executed once. Accounting size:-2.00, time:-0.00 on predicate exec:(true) Accounting size:1.00, time:0.40 on predicate exec:(true) t.c:17:5: optimized: Inlined d/10 into main/11 which now has time 8.370758 and size 24, net change of -4. so something is off with how we process speed/size optimization. Note it looks like the loop copy in main gets cold also because it is predicated by if (b) which is predicted as very cold: <bb 2> [local count: 1073741824]: b.0_2 = b; if (b.0_2 != 0) goto <bb 8>; [0.04%] else goto <bb 7>; [99.96%] <bb 8> [local count: 429496]: <bb 3> [local count: 43379093]: a = 1; goto <bb 5>; [100.00%] <bb 4> [local count: 350976297]: a.2_6 = (unsigned char) a.3_5; _7 = a.2_6 + 2; _8 = (char) _7; a = _8; <bb 5> [local count: 394355390]: a.3_5 = a; if (a.3_5 != 0) goto <bb 4>; [89.00%] else goto <bb 6>; [11.00%] still when the function is not called main() we're not getting the optimize_loop_for_size () predicate evaluated to true (with the exact same local profile as above!).