Re: -fprofile-generate and -fprofile-use
There was some discussion a few weeks ago about some apps running slower with FDO enabled. I've recently investigated a similar situation using mainline. In my case, the fact that the loop_optimize pass is disabled during FDO was the cause of the slowdown. It appears that was recently disabled as part of Jan Hubicka's patch to eliminate RTL based profiling. The commentary indicates that the old loop optimizer is incompatible with tree profiling. While this doesn't explain all of the degradations discussed (some were showing up on older versions of the compiler), it may explain some. Pete
Re: -fprofile-generate and -fprofile-use
>A more likely source of performance degradation is that loop unrolling >is enabled when profiling, and loop unrolling is almost always a bad >pessimization on 32 bits x86 targets. To clarify, I was compiling with -funroll-loops and -fpeel-loops enabled in both cases. The FDO slowdown in my case was caused by the presence of some loop invariant code that was getting removed from the loop by the loop optimizer pass in the non-FDO case. I'm running on powerpc-linux. Pete
Re: -fprofile-generate and -fprofile-use
> Do you have specific testcase? It would be interesting to see if new > optimizer can catch up at least on kill-loop branch. Here is a simplified version of what I observed. In the non-FDO case, the loop invariant load of the constant 32 is removed from the loop. When FDO is enabled, the load remains in the loop. float farray[100]; int main (int argc, char *argv[]) { int m; for( m = 0; m < 100; m++ ) { farray[m] = 32; } } I'm compiling it as follows using a version of gcc built from mainline yesterday. Non-FDO: gcc -O3 -funroll-loops -fpeel-loops -o test test.c FDO: gcc -O3 -funroll-loops -fpeel-loops -fprofile-generate -o test test.c ./test gcc -O3 -funroll-loops -fpeel-loops -fprofile-use -o test test.c Pete
Re: -fprofile-generate and -fprofile-use
>you may try adding -fmove-loop-invariants flag, which enables new >invariant motion pass. That cleaned up both my simplified test case, and the code it originated from. It also cleaned up a few other cases where I was noticing worse performance with FDO enabled. Thanks!! Perhaps this option should be enabled by default when doing FDO to replace the loop invariant motions done by the recently disabled loop optimize pass. Pete
Question on syntax in machine description files.
Quick question on syntax in md files as I'm not finding the documentation to explain it. If I see the following on an instruction definition: (set_attr "type" "*") What does * represent in this context as the value to assign to "type"? Thanks. Pete
Question on bad counters.
I'm not entirely sure how gcc's CFG structure all fits together yet, so I'll ask for some input on this one: While looking through some dumps from a compile using -fprofile-use, I noticed the following in the "jump" dump file: Basic block 164 prev 163, next -2, loop_depth 0, count 1672, freq 1482. Predecessors: 163 [100.0%] count:836 (fallthru) Successors: EXIT [100.0%] count:1672 (fallthru) Invalid sum of incoming frequencies 741, should be 1482 Invalid sum of incoming counts 836, should be 1672 I decided to try and figure out why the sum of incoming frequencies was invalid. I've found where the mistake is introduced, and think I know what needs to be done to fix it, but perhaps someone can confirm or suggest an alternative. Here are the last few blocks dumped just before cfgbuild.find_bb_boundaries. The counts look good at this point (note especially block 162). Basic block 153 prev 152, next 154, loop_depth 0, count 836, freq 741. Predecessors: 0 [27.8%] count:232 12 151 [100.0%] count:604 152 [100.0%] (fallthru) Successors: 162 [100.0%] count:836 Basic block 154 prev 153, next 155, loop_depth 1, count 195, freq 173, probably never executed. Predecessors: 72 [8.1%] count:36 74 [39.0%] count:159 73 159 [100.0%] Successors: 78 [66.7%] count:130 155 [33.3%] count:65 (fallthru) Basic block 155 prev 154, next 162, loop_depth 0, count 65, freq 58, probably never executed. Predecessors: 154 [33.3%] count:65 (fallthru) Successors: 79 [100.0%] count:65 Basic block 162 prev 155, next -2, loop_depth 0, count 836, freq 741. Predecessors: 153 [100.0%] count:836 Successors: EXIT [100.0%] count:836 (fallthru) - After cfgbuild.find_bb_boundaries, there are a few new blocks, and block 162 has been disconnected from the graph. Block 162 has no predecessors, yet it's count remains at 836 and it's frequency at 741 which are invalid. I believe blocks 163 and 164 were created while splitting block 162, and thus inherited the same count and frequency. Basic block 153 prev 152, next 154, loop_depth 0, count 836, freq 741. Predecessors: 0 [27.8%] count:232 12 151 [100.0%] count:604 152 [100.0%] (fallthru) Successors: Basic block 154 prev 153, next 155, loop_depth 1, count 195, freq 173, probably never executed. Predecessors: 72 [8.1%] count:36 74 [39.0%] count:159 73 159 [100.0%] Successors: 78 [66.7%] count:130 155 [33.3%] count:65 (fallthru) Basic block 155 prev 154, next 162, loop_depth 0, count 65, freq 58, probably never executed. Predecessors: 154 [33.3%] count:65 (fallthru) Successors: 79 [100.0%] count:65 Basic block 162 prev 155, next 163, loop_depth 0, count 836, freq 741. Predecessors: Successors: Invalid sum of incoming frequencies 0, should be 741 Invalid sum of incoming counts 0, should be 836 Basic block 163 prev 162, next 164, loop_depth 0, count 836, freq 741. Predecessors: Successors: Invalid sum of incoming frequencies 0, should be 741 Invalid sum of incoming counts 0, should be 836 Basic block 164 prev 163, next -2, loop_depth 0, count 836, freq 741. Predecessors: Successors: EXIT [100.0%] count:836 (fallthru) Invalid sum of incoming frequencies 0, should be 741 Invalid sum of incoming counts 0, should be 836 - Later on, the cfg has apparently been rebuilt. Some of the blocks with bad counts were reattached to the graph and their counts propagated. The result is the invalid block count I initially observed. Note here that the count and frequency for block 164 are exactly double what they should be. Basic block 159 prev 158, next 160, loop_depth 0, count 836, freq 741. Predecessors: 1 [27.8%] count:232 13 157 [100.0%] count:604 158 [100.0%] (fallthru) Successors: 163 [100.0%] count:836 Basic block 160 prev 159, next 161, loop_depth 1, count 195, freq 173, probably never executed. Predecessors: 75 [8.1%] count:36 77 [39.0%] count:159 76 79 [100.0%] Successors: 82 [66.7%] count:130 161 [33.3%] count:65 (fallthru) Basic block 161 prev 160, next 163, loop_depth 0, count 65, freq 58, probably never executed. Predecessors: 160 [33.3%] count:65 (fallthru) Successors: 83 [100.0%] count:65 Basic block 163 prev 161, next 164, loop_depth 0, count 836, freq 741. Predecessors: 159 [100.0%] count:836 Successors: 164 [100.0%] count:836 (fallthru) Basic block 164 prev 163, next -2, loop_depth 0, count 1672, freq 1482. Predecessors: 163 [100.0%] count:836 (fallthru) Successors: EXIT [100.0%] count:1672 (fallthru) Invalid sum of incoming frequencies 741, should be 1482 Invalid sum of incoming counts 836, should be 1672 The problem appears to be in the function cfgrtl.purge_dead_edges which is called by find_bb_boundaries. There are a couple of cases where "remove_edge" is called to remove an edge from the graph. In the above example, the edge being removed is the only predecessor edge of the destination block(162). The edge is removed, but the count and frequency of the now orphaned destinati
Re: Question about updating CFG block counters in purge_dead_edges [was "Question on bad counters."]
Added a better subject line.. Pete. [EMAIL PROTECTED] wrote on 09/30/2005 11:03:59 AM: > > I'm not entirely sure how gcc's CFG structure all fits together yet, so > I'll ask for some input on this one: > > While looking through some dumps from a compile using -fprofile-use, I > noticed the following in the "jump" dump file: > > Basic block 164 prev 163, next -2, loop_depth 0, count 1672, freq 1482. > Predecessors: 163 [100.0%] count:836 (fallthru) > Successors: EXIT [100.0%] count:1672 (fallthru) > Invalid sum of incoming frequencies 741, should be 1482 > Invalid sum of incoming counts 836, should be 1672 > > I decided to try and figure out why the sum of incoming frequencies was > invalid. I've found where the mistake is introduced, and think I know what > needs to be done to fix it, but perhaps someone can confirm or suggest an > alternative. > > > > Here are the last few blocks dumped just before > cfgbuild.find_bb_boundaries. The counts look good at this point (note > especially block 162). > > Basic block 153 prev 152, next 154, loop_depth 0, count 836, freq 741. > Predecessors: 0 [27.8%] count:232 12 151 [100.0%] count:604 152 [100.0%] > (fallthru) > Successors: 162 [100.0%] count:836 > > Basic block 154 prev 153, next 155, loop_depth 1, count 195, freq 173, > probably never executed. > Predecessors: 72 [8.1%] count:36 74 [39.0%] count:159 73 159 [100.0%] > Successors: 78 [66.7%] count:130 155 [33.3%] count:65 (fallthru) > > Basic block 155 prev 154, next 162, loop_depth 0, count 65, freq 58, > probably never executed. > Predecessors: 154 [33.3%] count:65 (fallthru) > Successors: 79 [100.0%] count:65 > > Basic block 162 prev 155, next -2, loop_depth 0, count 836, freq 741. > Predecessors: 153 [100.0%] count:836 > Successors: EXIT [100.0%] count:836 (fallthru) > > > - > After cfgbuild.find_bb_boundaries, there are a few new blocks, and block > 162 has been disconnected from the graph. Block 162 has no predecessors, > yet it's count remains at 836 and it's frequency at 741 which are invalid. > I believe blocks 163 and 164 were created while splitting block 162, and > thus inherited the same count and frequency. > > > Basic block 153 prev 152, next 154, loop_depth 0, count 836, freq 741. > Predecessors: 0 [27.8%] count:232 12 151 [100.0%] count:604 152 [100.0%] > (fallthru) > Successors: > > Basic block 154 prev 153, next 155, loop_depth 1, count 195, freq 173, > probably never executed. > Predecessors: 72 [8.1%] count:36 74 [39.0%] count:159 73 159 [100.0%] > Successors: 78 [66.7%] count:130 155 [33.3%] count:65 (fallthru) > > Basic block 155 prev 154, next 162, loop_depth 0, count 65, freq 58, > probably never executed. > Predecessors: 154 [33.3%] count:65 (fallthru) > Successors: 79 [100.0%] count:65 > > Basic block 162 prev 155, next 163, loop_depth 0, count 836, freq 741. > Predecessors: > Successors: > Invalid sum of incoming frequencies 0, should be 741 > Invalid sum of incoming counts 0, should be 836 > > Basic block 163 prev 162, next 164, loop_depth 0, count 836, freq 741. > Predecessors: > Successors: > Invalid sum of incoming frequencies 0, should be 741 > Invalid sum of incoming counts 0, should be 836 > > Basic block 164 prev 163, next -2, loop_depth 0, count 836, freq 741. > Predecessors: > Successors: EXIT [100.0%] count:836 (fallthru) > Invalid sum of incoming frequencies 0, should be 741 > Invalid sum of incoming counts 0, should be 836 > > > - > Later on, the cfg has apparently been rebuilt. Some of the blocks with bad > counts were reattached to the graph and their counts propagated. The > result is the invalid block count I initially observed. Note here that the > count and frequency for block 164 are exactly double what they should be. > > Basic block 159 prev 158, next 160, loop_depth 0, count 836, freq 741. > Predecessors: 1 [27.8%] count:232 13 157 [100.0%] count:604 158 [100.0%] > (fallthru) > Successors: 163 [100.0%] count:836 > > Basic block 160 prev 159, next 161, loop_depth 1, count 195, freq 173, > probably never executed. > Predecessors: 75 [8.1%] count:36 77 [39.0%] count:159 76 79 [100.0%] > Successors: 82 [66.7%] count:130 161 [33.3%] count:65 (fallthru) > > Basic block 161 prev 160, next 163, loop_depth 0, count 65, freq 58, > probably never executed. > Predecessors: 160 [33.3%] count:65 (fallthru) > Successors: 83 [100.0%] count:65 > > Basic block 163 prev 161, next 164, loop_depth 0, count 836, freq 741. > Predecessors: 159 [100.0%] count:836 > Successors: 164 [100.0%] count:836 (fallthru) > > Basic block 164 prev 163, next -2, loop_depth 0, count 1672, freq 1482. > Predecessors: 163 [100.0%] count:836 (fallthru) > Successors: EXIT [100.0%] count:1672 (fallthru) > Invalid sum of incoming frequencies 741, should be 1482 > Invalid sum of incoming counts 836, should be 1672 > > > > > The problem appears to be in the function cfgrtl.purge_dead_edges which is > call
Declaration of a guard function for use on define_bypass
I'm using store_data_bypass_p from recog.c as the guard for a define_bypass within a machine description. I'm seeing the following warning/error that I'd like to clean up. cc1: warnings being treated as errors insn-automata.c: In function 'internal_insn_latency': insn-automata.c:53265: warning: implicit declaration of function 'store_data_bypass_p' Anybody know what needs to be done to get recog.h included into the code created by genautomata (either directly, or indirectly) to eliminate the implicit declaration? Thanks! Pete
Generalize ready list sorting via heuristics in rank_for_schedule.
I've been looking a bit at how haifa_sched.c sorts the ready list and think there may be some room for added flexibility and/or improvement. I'll throw out a few ideas for discussion. Currently, within the ready_sort macro in haifa-sched.c, the call to qsort is passed "rank_for_schedule" to help it decide which of two instructions should be placed further towards the front of the ready list. Rank_for_schedule uses a set of ordered heuristics (rank, priority, etc.) to make this decision. The set of heuristics is fixed for all target machines. There can be cases, however, where a target machine may want to define heuristics driven by specific characteristics of that machine. Those heuristics may be meaningless on other targets. Likewise, a given target machine may prefer a different order in which to check the heuristics, possibly going as far as checking them in a different order based on which pass of the scheduler is running. I'd be interested in seeing "rank_for_schedule" converted into a scheduler target hook. Target machines would then have the flexibility to define further heuristics for determining sort order. The default for an undefined hook would be to use the current algorithm, One could go a step further and break each heuristic out into a separate function. This would allow target machines to specify to the scheduler a list of which heuristics to apply and an order in which to apply them. A target machine could also define its own heuristic functions and include them in the heuristic ordering for that target. In addition, a different set of heuristics, or a different ordering could be applied based on which pass of the scheduler is running. I'd like to start experimenting with this, but would appreciate any comments or suggestions from others that may be familiar with this code. Thanks! Pete.