Re: -fprofile-generate and -fprofile-use

2005-08-30 Thread Peter Steinmetz

There was some discussion a few weeks ago about some apps running slower
with FDO enabled.

I've recently investigated a similar situation using mainline.  In my case,
the fact that the loop_optimize pass is disabled during FDO was the cause
of the slowdown.  It appears that was recently disabled as part of Jan
Hubicka's patch to eliminate RTL based profiling.  The commentary indicates
that the old loop optimizer is incompatible with tree profiling.

While this doesn't explain all of the degradations discussed (some were
showing up on older versions of the compiler), it may explain some.

Pete



Re: -fprofile-generate and -fprofile-use

2005-08-30 Thread Peter Steinmetz
>A more likely source of performance degradation is that loop unrolling
>is enabled when profiling, and loop unrolling is almost always a bad
>pessimization on 32 bits x86 targets.

To clarify, I was compiling with -funroll-loops and -fpeel-loops
enabled in both cases.

The FDO slowdown in my case was caused by the presence of some loop
invariant code that was getting removed from the loop by the loop
optimizer pass in the non-FDO case.

I'm running on powerpc-linux.

Pete



Re: -fprofile-generate and -fprofile-use

2005-08-30 Thread Peter Steinmetz
> Do you have specific testcase?  It would be interesting to see if new
> optimizer can catch up at least on kill-loop branch.

Here is a simplified version of what I observed.  In the non-FDO case,
the loop invariant load of the constant 32 is removed from the loop.
When FDO is enabled, the load remains in the loop.

float farray[100];

int main (int argc, char *argv[])
{
int m;

for( m = 0; m < 100; m++ )
{
farray[m] = 32;
}
}

I'm compiling it as follows using a version of gcc built from
mainline yesterday.

Non-FDO:
gcc -O3 -funroll-loops -fpeel-loops -o test test.c

FDO:
gcc -O3 -funroll-loops -fpeel-loops -fprofile-generate -o test test.c
./test
gcc -O3 -funroll-loops -fpeel-loops -fprofile-use -o test test.c

Pete



Re: -fprofile-generate and -fprofile-use

2005-08-31 Thread Peter Steinmetz
>you may try adding -fmove-loop-invariants flag, which enables new
>invariant motion pass.

That cleaned up both my simplified test case, and the code it
originated from.  It also cleaned up a few other cases where I
was noticing worse performance with FDO enabled.  Thanks!!

Perhaps this option should be enabled by default when doing FDO
to replace the loop invariant motions done by the recently
disabled loop optimize pass.

Pete



Question on syntax in machine description files.

2005-09-20 Thread Peter Steinmetz

Quick question on syntax in md files as I'm not finding the documentation
to explain it.  If I see the following on an instruction definition:

(set_attr "type" "*")

What does * represent in this context as the value to assign to "type"?

Thanks.

Pete



Question on bad counters.

2005-09-30 Thread Peter Steinmetz

I'm not entirely sure how gcc's CFG structure all fits together yet, so
I'll ask for some input on this one:

While looking through some dumps from a compile using -fprofile-use, I
noticed the following in the "jump" dump file:

Basic block 164 prev 163, next -2, loop_depth 0, count 1672, freq 1482.
Predecessors:  163 [100.0%]  count:836 (fallthru)
Successors:  EXIT [100.0%]  count:1672 (fallthru)
Invalid sum of incoming frequencies 741, should be 1482
Invalid sum of incoming counts 836, should be 1672

I decided to try and figure out why the sum of incoming frequencies was
invalid.  I've found where the mistake is introduced, and think I know what
needs to be done to fix it, but perhaps someone can confirm or suggest an
alternative.



Here are the last few blocks dumped just before
cfgbuild.find_bb_boundaries.  The counts look good at this point (note
especially block 162).

Basic block 153 prev 152, next 154, loop_depth 0, count 836, freq 741.
Predecessors:  0 [27.8%]  count:232 12 151 [100.0%]  count:604 152 [100.0%]
(fallthru)
Successors:  162 [100.0%]  count:836

Basic block 154 prev 153, next 155, loop_depth 1, count 195, freq 173,
probably never executed.
Predecessors:  72 [8.1%]  count:36 74 [39.0%]  count:159 73 159 [100.0%]
Successors:  78 [66.7%]  count:130 155 [33.3%]  count:65 (fallthru)

Basic block 155 prev 154, next 162, loop_depth 0, count 65, freq 58,
probably never executed.
Predecessors:  154 [33.3%]  count:65 (fallthru)
Successors:  79 [100.0%]  count:65

Basic block 162 prev 155, next -2, loop_depth 0, count 836, freq 741.
Predecessors:  153 [100.0%]  count:836
Successors:  EXIT [100.0%]  count:836 (fallthru)


-
After cfgbuild.find_bb_boundaries, there are a few new blocks, and block
162 has been disconnected from the graph.  Block 162 has no predecessors,
yet it's count remains at 836 and it's frequency at 741 which are invalid.
I believe blocks 163 and 164 were created while splitting block 162, and
thus inherited the same count and frequency.


Basic block 153 prev 152, next 154, loop_depth 0, count 836, freq 741.
Predecessors:  0 [27.8%]  count:232 12 151 [100.0%]  count:604 152 [100.0%]
(fallthru)
Successors:

Basic block 154 prev 153, next 155, loop_depth 1, count 195, freq 173,
probably never executed.
Predecessors:  72 [8.1%]  count:36 74 [39.0%]  count:159 73 159 [100.0%]
Successors:  78 [66.7%]  count:130 155 [33.3%]  count:65 (fallthru)

Basic block 155 prev 154, next 162, loop_depth 0, count 65, freq 58,
probably never executed.
Predecessors:  154 [33.3%]  count:65 (fallthru)
Successors:  79 [100.0%]  count:65

Basic block 162 prev 155, next 163, loop_depth 0, count 836, freq 741.
Predecessors:
Successors:
Invalid sum of incoming frequencies 0, should be 741
Invalid sum of incoming counts 0, should be 836

Basic block 163 prev 162, next 164, loop_depth 0, count 836, freq 741.
Predecessors:
Successors:
Invalid sum of incoming frequencies 0, should be 741
Invalid sum of incoming counts 0, should be 836

Basic block 164 prev 163, next -2, loop_depth 0, count 836, freq 741.
Predecessors:
Successors:  EXIT [100.0%]  count:836 (fallthru)
Invalid sum of incoming frequencies 0, should be 741
Invalid sum of incoming counts 0, should be 836


-
Later on, the cfg has apparently been rebuilt.  Some of the blocks with bad
counts were reattached to the graph and their counts propagated.  The
result is the invalid block count I initially observed.  Note here that the
count and frequency for block 164 are exactly double what they should be.

Basic block 159 prev 158, next 160, loop_depth 0, count 836, freq 741.
Predecessors:  1 [27.8%]  count:232 13 157 [100.0%]  count:604 158 [100.0%]
(fallthru)
Successors:  163 [100.0%]  count:836

Basic block 160 prev 159, next 161, loop_depth 1, count 195, freq 173,
probably never executed.
Predecessors:  75 [8.1%]  count:36 77 [39.0%]  count:159 76 79 [100.0%]
Successors:  82 [66.7%]  count:130 161 [33.3%]  count:65 (fallthru)

Basic block 161 prev 160, next 163, loop_depth 0, count 65, freq 58,
probably never executed.
Predecessors:  160 [33.3%]  count:65 (fallthru)
Successors:  83 [100.0%]  count:65

Basic block 163 prev 161, next 164, loop_depth 0, count 836, freq 741.
Predecessors:  159 [100.0%]  count:836
Successors:  164 [100.0%]  count:836 (fallthru)

Basic block 164 prev 163, next -2, loop_depth 0, count 1672, freq 1482.
Predecessors:  163 [100.0%]  count:836 (fallthru)
Successors:  EXIT [100.0%]  count:1672 (fallthru)
Invalid sum of incoming frequencies 741, should be 1482
Invalid sum of incoming counts 836, should be 1672




The problem appears to be in the function cfgrtl.purge_dead_edges which is
called by find_bb_boundaries.  There are a couple of cases where
"remove_edge" is called to remove an edge from the graph.  In the above
example, the edge being removed is the only predecessor edge of the
destination block(162).  The edge is removed, but the count and frequency
of the now orphaned destinati

Re: Question about updating CFG block counters in purge_dead_edges [was "Question on bad counters."]

2005-09-30 Thread Peter Steinmetz
Added a better subject line.. Pete.


[EMAIL PROTECTED] wrote on 09/30/2005 11:03:59 AM:

>
> I'm not entirely sure how gcc's CFG structure all fits together yet, so
> I'll ask for some input on this one:
>
> While looking through some dumps from a compile using -fprofile-use, I
> noticed the following in the "jump" dump file:
>
> Basic block 164 prev 163, next -2, loop_depth 0, count 1672, freq 1482.
> Predecessors:  163 [100.0%]  count:836 (fallthru)
> Successors:  EXIT [100.0%]  count:1672 (fallthru)
> Invalid sum of incoming frequencies 741, should be 1482
> Invalid sum of incoming counts 836, should be 1672
>
> I decided to try and figure out why the sum of incoming frequencies was
> invalid.  I've found where the mistake is introduced, and think I know
what
> needs to be done to fix it, but perhaps someone can confirm or suggest an
> alternative.
>
>
>
> Here are the last few blocks dumped just before
> cfgbuild.find_bb_boundaries.  The counts look good at this point (note
> especially block 162).
>
> Basic block 153 prev 152, next 154, loop_depth 0, count 836, freq 741.
> Predecessors:  0 [27.8%]  count:232 12 151 [100.0%]  count:604 152
[100.0%]
> (fallthru)
> Successors:  162 [100.0%]  count:836
>
> Basic block 154 prev 153, next 155, loop_depth 1, count 195, freq 173,
> probably never executed.
> Predecessors:  72 [8.1%]  count:36 74 [39.0%]  count:159 73 159 [100.0%]
> Successors:  78 [66.7%]  count:130 155 [33.3%]  count:65 (fallthru)
>
> Basic block 155 prev 154, next 162, loop_depth 0, count 65, freq 58,
> probably never executed.
> Predecessors:  154 [33.3%]  count:65 (fallthru)
> Successors:  79 [100.0%]  count:65
>
> Basic block 162 prev 155, next -2, loop_depth 0, count 836, freq 741.
> Predecessors:  153 [100.0%]  count:836
> Successors:  EXIT [100.0%]  count:836 (fallthru)
>
>
> -
> After cfgbuild.find_bb_boundaries, there are a few new blocks, and block
> 162 has been disconnected from the graph.  Block 162 has no predecessors,
> yet it's count remains at 836 and it's frequency at 741 which are
invalid.
> I believe blocks 163 and 164 were created while splitting block 162, and
> thus inherited the same count and frequency.
>
>
> Basic block 153 prev 152, next 154, loop_depth 0, count 836, freq 741.
> Predecessors:  0 [27.8%]  count:232 12 151 [100.0%]  count:604 152
[100.0%]
> (fallthru)
> Successors:
>
> Basic block 154 prev 153, next 155, loop_depth 1, count 195, freq 173,
> probably never executed.
> Predecessors:  72 [8.1%]  count:36 74 [39.0%]  count:159 73 159 [100.0%]
> Successors:  78 [66.7%]  count:130 155 [33.3%]  count:65 (fallthru)
>
> Basic block 155 prev 154, next 162, loop_depth 0, count 65, freq 58,
> probably never executed.
> Predecessors:  154 [33.3%]  count:65 (fallthru)
> Successors:  79 [100.0%]  count:65
>
> Basic block 162 prev 155, next 163, loop_depth 0, count 836, freq 741.
> Predecessors:
> Successors:
> Invalid sum of incoming frequencies 0, should be 741
> Invalid sum of incoming counts 0, should be 836
>
> Basic block 163 prev 162, next 164, loop_depth 0, count 836, freq 741.
> Predecessors:
> Successors:
> Invalid sum of incoming frequencies 0, should be 741
> Invalid sum of incoming counts 0, should be 836
>
> Basic block 164 prev 163, next -2, loop_depth 0, count 836, freq 741.
> Predecessors:
> Successors:  EXIT [100.0%]  count:836 (fallthru)
> Invalid sum of incoming frequencies 0, should be 741
> Invalid sum of incoming counts 0, should be 836
>
>
> -
> Later on, the cfg has apparently been rebuilt.  Some of the blocks with
bad
> counts were reattached to the graph and their counts propagated.  The
> result is the invalid block count I initially observed.  Note here that
the
> count and frequency for block 164 are exactly double what they should be.
>
> Basic block 159 prev 158, next 160, loop_depth 0, count 836, freq 741.
> Predecessors:  1 [27.8%]  count:232 13 157 [100.0%]  count:604 158
[100.0%]
> (fallthru)
> Successors:  163 [100.0%]  count:836
>
> Basic block 160 prev 159, next 161, loop_depth 1, count 195, freq 173,
> probably never executed.
> Predecessors:  75 [8.1%]  count:36 77 [39.0%]  count:159 76 79 [100.0%]
> Successors:  82 [66.7%]  count:130 161 [33.3%]  count:65 (fallthru)
>
> Basic block 161 prev 160, next 163, loop_depth 0, count 65, freq 58,
> probably never executed.
> Predecessors:  160 [33.3%]  count:65 (fallthru)
> Successors:  83 [100.0%]  count:65
>
> Basic block 163 prev 161, next 164, loop_depth 0, count 836, freq 741.
> Predecessors:  159 [100.0%]  count:836
> Successors:  164 [100.0%]  count:836 (fallthru)
>
> Basic block 164 prev 163, next -2, loop_depth 0, count 1672, freq 1482.
> Predecessors:  163 [100.0%]  count:836 (fallthru)
> Successors:  EXIT [100.0%]  count:1672 (fallthru)
> Invalid sum of incoming frequencies 741, should be 1482
> Invalid sum of incoming counts 836, should be 1672
>
>
> 
>
> The problem appears to be in the function cfgrtl.purge_dead_edges which
is
> call

Declaration of a guard function for use on define_bypass

2006-01-26 Thread Peter Steinmetz

I'm using store_data_bypass_p from recog.c as the guard for a define_bypass
within a machine description.  I'm seeing the following warning/error that
I'd like to clean up.

cc1: warnings being treated as errors
insn-automata.c: In function 'internal_insn_latency':
insn-automata.c:53265: warning: implicit declaration of function
'store_data_bypass_p'

Anybody know what needs to be done to get recog.h included into the code
created by genautomata (either directly, or indirectly) to eliminate the
implicit declaration?

Thanks!

Pete



Generalize ready list sorting via heuristics in rank_for_schedule.

2006-01-31 Thread Peter Steinmetz

I've been looking a bit at how haifa_sched.c sorts the ready list and think
there may be some room for added flexibility and/or improvement.  I'll
throw out a few ideas for discussion.

Currently, within the ready_sort macro in haifa-sched.c, the call to qsort
is passed "rank_for_schedule" to help it decide which of two instructions
should be placed further towards the front of the ready list.
Rank_for_schedule uses a set of ordered heuristics (rank, priority, etc.)
to make this decision.  The set of heuristics is fixed for all target
machines.

There can be cases, however, where a target machine may want to define
heuristics driven by specific characteristics of that machine.  Those
heuristics may be meaningless on other targets.

Likewise, a given target machine may prefer a different order in which to
check the heuristics, possibly going as far as checking them in a different
order based on which pass of the scheduler is running.

I'd be interested in seeing "rank_for_schedule" converted into a scheduler
target hook.  Target machines would then have the flexibility to define
further heuristics for determining sort order.  The default for an
undefined hook would be to use the current algorithm,

One could go a step further and break each heuristic out into a separate
function.  This would allow target machines to specify to the scheduler a
list of which heuristics to apply and an order in which to apply them.  A
target machine could also define its own heuristic functions and include
them in the heuristic ordering for that target.  In addition, a different
set of heuristics, or a different ordering could be applied based on which
pass of the scheduler is running.

I'd like to start experimenting with this, but would appreciate any
comments or suggestions from others that may be familiar with this code.

Thanks!

Pete.