[Bug tree-optimization/38785] huge performance regression on EEMBC bitmnp01

amylaar at gcc dot gnu dot org Sat, 10 Jan 2009 08:10:41 -0800


------- Comment #6 from amylaar at gcc dot gnu dot org  2009-01-10 16:10 -------
(In reply to comment #5)
> Joern, re. comment #4, Richi refers to my patch to enable PRE at -Os, see 
> [1]. 
> An extension to this patch that we tested on x86 machines, is to disable PRE
> for scalar integer registers, via SMALL_REGISTER_CLASSES.  I changed
> SMALL_REGISTER_CLASSES into a target hook for this purpose, see [2]. You could
> play with this, see if you can use this to cure your problem...


This is not a problem of having inserted more expressions.  The expressions
actually go down, 7 add expressions are actually eliminated.
The problem is that this comes at the cost of addding 127 phi nodes which
are not free. It's a cascade of 64/32/16/8/4/2/1 phi nodes that are used
to compute the 7 chained adds through all the possible paths through the
6 consecutive if-blocks, and this requires 64/32/16/8/4/2/1 reg-reg copies
to implement inside these if-blocks, plus 64 unconditional constant loads
at the start.
requiring 64 + x registers for this one computation alone does cause register
allocation trouble for ARCompact, but it is in good company here, as lost of
other RISC architectures also have no more than 32 general purpose registers.
And even if you did this for a processor with lots of registers like the i960,
requiring 64 constant loads in the unconditional path and then having 127
conditional reg-reg copies is certainly worse than having 7 conditional add
operations.

I think the problem is that the algorithm, like many SSA algorithm,
has no idea of the run time cost of a phi node, and uses the lower
bound 0 as an approximation.
As you make more sophisticated algorithms to approximate the cost minimum
for a flawed cost function, you will pessimize more and more code.

Is there a way to determine when replacing one expression causes the
number of phi nodes in a dominator to increase?  I would think that
this criterion would be a possibly useful indicator of register
pressure and instruction count increase.

I haven't looked wht the ssa pre does exactly, but from the code
transformation performed I would guess that it sees one expression which
is fed by a directed acyclic graph (DAG) of phi nodes with constant leaves,
and figures it only needs to use a replacement DAG of phi nodes where the
expression is evaluated for each constant, assuming that the original DAG
will be mostly dead then.
However, what happens in the testcase is that the original DAG is still
fully live, the replacement DAG is added in parallel, and thus the number
of phi nodes from the original DAG has been doubled.

For the testcase (and for fbital, if you have access to it), it is
interesting to observe that the problem would not arise if pre had a notion
of what expressions better to leave for conditional execution.
Unfortunately, many architectures, conditional execution requires
two-address (wrt the ALU) operations, which cannot be expressed in SSA.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38785

[Bug tree-optimization/38785] huge performance regression on EEMBC bitmnp01

Reply via email to