------- Comment #6 from amylaar at gcc dot gnu dot org 2009-01-10 16:10 ------- (In reply to comment #5) > Joern, re. comment #4, Richi refers to my patch to enable PRE at -Os, see > [1]. > An extension to this patch that we tested on x86 machines, is to disable PRE > for scalar integer registers, via SMALL_REGISTER_CLASSES. I changed > SMALL_REGISTER_CLASSES into a target hook for this purpose, see [2]. You could > play with this, see if you can use this to cure your problem...
This is not a problem of having inserted more expressions. The expressions actually go down, 7 add expressions are actually eliminated. The problem is that this comes at the cost of addding 127 phi nodes which are not free. It's a cascade of 64/32/16/8/4/2/1 phi nodes that are used to compute the 7 chained adds through all the possible paths through the 6 consecutive if-blocks, and this requires 64/32/16/8/4/2/1 reg-reg copies to implement inside these if-blocks, plus 64 unconditional constant loads at the start. requiring 64 + x registers for this one computation alone does cause register allocation trouble for ARCompact, but it is in good company here, as lost of other RISC architectures also have no more than 32 general purpose registers. And even if you did this for a processor with lots of registers like the i960, requiring 64 constant loads in the unconditional path and then having 127 conditional reg-reg copies is certainly worse than having 7 conditional add operations. I think the problem is that the algorithm, like many SSA algorithm, has no idea of the run time cost of a phi node, and uses the lower bound 0 as an approximation. As you make more sophisticated algorithms to approximate the cost minimum for a flawed cost function, you will pessimize more and more code. Is there a way to determine when replacing one expression causes the number of phi nodes in a dominator to increase? I would think that this criterion would be a possibly useful indicator of register pressure and instruction count increase. I haven't looked wht the ssa pre does exactly, but from the code transformation performed I would guess that it sees one expression which is fed by a directed acyclic graph (DAG) of phi nodes with constant leaves, and figures it only needs to use a replacement DAG of phi nodes where the expression is evaluated for each constant, assuming that the original DAG will be mostly dead then. However, what happens in the testcase is that the original DAG is still fully live, the replacement DAG is added in parallel, and thus the number of phi nodes from the original DAG has been doubled. For the testcase (and for fbital, if you have access to it), it is interesting to observe that the problem would not arise if pre had a notion of what expressions better to leave for conditional execution. Unfortunately, many architectures, conditional execution requires two-address (wrt the ALU) operations, which cannot be expressed in SSA. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38785