Hi all,

I'm trying to reduce the cases where the midend calls the backend rtx costs on 
bogus rtl for which the backend
doesn't have patterns or ways of handling. Having to handle these kinds of 
rtxes sanely bloats those
functions and makes them harder to maintain.

One of the cases where this occurs is in combine and 
distribute_and_simplify_rtx in particular.
Citing the comment at that function:
" See if X is of the form (* (+ A B) C), and if so convert to
   (+ (* A C) (* B C)) and try to simplify.
 Most of the time, this results in no change.  However, if some of
   the operands are the same or inverses of each other, simplifications
   will result."

The problem is that after it applies the distributive law it calls rtx costs
to figure out whether the rtx became simpler. This rtx can get pretty complex.
For example, on arm I've seen it try to cost:
(plus:SI (mult:SI (plus:SI (reg:SI 232 [ m1 ])
            (const_int 1 [0x1]))
        (reg:SI 232 [ m1 ]))
    (plus:SI (reg:SI 232 [ m1 ])
        (const_int 1 [0x1])))

which is never going to match anything on arm anyway, so why should the costs 
function handle it?
In any case, I believe combine's design is such that it should first be 
attempting to call
recog and split on the rtxes, and only if that succeeds should it be making a 
target-specific
decision on which rtx to prefer. distribute_and_simplify_rtx goes against that 
by calling
rtx costs on an unverified rtx in attempt to gauge its complexity.

This patch remedies that by removing the call to rtx costs and instead manually 
performing
a relatively simple check on whether the resultant rtx was simplified. That is, 
using the example
from the comment, whether (+ (* A C) (* B C)) still has + at the top and * in 
the two operands.
This should give a good indication on whether any meaningful simplification was 
made (The '+' and '*'
operators in the example can be any operators that can be distributed over).

Initially, I wanted to just return the distributed version and let recog reject 
the invalid rtxes
but that caused some code quality regressions on arm where the original rtx 
would not recog but
would match a beneficial splitter, whereas the distributed rtx would not.

With this patch I saw almost no codegen differences on arm for the whole of 
SPEC2006.
The one exception was 416.gamess where it managed to merge a mul and an add 
into an mla
which resulted in a slightly better code sequence. That was in a pretty large 
file and I
don't speak Fortran'ese, so I couldn't really reduce a testcase for it, but my 
guess is that
before the patch the costs would return some essentially random value for an 
arbitrarily complex rtx
that it was passed to, which changed the decision in 
distribute_and_simplify_rtx on whether
to return the distributed rtx, which could have impacted further optimisations 
in combine.

I tried it on x86_64 as well. Again, there were almost no codegen differences. 
The exception
was tonto and wrf where a few instructions were eliminated, but no significant 
difference.
The resultant binaries for these two were a tiny bit smaller, with no impact on 
runtime.

Therefore I claim that this a safe thing to do, as it leaves the 
target-specific rtx cost
judgements in combine to be made only on valid recog-ed rtxes, and not having 
them cancel
optimisations early due to rtx costs not handling arbitrary rtxes well.

Bootstrapped on arm, x86_64, aarch64 (all linux). Tested on arm,aarch64.

Ok for trunk?

Thanks,
Kyrill


2015-04-20  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>

    * combine.c (distribute_and_simplify_rtx): Do not check rtx costs.
    Look at the rtx codes to see if a simplification occured.
commit e9833e5e3e996ac68b645fdca14738232f59e1a2
Author: Kyrylo Tkachov <kyrylo.tkac...@arm.com>
Date:   Wed Apr 15 14:11:16 2015 +0100

    [combine] Do not call rtx costs on potentially invalid rtx in distribute_and_simplify_rtx

diff --git a/gcc/combine.c b/gcc/combine.c
index 46cd6db..56d297b 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -9602,12 +9602,15 @@ distribute_and_simplify_rtx (rtx x, int n)
 
   tmp = apply_distributive_law (simplify_gen_binary (inner_code, mode,
 						     new_op0, new_op1));
-  if (GET_CODE (tmp) != outer_code
-      && (set_src_cost (tmp, optimize_this_for_speed_p)
-	  < set_src_cost (x, optimize_this_for_speed_p)))
-    return tmp;
 
-  return NULL_RTX;
+  /* We didn't manage to simplify.  */
+  if (GET_CODE (tmp) == outer_code
+      || (GET_CODE (tmp) == inner_code
+	  && GET_CODE (XEXP (tmp, 0)) == outer_code
+	  && GET_CODE (XEXP (tmp, 1)) == outer_code))
+    return NULL_RTX;
+
+  return tmp;
 }
 
 /* Simplify a logical `and' of VAROP with the constant CONSTOP, to be done

Reply via email to