Fix PR85726 (div-div suboptimization) and a rant on match.pd :s-flag

Hans-Peter Nilsson Wed, 09 May 2018 16:42:29 -0700

Replacing a division feeding a division helps only when the
second division is the only user, and "fusing" the divisions is
downright bad if another user of the result of first division is
a modulus of the same value as the second division, forming a
divmod pair.  See the test-case, where for the tested
architectures (which all fail the test-case before the patch)
the div and mod are implemented using the high-part of a widened
multiplication and shift, emitted separately but combined as
late as in rtl, with the multiplicaton and shift re-used.  That
of course does not happen if later passes see (y / 48; y % 3).
While in match.pd, I noticed the corresponding mul-mul match,
which I believe should be guarded the same way.


I noticed this spot investigating performance regressions for
mipsisa32r2el-linux-gnu comparing to gcc-4.7-era code generated
for a compute-intensive library.  I'd say the pattern in the
test-case is common in cases like implementing a 3x3 filter
using SIMD with a vector size 16.  The suboptimization was more
of an (the first, or second) eye-catcher in a hot degraded
function than an actual cause though; fixing it seems to amount
for no more than 2% (where there's a 13% regression) in that
function.  As a contrast, I see e.g. four times as many local
call-saved registers used for some loop-intensive functions, but
I can't dive into the larger problem, at least not right now.
(And no, it's not LRA, AFAICT.)

Tested using the gcc-8.1.0 release on aarch64-unknown-linux-gnu,
powerpc64-unknown-linux-gnu, x86_64-pc-linux-gnu and partly a
cross to mipsisa32r2el-linux-gnu.

The patch is generated using "-w" because otherwise
re-indentation for the single_use-wrapping makes the diff
practically unreadable.  The test-case uses rtl-scanning so the
result can be verified closer to the actual code but still not
restricting it to assembly code.  Scanning just after the (last)
match.pd pass would still be far from the divmod-combining
effects done in rtl.  Unfortunately, the pattern for the
truncated multiplication is observable in various numbers
depending on both the target and the bug, so to avoid littering
I restrict it to my target of interest at the moment.  At least
the presence and absence of the "1/48"-constant is stable across
the line with/without the bug with no false positives.

This is a optimization regression counting from at least 4.9.2,
most likely since r218211, so: ok for for all open branches?

Please also see the ":s"-rant after the patch.

gcc/testsuite:
        PR tree-optimization/85726
        * pr85726.c: New test.

gcc:
        * match.pd (div (div C1) C2): Guard with single_use of inner div.
        (mult (mult C1) C2): Similar.

--- /dev/null   Tue Oct 29 15:57:07 2002
+++ gcc.dg/pr85726.c    Tue May  8 07:18:19 2018
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-rtl-fwprop1" } */
+
+/* There should just be one mult-as-div sequence, re-used for the mod,
+   not one for each of the y / 3 and y % 3 (as in due to suboptimal
+   simplification as y / 48 and y % 3) and there should no be a trace of
+   the constant used for the 1 / 48 mult-as-div. */
+int ww, vv;
+int x(int y)
+{
+  int z = y / 16;
+  int w = z / 3;
+  int v = z % 3;
+  ww = w;
+  return v;
+}
+/* { dg-final { scan-rtl-dump-times "truncate:SI .lshiftrt:DI .mult:DI 
.sign_extend:DI" 1 "fwprop1" { target mips*-*-* } } }  */
+/* { dg-final { scan-rtl-dump-times " .0x2aaaaaab" 0 "fwprop1" } }  */

diff --git a/gcc/match.pd b/gcc/match.pd
index 0de4432..e8ebeac 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -278,11 +278,14 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
       && TYPE_UNSIGNED (type))
   (trunc_div @0 @1)))
 
-/* Combine two successive divisions.  Note that combining ceil_div
-   and floor_div is trickier and combining round_div even more so.  */
+/* Combine two successive divisions, if the second is the only
+   user of the result of the first, or else we'll just end up
+   with two divisions anyway.  Note that combining ceil_div and
+   floor_div is trickier and combining round_div even more so.  */
 (for div (trunc_div exact_div)
  (simplify
-  (div (div @0 INTEGER_CST@1) INTEGER_CST@2)
+  (div (div@3 @0 INTEGER_CST@1) INTEGER_CST@2)
+  (if (single_use (@3))
    (with {
      bool overflow_p;
      wide_int mul = wi::mul (wi::to_wide (@1), wi::to_wide (@2),
@@ -292,12 +295,13 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
      (div @0 { wide_int_to_tree (type, mul); })
      (if (TYPE_UNSIGNED (type)
          || mul != wi::min_value (TYPE_PRECISION (type), SIGNED))
-     { build_zero_cst (type); })))))
+      { build_zero_cst (type); }))))))
 
 /* Combine successive multiplications.  Similar to above, but handling
    overflow is different.  */
 (simplify
- (mult (mult @0 INTEGER_CST@1) INTEGER_CST@2)
+ (mult (mult@3 @0 INTEGER_CST@1) INTEGER_CST@2)
+ (if (single_use (@3))
   (with {
     bool overflow_p;
     wide_int mul = wi::mul (wi::to_wide (@1), wi::to_wide (@2),
@@ -306,7 +310,7 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
    /* Skip folding on overflow: the only special case is @1 * @2 == -INT_MIN,
       otherwise undefined overflow implies that @0 must be zero.  */
    (if (!overflow_p || TYPE_OVERFLOW_WRAPS (type))
-   (mult @0 { wide_int_to_tree (type, mul); }))))
+    (mult @0 { wide_int_to_tree (type, mul); })))))
 
 /* Optimize A / A to 1.0 if we don't care about
    NaNs or Infinities.  */

Now a rant on the match.pd ":s" flag, which reasonable people
may reasonably suggest I should have used in the patch instead
of the (if (single_use ...)).

Initially, I got the match.pd-language (which deserves a proper
name) all wrong.  Then I read the documentation and still got it
wrong.  I "misunderstood" that the ":s" on an operand "O" was
supposed to have the effect of conditionalize the replacement
"R" by wrapping it in "(if (single_use (O)) R)" as in the
suggested patch (above).  To wit, this does not work; it will
*not* stop the replacement as seen in the test-case (THIS IS NOT
A SUGGESTED PATCH):

 (for div (trunc_div exact_div)
  (simplify
-  (div (div @0 INTEGER_CST@1) INTEGER_CST@2)
+  (div (div:s @0 INTEGER_CST@1) INTEGER_CST@2)
   (with {
     bool overflow_p;
     wide_int mul = wi::mul (wi::to_wide (@1), wi::to_wide (@2),

In PR69556, it seems other people seem to have read the
documentation of ":s" the same way, but are corrected by other
comments there, so I guess it's not my reading that's flawed.

I suggest preferably (1) correcting the semantics of ":s" to do
as the documentation says because I don't understand the
explanation in PR69556 comment #4 that the replacement "is still
allowed if it is a single operation as that replaces at least
one other (the one we are simplifying)"; I see that as a
complete nullification of the :s flag, making it a nop.
Alternatively (2), if the :s is *not* a nop in some case add an
example of that case and sufficient explanation to
match-and-simplify.texi *and* the suggested ":S" flag
(i.e. *really* conditionalize the replacement by a single-use of
that operand).

That's just a detail, I do like that language.

brgds, H-P

Fix PR85726 (div-div suboptimization) and a rant on match.pd :s-flag

Reply via email to