Replacing a division feeding a division helps only when the second division is the only user, and "fusing" the divisions is downright bad if another user of the result of first division is a modulus of the same value as the second division, forming a divmod pair. See the test-case, where for the tested architectures (which all fail the test-case before the patch) the div and mod are implemented using the high-part of a widened multiplication and shift, emitted separately but combined as late as in rtl, with the multiplicaton and shift re-used. That of course does not happen if later passes see (y / 48; y % 3). While in match.pd, I noticed the corresponding mul-mul match, which I believe should be guarded the same way.
I noticed this spot investigating performance regressions for mipsisa32r2el-linux-gnu comparing to gcc-4.7-era code generated for a compute-intensive library. I'd say the pattern in the test-case is common in cases like implementing a 3x3 filter using SIMD with a vector size 16. The suboptimization was more of an (the first, or second) eye-catcher in a hot degraded function than an actual cause though; fixing it seems to amount for no more than 2% (where there's a 13% regression) in that function. As a contrast, I see e.g. four times as many local call-saved registers used for some loop-intensive functions, but I can't dive into the larger problem, at least not right now. (And no, it's not LRA, AFAICT.) Tested using the gcc-8.1.0 release on aarch64-unknown-linux-gnu, powerpc64-unknown-linux-gnu, x86_64-pc-linux-gnu and partly a cross to mipsisa32r2el-linux-gnu. The patch is generated using "-w" because otherwise re-indentation for the single_use-wrapping makes the diff practically unreadable. The test-case uses rtl-scanning so the result can be verified closer to the actual code but still not restricting it to assembly code. Scanning just after the (last) match.pd pass would still be far from the divmod-combining effects done in rtl. Unfortunately, the pattern for the truncated multiplication is observable in various numbers depending on both the target and the bug, so to avoid littering I restrict it to my target of interest at the moment. At least the presence and absence of the "1/48"-constant is stable across the line with/without the bug with no false positives. This is a optimization regression counting from at least 4.9.2, most likely since r218211, so: ok for for all open branches? Please also see the ":s"-rant after the patch. gcc/testsuite: PR tree-optimization/85726 * pr85726.c: New test. gcc: * match.pd (div (div C1) C2): Guard with single_use of inner div. (mult (mult C1) C2): Similar. --- /dev/null Tue Oct 29 15:57:07 2002 +++ gcc.dg/pr85726.c Tue May 8 07:18:19 2018 @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-rtl-fwprop1" } */ + +/* There should just be one mult-as-div sequence, re-used for the mod, + not one for each of the y / 3 and y % 3 (as in due to suboptimal + simplification as y / 48 and y % 3) and there should no be a trace of + the constant used for the 1 / 48 mult-as-div. */ +int ww, vv; +int x(int y) +{ + int z = y / 16; + int w = z / 3; + int v = z % 3; + ww = w; + return v; +} +/* { dg-final { scan-rtl-dump-times "truncate:SI .lshiftrt:DI .mult:DI .sign_extend:DI" 1 "fwprop1" { target mips*-*-* } } } */ +/* { dg-final { scan-rtl-dump-times " .0x2aaaaaab" 0 "fwprop1" } } */ diff --git a/gcc/match.pd b/gcc/match.pd index 0de4432..e8ebeac 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -278,11 +278,14 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) && TYPE_UNSIGNED (type)) (trunc_div @0 @1))) -/* Combine two successive divisions. Note that combining ceil_div - and floor_div is trickier and combining round_div even more so. */ +/* Combine two successive divisions, if the second is the only + user of the result of the first, or else we'll just end up + with two divisions anyway. Note that combining ceil_div and + floor_div is trickier and combining round_div even more so. */ (for div (trunc_div exact_div) (simplify - (div (div @0 INTEGER_CST@1) INTEGER_CST@2) + (div (div@3 @0 INTEGER_CST@1) INTEGER_CST@2) + (if (single_use (@3)) (with { bool overflow_p; wide_int mul = wi::mul (wi::to_wide (@1), wi::to_wide (@2), @@ -292,12 +295,13 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (div @0 { wide_int_to_tree (type, mul); }) (if (TYPE_UNSIGNED (type) || mul != wi::min_value (TYPE_PRECISION (type), SIGNED)) - { build_zero_cst (type); }))))) + { build_zero_cst (type); })))))) /* Combine successive multiplications. Similar to above, but handling overflow is different. */ (simplify - (mult (mult @0 INTEGER_CST@1) INTEGER_CST@2) + (mult (mult@3 @0 INTEGER_CST@1) INTEGER_CST@2) + (if (single_use (@3)) (with { bool overflow_p; wide_int mul = wi::mul (wi::to_wide (@1), wi::to_wide (@2), @@ -306,7 +310,7 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) /* Skip folding on overflow: the only special case is @1 * @2 == -INT_MIN, otherwise undefined overflow implies that @0 must be zero. */ (if (!overflow_p || TYPE_OVERFLOW_WRAPS (type)) - (mult @0 { wide_int_to_tree (type, mul); })))) + (mult @0 { wide_int_to_tree (type, mul); }))))) /* Optimize A / A to 1.0 if we don't care about NaNs or Infinities. */ Now a rant on the match.pd ":s" flag, which reasonable people may reasonably suggest I should have used in the patch instead of the (if (single_use ...)). Initially, I got the match.pd-language (which deserves a proper name) all wrong. Then I read the documentation and still got it wrong. I "misunderstood" that the ":s" on an operand "O" was supposed to have the effect of conditionalize the replacement "R" by wrapping it in "(if (single_use (O)) R)" as in the suggested patch (above). To wit, this does not work; it will *not* stop the replacement as seen in the test-case (THIS IS NOT A SUGGESTED PATCH): (for div (trunc_div exact_div) (simplify - (div (div @0 INTEGER_CST@1) INTEGER_CST@2) + (div (div:s @0 INTEGER_CST@1) INTEGER_CST@2) (with { bool overflow_p; wide_int mul = wi::mul (wi::to_wide (@1), wi::to_wide (@2), In PR69556, it seems other people seem to have read the documentation of ":s" the same way, but are corrected by other comments there, so I guess it's not my reading that's flawed. I suggest preferably (1) correcting the semantics of ":s" to do as the documentation says because I don't understand the explanation in PR69556 comment #4 that the replacement "is still allowed if it is a single operation as that replaces at least one other (the one we are simplifying)"; I see that as a complete nullification of the :s flag, making it a nop. Alternatively (2), if the :s is *not* a nop in some case add an example of that case and sufficient explanation to match-and-simplify.texi *and* the suggested ":S" flag (i.e. *really* conditionalize the replacement by a single-use of that operand). That's just a detail, I do like that language. brgds, H-P