Jeff Law <jeffreya...@gmail.com> writes: > The BZ in question is a failure to recognize a pair of shifts as a sign > extension. > > I originally thought simplify-rtx would be the right framework to > address this problem, but fwprop is actually better. We can write the > recognizer much simpler in that framework. > > fwprop already simplifies nested shifts/extensions to the desired RTL, > but it's not considered profitable and we throw away the good work done > by fwprop & simplifiers. > > It's hard to see a scenario where nested shifts or nested extensions > that simplify down to a single sign/zero extension isn't a profitable > transformation. So when fwprop has nested shifts/extensions that > simplifies to an extension, we consider it profitable. > > This allow us to simplify the testcase on rv64 with ZBB enabled from a > pair of shifts to a single byte or half-word sign extension.
Hmm. So just to summarise something that was discussed in the PR comments, this is a case where combine's expand_compound_operation/ make_compound_operation wrangler hurts us, because the process isn't idempotent, and combine produces two complex instructions: (insn 6 3 7 2 (set (reg:DI 137 [ _3 ]) (ashift:DI (reg:DI 139 [ x ]) (const_int 24 [0x18]))) "foo.c":2:20 305 {ashldi3} (expr_list:REG_DEAD (reg:DI 139 [ x ]) (nil))) (insn 12 7 13 2 (set (reg/i:DI 10 a0) (sign_extend:DI (ashiftrt:SI (subreg:SI (reg:DI 137 [ _3 ]) 0) (const_int 24 [0x18])))) "foo.c":2:27 321 {ashrsi3_extend} (expr_list:REG_DEAD (reg:DI 137 [ _3 ]) (nil))) given two simple instructions: (insn 6 3 7 2 (set (reg:SI 137 [ _3 ]) (sign_extend:SI (subreg:QI (reg/v:DI 136 [ x ]) 0))) "foo.c":2:20 533 {*extendqisi2_bitmanip} (expr_list:REG_DEAD (reg/v:DI 136 [ x ]) (nil))) (insn 7 6 12 2 (set (reg:DI 138 [ _3 ]) (sign_extend:DI (reg:SI 137 [ _3 ]))) "foo.c":2:20 discrim 1 133 {*extendsidi2_internal} (expr_list:REG_DEAD (reg:SI 137 [ _3 ]) (nil))) If I run with -fdisable-rtl-combine then late_combine1 already does the expected transformation. Although it would be nice to fix combine, that might be difficult. If we treat combine as immutable then the options are: (1) Teach simplify-rtx to simplify combine's output into a single sign_extend. (2) Allow fwprop1 to get in first, before combine has a chance to mess things up. The patch goes for (2). Is that a fair summary? Playing devil's advocate, I suppose one advantage of (1) is that it would allow the optimisation even if the original rtl looked like combine's output. And fwprop1 doesn't distinguish between cases in which the source instruction disappears from cases in which the source instruction is kept. Thus we could transform: (set (reg:SI R2) (sign_extend:SI (reg:QI R1))) (set (reg:DI R3) (sign_extend:DI (reg:SI R2))) into: (set (reg:SI R2) (sign_extend:SI (reg:QI R1))) (set (reg:DI R3) (sign_extend:DI (reg:QI R1))) which increases the register pressure between the two instructions (since R2 and R1 are both now live). In general, there could be quite a gap between the two instructions. On the other hand, even in that case, fwprop1 would be parallelising the extensions. And since we're talking about unary operations, even two-address targets would allow R1 to be extended without tying the source and destination. Also, it seems relatively unlikely that expand would produce code that looks like combine's, since the gimple optimisers should have simplified it into conversions. So initially I was going to agree that it's worth trying in fwprop. But... > Bootstrapped and regression tested on x86 as well.. Also tested in my > tester. I just don't remember when I installed it, so I don't know > which of the natives have been tested with it, but all the crosses have... > > Richard -- any thoughts here? > > Jeff > > PR rtl-optimization/109592 > gcc > * fwprop.cc (any_shift_p, any_extension_p): New functions. > (fwprop_propagation::classify_result): If we have nested shifts > or nested extensions that simplify to a single extension, then > consider it profitable. > > gcc/testsuite > * gcc.target/riscv/pr109592.c: New test. > > diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc > index 9b8fba59216..c844d870a87 100644 > --- a/gcc/fwprop.cc > +++ b/gcc/fwprop.cc > @@ -242,6 +242,24 @@ fwprop_propagation::check_mem (int old_num_changes, rtx > mem) > return true; > } > > +/* Return TRUE if X is a shift, FALSE otherwise. */ > + > +static bool > +any_shift_p (rtx x) > +{ > + enum rtx_code code = GET_CODE (x); > + return code == ASHIFTRT || code == LSHIFTRT || code == ASHIFT; > +} > + > +/* Return TRUE if X is an extension, FALSE otherwise. */ > + > +static bool > +any_extension_p (rtx x) > +{ > + enum rtx_code code = GET_CODE (x); > + return code == SIGN_EXTEND || code == ZERO_EXTEND; > +} > + > /* OLDX has been simplified to NEWX. Describe the change in terms of > result_flags. */ > > @@ -288,6 +306,22 @@ fwprop_propagation::classify_result (rtx old_rtx, rtx > new_rtx) > && !MEM_VOLATILE_P (new_rtx)) > return PROFITABLE; > > + /* If we had nested extensions that simplified to a single extension of > + the same underlying object, then consider it profitable. */ > + if (any_extension_p (old_rtx) > + && any_extension_p (XEXP (old_rtx, 0)) > + && any_extension_p (new_rtx) > + && rtx_equal_p (XEXP (XEXP (old_rtx, 0), 0), XEXP (new_rtx, 0))) > + return PROFITABLE; > + > + /* If we had nested shifts that simplify to a single extension of the > + same underlying object, then consider it profitable. */ > + if (any_shift_p (old_rtx) > + && any_shift_p (XEXP (old_rtx, 0)) > + && any_extension_p (new_rtx) > + && rtx_equal_p (XEXP (XEXP (old_rtx, 0), 0), XEXP (new_rtx, 0))) > + return PROFITABLE; ...I think we should make sure that XEXP (new_rtx, 0)) is a REG or a subreg of a REG (not just any SUBREG), so that we don't duplicate complex expressions. However, even that would lead to the odd situation in which simplifying (foo (bar (reg X))) to (reg X) would not be treated as profitable, but simplifying (foo (bar (reg X))) to (sign_extend (reg X)) would be treated as profitable. I think the two cases would need to be handled in the same way, and maybe also (foo (reg X)) to (reg X). That's beginning to feel like a big change in policy for this stage of development. And it looks like option (1) isn't simple either. As well as mangling the original instructions, combine also fuses the final operation into: (insn 12 7 13 2 (set (reg/i:DI 10 a0) (reg:DI 138 [ _3 ])) "foo.c":2:27 283 {*movdi_64bit} (expr_list:REG_DEAD (reg:DI 138 [ _3 ]) (nil))) so that it ends up with a hard register destination (as for the final insn 12 quoted earlier). late-combine currently punts on such instructions: // Avoid increasing the complexity of instructions that // reference allocatable hard registers. if (!REG_P (SET_SRC (set)) && !reload_completed && (accesses_include_nonfixed_hard_registers (use_insn->uses ()) || accesses_include_nonfixed_hard_registers (use_insn->defs ()))) return false; (the defs () check). Relaxing that would also be invasive at this stage. And this PR's missed optimisation is going to be more likely for the final return value, since the second extension comes from the ABI and is hidden from gimple. So combine is likely to replace a pseudo with a hard register in the motivating cases. So it seems like it's a bit of a mess :( If we do try to fix combine, I think something like the attached would fit within the current scheme. It is a pure shift-for-shift transformation, avoiding any extensions. Will think more about it, but wanted to get the above stream of consciousness out before I finish for the day :) Thanks, Richard diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index 223a0095978..4e1f3a5bbbd 100644 --- a/gcc/simplify-rtx.cc +++ b/gcc/simplify-rtx.cc @@ -4466,6 +4466,58 @@ simplify_context::simplify_binary_operation_1 (rtx_code code, return simplify_gen_binary (code, mode, op0, gen_int_shift_amount (mode, val)); } + + /* Simplify: + + (code:M1 + (subreg:M1 + ([al]shiftrt:M2 + (subreg:M2 + (ashift:M1 X C1)) + C2)) + C3) + + to: + + (code:M1 + ([al]shiftrt:M1 + (ashift:M1 X C1+N) + C2+N) + C3) + + where M1 is N bits wider than M2. Optimizing the (subreg:M1 ...) + directly would be arithmetically correct, but restricting the + simplification to shifts by constants is more conservative, + since it is more likely to lead to further simplifications. */ + if (is_a<scalar_int_mode> (mode, &int_mode) + && paradoxical_subreg_p (op0) + && is_a<scalar_int_mode> (GET_MODE (SUBREG_REG (op0)), &inner_mode) + && (GET_CODE (SUBREG_REG (op0)) == ASHIFTRT + || GET_CODE (SUBREG_REG (op0)) == LSHIFTRT) + && CONST_INT_P (op1)) + { + auto xcode = GET_CODE (SUBREG_REG (op0)); + rtx xop0 = XEXP (SUBREG_REG (op0), 0); + rtx xop1 = XEXP (SUBREG_REG (op0), 1); + if (SUBREG_P (xop0) + && GET_MODE (SUBREG_REG (xop0)) == mode + && GET_CODE (SUBREG_REG (xop0)) == ASHIFT + && CONST_INT_P (xop1)) + { + rtx yop0 = XEXP (SUBREG_REG (xop0), 0); + rtx yop1 = XEXP (SUBREG_REG (xop0), 1); + if (CONST_INT_P (yop1)) + { + auto bias = (GET_MODE_BITSIZE (int_mode) + - GET_MODE_BITSIZE (inner_mode)); + tem = simplify_gen_binary (ASHIFT, mode, yop0, + GEN_INT (INTVAL (yop1) + bias)); + tem = simplify_gen_binary (xcode, mode, tem, + GEN_INT (INTVAL (xop1) + bias)); + return simplify_gen_binary (code, mode, tem, op1); + } + } + } break; case SS_ASHIFT: