https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98986
Bug ID: 98986 Summary: Try matching both orders of commutative RTX operations when there is no canonical order Product: gcc Version: unknown Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Target Milestone: --- The motivating aarch64 testcase is this: #include <arm_neon.h> int32x4_t foo (int16x4_t a, int16x4_t b) { int16x4_t tmp = vdup_n_s16 (vget_lane_s16 (b, 3)); return vmull_s16 (tmp, a); } int32x4_t foo2 (int16x4_t a, int16x4_t b) { int16x4_t tmp = vdup_n_s16 (vget_lane_s16 (b, 3)); return vmull_s16 (a, tmp); } Both functions should generate the widening-mult-by-lane form: smull v0.4s, v0.4h, v1.h[3] // 13 [c=16 l=4] aarch64_vec_smult_lane_v4hi However only the second function foo2 manages to match it. We have a pattern for this in aarch64-simd.md: (define_insn "aarch64_vec_<su>mult_lane<Qlane>" [(set (match_operand:<VWIDE> 0 "register_operand" "=w") (mult:<VWIDE> (ANY_EXTEND:<VWIDE> (match_operand:<VCOND> 1 "register_operand" "w")) (ANY_EXTEND:<VWIDE> (vec_duplicate:<VCOND> (vec_select:<VEL> (match_operand:VDQHS 2 "register_operand" "<vwx>") (parallel [(match_operand:SI 3 "immediate_operand" "i")]))))))] "TARGET_SIMD" { operands[3] = aarch64_endian_lane_rtx (<MODE>mode, INTVAL (operands[3])); return "<su>mull\\t%0.<Vwtype>, %1.<Vcondtype>, %2.<Vetype>[%3]"; } [(set_attr "type" "neon_mul_<Vetype>_scalar_long")] ) For foo combine tries and fails to match the vec_select in the first arm of the mult: (set (reg:V4SI 93 [ <retval> ]) (mult:V4SI (sign_extend:V4SI (vec_duplicate:V4HI (vec_select:HI (reg:V4HI 99) (parallel:V4HI [ (const_int 3 [0x3]) ])))) (sign_extend:V4SI (reg:V4HI 98)))) Unfortunately, due to the sign_extends on both arm of the mult there is no canonical order for these expressions as both arms of the MULT are RTX_UNARY expressions and swap_commutative_operands_p doesn't try to swap them around. I guess we can work around this by adding more patterns in the backend to match the two different orders we can get in this situation, but we've got so many similar patterns in the backend... Do you think it's feasible to get recog or combine to try out both permutations of such commutative operations when matching without blowing up compile time? Any other ideas for resolving this are welcome