[Bug rtl-optimization/98986] New: Try matching both orders of commutative RTX operations when there is no canonical order

ktkachov at gcc dot gnu.org via Gcc-bugs Sun, 07 Feb 2021 10:07:54 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98986


            Bug ID: 98986
           Summary: Try matching both orders of commutative RTX operations
                    when there is no canonical order
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---

The motivating aarch64 testcase is this:

#include <arm_neon.h>
int32x4_t
foo (int16x4_t a, int16x4_t b)
{
  int16x4_t tmp = vdup_n_s16 (vget_lane_s16 (b, 3));

  return vmull_s16 (tmp, a);
}

int32x4_t
foo2 (int16x4_t a, int16x4_t b)
{
  int16x4_t tmp = vdup_n_s16 (vget_lane_s16 (b, 3));

  return vmull_s16 (a, tmp);
}

Both functions should generate the widening-mult-by-lane form:
        smull   v0.4s, v0.4h, v1.h[3]   // 13   [c=16 l=4] 
aarch64_vec_smult_lane_v4hi

However only the second function foo2 manages to match it.
We have a pattern for this in aarch64-simd.md:
(define_insn "aarch64_vec_<su>mult_lane<Qlane>"
  [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
        (mult:<VWIDE>
          (ANY_EXTEND:<VWIDE>
            (match_operand:<VCOND> 1 "register_operand" "w"))
          (ANY_EXTEND:<VWIDE>
            (vec_duplicate:<VCOND>
              (vec_select:<VEL>
                (match_operand:VDQHS 2 "register_operand" "<vwx>")
                (parallel [(match_operand:SI 3 "immediate_operand" "i")]))))))]
  "TARGET_SIMD"
  {
    operands[3] = aarch64_endian_lane_rtx (<MODE>mode, INTVAL (operands[3]));
    return "<su>mull\\t%0.<Vwtype>, %1.<Vcondtype>, %2.<Vetype>[%3]";
  }
  [(set_attr "type" "neon_mul_<Vetype>_scalar_long")]
)

For foo combine tries and fails to match the vec_select in the first arm of the
mult:
(set (reg:V4SI 93 [ <retval> ])
    (mult:V4SI (sign_extend:V4SI (vec_duplicate:V4HI (vec_select:HI (reg:V4HI
99)
                    (parallel:V4HI [
                            (const_int 3 [0x3])
                        ]))))
        (sign_extend:V4SI (reg:V4HI 98))))

Unfortunately, due to the sign_extends on both arm of the mult there is no
canonical order for these expressions as both arms of the MULT are RTX_UNARY
expressions and swap_commutative_operands_p doesn't try to swap them around.
I guess we can work around this by adding more patterns in the backend to match
the two different orders we can get in this situation, but we've got 
so many similar patterns in the backend...

Do you think it's feasible to get recog or combine to try out both permutations
of such commutative operations when matching without blowing up compile time?
Any other ideas for resolving this are welcome

[Bug rtl-optimization/98986] New: Try matching both orders of commutative RTX operations when there is no canonical order

Reply via email to