[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

rguenth at gcc dot gnu.org Fri, 26 May 2017 01:51:02 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846


--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
(define_expand "<plusminus_insn><mode>3"
  [(set (match_operand:VI_AVX2 0 "register_operand")
        (plusminus:VI_AVX2
          (match_operand:VI_AVX2 1 "vector_operand")
          (match_operand:VI_AVX2 2 "vector_operand")))]
  "TARGET_SSE2"
  "ix86_fixup_binary_operands_no_copy (<CODE>, <MODE>mode, operands);")

so maybe things can be fixed up in ix86_fixup_binary_operands which doesn't
seem to consider subregs in any way.

Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c      (revision 248482)
+++ gcc/config/i386/i386.c      (working copy)
@@ -21270,6 +21270,11 @@ ix86_fixup_binary_operands (enum rtx_cod
   if (MEM_P (src1) && !rtx_equal_p (dst, src1))
     src1 = force_reg (mode, src1);

+  if (SUBREG_P (src1) && SUBREG_BYTE (src1) != 0)
+    src1 = force_reg (mode, src1);
+  if (SUBREG_P (src2) && SUBREG_BYTE (src2) != 0)
+    src1 = force_reg (mode, src2);
+
   /* Improve address combine.  */
   if (code == PLUS
       && GET_MODE_CLASS (mode) == MODE_INT

doesn't help though.  pre-LRA:

(insn 19 16 20 4 (set (reg:V4SI 103)
        (subreg:V4SI (reg:V8SI 90 [ vect_sum_11.6 ]) 16)) 1222
{movv4si_internal}
     (nil))
(insn 20 19 21 4 (set (reg:V4SI 98 [ _29 ])
        (plus:V4SI (reg:V4SI 103)
            (subreg:V4SI (reg:V8SI 90 [ vect_sum_11.6 ]) 0))) 2990 {*addv4si3}
     (expr_list:REG_DEAD (reg:V4SI 103)
        (expr_list:REG_DEAD (reg:V8SI 90 [ vect_sum_11.6 ])
            (nil))))

of course LRA not splitting life ranges when spilling (and thus forcing
to spill inside the loop) doesn't help either.  But we really don't want
to spill...

         Choosing alt 2 in insn 19:  (0) v  (1) vm {movv4si_internal}
            2 Non pseudo reload: reject++
          alt=1,overall=1,losers=0,rld_nregs=0
         Choosing alt 1 in insn 20:  (0) v  (1) v  (2) vm {*addv4si3}
          alt=1,overall=0,losers=0,rld_nregs=0

         Choosing alt 2 in insn 19:  (0) v  (1) vm {movv4si_internal}
            0 Non-pseudo reload: reject+=2
            0 Non input pseudo reload: reject++
            alt=0: Bad operand -- refuse
            0 Non-pseudo reload: reject+=2
            0 Non input pseudo reload: reject++
            alt=1: Bad operand -- refuse
            0 Non-pseudo reload: reject+=2
            0 Non input pseudo reload: reject++
            Cycle danger: overall += LRA_MAX_REJECT

         Choosing alt 1 in insn 20:  (0) v  (1) v  (2) vm {*addv4si3}
            alt=0: Bad operand -- refuse
            alt=1: Bad operand -- refuse
          alt=2,overall=0,losers=0,rld_nregs=0

so we don't seem to handle insn 19 well (why's that movv4si_internal rather
than some pextr?)

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

Reply via email to