Here are two pretty straight-forward ways to write the same operation:

#define TYPE int

TYPE fun1(TYPE *x, TYPE *y, unsigned int n)
{
  int i, j;
  TYPE dot = 0;

  for (i = 0; i < n; i++)
    dot += *(x++) * *(y++);

  return dot;
}

TYPE fun2(TYPE *x, TYPE *y, unsigned int n)
{
  int i, j;
  TYPE dot = 0;

  for (i = 0; i < n / 8; i++)
    for (j = 0; j < 8; j++)
      dot += *(x++) * *(y++);

  return dot;
}

GCC 4.3 can vectorize both of them.  GCC 4.4 can only vectorize fun1.  I figure
this is why:

reduc.c:17: note: === vect_analyze_scalar_cycles ===
reduc.c:17: note: Analyze phi: dot_103 = PHI <dot_110(5), 0(3)>

reduc.c:17: note: Access function of PHI: {0, +, ((((((D.1621_32 + D.1621_43) +
D.1621_54) + D.1621_65)
+ D.1621_76) + D.1621_87) + D.1621_98) + D.1621_109}_1
reduc.c:17: note: step: ((((((D.1621_32 + D.1621_43) + D.1621_54) + D.1621_65)
+ D.1621_76) + D.1621_87)
 + D.1621_98) + D.1621_109,  init: 0
reduc.c:17: note: step unknown.

The cunrolli pass (which there is no way to disable) has completely unrolled
the inner loop.  Vectorizer SLP support can not handle the unrolled version of
the loop.

Also observed on ARM NEON with TYPE == float.


-- 
           Summary: Complete unrolling (inner) versus vectorization of
                    reduction
           Product: gcc
           Version: 4.4.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: drow at gcc dot gnu dot org
GCC target triplet: x86_64-linux


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41881

Reply via email to