Here are two pretty straight-forward ways to write the same operation:
#define TYPE int
TYPE fun1(TYPE *x, TYPE *y, unsigned int n)
{
int i, j;
TYPE dot = 0;
for (i = 0; i < n; i++)
dot += *(x++) * *(y++);
return dot;
}
TYPE fun2(TYPE *x, TYPE *y, unsigned int n)
{
int i, j;
TYPE dot = 0;
for (i = 0; i < n / 8; i++)
for (j = 0; j < 8; j++)
dot += *(x++) * *(y++);
return dot;
}
GCC 4.3 can vectorize both of them. GCC 4.4 can only vectorize fun1. I figure
this is why:
reduc.c:17: note: === vect_analyze_scalar_cycles ===
reduc.c:17: note: Analyze phi: dot_103 = PHI <dot_110(5), 0(3)>
reduc.c:17: note: Access function of PHI: {0, +, ((((((D.1621_32 + D.1621_43) +
D.1621_54) + D.1621_65)
+ D.1621_76) + D.1621_87) + D.1621_98) + D.1621_109}_1
reduc.c:17: note: step: ((((((D.1621_32 + D.1621_43) + D.1621_54) + D.1621_65)
+ D.1621_76) + D.1621_87)
+ D.1621_98) + D.1621_109, init: 0
reduc.c:17: note: step unknown.
The cunrolli pass (which there is no way to disable) has completely unrolled
the inner loop. Vectorizer SLP support can not handle the unrolled version of
the loop.
Also observed on ARM NEON with TYPE == float.
--
Summary: Complete unrolling (inner) versus vectorization of
reduction
Product: gcc
Version: 4.4.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: drow at gcc dot gnu dot org
GCC target triplet: x86_64-linux
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41881