Here are two pretty straight-forward ways to write the same operation: #define TYPE int
TYPE fun1(TYPE *x, TYPE *y, unsigned int n) { int i, j; TYPE dot = 0; for (i = 0; i < n; i++) dot += *(x++) * *(y++); return dot; } TYPE fun2(TYPE *x, TYPE *y, unsigned int n) { int i, j; TYPE dot = 0; for (i = 0; i < n / 8; i++) for (j = 0; j < 8; j++) dot += *(x++) * *(y++); return dot; } GCC 4.3 can vectorize both of them. GCC 4.4 can only vectorize fun1. I figure this is why: reduc.c:17: note: === vect_analyze_scalar_cycles === reduc.c:17: note: Analyze phi: dot_103 = PHI <dot_110(5), 0(3)> reduc.c:17: note: Access function of PHI: {0, +, ((((((D.1621_32 + D.1621_43) + D.1621_54) + D.1621_65) + D.1621_76) + D.1621_87) + D.1621_98) + D.1621_109}_1 reduc.c:17: note: step: ((((((D.1621_32 + D.1621_43) + D.1621_54) + D.1621_65) + D.1621_76) + D.1621_87) + D.1621_98) + D.1621_109, init: 0 reduc.c:17: note: step unknown. The cunrolli pass (which there is no way to disable) has completely unrolled the inner loop. Vectorizer SLP support can not handle the unrolled version of the loop. Also observed on ARM NEON with TYPE == float. -- Summary: Complete unrolling (inner) versus vectorization of reduction Product: gcc Version: 4.4.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: drow at gcc dot gnu dot org GCC target triplet: x86_64-linux http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41881