https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119860
Bug ID: 119860 Summary: needless vector unrolling causes less profitable vectorization Product: gcc Version: unknown Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Blocks: 53947, 115130 Target Milestone: --- consider the following loop: #define N 512 #define END 505 long long x[N] __attribute__((aligned(32))); int __attribute__((noipa)) foo (void) { for (unsigned int i = 0; i < END; ++i) { if (x[i] > 0) return 1; } return -1; } When vectorized produces with -O3 -march=armv8-a: .L2: add v29.4s, v29.4s, v26.4s add v28.4s, v28.4s, v27.4s cmp x1, x0 beq .L15 .L4: ldp q31, q30, [x0], 32 cmgt v31.2d, v31.2d, #0 cmgt v30.2d, v30.2d, #0 orr v31.16b, v31.16b, v30.16b umaxp v31.4s, v31.4s, v31.4s fmov x3, d31 cbz x3, .L2 which is suboptimal, due to the forcing of an 2x unroll factor. This happens because the SLP tree rooted in the if with the vector IV works on a smaller type than the rest of the loop. The vectorizer enforces datasize rather than VF over the different SLP instances and so we end up with V2DI vs V4SI and so the V2DI needs to be unrolled. I could have chosen V2DI and V2SI. The unroll factor ends up making early break less profitable depending on the loop size and also prevents further optimizations. Perhaps we should try to enforce VF rather that total vector size first? Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115130 [Bug 115130] [meta-bug] early break vectorization