https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119860
Bug ID: 119860
Summary: needless vector unrolling causes less profitable
vectorization
Product: gcc
Version: unknown
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: tnfchris at gcc dot gnu.org
Blocks: 53947, 115130
Target Milestone: ---
consider the following loop:
#define N 512
#define END 505
long long x[N] __attribute__((aligned(32)));
int __attribute__((noipa))
foo (void)
{
for (unsigned int i = 0; i < END; ++i)
{
if (x[i] > 0)
return 1;
}
return -1;
}
When vectorized produces with -O3 -march=armv8-a:
.L2:
add v29.4s, v29.4s, v26.4s
add v28.4s, v28.4s, v27.4s
cmp x1, x0
beq .L15
.L4:
ldp q31, q30, [x0], 32
cmgt v31.2d, v31.2d, #0
cmgt v30.2d, v30.2d, #0
orr v31.16b, v31.16b, v30.16b
umaxp v31.4s, v31.4s, v31.4s
fmov x3, d31
cbz x3, .L2
which is suboptimal, due to the forcing of an 2x unroll factor.
This happens because the SLP tree rooted in the if with the vector IV works on
a smaller type than the rest of the loop.
The vectorizer enforces datasize rather than VF over the different SLP
instances and so we end up with V2DI vs V4SI and so the V2DI needs to be
unrolled.
I could have chosen V2DI and V2SI.
The unroll factor ends up making early break less profitable depending on the
loop size and also prevents further optimizations.
Perhaps we should try to enforce VF rather that total vector size first?
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115130
[Bug 115130] [meta-bug] early break vectorization