https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67682
Bug ID: 67682
Summary: Missed vectorization: (another) straight-line
memcpy/memset not vectorized when equivalent loop is
Product: gcc
Version: 6.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: alalaw01 at gcc dot gnu.org
Target Milestone: ---
Target: aarch64
This code:
void
test (int*__restrict a, int*__restrict b)
{
a[0] = b[0];
a[1] = b[1];
a[2] = b[2];
a[3] = b[3];
a[4] = 0;
a[5] = 0;
a[6] = 0;
a[7] = 0;
}
is not vectorized; -fdump-tree-slp-details reveals
test.c:4:13: note: Build SLP failed: different operation in stmt MEM[(int
*)a_4(
D) + 28B] = 0;
test.c:4:13: note: original stmt *a_4(D) = _3;
test.c:4:13: note: === vect_slp_analyze_data_ref_dependences ===
test.c:4:13: note: === vect_slp_analyze_operations ===
test.c:4:13: note: not vectorized: bad operation in basic block.
test.c:4:13: note: ***** Re-trying analysis with vector size 8
...
test.c:4:13: note: Build SLP failed: different operation in stmt MEM[(int
*)a_4(D) + 28B] = 0;
test.c:4:13: note: original stmt *a_4(D) = _3;
test.c:4:13: note: === vect_slp_analyze_data_ref_dependences ===
test.c:4:13: note: === vect_slp_analyze_operations ===
test.c:4:13: note: not vectorized: bad operation in basic block.
(the failure with vector size 8 is expected, but vector size 4 should succeed)
Output is:
test:
ldp w4, w3, [x1]
ldp w2, w1, [x1, 8]
stp w4, w3, [x0]
stp w2, w1, [x0, 8]
stp wzr, wzr, [x0, 16]
stp wzr, wzr, [x0, 24]
ret
Curiously, a similar code but writing elements a[0..3] and a[5..8] (missing out
a[4]) is SLP'd, producing superior:
test:
ldr q0, [x1]
movi v1.4s, 0
str q1, [x0, 20]
str q0, [x0]
ret
And similarly for (equivalent to the first):
void
test (int*__restrict a, int*__restrict b)
{
for (int i = 0; i < 4; i++)
a[i] = b[i];
for (int i = 4; i < 8; i++)
a[i] = 0;
}
producing:
test:
movi v0.4s, 0
ldp x2, x3, [x1]
stp x2, x3, [x0]
str q0, [x0, 16]
ret