https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104582
--- Comment #15 from Richard Biener <rguenth at gcc dot gnu.org> ---
The patch will cause
FAIL: gcc.target/i386/pr91446.c scan-assembler-times vmovdqa[^\\n\\r]*xmm[0-9]
2
FAIL: gcc.target/i386/pr92658-avx512bw-2.c scan-assembler-times pmovsxdq 2
FAIL: gcc.target/i386/pr92658-sse4-2.c scan-assembler-times pmovsxbq 2
FAIL: gcc.target/i386/pr92658-sse4-2.c scan-assembler-times pmovsxdq 2
FAIL: gcc.target/i386/pr92658-sse4-2.c scan-assembler-times pmovsxwq 2
FAIL: gcc.target/i386/pr92658-sse4.c scan-assembler-times pmovzxbq 2
FAIL: gcc.target/i386/pr92658-sse4.c scan-assembler-times pmovzxdq 2
FAIL: gcc.target/i386/pr92658-sse4.c scan-assembler-times pmovzxwq 2
XPASS: gcc.target/i386/pr99881.c scan-assembler-not xmm[0-9]
I have to look into some of them. The pr92658 one seems to be cases like
void
bar_u32_u64 (v2di * dst, v4si src)
{
unsigned long long tem[2];
tem[0] = src[0];
tem[1] = src[1];
dst[0] = *(v2di *) tem;
}
where we fail to recognize the BIT_FIELD_REF as accessing a pre-existing
vector (we only support a subset of cases during SLP discovery):
_1 = BIT_FIELD_REF <src_6(D), 32, 0>;
_2 = (long long unsigned int) _1;
tem[0] = _2;
_3 = BIT_FIELD_REF <src_6(D), 32, 32>;
_4 = (long long unsigned int) _3;
tem[1] = _4;
but when vectorizing just store and the conversion as
<bb 2> [local count: 1073741824]:
_1 = BIT_FIELD_REF <src_6(D), 32, 0>;
_3 = BIT_FIELD_REF <src_6(D), 32, 32>;
_13 = {_1, _3};
vect__2.110_14 = (vector(2) long long unsigned int) _13;
MEM <vector(2) long long unsigned int> [(long long unsigned int *)&tem] =
vect__2.110_14;
we can recover things on the RTL side.
So we just realize that costing is a difficult thing.
Cost model analysis:
_2 1 times scalar_store costs 12 in body
_4 1 times scalar_store costs 12 in body
(long long unsigned int) _1 1 times scalar_stmt costs 4 in body
(long long unsigned int) _3 1 times scalar_stmt costs 4 in body
(long long unsigned int) _1 1 times vector_stmt costs 4 in body
node 0x415e268 1 times vec_construct costs 20 in prologue
_2 1 times vector_store costs 16 in body
Cost model analysis for part in loop 0:
Vector cost: 40
Scalar cost: 32
not vectorized: vectorization is not profitable.
note this uses icelake-server costs which has an unusally high sse_to_integer
cost.
The fix here would best be to recognize the BIT_FIELD_REF vector use of course.