https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104582
--- Comment #15 from Richard Biener <rguenth at gcc dot gnu.org> --- The patch will cause FAIL: gcc.target/i386/pr91446.c scan-assembler-times vmovdqa[^\\n\\r]*xmm[0-9] 2 FAIL: gcc.target/i386/pr92658-avx512bw-2.c scan-assembler-times pmovsxdq 2 FAIL: gcc.target/i386/pr92658-sse4-2.c scan-assembler-times pmovsxbq 2 FAIL: gcc.target/i386/pr92658-sse4-2.c scan-assembler-times pmovsxdq 2 FAIL: gcc.target/i386/pr92658-sse4-2.c scan-assembler-times pmovsxwq 2 FAIL: gcc.target/i386/pr92658-sse4.c scan-assembler-times pmovzxbq 2 FAIL: gcc.target/i386/pr92658-sse4.c scan-assembler-times pmovzxdq 2 FAIL: gcc.target/i386/pr92658-sse4.c scan-assembler-times pmovzxwq 2 XPASS: gcc.target/i386/pr99881.c scan-assembler-not xmm[0-9] I have to look into some of them. The pr92658 one seems to be cases like void bar_u32_u64 (v2di * dst, v4si src) { unsigned long long tem[2]; tem[0] = src[0]; tem[1] = src[1]; dst[0] = *(v2di *) tem; } where we fail to recognize the BIT_FIELD_REF as accessing a pre-existing vector (we only support a subset of cases during SLP discovery): _1 = BIT_FIELD_REF <src_6(D), 32, 0>; _2 = (long long unsigned int) _1; tem[0] = _2; _3 = BIT_FIELD_REF <src_6(D), 32, 32>; _4 = (long long unsigned int) _3; tem[1] = _4; but when vectorizing just store and the conversion as <bb 2> [local count: 1073741824]: _1 = BIT_FIELD_REF <src_6(D), 32, 0>; _3 = BIT_FIELD_REF <src_6(D), 32, 32>; _13 = {_1, _3}; vect__2.110_14 = (vector(2) long long unsigned int) _13; MEM <vector(2) long long unsigned int> [(long long unsigned int *)&tem] = vect__2.110_14; we can recover things on the RTL side. So we just realize that costing is a difficult thing. Cost model analysis: _2 1 times scalar_store costs 12 in body _4 1 times scalar_store costs 12 in body (long long unsigned int) _1 1 times scalar_stmt costs 4 in body (long long unsigned int) _3 1 times scalar_stmt costs 4 in body (long long unsigned int) _1 1 times vector_stmt costs 4 in body node 0x415e268 1 times vec_construct costs 20 in prologue _2 1 times vector_store costs 16 in body Cost model analysis for part in loop 0: Vector cost: 40 Scalar cost: 32 not vectorized: vectorization is not profitable. note this uses icelake-server costs which has an unusally high sse_to_integer cost. The fix here would best be to recognize the BIT_FIELD_REF vector use of course.