12 Regression] Unoptimal code for __negdi2 (and others) from libgcc2 due to unwanted vectorization

rguenth at gcc dot gnu.org via Gcc-bugs Fri, 18 Feb 2022 03:31:56 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104582


--- Comment #15 from Richard Biener <rguenth at gcc dot gnu.org> ---
The patch will cause

FAIL: gcc.target/i386/pr91446.c scan-assembler-times vmovdqa[^\\n\\r]*xmm[0-9]
2
FAIL: gcc.target/i386/pr92658-avx512bw-2.c scan-assembler-times pmovsxdq 2
FAIL: gcc.target/i386/pr92658-sse4-2.c scan-assembler-times pmovsxbq 2
FAIL: gcc.target/i386/pr92658-sse4-2.c scan-assembler-times pmovsxdq 2
FAIL: gcc.target/i386/pr92658-sse4-2.c scan-assembler-times pmovsxwq 2
FAIL: gcc.target/i386/pr92658-sse4.c scan-assembler-times pmovzxbq 2
FAIL: gcc.target/i386/pr92658-sse4.c scan-assembler-times pmovzxdq 2
FAIL: gcc.target/i386/pr92658-sse4.c scan-assembler-times pmovzxwq 2
XPASS: gcc.target/i386/pr99881.c scan-assembler-not xmm[0-9]

I have to look into some of them.  The pr92658 one seems to be cases like

void
bar_u32_u64 (v2di * dst, v4si src)
{
  unsigned long long tem[2];
  tem[0] = src[0];
  tem[1] = src[1];
  dst[0] = *(v2di *) tem;
}

where we fail to recognize the BIT_FIELD_REF as accessing a pre-existing
vector (we only support a subset of cases during SLP discovery):

  _1 = BIT_FIELD_REF <src_6(D), 32, 0>;
  _2 = (long long unsigned int) _1;
  tem[0] = _2;
  _3 = BIT_FIELD_REF <src_6(D), 32, 32>;
  _4 = (long long unsigned int) _3;
  tem[1] = _4;

but when vectorizing just store and the conversion as

  <bb 2> [local count: 1073741824]:
  _1 = BIT_FIELD_REF <src_6(D), 32, 0>;
  _3 = BIT_FIELD_REF <src_6(D), 32, 32>;
  _13 = {_1, _3};
  vect__2.110_14 = (vector(2) long long unsigned int) _13;
  MEM <vector(2) long long unsigned int> [(long long unsigned int *)&tem] =
vect__2.110_14;

we can recover things on the RTL side.

So we just realize that costing is a difficult thing.

Cost model analysis:
_2 1 times scalar_store costs 12 in body
_4 1 times scalar_store costs 12 in body
(long long unsigned int) _1 1 times scalar_stmt costs 4 in body
(long long unsigned int) _3 1 times scalar_stmt costs 4 in body
(long long unsigned int) _1 1 times vector_stmt costs 4 in body
node 0x415e268 1 times vec_construct costs 20 in prologue
_2 1 times vector_store costs 16 in body
Cost model analysis for part in loop 0:
  Vector cost: 40
  Scalar cost: 32
not vectorized: vectorization is not profitable.

note this uses icelake-server costs which has an unusally high sse_to_integer
cost.

The fix here would best be to recognize the BIT_FIELD_REF vector use of course.

[Bug tree-optimization/104582] [11/12 Regression] Unoptimal code for __negdi2 (and others) from libgcc2 due to unwanted vectorization

Reply via email to