https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99881
Bug ID: 99881
Summary: Regression compare -O2 -ftree-vectorize with -O2 on
SKX/CLX
Product: gcc
Version: 11.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: crazylht at gmail dot com
Target Milestone: ---
testcase is extracted from 557.xz_r
void
foo (int* __restrict a, int n, int c)
{
a[0] = n;
a[1] = c;
}
gcc -O2 -ftree-vectorize -fvect-cost-model=very-cheap
foo(int*, int, int):
movd xmm0, esi
movd xmm1, edx
punpckldq xmm0, xmm1
movq QWORD PTR [rdi], xmm0
ret
without vectorization
foo(int*, int, int):
mov DWORD PTR [rdi], esi
mov DWORD PTR [rdi+4], edx
ret
cost model:
scalar: 2 times scalar_store costs 24,
vector: 1 times unaligned_store costs 12, vec_contruct 8
I know that the current strategy of the cost model is to enable vectorization
as much as possible, but for the case above, it hurts performance. Because the
throughput of punpckldq is 1 on SKX/CLX, which becomes a bottleneck (znver2 is
ok). with -march=SKX, the second vmovd and unpck will be replaced by vpinsr,
and it regression more since vpinsr has throught 2 on CLX/SKX.
So i'm thinking to add extra cost for 2-element vec_construct to prevent the
above vectorization, at the same time, try not to affect other vectorization
situations.