https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99881

            Bug ID: 99881
           Summary: Regression compare -O2 -ftree-vectorize with -O2 on
                    SKX/CLX
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: crazylht at gmail dot com
  Target Milestone: ---

testcase is extracted from 557.xz_r

void
foo (int* __restrict a, int n, int c)
{
    a[0] = n;
    a[1] = c;
}

gcc -O2 -ftree-vectorize -fvect-cost-model=very-cheap

foo(int*, int, int):
        movd    xmm0, esi
        movd    xmm1, edx
        punpckldq       xmm0, xmm1
        movq    QWORD PTR [rdi], xmm0
        ret

without vectorization

foo(int*, int, int):
        mov     DWORD PTR [rdi], esi
        mov     DWORD PTR [rdi+4], edx
        ret

cost model:
scalar: 2 times scalar_store costs 24,
vector: 1 times unaligned_store costs 12, vec_contruct 8

I know that the current strategy of the cost model is to enable vectorization
as much as possible, but for the case above, it hurts performance. Because the
throughput of punpckldq is 1 on SKX/CLX, which becomes a bottleneck (znver2 is
ok). with -march=SKX, the second vmovd and unpck will be replaced by vpinsr,
and it regression more since vpinsr has throught 2 on CLX/SKX.

So i'm thinking to add extra cost for 2-element vec_construct to prevent the
above vectorization, at the same time, try not to affect other vectorization
situations.

Reply via email to