https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
Since r14-2007-g6f19cf7526168f we now vectorize the loop but without SLP
which means we get interleaving and a vectorization factor of 64.  Turning
off loop vectorization yields the following which is now comparable to
what clang does.  Of course the loop vectorized interleaving is inefficient
in the end ...

        .p2align 4
        .p2align 3
.L3:
        movq    %rax, %rdx
        movq    %rdi, %rax
        .p2align 4
        .p2align 3
.L4:
        vpinsrw $0, (%rax), %xmm0, %xmm0
        vmovss  (%rdx), %xmm1
        movzbl  2(%rax), %ecx
        addq    $4, %rdx
        addq    $4, %rax
        vpmovzxbd       %xmm0, %xmm0
        vmovsldup       %xmm1, %xmm4
        vcvtdq2ps       %xmm0, %xmm0
        vfmadd231ps     %xmm4, %xmm0, %xmm2
        vcvtsi2ssl      %ecx, %xmm5, %xmm0
        vfmadd231ss     %xmm0, %xmm1, %xmm3
        cmpq    %rsi, %rdx
        jne     .L4
        incl    %r9d
        movq    %r11, %rax
        addq    %rbx, %rdi
        addq    %rbp, %rsi
        cmpl    %r9d, %r10d
        je      .L2
        addq    %rbp, %r11
        jmp     .L3
        .p2align 4
        .p2align 3
.L2:
        vcvttps2dq      %xmm2, %xmm2
        vpmovdb %xmm2, %xmm2
        popq    %rbx
        .cfi_def_cfa_offset 16
        vcvttss2sil     %xmm3, %eax
        popq    %rbp
        .cfi_def_cfa_offset 8
        vpextrw $0, %xmm2, (%r8)
        movb    %al, 2(%r8)
        movb    $-1, 3(%r8)
        ret

The loop cost modeling looks like

t.c:9:23: note:  Cost model analysis:
  Vector inside of loop cost: 1156
  Vector prologue cost: 24
  Vector epilogue cost: 5488
  Scalar iteration cost: 168
  Scalar outside cost: 32
  Vector outside cost: 5512
  prologue iterations: 0
  epilogue iterations: 32
  Calculated minimum iters for profitability: 33
t.c:9:23: note:    Runtime profitability threshold = 64
t.c:9:23: note:    Static estimate profitability threshold = 64

and we get a VF == 32 vectorized epilog as well:

t.c:9:23: note:  Cost model analysis: 
  Vector inside of loop cost: 620
  Vector prologue cost: 12
  Vector epilogue cost: 2752
  Scalar iteration cost: 168
  Scalar outside cost: 32 
  Vector outside cost: 2764
  prologue iterations: 0
  epilogue iterations: 16
  Calculated minimum iters for profitability: 17
t.c:9:23: note:    Runtime profitability threshold = 32
t.c:9:23: note:    Static estimate profitability threshold = 32

so at least we'll enter the BB SLP optimized scalar epilog in the likely case.

Reply via email to