https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
Since r14-2007-g6f19cf7526168f we now vectorize the loop but without SLP
which means we get interleaving and a vectorization factor of 64. Turning
off loop vectorization yields the following which is now comparable to
what clang does. Of course the loop vectorized interleaving is inefficient
in the end ...
.p2align 4
.p2align 3
.L3:
movq %rax, %rdx
movq %rdi, %rax
.p2align 4
.p2align 3
.L4:
vpinsrw $0, (%rax), %xmm0, %xmm0
vmovss (%rdx), %xmm1
movzbl 2(%rax), %ecx
addq $4, %rdx
addq $4, %rax
vpmovzxbd %xmm0, %xmm0
vmovsldup %xmm1, %xmm4
vcvtdq2ps %xmm0, %xmm0
vfmadd231ps %xmm4, %xmm0, %xmm2
vcvtsi2ssl %ecx, %xmm5, %xmm0
vfmadd231ss %xmm0, %xmm1, %xmm3
cmpq %rsi, %rdx
jne .L4
incl %r9d
movq %r11, %rax
addq %rbx, %rdi
addq %rbp, %rsi
cmpl %r9d, %r10d
je .L2
addq %rbp, %r11
jmp .L3
.p2align 4
.p2align 3
.L2:
vcvttps2dq %xmm2, %xmm2
vpmovdb %xmm2, %xmm2
popq %rbx
.cfi_def_cfa_offset 16
vcvttss2sil %xmm3, %eax
popq %rbp
.cfi_def_cfa_offset 8
vpextrw $0, %xmm2, (%r8)
movb %al, 2(%r8)
movb $-1, 3(%r8)
ret
The loop cost modeling looks like
t.c:9:23: note: Cost model analysis:
Vector inside of loop cost: 1156
Vector prologue cost: 24
Vector epilogue cost: 5488
Scalar iteration cost: 168
Scalar outside cost: 32
Vector outside cost: 5512
prologue iterations: 0
epilogue iterations: 32
Calculated minimum iters for profitability: 33
t.c:9:23: note: Runtime profitability threshold = 64
t.c:9:23: note: Static estimate profitability threshold = 64
and we get a VF == 32 vectorized epilog as well:
t.c:9:23: note: Cost model analysis:
Vector inside of loop cost: 620
Vector prologue cost: 12
Vector epilogue cost: 2752
Scalar iteration cost: 168
Scalar outside cost: 32
Vector outside cost: 2764
prologue iterations: 0
epilogue iterations: 16
Calculated minimum iters for profitability: 17
t.c:9:23: note: Runtime profitability threshold = 32
t.c:9:23: note: Static estimate profitability threshold = 32
so at least we'll enter the BB SLP optimized scalar epilog in the likely case.