https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062
--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> --- Since r14-2007-g6f19cf7526168f we now vectorize the loop but without SLP which means we get interleaving and a vectorization factor of 64. Turning off loop vectorization yields the following which is now comparable to what clang does. Of course the loop vectorized interleaving is inefficient in the end ... .p2align 4 .p2align 3 .L3: movq %rax, %rdx movq %rdi, %rax .p2align 4 .p2align 3 .L4: vpinsrw $0, (%rax), %xmm0, %xmm0 vmovss (%rdx), %xmm1 movzbl 2(%rax), %ecx addq $4, %rdx addq $4, %rax vpmovzxbd %xmm0, %xmm0 vmovsldup %xmm1, %xmm4 vcvtdq2ps %xmm0, %xmm0 vfmadd231ps %xmm4, %xmm0, %xmm2 vcvtsi2ssl %ecx, %xmm5, %xmm0 vfmadd231ss %xmm0, %xmm1, %xmm3 cmpq %rsi, %rdx jne .L4 incl %r9d movq %r11, %rax addq %rbx, %rdi addq %rbp, %rsi cmpl %r9d, %r10d je .L2 addq %rbp, %r11 jmp .L3 .p2align 4 .p2align 3 .L2: vcvttps2dq %xmm2, %xmm2 vpmovdb %xmm2, %xmm2 popq %rbx .cfi_def_cfa_offset 16 vcvttss2sil %xmm3, %eax popq %rbp .cfi_def_cfa_offset 8 vpextrw $0, %xmm2, (%r8) movb %al, 2(%r8) movb $-1, 3(%r8) ret The loop cost modeling looks like t.c:9:23: note: Cost model analysis: Vector inside of loop cost: 1156 Vector prologue cost: 24 Vector epilogue cost: 5488 Scalar iteration cost: 168 Scalar outside cost: 32 Vector outside cost: 5512 prologue iterations: 0 epilogue iterations: 32 Calculated minimum iters for profitability: 33 t.c:9:23: note: Runtime profitability threshold = 64 t.c:9:23: note: Static estimate profitability threshold = 64 and we get a VF == 32 vectorized epilog as well: t.c:9:23: note: Cost model analysis: Vector inside of loop cost: 620 Vector prologue cost: 12 Vector epilogue cost: 2752 Scalar iteration cost: 168 Scalar outside cost: 32 Vector outside cost: 2764 prologue iterations: 0 epilogue iterations: 16 Calculated minimum iters for profitability: 17 t.c:9:23: note: Runtime profitability threshold = 32 t.c:9:23: note: Static estimate profitability threshold = 32 so at least we'll enter the BB SLP optimized scalar epilog in the likely case.