https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69873
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- ICC decides to have a cut-off at 8 iterations, having an unrolled vectorized iteration ..B1.4: # Preds ..B1.4 ..B1.3 addl $8, %eax #4.3 movaps %xmm0, a(,%rdx,8) #5.5 cmpl %ecx, %eax #4.3 movaps %xmm0, 16+a(,%rdx,8) #5.5 movaps %xmm0, 32+a(,%rdx,8) #5.5 movaps %xmm0, 48+a(,%rdx,8) #5.5 movl %eax, %edx #4.3 jb ..B1.4 with a shared scalar epilogue (shared with the non-vectorized path) ..B1.8: # Preds ..B1.8 ..B1.7 incl %edx #4.3 movq %rax, a(,%rcx,8) #5.5 incl %ecx #4.3 cmpl %edi, %edx #4.3 jb ..B1.8 where at least that scalar loop with two IVs looks sub-optimal. This also shows that unrolling from within the vectorizer might be more profitable than doing it afterwards (generating an additional vectorized epilogue for the remaining vectorized iterations). Code-size wise, of course.