https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69873

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
ICC decides to have a cut-off at 8 iterations, having an unrolled vectorized
iteration

..B1.4:                         # Preds ..B1.4 ..B1.3
        addl      $8, %eax                                      #4.3
        movaps    %xmm0, a(,%rdx,8)                             #5.5
        cmpl      %ecx, %eax                                    #4.3
        movaps    %xmm0, 16+a(,%rdx,8)                          #5.5
        movaps    %xmm0, 32+a(,%rdx,8)                          #5.5
        movaps    %xmm0, 48+a(,%rdx,8)                          #5.5
        movl      %eax, %edx                                    #4.3
        jb        ..B1.4 

with a shared scalar epilogue (shared with the non-vectorized path)

..B1.8:                         # Preds ..B1.8 ..B1.7
        incl      %edx                                          #4.3
        movq      %rax, a(,%rcx,8)                              #5.5
        incl      %ecx                                          #4.3
        cmpl      %edi, %edx                                    #4.3
        jb        ..B1.8  

where at least that scalar loop with two IVs looks sub-optimal.

This also shows that unrolling from within the vectorizer might be more
profitable than doing it afterwards (generating an additional vectorized
epilogue for the remaining vectorized iterations).  Code-size wise, of course.

Reply via email to