https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78427

            Bug ID: 78427
           Summary: missed optimization of loop condition
           Product: gcc
           Version: 6.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: vanyacpp at gmail dot com
  Target Milestone: ---

Consider the following code:

void and_1(uint64_t const* a, uint64_t* b, size_t n)
{
    for (size_t i = 0; i != n; ++i)
       b[i] = a[i] & b[i];
}

GCC 6.2 (-O2, -mtune=haswell) generates the following code:

and_1:
        xorl    %eax, %eax
        testq   %rdx, %rdx
        je      .L6
.L5:
        movq    (%rdi,%rax,8), %rcx
        andq    %rcx, (%rsi,%rax,8)
        addq    $1, %rax
        cmpq    %rax, %rdx
        jne     .L5
.L6:
        ret

If we change iteration indices from [0..N) to [-N..0), the end condition of the
loop can be simplified. I would prefer the generated code to be like this:

and_1:
        leaq    (%rdi,%rdx,8), %rdi
        leaq    (%rsi,%rdx,8), %rsi
        negq    %rdx
        jnc     .L6
.L5:
        movq    (%rdi,%rdx,8), %rcx
        andq    %rcx, (%rsi,%rdx,8)
        addq    $1, %rdx
        jne     .L5
.L6:
        ret

As you can see the inner loop is 1 instruction shorter (4 instead of 5). Also
this code use CF after neg instruction to check for zero.

The result loop is slightly faster. On my haswell machine it is 2% faster (5956
ms instead of 6087 ms). I expect the gain to be bigger on a in-order CPUs, but
I can not verify this claim.

I believe this optimization can be applied in many cases of array iteration.
The exact performance gain will depend on the operations performed by the loop
body, but I don't see the cases when this transformation can hurt the
performance of the loop.

Reply via email to