https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78427
Bug ID: 78427 Summary: missed optimization of loop condition Product: gcc Version: 6.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: vanyacpp at gmail dot com Target Milestone: --- Consider the following code: void and_1(uint64_t const* a, uint64_t* b, size_t n) { for (size_t i = 0; i != n; ++i) b[i] = a[i] & b[i]; } GCC 6.2 (-O2, -mtune=haswell) generates the following code: and_1: xorl %eax, %eax testq %rdx, %rdx je .L6 .L5: movq (%rdi,%rax,8), %rcx andq %rcx, (%rsi,%rax,8) addq $1, %rax cmpq %rax, %rdx jne .L5 .L6: ret If we change iteration indices from [0..N) to [-N..0), the end condition of the loop can be simplified. I would prefer the generated code to be like this: and_1: leaq (%rdi,%rdx,8), %rdi leaq (%rsi,%rdx,8), %rsi negq %rdx jnc .L6 .L5: movq (%rdi,%rdx,8), %rcx andq %rcx, (%rsi,%rdx,8) addq $1, %rdx jne .L5 .L6: ret As you can see the inner loop is 1 instruction shorter (4 instead of 5). Also this code use CF after neg instruction to check for zero. The result loop is slightly faster. On my haswell machine it is 2% faster (5956 ms instead of 6087 ms). I expect the gain to be bigger on a in-order CPUs, but I can not verify this claim. I believe this optimization can be applied in many cases of array iteration. The exact performance gain will depend on the operations performed by the loop body, but I don't see the cases when this transformation can hurt the performance of the loop.