http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60577
Bug ID: 60577 Summary: inefficient FDO instrumentation code Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: carrot at google dot com This is actually a regression caused by r175916. Compile the following code with options -O2 -fno-strict-aliasing -fprofile-generate struct thread_param { long* buf; long iterations; long accesses; } param; void access_buf(struct thread_param* p) { long i,j; long iterations = p->iterations; long accesses = p->accesses; for (i=0; i<iterations; i++) { long* pbuf = p->buf; for (j=0; j<accesses; j++) pbuf[j] += 1; } } Trunk gcc generates following for innermost loop: .L9: addq $1, __gcov0.access_buf(%rip) addq $1, (%rax) addq $8, %rax cmpq %rdx, %rax jne .L9 The fdo counter in memory is incremented in each iteration. GCC at revision r175915 generates following for innermost loop movq .LPBX1(%rip), %rsi ... .L4: addq $1, (%rax) addq $8, %rax cmpq %rdx, %rax jne .L4 leaq 1(%rsi,%r9), %rsi ... movq %rsi, .LPBX1(%rip) The fdo counter doesn't bring any overhead to the innermost loop. GCC at revision r175916 generates following for innermost loop movq .LPBX1(%rip), %rcx xorl %eax, %eax leaq 1(%rcx), %r8 .p2align 4,,10 .p2align 3 .L4: leaq (%r8,%rax), %rcx movq %rcx, .LPBX1(%rip) addq $1, (%rdx,%rax,8) addq $1, %rax cmpq %rsi, %rax jne .L4 The fdo counter is incremented and written to memory in each iteration.