https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80952
Martin Liška <marxin at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords|missed-optimization | Status|UNCONFIRMED |NEW Last reconfirmed| |2017-06-02 Ever confirmed|0 |1 --- Comment #3 from Martin Liška <marxin at gcc dot gnu.org> --- Confirmed, it's caused by adding -fprofile-update option in GCC 7.1: https://gcc.gnu.org/onlinedocs/gcc/Instrumentation-Options.html Where by default (when one uses -pthread), it's -fprofile-update=atomic. The option guarantees that collected profile is not corrupted as updating is racy. Using -fprofile-generate and -fprofile-update=single causes: pr80952.cpp: In function ‘main._omp_fn.0’: pr80952.cpp:40:1: error: corrupted profile info: profile data is not flow-consistent } ^ pr80952.cpp:40:1: error: corrupted profile info: number of executions for edge 3-4 thought to be 32 pr80952.cpp:40:1: error: corrupted profile info: number of executions for edge 3-13 thought to be -2 pr80952.cpp:40:1: error: corrupted profile info: number of executions for edge 11-8 thought to be -239929 pr80952.cpp:40:1: error: corrupted profile info: number of executions for edge 11-12 thought to be 247808373 Running perf confirms that locking is bottleneck: 0.00 : 4014a3: lock addq $0x1,0x204024(%rip) # 6054d0 <__gcov0.main._omp_fn.0+0x30> 49.28 : 4014ac: cmp %ecx,%r8d 0.00 : 4014af: jl 4014c3 <main._omp_fn.0+0x93> 0.00 : 4014b1: lock addq $0x1,0x203ffe(%rip) # 6054b8 <__gcov0.main._omp_fn.0+0x18> : { : if (dividend % divisor == 0) { 50.12 : 4014ba: mov %esi,%eax 0.01 : 4014bc: cltd Well, I planned to provide profile update method where there will be function local counters that will be merged to global at function exit. That would definitely help in this example.