https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153
Nathan Kurz <nate at verse dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |nate at verse dot com --- Comment #21 from Nathan Kurz <nate at verse dot com> --- Created attachment 37629 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37629&action=edit Test case using __builtin_expect() I wondered if this might match some other effects I've seen recently, so I tested this on Skylake and Haswell with g++ 5.3.0. My current belief is that everything here is expected behavior, and there is no bug with either the compiler or processor. The code spends most of its time in a tight loop that depends on runtime input, and the compiler doesn't have any reason to know which branch is more likely. The addition of "count" changes the heuristic in recent compilers, and by chance, changes it for the worse. More directly, one can get the same effect with __builtin_expect(). Here are the (excerpted) results I see on Skylake: nate@skylake:~/src/rust$ g++-5 -std=gnu++14 -O3 -Wall -g puzzlegen-expect.cc -o puzzlegen nate@skylake:~/src/rust$ perf stat ./puzzlegen > /dev/null 210.098864 task-clock (msec) # 1.001 CPUs utilized 711,412,710 cycles # 3.386 GHz 1,930,668,439 instructions # 2.71 insns per cycle 625,004,194 branches # 2974.810 M/sec 2,920,721 branch-misses # 0.47% of all branches 0.209838807 seconds time elapsed nate@skylake:~/src/rust$ g++-5 -std=gnu++14 -O3 -Wall -g puzzlegen-expect.cc -o puzzlegen -DUSE_EXPECT nate@skylake:~/src/rust$ perf stat ./puzzlegen > /dev/null 122.037566 task-clock (msec) # 1.001 CPUs utilized 413,227,074 cycles # 3.386 GHz 1,930,678,854 instructions # 4.67 insns per cycle 625,011,748 branches # 5121.470 M/sec 2,952,030 branch-misses # 0.47% of all branches 0.121937547 seconds time elapsed Both versions have the same number of instructions, branches, and branch-misses. The version with USE_EXPECT is almost twice as fast because it executes almost twice as many instructions per cycle. GCC does a great job of producing a very tight loop that executes in one cycle per iteration: 77.21 │210:┌─→add $0x4,%rcx 0.66 │ │ cmp %rcx,%r8 │ │↓ je 260 4.20 │219:│ mov (%rcx),%edi 0.88 │ │ test %r9d,%edi │ └──jne 210 The cmp/je and test/jne instructions fuse into on µop each, so that the processor can execute the remaining 4 fused µops at an iteration per cycle. If the loop is rearranged so that the middle branch is taken, each iteration takes a minimum of 2 cycles instead. My suspicion was that we'd see GCC splitting the instructions in a way that prevented fusion (which I think it often does), but I think it's doing an excellent job here. It's possible there is a different effect in one of the other test cases, but I don't think the case that involves 'count' needs fixing.