https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

Nathan Kurz <nate at verse dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |nate at verse dot com

--- Comment #21 from Nathan Kurz <nate at verse dot com> ---
Created attachment 37629
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37629&action=edit
Test case using __builtin_expect()

I wondered if this might match some other effects I've seen recently, so I
tested this on Skylake and Haswell with g++ 5.3.0.  My current belief is that
everything here is expected behavior, and there is no bug with either the
compiler or processor.  

The code spends most of its time in a tight loop that depends on runtime input,
and the compiler doesn't have any reason to know which branch is more likely.  
The addition of "count" changes the heuristic in recent compilers, and by
chance, changes it for the worse.   

More directly, one can get the same effect with __builtin_expect().  Here are
the (excerpted) results I see on Skylake:  

nate@skylake:~/src/rust$ g++-5 -std=gnu++14 -O3 -Wall -g puzzlegen-expect.cc -o
puzzlegen
nate@skylake:~/src/rust$ perf stat ./puzzlegen > /dev/null
        210.098864      task-clock (msec)         #    1.001 CPUs utilized
       711,412,710      cycles                    #    3.386 GHz
     1,930,668,439      instructions              #    2.71  insns per cycle
       625,004,194      branches                  # 2974.810 M/sec
         2,920,721      branch-misses             #    0.47% of all branches
       0.209838807 seconds time elapsed

nate@skylake:~/src/rust$ g++-5 -std=gnu++14 -O3 -Wall -g puzzlegen-expect.cc -o
puzzlegen -DUSE_EXPECT
nate@skylake:~/src/rust$ perf stat ./puzzlegen > /dev/null
        122.037566      task-clock (msec)         #    1.001 CPUs utilized
       413,227,074      cycles                    #    3.386 GHz
     1,930,678,854      instructions              #    4.67  insns per cycle
       625,011,748      branches                  # 5121.470 M/sec
         2,952,030      branch-misses             #    0.47% of all branches
       0.121937547 seconds time elapsed

Both versions have the same number of instructions, branches, and
branch-misses.  The version with USE_EXPECT is almost twice as fast because it
executes almost twice as many instructions per cycle.  GCC does a great job of
producing a very tight loop that executes in one cycle per iteration: 

   77.21 │210:┌─→add    $0x4,%rcx 
    0.66 │    │  cmp    %rcx,%r8  
         │    │↓ je     260       
    4.20 │219:│  mov    (%rcx),%edi 
    0.88 │    │  test   %r9d,%edi   
         │    └──jne    210

The cmp/je and test/jne instructions fuse into on µop each, so that the
processor can execute the remaining 4 fused µops at an iteration per cycle. If
the loop is rearranged so that the middle branch is taken, each iteration takes
a minimum of 2 cycles instead.

My suspicion was that we'd see GCC splitting the instructions in a way that
prevented fusion (which I think it often does), but I think it's doing an
excellent job here.  It's possible there is a different effect in one of the
other test cases, but I don't think the case that involves 'count' needs
fixing.

Reply via email to