https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568

--- Comment #5 from Peter Cordes <peter at cordes dot ca> ---
And BTW, this only helps if the SUB and JNE are consecutive, which GCC
(correctly) doesn't currently optimize for with XOR.

If this sub/jne is different from a normal sub/branch and won't already get
optimized for macro-fusion, we may get even more benefit from this change by
teaching gcc to keep them adjacent.

GCC currently sometimes splits up the instructions like this:

        xorq    %fs:40, %rdx
        movl    %ebx, %eax
        jne     .L7

from gcc8.3 (but not 9.1 or trunk in this case) on https://godbolt.org/z/nNjQ8u


#include <random>
unsigned int get_random_seed() {
    std::random_device rd;
    return rd();
}

Even with -O3 -march=skylake.
That's not wrong because XOR can't macro-fuse, but the point of switching to
SUB is that it *can* macro-fuse into a single sub-and-branch uop on
Sandybridge-family.  So we might need to teach gcc about that.

So when you change this, please make it aware of optimizing for macro-fusion by
keeping the sub and jne back to back.  Preferably with tune=generic (because
Sandybridge-family is fairly widespread and it doesn't hurt on other CPUs), but
definitely with -mtune=intel or -mtune=sandybridge or later.

Nehalem and earlier can only macro-fuse test/cmp

The potential downside of putting it adjacent instead of 1 or 2 insns earlier
for uarches that can't macro-fuse SUB/JNE should be about zero on average. 
These branches should predict very well, and there are no in-order x86 CPUs
still being sold.  So it's mostly just going to be variations in fetch/decode
that help sometimes, hurt sometimes, like any code alignment change.

Reply via email to