https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568
--- Comment #5 from Peter Cordes <peter at cordes dot ca> --- And BTW, this only helps if the SUB and JNE are consecutive, which GCC (correctly) doesn't currently optimize for with XOR. If this sub/jne is different from a normal sub/branch and won't already get optimized for macro-fusion, we may get even more benefit from this change by teaching gcc to keep them adjacent. GCC currently sometimes splits up the instructions like this: xorq %fs:40, %rdx movl %ebx, %eax jne .L7 from gcc8.3 (but not 9.1 or trunk in this case) on https://godbolt.org/z/nNjQ8u #include <random> unsigned int get_random_seed() { std::random_device rd; return rd(); } Even with -O3 -march=skylake. That's not wrong because XOR can't macro-fuse, but the point of switching to SUB is that it *can* macro-fuse into a single sub-and-branch uop on Sandybridge-family. So we might need to teach gcc about that. So when you change this, please make it aware of optimizing for macro-fusion by keeping the sub and jne back to back. Preferably with tune=generic (because Sandybridge-family is fairly widespread and it doesn't hurt on other CPUs), but definitely with -mtune=intel or -mtune=sandybridge or later. Nehalem and earlier can only macro-fuse test/cmp The potential downside of putting it adjacent instead of 1 or 2 insns earlier for uarches that can't macro-fuse SUB/JNE should be about zero on average. These branches should predict very well, and there are no in-order x86 CPUs still being sold. So it's mostly just going to be variations in fetch/decode that help sometimes, hurt sometimes, like any code alignment change.