https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78855
Bug ID: 78855 Summary: -mtune=generic should keep cmp/jcc together. AMD and Intel both macro-fuse Product: gcc Version: 7.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86-64-*-* -mtune=generic and -mtune=intel currently don't optimize for macro-fusion of CMP/JCC or TEST/JCC. They should, since it helps most CPUs from the last ~5 years or so, and I think barely hurts old / low-power (atom) ones at all. int ffs_loop(unsigned *nums) { int total = 0; for (int i = 0; i < 1024; i++) total += __builtin_ffs(nums[i]); return total; } gcc7.0 20161113 -O3 produces: leaq 4096(%rdi), %rsi xorl %eax, %eax movl $-1, %ecx .L2: bsfl (%rdi), %edx cmove %ecx, %edx addq $4, %rdi cmpq %rdi, %rsi # can't macro-fuse: separated by LEA leal 1(%rdx,%rax), %eax jne .L2 # loop branch ret instead of this (with -mtune=haswell): ... leal 1(%rdx,%rax), %eax cmpq %rdi, %rsi # can macro-fuse with jne on AMD and Intel jne .L2 Intel Nehalem and Sandybridge-family can macro-fuse that. So can AMD Bulldozer-family. In 32-bit mode, Core2 can also macro-fuse that cmp/jcc. (See Agner's microarch pdf: http://agner.org/optimize/). Sandybridge-family can even macro-fuse many ALU ops (like dec and sub) with some flavours of JCC, but AMD can only fuse TEST and CMP (but can do it with any JCC, even obscure ones like JP). Bizarrely, not even -mtune=intel tries to keep compare-and-branches together. IMO, that should be enabled in -mtune=generic and -mtune=intel, and only disabled in -mtune=atom, silvermont, k8, and k10. (and other specific -mtune options for even older CPUs). The penalty for doing the compare a couple instructions later on CPUs that don't support fusion might increasing the mispredict penalty by a couple cycles, I think. So I don't think we'd be hurting Atom a lot to help more common CPUs a little.