https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78855

            Bug ID: 78855
           Summary: -mtune=generic should keep cmp/jcc together. AMD and
                    Intel both macro-fuse
           Product: gcc
           Version: 7.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86-64-*-*

-mtune=generic and -mtune=intel currently don't optimize for macro-fusion of
CMP/JCC or TEST/JCC.  They should, since it helps most CPUs from the last ~5
years or so, and I think barely hurts old / low-power (atom) ones at all.

int ffs_loop(unsigned *nums) {
    int total = 0;
    for (int i = 0; i < 1024; i++)
        total +=  __builtin_ffs(nums[i]);
    return total;
}

gcc7.0 20161113 -O3 produces:

        leaq    4096(%rdi), %rsi
        xorl    %eax, %eax
        movl    $-1, %ecx
.L2:
        bsfl    (%rdi), %edx
        cmove   %ecx, %edx
        addq    $4, %rdi

        cmpq    %rdi, %rsi         # can't macro-fuse: separated by LEA
        leal    1(%rdx,%rax), %eax
        jne     .L2                # loop branch
        ret

instead of this (with -mtune=haswell):
        ...
        leal    1(%rdx,%rax), %eax
        cmpq    %rdi, %rsi         # can macro-fuse with jne on AMD and Intel
        jne     .L2

Intel Nehalem and Sandybridge-family can macro-fuse that.  So can AMD
Bulldozer-family.  In 32-bit mode, Core2 can also macro-fuse that cmp/jcc. 
(See Agner's microarch pdf: http://agner.org/optimize/).  Sandybridge-family
can even macro-fuse many ALU ops (like dec and sub) with some flavours of JCC,
but AMD can only fuse TEST and CMP (but can do it with any JCC, even obscure
ones like JP).

Bizarrely, not even -mtune=intel tries to keep compare-and-branches together.

IMO, that should be enabled in -mtune=generic and -mtune=intel, and only
disabled in -mtune=atom, silvermont, k8, and k10.  (and other specific -mtune
options for even older CPUs).

The penalty for doing the compare a couple instructions later on CPUs that
don't support fusion might increasing the mispredict penalty by a couple
cycles, I think.  So I don't think we'd be hurting Atom a lot to help more
common CPUs a little.

Reply via email to