https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200

--- Comment #20 from Venkataramanan <venkataramanan.kumar at amd dot com> ---
I tried Intel SDE on mcf to get the hot blocks dynamic execution counts.
<snip>
.L98:
        jle     .L97

        cmpl    $2, %r9d
        jne     .L97
.L99:
<snip>


BLOCK:     7   PC: 0000000000403252   ICOUNT: 9064729840   EXECUTIONS:
4532364920   #BYTES:  5   %:  3.08   cumltv%:  55.5  FN: primal_bea_mpp  IMG:
/home/gccuser/work/GCC_Team/vekumar/ALU-tune/benchspec/CPU2006/429.mcf/build/build_base_gcc-notune.exe.0000/mcf
XDIS 0000000000403252: BASE 83FF02                   cmp edi, 0x2
XDIS 0000000000403255: BASE 752A                     jnz 0x403281

BLOCK:     8   PC: 0000000000403250   ICOUNT: 6320686443   EXECUTIONS:
6320686443   #BYTES:  2   %:  2.15   cumltv%:  57.7  FN: primal_bea_mpp  IMG:
/home/gccuser/work/GCC_Team/vekumar/ALU-tune/benchspec/CPU2006/429.mcf/build/build_base_gcc-notune.exe.0000/mcf
XDIS 0000000000403250: BASE 7E2F                     jle 0x403281

When I swap the compares.

.L98:
       cmpl    $2, %r9d
        jne     .L97
        cmpq    $0, %rdi 
        jle     .L97           
.L99:

BLOCK:     4   PC: 0000000000403250   ICOUNT: 12641372886   EXECUTIONS:
6320686443   #BYTES:  5   %:  4.33   cumltv%:  46.3  FN: primal_bea_mpp  IMG:
/home/gccuser/work/GCC_Team/vekumar/ALU-tune/benchspec/CPU2006/429.mcf/build/build_base_gcc-notune.exe.0000/mcf
XDIS 0000000000403250: BASE 83FF02                   cmp edi, 0x2
XDIS 0000000000403253: BASE 7542                     jnz 0x403297

The block is not at all visible in top 300 hot blocks of MCF, which goes to
show it is executed very rare. we are spending more cycles in the regressing
case. 

        cmpq    $0, %rdi 
        jle     .L97           
.L99:

Next I tried  with profile guided optimization on MCF. The compares are not
swapped.  But basic block reordering has happened. 
Hot blocked is placed after the compare.  However this does not improve the run
time. 

  Pass1 -Ofast –march=znver1 –fprofile-generate  
  Pass2 -Ofast –march=znver1 –fprofile-use 

(Snip)
.L14:
        jle     .L13
        cmpl    $2, %edi
        je      .L16
.L13:                                                   <== hot block  placed
near 
        addq    %r9, %rax
        cmpq    %rax, %r8
        jbe     .L12
(snip)

Compared to -Ofast –march=znver1

(snip)
.L198:
        jle     .L197
        cmpl    $2, %r9d
        jne     .L197
.L199:                                                <== cold block 
        incq    %r15
        movq    %rdi, %r12
        movq    perm(,%r15,8), %r9
        sarq    $63, %r12
        movq    %rdi, 8(%r9)
        xorq    %r12, %rdi
        movq    %rax, (%r9)
        movq    %rdi, 16(%r9)
        subq    %r12, 16(%r9)
.L197:                                                 <== hot block 
        addq    %rbx, %rax
        cmpq    %rax, %r8
        jbe     .L196
(snip)

Runtime is better only when I swap the compares.

Reply via email to