https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200
--- Comment #20 from Venkataramanan <venkataramanan.kumar at amd dot com> --- I tried Intel SDE on mcf to get the hot blocks dynamic execution counts. <snip> .L98: jle .L97 cmpl $2, %r9d jne .L97 .L99: <snip> BLOCK: 7 PC: 0000000000403252 ICOUNT: 9064729840 EXECUTIONS: 4532364920 #BYTES: 5 %: 3.08 cumltv%: 55.5 FN: primal_bea_mpp IMG: /home/gccuser/work/GCC_Team/vekumar/ALU-tune/benchspec/CPU2006/429.mcf/build/build_base_gcc-notune.exe.0000/mcf XDIS 0000000000403252: BASE 83FF02 cmp edi, 0x2 XDIS 0000000000403255: BASE 752A jnz 0x403281 BLOCK: 8 PC: 0000000000403250 ICOUNT: 6320686443 EXECUTIONS: 6320686443 #BYTES: 2 %: 2.15 cumltv%: 57.7 FN: primal_bea_mpp IMG: /home/gccuser/work/GCC_Team/vekumar/ALU-tune/benchspec/CPU2006/429.mcf/build/build_base_gcc-notune.exe.0000/mcf XDIS 0000000000403250: BASE 7E2F jle 0x403281 When I swap the compares. .L98: cmpl $2, %r9d jne .L97 cmpq $0, %rdi jle .L97 .L99: BLOCK: 4 PC: 0000000000403250 ICOUNT: 12641372886 EXECUTIONS: 6320686443 #BYTES: 5 %: 4.33 cumltv%: 46.3 FN: primal_bea_mpp IMG: /home/gccuser/work/GCC_Team/vekumar/ALU-tune/benchspec/CPU2006/429.mcf/build/build_base_gcc-notune.exe.0000/mcf XDIS 0000000000403250: BASE 83FF02 cmp edi, 0x2 XDIS 0000000000403253: BASE 7542 jnz 0x403297 The block is not at all visible in top 300 hot blocks of MCF, which goes to show it is executed very rare. we are spending more cycles in the regressing case. cmpq $0, %rdi jle .L97 .L99: Next I tried with profile guided optimization on MCF. The compares are not swapped. But basic block reordering has happened. Hot blocked is placed after the compare. However this does not improve the run time. Pass1 -Ofast –march=znver1 –fprofile-generate Pass2 -Ofast –march=znver1 –fprofile-use (Snip) .L14: jle .L13 cmpl $2, %edi je .L16 .L13: <== hot block placed near addq %r9, %rax cmpq %rax, %r8 jbe .L12 (snip) Compared to -Ofast –march=znver1 (snip) .L198: jle .L197 cmpl $2, %r9d jne .L197 .L199: <== cold block incq %r15 movq %rdi, %r12 movq perm(,%r15,8), %r9 sarq $63, %r12 movq %rdi, 8(%r9) xorq %r12, %rdi movq %rax, (%r9) movq %rdi, 16(%r9) subq %r12, 16(%r9) .L197: <== hot block addq %rbx, %rax cmpq %rax, %r8 jbe .L196 (snip) Runtime is better only when I swap the compares.