[Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2

2023-07-31 Thread Changbin Du via Gcc-bugs
Hello, folks.
This is to discuss Gcc's heuristic strategy about Predicated Instructions and
Branches. And probably something needs to be improved.

[The story]
Weeks ago, I built a huffman encoding program with O2, O3, and PGO respectively.
This program is nothing special, just a random code I found on the internet. You
can download it from http://cau.ac.kr/~bongbong/dsd08/huffman.c.

Build it with O2/O3/PGO (My GCC is 13.1):
$ gcc -O2 -march=native -g -o huffman huffman.c
$ gcc -O3 -march=native -g -o huffman.O3 huffman.c

$ gcc -O2 -march=native -g -fprofile-generate -o huffman.instrumented huffman.c
$ ./huffman.instrumented test.data
$ gcc -O2 -march=native -g -fprofile-use=huffman.instrumented.gcda -o 
huffman.pgo huffman.c

Run them on my 12900H laptop:
$ head -c 50M /dev/urandom > test.data
$ perf stat  -r3 --table -- taskset -c 0 ./huffman test.data
$ perf stat  -r3 --table -- taskset -c 0 ./huffman.O3 test.data
$ perf stat  -r3 --table -- taskset -c 0 ./huffman.pgo test.data

The result (p-core, no ht, no turbo, performance mode):

O2  O3  PGO
cycles  2,581,832,749   8,638,401,568   9,394,200,585
(1.07s) (3.49s) (3.80s)
instructions12,609,600,094  11,827,675,782  12,036,010,638
branches2,303,416,221   2,671,184,833   2,723,414,574
branch-misses   0.00%   7.94%   8.84%
cache-misses3,012,613   3,055,722   3,076,316
L1-icache-load-misses   11,416,391  12,112,703  11,896,077
icache_tag.stalls   1,553,521   1,364,092   1,896,066
itlb_misses.stlb_hit6,856   21,756  22,600
itlb_misses.walk_completed  14,430  4,454   15,084
baclears.any131,573 140,355 131,644
int_misc.clear_resteer_cycles   2,545,915   586,578,125 679,021,993
machine_clears.count22,235  39,671  37,307
dsb2mite_switches.penalty_cycles 6,985,838  12,929,675  8,405,493
frontend_retired.any_dsb_miss   28,785,677  28,161,724  28,093,319
idq.dsb_cycles_any  1,986,038,896   5,683,820,258   5,971,969,906
idq.dsb_uops11,149,445,952  26,438,051,062  28,622,657,650
idq.mite_uops   207,881,687 216,734,007 212,003,064


Above data shows:
  o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively.
  o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to
aggressive inline.
  o O3/PGO introduced very bad branch prediction. I will explain it later.
  o Code built with O3 has high iTLB miss but much lower sTLB miss. This is 
beyond
my expectation.
  o O3/PGO introduced 78% and 68% more machine clears. This is interesting and
I don't know why. (subcategory MC is not measured yet)
  o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO.
  o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x.
DSB hit well. So frontend fetching and decoding is not a problem for O3/PGO.
  o Other events are mainly affected by bad branch misprediction.

Additionally, here is the TMA level 2 analysis: The main changes in the pipeline
slots are of Bad Speculation and Frontend Bound categories. I doubt the accuracy
of tma_fetch_bandwidth according to above frontend_retired.any_dsb_miss and
idq.mite_uops data.

$ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman 
test.data
test.data.huf is 1.00% of test.data

 Performance counter stats for 'taskset -c 0 ./huffman test.data':

 %  tma_branch_mispredicts%  tma_core_bound %  tma_heavy_operations %  
tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  
tma_fetch_latency %  tma_machine_clears
   0.0  0.8 11.4
 76.8  2.0 8.3  
0.80.0

   1.073381357 seconds time elapsed

   0.945233000 seconds user
   0.095719000 seconds sys


$ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.O3 
test.data
test.data.huf is 1.00% of test.data

 Performance counter stats for 'taskset -c 0 ./huffman.O3 test.data':

 %  tma_branch_mispredicts%  tma_core_bound %  tma_heavy_operations %  
tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  
tma_fetch_latency %  tma_machine_clears
  38.2  6.6  3.5
 21.7  0.920.9  
7.50.8

   3.501875873 seconds time elapsed

   3.378572000 seconds user
   0.084163000 seconds sys


$ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.pgo 
test.data
test.data.huf is 1.

Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2

2023-08-01 Thread Changbin Du via Gcc-bugs
On Mon, Jul 31, 2023 at 03:53:26PM +0200, Richard Biener wrote:
[snip]
> > The main difference in the compilation output about code around the 
> > miss-prediction
> > branch is:
> >   o In O2: predicated instruction (cmov here) is selected to eliminate above
> > branch. cmov is true better than branch here.
> >   o In O3/PGO: bitout() is inlined into encode_file(), and branch 
> > instruction
> > is selected. But this branch is obviously *unpredictable* and the 
> > compiler
> > doesn't know it. This why O3/PGO are are so bad for this program.
> >
> > Gcc doesn't support __builtin_unpredictable() which has been introduced by 
> > llvm.
> > Then I tried to see if __builtin_expect_with_probability(e,x, 0.5) can 
> > serve the
> > same purpose. The result is negative.
> 
> But does it appear to be predictable with your profiling data?
>
I profiled the branch-misses event on a kabylake machine. 99% of the
mis-prediction blames to encode_file() function.

$ sudo perf record -e branch-instructions:pp,branch-misses:pp -c 1000 -- 
taskset -c 0 ./huffman.O3 test.data

Samples: 197K of event 'branch-misses:pp', Event count (approx.): 197618000
Overhead  Command Shared Object Symbol
  99.58%  huffman.O3  huffman.O3[.] encode_file
   0.12%  huffman.O3  [kernel.vmlinux]  [k] __x86_indirect_thunk_array
   0.11%  huffman.O3  libc-2.31.so  [.] _IO_getc
   0.01%  huffman.O3  [kernel.vmlinux]  [k] common_file_perm

Then annotate encode_file() function:

Samples: 197K of event 'branch-misses:pp', 1000 Hz, Event count (approx.): 
197618000
encode_file  /work/myWork/linux/pgo/huffman.O3 [Percent: local period]
Percent│ ↑ je  38
   │ bitout():
   │ current_byte <<= 1;
   │ 70:   add %edi,%edi
   │ if (b == '1') current_byte |= 1;
 48.70 │┌──cmp $0x31,%dl
 47.11 │├──jne 7a
   ││  or  $0x1,%edi
   ││nbits++;  
   │ 7a:└─→inc %eax
   │ if (b == '1') current_byte |= 1;
   │   mov %edi,current_byte
   │ nbits++;
   │   mov %eax,nbits
   │ if (nbits == 8) {
  1.16 │   cmp $0x8,%eax
  3.03 │ ↓ je  a0
   │ encode_file():
   │ for (s=codes[ch]; *s; s++) bitout (outfile, *s);
   │   movzbl  0x1(%r13),%edx
   │   inc %r13
   │   test%dl,%dl
   │ ↑ jne 70
   │ ↑ jmp 38
   │   nop

-- 
Cheers,
Changbin Du


Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2

2023-08-01 Thread Changbin Du via Gcc-bugs
On Tue, Aug 01, 2023 at 10:44:02AM +0200, Jan Hubicka wrote:
> > > If I comment it out as above patch, then O3/PGO can get 16% and 12% 
> > > performance
> > > improvement compared to O2 on x86.
> > >
> > > O2  O3  PGO
> > > cycles  2,497,674,824   2,104,993,224   2,199,753,593
> > > instructions10,457,508,646  9,723,056,131   10,457,216,225
> > > branches2,303,029,380   2,250,522,323   2,302,994,942
> > > branch-misses   0.00%   0.01%   0.01%
> > >
> > > The main difference in the compilation output about code around the 
> > > miss-prediction
> > > branch is:
> > >   o In O2: predicated instruction (cmov here) is selected to eliminate 
> > > above
> > > branch. cmov is true better than branch here.
> > >   o In O3/PGO: bitout() is inlined into encode_file(), and branch 
> > > instruction
> > > is selected. But this branch is obviously *unpredictable* and the 
> > > compiler
> > > doesn't know it. This why O3/PGO are are so bad for this program.
> > >
> > > Gcc doesn't support __builtin_unpredictable() which has been introduced 
> > > by llvm.
> > > Then I tried to see if __builtin_expect_with_probability(e,x, 0.5) can 
> > > serve the
> > > same purpose. The result is negative.
> > 
> > But does it appear to be predictable with your profiling data?
> 
> Also one thing is that __builtin_expect and
> __builtin_expect_with_probability only affects the static branch
> prediciton algorithm, so with profile feedback they are ignored on every
> branch executed at least once during the train run.
> 
> setting probability 0.5 is really not exactly the same as hint that the
> branch will be mispredicted, since modern CPUs handle well regularly
> behaving branchs (such as a branch firing every even iteration of loop).
>
Yeah. Setting probability 0.5 is just an experimental attempt. I don't know
how the heuristic works internally.

> So I think having the builting is not a bad idea.  I was thinking if it
> makes sense to represent it withing profile_probability type and I am
> not convinced, since "unpredictable probability" sounds counceptually
> odd and we would need to keep the flag intact over all probability
> updates we do.  For things like loop exits we recompute probabilities
> from frequencies after unrolling/vectorizaiton and other things and we
> would need to invent new API to propagate the flag from previous
> probability (which is not even part of the computation right now)
> 
> So I guess the challenge is how to pass this info down through the
> optimization pipeline, since we would need to annotate gimple
> conds/switches and manage it to RTL level.  On gimple we have flags and
> on rtl level notes so there is space for it, but we would need to
> maintain the info through CFG changes.
> 
> Auto-FDO may be interesting way to detect such branches.
> 
So I suppose PGO also could. But branch instruction is selected in my test just
as O3 does. And data shows that comv works better than branch here.

> Honza
> > 
> > > I think we could come to a conclusion that there must be something can 
> > > improve in
> > > Gcc's heuristic strategy about Predicated Instructions and branches, at 
> > > least
> > > for O3 and PGO.
> > >
> > > And can we add __builtin_unpredictable() support for Gcc? As usually it's 
> > > hard
> > > for the compiler to detect unpredictable branches.
> > >
> > > --
> > > Cheers,
> > > Changbin Du

-- 
Cheers,
Changbin Du


Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2

2023-08-01 Thread Changbin Du via Gcc-bugs
On Mon, Jul 31, 2023 at 08:55:35PM +0800, Changbin Du wrote:
> The result (p-core, no ht, no turbo, performance mode):
> 
> O2  O3  PGO
> cycles  2,581,832,749   8,638,401,568   9,394,200,585
> (1.07s) (3.49s) (3.80s)
> instructions12,609,600,094  11,827,675,782  12,036,010,638
> branches2,303,416,221   2,671,184,833   2,723,414,574
> branch-misses   0.00%   7.94%   8.84%
> cache-misses3,012,613   3,055,722   3,076,316
> L1-icache-load-misses   11,416,391  12,112,703  11,896,077
> icache_tag.stalls   1,553,521   1,364,092   1,896,066
> itlb_misses.stlb_hit6,856   21,756  22,600
> itlb_misses.walk_completed  14,430  4,454   15,084
> baclears.any131,573 140,355 131,644
> int_misc.clear_resteer_cycles   2,545,915   586,578,125 679,021,993
> machine_clears.count22,235  39,671  37,307
> dsb2mite_switches.penalty_cycles 6,985,838  12,929,675  8,405,493
> frontend_retired.any_dsb_miss   28,785,677  28,161,724  28,093,319
> idq.dsb_cycles_any  1,986,038,896   5,683,820,258   5,971,969,906
> idq.dsb_uops11,149,445,952  26,438,051,062  28,622,657,650
> idq.mite_uops   207,881,687 216,734,007 212,003,064
> 
> 
> Above data shows:
>   o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively.
>   o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to
> aggressive inline.
>   o O3/PGO introduced very bad branch prediction. I will explain it later.
>   o Code built with O3 has high iTLB miss but much lower sTLB miss. This is 
> beyond
> my expectation.
>   o O3/PGO introduced 78% and 68% more machine clears. This is interesting and
> I don't know why. (subcategory MC is not measured yet)
The MCs are caused by memory ordering conflict and attribute to the kernel rcu
lock in I/O path, when ext4 tries to update its journal.

>   o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO.
>   o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x.
> DSB hit well. So frontend fetching and decoding is not a problem for 
> O3/PGO.
>   o Other events are mainly affected by bad branch misprediction.
> 

-- 
Cheers,
Changbin Du


Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2

2023-08-11 Thread Changbin Du via Gcc-bugs
On Mon, Jul 31, 2023 at 08:55:35PM +0800, Changbin Du wrote:
> Hello, folks.
> This is to discuss Gcc's heuristic strategy about Predicated Instructions and
> Branches. And probably something needs to be improved.
> 
> [The story]
> Weeks ago, I built a huffman encoding program with O2, O3, and PGO 
> respectively.
> This program is nothing special, just a random code I found on the internet. 
> You
> can download it from http://cau.ac.kr/~bongbong/dsd08/huffman.c.
> 
> Build it with O2/O3/PGO (My GCC is 13.1):
> $ gcc -O2 -march=native -g -o huffman huffman.c
> $ gcc -O3 -march=native -g -o huffman.O3 huffman.c
> 
> $ gcc -O2 -march=native -g -fprofile-generate -o huffman.instrumented 
> huffman.c
> $ ./huffman.instrumented test.data
> $ gcc -O2 -march=native -g -fprofile-use=huffman.instrumented.gcda -o 
> huffman.pgo huffman.c
> 
> Run them on my 12900H laptop:
> $ head -c 50M /dev/urandom > test.data
> $ perf stat  -r3 --table -- taskset -c 0 ./huffman test.data
> $ perf stat  -r3 --table -- taskset -c 0 ./huffman.O3 test.data
> $ perf stat  -r3 --table -- taskset -c 0 ./huffman.pgo test.data
> 
> The result (p-core, no ht, no turbo, performance mode):
> 
> O2  O3  PGO
> cycles  2,581,832,749   8,638,401,568   9,394,200,585
> (1.07s) (3.49s) (3.80s)
> instructions12,609,600,094  11,827,675,782  12,036,010,638
> branches2,303,416,221   2,671,184,833   2,723,414,574
> branch-misses   0.00%   7.94%   8.84%
> cache-misses3,012,613   3,055,722   3,076,316
> L1-icache-load-misses   11,416,391  12,112,703  11,896,077
> icache_tag.stalls   1,553,521   1,364,092   1,896,066
> itlb_misses.stlb_hit6,856   21,756  22,600
> itlb_misses.walk_completed  14,430  4,454   15,084
> baclears.any131,573 140,355 131,644
> int_misc.clear_resteer_cycles   2,545,915   586,578,125 679,021,993
> machine_clears.count22,235  39,671  37,307
> dsb2mite_switches.penalty_cycles 6,985,838  12,929,675  8,405,493
> frontend_retired.any_dsb_miss   28,785,677  28,161,724  28,093,319
> idq.dsb_cycles_any  1,986,038,896   5,683,820,258   5,971,969,906
> idq.dsb_uops11,149,445,952  26,438,051,062  28,622,657,650
> idq.mite_uops   207,881,687 216,734,007 212,003,064
> 
> 
> Above data shows:
>   o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively.
>   o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to
> aggressive inline.
>   o O3/PGO introduced very bad branch prediction. I will explain it later.
>   o Code built with O3 has high iTLB miss but much lower sTLB miss. This is 
> beyond
> my expectation.
>   o O3/PGO introduced 78% and 68% more machine clears. This is interesting and
> I don't know why. (subcategory MC is not measured yet)
>   o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO.
>   o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x.
> DSB hit well. So frontend fetching and decoding is not a problem for 
> O3/PGO.
>   o Other events are mainly affected by bad branch misprediction.
> 
> Additionally, here is the TMA level 2 analysis: The main changes in the 
> pipeline
> slots are of Bad Speculation and Frontend Bound categories. I doubt the 
> accuracy
> of tma_fetch_bandwidth according to above frontend_retired.any_dsb_miss and
> idq.mite_uops data.
> 
> $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman 
> test.data
> test.data.huf is 1.00% of test.data
> 
>  Performance counter stats for 'taskset -c 0 ./huffman test.data':
> 
>  %  tma_branch_mispredicts%  tma_core_bound %  tma_heavy_operations %  
> tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  
> tma_fetch_latency %  tma_machine_clears
>0.0  0.8 11.4  
>76.8  2.0 8.3  
> 0.80.0
> 
>1.073381357 seconds time elapsed
> 
>0.945233000 seconds user
>0.095719000 seconds sys
> 
> 
> $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 
> ./huffman.O3 test.data
> test.data.huf is 1.00% of test.data
> 
>  Performance counter stats for 'taskset -c 0 ./huffman.O3 test.data':
> 
>  %  tma_branch_mispredicts%  tma_core_bound %  tma_heavy_operations %  
> tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  
> tma_fetch_latency %  tma_machine_clears
>   38.2  6.6  3.5  
>21.7  0.9