[Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2
Hello, folks. This is to discuss Gcc's heuristic strategy about Predicated Instructions and Branches. And probably something needs to be improved. [The story] Weeks ago, I built a huffman encoding program with O2, O3, and PGO respectively. This program is nothing special, just a random code I found on the internet. You can download it from http://cau.ac.kr/~bongbong/dsd08/huffman.c. Build it with O2/O3/PGO (My GCC is 13.1): $ gcc -O2 -march=native -g -o huffman huffman.c $ gcc -O3 -march=native -g -o huffman.O3 huffman.c $ gcc -O2 -march=native -g -fprofile-generate -o huffman.instrumented huffman.c $ ./huffman.instrumented test.data $ gcc -O2 -march=native -g -fprofile-use=huffman.instrumented.gcda -o huffman.pgo huffman.c Run them on my 12900H laptop: $ head -c 50M /dev/urandom > test.data $ perf stat -r3 --table -- taskset -c 0 ./huffman test.data $ perf stat -r3 --table -- taskset -c 0 ./huffman.O3 test.data $ perf stat -r3 --table -- taskset -c 0 ./huffman.pgo test.data The result (p-core, no ht, no turbo, performance mode): O2 O3 PGO cycles 2,581,832,749 8,638,401,568 9,394,200,585 (1.07s) (3.49s) (3.80s) instructions12,609,600,094 11,827,675,782 12,036,010,638 branches2,303,416,221 2,671,184,833 2,723,414,574 branch-misses 0.00% 7.94% 8.84% cache-misses3,012,613 3,055,722 3,076,316 L1-icache-load-misses 11,416,391 12,112,703 11,896,077 icache_tag.stalls 1,553,521 1,364,092 1,896,066 itlb_misses.stlb_hit6,856 21,756 22,600 itlb_misses.walk_completed 14,430 4,454 15,084 baclears.any131,573 140,355 131,644 int_misc.clear_resteer_cycles 2,545,915 586,578,125 679,021,993 machine_clears.count22,235 39,671 37,307 dsb2mite_switches.penalty_cycles 6,985,838 12,929,675 8,405,493 frontend_retired.any_dsb_miss 28,785,677 28,161,724 28,093,319 idq.dsb_cycles_any 1,986,038,896 5,683,820,258 5,971,969,906 idq.dsb_uops11,149,445,952 26,438,051,062 28,622,657,650 idq.mite_uops 207,881,687 216,734,007 212,003,064 Above data shows: o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively. o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to aggressive inline. o O3/PGO introduced very bad branch prediction. I will explain it later. o Code built with O3 has high iTLB miss but much lower sTLB miss. This is beyond my expectation. o O3/PGO introduced 78% and 68% more machine clears. This is interesting and I don't know why. (subcategory MC is not measured yet) o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO. o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x. DSB hit well. So frontend fetching and decoding is not a problem for O3/PGO. o Other events are mainly affected by bad branch misprediction. Additionally, here is the TMA level 2 analysis: The main changes in the pipeline slots are of Bad Speculation and Frontend Bound categories. I doubt the accuracy of tma_fetch_bandwidth according to above frontend_retired.any_dsb_miss and idq.mite_uops data. $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman test.data test.data.huf is 1.00% of test.data Performance counter stats for 'taskset -c 0 ./huffman test.data': % tma_branch_mispredicts% tma_core_bound % tma_heavy_operations % tma_light_operations % tma_memory_bound % tma_fetch_bandwidth % tma_fetch_latency % tma_machine_clears 0.0 0.8 11.4 76.8 2.0 8.3 0.80.0 1.073381357 seconds time elapsed 0.945233000 seconds user 0.095719000 seconds sys $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.O3 test.data test.data.huf is 1.00% of test.data Performance counter stats for 'taskset -c 0 ./huffman.O3 test.data': % tma_branch_mispredicts% tma_core_bound % tma_heavy_operations % tma_light_operations % tma_memory_bound % tma_fetch_bandwidth % tma_fetch_latency % tma_machine_clears 38.2 6.6 3.5 21.7 0.920.9 7.50.8 3.501875873 seconds time elapsed 3.378572000 seconds user 0.084163000 seconds sys $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman.pgo test.data test.data.huf is 1.
Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2
On Mon, Jul 31, 2023 at 03:53:26PM +0200, Richard Biener wrote: [snip] > > The main difference in the compilation output about code around the > > miss-prediction > > branch is: > > o In O2: predicated instruction (cmov here) is selected to eliminate above > > branch. cmov is true better than branch here. > > o In O3/PGO: bitout() is inlined into encode_file(), and branch > > instruction > > is selected. But this branch is obviously *unpredictable* and the > > compiler > > doesn't know it. This why O3/PGO are are so bad for this program. > > > > Gcc doesn't support __builtin_unpredictable() which has been introduced by > > llvm. > > Then I tried to see if __builtin_expect_with_probability(e,x, 0.5) can > > serve the > > same purpose. The result is negative. > > But does it appear to be predictable with your profiling data? > I profiled the branch-misses event on a kabylake machine. 99% of the mis-prediction blames to encode_file() function. $ sudo perf record -e branch-instructions:pp,branch-misses:pp -c 1000 -- taskset -c 0 ./huffman.O3 test.data Samples: 197K of event 'branch-misses:pp', Event count (approx.): 197618000 Overhead Command Shared Object Symbol 99.58% huffman.O3 huffman.O3[.] encode_file 0.12% huffman.O3 [kernel.vmlinux] [k] __x86_indirect_thunk_array 0.11% huffman.O3 libc-2.31.so [.] _IO_getc 0.01% huffman.O3 [kernel.vmlinux] [k] common_file_perm Then annotate encode_file() function: Samples: 197K of event 'branch-misses:pp', 1000 Hz, Event count (approx.): 197618000 encode_file /work/myWork/linux/pgo/huffman.O3 [Percent: local period] Percent│ ↑ je 38 │ bitout(): │ current_byte <<= 1; │ 70: add %edi,%edi │ if (b == '1') current_byte |= 1; 48.70 │┌──cmp $0x31,%dl 47.11 │├──jne 7a ││ or $0x1,%edi ││nbits++; │ 7a:└─→inc %eax │ if (b == '1') current_byte |= 1; │ mov %edi,current_byte │ nbits++; │ mov %eax,nbits │ if (nbits == 8) { 1.16 │ cmp $0x8,%eax 3.03 │ ↓ je a0 │ encode_file(): │ for (s=codes[ch]; *s; s++) bitout (outfile, *s); │ movzbl 0x1(%r13),%edx │ inc %r13 │ test%dl,%dl │ ↑ jne 70 │ ↑ jmp 38 │ nop -- Cheers, Changbin Du
Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2
On Tue, Aug 01, 2023 at 10:44:02AM +0200, Jan Hubicka wrote: > > > If I comment it out as above patch, then O3/PGO can get 16% and 12% > > > performance > > > improvement compared to O2 on x86. > > > > > > O2 O3 PGO > > > cycles 2,497,674,824 2,104,993,224 2,199,753,593 > > > instructions10,457,508,646 9,723,056,131 10,457,216,225 > > > branches2,303,029,380 2,250,522,323 2,302,994,942 > > > branch-misses 0.00% 0.01% 0.01% > > > > > > The main difference in the compilation output about code around the > > > miss-prediction > > > branch is: > > > o In O2: predicated instruction (cmov here) is selected to eliminate > > > above > > > branch. cmov is true better than branch here. > > > o In O3/PGO: bitout() is inlined into encode_file(), and branch > > > instruction > > > is selected. But this branch is obviously *unpredictable* and the > > > compiler > > > doesn't know it. This why O3/PGO are are so bad for this program. > > > > > > Gcc doesn't support __builtin_unpredictable() which has been introduced > > > by llvm. > > > Then I tried to see if __builtin_expect_with_probability(e,x, 0.5) can > > > serve the > > > same purpose. The result is negative. > > > > But does it appear to be predictable with your profiling data? > > Also one thing is that __builtin_expect and > __builtin_expect_with_probability only affects the static branch > prediciton algorithm, so with profile feedback they are ignored on every > branch executed at least once during the train run. > > setting probability 0.5 is really not exactly the same as hint that the > branch will be mispredicted, since modern CPUs handle well regularly > behaving branchs (such as a branch firing every even iteration of loop). > Yeah. Setting probability 0.5 is just an experimental attempt. I don't know how the heuristic works internally. > So I think having the builting is not a bad idea. I was thinking if it > makes sense to represent it withing profile_probability type and I am > not convinced, since "unpredictable probability" sounds counceptually > odd and we would need to keep the flag intact over all probability > updates we do. For things like loop exits we recompute probabilities > from frequencies after unrolling/vectorizaiton and other things and we > would need to invent new API to propagate the flag from previous > probability (which is not even part of the computation right now) > > So I guess the challenge is how to pass this info down through the > optimization pipeline, since we would need to annotate gimple > conds/switches and manage it to RTL level. On gimple we have flags and > on rtl level notes so there is space for it, but we would need to > maintain the info through CFG changes. > > Auto-FDO may be interesting way to detect such branches. > So I suppose PGO also could. But branch instruction is selected in my test just as O3 does. And data shows that comv works better than branch here. > Honza > > > > > I think we could come to a conclusion that there must be something can > > > improve in > > > Gcc's heuristic strategy about Predicated Instructions and branches, at > > > least > > > for O3 and PGO. > > > > > > And can we add __builtin_unpredictable() support for Gcc? As usually it's > > > hard > > > for the compiler to detect unpredictable branches. > > > > > > -- > > > Cheers, > > > Changbin Du -- Cheers, Changbin Du
Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2
On Mon, Jul 31, 2023 at 08:55:35PM +0800, Changbin Du wrote: > The result (p-core, no ht, no turbo, performance mode): > > O2 O3 PGO > cycles 2,581,832,749 8,638,401,568 9,394,200,585 > (1.07s) (3.49s) (3.80s) > instructions12,609,600,094 11,827,675,782 12,036,010,638 > branches2,303,416,221 2,671,184,833 2,723,414,574 > branch-misses 0.00% 7.94% 8.84% > cache-misses3,012,613 3,055,722 3,076,316 > L1-icache-load-misses 11,416,391 12,112,703 11,896,077 > icache_tag.stalls 1,553,521 1,364,092 1,896,066 > itlb_misses.stlb_hit6,856 21,756 22,600 > itlb_misses.walk_completed 14,430 4,454 15,084 > baclears.any131,573 140,355 131,644 > int_misc.clear_resteer_cycles 2,545,915 586,578,125 679,021,993 > machine_clears.count22,235 39,671 37,307 > dsb2mite_switches.penalty_cycles 6,985,838 12,929,675 8,405,493 > frontend_retired.any_dsb_miss 28,785,677 28,161,724 28,093,319 > idq.dsb_cycles_any 1,986,038,896 5,683,820,258 5,971,969,906 > idq.dsb_uops11,149,445,952 26,438,051,062 28,622,657,650 > idq.mite_uops 207,881,687 216,734,007 212,003,064 > > > Above data shows: > o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively. > o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to > aggressive inline. > o O3/PGO introduced very bad branch prediction. I will explain it later. > o Code built with O3 has high iTLB miss but much lower sTLB miss. This is > beyond > my expectation. > o O3/PGO introduced 78% and 68% more machine clears. This is interesting and > I don't know why. (subcategory MC is not measured yet) The MCs are caused by memory ordering conflict and attribute to the kernel rcu lock in I/O path, when ext4 tries to update its journal. > o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO. > o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x. > DSB hit well. So frontend fetching and decoding is not a problem for > O3/PGO. > o Other events are mainly affected by bad branch misprediction. > -- Cheers, Changbin Du
Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2
On Mon, Jul 31, 2023 at 08:55:35PM +0800, Changbin Du wrote: > Hello, folks. > This is to discuss Gcc's heuristic strategy about Predicated Instructions and > Branches. And probably something needs to be improved. > > [The story] > Weeks ago, I built a huffman encoding program with O2, O3, and PGO > respectively. > This program is nothing special, just a random code I found on the internet. > You > can download it from http://cau.ac.kr/~bongbong/dsd08/huffman.c. > > Build it with O2/O3/PGO (My GCC is 13.1): > $ gcc -O2 -march=native -g -o huffman huffman.c > $ gcc -O3 -march=native -g -o huffman.O3 huffman.c > > $ gcc -O2 -march=native -g -fprofile-generate -o huffman.instrumented > huffman.c > $ ./huffman.instrumented test.data > $ gcc -O2 -march=native -g -fprofile-use=huffman.instrumented.gcda -o > huffman.pgo huffman.c > > Run them on my 12900H laptop: > $ head -c 50M /dev/urandom > test.data > $ perf stat -r3 --table -- taskset -c 0 ./huffman test.data > $ perf stat -r3 --table -- taskset -c 0 ./huffman.O3 test.data > $ perf stat -r3 --table -- taskset -c 0 ./huffman.pgo test.data > > The result (p-core, no ht, no turbo, performance mode): > > O2 O3 PGO > cycles 2,581,832,749 8,638,401,568 9,394,200,585 > (1.07s) (3.49s) (3.80s) > instructions12,609,600,094 11,827,675,782 12,036,010,638 > branches2,303,416,221 2,671,184,833 2,723,414,574 > branch-misses 0.00% 7.94% 8.84% > cache-misses3,012,613 3,055,722 3,076,316 > L1-icache-load-misses 11,416,391 12,112,703 11,896,077 > icache_tag.stalls 1,553,521 1,364,092 1,896,066 > itlb_misses.stlb_hit6,856 21,756 22,600 > itlb_misses.walk_completed 14,430 4,454 15,084 > baclears.any131,573 140,355 131,644 > int_misc.clear_resteer_cycles 2,545,915 586,578,125 679,021,993 > machine_clears.count22,235 39,671 37,307 > dsb2mite_switches.penalty_cycles 6,985,838 12,929,675 8,405,493 > frontend_retired.any_dsb_miss 28,785,677 28,161,724 28,093,319 > idq.dsb_cycles_any 1,986,038,896 5,683,820,258 5,971,969,906 > idq.dsb_uops11,149,445,952 26,438,051,062 28,622,657,650 > idq.mite_uops 207,881,687 216,734,007 212,003,064 > > > Above data shows: > o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively. > o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to > aggressive inline. > o O3/PGO introduced very bad branch prediction. I will explain it later. > o Code built with O3 has high iTLB miss but much lower sTLB miss. This is > beyond > my expectation. > o O3/PGO introduced 78% and 68% more machine clears. This is interesting and > I don't know why. (subcategory MC is not measured yet) > o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO. > o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x. > DSB hit well. So frontend fetching and decoding is not a problem for > O3/PGO. > o Other events are mainly affected by bad branch misprediction. > > Additionally, here is the TMA level 2 analysis: The main changes in the > pipeline > slots are of Bad Speculation and Frontend Bound categories. I doubt the > accuracy > of tma_fetch_bandwidth according to above frontend_retired.any_dsb_miss and > idq.mite_uops data. > > $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman > test.data > test.data.huf is 1.00% of test.data > > Performance counter stats for 'taskset -c 0 ./huffman test.data': > > % tma_branch_mispredicts% tma_core_bound % tma_heavy_operations % > tma_light_operations % tma_memory_bound % tma_fetch_bandwidth % > tma_fetch_latency % tma_machine_clears >0.0 0.8 11.4 >76.8 2.0 8.3 > 0.80.0 > >1.073381357 seconds time elapsed > >0.945233000 seconds user >0.095719000 seconds sys > > > $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 > ./huffman.O3 test.data > test.data.huf is 1.00% of test.data > > Performance counter stats for 'taskset -c 0 ./huffman.O3 test.data': > > % tma_branch_mispredicts% tma_core_bound % tma_heavy_operations % > tma_light_operations % tma_memory_bound % tma_fetch_bandwidth % > tma_fetch_latency % tma_machine_clears > 38.2 6.6 3.5 >21.7 0.9