Re: PR 94454, ICE in template machinery

2020-04-14 Thread Richard Biener via Gcc-bugs
On Tue, Apr 14, 2020 at 7:36 PM Nathan Sidwell  wrote:
>
> Richard,
> I think 94454 is a P1.  We have an inconsistency between specialization
> hasher and equality.  arguments may compare equal but hash differently.

It doesn't seem to be a regression so bug priority has no meaning ;)
But yeah, it looks like quite a bad bug worth fixing.

Richard.

>
> nathan
> --
> Nathan Sidwell


Re: [Predicated Ins vs Branches] O3 and PGO result in 2x performance drop relative to O2

2023-07-31 Thread Richard Biener via Gcc-bugs
On Mon, Jul 31, 2023 at 2:57 PM Changbin Du via Gcc  wrote:
>
> Hello, folks.
> This is to discuss Gcc's heuristic strategy about Predicated Instructions and
> Branches. And probably something needs to be improved.
>
> [The story]
> Weeks ago, I built a huffman encoding program with O2, O3, and PGO 
> respectively.
> This program is nothing special, just a random code I found on the internet. 
> You
> can download it from http://cau.ac.kr/~bongbong/dsd08/huffman.c.
>
> Build it with O2/O3/PGO (My GCC is 13.1):
> $ gcc -O2 -march=native -g -o huffman huffman.c
> $ gcc -O3 -march=native -g -o huffman.O3 huffman.c
>
> $ gcc -O2 -march=native -g -fprofile-generate -o huffman.instrumented 
> huffman.c
> $ ./huffman.instrumented test.data
> $ gcc -O2 -march=native -g -fprofile-use=huffman.instrumented.gcda -o 
> huffman.pgo huffman.c
>
> Run them on my 12900H laptop:
> $ head -c 50M /dev/urandom > test.data
> $ perf stat  -r3 --table -- taskset -c 0 ./huffman test.data
> $ perf stat  -r3 --table -- taskset -c 0 ./huffman.O3 test.data
> $ perf stat  -r3 --table -- taskset -c 0 ./huffman.pgo test.data
>
> The result (p-core, no ht, no turbo, performance mode):
>
> O2  O3  PGO
> cycles  2,581,832,749   8,638,401,568   9,394,200,585
> (1.07s) (3.49s) (3.80s)
> instructions12,609,600,094  11,827,675,782  12,036,010,638
> branches2,303,416,221   2,671,184,833   2,723,414,574
> branch-misses   0.00%   7.94%   8.84%
> cache-misses3,012,613   3,055,722   3,076,316
> L1-icache-load-misses   11,416,391  12,112,703  11,896,077
> icache_tag.stalls   1,553,521   1,364,092   1,896,066
> itlb_misses.stlb_hit6,856   21,756  22,600
> itlb_misses.walk_completed  14,430  4,454   15,084
> baclears.any131,573 140,355 131,644
> int_misc.clear_resteer_cycles   2,545,915   586,578,125 679,021,993
> machine_clears.count22,235  39,671  37,307
> dsb2mite_switches.penalty_cycles 6,985,838  12,929,675  8,405,493
> frontend_retired.any_dsb_miss   28,785,677  28,161,724  28,093,319
> idq.dsb_cycles_any  1,986,038,896   5,683,820,258   5,971,969,906
> idq.dsb_uops11,149,445,952  26,438,051,062  28,622,657,650
> idq.mite_uops   207,881,687 216,734,007 212,003,064
>
>
> Above data shows:
>   o O3/PGO lead to *2.3x/2.6x* performance drop than O2 respectively.
>   o O3/PGO reduced instructions by 6.2% and 4.5%. I think this attributes to
> aggressive inline.
>   o O3/PGO introduced very bad branch prediction. I will explain it later.
>   o Code built with O3 has high iTLB miss but much lower sTLB miss. This is 
> beyond
> my expectation.
>   o O3/PGO introduced 78% and 68% more machine clears. This is interesting and
> I don't know why. (subcategory MC is not measured yet)
>   o O3 has much higher dsb2mite_switches.penalty_cycles than O2/PGO.
>   o The idq.mite_uops of O3/PGO increased 4%, while idq.dsb_uops increased 2x.
> DSB hit well. So frontend fetching and decoding is not a problem for 
> O3/PGO.
>   o Other events are mainly affected by bad branch misprediction.
>
> Additionally, here is the TMA level 2 analysis: The main changes in the 
> pipeline
> slots are of Bad Speculation and Frontend Bound categories. I doubt the 
> accuracy
> of tma_fetch_bandwidth according to above frontend_retired.any_dsb_miss and
> idq.mite_uops data.
>
> $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 ./huffman 
> test.data
> test.data.huf is 1.00% of test.data
>
>  Performance counter stats for 'taskset -c 0 ./huffman test.data':
>
>  %  tma_branch_mispredicts%  tma_core_bound %  tma_heavy_operations %  
> tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  
> tma_fetch_latency %  tma_machine_clears
>0.0  0.8 11.4  
>76.8  2.0 8.3  
> 0.80.0
>
>1.073381357 seconds time elapsed
>
>0.945233000 seconds user
>0.095719000 seconds sys
>
>
> $ perf stat --topdown --td-level=2 --cputype core -- taskset -c 0 
> ./huffman.O3 test.data
> test.data.huf is 1.00% of test.data
>
>  Performance counter stats for 'taskset -c 0 ./huffman.O3 test.data':
>
>  %  tma_branch_mispredicts%  tma_core_bound %  tma_heavy_operations %  
> tma_light_operations  %  tma_memory_bound %  tma_fetch_bandwidth %  
> tma_fetch_latency %  tma_machine_clears
>   38.2  6.6  3.5  
>21.7  0.920.9