https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123126

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
On Zen4 I can still reproduce a 8% slowdown compared to GCC 15.2 with -Ofast
-march=znver4 -flto with PGO.

    18.71%         88671  cpuxalan_r_peak  cpuxalan_r_peak.gcc7-m64  [.]
xalanc_1_10::XStringCached::~XStringCached()

vs.

     5.75%         25439  cpuxalan_r_peak  cpuxalan_r_peak.gcc7-m64  [.]
xalanc_1_10::XStringCached::~XStringCached()

but

     6.14%         27091  cpuxalan_r_peak  cpuxalan_r_peak.gcc7-m64  [.]
xalanc_1_10::ReusableArenaAllocator<xalanc_1_10::XalanDOMString>::destroyObject(xalanc_1_10::XalanDOMString*)
[clone .isra.0]

is now inlined.  So it's definitely difficult to analyze.  The stupid thing
with perf is that it doesn't seem to understand LTO debug info.

What's showing is apparent branch prediction issues, possibly due to high
branch density.

    339 │       mov          %rcx,%rbx                                         
                                                                       ▒
     31 │       cmp          0x8(%rax),%rdx                                    
                                                                       ▒
  12492 │       je           230                                               
                                                                       ▒
        │       lea          0x10(%rax),%rbx                                   
                                                                       ▒
     56 │       cmp          (%rbx),%rdx                                       
                                                                       ◆
  13683 │       je           230                                               
                                                                       ▒
      1 │       lea          0x10(%rcx),%rbx                                   
                                                                       ▒
    112 │       cmp          0x10(%rcx),%rdx                                   
                                                                       ▒
  12606 │       je           230                                               
                                                                       ▒
        │       lea          0x18(%rcx),%rax                                   
                                                                       ▒
    160 │360:   mov          %rax,%rbx                                         
                                                                       ▒
   2334 │       cmp          %rax,%rsi                                         
                                                                       ▒
        │       je           2b7                                               
                                                                       ▒
        │       cmp          (%rax),%rdx                                       
                                                                       ▒
  15186 │       jne          330


Btw, there is no difference in runtime when I add -fno-tree-vectorize to
the set of optimization options.  The exact same slowdown reproduces.
So it's also not peeling twice in CH, given we run CH again
before vectorization (but only if flag_tree_loop_vectorize).

Reply via email to