[Bug tree-optimization/123126] [16 Regression] 6-10% slowdown of xalancbmk_r on Zen{4,5} since r16-6104-gb5c64db0a49d46 with PGO

rguenth at gcc dot gnu.org via Gcc-bugs Tue, 10 Feb 2026 04:14:27 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123126


--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
On Zen4 I can still reproduce a 8% slowdown compared to GCC 15.2 with -Ofast
-march=znver4 -flto with PGO.

    18.71%         88671  cpuxalan_r_peak  cpuxalan_r_peak.gcc7-m64  [.]
xalanc_1_10::XStringCached::~XStringCached()

vs.

     5.75%         25439  cpuxalan_r_peak  cpuxalan_r_peak.gcc7-m64  [.]
xalanc_1_10::XStringCached::~XStringCached()

but

     6.14%         27091  cpuxalan_r_peak  cpuxalan_r_peak.gcc7-m64  [.]
xalanc_1_10::ReusableArenaAllocator<xalanc_1_10::XalanDOMString>::destroyObject(xalanc_1_10::XalanDOMString*)
[clone .isra.0]

is now inlined.  So it's definitely difficult to analyze.  The stupid thing
with perf is that it doesn't seem to understand LTO debug info.

What's showing is apparent branch prediction issues, possibly due to high
branch density.

    339 │       mov          %rcx,%rbx                                         
                                                                       ▒
     31 │       cmp          0x8(%rax),%rdx                                    
                                                                       ▒
  12492 │       je           230                                               
                                                                       ▒
        │       lea          0x10(%rax),%rbx                                   
                                                                       ▒
     56 │       cmp          (%rbx),%rdx                                       
                                                                       ◆
  13683 │       je           230                                               
                                                                       ▒
      1 │       lea          0x10(%rcx),%rbx                                   
                                                                       ▒
    112 │       cmp          0x10(%rcx),%rdx                                   
                                                                       ▒
  12606 │       je           230                                               
                                                                       ▒
        │       lea          0x18(%rcx),%rax                                   
                                                                       ▒
    160 │360:   mov          %rax,%rbx                                         
                                                                       ▒
   2334 │       cmp          %rax,%rsi                                         
                                                                       ▒
        │       je           2b7                                               
                                                                       ▒
        │       cmp          (%rax),%rdx                                       
                                                                       ▒
  15186 │       jne          330


Btw, there is no difference in runtime when I add -fno-tree-vectorize to
the set of optimization options.  The exact same slowdown reproduces.
So it's also not peeling twice in CH, given we run CH again
before vectorization (but only if flag_tree_loop_vectorize).

[Bug tree-optimization/123126] [16 Regression] 6-10% slowdown of xalancbmk_r on Zen{4,5} since r16-6104-gb5c64db0a49d46 with PGO

Reply via email to