https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123126
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
On Zen4 I can still reproduce a 8% slowdown compared to GCC 15.2 with -Ofast
-march=znver4 -flto with PGO.
18.71% 88671 cpuxalan_r_peak cpuxalan_r_peak.gcc7-m64 [.]
xalanc_1_10::XStringCached::~XStringCached()
vs.
5.75% 25439 cpuxalan_r_peak cpuxalan_r_peak.gcc7-m64 [.]
xalanc_1_10::XStringCached::~XStringCached()
but
6.14% 27091 cpuxalan_r_peak cpuxalan_r_peak.gcc7-m64 [.]
xalanc_1_10::ReusableArenaAllocator<xalanc_1_10::XalanDOMString>::destroyObject(xalanc_1_10::XalanDOMString*)
[clone .isra.0]
is now inlined. So it's definitely difficult to analyze. The stupid thing
with perf is that it doesn't seem to understand LTO debug info.
What's showing is apparent branch prediction issues, possibly due to high
branch density.
339 │ mov %rcx,%rbx
▒
31 │ cmp 0x8(%rax),%rdx
▒
12492 │ je 230
▒
│ lea 0x10(%rax),%rbx
▒
56 │ cmp (%rbx),%rdx
◆
13683 │ je 230
▒
1 │ lea 0x10(%rcx),%rbx
▒
112 │ cmp 0x10(%rcx),%rdx
▒
12606 │ je 230
▒
│ lea 0x18(%rcx),%rax
▒
160 │360: mov %rax,%rbx
▒
2334 │ cmp %rax,%rsi
▒
│ je 2b7
▒
│ cmp (%rax),%rdx
▒
15186 │ jne 330
Btw, there is no difference in runtime when I add -fno-tree-vectorize to
the set of optimization options. The exact same slowdown reproduces.
So it's also not peeling twice in CH, given we run CH again
before vectorization (but only if flag_tree_loop_vectorize).
