[Bug target/119628] Need better mechanisms to manage register saves in callee for tail calls (inc. preserve_none for x86_64?)

kenjin4096 at gmail dot com via Gcc-bugs Tue, 15 Apr 2025 19:37:38 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119628


--- Comment #14 from Ken Jin <kenjin4096 at gmail dot com> ---
No speedup (within noise) with latest patch over previous patch. So Andrew
might be right there on the register shuffling.
However, note that pystones is just one benchmark in Python and not the
full benchmark suite we use (that takes very long to run), so I'm not sure if
the results are fully representative. Though I have to apologize for a previous
statement I made:

> I noticed it's still about 10% slower than clang-20 though.

This is only with LTO+PGO. With just LTO GCC 15 is faster than Clang 20!
Interestingly, PGO *slows down* GCC 15 with tail calls and preserve_none on
CPython. It might instead
be a PGO and musttail missed optimization happening on GCC 15.

===

Results:

GCC15 with tail calls + NO preserve_none + LTO:
This machine benchmarks at 974700 pystones/second

GCC15 with tail calls + latest preserve_none patch + LTO:
This machine benchmarks at 1.02208e+06 pystones/second

GCC15 with tail calls + latest preserve_none patch + LTO + PGO:
This machine benchmarks at 917500 pystones/second

Clang-20 with tail calls + preserve_none + ThinLTO*:
This machine benchmarks at 962309 pystones/second

Clang-20 with tail calls + preserve_none + ThinLTO + PGO:
This machine benchmarks at 1.09019e+06 pystones/second

For reference:

GCC15 with indirect goto + LTO:
This machine benchmarks at 900922 pystones/second

GCC15 with indirect goto + LTO + PGO:
This machine benchmarks at 1.11972e+06 pystones/second

* ThinLTO is default clang policy for CPython

I tried another toy interpreter. From here
https://github.com/brandtbucher/bf-dispatch-study/tree/main

Timings for GCC 15 WITHOUT preserve_none:
bf-tail:
3.22user 0.00system 0:03.22elapsed 99%CPU (0avgtext+0avgdata 1440maxresident)k
0inputs+0outputs (0major+119minor)pagefaults 0swaps

Timings for GCC 15 WITH preserve_none:
bf-tail:
3.39user 0.00system 0:03.39elapsed 99%CPU (0avgtext+0avgdata 1440maxresident)k
0inputs+0outputs (0major+121minor)pagefaults 0swaps


Finally, to make sure the results are quieter, I turned off Turbo Boost on my
system:

GCC15 with tail calls + NO preserve_none + LTO:
This machine benchmarks at 496950 pystones/second

GCC15 with tail calls + latest preserve_none patch + LTO:
This machine benchmarks at 508273 pystones/second

===

In short, great work H.J!
preserve_none for the complex interpreter (CPython)
is a win (2-3%), for the simpler toy interpreter without any calls
to external functions, it's a loss (5%). This makes sense to me.
Note that for CPython, results without PGO are generally not as accurate,
as we found results varies quite abit without it.

[Bug target/119628] Need better mechanisms to manage register saves in callee for tail calls (inc. preserve_none for x86_64?)

Reply via email to