https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119628
--- Comment #14 from Ken Jin <kenjin4096 at gmail dot com> --- No speedup (within noise) with latest patch over previous patch. So Andrew might be right there on the register shuffling. However, note that pystones is just one benchmark in Python and not the full benchmark suite we use (that takes very long to run), so I'm not sure if the results are fully representative. Though I have to apologize for a previous statement I made: > I noticed it's still about 10% slower than clang-20 though. This is only with LTO+PGO. With just LTO GCC 15 is faster than Clang 20! Interestingly, PGO *slows down* GCC 15 with tail calls and preserve_none on CPython. It might instead be a PGO and musttail missed optimization happening on GCC 15. === Results: GCC15 with tail calls + NO preserve_none + LTO: This machine benchmarks at 974700 pystones/second GCC15 with tail calls + latest preserve_none patch + LTO: This machine benchmarks at 1.02208e+06 pystones/second GCC15 with tail calls + latest preserve_none patch + LTO + PGO: This machine benchmarks at 917500 pystones/second Clang-20 with tail calls + preserve_none + ThinLTO*: This machine benchmarks at 962309 pystones/second Clang-20 with tail calls + preserve_none + ThinLTO + PGO: This machine benchmarks at 1.09019e+06 pystones/second For reference: GCC15 with indirect goto + LTO: This machine benchmarks at 900922 pystones/second GCC15 with indirect goto + LTO + PGO: This machine benchmarks at 1.11972e+06 pystones/second * ThinLTO is default clang policy for CPython I tried another toy interpreter. From here https://github.com/brandtbucher/bf-dispatch-study/tree/main Timings for GCC 15 WITHOUT preserve_none: bf-tail: 3.22user 0.00system 0:03.22elapsed 99%CPU (0avgtext+0avgdata 1440maxresident)k 0inputs+0outputs (0major+119minor)pagefaults 0swaps Timings for GCC 15 WITH preserve_none: bf-tail: 3.39user 0.00system 0:03.39elapsed 99%CPU (0avgtext+0avgdata 1440maxresident)k 0inputs+0outputs (0major+121minor)pagefaults 0swaps Finally, to make sure the results are quieter, I turned off Turbo Boost on my system: GCC15 with tail calls + NO preserve_none + LTO: This machine benchmarks at 496950 pystones/second GCC15 with tail calls + latest preserve_none patch + LTO: This machine benchmarks at 508273 pystones/second === In short, great work H.J! preserve_none for the complex interpreter (CPython) is a win (2-3%), for the simpler toy interpreter without any calls to external functions, it's a loss (5%). This makes sense to me. Note that for CPython, results without PGO are generally not as accurate, as we found results varies quite abit without it.