https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119168
Bug ID: 119168 Summary: [15 Regression] 5% 477.dealII slowdown since r15-7605-gc5752c1f01316a Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: pheeck at gcc dot gnu.org CC: rsandifo at gcc dot gnu.org Blocks: 26163 Target Milestone: --- Host: x86_64-pc-linux-gnu Target: x86_64-pc-linux-gnu As seen here https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=291.140.0 there was a 5% exec time slowdown of the 477.dealII SPEC 2006 benchmark when run with -O2 -march=native on an AMD Zen2 machine. I bisected it to r15-7605-gc5752c1f01316a commit c5752c1f01316ac26ec9cf8d171d68aea420a158 Author: Richard Sandiford <richard.sandif...@arm.com> Date: Tue Feb 18 11:00:57 2025 +0000 late-combine: Tighten register class check [PR108840] gcc.target/aarch64/pr108840.c has failed since r15-268-g9dbff9c05520 (which means that I really ought to have looked at it earlier). The test wants us to fold an SImode AND into all shifts that use it. This is something that late-combine is supposed to do, but: (1) the pre-RA pass chickened out because of a register pressure check (2) the post-RA pass can't handle it, because the shift uses are in QImode and the sets are in SImode Both are things that would be good to fix. But (1) is particularly silly. The constraints on the AND have "rk" for the destination (so allowing the stack pointer) and "r" for the first source. Including the stack pointer made the destination seem more permissive than the source. The intention was instead to check whether there are any *allocatable* registers in the destination class that aren't present in the source. That's enough for all tests but the last one. The last one still fails because combine merges the final shift with the move into the hard return register, giving an arithmetic instruction with a hard register destination. Pre-RA late-combine currently punts on those, again due to register pressure concerns. That too is something I'd like to relax, but not for GCC 15. In the interim, the best thing seems to be to disable combine for the test. gcc/ PR rtl-optimization/108840 * late-combine.cc (late_combine::check_register_pressure): Take only allocatable registers into account when checking the permissiveness of register classes. gcc/testsuite/ PR rtl-optimization/108840 * gcc.target/aarch64/pr108840.c: Run at -O2 but disable combine. gcc/late-combine.cc | 10 ++++++++-- gcc/testsuite/gcc.target/aarch64/pr108840.c | 2 +- 2 files changed, 9 insertions(+), 3 deletions(-) Btw, this benchmark has sped up a bit recently. That was between r15-7772-gdfdbad87aeb2de r15-7895-gb191e8bdecf881 Maybe that was because of r15-7852-ge836d80374aa03a? Anyway, now the benchmark is almost at the same speed as with GCC 14. See comparison with GCC 14 here: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.8=1051.140.0&plot.9=291.140.0& Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 [Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)