[Bug tree-optimization/120752] New: 5% slowdown of 525.x264_r since r16-1346-gb0d50cbb42ab2c

pheeck at gcc dot gnu.org via Gcc-bugs Sat, 21 Jun 2025 15:36:52 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120752


            Bug ID: 120752
           Summary: 5% slowdown of 525.x264_r since
                    r16-1346-gb0d50cbb42ab2c
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pheeck at gcc dot gnu.org
                CC: hubicka at gcc dot gnu.org
            Blocks: 26163
  Target Milestone: ---
              Host: x86_64-pc-linux-gnu, aarch64-gnu-linux
            Target: x86_64-pc-linux-gnu, aarch64-gnu-linux

As seen here

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=477.377.0
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=958.377.0
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=286.377.0
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=589.377.0
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=792.377.0

there was a 5% exec time slowdown of the 525.x264_r SPEC 2017
benchmark when run with -Ofast -march=native -flto -fprofile-use. I've seen it
on x86_64 AMD, x86_64 Intel and Aarch64 machines.
I bisected it to r16-1346-gb0d50cbb42ab2c.

b0d50cbb42ab2ce5fab8a832cb82fc54b371c914 is the first bad commit
commit b0d50cbb42ab2ce5fab8a832cb82fc54b371c914
Author: Jan Hubicka <hubi...@ucw.cz>
Date:   Fri Jun 6 17:57:00 2025 +0200

    Fix profile updating in ipa-cp

    Bootstrapping with autoprofiledbootstrap, LTO and checking enables ICEs in
WPA
    because we end up mixing local and IPA count in
    ipa-cp.cc:update_specialized_profile.  This is because of missing call to
    profile_count::adjust_for_ipa_scaling.  While looking into that I however
    noticed that the function forgets to update indirect call edges. This made
me
    to commonize same logic which currently exists in clone_inlined_nodes,
    update_specialized_profile, update_profiling_info and
    update_counts_for_self_gen_clones.

    While testing it I noticed that we also ICE when linking with
    -fdump-ipa-all-details-blocks since IPA and local counts are temporarily
mixed
    during IPA transformation stage, so I also added check to
profile_count::dump
    to not crash and added verifier to gimple_verify_flow_info.

    Other problem I also noticed is that while profile updates done by inliner
(via
    cgraph_node::clone) are correctly using global0 profiles instead of erasing
    profile completely when IPA counts drops to 0, the scaling in ipa-cp is not
    doing that, so we lose info and possibly some code quality.  I will fix
that
    incrementally. Similarly ipa-split, when offlining region with 0 entry
count
    may re-do frequency propagation to get something useful.


This is not a regression against GCC 15. See the comparison
here:

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=1016.377.0&plot.1=1179.377.0&plot.2=958.377.0&;

Btw, if you look at the graphs, at some point we got better execution times
than GCC 15 and this slowdown is just a rebound (sometimes not even entirely).


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

[Bug tree-optimization/120752] New: 5% slowdown of 525.x264_r since r16-1346-gb0d50cbb42ab2c

Reply via email to