[Bug middle-end/120614] 525.x264_r is ~30% slower with AutoFDO

hubicka at ucw dot cz via Gcc-bugs Mon, 14 Jul 2025 08:44:36 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120614


--- Comment #18 from Jan Hubicka <hubicka at ucw dot cz> ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120614
> 
> --- Comment #16 from kugan at gcc dot gnu.org ---
> I ran spec2017 again with recent gcc and SPE based autofdo (with local patches
> to enable SPE based profiling support for autofdo tools). I am seeing 
> following
> compared PGO:
> 
> 621.wrf_s -23%
> 549.fotonik3d_r -21%
> 525.x264_r -17%
> 644.nab_s -14%
> 603.bwaves_s -13%
> 625.x264_s -12%
> 623.xalancbmk_s -12%
> 600.perlbench_s -11%
> 500.perlbench_r -10%

LNT tester reports the following regressions:
SPEC/SPEC2017/FP/521.wrf_r      110.97%
SPEC/SPEC2017/FP/538.imagick_r  67.70%
SPEC/SPEC2017/FP/554.roms_r     15.77%
SPEC/SPEC2017/FP/503.bwaves_r   12.67%
SPEC/SPEC2017/INT/523.xalancbmk_r       11.29%
SPEC/SPEC2017/INT/548.exchange2_r       10.72%
SPEC/SPEC2017/FP/508.namd_r     8.78%
SPEC/SPEC2017/INT/531.deepsjeng_r       7.26%
SPEC/SPEC2017/INT/541.leela_r   6.54%
SPEC/SPEC2017/FP/519.lbm_r      5.72%
SPEC/SPEC2017/FP/549.fotonik3d_r        3.37%
SPEC/SPEC2017/INT/525.x264_r    3.09%
SPEC/SPEC2017/FP/510.parest_r   2.97%
SPEC/SPEC2017/FP/527.cam4_r     2.23%
SPEC/SPEC2017/INT/505.mcf_r     2.22%

In our setup wrf training is broken and does nothing without failing the
verification (which is odd). As a result profile is almost empty and
everything is optimized for size. 

I wonder if that is true for you as well?  You can do gcov_dump and see
if you get any reasobly large numbers.  I am quite puzzled by this issue
but did not have time to debug it yet.

imagemagick has broken train dataset in spec (it does not train the hot
loop which disables vectorization). I hacked runspec so I can use
-train_with=refrun and them imagemagick actually runs faster with
autofdo than without. So i think it is non-issue.
(with this hack autofdo now seems overall win for SPECfp modulo wrf)

Roms regresses due to disabled vectorization due to header BB having
very low count.  This is caused by vectorizer not doing very good job on
updating debu gstatement but also triggers problem in create_gcov not
consumming dwarf correctly.

LLVM has 3-layer discriminators that can record multiplicity, so
vectorization can keep iterations count accurate. I think it is useful
feature we should implement as well.
https://lists.llvm.org/pipermail/llvm-dev/2020-November/146694.html

fotonik was quite random for us, so we hacked the config file to train
every binary 8 times which reduced noise in profile obtained.
Here is history of runs
https://lnt.opensuse.org/db_default/v4/SPEC/graph?highlight_run=69309&plot.0=1370.527.0&plot.1=1288.527.0
and one can see that the randomness went away.

Some of regressions goes away for me with -fprofile-partial-training since
IPA code is still havind some issues with AFDO 0 not actually being 0.
In particular if funcction has only 0 samples in it, it will get 0 AFDO
profile with local profile preserved. If later function with non-zero
AFDO profile is inlined to it, it wil get 0 AFDO profile and end up
optimized for size.

I did not look into other regressions yet. I think it would be
interesting to unerstand leela, deepsjent and xalancbnk since they are
quite C++ heavy.

povray, omnetpp, perlbench, gcc sees out of noise improvmeents in our
setup. It would be interesting why perlbench regresses for you
https://lnt.opensuse.org/db_default/v4/SPEC/graph?highlight_run=69309&plot.0=1370.327.0&plot.1=1288.327.0

https://lnt.opensuse.org/db_default/v4/SPEC/69309?compare_to=69261

[Bug middle-end/120614] 525.x264_r is ~30% slower with AutoFDO

Reply via email to