> On Wed, Feb 19, 2025 at 9:06 PM Jan Hubicka <[email protected]> wrote:
> >
> > Hi,
> > this is a variant of a hook I benchmarked on cpu2016 with -Ofast -flto
> > and -O2 -flto. For non -Os and no Windows ABI should be pratically the
> > same as your variant that was simply returning mem_cost - 2.
> >
> I've tested O2/(Ofast march=native) with SPEC2017 on SPR, mostly
> neutral (small improvement on povray).
So I got ryzen3 runs with -O2, -O3 and -fno-ipa-ra.
Overall differences are quite small, but I think it is expected. Here is
what I get with -O2:
--------------- ------- --------- --------- ------- --------- ---------
500.perlbench_r 1 188 8.46 S 1 183 8.69
S
500.perlbench_r 1 187 8.52 * 1 182 8.75
*
500.perlbench_r 1 186 8.56 S 1 182 8.75
S
502.gcc_r 1 139 10.2 S 1 137 10.3
*
502.gcc_r 1 139 10.2 S 1 137 10.4
S
502.gcc_r 1 139 10.2 * 1 137 10.3
S
505.mcf_r 1 187 8.66 * 1 188 8.61
S
505.mcf_r 1 186 8.70 S 1 187 8.66
*
505.mcf_r 1 188 8.62 S 1 187 8.66
S
520.omnetpp_r 1 213 6.15 * 1 207 6.32
*
520.omnetpp_r 1 212 6.18 S 1 206 6.37
S
520.omnetpp_r 1 219 5.99 S 1 215 6.11
S
523.xalancbmk_r 1 -- CE 1 --
CE
525.x264_r 1 135 13.0 S 1 135 12.9
*
525.x264_r 1 135 13.0 * 1 135 12.9
S
525.x264_r 1 135 13.0 S 1 135 12.9
S
531.deepsjeng_r 1 167 6.86 * 1 167 6.85
S
531.deepsjeng_r 1 167 6.86 S 1 168 6.84
S
531.deepsjeng_r 1 167 6.86 S 1 167 6.85
*
541.leela_r 1 296 5.60 S 1 292 5.67
*
541.leela_r 1 293 5.65 S 1 293 5.65
S
541.leela_r 1 296 5.60 * 1 292 5.67
S
548.exchange2_r 1 208 12.6 S 1 208 12.6
S
548.exchange2_r 1 208 12.6 * 1 208 12.6
S
548.exchange2_r 1 208 12.6 S 1 208 12.6
*
557.xz_r 1 194 5.58 S 1 193 5.58
S
557.xz_r 1 192 5.62 S 1 193 5.60
S
557.xz_r 1 193 5.60 * 1 193 5.59
*
=================================================================================
500.perlbench_r 1 187 8.52 * 1 182 8.75
*
502.gcc_r 1 139 10.2 * 1 137 10.3
*
505.mcf_r 1 187 8.66 * 1 187 8.66
*
520.omnetpp_r 1 213 6.15 * 1 207 6.32
*
523.xalancbmk_r NR
NR
525.x264_r 1 135 13.0 * 1 135 12.9
*
531.deepsjeng_r 1 167 6.86 * 1 167 6.85
*
541.leela_r 1 296 5.60 * 1 292 5.67
*
548.exchange2_r 1 208 12.6 * 1 208 12.6
*
557.xz_r 1 193 5.60 * 1 193 5.59
*
Est. SPECrate2017_int_base 8.17
Est. SPECrate2017_int_peak 8.24
Perlbench seems to improve consistently without LTO (bot -O2, -O3 and
-O2 -fno-ipa-ra and I think it may be just a luck with code layout
gcc is quie concistent in all settings. Overall it seems consistent
little win. For fp tests, I see only off-noise povray differences and only in
-Ofast and -Ofast -flto.
Comparing code sizes at -O2:
500.perlbench_r/run/run_base_refrate_regalloc-m64.0000/perlbench_r_base.regalloc-m64
1699987 1731648 101.86
502.gcc_r/run/run_base_refrate_regalloc-m64.0000/cpugcc_r_base.regalloc-m64
7072031 7226911 102.19
503.bwaves_r/run/run_base_refrate_regalloc-m64.0000/bwaves_r_base.regalloc-m64
41327 41327 100.00
505.mcf_r/run/run_base_refrate_regalloc-m64.0000/mcf_r_base.regalloc-m64
17023 17023 100.00
507.cactuBSSN_r/run/run_base_refrate_regalloc-m64.0000/cactusBSSN_r_base.regalloc-m64
3432326 3464950 100.95
508.namd_r/run/run_base_refrate_regalloc-m64.0000/namd_r_base.regalloc-m64
835954 835457 99.94
510.parest_r/run/run_base_refrate_regalloc-m64.0000/parest_r_base.regalloc-m64
7498066 7587378 101.19
511.povray_r/run/run_base_refrate_regalloc-m64.0000/imagevalidate_511_base.regalloc-m64
18206 18222 100.08
511.povray_r/run/run_base_refrate_regalloc-m64.0000/povray_r_base.regalloc-m64
754591 761695 100.94
519.lbm_r/run/run_base_refrate_regalloc-m64.0000/lbm_r_base.regalloc-m64
10900 10916 100.14
520.omnetpp_r/run/run_base_refrate_regalloc-m64.0000/omnetpp_r_base.regalloc-m64
1403348 1425556 101.58
521.wrf_r/run/run_base_refrate_regalloc-m64.0000/diffwrf_521_base.regalloc-m64
16388136 16394552 100.03
521.wrf_r/run/run_base_refrate_regalloc-m64.0000/wrf_r_base.regalloc-m64
22293527 22302167 100.03
525.x264_r/run/run_base_refrate_regalloc-m64.0000/imagevalidate_525_base.regalloc-m64
18206 18222 100.08
525.x264_r/run/run_base_refrate_regalloc-m64.0000/ldecod_r_base.regalloc-m64
398564 401667 100.77
525.x264_r/run/run_base_refrate_regalloc-m64.0000/x264_r_base.regalloc-m64
405515 407051 100.37
526.blender_r/run/run_base_refrate_regalloc-m64.0000/blender_r_base.regalloc-m64
7567792 7631536 100.84
526.blender_r/run/run_base_refrate_regalloc-m64.0000/imagevalidate_526_base.regalloc-m64
18206 18222 100.08
527.cam4_r/run/run_base_refrate_regalloc-m64.0000/cam4_r_base.regalloc-m64
5957695 5969535 100.19
527.cam4_r/run/run_base_refrate_regalloc-m64.0000/cam4_validate_527_base.regalloc-m64
606591 608767 100.35
531.deepsjeng_r/run/run_base_refrate_regalloc-m64.0000/deepsjeng_r_base.regalloc-m64
75304 76248 101.25
538.imagick_r/run/run_base_refrate_regalloc-m64.0000/imagevalidate_538_base.regalloc-m64
18206 18222 100.08
538.imagick_r/run/run_base_refrate_regalloc-m64.0000/imagick_r_base.regalloc-m64
1638858 1651628 100.77
541.leela_r/run/run_base_refrate_regalloc-m64.0000/leela_r_base.regalloc-m64
132636 133146 100.38
544.nab_r/run/run_base_refrate_regalloc-m64.0000/nab_r_base.regalloc-m64
150146 150513 100.24
548.exchange2_r/run/run_base_refrate_regalloc-m64.0000/exchange2_r_base.regalloc-m64
76709 76709 100.00
549.fotonik3d_r/run/run_base_refrate_regalloc-m64.0000/fotonik3d_r_base.regalloc-m64
464940 465260 100.06
554.roms_r/run/run_base_refrate_regalloc-m64.0000/roms_r_base.regalloc-m64
833926 834166 100.02
557.xz_r/run/run_base_refrate_regalloc-m64.0000/xz_r_base.regalloc-m64
130345 133253 102.23
The 2% code size increase for gcc as not very nice, but I think also
expected, since we make compiler to use less push/pop instructions.
There are 34091 push instructions with patch and 38939 without.
With -fno-ipa-ra the story is similar:
500.perlbench_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/perlbench_r_base.regalloc-O2-noipara-m64
1701299 1733024 101.86
502.gcc_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/cpugcc_r_base.regalloc-O2-noipara-m64
7074527 7229855 102.19
503.bwaves_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/bwaves_r_base.regalloc-O2-noipara-m64
41327 41327 100.00
505.mcf_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/mcf_r_base.regalloc-O2-noipara-m64
17151 17151 100.00
507.cactuBSSN_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/cactusBSSN_r_base.regalloc-O2-noipara-m64
3432326 3464950 100.95
508.namd_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/namd_r_base.regalloc-O2-noipara-m64
835954 835457 99.94
510.parest_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/parest_r_base.regalloc-O2-noipara-m64
7504722 7594098 101.19
511.povray_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagevalidate_511_base.regalloc-O2-noipara-m64
18206 18222 100.08
511.povray_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/povray_r_base.regalloc-O2-noipara-m64
756639 763487 100.90
519.lbm_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/lbm_r_base.regalloc-O2-noipara-m64
10900 10916 100.14
520.omnetpp_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/omnetpp_r_base.regalloc-O2-noipara-m64
1403412 1425748 101.59
521.wrf_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/diffwrf_521_base.regalloc-O2-noipara-m64
16394344 16400504 100.03
521.wrf_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/wrf_r_base.regalloc-O2-noipara-m64
22300503 22308759 100.03
525.x264_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagevalidate_525_base.regalloc-O2-noipara-m64
18206 18222 100.08
525.x264_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/ldecod_r_base.regalloc-O2-noipara-m64
399204 402179 100.74
525.x264_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/x264_r_base.regalloc-O2-noipara-m64
406251 408299 100.50
526.blender_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/blender_r_base.regalloc-O2-noipara-m64
7583536 7648304 100.85
526.blender_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagevalidate_526_base.regalloc-O2-noipara-m64
18206 18222 100.08
527.cam4_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/cam4_r_base.regalloc-O2-noipara-m64
5962335 5974239 100.19
527.cam4_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/cam4_validate_527_base.regalloc-O2-noipara-m64
607327 609375 100.33
531.deepsjeng_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/deepsjeng_r_base.regalloc-O2-noipara-m64
75240 76248 101.33
538.imagick_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagevalidate_538_base.regalloc-O2-noipara-m64
18206 18222 100.08
538.imagick_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagick_r_base.regalloc-O2-noipara-m64
1641226 1654060 100.78
541.leela_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/leela_r_base.regalloc-O2-noipara-m64
132764 133274 100.38
544.nab_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/nab_r_base.regalloc-O2-noipara-m64
150498 150929 100.28
548.exchange2_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/exchange2_r_base.regalloc-O2-noipara-m64
76921 76921 100.00
549.fotonik3d_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/fotonik3d_r_base.regalloc-O2-noipara-m64
464940 465260 100.06
554.roms_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/roms_r_base.regalloc-O2-noipara-m64
833926 834166 100.02
557.xz_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/xz_r_base.regalloc-O2-noipara-m64
130697 133573 102.20
Overall I think new costing works reasonably well.
Honza