> On Wed, Feb 19, 2025 at 9:06 PM Jan Hubicka <hubi...@ucw.cz> wrote: > > > > Hi, > > this is a variant of a hook I benchmarked on cpu2016 with -Ofast -flto > > and -O2 -flto. For non -Os and no Windows ABI should be pratically the > > same as your variant that was simply returning mem_cost - 2. > > > I've tested O2/(Ofast march=native) with SPEC2017 on SPR, mostly > neutral (small improvement on povray).
So I got ryzen3 runs with -O2, -O3 and -fno-ipa-ra. Overall differences are quite small, but I think it is expected. Here is what I get with -O2: --------------- ------- --------- --------- ------- --------- --------- 500.perlbench_r 1 188 8.46 S 1 183 8.69 S 500.perlbench_r 1 187 8.52 * 1 182 8.75 * 500.perlbench_r 1 186 8.56 S 1 182 8.75 S 502.gcc_r 1 139 10.2 S 1 137 10.3 * 502.gcc_r 1 139 10.2 S 1 137 10.4 S 502.gcc_r 1 139 10.2 * 1 137 10.3 S 505.mcf_r 1 187 8.66 * 1 188 8.61 S 505.mcf_r 1 186 8.70 S 1 187 8.66 * 505.mcf_r 1 188 8.62 S 1 187 8.66 S 520.omnetpp_r 1 213 6.15 * 1 207 6.32 * 520.omnetpp_r 1 212 6.18 S 1 206 6.37 S 520.omnetpp_r 1 219 5.99 S 1 215 6.11 S 523.xalancbmk_r 1 -- CE 1 -- CE 525.x264_r 1 135 13.0 S 1 135 12.9 * 525.x264_r 1 135 13.0 * 1 135 12.9 S 525.x264_r 1 135 13.0 S 1 135 12.9 S 531.deepsjeng_r 1 167 6.86 * 1 167 6.85 S 531.deepsjeng_r 1 167 6.86 S 1 168 6.84 S 531.deepsjeng_r 1 167 6.86 S 1 167 6.85 * 541.leela_r 1 296 5.60 S 1 292 5.67 * 541.leela_r 1 293 5.65 S 1 293 5.65 S 541.leela_r 1 296 5.60 * 1 292 5.67 S 548.exchange2_r 1 208 12.6 S 1 208 12.6 S 548.exchange2_r 1 208 12.6 * 1 208 12.6 S 548.exchange2_r 1 208 12.6 S 1 208 12.6 * 557.xz_r 1 194 5.58 S 1 193 5.58 S 557.xz_r 1 192 5.62 S 1 193 5.60 S 557.xz_r 1 193 5.60 * 1 193 5.59 * ================================================================================= 500.perlbench_r 1 187 8.52 * 1 182 8.75 * 502.gcc_r 1 139 10.2 * 1 137 10.3 * 505.mcf_r 1 187 8.66 * 1 187 8.66 * 520.omnetpp_r 1 213 6.15 * 1 207 6.32 * 523.xalancbmk_r NR NR 525.x264_r 1 135 13.0 * 1 135 12.9 * 531.deepsjeng_r 1 167 6.86 * 1 167 6.85 * 541.leela_r 1 296 5.60 * 1 292 5.67 * 548.exchange2_r 1 208 12.6 * 1 208 12.6 * 557.xz_r 1 193 5.60 * 1 193 5.59 * Est. SPECrate2017_int_base 8.17 Est. SPECrate2017_int_peak 8.24 Perlbench seems to improve consistently without LTO (bot -O2, -O3 and -O2 -fno-ipa-ra and I think it may be just a luck with code layout gcc is quie concistent in all settings. Overall it seems consistent little win. For fp tests, I see only off-noise povray differences and only in -Ofast and -Ofast -flto. Comparing code sizes at -O2: 500.perlbench_r/run/run_base_refrate_regalloc-m64.0000/perlbench_r_base.regalloc-m64 1699987 1731648 101.86 502.gcc_r/run/run_base_refrate_regalloc-m64.0000/cpugcc_r_base.regalloc-m64 7072031 7226911 102.19 503.bwaves_r/run/run_base_refrate_regalloc-m64.0000/bwaves_r_base.regalloc-m64 41327 41327 100.00 505.mcf_r/run/run_base_refrate_regalloc-m64.0000/mcf_r_base.regalloc-m64 17023 17023 100.00 507.cactuBSSN_r/run/run_base_refrate_regalloc-m64.0000/cactusBSSN_r_base.regalloc-m64 3432326 3464950 100.95 508.namd_r/run/run_base_refrate_regalloc-m64.0000/namd_r_base.regalloc-m64 835954 835457 99.94 510.parest_r/run/run_base_refrate_regalloc-m64.0000/parest_r_base.regalloc-m64 7498066 7587378 101.19 511.povray_r/run/run_base_refrate_regalloc-m64.0000/imagevalidate_511_base.regalloc-m64 18206 18222 100.08 511.povray_r/run/run_base_refrate_regalloc-m64.0000/povray_r_base.regalloc-m64 754591 761695 100.94 519.lbm_r/run/run_base_refrate_regalloc-m64.0000/lbm_r_base.regalloc-m64 10900 10916 100.14 520.omnetpp_r/run/run_base_refrate_regalloc-m64.0000/omnetpp_r_base.regalloc-m64 1403348 1425556 101.58 521.wrf_r/run/run_base_refrate_regalloc-m64.0000/diffwrf_521_base.regalloc-m64 16388136 16394552 100.03 521.wrf_r/run/run_base_refrate_regalloc-m64.0000/wrf_r_base.regalloc-m64 22293527 22302167 100.03 525.x264_r/run/run_base_refrate_regalloc-m64.0000/imagevalidate_525_base.regalloc-m64 18206 18222 100.08 525.x264_r/run/run_base_refrate_regalloc-m64.0000/ldecod_r_base.regalloc-m64 398564 401667 100.77 525.x264_r/run/run_base_refrate_regalloc-m64.0000/x264_r_base.regalloc-m64 405515 407051 100.37 526.blender_r/run/run_base_refrate_regalloc-m64.0000/blender_r_base.regalloc-m64 7567792 7631536 100.84 526.blender_r/run/run_base_refrate_regalloc-m64.0000/imagevalidate_526_base.regalloc-m64 18206 18222 100.08 527.cam4_r/run/run_base_refrate_regalloc-m64.0000/cam4_r_base.regalloc-m64 5957695 5969535 100.19 527.cam4_r/run/run_base_refrate_regalloc-m64.0000/cam4_validate_527_base.regalloc-m64 606591 608767 100.35 531.deepsjeng_r/run/run_base_refrate_regalloc-m64.0000/deepsjeng_r_base.regalloc-m64 75304 76248 101.25 538.imagick_r/run/run_base_refrate_regalloc-m64.0000/imagevalidate_538_base.regalloc-m64 18206 18222 100.08 538.imagick_r/run/run_base_refrate_regalloc-m64.0000/imagick_r_base.regalloc-m64 1638858 1651628 100.77 541.leela_r/run/run_base_refrate_regalloc-m64.0000/leela_r_base.regalloc-m64 132636 133146 100.38 544.nab_r/run/run_base_refrate_regalloc-m64.0000/nab_r_base.regalloc-m64 150146 150513 100.24 548.exchange2_r/run/run_base_refrate_regalloc-m64.0000/exchange2_r_base.regalloc-m64 76709 76709 100.00 549.fotonik3d_r/run/run_base_refrate_regalloc-m64.0000/fotonik3d_r_base.regalloc-m64 464940 465260 100.06 554.roms_r/run/run_base_refrate_regalloc-m64.0000/roms_r_base.regalloc-m64 833926 834166 100.02 557.xz_r/run/run_base_refrate_regalloc-m64.0000/xz_r_base.regalloc-m64 130345 133253 102.23 The 2% code size increase for gcc as not very nice, but I think also expected, since we make compiler to use less push/pop instructions. There are 34091 push instructions with patch and 38939 without. With -fno-ipa-ra the story is similar: 500.perlbench_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/perlbench_r_base.regalloc-O2-noipara-m64 1701299 1733024 101.86 502.gcc_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/cpugcc_r_base.regalloc-O2-noipara-m64 7074527 7229855 102.19 503.bwaves_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/bwaves_r_base.regalloc-O2-noipara-m64 41327 41327 100.00 505.mcf_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/mcf_r_base.regalloc-O2-noipara-m64 17151 17151 100.00 507.cactuBSSN_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/cactusBSSN_r_base.regalloc-O2-noipara-m64 3432326 3464950 100.95 508.namd_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/namd_r_base.regalloc-O2-noipara-m64 835954 835457 99.94 510.parest_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/parest_r_base.regalloc-O2-noipara-m64 7504722 7594098 101.19 511.povray_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagevalidate_511_base.regalloc-O2-noipara-m64 18206 18222 100.08 511.povray_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/povray_r_base.regalloc-O2-noipara-m64 756639 763487 100.90 519.lbm_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/lbm_r_base.regalloc-O2-noipara-m64 10900 10916 100.14 520.omnetpp_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/omnetpp_r_base.regalloc-O2-noipara-m64 1403412 1425748 101.59 521.wrf_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/diffwrf_521_base.regalloc-O2-noipara-m64 16394344 16400504 100.03 521.wrf_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/wrf_r_base.regalloc-O2-noipara-m64 22300503 22308759 100.03 525.x264_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagevalidate_525_base.regalloc-O2-noipara-m64 18206 18222 100.08 525.x264_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/ldecod_r_base.regalloc-O2-noipara-m64 399204 402179 100.74 525.x264_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/x264_r_base.regalloc-O2-noipara-m64 406251 408299 100.50 526.blender_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/blender_r_base.regalloc-O2-noipara-m64 7583536 7648304 100.85 526.blender_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagevalidate_526_base.regalloc-O2-noipara-m64 18206 18222 100.08 527.cam4_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/cam4_r_base.regalloc-O2-noipara-m64 5962335 5974239 100.19 527.cam4_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/cam4_validate_527_base.regalloc-O2-noipara-m64 607327 609375 100.33 531.deepsjeng_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/deepsjeng_r_base.regalloc-O2-noipara-m64 75240 76248 101.33 538.imagick_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagevalidate_538_base.regalloc-O2-noipara-m64 18206 18222 100.08 538.imagick_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagick_r_base.regalloc-O2-noipara-m64 1641226 1654060 100.78 541.leela_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/leela_r_base.regalloc-O2-noipara-m64 132764 133274 100.38 544.nab_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/nab_r_base.regalloc-O2-noipara-m64 150498 150929 100.28 548.exchange2_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/exchange2_r_base.regalloc-O2-noipara-m64 76921 76921 100.00 549.fotonik3d_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/fotonik3d_r_base.regalloc-O2-noipara-m64 464940 465260 100.06 554.roms_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/roms_r_base.regalloc-O2-noipara-m64 833926 834166 100.02 557.xz_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/xz_r_base.regalloc-O2-noipara-m64 130697 133573 102.20 Overall I think new costing works reasonably well. Honza