Jan Hubicka <[email protected]> writes:
>> On Wed, Feb 19, 2025 at 9:06 PM Jan Hubicka <[email protected]> wrote:
>> >
>> > Hi,
>> > this is a variant of a hook I benchmarked on cpu2016 with -Ofast -flto
>> > and -O2 -flto. For non -Os and no Windows ABI should be pratically the
>> > same as your variant that was simply returning mem_cost - 2.
>> >
>> I've tested O2/(Ofast march=native) with SPEC2017 on SPR, mostly
>> neutral (small improvement on povray).
>
> So I got ryzen3 runs with -O2, -O3 and -fno-ipa-ra.
>
> Overall differences are quite small, but I think it is expected. Here is
> what I get with -O2:
> --------------- ------- --------- --------- ------- --------- ---------
> 500.perlbench_r 1 188 8.46 S 1 183
> 8.69 S
> 500.perlbench_r 1 187 8.52 * 1 182
> 8.75 *
> 500.perlbench_r 1 186 8.56 S 1 182
> 8.75 S
> 502.gcc_r 1 139 10.2 S 1 137 10.3
> *
> 502.gcc_r 1 139 10.2 S 1 137 10.4
> S
> 502.gcc_r 1 139 10.2 * 1 137 10.3
> S
> 505.mcf_r 1 187 8.66 * 1 188
> 8.61 S
> 505.mcf_r 1 186 8.70 S 1 187
> 8.66 *
> 505.mcf_r 1 188 8.62 S 1 187
> 8.66 S
> 520.omnetpp_r 1 213 6.15 * 1 207
> 6.32 *
> 520.omnetpp_r 1 212 6.18 S 1 206
> 6.37 S
> 520.omnetpp_r 1 219 5.99 S 1 215
> 6.11 S
> 523.xalancbmk_r 1 -- CE 1 --
> CE
> 525.x264_r 1 135 13.0 S 1 135 12.9
> *
> 525.x264_r 1 135 13.0 * 1 135 12.9
> S
> 525.x264_r 1 135 13.0 S 1 135 12.9
> S
> 531.deepsjeng_r 1 167 6.86 * 1 167
> 6.85 S
> 531.deepsjeng_r 1 167 6.86 S 1 168
> 6.84 S
> 531.deepsjeng_r 1 167 6.86 S 1 167
> 6.85 *
> 541.leela_r 1 296 5.60 S 1 292
> 5.67 *
> 541.leela_r 1 293 5.65 S 1 293
> 5.65 S
> 541.leela_r 1 296 5.60 * 1 292
> 5.67 S
> 548.exchange2_r 1 208 12.6 S 1 208 12.6
> S
> 548.exchange2_r 1 208 12.6 * 1 208 12.6
> S
> 548.exchange2_r 1 208 12.6 S 1 208 12.6
> *
> 557.xz_r 1 194 5.58 S 1 193
> 5.58 S
> 557.xz_r 1 192 5.62 S 1 193
> 5.60 S
> 557.xz_r 1 193 5.60 * 1 193
> 5.59 *
> =================================================================================
> 500.perlbench_r 1 187 8.52 * 1 182
> 8.75 *
> 502.gcc_r 1 139 10.2 * 1 137 10.3
> *
> 505.mcf_r 1 187 8.66 * 1 187
> 8.66 *
> 520.omnetpp_r 1 213 6.15 * 1 207
> 6.32 *
> 523.xalancbmk_r NR
> NR
> 525.x264_r 1 135 13.0 * 1 135 12.9
> *
> 531.deepsjeng_r 1 167 6.86 * 1 167
> 6.85 *
> 541.leela_r 1 296 5.60 * 1 292
> 5.67 *
> 548.exchange2_r 1 208 12.6 * 1 208 12.6
> *
> 557.xz_r 1 193 5.60 * 1 193
> 5.59 *
> Est. SPECrate2017_int_base 8.17
> Est. SPECrate2017_int_peak 8.24
>
> Perlbench seems to improve consistently without LTO (bot -O2, -O3 and
> -O2 -fno-ipa-ra and I think it may be just a luck with code layout
> gcc is quie concistent in all settings. Overall it seems consistent
> little win. For fp tests, I see only off-noise povray differences and only in
> -Ofast and -Ofast -flto.
>
> Comparing code sizes at -O2:
>
> 500.perlbench_r/run/run_base_refrate_regalloc-m64.0000/perlbench_r_base.regalloc-m64
> 1699987 1731648 101.86
> 502.gcc_r/run/run_base_refrate_regalloc-m64.0000/cpugcc_r_base.regalloc-m64
> 7072031 7226911 102.19
> 503.bwaves_r/run/run_base_refrate_regalloc-m64.0000/bwaves_r_base.regalloc-m64
> 41327 41327 100.00
> 505.mcf_r/run/run_base_refrate_regalloc-m64.0000/mcf_r_base.regalloc-m64
> 17023 17023 100.00
> 507.cactuBSSN_r/run/run_base_refrate_regalloc-m64.0000/cactusBSSN_r_base.regalloc-m64
> 3432326 3464950 100.95
> 508.namd_r/run/run_base_refrate_regalloc-m64.0000/namd_r_base.regalloc-m64
> 835954 835457 99.94
> 510.parest_r/run/run_base_refrate_regalloc-m64.0000/parest_r_base.regalloc-m64
> 7498066 7587378 101.19
> 511.povray_r/run/run_base_refrate_regalloc-m64.0000/imagevalidate_511_base.regalloc-m64
> 18206 18222 100.08
> 511.povray_r/run/run_base_refrate_regalloc-m64.0000/povray_r_base.regalloc-m64
> 754591 761695 100.94
> 519.lbm_r/run/run_base_refrate_regalloc-m64.0000/lbm_r_base.regalloc-m64
> 10900 10916 100.14
> 520.omnetpp_r/run/run_base_refrate_regalloc-m64.0000/omnetpp_r_base.regalloc-m64
> 1403348 1425556 101.58
> 521.wrf_r/run/run_base_refrate_regalloc-m64.0000/diffwrf_521_base.regalloc-m64
> 16388136 16394552 100.03
> 521.wrf_r/run/run_base_refrate_regalloc-m64.0000/wrf_r_base.regalloc-m64
> 22293527 22302167 100.03
> 525.x264_r/run/run_base_refrate_regalloc-m64.0000/imagevalidate_525_base.regalloc-m64
> 18206 18222 100.08
> 525.x264_r/run/run_base_refrate_regalloc-m64.0000/ldecod_r_base.regalloc-m64
> 398564 401667 100.77
> 525.x264_r/run/run_base_refrate_regalloc-m64.0000/x264_r_base.regalloc-m64
> 405515 407051 100.37
> 526.blender_r/run/run_base_refrate_regalloc-m64.0000/blender_r_base.regalloc-m64
> 7567792 7631536 100.84
> 526.blender_r/run/run_base_refrate_regalloc-m64.0000/imagevalidate_526_base.regalloc-m64
> 18206 18222 100.08
> 527.cam4_r/run/run_base_refrate_regalloc-m64.0000/cam4_r_base.regalloc-m64
> 5957695 5969535 100.19
> 527.cam4_r/run/run_base_refrate_regalloc-m64.0000/cam4_validate_527_base.regalloc-m64
> 606591 608767 100.35
> 531.deepsjeng_r/run/run_base_refrate_regalloc-m64.0000/deepsjeng_r_base.regalloc-m64
> 75304 76248 101.25
> 538.imagick_r/run/run_base_refrate_regalloc-m64.0000/imagevalidate_538_base.regalloc-m64
> 18206 18222 100.08
> 538.imagick_r/run/run_base_refrate_regalloc-m64.0000/imagick_r_base.regalloc-m64
> 1638858 1651628 100.77
> 541.leela_r/run/run_base_refrate_regalloc-m64.0000/leela_r_base.regalloc-m64
> 132636 133146 100.38
> 544.nab_r/run/run_base_refrate_regalloc-m64.0000/nab_r_base.regalloc-m64
> 150146 150513 100.24
> 548.exchange2_r/run/run_base_refrate_regalloc-m64.0000/exchange2_r_base.regalloc-m64
> 76709 76709 100.00
> 549.fotonik3d_r/run/run_base_refrate_regalloc-m64.0000/fotonik3d_r_base.regalloc-m64
> 464940 465260 100.06
> 554.roms_r/run/run_base_refrate_regalloc-m64.0000/roms_r_base.regalloc-m64
> 833926 834166 100.02
> 557.xz_r/run/run_base_refrate_regalloc-m64.0000/xz_r_base.regalloc-m64
> 130345 133253 102.23
>
> The 2% code size increase for gcc as not very nice, but I think also
> expected, since we make compiler to use less push/pop instructions.
> There are 34091 push instructions with patch and 38939 without.
>
> With -fno-ipa-ra the story is similar:
>
> 500.perlbench_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/perlbench_r_base.regalloc-O2-noipara-m64
> 1701299 1733024 101.86
> 502.gcc_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/cpugcc_r_base.regalloc-O2-noipara-m64
> 7074527 7229855 102.19
> 503.bwaves_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/bwaves_r_base.regalloc-O2-noipara-m64
> 41327 41327 100.00
> 505.mcf_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/mcf_r_base.regalloc-O2-noipara-m64
> 17151 17151 100.00
> 507.cactuBSSN_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/cactusBSSN_r_base.regalloc-O2-noipara-m64
> 3432326 3464950 100.95
> 508.namd_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/namd_r_base.regalloc-O2-noipara-m64
> 835954 835457 99.94
> 510.parest_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/parest_r_base.regalloc-O2-noipara-m64
> 7504722 7594098 101.19
> 511.povray_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagevalidate_511_base.regalloc-O2-noipara-m64
> 18206 18222 100.08
> 511.povray_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/povray_r_base.regalloc-O2-noipara-m64
> 756639 763487 100.90
> 519.lbm_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/lbm_r_base.regalloc-O2-noipara-m64
> 10900 10916 100.14
> 520.omnetpp_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/omnetpp_r_base.regalloc-O2-noipara-m64
> 1403412 1425748 101.59
> 521.wrf_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/diffwrf_521_base.regalloc-O2-noipara-m64
> 16394344 16400504 100.03
> 521.wrf_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/wrf_r_base.regalloc-O2-noipara-m64
> 22300503 22308759 100.03
> 525.x264_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagevalidate_525_base.regalloc-O2-noipara-m64
> 18206 18222 100.08
> 525.x264_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/ldecod_r_base.regalloc-O2-noipara-m64
> 399204 402179 100.74
> 525.x264_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/x264_r_base.regalloc-O2-noipara-m64
> 406251 408299 100.50
> 526.blender_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/blender_r_base.regalloc-O2-noipara-m64
> 7583536 7648304 100.85
> 526.blender_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagevalidate_526_base.regalloc-O2-noipara-m64
> 18206 18222 100.08
> 527.cam4_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/cam4_r_base.regalloc-O2-noipara-m64
> 5962335 5974239 100.19
> 527.cam4_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/cam4_validate_527_base.regalloc-O2-noipara-m64
> 607327 609375 100.33
> 531.deepsjeng_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/deepsjeng_r_base.regalloc-O2-noipara-m64
> 75240 76248 101.33
> 538.imagick_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagevalidate_538_base.regalloc-O2-noipara-m64
> 18206 18222 100.08
> 538.imagick_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagick_r_base.regalloc-O2-noipara-m64
> 1641226 1654060 100.78
> 541.leela_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/leela_r_base.regalloc-O2-noipara-m64
> 132764 133274 100.38
> 544.nab_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/nab_r_base.regalloc-O2-noipara-m64
> 150498 150929 100.28
> 548.exchange2_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/exchange2_r_base.regalloc-O2-noipara-m64
> 76921 76921 100.00
> 549.fotonik3d_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/fotonik3d_r_base.regalloc-O2-noipara-m64
> 464940 465260 100.06
> 554.roms_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/roms_r_base.regalloc-O2-noipara-m64
> 833926 834166 100.02
> 557.xz_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/xz_r_base.regalloc-O2-noipara-m64
> 130697 133573 102.20
>
> Overall I think new costing works reasonably well.
Thanks for running these. I saw poor results for perlbench with my
initial aarch64 hooks because the hooks reduced the cost to zero for
the entry case:
auto entry_cost = targetm.callee_save_cost
(spill_cost_type::SAVE, hard_regno, mode, saved_nregs,
ira_memory_move_cost[mode][rclass][0] * saved_nregs / nregs,
allocated_callee_save_regs, existing_spills_p);
/* In the event of a tie between caller-save and callee-save,
prefer callee-save. We apply this to the entry cost rather
than the exit cost since the entry frequency must be at
least as high as the exit frequency. */
if (entry_cost > 0)
entry_cost -= 1;
I "fixed" that by bumping the cost to a minimum of 2, but I was
wondering whether the "entry_cost > 0" should instead be "entry_cost > 1",
so that the cost is always greater than not using a callee save for
registers that don't cross a call. WDYT?
Richard