> On Wed, Feb 19, 2025 at 9:06 PM Jan Hubicka <hubi...@ucw.cz> wrote:
> >
> > Hi,
> > this is a variant of a hook I benchmarked on cpu2016 with -Ofast -flto
> > and -O2 -flto.  For non -Os and no Windows ABI should be pratically the
> > same as your variant that was simply returning mem_cost - 2.
> >
> I've tested O2/(Ofast march=native) with SPEC2017 on SPR, mostly
> neutral (small improvement on povray).

So I got ryzen3 runs with -O2, -O3 and -fno-ipa-ra.

Overall differences are quite small, but I think it is expected. Here is
what I get with -O2:
--------------- -------  ---------  ---------    -------  ---------  ---------
500.perlbench_r       1        188       8.46  S       1        183       8.69  
S
500.perlbench_r       1        187       8.52  *       1        182       8.75  
*
500.perlbench_r       1        186       8.56  S       1        182       8.75  
S
502.gcc_r             1        139      10.2   S       1        137      10.3   
*
502.gcc_r             1        139      10.2   S       1        137      10.4   
S
502.gcc_r             1        139      10.2   *       1        137      10.3   
S
505.mcf_r             1        187       8.66  *       1        188       8.61  
S
505.mcf_r             1        186       8.70  S       1        187       8.66  
*
505.mcf_r             1        188       8.62  S       1        187       8.66  
S
520.omnetpp_r         1        213       6.15  *       1        207       6.32  
*
520.omnetpp_r         1        212       6.18  S       1        206       6.37  
S
520.omnetpp_r         1        219       5.99  S       1        215       6.11  
S
523.xalancbmk_r       1         --            CE       1         --            
CE
525.x264_r            1        135      13.0   S       1        135      12.9   
*
525.x264_r            1        135      13.0   *       1        135      12.9   
S
525.x264_r            1        135      13.0   S       1        135      12.9   
S
531.deepsjeng_r       1        167       6.86  *       1        167       6.85  
S
531.deepsjeng_r       1        167       6.86  S       1        168       6.84  
S
531.deepsjeng_r       1        167       6.86  S       1        167       6.85  
*
541.leela_r           1        296       5.60  S       1        292       5.67  
*
541.leela_r           1        293       5.65  S       1        293       5.65  
S
541.leela_r           1        296       5.60  *       1        292       5.67  
S
548.exchange2_r       1        208      12.6   S       1        208      12.6   
S
548.exchange2_r       1        208      12.6   *       1        208      12.6   
S
548.exchange2_r       1        208      12.6   S       1        208      12.6   
*
557.xz_r              1        194       5.58  S       1        193       5.58  
S
557.xz_r              1        192       5.62  S       1        193       5.60  
S
557.xz_r              1        193       5.60  *       1        193       5.59  
*
=================================================================================
500.perlbench_r       1        187       8.52  *       1        182       8.75  
*
502.gcc_r             1        139      10.2   *       1        137      10.3   
*
505.mcf_r             1        187       8.66  *       1        187       8.66  
*
520.omnetpp_r         1        213       6.15  *       1        207       6.32  
*
523.xalancbmk_r                               NR                               
NR
525.x264_r            1        135      13.0   *       1        135      12.9   
*
531.deepsjeng_r       1        167       6.86  *       1        167       6.85  
*
541.leela_r           1        296       5.60  *       1        292       5.67  
*
548.exchange2_r       1        208      12.6   *       1        208      12.6   
*
557.xz_r              1        193       5.60  *       1        193       5.59  
*
 Est. SPECrate2017_int_base              8.17
 Est. SPECrate2017_int_peak                                               8.24

Perlbench seems to improve consistently without LTO (bot -O2, -O3 and
-O2 -fno-ipa-ra and I think it may be just a luck with code layout
gcc is quie concistent in all settings. Overall it seems consistent
little win.  For fp tests, I see only off-noise povray differences and only in
-Ofast and -Ofast -flto.

Comparing code sizes at -O2:

500.perlbench_r/run/run_base_refrate_regalloc-m64.0000/perlbench_r_base.regalloc-m64
          1699987    1731648 101.86
502.gcc_r/run/run_base_refrate_regalloc-m64.0000/cpugcc_r_base.regalloc-m64     
              7072031    7226911 102.19
503.bwaves_r/run/run_base_refrate_regalloc-m64.0000/bwaves_r_base.regalloc-m64  
                41327      41327 100.00
505.mcf_r/run/run_base_refrate_regalloc-m64.0000/mcf_r_base.regalloc-m64        
                17023      17023 100.00
507.cactuBSSN_r/run/run_base_refrate_regalloc-m64.0000/cactusBSSN_r_base.regalloc-m64
         3432326    3464950 100.95
508.namd_r/run/run_base_refrate_regalloc-m64.0000/namd_r_base.regalloc-m64      
               835954     835457 99.94
510.parest_r/run/run_base_refrate_regalloc-m64.0000/parest_r_base.regalloc-m64  
              7498066    7587378 101.19
511.povray_r/run/run_base_refrate_regalloc-m64.0000/imagevalidate_511_base.regalloc-m64
         18206      18222 100.08
511.povray_r/run/run_base_refrate_regalloc-m64.0000/povray_r_base.regalloc-m64  
               754591     761695 100.94
519.lbm_r/run/run_base_refrate_regalloc-m64.0000/lbm_r_base.regalloc-m64        
                10900      10916 100.14
520.omnetpp_r/run/run_base_refrate_regalloc-m64.0000/omnetpp_r_base.regalloc-m64
              1403348    1425556 101.58
521.wrf_r/run/run_base_refrate_regalloc-m64.0000/diffwrf_521_base.regalloc-m64  
             16388136   16394552 100.03
521.wrf_r/run/run_base_refrate_regalloc-m64.0000/wrf_r_base.regalloc-m64        
             22293527   22302167 100.03
525.x264_r/run/run_base_refrate_regalloc-m64.0000/imagevalidate_525_base.regalloc-m64
           18206      18222 100.08
525.x264_r/run/run_base_refrate_regalloc-m64.0000/ldecod_r_base.regalloc-m64    
               398564     401667 100.77
525.x264_r/run/run_base_refrate_regalloc-m64.0000/x264_r_base.regalloc-m64      
               405515     407051 100.37
526.blender_r/run/run_base_refrate_regalloc-m64.0000/blender_r_base.regalloc-m64
              7567792    7631536 100.84
526.blender_r/run/run_base_refrate_regalloc-m64.0000/imagevalidate_526_base.regalloc-m64
        18206      18222 100.08
527.cam4_r/run/run_base_refrate_regalloc-m64.0000/cam4_r_base.regalloc-m64      
              5957695    5969535 100.19
527.cam4_r/run/run_base_refrate_regalloc-m64.0000/cam4_validate_527_base.regalloc-m64
          606591     608767 100.35
531.deepsjeng_r/run/run_base_refrate_regalloc-m64.0000/deepsjeng_r_base.regalloc-m64
            75304      76248 101.25
538.imagick_r/run/run_base_refrate_regalloc-m64.0000/imagevalidate_538_base.regalloc-m64
        18206      18222 100.08
538.imagick_r/run/run_base_refrate_regalloc-m64.0000/imagick_r_base.regalloc-m64
              1638858    1651628 100.77
541.leela_r/run/run_base_refrate_regalloc-m64.0000/leela_r_base.regalloc-m64    
               132636     133146 100.38
544.nab_r/run/run_base_refrate_regalloc-m64.0000/nab_r_base.regalloc-m64        
               150146     150513 100.24
548.exchange2_r/run/run_base_refrate_regalloc-m64.0000/exchange2_r_base.regalloc-m64
            76709      76709 100.00
549.fotonik3d_r/run/run_base_refrate_regalloc-m64.0000/fotonik3d_r_base.regalloc-m64
           464940     465260 100.06
554.roms_r/run/run_base_refrate_regalloc-m64.0000/roms_r_base.regalloc-m64      
               833926     834166 100.02
557.xz_r/run/run_base_refrate_regalloc-m64.0000/xz_r_base.regalloc-m64          
               130345     133253 102.23

The 2% code size increase for gcc as not very nice, but I think also
expected, since we make compiler to use less push/pop instructions.
There are 34091 push instructions with patch and 38939 without.

With -fno-ipa-ra the story is similar:

500.perlbench_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/perlbench_r_base.regalloc-O2-noipara-m64
    1701299    1733024 101.86
502.gcc_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/cpugcc_r_base.regalloc-O2-noipara-m64
             7074527    7229855 102.19
503.bwaves_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/bwaves_r_base.regalloc-O2-noipara-m64
            41327      41327 100.00
505.mcf_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/mcf_r_base.regalloc-O2-noipara-m64
                  17151      17151 100.00
507.cactuBSSN_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/cactusBSSN_r_base.regalloc-O2-noipara-m64
   3432326    3464950 100.95
508.namd_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/namd_r_base.regalloc-O2-noipara-m64
               835954     835457 99.94
510.parest_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/parest_r_base.regalloc-O2-noipara-m64
          7504722    7594098 101.19
511.povray_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagevalidate_511_base.regalloc-O2-noipara-m64
   18206      18222 100.08
511.povray_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/povray_r_base.regalloc-O2-noipara-m64
           756639     763487 100.90
519.lbm_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/lbm_r_base.regalloc-O2-noipara-m64
                  10900      10916 100.14
520.omnetpp_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/omnetpp_r_base.regalloc-O2-noipara-m64
        1403412    1425748 101.59
521.wrf_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/diffwrf_521_base.regalloc-O2-noipara-m64
         16394344   16400504 100.03
521.wrf_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/wrf_r_base.regalloc-O2-noipara-m64
               22300503   22308759 100.03
525.x264_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagevalidate_525_base.regalloc-O2-noipara-m64
     18206      18222 100.08
525.x264_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/ldecod_r_base.regalloc-O2-noipara-m64
             399204     402179 100.74
525.x264_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/x264_r_base.regalloc-O2-noipara-m64
               406251     408299 100.50
526.blender_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/blender_r_base.regalloc-O2-noipara-m64
        7583536    7648304 100.85
526.blender_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagevalidate_526_base.regalloc-O2-noipara-m64
  18206      18222 100.08
527.cam4_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/cam4_r_base.regalloc-O2-noipara-m64
              5962335    5974239 100.19
527.cam4_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/cam4_validate_527_base.regalloc-O2-noipara-m64
    607327     609375 100.33
531.deepsjeng_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/deepsjeng_r_base.regalloc-O2-noipara-m64
      75240      76248 101.33
538.imagick_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagevalidate_538_base.regalloc-O2-noipara-m64
  18206      18222 100.08
538.imagick_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagick_r_base.regalloc-O2-noipara-m64
        1641226    1654060 100.78
541.leela_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/leela_r_base.regalloc-O2-noipara-m64
             132764     133274 100.38
544.nab_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/nab_r_base.regalloc-O2-noipara-m64
                 150498     150929 100.28
548.exchange2_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/exchange2_r_base.regalloc-O2-noipara-m64
      76921      76921 100.00
549.fotonik3d_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/fotonik3d_r_base.regalloc-O2-noipara-m64
     464940     465260 100.06
554.roms_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/roms_r_base.regalloc-O2-noipara-m64
               833926     834166 100.02
557.xz_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/xz_r_base.regalloc-O2-noipara-m64
                   130697     133573 102.20

Overall I think new costing works reasonably well.

Honza

Reply via email to