[Bug target/119298] New: 538.imagick_r is faster when compiled with GCC 14.2 and -Ofast -flto -march=native than with master on Zen5

jamborm at gcc dot gnu.org via Gcc-bugs Fri, 14 Mar 2025 11:16:29 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119298


            Bug ID: 119298
           Summary: 538.imagick_r is faster when compiled with GCC 14.2
                    and -Ofast -flto -march=native than with master on
                    Zen5
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jamborm at gcc dot gnu.org
                CC: hubicka at gcc dot gnu.org, rguenth at gcc dot gnu.org
            Blocks: 26163
  Target Milestone: ---
              Host: x86_64-linux-gnu
            Target: x86_64-linux-gnu

SPEC INTrate 2017 538.imagick_r benchmark is faster when compiled with
GCC 14.2 and -Ofast -flto -march=native than with trunk/master on Zen
5 CPUs.

The regression has been introduced in r15-3441-g4292297a0f938f (Jan
Hubicka: Zen5 tuning part 5: update instruction latencies in
x86-tune-costs)

It is the modification of "cost of ADDSS/SD SUBSS/SD insns" that is
the culprit, bumping it back to COSTS_N_INSNS(3) (instead of
COSTS_N_INSNS(2)) makes the regression go away.  Nevertheless, Honza
claims the cost should be correct.


Perf stat of the slow run:

         116866.57 msec task-clock:u                     #    1.000 CPUs
utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
              8347      page-faults:u                    #   71.423 /sec
      484499860679      cycles:u                         #    4.146 GHz
       21879349058      stalled-cycles-frontend:u        #    4.52% frontend
cycles idle
     2030074730877      instructions:u                   #    4.19  insn per
cycle
                                                  #    0.01  stalled cycles per
insn
      224436542157      branches:u                       #    1.920 G/sec
        1716173329      branch-misses:u                  #    0.76% of all
branches

     116.881252465 seconds time elapsed

     116.808499000 seconds user
       0.057350000 seconds sys

Perf report of the slow run (annotated assmebly attached):

  # Samples: 470K of event 'cycles:Pu'
  # Event count (approx.): 484158470552
  #
  # Overhead       Samples  Command          Shared Object                   
Symbol
  # ........  ............  ...............  ............................... 
.............................................
  #
      44.71%        210348  imagick_r_peak.  imagick_r_peak.mine-lto-nat-m64 
[.] MeanShiftImage
      28.76%        135308  imagick_r_peak.  imagick_r_peak.mine-lto-nat-m64 
[.] GetVirtualPixelsFromNexus
      25.50%        120106  imagick_r_peak.  imagick_r_peak.mine-lto-nat-m64 
[.] MorphologyApply





Perf stat of the fast run (with just the one cost reverted):

 Performance counter stats for 'taskset -c 0 specinvoke':

         108805.48 msec task-clock:u                     #    1.000 CPUs
utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
              8312      page-faults:u                    #   76.393 /sec
      450981792793      cycles:u                         #    4.145 GHz
       22610930072      stalled-cycles-frontend:u        #    5.01% frontend
cycles idle
     1933965750890      instructions:u                   #    4.29  insn per
cycle
                                                  #    0.01  stalled cycles per
insn
      224433996552      branches:u                       #    2.063 G/sec
        1721069495      branch-misses:u                  #    0.77% of all
branches

     108.819368844 seconds time elapsed

     108.763582000 seconds user
       0.041314000 seconds sys

Perf report of the fast run (annotated assmebly attached):

# Samples: 427K of event 'cycles:Pu'
# Event count (approx.): 439380128661
#
# Overhead       Samples  Command          Shared Object                   
Symbol
# ........  ............  ...............  ............................... 
..................................................
#
    44.53%        190164  imagick_r_peak.  imagick_r_peak.mine-lto-nat-m64  [.]
MeanShiftImage
    28.13%        120243  imagick_r_peak.  imagick_r_peak.mine-lto-nat-m64  [.]
MorphologyApply
    26.20%        111906  imagick_r_peak.  imagick_r_peak.mine-lto-nat-m64  [.]
GetVirtualPixelsFromNexus


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

[Bug target/119298] New: 538.imagick_r is faster when compiled with GCC 14.2 and -Ofast -flto -march=native than with master on Zen5

Reply via email to