[Bug tree-optimization/94427] New: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9

jamborm at gcc dot gnu.org Tue, 31 Mar 2020 10:34:38 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94427


            Bug ID: 94427
           Summary: 456.hmmer is 8-17% slower when compiled at -Ofast than
                    with GCC 9
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jamborm at gcc dot gnu.org
                CC: rguenth at gcc dot gnu.org
            Blocks: 26163
  Target Milestone: ---
              Host: x86_64-linux
            Target: x86_64-linux

SPECINT 2006 benchmark 456.hmmer runs 18% slower on AMD Zen2 CPUs, 15%
on AMD Zen1 CPUs and 8% on Intel Cascade Lake server CPUs when built
with trunk (revision 26b3e568a60) and just -Ofast (so with generic
march/mtune) than when compiled wth GCC 9.

Bisecting the regression leads to commit:

  commit 14ec49a7537004633b7fff859178cbebd288ca1d
  Author: Richard Biener <rguent...@suse.de>
  Date:   Tue Jul 2 07:35:23 2019 +0000

    re PR tree-optimization/58483 (missing optimization opportunity for const
std::vector compared to std::array)

    2019-07-02  Richard Biener  <rguent...@suse.de>

            PR tree-optimization/58483
            * tree-ssa-scopedtables.c (avail_expr_hash): Use OEP_ADDRESS_OF
            for MEM_REF base hashing.
            (equal_mem_array_ref_p): Likewise for base comparison.

            * gcc.dg/tree-ssa/ssa-dom-cse-8.c: New testcase.

    From-SVN: r272922


Collected profiles are weird, almost the other way round I would
expect them to be, because the *slow* version spends less time in cold
section - but both spend IMHO too much time there.  The following data
were collected on AMD Zen2 but those from Intel are similar in this
regard.  What is different is that on Intel perf stat reports doubling
of branch misses - and because it has older perf it does not report
front/back-end stalls.

Before the aforementioned revision:

 Performance counter stats for 'numactl -C 0 -l specinvoke':

         163360.87 msec task-clock:u              #    0.992 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
              7639      page-faults:u             #    0.047 K/sec
      525635661818      cycles:u                  #    
         809847511      stalled-cycles-frontend:u #    0.15% frontend cycles
idle     (83.35%)
      299331255326      stalled-cycles-backend:u  #   56.95% backend cycles
idle      (83.30%)
     1757801907547      instructions:u            #    3.34  insn per cycle
                                                  #    0.17  stalled cycles per
insn  (83.34%)
      133496985084      branches:u                #  817.191 M/sec             
      (83.35%)
         682351923      branch-misses:u           #    0.51% of all branches   
      (83.31%)

     164.659685804 seconds time elapsed

     163.325420000 seconds user
       0.022183000 seconds sys

# Samples: 637K of event 'cycles:u'
# Event count (approx.): 527143782584
#
# Overhead       Samples  Shared Object            Symbol
# ........  ............  .......................  ....................
#   
    58.43%        372284  hmmer_peak.mine-std-gen  [.] P7Viterbi
    35.12%        223887  hmmer_peak.mine-std-gen  [.] P7Viterbi.cold
     2.59%         16418  hmmer_peak.mine-std-gen  [.] FChoose
     2.51%         15906  hmmer_peak.mine-std-gen  [.] sre_random


At the aforementioned revision:

 Performance counter stats for 'numactl -C 0 -l specinvoke':                    

         191483.84 msec task-clock:u              #    0.994 CPUs utilized      
                 0      context-switches:u        #    0.000 K/sec              
                 0      cpu-migrations:u          #    0.000 K/sec              
              7639      page-faults:u             #    0.040 K/sec              
      622159384711      cycles:u                  #    
         817604010      stalled-cycles-frontend:u #    0.13% frontend cycles
idle     (83.31%)          
      439972264588      stalled-cycles-backend:u  #   70.72% backend cycles
idle      (83.34%)          
     1707838992202      instructions:u            #    2.75  insn per cycle     
                                                  #    0.26  stalled cycles per
insn  (83.35%)          
       91309384910      branches:u                #  476.852 M/sec             
      (83.32%)          
         655463713      branch-misses:u           #    0.72% of all branches   
      (83.33%)          

     192.564513355 seconds time elapsed

     191.443774000 seconds user
       0.023978000 seconds sys

# Samples: 752K of event 'cycles:u'
# Event count (approx.): 622947549968
#
# Overhead       Samples  Shared Object             Symbol
# ........  ............  ........................  ....................
#   
    83.68%        629645  hmmer_peak.small-std-gen  [.] P7Viterbi
    10.84%         81591  hmmer_peak.small-std-gen  [.] P7Viterbi.cold
     2.21%         16546  hmmer_peak.small-std-gen  [.] FChoose
     2.11%         15793  hmmer_peak.small-std-gen  [.] sre_random


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

[Bug tree-optimization/94427] New: 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9

Reply via email to