https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88751

            Bug ID: 88751
           Summary: Performance regression reload vs lra
           Product: gcc
           Version: 9.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: krebbel at gcc dot gnu.org
  Target Milestone: ---

There is a big performance drop in OpenJ9 after they have updated from GCC
4.8.5 to GCC 7.3.0.

- The performance regression disappears after compiling the byte code
interpreter loop with -mno-lra.
https://github.com/eclipse/openj9/blob/master/runtime/vm/BytecodeInterpreter.hpp

- The problem comes from the frequently accessed _pc and _sp variables being
assigned to stack slots instead of registers. With GCC 4.8 both variables end
up in hard regs.

- The problem can be seen on x86 as well as on S/390.

- In LRA the root cause of the problem is a threshold which prevents LRA from
running the full register coloring step (ira.c):

   /* If there are too many pseudos and/or basic blocks (e.g. 10K
      pseudos and 10K blocks or 100K pseudos and 1K blocks), we will
      use simplified and faster algorithms in LRA.  */
  lra_simple_p = (ira_use_lra_p && max_reg_num () >= (1 << 26) /
  last_basic_block_for_fn (cfun));

  For the huge run() function in the byte code interpreter the numbers are:

  (gdb) p max_reg_num()
  $6 = 27089
  (gdb) p last_basic_block_for_fn(cfun)
  $7 = 4799

  Forcing GCC to run the full coloring pass makes the _pc and _sp variables to
get hard regs assigned again.


As a quick workaround we might want to turn this threshold into a parameter.

Long-term it would be good if we could either enable the heuristic to estimate
whether full coloring would be beneficial or improve the fallback coloring to
cover such important cases.

Reply via email to