https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88751
Bug ID: 88751 Summary: Performance regression reload vs lra Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: krebbel at gcc dot gnu.org Target Milestone: --- There is a big performance drop in OpenJ9 after they have updated from GCC 4.8.5 to GCC 7.3.0. - The performance regression disappears after compiling the byte code interpreter loop with -mno-lra. https://github.com/eclipse/openj9/blob/master/runtime/vm/BytecodeInterpreter.hpp - The problem comes from the frequently accessed _pc and _sp variables being assigned to stack slots instead of registers. With GCC 4.8 both variables end up in hard regs. - The problem can be seen on x86 as well as on S/390. - In LRA the root cause of the problem is a threshold which prevents LRA from running the full register coloring step (ira.c): /* If there are too many pseudos and/or basic blocks (e.g. 10K pseudos and 10K blocks or 100K pseudos and 1K blocks), we will use simplified and faster algorithms in LRA. */ lra_simple_p = (ira_use_lra_p && max_reg_num () >= (1 << 26) / last_basic_block_for_fn (cfun)); For the huge run() function in the byte code interpreter the numbers are: (gdb) p max_reg_num() $6 = 27089 (gdb) p last_basic_block_for_fn(cfun) $7 = 4799 Forcing GCC to run the full coloring pass makes the _pc and _sp variables to get hard regs assigned again. As a quick workaround we might want to turn this threshold into a parameter. Long-term it would be good if we could either enable the heuristic to estimate whether full coloring would be beneficial or improve the fallback coloring to cover such important cases.