6 Regression] Huge increase in memory usage and compile time in combine

rguenth at gcc dot gnu.org Thu, 11 Feb 2016 04:59:45 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59811


--- Comment #16 from Richard Biener <rguenth at gcc dot gnu.org> ---
Call-grinding a release-checking stage3 shows get_ref_base_and_extend and
find_hard_regno_for_1 at the top.

And it shows wi::lshift_large called from get_ref_base_and_extent - exactly
what I feared...  we do hit both wi::lshift_large and wi::mul_internal.

perf confirmes the hot spots get_ref_base_and_extent (9%) and
find_hard_regno_for_1 (19%) but wi::lshift_large is somewhat down (1.8%),
wi::mul_internal is at 1%.  Note the shifts are all by 3 (BITS_PER_UNIT
multiplication).

The

      /* The first unfilled output block is a left shift of the first
         block in XVAL.  The other output blocks contain bits from two
         consecutive input blocks.  */
      unsigned HOST_WIDE_INT carry = 0;
      for (unsigned int i = skip; i < len; ++i)
        {
          unsigned HOST_WIDE_INT x = safe_uhwi (xval, xlen, i - skip);
          val[i] = (x << small_shift) | carry;
          carry = x >> (-small_shift % HOST_BITS_PER_WIDE_INT);
        }

loop in lshift_large doesn't seem to be very latency friendly:

  4.02 │ e0:   mov    (%r11,%r9,8),%rax     
  5.54 │ e4:   mov    %rax,%rdi                                               
▒
       │       mov    %r8d,%ecx                                               
▒
       │       shl    %cl,%rdi                                                
▒
  4.23 │       mov    %rdi,%rcx                                               
▒
  3.91 │       or     %r15,%rcx                                               
▒
  2.06 │       mov    %rcx,(%r14,%r9,8)           
  7.38 │       mov    %r13d,%ecx                                              
▒
  1.41 │       add    $0x1,%r9                                                
▒
       │       shr    %cl,%rax         
  3.04 │       cmp    %r12,%r9       
  3.37 │       mov    %rax,%r15         
  1.95 │     ↑ je     9e          

I wonder if GCC can be more efficient here by special-casing skip == 0,
len == 2 and using a __int128 on hosts where that is available.

In this case we're shifting xlen == 1 values but the precision might need 2
(byte to bit precision).  Special casing that case might also make sense.
It helps a bit but of course all the testing has an overhead as well.
Maybe a wi::bytes_to_bits helper is a better solution here.

Anyway, somehow caching the get_ref_base_and_extent result (which we re-compute
only for the stores btw, for stmt_may_clobber_ref_p) might help more.

Note that with release-checking the testcase compiles quite fast for me.

 alias stmt walking      :   4.53 (37%) usr   0.04 (18%) sys   4.44 (36%) wall 
     2 kB ( 0%) ggc
 dead store elim2        :   0.67 ( 5%) usr   0.04 (18%) sys   0.71 ( 6%) wall 
 87250 kB (57%) ggc
 combiner                :   0.20 ( 2%) usr   0.00 ( 0%) sys   0.20 ( 2%) wall 
  2709 kB ( 2%) ggc
 integrated RA           :   1.18 (10%) usr   0.01 ( 5%) sys   1.19 (10%) wall 
  6629 kB ( 4%) ggc
 LRA hard reg assignment :   2.69 (22%) usr   0.01 ( 5%) sys   2.69 (22%) wall 
     0 kB ( 0%) ggc
 reload CSE regs         :   0.56 ( 5%) usr   0.00 ( 0%) sys   0.56 ( 5%) wall 
  1064 kB ( 1%) ggc
 TOTAL                 :  12.21             0.22            12.42            
152002 kB

(that's w/o mucking with timevars).

[Bug rtl-optimization/59811] [4.9/5/6 Regression] Huge increase in memory usage and compile time in combine

Reply via email to