8 Regression] Code size increase for ARM compared to gcc-5.3.0

aldyh at gcc dot gnu.org Fri, 02 Mar 2018 02:55:16 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359


Aldy Hernandez <aldyh at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |aldyh at gcc dot gnu.org,
                   |                            |rguenth at gcc dot gnu.org

--- Comment #22 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
For the record, current mainline is even worse than when the testccase was
originally reported.  We are now are at 116 bytes versus 96 for gcc-5-branch.

(In reply to Jakub Jelinek from comment #5)
> I see multiple points of the increases:
> r224048 added 4 bytes.
> r224995 added 8 bytes.
> r228318 added 4 bytes.
> 
> Perhaps the middle one could change from
> (if (single_use (@2)
>      || (TREE_CODE (@1) == INTEGER_CST && TREE_CODE (@3) == INTEGER_CST))
> to
> (if (single_use (@2)
>      || (!optimize_size && TREE_CODE (@1) == INTEGER_CST && TREE_CODE (@3)
> == INTEGER_CST))
> (or some optimize*for_speed*)?

As Jakub mentions in comment #5, there are multiple patches that slowly bloated
the generated code, but perhaps it is a fool's errand to tackle them
individually.  For instance, associating "buf + len - 1" differently in r224995
reduces the generated code by 8 bytes, but predicating the association by
optimize_for_speed seems fragile IMO.

Interestingly, disabling -ftree-forwprop, reduces the byte count to 92 which is
even smaller than gcc-5-branch, so perhaps worth pursuing.  Even on x86,
disabling forwprop reduces the byte count by 13.  And on both x86 and arm, we
get one less branch without forwprop.

The first forwprop change is to replace an equality by a greater than, which
hardly seems worthwhile (and even a longer byte sequence on x86), but perhaps
not a showstopper:

<   if (ui_21 != 0)
---
>   if (ui_7 > 9)

OTOH, the following changes things quite a bit on arm:

<   p_22 = p_19 + 4294967295;
<   *p_22 = 45;
---
>   p_22 = p_8 + 4294967294;
>   MEM[(char *)p_19 + 4294967295B] = 45;

For context, we are using p_8 which was the previous iteration's value for p_8
and then subtracting 2 to return p correctly.  What the heck?

  <bb 2> [local count: 118111601]:
  _1 = ABS_EXPR <i_12(D)>;
  ui_13 = (unsigned int) _1;
  len.0_2 = (sizetype) len_14(D);
  _3 = len.0_2 + 4294967295;
  p_16 = buf_15(D) + _3;
  *p_16 = 0;

  <bb 3> [local count: 1073741825]:
  # ui_7 = PHI <ui_13(2), ui_21(3)>
  # p_8 = PHI <p_16(2), p_19(3)>
  _4 = ui_7 % 10;
  _5 = (char) _4;
  p_19 = p_8 + 4294967295;
  _6 = _5 + 48;
  MEM[base: p_19, offset: 0B] = _6;
  ui_21 = ui_7 / 10;
  if (ui_7 > 9)
    goto <bb 3>; [89.00%]
  else
    goto <bb 4>; [11.00%]

  <bb 4> [local count: 118111601]:
  if (i_12(D) < 0)
    goto <bb 5>; [41.00%]
  else
    goto <bb 6>; [59.00%]

  <bb 5> [local count: 48425756]:
  p_22 = p_8 + 4294967294;
  MEM[(char *)p_19 + 4294967295B] = 45;

  <bb 6> [local count: 118111601]:
  # p_9 = PHI <p_22(5), p_19(4)>
  return p_9;

This finally yields assembly without auto dec, and with an extra (forward!)
branch:

.L2:
        mov     r1, #10
        mov     r0, r6
        bl      __aeabi_uidivmod
        umull   r2, r3, r6, r8
        add     r1, r1, #48
        cmp     r6, #9
        sub     r4, r5, #1
        strb    r1, [r5, #-1]
        lsr     r3, r3, #3
        bhi     .L4                  ;; extra forward branch
        cmp     r7, #0
        movlt   r3, #45
        strblt  r3, [r4, #-1]
        sublt   r4, r5, #2
        mov     r0, r4
        pop     {r4, r5, r6, r7, r8, pc}
.L4:
        mov     r5, r4
        mov     r6, r3
        b       .L2

whereas without -ftree-forwprop we get auto dec:

.L2:
        mov     r0, r5
        mov     r1, #10
        bl      __aeabi_uidivmod
        umull   r2, r3, r5, r7
        add     r1, r1, #48
        lsrs    r5, r3, #3
        strb    r1, [r4, #-1]!         ;; auto dec, yay
        bne     .L2
        cmp     r6, #0
        movlt   r3, #45
        strblt  r3, [r4, #-1]!         ;; auto dec, yay
        mov     r0, r4
        pop     {r4, r5, r6, r7, r8, pc}

[Bug target/70359] [6/7/8 Regression] Code size increase for ARM compared to gcc-5.3.0

Reply via email to