https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359
Aldy Hernandez <aldyh at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |aldyh at gcc dot gnu.org, | |rguenth at gcc dot gnu.org --- Comment #22 from Aldy Hernandez <aldyh at gcc dot gnu.org> --- For the record, current mainline is even worse than when the testccase was originally reported. We are now are at 116 bytes versus 96 for gcc-5-branch. (In reply to Jakub Jelinek from comment #5) > I see multiple points of the increases: > r224048 added 4 bytes. > r224995 added 8 bytes. > r228318 added 4 bytes. > > Perhaps the middle one could change from > (if (single_use (@2) > || (TREE_CODE (@1) == INTEGER_CST && TREE_CODE (@3) == INTEGER_CST)) > to > (if (single_use (@2) > || (!optimize_size && TREE_CODE (@1) == INTEGER_CST && TREE_CODE (@3) > == INTEGER_CST)) > (or some optimize*for_speed*)? As Jakub mentions in comment #5, there are multiple patches that slowly bloated the generated code, but perhaps it is a fool's errand to tackle them individually. For instance, associating "buf + len - 1" differently in r224995 reduces the generated code by 8 bytes, but predicating the association by optimize_for_speed seems fragile IMO. Interestingly, disabling -ftree-forwprop, reduces the byte count to 92 which is even smaller than gcc-5-branch, so perhaps worth pursuing. Even on x86, disabling forwprop reduces the byte count by 13. And on both x86 and arm, we get one less branch without forwprop. The first forwprop change is to replace an equality by a greater than, which hardly seems worthwhile (and even a longer byte sequence on x86), but perhaps not a showstopper: < if (ui_21 != 0) --- > if (ui_7 > 9) OTOH, the following changes things quite a bit on arm: < p_22 = p_19 + 4294967295; < *p_22 = 45; --- > p_22 = p_8 + 4294967294; > MEM[(char *)p_19 + 4294967295B] = 45; For context, we are using p_8 which was the previous iteration's value for p_8 and then subtracting 2 to return p correctly. What the heck? <bb 2> [local count: 118111601]: _1 = ABS_EXPR <i_12(D)>; ui_13 = (unsigned int) _1; len.0_2 = (sizetype) len_14(D); _3 = len.0_2 + 4294967295; p_16 = buf_15(D) + _3; *p_16 = 0; <bb 3> [local count: 1073741825]: # ui_7 = PHI <ui_13(2), ui_21(3)> # p_8 = PHI <p_16(2), p_19(3)> _4 = ui_7 % 10; _5 = (char) _4; p_19 = p_8 + 4294967295; _6 = _5 + 48; MEM[base: p_19, offset: 0B] = _6; ui_21 = ui_7 / 10; if (ui_7 > 9) goto <bb 3>; [89.00%] else goto <bb 4>; [11.00%] <bb 4> [local count: 118111601]: if (i_12(D) < 0) goto <bb 5>; [41.00%] else goto <bb 6>; [59.00%] <bb 5> [local count: 48425756]: p_22 = p_8 + 4294967294; MEM[(char *)p_19 + 4294967295B] = 45; <bb 6> [local count: 118111601]: # p_9 = PHI <p_22(5), p_19(4)> return p_9; This finally yields assembly without auto dec, and with an extra (forward!) branch: .L2: mov r1, #10 mov r0, r6 bl __aeabi_uidivmod umull r2, r3, r6, r8 add r1, r1, #48 cmp r6, #9 sub r4, r5, #1 strb r1, [r5, #-1] lsr r3, r3, #3 bhi .L4 ;; extra forward branch cmp r7, #0 movlt r3, #45 strblt r3, [r4, #-1] sublt r4, r5, #2 mov r0, r4 pop {r4, r5, r6, r7, r8, pc} .L4: mov r5, r4 mov r6, r3 b .L2 whereas without -ftree-forwprop we get auto dec: .L2: mov r0, r5 mov r1, #10 bl __aeabi_uidivmod umull r2, r3, r5, r7 add r1, r1, #48 lsrs r5, r3, #3 strb r1, [r4, #-1]! ;; auto dec, yay bne .L2 cmp r6, #0 movlt r3, #45 strblt r3, [r4, #-1]! ;; auto dec, yay mov r0, r4 pop {r4, r5, r6, r7, r8, pc}