https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108016
Alexey Merzlyakov <alexey.merzlyakov at samsung dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |alexey.merzlyakov at samsung dot c | |om --- Comment #4 from Alexey Merzlyakov <alexey.merzlyakov at samsung dot com> --- I suspect, there might be 3 items are related to the reported in this ticket case. === Item1 === This is related to the above analysis provided by Jeff, where the unnecessary stack store-load generated. The simplified C-testcase: struct value { unsigned int a; unsigned char b; }; struct value func(unsigned int a, unsigned int b) { struct value out; out.a = a; out.b = a > b; return out; } Here I am just wondering, even if the optimized tree has is going through the memory, why DSE is not optimizing-out unnecessary store-load chain to something like: (insn 16 14 17 2 (set (mem/c:QI (plus:DI (reg/f:DI 65 frame) (const_int -4 [0xfffffffffffffffc]))) (subreg:QI (reg:DI 140) 0)) (insn 27 21 28 2 (set (reg:DI 148 [ D.2377+4 ]) (zero_extend:DI (mem/c:SI (plus:DI (reg/f:DI 65 frame) (const_int -4 [0xfffffffffffffffc]))))) -> (insn 41 14 17 2 (set (reg:DI 154) (reg:DI 140))) (insn 27 21 28 2 (set (reg:DI 148 [ D.2377+4 ]) (zero_extend:DI (subreg/s/u:QI (reg:QI 154) 0))) ? At least in the testcase from above, it does similar optimization for another store-load [sp-0x8] chain. === Item2 === Even if get rid from unnecessary store-load generation, the stack is still being allocated for the func frame. Assembler, generated for the following test-case, contains unnecessary stack allocation in prologue/epilogue (thanks to my colleague, Ankit Mahato for supporting with the reduced testcase on this situation): struct value { unsigned int a; unsigned int b; }; struct value func(void) { struct value out; out.a = 0; out.b = 1; return out; } Here is as in previous case, tree-optimized dump containing MEM usage. However, for this test-case further memory store-load was optimized-out by DSE1, and function stack won't contain locals on it anymore. But GCC is still generating unnecessary stack allocation: func(): addi sp,sp,-16 li a0,1 slli a0,a0,32 addi sp,sp,16 jr ra I looks like that frame size once being calculated in expand-rtl, will never being changed. Even after DSE removed all unnecessary stack usage, stack offset value was not corrected. I suppose that DSE if changes the usage of stack (for some reasons), could it correct the "frame_offset" value, that is responsible for further SP size allocation in pro_and_epilogue. For that, I've added simple and ugly "frame_offset" value optimizing-out function (checking no insns contain RTX_FRAME) to the end of DSE optimization; and it works for me on the above test-case. The main question here - should GCC change frame size after RTX was expanded; or it is not intended to work like this? === Item3 === Generation of extra sext.w instruction does not seem to be related to the previous items. The following test-case contains ADD_OVERFLOW built-in function: unsigned int func(unsigned int a, unsigned int b) { unsigned int out; unsigned int overflow = __builtin_add_overflow(a, b, &out); return overflow & out; } It produces an extra sign_extend, whether LLVM does not: func(unsigned int, unsigned int): addw a1,a0,a1 sext.w a5,a1 sltu a0,a5,a0 and a0,a0,a1 ret The riscv.md > uaddv<mode>4 implementation contains "riscv_emit_binary (PLUS, operands[0], operands[1], operands[2])" which produces the sum of two values in SI-mode: (insn 11 10 12 (set (reg:SI 141) (plus:SI (subreg/s/u:SI (reg/v:DI 139 [ a ]) 0) (subreg/s/u:SI (reg/v:DI 140 [ b ]) 0))) Sign-extend for this is being made separately by calling "emit_insn (gen_extend_insn (t4, operands[0], DImode, SImode, 0))": (insn 12 11 13 (set (reg:DI 143) (sign_extend:DI (reg:SI 141))) (a+b) sum in DI-mode is used for expansion of conditional branch: (jump_insn 13 12 14 (set (pc) (if_then_else (ltu (reg:DI 143 [a + b]) (reg:DI 142 [a])) (label_ref 16) (pc))) while the output from the "uaddv<mode>4" - operand[0] == (a+b):SI is used further in the code. It appears that instructions generation, made in such way can not be optimized out by further fwprop, crprop, combine, etc... pipeline; because of extra-dependency outside it PLUS-SIGN_EXTEND chain. Thus, sext.w remains to be untouched. However, if change "riscv_emit_binary (PLUS, operands[0], operands[1], operands[2]);" routine in the "uaddv<mode>4" -> to something producing results in DI-mode, say "emit_insn (gen_add3_insn (operands[0], operands[1], operands[2]));"; and expand became to produce the following unoptimized code for ADD_OVERFLOW: (insn 8 7 9 (set (reg:DI 142) (sign_extend:DI (plus:SI (subreg/s/u:SI (reg/v:DI 139 [ a ]) 0) (subreg/s/u:SI (reg/v:DI 140 [ b ]) 0)))) "add_overflow3.c":5:19 -1 (insn 9 8 10 (set (reg:SI 141) <- optimized-out by fwprop1 (subreg/s/u:SI (reg:DI 142) 0)) "add_overflow3.c":5:19 -1 (insn 10 9 11 (set (reg:DI 143) <- optimized-out by crprop1 (DCE) (sign_extend:DI (reg:SI 141))) "add_overflow3.c":5:19 -1 However, this unoptimized code is being compensated by fwprop and crprop and the finally produced code won't contain sext.w instruction. I do not love this tricky approach, as we're initially expanding unoptimized code for ADD_OVERFLOW, in the hope of future optimizations will get rid of unnecessary stuff. But I still haven't found yet a better way in RTL-optimizations to handle extra sign_extend for the initial case. Any suggestions? It feels like the even number of errors leads to the right result, while the odd number leaves one error behind.