https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108016

Alexey Merzlyakov <alexey.merzlyakov at samsung dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |alexey.merzlyakov at samsung 
dot c
                   |                            |om

--- Comment #4 from Alexey Merzlyakov <alexey.merzlyakov at samsung dot com> ---
I suspect, there might be 3 items are related to the reported in this ticket
case.

=== Item1 ===

This is related to the above analysis provided by Jeff, where the unnecessary
stack store-load generated.
The simplified C-testcase:

  struct value {
    unsigned int a;
    unsigned char b;
  };

  struct value func(unsigned int a, unsigned int b) {
    struct value out;
    out.a = a;
    out.b = a > b;
    return out;
  }

Here I am just wondering, even if the optimized tree has is going through the
memory, why DSE is not optimizing-out unnecessary store-load chain to something
like:

  (insn 16 14 17 2 (set (mem/c:QI (plus:DI (reg/f:DI 65 frame)
                (const_int -4 [0xfffffffffffffffc])))
        (subreg:QI (reg:DI 140) 0))
  (insn 27 21 28 2 (set (reg:DI 148 [ D.2377+4 ])
        (zero_extend:DI (mem/c:SI (plus:DI (reg/f:DI 65 frame)
                (const_int -4 [0xfffffffffffffffc])))))
  ->
  (insn 41 14 17 2 (set (reg:DI 154) (reg:DI 140)))
  (insn 27 21 28 2 (set (reg:DI 148 [ D.2377+4 ])
        (zero_extend:DI (subreg/s/u:QI (reg:QI 154) 0))) ?

At least in the testcase from above, it does similar optimization for another
store-load [sp-0x8] chain.

=== Item2 ===

Even if get rid from unnecessary store-load generation, the stack is still
being allocated for the func frame. Assembler, generated for the following
test-case, contains unnecessary stack allocation in prologue/epilogue (thanks
to my colleague, Ankit Mahato for supporting with the reduced testcase on this
situation):

  struct value {
    unsigned int a;
    unsigned int b;
  };

  struct value func(void) {
    struct value out;
    out.a = 0;
    out.b = 1;
    return out;
  }

Here is as in previous case, tree-optimized dump containing MEM usage. However,
for this test-case further memory store-load was optimized-out by DSE1, and
function stack won't contain locals on it anymore. But GCC is still generating
unnecessary stack allocation:

  func():
        addi    sp,sp,-16
        li      a0,1
        slli    a0,a0,32
        addi    sp,sp,16
        jr      ra

I looks like that frame size once being calculated in expand-rtl, will never
being changed. Even after DSE removed all unnecessary stack usage, stack offset
value was not corrected. I suppose that DSE if changes the usage of stack (for
some reasons), could it correct the "frame_offset" value, that is responsible
for further SP size allocation in pro_and_epilogue.

For that, I've added simple and ugly "frame_offset" value optimizing-out
function (checking no insns contain RTX_FRAME) to the end of DSE optimization;
and it works for me on the above test-case.

The main question here - should GCC change frame size after RTX was expanded;
or it is not intended to work like this?

=== Item3 ===

Generation of extra sext.w instruction does not seem to be related to the
previous items.
The following test-case contains ADD_OVERFLOW built-in function:

  unsigned int func(unsigned int a, unsigned int b) {
    unsigned int out;
    unsigned int overflow = __builtin_add_overflow(a, b, &out);
    return overflow & out;
  }

It produces an extra sign_extend, whether LLVM does not:

  func(unsigned int, unsigned int):
        addw    a1,a0,a1
        sext.w  a5,a1
        sltu    a0,a5,a0
        and     a0,a0,a1
        ret

The riscv.md > uaddv<mode>4 implementation contains "riscv_emit_binary (PLUS,
operands[0], operands[1], operands[2])" which produces the sum of two values in
SI-mode:

  (insn 11 10 12 (set (reg:SI 141)
        (plus:SI (subreg/s/u:SI (reg/v:DI 139 [ a ]) 0)
                (subreg/s/u:SI (reg/v:DI 140 [ b ]) 0)))

Sign-extend for this is being made separately by calling "emit_insn
(gen_extend_insn (t4, operands[0], DImode, SImode, 0))":

  (insn 12 11 13 (set (reg:DI 143)
        (sign_extend:DI (reg:SI 141)))

(a+b) sum in DI-mode is used for expansion of conditional branch:

  (jump_insn 13 12 14 (set (pc)
        (if_then_else (ltu (reg:DI 143 [a + b])
                        (reg:DI 142 [a]))
                (label_ref 16)
                (pc)))

while the output from the "uaddv<mode>4" - operand[0] == (a+b):SI is used
further in the code.

It appears that instructions generation, made in such way can not be optimized
out by further fwprop, crprop, combine, etc... pipeline; because of
extra-dependency outside it PLUS-SIGN_EXTEND chain. Thus, sext.w remains to be
untouched.

However, if change "riscv_emit_binary (PLUS, operands[0], operands[1],
operands[2]);" routine in the "uaddv<mode>4" -> to something producing results
in DI-mode, say "emit_insn (gen_add3_insn (operands[0], operands[1],
operands[2]));"; and expand became to produce the following unoptimized code
for ADD_OVERFLOW:

(insn 8 7 9 (set (reg:DI 142)
        (sign_extend:DI (plus:SI (subreg/s/u:SI (reg/v:DI 139 [ a ]) 0)
                (subreg/s/u:SI (reg/v:DI 140 [ b ]) 0))))
"add_overflow3.c":5:19 -1
(insn 9 8 10 (set (reg:SI 141) <- optimized-out by fwprop1
        (subreg/s/u:SI (reg:DI 142) 0)) "add_overflow3.c":5:19 -1
(insn 10 9 11 (set (reg:DI 143) <- optimized-out by crprop1 (DCE)
        (sign_extend:DI (reg:SI 141))) "add_overflow3.c":5:19 -1

However, this unoptimized code is being compensated by fwprop and crprop and
the finally produced code won't contain sext.w instruction. I do not love this
tricky approach, as we're initially expanding unoptimized code for
ADD_OVERFLOW, in the hope of future optimizations will get rid of unnecessary
stuff. But I still haven't found yet a better way in RTL-optimizations to
handle extra sign_extend for the initial case. Any suggestions? It feels like
the even number of errors leads to the right result, while the odd number
leaves one error behind.

Reply via email to