On 2023/01/21 0:14, Max Filippov wrote: > Hi Suwa-san, Hi!
> > On Wed, Jan 18, 2023 at 7:50 PM Takayuki 'January June' Suwa > <jjsuwa_sys3...@yahoo.co.jp> wrote: >> >> In the previous patch, if insn is JUMP_INSN or CALL_INSN, it bypasses the >> reg check (possibly FAIL). >> >> ===== >> In the case of the CALL0 ABI, values that must be retained before and >> after function calls are placed in the callee-saved registers (A12 >> through A15) and referenced later. However, it is often the case that >> the save and the reference are each only once and a simple register- >> register move (the frame pointer is needed to recover the stack pointer >> and must be excluded). >> >> e.g. in the following example, if there are no other occurrences of >> register A14: >> >> ;; before >> ; prologue { >> ... >> s32i.n a14, sp, 16 >> ... >> ; } prologue >> ... >> mov.n a14, a6 >> ... >> call0 foo >> ... >> mov.n a8, a14 >> ... >> ; epilogue { >> ... >> l32i.n a14, sp, 16 >> ... >> ; } epilogue >> >> It can be possible like this: >> >> ;; after >> ; prologue { >> ... >> (deleted) >> ... >> ; } prologue >> ... >> s32i.n a6, sp, 16 >> ... >> call0 foo >> ... >> l32i.n a8, sp, 16 >> ... >> ; epilogue { >> ... >> (deleted) >> ... >> ; } epilogue >> >> This patch introduces a new peephole2 pattern that implements the above. >> >> gcc/ChangeLog: >> >> * config/xtensa/xtensa.md: New peephole2 pattern that eliminates >> the use of callee-saved register that saves and restores only once >> for other register, by using its stack slot directly. >> --- >> gcc/config/xtensa/xtensa.md | 62 +++++++++++++++++++++++++++++++++++++ >> 1 file changed, 62 insertions(+) > > There are still issues with this change in the libgomp: > > FAIL: libgomp.c/examples-4/target-1.c execution test > FAIL: libgomp.c/examples-4/target-2.c execution test > > They come from the following function: > > code produced before the change: > .literal_position > .literal .LC8, init@PLT > .literal .LC9, 400000 > .literal .LC10, 100000 > .literal .LC11, -800000 > .literal .LC12, 800000 > .align 4 > .global vec_mult_ref > .type vec_mult_ref, @function > vec_mult_ref: > l32r a9, .LC11 > addi sp, sp, -16 > l32r a10, .LC9 > s32i.n a12, sp, 8 > s32i.n a13, sp, 4 > s32i.n a0, sp, 12 > add.n sp, sp, a9 > add.n a12, sp, a10 > l32r a9, .LC8 > mov.n a13, a2 > mov.n a3, sp > mov.n a2, a12 > callx0 a9 > l32r a7, .LC10 > mov.n a10, a12 > mov.n a11, sp > mov.n a2, a13 > loop a7, .L17_LEND > .L17: > l32i.n a9, a10, 0 > l32i.n a6, a11, 0 > addi.n a10, a10, 4 > mull a9, a9, a6 > addi.n a11, a11, 4 > s32i.n a9, a2, 0 > addi.n a2, a2, 4 > .L17_LEND: > l32r a9, .LC12 > add.n sp, sp, a9 > l32i.n a0, sp, 12 > l32i.n a12, sp, 8 > l32i.n a13, sp, 4 > addi sp, sp, 16 > ret.n > > -------------------- > > with the change: > .literal_position > .literal .LC8, init@PLT > .literal .LC9, 400000 > .literal .LC10, 100000 > .literal .LC11, -800000 > .literal .LC12, 800000 > .align 4 > .global vec_mult_ref > .type vec_mult_ref, @function > vec_mult_ref: > l32r a9, .LC11 > l32r a10, .LC9 > addi sp, sp, -16 > s32i.n a12, sp, 8 > s32i.n a0, sp, 12 > add.n sp, sp, a9 > add.n a12, sp, a10 > l32r a9, .LC8 > s32i.n a2, sp, 4 > mov.n a3, sp > mov.n a2, a12 > callx0 a9 > l32r a7, .LC10 > l32i.n a2, sp, 4 > mov.n a10, a12 > mov.n a11, sp > loop a7, .L17_LEND > .L17: > l32i.n a9, a10, 0 > l32i.n a6, a11, 0 > addi.n a10, a10, 4 > mull a9, a9, a6 > addi.n a11, a11, 4 > s32i.n a9, a2, 0 > addi.n a2, a2, 4 > .L17_LEND: > l32r a9, .LC12 > add.n sp, sp, a9 > l32i.n a0, sp, 12 > l32i.n a12, sp, 8 > addi sp, sp, 16 > ret.n > > the stack pointer is modified after saving callee-saved registers, > but the stack offset where a2 is stored and reloaded does not take > this into an account. > > After having this many attempts and getting to the issues that are > really hard to detect I wonder if the target backend is the right place > for this optimization? > I guess they are not hard to detect but just issues I didn't anticipate (and I just need a little more work). And where else should it be done? What about implementing a target-specific pass just for one-point optimization?