On Thu, Jul 6, 2023 at 2:04 PM Roger Sayle <[email protected]> wrote:
>
>
> Passing 128-bit integer (TImode) parameters on x86_64 can sometimes
> result in surprising code. Consider the example below (from PR 43644):
>
> __uint128 foo(__uint128 x, unsigned long long y) {
> return x+y;
> }
>
> which currently results in 6 consecutive movq instructions:
>
> foo: movq %rsi, %rax
> movq %rdi, %rsi
> movq %rdx, %rcx
> movq %rax, %rdi
> movq %rsi, %rax
> movq %rdi, %rdx
> addq %rcx, %rax
> adcq $0, %rdx
> ret
>
> The underlying issue is that during RTL expansion, we generate the
> following initial RTL for the x argument:
>
> (insn 4 3 5 2 (set (reg:TI 85)
> (subreg:TI (reg:DI 86) 0)) "pr43644-2.c":5:1 -1
> (nil))
> (insn 5 4 6 2 (set (subreg:DI (reg:TI 85) 8)
> (reg:DI 87)) "pr43644-2.c":5:1 -1
> (nil))
> (insn 6 5 7 2 (set (reg/v:TI 84 [ x ])
> (reg:TI 85)) "pr43644-2.c":5:1 -1
> (nil))
>
> which by combine/reload becomes
>
> (insn 25 3 22 2 (set (reg/v:TI 84 [ x ])
> (const_int 0 [0])) "pr43644-2.c":5:1 -1
> (nil))
> (insn 22 25 23 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 0)
> (reg:DI 93)) "pr43644-2.c":5:1 90 {*movdi_internal}
> (expr_list:REG_DEAD (reg:DI 93)
> (nil)))
> (insn 23 22 28 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 8)
> (reg:DI 94)) "pr43644-2.c":5:1 90 {*movdi_internal}
> (expr_list:REG_DEAD (reg:DI 94)
> (nil)))
>
> where the heavy use of SUBREG SET_DESTs creates challenges for both
> combine and register allocation.
>
> The improvement proposed here is to avoid these problematic SUBREGs
> by adding (two) special cases to ix86_expand_move. For insn 4, which
> sets a TImode destination from a paradoxical SUBREG, to assign the
> lowpart, we can use an explicit zero extension (zero_extendditi2 was
> added in July 2022), and for insn 5, which sets the highpart of a
> TImode register we can use the *insvti_highpart_1 instruction (that
> was added in May 2023, after being approved for stage1 in January).
> This allows combine to work its magic, merging these insns into a
> *concatditi3 and from there into other optimized forms.
How about we introduce *insvti_lowpart_1, similar to
*insvti_highpart_1, in the hope that combine is smart enough to also
combine these two instructions? IMO, faking insert to lowpart of the
register with zero_extend is a bit overkill, and could hinder some
other optimization opportunities (as perhaps hinted by failing
testcases).
Uros.
> So for the test case above, we now generate only a single movq:
>
> foo: movq %rdx, %rax
> xorl %edx, %edx
> addq %rdi, %rax
> adcq %rsi, %rdx
> ret
>
> But there is a little bad news. This patch causes two (minor) missed
> optimization regressions on x86_64; gcc.target/i386/pr82580.c and
> gcc.target/i386/pr91681-1.c. As shown in the test case above, we're
> no longer generating adcq $0, but instead using xorl. For the other
> FAIL, register allocation now has more freedom and is (arbitrarily)
> choosing a register assignment that doesn't match what the test is
> expecting. These issues are easier to explain and fix once this patch
> is in the tree.
>
> The good news is that this approach fixes a number of long standing
> issues, that need to checked in bugzilla, including PR target/110533
> which was just opened/reported earlier this week.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with only the two new FAILs described above. Ok for mainline?
>
> 2023-07-06 Roger Sayle <[email protected]>
>
> gcc/ChangeLog
> PR target/43644
> PR target/110533
> * config/i386/i386-expand.cc (ix86_expand_move): Convert SETs of
> TImode destinations from paradoxical SUBREGs (setting the lowpart)
> into explicit zero extensions. Use *insvti_highpart_1 instruction
> to set the highpart of a TImode destination.
>
> gcc/testsuite/ChangeLog
> PR target/43644
> PR target/110533
> * gcc.target/i386/pr110533.c: New test case.
> * gcc.target/i386/pr43644-2.c: Likewise.
>
>
> Thanks in advance,
> Roger
> --
>