The simple test case below demonstrates an interesting register
allocation challenge facing x86_64, imposed by ABI requirements
on int128.
__int128 foo(__int128 x, __int128 y)
{
return x+y;
}
For which GCC currently generates the unusual sequence:
movq %rsi, %rax
movq %rdi, %r8
movq %rax, %rdi
movq %rdx, %rax
movq %rcx, %rdx
addq %r8, %rax
adcq %rdi, %rdx
ret
The challenge is that the x86_64 ABI requires passing the first __int128,
x, in %rsi:%rdi (highpart in %rsi, lowpart in %rdi), where internally
GCC prefers TI mode (double word) integers to be register allocated as
%rdi:%rsi (highpart in %rdi, lowpart in %rsi). So after reload, we have
four mov instructions, two to move the double word to temporary registers
and then two to move them back.
This patch adds a peephole2 to spot this register shuffling, and with
-Os generates a xchg instruction, to produce:
xchgq %rsi, %rdi
movq %rdx, %rax
movq %rcx, %rdx
addq %rsi, %rax
adcq %rdi, %rdx
ret
or when optimizing for speed, a three mov sequence, using just one of
the temporary registers, which ultimately results in the improved:
movq %rdi, %r8
movq %rdx, %rax
movq %rcx, %rdx
addq %r8, %rax
adcq %rsi, %rdx
ret
I've a follow-up patch which improves things further, and with the
output in flux, I'd like to add the new testcase with part 2, once
we're back down to requiring only two movq instructions.
This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32} with
no new failures. Ok for mainline?
2022-06-02 Roger Sayle <[email protected]>
gcc/ChangeLog
* config/i386/i386.md (define_peephole2): Recognize double word
swap sequences, and replace them with more efficient idioms,
including using xchg when optimizing for size.
Thanks in advance,
Roger
--
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 2b1d65b..f3cf6e2 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -3016,6 +3016,36 @@
[(parallel [(set (match_dup 1) (match_dup 2))
(set (match_dup 2) (match_dup 1))])])
+;; Replace a double word swap that requires 4 mov insns with a
+;; 3 mov insn implementation (or an xchg when optimizing for size).
+(define_peephole2
+ [(set (match_operand:DWIH 0 "general_reg_operand")
+ (match_operand:DWIH 1 "general_reg_operand"))
+ (set (match_operand:DWIH 2 "general_reg_operand")
+ (match_operand:DWIH 3 "general_reg_operand"))
+ (clobber (match_operand:<DWI> 4 "general_reg_operand"))
+ (set (match_dup 3) (match_dup 0))
+ (set (match_dup 1) (match_dup 2))]
+ "REGNO (operands[0]) != REGNO (operands[3])
+ && REGNO (operands[1]) != REGNO (operands[2])
+ && REGNO (operands[1]) != REGNO (operands[3])
+ && REGNO (operands[3]) == REGNO (operands[4])
+ && peep2_reg_dead_p (4, operands[0])
+ && peep2_reg_dead_p (5, operands[2])"
+ [(parallel [(set (match_dup 1) (match_dup 3))
+ (set (match_dup 3) (match_dup 1))])]
+{
+ if (!optimize_insn_for_size_p ())
+ {
+ rtx tmp = REGNO (operands[0]) > REGNO (operands[2]) ? operands[0]
+ : operands[2];
+ emit_move_insn (tmp, operands[1]);
+ emit_move_insn (operands[1], operands[3]);
+ emit_move_insn (operands[3], tmp);
+ DONE;
+ }
+})
+
(define_expand "movstrict<mode>"
[(set (strict_low_part (match_operand:SWI12 0 "register_operand"))
(match_operand:SWI12 1 "general_operand"))]