This patch changes the way that __builtin_parity is expanded in i386.md.
This changes the code generated for the function: int test(unsigned char a) { unsigned char x = a + 1; return __builtin_parity(x); } from this before the patch: test: addl $1, %edi movzbl %dil, %edi movl %edi, %eax shrl $16, %edi xorl %edi, %eax xorb %ah, %al setnp %al movzbl %al, %eax ret to this with the patch: test: leal 1(%rdi), %eax testb %al, %al setnp %al movzbl %al, %eax ret GCC currently hides the shift and xor reduction inside a backend specific UNSPEC PARITY, making it invisible to the RTL optimizers until very late during compilation. It is normally reasonable for the middle-end to maintain wider mode representations for as long as possible and split them later, but this only helps if the semantics are visible at the RTL-level (to combine and other passes), but UNSPECs are black boxes, so in this case splitting early (during RTL expansion) is a better strategy. I'd originally written this patch after investigating whether GCC could generate the x86's "jp" and "jnp" conditional branch on parity flag, and noticed the inefficient reduction. It was only when I was preparing this patch for submission that I discovered that this is actually a regression fix, and that Uros had already implemented/thought about the above optimization back in 2007. Indeed the example above is taken from https://gcc.gnu.org/legacy-ml/gcc-patches/2007-02/msg00972.html Alas there is a mismatch between RTL's definition of PARITY operation which has an integer mode result, and the x86's parity flag. So when Uros addressed PR target/44481 in 2010, by introducing UNSPEC PARITY, we lost some of the optimization opportunities of his original patch. https://gcc.gnu.org/legacy-ml/gcc-patches/2010-06/msg01259.html The early splitting approach in this patch safely restores the 2007-2010 optimization opportunities. It turns out that that popcount instruction on modern x86_64 processors has (almost) made the integer parity flag in the x86 ALU completely obsolete, especially as POPCOUNT's integer semantics are a much better fit to RTL. The one remaining case where these transistors are useful is where __builtin_parity is immediately tested by a conditional branch, and therefore the result is wanted in a flags register rather than as an integer. This case is captured by two peephole2 optimizations in the attached patch. Hence the code: void foo(unsigned char x) { if (__builtin_parity(x)) bar(); } which on modern hardware is currently generated as: foo: movzbl %dil, %edi popcntl %edi, %edi andl $1, %edi jne .L4 ret .L4: jmp bar with this patch will be generated as: foo: testb %dil, %dil jnp .L4 ret .L4: jmp bar [I believe that popcount has a 3-cycle latency, but a test followed by a conditional branch can dispatch in the same cycle, but I'm not an expert]. The new parityhi2 and parityqi2 expanders are infrastructure to support (possible) future middle-end changes. This patch has been tested with a "make bootstrap" and "make -k check" on x86_64-pc-linux-gnu with no regressions. I've also added 7 new test cases; 4 tests gcc.target/i386/parity-[3-6].c check the existing behaviour, and 3 tests gcc.target/i386/parity-[7-9].c check the changes from this patch. Alas I'm very out of practice contributing patches (but my paperwork should still be valid), so if a friendly i386/x86_64 backend maintainer could take care of things from here, that would be very much appreciated. Many thanks in advance, Roger -- 2020-06-05 Roger Sayle <ro...@nextmovesoftware.com> * config/i386/i386.md (paritydi2, paritysi2): Expand reduction via shift and xor to an USPEC PARITY matching a parityhi2_cmp. (paritydi2_cmp, paritysi2_cmp): Delete these define_insn_and_split. (parityhi2, parityqi2): New expanders. (parityhi2_cmp): Implement set parity flag with xorb insn. (parityqi2_cmp): Implement set parity flag with testb insn. New peephole2s to use these insns (UNSPEC PARITY) when appropriate. 2020-06-05 Roger Sayle <ro...@nextmovesoftware.com> * gcc.target/i386/parity-[3-9].c: New tests.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 459cf62..6659657 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -14780,9 +14780,32 @@ "! TARGET_POPCNT" { rtx scratch = gen_reg_rtx (QImode); + rtx hipart1 = gen_reg_rtx (SImode); + rtx lopart1 = gen_reg_rtx (SImode); + rtx xor1 = gen_reg_rtx (SImode); + rtx shift2 = gen_reg_rtx (SImode); + rtx hipart2 = gen_reg_rtx (HImode); + rtx lopart2 = gen_reg_rtx (HImode); + rtx xor2 = gen_reg_rtx (HImode); - emit_insn (gen_paritydi2_cmp (NULL_RTX, NULL_RTX, - NULL_RTX, operands[1])); + if (TARGET_64BIT) + { + rtx shift1 = gen_reg_rtx (DImode); + emit_insn (gen_lshrdi3 (shift1, operands[1], GEN_INT (32))); + emit_move_insn (hipart1, gen_lowpart (SImode, shift1)); + } + else + emit_move_insn (hipart1, gen_highpart (SImode, operands[1])); + + emit_move_insn (lopart1, gen_lowpart (SImode, operands[1])); + emit_insn (gen_xorsi3 (xor1, hipart1, lopart1)); + + emit_insn (gen_lshrsi3 (shift2, xor1, GEN_INT (16))); + emit_move_insn (hipart2, gen_lowpart (HImode, shift2)); + emit_move_insn (lopart2, gen_lowpart (HImode, xor1)); + emit_insn (gen_xorhi3 (xor2, hipart2, lopart2)); + + emit_insn (gen_parityhi2_cmp (xor2)); ix86_expand_setcc (scratch, ORDERED, gen_rtx_REG (CCmode, FLAGS_REG), const0_rtx); @@ -14805,8 +14828,17 @@ "! TARGET_POPCNT" { rtx scratch = gen_reg_rtx (QImode); + rtx shift = gen_reg_rtx (SImode); + rtx hipart = gen_reg_rtx (HImode); + rtx lopart = gen_reg_rtx (HImode); + rtx tmp = gen_reg_rtx (HImode); + + emit_insn (gen_lshrsi3 (shift, operands[1], GEN_INT (16))); + emit_move_insn (hipart, gen_lowpart (HImode, shift)); + emit_move_insn (lopart, gen_lowpart (HImode, operands[1])); + emit_insn (gen_xorhi3 (tmp, hipart, lopart)); - emit_insn (gen_paritysi2_cmp (NULL_RTX, NULL_RTX, operands[1])); + emit_insn (gen_parityhi2_cmp (tmp)); ix86_expand_setcc (scratch, ORDERED, gen_rtx_REG (CCmode, FLAGS_REG), const0_rtx); @@ -14815,70 +14847,128 @@ DONE; }) -(define_insn_and_split "paritydi2_cmp" - [(set (reg:CC FLAGS_REG) - (unspec:CC [(match_operand:DI 3 "register_operand" "0")] - UNSPEC_PARITY)) - (clobber (match_scratch:DI 0 "=r")) - (clobber (match_scratch:SI 1 "=&r")) - (clobber (match_scratch:HI 2 "=Q"))] +(define_expand "parityhi2" + [(set (match_operand:HI 0 "register_operand") + (parity:HI (match_operand:HI 1 "register_operand")))] "! TARGET_POPCNT" - "#" - "&& reload_completed" - [(parallel - [(set (match_dup 1) - (xor:SI (match_dup 1) (match_dup 4))) - (clobber (reg:CC FLAGS_REG))]) - (parallel - [(set (reg:CC FLAGS_REG) - (unspec:CC [(match_dup 1)] UNSPEC_PARITY)) - (clobber (match_dup 1)) - (clobber (match_dup 2))])] { - operands[4] = gen_lowpart (SImode, operands[3]); + rtx scratch = gen_reg_rtx (QImode); - if (TARGET_64BIT) - { - emit_move_insn (operands[1], gen_lowpart (SImode, operands[3])); - emit_insn (gen_lshrdi3 (operands[3], operands[3], GEN_INT (32))); - } - else - operands[1] = gen_highpart (SImode, operands[3]); + emit_insn (gen_parityhi2_cmp (operands[1])); + + ix86_expand_setcc (scratch, ORDERED, + gen_rtx_REG (CCmode, FLAGS_REG), const0_rtx); + + emit_insn (gen_zero_extendqihi2 (operands[0], scratch)); + DONE; }) -(define_insn_and_split "paritysi2_cmp" - [(set (reg:CC FLAGS_REG) - (unspec:CC [(match_operand:SI 2 "register_operand" "0")] - UNSPEC_PARITY)) - (clobber (match_scratch:SI 0 "=r")) - (clobber (match_scratch:HI 1 "=&Q"))] +(define_expand "parityqi2" + [(set (match_operand:QI 0 "register_operand") + (parity:QI (match_operand:QI 1 "register_operand")))] "! TARGET_POPCNT" - "#" - "&& reload_completed" - [(parallel - [(set (match_dup 1) - (xor:HI (match_dup 1) (match_dup 3))) - (clobber (reg:CC FLAGS_REG))]) - (parallel - [(set (reg:CC FLAGS_REG) - (unspec:CC [(match_dup 1)] UNSPEC_PARITY)) - (clobber (match_dup 1))])] { - operands[3] = gen_lowpart (HImode, operands[2]); + emit_insn (gen_parityqi2_cmp (operands[1])); - emit_move_insn (operands[1], gen_lowpart (HImode, operands[2])); - emit_insn (gen_lshrsi3 (operands[2], operands[2], GEN_INT (16))); + ix86_expand_setcc (operands[0], ORDERED, + gen_rtx_REG (CCmode, FLAGS_REG), const0_rtx); + DONE; }) -(define_insn "*parityhi2_cmp" +(define_insn "parityhi2_cmp" [(set (reg:CC FLAGS_REG) - (unspec:CC [(match_operand:HI 1 "register_operand" "0")] + (unspec:CC [(match_operand:HI 0 "register_operand" "+Q")] UNSPEC_PARITY)) - (clobber (match_scratch:HI 0 "=Q"))] - "! TARGET_POPCNT" + (clobber (match_dup 0))] + "" "xor{b}\t{%h0, %b0|%b0, %h0}" [(set_attr "length" "2") - (set_attr "mode" "HI")]) + (set_attr "mode" "QI")]) + +(define_insn "parityqi2_cmp" + [(set (reg:CC FLAGS_REG) + (unspec:CC [(match_operand:QI 0 "register_operand")] + UNSPEC_PARITY))] + "" + "test{b}\t%b0, %b0" + [(set_attr "mode" "QI")]) + +;; Replace zero_extend:HI followed by parityhi2_cmp with parityqi2_cmp +(define_peephole2 + [(set (match_operand:HI 0 "register_operand") + (zero_extend:HI (match_operand:QI 1 "register_operand"))) + (parallel [(set (reg:CC FLAGS_REG) + (unspec:CC [(match_dup 0)] UNSPEC_PARITY)) + (clobber (match_dup 0))])] + "" + [(set (reg:CC FLAGS_REG) + (unspec:CC [(match_dup 1)] UNSPEC_PARITY))]) + +;; Eliminate QImode popcount&1 using parity flag +(define_peephole2 + [(set (match_operand:SI 0 "register_operand") + (zero_extend:SI (match_operand:QI 1 "register_operand"))) + (parallel [(set (match_operand:SI 2 "register_operand") + (popcount:SI (match_dup 0))) + (clobber (reg:CC FLAGS_REG))]) + (set (reg:CCZ FLAGS_REG) + (compare:CCZ (and:QI (match_operand:QI 3 "register_operand") + (const_int 1)) + (const_int 0))) + (set (pc) (if_then_else (match_operator 4 "bt_comparison_operator" + [(reg:CCZ FLAGS_REG) + (const_int 0)]) + (label_ref (match_operand 5)) + (pc)))] + "REGNO (operands[2]) == REGNO (operands[3]) + && peep2_reg_dead_p (3, operands[0]) + && peep2_reg_dead_p (3, operands[2]) + && peep2_regno_dead_p (4, FLAGS_REG)" + [(set (reg:CC FLAGS_REG) + (unspec:CC [(match_dup 1)] UNSPEC_PARITY)) + (set (pc) (if_then_else (match_op_dup 4 [(reg:CC FLAGS_REG) + (const_int 0)]) + (label_ref (match_dup 5)) + (pc)))] +{ + operands[4] = shallow_copy_rtx (operands[4]); + PUT_CODE (operands[4], GET_CODE (operands[4]) == EQ ? UNORDERED : ORDERED); +}) + +;; Eliminate HImode popcount&1 using parity flag +(define_peephole2 + [(match_scratch:HI 0 "Q") + (parallel [(set (match_operand:HI 1 "register_operand") + (popcount:HI + (match_operand:HI 2 "nonimmediate_operand"))) + (clobber (reg:CC FLAGS_REG))]) + (set (match_operand 3 "register_operand") + (zero_extend (match_dup 1))) + (set (reg:CCZ FLAGS_REG) + (compare:CCZ (and:QI (match_operand:QI 4 "register_operand") + (const_int 1)) + (const_int 0))) + (set (pc) (if_then_else (match_operator 5 "bt_comparison_operator" + [(reg:CCZ FLAGS_REG) + (const_int 0)]) + (label_ref (match_operand 6)) + (pc)))] + "REGNO (operands[3]) == REGNO (operands[4]) + && peep2_reg_dead_p (3, operands[1]) + && peep2_reg_dead_p (3, operands[3]) + && peep2_regno_dead_p (4, FLAGS_REG)" + [(set (match_dup 0) (match_dup 2)) + (parallel [(set (reg:CC FLAGS_REG) + (unspec:CC [(match_dup 0)] UNSPEC_PARITY)) + (clobber (match_dup 0))]) + (set (pc) (if_then_else (match_op_dup 5 [(reg:CC FLAGS_REG) + (const_int 0)]) + (label_ref (match_dup 6)) + (pc)))] +{ + operands[5] = shallow_copy_rtx (operands[5]); + PUT_CODE (operands[5], GET_CODE (operands[5]) == EQ ? UNORDERED : ORDERED); +}) ;; Thread-local storage patterns for ELF.
/* { dg-do compile } */ /* { dg-options "-O2 -march=core-avx2 -mno-popcnt" } */ /* { dg-final { scan-assembler "setp" } } */ /* { dg-final { scan-assembler "jnp" } } */ /* { dg-final { scan-assembler "jp" } } */ void dummy(void); int foo(unsigned int x) { return !__builtin_parity(x); } void bar(unsigned int x) { if (__builtin_parity(x)) dummy(); } void baz(unsigned int x) { if (!__builtin_parity(x)) dummy(); }
/* { dg-do compile } */ /* { dg-options "-O2 -march=core-avx2 -mno-popcnt" } */ /* { dg-final { scan-assembler "setp" } } */ /* { dg-final { scan-assembler "jnp" } } */ /* { dg-final { scan-assembler "jp" } } */ void dummy(void); int foo(unsigned long long x) { return !__builtin_parityll(x); } void bar(unsigned long long x) { if (__builtin_parityll(x)) dummy(); } void baz(unsigned long long x) { if (!__builtin_parityll(x)) dummy(); }
/* { dg-do compile } */ /* { dg-options "-O2 -march=core-avx2" } */ /* { dg-final { scan-assembler "popcnt" } } */ /* { dg-final { scan-assembler "and" } } */ int foo(unsigned int x) { return __builtin_parity(x); }
/* { dg-do compile } */ /* { dg-options "-O2 -march=core-avx2" } */ /* { dg-final { scan-assembler "popcnt" } } */ /* { dg-final { scan-assembler "and" } } */ int foo(unsigned long long x) { return __builtin_parityll(x); }
/* { dg-do compile } */ /* { dg-options "-O2 -march=core-avx2 -mno-popcnt" } */ /* { dg-final { scan-assembler-times "test" 2 } } */ /* { dg-final { scan-assembler-not "shr" } } */ int foo(unsigned char x) { return __builtin_parity(x); } int bar(unsigned char x) { return __builtin_parityll(x); }
/* { dg-do compile } */ /* { dg-options "-O2 -march=core-avx2 -mno-popcnt" } */ /* { dg-final { scan-assembler-not "shr" } } */ int foo(unsigned short x) { return __builtin_parity(x); } int bar(unsigned short x) { return __builtin_parityll(x); }
/* { dg-do compile } */ /* { dg-options "-O2 -march=core-avx2" } */ /* { dg-final { scan-assembler-not "popcnt" } } */ /* { dg-final { scan-assembler-not "shr" } } */ /* { dg-final { scan-assembler-times "jp" 2 } } */ /* { dg-final { scan-assembler-times "jnp" 2 } } */ void dummy(void); void pos8(unsigned char x) { if (__builtin_parity(x)) dummy(); } void neg8(unsigned char x) { if (!__builtin_parity(x)) dummy(); } void pos16(unsigned short x) { if (__builtin_parity(x)) dummy(); } void neg16(unsigned short x) { if (!__builtin_parity(x)) dummy(); }