RE: [RFC] expr: don't clear SUBREG_PROMOTED_VAR_P flag for a promoted subreg [target/111466]
I agree that this looks dubious. Normally, if the middle-end/optimizers wish to reuse a SUBREG in a context where the flags are not valid, it should create a new one with the desired flags, rather than "mutate" an existing (and possibly shared) RTX. I wonder if creating a new SUBREG here also fixes your problem? I'm not sure that clearing SUBREG_PROMOTED_VAR_P is needed at all, but given its motivation has been lost to history, it would good to have a plan B, if Jeff's alpha testing uncovers a subtle issue. Roger -- > -Original Message- > From: Vineet Gupta > Sent: 28 September 2023 22:44 > To: gcc-patches@gcc.gnu.org; Robin Dapp > Cc: kito.ch...@gmail.com; Jeff Law ; Palmer Dabbelt > ; gnu-toolch...@rivosinc.com; Roger Sayle > ; Jakub Jelinek ; Jivan > Hakobyan ; Vineet Gupta > Subject: [RFC] expr: don't clear SUBREG_PROMOTED_VAR_P flag for a promoted > subreg [target/111466] > > RISC-V suffers from extraneous sign extensions, despite/given the ABI guarantee > that 32-bit quantities are sign-extended into 64-bit registers, meaning incoming SI > function args need not be explicitly sign extended (so do SI return values as most > ALU insns implicitly sign-extend too.) > > Existing REE doesn't seem to handle this well and there are various ideas floating > around to smarten REE about it. > > RISC-V also seems to correctly implement middle-end hook PROMOTE_MODE > etc. > > Another approach would be to prevent EXPAND from generating the sign_extend > in the first place which this patch tries to do. > > The hunk being removed was introduced way back in 1994 as >5069803972 ("expand_expr, case CONVERT_EXPR .. clear the promotion flag") > > This survived full testsuite run for RISC-V rv64gc with surprisingly no > fallouts: test results before/after are exactly same. > > | | # of unexpected case / # of unique unexpected case > | | gcc | g++ | gfortran | > | rv64imafdc_zba_zbb_zbs_zicond/| 264 /87 |5 / 2 | 72 / 12 | > |lp64d/medlow > > Granted for something so old to have survived, there must be a valid reason. > Unfortunately the original change didn't have additional commentary or a test > case. That is not to say it can't/won't possibly break things on other arches/ABIs, > hence the RFC for someone to scream that this is just bonkers, don't do this :-) > > I've explicitly CC'ed Jakub and Roger who have last touched subreg promoted > notes in expr.cc for insight and/or screaming ;-) > > Thanks to Robin for narrowing this down in an amazing debugging session @ GNU > Cauldron. > > ``` > foo2: > sext.w a6,a1 <-- this goes away > beq a1,zero,.L4 > li a5,0 > li a0,0 > .L3: > addwa4,a2,a5 > addwa5,a3,a5 > addwa0,a4,a0 > bltua5,a6,.L3 > ret > .L4: > li a0,0 > ret > ``` > > Signed-off-by: Vineet Gupta > Co-developed-by: Robin Dapp > --- > gcc/expr.cc | 7 --- > gcc/testsuite/gcc.target/riscv/pr111466.c | 15 +++ > 2 files changed, 15 insertions(+), 7 deletions(-) create mode 100644 > gcc/testsuite/gcc.target/riscv/pr111466.c > > diff --git a/gcc/expr.cc b/gcc/expr.cc > index 308ddc09e631..d259c6e53385 100644 > --- a/gcc/expr.cc > +++ b/gcc/expr.cc > @@ -9332,13 +9332,6 @@ expand_expr_real_2 (sepops ops, rtx target, > machine_mode tmode, > op0 = expand_expr (treeop0, target, VOIDmode, >modifier); > > - /* If the signedness of the conversion differs and OP0 is > - a promoted SUBREG, clear that indication since we now > - have to do the proper extension. */ > - if (TYPE_UNSIGNED (TREE_TYPE (treeop0)) != unsignedp > - && GET_CODE (op0) == SUBREG) > - SUBREG_PROMOTED_VAR_P (op0) = 0; > - > return REDUCE_BIT_FIELD (op0); > } > > diff --git a/gcc/testsuite/gcc.target/riscv/pr111466.c > b/gcc/testsuite/gcc.target/riscv/pr111466.c > new file mode 100644 > index ..007792466a51 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/riscv/pr111466.c > @@ -0,0 +1,15 @@ > +/* Simplified varaint of gcc.target/riscv/zba-adduw.c. */ > + > +/* { dg-do compile } */ > +/* { dg-options "-march=rv64gc_zba_zbs -mabi=lp64" } */ > +/* { dg-skip-if "" { *-*-* } { "-O0" } } */ > + > +int foo2(int unused, int n, unsigned y, unsigned delta){ > + int s = 0; > + unsigned int x = 0; > + for (;x +s += x+y; > + return s; > +} > + > +/* { dg-final { scan-assembler "\msext\M" } } */ > -- > 2.34.1
[ARC PATCH] Use rlc r0, 0 to implement scc_ltu (i.e. carry_flag ? 1 : 0)
This patch teaches the ARC backend that the contents of the carry flag can be placed in an integer register conveniently using the "rlc rX,0" instruction, which is a rotate-left-through-carry using zero as a source. This is a convenient special case for the LTU form of the scc pattern. unsigned int foo(unsigned int x, unsigned int y) { return (x+y) < x; } With -O2 -mcpu=em this is currently compiled to: foo:add.f 0,r0,r1 mov_s r0,1;3 j_s.d [blink] mov.hs r0,0 [which after an addition to set the carry flag, sets r0 to 1, followed by a conditional assignment of r0 to zero if the carry flag is clear]. With the new define_insn/optimization in this patch, this becomes: foo:add.f 0,r0,r1 j_s.d [blink] rlc r0,0 This define_insn is also a useful building block for implementing shifts and rotates. Tested on a cross-compiler to arc-linux (hosted on x86_64-pc-linux-gnu), and a partial tool chain, where the new case passes and there are no new regressions. Ok for mainline? 2023-09-29 Roger Sayle gcc/ChangeLog * config/arc/arc.md (CC_ltu): New mode iterator for CC and CC_C. (scc_ltu_): New define_insn to handle LTU form of scc_insn. (*scc_insn): Don't split to a conditional move sequence for LTU. gcc/testsuite/ChangeLog * gcc.target/arc/scc-ltu.c: New test case. Thanks in advance, Roger -- diff --git a/gcc/config/arc/arc.md b/gcc/config/arc/arc.md index d37ecbf..fe2e7fb 100644 --- a/gcc/config/arc/arc.md +++ b/gcc/config/arc/arc.md @@ -3658,12 +3658,24 @@ archs4x, archs4xd" (define_expand "scc_insn" [(set (match_operand:SI 0 "dest_reg_operand" "=w") (match_operand:SI 1 ""))]) +(define_mode_iterator CC_ltu [CC_C CC]) + +(define_insn "scc_ltu_" + [(set (match_operand:SI 0 "dest_reg_operand" "=w") +(ltu:SI (reg:CC_ltu CC_REG) (const_int 0)))] + "" + "rlc\\t%0,0" + [(set_attr "type" "shift") + (set_attr "predicable" "no") + (set_attr "length" "4")]) + (define_insn_and_split "*scc_insn" [(set (match_operand:SI 0 "dest_reg_operand" "=w") (match_operator:SI 1 "proper_comparison_operator" [(reg CC_REG) (const_int 0)]))] "" "#" - "reload_completed" + "reload_completed + && GET_CODE (operands[1]) != LTU" [(set (match_dup 0) (const_int 1)) (cond_exec (match_dup 1) diff --git a/gcc/testsuite/gcc.target/arc/scc-ltu.c b/gcc/testsuite/gcc.target/arc/scc-ltu.c new file mode 100644 index 000..653c55d --- /dev/null +++ b/gcc/testsuite/gcc.target/arc/scc-ltu.c @@ -0,0 +1,12 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mcpu=em" } */ + +unsigned int foo(unsigned int x, unsigned int y) +{ + return (x+y) < x; +} + +/* { dg-final { scan-assembler "rlc\\s+r0,0" } } */ +/* { dg-final { scan-assembler "add.f\\s+0,r0,r1" } } */ +/* { dg-final { scan-assembler-not "mov_s\\s+r0,1" } } */ +/* { dg-final { scan-assembler-not "mov\.hs\\s+r0,0" } } */
RE: [ARC PATCH] Use rlc r0, 0 to implement scc_ltu (i.e. carry_flag ? 1 : 0)
Hi Claudiu, > The patch looks sane. Have you run dejagnu test suite? I've not yet managed to set up an emulator or compile the entire toolchain, so my dejagnu results are only useful for catching (serious) problems in the compile only tests: === gcc Summary === # of expected passes91875 # of unexpected failures23768 # of unexpected successes 23 # of expected failures 1038 # of unresolved testcases 19490 # of unsupported tests 3819 /home/roger/GCC/arc-linux/gcc/xgcc version 14.0.0 20230828 (experimental) (GCC) If someone could double check there are no issues on real hardware that would be great. I'm not sure if ARC is one of the targets covered by Jeff Law's compile farm? > -Original Message- > From: Roger Sayle > Sent: Friday, September 29, 2023 6:54 PM > To: gcc-patches@gcc.gnu.org > Cc: Claudiu Zissulescu > Subject: [ARC PATCH] Use rlc r0,0 to implement scc_ltu (i.e. carry_flag ? 1 : 0) > > > This patch teaches the ARC backend that the contents of the carry flag can be > placed in an integer register conveniently using the "rlc rX,0" > instruction, which is a rotate-left-through-carry using zero as a source. > This is a convenient special case for the LTU form of the scc pattern. > > unsigned int foo(unsigned int x, unsigned int y) { > return (x+y) < x; > } > > With -O2 -mcpu=em this is currently compiled to: > > foo:add.f 0,r0,r1 > mov_s r0,1;3 > j_s.d [blink] > mov.hs r0,0 > > [which after an addition to set the carry flag, sets r0 to 1, followed by a > conditional assignment of r0 to zero if the carry flag is clear]. With the new > define_insn/optimization in this patch, this becomes: > > foo:add.f 0,r0,r1 > j_s.d [blink] > rlc r0,0 > > This define_insn is also a useful building block for implementing shifts and rotates. > > Tested on a cross-compiler to arc-linux (hosted on x86_64-pc-linux-gnu), and a > partial tool chain, where the new case passes and there are no new regressions. > Ok for mainline? > > > 2023-09-29 Roger Sayle > > gcc/ChangeLog > * config/arc/arc.md (CC_ltu): New mode iterator for CC and CC_C. > (scc_ltu_): New define_insn to handle LTU form of scc_insn. > (*scc_insn): Don't split to a conditional move sequence for LTU. > > gcc/testsuite/ChangeLog > * gcc.target/arc/scc-ltu.c: New test case. > > > Thanks in advance, > Roger > --
RE: [ARC PATCH] Split SImode shifts pre-reload on !TARGET_BARREL_SHIFTER.
Hi Claudiu, Thanks for the answers to my technical questions. If you'd prefer to update arc.md's add3 pattern first, I'm happy to update/revise my patch based on this and your feedback, for example preferring add over asl_s (or controlling this choice with -Os). Thanks again. Roger -- > -Original Message- > From: Claudiu Zissulescu > Sent: 03 October 2023 15:26 > To: Roger Sayle ; gcc-patches@gcc.gnu.org > Subject: RE: [ARC PATCH] Split SImode shifts pre-reload on > !TARGET_BARREL_SHIFTER. > > Hi Roger, > > It was nice to meet you too. > > Thank you in looking into the ARC's non-Barrel Shifter configurations. I will dive > into your patch asap, but before starting here are a few of my comments: > > -Original Message- > From: Roger Sayle > Sent: Thursday, September 28, 2023 2:27 PM > To: gcc-patches@gcc.gnu.org > Cc: Claudiu Zissulescu > Subject: [ARC PATCH] Split SImode shifts pre-reload on > !TARGET_BARREL_SHIFTER. > > > Hi Claudiu, > It was great meeting up with you and the Synopsys ARC team at the GNU tools > Cauldron in Cambridge. > > This patch is the first in a series to improve SImode and DImode shifts and rotates > in the ARC backend. This first piece splits SImode shifts, for > !TARGET_BARREL_SHIFTER targets, after combine and before reload, in the split1 > pass, as suggested by the FIXME comment above output_shift in arc.cc. To do > this I've copied the implementation of the x86_pre_reload_split function from > i386 backend, and renamed it arc_pre_reload_split. > > Although the actual implementations of shifts remain the same (as in > output_shift), having them as explicit instructions in the RTL stream allows better > scheduling and use of compact forms when available. The benefits can be seen in > two short examples below. > > For the function: > unsigned int foo(unsigned int x, unsigned int y) { > return y << 2; > } > > GCC with -O2 -mcpu=em would previously generate: > foo:add r1,r1,r1 > add r1,r1,r1 > j_s.d [blink] > mov_s r0,r1 ;4 > > [CZI] The move shouldn't be generated indeed. The use of ADDs are slightly > beneficial for older ARCv1 arches. > > and with this patch now generates: > foo:asl_s r0,r1 > j_s.d [blink] > asl_s r0,r0 > > [CZI] Nice. This new sequence is as fast as we can get for our ARCv2 cpus. > > Notice the original (from shift_si3's output_shift) requires the shift sequence to be > monolithic with the same destination register as the source (requiring an extra > mov_s). The new version can eliminate this move, and schedule the second asl in > the branch delay slot of the return. > > For the function: > int x,y,z; > > void bar() > { > x <<= 3; > y <<= 3; > z <<= 3; > } > > GCC -O2 -mcpu=em currently generates: > bar:push_s r13 > ld.as r12,[gp,@x@sda] ;23 > ld.as r3,[gp,@y@sda] ;23 > mov r2,0 > add3 r12,r2,r12 > mov r2,0 > add3 r3,r2,r3 > ld.as r2,[gp,@z@sda] ;23 > st.as r12,[gp,@x@sda] ;26 > mov r13,0 > add3 r2,r13,r2 > st.as r3,[gp,@y@sda] ;26 > st.as r2,[gp,@z@sda] ;26 > j_s.d [blink] > pop_s r13 > > where each shift by 3, uses ARC's add3 instruction, which is similar to x86's lea > implementing x = (y<<3) + z, but requires the value zero to be placed in a > temporary register "z". Splitting this before reload allows these pseudos to be > shared/reused. With this patch, we get > > bar:ld.as r2,[gp,@x@sda] ;23 > mov_s r3,0;3 > add3r2,r3,r2 > ld.as r3,[gp,@y@sda] ;23 > st.as r2,[gp,@x@sda] ;26 > ld.as r2,[gp,@z@sda] ;23 > mov_s r12,0 ;3 > add3r3,r12,r3 > add3r2,r12,r2 > st.as r3,[gp,@y@sda] ;26 > st.as r2,[gp,@z@sda] ;26 > j_s [blink] > > [CZI] Looks great, but it also shows that I've forgot to add to ADD3 instruction the > Ra,LIMM,RC variant, which will lead to have instead of > mov_s r3,0;3 > add3r2,r3,r2 > Only this add3,0,r2, Indeed it is longer instruction but faster. > > Unfortunately, register allocation means that we only share two of the three > "mov_s z,0", but this is sufficient to reduce register pressure enough to avoid > spilling r13 in the prologue/epilogue. > > This patch also contains a (latent?) bug fix. The implementation of the default > insn "length" attribute, assumes instruc
PING: PR rtl-optimization/110701
There are a small handful of middle-end maintainers/reviewers that understand and appreciate the difference between the RTL statements: (set (subreg:HI (reg:SI x)) (reg:HI y)) and (set (strict_lowpart:HI (reg:SI x)) (reg:HI y)) If one (or more) of them could please take a look at https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625532.html I'd very much appreciate it (one less wrong-code regression). Many thanks in advance, Roger --
[PATCH] Support g++ 4.8 as a host compiler.
The recent patch to remove poly_int_pod triggers a bug in g++ 4.8.5's C++ 11 support which mistakenly believes poly_uint16 has a non-trivial constructor. This in turn prohibits it from being used as a member in a union (rtxunion) that constructed statically, resulting in a (fatal) error during stage 1. A workaround is to add an explicit constructor to the problematic union, which allows mainline to be bootstrapped with the system compiler on older RedHat 7 systems. This patch has been tested on x86_64-pc-linux-gnu where it allows a bootstrap to complete when using g++ 4.8.5 as the host compiler. Ok for mainline? 2023-10-04 Roger Sayle gcc/ChangeLog * rtl.h (rtx_def::u): Add explicit constructor to workaround issue using g++ 4.8 as a host compiler. diff --git a/gcc/rtl.h b/gcc/rtl.h index 6850281..a7667f5 100644 --- a/gcc/rtl.h +++ b/gcc/rtl.h @@ -451,6 +451,9 @@ struct GTY((desc("0"), tag("0"), struct fixed_value fv; struct hwivec_def hwiv; struct const_poly_int_def cpi; +#if defined(__GNUC__) && GCC_VERSION < 5000 +u () {} +#endif } GTY ((special ("rtx_def"), desc ("GET_CODE (&%0)"))) u; };
[X86 PATCH] Split lea into shorter left shift by 2 or 3 bits with -Oz.
This patch avoids long lea instructions for performing x<<2 and x<<3 by splitting them into shorter sal and move (or xchg instructions). Because this increases the number of instructions, but reduces the total size, its suitable for -Oz (but not -Os). The impact can be seen in the new test case: int foo(int x) { return x<<2; } int bar(int x) { return x<<3; } long long fool(long long x) { return x<<2; } long long barl(long long x) { return x<<3; } where with -O2 we generate: foo:lea0x0(,%rdi,4),%eax// 7 bytes retq bar:lea0x0(,%rdi,8),%eax// 7 bytes retq fool: lea0x0(,%rdi,4),%rax// 8 bytes retq barl: lea0x0(,%rdi,8),%rax// 8 bytes retq and with -Oz we now generate: foo:xchg %eax,%edi// 1 byte shl$0x2,%eax// 3 bytes retq bar:xchg %eax,%edi// 1 byte shl$0x3,%eax// 3 bytes retq fool: xchg %rax,%rdi// 2 bytes shl$0x2,%rax// 4 bytes retq barl: xchg %rax,%rdi// 2 bytes shl$0x3,%rax// 4 bytes retq Over the entirety of the CSiBE code size benchmark this saves 1347 bytes (0.037%) for x86_64, and 1312 bytes (0.036%) with -m32. Conveniently, there's already a backend function in i386.cc for deciding whether to split an lea into its component instructions, ix86_avoid_lea_for_addr, all that's required is an additional clause checking for -Oz (i.e. optimize_size > 1). This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board='unix{-m32}' with no new failures. Additional testing was performed by repeating these steps after removing the "optimize_size > 1" condition, so that suitable lea instructions were always split [-Oz is not heavily tested, so this invoked the new code during the bootstrap and regression testing], again with no regressions. Ok for mainline? 2023-10-05 Roger Sayle gcc/ChangeLog * config/i386/i386.cc (ix86_avoid_lea_for_addr): Split LEAs used to perform left shifts into shorter instructions with -Oz. gcc/testsuite/ChangeLog * gcc.target/i386/lea-2.c: New test case. diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index 477e6ce..9557bff 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -15543,6 +15543,13 @@ ix86_avoid_lea_for_addr (rtx_insn *insn, rtx operands[]) && (regno0 == regno1 || regno0 == regno2)) return true; + /* Split with -Oz if the encoding requires fewer bytes. */ + if (optimize_size > 1 + && parts.scale > 1 + && !parts.base + && (!parts.disp || parts.disp == const0_rtx)) +return true; + /* Check we need to optimize. */ if (!TARGET_AVOID_LEA_FOR_ADDR || optimize_function_for_size_p (cfun)) return false; diff --git a/gcc/testsuite/gcc.target/i386/lea-2.c b/gcc/testsuite/gcc.target/i386/lea-2.c new file mode 100644 index 000..20aded8 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/lea-2.c @@ -0,0 +1,7 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-Oz" } */ +int foo(int x) { return x<<2; } +int bar(int x) { return x<<3; } +long long fool(long long x) { return x<<2; } +long long barl(long long x) { return x<<3; } +/* { dg-final { scan-assembler-not "lea\[lq\]" } } */
[X86 PATCH] Implement doubleword shift left by 1 bit using add+adc.
This patch tweaks the i386 back-end's ix86_split_ashl to implement doubleword left shifts by 1 bit, using an add followed by an add-with-carry (i.e. a doubleword x+x) instead of using the x86's shld instruction. The replacement sequence both requires fewer bytes and is faster on both Intel and AMD architectures (from Agner Fog's latency tables and confirmed by my own microbenchmarking). For the test case: __int128 foo(__int128 x) { return x << 1; } with -O2 we previously generated: foo:movq%rdi, %rax movq%rsi, %rdx shldq $1, %rdi, %rdx addq%rdi, %rax ret with this patch we now generate: foo:movq%rdi, %rax movq%rsi, %rdx addq%rdi, %rax adcq%rsi, %rdx ret This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-10-05 Roger Sayle gcc/ChangeLog * config/i386/i386-expand.cc (ix86_split_ashl): Split shifts by one into add3_cc_overflow_1 followed by add3_carry. * config/i386/i386.md (@add3_cc_overflow_1): Renamed from "*add3_cc_overflow_1" to provide generator function. gcc/testsuite/ChangeLog * gcc.target/i386/ashldi3-2.c: New 32-bit test case. * gcc.target/i386/ashlti3-3.c: New 64-bit test case. Thanks in advance, Roger --
RE: [X86 PATCH] Implement doubleword shift left by 1 bit using add+adc.
Doh! ENOPATCH. > -Original Message- > From: Roger Sayle > Sent: 05 October 2023 12:44 > To: 'gcc-patches@gcc.gnu.org' > Cc: 'Uros Bizjak' > Subject: [X86 PATCH] Implement doubleword shift left by 1 bit using add+adc. > > > This patch tweaks the i386 back-end's ix86_split_ashl to implement doubleword > left shifts by 1 bit, using an add followed by an add-with-carry (i.e. a doubleword > x+x) instead of using the x86's shld instruction. > The replacement sequence both requires fewer bytes and is faster on both Intel > and AMD architectures (from Agner Fog's latency tables and confirmed by my > own microbenchmarking). > > For the test case: > __int128 foo(__int128 x) { return x << 1; } > > with -O2 we previously generated: > > foo:movq%rdi, %rax > movq%rsi, %rdx > shldq $1, %rdi, %rdx > addq%rdi, %rax > ret > > with this patch we now generate: > > foo:movq%rdi, %rax > movq%rsi, %rdx > addq%rdi, %rax > adcq%rsi, %rdx > ret > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and > make -k check, both with and without --target_board=unix{-m32} with no new > failures. Ok for mainline? > > > 2023-10-05 Roger Sayle > > gcc/ChangeLog > * config/i386/i386-expand.cc (ix86_split_ashl): Split shifts by > one into add3_cc_overflow_1 followed by add3_carry. > * config/i386/i386.md (@add3_cc_overflow_1): Renamed from > "*add3_cc_overflow_1" to provide generator function. > > gcc/testsuite/ChangeLog > * gcc.target/i386/ashldi3-2.c: New 32-bit test case. > * gcc.target/i386/ashlti3-3.c: New 64-bit test case. > > > Thanks in advance, > Roger > -- diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc index e42ff27..09e41c8 100644 --- a/gcc/config/i386/i386-expand.cc +++ b/gcc/config/i386/i386-expand.cc @@ -6342,6 +6342,18 @@ ix86_split_ashl (rtx *operands, rtx scratch, machine_mode mode) if (count > half_width) ix86_expand_ashl_const (high[0], count - half_width, mode); } + else if (count == 1) + { + if (!rtx_equal_p (operands[0], operands[1])) + emit_move_insn (operands[0], operands[1]); + rtx x3 = gen_rtx_REG (CCCmode, FLAGS_REG); + rtx x4 = gen_rtx_LTU (mode, x3, const0_rtx); + half_mode = mode == DImode ? SImode : DImode; + emit_insn (gen_add3_cc_overflow_1 (half_mode, low[0], +low[0], low[0])); + emit_insn (gen_add3_carry (half_mode, high[0], high[0], high[0], +x3, x4)); + } else { gen_shld = mode == DImode ? gen_x86_shld : gen_x86_64_shld; diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index eef8a0e..6a5bc16 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -8864,7 +8864,7 @@ [(set_attr "type" "alu") (set_attr "mode" "")]) -(define_insn "*add3_cc_overflow_1" +(define_insn "@add3_cc_overflow_1" [(set (reg:CCC FLAGS_REG) (compare:CCC (plus:SWI diff --git a/gcc/testsuite/gcc.target/i386/ashldi3-2.c b/gcc/testsuite/gcc.target/i386/ashldi3-2.c new file mode 100644 index 000..053389d --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/ashldi3-2.c @@ -0,0 +1,10 @@ +/* { dg-do compile { target ia32 } } */ +/* { dg-options "-O2 -mno-stv" } */ + +long long foo(long long x) +{ + return x << 1; +} + +/* { dg-final { scan-assembler "adcl" } } */ +/* { dg-final { scan-assembler-not "shldl" } } */ diff --git a/gcc/testsuite/gcc.target/i386/ashlti3-3.c b/gcc/testsuite/gcc.target/i386/ashlti3-3.c new file mode 100644 index 000..4f14ca0 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/ashlti3-3.c @@ -0,0 +1,10 @@ +/* { dg-do compile { target int128 } } */ +/* { dg-options "-O2" } */ + +__int128 foo(__int128 x) +{ + return x << 1; +} + +/* { dg-final { scan-assembler "adcq" } } */ +/* { dg-final { scan-assembler-not "shldq" } } */
RE: [X86 PATCH] Split lea into shorter left shift by 2 or 3 bits with -Oz.
Hi Uros, Very many thanks for the speedy reviews. Uros Bizjak wrote: > On Thu, Oct 5, 2023 at 11:06 AM Roger Sayle > wrote: > > > > > > This patch avoids long lea instructions for performing x<<2 and x<<3 > > by splitting them into shorter sal and move (or xchg instructions). > > Because this increases the number of instructions, but reduces the > > total size, its suitable for -Oz (but not -Os). > > > > The impact can be seen in the new test case: > > > > int foo(int x) { return x<<2; } > > int bar(int x) { return x<<3; } > > long long fool(long long x) { return x<<2; } long long barl(long long > > x) { return x<<3; } > > > > where with -O2 we generate: > > > > foo:lea0x0(,%rdi,4),%eax// 7 bytes > > retq > > bar:lea0x0(,%rdi,8),%eax// 7 bytes > > retq > > fool: lea0x0(,%rdi,4),%rax// 8 bytes > > retq > > barl: lea0x0(,%rdi,8),%rax// 8 bytes > > retq > > > > and with -Oz we now generate: > > > > foo:xchg %eax,%edi// 1 byte > > shl$0x2,%eax// 3 bytes > > retq > > bar:xchg %eax,%edi// 1 byte > > shl$0x3,%eax// 3 bytes > > retq > > fool: xchg %rax,%rdi// 2 bytes > > shl$0x2,%rax// 4 bytes > > retq > > barl: xchg %rax,%rdi// 2 bytes > > shl$0x3,%rax// 4 bytes > > retq > > > > Over the entirety of the CSiBE code size benchmark this saves 1347 > > bytes (0.037%) for x86_64, and 1312 bytes (0.036%) with -m32. > > Conveniently, there's already a backend function in i386.cc for > > deciding whether to split an lea into its component instructions, > > ix86_avoid_lea_for_addr, all that's required is an additional clause > > checking for -Oz (i.e. optimize_size > 1). > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board='unix{-m32}' > > with no new failures. Additional testing was performed by repeating > > these steps after removing the "optimize_size > 1" condition, so that > > suitable lea instructions were always split [-Oz is not heavily > > tested, so this invoked the new code during the bootstrap and > > regression testing], again with no regressions. Ok for mainline? > > > > > > 2023-10-05 Roger Sayle > > > > gcc/ChangeLog > > * config/i386/i386.cc (ix86_avoid_lea_for_addr): Split LEAs used > > to perform left shifts into shorter instructions with -Oz. > > > > gcc/testsuite/ChangeLog > > * gcc.target/i386/lea-2.c: New test case. > > > > OK, but ... > > @@ -0,0 +1,7 @@ > +/* { dg-do compile { target { ! ia32 } } } */ > > Is there a reason to avoid 32-bit targets? I'd expect that the optimization > also > triggers on x86_32 for 32bit integers. Good catch. You're 100% correct; because the test case just checks that an LEA is not used, and not for the specific sequence of shift instructions used instead, this test also passes with --target_board='unix{-m32}'. I'll remove the target clause from the dg-do compile directive. > +/* { dg-options "-Oz" } */ > +int foo(int x) { return x<<2; } > +int bar(int x) { return x<<3; } > +long long fool(long long x) { return x<<2; } long long barl(long long > +x) { return x<<3; } > +/* { dg-final { scan-assembler-not "lea\[lq\]" } } */ Thanks again. Roger --
[X86 PATCH] Implement doubleword right shifts by 1 bit using s[ha]r+rcr.
This patch tweaks the i386 back-end's ix86_split_ashr and ix86_split_lshr functions to implement doubleword right shifts by 1 bit, using a shift of the highpart that sets the carry flag followed by a rotate-carry-right (RCR) instruction on the lowpart. Conceptually this is similar to the recent left shift patch, but with two complicating factors. The first is that although the RCR sequence is shorter, and is a ~3x performance improvement on AMD, my micro-benchmarking shows it ~10% slower on Intel. Hence this patch also introduces a new X86_TUNE_USE_RCR tuning parameter. The second is that I believe this is the first time a "rotate-right-through-carry" and a right shift that sets the carry flag from the least significant bit has been modelled in GCC RTL (on a MODE_CC target). For this I've used the i386 back-end's UNSPEC_CC_NE which seems appropriate. Finally rcrsi2 and rcrdi2 are separate define_insns so that we can use their generator functions. For the pair of functions: unsigned __int128 foo(unsigned __int128 x) { return x >> 1; } __int128 bar(__int128 x) { return x >> 1; } with -O2 -march=znver4 we previously generated: foo:movq%rdi, %rax movq%rsi, %rdx shrdq $1, %rsi, %rax shrq%rdx ret bar:movq%rdi, %rax movq%rsi, %rdx shrdq $1, %rsi, %rax sarq%rdx ret with this patch we now generate: foo:movq%rsi, %rdx movq%rdi, %rax shrq%rdx rcrq%rax ret bar:movq%rsi, %rdx movq%rdi, %rax sarq%rdx rcrq%rax ret This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. And to provide additional testing, I've also bootstrapped and regression tested a version of this patch where the RCR is always generated (independent of the -march target) again with no regressions. Ok for mainline? 2023-10-06 Roger Sayle gcc/ChangeLog * config/i386/i386-expand.c (ix86_split_ashr): Split shifts by one into ashr[sd]i3_carry followed by rcr[sd]i2, if TARGET_USE_RCR or -Oz. (ix86_split_lshr): Likewise, split shifts by one bit into lshr[sd]i3_carry followed by rcr[sd]i2, if TARGET_USE_RCR or -Oz. * config/i386/i386.h (TARGET_USE_RCR): New backend macro. * config/i386/i386.md (rcrsi2): New define_insn for rcrl. (rcrdi2): New define_insn for rcrq. (3_carry): New define_insn for right shifts that set the carry flag from the least significant bit, modelled using UNSPEC_CC_NE. * config/i386/x86-tune.def (X86_TUNE_USE_RCR): New tuning parameter controlling use of rcr 1 vs. shrd, which is significantly faster on AMD processors. gcc/testsuite/ChangeLog * gcc.target/i386/rcr-1.c: New 64-bit test case. * gcc.target/i386/rcr-2.c: New 32-bit test case. Thanks in advance, Roger --
RE: [X86 PATCH] Implement doubleword right shifts by 1 bit using s[ha]r+rcr.
Grr! I've done it again. ENOPATCH. > -Original Message- > From: Roger Sayle > Sent: 06 October 2023 14:58 > To: 'gcc-patches@gcc.gnu.org' > Cc: 'Uros Bizjak' > Subject: [X86 PATCH] Implement doubleword right shifts by 1 bit using s[ha]r+rcr. > > > This patch tweaks the i386 back-end's ix86_split_ashr and ix86_split_lshr > functions to implement doubleword right shifts by 1 bit, using a shift of the > highpart that sets the carry flag followed by a rotate-carry-right > (RCR) instruction on the lowpart. > > Conceptually this is similar to the recent left shift patch, but with two > complicating factors. The first is that although the RCR sequence is shorter, and is > a ~3x performance improvement on AMD, my micro-benchmarking shows it > ~10% slower on Intel. Hence this patch also introduces a new > X86_TUNE_USE_RCR tuning parameter. The second is that I believe this is the > first time a "rotate-right-through-carry" and a right shift that sets the carry flag > from the least significant bit has been modelled in GCC RTL (on a MODE_CC > target). For this I've used the i386 back-end's UNSPEC_CC_NE which seems > appropriate. Finally rcrsi2 and rcrdi2 are separate define_insns so that we can > use their generator functions. > > For the pair of functions: > unsigned __int128 foo(unsigned __int128 x) { return x >> 1; } > __int128 bar(__int128 x) { return x >> 1; } > > with -O2 -march=znver4 we previously generated: > > foo:movq%rdi, %rax > movq%rsi, %rdx > shrdq $1, %rsi, %rax > shrq%rdx > ret > bar:movq%rdi, %rax > movq%rsi, %rdx > shrdq $1, %rsi, %rax > sarq%rdx > ret > > with this patch we now generate: > > foo:movq%rsi, %rdx > movq%rdi, %rax > shrq%rdx > rcrq%rax > ret > bar:movq%rsi, %rdx > movq%rdi, %rax > sarq%rdx > rcrq%rax > ret > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and > make -k check, both with and without --target_board=unix{-m32} with no new > failures. And to provide additional testing, I've also bootstrapped and regression > tested a version of this patch where the RCR is always generated (independent of > the -march target) again with no regressions. Ok for mainline? > > > 2023-10-06 Roger Sayle > > gcc/ChangeLog > * config/i386/i386-expand.c (ix86_split_ashr): Split shifts by > one into ashr[sd]i3_carry followed by rcr[sd]i2, if TARGET_USE_RCR > or -Oz. > (ix86_split_lshr): Likewise, split shifts by one bit into > lshr[sd]i3_carry followed by rcr[sd]i2, if TARGET_USE_RCR or -Oz. > * config/i386/i386.h (TARGET_USE_RCR): New backend macro. > * config/i386/i386.md (rcrsi2): New define_insn for rcrl. > (rcrdi2): New define_insn for rcrq. > (3_carry): New define_insn for right shifts that > set the carry flag from the least significant bit, modelled using > UNSPEC_CC_NE. > * config/i386/x86-tune.def (X86_TUNE_USE_RCR): New tuning parameter > controlling use of rcr 1 vs. shrd, which is significantly faster on > AMD processors. > > gcc/testsuite/ChangeLog > * gcc.target/i386/rcr-1.c: New 64-bit test case. > * gcc.target/i386/rcr-2.c: New 32-bit test case. > > > Thanks in advance, > Roger > -- diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc index e42ff27..399eb8e 100644 --- a/gcc/config/i386/i386-expand.cc +++ b/gcc/config/i386/i386-expand.cc @@ -6496,6 +6496,22 @@ ix86_split_ashr (rtx *operands, rtx scratch, machine_mode mode) emit_insn (gen_ashr3 (low[0], low[0], GEN_INT (count - half_width))); } + else if (count == 1 + && (TARGET_USE_RCR || optimize_size > 1)) + { + if (!rtx_equal_p (operands[0], operands[1])) + emit_move_insn (operands[0], operands[1]); + if (mode == DImode) + { + emit_insn (gen_ashrsi3_carry (high[0], high[0])); + emit_insn (gen_rcrsi2 (low[0], low[0])); + } + else + { + emit_insn (gen_ashrdi3_carry (high[0], high[0])); + emit_insn (gen_rcrdi2 (low[0], low[0])); + } + } else { gen_shrd = mode == DImode ? gen_x86_shrd : gen_x86_64_shrd; @@ -6561,6 +6577,22 @@ ix86_split_lshr (rtx *operands, rtx scratch, machine_mode mode) emit_insn (gen_lshr3 (low[0], low[0],
[ARC PATCH] Improved SImode shifts and rotates on !TARGET_BARREL_SHIFTER.
This patch completes the ARC back-end's transition to using pre-reload splitters for SImode shifts and rotates on targets without a barrel shifter. The core part is that the shift_si3 define_insn is no longer needed, as shifts and rotates that don't require a loop are split before reload, and then because shift_si3_loop is the only caller of output_shift, both can be significantly cleaned up and simplified. The output_shift function (Claudiu's "the elephant in the room") is renamed output_shift_loop, which handles just the four instruction zero-overhead loop implementations. Aside from the clean-ups, the user visible changes are much improved implementations of SImode shifts and rotates on affected targets. For the function: unsigned int rotr_1 (unsigned int x) { return (x >> 1) | (x << 31); } GCC with -O2 -mcpu=em would previously generate: rotr_1: lsr_s r2,r0 bmsk_s r0,r0,0 ror r0,r0 j_s.d [blink] or_sr0,r0,r2 with this patch, we now generate: j_s.d [blink] ror r0,r0 For the function: unsigned int rotr_31 (unsigned int x) { return (x >> 31) | (x << 1); } GCC with -O2 -mcpu=em would previously generate: rotr_31: mov_s r2,r0 ;4 asl_s r0,r0 add.f 0,r2,r2 rlc r2,0 j_s.d [blink] or_sr0,r0,r2 with this patch we now generate an add.f followed by an adc: rotr_31: add.f r0,r0,r0 j_s.d [blink] add.cs r0,r0,1 Shifts by constants requiring a loop have been improved for even counts by performing two operations in each iteration: int shl10(int x) { return x >> 10; } Previously looked like: shl10: mov.f lp_count, 10 lpnz2f asr r0,r0 nop 2: # end single insn loop j_s [blink] And now becomes: shl10: mov lp_count,5 lp 2f asr r0,r0 asr r0,r0 2: # end single insn loop j_s [blink] So emulating ARC's SWAP on architectures that don't have it: unsigned int rotr_16 (unsigned int x) { return (x >> 16) | (x << 16); } previously required 10 instructions and ~70 cycles: rotr_16: mov_s r2,r0 ;4 mov.f lp_count, 16 lpnz2f add r0,r0,r0 nop 2: # end single insn loop mov.f lp_count, 16 lpnz2f lsr r2,r2 nop 2: # end single insn loop j_s.d [blink] or_sr0,r0,r2 now becomes just 4 instructions and ~18 cycles: rotr_16: mov lp_count,8 lp 2f ror r0,r0 ror r0,r0 2: # end single insn loop j_s [blink] This patch has been tested with a cross-compiler to arc-linux hosted on x86_64-pc-linux-gnu and (partially) tested with the compile-only portions of the testsuite with no regressions. Ok for mainline, if your own testing shows no issues? 2023-10-07 Roger Sayle gcc/ChangeLog * config/arc/arc-protos.h (output_shift): Rename to... (output_shift_loop): Tweak API to take an explicit rtx_code. (arc_split_ashl): Prototype new function here. (arc_split_ashr): Likewise. (arc_split_lshr): Likewise. (arc_split_rotl): Likewise. (arc_split_rotr): Likewise. * config/arc/arc.cc (output_shift): Delete local prototype. Rename. (output_shift_loop): New function replacing output_shift to output a zero overheap loop for SImode shifts and rotates on ARC targets without barrel shifter (i.e. no hardware support for these insns). (arc_split_ashl): New helper function to split *ashlsi3_nobs. (arc_split_ashr): New helper function to split *ashrsi3_nobs. (arc_split_lshr): New helper function to split *lshrsi3_nobs. (arc_split_rotl): New helper function to split *rotlsi3_nobs. (arc_split_rotr): New helper function to split *rotrsi3_nobs. * config/arc/arc.md (any_shift_rotate): New define_code_iterator. (define_code_attr insn): New code attribute to map to pattern name. (si3): New expander unifying previous ashlsi3, ashrsi3 and lshrsi3 define_expands. Adds rotlsi3 and rotrsi3. (*si3_nobs): New define_insn_and_split that unifies the previous *ashlsi3_nobs, *ashrsi3_nobs and *lshrsi3_nobs. We now call arc_split_ in arc.cc to implement each split. (shift_si3): Delete define_insn, all shifts/rotates are now split. (shift_si3_loop): Rename to... (si3_loop): define_insn to handle loop implementations of SImode shifts and rotates, calling ouput_shift_loop for template. (rotrsi3): Rename to... (*rotrsi3_insn): define_insn for TARGET_BARREL_SHIFTER's ror. (*rotlsi3): New define_insn_and_split to transform left rotates into right rotates before reload. (rotlsi3_cnt1): New define_in
[PATCH] Optimize (ne:SI (subreg:QI (ashift:SI x 7) 0) 0) as (and:SI x 1).
This patch is the middle-end piece of an improvement to PRs 101955 and 106245, that adds a missing simplification to the RTL optimizers. This transformation is to simplify (char)(x << 7) != 0 as x & 1. Technically, the cast can be any truncation, where shift is by one less than the narrower type's precision, setting the most significant (only) bit from the least significant bit. This transformation applies to any target, but it's easy to see (and add a new test case) on x86, where the following function: int f(int a) { return (a << 31) >> 31; } currently gets compiled with -O2 to: foo:movl%edi, %eax sall$7, %eax sarb$7, %al movsbl %al, %eax ret but with this patch, we now generate the slightly simpler. foo:movl%edi, %eax sall$31, %eax sarl$31, %eax ret This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check with no new failures. Ok for mainline? 2023-10-10 Roger Sayle gcc/ChangeLog PR middle-end/101955 PR tree-optimization/106245 * simplify-rtx.c (simplify_relational_operation_1): Simplify the RTL (ne:SI (subreg:QI (ashift:SI x 7) 0) 0) to (and:SI x 1). gcc/testsuite/ChangeLog * gcc.target/i386/pr106245-1.c: New test case. Thanks in advance, Roger -- diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index bd9443d..69d8757 100644 --- a/gcc/simplify-rtx.cc +++ b/gcc/simplify-rtx.cc @@ -6109,6 +6109,23 @@ simplify_context::simplify_relational_operation_1 (rtx_code code, break; } + /* (ne:SI (subreg:QI (ashift:SI x 7) 0) 0) -> (and:SI x 1). */ + if (code == NE + && op1 == const0_rtx + && (op0code == TRUNCATE + || (partial_subreg_p (op0) + && subreg_lowpart_p (op0))) + && SCALAR_INT_MODE_P (mode) + && STORE_FLAG_VALUE == 1) +{ + rtx tmp = XEXP (op0, 0); + if (GET_CODE (tmp) == ASHIFT + && GET_MODE (tmp) == mode + && CONST_INT_P (XEXP (tmp, 1)) + && is_int_mode (GET_MODE (op0), &int_mode) + && INTVAL (XEXP (tmp, 1)) == GET_MODE_PRECISION (int_mode) - 1) + return simplify_gen_binary (AND, mode, XEXP (tmp, 0), const1_rtx); +} return NULL_RTX; } diff --git a/gcc/testsuite/gcc.target/i386/pr106245-1.c b/gcc/testsuite/gcc.target/i386/pr106245-1.c new file mode 100644 index 000..a0403e9 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106245-1.c @@ -0,0 +1,10 @@ +/* { dg-do compile } */ +/* { dg-options "-O2" } */ + +int f(int a) +{ +return (a << 31) >> 31; +} + +/* { dg-final { scan-assembler-not "sarb" } } */ +/* { dg-final { scan-assembler-not "movsbl" } } */
[PATCH] PR 91865: Avoid ZERO_EXTEND of ZERO_EXTEND in make_compound_operation.
This patch is my proposed solution to PR rtl-optimization/91865. Normally RTX simplification canonicalizes a ZERO_EXTEND of a ZERO_EXTEND to a single ZERO_EXTEND, but as shown in this PR it is possible for combine's make_compound_operation to unintentionally generate a non-canonical ZERO_EXTEND of a ZERO_EXTEND, which is unlikely to be matched by the backend. For the new test case: const int table[2] = {1, 2}; int foo (char i) { return table[i]; } compiling with -O2 -mlarge on msp430 we currently see: Trying 2 -> 7: 2: r25:HI=zero_extend(R12:QI) REG_DEAD R12:QI 7: r28:PSI=sign_extend(r25:HI)#0 REG_DEAD r25:HI Failed to match this instruction: (set (reg:PSI 28 [ iD.1772 ]) (zero_extend:PSI (zero_extend:HI (reg:QI 12 R12 [ iD.1772 ] which results in the following code: foo:AND #0xff, R12 RLAM.A #4, R12 { RRAM.A #4, R12 RLAM.A #1, R12 MOVX.W table(R12), R12 RETA With this patch, we now see: Trying 2 -> 7: 2: r25:HI=zero_extend(R12:QI) REG_DEAD R12:QI 7: r28:PSI=sign_extend(r25:HI)#0 REG_DEAD r25:HI Successfully matched this instruction: (set (reg:PSI 28 [ iD.1772 ]) (zero_extend:PSI (reg:QI 12 R12 [ iD.1772 ]))) allowing combination of insns 2 and 7 original costs 4 + 8 = 12 replacement cost 8 foo:MOV.B R12, R12 RLAM.A #1, R12 MOVX.W table(R12), R12 RETA This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-10-14 Roger Sayle gcc/ChangeLog PR rtl-optimization/91865 * combine.cc (make_compound_operation): Avoid creating a ZERO_EXTEND of a ZERO_EXTEND. gcc/testsuite/ChangeLog PR rtl-optimization/91865 * gcc.target/msp430/pr91865.c: New test case. Thanks in advance, Roger -- diff --git a/gcc/combine.cc b/gcc/combine.cc index 360aa2f25e6..f47ff596782 100644 --- a/gcc/combine.cc +++ b/gcc/combine.cc @@ -8453,6 +8453,9 @@ make_compound_operation (rtx x, enum rtx_code in_code) new_rtx, GET_MODE (XEXP (x, 0))); if (tem) return tem; + /* Avoid creating a ZERO_EXTEND of a ZERO_EXTEND. */ + if (GET_CODE (new_rtx) == ZERO_EXTEND) + new_rtx = XEXP (new_rtx, 0); SUBST (XEXP (x, 0), new_rtx); return x; } diff --git a/gcc/testsuite/gcc.target/msp430/pr91865.c b/gcc/testsuite/gcc.target/msp430/pr91865.c new file mode 100644 index 000..8cc21c8b9e8 --- /dev/null +++ b/gcc/testsuite/gcc.target/msp430/pr91865.c @@ -0,0 +1,8 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mlarge" } */ + +const int table[2] = {1, 2}; +int foo (char i) { return table[i]; } + +/* { dg-final { scan-assembler-not "AND" } } */ +/* { dg-final { scan-assembler-not "RRAM" } } */
[PATCH] Improved RTL expansion of 1LL << x.
This patch improves the initial RTL expanded for double word shifts on architectures with conditional moves, so that later passes don't need to clean-up unnecessary and/or unused instructions. Consider the general case, x << y, which is expanded well as: t1 = y & 32; t2 = 0; t3 = x_lo >> 1; t4 = y ^ ~0; t5 = t3 >> t4; tmp_hi = x_hi << y; tmp_hi |= t5; tmp_lo = x_lo << y; out_hi = t1 ? tmp_lo : tmp_hi; out_lo = t1 ? t2 : tmp_lo; which is nearly optimal, the only thing that can be improved is that using a unary NOT operation "t4 = ~y" is better than XOR with -1, on targets that support it. [Note the one_cmpl_optab expander didn't fall back to XOR when this code was originally written, but has been improved since]. Now consider the relatively common idiom of 1LL << y, which currently produces the RTL equivalent of: t1 = y & 32; t2 = 0; t3 = 1 >> 1; t4 = y ^ ~0; t5 = t3 >> t4; tmp_hi = 0 << y; tmp_hi |= t5; tmp_lo = 1 << y; out_hi = t1 ? tmp_lo : tmp_hi; out_lo = t1 ? t2 : tmp_lo; Notice here that t3 is always zero, so the assignment of t5 is a variable shift of zero, which expands to a loop on many smaller targets, a similar shift by zero in the first tmp_hi assignment (another loop), that the value of t4 is no longer required (as t3 is zero), and that the ultimate value of tmp_hi is always zero. Fortunately, for many (but perhaps not all) targets this mess gets cleaned up by later optimization passes. However, this patch avoids generating unnecessary RTL at expand time, by calling simplify_expand_binop instead of expand_binop, and avoiding generating dead or unnecessary code when intermediate values are known to be zero. For the 1LL << y test case above, we now generate: t1 = y & 32; t2 = 0; tmp_hi = 0; tmp_lo = 1 << y; out_hi = t1 ? tmp_lo : tmp_hi; out_lo = t1 ? t2 : tmp_lo; On arc-elf, for example, there are 18 RTL INSN_P instructions generated by expand before this patch, but only 12 with this patch (improving both compile-time and memory usage). This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-10-15 Roger Sayle gcc/ChangeLog * optabs.cc (expand_subword_shift): Call simplify_expand_binop instead of expand_binop. Optimize cases (i.e. avoid generating RTL) when CARRIES or INTO_INPUT is zero. Use one_cmpl_optab (i.e. NOT) instead of xor_optab with ~0 to calculate ~OP1. Thanks in advance, Roger -- diff --git a/gcc/optabs.cc b/gcc/optabs.cc index e1898da..f0a048a 100644 --- a/gcc/optabs.cc +++ b/gcc/optabs.cc @@ -533,15 +533,13 @@ expand_subword_shift (scalar_int_mode op1_mode, optab binoptab, has unknown behavior. Do a single shift first, then shift by the remainder. It's OK to use ~OP1 as the remainder if shift counts are truncated to the mode size. */ - carries = expand_binop (word_mode, reverse_unsigned_shift, - outof_input, const1_rtx, 0, unsignedp, methods); - if (shift_mask == BITS_PER_WORD - 1) - { - tmp = immed_wide_int_const - (wi::minus_one (GET_MODE_PRECISION (op1_mode)), op1_mode); - tmp = simplify_expand_binop (op1_mode, xor_optab, op1, tmp, - 0, true, methods); - } + carries = simplify_expand_binop (word_mode, reverse_unsigned_shift, + outof_input, const1_rtx, 0, + unsignedp, methods); + if (carries == const0_rtx) + tmp = const0_rtx; + else if (shift_mask == BITS_PER_WORD - 1) + tmp = expand_unop (op1_mode, one_cmpl_optab, op1, 0, true); else { tmp = immed_wide_int_const (wi::shwi (BITS_PER_WORD - 1, @@ -552,22 +550,29 @@ expand_subword_shift (scalar_int_mode op1_mode, optab binoptab, } if (tmp == 0 || carries == 0) return false; - carries = expand_binop (word_mode, reverse_unsigned_shift, - carries, tmp, 0, unsignedp, methods); + if (carries != const0_rtx && tmp != const0_rtx) +carries = simplify_expand_binop (word_mode, reverse_unsigned_shift, +carries, tmp, 0, unsignedp, methods); if (carries == 0) return false; - /* Shift INTO_INPUT logically by OP1. This is the last use of INTO_INPUT - so the result can go directly into INTO_TARGET if convenient. */ - tmp = expand_binop (word_mode, unsigned_shift, into_input, op1, - into_target, unsignedp, methods); - if (tmp == 0) -return false; + if (into_inp
[ARC PATCH] Split asl dst, 1, src into bset dst, 0, src to implement 1<
This patch adds a pre-reload splitter to arc.md, to use the bset (set specific bit instruction) to implement 1< gcc/ChangeLog * config/arc/arc.md (*ashlsi3_1): New pre-reload splitter to use bset dst,0,src to implement 1<
RE: [ARC PATCH] Split asl dst, 1, src into bset dst, 0, src to implement 1<
I've done it again. ENOPATCH. From: Roger Sayle Sent: 15 October 2023 09:13 To: 'gcc-patches@gcc.gnu.org' Cc: 'Claudiu Zissulescu' Subject: [ARC PATCH] Split asl dst,1,src into bset dst,0,src to implement 1<mailto:ro...@nextmovesoftware.com> > gcc/ChangeLog * config/arc/arc.md (*ashlsi3_1): New pre-reload splitter to use bset dst,0,src to implement 1<diff --git a/gcc/config/arc/arc.md b/gcc/config/arc/arc.md index a936a8b..22af0bf 100644 --- a/gcc/config/arc/arc.md +++ b/gcc/config/arc/arc.md @@ -3421,6 +3421,22 @@ archs4x, archs4xd" (set_attr "predicable" "no,no,yes,no,no") (set_attr "cond" "nocond,canuse,canuse,nocond,nocond")]) +;; Split asl dst,1,src into bset dst,0,src. +(define_insn_and_split "*ashlsi3_1" + [(set (match_operand:SI 0 "dest_reg_operand") + (ashift:SI (const_int 1) + (match_operand:SI 1 "nonmemory_operand")))] + "!TARGET_BARREL_SHIFTER + && arc_pre_reload_split ()" + "#" + "&& 1" + [(set (match_dup 0) + (ior:SI (ashift:SI (const_int 1) (match_dup 1)) + (const_int 0)))] + "" + [(set_attr "type" "shift") + (set_attr "length" "8")]) + (define_insn_and_split "*ashlsi3_nobs" [(set (match_operand:SI 0 "dest_reg_operand") (ashift:SI (match_operand:SI 1 "register_operand")
RE: [PATCH] PR 91865: Avoid ZERO_EXTEND of ZERO_EXTEND in make_compound_operation.
Hi Jeff, Thanks for the speedy review(s). > From: Jeff Law > Sent: 15 October 2023 00:03 > To: Roger Sayle ; gcc-patches@gcc.gnu.org > Subject: Re: [PATCH] PR 91865: Avoid ZERO_EXTEND of ZERO_EXTEND in > make_compound_operation. > > On 10/14/23 16:14, Roger Sayle wrote: > > > > This patch is my proposed solution to PR rtl-optimization/91865. > > Normally RTX simplification canonicalizes a ZERO_EXTEND of a > > ZERO_EXTEND to a single ZERO_EXTEND, but as shown in this PR it is > > possible for combine's make_compound_operation to unintentionally > > generate a non-canonical ZERO_EXTEND of a ZERO_EXTEND, which is > > unlikely to be matched by the backend. > > > > For the new test case: > > > > const int table[2] = {1, 2}; > > int foo (char i) { return table[i]; } > > > > compiling with -O2 -mlarge on msp430 we currently see: > > > > Trying 2 -> 7: > > 2: r25:HI=zero_extend(R12:QI) > >REG_DEAD R12:QI > > 7: r28:PSI=sign_extend(r25:HI)#0 > >REG_DEAD r25:HI > > Failed to match this instruction: > > (set (reg:PSI 28 [ iD.1772 ]) > > (zero_extend:PSI (zero_extend:HI (reg:QI 12 R12 [ iD.1772 ] > > > > which results in the following code: > > > > foo:AND #0xff, R12 > > RLAM.A #4, R12 { RRAM.A #4, R12 > > RLAM.A #1, R12 > > MOVX.W table(R12), R12 > > RETA > > > > With this patch, we now see: > > > > Trying 2 -> 7: > > 2: r25:HI=zero_extend(R12:QI) > >REG_DEAD R12:QI > > 7: r28:PSI=sign_extend(r25:HI)#0 > >REG_DEAD r25:HI > > Successfully matched this instruction: > > (set (reg:PSI 28 [ iD.1772 ]) > > (zero_extend:PSI (reg:QI 12 R12 [ iD.1772 ]))) allowing > > combination of insns 2 and 7 original costs 4 + 8 = 12 replacement > > cost 8 > > > > foo:MOV.B R12, R12 > > RLAM.A #1, R12 > > MOVX.W table(R12), R12 > > RETA > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board=unix{-m32} > > with no new failures. Ok for mainline? > > > > 2023-10-14 Roger Sayle > > > > gcc/ChangeLog > > PR rtl-optimization/91865 > > * combine.cc (make_compound_operation): Avoid creating a > > ZERO_EXTEND of a ZERO_EXTEND. > > > > gcc/testsuite/ChangeLog > > PR rtl-optimization/91865 > > * gcc.target/msp430/pr91865.c: New test case. > Neither an ACK or NAK at this point. > > The bug report includes a patch from Segher which purports to fix this in > simplify- > rtx. Any thoughts on Segher's approach and whether or not it should be > considered? > > The BZ also indicates that removal of 2 patterns from msp430.md would solve > this > too (though it may cause regressions elsewhere?). Any thoughts on that > approach > as well? > Great questions. I believe Segher's proposed patch (in comment #4) was an msp430-specific proof-of-concept workaround rather than intended to be fix. Eliminating a ZERO_EXTEND simply by changing the mode of a hard register is not a solution that'll work on many platforms (and therefore not really suitable for target-independent middle-end code in the RTL optimizers). For example, zero_extend:TI (and:QI (reg:QI hard_r1) (const_int 0x0f)) can't universally be reduced to (and:TI (reg:TI hard_r1) (const_int 0x0f)). Notice that Segher's code doesn't check TARGET_HARD_REGNO_MODE_OK or TARGET_MODES_TIEABLE_P or any of the other backend hooks necessary to confirm such a transformation is safe/possible. Secondly, the hard register aspect is a bit of a red herring. This work-around fixes the issue in the original BZ description, but not the slightly modified test case in comment #2 (with a global variable). This doesn't have a hard register, but does have the dubious ZERO_EXTEND/SIGN_EXTEND of a ZERO_EXTEND. The underlying issue, which is applicable to all targets, is that combine.cc's make_compound_operation is expected to reverse the local transformations made by expand_compound_operation. Hence, if an RTL expression is canonical going into expand_compound_operation, it is expected (hoped) to be canonical (and equivalent) coming out of make_compound_operation. Hence, rather than be a MSP430 specific issue, no target should expect (or be expected to see) a ZERO_EXTEND of a ZERO_EXTEND, or a SIGN_EXTEND of a ZERO_EXTEND in the RTL stream. Much like a binary operator with two CONST_INT operands, or a shift by zero, it's somethi
RE: [PATCH] Support g++ 4.8 as a host compiler.
I'd like to ping my patch for restoring bootstrap using g++ 4.8.5 (the system compiler on RHEL 7 and later systems). https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632008.html Note the preprocessor #ifs can be removed; they are only there to document why the union u must have an explicit, empty (but not default) constructor. I completely agree with the various opinions that we might consider upgrading the minimum host compiler for many good reasons (Ada, D, newer C++ features etc.). It's inevitable that older compilers and systems can't be supported indefinitely. Having said that I don't think that this unintentional trivial breakage, that has a safe one-line work around is sufficient cause (or non-neglible risk or support burden), to inconvenice a large number of GCC users (the impact/disruption to cfarm has already been mentioned). Interestingly, "scl enable devtoolset-XX" to use a newer host compiler, v10 or v11, results in a significant increase (100+) in unexpected failures I see during mainline regression testing using "make -k check" (on RedHat 7.9). (Older) system compilers, despite their flaws, are selected for their (overall) stability and maturity. If another patch/change hits the compiler next week that reasonably means that 4.8.5 can no longer be supported, so be it, but its an annoying (and unnecessary?) inconvenience in the meantime. Perhaps we should file a Bugzilla PR indicating that the documentation and release notes need updating, if my fix isn't considered acceptable? Why this patch is an trigger issue (that requires significant discussion and deliberation) is somewhat of a mystery. Thanks in advance. Roger > -Original Message- > From: Jeff Law > Sent: 07 October 2023 17:20 > To: Roger Sayle ; gcc-patches@gcc.gnu.org > Cc: 'Richard Sandiford' > Subject: Re: [PATCH] Support g++ 4.8 as a host compiler. > > > > On 10/4/23 16:19, Roger Sayle wrote: > > > > The recent patch to remove poly_int_pod triggers a bug in g++ 4.8.5's > > C++ 11 support which mistakenly believes poly_uint16 has a non-trivial > > constructor. This in turn prohibits it from being used as a member in > > a union (rtxunion) that constructed statically, resulting in a (fatal) > > error during stage 1. A workaround is to add an explicit constructor > > to the problematic union, which allows mainline to be bootstrapped > > with the system compiler on older RedHat 7 systems. > > > > This patch has been tested on x86_64-pc-linux-gnu where it allows a > > bootstrap to complete when using g++ 4.8.5 as the host compiler. > > Ok for mainline? > > > > > > 2023-10-04 Roger Sayle > > > > gcc/ChangeLog > > * rtl.h (rtx_def::u): Add explicit constructor to workaround > > issue using g++ 4.8 as a host compiler. > I think the bigger question is whether or not we're going to step forward on > the > minimum build requirements. > > My recollection was we settled on gcc-4.8 for the benefit of RHEL 7 and > Centos 7 > which are rapidly approaching EOL (June 2024). > > I would certainly support stepping forward to a more modern compiler for the > build requirements, which might make this patch obsolete. > > Jeff
[x86 PATCH] PR 106245: Split (x<<31)>>31 as -(x&1) in i386.md
This patch is the backend piece of a solution to PRs 101955 and 106245, that adds a define_insn_and_split to the i386 backend, to perform sign extension of a single (least significant) bit using AND $1 then NEG. Previously, (x<<31)>>31 would be generated as sall$31, %eax // 3 bytes sarl$31, %eax // 3 bytes with this patch the backend now generates: andl$1, %eax// 3 bytes negl%eax// 2 bytes Not only is this smaller in size, but microbenchmarking confirms that it's a performance win on both Intel and AMD; Intel sees only a 2% improvement (perhaps just a size effect), but AMD sees a 7% win. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-10-17 Roger Sayle gcc/ChangeLog PR middle-end/101955 PR tree-optimization/106245 * config/i386/i386.md (*extv_1_0): New define_insn_and_split. gcc/testsuite/ChangeLog PR middle-end/101955 PR tree-optimization/106245 * gcc.target/i386/pr106245-2.c: New test case. * gcc.target/i386/pr106245-3.c: New 32-bit test case. * gcc.target/i386/pr106245-4.c: New 64-bit test case. * gcc.target/i386/pr106245-5.c: Likewise. Thanks in advance, Roger -- diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 2a60df5..b7309be0 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -3414,6 +3414,21 @@ [(set_attr "type" "imovx") (set_attr "mode" "SI")]) +;; Split sign-extension of single least significant bit as and x,$1;neg x +(define_insn_and_split "*extv_1_0" + [(set (match_operand:SWI48 0 "register_operand" "=r") + (sign_extract:SWI48 (match_operand:SWI48 1 "register_operand" "0") + (const_int 1) + (const_int 0))) + (clobber (reg:CC FLAGS_REG))] + "" + "#" + "&& 1" + [(parallel [(set (match_dup 0) (and:SWI48 (match_dup 1) (const_int 1))) + (clobber (reg:CC FLAGS_REG))]) + (parallel [(set (match_dup 0) (neg:SWI48 (match_dup 0))) + (clobber (reg:CC FLAGS_REG))])]) + (define_expand "extzv" [(set (match_operand:SWI248 0 "register_operand") (zero_extract:SWI248 (match_operand:SWI248 1 "register_operand") diff --git a/gcc/testsuite/gcc.target/i386/pr106245-2.c b/gcc/testsuite/gcc.target/i386/pr106245-2.c new file mode 100644 index 000..47b0d27 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106245-2.c @@ -0,0 +1,10 @@ +/* { dg-do compile } */ +/* { dg-options "-O2" } */ + +int f(int a) +{ +return (a << 31) >> 31; +} + +/* { dg-final { scan-assembler "andl" } } */ +/* { dg-final { scan-assembler "negl" } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr106245-3.c b/gcc/testsuite/gcc.target/i386/pr106245-3.c new file mode 100644 index 000..4ec6342 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106245-3.c @@ -0,0 +1,11 @@ +/* { dg-do compile { target ia32 } } */ +/* { dg-options "-O2" } */ + +long long f(long long a) +{ +return (a << 63) >> 63; +} + +/* { dg-final { scan-assembler "andl" } } */ +/* { dg-final { scan-assembler "negl" } } */ +/* { dg-final { scan-assembler "cltd" } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr106245-4.c b/gcc/testsuite/gcc.target/i386/pr106245-4.c new file mode 100644 index 000..ef77ee5 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106245-4.c @@ -0,0 +1,10 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2" } */ + +long long f(long long a) +{ +return (a << 63) >> 63; +} + +/* { dg-final { scan-assembler "andl" } } */ +/* { dg-final { scan-assembler "negq" } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr106245-5.c b/gcc/testsuite/gcc.target/i386/pr106245-5.c new file mode 100644 index 000..0351866 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106245-5.c @@ -0,0 +1,11 @@ +/* { dg-do compile { target int128 } } */ +/* { dg-options "-O2" } */ + +__int128 f(__int128 a) +{ + return (a << 127) >> 127; +} + +/* { dg-final { scan-assembler "andl" } } */ +/* { dg-final { scan-assembler "negq" } } */ +/* { dg-final { scan-assembler "cqto" } } */
RE: [x86 PATCH] PR 106245: Split (x<<31)>>31 as -(x&1) in i386.md
Hi Uros, Thanks for the speedy review. > From: Uros Bizjak > Sent: 17 October 2023 17:38 > > On Tue, Oct 17, 2023 at 3:08 PM Roger Sayle > wrote: > > > > > > This patch is the backend piece of a solution to PRs 101955 and > > 106245, that adds a define_insn_and_split to the i386 backend, to > > perform sign extension of a single (least significant) bit using AND $1 > > then NEG. > > > > Previously, (x<<31)>>31 would be generated as > > > > sall$31, %eax // 3 bytes > > sarl$31, %eax // 3 bytes > > > > with this patch the backend now generates: > > > > andl$1, %eax// 3 bytes > > negl%eax// 2 bytes > > > > Not only is this smaller in size, but microbenchmarking confirms that > > it's a performance win on both Intel and AMD; Intel sees only a 2% > > improvement (perhaps just a size effect), but AMD sees a 7% win. > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board=unix{-m32} > > with no new failures. Ok for mainline? > > > > > > 2023-10-17 Roger Sayle > > > > gcc/ChangeLog > > PR middle-end/101955 > > PR tree-optimization/106245 > > * config/i386/i386.md (*extv_1_0): New define_insn_and_split. > > > > gcc/testsuite/ChangeLog > > PR middle-end/101955 > > PR tree-optimization/106245 > > * gcc.target/i386/pr106245-2.c: New test case. > > * gcc.target/i386/pr106245-3.c: New 32-bit test case. > > * gcc.target/i386/pr106245-4.c: New 64-bit test case. > > * gcc.target/i386/pr106245-5.c: Likewise. > > +;; Split sign-extension of single least significant bit as and x,$1;neg > +x (define_insn_and_split "*extv_1_0" > + [(set (match_operand:SWI48 0 "register_operand" "=r") > + (sign_extract:SWI48 (match_operand:SWI48 1 "register_operand" "0") > +(const_int 1) > +(const_int 0))) > + (clobber (reg:CC FLAGS_REG))] > + "" > + "#" > + "&& 1" > > No need to use "&&" for an empty insn constraint. Just use "reload_completed" > in > this case. > > + [(parallel [(set (match_dup 0) (and:SWI48 (match_dup 1) (const_int 1))) > + (clobber (reg:CC FLAGS_REG))]) > + (parallel [(set (match_dup 0) (neg:SWI48 (match_dup 0))) > + (clobber (reg:CC FLAGS_REG))])]) > > Did you intend to split this after reload? If this is the case, then > reload_completed > is missing. Because this splitter neither required the allocation of a new pseudo, nor a hard register assignment, i.e. it's a splitter that can be run before or after reload, it's written to split "whenever". If you'd prefer it to only split after reload, I agree a "reload_completed" can be added (alternatively, adding "ix86_pre_reload_split ()" would also work). I now see from "*load_tp_" that "" is perhaps preferred over "&& 1" In these cases. Please let me know which you prefer. Cheers, Roger
[x86 PATCH] PR target/110511: Fix reg allocation for widening multiplications.
This patch contains clean-ups of the widening multiplication patterns in i386.md, and provides variants of the existing highpart multiplication peephole2 transformations (that tidy up register allocation after reload), and thereby fixes PR target/110511, which is a superfluous move instruction. For the new test case, compiled on x86_64 with -O2. Before: mulx64: movabsq $-7046029254386353131, %rcx movq%rcx, %rax mulq%rdi xorq%rdx, %rax ret After: mulx64: movabsq $-7046029254386353131, %rax mulq%rdi xorq%rdx, %rax ret The clean-ups are (i) that operand 1 is consistently made register_operand and operand 2 becomes nonimmediate_operand, so that predicates match the constraints, (ii) the representation of the BMI2 mulx instruction is updated to use the new umul_highpart RTX, and (iii) because operands 0 and 1 have different modes in widening multiplications, "a" is a more appropriate constraint than "0" (which avoids spills/reloads containing SUBREGs). The new peephole2 transformations are based upon those at around line 9951 of i386.md, that begins with the comment ;; Highpart multiplication peephole2s to tweak register allocation. ;; mov imm,%rdx; mov %rdi,%rax; imulq %rdx -> mov imm,%rax; imulq %rdi This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-10-17 Roger Sayle gcc/ChangeLog PR target/110511 * config/i386/i386.md (mul3): Make operands 1 and 2 take "regiser_operand" and "nonimmediate_operand" respectively. (mulqihi3): Likewise. (*bmi2_umul3_1): Operand 2 needs to be register_operand matching the %d constraint. Use umul_highpart RTX to represent the highpart multiplication. (*umul3_1): Operand 2 should use regiser_operand predicate, and "a" rather than "0" as operands 0 and 2 have different modes. (define_split): For mul to mulx conversion, use the new umul_highpart RTX representation. (*mul3_1): Operand 1 should be register_operand and the constraint %a as operands 0 and 1 have different modes. (*mulqihi3_1): Operand 1 should be register_operand matching the constraint %0. (define_peephole2): Providing widening multiplication variants of the peephole2s that tweak highpart multiplication register allocation. gcc/testsuite/ChangeLog PR target/110511 * gcc.target/i386/pr110511.c: New test case. Thanks in advance, Roger diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 2a60df5..22f18c2 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -9710,33 +9710,29 @@ [(parallel [(set (match_operand: 0 "register_operand") (mult: (any_extend: - (match_operand:DWIH 1 "nonimmediate_operand")) + (match_operand:DWIH 1 "register_operand")) (any_extend: - (match_operand:DWIH 2 "register_operand" + (match_operand:DWIH 2 "nonimmediate_operand" (clobber (reg:CC FLAGS_REG))])]) (define_expand "mulqihi3" [(parallel [(set (match_operand:HI 0 "register_operand") (mult:HI (any_extend:HI - (match_operand:QI 1 "nonimmediate_operand")) + (match_operand:QI 1 "register_operand")) (any_extend:HI - (match_operand:QI 2 "register_operand" + (match_operand:QI 2 "nonimmediate_operand" (clobber (reg:CC FLAGS_REG))])] "TARGET_QIMODE_MATH") (define_insn "*bmi2_umul3_1" [(set (match_operand:DWIH 0 "register_operand" "=r") (mult:DWIH - (match_operand:DWIH 2 "nonimmediate_operand" "%d") + (match_operand:DWIH 2 "register_operand" "%d") (match_operand:DWIH 3 "nonimmediate_operand" "rm"))) (set (match_operand:DWIH 1 "register_operand" "=r") - (truncate:DWIH - (lshiftrt: - (mult: (zero_extend: (match_dup 2)) - (zero_extend: (match_dup 3))) - (match_operand:QI 4 "const_int_operand"] - "TARGET_BMI2 && INTVAL (operands[4]) == * BITS_PER_UNIT + (umul_highpart:DWIH (match_dup 2) (match_dup 3)))] + "TARGET_BMI2 && !(MEM_P (operands[2]) && MEM_P (operands[3]))" "mulx\t{%3, %0, %1|%1, %0, %3}" [(set_attr "type" &qu
RE: [x86 PATCH] PR target/110551: Fix reg allocation for widening multiplications.
Many thanks to Tobias Burnus for pointing out the mistake/typo in the PR number. This fix is for PR 110551, not PR 110511. I'll update the ChangeLog and filename of the new testcase, if approved. Sorry for any inconvenience/confusion. Cheers, Roger -- > -Original Message- > From: Roger Sayle > Sent: 17 October 2023 20:06 > To: 'gcc-patches@gcc.gnu.org' > Cc: 'Uros Bizjak' > Subject: [x86 PATCH] PR target/110511: Fix reg allocation for widening > multiplications. > > > This patch contains clean-ups of the widening multiplication patterns in i386.md, > and provides variants of the existing highpart multiplication > peephole2 transformations (that tidy up register allocation after reload), and > thereby fixes PR target/110511, which is a superfluous move instruction. > > For the new test case, compiled on x86_64 with -O2. > > Before: > mulx64: movabsq $-7046029254386353131, %rcx > movq%rcx, %rax > mulq%rdi > xorq%rdx, %rax > ret > > After: > mulx64: movabsq $-7046029254386353131, %rax > mulq%rdi > xorq%rdx, %rax > ret > > The clean-ups are (i) that operand 1 is consistently made register_operand and > operand 2 becomes nonimmediate_operand, so that predicates match the > constraints, (ii) the representation of the BMI2 mulx instruction is updated to use > the new umul_highpart RTX, and (iii) because operands > 0 and 1 have different modes in widening multiplications, "a" is a more > appropriate constraint than "0" (which avoids spills/reloads containing SUBREGs). > The new peephole2 transformations are based upon those at around line 9951 of > i386.md, that begins with the comment ;; Highpart multiplication peephole2s to > tweak register allocation. > ;; mov imm,%rdx; mov %rdi,%rax; imulq %rdx -> mov imm,%rax; imulq %rdi > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and > make -k check, both with and without --target_board=unix{-m32} with no new > failures. Ok for mainline? > > > 2023-10-17 Roger Sayle > > gcc/ChangeLog > PR target/110511 > * config/i386/i386.md (mul3): Make operands 1 and > 2 take "regiser_operand" and "nonimmediate_operand" respectively. > (mulqihi3): Likewise. > (*bmi2_umul3_1): Operand 2 needs to be register_operand > matching the %d constraint. Use umul_highpart RTX to represent > the highpart multiplication. > (*umul3_1): Operand 2 should use regiser_operand > predicate, and "a" rather than "0" as operands 0 and 2 have > different modes. > (define_split): For mul to mulx conversion, use the new > umul_highpart RTX representation. > (*mul3_1): Operand 1 should be register_operand > and the constraint %a as operands 0 and 1 have different modes. > (*mulqihi3_1): Operand 1 should be register_operand matching > the constraint %0. > (define_peephole2): Providing widening multiplication variants > of the peephole2s that tweak highpart multiplication register > allocation. > > gcc/testsuite/ChangeLog > PR target/110511 > * gcc.target/i386/pr110511.c: New test case. > > > Thanks in advance, > Roger
RE: [Patch] nvptx: Use fatal_error when -march= is missing not an assert [PR111093]
Hi Tomas, Tobias and Tom, Thanks for asking. Interestingly, I've a patch (attached) from last year that tackled some of the issues here. The surface problem is that nvptx's march and misa are related in complicated ways. Specifying an arch defines the range of valid isa's, and specifying an isa restricts the set of valid arches. The current approach, which I agree is problematic, is to force these to be specified (compatibly) on the cc1 command line. Certainly, an error is better than an abort. My proposed solution was to allow either to imply a default for the other, and only issue an error if they are explicitly specified incompatibly. One reason for supporting this approach was to ultimately support an -march=native in the driver (calling libcuda.so to determine the hardware available on the current machine). The other use case is bumping the "default" nvptx architecture to something more recent, say sm_53, by providing/honoring a default arch at configure time. Alas, it turns out that specifying a recent arch during GCC bootstrap, allows the build to notice that the backend (now) supports 16-bit floats, which then prompts libgcc to contain the floathf and fixhf support that would be required. Then this in turn shows up as a limitation in the middle-end's handling of libcalls, which I submitted as a patch to back in July 2022: https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598848.html That patch hasn't yet been approved, so the whole nvptx -march= patch series became backlogged/forgotten. Hopefully, the attached "proof-of-concept" patch looks interesting (food for thought). If this approach seems reasonable, I'm happy to brush the dust off, and resubmit it (or a series of pieces) for review. Best regards, Roger -- > -Original Message- > From: Thomas Schwinge > Sent: 18 October 2023 11:16 > To: Tobias Burnus > Cc: gcc-patches@gcc.gnu.org; Tom de Vries ; Roger Sayle > > Subject: Re: [Patch] nvptx: Use fatal_error when -march= is missing not an > assert > [PR111093] > > Hi Tobias! > > On 2023-10-16T11:18:45+0200, Tobias Burnus > wrote: > > While mkoffload ensures that there is always a -march=, nvptx's > > cc1 can also be run directly. > > > > In my case, I wanted to know which target-specific #define are > > available; hence, I did run: > >accel/nvptx-none/cc1 -E -dM < /dev/null which gave an ICE. After > > some debugging, the reasons was clear (missing -march=) but somehow a > > (fatal) error would have been nicer than an ICE + debugging. > > > > OK for mainline? > > Yes, thanks. I think I prefer this over hard-coding some default > 'ptx_isa_option' -- > but may be convinced otherwise (incremental change), if that's maybe more > convenient for others? (Roger?) > > > Grüße > Thomas > > > > nvptx: Use fatal_error when -march= is missing not an assert > > [PR111093] > > > > gcc/ChangeLog: > > > > PR target/111093 > > * config/nvptx/nvptx.cc (nvptx_option_override): Issue fatal error > > instead of an assert ICE when no -march= has been specified. > > > > diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc > > index edef39fb5e1..634c31673be 100644 > > --- a/gcc/config/nvptx/nvptx.cc > > +++ b/gcc/config/nvptx/nvptx.cc > > @@ -335,8 +335,9 @@ nvptx_option_override (void) > >init_machine_status = nvptx_init_machine_status; > > > >/* Via nvptx 'OPTION_DEFAULT_SPECS', '-misa' always appears on the > command > > - line. */ > > - gcc_checking_assert (OPTION_SET_P (ptx_isa_option)); > > + line; but handle the case that the compiler is not run via the > > + driver. */ if (!OPTION_SET_P (ptx_isa_option)) > > +fatal_error (UNKNOWN_LOCATION, "%<-march=%> must be specified"); > > > >handle_ptx_version_option (); > > > - > Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 > München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas > Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht > München, HRB 106955 diff --git a/gcc/calls.cc b/gcc/calls.cc index 6dd6f73..8a18eae 100644 --- a/gcc/calls.cc +++ b/gcc/calls.cc @@ -4795,14 +4795,20 @@ emit_library_call_value_1 (int retval, rtx orgfun, rtx value, else { /* Convert to the proper mode if a promotion has been active. */ - if (GET_MODE (valreg) != outmode) + enum machine_mode valmode = GET_MODE (valreg); + if (valmode != outmode) { int unsignedp = TYPE_UNSIGNED (tfom); gc
[PATCH] Replace a HWI_COMPUTABLE_MODE_P with wide-int in simplify-rtx.cc.
This patch enhances one of the optimizations in simplify_binary_operation_1 to allow it to simplify RTL expressions in modes than HOST_WIDE_INT by replacing a use of HWI_COMPUTABLE_MODE_P and UINTVAL with wide_int. The motivating example is a pending x86_64 backend patch that produces the following RTL in combine: (and:TI (zero_extend:TI (reg:DI 89)) (const_wide_int 0x0)) where the AND is redundant, as the mask, ~0LL, is DImode's MODE_MASK. There's already an optimization that catches this for narrower modes, transforming (and:HI (zero_extend:HI (reg:QI x)) (const_int 0xff)) into (zero_extend:HI (reg:QI x)), but this currently only handles CONST_INT not CONST_WIDE_INT. Fixed by upgrading this transformation to use wide_int, specifically rtx_mode_t and wi::mask. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-05-23 Roger Sayle gcc/ChangeLog * simplify-rtx.cc (simplify_binary_operation_1) : Use wide-int instead of HWI_COMPUTABLE_MODE_P and UINTVAL in transformation of (and (extend X) C) as (zero_extend (and X C)), to also optimize modes wider than HOST_WIDE_INT. Thanks in advance, Roger -- diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index d4aeebc..8dc880b 100644 --- a/gcc/simplify-rtx.cc +++ b/gcc/simplify-rtx.cc @@ -3826,15 +3826,16 @@ simplify_context::simplify_binary_operation_1 (rtx_code code, there are no nonzero bits of C outside of X's mode. */ if ((GET_CODE (op0) == SIGN_EXTEND || GET_CODE (op0) == ZERO_EXTEND) - && CONST_INT_P (trueop1) - && HWI_COMPUTABLE_MODE_P (mode) - && (~GET_MODE_MASK (GET_MODE (XEXP (op0, 0))) - & UINTVAL (trueop1)) == 0) + && CONST_SCALAR_INT_P (trueop1) + && is_a (mode, &int_mode) + && is_a (GET_MODE (XEXP (op0, 0)), &inner_mode) + && (wi::mask (GET_MODE_PRECISION (inner_mode), true, + GET_MODE_PRECISION (int_mode)) + & rtx_mode_t (trueop1, mode)) == 0) { machine_mode imode = GET_MODE (XEXP (op0, 0)); - tem = simplify_gen_binary (AND, imode, XEXP (op0, 0), -gen_int_mode (INTVAL (trueop1), - imode)); + tem = immed_wide_int_const (rtx_mode_t (trueop1, mode), imode); + tem = simplify_gen_binary (AND, imode, XEXP (op0, 0), tem); return simplify_gen_unary (ZERO_EXTEND, mode, tem, imode); }
[PATCH] PR target/107172: Avoid "unusual" MODE_CC comparisons in simplify-rtx.cc
I believe that a better (or supplementary) fix to PR target/107172 is to avoid producing incorrect (but valid) RTL in simplify_const_relational_operation when presented with questionable (obviously invalid) expressions, such as those produced during combine. Just as with the "first do no harm" clause with the Hippocratic Oath, simplify-rtx (probably) shouldn't unintentionally transform invalid RTL expressions, into incorrect (non-equivalent) but valid RTL that may be inappropriately recognized by recog. In this specific case, many GCC backends represent their flags register via MODE_CC, whose representation is intentionally "opaque" to the middle-end. The only use of MODE_CC comprehensible to the middle-end's RTL optimizers is relational comparisons between the result of a COMPARE rtx (op0) and zero (op1). Any other uses of MODE_CC should be left alone, and some might argue indicate representational issues in the backend. In practice, CPUs occasionally have numerous instructions that affect the flags register(s) other than comparisons [AVR's setc, powerpc's mtcrf, x86's clc, stc and cmc and x86_64's ptest that sets C and Z flags in non-obvious ways, c.f. PR target/109973]. Currently care has to be taken, wrapping these in UNSPEC, to avoid combine inappropriately merging flags setters with flags consumers (such as conditional jumps). It's safer to teach simplify_const_relational_operation not to modify expressions that it doesn't understand/recognize. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-05-26 Roger Sayle gcc/ChangeLog * simplify-rtx.cc (simplify_const_relational_operation): Return early if we have a MODE_CC comparison that isn't a COMPARE against const0_rtx. Thanks in advance, Roger -- diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index d4aeebc..d6444b4 100644 --- a/gcc/simplify-rtx.cc +++ b/gcc/simplify-rtx.cc @@ -6120,6 +6120,12 @@ simplify_const_relational_operation (enum rtx_code code, || (GET_MODE (op0) == VOIDmode && GET_MODE (op1) == VOIDmode)); + /* We only handle MODE_CC comparisons that are COMPARE against zero. */ + if (GET_MODE_CLASS (mode) == MODE_CC + && (op1 != const0_rtx + || GET_CODE (op0) != COMPARE)) +return NULL_RTX; + /* If op0 is a compare, extract the comparison arguments from it. */ if (GET_CODE (op0) == COMPARE && op1 == const0_rtx) {
[PATCH] Refactor wi::bswap as a function (instead of a method).
This patch implements Richard Sandiford's suggestion from https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618215.html that wi::bswap (and a new wi::bitreverse) should be functions, and ideally only accessors are member functions. This patch implements the first step, moving/refactoring wi::bswap. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-05-28 Roger Sayle gcc/ChangeLog * fold-const-call.cc (fold_const_call_ss) : Update call to wi::bswap. * simplify-rtx.cc (simplify_const_unary_operation) : Update call to wi::bswap. * tree-ssa-ccp.cc (evaluate_stmt) : Update calls to wi::bswap. * wide-int.cc (wide_int_storage::bswap): Remove/rename to... (wi::bswap_large): New function, with revised API. * wide-int.h (wi::bswap): New (template) function prototype. (wide_int_storage::bswap): Remove method. (sext_large, zext_large): Consistent indentation/line wrapping. (bswap_large): Prototype helper function containing implementation. (wi::bswap): New template wrapper around bswap_large. Thanks, Roger -- diff --git a/gcc/fold-const-call.cc b/gcc/fold-const-call.cc index 340cb66..663eae2 100644 --- a/gcc/fold-const-call.cc +++ b/gcc/fold-const-call.cc @@ -1060,7 +1060,8 @@ fold_const_call_ss (wide_int *result, combined_fn fn, const wide_int_ref &arg, case CFN_BUILT_IN_BSWAP32: case CFN_BUILT_IN_BSWAP64: case CFN_BUILT_IN_BSWAP128: - *result = wide_int::from (arg, precision, TYPE_SIGN (arg_type)).bswap (); + *result = wi::bswap (wide_int::from (arg, precision, + TYPE_SIGN (arg_type))); return true; default: diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index d4aeebc..d93d632 100644 --- a/gcc/simplify-rtx.cc +++ b/gcc/simplify-rtx.cc @@ -2111,7 +2111,7 @@ simplify_const_unary_operation (enum rtx_code code, machine_mode mode, break; case BSWAP: - result = wide_int (op0).bswap (); + result = wi::bswap (op0); break; case TRUNCATE: diff --git a/gcc/tree-ssa-ccp.cc b/gcc/tree-ssa-ccp.cc index 6fb371c..26d5e44 100644 --- a/gcc/tree-ssa-ccp.cc +++ b/gcc/tree-ssa-ccp.cc @@ -2401,11 +2401,12 @@ evaluate_stmt (gimple *stmt) wide_int wval = wi::to_wide (val.value); val.value = wide_int_to_tree (type, - wide_int::from (wval, prec, - UNSIGNED).bswap ()); + wi::bswap (wide_int::from (wval, prec, + UNSIGNED))); val.mask - = widest_int::from (wide_int::from (val.mask, prec, - UNSIGNED).bswap (), + = widest_int::from (wi::bswap (wide_int::from (val.mask, + prec, + UNSIGNED)), UNSIGNED); if (wi::sext (val.mask, prec) != -1) break; diff --git a/gcc/wide-int.cc b/gcc/wide-int.cc index c0987aa..1e4c046 100644 --- a/gcc/wide-int.cc +++ b/gcc/wide-int.cc @@ -731,16 +731,13 @@ wi::set_bit_large (HOST_WIDE_INT *val, const HOST_WIDE_INT *xval, } } -/* bswap THIS. */ -wide_int -wide_int_storage::bswap () const +/* Byte swap the integer represented by XVAL and LEN into VAL. Return + the number of blocks in VAL. Both XVAL and VAL have PRECISION bits. */ +unsigned int +wi::bswap_large (HOST_WIDE_INT *val, const HOST_WIDE_INT *xval, +unsigned int len, unsigned int precision) { - wide_int result = wide_int::create (precision); unsigned int i, s; - unsigned int len = BLOCKS_NEEDED (precision); - unsigned int xlen = get_len (); - const HOST_WIDE_INT *xval = get_val (); - HOST_WIDE_INT *val = result.write_val (); /* This is not a well defined operation if the precision is not a multiple of 8. */ @@ -758,7 +755,7 @@ wide_int_storage::bswap () const unsigned int block = s / HOST_BITS_PER_WIDE_INT; unsigned int offset = s & (HOST_BITS_PER_WIDE_INT - 1); - byte = (safe_uhwi (xval, xlen, block) >> offset) & 0xff; + byte = (safe_uhwi (xval, len, block) >> offset) & 0xff; block = d / HOST_BITS_PER_WIDE_INT; offset = d & (HOST_BITS_PER_WIDE_INT - 1); @@ -766,8 +763,7 @@ wide_int_storage::bswap () const val[block] |= byte << offset; } - result.set_len (canonize (val, len, precision)); - return result; + return canonize (val, len, precision); } /* Fill VAL
[x86_64 PATCH] PR target/109973: CCZmode and CCCmode variants of [v]ptest.
This is my proposed minimal fix for PR target/109973 (hopefully suitable for backporting) that follows Jakub Jelinek's suggestion that we introduce CCZmode and CCCmode variants of ptest and vptest, so that the i386 backend treats [v]ptest instructions similarly to testl instructions; using different CCmodes to indicate which condition flags are desired, and then relying on the RTL cmpelim pass to eliminate redundant tests. This conveniently matches Intel's intrinsics, that provide different functions for retrieving different flags, _mm_testz_si128 tests the Z flag, _mm_testc_si128 tests the carry flag. Currently we use the same instruction (pattern) for both, and unfortunately the *ptest_and optimization is only valid when the ptest/vptest instruction is used to set/test the Z flag. The downside, as predicted by Jakub, is that GCC's cmpelim pass is currently COMPARE-centric and not able to merge the ptests from expressions such as _mm256_testc_si256 (a, b) + _mm256_testz_si256 (a, b), which is a known issue, PR target/80040. I've some follow-up patches to improve things, but this first patch fixes the wrong-code regression, replacing it with a rare missed-optimization (hopefully suitable for GCC 13). The only change that was unanticipated was the tweak to ix86_match_ccmode. Oddly, CCZmode is allowable for CCmode, but CCCmode isn't. Given that CCZmode means just the Z flag, CCCmode means just the C flag, and CCmode means all the flags, I'm guessing this asymmetry is unintentional. Perhaps a super-safe fix is to explicitly test for CCZmode, CCCmode or CCmode in the *_ptest pattern's predicate, and not attempt to re-use ix86_match_ccmode? This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-05-29 Roger Sayle gcc/ChangeLog PR targt/109973 * config/i386/i386-builtin.def (__builtin_ia32_ptestz128): Use new CODE_for_sse4_1_ptestzv2di. (__builtin_ia32_ptestc128): Use new CODE_for_sse4_1_ptestcv2di. (__builtin_ia32_ptestz256): Use new CODE_for_avx_ptestzv4di. (__builtin_ia32_ptestc256): Use new CODE_for_avx_ptestcv4di. * config/i386/i386-expand.cc (ix86_expand_branch): Use CCZmode when expanding UNSPEC_PTEST to compare against zero. * config/i386/i386-features.cc (scalar_chain::convert_compare): Likewise generate CCZmode UNSPEC_PTESTs when converting comparisons. (general_scalar_chain::convert_insn): Use CCZmode for COMPARE result. (timode_scalar_chain::convert_insn): Use CCZmode for COMPARE result. * config/i386/i386.cc (ix86_match_ccmode): Allow the SET_SRC to be an UNSPEC, in addition to a COMPARE. Consider CCCmode to be a form of CCmode. * config/i386/sse.md (define_split): When splitting UNSPEC_MOVMSK to UNSPEC_PTEST, preserve the FLAG_REG mode as CCZ. (*_ptest): Add asterisk to hide define_insn. Remove ":CC" flags specification, and use ix86_match_ccmode instead. (_ptestz): New define_expand to specify CCZ. (_ptestc): New define_expand to specify CCC. (_ptest): A define_expand using CC to preserve the current behavior. (*ptest_and): Specify CCZ to only perform this optimization when only the Z flag is required. gcc/testsuite/ChangeLog PR targt/109973 * gcc.target/i386/pr109973-1.c: New test case. * gcc.target/i386/pr109973-2.c: Likewise. Thanks, Roger -- diff --git a/gcc/config/i386/i386-builtin.def b/gcc/config/i386/i386-builtin.def index c91e380..383b68a 100644 --- a/gcc/config/i386/i386-builtin.def +++ b/gcc/config/i386/i386-builtin.def @@ -1004,8 +1004,8 @@ BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_roundps_sfix, "__builtin_ia32_ BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_roundv4sf2, "__builtin_ia32_roundps_az", IX86_BUILTIN_ROUNDPS_AZ, UNKNOWN, (int) V4SF_FTYPE_V4SF) BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_roundv4sf2_sfix, "__builtin_ia32_roundps_az_sfix", IX86_BUILTIN_ROUNDPS_AZ_SFIX, UNKNOWN, (int) V4SI_FTYPE_V4SF) -BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_ptestv2di, "__builtin_ia32_ptestz128", IX86_BUILTIN_PTESTZ, EQ, (int) INT_FTYPE_V2DI_V2DI_PTEST) -BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_ptestv2di, "__builtin_ia32_ptestc128", IX86_BUILTIN_PTESTC, LTU, (int) INT_FTYPE_V2DI_V2DI_PTEST) +BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_ptestzv2di, "__builtin_ia32_ptestz128", IX86_BUILTIN_PTESTZ, EQ, (int) INT_FTYPE_V2DI_V2DI_PTEST) +BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_ptestcv2di, "__builtin_ia32_ptestc128", IX86_BUILTIN_PTESTC, LTU, (int) INT_FTYPE_V2DI_V2DI_PTEST) BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_ptestv2di, "__builtin_ia32_ptes
[PATCH] New wi::bitreverse function.
This patch provides a wide-int implementation of bitreverse, that implements both of Richard Sandiford's suggestions from the review at https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618215.html of an improved API (as a stand-alone function matching the bswap refactoring), and an implementation that works with any bit-width precision. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap (and a make check-gcc). Ok for mainline? Are the remaining pieces of the above patch pre-approved (pending re-testing)? The aim is that this new code will be thoroughly tested by the new *-2.c test cases in https://gcc.gnu.org/git/?p=gcc.git;h=c09471fbc7588db2480f036aa56a2403d3c03ae 5 with a minor tweak to use the BITREVERSE rtx in the NVPTX back-end, followed by similar tests on other targets that provide bit-reverse built-ins (such as ARM and xstormy16), in advance of support for a backend-independent solution to PR middle-end/50481. 2023-06-02 Roger Sayle gcc/ChangeLog * wide-int.cc (wi::bitreverse_large): New function implementing bit reversal of an integer. * wide-int.h (wi::bitreverse): New (template) function prototype. (bitreverse_large): Prototype helper function/implementation. (wi::bitreverse): New template wrapper around bitreverse_large. Thanks again, Roger -- diff --git a/gcc/fold-const-call.cc b/gcc/fold-const-call.cc index 340cb66..663eae2 100644 --- a/gcc/fold-const-call.cc +++ b/gcc/fold-const-call.cc @@ -1060,7 +1060,8 @@ fold_const_call_ss (wide_int *result, combined_fn fn, const wide_int_ref &arg, case CFN_BUILT_IN_BSWAP32: case CFN_BUILT_IN_BSWAP64: case CFN_BUILT_IN_BSWAP128: - *result = wide_int::from (arg, precision, TYPE_SIGN (arg_type)).bswap (); + *result = wi::bswap (wide_int::from (arg, precision, + TYPE_SIGN (arg_type))); return true; default: diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index d4aeebc..d93d632 100644 --- a/gcc/simplify-rtx.cc +++ b/gcc/simplify-rtx.cc @@ -2111,7 +2111,7 @@ simplify_const_unary_operation (enum rtx_code code, machine_mode mode, break; case BSWAP: - result = wide_int (op0).bswap (); + result = wi::bswap (op0); break; case TRUNCATE: diff --git a/gcc/tree-ssa-ccp.cc b/gcc/tree-ssa-ccp.cc index 6fb371c..26d5e44 100644 --- a/gcc/tree-ssa-ccp.cc +++ b/gcc/tree-ssa-ccp.cc @@ -2401,11 +2401,12 @@ evaluate_stmt (gimple *stmt) wide_int wval = wi::to_wide (val.value); val.value = wide_int_to_tree (type, - wide_int::from (wval, prec, - UNSIGNED).bswap ()); + wi::bswap (wide_int::from (wval, prec, + UNSIGNED))); val.mask - = widest_int::from (wide_int::from (val.mask, prec, - UNSIGNED).bswap (), + = widest_int::from (wi::bswap (wide_int::from (val.mask, + prec, + UNSIGNED)), UNSIGNED); if (wi::sext (val.mask, prec) != -1) break; diff --git a/gcc/wide-int.cc b/gcc/wide-int.cc index c0987aa..1e4c046 100644 --- a/gcc/wide-int.cc +++ b/gcc/wide-int.cc @@ -731,16 +731,13 @@ wi::set_bit_large (HOST_WIDE_INT *val, const HOST_WIDE_INT *xval, } } -/* bswap THIS. */ -wide_int -wide_int_storage::bswap () const +/* Byte swap the integer represented by XVAL and LEN into VAL. Return + the number of blocks in VAL. Both XVAL and VAL have PRECISION bits. */ +unsigned int +wi::bswap_large (HOST_WIDE_INT *val, const HOST_WIDE_INT *xval, +unsigned int len, unsigned int precision) { - wide_int result = wide_int::create (precision); unsigned int i, s; - unsigned int len = BLOCKS_NEEDED (precision); - unsigned int xlen = get_len (); - const HOST_WIDE_INT *xval = get_val (); - HOST_WIDE_INT *val = result.write_val (); /* This is not a well defined operation if the precision is not a multiple of 8. */ @@ -758,7 +755,7 @@ wide_int_storage::bswap () const unsigned int block = s / HOST_BITS_PER_WIDE_INT; unsigned int offset = s & (HOST_BITS_PER_WIDE_INT - 1); - byte = (safe_uhwi (xval, xlen, block) >> offset) & 0xff; + byte = (safe_uhwi (xval, len, block) >> offset) & 0xff; block = d / HOST_BITS_PER_WIDE_INT; offset = d & (HOST_BITS_PER_WIDE_INT - 1); @@ -766,8 +763,7 @@ wide_int_storage::bswap () const val[block] |= byte << offset; } - result.set_len (canonize (val, le
RE: [PATCH] New wi::bitreverse function.
Doh! Wrong patch... Roger -- -Original Message- From: Roger Sayle Sent: Friday, June 2, 2023 3:17 PM To: 'gcc-patches@gcc.gnu.org' Cc: 'Richard Sandiford' Subject: [PATCH] New wi::bitreverse function. This patch provides a wide-int implementation of bitreverse, that implements both of Richard Sandiford's suggestions from the review at https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618215.html of an improved API (as a stand-alone function matching the bswap refactoring), and an implementation that works with any bit-width precision. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap (and a make check-gcc). Ok for mainline? Are the remaining pieces of the above patch pre-approved (pending re-testing)? The aim is that this new code will be thoroughly tested by the new *-2.c test cases in https://gcc.gnu.org/git/?p=gcc.git;h=c09471fbc7588db2480f036aa56a2403d3c03ae 5 with a minor tweak to use the BITREVERSE rtx in the NVPTX back-end, followed by similar tests on other targets that provide bit-reverse built-ins (such as ARM and xstormy16), in advance of support for a backend-independent solution to PR middle-end/50481. 2023-06-02 Roger Sayle gcc/ChangeLog * wide-int.cc (wi::bitreverse_large): New function implementing bit reversal of an integer. * wide-int.h (wi::bitreverse): New (template) function prototype. (bitreverse_large): Prototype helper function/implementation. (wi::bitreverse): New template wrapper around bitreverse_large. Thanks again, Roger -- diff --git a/gcc/wide-int.cc b/gcc/wide-int.cc index 1e4c046..24bdce2 100644 --- a/gcc/wide-int.cc +++ b/gcc/wide-int.cc @@ -766,6 +766,33 @@ wi::bswap_large (HOST_WIDE_INT *val, const HOST_WIDE_INT *xval, return canonize (val, len, precision); } +/* Bitreverse the integer represented by XVAL and LEN into VAL. Return + the number of blocks in VAL. Both XVAL and VAL have PRECISION bits. */ +unsigned int +wi::bitreverse_large (HOST_WIDE_INT *val, const HOST_WIDE_INT *xval, + unsigned int len, unsigned int precision) +{ + unsigned int i, s; + + for (i = 0; i < len; i++) +val[i] = 0; + + for (s = 0; s < precision; s++) +{ + unsigned int block = s / HOST_BITS_PER_WIDE_INT; + unsigned int offset = s & (HOST_BITS_PER_WIDE_INT - 1); + if (((safe_uhwi (xval, len, block) >> offset) & 1) != 0) + { + unsigned int d = (precision - 1) - s; + block = d / HOST_BITS_PER_WIDE_INT; + offset = d & (HOST_BITS_PER_WIDE_INT - 1); + val[block] |= 1 << offset; + } +} + + return canonize (val, len, precision); +} + /* Fill VAL with a mask where the lower WIDTH bits are ones and the bits above that up to PREC are zeros. The result is inverted if NEGATE is true. Return the number of blocks in VAL. */ diff --git a/gcc/wide-int.h b/gcc/wide-int.h index e4723ad..498d14d 100644 --- a/gcc/wide-int.h +++ b/gcc/wide-int.h @@ -553,6 +553,7 @@ namespace wi UNARY_FUNCTION zext (const T &, unsigned int); UNARY_FUNCTION set_bit (const T &, unsigned int); UNARY_FUNCTION bswap (const T &); + UNARY_FUNCTION bitreverse (const T &); BINARY_FUNCTION min (const T1 &, const T2 &, signop); BINARY_FUNCTION smin (const T1 &, const T2 &); @@ -1748,6 +1749,8 @@ namespace wi unsigned int, unsigned int, unsigned int); unsigned int bswap_large (HOST_WIDE_INT *, const HOST_WIDE_INT *, unsigned int, unsigned int); + unsigned int bitreverse_large (HOST_WIDE_INT *, const HOST_WIDE_INT *, +unsigned int, unsigned int); unsigned int lshift_large (HOST_WIDE_INT *, const HOST_WIDE_INT *, unsigned int, unsigned int, unsigned int); @@ -2281,6 +2284,18 @@ wi::bswap (const T &x) return result; } +/* Bitreverse the integer X. */ +template +inline WI_UNARY_RESULT (T) +wi::bitreverse (const T &x) +{ + WI_UNARY_RESULT_VAR (result, val, T, x); + unsigned int precision = get_precision (result); + WIDE_INT_REF_FOR (T) xi (x, precision); + result.set_len (bitreverse_large (val, xi.val, xi.len, precision)); + return result; +} + /* Return the mininum of X and Y, treating them both as having signedness SGN. */ template
[x86_64 PATCH] PR target/110083: Fix-up REG_EQUAL notes on COMPARE in STV.
This patch fixes PR target/110083, an ICE-on-valid regression exposed by my recent PTEST improvements (to address PR target/109973). The latent bug (admittedly mine) is that the scalar-to-vector (STV) pass doesn't update or delete REG_EQUAL notes attached to COMPARE instructions. As a result the operands of COMPARE would be mismatched, with the register transformed to V1TImode, but the immediate operand left as const_wide_int, which is valid for TImode but not V1TImode. This remained latent when the STV conversion converted the mode of the COMPARE to CCmode, with later passes recognizing the REG_EQUAL note is obviously invalid as the modes didn't match, but now that we (correctly) preserve the CCZmode on COMPARE, the mismatched operand modes trigger a sanity checking ICE downstream. Fixed by updating (or deleting) any REG_EQUAL notes in convert_compare. Before: (expr_list:REG_EQUAL (compare:CCZ (reg:V1TI 119 [ ivin.29_38 ]) (const_wide_int 0x8000)) After: (expr_list:REG_EQUAL (compare:CCZ (reg:V1TI 119 [ ivin.29_38 ]) (const_vector:V1TI [ (const_wide_int 0x8000) ])) This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-06-03 Roger Sayle gcc/ChangeLog PR target/110083 * config/i386/i386-features.cc (scalar_chain::convert_compare): Update or delete REG_EQUAL notes, converting CONST_INT and CONST_WIDE_INT immediate operands to a suitable CONST_VECTOR. gcc/testsuite/ChangeLog PR target/110083 * gcc.target/i386/pr110083.c: New test case. Roger -- diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc index 3417f6b..4a3b07a 100644 --- a/gcc/config/i386/i386-features.cc +++ b/gcc/config/i386/i386-features.cc @@ -980,6 +980,39 @@ rtx scalar_chain::convert_compare (rtx op1, rtx op2, rtx_insn *insn) { rtx src, tmp; + + /* Handle any REG_EQUAL notes. */ + tmp = find_reg_equal_equiv_note (insn); + if (tmp) +{ + if (GET_CODE (XEXP (tmp, 0)) == COMPARE + && GET_MODE (XEXP (tmp, 0)) == CCZmode + && REG_P (XEXP (XEXP (tmp, 0), 0))) + { + rtx *op = &XEXP (XEXP (tmp, 0), 1); + if (CONST_SCALAR_INT_P (*op)) + { + if (constm1_operand (*op, GET_MODE (*op))) + *op = CONSTM1_RTX (vmode); + else + { + unsigned n = GET_MODE_NUNITS (vmode); + rtx *v = XALLOCAVEC (rtx, n); + v[0] = *op; + for (unsigned i = 1; i < n; ++i) + v[i] = const0_rtx; + *op = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v)); + } + tmp = NULL_RTX; + } + else if (REG_P (*op)) + tmp = NULL_RTX; + } + + if (tmp) + remove_note (insn, tmp); +} + /* Comparison against anything other than zero, requires an XOR. */ if (op2 != const0_rtx) { diff --git a/gcc/testsuite/gcc.target/i386/pr110083.c b/gcc/testsuite/gcc.target/i386/pr110083.c new file mode 100644 index 000..4b38ca8 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr110083.c @@ -0,0 +1,26 @@ +/* { dg-do compile { target int128 } } */ +/* { dg-options "-O2 -msse4 -mstv -mno-stackrealign" } */ +typedef int TItype __attribute__ ((mode (TI))); +typedef unsigned int UTItype __attribute__ ((mode (TI))); + +void foo (void) +{ + static volatile TItype ivin, ivout; + static volatile float fv1, fv2; + ivin = ((TItype) (UTItype) ~ (((UTItype) ~ (UTItype) 0) >> 1)); + fv1 = ((TItype) (UTItype) ~ (((UTItype) ~ (UTItype) 0) >> 1)); + fv2 = ivin; + ivout = fv2; + if (ivin != ((TItype) (UTItype) ~ (((UTItype) ~ (UTItype) 0) >> 1)) + || 128) > sizeof (TItype) * 8 - 1)) && ivout != ivin) + || 128) > sizeof (TItype) * 8 - 1)) + && ivout != + ((TItype) (UTItype) ~ (((UTItype) ~ (UTItype) 0) >> 1))) + || fv1 != + (float) ((TItype) (UTItype) ~ (((UTItype) ~ (UTItype) 0) >> 1)) + || fv2 != + (float) ((TItype) (UTItype) ~ (((UTItype) ~ (UTItype) 0) >> 1)) + || fv1 != fv2) +__builtin_abort (); +} +
[x86 PATCH] Add support for stc, clc and cmc instructions in i386.md
This patch is the latest revision of my patch to add support for the STC (set carry flag), CLC (clear carry flag) and CMC (complement carry flag) instructions to the i386 backend, incorporating Uros' previous feedback. The significant changes are (i) the inclusion of CMC, (ii) the use of UNSPEC for pattern, (iii) Use of a new X86_TUNE_SLOW_STC tuning flag to use alternate implementations on pentium4 (which has a notoriously slow STC) when not optimizing for size. An example of the use of the stc instruction is: unsigned int foo (unsigned int a, unsigned int b, unsigned int *c) { return __builtin_ia32_addcarryx_u32 (1, a, b, c); } which previously generated: movl$1, %eax addb$-1, %al adcl%esi, %edi setc%al movl%edi, (%rdx) movzbl %al, %eax ret with this patch now generates: stc adcl%esi, %edi setc%al movl%edi, (%rdx) movzbl %al, %eax ret An example of the use of the cmc instruction (where the carry from a first adc is inverted/complemented as input to a second adc) is: unsigned int bar (unsigned int a, unsigned int b, unsigned int c, unsigned int d) { unsigned int c1 = __builtin_ia32_addcarryx_u32 (1, a, b, &o1); return __builtin_ia32_addcarryx_u32 (c1 ^ 1, c, d, &o2); } which previously generated: movl$1, %eax addb$-1, %al adcl%esi, %edi setnc %al movl%edi, o1(%rip) addb$-1, %al adcl%ecx, %edx setc%al movl%edx, o2(%rip) movzbl %al, %eax ret and now generates: stc adcl%esi, %edi cmc movl%edi, o1(%rip) adcl%ecx, %edx setc%al movl%edx, o2(%rip) movzbl %al, %eax ret This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2022-06-03 Roger Sayle gcc/ChangeLog * config/i386/i386-expand.cc (ix86_expand_builtin) : Use new x86_stc or negqi_ccc_1 instructions to set the carry flag. * config/i386/i386.h (TARGET_SLOW_STC): New define. * config/i386/i386.md (UNSPEC_CLC): New UNSPEC for clc. (UNSPEC_STC): New UNSPEC for stc. (UNSPEC_CMC): New UNSPEC for cmc. (*x86_clc): New define_insn. (*x86_clc_xor): New define_insn for pentium4 without -Os. (x86_stc): New define_insn. (define_split): Convert x86_stc into alternate implementation on pentium4. (x86_cmc): New define_insn. (*x86_cmc_1): New define_insn_and_split to recognize cmc pattern. (*setcc_qi_negqi_ccc_1_): New define_insn_and_split to recognize (and eliminate) the carry flag being copied to itself. (*setcc_qi_negqi_ccc_2_): Likewise. (neg_ccc_1): Renamed from *neg_ccc_1 for gen function. * config/i386/x86-tune.def (X86_TUNE_SLOW_STC): New tuning flag. gcc/testsuite/ChangeLog * gcc.target/i386/cmc-1.c: New test case. * gcc.target/i386/stc-1.c: Likewise. Thanks, Roger -- diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc index 5d21810..9e02fdd 100644 --- a/gcc/config/i386/i386-expand.cc +++ b/gcc/config/i386/i386-expand.cc @@ -13948,8 +13948,6 @@ rdseed_step: arg3 = CALL_EXPR_ARG (exp, 3); /* unsigned int *sum_out. */ op1 = expand_normal (arg0); - if (!integer_zerop (arg0)) - op1 = copy_to_mode_reg (QImode, convert_to_mode (QImode, op1, 1)); op2 = expand_normal (arg1); if (!register_operand (op2, mode0)) @@ -13967,7 +13965,7 @@ rdseed_step: } op0 = gen_reg_rtx (mode0); - if (integer_zerop (arg0)) + if (op1 == const0_rtx) { /* If arg0 is 0, optimize right away into add or sub instruction that sets CCCmode flags. */ @@ -13977,7 +13975,14 @@ rdseed_step: else { /* Generate CF from input operand. */ - emit_insn (gen_addqi3_cconly_overflow (op1, constm1_rtx)); + if (!CONST_INT_P (op1)) + { + op1 = convert_to_mode (QImode, op1, 1); + op1 = copy_to_mode_reg (QImode, op1); + emit_insn (gen_negqi_ccc_1 (op1, op1)); + } + else + emit_insn (gen_x86_stc ()); /* Generate instruction that consumes CF. */ op1 = gen_rtx_REG (CCCmode, FLAGS_REG); diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index c7439f8..5ac9c78 100644 --- a/gcc/config/i386/i386.h +++ b/gcc/config/i386/i386.h @@ -448,6 +448,7 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST]; ix86_tune_features[X86_TUNE_V2DF_REDUCTION_PREFER_HADDPD] #define TARGET_DEST_FALSE_DEP_FOR_GLC \ ix86_tune_features[X86_TUNE_DEST_FALSE_DEP_FOR_
RE: [x86 PATCH] Add support for stc, clc and cmc instructions in i386.md
Hi Uros, This revision implements your suggestions/refinements. (i) Avoid the UNSPEC_CMC by using the canonical RTL idiom for *x86_cmc, (ii) Use peephole2s to convert x86_stc and *x86_cmc into alternate forms on TARGET_SLOW_STC CPUs (pentium4), when a suitable QImode register is available, (iii) Prefer the addqi_cconly_overflow idiom (addb $-1,%al) over negqi_ccc_1 (neg %al) for setting the carry from a QImode value, (iv) Use andl %eax,%eax to clear carry flag without requiring (clobbering) an additional register, as an alternate output template for *x86_clc. These changes required two minor edits to i386.cc: ix86_cc_mode had to be tweaked to suggest CCCmode for the new *x86_cmc pattern, and *x86_cmc needed to be handled/parameterized in ix86_rtx_costs so that combine would appreciate that this complex RTL expression was actually a fast, single byte instruction [i.e. preferable]. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2022-06-06 Roger Sayle Uros Bizjak gcc/ChangeLog * config/i386/i386-expand.cc (ix86_expand_builtin) : Use new x86_stc instruction when the carry flag must be set. * config/i386/i386.cc (ix86_cc_mode): Use CCCmode for *x86_cmc. (ix86_rtx_costs): Provide accurate rtx_costs for *x86_cmc. * config/i386/i386.h (TARGET_SLOW_STC): New define. * config/i386/i386.md (UNSPEC_CLC): New UNSPEC for clc. (UNSPEC_STC): New UNSPEC for stc. (*x86_clc): New define_insn (with implementation for pentium4). (x86_stc): New define_insn. (define_peephole2): Convert x86_stc into alternate implementation on pentium4 without -Os when a QImode register is available. (*x86_cmc): New define_insn. (define_peephole2): Convert *x86_cmc into alternate implementation on pentium4 without -Os when a QImode register is available. (*setccc): New define_insn_and_split for a no-op CCCmode move. (*setcc_qi_negqi_ccc_1_): New define_insn_and_split to recognize (and eliminate) the carry flag being copied to itself. (*setcc_qi_negqi_ccc_2_): Likewise. * config/i386/x86-tune.def (X86_TUNE_SLOW_STC): New tuning flag. gcc/testsuite/ChangeLog * gcc.target/i386/cmc-1.c: New test case. * gcc.target/i386/stc-1.c: Likewise. Thanks, Roger. -- -Original Message- From: Uros Bizjak Sent: 04 June 2023 18:53 To: Roger Sayle Cc: gcc-patches@gcc.gnu.org Subject: Re: [x86 PATCH] Add support for stc, clc and cmc instructions in i386.md On Sun, Jun 4, 2023 at 12:45 AM Roger Sayle wrote: > > > This patch is the latest revision of my patch to add support for the > STC (set carry flag), CLC (clear carry flag) and CMC (complement carry > flag) instructions to the i386 backend, incorporating Uros' > previous feedback. The significant changes are (i) the inclusion of > CMC, (ii) the use of UNSPEC for pattern, (iii) Use of a new > X86_TUNE_SLOW_STC tuning flag to use alternate implementations on > pentium4 (which has a notoriously slow STC) when not optimizing for > size. > > An example of the use of the stc instruction is: > unsigned int foo (unsigned int a, unsigned int b, unsigned int *c) { > return __builtin_ia32_addcarryx_u32 (1, a, b, c); } > > which previously generated: > movl$1, %eax > addb$-1, %al > adcl%esi, %edi > setc%al > movl%edi, (%rdx) > movzbl %al, %eax > ret > > with this patch now generates: > stc > adcl%esi, %edi > setc%al > movl%edi, (%rdx) > movzbl %al, %eax > ret > > An example of the use of the cmc instruction (where the carry from a > first adc is inverted/complemented as input to a second adc) is: > unsigned int bar (unsigned int a, unsigned int b, > unsigned int c, unsigned int d) { > unsigned int c1 = __builtin_ia32_addcarryx_u32 (1, a, b, &o1); > return __builtin_ia32_addcarryx_u32 (c1 ^ 1, c, d, &o2); } > > which previously generated: > movl$1, %eax > addb$-1, %al > adcl%esi, %edi > setnc %al > movl%edi, o1(%rip) > addb$-1, %al > adcl%ecx, %edx > setc%al > movl%edx, o2(%rip) > movzbl %al, %eax > ret > > and now generates: > stc > adcl%esi, %edi > cmc > movl%edi, o1(%rip) > adcl%ecx, %edx > setc%al > movl%edx, o2(%rip) > movzbl %al, %eax > ret > > > This patch has been tested on x86_64-pc-linux-gnu wit
RE: [x86 PATCH] Add support for stc, clc and cmc instructions in i386.md
Hi Uros, Might you willing to approve the patch without the *x86_clc pieces? These can be submitted later, when they are actually used. For now, we're arguing about the performance of a pattern that's not yet generated on an obsolete microarchitecture that's no longer in use, and this is holding up real improvements on current processors. cmc, for example, should allow for better cmov if-conversion. Thanks in advance. Roger -- -Original Message- From: Uros Bizjak Sent: 06 June 2023 18:34 To: Roger Sayle Cc: gcc-patches@gcc.gnu.org Subject: Re: [x86 PATCH] Add support for stc, clc and cmc instructions in i386.md On Tue, Jun 6, 2023 at 5:14 PM Roger Sayle wrote: > > > Hi Uros, > This revision implements your suggestions/refinements. (i) Avoid the > UNSPEC_CMC by using the canonical RTL idiom for *x86_cmc, (ii) Use > peephole2s to convert x86_stc and *x86_cmc into alternate forms on > TARGET_SLOW_STC CPUs (pentium4), when a suitable QImode register is > available, (iii) Prefer the addqi_cconly_overflow idiom (addb $-1,%al) > over negqi_ccc_1 (neg %al) for setting the carry from a QImode value, > (iv) Use andl %eax,%eax to clear carry flag without requiring > (clobbering) an additional register, as an alternate output template for > *x86_clc. Uh, I don't think (iv) is OK. "xor reg,reg" will break the dependency chain, while "and reg,reg" won't. So, you are hurting out-of-order execution by depending on an instruction that calculates previous result in reg. You can use peephole2 trick to allocate an unused reg here, but then using AND is no better than using XOR, and the latter is guaranteed to break dependency chains. Uros. > These changes required two minor edits to i386.cc: ix86_cc_mode had > to be tweaked to suggest CCCmode for the new *x86_cmc pattern, and > *x86_cmc needed to be handled/parameterized in ix86_rtx_costs so that > combine would appreciate that this complex RTL expression was actually > a fast, single byte instruction [i.e. preferable]. > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > and make -k check, both with and without --target_board=unix{-m32} > with no new failures. Ok for mainline? > > 2022-06-06 Roger Sayle > Uros Bizjak > > gcc/ChangeLog > * config/i386/i386-expand.cc (ix86_expand_builtin) : > Use new x86_stc instruction when the carry flag must be set. > * config/i386/i386.cc (ix86_cc_mode): Use CCCmode for *x86_cmc. > (ix86_rtx_costs): Provide accurate rtx_costs for *x86_cmc. > * config/i386/i386.h (TARGET_SLOW_STC): New define. > * config/i386/i386.md (UNSPEC_CLC): New UNSPEC for clc. > (UNSPEC_STC): New UNSPEC for stc. > (*x86_clc): New define_insn (with implementation for pentium4). > (x86_stc): New define_insn. > (define_peephole2): Convert x86_stc into alternate implementation > on pentium4 without -Os when a QImode register is available. > (*x86_cmc): New define_insn. > (define_peephole2): Convert *x86_cmc into alternate implementation > on pentium4 without -Os when a QImode register is available. > (*setccc): New define_insn_and_split for a no-op CCCmode move. > (*setcc_qi_negqi_ccc_1_): New define_insn_and_split to > recognize (and eliminate) the carry flag being copied to itself. > (*setcc_qi_negqi_ccc_2_): Likewise. > * config/i386/x86-tune.def (X86_TUNE_SLOW_STC): New tuning flag. > > gcc/testsuite/ChangeLog > * gcc.target/i386/cmc-1.c: New test case. > * gcc.target/i386/stc-1.c: Likewise. > > > Thanks, Roger. > -- > > -Original Message- > From: Uros Bizjak > Sent: 04 June 2023 18:53 > To: Roger Sayle > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [x86 PATCH] Add support for stc, clc and cmc instructions > in i386.md > > On Sun, Jun 4, 2023 at 12:45 AM Roger Sayle > wrote: > > > > > > This patch is the latest revision of my patch to add support for the > > STC (set carry flag), CLC (clear carry flag) and CMC (complement > > carry > > flag) instructions to the i386 backend, incorporating Uros' > > previous feedback. The significant changes are (i) the inclusion of > > CMC, (ii) the use of UNSPEC for pattern, (iii) Use of a new > > X86_TUNE_SLOW_STC tuning flag to use alternate implementations on > > pentium4 (which has a notoriously slow STC) when not optimizing for > > size. > > > > An example of the use of the stc instruction is: > > unsigned int foo (unsigned int a, unsigned int b, unsigned int *c) { > > return __builtin_ia32_addcarryx_u32 (1, a, b, c); } >
[x86_64 PATCH] PR target/110104: Missing peephole2 for addcarry.
This patch resolves PR target/110104, a missed optimization on x86 around adc with memory operands. In i386.md, there's a peephole2 after the pattern for *add3_cc_overflow_1 that converts the sequence reg = add(reg,mem); mem = reg [where the reg is dead afterwards] into the equivalent mem = add(mem,reg). The equivalent peephole2 for adc is missing (after addcarry), and is added by this patch. For the example code provided in the bugzilla PR: Before: movq%rsi, %rax mulq%rdx addq%rax, (%rdi) movq%rdx, %rax adcq8(%rdi), %rax adcq$0, 16(%rdi) movq%rax, 8(%rdi) ret After: movq%rsi, %rax mulq%rdx addq%rax, (%rdi) adcq%rdx, 8(%rdi) adcq$0, 16(%rdi) ret Note that the addq in this example has been transformed by the existing peephole2 described above. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-06-07 Roger Sayle gcc/ChangeLog PR target/110104 * config/i386/i386.md (define_peephole2): Transform reg=adc(reg,mem) followed by mem=reg into mem=adc(mem,reg) when applicable. gcc/testsuite/ChangeLog PR target/110104 * gcc.target/i386/pr110104.c: New test case. Thanks in advance, Roger -- diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index e6ebc46..33ec45f 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -7870,6 +7870,51 @@ (set_attr "pent_pair" "pu") (set_attr "mode" "")]) +;; peephole2 for addcarry matching one for *add3_cc_overflow_1. +;; reg = adc(reg,mem); mem = reg -> mem = adc(mem,reg). +(define_peephole2 + [(parallel +[(set (reg:CCC FLAGS_REG) + (compare:CCC + (zero_extend: + (plus:SWI48 + (plus:SWI48 + (match_operator:SWI48 3 "ix86_carry_flag_operator" + [(match_operand 2 "flags_reg_operand") (const_int 0)]) + (match_operand:SWI48 0 "general_reg_operand")) + (match_operand:SWI48 1 "memory_operand"))) + (plus: + (zero_extend: (match_dup 1)) + (match_operator: 4 "ix86_carry_flag_operator" + [(match_dup 2) (const_int 0)] + (set (match_dup 0) + (plus:SWI48 (plus:SWI48 (match_op_dup 3 + [(match_dup 2) (const_int 0)]) + (match_dup 0)) + (match_dup 1)))]) + (set (match_dup 1) (match_dup 0))] + "(TARGET_READ_MODIFY_WRITE || optimize_insn_for_size_p ()) + && peep2_reg_dead_p (2, operands[0]) + && !reg_overlap_mentioned_p (operands[0], operands[1])" + [(parallel +[(set (reg:CCC FLAGS_REG) + (compare:CCC + (zero_extend: + (plus:SWI48 + (plus:SWI48 + (match_op_dup 3 [(match_dup 2) (const_int 0)]) + (match_dup 1)) + (match_dup 0))) + (plus: + (zero_extend: (match_dup 0)) + (match_op_dup 4 + [(match_dup 2) (const_int 0)] + (set (match_dup 1) + (plus:SWI48 (plus:SWI48 (match_op_dup 3 + [(match_dup 2) (const_int 0)]) + (match_dup 1)) + (match_dup 0)))])]) + (define_expand "addcarry_0" [(parallel [(set (reg:CCC FLAGS_REG) diff --git a/gcc/testsuite/gcc.target/i386/pr110104.c b/gcc/testsuite/gcc.target/i386/pr110104.c new file mode 100644 index 000..bd814f3 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr110104.c @@ -0,0 +1,16 @@ +/* { dg-do compile { target int128 } } */ +/* { dg-options "-O2" } */ + +typedef unsigned long long u64; +typedef unsigned __int128 u128; +void testcase1(u64 *acc, u64 a, u64 b) +{ + u128 res = (u128)a*b; + u64 lo = res, hi = res >> 64; + unsigned char cf = 0; + cf = __builtin_ia32_addcarryx_u64 (cf, lo, acc[0], acc+0); + cf = __builtin_ia32_addcarryx_u64 (cf, hi, acc[1], acc+1); + cf = __builtin_ia32_addcarryx_u64 (cf, 0, acc[2], acc+2); +} + +/* { dg-final { scan-assembler-times "movq" 1 } } */
[x86 PATCH] PR target/31985: Improve memory operand use with doubleword add.
This patch addresses the last remaining issue with PR target/31985, that GCC could make better use of memory addressing modes when implementing double word addition. This is achieved by adding a define_insn_and_split that combines an *add3_doubleword with a *concat3, so that the components of the concat can be used directly, without first being loaded into a double word register. For test_c in the bugzilla PR: Before: pushl %ebx subl$16, %esp movl28(%esp), %eax movl36(%esp), %ecx movl32(%esp), %ebx movl24(%esp), %edx addl%ecx, %eax adcl%ebx, %edx movl%eax, 8(%esp) movl%edx, 12(%esp) addl$16, %esp popl%ebx ret After: test_c: subl$20, %esp movl36(%esp), %eax movl32(%esp), %edx addl28(%esp), %eax adcl24(%esp), %edx movl%eax, 8(%esp) movl%edx, 12(%esp) addl$20, %esp ret If this approach is considered acceptable, similar splitters can be used for other doubleword operations. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-06-07 Roger Sayle gcc/ChangeLog PR target/31985 * config/i386/i386.md (*add3_doubleword_concat): New define_insn_and_split combine *add3_doubleword with a *concat3 for more efficient lowering after reload. gcc/testsuite/ChangeLog PR target/31985 * gcc.target/i386/pr31985.c: New test case. Roger -- diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index e6ebc46..3592249 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -6124,6 +6124,36 @@ (clobber (reg:CC FLAGS_REG))])] "split_double_mode (mode, &operands[0], 2, &operands[0], &operands[3]);") +(define_insn_and_split "*add3_doubleword_concat" + [(set (match_operand: 0 "register_operand" "=r") + (plus: + (any_or_plus: + (ashift: + (zero_extend: + (match_operand:DWIH 2 "nonimmediate_operand" "rm")) + (match_operand: 3 "const_int_operand")) + (zero_extend: + (match_operand:DWIH 4 "nonimmediate_operand" "rm"))) + (match_operand: 1 "register_operand" "0"))) + (clobber (reg:CC FLAGS_REG))] + "INTVAL (operands[3]) == * BITS_PER_UNIT" + "#" + "&& reload_completed" + [(parallel [(set (reg:CCC FLAGS_REG) + (compare:CCC +(plus:DWIH (match_dup 1) (match_dup 4)) +(match_dup 1))) + (set (match_dup 0) + (plus:DWIH (match_dup 1) (match_dup 4)))]) + (parallel [(set (match_dup 5) + (plus:DWIH +(plus:DWIH + (ltu:DWIH (reg:CC FLAGS_REG) (const_int 0)) + (match_dup 6)) +(match_dup 2))) + (clobber (reg:CC FLAGS_REG))])] + "split_double_mode (mode, &operands[0], 2, &operands[0], &operands[5]);") + (define_insn "*add_1" [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r,r,r") (plus:SWI48 diff --git a/gcc/testsuite/gcc.target/i386/pr31985.c b/gcc/testsuite/gcc.target/i386/pr31985.c new file mode 100644 index 000..a6de1b5 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr31985.c @@ -0,0 +1,14 @@ +/* { dg-do compile { target ia32 } } */ +/* { dg-options "-O2" } */ + +void test_c (unsigned int a, unsigned int b, unsigned int c, unsigned int d) +{ + volatile unsigned int x, y; + unsigned long long __a = b | ((unsigned long long)a << 32); + unsigned long long __b = d | ((unsigned long long)c << 32); + unsigned long long __c = __a + __b; + x = (unsigned int)(__c & 0x); + y = (unsigned int)(__c >> 32); +} + +/* { dg-final { scan-assembler-times "movl" 4 } } */
RE: [x86_64 PATCH] PR target/110104: Missing peephole2 for addcarry.
Hi Jakub, Jakub Jelinek wrote: > Seems to be pretty much the same as one of the 12 define_peephole2 patterns I've posted in https://gcc.gnu.org/pipermail/gcc-patches/2023-June/620821.html Doh! Impressive work. I need to study how you handle constant carry flags. Fingers-crossed that patches that touch both the middle-end and a backend don't get delayed too long in the review/approval process. > The testcase will be useful though (but I'd go with including the intrin header and using the intrinsic rather than builtin). I find the use of intrin headers a pain when running cc1 under gdb, requiring additional paths to be specified with -I etc. Perhaps there's a trick that I'm missing? __builtins are more free-standing, and therefore work with cross-compilers to targets/development environments that I don't have. I withdraw my patch. Please feel free to assign PR 110104 to yourself in Bugzilla. Cheers (and thanks), Roger
[Committed] Bug fix to new wi::bitreverse_large function.
Richard Sandiford was, of course, right to be warry of new code without much test coverage. Converting the nvptx backend to use the BITREVERSE rtx infrastructure, has resulted in far more exhaustive testing and revealed a subtle bug in the new wi::bitreverse implementation. The code needs to use HOST_WIDE_INT_1U (instead of 1) to avoid unintended sign extension. This patch has been tested on nvptx-none hosted on x86_64-pc-linux-gnu (with a minor tweak to use BITREVERSE), where it fixes regressions of the 32-bit test vectors in gcc.target/nvptx/brev-2.c and the 64-bit test vectors in gcc.target/nvptx/brevll-2.c. Committed as obvious. 2023-06-07 Roger Sayle gcc/ChangeLog * wide-int.cc (wi::bitreverse_large): Use HOST_WIDE_INT_1U to avoid sign extension/undefined behaviour when setting each bit. Thanks, Roger -- diff --git a/gcc/wide-int.cc b/gcc/wide-int.cc index 24bdce2..ab92ee6 100644 --- a/gcc/wide-int.cc +++ b/gcc/wide-int.cc @@ -786,7 +786,7 @@ wi::bitreverse_large (HOST_WIDE_INT *val, const HOST_WIDE_INT *xval, unsigned int d = (precision - 1) - s; block = d / HOST_BITS_PER_WIDE_INT; offset = d & (HOST_BITS_PER_WIDE_INT - 1); - val[block] |= 1 << offset; + val[block] |= HOST_WIDE_INT_1U << offset; } }
[nvptx PATCH] Update nvptx's bitrev2 pattern to use BITREVERSE rtx.
This minor tweak to the nvptx backend switches the representation of of the brev instruction from an UNSPEC to instead use the new BITREVERSE rtx. This allows various RTL optimizations including evaluation (constant folding) of integer constant arguments at compile-time. This patch has been tested on nvptx-none with make and make -k check with no new failures. Ok for mainline? 2023-06-07 Roger Sayle gcc/ChangeLog * config/nvptx/nvptx.md (UNSPEC_BITREV): Delete. (bitrev2): Represent using bitreverse. Thanks in advance, Roger -- diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md index 1bb9304..7a7c994 100644 --- a/gcc/config/nvptx/nvptx.md +++ b/gcc/config/nvptx/nvptx.md @@ -34,8 +34,6 @@ UNSPEC_FPINT_CEIL UNSPEC_FPINT_NEARBYINT - UNSPEC_BITREV - UNSPEC_ALLOCA UNSPEC_SET_SOFTSTACK @@ -636,8 +634,7 @@ (define_insn "bitrev2" [(set (match_operand:SDIM 0 "nvptx_register_operand" "=R") - (unspec:SDIM [(match_operand:SDIM 1 "nvptx_register_operand" "R")] -UNSPEC_BITREV))] + (bitreverse:SDIM (match_operand:SDIM 1 "nvptx_register_operand" "R")))] "" "%.\\tbrev.b%T0\\t%0, %1;")
[GCC 13 PATCH] PR target/109973: CCZmode and CCCmode variants of [v]ptest.
This is a backport of the fixes for PR target/109973 and PR target/110083. This backport to the releases/gcc-13 branch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for gcc-13, or should we just close PR 109973 in Bugzilla? 2023-06-10 Roger Sayle Uros Bizjak gcc/ChangeLog PR target/109973 PR target/110083 * config/i386/i386-builtin.def (__builtin_ia32_ptestz128): Use new CODE_for_sse4_1_ptestzv2di. (__builtin_ia32_ptestc128): Use new CODE_for_sse4_1_ptestcv2di. (__builtin_ia32_ptestz256): Use new CODE_for_avx_ptestzv4di. (__builtin_ia32_ptestc256): Use new CODE_for_avx_ptestcv4di. * config/i386/i386-expand.cc (ix86_expand_branch): Use CCZmode when expanding UNSPEC_PTEST to compare against zero. * config/i386/i386-features.cc (scalar_chain::convert_compare): Likewise generate CCZmode UNSPEC_PTESTs when converting comparisons. Update or delete REG_EQUAL notes, converting CONST_INT and CONST_WIDE_INT immediate operands to a suitable CONST_VECTOR. (general_scalar_chain::convert_insn): Use CCZmode for COMPARE result. (timode_scalar_chain::convert_insn): Use CCZmode for COMPARE result. * config/i386/i386-protos.h (ix86_match_ptest_ccmode): Prototype. * config/i386/i386.cc (ix86_match_ptest_ccmode): New predicate to check for suitable matching modes for the UNSPEC_PTEST pattern. * config/i386/sse.md (define_split): When splitting UNSPEC_MOVMSK to UNSPEC_PTEST, preserve the FLAG_REG mode as CCZ. (*_ptest): Add asterisk to hide define_insn. Remove ":CC" mode of FLAGS_REG, instead use ix86_match_ptest_ccmode. (_ptestz): New define_expand to specify CCZ. (_ptestc): New define_expand to specify CCC. (_ptest): A define_expand using CC to preserve the current behavior. (*ptest_and): Specify CCZ to only perform this optimization when only the Z flag is required. gcc/testsuite/ChangeLog PR target/109973 PR target/110083 * gcc.target/i386/pr109973-1.c: New test case. * gcc.target/i386/pr109973-2.c: Likewise. * gcc.target/i386/pr110083.c: Likewise. Thanks, Roger -- diff --git a/gcc/config/i386/i386-builtin.def b/gcc/config/i386/i386-builtin.def index 6dae697..37df018 100644 --- a/gcc/config/i386/i386-builtin.def +++ b/gcc/config/i386/i386-builtin.def @@ -1004,8 +1004,8 @@ BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_roundps_sfix, "__builtin_ia32_ BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_roundv4sf2, "__builtin_ia32_roundps_az", IX86_BUILTIN_ROUNDPS_AZ, UNKNOWN, (int) V4SF_FTYPE_V4SF) BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_roundv4sf2_sfix, "__builtin_ia32_roundps_az_sfix", IX86_BUILTIN_ROUNDPS_AZ_SFIX, UNKNOWN, (int) V4SI_FTYPE_V4SF) -BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_ptestv2di, "__builtin_ia32_ptestz128", IX86_BUILTIN_PTESTZ, EQ, (int) INT_FTYPE_V2DI_V2DI_PTEST) -BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_ptestv2di, "__builtin_ia32_ptestc128", IX86_BUILTIN_PTESTC, LTU, (int) INT_FTYPE_V2DI_V2DI_PTEST) +BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_ptestzv2di, "__builtin_ia32_ptestz128", IX86_BUILTIN_PTESTZ, EQ, (int) INT_FTYPE_V2DI_V2DI_PTEST) +BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_ptestcv2di, "__builtin_ia32_ptestc128", IX86_BUILTIN_PTESTC, LTU, (int) INT_FTYPE_V2DI_V2DI_PTEST) BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_ptestv2di, "__builtin_ia32_ptestnzc128", IX86_BUILTIN_PTESTNZC, GTU, (int) INT_FTYPE_V2DI_V2DI_PTEST) /* SSE4.2 */ @@ -1164,8 +1164,8 @@ BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_vtestpd256, "__builtin_ia32_vtestnzc BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_vtestps256, "__builtin_ia32_vtestzps256", IX86_BUILTIN_VTESTZPS256, EQ, (int) INT_FTYPE_V8SF_V8SF_PTEST) BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_vtestps256, "__builtin_ia32_vtestcps256", IX86_BUILTIN_VTESTCPS256, LTU, (int) INT_FTYPE_V8SF_V8SF_PTEST) BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_vtestps256, "__builtin_ia32_vtestnzcps256", IX86_BUILTIN_VTESTNZCPS256, GTU, (int) INT_FTYPE_V8SF_V8SF_PTEST) -BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_ptestv4di, "__builtin_ia32_ptestz256", IX86_BUILTIN_PTESTZ256, EQ, (int) INT_FTYPE_V4DI_V4DI_PTEST) -BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_ptestv4di, "__builtin_ia32_ptestc256", IX86_BUILTIN_PTESTC256, LTU, (int) INT_FTYPE_V4DI_V4DI_PTEST) +BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_ptestzv4di, "__builtin_ia32_ptestz256", IX86_BUILTIN_PTESTZ256, EQ, (int) INT_FTYPE_V4DI_V4DI_PTEST) +BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_ptestcv4di, "__builtin_ia32_ptes
[PATCH] Avoid duplicate vector initializations during RTL expansion.
This middle-end patch avoids some redundant RTL for vector initialization during RTL expansion. For the simple test case: typedef __int128 v1ti __attribute__ ((__vector_size__ (16))); __int128 key; v1ti foo() { return (v1ti){key}; } the middle-end currently expands: (set (reg:V1TI 85) (const_vector:V1TI [ (const_int 0) ])) (set (reg:V1TI 85) (mem/c:V1TI (symbol_ref:DI ("key" where we create a dead instruction that initializes the vector to zero, immediately followed by a set of the entire vector. This patch skips this zeroing instruction when the vector has only a single element. It also updates the code to indicate when we've cleared the vector, so that we don't need to initialize zero elements. Interestingly, this code is very similar to my patch from April 2006: https://gcc.gnu.org/pipermail/gcc-patches/2006-April/192861.html This patch has been tested on x86_64-pc-linux-gnu with a make bootstrap and make -k check, both with and without --target_board=unix{-m32}, with no new failures. Ok for mainline? 2023-06-11 Roger Sayle gcc/ChangeLog * expr.cc (store_constructor) : Don't bother clearing vectors with only a single element. Set CLEARED if the vector was initialized to zero. Thanks, Roger -- diff --git a/gcc/expr.cc b/gcc/expr.cc index 868fa6e..62cd8fa 100644 --- a/gcc/expr.cc +++ b/gcc/expr.cc @@ -7531,8 +7531,11 @@ store_constructor (tree exp, rtx target, int cleared, poly_int64 size, } /* Inform later passes that the old value is dead. */ - if (!cleared && !vector && REG_P (target)) - emit_move_insn (target, CONST0_RTX (mode)); + if (!cleared && !vector && REG_P (target) && maybe_gt (n_elts, 1u)) + { + emit_move_insn (target, CONST0_RTX (mode)); + cleared = 1; + } if (MEM_P (target)) alias = MEM_ALIAS_SET (target);
[PATCH] New finish_compare_by_pieces target hook (for x86).
The following simple test case, from PR 104610, shows that memcmp () == 0 can result in some bizarre code sequences on x86. int foo(char *a) { static const char t[] = "0123456789012345678901234567890"; return __builtin_memcmp(a, &t[0], sizeof(t)) == 0; } with -O2 currently contains both: xorl%eax, %eax xorl$1, %eax and also movl$1, %eax xorl$1, %eax Changing the return type of foo to _Bool results in the equally bizarre: xorl%eax, %eax testl %eax, %eax sete%al and also movl$1, %eax testl %eax, %eax sete%al All these sequences set the result to a constant, but this optimization opportunity only occurs very late during compilation, by basic block duplication in the 322r.bbro pass, too late for CSE or peephole2 to do anything about it. The problem is that the idiom expanded by compare_by_pieces for __builtin_memcmp_eq contains basic blocks that can't easily be optimized by if-conversion due to the multiple incoming edges on the fail block. In summary, compare_by_pieces generates code that looks like: if (x[0] != y[0]) goto fail_label; if (x[1] != y[1]) goto fail_label; ... if (x[n] != y[n]) goto fail_label; result = 1; goto end_label; fail_label: result = 0; end_label: In theory, the RTL if-conversion pass could be enhanced to tackle arbitrarily complex if-then-else graphs, but the solution proposed here is to allow suitable targets to perform if-conversion during compare_by_pieces. The x86, for example, can take advantage that all of the above comparisons set and test the zero flag (ZF), which can then be used in combination with sete. Hence compare_by_pieces could instead generate: if (x[0] != y[0]) goto fail_label; if (x[1] != y[1]) goto fail_label; ... if (x[n] != y[n]) goto fail_label; fail_label: sete result which requires one less basic block, and the redundant conditional branch to a label immediately after is cleaned up by GCC's existing RTL optimizations. For the test case above, where -O2 -msse4 previously generated: foo:movdqu (%rdi), %xmm0 pxor.LC0(%rip), %xmm0 ptest %xmm0, %xmm0 je .L5 .L2:movl$1, %eax xorl$1, %eax ret .L5:movdqu 16(%rdi), %xmm0 pxor.LC1(%rip), %xmm0 ptest %xmm0, %xmm0 jne .L2 xorl%eax, %eax xorl$1, %eax ret we now generate: foo:movdqu (%rdi), %xmm0 pxor.LC0(%rip), %xmm0 ptest %xmm0, %xmm0 jne .L2 movdqu 16(%rdi), %xmm0 pxor.LC1(%rip), %xmm0 ptest %xmm0, %xmm0 .L2:sete%al movzbl %al, %eax ret Using a target hook allows the large amount of intelligence already in compare_by_pieces to be re-used by the i386 backend, but this can also help other backends with condition flags where the equality result can be materialized. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-06-12 Roger Sayle gcc/ChangeLog * config/i386/i386.cc (ix86_finish_compare_by_pieces): New function to provide a backend specific implementation. (TARGET_FINISH_COMPARE_BY_PIECES): Use the above function. * doc/tm.texi.in (TARGET_FINISH_COMPARE_BY_PIECES): New @hook. * doc/tm.texi: Regenerate. * expr.cc (compare_by_pieces): Call finish_compare_by_pieces in targetm to finalize the RTL expansion. Move the current implementation to a default target hook. * target.def (finish_compare_by_pieces): New target hook to allow compare_by_pieces to be customized by the target. * targhooks.cc (default_finish_compare_by_pieces): Default implementation moved here from expr.cc's compare_by_pieces. * targhooks.h (default_finish_compare_by_pieces): Prototype. gcc/testsuite/ChangeLog * gcc.target/i386/pieces-memcmp-1.c: New test case. Thanks in advance, Roger -- diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index 3a1444d..509c0ee 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -16146,6 +16146,20 @@ ix86_fp_compare_code_to_integer (enum rtx_code code) } } +/* Override compare_by_pieces' default implementation using the state + of the CCZmode FLAGS_REG and sete instruction. TARGET is the integral + mode result, and FAIL_LABEL is the branch target of mismatched + comparisons. */ + +void +ix86_finish_compare_by_pieces (rtx target, rtx_code_label *fail_label) +{ + rtx tmp = gen_reg_rtx (QImode); + emit_label (fail_label); + ix86_expand_setcc (tmp, NE, gen_rtx_REG (CCZmode, FLAGS_REG), const0_rtx); + convert_move (target, tmp,
[x86 PATCH] Convert ptestz of pandn into ptestc.
This patch is the next instalment in a set of backend patches around improvements to ptest/vptest. A previous patch optimized the sequence t=pand(x,y); ptestz(t,t) into the equivalent ptestz(x,y), using the property that ZF is set to (X&Y) == 0. This patch performs a similar transformation, converting t=pandn(x,y); ptestz(t,t) into the (almost) equivalent ptestc(y,x), using the property that the CF flags is set to (~X&Y) == 0. The tricky bit is that this sets the CF flag instead of the ZF flag, so we can only perform this transformation when we can also convert the flags' consumer, as well as the producer. For the test case: int foo (__m128i x, __m128i y) { __m128i a = x & ~y; return __builtin_ia32_ptestz128 (a, a); } With -O2 -msse4.1 we previously generated: foo:pandn %xmm0, %xmm1 xorl%eax, %eax ptest %xmm1, %xmm1 sete%al ret with this patch we now generate: foo:xorl%eax, %eax ptest %xmm0, %xmm1 setc%al ret At the same time, this patch also provides alternative fixes for PR target/109973 and PR target/110118, by recognizing that ptestc(x,x) always sets the carry flag (X&~X is always zero). This is achieved both by recognizing the special case in ix86_expand_sse_ptest and with a splitter to convert an eligible ptest into an stc. The next piece is, of course, STV of "if (x & ~y)..." This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-06-13 Roger Sayle gcc/ChangeLog * config/i386/i386-expand.cc (ix86_expand_sse_ptest): Recognize expansion of ptestc with equal operands as returning const1_rtx. * config/i386/i386.cc (ix86_rtx_costs): Provide accurate cost estimates of UNSPEC_PTEST, where the ptest performs the PAND or PAND of its operands. * config/i386/sse.md (define_split): Transform CCCmode UNSPEC_PTEST of reg_equal_p operands into an x86_stc instruction. (define_split): Split pandn/ptestz/setne into ptestc/setnc. (define_split): Split pandn/ptestz/sete into ptestc/setc. (define_split): Split pandn/ptestz/je into ptestc/jc. (define_split): Split pandn/ptestz/jne into ptestc/jnc. gcc/testsuite/ChangeLog * gcc.target/i386/avx-vptest-4.c: New test case. * gcc.target/i386/avx-vptest-5.c: Likewise. * gcc.target/i386/avx-vptest-6.c: Likewise. * gcc.target/i386/pr109973-1.c: Update test case. * gcc.target/i386/pr109973-2.c: Likewise. * gcc.target/i386/sse4_1-ptest-4.c: New test case. * gcc.target/i386/sse4_1-ptest-5.c: Likewise. * gcc.target/i386/sse4_1-ptest-6.c: Likewise. Thanks in advance, Roger -- diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc index def060a..1d11af2 100644 --- a/gcc/config/i386/i386-expand.cc +++ b/gcc/config/i386/i386-expand.cc @@ -10222,6 +10222,13 @@ ix86_expand_sse_ptest (const struct builtin_description *d, tree exp, machine_mode mode1 = insn_data[d->icode].operand[1].mode; enum rtx_code comparison = d->comparison; + /* ptest reg, reg sets the carry flag. */ + if (comparison == LTU + && (d->code == IX86_BUILTIN_PTESTC + || d->code == IX86_BUILTIN_PTESTC256) + && rtx_equal_p (op0, op1)) +return const1_rtx; + if (VECTOR_MODE_P (mode0)) op0 = safe_vector_operand (op0, mode0); if (VECTOR_MODE_P (mode1)) diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index 3a1444d..3e99e23 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -21423,16 +21423,23 @@ ix86_rtx_costs (rtx x, machine_mode mode, int outer_code_i, int opno, else if (XINT (x, 1) == UNSPEC_PTEST) { *total = cost->sse_op; - if (XVECLEN (x, 0) == 2 - && GET_CODE (XVECEXP (x, 0, 0)) == AND) + rtx test_op0 = XVECEXP (x, 0, 0); + if (!rtx_equal_p (test_op0, XVECEXP (x, 0, 1))) + return false; + if (GET_CODE (test_op0) == AND) { - rtx andop = XVECEXP (x, 0, 0); - *total += rtx_cost (XEXP (andop, 0), GET_MODE (andop), - AND, opno, speed) - + rtx_cost (XEXP (andop, 1), GET_MODE (andop), - AND, opno, speed); - return true; + rtx and_op0 = XEXP (test_op0, 0); + if (GET_CODE (and_op0) == NOT) + and_op0 = XEXP (and_op0, 0); + *total += rtx_cost (and_op0, GET_MODE (and_op0), + AND, 0, speed) + + rtx_cost (XEXP (test_op0, 1), GET_MODE (and_op0), + AND, 1, speed); } + else + *total = rtx_cost (test
RE: [x86 PATCH] PR target/31985: Improve memory operand use with doubleword add.
Hi Uros, > On the 7th June 2023, Uros Bizkak wrote: > The register allocator considers the instruction-to-be-split as one > instruction, so it > can allocate output register to match an input register (or a register that > forms an > input address), So, you have to either add an early clobber to the output, or > somehow prevent output to clobber registers in the second pattern. This implements your suggestion of adding an early clobber to the output, a one character ('&') change from the previous version of this patch. Retested with make bootstrap and make -k check, with and without -m32, to confirm there are no issues, and this still fixes the pr31985.c test case. As you've suggested, I'm also working on improving STV in this area. Ok for mainline? 2023-06-15 Roger Sayle Uros Bizjak gcc/ChangeLog PR target/31985 * config/i386/i386.md (*add3_doubleword_concat): New define_insn_and_split combine *add3_doubleword with a *concat3 for more efficient lowering after reload. gcc/testsuite/ChangeLog PR target/31985 * gcc.target/i386/pr31985.c: New test case. Roger -- diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index e6ebc46..42c302d 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -6124,6 +6124,36 @@ (clobber (reg:CC FLAGS_REG))])] "split_double_mode (mode, &operands[0], 2, &operands[0], &operands[3]);") +(define_insn_and_split "*add3_doubleword_concat" + [(set (match_operand: 0 "register_operand" "=&r") + (plus: + (any_or_plus: + (ashift: + (zero_extend: + (match_operand:DWIH 2 "nonimmediate_operand" "rm")) + (match_operand: 3 "const_int_operand")) + (zero_extend: + (match_operand:DWIH 4 "nonimmediate_operand" "rm"))) + (match_operand: 1 "register_operand" "0"))) + (clobber (reg:CC FLAGS_REG))] + "INTVAL (operands[3]) == * BITS_PER_UNIT" + "#" + "&& reload_completed" + [(parallel [(set (reg:CCC FLAGS_REG) + (compare:CCC +(plus:DWIH (match_dup 1) (match_dup 4)) +(match_dup 1))) + (set (match_dup 0) + (plus:DWIH (match_dup 1) (match_dup 4)))]) + (parallel [(set (match_dup 5) + (plus:DWIH +(plus:DWIH + (ltu:DWIH (reg:CC FLAGS_REG) (const_int 0)) + (match_dup 6)) +(match_dup 2))) + (clobber (reg:CC FLAGS_REG))])] + "split_double_mode (mode, &operands[0], 2, &operands[0], &operands[5]);") + (define_insn "*add_1" [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r,r,r") (plus:SWI48 diff --git a/gcc/testsuite/gcc.target/i386/pr31985.c b/gcc/testsuite/gcc.target/i386/pr31985.c new file mode 100644 index 000..a6de1b5 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr31985.c @@ -0,0 +1,14 @@ +/* { dg-do compile { target ia32 } } */ +/* { dg-options "-O2" } */ + +void test_c (unsigned int a, unsigned int b, unsigned int c, unsigned int d) +{ + volatile unsigned int x, y; + unsigned long long __a = b | ((unsigned long long)a << 32); + unsigned long long __b = d | ((unsigned long long)c << 32); + unsigned long long __c = __a + __b; + x = (unsigned int)(__c & 0x); + y = (unsigned int)(__c >> 32); +} + +/* { dg-final { scan-assembler-times "movl" 4 } } */
RE: [x86 PATCH] Tweak ix86_expand_int_compare to use PTEST for vector equality.
> From: Hongtao Liu > Sent: 12 July 2023 01:45 > > On Wed, Jul 12, 2023 at 4:57 AM Roger Sayle > > > From: Hongtao Liu > > > Sent: 28 June 2023 04:23 > > > > From: Roger Sayle > > > > Sent: 27 June 2023 20:28 > > > > > > > > I've also come up with an alternate/complementary/supplementary > > > > fix of generating the PTEST during RTL expansion, rather than rely > > > > on this being caught/optimized later during STV. > > > > > > > > You may notice in this patch, the tests for TARGET_SSE4_1 and > > > > TImode appear last. When I was writing this, I initially also > > > > added support for AVX VPTEST and OImode, before realizing that x86 > > > > doesn't (yet) support 256-bit OImode (which also explains why we > > > > don't have an OImode to V1OImode scalar-to-vector pass). > > > > Retaining this clause ordering should minimize the lines changed if > > > > things > change in future. > > > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make > > > > bootstrap and make -k check, both with and without > > > > --target_board=unix{-m32} with no new failures. Ok for mainline? > > > > > > > > > > > > 2023-06-27 Roger Sayle > > > > > > > > gcc/ChangeLog > > > > * config/i386/i386-expand.cc (ix86_expand_int_compare): If > > > > testing a TImode SUBREG of a 128-bit vector register against > > > > zero, use a PTEST instruction instead of first moving it to > > > > to scalar registers. > > > > > > > > > > + /* Attempt to use PTEST, if available, when testing vector modes for > > > + equality/inequality against zero. */ if (op1 == const0_rtx > > > + && SUBREG_P (op0) > > > + && cmpmode == CCZmode > > > + && SUBREG_BYTE (op0) == 0 > > > + && REG_P (SUBREG_REG (op0)) > > > Just register_operand (op0, TImode), > > > > I completely agree that in most circumstances, the early RTL > > optimizers should use standard predicates, such as register_operand, > > that don't distinguish between REG and SUBREG, allowing the choice > > (assignment) to be left to register allocation (reload). > > > > However in this case, unusually, the presence of the SUBREG, and > > treating it differently from a REG is critical (in fact the reason for > > the patch). x86_64 can very efficiently test whether a 128-bit value > > is zero, setting ZF, either in TImode, using orq %rax,%rdx in a single > > cycle/single instruction, or in V1TImode, using ptest %xmm0,%xmm0, in a > > single > cycle/single instruction. > > There's no reason to prefer one form over the other. A SUREG, > > however, that moves the value from the scalar registers to a vector > > register, or from a vector registers to scalar registers, requires two or > > three > instructions, often reading > > and writing values via memory, at a huge performance penalty. Hence the > > goal is to eliminate the (VIEW_CONVERT) SUBREG, and choose the > > appropriate single-cycle test instruction for where the data is > > located. Hence we want to leave REG_P alone, but optimize (only) the > SUBREG_P cases. > > register_operand doesn't help with this. > > > > Note this is counter to the usual advice. Normally, a SUBREG between > > scalar registers is cheap (in fact free) on x86, hence it safe for > > predicates to ignore them prior to register allocation. But another > > use of SUBREG, to represent a VIEW_CONVERT_EXPR/transfer between > > processing units is closer to a conversion, and a very expensive one > > (going via memory with different size reads vs writes) at that. > > > > > > > + && VECTOR_MODE_P (GET_MODE (SUBREG_REG (op0))) > > > + && TARGET_SSE4_1 > > > + && GET_MODE (op0) == TImode > > > + && GET_MODE_SIZE (GET_MODE (SUBREG_REG (op0))) == 16) > > > +{ > > > + tmp = SUBREG_REG (op0); > > > and tmp = lowpart_subreg (V1TImode, force_reg (TImode, op0));? > > > I think RA can handle SUBREG correctly, no need for extra predicates. > > > > Likewise, your "tmp = lowpart_subreg (V1TImode, force_reg (TImode, ...))" > > is forcing there to always be an inter-unit transfer/pipeline stall, > > when this is idiom that we're trying to eliminate. > > >
[x86_64 PATCH] Improved insv of DImode/DFmode {high, low}parts into TImode.
This is the next piece towards a fix for (the x86_64 ABI issues affecting) PR 88873. This patch generalizes the recent tweak to ix86_expand_move for setting the highpart of a TImode reg from a DImode source using *insvti_highpart_1, to handle both DImode and DFmode sources, and also use the recently added *insvti_lowpart_1 for setting the lowpart. Although this is another intermediate step (not yet a fix), towards enabling *insvti and *concat* patterns to be candidates for TImode STV (by using V2DI/V2DF instructions), it already improves things a little. For the test case from PR 88873 typedef struct { double x, y; } s_t; typedef double v2df __attribute__ ((vector_size (2 * sizeof(double; s_t foo (s_t a, s_t b, s_t c) { return (s_t) { fma(a.x, b.x, c.x), fma (a.y, b.y, c.y) }; } With -O2 -march=cascadelake, GCC currently generates: Before (29 instructions): vmovq %xmm2, -56(%rsp) movq-56(%rsp), %rdx vmovq %xmm4, -40(%rsp) movq$0, -48(%rsp) movq%rdx, -56(%rsp) movq-40(%rsp), %rdx vmovq %xmm0, -24(%rsp) movq%rdx, -40(%rsp) movq-24(%rsp), %rsi movq-56(%rsp), %rax movq$0, -32(%rsp) vmovq %xmm3, -48(%rsp) movq-48(%rsp), %rcx vmovq %xmm5, -32(%rsp) vmovq %rax, %xmm6 movq-40(%rsp), %rax movq$0, -16(%rsp) movq%rsi, -24(%rsp) movq-32(%rsp), %rsi vpinsrq $1, %rcx, %xmm6, %xmm6 vmovq %rax, %xmm7 vmovq %xmm1, -16(%rsp) vmovapd %xmm6, %xmm3 vpinsrq $1, %rsi, %xmm7, %xmm7 vfmadd132pd -24(%rsp), %xmm7, %xmm3 vmovapd %xmm3, -56(%rsp) vmovsd -48(%rsp), %xmm1 vmovsd -56(%rsp), %xmm0 ret After (20 instructions): vmovq %xmm2, -56(%rsp) movq-56(%rsp), %rax vmovq %xmm3, -48(%rsp) vmovq %xmm4, -40(%rsp) movq-48(%rsp), %rcx vmovq %xmm5, -32(%rsp) vmovq %rax, %xmm6 movq-40(%rsp), %rax movq-32(%rsp), %rsi vpinsrq $1, %rcx, %xmm6, %xmm6 vmovq %xmm0, -24(%rsp) vmovq %rax, %xmm7 vmovq %xmm1, -16(%rsp) vmovapd %xmm6, %xmm2 vpinsrq $1, %rsi, %xmm7, %xmm7 vfmadd132pd -24(%rsp), %xmm7, %xmm2 vmovapd %xmm2, -56(%rsp) vmovsd -48(%rsp), %xmm1 vmovsd -56(%rsp), %xmm0 ret This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. No testcase yet, as the above code will hopefully change dramatically with the next pieces. Ok for mainline? 2023-07-13 Roger Sayle gcc/ChangeLog * config/i386/i386-expand.cc (ix86_expand_move): Generalize special case inserting of 64-bit values into a TImode register, to handle both DImode and DFmode using either *insvti_lowpart_1 or *isnvti_highpart_1. Thanks again, Roger -- diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc index 92ffa4b..fe87f8e 100644 --- a/gcc/config/i386/i386-expand.cc +++ b/gcc/config/i386/i386-expand.cc @@ -542,22 +542,39 @@ ix86_expand_move (machine_mode mode, rtx operands[]) } } - /* Use *insvti_highpart_1 to set highpart of TImode register. */ + /* Special case inserting 64-bit values into a TImode register. */ if (TARGET_64BIT - && mode == DImode + && (mode == DImode || mode == DFmode) && SUBREG_P (op0) - && SUBREG_BYTE (op0) == 8 && GET_MODE (SUBREG_REG (op0)) == TImode && REG_P (SUBREG_REG (op0)) && REG_P (op1)) { - wide_int mask = wi::mask (64, false, 128); - rtx tmp = immed_wide_int_const (mask, TImode); - op0 = SUBREG_REG (op0); - tmp = gen_rtx_AND (TImode, copy_rtx (op0), tmp); - op1 = gen_rtx_ZERO_EXTEND (TImode, op1); - op1 = gen_rtx_ASHIFT (TImode, op1, GEN_INT (64)); - op1 = gen_rtx_IOR (TImode, tmp, op1); + /* Use *insvti_lowpart_1 to set lowpart. */ + if (SUBREG_BYTE (op0) == 0) + { + wide_int mask = wi::mask (64, true, 128); + rtx tmp = immed_wide_int_const (mask, TImode); + op0 = SUBREG_REG (op0); + tmp = gen_rtx_AND (TImode, copy_rtx (op0), tmp); + if (mode == DFmode) + op1 = force_reg (DImode, gen_lowpart (DImode, op1)); + op1 = gen_rtx_ZERO_EXTEND (TImode, op1); + op1 = gen_rtx_IOR (TImode, tmp, op1); + } + /* Use *insvti_highpart_1 to set highpart. */ + else if (SUBREG_BYTE (op0) == 8) + { + wide_int mask = wi::mask (64, false, 128); + rtx tmp = immed_wide_int_const (mask, TImode); + op0 = SUBREG_REG (op0); + tmp = gen_rtx_AND (TImode, copy_rtx (op0), tmp); + if (mode == DFmode) +
[x86 PATCH] PR target/110588: Add *bt_setncqi_2 to generate btl
This patch resolves PR target/110588 to catch another case in combine where the i386 backend should be generating a btl instruction. This adds another define_insn_and_split to recognize the RTL representation for this case. I also noticed that two related define_insn_and_split weren't using the preferred string style for single statement preparation-statements, so I've reformatted these to be consistent in style with the new one. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-07-13 Roger Sayle gcc/ChangeLog PR target/110588 * config/i386/i386.md (*bt_setcqi): Prefer string form preparation statement over braces for a single statement. (*bt_setncqi): Likewise. (*bt_setncqi_2): New define_insn_and_split. gcc/testsuite/ChangeLog PR target/110588 * gcc.target/i386/pr110588.c: New test case. Thanks again, Roger -- diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index e47ced1..04eca049 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -16170,9 +16170,7 @@ (const_int 0))) (set (match_dup 0) (eq:QI (reg:CCC FLAGS_REG) (const_int 0)))] -{ - operands[2] = lowpart_subreg (SImode, operands[2], QImode); -}) + "operands[2] = lowpart_subreg (SImode, operands[2], QImode);") ;; Help combine recognize bt followed by setnc (define_insn_and_split "*bt_setncqi" @@ -16193,9 +16191,7 @@ (const_int 0))) (set (match_dup 0) (ne:QI (reg:CCC FLAGS_REG) (const_int 0)))] -{ - operands[2] = lowpart_subreg (SImode, operands[2], QImode); -}) + "operands[2] = lowpart_subreg (SImode, operands[2], QImode);") (define_insn_and_split "*bt_setnc" [(set (match_operand:SWI48 0 "register_operand") @@ -16219,6 +16215,27 @@ operands[2] = lowpart_subreg (SImode, operands[2], QImode); operands[3] = gen_reg_rtx (QImode); }) + +;; Help combine recognize bt followed by setnc (PR target/110588) +(define_insn_and_split "*bt_setncqi_2" + [(set (match_operand:QI 0 "register_operand") + (eq:QI + (zero_extract:SWI48 + (match_operand:SWI48 1 "register_operand") + (const_int 1) + (zero_extend:SI (match_operand:QI 2 "register_operand"))) + (const_int 0))) + (clobber (reg:CC FLAGS_REG))] + "TARGET_USE_BT && ix86_pre_reload_split ()" + "#" + "&& 1" + [(set (reg:CCC FLAGS_REG) +(compare:CCC + (zero_extract:SWI48 (match_dup 1) (const_int 1) (match_dup 2)) + (const_int 0))) + (set (match_dup 0) +(ne:QI (reg:CCC FLAGS_REG) (const_int 0)))] + "operands[2] = lowpart_subreg (SImode, operands[2], QImode);") ;; Store-flag instructions. diff --git a/gcc/testsuite/gcc.target/i386/pr110588.c b/gcc/testsuite/gcc.target/i386/pr110588.c new file mode 100644 index 000..4505c87 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr110588.c @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mtune=core2" } */ + +unsigned char foo (unsigned char x, int y) +{ + int _1 = (int) x; + int _2 = _1 >> y; + int _3 = _2 & 1; + unsigned char _8 = (unsigned char) _3; + unsigned char _6 = _8 ^ 1; + return _6; +} + +/* { dg-final { scan-assembler "btl" } } */ +/* { dg-final { scan-assembler "setnc" } } */ +/* { dg-final { scan-assembler-not "sarl" } } */ +/* { dg-final { scan-assembler-not "andl" } } */ +/* { dg-final { scan-assembler-not "xorl" } } */
RE: [x86 PATCH] PR target/110588: Add *bt_setncqi_2 to generate btl
> From: Uros Bizjak > Sent: 13 July 2023 19:21 > > On Thu, Jul 13, 2023 at 7:10 PM Roger Sayle > wrote: > > > > This patch resolves PR target/110588 to catch another case in combine > > where the i386 backend should be generating a btl instruction. This > > adds another define_insn_and_split to recognize the RTL representation > > for this case. > > > > I also noticed that two related define_insn_and_split weren't using > > the preferred string style for single statement > > preparation-statements, so I've reformatted these to be consistent in style > > with > the new one. > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board=unix{-m32} > > with no new failures. Ok for mainline? > > > > > > 2023-07-13 Roger Sayle > > > > gcc/ChangeLog > > PR target/110588 > > * config/i386/i386.md (*bt_setcqi): Prefer string form > > preparation statement over braces for a single statement. > > (*bt_setncqi): Likewise. > > (*bt_setncqi_2): New define_insn_and_split. > > > > gcc/testsuite/ChangeLog > > PR target/110588 > > * gcc.target/i386/pr110588.c: New test case. > > +;; Help combine recognize bt followed by setnc (PR target/110588) > +(define_insn_and_split "*bt_setncqi_2" > + [(set (match_operand:QI 0 "register_operand") (eq:QI > + (zero_extract:SWI48 > +(match_operand:SWI48 1 "register_operand") > +(const_int 1) > +(zero_extend:SI (match_operand:QI 2 "register_operand"))) > + (const_int 0))) > + (clobber (reg:CC FLAGS_REG))] > + "TARGET_USE_BT && ix86_pre_reload_split ()" > + "#" > + "&& 1" > + [(set (reg:CCC FLAGS_REG) > +(compare:CCC > + (zero_extract:SWI48 (match_dup 1) (const_int 1) (match_dup 2)) > + (const_int 0))) > + (set (match_dup 0) > +(ne:QI (reg:CCC FLAGS_REG) (const_int 0)))] > + "operands[2] = lowpart_subreg (SImode, operands[2], QImode);") > > I don't think the above transformation is 100% correct, mainly due to the use > of > paradoxical subreg. > > The combined instruction is operating with a zero_extended QImode register, so > all bits of the register are well defined. You are splitting using > paradoxical subreg, > so you don't know what garbage is there in the highpart of the count register. > However, BTL/BTQ uses modulo 64 (or 32) of this register, so even with a > slightly > invalid RTX, everything checks out. > > + "operands[2] = lowpart_subreg (SImode, operands[2], QImode);") > > You probably need mode instead of SImode here. The define_insn for *bt is: (define_insn "*bt" [(set (reg:CCC FLAGS_REG) (compare:CCC (zero_extract:SWI48 (match_operand:SWI48 0 "nonimmediate_operand" "r,m") (const_int 1) (match_operand:SI 1 "nonmemory_operand" "r,")) (const_int 0)))] So isn't appropriate here. But now you've made me think about it, it's inconsistent that all of the shifts and rotates in i386.md standardize on QImode for shift counts, but the bit test instructions use SImode? I think this explains where the paradoxical SUBREGs come from, and in theory any_extend from QImode to SImode here could/should be handled/unnecessary. Is it worth investigating a follow-up patch to convert all ZERO_EXTRACTs and SIGN_EXTRACTs in i386.md to use QImode (instead of SImode)? Thanks in advance, Roger --
[PATCH] Fix bootstrap failure (with g++ 4.8.5) in tree-if-conv.cc.
This patch fixes the bootstrap failure I'm seeing using gcc 4.8.5 as the host compiler. Ok for mainline? [I might be missing something] 2023-07-14 Roger Sayle gcc/ChangeLog * tree-if-conv.cc (predicate_scalar_phi): Make the arguments to the std::sort comparison lambda function const. Cheers, Roger -- diff --git a/gcc/tree-if-conv.cc b/gcc/tree-if-conv.cc index 91e2eff..799f071 100644 --- a/gcc/tree-if-conv.cc +++ b/gcc/tree-if-conv.cc @@ -2204,7 +2204,8 @@ predicate_scalar_phi (gphi *phi, gimple_stmt_iterator *gsi) } /* Sort elements based on rankings ARGS. */ - std::sort(argsKV.begin(), argsKV.end(), [](ArgEntry &left, ArgEntry &right) { + std::sort(argsKV.begin(), argsKV.end(), [](const ArgEntry &left, +const ArgEntry &right) { return left.second < right.second; });
RE: [x86 PATCH] Fix FAIL of gcc.target/i386/pr91681-1.c
> From: Jiang, Haochen > Sent: 17 July 2023 02:50 > > > From: Jiang, Haochen > > Sent: Friday, July 14, 2023 10:50 AM > > > > > The recent change in TImode parameter passing on x86_64 results in > > > the FAIL of pr91681-1.c. The issue is that with the extra > > > flexibility, the combine pass is now spoilt for choice between using > > > either the *add3_doubleword_concat or the > > > *add3_doubleword_zext patterns, when one operand is a *concat and > the other is a zero_extend. > > > The solution proposed below is provide an > > > *add3_doubleword_concat_zext define_insn_and_split, that can > > > benefit both from the register allocation of *concat, and still > > > avoid the xor normally required by zero extension. > > > > > > I'm investigating a follow-up refinement to improve register > > > allocation further by avoiding the early clobber in the =&r, and > > > handling (custom) reloads explicitly, but this piece resolves the > > > testcase > > failure. > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make > > > bootstrap and make -k check, both with and without > > > --target_board=unix{-m32} with no new failures. Ok for mainline? > > > > > > > > > 2023-07-11 Roger Sayle > > > > > > gcc/ChangeLog > > > PR target/91681 > > > * config/i386/i386.md (*add3_doubleword_concat_zext): New > > > define_insn_and_split derived from *add3_doubleword_concat > > > and *add3_doubleword_zext. > > > > Hi Roger, > > > > This commit currently changed the codegen of testcase p443644-2.c from: > > Oops, a typo, I mean pr43644-2.c. > > Haochen I'm working on a fix and hope to have this resolved soon (unfortunately fixing things in a post-reload splitter isn't working out due to reload's choices, so the solution will likely be a peephole2). The problem is that pr91681-1.c and pr43644-2.c can't both PASS (as written)! The operation x = y + 0, can be generated as either "mov y,x; add $0,x" or as "xor x,x; add y,x". pr91681-1.c checks there isn't an xor, pr43644-2.c checks there isn't a mov. Doh! As the author of both these test cases, I've painted myself into a corner. The solution is that add $0,x should be generated (optimal) when y is already in x, and "xor x,x; add y,x" used otherwise (as this is shorter than "mov y,x; add $0,x", both sequences being approximately equal performance-wise). > > movq%rdx, %rax > > xorl%edx, %edx > > addq%rdi, %rax > > adcq%rsi, %rdx > > to: > > movq%rdx, %rcx > > movq%rdi, %rax > > movq%rsi, %rdx > > addq%rcx, %rax > > adcq$0, %rdx > > > > which causes the testcase fail under -m64. > > Is this within your expectation? You're right that the original (using xor) is better for pr43644-2.c's test case. unsigned __int128 foo(unsigned __int128 x, unsigned long long y) { return x+y; } but the closely related (swapping the argument order): unsigned __int128 bar(unsigned long long y, unsigned __int128 x) { return x+y; } is better using "adcq $0", than having a superfluous xor. Executive summary: This FAIL isn't serious. I'll silence it soon. > > BRs, > > Haochen > > > > > > > > > > > Thanks, > > > Roger > > > --
[x86_64 PATCH] More TImode parameter passing improvements.
This patch is the next piece of a solution to the x86_64 ABI issues in PR 88873. This splits the *concat3_3 define_insn_and_split into two patterns, a TARGET_64BIT *concatditi3_3 and a !TARGET_64BIT *concatsidi3_3. This allows us to add an additional alternative to the the 64-bit version, enabling the register allocator to perform this operation using SSE registers, which is implemented/split after reload using vec_concatv2di. To demonstrate the improvement, the test case from PR88873: typedef struct { double x, y; } s_t; s_t foo (s_t a, s_t b, s_t c) { return (s_t){ __builtin_fma(a.x, b.x, c.x), __builtin_fma (a.y, b.y, c.y) }; } when compiled with -O2 -march=cascadelake, currently generates: foo:vmovq %xmm2, -56(%rsp) movq-56(%rsp), %rax vmovq %xmm3, -48(%rsp) vmovq %xmm4, -40(%rsp) movq-48(%rsp), %rcx vmovq %xmm5, -32(%rsp) vmovq %rax, %xmm6 movq-40(%rsp), %rax movq-32(%rsp), %rsi vpinsrq $1, %rcx, %xmm6, %xmm6 vmovq %xmm0, -24(%rsp) vmovq %rax, %xmm7 vmovq %xmm1, -16(%rsp) vmovapd %xmm6, %xmm2 vpinsrq $1, %rsi, %xmm7, %xmm7 vfmadd132pd -24(%rsp), %xmm7, %xmm2 vmovapd %xmm2, -56(%rsp) vmovsd -48(%rsp), %xmm1 vmovsd -56(%rsp), %xmm0 ret with this change, we avoid many of the reloads via memory, foo:vpunpcklqdq %xmm3, %xmm2, %xmm7 vpunpcklqdq %xmm1, %xmm0, %xmm6 vpunpcklqdq %xmm5, %xmm4, %xmm2 vmovdqa %xmm7, -24(%rsp) vmovdqa %xmm6, %xmm1 movq-16(%rsp), %rax vpinsrq $1, %rax, %xmm7, %xmm4 vmovapd %xmm4, %xmm6 vfmadd132pd %xmm1, %xmm2, %xmm6 vmovapd %xmm6, -24(%rsp) vmovsd -16(%rsp), %xmm1 vmovsd -24(%rsp), %xmm0 ret This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-07-19 Roger Sayle gcc/ChangeLog * config/i386/i386-expand.cc (ix86_expand_move): Don't call force_reg, to use SUBREG rather than create a new pseudo when inserting DFmode fields into TImode with insvti_{high,low}part. (*concat3_3): Split into two define_insn_and_split... (*concatditi3_3): 64-bit implementation. Provide alternative that allows register allocation to use SSE registers that is split into vec_concatv2di after reload. (*concatsidi3_3): 32-bit implementation. gcc/testsuite/ChangeLog * gcc.target/i386/pr88873.c: New test case. Thanks in advance, Roger -- diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc index f9b0dc6..9c3febe 100644 --- a/gcc/config/i386/i386-expand.cc +++ b/gcc/config/i386/i386-expand.cc @@ -558,7 +558,7 @@ ix86_expand_move (machine_mode mode, rtx operands[]) op0 = SUBREG_REG (op0); tmp = gen_rtx_AND (TImode, copy_rtx (op0), tmp); if (mode == DFmode) - op1 = force_reg (DImode, gen_lowpart (DImode, op1)); + op1 = gen_lowpart (DImode, op1); op1 = gen_rtx_ZERO_EXTEND (TImode, op1); op1 = gen_rtx_IOR (TImode, tmp, op1); } @@ -570,7 +570,7 @@ ix86_expand_move (machine_mode mode, rtx operands[]) op0 = SUBREG_REG (op0); tmp = gen_rtx_AND (TImode, copy_rtx (op0), tmp); if (mode == DFmode) - op1 = force_reg (DImode, gen_lowpart (DImode, op1)); + op1 = gen_lowpart (DImode, op1); op1 = gen_rtx_ZERO_EXTEND (TImode, op1); op1 = gen_rtx_ASHIFT (TImode, op1, GEN_INT (64)); op1 = gen_rtx_IOR (TImode, tmp, op1); diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 47ea050..8c54aa5 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -12408,21 +12408,47 @@ DONE; }) -(define_insn_and_split "*concat3_3" - [(set (match_operand: 0 "nonimmediate_operand" "=ro,r,r,&r") - (any_or_plus: - (ashift: - (zero_extend: - (match_operand:DWIH 1 "nonimmediate_operand" "r,m,r,m")) +(define_insn_and_split "*concatditi3_3" + [(set (match_operand:TI 0 "nonimmediate_operand" "=ro,r,r,&r,x") + (any_or_plus:TI + (ashift:TI + (zero_extend:TI + (match_operand:DI 1 "nonimmediate_operand" "r,m,r,m,x")) (match_operand:QI 2 "const_int_operand")) - (zero_extend: - (match_operand:DWIH 3 "nonimmediate_operand" "r,r,m,m"] - "INTVAL (operands[2]) == * BITS_PER_UNIT" + (zero_extend:TI + (match_operand:DI 3 "nonimmediate_operand" "r,r,m,m,0"] + "TARGET_64BIT + && INTVAL (operands[2]) == 64" + &q
[PATCH] PR c/110699: Defend against error_mark_node in gimplify.cc.
This patch resolves PR c/110699, an ICE-after-error regression, by adding a check that the array type isn't error_mark_node in gimplify_compound_lval. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-07-19 Roger Sayle gcc/ChangeLog PR c/110699 * gimplify.cc (gimplify_compound_lval): For ARRAY_REF and ARRAY_RANGE_REF return GS_ERROR if the array's type is error_mark_node. gcc/testsuite/ChangeLog PR c/110699 * gcc.dg/pr110699.c: New test case. Cheers, Roger -- diff --git a/gcc/gimplify.cc b/gcc/gimplify.cc index 36e5df0..4f40b24 100644 --- a/gcc/gimplify.cc +++ b/gcc/gimplify.cc @@ -3211,6 +3211,9 @@ gimplify_compound_lval (tree *expr_p, gimple_seq *pre_p, gimple_seq *post_p, if (TREE_CODE (t) == ARRAY_REF || TREE_CODE (t) == ARRAY_RANGE_REF) { + if (TREE_TYPE (TREE_OPERAND (t, 0)) == error_mark_node) + return GS_ERROR; + /* Deal with the low bound and element type size and put them into the ARRAY_REF. If these values are set, they have already been gimplified. */ diff --git a/gcc/testsuite/gcc.dg/pr110699.c b/gcc/testsuite/gcc.dg/pr110699.c new file mode 100644 index 000..be77613 --- /dev/null +++ b/gcc/testsuite/gcc.dg/pr110699.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-options "-O2" } */ + +typedef __attribute__((__vector_size__(64))) int T; + +void f(void) { + extern char a[64], b[64]; /* { dg-message "previous" "note" } */ + void *p = a; + T q = *(T *)&b[0]; +} + +void g() { + extern char b; /* { dg-error "conflicting types" } */ +}
RE: [x86_64 PATCH] More TImode parameter passing improvements.
Hi Uros, > From: Uros Bizjak > Sent: 20 July 2023 07:50 > > On Wed, Jul 19, 2023 at 10:07 PM Roger Sayle > wrote: > > > > This patch is the next piece of a solution to the x86_64 ABI issues in > > PR 88873. This splits the *concat3_3 define_insn_and_split > > into two patterns, a TARGET_64BIT *concatditi3_3 and a !TARGET_64BIT > > *concatsidi3_3. This allows us to add an additional alternative to > > the the 64-bit version, enabling the register allocator to perform > > this operation using SSE registers, which is implemented/split after > > reload using vec_concatv2di. > > > > To demonstrate the improvement, the test case from PR88873: > > > > typedef struct { double x, y; } s_t; > > > > s_t foo (s_t a, s_t b, s_t c) > > { > > return (s_t){ __builtin_fma(a.x, b.x, c.x), __builtin_fma (a.y, b.y, > > c.y) }; } > > > > when compiled with -O2 -march=cascadelake, currently generates: > > > > foo:vmovq %xmm2, -56(%rsp) > > movq-56(%rsp), %rax > > vmovq %xmm3, -48(%rsp) > > vmovq %xmm4, -40(%rsp) > > movq-48(%rsp), %rcx > > vmovq %xmm5, -32(%rsp) > > vmovq %rax, %xmm6 > > movq-40(%rsp), %rax > > movq-32(%rsp), %rsi > > vpinsrq $1, %rcx, %xmm6, %xmm6 > > vmovq %xmm0, -24(%rsp) > > vmovq %rax, %xmm7 > > vmovq %xmm1, -16(%rsp) > > vmovapd %xmm6, %xmm2 > > vpinsrq $1, %rsi, %xmm7, %xmm7 > > vfmadd132pd -24(%rsp), %xmm7, %xmm2 > > vmovapd %xmm2, -56(%rsp) > > vmovsd -48(%rsp), %xmm1 > > vmovsd -56(%rsp), %xmm0 > > ret > > > > with this change, we avoid many of the reloads via memory, > > > > foo:vpunpcklqdq %xmm3, %xmm2, %xmm7 > > vpunpcklqdq %xmm1, %xmm0, %xmm6 > > vpunpcklqdq %xmm5, %xmm4, %xmm2 > > vmovdqa %xmm7, -24(%rsp) > > vmovdqa %xmm6, %xmm1 > > movq-16(%rsp), %rax > > vpinsrq $1, %rax, %xmm7, %xmm4 > > vmovapd %xmm4, %xmm6 > > vfmadd132pd %xmm1, %xmm2, %xmm6 > > vmovapd %xmm6, -24(%rsp) > > vmovsd -16(%rsp), %xmm1 > > vmovsd -24(%rsp), %xmm0 > > ret > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board=unix{-m32} > > with no new failures. Ok for mainline? > > > > > > 2023-07-19 Roger Sayle > > > > gcc/ChangeLog > > * config/i386/i386-expand.cc (ix86_expand_move): Don't call > > force_reg, to use SUBREG rather than create a new pseudo when > > inserting DFmode fields into TImode with insvti_{high,low}part. > > (*concat3_3): Split into two define_insn_and_split... > > (*concatditi3_3): 64-bit implementation. Provide alternative > > that allows register allocation to use SSE registers that is > > split into vec_concatv2di after reload. > > (*concatsidi3_3): 32-bit implementation. > > > > gcc/testsuite/ChangeLog > > * gcc.target/i386/pr88873.c: New test case. > > diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc > index f9b0dc6..9c3febe 100644 > --- a/gcc/config/i386/i386-expand.cc > +++ b/gcc/config/i386/i386-expand.cc > @@ -558,7 +558,7 @@ ix86_expand_move (machine_mode mode, rtx > operands[]) >op0 = SUBREG_REG (op0); >tmp = gen_rtx_AND (TImode, copy_rtx (op0), tmp); >if (mode == DFmode) > -op1 = force_reg (DImode, gen_lowpart (DImode, op1)); > +op1 = gen_lowpart (DImode, op1); > > Please note that gen_lowpart will ICE when op1 is a SUBREG. This is the reason > that we need to first force a SUBREG to a register and then perform > gen_lowpart, > and it is necessary to avoid ICE. The good news is that we know op1 is a register, as this is tested by "&& REG_P (op1)" on line 551. You'll also notice that I'm not removing the force_reg from before the call to gen_lowpart, but removing the call to force_reg after the call to gen_lowpart. When I originally wrote this, the hope was that placing this SUBREG in its own pseudo would help with register allocation/CSE. Unfortunately, increasing the number of pseudos (in this case) increases compile-time (due to quadratic behaviour in LRA), as shown by PR rtl-optimization/110587, and keeping the DF->DI conversion in a SUBREG inside the insvti_{high,low}part allows the register a
[x86 PATCH] Don't use insvti_{high, low}part with -O0 (for compile-time).
This patch attempts to help with PR rtl-optimization/110587, a regression of -O0 compile time for the pathological pr28071.c. My recent patch helps a bit, but hasn't returned -O0 compile-time to where it was before my ix86_expand_move changes. The obvious solution/workaround is to guard these new TImode parameter passing optimizations with "&& optimize", so they don't trigger when compiling with -O0. The very minor complication is that "&& optimize" alone leads to the regression of pr110533.c, where our improved TImode parameter passing fixes a wrong-code issue with naked functions, importantly, when compiling with -O0. This should explain the one line fix below "&& (optimize || ix86_function_naked (cfun))". I've an additional fix/tweak or two for this compile-time issue, but this change eliminates the part of the regression that I've caused. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-07-22 Roger Sayle gcc/ChangeLog * config/i386/i386-expand.cc (ix86_expand_move): Disable the 64-bit insertions into TImode optimizations with -O0, unless the function has the "naked" attribute (for PR target/110533). Cheers, Roger -- diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc index 7e94447..cdef95e 100644 --- a/gcc/config/i386/i386-expand.cc +++ b/gcc/config/i386/i386-expand.cc @@ -544,6 +544,7 @@ ix86_expand_move (machine_mode mode, rtx operands[]) /* Special case inserting 64-bit values into a TImode register. */ if (TARGET_64BIT + && (optimize || ix86_function_naked (current_function_decl)) && (mode == DImode || mode == DFmode) && SUBREG_P (op0) && GET_MODE (SUBREG_REG (op0)) == TImode
[x86 PATCH] Use QImode for offsets in zero_extract/sign_extract in i386.md
As suggested by Uros, this patch changes the ZERO_EXTRACTs and SIGN_EXTRACTs in i386.md to consistently use QImode for bit offsets (i.e. third and fourth operands), matching the use of QImode for bit counts in shifts and rotates. There's no change in functionality, and the new patterns simply ensure that we continue to generate the same code (match revised patterns) as before. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2023-07-22 Roger Sayle gcc/ChangeLog * config/i386/i386.md (extv): Use QImode for offsets. (extzv): Likewise. (insv): Likewise. (*testqi_ext_3): Likewise. (*btr_2): Likewise. (define_split): Likewise. (*btsq_imm): Likewise. (*btrq_imm): Likewise. (*btcq_imm): Likewise. (define_peephole2 x3): Likewise. (*bt): Likewise (*bt_mask): New define_insn_and_split. (*jcc_bt): Use QImode for offsets. (*jcc_bt_1): Delete obsolete pattern. (*jcc_bt_mask): Use QImode offsets. (*jcc_bt_mask_1): Likewise. (define_split): Likewise. (*bt_setcqi): Likewise. (*bt_setncqi): Likewise. (*bt_setnc): Likewise. (*bt_setncqi_2): Likewise. (*bt_setc_mask): New define_insn_and_split. (bmi2_bzhi_3): Use QImode offsets. (*bmi2_bzhi_3): Likewise. (*bmi2_bzhi_3_1): Likewise. (*bmi2_bzhi_3_1_ccz): Likewise. (@tbm_bextri_): Likewise. Thanks, Roger -- diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 47ea050..de8c3a5 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -3312,8 +3312,8 @@ (define_expand "extv" [(set (match_operand:SWI24 0 "register_operand") (sign_extract:SWI24 (match_operand:SWI24 1 "register_operand") - (match_operand:SI 2 "const_int_operand") - (match_operand:SI 3 "const_int_operand")))] + (match_operand:QI 2 "const_int_operand") + (match_operand:QI 3 "const_int_operand")))] "" { /* Handle extractions from %ah et al. */ @@ -3340,8 +3340,8 @@ (define_expand "extzv" [(set (match_operand:SWI248 0 "register_operand") (zero_extract:SWI248 (match_operand:SWI248 1 "register_operand") -(match_operand:SI 2 "const_int_operand") -(match_operand:SI 3 "const_int_operand")))] +(match_operand:QI 2 "const_int_operand") +(match_operand:QI 3 "const_int_operand")))] "" { if (ix86_expand_pextr (operands)) @@ -3428,8 +3428,8 @@ (define_expand "insv" [(set (zero_extract:SWI248 (match_operand:SWI248 0 "register_operand") -(match_operand:SI 1 "const_int_operand") -(match_operand:SI 2 "const_int_operand")) +(match_operand:QI 1 "const_int_operand") +(match_operand:QI 2 "const_int_operand")) (match_operand:SWI248 3 "register_operand"))] "" { @@ -10788,8 +10788,8 @@ (match_operator 1 "compare_operator" [(zero_extract:SWI248 (match_operand 2 "int_nonimmediate_operand" "rm") -(match_operand 3 "const_int_operand") -(match_operand 4 "const_int_operand")) +(match_operand:QI 3 "const_int_operand") +(match_operand:QI 4 "const_int_operand")) (const_int 0)]))] "/* Ensure that resulting mask is zero or sign extended operand. */ INTVAL (operands[4]) >= 0 @@ -15904,7 +15904,7 @@ [(set (zero_extract:HI (match_operand:SWI12 0 "nonimmediate_operand") (const_int 1) - (zero_extend:SI (match_operand:QI 1 "register_operand"))) + (match_operand:QI 1 "register_operand")) (const_int 0)) (clobber (reg:CC FLAGS_REG))] "TARGET_USE_BT && ix86_pre_reload_split ()" @@ -15928,7 +15928,7 @@ [(set (zero_extract:HI (match_operand:SWI12 0 "register_operand") (const_int 1) - (zero_extend:SI (match_operand:QI 1 "register_operand"))) + (match_operand:QI 1 "register_operand")) (const_int 0)) (clobber (reg:CC FLAGS_REG))] "TARGET_USE_BT && ix86_pre_reload_split ()" @@ -15955,7 +15955,7 @@ (define_insn "*btsq_imm" [(set (zero_extract:DI (match_operand:DI 0 "nonimmedia
[PATCH] Replace lra-spill.cc's return_regno_p with return_reg_p.
This patch is my attempt to address the compile-time hog issue in PR rtl-optimization/110587. Richard Biener's analysis shows that compilation of pr28071.c with -O0 currently spends ~70% in timer "LRA non-specific" due to return_regno_p failing to filter a large number of calls to regno_in_use_p, resulting in quadratic behaviour. For this pathological test case, things can be improved significantly. Although the return register (%rax) is indeed mentioned a large number of times in this function, due to inlining, the inlined functions access the returned register in TImode, whereas the current function returns a DImode. Hence the check to see if we're the last SET of the return register, which should be followed by a USE, can be improved by also testing the mode. Implementation-wise, rather than pass an additional mode parameter to LRA's local return_regno_p function, which only has a single caller, it's more convenient to pass the rtx REG_P, and from this extract both the REGNO and the mode in the callee, and rename this function to return_reg_p. The good news is that with this change "LRA non-specific" drops from 70% to 13%. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, with no new failures. Ok for mainline? 2023-07-22 Roger Sayle gcc/ChangeLog PR middle-end/28071 PR rtl-optimization/110587 * lra-spills.cc (return_regno_p): Change argument and rename to... (return_reg_p): Check if the given register RTX has the same REGNO and machine mode as the function's return value. (lra_final_code_change): Update call to return_reg_p. Thanks in advance, Roger -- diff --git a/gcc/lra-spills.cc b/gcc/lra-spills.cc index 3a7bb7e..ae147ad 100644 --- a/gcc/lra-spills.cc +++ b/gcc/lra-spills.cc @@ -705,10 +705,10 @@ alter_subregs (rtx *loc, bool final_p) return res; } -/* Return true if REGNO is used for return in the current - function. */ +/* Return true if register REG, known to be REG_P, is used for return + in the current function. */ static bool -return_regno_p (unsigned int regno) +return_reg_p (rtx reg) { rtx outgoing = crtl->return_rtx; @@ -716,7 +716,8 @@ return_regno_p (unsigned int regno) return false; if (REG_P (outgoing)) -return REGNO (outgoing) == regno; +return REGNO (outgoing) == REGNO (reg) + && GET_MODE (outgoing) == GET_MODE (reg); else if (GET_CODE (outgoing) == PARALLEL) { int i; @@ -725,7 +726,9 @@ return_regno_p (unsigned int regno) { rtx x = XEXP (XVECEXP (outgoing, 0, i), 0); - if (REG_P (x) && REGNO (x) == regno) + if (REG_P (x) + && REGNO (x) == REGNO (reg) + && GET_MODE (x) == GET_MODE (reg)) return true; } } @@ -821,7 +824,7 @@ lra_final_code_change (void) if (NONJUMP_INSN_P (insn) && GET_CODE (pat) == SET && REG_P (SET_SRC (pat)) && REG_P (SET_DEST (pat)) && REGNO (SET_SRC (pat)) == REGNO (SET_DEST (pat)) - && (! return_regno_p (REGNO (SET_SRC (pat))) + && (! return_reg_p (SET_SRC (pat)) || ! regno_in_use_p (insn, REGNO (SET_SRC (pat) { lra_invalidate_insn_data (insn);
[Committed] PR target/110787: Revert QImode offsets in {zero, sign}_extract.
My recent patch to use QImode for bit offsets in ZERO_EXTRACTs and SIGN_EXTRACTs in the i386 backend shouldn't have resulted in any change behaviour, but as reported by Rainer it produces a bootstrap failure in gm2. This reverts the problematic patch whilst we investigate the underlying cause. Committed as obvious. 2023-07-23 Roger Sayle gcc/ChangeLog PR target/110787 PR target/110790 Revert patch. * config/i386/i386.md (extv): Use QImode for offsets. (extzv): Likewise. (insv): Likewise. (*testqi_ext_3): Likewise. (*btr_2): Likewise. (define_split): Likewise. (*btsq_imm): Likewise. (*btrq_imm): Likewise. (*btcq_imm): Likewise. (define_peephole2 x3): Likewise. (*bt): Likewise (*bt_mask): New define_insn_and_split. (*jcc_bt): Use QImode for offsets. (*jcc_bt_1): Delete obsolete pattern. (*jcc_bt_mask): Use QImode offsets. (*jcc_bt_mask_1): Likewise. (define_split): Likewise. (*bt_setcqi): Likewise. (*bt_setncqi): Likewise. (*bt_setnc): Likewise. (*bt_setncqi_2): Likewise. (*bt_setc_mask): New define_insn_and_split. (bmi2_bzhi_3): Use QImode offsets. (*bmi2_bzhi_3): Likewise. (*bmi2_bzhi_3_1): Likewise. (*bmi2_bzhi_3_1_ccz): Likewise. (@tbm_bextri_): Likewise. Sorry for the inconvenience, Roger -- diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 2ce8e958565..4db210cc795 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -3312,8 +3312,8 @@ (define_expand "extv" [(set (match_operand:SWI24 0 "register_operand") (sign_extract:SWI24 (match_operand:SWI24 1 "register_operand") - (match_operand:QI 2 "const_int_operand") - (match_operand:QI 3 "const_int_operand")))] + (match_operand:SI 2 "const_int_operand") + (match_operand:SI 3 "const_int_operand")))] "" { /* Handle extractions from %ah et al. */ @@ -3340,8 +3340,8 @@ (define_expand "extzv" [(set (match_operand:SWI248 0 "register_operand") (zero_extract:SWI248 (match_operand:SWI248 1 "register_operand") -(match_operand:QI 2 "const_int_operand") -(match_operand:QI 3 "const_int_operand")))] +(match_operand:SI 2 "const_int_operand") +(match_operand:SI 3 "const_int_operand")))] "" { if (ix86_expand_pextr (operands)) @@ -3428,8 +3428,8 @@ (define_expand "insv" [(set (zero_extract:SWI248 (match_operand:SWI248 0 "register_operand") -(match_operand:QI 1 "const_int_operand") -(match_operand:QI 2 "const_int_operand")) +(match_operand:SI 1 "const_int_operand") +(match_operand:SI 2 "const_int_operand")) (match_operand:SWI248 3 "register_operand"))] "" { @@ -10788,8 +10788,8 @@ (match_operator 1 "compare_operator" [(zero_extract:SWI248 (match_operand 2 "int_nonimmediate_operand" "rm") -(match_operand:QI 3 "const_int_operand") -(match_operand:QI 4 "const_int_operand")) +(match_operand 3 "const_int_operand") +(match_operand 4 "const_int_operand")) (const_int 0)]))] "/* Ensure that resulting mask is zero or sign extended operand. */ INTVAL (operands[4]) >= 0 @@ -15965,7 +15965,7 @@ [(set (zero_extract:HI (match_operand:SWI12 0 "nonimmediate_operand") (const_int 1) - (match_operand:QI 1 "register_operand")) + (zero_extend:SI (match_operand:QI 1 "register_operand"))) (const_int 0)) (clobber (reg:CC FLAGS_REG))] "TARGET_USE_BT && ix86_pre_reload_split ()" @@ -15989,7 +15989,7 @@ [(set (zero_extract:HI (match_operand:SWI12 0 "register_operand") (const_int 1) - (match_operand:QI 1 "register_operand")) + (zero_extend:SI (match_operand:QI 1 "register_operand"))) (const_int 0)) (clobber (reg:CC FLAGS_REG))] "TARGET_USE_BT && ix86_pre_reload_split ()" @@ -16016,7 +16016,7 @@ (define_insn "*btsq_imm" [(set (zero_extract:DI (match_operand:DI 0 "nonimmediate_operand" "+rm") (const_int 1) -
[PATCH] PR rtl-optimization/110587: Reduce useless moves in compile-time hog.
This patch is the third in series of fixes for PR rtl-optimization/110587, a compile-time regression with -O0, that attempts to address the underlying cause. As noted previously, the pathological test case pr28071.c contains a large number of useless register-to-register moves that can produce quadratic behaviour (in LRA). These move are generated during RTL expansion in emit_group_load_1, where the middle-end attempts to simplify the source before calling extract_bit_field. This is reasonable if the source is a complex expression (from before the tree-ssa optimizers), or a SUBREG, or a hard register, but it's not particularly useful to copy a pseudo register into a new pseudo register. This patch eliminates that redundancy. The -fdump-tree-expand for pr28071.c compiled with -O0 currently contains 777K lines, with this patch it contains 717K lines, i.e. saving about 60K lines (admittedly of debugging text output, but it makes the point). This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? As always, I'm happy to revert this change quickly if there's a problem, and investigate why this additional copy might (still) be needed on other non-x86 targets. 2023-07-25 Roger Sayle gcc/ChangeLog PR middle-end/28071 PR rtl-optimization/110587 * expr.cc (emit_group_load_1): Avoid copying a pseudo register into a new pseudo register, i.e. only copy hard regs into a new pseudo. Thanks in advance, Roger -- diff --git a/gcc/expr.cc b/gcc/expr.cc index fff09dc..11d041b 100644 --- a/gcc/expr.cc +++ b/gcc/expr.cc @@ -2622,6 +2622,7 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src, tree type, be loaded directly into the destination. */ src = orig_src; if (!MEM_P (orig_src) + && (!REG_P (orig_src) || HARD_REGISTER_P (orig_src)) && (!CONSTANT_P (orig_src) || (GET_MODE (orig_src) != mode && GET_MODE (orig_src) != VOIDmode)))
[PATCH] PR rtl-optimization/110701: Fix SUBREG SET_DEST handling in combine.
This patch is my proposed fix to PR rtl-optimization 110701, a latent bug in combine's record_dead_and_set_regs_1 exposed by recent improvements to simplify_subreg. The issue involves the handling of (normal) SUBREG SET_DESTs as in the instruction: (set (subreg:HI (reg:SI x) 0) (expr:HI y)) The semantics of this are that the bits specified by the SUBREG are set to the SET_SRC, y, and that the other bits of the SET_DEST are left/become undefined. To simplify explanation, we'll only consider lowpart SUBREGs (though in theory non-lowpart SUBREGS could be handled), and the fact that bits outside of the lowpart WORD retain their original values (treating these as undefined is a missed optimization rather than incorrect code bug, that only affects targets with less than 64-bit words). The bug is that combine simulates the behaviour of the above instruction, for calculating nonzero_bits and set_sign_bit_copies, in the function record_value_for_reg, by using the equivalent of: (set (reg:SI x) (subreg:SI (expr:HI y)) by calling gen_lowpart on the SET_SRC. Alas, the semantics of this revised instruction aren't always equivalent to the original. In the test case for PR110701, the original instruction (set (subreg:HI (reg:SI x), 0) (and:HI (subreg:HI (reg:SI y) 0) (const_int 340))) which (by definition) leaves the top bits of x undefined, is mistakenly considered to be equivalent to (set (reg:SI x) (and:SI (reg:SI y) (const_int 340))) where gen_lowpart's freedom to do anything with paradoxical SUBREG bits, has now cleared the high bits. The same bug also triggers when the SET_SRC is say (subreg:HI (reg:DI z)), where gen_lowpart transforms this into (subreg:SI (reg:DI z)) which defines bits 16-31 to be the same as bits 16-31 of z. The fix is that after calling record_value_for_reg, we need to mark the bits that should be undefined as undefined, in case gen_lowpart, which performs transforms appropriate for r-values, has changed the interpretation of the SUBREG when used as an l-value. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? I've a version of this patch that preserves the original bits outside of the lowpart WORD that can be submitted as a follow-up, but this is the piece that addresses the wrong code regression. 2023-07-26 Roger Sayle gcc/ChangeLog PR rtl-optimization/110701 * combine.cc (record_dead_and_set_regs_1): Split comment into pieces placed before the relevant clauses. When the SET_DEST is a partial_subreg_p, mark the bits outside of the updated portion of the destination as undefined. gcc/testsuite/ChangeLog PR rtl-optimization/110701 * gcc.target/i386/pr110701.c: New test case. Thanks in advance, Roger -- diff --git a/gcc/combine.cc b/gcc/combine.cc index 4bf867d..c5ebb78 100644 --- a/gcc/combine.cc +++ b/gcc/combine.cc @@ -13337,27 +13337,43 @@ record_dead_and_set_regs_1 (rtx dest, const_rtx setter, void *data) if (REG_P (dest)) { - /* If we are setting the whole register, we know its value. Otherwise -show that we don't know the value. We can handle a SUBREG if it's -the low part, but we must be careful with paradoxical SUBREGs on -RISC architectures because we cannot strip e.g. an extension around -a load and record the naked load since the RTL middle-end considers -that the upper bits are defined according to LOAD_EXTEND_OP. */ + /* If we are setting the whole register, we know its value. */ if (GET_CODE (setter) == SET && dest == SET_DEST (setter)) record_value_for_reg (dest, record_dead_insn, SET_SRC (setter)); + /* We can handle a SUBREG if it's the low part, but we must be +careful with paradoxical SUBREGs on RISC architectures because +we cannot strip e.g. an extension around a load and record the +naked load since the RTL middle-end considers that the upper bits +are defined according to LOAD_EXTEND_OP. */ else if (GET_CODE (setter) == SET && GET_CODE (SET_DEST (setter)) == SUBREG && SUBREG_REG (SET_DEST (setter)) == dest && known_le (GET_MODE_PRECISION (GET_MODE (dest)), BITS_PER_WORD) && subreg_lowpart_p (SET_DEST (setter))) - record_value_for_reg (dest, record_dead_insn, - WORD_REGISTER_OPERATIONS - && word_register_operation_p (SET_SRC (setter)) - && paradoxical_subreg_p (SET_DEST (setter)) - ? SET_SRC (setter) - : gen_lowpart (GET_MODE (dest), -
RE: [PATCH] PR rtl-optimization/110587: Reduce useless moves in compile-time hog.
Hi Richard, You're 100% right. It’s possible to significantly clean-up this code, replacing the body of the conditional with a call to force_reg and simplifying the conditions under which it is called. These improvements are implemented in the patch below, which has been tested on x86_64-pc-linux-gnu, with a bootstrap and make -k check, both with and without -m32, as usual. Interestingly, the CONCAT clause afterwards is still required (I've learned something new), as calling force_reg (or gen_reg_rtx) with HCmode, actually returns a CONCAT instead of a REG, so although the code looks dead, it's required to build libgcc during a bootstrap. But the remaining clean-up is good, reducing the number of source lines and making the logic easier to understand. Ok for mainline? 2023-07-27 Roger Sayle Richard Biener gcc/ChangeLog PR middle-end/28071 PR rtl-optimization/110587 * expr.cc (emit_group_load_1): Simplify logic for calling force_reg on ORIG_SRC, to avoid making a copy if the source is already in a pseudo register. Roger -- > -Original Message- > From: Richard Biener > Sent: 25 July 2023 12:50 > > On Tue, Jul 25, 2023 at 1:31 PM Roger Sayle > wrote: > > > > This patch is the third in series of fixes for PR > > rtl-optimization/110587, a compile-time regression with -O0, that > > attempts to address the underlying cause. As noted previously, the > > pathological test case pr28071.c contains a large number of useless > > register-to-register moves that can produce quadratic behaviour (in > > LRA). These move are generated during RTL expansion in > > emit_group_load_1, where the middle-end attempts to simplify the > > source before calling extract_bit_field. This is reasonable if the > > source is a complex expression (from before the tree-ssa optimizers), > > or a SUBREG, or a hard register, but it's not particularly useful to > > copy a pseudo register into a new pseudo register. This patch eliminates > > that > redundancy. > > > > The -fdump-tree-expand for pr28071.c compiled with -O0 currently > > contains 777K lines, with this patch it contains 717K lines, i.e. > > saving about 60K lines (admittedly of debugging text output, but it makes > > the > point). > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board=unix{-m32} > > with no new failures. Ok for mainline? > > > > As always, I'm happy to revert this change quickly if there's a > > problem, and investigate why this additional copy might (still) be > > needed on other > > non-x86 targets. > > @@ -2622,6 +2622,7 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src, > tree type, > be loaded directly into the destination. */ >src = orig_src; >if (!MEM_P (orig_src) > + && (!REG_P (orig_src) || HARD_REGISTER_P (orig_src)) > && (!CONSTANT_P (orig_src) > || (GET_MODE (orig_src) != mode > && GET_MODE (orig_src) != VOIDmode))) > > so that means the code guarded by the conditional could instead be transformed > to > >src = force_reg (mode, orig_src); > > ? Btw, the || (GET_MODE (orig_src) != mode && GET_MODE (orig_src) != > VOIDmode) case looks odd as in that case we'd use GET_MODE (orig_src) for the > move ... that might also mean we have to use force_reg (GET_MODE (orig_src) == > VOIDmode ? mode : GET_MODE (orig_src), orig_src)) > > Otherwise I think this is OK, as said, using force_reg somehow would improve > readability here I think. > > I also wonder how the > > else if (GET_CODE (src) == CONCAT) > > case will ever trigger with the current code. > > Richard. > > > > > 2023-07-25 Roger Sayle > > > > gcc/ChangeLog > > PR middle-end/28071 > > PR rtl-optimization/110587 > > * expr.cc (emit_group_load_1): Avoid copying a pseudo register into > > a new pseudo register, i.e. only copy hard regs into a new pseudo. > > > > diff --git a/gcc/expr.cc b/gcc/expr.cc index fff09dc..174f8ac 100644 --- a/gcc/expr.cc +++ b/gcc/expr.cc @@ -2622,16 +2622,11 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src, tree type, be loaded directly into the destination. */ src = orig_src; if (!MEM_P (orig_src) - && (!CONSTANT_P (orig_src) - || (GET_MODE (orig_src) != mode - && GET_MODE (orig_src) != VOIDmode))) + && (!REG_P (orig_src) || HARD_REGISTER_P (orig_src)) + && !CONSTANT_P (orig_src)) { - if (GET_MODE (orig_src) == VOIDmode) - src = gen_reg_rtx (mode); - else - src = gen_reg_rtx (GET_MODE (orig_src)); - - emit_move_insn (src, orig_src); + gcc_assert (GET_MODE (orig_src) != VOIDmode); + src = force_reg (GET_MODE (orig_src), orig_src); } /* Optimize the access just a bit. */
[Committed] Use QImode for offsets in zero_extract/sign_extract in i386.md (take #2)
This patch reattempts to change the ZERO_EXTRACTs and SIGN_EXTRACTs in i386.md to consistently use QImode for bit offsets (i.e. third and fourth operands), matching the use of QImode for bit counts in shifts and rotates. This iteration corrects the "ne:QI" vs "eq:QI" mistake in the previous version, which was responsible for PR 110787 and PR 110790 and so was rapidly reverted last weekend. New test cases have been added to check the correct behaviour. This patch has been tested on x86_64-pc-linux-gnu with and without --enable-languages="all", with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Committed to mainline as an obvious fix to the previously approved patch. Sorry again for the temporary inconvenience, and thanks to Rainer Orth for identifying/confirming the problematic patch. 2023-07-29 Roger Sayle gcc/ChangeLog PR target/110790 * config/i386/i386.md (extv): Use QImode for offsets. (extzv): Likewise. (insv): Likewise. (*testqi_ext_3): Likewise. (*btr_2): Likewise. (define_split): Likewise. (*btsq_imm): Likewise. (*btrq_imm): Likewise. (*btcq_imm): Likewise. (define_peephole2 x3): Likewise. (*bt): Likewise (*bt_mask): New define_insn_and_split. (*jcc_bt): Use QImode for offsets. (*jcc_bt_1): Delete obsolete pattern. (*jcc_bt_mask): Use QImode offsets. (*jcc_bt_mask_1): Likewise. (define_split): Likewise. (*bt_setcqi): Likewise. (*bt_setncqi): Likewise. (*bt_setnc): Likewise. (*bt_setncqi_2): Likewise. (*bt_setc_mask): New define_insn_and_split. (bmi2_bzhi_3): Use QImode offsets. (*bmi2_bzhi_3): Likewise. (*bmi2_bzhi_3_1): Likewise. (*bmi2_bzhi_3_1_ccz): Likewise. (@tbm_bextri_): Likewise. gcc/testsuite/ChangeLog PR target/110790 * gcc.target/i386/pr110790-1.c: New test case. * gcc.target/i386/pr110790-2.c: Likewise. diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 4db210c..efac228 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -3312,8 +3312,8 @@ (define_expand "extv" [(set (match_operand:SWI24 0 "register_operand") (sign_extract:SWI24 (match_operand:SWI24 1 "register_operand") - (match_operand:SI 2 "const_int_operand") - (match_operand:SI 3 "const_int_operand")))] + (match_operand:QI 2 "const_int_operand") + (match_operand:QI 3 "const_int_operand")))] "" { /* Handle extractions from %ah et al. */ @@ -3340,8 +3340,8 @@ (define_expand "extzv" [(set (match_operand:SWI248 0 "register_operand") (zero_extract:SWI248 (match_operand:SWI248 1 "register_operand") -(match_operand:SI 2 "const_int_operand") -(match_operand:SI 3 "const_int_operand")))] +(match_operand:QI 2 "const_int_operand") +(match_operand:QI 3 "const_int_operand")))] "" { if (ix86_expand_pextr (operands)) @@ -3428,8 +3428,8 @@ (define_expand "insv" [(set (zero_extract:SWI248 (match_operand:SWI248 0 "register_operand") -(match_operand:SI 1 "const_int_operand") -(match_operand:SI 2 "const_int_operand")) +(match_operand:QI 1 "const_int_operand") +(match_operand:QI 2 "const_int_operand")) (match_operand:SWI248 3 "register_operand"))] "" { @@ -10788,8 +10788,8 @@ (match_operator 1 "compare_operator" [(zero_extract:SWI248 (match_operand 2 "int_nonimmediate_operand" "rm") -(match_operand 3 "const_int_operand") -(match_operand 4 "const_int_operand")) +(match_operand:QI 3 "const_int_operand") +(match_operand:QI 4 "const_int_operand")) (const_int 0)]))] "/* Ensure that resulting mask is zero or sign extended operand. */ INTVAL (operands[4]) >= 0 @@ -15965,7 +15965,7 @@ [(set (zero_extract:HI (match_operand:SWI12 0 "nonimmediate_operand") (const_int 1) - (zero_extend:SI (match_operand:QI 1 "register_operand"))) + (match_operand:QI 1 "register_operand")) (const_int 0)) (clobber (reg:CC FLAGS_REG))] "TARGET_USE_BT && ix86_pre_reload_split ()" @@ -15989,7 +15989,7 @@ [(
[Committed] PR target/110843: Check TARGET_AVX512VL for V2DI rotates in STV.
This patch resolves PR target/110843, an ICE caused by my enhancement to support AVX512 DImode and SImode rotates in the scalar-to-vector (STV) pass. Although the vprotate instructions are available on all TARGET_AVX512F microarchitectures, the V2DI and V4SI variants are only available on the TARGET_AVX512VL subset, leading to problems when command line options enable AVX512 (i.e. AVX512F) but not the required AVX512VL functionality. The simple fix is to update/correct the target checks. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Committed to mainline as obvious. 2023-07-31 Roger Sayle gcc/ChangeLog PR target/110843 * config/i386/i386-features.cc (compute_convert_gain): Check TARGET_AVX512VL (not TARGET_AVX512F) when considering V2DImode and V4SImode rotates in STV. (general_scalar_chain::convert_rotate): Likewise. gcc/testsuite/ChangeLog PR target/110843 * gcc.target/i386/pr110843.c: New test case. Sorry again for the inconvenience. Roger -- diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc index 6da8395..cead397 100644 --- a/gcc/config/i386/i386-features.cc +++ b/gcc/config/i386/i386-features.cc @@ -587,7 +587,7 @@ general_scalar_chain::compute_convert_gain () case ROTATE: case ROTATERT: igain += m * ix86_cost->shift_const; - if (TARGET_AVX512F) + if (TARGET_AVX512VL) igain -= ix86_cost->sse_op; else if (smode == DImode) { @@ -1230,7 +1230,7 @@ general_scalar_chain::convert_rotate (enum rtx_code code, rtx op0, rtx op1, emit_insn_before (pat, insn); result = gen_lowpart (V2DImode, tmp1); } - else if (TARGET_AVX512F) + else if (TARGET_AVX512VL) result = simplify_gen_binary (code, V2DImode, op0, op1); else if (bits == 16 || bits == 48) { @@ -1276,7 +1276,7 @@ general_scalar_chain::convert_rotate (enum rtx_code code, rtx op0, rtx op1, emit_insn_before (pat, insn); result = gen_lowpart (V4SImode, tmp1); } - else if (TARGET_AVX512F) + else if (TARGET_AVX512VL) result = simplify_gen_binary (code, V4SImode, op0, op1); else { diff --git a/gcc/testsuite/gcc.target/i386/pr110843.c b/gcc/testsuite/gcc.target/i386/pr110843.c new file mode 100644 index 000..b9bcddb --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr110843.c @@ -0,0 +1,20 @@ +/* PR target/110843 */ +/* derived from gcc.target/i386/pr70007.c */ +/* { dg-do compile { target int128 } } */ +/* { dg-options "-Os -mavx512ifma -Wno-psabi" } */ + +typedef unsigned short v32u16 __attribute__ ((vector_size (32))); +typedef unsigned long long v32u64 __attribute__ ((vector_size (32))); +typedef unsigned __int128 u128; +typedef unsigned __int128 v32u128 __attribute__ ((vector_size (32))); + +u128 foo (v32u16 v32u16_0, v32u64 v32u64_0, v32u64 v32u64_1) +{ + do { +v32u16_0[13] |= v32u64_1[3] = (v32u64_1[3] >> 19) | (v32u64_1[3] << 45); +v32u64_1 %= ~v32u64_1; +v32u64_0 *= (v32u64) v32u16_0; + } while (v32u64_0[0]); + return v32u64_1[3]; +} +
[x86 PATCH] UNSPEC_PALIGNR optimizations and clean-ups.
This patch is a follow-up to Hongtao's fix for PR target/105854. That fix is perfectly correct, but the thing that caught my eye was why is the compiler generating a shift by zero at all. Digging deeper it turns out that we can easily optimize __builtin_ia32_palignr for alignments of 0 and 64 respectively, which may be simplified to moves from the highpart or lowpart. After adding optimizations to simplify the 64-bit DImode palignr, I started to add the corresponding optimizations for vpalignr (i.e. 128-bit). The first oddity is that sse.md uses TImode and a special SSESCALARMODE iterator, rather than V1TImode, and indeed the comment above SSESCALARMODE hints that this should be "dropped in favor of VIMAX_AVX2_AVX512BW". Hence this patch includes the migration of _palignr to use VIMAX_AVX2_AVX512BW, basically using V1TImode instead of TImode for 128-bit palignr. But it was only after I'd implemented this clean-up that I stumbled across the strange semantics of 128-bit [v]palignr. According to https://www.felixcloutier.com/x86/palignr, the semantics are subtly different based upon how the instruction is encoded. PALIGNR leaves the highpart unmodified, whilst VEX.128 encoded VPALIGNR clears the highpart, and (unless I'm mistaken) it looks like GCC currently uses the exact same RTL/templates for both, treating one as an alternative for the other. Hence I thought I'd post what I have so far (part optimization and part clean-up), to then ask the x86 experts for their opinions. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-,32}, with no new failures. Ok for mainline? 2022-06-30 Roger Sayle gcc/ChangeLog * config/i386/i386-builtin.def (__builtin_ia32_palignr128): Change CODE_FOR_ssse3_palignrti to CODE_FOR_ssse3_palignrv1ti. * config/i386/i386-expand.cc (expand_vec_perm_palignr): Use V1TImode and gen_ssse3_palignv1ti instead of TImode. * config/i386/sse.md (SSESCALARMODE): Delete. (define_mode_attr ssse3_avx2): Handle V1TImode instead of TImode. (_palignr): Use VIMAX_AVX2_AVX512BW as a mode iterator instead of SSESCALARMODE. (ssse3_palignrdi): Optimize cases when operands[3] is 0 or 64, using a single move instruction (if required). (define_split): Likewise split UNSPEC_PALIGNR $0 into a move. (define_split): Likewise split UNSPEC_PALIGNR $64 into a move. gcc/testsuite/ChangeLog * gcc.target/i386/ssse3-palignr-2.c: New test case. Thanks in advance, Roger -- diff --git a/gcc/config/i386/i386-builtin.def b/gcc/config/i386/i386-builtin.def index e6daad4..fd16093 100644 --- a/gcc/config/i386/i386-builtin.def +++ b/gcc/config/i386/i386-builtin.def @@ -900,7 +900,7 @@ BDESC (OPTION_MASK_ISA_SSSE3, 0, CODE_FOR_ssse3_psignv4si3, "__builtin_ia32_psig BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0, CODE_FOR_ssse3_psignv2si3, "__builtin_ia32_psignd", IX86_BUILTIN_PSIGND, UNKNOWN, (int) V2SI_FTYPE_V2SI_V2SI) /* SSSE3. */ -BDESC (OPTION_MASK_ISA_SSSE3, 0, CODE_FOR_ssse3_palignrti, "__builtin_ia32_palignr128", IX86_BUILTIN_PALIGNR128, UNKNOWN, (int) V2DI_FTYPE_V2DI_V2DI_INT_CONVERT) +BDESC (OPTION_MASK_ISA_SSSE3, 0, CODE_FOR_ssse3_palignrv1ti, "__builtin_ia32_palignr128", IX86_BUILTIN_PALIGNR128, UNKNOWN, (int) V2DI_FTYPE_V2DI_V2DI_INT_CONVERT) BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0, CODE_FOR_ssse3_palignrdi, "__builtin_ia32_palignr", IX86_BUILTIN_PALIGNR, UNKNOWN, (int) V1DI_FTYPE_V1DI_V1DI_INT_CONVERT) /* SSE4.1 */ diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc index 8bc5430..6a3fcde 100644 --- a/gcc/config/i386/i386-expand.cc +++ b/gcc/config/i386/i386-expand.cc @@ -19548,9 +19548,11 @@ expand_vec_perm_palignr (struct expand_vec_perm_d *d, bool single_insn_only_p) shift = GEN_INT (min * GET_MODE_UNIT_BITSIZE (d->vmode)); if (GET_MODE_SIZE (d->vmode) == 16) { - target = gen_reg_rtx (TImode); - emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, dcopy.op1), - gen_lowpart (TImode, dcopy.op0), shift)); + target = gen_reg_rtx (V1TImode); + emit_insn (gen_ssse3_palignrv1ti (target, + gen_lowpart (V1TImode, dcopy.op1), + gen_lowpart (V1TImode, dcopy.op0), + shift)); } else { diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 8cd0f61..974deca 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -575,10 +575,6 @@ (define_mode_iterator VIMAX_AVX2 [(V2TI "TARGET_AVX2") V1TI]) -;; ??? This should probably be dropped in favor of VIMAX_AVX2_AVX512BW. -(define_mode_iterator SSESCALARMODE - [(V4TI "TARGET_AVX512BW") (V
[x86 PATCH] PR target/106122: Don't update %esp via the stack with -Oz.
When optimizing for size with -Oz, setting a register can be minimized by pushing an immediate value to the stack and popping it to the destination. Alas the one general register that shouldn't be updated via the stack is the stack pointer itself, where "pop %esp" can't be represented in GCC's RTL ("use of a register mentioned in pre_inc, pre_dec, post_inc or post_dec is not permitted within the same instruction"). This patch fixes PR target/106122 by explicitly checking for SP_REG in the problematic peephole2. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2022-06-30 Roger Sayle gcc/ChangeLog PR target/106122 * config/i386/i386.md (peephole2): Avoid generating pop %esp when optimizing for size. gcc/testsuite/ChangeLog PR target/106122 * gcc.target/i386/pr106122.c: New test case. Thanks in advance, Roger -- diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 125a3b4..3b6f362 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -2588,7 +2588,8 @@ "optimize_insn_for_size_p () && optimize_size > 1 && operands[1] != const0_rtx && IN_RANGE (INTVAL (operands[1]), -128, 127) - && !ix86_red_zone_used" + && !ix86_red_zone_used + && REGNO (operands[0]) != SP_REG" [(set (match_dup 2) (match_dup 1)) (set (match_dup 0) (match_dup 3))] { diff --git a/gcc/testsuite/gcc.target/i386/pr106122.c b/gcc/testsuite/gcc.target/i386/pr106122.c new file mode 100644 index 000..7d24ed3 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106122.c @@ -0,0 +1,15 @@ +/* PR middle-end/106122 */ +/* { dg-do compile } */ +/* { dg-options "-Oz" } */ + +register volatile int a __asm__("%esp"); +void foo (void *); +void bar (void *); + +void +baz (void) +{ + foo (__builtin_return_address (0)); + a = 0; + bar (__builtin_return_address (0)); +}
[Committed] Add constraints to new andn_doubleword_bmi pattern in i386.md.
Many thanks to Uros for spotting that I'd forgotten to add constraints to the new define_insn_and_split *andn_doubleword_bmi when moving it from pre-reload to post-reload. I've pushed this obvious fix after a make bootstrap on x86_64-pc-linux-gnu. Sorry for the inconvenience to anyone building the tree with a non-default architecture that enables BMI. 2022-07-01 Roger Sayle Uroš Bizjak gcc/ChangeLog * config/i386/i386.md (*andn3_doubleword_bmi): Add constraints to post-reload define_insn_and_split. Roger -- diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 3401814..352a21c 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -10405,10 +10405,10 @@ }) (define_insn_and_split "*andn3_doubleword_bmi" - [(set (match_operand: 0 "register_operand") + [(set (match_operand: 0 "register_operand" "=r") (and: - (not: (match_operand: 1 "register_operand")) - (match_operand: 2 "nonimmediate_operand"))) + (not: (match_operand: 1 "register_operand" "0")) + (match_operand: 2 "nonimmediate_operand" "ro"))) (clobber (reg:CC FLAGS_REG))] "TARGET_BMI" "#"
RE: [x86 PATCH] PR rtl-optimization/96692: ((A|B)^C)^A using andn with -mbmi.
Hi Uros, Thanks for the review. This patch implements all of your suggestions, both removing ix86_pre_reload_split from the combine splitter(s), and dividing the original splitter up into four simpler variants, that use match_dup to handle the variants/permutations caused by operator commutativity. This revised patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2022-07-04 Roger Sayle Uroš Bizjak gcc/ChangeLog PR rtl-optimization/96692 * config/i386/i386.md (define_split): Split ((A | B) ^ C) ^ D as (X & ~Y) ^ Z on target BMI when either C or D is A or B. gcc/testsuite/ChangeLog PR rtl-optimization/96692 * gcc.target/i386/bmi-andn-4.c: New test case. Thanks again, Roger -- > -Original Message- > From: Uros Bizjak > Sent: 26 June 2022 18:08 > To: Roger Sayle > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [x86 PATCH] PR rtl-optimization/96692: ((A|B)^C)^A using andn > with > -mbmi. > > On Sun, Jun 26, 2022 at 2:04 PM Roger Sayle > wrote: > > > > > > This patch addresses PR rtl-optimization/96692 on x86_64, by providing > > a define_split for combine to convert the three operation ((A|B)^C)^D > > into a two operation sequence using andn when either A or B is the > > same register as C or D. This is essentially a reassociation problem > > that's only a win if the target supports an and-not instruction (as with > > -mbmi). > > > > Hence for the new test case: > > > > int f(int a, int b, int c) > > { > > return (a ^ b) ^ (a | c); > > } > > > > GCC on x86_64-pc-linux-gnu wth -O2 -mbmi would previously generate: > > > > xorl%edi, %esi > > orl %edx, %edi > > movl%esi, %eax > > xorl%edi, %eax > > ret > > > > but with this patch now generates: > > > > andn%edx, %edi, %eax > > xorl%esi, %eax > > ret > > > > I'll investigate whether this optimization can also be implemented > > more generically in simplify_rtx when the backend provides accurate > > rtx_costs for "(and (not ..." (as there's no optab for andn). > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board=unix{-m32}, > > with no new failures. Ok for mainline? > > > > > > 2022-06-26 Roger Sayle > > > > gcc/ChangeLog > > PR rtl-optimization/96692 > > * config/i386/i386.md (define_split): Split ((A | B) ^ C) ^ D > > as (X & ~Y) ^ Z on target BMI when either C or D is A or B. > > > > gcc/testsuite/ChangeLog > > PR rtl-optimization/96692 > > * gcc.target/i386/bmi-andn-4.c: New test case. > > + "TARGET_BMI > + && ix86_pre_reload_split () > + && (rtx_equal_p (operands[1], operands[3]) > + || rtx_equal_p (operands[1], operands[4]) > + || (REG_P (operands[2]) > + && (rtx_equal_p (operands[2], operands[3]) > + || rtx_equal_p (operands[2], operands[4]" > > You don't need a ix86_pre_reload_split for combine splitter* > > OTOH, please split the pattern to two for each commutative operand and use > (match_dup x) instead. Something similar to [1]. > > *combine splitter is described in the documentation as the splitter pattern > that > does *not* match any existing insn pattern. > > [1] https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596804.html > > Uros. diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 20c3b9a..d114754 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -10522,6 +10522,82 @@ (set (match_dup 0) (match_op_dup 1 [(and:SI (match_dup 3) (match_dup 2)) (const_int 0)]))]) + +;; Variant 1 of 4: Split ((A | B) ^ A) ^ C as (B & ~A) ^ C. +(define_split + [(set (match_operand:SWI48 0 "register_operand") + (xor:SWI48 + (xor:SWI48 + (ior:SWI48 (match_operand:SWI48 1 "register_operand") +(match_operand:SWI48 2 "nonimmediate_operand")) + (match_dup 1)) + (match_operand:SWI48 3 "nonimmediate_operand"))) + (clobber (reg:CC FLAGS_REG))] + "TARGET_BMI" + [(parallel + [(set (match_dup 4) (and:SWI48 (not:SWI48 (match_dup 1)) (match_dup 2))) + (clobber (reg:CC FLAGS_REG))]) + (parallel + [(set (match_dup 0) (xor:SWI48 (match_dup 4) (match_
RE: [x86 PATCH] UNSPEC_PALIGNR optimizations and clean-ups.
Hi Hongtao, Many thanks for your review. This revised patch implements your suggestions of removing the combine splitters, and instead reusing the functionality of the ssse3_palignrdi define_insn_and split. This revised patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and with --target_board=unix{-32}, with no new failures. Is this revised version Ok for mainline? 2022-07-04 Roger Sayle Hongtao Liu gcc/ChangeLog * config/i386/i386-builtin.def (__builtin_ia32_palignr128): Change CODE_FOR_ssse3_palignrti to CODE_FOR_ssse3_palignrv1ti. * config/i386/i386-expand.cc (expand_vec_perm_palignr): Use V1TImode and gen_ssse3_palignv1ti instead of TImode. * config/i386/sse.md (SSESCALARMODE): Delete. (define_mode_attr ssse3_avx2): Handle V1TImode instead of TImode. (_palignr): Use VIMAX_AVX2_AVX512BW as a mode iterator instead of SSESCALARMODE. (ssse3_palignrdi): Optimize cases where operands[3] is 0 or 64, using a single move instruction (if required). gcc/testsuite/ChangeLog * gcc.target/i386/ssse3-palignr-2.c: New test case. Thanks in advance, Roger -- > -Original Message- > From: Hongtao Liu > Sent: 01 July 2022 03:40 > To: Roger Sayle > Cc: GCC Patches > Subject: Re: [x86 PATCH] UNSPEC_PALIGNR optimizations and clean-ups. > > On Fri, Jul 1, 2022 at 10:12 AM Hongtao Liu wrote: > > > > On Fri, Jul 1, 2022 at 2:42 AM Roger Sayle > wrote: > > > > > > > > > This patch is a follow-up to Hongtao's fix for PR target/105854. > > > That fix is perfectly correct, but the thing that caught my eye was > > > why is the compiler generating a shift by zero at all. Digging > > > deeper it turns out that we can easily optimize > > > __builtin_ia32_palignr for alignments of 0 and 64 respectively, > > > which may be simplified to moves from the highpart or lowpart. > > > > > > After adding optimizations to simplify the 64-bit DImode palignr, I > > > started to add the corresponding optimizations for vpalignr (i.e. > > > 128-bit). The first oddity is that sse.md uses TImode and a special > > > SSESCALARMODE iterator, rather than V1TImode, and indeed the comment > > > above SSESCALARMODE hints that this should be "dropped in favor of > > > VIMAX_AVX2_AVX512BW". Hence this patch includes the migration of > > > _palignr to use VIMAX_AVX2_AVX512BW, basically > > > using V1TImode instead of TImode for 128-bit palignr. > > > > > > But it was only after I'd implemented this clean-up that I stumbled > > > across the strange semantics of 128-bit [v]palignr. According to > > > https://www.felixcloutier.com/x86/palignr, the semantics are subtly > > > different based upon how the instruction is encoded. PALIGNR leaves > > > the highpart unmodified, whilst VEX.128 encoded VPALIGNR clears the > > > highpart, and (unless I'm mistaken) it looks like GCC currently uses > > > the exact same RTL/templates for both, treating one as an > > > alternative for the other. > > I think as long as patterns or intrinsics only care about the low > > part, they should be ok. > > But if we want to use default behavior for upper bits, we need to > > restrict them under specific isa(.i.e. vmovq in vec_set_0). > > Generally, 128-bit sse legacy instructions have different behaviors > > for upper bits from AVX ones, and that's why vzeroupper is introduced > > for sse <-> avx instructions transition. > > > > > > Hence I thought I'd post what I have so far (part optimization and > > > part clean-up), to then ask the x86 experts for their opinions. > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make > > > bootstrap and make -k check, both with and without > > > --target_board=unix{-,32}, with no new failures. Ok for mainline? > > > > > > > > > 2022-06-30 Roger Sayle > > > > > > gcc/ChangeLog > > > * config/i386/i386-builtin.def (__builtin_ia32_palignr128): Change > > > CODE_FOR_ssse3_palignrti to CODE_FOR_ssse3_palignrv1ti. > > > * config/i386/i386-expand.cc (expand_vec_perm_palignr): Use > V1TImode > > > and gen_ssse3_palignv1ti instead of TImode. > > > * config/i386/sse.md (SSESCALARMODE): Delete. > > > (define_mode_attr ssse3_avx2): Handle V1TImode instead of TImode. > > > (_palignr): Use VIMAX_AVX2_AVX512BW as a > mode > > > iterator instead of SSESCALARMODE. > >
[x86 PATCH take #2] Doubleword version of and; cmp to not; test optimization.
This patch is the latest revision of the patch originally posted at: https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596201.html This patch extends the earlier and;cmp to not;test optimization to also perform this transformation for TImode on TARGET_64BIT and DImode on -m32, One motivation for this is that it's a step to fixing the current failure of gcc.target/i386/pr65105-5.c on -m32. A more direct benefit for x86_64 is that the following code: int foo(__int128 x, __int128 y) { return (x & y) == y; } improves with -O2 from 15 instructions: movq%rdi, %r8 movq%rsi, %rax movq%rax, %rdi movq%r8, %rsi movq%rdx, %r8 andq%rdx, %rsi andq%rcx, %rdi movq%rsi, %rax movq%rdi, %rdx xorq%r8, %rax xorq%rcx, %rdx orq %rdx, %rax sete%al movzbl %al, %eax ret to the slightly better 13 instructions: movq%rdi, %r8 movq%rsi, %rax movq%r8, %rsi movq%rax, %rdi notq%rsi notq%rdi andq%rdx, %rsi andq%rcx, %rdi movq%rsi, %rax orq %rdi, %rax sete%al movzbl %al, %eax ret Now that all of the doubleword pieces are already in the tree, this patch is now much shorter (an rtx_costs improvement and a single new define_insn_and_split), however I couldn't resist including two very minor pattern naming tweaks/clean-ups to fix nits. This revised patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, where on TARGET_64BIT there are no new failures, and on --target_board=unix{-m32} with a single new failure; the other dg-final in gcc.target/i386/pr65105-5.c now also fails (as that code diverges further from the expected vectorized output). This is progress as both FAILs in pr65105-5.c may now be fixed by changes localized to the STV pass. OK for mainline? 2022-07-04 Roger Sayle gcc/ChangeLog * config/i386/i386.cc (ix86_rtx_costs) : Provide costs for double word comparisons and tests (comparisons against zero). * config/i386/i386.md (*test_not_doubleword): Split DWI and;cmp into andn;cmp $0 as a pre-reload splitter. (*andn3_doubleword_bmi): Use instead of in name. (*3_doubleword): Likewise. gcc/testsuite/ChangeLog * gcc.target/i386/testnot-3.c: New test case. Thanks in advance, Roger -- diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index b15b489..70c9a27 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -20935,6 +20935,19 @@ ix86_rtx_costs (rtx x, machine_mode mode, int outer_code_i, int opno, return true; } + if (SCALAR_INT_MODE_P (GET_MODE (op0)) + && GET_MODE_SIZE (GET_MODE (op0)) > UNITS_PER_WORD) + { + if (op1 == const0_rtx) + *total = cost->add ++ rtx_cost (op0, GET_MODE (op0), outer_code, opno, speed); + else + *total = 3*cost->add ++ rtx_cost (op0, GET_MODE (op0), outer_code, opno, speed) ++ rtx_cost (op1, GET_MODE (op0), outer_code, opno, speed); + return true; + } + /* The embedded comparison operand is completely free. */ if (!general_operand (op0, GET_MODE (op0)) && op1 == const0_rtx) *total = 0; diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 20c3b9a..2492ad4 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -9792,7 +9792,25 @@ (set (reg:CCZ FLAGS_REG) (compare:CCZ (and:SWI (match_dup 2) (match_dup 1)) (const_int 0)))] + "operands[2] = gen_reg_rtx (mode);") + +;; Split and;cmp (as optimized by combine) into andn;cmp $0 +(define_insn_and_split "*test_not_doubleword" + [(set (reg:CCZ FLAGS_REG) + (compare:CCZ + (and:DWI + (not:DWI (match_operand:DWI 0 "nonimmediate_operand")) + (match_operand:DWI 1 "nonimmediate_operand")) + (const_int 0)))] + "ix86_pre_reload_split ()" + "#" + "&& 1" + [(parallel + [(set (match_dup 2) (and:DWI (not:DWI (match_dup 0)) (match_dup 1))) + (clobber (reg:CC FLAGS_REG))]) + (set (reg:CCZ FLAGS_REG) (compare:CCZ (match_dup 2) (const_int 0)))] { + operands[0] = force_reg (mode, operands[0]); operands[2] = gen_reg_rtx (mode); }) @@ -10404,7 +10422,7 @@ operands[2] = gen_int_mode (INTVAL (operands[2]), QImode); }) -(define_insn_and_split "*andn3_doubleword_bmi" +(define_insn_and_split "*andn3_doubleword_bmi" [(set (match_operand: 0 "register_operand" "=r") (and: (not: (match_operand: 1 "register_operand" "r")) @@ -10542,7 +105
[x86 PATCH] Support *testdi_not_doubleword during STV pass.
This patch fixes the current two FAILs of pr65105-5.c on x86 when compiled with -m32. These (temporary) breakages were fallout from my patches to improve/upgrade (scalar) double word comparisons. On mainline, the i386 backend currently represents a critical comparison using (compare (and (not reg1) reg2) (const_int 0)) which isn't/wasn't recognized by the STV pass' convertible_comparison_p. This simple STV patch adds support for this pattern (*testdi_not_doubleword) and generates the vector pandn and ptest instructions expected in the existing (failing) test case. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, where with --target_board=unix{-m32} there are two fewer failures, and without, there are no new failures. Ok for mainline? 2022-07-07 Roger Sayle gcc/ChangeLog * config/i386/i386-features.cc (convert_compare): Add support for *testdi_not_doubleword pattern (i.e. "(compare (and (not ...") by generating a pandn followed by ptest. (convertible_comparison_p): Recognize both *cmpdi_doubleword and recent *testdi_not_doubleword comparison patterns. Thanks in advance, Roger -- diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc index be38586..a7bd172 100644 --- a/gcc/config/i386/i386-features.cc +++ b/gcc/config/i386/i386-features.cc @@ -938,10 +938,10 @@ general_scalar_chain::convert_compare (rtx op1, rtx op2, rtx_insn *insn) { rtx tmp = gen_reg_rtx (vmode); rtx src; - convert_op (&op1, insn); /* Comparison against anything other than zero, requires an XOR. */ if (op2 != const0_rtx) { + convert_op (&op1, insn); convert_op (&op2, insn); /* If both operands are MEMs, explicitly load the OP1 into TMP. */ if (MEM_P (op1) && MEM_P (op2)) @@ -953,8 +953,25 @@ general_scalar_chain::convert_compare (rtx op1, rtx op2, rtx_insn *insn) src = op1; src = gen_rtx_XOR (vmode, src, op2); } + else if (GET_CODE (op1) == AND + && GET_CODE (XEXP (op1, 0)) == NOT) +{ + rtx op11 = XEXP (XEXP (op1, 0), 0); + rtx op12 = XEXP (op1, 1); + convert_op (&op11, insn); + convert_op (&op12, insn); + if (MEM_P (op11)) + { + emit_insn_before (gen_rtx_SET (tmp, op11), insn); + op11 = tmp; + } + src = gen_rtx_AND (vmode, gen_rtx_NOT (vmode, op11), op12); +} else -src = op1; +{ + convert_op (&op1, insn); + src = op1; +} emit_insn_before (gen_rtx_SET (tmp, src), insn); if (vmode == V2DImode) @@ -1399,17 +1416,29 @@ convertible_comparison_p (rtx_insn *insn, enum machine_mode mode) rtx op1 = XEXP (src, 0); rtx op2 = XEXP (src, 1); - if (!CONST_INT_P (op1) - && ((!REG_P (op1) && !MEM_P (op1)) - || GET_MODE (op1) != mode)) -return false; - - if (!CONST_INT_P (op2) - && ((!REG_P (op2) && !MEM_P (op2)) - || GET_MODE (op2) != mode)) -return false; + /* *cmp_doubleword. */ + if ((CONST_INT_P (op1) + || ((REG_P (op1) || MEM_P (op1)) + && GET_MODE (op1) == mode)) + && (CONST_INT_P (op2) + || ((REG_P (op2) || MEM_P (op2)) + && GET_MODE (op2) == mode))) +return true; + + /* *test_not_doubleword. */ + if (op2 == const0_rtx + && GET_CODE (op1) == AND + && GET_CODE (XEXP (op1, 0)) == NOT) +{ + rtx op11 = XEXP (XEXP (op1, 0), 0); + rtx op12 = XEXP (op1, 1); + return (REG_P (op11) || MEM_P (op11)) +&& (REG_P (op12) || MEM_P (op12)) +&& GET_MODE (op11) == mode +&& GET_MODE (op12) == mode; +} - return true; + return false; } /* The general version of scalar_to_vector_candidate_p. */
[PATCH/RFC] combine_completed global variable.
Hi Kewen (and Segher), Many thanks for stress testing my patch to improve multiplication by integer constants on rs6000 by using the rldmi instruction. Although I've not been able to reproduce your ICE (using gcc135 on the compile farm), I completely agree with Segher's analysis that the Achilles heel with my approach/patch is that there's currently no way for the backend/recog to know that we're in a pass after combine. Rather than give up on this optimization (and a similar one for I386.md where test;sete can be replaced by xor $1 when combine knows that nonzero_bits is 1, but loses that information afterwards), I thought I'd post this "strawman" proposal to add a combine_completed global variable, matching the reload_completed and regstack_completed global variables already used (to track progress) by the middle-end. I was wondering if I could ask you could test the attached patch in combination with my previous rs6000.md patch (with the obvious change of reload_completed to combine_completed) to confirm that it fixes the problems you were seeing. Segher/Richard, would this sort of patch be considered acceptable? Or is there a better approach/solution? 2022-07-07 Roger Sayle gcc/ChangeLog * combine.cc (combine_completed): New global variable. (rest_of_handle_combine): Set combine_completed after pass. * final.cc (rest_of_clean_state): Reset combine_completed. * rtl.h (combine_completed): Prototype here. Many thanks in advance, Roger -- > -Original Message- > From: Kewen.Lin > Sent: 27 June 2022 10:04 > To: Roger Sayle > Cc: gcc-patches@gcc.gnu.org; Segher Boessenkool > ; David Edelsohn > Subject: Re: [rs6000 PATCH] Improve constant integer multiply using rldimi. > > Hi Roger, > > on 2022/6/27 04:56, Roger Sayle wrote: > > > > > > This patch tweaks the code generated on POWER for integer > > multiplications > > > > by a constant, by making use of rldimi instructions. Much like x86's > > > > lea instruction, rldimi can be used to implement a shift and add pair > > > > in some circumstances. For rldimi this is when the shifted operand > > > > is known to have no bits in common with the added operand. > > > > > > > > Hence for the new testcase below: > > > > > > > > int foo(int x) > > > > { > > > > int t = x & 42; > > > > return t * 0x2001; > > > > } > > > > > > > > when compiled with -O2, GCC currently generates: > > > > > > > > andi. 3,3,0x2a > > > > slwi 9,3,13 > > > > add 3,9,3 > > > > extsw 3,3 > > > > blr > > > > > > > > with this patch, we now generate: > > > > > > > > andi. 3,3,0x2a > > > > rlwimi 3,3,13,0,31-13 > > > > extsw 3,3 > > > > blr > > > > > > > > It turns out this optimization already exists in the form of a combine > > > > splitter in rs6000.md, but the constraints on combine splitters, > > > > requiring three of four input instructions (and generating one or two > > > > output instructions) mean it doesn't get applied as often as it could. > > > > This patch converts the define_split into a define_insn_and_split to > > > > catch more cases (such as the one above). > > > > > > > > The one bit that's tricky/controversial is the use of RTL's > > > > nonzero_bits which is accurate during the combine pass when this > > > > pattern is first recognized, but not as advanced (not kept up to > > > > date) when this pattern is eventually split. To support this, > > > > I've used a "|| reload_completed" idiom. Does this approach seem > > > > reasonable? [I've another patch of x86 that uses the same idiom]. > > > > > > I tested this patch on powerpc64-linux-gnu, it caused the below ICE against > test > case gcc/testsuite/gcc.c-torture/compile/pr93098.c. > > gcc/testsuite/gcc.c-torture/compile/pr93098.c: In function ‘foo’: > gcc/testsuite/gcc.c-torture/compile/pr93098.c:10:1: error: unrecognizable > insn: > (insn 104 32 34 2 (set (reg:SI 185 [+4 ]) > (ior:SI (and:SI (reg:SI 200 [+4 ]) > (const_int 4294967295 [0x])) > (ashift:SI (reg:SI 140) > (const_int 32 [0x20] "gcc/testsuite/gcc.c- > torture/compile/pr93098.c":6:11 -1 > (nil)) > during RTL pass: subreg3 > dump file: pr93098.c.291r.subreg3 > gcc
[PATCH] Be careful with MODE_CC in simplify_const_relational_operation.
I think it's fair to describe RTL's representation of condition flags using MODE_CC as a little counter-intuitive. For example, the i386 backend represents the carry flag (in adc instructions) using RTL of the form "(ltu:SI (reg:CCC) (const_int 0))", where great care needs to be taken not to treat this like a normal RTX expression, after all LTU (less-than-unsigned) against const0_rtx would normally always be false. Hence, MODE_CC comparisons need to be treated with caution, and simplify_const_relational_operation returns early (to avoid problems) when GET_MODE_CLASS (GET_MODE (op0)) == MODE_CC. However, consider the (currently) hypothetical situation, where the RTL optimizers determine that a previous instruction unconditionally sets or clears the carry flag, and this gets propagated by combine into the above expression, we'd end up with something that looks like (ltu:SI (const_int 1) (const_int 0)), which doesn't mean what it says. Fortunately, simplify_const_relational_operation is passed the original mode of the comparison (cmp_mode, the original mode of op0) which can be checked for MODE_CC, even when op0 is now VOIDmode (const_int) after the substitution. Defending against this is clearly the right thing to do. More controversially, rather than just abort simplification/optimization in this case, we can use the comparison operator to infer/select the semantics of the CC_MODE flag. Hopefully, whenever a backend uses LTU, it represents the (set) carry flag (and behaves like i386.md), in which case the result of the simplified expression is the first operand. [If there's no standardization of semantics across backends, then we should always just return 0; but then miss potential optimizations]. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32}, with no new failures, and in combination with a i386 backend patch (that introduces support for x86's stc and clc instructions) where it avoids failures. However, I'm submitting this middle-end piece independently, to confirm that maintainers/reviewers are happy with the approach, and also to check there are no issues on other platforms, before building upon this infrastructure. Thoughts? Ok for mainline? 2022-07-07 Roger Sayle gcc/ChangeLog * simplify-rtx.cc (simplify_const_relational_operation): Handle case where both operands of a MODE_CC comparison have been simplified to constant integers. Thanks in advance, Roger -- diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index fa20665..73ec5c7 100644 --- a/gcc/simplify-rtx.cc +++ b/gcc/simplify-rtx.cc @@ -6026,6 +6026,18 @@ simplify_const_relational_operation (enum rtx_code code, return 0; } + /* Handle MODE_CC comparisons that have been simplified to + constants. */ + if (GET_MODE_CLASS (mode) == MODE_CC + && op1 == const0_rtx + && CONST_INT_P (op0)) +{ + /* LTU represents the carry flag. */ + if (code == LTU) + return op0 == const0_rtx ? const0_rtx : const_true_rtx; + return 0; +} + /* We can't simplify MODE_CC values since we don't know what the actual comparison is. */ if (GET_MODE_CLASS (GET_MODE (op0)) == MODE_CC)
[x86 PATCH] Fun with flags: Adding stc/clc instructions to i386.md.
This patch adds support for x86's single-byte encoded stc (set carry flag) and clc (clear carry flag) instructions to i386.md. The motivating example is the simple code snippet: unsigned int foo (unsigned int a, unsigned int b, unsigned int *c) { return __builtin_ia32_addcarryx_u32 (1, a, b, c); } which uses the target built-in to generate an adc instruction, adding together A and B with the incoming carry flag already set. Currently for this mainline GCC generates (with -O2): movl$1, %eax addb$-1, %al adcl%esi, %edi setc%al movl%edi, (%rdx) movzbl %al, %eax ret where the first two instructions (to load 1 into a byte register and then add 255 to it) are the idiom used to set the carry flag. This is a little inefficient as x86 has a "stc" instruction for precisely this purpose. With the attached patch we now generate: stc adcl%esi, %edi setc%al movl%edi, (%rdx) movzbl %al, %eax ret The central part of the patch is the addition of x86_stc and x86_clc define_insns, represented as "(set (reg:CCC FLAGS_REG) (const_int 1))" and "(set (reg:CCC FLAGS_REG) (const_int 0))" respectively, then using x86_stc appropriately in the ix86_expand_builtin. Alas this change exposes two latent bugs/issues in the compiler. The first is that there are several peephole2s in i386.md that propagate the flags register, but take its mode from the SET_SRC rather than preserve the mode of the original SET_DEST. The other, which is being discussed with Segher, is that the middle-end's simplify-rtx inappropriately tries to interpret/optimize MODE_CC comparisons, converting the above adc into an add, as it mistakenly believes (ltu:SI (const_int 1) (const_int 0))" is always const0_rtx even when the mode of the comparison is MODE_CCC. I believe Segher will review (and hopefully approve) the middle-end chunk of this patch independently, but hopefully this backend patch provides the necessary context to explain why that change is needed. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2022-07-08 Roger Sayle gcc/ChangeLog * config/i386/i386-expand.cc (ix86_expand_builtin) : Use new x86_stc or negqi_ccc_1 instructions to set the carry flag. * config/i386/i386.md (x86_clc): New define_insn. (x86_stc): Likewise, new define_insn to set the carry flag. (*setcc_qi_negqi_ccc_1_): New define_insn_and_split to recognize (and eliminate) the carry flag being copied to itself. (neg_ccc_1): Renamed from *neg_ccc_1 for gen function. (define_peephole2): Use match_operand of flags_reg_operand to capture and preserve the mode of FLAGS_REG. (define_peephole2): Likewise. * simplify-rtx.cc (simplify_const_relational_operation): Handle case where both operands of a MODE_CC comparison have been simplified to constant integers. gcc/testsuite/ChangeLog * gcc.target/i386/stc-1.c: New test case. Thanks in advance (both Uros and Segher), Roger -- > -Original Message- > From: Segher Boessenkool > Sent: 07 July 2022 23:39 > To: Roger Sayle > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH] Be careful with MODE_CC in > simplify_const_relational_operation. > > Hi! > > On Thu, Jul 07, 2022 at 10:08:04PM +0100, Roger Sayle wrote: > > I think it's fair to describe RTL's representation of condition flags > > using MODE_CC as a little counter-intuitive. > > "A little challenging", and you should see that as a good thing, as a puzzle to > crack :-) > > > For example, the i386 > > backend represents the carry flag (in adc instructions) using RTL of > > the form "(ltu:SI (reg:CCC) (const_int 0))", where great care needs to > > be taken not to treat this like a normal RTX expression, after all LTU > > (less-than-unsigned) against const0_rtx would normally always be > > false. > > A comparison of a MODE_CC thing against 0 means the result of a > *previous* comparison (or other cc setter) is looked at. Usually it simply looks > at some condition bits in a flags register. It does not do any actual comparison: > that has been done before (if at all even). > > > Hence, MODE_CC comparisons need to be treated with caution, and > > simplify_const_relational_operation returns early (to avoid > > problems) when GET_MODE_CLASS (GET_MODE (op0)) == MODE_CC. > > Not just to avoid problems: there simply isn't enough information to do a > correct job. > > > However, consider the (currently) hypothetical situation, where the > > RTL optimizers dete
[gcc12 backport] PR target/105930: Split *xordi3_doubleword after reload on x86.
This is a backport of the fix for PR target/105930 from mainline to the gcc12 release branch. This patch has been retested against the gcc12 branch on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for the gcc12 branch? 2022-07-09 Roger Sayle Uroš Bizjak gcc/ChangeLog PR target/105930 * config/i386/i386.md (*di3_doubleword): Split after reload. Use rtx_equal_p to avoid creating memory-to-memory moves, and emit NOTE_INSN_DELETED if operand[2] is zero (i.e. with -O0). Thanks in advance, Roger -- diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 7c9560fc4..1c4781d 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -10400,22 +10400,25 @@ "ix86_expand_binary_operator (, mode, operands); DONE;") (define_insn_and_split "*di3_doubleword" - [(set (match_operand:DI 0 "nonimmediate_operand") + [(set (match_operand:DI 0 "nonimmediate_operand" "=ro,r") (any_or:DI -(match_operand:DI 1 "nonimmediate_operand") -(match_operand:DI 2 "x86_64_szext_general_operand"))) +(match_operand:DI 1 "nonimmediate_operand" "0,0") +(match_operand:DI 2 "x86_64_szext_general_operand" "re,o"))) (clobber (reg:CC FLAGS_REG))] "!TARGET_64BIT - && ix86_binary_operator_ok (, DImode, operands) - && ix86_pre_reload_split ()" + && ix86_binary_operator_ok (, DImode, operands)" "#" - "&& 1" + "&& reload_completed" [(const_int 0)] { + /* This insn may disappear completely when operands[2] == const0_rtx + and operands[0] == operands[1], which requires a NOTE_INSN_DELETED. */ + bool emit_insn_deleted_note_p = false; + split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]); if (operands[2] == const0_rtx) -emit_move_insn (operands[0], operands[1]); +emit_insn_deleted_note_p = true; else if (operands[2] == constm1_rtx) { if ( == IOR) @@ -10427,7 +10430,10 @@ ix86_expand_binary_operator (, SImode, &operands[0]); if (operands[5] == const0_rtx) -emit_move_insn (operands[3], operands[4]); +{ + if (emit_insn_deleted_note_p) + emit_note (NOTE_INSN_DELETED); +} else if (operands[5] == constm1_rtx) { if ( == IOR)
[x86_64 PATCH] Improved Scalar-To-Vector (STV) support for TImode to V1TImode.
This patch upgrades x86_64's scalar-to-vector (STV) pass to more aggressively transform 128-bit scalar TImode operations into vector V1TImode operations performed on SSE registers. TImode functionality already exists in STV, but only for move operations, this changes brings support for logical operations (AND, IOR, XOR, NOT and ANDN) and comparisons. The effect of these changes are conveniently demonstrated by the new sse4_1-stv-5.c test case: __int128 a[16]; __int128 b[16]; __int128 c[16]; void foo() { for (unsigned int i=0; i<16; i++) a[i] = b[i] & ~c[i]; } which when currently compiled on mainline wtih -O2 -msse4 produces: foo:xorl%eax, %eax .L2:movqc(%rax), %rsi movqc+8(%rax), %rdi addq$16, %rax notq%rsi notq%rdi andqb-16(%rax), %rsi andqb-8(%rax), %rdi movq%rsi, a-16(%rax) movq%rdi, a-8(%rax) cmpq$256, %rax jne .L2 ret but with this patch now produces: foo:xorl%eax, %eax .L2:movdqa c(%rax), %xmm0 pandn b(%rax), %xmm0 addq$16, %rax movaps %xmm0, a-16(%rax) cmpq$256, %rax jne .L2 ret Technically, the STV pass is implemented by three C++ classes, a common abstract base class "scalar_chain" that contains common functionality, and two derived classes: general_scalar_chain (which handles SI and DI modes) and timode_scalar_chain (which handles TI modes). As mentioned previously, because only TI mode moves were handled the two worker classes behaved significantly differently. These changes bring the functionality of these two classes closer together, which is reflected by refactoring more shared code from general_scalar_chain to the parent scalar_chain and reusing it from timode. There still remain significant differences (and simplifications) so the existing division of classes (as specializations) continues to make sense. Obviously, there are more changes to come (shifts and rotates), and compute_convert_gain doesn't yet have its final (tuned) form, but is already an improvement over the "return 1;" used previously. This patch has been tested on x86_64-pc-linux-gnu with make boostrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2022-07-09 Roger Sayle gcc/ChangeLog * config/i386/i386-features.h (scalar_chain): Add fields insns_conv, n_sse_to_integer and n_integer_to_sse to this parent class, moved from general_scalar_chain. (scalar_chain::convert_compare): Protected method moved from general_scalar_chain. (mark_dual_mode_def): Make protected, not private virtual. (scalar_chain:convert_op): New private virtual method. (general_scalar_chain::general_scalar_chain): Simplify constructor. (general_scalar_chain::~general_scalar_chain): Delete destructor. (general_scalar_chain): Move insns_conv, n_sse_to_integer and n_integer_to_sse fields to parent class, scalar_chain. (general_scalar_chain::mark_dual_mode_def): Delete prototype. (general_scalar_chain::convert_compare): Delete prototype. (timode_scalar_chain::compute_convert_gain): Remove simplistic implementation, convert to a method prototype. (timode_scalar_chain::mark_dual_mode_def): Delete prototype. (timode_scalar_chain::convert_op): Prototype new virtual method. * config/i386/i386-features.cc (scalar_chain::scalar_chain): Allocate insns_conv and initialize n_sse_to_integer and n_integer_to_sse fields in constructor. (scalar_chain::scalar_chain): Free insns_conv in destructor. (general_scalar_chain::general_scalar_chain): Delete constructor, now defined in the class declaration. (general_scalar_chain::~general_scalar_chain): Delete destructor. (scalar_chain::mark_dual_mode_def): Renamed from general_scalar_chain::mark_dual_mode_def. (timode_scalar_chain::mark_dual_mode_def): Delete. (scalar_chain::convert_compare): Renamed from general_scalar_chain::convert_compare. (timode_scalar_chain::compute_convert_gain): New method to determine the gain from converting a TImode chain to V1TImode. (timode_scalar_chain::convert_op): New method to convert an operand from TImode to V1TImode. (timode_scalar_chain::convert_insn) : Only PUT_MODE on REG_EQUAL notes that were originally TImode (not CONST_INT). Handle AND, ANDN, XOR, IOR, NOT and COMPARE. (timode_mem_p): Helper predicate to check where operand is memory reference with sufficient alignment for TImode STV. (timode_scalar_to_vector_candidate_p): Use convertible_comparison_p to check whether COMPARE is convertible. Handle SET_DESTs that that are
[PATCH] Move reload_completed and other rtl.h globals to crtl structure.
This patch builds upon Richard Biener's suggestion of avoiding global variables to track state/identify which passes have already been run. In the early middle-end, the tree-ssa passes use the curr_properties field in cfun to track this. This patch uses a new rtl_pass_progress int field in crtl to do something similar. This patch allows the global variables lra_in_progress, reload_in_progress, reload_completed, epilogue_completed and regstack_completed to be removed from rtl.h and implemented as bits within the new crtl->rtl_pass_progress. I've also taken the liberty of adding a new combine_completed bit at the same time [to respond the Segher's comment it's easy to change this to combine1_completed and combine2_completed if we ever perform multiple combine passes (or multiple reload/regstack passes)]. At the same time, I've also refactored bb_reorder_complete into the same new field; interestingly bb_reorder_complete was already a bool in crtl. One very minor advantage of this implementation/refactoring is that the predicate "can_create_pseudo_p ()" which is semantically defined to be !reload_in_progress && !reload_completed, can now be performed very efficiently as effectively the test (progress & 12) == 0, i.e. a single test instruction on x86. For consistency, I've also moved cse_not_expected (the last remaining global variable in rtl.h) into crtl, as its own bool field. The vast majority of this patch is then churn to handle these changes. Thanks to macros, most code is unaffected, assuming it treats those global variables as r-values, though some source files required/may require tweaks as these "variables" are now defined in emit-rtl.h instead of rtl.h. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32}, with no new failures. Might this clean-up be acceptable in stage 1, given the possible temporary disruption transitioning some backends? I'll start checking various backends myself with cross-compilers, but if Jeff Law could spin this patch on his build farm, that would help identify targets that need attention. 2022-07-10 Roger Sayle gcc/ChangeLog * bb-reorder.cc (reorder_basic_blocks): bb_reorder_complete is now a bit in crtl->rtl_pass_progress. * cfgrtl.cc (rtl_split_edge): Likewise. (fixup_partitions): Likewise. (verify_hot_cold_block_grouping): Likewise. (cfg_layout_initialize): Likewise. * combine.cc (rest_of_handle_combine): Set combine_completed bit in crtl->rtl_pass_progress. * cse.cc (rest_of_handle_cse): cse_not_expected is now a field in crtl. (rest_of_handle_cse2): Likewise. (rest_of_handle_cse_after_global_opts): Likewise. * df-problems.cc: Include emit-rtl.h to access RTL pass progress variables. * emit-rtl.h (PROGRESS_reload_completed): New bit masks. (rtl_data::rtl_pass_progress): New integer field to track progress. (rtl_data::bb_reorder_complete): Delete, no part of rtl_pass_progress. (rtl_data::cse_not_expected): New bool field, previously a global variable. (crtl_pass_progress): New convience macro. (combine_completed): New macro. (lra_in_progress): New macro replacing global variable. (reload_in_progress): Likewise. (reload_completed): Likewise. (bb_reorder_complete): New macro replacing bool field in crtl. (epilogue_completed): New macro replacing global variable. (regstack_completed): Likewise. (can_create_pseudo_p): Move from rtl.h and update definition. * explow.cc (memory_address_addr_space): cse_not_expected is now a field in crtl. (use_anchored_address): Likewise. * final.c (rest_of_clean_state): Reset crtl->rtl_pass_progress to zero. * function.cc (prepare_function_start): cse_not_expected is now a field in crtl. (thread_prologue_and_epilogue_insns): epilogue_completed is now a bit in crtl->rtl_pass_progress. * ifcvt.cc (noce_try_cmove_arith): cse_not_expected is now a field in crtl. * lra-eliminations.cc (init_elim_table): lra_in_progress is now a bit in crtl->rtl_pass_progress. * lra.cc (lra_in_progress): Delete global variable. (lra): lra_in_progress and reload_completed are now bits in crtl->rtl_pass_progress. * modulo-sched.cc (sms_schedule): reload_completed is now a bit in crtl->rtl_pass_progress. * passes.cc (skip_pass): reload_completed and epilogue_completed are now bits in crtl->rtl_pass_progress. * recog.cc (reload_completed): Delete global variable. (epilogue_completed): Likewise. * reg-stack.cc (regstack_completed): Likewise. (rest_of_handle_stack_r
RE: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for TImode to V1TImode.
Hi Uros, Yes, I agree. I think it makes sense to have a single STV pass (after combine and before reload). Let's hear what HJ thinks, but I'm happy to investigate a follow-up patch that unifies the STV passes. But it'll be easier to confirm there are no "code generation" changes if those modifications are pushed independently of these ones. Time to look into the (git) history of multiple STV passes... Thanks for the review. I'll wait for HJ's thoughts. Cheers, Roger -- > -Original Message- > From: Uros Bizjak > Sent: 10 July 2022 19:06 > To: Roger Sayle > Cc: gcc-patches@gcc.gnu.org; H. J. Lu > Subject: Re: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for > TImode to V1TImode. > > On Sat, Jul 9, 2022 at 2:17 PM Roger Sayle > wrote: > > > > > > This patch upgrades x86_64's scalar-to-vector (STV) pass to more > > aggressively transform 128-bit scalar TImode operations into vector > > V1TImode operations performed on SSE registers. TImode functionality > > already exists in STV, but only for move operations, this changes > > brings support for logical operations (AND, IOR, XOR, NOT and ANDN) > > and comparisons. > > > > The effect of these changes are conveniently demonstrated by the new > > sse4_1-stv-5.c test case: > > > > __int128 a[16]; > > __int128 b[16]; > > __int128 c[16]; > > > > void foo() > > { > > for (unsigned int i=0; i<16; i++) > > a[i] = b[i] & ~c[i]; > > } > > > > which when currently compiled on mainline wtih -O2 -msse4 produces: > > > > foo:xorl%eax, %eax > > .L2:movqc(%rax), %rsi > > movqc+8(%rax), %rdi > > addq$16, %rax > > notq%rsi > > notq%rdi > > andqb-16(%rax), %rsi > > andqb-8(%rax), %rdi > > movq%rsi, a-16(%rax) > > movq%rdi, a-8(%rax) > > cmpq$256, %rax > > jne .L2 > > ret > > > > but with this patch now produces: > > > > foo:xorl%eax, %eax > > .L2:movdqa c(%rax), %xmm0 > > pandn b(%rax), %xmm0 > > addq$16, %rax > > movaps %xmm0, a-16(%rax) > > cmpq$256, %rax > > jne .L2 > > ret > > > > Technically, the STV pass is implemented by three C++ classes, a > > common abstract base class "scalar_chain" that contains common > > functionality, and two derived classes: general_scalar_chain (which > > handles SI and DI modes) and timode_scalar_chain (which handles TI > > modes). As mentioned previously, because only TI mode moves were > > handled the two worker classes behaved significantly differently. > > These changes bring the functionality of these two classes closer > > together, which is reflected by refactoring more shared code from > > general_scalar_chain to the parent scalar_chain and reusing it from > > timode. There still remain significant differences (and > > simplifications) so the existing division of classes (as specializations) > > continues > to make sense. > > Please note that there are in fact two STV passes, one before combine and the > other after combine. The TImode pass that previously handled only loads and > stores is positioned before combine (there was a reason for this decision, > but I > don't remember the details - let's ask HJ...). However, DImode STV pass > transforms much more instructions and the reason it was positioned after the > combine pass was that STV pass transforms optimized insn stream where > forward propagation was already performed. > > What is not clear to me from the above explanation is: is the new TImode STV > pass positioned after the combine pass, and if this is the case, how the > change > affects current load/store TImode STV pass. I must admit, I don't like two > separate STV passess, so if TImode is now similar to DImode, I suggest we > abandon STV1 pass and do everything concerning TImode after the combine > pass. HJ, what is your opinion on this? > > Other than the above, the patch LGTM to me. > > Uros. > > > Obviously, there are more changes to come (shifts and rotates), and > > compute_convert_gain doesn't yet have its final (tuned) form, but is > > already an improvement over the "return 1;" used previously. > > > > This patch has been tested on x86_64-pc-linux-gnu with make boostrap > > and make -k check, both with and without --target_board=unix{-m32} > > with no new failure
RE: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for TImode to V1TImode.
Hi HJ, I believe this should now be handled by the post-reload (CSE) pass. Consider the simple test case: __int128 a, b, c; void foo() { a = 0; b = 0; c = 0; } Without any STV, i.e. -O2 -msse4 -mno-stv, GCC get TI mode writes: movq$0, a(%rip) movq$0, a+8(%rip) movq$0, b(%rip) movq$0, b+8(%rip) movq$0, c(%rip) movq$0, c+8(%rip) ret But with STV, i.e. -O2 -msse4, things get converted to V1TI mode: pxor%xmm0, %xmm0 movaps %xmm0, a(%rip) movaps %xmm0, b(%rip) movaps %xmm0, c(%rip) ret You're quite right internally the STV actually generates the equivalent of: pxor%xmm0, %xmm0 movaps %xmm0, a(%rip) pxor%xmm0, %xmm0 movaps %xmm0, b(%rip) pxor%xmm0, %xmm0 movaps %xmm0, c(%rip) ret And currently because STV run before cse2 and combine, the const0_rtx gets CSE'd be the cse2 pass to produce the code we see. However, if you specify -fno-rerun-cse-after-loop (to disable the cse2 pass), you'll see we continue to generate the same optimized code, as the same const0_rtx gets CSE'd in postreload. I can't be certain until I try the experiment, but I believe that the postreload CSE will clean-up, all of the same common subexpressions. Hence, it should be safe to perform all STV at the same point (after combine), which for a few additional optimizations. Does this make sense? Do you have a test case, -fno-rerun-cse-after-loop produces different/inferior code for TImode STV chains? My guess is that the RTL passes have changed so much in the last six or seven years, that some of the original motivation no longer applies. Certainly we now try to keep TI mode operations visible longer, and then allow STV to behave like a pre-reload pass to decide which set of registers to use (vector V1TI or scalar doubleword DI). Any CSE opportunities that cse2 finds with V1TI mode, could/should equally well be found for TI mode (mostly). Cheers, Roger -- > -Original Message- > From: H.J. Lu > Sent: 10 July 2022 20:15 > To: Roger Sayle > Cc: Uros Bizjak ; GCC Patches > Subject: Re: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for > TImode to V1TImode. > > On Sun, Jul 10, 2022 at 11:36 AM Roger Sayle > wrote: > > > > > > Hi Uros, > > Yes, I agree. I think it makes sense to have a single STV pass (after > > combine and before reload). Let's hear what HJ thinks, but I'm happy > > to investigate a follow-up patch that unifies the STV passes. > > But it'll be easier to confirm there are no "code generation" changes > > if those modifications are pushed independently of these ones. > > Time to look into the (git) history of multiple STV passes... > > > > Thanks for the review. I'll wait for HJ's thoughts. > > The TImode STV pass is run before the CSE pass so that instructions changed or > generated by the STV pass can be CSEed. > > > Cheers, > > Roger > > -- > > > > > -Original Message- > > > From: Uros Bizjak > > > Sent: 10 July 2022 19:06 > > > To: Roger Sayle > > > Cc: gcc-patches@gcc.gnu.org; H. J. Lu > > > Subject: Re: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support > > > for TImode to V1TImode. > > > > > > On Sat, Jul 9, 2022 at 2:17 PM Roger Sayle > > > > > > wrote: > > > > > > > > > > > > This patch upgrades x86_64's scalar-to-vector (STV) pass to more > > > > aggressively transform 128-bit scalar TImode operations into > > > > vector V1TImode operations performed on SSE registers. TImode > > > > functionality already exists in STV, but only for move operations, > > > > this changes brings support for logical operations (AND, IOR, XOR, > > > > NOT and ANDN) and comparisons. > > > > > > > > The effect of these changes are conveniently demonstrated by the > > > > new sse4_1-stv-5.c test case: > > > > > > > > __int128 a[16]; > > > > __int128 b[16]; > > > > __int128 c[16]; > > > > > > > > void foo() > > > > { > > > > for (unsigned int i=0; i<16; i++) > > > > a[i] = b[i] & ~c[i]; > > > > } > > > > > > > > which when currently compiled on mainline wtih -O2 -msse4 produces: > > > > > > > > foo:xorl%eax, %eax > > > > .L2:movqc(%rax), %rsi > > > > movqc+8(%rax), %rdi > > > > addq$16, %rax > > > >
RE: [PATCH] Move reload_completed and other rtl.h globals to crtl structure.
On 11 July 2022 08:20, Richard Biener wrote: > On Sun, 10 Jul 2022, Roger Sayle wrote: > > > This patch builds upon Richard Biener's suggestion of avoiding global > > variables to track state/identify which passes have already been run. > > In the early middle-end, the tree-ssa passes use the curr_properties > > field in cfun to track this. This patch uses a new rtl_pass_progress > > int field in crtl to do something similar. > > Why not simply add PROP_rtl_... and use the existing curr_properties for this? > RTL passes are also passes and this has the advantage you can add things like > reload_completed to the passes properties_set field hand have the flag setting > handled by the pass manager as it was intended? > Great question, and I did initially consider simply adding more RTL fields to curr_properties. My hesitation was from the comments/documentation that the curr_properties field is used by the pass manager as a way to track (and verify) the properties/invariants that are required, provided and destroyed by each pass. This semantically makes sense for properties such as accurate data flow, ssa form, cfg_layout, nonzero_bits etc, where hypothetically the pass manager can dynamically schedule a pass/analysis to ensure the next pass has the pre-requisite information it needs. This seems semantically slightly different from tracking time/progress, where we really want something more like DEBUG_COUNTER that simply provides the "tick-tock" of a pass clock. Alas, the "pass number", as used in the suffix of dump-files (where 302 currently means reload) isn't particularly useful as these change/evolve continually. Perhaps the most obvious indication of this (subtle) difference is the curr_properties field (PROP_rtl_split_insns) which tracks whether instructions have been split, where at a finer granularity rtl_pass_progress may wish to distinguish split1 (after combine before reload), split2 (after reload before peephole2) and split3 (after peephole2). It's conceptually not a simple property, whether all insns have been split or not, as in practice this is more subtle with backends choosing which instructions get split at which "times". There's also the concern that we've a large number of passes (currently 62 RTL passes), and only a finite number of bits (in curr_properties), so having two integers reduces the risk of running out of bits and needing to use a "wider" data structure. To be honest, I just didn't want to hijack curr_properties to abuse it for a use that didn't quite match the original intention, without checking with you and the other maintainers first. If the above reasoning isn't convincing, I can try spinning an alternate patch using curr_properties (but I'd expect even more churn as backend source files would now need to #include tree-passes.h and function.h to get reload_completed). And of course, a volunteer is welcome to contribute that re-refactoring after this one. I've no strong feelings either way. It was an almost arbitrary engineering decision (but hopefully the above explains the balance of my reasoning). Roger --
RE: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for TImode to V1TImode.
On Mon, Jul 11, 2022, H.J. Lu wrote: > On Sun, Jul 10, 2022 at 2:38 PM Roger Sayle > wrote: > > Hi HJ, > > > > I believe this should now be handled by the post-reload (CSE) pass. > > Consider the simple test case: > > > > __int128 a, b, c; > > void foo() > > { > > a = 0; > > b = 0; > > c = 0; > > } > > > > Without any STV, i.e. -O2 -msse4 -mno-stv, GCC get TI mode writes: > > movq$0, a(%rip) > > movq$0, a+8(%rip) > > movq$0, b(%rip) > > movq$0, b+8(%rip) > > movq$0, c(%rip) > > movq$0, c+8(%rip) > > ret > > > > But with STV, i.e. -O2 -msse4, things get converted to V1TI mode: > > pxor%xmm0, %xmm0 > > movaps %xmm0, a(%rip) > > movaps %xmm0, b(%rip) > > movaps %xmm0, c(%rip) > > ret > > > > You're quite right internally the STV actually generates the equivalent of: > > pxor%xmm0, %xmm0 > > movaps %xmm0, a(%rip) > > pxor%xmm0, %xmm0 > > movaps %xmm0, b(%rip) > > pxor%xmm0, %xmm0 > > movaps %xmm0, c(%rip) > > ret > > > > And currently because STV run before cse2 and combine, the const0_rtx > > gets CSE'd be the cse2 pass to produce the code we see. However, if > > you specify -fno-rerun-cse-after-loop (to disable the cse2 pass), > > you'll see we continue to generate the same optimized code, as the > > same const0_rtx gets CSE'd in postreload. > > > > I can't be certain until I try the experiment, but I believe that the > > postreload CSE will clean-up, all of the same common subexpressions. > > Hence, it should be safe to perform all STV at the same point (after > > combine), which for a few additional optimizations. > > > > Does this make sense? Do you have a test case, > > -fno-rerun-cse-after-loop produces different/inferior code for TImode STV > chains? > > > > My guess is that the RTL passes have changed so much in the last six > > or seven years, that some of the original motivation no longer applies. > > Certainly we now try to keep TI mode operations visible longer, and > > then allow STV to behave like a pre-reload pass to decide which set of > > registers to use (vector V1TI or scalar doubleword DI). Any CSE > > opportunities that cse2 finds with V1TI mode, could/should equally > > well be found for TI mode (mostly). > > You are probably right. If there are no regressions in GCC testsuite, my > original > motivation is no longer valid. It was good to try the experiment, but H.J. is right, there is still some benefit (as well as some disadvantages) to running STV lowering before CSE2/combine. A clean-up patch to perform all STV conversion as a single pass (removing a pass from the compiler) results in just a single regression in the test suite: FAIL: gcc.target/i386/pr70155-17.c scan-assembler-times movv1ti_internal 8 which looks like: __int128 a, b, c, d, e, f; void foo (void) { a = 0; b = -1; c = 0; d = -1; e = 0; f = -1; } By performing STV after combine (without CSE), reload prefers to implement this function using a single register, that then requires 12 instructions rather than 8 (if using two registers). Alas there's nothing that postreload CSE/GCSE can do. Doh! pxor%xmm0, %xmm0 movaps %xmm0, a(%rip) pcmpeqd %xmm0, %xmm0 movaps %xmm0, b(%rip) pxor%xmm0, %xmm0 movaps %xmm0, c(%rip) pcmpeqd %xmm0, %xmm0 movaps %xmm0, d(%rip) pxor%xmm0, %xmm0 movaps %xmm0, e(%rip) pcmpeqd %xmm0, %xmm0 movaps %xmm0, f(%rip) ret I also note that even without STV, the scalar implementation of this function when compiled with -Os is also larger than it needs to be due to poor CSE (notice in the following we only need a single zero register, and an all_ones reg would be helpful). xorl%eax, %eax xorl%edx, %edx xorl%ecx, %ecx movq$-1, b(%rip) movq%rax, a(%rip) movq%rax, a+8(%rip) movq$-1, b+8(%rip) movq%rdx, c(%rip) movq%rdx, c+8(%rip) movq$-1, d(%rip) movq$-1, d+8(%rip) movq%rcx, e(%rip) movq%rcx, e+8(%rip) movq$-1, f(%rip) movq$-1, f+8(%rip) ret I need to give the problem some more thought. It would be good to clean-up/unify the STV passes, but I/we need to solve/CSE HJ's last test case before we do. Perhaps by forbidding "(set (mem:ti) (const_int 0))" in movti_internal, would force the zero register to become visible, and CSE'd, benefiting both vector code and scalar -Os code, then use postreload/peephole2 to fix up the remaining scalar cases. It's tricky. Cheers, Roger --
[PATCH] PR target/106278: Keep REG_EQUAL notes consistent during TImode STV.
This patch resolves PR target/106278 a regression on x86_64 caused by my recent TImode STV improvements. Now that TImode STV can handle comparisons such as "(set (regs:CC) (compare:CC (reg:TI) ...))" the convert_insn method sensibly checks that the mode of the SET_DEST is TImode before setting it to V1TImode [to avoid V1TImode appearing on the hard reg CC_FLAGS. Hence the current code looks like: if (GET_MODE (dst) == TImode) { tmp = find_reg_equal_equiv_note (insn); if (tmp && GET_MODE (XEXP (tmp, 0)) == TImode) PUT_MODE (XEXP (tmp, 0), V1TImode); PUT_MODE (dst, V1TImode); fix_debug_reg_uses (dst); } break; which checks GET_MODE (dst) before calling PUT_MODE, and when a change is made updating the REG_EQUAL_NOTE tmp if it exists. The logical flaw (oversight) is that due to RTL sharing, the destination of this set may already have been updated to V1TImode, as this chain is being converted, but we still need to update any REG_EQUAL_NOTE that still has TImode. Hence the correct code is actually: if (GET_MODE (dst) == TImode) { PUT_MODE (dst, V1TImode); fix_debug_reg_uses (dst); } if (GET_MODE (dst) == V1TImode) { tmp = find_reg_equal_equiv_note (insn); if (tmp && GET_MODE (XEXP (tmp, 0)) == TImode) PUT_MODE (XEXP (tmp, 0), V1TImode); } break; While fixing this behavior, I noticed I had some indentation whitespace issues and some vestigial dead code in this function/method that I've taken the liberty of cleaning up (as obvious) in this patch. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32}, with no new failures. Ok for mainline? 2022-07-14 Roger Sayle gcc/ChangeLog PR target/106278 * config/i386/i386-features.cc (general_scalar_chain::convert_insn): Fix indentation whitespace. (timode_scalar_chain::fix_debug_reg_uses): Likewise. (timode_scalar_chain::convert_insn): Delete dead code. Update TImode REG_EQUAL_NOTE even if the SET_DEST is already V1TI. Fix indentation whitespace. (convertible_comparison_p): Likewise. (timode_scalar_to_vector_candidate_p): Likewise. gcc/testsuite/ChangeLog PR target/106278 * gcc.dg/pr106278.c: New test case. Thanks in advance, Roger -- diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc index f1b03c3..813b203 100644 --- a/gcc/config/i386/i386-features.cc +++ b/gcc/config/i386/i386-features.cc @@ -1054,13 +1054,13 @@ general_scalar_chain::convert_insn (rtx_insn *insn) else if (REG_P (dst) && GET_MODE (dst) == smode) { /* Replace the definition with a SUBREG to the definition we - use inside the chain. */ +use inside the chain. */ rtx *vdef = defs_map.get (dst); if (vdef) dst = *vdef; dst = gen_rtx_SUBREG (vmode, dst, 0); /* IRA doesn't like to have REG_EQUAL/EQUIV notes when the SET_DEST - is a non-REG_P. So kill those off. */ +is a non-REG_P. So kill those off. */ rtx note = find_reg_equal_equiv_note (insn); if (note) remove_note (insn, note); @@ -1246,7 +1246,7 @@ timode_scalar_chain::fix_debug_reg_uses (rtx reg) { rtx_insn *insn = DF_REF_INSN (ref); /* Make sure the next ref is for a different instruction, - so that we're not affected by the rescan. */ +so that we're not affected by the rescan. */ next = DF_REF_NEXT_REG (ref); while (next && DF_REF_INSN (next) == insn) next = DF_REF_NEXT_REG (next); @@ -1336,21 +1336,19 @@ timode_scalar_chain::convert_insn (rtx_insn *insn) rtx dst = SET_DEST (def_set); rtx tmp; - if (MEM_P (dst) && !REG_P (src)) -{ - /* There are no scalar integer instructions and therefore -temporary register usage is required. */ -} switch (GET_CODE (dst)) { case REG: if (GET_MODE (dst) == TImode) { + PUT_MODE (dst, V1TImode); + fix_debug_reg_uses (dst); + } + if (GET_MODE (dst) == V1TImode) + { tmp = find_reg_equal_equiv_note (insn); if (tmp && GET_MODE (XEXP (tmp, 0)) == TImode) PUT_MODE (XEXP (tmp, 0), V1TImode); - PUT_MODE (dst, V1TImode); - fix_debug_reg_uses (dst); } break; case MEM: @@ -1410,8 +1408,8 @@ timode_scalar_chain::convert_insn (rtx_insn *insn) if (MEM_P (dst)) { tmp = gen_reg_rtx (V1TImode); - emit_insn_before (gen_rtx_SET (tmp, src), insn); - src = tmp; + emit_insn_before (gen_rtx_SET (tmp, src), insn); + src = tmp; } break; @@ -1434,8 +1432
[x86 PATCH] PR target/106273: Add earlyclobber to *andn3_doubleword_bmi
This patch resolves PR target/106273 which is a wrong code regression caused by the recent reorganization to split doubleword operations after reload on x86. For the failing test case, the constraints on the andnti3_doubleword_bmi pattern allow reload to allocate the output and operand in overlapping but non-identical registers, i.e. (insn 45 44 66 2 (parallel [ (set (reg/v:TI 5 di [orig:96 i ] [96]) (and:TI (not:TI (reg:TI 39 r11 [orig:83 _2 ] [83])) (reg/v:TI 4 si [orig:100 i ] [100]))) (clobber (reg:CC 17 flags)) ]) "pr106273.c":13:5 562 {*andnti3_doubleword_bmi} where the output is in registers 5 and 6, and the second operand is registers 4 and 5, which then leads to the incorrect split: (insn 113 44 114 2 (parallel [ (set (reg:DI 5 di [orig:96 i ] [96]) (and:DI (not:DI (reg:DI 39 r11 [orig:83 _2 ] [83])) (reg:DI 4 si [orig:100 i ] [100]))) (clobber (reg:CC 17 flags)) ]) "pr106273.c":13:5 566 {*andndi_1} (insn 114 113 66 2 (parallel [ (set (reg:DI 6 bp [ i+8 ]) (and:DI (not:DI (reg:DI 40 r12 [ _2+8 ])) (reg:DI 5 di [ i+8 ]))) (clobber (reg:CC 17 flags)) ]) "pr106273.c":13:5 566 {*andndi_1} [Notice that reg:DI 5 is set in the first instruction, but assumed to have its original value in the second]. My first thought was that this could be fixed by swapping the order of the split instructions (which works in this case), but in the general case, it's impossible to handle (set (reg:TI x) (op (reg:TI x+1) (reg:TI x-1)). Hence for correctness this pattern needs an earlyclobber "=&r", but we can also allow cases where the output is the same as one of the operands (using constraint "0"). The other binary logic operations (AND, IOR, XOR) are unaffected as they constrain the output to match the first operand, but BMI's andn is a three-operand instruction which can lead to the overlapping cases described above. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2022-07-15 Roger Sayle gcc/ChangeLog PR target/106273 * config/i386/i386.md (*andn3_doubleword_bmi): Update the constraints to reflect the output is earlyclobber, unless it is the same register (pair) as one of the operands. gcc/testsuite/ChangeLog PR target/106273 * gcc.target/i386/pr106273.c: New test case. Thanks again, and sorry for the inconvenience. Roger -- diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 3b02d0c..585b2d5 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -10423,10 +10423,10 @@ }) (define_insn_and_split "*andn3_doubleword_bmi" - [(set (match_operand: 0 "register_operand" "=r") + [(set (match_operand: 0 "register_operand" "=&r,r,r") (and: - (not: (match_operand: 1 "register_operand" "r")) - (match_operand: 2 "nonimmediate_operand" "ro"))) + (not: (match_operand: 1 "register_operand" "r,0,r")) + (match_operand: 2 "nonimmediate_operand" "ro,ro,0"))) (clobber (reg:CC FLAGS_REG))] "TARGET_BMI" "#" diff --git a/gcc/testsuite/gcc.target/i386/pr106273.c b/gcc/testsuite/gcc.target/i386/pr106273.c new file mode 100644 index 000..8c2fbbb --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106273.c @@ -0,0 +1,27 @@ +/* { dg-do compile { target int128 } } */ +/* { dg-options "-Og -march=cascadelake" } */ +typedef unsigned char u8; +typedef unsigned short u16; +typedef unsigned long long u64; + +u8 g; + +void +foo (__int128 i, u8 *r) +{ + u16 a = __builtin_sub_overflow_p (0, i * g, 0); + i ^= g & i; + u64 s = (i >> 64) + i; + *r = ((union { u16 a; u8 b[2]; }) a).b[1] + s; +} + +int +main (void) +{ + u8 x; + foo (5, &x); + if (x != 5) +__builtin_abort (); + return 0; +} +/* { dg-final { scan-assembler-not "andn\[ \\t\]+%rdi, %r11, %rdi" } } */
[x86 PATCH] Fix issue with x86_64_const_vector_operand predicate.
This patch fixes (what I believe is) a latent bug in i386.md's x86_64_const_vector_operand define_predicate. According to the documentation, when a predicate is called with rtx operand OP and machine_mode operand MODE, we can't shouldn't assume that the MODE is (or has been checked to be) GET_MODE (OP). The failure mode is that recog can call x86_64_const_vector_operand on an arbitrary CONST_VECTOR passing a MODE of V2QI_mode, but when the CONST_VECTOR is in fact V1TImode, it's unsafe to directly call ix86_convert_const_vector_to_integer, which assumes that the CONST_VECTOR contains CONST_INTs when it actually contains CONST_WIDE_INTs. The checks in this define_predicate need to be testing OP's mode, and ideally confirming that this matches the passed in/specified MODE. This bug is currently latent, but adding an innocent/unrelated define_insn, such as "(set (reg:CCC FLAGS_REG) (const_int 0))" to i386.md can occasionally change the order in which genrecog generates its tests, then ICEing during bootstrap due to V1TI CONST_VECTORs. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target-board=unix{-m32}, with no new failures. Ok for mainline? 2022-07-16 Roger Sayle gcc/ChangeLog * config/i386/predicates.md (x86_64_const_vector_operand): Check the operand's mode matches the specified mode argument. Thanks in advance, Roger -- diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md index c71c453..42053ea 100644 --- a/gcc/config/i386/predicates.md +++ b/gcc/config/i386/predicates.md @@ -1199,6 +1199,10 @@ (define_predicate "x86_64_const_vector_operand" (match_code "const_vector") { + if (mode == VOIDmode) +mode = GET_MODE (op); + else if (GET_MODE (op) != mode) +return false; if (GET_MODE_SIZE (mode) > UNITS_PER_WORD) return false; HOST_WIDE_INT val = ix86_convert_const_vector_to_integer (op, mode);
[middle-end PATCH] PR c/106264: Silence warnings from __builtin_modf et al.
This middle-end patch resolves PR c/106264 which is a spurious warning regression caused by the tree-level expansion of modf, frexp and remquo producing "expression has no-effect" when the built-in function's result is ignored. When these built-ins were first expanded at tree-level, fold_builtin_n would blindly set TREE_NO_WARNING for all built-ins. Now that we're more discerning, we should precisely set TREE_NO_WARNING selectively on those COMPOUND_EXPRs that need them. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check with no new failures. Ok for mainline? 2022-07-16 Roger Sayle gcc/ChangeLog PR c/106264 * builtins.cc (fold_builtin_frexp): Set TREE_NO_WARNING on COMPOUND_EXPR to silence spurious warning if result isn't used. (fold_builtin_modf): Likewise. (do_mpfr_remquo): Likewise. gcc/testsuite/ChangeLog PR c/106264 * gcc.dg/pr106264.c: New test case. Thanks in advance, Roger -- diff --git a/gcc/builtins.cc b/gcc/builtins.cc index 35b9197..c745777 100644 --- a/gcc/builtins.cc +++ b/gcc/builtins.cc @@ -8625,7 +8625,7 @@ fold_builtin_frexp (location_t loc, tree arg0, tree arg1, tree rettype) if (TYPE_MAIN_VARIANT (TREE_TYPE (arg1)) == integer_type_node) { const REAL_VALUE_TYPE *const value = TREE_REAL_CST_PTR (arg0); - tree frac, exp; + tree frac, exp, res; switch (value->cl) { @@ -8656,7 +8656,9 @@ fold_builtin_frexp (location_t loc, tree arg0, tree arg1, tree rettype) /* Create the COMPOUND_EXPR (*arg1 = trunc, frac). */ arg1 = fold_build2_loc (loc, MODIFY_EXPR, rettype, arg1, exp); TREE_SIDE_EFFECTS (arg1) = 1; - return fold_build2_loc (loc, COMPOUND_EXPR, rettype, arg1, frac); + res = fold_build2_loc (loc, COMPOUND_EXPR, rettype, arg1, frac); + TREE_NO_WARNING (res) = 1; + return res; } return NULL_TREE; @@ -8682,6 +8684,7 @@ fold_builtin_modf (location_t loc, tree arg0, tree arg1, tree rettype) { const REAL_VALUE_TYPE *const value = TREE_REAL_CST_PTR (arg0); REAL_VALUE_TYPE trunc, frac; + tree res; switch (value->cl) { @@ -8711,8 +8714,10 @@ fold_builtin_modf (location_t loc, tree arg0, tree arg1, tree rettype) arg1 = fold_build2_loc (loc, MODIFY_EXPR, rettype, arg1, build_real (rettype, trunc)); TREE_SIDE_EFFECTS (arg1) = 1; - return fold_build2_loc (loc, COMPOUND_EXPR, rettype, arg1, - build_real (rettype, frac)); + res = fold_build2_loc (loc, COMPOUND_EXPR, rettype, arg1, +build_real (rettype, frac)); + TREE_NO_WARNING (res) = 1; + return res; } return NULL_TREE; @@ -10673,8 +10678,10 @@ do_mpfr_remquo (tree arg0, tree arg1, tree arg_quo) integer_quo)); TREE_SIDE_EFFECTS (result_quo) = 1; /* Combine the quo assignment with the rem. */ - result = non_lvalue (fold_build2 (COMPOUND_EXPR, type, - result_quo, result_rem)); + result = fold_build2 (COMPOUND_EXPR, type, + result_quo, result_rem); + TREE_NO_WARNING (result) = 1; + result = non_lvalue (result); } } } diff --git a/gcc/testsuite/gcc.dg/pr106264.c b/gcc/testsuite/gcc.dg/pr106264.c new file mode 100644 index 000..6b4af49 --- /dev/null +++ b/gcc/testsuite/gcc.dg/pr106264.c @@ -0,0 +1,27 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -Wall" } */ +double frexp (double, int*); +double modf (double, double*); +double remquo (double, double, int*); + +int f (void) +{ + int y; + frexp (1.0, &y); + return y; +} + +double g (void) +{ + double y; + modf (1.0, &y); + return y; +} + +int h (void) +{ + int y; + remquo (1.0, 1.0, &y); + return y; +} +
[AVX512 PATCH] Add UNSPEC_MASKOP to kupck instructions in sse.md.
This AVX512 specific patch to sse.md is split out from an earlier patch: https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596199.html The new splitters proposed in that patch interfere with AVX512's kunpckdq instruction which is defined as identical RTL, DW:DI = (HI:SI<<32)|zero_extend(LO:SI). To distinguish these, and avoid AVX512 mask registers accidentally being (ab)used by reload to perform SImode scalar shifts, this patch adds the explicit (unspec UNSPEC_MASKOP) to the unpack mask operations, which matches what sse.md does for the other mask specific (logic) operations. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32} with no new failures. Ok for mainline? 2022-07-16 Roger Sayle gcc/ChangeLog * config/i386/sse.md (kunpckhi): Add UNSPEC_MASKOP unspec. (kunpcksi): Likewise, add UNSPEC_MASKOP unspec. (kunpckdi): Likewise, add UNSPEC_MASKOP unspec. (vec_pack_trunc_qi): Update to specify required UNSPEC_MASKOP unspec. (vec_pack_trunc_): Likewise. Thanks in advance, Roger -- diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 62688f8..da50ffa 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -2072,7 +2072,8 @@ (ashift:HI (zero_extend:HI (match_operand:QI 1 "register_operand" "k")) (const_int 8)) - (zero_extend:HI (match_operand:QI 2 "register_operand" "k"] + (zero_extend:HI (match_operand:QI 2 "register_operand" "k" + (unspec [(const_int 0)] UNSPEC_MASKOP)] "TARGET_AVX512F" "kunpckbw\t{%2, %1, %0|%0, %1, %2}" [(set_attr "mode" "HI") @@ -2085,7 +2086,8 @@ (ashift:SI (zero_extend:SI (match_operand:HI 1 "register_operand" "k")) (const_int 16)) - (zero_extend:SI (match_operand:HI 2 "register_operand" "k"] + (zero_extend:SI (match_operand:HI 2 "register_operand" "k" + (unspec [(const_int 0)] UNSPEC_MASKOP)] "TARGET_AVX512BW" "kunpckwd\t{%2, %1, %0|%0, %1, %2}" [(set_attr "mode" "SI")]) @@ -2096,7 +2098,8 @@ (ashift:DI (zero_extend:DI (match_operand:SI 1 "register_operand" "k")) (const_int 32)) - (zero_extend:DI (match_operand:SI 2 "register_operand" "k"] + (zero_extend:DI (match_operand:SI 2 "register_operand" "k" + (unspec [(const_int 0)] UNSPEC_MASKOP)] "TARGET_AVX512BW" "kunpckdq\t{%2, %1, %0|%0, %1, %2}" [(set_attr "mode" "DI")]) @@ -17400,21 +17403,26 @@ }) (define_expand "vec_pack_trunc_qi" - [(set (match_operand:HI 0 "register_operand") - (ior:HI (ashift:HI (zero_extend:HI (match_operand:QI 2 "register_operand")) - (const_int 8)) - (zero_extend:HI (match_operand:QI 1 "register_operand"] + [(parallel +[(set (match_operand:HI 0 "register_operand") + (ior:HI + (ashift:HI (zero_extend:HI (match_operand:QI 2 "register_operand")) + (const_int 8)) + (zero_extend:HI (match_operand:QI 1 "register_operand" + (unspec [(const_int 0)] UNSPEC_MASKOP)])] "TARGET_AVX512F") (define_expand "vec_pack_trunc_" - [(set (match_operand: 0 "register_operand") - (ior: - (ashift: + [(parallel +[(set (match_operand: 0 "register_operand") + (ior: + (ashift: + (zero_extend: + (match_operand:SWI24 2 "register_operand")) + (match_dup 3)) (zero_extend: - (match_operand:SWI24 2 "register_operand")) - (match_dup 3)) - (zero_extend: - (match_operand:SWI24 1 "register_operand"] + (match_operand:SWI24 1 "register_operand" + (unspec [(const_int 0)] UNSPEC_MASKOP)])] "TARGET_AVX512BW" { operands[3] = GEN_INT (GET_MODE_BITSIZE (mode));
[x86_64 PATCH] PR target/106231: Optimize (any_extend:DI (ctz:SI ...)).
This patch resolves PR target/106231 by providing insns that recognize (zero_extend:DI (ctz:SI ...)) and (sign_extend:DI (ctz:SI ...)). The result of ctz:SI is always between 0 and 32 (or undefined), so sign_extension is the same as zero_extension, and the result is already extended in the destination register. Things are a little complicated, because the existing implementation of *ctzsi2 handles multiple cases, including false dependencies, which we continue to support in this patch. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check with no new failures. Ok for mainline? 2022-07-16 Roger Sayle gcc/ChangeLog PR target/106231 * config/i386/i386.md (*ctzsidi2_ext): New insn_and_split to recognize any_extend:DI of ctz:SI which is implicitly extended. (*ctzsidi2_ext_falsedep): New define_insn to model a DImode extended ctz:SI that has preceding xor to break false dependency. gcc/testsuite/ChangeLog PR target/106231 * gcc.target/i386/pr106231-1.c: New test case. * gcc.target/i386/pr106231-2.c: New test case. Thanks in advance, Roger -- diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 3b02d0c..164b0c2 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -16431,6 +16431,66 @@ (set_attr "prefix_rep" "1") (set_attr "mode" "SI")]) +(define_insn_and_split "*ctzsidi2_ext" + [(set (match_operand:DI 0 "register_operand" "=r") + (any_extend:DI + (ctz:SI + (match_operand:SI 1 "nonimmediate_operand" "rm" + (clobber (reg:CC FLAGS_REG))] + "TARGET_64BIT" +{ + if (TARGET_BMI) +return "tzcnt{l}\t{%1, %k0|%k0, %1}"; + else if (TARGET_CPU_P (GENERIC) + && !optimize_function_for_size_p (cfun)) +/* tzcnt expands to 'rep bsf' and we can use it even if !TARGET_BMI. */ +return "rep%; bsf{l}\t{%1, %k0|%k0, %1}"; + return "bsf{l}\t{%1, %k0|%k0, %1}"; +} + "(TARGET_BMI || TARGET_CPU_P (GENERIC)) + && TARGET_AVOID_FALSE_DEP_FOR_BMI && epilogue_completed + && optimize_function_for_speed_p (cfun) + && !reg_mentioned_p (operands[0], operands[1])" + [(parallel +[(set (match_dup 0) + (any_extend:DI (ctz:SI (match_dup 1 + (unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP) + (clobber (reg:CC FLAGS_REG))])] + "ix86_expand_clear (operands[0]);" + [(set_attr "type" "alu1") + (set_attr "prefix_0f" "1") + (set (attr "prefix_rep") + (if_then_else + (ior (match_test "TARGET_BMI") + (and (not (match_test "optimize_function_for_size_p (cfun)")) +(match_test "TARGET_CPU_P (GENERIC)"))) + (const_string "1") + (const_string "0"))) + (set_attr "mode" "SI")]) + +(define_insn "*ctzsidi2_ext_falsedep" + [(set (match_operand:DI 0 "register_operand" "=r") + (any_extend:DI + (ctz:SI + (match_operand:SI 1 "nonimmediate_operand" "rm" + (unspec [(match_operand:DI 2 "register_operand" "0")] + UNSPEC_INSN_FALSE_DEP) + (clobber (reg:CC FLAGS_REG))] + "TARGET_64BIT" +{ + if (TARGET_BMI) +return "tzcnt{l}\t{%1, %k0|%k0, %1}"; + else if (TARGET_CPU_P (GENERIC)) +/* tzcnt expands to 'rep bsf' and we can use it even if !TARGET_BMI. */ +return "rep%; bsf{l}\t{%1, %k0|%k0, %1}"; + else +gcc_unreachable (); +} + [(set_attr "type" "alu1") + (set_attr "prefix_0f" "1") + (set_attr "prefix_rep" "1") + (set_attr "mode" "SI")]) + (define_insn "bsr_rex64" [(set (reg:CCZ FLAGS_REG) (compare:CCZ (match_operand:DI 1 "nonimmediate_operand" "rm") diff --git a/gcc/testsuite/gcc.target/i386/pr106231-1.c b/gcc/testsuite/gcc.target/i386/pr106231-1.c new file mode 100644 index 000..d17297f --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106231-1.c @@ -0,0 +1,8 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -mtune=generic" } */ +long long +foo(long long x, unsigned bits) +{ + return x + (unsigned) __builtin_ctz(bits); +} +/* { dg-final { scan-assembler-not "cltq" } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr106231-2.c b/gcc/testsuite/gcc.target/i386/pr106231-2.c new file mode 100644 index 000..fd3a8e3 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106231-2.c @@ -0,0 +1,8 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -mtune=ivybridge" } */ +long long +foo(long long x, unsigned bits) +{ + return x + (unsigned) __builtin_ctz(bits); +} +/* { dg-final { scan-assembler-not "cltq" } } */
[x86 PATCH] PR target/106303: Fix TImode STV related failures.
This patch resolves PR target/106303 (and the related PRs 106347, 106404, 106407) which are ICEs caused by my improvements to x86_64's 128-bit TImode to V1TImode Scalar to Vector (STV) pass. My apologies for the breakage. The issue is that data flow analysis is used to partition usage of each TImode pseudo into "chains", where each chain is analyzed and if suitable converted to vector operations. The problems appears when some chains for a pseudo are converted, and others aren't as RTL sharing can result in some mode changes leaking into other instructions that aren't/shouldn't/can't be converted, which eventually leads to an ICE for mismatched modes. My first approach to a fix was to unify more of the STV infrastructure, reasoning that if TImode STV was exhibiting these problems, but DImode and SImode STV weren't, the issue was likely to be caused/resolved by these remaining differences. This appeared to fix some but not all of the reported PRs. A better solution was then proposed by H.J. Lu in Bugzilla (thanks!) that we need to iterate the removal of candidates in the function timode_remove_non_convertible_regs until there are no further changes. As each chain is removed from consideration, it in turn may affect whether other insns/chains can safely be converted. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32}, with no new failures. Ok for mainline? 2022-07-23 Roger Sayle H.J. Lu gcc/ChangeLog PR target/106303 PR target/106347 * config/i386/i386-features.cc (make_vector_copies): Move from general_scalar_chain to scalar_chain. (convert_reg): Likewise. (convert_insn_common): New scalar_chain method split out from general_scalar_chain convert_insn. (convert_registers): Move from general_scalar_chain to scalar_chain. (scalar_chain::convert): Call convert_insn_common before calling convert_insn. (timode_remove_non_convertible_regs): Iterate until there are no further changes to the candidates. * config/i386/i386-features.h (scalar_chain::hash_map): Move from general_scalar_chain. (scalar_chain::convert_reg): Likewise. (scalar_chain::convert_insn_common): New shared method. (scalar_chain::make_vector_copies): Move from general_scalar_chain. (scalar_chain::convert_registers): Likewise. No longer virtual. (general_scalar_chain::hash_map): Delete. Moved to scalar_chain. (general_scalar_chain::convert_reg): Likewise. (general_scalar_chain::make_vector_copies): Likewise. (general_scalar_chain::convert_registers): Delete virtual method. (timode_scalar_chain::convert_registers): Likewise. gcc/testsuite/ChangeLog PR target/106303 PR target/106347 * gcc.target/i386/pr106303.c: New test case. * gcc.target/i386/pr106347.c: New test case. Thanks in advance (and sorry again for the inconvenience), Roger -- diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc index 813b203..aa5de71 100644 --- a/gcc/config/i386/i386-features.cc +++ b/gcc/config/i386/i386-features.cc @@ -708,7 +708,7 @@ gen_gpr_to_xmm_move_src (enum machine_mode vmode, rtx gpr) and replace its uses in a chain. */ void -general_scalar_chain::make_vector_copies (rtx_insn *insn, rtx reg) +scalar_chain::make_vector_copies (rtx_insn *insn, rtx reg) { rtx vreg = *defs_map.get (reg); @@ -772,7 +772,7 @@ general_scalar_chain::make_vector_copies (rtx_insn *insn, rtx reg) scalar uses outside of the chain. */ void -general_scalar_chain::convert_reg (rtx_insn *insn, rtx dst, rtx src) +scalar_chain::convert_reg (rtx_insn *insn, rtx dst, rtx src) { start_sequence (); if (!TARGET_INTER_UNIT_MOVES_FROM_VEC) @@ -973,10 +973,10 @@ scalar_chain::convert_compare (rtx op1, rtx op2, rtx_insn *insn) UNSPEC_PTEST); } -/* Convert INSN to vector mode. */ +/* Helper function for converting INSN to vector mode. */ void -general_scalar_chain::convert_insn (rtx_insn *insn) +scalar_chain::convert_insn_common (rtx_insn *insn) { /* Generate copies for out-of-chain uses of defs and adjust debug uses. */ for (df_ref ref = DF_INSN_DEFS (insn); ref; ref = DF_REF_NEXT_LOC (ref)) @@ -1037,7 +1037,13 @@ general_scalar_chain::convert_insn (rtx_insn *insn) XEXP (note, 0) = *vreg; *DF_REF_REAL_LOC (ref) = *vreg; } +} + +/* Convert INSN to vector mode. */ +void +general_scalar_chain::convert_insn (rtx_insn *insn) +{ rtx def_set = single_set (insn); rtx src = SET_SRC (def_set); rtx dst = SET_DEST (def_set); @@ -1475,7 +1481,7 @@ timode_scalar_chain::convert_insn (rtx_insn *insn) Also populates defs_map which is used later by convert_insn. */ void -general_scalar_c
[x86 PATCH take #3] PR target/91681: zero_extendditi2 pattern for more optimizations.
Hi Uros, This is the next iteration of the zero_extendditi2 patch last reviewed here: https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596204.html [1] The sse.md changes were split out, reviewed, approved and committed. [2] The *concat splitters have been moved post-reload matching what we now do for many/most of the double word functionality. [3] As you recommend, these *concat splitters now use split_double_mode to "subreg" operand[0] into parts, via a new helper function that can also handle overlapping registers, and even use xchg for the rare case that a double word is constructed from its high and low parts, but the wrong way around. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without -target_board=unix{-m32}, with no new failures. Ok for mainline? 2022-07-23 Roger Sayle Uroš Bizjak gcc/ChangeLog PR target/91681 * config/i386/i386-expand.cc (split_double_concat): A new helper function for setting a double word value from two word values. * config/i386/i386-protos.h (split_double_concat): Prototype here. * config/i386/i386.md (zero_extendditi2): New define_insn_and_split. (*add3_doubleword_zext): New define_insn_and_split. (*sub3_doubleword_zext): New define_insn_and_split. (*concat3_1): New define_insn_and_split replacing previous define_split for implementing DST = (HI<<32)|LO as pair of move instructions, setting lopart and hipart. (*concat3_2): Likewise. (*concat3_3): Likewise, where HI is zero_extended. (*concat3_4): Likewise, where HI is zero_extended. gcc/testsuite/ChangeLog PR target/91681 * g++.target/i386/pr91681.C: New test case (from the PR). * gcc.target/i386/pr91681-1.c: New int128 test case. * gcc.target/i386/pr91681-2.c: Likewise. * gcc.target/i386/pr91681-3.c: Likewise, but for ia32. Thanks in advance, Roger -- diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc index 40f821e..66d8f28 100644 --- a/gcc/config/i386/i386-expand.cc +++ b/gcc/config/i386/i386-expand.cc @@ -165,6 +165,46 @@ split_double_mode (machine_mode mode, rtx operands[], } } +/* Emit the double word assignment DST = { LO, HI }. */ + +void +split_double_concat (machine_mode mode, rtx dst, rtx lo, rtx hi) +{ + rtx dlo, dhi; + int deleted_move_count = 0; + split_double_mode (mode, &dst, 1, &dlo, &dhi); + if (!rtx_equal_p (dlo, hi)) +{ + if (!rtx_equal_p (dlo, lo)) + emit_move_insn (dlo, lo); + else + deleted_move_count++; + if (!rtx_equal_p (dhi, hi)) + emit_move_insn (dhi, hi); + else + deleted_move_count++; +} + else if (!rtx_equal_p (lo, dhi)) +{ + if (!rtx_equal_p (dhi, hi)) + emit_move_insn (dhi, hi); + else + deleted_move_count++; + if (!rtx_equal_p (dlo, lo)) + emit_move_insn (dlo, lo); + else + deleted_move_count++; +} + else if (mode == TImode) +emit_insn (gen_swapdi (dlo, dhi)); + else +emit_insn (gen_swapsi (dlo, dhi)); + + if (deleted_move_count == 2) +emit_note (NOTE_INSN_DELETED); +} + + /* Generate either "mov $0, reg" or "xor reg, reg", as appropriate for the target. */ diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h index cf84775..e27c14f 100644 --- a/gcc/config/i386/i386-protos.h +++ b/gcc/config/i386/i386-protos.h @@ -85,6 +85,7 @@ extern void print_reg (rtx, int, FILE*); extern void ix86_print_operand (FILE *, rtx, int); extern void split_double_mode (machine_mode, rtx[], int, rtx[], rtx[]); +extern void split_double_concat (machine_mode, rtx, rtx lo, rtx); extern const char *output_set_got (rtx, rtx); extern const char *output_387_binary_op (rtx_insn *, rtx*); diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 9aaeb69..4560681 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -4379,6 +4379,16 @@ (set_attr "type" "imovx,mskmov,mskmov") (set_attr "mode" "SI,QI,QI")]) +(define_insn_and_split "zero_extendditi2" + [(set (match_operand:TI 0 "nonimmediate_operand" "=r,o") + (zero_extend:TI (match_operand:DI 1 "nonimmediate_operand" "rm,r")))] + "TARGET_64BIT" + "#" + "&& reload_completed" + [(set (match_dup 3) (match_dup 1)) + (set (match_dup 4) (const_int 0))] + "split_double_mode (TImode, &operands[0], 1, &operands[3], &operands[4]);") + ;; Transform xorl; mov[bw] (set strict_low_part) into movz[bw]l. (define_peephole2 [(parallel [(set (match_operand:SWI48 0 "general_reg_operand") @@ -6512,6 +6522,31 @@ [(set_attr "type"
[Documentation] Correct RTL documentation: (use (mem ...)) is allowed.
This patch is a one line correction/clarification to GCC's current RTL documentation that explains a USE of a MEM is permissible. PR rtl-optimization/99930 is an interesting example on x86_64 where the backend generates better code when a USE is a (const) MEM than when it is a REG. In fact the backend relies on CSE to propagate the MEM (a constant pool reference) into the USE, to enable combine to merge/simplify instructions. This change has been tested with a make bootstrap, but as it might provoke a discussion, I've decided to not consider it "obvious". Ok for mainline (to document the actual current behavior)? 2022-07-23 Roger Sayle gcc/ChangeLog * doc/rtl.texi (use): Document that the operand may be a MEM. Roger -- diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi index 43c9ee8..995c8be 100644 --- a/gcc/doc/rtl.texi +++ b/gcc/doc/rtl.texi @@ -3283,7 +3283,8 @@ Represents the use of the value of @var{x}. It indicates that the value in @var{x} at this point in the program is needed, even though it may not be apparent why this is so. Therefore, the compiler will not attempt to delete previous instructions whose only effect is to -store a value in @var{x}. @var{x} must be a @code{reg} expression. +store a value in @var{x}. @var{x} must be a @code{reg} or a @code{mem} +expression. In some situations, it may be tempting to add a @code{use} of a register in a @code{parallel} to describe a situation where the value
[PATCH] Add new target hook: simplify_modecc_const.
This patch is a major revision of the patch I originally proposed here: https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598040.html The primary motivation of this patch is to avoid incorrect optimization of MODE_CC comparisons in simplify_const_relational_operation when/if a backend represents the (known) contents of a MODE_CC register using a CONST_INT. In such cases, the RTL optimizers don't know the semantics of this integer value, so shouldn't change anything (i.e. should return NULL_RTX from simplify_const_relational_operation). The secondary motivation is that by introducing a new target hook, called simplify_modecc_const, the backend can (optionally) encode and interpret a target dependent encoding of MODE_CC registers. The worked example provided with this patch is to allow the i386 backend to explicitly model the carry flag (MODE_CCC) using 1 to indicate that the carry flag is set, and 0 to indicate the carry flag is clear. This allows the instructions stc (set carry flag), clc (clear carry flag) and cmc (complement carry flag) to be represented in RTL. However an even better example would be the rs6000 backend, where this patch/target hook would allow improved modelling of the condition register CR. The powerpc's comparison instructions set fields/bits in the CR register [where bit 0 indicates less than, bit 1 greater than, bit 2 equal to and bit3 overflow] analogous to x86's flags register [containing bits for carry, zero, overflow, parity etc.]. These fields can be manipulated directly using crset (aka creqv) and crclr (aka crxor) instructions and even transferred from general purpose registers using mtcr. However, without a patch like this, it's impossible to safely model/represent these instructions in rs6000.md. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32}, and both with and without a patch to add stc, clc and cmc support to the x86 backend. I'll resubmit the x86 target pieces again with that follow-up backend patch, so for now I'm only looking for approval of the middle-end infrastructure pieces. The x86 hunks below are provided as context/documentation for how this hook could/should be used (but I wouldn't object to pre-approval of those bits by Uros). Ok for mainline? 2022-07-26 Roger Sayle gcc/ChangeLog * target.def (simplify_modecc_const): New target hook. * doc/tm.texi (TARGET_SIMPLIFY_MODECC_CONST): Document here. * doc/tm.texi.in (TARGET_SIMPLIFY_MODECC_CONST): Locate @hook here. * hooks.cc (hook_rtx_mode_int_rtx_null): Define default hook here. * hooks.h (hook_rtx_mode_int_rtx_null): Prototype here. * simplify-rtx.c (simplify_const_relational_operation): Avoid mis-optimizing MODE_CC comparisons by calling new target hook. * config/i386.cc (ix86_simplify_modecc_const): Implement new target hook, supporting interpreting MODE_CCC values as the x86 carry flag. (TARGET_SIMPLIFY_MODECC_CONST): Define as ix86_simplify_modecc_const. Thanks in advance, Roger -- > -Original Message- > From: Segher Boessenkool > Sent: 07 July 2022 23:39 > To: Roger Sayle > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH] Be careful with MODE_CC in > simplify_const_relational_operation. > > Hi! > > On Thu, Jul 07, 2022 at 10:08:04PM +0100, Roger Sayle wrote: > > I think it's fair to describe RTL's representation of condition flags > > using MODE_CC as a little counter-intuitive. > > "A little challenging", and you should see that as a good thing, as a puzzle to > crack :-) > > > For example, the i386 > > backend represents the carry flag (in adc instructions) using RTL of > > the form "(ltu:SI (reg:CCC) (const_int 0))", where great care needs to > > be taken not to treat this like a normal RTX expression, after all LTU > > (less-than-unsigned) against const0_rtx would normally always be > > false. > > A comparison of a MODE_CC thing against 0 means the result of a > *previous* comparison (or other cc setter) is looked at. Usually it simply looks > at some condition bits in a flags register. It does not do any actual comparison: > that has been done before (if at all even). > > > Hence, MODE_CC comparisons need to be treated with caution, and > > simplify_const_relational_operation returns early (to avoid > > problems) when GET_MODE_CLASS (GET_MODE (op0)) == MODE_CC. > > Not just to avoid problems: there simply isn't enough information to do a > correct job. > > > However, consider the (currently) hypothetical situation, where the > > RTL optimizers determine that a previous instruction unconditionally > > sets or clears the carry flag, and this gets pro
[PATCH] middle-end: More support for ABIs that pass FP values as wider ints.
Firstly many thanks again to Jeff Law for reviewing/approving the previous patch to add support for ABIs that pass FP values as wider integer modes. That has allowed significant progress on PR target/104489. As predicted enabling HFmode on nvptx-none automatically enables more testcases in the testsuite and making sure these all PASS has revealed a few missed spots and a deficiency in the middle-end. For example, support for HC mode, where a complex value is encoded as two 16-bit HFmode parts was insufficiently covered in my previous testing. More interesting is that __fixunshfti is required by GCC, and not natively supported by the nvptx backend, requiring softfp support in libgcc, which in turn revealed an interesting asymmetry in libcall handling in optabs.cc. In the expand_fixed_convert function, which is responsible for expanding libcalls for integer to floating point conversion, GCC calls prepare_libcall_arg that (specifically for integer arguments) calls promote_function_mode on the argument, so that the libcall ABI matches the regular target ABI. By comparison, the equivalent expand_fix function, for floating point to integer conversion, doesn't promote its argument. On nvptx, where the assembler is strongly typed, this produces a mismatch as the __fixunshfti function created by libgcc doesn't precisely match the signature assumed by optabs. The solution is to perform the same (or similar) prepare_libcall_arg preparation in both cases. In this patch, the existing (static) prepare_libcall_arg, which assumes an integer argument, is renamed prepare_libcall_int_arg, and a matching prepare_libcall_fp_arg is introduced. This should be safe on other platforms (fingers-crossed) as floating point argument promotion is rare [floats are passed in float registers, doubles are passed in double registers, etc.] This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32}, and on nvptx-none with a backend patch that resolves the rest of PR target/104489. Ok for mainline? 2022-07-26 Roger Sayle gcc/ChangeLog PR target/104489 * calls.cc (emit_library_call_value_1): Enable the FP return value of a libcall to be returned as a wider integer, by converting the int result to be converted to the desired floating point mode. (store_one_arg): Allow floating point arguments to be passed on the stack as wider integers using convert_float_to_wider_int. * function.cc (assign_parms_unsplit_complex): Likewise, allow complex floating point modes to be passed as wider integer parts, using convert_wider_int_to_float. * optabs.cc (prepare_libcall_fp_arg): New function. A floating point version of the previous prepare_libcall_arg that calls promote_function_mode on its argument. (expand_fix): Call new prepare_libcall_fp_arg on FIX argument. (prepare_libcall_int_arg): Renamed from prepare_libcall_arg. (expand_fixed_convert): Update call of prepare_libcall_arg to the new name, prepare_libcall_int_arg. Thanks again, Roger -- diff --git a/gcc/calls.cc b/gcc/calls.cc index 7f3cf5f..50d0495 100644 --- a/gcc/calls.cc +++ b/gcc/calls.cc @@ -4791,14 +4791,20 @@ emit_library_call_value_1 (int retval, rtx orgfun, rtx value, else { /* Convert to the proper mode if a promotion has been active. */ - if (GET_MODE (valreg) != outmode) + enum machine_mode valmode = GET_MODE (valreg); + if (valmode != outmode) { int unsignedp = TYPE_UNSIGNED (tfom); gcc_assert (promote_function_mode (tfom, outmode, &unsignedp, fndecl ? TREE_TYPE (fndecl) : fntype, 1) - == GET_MODE (valreg)); - valreg = convert_modes (outmode, GET_MODE (valreg), valreg, 0); + == valmode); + if (SCALAR_INT_MODE_P (valmode) + && SCALAR_FLOAT_MODE_P (outmode) + && known_gt (GET_MODE_SIZE (valmode), GET_MODE_SIZE (outmode))) + valreg = convert_wider_int_to_float (outmode, valmode, valreg); + else + valreg = convert_modes (outmode, valmode, valreg, 0); } if (value != 0) @@ -5003,8 +5009,20 @@ store_one_arg (struct arg_data *arg, rtx argblock, int flags, /* If we are promoting object (or for any other reason) the mode doesn't agree, convert the mode. */ - if (arg->mode != TYPE_MODE (TREE_TYPE (pval))) - arg->value = convert_modes (arg->mode, TYPE_MODE (TREE_TYPE (pval)), + machine_mode old_mode = TYPE_MODE (TREE_TYPE (pval)); + + /* Some ABIs require scalar floating point modes to be passed +in a wider scalar integer mode. We need to explicitly +reinterpret
RE: [PATCH] Add new target hook: simplify_modecc_const.
Hi Segher, It's very important to distinguish the invariants that exist for the RTL data structures as held in memory (rtx), vs. the use of "enum rtx_code"s, "machine_mode"s and operands in the various processing functions of the middle-end. Yes, it's very true that RTL integer constants don't specify a mode (are VOIDmode), so therefore operations like ZERO_EXTEND or EQ don't make sense with all constant operands. This is (one reason) why constant-only operands are disallowed from RTL (data structures), and why in APIs that perform/simplify these operations, the original operand mode (of the const_int(s)) must be/is always passed as a parameter. Hence, for say simplify_const_binary_operation, op0 and op1 can both be const_int, as the mode argument specifies the mode of the "code" operation. Likewise, in simplify_relational_operation, both op0 and op1 may be CONST_INT as "cmp_mode" explicitly specifies the mode that the operation is performed in and "mode" specifies the mode of the result. Your comment that "comparing two integer constants is invalid RTL *in all contexts*" is a serious misunderstanding of what's going on. At no point is a RTL rtx node ever allocated with two integer constant operands. RTL simplification is for hypothetical "what if" transformations (just like try_combine calls recog with RTL that may not be real instructions), and these simplifcations are even sometimes required to preserve the documented RTL invariants. Comparisons of two integers must be simplified to true/false precisely to ensure that they never appear in an actual COMPARE node. I worry this fundamental misunderstanding is the same issue that has been holding up understanding/approving a previous patch: https://gcc.gnu.org/pipermail/gcc-patches/2021-September/578848.html For a related bug, consider PR rtl-optimization/67382, that's assigned to you in bugzilla. In this case, the RTL optimizers know that both operands to a COMPARE are integer constants (both -2), yet the compiler still performs a run-time comparison and conditional jump: movl$-2, %eax movl%eax, 12(%rsp) cmpl$-2, %eax je .L1 Failing to optimize/consider a comparison between two integer constants *in any context* just leads to poor code. Hopefully, this clears up that the documented constraints on RTL rtx aren't exactly the same as the constraints on the use of rtx_codes in simplify-rtx's functional APIs. So simplify_subreg really gets called on operands that are neither REG nor MEM, as this is unrelated to what the documentation of the SUBREG rtx specifies. If you don't believe that op0 and op1 can ever both be const_int in this function, perhaps consider it harmless dead code and humor me. Thanks in advance, Roger -- > -Original Message- > From: Segher Boessenkool > Sent: 26 July 2022 18:45 > To: Roger Sayle > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH] Add new target hook: simplify_modecc_const. > > Hi! > > On Tue, Jul 26, 2022 at 01:13:02PM +0100, Roger Sayle wrote: > > This patch is a major revision of the patch I originally proposed here: > > https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598040.html > > > > The primary motivation of this patch is to avoid incorrect > > optimization of MODE_CC comparisons in > > simplify_const_relational_operation when/if a backend represents the > > (known) contents of a MODE_CC register using a CONST_INT. In such > > cases, the RTL optimizers don't know the semantics of this integer > > value, so shouldn't change anything (i.e. should return NULL_RTX from > simplify_const_relational_operation). > > This is invalid RTL. What would (set (reg:CC) (const_int 0)) mean, for example? > If this was valid it would make most existing code using CC modes do essentially > random things :-( > > The documentation (in tm.texi, "Condition Code") says > Alternatively, you can use @code{BImode} if the comparison operator is > specified already in the compare instruction. In this case, you are not > interested in most macros in this section. > > > The worked example provided with this patch is to allow the i386 > > backend to explicitly model the carry flag (MODE_CCC) using 1 to > > indicate that the carry flag is set, and 0 to indicate the carry flag > > is clear. This allows the instructions stc (set carry flag), clc > > (clear carry flag) and cmc (complement carry flag) to be represented in RTL. > > Hrm, I wonder how other targets do this. > > On Power we have a separate hard register for the carry flag of course (it is a > separate bit in the hardware as well, XER[CA]). > > On Arm there is arm_carry_operatio
RE: [PATCH] Add new target hook: simplify_modecc_const.
Hi Segher, > Thank you for telling the maintainer of combine the basics of what all of this > does! I hadn't noticed any of that before. You're welcome. I've also been maintaining combine for some time now: https://gcc.gnu.org/legacy-ml/gcc/2003-10/msg00455.html > They can be, as clearly documented (and obvious from the code), but you can > not ever have that in the RTL stream, which is needed for your patch to do > anything. That's the misunderstanding; neither this nor the previous SUBREG patch, affect/change what is in the RTL stream, no COMPARE nodes are every changed or modified, only eliminated by the propagation/fusion in combine (or CSE). We have --enable-checking=rtl to guarantee that the documented invariants always hold in the RTL stream. Cheers, Roger
[PATCH] Some additional zero-extension related optimizations in simplify-rtx.
This patch implements some additional zero-extension and sign-extension related optimizations in simplify-rtx.cc. The original motivation comes from PR rtl-optimization/71775, where in comment #2 Andrew Pinski sees: Failed to match this instruction: (set (reg:DI 88 [ _1 ]) (sign_extend:DI (subreg:SI (ctz:DI (reg/v:DI 86 [ x ])) 0))) On many platforms the result of DImode CTZ is constrained to be a small unsigned integer (between 0 and 64), hence the truncation to 32-bits (using a SUBREG) and the following sign extension back to 64-bits are effectively a no-op, so the above should ideally (often) be simplified to "(set (reg:DI 88) (ctz:DI (reg/v:DI 86 [ x ]))". To implement this, and some closely related transformations, we build upon the existing val_signbit_known_clear_p predicate. In the first chunk, nonzero_bits knows that FFS and ABS can't leave the sign-bit bit set, so the simplification of of ABS (ABS (x)) and ABS (FFS (x)) can itself be simplified. The second transformation is that we can canonicalized SIGN_EXTEND to ZERO_EXTEND (as in the PR 71775 case above) when the operand's sign-bit is known to be clear. The final two chunks are for SIGN_EXTEND of a truncating SUBREG, and ZERO_EXTEND of a truncating SUBREG respectively. The nonzero_bits of a truncating SUBREG pessimistically thinks that the upper bits may have an arbitrary value (by taking the SUBREG), so we need look deeper at the SUBREG's operand to confirm that the high bits are known to be zero. Unfortunately, for PR rtl-optimization/71775, ctz:DI on x86_64 with default architecture options is undefined at zero, so we can't be sure the upper bits of reg:DI 88 will be sign extended (all zeros or all ones). nonzero_bits knows this, so the above transformations don't trigger, but the transformations themselves are perfectly valid for other operations such as FFS, POPCOUNT and PARITY, and on other targets/-march settings where CTZ is defined at zero. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32}, with no new failures. Testing with CSiBE shows these transformations trigger on several source files (and with -Os reduces the size of the code). Ok for mainline? 2022-07-27 Roger Sayle gcc/ChangeLog * simplify_rtx.cc (simplify_unary_operation_1) : Simplify test as both FFS and ABS result in nonzero_bits returning a mask that satisfies val_signbit_known_clear_p. : Canonicalize SIGN_EXTEND to ZERO_EXTEND when val_signbit_known_clear_p is true of the operand. Simplify sign extensions of SUBREG truncations of operands that are already suitably (zero) extended. : Simplify zero extensions of SUBREG truncations of operands that are already suitably zero extended. Thanks in advance, Roger -- diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index fa20665..e62bf56 100644 --- a/gcc/simplify-rtx.cc +++ b/gcc/simplify-rtx.cc @@ -1366,9 +1366,8 @@ simplify_context::simplify_unary_operation_1 (rtx_code code, machine_mode mode, break; /* If operand is something known to be positive, ignore the ABS. */ - if (GET_CODE (op) == FFS || GET_CODE (op) == ABS - || val_signbit_known_clear_p (GET_MODE (op), - nonzero_bits (op, GET_MODE (op + if (val_signbit_known_clear_p (GET_MODE (op), +nonzero_bits (op, GET_MODE (op return op; /* If operand is known to be only -1 or 0, convert ABS to NEG. */ @@ -1615,6 +1614,24 @@ simplify_context::simplify_unary_operation_1 (rtx_code code, machine_mode mode, } } + /* We can canonicalize SIGN_EXTEND (op) as ZERO_EXTEND (op) when + we know the sign bit of OP must be clear. */ + if (val_signbit_known_clear_p (GET_MODE (op), +nonzero_bits (op, GET_MODE (op + return simplify_gen_unary (ZERO_EXTEND, mode, op, GET_MODE (op)); + + /* (sign_extend:DI (subreg:SI (ctz:DI ...))) is (ctz:DI ...). */ + if (GET_CODE (op) == SUBREG + && subreg_lowpart_p (op) + && GET_MODE (SUBREG_REG (op)) == mode + && is_a (mode, &int_mode) + && is_a (GET_MODE (op), &op_mode) + && GET_MODE_PRECISION (int_mode) <= HOST_BITS_PER_WIDE_INT + && GET_MODE_PRECISION (op_mode) < GET_MODE_PRECISION (int_mode) + && (nonzero_bits (SUBREG_REG (op), mode) + & ~(GET_MODE_MASK (op_mode)>>1)) == 0) + return SUBREG_REG (op); + #if defined(POINTERS_EXTEND_UNSIGNED) /* As we do not know which address space the pointer is referring to, we can do this only if the target does not support different pointer @@ -1765,6 +1782,18 @@ simplify_co
[x86_64 PATCH] PR target/106450: Tweak timode_remove_non_convertible_regs.
This patch resolves PR target/106450, some more fall-out from more aggressive TImode scalar-to-vector (STV) optimizations. I continue to be caught out by how far TImode STV has diverged from DImode/SImode STV, and therefore requires additional (unexpected) tweaking. Many thanks to H.J. Lu for pointing out timode_remove_non_convertible_regs needs to be extended to handle XOR (and other new operations). Unhelpfully the comment above this function states that it's the TImode version of "remove_non_convertible_regs", which doesn't exist anymore, so I've resurrected an explanatory comment from the git history. By refactoring the checks for hard regs and already "marked" regs into timode_check_non_convertible_regs itself, all its callers are simplified. This patch then uses GET_RTX_CLASS to generically handle unary and binary operations, calling timode_check_non_convertible_regs on each TImode register operand in the single_set's SET_SRC. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32}, with no new failures. Ok for mainline? 2022-07-28 Roger Sayle gcc/ChangeLog PR target/106450 * config/i386/i386-features.cc (timode_check_non_convertible_regs): Do nothing if REGNO is set in the REGS bitmap, or is a hard reg. (timode_remove_non_convertible_regs): Update comment. Call timode_check_non_convertible_regs on all register operands of supported (binary and unary) operations. gcc/testsuite/ChangeLog PR target/106450 * gcc.target/i386/pr106450.c: New test case. Thanks in advance, Roger -- diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc index aa5de71..2a4097c 100644 --- a/gcc/config/i386/i386-features.cc +++ b/gcc/config/i386/i386-features.cc @@ -1808,6 +1808,11 @@ static void timode_check_non_convertible_regs (bitmap candidates, bitmap regs, unsigned int regno) { + /* Do nothing if REGNO is already in REGS or is a hard reg. */ + if (bitmap_bit_p (regs, regno) + || HARD_REGISTER_NUM_P (regno)) +return; + for (df_ref def = DF_REG_DEF_CHAIN (regno); def; def = DF_REF_NEXT_REG (def)) @@ -1843,7 +1848,13 @@ timode_check_non_convertible_regs (bitmap candidates, bitmap regs, } } -/* The TImode version of remove_non_convertible_regs. */ +/* For a given bitmap of insn UIDs scans all instructions and + remove insn from CANDIDATES in case it has both convertible + and not convertible definitions. + + All insns in a bitmap are conversion candidates according to + scalar_to_vector_candidate_p. Currently it implies all insns + are single_set. */ static void timode_remove_non_convertible_regs (bitmap candidates) @@ -1861,21 +1872,40 @@ timode_remove_non_convertible_regs (bitmap candidates) rtx dest = SET_DEST (def_set); rtx src = SET_SRC (def_set); - if ((!REG_P (dest) -|| bitmap_bit_p (regs, REGNO (dest)) -|| HARD_REGISTER_P (dest)) - && (!REG_P (src) - || bitmap_bit_p (regs, REGNO (src)) - || HARD_REGISTER_P (src))) - continue; - if (REG_P (dest)) timode_check_non_convertible_regs (candidates, regs, REGNO (dest)); - if (REG_P (src)) - timode_check_non_convertible_regs (candidates, regs, -REGNO (src)); + switch (GET_RTX_CLASS (GET_CODE (src))) + { + case RTX_OBJ: + if (REG_P (src)) + timode_check_non_convertible_regs (candidates, regs, +REGNO (src)); + break; + + case RTX_UNARY: + if (REG_P (XEXP (src, 0)) + && GET_MODE (XEXP (src, 0)) == TImode) + timode_check_non_convertible_regs (candidates, regs, +REGNO (XEXP (src, 0))); + break; + + case RTX_COMM_ARITH: + case RTX_BIN_ARITH: + if (REG_P (XEXP (src, 0)) + && GET_MODE (XEXP (src, 0)) == TImode) + timode_check_non_convertible_regs (candidates, regs, +REGNO (XEXP (src, 0))); + if (REG_P (XEXP (src, 1)) + && GET_MODE (XEXP (src, 1)) == TImode) + timode_check_non_convertible_regs (candidates, regs, +REGNO (XEXP (src, 1))); + break; + + default: + break; + } } EXECUTE_IF_SET_IN_BITMAP (regs, 0, id, bi) diff --git a/gcc/testsuite/gcc.target/i386/pr106450.c b/gcc/testsuite/gcc.target/i386/pr106450.c new file mode 100644 index 000..d16231f --- /dev/null +++ b/gcc/testsuite/gcc.tar
[x86 PATCH] Support logical shifts by (some) integer constants in TImode STV.
This patch improves TImode STV by adding support for logical shifts by integer constants that are multiples of 8. For the test case: __int128 a, b; void foo() { a = b << 16; } on x86_64, gcc -O2 currently generates: movqb(%rip), %rax movqb+8(%rip), %rdx shldq $16, %rax, %rdx salq$16, %rax movq%rax, a(%rip) movq%rdx, a+8(%rip) ret with this patch we now generate: movdqa b(%rip), %xmm0 pslldq $2, %xmm0 movaps %xmm0, a(%rip) ret This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check. both with and without --target_board=unix{-m32}, with no new failures. Ok for mainline? 2022-07-28 Roger Sayle gcc/ChangeLog * config/i386/i386-features.cc (compute_convert_gain): Add gain for converting suitable TImode shift to a V1TImode shift. (timode_scalar_chain::convert_insn): Add support for converting suitable ASHIFT and LSHIFTRT. (timode_scalar_to_vector_candidate_p): Consider logical shifts by integer constants that are multiples of 8 to be candidates. gcc/testsuite/ChangeLog * gcc.target/i386/sse4_1-stv-7.c: New test case. Thanks again, Roger -- diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc index aa5de71..e1e0645 100644 --- a/gcc/config/i386/i386-features.cc +++ b/gcc/config/i386/i386-features.cc @@ -1221,6 +1221,13 @@ timode_scalar_chain::compute_convert_gain () igain = COSTS_N_INSNS (1); break; + case ASHIFT: + case LSHIFTRT: + /* For logical shifts by constant multiples of 8. */ + igain = optimize_insn_for_size_p () ? COSTS_N_BYTES (4) + : COSTS_N_INSNS (1); + break; + default: break; } @@ -1462,6 +1469,12 @@ timode_scalar_chain::convert_insn (rtx_insn *insn) src = convert_compare (XEXP (src, 0), XEXP (src, 1), insn); break; +case ASHIFT: +case LSHIFTRT: + convert_op (&XEXP (src, 0), insn); + PUT_MODE (src, V1TImode); + break; + default: gcc_unreachable (); } @@ -1796,6 +1809,14 @@ timode_scalar_to_vector_candidate_p (rtx_insn *insn) case NOT: return REG_P (XEXP (src, 0)) || timode_mem_p (XEXP (src, 0)); +case ASHIFT: +case LSHIFTRT: + /* Handle logical shifts by integer constants between 0 and 120 +that are multiples of 8. */ + return REG_P (XEXP (src, 0)) +&& CONST_INT_P (XEXP (src, 1)) +&& (INTVAL (XEXP (src, 1)) & ~0x78) == 0; + default: return false; } diff --git a/gcc/testsuite/gcc.target/i386/sse4_1-stv-7.c b/gcc/testsuite/gcc.target/i386/sse4_1-stv-7.c new file mode 100644 index 000..b0d5fce --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/sse4_1-stv-7.c @@ -0,0 +1,18 @@ +/* { dg-do compile { target int128 } } */ +/* { dg-options "-O2 -msse4.1 -mstv -mno-stackrealign" } */ + +unsigned __int128 a; +unsigned __int128 b; + +void foo() +{ + a = b << 16; +} + +void bar() +{ + a = b >> 16; +} + +/* { dg-final { scan-assembler "pslldq" } } */ +/* { dg-final { scan-assembler "psrldq" } } */
[x86_64 PATCH] Add rotl64ti2_doubleword pattern to i386.md
This patch adds rot[lr]64ti2_doubleword patterns to the x86_64 backend, to move splitting of 128-bit TImode rotates by 64 bits after reload, matching what we now do for 64-bit DImode rotations by 32 bits with -m32. In theory moving when this rotation is split should have little influence on code generation, but in practice "reload" sometimes decides to make use of the increased flexibility to reduce the number of registers used, and the code size, by using xchg. For example: __int128 x; __int128 y; __int128 a; __int128 b; void foo() { unsigned __int128 t = x; t ^= a; t = (t<<64) | (t>>64); t ^= b; y = t; } Before: movqx(%rip), %rsi movqx+8(%rip), %rdi xorqa(%rip), %rsi xorqa+8(%rip), %rdi movq%rdi, %rax movq%rsi, %rdx xorqb(%rip), %rax xorqb+8(%rip), %rdx movq%rax, y(%rip) movq%rdx, y+8(%rip) ret After: movqx(%rip), %rax movqx+8(%rip), %rdx xorqa(%rip), %rax xorqa+8(%rip), %rdx xchgq %rdx, %rax xorqb(%rip), %rax xorqb+8(%rip), %rdx movq%rax, y(%rip) movq%rdx, y+8(%rip) ret One some modern architectures this is a small win, on some older architectures this is a small loss. The decision which code to generate is made in "reload", and could probably be tweaked by register preferencing. The much bigger win is that (eventually) all TImode mode shifts and rotates by constants will become potential candidates for TImode STV. This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check with no new failures. Ok for mainline? 2022-07-29 Roger Sayle gcc/ChangeLog * config/i386/i386.md (define_expand ti3): For rotations by 64 bits use new rot[lr]64ti2_doubleword pattern. (rot[lr]64ti2_doubleword): New post-reload splitter. Thanks again, Roger -- diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index fab6aed..f1158e1 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -13820,6 +13820,8 @@ if (const_1_to_63_operand (operands[2], VOIDmode)) emit_insn (gen_ix86_ti3_doubleword (operands[0], operands[1], operands[2])); + else if (CONST_INT_P (operands[2]) && INTVAL (operands[2]) == 64) +emit_insn (gen_64ti2_doubleword (operands[0], operands[1])); else { rtx amount = force_reg (QImode, operands[2]); @@ -14045,6 +14047,24 @@ } }) +(define_insn_and_split "64ti2_doubleword" + [(set (match_operand:TI 0 "register_operand" "=r,r,r") + (any_rotate:TI (match_operand:TI 1 "nonimmediate_operand" "0,r,o") + (const_int 64)))] + "TARGET_64BIT" + "#" + "&& reload_completed" + [(set (match_dup 0) (match_dup 3)) + (set (match_dup 2) (match_dup 1))] +{ + split_double_mode (TImode, &operands[0], 2, &operands[0], &operands[2]); + if (rtx_equal_p (operands[0], operands[1])) +{ + emit_insn (gen_swapdi (operands[0], operands[2])); + DONE; +} +}) + (define_mode_attr rorx_immediate_operand [(SI "const_0_to_31_operand") (DI "const_0_to_63_operand")])
RE: [PATCH] Some additional zero-extension related optimizations in simplify-rtx.
Hi Segher, > On Wed, Jul 27, 2022 at 02:42:25PM +0100, Roger Sayle wrote: > > This patch implements some additional zero-extension and > > sign-extension related optimizations in simplify-rtx.cc. The original > > motivation comes from PR rtl-optimization/71775, where in comment #2 > Andrew Pinski sees: > > > > Failed to match this instruction: > > (set (reg:DI 88 [ _1 ]) > > (sign_extend:DI (subreg:SI (ctz:DI (reg/v:DI 86 [ x ])) 0))) > > > > On many platforms the result of DImode CTZ is constrained to be a > > small unsigned integer (between 0 and 64), hence the truncation to > > 32-bits (using a SUBREG) and the following sign extension back to > > 64-bits are effectively a no-op, so the above should ideally (often) > > be simplified to "(set (reg:DI 88) (ctz:DI (reg/v:DI 86 [ x ]))". > > And you can also do that if ctz is undefined for a zero argument! Forgive my perhaps poor use of terminology. The case of ctz 0 on x64_64 isn't "undefined behaviour" (UB) in the C/C++ sense that would allow us to do anything, but implementation defined (which Intel calls "undefined" in their documentation). Hence, we don't know which DI value is placed in the result register. In this case, truncating to SI mode, then sign extending the result is not a no-op, as the top bits will/must now all be the same [though admittedly to an unknown undefined signbit]. Hence the above optimization would be invalid, as it doesn't guarantee the result would be sign-extended. > > To implement this, and some closely related transformations, we build > > upon the existing val_signbit_known_clear_p predicate. In the first > > chunk, nonzero_bits knows that FFS and ABS can't leave the sign-bit > > bit set, > > Is that guaranteed in all cases? Also at -O0, also for args bigger than > 64 bits? val_signbit_known_clear_p should work for any size/precision arg. I'm not sure if the results are affected by -O0, but even if they are, this will not affect correctness only whether these optimizations are performed, which is precisely what -O0 controls. > > + /* (sign_extend:DI (subreg:SI (ctz:DI ...))) is (ctz:DI ...). */ > > + if (GET_CODE (op) == SUBREG > > + && subreg_lowpart_p (op) > > + && GET_MODE (SUBREG_REG (op)) == mode > > + && is_a (mode, &int_mode) > > + && is_a (GET_MODE (op), &op_mode) > > + && GET_MODE_PRECISION (int_mode) <= HOST_BITS_PER_WIDE_INT > > + && GET_MODE_PRECISION (op_mode) < GET_MODE_PRECISION > (int_mode) > > + && (nonzero_bits (SUBREG_REG (op), mode) > > + & ~(GET_MODE_MASK (op_mode)>>1)) == 0) > > (spaces around >> please) Doh! Good catch, thanks. > Please use val_signbit_known_{set,clear}_p? Alas, it's not just the SI mode's signbit that we care about, but all of the bits above it in the DImode operand/result. These all need to be zero, for the operand to already be zero-extended/sign_extended. > > + return SUBREG_REG (op); > > Also, this is not correct for C[LT]Z_DEFINED_VALUE_AT_ZERO non-zero if the > value it returns in its second arg does not survive sign extending unmodified (if it > is 0x for an extend from SI to DI for example). Fortunately, C[LT]Z_DEFINED_VALUE_AT_ZERO being defined to return a negative result, such as -1 is already handled (accounted for) in nonzero_bits. The relevant code in rtlanal.cc's nonzero_bits1 is: case CTZ: /* If CTZ has a known value at zero, then the nonzero bits are that value, plus the number of bits in the mode minus one. */ if (CTZ_DEFINED_VALUE_AT_ZERO (mode, nonzero)) nonzero |= (HOST_WIDE_INT_1U << (floor_log2 (mode_width))) - 1; else nonzero = -1; break; Hence, any bits set by the constant returned by the target's DEFINED_VALUE_AT_ZERO will be set in the result of nonzero_bits. So if this is negative, say -1, then val_signbit_known_clear_p (or the more complex tests above) will return false. I'm currently bootstrapping and regression testing the whitespace change/correction suggested above. Thanks, Roger --
RE: [PATCH] Some additional zero-extension related optimizations in simplify-rtx.
Hi Segher, > > > To implement this, and some closely related transformations, we > > > build upon the existing val_signbit_known_clear_p predicate. In the > > > first chunk, nonzero_bits knows that FFS and ABS can't leave the > > > sign-bit bit set, > > > > Is that guaranteed in all cases? Also at -O0, also for args bigger > > than 64 bits? > > val_signbit_known_clear_p should work for any size/precision arg. No, you're right! Please forgive/excuse me. Neither val_signbit_p nor nonzero_bits have yet been updated to use "wide_int", so don't work for TImode or wider modes. Doh! I'm shocked. Roger --
[x86_64 PATCH take #2] PR target/106450: Tweak timode_remove_non_convertible_regs.
Many thanks to H.J. for pointing out a better idiom for traversing the USEs (and also DEFs) of TImode registers in an instruction. This revised patched has been tested on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32}, with no new failures. Ok for mainline? 2022-07-30 Roger Sayle H.J. Lu gcc/ChangeLog PR target/106450 * config/i386/i386-features.cc (timode_check_non_convertible_regs): Do nothing if REGNO is set in the REGS bitmap, or is a hard reg. (timode_remove_non_convertible_regs): Update comment. Call timode_check_non_convertible_reg on all TImode register DEFs and USEs in each instruction. gcc/testsuite/ChangeLog PR target/106450 * gcc.target/i386/pr106450.c: New test case. Thanks (H.J. and Uros), Roger -- > -Original Message- > From: H.J. Lu > Sent: 28 July 2022 17:55 > To: Roger Sayle > Cc: GCC Patches > Subject: Re: [x86_64 PATCH] PR target/106450: Tweak > timode_remove_non_convertible_regs. > > On Thu, Jul 28, 2022 at 9:43 AM Roger Sayle > wrote: > > > > This patch resolves PR target/106450, some more fall-out from more > > aggressive TImode scalar-to-vector (STV) optimizations. I continue to > > be caught out by how far TImode STV has diverged from DImode/SImode > > STV, and therefore requires additional (unexpected) tweaking. Many > > thanks to H.J. Lu for pointing out timode_remove_non_convertible_regs > > needs to be extended to handle XOR (and other new operations). > > > > Unhelpfully the comment above this function states that it's the > > TImode version of "remove_non_convertible_regs", which doesn't exist > > anymore, so I've resurrected an explanatory comment from the git history. > > By refactoring the checks for hard regs and already "marked" regs into > > timode_check_non_convertible_regs itself, all its callers are > > simplified. This patch then uses GET_RTX_CLASS to generically handle > > unary and binary operations, calling timode_check_non_convertible_regs > > on each TImode register operand in the single_set's SET_SRC. > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board=unix{-m32}, > > with no new failures. Ok for mainline? > > > > > > 2022-07-28 Roger Sayle > > > > gcc/ChangeLog > > PR target/106450 > > * config/i386/i386-features.cc (timode_check_non_convertible_regs): > > Do nothing if REGNO is set in the REGS bitmap, or is a hard reg. > > (timode_remove_non_convertible_regs): Update comment. > > Call timode_check_non_convertible_regs on all register operands > > of supported (binary and unary) operations. > > Should we use > > df_ref ref; > FOR_EACH_INSN_USE (ref, insn) >if (!DF_REF_REG_MEM_P (ref)) > timode_check_non_convertible_regs (candidates, regs, > DF_REF_REGNO (ref)); > > to check each use? > > > gcc/testsuite/ChangeLog > > PR target/106450 > > * gcc.target/i386/pr106450.c: New test case. > > > > > > Thanks in advance, > > Roger > > -- > -- > H.J. diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc index aa5de71..e4cc4a3 100644 --- a/gcc/config/i386/i386-features.cc +++ b/gcc/config/i386/i386-features.cc @@ -1808,6 +1808,11 @@ static void timode_check_non_convertible_regs (bitmap candidates, bitmap regs, unsigned int regno) { + /* Do nothing if REGNO is already in REGS or is a hard reg. */ + if (bitmap_bit_p (regs, regno) + || HARD_REGISTER_NUM_P (regno)) +return; + for (df_ref def = DF_REG_DEF_CHAIN (regno); def; def = DF_REF_NEXT_REG (def)) @@ -1843,7 +1848,13 @@ timode_check_non_convertible_regs (bitmap candidates, bitmap regs, } } -/* The TImode version of remove_non_convertible_regs. */ +/* For a given bitmap of insn UIDs scans all instructions and + remove insn from CANDIDATES in case it has both convertible + and not convertible definitions. + + All insns in a bitmap are conversion candidates according to + scalar_to_vector_candidate_p. Currently it implies all insns + are single_set. */ static void timode_remove_non_convertible_regs (bitmap candidates) @@ -1857,25 +1868,20 @@ timode_remove_non_convertible_regs (bitmap candidates) changed = false; EXECUTE_IF_SET_IN_BITMAP (candidates, 0, id, bi) { - rtx def_set = single_set (DF_INSN_UID_GET (id)->insn); - rtx dest = SET_DEST (def_set); - rtx src = SET_SRC (def_set); - -