RE: [RFC] expr: don't clear SUBREG_PROMOTED_VAR_P flag for a promoted subreg [target/111466]

2023-09-29 Thread Roger Sayle


I agree that this looks dubious.  Normally, if the middle-end/optimizers
wish to reuse a SUBREG in a context where the flags are not valid, it
should create a new one with the desired flags, rather than "mutate"
an existing (and possibly shared) RTX.

I wonder if creating a new SUBREG here also fixes your problem?
I'm not sure that clearing SUBREG_PROMOTED_VAR_P is needed
at all, but given its motivation has been lost to history, it would
good to have a plan B, if Jeff's alpha testing uncovers a subtle issue.

Roger
--

> -Original Message-
> From: Vineet Gupta 
> Sent: 28 September 2023 22:44
> To: gcc-patches@gcc.gnu.org; Robin Dapp 
> Cc: kito.ch...@gmail.com; Jeff Law ; Palmer Dabbelt
> ; gnu-toolch...@rivosinc.com; Roger Sayle
> ; Jakub Jelinek ; Jivan
> Hakobyan ; Vineet Gupta 
> Subject: [RFC] expr: don't clear SUBREG_PROMOTED_VAR_P flag for a promoted
> subreg [target/111466]
> 
> RISC-V suffers from extraneous sign extensions, despite/given the ABI
guarantee
> that 32-bit quantities are sign-extended into 64-bit registers, meaning
incoming SI
> function args need not be explicitly sign extended (so do SI return values
as most
> ALU insns implicitly sign-extend too.)
> 
> Existing REE doesn't seem to handle this well and there are various ideas
floating
> around to smarten REE about it.
> 
> RISC-V also seems to correctly implement middle-end hook PROMOTE_MODE
> etc.
> 
> Another approach would be to prevent EXPAND from generating the
sign_extend
> in the first place which this patch tries to do.
> 
> The hunk being removed was introduced way back in 1994 as
>5069803972 ("expand_expr, case CONVERT_EXPR .. clear the promotion
flag")
> 
> This survived full testsuite run for RISC-V rv64gc with surprisingly no
> fallouts: test results before/after are exactly same.
> 
> |   | # of unexpected case / # of unique
unexpected case
> |   |  gcc |  g++ |
gfortran |
> | rv64imafdc_zba_zbb_zbs_zicond/|  264 /87 |5 / 2 |   72 /
12 |
> |lp64d/medlow
> 
> Granted for something so old to have survived, there must be a valid
reason.
> Unfortunately the original change didn't have additional commentary or a
test
> case. That is not to say it can't/won't possibly break things on other
arches/ABIs,
> hence the RFC for someone to scream that this is just bonkers, don't do
this :-)
> 
> I've explicitly CC'ed Jakub and Roger who have last touched subreg
promoted
> notes in expr.cc for insight and/or screaming ;-)
> 
> Thanks to Robin for narrowing this down in an amazing debugging session @
GNU
> Cauldron.
> 
> ```
> foo2:
>   sext.w  a6,a1 <-- this goes away
>   beq a1,zero,.L4
>   li  a5,0
>   li  a0,0
> .L3:
>   addwa4,a2,a5
>   addwa5,a3,a5
>   addwa0,a4,a0
>   bltua5,a6,.L3
>   ret
> .L4:
>   li  a0,0
>   ret
> ```
> 
> Signed-off-by: Vineet Gupta 
> Co-developed-by: Robin Dapp 
> ---
>  gcc/expr.cc   |  7 ---
>  gcc/testsuite/gcc.target/riscv/pr111466.c | 15 +++
>  2 files changed, 15 insertions(+), 7 deletions(-)  create mode 100644
> gcc/testsuite/gcc.target/riscv/pr111466.c
> 
> diff --git a/gcc/expr.cc b/gcc/expr.cc
> index 308ddc09e631..d259c6e53385 100644
> --- a/gcc/expr.cc
> +++ b/gcc/expr.cc
> @@ -9332,13 +9332,6 @@ expand_expr_real_2 (sepops ops, rtx target,
> machine_mode tmode,
> op0 = expand_expr (treeop0, target, VOIDmode,
>modifier);
> 
> -   /* If the signedness of the conversion differs and OP0 is
> -  a promoted SUBREG, clear that indication since we now
> -  have to do the proper extension.  */
> -   if (TYPE_UNSIGNED (TREE_TYPE (treeop0)) != unsignedp
> -   && GET_CODE (op0) == SUBREG)
> - SUBREG_PROMOTED_VAR_P (op0) = 0;
> -
> return REDUCE_BIT_FIELD (op0);
>   }
> 
> diff --git a/gcc/testsuite/gcc.target/riscv/pr111466.c
> b/gcc/testsuite/gcc.target/riscv/pr111466.c
> new file mode 100644
> index ..007792466a51
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/pr111466.c
> @@ -0,0 +1,15 @@
> +/* Simplified varaint of gcc.target/riscv/zba-adduw.c.  */
> +
> +/* { dg-do compile } */
> +/* { dg-options "-march=rv64gc_zba_zbs -mabi=lp64" } */
> +/* { dg-skip-if "" { *-*-* } { "-O0" } } */
> +
> +int foo2(int unused, int n, unsigned y, unsigned delta){
> +  int s = 0;
> +  unsigned int x = 0;
> +  for (;x +s += x+y;
> +  return s;
> +}
> +
> +/* { dg-final { scan-assembler "\msext\M" } } */
> --
> 2.34.1




[ARC PATCH] Use rlc r0, 0 to implement scc_ltu (i.e. carry_flag ? 1 : 0)

2023-09-29 Thread Roger Sayle

This patch teaches the ARC backend that the contents of the carry flag
can be placed in an integer register conveniently using the "rlc rX,0"
instruction, which is a rotate-left-through-carry using zero as a source.
This is a convenient special case for the LTU form of the scc pattern.

unsigned int foo(unsigned int x, unsigned int y)
{
  return (x+y) < x;
}

With -O2 -mcpu=em this is currently compiled to:

foo:add.f 0,r0,r1
mov_s   r0,1;3
j_s.d   [blink]
mov.hs r0,0

[which after an addition to set the carry flag, sets r0 to 1,
followed by a conditional assignment of r0 to zero if the
carry flag is clear].  With the new define_insn/optimization
in this patch, this becomes:

foo:add.f 0,r0,r1
j_s.d   [blink]
rlc r0,0

This define_insn is also a useful building block for implementing
shifts and rotates.

Tested on a cross-compiler to arc-linux (hosted on x86_64-pc-linux-gnu),
and a partial tool chain, where the new case passes and there are no
new regressions.  Ok for mainline?


2023-09-29  Roger Sayle  

gcc/ChangeLog
* config/arc/arc.md (CC_ltu): New mode iterator for CC and CC_C.
(scc_ltu_): New define_insn to handle LTU form of scc_insn.
(*scc_insn): Don't split to a conditional move sequence for LTU.

gcc/testsuite/ChangeLog
* gcc.target/arc/scc-ltu.c: New test case.


Thanks in advance,
Roger
--

diff --git a/gcc/config/arc/arc.md b/gcc/config/arc/arc.md
index d37ecbf..fe2e7fb 100644
--- a/gcc/config/arc/arc.md
+++ b/gcc/config/arc/arc.md
@@ -3658,12 +3658,24 @@ archs4x, archs4xd"
 (define_expand "scc_insn"
   [(set (match_operand:SI 0 "dest_reg_operand" "=w") (match_operand:SI 1 ""))])
 
+(define_mode_iterator CC_ltu [CC_C CC])
+
+(define_insn "scc_ltu_"
+  [(set (match_operand:SI 0 "dest_reg_operand" "=w")
+(ltu:SI (reg:CC_ltu CC_REG) (const_int 0)))]
+  ""
+  "rlc\\t%0,0"
+  [(set_attr "type" "shift")
+   (set_attr "predicable" "no")
+   (set_attr "length" "4")])
+
 (define_insn_and_split "*scc_insn"
   [(set (match_operand:SI 0 "dest_reg_operand" "=w")
(match_operator:SI 1 "proper_comparison_operator" [(reg CC_REG) 
(const_int 0)]))]
   ""
   "#"
-  "reload_completed"
+  "reload_completed
+   && GET_CODE (operands[1]) != LTU"
   [(set (match_dup 0) (const_int 1))
(cond_exec
  (match_dup 1)
diff --git a/gcc/testsuite/gcc.target/arc/scc-ltu.c 
b/gcc/testsuite/gcc.target/arc/scc-ltu.c
new file mode 100644
index 000..653c55d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arc/scc-ltu.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=em" } */
+
+unsigned int foo(unsigned int x, unsigned int y)
+{
+  return (x+y) < x;
+}
+
+/* { dg-final { scan-assembler "rlc\\s+r0,0" } } */
+/* { dg-final { scan-assembler "add.f\\s+0,r0,r1" } } */
+/* { dg-final { scan-assembler-not "mov_s\\s+r0,1" } } */
+/* { dg-final { scan-assembler-not "mov\.hs\\s+r0,0" } } */


RE: [ARC PATCH] Use rlc r0, 0 to implement scc_ltu (i.e. carry_flag ? 1 : 0)

2023-09-29 Thread Roger Sayle


Hi Claudiu,
> The patch looks sane. Have you run dejagnu test suite?

I've not yet managed to set up an emulator or compile the entire toolchain,
so my dejagnu results are only useful for catching (serious) problems in the
compile only tests:

=== gcc Summary ===

# of expected passes91875
# of unexpected failures23768
# of unexpected successes   23
# of expected failures  1038
# of unresolved testcases   19490
# of unsupported tests  3819
/home/roger/GCC/arc-linux/gcc/xgcc  version 14.0.0 20230828 (experimental)
(GCC)

If someone could double check there are no issues on real hardware that
would be great.  I'm not sure if ARC is one of the targets covered by Jeff
Law's
compile farm?


> -Original Message-
> From: Roger Sayle 
> Sent: Friday, September 29, 2023 6:54 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Claudiu Zissulescu 
> Subject: [ARC PATCH] Use rlc r0,0 to implement scc_ltu (i.e. carry_flag ?
1 : 0)
> 
> 
> This patch teaches the ARC backend that the contents of the carry flag can
be
> placed in an integer register conveniently using the "rlc rX,0"
> instruction, which is a rotate-left-through-carry using zero as a source.
> This is a convenient special case for the LTU form of the scc pattern.
> 
> unsigned int foo(unsigned int x, unsigned int y) {
>   return (x+y) < x;
> }
> 
> With -O2 -mcpu=em this is currently compiled to:
> 
> foo:add.f 0,r0,r1
> mov_s   r0,1;3
> j_s.d   [blink]
> mov.hs r0,0
> 
> [which after an addition to set the carry flag, sets r0 to 1, followed by
a
> conditional assignment of r0 to zero if the carry flag is clear].  With
the new
> define_insn/optimization in this patch, this becomes:
> 
> foo:add.f 0,r0,r1
> j_s.d   [blink]
> rlc r0,0
> 
> This define_insn is also a useful building block for implementing shifts
and rotates.
> 
> Tested on a cross-compiler to arc-linux (hosted on x86_64-pc-linux-gnu),
and a
> partial tool chain, where the new case passes and there are no new
regressions.
> Ok for mainline?
> 
> 
> 2023-09-29  Roger Sayle  
> 
> gcc/ChangeLog
> * config/arc/arc.md (CC_ltu): New mode iterator for CC and CC_C.
> (scc_ltu_): New define_insn to handle LTU form of scc_insn.
> (*scc_insn): Don't split to a conditional move sequence for LTU.
> 
> gcc/testsuite/ChangeLog
> * gcc.target/arc/scc-ltu.c: New test case.
> 
> 
> Thanks in advance,
> Roger
> --




RE: [ARC PATCH] Split SImode shifts pre-reload on !TARGET_BARREL_SHIFTER.

2023-10-03 Thread Roger Sayle


Hi Claudiu,
Thanks for the answers to my technical questions.
If you'd prefer to update arc.md's add3 pattern first,
I'm happy to update/revise my patch based on this
and your feedback, for example preferring add over
asl_s (or controlling this choice with -Os).

Thanks again.
Roger
--

> -Original Message-
> From: Claudiu Zissulescu 
> Sent: 03 October 2023 15:26
> To: Roger Sayle ; gcc-patches@gcc.gnu.org
> Subject: RE: [ARC PATCH] Split SImode shifts pre-reload on
> !TARGET_BARREL_SHIFTER.
> 
> Hi Roger,
> 
> It was nice to meet you too.
> 
> Thank you in looking into the ARC's non-Barrel Shifter configurations.  I
will dive
> into your patch asap, but before starting here are a few of my comments:
> 
> -Original Message-
> From: Roger Sayle 
> Sent: Thursday, September 28, 2023 2:27 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Claudiu Zissulescu 
> Subject: [ARC PATCH] Split SImode shifts pre-reload on
> !TARGET_BARREL_SHIFTER.
> 
> 
> Hi Claudiu,
> It was great meeting up with you and the Synopsys ARC team at the GNU
tools
> Cauldron in Cambridge.
> 
> This patch is the first in a series to improve SImode and DImode shifts
and rotates
> in the ARC backend.  This first piece splits SImode shifts, for
> !TARGET_BARREL_SHIFTER targets, after combine and before reload, in the
split1
> pass, as suggested by the FIXME comment above output_shift in arc.cc.  To
do
> this I've copied the implementation of the x86_pre_reload_split function
from
> i386 backend, and renamed it arc_pre_reload_split.
> 
> Although the actual implementations of shifts remain the same (as in
> output_shift), having them as explicit instructions in the RTL stream
allows better
> scheduling and use of compact forms when available.  The benefits can be
seen in
> two short examples below.
> 
> For the function:
> unsigned int foo(unsigned int x, unsigned int y) {
>   return y << 2;
> }
> 
> GCC with -O2 -mcpu=em would previously generate:
> foo:add r1,r1,r1
> add r1,r1,r1
> j_s.d   [blink]
> mov_s   r0,r1   ;4
> 
> [CZI] The move shouldn't be generated indeed. The use of ADDs are slightly
> beneficial for older ARCv1 arches.
> 
> and with this patch now generates:
> foo:asl_s r0,r1
> j_s.d   [blink]
> asl_s r0,r0
> 
> [CZI] Nice. This new sequence is as fast as we can get for our ARCv2 cpus.
> 
> Notice the original (from shift_si3's output_shift) requires the shift
sequence to be
> monolithic with the same destination register as the source (requiring an
extra
> mov_s).  The new version can eliminate this move, and schedule the second
asl in
> the branch delay slot of the return.
> 
> For the function:
> int x,y,z;
> 
> void bar()
> {
>   x <<= 3;
>   y <<= 3;
>   z <<= 3;
> }
> 
> GCC -O2 -mcpu=em currently generates:
> bar:push_s  r13
> ld.as   r12,[gp,@x@sda] ;23
> ld.as   r3,[gp,@y@sda]  ;23
> mov r2,0
> add3 r12,r2,r12
> mov r2,0
> add3 r3,r2,r3
> ld.as   r2,[gp,@z@sda]  ;23
> st.as   r12,[gp,@x@sda] ;26
> mov r13,0
> add3 r2,r13,r2
> st.as   r3,[gp,@y@sda]  ;26
> st.as   r2,[gp,@z@sda]  ;26
> j_s.d   [blink]
> pop_s   r13
> 
> where each shift by 3, uses ARC's add3 instruction, which is similar to
x86's lea
> implementing x = (y<<3) + z, but requires the value zero to be placed in a
> temporary register "z".  Splitting this before reload allows these pseudos
to be
> shared/reused.  With this patch, we get
> 
> bar:ld.as   r2,[gp,@x@sda]  ;23
> mov_s   r3,0;3
> add3r2,r3,r2
> ld.as   r3,[gp,@y@sda]  ;23
> st.as   r2,[gp,@x@sda]  ;26
> ld.as   r2,[gp,@z@sda]  ;23
> mov_s   r12,0   ;3
> add3r3,r12,r3
> add3r2,r12,r2
> st.as   r3,[gp,@y@sda]  ;26
> st.as   r2,[gp,@z@sda]  ;26
> j_s [blink]
> 
> [CZI] Looks great, but it also shows that I've forgot to add to ADD3
instruction the
> Ra,LIMM,RC variant, which will lead to have instead of
> mov_s   r3,0;3
> add3r2,r3,r2
> Only this add3,0,r2, Indeed it is longer instruction but faster.
> 
> Unfortunately, register allocation means that we only share two of the
three
> "mov_s z,0", but this is sufficient to reduce register pressure enough to
avoid
> spilling r13 in the prologue/epilogue.
> 
> This patch also contains a (latent?) bug fix.  The implementation of the
default
> insn "length" attribute, assumes instruc

PING: PR rtl-optimization/110701

2023-10-03 Thread Roger Sayle
 

There are a small handful of middle-end maintainers/reviewers that

understand and appreciate the difference between the RTL statements:

(set (subreg:HI (reg:SI x)) (reg:HI y))

and

(set (strict_lowpart:HI (reg:SI x)) (reg:HI y))

 

If one (or more) of them could please take a look at

https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625532.html

I'd very much appreciate it (one less wrong-code regression).

 

Many thanks in advance,

Roger

--

 



[PATCH] Support g++ 4.8 as a host compiler.

2023-10-04 Thread Roger Sayle

The recent patch to remove poly_int_pod triggers a bug in g++ 4.8.5's
C++ 11 support which mistakenly believes poly_uint16 has a non-trivial
constructor.  This in turn prohibits it from being used as a member in
a union (rtxunion) that constructed statically, resulting in a (fatal)
error during stage 1.  A workaround is to add an explicit constructor
to the problematic union, which allows mainline to be bootstrapped with
the system compiler on older RedHat 7 systems.

This patch has been tested on x86_64-pc-linux-gnu where it allows a
bootstrap to complete when using g++ 4.8.5 as the host compiler.
Ok for mainline?


2023-10-04  Roger Sayle  

gcc/ChangeLog
* rtl.h (rtx_def::u): Add explicit constructor to workaround
issue using g++ 4.8 as a host compiler.

diff --git a/gcc/rtl.h b/gcc/rtl.h
index 6850281..a7667f5 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -451,6 +451,9 @@ struct GTY((desc("0"), tag("0"),
 struct fixed_value fv;
 struct hwivec_def hwiv;
 struct const_poly_int_def cpi;
+#if defined(__GNUC__) && GCC_VERSION < 5000
+u () {}
+#endif
   } GTY ((special ("rtx_def"), desc ("GET_CODE (&%0)"))) u;
 };
 


[X86 PATCH] Split lea into shorter left shift by 2 or 3 bits with -Oz.

2023-10-05 Thread Roger Sayle

This patch avoids long lea instructions for performing x<<2 and x<<3
by splitting them into shorter sal and move (or xchg instructions).
Because this increases the number of instructions, but reduces the
total size, its suitable for -Oz (but not -Os).

The impact can be seen in the new test case:

int foo(int x) { return x<<2; }
int bar(int x) { return x<<3; }
long long fool(long long x) { return x<<2; }
long long barl(long long x) { return x<<3; }

where with -O2 we generate:

foo:lea0x0(,%rdi,4),%eax// 7 bytes
retq
bar:lea0x0(,%rdi,8),%eax// 7 bytes
retq
fool:   lea0x0(,%rdi,4),%rax// 8 bytes
retq
barl:   lea0x0(,%rdi,8),%rax// 8 bytes
retq

and with -Oz we now generate:

foo:xchg   %eax,%edi// 1 byte
shl$0x2,%eax// 3 bytes
retq
bar:xchg   %eax,%edi// 1 byte
shl$0x3,%eax// 3 bytes
retq
fool:   xchg   %rax,%rdi// 2 bytes
shl$0x2,%rax// 4 bytes
retq
barl:   xchg   %rax,%rdi// 2 bytes
shl$0x3,%rax// 4 bytes
retq

Over the entirety of the CSiBE code size benchmark this saves 1347
bytes (0.037%) for x86_64, and 1312 bytes (0.036%) with -m32.
Conveniently, there's already a backend function in i386.cc for
deciding whether to split an lea into its component instructions,
ix86_avoid_lea_for_addr, all that's required is an additional clause
checking for -Oz (i.e. optimize_size > 1).

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board='unix{-m32}'
with no new failures.  Additional testing was performed by repeating
these steps after removing the "optimize_size > 1" condition, so that
suitable lea instructions were always split [-Oz is not heavily
tested, so this invoked the new code during the bootstrap and
regression testing], again with no regressions.  Ok for mainline?


2023-10-05  Roger Sayle  

gcc/ChangeLog
* config/i386/i386.cc (ix86_avoid_lea_for_addr): Split LEAs used
to perform left shifts into shorter instructions with -Oz.

gcc/testsuite/ChangeLog
* gcc.target/i386/lea-2.c: New test case.

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 477e6ce..9557bff 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -15543,6 +15543,13 @@ ix86_avoid_lea_for_addr (rtx_insn *insn, rtx 
operands[])
   && (regno0 == regno1 || regno0 == regno2))
 return true;
 
+  /* Split with -Oz if the encoding requires fewer bytes.  */
+  if (optimize_size > 1
+  && parts.scale > 1
+  && !parts.base
+  && (!parts.disp || parts.disp == const0_rtx)) 
+return true;
+
   /* Check we need to optimize.  */
   if (!TARGET_AVOID_LEA_FOR_ADDR || optimize_function_for_size_p (cfun))
 return false;
diff --git a/gcc/testsuite/gcc.target/i386/lea-2.c 
b/gcc/testsuite/gcc.target/i386/lea-2.c
new file mode 100644
index 000..20aded8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/lea-2.c
@@ -0,0 +1,7 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-Oz" } */
+int foo(int x) { return x<<2; }
+int bar(int x) { return x<<3; }
+long long fool(long long x) { return x<<2; }
+long long barl(long long x) { return x<<3; }
+/* { dg-final { scan-assembler-not "lea\[lq\]" } } */


[X86 PATCH] Implement doubleword shift left by 1 bit using add+adc.

2023-10-05 Thread Roger Sayle


This patch tweaks the i386 back-end's ix86_split_ashl to implement
doubleword left shifts by 1 bit, using an add followed by an add-with-carry
(i.e. a doubleword x+x) instead of using the x86's shld instruction.
The replacement sequence both requires fewer bytes and is faster on
both Intel and AMD architectures (from Agner Fog's latency tables and
confirmed by my own microbenchmarking).

For the test case:
__int128 foo(__int128 x) { return x << 1; }

with -O2 we previously generated:

foo:movq%rdi, %rax
movq%rsi, %rdx
shldq   $1, %rdi, %rdx
addq%rdi, %rax
ret

with this patch we now generate:

foo:movq%rdi, %rax
movq%rsi, %rdx
addq%rdi, %rax
adcq%rsi, %rdx
ret


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-10-05  Roger Sayle  

gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_split_ashl): Split shifts by
one into add3_cc_overflow_1 followed by add3_carry.
* config/i386/i386.md (@add3_cc_overflow_1): Renamed from
"*add3_cc_overflow_1" to provide generator function.

gcc/testsuite/ChangeLog
* gcc.target/i386/ashldi3-2.c: New 32-bit test case.
* gcc.target/i386/ashlti3-3.c: New 64-bit test case.


Thanks in advance,
Roger
--




RE: [X86 PATCH] Implement doubleword shift left by 1 bit using add+adc.

2023-10-05 Thread Roger Sayle
Doh! ENOPATCH.

> -Original Message-
> From: Roger Sayle 
> Sent: 05 October 2023 12:44
> To: 'gcc-patches@gcc.gnu.org' 
> Cc: 'Uros Bizjak' 
> Subject: [X86 PATCH] Implement doubleword shift left by 1 bit using
add+adc.
> 
> 
> This patch tweaks the i386 back-end's ix86_split_ashl to implement
doubleword
> left shifts by 1 bit, using an add followed by an add-with-carry (i.e. a
doubleword
> x+x) instead of using the x86's shld instruction.
> The replacement sequence both requires fewer bytes and is faster on both
Intel
> and AMD architectures (from Agner Fog's latency tables and confirmed by my
> own microbenchmarking).
> 
> For the test case:
> __int128 foo(__int128 x) { return x << 1; }
> 
> with -O2 we previously generated:
> 
> foo:movq%rdi, %rax
> movq%rsi, %rdx
> shldq   $1, %rdi, %rdx
> addq%rdi, %rax
> ret
> 
> with this patch we now generate:
> 
> foo:movq%rdi, %rax
> movq%rsi, %rdx
> addq%rdi, %rax
> adcq%rsi, %rdx
> ret
> 
> 
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and
> make -k check, both with and without --target_board=unix{-m32} with no new
> failures.  Ok for mainline?
> 
> 
> 2023-10-05  Roger Sayle  
> 
> gcc/ChangeLog
> * config/i386/i386-expand.cc (ix86_split_ashl): Split shifts by
> one into add3_cc_overflow_1 followed by add3_carry.
> * config/i386/i386.md (@add3_cc_overflow_1): Renamed from
> "*add3_cc_overflow_1" to provide generator function.
> 
> gcc/testsuite/ChangeLog
> * gcc.target/i386/ashldi3-2.c: New 32-bit test case.
> * gcc.target/i386/ashlti3-3.c: New 64-bit test case.
> 
> 
> Thanks in advance,
> Roger
> --

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index e42ff27..09e41c8 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -6342,6 +6342,18 @@ ix86_split_ashl (rtx *operands, rtx scratch, 
machine_mode mode)
  if (count > half_width)
ix86_expand_ashl_const (high[0], count - half_width, mode);
}
+  else if (count == 1)
+   {
+ if (!rtx_equal_p (operands[0], operands[1]))
+   emit_move_insn (operands[0], operands[1]);
+ rtx x3 = gen_rtx_REG (CCCmode, FLAGS_REG);
+ rtx x4 = gen_rtx_LTU (mode, x3, const0_rtx);
+ half_mode = mode == DImode ? SImode : DImode;
+ emit_insn (gen_add3_cc_overflow_1 (half_mode, low[0],
+low[0], low[0]));
+ emit_insn (gen_add3_carry (half_mode, high[0], high[0], high[0],
+x3, x4));
+   }
   else
{
  gen_shld = mode == DImode ? gen_x86_shld : gen_x86_64_shld;
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index eef8a0e..6a5bc16 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -8864,7 +8864,7 @@
   [(set_attr "type" "alu")
(set_attr "mode" "")])
 
-(define_insn "*add3_cc_overflow_1"
+(define_insn "@add3_cc_overflow_1"
   [(set (reg:CCC FLAGS_REG)
(compare:CCC
(plus:SWI
diff --git a/gcc/testsuite/gcc.target/i386/ashldi3-2.c 
b/gcc/testsuite/gcc.target/i386/ashldi3-2.c
new file mode 100644
index 000..053389d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/ashldi3-2.c
@@ -0,0 +1,10 @@
+/* { dg-do compile { target ia32 } } */
+/* { dg-options "-O2 -mno-stv" } */
+
+long long foo(long long x)
+{
+  return x << 1;
+}
+
+/* { dg-final { scan-assembler "adcl" } } */
+/* { dg-final { scan-assembler-not "shldl" } } */
diff --git a/gcc/testsuite/gcc.target/i386/ashlti3-3.c 
b/gcc/testsuite/gcc.target/i386/ashlti3-3.c
new file mode 100644
index 000..4f14ca0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/ashlti3-3.c
@@ -0,0 +1,10 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2" } */
+
+__int128 foo(__int128 x)
+{
+  return x << 1;
+}
+
+/* { dg-final { scan-assembler "adcq" } } */
+/* { dg-final { scan-assembler-not "shldq" } } */


RE: [X86 PATCH] Split lea into shorter left shift by 2 or 3 bits with -Oz.

2023-10-05 Thread Roger Sayle


Hi Uros,
Very many thanks for the speedy reviews.

Uros Bizjak wrote:
> On Thu, Oct 5, 2023 at 11:06 AM Roger Sayle 
> wrote:
> >
> >
> > This patch avoids long lea instructions for performing x<<2 and x<<3
> > by splitting them into shorter sal and move (or xchg instructions).
> > Because this increases the number of instructions, but reduces the
> > total size, its suitable for -Oz (but not -Os).
> >
> > The impact can be seen in the new test case:
> >
> > int foo(int x) { return x<<2; }
> > int bar(int x) { return x<<3; }
> > long long fool(long long x) { return x<<2; } long long barl(long long
> > x) { return x<<3; }
> >
> > where with -O2 we generate:
> >
> > foo:lea0x0(,%rdi,4),%eax// 7 bytes
> > retq
> > bar:lea0x0(,%rdi,8),%eax// 7 bytes
> > retq
> > fool:   lea0x0(,%rdi,4),%rax// 8 bytes
> > retq
> > barl:   lea0x0(,%rdi,8),%rax// 8 bytes
> > retq
> >
> > and with -Oz we now generate:
> >
> > foo:xchg   %eax,%edi// 1 byte
> > shl$0x2,%eax// 3 bytes
> > retq
> > bar:xchg   %eax,%edi// 1 byte
> > shl$0x3,%eax// 3 bytes
> > retq
> > fool:   xchg   %rax,%rdi// 2 bytes
> > shl$0x2,%rax// 4 bytes
> > retq
> > barl:   xchg   %rax,%rdi// 2 bytes
> > shl$0x3,%rax// 4 bytes
> > retq
> >
> > Over the entirety of the CSiBE code size benchmark this saves 1347
> > bytes (0.037%) for x86_64, and 1312 bytes (0.036%) with -m32.
> > Conveniently, there's already a backend function in i386.cc for
> > deciding whether to split an lea into its component instructions,
> > ix86_avoid_lea_for_addr, all that's required is an additional clause
> > checking for -Oz (i.e. optimize_size > 1).
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board='unix{-m32}'
> > with no new failures.  Additional testing was performed by repeating
> > these steps after removing the "optimize_size > 1" condition, so that
> > suitable lea instructions were always split [-Oz is not heavily
> > tested, so this invoked the new code during the bootstrap and
> > regression testing], again with no regressions.  Ok for mainline?
> >
> >
> > 2023-10-05  Roger Sayle  
> >
> > gcc/ChangeLog
> > * config/i386/i386.cc (ix86_avoid_lea_for_addr): Split LEAs used
> > to perform left shifts into shorter instructions with -Oz.
> >
> > gcc/testsuite/ChangeLog
> > * gcc.target/i386/lea-2.c: New test case.
> >
> 
> OK, but ...
> 
> @@ -0,0 +1,7 @@
> +/* { dg-do compile { target { ! ia32 } } } */
> 
> Is there a reason to avoid 32-bit targets? I'd expect that the optimization 
> also
> triggers on x86_32 for 32bit integers.

Good catch.  You're 100% correct; because the test case just checks that an LEA
is not used, and not for the specific sequence of shift instructions used 
instead,
this test also passes with --target_board='unix{-m32}'.  I'll remove the target 
clause
from the dg-do compile directive.

> +/* { dg-options "-Oz" } */
> +int foo(int x) { return x<<2; }
> +int bar(int x) { return x<<3; }
> +long long fool(long long x) { return x<<2; } long long barl(long long
> +x) { return x<<3; }
> +/* { dg-final { scan-assembler-not "lea\[lq\]" } } */

Thanks again.
Roger
--




[X86 PATCH] Implement doubleword right shifts by 1 bit using s[ha]r+rcr.

2023-10-06 Thread Roger Sayle


This patch tweaks the i386 back-end's ix86_split_ashr and ix86_split_lshr
functions to implement doubleword right shifts by 1 bit, using a shift
of the highpart that sets the carry flag followed by a rotate-carry-right
(RCR) instruction on the lowpart.

Conceptually this is similar to the recent left shift patch, but with two
complicating factors.  The first is that although the RCR sequence is
shorter, and is a ~3x performance improvement on AMD, my micro-benchmarking
shows it ~10% slower on Intel.  Hence this patch also introduces a new
X86_TUNE_USE_RCR tuning parameter.  The second is that I believe this is
the first time a "rotate-right-through-carry" and a right shift that sets
the carry flag from the least significant bit has been modelled in GCC RTL
(on a MODE_CC target).  For this I've used the i386 back-end's UNSPEC_CC_NE
which seems appropriate.  Finally rcrsi2 and rcrdi2 are separate
define_insns so that we can use their generator functions.

For the pair of functions:
unsigned __int128 foo(unsigned __int128 x) { return x >> 1; }
__int128 bar(__int128 x) { return x >> 1; }

with -O2 -march=znver4 we previously generated:

foo:movq%rdi, %rax
movq%rsi, %rdx
shrdq   $1, %rsi, %rax
shrq%rdx
ret
bar:movq%rdi, %rax
movq%rsi, %rdx
shrdq   $1, %rsi, %rax
sarq%rdx
ret

with this patch we now generate:

foo:movq%rsi, %rdx
movq%rdi, %rax
shrq%rdx
rcrq%rax
ret
bar:movq%rsi, %rdx
movq%rdi, %rax
sarq%rdx
rcrq%rax
ret

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  And to provide additional testing, I've also
bootstrapped and regression tested a version of this patch where the
RCR is always generated (independent of the -march target) again with
no regressions.  Ok for mainline?


2023-10-06  Roger Sayle  

gcc/ChangeLog
* config/i386/i386-expand.c (ix86_split_ashr): Split shifts by
one into ashr[sd]i3_carry followed by rcr[sd]i2, if TARGET_USE_RCR
or -Oz.
(ix86_split_lshr): Likewise, split shifts by one bit into
lshr[sd]i3_carry followed by rcr[sd]i2, if TARGET_USE_RCR or -Oz.
* config/i386/i386.h (TARGET_USE_RCR): New backend macro.
* config/i386/i386.md (rcrsi2): New define_insn for rcrl.
(rcrdi2): New define_insn for rcrq.
(3_carry): New define_insn for right shifts that
set the carry flag from the least significant bit, modelled using
UNSPEC_CC_NE.
* config/i386/x86-tune.def (X86_TUNE_USE_RCR): New tuning parameter
controlling use of rcr 1 vs. shrd, which is significantly faster on
AMD processors.

gcc/testsuite/ChangeLog
* gcc.target/i386/rcr-1.c: New 64-bit test case.
* gcc.target/i386/rcr-2.c: New 32-bit test case.


Thanks in advance,
Roger
--




RE: [X86 PATCH] Implement doubleword right shifts by 1 bit using s[ha]r+rcr.

2023-10-06 Thread Roger Sayle

Grr!  I've done it again.  ENOPATCH.

> -Original Message-
> From: Roger Sayle 
> Sent: 06 October 2023 14:58
> To: 'gcc-patches@gcc.gnu.org' 
> Cc: 'Uros Bizjak' 
> Subject: [X86 PATCH] Implement doubleword right shifts by 1 bit using
s[ha]r+rcr.
> 
> 
> This patch tweaks the i386 back-end's ix86_split_ashr and ix86_split_lshr
> functions to implement doubleword right shifts by 1 bit, using a shift of
the
> highpart that sets the carry flag followed by a rotate-carry-right
> (RCR) instruction on the lowpart.
> 
> Conceptually this is similar to the recent left shift patch, but with two
> complicating factors.  The first is that although the RCR sequence is
shorter, and is
> a ~3x performance improvement on AMD, my micro-benchmarking shows it
> ~10% slower on Intel.  Hence this patch also introduces a new
> X86_TUNE_USE_RCR tuning parameter.  The second is that I believe this is
the
> first time a "rotate-right-through-carry" and a right shift that sets the
carry flag
> from the least significant bit has been modelled in GCC RTL (on a MODE_CC
> target).  For this I've used the i386 back-end's UNSPEC_CC_NE which seems
> appropriate.  Finally rcrsi2 and rcrdi2 are separate define_insns so that
we can
> use their generator functions.
> 
> For the pair of functions:
> unsigned __int128 foo(unsigned __int128 x) { return x >> 1; }
> __int128 bar(__int128 x) { return x >> 1; }
> 
> with -O2 -march=znver4 we previously generated:
> 
> foo:movq%rdi, %rax
> movq%rsi, %rdx
> shrdq   $1, %rsi, %rax
> shrq%rdx
> ret
> bar:movq%rdi, %rax
> movq%rsi, %rdx
> shrdq   $1, %rsi, %rax
> sarq%rdx
> ret
> 
> with this patch we now generate:
> 
> foo:movq%rsi, %rdx
> movq%rdi, %rax
> shrq%rdx
> rcrq%rax
> ret
> bar:movq%rsi, %rdx
> movq%rdi, %rax
> sarq%rdx
> rcrq%rax
> ret
> 
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and
> make -k check, both with and without --target_board=unix{-m32} with no new
> failures.  And to provide additional testing, I've also bootstrapped and
regression
> tested a version of this patch where the RCR is always generated
(independent of
> the -march target) again with no regressions.  Ok for mainline?
> 
> 
> 2023-10-06  Roger Sayle  
> 
> gcc/ChangeLog
> * config/i386/i386-expand.c (ix86_split_ashr): Split shifts by
> one into ashr[sd]i3_carry followed by rcr[sd]i2, if TARGET_USE_RCR
> or -Oz.
> (ix86_split_lshr): Likewise, split shifts by one bit into
> lshr[sd]i3_carry followed by rcr[sd]i2, if TARGET_USE_RCR or -Oz.
> * config/i386/i386.h (TARGET_USE_RCR): New backend macro.
> * config/i386/i386.md (rcrsi2): New define_insn for rcrl.
> (rcrdi2): New define_insn for rcrq.
> (3_carry): New define_insn for right shifts that
> set the carry flag from the least significant bit, modelled using
> UNSPEC_CC_NE.
> * config/i386/x86-tune.def (X86_TUNE_USE_RCR): New tuning
parameter
> controlling use of rcr 1 vs. shrd, which is significantly faster
on
> AMD processors.
> 
> gcc/testsuite/ChangeLog
> * gcc.target/i386/rcr-1.c: New 64-bit test case.
> * gcc.target/i386/rcr-2.c: New 32-bit test case.
> 
> 
> Thanks in advance,
> Roger
> --

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index e42ff27..399eb8e 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -6496,6 +6496,22 @@ ix86_split_ashr (rtx *operands, rtx scratch, 
machine_mode mode)
emit_insn (gen_ashr3 (low[0], low[0],
  GEN_INT (count - half_width)));
}
+  else if (count == 1
+  && (TARGET_USE_RCR || optimize_size > 1))
+   {
+ if (!rtx_equal_p (operands[0], operands[1]))
+   emit_move_insn (operands[0], operands[1]);
+ if (mode == DImode)
+   {
+ emit_insn (gen_ashrsi3_carry (high[0], high[0]));
+ emit_insn (gen_rcrsi2 (low[0], low[0]));
+   }
+ else
+   {
+ emit_insn (gen_ashrdi3_carry (high[0], high[0]));
+ emit_insn (gen_rcrdi2 (low[0], low[0]));
+   }
+   }
   else
{
  gen_shrd = mode == DImode ? gen_x86_shrd : gen_x86_64_shrd;
@@ -6561,6 +6577,22 @@ ix86_split_lshr (rtx *operands, rtx scratch, 
machine_mode mode)
emit_insn (gen_lshr3 (low[0], low[0],

[ARC PATCH] Improved SImode shifts and rotates on !TARGET_BARREL_SHIFTER.

2023-10-08 Thread Roger Sayle

This patch completes the ARC back-end's transition to using pre-reload
splitters for SImode shifts and rotates on targets without a barrel
shifter.  The core part is that the shift_si3 define_insn is no longer
needed, as shifts and rotates that don't require a loop are split
before reload, and then because shift_si3_loop is the only caller
of output_shift, both can be significantly cleaned up and simplified.
The output_shift function (Claudiu's "the elephant in the room") is
renamed output_shift_loop, which handles just the four instruction
zero-overhead loop implementations.

Aside from the clean-ups, the user visible changes are much improved
implementations of SImode shifts and rotates on affected targets.

For the function:
unsigned int rotr_1 (unsigned int x) { return (x >> 1) | (x << 31); }

GCC with -O2 -mcpu=em would previously generate:

rotr_1: lsr_s r2,r0
bmsk_s r0,r0,0
ror r0,r0
j_s.d   [blink]
or_sr0,r0,r2

with this patch, we now generate:

j_s.d   [blink]
ror r0,r0

For the function:
unsigned int rotr_31 (unsigned int x) { return (x >> 31) | (x << 1); }

GCC with -O2 -mcpu=em would previously generate:

rotr_31:
mov_s   r2,r0   ;4
asl_s r0,r0
add.f 0,r2,r2
rlc r2,0
j_s.d   [blink]
or_sr0,r0,r2

with this patch we now generate an add.f followed by an adc:

rotr_31:
add.f   r0,r0,r0
j_s.d   [blink]
add.cs  r0,r0,1


Shifts by constants requiring a loop have been improved for even counts
by performing two operations in each iteration:

int shl10(int x) { return x >> 10; }

Previously looked like:

shl10:  mov.f lp_count, 10
lpnz2f
asr r0,r0
nop
2:  # end single insn loop
j_s [blink]


And now becomes:

shl10:
mov lp_count,5
lp  2f
asr r0,r0
asr r0,r0
2:  # end single insn loop
j_s [blink]


So emulating ARC's SWAP on architectures that don't have it:

unsigned int rotr_16 (unsigned int x) { return (x >> 16) | (x << 16); }

previously required 10 instructions and ~70 cycles:

rotr_16:
mov_s   r2,r0   ;4
mov.f lp_count, 16
lpnz2f
add r0,r0,r0
nop
2:  # end single insn loop
mov.f lp_count, 16
lpnz2f
lsr r2,r2
nop
2:  # end single insn loop
j_s.d   [blink]
or_sr0,r0,r2

now becomes just 4 instructions and ~18 cycles:

rotr_16:
mov lp_count,8
lp  2f
ror r0,r0
ror r0,r0
2:  # end single insn loop
j_s [blink]


This patch has been tested with a cross-compiler to arc-linux hosted
on x86_64-pc-linux-gnu and (partially) tested with the compile-only
portions of the testsuite with no regressions.  Ok for mainline, if
your own testing shows no issues?


2023-10-07  Roger Sayle  

gcc/ChangeLog
* config/arc/arc-protos.h (output_shift): Rename to...
(output_shift_loop): Tweak API to take an explicit rtx_code.
(arc_split_ashl): Prototype new function here.
(arc_split_ashr): Likewise.
(arc_split_lshr): Likewise.
(arc_split_rotl): Likewise.
(arc_split_rotr): Likewise.
* config/arc/arc.cc (output_shift): Delete local prototype.  Rename.
(output_shift_loop): New function replacing output_shift to output
a zero overheap loop for SImode shifts and rotates on ARC targets
without barrel shifter (i.e. no hardware support for these insns).
(arc_split_ashl): New helper function to split *ashlsi3_nobs.
(arc_split_ashr): New helper function to split *ashrsi3_nobs.
(arc_split_lshr): New helper function to split *lshrsi3_nobs.
(arc_split_rotl): New helper function to split *rotlsi3_nobs.
(arc_split_rotr): New helper function to split *rotrsi3_nobs.
* config/arc/arc.md (any_shift_rotate): New define_code_iterator.
(define_code_attr insn): New code attribute to map to pattern name.
(si3): New expander unifying previous ashlsi3,
ashrsi3 and lshrsi3 define_expands.  Adds rotlsi3 and rotrsi3.
(*si3_nobs): New define_insn_and_split that
unifies the previous *ashlsi3_nobs, *ashrsi3_nobs and *lshrsi3_nobs.
We now call arc_split_ in arc.cc to implement each split.
(shift_si3): Delete define_insn, all shifts/rotates are now split.
(shift_si3_loop): Rename to...
(si3_loop): define_insn to handle loop implementations of
SImode shifts and rotates, calling ouput_shift_loop for template.
(rotrsi3): Rename to...
(*rotrsi3_insn): define_insn for TARGET_BARREL_SHIFTER's ror.
(*rotlsi3): New define_insn_and_split to transform left rotates
into right rotates before reload.
(rotlsi3_cnt1): New define_in

[PATCH] Optimize (ne:SI (subreg:QI (ashift:SI x 7) 0) 0) as (and:SI x 1).

2023-10-10 Thread Roger Sayle

This patch is the middle-end piece of an improvement to PRs 101955 and
106245, that adds a missing simplification to the RTL optimizers.
This transformation is to simplify (char)(x << 7) != 0 as x & 1.
Technically, the cast can be any truncation, where shift is by one
less than the narrower type's precision, setting the most significant
(only) bit from the least significant bit.

This transformation applies to any target, but it's easy to see
(and add a new test case) on x86, where the following function:

int f(int a) { return (a << 31) >> 31; }

currently gets compiled with -O2 to:

foo:movl%edi, %eax
sall$7, %eax
sarb$7, %al
movsbl  %al, %eax
ret

but with this patch, we now generate the slightly simpler.

foo:movl%edi, %eax
sall$31, %eax
sarl$31, %eax
ret


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check with no new failures.  Ok for mainline?


2023-10-10  Roger Sayle  

gcc/ChangeLog
PR middle-end/101955
PR tree-optimization/106245
* simplify-rtx.c (simplify_relational_operation_1): Simplify
the RTL (ne:SI (subreg:QI (ashift:SI x 7) 0) 0) to (and:SI x 1).

gcc/testsuite/ChangeLog
* gcc.target/i386/pr106245-1.c: New test case.


Thanks in advance,
Roger
--

diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index bd9443d..69d8757 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -6109,6 +6109,23 @@ simplify_context::simplify_relational_operation_1 
(rtx_code code,
break;
   }
 
+  /* (ne:SI (subreg:QI (ashift:SI x 7) 0) 0) -> (and:SI x 1).  */
+  if (code == NE
+  && op1 == const0_rtx
+  && (op0code == TRUNCATE
+ || (partial_subreg_p (op0)
+ && subreg_lowpart_p (op0)))
+  && SCALAR_INT_MODE_P (mode)
+  && STORE_FLAG_VALUE == 1)
+{
+  rtx tmp = XEXP (op0, 0);
+  if (GET_CODE (tmp) == ASHIFT
+ && GET_MODE (tmp) == mode
+ && CONST_INT_P (XEXP (tmp, 1))
+ && is_int_mode (GET_MODE (op0), &int_mode)
+ && INTVAL (XEXP (tmp, 1)) == GET_MODE_PRECISION (int_mode) - 1)
+   return simplify_gen_binary (AND, mode, XEXP (tmp, 0), const1_rtx);
+}
   return NULL_RTX;
 }
 
diff --git a/gcc/testsuite/gcc.target/i386/pr106245-1.c 
b/gcc/testsuite/gcc.target/i386/pr106245-1.c
new file mode 100644
index 000..a0403e9
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr106245-1.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int f(int a)
+{
+return (a << 31) >> 31;
+}
+
+/* { dg-final { scan-assembler-not "sarb" } } */
+/* { dg-final { scan-assembler-not "movsbl" } } */


[PATCH] PR 91865: Avoid ZERO_EXTEND of ZERO_EXTEND in make_compound_operation.

2023-10-14 Thread Roger Sayle

This patch is my proposed solution to PR rtl-optimization/91865.
Normally RTX simplification canonicalizes a ZERO_EXTEND of a ZERO_EXTEND
to a single ZERO_EXTEND, but as shown in this PR it is possible for
combine's make_compound_operation to unintentionally generate a
non-canonical ZERO_EXTEND of a ZERO_EXTEND, which is unlikely to be
matched by the backend.

For the new test case:

const int table[2] = {1, 2};
int foo (char i) { return table[i]; }

compiling with -O2 -mlarge on msp430 we currently see:

Trying 2 -> 7:
2: r25:HI=zero_extend(R12:QI)
  REG_DEAD R12:QI
7: r28:PSI=sign_extend(r25:HI)#0
  REG_DEAD r25:HI
Failed to match this instruction:
(set (reg:PSI 28 [ iD.1772 ])
(zero_extend:PSI (zero_extend:HI (reg:QI 12 R12 [ iD.1772 ]

which results in the following code:

foo:AND #0xff, R12
RLAM.A #4, R12 { RRAM.A #4, R12
RLAM.A  #1, R12
MOVX.W  table(R12), R12
RETA

With this patch, we now see:

Trying 2 -> 7:
2: r25:HI=zero_extend(R12:QI)
  REG_DEAD R12:QI
7: r28:PSI=sign_extend(r25:HI)#0
  REG_DEAD r25:HI
Successfully matched this instruction:
(set (reg:PSI 28 [ iD.1772 ])
(zero_extend:PSI (reg:QI 12 R12 [ iD.1772 ])))
allowing combination of insns 2 and 7
original costs 4 + 8 = 12
replacement cost 8

foo:MOV.B   R12, R12
RLAM.A  #1, R12
MOVX.W  table(R12), R12
RETA


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?

2023-10-14  Roger Sayle  

gcc/ChangeLog
PR rtl-optimization/91865
* combine.cc (make_compound_operation): Avoid creating a
ZERO_EXTEND of a ZERO_EXTEND.

gcc/testsuite/ChangeLog
PR rtl-optimization/91865
* gcc.target/msp430/pr91865.c: New test case.


Thanks in advance,
Roger
--

diff --git a/gcc/combine.cc b/gcc/combine.cc
index 360aa2f25e6..f47ff596782 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -8453,6 +8453,9 @@ make_compound_operation (rtx x, enum rtx_code in_code)
new_rtx, GET_MODE (XEXP (x, 0)));
   if (tem)
return tem;
+  /* Avoid creating a ZERO_EXTEND of a ZERO_EXTEND.  */
+  if (GET_CODE (new_rtx) == ZERO_EXTEND)
+   new_rtx = XEXP (new_rtx, 0);
   SUBST (XEXP (x, 0), new_rtx);
   return x;
 }
diff --git a/gcc/testsuite/gcc.target/msp430/pr91865.c 
b/gcc/testsuite/gcc.target/msp430/pr91865.c
new file mode 100644
index 000..8cc21c8b9e8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/msp430/pr91865.c
@@ -0,0 +1,8 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mlarge" } */
+
+const int table[2] = {1, 2};
+int foo (char i) { return table[i]; }
+
+/* { dg-final { scan-assembler-not "AND" } } */
+/* { dg-final { scan-assembler-not "RRAM" } } */


[PATCH] Improved RTL expansion of 1LL << x.

2023-10-14 Thread Roger Sayle

This patch improves the initial RTL expanded for double word shifts
on architectures with conditional moves, so that later passes don't
need to clean-up unnecessary and/or unused instructions.

Consider the general case, x << y, which is expanded well as:

t1 = y & 32;
t2 = 0;
t3 = x_lo >> 1;
t4 = y ^ ~0;
t5 = t3 >> t4;
tmp_hi = x_hi << y;
tmp_hi |= t5;
tmp_lo = x_lo << y;
out_hi = t1 ? tmp_lo : tmp_hi;
out_lo = t1 ? t2 : tmp_lo;

which is nearly optimal, the only thing that can be improved is
that using a unary NOT operation "t4 = ~y" is better than XOR
with -1, on targets that support it.  [Note the one_cmpl_optab
expander didn't fall back to XOR when this code was originally
written, but has been improved since].

Now consider the relatively common idiom of 1LL << y, which
currently produces the RTL equivalent of:

t1 = y & 32;
t2 = 0;
t3 = 1 >> 1;
t4 = y ^ ~0;
t5 = t3 >> t4;
tmp_hi = 0 << y;
tmp_hi |= t5;
tmp_lo = 1 << y;
out_hi = t1 ? tmp_lo : tmp_hi;
out_lo = t1 ? t2 : tmp_lo;

Notice here that t3 is always zero, so the assignment of t5
is a variable shift of zero, which expands to a loop on many
smaller targets, a similar shift by zero in the first tmp_hi
assignment (another loop), that the value of t4 is no longer
required (as t3 is zero), and that the ultimate value of tmp_hi
is always zero.

Fortunately, for many (but perhaps not all) targets this mess
gets cleaned up by later optimization passes.  However, this
patch avoids generating unnecessary RTL at expand time, by
calling simplify_expand_binop instead of expand_binop, and
avoiding generating dead or unnecessary code when intermediate
values are known to be zero.  For the 1LL << y test case above,
we now generate:

t1 = y & 32;
t2 = 0;
tmp_hi = 0;
tmp_lo = 1 << y;
out_hi = t1 ? tmp_lo : tmp_hi;
out_lo = t1 ? t2 : tmp_lo;

On arc-elf, for example, there are 18 RTL INSN_P instructions
generated by expand before this patch, but only 12 with this patch
(improving both compile-time and memory usage).


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-10-15  Roger Sayle  

gcc/ChangeLog
* optabs.cc (expand_subword_shift): Call simplify_expand_binop
instead of expand_binop.  Optimize cases (i.e. avoid generating
RTL) when CARRIES or INTO_INPUT is zero.  Use one_cmpl_optab
(i.e. NOT) instead of xor_optab with ~0 to calculate ~OP1.


Thanks in advance,
Roger
--

diff --git a/gcc/optabs.cc b/gcc/optabs.cc
index e1898da..f0a048a 100644
--- a/gcc/optabs.cc
+++ b/gcc/optabs.cc
@@ -533,15 +533,13 @@ expand_subword_shift (scalar_int_mode op1_mode, optab 
binoptab,
 has unknown behavior.  Do a single shift first, then shift by the
 remainder.  It's OK to use ~OP1 as the remainder if shift counts
 are truncated to the mode size.  */
-  carries = expand_binop (word_mode, reverse_unsigned_shift,
- outof_input, const1_rtx, 0, unsignedp, methods);
-  if (shift_mask == BITS_PER_WORD - 1)
-   {
- tmp = immed_wide_int_const
-   (wi::minus_one (GET_MODE_PRECISION (op1_mode)), op1_mode);
- tmp = simplify_expand_binop (op1_mode, xor_optab, op1, tmp,
-  0, true, methods);
-   }
+  carries = simplify_expand_binop (word_mode, reverse_unsigned_shift,
+  outof_input, const1_rtx, 0,
+  unsignedp, methods);
+  if (carries == const0_rtx)
+   tmp = const0_rtx;
+  else if (shift_mask == BITS_PER_WORD - 1)
+   tmp = expand_unop (op1_mode, one_cmpl_optab, op1, 0, true);
   else
{
  tmp = immed_wide_int_const (wi::shwi (BITS_PER_WORD - 1,
@@ -552,22 +550,29 @@ expand_subword_shift (scalar_int_mode op1_mode, optab 
binoptab,
 }
   if (tmp == 0 || carries == 0)
 return false;
-  carries = expand_binop (word_mode, reverse_unsigned_shift,
- carries, tmp, 0, unsignedp, methods);
+  if (carries != const0_rtx && tmp != const0_rtx)
+carries = simplify_expand_binop (word_mode, reverse_unsigned_shift,
+carries, tmp, 0, unsignedp, methods);
   if (carries == 0)
 return false;
 
-  /* Shift INTO_INPUT logically by OP1.  This is the last use of INTO_INPUT
- so the result can go directly into INTO_TARGET if convenient.  */
-  tmp = expand_binop (word_mode, unsigned_shift, into_input, op1,
- into_target, unsignedp, methods);
-  if (tmp == 0)
-return false;
+  if (into_inp

[ARC PATCH] Split asl dst, 1, src into bset dst, 0, src to implement 1<

2023-10-15 Thread Roger Sayle
 

This patch adds a pre-reload splitter to arc.md, to use the bset (set

specific bit instruction) to implement 1<

 

gcc/ChangeLog

* config/arc/arc.md (*ashlsi3_1): New pre-reload splitter to

use bset dst,0,src to implement 1<

RE: [ARC PATCH] Split asl dst, 1, src into bset dst, 0, src to implement 1<

2023-10-15 Thread Roger Sayle
I've done it again. ENOPATCH.

 

From: Roger Sayle  
Sent: 15 October 2023 09:13
To: 'gcc-patches@gcc.gnu.org' 
Cc: 'Claudiu Zissulescu' 
Subject: [ARC PATCH] Split asl dst,1,src into bset dst,0,src to implement
1<mailto:ro...@nextmovesoftware.com> >

 

gcc/ChangeLog

* config/arc/arc.md (*ashlsi3_1): New pre-reload splitter to

use bset dst,0,src to implement 1<diff --git a/gcc/config/arc/arc.md b/gcc/config/arc/arc.md
index a936a8b..22af0bf 100644
--- a/gcc/config/arc/arc.md
+++ b/gcc/config/arc/arc.md
@@ -3421,6 +3421,22 @@ archs4x, archs4xd"
(set_attr "predicable" "no,no,yes,no,no")
(set_attr "cond" "nocond,canuse,canuse,nocond,nocond")])
 
+;; Split asl dst,1,src into bset dst,0,src.
+(define_insn_and_split "*ashlsi3_1"
+  [(set (match_operand:SI 0 "dest_reg_operand")
+   (ashift:SI (const_int 1)
+  (match_operand:SI 1 "nonmemory_operand")))]
+  "!TARGET_BARREL_SHIFTER
+   && arc_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (match_dup 0)
+   (ior:SI (ashift:SI (const_int 1) (match_dup 1))
+   (const_int 0)))]
+  ""
+  [(set_attr "type" "shift")
+   (set_attr "length" "8")])
+
 (define_insn_and_split "*ashlsi3_nobs"
   [(set (match_operand:SI 0 "dest_reg_operand")
(ashift:SI (match_operand:SI 1 "register_operand")


RE: [PATCH] PR 91865: Avoid ZERO_EXTEND of ZERO_EXTEND in make_compound_operation.

2023-10-15 Thread Roger Sayle


Hi Jeff,
Thanks for the speedy review(s).

> From: Jeff Law 
> Sent: 15 October 2023 00:03
> To: Roger Sayle ; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] PR 91865: Avoid ZERO_EXTEND of ZERO_EXTEND in
> make_compound_operation.
> 
> On 10/14/23 16:14, Roger Sayle wrote:
> >
> > This patch is my proposed solution to PR rtl-optimization/91865.
> > Normally RTX simplification canonicalizes a ZERO_EXTEND of a
> > ZERO_EXTEND to a single ZERO_EXTEND, but as shown in this PR it is
> > possible for combine's make_compound_operation to unintentionally
> > generate a non-canonical ZERO_EXTEND of a ZERO_EXTEND, which is
> > unlikely to be matched by the backend.
> >
> > For the new test case:
> >
> > const int table[2] = {1, 2};
> > int foo (char i) { return table[i]; }
> >
> > compiling with -O2 -mlarge on msp430 we currently see:
> >
> > Trying 2 -> 7:
> >  2: r25:HI=zero_extend(R12:QI)
> >REG_DEAD R12:QI
> >  7: r28:PSI=sign_extend(r25:HI)#0
> >REG_DEAD r25:HI
> > Failed to match this instruction:
> > (set (reg:PSI 28 [ iD.1772 ])
> >  (zero_extend:PSI (zero_extend:HI (reg:QI 12 R12 [ iD.1772 ]
> >
> > which results in the following code:
> >
> > foo:AND #0xff, R12
> >  RLAM.A #4, R12 { RRAM.A #4, R12
> >  RLAM.A  #1, R12
> >  MOVX.W  table(R12), R12
> >  RETA
> >
> > With this patch, we now see:
> >
> > Trying 2 -> 7:
> >  2: r25:HI=zero_extend(R12:QI)
> >REG_DEAD R12:QI
> >  7: r28:PSI=sign_extend(r25:HI)#0
> >REG_DEAD r25:HI
> > Successfully matched this instruction:
> > (set (reg:PSI 28 [ iD.1772 ])
> >  (zero_extend:PSI (reg:QI 12 R12 [ iD.1772 ]))) allowing
> > combination of insns 2 and 7 original costs 4 + 8 = 12 replacement
> > cost 8
> >
> > foo:MOV.B   R12, R12
> >  RLAM.A  #1, R12
> >  MOVX.W  table(R12), R12
> >  RETA
> >
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures.  Ok for mainline?
> >
> > 2023-10-14  Roger Sayle  
> >
> > gcc/ChangeLog
> >  PR rtl-optimization/91865
> >  * combine.cc (make_compound_operation): Avoid creating a
> >  ZERO_EXTEND of a ZERO_EXTEND.
> >
> > gcc/testsuite/ChangeLog
> >  PR rtl-optimization/91865
> >  * gcc.target/msp430/pr91865.c: New test case.
> Neither an ACK or NAK at this point.
> 
> The bug report includes a patch from Segher which purports to fix this in 
> simplify-
> rtx.  Any thoughts on Segher's approach and whether or not it should be
> considered?
> 
> The BZ also indicates that removal of 2 patterns from msp430.md would solve 
> this
> too (though it may cause regressions elsewhere?).  Any thoughts on that 
> approach
> as well?
> 

Great questions.  I believe Segher's proposed patch (in comment #4) was an
msp430-specific proof-of-concept workaround rather than intended to be fix.
Eliminating a ZERO_EXTEND simply by changing the mode of a hard register
is not a solution that'll work on many platforms (and therefore not really 
suitable
for target-independent middle-end code in the RTL optimizers).

For example, zero_extend:TI (and:QI (reg:QI hard_r1) (const_int 0x0f)) can't
universally be reduced to (and:TI (reg:TI hard_r1) (const_int 0x0f)).  Notice 
that
Segher's code doesn't check TARGET_HARD_REGNO_MODE_OK or 
TARGET_MODES_TIEABLE_P or any of the other backend hooks necessary
to confirm such a transformation is safe/possible.

Secondly, the hard register aspect is a bit of a red herring.  This work-around
fixes the issue in the original BZ description, but not the slightly modified 
test
case in comment #2 (with a global variable).  This doesn't have a hard register,
but does have the dubious ZERO_EXTEND/SIGN_EXTEND of a ZERO_EXTEND.

The underlying issue, which is applicable to all targets, is that combine.cc's
make_compound_operation is expected to reverse the local transformations
made by expand_compound_operation.  Hence, if an RTL expression is
canonical going into expand_compound_operation, it is expected (hoped)
to be canonical (and equivalent) coming out of make_compound_operation.

Hence, rather than be a MSP430 specific issue, no target should expect (or
be expected to see) a ZERO_EXTEND of a ZERO_EXTEND, or a SIGN_EXTEND
of a ZERO_EXTEND in the RTL stream.  Much like a binary operator with two
CONST_INT operands, or a shift by zero, it's somethi

RE: [PATCH] Support g++ 4.8 as a host compiler.

2023-10-15 Thread Roger Sayle


I'd like to ping my patch for restoring bootstrap using g++ 4.8.5
(the system compiler on RHEL 7 and later systems).
https://gcc.gnu.org/pipermail/gcc-patches/2023-October/632008.html

Note the preprocessor #ifs can be removed; they are only there to document
why the union u must have an explicit, empty (but not default) constructor.

I completely agree with the various opinions that we might consider
upgrading the minimum host compiler for many good reasons (Ada,
D, newer C++ features etc.).  It's inevitable that older compilers and
systems can't be supported indefinitely.

Having said that I don't think that this unintentional trivial breakage,
that has a safe one-line work around is sufficient cause (or non-neglible
risk or support burden), to inconvenice a large number of GCC users
(the impact/disruption to cfarm has already been mentioned).

Interestingly, "scl enable devtoolset-XX" to use a newer host compiler,
v10 or v11, results in a significant increase (100+) in unexpected failures I 
see
during mainline regression testing using "make -k check" (on RedHat 7.9).
(Older) system compilers, despite their flaws, are selected for their
(overall) stability and maturity.

If another patch/change hits the compiler next week that reasonably
means that 4.8.5 can no longer be supported, so be it, but its an
annoying (and unnecessary?) inconvenience in the meantime.

Perhaps we should file a Bugzilla PR indicating that the documentation
and release notes need updating, if my fix isn't considered acceptable?

Why this patch is an trigger issue (that requires significant discussion
and deliberation) is somewhat of a mystery.

Thanks in advance.
Roger
> -Original Message-
> From: Jeff Law 
> Sent: 07 October 2023 17:20
> To: Roger Sayle ; gcc-patches@gcc.gnu.org
> Cc: 'Richard Sandiford' 
> Subject: Re: [PATCH] Support g++ 4.8 as a host compiler.
> 
> 
> 
> On 10/4/23 16:19, Roger Sayle wrote:
> >
> > The recent patch to remove poly_int_pod triggers a bug in g++ 4.8.5's
> > C++ 11 support which mistakenly believes poly_uint16 has a non-trivial
> > constructor.  This in turn prohibits it from being used as a member in
> > a union (rtxunion) that constructed statically, resulting in a (fatal)
> > error during stage 1.  A workaround is to add an explicit constructor
> > to the problematic union, which allows mainline to be bootstrapped
> > with the system compiler on older RedHat 7 systems.
> >
> > This patch has been tested on x86_64-pc-linux-gnu where it allows a
> > bootstrap to complete when using g++ 4.8.5 as the host compiler.
> > Ok for mainline?
> >
> >
> > 2023-10-04  Roger Sayle  
> >
> > gcc/ChangeLog
> > * rtl.h (rtx_def::u): Add explicit constructor to workaround
> > issue using g++ 4.8 as a host compiler.
> I think the bigger question is whether or not we're going to step forward on 
> the
> minimum build requirements.
> 
> My recollection was we settled on gcc-4.8 for the benefit of RHEL 7 and 
> Centos 7
> which are rapidly approaching EOL (June 2024).
> 
> I would certainly support stepping forward to a more modern compiler for the
> build requirements, which might make this patch obsolete.
> 
> Jeff



[x86 PATCH] PR 106245: Split (x<<31)>>31 as -(x&1) in i386.md

2023-10-17 Thread Roger Sayle

This patch is the backend piece of a solution to PRs 101955 and 106245,
that adds a define_insn_and_split to the i386 backend, to perform sign
extension of a single (least significant) bit using AND $1 then NEG.

Previously, (x<<31)>>31 would be generated as

sall$31, %eax   // 3 bytes
sarl$31, %eax   // 3 bytes

with this patch the backend now generates:

andl$1, %eax// 3 bytes
negl%eax// 2 bytes

Not only is this smaller in size, but microbenchmarking confirms
that it's a performance win on both Intel and AMD; Intel sees only a
2% improvement (perhaps just a size effect), but AMD sees a 7% win.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-10-17  Roger Sayle  

gcc/ChangeLog
PR middle-end/101955
PR tree-optimization/106245
* config/i386/i386.md (*extv_1_0): New define_insn_and_split.

gcc/testsuite/ChangeLog
PR middle-end/101955
PR tree-optimization/106245
* gcc.target/i386/pr106245-2.c: New test case.
* gcc.target/i386/pr106245-3.c: New 32-bit test case.
* gcc.target/i386/pr106245-4.c: New 64-bit test case.
* gcc.target/i386/pr106245-5.c: Likewise.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 2a60df5..b7309be0 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -3414,6 +3414,21 @@
   [(set_attr "type" "imovx")
(set_attr "mode" "SI")])
 
+;; Split sign-extension of single least significant bit as and x,$1;neg x
+(define_insn_and_split "*extv_1_0"
+  [(set (match_operand:SWI48 0 "register_operand" "=r")
+   (sign_extract:SWI48 (match_operand:SWI48 1 "register_operand" "0")
+   (const_int 1)
+   (const_int 0)))
+   (clobber (reg:CC FLAGS_REG))]
+  ""
+  "#"
+  "&& 1"
+  [(parallel [(set (match_dup 0) (and:SWI48 (match_dup 1) (const_int 1)))
+ (clobber (reg:CC FLAGS_REG))])
+   (parallel [(set (match_dup 0) (neg:SWI48 (match_dup 0)))
+ (clobber (reg:CC FLAGS_REG))])])
+
 (define_expand "extzv"
   [(set (match_operand:SWI248 0 "register_operand")
(zero_extract:SWI248 (match_operand:SWI248 1 "register_operand")
diff --git a/gcc/testsuite/gcc.target/i386/pr106245-2.c 
b/gcc/testsuite/gcc.target/i386/pr106245-2.c
new file mode 100644
index 000..47b0d27
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr106245-2.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int f(int a)
+{
+return (a << 31) >> 31;
+}
+
+/* { dg-final { scan-assembler "andl" } } */
+/* { dg-final { scan-assembler "negl" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr106245-3.c 
b/gcc/testsuite/gcc.target/i386/pr106245-3.c
new file mode 100644
index 000..4ec6342
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr106245-3.c
@@ -0,0 +1,11 @@
+/* { dg-do compile { target ia32 } } */
+/* { dg-options "-O2" } */
+
+long long f(long long a)
+{
+return (a << 63) >> 63;
+}
+
+/* { dg-final { scan-assembler "andl" } } */
+/* { dg-final { scan-assembler "negl" } } */
+/* { dg-final { scan-assembler "cltd" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr106245-4.c 
b/gcc/testsuite/gcc.target/i386/pr106245-4.c
new file mode 100644
index 000..ef77ee5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr106245-4.c
@@ -0,0 +1,10 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2" } */
+
+long long f(long long a)
+{
+return (a << 63) >> 63;
+}
+
+/* { dg-final { scan-assembler "andl" } } */
+/* { dg-final { scan-assembler "negq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr106245-5.c 
b/gcc/testsuite/gcc.target/i386/pr106245-5.c
new file mode 100644
index 000..0351866
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr106245-5.c
@@ -0,0 +1,11 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2" } */
+
+__int128 f(__int128 a)
+{
+  return (a << 127) >> 127;
+}
+
+/* { dg-final { scan-assembler "andl" } } */
+/* { dg-final { scan-assembler "negq" } } */
+/* { dg-final { scan-assembler "cqto" } } */


RE: [x86 PATCH] PR 106245: Split (x<<31)>>31 as -(x&1) in i386.md

2023-10-17 Thread Roger Sayle


Hi Uros,
Thanks for the speedy review.

> From: Uros Bizjak 
> Sent: 17 October 2023 17:38
> 
> On Tue, Oct 17, 2023 at 3:08 PM Roger Sayle 
> wrote:
> >
> >
> > This patch is the backend piece of a solution to PRs 101955 and
> > 106245, that adds a define_insn_and_split to the i386 backend, to
> > perform sign extension of a single (least significant) bit using AND $1 
> > then NEG.
> >
> > Previously, (x<<31)>>31 would be generated as
> >
> > sall$31, %eax   // 3 bytes
> > sarl$31, %eax   // 3 bytes
> >
> > with this patch the backend now generates:
> >
> > andl$1, %eax// 3 bytes
> > negl%eax// 2 bytes
> >
> > Not only is this smaller in size, but microbenchmarking confirms that
> > it's a performance win on both Intel and AMD; Intel sees only a 2%
> > improvement (perhaps just a size effect), but AMD sees a 7% win.
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures.  Ok for mainline?
> >
> >
> > 2023-10-17  Roger Sayle  
> >
> > gcc/ChangeLog
> > PR middle-end/101955
> > PR tree-optimization/106245
> > * config/i386/i386.md (*extv_1_0): New define_insn_and_split.
> >
> > gcc/testsuite/ChangeLog
> > PR middle-end/101955
> > PR tree-optimization/106245
> > * gcc.target/i386/pr106245-2.c: New test case.
> > * gcc.target/i386/pr106245-3.c: New 32-bit test case.
> > * gcc.target/i386/pr106245-4.c: New 64-bit test case.
> > * gcc.target/i386/pr106245-5.c: Likewise.
> 
> +;; Split sign-extension of single least significant bit as and x,$1;neg
> +x (define_insn_and_split "*extv_1_0"
> +  [(set (match_operand:SWI48 0 "register_operand" "=r")
> + (sign_extract:SWI48 (match_operand:SWI48 1 "register_operand" "0")
> +(const_int 1)
> +(const_int 0)))
> +   (clobber (reg:CC FLAGS_REG))]
> +  ""
> +  "#"
> +  "&& 1"
> 
> No need to use "&&" for an empty insn constraint. Just use "reload_completed" 
> in
> this case.
> 
> +  [(parallel [(set (match_dup 0) (and:SWI48 (match_dup 1) (const_int 1)))
> +  (clobber (reg:CC FLAGS_REG))])
> +   (parallel [(set (match_dup 0) (neg:SWI48 (match_dup 0)))
> +  (clobber (reg:CC FLAGS_REG))])])
> 
> Did you intend to split this after reload? If this is the case, then 
> reload_completed
> is missing.

Because this splitter neither required the allocation of a new pseudo, nor a
hard register assignment, i.e. it's a splitter that can be run before or after
reload, it's written to split "whenever".  If you'd prefer it to only split 
after
reload, I agree a "reload_completed" can be added (alternatively, adding
"ix86_pre_reload_split ()" would also work).

I now see from "*load_tp_" that "" is perhaps preferred over "&& 1"
In these cases.  Please let me know which you prefer.

Cheers,
Roger




[x86 PATCH] PR target/110511: Fix reg allocation for widening multiplications.

2023-10-17 Thread Roger Sayle

This patch contains clean-ups of the widening multiplication patterns in
i386.md, and provides variants of the existing highpart multiplication
peephole2 transformations (that tidy up register allocation after
reload), and thereby fixes PR target/110511, which is a superfluous
move instruction.

For the new test case, compiled on x86_64 with -O2.

Before:
mulx64: movabsq $-7046029254386353131, %rcx
movq%rcx, %rax
mulq%rdi
xorq%rdx, %rax
ret

After:
mulx64: movabsq $-7046029254386353131, %rax
mulq%rdi
xorq%rdx, %rax
ret

The clean-ups are (i) that operand 1 is consistently made register_operand
and operand 2 becomes nonimmediate_operand, so that predicates match the
constraints, (ii) the representation of the BMI2 mulx instruction is
updated to use the new umul_highpart RTX, and (iii) because operands
0 and 1 have different modes in widening multiplications, "a" is a more
appropriate constraint than "0" (which avoids spills/reloads containing
SUBREGs).  The new peephole2 transformations are based upon those at
around line 9951 of i386.md, that begins with the comment
;; Highpart multiplication peephole2s to tweak register allocation.
;; mov imm,%rdx; mov %rdi,%rax; imulq %rdx  ->  mov imm,%rax; imulq %rdi


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-10-17  Roger Sayle  

gcc/ChangeLog
PR target/110511
* config/i386/i386.md (mul3): Make operands 1 and
2 take "regiser_operand" and "nonimmediate_operand" respectively.
(mulqihi3): Likewise.
(*bmi2_umul3_1): Operand 2 needs to be register_operand
matching the %d constraint.  Use umul_highpart RTX to represent
the highpart multiplication.
(*umul3_1):  Operand 2 should use regiser_operand
predicate, and "a" rather than "0" as operands 0 and 2 have
different modes.
(define_split): For mul to mulx conversion, use the new
umul_highpart RTX representation.
(*mul3_1):  Operand 1 should be register_operand
and the constraint %a as operands 0 and 1 have different modes.
(*mulqihi3_1): Operand 1 should be register_operand matching
the constraint %0.
(define_peephole2): Providing widening multiplication variants
of the peephole2s that tweak highpart multiplication register
allocation.

gcc/testsuite/ChangeLog
PR target/110511
* gcc.target/i386/pr110511.c: New test case.


Thanks in advance,
Roger

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 2a60df5..22f18c2 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -9710,33 +9710,29 @@
   [(parallel [(set (match_operand: 0 "register_operand")
   (mult:
 (any_extend:
-  (match_operand:DWIH 1 "nonimmediate_operand"))
+  (match_operand:DWIH 1 "register_operand"))
 (any_extend:
-  (match_operand:DWIH 2 "register_operand"
+  (match_operand:DWIH 2 "nonimmediate_operand"
  (clobber (reg:CC FLAGS_REG))])])
 
 (define_expand "mulqihi3"
   [(parallel [(set (match_operand:HI 0 "register_operand")
   (mult:HI
 (any_extend:HI
-  (match_operand:QI 1 "nonimmediate_operand"))
+  (match_operand:QI 1 "register_operand"))
 (any_extend:HI
-  (match_operand:QI 2 "register_operand"
+  (match_operand:QI 2 "nonimmediate_operand"
  (clobber (reg:CC FLAGS_REG))])]
   "TARGET_QIMODE_MATH")
 
 (define_insn "*bmi2_umul3_1"
   [(set (match_operand:DWIH 0 "register_operand" "=r")
(mult:DWIH
- (match_operand:DWIH 2 "nonimmediate_operand" "%d")
+ (match_operand:DWIH 2 "register_operand" "%d")
  (match_operand:DWIH 3 "nonimmediate_operand" "rm")))
(set (match_operand:DWIH 1 "register_operand" "=r")
-   (truncate:DWIH
- (lshiftrt:
-   (mult: (zero_extend: (match_dup 2))
-   (zero_extend: (match_dup 3)))
-   (match_operand:QI 4 "const_int_operand"]
-  "TARGET_BMI2 && INTVAL (operands[4]) ==  * BITS_PER_UNIT
+   (umul_highpart:DWIH (match_dup 2) (match_dup 3)))]
+  "TARGET_BMI2
&& !(MEM_P (operands[2]) && MEM_P (operands[3]))"
   "mulx\t{%3, %0, %1|%1, %0, %3}"
   [(set_attr "type" &qu

RE: [x86 PATCH] PR target/110551: Fix reg allocation for widening multiplications.

2023-10-18 Thread Roger Sayle


Many thanks to Tobias Burnus for pointing out the mistake/typo in the PR
number.
This fix is for PR 110551, not PR 110511.  I'll update the ChangeLog and
filename
of the new testcase, if approved.

Sorry for any inconvenience/confusion.
Cheers,
Roger
--

> -Original Message-
> From: Roger Sayle 
> Sent: 17 October 2023 20:06
> To: 'gcc-patches@gcc.gnu.org' 
> Cc: 'Uros Bizjak' 
> Subject: [x86 PATCH] PR target/110511: Fix reg allocation for widening
> multiplications.
> 
> 
> This patch contains clean-ups of the widening multiplication patterns in
i386.md,
> and provides variants of the existing highpart multiplication
> peephole2 transformations (that tidy up register allocation after reload),
and
> thereby fixes PR target/110511, which is a superfluous move instruction.
> 
> For the new test case, compiled on x86_64 with -O2.
> 
> Before:
> mulx64: movabsq $-7046029254386353131, %rcx
> movq%rcx, %rax
> mulq%rdi
> xorq%rdx, %rax
> ret
> 
> After:
> mulx64: movabsq $-7046029254386353131, %rax
> mulq%rdi
> xorq%rdx, %rax
> ret
> 
> The clean-ups are (i) that operand 1 is consistently made register_operand
and
> operand 2 becomes nonimmediate_operand, so that predicates match the
> constraints, (ii) the representation of the BMI2 mulx instruction is
updated to use
> the new umul_highpart RTX, and (iii) because operands
> 0 and 1 have different modes in widening multiplications, "a" is a more
> appropriate constraint than "0" (which avoids spills/reloads containing
SUBREGs).
> The new peephole2 transformations are based upon those at around line 9951
of
> i386.md, that begins with the comment ;; Highpart multiplication
peephole2s to
> tweak register allocation.
> ;; mov imm,%rdx; mov %rdi,%rax; imulq %rdx  ->  mov imm,%rax; imulq %rdi
> 
> 
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and
> make -k check, both with and without --target_board=unix{-m32} with no new
> failures.  Ok for mainline?
> 
> 
> 2023-10-17  Roger Sayle  
> 
> gcc/ChangeLog
> PR target/110511
> * config/i386/i386.md (mul3): Make operands 1 and
> 2 take "regiser_operand" and "nonimmediate_operand" respectively.
> (mulqihi3): Likewise.
> (*bmi2_umul3_1): Operand 2 needs to be register_operand
> matching the %d constraint.  Use umul_highpart RTX to represent
> the highpart multiplication.
> (*umul3_1):  Operand 2 should use regiser_operand
> predicate, and "a" rather than "0" as operands 0 and 2 have
> different modes.
> (define_split): For mul to mulx conversion, use the new
> umul_highpart RTX representation.
> (*mul3_1):  Operand 1 should be register_operand
> and the constraint %a as operands 0 and 1 have different modes.
> (*mulqihi3_1): Operand 1 should be register_operand matching
> the constraint %0.
> (define_peephole2): Providing widening multiplication variants
> of the peephole2s that tweak highpart multiplication register
> allocation.
> 
> gcc/testsuite/ChangeLog
> PR target/110511
> * gcc.target/i386/pr110511.c: New test case.
> 
> 
> Thanks in advance,
> Roger




RE: [Patch] nvptx: Use fatal_error when -march= is missing not an assert [PR111093]

2023-10-18 Thread Roger Sayle

Hi Tomas, Tobias and Tom,
Thanks for asking.  Interestingly, I've a patch (attached) from last year that
tackled some of the issues here.  The surface problem is that nvptx's march
and misa are related in complicated ways.  Specifying an arch defines the
range of valid isa's, and specifying an isa restricts the set of valid arches.

The current approach, which I agree is problematic, is to force these to
be specified (compatibly) on the cc1 command line.  Certainly, an error
is better than an abort.  My proposed solution was to allow either to 
imply a default for the other, and only issue an error if they are explicitly
specified incompatibly.

One reason for supporting this approach was to ultimately support an
-march=native in the driver (calling libcuda.so to determine the hardware
available on the current machine).

The other use case is bumping the "default" nvptx architecture to something
more recent, say sm_53, by providing/honoring a default arch at configure
time.

Alas, it turns out that specifying a recent arch during GCC bootstrap, allows
the build to notice that the backend (now) supports 16-bit floats, which then
prompts libgcc to contain the floathf and fixhf support that would be required.
Then this in turn shows up as a limitation in the middle-end's handling of 
libcalls, which I submitted as a patch to back in July 2022:
https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598848.html

That patch hasn't yet been approved, so the whole nvptx -march= patch
series became backlogged/forgotten.

Hopefully, the attached "proof-of-concept" patch looks interesting (food
for thought).  If this approach seems reasonable, I'm happy to brush the
dust off, and resubmit it (or a series of pieces) for review.

Best regards,
Roger
--

> -Original Message-
> From: Thomas Schwinge 
> Sent: 18 October 2023 11:16
> To: Tobias Burnus 
> Cc: gcc-patches@gcc.gnu.org; Tom de Vries ; Roger Sayle
> 
> Subject: Re: [Patch] nvptx: Use fatal_error when -march= is missing not an 
> assert
> [PR111093]
> 
> Hi Tobias!
> 
> On 2023-10-16T11:18:45+0200, Tobias Burnus 
> wrote:
> > While mkoffload ensures that there is always a -march=, nvptx's
> > cc1 can also be run directly.
> >
> > In my case, I wanted to know which target-specific #define are
> > available; hence, I did run:
> >accel/nvptx-none/cc1 -E -dM < /dev/null which gave an ICE. After
> > some debugging, the reasons was clear (missing -march=) but somehow a
> > (fatal) error would have been nicer than an ICE + debugging.
> >
> > OK for mainline?
> 
> Yes, thanks.  I think I prefer this over hard-coding some default 
> 'ptx_isa_option' --
> but may be convinced otherwise (incremental change), if that's maybe more
> convenient for others?  (Roger?)
> 
> 
> Grüße
>  Thomas
> 
> 
> > nvptx: Use fatal_error when -march= is missing not an assert
> > [PR111093]
> >
> > gcc/ChangeLog:
> >
> >   PR target/111093
> >   * config/nvptx/nvptx.cc (nvptx_option_override): Issue fatal error
> >   instead of an assert ICE when no -march= has been specified.
> >
> > diff --git a/gcc/config/nvptx/nvptx.cc b/gcc/config/nvptx/nvptx.cc
> > index edef39fb5e1..634c31673be 100644
> > --- a/gcc/config/nvptx/nvptx.cc
> > +++ b/gcc/config/nvptx/nvptx.cc
> > @@ -335,8 +335,9 @@ nvptx_option_override (void)
> >init_machine_status = nvptx_init_machine_status;
> >
> >/* Via nvptx 'OPTION_DEFAULT_SPECS', '-misa' always appears on the
> command
> > - line.  */
> > -  gcc_checking_assert (OPTION_SET_P (ptx_isa_option));
> > + line; but handle the case that the compiler is not run via the
> > + driver.  */  if (!OPTION_SET_P (ptx_isa_option))
> > +fatal_error (UNKNOWN_LOCATION, "%<-march=%> must be specified");
> >
> >handle_ptx_version_option ();
> >
> -
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634
> München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas
> Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht
> München, HRB 106955
diff --git a/gcc/calls.cc b/gcc/calls.cc
index 6dd6f73..8a18eae 100644
--- a/gcc/calls.cc
+++ b/gcc/calls.cc
@@ -4795,14 +4795,20 @@ emit_library_call_value_1 (int retval, rtx orgfun, rtx 
value,
   else
{
  /* Convert to the proper mode if a promotion has been active.  */
- if (GET_MODE (valreg) != outmode)
+ enum machine_mode valmode = GET_MODE (valreg);
+ if (valmode != outmode)
{
  int unsignedp = TYPE_UNSIGNED (tfom);
 
  gc

[PATCH] Replace a HWI_COMPUTABLE_MODE_P with wide-int in simplify-rtx.cc.

2023-05-26 Thread Roger Sayle

This patch enhances one of the optimizations in simplify_binary_operation_1
to allow it to simplify RTL expressions in modes than HOST_WIDE_INT by
replacing a use of HWI_COMPUTABLE_MODE_P and UINTVAL with wide_int.

The motivating example is a pending x86_64 backend patch that produces
the following RTL in combine:

(and:TI (zero_extend:TI (reg:DI 89))
(const_wide_int 0x0))

where the AND is redundant, as the mask, ~0LL, is DImode's MODE_MASK.
There's already an optimization that catches this for narrower modes,
transforming (and:HI (zero_extend:HI (reg:QI x)) (const_int 0xff))
into (zero_extend:HI (reg:QI x)), but this currently only handles
CONST_INT not CONST_WIDE_INT.  Fixed by upgrading this transformation
to use wide_int, specifically rtx_mode_t and wi::mask.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-05-23  Roger Sayle  

gcc/ChangeLog
* simplify-rtx.cc (simplify_binary_operation_1) : Use wide-int
instead of HWI_COMPUTABLE_MODE_P and UINTVAL in transformation of
(and (extend X) C) as (zero_extend (and X C)), to also optimize
modes wider than HOST_WIDE_INT.


Thanks in advance,
Roger
--

diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index d4aeebc..8dc880b 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -3826,15 +3826,16 @@ simplify_context::simplify_binary_operation_1 (rtx_code 
code,
 there are no nonzero bits of C outside of X's mode.  */
   if ((GET_CODE (op0) == SIGN_EXTEND
   || GET_CODE (op0) == ZERO_EXTEND)
- && CONST_INT_P (trueop1)
- && HWI_COMPUTABLE_MODE_P (mode)
- && (~GET_MODE_MASK (GET_MODE (XEXP (op0, 0)))
- & UINTVAL (trueop1)) == 0)
+ && CONST_SCALAR_INT_P (trueop1)
+ && is_a  (mode, &int_mode)
+ && is_a  (GET_MODE (XEXP (op0, 0)), &inner_mode)
+ && (wi::mask (GET_MODE_PRECISION (inner_mode), true,
+   GET_MODE_PRECISION (int_mode))
+ & rtx_mode_t (trueop1, mode)) == 0)
{
  machine_mode imode = GET_MODE (XEXP (op0, 0));
- tem = simplify_gen_binary (AND, imode, XEXP (op0, 0),
-gen_int_mode (INTVAL (trueop1),
-  imode));
+ tem = immed_wide_int_const (rtx_mode_t (trueop1, mode), imode);
+ tem = simplify_gen_binary (AND, imode, XEXP (op0, 0), tem);
  return simplify_gen_unary (ZERO_EXTEND, mode, tem, imode);
}
 


[PATCH] PR target/107172: Avoid "unusual" MODE_CC comparisons in simplify-rtx.cc

2023-05-26 Thread Roger Sayle

I believe that a better (or supplementary) fix to PR target/107172 is to
avoid
producing incorrect (but valid) RTL in simplify_const_relational_operation
when
presented with questionable (obviously invalid) expressions, such as those
produced during combine.  Just as with the "first do no harm" clause with
the
Hippocratic Oath, simplify-rtx (probably) shouldn't unintentionally
transform
invalid RTL expressions, into incorrect (non-equivalent) but valid RTL that
may be inappropriately recognized by recog.

In this specific case, many GCC backends represent their flags register via
MODE_CC, whose representation is intentionally "opaque" to the middle-end.
The only use of MODE_CC comprehensible to the middle-end's RTL optimizers
is relational comparisons between the result of a COMPARE rtx (op0) and zero
(op1).  Any other uses of MODE_CC should be left alone, and some might argue
indicate representational issues in the backend.

In practice, CPUs occasionally have numerous instructions that affect the
flags register(s) other than comparisons [AVR's setc, powerpc's mtcrf,
x86's clc, stc and cmc and x86_64's ptest that sets C and Z flags in
non-obvious ways, c.f. PR target/109973].  Currently care has to be taken,
wrapping these in UNSPEC, to avoid combine inappropriately merging flags
setters with flags consumers (such as conditional jumps).  It's safer to
teach simplify_const_relational_operation not to modify expressions that
it doesn't understand/recognize.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-05-26  Roger Sayle  

gcc/ChangeLog
* simplify-rtx.cc (simplify_const_relational_operation): Return
early
if we have a MODE_CC comparison that isn't a COMPARE against
const0_rtx.


Thanks in advance,
Roger
--

diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index d4aeebc..d6444b4 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -6120,6 +6120,12 @@ simplify_const_relational_operation (enum rtx_code code,
  || (GET_MODE (op0) == VOIDmode
  && GET_MODE (op1) == VOIDmode));
 
+  /* We only handle MODE_CC comparisons that are COMPARE against zero.  */
+  if (GET_MODE_CLASS (mode) == MODE_CC
+  && (op1 != const0_rtx
+ || GET_CODE (op0) != COMPARE))
+return NULL_RTX;
+
   /* If op0 is a compare, extract the comparison arguments from it.  */
   if (GET_CODE (op0) == COMPARE && op1 == const0_rtx)
 {


[PATCH] Refactor wi::bswap as a function (instead of a method).

2023-05-28 Thread Roger Sayle

This patch implements Richard Sandiford's suggestion from
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618215.html
that wi::bswap (and a new wi::bitreverse) should be functions,
and ideally only accessors are member functions.  This patch
implements the first step, moving/refactoring wi::bswap.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-05-28  Roger Sayle  

gcc/ChangeLog
* fold-const-call.cc (fold_const_call_ss) :
Update call to wi::bswap.
* simplify-rtx.cc (simplify_const_unary_operation) :
Update call to wi::bswap.
* tree-ssa-ccp.cc (evaluate_stmt) :
Update calls to wi::bswap.

* wide-int.cc (wide_int_storage::bswap): Remove/rename to...
(wi::bswap_large): New function, with revised API.
* wide-int.h (wi::bswap): New (template) function prototype.
(wide_int_storage::bswap): Remove method.
(sext_large, zext_large): Consistent indentation/line wrapping.
(bswap_large): Prototype helper function containing implementation.
(wi::bswap): New template wrapper around bswap_large.


Thanks,
Roger
--

diff --git a/gcc/fold-const-call.cc b/gcc/fold-const-call.cc
index 340cb66..663eae2 100644
--- a/gcc/fold-const-call.cc
+++ b/gcc/fold-const-call.cc
@@ -1060,7 +1060,8 @@ fold_const_call_ss (wide_int *result, combined_fn fn, 
const wide_int_ref &arg,
 case CFN_BUILT_IN_BSWAP32:
 case CFN_BUILT_IN_BSWAP64:
 case CFN_BUILT_IN_BSWAP128:
-  *result = wide_int::from (arg, precision, TYPE_SIGN (arg_type)).bswap ();
+  *result = wi::bswap (wide_int::from (arg, precision,
+  TYPE_SIGN (arg_type)));
   return true;
 
 default:
diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index d4aeebc..d93d632 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -2111,7 +2111,7 @@ simplify_const_unary_operation (enum rtx_code code, 
machine_mode mode,
  break;
 
case BSWAP:
- result = wide_int (op0).bswap ();
+ result = wi::bswap (op0);
  break;
 
case TRUNCATE:
diff --git a/gcc/tree-ssa-ccp.cc b/gcc/tree-ssa-ccp.cc
index 6fb371c..26d5e44 100644
--- a/gcc/tree-ssa-ccp.cc
+++ b/gcc/tree-ssa-ccp.cc
@@ -2401,11 +2401,12 @@ evaluate_stmt (gimple *stmt)
  wide_int wval = wi::to_wide (val.value);
  val.value
= wide_int_to_tree (type,
-   wide_int::from (wval, prec,
-   UNSIGNED).bswap ());
+   wi::bswap (wide_int::from (wval, prec,
+  UNSIGNED)));
  val.mask
-   = widest_int::from (wide_int::from (val.mask, prec,
-   UNSIGNED).bswap (),
+   = widest_int::from (wi::bswap (wide_int::from (val.mask,
+  prec,
+  UNSIGNED)),
UNSIGNED);
  if (wi::sext (val.mask, prec) != -1)
break;
diff --git a/gcc/wide-int.cc b/gcc/wide-int.cc
index c0987aa..1e4c046 100644
--- a/gcc/wide-int.cc
+++ b/gcc/wide-int.cc
@@ -731,16 +731,13 @@ wi::set_bit_large (HOST_WIDE_INT *val, const 
HOST_WIDE_INT *xval,
 }
 }
 
-/* bswap THIS.  */
-wide_int
-wide_int_storage::bswap () const
+/* Byte swap the integer represented by XVAL and LEN into VAL.  Return
+   the number of blocks in VAL.  Both XVAL and VAL have PRECISION bits.  */
+unsigned int
+wi::bswap_large (HOST_WIDE_INT *val, const HOST_WIDE_INT *xval,
+unsigned int len, unsigned int precision)
 {
-  wide_int result = wide_int::create (precision);
   unsigned int i, s;
-  unsigned int len = BLOCKS_NEEDED (precision);
-  unsigned int xlen = get_len ();
-  const HOST_WIDE_INT *xval = get_val ();
-  HOST_WIDE_INT *val = result.write_val ();
 
   /* This is not a well defined operation if the precision is not a
  multiple of 8.  */
@@ -758,7 +755,7 @@ wide_int_storage::bswap () const
   unsigned int block = s / HOST_BITS_PER_WIDE_INT;
   unsigned int offset = s & (HOST_BITS_PER_WIDE_INT - 1);
 
-  byte = (safe_uhwi (xval, xlen, block) >> offset) & 0xff;
+  byte = (safe_uhwi (xval, len, block) >> offset) & 0xff;
 
   block = d / HOST_BITS_PER_WIDE_INT;
   offset = d & (HOST_BITS_PER_WIDE_INT - 1);
@@ -766,8 +763,7 @@ wide_int_storage::bswap () const
   val[block] |= byte << offset;
 }
 
-  result.set_len (canonize (val, len, precision));
-  return result;
+  return canonize (val, len, precision);
 }
 
 /* Fill VAL

[x86_64 PATCH] PR target/109973: CCZmode and CCCmode variants of [v]ptest.

2023-05-29 Thread Roger Sayle

This is my proposed minimal fix for PR target/109973 (hopefully suitable
for backporting) that follows Jakub Jelinek's suggestion that we introduce
CCZmode and CCCmode variants of ptest and vptest, so that the i386
backend treats [v]ptest instructions similarly to testl instructions;
using different CCmodes to indicate which condition flags are desired,
and then relying on the RTL cmpelim pass to eliminate redundant tests.

This conveniently matches Intel's intrinsics, that provide different
functions for retrieving different flags, _mm_testz_si128 tests the
Z flag, _mm_testc_si128 tests the carry flag.  Currently we use the
same instruction (pattern) for both, and unfortunately the *ptest_and
optimization is only valid when the ptest/vptest instruction is used to
set/test the Z flag.

The downside, as predicted by Jakub, is that GCC's cmpelim pass is
currently COMPARE-centric and not able to merge the ptests from expressions
such as _mm256_testc_si256 (a, b) + _mm256_testz_si256 (a, b), which is a
known issue, PR target/80040.  I've some follow-up patches to improve
things, but this first patch fixes the wrong-code regression, replacing
it with a rare missed-optimization (hopefully suitable for GCC 13).

The only change that was unanticipated was the tweak to ix86_match_ccmode.
Oddly, CCZmode is allowable for CCmode, but CCCmode isn't.  Given that
CCZmode means just the Z flag, CCCmode means just the C flag, and
CCmode means all the flags, I'm guessing this asymmetry is unintentional.
Perhaps a super-safe fix is to explicitly test for CCZmode, CCCmode or
CCmode
in the *_ptest pattern's predicate, and not attempt to
re-use ix86_match_ccmode?

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-05-29  Roger Sayle  

gcc/ChangeLog
PR targt/109973
* config/i386/i386-builtin.def (__builtin_ia32_ptestz128): Use new
CODE_for_sse4_1_ptestzv2di.
(__builtin_ia32_ptestc128): Use new CODE_for_sse4_1_ptestcv2di.
(__builtin_ia32_ptestz256): Use new CODE_for_avx_ptestzv4di.
(__builtin_ia32_ptestc256): Use new CODE_for_avx_ptestcv4di.
* config/i386/i386-expand.cc (ix86_expand_branch): Use CCZmode
when expanding UNSPEC_PTEST to compare against zero.
* config/i386/i386-features.cc (scalar_chain::convert_compare):
Likewise generate CCZmode UNSPEC_PTESTs when converting comparisons.
(general_scalar_chain::convert_insn): Use CCZmode for COMPARE
result.
(timode_scalar_chain::convert_insn): Use CCZmode for COMPARE result.
* config/i386/i386.cc (ix86_match_ccmode): Allow the SET_SRC to be
an UNSPEC, in addition to a COMPARE.  Consider CCCmode to be a form
of CCmode.
* config/i386/sse.md (define_split): When splitting UNSPEC_MOVMSK
to UNSPEC_PTEST, preserve the FLAG_REG mode as CCZ.
(*_ptest): Add asterisk to hide define_insn.
Remove ":CC" flags specification, and use ix86_match_ccmode instead.
(_ptestz): New define_expand to specify CCZ.
(_ptestc): New define_expand to specify CCC.
(_ptest): A define_expand using CC to preserve the
current behavior.
(*ptest_and): Specify CCZ to only perform this optimization
when only the Z flag is required.

gcc/testsuite/ChangeLog
PR targt/109973
* gcc.target/i386/pr109973-1.c: New test case.
* gcc.target/i386/pr109973-2.c: Likewise.


Thanks,
Roger
--

diff --git a/gcc/config/i386/i386-builtin.def b/gcc/config/i386/i386-builtin.def
index c91e380..383b68a 100644
--- a/gcc/config/i386/i386-builtin.def
+++ b/gcc/config/i386/i386-builtin.def
@@ -1004,8 +1004,8 @@ BDESC (OPTION_MASK_ISA_SSE4_1, 0, 
CODE_FOR_sse4_1_roundps_sfix, "__builtin_ia32_
 BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_roundv4sf2, 
"__builtin_ia32_roundps_az", IX86_BUILTIN_ROUNDPS_AZ, UNKNOWN, (int) 
V4SF_FTYPE_V4SF)
 BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_roundv4sf2_sfix, 
"__builtin_ia32_roundps_az_sfix", IX86_BUILTIN_ROUNDPS_AZ_SFIX, UNKNOWN, (int) 
V4SI_FTYPE_V4SF)
 
-BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_ptestv2di, 
"__builtin_ia32_ptestz128", IX86_BUILTIN_PTESTZ, EQ, (int) 
INT_FTYPE_V2DI_V2DI_PTEST)
-BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_ptestv2di, 
"__builtin_ia32_ptestc128", IX86_BUILTIN_PTESTC, LTU, (int) 
INT_FTYPE_V2DI_V2DI_PTEST)
+BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_ptestzv2di, 
"__builtin_ia32_ptestz128", IX86_BUILTIN_PTESTZ, EQ, (int) 
INT_FTYPE_V2DI_V2DI_PTEST)
+BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_ptestcv2di, 
"__builtin_ia32_ptestc128", IX86_BUILTIN_PTESTC, LTU, (int) 
INT_FTYPE_V2DI_V2DI_PTEST)
 BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_ptestv2di, 
"__builtin_ia32_ptes

[PATCH] New wi::bitreverse function.

2023-06-02 Thread Roger Sayle

This patch provides a wide-int implementation of bitreverse, that
implements both of Richard Sandiford's suggestions from the review at
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618215.html of an
improved API (as a stand-alone function matching the bswap refactoring),
and an implementation that works with any bit-width precision.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
(and a make check-gcc).  Ok for mainline?  Are the remaining pieces
of the above patch pre-approved (pending re-testing)?  The aim is that
this new code will be thoroughly tested by the new *-2.c test cases in
https://gcc.gnu.org/git/?p=gcc.git;h=c09471fbc7588db2480f036aa56a2403d3c03ae
5
with a minor tweak to use the BITREVERSE rtx in the NVPTX back-end,
followed by similar tests on other targets that provide bit-reverse
built-ins (such as ARM and xstormy16), in advance of support for a
backend-independent solution to PR middle-end/50481.


2023-06-02  Roger Sayle  

gcc/ChangeLog
* wide-int.cc (wi::bitreverse_large): New function implementing
bit reversal of an integer.
* wide-int.h (wi::bitreverse): New (template) function prototype.
(bitreverse_large): Prototype helper function/implementation.
(wi::bitreverse): New template wrapper around bitreverse_large.


Thanks again,
Roger
--

diff --git a/gcc/fold-const-call.cc b/gcc/fold-const-call.cc
index 340cb66..663eae2 100644
--- a/gcc/fold-const-call.cc
+++ b/gcc/fold-const-call.cc
@@ -1060,7 +1060,8 @@ fold_const_call_ss (wide_int *result, combined_fn fn, 
const wide_int_ref &arg,
 case CFN_BUILT_IN_BSWAP32:
 case CFN_BUILT_IN_BSWAP64:
 case CFN_BUILT_IN_BSWAP128:
-  *result = wide_int::from (arg, precision, TYPE_SIGN (arg_type)).bswap ();
+  *result = wi::bswap (wide_int::from (arg, precision,
+  TYPE_SIGN (arg_type)));
   return true;
 
 default:
diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index d4aeebc..d93d632 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -2111,7 +2111,7 @@ simplify_const_unary_operation (enum rtx_code code, 
machine_mode mode,
  break;
 
case BSWAP:
- result = wide_int (op0).bswap ();
+ result = wi::bswap (op0);
  break;
 
case TRUNCATE:
diff --git a/gcc/tree-ssa-ccp.cc b/gcc/tree-ssa-ccp.cc
index 6fb371c..26d5e44 100644
--- a/gcc/tree-ssa-ccp.cc
+++ b/gcc/tree-ssa-ccp.cc
@@ -2401,11 +2401,12 @@ evaluate_stmt (gimple *stmt)
  wide_int wval = wi::to_wide (val.value);
  val.value
= wide_int_to_tree (type,
-   wide_int::from (wval, prec,
-   UNSIGNED).bswap ());
+   wi::bswap (wide_int::from (wval, prec,
+  UNSIGNED)));
  val.mask
-   = widest_int::from (wide_int::from (val.mask, prec,
-   UNSIGNED).bswap (),
+   = widest_int::from (wi::bswap (wide_int::from (val.mask,
+  prec,
+  UNSIGNED)),
UNSIGNED);
  if (wi::sext (val.mask, prec) != -1)
break;
diff --git a/gcc/wide-int.cc b/gcc/wide-int.cc
index c0987aa..1e4c046 100644
--- a/gcc/wide-int.cc
+++ b/gcc/wide-int.cc
@@ -731,16 +731,13 @@ wi::set_bit_large (HOST_WIDE_INT *val, const 
HOST_WIDE_INT *xval,
 }
 }
 
-/* bswap THIS.  */
-wide_int
-wide_int_storage::bswap () const
+/* Byte swap the integer represented by XVAL and LEN into VAL.  Return
+   the number of blocks in VAL.  Both XVAL and VAL have PRECISION bits.  */
+unsigned int
+wi::bswap_large (HOST_WIDE_INT *val, const HOST_WIDE_INT *xval,
+unsigned int len, unsigned int precision)
 {
-  wide_int result = wide_int::create (precision);
   unsigned int i, s;
-  unsigned int len = BLOCKS_NEEDED (precision);
-  unsigned int xlen = get_len ();
-  const HOST_WIDE_INT *xval = get_val ();
-  HOST_WIDE_INT *val = result.write_val ();
 
   /* This is not a well defined operation if the precision is not a
  multiple of 8.  */
@@ -758,7 +755,7 @@ wide_int_storage::bswap () const
   unsigned int block = s / HOST_BITS_PER_WIDE_INT;
   unsigned int offset = s & (HOST_BITS_PER_WIDE_INT - 1);
 
-  byte = (safe_uhwi (xval, xlen, block) >> offset) & 0xff;
+  byte = (safe_uhwi (xval, len, block) >> offset) & 0xff;
 
   block = d / HOST_BITS_PER_WIDE_INT;
   offset = d & (HOST_BITS_PER_WIDE_INT - 1);
@@ -766,8 +763,7 @@ wide_int_storage::bswap () const
   val[block] |= byte << offset;
 }
 
-  result.set_len (canonize (val, le

RE: [PATCH] New wi::bitreverse function.

2023-06-02 Thread Roger Sayle

Doh!  Wrong patch...

Roger
--

-Original Message-
From: Roger Sayle  
Sent: Friday, June 2, 2023 3:17 PM
To: 'gcc-patches@gcc.gnu.org' 
Cc: 'Richard Sandiford' 
Subject: [PATCH] New wi::bitreverse function.


This patch provides a wide-int implementation of bitreverse, that implements
both of Richard Sandiford's suggestions from the review at
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/618215.html of an
improved API (as a stand-alone function matching the bswap refactoring), and
an implementation that works with any bit-width precision.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap (and a
make check-gcc).  Ok for mainline?  Are the remaining pieces of the above
patch pre-approved (pending re-testing)?  The aim is that this new code will
be thoroughly tested by the new *-2.c test cases in
https://gcc.gnu.org/git/?p=gcc.git;h=c09471fbc7588db2480f036aa56a2403d3c03ae
5
with a minor tweak to use the BITREVERSE rtx in the NVPTX back-end, followed
by similar tests on other targets that provide bit-reverse built-ins (such
as ARM and xstormy16), in advance of support for a backend-independent
solution to PR middle-end/50481.


2023-06-02  Roger Sayle  

gcc/ChangeLog
* wide-int.cc (wi::bitreverse_large): New function implementing
bit reversal of an integer.
* wide-int.h (wi::bitreverse): New (template) function prototype.
(bitreverse_large): Prototype helper function/implementation.
(wi::bitreverse): New template wrapper around bitreverse_large.


Thanks again,
Roger
--

diff --git a/gcc/wide-int.cc b/gcc/wide-int.cc
index 1e4c046..24bdce2 100644
--- a/gcc/wide-int.cc
+++ b/gcc/wide-int.cc
@@ -766,6 +766,33 @@ wi::bswap_large (HOST_WIDE_INT *val, const HOST_WIDE_INT 
*xval,
   return canonize (val, len, precision);
 }
 
+/* Bitreverse the integer represented by XVAL and LEN into VAL.  Return
+   the number of blocks in VAL.  Both XVAL and VAL have PRECISION bits.  */
+unsigned int
+wi::bitreverse_large (HOST_WIDE_INT *val, const HOST_WIDE_INT *xval,
+ unsigned int len, unsigned int precision)
+{
+  unsigned int i, s;
+
+  for (i = 0; i < len; i++)
+val[i] = 0;
+
+  for (s = 0; s < precision; s++)
+{
+  unsigned int block = s / HOST_BITS_PER_WIDE_INT;
+  unsigned int offset = s & (HOST_BITS_PER_WIDE_INT - 1);
+  if (((safe_uhwi (xval, len, block) >> offset) & 1) != 0)
+   {
+ unsigned int d = (precision - 1) - s;
+ block = d / HOST_BITS_PER_WIDE_INT;
+ offset = d & (HOST_BITS_PER_WIDE_INT - 1);
+  val[block] |= 1 << offset;
+   }
+}
+
+  return canonize (val, len, precision);
+}
+
 /* Fill VAL with a mask where the lower WIDTH bits are ones and the bits
above that up to PREC are zeros.  The result is inverted if NEGATE
is true.  Return the number of blocks in VAL.  */
diff --git a/gcc/wide-int.h b/gcc/wide-int.h
index e4723ad..498d14d 100644
--- a/gcc/wide-int.h
+++ b/gcc/wide-int.h
@@ -553,6 +553,7 @@ namespace wi
   UNARY_FUNCTION zext (const T &, unsigned int);
   UNARY_FUNCTION set_bit (const T &, unsigned int);
   UNARY_FUNCTION bswap (const T &);
+  UNARY_FUNCTION bitreverse (const T &);
 
   BINARY_FUNCTION min (const T1 &, const T2 &, signop);
   BINARY_FUNCTION smin (const T1 &, const T2 &);
@@ -1748,6 +1749,8 @@ namespace wi
  unsigned int, unsigned int, unsigned int);
   unsigned int bswap_large (HOST_WIDE_INT *, const HOST_WIDE_INT *,
unsigned int, unsigned int);
+  unsigned int bitreverse_large (HOST_WIDE_INT *, const HOST_WIDE_INT *,
+unsigned int, unsigned int);
   
   unsigned int lshift_large (HOST_WIDE_INT *, const HOST_WIDE_INT *,
 unsigned int, unsigned int, unsigned int);
@@ -2281,6 +2284,18 @@ wi::bswap (const T &x)
   return result;
 }
 
+/* Bitreverse the integer X.  */
+template 
+inline WI_UNARY_RESULT (T)
+wi::bitreverse (const T &x)
+{
+  WI_UNARY_RESULT_VAR (result, val, T, x);
+  unsigned int precision = get_precision (result);
+  WIDE_INT_REF_FOR (T) xi (x, precision);
+  result.set_len (bitreverse_large (val, xi.val, xi.len, precision));
+  return result;
+}
+
 /* Return the mininum of X and Y, treating them both as having
signedness SGN.  */
 template 


[x86_64 PATCH] PR target/110083: Fix-up REG_EQUAL notes on COMPARE in STV.

2023-06-03 Thread Roger Sayle

This patch fixes PR target/110083, an ICE-on-valid regression exposed by
my recent PTEST improvements (to address PR target/109973).  The latent
bug (admittedly mine) is that the scalar-to-vector (STV) pass doesn't update
or delete REG_EQUAL notes attached to COMPARE instructions.  As a result
the operands of COMPARE would be mismatched, with the register transformed
to V1TImode, but the immediate operand left as const_wide_int, which is
valid for TImode but not V1TImode.  This remained latent when the STV
conversion converted the mode of the COMPARE to CCmode, with later passes
recognizing the REG_EQUAL note is obviously invalid as the modes didn't
match, but now that we (correctly) preserve the CCZmode on COMPARE, the
mismatched operand modes trigger a sanity checking ICE downstream.

Fixed by updating (or deleting) any REG_EQUAL notes in convert_compare.

Before:
(expr_list:REG_EQUAL (compare:CCZ (reg:V1TI 119 [ ivin.29_38 ])
(const_wide_int 0x8000))

After:
(expr_list:REG_EQUAL (compare:CCZ (reg:V1TI 119 [ ivin.29_38 ])
(const_vector:V1TI [
(const_wide_int 0x8000)
 ]))

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-06-03  Roger Sayle  

gcc/ChangeLog
PR target/110083
* config/i386/i386-features.cc (scalar_chain::convert_compare):
Update or delete REG_EQUAL notes, converting CONST_INT and
CONST_WIDE_INT immediate operands to a suitable CONST_VECTOR.

gcc/testsuite/ChangeLog
PR target/110083
* gcc.target/i386/pr110083.c: New test case.


Roger
--

diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc
index 3417f6b..4a3b07a 100644
--- a/gcc/config/i386/i386-features.cc
+++ b/gcc/config/i386/i386-features.cc
@@ -980,6 +980,39 @@ rtx
 scalar_chain::convert_compare (rtx op1, rtx op2, rtx_insn *insn)
 {
   rtx src, tmp;
+
+  /* Handle any REG_EQUAL notes.  */
+  tmp = find_reg_equal_equiv_note (insn);
+  if (tmp)
+{
+  if (GET_CODE (XEXP (tmp, 0)) == COMPARE
+ && GET_MODE (XEXP (tmp, 0)) == CCZmode
+ && REG_P (XEXP (XEXP (tmp, 0), 0)))
+   {
+ rtx *op = &XEXP (XEXP (tmp, 0), 1);
+ if (CONST_SCALAR_INT_P (*op))
+   {
+ if (constm1_operand (*op, GET_MODE (*op)))
+   *op = CONSTM1_RTX (vmode);
+ else
+   {
+ unsigned n = GET_MODE_NUNITS (vmode);
+ rtx *v = XALLOCAVEC (rtx, n);
+ v[0] = *op;
+ for (unsigned i = 1; i < n; ++i)
+   v[i] = const0_rtx;
+ *op = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v));
+   }
+ tmp = NULL_RTX;
+   }
+ else if (REG_P (*op))
+   tmp = NULL_RTX;
+   }
+
+  if (tmp)
+   remove_note (insn, tmp);
+}
+
   /* Comparison against anything other than zero, requires an XOR.  */
   if (op2 != const0_rtx)
 {
diff --git a/gcc/testsuite/gcc.target/i386/pr110083.c 
b/gcc/testsuite/gcc.target/i386/pr110083.c
new file mode 100644
index 000..4b38ca8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr110083.c
@@ -0,0 +1,26 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse4 -mstv -mno-stackrealign" } */
+typedef int TItype __attribute__ ((mode (TI)));
+typedef unsigned int UTItype __attribute__ ((mode (TI)));
+
+void foo (void)
+{
+  static volatile TItype ivin, ivout;
+  static volatile float fv1, fv2;
+  ivin = ((TItype) (UTItype) ~ (((UTItype) ~ (UTItype) 0) >> 1));
+  fv1 = ((TItype) (UTItype) ~ (((UTItype) ~ (UTItype) 0) >> 1));
+  fv2 = ivin;
+  ivout = fv2;
+  if (ivin != ((TItype) (UTItype) ~ (((UTItype) ~ (UTItype) 0) >> 1))
+  || 128) > sizeof (TItype) * 8 - 1)) && ivout != ivin)
+  || 128) > sizeof (TItype) * 8 - 1))
+ && ivout !=
+ ((TItype) (UTItype) ~ (((UTItype) ~ (UTItype) 0) >> 1)))
+  || fv1 !=
+  (float) ((TItype) (UTItype) ~ (((UTItype) ~ (UTItype) 0) >> 1))
+  || fv2 !=
+  (float) ((TItype) (UTItype) ~ (((UTItype) ~ (UTItype) 0) >> 1))
+  || fv1 != fv2)
+__builtin_abort ();
+}
+


[x86 PATCH] Add support for stc, clc and cmc instructions in i386.md

2023-06-03 Thread Roger Sayle

This patch is the latest revision of my patch to add support for the
STC (set carry flag), CLC (clear carry flag) and CMC (complement
carry flag) instructions to the i386 backend, incorporating Uros'
previous feedback.  The significant changes are (i) the inclusion
of CMC, (ii) the use of UNSPEC for pattern, (iii) Use of a new
X86_TUNE_SLOW_STC tuning flag to use alternate implementations on
pentium4 (which has a notoriously slow STC) when not optimizing
for size.

An example of the use of the stc instruction is:
unsigned int foo (unsigned int a, unsigned int b, unsigned int *c) {
  return __builtin_ia32_addcarryx_u32 (1, a, b, c);
}

which previously generated:
movl$1, %eax
addb$-1, %al
adcl%esi, %edi
setc%al
movl%edi, (%rdx)
movzbl  %al, %eax
ret

with this patch now generates:
stc
adcl%esi, %edi
setc%al
movl%edi, (%rdx)
movzbl  %al, %eax
ret

An example of the use of the cmc instruction (where the carry from
a first adc is inverted/complemented as input to a second adc) is:
unsigned int bar (unsigned int a, unsigned int b,
  unsigned int c, unsigned int d)
{
  unsigned int c1 = __builtin_ia32_addcarryx_u32 (1, a, b, &o1);
  return __builtin_ia32_addcarryx_u32 (c1 ^ 1, c, d, &o2);
}

which previously generated:
movl$1, %eax
addb$-1, %al
adcl%esi, %edi
setnc   %al
movl%edi, o1(%rip)
addb$-1, %al
adcl%ecx, %edx
setc%al
movl%edx, o2(%rip)
movzbl  %al, %eax
ret

and now generates:
stc
adcl%esi, %edi
cmc
movl%edi, o1(%rip)
adcl%ecx, %edx
setc%al
movl%edx, o2(%rip)
movzbl  %al, %eax
ret


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2022-06-03  Roger Sayle  

gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_expand_builtin) :
Use new x86_stc or negqi_ccc_1 instructions to set the carry flag.
* config/i386/i386.h (TARGET_SLOW_STC): New define.
* config/i386/i386.md (UNSPEC_CLC): New UNSPEC for clc.
(UNSPEC_STC): New UNSPEC for stc.
(UNSPEC_CMC): New UNSPEC for cmc.
(*x86_clc): New define_insn.
(*x86_clc_xor): New define_insn for pentium4 without -Os.
(x86_stc): New define_insn.
(define_split): Convert x86_stc into alternate implementation
on pentium4.
(x86_cmc): New define_insn.
(*x86_cmc_1): New define_insn_and_split to recognize cmc pattern.
(*setcc_qi_negqi_ccc_1_): New define_insn_and_split to
recognize (and eliminate) the carry flag being copied to itself.
(*setcc_qi_negqi_ccc_2_): Likewise.
(neg_ccc_1): Renamed from *neg_ccc_1 for gen function.
* config/i386/x86-tune.def (X86_TUNE_SLOW_STC): New tuning flag.

gcc/testsuite/ChangeLog
* gcc.target/i386/cmc-1.c: New test case.
* gcc.target/i386/stc-1.c: Likewise.


Thanks,
Roger
--

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 5d21810..9e02fdd 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -13948,8 +13948,6 @@ rdseed_step:
   arg3 = CALL_EXPR_ARG (exp, 3); /* unsigned int *sum_out.  */
 
   op1 = expand_normal (arg0);
-  if (!integer_zerop (arg0))
-   op1 = copy_to_mode_reg (QImode, convert_to_mode (QImode, op1, 1));
 
   op2 = expand_normal (arg1);
   if (!register_operand (op2, mode0))
@@ -13967,7 +13965,7 @@ rdseed_step:
}
 
   op0 = gen_reg_rtx (mode0);
-  if (integer_zerop (arg0))
+  if (op1 == const0_rtx)
{
  /* If arg0 is 0, optimize right away into add or sub
 instruction that sets CCCmode flags.  */
@@ -13977,7 +13975,14 @@ rdseed_step:
   else
{
  /* Generate CF from input operand.  */
- emit_insn (gen_addqi3_cconly_overflow (op1, constm1_rtx));
+ if (!CONST_INT_P (op1))
+   {
+ op1 = convert_to_mode (QImode, op1, 1);
+ op1 = copy_to_mode_reg (QImode, op1);
+ emit_insn (gen_negqi_ccc_1 (op1, op1));
+   }
+ else
+   emit_insn (gen_x86_stc ());
 
  /* Generate instruction that consumes CF.  */
  op1 = gen_rtx_REG (CCCmode, FLAGS_REG);
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index c7439f8..5ac9c78 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -448,6 +448,7 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
ix86_tune_features[X86_TUNE_V2DF_REDUCTION_PREFER_HADDPD]
 #define TARGET_DEST_FALSE_DEP_FOR_GLC \
ix86_tune_features[X86_TUNE_DEST_FALSE_DEP_FOR_

RE: [x86 PATCH] Add support for stc, clc and cmc instructions in i386.md

2023-06-06 Thread Roger Sayle

Hi Uros,
This revision implements your suggestions/refinements. (i) Avoid the
UNSPEC_CMC by using the canonical RTL idiom for *x86_cmc, (ii) Use
peephole2s to convert x86_stc and *x86_cmc into alternate forms on
TARGET_SLOW_STC CPUs (pentium4), when a suitable QImode register is
available, (iii) Prefer the addqi_cconly_overflow idiom (addb $-1,%al)
over negqi_ccc_1 (neg %al) for setting the carry from a QImode value,
(iv) Use andl %eax,%eax to clear carry flag without requiring (clobbering)
an additional register, as an alternate output template for *x86_clc.
These changes required two minor edits to i386.cc:  ix86_cc_mode had
to be tweaked to suggest CCCmode for the new *x86_cmc pattern, and
*x86_cmc needed to be handled/parameterized in ix86_rtx_costs so that
combine would appreciate that this complex RTL expression was actually
a fast, single byte instruction [i.e. preferable].

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?

2022-06-06  Roger Sayle  
Uros Bizjak  

gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_expand_builtin) :
Use new x86_stc instruction when the carry flag must be set.
* config/i386/i386.cc (ix86_cc_mode): Use CCCmode for *x86_cmc.
(ix86_rtx_costs): Provide accurate rtx_costs for *x86_cmc.
* config/i386/i386.h (TARGET_SLOW_STC): New define.
* config/i386/i386.md (UNSPEC_CLC): New UNSPEC for clc.
(UNSPEC_STC): New UNSPEC for stc.
(*x86_clc): New define_insn (with implementation for pentium4).
(x86_stc): New define_insn.
(define_peephole2): Convert x86_stc into alternate implementation
on pentium4 without -Os when a QImode register is available.
(*x86_cmc): New define_insn.
(define_peephole2): Convert *x86_cmc into alternate implementation
on pentium4 without -Os when a QImode register is available.
(*setccc): New define_insn_and_split for a no-op CCCmode move.
(*setcc_qi_negqi_ccc_1_): New define_insn_and_split to
recognize (and eliminate) the carry flag being copied to itself.
(*setcc_qi_negqi_ccc_2_): Likewise.
* config/i386/x86-tune.def (X86_TUNE_SLOW_STC): New tuning flag.

gcc/testsuite/ChangeLog
* gcc.target/i386/cmc-1.c: New test case.
* gcc.target/i386/stc-1.c: Likewise.


Thanks, Roger.
--

-Original Message-
From: Uros Bizjak  
Sent: 04 June 2023 18:53
To: Roger Sayle 
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [x86 PATCH] Add support for stc, clc and cmc instructions in 
i386.md

On Sun, Jun 4, 2023 at 12:45 AM Roger Sayle  wrote:
>
>
> This patch is the latest revision of my patch to add support for the 
> STC (set carry flag), CLC (clear carry flag) and CMC (complement carry 
> flag) instructions to the i386 backend, incorporating Uros'
> previous feedback.  The significant changes are (i) the inclusion of 
> CMC, (ii) the use of UNSPEC for pattern, (iii) Use of a new 
> X86_TUNE_SLOW_STC tuning flag to use alternate implementations on
> pentium4 (which has a notoriously slow STC) when not optimizing for 
> size.
>
> An example of the use of the stc instruction is:
> unsigned int foo (unsigned int a, unsigned int b, unsigned int *c) {
>   return __builtin_ia32_addcarryx_u32 (1, a, b, c); }
>
> which previously generated:
> movl$1, %eax
> addb$-1, %al
> adcl%esi, %edi
> setc%al
> movl%edi, (%rdx)
> movzbl  %al, %eax
> ret
>
> with this patch now generates:
> stc
> adcl%esi, %edi
> setc%al
> movl%edi, (%rdx)
> movzbl  %al, %eax
> ret
>
> An example of the use of the cmc instruction (where the carry from a 
> first adc is inverted/complemented as input to a second adc) is:
> unsigned int bar (unsigned int a, unsigned int b,
>   unsigned int c, unsigned int d) {
>   unsigned int c1 = __builtin_ia32_addcarryx_u32 (1, a, b, &o1);
>   return __builtin_ia32_addcarryx_u32 (c1 ^ 1, c, d, &o2); }
>
> which previously generated:
> movl$1, %eax
> addb$-1, %al
> adcl%esi, %edi
> setnc   %al
> movl%edi, o1(%rip)
> addb$-1, %al
> adcl%ecx, %edx
> setc%al
> movl%edx, o2(%rip)
> movzbl  %al, %eax
> ret
>
> and now generates:
> stc
> adcl%esi, %edi
> cmc
> movl%edi, o1(%rip)
> adcl%ecx, %edx
> setc%al
> movl%edx, o2(%rip)
> movzbl  %al, %eax
> ret
>
>
> This patch has been tested on x86_64-pc-linux-gnu wit

RE: [x86 PATCH] Add support for stc, clc and cmc instructions in i386.md

2023-06-06 Thread Roger Sayle


Hi Uros,
Might you willing to approve the patch without the *x86_clc pieces?
These can be submitted later, when they are actually used.  For now,
we're arguing about the performance of a pattern that's not yet
generated on an obsolete microarchitecture that's no longer in use,
and this is holding up real improvements on current processors.
cmc, for example, should allow for better cmov if-conversion.

Thanks in advance.
Roger
--

-Original Message-
From: Uros Bizjak  
Sent: 06 June 2023 18:34
To: Roger Sayle 
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [x86 PATCH] Add support for stc, clc and cmc instructions in 
i386.md

On Tue, Jun 6, 2023 at 5:14 PM Roger Sayle  wrote:
>
>
> Hi Uros,
> This revision implements your suggestions/refinements. (i) Avoid the 
> UNSPEC_CMC by using the canonical RTL idiom for *x86_cmc, (ii) Use 
> peephole2s to convert x86_stc and *x86_cmc into alternate forms on 
> TARGET_SLOW_STC CPUs (pentium4), when a suitable QImode register is 
> available, (iii) Prefer the addqi_cconly_overflow idiom (addb $-1,%al) 
> over negqi_ccc_1 (neg %al) for setting the carry from a QImode value,
> (iv) Use andl %eax,%eax to clear carry flag without requiring 
> (clobbering) an additional register, as an alternate output template for 
> *x86_clc.

Uh, I don't think (iv) is OK. "xor reg,reg" will break the dependency chain, 
while "and reg,reg" won't. So, you are hurting out-of-order execution by 
depending on an instruction that calculates previous result in reg. You can use 
peephole2 trick to allocate an unused reg here, but then using AND is no better 
than using XOR, and the latter is guaranteed to break dependency chains.

Uros.

> These changes required two minor edits to i386.cc:  ix86_cc_mode had 
> to be tweaked to suggest CCCmode for the new *x86_cmc pattern, and 
> *x86_cmc needed to be handled/parameterized in ix86_rtx_costs so that 
> combine would appreciate that this complex RTL expression was actually 
> a fast, single byte instruction [i.e. preferable].
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap 
> and make -k check, both with and without --target_board=unix{-m32} 
> with no new failures.  Ok for mainline?
>
> 2022-06-06  Roger Sayle  
> Uros Bizjak  
>
> gcc/ChangeLog
> * config/i386/i386-expand.cc (ix86_expand_builtin) :
> Use new x86_stc instruction when the carry flag must be set.
> * config/i386/i386.cc (ix86_cc_mode): Use CCCmode for *x86_cmc.
> (ix86_rtx_costs): Provide accurate rtx_costs for *x86_cmc.
> * config/i386/i386.h (TARGET_SLOW_STC): New define.
> * config/i386/i386.md (UNSPEC_CLC): New UNSPEC for clc.
> (UNSPEC_STC): New UNSPEC for stc.
> (*x86_clc): New define_insn (with implementation for pentium4).
> (x86_stc): New define_insn.
> (define_peephole2): Convert x86_stc into alternate implementation
> on pentium4 without -Os when a QImode register is available.
> (*x86_cmc): New define_insn.
> (define_peephole2): Convert *x86_cmc into alternate implementation
> on pentium4 without -Os when a QImode register is available.
> (*setccc): New define_insn_and_split for a no-op CCCmode move.
> (*setcc_qi_negqi_ccc_1_): New define_insn_and_split to
> recognize (and eliminate) the carry flag being copied to itself.
> (*setcc_qi_negqi_ccc_2_): Likewise.
> * config/i386/x86-tune.def (X86_TUNE_SLOW_STC): New tuning flag.
>
> gcc/testsuite/ChangeLog
> * gcc.target/i386/cmc-1.c: New test case.
> * gcc.target/i386/stc-1.c: Likewise.
>
>
> Thanks, Roger.
> --
>
> -Original Message-
> From: Uros Bizjak 
> Sent: 04 June 2023 18:53
> To: Roger Sayle 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [x86 PATCH] Add support for stc, clc and cmc instructions 
> in i386.md
>
> On Sun, Jun 4, 2023 at 12:45 AM Roger Sayle  
> wrote:
> >
> >
> > This patch is the latest revision of my patch to add support for the 
> > STC (set carry flag), CLC (clear carry flag) and CMC (complement 
> > carry
> > flag) instructions to the i386 backend, incorporating Uros'
> > previous feedback.  The significant changes are (i) the inclusion of 
> > CMC, (ii) the use of UNSPEC for pattern, (iii) Use of a new 
> > X86_TUNE_SLOW_STC tuning flag to use alternate implementations on
> > pentium4 (which has a notoriously slow STC) when not optimizing for 
> > size.
> >
> > An example of the use of the stc instruction is:
> > unsigned int foo (unsigned int a, unsigned int b, unsigned int *c) {
> >   return __builtin_ia32_addcarryx_u32 (1, a, b, c); }
> 

[x86_64 PATCH] PR target/110104: Missing peephole2 for addcarry.

2023-06-06 Thread Roger Sayle

This patch resolves PR target/110104, a missed optimization on x86 around
adc with memory operands.  In i386.md, there's a peephole2 after the
pattern for *add3_cc_overflow_1 that converts the sequence
reg = add(reg,mem); mem = reg [where the reg is dead afterwards] into
the equivalent mem = add(mem,reg).  The equivalent peephole2 for adc
is missing (after addcarry), and is added by this patch.

For the example code provided in the bugzilla PR:

Before:
movq%rsi, %rax
mulq%rdx
addq%rax, (%rdi)
movq%rdx, %rax
adcq8(%rdi), %rax
adcq$0, 16(%rdi)
movq%rax, 8(%rdi)
ret

After:
movq%rsi, %rax
mulq%rdx
addq%rax, (%rdi)
adcq%rdx, 8(%rdi)
adcq$0, 16(%rdi)
ret

Note that the addq in this example has been transformed by the
existing peephole2 described above.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-06-07  Roger Sayle  

gcc/ChangeLog
PR target/110104
* config/i386/i386.md (define_peephole2): Transform reg=adc(reg,mem)
followed by mem=reg into mem=adc(mem,reg) when applicable.

gcc/testsuite/ChangeLog
PR target/110104
* gcc.target/i386/pr110104.c: New test case.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index e6ebc46..33ec45f 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -7870,6 +7870,51 @@
(set_attr "pent_pair" "pu")
(set_attr "mode" "")])
 
+;; peephole2 for addcarry matching one for *add3_cc_overflow_1.
+;; reg = adc(reg,mem); mem = reg  ->  mem = adc(mem,reg).
+(define_peephole2
+  [(parallel
+[(set (reg:CCC FLAGS_REG) 
+ (compare:CCC
+   (zero_extend:
+ (plus:SWI48
+   (plus:SWI48
+ (match_operator:SWI48 3 "ix86_carry_flag_operator"
+   [(match_operand 2 "flags_reg_operand") (const_int 0)])
+ (match_operand:SWI48 0 "general_reg_operand"))
+   (match_operand:SWI48 1 "memory_operand")))
+   (plus:
+ (zero_extend: (match_dup 1))
+   (match_operator: 4 "ix86_carry_flag_operator"
+ [(match_dup 2) (const_int 0)]
+ (set (match_dup 0)
+ (plus:SWI48 (plus:SWI48 (match_op_dup 3
+   [(match_dup 2) (const_int 0)])
+ (match_dup 0))
+ (match_dup 1)))])
+   (set (match_dup 1) (match_dup 0))]
+  "(TARGET_READ_MODIFY_WRITE || optimize_insn_for_size_p ())
+   && peep2_reg_dead_p (2, operands[0])
+   && !reg_overlap_mentioned_p (operands[0], operands[1])"
+  [(parallel
+[(set (reg:CCC FLAGS_REG)
+ (compare:CCC
+   (zero_extend:
+ (plus:SWI48
+   (plus:SWI48
+ (match_op_dup 3 [(match_dup 2) (const_int 0)])
+ (match_dup 1))
+   (match_dup 0)))
+   (plus:
+ (zero_extend: (match_dup 0))
+   (match_op_dup 4
+ [(match_dup 2) (const_int 0)]
+ (set (match_dup 1)
+ (plus:SWI48 (plus:SWI48 (match_op_dup 3
+   [(match_dup 2) (const_int 0)])
+ (match_dup 1))
+ (match_dup 0)))])])
+
 (define_expand "addcarry_0"
   [(parallel
  [(set (reg:CCC FLAGS_REG)
diff --git a/gcc/testsuite/gcc.target/i386/pr110104.c 
b/gcc/testsuite/gcc.target/i386/pr110104.c
new file mode 100644
index 000..bd814f3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr110104.c
@@ -0,0 +1,16 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2" } */
+
+typedef unsigned long long u64;
+typedef unsigned __int128 u128;
+void testcase1(u64 *acc, u64 a, u64 b)
+{
+  u128 res = (u128)a*b;
+  u64 lo = res, hi = res >> 64;
+  unsigned char cf = 0;
+  cf = __builtin_ia32_addcarryx_u64 (cf, lo, acc[0], acc+0);
+  cf = __builtin_ia32_addcarryx_u64 (cf, hi, acc[1], acc+1);
+  cf = __builtin_ia32_addcarryx_u64 (cf,  0, acc[2], acc+2);
+}
+
+/* { dg-final { scan-assembler-times "movq" 1 } } */


[x86 PATCH] PR target/31985: Improve memory operand use with doubleword add.

2023-06-06 Thread Roger Sayle

This patch addresses the last remaining issue with PR target/31985, that
GCC could make better use of memory addressing modes when implementing
double word addition.  This is achieved by adding a define_insn_and_split
that combines an *add3_doubleword with a *concat3, so
that the components of the concat can be used directly, without first
being loaded into a double word register.

For test_c in the bugzilla PR:

Before:
pushl   %ebx
subl$16, %esp
movl28(%esp), %eax
movl36(%esp), %ecx
movl32(%esp), %ebx
movl24(%esp), %edx
addl%ecx, %eax
adcl%ebx, %edx
movl%eax, 8(%esp)
movl%edx, 12(%esp)
addl$16, %esp
popl%ebx
ret

After:
test_c:
subl$20, %esp
movl36(%esp), %eax
movl32(%esp), %edx
addl28(%esp), %eax
adcl24(%esp), %edx
movl%eax, 8(%esp)
movl%edx, 12(%esp)
addl$20, %esp
ret


If this approach is considered acceptable, similar splitters can be
used for other doubleword operations.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?

2023-06-07  Roger Sayle  

gcc/ChangeLog
PR target/31985
* config/i386/i386.md (*add3_doubleword_concat): New
define_insn_and_split combine *add3_doubleword with a
*concat3 for more efficient lowering after reload.

gcc/testsuite/ChangeLog
PR target/31985
* gcc.target/i386/pr31985.c: New test case.


Roger
--

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index e6ebc46..3592249 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -6124,6 +6124,36 @@
  (clobber (reg:CC FLAGS_REG))])]
  "split_double_mode (mode, &operands[0], 2, &operands[0], &operands[3]);")
 
+(define_insn_and_split "*add3_doubleword_concat"
+  [(set (match_operand: 0 "register_operand" "=r")
+   (plus:
+ (any_or_plus:
+   (ashift:
+ (zero_extend:
+   (match_operand:DWIH 2 "nonimmediate_operand" "rm"))
+ (match_operand: 3 "const_int_operand"))
+   (zero_extend:
+ (match_operand:DWIH 4 "nonimmediate_operand" "rm")))
+ (match_operand: 1 "register_operand" "0")))
+   (clobber (reg:CC FLAGS_REG))]
+  "INTVAL (operands[3]) ==  * BITS_PER_UNIT"
+  "#"
+  "&& reload_completed"
+  [(parallel [(set (reg:CCC FLAGS_REG)
+  (compare:CCC
+(plus:DWIH (match_dup 1) (match_dup 4))
+(match_dup 1)))
+ (set (match_dup 0)
+  (plus:DWIH (match_dup 1) (match_dup 4)))])
+   (parallel [(set (match_dup 5)
+  (plus:DWIH
+(plus:DWIH
+  (ltu:DWIH (reg:CC FLAGS_REG) (const_int 0))
+  (match_dup 6))
+(match_dup 2)))
+ (clobber (reg:CC FLAGS_REG))])]
+ "split_double_mode (mode, &operands[0], 2, &operands[0], &operands[5]);")
+
 (define_insn "*add_1"
   [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r,r,r")
(plus:SWI48
diff --git a/gcc/testsuite/gcc.target/i386/pr31985.c 
b/gcc/testsuite/gcc.target/i386/pr31985.c
new file mode 100644
index 000..a6de1b5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr31985.c
@@ -0,0 +1,14 @@
+/* { dg-do compile { target ia32 } } */
+/* { dg-options "-O2" } */
+
+void test_c (unsigned int a, unsigned int b, unsigned int c, unsigned int d)
+{
+  volatile unsigned int x, y;
+  unsigned long long __a = b | ((unsigned long long)a << 32);
+  unsigned long long __b = d | ((unsigned long long)c << 32);
+  unsigned long long __c = __a + __b;
+  x = (unsigned int)(__c & 0x);
+  y = (unsigned int)(__c >> 32);
+}
+
+/* { dg-final { scan-assembler-times "movl" 4 } } */


RE: [x86_64 PATCH] PR target/110104: Missing peephole2 for addcarry.

2023-06-06 Thread Roger Sayle


Hi Jakub,
Jakub Jelinek wrote:
> Seems to be pretty much the same as one of the 12 define_peephole2
patterns I've posted in
https://gcc.gnu.org/pipermail/gcc-patches/2023-June/620821.html

Doh!  Impressive work.  I need to study how you handle constant carry flags.
Fingers-crossed that patches that touch both the middle-end and a backend
don't get delayed too long in the review/approval process.

> The testcase will be useful though (but I'd go with including the intrin
header and using the intrinsic rather than builtin).

I find the use of intrin headers a pain when running cc1 under gdb,
requiring additional paths to be
specified with -I etc.  Perhaps there's a trick that I'm missing?
__builtins are more free-standing,
and therefore work with cross-compilers to targets/development environments
that I don't have.

I withdraw my patch.  Please feel free to assign PR 110104 to yourself in
Bugzilla.

Cheers (and thanks),
Roger




[Committed] Bug fix to new wi::bitreverse_large function.

2023-06-07 Thread Roger Sayle

Richard Sandiford was, of course, right to be warry of new code without
much test coverage.  Converting the nvptx backend to use the BITREVERSE
rtx infrastructure, has resulted in far more exhaustive testing and
revealed a subtle bug in the new wi::bitreverse implementation.  The
code needs to use HOST_WIDE_INT_1U (instead of 1) to avoid unintended
sign extension.

This patch has been tested on nvptx-none hosted on x86_64-pc-linux-gnu
(with a minor tweak to use BITREVERSE), where it fixes regressions of
the 32-bit test vectors in gcc.target/nvptx/brev-2.c and the 64-bit
test vectors in gcc.target/nvptx/brevll-2.c.  Committed as obvious.


2023-06-07  Roger Sayle  

gcc/ChangeLog
* wide-int.cc (wi::bitreverse_large): Use HOST_WIDE_INT_1U to
avoid sign extension/undefined behaviour when setting each bit.


Thanks,
Roger
--

diff --git a/gcc/wide-int.cc b/gcc/wide-int.cc
index 24bdce2..ab92ee6 100644
--- a/gcc/wide-int.cc
+++ b/gcc/wide-int.cc
@@ -786,7 +786,7 @@ wi::bitreverse_large (HOST_WIDE_INT *val, const 
HOST_WIDE_INT *xval,
  unsigned int d = (precision - 1) - s;
  block = d / HOST_BITS_PER_WIDE_INT;
  offset = d & (HOST_BITS_PER_WIDE_INT - 1);
-  val[block] |= 1 << offset;
+  val[block] |= HOST_WIDE_INT_1U << offset;
}
 }
 


[nvptx PATCH] Update nvptx's bitrev2 pattern to use BITREVERSE rtx.

2023-06-07 Thread Roger Sayle

This minor tweak to the nvptx backend switches the representation of
of the brev instruction from an UNSPEC to instead use the new BITREVERSE
rtx.  This allows various RTL optimizations including evaluation (constant
folding) of integer constant arguments at compile-time.

This patch has been tested on nvptx-none with make and make -k check
with no new failures.  Ok for mainline?


2023-06-07  Roger Sayle  

gcc/ChangeLog
* config/nvptx/nvptx.md (UNSPEC_BITREV): Delete.
(bitrev2): Represent using bitreverse.


Thanks in advance,
Roger
--

diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 1bb9304..7a7c994 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -34,8 +34,6 @@
UNSPEC_FPINT_CEIL
UNSPEC_FPINT_NEARBYINT
 
-   UNSPEC_BITREV
-
UNSPEC_ALLOCA
 
UNSPEC_SET_SOFTSTACK
@@ -636,8 +634,7 @@
 
 (define_insn "bitrev2"
   [(set (match_operand:SDIM 0 "nvptx_register_operand" "=R")
-   (unspec:SDIM [(match_operand:SDIM 1 "nvptx_register_operand" "R")]
-UNSPEC_BITREV))]
+   (bitreverse:SDIM (match_operand:SDIM 1 "nvptx_register_operand" "R")))]
   ""
   "%.\\tbrev.b%T0\\t%0, %1;")
 


[GCC 13 PATCH] PR target/109973: CCZmode and CCCmode variants of [v]ptest.

2023-06-10 Thread Roger Sayle

This is a backport of the fixes for PR target/109973 and PR target/110083.

This backport to the releases/gcc-13 branch has been tested on
x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and
without --target_board=unix{-m32} with no new failures.  Ok for gcc-13,
or should we just close PR 109973 in Bugzilla?


2023-06-10  Roger Sayle  
Uros Bizjak  

gcc/ChangeLog
PR target/109973
PR target/110083
* config/i386/i386-builtin.def (__builtin_ia32_ptestz128): Use new
CODE_for_sse4_1_ptestzv2di.
(__builtin_ia32_ptestc128): Use new CODE_for_sse4_1_ptestcv2di.
(__builtin_ia32_ptestz256): Use new CODE_for_avx_ptestzv4di.
(__builtin_ia32_ptestc256): Use new CODE_for_avx_ptestcv4di.
* config/i386/i386-expand.cc (ix86_expand_branch): Use CCZmode
when expanding UNSPEC_PTEST to compare against zero.
* config/i386/i386-features.cc (scalar_chain::convert_compare):
Likewise generate CCZmode UNSPEC_PTESTs when converting comparisons.
Update or delete REG_EQUAL notes, converting CONST_INT and
CONST_WIDE_INT immediate operands to a suitable CONST_VECTOR.
(general_scalar_chain::convert_insn): Use CCZmode for COMPARE
result.
(timode_scalar_chain::convert_insn): Use CCZmode for COMPARE result.
* config/i386/i386-protos.h (ix86_match_ptest_ccmode): Prototype.
* config/i386/i386.cc (ix86_match_ptest_ccmode): New predicate to
check for suitable matching modes for the UNSPEC_PTEST pattern.
* config/i386/sse.md (define_split): When splitting UNSPEC_MOVMSK
to UNSPEC_PTEST, preserve the FLAG_REG mode as CCZ. 
(*_ptest): Add asterisk to hide define_insn.  Remove
":CC" mode of FLAGS_REG, instead use ix86_match_ptest_ccmode.
(_ptestz): New define_expand to specify CCZ.
(_ptestc): New define_expand to specify CCC.
(_ptest): A define_expand using CC to preserve the
current behavior.
(*ptest_and): Specify CCZ to only perform this optimization
when only the Z flag is required.

gcc/testsuite/ChangeLog
PR target/109973
PR target/110083
* gcc.target/i386/pr109973-1.c: New test case.
* gcc.target/i386/pr109973-2.c: Likewise.
* gcc.target/i386/pr110083.c: Likewise.


Thanks,
Roger
--

diff --git a/gcc/config/i386/i386-builtin.def b/gcc/config/i386/i386-builtin.def
index 6dae697..37df018 100644
--- a/gcc/config/i386/i386-builtin.def
+++ b/gcc/config/i386/i386-builtin.def
@@ -1004,8 +1004,8 @@ BDESC (OPTION_MASK_ISA_SSE4_1, 0, 
CODE_FOR_sse4_1_roundps_sfix, "__builtin_ia32_
 BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_roundv4sf2, 
"__builtin_ia32_roundps_az", IX86_BUILTIN_ROUNDPS_AZ, UNKNOWN, (int) 
V4SF_FTYPE_V4SF)
 BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_roundv4sf2_sfix, 
"__builtin_ia32_roundps_az_sfix", IX86_BUILTIN_ROUNDPS_AZ_SFIX, UNKNOWN, (int) 
V4SI_FTYPE_V4SF)
 
-BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_ptestv2di, 
"__builtin_ia32_ptestz128", IX86_BUILTIN_PTESTZ, EQ, (int) 
INT_FTYPE_V2DI_V2DI_PTEST)
-BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_ptestv2di, 
"__builtin_ia32_ptestc128", IX86_BUILTIN_PTESTC, LTU, (int) 
INT_FTYPE_V2DI_V2DI_PTEST)
+BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_ptestzv2di, 
"__builtin_ia32_ptestz128", IX86_BUILTIN_PTESTZ, EQ, (int) 
INT_FTYPE_V2DI_V2DI_PTEST)
+BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_ptestcv2di, 
"__builtin_ia32_ptestc128", IX86_BUILTIN_PTESTC, LTU, (int) 
INT_FTYPE_V2DI_V2DI_PTEST)
 BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_ptestv2di, 
"__builtin_ia32_ptestnzc128", IX86_BUILTIN_PTESTNZC, GTU, (int) 
INT_FTYPE_V2DI_V2DI_PTEST)
 
 /* SSE4.2 */
@@ -1164,8 +1164,8 @@ BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_vtestpd256, 
"__builtin_ia32_vtestnzc
 BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_vtestps256, 
"__builtin_ia32_vtestzps256", IX86_BUILTIN_VTESTZPS256, EQ, (int) 
INT_FTYPE_V8SF_V8SF_PTEST)
 BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_vtestps256, 
"__builtin_ia32_vtestcps256", IX86_BUILTIN_VTESTCPS256, LTU, (int) 
INT_FTYPE_V8SF_V8SF_PTEST)
 BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_vtestps256, 
"__builtin_ia32_vtestnzcps256", IX86_BUILTIN_VTESTNZCPS256, GTU, (int) 
INT_FTYPE_V8SF_V8SF_PTEST)
-BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_ptestv4di, 
"__builtin_ia32_ptestz256", IX86_BUILTIN_PTESTZ256, EQ, (int) 
INT_FTYPE_V4DI_V4DI_PTEST)
-BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_ptestv4di, 
"__builtin_ia32_ptestc256", IX86_BUILTIN_PTESTC256, LTU, (int) 
INT_FTYPE_V4DI_V4DI_PTEST)
+BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_ptestzv4di, 
"__builtin_ia32_ptestz256", IX86_BUILTIN_PTESTZ256, EQ, (int) 
INT_FTYPE_V4DI_V4DI_PTEST)
+BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_ptestcv4di, 
"__builtin_ia32_ptes

[PATCH] Avoid duplicate vector initializations during RTL expansion.

2023-06-11 Thread Roger Sayle

This middle-end patch avoids some redundant RTL for vector initialization
during RTL expansion.  For the simple test case:

typedef __int128 v1ti __attribute__ ((__vector_size__ (16)));
__int128 key;

v1ti foo() {
return (v1ti){key};
}

the middle-end currently expands:

(set (reg:V1TI 85) (const_vector:V1TI [ (const_int 0) ]))

(set (reg:V1TI 85) (mem/c:V1TI (symbol_ref:DI ("key"

where we create a dead instruction that initializes the vector to zero,
immediately followed by a set of the entire vector.  This patch skips
this zeroing instruction when the vector has only a single element.
It also updates the code to indicate when we've cleared the vector,
so that we don't need to initialize zero elements.

Interestingly, this code is very similar to my patch from April 2006:
https://gcc.gnu.org/pipermail/gcc-patches/2006-April/192861.html


This patch has been tested on x86_64-pc-linux-gnu with a make bootstrap
and make -k check, both with and without --target_board=unix{-m32}, with
no new failures.  Ok for mainline?


2023-06-11  Roger Sayle  

gcc/ChangeLog
* expr.cc (store_constructor) : Don't bother
clearing vectors with only a single element.  Set CLEARED if the
vector was initialized to zero.


Thanks,
Roger
--

diff --git a/gcc/expr.cc b/gcc/expr.cc
index 868fa6e..62cd8fa 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -7531,8 +7531,11 @@ store_constructor (tree exp, rtx target, int cleared, 
poly_int64 size,
  }
 
/* Inform later passes that the old value is dead.  */
-   if (!cleared && !vector && REG_P (target))
- emit_move_insn (target, CONST0_RTX (mode));
+   if (!cleared && !vector && REG_P (target) && maybe_gt (n_elts, 1u))
+ {
+   emit_move_insn (target, CONST0_RTX (mode));
+   cleared = 1;
+ }
 
 if (MEM_P (target))
  alias = MEM_ALIAS_SET (target);


[PATCH] New finish_compare_by_pieces target hook (for x86).

2023-06-12 Thread Roger Sayle

The following simple test case, from PR 104610, shows that memcmp () == 0
can result in some bizarre code sequences on x86.

int foo(char *a)
{
static const char t[] = "0123456789012345678901234567890";
return __builtin_memcmp(a, &t[0], sizeof(t)) == 0;
}

with -O2 currently contains both:
xorl%eax, %eax
xorl$1, %eax
and also
movl$1, %eax
xorl$1, %eax

Changing the return type of foo to _Bool results in the equally
bizarre:
xorl%eax, %eax
testl   %eax, %eax
sete%al
and also
movl$1, %eax
testl   %eax, %eax
sete%al

All these sequences set the result to a constant, but this optimization
opportunity only occurs very late during compilation, by basic block
duplication in the 322r.bbro pass, too late for CSE or peephole2 to
do anything about it.  The problem is that the idiom expanded by
compare_by_pieces for __builtin_memcmp_eq contains basic blocks that
can't easily be optimized by if-conversion due to the multiple
incoming edges on the fail block.

In summary, compare_by_pieces generates code that looks like:

if (x[0] != y[0]) goto fail_label;
if (x[1] != y[1]) goto fail_label;
...
if (x[n] != y[n]) goto fail_label;
result = 1;
goto end_label;
fail_label:
result = 0;
end_label:

In theory, the RTL if-conversion pass could be enhanced to tackle
arbitrarily complex if-then-else graphs, but the solution proposed
here is to allow suitable targets to perform if-conversion during
compare_by_pieces.  The x86, for example, can take advantage that
all of the above comparisons set and test the zero flag (ZF), which
can then be used in combination with sete.  Hence compare_by_pieces
could instead generate:

if (x[0] != y[0]) goto fail_label;
if (x[1] != y[1]) goto fail_label;
...
if (x[n] != y[n]) goto fail_label;
fail_label:
sete result

which requires one less basic block, and the redundant conditional
branch to a label immediately after is cleaned up by GCC's existing
RTL optimizations.

For the test case above, where -O2 -msse4 previously generated:

foo:movdqu  (%rdi), %xmm0
pxor.LC0(%rip), %xmm0
ptest   %xmm0, %xmm0
je  .L5
.L2:movl$1, %eax
xorl$1, %eax
ret
.L5:movdqu  16(%rdi), %xmm0
pxor.LC1(%rip), %xmm0
ptest   %xmm0, %xmm0
jne .L2
xorl%eax, %eax
xorl$1, %eax
ret

we now generate:

foo:movdqu  (%rdi), %xmm0
pxor.LC0(%rip), %xmm0
ptest   %xmm0, %xmm0
jne .L2
movdqu  16(%rdi), %xmm0
pxor.LC1(%rip), %xmm0
ptest   %xmm0, %xmm0
.L2:sete%al
movzbl  %al, %eax
ret

Using a target hook allows the large amount of intelligence already in
compare_by_pieces to be re-used by the i386 backend, but this can also
help other backends with condition flags where the equality result can
be materialized.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-06-12  Roger Sayle  

gcc/ChangeLog
* config/i386/i386.cc (ix86_finish_compare_by_pieces): New
function to provide a backend specific implementation.
(TARGET_FINISH_COMPARE_BY_PIECES): Use the above function.

* doc/tm.texi.in (TARGET_FINISH_COMPARE_BY_PIECES): New @hook.
* doc/tm.texi: Regenerate.

* expr.cc (compare_by_pieces): Call finish_compare_by_pieces in
targetm to finalize the RTL expansion.  Move the current
implementation to a default target hook.
* target.def (finish_compare_by_pieces): New target hook to allow
compare_by_pieces to be customized by the target.
* targhooks.cc (default_finish_compare_by_pieces): Default
implementation moved here from expr.cc's compare_by_pieces.
* targhooks.h (default_finish_compare_by_pieces): Prototype.

gcc/testsuite/ChangeLog
* gcc.target/i386/pieces-memcmp-1.c: New test case.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 3a1444d..509c0ee 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -16146,6 +16146,20 @@ ix86_fp_compare_code_to_integer (enum rtx_code code)
 }
 }
 
+/* Override compare_by_pieces' default implementation using the state
+   of the CCZmode FLAGS_REG and sete instruction.  TARGET is the integral
+   mode result, and FAIL_LABEL is the branch target of mismatched
+   comparisons.  */
+
+void
+ix86_finish_compare_by_pieces (rtx target, rtx_code_label *fail_label)
+{
+  rtx tmp = gen_reg_rtx (QImode);
+  emit_label (fail_label);
+  ix86_expand_setcc (tmp, NE, gen_rtx_REG (CCZmode, FLAGS_REG), const0_rtx);
+  convert_move (target, tmp,

[x86 PATCH] Convert ptestz of pandn into ptestc.

2023-06-13 Thread Roger Sayle

This patch is the next instalment in a set of backend patches around
improvements to ptest/vptest.  A previous patch optimized the sequence
t=pand(x,y); ptestz(t,t) into the equivalent ptestz(x,y), using the
property that ZF is set to (X&Y) == 0.  This patch performs a similar
transformation, converting t=pandn(x,y); ptestz(t,t) into the (almost)
equivalent ptestc(y,x), using the property that the CF flags is set to
(~X&Y) == 0.  The tricky bit is that this sets the CF flag instead of
the ZF flag, so we can only perform this transformation when we can
also convert the flags' consumer, as well as the producer.

For the test case:

int foo (__m128i x, __m128i y)
{
  __m128i a = x & ~y;
  return __builtin_ia32_ptestz128 (a, a);
}

With -O2 -msse4.1 we previously generated:

foo:pandn   %xmm0, %xmm1
xorl%eax, %eax
ptest   %xmm1, %xmm1
sete%al
ret

with this patch we now generate:

foo:xorl%eax, %eax
ptest   %xmm0, %xmm1
setc%al
ret

At the same time, this patch also provides alternative fixes for
PR target/109973 and PR target/110118, by recognizing that ptestc(x,x)
always sets the carry flag (X&~X is always zero).  This is achieved
both by recognizing the special case in ix86_expand_sse_ptest and with
a splitter to convert an eligible ptest into an stc.

The next piece is, of course, STV of "if (x & ~y)..."

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?

2023-06-13  Roger Sayle  

gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_expand_sse_ptest): Recognize
expansion of ptestc with equal operands as returning const1_rtx.
* config/i386/i386.cc (ix86_rtx_costs): Provide accurate cost
estimates of UNSPEC_PTEST, where the ptest performs the PAND
or PAND of its operands.
* config/i386/sse.md (define_split): Transform CCCmode UNSPEC_PTEST
of reg_equal_p operands into an x86_stc instruction.
(define_split): Split pandn/ptestz/setne into ptestc/setnc.
(define_split): Split pandn/ptestz/sete into ptestc/setc.
(define_split): Split pandn/ptestz/je into ptestc/jc.
(define_split): Split pandn/ptestz/jne into ptestc/jnc.

gcc/testsuite/ChangeLog
* gcc.target/i386/avx-vptest-4.c: New test case.
* gcc.target/i386/avx-vptest-5.c: Likewise.
* gcc.target/i386/avx-vptest-6.c: Likewise.
* gcc.target/i386/pr109973-1.c: Update test case.
* gcc.target/i386/pr109973-2.c: Likewise.
* gcc.target/i386/sse4_1-ptest-4.c: New test case.
* gcc.target/i386/sse4_1-ptest-5.c: Likewise.
* gcc.target/i386/sse4_1-ptest-6.c: Likewise.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index def060a..1d11af2 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -10222,6 +10222,13 @@ ix86_expand_sse_ptest (const struct 
builtin_description *d, tree exp,
   machine_mode mode1 = insn_data[d->icode].operand[1].mode;
   enum rtx_code comparison = d->comparison;
 
+  /* ptest reg, reg sets the carry flag.  */
+  if (comparison == LTU
+  && (d->code == IX86_BUILTIN_PTESTC
+ || d->code == IX86_BUILTIN_PTESTC256)
+  && rtx_equal_p (op0, op1))
+return const1_rtx;
+
   if (VECTOR_MODE_P (mode0))
 op0 = safe_vector_operand (op0, mode0);
   if (VECTOR_MODE_P (mode1))
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 3a1444d..3e99e23 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -21423,16 +21423,23 @@ ix86_rtx_costs (rtx x, machine_mode mode, int 
outer_code_i, int opno,
   else if (XINT (x, 1) == UNSPEC_PTEST)
{
  *total = cost->sse_op;
- if (XVECLEN (x, 0) == 2
- && GET_CODE (XVECEXP (x, 0, 0)) == AND)
+ rtx test_op0 = XVECEXP (x, 0, 0);
+ if (!rtx_equal_p (test_op0, XVECEXP (x, 0, 1)))
+   return false;
+ if (GET_CODE (test_op0) == AND)
{
- rtx andop = XVECEXP (x, 0, 0);
- *total += rtx_cost (XEXP (andop, 0), GET_MODE (andop),
- AND, opno, speed)
-   + rtx_cost (XEXP (andop, 1), GET_MODE (andop),
-   AND, opno, speed);
- return true;
+ rtx and_op0 = XEXP (test_op0, 0);
+ if (GET_CODE (and_op0) == NOT)
+   and_op0 = XEXP (and_op0, 0);
+ *total += rtx_cost (and_op0, GET_MODE (and_op0),
+ AND, 0, speed)
+   + rtx_cost (XEXP (test_op0, 1), GET_MODE (and_op0),
+   AND, 1, speed);
}
+ else
+   *total = rtx_cost (test

RE: [x86 PATCH] PR target/31985: Improve memory operand use with doubleword add.

2023-06-15 Thread Roger Sayle

Hi Uros,

> On the 7th June 2023, Uros Bizkak wrote:
> The register allocator considers the instruction-to-be-split as one 
> instruction, so it
> can allocate output register to match an input register (or a register that 
> forms an
> input address), So, you have to either add an early clobber to the output, or
> somehow prevent output to clobber registers in the second pattern.

This implements your suggestion of adding an early clobber to the output, a
one character ('&') change from the previous version of this patch.  Retested
with make bootstrap and make -k check, with and without -m32, to confirm
there are no issues, and this still fixes the pr31985.c test case.

As you've suggested, I'm also working on improving STV in this area.

Ok for mainline?


2023-06-15  Roger Sayle  
Uros Bizjak  

gcc/ChangeLog
PR target/31985
* config/i386/i386.md (*add3_doubleword_concat): New
define_insn_and_split combine *add3_doubleword with a
*concat3 for more efficient lowering after reload.

gcc/testsuite/ChangeLog
PR target/31985
* gcc.target/i386/pr31985.c: New test case.

Roger
--

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index e6ebc46..42c302d 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -6124,6 +6124,36 @@
  (clobber (reg:CC FLAGS_REG))])]
  "split_double_mode (mode, &operands[0], 2, &operands[0], &operands[3]);")
 
+(define_insn_and_split "*add3_doubleword_concat"
+  [(set (match_operand: 0 "register_operand" "=&r")
+   (plus:
+ (any_or_plus:
+   (ashift:
+ (zero_extend:
+   (match_operand:DWIH 2 "nonimmediate_operand" "rm"))
+ (match_operand: 3 "const_int_operand"))
+   (zero_extend:
+ (match_operand:DWIH 4 "nonimmediate_operand" "rm")))
+ (match_operand: 1 "register_operand" "0")))
+   (clobber (reg:CC FLAGS_REG))]
+  "INTVAL (operands[3]) ==  * BITS_PER_UNIT"
+  "#"
+  "&& reload_completed"
+  [(parallel [(set (reg:CCC FLAGS_REG)
+  (compare:CCC
+(plus:DWIH (match_dup 1) (match_dup 4))
+(match_dup 1)))
+ (set (match_dup 0)
+  (plus:DWIH (match_dup 1) (match_dup 4)))])
+   (parallel [(set (match_dup 5)
+  (plus:DWIH
+(plus:DWIH
+  (ltu:DWIH (reg:CC FLAGS_REG) (const_int 0))
+  (match_dup 6))
+(match_dup 2)))
+ (clobber (reg:CC FLAGS_REG))])]
+ "split_double_mode (mode, &operands[0], 2, &operands[0], &operands[5]);")
+
 (define_insn "*add_1"
   [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r,r,r")
(plus:SWI48
diff --git a/gcc/testsuite/gcc.target/i386/pr31985.c 
b/gcc/testsuite/gcc.target/i386/pr31985.c
new file mode 100644
index 000..a6de1b5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr31985.c
@@ -0,0 +1,14 @@
+/* { dg-do compile { target ia32 } } */
+/* { dg-options "-O2" } */
+
+void test_c (unsigned int a, unsigned int b, unsigned int c, unsigned int d)
+{
+  volatile unsigned int x, y;
+  unsigned long long __a = b | ((unsigned long long)a << 32);
+  unsigned long long __b = d | ((unsigned long long)c << 32);
+  unsigned long long __c = __a + __b;
+  x = (unsigned int)(__c & 0x);
+  y = (unsigned int)(__c >> 32);
+}
+
+/* { dg-final { scan-assembler-times "movl" 4 } } */


RE: [x86 PATCH] Tweak ix86_expand_int_compare to use PTEST for vector equality.

2023-07-12 Thread Roger Sayle


> From: Hongtao Liu 
> Sent: 12 July 2023 01:45
> 
> On Wed, Jul 12, 2023 at 4:57 AM Roger Sayle 
> > > From: Hongtao Liu 
> > > Sent: 28 June 2023 04:23
> > > > From: Roger Sayle 
> > > > Sent: 27 June 2023 20:28
> > > >
> > > > I've also come up with an alternate/complementary/supplementary
> > > > fix of generating the PTEST during RTL expansion, rather than rely
> > > > on this being caught/optimized later during STV.
> > > >
> > > > You may notice in this patch, the tests for TARGET_SSE4_1 and
> > > > TImode appear last.  When I was writing this, I initially also
> > > > added support for AVX VPTEST and OImode, before realizing that x86
> > > > doesn't (yet) support 256-bit OImode (which also explains why we
> > > > don't have an OImode to V1OImode scalar-to-vector pass).
> > > > Retaining this clause ordering should minimize the lines changed if 
> > > > things
> change in future.
> > > >
> > > > This patch has been tested on x86_64-pc-linux-gnu with make
> > > > bootstrap and make -k check, both with and without
> > > > --target_board=unix{-m32} with no new failures.  Ok for mainline?
> > > >
> > > >
> > > > 2023-06-27  Roger Sayle  
> > > >
> > > > gcc/ChangeLog
> > > > * config/i386/i386-expand.cc (ix86_expand_int_compare): If
> > > > testing a TImode SUBREG of a 128-bit vector register against
> > > > zero, use a PTEST instruction instead of first moving it to
> > > > to scalar registers.
> > > >
> > >
> > > +  /* Attempt to use PTEST, if available, when testing vector modes for
> > > + equality/inequality against zero.  */  if (op1 == const0_rtx
> > > +  && SUBREG_P (op0)
> > > +  && cmpmode == CCZmode
> > > +  && SUBREG_BYTE (op0) == 0
> > > +  && REG_P (SUBREG_REG (op0))
> > > Just register_operand (op0, TImode),
> >
> > I completely agree that in most circumstances, the early RTL
> > optimizers should use standard predicates, such as register_operand,
> > that don't distinguish between REG and SUBREG, allowing the choice
> > (assignment) to be left to register allocation (reload).
> >
> > However in this case, unusually, the presence of the SUBREG, and
> > treating it differently from a REG is critical (in fact the reason for
> > the patch).  x86_64 can very efficiently test whether a 128-bit value
> > is zero, setting ZF, either in TImode, using orq %rax,%rdx in a single
> > cycle/single instruction, or in V1TImode, using ptest %xmm0,%xmm0, in a 
> > single
> cycle/single instruction.
> > There's no reason to prefer one form over the other.  A SUREG,
> > however, that moves the value from the scalar registers to a vector
> > register, or from a vector registers to scalar registers, requires two or 
> > three
> instructions, often reading
> > and writing values via memory, at a huge performance penalty.   Hence the
> > goal is to eliminate the (VIEW_CONVERT) SUBREG, and choose the
> > appropriate single-cycle test instruction for where the data is
> > located.  Hence we want to leave REG_P alone, but optimize (only) the
> SUBREG_P cases.
> > register_operand doesn't help with this.
> >
> > Note this is counter to the usual advice.  Normally, a SUBREG between
> > scalar registers is cheap (in fact free) on x86, hence it safe for
> > predicates to ignore them prior to register allocation.  But another
> > use of SUBREG, to represent a VIEW_CONVERT_EXPR/transfer between
> > processing units is closer to a conversion, and a very expensive one
> > (going via memory with different size reads vs writes) at that.
> >
> >
> > > +  && VECTOR_MODE_P (GET_MODE (SUBREG_REG (op0)))
> > > +  && TARGET_SSE4_1
> > > +  && GET_MODE (op0) == TImode
> > > +  && GET_MODE_SIZE (GET_MODE (SUBREG_REG (op0))) == 16)
> > > +{
> > > +  tmp = SUBREG_REG (op0);
> > > and tmp = lowpart_subreg (V1TImode, force_reg (TImode, op0));?
> > > I think RA can handle SUBREG correctly, no need for extra predicates.
> >
> > Likewise, your "tmp = lowpart_subreg (V1TImode, force_reg (TImode, ...))"
> > is forcing there to always be an inter-unit transfer/pipeline stall,
> > when this is idiom that we're trying to eliminate.
> >
>

[x86_64 PATCH] Improved insv of DImode/DFmode {high, low}parts into TImode.

2023-07-13 Thread Roger Sayle

This is the next piece towards a fix for (the x86_64 ABI issues affecting)
PR 88873.  This patch generalizes the recent tweak to ix86_expand_move
for setting the highpart of a TImode reg from a DImode source using
*insvti_highpart_1, to handle both DImode and DFmode sources, and also
use the recently added *insvti_lowpart_1 for setting the lowpart.

Although this is another intermediate step (not yet a fix), towards
enabling *insvti and *concat* patterns to be candidates for TImode STV
(by using V2DI/V2DF instructions), it already improves things a little.

For the test case from PR 88873

typedef struct { double x, y; } s_t;
typedef double v2df __attribute__ ((vector_size (2 * sizeof(double;

s_t foo (s_t a, s_t b, s_t c)
{
  return (s_t) { fma(a.x, b.x, c.x), fma (a.y, b.y, c.y) };
}


With -O2 -march=cascadelake, GCC currently generates:

Before (29 instructions):
vmovq   %xmm2, -56(%rsp)
movq-56(%rsp), %rdx
vmovq   %xmm4, -40(%rsp)
movq$0, -48(%rsp)
movq%rdx, -56(%rsp)
movq-40(%rsp), %rdx
vmovq   %xmm0, -24(%rsp)
movq%rdx, -40(%rsp)
movq-24(%rsp), %rsi
movq-56(%rsp), %rax
movq$0, -32(%rsp)
vmovq   %xmm3, -48(%rsp)
movq-48(%rsp), %rcx
vmovq   %xmm5, -32(%rsp)
vmovq   %rax, %xmm6
movq-40(%rsp), %rax
movq$0, -16(%rsp)
movq%rsi, -24(%rsp)
movq-32(%rsp), %rsi
vpinsrq $1, %rcx, %xmm6, %xmm6
vmovq   %rax, %xmm7
vmovq   %xmm1, -16(%rsp)
vmovapd %xmm6, %xmm3
vpinsrq $1, %rsi, %xmm7, %xmm7
vfmadd132pd -24(%rsp), %xmm7, %xmm3
vmovapd %xmm3, -56(%rsp)
vmovsd  -48(%rsp), %xmm1
vmovsd  -56(%rsp), %xmm0
ret

After (20 instructions):
vmovq   %xmm2, -56(%rsp)
movq-56(%rsp), %rax
vmovq   %xmm3, -48(%rsp)
vmovq   %xmm4, -40(%rsp)
movq-48(%rsp), %rcx
vmovq   %xmm5, -32(%rsp)
vmovq   %rax, %xmm6
movq-40(%rsp), %rax
movq-32(%rsp), %rsi
vpinsrq $1, %rcx, %xmm6, %xmm6
vmovq   %xmm0, -24(%rsp)
vmovq   %rax, %xmm7
vmovq   %xmm1, -16(%rsp)
vmovapd %xmm6, %xmm2
vpinsrq $1, %rsi, %xmm7, %xmm7
vfmadd132pd -24(%rsp), %xmm7, %xmm2
vmovapd %xmm2, -56(%rsp)
vmovsd  -48(%rsp), %xmm1
vmovsd  -56(%rsp), %xmm0
ret

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  No testcase yet, as the above code will hopefully
change dramatically with the next pieces.  Ok for mainline?


2023-07-13  Roger Sayle  

gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_expand_move): Generalize special
case inserting of 64-bit values into a TImode register, to handle
both DImode and DFmode using either *insvti_lowpart_1
or *isnvti_highpart_1.


Thanks again,
Roger
--

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 92ffa4b..fe87f8e 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -542,22 +542,39 @@ ix86_expand_move (machine_mode mode, rtx operands[])
}
 }
 
-  /* Use *insvti_highpart_1 to set highpart of TImode register.  */
+  /* Special case inserting 64-bit values into a TImode register.  */
   if (TARGET_64BIT
-  && mode == DImode
+  && (mode == DImode || mode == DFmode)
   && SUBREG_P (op0)
-  && SUBREG_BYTE (op0) == 8
   && GET_MODE (SUBREG_REG (op0)) == TImode
   && REG_P (SUBREG_REG (op0))
   && REG_P (op1))
 {
-  wide_int mask = wi::mask (64, false, 128);
-  rtx tmp = immed_wide_int_const (mask, TImode);
-  op0 = SUBREG_REG (op0);
-  tmp = gen_rtx_AND (TImode, copy_rtx (op0), tmp);
-  op1 = gen_rtx_ZERO_EXTEND (TImode, op1);
-  op1 = gen_rtx_ASHIFT (TImode, op1, GEN_INT (64));
-  op1 = gen_rtx_IOR (TImode, tmp, op1);
+  /* Use *insvti_lowpart_1 to set lowpart.  */
+  if (SUBREG_BYTE (op0) == 0)
+   {
+ wide_int mask = wi::mask (64, true, 128);
+ rtx tmp = immed_wide_int_const (mask, TImode);
+ op0 = SUBREG_REG (op0);
+ tmp = gen_rtx_AND (TImode, copy_rtx (op0), tmp);
+ if (mode == DFmode)
+   op1 = force_reg (DImode, gen_lowpart (DImode, op1));
+ op1 = gen_rtx_ZERO_EXTEND (TImode, op1);
+ op1 = gen_rtx_IOR (TImode, tmp, op1);
+   }
+  /* Use *insvti_highpart_1 to set highpart.  */
+  else if (SUBREG_BYTE (op0) == 8)
+   {
+ wide_int mask = wi::mask (64, false, 128);
+ rtx tmp = immed_wide_int_const (mask, TImode);
+ op0 = SUBREG_REG (op0);
+ tmp = gen_rtx_AND (TImode, copy_rtx (op0), tmp);
+ if (mode == DFmode)
+

[x86 PATCH] PR target/110588: Add *bt_setncqi_2 to generate btl

2023-07-13 Thread Roger Sayle

This patch resolves PR target/110588 to catch another case in combine
where the i386 backend should be generating a btl instruction.  This adds
another define_insn_and_split to recognize the RTL representation for this
case.

I also noticed that two related define_insn_and_split weren't using the
preferred string style for single statement preparation-statements, so
I've reformatted these to be consistent in style with the new one.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-07-13  Roger Sayle  

gcc/ChangeLog
PR target/110588
* config/i386/i386.md (*bt_setcqi): Prefer string form
preparation statement over braces for a single statement.
(*bt_setncqi): Likewise.
(*bt_setncqi_2): New define_insn_and_split.

gcc/testsuite/ChangeLog
PR target/110588
* gcc.target/i386/pr110588.c: New test case.


Thanks again,
Roger
--

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index e47ced1..04eca049 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -16170,9 +16170,7 @@
  (const_int 0)))
(set (match_dup 0)
 (eq:QI (reg:CCC FLAGS_REG) (const_int 0)))]
-{
-  operands[2] = lowpart_subreg (SImode, operands[2], QImode);
-})
+  "operands[2] = lowpart_subreg (SImode, operands[2], QImode);")
 
 ;; Help combine recognize bt followed by setnc
 (define_insn_and_split "*bt_setncqi"
@@ -16193,9 +16191,7 @@
  (const_int 0)))
(set (match_dup 0)
 (ne:QI (reg:CCC FLAGS_REG) (const_int 0)))]
-{
-  operands[2] = lowpart_subreg (SImode, operands[2], QImode);
-})
+  "operands[2] = lowpart_subreg (SImode, operands[2], QImode);")
 
 (define_insn_and_split "*bt_setnc"
   [(set (match_operand:SWI48 0 "register_operand")
@@ -16219,6 +16215,27 @@
   operands[2] = lowpart_subreg (SImode, operands[2], QImode);
   operands[3] = gen_reg_rtx (QImode);
 })
+
+;; Help combine recognize bt followed by setnc (PR target/110588)
+(define_insn_and_split "*bt_setncqi_2"
+  [(set (match_operand:QI 0 "register_operand")
+   (eq:QI
+ (zero_extract:SWI48
+   (match_operand:SWI48 1 "register_operand")
+   (const_int 1)
+   (zero_extend:SI (match_operand:QI 2 "register_operand")))
+ (const_int 0)))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_USE_BT && ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(set (reg:CCC FLAGS_REG)
+(compare:CCC
+ (zero_extract:SWI48 (match_dup 1) (const_int 1) (match_dup 2))
+ (const_int 0)))
+   (set (match_dup 0)
+(ne:QI (reg:CCC FLAGS_REG) (const_int 0)))]
+  "operands[2] = lowpart_subreg (SImode, operands[2], QImode);")
 
 ;; Store-flag instructions.
 
diff --git a/gcc/testsuite/gcc.target/i386/pr110588.c 
b/gcc/testsuite/gcc.target/i386/pr110588.c
new file mode 100644
index 000..4505c87
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr110588.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mtune=core2" } */
+
+unsigned char foo (unsigned char x, int y)
+{
+  int _1 = (int) x;
+  int _2 = _1 >> y;
+  int _3 = _2 & 1;
+  unsigned char _8 = (unsigned char) _3;
+  unsigned char _6 = _8 ^ 1;
+  return _6;
+}
+
+/* { dg-final { scan-assembler "btl" } } */
+/* { dg-final { scan-assembler "setnc" } } */
+/* { dg-final { scan-assembler-not "sarl" } } */
+/* { dg-final { scan-assembler-not "andl" } } */
+/* { dg-final { scan-assembler-not "xorl" } } */


RE: [x86 PATCH] PR target/110588: Add *bt_setncqi_2 to generate btl

2023-07-14 Thread Roger Sayle


> From: Uros Bizjak 
> Sent: 13 July 2023 19:21
> 
> On Thu, Jul 13, 2023 at 7:10 PM Roger Sayle 
> wrote:
> >
> > This patch resolves PR target/110588 to catch another case in combine
> > where the i386 backend should be generating a btl instruction.  This
> > adds another define_insn_and_split to recognize the RTL representation
> > for this case.
> >
> > I also noticed that two related define_insn_and_split weren't using
> > the preferred string style for single statement
> > preparation-statements, so I've reformatted these to be consistent in style 
> > with
> the new one.
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures.  Ok for mainline?
> >
> >
> > 2023-07-13  Roger Sayle  
> >
> > gcc/ChangeLog
> > PR target/110588
> > * config/i386/i386.md (*bt_setcqi): Prefer string form
> > preparation statement over braces for a single statement.
> > (*bt_setncqi): Likewise.
> > (*bt_setncqi_2): New define_insn_and_split.
> >
> > gcc/testsuite/ChangeLog
> > PR target/110588
> > * gcc.target/i386/pr110588.c: New test case.
> 
> +;; Help combine recognize bt followed by setnc (PR target/110588)
> +(define_insn_and_split "*bt_setncqi_2"
> +  [(set (match_operand:QI 0 "register_operand")  (eq:QI
> +  (zero_extract:SWI48
> +(match_operand:SWI48 1 "register_operand")
> +(const_int 1)
> +(zero_extend:SI (match_operand:QI 2 "register_operand")))
> +  (const_int 0)))
> +   (clobber (reg:CC FLAGS_REG))]
> +  "TARGET_USE_BT && ix86_pre_reload_split ()"
> +  "#"
> +  "&& 1"
> +  [(set (reg:CCC FLAGS_REG)
> +(compare:CCC
> + (zero_extract:SWI48 (match_dup 1) (const_int 1) (match_dup 2))
> + (const_int 0)))
> +   (set (match_dup 0)
> +(ne:QI (reg:CCC FLAGS_REG) (const_int 0)))]
> +  "operands[2] = lowpart_subreg (SImode, operands[2], QImode);")
> 
> I don't think the above transformation is 100% correct, mainly due to the use 
> of
> paradoxical subreg.
> 
> The combined instruction is operating with a zero_extended QImode register, so
> all bits of the register are well defined. You are splitting using 
> paradoxical subreg,
> so you don't know what garbage is there in the highpart of the count register.
> However, BTL/BTQ uses modulo 64 (or 32) of this register, so even with a 
> slightly
> invalid RTX, everything checks out.
> 
> +  "operands[2] = lowpart_subreg (SImode, operands[2], QImode);")
> 
> You probably need mode instead of SImode here.

The define_insn for *bt is:

(define_insn "*bt"
  [(set (reg:CCC FLAGS_REG)
(compare:CCC
  (zero_extract:SWI48
(match_operand:SWI48 0 "nonimmediate_operand" "r,m")
(const_int 1)
(match_operand:SI 1 "nonmemory_operand" "r,"))
  (const_int 0)))]

So  isn't appropriate here.

But now you've made me think about it, it's inconsistent that all of the shifts
and rotates in i386.md standardize on QImode for shift counts, but the bit test
instructions use SImode?  I think this explains where the paradoxical SUBREGs
come from, and in theory any_extend from QImode to SImode here could/should 
be handled/unnecessary.

Is it worth investigating a follow-up patch to convert all ZERO_EXTRACTs and
SIGN_EXTRACTs in i386.md to use QImode (instead of SImode)?

Thanks in advance,
Roger
--




[PATCH] Fix bootstrap failure (with g++ 4.8.5) in tree-if-conv.cc.

2023-07-14 Thread Roger Sayle
 

This patch fixes the bootstrap failure I'm seeing using gcc 4.8.5 as

the host compiler.  Ok for mainline?  [I might be missing something]

 

 

2023-07-14  Roger Sayle  

 

gcc/ChangeLog

* tree-if-conv.cc (predicate_scalar_phi): Make the arguments

to the std::sort comparison lambda function const.

 

 

Cheers,

Roger

--

 

diff --git a/gcc/tree-if-conv.cc b/gcc/tree-if-conv.cc
index 91e2eff..799f071 100644
--- a/gcc/tree-if-conv.cc
+++ b/gcc/tree-if-conv.cc
@@ -2204,7 +2204,8 @@ predicate_scalar_phi (gphi *phi, gimple_stmt_iterator 
*gsi)
 }
 
   /* Sort elements based on rankings ARGS.  */
-  std::sort(argsKV.begin(), argsKV.end(), [](ArgEntry &left, ArgEntry &right) {
+  std::sort(argsKV.begin(), argsKV.end(), [](const ArgEntry &left,
+const ArgEntry &right) {
 return left.second < right.second;
   });
 


RE: [x86 PATCH] Fix FAIL of gcc.target/i386/pr91681-1.c

2023-07-17 Thread Roger Sayle


> From: Jiang, Haochen 
> Sent: 17 July 2023 02:50
> 
> > From: Jiang, Haochen
> > Sent: Friday, July 14, 2023 10:50 AM
> >
> > > The recent change in TImode parameter passing on x86_64 results in
> > > the FAIL of pr91681-1.c.  The issue is that with the extra
> > > flexibility, the combine pass is now spoilt for choice between using
> > > either the *add3_doubleword_concat or the
> > > *add3_doubleword_zext patterns, when one operand is a *concat and
> the other is a zero_extend.
> > > The solution proposed below is provide an
> > > *add3_doubleword_concat_zext define_insn_and_split, that can
> > > benefit both from the register allocation of *concat, and still
> > > avoid the xor normally required by zero extension.
> > >
> > > I'm investigating a follow-up refinement to improve register
> > > allocation further by avoiding the early clobber in the =&r, and
> > > handling (custom) reloads explicitly, but this piece resolves the
> > > testcase
> > failure.
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make
> > > bootstrap and make -k check, both with and without
> > > --target_board=unix{-m32} with no new failures.  Ok for mainline?
> > >
> > >
> > > 2023-07-11  Roger Sayle  
> > >
> > > gcc/ChangeLog
> > > PR target/91681
> > > * config/i386/i386.md (*add3_doubleword_concat_zext): New
> > > define_insn_and_split derived from
*add3_doubleword_concat
> > > and *add3_doubleword_zext.
> >
> > Hi Roger,
> >
> > This commit currently changed the codegen of testcase p443644-2.c from:
> 
> Oops, a typo, I mean pr43644-2.c.
> 
> Haochen

I'm working on a fix and hope to have this resolved soon (unfortunately
fixing
things in a post-reload splitter isn't working out due to reload's choices,
so the
solution will likely be a peephole2).

The problem is that pr91681-1.c and pr43644-2.c can't both PASS (as
written)!
The operation x = y + 0, can be generated as either "mov y,x; add $0,x" or
as
"xor x,x; add y,x".  pr91681-1.c checks there isn't an xor, pr43644-2.c
checks
there isn't a mov.  Doh!  As the author of both these test cases, I've
painted
myself into a corner.

The solution is that add $0,x should be generated (optimal) when y is
already in x,
and "xor x,x; add y,x" used otherwise (as this is shorter than "mov y,x; add
$0,x",
both sequences being approximately equal performance-wise).

> > movq%rdx, %rax
> > xorl%edx, %edx
> > addq%rdi, %rax
> > adcq%rsi, %rdx
> > to:
> > movq%rdx, %rcx
> > movq%rdi, %rax
> > movq%rsi, %rdx
> > addq%rcx, %rax
> > adcq$0, %rdx
> >
> > which causes the testcase fail under -m64.
> > Is this within your expectation?

You're right that the original (using xor) is better for pr43644-2.c's test
case.
unsigned __int128 foo(unsigned __int128 x, unsigned long long y) { return
x+y; }
but the closely related (swapping the argument order):
unsigned __int128 bar(unsigned long long y, unsigned __int128 x) { return
x+y; }
is better using "adcq $0", than having a superfluous xor.

Executive summary: This FAIL isn't serious.  I'll silence it soon.

> > BRs,
> > Haochen
> >
> > >
> > >
> > > Thanks,
> > > Roger
> > > --




[x86_64 PATCH] More TImode parameter passing improvements.

2023-07-19 Thread Roger Sayle

This patch is the next piece of a solution to the x86_64 ABI issues in
PR 88873.  This splits the *concat3_3 define_insn_and_split
into two patterns, a TARGET_64BIT *concatditi3_3 and a !TARGET_64BIT
*concatsidi3_3.  This allows us to add an additional alternative to the
the 64-bit version, enabling the register allocator to perform this
operation using SSE registers, which is implemented/split after reload
using vec_concatv2di.

To demonstrate the improvement, the test case from PR88873:

typedef struct { double x, y; } s_t;

s_t foo (s_t a, s_t b, s_t c)
{
  return (s_t){ __builtin_fma(a.x, b.x, c.x), __builtin_fma (a.y, b.y, c.y)
};
}

when compiled with -O2 -march=cascadelake, currently generates:

foo:vmovq   %xmm2, -56(%rsp)
movq-56(%rsp), %rax
vmovq   %xmm3, -48(%rsp)
vmovq   %xmm4, -40(%rsp)
movq-48(%rsp), %rcx
vmovq   %xmm5, -32(%rsp)
vmovq   %rax, %xmm6
movq-40(%rsp), %rax
movq-32(%rsp), %rsi
vpinsrq $1, %rcx, %xmm6, %xmm6
vmovq   %xmm0, -24(%rsp)
vmovq   %rax, %xmm7
vmovq   %xmm1, -16(%rsp)
vmovapd %xmm6, %xmm2
vpinsrq $1, %rsi, %xmm7, %xmm7
vfmadd132pd -24(%rsp), %xmm7, %xmm2
vmovapd %xmm2, -56(%rsp)
vmovsd  -48(%rsp), %xmm1
vmovsd  -56(%rsp), %xmm0
ret

with this change, we avoid many of the reloads via memory,

foo:vpunpcklqdq %xmm3, %xmm2, %xmm7
vpunpcklqdq %xmm1, %xmm0, %xmm6
vpunpcklqdq %xmm5, %xmm4, %xmm2
vmovdqa %xmm7, -24(%rsp)
vmovdqa %xmm6, %xmm1
movq-16(%rsp), %rax
vpinsrq $1, %rax, %xmm7, %xmm4
vmovapd %xmm4, %xmm6
vfmadd132pd %xmm1, %xmm2, %xmm6
vmovapd %xmm6, -24(%rsp)
vmovsd  -16(%rsp), %xmm1
vmovsd  -24(%rsp), %xmm0
ret


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-07-19  Roger Sayle  

gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_expand_move): Don't call
force_reg, to use SUBREG rather than create a new pseudo when
inserting DFmode fields into TImode with insvti_{high,low}part.
(*concat3_3): Split into two define_insn_and_split...
(*concatditi3_3): 64-bit implementation.  Provide alternative
that allows register allocation to use SSE registers that is
split into vec_concatv2di after reload.
(*concatsidi3_3): 32-bit implementation.

gcc/testsuite/ChangeLog
* gcc.target/i386/pr88873.c: New test case.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index f9b0dc6..9c3febe 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -558,7 +558,7 @@ ix86_expand_move (machine_mode mode, rtx operands[])
  op0 = SUBREG_REG (op0);
  tmp = gen_rtx_AND (TImode, copy_rtx (op0), tmp);
  if (mode == DFmode)
-   op1 = force_reg (DImode, gen_lowpart (DImode, op1));
+   op1 = gen_lowpart (DImode, op1);
  op1 = gen_rtx_ZERO_EXTEND (TImode, op1);
  op1 = gen_rtx_IOR (TImode, tmp, op1);
}
@@ -570,7 +570,7 @@ ix86_expand_move (machine_mode mode, rtx operands[])
  op0 = SUBREG_REG (op0);
  tmp = gen_rtx_AND (TImode, copy_rtx (op0), tmp);
  if (mode == DFmode)
-   op1 = force_reg (DImode, gen_lowpart (DImode, op1));
+   op1 = gen_lowpart (DImode, op1);
  op1 = gen_rtx_ZERO_EXTEND (TImode, op1);
  op1 = gen_rtx_ASHIFT (TImode, op1, GEN_INT (64));
  op1 = gen_rtx_IOR (TImode, tmp, op1);
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 47ea050..8c54aa5 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -12408,21 +12408,47 @@
   DONE;
 })
 
-(define_insn_and_split "*concat3_3"
-  [(set (match_operand: 0 "nonimmediate_operand" "=ro,r,r,&r")
-   (any_or_plus:
- (ashift:
-   (zero_extend:
- (match_operand:DWIH 1 "nonimmediate_operand" "r,m,r,m"))
+(define_insn_and_split "*concatditi3_3"
+  [(set (match_operand:TI 0 "nonimmediate_operand" "=ro,r,r,&r,x")
+   (any_or_plus:TI
+ (ashift:TI
+   (zero_extend:TI
+ (match_operand:DI 1 "nonimmediate_operand" "r,m,r,m,x"))
(match_operand:QI 2 "const_int_operand"))
- (zero_extend:
-   (match_operand:DWIH 3 "nonimmediate_operand" "r,r,m,m"]
-  "INTVAL (operands[2]) ==  * BITS_PER_UNIT"
+ (zero_extend:TI
+   (match_operand:DI 3 "nonimmediate_operand" "r,r,m,m,0"]
+  "TARGET_64BIT
+   && INTVAL (operands[2]) == 64"
+  &q

[PATCH] PR c/110699: Defend against error_mark_node in gimplify.cc.

2023-07-19 Thread Roger Sayle

This patch resolves PR c/110699, an ICE-after-error regression, by adding
a check that the array type isn't error_mark_node in gimplify_compound_lval.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-07-19  Roger Sayle  

gcc/ChangeLog
PR c/110699
* gimplify.cc (gimplify_compound_lval):  For ARRAY_REF and
ARRAY_RANGE_REF return GS_ERROR if the array's type is
error_mark_node.

gcc/testsuite/ChangeLog
PR c/110699
* gcc.dg/pr110699.c: New test case.


Cheers,
Roger
--

diff --git a/gcc/gimplify.cc b/gcc/gimplify.cc
index 36e5df0..4f40b24 100644
--- a/gcc/gimplify.cc
+++ b/gcc/gimplify.cc
@@ -3211,6 +3211,9 @@ gimplify_compound_lval (tree *expr_p, gimple_seq *pre_p, 
gimple_seq *post_p,
 
   if (TREE_CODE (t) == ARRAY_REF || TREE_CODE (t) == ARRAY_RANGE_REF)
{
+ if (TREE_TYPE (TREE_OPERAND (t, 0)) == error_mark_node)
+   return GS_ERROR;
+
  /* Deal with the low bound and element type size and put them into
 the ARRAY_REF.  If these values are set, they have already been
 gimplified.  */
diff --git a/gcc/testsuite/gcc.dg/pr110699.c b/gcc/testsuite/gcc.dg/pr110699.c
new file mode 100644
index 000..be77613
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr110699.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+typedef __attribute__((__vector_size__(64))) int T;
+
+void f(void) {
+  extern char a[64], b[64];  /* { dg-message "previous" "note" } */
+  void *p = a;
+  T q = *(T *)&b[0];
+}
+
+void g() {
+  extern char b;  /* { dg-error "conflicting types" } */
+}


RE: [x86_64 PATCH] More TImode parameter passing improvements.

2023-07-20 Thread Roger Sayle


Hi Uros,

> From: Uros Bizjak 
> Sent: 20 July 2023 07:50
> 
> On Wed, Jul 19, 2023 at 10:07 PM Roger Sayle 
> wrote:
> >
> > This patch is the next piece of a solution to the x86_64 ABI issues in
> > PR 88873.  This splits the *concat3_3 define_insn_and_split
> > into two patterns, a TARGET_64BIT *concatditi3_3 and a !TARGET_64BIT
> > *concatsidi3_3.  This allows us to add an additional alternative to
> > the the 64-bit version, enabling the register allocator to perform
> > this operation using SSE registers, which is implemented/split after
> > reload using vec_concatv2di.
> >
> > To demonstrate the improvement, the test case from PR88873:
> >
> > typedef struct { double x, y; } s_t;
> >
> > s_t foo (s_t a, s_t b, s_t c)
> > {
> >   return (s_t){ __builtin_fma(a.x, b.x, c.x), __builtin_fma (a.y, b.y,
> > c.y) }; }
> >
> > when compiled with -O2 -march=cascadelake, currently generates:
> >
> > foo:vmovq   %xmm2, -56(%rsp)
> > movq-56(%rsp), %rax
> > vmovq   %xmm3, -48(%rsp)
> > vmovq   %xmm4, -40(%rsp)
> > movq-48(%rsp), %rcx
> > vmovq   %xmm5, -32(%rsp)
> > vmovq   %rax, %xmm6
> > movq-40(%rsp), %rax
> > movq-32(%rsp), %rsi
> > vpinsrq $1, %rcx, %xmm6, %xmm6
> > vmovq   %xmm0, -24(%rsp)
> > vmovq   %rax, %xmm7
> > vmovq   %xmm1, -16(%rsp)
> > vmovapd %xmm6, %xmm2
> > vpinsrq $1, %rsi, %xmm7, %xmm7
> > vfmadd132pd -24(%rsp), %xmm7, %xmm2
> > vmovapd %xmm2, -56(%rsp)
> > vmovsd  -48(%rsp), %xmm1
> > vmovsd  -56(%rsp), %xmm0
> > ret
> >
> > with this change, we avoid many of the reloads via memory,
> >
> > foo:vpunpcklqdq %xmm3, %xmm2, %xmm7
> > vpunpcklqdq %xmm1, %xmm0, %xmm6
> > vpunpcklqdq %xmm5, %xmm4, %xmm2
> > vmovdqa %xmm7, -24(%rsp)
> > vmovdqa %xmm6, %xmm1
> > movq-16(%rsp), %rax
> > vpinsrq $1, %rax, %xmm7, %xmm4
> > vmovapd %xmm4, %xmm6
> > vfmadd132pd %xmm1, %xmm2, %xmm6
> > vmovapd %xmm6, -24(%rsp)
> > vmovsd  -16(%rsp), %xmm1
> > vmovsd  -24(%rsp), %xmm0
> > ret
> >
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures.  Ok for mainline?
> >
> >
> > 2023-07-19  Roger Sayle  
> >
> > gcc/ChangeLog
> > * config/i386/i386-expand.cc (ix86_expand_move): Don't call
> > force_reg, to use SUBREG rather than create a new pseudo when
> > inserting DFmode fields into TImode with insvti_{high,low}part.
> > (*concat3_3): Split into two define_insn_and_split...
> > (*concatditi3_3): 64-bit implementation.  Provide alternative
> > that allows register allocation to use SSE registers that is
> > split into vec_concatv2di after reload.
> > (*concatsidi3_3): 32-bit implementation.
> >
> > gcc/testsuite/ChangeLog
> > * gcc.target/i386/pr88873.c: New test case.
> 
> diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> index f9b0dc6..9c3febe 100644
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -558,7 +558,7 @@ ix86_expand_move (machine_mode mode, rtx
> operands[])
>op0 = SUBREG_REG (op0);
>tmp = gen_rtx_AND (TImode, copy_rtx (op0), tmp);
>if (mode == DFmode)
> -op1 = force_reg (DImode, gen_lowpart (DImode, op1));
> +op1 = gen_lowpart (DImode, op1);
> 
> Please note that gen_lowpart will ICE when op1 is a SUBREG. This is the reason
> that we need to first force a SUBREG to a register and then perform 
> gen_lowpart,
> and it is necessary to avoid ICE.

The good news is that we know op1 is a register, as this is tested by
"&& REG_P (op1)" on line 551.  You'll also notice that I'm not removing
the force_reg from before the call to gen_lowpart, but removing the call
to force_reg after the call to gen_lowpart.  When I originally wrote this,
the hope was that placing this SUBREG in its own pseudo would help
with register allocation/CSE.  Unfortunately, increasing the number of
pseudos (in this case) increases compile-time (due to quadratic behaviour
in LRA), as shown by PR rtl-optimization/110587, and keeping the DF->DI
conversion in a SUBREG inside the insvti_{high,low}part allows the
register a

[x86 PATCH] Don't use insvti_{high, low}part with -O0 (for compile-time).

2023-07-22 Thread Roger Sayle

This patch attempts to help with PR rtl-optimization/110587, a regression
of -O0 compile time for the pathological pr28071.c.  My recent patch helps
a bit, but hasn't returned -O0 compile-time to where it was before my
ix86_expand_move changes.  The obvious solution/workaround is to guard
these new TImode parameter passing optimizations with "&& optimize", so
they don't trigger when compiling with -O0.  The very minor complication
is that "&& optimize" alone leads to the regression of pr110533.c, where
our improved TImode parameter passing fixes a wrong-code issue with naked
functions, importantly, when compiling with -O0.  This should explain
the one line fix below "&& (optimize || ix86_function_naked (cfun))".

I've an additional fix/tweak or two for this compile-time issue, but
this change eliminates the part of the regression that I've caused.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?

2023-07-22  Roger Sayle  

gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_expand_move): Disable the
64-bit insertions into TImode optimizations with -O0, unless
the function has the "naked" attribute (for PR target/110533).

Cheers,
Roger
--

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 7e94447..cdef95e 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -544,6 +544,7 @@ ix86_expand_move (machine_mode mode, rtx operands[])
 
   /* Special case inserting 64-bit values into a TImode register.  */
   if (TARGET_64BIT
+  && (optimize || ix86_function_naked (current_function_decl))
   && (mode == DImode || mode == DFmode)
   && SUBREG_P (op0)
   && GET_MODE (SUBREG_REG (op0)) == TImode


[x86 PATCH] Use QImode for offsets in zero_extract/sign_extract in i386.md

2023-07-22 Thread Roger Sayle

As suggested by Uros, this patch changes the ZERO_EXTRACTs and SIGN_EXTRACTs
in i386.md to consistently use QImode for bit offsets (i.e. third and fourth
operands), matching the use of QImode for bit counts in shifts and rotates.

There's no change in functionality, and the new patterns simply ensure that
we continue to generate the same code (match revised patterns) as before.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2023-07-22  Roger Sayle  

gcc/ChangeLog
* config/i386/i386.md (extv): Use QImode for offsets.
(extzv): Likewise.
(insv): Likewise.
(*testqi_ext_3): Likewise.
(*btr_2): Likewise.
(define_split): Likewise.
(*btsq_imm): Likewise.
(*btrq_imm): Likewise.
(*btcq_imm): Likewise.
(define_peephole2 x3): Likewise.
(*bt): Likewise
(*bt_mask): New define_insn_and_split.
(*jcc_bt): Use QImode for offsets.
(*jcc_bt_1): Delete obsolete pattern.
(*jcc_bt_mask): Use QImode offsets.
(*jcc_bt_mask_1): Likewise.
(define_split): Likewise.
(*bt_setcqi): Likewise.
(*bt_setncqi): Likewise.
(*bt_setnc): Likewise.
(*bt_setncqi_2): Likewise.
(*bt_setc_mask): New define_insn_and_split.
(bmi2_bzhi_3): Use QImode offsets.
(*bmi2_bzhi_3): Likewise.
(*bmi2_bzhi_3_1): Likewise.
(*bmi2_bzhi_3_1_ccz): Likewise.
(@tbm_bextri_): Likewise.


Thanks,
Roger
--

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 47ea050..de8c3a5 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -3312,8 +3312,8 @@
 (define_expand "extv"
   [(set (match_operand:SWI24 0 "register_operand")
(sign_extract:SWI24 (match_operand:SWI24 1 "register_operand")
-   (match_operand:SI 2 "const_int_operand")
-   (match_operand:SI 3 "const_int_operand")))]
+   (match_operand:QI 2 "const_int_operand")
+   (match_operand:QI 3 "const_int_operand")))]
   ""
 {
   /* Handle extractions from %ah et al.  */
@@ -3340,8 +3340,8 @@
 (define_expand "extzv"
   [(set (match_operand:SWI248 0 "register_operand")
(zero_extract:SWI248 (match_operand:SWI248 1 "register_operand")
-(match_operand:SI 2 "const_int_operand")
-(match_operand:SI 3 "const_int_operand")))]
+(match_operand:QI 2 "const_int_operand")
+(match_operand:QI 3 "const_int_operand")))]
   ""
 {
   if (ix86_expand_pextr (operands))
@@ -3428,8 +3428,8 @@
 
 (define_expand "insv"
   [(set (zero_extract:SWI248 (match_operand:SWI248 0 "register_operand")
-(match_operand:SI 1 "const_int_operand")
-(match_operand:SI 2 "const_int_operand"))
+(match_operand:QI 1 "const_int_operand")
+(match_operand:QI 2 "const_int_operand"))
 (match_operand:SWI248 3 "register_operand"))]
   ""
 {
@@ -10788,8 +10788,8 @@
 (match_operator 1 "compare_operator"
  [(zero_extract:SWI248
 (match_operand 2 "int_nonimmediate_operand" "rm")
-(match_operand 3 "const_int_operand")
-(match_operand 4 "const_int_operand"))
+(match_operand:QI 3 "const_int_operand")
+(match_operand:QI 4 "const_int_operand"))
   (const_int 0)]))]
   "/* Ensure that resulting mask is zero or sign extended operand.  */
INTVAL (operands[4]) >= 0
@@ -15904,7 +15904,7 @@
   [(set (zero_extract:HI
  (match_operand:SWI12 0 "nonimmediate_operand")
  (const_int 1)
- (zero_extend:SI (match_operand:QI 1 "register_operand")))
+ (match_operand:QI 1 "register_operand"))
(const_int 0))
(clobber (reg:CC FLAGS_REG))]
   "TARGET_USE_BT && ix86_pre_reload_split ()"
@@ -15928,7 +15928,7 @@
   [(set (zero_extract:HI
  (match_operand:SWI12 0 "register_operand")
  (const_int 1)
- (zero_extend:SI (match_operand:QI 1 "register_operand")))
+ (match_operand:QI 1 "register_operand"))
(const_int 0))
(clobber (reg:CC FLAGS_REG))]
   "TARGET_USE_BT && ix86_pre_reload_split ()"
@@ -15955,7 +15955,7 @@
 (define_insn "*btsq_imm"
   [(set (zero_extract:DI (match_operand:DI 0 "nonimmedia

[PATCH] Replace lra-spill.cc's return_regno_p with return_reg_p.

2023-07-22 Thread Roger Sayle

This patch is my attempt to address the compile-time hog issue
in PR rtl-optimization/110587.  Richard Biener's analysis shows that
compilation of pr28071.c with -O0 currently spends ~70% in timer
"LRA non-specific" due to return_regno_p failing to filter a large
number of calls to regno_in_use_p, resulting in quadratic behaviour.

For this pathological test case, things can be improved significantly.
Although the return register (%rax) is indeed mentioned a large
number of times in this function, due to inlining, the inlined functions
access the returned register in TImode, whereas the current function
returns a DImode.  Hence the check to see if we're the last SET of the
return register, which should be followed by a USE, can be improved
by also testing the mode.  Implementation-wise, rather than pass an
additional mode parameter to LRA's local return_regno_p function, which
only has a single caller, it's more convenient to pass the rtx REG_P,
and from this extract both the REGNO and the mode in the callee, and
rename this function to return_reg_p.

The good news is that with this change "LRA non-specific" drops from
70% to 13%.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, with no new failures.  Ok for mainline?


2023-07-22  Roger Sayle  

gcc/ChangeLog
PR middle-end/28071
PR rtl-optimization/110587
* lra-spills.cc (return_regno_p): Change argument and rename to...
(return_reg_p): Check if the given register RTX has the same
REGNO and machine mode as the function's return value.
(lra_final_code_change): Update call to return_reg_p.


Thanks in advance,
Roger
--

diff --git a/gcc/lra-spills.cc b/gcc/lra-spills.cc
index 3a7bb7e..ae147ad 100644
--- a/gcc/lra-spills.cc
+++ b/gcc/lra-spills.cc
@@ -705,10 +705,10 @@ alter_subregs (rtx *loc, bool final_p)
   return res;
 }
 
-/* Return true if REGNO is used for return in the current
-   function.  */
+/* Return true if register REG, known to be REG_P, is used for return
+   in the current function.  */
 static bool
-return_regno_p (unsigned int regno)
+return_reg_p (rtx reg)
 {
   rtx outgoing = crtl->return_rtx;
 
@@ -716,7 +716,8 @@ return_regno_p (unsigned int regno)
 return false;
 
   if (REG_P (outgoing))
-return REGNO (outgoing) == regno;
+return REGNO (outgoing) == REGNO (reg)
+  && GET_MODE (outgoing) == GET_MODE (reg);
   else if (GET_CODE (outgoing) == PARALLEL)
 {
   int i;
@@ -725,7 +726,9 @@ return_regno_p (unsigned int regno)
{
  rtx x = XEXP (XVECEXP (outgoing, 0, i), 0);
 
- if (REG_P (x) && REGNO (x) == regno)
+ if (REG_P (x)
+ && REGNO (x) == REGNO (reg)
+ && GET_MODE (x) == GET_MODE (reg))
return true;
}
 }
@@ -821,7 +824,7 @@ lra_final_code_change (void)
  if (NONJUMP_INSN_P (insn) && GET_CODE (pat) == SET
  && REG_P (SET_SRC (pat)) && REG_P (SET_DEST (pat))
  && REGNO (SET_SRC (pat)) == REGNO (SET_DEST (pat))
- && (! return_regno_p (REGNO (SET_SRC (pat)))
+ && (! return_reg_p (SET_SRC (pat))
  || ! regno_in_use_p (insn, REGNO (SET_SRC (pat)
{
  lra_invalidate_insn_data (insn);


[Committed] PR target/110787: Revert QImode offsets in {zero, sign}_extract.

2023-07-24 Thread Roger Sayle
 

My recent patch to use QImode for bit offsets in ZERO_EXTRACTs and

SIGN_EXTRACTs in the i386 backend shouldn't have resulted in any change

behaviour, but as reported by Rainer it produces a bootstrap failure in

gm2.  This reverts the problematic patch whilst we investigate the

underlying cause.

 

Committed as obvious.

 

 

2023-07-23  Roger Sayle  

 

gcc/ChangeLog

PR target/110787

PR target/110790

Revert patch.

* config/i386/i386.md (extv): Use QImode for offsets.

(extzv): Likewise.

(insv): Likewise.

(*testqi_ext_3): Likewise.

(*btr_2): Likewise.

(define_split): Likewise.

(*btsq_imm): Likewise.

(*btrq_imm): Likewise.

(*btcq_imm): Likewise.

(define_peephole2 x3): Likewise.

(*bt): Likewise

(*bt_mask): New define_insn_and_split.

(*jcc_bt): Use QImode for offsets.

(*jcc_bt_1): Delete obsolete pattern.

(*jcc_bt_mask): Use QImode offsets.

(*jcc_bt_mask_1): Likewise.

(define_split): Likewise.

(*bt_setcqi): Likewise.

(*bt_setncqi): Likewise.

(*bt_setnc): Likewise.

(*bt_setncqi_2): Likewise.

(*bt_setc_mask): New define_insn_and_split.

(bmi2_bzhi_3): Use QImode offsets.

(*bmi2_bzhi_3): Likewise.

(*bmi2_bzhi_3_1): Likewise.

(*bmi2_bzhi_3_1_ccz): Likewise.

(@tbm_bextri_): Likewise.

 

 

Sorry for the inconvenience,

Roger

--

 

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 2ce8e958565..4db210cc795 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -3312,8 +3312,8 @@
 (define_expand "extv"
   [(set (match_operand:SWI24 0 "register_operand")
(sign_extract:SWI24 (match_operand:SWI24 1 "register_operand")
-   (match_operand:QI 2 "const_int_operand")
-   (match_operand:QI 3 "const_int_operand")))]
+   (match_operand:SI 2 "const_int_operand")
+   (match_operand:SI 3 "const_int_operand")))]
   ""
 {
   /* Handle extractions from %ah et al.  */
@@ -3340,8 +3340,8 @@
 (define_expand "extzv"
   [(set (match_operand:SWI248 0 "register_operand")
(zero_extract:SWI248 (match_operand:SWI248 1 "register_operand")
-(match_operand:QI 2 "const_int_operand")
-(match_operand:QI 3 "const_int_operand")))]
+(match_operand:SI 2 "const_int_operand")
+(match_operand:SI 3 "const_int_operand")))]
   ""
 {
   if (ix86_expand_pextr (operands))
@@ -3428,8 +3428,8 @@
 
 (define_expand "insv"
   [(set (zero_extract:SWI248 (match_operand:SWI248 0 "register_operand")
-(match_operand:QI 1 "const_int_operand")
-(match_operand:QI 2 "const_int_operand"))
+(match_operand:SI 1 "const_int_operand")
+(match_operand:SI 2 "const_int_operand"))
 (match_operand:SWI248 3 "register_operand"))]
   ""
 {
@@ -10788,8 +10788,8 @@
 (match_operator 1 "compare_operator"
  [(zero_extract:SWI248
 (match_operand 2 "int_nonimmediate_operand" "rm")
-(match_operand:QI 3 "const_int_operand")
-(match_operand:QI 4 "const_int_operand"))
+(match_operand 3 "const_int_operand")
+(match_operand 4 "const_int_operand"))
   (const_int 0)]))]
   "/* Ensure that resulting mask is zero or sign extended operand.  */
INTVAL (operands[4]) >= 0
@@ -15965,7 +15965,7 @@
   [(set (zero_extract:HI
  (match_operand:SWI12 0 "nonimmediate_operand")
  (const_int 1)
- (match_operand:QI 1 "register_operand"))
+ (zero_extend:SI (match_operand:QI 1 "register_operand")))
(const_int 0))
(clobber (reg:CC FLAGS_REG))]
   "TARGET_USE_BT && ix86_pre_reload_split ()"
@@ -15989,7 +15989,7 @@
   [(set (zero_extract:HI
  (match_operand:SWI12 0 "register_operand")
  (const_int 1)
- (match_operand:QI 1 "register_operand"))
+ (zero_extend:SI (match_operand:QI 1 "register_operand")))
(const_int 0))
(clobber (reg:CC FLAGS_REG))]
   "TARGET_USE_BT && ix86_pre_reload_split ()"
@@ -16016,7 +16016,7 @@
 (define_insn "*btsq_imm"
   [(set (zero_extract:DI (match_operand:DI 0 "nonimmediate_operand" "+rm")
 (const_int 1)
-

[PATCH] PR rtl-optimization/110587: Reduce useless moves in compile-time hog.

2023-07-25 Thread Roger Sayle

This patch is the third in series of fixes for PR rtl-optimization/110587,
a compile-time regression with -O0, that attempts to address the underlying
cause.  As noted previously, the pathological test case pr28071.c contains
a large number of useless register-to-register moves that can produce
quadratic behaviour (in LRA).  These move are generated during RTL
expansion in emit_group_load_1, where the middle-end attempts to simplify
the source before calling extract_bit_field.  This is reasonable if the
source is a complex expression (from before the tree-ssa optimizers), or
a SUBREG, or a hard register, but it's not particularly useful to copy
a pseudo register into a new pseudo register.  This patch eliminates that
redundancy.

The -fdump-tree-expand for pr28071.c compiled with -O0 currently contains
777K lines, with this patch it contains 717K lines, i.e. saving about 60K
lines (admittedly of debugging text output, but it makes the point).


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?

As always, I'm happy to revert this change quickly if there's a problem,
and investigate why this additional copy might (still) be needed on other
non-x86 targets.


2023-07-25  Roger Sayle  

gcc/ChangeLog
PR middle-end/28071
PR rtl-optimization/110587
* expr.cc (emit_group_load_1): Avoid copying a pseudo register into
a new pseudo register, i.e. only copy hard regs into a new pseudo.


Thanks in advance,
Roger
--

diff --git a/gcc/expr.cc b/gcc/expr.cc
index fff09dc..11d041b 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -2622,6 +2622,7 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src, tree 
type,
 be loaded directly into the destination.  */
   src = orig_src;
   if (!MEM_P (orig_src)
+ && (!REG_P (orig_src) || HARD_REGISTER_P (orig_src))
  && (!CONSTANT_P (orig_src)
  || (GET_MODE (orig_src) != mode
  && GET_MODE (orig_src) != VOIDmode)))


[PATCH] PR rtl-optimization/110701: Fix SUBREG SET_DEST handling in combine.

2023-07-26 Thread Roger Sayle

This patch is my proposed fix to PR rtl-optimization 110701, a latent bug
in combine's record_dead_and_set_regs_1 exposed by recent improvements to
simplify_subreg.

The issue involves the handling of (normal) SUBREG SET_DESTs as in the
instruction:

(set (subreg:HI (reg:SI x) 0) (expr:HI y))

The semantics of this are that the bits specified by the SUBREG are set
to the SET_SRC, y, and that the other bits of the SET_DEST are left/become
undefined.  To simplify explanation, we'll only consider lowpart SUBREGs
(though in theory non-lowpart SUBREGS could be handled), and the fact that
bits outside of the lowpart WORD retain their original values (treating
these as undefined is a missed optimization rather than incorrect code
bug, that only affects targets with less than 64-bit words).

The bug is that combine simulates the behaviour of the above instruction,
for calculating nonzero_bits and set_sign_bit_copies, in the function
record_value_for_reg, by using the equivalent of:

(set (reg:SI x) (subreg:SI (expr:HI y))

by calling gen_lowpart on the SET_SRC.  Alas, the semantics of this
revised instruction aren't always equivalent to the original.

In the test case for PR110701, the original instruction

(set (subreg:HI (reg:SI x), 0)
 (and:HI (subreg:HI (reg:SI y) 0)
 (const_int 340)))

which (by definition) leaves the top bits of x undefined, is mistakenly
considered to be equivalent to

(set (reg:SI x) (and:SI (reg:SI y) (const_int 340)))

where gen_lowpart's freedom to do anything with paradoxical SUBREG bits,
has now cleared the high bits.  The same bug also triggers when the
SET_SRC is say (subreg:HI (reg:DI z)), where gen_lowpart transforms
this into (subreg:SI (reg:DI z)) which defines bits 16-31 to be the
same as bits 16-31 of z.

The fix is that after calling record_value_for_reg, we need to mark
the bits that should be undefined as undefined, in case gen_lowpart,
which performs transforms appropriate for r-values, has changed the
interpretation of the SUBREG when used as an l-value.


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?

I've a version of this patch that preserves the original bits outside
of the lowpart WORD that can be submitted as a follow-up, but this is
the piece that addresses the wrong code regression.


2023-07-26  Roger Sayle  

gcc/ChangeLog
PR rtl-optimization/110701
* combine.cc (record_dead_and_set_regs_1): Split comment into
pieces placed before the relevant clauses.  When the SET_DEST
is a partial_subreg_p, mark the bits outside of the updated
portion of the destination as undefined.

gcc/testsuite/ChangeLog
PR rtl-optimization/110701
* gcc.target/i386/pr110701.c: New test case.


Thanks in advance,
Roger
--

diff --git a/gcc/combine.cc b/gcc/combine.cc
index 4bf867d..c5ebb78 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -13337,27 +13337,43 @@ record_dead_and_set_regs_1 (rtx dest, const_rtx 
setter, void *data)
 
   if (REG_P (dest))
 {
-  /* If we are setting the whole register, we know its value.  Otherwise
-show that we don't know the value.  We can handle a SUBREG if it's
-the low part, but we must be careful with paradoxical SUBREGs on
-RISC architectures because we cannot strip e.g. an extension around
-a load and record the naked load since the RTL middle-end considers
-that the upper bits are defined according to LOAD_EXTEND_OP.  */
+  /* If we are setting the whole register, we know its value.  */
   if (GET_CODE (setter) == SET && dest == SET_DEST (setter))
record_value_for_reg (dest, record_dead_insn, SET_SRC (setter));
+  /* We can handle a SUBREG if it's the low part, but we must be
+careful with paradoxical SUBREGs on RISC architectures because
+we cannot strip e.g. an extension around a load and record the
+naked load since the RTL middle-end considers that the upper bits
+are defined according to LOAD_EXTEND_OP.  */
   else if (GET_CODE (setter) == SET
   && GET_CODE (SET_DEST (setter)) == SUBREG
   && SUBREG_REG (SET_DEST (setter)) == dest
   && known_le (GET_MODE_PRECISION (GET_MODE (dest)),
BITS_PER_WORD)
   && subreg_lowpart_p (SET_DEST (setter)))
-   record_value_for_reg (dest, record_dead_insn,
- WORD_REGISTER_OPERATIONS
- && word_register_operation_p (SET_SRC (setter))
- && paradoxical_subreg_p (SET_DEST (setter))
- ? SET_SRC (setter)
- : gen_lowpart (GET_MODE (dest),
-

RE: [PATCH] PR rtl-optimization/110587: Reduce useless moves in compile-time hog.

2023-07-27 Thread Roger Sayle

Hi Richard,

You're 100% right.  It’s possible to significantly clean-up this code, replacing
the body of the conditional with a call to force_reg and simplifying the 
conditions
under which it is called.  These improvements are implemented in the patch
below, which has been tested on x86_64-pc-linux-gnu, with a bootstrap and
make -k check, both with and without -m32, as usual.

Interestingly, the CONCAT clause afterwards is still required (I've learned 
something
new),  as calling force_reg (or gen_reg_rtx) with HCmode, actually returns a 
CONCAT
instead of a REG, so although the code looks dead, it's required to build 
libgcc during
a bootstrap.  But the remaining clean-up is good, reducing the number of source 
lines
and making the logic easier to understand.

Ok for mainline?

2023-07-27  Roger Sayle  
Richard Biener  

gcc/ChangeLog
PR middle-end/28071
PR rtl-optimization/110587
* expr.cc (emit_group_load_1): Simplify logic for calling
force_reg on ORIG_SRC, to avoid making a copy if the source
is already in a pseudo register.

Roger
--

> -Original Message-
> From: Richard Biener 
> Sent: 25 July 2023 12:50
> 
> On Tue, Jul 25, 2023 at 1:31 PM Roger Sayle 
> wrote:
> >
> > This patch is the third in series of fixes for PR
> > rtl-optimization/110587, a compile-time regression with -O0, that
> > attempts to address the underlying cause.  As noted previously, the
> > pathological test case pr28071.c contains a large number of useless
> > register-to-register moves that can produce quadratic behaviour (in
> > LRA).  These move are generated during RTL expansion in
> > emit_group_load_1, where the middle-end attempts to simplify the
> > source before calling extract_bit_field.  This is reasonable if the
> > source is a complex expression (from before the tree-ssa optimizers),
> > or a SUBREG, or a hard register, but it's not particularly useful to
> > copy a pseudo register into a new pseudo register.  This patch eliminates 
> > that
> redundancy.
> >
> > The -fdump-tree-expand for pr28071.c compiled with -O0 currently
> > contains 777K lines, with this patch it contains 717K lines, i.e.
> > saving about 60K lines (admittedly of debugging text output, but it makes 
> > the
> point).
> >
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures.  Ok for mainline?
> >
> > As always, I'm happy to revert this change quickly if there's a
> > problem, and investigate why this additional copy might (still) be
> > needed on other
> > non-x86 targets.
> 
> @@ -2622,6 +2622,7 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src,
> tree type,
>  be loaded directly into the destination.  */
>src = orig_src;
>if (!MEM_P (orig_src)
> + && (!REG_P (orig_src) || HARD_REGISTER_P (orig_src))
>   && (!CONSTANT_P (orig_src)
>   || (GET_MODE (orig_src) != mode
>   && GET_MODE (orig_src) != VOIDmode)))
> 
> so that means the code guarded by the conditional could instead be transformed
> to
> 
>src = force_reg (mode, orig_src);
> 
> ?  Btw, the || (GET_MODE (orig_src) != mode && GET_MODE (orig_src) !=
> VOIDmode) case looks odd as in that case we'd use GET_MODE (orig_src) for the
> move ... that might also mean we have to use force_reg (GET_MODE (orig_src) ==
> VOIDmode ? mode : GET_MODE (orig_src), orig_src))
> 
> Otherwise I think this is OK, as said, using force_reg somehow would improve
> readability here I think.
> 
> I also wonder how the
> 
>   else if (GET_CODE (src) == CONCAT)
> 
> case will ever trigger with the current code.
> 
> Richard.
> 
> >
> > 2023-07-25  Roger Sayle  
> >
> > gcc/ChangeLog
> > PR middle-end/28071
> > PR rtl-optimization/110587
> > * expr.cc (emit_group_load_1): Avoid copying a pseudo register into
> > a new pseudo register, i.e. only copy hard regs into a new pseudo.
> >
> >

diff --git a/gcc/expr.cc b/gcc/expr.cc
index fff09dc..174f8ac 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -2622,16 +2622,11 @@ emit_group_load_1 (rtx *tmps, rtx dst, rtx orig_src, 
tree type,
 be loaded directly into the destination.  */
   src = orig_src;
   if (!MEM_P (orig_src)
- && (!CONSTANT_P (orig_src)
- || (GET_MODE (orig_src) != mode
- && GET_MODE (orig_src) != VOIDmode)))
+ && (!REG_P (orig_src) || HARD_REGISTER_P (orig_src))
+ && !CONSTANT_P (orig_src))
{
- if (GET_MODE (orig_src) == VOIDmode)
-   src = gen_reg_rtx (mode);
- else
-   src = gen_reg_rtx (GET_MODE (orig_src));
-
- emit_move_insn (src, orig_src);
+ gcc_assert (GET_MODE (orig_src) != VOIDmode);
+ src = force_reg (GET_MODE (orig_src), orig_src);
}
 
   /* Optimize the access just a bit.  */


[Committed] Use QImode for offsets in zero_extract/sign_extract in i386.md (take #2)

2023-07-29 Thread Roger Sayle

This patch reattempts to change the ZERO_EXTRACTs and SIGN_EXTRACTs
in i386.md to consistently use QImode for bit offsets (i.e. third and fourth
operands), matching the use of QImode for bit counts in shifts and rotates.

This iteration corrects the "ne:QI" vs "eq:QI" mistake in the previous
version, which was responsible for PR 110787 and PR 110790 and so was
rapidly reverted last weekend.  New test cases have been added to check
the correct behaviour.

This patch has been tested on x86_64-pc-linux-gnu with and without
--enable-languages="all", with make bootstrap and make -k check, both
with and without --target_board=unix{-m32} with no new failures.
Committed to mainline as an obvious fix to the previously approved
patch.  Sorry again for the temporary inconvenience, and thanks to
Rainer Orth for identifying/confirming the problematic patch.


2023-07-29  Roger Sayle  

gcc/ChangeLog
PR target/110790
* config/i386/i386.md (extv): Use QImode for offsets.
(extzv): Likewise.
(insv): Likewise.
(*testqi_ext_3): Likewise.
(*btr_2): Likewise.
(define_split): Likewise.
(*btsq_imm): Likewise.
(*btrq_imm): Likewise.
(*btcq_imm): Likewise.
(define_peephole2 x3): Likewise.
(*bt): Likewise
(*bt_mask): New define_insn_and_split.
(*jcc_bt): Use QImode for offsets.
(*jcc_bt_1): Delete obsolete pattern.
(*jcc_bt_mask): Use QImode offsets.
(*jcc_bt_mask_1): Likewise.
(define_split): Likewise.
(*bt_setcqi): Likewise.
(*bt_setncqi): Likewise.
(*bt_setnc): Likewise.
(*bt_setncqi_2): Likewise.
(*bt_setc_mask): New define_insn_and_split.
(bmi2_bzhi_3): Use QImode offsets.
(*bmi2_bzhi_3): Likewise.
(*bmi2_bzhi_3_1): Likewise.
(*bmi2_bzhi_3_1_ccz): Likewise.
(@tbm_bextri_): Likewise.

gcc/testsuite/ChangeLog
PR target/110790
* gcc.target/i386/pr110790-1.c: New test case.
* gcc.target/i386/pr110790-2.c: Likewise.


diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 4db210c..efac228 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -3312,8 +3312,8 @@
 (define_expand "extv"
   [(set (match_operand:SWI24 0 "register_operand")
(sign_extract:SWI24 (match_operand:SWI24 1 "register_operand")
-   (match_operand:SI 2 "const_int_operand")
-   (match_operand:SI 3 "const_int_operand")))]
+   (match_operand:QI 2 "const_int_operand")
+   (match_operand:QI 3 "const_int_operand")))]
   ""
 {
   /* Handle extractions from %ah et al.  */
@@ -3340,8 +3340,8 @@
 (define_expand "extzv"
   [(set (match_operand:SWI248 0 "register_operand")
(zero_extract:SWI248 (match_operand:SWI248 1 "register_operand")
-(match_operand:SI 2 "const_int_operand")
-(match_operand:SI 3 "const_int_operand")))]
+(match_operand:QI 2 "const_int_operand")
+(match_operand:QI 3 "const_int_operand")))]
   ""
 {
   if (ix86_expand_pextr (operands))
@@ -3428,8 +3428,8 @@
 
 (define_expand "insv"
   [(set (zero_extract:SWI248 (match_operand:SWI248 0 "register_operand")
-(match_operand:SI 1 "const_int_operand")
-(match_operand:SI 2 "const_int_operand"))
+(match_operand:QI 1 "const_int_operand")
+(match_operand:QI 2 "const_int_operand"))
 (match_operand:SWI248 3 "register_operand"))]
   ""
 {
@@ -10788,8 +10788,8 @@
 (match_operator 1 "compare_operator"
  [(zero_extract:SWI248
 (match_operand 2 "int_nonimmediate_operand" "rm")
-(match_operand 3 "const_int_operand")
-(match_operand 4 "const_int_operand"))
+(match_operand:QI 3 "const_int_operand")
+(match_operand:QI 4 "const_int_operand"))
   (const_int 0)]))]
   "/* Ensure that resulting mask is zero or sign extended operand.  */
INTVAL (operands[4]) >= 0
@@ -15965,7 +15965,7 @@
   [(set (zero_extract:HI
  (match_operand:SWI12 0 "nonimmediate_operand")
  (const_int 1)
- (zero_extend:SI (match_operand:QI 1 "register_operand")))
+ (match_operand:QI 1 "register_operand"))
(const_int 0))
(clobber (reg:CC FLAGS_REG))]
   "TARGET_USE_BT && ix86_pre_reload_split ()"
@@ -15989,7 +15989,7 @@
   [(

[Committed] PR target/110843: Check TARGET_AVX512VL for V2DI rotates in STV.

2023-07-31 Thread Roger Sayle

This patch resolves PR target/110843, an ICE caused by my enhancement to
support AVX512 DImode and SImode rotates in the scalar-to-vector (STV) pass.
Although the vprotate instructions are available on all TARGET_AVX512F
microarchitectures, the V2DI and V4SI variants are only available on the
TARGET_AVX512VL subset, leading to problems when command line options
enable AVX512 (i.e. AVX512F) but not the required AVX512VL functionality.
The simple fix is to update/correct the target checks.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Committed to mainline as obvious.


2023-07-31  Roger Sayle  

gcc/ChangeLog
PR target/110843
* config/i386/i386-features.cc (compute_convert_gain): Check
TARGET_AVX512VL (not TARGET_AVX512F) when considering V2DImode
and V4SImode rotates in STV.
(general_scalar_chain::convert_rotate): Likewise.

gcc/testsuite/ChangeLog
PR target/110843
* gcc.target/i386/pr110843.c: New test case.


Sorry again for the inconvenience.
Roger
--

diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc
index 6da8395..cead397 100644
--- a/gcc/config/i386/i386-features.cc
+++ b/gcc/config/i386/i386-features.cc
@@ -587,7 +587,7 @@ general_scalar_chain::compute_convert_gain ()
  case ROTATE:
  case ROTATERT:
igain += m * ix86_cost->shift_const;
-   if (TARGET_AVX512F)
+   if (TARGET_AVX512VL)
  igain -= ix86_cost->sse_op;
else if (smode == DImode)
  {
@@ -1230,7 +1230,7 @@ general_scalar_chain::convert_rotate (enum rtx_code code, 
rtx op0, rtx op1,
  emit_insn_before (pat, insn);
  result = gen_lowpart (V2DImode, tmp1);
}
-  else if (TARGET_AVX512F)
+  else if (TARGET_AVX512VL)
result = simplify_gen_binary (code, V2DImode, op0, op1);
   else if (bits == 16 || bits == 48)
{
@@ -1276,7 +1276,7 @@ general_scalar_chain::convert_rotate (enum rtx_code code, 
rtx op0, rtx op1,
   emit_insn_before (pat, insn);
   result = gen_lowpart (V4SImode, tmp1);
 }
-  else if (TARGET_AVX512F)
+  else if (TARGET_AVX512VL)
 result = simplify_gen_binary (code, V4SImode, op0, op1);
   else
 {
diff --git a/gcc/testsuite/gcc.target/i386/pr110843.c 
b/gcc/testsuite/gcc.target/i386/pr110843.c
new file mode 100644
index 000..b9bcddb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr110843.c
@@ -0,0 +1,20 @@
+/* PR target/110843 */
+/* derived from gcc.target/i386/pr70007.c */
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-Os -mavx512ifma -Wno-psabi" } */
+
+typedef unsigned short v32u16 __attribute__ ((vector_size (32)));
+typedef unsigned long long v32u64 __attribute__ ((vector_size (32)));
+typedef unsigned __int128 u128;
+typedef unsigned __int128 v32u128 __attribute__ ((vector_size (32)));
+
+u128 foo (v32u16 v32u16_0, v32u64 v32u64_0, v32u64 v32u64_1)
+{
+  do {
+v32u16_0[13] |= v32u64_1[3] = (v32u64_1[3] >> 19) | (v32u64_1[3] << 45);
+v32u64_1 %= ~v32u64_1;
+v32u64_0 *= (v32u64) v32u16_0;
+  } while (v32u64_0[0]);
+  return v32u64_1[3];
+}
+


[x86 PATCH] UNSPEC_PALIGNR optimizations and clean-ups.

2022-06-30 Thread Roger Sayle

This patch is a follow-up to Hongtao's fix for PR target/105854.  That
fix is perfectly correct, but the thing that caught my eye was why is
the compiler generating a shift by zero at all.  Digging deeper it
turns out that we can easily optimize __builtin_ia32_palignr for
alignments of 0 and 64 respectively, which may be simplified to moves
from the highpart or lowpart.

After adding optimizations to simplify the 64-bit DImode palignr,
I started to add the corresponding optimizations for vpalignr (i.e.
128-bit).  The first oddity is that sse.md uses TImode and a special
SSESCALARMODE iterator, rather than V1TImode, and indeed the comment
above SSESCALARMODE hints that this should be "dropped in favor of
VIMAX_AVX2_AVX512BW".  Hence this patch includes the migration of
_palignr to use VIMAX_AVX2_AVX512BW, basically
using V1TImode instead of TImode for 128-bit palignr.

But it was only after I'd implemented this clean-up that I stumbled
across the strange semantics of 128-bit [v]palignr.  According to
https://www.felixcloutier.com/x86/palignr, the semantics are subtly
different based upon how the instruction is encoded.  PALIGNR leaves
the highpart unmodified, whilst VEX.128 encoded VPALIGNR clears the
highpart, and (unless I'm mistaken) it looks like GCC currently uses
the exact same RTL/templates for both, treating one as an alternative
for the other.

Hence I thought I'd post what I have so far (part optimization and
part clean-up), to then ask the x86 experts for their opinions.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-,32},
with no new failures.  Ok for mainline?


2022-06-30  Roger Sayle  

gcc/ChangeLog
* config/i386/i386-builtin.def (__builtin_ia32_palignr128): Change
CODE_FOR_ssse3_palignrti to CODE_FOR_ssse3_palignrv1ti.
* config/i386/i386-expand.cc (expand_vec_perm_palignr): Use V1TImode
and gen_ssse3_palignv1ti instead of TImode.
* config/i386/sse.md (SSESCALARMODE): Delete.
(define_mode_attr ssse3_avx2): Handle V1TImode instead of TImode.
(_palignr): Use VIMAX_AVX2_AVX512BW as a mode
iterator instead of SSESCALARMODE.

(ssse3_palignrdi): Optimize cases when operands[3] is 0 or 64,
using a single move instruction (if required).
(define_split): Likewise split UNSPEC_PALIGNR $0 into a move.
(define_split): Likewise split UNSPEC_PALIGNR $64 into a move.

gcc/testsuite/ChangeLog
* gcc.target/i386/ssse3-palignr-2.c: New test case.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386-builtin.def b/gcc/config/i386/i386-builtin.def
index e6daad4..fd16093 100644
--- a/gcc/config/i386/i386-builtin.def
+++ b/gcc/config/i386/i386-builtin.def
@@ -900,7 +900,7 @@ BDESC (OPTION_MASK_ISA_SSSE3, 0, CODE_FOR_ssse3_psignv4si3, 
"__builtin_ia32_psig
 BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0, 
CODE_FOR_ssse3_psignv2si3, "__builtin_ia32_psignd", IX86_BUILTIN_PSIGND, 
UNKNOWN, (int) V2SI_FTYPE_V2SI_V2SI)
 
 /* SSSE3.  */
-BDESC (OPTION_MASK_ISA_SSSE3, 0, CODE_FOR_ssse3_palignrti, 
"__builtin_ia32_palignr128", IX86_BUILTIN_PALIGNR128, UNKNOWN, (int) 
V2DI_FTYPE_V2DI_V2DI_INT_CONVERT)
+BDESC (OPTION_MASK_ISA_SSSE3, 0, CODE_FOR_ssse3_palignrv1ti, 
"__builtin_ia32_palignr128", IX86_BUILTIN_PALIGNR128, UNKNOWN, (int) 
V2DI_FTYPE_V2DI_V2DI_INT_CONVERT)
 BDESC (OPTION_MASK_ISA_SSSE3 | OPTION_MASK_ISA_MMX, 0, 
CODE_FOR_ssse3_palignrdi, "__builtin_ia32_palignr", IX86_BUILTIN_PALIGNR, 
UNKNOWN, (int) V1DI_FTYPE_V1DI_V1DI_INT_CONVERT)
 
 /* SSE4.1 */
diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 8bc5430..6a3fcde 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -19548,9 +19548,11 @@ expand_vec_perm_palignr (struct expand_vec_perm_d *d, 
bool single_insn_only_p)
   shift = GEN_INT (min * GET_MODE_UNIT_BITSIZE (d->vmode));
   if (GET_MODE_SIZE (d->vmode) == 16)
 {
-  target = gen_reg_rtx (TImode);
-  emit_insn (gen_ssse3_palignrti (target, gen_lowpart (TImode, dcopy.op1),
- gen_lowpart (TImode, dcopy.op0), shift));
+  target = gen_reg_rtx (V1TImode);
+  emit_insn (gen_ssse3_palignrv1ti (target,
+   gen_lowpart (V1TImode, dcopy.op1),
+   gen_lowpart (V1TImode, dcopy.op0),
+   shift));
 }
   else
 {
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 8cd0f61..974deca 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -575,10 +575,6 @@
 (define_mode_iterator VIMAX_AVX2
   [(V2TI "TARGET_AVX2") V1TI])
 
-;; ??? This should probably be dropped in favor of VIMAX_AVX2_AVX512BW.
-(define_mode_iterator SSESCALARMODE
-  [(V4TI "TARGET_AVX512BW") (V

[x86 PATCH] PR target/106122: Don't update %esp via the stack with -Oz.

2022-06-30 Thread Roger Sayle

When optimizing for size with -Oz, setting a register can be minimized by
pushing an immediate value to the stack and popping it to the destination.
Alas the one general register that shouldn't be updated via the stack is
the stack pointer itself, where "pop %esp" can't be represented in GCC's
RTL ("use of a register mentioned in pre_inc, pre_dec, post_inc or
post_dec is not permitted within the same instruction").  This patch
fixes PR target/106122 by explicitly checking for SP_REG in the
problematic peephole2.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2022-06-30  Roger Sayle  

gcc/ChangeLog
PR target/106122
* config/i386/i386.md (peephole2): Avoid generating pop %esp
when optimizing for size.

gcc/testsuite/ChangeLog
PR target/106122
* gcc.target/i386/pr106122.c: New test case.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 125a3b4..3b6f362 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -2588,7 +2588,8 @@
   "optimize_insn_for_size_p () && optimize_size > 1
&& operands[1] != const0_rtx
&& IN_RANGE (INTVAL (operands[1]), -128, 127)
-   && !ix86_red_zone_used"
+   && !ix86_red_zone_used
+   && REGNO (operands[0]) != SP_REG"
   [(set (match_dup 2) (match_dup 1))
(set (match_dup 0) (match_dup 3))]
 {
diff --git a/gcc/testsuite/gcc.target/i386/pr106122.c 
b/gcc/testsuite/gcc.target/i386/pr106122.c
new file mode 100644
index 000..7d24ed3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr106122.c
@@ -0,0 +1,15 @@
+/* PR middle-end/106122 */
+/* { dg-do compile } */
+/* { dg-options "-Oz" } */
+
+register volatile int a __asm__("%esp");
+void foo (void *);
+void bar (void *);
+
+void
+baz (void)
+{
+  foo (__builtin_return_address (0));
+  a = 0;
+  bar (__builtin_return_address (0));
+}


[Committed] Add constraints to new andn_doubleword_bmi pattern in i386.md.

2022-07-01 Thread Roger Sayle

Many thanks to Uros for spotting that I'd forgotten to add constraints
to the new define_insn_and_split *andn_doubleword_bmi when moving it
from pre-reload to post-reload.  I've pushed this obvious fix after a
make bootstrap on x86_64-pc-linux-gnu.  Sorry for the inconvenience to
anyone building the tree with a non-default architecture that enables
BMI.


2022-07-01  Roger Sayle  
Uroš Bizjak  

gcc/ChangeLog
* config/i386/i386.md (*andn3_doubleword_bmi): Add constraints
to post-reload define_insn_and_split.


Roger
--

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 3401814..352a21c 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -10405,10 +10405,10 @@
 })
 
 (define_insn_and_split "*andn3_doubleword_bmi"
-  [(set (match_operand: 0 "register_operand")
+  [(set (match_operand: 0 "register_operand" "=r")
(and:
- (not: (match_operand: 1 "register_operand"))
- (match_operand: 2 "nonimmediate_operand")))
+ (not: (match_operand: 1 "register_operand" "0"))
+ (match_operand: 2 "nonimmediate_operand" "ro")))
(clobber (reg:CC FLAGS_REG))]
   "TARGET_BMI"
   "#"


RE: [x86 PATCH] PR rtl-optimization/96692: ((A|B)^C)^A using andn with -mbmi.

2022-07-04 Thread Roger Sayle

Hi Uros,
Thanks for the review.  This patch implements all of your suggestions, both
removing ix86_pre_reload_split from the combine splitter(s), and dividing
the original splitter up into four simpler variants, that use match_dup to 
handle the variants/permutations caused by operator commutativity.

This revised patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32} with no
new failures.  Ok for mainline?

2022-07-04  Roger Sayle  
Uroš Bizjak  

gcc/ChangeLog
PR rtl-optimization/96692
* config/i386/i386.md (define_split): Split ((A | B) ^ C) ^ D
as (X & ~Y) ^ Z on target BMI when either C or D is A or B.

gcc/testsuite/ChangeLog
PR rtl-optimization/96692
* gcc.target/i386/bmi-andn-4.c: New test case.


Thanks again,
Roger
--

> -Original Message-
> From: Uros Bizjak 
> Sent: 26 June 2022 18:08
> To: Roger Sayle 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [x86 PATCH] PR rtl-optimization/96692: ((A|B)^C)^A using andn 
> with
> -mbmi.
> 
> On Sun, Jun 26, 2022 at 2:04 PM Roger Sayle 
> wrote:
> >
> >
> > This patch addresses PR rtl-optimization/96692 on x86_64, by providing
> > a define_split for combine to convert the three operation ((A|B)^C)^D
> > into a two operation sequence using andn when either A or B is the
> > same register as C or D.  This is essentially a reassociation problem
> > that's only a win if the target supports an and-not instruction (as with 
> > -mbmi).
> >
> > Hence for the new test case:
> >
> > int f(int a, int b, int c)
> > {
> > return (a ^ b) ^ (a | c);
> > }
> >
> > GCC on x86_64-pc-linux-gnu wth -O2 -mbmi would previously generate:
> >
> > xorl%edi, %esi
> > orl %edx, %edi
> > movl%esi, %eax
> > xorl%edi, %eax
> > ret
> >
> > but with this patch now generates:
> >
> > andn%edx, %edi, %eax
> > xorl%esi, %eax
> > ret
> >
> > I'll investigate whether this optimization can also be implemented
> > more generically in simplify_rtx when the backend provides accurate
> > rtx_costs for "(and (not ..." (as there's no optab for andn).
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32},
> > with no new failures.  Ok for mainline?
> >
> >
> > 2022-06-26  Roger Sayle  
> >
> > gcc/ChangeLog
> > PR rtl-optimization/96692
> > * config/i386/i386.md (define_split): Split ((A | B) ^ C) ^ D
> > as (X & ~Y) ^ Z on target BMI when either C or D is A or B.
> >
> > gcc/testsuite/ChangeLog
> > PR rtl-optimization/96692
> > * gcc.target/i386/bmi-andn-4.c: New test case.
> 
> +  "TARGET_BMI
> +   && ix86_pre_reload_split ()
> +   && (rtx_equal_p (operands[1], operands[3])
> +   || rtx_equal_p (operands[1], operands[4])
> +   || (REG_P (operands[2])
> +   && (rtx_equal_p (operands[2], operands[3])
> +   || rtx_equal_p (operands[2], operands[4]"
> 
> You don't need a ix86_pre_reload_split for combine splitter*
> 
> OTOH, please split the pattern to two for each commutative operand and use
> (match_dup x) instead. Something similar to [1].
> 
> *combine splitter is described in the documentation as the splitter pattern 
> that
> does *not* match any existing insn pattern.
> 
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596804.html
> 
> Uros.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 20c3b9a..d114754 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -10522,6 +10522,82 @@
(set (match_dup 0) (match_op_dup 1
 [(and:SI (match_dup 3) (match_dup 2))
 (const_int 0)]))])
+
+;; Variant 1 of 4: Split ((A | B) ^ A) ^ C as (B & ~A) ^ C.
+(define_split
+  [(set (match_operand:SWI48 0 "register_operand")
+   (xor:SWI48
+  (xor:SWI48
+ (ior:SWI48 (match_operand:SWI48 1 "register_operand")
+(match_operand:SWI48 2 "nonimmediate_operand"))
+ (match_dup 1))
+  (match_operand:SWI48 3 "nonimmediate_operand")))
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_BMI"
+  [(parallel
+  [(set (match_dup 4) (and:SWI48 (not:SWI48 (match_dup 1)) (match_dup 2)))
+   (clobber (reg:CC FLAGS_REG))])
+   (parallel
+  [(set (match_dup 0) (xor:SWI48 (match_dup 4) (match_

RE: [x86 PATCH] UNSPEC_PALIGNR optimizations and clean-ups.

2022-07-04 Thread Roger Sayle

Hi Hongtao,
Many thanks for your review.  This revised patch implements your
suggestions of removing the combine splitters, and instead reusing
the functionality of the ssse3_palignrdi define_insn_and split.

This revised patch has been tested on x86_64-pc-linux-gnu with make
bootstrap and make -k check, both with and with --target_board=unix{-32},
with no new failures.  Is this revised version Ok for mainline?


2022-07-04  Roger Sayle  
Hongtao Liu  

gcc/ChangeLog
* config/i386/i386-builtin.def (__builtin_ia32_palignr128): Change
CODE_FOR_ssse3_palignrti to CODE_FOR_ssse3_palignrv1ti.
* config/i386/i386-expand.cc (expand_vec_perm_palignr): Use V1TImode
and gen_ssse3_palignv1ti instead of TImode.
* config/i386/sse.md (SSESCALARMODE): Delete.
(define_mode_attr ssse3_avx2): Handle V1TImode instead of TImode.
(_palignr): Use VIMAX_AVX2_AVX512BW as a mode
iterator instead of SSESCALARMODE.

(ssse3_palignrdi): Optimize cases where operands[3] is 0 or 64,
using a single move instruction (if required).

gcc/testsuite/ChangeLog
* gcc.target/i386/ssse3-palignr-2.c: New test case.


Thanks in advance,
Roger
--

> -Original Message-
> From: Hongtao Liu 
> Sent: 01 July 2022 03:40
> To: Roger Sayle 
> Cc: GCC Patches 
> Subject: Re: [x86 PATCH] UNSPEC_PALIGNR optimizations and clean-ups.
> 
> On Fri, Jul 1, 2022 at 10:12 AM Hongtao Liu  wrote:
> >
> > On Fri, Jul 1, 2022 at 2:42 AM Roger Sayle 
> wrote:
> > >
> > >
> > > This patch is a follow-up to Hongtao's fix for PR target/105854.
> > > That fix is perfectly correct, but the thing that caught my eye was
> > > why is the compiler generating a shift by zero at all.  Digging
> > > deeper it turns out that we can easily optimize
> > > __builtin_ia32_palignr for alignments of 0 and 64 respectively,
> > > which may be simplified to moves from the highpart or lowpart.
> > >
> > > After adding optimizations to simplify the 64-bit DImode palignr, I
> > > started to add the corresponding optimizations for vpalignr (i.e.
> > > 128-bit).  The first oddity is that sse.md uses TImode and a special
> > > SSESCALARMODE iterator, rather than V1TImode, and indeed the comment
> > > above SSESCALARMODE hints that this should be "dropped in favor of
> > > VIMAX_AVX2_AVX512BW".  Hence this patch includes the migration of
> > > _palignr to use VIMAX_AVX2_AVX512BW, basically
> > > using V1TImode instead of TImode for 128-bit palignr.
> > >
> > > But it was only after I'd implemented this clean-up that I stumbled
> > > across the strange semantics of 128-bit [v]palignr.  According to
> > > https://www.felixcloutier.com/x86/palignr, the semantics are subtly
> > > different based upon how the instruction is encoded.  PALIGNR leaves
> > > the highpart unmodified, whilst VEX.128 encoded VPALIGNR clears the
> > > highpart, and (unless I'm mistaken) it looks like GCC currently uses
> > > the exact same RTL/templates for both, treating one as an
> > > alternative for the other.
> > I think as long as patterns or intrinsics only care about the low
> > part, they should be ok.
> > But if we want to use default behavior for upper bits, we need to
> > restrict them under specific isa(.i.e. vmovq in vec_set_0).
> > Generally, 128-bit sse legacy instructions have different behaviors
> > for upper bits from AVX ones, and that's why vzeroupper is introduced
> > for sse <-> avx instructions transition.
> > >
> > > Hence I thought I'd post what I have so far (part optimization and
> > > part clean-up), to then ask the x86 experts for their opinions.
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make
> > > bootstrap and make -k check, both with and without
> > > --target_board=unix{-,32}, with no new failures.  Ok for mainline?
> > >
> > >
> > > 2022-06-30  Roger Sayle  
> > >
> > > gcc/ChangeLog
> > > * config/i386/i386-builtin.def (__builtin_ia32_palignr128): Change
> > > CODE_FOR_ssse3_palignrti to CODE_FOR_ssse3_palignrv1ti.
> > > * config/i386/i386-expand.cc (expand_vec_perm_palignr): Use
> V1TImode
> > > and gen_ssse3_palignv1ti instead of TImode.
> > > * config/i386/sse.md (SSESCALARMODE): Delete.
> > > (define_mode_attr ssse3_avx2): Handle V1TImode instead of TImode.
> > > (_palignr): Use VIMAX_AVX2_AVX512BW as a
> mode
> > > iterator instead of SSESCALARMODE.
> >

[x86 PATCH take #2] Doubleword version of and; cmp to not; test optimization.

2022-07-04 Thread Roger Sayle

This patch is the latest revision of the patch originally posted at:
https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596201.html

This patch extends the earlier and;cmp to not;test optimization to also
perform this transformation for TImode on TARGET_64BIT and DImode on -m32,
One motivation for this is that it's a step to fixing the current failure
of gcc.target/i386/pr65105-5.c on -m32.

A more direct benefit for x86_64 is that the following code:

int foo(__int128 x, __int128 y)
{
  return (x & y) == y;
}

improves with -O2 from 15 instructions:

movq%rdi, %r8
movq%rsi, %rax
movq%rax, %rdi
movq%r8, %rsi
movq%rdx, %r8
andq%rdx, %rsi
andq%rcx, %rdi
movq%rsi, %rax
movq%rdi, %rdx
xorq%r8, %rax
xorq%rcx, %rdx
orq %rdx, %rax
sete%al
movzbl  %al, %eax
ret

to the slightly better 13 instructions:

movq%rdi, %r8
movq%rsi, %rax
movq%r8, %rsi
movq%rax, %rdi
notq%rsi
notq%rdi
andq%rdx, %rsi
andq%rcx, %rdi
movq%rsi, %rax
orq %rdi, %rax
sete%al
movzbl  %al, %eax
ret

Now that all of the doubleword pieces are already in the tree, this
patch is now much shorter (an rtx_costs improvement and a single new
define_insn_and_split), however I couldn't resist including two very
minor pattern naming tweaks/clean-ups to fix nits.

This revised patch has been tested on x86_64-pc-linux-gnu with
make bootstrap and make -k check, where on TARGET_64BIT there are
no new failures, and on --target_board=unix{-m32} with a single new
failure; the other dg-final in gcc.target/i386/pr65105-5.c now also
fails (as that code diverges further from the expected vectorized
output).  This is progress as both FAILs in pr65105-5.c may now be
fixed by changes localized to the STV pass.  OK for mainline?


2022-07-04  Roger Sayle  

gcc/ChangeLog
* config/i386/i386.cc (ix86_rtx_costs) : Provide costs
for double word comparisons and tests (comparisons against zero).
* config/i386/i386.md (*test_not_doubleword): Split DWI
and;cmp into andn;cmp $0 as a pre-reload splitter.
(*andn3_doubleword_bmi): Use  instead of  in name.
(*3_doubleword): Likewise.

gcc/testsuite/ChangeLog
* gcc.target/i386/testnot-3.c: New test case.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index b15b489..70c9a27 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -20935,6 +20935,19 @@ ix86_rtx_costs (rtx x, machine_mode mode, int 
outer_code_i, int opno,
  return true;
}
 
+  if (SCALAR_INT_MODE_P (GET_MODE (op0))
+ && GET_MODE_SIZE (GET_MODE (op0)) > UNITS_PER_WORD)
+   {
+ if (op1 == const0_rtx)
+   *total = cost->add
++ rtx_cost (op0, GET_MODE (op0), outer_code, opno, speed);
+ else
+   *total = 3*cost->add
++ rtx_cost (op0, GET_MODE (op0), outer_code, opno, speed)
++ rtx_cost (op1, GET_MODE (op0), outer_code, opno, speed);
+ return true;
+   }
+
   /* The embedded comparison operand is completely free.  */
   if (!general_operand (op0, GET_MODE (op0)) && op1 == const0_rtx)
*total = 0;
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 20c3b9a..2492ad4 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -9792,7 +9792,25 @@
(set (reg:CCZ FLAGS_REG)
(compare:CCZ (and:SWI (match_dup 2) (match_dup 1))
 (const_int 0)))]
+  "operands[2] = gen_reg_rtx (mode);")
+
+;; Split and;cmp (as optimized by combine) into andn;cmp $0
+(define_insn_and_split "*test_not_doubleword"
+  [(set (reg:CCZ FLAGS_REG)
+   (compare:CCZ
+ (and:DWI
+   (not:DWI (match_operand:DWI 0 "nonimmediate_operand"))
+   (match_operand:DWI 1 "nonimmediate_operand"))
+ (const_int 0)))]
+  "ix86_pre_reload_split ()"
+  "#"
+  "&& 1"
+  [(parallel
+  [(set (match_dup 2) (and:DWI (not:DWI (match_dup 0)) (match_dup 1)))
+   (clobber (reg:CC FLAGS_REG))])
+   (set (reg:CCZ FLAGS_REG) (compare:CCZ (match_dup 2) (const_int 0)))]
 {
+  operands[0] = force_reg (mode, operands[0]);
   operands[2] = gen_reg_rtx (mode);
 })
 
@@ -10404,7 +10422,7 @@
   operands[2] = gen_int_mode (INTVAL (operands[2]), QImode);
 })
 
-(define_insn_and_split "*andn3_doubleword_bmi"
+(define_insn_and_split "*andn3_doubleword_bmi"
   [(set (match_operand: 0 "register_operand" "=r")
(and:
  (not: (match_operand: 1 "register_operand" "r"))
@@ -10542,7 +105

[x86 PATCH] Support *testdi_not_doubleword during STV pass.

2022-07-07 Thread Roger Sayle

This patch fixes the current two FAILs of pr65105-5.c on x86 when
compiled with -m32.  These (temporary) breakages were fallout from my
patches to improve/upgrade (scalar) double word comparisons.
On mainline, the i386 backend currently represents a critical comparison
using (compare (and (not reg1) reg2) (const_int 0)) which isn't/wasn't
recognized by the STV pass' convertible_comparison_p.  This simple STV
patch adds support for this pattern (*testdi_not_doubleword) and
generates the vector pandn and ptest instructions expected in the
existing (failing) test case.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, where with --target_board=unix{-m32} there are two
fewer failures, and without, there are no new failures.
Ok for mainline?


2022-07-07  Roger Sayle  

gcc/ChangeLog
* config/i386/i386-features.cc (convert_compare): Add support
for *testdi_not_doubleword pattern (i.e. "(compare (and (not ...")
by generating a pandn followed by ptest.
(convertible_comparison_p): Recognize both *cmpdi_doubleword and
recent *testdi_not_doubleword comparison patterns.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc
index be38586..a7bd172 100644
--- a/gcc/config/i386/i386-features.cc
+++ b/gcc/config/i386/i386-features.cc
@@ -938,10 +938,10 @@ general_scalar_chain::convert_compare (rtx op1, rtx op2, 
rtx_insn *insn)
 {
   rtx tmp = gen_reg_rtx (vmode);
   rtx src;
-  convert_op (&op1, insn);
   /* Comparison against anything other than zero, requires an XOR.  */
   if (op2 != const0_rtx)
 {
+  convert_op (&op1, insn);
   convert_op (&op2, insn);
   /* If both operands are MEMs, explicitly load the OP1 into TMP.  */
   if (MEM_P (op1) && MEM_P (op2))
@@ -953,8 +953,25 @@ general_scalar_chain::convert_compare (rtx op1, rtx op2, 
rtx_insn *insn)
src = op1;
   src = gen_rtx_XOR (vmode, src, op2);
 }
+  else if (GET_CODE (op1) == AND
+  && GET_CODE (XEXP (op1, 0)) == NOT)
+{
+  rtx op11 = XEXP (XEXP (op1, 0), 0);
+  rtx op12 = XEXP (op1, 1);
+  convert_op (&op11, insn);
+  convert_op (&op12, insn);
+  if (MEM_P (op11))
+   {
+ emit_insn_before (gen_rtx_SET (tmp, op11), insn);
+ op11 = tmp;
+   }
+  src = gen_rtx_AND (vmode, gen_rtx_NOT (vmode, op11), op12);
+}
   else
-src = op1;
+{
+  convert_op (&op1, insn);
+  src = op1;
+}
   emit_insn_before (gen_rtx_SET (tmp, src), insn);
 
   if (vmode == V2DImode)
@@ -1399,17 +1416,29 @@ convertible_comparison_p (rtx_insn *insn, enum 
machine_mode mode)
   rtx op1 = XEXP (src, 0);
   rtx op2 = XEXP (src, 1);
 
-  if (!CONST_INT_P (op1)
-  && ((!REG_P (op1) && !MEM_P (op1))
- || GET_MODE (op1) != mode))
-return false;
-
-  if (!CONST_INT_P (op2)
-  && ((!REG_P (op2) && !MEM_P (op2))
- || GET_MODE (op2) != mode))
-return false;
+  /* *cmp_doubleword.  */
+  if ((CONST_INT_P (op1)
+   || ((REG_P (op1) || MEM_P (op1))
+   && GET_MODE (op1) == mode))
+  && (CONST_INT_P (op2)
+ || ((REG_P (op2) || MEM_P (op2))
+ && GET_MODE (op2) == mode)))
+return true;
+
+  /* *test_not_doubleword.  */
+  if (op2 == const0_rtx
+  && GET_CODE (op1) == AND
+  && GET_CODE (XEXP (op1, 0)) == NOT)
+{
+  rtx op11 = XEXP (XEXP (op1, 0), 0);
+  rtx op12 = XEXP (op1, 1);
+  return (REG_P (op11) || MEM_P (op11))
+&& (REG_P (op12) || MEM_P (op12))
+&& GET_MODE (op11) == mode
+&& GET_MODE (op12) == mode;
+}
 
-  return true;
+  return false;
 }
 
 /* The general version of scalar_to_vector_candidate_p.  */


[PATCH/RFC] combine_completed global variable.

2022-07-07 Thread Roger Sayle

Hi Kewen (and Segher),
Many thanks for stress testing my patch to improve multiplication
by integer constants on rs6000 by using the rldmi instruction.
Although I've not been able to reproduce your ICE (using gcc135
on the compile farm), I completely agree with Segher's analysis
that the Achilles heel with my approach/patch is that there's
currently no way for the backend/recog to know that we're in a
pass after combine.

Rather than give up on this optimization (and a similar one for
I386.md where test;sete can be replaced by xor $1 when combine
knows that nonzero_bits is 1, but loses that information afterwards),
I thought I'd post this "strawman" proposal to add a combine_completed
global variable, matching the reload_completed and regstack_completed
global variables already used (to track progress) by the middle-end.

I was wondering if I could ask you could test the attached patch
in combination with my previous rs6000.md patch (with the obvious
change of reload_completed to combine_completed) to confirm
that it fixes the problems you were seeing.

Segher/Richard, would this sort of patch be considered acceptable?
Or is there a better approach/solution?


2022-07-07  Roger Sayle  

gcc/ChangeLog
* combine.cc (combine_completed): New global variable.
(rest_of_handle_combine): Set combine_completed after pass.
* final.cc (rest_of_clean_state): Reset combine_completed.
* rtl.h (combine_completed): Prototype here.


Many thanks in advance,
Roger
--

> -Original Message-
> From: Kewen.Lin 
> Sent: 27 June 2022 10:04
> To: Roger Sayle 
> Cc: gcc-patches@gcc.gnu.org; Segher Boessenkool
> ; David Edelsohn 
> Subject: Re: [rs6000 PATCH] Improve constant integer multiply using rldimi.
> 
> Hi Roger,
> 
> on 2022/6/27 04:56, Roger Sayle wrote:
> >
> >
> > This patch tweaks the code generated on POWER for integer
> > multiplications
> >
> > by a constant, by making use of rldimi instructions.  Much like x86's
> >
> > lea instruction, rldimi can be used to implement a shift and add pair
> >
> > in some circumstances.  For rldimi this is when the shifted operand
> >
> > is known to have no bits in common with the added operand.
> >
> >
> >
> > Hence for the new testcase below:
> >
> >
> >
> > int foo(int x)
> >
> > {
> >
> >   int t = x & 42;
> >
> >   return t * 0x2001;
> >
> > }
> >
> >
> >
> > when compiled with -O2, GCC currently generates:
> >
> >
> >
> > andi. 3,3,0x2a
> >
> > slwi 9,3,13
> >
> > add 3,9,3
> >
> > extsw 3,3
> >
> > blr
> >
> >
> >
> > with this patch, we now generate:
> >
> >
> >
> > andi. 3,3,0x2a
> >
> > rlwimi 3,3,13,0,31-13
> >
> > extsw 3,3
> >
> > blr
> >
> >
> >
> > It turns out this optimization already exists in the form of a combine
> >
> > splitter in rs6000.md, but the constraints on combine splitters,
> >
> > requiring three of four input instructions (and generating one or two
> >
> > output instructions) mean it doesn't get applied as often as it could.
> >
> > This patch converts the define_split into a define_insn_and_split to
> >
> > catch more cases (such as the one above).
> >
> >
> >
> > The one bit that's tricky/controversial is the use of RTL's
> >
> > nonzero_bits which is accurate during the combine pass when this
> >
> > pattern is first recognized, but not as advanced (not kept up to
> >
> > date) when this pattern is eventually split.  To support this,
> >
> > I've used a "|| reload_completed" idiom.  Does this approach seem
> >
> > reasonable? [I've another patch of x86 that uses the same idiom].
> >
> >
> 
> I tested this patch on powerpc64-linux-gnu, it caused the below ICE against 
> test
> case gcc/testsuite/gcc.c-torture/compile/pr93098.c.
> 
> gcc/testsuite/gcc.c-torture/compile/pr93098.c: In function ‘foo’:
> gcc/testsuite/gcc.c-torture/compile/pr93098.c:10:1: error: unrecognizable 
> insn:
> (insn 104 32 34 2 (set (reg:SI 185 [+4 ])
> (ior:SI (and:SI (reg:SI 200 [+4 ])
> (const_int 4294967295 [0x]))
> (ashift:SI (reg:SI 140)
> (const_int 32 [0x20] "gcc/testsuite/gcc.c-
> torture/compile/pr93098.c":6:11 -1
>  (nil))
> during RTL pass: subreg3
> dump file: pr93098.c.291r.subreg3
> gcc

[PATCH] Be careful with MODE_CC in simplify_const_relational_operation.

2022-07-07 Thread Roger Sayle

I think it's fair to describe RTL's representation of condition flags
using MODE_CC as a little counter-intuitive.  For example, the i386
backend represents the carry flag (in adc instructions) using RTL of
the form "(ltu:SI (reg:CCC) (const_int 0))", where great care needs
to be taken not to treat this like a normal RTX expression, after all
LTU (less-than-unsigned) against const0_rtx would normally always be
false.  Hence, MODE_CC comparisons need to be treated with caution,
and simplify_const_relational_operation returns early (to avoid
problems) when GET_MODE_CLASS (GET_MODE (op0)) == MODE_CC.

However, consider the (currently) hypothetical situation, where the
RTL optimizers determine that a previous instruction unconditionally
sets or clears the carry flag, and this gets propagated by combine into
the above expression, we'd end up with something that looks like
(ltu:SI (const_int 1) (const_int 0)), which doesn't mean what it says.
Fortunately, simplify_const_relational_operation is passed the
original mode of the comparison (cmp_mode, the original mode of op0)
which can be checked for MODE_CC, even when op0 is now VOIDmode
(const_int) after the substitution.  Defending against this is clearly the
right thing to do.

More controversially, rather than just abort simplification/optimization
in this case, we can use the comparison operator to infer/select the
semantics of the CC_MODE flag.  Hopefully, whenever a backend uses LTU,
it represents the (set) carry flag (and behaves like i386.md), in which
case the result of the simplified expression is the first operand.
[If there's no standardization of semantics across backends, then
we should always just return 0; but then miss potential optimizations].

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32},
with no new failures, and in combination with a i386 backend patch
(that introduces support for x86's stc and clc instructions) where it
avoids failures.  However, I'm submitting this middle-end piece
independently, to confirm that maintainers/reviewers are happy with
the approach, and also to check there are no issues on other platforms,
before building upon this infrastructure.

Thoughts?  Ok for mainline?


2022-07-07  Roger Sayle  

gcc/ChangeLog
* simplify-rtx.cc (simplify_const_relational_operation): Handle
case where both operands of a MODE_CC comparison have been
simplified to constant integers.


Thanks in advance,
Roger
--

diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index fa20665..73ec5c7 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -6026,6 +6026,18 @@ simplify_const_relational_operation (enum rtx_code code,
return 0;
 }
 
+  /* Handle MODE_CC comparisons that have been simplified to
+ constants.  */
+  if (GET_MODE_CLASS (mode) == MODE_CC
+  && op1 == const0_rtx
+  && CONST_INT_P (op0))
+{
+  /* LTU represents the carry flag.  */
+  if (code == LTU)
+   return op0 == const0_rtx ? const0_rtx : const_true_rtx;
+  return 0;
+}
+
   /* We can't simplify MODE_CC values since we don't know what the
  actual comparison is.  */
   if (GET_MODE_CLASS (GET_MODE (op0)) == MODE_CC)


[x86 PATCH] Fun with flags: Adding stc/clc instructions to i386.md.

2022-07-08 Thread Roger Sayle

This patch adds support for x86's single-byte encoded stc (set carry flag)
and clc (clear carry flag) instructions to i386.md.

The motivating example is the simple code snippet:

unsigned int foo (unsigned int a, unsigned int b, unsigned int *c)
{
  return __builtin_ia32_addcarryx_u32 (1, a, b, c);
}

which uses the target built-in to generate an adc instruction, adding
together A and B with the incoming carry flag already set.  Currently
for this mainline GCC generates (with -O2):

movl$1, %eax
addb$-1, %al
adcl%esi, %edi
setc%al
movl%edi, (%rdx)
movzbl  %al, %eax
ret

where the first two instructions (to load 1 into a byte register and
then add 255 to it) are the idiom used to set the carry flag.  This
is a little inefficient as x86 has a "stc" instruction for precisely
this purpose.  With the attached patch we now generate:

stc
adcl%esi, %edi
setc%al
movl%edi, (%rdx)
movzbl  %al, %eax
ret

The central part of the patch is the addition of x86_stc and x86_clc
define_insns, represented as "(set (reg:CCC FLAGS_REG) (const_int 1))"
and "(set (reg:CCC FLAGS_REG) (const_int 0))" respectively, then using
x86_stc appropriately in the ix86_expand_builtin.

Alas this change exposes two latent bugs/issues in the compiler.
The first is that there are several peephole2s in i386.md that propagate
the flags register, but take its mode from the SET_SRC rather than
preserve the mode of the original SET_DEST.  The other, which is
being discussed with Segher, is that the middle-end's simplify-rtx
inappropriately tries to interpret/optimize MODE_CC comparisons,
converting the above adc into an add, as it mistakenly believes
(ltu:SI (const_int 1) (const_int 0))" is always const0_rtx even when
the mode of the comparison is MODE_CCC.

I believe Segher will review (and hopefully approve) the middle-end
chunk of this patch independently, but hopefully this backend patch
provides the necessary context to explain why that change is needed.


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32} with
no new failures.  Ok for mainline?


2022-07-08  Roger Sayle  

gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_expand_builtin) :
Use new x86_stc or negqi_ccc_1 instructions to set the carry flag.
* config/i386/i386.md (x86_clc): New define_insn.
(x86_stc): Likewise, new define_insn to set the carry flag.
(*setcc_qi_negqi_ccc_1_): New define_insn_and_split to
recognize (and eliminate) the carry flag being copied to itself.
(neg_ccc_1): Renamed from *neg_ccc_1 for gen function.
(define_peephole2): Use match_operand of flags_reg_operand to
capture and preserve the mode of FLAGS_REG.
(define_peephole2): Likewise.

* simplify-rtx.cc (simplify_const_relational_operation): Handle
case where both operands of a MODE_CC comparison have been
simplified to constant integers.

gcc/testsuite/ChangeLog
* gcc.target/i386/stc-1.c: New test case.


Thanks in advance (both Uros and Segher),
Roger
--

> -Original Message-
> From: Segher Boessenkool 
> Sent: 07 July 2022 23:39
> To: Roger Sayle 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] Be careful with MODE_CC in
> simplify_const_relational_operation.
> 
> Hi!
> 
> On Thu, Jul 07, 2022 at 10:08:04PM +0100, Roger Sayle wrote:
> > I think it's fair to describe RTL's representation of condition flags
> > using MODE_CC as a little counter-intuitive.
> 
> "A little challenging", and you should see that as a good thing, as a
puzzle to
> crack :-)
> 
> > For example, the i386
> > backend represents the carry flag (in adc instructions) using RTL of
> > the form "(ltu:SI (reg:CCC) (const_int 0))", where great care needs to
> > be taken not to treat this like a normal RTX expression, after all LTU
> > (less-than-unsigned) against const0_rtx would normally always be
> > false.
> 
> A comparison of a MODE_CC thing against 0 means the result of a
> *previous* comparison (or other cc setter) is looked at.  Usually it
simply looks
> at some condition bits in a flags register.  It does not do any actual
comparison:
> that has been done before (if at all even).
> 
> > Hence, MODE_CC comparisons need to be treated with caution, and
> > simplify_const_relational_operation returns early (to avoid
> > problems) when GET_MODE_CLASS (GET_MODE (op0)) == MODE_CC.
> 
> Not just to avoid problems: there simply isn't enough information to do a
> correct job.
> 
> > However, consider the (currently) hypothetical situation, where the
> > RTL optimizers dete

[gcc12 backport] PR target/105930: Split *xordi3_doubleword after reload on x86.

2022-07-09 Thread Roger Sayle

This is a backport of the fix for PR target/105930 from mainline to the
gcc12 release branch.  This patch has been retested against the gcc12
branch on x86_64-pc-linux-gnu with make bootstrap and make -k check,
both with and without --target_board=unix{-m32} with no new failures.
Ok for the gcc12 branch?


2022-07-09  Roger Sayle  
Uroš Bizjak  

gcc/ChangeLog
PR target/105930
* config/i386/i386.md (*di3_doubleword): Split after
reload.  Use rtx_equal_p to avoid creating memory-to-memory moves,
and emit NOTE_INSN_DELETED if operand[2] is zero (i.e. with -O0).


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 7c9560fc4..1c4781d 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -10400,22 +10400,25 @@
   "ix86_expand_binary_operator (, mode, operands); DONE;")
 
 (define_insn_and_split "*di3_doubleword"
-  [(set (match_operand:DI 0 "nonimmediate_operand")
+  [(set (match_operand:DI 0 "nonimmediate_operand" "=ro,r")
(any_or:DI
-(match_operand:DI 1 "nonimmediate_operand")
-(match_operand:DI 2 "x86_64_szext_general_operand")))
+(match_operand:DI 1 "nonimmediate_operand" "0,0")
+(match_operand:DI 2 "x86_64_szext_general_operand" "re,o")))
(clobber (reg:CC FLAGS_REG))]
   "!TARGET_64BIT
-   && ix86_binary_operator_ok (, DImode, operands)
-   && ix86_pre_reload_split ()"
+   && ix86_binary_operator_ok (, DImode, operands)"
   "#"
-  "&& 1"
+  "&& reload_completed"
   [(const_int 0)]
 {
+  /* This insn may disappear completely when operands[2] == const0_rtx
+ and operands[0] == operands[1], which requires a NOTE_INSN_DELETED.  */
+  bool emit_insn_deleted_note_p = false;
+
   split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]);
 
   if (operands[2] == const0_rtx)
-emit_move_insn (operands[0], operands[1]);
+emit_insn_deleted_note_p = true;
   else if (operands[2] == constm1_rtx)
 {
   if ( == IOR)
@@ -10427,7 +10430,10 @@
 ix86_expand_binary_operator (, SImode, &operands[0]);
 
   if (operands[5] == const0_rtx)
-emit_move_insn (operands[3], operands[4]);
+{
+  if (emit_insn_deleted_note_p)
+   emit_note (NOTE_INSN_DELETED);
+}
   else if (operands[5] == constm1_rtx)
 {
   if ( == IOR)


[x86_64 PATCH] Improved Scalar-To-Vector (STV) support for TImode to V1TImode.

2022-07-09 Thread Roger Sayle

This patch upgrades x86_64's scalar-to-vector (STV) pass to more
aggressively transform 128-bit scalar TImode operations into vector
V1TImode operations performed on SSE registers.  TImode functionality
already exists in STV, but only for move operations, this changes
brings support for logical operations (AND, IOR, XOR, NOT and ANDN)
and comparisons.

The effect of these changes are conveniently demonstrated by the new
sse4_1-stv-5.c test case:

__int128 a[16];
__int128 b[16];
__int128 c[16];

void foo()
{
  for (unsigned int i=0; i<16; i++)
a[i] = b[i] & ~c[i];
}

which when currently compiled on mainline wtih -O2 -msse4 produces:

foo:xorl%eax, %eax
.L2:movqc(%rax), %rsi
movqc+8(%rax), %rdi
addq$16, %rax
notq%rsi
notq%rdi
andqb-16(%rax), %rsi
andqb-8(%rax), %rdi
movq%rsi, a-16(%rax)
movq%rdi, a-8(%rax)
cmpq$256, %rax
jne .L2
ret

but with this patch now produces:

foo:xorl%eax, %eax
.L2:movdqa  c(%rax), %xmm0
pandn   b(%rax), %xmm0
addq$16, %rax
movaps  %xmm0, a-16(%rax)
cmpq$256, %rax
jne .L2
ret

Technically, the STV pass is implemented by three C++ classes, a common
abstract base class "scalar_chain" that contains common functionality,
and two derived classes: general_scalar_chain (which handles SI and
DI modes) and timode_scalar_chain (which handles TI modes).  As
mentioned previously, because only TI mode moves were handled the
two worker classes behaved significantly differently.  These changes
bring the functionality of these two classes closer together, which
is reflected by refactoring more shared code from general_scalar_chain
to the parent scalar_chain and reusing it from timode.  There still
remain significant differences (and simplifications) so the existing
division of classes (as specializations) continues to make sense.

Obviously, there are more changes to come (shifts and rotates),
and compute_convert_gain doesn't yet have its final (tuned) form,
but is already an improvement over the "return 1;" used previously.

This patch has been tested on x86_64-pc-linux-gnu with make boostrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?


2022-07-09  Roger Sayle  

gcc/ChangeLog
* config/i386/i386-features.h (scalar_chain): Add fields
insns_conv, n_sse_to_integer and n_integer_to_sse to this
parent class, moved from general_scalar_chain.
(scalar_chain::convert_compare): Protected method moved
from general_scalar_chain.
(mark_dual_mode_def): Make protected, not private virtual.
(scalar_chain:convert_op): New private virtual method.

(general_scalar_chain::general_scalar_chain): Simplify constructor.
(general_scalar_chain::~general_scalar_chain): Delete destructor.
(general_scalar_chain): Move insns_conv, n_sse_to_integer and
n_integer_to_sse fields to parent class, scalar_chain.
(general_scalar_chain::mark_dual_mode_def): Delete prototype.
(general_scalar_chain::convert_compare): Delete prototype.

(timode_scalar_chain::compute_convert_gain): Remove simplistic
implementation, convert to a method prototype.
(timode_scalar_chain::mark_dual_mode_def): Delete prototype.
(timode_scalar_chain::convert_op): Prototype new virtual method.

* config/i386/i386-features.cc (scalar_chain::scalar_chain):
Allocate insns_conv and initialize n_sse_to_integer and
n_integer_to_sse fields in constructor.
(scalar_chain::scalar_chain): Free insns_conv in destructor.

(general_scalar_chain::general_scalar_chain): Delete
constructor, now defined in the class declaration.
(general_scalar_chain::~general_scalar_chain): Delete destructor.

(scalar_chain::mark_dual_mode_def): Renamed from
general_scalar_chain::mark_dual_mode_def.
(timode_scalar_chain::mark_dual_mode_def): Delete.
(scalar_chain::convert_compare): Renamed from
general_scalar_chain::convert_compare.

(timode_scalar_chain::compute_convert_gain): New method to
determine the gain from converting a TImode chain to V1TImode.
(timode_scalar_chain::convert_op): New method to convert an
operand from TImode to V1TImode.

(timode_scalar_chain::convert_insn) : Only PUT_MODE
on REG_EQUAL notes that were originally TImode (not CONST_INT).
Handle AND, ANDN, XOR, IOR, NOT and COMPARE.
(timode_mem_p): Helper predicate to check where operand is
memory reference with sufficient alignment for TImode STV.
(timode_scalar_to_vector_candidate_p): Use convertible_comparison_p
to check whether COMPARE is convertible.  Handle SET_DESTs that
that are

[PATCH] Move reload_completed and other rtl.h globals to crtl structure.

2022-07-10 Thread Roger Sayle

This patch builds upon Richard Biener's suggestion of avoiding global
variables to track state/identify which passes have already been run.
In the early middle-end, the tree-ssa passes use the curr_properties
field in cfun to track this.  This patch uses a new rtl_pass_progress
int field in crtl to do something similar.

This patch allows the global variables lra_in_progress, reload_in_progress,
reload_completed, epilogue_completed and regstack_completed to be removed
from rtl.h and implemented as bits within the new crtl->rtl_pass_progress.
I've also taken the liberty of adding a new combine_completed bit at the
same time [to respond the Segher's comment it's easy to change this to
combine1_completed and combine2_completed if we ever perform multiple
combine passes (or multiple reload/regstack passes)].  At the same time,
I've also refactored bb_reorder_complete into the same new field;
interestingly bb_reorder_complete was already a bool in crtl.

One very minor advantage of this implementation/refactoring is that the
predicate "can_create_pseudo_p ()" which is semantically defined to be
!reload_in_progress && !reload_completed, can now be performed very
efficiently as effectively the test (progress & 12) == 0, i.e. a single
test instruction on x86.

For consistency, I've also moved cse_not_expected (the last remaining
global variable in rtl.h) into crtl, as its own bool field.

The vast majority of this patch is then churn to handle these changes.
Thanks to macros, most code is unaffected, assuming it treats those
global variables as r-values, though some source files required/may
require tweaks as these "variables" are now defined in emit-rtl.h
instead of rtl.h.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32},
with no new failures.  Might this clean-up be acceptable in stage 1,
given the possible temporary disruption transitioning some backends?
I'll start checking various backends myself with cross-compilers, but if
Jeff Law could spin this patch on his build farm, that would help
identify targets that need attention.


2022-07-10  Roger Sayle  

gcc/ChangeLog
* bb-reorder.cc (reorder_basic_blocks): bb_reorder_complete is
now a bit in crtl->rtl_pass_progress.
* cfgrtl.cc (rtl_split_edge): Likewise.
(fixup_partitions): Likewise.
(verify_hot_cold_block_grouping): Likewise.
(cfg_layout_initialize): Likewise.
 * combine.cc (rest_of_handle_combine): Set combine_completed
bit in crtl->rtl_pass_progress.
* cse.cc (rest_of_handle_cse): cse_not_expected is now a field
in crtl.
(rest_of_handle_cse2): Likewise.
(rest_of_handle_cse_after_global_opts): Likewise.
* df-problems.cc: Include emit-rtl.h to access RTL pass progress
variables.

* emit-rtl.h (PROGRESS_reload_completed): New bit masks.
(rtl_data::rtl_pass_progress): New integer field to track progress.
(rtl_data::bb_reorder_complete): Delete, no part of
rtl_pass_progress.
(rtl_data::cse_not_expected): New bool field, previously a global
variable.
(crtl_pass_progress): New convience macro.
(combine_completed): New macro.
(lra_in_progress): New macro replacing global variable.
(reload_in_progress): Likewise.
(reload_completed): Likewise.
(bb_reorder_complete): New macro replacing bool field in crtl.
(epilogue_completed): New macro replacing global variable.
(regstack_completed): Likewise.
(can_create_pseudo_p): Move from rtl.h and update definition.

* explow.cc (memory_address_addr_space): cse_not_expected is now
a field in crtl.
(use_anchored_address): Likewise.
* final.c (rest_of_clean_state): Reset crtl->rtl_pass_progress
to zero.
* function.cc (prepare_function_start): cse_not_expected is now
a field in crtl.
(thread_prologue_and_epilogue_insns): epilogue_completed is now
a bit in crtl->rtl_pass_progress.
* ifcvt.cc (noce_try_cmove_arith): cse_not_expected is now a
field in crtl.
* lra-eliminations.cc (init_elim_table): lra_in_progress is now
a bit in crtl->rtl_pass_progress.
* lra.cc (lra_in_progress): Delete global variable.
(lra): lra_in_progress and reload_completed are now bits in
crtl->rtl_pass_progress.
* modulo-sched.cc (sms_schedule): reload_completed is now a bit
in crtl->rtl_pass_progress.
* passes.cc (skip_pass): reload_completed and epilogue_completed
are now bits in crtl->rtl_pass_progress.
* recog.cc (reload_completed): Delete global variable.
(epilogue_completed): Likewise.
* reg-stack.cc (regstack_completed): Likewise.
(rest_of_handle_stack_r

RE: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for TImode to V1TImode.

2022-07-10 Thread Roger Sayle


Hi Uros,
Yes, I agree.  I think it makes sense to have a single STV pass (after
combine and before reload).  Let's hear what HJ thinks, but I'm
happy to investigate a follow-up patch that unifies the STV passes.
But it'll be easier to confirm there are no "code generation" changes
if those modifications are pushed independently of these ones.
Time to look into the (git) history of multiple STV passes...

Thanks for the review.  I'll wait for HJ's thoughts.
Cheers,
Roger
--

> -Original Message-
> From: Uros Bizjak 
> Sent: 10 July 2022 19:06
> To: Roger Sayle 
> Cc: gcc-patches@gcc.gnu.org; H. J. Lu 
> Subject: Re: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for
> TImode to V1TImode.
> 
> On Sat, Jul 9, 2022 at 2:17 PM Roger Sayle 
> wrote:
> >
> >
> > This patch upgrades x86_64's scalar-to-vector (STV) pass to more
> > aggressively transform 128-bit scalar TImode operations into vector
> > V1TImode operations performed on SSE registers.  TImode functionality
> > already exists in STV, but only for move operations, this changes
> > brings support for logical operations (AND, IOR, XOR, NOT and ANDN)
> > and comparisons.
> >
> > The effect of these changes are conveniently demonstrated by the new
> > sse4_1-stv-5.c test case:
> >
> > __int128 a[16];
> > __int128 b[16];
> > __int128 c[16];
> >
> > void foo()
> > {
> >   for (unsigned int i=0; i<16; i++)
> > a[i] = b[i] & ~c[i];
> > }
> >
> > which when currently compiled on mainline wtih -O2 -msse4 produces:
> >
> > foo:xorl%eax, %eax
> > .L2:movqc(%rax), %rsi
> > movqc+8(%rax), %rdi
> > addq$16, %rax
> > notq%rsi
> > notq%rdi
> > andqb-16(%rax), %rsi
> > andqb-8(%rax), %rdi
> > movq%rsi, a-16(%rax)
> > movq%rdi, a-8(%rax)
> > cmpq$256, %rax
> > jne .L2
> > ret
> >
> > but with this patch now produces:
> >
> > foo:xorl%eax, %eax
> > .L2:movdqa  c(%rax), %xmm0
> > pandn   b(%rax), %xmm0
> > addq$16, %rax
> > movaps  %xmm0, a-16(%rax)
> > cmpq$256, %rax
> > jne .L2
> > ret
> >
> > Technically, the STV pass is implemented by three C++ classes, a
> > common abstract base class "scalar_chain" that contains common
> > functionality, and two derived classes: general_scalar_chain (which
> > handles SI and DI modes) and timode_scalar_chain (which handles TI
> > modes).  As mentioned previously, because only TI mode moves were
> > handled the two worker classes behaved significantly differently.
> > These changes bring the functionality of these two classes closer
> > together, which is reflected by refactoring more shared code from
> > general_scalar_chain to the parent scalar_chain and reusing it from
> > timode.  There still remain significant differences (and
> > simplifications) so the existing division of classes (as specializations) 
> > continues
> to make sense.
> 
> Please note that there are in fact two STV passes, one before combine and the
> other after combine. The TImode pass that previously handled only loads and
> stores is positioned before combine (there was a reason for this decision, 
> but I
> don't remember the details - let's ask HJ...). However, DImode STV pass
> transforms much more instructions and the reason it was positioned after the
> combine pass was that STV pass transforms optimized insn stream where
> forward propagation was already performed.
> 
> What is not clear to me from the above explanation is: is the new TImode STV
> pass positioned after the combine pass, and if this is the case, how the 
> change
> affects current load/store TImode STV pass. I must admit, I don't like two
> separate STV passess, so if TImode is now similar to DImode, I suggest we
> abandon STV1 pass and do everything concerning TImode after the combine
> pass. HJ, what is your opinion on this?
> 
> Other than the above, the patch LGTM to me.
> 
> Uros.
> 
> > Obviously, there are more changes to come (shifts and rotates), and
> > compute_convert_gain doesn't yet have its final (tuned) form, but is
> > already an improvement over the "return 1;" used previously.
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make boostrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failure

RE: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for TImode to V1TImode.

2022-07-10 Thread Roger Sayle


Hi HJ,

I believe this should now be handled by the post-reload (CSE) pass.
Consider the simple test case:

__int128 a, b, c;
void foo()
{
  a = 0;
  b = 0;
  c = 0;
}

Without any STV, i.e. -O2 -msse4 -mno-stv, GCC get TI mode writes:
movq$0, a(%rip)
movq$0, a+8(%rip)
movq$0, b(%rip)
movq$0, b+8(%rip)
movq$0, c(%rip)
movq$0, c+8(%rip)
ret

But with STV, i.e. -O2 -msse4, things get converted to V1TI mode:
pxor%xmm0, %xmm0
movaps  %xmm0, a(%rip)
movaps  %xmm0, b(%rip)
movaps  %xmm0, c(%rip)
ret

You're quite right internally the STV actually generates the equivalent of:
pxor%xmm0, %xmm0
movaps  %xmm0, a(%rip)
pxor%xmm0, %xmm0
movaps  %xmm0, b(%rip)
pxor%xmm0, %xmm0
movaps  %xmm0, c(%rip)
ret

And currently because STV run before cse2 and combine, the const0_rtx
gets CSE'd be the cse2 pass to produce the code we see.  However, if you
specify -fno-rerun-cse-after-loop (to disable the cse2 pass), you'll see we
continue to generate the same optimized code, as the same const0_rtx
gets CSE'd in postreload.

I can't be certain until I try the experiment, but I believe that the postreload
CSE will clean-up, all of the same common subexpressions.  Hence, it should
be safe to perform all STV at the same point (after combine), which for a few
additional optimizations.

Does this make sense?  Do you have a test case, -fno-rerun-cse-after-loop
produces different/inferior code for TImode STV chains?

My guess is that the RTL passes have changed so much in the last six or
seven years, that some of the original motivation no longer applies.
Certainly we now try to keep TI mode operations visible longer, and
then allow STV to behave like a pre-reload pass to decide which set of
registers to use (vector V1TI or scalar doubleword DI).  Any CSE opportunities
that cse2 finds with V1TI mode, could/should equally well be found for
TI mode (mostly).

Cheers,
Roger
--

> -Original Message-
> From: H.J. Lu 
> Sent: 10 July 2022 20:15
> To: Roger Sayle 
> Cc: Uros Bizjak ; GCC Patches 
> Subject: Re: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for
> TImode to V1TImode.
> 
> On Sun, Jul 10, 2022 at 11:36 AM Roger Sayle 
> wrote:
> >
> >
> > Hi Uros,
> > Yes, I agree.  I think it makes sense to have a single STV pass (after
> > combine and before reload).  Let's hear what HJ thinks, but I'm happy
> > to investigate a follow-up patch that unifies the STV passes.
> > But it'll be easier to confirm there are no "code generation" changes
> > if those modifications are pushed independently of these ones.
> > Time to look into the (git) history of multiple STV passes...
> >
> > Thanks for the review.  I'll wait for HJ's thoughts.
> 
> The TImode STV pass is run before the CSE pass so that instructions changed or
> generated by the STV pass can be CSEed.
> 
> > Cheers,
> > Roger
> > --
> >
> > > -Original Message-
> > > From: Uros Bizjak 
> > > Sent: 10 July 2022 19:06
> > > To: Roger Sayle 
> > > Cc: gcc-patches@gcc.gnu.org; H. J. Lu 
> > > Subject: Re: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support
> > > for TImode to V1TImode.
> > >
> > > On Sat, Jul 9, 2022 at 2:17 PM Roger Sayle
> > > 
> > > wrote:
> > > >
> > > >
> > > > This patch upgrades x86_64's scalar-to-vector (STV) pass to more
> > > > aggressively transform 128-bit scalar TImode operations into
> > > > vector V1TImode operations performed on SSE registers.  TImode
> > > > functionality already exists in STV, but only for move operations,
> > > > this changes brings support for logical operations (AND, IOR, XOR,
> > > > NOT and ANDN) and comparisons.
> > > >
> > > > The effect of these changes are conveniently demonstrated by the
> > > > new sse4_1-stv-5.c test case:
> > > >
> > > > __int128 a[16];
> > > > __int128 b[16];
> > > > __int128 c[16];
> > > >
> > > > void foo()
> > > > {
> > > >   for (unsigned int i=0; i<16; i++)
> > > > a[i] = b[i] & ~c[i];
> > > > }
> > > >
> > > > which when currently compiled on mainline wtih -O2 -msse4 produces:
> > > >
> > > > foo:xorl%eax, %eax
> > > > .L2:movqc(%rax), %rsi
> > > > movqc+8(%rax), %rdi
> > > > addq$16, %rax
> > > >   

RE: [PATCH] Move reload_completed and other rtl.h globals to crtl structure.

2022-07-11 Thread Roger Sayle
On 11 July 2022 08:20, Richard Biener wrote:
> On Sun, 10 Jul 2022, Roger Sayle wrote:
> 
> > This patch builds upon Richard Biener's suggestion of avoiding global
> > variables to track state/identify which passes have already been run.
> > In the early middle-end, the tree-ssa passes use the curr_properties
> > field in cfun to track this.  This patch uses a new rtl_pass_progress
> > int field in crtl to do something similar.
> 
> Why not simply add PROP_rtl_... and use the existing curr_properties for
this?
> RTL passes are also passes and this has the advantage you can add things
like
> reload_completed to the passes properties_set field hand have the flag
setting
> handled by the pass manager as it was intended?
> 

Great question, and I did initially consider simply adding more RTL fields
to
curr_properties.  My hesitation was from the comments/documentation that
the curr_properties field is used by the pass manager as a way to track
(and verify) the properties/invariants that are required, provided and
destroyed
by each pass.  This semantically makes sense for properties such as accurate
data flow, ssa form, cfg_layout, nonzero_bits etc, where hypothetically the
pass manager can dynamically schedule a pass/analysis to ensure the next
pass
has the pre-requisite information it needs.

This seems semantically slightly different from tracking time/progress,
where
we really want something more like DEBUG_COUNTER that simply provides
the "tick-tock" of a pass clock.  Alas, the "pass number", as used in the
suffix
of dump-files (where 302 currently means reload) isn't particularly useful
as
these change/evolve continually.

Perhaps the most obvious indication of this (subtle) difference is the
curr_properties field (PROP_rtl_split_insns) which tracks whether
instructions
have been split, where at a finer granularity rtl_pass_progress may wish to
distinguish split1 (after combine before reload), split2 (after reload
before
peephole2) and split3 (after peephole2).  It's conceptually not a simple
property, whether all insns have been split or not, as in practice this is
more subtle with backends choosing which instructions get split at which
"times".

There's also the concern that we've a large number of passes (currently
62 RTL passes), and only a finite number of bits (in curr_properties), so
having two integers reduces the risk of running out of bits and needing
to use a "wider" data structure.

To be honest, I just didn't want to hijack curr_properties to abuse it for a

use that didn't quite match the original intention, without checking with
you and the other maintainers first.  If the above reasoning isn't
convincing,
I can try spinning an alternate patch using curr_properties (but I'd expect
even more churn as backend source files would now need to #include
tree-passes.h and function.h to get reload_completed).  And of course,
a volunteer is welcome to contribute that re-refactoring after this one.

I've no strong feelings either way.  It was an almost arbitrary engineering
decision (but hopefully the above explains the balance of my reasoning).

Roger
--




RE: [x86_64 PATCH] Improved Scalar-To-Vector (STV) support for TImode to V1TImode.

2022-07-13 Thread Roger Sayle


On Mon, Jul 11, 2022, H.J. Lu  wrote:
> On Sun, Jul 10, 2022 at 2:38 PM Roger Sayle 
> wrote:
> > Hi HJ,
> >
> > I believe this should now be handled by the post-reload (CSE) pass.
> > Consider the simple test case:
> >
> > __int128 a, b, c;
> > void foo()
> > {
> >   a = 0;
> >   b = 0;
> >   c = 0;
> > }
> >
> > Without any STV, i.e. -O2 -msse4 -mno-stv, GCC get TI mode writes:
> > movq$0, a(%rip)
> > movq$0, a+8(%rip)
> > movq$0, b(%rip)
> > movq$0, b+8(%rip)
> > movq$0, c(%rip)
> > movq$0, c+8(%rip)
> > ret
> >
> > But with STV, i.e. -O2 -msse4, things get converted to V1TI mode:
> > pxor%xmm0, %xmm0
> > movaps  %xmm0, a(%rip)
> > movaps  %xmm0, b(%rip)
> > movaps  %xmm0, c(%rip)
> > ret
> >
> > You're quite right internally the STV actually generates the equivalent of:
> > pxor%xmm0, %xmm0
> > movaps  %xmm0, a(%rip)
> > pxor%xmm0, %xmm0
> > movaps  %xmm0, b(%rip)
> > pxor%xmm0, %xmm0
> > movaps  %xmm0, c(%rip)
> > ret
> >
> > And currently because STV run before cse2 and combine, the const0_rtx
> > gets CSE'd be the cse2 pass to produce the code we see.  However, if
> > you specify -fno-rerun-cse-after-loop (to disable the cse2 pass),
> > you'll see we continue to generate the same optimized code, as the
> > same const0_rtx gets CSE'd in postreload.
> >
> > I can't be certain until I try the experiment, but I believe that the
> > postreload CSE will clean-up, all of the same common subexpressions.
> > Hence, it should be safe to perform all STV at the same point (after
> > combine), which for a few additional optimizations.
> >
> > Does this make sense?  Do you have a test case,
> > -fno-rerun-cse-after-loop produces different/inferior code for TImode STV
> chains?
> >
> > My guess is that the RTL passes have changed so much in the last six
> > or seven years, that some of the original motivation no longer applies.
> > Certainly we now try to keep TI mode operations visible longer, and
> > then allow STV to behave like a pre-reload pass to decide which set of
> > registers to use (vector V1TI or scalar doubleword DI).  Any CSE
> > opportunities that cse2 finds with V1TI mode, could/should equally
> > well be found for TI mode (mostly).
> 
> You are probably right.  If there are no regressions in GCC testsuite, my 
> original
> motivation is no longer valid.

It was good to try the experiment, but H.J. is right, there is still some 
benefit
(as well as some disadvantages)  to running STV lowering before CSE2/combine.
A clean-up patch to perform all STV conversion as a single pass (removing a
pass from the compiler) results in just a single regression in the test suite:
FAIL: gcc.target/i386/pr70155-17.c scan-assembler-times movv1ti_internal 8
which looks like:

__int128 a, b, c, d, e, f;
void foo (void)
{
  a = 0;
  b = -1;
  c = 0;
  d = -1;
  e = 0;
  f = -1;
}

By performing STV after combine (without CSE), reload prefers to implement
this function using a single register, that then requires 12 instructions rather
than 8 (if using two registers).  Alas there's nothing that postreload CSE/GCSE
can do.  Doh!

pxor%xmm0, %xmm0
movaps  %xmm0, a(%rip)
pcmpeqd %xmm0, %xmm0
movaps  %xmm0, b(%rip)
pxor%xmm0, %xmm0
movaps  %xmm0, c(%rip)
pcmpeqd %xmm0, %xmm0
movaps  %xmm0, d(%rip)
pxor%xmm0, %xmm0
movaps  %xmm0, e(%rip)
pcmpeqd %xmm0, %xmm0
movaps  %xmm0, f(%rip)
ret

I also note that even without STV, the scalar implementation of this function 
when
compiled with -Os is also larger than it needs to be due to poor CSE (notice in 
the
following we only need a single zero register, and  an all_ones reg would be 
helpful).

xorl%eax, %eax
xorl%edx, %edx
xorl%ecx, %ecx
movq$-1, b(%rip)
movq%rax, a(%rip)
movq%rax, a+8(%rip)
movq$-1, b+8(%rip)
movq%rdx, c(%rip)
movq%rdx, c+8(%rip)
movq$-1, d(%rip)
movq$-1, d+8(%rip)
movq%rcx, e(%rip)
movq%rcx, e+8(%rip)
movq$-1, f(%rip)
movq$-1, f+8(%rip)
ret

I need to give the problem some more thought.  It would be good to 
clean-up/unify
the STV passes, but I/we need to solve/CSE HJ's last test case before we do.  
Perhaps
by forbidding "(set (mem:ti) (const_int 0))" in movti_internal, would force the 
zero
register to become visible, and CSE'd, benefiting both vector code and scalar 
-Os code,
then use postreload/peephole2 to fix up the remaining scalar cases.  It's 
tricky.

Cheers,
Roger
--




[PATCH] PR target/106278: Keep REG_EQUAL notes consistent during TImode STV.

2022-07-14 Thread Roger Sayle

This patch resolves PR target/106278 a regression on x86_64 caused by my
recent TImode STV improvements.  Now that TImode STV can handle comparisons
such as "(set (regs:CC) (compare:CC (reg:TI) ...))" the convert_insn method
sensibly checks that the mode of the SET_DEST is TImode before setting
it to V1TImode [to avoid V1TImode appearing on the hard reg CC_FLAGS.

Hence the current code looks like:

  if (GET_MODE (dst) == TImode)
{
  tmp = find_reg_equal_equiv_note (insn);
  if (tmp && GET_MODE (XEXP (tmp, 0)) == TImode)
PUT_MODE (XEXP (tmp, 0), V1TImode);
  PUT_MODE (dst, V1TImode);
  fix_debug_reg_uses (dst);
}
  break;

which checks GET_MODE (dst) before calling PUT_MODE, and when a
change is made updating the REG_EQUAL_NOTE tmp if it exists.

The logical flaw (oversight) is that due to RTL sharing, the destination
of this set may already have been updated to V1TImode, as this chain is
being converted, but we still need to update any REG_EQUAL_NOTE that
still has TImode.  Hence the correct code is actually:


  if (GET_MODE (dst) == TImode)
{
  PUT_MODE (dst, V1TImode);
  fix_debug_reg_uses (dst);
}
  if (GET_MODE (dst) == V1TImode)
{
  tmp = find_reg_equal_equiv_note (insn);
  if (tmp && GET_MODE (XEXP (tmp, 0)) == TImode)
PUT_MODE (XEXP (tmp, 0), V1TImode);
}
  break;

While fixing this behavior, I noticed I had some indentation whitespace
issues and some vestigial dead code in this function/method that I've
taken the liberty of cleaning up (as obvious) in this patch.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32},
with no new failures.  Ok for mainline?


2022-07-14  Roger Sayle  

gcc/ChangeLog
PR target/106278
* config/i386/i386-features.cc (general_scalar_chain::convert_insn):
Fix indentation whitespace.
(timode_scalar_chain::fix_debug_reg_uses): Likewise.
(timode_scalar_chain::convert_insn): Delete dead code.
Update TImode REG_EQUAL_NOTE even if the SET_DEST is already V1TI.
Fix indentation whitespace.
(convertible_comparison_p): Likewise.
(timode_scalar_to_vector_candidate_p): Likewise.

gcc/testsuite/ChangeLog
PR target/106278
* gcc.dg/pr106278.c: New test case.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc
index f1b03c3..813b203 100644
--- a/gcc/config/i386/i386-features.cc
+++ b/gcc/config/i386/i386-features.cc
@@ -1054,13 +1054,13 @@ general_scalar_chain::convert_insn (rtx_insn *insn)
   else if (REG_P (dst) && GET_MODE (dst) == smode)
 {
   /* Replace the definition with a SUBREG to the definition we
- use inside the chain.  */
+use inside the chain.  */
   rtx *vdef = defs_map.get (dst);
   if (vdef)
dst = *vdef;
   dst = gen_rtx_SUBREG (vmode, dst, 0);
   /* IRA doesn't like to have REG_EQUAL/EQUIV notes when the SET_DEST
- is a non-REG_P.  So kill those off.  */
+is a non-REG_P.  So kill those off.  */
   rtx note = find_reg_equal_equiv_note (insn);
   if (note)
remove_note (insn, note);
@@ -1246,7 +1246,7 @@ timode_scalar_chain::fix_debug_reg_uses (rtx reg)
 {
   rtx_insn *insn = DF_REF_INSN (ref);
   /* Make sure the next ref is for a different instruction,
- so that we're not affected by the rescan.  */
+so that we're not affected by the rescan.  */
   next = DF_REF_NEXT_REG (ref);
   while (next && DF_REF_INSN (next) == insn)
next = DF_REF_NEXT_REG (next);
@@ -1336,21 +1336,19 @@ timode_scalar_chain::convert_insn (rtx_insn *insn)
   rtx dst = SET_DEST (def_set);
   rtx tmp;
 
-  if (MEM_P (dst) && !REG_P (src))
-{
-  /* There are no scalar integer instructions and therefore
-temporary register usage is required.  */
-}
   switch (GET_CODE (dst))
 {
 case REG:
   if (GET_MODE (dst) == TImode)
{
+ PUT_MODE (dst, V1TImode);
+ fix_debug_reg_uses (dst);
+   }
+  if (GET_MODE (dst) == V1TImode)
+   {
  tmp = find_reg_equal_equiv_note (insn);
  if (tmp && GET_MODE (XEXP (tmp, 0)) == TImode)
PUT_MODE (XEXP (tmp, 0), V1TImode);
- PUT_MODE (dst, V1TImode);
- fix_debug_reg_uses (dst);
}
   break;
 case MEM:
@@ -1410,8 +1408,8 @@ timode_scalar_chain::convert_insn (rtx_insn *insn)
   if (MEM_P (dst))
{
  tmp = gen_reg_rtx (V1TImode);
-  emit_insn_before (gen_rtx_SET (tmp, src), insn);
-  src = tmp;
+ emit_insn_before (gen_rtx_SET (tmp, src), insn);
+ src = tmp;
}
   break;
 
@@ -1434,8 +1432

[x86 PATCH] PR target/106273: Add earlyclobber to *andn3_doubleword_bmi

2022-07-15 Thread Roger Sayle
 

This patch resolves PR target/106273 which is a wrong code regression

caused by the recent reorganization to split doubleword operations after

reload on x86.  For the failing test case, the constraints on the

andnti3_doubleword_bmi pattern allow reload to allocate the output and

operand in overlapping but non-identical registers, i.e.

 

(insn 45 44 66 2 (parallel [

(set (reg/v:TI 5 di [orig:96 i ] [96])

(and:TI (not:TI (reg:TI 39 r11 [orig:83 _2 ] [83]))

(reg/v:TI 4 si [orig:100 i ] [100])))

(clobber (reg:CC 17 flags))

]) "pr106273.c":13:5 562 {*andnti3_doubleword_bmi}

 

where the output is in registers 5 and 6, and the second operand is

registers 4 and 5, which then leads to the incorrect split:

 

(insn 113 44 114 2 (parallel [

(set (reg:DI 5 di [orig:96 i ] [96])

(and:DI (not:DI (reg:DI 39 r11 [orig:83 _2 ] [83]))

(reg:DI 4 si [orig:100 i ] [100])))

(clobber (reg:CC 17 flags))

]) "pr106273.c":13:5 566 {*andndi_1}

 

(insn 114 113 66 2 (parallel [

(set (reg:DI 6 bp [ i+8 ])

(and:DI (not:DI (reg:DI 40 r12 [ _2+8 ]))

(reg:DI 5 di [ i+8 ])))

(clobber (reg:CC 17 flags))

]) "pr106273.c":13:5 566 {*andndi_1}

 

[Notice that reg:DI 5 is set in the first instruction, but assumed

to have its original value in the second].  My first thought was

that this could be fixed by swapping the order of the split instructions

(which works in this case), but in the general case, it's impossible

to handle (set (reg:TI x) (op (reg:TI x+1) (reg:TI x-1)).  Hence for

correctness this pattern needs an earlyclobber "=&r", but we can also

allow cases where the output is the same as one of the operands (using

constraint "0").  The other binary logic operations (AND, IOR, XOR)

are unaffected as they constrain the output to match the first

operand, but BMI's andn is a three-operand instruction which can

lead to the overlapping cases described above.

 

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap

and make -k check, both with and without --target_board=unix{-m32} with

no new failures.  Ok for mainline?

 

2022-07-15  Roger Sayle  

 

gcc/ChangeLog

PR target/106273

* config/i386/i386.md (*andn3_doubleword_bmi): Update the

constraints to reflect the output is earlyclobber, unless it is

the same register (pair) as one of the operands.

 

gcc/testsuite/ChangeLog

PR target/106273

* gcc.target/i386/pr106273.c: New test case.

 

 

Thanks again, and sorry for the inconvenience.

Roger

--

 

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 3b02d0c..585b2d5 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -10423,10 +10423,10 @@
 })
 
 (define_insn_and_split "*andn3_doubleword_bmi"
-  [(set (match_operand: 0 "register_operand" "=r")
+  [(set (match_operand: 0 "register_operand" "=&r,r,r")
(and:
- (not: (match_operand: 1 "register_operand" "r"))
- (match_operand: 2 "nonimmediate_operand" "ro")))
+ (not: (match_operand: 1 "register_operand" "r,0,r"))
+ (match_operand: 2 "nonimmediate_operand" "ro,ro,0")))
(clobber (reg:CC FLAGS_REG))]
   "TARGET_BMI"
   "#"
diff --git a/gcc/testsuite/gcc.target/i386/pr106273.c 
b/gcc/testsuite/gcc.target/i386/pr106273.c
new file mode 100644
index 000..8c2fbbb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr106273.c
@@ -0,0 +1,27 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-Og -march=cascadelake" } */
+typedef unsigned char u8;
+typedef unsigned short u16;
+typedef unsigned long long u64;
+
+u8 g;
+
+void
+foo (__int128 i, u8 *r)
+{
+  u16 a = __builtin_sub_overflow_p (0, i * g, 0);
+  i ^= g & i;
+  u64 s = (i >> 64) + i;
+  *r = ((union { u16 a; u8 b[2]; }) a).b[1] + s;
+}
+
+int
+main (void)
+{
+  u8 x;
+  foo (5, &x);
+  if (x != 5)
+__builtin_abort ();
+  return 0;
+}
+/* { dg-final { scan-assembler-not "andn\[ \\t\]+%rdi, %r11, %rdi" } } */


[x86 PATCH] Fix issue with x86_64_const_vector_operand predicate.

2022-07-16 Thread Roger Sayle

This patch fixes (what I believe is) a latent bug in i386.md's
x86_64_const_vector_operand define_predicate.  According to the
documentation, when a predicate is called with rtx operand OP and
machine_mode operand MODE, we can't shouldn't assume that the
MODE is (or has been checked to be) GET_MODE (OP).

The failure mode is that recog can call x86_64_const_vector_operand
on an arbitrary CONST_VECTOR passing a MODE of V2QI_mode, but when
the CONST_VECTOR is in fact V1TImode, it's unsafe to directly call
ix86_convert_const_vector_to_integer, which assumes that the CONST_VECTOR
contains CONST_INTs when it actually contains CONST_WIDE_INTs.  The
checks in this define_predicate need to be testing OP's mode, and
ideally confirming that this matches the passed in/specified MODE.

This bug is currently latent, but adding an innocent/unrelated
define_insn, such as "(set (reg:CCC FLAGS_REG) (const_int 0))" to
i386.md can occasionally change the order in which genrecog generates
its tests, then ICEing during bootstrap due to V1TI CONST_VECTORs.


This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target-board=unix{-m32},
with no new failures.  Ok for mainline?

2022-07-16  Roger Sayle  

gcc/ChangeLog
* config/i386/predicates.md (x86_64_const_vector_operand):
Check the operand's mode matches the specified mode argument.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
index c71c453..42053ea 100644
--- a/gcc/config/i386/predicates.md
+++ b/gcc/config/i386/predicates.md
@@ -1199,6 +1199,10 @@
 (define_predicate "x86_64_const_vector_operand"
   (match_code "const_vector")
 {
+  if (mode == VOIDmode)
+mode = GET_MODE (op);
+  else if (GET_MODE (op) != mode)
+return false;
   if (GET_MODE_SIZE (mode) > UNITS_PER_WORD)
 return false;
   HOST_WIDE_INT val = ix86_convert_const_vector_to_integer (op, mode);


[middle-end PATCH] PR c/106264: Silence warnings from __builtin_modf et al.

2022-07-16 Thread Roger Sayle

This middle-end patch resolves PR c/106264 which is a spurious warning
regression caused by the tree-level expansion of modf, frexp and remquo
producing "expression has no-effect" when the built-in function's result
is ignored.  When these built-ins were first expanded at tree-level,
fold_builtin_n would blindly set TREE_NO_WARNING for all built-ins.
Now that we're more discerning, we should precisely set TREE_NO_WARNING
selectively on those COMPOUND_EXPRs that need them.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check with no new failures.  Ok for mainline?

2022-07-16  Roger Sayle  

gcc/ChangeLog
PR c/106264
* builtins.cc (fold_builtin_frexp): Set TREE_NO_WARNING on
COMPOUND_EXPR to silence spurious warning if result isn't used.
(fold_builtin_modf): Likewise.
(do_mpfr_remquo): Likewise.

gcc/testsuite/ChangeLog
PR c/106264
* gcc.dg/pr106264.c: New test case.


Thanks in advance,
Roger
--

diff --git a/gcc/builtins.cc b/gcc/builtins.cc
index 35b9197..c745777 100644
--- a/gcc/builtins.cc
+++ b/gcc/builtins.cc
@@ -8625,7 +8625,7 @@ fold_builtin_frexp (location_t loc, tree arg0, tree arg1, 
tree rettype)
   if (TYPE_MAIN_VARIANT (TREE_TYPE (arg1)) == integer_type_node)
 {
   const REAL_VALUE_TYPE *const value = TREE_REAL_CST_PTR (arg0);
-  tree frac, exp;
+  tree frac, exp, res;
 
   switch (value->cl)
   {
@@ -8656,7 +8656,9 @@ fold_builtin_frexp (location_t loc, tree arg0, tree arg1, 
tree rettype)
   /* Create the COMPOUND_EXPR (*arg1 = trunc, frac). */
   arg1 = fold_build2_loc (loc, MODIFY_EXPR, rettype, arg1, exp);
   TREE_SIDE_EFFECTS (arg1) = 1;
-  return fold_build2_loc (loc, COMPOUND_EXPR, rettype, arg1, frac);
+  res = fold_build2_loc (loc, COMPOUND_EXPR, rettype, arg1, frac);
+  TREE_NO_WARNING (res) = 1;
+  return res;
 }
 
   return NULL_TREE;
@@ -8682,6 +8684,7 @@ fold_builtin_modf (location_t loc, tree arg0, tree arg1, 
tree rettype)
 {
   const REAL_VALUE_TYPE *const value = TREE_REAL_CST_PTR (arg0);
   REAL_VALUE_TYPE trunc, frac;
+  tree res;
 
   switch (value->cl)
   {
@@ -8711,8 +8714,10 @@ fold_builtin_modf (location_t loc, tree arg0, tree arg1, 
tree rettype)
   arg1 = fold_build2_loc (loc, MODIFY_EXPR, rettype, arg1,
  build_real (rettype, trunc));
   TREE_SIDE_EFFECTS (arg1) = 1;
-  return fold_build2_loc (loc, COMPOUND_EXPR, rettype, arg1,
- build_real (rettype, frac));
+  res = fold_build2_loc (loc, COMPOUND_EXPR, rettype, arg1,
+build_real (rettype, frac));
+  TREE_NO_WARNING (res) = 1;
+  return res;
 }
 
   return NULL_TREE;
@@ -10673,8 +10678,10 @@ do_mpfr_remquo (tree arg0, tree arg1, tree arg_quo)
  integer_quo));
  TREE_SIDE_EFFECTS (result_quo) = 1;
  /* Combine the quo assignment with the rem.  */
- result = non_lvalue (fold_build2 (COMPOUND_EXPR, type,
-   result_quo, result_rem));
+ result = fold_build2 (COMPOUND_EXPR, type,
+   result_quo, result_rem);
+ TREE_NO_WARNING (result) = 1;
+ result = non_lvalue (result);
}
}
}
diff --git a/gcc/testsuite/gcc.dg/pr106264.c b/gcc/testsuite/gcc.dg/pr106264.c
new file mode 100644
index 000..6b4af49
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr106264.c
@@ -0,0 +1,27 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -Wall" } */
+double frexp (double, int*);
+double modf (double, double*);
+double remquo (double, double, int*);
+
+int f (void)
+{
+  int y;
+  frexp (1.0, &y);
+  return y;
+}
+
+double g (void)
+{
+  double y;
+  modf (1.0, &y);
+  return y;
+}
+
+int h (void)
+{
+  int y;
+  remquo (1.0, 1.0, &y);
+  return y;
+}
+


[AVX512 PATCH] Add UNSPEC_MASKOP to kupck instructions in sse.md.

2022-07-16 Thread Roger Sayle

This AVX512 specific patch to sse.md is split out from an earlier patch:
https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596199.html

The new splitters proposed in that patch interfere with AVX512's
kunpckdq instruction which is defined as identical RTL,
DW:DI = (HI:SI<<32)|zero_extend(LO:SI).  To distinguish these,
and avoid AVX512 mask registers accidentally being (ab)used by reload
to perform SImode scalar shifts, this patch adds the explicit
(unspec UNSPEC_MASKOP) to the unpack mask operations, which matches
what sse.md does for the other mask specific (logic) operations.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?

2022-07-16  Roger Sayle  

gcc/ChangeLog
* config/i386/sse.md (kunpckhi): Add UNSPEC_MASKOP unspec.
(kunpcksi): Likewise, add UNSPEC_MASKOP unspec.
(kunpckdi): Likewise, add UNSPEC_MASKOP unspec.
(vec_pack_trunc_qi): Update to specify required UNSPEC_MASKOP
unspec.
(vec_pack_trunc_): Likewise.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 62688f8..da50ffa 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -2072,7 +2072,8 @@
  (ashift:HI
(zero_extend:HI (match_operand:QI 1 "register_operand" "k"))
(const_int 8))
- (zero_extend:HI (match_operand:QI 2 "register_operand" "k"]
+ (zero_extend:HI (match_operand:QI 2 "register_operand" "k"
+   (unspec [(const_int 0)] UNSPEC_MASKOP)]
   "TARGET_AVX512F"
   "kunpckbw\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "mode" "HI")
@@ -2085,7 +2086,8 @@
  (ashift:SI
(zero_extend:SI (match_operand:HI 1 "register_operand" "k"))
(const_int 16))
- (zero_extend:SI (match_operand:HI 2 "register_operand" "k"]
+ (zero_extend:SI (match_operand:HI 2 "register_operand" "k"
+   (unspec [(const_int 0)] UNSPEC_MASKOP)]
   "TARGET_AVX512BW"
   "kunpckwd\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "mode" "SI")])
@@ -2096,7 +2098,8 @@
  (ashift:DI
(zero_extend:DI (match_operand:SI 1 "register_operand" "k"))
(const_int 32))
- (zero_extend:DI (match_operand:SI 2 "register_operand" "k"]
+ (zero_extend:DI (match_operand:SI 2 "register_operand" "k"
+   (unspec [(const_int 0)] UNSPEC_MASKOP)]
   "TARGET_AVX512BW"
   "kunpckdq\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "mode" "DI")])
@@ -17400,21 +17403,26 @@
 })
 
 (define_expand "vec_pack_trunc_qi"
-  [(set (match_operand:HI 0 "register_operand")
-   (ior:HI (ashift:HI (zero_extend:HI (match_operand:QI 2 
"register_operand"))
-   (const_int 8))
-   (zero_extend:HI (match_operand:QI 1 "register_operand"]
+  [(parallel
+[(set (match_operand:HI 0 "register_operand")
+   (ior:HI
+  (ashift:HI (zero_extend:HI (match_operand:QI 2 "register_operand"))
+ (const_int 8))
+  (zero_extend:HI (match_operand:QI 1 "register_operand"
+ (unspec [(const_int 0)] UNSPEC_MASKOP)])]
   "TARGET_AVX512F")
 
 (define_expand "vec_pack_trunc_"
-  [(set (match_operand: 0 "register_operand")
-   (ior:
- (ashift:
+  [(parallel
+[(set (match_operand: 0 "register_operand")
+ (ior:
+   (ashift:
+ (zero_extend:
+   (match_operand:SWI24 2 "register_operand"))
+ (match_dup 3))
(zero_extend:
- (match_operand:SWI24 2 "register_operand"))
-   (match_dup 3))
- (zero_extend:
-   (match_operand:SWI24 1 "register_operand"]
+ (match_operand:SWI24 1 "register_operand"
+ (unspec [(const_int 0)] UNSPEC_MASKOP)])]
   "TARGET_AVX512BW"
 {
   operands[3] = GEN_INT (GET_MODE_BITSIZE (mode));


[x86_64 PATCH] PR target/106231: Optimize (any_extend:DI (ctz:SI ...)).

2022-07-16 Thread Roger Sayle

This patch resolves PR target/106231 by providing insns that recognize
(zero_extend:DI (ctz:SI ...)) and (sign_extend:DI (ctz:SI ...)).  The
result of ctz:SI is always between 0 and 32 (or undefined), so
sign_extension is the same as zero_extension, and the result is already
extended in the destination register.

Things are a little complicated, because the existing implementation
of *ctzsi2 handles multiple cases, including false dependencies, which
we continue to support in this patch.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check with no new failures.  Ok for mainline?

2022-07-16  Roger Sayle  

gcc/ChangeLog
PR target/106231
* config/i386/i386.md (*ctzsidi2_ext): New insn_and_split
to recognize any_extend:DI of ctz:SI which is implicitly extended.
(*ctzsidi2_ext_falsedep): New define_insn to model a DImode
extended ctz:SI that has preceding xor to break false dependency.

gcc/testsuite/ChangeLog
PR target/106231
* gcc.target/i386/pr106231-1.c: New test case.
* gcc.target/i386/pr106231-2.c: New test case.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 3b02d0c..164b0c2 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -16431,6 +16431,66 @@
(set_attr "prefix_rep" "1")
(set_attr "mode" "SI")])
 
+(define_insn_and_split "*ctzsidi2_ext"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+   (any_extend:DI
+ (ctz:SI
+   (match_operand:SI 1 "nonimmediate_operand" "rm"
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_64BIT"
+{
+  if (TARGET_BMI)
+return "tzcnt{l}\t{%1, %k0|%k0, %1}";
+  else if (TARGET_CPU_P (GENERIC)
+  && !optimize_function_for_size_p (cfun))
+/* tzcnt expands to 'rep bsf' and we can use it even if !TARGET_BMI.  */
+return "rep%; bsf{l}\t{%1, %k0|%k0, %1}";
+  return "bsf{l}\t{%1, %k0|%k0, %1}";
+}
+  "(TARGET_BMI || TARGET_CPU_P (GENERIC))
+   && TARGET_AVOID_FALSE_DEP_FOR_BMI && epilogue_completed
+   && optimize_function_for_speed_p (cfun)
+   && !reg_mentioned_p (operands[0], operands[1])"
+  [(parallel
+[(set (match_dup 0)
+ (any_extend:DI (ctz:SI (match_dup 1
+ (unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP)
+ (clobber (reg:CC FLAGS_REG))])]
+  "ix86_expand_clear (operands[0]);"
+  [(set_attr "type" "alu1")
+   (set_attr "prefix_0f" "1")
+   (set (attr "prefix_rep")
+ (if_then_else
+   (ior (match_test "TARGET_BMI")
+   (and (not (match_test "optimize_function_for_size_p (cfun)"))
+(match_test "TARGET_CPU_P (GENERIC)")))
+   (const_string "1")
+   (const_string "0")))
+   (set_attr "mode" "SI")])
+
+(define_insn "*ctzsidi2_ext_falsedep"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+   (any_extend:DI
+ (ctz:SI
+   (match_operand:SI 1 "nonimmediate_operand" "rm"
+   (unspec [(match_operand:DI 2 "register_operand" "0")]
+  UNSPEC_INSN_FALSE_DEP)
+   (clobber (reg:CC FLAGS_REG))]
+  "TARGET_64BIT"
+{
+  if (TARGET_BMI)
+return "tzcnt{l}\t{%1, %k0|%k0, %1}";
+  else if (TARGET_CPU_P (GENERIC))
+/* tzcnt expands to 'rep bsf' and we can use it even if !TARGET_BMI.  */
+return "rep%; bsf{l}\t{%1, %k0|%k0, %1}";
+  else
+gcc_unreachable ();
+}
+  [(set_attr "type" "alu1")
+   (set_attr "prefix_0f" "1")
+   (set_attr "prefix_rep" "1")
+   (set_attr "mode" "SI")])
+
 (define_insn "bsr_rex64"
   [(set (reg:CCZ FLAGS_REG)
(compare:CCZ (match_operand:DI 1 "nonimmediate_operand" "rm")
diff --git a/gcc/testsuite/gcc.target/i386/pr106231-1.c 
b/gcc/testsuite/gcc.target/i386/pr106231-1.c
new file mode 100644
index 000..d17297f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr106231-1.c
@@ -0,0 +1,8 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -mtune=generic" } */
+long long
+foo(long long x, unsigned bits)
+{
+  return x + (unsigned) __builtin_ctz(bits);
+}
+/* { dg-final { scan-assembler-not "cltq" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr106231-2.c 
b/gcc/testsuite/gcc.target/i386/pr106231-2.c
new file mode 100644
index 000..fd3a8e3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr106231-2.c
@@ -0,0 +1,8 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -mtune=ivybridge" } */
+long long
+foo(long long x, unsigned bits)
+{
+  return x + (unsigned) __builtin_ctz(bits);
+}
+/* { dg-final { scan-assembler-not "cltq" } } */


[x86 PATCH] PR target/106303: Fix TImode STV related failures.

2022-07-23 Thread Roger Sayle

This patch resolves PR target/106303 (and the related PRs 106347,
106404, 106407) which are ICEs caused by my improvements to x86_64's
128-bit TImode to V1TImode Scalar to Vector (STV) pass.  My apologies
for the breakage.  The issue is that data flow analysis is used to
partition usage of each TImode pseudo into "chains", where each
chain is analyzed and if suitable converted to vector operations.
The problems appears when some chains for a pseudo are converted,
and others aren't as RTL sharing can result in some mode changes
leaking into other instructions that aren't/shouldn't/can't be
converted, which eventually leads to an ICE for mismatched modes.

My first approach to a fix was to unify more of the STV infrastructure,
reasoning that if TImode STV was exhibiting these problems, but DImode
and SImode STV weren't, the issue was likely to be caused/resolved by
these remaining differences.  This appeared to fix some but not all of
the reported PRs.  A better solution was then proposed by H.J. Lu in
Bugzilla (thanks!) that we need to iterate the removal of candidates in the
function timode_remove_non_convertible_regs until there are no further
changes.  As each chain is removed from consideration, it in turn may
affect whether other insns/chains can safely be converted.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32},
with no new failures.  Ok for mainline?


2022-07-23  Roger Sayle  
H.J. Lu  

gcc/ChangeLog
PR target/106303
PR target/106347
* config/i386/i386-features.cc (make_vector_copies): Move from
general_scalar_chain to scalar_chain.
(convert_reg): Likewise.
(convert_insn_common): New scalar_chain method split out from
general_scalar_chain convert_insn.
(convert_registers): Move from general_scalar_chain to
scalar_chain.
(scalar_chain::convert): Call convert_insn_common before calling
convert_insn.
(timode_remove_non_convertible_regs): Iterate until there are
no further changes to the candidates.
* config/i386/i386-features.h (scalar_chain::hash_map): Move
from general_scalar_chain.
(scalar_chain::convert_reg): Likewise.
(scalar_chain::convert_insn_common): New shared method.
(scalar_chain::make_vector_copies): Move from general_scalar_chain.
(scalar_chain::convert_registers): Likewise.  No longer virtual.
(general_scalar_chain::hash_map): Delete.  Moved to scalar_chain.
(general_scalar_chain::convert_reg): Likewise.
(general_scalar_chain::make_vector_copies): Likewise.
(general_scalar_chain::convert_registers): Delete virtual method.
(timode_scalar_chain::convert_registers): Likewise.

gcc/testsuite/ChangeLog
PR target/106303
PR target/106347
* gcc.target/i386/pr106303.c: New test case.
* gcc.target/i386/pr106347.c: New test case.


Thanks in advance (and sorry again for the inconvenience),
Roger
--

diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc
index 813b203..aa5de71 100644
--- a/gcc/config/i386/i386-features.cc
+++ b/gcc/config/i386/i386-features.cc
@@ -708,7 +708,7 @@ gen_gpr_to_xmm_move_src (enum machine_mode vmode, rtx gpr)
and replace its uses in a chain.  */
 
 void
-general_scalar_chain::make_vector_copies (rtx_insn *insn, rtx reg)
+scalar_chain::make_vector_copies (rtx_insn *insn, rtx reg)
 {
   rtx vreg = *defs_map.get (reg);
 
@@ -772,7 +772,7 @@ general_scalar_chain::make_vector_copies (rtx_insn *insn, 
rtx reg)
scalar uses outside of the chain.  */
 
 void
-general_scalar_chain::convert_reg (rtx_insn *insn, rtx dst, rtx src)
+scalar_chain::convert_reg (rtx_insn *insn, rtx dst, rtx src)
 {
   start_sequence ();
   if (!TARGET_INTER_UNIT_MOVES_FROM_VEC)
@@ -973,10 +973,10 @@ scalar_chain::convert_compare (rtx op1, rtx op2, rtx_insn 
*insn)
 UNSPEC_PTEST);
 }
 
-/* Convert INSN to vector mode.  */
+/* Helper function for converting INSN to vector mode.  */
 
 void
-general_scalar_chain::convert_insn (rtx_insn *insn)
+scalar_chain::convert_insn_common (rtx_insn *insn)
 {
   /* Generate copies for out-of-chain uses of defs and adjust debug uses.  */
   for (df_ref ref = DF_INSN_DEFS (insn); ref; ref = DF_REF_NEXT_LOC (ref))
@@ -1037,7 +1037,13 @@ general_scalar_chain::convert_insn (rtx_insn *insn)
XEXP (note, 0) = *vreg;
  *DF_REF_REAL_LOC (ref) = *vreg;
}
+}
+
+/* Convert INSN to vector mode.  */
 
+void
+general_scalar_chain::convert_insn (rtx_insn *insn)
+{
   rtx def_set = single_set (insn);
   rtx src = SET_SRC (def_set);
   rtx dst = SET_DEST (def_set);
@@ -1475,7 +1481,7 @@ timode_scalar_chain::convert_insn (rtx_insn *insn)
Also populates defs_map which is used later by convert_insn.  */
 
 void
-general_scalar_c

[x86 PATCH take #3] PR target/91681: zero_extendditi2 pattern for more optimizations.

2022-07-23 Thread Roger Sayle
 

Hi Uros,

This is the next iteration of the zero_extendditi2 patch last reviewed here:

https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596204.html

 

[1] The sse.md changes were split out, reviewed, approved and committed.

[2] The *concat splitters have been moved post-reload matching what we

now do for many/most of the double word functionality.

[3] As you recommend, these *concat splitters now use split_double_mode

to "subreg" operand[0] into parts, via a new helper function that can also

handle overlapping registers, and even use xchg for the rare case that a

double word is constructed from its high and low parts, but the wrong

way around.

 

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap

and make -k check, both with and without -target_board=unix{-m32},

with no new failures.  Ok for mainline?

 

2022-07-23  Roger Sayle  

Uroš Bizjak  

 

gcc/ChangeLog

PR target/91681

* config/i386/i386-expand.cc (split_double_concat): A new helper

function for setting a double word value from two word values.

* config/i386/i386-protos.h (split_double_concat): Prototype here.

* config/i386/i386.md (zero_extendditi2): New define_insn_and_split.

(*add3_doubleword_zext): New define_insn_and_split.

(*sub3_doubleword_zext): New define_insn_and_split.

(*concat3_1): New define_insn_and_split replacing

previous define_split for implementing DST = (HI<<32)|LO as

pair of move instructions, setting lopart and hipart.

(*concat3_2): Likewise.

(*concat3_3): Likewise, where HI is zero_extended.

(*concat3_4): Likewise, where HI is zero_extended.

 

gcc/testsuite/ChangeLog

PR target/91681

* g++.target/i386/pr91681.C: New test case (from the PR).

* gcc.target/i386/pr91681-1.c: New int128 test case.

* gcc.target/i386/pr91681-2.c: Likewise.

* gcc.target/i386/pr91681-3.c: Likewise, but for ia32.

 

 

Thanks in advance,

Roger

--

 

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 40f821e..66d8f28 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -165,6 +165,46 @@ split_double_mode (machine_mode mode, rtx operands[],
 }
 }
 
+/* Emit the double word assignment DST = { LO, HI }.  */
+
+void
+split_double_concat (machine_mode mode, rtx dst, rtx lo, rtx hi)
+{
+  rtx dlo, dhi;
+  int deleted_move_count = 0;
+  split_double_mode (mode, &dst, 1, &dlo, &dhi);
+  if (!rtx_equal_p (dlo, hi))
+{
+  if (!rtx_equal_p (dlo, lo))
+   emit_move_insn (dlo, lo);
+  else
+   deleted_move_count++;
+  if (!rtx_equal_p (dhi, hi))
+   emit_move_insn (dhi, hi);
+  else
+   deleted_move_count++;
+}
+  else if (!rtx_equal_p (lo, dhi))
+{
+  if (!rtx_equal_p (dhi, hi))
+   emit_move_insn (dhi, hi);
+  else
+   deleted_move_count++;
+  if (!rtx_equal_p (dlo, lo))
+   emit_move_insn (dlo, lo);
+  else
+   deleted_move_count++;
+}
+  else if (mode == TImode)
+emit_insn (gen_swapdi (dlo, dhi));
+  else
+emit_insn (gen_swapsi (dlo, dhi));
+
+  if (deleted_move_count == 2)
+emit_note (NOTE_INSN_DELETED);
+}
+
+
 /* Generate either "mov $0, reg" or "xor reg, reg", as appropriate
for the target.  */
 
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index cf84775..e27c14f 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -85,6 +85,7 @@ extern void print_reg (rtx, int, FILE*);
 extern void ix86_print_operand (FILE *, rtx, int);
 
 extern void split_double_mode (machine_mode, rtx[], int, rtx[], rtx[]);
+extern void split_double_concat (machine_mode, rtx, rtx lo, rtx);
 
 extern const char *output_set_got (rtx, rtx);
 extern const char *output_387_binary_op (rtx_insn *, rtx*);
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 9aaeb69..4560681 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -4379,6 +4379,16 @@
(set_attr "type" "imovx,mskmov,mskmov")
(set_attr "mode" "SI,QI,QI")])
 
+(define_insn_and_split "zero_extendditi2"
+  [(set (match_operand:TI 0 "nonimmediate_operand" "=r,o")
+   (zero_extend:TI (match_operand:DI 1 "nonimmediate_operand" "rm,r")))]
+  "TARGET_64BIT"
+  "#"
+  "&& reload_completed"
+  [(set (match_dup 3) (match_dup 1))
+   (set (match_dup 4) (const_int 0))]
+  "split_double_mode (TImode, &operands[0], 1, &operands[3], &operands[4]);")
+
 ;; Transform xorl; mov[bw] (set strict_low_part) into movz[bw]l.
 (define_peephole2
   [(parallel [(set (match_operand:SWI48 0 "general_reg_operand")
@@ -6512,6 +6522,31 @@
   [(set_attr "type"

[Documentation] Correct RTL documentation: (use (mem ...)) is allowed.

2022-07-23 Thread Roger Sayle

This patch is a one line correction/clarification to GCC's current
RTL documentation that explains a USE of a MEM is permissible.

PR rtl-optimization/99930 is an interesting example on x86_64 where
the backend generates better code when a USE is a (const) MEM than
when it is a REG. In fact the backend relies on CSE to propagate the
MEM (a constant pool reference) into the USE, to enable combine to
merge/simplify instructions.

This change has been tested with a make bootstrap, but as it might
provoke a discussion, I've decided to not consider it "obvious".
Ok for mainline (to document the actual current behavior)?


2022-07-23  Roger Sayle   

gcc/ChangeLog
* doc/rtl.texi (use): Document that the operand may be a MEM.


Roger
--

diff --git a/gcc/doc/rtl.texi b/gcc/doc/rtl.texi
index 43c9ee8..995c8be 100644
--- a/gcc/doc/rtl.texi
+++ b/gcc/doc/rtl.texi
@@ -3283,7 +3283,8 @@ Represents the use of the value of @var{x}.  It indicates 
that the
 value in @var{x} at this point in the program is needed, even though
 it may not be apparent why this is so.  Therefore, the compiler will
 not attempt to delete previous instructions whose only effect is to
-store a value in @var{x}.  @var{x} must be a @code{reg} expression.
+store a value in @var{x}.  @var{x} must be a @code{reg} or a @code{mem}
+expression.
 
 In some situations, it may be tempting to add a @code{use} of a
 register in a @code{parallel} to describe a situation where the value


[PATCH] Add new target hook: simplify_modecc_const.

2022-07-26 Thread Roger Sayle

This patch is a major revision of the patch I originally proposed here:
https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598040.html

The primary motivation of this patch is to avoid incorrect optimization
of MODE_CC comparisons in simplify_const_relational_operation when/if a
backend represents the (known) contents of a MODE_CC register using a
CONST_INT.  In such cases, the RTL optimizers don't know the semantics
of this integer value, so shouldn't change anything (i.e. should return
NULL_RTX from simplify_const_relational_operation).

The secondary motivation is that by introducing a new target hook, called
simplify_modecc_const, the backend can (optionally) encode and interpret
a target dependent encoding of MODE_CC registers.

The worked example provided with this patch is to allow the i386 backend
to explicitly model the carry flag (MODE_CCC) using 1 to indicate that
the carry flag is set, and 0 to indicate the carry flag is clear.  This
allows the instructions stc (set carry flag), clc (clear carry flag) and
cmc (complement carry flag) to be represented in RTL.

However an even better example would be the rs6000 backend, where this
patch/target hook would allow improved modelling of the condition register
CR.  The powerpc's comparison instructions set fields/bits in the CR
register [where bit 0 indicates less than, bit 1 greater than, bit 2
equal to and bit3 overflow] analogous to x86's flags register [containing
bits for carry, zero, overflow, parity etc.].  These fields can be
manipulated directly using crset (aka creqv) and crclr (aka crxor)
instructions and even transferred from general purpose registers using
mtcr.  However, without a patch like this, it's impossible to safely
model/represent these instructions in rs6000.md.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32},
and both with and without a patch to add stc, clc and cmc support to
the x86 backend.  I'll resubmit the x86 target pieces again with that
follow-up backend patch, so for now I'm only looking for approval
of the middle-end infrastructure pieces.  The x86 hunks below are
provided as context/documentation for how this hook could/should be
used (but I wouldn't object to pre-approval of those bits by Uros).
Ok for mainline?


2022-07-26  Roger Sayle  

gcc/ChangeLog
* target.def (simplify_modecc_const): New target hook.
* doc/tm.texi (TARGET_SIMPLIFY_MODECC_CONST): Document here.
* doc/tm.texi.in (TARGET_SIMPLIFY_MODECC_CONST): Locate @hook here.
* hooks.cc (hook_rtx_mode_int_rtx_null): Define default hook here.
* hooks.h (hook_rtx_mode_int_rtx_null): Prototype here.
* simplify-rtx.c (simplify_const_relational_operation): Avoid
mis-optimizing MODE_CC comparisons by calling new target hook.

* config/i386.cc (ix86_simplify_modecc_const): Implement new target
hook, supporting interpreting MODE_CCC values as the x86 carry flag.
(TARGET_SIMPLIFY_MODECC_CONST): Define as
ix86_simplify_modecc_const.


Thanks in advance,
Roger
--

> -Original Message-
> From: Segher Boessenkool 
> Sent: 07 July 2022 23:39
> To: Roger Sayle 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] Be careful with MODE_CC in
> simplify_const_relational_operation.
> 
> Hi!
> 
> On Thu, Jul 07, 2022 at 10:08:04PM +0100, Roger Sayle wrote:
> > I think it's fair to describe RTL's representation of condition flags
> > using MODE_CC as a little counter-intuitive.
> 
> "A little challenging", and you should see that as a good thing, as a
puzzle to
> crack :-)
> 
> > For example, the i386
> > backend represents the carry flag (in adc instructions) using RTL of
> > the form "(ltu:SI (reg:CCC) (const_int 0))", where great care needs to
> > be taken not to treat this like a normal RTX expression, after all LTU
> > (less-than-unsigned) against const0_rtx would normally always be
> > false.
> 
> A comparison of a MODE_CC thing against 0 means the result of a
> *previous* comparison (or other cc setter) is looked at.  Usually it
simply looks
> at some condition bits in a flags register.  It does not do any actual
comparison:
> that has been done before (if at all even).
> 
> > Hence, MODE_CC comparisons need to be treated with caution, and
> > simplify_const_relational_operation returns early (to avoid
> > problems) when GET_MODE_CLASS (GET_MODE (op0)) == MODE_CC.
> 
> Not just to avoid problems: there simply isn't enough information to do a
> correct job.
> 
> > However, consider the (currently) hypothetical situation, where the
> > RTL optimizers determine that a previous instruction unconditionally
> > sets or clears the carry flag, and this gets pro

[PATCH] middle-end: More support for ABIs that pass FP values as wider ints.

2022-07-26 Thread Roger Sayle

Firstly many thanks again to Jeff Law for reviewing/approving the previous
patch to add support for ABIs that pass FP values as wider integer modes.
That has allowed significant progress on PR target/104489.  As predicted
enabling HFmode on nvptx-none automatically enables more testcases in the
testsuite and making sure these all PASS has revealed a few missed spots
and a deficiency in the middle-end.  For example, support for HC mode,
where a complex value is encoded as two 16-bit HFmode parts was
insufficiently covered in my previous testing.  More interesting is
that __fixunshfti is required by GCC, and not natively supported by
the nvptx backend, requiring softfp support in libgcc, which in turn
revealed an interesting asymmetry in libcall handling in optabs.cc.

In the expand_fixed_convert function, which is responsible for
expanding libcalls for integer to floating point conversion, GCC
calls prepare_libcall_arg that (specifically for integer arguments)
calls promote_function_mode on the argument, so that the libcall
ABI matches the regular target ABI.  By comparison, the equivalent
expand_fix function, for floating point to integer conversion, doesn't
promote its argument.  On nvptx, where the assembler is strongly
typed, this produces a mismatch as the __fixunshfti function created
by libgcc doesn't precisely match the signature assumed by optabs.
The solution is to perform the same (or similar) prepare_libcall_arg
preparation in both cases.  In this patch, the existing (static)
prepare_libcall_arg, which assumes an integer argument, is renamed
prepare_libcall_int_arg, and a matching prepare_libcall_fp_arg is
introduced.  This should be safe on other platforms (fingers-crossed)
as floating point argument promotion is rare [floats are passed in
float registers, doubles are passed in double registers, etc.]

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32},
and on nvptx-none with a backend patch that resolves the rest of
PR target/104489.  Ok for mainline?


2022-07-26  Roger Sayle  

gcc/ChangeLog
PR target/104489
* calls.cc (emit_library_call_value_1): Enable the FP return value
of a libcall to be returned as a wider integer, by converting the
int result to be converted to the desired floating point mode.
(store_one_arg): Allow floating point arguments to be passed on
the stack as wider integers using convert_float_to_wider_int.
* function.cc (assign_parms_unsplit_complex): Likewise, allow
complex floating point modes to be passed as wider integer parts,
using convert_wider_int_to_float.
* optabs.cc (prepare_libcall_fp_arg): New function. A floating
point version of the previous prepare_libcall_arg that calls
promote_function_mode on its argument.
(expand_fix): Call new prepare_libcall_fp_arg on FIX argument.
(prepare_libcall_int_arg): Renamed from prepare_libcall_arg.
(expand_fixed_convert): Update call of prepare_libcall_arg to
the new name, prepare_libcall_int_arg.


Thanks again,
Roger
--

diff --git a/gcc/calls.cc b/gcc/calls.cc
index 7f3cf5f..50d0495 100644
--- a/gcc/calls.cc
+++ b/gcc/calls.cc
@@ -4791,14 +4791,20 @@ emit_library_call_value_1 (int retval, rtx orgfun, rtx 
value,
   else
{
  /* Convert to the proper mode if a promotion has been active.  */
- if (GET_MODE (valreg) != outmode)
+ enum machine_mode valmode = GET_MODE (valreg);
+ if (valmode != outmode)
{
  int unsignedp = TYPE_UNSIGNED (tfom);
 
  gcc_assert (promote_function_mode (tfom, outmode, &unsignedp,
 fndecl ? TREE_TYPE (fndecl) : 
fntype, 1)
- == GET_MODE (valreg));
- valreg = convert_modes (outmode, GET_MODE (valreg), valreg, 0);
+ == valmode);
+ if (SCALAR_INT_MODE_P (valmode)
+ && SCALAR_FLOAT_MODE_P (outmode)
+ && known_gt (GET_MODE_SIZE (valmode), GET_MODE_SIZE 
(outmode)))
+   valreg = convert_wider_int_to_float (outmode, valmode, valreg);
+ else
+   valreg = convert_modes (outmode, valmode, valreg, 0);
}
 
  if (value != 0)
@@ -5003,8 +5009,20 @@ store_one_arg (struct arg_data *arg, rtx argblock, int 
flags,
   /* If we are promoting object (or for any other reason) the mode
 doesn't agree, convert the mode.  */
 
-  if (arg->mode != TYPE_MODE (TREE_TYPE (pval)))
-   arg->value = convert_modes (arg->mode, TYPE_MODE (TREE_TYPE (pval)),
+  machine_mode old_mode = TYPE_MODE (TREE_TYPE (pval));
+
+  /* Some ABIs require scalar floating point modes to be passed
+in a wider scalar integer mode.  We need to explicitly
+reinterpret

RE: [PATCH] Add new target hook: simplify_modecc_const.

2022-07-26 Thread Roger Sayle


Hi Segher,
It's very important to distinguish the invariants that exist for the RTL
data structures as held in memory (rtx), vs. the use of "enum rtx_code"s,
"machine_mode"s and operands in the various processing functions
of the middle-end.

Yes, it's very true that RTL integer constants don't specify a mode
(are VOIDmode), so therefore operations like ZERO_EXTEND or EQ
don't make sense with all constant operands.  This is (one reason)
why constant-only operands are disallowed from RTL (data structures),
and why in APIs that perform/simplify these operations, the original
operand mode (of the const_int(s)) must be/is always passed as a
parameter.

Hence, for say simplify_const_binary_operation, op0 and op1 can
both be const_int, as the mode argument specifies the mode of the
"code" operation. Likewise, in simplify_relational_operation, both
op0 and op1 may be CONST_INT as "cmp_mode" explicitly specifies
the mode that the operation is performed in and "mode" specifies
the mode of the result.

Your comment that "comparing two integer constants is invalid
RTL *in all contexts*" is a serious misunderstanding of what's
going on.  At no point is a RTL rtx node ever allocated with two
integer constant operands.  RTL simplification is for hypothetical
"what if" transformations (just like try_combine calls recog with
RTL that may not be real instructions), and these simplifcations
are even sometimes required to preserve the documented RTL
invariants.  Comparisons of two integers must be simplified to
true/false precisely to ensure that they never appear in an actual
COMPARE node.

I worry this fundamental misunderstanding is the same issue that
has been holding up understanding/approving a previous patch:
https://gcc.gnu.org/pipermail/gcc-patches/2021-September/578848.html

For a related bug, consider PR rtl-optimization/67382, that's assigned
to you in bugzilla.  In this case, the RTL optimizers know that both
operands to a COMPARE are integer constants (both -2), yet the
compiler still performs a run-time comparison and conditional jump:

movl$-2, %eax
movl%eax, 12(%rsp)
cmpl$-2, %eax
je  .L1

Failing to optimize/consider a comparison between two integer
constants *in any context* just leads to poor code.

Hopefully, this clears up that the documented constraints on RTL rtx
aren't exactly the same as the constraints on the use of rtx_codes in
simplify-rtx's functional APIs.  So simplify_subreg really gets called
on operands that are neither REG nor MEM, as this is unrelated to
what the documentation of the SUBREG rtx specifies.

If you don't believe that op0 and op1 can ever both be const_int
in this function, perhaps consider it harmless dead code and humor
me.

Thanks in advance,
Roger
--

> -Original Message-
> From: Segher Boessenkool 
> Sent: 26 July 2022 18:45
> To: Roger Sayle 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] Add new target hook: simplify_modecc_const.
> 
> Hi!
> 
> On Tue, Jul 26, 2022 at 01:13:02PM +0100, Roger Sayle wrote:
> > This patch is a major revision of the patch I originally proposed here:
> > https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598040.html
> >
> > The primary motivation of this patch is to avoid incorrect
> > optimization of MODE_CC comparisons in
> > simplify_const_relational_operation when/if a backend represents the
> > (known) contents of a MODE_CC register using a CONST_INT.  In such
> > cases, the RTL optimizers don't know the semantics of this integer
> > value, so shouldn't change anything (i.e. should return NULL_RTX from
> simplify_const_relational_operation).
> 
> This is invalid RTL.  What would  (set (reg:CC) (const_int 0))  mean, for
example?
> If this was valid it would make most existing code using CC modes do
essentially
> random things :-(
> 
> The documentation (in tm.texi, "Condition Code") says
>   Alternatively, you can use @code{BImode} if the comparison operator is
>   specified already in the compare instruction.  In this case, you are not
>   interested in most macros in this section.
> 
> > The worked example provided with this patch is to allow the i386
> > backend to explicitly model the carry flag (MODE_CCC) using 1 to
> > indicate that the carry flag is set, and 0 to indicate the carry flag
> > is clear.  This allows the instructions stc (set carry flag), clc
> > (clear carry flag) and cmc (complement carry flag) to be represented in
RTL.
> 
> Hrm, I wonder how other targets do this.
> 
> On Power we have a separate hard register for the carry flag of course (it
is a
> separate bit in the hardware as well, XER[CA]).
> 
> On Arm there is arm_carry_operatio

RE: [PATCH] Add new target hook: simplify_modecc_const.

2022-07-27 Thread Roger Sayle
Hi Segher,

> Thank you for telling the maintainer of combine the basics of what all of
this
> does!  I hadn't noticed any of that before.

You're welcome.  I've also been maintaining combine for some time now:
https://gcc.gnu.org/legacy-ml/gcc/2003-10/msg00455.html

> They can be, as clearly documented (and obvious from the code), but you
can
> not ever have that in the RTL stream, which is needed for your patch to do
> anything.

That's the misunderstanding; neither this nor the previous SUBREG patch,
affect/change what is in the RTL stream, no COMPARE nodes are every
changed or modified, only eliminated by the propagation/fusion in combine
(or CSE).

We have --enable-checking=rtl to guarantee that the documented invariants
always hold in the RTL stream.

Cheers,
Roger




[PATCH] Some additional zero-extension related optimizations in simplify-rtx.

2022-07-27 Thread Roger Sayle

This patch implements some additional zero-extension and sign-extension
related optimizations in simplify-rtx.cc.  The original motivation comes
from PR rtl-optimization/71775, where in comment #2 Andrew Pinski sees:

Failed to match this instruction:
(set (reg:DI 88 [ _1 ])
(sign_extend:DI (subreg:SI (ctz:DI (reg/v:DI 86 [ x ])) 0)))

On many platforms the result of DImode CTZ is constrained to be a
small unsigned integer (between 0 and 64), hence the truncation to
32-bits (using a SUBREG) and the following sign extension back to
64-bits are effectively a no-op, so the above should ideally (often)
be simplified to "(set (reg:DI 88) (ctz:DI (reg/v:DI 86 [ x ]))".

To implement this, and some closely related transformations, we build
upon the existing val_signbit_known_clear_p predicate.  In the first
chunk, nonzero_bits knows that FFS and ABS can't leave the sign-bit
bit set, so the simplification of of ABS (ABS (x)) and ABS (FFS (x))
can itself be simplified.  The second transformation is that we can
canonicalized SIGN_EXTEND to ZERO_EXTEND (as in the PR 71775 case above)
when the operand's sign-bit is known to be clear.  The final two chunks
are for SIGN_EXTEND of a truncating SUBREG, and ZERO_EXTEND of a
truncating SUBREG respectively.  The nonzero_bits of a truncating
SUBREG pessimistically thinks that the upper bits may have an
arbitrary value (by taking the SUBREG), so we need look deeper at the
SUBREG's operand to confirm that the high bits are known to be zero.

Unfortunately, for PR rtl-optimization/71775, ctz:DI on x86_64 with
default architecture options is undefined at zero, so we can't be sure
the upper bits of reg:DI 88 will be sign extended (all zeros or all ones).
nonzero_bits knows this, so the above transformations don't trigger,
but the transformations themselves are perfectly valid for other
operations such as FFS, POPCOUNT and PARITY, and on other targets/-march
settings where CTZ is defined at zero.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32},
with no new failures.  Testing with CSiBE shows these transformations
trigger on several source files (and with -Os reduces the size of the
code).  Ok for mainline?


2022-07-27  Roger Sayle  

gcc/ChangeLog
* simplify_rtx.cc (simplify_unary_operation_1) : Simplify
test as both FFS and ABS result in nonzero_bits returning a
mask that satisfies val_signbit_known_clear_p.
: Canonicalize SIGN_EXTEND to ZERO_EXTEND when
val_signbit_known_clear_p is true of the operand.
Simplify sign extensions of SUBREG truncations of operands
that are already suitably (zero) extended.
: Simplify zero extensions of SUBREG truncations
of operands that are already suitably zero extended.


Thanks in advance,
Roger
--

diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index fa20665..e62bf56 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -1366,9 +1366,8 @@ simplify_context::simplify_unary_operation_1 (rtx_code 
code, machine_mode mode,
break;
 
   /* If operand is something known to be positive, ignore the ABS.  */
-  if (GET_CODE (op) == FFS || GET_CODE (op) == ABS
- || val_signbit_known_clear_p (GET_MODE (op),
-   nonzero_bits (op, GET_MODE (op
+  if (val_signbit_known_clear_p (GET_MODE (op),
+nonzero_bits (op, GET_MODE (op
return op;
 
   /* If operand is known to be only -1 or 0, convert ABS to NEG.  */
@@ -1615,6 +1614,24 @@ simplify_context::simplify_unary_operation_1 (rtx_code 
code, machine_mode mode,
}
}
 
+  /* We can canonicalize SIGN_EXTEND (op) as ZERO_EXTEND (op) when
+ we know the sign bit of OP must be clear.  */
+  if (val_signbit_known_clear_p (GET_MODE (op),
+nonzero_bits (op, GET_MODE (op
+   return simplify_gen_unary (ZERO_EXTEND, mode, op, GET_MODE (op));
+
+  /* (sign_extend:DI (subreg:SI (ctz:DI ...))) is (ctz:DI ...).  */
+  if (GET_CODE (op) == SUBREG
+ && subreg_lowpart_p (op)
+ && GET_MODE (SUBREG_REG (op)) == mode
+ && is_a  (mode, &int_mode)
+ && is_a  (GET_MODE (op), &op_mode)
+ && GET_MODE_PRECISION (int_mode) <= HOST_BITS_PER_WIDE_INT
+ && GET_MODE_PRECISION (op_mode) < GET_MODE_PRECISION (int_mode)
+ && (nonzero_bits (SUBREG_REG (op), mode)
+ & ~(GET_MODE_MASK (op_mode)>>1)) == 0)
+   return SUBREG_REG (op);
+
 #if defined(POINTERS_EXTEND_UNSIGNED)
   /* As we do not know which address space the pointer is referring to,
 we can do this only if the target does not support different pointer
@@ -1765,6 +1782,18 @@ simplify_co

[x86_64 PATCH] PR target/106450: Tweak timode_remove_non_convertible_regs.

2022-07-28 Thread Roger Sayle

This patch resolves PR target/106450, some more fall-out from more
aggressive TImode scalar-to-vector (STV) optimizations.  I continue
to be caught out by how far TImode STV has diverged from DImode/SImode
STV, and therefore requires additional (unexpected) tweaking.  Many
thanks to H.J. Lu for pointing out timode_remove_non_convertible_regs
needs to be extended to handle XOR (and other new operations).

Unhelpfully the comment above this function states that it's the TImode
version of "remove_non_convertible_regs", which doesn't exist anymore,
so I've resurrected an explanatory comment from the git history.
By refactoring the checks for hard regs and already "marked" regs
into timode_check_non_convertible_regs itself, all its callers are
simplified.  This patch then uses GET_RTX_CLASS to generically handle
unary and binary operations, calling timode_check_non_convertible_regs
on each TImode register operand in the single_set's SET_SRC.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32},
with no new failures.  Ok for mainline?


2022-07-28  Roger Sayle  

gcc/ChangeLog
PR target/106450
* config/i386/i386-features.cc (timode_check_non_convertible_regs):
Do nothing if REGNO is set in the REGS bitmap, or is a hard reg.
(timode_remove_non_convertible_regs): Update comment.
Call timode_check_non_convertible_regs on all register operands
of supported (binary and unary) operations.

gcc/testsuite/ChangeLog
PR target/106450
* gcc.target/i386/pr106450.c: New test case.


Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc
index aa5de71..2a4097c 100644
--- a/gcc/config/i386/i386-features.cc
+++ b/gcc/config/i386/i386-features.cc
@@ -1808,6 +1808,11 @@ static void
 timode_check_non_convertible_regs (bitmap candidates, bitmap regs,
   unsigned int regno)
 {
+  /* Do nothing if REGNO is already in REGS or is a hard reg.  */
+  if (bitmap_bit_p (regs, regno)
+  || HARD_REGISTER_NUM_P (regno))
+return;
+
   for (df_ref def = DF_REG_DEF_CHAIN (regno);
def;
def = DF_REF_NEXT_REG (def))
@@ -1843,7 +1848,13 @@ timode_check_non_convertible_regs (bitmap candidates, 
bitmap regs,
 }
 }
 
-/* The TImode version of remove_non_convertible_regs.  */
+/* For a given bitmap of insn UIDs scans all instructions and
+   remove insn from CANDIDATES in case it has both convertible
+   and not convertible definitions.
+
+   All insns in a bitmap are conversion candidates according to
+   scalar_to_vector_candidate_p.  Currently it implies all insns
+   are single_set.  */
 
 static void
 timode_remove_non_convertible_regs (bitmap candidates)
@@ -1861,21 +1872,40 @@ timode_remove_non_convertible_regs (bitmap candidates)
rtx dest = SET_DEST (def_set);
rtx src = SET_SRC (def_set);
 
-   if ((!REG_P (dest)
-|| bitmap_bit_p (regs, REGNO (dest))
-|| HARD_REGISTER_P (dest))
-   && (!REG_P (src)
-   || bitmap_bit_p (regs, REGNO (src))
-   || HARD_REGISTER_P (src)))
- continue;
-
if (REG_P (dest))
  timode_check_non_convertible_regs (candidates, regs,
 REGNO (dest));
 
-   if (REG_P (src))
- timode_check_non_convertible_regs (candidates, regs,
-REGNO (src));
+   switch (GET_RTX_CLASS (GET_CODE (src)))
+ {
+ case RTX_OBJ:
+   if (REG_P (src))
+ timode_check_non_convertible_regs (candidates, regs,
+REGNO (src));
+   break;
+
+ case RTX_UNARY:
+   if (REG_P (XEXP (src, 0))
+   && GET_MODE (XEXP (src, 0)) == TImode)
+ timode_check_non_convertible_regs (candidates, regs,
+REGNO (XEXP (src, 0)));
+   break;
+
+ case RTX_COMM_ARITH:
+ case RTX_BIN_ARITH:
+   if (REG_P (XEXP (src, 0))
+   && GET_MODE (XEXP (src, 0)) == TImode)
+ timode_check_non_convertible_regs (candidates, regs,
+REGNO (XEXP (src, 0)));
+   if (REG_P (XEXP (src, 1))
+   && GET_MODE (XEXP (src, 1)) == TImode)
+ timode_check_non_convertible_regs (candidates, regs,
+REGNO (XEXP (src, 1)));
+   break;
+
+ default:
+   break;
+ }
   }
 
 EXECUTE_IF_SET_IN_BITMAP (regs, 0, id, bi)
diff --git a/gcc/testsuite/gcc.target/i386/pr106450.c 
b/gcc/testsuite/gcc.target/i386/pr106450.c
new file mode 100644
index 000..d16231f
--- /dev/null
+++ b/gcc/testsuite/gcc.tar

[x86 PATCH] Support logical shifts by (some) integer constants in TImode STV.

2022-07-28 Thread Roger Sayle

This patch improves TImode STV by adding support for logical shifts by
integer constants that are multiples of 8.  For the test case:

__int128 a, b;
void foo() { a = b << 16; }

on x86_64, gcc -O2 currently generates:

movqb(%rip), %rax
movqb+8(%rip), %rdx
shldq   $16, %rax, %rdx
salq$16, %rax
movq%rax, a(%rip)
movq%rdx, a+8(%rip)
ret

with this patch we now generate:

movdqa  b(%rip), %xmm0
pslldq  $2, %xmm0
movaps  %xmm0, a(%rip)
ret

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check. both with and without --target_board=unix{-m32},
with no new failures.  Ok for mainline?


2022-07-28  Roger Sayle  

gcc/ChangeLog
* config/i386/i386-features.cc (compute_convert_gain): Add gain
for converting suitable TImode shift to a V1TImode shift.
(timode_scalar_chain::convert_insn): Add support for converting
suitable ASHIFT and LSHIFTRT.
(timode_scalar_to_vector_candidate_p): Consider logical shifts
by integer constants that are multiples of 8 to be candidates.

gcc/testsuite/ChangeLog
* gcc.target/i386/sse4_1-stv-7.c: New test case.


Thanks again,
Roger
--

diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc
index aa5de71..e1e0645 100644
--- a/gcc/config/i386/i386-features.cc
+++ b/gcc/config/i386/i386-features.cc
@@ -1221,6 +1221,13 @@ timode_scalar_chain::compute_convert_gain ()
igain = COSTS_N_INSNS (1);
  break;
 
+   case ASHIFT:
+   case LSHIFTRT:
+ /* For logical shifts by constant multiples of 8. */
+ igain = optimize_insn_for_size_p () ? COSTS_N_BYTES (4)
+ : COSTS_N_INSNS (1);
+ break;
+
default:
  break;
}
@@ -1462,6 +1469,12 @@ timode_scalar_chain::convert_insn (rtx_insn *insn)
   src = convert_compare (XEXP (src, 0), XEXP (src, 1), insn);
   break;
 
+case ASHIFT:
+case LSHIFTRT:
+  convert_op (&XEXP (src, 0), insn);
+  PUT_MODE (src, V1TImode);
+  break;
+
 default:
   gcc_unreachable ();
 }
@@ -1796,6 +1809,14 @@ timode_scalar_to_vector_candidate_p (rtx_insn *insn)
 case NOT:
   return REG_P (XEXP (src, 0)) || timode_mem_p (XEXP (src, 0));
 
+case ASHIFT:
+case LSHIFTRT:
+  /* Handle logical shifts by integer constants between 0 and 120
+that are multiples of 8.  */
+  return REG_P (XEXP (src, 0))
+&& CONST_INT_P (XEXP (src, 1))
+&& (INTVAL (XEXP (src, 1)) & ~0x78) == 0;
+
 default:
   return false;
 }
diff --git a/gcc/testsuite/gcc.target/i386/sse4_1-stv-7.c 
b/gcc/testsuite/gcc.target/i386/sse4_1-stv-7.c
new file mode 100644
index 000..b0d5fce
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/sse4_1-stv-7.c
@@ -0,0 +1,18 @@
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -msse4.1 -mstv -mno-stackrealign" } */
+
+unsigned __int128 a;
+unsigned __int128 b;
+
+void foo()
+{
+  a = b << 16;
+}
+
+void bar()
+{
+  a = b >> 16;
+}
+
+/* { dg-final { scan-assembler "pslldq" } } */
+/* { dg-final { scan-assembler "psrldq" } } */


[x86_64 PATCH] Add rotl64ti2_doubleword pattern to i386.md

2022-07-28 Thread Roger Sayle

This patch adds rot[lr]64ti2_doubleword patterns to the x86_64 backend,
to move splitting of 128-bit TImode rotates by 64 bits after reload,
matching what we now do for 64-bit DImode rotations by 32 bits with -m32.

In theory moving when this rotation is split should have little
influence on code generation, but in practice "reload" sometimes
decides to make use of the increased flexibility to reduce the number
of registers used, and the code size, by using xchg.

For example:
__int128 x;
__int128 y;
__int128 a;
__int128 b;

void foo()
{
unsigned __int128 t = x;
t ^= a;
t = (t<<64) | (t>>64);
t ^= b;
y = t;
}

Before:
movqx(%rip), %rsi
movqx+8(%rip), %rdi
xorqa(%rip), %rsi
xorqa+8(%rip), %rdi
movq%rdi, %rax
movq%rsi, %rdx
xorqb(%rip), %rax
xorqb+8(%rip), %rdx
movq%rax, y(%rip)
movq%rdx, y+8(%rip)
ret

After:
movqx(%rip), %rax
movqx+8(%rip), %rdx
xorqa(%rip), %rax
xorqa+8(%rip), %rdx
xchgq   %rdx, %rax
xorqb(%rip), %rax
xorqb+8(%rip), %rdx
movq%rax, y(%rip)
movq%rdx, y+8(%rip)
ret

One some modern architectures this is a small win, on some older
architectures this is a small loss.  The decision which code to
generate is made in "reload", and could probably be tweaked by
register preferencing.  The much bigger win is that (eventually) all
TImode mode shifts and rotates by constants will become potential
candidates for TImode STV.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check with no new failures.  Ok for mainline?


2022-07-29  Roger Sayle  

gcc/ChangeLog
* config/i386/i386.md (define_expand ti3): For
rotations by 64 bits use new rot[lr]64ti2_doubleword pattern.
(rot[lr]64ti2_doubleword): New post-reload splitter.


Thanks again,
Roger
--

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index fab6aed..f1158e1 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -13820,6 +13820,8 @@
   if (const_1_to_63_operand (operands[2], VOIDmode))
 emit_insn (gen_ix86_ti3_doubleword
(operands[0], operands[1], operands[2]));
+  else if (CONST_INT_P (operands[2]) && INTVAL (operands[2]) == 64)
+emit_insn (gen_64ti2_doubleword (operands[0], operands[1]));
   else
 {
   rtx amount = force_reg (QImode, operands[2]);
@@ -14045,6 +14047,24 @@
 }
 })
 
+(define_insn_and_split "64ti2_doubleword"
+ [(set (match_operand:TI 0 "register_operand" "=r,r,r")
+   (any_rotate:TI (match_operand:TI 1 "nonimmediate_operand" "0,r,o")
+  (const_int 64)))]
+ "TARGET_64BIT"
+ "#"
+ "&& reload_completed"
+ [(set (match_dup 0) (match_dup 3))
+  (set (match_dup 2) (match_dup 1))]
+{
+  split_double_mode (TImode, &operands[0], 2, &operands[0], &operands[2]);
+  if (rtx_equal_p (operands[0], operands[1]))
+{
+  emit_insn (gen_swapdi (operands[0], operands[2]));
+  DONE;
+}
+})
+
 (define_mode_attr rorx_immediate_operand
[(SI "const_0_to_31_operand")
 (DI "const_0_to_63_operand")])


RE: [PATCH] Some additional zero-extension related optimizations in simplify-rtx.

2022-07-28 Thread Roger Sayle


Hi Segher,
 
> On Wed, Jul 27, 2022 at 02:42:25PM +0100, Roger Sayle wrote:
> > This patch implements some additional zero-extension and
> > sign-extension related optimizations in simplify-rtx.cc.  The original
> > motivation comes from PR rtl-optimization/71775, where in comment #2
> Andrew Pinski sees:
> >
> > Failed to match this instruction:
> > (set (reg:DI 88 [ _1 ])
> > (sign_extend:DI (subreg:SI (ctz:DI (reg/v:DI 86 [ x ])) 0)))
> >
> > On many platforms the result of DImode CTZ is constrained to be a
> > small unsigned integer (between 0 and 64), hence the truncation to
> > 32-bits (using a SUBREG) and the following sign extension back to
> > 64-bits are effectively a no-op, so the above should ideally (often)
> > be simplified to "(set (reg:DI 88) (ctz:DI (reg/v:DI 86 [ x ]))".
> 
> And you can also do that if ctz is undefined for a zero argument!

Forgive my perhaps poor use of terminology.  The case of ctz 0 on
x64_64 isn't "undefined behaviour" (UB) in the C/C++ sense that
would allow us to do anything, but implementation defined (which
Intel calls "undefined" in their documentation).  Hence, we don't
know which DI value is placed in the result register.  In this case,
truncating to SI mode, then sign extending the result is not a no-op,
as the top bits will/must now all be the same [though admittedly to an
unknown undefined signbit].  Hence the above optimization would 
be invalid, as it doesn't guarantee the result would be sign-extended.

> > To implement this, and some closely related transformations, we build
> > upon the existing val_signbit_known_clear_p predicate.  In the first
> > chunk, nonzero_bits knows that FFS and ABS can't leave the sign-bit
> > bit set,
> 
> Is that guaranteed in all cases?  Also at -O0, also for args bigger than
> 64 bits?

val_signbit_known_clear_p should work for any size/precision arg.
I'm not sure if the results are affected by -O0, but even if they are, this
will
not affect correctness only whether these optimizations are performed,
which is precisely what -O0 controls.
 
> > +  /* (sign_extend:DI (subreg:SI (ctz:DI ...))) is (ctz:DI ...).  */
> > +  if (GET_CODE (op) == SUBREG
> > + && subreg_lowpart_p (op)
> > + && GET_MODE (SUBREG_REG (op)) == mode
> > + && is_a  (mode, &int_mode)
> > + && is_a  (GET_MODE (op), &op_mode)
> > + && GET_MODE_PRECISION (int_mode) <= HOST_BITS_PER_WIDE_INT
> > + && GET_MODE_PRECISION (op_mode) < GET_MODE_PRECISION
> (int_mode)
> > + && (nonzero_bits (SUBREG_REG (op), mode)
> > + & ~(GET_MODE_MASK (op_mode)>>1)) == 0)
> 
> (spaces around >> please)

Doh! Good catch, thanks.

> Please use val_signbit_known_{set,clear}_p?

Alas, it's not just the SI mode's signbit that we care about, but all of the
bits above it in the DImode operand/result.  These all need to be zero,
for the operand to already be zero-extended/sign_extended.

> > +   return SUBREG_REG (op);
> 
> Also, this is not correct for C[LT]Z_DEFINED_VALUE_AT_ZERO non-zero if the
> value it returns in its second arg does not survive sign extending
unmodified (if it
> is 0x for an extend from SI to DI for example).

Fortunately, C[LT]Z_DEFINED_VALUE_AT_ZERO being defined to return a negative
result, such as -1 is already handled (accounted for) in nonzero_bits.  The
relevant
code in rtlanal.cc's nonzero_bits1 is:

case CTZ:
  /* If CTZ has a known value at zero, then the nonzero bits are
 that value, plus the number of bits in the mode minus one.  */
  if (CTZ_DEFINED_VALUE_AT_ZERO (mode, nonzero))
nonzero
  |= (HOST_WIDE_INT_1U << (floor_log2 (mode_width))) - 1;
  else
nonzero = -1;
  break;

Hence, any bits set by the constant returned by the target's
DEFINED_VALUE_AT_ZERO will be set in the result of nonzero_bits.
So if this is negative, say -1, then val_signbit_known_clear_p (or the
more complex tests above) will return false.

I'm currently bootstrapping and regression testing the whitespace 
change/correction suggested above.

Thanks,
Roger
--




RE: [PATCH] Some additional zero-extension related optimizations in simplify-rtx.

2022-07-29 Thread Roger Sayle


Hi Segher,
> > > To implement this, and some closely related transformations, we
> > > build upon the existing val_signbit_known_clear_p predicate.  In the
> > > first chunk, nonzero_bits knows that FFS and ABS can't leave the
> > > sign-bit bit set,
> >
> > Is that guaranteed in all cases?  Also at -O0, also for args bigger
> > than 64 bits?
> 
> val_signbit_known_clear_p should work for any size/precision arg.

No, you're right!  Please forgive/excuse me.  Neither val_signbit_p nor
nonzero_bits have yet been updated to use "wide_int", so don't work
for TImode or wider modes.  Doh!

I'm shocked.

Roger
--




[x86_64 PATCH take #2] PR target/106450: Tweak timode_remove_non_convertible_regs.

2022-07-30 Thread Roger Sayle

Many thanks to H.J. for pointing out a better idiom for traversing
the USEs (and also DEFs) of TImode registers in an instruction.

This revised patched has been tested on x86_64-pc-linux-gnu with
make bootstrap and make -k check, both with and without
--target_board=unix{-m32}, with no new failures.  Ok for mainline?


2022-07-30  Roger Sayle  
H.J. Lu  

gcc/ChangeLog
PR target/106450
* config/i386/i386-features.cc (timode_check_non_convertible_regs):
Do nothing if REGNO is set in the REGS bitmap, or is a hard reg.
(timode_remove_non_convertible_regs): Update comment.
Call timode_check_non_convertible_reg on all TImode register
DEFs and USEs in each instruction.

gcc/testsuite/ChangeLog
PR target/106450
* gcc.target/i386/pr106450.c: New test case.


Thanks (H.J. and Uros),
Roger
--

> -Original Message-
> From: H.J. Lu 
> Sent: 28 July 2022 17:55
> To: Roger Sayle 
> Cc: GCC Patches 
> Subject: Re: [x86_64 PATCH] PR target/106450: Tweak
> timode_remove_non_convertible_regs.
> 
> On Thu, Jul 28, 2022 at 9:43 AM Roger Sayle 
> wrote:
> >
> > This patch resolves PR target/106450, some more fall-out from more
> > aggressive TImode scalar-to-vector (STV) optimizations.  I continue to
> > be caught out by how far TImode STV has diverged from DImode/SImode
> > STV, and therefore requires additional (unexpected) tweaking.  Many
> > thanks to H.J. Lu for pointing out timode_remove_non_convertible_regs
> > needs to be extended to handle XOR (and other new operations).
> >
> > Unhelpfully the comment above this function states that it's the
> > TImode version of "remove_non_convertible_regs", which doesn't exist
> > anymore, so I've resurrected an explanatory comment from the git history.
> > By refactoring the checks for hard regs and already "marked" regs into
> > timode_check_non_convertible_regs itself, all its callers are
> > simplified.  This patch then uses GET_RTX_CLASS to generically handle
> > unary and binary operations, calling timode_check_non_convertible_regs
> > on each TImode register operand in the single_set's SET_SRC.
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32},
> > with no new failures.  Ok for mainline?
> >
> >
> > 2022-07-28  Roger Sayle  
> >
> > gcc/ChangeLog
> > PR target/106450
> > * config/i386/i386-features.cc (timode_check_non_convertible_regs):
> > Do nothing if REGNO is set in the REGS bitmap, or is a hard reg.
> > (timode_remove_non_convertible_regs): Update comment.
> > Call timode_check_non_convertible_regs on all register operands
> > of supported (binary and unary) operations.
> 
> Should we use
> 
> df_ref ref;
> FOR_EACH_INSN_USE (ref, insn)
>if (!DF_REF_REG_MEM_P (ref))
>  timode_check_non_convertible_regs (candidates, regs,
>   DF_REF_REGNO (ref));
> 
> to check each use?
> 
> > gcc/testsuite/ChangeLog
> > PR target/106450
> > * gcc.target/i386/pr106450.c: New test case.
> >
> >
> > Thanks in advance,
> > Roger
> > --
> --
> H.J.
diff --git a/gcc/config/i386/i386-features.cc b/gcc/config/i386/i386-features.cc
index aa5de71..e4cc4a3 100644
--- a/gcc/config/i386/i386-features.cc
+++ b/gcc/config/i386/i386-features.cc
@@ -1808,6 +1808,11 @@ static void
 timode_check_non_convertible_regs (bitmap candidates, bitmap regs,
   unsigned int regno)
 {
+  /* Do nothing if REGNO is already in REGS or is a hard reg.  */
+  if (bitmap_bit_p (regs, regno)
+  || HARD_REGISTER_NUM_P (regno))
+return;
+
   for (df_ref def = DF_REG_DEF_CHAIN (regno);
def;
def = DF_REF_NEXT_REG (def))
@@ -1843,7 +1848,13 @@ timode_check_non_convertible_regs (bitmap candidates, 
bitmap regs,
 }
 }
 
-/* The TImode version of remove_non_convertible_regs.  */
+/* For a given bitmap of insn UIDs scans all instructions and
+   remove insn from CANDIDATES in case it has both convertible
+   and not convertible definitions.
+
+   All insns in a bitmap are conversion candidates according to
+   scalar_to_vector_candidate_p.  Currently it implies all insns
+   are single_set.  */
 
 static void
 timode_remove_non_convertible_regs (bitmap candidates)
@@ -1857,25 +1868,20 @@ timode_remove_non_convertible_regs (bitmap candidates)
 changed = false;
 EXECUTE_IF_SET_IN_BITMAP (candidates, 0, id, bi)
   {
-   rtx def_set = single_set (DF_INSN_UID_GET (id)->insn);
-   rtx dest = SET_DEST (def_set);
-   rtx src = SET_SRC (def_set);
-
-   

  1   2   3   4   5   6   7   >