Re: [PATCH] i386; Add indirect_return function attribute

2018-07-03 Thread H.J. Lu
On Fri, Jun 8, 2018 at 3:27 AM, H.J. Lu  wrote:
> On x86, swapcontext may return via indirect branch when shadow stack
> is enabled.  To support code instrumentation of control-flow transfers
> with -fcf-protection, add indirect_return function attribute to inform
> compiler that a function may return via indirect branch.
>
> Note: Unlike setjmp, swapcontext only returns once.  Mark it return
> twice will unnecessarily disable compiler optimization.
>
> OK for trunk?
>
> H.J.
> 
> gcc/
>
> PR target/85620
> * config/i386/i386.c (rest_of_insert_endbranch): Also generate
> ENDBRANCH for non-tail call which may return via indirect branch.
> * doc/extend.texi: Document indirect_return attribute.
>
> gcc/testsuite/
>
> PR target/85620
> * gcc.target/i386/pr85620-1.c: New test.
> * gcc.target/i386/pr85620-2.c: Likewise.
>

Here is the updated patch with a testcase to show the impact of
returns_twice attribute.

Jan, Uros, can you take a look?

Thanks.

-- 
H.J.
From 6115541e03073b93bd81f5eb81fdedd4e5b47b28 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" 
Date: Thu, 7 Jun 2018 20:05:15 -0700
Subject: [PATCH] i386; Add indirect_return function attribute

On x86, swapcontext may return via indirect branch when shadow stack
is enabled.  To support code instrumentation of control-flow transfers
with -fcf-protection, add indirect_return function attribute to inform
compiler that a function may return via indirect branch.

Note: Unlike setjmp, swapcontext only returns once.  Mark it return
twice will unnecessarily disable compiler optimization as shown in
the testcase here.

gcc/

	PR target/85620
	* config/i386/i386.c (rest_of_insert_endbranch): Also generate
	ENDBRANCH for non-tail call which may return via indirect branch.
	* doc/extend.texi: Document indirect_return attribute.

gcc/testsuite/

	PR target/85620
	* gcc.target/i386/pr85620-1.c: New test.
	* gcc.target/i386/pr85620-2.c: Likewise.
	* gcc.target/i386/pr85620-3.c: Likewise.
	* gcc.target/i386/pr85620-4.c: Likewise.
---
 gcc/config/i386/i386.c| 23 ++-
 gcc/doc/extend.texi   |  6 ++
 gcc/testsuite/gcc.target/i386/pr85620-1.c | 15 +++
 gcc/testsuite/gcc.target/i386/pr85620-2.c | 13 +
 gcc/testsuite/gcc.target/i386/pr85620-3.c | 18 ++
 gcc/testsuite/gcc.target/i386/pr85620-4.c | 18 ++
 6 files changed, 92 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr85620-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr85620-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr85620-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr85620-4.c

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index e6d17632142..41461d582a4 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2621,7 +2621,26 @@ rest_of_insert_endbranch (void)
 	{
 	  if (CALL_P (insn))
 	{
-	  if (find_reg_note (insn, REG_SETJMP, NULL) == NULL)
+	  bool need_endbr;
+	  need_endbr = find_reg_note (insn, REG_SETJMP, NULL) != NULL;
+	  if (!need_endbr && !SIBLING_CALL_P (insn))
+		{
+		  rtx call = get_call_rtx_from (insn);
+		  rtx fnaddr = XEXP (call, 0);
+
+		  /* Also generate ENDBRANCH for non-tail call which
+		 may return via indirect branch.  */
+		  if (MEM_P (fnaddr)
+		  && GET_CODE (XEXP (fnaddr, 0)) == SYMBOL_REF)
+		{
+		  tree fndecl = SYMBOL_REF_DECL (XEXP (fnaddr, 0));
+		  if (fndecl
+			  && lookup_attribute ("indirect_return",
+	   DECL_ATTRIBUTES (fndecl)))
+			need_endbr = true;
+		}
+		}
+	  if (!need_endbr)
 		continue;
 	  /* Generate ENDBRANCH after CALL, which can return more than
 		 twice, setjmp-like functions.  */
@@ -45897,6 +45916,8 @@ static const struct attribute_spec ix86_attribute_table[] =
 ix86_handle_fndecl_attribute, NULL },
   { "function_return", 1, 1, true, false, false, false,
 ix86_handle_fndecl_attribute, NULL },
+  { "indirect_return", 0, 0, true, false, false, false,
+ix86_handle_fndecl_attribute, NULL },
 
   /* End element.  */
   { NULL, 0, 0, false, false, false, false, NULL, NULL }
diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index 19c2da2e5db..97b1f78cade 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -5886,6 +5886,12 @@ foo (void)
 @}
 @end smallexample
 
+@item indirect_return
+@cindex @code{indirect_return} function attribute, x86
+
+The @code{indirect_return} attribute on a function is used to inform
+the compiler that the function may return via indiret branch.
+
 @end table
 
 On the x86, the inliner does not inline a
diff --git a/gcc/testsuite/gcc.target/i386/pr85620-1.c b/gcc/testsuite/gcc.target/i386/pr85620-1.c
new file mode 100644
index 000..32efb08e59e
--

Re: [PATCH] i386; Add indirect_return function attribute

2018-07-03 Thread H.J. Lu
On Tue, Jul 3, 2018 at 9:12 AM, Uros Bizjak  wrote:
> On Tue, Jul 3, 2018 at 5:32 PM, H.J. Lu  wrote:
>> On Fri, Jun 8, 2018 at 3:27 AM, H.J. Lu  wrote:
>>> On x86, swapcontext may return via indirect branch when shadow stack
>>> is enabled.  To support code instrumentation of control-flow transfers
>>> with -fcf-protection, add indirect_return function attribute to inform
>>> compiler that a function may return via indirect branch.
>>>
>>> Note: Unlike setjmp, swapcontext only returns once.  Mark it return
>>> twice will unnecessarily disable compiler optimization.
>>>
>>> OK for trunk?
>>>
>>> H.J.
>>> 
>>> gcc/
>>>
>>> PR target/85620
>>> * config/i386/i386.c (rest_of_insert_endbranch): Also generate
>>> ENDBRANCH for non-tail call which may return via indirect branch.
>>> * doc/extend.texi: Document indirect_return attribute.
>>>
>>> gcc/testsuite/
>>>
>>> PR target/85620
>>> * gcc.target/i386/pr85620-1.c: New test.
>>> * gcc.target/i386/pr85620-2.c: Likewise.
>>>
>>
>> Here is the updated patch with a testcase to show the impact of
>> returns_twice attribute.
>>
>> Jan, Uros, can you take a look?
>
> LGTM for the implementation, can't say if attribute is really needed or not.

This gives programmers more flexibly.

> +@item indirect_return
> +@cindex @code{indirect_return} function attribute, x86
> +
> +The @code{indirect_return} attribute on a function is used to inform
> +the compiler that the function may return via indiret branch.
>
> s/indiret/indirect/

Fixed.  Here is the updated patch.

Thanks.

-- 
H.J.
From bb98f6a31801659ae3c6689d6d31af33a3c28bb2 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" 
Date: Thu, 7 Jun 2018 20:05:15 -0700
Subject: [PATCH] i386; Add indirect_return function attribute

On x86, swapcontext may return via indirect branch when shadow stack
is enabled.  To support code instrumentation of control-flow transfers
with -fcf-protection, add indirect_return function attribute to inform
compiler that a function may return via indirect branch.

Note: Unlike setjmp, swapcontext only returns once.  Mark it return
twice will unnecessarily disable compiler optimization as shown in
the testcase here.

gcc/

	PR target/85620
	* config/i386/i386.c (rest_of_insert_endbranch): Also generate
	ENDBRANCH for non-tail call which may return via indirect branch.
	* doc/extend.texi: Document indirect_return attribute.

gcc/testsuite/

	PR target/85620
	* gcc.target/i386/pr85620-1.c: New test.
	* gcc.target/i386/pr85620-2.c: Likewise.
	* gcc.target/i386/pr85620-3.c: Likewise.
	* gcc.target/i386/pr85620-4.c: Likewise.
---
 gcc/config/i386/i386.c| 23 ++-
 gcc/doc/extend.texi   |  6 ++
 gcc/testsuite/gcc.target/i386/pr85620-1.c | 15 +++
 gcc/testsuite/gcc.target/i386/pr85620-2.c | 13 +
 gcc/testsuite/gcc.target/i386/pr85620-3.c | 18 ++
 gcc/testsuite/gcc.target/i386/pr85620-4.c | 18 ++
 6 files changed, 92 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr85620-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr85620-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr85620-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr85620-4.c

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index e6d17632142..41461d582a4 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2621,7 +2621,26 @@ rest_of_insert_endbranch (void)
 	{
 	  if (CALL_P (insn))
 	{
-	  if (find_reg_note (insn, REG_SETJMP, NULL) == NULL)
+	  bool need_endbr;
+	  need_endbr = find_reg_note (insn, REG_SETJMP, NULL) != NULL;
+	  if (!need_endbr && !SIBLING_CALL_P (insn))
+		{
+		  rtx call = get_call_rtx_from (insn);
+		  rtx fnaddr = XEXP (call, 0);
+
+		  /* Also generate ENDBRANCH for non-tail call which
+		 may return via indirect branch.  */
+		  if (MEM_P (fnaddr)
+		  && GET_CODE (XEXP (fnaddr, 0)) == SYMBOL_REF)
+		{
+		  tree fndecl = SYMBOL_REF_DECL (XEXP (fnaddr, 0));
+		  if (fndecl
+			  && lookup_attribute ("indirect_return",
+	   DECL_ATTRIBUTES (fndecl)))
+			need_endbr = true;
+		}
+		}
+	  if (!need_endbr)
 		continue;
 	  /* Generate ENDBRANCH after CALL, which can return more than
 		 twice, setjmp-like functions.  */
@@ -45897,6 +45916,8 @@ static const struct attribute_spec ix86_attribute_table[] =
 ix86_handle_fndecl_attribute, NULL },
   { "function_return", 1, 1, true, false, false, false,
 ix86_handle_fndecl_attribute, NULL },
+  { "indirect_return", 

[PATCH] x86: Tune Skylake, Cannonlake and Icelake as Haswell

2018-07-12 Thread H.J. Lu
r259399, which added PROCESSOR_SKYLAKE, disabled many x86 optimizations
which are enabled by PROCESSOR_HASWELL.  As the result, -mtune=skylake
generates slower codes on Skylake than before.  The same also applies
to Cannonlake and Icelak tuning.

This patch changes -mtune={skylake|cannonlake|icelake} to tune like
-mtune=haswell for until their tuning is properly adjusted. It also
enables -mprefer-vector-width=256 for -mtune=haswell, which has no
impact on codegen when AVX512 isn't enabled.

Performance impacts on SPEC CPU 2017 rate with 1 copy using

-march=native -mfpmath=sse -O2 -m64

are

1. On Broadwell server:

500.perlbench_r -0.56%
502.gcc_r   -0.18%
505.mcf_r   0.24%
520.omnetpp_r   0.00%
523.xalancbmk_r -0.32%
525.x264_r  -0.17%
531.deepsjeng_r 0.00%
541.leela_r 0.00%
548.exchange2_r 0.12%
557.xz_r0.00%
geomean 0.00%

503.bwaves_r0.00%
507.cactuBSSN_r 0.21%
508.namd_r  0.00%
510.parest_r0.19%
511.povray_r-0.48%
519.lbm_r   0.00%
521.wrf_r   0.28%
526.blender_r   0.19%
527.cam4_r  0.39%
538.imagick_r   0.00%
544.nab_r   -0.36%
549.fotonik3d_r 0.51%
554.roms_r  0.00%
geomean 0.17%

On Skylake client:

500.perlbench_r 0.96%
502.gcc_r   0.13%
505.mcf_r   -1.03%
520.omnetpp_r   -1.11%
523.xalancbmk_r 1.02%
525.x264_r  0.50%
531.deepsjeng_r 2.97%
541.leela_r 0.50%
548.exchange2_r -0.95%
557.xz_r2.41%
geomean 0.56%

503.bwaves_r0.49%
507.cactuBSSN_r 3.17%
508.namd_r  4.05%
510.parest_r0.15%
511.povray_r0.80%
519.lbm_r   3.15%
521.wrf_r   10.56%
526.blender_r   2.97%
527.cam4_r  2.36%
538.imagick_r   46.40%
544.nab_r   2.04%
549.fotonik3d_r 0.00%
554.roms_r  1.27%
geomean 5.49%

On Skylake server:

500.perlbench_r 0.71%
502.gcc_r   -0.51%
505.mcf_r   -1.06%
520.omnetpp_r   -0.33%
523.xalancbmk_r -0.22%
525.x264_r  1.72%
531.deepsjeng_r -0.26%
541.leela_r 0.57%
548.exchange2_r -0.75%
557.xz_r-1.28%
geomean -0.21%

503.bwaves_r0.00%
507.cactuBSSN_r 2.66%
508.namd_r  3.67%
510.parest_r1.25%
511.povray_r2.26%
519.lbm_r   1.69%
521.wrf_r   11.03%
526.blender_r   3.39%
527.cam4_r  1.69%
538.imagick_r   64.59%
544.nab_r   -0.54%
549.fotonik3d_r 2.68%
554.roms_r  0.00%
geomean 6.19%

This patch improves -march=native performance on Skylake up to 60% and
leaves -march=native performance unchanged on Haswell.

OK for trunk?

Thanks.

H.J.
---
gcc/

2018-07-12  H.J. Lu  
Sunil K Pandey  

PR target/84413
* config/i386/i386.c (m_HASWELL): Add PROCESSOR_SKYLAKE,
PROCESSOR_SKYLAKE_AVX512, PROCESSOR_CANNONLAKE,
PROCESSOR_ICELAKE_CLIENT and PROCESSOR_ICELAKE_SERVER.
(m_SKYLAKE): Set to 0.
(m_SKYLAKE_AVX512): Likewise.
(m_CANNONLAKE): Likewise.
(m_ICELAKE_CLIENT): Likewise.
(m_ICELAKE_SERVER): Likewise.
* config/i386/x86-tune.def (avx256_optimal): Also enabled for
m_HASWELL.

gcc/testsuite/

2018-07-12  H.J. Lu  
Sunil K Pandey  

PR target/84413
* gcc.target/i386/pr84413-1.c: New test.
* gcc.target/i386/pr84413-2.c: Likewise.
* gcc.target/i386/pr84413-3.c: Likewise.
* gcc.target/i386/pr84413-4.c: Likewise.
---
 gcc/config/i386/i386.c| 17 +++--
 gcc/config/i386/x86-tune.def  |  9 ++---
 gcc/testsuite/gcc.target/i386/pr84413-1.c | 17 +
 gcc/testsuite/gcc.target/i386/pr84413-2.c | 17 +
 gcc/testsuite/gcc.target/i386/pr84413-3.c | 17 +
 gcc/testsuite/gcc.target/i386/pr84413-4.c | 17 +
 6 files changed, 85 insertions(+), 9 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr84413-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr84413-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr84413-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr84413-4.c

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 9e46b7b136f..762ab89fc9e 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -137,17 +137,22 @@ const struct processor_costs *ix86_cost = NULL;
 #define m_CORE2 (HOST_WIDE_INT_1U<

Re: [PATCH] x86: Tune Skylake, Cannonlake and Icelake as Haswell

2018-07-13 Thread H.J. Lu
On Fri, Jul 13, 2018 at 08:53:02AM +0200, Uros Bizjak wrote:
> On Thu, Jul 12, 2018 at 9:57 PM, H.J. Lu  wrote:
> 
> > r259399, which added PROCESSOR_SKYLAKE, disabled many x86 optimizations
> > which are enabled by PROCESSOR_HASWELL.  As the result, -mtune=skylake
> > generates slower codes on Skylake than before.  The same also applies
> > to Cannonlake and Icelak tuning.
> >
> > This patch changes -mtune={skylake|cannonlake|icelake} to tune like
> > -mtune=haswell for until their tuning is properly adjusted. It also
> > enables -mprefer-vector-width=256 for -mtune=haswell, which has no
> > impact on codegen when AVX512 isn't enabled.
> >
> > Performance impacts on SPEC CPU 2017 rate with 1 copy using
> >
> > -march=native -mfpmath=sse -O2 -m64
> >
> > are
> >
> > 1. On Broadwell server:
> >
> > 500.perlbench_r -0.56%
> > 502.gcc_r   -0.18%
> > 505.mcf_r   0.24%
> > 520.omnetpp_r   0.00%
> > 523.xalancbmk_r -0.32%
> > 525.x264_r  -0.17%
> > 531.deepsjeng_r 0.00%
> > 541.leela_r 0.00%
> > 548.exchange2_r 0.12%
> > 557.xz_r0.00%
> > geomean 0.00%
> >
> > 503.bwaves_r0.00%
> > 507.cactuBSSN_r 0.21%
> > 508.namd_r  0.00%
> > 510.parest_r0.19%
> > 511.povray_r-0.48%
> > 519.lbm_r   0.00%
> > 521.wrf_r   0.28%
> > 526.blender_r   0.19%
> > 527.cam4_r  0.39%
> > 538.imagick_r   0.00%
> > 544.nab_r   -0.36%
> > 549.fotonik3d_r 0.51%
> > 554.roms_r  0.00%
> > geomean 0.17%
> >
> > On Skylake client:
> >
> > 500.perlbench_r 0.96%
> > 502.gcc_r   0.13%
> > 505.mcf_r   -1.03%
> > 520.omnetpp_r   -1.11%
> > 523.xalancbmk_r 1.02%
> > 525.x264_r  0.50%
> > 531.deepsjeng_r 2.97%
> > 541.leela_r 0.50%
> > 548.exchange2_r -0.95%
> > 557.xz_r2.41%
> > geomean 0.56%
> >
> > 503.bwaves_r0.49%
> > 507.cactuBSSN_r 3.17%
> > 508.namd_r  4.05%
> > 510.parest_r0.15%
> > 511.povray_r0.80%
> > 519.lbm_r   3.15%
> > 521.wrf_r   10.56%
> > 526.blender_r   2.97%
> > 527.cam4_r  2.36%
> > 538.imagick_r   46.40%
> > 544.nab_r   2.04%
> > 549.fotonik3d_r 0.00%
> > 554.roms_r  1.27%
> > geomean 5.49%
> >
> > On Skylake server:
> >
> > 500.perlbench_r 0.71%
> > 502.gcc_r   -0.51%
> > 505.mcf_r   -1.06%
> > 520.omnetpp_r   -0.33%
> > 523.xalancbmk_r -0.22%
> > 525.x264_r  1.72%
> > 531.deepsjeng_r -0.26%
> > 541.leela_r 0.57%
> > 548.exchange2_r -0.75%
> > 557.xz_r-1.28%
> > geomean -0.21%
> >
> > 503.bwaves_r0.00%
> > 507.cactuBSSN_r 2.66%
> > 508.namd_r  3.67%
> > 510.parest_r1.25%
> > 511.povray_r2.26%
> > 519.lbm_r   1.69%
> > 521.wrf_r   11.03%
> > 526.blender_r   3.39%
> > 527.cam4_r  1.69%
> > 538.imagick_r   64.59%
> > 544.nab_r   -0.54%
> > 549.fotonik3d_r 2.68%
> > 554.roms_r  0.00%
> > geomean 6.19%
> >
> > This patch improves -march=native performance on Skylake up to 60% and
> > leaves -march=native performance unchanged on Haswell.
> >
> > OK for trunk?
> >
> > Thanks.
> >
> > H.J.
> > ---
> > gcc/
> >
> > 2018-07-12  H.J. Lu  
> > Sunil K Pandey  
> >
> > PR target/84413
> > * config/i386/i386.c (m_HASWELL): Add PROCESSOR_SKYLAKE,
> > PROCESSOR_SKYLAKE_AVX512, PROCESSOR_CANNONLAKE,
> > PROCESSOR_ICELAKE_CLIENT and PROCESSOR_ICELAKE_SERVER.
> > (m_SKYLAKE): Set to 0.
> > (m_SKYLAKE_AVX512): Likewise.
> > (m_CANNONLAKE): Likewise.
> > (m_ICELAKE_CLIENT): Likewise.
> > (m_ICELAKE_SERVER): Likewise.
> > * config/i386/x86-tune.def (avx256_optimal): Also enabled for
> > m_HASWELL.
> >

Re: [PATCH] x86: Tune Skylake, Cannonlake and Icelake as Haswell

2018-07-13 Thread H.J. Lu
On Fri, Jul 13, 2018 at 9:07 AM, Jan Hubicka  wrote:
>> > > > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
>> > > > index 9e46b7b136f..762ab89fc9e 100644
>> > > > --- a/gcc/config/i386/i386.c
>> > > > +++ b/gcc/config/i386/i386.c
>> > > > @@ -137,17 +137,22 @@ const struct processor_costs *ix86_cost = NULL;
>> > > >  #define m_CORE2 (HOST_WIDE_INT_1U<> > > >  #define m_NEHALEM (HOST_WIDE_INT_1U<> > > >  #define m_SANDYBRIDGE (HOST_WIDE_INT_1U<> > > > -#define m_HASWELL (HOST_WIDE_INT_1U<> > > > +#define m_HASWELL ((HOST_WIDE_INT_1U<> > > > +  | (HOST_WIDE_INT_1U<> > > > +  | (HOST_WIDE_INT_1U<> > > > +  | (HOST_WIDE_INT_1U<> > > > +  | (HOST_WIDE_INT_1U<> > > > +  | (HOST_WIDE_INT_1U<> > > >
>> > >
>> > > Please introduce a new per-family define and group processors in this
>> > > define. Something like m_BDVER, m_BTVER and m_AMD_MULTIPLE for AMD
>> > targets.
>> > > We should not redefine m_HASWELL to include unrelated families.
>> > >
>> >
>> > Here is the updated patch.  OK for trunk if all tests pass?
>> >
>> >
>> OK.
>
> We have also noticed that benchmarks on skylake are not good compared to
> haswell, this nicely explains it.  I think this is -march=native regression
> compared to GCC versions that did not suppored better CPUs than Haswell.  So 
> it
> would be nice to backport it.

Yes, we should.   Here is the patch to backport to GCC 8.  OK for GCC 8 after
it has been checked into trunk?

Thanks.

-- 
H.J.
From 40a1050b330b421a1f445cb2a40b5a002da2e6d6 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" 
Date: Mon, 4 Jun 2018 19:16:06 -0700
Subject: [PATCH] x86: Tune Skylake, Cannonlake and Icelake as Haswell

r259399, which added PROCESSOR_SKYLAKE, disabled many x86 optimizations
which are enabled by PROCESSOR_HASWELL.  As the result, -mtune=skylake
generates slower codes on Skylake than before.  The same also applies
to Cannonlake and Icelak tuning.

This patch changes -mtune={skylake|cannonlake|icelake} to tune like
-mtune=haswell for until their tuning is properly adjusted. It also
enables -mprefer-vector-width=256 for -mtune=haswell, which has no
impact on codegen when AVX512 isn't enabled.

Performance impacts on SPEC CPU 2017 rate with 1 copy using

-march=native -mfpmath=sse -O2 -m64

are

1. On Broadwell server:

500.perlbench_r		-0.56%
502.gcc_r		-0.18%
505.mcf_r		0.24%
520.omnetpp_r		0.00%
523.xalancbmk_r		-0.32%
525.x264_r		-0.17%
531.deepsjeng_r		0.00%
541.leela_r		0.00%
548.exchange2_r		0.12%
557.xz_r		0.00%
Geomean			0.00%

503.bwaves_r		0.00%
507.cactuBSSN_r		0.21%
508.namd_r		0.00%
510.parest_r		0.19%
511.povray_r		-0.48%
519.lbm_r		0.00%
521.wrf_r		0.28%
526.blender_r		0.19%
527.cam4_r		0.39%
538.imagick_r		0.00%
544.nab_r		-0.36%
549.fotonik3d_r		0.51%
554.roms_r		0.00%
Geomean			0.17%

On Skylake client:

500.perlbench_r		0.96%
502.gcc_r		0.13%
505.mcf_r		-1.03%
520.omnetpp_r		-1.11%
523.xalancbmk_r		1.02%
525.x264_r		0.50%
531.deepsjeng_r		2.97%
541.leela_r		0.50%
548.exchange2_r		-0.95%
557.xz_r		2.41%
Geomean			0.56%

503.bwaves_r		0.49%
507.cactuBSSN_r		3.17%
508.namd_r		4.05%
510.parest_r		0.15%
511.povray_r		0.80%
519.lbm_r		3.15%
521.wrf_r		10.56%
526.blender_r		2.97%
527.cam4_r		2.36%
538.imagick_r		46.40%
544.nab_r		2.04%
549.fotonik3d_r		0.00%
554.roms_r		1.27%
Geomean			5.49%

On Skylake server:

500.perlbench_r		0.71%
502.gcc_r		-0.51%
505.mcf_r		-1.06%
520.omnetpp_r		-0.33%
523.xalancbmk_r		-0.22%
525.x264_r		1.72%
531.deepsjeng_r		-0.26%
541.leela_r		0.57%
548.exchange2_r		-0.75%
557.xz_r		-1.28%
Geomean			-0.21%

503.bwaves_r		0.00%
507.cactuBSSN_r		2.66%
508.namd_r		3.67%
510.parest_r		1.25%
511.povray_r		2.26%
519.lbm_r		1.69%
521.wrf_r		11.03%
526.blender_r		3.39%
527.cam4_r		1.69%
538.imagick_r		64.59%
544.nab_r		-0.54%
549.fotonik3d_r		2.68%
554.roms_r		0.00%
Geomean			6.19%

This patch improves -march=native performance on Skylake up to 60% and
leaves -march=native performance unchanged on Haswell.

gcc/

	Backport from mainline
	2018-07-12  H.J. Lu  
		Sunil K Pandey  

	PR target/84413
	* config/i386/i386.c (m_CORE_AVX512): New.
	(m_CORE_AVX2): Likewise.
	(m_CORE_ALL): Add m_CORE_AVX2.
	* config/i386/x86-tune.def: Replace m_HASWELL with m_CORE_AVX2.
	Replace m_SKYLAKE_AVX512 with m_CORE_AVX512 on avx256_optimal
	and remove the rest of m_SKYLAKE_AVX512.

gcc/testsuite/

	Backport from mainline
	2018-07-12  H.J. Lu  
		Sunil K Pandey  

	PR target/84413
	* gcc.target/i386/pr84413-1.c: New test.
	* gcc.target/i386/pr84413-2.c: Likewise.
	* gcc.target/i386/pr84413-3.c: Likewise.
	* gcc.tar

Re: [PATCH] x86: Tune Skylake, Cannonlake and Icelake as Haswell

2018-07-13 Thread H.J. Lu
On Fri, Jul 13, 2018 at 9:31 AM, Jan Hubicka  wrote:
>> > We have also noticed that benchmarks on skylake are not good compared to
>> > haswell, this nicely explains it.  I think this is -march=native regression
>> > compared to GCC versions that did not suppored better CPUs than Haswell.  
>> > So it
>> > would be nice to backport it.
>>
>> Yes, we should.   Here is the patch to backport to GCC 8.  OK for GCC 8 after
>> it has been checked into trunk?
>
> OK,
> Honza
>>
>> Thanks.
>>
>> --
>> H.J.
>
>> From 40a1050b330b421a1f445cb2a40b5a002da2e6d6 Mon Sep 17 00:00:00 2001
>> From: "H.J. Lu" 
>> Date: Mon, 4 Jun 2018 19:16:06 -0700
>> Subject: [PATCH] x86: Tune Skylake, Cannonlake and Icelake as Haswell
>>
>> r259399, which added PROCESSOR_SKYLAKE, disabled many x86 optimizations
>> which are enabled by PROCESSOR_HASWELL.  As the result, -mtune=skylake
>> generates slower codes on Skylake than before.  The same also applies
>> to Cannonlake and Icelak tuning.
>>
>> This patch changes -mtune={skylake|cannonlake|icelake} to tune like
>> -mtune=haswell for until their tuning is properly adjusted. It also
>> enables -mprefer-vector-width=256 for -mtune=haswell, which has no
>> impact on codegen when AVX512 isn't enabled.
>>
>> Performance impacts on SPEC CPU 2017 rate with 1 copy using
>>
>> -march=native -mfpmath=sse -O2 -m64
>>
>> are
>>
>> 1. On Broadwell server:
>>
>> 500.perlbench_r   -0.56%
>> 502.gcc_r -0.18%
>> 505.mcf_r 0.24%
>> 520.omnetpp_r 0.00%
>> 523.xalancbmk_r   -0.32%
>> 525.x264_r-0.17%
>> 531.deepsjeng_r   0.00%
>> 541.leela_r   0.00%
>> 548.exchange2_r   0.12%
>> 557.xz_r  0.00%
>> Geomean   0.00%
>>
>> 503.bwaves_r  0.00%
>> 507.cactuBSSN_r   0.21%
>> 508.namd_r0.00%
>> 510.parest_r  0.19%
>> 511.povray_r  -0.48%
>> 519.lbm_r 0.00%
>> 521.wrf_r 0.28%
>> 526.blender_r 0.19%
>> 527.cam4_r0.39%
>> 538.imagick_r 0.00%
>> 544.nab_r -0.36%
>> 549.fotonik3d_r   0.51%
>> 554.roms_r0.00%
>> Geomean   0.17%
>>
>> On Skylake client:
>>
>> 500.perlbench_r   0.96%
>> 502.gcc_r 0.13%
>> 505.mcf_r -1.03%
>> 520.omnetpp_r -1.11%
>> 523.xalancbmk_r   1.02%
>> 525.x264_r0.50%
>> 531.deepsjeng_r   2.97%
>> 541.leela_r   0.50%
>> 548.exchange2_r   -0.95%
>> 557.xz_r  2.41%
>> Geomean   0.56%
>>
>> 503.bwaves_r  0.49%
>> 507.cactuBSSN_r   3.17%
>> 508.namd_r4.05%
>> 510.parest_r  0.15%
>> 511.povray_r  0.80%
>> 519.lbm_r 3.15%
>> 521.wrf_r 10.56%
>> 526.blender_r 2.97%
>> 527.cam4_r2.36%
>> 538.imagick_r 46.40%
>> 544.nab_r 2.04%
>> 549.fotonik3d_r   0.00%
>> 554.roms_r1.27%
>> Geomean   5.49%
>>
>> On Skylake server:
>>
>> 500.perlbench_r   0.71%
>> 502.gcc_r -0.51%
>> 505.mcf_r -1.06%
>> 520.omnetpp_r -0.33%
>> 523.xalancbmk_r   -0.22%
>> 525.x264_r1.72%
>> 531.deepsjeng_r   -0.26%
>> 541.leela_r   0.57%
>> 548.exchange2_r   -0.75%
>> 557.xz_r  -1.28%
>> Geomean   -0.21%
>>
>> 503.bwaves_r  0.00%
>> 507.cactuBSSN_r   2.66%
>> 508.namd_r3.67%
>> 510.parest_r  1.25%
>> 511.povray_r  2.26%
>> 519.lbm_r 1.69%
>> 521.wrf_r 11.03%
>> 526.blender_r 3.39%
>> 527.cam4_r1.69%
>> 538.imagick_r 64.59%
>> 544.nab_r -0.54%
>> 549.fotonik3d_r   2.68%
>> 554.roms_r0.00%
>> Geomean   6.19%
>>
>> This patch improves -march=native performance on Skylake up to 60% and
>> leaves -march=native performance unchanged on Haswell.
>>
>> gcc/
>>
>>   Backport from mainline
>>   20

Re: [PATCH] x86: Tune Skylake, Cannonlake and Icelake as Haswell

2018-07-14 Thread H.J. Lu
On Sat, Jul 14, 2018 at 06:09:47PM +0200, Gerald Pfeifer wrote:
> On Fri, 13 Jul 2018, H.J. Lu wrote:
> > I will do the same for GCC8 backport.
> 
> Can you please add a note to gcc-8/changes.html?  This seems big
> enough to warrant a note in a part for GCC 8.2.
> 
> (At gcc-7/changes.html you can see how to go about this for minor
> releases.)
> 

Like this?

H.J.
---
Index: changes.html
===
RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-8/changes.html,v
retrieving revision 1.88
diff -u -p -r1.88 changes.html
--- changes.html14 Jun 2018 13:52:35 -  1.88
+++ changes.html14 Jul 2018 21:17:10 -
@@ -1312,5 +1312,23 @@ known to be fixed in the 8.1 release. Th
 complete (that is, it is possible that some PRs that have been fixed
 are not listed here).
 
+
+GCC 8.2
+
+This is the https://gcc.gnu.org/bugzilla/buglist.cgi?bug_status=RESOLVED&resolution=FIXED&target_milestone=8.2";>list
+of problem reports (PRs) from GCC's bug tracking system that are
+known to be fixed in the 8.1 release. This list might not be
+complete (that is, it is possible that some PRs that have been fixed
+are not listed here).
+
+Target Specific Changes
+
+IA-32/x86-64
+  
+ -mtune=native performance regression
+https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84413";>PR84413
+on Intel Skylake processors has been fixed.
+  
+
 
 


[PATCH 2/3] i386: Change indirect_return to function type attribute

2018-07-18 Thread H.J. Lu
In

struct ucontext;
typedef struct ucontext ucontext_t;

extern int (*bar) (ucontext_t *__restrict __oucp,
   const ucontext_t *__restrict __ucp)
  __attribute__((__indirect_return__));

extern int res;

void
foo (ucontext_t *oucp, ucontext_t *ucp)
{
  res = bar (oucp, ucp);
}

bar() may return via indirect branch.  This patch changes indirect_return
to type attribute to allow indirect_return attribute on variable or type
of function pointer so that ENDBR can be inserted after call to bar().

Tested on i386 and x86-64.  OK for trunk?

Thanks.


H.J.
---
gcc/

PR target/86560
* config/i386/i386.c (rest_of_insert_endbranch): Lookup
indirect_return as function type attribute.
(ix86_attribute_table): Change indirect_return to function
type attribute.
* doc/extend.texi: Update indirect_return attribute.

gcc/testsuite/

PR target/86560
* gcc.target/i386/pr86560-1.c: New test.
* gcc.target/i386/pr86560-2.c: Likewise.
* gcc.target/i386/pr86560-3.c: Likewise.
---
 gcc/config/i386/i386.c| 23 +++
 gcc/doc/extend.texi   |  5 +++--
 gcc/testsuite/gcc.target/i386/pr86560-1.c | 16 
 gcc/testsuite/gcc.target/i386/pr86560-2.c | 16 
 gcc/testsuite/gcc.target/i386/pr86560-3.c | 17 +
 5 files changed, 67 insertions(+), 10 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr86560-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr86560-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr86560-3.c

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index aec739c3974..ac27248370b 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2627,16 +2627,23 @@ rest_of_insert_endbranch (void)
{
  rtx call = get_call_rtx_from (insn);
  rtx fnaddr = XEXP (call, 0);
+ tree fndecl = NULL_TREE;
 
  /* Also generate ENDBRANCH for non-tail call which
 may return via indirect branch.  */
- if (MEM_P (fnaddr)
- && GET_CODE (XEXP (fnaddr, 0)) == SYMBOL_REF)
+ if (GET_CODE (XEXP (fnaddr, 0)) == SYMBOL_REF)
+   fndecl = SYMBOL_REF_DECL (XEXP (fnaddr, 0));
+ if (fndecl == NULL_TREE)
+   fndecl = MEM_EXPR (fnaddr);
+ if (fndecl
+ && TREE_CODE (TREE_TYPE (fndecl)) != FUNCTION_TYPE
+ && TREE_CODE (TREE_TYPE (fndecl)) != METHOD_TYPE)
+   fndecl = NULL_TREE;
+ if (fndecl && TYPE_ARG_TYPES (TREE_TYPE (fndecl)))
{
- tree fndecl = SYMBOL_REF_DECL (XEXP (fnaddr, 0));
- if (fndecl
- && lookup_attribute ("indirect_return",
-  DECL_ATTRIBUTES (fndecl)))
+ tree fntype = TREE_TYPE (fndecl);
+ if (lookup_attribute ("indirect_return",
+   TYPE_ATTRIBUTES (fntype)))
need_endbr = true;
}
}
@@ -46101,8 +46108,8 @@ static const struct attribute_spec 
ix86_attribute_table[] =
 ix86_handle_fndecl_attribute, NULL },
   { "function_return", 1, 1, true, false, false, false,
 ix86_handle_fndecl_attribute, NULL },
-  { "indirect_return", 0, 0, true, false, false, false,
-ix86_handle_fndecl_attribute, NULL },
+  { "indirect_return", 0, 0, false, true, true, false,
+NULL, NULL },
 
   /* End element.  */
   { NULL, 0, 0, false, false, false, false, NULL, NULL }
diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index 8b4d3fd9de3..edeaec6d872 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -5861,8 +5861,9 @@ foo (void)
 @item indirect_return
 @cindex @code{indirect_return} function attribute, x86
 
-The @code{indirect_return} attribute on a function is used to inform
-the compiler that the function may return via indirect branch.
+The @code{indirect_return} attribute can be applied to a function,
+as well as variable or type of function pointer to inform the
+compiler that the function may return via indirect branch.
 
 @end table
 
diff --git a/gcc/testsuite/gcc.target/i386/pr86560-1.c 
b/gcc/testsuite/gcc.target/i386/pr86560-1.c
new file mode 100644
index 000..a2b702695c5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr86560-1.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fcf-protection" } */
+/* { dg-final { scan-assembler-times {\mendbr} 2 } } */
+
+struct ucontext;
+
+extern int (*bar) (struct ucontext *)
+  __attribute__((__indirect_return__));
+
+extern int res;
+
+void
+foo (struct ucontext *oucp)
+{
+  res = bar (oucp);
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr86560-2.c 
b/gcc/testsuite/gcc.target/i386/pr

[PATCH] Call REAL(swapcontext) with indirect_return attribute on x86

2018-07-18 Thread H.J. Lu
asan/asan_interceptors.cc has

...
  int res = REAL(swapcontext)(oucp, ucp);
...

REAL(swapcontext) is a function pointer to swapcontext in libc.  Since
swapcontext may return via indirect branch on x86 when shadow stack is
enabled, we need to call REAL(swapcontext) with indirect_return attribute
on x86 so that compiler can insert ENDBR after REAL(swapcontext) call.

I opened an LLVM bug:

https://bugs.llvm.org/show_bug.cgi?id=38207

But it won't get fixed before indirect_return attribute is added to
LLVM.  I'd like to get it fixed in GCC first.

Tested on i386 and x86-64.  OK for trunk after

https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01007.html

is approved?

Thanks.


H.J.
---
PR target/86560
* asan/asan_interceptors.cc (swapcontext): Call REAL(swapcontext)
with indirect_return attribute on x86.
---
 libsanitizer/asan/asan_interceptors.cc | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/libsanitizer/asan/asan_interceptors.cc 
b/libsanitizer/asan/asan_interceptors.cc
index a8f4b72723f..b8dde4f19c5 100644
--- a/libsanitizer/asan/asan_interceptors.cc
+++ b/libsanitizer/asan/asan_interceptors.cc
@@ -267,7 +267,13 @@ INTERCEPTOR(int, swapcontext, struct ucontext_t *oucp,
   uptr stack, ssize;
   ReadContextStack(ucp, &stack, &ssize);
   ClearShadowMemoryForContextStack(stack, ssize);
+#if defined(__x86_64__) || defined(__i386__)
+  int (*real_swapcontext) (struct ucontext_t *, struct ucontext_t *)
+__attribute__((__indirect_return__)) = REAL(swapcontext);
+  int res = real_swapcontext(oucp, ucp);
+#else
   int res = REAL(swapcontext)(oucp, ucp);
+#endif
   // swapcontext technically does not return, but program may swap context to
   // "oucp" later, that would look as if swapcontext() returned 0.
   // We need to clear shadow for ucp once again, as it may be in arbitrary
-- 
2.17.1



Re: [PATCH] Call REAL(swapcontext) with indirect_return attribute on x86

2018-07-18 Thread H.J. Lu
On Wed, Jul 18, 2018 at 11:18 AM, Kostya Serebryany  wrote:
> What's ENDBR and do we really need to have it in compiler-rt?

When shadow stack from Intel CET is enabled,  the first instruction of all
indirect branch targets must be a special instruction, ENDBR.  In this case,

int res = REAL(swapcontext)(oucp, ucp);
    This function may be
returned via an indirect branch.

Here compiler must insert ENDBR after call, like

call *bar(%rip)
endbr64

> As usual, I am opposed to any gcc compiler-rt that bypass upstream.

We want it to be fixed in upstream.  That is why I opened an LLVM bug.


> --kcc
>
> On Wed, Jul 18, 2018 at 8:37 AM H.J. Lu  wrote:
>>
>> asan/asan_interceptors.cc has
>>
>> ...
>>   int res = REAL(swapcontext)(oucp, ucp);
>> ...
>>
>> REAL(swapcontext) is a function pointer to swapcontext in libc.  Since
>> swapcontext may return via indirect branch on x86 when shadow stack is
>> enabled, we need to call REAL(swapcontext) with indirect_return attribute
>> on x86 so that compiler can insert ENDBR after REAL(swapcontext) call.
>>
>> I opened an LLVM bug:
>>
>> https://bugs.llvm.org/show_bug.cgi?id=38207
>>
>> But it won't get fixed before indirect_return attribute is added to
>> LLVM.  I'd like to get it fixed in GCC first.
>>
>> Tested on i386 and x86-64.  OK for trunk after
>>
>> https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01007.html
>>
>> is approved?
>>
>> Thanks.
>>
>>
>> H.J.
>> ---
>> PR target/86560
>> * asan/asan_interceptors.cc (swapcontext): Call REAL(swapcontext)
>> with indirect_return attribute on x86.
>> ---
>>  libsanitizer/asan/asan_interceptors.cc | 6 ++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/libsanitizer/asan/asan_interceptors.cc 
>> b/libsanitizer/asan/asan_interceptors.cc
>> index a8f4b72723f..b8dde4f19c5 100644
>> --- a/libsanitizer/asan/asan_interceptors.cc
>> +++ b/libsanitizer/asan/asan_interceptors.cc
>> @@ -267,7 +267,13 @@ INTERCEPTOR(int, swapcontext, struct ucontext_t *oucp,
>>uptr stack, ssize;
>>ReadContextStack(ucp, &stack, &ssize);
>>ClearShadowMemoryForContextStack(stack, ssize);
>> +#if defined(__x86_64__) || defined(__i386__)
>> +  int (*real_swapcontext) (struct ucontext_t *, struct ucontext_t *)
>> +__attribute__((__indirect_return__)) = REAL(swapcontext);
>> +  int res = real_swapcontext(oucp, ucp);
>> +#else
>>int res = REAL(swapcontext)(oucp, ucp);
>> +#endif
>>// swapcontext technically does not return, but program may swap context 
>> to
>>// "oucp" later, that would look as if swapcontext() returned 0.
>>// We need to clear shadow for ucp once again, as it may be in arbitrary
>> --
>> 2.17.1
>>



-- 
H.J.


Re: [PATCH] Call REAL(swapcontext) with indirect_return attribute on x86

2018-07-18 Thread H.J. Lu
On Wed, Jul 18, 2018 at 11:45 AM, Kostya Serebryany  wrote:
> On Wed, Jul 18, 2018 at 11:40 AM H.J. Lu  wrote:
>>
>> On Wed, Jul 18, 2018 at 11:18 AM, Kostya Serebryany  wrote:
>> > What's ENDBR and do we really need to have it in compiler-rt?
>>
>> When shadow stack from Intel CET is enabled,  the first instruction of all
>> indirect branch targets must be a special instruction, ENDBR.  In this case,
>
> I am confused.
> CET is a security mitigation feature (and ENDBR is a pretty weak form of 
> such),
> while ASAN is a testing tool, rarely used in production is almost
> never as a mitigation (which it is not!).
> Why would anyone need to combine CET and ASAN in one process?
>

CET is transparent to ASAN.  It is perfectly OK to use -fcf-protection to
enable CET together with ASAN.

> Also, CET doesn't exist in the hardware yet, at least not publicly available.
> Which means there should be no rush (am I wrong?) and we can do things
> in the correct order:
> implement the Clang/LLVM support, make the compiler-rt change in LLVM,
> merge back to GCC.

I am working with our LLVM people to address this.

H.J.
> --kcc
>
>>
>> int res = REAL(swapcontext)(oucp, ucp);
>>     This function may be
>> returned via an indirect branch.
>>
>> Here compiler must insert ENDBR after call, like
>>
>> call *bar(%rip)
>> endbr64
>>
>> > As usual, I am opposed to any gcc compiler-rt that bypass upstream.
>>
>> We want it to be fixed in upstream.  That is why I opened an LLVM bug.
>>
>>
>> > --kcc
>> >
>> > On Wed, Jul 18, 2018 at 8:37 AM H.J. Lu  wrote:
>> >>
>> >> asan/asan_interceptors.cc has
>> >>
>> >> ...
>> >>   int res = REAL(swapcontext)(oucp, ucp);
>> >> ...
>> >>
>> >> REAL(swapcontext) is a function pointer to swapcontext in libc.  Since
>> >> swapcontext may return via indirect branch on x86 when shadow stack is
>> >> enabled, we need to call REAL(swapcontext) with indirect_return attribute
>> >> on x86 so that compiler can insert ENDBR after REAL(swapcontext) call.
>> >>
>> >> I opened an LLVM bug:
>> >>
>> >> https://bugs.llvm.org/show_bug.cgi?id=38207
>> >>
>> >> But it won't get fixed before indirect_return attribute is added to
>> >> LLVM.  I'd like to get it fixed in GCC first.
>> >>
>> >> Tested on i386 and x86-64.  OK for trunk after
>> >>
>> >> https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01007.html
>> >>
>> >> is approved?
>> >>
>> >> Thanks.
>> >>
>> >>
>> >> H.J.
>> >> ---
>> >> PR target/86560
>> >> * asan/asan_interceptors.cc (swapcontext): Call REAL(swapcontext)
>> >> with indirect_return attribute on x86.
>> >> ---
>> >>  libsanitizer/asan/asan_interceptors.cc | 6 ++
>> >>  1 file changed, 6 insertions(+)
>> >>
>> >> diff --git a/libsanitizer/asan/asan_interceptors.cc 
>> >> b/libsanitizer/asan/asan_interceptors.cc
>> >> index a8f4b72723f..b8dde4f19c5 100644
>> >> --- a/libsanitizer/asan/asan_interceptors.cc
>> >> +++ b/libsanitizer/asan/asan_interceptors.cc
>> >> @@ -267,7 +267,13 @@ INTERCEPTOR(int, swapcontext, struct ucontext_t 
>> >> *oucp,
>> >>uptr stack, ssize;
>> >>ReadContextStack(ucp, &stack, &ssize);
>> >>ClearShadowMemoryForContextStack(stack, ssize);
>> >> +#if defined(__x86_64__) || defined(__i386__)
>> >> +  int (*real_swapcontext) (struct ucontext_t *, struct ucontext_t *)
>> >> +__attribute__((__indirect_return__)) = REAL(swapcontext);
>> >> +  int res = real_swapcontext(oucp, ucp);
>> >> +#else
>> >>int res = REAL(swapcontext)(oucp, ucp);
>> >> +#endif
>> >>// swapcontext technically does not return, but program may swap 
>> >> context to
>> >>// "oucp" later, that would look as if swapcontext() returned 0.
>> >>// We need to clear shadow for ucp once again, as it may be in 
>> >> arbitrary
>> >> --
>> >> 2.17.1
>> >>
>>
>>
>>
>> --
>> H.J.



-- 
H.J.


[PATCH] i386: Define __HAVE_INDIRECT_RETURN_ATTRIBUTE__

2018-07-19 Thread H.J. Lu
On Thu, Jul 19, 2018 at 10:35:27AM +0200, Richard Biener wrote:
> On Wed, Jul 18, 2018 at 5:33 PM H.J. Lu  wrote:
> >
> > In
> >
> > struct ucontext;
> > typedef struct ucontext ucontext_t;
> >
> > extern int (*bar) (ucontext_t *__restrict __oucp,
> >const ucontext_t *__restrict __ucp)
> >   __attribute__((__indirect_return__));
> >
> > extern int res;
> >
> > void
> > foo (ucontext_t *oucp, ucontext_t *ucp)
> > {
> >   res = bar (oucp, ucp);
> > }
> >
> > bar() may return via indirect branch.  This patch changes indirect_return
> > to type attribute to allow indirect_return attribute on variable or type
> > of function pointer so that ENDBR can be inserted after call to bar().
> >
> > Tested on i386 and x86-64.  OK for trunk?
> 
> OK.
> 

The new indirect_return attribute is intended to mark swapcontext in
.  This patch defines __HAVE_INDIRECT_RETURN_ATTRIBUTE__
so that it can be used checked before using indirect_return attribute
in .  It works when the indirect_return attribute is
backported to GCC 8.

OK for trunk?

Thanks.

H.J.
---
gcc/

PR target/86560
* config/i386/i386-c.c (ix86_target_macros): Define
__HAVE_INDIRECT_RETURN_ATTRIBUTE__.

gcc/testsuite/

PR target/86560
* gcc.target/i386/pr86560-4.c: New test.
---
 gcc/config/i386/i386-c.c  |  2 ++
 gcc/testsuite/gcc.target/i386/pr86560-4.c | 19 +++
 2 files changed, 21 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr86560-4.c

diff --git a/gcc/config/i386/i386-c.c b/gcc/config/i386/i386-c.c
index 005e1a5b308..041d47c3ee6 100644
--- a/gcc/config/i386/i386-c.c
+++ b/gcc/config/i386/i386-c.c
@@ -695,6 +695,8 @@ ix86_target_macros (void)
   if (flag_cf_protection != CF_NONE)
 cpp_define_formatted (parse_in, "__CET__=%d",
  flag_cf_protection & ~CF_SET);
+
+  cpp_define (parse_in, "__HAVE_INDIRECT_RETURN_ATTRIBUTE__");
 }
 
 
diff --git a/gcc/testsuite/gcc.target/i386/pr86560-4.c 
b/gcc/testsuite/gcc.target/i386/pr86560-4.c
new file mode 100644
index 000..46ea923fdfc
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr86560-4.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fcf-protection" } */
+/* { dg-final { scan-assembler-times {\mendbr} 2 } } */
+
+struct ucontext;
+
+extern int (*bar) (struct ucontext *)
+#ifdef __HAVE_INDIRECT_RETURN_ATTRIBUTE__
+  __attribute__((__indirect_return__))
+#endif
+;
+
+extern int res;
+
+void
+foo (struct ucontext *oucp)
+{
+  res = bar (oucp);
+}
-- 
2.17.1



Re: [PATCH] i386: Define __HAVE_INDIRECT_RETURN_ATTRIBUTE__

2018-07-19 Thread H.J. Lu
On Thu, Jul 19, 2018 at 01:39:04PM +0200, Florian Weimer wrote:
> On 07/19/2018 01:33 PM, Jakub Jelinek wrote:
> > On Thu, Jul 19, 2018 at 04:21:26AM -0700, H.J. Lu wrote:
> > > The new indirect_return attribute is intended to mark swapcontext in
> > > .  This patch defines __HAVE_INDIRECT_RETURN_ATTRIBUTE__
> > > so that it can be used checked before using indirect_return attribute
> > > in .  It works when the indirect_return attribute is
> > > backported to GCC 8.
> > > 
> > > OK for trunk?
> > 
> > No.  Use
> > #ifdef __has_attribute
> > #if __has_attribute (indirect_return)
> > ...
> > #endif
> > #endif
> > instead, like for any other attribute.
> 
> That doesn't work because indirect_return is not in the implementation
> namespace and expanded in this context. I assume that __has_attribute
> (__indirect_return__) would work, though.
> 
> Could we add:
> 
> #ifdef __has_attribute
> # define __glibc_has_attribute(attr) __has_attribute (attr)
> #else
> # define __glibc_has_attribute 0
> #endif
> 
> And then use this:
> 
> #if __glibc_has_attribute (__indirect_return__)
> 
> Would that still work?
> 

Both __has_attribute (indirect_return) and __has_attribute (__indirect_return__)
work here.


H.J.
---
The new indirect_return attribute is intended to mark swapcontext in
.  Test __has_attribute (indirect_return) so that it
can be backported to GCC 8.

PR target/86560
* gcc.target/i386/pr86560-4.c: New test.
* gcc.target/i386/pr86560-5.c: Likewise.
---
 gcc/testsuite/gcc.target/i386/pr86560-4.c | 21 +
 gcc/testsuite/gcc.target/i386/pr86560-5.c | 21 +
 2 files changed, 42 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr86560-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr86560-5.c

diff --git a/gcc/testsuite/gcc.target/i386/pr86560-4.c 
b/gcc/testsuite/gcc.target/i386/pr86560-4.c
new file mode 100644
index 000..a623e3dcbeb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr86560-4.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fcf-protection" } */
+/* { dg-final { scan-assembler-times {\mendbr} 2 } } */
+
+struct ucontext;
+
+extern int (*bar) (struct ucontext *)
+#ifdef __has_attribute
+# if __has_attribute (indirect_return)
+  __attribute__((__indirect_return__))
+# endif
+#endif
+;
+
+extern int res;
+
+void
+foo (struct ucontext *oucp)
+{
+  res = bar (oucp);
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr86560-5.c 
b/gcc/testsuite/gcc.target/i386/pr86560-5.c
new file mode 100644
index 000..33b0f6424c2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr86560-5.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fcf-protection" } */
+/* { dg-final { scan-assembler-times {\mendbr} 2 } } */
+
+struct ucontext;
+
+extern int (*bar) (struct ucontext *)
+#ifdef __has_attribute
+# if __has_attribute (__indirect_return__)
+  __attribute__((__indirect_return__))
+# endif
+#endif
+;
+
+extern int res;
+
+void
+foo (struct ucontext *oucp)
+{
+  res = bar (oucp);
+}
-- 
2.17.1



Re: [PATCH] i386: Define __HAVE_INDIRECT_RETURN_ATTRIBUTE__

2018-07-19 Thread H.J. Lu
On Thu, Jul 19, 2018 at 4:56 AM, Jakub Jelinek  wrote:
> On Thu, Jul 19, 2018 at 01:54:46PM +0200, Florian Weimer wrote:
>> On 07/19/2018 01:48 PM, H.J. Lu wrote:
>> > Both __has_attribute (indirect_return) and __has_attribute 
>> > (__indirect_return__)
>> > work here.
>>
>> Applications can have
>>
>> #define indirect_return
>>
>> so the variant without underscore mangling is definitely not correct.
>
> Incorrect for what?  glibc header?  Yes.  The libsanitizer use, where we
> control the headers and what we define?  No.
>
> Jakub

I am checking my testcases to show how it works.

-- 
H.J.


Re: [PATCH] Call REAL(swapcontext) with indirect_return attribute on x86

2018-07-19 Thread H.J. Lu
On Wed, Jul 18, 2018 at 12:34:28PM -0700, Kostya Serebryany wrote:
> On Wed, Jul 18, 2018 at 12:29 PM H.J. Lu  wrote:
> >
> > On Wed, Jul 18, 2018 at 11:45 AM, Kostya Serebryany  wrote:
> > > On Wed, Jul 18, 2018 at 11:40 AM H.J. Lu  wrote:
> > >>
> > >> On Wed, Jul 18, 2018 at 11:18 AM, Kostya Serebryany  
> > >> wrote:
> > >> > What's ENDBR and do we really need to have it in compiler-rt?
> > >>
> > >> When shadow stack from Intel CET is enabled,  the first instruction of 
> > >> all
> > >> indirect branch targets must be a special instruction, ENDBR.  In this 
> > >> case,
> > >
> > > I am confused.
> > > CET is a security mitigation feature (and ENDBR is a pretty weak form of 
> > > such),
> > > while ASAN is a testing tool, rarely used in production is almost
> > > never as a mitigation (which it is not!).
> > > Why would anyone need to combine CET and ASAN in one process?
> > >
> >
> > CET is transparent to ASAN.  It is perfectly OK to use -fcf-protection to
> > enable CET together with ASAN.
> 
> It is ok, but does it make any sense?
> If anything, the current ASAN's intereceptors are a large blob of
> security vulnerabilities.
> If we ever want to use ASAN (or, more likely, HWASAN) as a security
> mitigation feature,
> we will need to get rid of these interceptors entirely.
> 
> 
> >
> > > Also, CET doesn't exist in the hardware yet, at least not publicly 
> > > available.
> > > Which means there should be no rush (am I wrong?) and we can do things
> > > in the correct order:
> > > implement the Clang/LLVM support, make the compiler-rt change in LLVM,
> > > merge back to GCC.
> >
> > I am working with our LLVM people to address this.
> 
> Cool!
> 

I am testing this patch and will submit it upstream.

H.J.
---
asan/asan_interceptors.cc has

...
  int res = REAL(swapcontext)(oucp, ucp);
...

REAL(swapcontext) is a function pointer to swapcontext in libc.  Since
swapcontext may return via indirect branch on x86 when shadow stack is
enabled, we need to call REAL(swapcontext) with indirect_return attribute
on x86 so that compiler can insert ENDBR after REAL(swapcontext) call.

PR target/86560
* asan/asan_interceptors.cc (swapcontext): Call REAL(swapcontext)
with indirect_return attribute on x86 if indirect_return attribute
is available.
---
 libsanitizer/asan/asan_interceptors.cc | 9 +
 1 file changed, 9 insertions(+)

diff --git a/libsanitizer/asan/asan_interceptors.cc 
b/libsanitizer/asan/asan_interceptors.cc
index a8f4b72723f..3ae473f210a 100644
--- a/libsanitizer/asan/asan_interceptors.cc
+++ b/libsanitizer/asan/asan_interceptors.cc
@@ -267,7 +267,16 @@ INTERCEPTOR(int, swapcontext, struct ucontext_t *oucp,
   uptr stack, ssize;
   ReadContextStack(ucp, &stack, &ssize);
   ClearShadowMemoryForContextStack(stack, ssize);
+#if defined(__has_attribute) && (defined(__x86_64__) || defined(__i386__))
+  int (*real_swapcontext) (struct ucontext_t *, struct ucontext_t *)
+# if __has_attribute (__indirect_return__)
+__attribute__((__indirect_return__))
+# endif
+= REAL(swapcontext);
+  int res = real_swapcontext(oucp, ucp);
+#else
   int res = REAL(swapcontext)(oucp, ucp);
+#endif
   // swapcontext technically does not return, but program may swap context to
   // "oucp" later, that would look as if swapcontext() returned 0.
   // We need to clear shadow for ucp once again, as it may be in arbitrary
-- 
2.17.1



[PATCH] i386: Remove _Unwind_Frames_Increment

2018-07-20 Thread H.J. Lu
Tested on CET SDV using the CET kernel on cet branch at:

https://github.com/yyu168/linux_cet/tree/cet

OK for trunk and GCC 8 branch?

Thanks.


H.J.
---
The CET kernel has been changed to place a restore token on shadow stack
for signal handler to enhance security.  It is usually transparent to user
programs since kernel will pop the restore token when signal handler
returns.  But when an exception is thrown from a signal handler, now
we need to remove _Unwind_Frames_Increment to pop the the restore token
from shadow stack.  Otherwise, we get

FAIL: g++.dg/torture/pr85334.C   -O0  execution test
FAIL: g++.dg/torture/pr85334.C   -O1  execution test
FAIL: g++.dg/torture/pr85334.C   -O2  execution test
FAIL: g++.dg/torture/pr85334.C   -O3 -g  execution test
FAIL: g++.dg/torture/pr85334.C   -Os  execution test
FAIL: g++.dg/torture/pr85334.C   -O2 -flto -fno-use-linker-plugin 
-flto-partition=none  execution test

PR libgcc/85334
* config/i386/shadow-stack-unwind.h (_Unwind_Frames_Increment):
Removed.
---
 libgcc/config/i386/shadow-stack-unwind.h | 5 -
 1 file changed, 5 deletions(-)

diff --git a/libgcc/config/i386/shadow-stack-unwind.h 
b/libgcc/config/i386/shadow-stack-unwind.h
index a32f3e74b52..40f48df2aec 100644
--- a/libgcc/config/i386/shadow-stack-unwind.h
+++ b/libgcc/config/i386/shadow-stack-unwind.h
@@ -49,8 +49,3 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  If 
not, see
}   \
 }  \
 while (0)
-
-/* Increment frame count.  Skip signal frames.  */
-#undef _Unwind_Frames_Increment
-#define _Unwind_Frames_Increment(context, frames) \
-  if (!_Unwind_IsSignalFrame (context)) frames++
-- 
2.17.1



[PATCH] libsanitizer: Mark REAL(swapcontext) with indirect_return attribute on x86

2018-07-20 Thread H.J. Lu
Cherry-pick compiler-rt revision 337603:

When shadow stack from Intel CET is enabled, the first instruction of all
indirect branch targets must be a special instruction, ENDBR.

lib/asan/asan_interceptors.cc has

...
  int res = REAL(swapcontext)(oucp, ucp);
...

REAL(swapcontext) is a function pointer to swapcontext in libc.  Since
swapcontext may return via indirect branch on x86 when shadow stack is
enabled, as in this case,

int res = REAL(swapcontext)(oucp, ucp);
    This function may be
returned via an indirect branch.

Here compiler must insert ENDBR after call, like

call *bar(%rip)
endbr64

I opened an LLVM bug:

https://bugs.llvm.org/show_bug.cgi?id=38207

to add the indirect_return attribute so that it can be used to inform
compiler to insert ENDBR after REAL(swapcontext) call.  We mark
REAL(swapcontext) with the indirect_return attribute if it is available.

This fixed:

https://bugs.llvm.org/show_bug.cgi?id=38249

Reviewed By: eugenis

Differential Revision: https://reviews.llvm.org/D49608

OK for trunk?

H.J.
---
PR target/86560
* asan/asan_interceptors.cc (swapcontext): Call REAL(swapcontext)
with indirect_return attribute on x86 if indirect_return attribute
is available.
* sanitizer_common/sanitizer_internal_defs.h (__has_attribute):
New.
---
 libsanitizer/asan/asan_interceptors.cc  | 8 
 libsanitizer/sanitizer_common/sanitizer_internal_defs.h | 5 +
 2 files changed, 13 insertions(+)

diff --git a/libsanitizer/asan/asan_interceptors.cc 
b/libsanitizer/asan/asan_interceptors.cc
index a8f4b72723f..552cf9347af 100644
--- a/libsanitizer/asan/asan_interceptors.cc
+++ b/libsanitizer/asan/asan_interceptors.cc
@@ -267,7 +267,15 @@ INTERCEPTOR(int, swapcontext, struct ucontext_t *oucp,
   uptr stack, ssize;
   ReadContextStack(ucp, &stack, &ssize);
   ClearShadowMemoryForContextStack(stack, ssize);
+#if __has_attribute(__indirect_return__) && \
+(defined(__x86_64__) || defined(__i386__))
+  int (*real_swapcontext)(struct ucontext_t *, struct ucontext_t *)
+__attribute__((__indirect_return__))
+= REAL(swapcontext);
+  int res = real_swapcontext(oucp, ucp);
+#else
   int res = REAL(swapcontext)(oucp, ucp);
+#endif
   // swapcontext technically does not return, but program may swap context to
   // "oucp" later, that would look as if swapcontext() returned 0.
   // We need to clear shadow for ucp once again, as it may be in arbitrary
diff --git a/libsanitizer/sanitizer_common/sanitizer_internal_defs.h 
b/libsanitizer/sanitizer_common/sanitizer_internal_defs.h
index edd6a21c122..4413a88bea0 100644
--- a/libsanitizer/sanitizer_common/sanitizer_internal_defs.h
+++ b/libsanitizer/sanitizer_common/sanitizer_internal_defs.h
@@ -104,6 +104,11 @@
 # define __has_feature(x) 0
 #endif
 
+// Older GCCs do not understand __has_attribute.
+#if !defined(__has_attribute)
+# define __has_attribute(x) 0
+#endif
+
 // For portability reasons we do not include stddef.h, stdint.h or any other
 // system header, but we do need some basic types that are not defined
 // in a portable way by the language itself.
-- 
2.17.1



Re: [PATCH] specify large command line option arguments (PR 82063)

2018-07-21 Thread H.J. Lu
On Fri, Jul 20, 2018 at 1:57 PM, Martin Sebor  wrote:
> On 07/19/2018 04:31 PM, Jeff Law wrote:
>>
>> On 06/24/2018 03:05 PM, Martin Sebor wrote:
>>>
>>> Storing integer command line option arguments in type int
>>> limits options such as -Wlarger-than= or -Walloca-larger-than
>>> to at most INT_MAX (see bug 71905).  Larger values wrap around
>>> zero.  The value zero is considered to disable the option,
>>> making it impossible to specify a zero limit.
>>>
>>> To get around these limitations, the -Walloc-size-larger-than=
>>> option accepts a string argument that it then parses itself
>>> and interprets as HOST_WIDE_INT.  The option also accepts byte
>>> size suffixes like KB, MB, GiB, etc. to make it convenient to
>>> specify very large limits.
>>>
>>> The int limitation is obviously less than ideal in a 64-bit
>>> world.  The treatment of zero as a toggle is just a minor wart.
>>> The special treatment to make it work for just a single option
>>> makes option handling inconsistent.  It should be possible for
>>> any option that takes an integer argument to use the same logic.
>>>
>>> The attached patch enhances GCC option processing to do that.
>>> It changes the storage type of option arguments from int to
>>> HOST_WIDE_INT and extends the existing (although undocumented)
>>> option property Host_Wide_Int to specify wide option arguments.
>>> It also introduces the ByteSize property for options for which
>>> specifying the byte-size suffix makes sense.
>>>
>>> To make it possible to consider zero as a meaningful argument
>>> value rather than a flag indicating that an option is disabled
>>> the patch also adds a CLVC_SIZE enumerator to the cl_var_type
>>> enumeration, and modifies how options of the kind are handled.
>>>
>>> Warning options that take large byte-size arguments can be
>>> disabled by specifying a value equal to or greater than
>>> HOST_WIDE_INT_M1U.  For convenience, aliases in the form of
>>> -Wno-xxx-larger-than have been provided for all the affected
>>> options.
>>>
>>> In the patch all the existing -larger-than options are set
>>> to PTRDIFF_MAX.  This makes them effectively enabled, but
>>> because the setting is exceedingly permissive, and because
>>> some of the existing warnings are already set to the same
>>> value and some other checks detect and reject such exceedingly
>>> large values with errors, this change shouldn't noticeably
>>> affect what constructs are diagnosed.
>>>
>>> Although all the options are set to PTRDIFF_MAX, I think it
>>> would make sense to consider setting some of them lower, say
>>> to PTRDIFF_MAX / 2.  I'd like to propose that in a followup
>>> patch.
>>>
>>> To minimize observable changes the -Walloca-larger-than and
>>> -Wvla-larger-than warnings required more extensive work to
>>> make of the new mechanism because of the "unbounded" argument
>>> handling (the warnings trigger for arguments that are not
>>> visibly constrained), and because of the zero handling
>>> (the warnings also trigger
>>>
>>>
>>> Martin
>>>
>>>
>>> gcc-82063.diff
>>>
>>>
>>> PR middle-end/82063 - issues with arguments enabled by -Wall
>>>
>>> gcc/ada/ChangeLog:
>>>
>>> PR middle-end/82063
>>> * gcc-interface/misc.c (gnat_handle_option): Change function
>>> argument
>>> to HOST_WIDE_INT.
>>>
>>> gcc/brig/ChangeLog:
>>> * brig/brig-lang.c (brig_langhook_handle_option): Change function
>>> argument to HOST_WIDE_INT.
>>>
>>> gcc/c-family/ChangeLog:
>>>
>>> PR middle-end/82063
>>> * c-common.h (c_common_handle_option): Change function argument
>>> to HOST_WIDE_INT.
>>> * c-opts.c (c_common_init_options): Same.
>>> (c_common_handle_option): Same.  Remove special handling of
>>> OPT_Walloca_larger_than_ and OPT_Wvla_larger_than_.
>>> * c.opt (-Walloc-size-larger-than, -Walloca-larger-than): Change
>>> options to take a HOST_WIDE_INT argument and accept a byte-size
>>> suffix.  Initialize.
>>> (-Wvla-larger-than): Same.
>>> (-Wno-alloc-size-larger-than, -Wno-alloca-larger-than): New.
>>> (-Wno-vla-larger-than): Same.
>>>
>>>
>>> gcc/fortran/ChangeLog:
>>>
>>> PR middle-end/82063
>>> * gfortran.h (gfc_handle_option): Change function argument
>>> to HOST_WIDE_INT.
>>> * options.c (gfc_handle_option): Same.
>>>
>>> gcc/go/ChangeLog:
>>>
>>> PR middle-end/82063
>>> * go-lang.c (go_langhook_handle_option): Change function argument
>>> to HOST_WIDE_INT.
>>>
>>> gcc/lto/ChangeLog:
>>>
>>> PR middle-end/82063
>>> * lto-lang.c (lto_handle_option): Change function argument
>>> to HOST_WIDE_INT.
>>>
>>> gcc/testsuite/ChangeLog:
>>>
>>> PR middle-end/82063
>>> * gcc.dg/Walloc-size-larger-than-16.c: Adjust.
>>> * gcc.dg/Walloca-larger-than.c: New test.
>>> * gcc.dg/Wframe-larger-than-2.c: New test.
>>> * gcc.dg/Wlarger-than3.c: New test.
>>>

V3 [PATCH] C/C++: Add -Waddress-of-packed-member

2018-07-23 Thread H.J. Lu
On Mon, Jun 18, 2018 at 12:26 PM, Joseph Myers  wrote:
> On Mon, 18 Jun 2018, Jason Merrill wrote:
>
>> On Mon, Jun 18, 2018 at 11:59 AM, Joseph Myers  
>> wrote:
>> > On Mon, 18 Jun 2018, Jason Merrill wrote:
>> >
>> >> > +  if (TREE_CODE (rhs) == COND_EXPR)
>> >> > +{
>> >> > +  /* Check the THEN path first.  */
>> >> > +  tree op1 = TREE_OPERAND (rhs, 1);
>> >> > +  context = check_address_of_packed_member (type, op1);
>> >>
>> >> This should handle the GNU extension of re-using operand 0 if operand
>> >> 1 is omitted.
>> >
>> > Doesn't that just use a SAVE_EXPR?
>>
>> Hmm, I suppose it does, but many places in the compiler seem to expect
>> that it produces a COND_EXPR with TREE_OPERAND 1 as NULL_TREE.
>
> Maybe that's used somewhere inside the C++ front end.  For C a SAVE_EXPR
> is produced directly.
>

Here is the updated patch.  Changes from the last one:

1. Handle COMPOUND_EXPR.
2. Fixed typos in comments.
3. Combined warn_for_pointer_of_packed_member and
warn_for_address_of_packed_member into
warn_for_address_or_pointer_of_packed_member.

Tested on Linux/x86-64 and Linux/i686.  OK for trunk.

Thanks.

-- 
H.J.
From 2ddae2d57d2875e80c9186b281edfabfddb42e86 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" 
Date: Fri, 12 Jan 2018 21:12:05 -0800
Subject: [PATCH] C/C++: Add -Waddress-of-packed-member
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When address of packed member of struct or union is taken, it may result
in an unaligned pointer value.  This patch adds -Waddress-of-packed-member
to check alignment at pointer assignment and warn unaligned address as
well as unaligned pointer:

$ cat x.i
struct pair_t
{
  char c;
  int i;
} __attribute__ ((packed));

extern struct pair_t p;
int *addr = &p.i;
$ gcc -O2 -S x.i
x.i:8:13:  warning: taking address of packed member of 'struct pair_t' may result in an unaligned pointer value [-Waddress-of-packed-member]
 int *addr = &p.i;
 ^
$ cat c.i
struct B { int i; };
struct C { struct B b; } __attribute__ ((packed));

long* g8 (struct C *p) { return p; }
$ gcc -O2 -S c.i -Wno-incompatible-pointer-types
c.i: In function ‘g8’:
c.i:4:33: warning: returning ‘struct C *’ from a function with incompatible return type ‘long int *’ [-Wincompatible-pointer-types]
 long* g8 (struct C *p) { return p; }
 ^
c.i:4:33: warning: converting a packed ‘struct C *’ pointer increases the alignment of ‘long int *’ pointer from 1 to 8 [-Waddress-of-packed-member]
c.i:2:8: note: defined here
 struct C { struct B b; } __attribute__ ((packed));
^
$

This warning is enabled by default.  Since read_encoded_value_with_base
in unwind-pe.h has

  union unaligned
{
  void *ptr;
  unsigned u2 __attribute__ ((mode (HI)));
  unsigned u4 __attribute__ ((mode (SI)));
  unsigned u8 __attribute__ ((mode (DI)));
  signed s2 __attribute__ ((mode (HI)));
  signed s4 __attribute__ ((mode (SI)));
  signed s8 __attribute__ ((mode (DI)));
} __attribute__((__packed__));
  _Unwind_Internal_Ptr result;

and GCC warns:

gcc/libgcc/unwind-pe.h:210:37: warning: taking address of packed member of 'union unaligned' may result in an unaligned pointer value [-Waddress-of-packed-member]
result = (_Unwind_Internal_Ptr) u->ptr;
^
we need to add GCC pragma to ignore -Waddress-of-packed-member.

gcc/c/

	PR c/51628
	* doc/invoke.texi: Document -Wno-address-of-packed-member.

gcc/c-family/

	PR c/51628
	* c-common.h (warn_for_address_or_pointer_of_packed_member): New.
	* c-warn.c (check_address_of_packed_member): New function.
	(warn_for_address_or_pointer_of_packed_member): Likewise.
	* c.opt: Add -Wno-address-of-packed-member.

gcc/c/

	PR c/51628
	* c-typeck.c (convert_for_assignment): Call
	warn_for_address_or_pointer_of_packed_member.

gcc/cp/

	PR c/51628
	* call.c (convert_for_arg_passing): Call
	warn_for_address_or_pointer_of_packed_member.
	* typeck.c (convert_for_assignment): Likewise.

gcc/testsuite/

	PR c/51628
	* c-c++-common/pr51628-1.c: New test.
	* c-c++-common/pr51628-2.c: Likewise.
	* c-c++-common/pr51628-3.c: Likewise.
	* c-c++-common/pr51628-4.c: Likewise.
	* c-c++-common/pr51628-5.c: Likewise.
	* c-c++-common/pr51628-6.c: Likewise.
	* c-c++-common/pr51628-7.c: Likewise.
	* c-c++-common/pr51628-8.c: Likewise.
	* c-c++-common/pr51628-9.c: Likewise.
	* c-c++-common/pr51628-10.c: Likewise.
	* c-c++-common/pr51628-11.c: Likewise.
	* c-c++-common/pr51628-12.c: Likewise.
	* c-c++-common/pr51628-13.c: Likewise.
	* c-c++-common/pr51628-14.c: Likewise.
	* c-c++-common/pr51628-15.c: Likewise.
	* c-c++-common/pr51628-26.c: Likewise.
	* gcc.dg/pr51628-17.c: Likewise.
	* gcc.dg/pr51628-18.c: 

PING [PATCH] libsanitizer: Mark REAL(swapcontext) with indirect_return attribute on x86

2018-07-26 Thread H.J. Lu
On Fri, Jul 20, 2018 at 1:11 PM, H.J. Lu  wrote:
> Cherry-pick compiler-rt revision 337603:
>
> When shadow stack from Intel CET is enabled, the first instruction of all
> indirect branch targets must be a special instruction, ENDBR.
>
> lib/asan/asan_interceptors.cc has
>
> ...
>   int res = REAL(swapcontext)(oucp, ucp);
> ...
>
> REAL(swapcontext) is a function pointer to swapcontext in libc.  Since
> swapcontext may return via indirect branch on x86 when shadow stack is
> enabled, as in this case,
>
> int res = REAL(swapcontext)(oucp, ucp);
>     This function may be
> returned via an indirect branch.
>
> Here compiler must insert ENDBR after call, like
>
> call *bar(%rip)
> endbr64
>
> I opened an LLVM bug:
>
> https://bugs.llvm.org/show_bug.cgi?id=38207
>
> to add the indirect_return attribute so that it can be used to inform
> compiler to insert ENDBR after REAL(swapcontext) call.  We mark
> REAL(swapcontext) with the indirect_return attribute if it is available.
>
> This fixed:
>
> https://bugs.llvm.org/show_bug.cgi?id=38249
>
> Reviewed By: eugenis
>
> Differential Revision: https://reviews.llvm.org/D49608
>
> OK for trunk?
>
> H.J.
> ---
> PR target/86560
> * asan/asan_interceptors.cc (swapcontext): Call REAL(swapcontext)
> with indirect_return attribute on x86 if indirect_return attribute
> is available.
> * sanitizer_common/sanitizer_internal_defs.h (__has_attribute):
> New.
> ---
>  libsanitizer/asan/asan_interceptors.cc  | 8 
>  libsanitizer/sanitizer_common/sanitizer_internal_defs.h | 5 +
>  2 files changed, 13 insertions(+)
>
> diff --git a/libsanitizer/asan/asan_interceptors.cc 
> b/libsanitizer/asan/asan_interceptors.cc
> index a8f4b72723f..552cf9347af 100644
> --- a/libsanitizer/asan/asan_interceptors.cc
> +++ b/libsanitizer/asan/asan_interceptors.cc
> @@ -267,7 +267,15 @@ INTERCEPTOR(int, swapcontext, struct ucontext_t *oucp,
>uptr stack, ssize;
>ReadContextStack(ucp, &stack, &ssize);
>ClearShadowMemoryForContextStack(stack, ssize);
> +#if __has_attribute(__indirect_return__) && \
> +(defined(__x86_64__) || defined(__i386__))
> +  int (*real_swapcontext)(struct ucontext_t *, struct ucontext_t *)
> +__attribute__((__indirect_return__))
> += REAL(swapcontext);
> +  int res = real_swapcontext(oucp, ucp);
> +#else
>int res = REAL(swapcontext)(oucp, ucp);
> +#endif
>// swapcontext technically does not return, but program may swap context to
>// "oucp" later, that would look as if swapcontext() returned 0.
>// We need to clear shadow for ucp once again, as it may be in arbitrary
> diff --git a/libsanitizer/sanitizer_common/sanitizer_internal_defs.h 
> b/libsanitizer/sanitizer_common/sanitizer_internal_defs.h
> index edd6a21c122..4413a88bea0 100644
> --- a/libsanitizer/sanitizer_common/sanitizer_internal_defs.h
> +++ b/libsanitizer/sanitizer_common/sanitizer_internal_defs.h
> @@ -104,6 +104,11 @@
>  # define __has_feature(x) 0
>  #endif
>
> +// Older GCCs do not understand __has_attribute.
> +#if !defined(__has_attribute)
> +# define __has_attribute(x) 0
> +#endif
> +
>  // For portability reasons we do not include stddef.h, stdint.h or any other
>  // system header, but we do need some basic types that are not defined
>  // in a portable way by the language itself.
> --
> 2.17.1
>

Any objections?


-- 
H.J.


PING [PATCH] i386: Remove _Unwind_Frames_Increment

2018-07-26 Thread H.J. Lu
On Fri, Jul 20, 2018 at 11:15 AM, H.J. Lu  wrote:
> Tested on CET SDV using the CET kernel on cet branch at:
>
> https://github.com/yyu168/linux_cet/tree/cet
>
> OK for trunk and GCC 8 branch?
>
> Thanks.
>
>
> H.J.
> ---
> The CET kernel has been changed to place a restore token on shadow stack
> for signal handler to enhance security.  It is usually transparent to user
> programs since kernel will pop the restore token when signal handler
> returns.  But when an exception is thrown from a signal handler, now
> we need to remove _Unwind_Frames_Increment to pop the the restore token
> from shadow stack.  Otherwise, we get
>
> FAIL: g++.dg/torture/pr85334.C   -O0  execution test
> FAIL: g++.dg/torture/pr85334.C   -O1  execution test
> FAIL: g++.dg/torture/pr85334.C   -O2  execution test
> FAIL: g++.dg/torture/pr85334.C   -O3 -g  execution test
> FAIL: g++.dg/torture/pr85334.C   -Os  execution test
> FAIL: g++.dg/torture/pr85334.C   -O2 -flto -fno-use-linker-plugin 
> -flto-partition=none  execution test
>
> PR libgcc/85334
> * config/i386/shadow-stack-unwind.h (_Unwind_Frames_Increment):
> Removed.
> ---
>  libgcc/config/i386/shadow-stack-unwind.h | 5 -
>  1 file changed, 5 deletions(-)
>
> diff --git a/libgcc/config/i386/shadow-stack-unwind.h 
> b/libgcc/config/i386/shadow-stack-unwind.h
> index a32f3e74b52..40f48df2aec 100644
> --- a/libgcc/config/i386/shadow-stack-unwind.h
> +++ b/libgcc/config/i386/shadow-stack-unwind.h
> @@ -49,8 +49,3 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  
> If not, see
> }   \
>  }  \
>  while (0)
> -
> -/* Increment frame count.  Skip signal frames.  */
> -#undef _Unwind_Frames_Increment
> -#define _Unwind_Frames_Increment(context, frames) \
> -  if (!_Unwind_IsSignalFrame (context)) frames++
> --
> 2.17.1
>

I will check it into trunk tomorrow if there is no objection.


-- 
H.J.


Re: [PATCH] combine: Allow combining two insns to two insns

2018-07-31 Thread H.J. Lu
On Wed, Jul 25, 2018 at 1:28 AM, Richard Biener
 wrote:
> On Tue, Jul 24, 2018 at 7:18 PM Segher Boessenkool
>  wrote:
>>
>> This patch allows combine to combine two insns into two.  This helps
>> in many cases, by reducing instruction path length, and also allowing
>> further combinations to happen.  PR85160 is a typical example of code
>> that it can improve.
>>
>> This patch does not allow such combinations if either of the original
>> instructions was a simple move instruction.  In those cases combining
>> the two instructions increases register pressure without improving the
>> code.  With this move test register pressure does no longer increase
>> noticably as far as I can tell.
>>
>> (At first I also didn't allow either of the resulting insns to be a
>> move instruction.  But that is actually a very good thing to have, as
>> should have been obvious).
>>
>> Tested for many months; tested on about 30 targets.
>>
>> I'll commit this later this week if there are no objections.
>
> Sounds good - but, _any_ testcase?  Please! ;)
>

Here is a testcase:

For

---
#define N 16
float f[N];
double d[N];
int n[N];

__attribute__((noinline)) void
f3 (void)
{
  int i;
  for (i = 0; i < N; i++)
d[i] = f[i];
}
---

r263067 improved -O3 -mavx2 -mtune=generic -m64 from

.cfi_startproc
vmovaps f(%rip), %xmm2
vmovaps f+32(%rip), %xmm3
vinsertf128 $0x1, f+16(%rip), %ymm2, %ymm0
vcvtps2pd %xmm0, %ymm1
vextractf128 $0x1, %ymm0, %xmm0
vmovaps %xmm1, d(%rip)
vextractf128 $0x1, %ymm1, d+16(%rip)
vcvtps2pd %xmm0, %ymm0
vmovaps %xmm0, d+32(%rip)
vextractf128 $0x1, %ymm0, d+48(%rip)
vinsertf128 $0x1, f+48(%rip), %ymm3, %ymm0
vcvtps2pd %xmm0, %ymm1
vextractf128 $0x1, %ymm0, %xmm0
vmovaps %xmm1, d+64(%rip)
vextractf128 $0x1, %ymm1, d+80(%rip)
vcvtps2pd %xmm0, %ymm0
vmovaps %xmm0, d+96(%rip)
vextractf128 $0x1, %ymm0, d+112(%rip)
vzeroupper
ret
.cfi_endproc

to

.cfi_startproc
vcvtps2pd f(%rip), %ymm0
vmovaps %xmm0, d(%rip)
vextractf128 $0x1, %ymm0, d+16(%rip)
vcvtps2pd f+16(%rip), %ymm0
vmovaps %xmm0, d+32(%rip)
vextractf128 $0x1, %ymm0, d+48(%rip)
vcvtps2pd f+32(%rip), %ymm0
vextractf128 $0x1, %ymm0, d+80(%rip)
vmovaps %xmm0, d+64(%rip)
vcvtps2pd f+48(%rip), %ymm0
vextractf128 $0x1, %ymm0, d+112(%rip)
vmovaps %xmm0, d+96(%rip)
vzeroupper
ret
.cfi_endproc

This is:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86752

H.J.


Re: [PATCH 01/11] Add __builtin_speculation_safe_value

2018-07-31 Thread H.J. Lu
On Mon, Jul 30, 2018 at 6:16 AM, Richard Biener  wrote:
> On Fri, 27 Jul 2018, Richard Earnshaw wrote:
>
>>
>> This patch defines a new intrinsic function
>> __builtin_speculation_safe_value.  A generic default implementation is
>> defined which will attempt to use the backend pattern
>> "speculation_safe_barrier".  If this pattern is not defined, or if it
>> is not available, then the compiler will emit a warning, but
>> compilation will continue.
>>
>> Note that the test spec-barrier-1.c will currently fail on all
>> targets.  This is deliberate, the failure will go away when
>> appropriate action is taken for each target backend.
>
> OK.
>
> Thanks,
> Richard.
>
>> gcc:
>>   * builtin-types.def (BT_FN_PTR_PTR_VAR): New function type.
>>   (BT_FN_I1_I1_VAR, BT_FN_I2_I2_VAR, BT_FN_I4_I4_VAR): Likewise.
>>   (BT_FN_I8_I8_VAR, BT_FN_I16_I16_VAR): Likewise.
>>   * builtin-attrs.def (ATTR_NOVOPS_NOTHROW_LEAF_LIST): New attribute
>>   list.
>>   * builtins.def (BUILT_IN_SPECULATION_SAFE_VALUE_N): New builtin.
>>   (BUILT_IN_SPECULATION_SAFE_VALUE_PTR): New internal builtin.
>>   (BUILT_IN_SPECULATION_SAFE_VALUE_1): Likewise.
>>   (BUILT_IN_SPECULATION_SAFE_VALUE_2): Likewise.
>>   (BUILT_IN_SPECULATION_SAFE_VALUE_4): Likewise.
>>   (BUILT_IN_SPECULATION_SAFE_VALUE_8): Likewise.
>>   (BUILT_IN_SPECULATION_SAFE_VALUE_16): Likewise.
>>   * builtins.c (expand_speculation_safe_value): New function.
>>   (expand_builtin): Call it.
>>   * doc/cpp.texi: Document predefine __HAVE_SPECULATION_SAFE_VALUE.
>>   * doc/extend.texi: Document __builtin_speculation_safe_value.
>>   * doc/md.texi: Document "speculation_barrier" pattern.
>>   * doc/tm.texi.in: Pull in TARGET_SPECULATION_SAFE_VALUE and
>>   TARGET_HAVE_SPECULATION_SAFE_VALUE.
>>   * doc/tm.texi: Regenerated.
>>   * target.def (have_speculation_safe_value, speculation_safe_value): New
>>   hooks.
>>   * targhooks.c (default_have_speculation_safe_value): New function.
>>   (default_speculation_safe_value): New function.
>>   * targhooks.h (default_have_speculation_safe_value): Add prototype.
>>   (default_speculation_safe_value): Add prototype.
>>

I got

../../src-trunk/gcc/targhooks.c: In function ‘bool
default_have_speculation_safe_value(bool)’:
../../src-trunk/gcc/targhooks.c:2319:43: error: unused parameter
‘active’ [-Werror=unused-parameter]
 default_have_speculation_safe_value (bool active)
  ~^~

-- 
H.J.


Re: [PATCH 10/11] x86 - add speculation_barrier pattern

2018-07-31 Thread H.J. Lu
On Sat, Jul 28, 2018 at 1:25 AM, Uros Bizjak  wrote:
> On Fri, Jul 27, 2018 at 11:37 AM, Richard Earnshaw
>  wrote:
>>
>> This patch adds a speculation barrier for x86, based on my
>> understanding of the required mitigation for that CPU, which is to use
>> an lfence instruction.
>>
>> This patch needs some review by an x86 expert and if adjustments are
>> needed, I'd appreciate it if they could be picked up by the port
>> maintainer.  This is supposed to serve as an example of how to deploy
>> the new __builtin_speculation_safe_value() intrinsic on this
>> architecture.
>>
>> * config/i386/i386.md (unspecv): Add UNSPECV_SPECULATION_BARRIER.
>> (speculation_barrier): New insn.
>
> The implementation is OK, but someone from Intel (CC'd) should clarify
> if lfence is the correct insn.
>

I checked with our people.  lfence is OK.

Thanks.

-- 
H.J.


Re: [PR 83141] Prevent SRA from removing type changing assignment

2018-07-31 Thread H.J. Lu
On Tue, Dec 5, 2017 at 4:00 AM, Martin Jambor  wrote:
> On Tue, Dec 05 2017, Martin Jambor wrote:
>> On Tue, Dec 05 2017, Martin Jambor wrote:
>> Hi,
>>
>>> Hi,
>>>
>>> this is a followup to Richi's
>>> https://gcc.gnu.org/ml/gcc-patches/2017-11/msg02396.html to fix PR
>>> 83141.  The basic idea is simple, be just as conservative about type
>>> changing MEM_REFs as we are about actual VCEs.
>>>
>>> I have checked how that would affect compilation of SPEC 2006 and (non
>>> LTO) Mozilla Firefox and am happy to report that the difference was
>>> tiny.  However, I had to make the test less strict, otherwise testcase
>>> gcc.dg/guality/pr54970.c kept failing because it contains folded memcpy
>>> and expects us to track values accross:
>>>
>>>   int a[] = { 1, 2, 3 };
>>>   /* ... */
>>>   __builtin_memcpy (&a, (int [3]) { 4, 5, 6 }, sizeof (a));
>>>  /* { dg-final { gdb-test 31 "a\[0\]" "4" } } */
>>>  /* { dg-final { gdb-test 31 "a\[1\]" "5" } } */
>>>  /* { dg-final { gdb-test 31 "a\[2\]" "6" } } */
>>>
>>> SRA is able to load replacement of a[0] directly from the temporary
>>> array which is apparently necessary to generate proper debug info.  I
>>> have therefore allowed the current transformation to go forward if the
>>> source does not contain any padding or if it is a read-only declaration.
>>
>> Ah, the read-only test is of course bogus, it was a last minute addition
>> when I was apparently already too tired to think it through.  Please
>> disregard that line in the patch (it has passed bootstrap and testing
>> without it).
>>
>> Sorry for the noise,
>>
>> Martin
>>
>
> And for the record, below is the actual patch, after a fresh round of
> re-testing to double check I did not mess up anything else.  As before,
> I'd like to ask for review, especially of the type_contains_padding_p
> predicate and then would like to commit it to trunk.
>
> Thanks,
>
> Martin
>
>
> 2017-12-05  Martin Jambor  
>
> PR tree-optimization/83141
> * tree-sra.c (type_contains_padding_p): New function.
> (contains_vce_or_bfcref_p): Move up in the file, also test for
> MEM_REFs implicitely changing types with padding.  Remove inline
> keyword.
> (build_accesses_from_assign): Check contains_vce_or_bfcref_p
> before setting bit in should_scalarize_away_bitmap.
>

This caused:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86763

H.J.


Re: RFC/A: Add a targetm.vectorize.related_mode hook

2019-10-23 Thread H.J. Lu
On Wed, Oct 23, 2019 at 4:51 AM Richard Sandiford
 wrote:
>
> Richard Biener  writes:
> > On Wed, Oct 23, 2019 at 1:00 PM Richard Sandiford
> >  wrote:
> >>
> >> This patch is the first of a series that tries to remove two
> >> assumptions:
> >>
> >> (1) that all vectors involved in vectorisation must be the same size
> >>
> >> (2) that there is only one vector mode for a given element mode and
> >> number of elements
> >>
> >> Relaxing (1) helps with targets that support multiple vector sizes or
> >> that require the number of elements to stay the same.  E.g. if we're
> >> vectorising code that operates on narrow and wide elements, and the
> >> narrow elements use 64-bit vectors, then on AArch64 it would normally
> >> be better to use 128-bit vectors rather than pairs of 64-bit vectors
> >> for the wide elements.
> >>
> >> Relaxing (2) makes it possible for -msve-vector-bits=128 to preoduce
> >> fixed-length code for SVE.  It also allows unpacked/half-size SVE
> >> vectors to work with -msve-vector-bits=256.
> >>
> >> The patch adds a new hook that targets can use to control how we
> >> move from one vector mode to another.  The hook takes a starting vector
> >> mode, a new element mode, and (optionally) a new number of elements.
> >> The flexibility needed for (1) comes in when the number of elements
> >> isn't specified.
> >>
> >> All callers in this patch specify the number of elements, but a later
> >> vectoriser patch doesn't.  I won't be posting the vectoriser patch
> >> for a few days, hence the RFC/A tag.
> >>
> >> Tested individually on aarch64-linux-gnu and as a series on
> >> x86_64-linux-gnu.  OK to install?  Or if not yet, does the idea
> >> look OK?
> >
> > In isolation the idea looks good but maybe a bit limited?  I see
> > how it works for the same-size case but if you consider x86
> > where we have SSE, AVX256 and AVX512 what would it return
> > for related_vector_mode (V4SImode, SImode, 0)?  Or is this
> > kind of query not intended (where the component modes match
> > but nunits is zero)?
>
> In that case we'd normally get V4SImode back.  It's an allowed
> combination, but not very useful.
>
> > How do you get from SVE fixed 128bit to NEON fixed 128bit then?  Or is
> > it just used to stay in the same register set for different component
> > modes?
>
> Yeah, the idea is to use the original vector mode as essentially
> a base architecture.
>
> The follow-on patches replace vec_info::vector_size with
> vec_info::vector_mode and targetm.vectorize.autovectorize_vector_sizes
> with targetm.vectorize.autovectorize_vector_modes.  These are the
> starting modes that would be passed to the hook in the nunits==0 case.
>

For a target with different vector sizes,
targetm.vectorize.autovectorize_vector_sizes
doesn't return the optimal vector sizes for known trip count and
unknown trip count.
For a target with 128-bit and 256-bit vectors, 256-bit followed by
128-bit works well for
known trip count since vectorizer knows the maximum usable vector size.  But for
unknown trip count, we may want to use 128-bit vector when 256-bit
code path won't
be used at run-time, but 128-bit vector will.  At the moment, we can
only use one
set of vector sizes for both known trip count and unknown trip count.
  Can vectorizer
support 2 sets of vector sizes, one for known trip count and the other
for unknown
trip count?

H.J.


Re: RFC/A: Add a targetm.vectorize.related_mode hook

2019-10-24 Thread H.J. Lu
On Thu, Oct 24, 2019 at 12:56 AM Richard Sandiford
 wrote:
>
> "H.J. Lu"  writes:
> > On Wed, Oct 23, 2019 at 4:51 AM Richard Sandiford
> >  wrote:
> >>
> >> Richard Biener  writes:
> >> > On Wed, Oct 23, 2019 at 1:00 PM Richard Sandiford
> >> >  wrote:
> >> >>
> >> >> This patch is the first of a series that tries to remove two
> >> >> assumptions:
> >> >>
> >> >> (1) that all vectors involved in vectorisation must be the same size
> >> >>
> >> >> (2) that there is only one vector mode for a given element mode and
> >> >> number of elements
> >> >>
> >> >> Relaxing (1) helps with targets that support multiple vector sizes or
> >> >> that require the number of elements to stay the same.  E.g. if we're
> >> >> vectorising code that operates on narrow and wide elements, and the
> >> >> narrow elements use 64-bit vectors, then on AArch64 it would normally
> >> >> be better to use 128-bit vectors rather than pairs of 64-bit vectors
> >> >> for the wide elements.
> >> >>
> >> >> Relaxing (2) makes it possible for -msve-vector-bits=128 to preoduce
> >> >> fixed-length code for SVE.  It also allows unpacked/half-size SVE
> >> >> vectors to work with -msve-vector-bits=256.
> >> >>
> >> >> The patch adds a new hook that targets can use to control how we
> >> >> move from one vector mode to another.  The hook takes a starting vector
> >> >> mode, a new element mode, and (optionally) a new number of elements.
> >> >> The flexibility needed for (1) comes in when the number of elements
> >> >> isn't specified.
> >> >>
> >> >> All callers in this patch specify the number of elements, but a later
> >> >> vectoriser patch doesn't.  I won't be posting the vectoriser patch
> >> >> for a few days, hence the RFC/A tag.
> >> >>
> >> >> Tested individually on aarch64-linux-gnu and as a series on
> >> >> x86_64-linux-gnu.  OK to install?  Or if not yet, does the idea
> >> >> look OK?
> >> >
> >> > In isolation the idea looks good but maybe a bit limited?  I see
> >> > how it works for the same-size case but if you consider x86
> >> > where we have SSE, AVX256 and AVX512 what would it return
> >> > for related_vector_mode (V4SImode, SImode, 0)?  Or is this
> >> > kind of query not intended (where the component modes match
> >> > but nunits is zero)?
> >>
> >> In that case we'd normally get V4SImode back.  It's an allowed
> >> combination, but not very useful.
> >>
> >> > How do you get from SVE fixed 128bit to NEON fixed 128bit then?  Or is
> >> > it just used to stay in the same register set for different component
> >> > modes?
> >>
> >> Yeah, the idea is to use the original vector mode as essentially
> >> a base architecture.
> >>
> >> The follow-on patches replace vec_info::vector_size with
> >> vec_info::vector_mode and targetm.vectorize.autovectorize_vector_sizes
> >> with targetm.vectorize.autovectorize_vector_modes.  These are the
> >> starting modes that would be passed to the hook in the nunits==0 case.
> >>
> >
> > For a target with different vector sizes,
> > targetm.vectorize.autovectorize_vector_sizes
> > doesn't return the optimal vector sizes for known trip count and
> > unknown trip count.
> > For a target with 128-bit and 256-bit vectors, 256-bit followed by
> > 128-bit works well for
> > known trip count since vectorizer knows the maximum usable vector size.  
> > But for
> > unknown trip count, we may want to use 128-bit vector when 256-bit
> > code path won't
> > be used at run-time, but 128-bit vector will.  At the moment, we can
> > only use one
> > set of vector sizes for both known trip count and unknown trip count.
>
> Yeah, we're hit by this for AArch64 too.  Andre's recent patches:
>
> https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01564.html
> https://gcc.gnu.org/ml/gcc-patches/2019-09/msg00205.html
>
> should help.
>
> >   Can vectorizer
> > support 2 sets of vector sizes, one for known trip count and the other
> > for unknown
> > trip count?
>
> The approach Andre's taking is to continue to use the wider vector size
> for unknown trip counts, and instead ensure that the epilogue loop is
> vectorised at the narrower vector size if possible.  The patches then
> use this vectorised epilogue as a fallback "main" loop if the runtime
> trip count is too low for the wide vectors.

I tried it on 548.exchange2_r in SPEC CPU 2017.  There is short cut
to vectorized epilogue for low trip count.

-- 
H.J.


Re: [PR47785] COLLECT_AS_OPTIONS

2019-10-29 Thread H.J. Lu
On Sun, Oct 27, 2019 at 6:33 PM Kugan Vivekanandarajah
 wrote:
>
> Hi Richard,
>
> Thanks for the review.
>
> On Wed, 23 Oct 2019 at 23:07, Richard Biener  
> wrote:
> >
> > On Mon, Oct 21, 2019 at 10:04 AM Kugan Vivekanandarajah
> >  wrote:
> > >
> > > Hi Richard,
> > >
> > > Thanks for the pointers.
> > >
> > >
> > >
> > > On Fri, 11 Oct 2019 at 22:33, Richard Biener  
> > > wrote:
> > > >
> > > > On Fri, Oct 11, 2019 at 6:15 AM Kugan Vivekanandarajah
> > > >  wrote:
> > > > >
> > > > > Hi Richard,
> > > > > Thanks for the review.
> > > > >
> > > > > On Wed, 2 Oct 2019 at 20:41, Richard Biener 
> > > > >  wrote:
> > > > > >
> > > > > > On Wed, Oct 2, 2019 at 10:39 AM Kugan Vivekanandarajah
> > > > > >  wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > As mentioned in the PR, attached patch adds COLLECT_AS_OPTIONS for
> > > > > > > passing assembler options specified with -Wa, to the link-time 
> > > > > > > driver.
> > > > > > >
> > > > > > > The proposed solution only works for uniform -Wa options across 
> > > > > > > all
> > > > > > > TUs. As mentioned by Richard Biener, supporting non-uniform -Wa 
> > > > > > > flags
> > > > > > > would require either adjusting partitioning according to flags or
> > > > > > > emitting multiple object files  from a single LTRANS CU. We could
> > > > > > > consider this as a follow up.
> > > > > > >
> > > > > > > Bootstrapped and regression tests on  arm-linux-gcc. Is this OK 
> > > > > > > for trunk?
> > > > > >
> > > > > > While it works for your simple cases it is unlikely to work in 
> > > > > > practice since
> > > > > > your implementation needs the assembler options be present at the 
> > > > > > link
> > > > > > command line.  I agree that this might be the way for people to go 
> > > > > > when
> > > > > > they face the issue but then it needs to be documented somewhere
> > > > > > in the manual.
> > > > > >
> > > > > > That is, with COLLECT_AS_OPTION (why singular?  I'd expected
> > > > > > COLLECT_AS_OPTIONS) available to cc1 we could stream this string
> > > > > > to lto_options and re-materialize it at link time (and diagnose 
> > > > > > mismatches
> > > > > > even if we like).
> > > > > OK. I will try to implement this. So the idea is if we provide
> > > > > -Wa,options as part of the lto compile, this should be available
> > > > > during link time. Like in:
> > > > >
> > > > > arm-linux-gnueabihf-gcc -march=armv7-a -mthumb -O2 -flto
> > > > > -Wa,-mimplicit-it=always,-mthumb -c test.c
> > > > > arm-linux-gnueabihf-gcc  -flto  test.o
> > > > >
> > > > > I am not sure where should we stream this. Currently, cl_optimization
> > > > > has all the optimization flag provided for compiler and it is
> > > > > autogenerated and all the flags are integer values. Do you have any
> > > > > preference or example where this should be done.
> > > >
> > > > In lto_write_options, I'd simply append the contents of 
> > > > COLLECT_AS_OPTIONS
> > > > (with -Wa, prepended to each of them), then recover them in lto-wrapper
> > > > for each TU and pass them down to the LTRANS compiles (if they agree
> > > > for all TUs, otherwise I'd warn and drop them).
> > >
> > > Attached patch streams it and also make sure that the options are the
> > > same for all the TUs. Maybe it is a bit restrictive.
> > >
> > > What is the best place to document COLLECT_AS_OPTIONS. We don't seem
> > > to document COLLECT_GCC_OPTIONS anywhere ?
> >
> > Nowhere, it's an implementation detail then.
> >
> > > Attached patch passes regression and also fixes the original ARM
> > > kernel build issue with tumb2.
> >
> > Did you try this with multiple assembler options?  I see you stream
> > them as -Wa,-mfpu=xyz,-mthumb but then compare the whole
> > option strings so a mismatch with -Wa,-mthumb,-mfpu=xyz would be
> > diagnosed.  If there's a spec induced -Wa option do we get to see
> > that as well?  I can imagine -march=xyz enabling a -Wa option
> > for example.
> >
> > + *collect_as = XNEWVEC (char, strlen (args_text) + 1);
> > + strcpy (*collect_as, args_text);
> >
> > there's strdup.  Btw, I'm not sure why you don't simply leave
> > the -Wa option in the merged options [individually] and match
> > them up but go the route of comparing strings and carrying that
> > along separately.  I think that would be much better.
>
> Is attached patch which does this is OK?
>

Don't you need to also handle -Xassembler? Since -Wa, doesn't work with comma in
assembler options, like -mfoo=foo1,foo2, one needs to use

-Xassembler  -mfoo=foo1,foo2

to pass  -mfoo=foo1,foo2 to assembler.

-- 
H.J.


Re: [PR47785] COLLECT_AS_OPTIONS

2019-11-01 Thread H.J. Lu
On Thu, Oct 31, 2019 at 6:33 PM Kugan Vivekanandarajah
 wrote:
>
> On Wed, 30 Oct 2019 at 03:11, H.J. Lu  wrote:
> >
> > On Sun, Oct 27, 2019 at 6:33 PM Kugan Vivekanandarajah
> >  wrote:
> > >
> > > Hi Richard,
> > >
> > > Thanks for the review.
> > >
> > > On Wed, 23 Oct 2019 at 23:07, Richard Biener  
> > > wrote:
> > > >
> > > > On Mon, Oct 21, 2019 at 10:04 AM Kugan Vivekanandarajah
> > > >  wrote:
> > > > >
> > > > > Hi Richard,
> > > > >
> > > > > Thanks for the pointers.
> > > > >
> > > > >
> > > > >
> > > > > On Fri, 11 Oct 2019 at 22:33, Richard Biener 
> > > > >  wrote:
> > > > > >
> > > > > > On Fri, Oct 11, 2019 at 6:15 AM Kugan Vivekanandarajah
> > > > > >  wrote:
> > > > > > >
> > > > > > > Hi Richard,
> > > > > > > Thanks for the review.
> > > > > > >
> > > > > > > On Wed, 2 Oct 2019 at 20:41, Richard Biener 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Wed, Oct 2, 2019 at 10:39 AM Kugan Vivekanandarajah
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > As mentioned in the PR, attached patch adds 
> > > > > > > > > COLLECT_AS_OPTIONS for
> > > > > > > > > passing assembler options specified with -Wa, to the 
> > > > > > > > > link-time driver.
> > > > > > > > >
> > > > > > > > > The proposed solution only works for uniform -Wa options 
> > > > > > > > > across all
> > > > > > > > > TUs. As mentioned by Richard Biener, supporting non-uniform 
> > > > > > > > > -Wa flags
> > > > > > > > > would require either adjusting partitioning according to 
> > > > > > > > > flags or
> > > > > > > > > emitting multiple object files  from a single LTRANS CU. We 
> > > > > > > > > could
> > > > > > > > > consider this as a follow up.
> > > > > > > > >
> > > > > > > > > Bootstrapped and regression tests on  arm-linux-gcc. Is this 
> > > > > > > > > OK for trunk?
> > > > > > > >
> > > > > > > > While it works for your simple cases it is unlikely to work in 
> > > > > > > > practice since
> > > > > > > > your implementation needs the assembler options be present at 
> > > > > > > > the link
> > > > > > > > command line.  I agree that this might be the way for people to 
> > > > > > > > go when
> > > > > > > > they face the issue but then it needs to be documented somewhere
> > > > > > > > in the manual.
> > > > > > > >
> > > > > > > > That is, with COLLECT_AS_OPTION (why singular?  I'd expected
> > > > > > > > COLLECT_AS_OPTIONS) available to cc1 we could stream this string
> > > > > > > > to lto_options and re-materialize it at link time (and diagnose 
> > > > > > > > mismatches
> > > > > > > > even if we like).
> > > > > > > OK. I will try to implement this. So the idea is if we provide
> > > > > > > -Wa,options as part of the lto compile, this should be available
> > > > > > > during link time. Like in:
> > > > > > >
> > > > > > > arm-linux-gnueabihf-gcc -march=armv7-a -mthumb -O2 -flto
> > > > > > > -Wa,-mimplicit-it=always,-mthumb -c test.c
> > > > > > > arm-linux-gnueabihf-gcc  -flto  test.o
> > > > > > >
> > > > > > > I am not sure where should we stream this. Currently, 
> > > > > > > cl_optimization
> > > > > > > has all the optimization flag provided for compiler and it is
> > > > > > > autogenerated and all the flags are integer values. Do you have 
> > > > > > > any
> > &

Re: [PR47785] COLLECT_AS_OPTIONS

2019-11-04 Thread H.J. Lu
On Sun, Nov 3, 2019 at 6:45 PM Kugan Vivekanandarajah
 wrote:
>
> Thanks for the reviews.
>
>
> On Sat, 2 Nov 2019 at 02:49, H.J. Lu  wrote:
> >
> > On Thu, Oct 31, 2019 at 6:33 PM Kugan Vivekanandarajah
> >  wrote:
> > >
> > > On Wed, 30 Oct 2019 at 03:11, H.J. Lu  wrote:
> > > >
> > > > On Sun, Oct 27, 2019 at 6:33 PM Kugan Vivekanandarajah
> > > >  wrote:
> > > > >
> > > > > Hi Richard,
> > > > >
> > > > > Thanks for the review.
> > > > >
> > > > > On Wed, 23 Oct 2019 at 23:07, Richard Biener 
> > > > >  wrote:
> > > > > >
> > > > > > On Mon, Oct 21, 2019 at 10:04 AM Kugan Vivekanandarajah
> > > > > >  wrote:
> > > > > > >
> > > > > > > Hi Richard,
> > > > > > >
> > > > > > > Thanks for the pointers.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, 11 Oct 2019 at 22:33, Richard Biener 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Fri, Oct 11, 2019 at 6:15 AM Kugan Vivekanandarajah
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > Hi Richard,
> > > > > > > > > Thanks for the review.
> > > > > > > > >
> > > > > > > > > On Wed, 2 Oct 2019 at 20:41, Richard Biener 
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, Oct 2, 2019 at 10:39 AM Kugan Vivekanandarajah
> > > > > > > > > >  wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > As mentioned in the PR, attached patch adds 
> > > > > > > > > > > COLLECT_AS_OPTIONS for
> > > > > > > > > > > passing assembler options specified with -Wa, to the 
> > > > > > > > > > > link-time driver.
> > > > > > > > > > >
> > > > > > > > > > > The proposed solution only works for uniform -Wa options 
> > > > > > > > > > > across all
> > > > > > > > > > > TUs. As mentioned by Richard Biener, supporting 
> > > > > > > > > > > non-uniform -Wa flags
> > > > > > > > > > > would require either adjusting partitioning according to 
> > > > > > > > > > > flags or
> > > > > > > > > > > emitting multiple object files  from a single LTRANS CU. 
> > > > > > > > > > > We could
> > > > > > > > > > > consider this as a follow up.
> > > > > > > > > > >
> > > > > > > > > > > Bootstrapped and regression tests on  arm-linux-gcc. Is 
> > > > > > > > > > > this OK for trunk?
> > > > > > > > > >
> > > > > > > > > > While it works for your simple cases it is unlikely to work 
> > > > > > > > > > in practice since
> > > > > > > > > > your implementation needs the assembler options be present 
> > > > > > > > > > at the link
> > > > > > > > > > command line.  I agree that this might be the way for 
> > > > > > > > > > people to go when
> > > > > > > > > > they face the issue but then it needs to be documented 
> > > > > > > > > > somewhere
> > > > > > > > > > in the manual.
> > > > > > > > > >
> > > > > > > > > > That is, with COLLECT_AS_OPTION (why singular?  I'd expected
> > > > > > > > > > COLLECT_AS_OPTIONS) available to cc1 we could stream this 
> > > > > > > > > > string
> > > > > > > > > > to lto_options and re-materialize it at link time (and 
> > > > > > > > > > diagnose mismatches
> > > > > > &g

Re: [PATCH] Set AVX128_OPTIMAL for all avx targets.

2019-11-12 Thread H.J. Lu
On Tue, Nov 12, 2019 at 2:48 AM Hongtao Liu  wrote:
>
> On Tue, Nov 12, 2019 at 4:41 PM Richard Biener
>  wrote:
> >
> > On Tue, Nov 12, 2019 at 9:29 AM Hongtao Liu  wrote:
> > >
> > > On Tue, Nov 12, 2019 at 4:19 PM Richard Biener
> > >  wrote:
> > > >
> > > > On Tue, Nov 12, 2019 at 8:36 AM Hongtao Liu  wrote:
> > > > >
> > > > > Hi:
> > > > >   This patch is about to set X86_TUNE_AVX128_OPTIMAL as default for
> > > > > all AVX target because we found there's still performance gap between
> > > > > 128-bit auto-vectorization and 256-bit auto-vectorization even with
> > > > > epilog vectorized.
> > > > >   The performance influence of setting avx128_optimal as default on
> > > > > SPEC2017 with option `-march=native -funroll-loops -Ofast -flto" on
> > > > > CLX is as bellow:
> > > > >
> > > > > INT rate
> > > > > 500.perlbench_r -0.32%
> > > > > 502.gcc_r   -1.32%
> > > > > 505.mcf_r   -0.12%
> > > > > 520.omnetpp_r   -0.34%
> > > > > 523.xalancbmk_r -0.65%
> > > > > 525.x264_r  2.23%
> > > > > 531.deepsjeng_r 0.81%
> > > > > 541.leela_r -0.02%
> > > > > 548.exchange2_r 10.89%  --> big improvement
> > > > > 557.xz_r0.38%
> > > > > geomean for intrate 1.10%
> > > > >
> > > > > FP rate
> > > > > 503.bwaves_r1.41%
> > > > > 507.cactuBSSN_r -0.14%
> > > > > 508.namd_r  1.54%
> > > > > 510.parest_r-0.87%
> > > > > 511.povray_r0.28%
> > > > > 519.lbm_r   0.32%
> > > > > 521.wrf_r   -0.54%
> > > > > 526.blender_r   0.59%
> > > > > 527.cam4_r  -2.70%
> > > > > 538.imagick_r   3.92%
> > > > > 544.nab_r   0.59%
> > > > > 549.fotonik3d_r -5.44%  -> regression
> > > > > 554.roms_r  -2.34%
> > > > > geomean for fprate  -0.28%
> > > > >
> > > > > The 10% improvement of 548.exchange_r is because there is 9-layer
> > > > > nested loop, and the loop count for innermost layer is small(enough
> > > > > for 128-bit vectorization, but not for 256-bit vectorization).
> > > > > Since loop count is not statically analyzed out, vectorizer will
> > > > > choose 256-bit vectorization which would never never be triggered. The
> > > > > vectorization of epilog will introduced some extra instructions,
> > > > > normally it will bring back some performance, but since it's 9-layer
> > > > > nested loop, costs of extra instructions will cover the gain.
> > > > >
> > > > > The 5.44% regression of 549.fotonik3d_r is because 256-bit
> > > > > vectorization is better than 128-bit vectorization. Generally when
> > > > > enabling 256-bit or 512-bit vectorization, there will be instruction
> > > > > clocksticks reduction also with frequency reduction. when frequency
> > > > > reduction is less than instructions clocksticks reduction, long vector
> > > > > width vectorization would be better than shorter one, otherwise the
> > > > > opposite. The regression of 549.fotonik3d_r is due to this, similar
> > > > > for 554.roms_r, 528.cam4_r, for those 3 benchmarks, 512-bit
> > > > > vectorization is best.
> > > > >
> > > > > Bootstrap and regression test on i386 is ok.
> > > > > Ok for trunk?
> > > >
> > > > I don't think 128_optimal does what you think it does.  If you want to
> > > > prefer 128bit AVX adjust the preference, but 128_optimal describes
> > > > a microarchitectural detail (AVX256 ops are split into two AVX128 ops)
> > > But it will set target_prefer_avx128 by default.
> > > 
> > > 2694  /* Enable 128-bit AVX instruction generation
> > > 2695 for the auto-vectorizer.  */
> > > 2696  if (TARGET_AVX128_OPTIMAL
> > > 2697  && (opts_set->x_prefer_vector_width_type == PVW_NONE))
> > > 2698opts->x_prefer_vector_width_type = PVW_AVX128;
> > > -
> > > And it may be too confusing to add another tuning flag.
> >
> > Well, it's confusing to mix two things - defaulting the vector width 
> > preference
> > and the architectural detail of Bulldozer and early Zen parts.  So please 
> > split
> > the tuning.  And then re-benchmark with _just_ changing the preference
> Actually, the result is similar, I've test both(patch using
> avx128_optimal and trunk_gcc apply additional
> -mprefer-vector-width=128).
> And i would give a test to see the affect of FDO.

It is hard to tell if 128-bit vector size or 256-bit vector size works better.
For SPEC CPU 2017, 128-bit vector size gives better overall scores.
One can always change vector size, even to 512-bit, as some workloads
are faster with 512-bit vector size.

-- 
H.J.


Re: [PR47785] COLLECT_AS_OPTIONS

2020-01-17 Thread H.J. Lu
On Tue, Jan 14, 2020 at 11:29 PM Prathamesh Kulkarni
 wrote:
>
> On Wed, 8 Jan 2020 at 15:50, Prathamesh Kulkarni
>  wrote:
> >
> > On Tue, 5 Nov 2019 at 17:38, Richard Biener  
> > wrote:
> > >
> > > On Tue, Nov 5, 2019 at 12:17 AM Kugan Vivekanandarajah
> > >  wrote:
> > > >
> > > > Hi,
> > > > Thanks for the review.
> > > >
> > > > On Tue, 5 Nov 2019 at 03:57, H.J. Lu  wrote:
> > > > >
> > > > > On Sun, Nov 3, 2019 at 6:45 PM Kugan Vivekanandarajah
> > > > >  wrote:
> > > > > >
> > > > > > Thanks for the reviews.
> > > > > >
> > > > > >
> > > > > > On Sat, 2 Nov 2019 at 02:49, H.J. Lu  wrote:
> > > > > > >
> > > > > > > On Thu, Oct 31, 2019 at 6:33 PM Kugan Vivekanandarajah
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Wed, 30 Oct 2019 at 03:11, H.J. Lu  
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > On Sun, Oct 27, 2019 at 6:33 PM Kugan Vivekanandarajah
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Richard,
> > > > > > > > > >
> > > > > > > > > > Thanks for the review.
> > > > > > > > > >
> > > > > > > > > > On Wed, 23 Oct 2019 at 23:07, Richard Biener 
> > > > > > > > > >  wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Oct 21, 2019 at 10:04 AM Kugan Vivekanandarajah
> > > > > > > > > > >  wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Richard,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the pointers.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, 11 Oct 2019 at 22:33, Richard Biener 
> > > > > > > > > > > >  wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Oct 11, 2019 at 6:15 AM Kugan Vivekanandarajah
> > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Richard,
> > > > > > > > > > > > > > Thanks for the review.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, 2 Oct 2019 at 20:41, Richard Biener 
> > > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Oct 2, 2019 at 10:39 AM Kugan 
> > > > > > > > > > > > > > > Vivekanandarajah
> > > > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > As mentioned in the PR, attached patch adds 
> > > > > > > > > > > > > > > > COLLECT_AS_OPTIONS for
> > > > > > > > > > > > > > > > passing assembler options specified with -Wa, 
> > > > > > > > > > > > > > > > to the link-time driver.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > The proposed solution only works for uniform 
> > > > > > > > > > > > > > > > -Wa options across all
> > > > > > > > > > > > > > > > TUs. As mentioned by Richard Biener, supporting 
> > > >

[PATCH] PR target/93319: x32: Add x32 support to -mtls-dialect=gnu2

2020-01-19 Thread H.J. Lu
To add x32 support to -mtls-dialect=gnu2, we need to replace DI with
P in GNU2 TLS patterns.  Since thread pointer is in ptr_mode, PLUS in
GNU2 TLS address computation must be done in ptr_mode to support
-maddress-mode=long.  Also drop the "q" suffix from lea to support
both "lea foo@TLSDESC(%rip), %eax" and "foo@TLSDESC(%rip), %rax".

Tested on Linux/x86-64.  OK for master?

Thanks.

H.J.
---
gcc/

PR target/93319
* config/i386/i386.c (legitimize_tls_address): Pass Pmode to
gen_tls_dynamic_gnu2_64.  Compute GNU2 TLS address in ptr_mode.
* config/i386/i386.md (tls_dynamic_gnu2_64): Renamed to ...
(@tls_dynamic_gnu2_64_): This.  Replace DI with P.
(*tls_dynamic_gnu2_lea_64): Renamed to ...
(*tls_dynamic_gnu2_lea_64_): This.  Replace DI with P.
Remove the {q} suffix from lea.
(*tls_dynamic_gnu2_call_64): Renamed to ...
(*tls_dynamic_gnu2_call_64_): This.  Replace DI with P.
(*tls_dynamic_gnu2_combine_64): Renamed to ...
(*tls_dynamic_gnu2_combine_64_): This.  Replace DI with P.
Pass Pmode to gen_tls_dynamic_gnu2_64.

gcc/testsuite/

PR target/93319
* gcc.target/i386/pr93319-1a.c: New test.
* gcc.target/i386/pr93319-1b.c: Likewise.
* gcc.target/i386/pr93319-1c.c: Likewise.
* gcc.target/i386/pr93319-1d.c: Likewise.
---
 gcc/config/i386/i386.c | 31 +++--
 gcc/config/i386/i386.md| 54 +++---
 gcc/testsuite/gcc.target/i386/pr93319-1a.c | 24 ++
 gcc/testsuite/gcc.target/i386/pr93319-1b.c |  7 +++
 gcc/testsuite/gcc.target/i386/pr93319-1c.c |  7 +++
 gcc/testsuite/gcc.target/i386/pr93319-1d.c |  7 +++
 6 files changed, 99 insertions(+), 31 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1c.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1d.c

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 2c087a4a3e0..8c437dbe1f3 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -10764,12 +10764,24 @@ legitimize_tls_address (rtx x, enum tls_model model, 
bool for_mov)
   if (TARGET_GNU2_TLS)
{
  if (TARGET_64BIT)
-   emit_insn (gen_tls_dynamic_gnu2_64 (dest, x));
+   emit_insn (gen_tls_dynamic_gnu2_64 (Pmode, dest, x));
  else
emit_insn (gen_tls_dynamic_gnu2_32 (dest, x, pic));
 
  tp = get_thread_pointer (Pmode, true);
- dest = force_reg (Pmode, gen_rtx_PLUS (Pmode, tp, dest));
+
+ /* NB: Since thread pointer is in ptr_mode, make sure that
+PLUS is done in ptr_mode.  */
+ if (Pmode != ptr_mode)
+   {
+ tp = lowpart_subreg (ptr_mode, tp, Pmode);
+ dest = lowpart_subreg (ptr_mode, dest, Pmode);
+ dest = gen_rtx_PLUS (ptr_mode, tp, dest);
+ dest = gen_rtx_ZERO_EXTEND (Pmode, dest);
+   }
+ else
+   dest = gen_rtx_PLUS (Pmode, tp, dest);
+ dest = force_reg (Pmode, dest);
 
  if (GET_MODE (x) != Pmode)
x = gen_rtx_ZERO_EXTEND (Pmode, x);
@@ -10821,7 +10833,7 @@ legitimize_tls_address (rtx x, enum tls_model model, 
bool for_mov)
  rtx tmp = ix86_tls_module_base ();
 
  if (TARGET_64BIT)
-   emit_insn (gen_tls_dynamic_gnu2_64 (base, tmp));
+   emit_insn (gen_tls_dynamic_gnu2_64 (Pmode, base, tmp));
  else
emit_insn (gen_tls_dynamic_gnu2_32 (base, tmp, pic));
 
@@ -10864,7 +10876,18 @@ legitimize_tls_address (rtx x, enum tls_model model, 
bool for_mov)
 
   if (TARGET_GNU2_TLS)
{
- dest = force_reg (Pmode, gen_rtx_PLUS (Pmode, dest, tp));
+ /* NB: Since thread pointer is in ptr_mode, make sure that
+PLUS is done in ptr_mode.  */
+ if (Pmode != ptr_mode)
+   {
+ tp = lowpart_subreg (ptr_mode, tp, Pmode);
+ dest = lowpart_subreg (ptr_mode, dest, Pmode);
+ dest = gen_rtx_PLUS (ptr_mode, tp, dest);
+ dest = gen_rtx_ZERO_EXTEND (Pmode, dest);
+   }
+ else
+   dest = gen_rtx_PLUS (Pmode, tp, dest);
+ dest = force_reg (Pmode, dest);
 
  if (GET_MODE (x) != Pmode)
x = gen_rtx_ZERO_EXTEND (Pmode, x);
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index c9d2f338fe9..d53684096c4 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -15185,14 +15185,14 @@ (define_insn_and_split "*tls_dynamic_gnu2_combine_32"
   emit_insn (gen_tls_dynamic_gnu2_32 (operands[5], operands[1], operands[2]));
 })
 
-(define_expand "tls_dynamic_gnu2_64"
+(define_expand "@tls_dynamic_gnu2_64_"
   [(set (match_dup 2)
-   (unspec:DI [(match_operand 1 "tls_symbolic_operand")]
-  UNSPEC_TLSDESC))
+   (unspec:P

Re: New repository location

2020-01-19 Thread H.J. Lu
On Sun, Jan 19, 2020 at 6:33 AM Bill Schmidt  wrote:
>
> Question:  Is the new gcc git repository at gcc.gnu.org/git/gcc.git
> using the same location as the earlier git mirror did?  I'm curious
> whether our repository on pike is still syncing with the new master, or
> whether we need to make some adjustments before we next rebase pu
> against master.
>

2 repos are different.  I renamed my old mirror and created a new one:

https://gitlab.com/x86-gcc

-- 
H.J.


Re: [PATCH] PR target/93319: x32: Add x32 support to -mtls-dialect=gnu2

2020-01-19 Thread H.J. Lu
On Sun, Jan 19, 2020 at 9:48 AM Uros Bizjak  wrote:
>
> On Sun, Jan 19, 2020 at 6:43 PM Uros Bizjak  wrote:
> >
> > On Sun, Jan 19, 2020 at 2:58 PM H.J. Lu  wrote:
> > >
> > > To add x32 support to -mtls-dialect=gnu2, we need to replace DI with
> > > P in GNU2 TLS patterns.  Since thread pointer is in ptr_mode, PLUS in
> > > GNU2 TLS address computation must be done in ptr_mode to support
> > > -maddress-mode=long.  Also drop the "q" suffix from lea to support
> > > both "lea foo@TLSDESC(%rip), %eax" and "foo@TLSDESC(%rip), %rax".
> >
> > Please use "lea%z0" instead.
> >
> > > Tested on Linux/x86-64.  OK for master?
> > >
> > > Thanks.
> > >
> > > H.J.
> > > ---
> > > gcc/
> > >
> > > PR target/93319
> > > * config/i386/i386.c (legitimize_tls_address): Pass Pmode to
> > > gen_tls_dynamic_gnu2_64.  Compute GNU2 TLS address in ptr_mode.
> > > * config/i386/i386.md (tls_dynamic_gnu2_64): Renamed to ...
> > > (@tls_dynamic_gnu2_64_): This.  Replace DI with P.
> > > (*tls_dynamic_gnu2_lea_64): Renamed to ...
> > > (*tls_dynamic_gnu2_lea_64_): This.  Replace DI with P.
> > > Remove the {q} suffix from lea.
> > > (*tls_dynamic_gnu2_call_64): Renamed to ...
> > > (*tls_dynamic_gnu2_call_64_): This.  Replace DI with P.
> > > (*tls_dynamic_gnu2_combine_64): Renamed to ...
> > > (*tls_dynamic_gnu2_combine_64_): This.  Replace DI with P.
> > > Pass Pmode to gen_tls_dynamic_gnu2_64.
> > >
> > > gcc/testsuite/
> > >
> > > PR target/93319
> > > * gcc.target/i386/pr93319-1a.c: New test.
> > > * gcc.target/i386/pr93319-1b.c: Likewise.
> > > * gcc.target/i386/pr93319-1c.c: Likewise.
> > > * gcc.target/i386/pr93319-1d.c: Likewise.
> > > ---
> > >  gcc/config/i386/i386.c | 31 +++--
> > >  gcc/config/i386/i386.md| 54 +++---
> > >  gcc/testsuite/gcc.target/i386/pr93319-1a.c | 24 ++
> > >  gcc/testsuite/gcc.target/i386/pr93319-1b.c |  7 +++
> > >  gcc/testsuite/gcc.target/i386/pr93319-1c.c |  7 +++
> > >  gcc/testsuite/gcc.target/i386/pr93319-1d.c |  7 +++
> > >  6 files changed, 99 insertions(+), 31 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1a.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1b.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1c.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1d.c
> > >
> > > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> > > index 2c087a4a3e0..8c437dbe1f3 100644
> > > --- a/gcc/config/i386/i386.c
> > > +++ b/gcc/config/i386/i386.c
> > > @@ -10764,12 +10764,24 @@ legitimize_tls_address (rtx x, enum tls_model 
> > > model, bool for_mov)
> > >if (TARGET_GNU2_TLS)
> > > {
> > >   if (TARGET_64BIT)
> > > -   emit_insn (gen_tls_dynamic_gnu2_64 (dest, x));
> > > +   emit_insn (gen_tls_dynamic_gnu2_64 (Pmode, dest, x));
> > >   else
> > > emit_insn (gen_tls_dynamic_gnu2_32 (dest, x, pic));
> > >
> > >   tp = get_thread_pointer (Pmode, true);
> > > - dest = force_reg (Pmode, gen_rtx_PLUS (Pmode, tp, dest));
> > > +
> > > + /* NB: Since thread pointer is in ptr_mode, make sure that
> > > +PLUS is done in ptr_mode.  */
>
> Actually, thread_pointer is in Pmode, see the line just above your
> change. Also, dest is in Pmode, so why do we need all this subreg
> dance?

dest set from  gen_tls_dynamic_gnu2_64 is in ptr_mode by

call *foo@TLSCALL(%rax)

(gdb) bt
#0  test () at lib.s:20
#1  0x00401075 in main () at main.c:13
(gdb) f 0
#0  test () at lib.s:20
20 addq %rax, %r12
(gdb) disass
Dump of assembler code for function test:
   0xf7fca120 <+0>: push   %r12
   0xf7fca122 <+2>: lea0x2ef7(%rip),%rax# 0xf7fcd020 
   0xf7fca129 <+9>: lea0xed0(%rip),%rdi# 0xf7fcb000
   0xf7fca130 <+16>: mov%fs:0x0,%r12d
   0xf7fca139 <+25>: callq  *(%rax)
=> 0xf7fca13b <+27>: add%rax,%r12
^^ Wrong address in R12.
   0xf7fca13e <+30>: xor%eax,%eax
   0xf7fca140 <+32>: mov(%r12),%esi
   0xf7fca144 <+36>

Re: [PATCH] PR target/93319: x32: Add x32 support to -mtls-dialect=gnu2

2020-01-19 Thread H.J. Lu
On Sun, Jan 19, 2020 at 12:01 PM Uros Bizjak  wrote:
>
> On Sun, Jan 19, 2020 at 7:07 PM H.J. Lu  wrote:
> >
> > On Sun, Jan 19, 2020 at 9:48 AM Uros Bizjak  wrote:
> > >
> > > On Sun, Jan 19, 2020 at 6:43 PM Uros Bizjak  wrote:
> > > >
> > > > On Sun, Jan 19, 2020 at 2:58 PM H.J. Lu  wrote:
> > > > >
> > > > > To add x32 support to -mtls-dialect=gnu2, we need to replace DI with
> > > > > P in GNU2 TLS patterns.  Since thread pointer is in ptr_mode, PLUS in
> > > > > GNU2 TLS address computation must be done in ptr_mode to support
> > > > > -maddress-mode=long.  Also drop the "q" suffix from lea to support
> > > > > both "lea foo@TLSDESC(%rip), %eax" and "foo@TLSDESC(%rip), %rax".
> > > >
> > > > Please use "lea%z0" instead.
> > > >
> > > > > Tested on Linux/x86-64.  OK for master?
> > > > >
> > > > > Thanks.
> > > > >
> > > > > H.J.
> > > > > ---
> > > > > gcc/
> > > > >
> > > > > PR target/93319
> > > > > * config/i386/i386.c (legitimize_tls_address): Pass Pmode to
> > > > > gen_tls_dynamic_gnu2_64.  Compute GNU2 TLS address in 
> > > > > ptr_mode.
> > > > > * config/i386/i386.md (tls_dynamic_gnu2_64): Renamed to ...
> > > > > (@tls_dynamic_gnu2_64_): This.  Replace DI with P.
> > > > > (*tls_dynamic_gnu2_lea_64): Renamed to ...
> > > > > (*tls_dynamic_gnu2_lea_64_): This.  Replace DI with P.
> > > > > Remove the {q} suffix from lea.
> > > > > (*tls_dynamic_gnu2_call_64): Renamed to ...
> > > > > (*tls_dynamic_gnu2_call_64_): This.  Replace DI with P.
> > > > > (*tls_dynamic_gnu2_combine_64): Renamed to ...
> > > > > (*tls_dynamic_gnu2_combine_64_): This.  Replace DI with 
> > > > > P.
> > > > > Pass Pmode to gen_tls_dynamic_gnu2_64.
> > > > >
> > > > > gcc/testsuite/
> > > > >
> > > > > PR target/93319
> > > > > * gcc.target/i386/pr93319-1a.c: New test.
> > > > > * gcc.target/i386/pr93319-1b.c: Likewise.
> > > > > * gcc.target/i386/pr93319-1c.c: Likewise.
> > > > > * gcc.target/i386/pr93319-1d.c: Likewise.
> > > > > ---
> > > > >  gcc/config/i386/i386.c | 31 +++--
> > > > >  gcc/config/i386/i386.md| 54 
> > > > > +++---
> > > > >  gcc/testsuite/gcc.target/i386/pr93319-1a.c | 24 ++
> > > > >  gcc/testsuite/gcc.target/i386/pr93319-1b.c |  7 +++
> > > > >  gcc/testsuite/gcc.target/i386/pr93319-1c.c |  7 +++
> > > > >  gcc/testsuite/gcc.target/i386/pr93319-1d.c |  7 +++
> > > > >  6 files changed, 99 insertions(+), 31 deletions(-)
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1a.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1b.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1c.c
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1d.c
> > > > >
> > > > > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> > > > > index 2c087a4a3e0..8c437dbe1f3 100644
> > > > > --- a/gcc/config/i386/i386.c
> > > > > +++ b/gcc/config/i386/i386.c
> > > > > @@ -10764,12 +10764,24 @@ legitimize_tls_address (rtx x, enum 
> > > > > tls_model model, bool for_mov)
> > > > >if (TARGET_GNU2_TLS)
> > > > > {
> > > > >   if (TARGET_64BIT)
> > > > > -   emit_insn (gen_tls_dynamic_gnu2_64 (dest, x));
> > > > > +   emit_insn (gen_tls_dynamic_gnu2_64 (Pmode, dest, x));
> > > > >   else
> > > > > emit_insn (gen_tls_dynamic_gnu2_32 (dest, x, pic));
> > > > >
> > > > >   tp = get_thread_pointer (Pmode, true);
> > > > > - dest = force_reg (Pmode, gen_rtx_PLUS (Pmode, tp, dest));
> > > > > +
> > > > > + /* NB: Since thread pointer is in ptr_mode, make sure that
> > > > > +PLUS i

Re: [PATCH] PR target/93319: x32: Add x32 support to -mtls-dialect=gnu2

2020-01-19 Thread H.J. Lu
On Sun, Jan 19, 2020 at 12:16 PM Uros Bizjak  wrote:
>
> On Sun, Jan 19, 2020 at 9:07 PM H.J. Lu  wrote:
> >
> > On Sun, Jan 19, 2020 at 12:01 PM Uros Bizjak  wrote:
> > >
> > > On Sun, Jan 19, 2020 at 7:07 PM H.J. Lu  wrote:
> > > >
> > > > On Sun, Jan 19, 2020 at 9:48 AM Uros Bizjak  wrote:
> > > > >
> > > > > On Sun, Jan 19, 2020 at 6:43 PM Uros Bizjak  wrote:
> > > > > >
> > > > > > On Sun, Jan 19, 2020 at 2:58 PM H.J. Lu  wrote:
> > > > > > >
> > > > > > > To add x32 support to -mtls-dialect=gnu2, we need to replace DI 
> > > > > > > with
> > > > > > > P in GNU2 TLS patterns.  Since thread pointer is in ptr_mode, 
> > > > > > > PLUS in
> > > > > > > GNU2 TLS address computation must be done in ptr_mode to support
> > > > > > > -maddress-mode=long.  Also drop the "q" suffix from lea to support
> > > > > > > both "lea foo@TLSDESC(%rip), %eax" and "foo@TLSDESC(%rip), %rax".
> > > > > >
> > > > > > Please use "lea%z0" instead.
> > > > > >
> > > > > > > Tested on Linux/x86-64.  OK for master?
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > > H.J.
> > > > > > > ---
> > > > > > > gcc/
> > > > > > >
> > > > > > > PR target/93319
> > > > > > > * config/i386/i386.c (legitimize_tls_address): Pass Pmode 
> > > > > > > to
> > > > > > > gen_tls_dynamic_gnu2_64.  Compute GNU2 TLS address in 
> > > > > > > ptr_mode.
> > > > > > > * config/i386/i386.md (tls_dynamic_gnu2_64): Renamed to 
> > > > > > > ...
> > > > > > > (@tls_dynamic_gnu2_64_): This.  Replace DI with P.
> > > > > > > (*tls_dynamic_gnu2_lea_64): Renamed to ...
> > > > > > > (*tls_dynamic_gnu2_lea_64_): This.  Replace DI with 
> > > > > > > P.
> > > > > > > Remove the {q} suffix from lea.
> > > > > > > (*tls_dynamic_gnu2_call_64): Renamed to ...
> > > > > > > (*tls_dynamic_gnu2_call_64_): This.  Replace DI 
> > > > > > > with P.
> > > > > > > (*tls_dynamic_gnu2_combine_64): Renamed to ...
> > > > > > > (*tls_dynamic_gnu2_combine_64_): This.  Replace DI 
> > > > > > > with P.
> > > > > > > Pass Pmode to gen_tls_dynamic_gnu2_64.
> > > > > > >
> > > > > > > gcc/testsuite/
> > > > > > >
> > > > > > > PR target/93319
> > > > > > > * gcc.target/i386/pr93319-1a.c: New test.
> > > > > > > * gcc.target/i386/pr93319-1b.c: Likewise.
> > > > > > > * gcc.target/i386/pr93319-1c.c: Likewise.
> > > > > > > * gcc.target/i386/pr93319-1d.c: Likewise.
> > > > > > > ---
> > > > > > >  gcc/config/i386/i386.c | 31 +++--
> > > > > > >  gcc/config/i386/i386.md| 54 
> > > > > > > +++---
> > > > > > >  gcc/testsuite/gcc.target/i386/pr93319-1a.c | 24 ++
> > > > > > >  gcc/testsuite/gcc.target/i386/pr93319-1b.c |  7 +++
> > > > > > >  gcc/testsuite/gcc.target/i386/pr93319-1c.c |  7 +++
> > > > > > >  gcc/testsuite/gcc.target/i386/pr93319-1d.c |  7 +++
> > > > > > >  6 files changed, 99 insertions(+), 31 deletions(-)
> > > > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1a.c
> > > > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1b.c
> > > > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1c.c
> > > > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1d.c
> > > > > > >
> > > > > > > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> > > > > > > index 2c087a4a3e0..8c437dbe1f3 100644
> > > > > > > --- a/gcc/config/i386/i386.c

Re: [PATCH] Make target_clones resolver fn static.

2020-01-20 Thread H.J. Lu
On Mon, Jan 20, 2020 at 2:25 AM Richard Biener
 wrote:
>
> On Fri, Jan 17, 2020 at 10:25 AM Martin Liška  wrote:
> >
> > Hi.
> >
> > The patch removes need to have a gnu_indirect_function global
> > symbol. That aligns the code with what ppc64 target does.
> >
> > Patch can bootstrap on x86_64-linux-gnu and survives regression tests.
> >
> > Ready to be installed?
>
> Did you verify the result actually works?  I'm not sure we have any runtime 
> test
> coverage for the feature and non-public functions and you don't add a testcase
> either.  Maybe there's interesting coverage in the binutils or glibc testsuite
> (though both might not use the compilers ifunc feature...).
>
> The patch also suspiciously lacks removal of actually making the resolver
> TREE_PUBLIC if the default implementation was not ... so I wonder whether
> you verified that the resolver _is_ indeed local.
>
> HJ, do you know anything about this requirement?  It's that way since
> the original contribution of multi-versioning by Google...

We can that only if function is static:

[hjl@gnu-cfl-2 tmp]$ cat x.c
__attribute__((target_clones("avx","default")))
int
foo ()
{
  return -2;
}
[hjl@gnu-cfl-2 tmp]$ gcc -S -O2 x.c
[hjl@gnu-cfl-2 tmp]$ cat x.s
.file "x.c"
.text
.p2align 4
.type foo.default.1, @function
foo.default.1:
.LFB0:
.cfi_startproc
movl $-2, %eax
ret
.cfi_endproc
.LFE0:
.size foo.default.1, .-foo.default.1
.p2align 4
.type foo.avx.0, @function
foo.avx.0:
.LFB1:
.cfi_startproc
movl $-2, %eax
ret
.cfi_endproc
.LFE1:
.size foo.avx.0, .-foo.avx.0
.section .text.foo.resolver,"axG",@progbits,foo.resolver,comdat
.p2align 4
.weak foo.resolver
.type foo.resolver, @function
foo.resolver:
.LFB3:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
call __cpu_indicator_init
movl $foo.default.1, %eax
movl $foo.avx.0, %edx
testb $2, __cpu_model+13(%rip)
cmovne %rdx, %rax
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE3:
.size foo.resolver, .-foo.resolver
.globl foo
.type foo, @gnu_indirect_function
.set foo,foo.resolver
.ident "GCC: (GNU) 9.2.1 20191120 (Red Hat 9.2.1-2)"
.section .note.GNU-stack,"",@progbits
[hjl@gnu-cfl-2 tmp]$

In this case, foo must be global.

> Richard.
>
> > Thanks,
> > Martin
> >
> > gcc/ChangeLog:
> >
> > 2020-01-17  Martin Liska  
> >
> > PR target/93274
> > * config/i386/i386-features.c (make_resolver_func):
> > Align the code with ppc64 target implementaion.
> > We do not need to have gnu_indirect_function
> > as a global function.
> >
> > gcc/testsuite/ChangeLog:
> >
> > 2020-01-17  Martin Liska  
> >
> > PR target/93274
> > * gcc.target/i386/pr81213.c: Adjust to not expect
> > a global unique name.
> > ---
> >   gcc/config/i386/i386-features.c | 20 +---
> >   gcc/testsuite/gcc.target/i386/pr81213.c |  4 ++--
> >   2 files changed, 7 insertions(+), 17 deletions(-)
> >
> >



-- 
H.J.


Re: [PATCH] PR target/93319: x32: Add x32 support to -mtls-dialect=gnu2

2020-01-20 Thread H.J. Lu
On Sun, Jan 19, 2020 at 11:53 PM Uros Bizjak  wrote:
>
> On Sun, Jan 19, 2020 at 10:00 PM H.J. Lu  wrote:
> >
> > On Sun, Jan 19, 2020 at 12:16 PM Uros Bizjak  wrote:
> > >
> > > On Sun, Jan 19, 2020 at 9:07 PM H.J. Lu  wrote:
> > > >
> > > > On Sun, Jan 19, 2020 at 12:01 PM Uros Bizjak  wrote:
> > > > >
> > > > > On Sun, Jan 19, 2020 at 7:07 PM H.J. Lu  wrote:
> > > > > >
> > > > > > On Sun, Jan 19, 2020 at 9:48 AM Uros Bizjak  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Sun, Jan 19, 2020 at 6:43 PM Uros Bizjak  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > On Sun, Jan 19, 2020 at 2:58 PM H.J. Lu  
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > To add x32 support to -mtls-dialect=gnu2, we need to replace 
> > > > > > > > > DI with
> > > > > > > > > P in GNU2 TLS patterns.  Since thread pointer is in ptr_mode, 
> > > > > > > > > PLUS in
> > > > > > > > > GNU2 TLS address computation must be done in ptr_mode to 
> > > > > > > > > support
> > > > > > > > > -maddress-mode=long.  Also drop the "q" suffix from lea to 
> > > > > > > > > support
> > > > > > > > > both "lea foo@TLSDESC(%rip), %eax" and "foo@TLSDESC(%rip), 
> > > > > > > > > %rax".
> > > > > > > >
> > > > > > > > Please use "lea%z0" instead.
> > > > > > > >
> > > > > > > > > Tested on Linux/x86-64.  OK for master?
> > > > > > > > >
> > > > > > > > > Thanks.
> > > > > > > > >
> > > > > > > > > H.J.
> > > > > > > > > ---
> > > > > > > > > gcc/
> > > > > > > > >
> > > > > > > > > PR target/93319
> > > > > > > > > * config/i386/i386.c (legitimize_tls_address): Pass 
> > > > > > > > > Pmode to
> > > > > > > > > gen_tls_dynamic_gnu2_64.  Compute GNU2 TLS address in 
> > > > > > > > > ptr_mode.
> > > > > > > > > * config/i386/i386.md (tls_dynamic_gnu2_64): Renamed 
> > > > > > > > > to ...
> > > > > > > > > (@tls_dynamic_gnu2_64_): This.  Replace DI with 
> > > > > > > > > P.
> > > > > > > > > (*tls_dynamic_gnu2_lea_64): Renamed to ...
> > > > > > > > > (*tls_dynamic_gnu2_lea_64_): This.  Replace DI 
> > > > > > > > > with P.
> > > > > > > > > Remove the {q} suffix from lea.
> > > > > > > > > (*tls_dynamic_gnu2_call_64): Renamed to ...
> > > > > > > > > (*tls_dynamic_gnu2_call_64_): This.  Replace DI 
> > > > > > > > > with P.
> > > > > > > > > (*tls_dynamic_gnu2_combine_64): Renamed to ...
> > > > > > > > > (*tls_dynamic_gnu2_combine_64_): This.  Replace 
> > > > > > > > > DI with P.
> > > > > > > > > Pass Pmode to gen_tls_dynamic_gnu2_64.
> > > > > > > > >
> > > > > > > > > gcc/testsuite/
> > > > > > > > >
> > > > > > > > > PR target/93319
> > > > > > > > > * gcc.target/i386/pr93319-1a.c: New test.
> > > > > > > > > * gcc.target/i386/pr93319-1b.c: Likewise.
> > > > > > > > > * gcc.target/i386/pr93319-1c.c: Likewise.
> > > > > > > > > * gcc.target/i386/pr93319-1d.c: Likewise.
> > > > > > > > > ---
> > > > > > > > >  gcc/config/i386/i386.c | 31 +++--
> > > > > > > > >  gcc/config/i386/i386.md| 54 
> > > > > > > > > +++---
> > > > > > > > >

Re: [PATCH] Make target_clones resolver fn static.

2020-01-20 Thread H.J. Lu
On Mon, Jan 20, 2020 at 5:36 AM Alexander Monakov  wrote:
>
>
>
> On Mon, 20 Jan 2020, H.J. Lu wrote:
> > We can that only if function is static:
> >
> [ship asm]
> >
> > In this case, foo must be global.
>
> H.J., can you rephrase more clearly? Your response seems contradictory and
> does not help to explain the matter.
>
> Alexander

For,

---
__attribute__((target_clones("avx","default")))
int
foo ()
{
  return -2;
}


foo's resolver must be global.  For

---
__attribute__((target_clones("avx","default")))
static int
foo ()
{
  return -2;
}
---

foo's resolver must be static.

-- 
H.J.


Re: [PATCH] Make target_clones resolver fn static.

2020-01-20 Thread H.J. Lu
On Mon, Jan 20, 2020 at 6:16 AM Alexander Monakov  wrote:
>
>
>
> On Mon, 20 Jan 2020, H.J. Lu wrote:
> > For,
> >
> > ---
> > __attribute__((target_clones("avx","default")))
> > int
> > foo ()
> > {
> >   return -2;
> > }
> > 
> >
> > foo's resolver must be global.  For
> >
> > ---
> > __attribute__((target_clones("avx","default")))
> > static int
> > foo ()
> > {
> >   return -2;
> > }
> > ---
> >
> > foo's resolver must be static.
>
> Bare IFUNC's don't seem to have this restriction. Why do we want to
> constrain target clones this way?
>

foo's resolver acts as foo.  It should have the same visibility as foo.


-- 
H.J.


Re: [PATCH] Make target_clones resolver fn static.

2020-01-20 Thread H.J. Lu
On Mon, Jan 20, 2020 at 6:41 AM Alexander Monakov  wrote:
>
>
>
> On Mon, 20 Jan 2020, H.J. Lu wrote:
>
> > > Bare IFUNC's don't seem to have this restriction. Why do we want to
> > > constrain target clones this way?
> > >
> >
> > foo's resolver acts as foo.  It should have the same visibility as foo.
>
> What do you mean by that? From the implementation standpoint, there's
> two symbols of different type with the same value. There's no problem
> allowing one of them have local binding and the other have global binding.
>
> Is there something special about target clones that doesn't come into
> play with ifuncs?
>

I stand corrected.   Resolver should be static and it shouldn't be weak.


-- 
H.J.


Re: [PATCH] PR target/93319: x32: Add x32 support to -mtls-dialect=gnu2

2020-01-20 Thread H.J. Lu
On Mon, Jan 20, 2020 at 5:24 AM H.J. Lu  wrote:
>
> On Sun, Jan 19, 2020 at 11:53 PM Uros Bizjak  wrote:
> >
> > On Sun, Jan 19, 2020 at 10:00 PM H.J. Lu  wrote:
> > >
> > > On Sun, Jan 19, 2020 at 12:16 PM Uros Bizjak  wrote:
> > > >
> > > > On Sun, Jan 19, 2020 at 9:07 PM H.J. Lu  wrote:
> > > > >
> > > > > On Sun, Jan 19, 2020 at 12:01 PM Uros Bizjak  
> > > > > wrote:
> > > > > >
> > > > > > On Sun, Jan 19, 2020 at 7:07 PM H.J. Lu  wrote:
> > > > > > >
> > > > > > > On Sun, Jan 19, 2020 at 9:48 AM Uros Bizjak  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > On Sun, Jan 19, 2020 at 6:43 PM Uros Bizjak  
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > On Sun, Jan 19, 2020 at 2:58 PM H.J. Lu  
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > To add x32 support to -mtls-dialect=gnu2, we need to 
> > > > > > > > > > replace DI with
> > > > > > > > > > P in GNU2 TLS patterns.  Since thread pointer is in 
> > > > > > > > > > ptr_mode, PLUS in
> > > > > > > > > > GNU2 TLS address computation must be done in ptr_mode to 
> > > > > > > > > > support
> > > > > > > > > > -maddress-mode=long.  Also drop the "q" suffix from lea to 
> > > > > > > > > > support
> > > > > > > > > > both "lea foo@TLSDESC(%rip), %eax" and "foo@TLSDESC(%rip), 
> > > > > > > > > > %rax".
> > > > > > > > >
> > > > > > > > > Please use "lea%z0" instead.
> > > > > > > > >
> > > > > > > > > > Tested on Linux/x86-64.  OK for master?
> > > > > > > > > >
> > > > > > > > > > Thanks.
> > > > > > > > > >
> > > > > > > > > > H.J.
> > > > > > > > > > ---
> > > > > > > > > > gcc/
> > > > > > > > > >
> > > > > > > > > > PR target/93319
> > > > > > > > > > * config/i386/i386.c (legitimize_tls_address): Pass 
> > > > > > > > > > Pmode to
> > > > > > > > > > gen_tls_dynamic_gnu2_64.  Compute GNU2 TLS address 
> > > > > > > > > > in ptr_mode.
> > > > > > > > > > * config/i386/i386.md (tls_dynamic_gnu2_64): 
> > > > > > > > > > Renamed to ...
> > > > > > > > > > (@tls_dynamic_gnu2_64_): This.  Replace DI 
> > > > > > > > > > with P.
> > > > > > > > > > (*tls_dynamic_gnu2_lea_64): Renamed to ...
> > > > > > > > > > (*tls_dynamic_gnu2_lea_64_): This.  Replace 
> > > > > > > > > > DI with P.
> > > > > > > > > > Remove the {q} suffix from lea.
> > > > > > > > > > (*tls_dynamic_gnu2_call_64): Renamed to ...
> > > > > > > > > > (*tls_dynamic_gnu2_call_64_): This.  Replace 
> > > > > > > > > > DI with P.
> > > > > > > > > > (*tls_dynamic_gnu2_combine_64): Renamed to ...
> > > > > > > > > > (*tls_dynamic_gnu2_combine_64_): This.  
> > > > > > > > > > Replace DI with P.
> > > > > > > > > > Pass Pmode to gen_tls_dynamic_gnu2_64.
> > > > > > > > > >
> > > > > > > > > > gcc/testsuite/
> > > > > > > > > >
> > > > > > > > > > PR target/93319
> > > > > > > > > > * gcc.target/i386/pr93319-1a.c: New test.
> > > > > > > > > > * gcc.target/i386/pr93319-1b.c: Likewise.
> > > > > > > > > > * gcc.target/i386/pr93319-1c.c: Likewise.
> > > > > > > > > >

Re: [PATCH] PR target/93319: x32: Add x32 support to -mtls-dialect=gnu2

2020-01-21 Thread H.J. Lu
On Tue, Jan 21, 2020 at 2:29 AM Uros Bizjak  wrote:
>
> On Tue, Jan 21, 2020 at 9:47 AM Uros Bizjak  wrote:
> >
> > On Mon, Jan 20, 2020 at 10:46 PM H.J. Lu  wrote:
> >
> > > > > OK. Let's go with this version, but please investigate if we need to
> > > > > calculate TLS address in ptr_mode instead of Pmode. Due to quite some
> > > > > zero-extension from ptr_mode to Pmode hacks in this area, it looks to
> > > > > me that the whole calculation should be performed in ptr_mode (SImode
> > > > > in case of x32), and the result zero-extended to Pmode in case when
> > > > > Pmode = DImode.
> > > > >
> > > >
> > > > I checked it in.  I will investigate if we can use ptr_mode for TLS.
> > >
> > > Here is a patch to perform GNU2 TLS address computation in ptr_mode
> > > and zero-extend result to Pmode.
> >
> >  case TLS_MODEL_GLOBAL_DYNAMIC:
> > -  dest = gen_reg_rtx (Pmode);
> > +  dest = gen_reg_rtx (TARGET_GNU2_TLS ? ptr_mode : Pmode);
> >
> > Please put these in their respective arms of "if (TARGET_GNU2_TLS).
> >
> >  case TLS_MODEL_LOCAL_DYNAMIC:
> > -  base = gen_reg_rtx (Pmode);
> > +  base = gen_reg_rtx (TARGET_GNU2_TLS ? ptr_mode : Pmode);
> >
> > Also here.
> >
> > A question: Do we need to emit the following part in Pmode?
>
> To answer my own question: Yes. Linker doesn't like SImode relocs fox
> x86_64 and for
>
> addl$foo@dtpoff, %eax
>
> errors out with:
>
> pr93319-1a.s: Assembler messages:
> pr93319-1a.s:20: Error: relocated field and relocation type differ in 
> signedness
>
> So, the part below is OK, except:
>
> -  tp = get_thread_pointer (Pmode, true);
> -  set_unique_reg_note (get_last_insn (), REG_EQUAL,
> -   gen_rtx_MINUS (Pmode, tmp, tp));
> +  tp = get_thread_pointer (ptr_mode, true);
> +  tmp = gen_rtx_MINUS (ptr_mode, tmp, tp);
> +  if (GET_MODE (tmp) != Pmode)
> +tmp = gen_rtx_ZERO_EXTEND (Pmode, tmp);
> +  set_unique_reg_note (get_last_insn (), REG_EQUAL, tmp);
>
> I don't think we should attach this note to the thread pointer
> initialization. I have removed this part from the patch, but please
> review the decision.
>
> and
>
> -dest = gen_rtx_PLUS (Pmode, tp, dest);
> +dest = gen_rtx_PLUS (ptr_mode, tp, dest);
>
> Please leave Pmode here. ptr_mode == Pmode at this point, but Pmode
> better documents the mode selection logic.
>
> Also, the tests fail for me with:
>
> /usr/include/gnu/stubs.h:13:11: fatal error: gnu/stubs-x32.h: No such
> file or directory
>
> so either use __builtin_printf or some other approach that doesn't
> need to #include stdio.h.
>
> A patch that implements above changes is attached to the message.
>

Here is the updated patch.  OK for master?

Thanks.

-- 
H.J.
From 01b20630518882fa3952962b26bfbb2465e08036 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" 
Date: Mon, 20 Jan 2020 13:30:04 -0800
Subject: [PATCH] i386: Do GNU2 TLS address computation in ptr_mode

Since GNU2 TLS address from glibc run-time is in ptr_mode, we should do
GNU2 TLS address computation in ptr_mode and zero-extend result to Pmode.

2020-01-21  H.J. Lu  
	Uros Bizjak

gcc/

	PR target/93319
	* config/i386/i386.c (ix86_tls_module_base): Replace Pmode
	with ptr_mode.
	(legitimize_tls_address): Do GNU2 TLS address computation in
	ptr_mode and zero-extend result to Pmode.
	*  config/i386/i386.md (@tls_dynamic_gnu2_64_): Replace
	:P with :PTR and Pmode with ptr_mode.
	(*tls_dynamic_gnu2_lea_64_): Likewise.
	(*tls_dynamic_gnu2_call_64_): Likewise.
	(*tls_dynamic_gnu2_combine_64_): Likewise.

gcc/testsuite/

2020-01-21  Uros Bizjak

	PR target/93319
	* gcc.target/i386/pr93319-1a.c: Don't include .
	(test1): Replace printf with __builtin_printf.
---
 gcc/config/i386/i386.c | 43 ---
 gcc/config/i386/i386.md| 48 +++---
 gcc/testsuite/gcc.target/i386/pr93319-1a.c |  6 +--
 3 files changed, 42 insertions(+), 55 deletions(-)

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 0b8a4b9ee4f..ffe60baa72a 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -10717,7 +10717,7 @@ ix86_tls_module_base (void)
   if (!ix86_tls_module_base_symbol)
 {
   ix86_tls_module_base_symbol
-	= gen_rtx_SYMBOL_REF (Pmode, "_TLS_MODULE_BASE_");
+	= gen_rtx_SYMBOL_REF (ptr_mode, "_TLS_MODULE_BASE_");
 
   SYMBOL_REF_FLAGS (ix86_tls_module_base_symbol)
 	|= TLS_MODEL_GLOBAL_DYNAMIC << SYMBOL_FLAG_TLS_SHIFT;
@@ -10748,8 +10748,6 @@ legitimize_tls_address (r

[PATCH] i386: Don't use ix86_tune_ctrl_string in parse_mtune_ctrl_str

2020-01-27 Thread H.J. Lu
There are

static void
parse_mtune_ctrl_str (bool dump)
{
  if (!ix86_tune_ctrl_string)
return;

parse_mtune_ctrl_str is only called from set_ix86_tune_features, which
is only called from ix86_function_specific_restore and
ix86_option_override_internal.  parse_mtune_ctrl_str shouldn't use
ix86_tune_ctrl_string which is defined with global_options.  Instead,
opts should be passed to parse_mtune_ctrl_str.

PR target/91399
* config/i386/i386-options.c (set_ix86_tune_features): Add an
argument of a pointer to struct gcc_options and pass it to
parse_mtune_ctrl_str.
(ix86_function_specific_restore): Pass opts to
set_ix86_tune_features.
(ix86_option_override_internal): Likewise.
(parse_mtune_ctrl_str): Add an argument of a pointer to struct
gcc_options and use it for x_ix86_tune_ctrl_string.
---
 gcc/config/i386/i386-options.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
index 2acc9fb0cfe..e0be4932534 100644
--- a/gcc/config/i386/i386-options.c
+++ b/gcc/config/i386/i386-options.c
@@ -741,7 +741,8 @@ ix86_option_override_internal (bool main_args_p,
   struct gcc_options *opts,
   struct gcc_options *opts_set);
 static void
-set_ix86_tune_features (enum processor_type ix86_tune, bool dump);
+set_ix86_tune_features (struct gcc_options *opts,
+   enum processor_type ix86_tune, bool dump);
 
 /* Restore the current options */
 
@@ -810,7 +811,7 @@ ix86_function_specific_restore (struct gcc_options *opts,
 
   /* Recreate the tune optimization tests */
   if (old_tune != ix86_tune)
-set_ix86_tune_features (ix86_tune, false);
+set_ix86_tune_features (opts, ix86_tune, false);
 }
 
 /* Adjust target options after streaming them in.  This is mainly about
@@ -1538,13 +1539,13 @@ ix86_parse_stringop_strategy_string (char 
*strategy_str, bool is_memset)
print the features that are explicitly set.  */
 
 static void
-parse_mtune_ctrl_str (bool dump)
+parse_mtune_ctrl_str (struct gcc_options *opts, bool dump)
 {
-  if (!ix86_tune_ctrl_string)
+  if (!opts->x_ix86_tune_ctrl_string)
 return;
 
   char *next_feature_string = NULL;
-  char *curr_feature_string = xstrdup (ix86_tune_ctrl_string);
+  char *curr_feature_string = xstrdup (opts->x_ix86_tune_ctrl_string);
   char *orig = curr_feature_string;
   int i;
   do
@@ -1583,7 +1584,8 @@ parse_mtune_ctrl_str (bool dump)
processor type.  */
 
 static void
-set_ix86_tune_features (enum processor_type ix86_tune, bool dump)
+set_ix86_tune_features (struct gcc_options *opts,
+   enum processor_type ix86_tune, bool dump)
 {
   unsigned HOST_WIDE_INT ix86_tune_mask = HOST_WIDE_INT_1U << ix86_tune;
   int i;
@@ -1605,7 +1607,7 @@ set_ix86_tune_features (enum processor_type ix86_tune, 
bool dump)
  ix86_tune_features[i] ? "on" : "off");
 }
 
-  parse_mtune_ctrl_str (dump);
+  parse_mtune_ctrl_str (opts, dump);
 }
 
 
@@ -2364,7 +2366,7 @@ ix86_option_override_internal (bool main_args_p,
   XDELETEVEC (s);
 }
 
-  set_ix86_tune_features (ix86_tune, opts->x_ix86_dump_tunes);
+  set_ix86_tune_features (opts, ix86_tune, opts->x_ix86_dump_tunes);
 
   ix86_recompute_optlev_based_flags (opts, opts_set);
 
-- 
2.24.1



[PATCH] i386: Disable TARGET_SSE_TYPELESS_STORES for TARGET_AVX

2020-01-27 Thread H.J. Lu
movaps/movups is one byte shorter than movdaq/movdqu.  But it isn't the
case for AVX nor AVX512.  We should disable TARGET_SSE_TYPELESS_STORES
for TARGET_AVX.

gcc/

PR target/91461
* config/i386/i386.h (TARGET_SSE_TYPELESS_STORES): Disable for
TARGET_AVX.
* config/i386/i386.md (*movoi_internal_avx): Remove
TARGET_SSE_TYPELESS_STORES check.

gcc/testsuite/

PR target/91461
* gcc.target/i386/pr91461-1.c: New test.
* gcc.target/i386/pr91461-2.c: Likewise.
* gcc.target/i386/pr91461-3.c: Likewise.
* gcc.target/i386/pr91461-4.c: Likewise.
* gcc.target/i386/pr91461-5.c: Likewise.
---
 gcc/config/i386/i386.h|  4 +-
 gcc/config/i386/i386.md   |  4 +-
 gcc/testsuite/gcc.target/i386/pr91461-1.c | 66 
 gcc/testsuite/gcc.target/i386/pr91461-2.c | 19 ++
 gcc/testsuite/gcc.target/i386/pr91461-3.c | 76 +++
 gcc/testsuite/gcc.target/i386/pr91461-4.c | 21 +++
 gcc/testsuite/gcc.target/i386/pr91461-5.c | 17 +
 7 files changed, 203 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-5.c

diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 943e9a5c783..c134b04c5c4 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -516,8 +516,10 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
 #define TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL \
ix86_tune_features[X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL]
 #define TARGET_SSE_SPLIT_REGS  ix86_tune_features[X86_TUNE_SSE_SPLIT_REGS]
+/* NB: movaps/movups is one byte shorter than movdaq/movdqu.  But it
+   isn't the case for AVX nor AVX512.  */
 #define TARGET_SSE_TYPELESS_STORES \
-   ix86_tune_features[X86_TUNE_SSE_TYPELESS_STORES]
+   (!TARGET_AVX && ix86_tune_features[X86_TUNE_SSE_TYPELESS_STORES])
 #define TARGET_SSE_LOAD0_BY_PXOR ix86_tune_features[X86_TUNE_SSE_LOAD0_BY_PXOR]
 #define TARGET_MEMORY_MISMATCH_STALL \
ix86_tune_features[X86_TUNE_MEMORY_MISMATCH_STALL]
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 6e9c9bd2fb6..bb096133880 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -1980,9 +1980,7 @@
   (and (eq_attr "alternative" "1")
(match_test "TARGET_AVX512VL"))
 (const_string "XI")
-  (ior (match_test "TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL")
-   (and (eq_attr "alternative" "3")
-(match_test "TARGET_SSE_TYPELESS_STORES")))
+  (match_test "TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL")
 (const_string "V8SF")
  ]
  (const_string "OI")))])
diff --git a/gcc/testsuite/gcc.target/i386/pr91461-1.c 
b/gcc/testsuite/gcc.target/i386/pr91461-1.c
new file mode 100644
index 000..0c94b8e2b76
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr91461-1.c
@@ -0,0 +1,66 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx" } */
+/* { dg-final { scan-assembler "\tvmovdqa\t" } } */
+/* { dg-final { scan-assembler "\tvmovdqu\t" } } */
+/* { dg-final { scan-assembler "\tvmovapd\t" } } */
+/* { dg-final { scan-assembler "\tvmovupd\t" } } */
+/* { dg-final { scan-assembler-not "\tvmovaps\t" } } */
+/* { dg-final { scan-assembler-not "\tvmovups\t" } } */
+
+#include 
+
+void
+foo1 (__m128i *p, __m128i x)
+{
+  *p = x;
+}
+
+void
+foo2 (__m128d *p, __m128d x)
+{
+  *p = x;
+}
+
+void
+foo3 (__float128 *p, __float128 x)
+{
+  *p = x;
+}
+
+void
+foo4 (__m128i_u *p, __m128i x)
+{
+  *p = x;
+}
+
+void
+foo5 (__m128d_u *p, __m128d x)
+{
+  *p = x;
+}
+
+typedef __float128 __float128_u __attribute__ ((__aligned__ (1)));
+
+void
+foo6 (__float128_u *p, __float128 x)
+{
+  *p = x;
+}
+
+#ifdef __x86_64__
+typedef __int128 __int128_u __attribute__ ((__aligned__ (1)));
+
+extern __int128 int128;
+
+void
+foo7 (__int128 *p)
+{
+  *p = int128;
+}
+
+void
+foo8 (__int128_u *p)
+{
+  *p = int128;
+}
+#endif
diff --git a/gcc/testsuite/gcc.target/i386/pr91461-2.c 
b/gcc/testsuite/gcc.target/i386/pr91461-2.c
new file mode 100644
index 000..921cfaf9780
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr91461-2.c
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx" } */
+/* { dg-final { scan-assembler "\tvmovdqa\t" } } */
+/* { dg-final { scan-assembler "\tvmovapd\t" } } */
+/* { dg-final { scan-assembler-not "\tvmovaps\t" } } */
+
+#include 
+
+void
+foo1 (__m256i *p, __m256i x)
+{
+  *p = x;
+}
+
+void
+foo2 (__m256d *p, __m256d x)
+{
+  *p = x;
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr91461-3.c 
b/gcc/testsuite/gcc.target/i386/pr91461-3.c
new file mode 100644
index 000..c67a4

PING^5: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move

2020-01-27 Thread H.J. Lu
On Mon, Jul 8, 2019 at 8:19 AM H.J. Lu  wrote:
>
> On Tue, Jun 18, 2019 at 8:59 AM H.J. Lu  wrote:
> >
> > On Fri, May 31, 2019 at 10:38 AM H.J. Lu  wrote:
> > >
> > > On Tue, May 21, 2019 at 2:43 PM H.J. Lu  wrote:
> > > >
> > > > On Fri, Feb 22, 2019 at 8:25 AM H.J. Lu  wrote:
> > > > >
> > > > > Hi Jan, Uros,
> > > > >
> > > > > This patch fixes the wrong code bug:
> > > > >
> > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89229
> > > > >
> > > > > Tested on AVX2 and AVX512 with and without --with-arch=native.
> > > > >
> > > > > OK for trunk?
> > > > >
> > > > > Thanks.
> > > > >
> > > > > H.J.
> > > > > --
> > > > > i386 backend has
> > > > >
> > > > > INT_MODE (OI, 32);
> > > > > INT_MODE (XI, 64);
> > > > >
> > > > > So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit operation,
> > > > > in case of const_1, all 512 bits set.
> > > > >
> > > > > We can load zeros with narrower instruction, (e.g. 256 bit by inherent
> > > > > zeroing of highpart in case of 128 bit xor), so TImode in this case.
> > > > >
> > > > > Some targets prefer V4SF mode, so they will emit float xorps for 
> > > > > zeroing.
> > > > >
> > > > > sse.md has
> > > > >
> > > > > (define_insn "mov_internal"
> > > > >   [(set (match_operand:VMOVE 0 "nonimmediate_operand"
> > > > >  "=v,v ,v ,m")
> > > > > (match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand"
> > > > >  " C,BC,vm,v"))]
> > > > > 
> > > > >   /* There is no evex-encoded vmov* for sizes smaller than 
> > > > > 64-bytes
> > > > >  in avx512f, so we need to use workarounds, to access sse 
> > > > > registers
> > > > >  16-31, which are evex-only. In avx512vl we don't need 
> > > > > workarounds.  */
> > > > >   if (TARGET_AVX512F &&  < 64 && !TARGET_AVX512VL
> > > > >   && (EXT_REX_SSE_REG_P (operands[0])
> > > > >   || EXT_REX_SSE_REG_P (operands[1])))
> > > > > {
> > > > >   if (memory_operand (operands[0], mode))
> > > > > {
> > > > >   if ( == 32)
> > > > > return "vextract64x4\t{$0x0, %g1, %0|%0, 
> > > > > %g1, 0x0}";
> > > > >   else if ( == 16)
> > > > > return "vextract32x4\t{$0x0, %g1, %0|%0, 
> > > > > %g1, 0x0}";
> > > > >   else
> > > > > gcc_unreachable ();
> > > > > }
> > > > > ...
> > > > >
> > > > > However, since ix86_hard_regno_mode_ok has
> > > > >
> > > > >  /* TODO check for QI/HI scalars.  */
> > > > >   /* AVX512VL allows sse regs16+ for 128/256 bit modes.  */
> > > > >   if (TARGET_AVX512VL
> > > > >   && (mode == OImode
> > > > >   || mode == TImode
> > > > >   || VALID_AVX256_REG_MODE (mode)
> > > > >   || VALID_AVX512VL_128_REG_MODE (mode)))
> > > > > return true;
> > > > >
> > > > >   /* xmm16-xmm31 are only available for AVX-512.  */
> > > > >   if (EXT_REX_SSE_REGNO_P (regno))
> > > > > return false;
> > > > >
> > > > >   if (TARGET_AVX512F &&  < 64 && !TARGET_AVX512VL
> > > > >   && (EXT_REX_SSE_REG_P (operands[0])
> > > > >   || EXT_REX_SSE_REG_P (operands[1])))
> > > > >
> > > > > is a dead code.
> > > > >
> > > > > Also for
> > > > >
> > > > > long long *p;
> > > > > volatile __m256i yy;
> > > > >
> > > > > void
> > > > > foo (void)
> > > > > {
> > > > >_mm256_store_epi64 (p, yy);
> > > > &

Re: [PATCH] i386: Disable TARGET_SSE_TYPELESS_STORES for TARGET_AVX

2020-01-27 Thread H.J. Lu
On Mon, Jan 27, 2020 at 12:26 PM Uros Bizjak  wrote:
>
> On Mon, Jan 27, 2020 at 7:23 PM H.J. Lu  wrote:
> >
> > movaps/movups is one byte shorter than movdaq/movdqu.  But it isn't the
> > case for AVX nor AVX512.  We should disable TARGET_SSE_TYPELESS_STORES
> > for TARGET_AVX.
> >
> > gcc/
> >
> > PR target/91461
> > * config/i386/i386.h (TARGET_SSE_TYPELESS_STORES): Disable for
> > TARGET_AVX.
> > * config/i386/i386.md (*movoi_internal_avx): Remove
> > TARGET_SSE_TYPELESS_STORES check.
> >
> > gcc/testsuite/
> >
> > PR target/91461
> > * gcc.target/i386/pr91461-1.c: New test.
> > * gcc.target/i386/pr91461-2.c: Likewise.
> > * gcc.target/i386/pr91461-3.c: Likewise.
> > * gcc.target/i386/pr91461-4.c: Likewise.
> > * gcc.target/i386/pr91461-5.c: Likewise.
> > ---
> >  gcc/config/i386/i386.h|  4 +-
> >  gcc/config/i386/i386.md   |  4 +-
> >  gcc/testsuite/gcc.target/i386/pr91461-1.c | 66 
> >  gcc/testsuite/gcc.target/i386/pr91461-2.c | 19 ++
> >  gcc/testsuite/gcc.target/i386/pr91461-3.c | 76 +++
> >  gcc/testsuite/gcc.target/i386/pr91461-4.c | 21 +++
> >  gcc/testsuite/gcc.target/i386/pr91461-5.c | 17 +
> >  7 files changed, 203 insertions(+), 4 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-2.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-3.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-4.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-5.c
> >
> > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> > index 943e9a5c783..c134b04c5c4 100644
> > --- a/gcc/config/i386/i386.h
> > +++ b/gcc/config/i386/i386.h
> > @@ -516,8 +516,10 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
> >  #define TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL \
> > ix86_tune_features[X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL]
> >  #define TARGET_SSE_SPLIT_REGS  ix86_tune_features[X86_TUNE_SSE_SPLIT_REGS]
> > +/* NB: movaps/movups is one byte shorter than movdaq/movdqu.  But it
> > +   isn't the case for AVX nor AVX512.  */
> >  #define TARGET_SSE_TYPELESS_STORES \
> > -   ix86_tune_features[X86_TUNE_SSE_TYPELESS_STORES]
> > +   (!TARGET_AVX && ix86_tune_features[X86_TUNE_SSE_TYPELESS_STORES])
>
> This is wrong place to disable the feature.

Like this?

diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
index 2acc9fb0cfe..639969d736d 100644
--- a/gcc/config/i386/i386-options.c
+++ b/gcc/config/i386/i386-options.c
@@ -1597,6 +1597,11 @@ set_ix86_tune_features (enum processor_type
ix86_tune, bool dump)
 = !!(initial_ix86_tune_features[i] & ix86_tune_mask);
 }

+  /* NB: movaps/movups is one byte shorter than movdaq/movdqu.  But it
+ isn't the case for AVX nor AVX512.  */
+  if (TARGET_AVX)
+ix86_tune_features[X86_TUNE_SSE_TYPELESS_STORES] = 0;
+
   if (dump)
 {
   fprintf (stderr, "List of x86 specific tuning parameter names:\n");


-- 
H.J.


Re: [PATCH] i386: Disable TARGET_SSE_TYPELESS_STORES for TARGET_AVX

2020-01-27 Thread H.J. Lu
On Mon, Jan 27, 2020 at 2:17 PM H.J. Lu  wrote:
>
> On Mon, Jan 27, 2020 at 12:26 PM Uros Bizjak  wrote:
> >
> > On Mon, Jan 27, 2020 at 7:23 PM H.J. Lu  wrote:
> > >
> > > movaps/movups is one byte shorter than movdaq/movdqu.  But it isn't the
> > > case for AVX nor AVX512.  We should disable TARGET_SSE_TYPELESS_STORES
> > > for TARGET_AVX.
> > >
> > > gcc/
> > >
> > > PR target/91461
> > > * config/i386/i386.h (TARGET_SSE_TYPELESS_STORES): Disable for
> > > TARGET_AVX.
> > > * config/i386/i386.md (*movoi_internal_avx): Remove
> > > TARGET_SSE_TYPELESS_STORES check.
> > >
> > > gcc/testsuite/
> > >
> > > PR target/91461
> > > * gcc.target/i386/pr91461-1.c: New test.
> > > * gcc.target/i386/pr91461-2.c: Likewise.
> > > * gcc.target/i386/pr91461-3.c: Likewise.
> > > * gcc.target/i386/pr91461-4.c: Likewise.
> > > * gcc.target/i386/pr91461-5.c: Likewise.
> > > ---
> > >  gcc/config/i386/i386.h|  4 +-
> > >  gcc/config/i386/i386.md   |  4 +-
> > >  gcc/testsuite/gcc.target/i386/pr91461-1.c | 66 
> > >  gcc/testsuite/gcc.target/i386/pr91461-2.c | 19 ++
> > >  gcc/testsuite/gcc.target/i386/pr91461-3.c | 76 +++
> > >  gcc/testsuite/gcc.target/i386/pr91461-4.c | 21 +++
> > >  gcc/testsuite/gcc.target/i386/pr91461-5.c | 17 +
> > >  7 files changed, 203 insertions(+), 4 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-1.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-2.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-3.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-4.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-5.c
> > >
> > > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> > > index 943e9a5c783..c134b04c5c4 100644
> > > --- a/gcc/config/i386/i386.h
> > > +++ b/gcc/config/i386/i386.h
> > > @@ -516,8 +516,10 @@ extern unsigned char 
> > > ix86_tune_features[X86_TUNE_LAST];
> > >  #define TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL \
> > > ix86_tune_features[X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL]
> > >  #define TARGET_SSE_SPLIT_REGS  
> > > ix86_tune_features[X86_TUNE_SSE_SPLIT_REGS]
> > > +/* NB: movaps/movups is one byte shorter than movdaq/movdqu.  But it
> > > +   isn't the case for AVX nor AVX512.  */
> > >  #define TARGET_SSE_TYPELESS_STORES \
> > > -   ix86_tune_features[X86_TUNE_SSE_TYPELESS_STORES]
> > > +   (!TARGET_AVX && ix86_tune_features[X86_TUNE_SSE_TYPELESS_STORES])
> >
> > This is wrong place to disable the feature.
>

Here is the updated patch on top of

https://gcc.gnu.org/ml/gcc-patches/2020-01/msg01742.html

so that set_ix86_tune_features can access per function setting.

OK for master branch?

Thanks.

-- 
H.J.
From 61482a7d4dff07075f2534840040bafa420e9f36 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" 
Date: Mon, 27 Jan 2020 09:35:11 -0800
Subject: [PATCH] i386: Disable TARGET_SSE_TYPELESS_STORES for TARGET_AVX

movaps/movups is one byte shorter than movdaq/movdqu.  But it isn't the
case for AVX nor AVX512.  We should disable TARGET_SSE_TYPELESS_STORES
for TARGET_AVX and adjust vmovups checks in assembly ouputs.

gcc/

	PR target/91461
	* config/i386/i386-options.c (set_ix86_tune_features): Disable
	TARGET_SSE_TYPELESS_STORES for TARGET_AVX.
	* config/i386/i386.md (*movoi_internal_avx): Remove
	TARGET_SSE_TYPELESS_STORES check.

gcc/testsuite/

	PR target/91461
	* gcc.target/i386/avx256-unaligned-store-3.c: Don't check
	vmovups.
	* gcc.target/i386/pieces-memcpy-4.c: Likewise.
	* gcc.target/i386/pieces-memcpy-5.c: Likewise.
	* gcc.target/i386/pieces-memcpy-6.c: Likewise.
	* gcc.target/i386/pieces-strcpy-2.c: Likewise.
	* gcc.target/i386/pr90980-1.c: Likewise.
	* gcc.target/i386/pr87317-4.c: Check "\tvmovd\t" instead of
	"vmovd" to avoid matching "vmovdqu".
	* gcc.target/i386/pr87317-5.c: Likewise.
	* gcc.target/i386/pr87317-7.c: Likewise.
	* gcc.target/i386/pr91461-1.c: New test.
	* gcc.target/i386/pr91461-2.c: Likewise.
	* gcc.target/i386/pr91461-3.c: Likewise.
	* gcc.target/i386/pr91461-4.c: Likewise.
	* gcc.target/i386/pr91461-5.c: Likewise.
	* gcc.target/i386/pr91461-6.c: Likewise.
---
 gcc/config/i386/i386-options.c|  5 ++
 gcc/config/i386/i386.md   |  4 +-
 .../i386/avx256-unaligned-store-3.c   |  4 +-

Re: [PATCH] i386: Disable TARGET_SSE_TYPELESS_STORES for TARGET_AVX

2020-01-28 Thread H.J. Lu
On Mon, Jan 27, 2020 at 11:04 PM Uros Bizjak  wrote:
>
> On Mon, Jan 27, 2020 at 11:17 PM H.J. Lu  wrote:
> >
> > On Mon, Jan 27, 2020 at 12:26 PM Uros Bizjak  wrote:
> > >
> > > On Mon, Jan 27, 2020 at 7:23 PM H.J. Lu  wrote:
> > > >
> > > > movaps/movups is one byte shorter than movdaq/movdqu.  But it isn't the
> > > > case for AVX nor AVX512.  We should disable TARGET_SSE_TYPELESS_STORES
> > > > for TARGET_AVX.
> > > >
> > > > gcc/
> > > >
> > > > PR target/91461
> > > > * config/i386/i386.h (TARGET_SSE_TYPELESS_STORES): Disable for
> > > > TARGET_AVX.
> > > > * config/i386/i386.md (*movoi_internal_avx): Remove
> > > > TARGET_SSE_TYPELESS_STORES check.
> > > >
> > > > gcc/testsuite/
> > > >
> > > > PR target/91461
> > > > * gcc.target/i386/pr91461-1.c: New test.
> > > > * gcc.target/i386/pr91461-2.c: Likewise.
> > > > * gcc.target/i386/pr91461-3.c: Likewise.
> > > > * gcc.target/i386/pr91461-4.c: Likewise.
> > > > * gcc.target/i386/pr91461-5.c: Likewise.
> > > > ---
> > > >  gcc/config/i386/i386.h|  4 +-
> > > >  gcc/config/i386/i386.md   |  4 +-
> > > >  gcc/testsuite/gcc.target/i386/pr91461-1.c | 66 
> > > >  gcc/testsuite/gcc.target/i386/pr91461-2.c | 19 ++
> > > >  gcc/testsuite/gcc.target/i386/pr91461-3.c | 76 +++
> > > >  gcc/testsuite/gcc.target/i386/pr91461-4.c | 21 +++
> > > >  gcc/testsuite/gcc.target/i386/pr91461-5.c | 17 +
> > > >  7 files changed, 203 insertions(+), 4 deletions(-)
> > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-1.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-2.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-3.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-4.c
> > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-5.c
> > > >
> > > > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> > > > index 943e9a5c783..c134b04c5c4 100644
> > > > --- a/gcc/config/i386/i386.h
> > > > +++ b/gcc/config/i386/i386.h
> > > > @@ -516,8 +516,10 @@ extern unsigned char 
> > > > ix86_tune_features[X86_TUNE_LAST];
> > > >  #define TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL \
> > > > ix86_tune_features[X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL]
> > > >  #define TARGET_SSE_SPLIT_REGS  
> > > > ix86_tune_features[X86_TUNE_SSE_SPLIT_REGS]
> > > > +/* NB: movaps/movups is one byte shorter than movdaq/movdqu.  But it
> > > > +   isn't the case for AVX nor AVX512.  */
> > > >  #define TARGET_SSE_TYPELESS_STORES \
> > > > -   ix86_tune_features[X86_TUNE_SSE_TYPELESS_STORES]
> > > > +   (!TARGET_AVX && 
> > > > ix86_tune_features[X86_TUNE_SSE_TYPELESS_STORES])
> > >
> > > This is wrong place to disable the feature.
> >
> > Like this?
>
> No.
>
> There is a mode attribute in i386.md/sse.md for relevant patterns.
> Please adapt calculation of mode attributes instead.
>

Like this?


-- 
H.J.
From 1ba0c9ce5f764b8faa8c66b70e676af187a57415 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" 
Date: Mon, 27 Jan 2020 09:35:11 -0800
Subject: [PATCH] i386: Disable TARGET_SSE_TYPELESS_STORES for TARGET_AVX

movaps/movups is one byte shorter than movdaq/movdqu.  But it isn't the
case for AVX nor AVX512.  We should disable TARGET_SSE_TYPELESS_STORES
for TARGET_AVX.

gcc/

	PR target/91461
	* config/i386/i386.md (*movoi_internal_avx): Remove
	TARGET_SSE_TYPELESS_STORES check.
	(*movti_internal): Disable TARGET_SSE_TYPELESS_STORES for
	TARGET_AVX.
	* config/i386/sse.md (mov_internal): Likewise.

gcc/testsuite/

	PR target/91461
	* gcc.target/i386/pr91461-1.c: New test.
	* gcc.target/i386/pr91461-2.c: Likewise.
	* gcc.target/i386/pr91461-3.c: Likewise.
	* gcc.target/i386/pr91461-4.c: Likewise.
	* gcc.target/i386/pr91461-5.c: Likewise.
---
 gcc/config/i386/i386.md   |  8 +--
 gcc/config/i386/sse.md|  2 +-
 gcc/testsuite/gcc.target/i386/pr91461-1.c | 66 
 gcc/testsuite/gcc.target/i386/pr91461-2.c | 19 ++
 gcc/testsuite/gcc.target/i386/pr91461-3.c | 76 +++
 gcc/testsuite/gcc.target/i386/pr91461-4.c | 21 +++
 gcc/testsuite/gcc.targe

Re: [PATCH] i386: Disable TARGET_SSE_TYPELESS_STORES for TARGET_AVX

2020-01-28 Thread H.J. Lu
On Tue, Jan 28, 2020 at 6:45 AM Uros Bizjak  wrote:
>
> On Tue, Jan 28, 2020 at 3:32 PM H.J. Lu  wrote:
> >
> > On Mon, Jan 27, 2020 at 11:04 PM Uros Bizjak  wrote:
> > >
> > > On Mon, Jan 27, 2020 at 11:17 PM H.J. Lu  wrote:
> > > >
> > > > On Mon, Jan 27, 2020 at 12:26 PM Uros Bizjak  wrote:
> > > > >
> > > > > On Mon, Jan 27, 2020 at 7:23 PM H.J. Lu  wrote:
> > > > > >
> > > > > > movaps/movups is one byte shorter than movdaq/movdqu.  But it isn't 
> > > > > > the
> > > > > > case for AVX nor AVX512.  We should disable 
> > > > > > TARGET_SSE_TYPELESS_STORES
> > > > > > for TARGET_AVX.
> > > > > >
> > > > > > gcc/
> > > > > >
> > > > > > PR target/91461
> > > > > > * config/i386/i386.h (TARGET_SSE_TYPELESS_STORES): Disable 
> > > > > > for
> > > > > > TARGET_AVX.
> > > > > > * config/i386/i386.md (*movoi_internal_avx): Remove
> > > > > > TARGET_SSE_TYPELESS_STORES check.
> > > > > >
> > > > > > gcc/testsuite/
> > > > > >
> > > > > > PR target/91461
> > > > > > * gcc.target/i386/pr91461-1.c: New test.
> > > > > > * gcc.target/i386/pr91461-2.c: Likewise.
> > > > > > * gcc.target/i386/pr91461-3.c: Likewise.
> > > > > > * gcc.target/i386/pr91461-4.c: Likewise.
> > > > > > * gcc.target/i386/pr91461-5.c: Likewise.
> > > > > > ---
> > > > > >  gcc/config/i386/i386.h|  4 +-
> > > > > >  gcc/config/i386/i386.md   |  4 +-
> > > > > >  gcc/testsuite/gcc.target/i386/pr91461-1.c | 66 
> > > > > >  gcc/testsuite/gcc.target/i386/pr91461-2.c | 19 ++
> > > > > >  gcc/testsuite/gcc.target/i386/pr91461-3.c | 76 
> > > > > > +++
> > > > > >  gcc/testsuite/gcc.target/i386/pr91461-4.c | 21 +++
> > > > > >  gcc/testsuite/gcc.target/i386/pr91461-5.c | 17 +
> > > > > >  7 files changed, 203 insertions(+), 4 deletions(-)
> > > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-1.c
> > > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-2.c
> > > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-3.c
> > > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-4.c
> > > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-5.c
> > > > > >
> > > > > > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
> > > > > > index 943e9a5c783..c134b04c5c4 100644
> > > > > > --- a/gcc/config/i386/i386.h
> > > > > > +++ b/gcc/config/i386/i386.h
> > > > > > @@ -516,8 +516,10 @@ extern unsigned char 
> > > > > > ix86_tune_features[X86_TUNE_LAST];
> > > > > >  #define TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL \
> > > > > > ix86_tune_features[X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL]
> > > > > >  #define TARGET_SSE_SPLIT_REGS  
> > > > > > ix86_tune_features[X86_TUNE_SSE_SPLIT_REGS]
> > > > > > +/* NB: movaps/movups is one byte shorter than movdaq/movdqu.  But 
> > > > > > it
> > > > > > +   isn't the case for AVX nor AVX512.  */
> > > > > >  #define TARGET_SSE_TYPELESS_STORES \
> > > > > > -   ix86_tune_features[X86_TUNE_SSE_TYPELESS_STORES]
> > > > > > +   (!TARGET_AVX && 
> > > > > > ix86_tune_features[X86_TUNE_SSE_TYPELESS_STORES])
> > > > >
> > > > > This is wrong place to disable the feature.
> > > >
> > > > Like this?
> > >
> > > No.
> > >
> > > There is a mode attribute in i386.md/sse.md for relevant patterns.
> > > Please adapt calculation of mode attributes instead.
> > >
> >
> > Like this?
>
> Still no.
>
> You could move
>
> (match_test "TARGET_AVX")
>   (const_string "TI")
>
> up to bypass the cases below.
>

I don't think we can do that.   There are 2 cases where we prefer movaps/movups:

/* Use packed single precision instructions where posisble.  I.e.
movups instead   of movupd.  */
DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL,
"sse_packed_single_insn_optimal",
  m_BDVER | m_ZNVER)

/* X86_TUNE_SSE_TYPELESS_STORES: Always movaps/movups for 128bit stores.   */
DEF_TUNE (X86_TUNE_SSE_TYPELESS_STORES, "sse_typeless_stores",
  m_AMD_MULTIPLE | m_CORE_ALL | m_GENERIC)

We should always use movaps/movups for TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL.
It is wrong to bypass TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL with TARGET_AVX
as m_BDVER | m_ZNVER support AVX.

-- 
H.J.


[PATCH] i386: Prefer TARGET_AVX over TARGET_SSE_TYPELESS_STORES

2020-01-28 Thread H.J. Lu
On Tue, Jan 28, 2020 at 9:12 AM Uros Bizjak  wrote:
>
> On Tue, Jan 28, 2020 at 4:34 PM H.J. Lu  wrote:
>
> > > You could move
> > >
> > > (match_test "TARGET_AVX")
> > >   (const_string "TI")
> > >
> > > up to bypass the cases below.
> > >
> >
> > I don't think we can do that.   There are 2 cases where we prefer 
> > movaps/movups:
> >
> > /* Use packed single precision instructions where posisble.  I.e.
> > movups instead   of movupd.  */
> > DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL,
> > "sse_packed_single_insn_optimal",
> >   m_BDVER | m_ZNVER)
> >
> > /* X86_TUNE_SSE_TYPELESS_STORES: Always movaps/movups for 128bit stores.   
> > */
> > DEF_TUNE (X86_TUNE_SSE_TYPELESS_STORES, "sse_typeless_stores",
> >   m_AMD_MULTIPLE | m_CORE_ALL | m_GENERIC)
> >
> > We should always use movaps/movups for 
> > TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL.
> > It is wrong to bypass TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL with TARGET_AVX
> > as m_BDVER | m_ZNVER support AVX.
>
> The reason for TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL on AMD target is
> only insn size, as advised in e.g. Software Optimization Guide for the
> AMD Family 15h Processors [1], section 7.1.2, where it is said:
>
> --quote--
> 7.1.2 Reduce Instruction SizeOptimization
>
> Reduce the size of instructions when possible.
>
> Rationale
>
> Using smaller instruction sizes improves instruction fetch throughput.
> Specific examples include the following:
>
> *In SIMD code, use the single-precision (PS) form of instructions
> instead of the double-precision (PD) form. For example, for register
> to register moves, MOVAPS achieves the same result as MOVAPD, but uses
> one less byte to encode the instruction and has no prefix byte. Other
> examples in which single-precision forms can be substituted for
> double-precision forms include MOVUPS, MOVNTPS, XORPS, ORPS, ANDPS,
> and SHUFPS.
> ...
> --/quote--
>
> Please note that this optimization applies only to non-AVX forms, as
> demonstrated by:
>
>0:   0f 28 c8movaps %xmm0,%xmm1
>3:   66 0f 28 c8 movapd %xmm0,%xmm1
>7:   c5 f8 28 d1 vmovaps %xmm1,%xmm2
>b:   c5 f9 28 d1 vmovapd %xmm1,%xmm2
>
> Also note that MOVDQA is missing in the above optimization. It is
> harmful to substitute MOVDQA with MOVAPS, as it can (and does)
> introduce +1 cycle forwarding penalty between FLT (FPA/FPM) and INT
> (VALU) FP clusters.
>
> Following the above optimization, it is obvious that
> TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL handling was cargo-culted from
> one pattern to another. Its use should be reviewed and fixed where not
> appropriate.
>
> [1] https://www.amd.com/system/files/TechDocs/47414_15h_sw_opt_guide.pdf
>
> Uros.

Here is the updated patch which moves TARGET_AVX before
TARGET_SSE_TYPELESS_STORES.   OK for master if there is
no regression?

Thanks.

-- 
H.J.
From cbcf8b23b29588f12e464076dacacd4600d0059b Mon Sep 17 00:00:00 2001
From: "H.J. Lu" 
Date: Mon, 27 Jan 2020 09:35:11 -0800
Subject: [PATCH] i386: Prefer TARGET_AVX over TARGET_SSE_TYPELESS_STORES

movaps/movups is one byte shorter than movdaq/movdqu.  But it isn't the
case for AVX nor AVX512.  This patch prefers TARGET_AVX over
TARGET_SSE_TYPELESS_STORES and adjust vmovups checks in assembly ouputs.

gcc/

	PR target/91461
	* config/i386/i386.md (*movoi_internal_avx): Remove
	TARGET_SSE_TYPELESS_STORES check.
	(*movti_internal): Prefer TARGET_AVX over
	TARGET_SSE_TYPELESS_STORES.
	(*movtf_internal): Likewise.
	* config/i386/sse.md (mov_internal): Likewise.

gcc/testsuite/

	PR target/91461
	* gcc.target/i386/avx256-unaligned-store-2.c: Don't check
	vmovups.
	* gcc.target/i386/avx256-unaligned-store-3.c: Likewise.
	* gcc.target/i386/pieces-memcpy-4.c: Likewise.
	* gcc.target/i386/pieces-memcpy-5.c: Likewise.
	* gcc.target/i386/pieces-memcpy-6.c: Likewise.
	* gcc.target/i386/pieces-strcpy-2.c: Likewise.
	* gcc.target/i386/pr90980-1.c: Likewise.
	* gcc.target/i386/pr87317-4.c: Check "\tvmovd\t" instead of
	"vmovd" to avoid matching "vmovdqu".
	* gcc.target/i386/pr87317-5.c: Likewise.
	* gcc.target/i386/pr87317-7.c: Likewise.
	* gcc.target/i386/pr91461-1.c: New test.
	* gcc.target/i386/pr91461-2.c: Likewise.
	* gcc.target/i386/pr91461-3.c: Likewise.
	* gcc.target/i386/pr91461-4.c: Likewise.
	* gcc.target/i386/pr91461-5.c: Likewise.

Xi
---
 gcc/config/i386/i386.md   | 12 ++-
 gcc/config/i386/sse.md|  4 +-
 .../i386/avx256-unaligned-store-2.c   |  4 +-
 .../i386/avx256-unaligned-store-3.c   |  4 +-
 .../gcc

Re: [PATCH] i386: Prefer TARGET_AVX over TARGET_SSE_TYPELESS_STORES

2020-01-28 Thread H.J. Lu
On Tue, Jan 28, 2020 at 10:04 AM Uros Bizjak  wrote:
>
> On Tue, Jan 28, 2020 at 6:51 PM H.J. Lu  wrote:
> >
> > On Tue, Jan 28, 2020 at 9:12 AM Uros Bizjak  wrote:
> > >
> > > On Tue, Jan 28, 2020 at 4:34 PM H.J. Lu  wrote:
> > >
> > > > > You could move
> > > > >
> > > > > (match_test "TARGET_AVX")
> > > > >   (const_string "TI")
> > > > >
> > > > > up to bypass the cases below.
> > > > >
> > > >
> > > > I don't think we can do that.   There are 2 cases where we prefer 
> > > > movaps/movups:
> > > >
> > > > /* Use packed single precision instructions where posisble.  I.e.
> > > > movups instead   of movupd.  */
> > > > DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL,
> > > > "sse_packed_single_insn_optimal",
> > > >   m_BDVER | m_ZNVER)
> > > >
> > > > /* X86_TUNE_SSE_TYPELESS_STORES: Always movaps/movups for 128bit 
> > > > stores.   */
> > > > DEF_TUNE (X86_TUNE_SSE_TYPELESS_STORES, "sse_typeless_stores",
> > > >   m_AMD_MULTIPLE | m_CORE_ALL | m_GENERIC)
> > > >
> > > > We should always use movaps/movups for 
> > > > TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL.
> > > > It is wrong to bypass TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL with 
> > > > TARGET_AVX
> > > > as m_BDVER | m_ZNVER support AVX.
> > >
> > > The reason for TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL on AMD target is
> > > only insn size, as advised in e.g. Software Optimization Guide for the
> > > AMD Family 15h Processors [1], section 7.1.2, where it is said:
> > >
> > > --quote--
> > > 7.1.2 Reduce Instruction SizeOptimization
> > >
> > > Reduce the size of instructions when possible.
> > >
> > > Rationale
> > >
> > > Using smaller instruction sizes improves instruction fetch throughput.
> > > Specific examples include the following:
> > >
> > > *In SIMD code, use the single-precision (PS) form of instructions
> > > instead of the double-precision (PD) form. For example, for register
> > > to register moves, MOVAPS achieves the same result as MOVAPD, but uses
> > > one less byte to encode the instruction and has no prefix byte. Other
> > > examples in which single-precision forms can be substituted for
> > > double-precision forms include MOVUPS, MOVNTPS, XORPS, ORPS, ANDPS,
> > > and SHUFPS.
> > > ...
> > > --/quote--
> > >
> > > Please note that this optimization applies only to non-AVX forms, as
> > > demonstrated by:
> > >
> > >0:   0f 28 c8movaps %xmm0,%xmm1
> > >3:   66 0f 28 c8 movapd %xmm0,%xmm1
> > >7:   c5 f8 28 d1 vmovaps %xmm1,%xmm2
> > >b:   c5 f9 28 d1 vmovapd %xmm1,%xmm2
> > >
> > > Also note that MOVDQA is missing in the above optimization. It is
> > > harmful to substitute MOVDQA with MOVAPS, as it can (and does)
> > > introduce +1 cycle forwarding penalty between FLT (FPA/FPM) and INT
> > > (VALU) FP clusters.
> > >
> > > Following the above optimization, it is obvious that
> > > TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL handling was cargo-culted from
> > > one pattern to another. Its use should be reviewed and fixed where not
> > > appropriate.
> > >
> > > [1] https://www.amd.com/system/files/TechDocs/47414_15h_sw_opt_guide.pdf
> > >
> > > Uros.
> >
> > Here is the updated patch which moves TARGET_AVX before
> > TARGET_SSE_TYPELESS_STORES.   OK for master if there is
> > no regression?
> >
> > Thanks.
>
>
> +   (match_test "TARGET_AVX")
> + (const_string "")
> (and (match_test " == 16")
>
> Only MODE_SIZE == 16 cases will be left here, since TARGET_AVX is
> necessary for MODE_SIZE > 16. This test can be removed.
>
> OK with the above change.
>

This is the patch I am going to check in.

Thanks.

-- 
H.J.
From 66c534dedc7a9a632aa38c32e3f7c251b8f2c778 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" 
Date: Mon, 27 Jan 2020 09:35:11 -0800
Subject: [PATCH] i386: Prefer TARGET_AVX over TARGET_SSE_TYPELESS_STORES

movaps/movups is one byte shorter than movdaq/movdqu.  But it isn't the
case for AVX nor AVX512.  This patch prefers TARGET_AVX over
TARGET_SSE_TYPELESS_STORES and adjust vmovu

Re: [PATCH] i386: Prefer TARGET_AVX over TARGET_SSE_TYPELESS_STORES

2020-01-28 Thread H.J. Lu
On Tue, Jan 28, 2020 at 10:58 AM Jakub Jelinek  wrote:
>
> On Tue, Jan 28, 2020 at 10:20:36AM -0800, H.J. Lu wrote:
> > From 66c534dedc7a9a632aa38c32e3f7c251b8f2c778 Mon Sep 17 00:00:00 2001
> > From: "H.J. Lu" 
> > Date: Mon, 27 Jan 2020 09:35:11 -0800
> > Subject: [PATCH] i386: Prefer TARGET_AVX over TARGET_SSE_TYPELESS_STORES
> >
> > movaps/movups is one byte shorter than movdaq/movdqu.  But it isn't the
> > case for AVX nor AVX512.  This patch prefers TARGET_AVX over
> > TARGET_SSE_TYPELESS_STORES and adjust vmovups checks in assembly ouputs.
>
> If you haven't committed yet, please fix the movdaq typo in the description
> (to movdqa).
>

Will do.

Thanks.

-- 
H.J.


[PATCH] i386: Define TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY

2020-02-03 Thread H.J. Lu
Define TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY to make sure that the
ENDBR are emitted before the patch area.  When -mfentry -pg is also used
together, there should be no ENDBR before "call __fentry__".

OK for master if there is no regression?

Thanks.

H.J.
--
gcc/

PR target/93492
* config/i386/i386.c (ix86_asm_output_function_label): Set
function_label_emitted to true.
(ix86_print_patchable_function_entry): New function.

gcc/testsuite/

PR target/93492
* gcc.target/i386/pr93492-1.c: New test.
* gcc.target/i386/pr93492-2.c: Likewise.
* gcc.target/i386/pr93492-3.c: Likewise.


-- 
H.J.
From 5363c0289e3525139939bb678deeda98d06b2556 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" 
Date: Mon, 3 Feb 2020 10:22:57 -0800
Subject: [PATCH] i386: Define TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY

Define TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY to make sure that the
ENDBR are emitted before the patch area.  When -mfentry -pg is also used
together, there should be no ENDBR before "call __fentry__".

gcc/

	PR target/93492
	* config/i386/i386.c (ix86_asm_output_function_label): Set
	function_label_emitted to true.
	(ix86_print_patchable_function_entry): New function.

gcc/testsuite/

	PR target/93492
	* gcc.target/i386/pr93492-1.c: New test.
	* gcc.target/i386/pr93492-2.c: Likewise.
	* gcc.target/i386/pr93492-3.c: Likewise.
---
 gcc/config/i386/i386.c| 46 ++
 gcc/config/i386/i386.h|  3 +
 gcc/testsuite/gcc.target/i386/pr93492-1.c | 73 +++
 gcc/testsuite/gcc.target/i386/pr93492-2.c | 12 
 gcc/testsuite/gcc.target/i386/pr93492-3.c | 13 
 5 files changed, 147 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-3.c

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index ffda3e8fd21..dc9bd095e9a 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -1563,6 +1563,9 @@ ix86_asm_output_function_label (FILE *asm_out_file, const char *fname,
 {
   bool is_ms_hook = ix86_function_ms_hook_prologue (decl);
 
+  if (cfun)
+cfun->machine->function_label_emitted = true;
+
   if (is_ms_hook)
 {
   int i, filler_count = (TARGET_64BIT ? 32 : 16);
@@ -9118,6 +9121,45 @@ ix86_output_function_epilogue (FILE *file ATTRIBUTE_UNUSED)
 }
 }
 
+/* Implement TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY.  */
+
+void
+ix86_print_patchable_function_entry (FILE *file,
+ unsigned HOST_WIDE_INT patch_area_size,
+ bool record_p)
+{
+  if (cfun->machine->function_label_emitted)
+{
+  if ((flag_cf_protection & CF_BRANCH)
+	  && !lookup_attribute ("nocf_check",
+TYPE_ATTRIBUTES (TREE_TYPE (cfun->decl)))
+	  && (!flag_manual_endbr
+	  || lookup_attribute ("cf_check",
+   DECL_ATTRIBUTES (cfun->decl)))
+	  && !cgraph_node::get (cfun->decl)->only_called_directly_p ())
+	{
+	  /* Remove ENDBR that follows the patch area.  */
+	  rtx_insn *insn = next_real_nondebug_insn (get_insns ());
+	  if (insn
+	  && INSN_P (insn)
+	  && GET_CODE (PATTERN (insn)) == UNSPEC_VOLATILE
+	  && XINT (PATTERN (insn), 1) == UNSPECV_NOP_ENDBR)
+	delete_insn (insn);
+
+	  /* Remove the queued ENDBR.  */
+	  cfun->machine->endbr_queued_at_entrance = false;
+
+	  /* Insert a ENDBR before the patch area right after the
+	 function label and the .cfi_startproc directive.  */
+	  asm_fprintf (file, "\t%s\n",
+		   TARGET_64BIT ? "endbr64" : "endbr32");
+	}
+}
+
+  default_print_patchable_function_entry (file, patch_area_size,
+	  record_p);
+}
+
 /* Return a scratch register to use in the split stack prologue.  The
split stack prologue is used for -fsplit-stack.  It is the first
instructions in the function, even before the regular prologue.
@@ -22744,6 +22786,10 @@ ix86_run_selftests (void)
 #undef TARGET_ASM_FUNCTION_EPILOGUE
 #define TARGET_ASM_FUNCTION_EPILOGUE ix86_output_function_epilogue
 
+#undef TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY
+#define TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY \
+  ix86_print_patchable_function_entry
+
 #undef TARGET_ENCODE_SECTION_INFO
 #ifndef SUBTARGET_ENCODE_SECTION_INFO
 #define TARGET_ENCODE_SECTION_INFO ix86_encode_section_info
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 943e9a5c783..46a809afb96 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -2844,6 +2844,9 @@ struct GTY(()) machine_function {
   /* If true, ENDBR is queued at function entrance.  */
   BOOL_BITFIELD endbr_queued_at_entrance : 1;
 
+  /* If true, the function label has been emitted.  */
+  BOOL_BITFIELD function_label_emitted : 1;
+
   /* True if the function needs a stack frame.  */
   BOOL_BITFIELD stack_frame_required : 1;
 
diff --git a

Re: [PATCH] i386: Define TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY

2020-02-03 Thread H.J. Lu
On Mon, Feb 3, 2020 at 10:35 AM H.J. Lu  wrote:
>
> Define TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY to make sure that the
> ENDBR are emitted before the patch area.  When -mfentry -pg is also used
> together, there should be no ENDBR before "call __fentry__".
>
> OK for master if there is no regression?
>
> Thanks.
>
> H.J.
> --
> gcc/
>
> PR target/93492
> * config/i386/i386.c (ix86_asm_output_function_label): Set
> function_label_emitted to true.
> (ix86_print_patchable_function_entry): New function.
>
> gcc/testsuite/
>
> PR target/93492
> * gcc.target/i386/pr93492-1.c: New test.
> * gcc.target/i386/pr93492-2.c: Likewise.
> * gcc.target/i386/pr93492-3.c: Likewise.
>

This version works with both .cfi_startproc and DWARF debug info.


-- 
H.J.
From c4660acd1555f90f0f76f32a59f043a51c866553 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" 
Date: Mon, 3 Feb 2020 10:22:57 -0800
Subject: [PATCH] i386: Define TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY

Define TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY to delay patchable
area generation to ENDBR generation.  It works with both .cfi_startproc
and DWARF debug info.

gcc/

	PR target/93492
	* config/i386/i386-features.c (rest_of_insert_endbranch): Set
	endbr_queued_at_entrance to TYPE_ENDBR.
	* config/i386/i386-protos.h (ix86_output_endbr): New.
	* config/i386/i386.c (ix86_asm_output_function_label): Set
	function_label_emitted to true.
	(ix86_print_patchable_function_entry): New function.
	(ix86_output_endbr): Likewise.
	(x86_function_profiler): Call ix86_output_endbr to generate
	ENDBR.
	(TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY): New.
	* i386.h (endbr_type): New.
	(machine_function): Add patch_area_size, record_patch_area and
	function_label_emitted.  Change endbr_queued_at_entrance to
	enum.
	* config/i386/i386.md (UNSPECV_PATCH_ENDBR): New.
	(patch_endbr): New.

gcc/testsuite/

	PR target/93492
	* gcc.target/i386/pr93492-1.c: New test.
	* gcc.target/i386/pr93492-2.c: Likewise.
	* gcc.target/i386/pr93492-3.c: Likewise.
---
 gcc/config/i386/i386-features.c   |  2 +-
 gcc/config/i386/i386-protos.h |  2 +
 gcc/config/i386/i386.c| 77 ++-
 gcc/config/i386/i386.h| 18 +-
 gcc/config/i386/i386.md   |  9 +++
 gcc/testsuite/gcc.target/i386/pr93492-1.c | 73 +
 gcc/testsuite/gcc.target/i386/pr93492-2.c | 12 
 gcc/testsuite/gcc.target/i386/pr93492-3.c | 13 
 8 files changed, 203 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-3.c

diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
index b49e6f8d408..4d3d36e9ade 100644
--- a/gcc/config/i386/i386-features.c
+++ b/gcc/config/i386/i386-features.c
@@ -1963,7 +1963,7 @@ rest_of_insert_endbranch (void)
 {
   /* Queue ENDBR insertion to x86_function_profiler.  */
   if (crtl->profile && flag_fentry)
-	cfun->machine->endbr_queued_at_entrance = true;
+	cfun->machine->endbr_queued_at_entrance = TYPE_ENDBR;
   else
 	{
 	  cet_eb = gen_nop_endbr ();
diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 266381ca5a6..f9f5a243714 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -38,6 +38,8 @@ extern void ix86_expand_split_stack_prologue (void);
 extern void ix86_output_addr_vec_elt (FILE *, int);
 extern void ix86_output_addr_diff_elt (FILE *, int, int);
 
+extern void ix86_output_endbr (bool);
+
 extern enum calling_abi ix86_cfun_abi (void);
 extern enum calling_abi ix86_function_type_abi (const_tree);
 
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index ffda3e8fd21..e5b2565d5bd 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -1563,6 +1563,9 @@ ix86_asm_output_function_label (FILE *asm_out_file, const char *fname,
 {
   bool is_ms_hook = ix86_function_ms_hook_prologue (decl);
 
+  if (cfun)
+cfun->machine->function_label_emitted = true;
+
   if (is_ms_hook)
 {
   int i, filler_count = (TARGET_64BIT ? 32 : 16);
@@ -9118,6 +9121,73 @@ ix86_output_function_epilogue (FILE *file ATTRIBUTE_UNUSED)
 }
 }
 
+/* Implement TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY.  */
+
+void
+ix86_print_patchable_function_entry (FILE *file,
+ unsigned HOST_WIDE_INT patch_area_size,
+ bool record_p)
+{
+  if (cfun->machine->function_label_emitted)
+{
+  if ((flag_cf_protection & CF_BRANCH)
+	  && !lookup_attribute ("nocf_check",
+TYPE_ATTRIBUTES (TREE_TYPE (cfun->decl)))
+	  && (!flag_manual_endbr
+	  || lookup_attribute ("cf_check",
+   DECL_ATTRIBUTES (cfun->decl)))
+	  && !cgraph_node::get (cfun->de

Re: [PATCH] i386: Define TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY

2020-02-03 Thread H.J. Lu
On Mon, Feb 3, 2020 at 4:02 PM H.J. Lu  wrote:
>
> On Mon, Feb 3, 2020 at 10:35 AM H.J. Lu  wrote:
> >
> > Define TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY to make sure that the
> > ENDBR are emitted before the patch area.  When -mfentry -pg is also used
> > together, there should be no ENDBR before "call __fentry__".
> >
> > OK for master if there is no regression?
> >
> > Thanks.
> >
> > H.J.
> > --
> > gcc/
> >
> > PR target/93492
> > * config/i386/i386.c (ix86_asm_output_function_label): Set
> > function_label_emitted to true.
> > (ix86_print_patchable_function_entry): New function.
> >
> > gcc/testsuite/
> >
> > PR target/93492
> > * gcc.target/i386/pr93492-1.c: New test.
> > * gcc.target/i386/pr93492-2.c: Likewise.
> > * gcc.target/i386/pr93492-3.c: Likewise.
> >
>
> This version works with both .cfi_startproc and DWARF debug info.
>

-g -fpatchable-function-entry=1  doesn't work together:

[hjl@gnu-cfl-1 pr93492]$ cat y.i
void f(){}
[hjl@gnu-cfl-1 pr93492]$ make y.s
/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/xgcc
-B/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/
-g -fpatchable-function-entry=1 -S y.i
[hjl@gnu-cfl-1 pr93492]$ cat y.s
.file "y.i"
.text
.Ltext0:
.globl f
.type f, @function
f:
.section __patchable_function_entries,"aw",@progbits
.align 8
.quad .LPFE1
.text
.LPFE1:
nop
.LFB0:
.file 1 "y.i"
.loc 1 1 9
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
.loc 1 1 1
nop
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size f, .-f

I will update my patch to handle it.

-- 
H.J.


[PATCH] x86: Add UNSPECV_PATCHABLE_AREA

2020-02-04 Thread H.J. Lu
On Mon, Feb 03, 2020 at 06:10:49PM -0800, H.J. Lu wrote:
> On Mon, Feb 3, 2020 at 4:02 PM H.J. Lu  wrote:
> >
> > On Mon, Feb 3, 2020 at 10:35 AM H.J. Lu  wrote:
> > >
> > > Define TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY to make sure that the
> > > ENDBR are emitted before the patch area.  When -mfentry -pg is also used
> > > together, there should be no ENDBR before "call __fentry__".
> > >
> > > OK for master if there is no regression?
> > >
> > > Thanks.
> > >
> > > H.J.
> > > --
> > > gcc/
> > >
> > > PR target/93492
> > > * config/i386/i386.c (ix86_asm_output_function_label): Set
> > > function_label_emitted to true.
> > > (ix86_print_patchable_function_entry): New function.
> > >
> > > gcc/testsuite/
> > >
> > > PR target/93492
> > > * gcc.target/i386/pr93492-1.c: New test.
> > > * gcc.target/i386/pr93492-2.c: Likewise.
> > > * gcc.target/i386/pr93492-3.c: Likewise.
> > >
> >
> > This version works with both .cfi_startproc and DWARF debug info.
> >
> 
> -g -fpatchable-function-entry=1  doesn't work together:
> 

Here is a differnt approach with UNSPECV_PATCHABLE_AREA.


H.J.
---
Currently patchable area is at the wrong place.  It is placed immediately
after function label, before both .cfi_startproc and ENDBR.  This patch
adds UNSPECV_PATCHABLE_AREA for pseudo patchable area instruction and
changes ENDBR insertion pass to also insert a dummy patchable area.
TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY is defined to provide the
actual size for patchable area.  It places patchable area immediately
after .cfi_startproc and ENDBR.

gcc/

PR target/93492
* config/i386/i386-features.c (rest_of_insert_endbranch):
Renamed to ...
(rest_of_insert_endbr_and_patchable_area): Change return type
to void.  Don't call timevar_push nor timevar_pop.  Replace
endbr_queued_at_entrance with insn_queued_at_entrance.  Insert
UNSPECV_PATCHABLE_AREA for patchable area.
(pass_data_insert_endbranch): Renamed to ...
(pass_data_insert_endbr_and_patchable_area): This.  Change
pass name to endbr_and_patchable_area.
(pass_insert_endbranch): Renamed to ...
(pass_insert_endbr_and_patchable_area): This.  Add need_endbr
and need_patchable_area.
(pass_insert_endbr_and_patchable_area::gate): Set and check
need_endbr/need_patchable_area.
(pass_insert_endbr_and_patchable_area::execute): Call
timevar_push and timevar_pop.  Pass need_endbr amd
need_patchable_area to rest_of_insert_endbr_and_patchable_area.
(make_pass_insert_endbranch): Renamed to ...
(make_pass_insert_endbr_and_patchable_area): This.
* config/i386/i386-passes.def: Replace pass_insert_endbranch
with pass_insert_endbr_and_patchable_area.
* config/i386/i386-protos.h (ix86_output_patchable_area): New.
(make_pass_insert_endbranch): Renamed to ...
(make_pass_insert_endbr_and_patchable_area): This.
* config/i386/i386.c (ix86_asm_output_function_label): Set
function_label_emitted to true.
(ix86_print_patchable_function_entry): New function.
(ix86_output_patchable_area): Likewise.
(x86_function_profiler): Replace endbr_queued_at_entrance with
insn_queued_at_entrance.  Generate ENDBR only for TYPE_ENDBR.
Call ix86_output_patchable_area to generate patchable area.
(TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY): New.
* i386.h (queued_insn_type): New.
(machine_function): Add patch_area_size, record_patch_area and
function_label_emitted.  Replace endbr_queued_at_entrance with
insn_queued_at_entrance.
* config/i386/i386.md (UNSPECV_PATCHABLE_AREA): New.
(patchable_area): New.

gcc/testsuite/

PR target/93492
* gcc.target/i386/pr93492-1.c: New test.
* gcc.target/i386/pr93492-2.c: Likewise.
* gcc.target/i386/pr93492-3.c: Likewise.
* gcc.target/i386/pr93492-4.c: Likewise.
* gcc.target/i386/pr93492-5.c: Likewise.
---
 gcc/config/i386/i386-features.c   | 139 ++
 gcc/config/i386/i386-passes.def   |   2 +-
 gcc/config/i386/i386-protos.h |   5 +-
 gcc/config/i386/i386.c|  90 +-
 gcc/config/i386/i386.h|  20 +++-
 gcc/config/i386/i386.md   |  14 +++
 gcc/testsuite/gcc.target/i386/pr93492-1.c |  73 
 gcc/testsuite/gcc.target/i386/pr93492-2.c |  12 ++
 gcc/testsuite/gcc.target/i386/pr93492-3.c |  13 ++
 gcc/testsuite/gcc.target/i386/pr93492-4.c |  11 ++
 gcc/testsuite/gcc.target/i386/pr93492-5.c |  12 ++
 11 files changed, 337 insertions(+), 

[PATCH 0/3] Update -fpatchable-function-entry implementation

2020-02-05 Thread H.J. Lu
The current -fpatchable-function-entry implementation is done almost
behind the backend back.  Backend doesn't know if and where patchable
area will be generated during RTL passes.

TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY is only used to print out
assembly codes for patchable area.   This leads to wrong codes with
-fpatchable-function-entry:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92424
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93492

Also .cfi_startproc and DWARF info are at the wrong places when
-fpatchable-function-entry is used.

This patch set has 3 parts:

1. Add a pseudo UNSPECV_PATCHABLE_AREA instruction and define
TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY to work around the
-fpatchable-function-entry implementation deficiency.
2. Add patch_area_size and patch_area_entry to cfun to make the 
patchable area info is available in RTL passes so that backend
can handle patchable area properly.  It also limits patch_area_size
and patch_area_entry to 65535, which is a reasonable maximum size
for patchable area.  Other backends can also use it properly generate
patchable area.
3. Remove workaround in UNSPECV_PATCHABLE_AREA generation. 

If the patch set is acceptable, I can combine patch 1 and patch 3
into a single patch.

H.J. Lu (3):
  x86: Add UNSPECV_PATCHABLE_AREA
  Add patch_area_size and patch_area_entry to cfun
  x86: Simplify UNSPECV_PATCHABLE_AREA generation

 gcc/config/i386/i386-features.c   | 130 --
 gcc/config/i386/i386-passes.def   |   2 +-
 gcc/config/i386/i386-protos.h |   5 +-
 gcc/config/i386/i386.c|  51 ++-
 gcc/config/i386/i386.h|  14 +-
 gcc/config/i386/i386.md   |  17 +++
 gcc/doc/invoke.texi   |   1 +
 gcc/function.c|  35 +
 gcc/function.h|   6 +
 gcc/opts.c|   4 +-
 .../patchable_function_entry-error-1.c|   9 ++
 .../patchable_function_entry-error-2.c|   9 ++
 .../patchable_function_entry-error-3.c|  20 +++
 gcc/testsuite/gcc.target/i386/pr93492-1.c |  73 ++
 gcc/testsuite/gcc.target/i386/pr93492-2.c |  12 ++
 gcc/testsuite/gcc.target/i386/pr93492-3.c |  13 ++
 gcc/testsuite/gcc.target/i386/pr93492-4.c |  11 ++
 gcc/testsuite/gcc.target/i386/pr93492-5.c |  12 ++
 gcc/varasm.c  |  30 +---
 19 files changed, 375 insertions(+), 79 deletions(-)
 create mode 100644 
gcc/testsuite/c-c++-common/patchable_function_entry-error-1.c
 create mode 100644 
gcc/testsuite/c-c++-common/patchable_function_entry-error-2.c
 create mode 100644 
gcc/testsuite/c-c++-common/patchable_function_entry-error-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-5.c

-- 
2.24.1



[PATCH 1/3] x86: Add UNSPECV_PATCHABLE_AREA

2020-02-05 Thread H.J. Lu
Currently patchable area is at the wrong place.  It is placed immediately
after function label, before both .cfi_startproc and ENDBR.  This patch
adds UNSPECV_PATCHABLE_AREA for pseudo patchable area instruction and
changes ENDBR insertion pass to also insert a dummy patchable area.
TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY is defined to provide the
actual size for patchable area.  It places patchable area immediately
after .cfi_startproc and ENDBR.

gcc/

PR target/93492
* config/i386/i386-features.c (rest_of_insert_endbranch):
Renamed to ...
(rest_of_insert_endbr_and_patchable_area): Change return type
to void.  Don't call timevar_push nor timevar_pop.  Replace
endbr_queued_at_entrance with insn_queued_at_entrance.  Insert
UNSPECV_PATCHABLE_AREA for patchable area.
(pass_data_insert_endbranch): Renamed to ...
(pass_data_insert_endbr_and_patchable_area): This.  Change
pass name to endbr_and_patchable_area.
(pass_insert_endbranch): Renamed to ...
(pass_insert_endbr_and_patchable_area): This.  Add need_endbr
and need_patchable_area.
(pass_insert_endbr_and_patchable_area::gate): Set and check
need_endbr/need_patchable_area.
(pass_insert_endbr_and_patchable_area::execute): Call
timevar_push and timevar_pop.  Pass need_endbr amd
need_patchable_area to rest_of_insert_endbr_and_patchable_area.
(make_pass_insert_endbranch): Renamed to ...
(make_pass_insert_endbr_and_patchable_area): This.
* config/i386/i386-passes.def: Replace pass_insert_endbranch
with pass_insert_endbr_and_patchable_area.
* config/i386/i386-protos.h (ix86_output_patchable_area): New.
(make_pass_insert_endbranch): Renamed to ...
(make_pass_insert_endbr_and_patchable_area): This.
* config/i386/i386.c (ix86_asm_output_function_label): Set
function_label_emitted to true.
(ix86_print_patchable_function_entry): New function.
(ix86_output_patchable_area): Likewise.
(x86_function_profiler): Replace endbr_queued_at_entrance with
insn_queued_at_entrance.  Generate ENDBR only for TYPE_ENDBR.
Call ix86_output_patchable_area to generate patchable area.
(TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY): New.
* i386.h (queued_insn_type): New.
(machine_function): Add patch_area_size, record_patch_area and
function_label_emitted.  Replace endbr_queued_at_entrance with
insn_queued_at_entrance.
* config/i386/i386.md (UNSPECV_PATCHABLE_AREA): New.
(patchable_area): New.

gcc/testsuite/

PR target/93492
* gcc.target/i386/pr93492-1.c: New test.
* gcc.target/i386/pr93492-2.c: Likewise.
* gcc.target/i386/pr93492-3.c: Likewise.
* gcc.target/i386/pr93492-4.c: Likewise.
* gcc.target/i386/pr93492-5.c: Likewise.
---
 gcc/config/i386/i386-features.c   | 136 +++---
 gcc/config/i386/i386-passes.def   |   2 +-
 gcc/config/i386/i386-protos.h |   5 +-
 gcc/config/i386/i386.c|  91 ++-
 gcc/config/i386/i386.h|  20 +++-
 gcc/config/i386/i386.md   |  15 +++
 gcc/testsuite/gcc.target/i386/pr93492-1.c |  73 
 gcc/testsuite/gcc.target/i386/pr93492-2.c |  12 ++
 gcc/testsuite/gcc.target/i386/pr93492-3.c |  13 +++
 gcc/testsuite/gcc.target/i386/pr93492-4.c |  11 ++
 gcc/testsuite/gcc.target/i386/pr93492-5.c |  12 ++
 11 files changed, 339 insertions(+), 51 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-5.c

diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
index b49e6f8d408..be46f036126 100644
--- a/gcc/config/i386/i386-features.c
+++ b/gcc/config/i386/i386-features.c
@@ -1937,43 +1937,79 @@ make_pass_stv (gcc::context *ctxt)
   return new pass_stv (ctxt);
 }
 
-/* Inserting ENDBRANCH instructions.  */
+/* Inserting ENDBR and pseudo patchable-area instructions.  */
 
-static unsigned int
-rest_of_insert_endbranch (void)
+static void
+rest_of_insert_endbr_and_patchable_area (bool need_endbr,
+bool need_patchable_area)
 {
-  timevar_push (TV_MACH_DEP);
-
-  rtx cet_eb;
+  rtx endbr;
   rtx_insn *insn;
+  rtx_insn *endbr_insn = NULL;
   basic_block bb;
 
-  /* Currently emit EB if it's a tracking function, i.e. 'nocf_check' is
- absent among function attributes.  Later an optimization will be
- introduced to make analysis if an address of a static function is
- taken.  A static function whose address is not taken will get a
- nocf_check attribute.  This will allow to redu

[PATCH 2/3] Add patch_area_size and patch_area_entry to cfun

2020-02-05 Thread H.J. Lu
Currently patchable area is at the wrong place.  It is placed immediately
after function label and before .cfi_startproc.  A backend should be able
to add a pseudo patchable area instruction durectly into RTL.  This patch
adds patch_area_size and patch_area_entry to cfun so that the patchable
area info is available in RTL passes.

It also limits patch_area_size and patch_area_entry to 65535, which is
a reasonable maximum size for patchable area.

gcc/

PR target/93492
* function.c (expand_function_start): Set cfun->patch_area_size
and cfun->patch_area_entry.
* function.h (function): Add patch_area_size and patch_area_entry.
* opts.c (common_handle_option): Limit
function_entry_patch_area_size and function_entry_patch_area_start
to USHRT_MAX.  Fix a typo in error message.
* varasm.c (assemble_start_function): Use cfun->patch_area_size
and cfun->patch_area_entry.
* doc/invoke.texi: Document the maximum value for
-fpatchable-function-entry.

gcc/testsuite/

PR target/93492
* c-c++-common/patchable_function_entry-error-1.c: New test.
* c-c++-common/patchable_function_entry-error-2.c: Likewise.
* c-c++-common/patchable_function_entry-error-3.c: Likewise.
---
 gcc/doc/invoke.texi   |  1 +
 gcc/function.c| 35 +++
 gcc/function.h|  6 
 gcc/opts.c|  4 ++-
 .../patchable_function_entry-error-1.c|  9 +
 .../patchable_function_entry-error-2.c|  9 +
 .../patchable_function_entry-error-3.c| 20 +++
 gcc/varasm.c  | 30 ++--
 8 files changed, 85 insertions(+), 29 deletions(-)
 create mode 100644 
gcc/testsuite/c-c++-common/patchable_function_entry-error-1.c
 create mode 100644 
gcc/testsuite/c-c++-common/patchable_function_entry-error-2.c
 create mode 100644 
gcc/testsuite/c-c++-common/patchable_function_entry-error-3.c

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 35b341e759f..dd4835199b0 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -13966,6 +13966,7 @@ If @code{N=0}, no pad location is recorded.
 The NOP instructions are inserted at---and maybe before, depending on
 @var{M}---the function entry address, even before the prologue.
 
+The maximum value of @var{N} and @var{M} is 65535.
 @end table
 
 
diff --git a/gcc/function.c b/gcc/function.c
index d8008f60422..badbf538eec 100644
--- a/gcc/function.c
+++ b/gcc/function.c
@@ -5202,6 +5202,41 @@ expand_function_start (tree subr)
   /* If we are doing generic stack checking, the probe should go here.  */
   if (flag_stack_check == GENERIC_STACK_CHECK)
 stack_check_probe_note = emit_note (NOTE_INSN_DELETED);
+
+  unsigned HOST_WIDE_INT patch_area_size = function_entry_patch_area_size;
+  unsigned HOST_WIDE_INT patch_area_entry = function_entry_patch_area_start;
+
+  tree patchable_function_entry_attr
+= lookup_attribute ("patchable_function_entry",
+   DECL_ATTRIBUTES (cfun->decl));
+  if (patchable_function_entry_attr)
+{
+  tree pp_val = TREE_VALUE (patchable_function_entry_attr);
+  tree patchable_function_entry_value1 = TREE_VALUE (pp_val);
+
+  patch_area_size = tree_to_uhwi (patchable_function_entry_value1);
+  patch_area_entry = 0;
+  if (TREE_CHAIN (pp_val) != NULL_TREE)
+   {
+ tree patchable_function_entry_value2
+   = TREE_VALUE (TREE_CHAIN (pp_val));
+ patch_area_entry = tree_to_uhwi (patchable_function_entry_value2);
+   }
+  if (patch_area_size > USHRT_MAX || patch_area_size > USHRT_MAX)
+   error ("invalid values for % attribute");
+}
+
+  if (patch_area_entry > patch_area_size)
+{
+  if (patch_area_size > 0)
+   warning (OPT_Wattributes,
+"patchable function entry %wu exceeds size %wu",
+patch_area_entry, patch_area_size);
+  patch_area_entry = 0;
+}
+
+  cfun->patch_area_size = patch_area_size;
+  cfun->patch_area_entry = patch_area_entry;
 }
 
 void
diff --git a/gcc/function.h b/gcc/function.h
index 1ee8ed3de53..1ed7c400f23 100644
--- a/gcc/function.h
+++ b/gcc/function.h
@@ -332,6 +332,12 @@ struct GTY(()) function {
   /* Last assigned dependence info clique.  */
   unsigned short last_clique;
 
+  /* How many NOP insns to place at each function entry by default.  */
+  unsigned short patch_area_size;
+
+  /* How far the real asm entry point is into this area.  */
+  unsigned short patch_area_entry;
+
   /* Collected bit flags.  */
 
   /* Number of units of general registers that need saving in stdarg
diff --git a/gcc/opts.c b/gcc/opts.c
index 7affeb41a96..c6011f1f9b7 100644
--- a/gcc/opts.c
+++ b/gcc/opts.c
@@ -2598,10 +2598,12 @@ common_handle_option (struct gcc_options *opts,
function_entry_patch_area_start = 0;

[PATCH 3/3] x86: Simplify UNSPECV_PATCHABLE_AREA generation

2020-02-05 Thread H.J. Lu
Since patch_area_size and patch_area_entry have been added to cfun,
we can use them to directly insert the pseudo UNSPECV_PATCHABLE_AREA
instruction.

PR target/93492
* config/i386/i386-features.c
(rest_of_insert_endbr_and_patchable_area): Change
need_patchable_area argument to patchable_area_size.  Generate
UNSPECV_PATCHABLE_AREA instruction with proper arguments.
(pass_insert_endbr_and_patchable_area::gate): Set and check
patchable_area_size instead of need_patchable_area.
(pass_insert_endbr_and_patchable_area::execute): Pass
patchable_area_size to rest_of_insert_endbr_and_patchable_area.
(pass_insert_endbr_and_patchable_area): Replace
need_patchable_area with patchable_area_size.
* config/i386/i386.c (ix86_print_patchable_function_entry):
Just return if function table has been emitted.
(x86_function_profiler): Use cfun->patch_area_size and
cfun->patch_area_entry.
* config/i386/i386.h (machine_function): Remove patch_area_size
and record_patch_area.
* config/i386/i386.md (patchable_area): Set length attribute.
---
 gcc/config/i386/i386-features.c | 22 +---
 gcc/config/i386/i386.c  | 60 ++---
 gcc/config/i386/i386.h  |  6 
 gcc/config/i386/i386.md | 10 +++---
 4 files changed, 25 insertions(+), 73 deletions(-)

diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c
index be46f036126..d358abe7064 100644
--- a/gcc/config/i386/i386-features.c
+++ b/gcc/config/i386/i386-features.c
@@ -1941,7 +1941,7 @@ make_pass_stv (gcc::context *ctxt)
 
 static void
 rest_of_insert_endbr_and_patchable_area (bool need_endbr,
-bool need_patchable_area)
+unsigned int patchable_area_size)
 {
   rtx endbr;
   rtx_insn *insn;
@@ -1980,7 +1980,7 @@ rest_of_insert_endbr_and_patchable_area (bool need_endbr,
}
 }
 
-  if (need_patchable_area)
+  if (patchable_area_size)
 {
   if (crtl->profile && flag_fentry)
{
@@ -1992,10 +1992,9 @@ rest_of_insert_endbr_and_patchable_area (bool need_endbr,
}
   else
{
- /* ix86_print_patchable_function_entry will provide actual
-size.  */
- rtx patchable_area = gen_patchable_area (GEN_INT (0),
-  GEN_INT (0));
+ rtx patchable_area
+   = gen_patchable_area (GEN_INT (patchable_area_size),
+ GEN_INT (cfun->patch_area_entry == 0));
  if (endbr_insn)
emit_insn_after (patchable_area, endbr_insn);
  else
@@ -2123,25 +2122,22 @@ public:
   virtual bool gate (function *fun)
 {
   need_endbr = (flag_cf_protection & CF_BRANCH) != 0;
-  need_patchable_area
-   = (function_entry_patch_area_size
-  || lookup_attribute ("patchable_function_entry",
-   DECL_ATTRIBUTES (fun->decl)));
-  return need_endbr || need_patchable_area;
+  patchable_area_size = fun->patch_area_size - fun->patch_area_entry;
+  return need_endbr || patchable_area_size;
 }
 
   virtual unsigned int execute (function *)
 {
   timevar_push (TV_MACH_DEP);
   rest_of_insert_endbr_and_patchable_area (need_endbr,
-  need_patchable_area);
+  patchable_area_size);
   timevar_pop (TV_MACH_DEP);
   return 0;
 }
 
 private:
   bool need_endbr;
-  bool need_patchable_area;
+  unsigned int patchable_area_size;
 }; // class pass_insert_endbr_and_patchable_area
 
 } // anon namespace
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 051a1fcbdc2..79a8823f61e 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -9130,53 +9130,11 @@ ix86_print_patchable_function_entry (FILE *file,
 {
   if (cfun->machine->function_label_emitted)
 {
-  /* The insert_endbr_and_patchable_area pass inserted a dummy
-UNSPECV_PATCHABLE_AREA with 0 patchable area size.  If the
-patchable area is placed after the function label, we replace
-0 patchable area size with the real one.  Otherwise, the
-dummy UNSPECV_PATCHABLE_AREA will be ignored.  */
-  if (cfun->machine->insn_queued_at_entrance)
-   {
- /* Record the patchable area.  Both ENDBR and patchable area
-will be inserted by x86_function_profiler later.  */
- cfun->machine->patch_area_size = patch_area_size;
- cfun->machine->record_patch_area = record_p;
- return;
-   }
-
-  /* We can have
-
-UNSPECV_NOP_ENDBR
-UNSPECV_PATCHABLE_AREA
-
-or just
-
-UNSPECV_PATCHABLE_AREA
-   */
-  rtx_insn *patchable_insn;
-  rtx_insn *insn = next_real_nondebug_insn (get_insns ());
-  if (insn

[PATCH] x86-64: Pass aggregates with only float/double in GPRs for MS_ABI

2020-02-05 Thread H.J. Lu
MS_ABI requires passing aggregates with only float/double in integer
registers.  Checked gcc outputs against Clang and fixed:

FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=54
-Wno-unused-variable -Wno-unused-parameter
-Wno-unused-but-set-variable -Wno-uninitialized -O0
-DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test
FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=54
-Wno-unused-variable -Wno-unused-parameter
-Wno-unused-but-set-variable -Wno-uninitialized -O2
-DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test
FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=55
-Wno-unused-variable -Wno-unused-parameter
-Wno-unused-but-set-variable -Wno-uninitialized -O0
-DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test
FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=55
-Wno-unused-variable -Wno-unused-parameter
-Wno-unused-but-set-variable -Wno-uninitialized -O2
-DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test
FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=56
-Wno-unused-variable -Wno-unused-parameter
-Wno-unused-but-set-variable -Wno-uninitialized -O0
-DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test
FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=56
-Wno-unused-variable -Wno-unused-parameter
-Wno-unused-but-set-variable -Wno-uninitialized -O2
-DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test

in libffi testsuite.

OK for master and backports to GCC 8/9 branches?

gcc/

PR target/85667
* config/i386/i386.c (function_arg_ms_64): Add a type argument.
Don't return aggregates with only SFmode and DFmode in SSE
register.
(ix86_function_arg): Pass arg.type to function_arg_ms_64.

gcc/testsuite/

PR target/85667
* gcc.target/i386/pr85667-10.c: New test.
* gcc.target/i386/pr85667-7.c: Likewise.
* gcc.target/i386/pr85667-8.c: Likewise.
* gcc.target/i386/pr85667-9.c: Likewise.


-- 
H.J.
From e561fd8fcb46b8d8e40942c077e26ce120832747 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" 
Date: Wed, 5 Feb 2020 09:49:56 -0800
Subject: [PATCH] x86-64: Pass aggregates with only float/double in GPRs for
 MS_ABI

MS_ABI requires passing aggregates with only float/double in integer
registers.  Checked gcc outputs against Clang and fixed:

FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=54 -Wno-unused-variable -Wno-unused-parameter -Wno-unused-but-set-variable -Wno-uninitialized -O0 -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test
FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=54 -Wno-unused-variable -Wno-unused-parameter -Wno-unused-but-set-variable -Wno-uninitialized -O2 -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test
FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=55 -Wno-unused-variable -Wno-unused-parameter -Wno-unused-but-set-variable -Wno-uninitialized -O0 -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test
FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=55 -Wno-unused-variable -Wno-unused-parameter -Wno-unused-but-set-variable -Wno-uninitialized -O2 -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test
FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=56 -Wno-unused-variable -Wno-unused-parameter -Wno-unused-but-set-variable -Wno-uninitialized -O0 -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test
FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=56 -Wno-unused-variable -Wno-unused-parameter -Wno-unused-but-set-variable -Wno-uninitialized -O2 -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test

in libffi testsuite.

gcc/

	PR target/85667
	* config/i386/i386.c (function_arg_ms_64): Add a type argument.
	Don't return aggregates with only SFmode and DFmode in SSE
	register.
	(ix86_function_arg): Pass arg.type to function_arg_ms_64.

gcc/testsuite/

	PR target/85667
	* gcc.target/i386/pr85667-10.c: New test.
	* gcc.target/i386/pr85667-7.c: Likewise.
	* gcc.target/i386/pr85667-8.c: Likewise.
	* gcc.target/i386/pr85667-9.c: Likewise.
---
 gcc/config/i386/i386.c | 10 --
 gcc/testsuite/gcc.target/i386/pr85667-10.c | 21 +
 gcc/testsuite/gcc.target/i386/pr85667-7.c  | 36 ++
 gcc/testsuite/gcc.target/i386/pr85667-8.c  | 21 +
 gcc/testsuite/gcc.target/i386/pr85667-9.c  | 36 ++
 5 files changed, 121 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr85667-10.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr85667-7.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr85667-8.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr85667-9.c

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index ffda3e8fd21..f769cb8f75e 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -3153,7 +3153,7 @@ function_arg_64 (const CUMULATIVE_ARGS *cum, machine_mode mode,
 
 static rtx
 function_arg_ms_64 (const CUMULATIVE_ARGS *cum, machine_mode mode,
-		ma

[PATCH] Add patch_area_size and patch_area_entry to crtl

2020-02-05 Thread H.J. Lu
On Wed, Feb 5, 2020 at 9:00 AM Richard Sandiford
 wrote:
>
> "H.J. Lu"  writes:
> > Currently patchable area is at the wrong place.
>
> Agreed :-)
>
> > It is placed immediately
> > after function label and before .cfi_startproc.  A backend should be able
> > to add a pseudo patchable area instruction durectly into RTL.  This patch
> > adds patch_area_size and patch_area_entry to cfun so that the patchable
> > area info is available in RTL passes.
>
> It might be better to add it to crtl, since it should only be needed
> during rtl generation.
>
> > It also limits patch_area_size and patch_area_entry to 65535, which is
> > a reasonable maximum size for patchable area.
> >
> > gcc/
> >
> >   PR target/93492
> >   * function.c (expand_function_start): Set cfun->patch_area_size
> >   and cfun->patch_area_entry.
> >   * function.h (function): Add patch_area_size and patch_area_entry.
> >   * opts.c (common_handle_option): Limit
> >   function_entry_patch_area_size and function_entry_patch_area_start
> >   to USHRT_MAX.  Fix a typo in error message.
> >   * varasm.c (assemble_start_function): Use cfun->patch_area_size
> >   and cfun->patch_area_entry.
> >   * doc/invoke.texi: Document the maximum value for
> >   -fpatchable-function-entry.
> >
> > gcc/testsuite/
> >
> >   PR target/93492
> >   * c-c++-common/patchable_function_entry-error-1.c: New test.
> >   * c-c++-common/patchable_function_entry-error-2.c: Likewise.
> >   * c-c++-common/patchable_function_entry-error-3.c: Likewise.
> > ---
> >  gcc/doc/invoke.texi   |  1 +
> >  gcc/function.c| 35 +++
> >  gcc/function.h|  6 
> >  gcc/opts.c|  4 ++-
> >  .../patchable_function_entry-error-1.c|  9 +
> >  .../patchable_function_entry-error-2.c|  9 +
> >  .../patchable_function_entry-error-3.c| 20 +++
> >  gcc/varasm.c  | 30 ++--
> >  8 files changed, 85 insertions(+), 29 deletions(-)
> >  create mode 100644 
> > gcc/testsuite/c-c++-common/patchable_function_entry-error-1.c
> >  create mode 100644 
> > gcc/testsuite/c-c++-common/patchable_function_entry-error-2.c
> >  create mode 100644 
> > gcc/testsuite/c-c++-common/patchable_function_entry-error-3.c
> >
> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > index 35b341e759f..dd4835199b0 100644
> > --- a/gcc/doc/invoke.texi
> > +++ b/gcc/doc/invoke.texi
> > @@ -13966,6 +13966,7 @@ If @code{N=0}, no pad location is recorded.
> >  The NOP instructions are inserted at---and maybe before, depending on
> >  @var{M}---the function entry address, even before the prologue.
> >
> > +The maximum value of @var{N} and @var{M} is 65535.
> >  @end table
> >
> >
> > diff --git a/gcc/function.c b/gcc/function.c
> > index d8008f60422..badbf538eec 100644
> > --- a/gcc/function.c
> > +++ b/gcc/function.c
> > @@ -5202,6 +5202,41 @@ expand_function_start (tree subr)
> >/* If we are doing generic stack checking, the probe should go here.  */
> >if (flag_stack_check == GENERIC_STACK_CHECK)
> >  stack_check_probe_note = emit_note (NOTE_INSN_DELETED);
> > +
> > +  unsigned HOST_WIDE_INT patch_area_size = function_entry_patch_area_size;
> > +  unsigned HOST_WIDE_INT patch_area_entry = 
> > function_entry_patch_area_start;
> > +
> > +  tree patchable_function_entry_attr
> > += lookup_attribute ("patchable_function_entry",
> > + DECL_ATTRIBUTES (cfun->decl));
> > +  if (patchable_function_entry_attr)
> > +{
> > +  tree pp_val = TREE_VALUE (patchable_function_entry_attr);
> > +  tree patchable_function_entry_value1 = TREE_VALUE (pp_val);
> > +
> > +  patch_area_size = tree_to_uhwi (patchable_function_entry_value1);
> > +  patch_area_entry = 0;
> > +  if (TREE_CHAIN (pp_val) != NULL_TREE)
> > + {
> > +   tree patchable_function_entry_value2
> > + = TREE_VALUE (TREE_CHAIN (pp_val));
> > +   patch_area_entry = tree_to_uhwi (patchable_function_entry_value2);
> > + }
> > +  if (patch_area_size > USHRT_MAX || patch_area_size > USHRT_MAX)
> > + error ("invalid values for % attribute");
>
> This should probably go in handle_patchable_function_entry_attri

Re: [PATCH] Add patch_area_size and patch_area_entry to crtl

2020-02-05 Thread H.J. Lu
On Wed, Feb 5, 2020 at 12:20 PM H.J. Lu  wrote:
>
> On Wed, Feb 5, 2020 at 9:00 AM Richard Sandiford
>  wrote:
> >
> > "H.J. Lu"  writes:
> > > Currently patchable area is at the wrong place.
> >
> > Agreed :-)
> >
> > > It is placed immediately
> > > after function label and before .cfi_startproc.  A backend should be able
> > > to add a pseudo patchable area instruction durectly into RTL.  This patch
> > > adds patch_area_size and patch_area_entry to cfun so that the patchable
> > > area info is available in RTL passes.
> >
> > It might be better to add it to crtl, since it should only be needed
> > during rtl generation.
> >
> > > It also limits patch_area_size and patch_area_entry to 65535, which is
> > > a reasonable maximum size for patchable area.
> > >
> > > gcc/
> > >
> > >   PR target/93492
> > >   * function.c (expand_function_start): Set cfun->patch_area_size
> > >   and cfun->patch_area_entry.
> > >   * function.h (function): Add patch_area_size and patch_area_entry.
> > >   * opts.c (common_handle_option): Limit
> > >   function_entry_patch_area_size and function_entry_patch_area_start
> > >   to USHRT_MAX.  Fix a typo in error message.
> > >   * varasm.c (assemble_start_function): Use cfun->patch_area_size
> > >   and cfun->patch_area_entry.
> > >   * doc/invoke.texi: Document the maximum value for
> > >   -fpatchable-function-entry.
> > >
> > > gcc/testsuite/
> > >
> > >   PR target/93492
> > >   * c-c++-common/patchable_function_entry-error-1.c: New test.
> > >   * c-c++-common/patchable_function_entry-error-2.c: Likewise.
> > >   * c-c++-common/patchable_function_entry-error-3.c: Likewise.
> > > ---
> > >  gcc/doc/invoke.texi   |  1 +
> > >  gcc/function.c| 35 +++
> > >  gcc/function.h|  6 
> > >  gcc/opts.c|  4 ++-
> > >  .../patchable_function_entry-error-1.c|  9 +
> > >  .../patchable_function_entry-error-2.c|  9 +
> > >  .../patchable_function_entry-error-3.c| 20 +++
> > >  gcc/varasm.c  | 30 ++--
> > >  8 files changed, 85 insertions(+), 29 deletions(-)
> > >  create mode 100644 
> > > gcc/testsuite/c-c++-common/patchable_function_entry-error-1.c
> > >  create mode 100644 
> > > gcc/testsuite/c-c++-common/patchable_function_entry-error-2.c
> > >  create mode 100644 
> > > gcc/testsuite/c-c++-common/patchable_function_entry-error-3.c
> > >
> > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > index 35b341e759f..dd4835199b0 100644
> > > --- a/gcc/doc/invoke.texi
> > > +++ b/gcc/doc/invoke.texi
> > > @@ -13966,6 +13966,7 @@ If @code{N=0}, no pad location is recorded.
> > >  The NOP instructions are inserted at---and maybe before, depending on
> > >  @var{M}---the function entry address, even before the prologue.
> > >
> > > +The maximum value of @var{N} and @var{M} is 65535.
> > >  @end table
> > >
> > >
> > > diff --git a/gcc/function.c b/gcc/function.c
> > > index d8008f60422..badbf538eec 100644
> > > --- a/gcc/function.c
> > > +++ b/gcc/function.c
> > > @@ -5202,6 +5202,41 @@ expand_function_start (tree subr)
> > >/* If we are doing generic stack checking, the probe should go here.  
> > > */
> > >if (flag_stack_check == GENERIC_STACK_CHECK)
> > >  stack_check_probe_note = emit_note (NOTE_INSN_DELETED);
> > > +
> > > +  unsigned HOST_WIDE_INT patch_area_size = 
> > > function_entry_patch_area_size;
> > > +  unsigned HOST_WIDE_INT patch_area_entry = 
> > > function_entry_patch_area_start;
> > > +
> > > +  tree patchable_function_entry_attr
> > > += lookup_attribute ("patchable_function_entry",
> > > + DECL_ATTRIBUTES (cfun->decl));
> > > +  if (patchable_function_entry_attr)
> > > +{
> > > +  tree pp_val = TREE_VALUE (patchable_function_entry_attr);
> > > +  tree patchable_function_entry_value1 = TREE_VALUE (pp_val);
> > > +
> > > +  patch_area_size = tree_to_uhwi (patchable_function_entry_value1);
> > &g

Re: [PATCH] Add patch_area_size and patch_area_entry to crtl

2020-02-05 Thread H.J. Lu
On Wed, Feb 5, 2020 at 2:37 PM Marek Polacek  wrote:
>
> On Wed, Feb 05, 2020 at 02:24:48PM -0800, H.J. Lu wrote:
> > On Wed, Feb 5, 2020 at 12:20 PM H.J. Lu  wrote:
> > >
> > > On Wed, Feb 5, 2020 at 9:00 AM Richard Sandiford
> > >  wrote:
> > > >
> > > > "H.J. Lu"  writes:
> > > > > Currently patchable area is at the wrong place.
> > > >
> > > > Agreed :-)
> > > >
> > > > > It is placed immediately
> > > > > after function label and before .cfi_startproc.  A backend should be 
> > > > > able
> > > > > to add a pseudo patchable area instruction durectly into RTL.  This 
> > > > > patch
> > > > > adds patch_area_size and patch_area_entry to cfun so that the 
> > > > > patchable
> > > > > area info is available in RTL passes.
> > > >
> > > > It might be better to add it to crtl, since it should only be needed
> > > > during rtl generation.
> > > >
> > > > > It also limits patch_area_size and patch_area_entry to 65535, which is
> > > > > a reasonable maximum size for patchable area.
> > > > >
> > > > > gcc/
> > > > >
> > > > >   PR target/93492
> > > > >   * function.c (expand_function_start): Set cfun->patch_area_size
> > > > >   and cfun->patch_area_entry.
> > > > >   * function.h (function): Add patch_area_size and 
> > > > > patch_area_entry.
> > > > >   * opts.c (common_handle_option): Limit
> > > > >   function_entry_patch_area_size and 
> > > > > function_entry_patch_area_start
> > > > >   to USHRT_MAX.  Fix a typo in error message.
> > > > >   * varasm.c (assemble_start_function): Use cfun->patch_area_size
> > > > >   and cfun->patch_area_entry.
> > > > >   * doc/invoke.texi: Document the maximum value for
> > > > >   -fpatchable-function-entry.
> > > > >
> > > > > gcc/testsuite/
> > > > >
> > > > >   PR target/93492
> > > > >   * c-c++-common/patchable_function_entry-error-1.c: New test.
> > > > >   * c-c++-common/patchable_function_entry-error-2.c: Likewise.
> > > > >   * c-c++-common/patchable_function_entry-error-3.c: Likewise.
> > > > > ---
> > > > >  gcc/doc/invoke.texi   |  1 +
> > > > >  gcc/function.c| 35 
> > > > > +++
> > > > >  gcc/function.h|  6 
> > > > >  gcc/opts.c|  4 ++-
> > > > >  .../patchable_function_entry-error-1.c|  9 +
> > > > >  .../patchable_function_entry-error-2.c|  9 +
> > > > >  .../patchable_function_entry-error-3.c| 20 +++
> > > > >  gcc/varasm.c  | 30 ++--
> > > > >  8 files changed, 85 insertions(+), 29 deletions(-)
> > > > >  create mode 100644 
> > > > > gcc/testsuite/c-c++-common/patchable_function_entry-error-1.c
> > > > >  create mode 100644 
> > > > > gcc/testsuite/c-c++-common/patchable_function_entry-error-2.c
> > > > >  create mode 100644 
> > > > > gcc/testsuite/c-c++-common/patchable_function_entry-error-3.c
> > > > >
> > > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > > > index 35b341e759f..dd4835199b0 100644
> > > > > --- a/gcc/doc/invoke.texi
> > > > > +++ b/gcc/doc/invoke.texi
> > > > > @@ -13966,6 +13966,7 @@ If @code{N=0}, no pad location is recorded.
> > > > >  The NOP instructions are inserted at---and maybe before, depending on
> > > > >  @var{M}---the function entry address, even before the prologue.
> > > > >
> > > > > +The maximum value of @var{N} and @var{M} is 65535.
> > > > >  @end table
> > > > >
> > > > >
> > > > > diff --git a/gcc/function.c b/gcc/function.c
> > > > > index d8008f60422..badbf538eec 100644
> > > > > --- a/gcc/function.c
> > > > > +++ b/gcc/function.c
> > > > > @@ -5202,6 +5202,41 @@ expand_function_start (tree subr)
> > &

Re: [PATCH] Add patch_area_size and patch_area_entry to crtl

2020-02-05 Thread H.J. Lu
On Wed, Feb 5, 2020 at 2:51 PM H.J. Lu  wrote:
>
> On Wed, Feb 5, 2020 at 2:37 PM Marek Polacek  wrote:
> >
> > On Wed, Feb 05, 2020 at 02:24:48PM -0800, H.J. Lu wrote:
> > > On Wed, Feb 5, 2020 at 12:20 PM H.J. Lu  wrote:
> > > >
> > > > On Wed, Feb 5, 2020 at 9:00 AM Richard Sandiford
> > > >  wrote:
> > > > >
> > > > > "H.J. Lu"  writes:
> > > > > > Currently patchable area is at the wrong place.
> > > > >
> > > > > Agreed :-)
> > > > >
> > > > > > It is placed immediately
> > > > > > after function label and before .cfi_startproc.  A backend should 
> > > > > > be able
> > > > > > to add a pseudo patchable area instruction durectly into RTL.  This 
> > > > > > patch
> > > > > > adds patch_area_size and patch_area_entry to cfun so that the 
> > > > > > patchable
> > > > > > area info is available in RTL passes.
> > > > >
> > > > > It might be better to add it to crtl, since it should only be needed
> > > > > during rtl generation.
> > > > >
> > > > > > It also limits patch_area_size and patch_area_entry to 65535, which 
> > > > > > is
> > > > > > a reasonable maximum size for patchable area.
> > > > > >
> > > > > > gcc/
> > > > > >
> > > > > >   PR target/93492
> > > > > >   * function.c (expand_function_start): Set 
> > > > > > cfun->patch_area_size
> > > > > >   and cfun->patch_area_entry.
> > > > > >   * function.h (function): Add patch_area_size and 
> > > > > > patch_area_entry.
> > > > > >   * opts.c (common_handle_option): Limit
> > > > > >   function_entry_patch_area_size and 
> > > > > > function_entry_patch_area_start
> > > > > >   to USHRT_MAX.  Fix a typo in error message.
> > > > > >   * varasm.c (assemble_start_function): Use 
> > > > > > cfun->patch_area_size
> > > > > >   and cfun->patch_area_entry.
> > > > > >   * doc/invoke.texi: Document the maximum value for
> > > > > >   -fpatchable-function-entry.
> > > > > >
> > > > > > gcc/testsuite/
> > > > > >
> > > > > >   PR target/93492
> > > > > >   * c-c++-common/patchable_function_entry-error-1.c: New test.
> > > > > >   * c-c++-common/patchable_function_entry-error-2.c: Likewise.
> > > > > >   * c-c++-common/patchable_function_entry-error-3.c: Likewise.
> > > > > > ---
> > > > > >  gcc/doc/invoke.texi   |  1 +
> > > > > >  gcc/function.c| 35 
> > > > > > +++
> > > > > >  gcc/function.h|  6 
> > > > > >  gcc/opts.c|  4 ++-
> > > > > >  .../patchable_function_entry-error-1.c|  9 +
> > > > > >  .../patchable_function_entry-error-2.c|  9 +
> > > > > >  .../patchable_function_entry-error-3.c| 20 +++
> > > > > >  gcc/varasm.c  | 30 ++--
> > > > > >  8 files changed, 85 insertions(+), 29 deletions(-)
> > > > > >  create mode 100644 
> > > > > > gcc/testsuite/c-c++-common/patchable_function_entry-error-1.c
> > > > > >  create mode 100644 
> > > > > > gcc/testsuite/c-c++-common/patchable_function_entry-error-2.c
> > > > > >  create mode 100644 
> > > > > > gcc/testsuite/c-c++-common/patchable_function_entry-error-3.c
> > > > > >
> > > > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > > > > > index 35b341e759f..dd4835199b0 100644
> > > > > > --- a/gcc/doc/invoke.texi
> > > > > > +++ b/gcc/doc/invoke.texi
> > > > > > @@ -13966,6 +13966,7 @@ If @code{N=0}, no pad location is recorded.
> > > > > >  The NOP instructions are inserted at---and maybe before, depending 
> > > > > > on
> > > > > >  @var{M}---

[PATCH] Use the section flag 'o' for __patchable_function_entries

2020-02-06 Thread H.J. Lu
This commit in GNU binutils 2.35:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=commit;h=b7d072167715829eed0622616f6ae0182900de3e

added the section flag 'o' to .section directive:

.section __patchable_function_entries,"awo",@progbits,foo

which specifies the symbol name which the section references.  Assembler
creates a unique __patchable_function_entries section with the section,
where foo is defined, as its linked-to section.  Linker keeps a section
if its linked-to section is kept during garbage collection.

This patch checks assembler support for the section flag 'o' and uses
it to implement __patchable_function_entries section.  Since Solaris may
use GNU assembler with Solairs ld.  Even if GNU assembler supports the
section flag 'o', it doesn't mean that Solairs ld supports it.  This
feature is disabled for Solairs targets.

gcc/

PR middle-end/93195
PR middle-end/93197
* configure.ac (HAVE_GAS_SECTION_LINK_ORDER): New.  Define if
the assembler supports the section flag 'o' for specifying
section with link-order.
* dwarf2out.c (output_comdat_type_unit): Pass 0 as flags2
to targetm.asm_out.named_section.
* config/sol2.c (solaris_elf_asm_comdat_section): Likewise.
* output.h (SECTION2_LINK_ORDER): New.
(switch_to_section): Add an unsigned int argument.
(default_no_named_section): Likewise.
(default_elf_asm_named_section): Likewise.
* target.def (asm_out.named_section): Likewise.
* targhooks.c (default_print_patchable_function_entry): Pass
current_function_decl to get_section and SECTION2_LINK_ORDER
to switch_to_section.
* varasm.c (default_no_named_section): Add an unsigned int
argument.
(default_elf_asm_named_section): Add an unsigned int argument,
flags2.  Use 'o' flag for SECTION2_LINK_ORDER if assembler
supports it.
(switch_to_section): Add an unsigned int argument and pass it
to targetm.asm_out.named_section.
(handle_vtv_comdat_section): Pass 0 to
targetm.asm_out.named_section.
* config.in: Regenerated.
* configure: Likewise.
* doc/tm.texi: Likewise.

gcc/testsuite/

PR middle-end/93195
* g++.dg/pr93195a.C: New test.
* g++.dg/pr93195b.C: Likewise.
* lib/target-supports.exp
(check_effective_target_o_flag_in_section): New proc.
---
 gcc/config.in |  6 
 gcc/config/sol2.c |  3 +-
 gcc/configure | 52 +++
 gcc/configure.ac  | 22 
 gcc/doc/tm.texi   |  5 +--
 gcc/dwarf2out.c   |  4 +--
 gcc/output.h  | 11 --
 gcc/target.def|  5 +--
 gcc/targhooks.c   |  4 ++-
 gcc/testsuite/g++.dg/pr93195a.C   | 27 ++
 gcc/testsuite/g++.dg/pr93195b.C   | 14 
 gcc/testsuite/lib/target-supports.exp | 40 +
 gcc/varasm.c  | 25 ++---
 13 files changed, 202 insertions(+), 16 deletions(-)
 create mode 100644 gcc/testsuite/g++.dg/pr93195a.C
 create mode 100644 gcc/testsuite/g++.dg/pr93195b.C

diff --git a/gcc/config.in b/gcc/config.in
index 48292861842..d1ecc5b15a6 100644
--- a/gcc/config.in
+++ b/gcc/config.in
@@ -1313,6 +1313,12 @@
 #endif
 
 
+/* Define if your assembler supports 'o' flag in .section directive. */
+#ifndef USED_FOR_TARGET
+#undef HAVE_GAS_SECTION_LINK_ORDER
+#endif
+
+
 /* Define 0/1 if your assembler supports marking sections with SHF_MERGE flag.
*/
 #ifndef USED_FOR_TARGET
diff --git a/gcc/config/sol2.c b/gcc/config/sol2.c
index cf9d9f1f684..62bbdec2f97 100644
--- a/gcc/config/sol2.c
+++ b/gcc/config/sol2.c
@@ -224,7 +224,8 @@ solaris_elf_asm_comdat_section (const char *name, unsigned 
int flags, tree decl)
  emits this as a regular section.  Emit section before .group
  directive since Sun as treats undeclared sections as @progbits,
  which conflicts with .bss* sections which are @nobits.  */
-  targetm.asm_out.named_section (section, flags & ~SECTION_LINKONCE, decl);
+  targetm.asm_out.named_section (section, flags & ~SECTION_LINKONCE,
+0, decl);
   
   /* Sun as separates declaration of a group section and of the group
  itself, using the .group directive and the #comdat flag.  */
diff --git a/gcc/configure b/gcc/configure
index 5fa565a40a4..a7315e33a62 100755
--- a/gcc/configure
+++ b/gcc/configure
@@ -24185,6 +24185,58 @@ cat >>confdefs.h <<_ACEOF
 _ACEOF
 
 
+# Test if the assembler supports the section flag 'o' for specifying
+# section with link-order.
+case "${target}" in
+  # Solaris may use GNU assembler with Solairs ld.  Even if GNU
+  # assembler supports the section flag 'o', it doesn't mean that
+  # Solairs ld supports it.
+  *-*-solaris2*)
+gcc_cv

Re: [PATCH] x86-64: Pass aggregates with only float/double in GPRs for MS_ABI

2020-02-06 Thread H.J. Lu
On Wed, Feb 05, 2020 at 09:51:14PM +0100, Uros Bizjak wrote:
> On Wed, Feb 5, 2020 at 6:59 PM H.J. Lu  wrote:
> >
> > MS_ABI requires passing aggregates with only float/double in integer
> > registers.  Checked gcc outputs against Clang and fixed:
> >
> > FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=54
> > -Wno-unused-variable -Wno-unused-parameter
> > -Wno-unused-but-set-variable -Wno-uninitialized -O0
> > -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test
> > FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=54
> > -Wno-unused-variable -Wno-unused-parameter
> > -Wno-unused-but-set-variable -Wno-uninitialized -O2
> > -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test
> > FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=55
> > -Wno-unused-variable -Wno-unused-parameter
> > -Wno-unused-but-set-variable -Wno-uninitialized -O0
> > -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test
> > FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=55
> > -Wno-unused-variable -Wno-unused-parameter
> > -Wno-unused-but-set-variable -Wno-uninitialized -O2
> > -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test
> > FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=56
> > -Wno-unused-variable -Wno-unused-parameter
> > -Wno-unused-but-set-variable -Wno-uninitialized -O0
> > -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test
> > FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=56
> > -Wno-unused-variable -Wno-unused-parameter
> > -Wno-unused-but-set-variable -Wno-uninitialized -O2
> > -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test
> >
> > in libffi testsuite.
> >
> > OK for master and backports to GCC 8/9 branches?
> >
> > gcc/
> >
> > PR target/85667
> > * config/i386/i386.c (function_arg_ms_64): Add a type argument.
> > Don't return aggregates with only SFmode and DFmode in SSE
> > register.
> > (ix86_function_arg): Pass arg.type to function_arg_ms_64.
> >
> > gcc/testsuite/
> >
> > PR target/85667
> > * gcc.target/i386/pr85667-10.c: New test.
> > * gcc.target/i386/pr85667-7.c: Likewise.
> > * gcc.target/i386/pr85667-8.c: Likewise.
> > * gcc.target/i386/pr85667-9.c: Likewise.
> 
> LGTM, but should really be reviewed by cygwin, mingw-w64 maintainer (CC'd).
> 

I checked the result against MSVC v19.10 at

https://godbolt.org/z/2NPygd

My patch matches MSVC v19.10.  I am checking it in tomorrow unless
mingw-w64 maintainer objects.

Thanks.

H.J.


PING^6: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move

2020-02-06 Thread H.J. Lu
On Mon, Jan 27, 2020 at 10:59 AM H.J. Lu  wrote:
>
> On Mon, Jul 8, 2019 at 8:19 AM H.J. Lu  wrote:
> >
> > On Tue, Jun 18, 2019 at 8:59 AM H.J. Lu  wrote:
> > >
> > > On Fri, May 31, 2019 at 10:38 AM H.J. Lu  wrote:
> > > >
> > > > On Tue, May 21, 2019 at 2:43 PM H.J. Lu  wrote:
> > > > >
> > > > > On Fri, Feb 22, 2019 at 8:25 AM H.J. Lu  wrote:
> > > > > >
> > > > > > Hi Jan, Uros,
> > > > > >
> > > > > > This patch fixes the wrong code bug:
> > > > > >
> > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89229
> > > > > >
> > > > > > Tested on AVX2 and AVX512 with and without --with-arch=native.
> > > > > >
> > > > > > OK for trunk?
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > H.J.
> > > > > > --
> > > > > > i386 backend has
> > > > > >
> > > > > > INT_MODE (OI, 32);
> > > > > > INT_MODE (XI, 64);
> > > > > >
> > > > > > So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit 
> > > > > > operation,
> > > > > > in case of const_1, all 512 bits set.
> > > > > >
> > > > > > We can load zeros with narrower instruction, (e.g. 256 bit by 
> > > > > > inherent
> > > > > > zeroing of highpart in case of 128 bit xor), so TImode in this case.
> > > > > >
> > > > > > Some targets prefer V4SF mode, so they will emit float xorps for 
> > > > > > zeroing.
> > > > > >
> > > > > > sse.md has
> > > > > >
> > > > > > (define_insn "mov_internal"
> > > > > >   [(set (match_operand:VMOVE 0 "nonimmediate_operand"
> > > > > >  "=v,v ,v ,m")
> > > > > > (match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand"
> > > > > >  " C,BC,vm,v"))]
> > > > > > 
> > > > > >   /* There is no evex-encoded vmov* for sizes smaller than 
> > > > > > 64-bytes
> > > > > >  in avx512f, so we need to use workarounds, to access sse 
> > > > > > registers
> > > > > >  16-31, which are evex-only. In avx512vl we don't need 
> > > > > > workarounds.  */
> > > > > >   if (TARGET_AVX512F &&  < 64 && !TARGET_AVX512VL
> > > > > >   && (EXT_REX_SSE_REG_P (operands[0])
> > > > > >   || EXT_REX_SSE_REG_P (operands[1])))
> > > > > > {
> > > > > >   if (memory_operand (operands[0], mode))
> > > > > > {
> > > > > >   if ( == 32)
> > > > > > return "vextract64x4\t{$0x0, %g1, 
> > > > > > %0|%0, %g1, 0x0}";
> > > > > >   else if ( == 16)
> > > > > > return "vextract32x4\t{$0x0, %g1, 
> > > > > > %0|%0, %g1, 0x0}";
> > > > > >   else
> > > > > > gcc_unreachable ();
> > > > > > }
> > > > > > ...
> > > > > >
> > > > > > However, since ix86_hard_regno_mode_ok has
> > > > > >
> > > > > >  /* TODO check for QI/HI scalars.  */
> > > > > >   /* AVX512VL allows sse regs16+ for 128/256 bit modes.  */
> > > > > >   if (TARGET_AVX512VL
> > > > > >   && (mode == OImode
> > > > > >   || mode == TImode
> > > > > >   || VALID_AVX256_REG_MODE (mode)
> > > > > >   || VALID_AVX512VL_128_REG_MODE (mode)))
> > > > > > return true;
> > > > > >
> > > > > >   /* xmm16-xmm31 are only available for AVX-512.  */
> > > > > >   if (EXT_REX_SSE_REGNO_P (regno))
> > > > > > return false;
> > > > > >
> > > > > >   if (TARGET_AVX512F &&  < 64 && !TARGET_AVX512VL
> > > > >

Re: [PATCH] x86-64: Pass aggregates with only float/double in GPRs for MS_ABI

2020-02-07 Thread H.J. Lu
On Fri, Feb 7, 2020 at 2:14 AM JonY <10wa...@gmail.com> wrote:
>
> On 2/7/20 3:23 AM, H.J. Lu wrote:
> > On Wed, Feb 05, 2020 at 09:51:14PM +0100, Uros Bizjak wrote:
> >> On Wed, Feb 5, 2020 at 6:59 PM H.J. Lu  wrote:
> >>>
> >>> MS_ABI requires passing aggregates with only float/double in integer
> >>> registers.  Checked gcc outputs against Clang and fixed:
> >>>
> >>> FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=54
> >>> -Wno-unused-variable -Wno-unused-parameter
> >>> -Wno-unused-but-set-variable -Wno-uninitialized -O0
> >>> -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test
> >>> FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=54
> >>> -Wno-unused-variable -Wno-unused-parameter
> >>> -Wno-unused-but-set-variable -Wno-uninitialized -O2
> >>> -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test
> >>> FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=55
> >>> -Wno-unused-variable -Wno-unused-parameter
> >>> -Wno-unused-but-set-variable -Wno-uninitialized -O0
> >>> -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test
> >>> FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=55
> >>> -Wno-unused-variable -Wno-unused-parameter
> >>> -Wno-unused-but-set-variable -Wno-uninitialized -O2
> >>> -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test
> >>> FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=56
> >>> -Wno-unused-variable -Wno-unused-parameter
> >>> -Wno-unused-but-set-variable -Wno-uninitialized -O0
> >>> -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test
> >>> FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=56
> >>> -Wno-unused-variable -Wno-unused-parameter
> >>> -Wno-unused-but-set-variable -Wno-uninitialized -O2
> >>> -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test
> >>>
> >>> in libffi testsuite.
> >>>
> >>> OK for master and backports to GCC 8/9 branches?
> >>>
> >>> gcc/
> >>>
> >>> PR target/85667
> >>> * config/i386/i386.c (function_arg_ms_64): Add a type argument.
> >>> Don't return aggregates with only SFmode and DFmode in SSE
> >>> register.
> >>> (ix86_function_arg): Pass arg.type to function_arg_ms_64.
> >>>
> >>> gcc/testsuite/
> >>>
> >>> PR target/85667
> >>> * gcc.target/i386/pr85667-10.c: New test.
> >>> * gcc.target/i386/pr85667-7.c: Likewise.
> >>> * gcc.target/i386/pr85667-8.c: Likewise.
> >>> * gcc.target/i386/pr85667-9.c: Likewise.
> >>
> >> LGTM, but should really be reviewed by cygwin, mingw-w64 maintainer (CC'd).
> >>
> >
> > I checked the result against MSVC v19.10 at
> >
> > https://godbolt.org/z/2NPygd
> >
> > My patch matches MSVC v19.10.  I am checking it in tomorrow unless
> > mingw-w64 maintainer objects.
> >
>
> Please go ahead and thanks.
>

I checked it into master and backported it to releases/gcc-9 branch.
No plan to fix releases/gcc-8 branch.

Thanks.

-- 
H.J.


[PATCH] i386: Properly pop restore token in signal frame

2020-02-08 Thread H.J. Lu
Linux CET kernel places a restore token on shadow stack for signal
handler to enhance security.  The restore token is 8 byte and aligned
to 8 bytes.  It is usually transparent to user programs since kernel
will pop the restore token when signal handler returns.  But when an
exception is thrown from a signal handler, now we need to pop the
restore token from shadow stack.  For x86-64, we just need to treat
the signal frame as normal frame.  For i386, we need to search for
the restore token to check if the original shadow stack is 8 byte
aligned.  If the original shadow stack is 8 byte aligned, we just
need to pop 2 slots, one restore token, from shadow stack.  Otherwise,
we need to pop 3 slots, one restore token + 4 byte pading, from
shadow stack.

This patch also includes 2 tests, one has a restore token with 4 byte
padding and one without.

Tested on Linux/x86-64 CET machine with and without -m32.  OK for master
and backport to GCC 8/9 branches?

Thanks.

H.J.
---
libgcc/

PR libgcc/85334
* config/i386/shadow-stack-unwind.h (_Unwind_Frames_Increment):
New.

gcc/testsuite/

PR libgcc/85334
* g++.target/i386/pr85334-1.C: New test.
* g++.target/i386/pr85334-2.C: Likewise.
---
 gcc/testsuite/g++.target/i386/pr85334-1.C | 55 +++
 gcc/testsuite/g++.target/i386/pr85334-2.C | 48 
 libgcc/config/i386/shadow-stack-unwind.h  | 43 ++
 3 files changed, 146 insertions(+)
 create mode 100644 gcc/testsuite/g++.target/i386/pr85334-1.C
 create mode 100644 gcc/testsuite/g++.target/i386/pr85334-2.C

diff --git a/gcc/testsuite/g++.target/i386/pr85334-1.C 
b/gcc/testsuite/g++.target/i386/pr85334-1.C
new file mode 100644
index 000..3c5ccad1714
--- /dev/null
+++ b/gcc/testsuite/g++.target/i386/pr85334-1.C
@@ -0,0 +1,55 @@
+// { dg-do run }
+// { dg-require-effective-target cet }
+// { dg-additional-options "-fexceptions -fnon-call-exceptions 
-fcf-protection" }
+
+// Delta between numbers of call stacks of pr85334-1.C and pr85334-2.C is 1.
+
+#include 
+#include 
+
+void sighandler (int signo, siginfo_t * si, void * uc)
+{
+  throw (5);
+}
+
+char *
+__attribute ((noinline, noclone))
+dosegv ()
+{
+  * ((volatile int *)0) = 12;
+  return 0;
+}
+
+int
+__attribute ((noinline, noclone))
+func2 ()
+{
+  try {
+dosegv ();
+  }
+  catch (int x) {
+return (x != 5);
+  }
+  return 1;
+}
+
+int
+__attribute ((noinline, noclone))
+func1 ()
+{
+  return func2 ();
+}
+
+int main ()
+{
+  struct sigaction sa;
+  int status;
+
+  sa.sa_sigaction = sighandler;
+  sa.sa_flags = SA_SIGINFO;
+
+  status = sigaction (SIGSEGV, & sa, NULL);
+  status = sigaction (SIGBUS, & sa, NULL);
+
+  return func1 ();
+}
diff --git a/gcc/testsuite/g++.target/i386/pr85334-2.C 
b/gcc/testsuite/g++.target/i386/pr85334-2.C
new file mode 100644
index 000..e2b5afe78cb
--- /dev/null
+++ b/gcc/testsuite/g++.target/i386/pr85334-2.C
@@ -0,0 +1,48 @@
+// { dg-do run }
+// { dg-require-effective-target cet }
+// { dg-additional-options "-fexceptions -fnon-call-exceptions 
-fcf-protection" }
+
+// Delta between numbers of call stacks of pr85334-1.C and pr85334-2.C is 1.
+
+#include 
+#include 
+
+void sighandler (int signo, siginfo_t * si, void * uc)
+{
+  throw (5);
+}
+
+char *
+__attribute ((noinline, noclone))
+dosegv ()
+{
+  * ((volatile int *)0) = 12;
+  return 0;
+}
+
+int
+__attribute ((noinline, noclone))
+func1 ()
+{
+  try {
+dosegv ();
+  }
+  catch (int x) {
+return (x != 5);
+  }
+  return 1;
+}
+
+int main ()
+{
+  struct sigaction sa;
+  int status;
+
+  sa.sa_sigaction = sighandler;
+  sa.sa_flags = SA_SIGINFO;
+
+  status = sigaction (SIGSEGV, & sa, NULL);
+  status = sigaction (SIGBUS, & sa, NULL);
+
+  return func1 ();
+}
diff --git a/libgcc/config/i386/shadow-stack-unwind.h 
b/libgcc/config/i386/shadow-stack-unwind.h
index a0244d2ee70..a5f9235b146 100644
--- a/libgcc/config/i386/shadow-stack-unwind.h
+++ b/libgcc/config/i386/shadow-stack-unwind.h
@@ -49,3 +49,46 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  If 
not, see
}   \
 }  \
 while (0)
+
+/* Linux CET kernel places a restore token on shadow stack for signal
+   handler to enhance security.  The restore token is 8 byte and aligned
+   to 8 bytes.  It is usually transparent to user programs since kernel
+   will pop the restore token when signal handler returns.  But when an
+   exception is thrown from a signal handler, now we need to pop the
+   restore token from shadow stack.  For x86-64, we just need to treat
+   the signal frame as normal frame.  For i386, we need to search for
+   the restore token to check if the original shadow stack is 8 byte
+   aligned.  If the original shadow stack is 8 byte aligned, we just
+   need to pop 2 slots, one restore token, from shadow stack.  Otherwise,
+   we need to pop 3 slots, one restore token + 4

[PATCH] i386: Skip ENDBR32 at nested function entry

2020-02-10 Thread H.J. Lu
Since nested function isn't only called directly, there is ENDBR32 at
function entry and we need to skip it for direct jump in trampoline.

Tested on Linux/x86-64 CET machine with and without -m32.

gcc/

PR target/93656
* config/i386/i386.c (ix86_trampoline_init): Skip ENDBR32 at
nested function entry.

gcc/testsuite/

PR target/93656
* gcc.target/i386/pr93656.c: New test.
---
 gcc/config/i386/i386.c  | 7 ++-
 gcc/testsuite/gcc.target/i386/pr93656.c | 4 
 2 files changed, 10 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr93656.c

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 44bc0e0176a..dbcae244acb 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -16839,9 +16839,14 @@ ix86_trampoline_init (rtx m_tramp, tree fndecl, rtx 
chain_value)
 the stack, we need to skip the first insn which pushes the
 (call-saved) register static chain; this push is 1 byte.  */
   offset += 5;
+  int skip = MEM_P (chain) ? 1 : 0;
+  /* Since nested function isn't only called directly, there is
+ENDBR32 at function entry and we need to skip it.  */
+  if (need_endbr)
+   skip += 4;
   disp = expand_binop (SImode, sub_optab, fnaddr,
   plus_constant (Pmode, XEXP (m_tramp, 0),
- offset - (MEM_P (chain) ? 1 : 0)),
+ offset - skip),
   NULL_RTX, 1, OPTAB_DIRECT);
   emit_move_insn (mem, disp);
 }
diff --git a/gcc/testsuite/gcc.target/i386/pr93656.c 
b/gcc/testsuite/gcc.target/i386/pr93656.c
new file mode 100644
index 000..f0ac8c8edaa
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr93656.c
@@ -0,0 +1,4 @@
+/* { dg-do run { target { ia32 && cet } } } */
+/* { dg-options "-O2 -fcf-protection" } */
+
+#include "pr67770.c"
-- 
2.24.1



Re: [PATCH] i386: Skip ENDBR32 at nested function entry

2020-02-10 Thread H.J. Lu
On Mon, Feb 10, 2020 at 11:40 AM Uros Bizjak  wrote:
>
> On Mon, Feb 10, 2020 at 8:22 PM H.J. Lu  wrote:
> >
> > Since nested function isn't only called directly, there is ENDBR32 at
> > function entry and we need to skip it for direct jump in trampoline.
>
> Hm, I'm afraid I don't understand this comment. Can you perhaps rephrase it?
>

ix86_trampoline_init has

 /* Compute offset from the end of the jmp to the target function.
 In the case in which the trampoline stores the static chain on
 the stack, we need to skip the first insn which pushes the
 (call-saved) register static chain; this push is 1 byte.  */
  offset += 5;
  disp = expand_binop (SImode, sub_optab, fnaddr,
   plus_constant (Pmode, XEXP (m_tramp, 0),
  offset - (MEM_P (chain) ? 1 : 0)),
   NULL_RTX, 1, OPTAB_DIRECT);
  emit_move_insn (mem, disp);

Without CET, we got

011 :
  11: 56push   %esi
  12: 55push   %ebp   <<<<<< trampoline jumps here.
  13: 89 e5mov%esp,%ebp
  15: 83 ec 08  sub$0x8,%esp

With CET, if bar isn't only called directly, we got

0015 :
  15: f3 0f 1e fb  endbr32
  19: 56push   %esi
  1a: 55push   %ebp   <<<<<<<< trampoline jumps here.
  1b: 89 e5mov%esp,%ebp
  1d: 83 ec 08  sub$0x8,%esp

We need to add 4 bytes for trampoline to skip endbr32.

Here is the updated patch to check if nested function isn't only
called directly,


-- 
H.J.
From 10ffeb41d1cdbd42f19ba08fdd6ce4a9913a5b5b Mon Sep 17 00:00:00 2001
From: "H.J. Lu" 
Date: Mon, 10 Feb 2020 11:10:52 -0800
Subject: [PATCH] i386: Skip ENDBR32 at nested function entry

If nested function isn't only called directly, there is ENDBR32 at
function entry and we need to skip it for direct jump in trampoline.

Tested on Linux/x86-64 CET machine with and without -m32.

gcc/

	PR target/93656
	* config/i386/i386.c (ix86_trampoline_init): Skip ENDBR32 at
	nested function entry.

gcc/testsuite/

	PR target/93656
	* gcc.target/i386/pr93656.c: New test.
---
 gcc/config/i386/i386.c  | 8 +++-
 gcc/testsuite/gcc.target/i386/pr93656.c | 4 
 2 files changed, 11 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr93656.c

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 44bc0e0176a..bc494ec19b6 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -16839,9 +16839,15 @@ ix86_trampoline_init (rtx m_tramp, tree fndecl, rtx chain_value)
 	 the stack, we need to skip the first insn which pushes the
 	 (call-saved) register static chain; this push is 1 byte.  */
   offset += 5;
+  int skip = MEM_P (chain) ? 1 : 0;
+  /* If nested function isn't only called directly, there is ENDBR32
+	 at function entry and we need to skip it.  */
+  if (need_endbr
+	  && !cgraph_node::get (fndecl)->only_called_directly_p ())
+	skip += 4;
   disp = expand_binop (SImode, sub_optab, fnaddr,
 			   plus_constant (Pmode, XEXP (m_tramp, 0),
-	  offset - (MEM_P (chain) ? 1 : 0)),
+	  offset - skip),
 			   NULL_RTX, 1, OPTAB_DIRECT);
   emit_move_insn (mem, disp);
 }
diff --git a/gcc/testsuite/gcc.target/i386/pr93656.c b/gcc/testsuite/gcc.target/i386/pr93656.c
new file mode 100644
index 000..f0ac8c8edaa
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr93656.c
@@ -0,0 +1,4 @@
+/* { dg-do run { target { ia32 && cet } } } */
+/* { dg-options "-O2 -fcf-protection" } */
+
+#include "pr67770.c"
-- 
2.24.1



Re: [PATCH] i386: Skip ENDBR32 at nested function entry

2020-02-12 Thread H.J. Lu
On Mon, Feb 10, 2020 at 12:01 PM Uros Bizjak  wrote:
>
> On Mon, Feb 10, 2020 at 8:53 PM H.J. Lu  wrote:
> >
> > On Mon, Feb 10, 2020 at 11:40 AM Uros Bizjak  wrote:
> > >
> > > On Mon, Feb 10, 2020 at 8:22 PM H.J. Lu  wrote:
> > > >
> > > > Since nested function isn't only called directly, there is ENDBR32 at
> > > > function entry and we need to skip it for direct jump in trampoline.
> > >
> > > Hm, I'm afraid I don't understand this comment. Can you perhaps rephrase 
> > > it?
> > >
> >
> > ix86_trampoline_init has
> >
> >  /* Compute offset from the end of the jmp to the target function.
> >  In the case in which the trampoline stores the static chain on
> >  the stack, we need to skip the first insn which pushes the
> >  (call-saved) register static chain; this push is 1 byte.  */
> >   offset += 5;
> >   disp = expand_binop (SImode, sub_optab, fnaddr,
> >plus_constant (Pmode, XEXP (m_tramp, 0),
> >   offset - (MEM_P (chain) ? 1 : 0)),
> >NULL_RTX, 1, OPTAB_DIRECT);
> >   emit_move_insn (mem, disp);
> >
> > Without CET, we got
> >
> > 011 :
> >   11: 56push   %esi
> >   12: 55push   %ebp   <<<<<< trampoline jumps here.
> >   13: 89 e5mov%esp,%ebp
> >   15: 83 ec 08  sub$0x8,%esp
> >
> > With CET, if bar isn't only called directly, we got
> >
> > 0015 :
> >   15: f3 0f 1e fb  endbr32
> >   19: 56push   %esi
> >   1a: 55push   %ebp   <<<<<<<< trampoline jumps here.
> >   1b: 89 e5mov%esp,%ebp
> >   1d: 83 ec 08  sub$0x8,%esp
> >
> > We need to add 4 bytes for trampoline to skip endbr32.
> >
> > Here is the updated patch to check if nested function isn't only
> > called directly,
>
> Please figure out the final patch. I don't want to waste my time
> reviewing different version every half hour. Ping me in a couple of
> days.

This is the final version:

https://gcc.gnu.org/ml/gcc-patches/2020-02/msg00586.html

You can try the testcase in the patch on any machine with CET binutils
since ENDBR32 is nop on none-CET machines.  Without this patch,
the test will fail.

Thanks.

-- 
H.J.


[PATCH] i386: Also skip ENDBR32 at the target function entry

2020-02-13 Thread H.J. Lu
On Thu, Feb 13, 2020 at 09:29:32AM +0100, Uros Bizjak wrote:
> On Wed, Feb 12, 2020 at 1:21 PM H.J. Lu  wrote:
> >
> > On Mon, Feb 10, 2020 at 12:01 PM Uros Bizjak  wrote:
> > >
> > > On Mon, Feb 10, 2020 at 8:53 PM H.J. Lu  wrote:
> > > >
> > > > On Mon, Feb 10, 2020 at 11:40 AM Uros Bizjak  wrote:
> > > > >
> > > > > On Mon, Feb 10, 2020 at 8:22 PM H.J. Lu  wrote:
> > > > > >
> > > > > > Since nested function isn't only called directly, there is ENDBR32 
> > > > > > at
> > > > > > function entry and we need to skip it for direct jump in trampoline.
> > > > >
> > > > > Hm, I'm afraid I don't understand this comment. Can you perhaps 
> > > > > rephrase it?
> > > > >
> > > >
> > > > ix86_trampoline_init has
> > > >
> > > >  /* Compute offset from the end of the jmp to the target function.
> > > >  In the case in which the trampoline stores the static chain on
> > > >  the stack, we need to skip the first insn which pushes the
> > > >  (call-saved) register static chain; this push is 1 byte.  */
> > > >   offset += 5;
> > > >   disp = expand_binop (SImode, sub_optab, fnaddr,
> > > >plus_constant (Pmode, XEXP (m_tramp, 0),
> > > >   offset - (MEM_P (chain) ? 1 : 
> > > > 0)),
> > > >NULL_RTX, 1, OPTAB_DIRECT);
> > > >   emit_move_insn (mem, disp);
> > > >
> > > > Without CET, we got
> > > >
> > > > 011 :
> > > >   11: 56push   %esi
> > > >   12: 55push   %ebp   <<<<<< trampoline jumps here.
> > > >   13: 89 e5mov%esp,%ebp
> > > >   15: 83 ec 08  sub$0x8,%esp
> > > >
> > > > With CET, if bar isn't only called directly, we got
> > > >
> > > > 0015 :
> > > >   15: f3 0f 1e fb  endbr32
> > > >   19: 56push   %esi
> > > >   1a: 55push   %ebp   <<<<<<<< trampoline jumps 
> > > > here.
> > > >   1b: 89 e5mov%esp,%ebp
> > > >   1d: 83 ec 08  sub$0x8,%esp
> > > >
> > > > We need to add 4 bytes for trampoline to skip endbr32.
> > > >
> > > > Here is the updated patch to check if nested function isn't only
> > > > called directly,
> > >
> > > Please figure out the final patch. I don't want to waste my time
> > > reviewing different version every half hour. Ping me in a couple of
> > > days.
> >
> > This is the final version:
> >
> > https://gcc.gnu.org/ml/gcc-patches/2020-02/msg00586.html
> >
> > You can try the testcase in the patch on any machine with CET binutils
> > since ENDBR32 is nop on none-CET machines.  Without this patch,
> > the test will fail.
> 
> Please rephrase the comment. I don't understand what it tries to say.
> 

Here is the patch with updated comments.  OK for master and 8/9 branches?

Thanks.

H.J.
---
Since pass_insert_endbranch inserts ENDBR32 at entry of the target
function if it may be called indirectly, we also need to skip the
4-byte ENDBR32 for direct jump in trampoline if it is the case.

Tested on Linux/x86-64 CET machine with and without -m32.

gcc/

PR target/93656
* config/i386/i386.c (ix86_trampoline_init): Skip ENDBR32 at
the target function entry.

gcc/testsuite/

PR target/93656
* gcc.target/i386/pr93656.c: New test.
---
 gcc/config/i386/i386.c  | 9 -
 gcc/testsuite/gcc.target/i386/pr93656.c | 4 
 2 files changed, 12 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr93656.c

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 44bc0e0176a..52640b74cc8 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -16839,9 +16839,16 @@ ix86_trampoline_init (rtx m_tramp, tree fndecl, rtx 
chain_value)
 the stack, we need to skip the first insn which pushes the
 (call-saved) register static chain; this push is 1 byte.  */
   offset += 5;
+  int skip = MEM_P (chain) ? 1 : 0;
+  /* Since pass_insert_endbranch inserts ENDBR32 at entry of the
+target function if it may be called indirectly, we also need
+to skip 

Re: [PATCH] i386: Also skip ENDBR32 at the target function entry

2020-02-13 Thread H.J. Lu
On Thu, Feb 13, 2020 at 01:28:43PM +0100, Uros Bizjak wrote:
> On Thu, Feb 13, 2020 at 1:06 PM H.J. Lu  wrote:
> >
> > On Thu, Feb 13, 2020 at 09:29:32AM +0100, Uros Bizjak wrote:
> > > On Wed, Feb 12, 2020 at 1:21 PM H.J. Lu  wrote:
> > > >
> > > > On Mon, Feb 10, 2020 at 12:01 PM Uros Bizjak  wrote:
> > > > >
> > > > > On Mon, Feb 10, 2020 at 8:53 PM H.J. Lu  wrote:
> > > > > >
> > > > > > On Mon, Feb 10, 2020 at 11:40 AM Uros Bizjak  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Mon, Feb 10, 2020 at 8:22 PM H.J. Lu  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Since nested function isn't only called directly, there is 
> > > > > > > > ENDBR32 at
> > > > > > > > function entry and we need to skip it for direct jump in 
> > > > > > > > trampoline.
> > > > > > >
> > > > > > > Hm, I'm afraid I don't understand this comment. Can you perhaps 
> > > > > > > rephrase it?
> > > > > > >
> > > > > >
> > > > > > ix86_trampoline_init has
> > > > > >
> > > > > >  /* Compute offset from the end of the jmp to the target 
> > > > > > function.
> > > > > >  In the case in which the trampoline stores the static 
> > > > > > chain on
> > > > > >  the stack, we need to skip the first insn which pushes the
> > > > > >  (call-saved) register static chain; this push is 1 byte.  
> > > > > > */
> > > > > >   offset += 5;
> > > > > >   disp = expand_binop (SImode, sub_optab, fnaddr,
> > > > > >plus_constant (Pmode, XEXP (m_tramp, 0),
> > > > > >   offset - (MEM_P (chain) ? 
> > > > > > 1 : 0)),
> > > > > >NULL_RTX, 1, OPTAB_DIRECT);
> > > > > >   emit_move_insn (mem, disp);
> > > > > >
> > > > > > Without CET, we got
> > > > > >
> > > > > > 011 :
> > > > > >   11: 56push   %esi
> > > > > >   12: 55push   %ebp   <<<<<< trampoline jumps 
> > > > > > here.
> > > > > >   13: 89 e5mov%esp,%ebp
> > > > > >   15: 83 ec 08  sub$0x8,%esp
> > > > > >
> > > > > > With CET, if bar isn't only called directly, we got
> > > > > >
> > > > > > 0015 :
> > > > > >   15: f3 0f 1e fb  endbr32
> > > > > >   19: 56push   %esi
> > > > > >   1a: 55push   %ebp   <<<<<<<< trampoline jumps 
> > > > > > here.
> > > > > >   1b: 89 e5mov%esp,%ebp
> > > > > >   1d: 83 ec 08  sub$0x8,%esp
> > > > > >
> > > > > > We need to add 4 bytes for trampoline to skip endbr32.
> > > > > >
> > > > > > Here is the updated patch to check if nested function isn't only
> > > > > > called directly,
> > > > >
> > > > > Please figure out the final patch. I don't want to waste my time
> > > > > reviewing different version every half hour. Ping me in a couple of
> > > > > days.
> > > >
> > > > This is the final version:
> > > >
> > > > https://gcc.gnu.org/ml/gcc-patches/2020-02/msg00586.html
> > > >
> > > > You can try the testcase in the patch on any machine with CET binutils
> > > > since ENDBR32 is nop on none-CET machines.  Without this patch,
> > > > the test will fail.
> > >
> > > Please rephrase the comment. I don't understand what it tries to say.
> > >
> >
> > Here is the patch with updated comments.  OK for master and 8/9 branches?
> >
> > Thanks.
> >
> > H.J.
> > ---
> > Since pass_insert_endbranch inserts ENDBR32 at entry of the target
> > function if it may be called indirectly, we also need to skip the
> > 4-byte 

PING^7: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move

2020-02-13 Thread H.J. Lu
On Thu, Feb 6, 2020 at 8:17 PM H.J. Lu  wrote:
>
> On Mon, Jan 27, 2020 at 10:59 AM H.J. Lu  wrote:
> >
> > On Mon, Jul 8, 2019 at 8:19 AM H.J. Lu  wrote:
> > >
> > > On Tue, Jun 18, 2019 at 8:59 AM H.J. Lu  wrote:
> > > >
> > > > On Fri, May 31, 2019 at 10:38 AM H.J. Lu  wrote:
> > > > >
> > > > > On Tue, May 21, 2019 at 2:43 PM H.J. Lu  wrote:
> > > > > >
> > > > > > On Fri, Feb 22, 2019 at 8:25 AM H.J. Lu  
> > > > > > wrote:
> > > > > > >
> > > > > > > Hi Jan, Uros,
> > > > > > >
> > > > > > > This patch fixes the wrong code bug:
> > > > > > >
> > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89229
> > > > > > >
> > > > > > > Tested on AVX2 and AVX512 with and without --with-arch=native.
> > > > > > >
> > > > > > > OK for trunk?
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > > H.J.
> > > > > > > --
> > > > > > > i386 backend has
> > > > > > >
> > > > > > > INT_MODE (OI, 32);
> > > > > > > INT_MODE (XI, 64);
> > > > > > >
> > > > > > > So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit 
> > > > > > > operation,
> > > > > > > in case of const_1, all 512 bits set.
> > > > > > >
> > > > > > > We can load zeros with narrower instruction, (e.g. 256 bit by 
> > > > > > > inherent
> > > > > > > zeroing of highpart in case of 128 bit xor), so TImode in this 
> > > > > > > case.
> > > > > > >
> > > > > > > Some targets prefer V4SF mode, so they will emit float xorps for 
> > > > > > > zeroing.
> > > > > > >
> > > > > > > sse.md has
> > > > > > >
> > > > > > > (define_insn "mov_internal"
> > > > > > >   [(set (match_operand:VMOVE 0 "nonimmediate_operand"
> > > > > > >  "=v,v ,v ,m")
> > > > > > > (match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand"
> > > > > > >  " C,BC,vm,v"))]
> > > > > > > 
> > > > > > >   /* There is no evex-encoded vmov* for sizes smaller than 
> > > > > > > 64-bytes
> > > > > > >  in avx512f, so we need to use workarounds, to access sse 
> > > > > > > registers
> > > > > > >  16-31, which are evex-only. In avx512vl we don't need 
> > > > > > > workarounds.  */
> > > > > > >   if (TARGET_AVX512F &&  < 64 && !TARGET_AVX512VL
> > > > > > >   && (EXT_REX_SSE_REG_P (operands[0])
> > > > > > >   || EXT_REX_SSE_REG_P (operands[1])))
> > > > > > > {
> > > > > > >   if (memory_operand (operands[0], mode))
> > > > > > > {
> > > > > > >   if ( == 32)
> > > > > > > return "vextract64x4\t{$0x0, %g1, 
> > > > > > > %0|%0, %g1, 0x0}";
> > > > > > >   else if ( == 16)
> > > > > > > return "vextract32x4\t{$0x0, %g1, 
> > > > > > > %0|%0, %g1, 0x0}";
> > > > > > >   else
> > > > > > > gcc_unreachable ();
> > > > > > > }
> > > > > > > ...
> > > > > > >
> > > > > > > However, since ix86_hard_regno_mode_ok has
> > > > > > >
> > > > > > >  /* TODO check for QI/HI scalars.  */
> > > > > > >   /* AVX512VL allows sse regs16+ for 128/256 bit modes.  */
> > > > > > >   if (TARGET_AVX512VL
> > > > > > >   && (mode == OImode
> > > > > > >   || mode == TImode
> > > > > > >   || VALID_AVX256_REG_MODE (mode

Re: [PATCH] i386: Also skip ENDBR32 at the target function entry

2020-02-13 Thread H.J. Lu
On Thu, Feb 13, 2020 at 5:10 AM Uros Bizjak  wrote:
>
> On Thu, Feb 13, 2020 at 1:42 PM H.J. Lu  wrote:
> >
> > On Thu, Feb 13, 2020 at 01:28:43PM +0100, Uros Bizjak wrote:
> > > On Thu, Feb 13, 2020 at 1:06 PM H.J. Lu  wrote:
> > > >
> > > > On Thu, Feb 13, 2020 at 09:29:32AM +0100, Uros Bizjak wrote:
> > > > > On Wed, Feb 12, 2020 at 1:21 PM H.J. Lu  wrote:
> > > > > >
> > > > > > On Mon, Feb 10, 2020 at 12:01 PM Uros Bizjak  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Mon, Feb 10, 2020 at 8:53 PM H.J. Lu  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > On Mon, Feb 10, 2020 at 11:40 AM Uros Bizjak 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Mon, Feb 10, 2020 at 8:22 PM H.J. Lu  
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Since nested function isn't only called directly, there is 
> > > > > > > > > > ENDBR32 at
> > > > > > > > > > function entry and we need to skip it for direct jump in 
> > > > > > > > > > trampoline.
> > > > > > > > >
> > > > > > > > > Hm, I'm afraid I don't understand this comment. Can you 
> > > > > > > > > perhaps rephrase it?
> > > > > > > > >
> > > > > > > >
> > > > > > > > ix86_trampoline_init has
> > > > > > > >
> > > > > > > >  /* Compute offset from the end of the jmp to the target 
> > > > > > > > function.
> > > > > > > >  In the case in which the trampoline stores the static 
> > > > > > > > chain on
> > > > > > > >  the stack, we need to skip the first insn which pushes 
> > > > > > > > the
> > > > > > > >  (call-saved) register static chain; this push is 1 
> > > > > > > > byte.  */
> > > > > > > >   offset += 5;
> > > > > > > >   disp = expand_binop (SImode, sub_optab, fnaddr,
> > > > > > > >plus_constant (Pmode, XEXP (m_tramp, 
> > > > > > > > 0),
> > > > > > > >   offset - (MEM_P 
> > > > > > > > (chain) ? 1 : 0)),
> > > > > > > >NULL_RTX, 1, OPTAB_DIRECT);
> > > > > > > >   emit_move_insn (mem, disp);
> > > > > > > >
> > > > > > > > Without CET, we got
> > > > > > > >
> > > > > > > > 011 :
> > > > > > > >   11: 56push   %esi
> > > > > > > >   12: 55push   %ebp   <<<<<< trampoline 
> > > > > > > > jumps here.
> > > > > > > >   13: 89 e5mov%esp,%ebp
> > > > > > > >   15: 83 ec 08  sub$0x8,%esp
> > > > > > > >
> > > > > > > > With CET, if bar isn't only called directly, we got
> > > > > > > >
> > > > > > > > 0015 :
> > > > > > > >   15: f3 0f 1e fb  endbr32
> > > > > > > >   19: 56push   %esi
> > > > > > > >   1a: 55push   %ebp   <<<<<<<< trampoline 
> > > > > > > > jumps here.
> > > > > > > >   1b: 89 e5mov%esp,%ebp
> > > > > > > >   1d: 83 ec 08  sub$0x8,%esp
> > > > > > > >
> > > > > > > > We need to add 4 bytes for trampoline to skip endbr32.
> > > > > > > >
> > > > > > > > Here is the updated patch to check if nested function isn't only
> > > > > > > > called directly,
> > > > > > >
> > > > > > > Please figure out the final patch. I don't want to waste my time
> > > > > > > reviewing dif

Re: Backports to 9.3

2020-02-14 Thread H.J. Lu
On Thu, Feb 13, 2020 at 2:46 PM Jakub Jelinek  wrote:
>
> Hi!
>
> I've backported following 15 commits from trunk to 9.3 branch,
> bootstrapped/regtested on x86_64-linux and i686-linux, committed.
>

Hi Jakub,

Are you preparing for GCC 9.3? I'd like to include this in GCC 9.3:

https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=1d69147af203d4dcd2270429f90c93f1a37ddfff

It is very safe.  Uros asked me to wait for a week before backporting to
GCC 9 branch.  I am planning to do it next Thursday.

Thanks.

-- 
H.J.


Re: Backports to 9.3

2020-02-14 Thread H.J. Lu
On Fri, Feb 14, 2020 at 7:51 AM Jakub Jelinek  wrote:
>
> On Fri, Feb 14, 2020 at 07:45:43AM -0800, H.J. Lu wrote:
> > Are you preparing for GCC 9.3? I'd like to include this in GCC 9.3:
> >
> > https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=1d69147af203d4dcd2270429f90c93f1a37ddfff
> >
> > It is very safe.  Uros asked me to wait for a week before backporting to
> > GCC 9 branch.  I am planning to do it next Thursday.
>
> Richi wants to do 8.4 first, am backporting a lot of patches to that now.
> I'd say we should aim for 8.4 rc next week or say on Monday 24th

I am planning to back it to both GCC 8 and 9 branches next Thursday.
I think I will be fine.

> and release a week after that and 9.3 maybe one week later than that.
>
> Jakub
>

Thanks.


-- 
H.J.


[PATCH 02/10] i386: Use ix86_output_ssemov for XImode TYPE_SSEMOV

2020-02-15 Thread H.J. Lu
PR target/89229
* config/i386/i386.md (*movxi_internal_avx512f): Call
ix86_output_ssemov for TYPE_SSEMOV.
---
 gcc/config/i386/i386.md | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index f14683cd14f..b30e5a51edc 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -1902,11 +1902,7 @@ (define_insn "*movxi_internal_avx512f"
   return standard_sse_constant_opcode (insn, operands);
 
 case TYPE_SSEMOV:
-  if (misaligned_operand (operands[0], XImode)
- || misaligned_operand (operands[1], XImode))
-   return "vmovdqu32\t{%1, %0|%0, %1}";
-  else
-   return "vmovdqa32\t{%1, %0|%0, %1}";
+  return ix86_output_ssemov (insn, operands);
 
 default:
   gcc_unreachable ();
-- 
2.24.1



[PATCH 06/10] i386: Use ix86_output_ssemov for SImode TYPE_SSEMOV

2020-02-15 Thread H.J. Lu
There is no need to set mode attribute to XImode since ix86_output_ssemov
can properly encode xmm16-xmm31 registers with and without AVX512VL.

gcc/

PR target/89229
* config/i386/i386.md (*movsi_internal): Call ix86_output_ssemov
for TYPE_SSEMOV.  Remove ext_sse_reg_operand and TARGET_AVX512VL
check.

gcc/testsuite/

PR target/89229
* gcc.target/i386/pr89229-4a.c: New test.
* gcc.target/i386/pr89229-4b.c: Likewise.
* gcc.target/i386/pr89229-4c.c: Likewise.
---
 gcc/config/i386/i386.md| 25 ++
 gcc/testsuite/gcc.target/i386/pr89229-4a.c | 17 +++
 gcc/testsuite/gcc.target/i386/pr89229-4b.c |  6 ++
 gcc/testsuite/gcc.target/i386/pr89229-4c.c |  7 ++
 4 files changed, 32 insertions(+), 23 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-4a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-4b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-4c.c

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 03d8078e957..05815c5cf3b 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -2261,25 +2261,7 @@ (define_insn "*movsi_internal"
   gcc_unreachable ();
 
 case TYPE_SSEMOV:
-  switch (get_attr_mode (insn))
-   {
-   case MODE_SI:
-  return "%vmovd\t{%1, %0|%0, %1}";
-   case MODE_TI:
- return "%vmovdqa\t{%1, %0|%0, %1}";
-   case MODE_XI:
- return "vmovdqa32\t{%g1, %g0|%g0, %g1}";
-
-   case MODE_V4SF:
- return "%vmovaps\t{%1, %0|%0, %1}";
-
-   case MODE_SF:
- gcc_assert (!TARGET_AVX);
-  return "movss\t{%1, %0|%0, %1}";
-
-   default:
- gcc_unreachable ();
-   }
+  return ix86_output_ssemov (insn, operands);
 
 case TYPE_MMX:
   return "pxor\t%0, %0";
@@ -2345,10 +2327,7 @@ (define_insn "*movsi_internal"
  (cond [(eq_attr "alternative" "2,3")
  (const_string "DI")
(eq_attr "alternative" "8,9")
- (cond [(ior (match_operand 0 "ext_sse_reg_operand")
- (match_operand 1 "ext_sse_reg_operand"))
-  (const_string "XI")
-(match_test "TARGET_AVX")
+ (cond [(match_test "TARGET_AVX")
   (const_string "TI")
 (ior (not (match_test "TARGET_SSE2"))
  (match_test "optimize_function_for_size_p (cfun)"))
diff --git a/gcc/testsuite/gcc.target/i386/pr89229-4a.c 
b/gcc/testsuite/gcc.target/i386/pr89229-4a.c
new file mode 100644
index 000..fd56f447016
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr89229-4a.c
@@ -0,0 +1,17 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=skylake-avx512" } */
+
+extern int i;
+
+int
+foo1 (void)
+{
+  register int xmm16 __asm ("xmm16") = i;
+  asm volatile ("" : "+v" (xmm16));
+  register int xmm17 __asm ("xmm17") = xmm16;
+  asm volatile ("" : "+v" (xmm17));
+  return xmm17;
+}
+
+/* { dg-final { scan-assembler-times 
"vmovdqa32\[^\n\r]*xmm1\[67]\[^\n\r]*xmm1\[67]" 1 } } */
+/* { dg-final { scan-assembler-not "%zmm\[0-9\]+" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr89229-4b.c 
b/gcc/testsuite/gcc.target/i386/pr89229-4b.c
new file mode 100644
index 000..023e81253a0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr89229-4b.c
@@ -0,0 +1,6 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=skylake-avx512 -mno-avx512vl" } */
+
+#include "pr89229-4a.c"
+
+/* { dg-final { scan-assembler-times 
"vmovdqa32\[^\n\r]*zmm1\[67]\[^\n\r]*zmm1\[67]" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr89229-4c.c 
b/gcc/testsuite/gcc.target/i386/pr89229-4c.c
new file mode 100644
index 000..bb728082e96
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr89229-4c.c
@@ -0,0 +1,7 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=skylake-avx512 -mprefer-vector-width=512" } */
+
+#include "pr89229-4a.c"
+
+/* { dg-final { scan-assembler-times 
"vmovdqa32\[^\n\r]*xmm1\[67]\[^\n\r]*xmm1\[67]" 1 } } */
+/* { dg-final { scan-assembler-not "%zmm\[0-9\]+" } } */
-- 
2.24.1



[PATCH 03/10] i386: Use ix86_output_ssemov for OImode TYPE_SSEMOV

2020-02-15 Thread H.J. Lu
There is no need to set mode attribute to XImode since ix86_output_ssemov
can properly encode ymm16-ymm31 registers with and without AVX512VL.

PR target/89229
* config/i386/i386.md (*movoi_internal_avx): Call
ix86_output_ssemov for TYPE_SSEMOV.  Remove ext_sse_reg_operand
and TARGET_AVX512VL check.
---
 gcc/config/i386/i386.md | 26 ++
 1 file changed, 2 insertions(+), 24 deletions(-)

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index b30e5a51edc..9e9b17d0913 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -1925,21 +1925,7 @@ (define_insn "*movoi_internal_avx"
   return standard_sse_constant_opcode (insn, operands);
 
 case TYPE_SSEMOV:
-  if (misaligned_operand (operands[0], OImode)
- || misaligned_operand (operands[1], OImode))
-   {
- if (get_attr_mode (insn) == MODE_XI)
-   return "vmovdqu32\t{%1, %0|%0, %1}";
- else
-   return "vmovdqu\t{%1, %0|%0, %1}";
-   }
-  else
-   {
- if (get_attr_mode (insn) == MODE_XI)
-   return "vmovdqa32\t{%1, %0|%0, %1}";
- else
-   return "vmovdqa\t{%1, %0|%0, %1}";
-   }
+  return ix86_output_ssemov (insn, operands);
 
 default:
   gcc_unreachable ();
@@ -1948,15 +1934,7 @@ (define_insn "*movoi_internal_avx"
   [(set_attr "isa" "*,avx2,*,*")
(set_attr "type" "sselog1,sselog1,ssemov,ssemov")
(set_attr "prefix" "vex")
-   (set (attr "mode")
-   (cond [(ior (match_operand 0 "ext_sse_reg_operand")
-   (match_operand 1 "ext_sse_reg_operand"))
-(const_string "XI")
-  (and (eq_attr "alternative" "1")
-   (match_test "TARGET_AVX512VL"))
-(const_string "XI")
- ]
- (const_string "OI")))])
+   (set_attr "mode" "OI")])
 
 (define_insn "*movti_internal"
   [(set (match_operand:TI 0 "nonimmediate_operand" "=!r ,o ,v,v ,v ,m,?r,?Yd")
-- 
2.24.1



[PATCH 00/10] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move

2020-02-15 Thread H.J. Lu
This patch set was originally submitted in Feb 2019:

https://gcc.gnu.org/ml/gcc-patches/2019-02/msg01841.html

I broke it into 10 smaller patches for easy review.

On x86, when AVX and AVX512 are enabled, vector move instructions can
be encoded with either 2-byte/3-byte VEX (AVX) or 4-byte EVEX (AVX512):

   0:   c5 f9 6f d1 vmovdqa %xmm1,%xmm2
   4:   62 f1 fd 08 6f d1   vmovdqa64 %xmm1,%xmm2

We prefer VEX encoding over EVEX since VEX is shorter.  Also AVX512F
only supports 512-bit vector moves.  AVX512F + AVX512VL supports 128-bit
and 256-bit vector moves.  Mode attributes on x86 vector move patterns
indicate target preferences of vector move encoding.  For vector register
to vector register move, we can use 512-bit vector move instructions to
move 128-bit/256-bit vector if AVX512VL isn't available.  With AVX512F
and AVX512VL, we should use VEX encoding for 128-bit/256-bit vector moves
if upper 16 vector registers aren't used.  This patch adds a function,
ix86_output_ssemov, to generate vector moves:

1. If zmm registers are used, use EVEX encoding.
2. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE or VEX encoding
will be generated.
3. If xmm16-xmm31/ymm16-ymm31 registers are used:
   a. With AVX512VL, AVX512VL vector moves will be generated.
   b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to register
  move will be done with zmm register move.

Tested on AVX2 and AVX512 with and without --with-arch=native.

H.J. Lu (10):
  i386: Properly encode vector registers in vector move
  i386: Use ix86_output_ssemov for XImode TYPE_SSEMOV
  i386: Use ix86_output_ssemov for OImode TYPE_SSEMOV
  i386: Use ix86_output_ssemov for TImode TYPE_SSEMOV
  i386: Use ix86_output_ssemov for DImode TYPE_SSEMOV
  i386: Use ix86_output_ssemov for SImode TYPE_SSEMOV
  i386: Use ix86_output_ssemov for TFmode TYPE_SSEMOV
  i386: Use ix86_output_ssemov for DFmode TYPE_SSEMOV
  i386: Use ix86_output_ssemov for SFmode TYPE_SSEMOV
  i386: Use ix86_output_ssemov for MMX TYPE_SSEMOV

 gcc/config/i386/i386-protos.h |   2 +
 gcc/config/i386/i386.c| 274 ++
 gcc/config/i386/i386.md   | 212 +-
 gcc/config/i386/mmx.md|  29 +-
 gcc/config/i386/predicates.md |   5 -
 gcc/config/i386/sse.md|  98 +--
 .../gcc.target/i386/avx512vl-vmovdqa64-1.c|   7 +-
 gcc/testsuite/gcc.target/i386/pr89229-2a.c|  15 +
 gcc/testsuite/gcc.target/i386/pr89229-2b.c|  13 +
 gcc/testsuite/gcc.target/i386/pr89229-2c.c|   6 +
 gcc/testsuite/gcc.target/i386/pr89229-3a.c|  17 ++
 gcc/testsuite/gcc.target/i386/pr89229-3b.c|   6 +
 gcc/testsuite/gcc.target/i386/pr89229-3c.c|   7 +
 gcc/testsuite/gcc.target/i386/pr89229-4a.c|  17 ++
 gcc/testsuite/gcc.target/i386/pr89229-4b.c|   6 +
 gcc/testsuite/gcc.target/i386/pr89229-4c.c|   7 +
 gcc/testsuite/gcc.target/i386/pr89229-5a.c|  16 +
 gcc/testsuite/gcc.target/i386/pr89229-5b.c|  12 +
 gcc/testsuite/gcc.target/i386/pr89229-5c.c|   6 +
 gcc/testsuite/gcc.target/i386/pr89229-6a.c|  16 +
 gcc/testsuite/gcc.target/i386/pr89229-6b.c|   7 +
 gcc/testsuite/gcc.target/i386/pr89229-6c.c|   6 +
 gcc/testsuite/gcc.target/i386/pr89229-7a.c|  16 +
 gcc/testsuite/gcc.target/i386/pr89229-7b.c|   6 +
 gcc/testsuite/gcc.target/i386/pr89229-7c.c|   6 +
 gcc/testsuite/gcc.target/i386/pr89346.c   |  15 +
 26 files changed, 497 insertions(+), 330 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-2a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-2b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-2c.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-3a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-3b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-3c.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-4a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-4b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-4c.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-5a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-5b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-5c.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-6a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-6b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-6c.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-7a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-7b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-7c.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89346.c

-- 
2.24.1



[PATCH 04/10] i386: Use ix86_output_ssemov for TImode TYPE_SSEMOV

2020-02-15 Thread H.J. Lu
There is no need to set mode attribute to XImode since ix86_output_ssemov
can properly encode xmm16-xmm31 registers with and without AVX512VL.

gcc/

PR target/89229
* config/i386/i386.md (*movti_internal): Call ix86_output_ssemov
for TYPE_SSEMOV.  Remove ext_sse_reg_operand and TARGET_AVX512VL
check.

gcc/testsuite/

PR target/89229
* gcc.target/i386/pr89229-2a.c: New test.
* gcc.target/i386/pr89229-2b.c: Likewise.
* gcc.target/i386/pr89229-2c.c: Likewise.
---
 gcc/config/i386/i386.md| 28 +-
 gcc/testsuite/gcc.target/i386/pr89229-2a.c | 15 
 gcc/testsuite/gcc.target/i386/pr89229-2b.c | 13 ++
 gcc/testsuite/gcc.target/i386/pr89229-2c.c |  6 +
 4 files changed, 35 insertions(+), 27 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-2a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-2b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-2c.c

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 9e9b17d0913..5607d1ecddc 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -1955,27 +1955,7 @@ (define_insn "*movti_internal"
   return standard_sse_constant_opcode (insn, operands);
 
 case TYPE_SSEMOV:
-  /* TDmode values are passed as TImode on the stack.  Moving them
-to stack may result in unaligned memory access.  */
-  if (misaligned_operand (operands[0], TImode)
- || misaligned_operand (operands[1], TImode))
-   {
- if (get_attr_mode (insn) == MODE_V4SF)
-   return "%vmovups\t{%1, %0|%0, %1}";
- else if (get_attr_mode (insn) == MODE_XI)
-   return "vmovdqu32\t{%1, %0|%0, %1}";
- else
-   return "%vmovdqu\t{%1, %0|%0, %1}";
-   }
-  else
-   {
- if (get_attr_mode (insn) == MODE_V4SF)
-   return "%vmovaps\t{%1, %0|%0, %1}";
- else if (get_attr_mode (insn) == MODE_XI)
-   return "vmovdqa32\t{%1, %0|%0, %1}";
- else
-   return "%vmovdqa\t{%1, %0|%0, %1}";
-   }
+  return ix86_output_ssemov (insn, operands);
 
 default:
   gcc_unreachable ();
@@ -2002,12 +1982,6 @@ (define_insn "*movti_internal"
(set (attr "mode")
(cond [(eq_attr "alternative" "0,1")
 (const_string "DI")
-  (ior (match_operand 0 "ext_sse_reg_operand")
-   (match_operand 1 "ext_sse_reg_operand"))
-(const_string "XI")
-  (and (eq_attr "alternative" "3")
-   (match_test "TARGET_AVX512VL"))
-(const_string "XI")
   (match_test "TARGET_AVX")
 (const_string "TI")
   (ior (not (match_test "TARGET_SSE2"))
diff --git a/gcc/testsuite/gcc.target/i386/pr89229-2a.c 
b/gcc/testsuite/gcc.target/i386/pr89229-2a.c
new file mode 100644
index 000..0cf78039481
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr89229-2a.c
@@ -0,0 +1,15 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=skylake-avx512" } */
+
+typedef __int128 __m128t __attribute__ ((__vector_size__ (16),
+__may_alias__));
+
+__m128t
+foo1 (void)
+{
+  register __int128 xmm16 __asm ("xmm16") = (__int128) -1;
+  asm volatile ("" : "+v" (xmm16));
+  return (__m128t) xmm16;
+}
+
+/* { dg-final { scan-assembler-not "%zmm\[0-9\]+" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr89229-2b.c 
b/gcc/testsuite/gcc.target/i386/pr89229-2b.c
new file mode 100644
index 000..8d5d6c41d30
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr89229-2b.c
@@ -0,0 +1,13 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=skylake-avx512 -mno-avx512vl" } */
+
+typedef __int128 __m128t __attribute__ ((__vector_size__ (16),
+__may_alias__));
+
+__m128t
+foo1 (void)
+{
+  register __int128 xmm16 __asm ("xmm16") = (__int128) -1; /* { dg-error 
"register specified for 'xmm16'" } */
+  asm volatile ("" : "+v" (xmm16));
+  return (__m128t) xmm16;
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr89229-2c.c 
b/gcc/testsuite/gcc.target/i386/pr89229-2c.c
new file mode 100644
index 000..218da46dcd0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr89229-2c.c
@@ -0,0 +1,6 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=skylake-avx512 -mprefer-vector-width=512" } */
+
+#include "pr89229-2a.c"
+
+/* { dg-final { scan-assembler-not "%zmm\[0-9\]+" } } */
-- 
2.24.1



[PATCH 05/10] i386: Use ix86_output_ssemov for DImode TYPE_SSEMOV

2020-02-15 Thread H.J. Lu
There is no need to set mode attribute to XImode since ix86_output_ssemov
can properly encode xmm16-xmm31 registers with and without AVX512VL.

gcc/

PR target/89229
* config/i386/i386.md (*movdi_internal): Call ix86_output_ssemov
for TYPE_SSEMOV.  Remove ext_sse_reg_operand and TARGET_AVX512VL
check.

gcc/testsuite/

PR target/89229
* gcc.target/i386/pr89229-3a.c: New test.
* gcc.target/i386/pr89229-3b.c: Likewise.
* gcc.target/i386/pr89229-3c.c: Likewise.
---
 gcc/config/i386/i386.md| 31 ++
 gcc/testsuite/gcc.target/i386/pr89229-3a.c | 17 
 gcc/testsuite/gcc.target/i386/pr89229-3b.c |  6 +
 gcc/testsuite/gcc.target/i386/pr89229-3c.c |  7 +
 4 files changed, 32 insertions(+), 29 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-3a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-3b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-3c.c

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 5607d1ecddc..03d8078e957 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -2054,31 +2054,7 @@ (define_insn "*movdi_internal"
   return standard_sse_constant_opcode (insn, operands);
 
 case TYPE_SSEMOV:
-  switch (get_attr_mode (insn))
-   {
-   case MODE_DI:
- /* Handle broken assemblers that require movd instead of movq.  */
- if (!HAVE_AS_IX86_INTERUNIT_MOVQ
- && (GENERAL_REG_P (operands[0]) || GENERAL_REG_P (operands[1])))
-   return "%vmovd\t{%1, %0|%0, %1}";
- return "%vmovq\t{%1, %0|%0, %1}";
-
-   case MODE_TI:
- /* Handle AVX512 registers set.  */
- if (EXT_REX_SSE_REG_P (operands[0])
- || EXT_REX_SSE_REG_P (operands[1]))
-   return "vmovdqa64\t{%1, %0|%0, %1}";
- return "%vmovdqa\t{%1, %0|%0, %1}";
-
-   case MODE_V2SF:
- gcc_assert (!TARGET_AVX);
- return "movlps\t{%1, %0|%0, %1}";
-   case MODE_V4SF:
- return "%vmovaps\t{%1, %0|%0, %1}";
-
-   default:
- gcc_unreachable ();
-   }
+  return ix86_output_ssemov (insn, operands);
 
 case TYPE_SSECVT:
   if (SSE_REG_P (operands[0]))
@@ -2164,10 +2140,7 @@ (define_insn "*movdi_internal"
  (cond [(eq_attr "alternative" "2")
  (const_string "SI")
(eq_attr "alternative" "12,13")
- (cond [(ior (match_operand 0 "ext_sse_reg_operand")
- (match_operand 1 "ext_sse_reg_operand"))
-  (const_string "TI")
-(match_test "TARGET_AVX")
+ (cond [(match_test "TARGET_AVX")
   (const_string "TI")
 (ior (not (match_test "TARGET_SSE2"))
  (match_test "optimize_function_for_size_p (cfun)"))
diff --git a/gcc/testsuite/gcc.target/i386/pr89229-3a.c 
b/gcc/testsuite/gcc.target/i386/pr89229-3a.c
new file mode 100644
index 000..cb9b071e873
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr89229-3a.c
@@ -0,0 +1,17 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=skylake-avx512 -mprefer-vector-width=512" } */
+
+extern long long i;
+
+long long
+foo1 (void)
+{
+  register long long xmm16 __asm ("xmm16") = i;
+  asm volatile ("" : "+v" (xmm16));
+  register long long xmm17 __asm ("xmm17") = xmm16;
+  asm volatile ("" : "+v" (xmm17));
+  return xmm17;
+}
+
+/* { dg-final { scan-assembler-times 
"vmovdqa64\[^\n\r]*xmm1\[67]\[^\n\r]*xmm1\[67]" 1 } } */
+/* { dg-final { scan-assembler-not "%zmm\[0-9\]+" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr89229-3b.c 
b/gcc/testsuite/gcc.target/i386/pr89229-3b.c
new file mode 100644
index 000..9265fc0354b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr89229-3b.c
@@ -0,0 +1,6 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=skylake-avx512 -mno-avx512vl" } */
+
+#include "pr89229-3a.c"
+
+/* { dg-final { scan-assembler-times 
"vmovdqa32\[^\n\r]*zmm1\[67]\[^\n\r]*zmm1\[67]" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr89229-3c.c 
b/gcc/testsuite/gcc.target/i386/pr89229-3c.c
new file mode 100644
index 000..be0ca78a37e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr89229-3c.c
@@ -0,0 +1,7 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=skylake-avx512 -mprefer-vector-width=512" } */
+
+#include "pr89229-3a.c"
+
+/* { dg-final { scan-assembler-times 
"vmovdqa64\[^\n\r]*xmm1\[67]\[^\n\r]*xmm1\[67]" 1 } } */
+/* { dg-final { scan-assembler-not "%zmm\[0-9\]+" } } */
-- 
2.24.1



[PATCH 01/10] i386: Properly encode vector registers in vector move

2020-02-15 Thread H.J. Lu
On x86, when AVX and AVX512 are enabled, vector move instructions can
be encoded with either 2-byte/3-byte VEX (AVX) or 4-byte EVEX (AVX512):

   0:   c5 f9 6f d1 vmovdqa %xmm1,%xmm2
   4:   62 f1 fd 08 6f d1   vmovdqa64 %xmm1,%xmm2

We prefer VEX encoding over EVEX since VEX is shorter.  Also AVX512F
only supports 512-bit vector moves.  AVX512F + AVX512VL supports 128-bit
and 256-bit vector moves.  Mode attributes on x86 vector move patterns
indicate target preferences of vector move encoding.  For vector register
to vector register move, we can use 512-bit vector move instructions to
move 128-bit/256-bit vector if AVX512VL isn't available.  With AVX512F
and AVX512VL, we should use VEX encoding for 128-bit/256-bit vector moves
if upper 16 vector registers aren't used.  This patch adds a function,
ix86_output_ssemov, to generate vector moves:

1. If zmm registers are used, use EVEX encoding.
2. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE or VEX encoding
will be generated.
3. If xmm16-xmm31/ymm16-ymm31 registers are used:
   a. With AVX512VL, AVX512VL vector moves will be generated.
   b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to register
  move will be done with zmm register move.

Tested on AVX2 and AVX512 with and without --with-arch=native.

gcc/

PR target/89229
PR target/89346
* config/i386/i386-protos.h (ix86_output_ssemov): New prototype.
* config/i386/i386.c (ix86_get_ssemov): New function.
(ix86_output_ssemov): Likewise.
* config/i386/sse.md (VMOVE:mov_internal): Call
ix86_output_ssemov for TYPE_SSEMOV.  Remove TARGET_AVX512VL
check.

gcc/testsuite/

PR target/89229
PR target/89346
* gcc.target/i386/avx512vl-vmovdqa64-1.c: Updated.
* gcc.target/i386/pr89229-2a.c: New test.
---
 gcc/config/i386/i386-protos.h |   2 +
 gcc/config/i386/i386.c| 274 ++
 gcc/config/i386/sse.md|  98 +--
 .../gcc.target/i386/avx512vl-vmovdqa64-1.c|   7 +-
 gcc/testsuite/gcc.target/i386/pr89346.c   |  15 +
 5 files changed, 296 insertions(+), 100 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89346.c

diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 266381ca5a6..39fcaa0ad5f 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -38,6 +38,8 @@ extern void ix86_expand_split_stack_prologue (void);
 extern void ix86_output_addr_vec_elt (FILE *, int);
 extern void ix86_output_addr_diff_elt (FILE *, int, int);
 
+extern const char *ix86_output_ssemov (rtx_insn *, rtx *);
+
 extern enum calling_abi ix86_cfun_abi (void);
 extern enum calling_abi ix86_function_type_abi (const_tree);
 
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index dac7a3fc5fd..26f8c9494b9 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -4915,6 +4915,280 @@ ix86_pre_reload_split (void)
  && !(cfun->curr_properties & PROP_rtl_split_insns));
 }
 
+/* Return the opcode of the TYPE_SSEMOV instruction.  To move from
+   or to xmm16-xmm31/ymm16-ymm31 registers, we either require
+   TARGET_AVX512VL or it is a register to register move which can
+   be done with zmm register move. */
+
+static const char *
+ix86_get_ssemov (rtx *operands, unsigned size,
+enum attr_mode insn_mode, machine_mode mode)
+{
+  char buf[128];
+  bool misaligned_p = (misaligned_operand (operands[0], mode)
+  || misaligned_operand (operands[1], mode));
+  bool evex_reg_p = (EXT_REX_SSE_REG_P (operands[0])
+|| EXT_REX_SSE_REG_P (operands[1]));
+  machine_mode scalar_mode;
+
+  const char *opcode = NULL;
+  enum
+{
+  opcode_int,
+  opcode_float,
+  opcode_double
+} type = opcode_int;
+
+  switch (insn_mode)
+{
+case MODE_V16SF:
+case MODE_V8SF:
+case MODE_V4SF:
+  scalar_mode = E_SFmode;
+  break;
+case MODE_V8DF:
+case MODE_V4DF:
+case MODE_V2DF:
+  scalar_mode = E_DFmode;
+  break;
+case MODE_XI:
+case MODE_OI:
+case MODE_TI:
+  scalar_mode = GET_MODE_INNER (mode);
+  break;
+default:
+  gcc_unreachable ();
+}
+
+  if (SCALAR_FLOAT_MODE_P (scalar_mode))
+{
+  switch (scalar_mode)
+   {
+   case E_SFmode:
+ if (size == 64 || !evex_reg_p || TARGET_AVX512VL)
+   opcode = misaligned_p ? "%vmovups" : "%vmovaps";
+ else
+   type = opcode_float;
+ break;
+   case E_DFmode:
+ if (size == 64 || !evex_reg_p || TARGET_AVX512VL)
+   opcode = misaligned_p ? "%vmovupd" : "%vmovapd";
+ else
+   type = opcode_double;
+ break;
+   case E_TFmode:
+ if (size == 64)
+   opcode = misaligned_p ? "vmovdqu64" : "vmovdqa64";
+ else if (evex_reg_p)
+   {
+ if (TARGET_AVX512VL)
+   

[PATCH 08/10] i386: Use ix86_output_ssemov for DFmode TYPE_SSEMOV

2020-02-15 Thread H.J. Lu
There is no need to set mode attribute to XImode nor V8DFmode since
ix86_output_ssemov can properly encode xmm16-xmm31 registers with and
without AVX512VL.

gcc/

PR target/89229
* config/i386/i386.md (*movdf_internal): Call ix86_output_ssemov
for TYPE_SSEMOV.  Remove TARGET_AVX512F, TARGET_PREFER_AVX256,
TARGET_AVX512VL and ext_sse_reg_operand check.

gcc/testsuite/

PR target/89229
* gcc.target/i386/pr89229-6a.c: New test.
* gcc.target/i386/pr89229-6b.c: Likewise.
* gcc.target/i386/pr89229-6c.c: Likewise.
---
 gcc/config/i386/i386.md| 44 ++
 gcc/testsuite/gcc.target/i386/pr89229-6a.c | 16 
 gcc/testsuite/gcc.target/i386/pr89229-6b.c |  7 
 gcc/testsuite/gcc.target/i386/pr89229-6c.c |  6 +++
 4 files changed, 32 insertions(+), 41 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-6a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-6b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-6c.c

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index fdf0e5a8802..01892992adb 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -3307,37 +3307,7 @@ (define_insn "*movdf_internal"
   return standard_sse_constant_opcode (insn, operands);
 
 case TYPE_SSEMOV:
-  switch (get_attr_mode (insn))
-   {
-   case MODE_DF:
- if (TARGET_AVX && REG_P (operands[0]) && REG_P (operands[1]))
-   return "vmovsd\t{%d1, %0|%0, %d1}";
- return "%vmovsd\t{%1, %0|%0, %1}";
-
-   case MODE_V4SF:
- return "%vmovaps\t{%1, %0|%0, %1}";
-   case MODE_V8DF:
- return "vmovapd\t{%g1, %g0|%g0, %g1}";
-   case MODE_V2DF:
- return "%vmovapd\t{%1, %0|%0, %1}";
-
-   case MODE_V2SF:
- gcc_assert (!TARGET_AVX);
- return "movlps\t{%1, %0|%0, %1}";
-   case MODE_V1DF:
- gcc_assert (!TARGET_AVX);
- return "movlpd\t{%1, %0|%0, %1}";
-
-   case MODE_DI:
- /* Handle broken assemblers that require movd instead of movq.  */
- if (!HAVE_AS_IX86_INTERUNIT_MOVQ
- && (GENERAL_REG_P (operands[0]) || GENERAL_REG_P (operands[1])))
-   return "%vmovd\t{%1, %0|%0, %1}";
- return "%vmovq\t{%1, %0|%0, %1}";
-
-   default:
- gcc_unreachable ();
-   }
+  return ix86_output_ssemov (insn, operands);
 
 default:
   gcc_unreachable ();
@@ -3391,10 +3361,7 @@ (define_insn "*movdf_internal"
 
   /* xorps is one byte shorter for non-AVX targets.  */
   (eq_attr "alternative" "12,16")
-(cond [(and (match_test "TARGET_AVX512F")
-(not (match_test "TARGET_PREFER_AVX256")))
- (const_string "XI")
-   (match_test "TARGET_AVX")
+(cond [(match_test "TARGET_AVX")
  (const_string "V2DF")
(ior (not (match_test "TARGET_SSE2"))
 (match_test "optimize_function_for_size_p (cfun)"))
@@ -3410,12 +3377,7 @@ (define_insn "*movdf_internal"
 
   /* movaps is one byte shorter for non-AVX targets.  */
   (eq_attr "alternative" "13,17")
-(cond [(and (ior (not (match_test "TARGET_PREFER_AVX256"))
- (not (match_test "TARGET_AVX512VL")))
-(ior (match_operand 0 "ext_sse_reg_operand")
- (match_operand 1 "ext_sse_reg_operand")))
- (const_string "V8DF")
-   (match_test "TARGET_AVX")
+(cond [(match_test "TARGET_AVX")
  (const_string "DF")
(ior (not (match_test "TARGET_SSE2"))
 (match_test "optimize_function_for_size_p (cfun)"))
diff --git a/gcc/testsuite/gcc.target/i386/pr89229-6a.c 
b/gcc/testsuite/gcc.target/i386/pr89229-6a.c
new file mode 100644
index 000..5bc10d25619
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr89229-6a.c
@@ -0,0 +1,16 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=skylake-avx512" } */
+
+extern double d;
+
+void
+foo1 (double x)
+{
+  register double xmm16 __asm ("xmm16") = x;
+  asm volatile ("" : "+v" (xmm16));
+  register double xmm17 __asm ("xmm17") = xmm16;
+  asm volatile ("" : "+v" (xmm17));
+  d = xmm17;
+}
+
+/* { dg-final { scan-assembler-not "vmovapd" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr89229-6b.c 
b/gcc/testsuite/gcc.target/i386/pr89229-6b.c
new file mode 100644
index 000..b248a3726f4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr89229-6b.c
@@ -0,0 +1,7 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=skylake-avx512 -mno-avx512vl" } */
+
+#include "pr89229-6a.c"
+
+/* { dg-final { scan-assembler-not "%zmm\[0-9\]+" } } */
+/* { dg-fina

[PATCH 09/10] i386: Use ix86_output_ssemov for SFmode TYPE_SSEMOV

2020-02-15 Thread H.J. Lu
There is no need to set mode attribute to V16SFmode since ix86_output_ssemov
can properly encode xmm16-xmm31 registers with and without AVX512VL.

gcc/

PR target/89229
* config/i386/i386.md (*movdf_internal): Call ix86_output_ssemov
for TYPE_SSEMOV.  Remove TARGET_PREFER_AVX256, TARGET_AVX512VL
and ext_sse_reg_operand check.

gcc/testsuite/

PR target/89229
* gcc.target/i386/pr89229-7a.c: New test.
* gcc.target/i386/pr89229-7b.c: Likewise.
* gcc.target/i386/pr89229-7c.c: Likewise.
---
 gcc/config/i386/i386.md| 26 ++
 gcc/testsuite/gcc.target/i386/pr89229-7a.c | 16 +
 gcc/testsuite/gcc.target/i386/pr89229-7b.c |  6 +
 gcc/testsuite/gcc.target/i386/pr89229-7c.c |  6 +
 4 files changed, 30 insertions(+), 24 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-7a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-7b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-7c.c

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 01892992adb..2dcf2d598c3 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -3469,24 +3469,7 @@ (define_insn "*movsf_internal"
   return standard_sse_constant_opcode (insn, operands);
 
 case TYPE_SSEMOV:
-  switch (get_attr_mode (insn))
-   {
-   case MODE_SF:
- if (TARGET_AVX && REG_P (operands[0]) && REG_P (operands[1]))
-   return "vmovss\t{%d1, %0|%0, %d1}";
- return "%vmovss\t{%1, %0|%0, %1}";
-
-   case MODE_V16SF:
- return "vmovaps\t{%g1, %g0|%g0, %g1}";
-   case MODE_V4SF:
- return "%vmovaps\t{%1, %0|%0, %1}";
-
-   case MODE_SI:
- return "%vmovd\t{%1, %0|%0, %1}";
-
-   default:
- gcc_unreachable ();
-   }
+  return ix86_output_ssemov (insn, operands);
 
 case TYPE_MMXMOV:
   switch (get_attr_mode (insn))
@@ -3558,12 +3541,7 @@ (define_insn "*movsf_internal"
  better to maintain the whole registers in single format
  to avoid problems on using packed logical operations.  */
   (eq_attr "alternative" "6")
-(cond [(and (ior (not (match_test "TARGET_PREFER_AVX256"))
- (not (match_test "TARGET_AVX512VL")))
-(ior (match_operand 0 "ext_sse_reg_operand")
- (match_operand 1 "ext_sse_reg_operand")))
- (const_string "V16SF")
-   (ior (match_test "TARGET_SSE_PARTIAL_REG_DEPENDENCY")
+(cond [(ior (match_test "TARGET_SSE_PARTIAL_REG_DEPENDENCY")
 (match_test "TARGET_SSE_SPLIT_REGS"))
  (const_string "V4SF")
   ]
diff --git a/gcc/testsuite/gcc.target/i386/pr89229-7a.c 
b/gcc/testsuite/gcc.target/i386/pr89229-7a.c
new file mode 100644
index 000..856115b2f5a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr89229-7a.c
@@ -0,0 +1,16 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=skylake-avx512" } */
+
+extern float d;
+
+void
+foo1 (float x)
+{
+  register float xmm16 __asm ("xmm16") = x;
+  asm volatile ("" : "+v" (xmm16));
+  register float xmm17 __asm ("xmm17") = xmm16;
+  asm volatile ("" : "+v" (xmm17));
+  d = xmm17;
+}
+
+/* { dg-final { scan-assembler-not "%zmm\[0-9\]+" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr89229-7b.c 
b/gcc/testsuite/gcc.target/i386/pr89229-7b.c
new file mode 100644
index 000..93d1e43770c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr89229-7b.c
@@ -0,0 +1,6 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=skylake-avx512 -mno-avx512vl" } */
+
+#include "pr89229-7a.c"
+
+/* { dg-final { scan-assembler-times 
"vmovaps\[^\n\r]*zmm1\[67]\[^\n\r]*zmm1\[67]" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr89229-7c.c 
b/gcc/testsuite/gcc.target/i386/pr89229-7c.c
new file mode 100644
index 000..e37ff2bf5bd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr89229-7c.c
@@ -0,0 +1,6 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=skylake-avx512 -mprefer-vector-width=512" } */
+
+#include "pr89229-7a.c"
+
+/* { dg-final { scan-assembler-not "%zmm\[0-9\]+" } } */
-- 
2.24.1



[PATCH 10/10] i386: Use ix86_output_ssemov for MMX TYPE_SSEMOV

2020-02-15 Thread H.J. Lu
There is no need to set mode attribute to XImode since ix86_output_ssemov
can properly encode xmm16-xmm31 registers with and without AVX512VL.

Remove ext_sse_reg_operand since it is no longer needed.

PR target/89229
* config/i386/mmx.md (MMXMODE:*mov_internal): Call
ix86_output_ssemov for TYPE_SSEMOV.  Remove ext_sse_reg_operand
check.
* config/i386/predicates.md (ext_sse_reg_operand): Removed.
---
 gcc/config/i386/mmx.md| 29 ++---
 gcc/config/i386/predicates.md |  5 -
 2 files changed, 2 insertions(+), 32 deletions(-)

diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
index f695831b5b9..7d9db5d352c 100644
--- a/gcc/config/i386/mmx.md
+++ b/gcc/config/i386/mmx.md
@@ -118,29 +118,7 @@ (define_insn "*mov_internal"
   return standard_sse_constant_opcode (insn, operands);
 
 case TYPE_SSEMOV:
-  switch (get_attr_mode (insn))
-   {
-   case MODE_DI:
- /* Handle broken assemblers that require movd instead of movq.  */
- if (!HAVE_AS_IX86_INTERUNIT_MOVQ
- && (GENERAL_REG_P (operands[0]) || GENERAL_REG_P (operands[1])))
-   return "%vmovd\t{%1, %0|%0, %1}";
- return "%vmovq\t{%1, %0|%0, %1}";
-   case MODE_TI:
- return "%vmovdqa\t{%1, %0|%0, %1}";
-   case MODE_XI:
- return "vmovdqa64\t{%g1, %g0|%g0, %g1}";
-
-   case MODE_V2SF:
- if (TARGET_AVX && REG_P (operands[0]))
-   return "vmovlps\t{%1, %0, %0|%0, %0, %1}";
- return "%vmovlps\t{%1, %0|%0, %1}";
-   case MODE_V4SF:
- return "%vmovaps\t{%1, %0|%0, %1}";
-
-   default:
- gcc_unreachable ();
-   }
+  return ix86_output_ssemov (insn, operands);
 
 default:
   gcc_unreachable ();
@@ -189,10 +167,7 @@ (define_insn "*mov_internal"
  (cond [(eq_attr "alternative" "2")
  (const_string "SI")
(eq_attr "alternative" "11,12")
- (cond [(ior (match_operand 0 "ext_sse_reg_operand")
- (match_operand 1 "ext_sse_reg_operand"))
-   (const_string "XI")
-(match_test "mode == V2SFmode")
+ (cond [(match_test "mode == V2SFmode")
   (const_string "V4SF")
 (ior (not (match_test "TARGET_SSE2"))
  (match_test "optimize_function_for_size_p (cfun)"))
diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
index 1119366d54e..71f4cb1193c 100644
--- a/gcc/config/i386/predicates.md
+++ b/gcc/config/i386/predicates.md
@@ -61,11 +61,6 @@ (define_predicate "sse_reg_operand"
   (and (match_code "reg")
(match_test "SSE_REGNO_P (REGNO (op))")))
 
-;; True if the operand is an AVX-512 new register.
-(define_predicate "ext_sse_reg_operand"
-  (and (match_code "reg")
-   (match_test "EXT_REX_SSE_REGNO_P (REGNO (op))")))
-
 ;; Return true if op is a QImode register.
 (define_predicate "any_QIreg_operand"
   (and (match_code "reg")
-- 
2.24.1



[PATCH 07/10] i386: Use ix86_output_ssemov for TFmode TYPE_SSEMOV

2020-02-15 Thread H.J. Lu
gcc/

PR target/89229
* config/i386/i386.md (*movtf_internal): Call ix86_output_ssemov
for TYPE_SSEMOV.

gcc/testsuite/

PR target/89229
* gcc.target/i386/pr89229-5a.c: New test.
* gcc.target/i386/pr89229-5b.c: Likewise.
* gcc.target/i386/pr89229-5c.c: Likewise.
---
 gcc/config/i386/i386.md| 26 +-
 gcc/testsuite/gcc.target/i386/pr89229-5a.c | 16 +
 gcc/testsuite/gcc.target/i386/pr89229-5b.c | 12 ++
 gcc/testsuite/gcc.target/i386/pr89229-5c.c |  6 +
 4 files changed, 35 insertions(+), 25 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-5a.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-5b.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-5c.c

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 05815c5cf3b..fdf0e5a8802 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -3154,31 +3154,7 @@ (define_insn "*movtf_internal"
   return standard_sse_constant_opcode (insn, operands);
 
 case TYPE_SSEMOV:
-  /* Handle misaligned load/store since we
- don't have movmisaligntf pattern. */
-  if (misaligned_operand (operands[0], TFmode)
- || misaligned_operand (operands[1], TFmode))
-   {
- if (get_attr_mode (insn) == MODE_V4SF)
-   return "%vmovups\t{%1, %0|%0, %1}";
- else if (TARGET_AVX512VL
-  && (EXT_REX_SSE_REG_P (operands[0])
-  || EXT_REX_SSE_REG_P (operands[1])))
-   return "vmovdqu64\t{%1, %0|%0, %1}";
- else
-   return "%vmovdqu\t{%1, %0|%0, %1}";
-   }
-  else
-   {
- if (get_attr_mode (insn) == MODE_V4SF)
-   return "%vmovaps\t{%1, %0|%0, %1}";
- else if (TARGET_AVX512VL
-  && (EXT_REX_SSE_REG_P (operands[0])
-  || EXT_REX_SSE_REG_P (operands[1])))
-   return "vmovdqa64\t{%1, %0|%0, %1}";
- else
-   return "%vmovdqa\t{%1, %0|%0, %1}";
-   }
+  return ix86_output_ssemov (insn, operands);
 
 case TYPE_MULTI:
return "#";
diff --git a/gcc/testsuite/gcc.target/i386/pr89229-5a.c 
b/gcc/testsuite/gcc.target/i386/pr89229-5a.c
new file mode 100644
index 000..fcb85c366b6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr89229-5a.c
@@ -0,0 +1,16 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=skylake-avx512" } */
+
+extern __float128 d;
+
+void
+foo1 (__float128 x)
+{
+  register __float128 xmm16 __asm ("xmm16") = x;
+  asm volatile ("" : "+v" (xmm16));
+  register __float128 xmm17 __asm ("xmm17") = xmm16;
+  asm volatile ("" : "+v" (xmm17));
+  d = xmm17;
+}
+
+/* { dg-final { scan-assembler-not "%zmm\[0-9\]+" } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr89229-5b.c 
b/gcc/testsuite/gcc.target/i386/pr89229-5b.c
new file mode 100644
index 000..37eb83c783b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr89229-5b.c
@@ -0,0 +1,12 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=skylake-avx512 -mno-avx512vl" } */
+
+extern __float128 d;
+
+void
+foo1 (__float128 x)
+{
+  register __float128 xmm16 __asm ("xmm16") = x; /* { dg-error "register 
specified for 'xmm16'" } */
+  asm volatile ("" : "+v" (xmm16));
+  d = xmm16;
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr89229-5c.c 
b/gcc/testsuite/gcc.target/i386/pr89229-5c.c
new file mode 100644
index 000..529a520133c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr89229-5c.c
@@ -0,0 +1,6 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -march=skylake-avx512 -mprefer-vector-width=512" } */
+
+#include "pr89229-5a.c"
+
+/* { dg-final { scan-assembler-not "%zmm\[0-9\]+" } } */
-- 
2.24.1



[PATCH] i386: Use add for a = a + b and a = b + a when possible

2019-12-06 Thread H.J. Lu
Since except for Bonnell,

01 fbadd%edi,%ebx

is faster and shorter than

8d 1c 1f  lea(%rdi,%rbx,1),%ebx

we should use add for a = a + b and a = b + a when possible if not
optimizing for Bonnell.

Tested on x86-64.

gcc/

PR target/92807
* config/i386/i386.c (ix86_lea_outperforms): Check !TARGET_BONNELL.
(ix86_avoid_lea_for_addr): When not optimizing for Bonnell, use add
for a = a + b and a = b + a.

gcc/testsuite/

PR target/92807
* gcc.target/i386/pr92807-1.c: New test.

-- 
H.J.
From ad803a967a6c18ae3bd6f8381ebc8a78c31a82ae Mon Sep 17 00:00:00 2001
From: "H.J. Lu" 
Date: Tue, 3 Dec 2019 15:27:51 -0800
Subject: [PATCH] i386: Use add for a = a + b and a = b + a when possible

Since except for Bonnell,

01 fb	add%edi,%ebx

is faster and shorter than

8d 1c 1f 	lea(%rdi,%rbx,1),%ebx

we should use add for a = a + b and a = b + a when possible if not
optimizing for Bonnell.

Tested on x86-64.

gcc/

	PR target/92807
	* config/i386/i386.c (ix86_lea_outperforms): Check !TARGET_BONNELL.
	(ix86_avoid_lea_for_addr): When not optimizing for Bonnell, use add
	for a = a + b and a = b + a.

gcc/testsuite/

	PR target/92807
	* gcc.target/i386/pr92807-1.c: New test.
---
 gcc/config/i386/i386.c| 27 +++
 gcc/testsuite/gcc.target/i386/pr92807-1.c | 11 +
 2 files changed, 29 insertions(+), 9 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr92807-1.c

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 04cbbd532c0d..65f0d44916a8 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -14393,11 +14393,10 @@ ix86_lea_outperforms (rtx_insn *insn, unsigned int regno0, unsigned int regno1,
 {
   int dist_define, dist_use;
 
-  /* For Silvermont if using a 2-source or 3-source LEA for
- non-destructive destination purposes, or due to wanting
- ability to use SCALE, the use of LEA is justified.  */
-  if (TARGET_SILVERMONT || TARGET_GOLDMONT || TARGET_GOLDMONT_PLUS
-  || TARGET_TREMONT || TARGET_INTEL)
+  /* For Atom processors newer than Bonnell, if using a 2-source or
+ 3-source LEA for non-destructive destination purposes, or due to
+ wanting ability to use SCALE, the use of LEA is justified.  */
+  if (!TARGET_BONNELL)
 {
   if (has_scale)
 	return true;
@@ -14532,10 +14531,6 @@ ix86_avoid_lea_for_addr (rtx_insn *insn, rtx operands[])
   struct ix86_address parts;
   int ok;
 
-  /* Check we need to optimize.  */
-  if (!TARGET_AVOID_LEA_FOR_ADDR || optimize_function_for_size_p (cfun))
-return false;
-
   /* The "at least two components" test below might not catch simple
  move or zero extension insns if parts.base is non-NULL and parts.disp
  is const0_rtx as the only components in the address, e.g. if the
@@ -14572,6 +14567,20 @@ ix86_avoid_lea_for_addr (rtx_insn *insn, rtx operands[])
   if (parts.index)
 regno2 = true_regnum (parts.index);
 
+  /* Use add for a = a + b and a = b + a since it is faster and shorter
+ than lea for most processors.  For the processors like BONNELL, if
+ the destination register of LEA holds an actual address which will
+ be used soon, LEA is better and otherwise ADD is better.  */
+  if (!TARGET_BONNELL
+  && parts.scale == 1
+  && (!parts.disp || parts.disp == const0_rtx)
+  && (regno0 == regno1 || regno0 == regno2))
+return true;
+
+  /* Check we need to optimize.  */
+  if (!TARGET_AVOID_LEA_FOR_ADDR || optimize_function_for_size_p (cfun))
+return false;
+
   split_cost = 0;
 
   /* Compute how many cycles we will add to execution time
diff --git a/gcc/testsuite/gcc.target/i386/pr92807-1.c b/gcc/testsuite/gcc.target/i386/pr92807-1.c
new file mode 100644
index ..00f92930af92
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr92807-1.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+unsigned int
+abs2 (unsigned int a) 
+{
+  unsigned int s = ((a>>15)&0x10001)*0x;
+  return (a+s)^s;
+}
+
+/* { dg-final { scan-assembler-not "leal" } } */
-- 
2.21.0



Re: [C++ coroutines] Initial implementation pushed to master.

2024-03-05 Thread H.J. Lu
On Sat, Jan 18, 2020 at 4:54 AM Iain Sandoe  wrote:
>
> Hi,
>
> Thanks to:
>
>* the reviewers, the code was definitely improved by your reviews.
>
>* those folks who tested the branch and/or compiler explorer
>  instance and reported problems with reproducers.
>
>   * WG21 colleagues, especially Lewis and Gor for valuable input
> and discussions on the design.
>
> = TL;DR:
>
> * This is not enabled by default (even for -std=c++2a), it needs -fcoroutines.
>
> * Like all the C++20 support, it is experimental, perhaps more experimental
>   than some other pieces because wording is still being amended.
>
> * The FE/ME tests are run for ALL targets; in principle this should be target-
>   agnostic, if we see fails then that is probably interesting input for the 
> ABI
>  panel.
>
>  * I regstrapped on 64b LE and BE platforms and a 32b LE host with no observed
>   issues or regressions.
>
>  * it’s just slightly too big to send uncompressed so attached as a bz2.
>
>  * commit is r10-6063-g49789fd08
>
> thanks again to all those who helped,
> Iain
>
> ==  The full covering note:
>
> This is the squashed version of the first 6 patches that were split to
> facilitate review.
>
> The changes to libiberty (7th patch) to support demangling the co_await
> operator stand alone and are applied separately.
>
> The patch series is an initial implementation of a coroutine feature,
> expected to be standardised in C++20.
>
> Standardisation status (and potential impact on this implementation)
> 
>
> The facility was accepted into the working draft for C++20 by WG21 in
> February 2019.  During following WG21 meetings, design and national body
> comments have been reviewed, with no significant change resulting.
>
> The current GCC implementation is against n4835 [1].
>
> At this stage, the remaining potential for change comes from:
>
> * Areas of national body comments that were not resolved in the version we
>   have worked to:
>   (a) handling of the situation where aligned allocation is available.
>   (b) handling of the situation where a user wants coroutines, but does not
>   want exceptions (e.g. a GPU).
>
> * Agreed changes that have not yet been worded in a draft standard that we
>   have worked to.
>
> It is not expected that the resolution to these can produce any major
> change at this phase of the standardisation process.  Such changes should be
> limited to the coroutine-specific code.
>
> ABI
> ---
>
> The various compiler developers 'vendors' have discussed a minimal ABI to
> allow one implementation to call coroutines compiled by another.
>
> This amounts to:
>
> 1. The layout of a public portion of the coroutine frame.
>
>  Coroutines need to preserve state across suspension points, the storage for
>  this is called a "coroutine frame".
>
>  The ABI mandates that pointers into the coroutine frame point to an area
>  begining with two function pointers (to the resume and destroy functions
>  described below); these are immediately followed by the "promise object"
>  described in the standard.
>
>  This is sufficient that the builtins can take a coroutine frame pointer and
>  determine the address of the promise (or call the resume/destroy functions).
>
> 2. A number of compiler builtins that the standard library might use.
>
>   These are implemented by this patch series.
>
> 3. This introduces a new operator 'co_await' the mangling for which is also
> agreed between vendors (and has an issue filed for that against the upstream
> c++abi).  Demangling for this is added to libiberty in a separate patch.
>
> The ABI has currently no target-specific content (a given psABI might elect
> to mandate alignment, but the common ABI does not do this).
>
> Standard Library impact
> ---
>
> The current implementations require addition of only a single header to
> the standard library (no change to the runtime).  This header is part of
> the patch.
>
> GCC Implementation outline
> --
>
> The standard's design for coroutines does not decorate the definition of
> a coroutine in any way, so that a function is only known to be a coroutine
> when one of the keywords (co_await, co_yield, co_return) is encountered.
>
> This means that we cannot special-case such functions from the outset, but
> must process them differently when they are finalised - which we do from
> "finish_function ()".
>
> At a high level, this design of coroutine produces four pieces from the
> original user's function:
>
>   1. A coroutine state frame (taking the logical place of the activation
>  record for a regular function).  One item stored in that state is the
>  index of the current suspend point.
>   2. A "ramp" function
>  This is what the user calls to construct the coroutine frame and start
>  the coroutine execution.  This will return some object representing the
>  coroutine's eventual ret

[PATCH v2] tree-profile: Don't instrument an IFUNC resolver nor its callees

2024-03-05 Thread H.J. Lu
We can't instrument an IFUNC resolver nor its callees as it may require
TLS which hasn't been set up yet when the dynamic linker is resolving
IFUNC symbols.

Add an IFUNC resolver caller marker to cgraph_node and set it if the
function is called by an IFUNC resolver.  Update tree_profiling to skip
functions called by IFUNC resolver.

Tested with profiledbootstrap on Fedora 39/x86-64.

gcc/ChangeLog:

PR tree-optimization/114115
* cgraph.h (symtab_node): Add check_ifunc_callee_symtab_nodes.
(cgraph_node): Add called_by_ifunc_resolver.
* cgraphunit.cc (symbol_table::compile): Call
symtab_node::check_ifunc_callee_symtab_nodes.
* symtab.cc (check_ifunc_resolver): New.
(ifunc_ref_map): Likewise.
(is_caller_ifunc_resolver): Likewise.
(symtab_node::check_ifunc_callee_symtab_nodes): Likewise.
* tree-profile.cc (tree_profiling): Do not instrument an IFUNC
resolver nor its callees.

gcc/testsuite/ChangeLog:

PR tree-optimization/114115
* gcc.dg/pr114115.c: New test.
---
 gcc/cgraph.h|  6 +++
 gcc/cgraphunit.cc   |  2 +
 gcc/symtab.cc   | 89 +
 gcc/testsuite/gcc.dg/pr114115.c | 24 +
 gcc/tree-profile.cc |  4 ++
 5 files changed, 125 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/pr114115.c

diff --git a/gcc/cgraph.h b/gcc/cgraph.h
index 47f35e8078d..a8c3224802c 100644
--- a/gcc/cgraph.h
+++ b/gcc/cgraph.h
@@ -479,6 +479,9 @@ public:
  Return NULL if there's no such node.  */
   static symtab_node *get_for_asmname (const_tree asmname);
 
+  /* Check symbol table for callees of IFUNC resolvers.  */
+  static void check_ifunc_callee_symtab_nodes (void);
+
   /* Verify symbol table for internal consistency.  */
   static DEBUG_FUNCTION void verify_symtab_nodes (void);
 
@@ -896,6 +899,7 @@ struct GTY((tag ("SYMTAB_FUNCTION"))) cgraph_node : public 
symtab_node
   redefined_extern_inline (false), tm_may_enter_irr (false),
   ipcp_clone (false), declare_variant_alt (false),
   calls_declare_variant_alt (false), gc_candidate (false),
+  called_by_ifunc_resolver (false),
   m_uid (uid), m_summary_id (-1)
   {}
 
@@ -1495,6 +1499,8 @@ struct GTY((tag ("SYMTAB_FUNCTION"))) cgraph_node : 
public symtab_node
  is set for local SIMD clones when they are created and cleared if the
  vectorizer uses them.  */
   unsigned gc_candidate : 1;
+  /* Set if the function is called by an IFUNC resolver.  */
+  unsigned called_by_ifunc_resolver : 1;
 
 private:
   /* Unique id of the node.  */
diff --git a/gcc/cgraphunit.cc b/gcc/cgraphunit.cc
index d200166f7e9..2bd0289ffba 100644
--- a/gcc/cgraphunit.cc
+++ b/gcc/cgraphunit.cc
@@ -2317,6 +2317,8 @@ symbol_table::compile (void)
 
   symtab_node::checking_verify_symtab_nodes ();
 
+  symtab_node::check_ifunc_callee_symtab_nodes ();
+
   timevar_push (TV_CGRAPHOPT);
   if (pre_ipa_mem_report)
 dump_memory_report ("Memory consumption before IPA");
diff --git a/gcc/symtab.cc b/gcc/symtab.cc
index 4c7e3c135ca..3256133891d 100644
--- a/gcc/symtab.cc
+++ b/gcc/symtab.cc
@@ -1369,6 +1369,95 @@ symtab_node::verify (void)
   timevar_pop (TV_CGRAPH_VERIFY);
 }
 
+/* Return true and set *DATA to true if NODE is an ifunc resolver.  */
+
+static bool
+check_ifunc_resolver (cgraph_node *node, void *data)
+{
+  if (node->ifunc_resolver)
+{
+  bool *is_ifunc_resolver = (bool *) data;
+  *is_ifunc_resolver = true;
+  return true;
+}
+  return false;
+}
+
+static auto_bitmap ifunc_ref_map;
+
+/* Return true if any caller of NODE is an ifunc resolver.  */
+
+static bool
+is_caller_ifunc_resolver (cgraph_node *node)
+{
+  bool is_ifunc_resolver = false;
+
+  for (cgraph_edge *e = node->callers; e; e = e->next_caller)
+{
+  /* Return true if caller is known to be an IFUNC resolver.  */
+  if (e->caller->called_by_ifunc_resolver)
+   return true;
+
+  /* Check for recursive call.  */
+  if (e->caller == node)
+   continue;
+
+  /* Skip if it has been visited.  */
+  unsigned int uid = e->caller->get_uid ();
+  if (bitmap_bit_p (ifunc_ref_map, uid))
+   continue;
+  bitmap_set_bit (ifunc_ref_map, uid);
+
+  if (is_caller_ifunc_resolver (e->caller))
+   {
+ /* Return true if caller is an IFUNC resolver.  */
+ e->caller->called_by_ifunc_resolver = true;
+ return true;
+   }
+
+  /* Check if caller's alias is an IFUNC resolver.  */
+  e->caller->call_for_symbol_and_aliases (check_ifunc_resolver,
+ &is_ifunc_resolver,
+ true);
+  if (is_ifunc_resolver)
+   {
+ /* Return true if caller's alias is an IFUNC resolver.  */
+ e->caller->called_by_ifunc_resolver = true;
+ return true;
+   }
+}
+
+  return false;
+}
+
+/* Check symbol table for ca

Re: [PATCH] tree-profile: Don't instrument an IFUNC resolver nor its callees

2024-03-05 Thread H.J. Lu
On Thu, Feb 29, 2024 at 7:11 AM H.J. Lu  wrote:
>
> On Thu, Feb 29, 2024 at 7:06 AM Jan Hubicka  wrote:
> >
> > > > I am worried about scenario where ifunc selector calls function foo
> > > > defined locally and foo is also used from other places possibly in hot
> > > > loops.
> > > > >
> > > > > > So it is not really reliable fix (though I guess it will work a lot 
> > > > > > of
> > > > > > common code).  I wonder what would be alternatives.  In GCC 
> > > > > > generated
> > > > > > profling code we use TLS only for undirect call profiling (so there 
> > > > > > is
> > > > > > no need to turn off rest of profiling).  I wonder if there is any 
> > > > > > chance
> > > > > > to not make it seffault when it is done before TLS is set up?
> > > > >
> > > > > IFUNC selector should make minimum external calls, none is preferred.
> > > >
> > > > Edge porfiling only inserts (atomic) 64bit increments of counters.
> > > > If target supports these operations inline, no external calls will be
> > > > done.
> > > >
> > > > Indirect call profiling inserts the problematic TLS variable (to track
> > > > caller-callee pairs). Value profiling also inserts various additional
> > > > external calls to counters.
> > > >
> > > > I am perfectly fine with disabling instrumentation for ifunc selectors
> > > > and functions only reachable from them, but I am worried about calles
> > > > used also from non-ifunc path.
> > >
> > > Programmers need to understand not to do it.
> >
> > It would help to have this documented. Should we warn when ifunc
> > resolver calls external function, comdat of function reachable from
> > non-ifunc code?
>
> That will be nice.
>
> > >
> > > > For example selector implemented in C++ may do some string handling to
> > > > match CPU name and propagation will disable profiling for std::string
> > >
> > > On x86, they should use CPUID, not string functions.
> > >
> > > > member functions (which may not be effective if comdat section is
> > > > prevailed from other translation unit).
> > >
> > > String functions may lead to external function calls which is dangerous.
> > >
> > > > > Any external calls may lead to issues at run-time.  It is a very bad 
> > > > > idea
> > > > > to profile IFUNC selector via external function call.
> > > >
> > > > Looking at https://sourceware.org/glibc/wiki/GNU_IFUNC
> > > > there are other limitations on ifunc except for profiling, such as
> > > > -fstack-protector-all.  So perhaps your propagation can be used to
> > > > disable those features as well.
> > >
> > > So, it may not be tree-profile specific.  Where should these 2 bits
> > > be added?
> >
> > If we want to disable other transforms too, then I think having a bit in
> > cgraph_node for reachability from ifunc resolver makes sense.
> > I would still do the cycle detection using on-side hash_map to avoid
> > polution of the global datastructure.
> >
>
> I will see what I can do.
>
>

The v2 patch is at

https://patchwork.sourceware.org/project/gcc/list/?series=31627

-- 
H.J.


Re: [C++ coroutines] Initial implementation pushed to master.

2024-03-06 Thread H.J. Lu
On Wed, Mar 6, 2024 at 1:03 AM Iain Sandoe  wrote:
>
>
>
> > On 5 Mar 2024, at 17:31, H.J. Lu  wrote:
> >
> > On Sat, Jan 18, 2020 at 4:54 AM Iain Sandoe  wrote:
> >>
>
> >> 2020-01-18  Iain Sandoe  
> >>
> >>* Makefile.in: Add coroutine-passes.o.
> >>* builtin-types.def (BT_CONST_SIZE): New.
> >>(BT_FN_BOOL_PTR): New.
> >>(BT_FN_PTR_PTR_CONST_SIZE_BOOL): New.
> >>* builtins.def (DEF_COROUTINE_BUILTIN): New.
> >>* coroutine-builtins.def: New file.
> >>* coroutine-passes.cc: New file.
> >
> > There are
> >
> >  tree res_tgt = TREE_OPERAND (gimple_call_arg (stmt, 2), 0);
> >  tree &res_dest = destinations.get_or_insert (idx, &existed);
> >  if (existed && dump_file)
> >Why does this behavior depend on dump_file?
>
> This was checking for a potential wrong-code error during development;
> there is no point in making it into a diagnostic (since the user could not fix
> the problem if it happened).  I guess changing to a gcc_checking_assert()
> would be reasonable but I’d prefer to do that once GCC-15 opens.
>
> Have you found any instance where this results in a reported bug?

No, I haven't.  I only noticed it by chance.

> (I do not recall anything on my coroutines bug list that would seem to 
> indicate this).
>
> thanks for noting it.
> Iain
>
>
> >{
> >  fprintf (
> >dump_file,
> >"duplicate YIELD RESUME point (" HOST_WIDE_INT_PRINT_DEC
> >") ?\n",
> >idx);
> >  print_gimple_stmt (dump_file, stmt, 0, 
> > TDF_VOPS|TDF_MEMSYMS);
> >}
> >  else
> >res_dest = res_tgt;
> >
> > H.J.
>


-- 
H.J.


Re: libbacktrace patch committed: Don't assume compressed section aligned

2024-03-08 Thread H.J. Lu
On Fri, Mar 8, 2024 at 2:48 PM Fangrui Song  wrote:
>
> On ELF64, it looks like BFD uses 8-byte alignment for compressed
> `.debug_*` sections while gold/lld/mold use 1-byte alignment. I do not
> know how the Solaris linker sets the alignment.
>
> The specification's wording makes me confused whether it really
> requires 8-byte alignment, even if a non-packed `Elf64_Chdr` surely
> requires 8.

Since compressed sections begin with a compression header
structure that identifies the compression algorithm, compressed
sections must be aligned to the alignment of the compression
header.  I don't think there is any ambiguity here.

> > The sh_size and sh_addralign fields of the section header for a compressed 
> > section reflect the requirements of the compressed section.
>
> There are many `.debug_*` sections. So avoiding some alignment padding
> seems a very natural extension (a DWARF v5 -gsplit-dwarf relocatable
> file has ~10 `.debug_*` sections), even if the specification doesn't
> allow it with a very strict interpretation...
>
> (Off-topic: I wonder whether ELF control structures should use
> unaligned LEB128 more. REL/RELA can naturally be replaced with a
> LEB128 one similar to wasm.)
>
> On Fri, Mar 8, 2024 at 1:57 PM Ian Lance Taylor  wrote:
> >
> > Reportedly when lld compresses debug sections, it fails to set the
> > alignment of the compressed section such that the compressed header
> > can be read directly.  To me this seems like a bug in lld.  However,
> > libbacktrace needs to work around it.  This patch, originally by the
> > GitHub user ubyte, does that.  Bootstrapped and tested on
> > x86_64-pc-linux-gnu.  Committed to mainline.
> >
> > Ian
> >
> > * elf.c (elf_uncompress_chdr): Don't assume compressed section is
> > aligned.
>
>
>
> --
> 宋方睿



-- 
H.J.


PING: [PATCH v2] tree-profile: Don't instrument an IFUNC resolver nor its callees

2024-04-02 Thread H.J. Lu
On Tue, Mar 5, 2024 at 1:45 PM H.J. Lu  wrote:
>
> We can't instrument an IFUNC resolver nor its callees as it may require
> TLS which hasn't been set up yet when the dynamic linker is resolving
> IFUNC symbols.
>
> Add an IFUNC resolver caller marker to cgraph_node and set it if the
> function is called by an IFUNC resolver.  Update tree_profiling to skip
> functions called by IFUNC resolver.
>
> Tested with profiledbootstrap on Fedora 39/x86-64.
>
> gcc/ChangeLog:
>
> PR tree-optimization/114115
> * cgraph.h (symtab_node): Add check_ifunc_callee_symtab_nodes.
> (cgraph_node): Add called_by_ifunc_resolver.
> * cgraphunit.cc (symbol_table::compile): Call
> symtab_node::check_ifunc_callee_symtab_nodes.
> * symtab.cc (check_ifunc_resolver): New.
> (ifunc_ref_map): Likewise.
> (is_caller_ifunc_resolver): Likewise.
> (symtab_node::check_ifunc_callee_symtab_nodes): Likewise.
> * tree-profile.cc (tree_profiling): Do not instrument an IFUNC
> resolver nor its callees.
>
> gcc/testsuite/ChangeLog:
>
> PR tree-optimization/114115
> * gcc.dg/pr114115.c: New test.
> ---
>  gcc/cgraph.h|  6 +++
>  gcc/cgraphunit.cc   |  2 +
>  gcc/symtab.cc   | 89 +
>  gcc/testsuite/gcc.dg/pr114115.c | 24 +
>  gcc/tree-profile.cc |  4 ++
>  5 files changed, 125 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.dg/pr114115.c
>
> diff --git a/gcc/cgraph.h b/gcc/cgraph.h
> index 47f35e8078d..a8c3224802c 100644
> --- a/gcc/cgraph.h
> +++ b/gcc/cgraph.h
> @@ -479,6 +479,9 @@ public:
>   Return NULL if there's no such node.  */
>static symtab_node *get_for_asmname (const_tree asmname);
>
> +  /* Check symbol table for callees of IFUNC resolvers.  */
> +  static void check_ifunc_callee_symtab_nodes (void);
> +
>/* Verify symbol table for internal consistency.  */
>static DEBUG_FUNCTION void verify_symtab_nodes (void);
>
> @@ -896,6 +899,7 @@ struct GTY((tag ("SYMTAB_FUNCTION"))) cgraph_node : 
> public symtab_node
>redefined_extern_inline (false), tm_may_enter_irr (false),
>ipcp_clone (false), declare_variant_alt (false),
>calls_declare_variant_alt (false), gc_candidate (false),
> +  called_by_ifunc_resolver (false),
>m_uid (uid), m_summary_id (-1)
>{}
>
> @@ -1495,6 +1499,8 @@ struct GTY((tag ("SYMTAB_FUNCTION"))) cgraph_node : 
> public symtab_node
>   is set for local SIMD clones when they are created and cleared if the
>   vectorizer uses them.  */
>unsigned gc_candidate : 1;
> +  /* Set if the function is called by an IFUNC resolver.  */
> +  unsigned called_by_ifunc_resolver : 1;
>
>  private:
>/* Unique id of the node.  */
> diff --git a/gcc/cgraphunit.cc b/gcc/cgraphunit.cc
> index d200166f7e9..2bd0289ffba 100644
> --- a/gcc/cgraphunit.cc
> +++ b/gcc/cgraphunit.cc
> @@ -2317,6 +2317,8 @@ symbol_table::compile (void)
>
>symtab_node::checking_verify_symtab_nodes ();
>
> +  symtab_node::check_ifunc_callee_symtab_nodes ();
> +
>timevar_push (TV_CGRAPHOPT);
>if (pre_ipa_mem_report)
>  dump_memory_report ("Memory consumption before IPA");
> diff --git a/gcc/symtab.cc b/gcc/symtab.cc
> index 4c7e3c135ca..3256133891d 100644
> --- a/gcc/symtab.cc
> +++ b/gcc/symtab.cc
> @@ -1369,6 +1369,95 @@ symtab_node::verify (void)
>timevar_pop (TV_CGRAPH_VERIFY);
>  }
>
> +/* Return true and set *DATA to true if NODE is an ifunc resolver.  */
> +
> +static bool
> +check_ifunc_resolver (cgraph_node *node, void *data)
> +{
> +  if (node->ifunc_resolver)
> +{
> +  bool *is_ifunc_resolver = (bool *) data;
> +  *is_ifunc_resolver = true;
> +  return true;
> +}
> +  return false;
> +}
> +
> +static auto_bitmap ifunc_ref_map;
> +
> +/* Return true if any caller of NODE is an ifunc resolver.  */
> +
> +static bool
> +is_caller_ifunc_resolver (cgraph_node *node)
> +{
> +  bool is_ifunc_resolver = false;
> +
> +  for (cgraph_edge *e = node->callers; e; e = e->next_caller)
> +{
> +  /* Return true if caller is known to be an IFUNC resolver.  */
> +  if (e->caller->called_by_ifunc_resolver)
> +   return true;
> +
> +  /* Check for recursive call.  */
> +  if (e->caller == node)
> +   continue;
> +
> +  /* Skip if it has been visited.  */
> +  unsigned int uid = e->caller->get_uid ();
> +  if (bitmap_bit_p (ifunc_ref_map, uid))
> +   continue;
> +  bitm

Re: PING: [PATCH v2] tree-profile: Don't instrument an IFUNC resolver nor its callees

2024-04-02 Thread H.J. Lu
On Tue, Apr 2, 2024 at 7:50 AM Jan Hubicka  wrote:
>
> > On Tue, Mar 5, 2024 at 1:45 PM H.J. Lu  wrote:
> > >
> > > We can't instrument an IFUNC resolver nor its callees as it may require
> > > TLS which hasn't been set up yet when the dynamic linker is resolving
> > > IFUNC symbols.
> > >
> > > Add an IFUNC resolver caller marker to cgraph_node and set it if the
> > > function is called by an IFUNC resolver.  Update tree_profiling to skip
> > > functions called by IFUNC resolver.
> > >
> > > Tested with profiledbootstrap on Fedora 39/x86-64.
> > >
> > > gcc/ChangeLog:
> > >
> > > PR tree-optimization/114115
> > > * cgraph.h (symtab_node): Add check_ifunc_callee_symtab_nodes.
> > > (cgraph_node): Add called_by_ifunc_resolver.
> > > * cgraphunit.cc (symbol_table::compile): Call
> > > symtab_node::check_ifunc_callee_symtab_nodes.
> > > * symtab.cc (check_ifunc_resolver): New.
> > > (ifunc_ref_map): Likewise.
> > > (is_caller_ifunc_resolver): Likewise.
> > > (symtab_node::check_ifunc_callee_symtab_nodes): Likewise.
> > > * tree-profile.cc (tree_profiling): Do not instrument an IFUNC
> > > resolver nor its callees.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > PR tree-optimization/114115
> > > * gcc.dg/pr114115.c: New test.
> >
> > PING.
>
> I am bit worried about commonly used functions getting "infected" by
> being called once from ifunc resolver.  I think we only use thread local
> storage for indirect call profiling, so we may just disable indirect
> call profiling for these functions.

Will change it.

> Also the patch will be noop with -flto -flto-partition=max, so probably
> we need to compute this flag at WPA time and stream to partitions.
>

Why is it a nop with -flto -flto-partition=max? I got

(gdb) bt
#0  symtab_node::check_ifunc_callee_symtab_nodes ()
at /export/gnu/import/git/gitlab/x86-gcc/gcc/symtab.cc:1440
#1  0x00e487d3 in symbol_table::compile (this=0x7fffea006000)
at /export/gnu/import/git/gitlab/x86-gcc/gcc/cgraphunit.cc:2320
#2  0x00d23ecf in lto_main ()
at /export/gnu/import/git/gitlab/x86-gcc/gcc/lto/lto.cc:687
#3  0x015254d2 in compile_file ()
at /export/gnu/import/git/gitlab/x86-gcc/gcc/toplev.cc:449
#4  0x015284a4 in do_compile ()
at /export/gnu/import/git/gitlab/x86-gcc/gcc/toplev.cc:2154
#5  0x01528864 in toplev::main (this=0x7fffd84a, argc=16,
argv=0x42261f0) at /export/gnu/import/git/gitlab/x86-gcc/gcc/toplev.cc:2310
#6  0x030a3fe2 in main (argc=16, argv=0x7fffd958)
at /export/gnu/import/git/gitlab/x86-gcc/gcc/main.cc:39

Do you have a testcase to show that it is a nop?

-- 
H.J.


  1   2   3   4   5   6   7   8   9   10   >