Re: [PATCH] i386; Add indirect_return function attribute
On Fri, Jun 8, 2018 at 3:27 AM, H.J. Lu wrote: > On x86, swapcontext may return via indirect branch when shadow stack > is enabled. To support code instrumentation of control-flow transfers > with -fcf-protection, add indirect_return function attribute to inform > compiler that a function may return via indirect branch. > > Note: Unlike setjmp, swapcontext only returns once. Mark it return > twice will unnecessarily disable compiler optimization. > > OK for trunk? > > H.J. > > gcc/ > > PR target/85620 > * config/i386/i386.c (rest_of_insert_endbranch): Also generate > ENDBRANCH for non-tail call which may return via indirect branch. > * doc/extend.texi: Document indirect_return attribute. > > gcc/testsuite/ > > PR target/85620 > * gcc.target/i386/pr85620-1.c: New test. > * gcc.target/i386/pr85620-2.c: Likewise. > Here is the updated patch with a testcase to show the impact of returns_twice attribute. Jan, Uros, can you take a look? Thanks. -- H.J. From 6115541e03073b93bd81f5eb81fdedd4e5b47b28 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" Date: Thu, 7 Jun 2018 20:05:15 -0700 Subject: [PATCH] i386; Add indirect_return function attribute On x86, swapcontext may return via indirect branch when shadow stack is enabled. To support code instrumentation of control-flow transfers with -fcf-protection, add indirect_return function attribute to inform compiler that a function may return via indirect branch. Note: Unlike setjmp, swapcontext only returns once. Mark it return twice will unnecessarily disable compiler optimization as shown in the testcase here. gcc/ PR target/85620 * config/i386/i386.c (rest_of_insert_endbranch): Also generate ENDBRANCH for non-tail call which may return via indirect branch. * doc/extend.texi: Document indirect_return attribute. gcc/testsuite/ PR target/85620 * gcc.target/i386/pr85620-1.c: New test. * gcc.target/i386/pr85620-2.c: Likewise. * gcc.target/i386/pr85620-3.c: Likewise. * gcc.target/i386/pr85620-4.c: Likewise. --- gcc/config/i386/i386.c| 23 ++- gcc/doc/extend.texi | 6 ++ gcc/testsuite/gcc.target/i386/pr85620-1.c | 15 +++ gcc/testsuite/gcc.target/i386/pr85620-2.c | 13 + gcc/testsuite/gcc.target/i386/pr85620-3.c | 18 ++ gcc/testsuite/gcc.target/i386/pr85620-4.c | 18 ++ 6 files changed, 92 insertions(+), 1 deletion(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr85620-1.c create mode 100644 gcc/testsuite/gcc.target/i386/pr85620-2.c create mode 100644 gcc/testsuite/gcc.target/i386/pr85620-3.c create mode 100644 gcc/testsuite/gcc.target/i386/pr85620-4.c diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index e6d17632142..41461d582a4 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -2621,7 +2621,26 @@ rest_of_insert_endbranch (void) { if (CALL_P (insn)) { - if (find_reg_note (insn, REG_SETJMP, NULL) == NULL) + bool need_endbr; + need_endbr = find_reg_note (insn, REG_SETJMP, NULL) != NULL; + if (!need_endbr && !SIBLING_CALL_P (insn)) + { + rtx call = get_call_rtx_from (insn); + rtx fnaddr = XEXP (call, 0); + + /* Also generate ENDBRANCH for non-tail call which + may return via indirect branch. */ + if (MEM_P (fnaddr) + && GET_CODE (XEXP (fnaddr, 0)) == SYMBOL_REF) + { + tree fndecl = SYMBOL_REF_DECL (XEXP (fnaddr, 0)); + if (fndecl + && lookup_attribute ("indirect_return", + DECL_ATTRIBUTES (fndecl))) + need_endbr = true; + } + } + if (!need_endbr) continue; /* Generate ENDBRANCH after CALL, which can return more than twice, setjmp-like functions. */ @@ -45897,6 +45916,8 @@ static const struct attribute_spec ix86_attribute_table[] = ix86_handle_fndecl_attribute, NULL }, { "function_return", 1, 1, true, false, false, false, ix86_handle_fndecl_attribute, NULL }, + { "indirect_return", 0, 0, true, false, false, false, +ix86_handle_fndecl_attribute, NULL }, /* End element. */ { NULL, 0, 0, false, false, false, false, NULL, NULL } diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi index 19c2da2e5db..97b1f78cade 100644 --- a/gcc/doc/extend.texi +++ b/gcc/doc/extend.texi @@ -5886,6 +5886,12 @@ foo (void) @} @end smallexample +@item indirect_return +@cindex @code{indirect_return} function attribute, x86 + +The @code{indirect_return} attribute on a function is used to inform +the compiler that the function may return via indiret branch. + @end table On the x86, the inliner does not inline a diff --git a/gcc/testsuite/gcc.target/i386/pr85620-1.c b/gcc/testsuite/gcc.target/i386/pr85620-1.c new file mode 100644 index 000..32efb08e59e --
Re: [PATCH] i386; Add indirect_return function attribute
On Tue, Jul 3, 2018 at 9:12 AM, Uros Bizjak wrote: > On Tue, Jul 3, 2018 at 5:32 PM, H.J. Lu wrote: >> On Fri, Jun 8, 2018 at 3:27 AM, H.J. Lu wrote: >>> On x86, swapcontext may return via indirect branch when shadow stack >>> is enabled. To support code instrumentation of control-flow transfers >>> with -fcf-protection, add indirect_return function attribute to inform >>> compiler that a function may return via indirect branch. >>> >>> Note: Unlike setjmp, swapcontext only returns once. Mark it return >>> twice will unnecessarily disable compiler optimization. >>> >>> OK for trunk? >>> >>> H.J. >>> >>> gcc/ >>> >>> PR target/85620 >>> * config/i386/i386.c (rest_of_insert_endbranch): Also generate >>> ENDBRANCH for non-tail call which may return via indirect branch. >>> * doc/extend.texi: Document indirect_return attribute. >>> >>> gcc/testsuite/ >>> >>> PR target/85620 >>> * gcc.target/i386/pr85620-1.c: New test. >>> * gcc.target/i386/pr85620-2.c: Likewise. >>> >> >> Here is the updated patch with a testcase to show the impact of >> returns_twice attribute. >> >> Jan, Uros, can you take a look? > > LGTM for the implementation, can't say if attribute is really needed or not. This gives programmers more flexibly. > +@item indirect_return > +@cindex @code{indirect_return} function attribute, x86 > + > +The @code{indirect_return} attribute on a function is used to inform > +the compiler that the function may return via indiret branch. > > s/indiret/indirect/ Fixed. Here is the updated patch. Thanks. -- H.J. From bb98f6a31801659ae3c6689d6d31af33a3c28bb2 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" Date: Thu, 7 Jun 2018 20:05:15 -0700 Subject: [PATCH] i386; Add indirect_return function attribute On x86, swapcontext may return via indirect branch when shadow stack is enabled. To support code instrumentation of control-flow transfers with -fcf-protection, add indirect_return function attribute to inform compiler that a function may return via indirect branch. Note: Unlike setjmp, swapcontext only returns once. Mark it return twice will unnecessarily disable compiler optimization as shown in the testcase here. gcc/ PR target/85620 * config/i386/i386.c (rest_of_insert_endbranch): Also generate ENDBRANCH for non-tail call which may return via indirect branch. * doc/extend.texi: Document indirect_return attribute. gcc/testsuite/ PR target/85620 * gcc.target/i386/pr85620-1.c: New test. * gcc.target/i386/pr85620-2.c: Likewise. * gcc.target/i386/pr85620-3.c: Likewise. * gcc.target/i386/pr85620-4.c: Likewise. --- gcc/config/i386/i386.c| 23 ++- gcc/doc/extend.texi | 6 ++ gcc/testsuite/gcc.target/i386/pr85620-1.c | 15 +++ gcc/testsuite/gcc.target/i386/pr85620-2.c | 13 + gcc/testsuite/gcc.target/i386/pr85620-3.c | 18 ++ gcc/testsuite/gcc.target/i386/pr85620-4.c | 18 ++ 6 files changed, 92 insertions(+), 1 deletion(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr85620-1.c create mode 100644 gcc/testsuite/gcc.target/i386/pr85620-2.c create mode 100644 gcc/testsuite/gcc.target/i386/pr85620-3.c create mode 100644 gcc/testsuite/gcc.target/i386/pr85620-4.c diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index e6d17632142..41461d582a4 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -2621,7 +2621,26 @@ rest_of_insert_endbranch (void) { if (CALL_P (insn)) { - if (find_reg_note (insn, REG_SETJMP, NULL) == NULL) + bool need_endbr; + need_endbr = find_reg_note (insn, REG_SETJMP, NULL) != NULL; + if (!need_endbr && !SIBLING_CALL_P (insn)) + { + rtx call = get_call_rtx_from (insn); + rtx fnaddr = XEXP (call, 0); + + /* Also generate ENDBRANCH for non-tail call which + may return via indirect branch. */ + if (MEM_P (fnaddr) + && GET_CODE (XEXP (fnaddr, 0)) == SYMBOL_REF) + { + tree fndecl = SYMBOL_REF_DECL (XEXP (fnaddr, 0)); + if (fndecl + && lookup_attribute ("indirect_return", + DECL_ATTRIBUTES (fndecl))) + need_endbr = true; + } + } + if (!need_endbr) continue; /* Generate ENDBRANCH after CALL, which can return more than twice, setjmp-like functions. */ @@ -45897,6 +45916,8 @@ static const struct attribute_spec ix86_attribute_table[] = ix86_handle_fndecl_attribute, NULL }, { "function_return", 1, 1, true, false, false, false, ix86_handle_fndecl_attribute, NULL }, + { "indirect_return",
[PATCH] x86: Tune Skylake, Cannonlake and Icelake as Haswell
r259399, which added PROCESSOR_SKYLAKE, disabled many x86 optimizations which are enabled by PROCESSOR_HASWELL. As the result, -mtune=skylake generates slower codes on Skylake than before. The same also applies to Cannonlake and Icelak tuning. This patch changes -mtune={skylake|cannonlake|icelake} to tune like -mtune=haswell for until their tuning is properly adjusted. It also enables -mprefer-vector-width=256 for -mtune=haswell, which has no impact on codegen when AVX512 isn't enabled. Performance impacts on SPEC CPU 2017 rate with 1 copy using -march=native -mfpmath=sse -O2 -m64 are 1. On Broadwell server: 500.perlbench_r -0.56% 502.gcc_r -0.18% 505.mcf_r 0.24% 520.omnetpp_r 0.00% 523.xalancbmk_r -0.32% 525.x264_r -0.17% 531.deepsjeng_r 0.00% 541.leela_r 0.00% 548.exchange2_r 0.12% 557.xz_r0.00% geomean 0.00% 503.bwaves_r0.00% 507.cactuBSSN_r 0.21% 508.namd_r 0.00% 510.parest_r0.19% 511.povray_r-0.48% 519.lbm_r 0.00% 521.wrf_r 0.28% 526.blender_r 0.19% 527.cam4_r 0.39% 538.imagick_r 0.00% 544.nab_r -0.36% 549.fotonik3d_r 0.51% 554.roms_r 0.00% geomean 0.17% On Skylake client: 500.perlbench_r 0.96% 502.gcc_r 0.13% 505.mcf_r -1.03% 520.omnetpp_r -1.11% 523.xalancbmk_r 1.02% 525.x264_r 0.50% 531.deepsjeng_r 2.97% 541.leela_r 0.50% 548.exchange2_r -0.95% 557.xz_r2.41% geomean 0.56% 503.bwaves_r0.49% 507.cactuBSSN_r 3.17% 508.namd_r 4.05% 510.parest_r0.15% 511.povray_r0.80% 519.lbm_r 3.15% 521.wrf_r 10.56% 526.blender_r 2.97% 527.cam4_r 2.36% 538.imagick_r 46.40% 544.nab_r 2.04% 549.fotonik3d_r 0.00% 554.roms_r 1.27% geomean 5.49% On Skylake server: 500.perlbench_r 0.71% 502.gcc_r -0.51% 505.mcf_r -1.06% 520.omnetpp_r -0.33% 523.xalancbmk_r -0.22% 525.x264_r 1.72% 531.deepsjeng_r -0.26% 541.leela_r 0.57% 548.exchange2_r -0.75% 557.xz_r-1.28% geomean -0.21% 503.bwaves_r0.00% 507.cactuBSSN_r 2.66% 508.namd_r 3.67% 510.parest_r1.25% 511.povray_r2.26% 519.lbm_r 1.69% 521.wrf_r 11.03% 526.blender_r 3.39% 527.cam4_r 1.69% 538.imagick_r 64.59% 544.nab_r -0.54% 549.fotonik3d_r 2.68% 554.roms_r 0.00% geomean 6.19% This patch improves -march=native performance on Skylake up to 60% and leaves -march=native performance unchanged on Haswell. OK for trunk? Thanks. H.J. --- gcc/ 2018-07-12 H.J. Lu Sunil K Pandey PR target/84413 * config/i386/i386.c (m_HASWELL): Add PROCESSOR_SKYLAKE, PROCESSOR_SKYLAKE_AVX512, PROCESSOR_CANNONLAKE, PROCESSOR_ICELAKE_CLIENT and PROCESSOR_ICELAKE_SERVER. (m_SKYLAKE): Set to 0. (m_SKYLAKE_AVX512): Likewise. (m_CANNONLAKE): Likewise. (m_ICELAKE_CLIENT): Likewise. (m_ICELAKE_SERVER): Likewise. * config/i386/x86-tune.def (avx256_optimal): Also enabled for m_HASWELL. gcc/testsuite/ 2018-07-12 H.J. Lu Sunil K Pandey PR target/84413 * gcc.target/i386/pr84413-1.c: New test. * gcc.target/i386/pr84413-2.c: Likewise. * gcc.target/i386/pr84413-3.c: Likewise. * gcc.target/i386/pr84413-4.c: Likewise. --- gcc/config/i386/i386.c| 17 +++-- gcc/config/i386/x86-tune.def | 9 ++--- gcc/testsuite/gcc.target/i386/pr84413-1.c | 17 + gcc/testsuite/gcc.target/i386/pr84413-2.c | 17 + gcc/testsuite/gcc.target/i386/pr84413-3.c | 17 + gcc/testsuite/gcc.target/i386/pr84413-4.c | 17 + 6 files changed, 85 insertions(+), 9 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr84413-1.c create mode 100644 gcc/testsuite/gcc.target/i386/pr84413-2.c create mode 100644 gcc/testsuite/gcc.target/i386/pr84413-3.c create mode 100644 gcc/testsuite/gcc.target/i386/pr84413-4.c diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 9e46b7b136f..762ab89fc9e 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -137,17 +137,22 @@ const struct processor_costs *ix86_cost = NULL; #define m_CORE2 (HOST_WIDE_INT_1U<
Re: [PATCH] x86: Tune Skylake, Cannonlake and Icelake as Haswell
On Fri, Jul 13, 2018 at 08:53:02AM +0200, Uros Bizjak wrote: > On Thu, Jul 12, 2018 at 9:57 PM, H.J. Lu wrote: > > > r259399, which added PROCESSOR_SKYLAKE, disabled many x86 optimizations > > which are enabled by PROCESSOR_HASWELL. As the result, -mtune=skylake > > generates slower codes on Skylake than before. The same also applies > > to Cannonlake and Icelak tuning. > > > > This patch changes -mtune={skylake|cannonlake|icelake} to tune like > > -mtune=haswell for until their tuning is properly adjusted. It also > > enables -mprefer-vector-width=256 for -mtune=haswell, which has no > > impact on codegen when AVX512 isn't enabled. > > > > Performance impacts on SPEC CPU 2017 rate with 1 copy using > > > > -march=native -mfpmath=sse -O2 -m64 > > > > are > > > > 1. On Broadwell server: > > > > 500.perlbench_r -0.56% > > 502.gcc_r -0.18% > > 505.mcf_r 0.24% > > 520.omnetpp_r 0.00% > > 523.xalancbmk_r -0.32% > > 525.x264_r -0.17% > > 531.deepsjeng_r 0.00% > > 541.leela_r 0.00% > > 548.exchange2_r 0.12% > > 557.xz_r0.00% > > geomean 0.00% > > > > 503.bwaves_r0.00% > > 507.cactuBSSN_r 0.21% > > 508.namd_r 0.00% > > 510.parest_r0.19% > > 511.povray_r-0.48% > > 519.lbm_r 0.00% > > 521.wrf_r 0.28% > > 526.blender_r 0.19% > > 527.cam4_r 0.39% > > 538.imagick_r 0.00% > > 544.nab_r -0.36% > > 549.fotonik3d_r 0.51% > > 554.roms_r 0.00% > > geomean 0.17% > > > > On Skylake client: > > > > 500.perlbench_r 0.96% > > 502.gcc_r 0.13% > > 505.mcf_r -1.03% > > 520.omnetpp_r -1.11% > > 523.xalancbmk_r 1.02% > > 525.x264_r 0.50% > > 531.deepsjeng_r 2.97% > > 541.leela_r 0.50% > > 548.exchange2_r -0.95% > > 557.xz_r2.41% > > geomean 0.56% > > > > 503.bwaves_r0.49% > > 507.cactuBSSN_r 3.17% > > 508.namd_r 4.05% > > 510.parest_r0.15% > > 511.povray_r0.80% > > 519.lbm_r 3.15% > > 521.wrf_r 10.56% > > 526.blender_r 2.97% > > 527.cam4_r 2.36% > > 538.imagick_r 46.40% > > 544.nab_r 2.04% > > 549.fotonik3d_r 0.00% > > 554.roms_r 1.27% > > geomean 5.49% > > > > On Skylake server: > > > > 500.perlbench_r 0.71% > > 502.gcc_r -0.51% > > 505.mcf_r -1.06% > > 520.omnetpp_r -0.33% > > 523.xalancbmk_r -0.22% > > 525.x264_r 1.72% > > 531.deepsjeng_r -0.26% > > 541.leela_r 0.57% > > 548.exchange2_r -0.75% > > 557.xz_r-1.28% > > geomean -0.21% > > > > 503.bwaves_r0.00% > > 507.cactuBSSN_r 2.66% > > 508.namd_r 3.67% > > 510.parest_r1.25% > > 511.povray_r2.26% > > 519.lbm_r 1.69% > > 521.wrf_r 11.03% > > 526.blender_r 3.39% > > 527.cam4_r 1.69% > > 538.imagick_r 64.59% > > 544.nab_r -0.54% > > 549.fotonik3d_r 2.68% > > 554.roms_r 0.00% > > geomean 6.19% > > > > This patch improves -march=native performance on Skylake up to 60% and > > leaves -march=native performance unchanged on Haswell. > > > > OK for trunk? > > > > Thanks. > > > > H.J. > > --- > > gcc/ > > > > 2018-07-12 H.J. Lu > > Sunil K Pandey > > > > PR target/84413 > > * config/i386/i386.c (m_HASWELL): Add PROCESSOR_SKYLAKE, > > PROCESSOR_SKYLAKE_AVX512, PROCESSOR_CANNONLAKE, > > PROCESSOR_ICELAKE_CLIENT and PROCESSOR_ICELAKE_SERVER. > > (m_SKYLAKE): Set to 0. > > (m_SKYLAKE_AVX512): Likewise. > > (m_CANNONLAKE): Likewise. > > (m_ICELAKE_CLIENT): Likewise. > > (m_ICELAKE_SERVER): Likewise. > > * config/i386/x86-tune.def (avx256_optimal): Also enabled for > > m_HASWELL. > >
Re: [PATCH] x86: Tune Skylake, Cannonlake and Icelake as Haswell
On Fri, Jul 13, 2018 at 9:07 AM, Jan Hubicka wrote: >> > > > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c >> > > > index 9e46b7b136f..762ab89fc9e 100644 >> > > > --- a/gcc/config/i386/i386.c >> > > > +++ b/gcc/config/i386/i386.c >> > > > @@ -137,17 +137,22 @@ const struct processor_costs *ix86_cost = NULL; >> > > > #define m_CORE2 (HOST_WIDE_INT_1U<> > > > #define m_NEHALEM (HOST_WIDE_INT_1U<> > > > #define m_SANDYBRIDGE (HOST_WIDE_INT_1U<> > > > -#define m_HASWELL (HOST_WIDE_INT_1U<> > > > +#define m_HASWELL ((HOST_WIDE_INT_1U<> > > > + | (HOST_WIDE_INT_1U<> > > > + | (HOST_WIDE_INT_1U<> > > > + | (HOST_WIDE_INT_1U<> > > > + | (HOST_WIDE_INT_1U<> > > > + | (HOST_WIDE_INT_1U<> > > > >> > > >> > > Please introduce a new per-family define and group processors in this >> > > define. Something like m_BDVER, m_BTVER and m_AMD_MULTIPLE for AMD >> > targets. >> > > We should not redefine m_HASWELL to include unrelated families. >> > > >> > >> > Here is the updated patch. OK for trunk if all tests pass? >> > >> > >> OK. > > We have also noticed that benchmarks on skylake are not good compared to > haswell, this nicely explains it. I think this is -march=native regression > compared to GCC versions that did not suppored better CPUs than Haswell. So > it > would be nice to backport it. Yes, we should. Here is the patch to backport to GCC 8. OK for GCC 8 after it has been checked into trunk? Thanks. -- H.J. From 40a1050b330b421a1f445cb2a40b5a002da2e6d6 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" Date: Mon, 4 Jun 2018 19:16:06 -0700 Subject: [PATCH] x86: Tune Skylake, Cannonlake and Icelake as Haswell r259399, which added PROCESSOR_SKYLAKE, disabled many x86 optimizations which are enabled by PROCESSOR_HASWELL. As the result, -mtune=skylake generates slower codes on Skylake than before. The same also applies to Cannonlake and Icelak tuning. This patch changes -mtune={skylake|cannonlake|icelake} to tune like -mtune=haswell for until their tuning is properly adjusted. It also enables -mprefer-vector-width=256 for -mtune=haswell, which has no impact on codegen when AVX512 isn't enabled. Performance impacts on SPEC CPU 2017 rate with 1 copy using -march=native -mfpmath=sse -O2 -m64 are 1. On Broadwell server: 500.perlbench_r -0.56% 502.gcc_r -0.18% 505.mcf_r 0.24% 520.omnetpp_r 0.00% 523.xalancbmk_r -0.32% 525.x264_r -0.17% 531.deepsjeng_r 0.00% 541.leela_r 0.00% 548.exchange2_r 0.12% 557.xz_r 0.00% Geomean 0.00% 503.bwaves_r 0.00% 507.cactuBSSN_r 0.21% 508.namd_r 0.00% 510.parest_r 0.19% 511.povray_r -0.48% 519.lbm_r 0.00% 521.wrf_r 0.28% 526.blender_r 0.19% 527.cam4_r 0.39% 538.imagick_r 0.00% 544.nab_r -0.36% 549.fotonik3d_r 0.51% 554.roms_r 0.00% Geomean 0.17% On Skylake client: 500.perlbench_r 0.96% 502.gcc_r 0.13% 505.mcf_r -1.03% 520.omnetpp_r -1.11% 523.xalancbmk_r 1.02% 525.x264_r 0.50% 531.deepsjeng_r 2.97% 541.leela_r 0.50% 548.exchange2_r -0.95% 557.xz_r 2.41% Geomean 0.56% 503.bwaves_r 0.49% 507.cactuBSSN_r 3.17% 508.namd_r 4.05% 510.parest_r 0.15% 511.povray_r 0.80% 519.lbm_r 3.15% 521.wrf_r 10.56% 526.blender_r 2.97% 527.cam4_r 2.36% 538.imagick_r 46.40% 544.nab_r 2.04% 549.fotonik3d_r 0.00% 554.roms_r 1.27% Geomean 5.49% On Skylake server: 500.perlbench_r 0.71% 502.gcc_r -0.51% 505.mcf_r -1.06% 520.omnetpp_r -0.33% 523.xalancbmk_r -0.22% 525.x264_r 1.72% 531.deepsjeng_r -0.26% 541.leela_r 0.57% 548.exchange2_r -0.75% 557.xz_r -1.28% Geomean -0.21% 503.bwaves_r 0.00% 507.cactuBSSN_r 2.66% 508.namd_r 3.67% 510.parest_r 1.25% 511.povray_r 2.26% 519.lbm_r 1.69% 521.wrf_r 11.03% 526.blender_r 3.39% 527.cam4_r 1.69% 538.imagick_r 64.59% 544.nab_r -0.54% 549.fotonik3d_r 2.68% 554.roms_r 0.00% Geomean 6.19% This patch improves -march=native performance on Skylake up to 60% and leaves -march=native performance unchanged on Haswell. gcc/ Backport from mainline 2018-07-12 H.J. Lu Sunil K Pandey PR target/84413 * config/i386/i386.c (m_CORE_AVX512): New. (m_CORE_AVX2): Likewise. (m_CORE_ALL): Add m_CORE_AVX2. * config/i386/x86-tune.def: Replace m_HASWELL with m_CORE_AVX2. Replace m_SKYLAKE_AVX512 with m_CORE_AVX512 on avx256_optimal and remove the rest of m_SKYLAKE_AVX512. gcc/testsuite/ Backport from mainline 2018-07-12 H.J. Lu Sunil K Pandey PR target/84413 * gcc.target/i386/pr84413-1.c: New test. * gcc.target/i386/pr84413-2.c: Likewise. * gcc.target/i386/pr84413-3.c: Likewise. * gcc.tar
Re: [PATCH] x86: Tune Skylake, Cannonlake and Icelake as Haswell
On Fri, Jul 13, 2018 at 9:31 AM, Jan Hubicka wrote: >> > We have also noticed that benchmarks on skylake are not good compared to >> > haswell, this nicely explains it. I think this is -march=native regression >> > compared to GCC versions that did not suppored better CPUs than Haswell. >> > So it >> > would be nice to backport it. >> >> Yes, we should. Here is the patch to backport to GCC 8. OK for GCC 8 after >> it has been checked into trunk? > > OK, > Honza >> >> Thanks. >> >> -- >> H.J. > >> From 40a1050b330b421a1f445cb2a40b5a002da2e6d6 Mon Sep 17 00:00:00 2001 >> From: "H.J. Lu" >> Date: Mon, 4 Jun 2018 19:16:06 -0700 >> Subject: [PATCH] x86: Tune Skylake, Cannonlake and Icelake as Haswell >> >> r259399, which added PROCESSOR_SKYLAKE, disabled many x86 optimizations >> which are enabled by PROCESSOR_HASWELL. As the result, -mtune=skylake >> generates slower codes on Skylake than before. The same also applies >> to Cannonlake and Icelak tuning. >> >> This patch changes -mtune={skylake|cannonlake|icelake} to tune like >> -mtune=haswell for until their tuning is properly adjusted. It also >> enables -mprefer-vector-width=256 for -mtune=haswell, which has no >> impact on codegen when AVX512 isn't enabled. >> >> Performance impacts on SPEC CPU 2017 rate with 1 copy using >> >> -march=native -mfpmath=sse -O2 -m64 >> >> are >> >> 1. On Broadwell server: >> >> 500.perlbench_r -0.56% >> 502.gcc_r -0.18% >> 505.mcf_r 0.24% >> 520.omnetpp_r 0.00% >> 523.xalancbmk_r -0.32% >> 525.x264_r-0.17% >> 531.deepsjeng_r 0.00% >> 541.leela_r 0.00% >> 548.exchange2_r 0.12% >> 557.xz_r 0.00% >> Geomean 0.00% >> >> 503.bwaves_r 0.00% >> 507.cactuBSSN_r 0.21% >> 508.namd_r0.00% >> 510.parest_r 0.19% >> 511.povray_r -0.48% >> 519.lbm_r 0.00% >> 521.wrf_r 0.28% >> 526.blender_r 0.19% >> 527.cam4_r0.39% >> 538.imagick_r 0.00% >> 544.nab_r -0.36% >> 549.fotonik3d_r 0.51% >> 554.roms_r0.00% >> Geomean 0.17% >> >> On Skylake client: >> >> 500.perlbench_r 0.96% >> 502.gcc_r 0.13% >> 505.mcf_r -1.03% >> 520.omnetpp_r -1.11% >> 523.xalancbmk_r 1.02% >> 525.x264_r0.50% >> 531.deepsjeng_r 2.97% >> 541.leela_r 0.50% >> 548.exchange2_r -0.95% >> 557.xz_r 2.41% >> Geomean 0.56% >> >> 503.bwaves_r 0.49% >> 507.cactuBSSN_r 3.17% >> 508.namd_r4.05% >> 510.parest_r 0.15% >> 511.povray_r 0.80% >> 519.lbm_r 3.15% >> 521.wrf_r 10.56% >> 526.blender_r 2.97% >> 527.cam4_r2.36% >> 538.imagick_r 46.40% >> 544.nab_r 2.04% >> 549.fotonik3d_r 0.00% >> 554.roms_r1.27% >> Geomean 5.49% >> >> On Skylake server: >> >> 500.perlbench_r 0.71% >> 502.gcc_r -0.51% >> 505.mcf_r -1.06% >> 520.omnetpp_r -0.33% >> 523.xalancbmk_r -0.22% >> 525.x264_r1.72% >> 531.deepsjeng_r -0.26% >> 541.leela_r 0.57% >> 548.exchange2_r -0.75% >> 557.xz_r -1.28% >> Geomean -0.21% >> >> 503.bwaves_r 0.00% >> 507.cactuBSSN_r 2.66% >> 508.namd_r3.67% >> 510.parest_r 1.25% >> 511.povray_r 2.26% >> 519.lbm_r 1.69% >> 521.wrf_r 11.03% >> 526.blender_r 3.39% >> 527.cam4_r1.69% >> 538.imagick_r 64.59% >> 544.nab_r -0.54% >> 549.fotonik3d_r 2.68% >> 554.roms_r0.00% >> Geomean 6.19% >> >> This patch improves -march=native performance on Skylake up to 60% and >> leaves -march=native performance unchanged on Haswell. >> >> gcc/ >> >> Backport from mainline >> 20
Re: [PATCH] x86: Tune Skylake, Cannonlake and Icelake as Haswell
On Sat, Jul 14, 2018 at 06:09:47PM +0200, Gerald Pfeifer wrote: > On Fri, 13 Jul 2018, H.J. Lu wrote: > > I will do the same for GCC8 backport. > > Can you please add a note to gcc-8/changes.html? This seems big > enough to warrant a note in a part for GCC 8.2. > > (At gcc-7/changes.html you can see how to go about this for minor > releases.) > Like this? H.J. --- Index: changes.html === RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-8/changes.html,v retrieving revision 1.88 diff -u -p -r1.88 changes.html --- changes.html14 Jun 2018 13:52:35 - 1.88 +++ changes.html14 Jul 2018 21:17:10 - @@ -1312,5 +1312,23 @@ known to be fixed in the 8.1 release. Th complete (that is, it is possible that some PRs that have been fixed are not listed here). + +GCC 8.2 + +This is the https://gcc.gnu.org/bugzilla/buglist.cgi?bug_status=RESOLVED&resolution=FIXED&target_milestone=8.2";>list +of problem reports (PRs) from GCC's bug tracking system that are +known to be fixed in the 8.1 release. This list might not be +complete (that is, it is possible that some PRs that have been fixed +are not listed here). + +Target Specific Changes + +IA-32/x86-64 + + -mtune=native performance regression +https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84413";>PR84413 +on Intel Skylake processors has been fixed. + +
[PATCH 2/3] i386: Change indirect_return to function type attribute
In struct ucontext; typedef struct ucontext ucontext_t; extern int (*bar) (ucontext_t *__restrict __oucp, const ucontext_t *__restrict __ucp) __attribute__((__indirect_return__)); extern int res; void foo (ucontext_t *oucp, ucontext_t *ucp) { res = bar (oucp, ucp); } bar() may return via indirect branch. This patch changes indirect_return to type attribute to allow indirect_return attribute on variable or type of function pointer so that ENDBR can be inserted after call to bar(). Tested on i386 and x86-64. OK for trunk? Thanks. H.J. --- gcc/ PR target/86560 * config/i386/i386.c (rest_of_insert_endbranch): Lookup indirect_return as function type attribute. (ix86_attribute_table): Change indirect_return to function type attribute. * doc/extend.texi: Update indirect_return attribute. gcc/testsuite/ PR target/86560 * gcc.target/i386/pr86560-1.c: New test. * gcc.target/i386/pr86560-2.c: Likewise. * gcc.target/i386/pr86560-3.c: Likewise. --- gcc/config/i386/i386.c| 23 +++ gcc/doc/extend.texi | 5 +++-- gcc/testsuite/gcc.target/i386/pr86560-1.c | 16 gcc/testsuite/gcc.target/i386/pr86560-2.c | 16 gcc/testsuite/gcc.target/i386/pr86560-3.c | 17 + 5 files changed, 67 insertions(+), 10 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr86560-1.c create mode 100644 gcc/testsuite/gcc.target/i386/pr86560-2.c create mode 100644 gcc/testsuite/gcc.target/i386/pr86560-3.c diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index aec739c3974..ac27248370b 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -2627,16 +2627,23 @@ rest_of_insert_endbranch (void) { rtx call = get_call_rtx_from (insn); rtx fnaddr = XEXP (call, 0); + tree fndecl = NULL_TREE; /* Also generate ENDBRANCH for non-tail call which may return via indirect branch. */ - if (MEM_P (fnaddr) - && GET_CODE (XEXP (fnaddr, 0)) == SYMBOL_REF) + if (GET_CODE (XEXP (fnaddr, 0)) == SYMBOL_REF) + fndecl = SYMBOL_REF_DECL (XEXP (fnaddr, 0)); + if (fndecl == NULL_TREE) + fndecl = MEM_EXPR (fnaddr); + if (fndecl + && TREE_CODE (TREE_TYPE (fndecl)) != FUNCTION_TYPE + && TREE_CODE (TREE_TYPE (fndecl)) != METHOD_TYPE) + fndecl = NULL_TREE; + if (fndecl && TYPE_ARG_TYPES (TREE_TYPE (fndecl))) { - tree fndecl = SYMBOL_REF_DECL (XEXP (fnaddr, 0)); - if (fndecl - && lookup_attribute ("indirect_return", - DECL_ATTRIBUTES (fndecl))) + tree fntype = TREE_TYPE (fndecl); + if (lookup_attribute ("indirect_return", + TYPE_ATTRIBUTES (fntype))) need_endbr = true; } } @@ -46101,8 +46108,8 @@ static const struct attribute_spec ix86_attribute_table[] = ix86_handle_fndecl_attribute, NULL }, { "function_return", 1, 1, true, false, false, false, ix86_handle_fndecl_attribute, NULL }, - { "indirect_return", 0, 0, true, false, false, false, -ix86_handle_fndecl_attribute, NULL }, + { "indirect_return", 0, 0, false, true, true, false, +NULL, NULL }, /* End element. */ { NULL, 0, 0, false, false, false, false, NULL, NULL } diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi index 8b4d3fd9de3..edeaec6d872 100644 --- a/gcc/doc/extend.texi +++ b/gcc/doc/extend.texi @@ -5861,8 +5861,9 @@ foo (void) @item indirect_return @cindex @code{indirect_return} function attribute, x86 -The @code{indirect_return} attribute on a function is used to inform -the compiler that the function may return via indirect branch. +The @code{indirect_return} attribute can be applied to a function, +as well as variable or type of function pointer to inform the +compiler that the function may return via indirect branch. @end table diff --git a/gcc/testsuite/gcc.target/i386/pr86560-1.c b/gcc/testsuite/gcc.target/i386/pr86560-1.c new file mode 100644 index 000..a2b702695c5 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr86560-1.c @@ -0,0 +1,16 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fcf-protection" } */ +/* { dg-final { scan-assembler-times {\mendbr} 2 } } */ + +struct ucontext; + +extern int (*bar) (struct ucontext *) + __attribute__((__indirect_return__)); + +extern int res; + +void +foo (struct ucontext *oucp) +{ + res = bar (oucp); +} diff --git a/gcc/testsuite/gcc.target/i386/pr86560-2.c b/gcc/testsuite/gcc.target/i386/pr
[PATCH] Call REAL(swapcontext) with indirect_return attribute on x86
asan/asan_interceptors.cc has ... int res = REAL(swapcontext)(oucp, ucp); ... REAL(swapcontext) is a function pointer to swapcontext in libc. Since swapcontext may return via indirect branch on x86 when shadow stack is enabled, we need to call REAL(swapcontext) with indirect_return attribute on x86 so that compiler can insert ENDBR after REAL(swapcontext) call. I opened an LLVM bug: https://bugs.llvm.org/show_bug.cgi?id=38207 But it won't get fixed before indirect_return attribute is added to LLVM. I'd like to get it fixed in GCC first. Tested on i386 and x86-64. OK for trunk after https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01007.html is approved? Thanks. H.J. --- PR target/86560 * asan/asan_interceptors.cc (swapcontext): Call REAL(swapcontext) with indirect_return attribute on x86. --- libsanitizer/asan/asan_interceptors.cc | 6 ++ 1 file changed, 6 insertions(+) diff --git a/libsanitizer/asan/asan_interceptors.cc b/libsanitizer/asan/asan_interceptors.cc index a8f4b72723f..b8dde4f19c5 100644 --- a/libsanitizer/asan/asan_interceptors.cc +++ b/libsanitizer/asan/asan_interceptors.cc @@ -267,7 +267,13 @@ INTERCEPTOR(int, swapcontext, struct ucontext_t *oucp, uptr stack, ssize; ReadContextStack(ucp, &stack, &ssize); ClearShadowMemoryForContextStack(stack, ssize); +#if defined(__x86_64__) || defined(__i386__) + int (*real_swapcontext) (struct ucontext_t *, struct ucontext_t *) +__attribute__((__indirect_return__)) = REAL(swapcontext); + int res = real_swapcontext(oucp, ucp); +#else int res = REAL(swapcontext)(oucp, ucp); +#endif // swapcontext technically does not return, but program may swap context to // "oucp" later, that would look as if swapcontext() returned 0. // We need to clear shadow for ucp once again, as it may be in arbitrary -- 2.17.1
Re: [PATCH] Call REAL(swapcontext) with indirect_return attribute on x86
On Wed, Jul 18, 2018 at 11:18 AM, Kostya Serebryany wrote: > What's ENDBR and do we really need to have it in compiler-rt? When shadow stack from Intel CET is enabled, the first instruction of all indirect branch targets must be a special instruction, ENDBR. In this case, int res = REAL(swapcontext)(oucp, ucp); This function may be returned via an indirect branch. Here compiler must insert ENDBR after call, like call *bar(%rip) endbr64 > As usual, I am opposed to any gcc compiler-rt that bypass upstream. We want it to be fixed in upstream. That is why I opened an LLVM bug. > --kcc > > On Wed, Jul 18, 2018 at 8:37 AM H.J. Lu wrote: >> >> asan/asan_interceptors.cc has >> >> ... >> int res = REAL(swapcontext)(oucp, ucp); >> ... >> >> REAL(swapcontext) is a function pointer to swapcontext in libc. Since >> swapcontext may return via indirect branch on x86 when shadow stack is >> enabled, we need to call REAL(swapcontext) with indirect_return attribute >> on x86 so that compiler can insert ENDBR after REAL(swapcontext) call. >> >> I opened an LLVM bug: >> >> https://bugs.llvm.org/show_bug.cgi?id=38207 >> >> But it won't get fixed before indirect_return attribute is added to >> LLVM. I'd like to get it fixed in GCC first. >> >> Tested on i386 and x86-64. OK for trunk after >> >> https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01007.html >> >> is approved? >> >> Thanks. >> >> >> H.J. >> --- >> PR target/86560 >> * asan/asan_interceptors.cc (swapcontext): Call REAL(swapcontext) >> with indirect_return attribute on x86. >> --- >> libsanitizer/asan/asan_interceptors.cc | 6 ++ >> 1 file changed, 6 insertions(+) >> >> diff --git a/libsanitizer/asan/asan_interceptors.cc >> b/libsanitizer/asan/asan_interceptors.cc >> index a8f4b72723f..b8dde4f19c5 100644 >> --- a/libsanitizer/asan/asan_interceptors.cc >> +++ b/libsanitizer/asan/asan_interceptors.cc >> @@ -267,7 +267,13 @@ INTERCEPTOR(int, swapcontext, struct ucontext_t *oucp, >>uptr stack, ssize; >>ReadContextStack(ucp, &stack, &ssize); >>ClearShadowMemoryForContextStack(stack, ssize); >> +#if defined(__x86_64__) || defined(__i386__) >> + int (*real_swapcontext) (struct ucontext_t *, struct ucontext_t *) >> +__attribute__((__indirect_return__)) = REAL(swapcontext); >> + int res = real_swapcontext(oucp, ucp); >> +#else >>int res = REAL(swapcontext)(oucp, ucp); >> +#endif >>// swapcontext technically does not return, but program may swap context >> to >>// "oucp" later, that would look as if swapcontext() returned 0. >>// We need to clear shadow for ucp once again, as it may be in arbitrary >> -- >> 2.17.1 >> -- H.J.
Re: [PATCH] Call REAL(swapcontext) with indirect_return attribute on x86
On Wed, Jul 18, 2018 at 11:45 AM, Kostya Serebryany wrote: > On Wed, Jul 18, 2018 at 11:40 AM H.J. Lu wrote: >> >> On Wed, Jul 18, 2018 at 11:18 AM, Kostya Serebryany wrote: >> > What's ENDBR and do we really need to have it in compiler-rt? >> >> When shadow stack from Intel CET is enabled, the first instruction of all >> indirect branch targets must be a special instruction, ENDBR. In this case, > > I am confused. > CET is a security mitigation feature (and ENDBR is a pretty weak form of > such), > while ASAN is a testing tool, rarely used in production is almost > never as a mitigation (which it is not!). > Why would anyone need to combine CET and ASAN in one process? > CET is transparent to ASAN. It is perfectly OK to use -fcf-protection to enable CET together with ASAN. > Also, CET doesn't exist in the hardware yet, at least not publicly available. > Which means there should be no rush (am I wrong?) and we can do things > in the correct order: > implement the Clang/LLVM support, make the compiler-rt change in LLVM, > merge back to GCC. I am working with our LLVM people to address this. H.J. > --kcc > >> >> int res = REAL(swapcontext)(oucp, ucp); >> This function may be >> returned via an indirect branch. >> >> Here compiler must insert ENDBR after call, like >> >> call *bar(%rip) >> endbr64 >> >> > As usual, I am opposed to any gcc compiler-rt that bypass upstream. >> >> We want it to be fixed in upstream. That is why I opened an LLVM bug. >> >> >> > --kcc >> > >> > On Wed, Jul 18, 2018 at 8:37 AM H.J. Lu wrote: >> >> >> >> asan/asan_interceptors.cc has >> >> >> >> ... >> >> int res = REAL(swapcontext)(oucp, ucp); >> >> ... >> >> >> >> REAL(swapcontext) is a function pointer to swapcontext in libc. Since >> >> swapcontext may return via indirect branch on x86 when shadow stack is >> >> enabled, we need to call REAL(swapcontext) with indirect_return attribute >> >> on x86 so that compiler can insert ENDBR after REAL(swapcontext) call. >> >> >> >> I opened an LLVM bug: >> >> >> >> https://bugs.llvm.org/show_bug.cgi?id=38207 >> >> >> >> But it won't get fixed before indirect_return attribute is added to >> >> LLVM. I'd like to get it fixed in GCC first. >> >> >> >> Tested on i386 and x86-64. OK for trunk after >> >> >> >> https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01007.html >> >> >> >> is approved? >> >> >> >> Thanks. >> >> >> >> >> >> H.J. >> >> --- >> >> PR target/86560 >> >> * asan/asan_interceptors.cc (swapcontext): Call REAL(swapcontext) >> >> with indirect_return attribute on x86. >> >> --- >> >> libsanitizer/asan/asan_interceptors.cc | 6 ++ >> >> 1 file changed, 6 insertions(+) >> >> >> >> diff --git a/libsanitizer/asan/asan_interceptors.cc >> >> b/libsanitizer/asan/asan_interceptors.cc >> >> index a8f4b72723f..b8dde4f19c5 100644 >> >> --- a/libsanitizer/asan/asan_interceptors.cc >> >> +++ b/libsanitizer/asan/asan_interceptors.cc >> >> @@ -267,7 +267,13 @@ INTERCEPTOR(int, swapcontext, struct ucontext_t >> >> *oucp, >> >>uptr stack, ssize; >> >>ReadContextStack(ucp, &stack, &ssize); >> >>ClearShadowMemoryForContextStack(stack, ssize); >> >> +#if defined(__x86_64__) || defined(__i386__) >> >> + int (*real_swapcontext) (struct ucontext_t *, struct ucontext_t *) >> >> +__attribute__((__indirect_return__)) = REAL(swapcontext); >> >> + int res = real_swapcontext(oucp, ucp); >> >> +#else >> >>int res = REAL(swapcontext)(oucp, ucp); >> >> +#endif >> >>// swapcontext technically does not return, but program may swap >> >> context to >> >>// "oucp" later, that would look as if swapcontext() returned 0. >> >>// We need to clear shadow for ucp once again, as it may be in >> >> arbitrary >> >> -- >> >> 2.17.1 >> >> >> >> >> >> -- >> H.J. -- H.J.
[PATCH] i386: Define __HAVE_INDIRECT_RETURN_ATTRIBUTE__
On Thu, Jul 19, 2018 at 10:35:27AM +0200, Richard Biener wrote: > On Wed, Jul 18, 2018 at 5:33 PM H.J. Lu wrote: > > > > In > > > > struct ucontext; > > typedef struct ucontext ucontext_t; > > > > extern int (*bar) (ucontext_t *__restrict __oucp, > >const ucontext_t *__restrict __ucp) > > __attribute__((__indirect_return__)); > > > > extern int res; > > > > void > > foo (ucontext_t *oucp, ucontext_t *ucp) > > { > > res = bar (oucp, ucp); > > } > > > > bar() may return via indirect branch. This patch changes indirect_return > > to type attribute to allow indirect_return attribute on variable or type > > of function pointer so that ENDBR can be inserted after call to bar(). > > > > Tested on i386 and x86-64. OK for trunk? > > OK. > The new indirect_return attribute is intended to mark swapcontext in . This patch defines __HAVE_INDIRECT_RETURN_ATTRIBUTE__ so that it can be used checked before using indirect_return attribute in . It works when the indirect_return attribute is backported to GCC 8. OK for trunk? Thanks. H.J. --- gcc/ PR target/86560 * config/i386/i386-c.c (ix86_target_macros): Define __HAVE_INDIRECT_RETURN_ATTRIBUTE__. gcc/testsuite/ PR target/86560 * gcc.target/i386/pr86560-4.c: New test. --- gcc/config/i386/i386-c.c | 2 ++ gcc/testsuite/gcc.target/i386/pr86560-4.c | 19 +++ 2 files changed, 21 insertions(+) create mode 100644 gcc/testsuite/gcc.target/i386/pr86560-4.c diff --git a/gcc/config/i386/i386-c.c b/gcc/config/i386/i386-c.c index 005e1a5b308..041d47c3ee6 100644 --- a/gcc/config/i386/i386-c.c +++ b/gcc/config/i386/i386-c.c @@ -695,6 +695,8 @@ ix86_target_macros (void) if (flag_cf_protection != CF_NONE) cpp_define_formatted (parse_in, "__CET__=%d", flag_cf_protection & ~CF_SET); + + cpp_define (parse_in, "__HAVE_INDIRECT_RETURN_ATTRIBUTE__"); } diff --git a/gcc/testsuite/gcc.target/i386/pr86560-4.c b/gcc/testsuite/gcc.target/i386/pr86560-4.c new file mode 100644 index 000..46ea923fdfc --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr86560-4.c @@ -0,0 +1,19 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fcf-protection" } */ +/* { dg-final { scan-assembler-times {\mendbr} 2 } } */ + +struct ucontext; + +extern int (*bar) (struct ucontext *) +#ifdef __HAVE_INDIRECT_RETURN_ATTRIBUTE__ + __attribute__((__indirect_return__)) +#endif +; + +extern int res; + +void +foo (struct ucontext *oucp) +{ + res = bar (oucp); +} -- 2.17.1
Re: [PATCH] i386: Define __HAVE_INDIRECT_RETURN_ATTRIBUTE__
On Thu, Jul 19, 2018 at 01:39:04PM +0200, Florian Weimer wrote: > On 07/19/2018 01:33 PM, Jakub Jelinek wrote: > > On Thu, Jul 19, 2018 at 04:21:26AM -0700, H.J. Lu wrote: > > > The new indirect_return attribute is intended to mark swapcontext in > > > . This patch defines __HAVE_INDIRECT_RETURN_ATTRIBUTE__ > > > so that it can be used checked before using indirect_return attribute > > > in . It works when the indirect_return attribute is > > > backported to GCC 8. > > > > > > OK for trunk? > > > > No. Use > > #ifdef __has_attribute > > #if __has_attribute (indirect_return) > > ... > > #endif > > #endif > > instead, like for any other attribute. > > That doesn't work because indirect_return is not in the implementation > namespace and expanded in this context. I assume that __has_attribute > (__indirect_return__) would work, though. > > Could we add: > > #ifdef __has_attribute > # define __glibc_has_attribute(attr) __has_attribute (attr) > #else > # define __glibc_has_attribute 0 > #endif > > And then use this: > > #if __glibc_has_attribute (__indirect_return__) > > Would that still work? > Both __has_attribute (indirect_return) and __has_attribute (__indirect_return__) work here. H.J. --- The new indirect_return attribute is intended to mark swapcontext in . Test __has_attribute (indirect_return) so that it can be backported to GCC 8. PR target/86560 * gcc.target/i386/pr86560-4.c: New test. * gcc.target/i386/pr86560-5.c: Likewise. --- gcc/testsuite/gcc.target/i386/pr86560-4.c | 21 + gcc/testsuite/gcc.target/i386/pr86560-5.c | 21 + 2 files changed, 42 insertions(+) create mode 100644 gcc/testsuite/gcc.target/i386/pr86560-4.c create mode 100644 gcc/testsuite/gcc.target/i386/pr86560-5.c diff --git a/gcc/testsuite/gcc.target/i386/pr86560-4.c b/gcc/testsuite/gcc.target/i386/pr86560-4.c new file mode 100644 index 000..a623e3dcbeb --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr86560-4.c @@ -0,0 +1,21 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fcf-protection" } */ +/* { dg-final { scan-assembler-times {\mendbr} 2 } } */ + +struct ucontext; + +extern int (*bar) (struct ucontext *) +#ifdef __has_attribute +# if __has_attribute (indirect_return) + __attribute__((__indirect_return__)) +# endif +#endif +; + +extern int res; + +void +foo (struct ucontext *oucp) +{ + res = bar (oucp); +} diff --git a/gcc/testsuite/gcc.target/i386/pr86560-5.c b/gcc/testsuite/gcc.target/i386/pr86560-5.c new file mode 100644 index 000..33b0f6424c2 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr86560-5.c @@ -0,0 +1,21 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fcf-protection" } */ +/* { dg-final { scan-assembler-times {\mendbr} 2 } } */ + +struct ucontext; + +extern int (*bar) (struct ucontext *) +#ifdef __has_attribute +# if __has_attribute (__indirect_return__) + __attribute__((__indirect_return__)) +# endif +#endif +; + +extern int res; + +void +foo (struct ucontext *oucp) +{ + res = bar (oucp); +} -- 2.17.1
Re: [PATCH] i386: Define __HAVE_INDIRECT_RETURN_ATTRIBUTE__
On Thu, Jul 19, 2018 at 4:56 AM, Jakub Jelinek wrote: > On Thu, Jul 19, 2018 at 01:54:46PM +0200, Florian Weimer wrote: >> On 07/19/2018 01:48 PM, H.J. Lu wrote: >> > Both __has_attribute (indirect_return) and __has_attribute >> > (__indirect_return__) >> > work here. >> >> Applications can have >> >> #define indirect_return >> >> so the variant without underscore mangling is definitely not correct. > > Incorrect for what? glibc header? Yes. The libsanitizer use, where we > control the headers and what we define? No. > > Jakub I am checking my testcases to show how it works. -- H.J.
Re: [PATCH] Call REAL(swapcontext) with indirect_return attribute on x86
On Wed, Jul 18, 2018 at 12:34:28PM -0700, Kostya Serebryany wrote: > On Wed, Jul 18, 2018 at 12:29 PM H.J. Lu wrote: > > > > On Wed, Jul 18, 2018 at 11:45 AM, Kostya Serebryany wrote: > > > On Wed, Jul 18, 2018 at 11:40 AM H.J. Lu wrote: > > >> > > >> On Wed, Jul 18, 2018 at 11:18 AM, Kostya Serebryany > > >> wrote: > > >> > What's ENDBR and do we really need to have it in compiler-rt? > > >> > > >> When shadow stack from Intel CET is enabled, the first instruction of > > >> all > > >> indirect branch targets must be a special instruction, ENDBR. In this > > >> case, > > > > > > I am confused. > > > CET is a security mitigation feature (and ENDBR is a pretty weak form of > > > such), > > > while ASAN is a testing tool, rarely used in production is almost > > > never as a mitigation (which it is not!). > > > Why would anyone need to combine CET and ASAN in one process? > > > > > > > CET is transparent to ASAN. It is perfectly OK to use -fcf-protection to > > enable CET together with ASAN. > > It is ok, but does it make any sense? > If anything, the current ASAN's intereceptors are a large blob of > security vulnerabilities. > If we ever want to use ASAN (or, more likely, HWASAN) as a security > mitigation feature, > we will need to get rid of these interceptors entirely. > > > > > > > Also, CET doesn't exist in the hardware yet, at least not publicly > > > available. > > > Which means there should be no rush (am I wrong?) and we can do things > > > in the correct order: > > > implement the Clang/LLVM support, make the compiler-rt change in LLVM, > > > merge back to GCC. > > > > I am working with our LLVM people to address this. > > Cool! > I am testing this patch and will submit it upstream. H.J. --- asan/asan_interceptors.cc has ... int res = REAL(swapcontext)(oucp, ucp); ... REAL(swapcontext) is a function pointer to swapcontext in libc. Since swapcontext may return via indirect branch on x86 when shadow stack is enabled, we need to call REAL(swapcontext) with indirect_return attribute on x86 so that compiler can insert ENDBR after REAL(swapcontext) call. PR target/86560 * asan/asan_interceptors.cc (swapcontext): Call REAL(swapcontext) with indirect_return attribute on x86 if indirect_return attribute is available. --- libsanitizer/asan/asan_interceptors.cc | 9 + 1 file changed, 9 insertions(+) diff --git a/libsanitizer/asan/asan_interceptors.cc b/libsanitizer/asan/asan_interceptors.cc index a8f4b72723f..3ae473f210a 100644 --- a/libsanitizer/asan/asan_interceptors.cc +++ b/libsanitizer/asan/asan_interceptors.cc @@ -267,7 +267,16 @@ INTERCEPTOR(int, swapcontext, struct ucontext_t *oucp, uptr stack, ssize; ReadContextStack(ucp, &stack, &ssize); ClearShadowMemoryForContextStack(stack, ssize); +#if defined(__has_attribute) && (defined(__x86_64__) || defined(__i386__)) + int (*real_swapcontext) (struct ucontext_t *, struct ucontext_t *) +# if __has_attribute (__indirect_return__) +__attribute__((__indirect_return__)) +# endif += REAL(swapcontext); + int res = real_swapcontext(oucp, ucp); +#else int res = REAL(swapcontext)(oucp, ucp); +#endif // swapcontext technically does not return, but program may swap context to // "oucp" later, that would look as if swapcontext() returned 0. // We need to clear shadow for ucp once again, as it may be in arbitrary -- 2.17.1
[PATCH] i386: Remove _Unwind_Frames_Increment
Tested on CET SDV using the CET kernel on cet branch at: https://github.com/yyu168/linux_cet/tree/cet OK for trunk and GCC 8 branch? Thanks. H.J. --- The CET kernel has been changed to place a restore token on shadow stack for signal handler to enhance security. It is usually transparent to user programs since kernel will pop the restore token when signal handler returns. But when an exception is thrown from a signal handler, now we need to remove _Unwind_Frames_Increment to pop the the restore token from shadow stack. Otherwise, we get FAIL: g++.dg/torture/pr85334.C -O0 execution test FAIL: g++.dg/torture/pr85334.C -O1 execution test FAIL: g++.dg/torture/pr85334.C -O2 execution test FAIL: g++.dg/torture/pr85334.C -O3 -g execution test FAIL: g++.dg/torture/pr85334.C -Os execution test FAIL: g++.dg/torture/pr85334.C -O2 -flto -fno-use-linker-plugin -flto-partition=none execution test PR libgcc/85334 * config/i386/shadow-stack-unwind.h (_Unwind_Frames_Increment): Removed. --- libgcc/config/i386/shadow-stack-unwind.h | 5 - 1 file changed, 5 deletions(-) diff --git a/libgcc/config/i386/shadow-stack-unwind.h b/libgcc/config/i386/shadow-stack-unwind.h index a32f3e74b52..40f48df2aec 100644 --- a/libgcc/config/i386/shadow-stack-unwind.h +++ b/libgcc/config/i386/shadow-stack-unwind.h @@ -49,8 +49,3 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see } \ } \ while (0) - -/* Increment frame count. Skip signal frames. */ -#undef _Unwind_Frames_Increment -#define _Unwind_Frames_Increment(context, frames) \ - if (!_Unwind_IsSignalFrame (context)) frames++ -- 2.17.1
[PATCH] libsanitizer: Mark REAL(swapcontext) with indirect_return attribute on x86
Cherry-pick compiler-rt revision 337603: When shadow stack from Intel CET is enabled, the first instruction of all indirect branch targets must be a special instruction, ENDBR. lib/asan/asan_interceptors.cc has ... int res = REAL(swapcontext)(oucp, ucp); ... REAL(swapcontext) is a function pointer to swapcontext in libc. Since swapcontext may return via indirect branch on x86 when shadow stack is enabled, as in this case, int res = REAL(swapcontext)(oucp, ucp); This function may be returned via an indirect branch. Here compiler must insert ENDBR after call, like call *bar(%rip) endbr64 I opened an LLVM bug: https://bugs.llvm.org/show_bug.cgi?id=38207 to add the indirect_return attribute so that it can be used to inform compiler to insert ENDBR after REAL(swapcontext) call. We mark REAL(swapcontext) with the indirect_return attribute if it is available. This fixed: https://bugs.llvm.org/show_bug.cgi?id=38249 Reviewed By: eugenis Differential Revision: https://reviews.llvm.org/D49608 OK for trunk? H.J. --- PR target/86560 * asan/asan_interceptors.cc (swapcontext): Call REAL(swapcontext) with indirect_return attribute on x86 if indirect_return attribute is available. * sanitizer_common/sanitizer_internal_defs.h (__has_attribute): New. --- libsanitizer/asan/asan_interceptors.cc | 8 libsanitizer/sanitizer_common/sanitizer_internal_defs.h | 5 + 2 files changed, 13 insertions(+) diff --git a/libsanitizer/asan/asan_interceptors.cc b/libsanitizer/asan/asan_interceptors.cc index a8f4b72723f..552cf9347af 100644 --- a/libsanitizer/asan/asan_interceptors.cc +++ b/libsanitizer/asan/asan_interceptors.cc @@ -267,7 +267,15 @@ INTERCEPTOR(int, swapcontext, struct ucontext_t *oucp, uptr stack, ssize; ReadContextStack(ucp, &stack, &ssize); ClearShadowMemoryForContextStack(stack, ssize); +#if __has_attribute(__indirect_return__) && \ +(defined(__x86_64__) || defined(__i386__)) + int (*real_swapcontext)(struct ucontext_t *, struct ucontext_t *) +__attribute__((__indirect_return__)) += REAL(swapcontext); + int res = real_swapcontext(oucp, ucp); +#else int res = REAL(swapcontext)(oucp, ucp); +#endif // swapcontext technically does not return, but program may swap context to // "oucp" later, that would look as if swapcontext() returned 0. // We need to clear shadow for ucp once again, as it may be in arbitrary diff --git a/libsanitizer/sanitizer_common/sanitizer_internal_defs.h b/libsanitizer/sanitizer_common/sanitizer_internal_defs.h index edd6a21c122..4413a88bea0 100644 --- a/libsanitizer/sanitizer_common/sanitizer_internal_defs.h +++ b/libsanitizer/sanitizer_common/sanitizer_internal_defs.h @@ -104,6 +104,11 @@ # define __has_feature(x) 0 #endif +// Older GCCs do not understand __has_attribute. +#if !defined(__has_attribute) +# define __has_attribute(x) 0 +#endif + // For portability reasons we do not include stddef.h, stdint.h or any other // system header, but we do need some basic types that are not defined // in a portable way by the language itself. -- 2.17.1
Re: [PATCH] specify large command line option arguments (PR 82063)
On Fri, Jul 20, 2018 at 1:57 PM, Martin Sebor wrote: > On 07/19/2018 04:31 PM, Jeff Law wrote: >> >> On 06/24/2018 03:05 PM, Martin Sebor wrote: >>> >>> Storing integer command line option arguments in type int >>> limits options such as -Wlarger-than= or -Walloca-larger-than >>> to at most INT_MAX (see bug 71905). Larger values wrap around >>> zero. The value zero is considered to disable the option, >>> making it impossible to specify a zero limit. >>> >>> To get around these limitations, the -Walloc-size-larger-than= >>> option accepts a string argument that it then parses itself >>> and interprets as HOST_WIDE_INT. The option also accepts byte >>> size suffixes like KB, MB, GiB, etc. to make it convenient to >>> specify very large limits. >>> >>> The int limitation is obviously less than ideal in a 64-bit >>> world. The treatment of zero as a toggle is just a minor wart. >>> The special treatment to make it work for just a single option >>> makes option handling inconsistent. It should be possible for >>> any option that takes an integer argument to use the same logic. >>> >>> The attached patch enhances GCC option processing to do that. >>> It changes the storage type of option arguments from int to >>> HOST_WIDE_INT and extends the existing (although undocumented) >>> option property Host_Wide_Int to specify wide option arguments. >>> It also introduces the ByteSize property for options for which >>> specifying the byte-size suffix makes sense. >>> >>> To make it possible to consider zero as a meaningful argument >>> value rather than a flag indicating that an option is disabled >>> the patch also adds a CLVC_SIZE enumerator to the cl_var_type >>> enumeration, and modifies how options of the kind are handled. >>> >>> Warning options that take large byte-size arguments can be >>> disabled by specifying a value equal to or greater than >>> HOST_WIDE_INT_M1U. For convenience, aliases in the form of >>> -Wno-xxx-larger-than have been provided for all the affected >>> options. >>> >>> In the patch all the existing -larger-than options are set >>> to PTRDIFF_MAX. This makes them effectively enabled, but >>> because the setting is exceedingly permissive, and because >>> some of the existing warnings are already set to the same >>> value and some other checks detect and reject such exceedingly >>> large values with errors, this change shouldn't noticeably >>> affect what constructs are diagnosed. >>> >>> Although all the options are set to PTRDIFF_MAX, I think it >>> would make sense to consider setting some of them lower, say >>> to PTRDIFF_MAX / 2. I'd like to propose that in a followup >>> patch. >>> >>> To minimize observable changes the -Walloca-larger-than and >>> -Wvla-larger-than warnings required more extensive work to >>> make of the new mechanism because of the "unbounded" argument >>> handling (the warnings trigger for arguments that are not >>> visibly constrained), and because of the zero handling >>> (the warnings also trigger >>> >>> >>> Martin >>> >>> >>> gcc-82063.diff >>> >>> >>> PR middle-end/82063 - issues with arguments enabled by -Wall >>> >>> gcc/ada/ChangeLog: >>> >>> PR middle-end/82063 >>> * gcc-interface/misc.c (gnat_handle_option): Change function >>> argument >>> to HOST_WIDE_INT. >>> >>> gcc/brig/ChangeLog: >>> * brig/brig-lang.c (brig_langhook_handle_option): Change function >>> argument to HOST_WIDE_INT. >>> >>> gcc/c-family/ChangeLog: >>> >>> PR middle-end/82063 >>> * c-common.h (c_common_handle_option): Change function argument >>> to HOST_WIDE_INT. >>> * c-opts.c (c_common_init_options): Same. >>> (c_common_handle_option): Same. Remove special handling of >>> OPT_Walloca_larger_than_ and OPT_Wvla_larger_than_. >>> * c.opt (-Walloc-size-larger-than, -Walloca-larger-than): Change >>> options to take a HOST_WIDE_INT argument and accept a byte-size >>> suffix. Initialize. >>> (-Wvla-larger-than): Same. >>> (-Wno-alloc-size-larger-than, -Wno-alloca-larger-than): New. >>> (-Wno-vla-larger-than): Same. >>> >>> >>> gcc/fortran/ChangeLog: >>> >>> PR middle-end/82063 >>> * gfortran.h (gfc_handle_option): Change function argument >>> to HOST_WIDE_INT. >>> * options.c (gfc_handle_option): Same. >>> >>> gcc/go/ChangeLog: >>> >>> PR middle-end/82063 >>> * go-lang.c (go_langhook_handle_option): Change function argument >>> to HOST_WIDE_INT. >>> >>> gcc/lto/ChangeLog: >>> >>> PR middle-end/82063 >>> * lto-lang.c (lto_handle_option): Change function argument >>> to HOST_WIDE_INT. >>> >>> gcc/testsuite/ChangeLog: >>> >>> PR middle-end/82063 >>> * gcc.dg/Walloc-size-larger-than-16.c: Adjust. >>> * gcc.dg/Walloca-larger-than.c: New test. >>> * gcc.dg/Wframe-larger-than-2.c: New test. >>> * gcc.dg/Wlarger-than3.c: New test. >>>
V3 [PATCH] C/C++: Add -Waddress-of-packed-member
On Mon, Jun 18, 2018 at 12:26 PM, Joseph Myers wrote: > On Mon, 18 Jun 2018, Jason Merrill wrote: > >> On Mon, Jun 18, 2018 at 11:59 AM, Joseph Myers >> wrote: >> > On Mon, 18 Jun 2018, Jason Merrill wrote: >> > >> >> > + if (TREE_CODE (rhs) == COND_EXPR) >> >> > +{ >> >> > + /* Check the THEN path first. */ >> >> > + tree op1 = TREE_OPERAND (rhs, 1); >> >> > + context = check_address_of_packed_member (type, op1); >> >> >> >> This should handle the GNU extension of re-using operand 0 if operand >> >> 1 is omitted. >> > >> > Doesn't that just use a SAVE_EXPR? >> >> Hmm, I suppose it does, but many places in the compiler seem to expect >> that it produces a COND_EXPR with TREE_OPERAND 1 as NULL_TREE. > > Maybe that's used somewhere inside the C++ front end. For C a SAVE_EXPR > is produced directly. > Here is the updated patch. Changes from the last one: 1. Handle COMPOUND_EXPR. 2. Fixed typos in comments. 3. Combined warn_for_pointer_of_packed_member and warn_for_address_of_packed_member into warn_for_address_or_pointer_of_packed_member. Tested on Linux/x86-64 and Linux/i686. OK for trunk. Thanks. -- H.J. From 2ddae2d57d2875e80c9186b281edfabfddb42e86 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" Date: Fri, 12 Jan 2018 21:12:05 -0800 Subject: [PATCH] C/C++: Add -Waddress-of-packed-member MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When address of packed member of struct or union is taken, it may result in an unaligned pointer value. This patch adds -Waddress-of-packed-member to check alignment at pointer assignment and warn unaligned address as well as unaligned pointer: $ cat x.i struct pair_t { char c; int i; } __attribute__ ((packed)); extern struct pair_t p; int *addr = &p.i; $ gcc -O2 -S x.i x.i:8:13: warning: taking address of packed member of 'struct pair_t' may result in an unaligned pointer value [-Waddress-of-packed-member] int *addr = &p.i; ^ $ cat c.i struct B { int i; }; struct C { struct B b; } __attribute__ ((packed)); long* g8 (struct C *p) { return p; } $ gcc -O2 -S c.i -Wno-incompatible-pointer-types c.i: In function ‘g8’: c.i:4:33: warning: returning ‘struct C *’ from a function with incompatible return type ‘long int *’ [-Wincompatible-pointer-types] long* g8 (struct C *p) { return p; } ^ c.i:4:33: warning: converting a packed ‘struct C *’ pointer increases the alignment of ‘long int *’ pointer from 1 to 8 [-Waddress-of-packed-member] c.i:2:8: note: defined here struct C { struct B b; } __attribute__ ((packed)); ^ $ This warning is enabled by default. Since read_encoded_value_with_base in unwind-pe.h has union unaligned { void *ptr; unsigned u2 __attribute__ ((mode (HI))); unsigned u4 __attribute__ ((mode (SI))); unsigned u8 __attribute__ ((mode (DI))); signed s2 __attribute__ ((mode (HI))); signed s4 __attribute__ ((mode (SI))); signed s8 __attribute__ ((mode (DI))); } __attribute__((__packed__)); _Unwind_Internal_Ptr result; and GCC warns: gcc/libgcc/unwind-pe.h:210:37: warning: taking address of packed member of 'union unaligned' may result in an unaligned pointer value [-Waddress-of-packed-member] result = (_Unwind_Internal_Ptr) u->ptr; ^ we need to add GCC pragma to ignore -Waddress-of-packed-member. gcc/c/ PR c/51628 * doc/invoke.texi: Document -Wno-address-of-packed-member. gcc/c-family/ PR c/51628 * c-common.h (warn_for_address_or_pointer_of_packed_member): New. * c-warn.c (check_address_of_packed_member): New function. (warn_for_address_or_pointer_of_packed_member): Likewise. * c.opt: Add -Wno-address-of-packed-member. gcc/c/ PR c/51628 * c-typeck.c (convert_for_assignment): Call warn_for_address_or_pointer_of_packed_member. gcc/cp/ PR c/51628 * call.c (convert_for_arg_passing): Call warn_for_address_or_pointer_of_packed_member. * typeck.c (convert_for_assignment): Likewise. gcc/testsuite/ PR c/51628 * c-c++-common/pr51628-1.c: New test. * c-c++-common/pr51628-2.c: Likewise. * c-c++-common/pr51628-3.c: Likewise. * c-c++-common/pr51628-4.c: Likewise. * c-c++-common/pr51628-5.c: Likewise. * c-c++-common/pr51628-6.c: Likewise. * c-c++-common/pr51628-7.c: Likewise. * c-c++-common/pr51628-8.c: Likewise. * c-c++-common/pr51628-9.c: Likewise. * c-c++-common/pr51628-10.c: Likewise. * c-c++-common/pr51628-11.c: Likewise. * c-c++-common/pr51628-12.c: Likewise. * c-c++-common/pr51628-13.c: Likewise. * c-c++-common/pr51628-14.c: Likewise. * c-c++-common/pr51628-15.c: Likewise. * c-c++-common/pr51628-26.c: Likewise. * gcc.dg/pr51628-17.c: Likewise. * gcc.dg/pr51628-18.c:
PING [PATCH] libsanitizer: Mark REAL(swapcontext) with indirect_return attribute on x86
On Fri, Jul 20, 2018 at 1:11 PM, H.J. Lu wrote: > Cherry-pick compiler-rt revision 337603: > > When shadow stack from Intel CET is enabled, the first instruction of all > indirect branch targets must be a special instruction, ENDBR. > > lib/asan/asan_interceptors.cc has > > ... > int res = REAL(swapcontext)(oucp, ucp); > ... > > REAL(swapcontext) is a function pointer to swapcontext in libc. Since > swapcontext may return via indirect branch on x86 when shadow stack is > enabled, as in this case, > > int res = REAL(swapcontext)(oucp, ucp); > This function may be > returned via an indirect branch. > > Here compiler must insert ENDBR after call, like > > call *bar(%rip) > endbr64 > > I opened an LLVM bug: > > https://bugs.llvm.org/show_bug.cgi?id=38207 > > to add the indirect_return attribute so that it can be used to inform > compiler to insert ENDBR after REAL(swapcontext) call. We mark > REAL(swapcontext) with the indirect_return attribute if it is available. > > This fixed: > > https://bugs.llvm.org/show_bug.cgi?id=38249 > > Reviewed By: eugenis > > Differential Revision: https://reviews.llvm.org/D49608 > > OK for trunk? > > H.J. > --- > PR target/86560 > * asan/asan_interceptors.cc (swapcontext): Call REAL(swapcontext) > with indirect_return attribute on x86 if indirect_return attribute > is available. > * sanitizer_common/sanitizer_internal_defs.h (__has_attribute): > New. > --- > libsanitizer/asan/asan_interceptors.cc | 8 > libsanitizer/sanitizer_common/sanitizer_internal_defs.h | 5 + > 2 files changed, 13 insertions(+) > > diff --git a/libsanitizer/asan/asan_interceptors.cc > b/libsanitizer/asan/asan_interceptors.cc > index a8f4b72723f..552cf9347af 100644 > --- a/libsanitizer/asan/asan_interceptors.cc > +++ b/libsanitizer/asan/asan_interceptors.cc > @@ -267,7 +267,15 @@ INTERCEPTOR(int, swapcontext, struct ucontext_t *oucp, >uptr stack, ssize; >ReadContextStack(ucp, &stack, &ssize); >ClearShadowMemoryForContextStack(stack, ssize); > +#if __has_attribute(__indirect_return__) && \ > +(defined(__x86_64__) || defined(__i386__)) > + int (*real_swapcontext)(struct ucontext_t *, struct ucontext_t *) > +__attribute__((__indirect_return__)) > += REAL(swapcontext); > + int res = real_swapcontext(oucp, ucp); > +#else >int res = REAL(swapcontext)(oucp, ucp); > +#endif >// swapcontext technically does not return, but program may swap context to >// "oucp" later, that would look as if swapcontext() returned 0. >// We need to clear shadow for ucp once again, as it may be in arbitrary > diff --git a/libsanitizer/sanitizer_common/sanitizer_internal_defs.h > b/libsanitizer/sanitizer_common/sanitizer_internal_defs.h > index edd6a21c122..4413a88bea0 100644 > --- a/libsanitizer/sanitizer_common/sanitizer_internal_defs.h > +++ b/libsanitizer/sanitizer_common/sanitizer_internal_defs.h > @@ -104,6 +104,11 @@ > # define __has_feature(x) 0 > #endif > > +// Older GCCs do not understand __has_attribute. > +#if !defined(__has_attribute) > +# define __has_attribute(x) 0 > +#endif > + > // For portability reasons we do not include stddef.h, stdint.h or any other > // system header, but we do need some basic types that are not defined > // in a portable way by the language itself. > -- > 2.17.1 > Any objections? -- H.J.
PING [PATCH] i386: Remove _Unwind_Frames_Increment
On Fri, Jul 20, 2018 at 11:15 AM, H.J. Lu wrote: > Tested on CET SDV using the CET kernel on cet branch at: > > https://github.com/yyu168/linux_cet/tree/cet > > OK for trunk and GCC 8 branch? > > Thanks. > > > H.J. > --- > The CET kernel has been changed to place a restore token on shadow stack > for signal handler to enhance security. It is usually transparent to user > programs since kernel will pop the restore token when signal handler > returns. But when an exception is thrown from a signal handler, now > we need to remove _Unwind_Frames_Increment to pop the the restore token > from shadow stack. Otherwise, we get > > FAIL: g++.dg/torture/pr85334.C -O0 execution test > FAIL: g++.dg/torture/pr85334.C -O1 execution test > FAIL: g++.dg/torture/pr85334.C -O2 execution test > FAIL: g++.dg/torture/pr85334.C -O3 -g execution test > FAIL: g++.dg/torture/pr85334.C -Os execution test > FAIL: g++.dg/torture/pr85334.C -O2 -flto -fno-use-linker-plugin > -flto-partition=none execution test > > PR libgcc/85334 > * config/i386/shadow-stack-unwind.h (_Unwind_Frames_Increment): > Removed. > --- > libgcc/config/i386/shadow-stack-unwind.h | 5 - > 1 file changed, 5 deletions(-) > > diff --git a/libgcc/config/i386/shadow-stack-unwind.h > b/libgcc/config/i386/shadow-stack-unwind.h > index a32f3e74b52..40f48df2aec 100644 > --- a/libgcc/config/i386/shadow-stack-unwind.h > +++ b/libgcc/config/i386/shadow-stack-unwind.h > @@ -49,8 +49,3 @@ see the files COPYING3 and COPYING.RUNTIME respectively. > If not, see > } \ > } \ > while (0) > - > -/* Increment frame count. Skip signal frames. */ > -#undef _Unwind_Frames_Increment > -#define _Unwind_Frames_Increment(context, frames) \ > - if (!_Unwind_IsSignalFrame (context)) frames++ > -- > 2.17.1 > I will check it into trunk tomorrow if there is no objection. -- H.J.
Re: [PATCH] combine: Allow combining two insns to two insns
On Wed, Jul 25, 2018 at 1:28 AM, Richard Biener wrote: > On Tue, Jul 24, 2018 at 7:18 PM Segher Boessenkool > wrote: >> >> This patch allows combine to combine two insns into two. This helps >> in many cases, by reducing instruction path length, and also allowing >> further combinations to happen. PR85160 is a typical example of code >> that it can improve. >> >> This patch does not allow such combinations if either of the original >> instructions was a simple move instruction. In those cases combining >> the two instructions increases register pressure without improving the >> code. With this move test register pressure does no longer increase >> noticably as far as I can tell. >> >> (At first I also didn't allow either of the resulting insns to be a >> move instruction. But that is actually a very good thing to have, as >> should have been obvious). >> >> Tested for many months; tested on about 30 targets. >> >> I'll commit this later this week if there are no objections. > > Sounds good - but, _any_ testcase? Please! ;) > Here is a testcase: For --- #define N 16 float f[N]; double d[N]; int n[N]; __attribute__((noinline)) void f3 (void) { int i; for (i = 0; i < N; i++) d[i] = f[i]; } --- r263067 improved -O3 -mavx2 -mtune=generic -m64 from .cfi_startproc vmovaps f(%rip), %xmm2 vmovaps f+32(%rip), %xmm3 vinsertf128 $0x1, f+16(%rip), %ymm2, %ymm0 vcvtps2pd %xmm0, %ymm1 vextractf128 $0x1, %ymm0, %xmm0 vmovaps %xmm1, d(%rip) vextractf128 $0x1, %ymm1, d+16(%rip) vcvtps2pd %xmm0, %ymm0 vmovaps %xmm0, d+32(%rip) vextractf128 $0x1, %ymm0, d+48(%rip) vinsertf128 $0x1, f+48(%rip), %ymm3, %ymm0 vcvtps2pd %xmm0, %ymm1 vextractf128 $0x1, %ymm0, %xmm0 vmovaps %xmm1, d+64(%rip) vextractf128 $0x1, %ymm1, d+80(%rip) vcvtps2pd %xmm0, %ymm0 vmovaps %xmm0, d+96(%rip) vextractf128 $0x1, %ymm0, d+112(%rip) vzeroupper ret .cfi_endproc to .cfi_startproc vcvtps2pd f(%rip), %ymm0 vmovaps %xmm0, d(%rip) vextractf128 $0x1, %ymm0, d+16(%rip) vcvtps2pd f+16(%rip), %ymm0 vmovaps %xmm0, d+32(%rip) vextractf128 $0x1, %ymm0, d+48(%rip) vcvtps2pd f+32(%rip), %ymm0 vextractf128 $0x1, %ymm0, d+80(%rip) vmovaps %xmm0, d+64(%rip) vcvtps2pd f+48(%rip), %ymm0 vextractf128 $0x1, %ymm0, d+112(%rip) vmovaps %xmm0, d+96(%rip) vzeroupper ret .cfi_endproc This is: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86752 H.J.
Re: [PATCH 01/11] Add __builtin_speculation_safe_value
On Mon, Jul 30, 2018 at 6:16 AM, Richard Biener wrote: > On Fri, 27 Jul 2018, Richard Earnshaw wrote: > >> >> This patch defines a new intrinsic function >> __builtin_speculation_safe_value. A generic default implementation is >> defined which will attempt to use the backend pattern >> "speculation_safe_barrier". If this pattern is not defined, or if it >> is not available, then the compiler will emit a warning, but >> compilation will continue. >> >> Note that the test spec-barrier-1.c will currently fail on all >> targets. This is deliberate, the failure will go away when >> appropriate action is taken for each target backend. > > OK. > > Thanks, > Richard. > >> gcc: >> * builtin-types.def (BT_FN_PTR_PTR_VAR): New function type. >> (BT_FN_I1_I1_VAR, BT_FN_I2_I2_VAR, BT_FN_I4_I4_VAR): Likewise. >> (BT_FN_I8_I8_VAR, BT_FN_I16_I16_VAR): Likewise. >> * builtin-attrs.def (ATTR_NOVOPS_NOTHROW_LEAF_LIST): New attribute >> list. >> * builtins.def (BUILT_IN_SPECULATION_SAFE_VALUE_N): New builtin. >> (BUILT_IN_SPECULATION_SAFE_VALUE_PTR): New internal builtin. >> (BUILT_IN_SPECULATION_SAFE_VALUE_1): Likewise. >> (BUILT_IN_SPECULATION_SAFE_VALUE_2): Likewise. >> (BUILT_IN_SPECULATION_SAFE_VALUE_4): Likewise. >> (BUILT_IN_SPECULATION_SAFE_VALUE_8): Likewise. >> (BUILT_IN_SPECULATION_SAFE_VALUE_16): Likewise. >> * builtins.c (expand_speculation_safe_value): New function. >> (expand_builtin): Call it. >> * doc/cpp.texi: Document predefine __HAVE_SPECULATION_SAFE_VALUE. >> * doc/extend.texi: Document __builtin_speculation_safe_value. >> * doc/md.texi: Document "speculation_barrier" pattern. >> * doc/tm.texi.in: Pull in TARGET_SPECULATION_SAFE_VALUE and >> TARGET_HAVE_SPECULATION_SAFE_VALUE. >> * doc/tm.texi: Regenerated. >> * target.def (have_speculation_safe_value, speculation_safe_value): New >> hooks. >> * targhooks.c (default_have_speculation_safe_value): New function. >> (default_speculation_safe_value): New function. >> * targhooks.h (default_have_speculation_safe_value): Add prototype. >> (default_speculation_safe_value): Add prototype. >> I got ../../src-trunk/gcc/targhooks.c: In function ‘bool default_have_speculation_safe_value(bool)’: ../../src-trunk/gcc/targhooks.c:2319:43: error: unused parameter ‘active’ [-Werror=unused-parameter] default_have_speculation_safe_value (bool active) ~^~ -- H.J.
Re: [PATCH 10/11] x86 - add speculation_barrier pattern
On Sat, Jul 28, 2018 at 1:25 AM, Uros Bizjak wrote: > On Fri, Jul 27, 2018 at 11:37 AM, Richard Earnshaw > wrote: >> >> This patch adds a speculation barrier for x86, based on my >> understanding of the required mitigation for that CPU, which is to use >> an lfence instruction. >> >> This patch needs some review by an x86 expert and if adjustments are >> needed, I'd appreciate it if they could be picked up by the port >> maintainer. This is supposed to serve as an example of how to deploy >> the new __builtin_speculation_safe_value() intrinsic on this >> architecture. >> >> * config/i386/i386.md (unspecv): Add UNSPECV_SPECULATION_BARRIER. >> (speculation_barrier): New insn. > > The implementation is OK, but someone from Intel (CC'd) should clarify > if lfence is the correct insn. > I checked with our people. lfence is OK. Thanks. -- H.J.
Re: [PR 83141] Prevent SRA from removing type changing assignment
On Tue, Dec 5, 2017 at 4:00 AM, Martin Jambor wrote: > On Tue, Dec 05 2017, Martin Jambor wrote: >> On Tue, Dec 05 2017, Martin Jambor wrote: >> Hi, >> >>> Hi, >>> >>> this is a followup to Richi's >>> https://gcc.gnu.org/ml/gcc-patches/2017-11/msg02396.html to fix PR >>> 83141. The basic idea is simple, be just as conservative about type >>> changing MEM_REFs as we are about actual VCEs. >>> >>> I have checked how that would affect compilation of SPEC 2006 and (non >>> LTO) Mozilla Firefox and am happy to report that the difference was >>> tiny. However, I had to make the test less strict, otherwise testcase >>> gcc.dg/guality/pr54970.c kept failing because it contains folded memcpy >>> and expects us to track values accross: >>> >>> int a[] = { 1, 2, 3 }; >>> /* ... */ >>> __builtin_memcpy (&a, (int [3]) { 4, 5, 6 }, sizeof (a)); >>> /* { dg-final { gdb-test 31 "a\[0\]" "4" } } */ >>> /* { dg-final { gdb-test 31 "a\[1\]" "5" } } */ >>> /* { dg-final { gdb-test 31 "a\[2\]" "6" } } */ >>> >>> SRA is able to load replacement of a[0] directly from the temporary >>> array which is apparently necessary to generate proper debug info. I >>> have therefore allowed the current transformation to go forward if the >>> source does not contain any padding or if it is a read-only declaration. >> >> Ah, the read-only test is of course bogus, it was a last minute addition >> when I was apparently already too tired to think it through. Please >> disregard that line in the patch (it has passed bootstrap and testing >> without it). >> >> Sorry for the noise, >> >> Martin >> > > And for the record, below is the actual patch, after a fresh round of > re-testing to double check I did not mess up anything else. As before, > I'd like to ask for review, especially of the type_contains_padding_p > predicate and then would like to commit it to trunk. > > Thanks, > > Martin > > > 2017-12-05 Martin Jambor > > PR tree-optimization/83141 > * tree-sra.c (type_contains_padding_p): New function. > (contains_vce_or_bfcref_p): Move up in the file, also test for > MEM_REFs implicitely changing types with padding. Remove inline > keyword. > (build_accesses_from_assign): Check contains_vce_or_bfcref_p > before setting bit in should_scalarize_away_bitmap. > This caused: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86763 H.J.
Re: RFC/A: Add a targetm.vectorize.related_mode hook
On Wed, Oct 23, 2019 at 4:51 AM Richard Sandiford wrote: > > Richard Biener writes: > > On Wed, Oct 23, 2019 at 1:00 PM Richard Sandiford > > wrote: > >> > >> This patch is the first of a series that tries to remove two > >> assumptions: > >> > >> (1) that all vectors involved in vectorisation must be the same size > >> > >> (2) that there is only one vector mode for a given element mode and > >> number of elements > >> > >> Relaxing (1) helps with targets that support multiple vector sizes or > >> that require the number of elements to stay the same. E.g. if we're > >> vectorising code that operates on narrow and wide elements, and the > >> narrow elements use 64-bit vectors, then on AArch64 it would normally > >> be better to use 128-bit vectors rather than pairs of 64-bit vectors > >> for the wide elements. > >> > >> Relaxing (2) makes it possible for -msve-vector-bits=128 to preoduce > >> fixed-length code for SVE. It also allows unpacked/half-size SVE > >> vectors to work with -msve-vector-bits=256. > >> > >> The patch adds a new hook that targets can use to control how we > >> move from one vector mode to another. The hook takes a starting vector > >> mode, a new element mode, and (optionally) a new number of elements. > >> The flexibility needed for (1) comes in when the number of elements > >> isn't specified. > >> > >> All callers in this patch specify the number of elements, but a later > >> vectoriser patch doesn't. I won't be posting the vectoriser patch > >> for a few days, hence the RFC/A tag. > >> > >> Tested individually on aarch64-linux-gnu and as a series on > >> x86_64-linux-gnu. OK to install? Or if not yet, does the idea > >> look OK? > > > > In isolation the idea looks good but maybe a bit limited? I see > > how it works for the same-size case but if you consider x86 > > where we have SSE, AVX256 and AVX512 what would it return > > for related_vector_mode (V4SImode, SImode, 0)? Or is this > > kind of query not intended (where the component modes match > > but nunits is zero)? > > In that case we'd normally get V4SImode back. It's an allowed > combination, but not very useful. > > > How do you get from SVE fixed 128bit to NEON fixed 128bit then? Or is > > it just used to stay in the same register set for different component > > modes? > > Yeah, the idea is to use the original vector mode as essentially > a base architecture. > > The follow-on patches replace vec_info::vector_size with > vec_info::vector_mode and targetm.vectorize.autovectorize_vector_sizes > with targetm.vectorize.autovectorize_vector_modes. These are the > starting modes that would be passed to the hook in the nunits==0 case. > For a target with different vector sizes, targetm.vectorize.autovectorize_vector_sizes doesn't return the optimal vector sizes for known trip count and unknown trip count. For a target with 128-bit and 256-bit vectors, 256-bit followed by 128-bit works well for known trip count since vectorizer knows the maximum usable vector size. But for unknown trip count, we may want to use 128-bit vector when 256-bit code path won't be used at run-time, but 128-bit vector will. At the moment, we can only use one set of vector sizes for both known trip count and unknown trip count. Can vectorizer support 2 sets of vector sizes, one for known trip count and the other for unknown trip count? H.J.
Re: RFC/A: Add a targetm.vectorize.related_mode hook
On Thu, Oct 24, 2019 at 12:56 AM Richard Sandiford wrote: > > "H.J. Lu" writes: > > On Wed, Oct 23, 2019 at 4:51 AM Richard Sandiford > > wrote: > >> > >> Richard Biener writes: > >> > On Wed, Oct 23, 2019 at 1:00 PM Richard Sandiford > >> > wrote: > >> >> > >> >> This patch is the first of a series that tries to remove two > >> >> assumptions: > >> >> > >> >> (1) that all vectors involved in vectorisation must be the same size > >> >> > >> >> (2) that there is only one vector mode for a given element mode and > >> >> number of elements > >> >> > >> >> Relaxing (1) helps with targets that support multiple vector sizes or > >> >> that require the number of elements to stay the same. E.g. if we're > >> >> vectorising code that operates on narrow and wide elements, and the > >> >> narrow elements use 64-bit vectors, then on AArch64 it would normally > >> >> be better to use 128-bit vectors rather than pairs of 64-bit vectors > >> >> for the wide elements. > >> >> > >> >> Relaxing (2) makes it possible for -msve-vector-bits=128 to preoduce > >> >> fixed-length code for SVE. It also allows unpacked/half-size SVE > >> >> vectors to work with -msve-vector-bits=256. > >> >> > >> >> The patch adds a new hook that targets can use to control how we > >> >> move from one vector mode to another. The hook takes a starting vector > >> >> mode, a new element mode, and (optionally) a new number of elements. > >> >> The flexibility needed for (1) comes in when the number of elements > >> >> isn't specified. > >> >> > >> >> All callers in this patch specify the number of elements, but a later > >> >> vectoriser patch doesn't. I won't be posting the vectoriser patch > >> >> for a few days, hence the RFC/A tag. > >> >> > >> >> Tested individually on aarch64-linux-gnu and as a series on > >> >> x86_64-linux-gnu. OK to install? Or if not yet, does the idea > >> >> look OK? > >> > > >> > In isolation the idea looks good but maybe a bit limited? I see > >> > how it works for the same-size case but if you consider x86 > >> > where we have SSE, AVX256 and AVX512 what would it return > >> > for related_vector_mode (V4SImode, SImode, 0)? Or is this > >> > kind of query not intended (where the component modes match > >> > but nunits is zero)? > >> > >> In that case we'd normally get V4SImode back. It's an allowed > >> combination, but not very useful. > >> > >> > How do you get from SVE fixed 128bit to NEON fixed 128bit then? Or is > >> > it just used to stay in the same register set for different component > >> > modes? > >> > >> Yeah, the idea is to use the original vector mode as essentially > >> a base architecture. > >> > >> The follow-on patches replace vec_info::vector_size with > >> vec_info::vector_mode and targetm.vectorize.autovectorize_vector_sizes > >> with targetm.vectorize.autovectorize_vector_modes. These are the > >> starting modes that would be passed to the hook in the nunits==0 case. > >> > > > > For a target with different vector sizes, > > targetm.vectorize.autovectorize_vector_sizes > > doesn't return the optimal vector sizes for known trip count and > > unknown trip count. > > For a target with 128-bit and 256-bit vectors, 256-bit followed by > > 128-bit works well for > > known trip count since vectorizer knows the maximum usable vector size. > > But for > > unknown trip count, we may want to use 128-bit vector when 256-bit > > code path won't > > be used at run-time, but 128-bit vector will. At the moment, we can > > only use one > > set of vector sizes for both known trip count and unknown trip count. > > Yeah, we're hit by this for AArch64 too. Andre's recent patches: > > https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01564.html > https://gcc.gnu.org/ml/gcc-patches/2019-09/msg00205.html > > should help. > > > Can vectorizer > > support 2 sets of vector sizes, one for known trip count and the other > > for unknown > > trip count? > > The approach Andre's taking is to continue to use the wider vector size > for unknown trip counts, and instead ensure that the epilogue loop is > vectorised at the narrower vector size if possible. The patches then > use this vectorised epilogue as a fallback "main" loop if the runtime > trip count is too low for the wide vectors. I tried it on 548.exchange2_r in SPEC CPU 2017. There is short cut to vectorized epilogue for low trip count. -- H.J.
Re: [PR47785] COLLECT_AS_OPTIONS
On Sun, Oct 27, 2019 at 6:33 PM Kugan Vivekanandarajah wrote: > > Hi Richard, > > Thanks for the review. > > On Wed, 23 Oct 2019 at 23:07, Richard Biener > wrote: > > > > On Mon, Oct 21, 2019 at 10:04 AM Kugan Vivekanandarajah > > wrote: > > > > > > Hi Richard, > > > > > > Thanks for the pointers. > > > > > > > > > > > > On Fri, 11 Oct 2019 at 22:33, Richard Biener > > > wrote: > > > > > > > > On Fri, Oct 11, 2019 at 6:15 AM Kugan Vivekanandarajah > > > > wrote: > > > > > > > > > > Hi Richard, > > > > > Thanks for the review. > > > > > > > > > > On Wed, 2 Oct 2019 at 20:41, Richard Biener > > > > > wrote: > > > > > > > > > > > > On Wed, Oct 2, 2019 at 10:39 AM Kugan Vivekanandarajah > > > > > > wrote: > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > As mentioned in the PR, attached patch adds COLLECT_AS_OPTIONS for > > > > > > > passing assembler options specified with -Wa, to the link-time > > > > > > > driver. > > > > > > > > > > > > > > The proposed solution only works for uniform -Wa options across > > > > > > > all > > > > > > > TUs. As mentioned by Richard Biener, supporting non-uniform -Wa > > > > > > > flags > > > > > > > would require either adjusting partitioning according to flags or > > > > > > > emitting multiple object files from a single LTRANS CU. We could > > > > > > > consider this as a follow up. > > > > > > > > > > > > > > Bootstrapped and regression tests on arm-linux-gcc. Is this OK > > > > > > > for trunk? > > > > > > > > > > > > While it works for your simple cases it is unlikely to work in > > > > > > practice since > > > > > > your implementation needs the assembler options be present at the > > > > > > link > > > > > > command line. I agree that this might be the way for people to go > > > > > > when > > > > > > they face the issue but then it needs to be documented somewhere > > > > > > in the manual. > > > > > > > > > > > > That is, with COLLECT_AS_OPTION (why singular? I'd expected > > > > > > COLLECT_AS_OPTIONS) available to cc1 we could stream this string > > > > > > to lto_options and re-materialize it at link time (and diagnose > > > > > > mismatches > > > > > > even if we like). > > > > > OK. I will try to implement this. So the idea is if we provide > > > > > -Wa,options as part of the lto compile, this should be available > > > > > during link time. Like in: > > > > > > > > > > arm-linux-gnueabihf-gcc -march=armv7-a -mthumb -O2 -flto > > > > > -Wa,-mimplicit-it=always,-mthumb -c test.c > > > > > arm-linux-gnueabihf-gcc -flto test.o > > > > > > > > > > I am not sure where should we stream this. Currently, cl_optimization > > > > > has all the optimization flag provided for compiler and it is > > > > > autogenerated and all the flags are integer values. Do you have any > > > > > preference or example where this should be done. > > > > > > > > In lto_write_options, I'd simply append the contents of > > > > COLLECT_AS_OPTIONS > > > > (with -Wa, prepended to each of them), then recover them in lto-wrapper > > > > for each TU and pass them down to the LTRANS compiles (if they agree > > > > for all TUs, otherwise I'd warn and drop them). > > > > > > Attached patch streams it and also make sure that the options are the > > > same for all the TUs. Maybe it is a bit restrictive. > > > > > > What is the best place to document COLLECT_AS_OPTIONS. We don't seem > > > to document COLLECT_GCC_OPTIONS anywhere ? > > > > Nowhere, it's an implementation detail then. > > > > > Attached patch passes regression and also fixes the original ARM > > > kernel build issue with tumb2. > > > > Did you try this with multiple assembler options? I see you stream > > them as -Wa,-mfpu=xyz,-mthumb but then compare the whole > > option strings so a mismatch with -Wa,-mthumb,-mfpu=xyz would be > > diagnosed. If there's a spec induced -Wa option do we get to see > > that as well? I can imagine -march=xyz enabling a -Wa option > > for example. > > > > + *collect_as = XNEWVEC (char, strlen (args_text) + 1); > > + strcpy (*collect_as, args_text); > > > > there's strdup. Btw, I'm not sure why you don't simply leave > > the -Wa option in the merged options [individually] and match > > them up but go the route of comparing strings and carrying that > > along separately. I think that would be much better. > > Is attached patch which does this is OK? > Don't you need to also handle -Xassembler? Since -Wa, doesn't work with comma in assembler options, like -mfoo=foo1,foo2, one needs to use -Xassembler -mfoo=foo1,foo2 to pass -mfoo=foo1,foo2 to assembler. -- H.J.
Re: [PR47785] COLLECT_AS_OPTIONS
On Thu, Oct 31, 2019 at 6:33 PM Kugan Vivekanandarajah wrote: > > On Wed, 30 Oct 2019 at 03:11, H.J. Lu wrote: > > > > On Sun, Oct 27, 2019 at 6:33 PM Kugan Vivekanandarajah > > wrote: > > > > > > Hi Richard, > > > > > > Thanks for the review. > > > > > > On Wed, 23 Oct 2019 at 23:07, Richard Biener > > > wrote: > > > > > > > > On Mon, Oct 21, 2019 at 10:04 AM Kugan Vivekanandarajah > > > > wrote: > > > > > > > > > > Hi Richard, > > > > > > > > > > Thanks for the pointers. > > > > > > > > > > > > > > > > > > > > On Fri, 11 Oct 2019 at 22:33, Richard Biener > > > > > wrote: > > > > > > > > > > > > On Fri, Oct 11, 2019 at 6:15 AM Kugan Vivekanandarajah > > > > > > wrote: > > > > > > > > > > > > > > Hi Richard, > > > > > > > Thanks for the review. > > > > > > > > > > > > > > On Wed, 2 Oct 2019 at 20:41, Richard Biener > > > > > > > wrote: > > > > > > > > > > > > > > > > On Wed, Oct 2, 2019 at 10:39 AM Kugan Vivekanandarajah > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > As mentioned in the PR, attached patch adds > > > > > > > > > COLLECT_AS_OPTIONS for > > > > > > > > > passing assembler options specified with -Wa, to the > > > > > > > > > link-time driver. > > > > > > > > > > > > > > > > > > The proposed solution only works for uniform -Wa options > > > > > > > > > across all > > > > > > > > > TUs. As mentioned by Richard Biener, supporting non-uniform > > > > > > > > > -Wa flags > > > > > > > > > would require either adjusting partitioning according to > > > > > > > > > flags or > > > > > > > > > emitting multiple object files from a single LTRANS CU. We > > > > > > > > > could > > > > > > > > > consider this as a follow up. > > > > > > > > > > > > > > > > > > Bootstrapped and regression tests on arm-linux-gcc. Is this > > > > > > > > > OK for trunk? > > > > > > > > > > > > > > > > While it works for your simple cases it is unlikely to work in > > > > > > > > practice since > > > > > > > > your implementation needs the assembler options be present at > > > > > > > > the link > > > > > > > > command line. I agree that this might be the way for people to > > > > > > > > go when > > > > > > > > they face the issue but then it needs to be documented somewhere > > > > > > > > in the manual. > > > > > > > > > > > > > > > > That is, with COLLECT_AS_OPTION (why singular? I'd expected > > > > > > > > COLLECT_AS_OPTIONS) available to cc1 we could stream this string > > > > > > > > to lto_options and re-materialize it at link time (and diagnose > > > > > > > > mismatches > > > > > > > > even if we like). > > > > > > > OK. I will try to implement this. So the idea is if we provide > > > > > > > -Wa,options as part of the lto compile, this should be available > > > > > > > during link time. Like in: > > > > > > > > > > > > > > arm-linux-gnueabihf-gcc -march=armv7-a -mthumb -O2 -flto > > > > > > > -Wa,-mimplicit-it=always,-mthumb -c test.c > > > > > > > arm-linux-gnueabihf-gcc -flto test.o > > > > > > > > > > > > > > I am not sure where should we stream this. Currently, > > > > > > > cl_optimization > > > > > > > has all the optimization flag provided for compiler and it is > > > > > > > autogenerated and all the flags are integer values. Do you have > > > > > > > any > > &
Re: [PR47785] COLLECT_AS_OPTIONS
On Sun, Nov 3, 2019 at 6:45 PM Kugan Vivekanandarajah wrote: > > Thanks for the reviews. > > > On Sat, 2 Nov 2019 at 02:49, H.J. Lu wrote: > > > > On Thu, Oct 31, 2019 at 6:33 PM Kugan Vivekanandarajah > > wrote: > > > > > > On Wed, 30 Oct 2019 at 03:11, H.J. Lu wrote: > > > > > > > > On Sun, Oct 27, 2019 at 6:33 PM Kugan Vivekanandarajah > > > > wrote: > > > > > > > > > > Hi Richard, > > > > > > > > > > Thanks for the review. > > > > > > > > > > On Wed, 23 Oct 2019 at 23:07, Richard Biener > > > > > wrote: > > > > > > > > > > > > On Mon, Oct 21, 2019 at 10:04 AM Kugan Vivekanandarajah > > > > > > wrote: > > > > > > > > > > > > > > Hi Richard, > > > > > > > > > > > > > > Thanks for the pointers. > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 11 Oct 2019 at 22:33, Richard Biener > > > > > > > wrote: > > > > > > > > > > > > > > > > On Fri, Oct 11, 2019 at 6:15 AM Kugan Vivekanandarajah > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > Hi Richard, > > > > > > > > > Thanks for the review. > > > > > > > > > > > > > > > > > > On Wed, 2 Oct 2019 at 20:41, Richard Biener > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > On Wed, Oct 2, 2019 at 10:39 AM Kugan Vivekanandarajah > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > As mentioned in the PR, attached patch adds > > > > > > > > > > > COLLECT_AS_OPTIONS for > > > > > > > > > > > passing assembler options specified with -Wa, to the > > > > > > > > > > > link-time driver. > > > > > > > > > > > > > > > > > > > > > > The proposed solution only works for uniform -Wa options > > > > > > > > > > > across all > > > > > > > > > > > TUs. As mentioned by Richard Biener, supporting > > > > > > > > > > > non-uniform -Wa flags > > > > > > > > > > > would require either adjusting partitioning according to > > > > > > > > > > > flags or > > > > > > > > > > > emitting multiple object files from a single LTRANS CU. > > > > > > > > > > > We could > > > > > > > > > > > consider this as a follow up. > > > > > > > > > > > > > > > > > > > > > > Bootstrapped and regression tests on arm-linux-gcc. Is > > > > > > > > > > > this OK for trunk? > > > > > > > > > > > > > > > > > > > > While it works for your simple cases it is unlikely to work > > > > > > > > > > in practice since > > > > > > > > > > your implementation needs the assembler options be present > > > > > > > > > > at the link > > > > > > > > > > command line. I agree that this might be the way for > > > > > > > > > > people to go when > > > > > > > > > > they face the issue but then it needs to be documented > > > > > > > > > > somewhere > > > > > > > > > > in the manual. > > > > > > > > > > > > > > > > > > > > That is, with COLLECT_AS_OPTION (why singular? I'd expected > > > > > > > > > > COLLECT_AS_OPTIONS) available to cc1 we could stream this > > > > > > > > > > string > > > > > > > > > > to lto_options and re-materialize it at link time (and > > > > > > > > > > diagnose mismatches > > > > > > &g
Re: [PATCH] Set AVX128_OPTIMAL for all avx targets.
On Tue, Nov 12, 2019 at 2:48 AM Hongtao Liu wrote: > > On Tue, Nov 12, 2019 at 4:41 PM Richard Biener > wrote: > > > > On Tue, Nov 12, 2019 at 9:29 AM Hongtao Liu wrote: > > > > > > On Tue, Nov 12, 2019 at 4:19 PM Richard Biener > > > wrote: > > > > > > > > On Tue, Nov 12, 2019 at 8:36 AM Hongtao Liu wrote: > > > > > > > > > > Hi: > > > > > This patch is about to set X86_TUNE_AVX128_OPTIMAL as default for > > > > > all AVX target because we found there's still performance gap between > > > > > 128-bit auto-vectorization and 256-bit auto-vectorization even with > > > > > epilog vectorized. > > > > > The performance influence of setting avx128_optimal as default on > > > > > SPEC2017 with option `-march=native -funroll-loops -Ofast -flto" on > > > > > CLX is as bellow: > > > > > > > > > > INT rate > > > > > 500.perlbench_r -0.32% > > > > > 502.gcc_r -1.32% > > > > > 505.mcf_r -0.12% > > > > > 520.omnetpp_r -0.34% > > > > > 523.xalancbmk_r -0.65% > > > > > 525.x264_r 2.23% > > > > > 531.deepsjeng_r 0.81% > > > > > 541.leela_r -0.02% > > > > > 548.exchange2_r 10.89% --> big improvement > > > > > 557.xz_r0.38% > > > > > geomean for intrate 1.10% > > > > > > > > > > FP rate > > > > > 503.bwaves_r1.41% > > > > > 507.cactuBSSN_r -0.14% > > > > > 508.namd_r 1.54% > > > > > 510.parest_r-0.87% > > > > > 511.povray_r0.28% > > > > > 519.lbm_r 0.32% > > > > > 521.wrf_r -0.54% > > > > > 526.blender_r 0.59% > > > > > 527.cam4_r -2.70% > > > > > 538.imagick_r 3.92% > > > > > 544.nab_r 0.59% > > > > > 549.fotonik3d_r -5.44% -> regression > > > > > 554.roms_r -2.34% > > > > > geomean for fprate -0.28% > > > > > > > > > > The 10% improvement of 548.exchange_r is because there is 9-layer > > > > > nested loop, and the loop count for innermost layer is small(enough > > > > > for 128-bit vectorization, but not for 256-bit vectorization). > > > > > Since loop count is not statically analyzed out, vectorizer will > > > > > choose 256-bit vectorization which would never never be triggered. The > > > > > vectorization of epilog will introduced some extra instructions, > > > > > normally it will bring back some performance, but since it's 9-layer > > > > > nested loop, costs of extra instructions will cover the gain. > > > > > > > > > > The 5.44% regression of 549.fotonik3d_r is because 256-bit > > > > > vectorization is better than 128-bit vectorization. Generally when > > > > > enabling 256-bit or 512-bit vectorization, there will be instruction > > > > > clocksticks reduction also with frequency reduction. when frequency > > > > > reduction is less than instructions clocksticks reduction, long vector > > > > > width vectorization would be better than shorter one, otherwise the > > > > > opposite. The regression of 549.fotonik3d_r is due to this, similar > > > > > for 554.roms_r, 528.cam4_r, for those 3 benchmarks, 512-bit > > > > > vectorization is best. > > > > > > > > > > Bootstrap and regression test on i386 is ok. > > > > > Ok for trunk? > > > > > > > > I don't think 128_optimal does what you think it does. If you want to > > > > prefer 128bit AVX adjust the preference, but 128_optimal describes > > > > a microarchitectural detail (AVX256 ops are split into two AVX128 ops) > > > But it will set target_prefer_avx128 by default. > > > > > > 2694 /* Enable 128-bit AVX instruction generation > > > 2695 for the auto-vectorizer. */ > > > 2696 if (TARGET_AVX128_OPTIMAL > > > 2697 && (opts_set->x_prefer_vector_width_type == PVW_NONE)) > > > 2698opts->x_prefer_vector_width_type = PVW_AVX128; > > > - > > > And it may be too confusing to add another tuning flag. > > > > Well, it's confusing to mix two things - defaulting the vector width > > preference > > and the architectural detail of Bulldozer and early Zen parts. So please > > split > > the tuning. And then re-benchmark with _just_ changing the preference > Actually, the result is similar, I've test both(patch using > avx128_optimal and trunk_gcc apply additional > -mprefer-vector-width=128). > And i would give a test to see the affect of FDO. It is hard to tell if 128-bit vector size or 256-bit vector size works better. For SPEC CPU 2017, 128-bit vector size gives better overall scores. One can always change vector size, even to 512-bit, as some workloads are faster with 512-bit vector size. -- H.J.
Re: [PR47785] COLLECT_AS_OPTIONS
On Tue, Jan 14, 2020 at 11:29 PM Prathamesh Kulkarni wrote: > > On Wed, 8 Jan 2020 at 15:50, Prathamesh Kulkarni > wrote: > > > > On Tue, 5 Nov 2019 at 17:38, Richard Biener > > wrote: > > > > > > On Tue, Nov 5, 2019 at 12:17 AM Kugan Vivekanandarajah > > > wrote: > > > > > > > > Hi, > > > > Thanks for the review. > > > > > > > > On Tue, 5 Nov 2019 at 03:57, H.J. Lu wrote: > > > > > > > > > > On Sun, Nov 3, 2019 at 6:45 PM Kugan Vivekanandarajah > > > > > wrote: > > > > > > > > > > > > Thanks for the reviews. > > > > > > > > > > > > > > > > > > On Sat, 2 Nov 2019 at 02:49, H.J. Lu wrote: > > > > > > > > > > > > > > On Thu, Oct 31, 2019 at 6:33 PM Kugan Vivekanandarajah > > > > > > > wrote: > > > > > > > > > > > > > > > > On Wed, 30 Oct 2019 at 03:11, H.J. Lu > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > On Sun, Oct 27, 2019 at 6:33 PM Kugan Vivekanandarajah > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > Hi Richard, > > > > > > > > > > > > > > > > > > > > Thanks for the review. > > > > > > > > > > > > > > > > > > > > On Wed, 23 Oct 2019 at 23:07, Richard Biener > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > On Mon, Oct 21, 2019 at 10:04 AM Kugan Vivekanandarajah > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > Hi Richard, > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the pointers. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 11 Oct 2019 at 22:33, Richard Biener > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Oct 11, 2019 at 6:15 AM Kugan Vivekanandarajah > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Richard, > > > > > > > > > > > > > > Thanks for the review. > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2 Oct 2019 at 20:41, Richard Biener > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Oct 2, 2019 at 10:39 AM Kugan > > > > > > > > > > > > > > > Vivekanandarajah > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > As mentioned in the PR, attached patch adds > > > > > > > > > > > > > > > > COLLECT_AS_OPTIONS for > > > > > > > > > > > > > > > > passing assembler options specified with -Wa, > > > > > > > > > > > > > > > > to the link-time driver. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The proposed solution only works for uniform > > > > > > > > > > > > > > > > -Wa options across all > > > > > > > > > > > > > > > > TUs. As mentioned by Richard Biener, supporting > > > >
[PATCH] PR target/93319: x32: Add x32 support to -mtls-dialect=gnu2
To add x32 support to -mtls-dialect=gnu2, we need to replace DI with P in GNU2 TLS patterns. Since thread pointer is in ptr_mode, PLUS in GNU2 TLS address computation must be done in ptr_mode to support -maddress-mode=long. Also drop the "q" suffix from lea to support both "lea foo@TLSDESC(%rip), %eax" and "foo@TLSDESC(%rip), %rax". Tested on Linux/x86-64. OK for master? Thanks. H.J. --- gcc/ PR target/93319 * config/i386/i386.c (legitimize_tls_address): Pass Pmode to gen_tls_dynamic_gnu2_64. Compute GNU2 TLS address in ptr_mode. * config/i386/i386.md (tls_dynamic_gnu2_64): Renamed to ... (@tls_dynamic_gnu2_64_): This. Replace DI with P. (*tls_dynamic_gnu2_lea_64): Renamed to ... (*tls_dynamic_gnu2_lea_64_): This. Replace DI with P. Remove the {q} suffix from lea. (*tls_dynamic_gnu2_call_64): Renamed to ... (*tls_dynamic_gnu2_call_64_): This. Replace DI with P. (*tls_dynamic_gnu2_combine_64): Renamed to ... (*tls_dynamic_gnu2_combine_64_): This. Replace DI with P. Pass Pmode to gen_tls_dynamic_gnu2_64. gcc/testsuite/ PR target/93319 * gcc.target/i386/pr93319-1a.c: New test. * gcc.target/i386/pr93319-1b.c: Likewise. * gcc.target/i386/pr93319-1c.c: Likewise. * gcc.target/i386/pr93319-1d.c: Likewise. --- gcc/config/i386/i386.c | 31 +++-- gcc/config/i386/i386.md| 54 +++--- gcc/testsuite/gcc.target/i386/pr93319-1a.c | 24 ++ gcc/testsuite/gcc.target/i386/pr93319-1b.c | 7 +++ gcc/testsuite/gcc.target/i386/pr93319-1c.c | 7 +++ gcc/testsuite/gcc.target/i386/pr93319-1d.c | 7 +++ 6 files changed, 99 insertions(+), 31 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1a.c create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1b.c create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1c.c create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1d.c diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 2c087a4a3e0..8c437dbe1f3 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -10764,12 +10764,24 @@ legitimize_tls_address (rtx x, enum tls_model model, bool for_mov) if (TARGET_GNU2_TLS) { if (TARGET_64BIT) - emit_insn (gen_tls_dynamic_gnu2_64 (dest, x)); + emit_insn (gen_tls_dynamic_gnu2_64 (Pmode, dest, x)); else emit_insn (gen_tls_dynamic_gnu2_32 (dest, x, pic)); tp = get_thread_pointer (Pmode, true); - dest = force_reg (Pmode, gen_rtx_PLUS (Pmode, tp, dest)); + + /* NB: Since thread pointer is in ptr_mode, make sure that +PLUS is done in ptr_mode. */ + if (Pmode != ptr_mode) + { + tp = lowpart_subreg (ptr_mode, tp, Pmode); + dest = lowpart_subreg (ptr_mode, dest, Pmode); + dest = gen_rtx_PLUS (ptr_mode, tp, dest); + dest = gen_rtx_ZERO_EXTEND (Pmode, dest); + } + else + dest = gen_rtx_PLUS (Pmode, tp, dest); + dest = force_reg (Pmode, dest); if (GET_MODE (x) != Pmode) x = gen_rtx_ZERO_EXTEND (Pmode, x); @@ -10821,7 +10833,7 @@ legitimize_tls_address (rtx x, enum tls_model model, bool for_mov) rtx tmp = ix86_tls_module_base (); if (TARGET_64BIT) - emit_insn (gen_tls_dynamic_gnu2_64 (base, tmp)); + emit_insn (gen_tls_dynamic_gnu2_64 (Pmode, base, tmp)); else emit_insn (gen_tls_dynamic_gnu2_32 (base, tmp, pic)); @@ -10864,7 +10876,18 @@ legitimize_tls_address (rtx x, enum tls_model model, bool for_mov) if (TARGET_GNU2_TLS) { - dest = force_reg (Pmode, gen_rtx_PLUS (Pmode, dest, tp)); + /* NB: Since thread pointer is in ptr_mode, make sure that +PLUS is done in ptr_mode. */ + if (Pmode != ptr_mode) + { + tp = lowpart_subreg (ptr_mode, tp, Pmode); + dest = lowpart_subreg (ptr_mode, dest, Pmode); + dest = gen_rtx_PLUS (ptr_mode, tp, dest); + dest = gen_rtx_ZERO_EXTEND (Pmode, dest); + } + else + dest = gen_rtx_PLUS (Pmode, tp, dest); + dest = force_reg (Pmode, dest); if (GET_MODE (x) != Pmode) x = gen_rtx_ZERO_EXTEND (Pmode, x); diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index c9d2f338fe9..d53684096c4 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -15185,14 +15185,14 @@ (define_insn_and_split "*tls_dynamic_gnu2_combine_32" emit_insn (gen_tls_dynamic_gnu2_32 (operands[5], operands[1], operands[2])); }) -(define_expand "tls_dynamic_gnu2_64" +(define_expand "@tls_dynamic_gnu2_64_" [(set (match_dup 2) - (unspec:DI [(match_operand 1 "tls_symbolic_operand")] - UNSPEC_TLSDESC)) + (unspec:P
Re: New repository location
On Sun, Jan 19, 2020 at 6:33 AM Bill Schmidt wrote: > > Question: Is the new gcc git repository at gcc.gnu.org/git/gcc.git > using the same location as the earlier git mirror did? I'm curious > whether our repository on pike is still syncing with the new master, or > whether we need to make some adjustments before we next rebase pu > against master. > 2 repos are different. I renamed my old mirror and created a new one: https://gitlab.com/x86-gcc -- H.J.
Re: [PATCH] PR target/93319: x32: Add x32 support to -mtls-dialect=gnu2
On Sun, Jan 19, 2020 at 9:48 AM Uros Bizjak wrote: > > On Sun, Jan 19, 2020 at 6:43 PM Uros Bizjak wrote: > > > > On Sun, Jan 19, 2020 at 2:58 PM H.J. Lu wrote: > > > > > > To add x32 support to -mtls-dialect=gnu2, we need to replace DI with > > > P in GNU2 TLS patterns. Since thread pointer is in ptr_mode, PLUS in > > > GNU2 TLS address computation must be done in ptr_mode to support > > > -maddress-mode=long. Also drop the "q" suffix from lea to support > > > both "lea foo@TLSDESC(%rip), %eax" and "foo@TLSDESC(%rip), %rax". > > > > Please use "lea%z0" instead. > > > > > Tested on Linux/x86-64. OK for master? > > > > > > Thanks. > > > > > > H.J. > > > --- > > > gcc/ > > > > > > PR target/93319 > > > * config/i386/i386.c (legitimize_tls_address): Pass Pmode to > > > gen_tls_dynamic_gnu2_64. Compute GNU2 TLS address in ptr_mode. > > > * config/i386/i386.md (tls_dynamic_gnu2_64): Renamed to ... > > > (@tls_dynamic_gnu2_64_): This. Replace DI with P. > > > (*tls_dynamic_gnu2_lea_64): Renamed to ... > > > (*tls_dynamic_gnu2_lea_64_): This. Replace DI with P. > > > Remove the {q} suffix from lea. > > > (*tls_dynamic_gnu2_call_64): Renamed to ... > > > (*tls_dynamic_gnu2_call_64_): This. Replace DI with P. > > > (*tls_dynamic_gnu2_combine_64): Renamed to ... > > > (*tls_dynamic_gnu2_combine_64_): This. Replace DI with P. > > > Pass Pmode to gen_tls_dynamic_gnu2_64. > > > > > > gcc/testsuite/ > > > > > > PR target/93319 > > > * gcc.target/i386/pr93319-1a.c: New test. > > > * gcc.target/i386/pr93319-1b.c: Likewise. > > > * gcc.target/i386/pr93319-1c.c: Likewise. > > > * gcc.target/i386/pr93319-1d.c: Likewise. > > > --- > > > gcc/config/i386/i386.c | 31 +++-- > > > gcc/config/i386/i386.md| 54 +++--- > > > gcc/testsuite/gcc.target/i386/pr93319-1a.c | 24 ++ > > > gcc/testsuite/gcc.target/i386/pr93319-1b.c | 7 +++ > > > gcc/testsuite/gcc.target/i386/pr93319-1c.c | 7 +++ > > > gcc/testsuite/gcc.target/i386/pr93319-1d.c | 7 +++ > > > 6 files changed, 99 insertions(+), 31 deletions(-) > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1a.c > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1b.c > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1c.c > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1d.c > > > > > > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c > > > index 2c087a4a3e0..8c437dbe1f3 100644 > > > --- a/gcc/config/i386/i386.c > > > +++ b/gcc/config/i386/i386.c > > > @@ -10764,12 +10764,24 @@ legitimize_tls_address (rtx x, enum tls_model > > > model, bool for_mov) > > >if (TARGET_GNU2_TLS) > > > { > > > if (TARGET_64BIT) > > > - emit_insn (gen_tls_dynamic_gnu2_64 (dest, x)); > > > + emit_insn (gen_tls_dynamic_gnu2_64 (Pmode, dest, x)); > > > else > > > emit_insn (gen_tls_dynamic_gnu2_32 (dest, x, pic)); > > > > > > tp = get_thread_pointer (Pmode, true); > > > - dest = force_reg (Pmode, gen_rtx_PLUS (Pmode, tp, dest)); > > > + > > > + /* NB: Since thread pointer is in ptr_mode, make sure that > > > +PLUS is done in ptr_mode. */ > > Actually, thread_pointer is in Pmode, see the line just above your > change. Also, dest is in Pmode, so why do we need all this subreg > dance? dest set from gen_tls_dynamic_gnu2_64 is in ptr_mode by call *foo@TLSCALL(%rax) (gdb) bt #0 test () at lib.s:20 #1 0x00401075 in main () at main.c:13 (gdb) f 0 #0 test () at lib.s:20 20 addq %rax, %r12 (gdb) disass Dump of assembler code for function test: 0xf7fca120 <+0>: push %r12 0xf7fca122 <+2>: lea0x2ef7(%rip),%rax# 0xf7fcd020 0xf7fca129 <+9>: lea0xed0(%rip),%rdi# 0xf7fcb000 0xf7fca130 <+16>: mov%fs:0x0,%r12d 0xf7fca139 <+25>: callq *(%rax) => 0xf7fca13b <+27>: add%rax,%r12 ^^ Wrong address in R12. 0xf7fca13e <+30>: xor%eax,%eax 0xf7fca140 <+32>: mov(%r12),%esi 0xf7fca144 <+36>
Re: [PATCH] PR target/93319: x32: Add x32 support to -mtls-dialect=gnu2
On Sun, Jan 19, 2020 at 12:01 PM Uros Bizjak wrote: > > On Sun, Jan 19, 2020 at 7:07 PM H.J. Lu wrote: > > > > On Sun, Jan 19, 2020 at 9:48 AM Uros Bizjak wrote: > > > > > > On Sun, Jan 19, 2020 at 6:43 PM Uros Bizjak wrote: > > > > > > > > On Sun, Jan 19, 2020 at 2:58 PM H.J. Lu wrote: > > > > > > > > > > To add x32 support to -mtls-dialect=gnu2, we need to replace DI with > > > > > P in GNU2 TLS patterns. Since thread pointer is in ptr_mode, PLUS in > > > > > GNU2 TLS address computation must be done in ptr_mode to support > > > > > -maddress-mode=long. Also drop the "q" suffix from lea to support > > > > > both "lea foo@TLSDESC(%rip), %eax" and "foo@TLSDESC(%rip), %rax". > > > > > > > > Please use "lea%z0" instead. > > > > > > > > > Tested on Linux/x86-64. OK for master? > > > > > > > > > > Thanks. > > > > > > > > > > H.J. > > > > > --- > > > > > gcc/ > > > > > > > > > > PR target/93319 > > > > > * config/i386/i386.c (legitimize_tls_address): Pass Pmode to > > > > > gen_tls_dynamic_gnu2_64. Compute GNU2 TLS address in > > > > > ptr_mode. > > > > > * config/i386/i386.md (tls_dynamic_gnu2_64): Renamed to ... > > > > > (@tls_dynamic_gnu2_64_): This. Replace DI with P. > > > > > (*tls_dynamic_gnu2_lea_64): Renamed to ... > > > > > (*tls_dynamic_gnu2_lea_64_): This. Replace DI with P. > > > > > Remove the {q} suffix from lea. > > > > > (*tls_dynamic_gnu2_call_64): Renamed to ... > > > > > (*tls_dynamic_gnu2_call_64_): This. Replace DI with P. > > > > > (*tls_dynamic_gnu2_combine_64): Renamed to ... > > > > > (*tls_dynamic_gnu2_combine_64_): This. Replace DI with > > > > > P. > > > > > Pass Pmode to gen_tls_dynamic_gnu2_64. > > > > > > > > > > gcc/testsuite/ > > > > > > > > > > PR target/93319 > > > > > * gcc.target/i386/pr93319-1a.c: New test. > > > > > * gcc.target/i386/pr93319-1b.c: Likewise. > > > > > * gcc.target/i386/pr93319-1c.c: Likewise. > > > > > * gcc.target/i386/pr93319-1d.c: Likewise. > > > > > --- > > > > > gcc/config/i386/i386.c | 31 +++-- > > > > > gcc/config/i386/i386.md| 54 > > > > > +++--- > > > > > gcc/testsuite/gcc.target/i386/pr93319-1a.c | 24 ++ > > > > > gcc/testsuite/gcc.target/i386/pr93319-1b.c | 7 +++ > > > > > gcc/testsuite/gcc.target/i386/pr93319-1c.c | 7 +++ > > > > > gcc/testsuite/gcc.target/i386/pr93319-1d.c | 7 +++ > > > > > 6 files changed, 99 insertions(+), 31 deletions(-) > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1a.c > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1b.c > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1c.c > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1d.c > > > > > > > > > > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c > > > > > index 2c087a4a3e0..8c437dbe1f3 100644 > > > > > --- a/gcc/config/i386/i386.c > > > > > +++ b/gcc/config/i386/i386.c > > > > > @@ -10764,12 +10764,24 @@ legitimize_tls_address (rtx x, enum > > > > > tls_model model, bool for_mov) > > > > >if (TARGET_GNU2_TLS) > > > > > { > > > > > if (TARGET_64BIT) > > > > > - emit_insn (gen_tls_dynamic_gnu2_64 (dest, x)); > > > > > + emit_insn (gen_tls_dynamic_gnu2_64 (Pmode, dest, x)); > > > > > else > > > > > emit_insn (gen_tls_dynamic_gnu2_32 (dest, x, pic)); > > > > > > > > > > tp = get_thread_pointer (Pmode, true); > > > > > - dest = force_reg (Pmode, gen_rtx_PLUS (Pmode, tp, dest)); > > > > > + > > > > > + /* NB: Since thread pointer is in ptr_mode, make sure that > > > > > +PLUS i
Re: [PATCH] PR target/93319: x32: Add x32 support to -mtls-dialect=gnu2
On Sun, Jan 19, 2020 at 12:16 PM Uros Bizjak wrote: > > On Sun, Jan 19, 2020 at 9:07 PM H.J. Lu wrote: > > > > On Sun, Jan 19, 2020 at 12:01 PM Uros Bizjak wrote: > > > > > > On Sun, Jan 19, 2020 at 7:07 PM H.J. Lu wrote: > > > > > > > > On Sun, Jan 19, 2020 at 9:48 AM Uros Bizjak wrote: > > > > > > > > > > On Sun, Jan 19, 2020 at 6:43 PM Uros Bizjak wrote: > > > > > > > > > > > > On Sun, Jan 19, 2020 at 2:58 PM H.J. Lu wrote: > > > > > > > > > > > > > > To add x32 support to -mtls-dialect=gnu2, we need to replace DI > > > > > > > with > > > > > > > P in GNU2 TLS patterns. Since thread pointer is in ptr_mode, > > > > > > > PLUS in > > > > > > > GNU2 TLS address computation must be done in ptr_mode to support > > > > > > > -maddress-mode=long. Also drop the "q" suffix from lea to support > > > > > > > both "lea foo@TLSDESC(%rip), %eax" and "foo@TLSDESC(%rip), %rax". > > > > > > > > > > > > Please use "lea%z0" instead. > > > > > > > > > > > > > Tested on Linux/x86-64. OK for master? > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > H.J. > > > > > > > --- > > > > > > > gcc/ > > > > > > > > > > > > > > PR target/93319 > > > > > > > * config/i386/i386.c (legitimize_tls_address): Pass Pmode > > > > > > > to > > > > > > > gen_tls_dynamic_gnu2_64. Compute GNU2 TLS address in > > > > > > > ptr_mode. > > > > > > > * config/i386/i386.md (tls_dynamic_gnu2_64): Renamed to > > > > > > > ... > > > > > > > (@tls_dynamic_gnu2_64_): This. Replace DI with P. > > > > > > > (*tls_dynamic_gnu2_lea_64): Renamed to ... > > > > > > > (*tls_dynamic_gnu2_lea_64_): This. Replace DI with > > > > > > > P. > > > > > > > Remove the {q} suffix from lea. > > > > > > > (*tls_dynamic_gnu2_call_64): Renamed to ... > > > > > > > (*tls_dynamic_gnu2_call_64_): This. Replace DI > > > > > > > with P. > > > > > > > (*tls_dynamic_gnu2_combine_64): Renamed to ... > > > > > > > (*tls_dynamic_gnu2_combine_64_): This. Replace DI > > > > > > > with P. > > > > > > > Pass Pmode to gen_tls_dynamic_gnu2_64. > > > > > > > > > > > > > > gcc/testsuite/ > > > > > > > > > > > > > > PR target/93319 > > > > > > > * gcc.target/i386/pr93319-1a.c: New test. > > > > > > > * gcc.target/i386/pr93319-1b.c: Likewise. > > > > > > > * gcc.target/i386/pr93319-1c.c: Likewise. > > > > > > > * gcc.target/i386/pr93319-1d.c: Likewise. > > > > > > > --- > > > > > > > gcc/config/i386/i386.c | 31 +++-- > > > > > > > gcc/config/i386/i386.md| 54 > > > > > > > +++--- > > > > > > > gcc/testsuite/gcc.target/i386/pr93319-1a.c | 24 ++ > > > > > > > gcc/testsuite/gcc.target/i386/pr93319-1b.c | 7 +++ > > > > > > > gcc/testsuite/gcc.target/i386/pr93319-1c.c | 7 +++ > > > > > > > gcc/testsuite/gcc.target/i386/pr93319-1d.c | 7 +++ > > > > > > > 6 files changed, 99 insertions(+), 31 deletions(-) > > > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1a.c > > > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1b.c > > > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1c.c > > > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr93319-1d.c > > > > > > > > > > > > > > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c > > > > > > > index 2c087a4a3e0..8c437dbe1f3 100644 > > > > > > > --- a/gcc/config/i386/i386.c
Re: [PATCH] Make target_clones resolver fn static.
On Mon, Jan 20, 2020 at 2:25 AM Richard Biener wrote: > > On Fri, Jan 17, 2020 at 10:25 AM Martin Liška wrote: > > > > Hi. > > > > The patch removes need to have a gnu_indirect_function global > > symbol. That aligns the code with what ppc64 target does. > > > > Patch can bootstrap on x86_64-linux-gnu and survives regression tests. > > > > Ready to be installed? > > Did you verify the result actually works? I'm not sure we have any runtime > test > coverage for the feature and non-public functions and you don't add a testcase > either. Maybe there's interesting coverage in the binutils or glibc testsuite > (though both might not use the compilers ifunc feature...). > > The patch also suspiciously lacks removal of actually making the resolver > TREE_PUBLIC if the default implementation was not ... so I wonder whether > you verified that the resolver _is_ indeed local. > > HJ, do you know anything about this requirement? It's that way since > the original contribution of multi-versioning by Google... We can that only if function is static: [hjl@gnu-cfl-2 tmp]$ cat x.c __attribute__((target_clones("avx","default"))) int foo () { return -2; } [hjl@gnu-cfl-2 tmp]$ gcc -S -O2 x.c [hjl@gnu-cfl-2 tmp]$ cat x.s .file "x.c" .text .p2align 4 .type foo.default.1, @function foo.default.1: .LFB0: .cfi_startproc movl $-2, %eax ret .cfi_endproc .LFE0: .size foo.default.1, .-foo.default.1 .p2align 4 .type foo.avx.0, @function foo.avx.0: .LFB1: .cfi_startproc movl $-2, %eax ret .cfi_endproc .LFE1: .size foo.avx.0, .-foo.avx.0 .section .text.foo.resolver,"axG",@progbits,foo.resolver,comdat .p2align 4 .weak foo.resolver .type foo.resolver, @function foo.resolver: .LFB3: .cfi_startproc subq $8, %rsp .cfi_def_cfa_offset 16 call __cpu_indicator_init movl $foo.default.1, %eax movl $foo.avx.0, %edx testb $2, __cpu_model+13(%rip) cmovne %rdx, %rax addq $8, %rsp .cfi_def_cfa_offset 8 ret .cfi_endproc .LFE3: .size foo.resolver, .-foo.resolver .globl foo .type foo, @gnu_indirect_function .set foo,foo.resolver .ident "GCC: (GNU) 9.2.1 20191120 (Red Hat 9.2.1-2)" .section .note.GNU-stack,"",@progbits [hjl@gnu-cfl-2 tmp]$ In this case, foo must be global. > Richard. > > > Thanks, > > Martin > > > > gcc/ChangeLog: > > > > 2020-01-17 Martin Liska > > > > PR target/93274 > > * config/i386/i386-features.c (make_resolver_func): > > Align the code with ppc64 target implementaion. > > We do not need to have gnu_indirect_function > > as a global function. > > > > gcc/testsuite/ChangeLog: > > > > 2020-01-17 Martin Liska > > > > PR target/93274 > > * gcc.target/i386/pr81213.c: Adjust to not expect > > a global unique name. > > --- > > gcc/config/i386/i386-features.c | 20 +--- > > gcc/testsuite/gcc.target/i386/pr81213.c | 4 ++-- > > 2 files changed, 7 insertions(+), 17 deletions(-) > > > > -- H.J.
Re: [PATCH] PR target/93319: x32: Add x32 support to -mtls-dialect=gnu2
On Sun, Jan 19, 2020 at 11:53 PM Uros Bizjak wrote: > > On Sun, Jan 19, 2020 at 10:00 PM H.J. Lu wrote: > > > > On Sun, Jan 19, 2020 at 12:16 PM Uros Bizjak wrote: > > > > > > On Sun, Jan 19, 2020 at 9:07 PM H.J. Lu wrote: > > > > > > > > On Sun, Jan 19, 2020 at 12:01 PM Uros Bizjak wrote: > > > > > > > > > > On Sun, Jan 19, 2020 at 7:07 PM H.J. Lu wrote: > > > > > > > > > > > > On Sun, Jan 19, 2020 at 9:48 AM Uros Bizjak > > > > > > wrote: > > > > > > > > > > > > > > On Sun, Jan 19, 2020 at 6:43 PM Uros Bizjak > > > > > > > wrote: > > > > > > > > > > > > > > > > On Sun, Jan 19, 2020 at 2:58 PM H.J. Lu > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > To add x32 support to -mtls-dialect=gnu2, we need to replace > > > > > > > > > DI with > > > > > > > > > P in GNU2 TLS patterns. Since thread pointer is in ptr_mode, > > > > > > > > > PLUS in > > > > > > > > > GNU2 TLS address computation must be done in ptr_mode to > > > > > > > > > support > > > > > > > > > -maddress-mode=long. Also drop the "q" suffix from lea to > > > > > > > > > support > > > > > > > > > both "lea foo@TLSDESC(%rip), %eax" and "foo@TLSDESC(%rip), > > > > > > > > > %rax". > > > > > > > > > > > > > > > > Please use "lea%z0" instead. > > > > > > > > > > > > > > > > > Tested on Linux/x86-64. OK for master? > > > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > H.J. > > > > > > > > > --- > > > > > > > > > gcc/ > > > > > > > > > > > > > > > > > > PR target/93319 > > > > > > > > > * config/i386/i386.c (legitimize_tls_address): Pass > > > > > > > > > Pmode to > > > > > > > > > gen_tls_dynamic_gnu2_64. Compute GNU2 TLS address in > > > > > > > > > ptr_mode. > > > > > > > > > * config/i386/i386.md (tls_dynamic_gnu2_64): Renamed > > > > > > > > > to ... > > > > > > > > > (@tls_dynamic_gnu2_64_): This. Replace DI with > > > > > > > > > P. > > > > > > > > > (*tls_dynamic_gnu2_lea_64): Renamed to ... > > > > > > > > > (*tls_dynamic_gnu2_lea_64_): This. Replace DI > > > > > > > > > with P. > > > > > > > > > Remove the {q} suffix from lea. > > > > > > > > > (*tls_dynamic_gnu2_call_64): Renamed to ... > > > > > > > > > (*tls_dynamic_gnu2_call_64_): This. Replace DI > > > > > > > > > with P. > > > > > > > > > (*tls_dynamic_gnu2_combine_64): Renamed to ... > > > > > > > > > (*tls_dynamic_gnu2_combine_64_): This. Replace > > > > > > > > > DI with P. > > > > > > > > > Pass Pmode to gen_tls_dynamic_gnu2_64. > > > > > > > > > > > > > > > > > > gcc/testsuite/ > > > > > > > > > > > > > > > > > > PR target/93319 > > > > > > > > > * gcc.target/i386/pr93319-1a.c: New test. > > > > > > > > > * gcc.target/i386/pr93319-1b.c: Likewise. > > > > > > > > > * gcc.target/i386/pr93319-1c.c: Likewise. > > > > > > > > > * gcc.target/i386/pr93319-1d.c: Likewise. > > > > > > > > > --- > > > > > > > > > gcc/config/i386/i386.c | 31 +++-- > > > > > > > > > gcc/config/i386/i386.md| 54 > > > > > > > > > +++--- > > > > > > > > >
Re: [PATCH] Make target_clones resolver fn static.
On Mon, Jan 20, 2020 at 5:36 AM Alexander Monakov wrote: > > > > On Mon, 20 Jan 2020, H.J. Lu wrote: > > We can that only if function is static: > > > [ship asm] > > > > In this case, foo must be global. > > H.J., can you rephrase more clearly? Your response seems contradictory and > does not help to explain the matter. > > Alexander For, --- __attribute__((target_clones("avx","default"))) int foo () { return -2; } foo's resolver must be global. For --- __attribute__((target_clones("avx","default"))) static int foo () { return -2; } --- foo's resolver must be static. -- H.J.
Re: [PATCH] Make target_clones resolver fn static.
On Mon, Jan 20, 2020 at 6:16 AM Alexander Monakov wrote: > > > > On Mon, 20 Jan 2020, H.J. Lu wrote: > > For, > > > > --- > > __attribute__((target_clones("avx","default"))) > > int > > foo () > > { > > return -2; > > } > > > > > > foo's resolver must be global. For > > > > --- > > __attribute__((target_clones("avx","default"))) > > static int > > foo () > > { > > return -2; > > } > > --- > > > > foo's resolver must be static. > > Bare IFUNC's don't seem to have this restriction. Why do we want to > constrain target clones this way? > foo's resolver acts as foo. It should have the same visibility as foo. -- H.J.
Re: [PATCH] Make target_clones resolver fn static.
On Mon, Jan 20, 2020 at 6:41 AM Alexander Monakov wrote: > > > > On Mon, 20 Jan 2020, H.J. Lu wrote: > > > > Bare IFUNC's don't seem to have this restriction. Why do we want to > > > constrain target clones this way? > > > > > > > foo's resolver acts as foo. It should have the same visibility as foo. > > What do you mean by that? From the implementation standpoint, there's > two symbols of different type with the same value. There's no problem > allowing one of them have local binding and the other have global binding. > > Is there something special about target clones that doesn't come into > play with ifuncs? > I stand corrected. Resolver should be static and it shouldn't be weak. -- H.J.
Re: [PATCH] PR target/93319: x32: Add x32 support to -mtls-dialect=gnu2
On Mon, Jan 20, 2020 at 5:24 AM H.J. Lu wrote: > > On Sun, Jan 19, 2020 at 11:53 PM Uros Bizjak wrote: > > > > On Sun, Jan 19, 2020 at 10:00 PM H.J. Lu wrote: > > > > > > On Sun, Jan 19, 2020 at 12:16 PM Uros Bizjak wrote: > > > > > > > > On Sun, Jan 19, 2020 at 9:07 PM H.J. Lu wrote: > > > > > > > > > > On Sun, Jan 19, 2020 at 12:01 PM Uros Bizjak > > > > > wrote: > > > > > > > > > > > > On Sun, Jan 19, 2020 at 7:07 PM H.J. Lu wrote: > > > > > > > > > > > > > > On Sun, Jan 19, 2020 at 9:48 AM Uros Bizjak > > > > > > > wrote: > > > > > > > > > > > > > > > > On Sun, Jan 19, 2020 at 6:43 PM Uros Bizjak > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > On Sun, Jan 19, 2020 at 2:58 PM H.J. Lu > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > To add x32 support to -mtls-dialect=gnu2, we need to > > > > > > > > > > replace DI with > > > > > > > > > > P in GNU2 TLS patterns. Since thread pointer is in > > > > > > > > > > ptr_mode, PLUS in > > > > > > > > > > GNU2 TLS address computation must be done in ptr_mode to > > > > > > > > > > support > > > > > > > > > > -maddress-mode=long. Also drop the "q" suffix from lea to > > > > > > > > > > support > > > > > > > > > > both "lea foo@TLSDESC(%rip), %eax" and "foo@TLSDESC(%rip), > > > > > > > > > > %rax". > > > > > > > > > > > > > > > > > > Please use "lea%z0" instead. > > > > > > > > > > > > > > > > > > > Tested on Linux/x86-64. OK for master? > > > > > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > H.J. > > > > > > > > > > --- > > > > > > > > > > gcc/ > > > > > > > > > > > > > > > > > > > > PR target/93319 > > > > > > > > > > * config/i386/i386.c (legitimize_tls_address): Pass > > > > > > > > > > Pmode to > > > > > > > > > > gen_tls_dynamic_gnu2_64. Compute GNU2 TLS address > > > > > > > > > > in ptr_mode. > > > > > > > > > > * config/i386/i386.md (tls_dynamic_gnu2_64): > > > > > > > > > > Renamed to ... > > > > > > > > > > (@tls_dynamic_gnu2_64_): This. Replace DI > > > > > > > > > > with P. > > > > > > > > > > (*tls_dynamic_gnu2_lea_64): Renamed to ... > > > > > > > > > > (*tls_dynamic_gnu2_lea_64_): This. Replace > > > > > > > > > > DI with P. > > > > > > > > > > Remove the {q} suffix from lea. > > > > > > > > > > (*tls_dynamic_gnu2_call_64): Renamed to ... > > > > > > > > > > (*tls_dynamic_gnu2_call_64_): This. Replace > > > > > > > > > > DI with P. > > > > > > > > > > (*tls_dynamic_gnu2_combine_64): Renamed to ... > > > > > > > > > > (*tls_dynamic_gnu2_combine_64_): This. > > > > > > > > > > Replace DI with P. > > > > > > > > > > Pass Pmode to gen_tls_dynamic_gnu2_64. > > > > > > > > > > > > > > > > > > > > gcc/testsuite/ > > > > > > > > > > > > > > > > > > > > PR target/93319 > > > > > > > > > > * gcc.target/i386/pr93319-1a.c: New test. > > > > > > > > > > * gcc.target/i386/pr93319-1b.c: Likewise. > > > > > > > > > > * gcc.target/i386/pr93319-1c.c: Likewise. > > > > > > > > > >
Re: [PATCH] PR target/93319: x32: Add x32 support to -mtls-dialect=gnu2
On Tue, Jan 21, 2020 at 2:29 AM Uros Bizjak wrote: > > On Tue, Jan 21, 2020 at 9:47 AM Uros Bizjak wrote: > > > > On Mon, Jan 20, 2020 at 10:46 PM H.J. Lu wrote: > > > > > > > OK. Let's go with this version, but please investigate if we need to > > > > > calculate TLS address in ptr_mode instead of Pmode. Due to quite some > > > > > zero-extension from ptr_mode to Pmode hacks in this area, it looks to > > > > > me that the whole calculation should be performed in ptr_mode (SImode > > > > > in case of x32), and the result zero-extended to Pmode in case when > > > > > Pmode = DImode. > > > > > > > > > > > > > I checked it in. I will investigate if we can use ptr_mode for TLS. > > > > > > Here is a patch to perform GNU2 TLS address computation in ptr_mode > > > and zero-extend result to Pmode. > > > > case TLS_MODEL_GLOBAL_DYNAMIC: > > - dest = gen_reg_rtx (Pmode); > > + dest = gen_reg_rtx (TARGET_GNU2_TLS ? ptr_mode : Pmode); > > > > Please put these in their respective arms of "if (TARGET_GNU2_TLS). > > > > case TLS_MODEL_LOCAL_DYNAMIC: > > - base = gen_reg_rtx (Pmode); > > + base = gen_reg_rtx (TARGET_GNU2_TLS ? ptr_mode : Pmode); > > > > Also here. > > > > A question: Do we need to emit the following part in Pmode? > > To answer my own question: Yes. Linker doesn't like SImode relocs fox > x86_64 and for > > addl$foo@dtpoff, %eax > > errors out with: > > pr93319-1a.s: Assembler messages: > pr93319-1a.s:20: Error: relocated field and relocation type differ in > signedness > > So, the part below is OK, except: > > - tp = get_thread_pointer (Pmode, true); > - set_unique_reg_note (get_last_insn (), REG_EQUAL, > - gen_rtx_MINUS (Pmode, tmp, tp)); > + tp = get_thread_pointer (ptr_mode, true); > + tmp = gen_rtx_MINUS (ptr_mode, tmp, tp); > + if (GET_MODE (tmp) != Pmode) > +tmp = gen_rtx_ZERO_EXTEND (Pmode, tmp); > + set_unique_reg_note (get_last_insn (), REG_EQUAL, tmp); > > I don't think we should attach this note to the thread pointer > initialization. I have removed this part from the patch, but please > review the decision. > > and > > -dest = gen_rtx_PLUS (Pmode, tp, dest); > +dest = gen_rtx_PLUS (ptr_mode, tp, dest); > > Please leave Pmode here. ptr_mode == Pmode at this point, but Pmode > better documents the mode selection logic. > > Also, the tests fail for me with: > > /usr/include/gnu/stubs.h:13:11: fatal error: gnu/stubs-x32.h: No such > file or directory > > so either use __builtin_printf or some other approach that doesn't > need to #include stdio.h. > > A patch that implements above changes is attached to the message. > Here is the updated patch. OK for master? Thanks. -- H.J. From 01b20630518882fa3952962b26bfbb2465e08036 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" Date: Mon, 20 Jan 2020 13:30:04 -0800 Subject: [PATCH] i386: Do GNU2 TLS address computation in ptr_mode Since GNU2 TLS address from glibc run-time is in ptr_mode, we should do GNU2 TLS address computation in ptr_mode and zero-extend result to Pmode. 2020-01-21 H.J. Lu Uros Bizjak gcc/ PR target/93319 * config/i386/i386.c (ix86_tls_module_base): Replace Pmode with ptr_mode. (legitimize_tls_address): Do GNU2 TLS address computation in ptr_mode and zero-extend result to Pmode. * config/i386/i386.md (@tls_dynamic_gnu2_64_): Replace :P with :PTR and Pmode with ptr_mode. (*tls_dynamic_gnu2_lea_64_): Likewise. (*tls_dynamic_gnu2_call_64_): Likewise. (*tls_dynamic_gnu2_combine_64_): Likewise. gcc/testsuite/ 2020-01-21 Uros Bizjak PR target/93319 * gcc.target/i386/pr93319-1a.c: Don't include . (test1): Replace printf with __builtin_printf. --- gcc/config/i386/i386.c | 43 --- gcc/config/i386/i386.md| 48 +++--- gcc/testsuite/gcc.target/i386/pr93319-1a.c | 6 +-- 3 files changed, 42 insertions(+), 55 deletions(-) diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 0b8a4b9ee4f..ffe60baa72a 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -10717,7 +10717,7 @@ ix86_tls_module_base (void) if (!ix86_tls_module_base_symbol) { ix86_tls_module_base_symbol - = gen_rtx_SYMBOL_REF (Pmode, "_TLS_MODULE_BASE_"); + = gen_rtx_SYMBOL_REF (ptr_mode, "_TLS_MODULE_BASE_"); SYMBOL_REF_FLAGS (ix86_tls_module_base_symbol) |= TLS_MODEL_GLOBAL_DYNAMIC << SYMBOL_FLAG_TLS_SHIFT; @@ -10748,8 +10748,6 @@ legitimize_tls_address (r
[PATCH] i386: Don't use ix86_tune_ctrl_string in parse_mtune_ctrl_str
There are static void parse_mtune_ctrl_str (bool dump) { if (!ix86_tune_ctrl_string) return; parse_mtune_ctrl_str is only called from set_ix86_tune_features, which is only called from ix86_function_specific_restore and ix86_option_override_internal. parse_mtune_ctrl_str shouldn't use ix86_tune_ctrl_string which is defined with global_options. Instead, opts should be passed to parse_mtune_ctrl_str. PR target/91399 * config/i386/i386-options.c (set_ix86_tune_features): Add an argument of a pointer to struct gcc_options and pass it to parse_mtune_ctrl_str. (ix86_function_specific_restore): Pass opts to set_ix86_tune_features. (ix86_option_override_internal): Likewise. (parse_mtune_ctrl_str): Add an argument of a pointer to struct gcc_options and use it for x_ix86_tune_ctrl_string. --- gcc/config/i386/i386-options.c | 18 ++ 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c index 2acc9fb0cfe..e0be4932534 100644 --- a/gcc/config/i386/i386-options.c +++ b/gcc/config/i386/i386-options.c @@ -741,7 +741,8 @@ ix86_option_override_internal (bool main_args_p, struct gcc_options *opts, struct gcc_options *opts_set); static void -set_ix86_tune_features (enum processor_type ix86_tune, bool dump); +set_ix86_tune_features (struct gcc_options *opts, + enum processor_type ix86_tune, bool dump); /* Restore the current options */ @@ -810,7 +811,7 @@ ix86_function_specific_restore (struct gcc_options *opts, /* Recreate the tune optimization tests */ if (old_tune != ix86_tune) -set_ix86_tune_features (ix86_tune, false); +set_ix86_tune_features (opts, ix86_tune, false); } /* Adjust target options after streaming them in. This is mainly about @@ -1538,13 +1539,13 @@ ix86_parse_stringop_strategy_string (char *strategy_str, bool is_memset) print the features that are explicitly set. */ static void -parse_mtune_ctrl_str (bool dump) +parse_mtune_ctrl_str (struct gcc_options *opts, bool dump) { - if (!ix86_tune_ctrl_string) + if (!opts->x_ix86_tune_ctrl_string) return; char *next_feature_string = NULL; - char *curr_feature_string = xstrdup (ix86_tune_ctrl_string); + char *curr_feature_string = xstrdup (opts->x_ix86_tune_ctrl_string); char *orig = curr_feature_string; int i; do @@ -1583,7 +1584,8 @@ parse_mtune_ctrl_str (bool dump) processor type. */ static void -set_ix86_tune_features (enum processor_type ix86_tune, bool dump) +set_ix86_tune_features (struct gcc_options *opts, + enum processor_type ix86_tune, bool dump) { unsigned HOST_WIDE_INT ix86_tune_mask = HOST_WIDE_INT_1U << ix86_tune; int i; @@ -1605,7 +1607,7 @@ set_ix86_tune_features (enum processor_type ix86_tune, bool dump) ix86_tune_features[i] ? "on" : "off"); } - parse_mtune_ctrl_str (dump); + parse_mtune_ctrl_str (opts, dump); } @@ -2364,7 +2366,7 @@ ix86_option_override_internal (bool main_args_p, XDELETEVEC (s); } - set_ix86_tune_features (ix86_tune, opts->x_ix86_dump_tunes); + set_ix86_tune_features (opts, ix86_tune, opts->x_ix86_dump_tunes); ix86_recompute_optlev_based_flags (opts, opts_set); -- 2.24.1
[PATCH] i386: Disable TARGET_SSE_TYPELESS_STORES for TARGET_AVX
movaps/movups is one byte shorter than movdaq/movdqu. But it isn't the case for AVX nor AVX512. We should disable TARGET_SSE_TYPELESS_STORES for TARGET_AVX. gcc/ PR target/91461 * config/i386/i386.h (TARGET_SSE_TYPELESS_STORES): Disable for TARGET_AVX. * config/i386/i386.md (*movoi_internal_avx): Remove TARGET_SSE_TYPELESS_STORES check. gcc/testsuite/ PR target/91461 * gcc.target/i386/pr91461-1.c: New test. * gcc.target/i386/pr91461-2.c: Likewise. * gcc.target/i386/pr91461-3.c: Likewise. * gcc.target/i386/pr91461-4.c: Likewise. * gcc.target/i386/pr91461-5.c: Likewise. --- gcc/config/i386/i386.h| 4 +- gcc/config/i386/i386.md | 4 +- gcc/testsuite/gcc.target/i386/pr91461-1.c | 66 gcc/testsuite/gcc.target/i386/pr91461-2.c | 19 ++ gcc/testsuite/gcc.target/i386/pr91461-3.c | 76 +++ gcc/testsuite/gcc.target/i386/pr91461-4.c | 21 +++ gcc/testsuite/gcc.target/i386/pr91461-5.c | 17 + 7 files changed, 203 insertions(+), 4 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-1.c create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-2.c create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-3.c create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-4.c create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-5.c diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index 943e9a5c783..c134b04c5c4 100644 --- a/gcc/config/i386/i386.h +++ b/gcc/config/i386/i386.h @@ -516,8 +516,10 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST]; #define TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL \ ix86_tune_features[X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL] #define TARGET_SSE_SPLIT_REGS ix86_tune_features[X86_TUNE_SSE_SPLIT_REGS] +/* NB: movaps/movups is one byte shorter than movdaq/movdqu. But it + isn't the case for AVX nor AVX512. */ #define TARGET_SSE_TYPELESS_STORES \ - ix86_tune_features[X86_TUNE_SSE_TYPELESS_STORES] + (!TARGET_AVX && ix86_tune_features[X86_TUNE_SSE_TYPELESS_STORES]) #define TARGET_SSE_LOAD0_BY_PXOR ix86_tune_features[X86_TUNE_SSE_LOAD0_BY_PXOR] #define TARGET_MEMORY_MISMATCH_STALL \ ix86_tune_features[X86_TUNE_MEMORY_MISMATCH_STALL] diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 6e9c9bd2fb6..bb096133880 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -1980,9 +1980,7 @@ (and (eq_attr "alternative" "1") (match_test "TARGET_AVX512VL")) (const_string "XI") - (ior (match_test "TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL") - (and (eq_attr "alternative" "3") -(match_test "TARGET_SSE_TYPELESS_STORES"))) + (match_test "TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL") (const_string "V8SF") ] (const_string "OI")))]) diff --git a/gcc/testsuite/gcc.target/i386/pr91461-1.c b/gcc/testsuite/gcc.target/i386/pr91461-1.c new file mode 100644 index 000..0c94b8e2b76 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr91461-1.c @@ -0,0 +1,66 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mavx" } */ +/* { dg-final { scan-assembler "\tvmovdqa\t" } } */ +/* { dg-final { scan-assembler "\tvmovdqu\t" } } */ +/* { dg-final { scan-assembler "\tvmovapd\t" } } */ +/* { dg-final { scan-assembler "\tvmovupd\t" } } */ +/* { dg-final { scan-assembler-not "\tvmovaps\t" } } */ +/* { dg-final { scan-assembler-not "\tvmovups\t" } } */ + +#include + +void +foo1 (__m128i *p, __m128i x) +{ + *p = x; +} + +void +foo2 (__m128d *p, __m128d x) +{ + *p = x; +} + +void +foo3 (__float128 *p, __float128 x) +{ + *p = x; +} + +void +foo4 (__m128i_u *p, __m128i x) +{ + *p = x; +} + +void +foo5 (__m128d_u *p, __m128d x) +{ + *p = x; +} + +typedef __float128 __float128_u __attribute__ ((__aligned__ (1))); + +void +foo6 (__float128_u *p, __float128 x) +{ + *p = x; +} + +#ifdef __x86_64__ +typedef __int128 __int128_u __attribute__ ((__aligned__ (1))); + +extern __int128 int128; + +void +foo7 (__int128 *p) +{ + *p = int128; +} + +void +foo8 (__int128_u *p) +{ + *p = int128; +} +#endif diff --git a/gcc/testsuite/gcc.target/i386/pr91461-2.c b/gcc/testsuite/gcc.target/i386/pr91461-2.c new file mode 100644 index 000..921cfaf9780 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr91461-2.c @@ -0,0 +1,19 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mavx" } */ +/* { dg-final { scan-assembler "\tvmovdqa\t" } } */ +/* { dg-final { scan-assembler "\tvmovapd\t" } } */ +/* { dg-final { scan-assembler-not "\tvmovaps\t" } } */ + +#include + +void +foo1 (__m256i *p, __m256i x) +{ + *p = x; +} + +void +foo2 (__m256d *p, __m256d x) +{ + *p = x; +} diff --git a/gcc/testsuite/gcc.target/i386/pr91461-3.c b/gcc/testsuite/gcc.target/i386/pr91461-3.c new file mode 100644 index 000..c67a4
PING^5: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move
On Mon, Jul 8, 2019 at 8:19 AM H.J. Lu wrote: > > On Tue, Jun 18, 2019 at 8:59 AM H.J. Lu wrote: > > > > On Fri, May 31, 2019 at 10:38 AM H.J. Lu wrote: > > > > > > On Tue, May 21, 2019 at 2:43 PM H.J. Lu wrote: > > > > > > > > On Fri, Feb 22, 2019 at 8:25 AM H.J. Lu wrote: > > > > > > > > > > Hi Jan, Uros, > > > > > > > > > > This patch fixes the wrong code bug: > > > > > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89229 > > > > > > > > > > Tested on AVX2 and AVX512 with and without --with-arch=native. > > > > > > > > > > OK for trunk? > > > > > > > > > > Thanks. > > > > > > > > > > H.J. > > > > > -- > > > > > i386 backend has > > > > > > > > > > INT_MODE (OI, 32); > > > > > INT_MODE (XI, 64); > > > > > > > > > > So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit operation, > > > > > in case of const_1, all 512 bits set. > > > > > > > > > > We can load zeros with narrower instruction, (e.g. 256 bit by inherent > > > > > zeroing of highpart in case of 128 bit xor), so TImode in this case. > > > > > > > > > > Some targets prefer V4SF mode, so they will emit float xorps for > > > > > zeroing. > > > > > > > > > > sse.md has > > > > > > > > > > (define_insn "mov_internal" > > > > > [(set (match_operand:VMOVE 0 "nonimmediate_operand" > > > > > "=v,v ,v ,m") > > > > > (match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand" > > > > > " C,BC,vm,v"))] > > > > > > > > > > /* There is no evex-encoded vmov* for sizes smaller than > > > > > 64-bytes > > > > > in avx512f, so we need to use workarounds, to access sse > > > > > registers > > > > > 16-31, which are evex-only. In avx512vl we don't need > > > > > workarounds. */ > > > > > if (TARGET_AVX512F && < 64 && !TARGET_AVX512VL > > > > > && (EXT_REX_SSE_REG_P (operands[0]) > > > > > || EXT_REX_SSE_REG_P (operands[1]))) > > > > > { > > > > > if (memory_operand (operands[0], mode)) > > > > > { > > > > > if ( == 32) > > > > > return "vextract64x4\t{$0x0, %g1, %0|%0, > > > > > %g1, 0x0}"; > > > > > else if ( == 16) > > > > > return "vextract32x4\t{$0x0, %g1, %0|%0, > > > > > %g1, 0x0}"; > > > > > else > > > > > gcc_unreachable (); > > > > > } > > > > > ... > > > > > > > > > > However, since ix86_hard_regno_mode_ok has > > > > > > > > > > /* TODO check for QI/HI scalars. */ > > > > > /* AVX512VL allows sse regs16+ for 128/256 bit modes. */ > > > > > if (TARGET_AVX512VL > > > > > && (mode == OImode > > > > > || mode == TImode > > > > > || VALID_AVX256_REG_MODE (mode) > > > > > || VALID_AVX512VL_128_REG_MODE (mode))) > > > > > return true; > > > > > > > > > > /* xmm16-xmm31 are only available for AVX-512. */ > > > > > if (EXT_REX_SSE_REGNO_P (regno)) > > > > > return false; > > > > > > > > > > if (TARGET_AVX512F && < 64 && !TARGET_AVX512VL > > > > > && (EXT_REX_SSE_REG_P (operands[0]) > > > > > || EXT_REX_SSE_REG_P (operands[1]))) > > > > > > > > > > is a dead code. > > > > > > > > > > Also for > > > > > > > > > > long long *p; > > > > > volatile __m256i yy; > > > > > > > > > > void > > > > > foo (void) > > > > > { > > > > >_mm256_store_epi64 (p, yy); > > > > &
Re: [PATCH] i386: Disable TARGET_SSE_TYPELESS_STORES for TARGET_AVX
On Mon, Jan 27, 2020 at 12:26 PM Uros Bizjak wrote: > > On Mon, Jan 27, 2020 at 7:23 PM H.J. Lu wrote: > > > > movaps/movups is one byte shorter than movdaq/movdqu. But it isn't the > > case for AVX nor AVX512. We should disable TARGET_SSE_TYPELESS_STORES > > for TARGET_AVX. > > > > gcc/ > > > > PR target/91461 > > * config/i386/i386.h (TARGET_SSE_TYPELESS_STORES): Disable for > > TARGET_AVX. > > * config/i386/i386.md (*movoi_internal_avx): Remove > > TARGET_SSE_TYPELESS_STORES check. > > > > gcc/testsuite/ > > > > PR target/91461 > > * gcc.target/i386/pr91461-1.c: New test. > > * gcc.target/i386/pr91461-2.c: Likewise. > > * gcc.target/i386/pr91461-3.c: Likewise. > > * gcc.target/i386/pr91461-4.c: Likewise. > > * gcc.target/i386/pr91461-5.c: Likewise. > > --- > > gcc/config/i386/i386.h| 4 +- > > gcc/config/i386/i386.md | 4 +- > > gcc/testsuite/gcc.target/i386/pr91461-1.c | 66 > > gcc/testsuite/gcc.target/i386/pr91461-2.c | 19 ++ > > gcc/testsuite/gcc.target/i386/pr91461-3.c | 76 +++ > > gcc/testsuite/gcc.target/i386/pr91461-4.c | 21 +++ > > gcc/testsuite/gcc.target/i386/pr91461-5.c | 17 + > > 7 files changed, 203 insertions(+), 4 deletions(-) > > create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-1.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-2.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-3.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-4.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-5.c > > > > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h > > index 943e9a5c783..c134b04c5c4 100644 > > --- a/gcc/config/i386/i386.h > > +++ b/gcc/config/i386/i386.h > > @@ -516,8 +516,10 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST]; > > #define TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL \ > > ix86_tune_features[X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL] > > #define TARGET_SSE_SPLIT_REGS ix86_tune_features[X86_TUNE_SSE_SPLIT_REGS] > > +/* NB: movaps/movups is one byte shorter than movdaq/movdqu. But it > > + isn't the case for AVX nor AVX512. */ > > #define TARGET_SSE_TYPELESS_STORES \ > > - ix86_tune_features[X86_TUNE_SSE_TYPELESS_STORES] > > + (!TARGET_AVX && ix86_tune_features[X86_TUNE_SSE_TYPELESS_STORES]) > > This is wrong place to disable the feature. Like this? diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c index 2acc9fb0cfe..639969d736d 100644 --- a/gcc/config/i386/i386-options.c +++ b/gcc/config/i386/i386-options.c @@ -1597,6 +1597,11 @@ set_ix86_tune_features (enum processor_type ix86_tune, bool dump) = !!(initial_ix86_tune_features[i] & ix86_tune_mask); } + /* NB: movaps/movups is one byte shorter than movdaq/movdqu. But it + isn't the case for AVX nor AVX512. */ + if (TARGET_AVX) +ix86_tune_features[X86_TUNE_SSE_TYPELESS_STORES] = 0; + if (dump) { fprintf (stderr, "List of x86 specific tuning parameter names:\n"); -- H.J.
Re: [PATCH] i386: Disable TARGET_SSE_TYPELESS_STORES for TARGET_AVX
On Mon, Jan 27, 2020 at 2:17 PM H.J. Lu wrote: > > On Mon, Jan 27, 2020 at 12:26 PM Uros Bizjak wrote: > > > > On Mon, Jan 27, 2020 at 7:23 PM H.J. Lu wrote: > > > > > > movaps/movups is one byte shorter than movdaq/movdqu. But it isn't the > > > case for AVX nor AVX512. We should disable TARGET_SSE_TYPELESS_STORES > > > for TARGET_AVX. > > > > > > gcc/ > > > > > > PR target/91461 > > > * config/i386/i386.h (TARGET_SSE_TYPELESS_STORES): Disable for > > > TARGET_AVX. > > > * config/i386/i386.md (*movoi_internal_avx): Remove > > > TARGET_SSE_TYPELESS_STORES check. > > > > > > gcc/testsuite/ > > > > > > PR target/91461 > > > * gcc.target/i386/pr91461-1.c: New test. > > > * gcc.target/i386/pr91461-2.c: Likewise. > > > * gcc.target/i386/pr91461-3.c: Likewise. > > > * gcc.target/i386/pr91461-4.c: Likewise. > > > * gcc.target/i386/pr91461-5.c: Likewise. > > > --- > > > gcc/config/i386/i386.h| 4 +- > > > gcc/config/i386/i386.md | 4 +- > > > gcc/testsuite/gcc.target/i386/pr91461-1.c | 66 > > > gcc/testsuite/gcc.target/i386/pr91461-2.c | 19 ++ > > > gcc/testsuite/gcc.target/i386/pr91461-3.c | 76 +++ > > > gcc/testsuite/gcc.target/i386/pr91461-4.c | 21 +++ > > > gcc/testsuite/gcc.target/i386/pr91461-5.c | 17 + > > > 7 files changed, 203 insertions(+), 4 deletions(-) > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-1.c > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-2.c > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-3.c > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-4.c > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-5.c > > > > > > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h > > > index 943e9a5c783..c134b04c5c4 100644 > > > --- a/gcc/config/i386/i386.h > > > +++ b/gcc/config/i386/i386.h > > > @@ -516,8 +516,10 @@ extern unsigned char > > > ix86_tune_features[X86_TUNE_LAST]; > > > #define TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL \ > > > ix86_tune_features[X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL] > > > #define TARGET_SSE_SPLIT_REGS > > > ix86_tune_features[X86_TUNE_SSE_SPLIT_REGS] > > > +/* NB: movaps/movups is one byte shorter than movdaq/movdqu. But it > > > + isn't the case for AVX nor AVX512. */ > > > #define TARGET_SSE_TYPELESS_STORES \ > > > - ix86_tune_features[X86_TUNE_SSE_TYPELESS_STORES] > > > + (!TARGET_AVX && ix86_tune_features[X86_TUNE_SSE_TYPELESS_STORES]) > > > > This is wrong place to disable the feature. > Here is the updated patch on top of https://gcc.gnu.org/ml/gcc-patches/2020-01/msg01742.html so that set_ix86_tune_features can access per function setting. OK for master branch? Thanks. -- H.J. From 61482a7d4dff07075f2534840040bafa420e9f36 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" Date: Mon, 27 Jan 2020 09:35:11 -0800 Subject: [PATCH] i386: Disable TARGET_SSE_TYPELESS_STORES for TARGET_AVX movaps/movups is one byte shorter than movdaq/movdqu. But it isn't the case for AVX nor AVX512. We should disable TARGET_SSE_TYPELESS_STORES for TARGET_AVX and adjust vmovups checks in assembly ouputs. gcc/ PR target/91461 * config/i386/i386-options.c (set_ix86_tune_features): Disable TARGET_SSE_TYPELESS_STORES for TARGET_AVX. * config/i386/i386.md (*movoi_internal_avx): Remove TARGET_SSE_TYPELESS_STORES check. gcc/testsuite/ PR target/91461 * gcc.target/i386/avx256-unaligned-store-3.c: Don't check vmovups. * gcc.target/i386/pieces-memcpy-4.c: Likewise. * gcc.target/i386/pieces-memcpy-5.c: Likewise. * gcc.target/i386/pieces-memcpy-6.c: Likewise. * gcc.target/i386/pieces-strcpy-2.c: Likewise. * gcc.target/i386/pr90980-1.c: Likewise. * gcc.target/i386/pr87317-4.c: Check "\tvmovd\t" instead of "vmovd" to avoid matching "vmovdqu". * gcc.target/i386/pr87317-5.c: Likewise. * gcc.target/i386/pr87317-7.c: Likewise. * gcc.target/i386/pr91461-1.c: New test. * gcc.target/i386/pr91461-2.c: Likewise. * gcc.target/i386/pr91461-3.c: Likewise. * gcc.target/i386/pr91461-4.c: Likewise. * gcc.target/i386/pr91461-5.c: Likewise. * gcc.target/i386/pr91461-6.c: Likewise. --- gcc/config/i386/i386-options.c| 5 ++ gcc/config/i386/i386.md | 4 +- .../i386/avx256-unaligned-store-3.c | 4 +-
Re: [PATCH] i386: Disable TARGET_SSE_TYPELESS_STORES for TARGET_AVX
On Mon, Jan 27, 2020 at 11:04 PM Uros Bizjak wrote: > > On Mon, Jan 27, 2020 at 11:17 PM H.J. Lu wrote: > > > > On Mon, Jan 27, 2020 at 12:26 PM Uros Bizjak wrote: > > > > > > On Mon, Jan 27, 2020 at 7:23 PM H.J. Lu wrote: > > > > > > > > movaps/movups is one byte shorter than movdaq/movdqu. But it isn't the > > > > case for AVX nor AVX512. We should disable TARGET_SSE_TYPELESS_STORES > > > > for TARGET_AVX. > > > > > > > > gcc/ > > > > > > > > PR target/91461 > > > > * config/i386/i386.h (TARGET_SSE_TYPELESS_STORES): Disable for > > > > TARGET_AVX. > > > > * config/i386/i386.md (*movoi_internal_avx): Remove > > > > TARGET_SSE_TYPELESS_STORES check. > > > > > > > > gcc/testsuite/ > > > > > > > > PR target/91461 > > > > * gcc.target/i386/pr91461-1.c: New test. > > > > * gcc.target/i386/pr91461-2.c: Likewise. > > > > * gcc.target/i386/pr91461-3.c: Likewise. > > > > * gcc.target/i386/pr91461-4.c: Likewise. > > > > * gcc.target/i386/pr91461-5.c: Likewise. > > > > --- > > > > gcc/config/i386/i386.h| 4 +- > > > > gcc/config/i386/i386.md | 4 +- > > > > gcc/testsuite/gcc.target/i386/pr91461-1.c | 66 > > > > gcc/testsuite/gcc.target/i386/pr91461-2.c | 19 ++ > > > > gcc/testsuite/gcc.target/i386/pr91461-3.c | 76 +++ > > > > gcc/testsuite/gcc.target/i386/pr91461-4.c | 21 +++ > > > > gcc/testsuite/gcc.target/i386/pr91461-5.c | 17 + > > > > 7 files changed, 203 insertions(+), 4 deletions(-) > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-1.c > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-2.c > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-3.c > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-4.c > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-5.c > > > > > > > > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h > > > > index 943e9a5c783..c134b04c5c4 100644 > > > > --- a/gcc/config/i386/i386.h > > > > +++ b/gcc/config/i386/i386.h > > > > @@ -516,8 +516,10 @@ extern unsigned char > > > > ix86_tune_features[X86_TUNE_LAST]; > > > > #define TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL \ > > > > ix86_tune_features[X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL] > > > > #define TARGET_SSE_SPLIT_REGS > > > > ix86_tune_features[X86_TUNE_SSE_SPLIT_REGS] > > > > +/* NB: movaps/movups is one byte shorter than movdaq/movdqu. But it > > > > + isn't the case for AVX nor AVX512. */ > > > > #define TARGET_SSE_TYPELESS_STORES \ > > > > - ix86_tune_features[X86_TUNE_SSE_TYPELESS_STORES] > > > > + (!TARGET_AVX && > > > > ix86_tune_features[X86_TUNE_SSE_TYPELESS_STORES]) > > > > > > This is wrong place to disable the feature. > > > > Like this? > > No. > > There is a mode attribute in i386.md/sse.md for relevant patterns. > Please adapt calculation of mode attributes instead. > Like this? -- H.J. From 1ba0c9ce5f764b8faa8c66b70e676af187a57415 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" Date: Mon, 27 Jan 2020 09:35:11 -0800 Subject: [PATCH] i386: Disable TARGET_SSE_TYPELESS_STORES for TARGET_AVX movaps/movups is one byte shorter than movdaq/movdqu. But it isn't the case for AVX nor AVX512. We should disable TARGET_SSE_TYPELESS_STORES for TARGET_AVX. gcc/ PR target/91461 * config/i386/i386.md (*movoi_internal_avx): Remove TARGET_SSE_TYPELESS_STORES check. (*movti_internal): Disable TARGET_SSE_TYPELESS_STORES for TARGET_AVX. * config/i386/sse.md (mov_internal): Likewise. gcc/testsuite/ PR target/91461 * gcc.target/i386/pr91461-1.c: New test. * gcc.target/i386/pr91461-2.c: Likewise. * gcc.target/i386/pr91461-3.c: Likewise. * gcc.target/i386/pr91461-4.c: Likewise. * gcc.target/i386/pr91461-5.c: Likewise. --- gcc/config/i386/i386.md | 8 +-- gcc/config/i386/sse.md| 2 +- gcc/testsuite/gcc.target/i386/pr91461-1.c | 66 gcc/testsuite/gcc.target/i386/pr91461-2.c | 19 ++ gcc/testsuite/gcc.target/i386/pr91461-3.c | 76 +++ gcc/testsuite/gcc.target/i386/pr91461-4.c | 21 +++ gcc/testsuite/gcc.targe
Re: [PATCH] i386: Disable TARGET_SSE_TYPELESS_STORES for TARGET_AVX
On Tue, Jan 28, 2020 at 6:45 AM Uros Bizjak wrote: > > On Tue, Jan 28, 2020 at 3:32 PM H.J. Lu wrote: > > > > On Mon, Jan 27, 2020 at 11:04 PM Uros Bizjak wrote: > > > > > > On Mon, Jan 27, 2020 at 11:17 PM H.J. Lu wrote: > > > > > > > > On Mon, Jan 27, 2020 at 12:26 PM Uros Bizjak wrote: > > > > > > > > > > On Mon, Jan 27, 2020 at 7:23 PM H.J. Lu wrote: > > > > > > > > > > > > movaps/movups is one byte shorter than movdaq/movdqu. But it isn't > > > > > > the > > > > > > case for AVX nor AVX512. We should disable > > > > > > TARGET_SSE_TYPELESS_STORES > > > > > > for TARGET_AVX. > > > > > > > > > > > > gcc/ > > > > > > > > > > > > PR target/91461 > > > > > > * config/i386/i386.h (TARGET_SSE_TYPELESS_STORES): Disable > > > > > > for > > > > > > TARGET_AVX. > > > > > > * config/i386/i386.md (*movoi_internal_avx): Remove > > > > > > TARGET_SSE_TYPELESS_STORES check. > > > > > > > > > > > > gcc/testsuite/ > > > > > > > > > > > > PR target/91461 > > > > > > * gcc.target/i386/pr91461-1.c: New test. > > > > > > * gcc.target/i386/pr91461-2.c: Likewise. > > > > > > * gcc.target/i386/pr91461-3.c: Likewise. > > > > > > * gcc.target/i386/pr91461-4.c: Likewise. > > > > > > * gcc.target/i386/pr91461-5.c: Likewise. > > > > > > --- > > > > > > gcc/config/i386/i386.h| 4 +- > > > > > > gcc/config/i386/i386.md | 4 +- > > > > > > gcc/testsuite/gcc.target/i386/pr91461-1.c | 66 > > > > > > gcc/testsuite/gcc.target/i386/pr91461-2.c | 19 ++ > > > > > > gcc/testsuite/gcc.target/i386/pr91461-3.c | 76 > > > > > > +++ > > > > > > gcc/testsuite/gcc.target/i386/pr91461-4.c | 21 +++ > > > > > > gcc/testsuite/gcc.target/i386/pr91461-5.c | 17 + > > > > > > 7 files changed, 203 insertions(+), 4 deletions(-) > > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-1.c > > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-2.c > > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-3.c > > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-4.c > > > > > > create mode 100644 gcc/testsuite/gcc.target/i386/pr91461-5.c > > > > > > > > > > > > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h > > > > > > index 943e9a5c783..c134b04c5c4 100644 > > > > > > --- a/gcc/config/i386/i386.h > > > > > > +++ b/gcc/config/i386/i386.h > > > > > > @@ -516,8 +516,10 @@ extern unsigned char > > > > > > ix86_tune_features[X86_TUNE_LAST]; > > > > > > #define TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL \ > > > > > > ix86_tune_features[X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL] > > > > > > #define TARGET_SSE_SPLIT_REGS > > > > > > ix86_tune_features[X86_TUNE_SSE_SPLIT_REGS] > > > > > > +/* NB: movaps/movups is one byte shorter than movdaq/movdqu. But > > > > > > it > > > > > > + isn't the case for AVX nor AVX512. */ > > > > > > #define TARGET_SSE_TYPELESS_STORES \ > > > > > > - ix86_tune_features[X86_TUNE_SSE_TYPELESS_STORES] > > > > > > + (!TARGET_AVX && > > > > > > ix86_tune_features[X86_TUNE_SSE_TYPELESS_STORES]) > > > > > > > > > > This is wrong place to disable the feature. > > > > > > > > Like this? > > > > > > No. > > > > > > There is a mode attribute in i386.md/sse.md for relevant patterns. > > > Please adapt calculation of mode attributes instead. > > > > > > > Like this? > > Still no. > > You could move > > (match_test "TARGET_AVX") > (const_string "TI") > > up to bypass the cases below. > I don't think we can do that. There are 2 cases where we prefer movaps/movups: /* Use packed single precision instructions where posisble. I.e. movups instead of movupd. */ DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL, "sse_packed_single_insn_optimal", m_BDVER | m_ZNVER) /* X86_TUNE_SSE_TYPELESS_STORES: Always movaps/movups for 128bit stores. */ DEF_TUNE (X86_TUNE_SSE_TYPELESS_STORES, "sse_typeless_stores", m_AMD_MULTIPLE | m_CORE_ALL | m_GENERIC) We should always use movaps/movups for TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL. It is wrong to bypass TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL with TARGET_AVX as m_BDVER | m_ZNVER support AVX. -- H.J.
[PATCH] i386: Prefer TARGET_AVX over TARGET_SSE_TYPELESS_STORES
On Tue, Jan 28, 2020 at 9:12 AM Uros Bizjak wrote: > > On Tue, Jan 28, 2020 at 4:34 PM H.J. Lu wrote: > > > > You could move > > > > > > (match_test "TARGET_AVX") > > > (const_string "TI") > > > > > > up to bypass the cases below. > > > > > > > I don't think we can do that. There are 2 cases where we prefer > > movaps/movups: > > > > /* Use packed single precision instructions where posisble. I.e. > > movups instead of movupd. */ > > DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL, > > "sse_packed_single_insn_optimal", > > m_BDVER | m_ZNVER) > > > > /* X86_TUNE_SSE_TYPELESS_STORES: Always movaps/movups for 128bit stores. > > */ > > DEF_TUNE (X86_TUNE_SSE_TYPELESS_STORES, "sse_typeless_stores", > > m_AMD_MULTIPLE | m_CORE_ALL | m_GENERIC) > > > > We should always use movaps/movups for > > TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL. > > It is wrong to bypass TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL with TARGET_AVX > > as m_BDVER | m_ZNVER support AVX. > > The reason for TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL on AMD target is > only insn size, as advised in e.g. Software Optimization Guide for the > AMD Family 15h Processors [1], section 7.1.2, where it is said: > > --quote-- > 7.1.2 Reduce Instruction SizeOptimization > > Reduce the size of instructions when possible. > > Rationale > > Using smaller instruction sizes improves instruction fetch throughput. > Specific examples include the following: > > *In SIMD code, use the single-precision (PS) form of instructions > instead of the double-precision (PD) form. For example, for register > to register moves, MOVAPS achieves the same result as MOVAPD, but uses > one less byte to encode the instruction and has no prefix byte. Other > examples in which single-precision forms can be substituted for > double-precision forms include MOVUPS, MOVNTPS, XORPS, ORPS, ANDPS, > and SHUFPS. > ... > --/quote-- > > Please note that this optimization applies only to non-AVX forms, as > demonstrated by: > >0: 0f 28 c8movaps %xmm0,%xmm1 >3: 66 0f 28 c8 movapd %xmm0,%xmm1 >7: c5 f8 28 d1 vmovaps %xmm1,%xmm2 >b: c5 f9 28 d1 vmovapd %xmm1,%xmm2 > > Also note that MOVDQA is missing in the above optimization. It is > harmful to substitute MOVDQA with MOVAPS, as it can (and does) > introduce +1 cycle forwarding penalty between FLT (FPA/FPM) and INT > (VALU) FP clusters. > > Following the above optimization, it is obvious that > TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL handling was cargo-culted from > one pattern to another. Its use should be reviewed and fixed where not > appropriate. > > [1] https://www.amd.com/system/files/TechDocs/47414_15h_sw_opt_guide.pdf > > Uros. Here is the updated patch which moves TARGET_AVX before TARGET_SSE_TYPELESS_STORES. OK for master if there is no regression? Thanks. -- H.J. From cbcf8b23b29588f12e464076dacacd4600d0059b Mon Sep 17 00:00:00 2001 From: "H.J. Lu" Date: Mon, 27 Jan 2020 09:35:11 -0800 Subject: [PATCH] i386: Prefer TARGET_AVX over TARGET_SSE_TYPELESS_STORES movaps/movups is one byte shorter than movdaq/movdqu. But it isn't the case for AVX nor AVX512. This patch prefers TARGET_AVX over TARGET_SSE_TYPELESS_STORES and adjust vmovups checks in assembly ouputs. gcc/ PR target/91461 * config/i386/i386.md (*movoi_internal_avx): Remove TARGET_SSE_TYPELESS_STORES check. (*movti_internal): Prefer TARGET_AVX over TARGET_SSE_TYPELESS_STORES. (*movtf_internal): Likewise. * config/i386/sse.md (mov_internal): Likewise. gcc/testsuite/ PR target/91461 * gcc.target/i386/avx256-unaligned-store-2.c: Don't check vmovups. * gcc.target/i386/avx256-unaligned-store-3.c: Likewise. * gcc.target/i386/pieces-memcpy-4.c: Likewise. * gcc.target/i386/pieces-memcpy-5.c: Likewise. * gcc.target/i386/pieces-memcpy-6.c: Likewise. * gcc.target/i386/pieces-strcpy-2.c: Likewise. * gcc.target/i386/pr90980-1.c: Likewise. * gcc.target/i386/pr87317-4.c: Check "\tvmovd\t" instead of "vmovd" to avoid matching "vmovdqu". * gcc.target/i386/pr87317-5.c: Likewise. * gcc.target/i386/pr87317-7.c: Likewise. * gcc.target/i386/pr91461-1.c: New test. * gcc.target/i386/pr91461-2.c: Likewise. * gcc.target/i386/pr91461-3.c: Likewise. * gcc.target/i386/pr91461-4.c: Likewise. * gcc.target/i386/pr91461-5.c: Likewise. Xi --- gcc/config/i386/i386.md | 12 ++- gcc/config/i386/sse.md| 4 +- .../i386/avx256-unaligned-store-2.c | 4 +- .../i386/avx256-unaligned-store-3.c | 4 +- .../gcc
Re: [PATCH] i386: Prefer TARGET_AVX over TARGET_SSE_TYPELESS_STORES
On Tue, Jan 28, 2020 at 10:04 AM Uros Bizjak wrote: > > On Tue, Jan 28, 2020 at 6:51 PM H.J. Lu wrote: > > > > On Tue, Jan 28, 2020 at 9:12 AM Uros Bizjak wrote: > > > > > > On Tue, Jan 28, 2020 at 4:34 PM H.J. Lu wrote: > > > > > > > > You could move > > > > > > > > > > (match_test "TARGET_AVX") > > > > > (const_string "TI") > > > > > > > > > > up to bypass the cases below. > > > > > > > > > > > > > I don't think we can do that. There are 2 cases where we prefer > > > > movaps/movups: > > > > > > > > /* Use packed single precision instructions where posisble. I.e. > > > > movups instead of movupd. */ > > > > DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL, > > > > "sse_packed_single_insn_optimal", > > > > m_BDVER | m_ZNVER) > > > > > > > > /* X86_TUNE_SSE_TYPELESS_STORES: Always movaps/movups for 128bit > > > > stores. */ > > > > DEF_TUNE (X86_TUNE_SSE_TYPELESS_STORES, "sse_typeless_stores", > > > > m_AMD_MULTIPLE | m_CORE_ALL | m_GENERIC) > > > > > > > > We should always use movaps/movups for > > > > TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL. > > > > It is wrong to bypass TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL with > > > > TARGET_AVX > > > > as m_BDVER | m_ZNVER support AVX. > > > > > > The reason for TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL on AMD target is > > > only insn size, as advised in e.g. Software Optimization Guide for the > > > AMD Family 15h Processors [1], section 7.1.2, where it is said: > > > > > > --quote-- > > > 7.1.2 Reduce Instruction SizeOptimization > > > > > > Reduce the size of instructions when possible. > > > > > > Rationale > > > > > > Using smaller instruction sizes improves instruction fetch throughput. > > > Specific examples include the following: > > > > > > *In SIMD code, use the single-precision (PS) form of instructions > > > instead of the double-precision (PD) form. For example, for register > > > to register moves, MOVAPS achieves the same result as MOVAPD, but uses > > > one less byte to encode the instruction and has no prefix byte. Other > > > examples in which single-precision forms can be substituted for > > > double-precision forms include MOVUPS, MOVNTPS, XORPS, ORPS, ANDPS, > > > and SHUFPS. > > > ... > > > --/quote-- > > > > > > Please note that this optimization applies only to non-AVX forms, as > > > demonstrated by: > > > > > >0: 0f 28 c8movaps %xmm0,%xmm1 > > >3: 66 0f 28 c8 movapd %xmm0,%xmm1 > > >7: c5 f8 28 d1 vmovaps %xmm1,%xmm2 > > >b: c5 f9 28 d1 vmovapd %xmm1,%xmm2 > > > > > > Also note that MOVDQA is missing in the above optimization. It is > > > harmful to substitute MOVDQA with MOVAPS, as it can (and does) > > > introduce +1 cycle forwarding penalty between FLT (FPA/FPM) and INT > > > (VALU) FP clusters. > > > > > > Following the above optimization, it is obvious that > > > TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL handling was cargo-culted from > > > one pattern to another. Its use should be reviewed and fixed where not > > > appropriate. > > > > > > [1] https://www.amd.com/system/files/TechDocs/47414_15h_sw_opt_guide.pdf > > > > > > Uros. > > > > Here is the updated patch which moves TARGET_AVX before > > TARGET_SSE_TYPELESS_STORES. OK for master if there is > > no regression? > > > > Thanks. > > > + (match_test "TARGET_AVX") > + (const_string "") > (and (match_test " == 16") > > Only MODE_SIZE == 16 cases will be left here, since TARGET_AVX is > necessary for MODE_SIZE > 16. This test can be removed. > > OK with the above change. > This is the patch I am going to check in. Thanks. -- H.J. From 66c534dedc7a9a632aa38c32e3f7c251b8f2c778 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" Date: Mon, 27 Jan 2020 09:35:11 -0800 Subject: [PATCH] i386: Prefer TARGET_AVX over TARGET_SSE_TYPELESS_STORES movaps/movups is one byte shorter than movdaq/movdqu. But it isn't the case for AVX nor AVX512. This patch prefers TARGET_AVX over TARGET_SSE_TYPELESS_STORES and adjust vmovu
Re: [PATCH] i386: Prefer TARGET_AVX over TARGET_SSE_TYPELESS_STORES
On Tue, Jan 28, 2020 at 10:58 AM Jakub Jelinek wrote: > > On Tue, Jan 28, 2020 at 10:20:36AM -0800, H.J. Lu wrote: > > From 66c534dedc7a9a632aa38c32e3f7c251b8f2c778 Mon Sep 17 00:00:00 2001 > > From: "H.J. Lu" > > Date: Mon, 27 Jan 2020 09:35:11 -0800 > > Subject: [PATCH] i386: Prefer TARGET_AVX over TARGET_SSE_TYPELESS_STORES > > > > movaps/movups is one byte shorter than movdaq/movdqu. But it isn't the > > case for AVX nor AVX512. This patch prefers TARGET_AVX over > > TARGET_SSE_TYPELESS_STORES and adjust vmovups checks in assembly ouputs. > > If you haven't committed yet, please fix the movdaq typo in the description > (to movdqa). > Will do. Thanks. -- H.J.
[PATCH] i386: Define TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY
Define TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY to make sure that the ENDBR are emitted before the patch area. When -mfentry -pg is also used together, there should be no ENDBR before "call __fentry__". OK for master if there is no regression? Thanks. H.J. -- gcc/ PR target/93492 * config/i386/i386.c (ix86_asm_output_function_label): Set function_label_emitted to true. (ix86_print_patchable_function_entry): New function. gcc/testsuite/ PR target/93492 * gcc.target/i386/pr93492-1.c: New test. * gcc.target/i386/pr93492-2.c: Likewise. * gcc.target/i386/pr93492-3.c: Likewise. -- H.J. From 5363c0289e3525139939bb678deeda98d06b2556 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" Date: Mon, 3 Feb 2020 10:22:57 -0800 Subject: [PATCH] i386: Define TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY Define TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY to make sure that the ENDBR are emitted before the patch area. When -mfentry -pg is also used together, there should be no ENDBR before "call __fentry__". gcc/ PR target/93492 * config/i386/i386.c (ix86_asm_output_function_label): Set function_label_emitted to true. (ix86_print_patchable_function_entry): New function. gcc/testsuite/ PR target/93492 * gcc.target/i386/pr93492-1.c: New test. * gcc.target/i386/pr93492-2.c: Likewise. * gcc.target/i386/pr93492-3.c: Likewise. --- gcc/config/i386/i386.c| 46 ++ gcc/config/i386/i386.h| 3 + gcc/testsuite/gcc.target/i386/pr93492-1.c | 73 +++ gcc/testsuite/gcc.target/i386/pr93492-2.c | 12 gcc/testsuite/gcc.target/i386/pr93492-3.c | 13 5 files changed, 147 insertions(+) create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-1.c create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-2.c create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-3.c diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index ffda3e8fd21..dc9bd095e9a 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -1563,6 +1563,9 @@ ix86_asm_output_function_label (FILE *asm_out_file, const char *fname, { bool is_ms_hook = ix86_function_ms_hook_prologue (decl); + if (cfun) +cfun->machine->function_label_emitted = true; + if (is_ms_hook) { int i, filler_count = (TARGET_64BIT ? 32 : 16); @@ -9118,6 +9121,45 @@ ix86_output_function_epilogue (FILE *file ATTRIBUTE_UNUSED) } } +/* Implement TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY. */ + +void +ix86_print_patchable_function_entry (FILE *file, + unsigned HOST_WIDE_INT patch_area_size, + bool record_p) +{ + if (cfun->machine->function_label_emitted) +{ + if ((flag_cf_protection & CF_BRANCH) + && !lookup_attribute ("nocf_check", +TYPE_ATTRIBUTES (TREE_TYPE (cfun->decl))) + && (!flag_manual_endbr + || lookup_attribute ("cf_check", + DECL_ATTRIBUTES (cfun->decl))) + && !cgraph_node::get (cfun->decl)->only_called_directly_p ()) + { + /* Remove ENDBR that follows the patch area. */ + rtx_insn *insn = next_real_nondebug_insn (get_insns ()); + if (insn + && INSN_P (insn) + && GET_CODE (PATTERN (insn)) == UNSPEC_VOLATILE + && XINT (PATTERN (insn), 1) == UNSPECV_NOP_ENDBR) + delete_insn (insn); + + /* Remove the queued ENDBR. */ + cfun->machine->endbr_queued_at_entrance = false; + + /* Insert a ENDBR before the patch area right after the + function label and the .cfi_startproc directive. */ + asm_fprintf (file, "\t%s\n", + TARGET_64BIT ? "endbr64" : "endbr32"); + } +} + + default_print_patchable_function_entry (file, patch_area_size, + record_p); +} + /* Return a scratch register to use in the split stack prologue. The split stack prologue is used for -fsplit-stack. It is the first instructions in the function, even before the regular prologue. @@ -22744,6 +22786,10 @@ ix86_run_selftests (void) #undef TARGET_ASM_FUNCTION_EPILOGUE #define TARGET_ASM_FUNCTION_EPILOGUE ix86_output_function_epilogue +#undef TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY +#define TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY \ + ix86_print_patchable_function_entry + #undef TARGET_ENCODE_SECTION_INFO #ifndef SUBTARGET_ENCODE_SECTION_INFO #define TARGET_ENCODE_SECTION_INFO ix86_encode_section_info diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index 943e9a5c783..46a809afb96 100644 --- a/gcc/config/i386/i386.h +++ b/gcc/config/i386/i386.h @@ -2844,6 +2844,9 @@ struct GTY(()) machine_function { /* If true, ENDBR is queued at function entrance. */ BOOL_BITFIELD endbr_queued_at_entrance : 1; + /* If true, the function label has been emitted. */ + BOOL_BITFIELD function_label_emitted : 1; + /* True if the function needs a stack frame. */ BOOL_BITFIELD stack_frame_required : 1; diff --git a
Re: [PATCH] i386: Define TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY
On Mon, Feb 3, 2020 at 10:35 AM H.J. Lu wrote: > > Define TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY to make sure that the > ENDBR are emitted before the patch area. When -mfentry -pg is also used > together, there should be no ENDBR before "call __fentry__". > > OK for master if there is no regression? > > Thanks. > > H.J. > -- > gcc/ > > PR target/93492 > * config/i386/i386.c (ix86_asm_output_function_label): Set > function_label_emitted to true. > (ix86_print_patchable_function_entry): New function. > > gcc/testsuite/ > > PR target/93492 > * gcc.target/i386/pr93492-1.c: New test. > * gcc.target/i386/pr93492-2.c: Likewise. > * gcc.target/i386/pr93492-3.c: Likewise. > This version works with both .cfi_startproc and DWARF debug info. -- H.J. From c4660acd1555f90f0f76f32a59f043a51c866553 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" Date: Mon, 3 Feb 2020 10:22:57 -0800 Subject: [PATCH] i386: Define TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY Define TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY to delay patchable area generation to ENDBR generation. It works with both .cfi_startproc and DWARF debug info. gcc/ PR target/93492 * config/i386/i386-features.c (rest_of_insert_endbranch): Set endbr_queued_at_entrance to TYPE_ENDBR. * config/i386/i386-protos.h (ix86_output_endbr): New. * config/i386/i386.c (ix86_asm_output_function_label): Set function_label_emitted to true. (ix86_print_patchable_function_entry): New function. (ix86_output_endbr): Likewise. (x86_function_profiler): Call ix86_output_endbr to generate ENDBR. (TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY): New. * i386.h (endbr_type): New. (machine_function): Add patch_area_size, record_patch_area and function_label_emitted. Change endbr_queued_at_entrance to enum. * config/i386/i386.md (UNSPECV_PATCH_ENDBR): New. (patch_endbr): New. gcc/testsuite/ PR target/93492 * gcc.target/i386/pr93492-1.c: New test. * gcc.target/i386/pr93492-2.c: Likewise. * gcc.target/i386/pr93492-3.c: Likewise. --- gcc/config/i386/i386-features.c | 2 +- gcc/config/i386/i386-protos.h | 2 + gcc/config/i386/i386.c| 77 ++- gcc/config/i386/i386.h| 18 +- gcc/config/i386/i386.md | 9 +++ gcc/testsuite/gcc.target/i386/pr93492-1.c | 73 + gcc/testsuite/gcc.target/i386/pr93492-2.c | 12 gcc/testsuite/gcc.target/i386/pr93492-3.c | 13 8 files changed, 203 insertions(+), 3 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-1.c create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-2.c create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-3.c diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c index b49e6f8d408..4d3d36e9ade 100644 --- a/gcc/config/i386/i386-features.c +++ b/gcc/config/i386/i386-features.c @@ -1963,7 +1963,7 @@ rest_of_insert_endbranch (void) { /* Queue ENDBR insertion to x86_function_profiler. */ if (crtl->profile && flag_fentry) - cfun->machine->endbr_queued_at_entrance = true; + cfun->machine->endbr_queued_at_entrance = TYPE_ENDBR; else { cet_eb = gen_nop_endbr (); diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h index 266381ca5a6..f9f5a243714 100644 --- a/gcc/config/i386/i386-protos.h +++ b/gcc/config/i386/i386-protos.h @@ -38,6 +38,8 @@ extern void ix86_expand_split_stack_prologue (void); extern void ix86_output_addr_vec_elt (FILE *, int); extern void ix86_output_addr_diff_elt (FILE *, int, int); +extern void ix86_output_endbr (bool); + extern enum calling_abi ix86_cfun_abi (void); extern enum calling_abi ix86_function_type_abi (const_tree); diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index ffda3e8fd21..e5b2565d5bd 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -1563,6 +1563,9 @@ ix86_asm_output_function_label (FILE *asm_out_file, const char *fname, { bool is_ms_hook = ix86_function_ms_hook_prologue (decl); + if (cfun) +cfun->machine->function_label_emitted = true; + if (is_ms_hook) { int i, filler_count = (TARGET_64BIT ? 32 : 16); @@ -9118,6 +9121,73 @@ ix86_output_function_epilogue (FILE *file ATTRIBUTE_UNUSED) } } +/* Implement TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY. */ + +void +ix86_print_patchable_function_entry (FILE *file, + unsigned HOST_WIDE_INT patch_area_size, + bool record_p) +{ + if (cfun->machine->function_label_emitted) +{ + if ((flag_cf_protection & CF_BRANCH) + && !lookup_attribute ("nocf_check", +TYPE_ATTRIBUTES (TREE_TYPE (cfun->decl))) + && (!flag_manual_endbr + || lookup_attribute ("cf_check", + DECL_ATTRIBUTES (cfun->decl))) + && !cgraph_node::get (cfun->de
Re: [PATCH] i386: Define TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY
On Mon, Feb 3, 2020 at 4:02 PM H.J. Lu wrote: > > On Mon, Feb 3, 2020 at 10:35 AM H.J. Lu wrote: > > > > Define TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY to make sure that the > > ENDBR are emitted before the patch area. When -mfentry -pg is also used > > together, there should be no ENDBR before "call __fentry__". > > > > OK for master if there is no regression? > > > > Thanks. > > > > H.J. > > -- > > gcc/ > > > > PR target/93492 > > * config/i386/i386.c (ix86_asm_output_function_label): Set > > function_label_emitted to true. > > (ix86_print_patchable_function_entry): New function. > > > > gcc/testsuite/ > > > > PR target/93492 > > * gcc.target/i386/pr93492-1.c: New test. > > * gcc.target/i386/pr93492-2.c: Likewise. > > * gcc.target/i386/pr93492-3.c: Likewise. > > > > This version works with both .cfi_startproc and DWARF debug info. > -g -fpatchable-function-entry=1 doesn't work together: [hjl@gnu-cfl-1 pr93492]$ cat y.i void f(){} [hjl@gnu-cfl-1 pr93492]$ make y.s /export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/ -g -fpatchable-function-entry=1 -S y.i [hjl@gnu-cfl-1 pr93492]$ cat y.s .file "y.i" .text .Ltext0: .globl f .type f, @function f: .section __patchable_function_entries,"aw",@progbits .align 8 .quad .LPFE1 .text .LPFE1: nop .LFB0: .file 1 "y.i" .loc 1 1 9 .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 .loc 1 1 1 nop popq %rbp .cfi_def_cfa 7, 8 ret .cfi_endproc .LFE0: .size f, .-f I will update my patch to handle it. -- H.J.
[PATCH] x86: Add UNSPECV_PATCHABLE_AREA
On Mon, Feb 03, 2020 at 06:10:49PM -0800, H.J. Lu wrote: > On Mon, Feb 3, 2020 at 4:02 PM H.J. Lu wrote: > > > > On Mon, Feb 3, 2020 at 10:35 AM H.J. Lu wrote: > > > > > > Define TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY to make sure that the > > > ENDBR are emitted before the patch area. When -mfentry -pg is also used > > > together, there should be no ENDBR before "call __fentry__". > > > > > > OK for master if there is no regression? > > > > > > Thanks. > > > > > > H.J. > > > -- > > > gcc/ > > > > > > PR target/93492 > > > * config/i386/i386.c (ix86_asm_output_function_label): Set > > > function_label_emitted to true. > > > (ix86_print_patchable_function_entry): New function. > > > > > > gcc/testsuite/ > > > > > > PR target/93492 > > > * gcc.target/i386/pr93492-1.c: New test. > > > * gcc.target/i386/pr93492-2.c: Likewise. > > > * gcc.target/i386/pr93492-3.c: Likewise. > > > > > > > This version works with both .cfi_startproc and DWARF debug info. > > > > -g -fpatchable-function-entry=1 doesn't work together: > Here is a differnt approach with UNSPECV_PATCHABLE_AREA. H.J. --- Currently patchable area is at the wrong place. It is placed immediately after function label, before both .cfi_startproc and ENDBR. This patch adds UNSPECV_PATCHABLE_AREA for pseudo patchable area instruction and changes ENDBR insertion pass to also insert a dummy patchable area. TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY is defined to provide the actual size for patchable area. It places patchable area immediately after .cfi_startproc and ENDBR. gcc/ PR target/93492 * config/i386/i386-features.c (rest_of_insert_endbranch): Renamed to ... (rest_of_insert_endbr_and_patchable_area): Change return type to void. Don't call timevar_push nor timevar_pop. Replace endbr_queued_at_entrance with insn_queued_at_entrance. Insert UNSPECV_PATCHABLE_AREA for patchable area. (pass_data_insert_endbranch): Renamed to ... (pass_data_insert_endbr_and_patchable_area): This. Change pass name to endbr_and_patchable_area. (pass_insert_endbranch): Renamed to ... (pass_insert_endbr_and_patchable_area): This. Add need_endbr and need_patchable_area. (pass_insert_endbr_and_patchable_area::gate): Set and check need_endbr/need_patchable_area. (pass_insert_endbr_and_patchable_area::execute): Call timevar_push and timevar_pop. Pass need_endbr amd need_patchable_area to rest_of_insert_endbr_and_patchable_area. (make_pass_insert_endbranch): Renamed to ... (make_pass_insert_endbr_and_patchable_area): This. * config/i386/i386-passes.def: Replace pass_insert_endbranch with pass_insert_endbr_and_patchable_area. * config/i386/i386-protos.h (ix86_output_patchable_area): New. (make_pass_insert_endbranch): Renamed to ... (make_pass_insert_endbr_and_patchable_area): This. * config/i386/i386.c (ix86_asm_output_function_label): Set function_label_emitted to true. (ix86_print_patchable_function_entry): New function. (ix86_output_patchable_area): Likewise. (x86_function_profiler): Replace endbr_queued_at_entrance with insn_queued_at_entrance. Generate ENDBR only for TYPE_ENDBR. Call ix86_output_patchable_area to generate patchable area. (TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY): New. * i386.h (queued_insn_type): New. (machine_function): Add patch_area_size, record_patch_area and function_label_emitted. Replace endbr_queued_at_entrance with insn_queued_at_entrance. * config/i386/i386.md (UNSPECV_PATCHABLE_AREA): New. (patchable_area): New. gcc/testsuite/ PR target/93492 * gcc.target/i386/pr93492-1.c: New test. * gcc.target/i386/pr93492-2.c: Likewise. * gcc.target/i386/pr93492-3.c: Likewise. * gcc.target/i386/pr93492-4.c: Likewise. * gcc.target/i386/pr93492-5.c: Likewise. --- gcc/config/i386/i386-features.c | 139 ++ gcc/config/i386/i386-passes.def | 2 +- gcc/config/i386/i386-protos.h | 5 +- gcc/config/i386/i386.c| 90 +- gcc/config/i386/i386.h| 20 +++- gcc/config/i386/i386.md | 14 +++ gcc/testsuite/gcc.target/i386/pr93492-1.c | 73 gcc/testsuite/gcc.target/i386/pr93492-2.c | 12 ++ gcc/testsuite/gcc.target/i386/pr93492-3.c | 13 ++ gcc/testsuite/gcc.target/i386/pr93492-4.c | 11 ++ gcc/testsuite/gcc.target/i386/pr93492-5.c | 12 ++ 11 files changed, 337 insertions(+),
[PATCH 0/3] Update -fpatchable-function-entry implementation
The current -fpatchable-function-entry implementation is done almost behind the backend back. Backend doesn't know if and where patchable area will be generated during RTL passes. TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY is only used to print out assembly codes for patchable area. This leads to wrong codes with -fpatchable-function-entry: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92424 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93492 Also .cfi_startproc and DWARF info are at the wrong places when -fpatchable-function-entry is used. This patch set has 3 parts: 1. Add a pseudo UNSPECV_PATCHABLE_AREA instruction and define TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY to work around the -fpatchable-function-entry implementation deficiency. 2. Add patch_area_size and patch_area_entry to cfun to make the patchable area info is available in RTL passes so that backend can handle patchable area properly. It also limits patch_area_size and patch_area_entry to 65535, which is a reasonable maximum size for patchable area. Other backends can also use it properly generate patchable area. 3. Remove workaround in UNSPECV_PATCHABLE_AREA generation. If the patch set is acceptable, I can combine patch 1 and patch 3 into a single patch. H.J. Lu (3): x86: Add UNSPECV_PATCHABLE_AREA Add patch_area_size and patch_area_entry to cfun x86: Simplify UNSPECV_PATCHABLE_AREA generation gcc/config/i386/i386-features.c | 130 -- gcc/config/i386/i386-passes.def | 2 +- gcc/config/i386/i386-protos.h | 5 +- gcc/config/i386/i386.c| 51 ++- gcc/config/i386/i386.h| 14 +- gcc/config/i386/i386.md | 17 +++ gcc/doc/invoke.texi | 1 + gcc/function.c| 35 + gcc/function.h| 6 + gcc/opts.c| 4 +- .../patchable_function_entry-error-1.c| 9 ++ .../patchable_function_entry-error-2.c| 9 ++ .../patchable_function_entry-error-3.c| 20 +++ gcc/testsuite/gcc.target/i386/pr93492-1.c | 73 ++ gcc/testsuite/gcc.target/i386/pr93492-2.c | 12 ++ gcc/testsuite/gcc.target/i386/pr93492-3.c | 13 ++ gcc/testsuite/gcc.target/i386/pr93492-4.c | 11 ++ gcc/testsuite/gcc.target/i386/pr93492-5.c | 12 ++ gcc/varasm.c | 30 +--- 19 files changed, 375 insertions(+), 79 deletions(-) create mode 100644 gcc/testsuite/c-c++-common/patchable_function_entry-error-1.c create mode 100644 gcc/testsuite/c-c++-common/patchable_function_entry-error-2.c create mode 100644 gcc/testsuite/c-c++-common/patchable_function_entry-error-3.c create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-1.c create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-2.c create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-3.c create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-4.c create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-5.c -- 2.24.1
[PATCH 1/3] x86: Add UNSPECV_PATCHABLE_AREA
Currently patchable area is at the wrong place. It is placed immediately after function label, before both .cfi_startproc and ENDBR. This patch adds UNSPECV_PATCHABLE_AREA for pseudo patchable area instruction and changes ENDBR insertion pass to also insert a dummy patchable area. TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY is defined to provide the actual size for patchable area. It places patchable area immediately after .cfi_startproc and ENDBR. gcc/ PR target/93492 * config/i386/i386-features.c (rest_of_insert_endbranch): Renamed to ... (rest_of_insert_endbr_and_patchable_area): Change return type to void. Don't call timevar_push nor timevar_pop. Replace endbr_queued_at_entrance with insn_queued_at_entrance. Insert UNSPECV_PATCHABLE_AREA for patchable area. (pass_data_insert_endbranch): Renamed to ... (pass_data_insert_endbr_and_patchable_area): This. Change pass name to endbr_and_patchable_area. (pass_insert_endbranch): Renamed to ... (pass_insert_endbr_and_patchable_area): This. Add need_endbr and need_patchable_area. (pass_insert_endbr_and_patchable_area::gate): Set and check need_endbr/need_patchable_area. (pass_insert_endbr_and_patchable_area::execute): Call timevar_push and timevar_pop. Pass need_endbr amd need_patchable_area to rest_of_insert_endbr_and_patchable_area. (make_pass_insert_endbranch): Renamed to ... (make_pass_insert_endbr_and_patchable_area): This. * config/i386/i386-passes.def: Replace pass_insert_endbranch with pass_insert_endbr_and_patchable_area. * config/i386/i386-protos.h (ix86_output_patchable_area): New. (make_pass_insert_endbranch): Renamed to ... (make_pass_insert_endbr_and_patchable_area): This. * config/i386/i386.c (ix86_asm_output_function_label): Set function_label_emitted to true. (ix86_print_patchable_function_entry): New function. (ix86_output_patchable_area): Likewise. (x86_function_profiler): Replace endbr_queued_at_entrance with insn_queued_at_entrance. Generate ENDBR only for TYPE_ENDBR. Call ix86_output_patchable_area to generate patchable area. (TARGET_ASM_PRINT_PATCHABLE_FUNCTION_ENTRY): New. * i386.h (queued_insn_type): New. (machine_function): Add patch_area_size, record_patch_area and function_label_emitted. Replace endbr_queued_at_entrance with insn_queued_at_entrance. * config/i386/i386.md (UNSPECV_PATCHABLE_AREA): New. (patchable_area): New. gcc/testsuite/ PR target/93492 * gcc.target/i386/pr93492-1.c: New test. * gcc.target/i386/pr93492-2.c: Likewise. * gcc.target/i386/pr93492-3.c: Likewise. * gcc.target/i386/pr93492-4.c: Likewise. * gcc.target/i386/pr93492-5.c: Likewise. --- gcc/config/i386/i386-features.c | 136 +++--- gcc/config/i386/i386-passes.def | 2 +- gcc/config/i386/i386-protos.h | 5 +- gcc/config/i386/i386.c| 91 ++- gcc/config/i386/i386.h| 20 +++- gcc/config/i386/i386.md | 15 +++ gcc/testsuite/gcc.target/i386/pr93492-1.c | 73 gcc/testsuite/gcc.target/i386/pr93492-2.c | 12 ++ gcc/testsuite/gcc.target/i386/pr93492-3.c | 13 +++ gcc/testsuite/gcc.target/i386/pr93492-4.c | 11 ++ gcc/testsuite/gcc.target/i386/pr93492-5.c | 12 ++ 11 files changed, 339 insertions(+), 51 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-1.c create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-2.c create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-3.c create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-4.c create mode 100644 gcc/testsuite/gcc.target/i386/pr93492-5.c diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c index b49e6f8d408..be46f036126 100644 --- a/gcc/config/i386/i386-features.c +++ b/gcc/config/i386/i386-features.c @@ -1937,43 +1937,79 @@ make_pass_stv (gcc::context *ctxt) return new pass_stv (ctxt); } -/* Inserting ENDBRANCH instructions. */ +/* Inserting ENDBR and pseudo patchable-area instructions. */ -static unsigned int -rest_of_insert_endbranch (void) +static void +rest_of_insert_endbr_and_patchable_area (bool need_endbr, +bool need_patchable_area) { - timevar_push (TV_MACH_DEP); - - rtx cet_eb; + rtx endbr; rtx_insn *insn; + rtx_insn *endbr_insn = NULL; basic_block bb; - /* Currently emit EB if it's a tracking function, i.e. 'nocf_check' is - absent among function attributes. Later an optimization will be - introduced to make analysis if an address of a static function is - taken. A static function whose address is not taken will get a - nocf_check attribute. This will allow to redu
[PATCH 2/3] Add patch_area_size and patch_area_entry to cfun
Currently patchable area is at the wrong place. It is placed immediately after function label and before .cfi_startproc. A backend should be able to add a pseudo patchable area instruction durectly into RTL. This patch adds patch_area_size and patch_area_entry to cfun so that the patchable area info is available in RTL passes. It also limits patch_area_size and patch_area_entry to 65535, which is a reasonable maximum size for patchable area. gcc/ PR target/93492 * function.c (expand_function_start): Set cfun->patch_area_size and cfun->patch_area_entry. * function.h (function): Add patch_area_size and patch_area_entry. * opts.c (common_handle_option): Limit function_entry_patch_area_size and function_entry_patch_area_start to USHRT_MAX. Fix a typo in error message. * varasm.c (assemble_start_function): Use cfun->patch_area_size and cfun->patch_area_entry. * doc/invoke.texi: Document the maximum value for -fpatchable-function-entry. gcc/testsuite/ PR target/93492 * c-c++-common/patchable_function_entry-error-1.c: New test. * c-c++-common/patchable_function_entry-error-2.c: Likewise. * c-c++-common/patchable_function_entry-error-3.c: Likewise. --- gcc/doc/invoke.texi | 1 + gcc/function.c| 35 +++ gcc/function.h| 6 gcc/opts.c| 4 ++- .../patchable_function_entry-error-1.c| 9 + .../patchable_function_entry-error-2.c| 9 + .../patchable_function_entry-error-3.c| 20 +++ gcc/varasm.c | 30 ++-- 8 files changed, 85 insertions(+), 29 deletions(-) create mode 100644 gcc/testsuite/c-c++-common/patchable_function_entry-error-1.c create mode 100644 gcc/testsuite/c-c++-common/patchable_function_entry-error-2.c create mode 100644 gcc/testsuite/c-c++-common/patchable_function_entry-error-3.c diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 35b341e759f..dd4835199b0 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -13966,6 +13966,7 @@ If @code{N=0}, no pad location is recorded. The NOP instructions are inserted at---and maybe before, depending on @var{M}---the function entry address, even before the prologue. +The maximum value of @var{N} and @var{M} is 65535. @end table diff --git a/gcc/function.c b/gcc/function.c index d8008f60422..badbf538eec 100644 --- a/gcc/function.c +++ b/gcc/function.c @@ -5202,6 +5202,41 @@ expand_function_start (tree subr) /* If we are doing generic stack checking, the probe should go here. */ if (flag_stack_check == GENERIC_STACK_CHECK) stack_check_probe_note = emit_note (NOTE_INSN_DELETED); + + unsigned HOST_WIDE_INT patch_area_size = function_entry_patch_area_size; + unsigned HOST_WIDE_INT patch_area_entry = function_entry_patch_area_start; + + tree patchable_function_entry_attr += lookup_attribute ("patchable_function_entry", + DECL_ATTRIBUTES (cfun->decl)); + if (patchable_function_entry_attr) +{ + tree pp_val = TREE_VALUE (patchable_function_entry_attr); + tree patchable_function_entry_value1 = TREE_VALUE (pp_val); + + patch_area_size = tree_to_uhwi (patchable_function_entry_value1); + patch_area_entry = 0; + if (TREE_CHAIN (pp_val) != NULL_TREE) + { + tree patchable_function_entry_value2 + = TREE_VALUE (TREE_CHAIN (pp_val)); + patch_area_entry = tree_to_uhwi (patchable_function_entry_value2); + } + if (patch_area_size > USHRT_MAX || patch_area_size > USHRT_MAX) + error ("invalid values for % attribute"); +} + + if (patch_area_entry > patch_area_size) +{ + if (patch_area_size > 0) + warning (OPT_Wattributes, +"patchable function entry %wu exceeds size %wu", +patch_area_entry, patch_area_size); + patch_area_entry = 0; +} + + cfun->patch_area_size = patch_area_size; + cfun->patch_area_entry = patch_area_entry; } void diff --git a/gcc/function.h b/gcc/function.h index 1ee8ed3de53..1ed7c400f23 100644 --- a/gcc/function.h +++ b/gcc/function.h @@ -332,6 +332,12 @@ struct GTY(()) function { /* Last assigned dependence info clique. */ unsigned short last_clique; + /* How many NOP insns to place at each function entry by default. */ + unsigned short patch_area_size; + + /* How far the real asm entry point is into this area. */ + unsigned short patch_area_entry; + /* Collected bit flags. */ /* Number of units of general registers that need saving in stdarg diff --git a/gcc/opts.c b/gcc/opts.c index 7affeb41a96..c6011f1f9b7 100644 --- a/gcc/opts.c +++ b/gcc/opts.c @@ -2598,10 +2598,12 @@ common_handle_option (struct gcc_options *opts, function_entry_patch_area_start = 0;
[PATCH 3/3] x86: Simplify UNSPECV_PATCHABLE_AREA generation
Since patch_area_size and patch_area_entry have been added to cfun, we can use them to directly insert the pseudo UNSPECV_PATCHABLE_AREA instruction. PR target/93492 * config/i386/i386-features.c (rest_of_insert_endbr_and_patchable_area): Change need_patchable_area argument to patchable_area_size. Generate UNSPECV_PATCHABLE_AREA instruction with proper arguments. (pass_insert_endbr_and_patchable_area::gate): Set and check patchable_area_size instead of need_patchable_area. (pass_insert_endbr_and_patchable_area::execute): Pass patchable_area_size to rest_of_insert_endbr_and_patchable_area. (pass_insert_endbr_and_patchable_area): Replace need_patchable_area with patchable_area_size. * config/i386/i386.c (ix86_print_patchable_function_entry): Just return if function table has been emitted. (x86_function_profiler): Use cfun->patch_area_size and cfun->patch_area_entry. * config/i386/i386.h (machine_function): Remove patch_area_size and record_patch_area. * config/i386/i386.md (patchable_area): Set length attribute. --- gcc/config/i386/i386-features.c | 22 +--- gcc/config/i386/i386.c | 60 ++--- gcc/config/i386/i386.h | 6 gcc/config/i386/i386.md | 10 +++--- 4 files changed, 25 insertions(+), 73 deletions(-) diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c index be46f036126..d358abe7064 100644 --- a/gcc/config/i386/i386-features.c +++ b/gcc/config/i386/i386-features.c @@ -1941,7 +1941,7 @@ make_pass_stv (gcc::context *ctxt) static void rest_of_insert_endbr_and_patchable_area (bool need_endbr, -bool need_patchable_area) +unsigned int patchable_area_size) { rtx endbr; rtx_insn *insn; @@ -1980,7 +1980,7 @@ rest_of_insert_endbr_and_patchable_area (bool need_endbr, } } - if (need_patchable_area) + if (patchable_area_size) { if (crtl->profile && flag_fentry) { @@ -1992,10 +1992,9 @@ rest_of_insert_endbr_and_patchable_area (bool need_endbr, } else { - /* ix86_print_patchable_function_entry will provide actual -size. */ - rtx patchable_area = gen_patchable_area (GEN_INT (0), - GEN_INT (0)); + rtx patchable_area + = gen_patchable_area (GEN_INT (patchable_area_size), + GEN_INT (cfun->patch_area_entry == 0)); if (endbr_insn) emit_insn_after (patchable_area, endbr_insn); else @@ -2123,25 +2122,22 @@ public: virtual bool gate (function *fun) { need_endbr = (flag_cf_protection & CF_BRANCH) != 0; - need_patchable_area - = (function_entry_patch_area_size - || lookup_attribute ("patchable_function_entry", - DECL_ATTRIBUTES (fun->decl))); - return need_endbr || need_patchable_area; + patchable_area_size = fun->patch_area_size - fun->patch_area_entry; + return need_endbr || patchable_area_size; } virtual unsigned int execute (function *) { timevar_push (TV_MACH_DEP); rest_of_insert_endbr_and_patchable_area (need_endbr, - need_patchable_area); + patchable_area_size); timevar_pop (TV_MACH_DEP); return 0; } private: bool need_endbr; - bool need_patchable_area; + unsigned int patchable_area_size; }; // class pass_insert_endbr_and_patchable_area } // anon namespace diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 051a1fcbdc2..79a8823f61e 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -9130,53 +9130,11 @@ ix86_print_patchable_function_entry (FILE *file, { if (cfun->machine->function_label_emitted) { - /* The insert_endbr_and_patchable_area pass inserted a dummy -UNSPECV_PATCHABLE_AREA with 0 patchable area size. If the -patchable area is placed after the function label, we replace -0 patchable area size with the real one. Otherwise, the -dummy UNSPECV_PATCHABLE_AREA will be ignored. */ - if (cfun->machine->insn_queued_at_entrance) - { - /* Record the patchable area. Both ENDBR and patchable area -will be inserted by x86_function_profiler later. */ - cfun->machine->patch_area_size = patch_area_size; - cfun->machine->record_patch_area = record_p; - return; - } - - /* We can have - -UNSPECV_NOP_ENDBR -UNSPECV_PATCHABLE_AREA - -or just - -UNSPECV_PATCHABLE_AREA - */ - rtx_insn *patchable_insn; - rtx_insn *insn = next_real_nondebug_insn (get_insns ()); - if (insn
[PATCH] x86-64: Pass aggregates with only float/double in GPRs for MS_ABI
MS_ABI requires passing aggregates with only float/double in integer registers. Checked gcc outputs against Clang and fixed: FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=54 -Wno-unused-variable -Wno-unused-parameter -Wno-unused-but-set-variable -Wno-uninitialized -O0 -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=54 -Wno-unused-variable -Wno-unused-parameter -Wno-unused-but-set-variable -Wno-uninitialized -O2 -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=55 -Wno-unused-variable -Wno-unused-parameter -Wno-unused-but-set-variable -Wno-uninitialized -O0 -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=55 -Wno-unused-variable -Wno-unused-parameter -Wno-unused-but-set-variable -Wno-uninitialized -O2 -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=56 -Wno-unused-variable -Wno-unused-parameter -Wno-unused-but-set-variable -Wno-uninitialized -O0 -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=56 -Wno-unused-variable -Wno-unused-parameter -Wno-unused-but-set-variable -Wno-uninitialized -O2 -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test in libffi testsuite. OK for master and backports to GCC 8/9 branches? gcc/ PR target/85667 * config/i386/i386.c (function_arg_ms_64): Add a type argument. Don't return aggregates with only SFmode and DFmode in SSE register. (ix86_function_arg): Pass arg.type to function_arg_ms_64. gcc/testsuite/ PR target/85667 * gcc.target/i386/pr85667-10.c: New test. * gcc.target/i386/pr85667-7.c: Likewise. * gcc.target/i386/pr85667-8.c: Likewise. * gcc.target/i386/pr85667-9.c: Likewise. -- H.J. From e561fd8fcb46b8d8e40942c077e26ce120832747 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" Date: Wed, 5 Feb 2020 09:49:56 -0800 Subject: [PATCH] x86-64: Pass aggregates with only float/double in GPRs for MS_ABI MS_ABI requires passing aggregates with only float/double in integer registers. Checked gcc outputs against Clang and fixed: FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=54 -Wno-unused-variable -Wno-unused-parameter -Wno-unused-but-set-variable -Wno-uninitialized -O0 -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=54 -Wno-unused-variable -Wno-unused-parameter -Wno-unused-but-set-variable -Wno-uninitialized -O2 -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=55 -Wno-unused-variable -Wno-unused-parameter -Wno-unused-but-set-variable -Wno-uninitialized -O0 -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=55 -Wno-unused-variable -Wno-unused-parameter -Wno-unused-but-set-variable -Wno-uninitialized -O2 -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=56 -Wno-unused-variable -Wno-unused-parameter -Wno-unused-but-set-variable -Wno-uninitialized -O0 -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=56 -Wno-unused-variable -Wno-unused-parameter -Wno-unused-but-set-variable -Wno-uninitialized -O2 -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test in libffi testsuite. gcc/ PR target/85667 * config/i386/i386.c (function_arg_ms_64): Add a type argument. Don't return aggregates with only SFmode and DFmode in SSE register. (ix86_function_arg): Pass arg.type to function_arg_ms_64. gcc/testsuite/ PR target/85667 * gcc.target/i386/pr85667-10.c: New test. * gcc.target/i386/pr85667-7.c: Likewise. * gcc.target/i386/pr85667-8.c: Likewise. * gcc.target/i386/pr85667-9.c: Likewise. --- gcc/config/i386/i386.c | 10 -- gcc/testsuite/gcc.target/i386/pr85667-10.c | 21 + gcc/testsuite/gcc.target/i386/pr85667-7.c | 36 ++ gcc/testsuite/gcc.target/i386/pr85667-8.c | 21 + gcc/testsuite/gcc.target/i386/pr85667-9.c | 36 ++ 5 files changed, 121 insertions(+), 3 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr85667-10.c create mode 100644 gcc/testsuite/gcc.target/i386/pr85667-7.c create mode 100644 gcc/testsuite/gcc.target/i386/pr85667-8.c create mode 100644 gcc/testsuite/gcc.target/i386/pr85667-9.c diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index ffda3e8fd21..f769cb8f75e 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -3153,7 +3153,7 @@ function_arg_64 (const CUMULATIVE_ARGS *cum, machine_mode mode, static rtx function_arg_ms_64 (const CUMULATIVE_ARGS *cum, machine_mode mode, - ma
[PATCH] Add patch_area_size and patch_area_entry to crtl
On Wed, Feb 5, 2020 at 9:00 AM Richard Sandiford wrote: > > "H.J. Lu" writes: > > Currently patchable area is at the wrong place. > > Agreed :-) > > > It is placed immediately > > after function label and before .cfi_startproc. A backend should be able > > to add a pseudo patchable area instruction durectly into RTL. This patch > > adds patch_area_size and patch_area_entry to cfun so that the patchable > > area info is available in RTL passes. > > It might be better to add it to crtl, since it should only be needed > during rtl generation. > > > It also limits patch_area_size and patch_area_entry to 65535, which is > > a reasonable maximum size for patchable area. > > > > gcc/ > > > > PR target/93492 > > * function.c (expand_function_start): Set cfun->patch_area_size > > and cfun->patch_area_entry. > > * function.h (function): Add patch_area_size and patch_area_entry. > > * opts.c (common_handle_option): Limit > > function_entry_patch_area_size and function_entry_patch_area_start > > to USHRT_MAX. Fix a typo in error message. > > * varasm.c (assemble_start_function): Use cfun->patch_area_size > > and cfun->patch_area_entry. > > * doc/invoke.texi: Document the maximum value for > > -fpatchable-function-entry. > > > > gcc/testsuite/ > > > > PR target/93492 > > * c-c++-common/patchable_function_entry-error-1.c: New test. > > * c-c++-common/patchable_function_entry-error-2.c: Likewise. > > * c-c++-common/patchable_function_entry-error-3.c: Likewise. > > --- > > gcc/doc/invoke.texi | 1 + > > gcc/function.c| 35 +++ > > gcc/function.h| 6 > > gcc/opts.c| 4 ++- > > .../patchable_function_entry-error-1.c| 9 + > > .../patchable_function_entry-error-2.c| 9 + > > .../patchable_function_entry-error-3.c| 20 +++ > > gcc/varasm.c | 30 ++-- > > 8 files changed, 85 insertions(+), 29 deletions(-) > > create mode 100644 > > gcc/testsuite/c-c++-common/patchable_function_entry-error-1.c > > create mode 100644 > > gcc/testsuite/c-c++-common/patchable_function_entry-error-2.c > > create mode 100644 > > gcc/testsuite/c-c++-common/patchable_function_entry-error-3.c > > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > > index 35b341e759f..dd4835199b0 100644 > > --- a/gcc/doc/invoke.texi > > +++ b/gcc/doc/invoke.texi > > @@ -13966,6 +13966,7 @@ If @code{N=0}, no pad location is recorded. > > The NOP instructions are inserted at---and maybe before, depending on > > @var{M}---the function entry address, even before the prologue. > > > > +The maximum value of @var{N} and @var{M} is 65535. > > @end table > > > > > > diff --git a/gcc/function.c b/gcc/function.c > > index d8008f60422..badbf538eec 100644 > > --- a/gcc/function.c > > +++ b/gcc/function.c > > @@ -5202,6 +5202,41 @@ expand_function_start (tree subr) > >/* If we are doing generic stack checking, the probe should go here. */ > >if (flag_stack_check == GENERIC_STACK_CHECK) > > stack_check_probe_note = emit_note (NOTE_INSN_DELETED); > > + > > + unsigned HOST_WIDE_INT patch_area_size = function_entry_patch_area_size; > > + unsigned HOST_WIDE_INT patch_area_entry = > > function_entry_patch_area_start; > > + > > + tree patchable_function_entry_attr > > += lookup_attribute ("patchable_function_entry", > > + DECL_ATTRIBUTES (cfun->decl)); > > + if (patchable_function_entry_attr) > > +{ > > + tree pp_val = TREE_VALUE (patchable_function_entry_attr); > > + tree patchable_function_entry_value1 = TREE_VALUE (pp_val); > > + > > + patch_area_size = tree_to_uhwi (patchable_function_entry_value1); > > + patch_area_entry = 0; > > + if (TREE_CHAIN (pp_val) != NULL_TREE) > > + { > > + tree patchable_function_entry_value2 > > + = TREE_VALUE (TREE_CHAIN (pp_val)); > > + patch_area_entry = tree_to_uhwi (patchable_function_entry_value2); > > + } > > + if (patch_area_size > USHRT_MAX || patch_area_size > USHRT_MAX) > > + error ("invalid values for % attribute"); > > This should probably go in handle_patchable_function_entry_attri
Re: [PATCH] Add patch_area_size and patch_area_entry to crtl
On Wed, Feb 5, 2020 at 12:20 PM H.J. Lu wrote: > > On Wed, Feb 5, 2020 at 9:00 AM Richard Sandiford > wrote: > > > > "H.J. Lu" writes: > > > Currently patchable area is at the wrong place. > > > > Agreed :-) > > > > > It is placed immediately > > > after function label and before .cfi_startproc. A backend should be able > > > to add a pseudo patchable area instruction durectly into RTL. This patch > > > adds patch_area_size and patch_area_entry to cfun so that the patchable > > > area info is available in RTL passes. > > > > It might be better to add it to crtl, since it should only be needed > > during rtl generation. > > > > > It also limits patch_area_size and patch_area_entry to 65535, which is > > > a reasonable maximum size for patchable area. > > > > > > gcc/ > > > > > > PR target/93492 > > > * function.c (expand_function_start): Set cfun->patch_area_size > > > and cfun->patch_area_entry. > > > * function.h (function): Add patch_area_size and patch_area_entry. > > > * opts.c (common_handle_option): Limit > > > function_entry_patch_area_size and function_entry_patch_area_start > > > to USHRT_MAX. Fix a typo in error message. > > > * varasm.c (assemble_start_function): Use cfun->patch_area_size > > > and cfun->patch_area_entry. > > > * doc/invoke.texi: Document the maximum value for > > > -fpatchable-function-entry. > > > > > > gcc/testsuite/ > > > > > > PR target/93492 > > > * c-c++-common/patchable_function_entry-error-1.c: New test. > > > * c-c++-common/patchable_function_entry-error-2.c: Likewise. > > > * c-c++-common/patchable_function_entry-error-3.c: Likewise. > > > --- > > > gcc/doc/invoke.texi | 1 + > > > gcc/function.c| 35 +++ > > > gcc/function.h| 6 > > > gcc/opts.c| 4 ++- > > > .../patchable_function_entry-error-1.c| 9 + > > > .../patchable_function_entry-error-2.c| 9 + > > > .../patchable_function_entry-error-3.c| 20 +++ > > > gcc/varasm.c | 30 ++-- > > > 8 files changed, 85 insertions(+), 29 deletions(-) > > > create mode 100644 > > > gcc/testsuite/c-c++-common/patchable_function_entry-error-1.c > > > create mode 100644 > > > gcc/testsuite/c-c++-common/patchable_function_entry-error-2.c > > > create mode 100644 > > > gcc/testsuite/c-c++-common/patchable_function_entry-error-3.c > > > > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > > > index 35b341e759f..dd4835199b0 100644 > > > --- a/gcc/doc/invoke.texi > > > +++ b/gcc/doc/invoke.texi > > > @@ -13966,6 +13966,7 @@ If @code{N=0}, no pad location is recorded. > > > The NOP instructions are inserted at---and maybe before, depending on > > > @var{M}---the function entry address, even before the prologue. > > > > > > +The maximum value of @var{N} and @var{M} is 65535. > > > @end table > > > > > > > > > diff --git a/gcc/function.c b/gcc/function.c > > > index d8008f60422..badbf538eec 100644 > > > --- a/gcc/function.c > > > +++ b/gcc/function.c > > > @@ -5202,6 +5202,41 @@ expand_function_start (tree subr) > > >/* If we are doing generic stack checking, the probe should go here. > > > */ > > >if (flag_stack_check == GENERIC_STACK_CHECK) > > > stack_check_probe_note = emit_note (NOTE_INSN_DELETED); > > > + > > > + unsigned HOST_WIDE_INT patch_area_size = > > > function_entry_patch_area_size; > > > + unsigned HOST_WIDE_INT patch_area_entry = > > > function_entry_patch_area_start; > > > + > > > + tree patchable_function_entry_attr > > > += lookup_attribute ("patchable_function_entry", > > > + DECL_ATTRIBUTES (cfun->decl)); > > > + if (patchable_function_entry_attr) > > > +{ > > > + tree pp_val = TREE_VALUE (patchable_function_entry_attr); > > > + tree patchable_function_entry_value1 = TREE_VALUE (pp_val); > > > + > > > + patch_area_size = tree_to_uhwi (patchable_function_entry_value1); > > &g
Re: [PATCH] Add patch_area_size and patch_area_entry to crtl
On Wed, Feb 5, 2020 at 2:37 PM Marek Polacek wrote: > > On Wed, Feb 05, 2020 at 02:24:48PM -0800, H.J. Lu wrote: > > On Wed, Feb 5, 2020 at 12:20 PM H.J. Lu wrote: > > > > > > On Wed, Feb 5, 2020 at 9:00 AM Richard Sandiford > > > wrote: > > > > > > > > "H.J. Lu" writes: > > > > > Currently patchable area is at the wrong place. > > > > > > > > Agreed :-) > > > > > > > > > It is placed immediately > > > > > after function label and before .cfi_startproc. A backend should be > > > > > able > > > > > to add a pseudo patchable area instruction durectly into RTL. This > > > > > patch > > > > > adds patch_area_size and patch_area_entry to cfun so that the > > > > > patchable > > > > > area info is available in RTL passes. > > > > > > > > It might be better to add it to crtl, since it should only be needed > > > > during rtl generation. > > > > > > > > > It also limits patch_area_size and patch_area_entry to 65535, which is > > > > > a reasonable maximum size for patchable area. > > > > > > > > > > gcc/ > > > > > > > > > > PR target/93492 > > > > > * function.c (expand_function_start): Set cfun->patch_area_size > > > > > and cfun->patch_area_entry. > > > > > * function.h (function): Add patch_area_size and > > > > > patch_area_entry. > > > > > * opts.c (common_handle_option): Limit > > > > > function_entry_patch_area_size and > > > > > function_entry_patch_area_start > > > > > to USHRT_MAX. Fix a typo in error message. > > > > > * varasm.c (assemble_start_function): Use cfun->patch_area_size > > > > > and cfun->patch_area_entry. > > > > > * doc/invoke.texi: Document the maximum value for > > > > > -fpatchable-function-entry. > > > > > > > > > > gcc/testsuite/ > > > > > > > > > > PR target/93492 > > > > > * c-c++-common/patchable_function_entry-error-1.c: New test. > > > > > * c-c++-common/patchable_function_entry-error-2.c: Likewise. > > > > > * c-c++-common/patchable_function_entry-error-3.c: Likewise. > > > > > --- > > > > > gcc/doc/invoke.texi | 1 + > > > > > gcc/function.c| 35 > > > > > +++ > > > > > gcc/function.h| 6 > > > > > gcc/opts.c| 4 ++- > > > > > .../patchable_function_entry-error-1.c| 9 + > > > > > .../patchable_function_entry-error-2.c| 9 + > > > > > .../patchable_function_entry-error-3.c| 20 +++ > > > > > gcc/varasm.c | 30 ++-- > > > > > 8 files changed, 85 insertions(+), 29 deletions(-) > > > > > create mode 100644 > > > > > gcc/testsuite/c-c++-common/patchable_function_entry-error-1.c > > > > > create mode 100644 > > > > > gcc/testsuite/c-c++-common/patchable_function_entry-error-2.c > > > > > create mode 100644 > > > > > gcc/testsuite/c-c++-common/patchable_function_entry-error-3.c > > > > > > > > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > > > > > index 35b341e759f..dd4835199b0 100644 > > > > > --- a/gcc/doc/invoke.texi > > > > > +++ b/gcc/doc/invoke.texi > > > > > @@ -13966,6 +13966,7 @@ If @code{N=0}, no pad location is recorded. > > > > > The NOP instructions are inserted at---and maybe before, depending on > > > > > @var{M}---the function entry address, even before the prologue. > > > > > > > > > > +The maximum value of @var{N} and @var{M} is 65535. > > > > > @end table > > > > > > > > > > > > > > > diff --git a/gcc/function.c b/gcc/function.c > > > > > index d8008f60422..badbf538eec 100644 > > > > > --- a/gcc/function.c > > > > > +++ b/gcc/function.c > > > > > @@ -5202,6 +5202,41 @@ expand_function_start (tree subr) > > &
Re: [PATCH] Add patch_area_size and patch_area_entry to crtl
On Wed, Feb 5, 2020 at 2:51 PM H.J. Lu wrote: > > On Wed, Feb 5, 2020 at 2:37 PM Marek Polacek wrote: > > > > On Wed, Feb 05, 2020 at 02:24:48PM -0800, H.J. Lu wrote: > > > On Wed, Feb 5, 2020 at 12:20 PM H.J. Lu wrote: > > > > > > > > On Wed, Feb 5, 2020 at 9:00 AM Richard Sandiford > > > > wrote: > > > > > > > > > > "H.J. Lu" writes: > > > > > > Currently patchable area is at the wrong place. > > > > > > > > > > Agreed :-) > > > > > > > > > > > It is placed immediately > > > > > > after function label and before .cfi_startproc. A backend should > > > > > > be able > > > > > > to add a pseudo patchable area instruction durectly into RTL. This > > > > > > patch > > > > > > adds patch_area_size and patch_area_entry to cfun so that the > > > > > > patchable > > > > > > area info is available in RTL passes. > > > > > > > > > > It might be better to add it to crtl, since it should only be needed > > > > > during rtl generation. > > > > > > > > > > > It also limits patch_area_size and patch_area_entry to 65535, which > > > > > > is > > > > > > a reasonable maximum size for patchable area. > > > > > > > > > > > > gcc/ > > > > > > > > > > > > PR target/93492 > > > > > > * function.c (expand_function_start): Set > > > > > > cfun->patch_area_size > > > > > > and cfun->patch_area_entry. > > > > > > * function.h (function): Add patch_area_size and > > > > > > patch_area_entry. > > > > > > * opts.c (common_handle_option): Limit > > > > > > function_entry_patch_area_size and > > > > > > function_entry_patch_area_start > > > > > > to USHRT_MAX. Fix a typo in error message. > > > > > > * varasm.c (assemble_start_function): Use > > > > > > cfun->patch_area_size > > > > > > and cfun->patch_area_entry. > > > > > > * doc/invoke.texi: Document the maximum value for > > > > > > -fpatchable-function-entry. > > > > > > > > > > > > gcc/testsuite/ > > > > > > > > > > > > PR target/93492 > > > > > > * c-c++-common/patchable_function_entry-error-1.c: New test. > > > > > > * c-c++-common/patchable_function_entry-error-2.c: Likewise. > > > > > > * c-c++-common/patchable_function_entry-error-3.c: Likewise. > > > > > > --- > > > > > > gcc/doc/invoke.texi | 1 + > > > > > > gcc/function.c| 35 > > > > > > +++ > > > > > > gcc/function.h| 6 > > > > > > gcc/opts.c| 4 ++- > > > > > > .../patchable_function_entry-error-1.c| 9 + > > > > > > .../patchable_function_entry-error-2.c| 9 + > > > > > > .../patchable_function_entry-error-3.c| 20 +++ > > > > > > gcc/varasm.c | 30 ++-- > > > > > > 8 files changed, 85 insertions(+), 29 deletions(-) > > > > > > create mode 100644 > > > > > > gcc/testsuite/c-c++-common/patchable_function_entry-error-1.c > > > > > > create mode 100644 > > > > > > gcc/testsuite/c-c++-common/patchable_function_entry-error-2.c > > > > > > create mode 100644 > > > > > > gcc/testsuite/c-c++-common/patchable_function_entry-error-3.c > > > > > > > > > > > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > > > > > > index 35b341e759f..dd4835199b0 100644 > > > > > > --- a/gcc/doc/invoke.texi > > > > > > +++ b/gcc/doc/invoke.texi > > > > > > @@ -13966,6 +13966,7 @@ If @code{N=0}, no pad location is recorded. > > > > > > The NOP instructions are inserted at---and maybe before, depending > > > > > > on > > > > > > @var{M}---
[PATCH] Use the section flag 'o' for __patchable_function_entries
This commit in GNU binutils 2.35: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=commit;h=b7d072167715829eed0622616f6ae0182900de3e added the section flag 'o' to .section directive: .section __patchable_function_entries,"awo",@progbits,foo which specifies the symbol name which the section references. Assembler creates a unique __patchable_function_entries section with the section, where foo is defined, as its linked-to section. Linker keeps a section if its linked-to section is kept during garbage collection. This patch checks assembler support for the section flag 'o' and uses it to implement __patchable_function_entries section. Since Solaris may use GNU assembler with Solairs ld. Even if GNU assembler supports the section flag 'o', it doesn't mean that Solairs ld supports it. This feature is disabled for Solairs targets. gcc/ PR middle-end/93195 PR middle-end/93197 * configure.ac (HAVE_GAS_SECTION_LINK_ORDER): New. Define if the assembler supports the section flag 'o' for specifying section with link-order. * dwarf2out.c (output_comdat_type_unit): Pass 0 as flags2 to targetm.asm_out.named_section. * config/sol2.c (solaris_elf_asm_comdat_section): Likewise. * output.h (SECTION2_LINK_ORDER): New. (switch_to_section): Add an unsigned int argument. (default_no_named_section): Likewise. (default_elf_asm_named_section): Likewise. * target.def (asm_out.named_section): Likewise. * targhooks.c (default_print_patchable_function_entry): Pass current_function_decl to get_section and SECTION2_LINK_ORDER to switch_to_section. * varasm.c (default_no_named_section): Add an unsigned int argument. (default_elf_asm_named_section): Add an unsigned int argument, flags2. Use 'o' flag for SECTION2_LINK_ORDER if assembler supports it. (switch_to_section): Add an unsigned int argument and pass it to targetm.asm_out.named_section. (handle_vtv_comdat_section): Pass 0 to targetm.asm_out.named_section. * config.in: Regenerated. * configure: Likewise. * doc/tm.texi: Likewise. gcc/testsuite/ PR middle-end/93195 * g++.dg/pr93195a.C: New test. * g++.dg/pr93195b.C: Likewise. * lib/target-supports.exp (check_effective_target_o_flag_in_section): New proc. --- gcc/config.in | 6 gcc/config/sol2.c | 3 +- gcc/configure | 52 +++ gcc/configure.ac | 22 gcc/doc/tm.texi | 5 +-- gcc/dwarf2out.c | 4 +-- gcc/output.h | 11 -- gcc/target.def| 5 +-- gcc/targhooks.c | 4 ++- gcc/testsuite/g++.dg/pr93195a.C | 27 ++ gcc/testsuite/g++.dg/pr93195b.C | 14 gcc/testsuite/lib/target-supports.exp | 40 + gcc/varasm.c | 25 ++--- 13 files changed, 202 insertions(+), 16 deletions(-) create mode 100644 gcc/testsuite/g++.dg/pr93195a.C create mode 100644 gcc/testsuite/g++.dg/pr93195b.C diff --git a/gcc/config.in b/gcc/config.in index 48292861842..d1ecc5b15a6 100644 --- a/gcc/config.in +++ b/gcc/config.in @@ -1313,6 +1313,12 @@ #endif +/* Define if your assembler supports 'o' flag in .section directive. */ +#ifndef USED_FOR_TARGET +#undef HAVE_GAS_SECTION_LINK_ORDER +#endif + + /* Define 0/1 if your assembler supports marking sections with SHF_MERGE flag. */ #ifndef USED_FOR_TARGET diff --git a/gcc/config/sol2.c b/gcc/config/sol2.c index cf9d9f1f684..62bbdec2f97 100644 --- a/gcc/config/sol2.c +++ b/gcc/config/sol2.c @@ -224,7 +224,8 @@ solaris_elf_asm_comdat_section (const char *name, unsigned int flags, tree decl) emits this as a regular section. Emit section before .group directive since Sun as treats undeclared sections as @progbits, which conflicts with .bss* sections which are @nobits. */ - targetm.asm_out.named_section (section, flags & ~SECTION_LINKONCE, decl); + targetm.asm_out.named_section (section, flags & ~SECTION_LINKONCE, +0, decl); /* Sun as separates declaration of a group section and of the group itself, using the .group directive and the #comdat flag. */ diff --git a/gcc/configure b/gcc/configure index 5fa565a40a4..a7315e33a62 100755 --- a/gcc/configure +++ b/gcc/configure @@ -24185,6 +24185,58 @@ cat >>confdefs.h <<_ACEOF _ACEOF +# Test if the assembler supports the section flag 'o' for specifying +# section with link-order. +case "${target}" in + # Solaris may use GNU assembler with Solairs ld. Even if GNU + # assembler supports the section flag 'o', it doesn't mean that + # Solairs ld supports it. + *-*-solaris2*) +gcc_cv
Re: [PATCH] x86-64: Pass aggregates with only float/double in GPRs for MS_ABI
On Wed, Feb 05, 2020 at 09:51:14PM +0100, Uros Bizjak wrote: > On Wed, Feb 5, 2020 at 6:59 PM H.J. Lu wrote: > > > > MS_ABI requires passing aggregates with only float/double in integer > > registers. Checked gcc outputs against Clang and fixed: > > > > FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=54 > > -Wno-unused-variable -Wno-unused-parameter > > -Wno-unused-but-set-variable -Wno-uninitialized -O0 > > -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test > > FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=54 > > -Wno-unused-variable -Wno-unused-parameter > > -Wno-unused-but-set-variable -Wno-uninitialized -O2 > > -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test > > FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=55 > > -Wno-unused-variable -Wno-unused-parameter > > -Wno-unused-but-set-variable -Wno-uninitialized -O0 > > -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test > > FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=55 > > -Wno-unused-variable -Wno-unused-parameter > > -Wno-unused-but-set-variable -Wno-uninitialized -O2 > > -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test > > FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=56 > > -Wno-unused-variable -Wno-unused-parameter > > -Wno-unused-but-set-variable -Wno-uninitialized -O0 > > -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test > > FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=56 > > -Wno-unused-variable -Wno-unused-parameter > > -Wno-unused-but-set-variable -Wno-uninitialized -O2 > > -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test > > > > in libffi testsuite. > > > > OK for master and backports to GCC 8/9 branches? > > > > gcc/ > > > > PR target/85667 > > * config/i386/i386.c (function_arg_ms_64): Add a type argument. > > Don't return aggregates with only SFmode and DFmode in SSE > > register. > > (ix86_function_arg): Pass arg.type to function_arg_ms_64. > > > > gcc/testsuite/ > > > > PR target/85667 > > * gcc.target/i386/pr85667-10.c: New test. > > * gcc.target/i386/pr85667-7.c: Likewise. > > * gcc.target/i386/pr85667-8.c: Likewise. > > * gcc.target/i386/pr85667-9.c: Likewise. > > LGTM, but should really be reviewed by cygwin, mingw-w64 maintainer (CC'd). > I checked the result against MSVC v19.10 at https://godbolt.org/z/2NPygd My patch matches MSVC v19.10. I am checking it in tomorrow unless mingw-w64 maintainer objects. Thanks. H.J.
PING^6: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move
On Mon, Jan 27, 2020 at 10:59 AM H.J. Lu wrote: > > On Mon, Jul 8, 2019 at 8:19 AM H.J. Lu wrote: > > > > On Tue, Jun 18, 2019 at 8:59 AM H.J. Lu wrote: > > > > > > On Fri, May 31, 2019 at 10:38 AM H.J. Lu wrote: > > > > > > > > On Tue, May 21, 2019 at 2:43 PM H.J. Lu wrote: > > > > > > > > > > On Fri, Feb 22, 2019 at 8:25 AM H.J. Lu wrote: > > > > > > > > > > > > Hi Jan, Uros, > > > > > > > > > > > > This patch fixes the wrong code bug: > > > > > > > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89229 > > > > > > > > > > > > Tested on AVX2 and AVX512 with and without --with-arch=native. > > > > > > > > > > > > OK for trunk? > > > > > > > > > > > > Thanks. > > > > > > > > > > > > H.J. > > > > > > -- > > > > > > i386 backend has > > > > > > > > > > > > INT_MODE (OI, 32); > > > > > > INT_MODE (XI, 64); > > > > > > > > > > > > So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit > > > > > > operation, > > > > > > in case of const_1, all 512 bits set. > > > > > > > > > > > > We can load zeros with narrower instruction, (e.g. 256 bit by > > > > > > inherent > > > > > > zeroing of highpart in case of 128 bit xor), so TImode in this case. > > > > > > > > > > > > Some targets prefer V4SF mode, so they will emit float xorps for > > > > > > zeroing. > > > > > > > > > > > > sse.md has > > > > > > > > > > > > (define_insn "mov_internal" > > > > > > [(set (match_operand:VMOVE 0 "nonimmediate_operand" > > > > > > "=v,v ,v ,m") > > > > > > (match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand" > > > > > > " C,BC,vm,v"))] > > > > > > > > > > > > /* There is no evex-encoded vmov* for sizes smaller than > > > > > > 64-bytes > > > > > > in avx512f, so we need to use workarounds, to access sse > > > > > > registers > > > > > > 16-31, which are evex-only. In avx512vl we don't need > > > > > > workarounds. */ > > > > > > if (TARGET_AVX512F && < 64 && !TARGET_AVX512VL > > > > > > && (EXT_REX_SSE_REG_P (operands[0]) > > > > > > || EXT_REX_SSE_REG_P (operands[1]))) > > > > > > { > > > > > > if (memory_operand (operands[0], mode)) > > > > > > { > > > > > > if ( == 32) > > > > > > return "vextract64x4\t{$0x0, %g1, > > > > > > %0|%0, %g1, 0x0}"; > > > > > > else if ( == 16) > > > > > > return "vextract32x4\t{$0x0, %g1, > > > > > > %0|%0, %g1, 0x0}"; > > > > > > else > > > > > > gcc_unreachable (); > > > > > > } > > > > > > ... > > > > > > > > > > > > However, since ix86_hard_regno_mode_ok has > > > > > > > > > > > > /* TODO check for QI/HI scalars. */ > > > > > > /* AVX512VL allows sse regs16+ for 128/256 bit modes. */ > > > > > > if (TARGET_AVX512VL > > > > > > && (mode == OImode > > > > > > || mode == TImode > > > > > > || VALID_AVX256_REG_MODE (mode) > > > > > > || VALID_AVX512VL_128_REG_MODE (mode))) > > > > > > return true; > > > > > > > > > > > > /* xmm16-xmm31 are only available for AVX-512. */ > > > > > > if (EXT_REX_SSE_REGNO_P (regno)) > > > > > > return false; > > > > > > > > > > > > if (TARGET_AVX512F && < 64 && !TARGET_AVX512VL > > > > >
Re: [PATCH] x86-64: Pass aggregates with only float/double in GPRs for MS_ABI
On Fri, Feb 7, 2020 at 2:14 AM JonY <10wa...@gmail.com> wrote: > > On 2/7/20 3:23 AM, H.J. Lu wrote: > > On Wed, Feb 05, 2020 at 09:51:14PM +0100, Uros Bizjak wrote: > >> On Wed, Feb 5, 2020 at 6:59 PM H.J. Lu wrote: > >>> > >>> MS_ABI requires passing aggregates with only float/double in integer > >>> registers. Checked gcc outputs against Clang and fixed: > >>> > >>> FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=54 > >>> -Wno-unused-variable -Wno-unused-parameter > >>> -Wno-unused-but-set-variable -Wno-uninitialized -O0 > >>> -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test > >>> FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=54 > >>> -Wno-unused-variable -Wno-unused-parameter > >>> -Wno-unused-but-set-variable -Wno-uninitialized -O2 > >>> -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test > >>> FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=55 > >>> -Wno-unused-variable -Wno-unused-parameter > >>> -Wno-unused-but-set-variable -Wno-uninitialized -O0 > >>> -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test > >>> FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=55 > >>> -Wno-unused-variable -Wno-unused-parameter > >>> -Wno-unused-but-set-variable -Wno-uninitialized -O2 > >>> -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test > >>> FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=56 > >>> -Wno-unused-variable -Wno-unused-parameter > >>> -Wno-unused-but-set-variable -Wno-uninitialized -O0 > >>> -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test > >>> FAIL: libffi.bhaible/test-callback.c -W -Wall -Wno-psabi -DDGTEST=56 > >>> -Wno-unused-variable -Wno-unused-parameter > >>> -Wno-unused-but-set-variable -Wno-uninitialized -O2 > >>> -DABI_NUM=FFI_GNUW64 -DABI_ATTR=MSABI execution test > >>> > >>> in libffi testsuite. > >>> > >>> OK for master and backports to GCC 8/9 branches? > >>> > >>> gcc/ > >>> > >>> PR target/85667 > >>> * config/i386/i386.c (function_arg_ms_64): Add a type argument. > >>> Don't return aggregates with only SFmode and DFmode in SSE > >>> register. > >>> (ix86_function_arg): Pass arg.type to function_arg_ms_64. > >>> > >>> gcc/testsuite/ > >>> > >>> PR target/85667 > >>> * gcc.target/i386/pr85667-10.c: New test. > >>> * gcc.target/i386/pr85667-7.c: Likewise. > >>> * gcc.target/i386/pr85667-8.c: Likewise. > >>> * gcc.target/i386/pr85667-9.c: Likewise. > >> > >> LGTM, but should really be reviewed by cygwin, mingw-w64 maintainer (CC'd). > >> > > > > I checked the result against MSVC v19.10 at > > > > https://godbolt.org/z/2NPygd > > > > My patch matches MSVC v19.10. I am checking it in tomorrow unless > > mingw-w64 maintainer objects. > > > > Please go ahead and thanks. > I checked it into master and backported it to releases/gcc-9 branch. No plan to fix releases/gcc-8 branch. Thanks. -- H.J.
[PATCH] i386: Properly pop restore token in signal frame
Linux CET kernel places a restore token on shadow stack for signal handler to enhance security. The restore token is 8 byte and aligned to 8 bytes. It is usually transparent to user programs since kernel will pop the restore token when signal handler returns. But when an exception is thrown from a signal handler, now we need to pop the restore token from shadow stack. For x86-64, we just need to treat the signal frame as normal frame. For i386, we need to search for the restore token to check if the original shadow stack is 8 byte aligned. If the original shadow stack is 8 byte aligned, we just need to pop 2 slots, one restore token, from shadow stack. Otherwise, we need to pop 3 slots, one restore token + 4 byte pading, from shadow stack. This patch also includes 2 tests, one has a restore token with 4 byte padding and one without. Tested on Linux/x86-64 CET machine with and without -m32. OK for master and backport to GCC 8/9 branches? Thanks. H.J. --- libgcc/ PR libgcc/85334 * config/i386/shadow-stack-unwind.h (_Unwind_Frames_Increment): New. gcc/testsuite/ PR libgcc/85334 * g++.target/i386/pr85334-1.C: New test. * g++.target/i386/pr85334-2.C: Likewise. --- gcc/testsuite/g++.target/i386/pr85334-1.C | 55 +++ gcc/testsuite/g++.target/i386/pr85334-2.C | 48 libgcc/config/i386/shadow-stack-unwind.h | 43 ++ 3 files changed, 146 insertions(+) create mode 100644 gcc/testsuite/g++.target/i386/pr85334-1.C create mode 100644 gcc/testsuite/g++.target/i386/pr85334-2.C diff --git a/gcc/testsuite/g++.target/i386/pr85334-1.C b/gcc/testsuite/g++.target/i386/pr85334-1.C new file mode 100644 index 000..3c5ccad1714 --- /dev/null +++ b/gcc/testsuite/g++.target/i386/pr85334-1.C @@ -0,0 +1,55 @@ +// { dg-do run } +// { dg-require-effective-target cet } +// { dg-additional-options "-fexceptions -fnon-call-exceptions -fcf-protection" } + +// Delta between numbers of call stacks of pr85334-1.C and pr85334-2.C is 1. + +#include +#include + +void sighandler (int signo, siginfo_t * si, void * uc) +{ + throw (5); +} + +char * +__attribute ((noinline, noclone)) +dosegv () +{ + * ((volatile int *)0) = 12; + return 0; +} + +int +__attribute ((noinline, noclone)) +func2 () +{ + try { +dosegv (); + } + catch (int x) { +return (x != 5); + } + return 1; +} + +int +__attribute ((noinline, noclone)) +func1 () +{ + return func2 (); +} + +int main () +{ + struct sigaction sa; + int status; + + sa.sa_sigaction = sighandler; + sa.sa_flags = SA_SIGINFO; + + status = sigaction (SIGSEGV, & sa, NULL); + status = sigaction (SIGBUS, & sa, NULL); + + return func1 (); +} diff --git a/gcc/testsuite/g++.target/i386/pr85334-2.C b/gcc/testsuite/g++.target/i386/pr85334-2.C new file mode 100644 index 000..e2b5afe78cb --- /dev/null +++ b/gcc/testsuite/g++.target/i386/pr85334-2.C @@ -0,0 +1,48 @@ +// { dg-do run } +// { dg-require-effective-target cet } +// { dg-additional-options "-fexceptions -fnon-call-exceptions -fcf-protection" } + +// Delta between numbers of call stacks of pr85334-1.C and pr85334-2.C is 1. + +#include +#include + +void sighandler (int signo, siginfo_t * si, void * uc) +{ + throw (5); +} + +char * +__attribute ((noinline, noclone)) +dosegv () +{ + * ((volatile int *)0) = 12; + return 0; +} + +int +__attribute ((noinline, noclone)) +func1 () +{ + try { +dosegv (); + } + catch (int x) { +return (x != 5); + } + return 1; +} + +int main () +{ + struct sigaction sa; + int status; + + sa.sa_sigaction = sighandler; + sa.sa_flags = SA_SIGINFO; + + status = sigaction (SIGSEGV, & sa, NULL); + status = sigaction (SIGBUS, & sa, NULL); + + return func1 (); +} diff --git a/libgcc/config/i386/shadow-stack-unwind.h b/libgcc/config/i386/shadow-stack-unwind.h index a0244d2ee70..a5f9235b146 100644 --- a/libgcc/config/i386/shadow-stack-unwind.h +++ b/libgcc/config/i386/shadow-stack-unwind.h @@ -49,3 +49,46 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see } \ } \ while (0) + +/* Linux CET kernel places a restore token on shadow stack for signal + handler to enhance security. The restore token is 8 byte and aligned + to 8 bytes. It is usually transparent to user programs since kernel + will pop the restore token when signal handler returns. But when an + exception is thrown from a signal handler, now we need to pop the + restore token from shadow stack. For x86-64, we just need to treat + the signal frame as normal frame. For i386, we need to search for + the restore token to check if the original shadow stack is 8 byte + aligned. If the original shadow stack is 8 byte aligned, we just + need to pop 2 slots, one restore token, from shadow stack. Otherwise, + we need to pop 3 slots, one restore token + 4
[PATCH] i386: Skip ENDBR32 at nested function entry
Since nested function isn't only called directly, there is ENDBR32 at function entry and we need to skip it for direct jump in trampoline. Tested on Linux/x86-64 CET machine with and without -m32. gcc/ PR target/93656 * config/i386/i386.c (ix86_trampoline_init): Skip ENDBR32 at nested function entry. gcc/testsuite/ PR target/93656 * gcc.target/i386/pr93656.c: New test. --- gcc/config/i386/i386.c | 7 ++- gcc/testsuite/gcc.target/i386/pr93656.c | 4 2 files changed, 10 insertions(+), 1 deletion(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr93656.c diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 44bc0e0176a..dbcae244acb 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -16839,9 +16839,14 @@ ix86_trampoline_init (rtx m_tramp, tree fndecl, rtx chain_value) the stack, we need to skip the first insn which pushes the (call-saved) register static chain; this push is 1 byte. */ offset += 5; + int skip = MEM_P (chain) ? 1 : 0; + /* Since nested function isn't only called directly, there is +ENDBR32 at function entry and we need to skip it. */ + if (need_endbr) + skip += 4; disp = expand_binop (SImode, sub_optab, fnaddr, plus_constant (Pmode, XEXP (m_tramp, 0), - offset - (MEM_P (chain) ? 1 : 0)), + offset - skip), NULL_RTX, 1, OPTAB_DIRECT); emit_move_insn (mem, disp); } diff --git a/gcc/testsuite/gcc.target/i386/pr93656.c b/gcc/testsuite/gcc.target/i386/pr93656.c new file mode 100644 index 000..f0ac8c8edaa --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr93656.c @@ -0,0 +1,4 @@ +/* { dg-do run { target { ia32 && cet } } } */ +/* { dg-options "-O2 -fcf-protection" } */ + +#include "pr67770.c" -- 2.24.1
Re: [PATCH] i386: Skip ENDBR32 at nested function entry
On Mon, Feb 10, 2020 at 11:40 AM Uros Bizjak wrote: > > On Mon, Feb 10, 2020 at 8:22 PM H.J. Lu wrote: > > > > Since nested function isn't only called directly, there is ENDBR32 at > > function entry and we need to skip it for direct jump in trampoline. > > Hm, I'm afraid I don't understand this comment. Can you perhaps rephrase it? > ix86_trampoline_init has /* Compute offset from the end of the jmp to the target function. In the case in which the trampoline stores the static chain on the stack, we need to skip the first insn which pushes the (call-saved) register static chain; this push is 1 byte. */ offset += 5; disp = expand_binop (SImode, sub_optab, fnaddr, plus_constant (Pmode, XEXP (m_tramp, 0), offset - (MEM_P (chain) ? 1 : 0)), NULL_RTX, 1, OPTAB_DIRECT); emit_move_insn (mem, disp); Without CET, we got 011 : 11: 56push %esi 12: 55push %ebp <<<<<< trampoline jumps here. 13: 89 e5mov%esp,%ebp 15: 83 ec 08 sub$0x8,%esp With CET, if bar isn't only called directly, we got 0015 : 15: f3 0f 1e fb endbr32 19: 56push %esi 1a: 55push %ebp <<<<<<<< trampoline jumps here. 1b: 89 e5mov%esp,%ebp 1d: 83 ec 08 sub$0x8,%esp We need to add 4 bytes for trampoline to skip endbr32. Here is the updated patch to check if nested function isn't only called directly, -- H.J. From 10ffeb41d1cdbd42f19ba08fdd6ce4a9913a5b5b Mon Sep 17 00:00:00 2001 From: "H.J. Lu" Date: Mon, 10 Feb 2020 11:10:52 -0800 Subject: [PATCH] i386: Skip ENDBR32 at nested function entry If nested function isn't only called directly, there is ENDBR32 at function entry and we need to skip it for direct jump in trampoline. Tested on Linux/x86-64 CET machine with and without -m32. gcc/ PR target/93656 * config/i386/i386.c (ix86_trampoline_init): Skip ENDBR32 at nested function entry. gcc/testsuite/ PR target/93656 * gcc.target/i386/pr93656.c: New test. --- gcc/config/i386/i386.c | 8 +++- gcc/testsuite/gcc.target/i386/pr93656.c | 4 2 files changed, 11 insertions(+), 1 deletion(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr93656.c diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 44bc0e0176a..bc494ec19b6 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -16839,9 +16839,15 @@ ix86_trampoline_init (rtx m_tramp, tree fndecl, rtx chain_value) the stack, we need to skip the first insn which pushes the (call-saved) register static chain; this push is 1 byte. */ offset += 5; + int skip = MEM_P (chain) ? 1 : 0; + /* If nested function isn't only called directly, there is ENDBR32 + at function entry and we need to skip it. */ + if (need_endbr + && !cgraph_node::get (fndecl)->only_called_directly_p ()) + skip += 4; disp = expand_binop (SImode, sub_optab, fnaddr, plus_constant (Pmode, XEXP (m_tramp, 0), - offset - (MEM_P (chain) ? 1 : 0)), + offset - skip), NULL_RTX, 1, OPTAB_DIRECT); emit_move_insn (mem, disp); } diff --git a/gcc/testsuite/gcc.target/i386/pr93656.c b/gcc/testsuite/gcc.target/i386/pr93656.c new file mode 100644 index 000..f0ac8c8edaa --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr93656.c @@ -0,0 +1,4 @@ +/* { dg-do run { target { ia32 && cet } } } */ +/* { dg-options "-O2 -fcf-protection" } */ + +#include "pr67770.c" -- 2.24.1
Re: [PATCH] i386: Skip ENDBR32 at nested function entry
On Mon, Feb 10, 2020 at 12:01 PM Uros Bizjak wrote: > > On Mon, Feb 10, 2020 at 8:53 PM H.J. Lu wrote: > > > > On Mon, Feb 10, 2020 at 11:40 AM Uros Bizjak wrote: > > > > > > On Mon, Feb 10, 2020 at 8:22 PM H.J. Lu wrote: > > > > > > > > Since nested function isn't only called directly, there is ENDBR32 at > > > > function entry and we need to skip it for direct jump in trampoline. > > > > > > Hm, I'm afraid I don't understand this comment. Can you perhaps rephrase > > > it? > > > > > > > ix86_trampoline_init has > > > > /* Compute offset from the end of the jmp to the target function. > > In the case in which the trampoline stores the static chain on > > the stack, we need to skip the first insn which pushes the > > (call-saved) register static chain; this push is 1 byte. */ > > offset += 5; > > disp = expand_binop (SImode, sub_optab, fnaddr, > >plus_constant (Pmode, XEXP (m_tramp, 0), > > offset - (MEM_P (chain) ? 1 : 0)), > >NULL_RTX, 1, OPTAB_DIRECT); > > emit_move_insn (mem, disp); > > > > Without CET, we got > > > > 011 : > > 11: 56push %esi > > 12: 55push %ebp <<<<<< trampoline jumps here. > > 13: 89 e5mov%esp,%ebp > > 15: 83 ec 08 sub$0x8,%esp > > > > With CET, if bar isn't only called directly, we got > > > > 0015 : > > 15: f3 0f 1e fb endbr32 > > 19: 56push %esi > > 1a: 55push %ebp <<<<<<<< trampoline jumps here. > > 1b: 89 e5mov%esp,%ebp > > 1d: 83 ec 08 sub$0x8,%esp > > > > We need to add 4 bytes for trampoline to skip endbr32. > > > > Here is the updated patch to check if nested function isn't only > > called directly, > > Please figure out the final patch. I don't want to waste my time > reviewing different version every half hour. Ping me in a couple of > days. This is the final version: https://gcc.gnu.org/ml/gcc-patches/2020-02/msg00586.html You can try the testcase in the patch on any machine with CET binutils since ENDBR32 is nop on none-CET machines. Without this patch, the test will fail. Thanks. -- H.J.
[PATCH] i386: Also skip ENDBR32 at the target function entry
On Thu, Feb 13, 2020 at 09:29:32AM +0100, Uros Bizjak wrote: > On Wed, Feb 12, 2020 at 1:21 PM H.J. Lu wrote: > > > > On Mon, Feb 10, 2020 at 12:01 PM Uros Bizjak wrote: > > > > > > On Mon, Feb 10, 2020 at 8:53 PM H.J. Lu wrote: > > > > > > > > On Mon, Feb 10, 2020 at 11:40 AM Uros Bizjak wrote: > > > > > > > > > > On Mon, Feb 10, 2020 at 8:22 PM H.J. Lu wrote: > > > > > > > > > > > > Since nested function isn't only called directly, there is ENDBR32 > > > > > > at > > > > > > function entry and we need to skip it for direct jump in trampoline. > > > > > > > > > > Hm, I'm afraid I don't understand this comment. Can you perhaps > > > > > rephrase it? > > > > > > > > > > > > > ix86_trampoline_init has > > > > > > > > /* Compute offset from the end of the jmp to the target function. > > > > In the case in which the trampoline stores the static chain on > > > > the stack, we need to skip the first insn which pushes the > > > > (call-saved) register static chain; this push is 1 byte. */ > > > > offset += 5; > > > > disp = expand_binop (SImode, sub_optab, fnaddr, > > > >plus_constant (Pmode, XEXP (m_tramp, 0), > > > > offset - (MEM_P (chain) ? 1 : > > > > 0)), > > > >NULL_RTX, 1, OPTAB_DIRECT); > > > > emit_move_insn (mem, disp); > > > > > > > > Without CET, we got > > > > > > > > 011 : > > > > 11: 56push %esi > > > > 12: 55push %ebp <<<<<< trampoline jumps here. > > > > 13: 89 e5mov%esp,%ebp > > > > 15: 83 ec 08 sub$0x8,%esp > > > > > > > > With CET, if bar isn't only called directly, we got > > > > > > > > 0015 : > > > > 15: f3 0f 1e fb endbr32 > > > > 19: 56push %esi > > > > 1a: 55push %ebp <<<<<<<< trampoline jumps > > > > here. > > > > 1b: 89 e5mov%esp,%ebp > > > > 1d: 83 ec 08 sub$0x8,%esp > > > > > > > > We need to add 4 bytes for trampoline to skip endbr32. > > > > > > > > Here is the updated patch to check if nested function isn't only > > > > called directly, > > > > > > Please figure out the final patch. I don't want to waste my time > > > reviewing different version every half hour. Ping me in a couple of > > > days. > > > > This is the final version: > > > > https://gcc.gnu.org/ml/gcc-patches/2020-02/msg00586.html > > > > You can try the testcase in the patch on any machine with CET binutils > > since ENDBR32 is nop on none-CET machines. Without this patch, > > the test will fail. > > Please rephrase the comment. I don't understand what it tries to say. > Here is the patch with updated comments. OK for master and 8/9 branches? Thanks. H.J. --- Since pass_insert_endbranch inserts ENDBR32 at entry of the target function if it may be called indirectly, we also need to skip the 4-byte ENDBR32 for direct jump in trampoline if it is the case. Tested on Linux/x86-64 CET machine with and without -m32. gcc/ PR target/93656 * config/i386/i386.c (ix86_trampoline_init): Skip ENDBR32 at the target function entry. gcc/testsuite/ PR target/93656 * gcc.target/i386/pr93656.c: New test. --- gcc/config/i386/i386.c | 9 - gcc/testsuite/gcc.target/i386/pr93656.c | 4 2 files changed, 12 insertions(+), 1 deletion(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr93656.c diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 44bc0e0176a..52640b74cc8 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -16839,9 +16839,16 @@ ix86_trampoline_init (rtx m_tramp, tree fndecl, rtx chain_value) the stack, we need to skip the first insn which pushes the (call-saved) register static chain; this push is 1 byte. */ offset += 5; + int skip = MEM_P (chain) ? 1 : 0; + /* Since pass_insert_endbranch inserts ENDBR32 at entry of the +target function if it may be called indirectly, we also need +to skip
Re: [PATCH] i386: Also skip ENDBR32 at the target function entry
On Thu, Feb 13, 2020 at 01:28:43PM +0100, Uros Bizjak wrote: > On Thu, Feb 13, 2020 at 1:06 PM H.J. Lu wrote: > > > > On Thu, Feb 13, 2020 at 09:29:32AM +0100, Uros Bizjak wrote: > > > On Wed, Feb 12, 2020 at 1:21 PM H.J. Lu wrote: > > > > > > > > On Mon, Feb 10, 2020 at 12:01 PM Uros Bizjak wrote: > > > > > > > > > > On Mon, Feb 10, 2020 at 8:53 PM H.J. Lu wrote: > > > > > > > > > > > > On Mon, Feb 10, 2020 at 11:40 AM Uros Bizjak > > > > > > wrote: > > > > > > > > > > > > > > On Mon, Feb 10, 2020 at 8:22 PM H.J. Lu > > > > > > > wrote: > > > > > > > > > > > > > > > > Since nested function isn't only called directly, there is > > > > > > > > ENDBR32 at > > > > > > > > function entry and we need to skip it for direct jump in > > > > > > > > trampoline. > > > > > > > > > > > > > > Hm, I'm afraid I don't understand this comment. Can you perhaps > > > > > > > rephrase it? > > > > > > > > > > > > > > > > > > > ix86_trampoline_init has > > > > > > > > > > > > /* Compute offset from the end of the jmp to the target > > > > > > function. > > > > > > In the case in which the trampoline stores the static > > > > > > chain on > > > > > > the stack, we need to skip the first insn which pushes the > > > > > > (call-saved) register static chain; this push is 1 byte. > > > > > > */ > > > > > > offset += 5; > > > > > > disp = expand_binop (SImode, sub_optab, fnaddr, > > > > > >plus_constant (Pmode, XEXP (m_tramp, 0), > > > > > > offset - (MEM_P (chain) ? > > > > > > 1 : 0)), > > > > > >NULL_RTX, 1, OPTAB_DIRECT); > > > > > > emit_move_insn (mem, disp); > > > > > > > > > > > > Without CET, we got > > > > > > > > > > > > 011 : > > > > > > 11: 56push %esi > > > > > > 12: 55push %ebp <<<<<< trampoline jumps > > > > > > here. > > > > > > 13: 89 e5mov%esp,%ebp > > > > > > 15: 83 ec 08 sub$0x8,%esp > > > > > > > > > > > > With CET, if bar isn't only called directly, we got > > > > > > > > > > > > 0015 : > > > > > > 15: f3 0f 1e fb endbr32 > > > > > > 19: 56push %esi > > > > > > 1a: 55push %ebp <<<<<<<< trampoline jumps > > > > > > here. > > > > > > 1b: 89 e5mov%esp,%ebp > > > > > > 1d: 83 ec 08 sub$0x8,%esp > > > > > > > > > > > > We need to add 4 bytes for trampoline to skip endbr32. > > > > > > > > > > > > Here is the updated patch to check if nested function isn't only > > > > > > called directly, > > > > > > > > > > Please figure out the final patch. I don't want to waste my time > > > > > reviewing different version every half hour. Ping me in a couple of > > > > > days. > > > > > > > > This is the final version: > > > > > > > > https://gcc.gnu.org/ml/gcc-patches/2020-02/msg00586.html > > > > > > > > You can try the testcase in the patch on any machine with CET binutils > > > > since ENDBR32 is nop on none-CET machines. Without this patch, > > > > the test will fail. > > > > > > Please rephrase the comment. I don't understand what it tries to say. > > > > > > > Here is the patch with updated comments. OK for master and 8/9 branches? > > > > Thanks. > > > > H.J. > > --- > > Since pass_insert_endbranch inserts ENDBR32 at entry of the target > > function if it may be called indirectly, we also need to skip the > > 4-byte
PING^7: [PATCH] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move
On Thu, Feb 6, 2020 at 8:17 PM H.J. Lu wrote: > > On Mon, Jan 27, 2020 at 10:59 AM H.J. Lu wrote: > > > > On Mon, Jul 8, 2019 at 8:19 AM H.J. Lu wrote: > > > > > > On Tue, Jun 18, 2019 at 8:59 AM H.J. Lu wrote: > > > > > > > > On Fri, May 31, 2019 at 10:38 AM H.J. Lu wrote: > > > > > > > > > > On Tue, May 21, 2019 at 2:43 PM H.J. Lu wrote: > > > > > > > > > > > > On Fri, Feb 22, 2019 at 8:25 AM H.J. Lu > > > > > > wrote: > > > > > > > > > > > > > > Hi Jan, Uros, > > > > > > > > > > > > > > This patch fixes the wrong code bug: > > > > > > > > > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89229 > > > > > > > > > > > > > > Tested on AVX2 and AVX512 with and without --with-arch=native. > > > > > > > > > > > > > > OK for trunk? > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > H.J. > > > > > > > -- > > > > > > > i386 backend has > > > > > > > > > > > > > > INT_MODE (OI, 32); > > > > > > > INT_MODE (XI, 64); > > > > > > > > > > > > > > So, XI_MODE represents 64 INTEGER bytes = 64 * 8 = 512 bit > > > > > > > operation, > > > > > > > in case of const_1, all 512 bits set. > > > > > > > > > > > > > > We can load zeros with narrower instruction, (e.g. 256 bit by > > > > > > > inherent > > > > > > > zeroing of highpart in case of 128 bit xor), so TImode in this > > > > > > > case. > > > > > > > > > > > > > > Some targets prefer V4SF mode, so they will emit float xorps for > > > > > > > zeroing. > > > > > > > > > > > > > > sse.md has > > > > > > > > > > > > > > (define_insn "mov_internal" > > > > > > > [(set (match_operand:VMOVE 0 "nonimmediate_operand" > > > > > > > "=v,v ,v ,m") > > > > > > > (match_operand:VMOVE 1 "nonimmediate_or_sse_const_operand" > > > > > > > " C,BC,vm,v"))] > > > > > > > > > > > > > > /* There is no evex-encoded vmov* for sizes smaller than > > > > > > > 64-bytes > > > > > > > in avx512f, so we need to use workarounds, to access sse > > > > > > > registers > > > > > > > 16-31, which are evex-only. In avx512vl we don't need > > > > > > > workarounds. */ > > > > > > > if (TARGET_AVX512F && < 64 && !TARGET_AVX512VL > > > > > > > && (EXT_REX_SSE_REG_P (operands[0]) > > > > > > > || EXT_REX_SSE_REG_P (operands[1]))) > > > > > > > { > > > > > > > if (memory_operand (operands[0], mode)) > > > > > > > { > > > > > > > if ( == 32) > > > > > > > return "vextract64x4\t{$0x0, %g1, > > > > > > > %0|%0, %g1, 0x0}"; > > > > > > > else if ( == 16) > > > > > > > return "vextract32x4\t{$0x0, %g1, > > > > > > > %0|%0, %g1, 0x0}"; > > > > > > > else > > > > > > > gcc_unreachable (); > > > > > > > } > > > > > > > ... > > > > > > > > > > > > > > However, since ix86_hard_regno_mode_ok has > > > > > > > > > > > > > > /* TODO check for QI/HI scalars. */ > > > > > > > /* AVX512VL allows sse regs16+ for 128/256 bit modes. */ > > > > > > > if (TARGET_AVX512VL > > > > > > > && (mode == OImode > > > > > > > || mode == TImode > > > > > > > || VALID_AVX256_REG_MODE (mode
Re: [PATCH] i386: Also skip ENDBR32 at the target function entry
On Thu, Feb 13, 2020 at 5:10 AM Uros Bizjak wrote: > > On Thu, Feb 13, 2020 at 1:42 PM H.J. Lu wrote: > > > > On Thu, Feb 13, 2020 at 01:28:43PM +0100, Uros Bizjak wrote: > > > On Thu, Feb 13, 2020 at 1:06 PM H.J. Lu wrote: > > > > > > > > On Thu, Feb 13, 2020 at 09:29:32AM +0100, Uros Bizjak wrote: > > > > > On Wed, Feb 12, 2020 at 1:21 PM H.J. Lu wrote: > > > > > > > > > > > > On Mon, Feb 10, 2020 at 12:01 PM Uros Bizjak > > > > > > wrote: > > > > > > > > > > > > > > On Mon, Feb 10, 2020 at 8:53 PM H.J. Lu > > > > > > > wrote: > > > > > > > > > > > > > > > > On Mon, Feb 10, 2020 at 11:40 AM Uros Bizjak > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > On Mon, Feb 10, 2020 at 8:22 PM H.J. Lu > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > Since nested function isn't only called directly, there is > > > > > > > > > > ENDBR32 at > > > > > > > > > > function entry and we need to skip it for direct jump in > > > > > > > > > > trampoline. > > > > > > > > > > > > > > > > > > Hm, I'm afraid I don't understand this comment. Can you > > > > > > > > > perhaps rephrase it? > > > > > > > > > > > > > > > > > > > > > > > > > ix86_trampoline_init has > > > > > > > > > > > > > > > > /* Compute offset from the end of the jmp to the target > > > > > > > > function. > > > > > > > > In the case in which the trampoline stores the static > > > > > > > > chain on > > > > > > > > the stack, we need to skip the first insn which pushes > > > > > > > > the > > > > > > > > (call-saved) register static chain; this push is 1 > > > > > > > > byte. */ > > > > > > > > offset += 5; > > > > > > > > disp = expand_binop (SImode, sub_optab, fnaddr, > > > > > > > >plus_constant (Pmode, XEXP (m_tramp, > > > > > > > > 0), > > > > > > > > offset - (MEM_P > > > > > > > > (chain) ? 1 : 0)), > > > > > > > >NULL_RTX, 1, OPTAB_DIRECT); > > > > > > > > emit_move_insn (mem, disp); > > > > > > > > > > > > > > > > Without CET, we got > > > > > > > > > > > > > > > > 011 : > > > > > > > > 11: 56push %esi > > > > > > > > 12: 55push %ebp <<<<<< trampoline > > > > > > > > jumps here. > > > > > > > > 13: 89 e5mov%esp,%ebp > > > > > > > > 15: 83 ec 08 sub$0x8,%esp > > > > > > > > > > > > > > > > With CET, if bar isn't only called directly, we got > > > > > > > > > > > > > > > > 0015 : > > > > > > > > 15: f3 0f 1e fb endbr32 > > > > > > > > 19: 56push %esi > > > > > > > > 1a: 55push %ebp <<<<<<<< trampoline > > > > > > > > jumps here. > > > > > > > > 1b: 89 e5mov%esp,%ebp > > > > > > > > 1d: 83 ec 08 sub$0x8,%esp > > > > > > > > > > > > > > > > We need to add 4 bytes for trampoline to skip endbr32. > > > > > > > > > > > > > > > > Here is the updated patch to check if nested function isn't only > > > > > > > > called directly, > > > > > > > > > > > > > > Please figure out the final patch. I don't want to waste my time > > > > > > > reviewing dif
Re: Backports to 9.3
On Thu, Feb 13, 2020 at 2:46 PM Jakub Jelinek wrote: > > Hi! > > I've backported following 15 commits from trunk to 9.3 branch, > bootstrapped/regtested on x86_64-linux and i686-linux, committed. > Hi Jakub, Are you preparing for GCC 9.3? I'd like to include this in GCC 9.3: https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=1d69147af203d4dcd2270429f90c93f1a37ddfff It is very safe. Uros asked me to wait for a week before backporting to GCC 9 branch. I am planning to do it next Thursday. Thanks. -- H.J.
Re: Backports to 9.3
On Fri, Feb 14, 2020 at 7:51 AM Jakub Jelinek wrote: > > On Fri, Feb 14, 2020 at 07:45:43AM -0800, H.J. Lu wrote: > > Are you preparing for GCC 9.3? I'd like to include this in GCC 9.3: > > > > https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=1d69147af203d4dcd2270429f90c93f1a37ddfff > > > > It is very safe. Uros asked me to wait for a week before backporting to > > GCC 9 branch. I am planning to do it next Thursday. > > Richi wants to do 8.4 first, am backporting a lot of patches to that now. > I'd say we should aim for 8.4 rc next week or say on Monday 24th I am planning to back it to both GCC 8 and 9 branches next Thursday. I think I will be fine. > and release a week after that and 9.3 maybe one week later than that. > > Jakub > Thanks. -- H.J.
[PATCH 02/10] i386: Use ix86_output_ssemov for XImode TYPE_SSEMOV
PR target/89229 * config/i386/i386.md (*movxi_internal_avx512f): Call ix86_output_ssemov for TYPE_SSEMOV. --- gcc/config/i386/i386.md | 6 +- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index f14683cd14f..b30e5a51edc 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -1902,11 +1902,7 @@ (define_insn "*movxi_internal_avx512f" return standard_sse_constant_opcode (insn, operands); case TYPE_SSEMOV: - if (misaligned_operand (operands[0], XImode) - || misaligned_operand (operands[1], XImode)) - return "vmovdqu32\t{%1, %0|%0, %1}"; - else - return "vmovdqa32\t{%1, %0|%0, %1}"; + return ix86_output_ssemov (insn, operands); default: gcc_unreachable (); -- 2.24.1
[PATCH 06/10] i386: Use ix86_output_ssemov for SImode TYPE_SSEMOV
There is no need to set mode attribute to XImode since ix86_output_ssemov can properly encode xmm16-xmm31 registers with and without AVX512VL. gcc/ PR target/89229 * config/i386/i386.md (*movsi_internal): Call ix86_output_ssemov for TYPE_SSEMOV. Remove ext_sse_reg_operand and TARGET_AVX512VL check. gcc/testsuite/ PR target/89229 * gcc.target/i386/pr89229-4a.c: New test. * gcc.target/i386/pr89229-4b.c: Likewise. * gcc.target/i386/pr89229-4c.c: Likewise. --- gcc/config/i386/i386.md| 25 ++ gcc/testsuite/gcc.target/i386/pr89229-4a.c | 17 +++ gcc/testsuite/gcc.target/i386/pr89229-4b.c | 6 ++ gcc/testsuite/gcc.target/i386/pr89229-4c.c | 7 ++ 4 files changed, 32 insertions(+), 23 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-4a.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-4b.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-4c.c diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 03d8078e957..05815c5cf3b 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -2261,25 +2261,7 @@ (define_insn "*movsi_internal" gcc_unreachable (); case TYPE_SSEMOV: - switch (get_attr_mode (insn)) - { - case MODE_SI: - return "%vmovd\t{%1, %0|%0, %1}"; - case MODE_TI: - return "%vmovdqa\t{%1, %0|%0, %1}"; - case MODE_XI: - return "vmovdqa32\t{%g1, %g0|%g0, %g1}"; - - case MODE_V4SF: - return "%vmovaps\t{%1, %0|%0, %1}"; - - case MODE_SF: - gcc_assert (!TARGET_AVX); - return "movss\t{%1, %0|%0, %1}"; - - default: - gcc_unreachable (); - } + return ix86_output_ssemov (insn, operands); case TYPE_MMX: return "pxor\t%0, %0"; @@ -2345,10 +2327,7 @@ (define_insn "*movsi_internal" (cond [(eq_attr "alternative" "2,3") (const_string "DI") (eq_attr "alternative" "8,9") - (cond [(ior (match_operand 0 "ext_sse_reg_operand") - (match_operand 1 "ext_sse_reg_operand")) - (const_string "XI") -(match_test "TARGET_AVX") + (cond [(match_test "TARGET_AVX") (const_string "TI") (ior (not (match_test "TARGET_SSE2")) (match_test "optimize_function_for_size_p (cfun)")) diff --git a/gcc/testsuite/gcc.target/i386/pr89229-4a.c b/gcc/testsuite/gcc.target/i386/pr89229-4a.c new file mode 100644 index 000..fd56f447016 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr89229-4a.c @@ -0,0 +1,17 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -march=skylake-avx512" } */ + +extern int i; + +int +foo1 (void) +{ + register int xmm16 __asm ("xmm16") = i; + asm volatile ("" : "+v" (xmm16)); + register int xmm17 __asm ("xmm17") = xmm16; + asm volatile ("" : "+v" (xmm17)); + return xmm17; +} + +/* { dg-final { scan-assembler-times "vmovdqa32\[^\n\r]*xmm1\[67]\[^\n\r]*xmm1\[67]" 1 } } */ +/* { dg-final { scan-assembler-not "%zmm\[0-9\]+" } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr89229-4b.c b/gcc/testsuite/gcc.target/i386/pr89229-4b.c new file mode 100644 index 000..023e81253a0 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr89229-4b.c @@ -0,0 +1,6 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -march=skylake-avx512 -mno-avx512vl" } */ + +#include "pr89229-4a.c" + +/* { dg-final { scan-assembler-times "vmovdqa32\[^\n\r]*zmm1\[67]\[^\n\r]*zmm1\[67]" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr89229-4c.c b/gcc/testsuite/gcc.target/i386/pr89229-4c.c new file mode 100644 index 000..bb728082e96 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr89229-4c.c @@ -0,0 +1,7 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -march=skylake-avx512 -mprefer-vector-width=512" } */ + +#include "pr89229-4a.c" + +/* { dg-final { scan-assembler-times "vmovdqa32\[^\n\r]*xmm1\[67]\[^\n\r]*xmm1\[67]" 1 } } */ +/* { dg-final { scan-assembler-not "%zmm\[0-9\]+" } } */ -- 2.24.1
[PATCH 03/10] i386: Use ix86_output_ssemov for OImode TYPE_SSEMOV
There is no need to set mode attribute to XImode since ix86_output_ssemov can properly encode ymm16-ymm31 registers with and without AVX512VL. PR target/89229 * config/i386/i386.md (*movoi_internal_avx): Call ix86_output_ssemov for TYPE_SSEMOV. Remove ext_sse_reg_operand and TARGET_AVX512VL check. --- gcc/config/i386/i386.md | 26 ++ 1 file changed, 2 insertions(+), 24 deletions(-) diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index b30e5a51edc..9e9b17d0913 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -1925,21 +1925,7 @@ (define_insn "*movoi_internal_avx" return standard_sse_constant_opcode (insn, operands); case TYPE_SSEMOV: - if (misaligned_operand (operands[0], OImode) - || misaligned_operand (operands[1], OImode)) - { - if (get_attr_mode (insn) == MODE_XI) - return "vmovdqu32\t{%1, %0|%0, %1}"; - else - return "vmovdqu\t{%1, %0|%0, %1}"; - } - else - { - if (get_attr_mode (insn) == MODE_XI) - return "vmovdqa32\t{%1, %0|%0, %1}"; - else - return "vmovdqa\t{%1, %0|%0, %1}"; - } + return ix86_output_ssemov (insn, operands); default: gcc_unreachable (); @@ -1948,15 +1934,7 @@ (define_insn "*movoi_internal_avx" [(set_attr "isa" "*,avx2,*,*") (set_attr "type" "sselog1,sselog1,ssemov,ssemov") (set_attr "prefix" "vex") - (set (attr "mode") - (cond [(ior (match_operand 0 "ext_sse_reg_operand") - (match_operand 1 "ext_sse_reg_operand")) -(const_string "XI") - (and (eq_attr "alternative" "1") - (match_test "TARGET_AVX512VL")) -(const_string "XI") - ] - (const_string "OI")))]) + (set_attr "mode" "OI")]) (define_insn "*movti_internal" [(set (match_operand:TI 0 "nonimmediate_operand" "=!r ,o ,v,v ,v ,m,?r,?Yd") -- 2.24.1
[PATCH 00/10] i386: Properly encode xmm16-xmm31/ymm16-ymm31 for vector move
This patch set was originally submitted in Feb 2019: https://gcc.gnu.org/ml/gcc-patches/2019-02/msg01841.html I broke it into 10 smaller patches for easy review. On x86, when AVX and AVX512 are enabled, vector move instructions can be encoded with either 2-byte/3-byte VEX (AVX) or 4-byte EVEX (AVX512): 0: c5 f9 6f d1 vmovdqa %xmm1,%xmm2 4: 62 f1 fd 08 6f d1 vmovdqa64 %xmm1,%xmm2 We prefer VEX encoding over EVEX since VEX is shorter. Also AVX512F only supports 512-bit vector moves. AVX512F + AVX512VL supports 128-bit and 256-bit vector moves. Mode attributes on x86 vector move patterns indicate target preferences of vector move encoding. For vector register to vector register move, we can use 512-bit vector move instructions to move 128-bit/256-bit vector if AVX512VL isn't available. With AVX512F and AVX512VL, we should use VEX encoding for 128-bit/256-bit vector moves if upper 16 vector registers aren't used. This patch adds a function, ix86_output_ssemov, to generate vector moves: 1. If zmm registers are used, use EVEX encoding. 2. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE or VEX encoding will be generated. 3. If xmm16-xmm31/ymm16-ymm31 registers are used: a. With AVX512VL, AVX512VL vector moves will be generated. b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to register move will be done with zmm register move. Tested on AVX2 and AVX512 with and without --with-arch=native. H.J. Lu (10): i386: Properly encode vector registers in vector move i386: Use ix86_output_ssemov for XImode TYPE_SSEMOV i386: Use ix86_output_ssemov for OImode TYPE_SSEMOV i386: Use ix86_output_ssemov for TImode TYPE_SSEMOV i386: Use ix86_output_ssemov for DImode TYPE_SSEMOV i386: Use ix86_output_ssemov for SImode TYPE_SSEMOV i386: Use ix86_output_ssemov for TFmode TYPE_SSEMOV i386: Use ix86_output_ssemov for DFmode TYPE_SSEMOV i386: Use ix86_output_ssemov for SFmode TYPE_SSEMOV i386: Use ix86_output_ssemov for MMX TYPE_SSEMOV gcc/config/i386/i386-protos.h | 2 + gcc/config/i386/i386.c| 274 ++ gcc/config/i386/i386.md | 212 +- gcc/config/i386/mmx.md| 29 +- gcc/config/i386/predicates.md | 5 - gcc/config/i386/sse.md| 98 +-- .../gcc.target/i386/avx512vl-vmovdqa64-1.c| 7 +- gcc/testsuite/gcc.target/i386/pr89229-2a.c| 15 + gcc/testsuite/gcc.target/i386/pr89229-2b.c| 13 + gcc/testsuite/gcc.target/i386/pr89229-2c.c| 6 + gcc/testsuite/gcc.target/i386/pr89229-3a.c| 17 ++ gcc/testsuite/gcc.target/i386/pr89229-3b.c| 6 + gcc/testsuite/gcc.target/i386/pr89229-3c.c| 7 + gcc/testsuite/gcc.target/i386/pr89229-4a.c| 17 ++ gcc/testsuite/gcc.target/i386/pr89229-4b.c| 6 + gcc/testsuite/gcc.target/i386/pr89229-4c.c| 7 + gcc/testsuite/gcc.target/i386/pr89229-5a.c| 16 + gcc/testsuite/gcc.target/i386/pr89229-5b.c| 12 + gcc/testsuite/gcc.target/i386/pr89229-5c.c| 6 + gcc/testsuite/gcc.target/i386/pr89229-6a.c| 16 + gcc/testsuite/gcc.target/i386/pr89229-6b.c| 7 + gcc/testsuite/gcc.target/i386/pr89229-6c.c| 6 + gcc/testsuite/gcc.target/i386/pr89229-7a.c| 16 + gcc/testsuite/gcc.target/i386/pr89229-7b.c| 6 + gcc/testsuite/gcc.target/i386/pr89229-7c.c| 6 + gcc/testsuite/gcc.target/i386/pr89346.c | 15 + 26 files changed, 497 insertions(+), 330 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-2a.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-2b.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-2c.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-3a.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-3b.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-3c.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-4a.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-4b.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-4c.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-5a.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-5b.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-5c.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-6a.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-6b.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-6c.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-7a.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-7b.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-7c.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89346.c -- 2.24.1
[PATCH 04/10] i386: Use ix86_output_ssemov for TImode TYPE_SSEMOV
There is no need to set mode attribute to XImode since ix86_output_ssemov can properly encode xmm16-xmm31 registers with and without AVX512VL. gcc/ PR target/89229 * config/i386/i386.md (*movti_internal): Call ix86_output_ssemov for TYPE_SSEMOV. Remove ext_sse_reg_operand and TARGET_AVX512VL check. gcc/testsuite/ PR target/89229 * gcc.target/i386/pr89229-2a.c: New test. * gcc.target/i386/pr89229-2b.c: Likewise. * gcc.target/i386/pr89229-2c.c: Likewise. --- gcc/config/i386/i386.md| 28 +- gcc/testsuite/gcc.target/i386/pr89229-2a.c | 15 gcc/testsuite/gcc.target/i386/pr89229-2b.c | 13 ++ gcc/testsuite/gcc.target/i386/pr89229-2c.c | 6 + 4 files changed, 35 insertions(+), 27 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-2a.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-2b.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-2c.c diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 9e9b17d0913..5607d1ecddc 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -1955,27 +1955,7 @@ (define_insn "*movti_internal" return standard_sse_constant_opcode (insn, operands); case TYPE_SSEMOV: - /* TDmode values are passed as TImode on the stack. Moving them -to stack may result in unaligned memory access. */ - if (misaligned_operand (operands[0], TImode) - || misaligned_operand (operands[1], TImode)) - { - if (get_attr_mode (insn) == MODE_V4SF) - return "%vmovups\t{%1, %0|%0, %1}"; - else if (get_attr_mode (insn) == MODE_XI) - return "vmovdqu32\t{%1, %0|%0, %1}"; - else - return "%vmovdqu\t{%1, %0|%0, %1}"; - } - else - { - if (get_attr_mode (insn) == MODE_V4SF) - return "%vmovaps\t{%1, %0|%0, %1}"; - else if (get_attr_mode (insn) == MODE_XI) - return "vmovdqa32\t{%1, %0|%0, %1}"; - else - return "%vmovdqa\t{%1, %0|%0, %1}"; - } + return ix86_output_ssemov (insn, operands); default: gcc_unreachable (); @@ -2002,12 +1982,6 @@ (define_insn "*movti_internal" (set (attr "mode") (cond [(eq_attr "alternative" "0,1") (const_string "DI") - (ior (match_operand 0 "ext_sse_reg_operand") - (match_operand 1 "ext_sse_reg_operand")) -(const_string "XI") - (and (eq_attr "alternative" "3") - (match_test "TARGET_AVX512VL")) -(const_string "XI") (match_test "TARGET_AVX") (const_string "TI") (ior (not (match_test "TARGET_SSE2")) diff --git a/gcc/testsuite/gcc.target/i386/pr89229-2a.c b/gcc/testsuite/gcc.target/i386/pr89229-2a.c new file mode 100644 index 000..0cf78039481 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr89229-2a.c @@ -0,0 +1,15 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -march=skylake-avx512" } */ + +typedef __int128 __m128t __attribute__ ((__vector_size__ (16), +__may_alias__)); + +__m128t +foo1 (void) +{ + register __int128 xmm16 __asm ("xmm16") = (__int128) -1; + asm volatile ("" : "+v" (xmm16)); + return (__m128t) xmm16; +} + +/* { dg-final { scan-assembler-not "%zmm\[0-9\]+" } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr89229-2b.c b/gcc/testsuite/gcc.target/i386/pr89229-2b.c new file mode 100644 index 000..8d5d6c41d30 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr89229-2b.c @@ -0,0 +1,13 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -march=skylake-avx512 -mno-avx512vl" } */ + +typedef __int128 __m128t __attribute__ ((__vector_size__ (16), +__may_alias__)); + +__m128t +foo1 (void) +{ + register __int128 xmm16 __asm ("xmm16") = (__int128) -1; /* { dg-error "register specified for 'xmm16'" } */ + asm volatile ("" : "+v" (xmm16)); + return (__m128t) xmm16; +} diff --git a/gcc/testsuite/gcc.target/i386/pr89229-2c.c b/gcc/testsuite/gcc.target/i386/pr89229-2c.c new file mode 100644 index 000..218da46dcd0 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr89229-2c.c @@ -0,0 +1,6 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -march=skylake-avx512 -mprefer-vector-width=512" } */ + +#include "pr89229-2a.c" + +/* { dg-final { scan-assembler-not "%zmm\[0-9\]+" } } */ -- 2.24.1
[PATCH 05/10] i386: Use ix86_output_ssemov for DImode TYPE_SSEMOV
There is no need to set mode attribute to XImode since ix86_output_ssemov can properly encode xmm16-xmm31 registers with and without AVX512VL. gcc/ PR target/89229 * config/i386/i386.md (*movdi_internal): Call ix86_output_ssemov for TYPE_SSEMOV. Remove ext_sse_reg_operand and TARGET_AVX512VL check. gcc/testsuite/ PR target/89229 * gcc.target/i386/pr89229-3a.c: New test. * gcc.target/i386/pr89229-3b.c: Likewise. * gcc.target/i386/pr89229-3c.c: Likewise. --- gcc/config/i386/i386.md| 31 ++ gcc/testsuite/gcc.target/i386/pr89229-3a.c | 17 gcc/testsuite/gcc.target/i386/pr89229-3b.c | 6 + gcc/testsuite/gcc.target/i386/pr89229-3c.c | 7 + 4 files changed, 32 insertions(+), 29 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-3a.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-3b.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-3c.c diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 5607d1ecddc..03d8078e957 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -2054,31 +2054,7 @@ (define_insn "*movdi_internal" return standard_sse_constant_opcode (insn, operands); case TYPE_SSEMOV: - switch (get_attr_mode (insn)) - { - case MODE_DI: - /* Handle broken assemblers that require movd instead of movq. */ - if (!HAVE_AS_IX86_INTERUNIT_MOVQ - && (GENERAL_REG_P (operands[0]) || GENERAL_REG_P (operands[1]))) - return "%vmovd\t{%1, %0|%0, %1}"; - return "%vmovq\t{%1, %0|%0, %1}"; - - case MODE_TI: - /* Handle AVX512 registers set. */ - if (EXT_REX_SSE_REG_P (operands[0]) - || EXT_REX_SSE_REG_P (operands[1])) - return "vmovdqa64\t{%1, %0|%0, %1}"; - return "%vmovdqa\t{%1, %0|%0, %1}"; - - case MODE_V2SF: - gcc_assert (!TARGET_AVX); - return "movlps\t{%1, %0|%0, %1}"; - case MODE_V4SF: - return "%vmovaps\t{%1, %0|%0, %1}"; - - default: - gcc_unreachable (); - } + return ix86_output_ssemov (insn, operands); case TYPE_SSECVT: if (SSE_REG_P (operands[0])) @@ -2164,10 +2140,7 @@ (define_insn "*movdi_internal" (cond [(eq_attr "alternative" "2") (const_string "SI") (eq_attr "alternative" "12,13") - (cond [(ior (match_operand 0 "ext_sse_reg_operand") - (match_operand 1 "ext_sse_reg_operand")) - (const_string "TI") -(match_test "TARGET_AVX") + (cond [(match_test "TARGET_AVX") (const_string "TI") (ior (not (match_test "TARGET_SSE2")) (match_test "optimize_function_for_size_p (cfun)")) diff --git a/gcc/testsuite/gcc.target/i386/pr89229-3a.c b/gcc/testsuite/gcc.target/i386/pr89229-3a.c new file mode 100644 index 000..cb9b071e873 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr89229-3a.c @@ -0,0 +1,17 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -march=skylake-avx512 -mprefer-vector-width=512" } */ + +extern long long i; + +long long +foo1 (void) +{ + register long long xmm16 __asm ("xmm16") = i; + asm volatile ("" : "+v" (xmm16)); + register long long xmm17 __asm ("xmm17") = xmm16; + asm volatile ("" : "+v" (xmm17)); + return xmm17; +} + +/* { dg-final { scan-assembler-times "vmovdqa64\[^\n\r]*xmm1\[67]\[^\n\r]*xmm1\[67]" 1 } } */ +/* { dg-final { scan-assembler-not "%zmm\[0-9\]+" } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr89229-3b.c b/gcc/testsuite/gcc.target/i386/pr89229-3b.c new file mode 100644 index 000..9265fc0354b --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr89229-3b.c @@ -0,0 +1,6 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -march=skylake-avx512 -mno-avx512vl" } */ + +#include "pr89229-3a.c" + +/* { dg-final { scan-assembler-times "vmovdqa32\[^\n\r]*zmm1\[67]\[^\n\r]*zmm1\[67]" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr89229-3c.c b/gcc/testsuite/gcc.target/i386/pr89229-3c.c new file mode 100644 index 000..be0ca78a37e --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr89229-3c.c @@ -0,0 +1,7 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -march=skylake-avx512 -mprefer-vector-width=512" } */ + +#include "pr89229-3a.c" + +/* { dg-final { scan-assembler-times "vmovdqa64\[^\n\r]*xmm1\[67]\[^\n\r]*xmm1\[67]" 1 } } */ +/* { dg-final { scan-assembler-not "%zmm\[0-9\]+" } } */ -- 2.24.1
[PATCH 01/10] i386: Properly encode vector registers in vector move
On x86, when AVX and AVX512 are enabled, vector move instructions can be encoded with either 2-byte/3-byte VEX (AVX) or 4-byte EVEX (AVX512): 0: c5 f9 6f d1 vmovdqa %xmm1,%xmm2 4: 62 f1 fd 08 6f d1 vmovdqa64 %xmm1,%xmm2 We prefer VEX encoding over EVEX since VEX is shorter. Also AVX512F only supports 512-bit vector moves. AVX512F + AVX512VL supports 128-bit and 256-bit vector moves. Mode attributes on x86 vector move patterns indicate target preferences of vector move encoding. For vector register to vector register move, we can use 512-bit vector move instructions to move 128-bit/256-bit vector if AVX512VL isn't available. With AVX512F and AVX512VL, we should use VEX encoding for 128-bit/256-bit vector moves if upper 16 vector registers aren't used. This patch adds a function, ix86_output_ssemov, to generate vector moves: 1. If zmm registers are used, use EVEX encoding. 2. If xmm16-xmm31/ymm16-ymm31 registers aren't used, SSE or VEX encoding will be generated. 3. If xmm16-xmm31/ymm16-ymm31 registers are used: a. With AVX512VL, AVX512VL vector moves will be generated. b. Without AVX512VL, xmm16-xmm31/ymm16-ymm31 register to register move will be done with zmm register move. Tested on AVX2 and AVX512 with and without --with-arch=native. gcc/ PR target/89229 PR target/89346 * config/i386/i386-protos.h (ix86_output_ssemov): New prototype. * config/i386/i386.c (ix86_get_ssemov): New function. (ix86_output_ssemov): Likewise. * config/i386/sse.md (VMOVE:mov_internal): Call ix86_output_ssemov for TYPE_SSEMOV. Remove TARGET_AVX512VL check. gcc/testsuite/ PR target/89229 PR target/89346 * gcc.target/i386/avx512vl-vmovdqa64-1.c: Updated. * gcc.target/i386/pr89229-2a.c: New test. --- gcc/config/i386/i386-protos.h | 2 + gcc/config/i386/i386.c| 274 ++ gcc/config/i386/sse.md| 98 +-- .../gcc.target/i386/avx512vl-vmovdqa64-1.c| 7 +- gcc/testsuite/gcc.target/i386/pr89346.c | 15 + 5 files changed, 296 insertions(+), 100 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr89346.c diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h index 266381ca5a6..39fcaa0ad5f 100644 --- a/gcc/config/i386/i386-protos.h +++ b/gcc/config/i386/i386-protos.h @@ -38,6 +38,8 @@ extern void ix86_expand_split_stack_prologue (void); extern void ix86_output_addr_vec_elt (FILE *, int); extern void ix86_output_addr_diff_elt (FILE *, int, int); +extern const char *ix86_output_ssemov (rtx_insn *, rtx *); + extern enum calling_abi ix86_cfun_abi (void); extern enum calling_abi ix86_function_type_abi (const_tree); diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index dac7a3fc5fd..26f8c9494b9 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -4915,6 +4915,280 @@ ix86_pre_reload_split (void) && !(cfun->curr_properties & PROP_rtl_split_insns)); } +/* Return the opcode of the TYPE_SSEMOV instruction. To move from + or to xmm16-xmm31/ymm16-ymm31 registers, we either require + TARGET_AVX512VL or it is a register to register move which can + be done with zmm register move. */ + +static const char * +ix86_get_ssemov (rtx *operands, unsigned size, +enum attr_mode insn_mode, machine_mode mode) +{ + char buf[128]; + bool misaligned_p = (misaligned_operand (operands[0], mode) + || misaligned_operand (operands[1], mode)); + bool evex_reg_p = (EXT_REX_SSE_REG_P (operands[0]) +|| EXT_REX_SSE_REG_P (operands[1])); + machine_mode scalar_mode; + + const char *opcode = NULL; + enum +{ + opcode_int, + opcode_float, + opcode_double +} type = opcode_int; + + switch (insn_mode) +{ +case MODE_V16SF: +case MODE_V8SF: +case MODE_V4SF: + scalar_mode = E_SFmode; + break; +case MODE_V8DF: +case MODE_V4DF: +case MODE_V2DF: + scalar_mode = E_DFmode; + break; +case MODE_XI: +case MODE_OI: +case MODE_TI: + scalar_mode = GET_MODE_INNER (mode); + break; +default: + gcc_unreachable (); +} + + if (SCALAR_FLOAT_MODE_P (scalar_mode)) +{ + switch (scalar_mode) + { + case E_SFmode: + if (size == 64 || !evex_reg_p || TARGET_AVX512VL) + opcode = misaligned_p ? "%vmovups" : "%vmovaps"; + else + type = opcode_float; + break; + case E_DFmode: + if (size == 64 || !evex_reg_p || TARGET_AVX512VL) + opcode = misaligned_p ? "%vmovupd" : "%vmovapd"; + else + type = opcode_double; + break; + case E_TFmode: + if (size == 64) + opcode = misaligned_p ? "vmovdqu64" : "vmovdqa64"; + else if (evex_reg_p) + { + if (TARGET_AVX512VL) +
[PATCH 08/10] i386: Use ix86_output_ssemov for DFmode TYPE_SSEMOV
There is no need to set mode attribute to XImode nor V8DFmode since ix86_output_ssemov can properly encode xmm16-xmm31 registers with and without AVX512VL. gcc/ PR target/89229 * config/i386/i386.md (*movdf_internal): Call ix86_output_ssemov for TYPE_SSEMOV. Remove TARGET_AVX512F, TARGET_PREFER_AVX256, TARGET_AVX512VL and ext_sse_reg_operand check. gcc/testsuite/ PR target/89229 * gcc.target/i386/pr89229-6a.c: New test. * gcc.target/i386/pr89229-6b.c: Likewise. * gcc.target/i386/pr89229-6c.c: Likewise. --- gcc/config/i386/i386.md| 44 ++ gcc/testsuite/gcc.target/i386/pr89229-6a.c | 16 gcc/testsuite/gcc.target/i386/pr89229-6b.c | 7 gcc/testsuite/gcc.target/i386/pr89229-6c.c | 6 +++ 4 files changed, 32 insertions(+), 41 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-6a.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-6b.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-6c.c diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index fdf0e5a8802..01892992adb 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -3307,37 +3307,7 @@ (define_insn "*movdf_internal" return standard_sse_constant_opcode (insn, operands); case TYPE_SSEMOV: - switch (get_attr_mode (insn)) - { - case MODE_DF: - if (TARGET_AVX && REG_P (operands[0]) && REG_P (operands[1])) - return "vmovsd\t{%d1, %0|%0, %d1}"; - return "%vmovsd\t{%1, %0|%0, %1}"; - - case MODE_V4SF: - return "%vmovaps\t{%1, %0|%0, %1}"; - case MODE_V8DF: - return "vmovapd\t{%g1, %g0|%g0, %g1}"; - case MODE_V2DF: - return "%vmovapd\t{%1, %0|%0, %1}"; - - case MODE_V2SF: - gcc_assert (!TARGET_AVX); - return "movlps\t{%1, %0|%0, %1}"; - case MODE_V1DF: - gcc_assert (!TARGET_AVX); - return "movlpd\t{%1, %0|%0, %1}"; - - case MODE_DI: - /* Handle broken assemblers that require movd instead of movq. */ - if (!HAVE_AS_IX86_INTERUNIT_MOVQ - && (GENERAL_REG_P (operands[0]) || GENERAL_REG_P (operands[1]))) - return "%vmovd\t{%1, %0|%0, %1}"; - return "%vmovq\t{%1, %0|%0, %1}"; - - default: - gcc_unreachable (); - } + return ix86_output_ssemov (insn, operands); default: gcc_unreachable (); @@ -3391,10 +3361,7 @@ (define_insn "*movdf_internal" /* xorps is one byte shorter for non-AVX targets. */ (eq_attr "alternative" "12,16") -(cond [(and (match_test "TARGET_AVX512F") -(not (match_test "TARGET_PREFER_AVX256"))) - (const_string "XI") - (match_test "TARGET_AVX") +(cond [(match_test "TARGET_AVX") (const_string "V2DF") (ior (not (match_test "TARGET_SSE2")) (match_test "optimize_function_for_size_p (cfun)")) @@ -3410,12 +3377,7 @@ (define_insn "*movdf_internal" /* movaps is one byte shorter for non-AVX targets. */ (eq_attr "alternative" "13,17") -(cond [(and (ior (not (match_test "TARGET_PREFER_AVX256")) - (not (match_test "TARGET_AVX512VL"))) -(ior (match_operand 0 "ext_sse_reg_operand") - (match_operand 1 "ext_sse_reg_operand"))) - (const_string "V8DF") - (match_test "TARGET_AVX") +(cond [(match_test "TARGET_AVX") (const_string "DF") (ior (not (match_test "TARGET_SSE2")) (match_test "optimize_function_for_size_p (cfun)")) diff --git a/gcc/testsuite/gcc.target/i386/pr89229-6a.c b/gcc/testsuite/gcc.target/i386/pr89229-6a.c new file mode 100644 index 000..5bc10d25619 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr89229-6a.c @@ -0,0 +1,16 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -march=skylake-avx512" } */ + +extern double d; + +void +foo1 (double x) +{ + register double xmm16 __asm ("xmm16") = x; + asm volatile ("" : "+v" (xmm16)); + register double xmm17 __asm ("xmm17") = xmm16; + asm volatile ("" : "+v" (xmm17)); + d = xmm17; +} + +/* { dg-final { scan-assembler-not "vmovapd" } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr89229-6b.c b/gcc/testsuite/gcc.target/i386/pr89229-6b.c new file mode 100644 index 000..b248a3726f4 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr89229-6b.c @@ -0,0 +1,7 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -march=skylake-avx512 -mno-avx512vl" } */ + +#include "pr89229-6a.c" + +/* { dg-final { scan-assembler-not "%zmm\[0-9\]+" } } */ +/* { dg-fina
[PATCH 09/10] i386: Use ix86_output_ssemov for SFmode TYPE_SSEMOV
There is no need to set mode attribute to V16SFmode since ix86_output_ssemov can properly encode xmm16-xmm31 registers with and without AVX512VL. gcc/ PR target/89229 * config/i386/i386.md (*movdf_internal): Call ix86_output_ssemov for TYPE_SSEMOV. Remove TARGET_PREFER_AVX256, TARGET_AVX512VL and ext_sse_reg_operand check. gcc/testsuite/ PR target/89229 * gcc.target/i386/pr89229-7a.c: New test. * gcc.target/i386/pr89229-7b.c: Likewise. * gcc.target/i386/pr89229-7c.c: Likewise. --- gcc/config/i386/i386.md| 26 ++ gcc/testsuite/gcc.target/i386/pr89229-7a.c | 16 + gcc/testsuite/gcc.target/i386/pr89229-7b.c | 6 + gcc/testsuite/gcc.target/i386/pr89229-7c.c | 6 + 4 files changed, 30 insertions(+), 24 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-7a.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-7b.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-7c.c diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 01892992adb..2dcf2d598c3 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -3469,24 +3469,7 @@ (define_insn "*movsf_internal" return standard_sse_constant_opcode (insn, operands); case TYPE_SSEMOV: - switch (get_attr_mode (insn)) - { - case MODE_SF: - if (TARGET_AVX && REG_P (operands[0]) && REG_P (operands[1])) - return "vmovss\t{%d1, %0|%0, %d1}"; - return "%vmovss\t{%1, %0|%0, %1}"; - - case MODE_V16SF: - return "vmovaps\t{%g1, %g0|%g0, %g1}"; - case MODE_V4SF: - return "%vmovaps\t{%1, %0|%0, %1}"; - - case MODE_SI: - return "%vmovd\t{%1, %0|%0, %1}"; - - default: - gcc_unreachable (); - } + return ix86_output_ssemov (insn, operands); case TYPE_MMXMOV: switch (get_attr_mode (insn)) @@ -3558,12 +3541,7 @@ (define_insn "*movsf_internal" better to maintain the whole registers in single format to avoid problems on using packed logical operations. */ (eq_attr "alternative" "6") -(cond [(and (ior (not (match_test "TARGET_PREFER_AVX256")) - (not (match_test "TARGET_AVX512VL"))) -(ior (match_operand 0 "ext_sse_reg_operand") - (match_operand 1 "ext_sse_reg_operand"))) - (const_string "V16SF") - (ior (match_test "TARGET_SSE_PARTIAL_REG_DEPENDENCY") +(cond [(ior (match_test "TARGET_SSE_PARTIAL_REG_DEPENDENCY") (match_test "TARGET_SSE_SPLIT_REGS")) (const_string "V4SF") ] diff --git a/gcc/testsuite/gcc.target/i386/pr89229-7a.c b/gcc/testsuite/gcc.target/i386/pr89229-7a.c new file mode 100644 index 000..856115b2f5a --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr89229-7a.c @@ -0,0 +1,16 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -march=skylake-avx512" } */ + +extern float d; + +void +foo1 (float x) +{ + register float xmm16 __asm ("xmm16") = x; + asm volatile ("" : "+v" (xmm16)); + register float xmm17 __asm ("xmm17") = xmm16; + asm volatile ("" : "+v" (xmm17)); + d = xmm17; +} + +/* { dg-final { scan-assembler-not "%zmm\[0-9\]+" } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr89229-7b.c b/gcc/testsuite/gcc.target/i386/pr89229-7b.c new file mode 100644 index 000..93d1e43770c --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr89229-7b.c @@ -0,0 +1,6 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -march=skylake-avx512 -mno-avx512vl" } */ + +#include "pr89229-7a.c" + +/* { dg-final { scan-assembler-times "vmovaps\[^\n\r]*zmm1\[67]\[^\n\r]*zmm1\[67]" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr89229-7c.c b/gcc/testsuite/gcc.target/i386/pr89229-7c.c new file mode 100644 index 000..e37ff2bf5bd --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr89229-7c.c @@ -0,0 +1,6 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -march=skylake-avx512 -mprefer-vector-width=512" } */ + +#include "pr89229-7a.c" + +/* { dg-final { scan-assembler-not "%zmm\[0-9\]+" } } */ -- 2.24.1
[PATCH 10/10] i386: Use ix86_output_ssemov for MMX TYPE_SSEMOV
There is no need to set mode attribute to XImode since ix86_output_ssemov can properly encode xmm16-xmm31 registers with and without AVX512VL. Remove ext_sse_reg_operand since it is no longer needed. PR target/89229 * config/i386/mmx.md (MMXMODE:*mov_internal): Call ix86_output_ssemov for TYPE_SSEMOV. Remove ext_sse_reg_operand check. * config/i386/predicates.md (ext_sse_reg_operand): Removed. --- gcc/config/i386/mmx.md| 29 ++--- gcc/config/i386/predicates.md | 5 - 2 files changed, 2 insertions(+), 32 deletions(-) diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md index f695831b5b9..7d9db5d352c 100644 --- a/gcc/config/i386/mmx.md +++ b/gcc/config/i386/mmx.md @@ -118,29 +118,7 @@ (define_insn "*mov_internal" return standard_sse_constant_opcode (insn, operands); case TYPE_SSEMOV: - switch (get_attr_mode (insn)) - { - case MODE_DI: - /* Handle broken assemblers that require movd instead of movq. */ - if (!HAVE_AS_IX86_INTERUNIT_MOVQ - && (GENERAL_REG_P (operands[0]) || GENERAL_REG_P (operands[1]))) - return "%vmovd\t{%1, %0|%0, %1}"; - return "%vmovq\t{%1, %0|%0, %1}"; - case MODE_TI: - return "%vmovdqa\t{%1, %0|%0, %1}"; - case MODE_XI: - return "vmovdqa64\t{%g1, %g0|%g0, %g1}"; - - case MODE_V2SF: - if (TARGET_AVX && REG_P (operands[0])) - return "vmovlps\t{%1, %0, %0|%0, %0, %1}"; - return "%vmovlps\t{%1, %0|%0, %1}"; - case MODE_V4SF: - return "%vmovaps\t{%1, %0|%0, %1}"; - - default: - gcc_unreachable (); - } + return ix86_output_ssemov (insn, operands); default: gcc_unreachable (); @@ -189,10 +167,7 @@ (define_insn "*mov_internal" (cond [(eq_attr "alternative" "2") (const_string "SI") (eq_attr "alternative" "11,12") - (cond [(ior (match_operand 0 "ext_sse_reg_operand") - (match_operand 1 "ext_sse_reg_operand")) - (const_string "XI") -(match_test "mode == V2SFmode") + (cond [(match_test "mode == V2SFmode") (const_string "V4SF") (ior (not (match_test "TARGET_SSE2")) (match_test "optimize_function_for_size_p (cfun)")) diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md index 1119366d54e..71f4cb1193c 100644 --- a/gcc/config/i386/predicates.md +++ b/gcc/config/i386/predicates.md @@ -61,11 +61,6 @@ (define_predicate "sse_reg_operand" (and (match_code "reg") (match_test "SSE_REGNO_P (REGNO (op))"))) -;; True if the operand is an AVX-512 new register. -(define_predicate "ext_sse_reg_operand" - (and (match_code "reg") - (match_test "EXT_REX_SSE_REGNO_P (REGNO (op))"))) - ;; Return true if op is a QImode register. (define_predicate "any_QIreg_operand" (and (match_code "reg") -- 2.24.1
[PATCH 07/10] i386: Use ix86_output_ssemov for TFmode TYPE_SSEMOV
gcc/ PR target/89229 * config/i386/i386.md (*movtf_internal): Call ix86_output_ssemov for TYPE_SSEMOV. gcc/testsuite/ PR target/89229 * gcc.target/i386/pr89229-5a.c: New test. * gcc.target/i386/pr89229-5b.c: Likewise. * gcc.target/i386/pr89229-5c.c: Likewise. --- gcc/config/i386/i386.md| 26 +- gcc/testsuite/gcc.target/i386/pr89229-5a.c | 16 + gcc/testsuite/gcc.target/i386/pr89229-5b.c | 12 ++ gcc/testsuite/gcc.target/i386/pr89229-5c.c | 6 + 4 files changed, 35 insertions(+), 25 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-5a.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-5b.c create mode 100644 gcc/testsuite/gcc.target/i386/pr89229-5c.c diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 05815c5cf3b..fdf0e5a8802 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -3154,31 +3154,7 @@ (define_insn "*movtf_internal" return standard_sse_constant_opcode (insn, operands); case TYPE_SSEMOV: - /* Handle misaligned load/store since we - don't have movmisaligntf pattern. */ - if (misaligned_operand (operands[0], TFmode) - || misaligned_operand (operands[1], TFmode)) - { - if (get_attr_mode (insn) == MODE_V4SF) - return "%vmovups\t{%1, %0|%0, %1}"; - else if (TARGET_AVX512VL - && (EXT_REX_SSE_REG_P (operands[0]) - || EXT_REX_SSE_REG_P (operands[1]))) - return "vmovdqu64\t{%1, %0|%0, %1}"; - else - return "%vmovdqu\t{%1, %0|%0, %1}"; - } - else - { - if (get_attr_mode (insn) == MODE_V4SF) - return "%vmovaps\t{%1, %0|%0, %1}"; - else if (TARGET_AVX512VL - && (EXT_REX_SSE_REG_P (operands[0]) - || EXT_REX_SSE_REG_P (operands[1]))) - return "vmovdqa64\t{%1, %0|%0, %1}"; - else - return "%vmovdqa\t{%1, %0|%0, %1}"; - } + return ix86_output_ssemov (insn, operands); case TYPE_MULTI: return "#"; diff --git a/gcc/testsuite/gcc.target/i386/pr89229-5a.c b/gcc/testsuite/gcc.target/i386/pr89229-5a.c new file mode 100644 index 000..fcb85c366b6 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr89229-5a.c @@ -0,0 +1,16 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -march=skylake-avx512" } */ + +extern __float128 d; + +void +foo1 (__float128 x) +{ + register __float128 xmm16 __asm ("xmm16") = x; + asm volatile ("" : "+v" (xmm16)); + register __float128 xmm17 __asm ("xmm17") = xmm16; + asm volatile ("" : "+v" (xmm17)); + d = xmm17; +} + +/* { dg-final { scan-assembler-not "%zmm\[0-9\]+" } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr89229-5b.c b/gcc/testsuite/gcc.target/i386/pr89229-5b.c new file mode 100644 index 000..37eb83c783b --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr89229-5b.c @@ -0,0 +1,12 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -march=skylake-avx512 -mno-avx512vl" } */ + +extern __float128 d; + +void +foo1 (__float128 x) +{ + register __float128 xmm16 __asm ("xmm16") = x; /* { dg-error "register specified for 'xmm16'" } */ + asm volatile ("" : "+v" (xmm16)); + d = xmm16; +} diff --git a/gcc/testsuite/gcc.target/i386/pr89229-5c.c b/gcc/testsuite/gcc.target/i386/pr89229-5c.c new file mode 100644 index 000..529a520133c --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr89229-5c.c @@ -0,0 +1,6 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -march=skylake-avx512 -mprefer-vector-width=512" } */ + +#include "pr89229-5a.c" + +/* { dg-final { scan-assembler-not "%zmm\[0-9\]+" } } */ -- 2.24.1
[PATCH] i386: Use add for a = a + b and a = b + a when possible
Since except for Bonnell, 01 fbadd%edi,%ebx is faster and shorter than 8d 1c 1f lea(%rdi,%rbx,1),%ebx we should use add for a = a + b and a = b + a when possible if not optimizing for Bonnell. Tested on x86-64. gcc/ PR target/92807 * config/i386/i386.c (ix86_lea_outperforms): Check !TARGET_BONNELL. (ix86_avoid_lea_for_addr): When not optimizing for Bonnell, use add for a = a + b and a = b + a. gcc/testsuite/ PR target/92807 * gcc.target/i386/pr92807-1.c: New test. -- H.J. From ad803a967a6c18ae3bd6f8381ebc8a78c31a82ae Mon Sep 17 00:00:00 2001 From: "H.J. Lu" Date: Tue, 3 Dec 2019 15:27:51 -0800 Subject: [PATCH] i386: Use add for a = a + b and a = b + a when possible Since except for Bonnell, 01 fb add%edi,%ebx is faster and shorter than 8d 1c 1f lea(%rdi,%rbx,1),%ebx we should use add for a = a + b and a = b + a when possible if not optimizing for Bonnell. Tested on x86-64. gcc/ PR target/92807 * config/i386/i386.c (ix86_lea_outperforms): Check !TARGET_BONNELL. (ix86_avoid_lea_for_addr): When not optimizing for Bonnell, use add for a = a + b and a = b + a. gcc/testsuite/ PR target/92807 * gcc.target/i386/pr92807-1.c: New test. --- gcc/config/i386/i386.c| 27 +++ gcc/testsuite/gcc.target/i386/pr92807-1.c | 11 + 2 files changed, 29 insertions(+), 9 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr92807-1.c diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 04cbbd532c0d..65f0d44916a8 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -14393,11 +14393,10 @@ ix86_lea_outperforms (rtx_insn *insn, unsigned int regno0, unsigned int regno1, { int dist_define, dist_use; - /* For Silvermont if using a 2-source or 3-source LEA for - non-destructive destination purposes, or due to wanting - ability to use SCALE, the use of LEA is justified. */ - if (TARGET_SILVERMONT || TARGET_GOLDMONT || TARGET_GOLDMONT_PLUS - || TARGET_TREMONT || TARGET_INTEL) + /* For Atom processors newer than Bonnell, if using a 2-source or + 3-source LEA for non-destructive destination purposes, or due to + wanting ability to use SCALE, the use of LEA is justified. */ + if (!TARGET_BONNELL) { if (has_scale) return true; @@ -14532,10 +14531,6 @@ ix86_avoid_lea_for_addr (rtx_insn *insn, rtx operands[]) struct ix86_address parts; int ok; - /* Check we need to optimize. */ - if (!TARGET_AVOID_LEA_FOR_ADDR || optimize_function_for_size_p (cfun)) -return false; - /* The "at least two components" test below might not catch simple move or zero extension insns if parts.base is non-NULL and parts.disp is const0_rtx as the only components in the address, e.g. if the @@ -14572,6 +14567,20 @@ ix86_avoid_lea_for_addr (rtx_insn *insn, rtx operands[]) if (parts.index) regno2 = true_regnum (parts.index); + /* Use add for a = a + b and a = b + a since it is faster and shorter + than lea for most processors. For the processors like BONNELL, if + the destination register of LEA holds an actual address which will + be used soon, LEA is better and otherwise ADD is better. */ + if (!TARGET_BONNELL + && parts.scale == 1 + && (!parts.disp || parts.disp == const0_rtx) + && (regno0 == regno1 || regno0 == regno2)) +return true; + + /* Check we need to optimize. */ + if (!TARGET_AVOID_LEA_FOR_ADDR || optimize_function_for_size_p (cfun)) +return false; + split_cost = 0; /* Compute how many cycles we will add to execution time diff --git a/gcc/testsuite/gcc.target/i386/pr92807-1.c b/gcc/testsuite/gcc.target/i386/pr92807-1.c new file mode 100644 index ..00f92930af92 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr92807-1.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O2" } */ + +unsigned int +abs2 (unsigned int a) +{ + unsigned int s = ((a>>15)&0x10001)*0x; + return (a+s)^s; +} + +/* { dg-final { scan-assembler-not "leal" } } */ -- 2.21.0
Re: [C++ coroutines] Initial implementation pushed to master.
On Sat, Jan 18, 2020 at 4:54 AM Iain Sandoe wrote: > > Hi, > > Thanks to: > >* the reviewers, the code was definitely improved by your reviews. > >* those folks who tested the branch and/or compiler explorer > instance and reported problems with reproducers. > > * WG21 colleagues, especially Lewis and Gor for valuable input > and discussions on the design. > > = TL;DR: > > * This is not enabled by default (even for -std=c++2a), it needs -fcoroutines. > > * Like all the C++20 support, it is experimental, perhaps more experimental > than some other pieces because wording is still being amended. > > * The FE/ME tests are run for ALL targets; in principle this should be target- > agnostic, if we see fails then that is probably interesting input for the > ABI > panel. > > * I regstrapped on 64b LE and BE platforms and a 32b LE host with no observed > issues or regressions. > > * it’s just slightly too big to send uncompressed so attached as a bz2. > > * commit is r10-6063-g49789fd08 > > thanks again to all those who helped, > Iain > > == The full covering note: > > This is the squashed version of the first 6 patches that were split to > facilitate review. > > The changes to libiberty (7th patch) to support demangling the co_await > operator stand alone and are applied separately. > > The patch series is an initial implementation of a coroutine feature, > expected to be standardised in C++20. > > Standardisation status (and potential impact on this implementation) > > > The facility was accepted into the working draft for C++20 by WG21 in > February 2019. During following WG21 meetings, design and national body > comments have been reviewed, with no significant change resulting. > > The current GCC implementation is against n4835 [1]. > > At this stage, the remaining potential for change comes from: > > * Areas of national body comments that were not resolved in the version we > have worked to: > (a) handling of the situation where aligned allocation is available. > (b) handling of the situation where a user wants coroutines, but does not > want exceptions (e.g. a GPU). > > * Agreed changes that have not yet been worded in a draft standard that we > have worked to. > > It is not expected that the resolution to these can produce any major > change at this phase of the standardisation process. Such changes should be > limited to the coroutine-specific code. > > ABI > --- > > The various compiler developers 'vendors' have discussed a minimal ABI to > allow one implementation to call coroutines compiled by another. > > This amounts to: > > 1. The layout of a public portion of the coroutine frame. > > Coroutines need to preserve state across suspension points, the storage for > this is called a "coroutine frame". > > The ABI mandates that pointers into the coroutine frame point to an area > begining with two function pointers (to the resume and destroy functions > described below); these are immediately followed by the "promise object" > described in the standard. > > This is sufficient that the builtins can take a coroutine frame pointer and > determine the address of the promise (or call the resume/destroy functions). > > 2. A number of compiler builtins that the standard library might use. > > These are implemented by this patch series. > > 3. This introduces a new operator 'co_await' the mangling for which is also > agreed between vendors (and has an issue filed for that against the upstream > c++abi). Demangling for this is added to libiberty in a separate patch. > > The ABI has currently no target-specific content (a given psABI might elect > to mandate alignment, but the common ABI does not do this). > > Standard Library impact > --- > > The current implementations require addition of only a single header to > the standard library (no change to the runtime). This header is part of > the patch. > > GCC Implementation outline > -- > > The standard's design for coroutines does not decorate the definition of > a coroutine in any way, so that a function is only known to be a coroutine > when one of the keywords (co_await, co_yield, co_return) is encountered. > > This means that we cannot special-case such functions from the outset, but > must process them differently when they are finalised - which we do from > "finish_function ()". > > At a high level, this design of coroutine produces four pieces from the > original user's function: > > 1. A coroutine state frame (taking the logical place of the activation > record for a regular function). One item stored in that state is the > index of the current suspend point. > 2. A "ramp" function > This is what the user calls to construct the coroutine frame and start > the coroutine execution. This will return some object representing the > coroutine's eventual ret
[PATCH v2] tree-profile: Don't instrument an IFUNC resolver nor its callees
We can't instrument an IFUNC resolver nor its callees as it may require TLS which hasn't been set up yet when the dynamic linker is resolving IFUNC symbols. Add an IFUNC resolver caller marker to cgraph_node and set it if the function is called by an IFUNC resolver. Update tree_profiling to skip functions called by IFUNC resolver. Tested with profiledbootstrap on Fedora 39/x86-64. gcc/ChangeLog: PR tree-optimization/114115 * cgraph.h (symtab_node): Add check_ifunc_callee_symtab_nodes. (cgraph_node): Add called_by_ifunc_resolver. * cgraphunit.cc (symbol_table::compile): Call symtab_node::check_ifunc_callee_symtab_nodes. * symtab.cc (check_ifunc_resolver): New. (ifunc_ref_map): Likewise. (is_caller_ifunc_resolver): Likewise. (symtab_node::check_ifunc_callee_symtab_nodes): Likewise. * tree-profile.cc (tree_profiling): Do not instrument an IFUNC resolver nor its callees. gcc/testsuite/ChangeLog: PR tree-optimization/114115 * gcc.dg/pr114115.c: New test. --- gcc/cgraph.h| 6 +++ gcc/cgraphunit.cc | 2 + gcc/symtab.cc | 89 + gcc/testsuite/gcc.dg/pr114115.c | 24 + gcc/tree-profile.cc | 4 ++ 5 files changed, 125 insertions(+) create mode 100644 gcc/testsuite/gcc.dg/pr114115.c diff --git a/gcc/cgraph.h b/gcc/cgraph.h index 47f35e8078d..a8c3224802c 100644 --- a/gcc/cgraph.h +++ b/gcc/cgraph.h @@ -479,6 +479,9 @@ public: Return NULL if there's no such node. */ static symtab_node *get_for_asmname (const_tree asmname); + /* Check symbol table for callees of IFUNC resolvers. */ + static void check_ifunc_callee_symtab_nodes (void); + /* Verify symbol table for internal consistency. */ static DEBUG_FUNCTION void verify_symtab_nodes (void); @@ -896,6 +899,7 @@ struct GTY((tag ("SYMTAB_FUNCTION"))) cgraph_node : public symtab_node redefined_extern_inline (false), tm_may_enter_irr (false), ipcp_clone (false), declare_variant_alt (false), calls_declare_variant_alt (false), gc_candidate (false), + called_by_ifunc_resolver (false), m_uid (uid), m_summary_id (-1) {} @@ -1495,6 +1499,8 @@ struct GTY((tag ("SYMTAB_FUNCTION"))) cgraph_node : public symtab_node is set for local SIMD clones when they are created and cleared if the vectorizer uses them. */ unsigned gc_candidate : 1; + /* Set if the function is called by an IFUNC resolver. */ + unsigned called_by_ifunc_resolver : 1; private: /* Unique id of the node. */ diff --git a/gcc/cgraphunit.cc b/gcc/cgraphunit.cc index d200166f7e9..2bd0289ffba 100644 --- a/gcc/cgraphunit.cc +++ b/gcc/cgraphunit.cc @@ -2317,6 +2317,8 @@ symbol_table::compile (void) symtab_node::checking_verify_symtab_nodes (); + symtab_node::check_ifunc_callee_symtab_nodes (); + timevar_push (TV_CGRAPHOPT); if (pre_ipa_mem_report) dump_memory_report ("Memory consumption before IPA"); diff --git a/gcc/symtab.cc b/gcc/symtab.cc index 4c7e3c135ca..3256133891d 100644 --- a/gcc/symtab.cc +++ b/gcc/symtab.cc @@ -1369,6 +1369,95 @@ symtab_node::verify (void) timevar_pop (TV_CGRAPH_VERIFY); } +/* Return true and set *DATA to true if NODE is an ifunc resolver. */ + +static bool +check_ifunc_resolver (cgraph_node *node, void *data) +{ + if (node->ifunc_resolver) +{ + bool *is_ifunc_resolver = (bool *) data; + *is_ifunc_resolver = true; + return true; +} + return false; +} + +static auto_bitmap ifunc_ref_map; + +/* Return true if any caller of NODE is an ifunc resolver. */ + +static bool +is_caller_ifunc_resolver (cgraph_node *node) +{ + bool is_ifunc_resolver = false; + + for (cgraph_edge *e = node->callers; e; e = e->next_caller) +{ + /* Return true if caller is known to be an IFUNC resolver. */ + if (e->caller->called_by_ifunc_resolver) + return true; + + /* Check for recursive call. */ + if (e->caller == node) + continue; + + /* Skip if it has been visited. */ + unsigned int uid = e->caller->get_uid (); + if (bitmap_bit_p (ifunc_ref_map, uid)) + continue; + bitmap_set_bit (ifunc_ref_map, uid); + + if (is_caller_ifunc_resolver (e->caller)) + { + /* Return true if caller is an IFUNC resolver. */ + e->caller->called_by_ifunc_resolver = true; + return true; + } + + /* Check if caller's alias is an IFUNC resolver. */ + e->caller->call_for_symbol_and_aliases (check_ifunc_resolver, + &is_ifunc_resolver, + true); + if (is_ifunc_resolver) + { + /* Return true if caller's alias is an IFUNC resolver. */ + e->caller->called_by_ifunc_resolver = true; + return true; + } +} + + return false; +} + +/* Check symbol table for ca
Re: [PATCH] tree-profile: Don't instrument an IFUNC resolver nor its callees
On Thu, Feb 29, 2024 at 7:11 AM H.J. Lu wrote: > > On Thu, Feb 29, 2024 at 7:06 AM Jan Hubicka wrote: > > > > > > I am worried about scenario where ifunc selector calls function foo > > > > defined locally and foo is also used from other places possibly in hot > > > > loops. > > > > > > > > > > > So it is not really reliable fix (though I guess it will work a lot > > > > > > of > > > > > > common code). I wonder what would be alternatives. In GCC > > > > > > generated > > > > > > profling code we use TLS only for undirect call profiling (so there > > > > > > is > > > > > > no need to turn off rest of profiling). I wonder if there is any > > > > > > chance > > > > > > to not make it seffault when it is done before TLS is set up? > > > > > > > > > > IFUNC selector should make minimum external calls, none is preferred. > > > > > > > > Edge porfiling only inserts (atomic) 64bit increments of counters. > > > > If target supports these operations inline, no external calls will be > > > > done. > > > > > > > > Indirect call profiling inserts the problematic TLS variable (to track > > > > caller-callee pairs). Value profiling also inserts various additional > > > > external calls to counters. > > > > > > > > I am perfectly fine with disabling instrumentation for ifunc selectors > > > > and functions only reachable from them, but I am worried about calles > > > > used also from non-ifunc path. > > > > > > Programmers need to understand not to do it. > > > > It would help to have this documented. Should we warn when ifunc > > resolver calls external function, comdat of function reachable from > > non-ifunc code? > > That will be nice. > > > > > > > > For example selector implemented in C++ may do some string handling to > > > > match CPU name and propagation will disable profiling for std::string > > > > > > On x86, they should use CPUID, not string functions. > > > > > > > member functions (which may not be effective if comdat section is > > > > prevailed from other translation unit). > > > > > > String functions may lead to external function calls which is dangerous. > > > > > > > > Any external calls may lead to issues at run-time. It is a very bad > > > > > idea > > > > > to profile IFUNC selector via external function call. > > > > > > > > Looking at https://sourceware.org/glibc/wiki/GNU_IFUNC > > > > there are other limitations on ifunc except for profiling, such as > > > > -fstack-protector-all. So perhaps your propagation can be used to > > > > disable those features as well. > > > > > > So, it may not be tree-profile specific. Where should these 2 bits > > > be added? > > > > If we want to disable other transforms too, then I think having a bit in > > cgraph_node for reachability from ifunc resolver makes sense. > > I would still do the cycle detection using on-side hash_map to avoid > > polution of the global datastructure. > > > > I will see what I can do. > > The v2 patch is at https://patchwork.sourceware.org/project/gcc/list/?series=31627 -- H.J.
Re: [C++ coroutines] Initial implementation pushed to master.
On Wed, Mar 6, 2024 at 1:03 AM Iain Sandoe wrote: > > > > > On 5 Mar 2024, at 17:31, H.J. Lu wrote: > > > > On Sat, Jan 18, 2020 at 4:54 AM Iain Sandoe wrote: > >> > > >> 2020-01-18 Iain Sandoe > >> > >>* Makefile.in: Add coroutine-passes.o. > >>* builtin-types.def (BT_CONST_SIZE): New. > >>(BT_FN_BOOL_PTR): New. > >>(BT_FN_PTR_PTR_CONST_SIZE_BOOL): New. > >>* builtins.def (DEF_COROUTINE_BUILTIN): New. > >>* coroutine-builtins.def: New file. > >>* coroutine-passes.cc: New file. > > > > There are > > > > tree res_tgt = TREE_OPERAND (gimple_call_arg (stmt, 2), 0); > > tree &res_dest = destinations.get_or_insert (idx, &existed); > > if (existed && dump_file) > >Why does this behavior depend on dump_file? > > This was checking for a potential wrong-code error during development; > there is no point in making it into a diagnostic (since the user could not fix > the problem if it happened). I guess changing to a gcc_checking_assert() > would be reasonable but I’d prefer to do that once GCC-15 opens. > > Have you found any instance where this results in a reported bug? No, I haven't. I only noticed it by chance. > (I do not recall anything on my coroutines bug list that would seem to > indicate this). > > thanks for noting it. > Iain > > > >{ > > fprintf ( > >dump_file, > >"duplicate YIELD RESUME point (" HOST_WIDE_INT_PRINT_DEC > >") ?\n", > >idx); > > print_gimple_stmt (dump_file, stmt, 0, > > TDF_VOPS|TDF_MEMSYMS); > >} > > else > >res_dest = res_tgt; > > > > H.J. > -- H.J.
Re: libbacktrace patch committed: Don't assume compressed section aligned
On Fri, Mar 8, 2024 at 2:48 PM Fangrui Song wrote: > > On ELF64, it looks like BFD uses 8-byte alignment for compressed > `.debug_*` sections while gold/lld/mold use 1-byte alignment. I do not > know how the Solaris linker sets the alignment. > > The specification's wording makes me confused whether it really > requires 8-byte alignment, even if a non-packed `Elf64_Chdr` surely > requires 8. Since compressed sections begin with a compression header structure that identifies the compression algorithm, compressed sections must be aligned to the alignment of the compression header. I don't think there is any ambiguity here. > > The sh_size and sh_addralign fields of the section header for a compressed > > section reflect the requirements of the compressed section. > > There are many `.debug_*` sections. So avoiding some alignment padding > seems a very natural extension (a DWARF v5 -gsplit-dwarf relocatable > file has ~10 `.debug_*` sections), even if the specification doesn't > allow it with a very strict interpretation... > > (Off-topic: I wonder whether ELF control structures should use > unaligned LEB128 more. REL/RELA can naturally be replaced with a > LEB128 one similar to wasm.) > > On Fri, Mar 8, 2024 at 1:57 PM Ian Lance Taylor wrote: > > > > Reportedly when lld compresses debug sections, it fails to set the > > alignment of the compressed section such that the compressed header > > can be read directly. To me this seems like a bug in lld. However, > > libbacktrace needs to work around it. This patch, originally by the > > GitHub user ubyte, does that. Bootstrapped and tested on > > x86_64-pc-linux-gnu. Committed to mainline. > > > > Ian > > > > * elf.c (elf_uncompress_chdr): Don't assume compressed section is > > aligned. > > > > -- > 宋方睿 -- H.J.
PING: [PATCH v2] tree-profile: Don't instrument an IFUNC resolver nor its callees
On Tue, Mar 5, 2024 at 1:45 PM H.J. Lu wrote: > > We can't instrument an IFUNC resolver nor its callees as it may require > TLS which hasn't been set up yet when the dynamic linker is resolving > IFUNC symbols. > > Add an IFUNC resolver caller marker to cgraph_node and set it if the > function is called by an IFUNC resolver. Update tree_profiling to skip > functions called by IFUNC resolver. > > Tested with profiledbootstrap on Fedora 39/x86-64. > > gcc/ChangeLog: > > PR tree-optimization/114115 > * cgraph.h (symtab_node): Add check_ifunc_callee_symtab_nodes. > (cgraph_node): Add called_by_ifunc_resolver. > * cgraphunit.cc (symbol_table::compile): Call > symtab_node::check_ifunc_callee_symtab_nodes. > * symtab.cc (check_ifunc_resolver): New. > (ifunc_ref_map): Likewise. > (is_caller_ifunc_resolver): Likewise. > (symtab_node::check_ifunc_callee_symtab_nodes): Likewise. > * tree-profile.cc (tree_profiling): Do not instrument an IFUNC > resolver nor its callees. > > gcc/testsuite/ChangeLog: > > PR tree-optimization/114115 > * gcc.dg/pr114115.c: New test. > --- > gcc/cgraph.h| 6 +++ > gcc/cgraphunit.cc | 2 + > gcc/symtab.cc | 89 + > gcc/testsuite/gcc.dg/pr114115.c | 24 + > gcc/tree-profile.cc | 4 ++ > 5 files changed, 125 insertions(+) > create mode 100644 gcc/testsuite/gcc.dg/pr114115.c > > diff --git a/gcc/cgraph.h b/gcc/cgraph.h > index 47f35e8078d..a8c3224802c 100644 > --- a/gcc/cgraph.h > +++ b/gcc/cgraph.h > @@ -479,6 +479,9 @@ public: > Return NULL if there's no such node. */ >static symtab_node *get_for_asmname (const_tree asmname); > > + /* Check symbol table for callees of IFUNC resolvers. */ > + static void check_ifunc_callee_symtab_nodes (void); > + >/* Verify symbol table for internal consistency. */ >static DEBUG_FUNCTION void verify_symtab_nodes (void); > > @@ -896,6 +899,7 @@ struct GTY((tag ("SYMTAB_FUNCTION"))) cgraph_node : > public symtab_node >redefined_extern_inline (false), tm_may_enter_irr (false), >ipcp_clone (false), declare_variant_alt (false), >calls_declare_variant_alt (false), gc_candidate (false), > + called_by_ifunc_resolver (false), >m_uid (uid), m_summary_id (-1) >{} > > @@ -1495,6 +1499,8 @@ struct GTY((tag ("SYMTAB_FUNCTION"))) cgraph_node : > public symtab_node > is set for local SIMD clones when they are created and cleared if the > vectorizer uses them. */ >unsigned gc_candidate : 1; > + /* Set if the function is called by an IFUNC resolver. */ > + unsigned called_by_ifunc_resolver : 1; > > private: >/* Unique id of the node. */ > diff --git a/gcc/cgraphunit.cc b/gcc/cgraphunit.cc > index d200166f7e9..2bd0289ffba 100644 > --- a/gcc/cgraphunit.cc > +++ b/gcc/cgraphunit.cc > @@ -2317,6 +2317,8 @@ symbol_table::compile (void) > >symtab_node::checking_verify_symtab_nodes (); > > + symtab_node::check_ifunc_callee_symtab_nodes (); > + >timevar_push (TV_CGRAPHOPT); >if (pre_ipa_mem_report) > dump_memory_report ("Memory consumption before IPA"); > diff --git a/gcc/symtab.cc b/gcc/symtab.cc > index 4c7e3c135ca..3256133891d 100644 > --- a/gcc/symtab.cc > +++ b/gcc/symtab.cc > @@ -1369,6 +1369,95 @@ symtab_node::verify (void) >timevar_pop (TV_CGRAPH_VERIFY); > } > > +/* Return true and set *DATA to true if NODE is an ifunc resolver. */ > + > +static bool > +check_ifunc_resolver (cgraph_node *node, void *data) > +{ > + if (node->ifunc_resolver) > +{ > + bool *is_ifunc_resolver = (bool *) data; > + *is_ifunc_resolver = true; > + return true; > +} > + return false; > +} > + > +static auto_bitmap ifunc_ref_map; > + > +/* Return true if any caller of NODE is an ifunc resolver. */ > + > +static bool > +is_caller_ifunc_resolver (cgraph_node *node) > +{ > + bool is_ifunc_resolver = false; > + > + for (cgraph_edge *e = node->callers; e; e = e->next_caller) > +{ > + /* Return true if caller is known to be an IFUNC resolver. */ > + if (e->caller->called_by_ifunc_resolver) > + return true; > + > + /* Check for recursive call. */ > + if (e->caller == node) > + continue; > + > + /* Skip if it has been visited. */ > + unsigned int uid = e->caller->get_uid (); > + if (bitmap_bit_p (ifunc_ref_map, uid)) > + continue; > + bitm
Re: PING: [PATCH v2] tree-profile: Don't instrument an IFUNC resolver nor its callees
On Tue, Apr 2, 2024 at 7:50 AM Jan Hubicka wrote: > > > On Tue, Mar 5, 2024 at 1:45 PM H.J. Lu wrote: > > > > > > We can't instrument an IFUNC resolver nor its callees as it may require > > > TLS which hasn't been set up yet when the dynamic linker is resolving > > > IFUNC symbols. > > > > > > Add an IFUNC resolver caller marker to cgraph_node and set it if the > > > function is called by an IFUNC resolver. Update tree_profiling to skip > > > functions called by IFUNC resolver. > > > > > > Tested with profiledbootstrap on Fedora 39/x86-64. > > > > > > gcc/ChangeLog: > > > > > > PR tree-optimization/114115 > > > * cgraph.h (symtab_node): Add check_ifunc_callee_symtab_nodes. > > > (cgraph_node): Add called_by_ifunc_resolver. > > > * cgraphunit.cc (symbol_table::compile): Call > > > symtab_node::check_ifunc_callee_symtab_nodes. > > > * symtab.cc (check_ifunc_resolver): New. > > > (ifunc_ref_map): Likewise. > > > (is_caller_ifunc_resolver): Likewise. > > > (symtab_node::check_ifunc_callee_symtab_nodes): Likewise. > > > * tree-profile.cc (tree_profiling): Do not instrument an IFUNC > > > resolver nor its callees. > > > > > > gcc/testsuite/ChangeLog: > > > > > > PR tree-optimization/114115 > > > * gcc.dg/pr114115.c: New test. > > > > PING. > > I am bit worried about commonly used functions getting "infected" by > being called once from ifunc resolver. I think we only use thread local > storage for indirect call profiling, so we may just disable indirect > call profiling for these functions. Will change it. > Also the patch will be noop with -flto -flto-partition=max, so probably > we need to compute this flag at WPA time and stream to partitions. > Why is it a nop with -flto -flto-partition=max? I got (gdb) bt #0 symtab_node::check_ifunc_callee_symtab_nodes () at /export/gnu/import/git/gitlab/x86-gcc/gcc/symtab.cc:1440 #1 0x00e487d3 in symbol_table::compile (this=0x7fffea006000) at /export/gnu/import/git/gitlab/x86-gcc/gcc/cgraphunit.cc:2320 #2 0x00d23ecf in lto_main () at /export/gnu/import/git/gitlab/x86-gcc/gcc/lto/lto.cc:687 #3 0x015254d2 in compile_file () at /export/gnu/import/git/gitlab/x86-gcc/gcc/toplev.cc:449 #4 0x015284a4 in do_compile () at /export/gnu/import/git/gitlab/x86-gcc/gcc/toplev.cc:2154 #5 0x01528864 in toplev::main (this=0x7fffd84a, argc=16, argv=0x42261f0) at /export/gnu/import/git/gitlab/x86-gcc/gcc/toplev.cc:2310 #6 0x030a3fe2 in main (argc=16, argv=0x7fffd958) at /export/gnu/import/git/gitlab/x86-gcc/gcc/main.cc:39 Do you have a testcase to show that it is a nop? -- H.J.