date:20201026

Re: [committed] libstdc++: Simplify std::shared_ptr construction from std::weak_ptr

2020-10-26 Thread Stephan Bergmann via Gcc-patches


On 21/10/2020 22:14, Jonathan Wakely via Gcc-patches wrote:

The _M_add_ref_lock() and _M_add_ref_lock_nothrow() members of
_Sp_counted_base are very similar, except that the former throws an
exception when the use count is zero and the latter returns false. The
former (and its callers) can be implemented in terms of the latter.
This results in a small reduction in code size, because throwing an
exception now only happens in one place.

libstdc++-v3/ChangeLog:

* include/bits/shared_ptr.h (shared_ptr(const weak_ptr&, nothrow_t)):
Add noexcept.
* include/bits/shared_ptr_base.h (_Sp_counted_base::_M_add_ref_lock):
Remove specializations and just call _M_add_ref_lock_nothrow.
(__shared_count, __shared_ptr): Use nullptr for null pointer
constants.
(__shared_count(const __weak_count&)): Use _M_add_ref_lock_nothrow
instead of _M_add_ref_lock.
(__shared_count(const __weak_count&, nothrow_t)): Add noexcept.
(__shared_ptr::operator bool()): Add noexcept.
(__shared_ptr(const __weak_ptr&, nothrow_t)): Add noexcept.

Tested powerpc64le-linux. Committed to trunk.


Clang now complains about


~gcc/trunk/inst/lib/gcc/x86_64-pc-linux-gnu/11.0.0/../../../../include/c++/11.0.0/bits/shared_ptr_base.h:230:5:
 error: '_M_add_ref_lock_nothrow' is missing exception specification 'noexcept'
_M_add_ref_lock_nothrow()
^
~gcc/trunk/inst/lib/gcc/x86_64-pc-linux-gnu/11.0.0/../../../../include/c++/11.0.0/bits/shared_ptr_base.h:158:7:
 note: previous declaration is here
  _M_add_ref_lock_nothrow() noexcept;
  ^
~gcc/trunk/inst/lib/gcc/x86_64-pc-linux-gnu/11.0.0/../../../../include/c++/11.0.0/bits/shared_ptr_base.h:241:5:
 error: '_M_add_ref_lock_nothrow' is missing exception specification 'noexcept'
_M_add_ref_lock_nothrow()
^
~gcc/trunk/inst/lib/gcc/x86_64-pc-linux-gnu/11.0.0/../../../../include/c++/11.0.0/bits/shared_ptr_base.h:158:7:
 note: previous declaration is here
  _M_add_ref_lock_nothrow() noexcept;
  ^
~gcc/trunk/inst/lib/gcc/x86_64-pc-linux-gnu/11.0.0/../../../../include/c++/11.0.0/bits/shared_ptr_base.h:255:5:
 error: '_M_add_ref_lock_nothrow' is missing exception specification 'noexcept'
_M_add_ref_lock_nothrow()
^
~gcc/trunk/inst/lib/gcc/x86_64-pc-linux-gnu/11.0.0/../../../../include/c++/11.0.0/bits/shared_ptr_base.h:158:7:
 note: previous declaration is here
  _M_add_ref_lock_nothrow() noexcept;
  ^
~gcc/trunk/inst/lib/gcc/x86_64-pc-linux-gnu/11.0.0/../../../../include/c++/11.0.0/bits/shared_ptr_base.h:876:5:
 error: exception specification in declaration does not match previous 
declaration
__shared_count(const __weak_count<_Lp>& __r, std::nothrow_t) noexcept
^
~gcc/trunk/inst/lib/gcc/x86_64-pc-linux-gnu/11.0.0/../../../../include/c++/11.0.0/bits/shared_ptr_base.h:696:16:
 note: previous declaration is here
  explicit __shared_count(const __weak_count<_Lp>& __r, std::nothrow_t);
   ^
4 errors generated.


which would be fixed with


diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h 
b/libstdc++-v3/include/bits/shared_ptr_base.h
index a9e1c9bb1d5..10c9c831411 100644
--- a/libstdc++-v3/include/bits/shared_ptr_base.h
+++ b/libstdc++-v3/include/bits/shared_ptr_base.h
@@ -227,7 +227,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   template<>
 inline bool
 _Sp_counted_base<_S_single>::
-_M_add_ref_lock_nothrow()
+_M_add_ref_lock_nothrow() noexcept
 {
   if (_M_use_count == 0)
return false;
@@ -238,7 +238,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   template<>
 inline bool
 _Sp_counted_base<_S_mutex>::
-_M_add_ref_lock_nothrow()
+_M_add_ref_lock_nothrow() noexcept
 {
   __gnu_cxx::__scoped_lock sentry(*this);
   if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, 1) == 0)
@@ -252,7 +252,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   template<>
 inline bool
 _Sp_counted_base<_S_atomic>::
-_M_add_ref_lock_nothrow()
+_M_add_ref_lock_nothrow() noexcept
 {
   // Perform lock-free add-if-not-zero operation.
   _Atomic_word __count = _M_get_use_count();
@@ -693,7 +693,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   explicit __shared_count(const __weak_count<_Lp>& __r);
 
   // Does not throw if __r._M_get_use_count() == 0, caller must check.

-  explicit __shared_count(const __weak_count<_Lp>& __r, std::nothrow_t);
+  explicit __shared_count(const __weak_count<_Lp>& __r, std::nothrow_t) 
noexcept;
 
   ~__shared_count() noexcept

   {

Re: Materialize clones on demand

2020-10-26 Thread Richard Biener

On Fri, 23 Oct 2020, Jan Hubicka wrote:

> > Hi,
> > 
> > On Thu, Oct 22 2020, Jan Hubicka wrote:
> > > Hi,
> > > this patch removes the pass to materialize all clones and instead this
> > > is now done on demand.  The motivation is to reduce lifetime of function
> > > bodies in ltrans that should noticeably reduce memory use for highly
> > > parallel compilations of large programs (like Martin does) or with
> > > partitioning reduced/disabled. For cc1 with one partition the memory use
> > > seems to go down from 4gb to cca 1.5gb (seeing from top, so this is not
> > > particularly accurate).
> > >
> > 
> > Nice.
> 
> Sadly this is only true w/o debug info.  I collected memory usage stats
> at the end of the ltrans stage and it is as folloes
> 
>  - after streaming in global stream: 126M GGC and 41M heap
>  - after streaming symbol table: 373M GGC and 92M heap
>  - after stremaing in summaries: 394M GGC and 92M heap 
>(only large summary seems to be ipa-cp transformation summary)
>  - then compilation starts and memory goes slowly up to 3527M at the end
>of compilation
> 
> The following accounts for more than 1% GGC:
> 
> Time variable   usr   sys  
> wall   GGC
>  ipa inlining heuristics:   6.99 (  0%)   4.62 (  1%)  11.17 (  
> 1%)   241M (  1%)
>  ipa lto gimple in  :  50.04 (  3%)  29.72 (  7%)  80.22 (  
> 4%)  3129M ( 14%)
>  ipa lto decl in:   0.79 (  0%)   0.36 (  0%)   1.15 (  
> 0%)   135M (  1%)
>  ipa lto cgraph I/O :   0.95 (  0%)   0.20 (  0%)   1.15 (  
> 0%)   269M (  1%)
>  cfg cleanup:  25.83 (  2%)   2.52 (  1%)  28.15 (  
> 1%)   154M (  1%)
>  df reg dead/unused notes   :  24.08 (  2%)   2.09 (  1%)  26.77 (  
> 1%)   180M (  1%)
>  alias analysis :  16.94 (  1%)   1.05 (  0%)  17.71 (  
> 1%)   383M (  2%)
>  integration:  45.76 (  3%)  44.30 ( 11%)  88.99 (  
> 5%)  2328M ( 10%)
>  tree VRP   :  41.38 (  3%)  15.67 (  4%)  57.71 (  
> 3%)   560M (  2%)
>  tree SSA rewrite   :   6.71 (  0%)   2.17 (  1%)   8.96 (  
> 0%)   194M (  1%)
>  tree SSA incremental   :  26.99 (  2%)   8.23 (  2%)  34.42 (  
> 2%)   144M (  1%)
>  tree operand scan  :  65.34 (  4%)  61.50 ( 15%) 127.02 (  
> 7%)   886M (  4%)
>  dominator optimization :  41.53 (  3%)  13.56 (  3%)  55.78 (  
> 3%)   407M (  2%)
>  tree split crit edges  :   1.08 (  0%)   0.65 (  0%)   1.63 (  
> 0%)   127M (  1%)
>  tree PRE   :  34.30 (  2%)  14.52 (  4%)  49.08 (  
> 3%)   337M (  1%)
>  tree code sinking  :   2.92 (  0%)   0.58 (  0%)   3.51 (  
> 0%)   122M (  1%)
>  tree iv optimization   :   6.71 (  0%)   1.19 (  0%)   8.46 (  
> 0%)   133M (  1%)
>  expand :  45.56 (  3%)   8.24 (  2%)  55.02 (  
> 3%)  1980M (  9%)
>  forward prop   :  11.89 (  1%)   1.39 (  0%)  12.59 (  
> 1%)   130M (  1%)
>  dead store elim2   :  10.03 (  1%)   0.70 (  0%)  11.23 (  
> 1%)   138M (  1%)
>  loop init  :  11.96 (  1%)   4.95 (  1%)  17.11 (  
> 1%)   378M (  2%)
>  CPROP  :  22.63 (  2%)   2.78 (  1%)  25.19 (  
> 1%)   359M (  2%)
>  combiner   :  41.39 (  3%)   2.57 (  1%)  43.30 (  
> 2%)   558M (  2%)
>  reload CSE regs:  22.38 (  2%)   1.25 (  0%)  23.06 (  
> 1%)   186M (  1%)
>  final  :  32.33 (  2%)   4.28 (  1%)  36.75 (  
> 2%)  1105M (  5%)
>  symout :  49.04 (  3%)   2.23 (  1%)  52.33 (  
> 3%)  2517M ( 11%)
>  var-tracking emit  :  33.26 (  2%)   1.02 (  0%)  34.35 (  
> 2%)   582M (  3%)
>  rest of compilation:  38.05 (  3%)  15.61 (  4%)  52.42 (  
> 3%)   114M (  1%)
>  TOTAL  :1486.02408.79   1899.96  
>   22512M
> 
> We seem to leak some hashtables:
> dwarf2out.c:28850 (dwarf2out_init)  31M: 23.8%   47M  
>  19 :  0.0%   ggc

that one likely keeps quite some memory live...

> cselib.c:3137 (cselib_init) 34M: 25.9%   34M  
>1514k: 17.3%  heap
> tree-scalar-evolution.c:2984 (scev_initialize)  37M: 27.6%   50M  
> 228k:  2.6%   ggc

Hmm, so we do

  scalar_evolution_info = hash_table::create_ggc (100);

and

  scalar_evolution_info->empty ();
  scalar_evolution_info = NULL;

to reclaim.  ->empty () will IIRC at least allocate 7 elements which we
the eventually should reclaim during a GC walk - I guess the hashtable
statistics do not really handle GC reclaimed portions?

If there's a friendlier way of releasing a GC allocated hash-tab
we can switch to that.  Note that in principle the hash-t

Re: [PATCH] Add debug_bb_details and debug_bb_n_details

2020-10-26 Thread Richard Biener

On Mon, 26 Oct 2020, Xionghu Luo wrote:

> 
> On 2020/10/23 18:18, Richard Biener wrote:
> > On Fri, 23 Oct 2020, Xiong Hu Luo wrote:
> > 
> >> Sometimes debug_bb_slim&debug_bb_n_slim is not enough, how about adding
> >> this debug_bb_details&debug_bb_n_details? Or any other similar call
> >> existed?
> > There's already debug_bb and debug_bb_n in cfg.c which works on both
> > RTL and GIMPLE.  How about instead adding overloads that accept
> > a flags argument so you can do
> > 
> > debug_bb_n (5, TDF_DETAILS)
> > 
> > ?  The debug_bb_slim variant would then just a forwarder.
> > 
> 
> Thanks.  Updated the patch as below:

OK.

Richard.

> 
> [PATCH v2] Add overloaded debug_bb and debug_bb_n with dump flags
> 
> 
> Add overloads that accept a flags argument so we can print
> debug_bb_n (5, TDF_DETAILS) in gdb, also the debug_bb_slim
> variant would then be just a forwarder.
> 
> gcc/ChangeLog:
> 
> 2020-10-26  Xionghu Luo  
> 
>   * cfg.c (debug_bb): New overloaded function.
>   (debug_bb_n): New overloaded function.
>   * cfg.h (debug_bb_n): New declaration.
>   (debug_bb_n): New declaration.
>   * print-rtl.c (debug_bb_slim): Call debug_bb with flags.
> ---
>  gcc/cfg.c   | 20 +++-
>  gcc/cfg.h   |  2 ++
>  gcc/print-rtl.c |  2 +-
>  3 files changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/gcc/cfg.c b/gcc/cfg.c
> index 270a48f729a..05f922f5470 100644
> --- a/gcc/cfg.c
> +++ b/gcc/cfg.c
> @@ -720,7 +720,7 @@ free_aux_for_edges (void)
>  DEBUG_FUNCTION void
>  debug_bb (basic_block bb)
>  {
> -  dump_bb (stderr, bb, 0, dump_flags);
> +  debug_bb (bb, dump_flags);
>  }
>  
>  DEBUG_FUNCTION basic_block
> @@ -731,6 +731,24 @@ debug_bb_n (int n)
>return bb;
>  }
>  
> +/* Print bb with specified flags.  */
> +
> +DEBUG_FUNCTION void
> +debug_bb (basic_block bb, dump_flags_t flags)
> +{
> +  dump_bb (stderr, bb, 0, flags);
> +}
> +
> +/* Print bb numbered n with specified flags.  */
> +
> +DEBUG_FUNCTION basic_block
> +debug_bb_n (int n, dump_flags_t flags)
> +{
> +  basic_block bb = BASIC_BLOCK_FOR_FN (cfun, n);
> +  debug_bb (bb, flags);
> +  return bb;
> +}
> +
>  /* Dumps cfg related information about basic block BB to OUTF.
> If HEADER is true, dump things that appear before the instructions
> contained in BB.  If FOOTER is true, dump things that appear after.
> diff --git a/gcc/cfg.h b/gcc/cfg.h
> index 1eb7866bac9..93fde6df2bf 100644
> --- a/gcc/cfg.h
> +++ b/gcc/cfg.h
> @@ -108,6 +108,8 @@ extern void clear_aux_for_edges (void);
>  extern void free_aux_for_edges (void);
>  extern void debug_bb (basic_block);
>  extern basic_block debug_bb_n (int);
> +extern void debug_bb (basic_block, dump_flags_t);
> +extern basic_block debug_bb_n (int, dump_flags_t);
>  extern void dump_bb_info (FILE *, basic_block, int, dump_flags_t, bool, 
> bool);
>  extern void brief_dump_cfg (FILE *, dump_flags_t);
>  extern void update_bb_profile_for_threading (basic_block, profile_count, 
> edge);
> diff --git a/gcc/print-rtl.c b/gcc/print-rtl.c
> index 25265efc71b..d514b1c5373 100644
> --- a/gcc/print-rtl.c
> +++ b/gcc/print-rtl.c
> @@ -2139,7 +2139,7 @@ extern void debug_bb_slim (basic_block);
>  DEBUG_FUNCTION void
>  debug_bb_slim (basic_block bb)
>  {
> -  dump_bb (stderr, bb, 0, TDF_SLIM | TDF_BLOCKS);
> +  debug_bb (bb, TDF_SLIM | TDF_BLOCKS);
>  }
>  
>  extern void debug_bb_n_slim (int);
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imend

Re: [PATCH, OpenMP 5.0] Implement structure element mapping changes in 5.0

2020-10-26 Thread Jakub Jelinek via Gcc-patches

On Sat, Oct 24, 2020 at 01:43:26AM +0800, Chung-Lin Tang wrote:
> On 2020/10/23 8:13 PM, Jakub Jelinek wrote:
> > > In general, upon encountering a construct, we can't statically determine 
> > > and insert alloc/release maps
> > > for each element of a structure variable, since we don't really know 
> > > which region of the structure is
> > > currently mapped or not, hence this probably can't be properly 
> > > implemented in the compiler.
> > > 
> > > Instead this patch tries to do the equivalent in the runtime: I've 
> > > modified the handling of the
> > > (GOMP_MAP_STRUCT, , , ...) sequence to:
> > > 
> > >(1) Create just a single splay_tree_key to represent the entire 
> > > structure's mapped-region
> > >(all element target_var_desc's now reference this same key instead 
> > > of creating their own), and
> > I'm not sure that is what we want.  If we create just a single
> > splay_tree_key spanning the whole structure mapped region, then we can't
> > diagnose various mapping errors.  E.g. if I have:
> > void bar (struct S *);
> > struct S { int a, b, c, d, e; };
> > void foo (struct S s)
> > {
> >#pragma omp target data map(tofrom: s.b, s.d)
> >#pragma omp target map (s.b, s.c)
> >bar (&s);
> > }
> > then target data maps the &s.b to &s.d + 1 region of the struct, but s.c
> > wasn't mapped and so the target region's mapping should fail, even when it
> > is in the middle of the mapped region.
> 
> Are you really sure this is what we want? I don't quite see anything harmful
> about implicitly mapping "middle fields" like s.c, in fact the corresponding
> memory is actually "mapped" anyways.

Yes, it is a QoI and it is important not to regress about that.
Furthermore, the more we diverge from what the spec says, it will be harder
for us to implement, not just now, but in the future too.
What I wrote about the actual implementation is actually not accurate, we
need the master and slaves to be the struct splay_tree_key_s objects.
And that one already has the aux field that could be used for the slaves,
so we could e.g. use another magic value of refcount, e.g. REFCOUNT_SLAVE
~(uintptr_t) 2, and in that case aux would point to the master
splay_tree_key_s.

And the 
"If the corresponding list item’s reference count was not already incremented 
because of the
effect of a map clause on the construct then:
a) The corresponding list item’s reference count is incremented by one;"
and
"If the map-type is not delete and the corresponding list item’s reference 
count is finite and
was not already decremented because of the effect of a map clause on the 
construct then:
a) The corresponding list item’s reference count is decremented by one;"
rules we need to implement in any case, I don't see a way around that.
The same list item can now be mapped (or unmapped) multiple times on the same
construct.

Jakub

[PATCH V2] aarch64: Add bfloat16 vldN_lane_bf16 + vldNq_lane_bf16 intrisics

2020-10-26 Thread Andrea Corallo via Gcc-patches

Hi all,

Second version of the patch here implementing the bfloat16_t neon
related load intrinsics: vld2_lane_bf16, vld2q_lane_bf16,
vld3_lane_bf16, vld3q_lane_bf16 vld4_lane_bf16, vld4q_lane_bf16.

This better narrows testcases so they do not cause regressions for the
arm backend where these intrinsics are not yet present.

Please see refer to:
ACLE 
ISA  

Okay for trunk?

Thanks!

  Andrea

>From 08bd8d745bc46ca4b9dd24906dea2743dda66cc5 Mon Sep 17 00:00:00 2001
From: Andrea Corallo 
Date: Thu, 15 Oct 2020 10:16:18 +0200
Subject: [PATCH] aarch64: Add bfloat16 vldN_lane_bf16 + vldNq_lane_bf16
 intrisics

gcc/ChangeLog

2020-10-15  Andrea Corallo  

* config/aarch64/arm_neon.h (__LDX_LANE_FUNC): Move to the bottom
of the file so we can use these also for defining the bf16 related
intrinsics.
(vld2_lane_bf16, vld2q_lane_bf16, vld3_lane_bf16, vld3q_lane_bf16)
(vld4_lane_bf16, vld4q_lane_bf16): Add new intrinsics.

gcc/testsuite/ChangeLog

2020-10-15  Andrea Corallo  

* gcc.target/aarch64/advsimd-intrinsics/bf16_vldN_lane_1.c: New
testcase.
* gcc.target/aarch64/advsimd-intrinsics/bf16_vldN_lane_2.c:
Likewise.
* gcc.target/aarch64/advsimd-intrinsics/vld2_lane_bf16_indices_1.c:
Likewise.
* gcc.target/aarch64/advsimd-intrinsics/vld2q_lane_bf16_indices_1.c:
Likewise.
* gcc.target/aarch64/advsimd-intrinsics/vld3_lane_bf16_indices_1.c:
Likewise.
* gcc.target/aarch64/advsimd-intrinsics/vld3q_lane_bf16_indices_1.c:
Likewise.
* gcc.target/aarch64/advsimd-intrinsics/vld4_lane_bf16_indices_1.c:
Likewise.
* gcc.target/aarch64/advsimd-intrinsics/vld4q_lane_bf16_indices_1.c:
Likewise.
---
 gcc/config/aarch64/arm_neon.h | 792 +-
 .../advsimd-intrinsics/bf16_vldN_lane_1.c |  74 ++
 .../advsimd-intrinsics/bf16_vldN_lane_2.c |  52 ++
 .../vld2_lane_bf16_indices_1.c|  17 +
 .../vld2q_lane_bf16_indices_1.c   |  17 +
 .../vld3_lane_bf16_indices_1.c|  17 +
 .../vld3q_lane_bf16_indices_1.c   |  17 +
 .../vld4_lane_bf16_indices_1.c|  17 +
 .../vld4q_lane_bf16_indices_1.c   |  17 +
 9 files changed, 629 insertions(+), 391 deletions(-)
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/bf16_vldN_lane_1.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/bf16_vldN_lane_2.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld2_lane_bf16_indices_1.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld2q_lane_bf16_indices_1.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld3_lane_bf16_indices_1.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld3q_lane_bf16_indices_1.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld4_lane_bf16_indices_1.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vld4q_lane_bf16_indices_1.c

diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index d943f63a274..2bb20e15069 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -20792,311 +20792,6 @@ vld4q_dup_p64 (const poly64_t * __a)
   return ret;
 }
 
-/* vld2_lane */
-
-#define __LD2_LANE_FUNC(intype, vectype, largetype, ptrtype, mode,\
-qmode, ptrmode, funcsuffix, signedtype)   \
-__extension__ extern __inline intype \
-__attribute__ ((__always_inline__, __gnu_inline__,__artificial__)) \
-vld2_lane_##funcsuffix (const ptrtype * __ptr, intype __b, const int __c)  \
-{ \
-  __builtin_aarch64_simd_oi __o;  \
-  largetype __temp;   \
-  __temp.val[0] = \
-vcombine_##funcsuffix (__b.val[0], vcreate_##funcsuffix (0)); \
-  __temp.val[1] = \
-vcombine_##funcsuffix (__b.val[1], vcreate_##funcsuffix (0)); \
-  __o = __builtin_aarch64_set_qregoi##qmode (__o, \
-   (signedtype) __temp.val[0],\
-   0);\
-  __o = __builtin_aarch64_set_qregoi##qmode (__o, \
-   (signedtype) __temp.val[1],\
-   1);\
-  __o =__builtin_aarch64_ld2_lane##mode (  
   \
- (__builtin_aarch64_simd_##ptrmode *) __ptr, __o, __c);

[PATCH V2] aarch64: Add vstN_lane_bf16 + vstNq_lane_bf16 intrinsics

2020-10-26 Thread Andrea Corallo via Gcc-patches

Hi all,

Second version of the patch here implementing the bfloat16_t neon
related store intrinsics: vst2_lane_bf16, vst2q_lane_bf16,
vst3_lane_bf16, vst3q_lane_bf16 vst4_lane_bf16, vst4q_lane_bf16.

Please see refer to:
ACLE 
ISA  

This better narrows testcases so they do not cause regressions for the
arm backend where these intrinsics are not yet present.

Please see refer to:
ACLE 
ISA  

Okay for trunk?

Thanks!

  Andrea

>From 16803710f96889ec89349c5bb6ff1fb96a9d32d8 Mon Sep 17 00:00:00 2001
From: Andrea Corallo 
Date: Thu, 8 Oct 2020 11:02:09 +0200
Subject: [PATCH] aarch64: Add vstN_lane_bf16 + vstNq_lane_bf16 intrinsics

gcc/ChangeLog

2020-10-19  Andrea Corallo  

* config/aarch64/arm_neon.h (__STX_LANE_FUNC): Move to the bottom
of the file so we can use these also for defining the bf16 related
intrinsics.
(vst2_lane_bf16, vst2q_lane_bf16, vst3_lane_bf16, vst3q_lane_bf16)
(vst4_lane_bf16, vst4q_lane_bf16): Add new intrinsics.

gcc/testsuite/ChangeLog

2020-10-19  Andrea Corallo  

* gcc.target/aarch64/advsimd-intrinsics/arm-neon-ref.h
(hbfloat16_t): Define type.
(CHECK_FP): Make it working for bfloat types.
* gcc.target/aarch64/advsimd-intrinsics/bf16_vstN_lane_1.c: New file.
* gcc.target/aarch64/advsimd-intrinsics/bf16_vstN_lane_2.c: Likewise.
* gcc.target/aarch64/advsimd-intrinsics/vst2_lane_bf16_indices_1.c:
Likewise.
* gcc.target/aarch64/advsimd-intrinsics/vst2q_lane_bf16_indices_1.c:
Likewise.
* gcc.target/aarch64/advsimd-intrinsics/vst3_lane_bf16_indices_1.c:
Likewise.
* gcc.target/aarch64/advsimd-intrinsics/vst3q_lane_bf16_indices_1.c:
Likewise.
* gcc.target/aarch64/advsimd-intrinsics/vst4_lane_bf16_indices_1.c:
Likewise.
* gcc.target/aarch64/advsimd-intrinsics/vst4q_lane_bf16_indices_1.c:
Likewise.
---
 gcc/config/aarch64/arm_neon.h | 534 +-
 .../aarch64/advsimd-intrinsics/arm-neon-ref.h |   4 +-
 .../advsimd-intrinsics/bf16_vstN_lane_1.c | 227 
 .../advsimd-intrinsics/bf16_vstN_lane_2.c |  52 ++
 .../vst2_lane_bf16_indices_1.c|  16 +
 .../vst2q_lane_bf16_indices_1.c   |  16 +
 .../vst3_lane_bf16_indices_1.c|  16 +
 .../vst3q_lane_bf16_indices_1.c   |  16 +
 .../vst4_lane_bf16_indices_1.c|  16 +
 .../vst4q_lane_bf16_indices_1.c   |  16 +
 10 files changed, 656 insertions(+), 257 deletions(-)
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/bf16_vstN_lane_1.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/bf16_vstN_lane_2.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst2_lane_bf16_indices_1.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst2q_lane_bf16_indices_1.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst3_lane_bf16_indices_1.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst3q_lane_bf16_indices_1.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst4_lane_bf16_indices_1.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vst4q_lane_bf16_indices_1.c

diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index 2bb20e15069..0088ea9896f 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -10873,262 +10873,6 @@ __STRUCTN (poly, 8, 4)
 __STRUCTN (float, 64, 4)
 #undef __STRUCTN
 
-
-#define __ST2_LANE_FUNC(intype, largetype, ptrtype, mode,   \
-   qmode, ptr_mode, funcsuffix, signedtype) \
-__extension__ extern __inline void  \
-__attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) \
-vst2_lane_ ## funcsuffix (ptrtype *__ptr,   \
- intype __b, const int __c) \
-{   \
-  __builtin_aarch64_simd_oi __o;\
-  largetype __temp; \
-  __temp.val[0]
 \
-= vcombine_##funcsuffix (__b.val[0],\
-vcreate_##funcsuffix (__AARCH64_UINT64_C (0))); \
-  __temp.val[1]
 \
-= vcombine_##funcsuffix (__b.val[1],\
-vcreate_##funcsuffix (__AARCH64_UINT64_C

[PATCH V2] aarch64: Add vcopy(q)__lane(q)_bf16 intrinsics

2020-10-26 Thread Andrea Corallo via Gcc-patches

Hi all,

Second version of the patch here implementing the bfloat16_t neon
related copy intrinsics: vcopy_lane_bf16, vcopyq_lane_bf16,
vcopyq_laneq_bf16, vcopy_laneq_bf16.

This better narrows testcases so they do not cause regressions for the
arm backend where these intrinsics are not yet present.

Please see refer to:
ACLE 
ISA  

Okay for trunk?

Regards

  Andrea

>From 8b53c3679501e600c845f3023d2fe69506500cf7 Mon Sep 17 00:00:00 2001
From: Andrea Corallo 
Date: Thu, 8 Oct 2020 12:29:00 +0200
Subject: [PATCH] aarch64: Add vcopy(q)__lane(q)_bf16 intrinsics

gcc/ChangeLog

2020-10-20  Andrea Corallo  

* config/aarch64/arm_neon.h (vcopy_lane_bf16, vcopyq_lane_bf16)
(vcopyq_laneq_bf16, vcopy_laneq_bf16): New intrinsics.

gcc/testsuite/ChangeLog

2020-10-20  Andrea Corallo  

* gcc.target/aarch64/advsimd-intrinsics/bf16_vect_copy_lane_1.c:
New test.
* gcc.target/aarch64/advsimd-intrinsics/vcopy_lane_bf16_indices_1.c:
Likewise.
* gcc.target/aarch64/advsimd-intrinsics/vcopy_lane_bf16_indices_2.c:
Likewise.
* gcc.target/aarch64/advsimd-intrinsics/vcopy_laneq_bf16_indices_1.c:
Likewise.
* gcc.target/aarch64/advsimd-intrinsics/vcopy_laneq_bf16_indices_2.c:
Likewise.
* gcc.target/aarch64/advsimd-intrinsics/vcopyq_lane_bf16_indices_1.c:
Likewise.
* gcc.target/aarch64/advsimd-intrinsics/vcopyq_lane_bf16_indices_2.c:
Likewise.
* gcc.target/aarch64/advsimd-intrinsics/vcopyq_laneq_bf16_indices_1.c:
Likewise.
* gcc.target/aarch64/advsimd-intrinsics/vcopyq_laneq_bf16_indices_2.c:
Likewise.
---
 gcc/config/aarch64/arm_neon.h | 36 +++
 .../bf16_vect_copy_lane_1.c   | 32 +
 .../vcopy_lane_bf16_indices_1.c   | 18 ++
 .../vcopy_lane_bf16_indices_2.c   | 18 ++
 .../vcopy_laneq_bf16_indices_1.c  | 17 +
 .../vcopy_laneq_bf16_indices_2.c  | 17 +
 .../vcopyq_lane_bf16_indices_1.c  | 17 +
 .../vcopyq_lane_bf16_indices_2.c  | 17 +
 .../vcopyq_laneq_bf16_indices_1.c | 17 +
 .../vcopyq_laneq_bf16_indices_2.c | 17 +
 10 files changed, 206 insertions(+)
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/bf16_vect_copy_lane_1.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcopy_lane_bf16_indices_1.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcopy_lane_bf16_indices_2.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcopy_laneq_bf16_indices_1.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcopy_laneq_bf16_indices_2.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcopyq_lane_bf16_indices_1.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcopyq_lane_bf16_indices_2.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcopyq_laneq_bf16_indices_1.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcopyq_laneq_bf16_indices_2.c

diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index 0088ea9896f..9c801661775 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -35155,6 +35155,42 @@ vcvtq_high_bf16_f32 (bfloat16x8_t __inactive, 
float32x4_t __a)
   return __builtin_aarch64_bfcvtn2v8bf (__inactive, __a);
 }
 
+__extension__ extern __inline bfloat16x4_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vcopy_lane_bf16 (bfloat16x4_t __a, const int __lane1,
+bfloat16x4_t __b, const int __lane2)
+{
+  return __aarch64_vset_lane_any (__aarch64_vget_lane_any (__b, __lane2),
+ __a, __lane1);
+}
+
+__extension__ extern __inline bfloat16x8_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vcopyq_lane_bf16 (bfloat16x8_t __a, const int __lane1,
+ bfloat16x4_t __b, const int __lane2)
+{
+  return __aarch64_vset_lane_any (__aarch64_vget_lane_any (__b, __lane2),
+ __a, __lane1);
+}
+
+__extension__ extern __inline bfloat16x4_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vcopy_laneq_bf16 (bfloat16x4_t __a, const int __lane1,
+ bfloat16x8_t __b, const int __lane2)
+{
+  return __aarch64_vset_lane_any (__aarch64_vget_lane_any (__b, __lane2),
+ __a, __lane1);
+}
+
+__extension__ extern __inline bfloat16x8_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vcopyq_laneq_bf16 (bfloat16x8_t __a, const int __lane1,
+  bfloat16x8_t __b, const int __lane2)
+{
+  return __aarc

[PATCH] PR tree-optimization/97546 Bail out of find_bswap_or_nop on non-INTEGER_CST sizes

2020-10-26 Thread Kyrylo Tkachov via Gcc-patches

Hi all,

This patch fixes the ICE in the PR by bailing out of find_bswap_or_nop on 
poly_int sizes.
I don't think it intends to handle them and from my reading of the code it's 
the most appropriate place to reject them
here rather than in the callers.

Bootstrapped and tested on aarch64-none-linux-gnu.

Ok for trunk?
Thanks,
Kyrill

gcc/
PR tree-optimization/97546
* gimple-ssa-store-merging.c (find_bswap_or_nop): Return NULL if type is
not INTEGER_CST.

gcc/testsuite/
PR tree-optimization/97546
* gcc.target/aarch64/sve/acle/general/pr97546.c: New test.


sm-poly.patch
Description: sm-poly.patch

Re: [PATCH] PR tree-optimization/97546 Bail out of find_bswap_or_nop on non-INTEGER_CST sizes

2020-10-26 Thread Jakub Jelinek via Gcc-patches

On Mon, Oct 26, 2020 at 09:20:42AM +, Kyrylo Tkachov via Gcc-patches wrote:
> This patch fixes the ICE in the PR by bailing out of find_bswap_or_nop on 
> poly_int sizes.
> I don't think it intends to handle them and from my reading of the code it's 
> the most appropriate place to reject them
> here rather than in the callers.
> 
> Bootstrapped and tested on aarch64-none-linux-gnu.
> 
> Ok for trunk?
> Thanks,
> Kyrill
> 
> gcc/
>   PR tree-optimization/97546
>   * gimple-ssa-store-merging.c (find_bswap_or_nop): Return NULL if type is
>   not INTEGER_CST.

I think better use tree_fits_uhwi_p instead of cst_and_fits_hwi and
instead of TREE_INT_CST_LOW use tree_to_uhwi.
TYPE_SIZE_UNIT which doesn't fit into uhwi but fits into shwi is something
that really shouldn't appear.
Otherwise LGTM.

> gcc/testsuite/
>   PR tree-optimization/97546
>   * gcc.target/aarch64/sve/acle/general/pr97546.c: New test.



Jakub

Re: Materialize clones on demand

2020-10-26 Thread Jan Hubicka

> > We seem to leak some hashtables:
> > dwarf2out.c:28850 (dwarf2out_init)  31M: 23.8%   
> > 47M   19 :  0.0%   ggc
> 
> that one likely keeps quite some memory live...

Yep, having in-memory dwaf2out for whole cc1plus eats a lot of memory
quite naturally.
> 
> > cselib.c:3137 (cselib_init) 34M: 25.9%   
> > 34M 1514k: 17.3%  heap
> > tree-scalar-evolution.c:2984 (scev_initialize)  37M: 27.6%   
> > 50M  228k:  2.6%   ggc
> 
> Hmm, so we do
> 
>   scalar_evolution_info = hash_table::create_ggc (100);
> 
> and
> 
>   scalar_evolution_info->empty ();
>   scalar_evolution_info = NULL;
> 
> to reclaim.  ->empty () will IIRC at least allocate 7 elements which we
> the eventually should reclaim during a GC walk - I guess the hashtable
> statistics do not really handle GC reclaimed portions?
> 
> If there's a friendlier way of releasing a GC allocated hash-tab
> we can switch to that.  Note that in principle the hash-table doesn't
> need to be GC allocated but it needs to be walked since it refers to
> trees that might not be referenced in other ways.

hashtable has destructor that does ggc_free, so i think ggc_delete is
right way to free.
> 
> > and hashmaps:
> > ipa-reference.c:1133 (ipa_reference_read_optimiz  2047k:  3.0% 
> > 3071k9 :  0.0%  heap
> > tree-ssa.c:60 (redirect_edge_var_map_add) 4125k:  6.1% 
> > 4126k 8190 :  0.1%  heap
> 
> Similar as SCEV, probably mis-accounting?
> 
> > alias.c:1200 (record_alias_subset)4510k:  6.6% 
> > 4510k 4546 :  0.0%   ggc
> > ipa-prop.h:986 (ipcp_transformation_t)8191k: 12.0%   
> > 11M   16 :  0.0%   ggc
> > dwarf2out.c:5957 (dwarf2out_register_external_di47M: 72.2%   
> > 71M   12 :  0.0%   ggc
> > 
> > and hashsets:
> > ipa-devirt.c:3093 (possible_polymorphic_call_tar15k:  0.9%   
> > 23k8 :  0.0%  heap
> > ipa-devirt.c:1599 (add_type_duplicate) 412k: 22.2%  
> > 412k 4065 :  0.0%  heap
> > tree-ssa-threadbackward.c:40 (thread_jumps)   1432k: 77.0% 
> > 1433k  119k:  0.8%  heap
> > 
> > and vectors:
> > tree-ssa-structalias.c:5783 (push_fields_onto_fi  8   847k: 
> > 0.3%  976k475621: 0.8%17k24k
> 
> Huh.  It's an auto_vec<>

Hmm, those maybe gets miscounted, i will check.
> 
> > tree-ssa-pre.c:334 (alloc_expression_id) 48  1125k: 
> > 0.4% 1187k198336: 0.3%23k34k
> > tree-into-ssa.c:1787 (register_new_update_single  8  1196k: 
> > 0.5% 1264k380385: 0.6%24k36k
> > ggc-page.c:1264 (add_finalizer)   8  1232k: 
> > 0.5% 1848k43: 0.0%77k81k
> > tree-ssa-structalias.c:1609 (topo_visit)  8  1302k: 
> > 0.5% 1328k892964: 1.4%27k33k
> > graphds.c:254 (graphds_dfs)   4  1469k: 
> > 0.6% 1675k   2101780: 3.4%30k34k
> > dominance.c:955 (get_dominated_to_depth)  8  2251k: 
> > 0.9% 2266k685140: 1.1%46k50k
> > tree-ssa-structalias.c:410 (new_var_info)32  2264k: 
> > 0.9% 2341k330758: 0.5%47k63k
> > tree-ssa-structalias.c:3104 (process_constraint) 48  2376k: 
> > 0.9% 2606k405451: 0.7%49k83k
> > symtab.c:612 (create_reference)   8  3314k: 
> > 1.3% 4897k 75213: 0.1%   414k   612k
> > vec.h:1734 (copy)48   
> > 233M:90.5%  234M   6243163:10.1%  4982k  5003k

Also I should annotate copy.
> 
> Those all look OK to me, not sure why we even think there's a leak?

I think we do not need to hold references anymore (perhaps for aliases -
i will check).  Also all function bodies should be freed by now.
> 
> > However main problem is
> > cfg.c:202 (connect_src)   5745k:  0.2%  
> > 271M:  1.9% 1754k:  0.0% 1132k:  0.2% 7026k
> > cfg.c:212 (connect_dest)  6307k:  0.2%  
> > 281M:  2.0%10129k:  0.2% 2490k:  0.5% 7172k
> > varasm.c:3359 (build_constant_desc)   7387k:  0.2%0 
> > :  0.0%0 :  0.0%0 :  0.0%   51k
> > emit-rtl.c:486 (gen_raw_REG)  7799k:  0.2%  
> > 215M:  1.5%   96 :  0.0%0 :  0.0% 9502k
> > dwarf2cfi.c:2341 (add_cfis_to_fde)8027k:  0.2%0 
> > :  0.0% 4906k:  0.1% 1405k:  0.3%   78k
> > emit-rtl.c:4074 (make_jump_insn_raw)  8239k:  0.2%   
> > 93M:  0.7%0 :  0.0%0 :  0.0% 1442k
> > tree-ssanames.c:308 (make_ssa_name_fn)9130k:  0.2%  
> > 45

Re: [PATCH 2/2] combine: Don't turn (mult (extend x) 2^n) into extract

2020-10-26 Thread Alex Coplan via Gcc-patches

Hi Segher,

On 22/10/2020 15:39, Segher Boessenkool wrote:
> On Thu, Oct 15, 2020 at 09:59:24AM +0100, Alex Coplan wrote:
> > Currently, make_extraction() identifies where we can emit an ASHIFT of
> > an extend in place of an extraction, but fails to make the corresponding
> > canonicalization/simplification when presented with a MULT by a power of
> > two. Such a representation is canonical when representing a left-shifted
> > address inside a MEM.
> > 
> > This patch remedies this situation: after the patch, make_extraction()
> > now also identifies RTXs such as:
> > 
> > (mult:DI (subreg:DI (reg:SI r)) (const_int 2^n))
> > 
> > and rewrites this as:
> > 
> > (mult:DI (sign_extend:DI (reg:SI r)) (const_int 2^n))
> > 
> > instead of using a sign_extract.
> 
> That is only correct if SUBREG_PROMOTED_VAR_P is true and
> SUBREG_PROMOTED_UNSIGNED_P is false for r.  Is that guaranteed to be
> true here (and how then?)

Sorry, I didn't give enough context here. For this subreg,
SUBREG_PROMOTED_VAR_P is not set, so I agree that this transformation in
isolation is not valid.

The crucial piece of missing information is that we only make this
transformation in calls to make_extraction where len = 32 + n and
pos_rtx = pos = 0 (so we're extracting the bottom 32 + n bits), and
unsignedp is false (so we're doing a sign_extract).

Below is a proposed commit message, updated with this information.

OK for trunk?

Thanks,
Alex

---

Currently, make_extraction() identifies where we can emit an ashift of
an extend in place of an extraction, but fails to make the corresponding
canonicalization/simplification when presented with a mult by a power of
two. Such a representation is canonical when representing a left-shifted
address inside a mem.

This patch remedies this situation. For rtxes such as:

(mult:DI (subreg:DI (reg:SI r) 0) (const_int 2^n))

where the bottom 32 + n bits are valid (the higher-order bits are
undefined) and make_extraction() is being asked to sign_extract the
lower (valid) bits, after the patch, we rewrite this as:

(mult:DI (sign_extend:DI (reg:SI r)) (const_int 2^n))

instead of using a sign_extract.

(This patch also fixes up a comment in expand_compound_operation() which
appears to have suffered from bitrot.)

For an example of the existing behavior in the ashift case, compiling
the following C testcase at -O2 on AArch64:

int h(void);
struct c d;
struct c {
int e[1];
};

void g(void) {
  int k = 0;
  for (;; k = h()) {
asm volatile ("" :: "r"(&d.e[k]));
  }
}

make_extraction gets called with len=34, pos_rtx=pos=0, unsignedp=false
(so we're sign_extracting the bottom 34 bits), and the following rtx in
inner:

(ashift:DI (subreg:DI (reg/v:SI 93 [ k ]) 0)
(const_int 2 [0x2]))

where it is clear that the bottom 34 bits are valid (and the
higher-order bits are undefined). We then hit the block:

  else if (GET_CODE (inner) == ASHIFT
   && CONST_INT_P (XEXP (inner, 1))
   && pos_rtx == 0 && pos == 0
   && len > UINTVAL (XEXP (inner, 1)))
{
  /* We're extracting the least significant bits of an rtx
 (ashift X (const_int C)), where LEN > C.  Extract the
 least significant (LEN - C) bits of X, giving an rtx
 whose mode is MODE, then shift it left C times.  */
  new_rtx = make_extraction (mode, XEXP (inner, 0),
 0, 0, len - INTVAL (XEXP (inner, 1)),
 unsignedp, in_dest, in_compare);
  if (new_rtx != 0)
return gen_rtx_ASHIFT (mode, new_rtx, XEXP (inner, 1));
}

and the recursive call to make_extraction() is asked to sign_extract the
bottom (LEN - C) = 32 bits of:

(subreg:DI (reg/v:SI 93) 0)

which gives us:

(sign_extend:DI (reg/v:SI 93 [ k ])). The gen_rtx_ASHIFT call then gives
us the final result:

(ashift:DI (sign_extend:DI (reg/v:SI 93 [ k ]))
(const_int 2 [0x2])).

Now all that this patch does is to teach the block that looks for these
shifts how to handle the same thing written as a mult instead of an
ashift. In particular, for the testcase in the PR (96998), we hit
make_extraction with len=34, unsignedp=false, pos_rtx=pos=0, and inner
as:

(mult:DI (subreg:DI (reg/v:SI 92 [ g ]) 0)
(const_int 4 [0x4]))

It should be clear from the above that this can be handled in an
analogous way: the recursive case is precisely the same, the only
difference is that we take the log2 of the shift amount and write the
end result as a mult instead.

This fixes several quality regressions on AArch64 after removing support
for addresses represented as sign_extract insns (1/2).

In particular, after the fix for PR96998, for the relevant testcase, we
have:

.L2:
sxtwx0, w0  // 8[c=4 l=4]  *extendsidi2_aarch64/0
add x0, x19, x0, lsl 2  // 39   [c=8 l=4]  *add_lsl_di
bl  h   // 11   [c=4 l=4]  *call_value_insn/1
b   .L2 // 54   [c=4 l=4]  jump

and after this patch, we have:

.L2:
add

Re: [committed] libstdc++: Simplify std::shared_ptr construction from std::weak_ptr

2020-10-26 Thread Jonathan Wakely via Gcc-patches


On 26/10/20 08:07 +0100, Stephan Bergmann wrote:

On 21/10/2020 22:14, Jonathan Wakely via Gcc-patches wrote:

The _M_add_ref_lock() and _M_add_ref_lock_nothrow() members of
_Sp_counted_base are very similar, except that the former throws an
exception when the use count is zero and the latter returns false. The
former (and its callers) can be implemented in terms of the latter.
This results in a small reduction in code size, because throwing an
exception now only happens in one place.

libstdc++-v3/ChangeLog:

* include/bits/shared_ptr.h (shared_ptr(const weak_ptr&, nothrow_t)):
Add noexcept.
* include/bits/shared_ptr_base.h (_Sp_counted_base::_M_add_ref_lock):
Remove specializations and just call _M_add_ref_lock_nothrow.
(__shared_count, __shared_ptr): Use nullptr for null pointer
constants.
(__shared_count(const __weak_count&)): Use _M_add_ref_lock_nothrow
instead of _M_add_ref_lock.
(__shared_count(const __weak_count&, nothrow_t)): Add noexcept.
(__shared_ptr::operator bool()): Add noexcept.
(__shared_ptr(const __weak_ptr&, nothrow_t)): Add noexcept.

Tested powerpc64le-linux. Committed to trunk.


Clang now complains about


~gcc/trunk/inst/lib/gcc/x86_64-pc-linux-gnu/11.0.0/../../../../include/c++/11.0.0/bits/shared_ptr_base.h:230:5:
 error: '_M_add_ref_lock_nothrow' is missing exception specification 'noexcept'
   _M_add_ref_lock_nothrow()
   ^
~gcc/trunk/inst/lib/gcc/x86_64-pc-linux-gnu/11.0.0/../../../../include/c++/11.0.0/bits/shared_ptr_base.h:158:7:
 note: previous declaration is here
 _M_add_ref_lock_nothrow() noexcept;
 ^
~gcc/trunk/inst/lib/gcc/x86_64-pc-linux-gnu/11.0.0/../../../../include/c++/11.0.0/bits/shared_ptr_base.h:241:5:
 error: '_M_add_ref_lock_nothrow' is missing exception specification 'noexcept'
   _M_add_ref_lock_nothrow()
   ^
~gcc/trunk/inst/lib/gcc/x86_64-pc-linux-gnu/11.0.0/../../../../include/c++/11.0.0/bits/shared_ptr_base.h:158:7:
 note: previous declaration is here
 _M_add_ref_lock_nothrow() noexcept;
 ^
~gcc/trunk/inst/lib/gcc/x86_64-pc-linux-gnu/11.0.0/../../../../include/c++/11.0.0/bits/shared_ptr_base.h:255:5:
 error: '_M_add_ref_lock_nothrow' is missing exception specification 'noexcept'
   _M_add_ref_lock_nothrow()
   ^
~gcc/trunk/inst/lib/gcc/x86_64-pc-linux-gnu/11.0.0/../../../../include/c++/11.0.0/bits/shared_ptr_base.h:158:7:
 note: previous declaration is here
 _M_add_ref_lock_nothrow() noexcept;
 ^
~gcc/trunk/inst/lib/gcc/x86_64-pc-linux-gnu/11.0.0/../../../../include/c++/11.0.0/bits/shared_ptr_base.h:876:5:
 error: exception specification in declaration does not match previous 
declaration
   __shared_count(const __weak_count<_Lp>& __r, std::nothrow_t) noexcept
   ^
~gcc/trunk/inst/lib/gcc/x86_64-pc-linux-gnu/11.0.0/../../../../include/c++/11.0.0/bits/shared_ptr_base.h:696:16:
 note: previous declaration is here
 explicit __shared_count(const __weak_count<_Lp>& __r, std::nothrow_t);
  ^
4 errors generated.


which would be fixed with


Committed, thanks.



diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h 
b/libstdc++-v3/include/bits/shared_ptr_base.h
index a9e1c9bb1d5..10c9c831411 100644
--- a/libstdc++-v3/include/bits/shared_ptr_base.h
+++ b/libstdc++-v3/include/bits/shared_ptr_base.h
@@ -227,7 +227,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
  template<>
inline bool
_Sp_counted_base<_S_single>::
-_M_add_ref_lock_nothrow()
+_M_add_ref_lock_nothrow() noexcept
{
  if (_M_use_count == 0)
   return false;
@@ -238,7 +238,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
  template<>
inline bool
_Sp_counted_base<_S_mutex>::
-_M_add_ref_lock_nothrow()
+_M_add_ref_lock_nothrow() noexcept
{
  __gnu_cxx::__scoped_lock sentry(*this);
  if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, 1) == 0)
@@ -252,7 +252,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
  template<>
inline bool
_Sp_counted_base<_S_atomic>::
-_M_add_ref_lock_nothrow()
+_M_add_ref_lock_nothrow() noexcept
{
  // Perform lock-free add-if-not-zero operation.
  _Atomic_word __count = _M_get_use_count();
@@ -693,7 +693,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
  explicit __shared_count(const __weak_count<_Lp>& __r);
  // Does not throw if __r._M_get_use_count() == 0, caller must check.
-  explicit __shared_count(const __weak_count<_Lp>& __r, std::nothrow_t);
+  explicit __shared_count(const __weak_count<_Lp>& __r, std::nothrow_t) 
noexcept;
  ~__shared_count() noexcept
  {

Re: Materialize clones on demand

2020-10-26 Thread Richard Biener

On Mon, 26 Oct 2020, Jan Hubicka wrote:

> > > We seem to leak some hashtables:
> > > dwarf2out.c:28850 (dwarf2out_init)  31M: 23.8%   
> > > 47M   19 :  0.0%   ggc
> > 
> > that one likely keeps quite some memory live...
> 
> Yep, having in-memory dwaf2out for whole cc1plus eats a lot of memory
> quite naturally.

OTOH the late debug shouldn't be so big ...

> > 
> > > cselib.c:3137 (cselib_init) 34M: 25.9%   
> > > 34M 1514k: 17.3%  heap
> > > tree-scalar-evolution.c:2984 (scev_initialize)  37M: 27.6%   
> > > 50M  228k:  2.6%   ggc
> > 
> > Hmm, so we do
> > 
> >   scalar_evolution_info = hash_table::create_ggc (100);
> > 
> > and
> > 
> >   scalar_evolution_info->empty ();
> >   scalar_evolution_info = NULL;
> > 
> > to reclaim.  ->empty () will IIRC at least allocate 7 elements which we
> > the eventually should reclaim during a GC walk - I guess the hashtable
> > statistics do not really handle GC reclaimed portions?
> > 
> > If there's a friendlier way of releasing a GC allocated hash-tab
> > we can switch to that.  Note that in principle the hash-table doesn't
> > need to be GC allocated but it needs to be walked since it refers to
> > trees that might not be referenced in other ways.
> 
> hashtable has destructor that does ggc_free, so i think ggc_delete is
> right way to free.

Can you try if that helps?  As said, in the end it's probably
miscountings in the stats.

> > 
> > > and hashmaps:
> > > ipa-reference.c:1133 (ipa_reference_read_optimiz  2047k:  3.0% 
> > > 3071k9 :  0.0%  heap
> > > tree-ssa.c:60 (redirect_edge_var_map_add) 4125k:  6.1% 
> > > 4126k 8190 :  0.1%  heap
> > 
> > Similar as SCEV, probably mis-accounting?
> > 
> > > alias.c:1200 (record_alias_subset)4510k:  6.6% 
> > > 4510k 4546 :  0.0%   ggc
> > > ipa-prop.h:986 (ipcp_transformation_t)8191k: 12.0%   
> > > 11M   16 :  0.0%   ggc
> > > dwarf2out.c:5957 (dwarf2out_register_external_di47M: 72.2%   
> > > 71M   12 :  0.0%   ggc
> > > 
> > > and hashsets:
> > > ipa-devirt.c:3093 (possible_polymorphic_call_tar15k:  0.9%   
> > > 23k8 :  0.0%  heap
> > > ipa-devirt.c:1599 (add_type_duplicate) 412k: 22.2%  
> > > 412k 4065 :  0.0%  heap
> > > tree-ssa-threadbackward.c:40 (thread_jumps)   1432k: 77.0% 
> > > 1433k  119k:  0.8%  heap
> > > 
> > > and vectors:
> > > tree-ssa-structalias.c:5783 (push_fields_onto_fi  8   847k: 
> > > 0.3%  976k475621: 0.8%17k24k
> > 
> > Huh.  It's an auto_vec<>
> 
> Hmm, those maybe gets miscounted, i will check.
> > 
> > > tree-ssa-pre.c:334 (alloc_expression_id) 48  1125k: 
> > > 0.4% 1187k198336: 0.3%23k34k
> > > tree-into-ssa.c:1787 (register_new_update_single  8  1196k: 
> > > 0.5% 1264k380385: 0.6%24k36k
> > > ggc-page.c:1264 (add_finalizer)   8  1232k: 
> > > 0.5% 1848k43: 0.0%77k81k
> > > tree-ssa-structalias.c:1609 (topo_visit)  8  1302k: 
> > > 0.5% 1328k892964: 1.4%27k33k
> > > graphds.c:254 (graphds_dfs)   4  1469k: 
> > > 0.6% 1675k   2101780: 3.4%30k34k
> > > dominance.c:955 (get_dominated_to_depth)  8  2251k: 
> > > 0.9% 2266k685140: 1.1%46k50k
> > > tree-ssa-structalias.c:410 (new_var_info)32  2264k: 
> > > 0.9% 2341k330758: 0.5%47k63k
> > > tree-ssa-structalias.c:3104 (process_constraint) 48  2376k: 
> > > 0.9% 2606k405451: 0.7%49k83k
> > > symtab.c:612 (create_reference)   8  3314k: 
> > > 1.3% 4897k 75213: 0.1%   414k   612k
> > > vec.h:1734 (copy)48   
> > > 233M:90.5%  234M   6243163:10.1%  4982k  5003k
> 
> Also I should annotate copy.

Yeah, some missing annotations might cause issues.

> > 
> > Those all look OK to me, not sure why we even think there's a leak?
> 
> I think we do not need to hold references anymore (perhaps for aliases -
> i will check).  Also all function bodies should be freed by now.
> > 
> > > However main problem is
> > > cfg.c:202 (connect_src)   5745k:  0.2%  
> > > 271M:  1.9% 1754k:  0.0% 1132k:  0.2% 7026k
> > > cfg.c:212 (connect_dest)  6307k:  0.2%  
> > > 281M:  2.0%10129k:  0.2% 2490k:  0.5% 7172k
> > > varasm.c:3359 (build_constant_desc)   7387k:  0.2%
> > > 0 :  0.0%0 :  0.0%0 :  0.0%   51k
> > > emit-rtl.c:486 (gen_raw_REG)  7

[PATCH] middle-end/97554 - avoid overflow in alloc size compute

2020-10-26 Thread Richard Biener

This avoids overflow in the allocation size computations in
sbitmap_vector_alloc when the result exceeds 2GB.

Bootstrapped / tested on x86_64-unknown-linux-gnu, pushed.

2020-10-26  Richard Biener  

* sbitmap.c (sbitmap_vector_alloc): Use size_t for byte
quantities to avoid overflow.
---
 gcc/sbitmap.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/gcc/sbitmap.c b/gcc/sbitmap.c
index 292e7eede5a..3a43fe35bb1 100644
--- a/gcc/sbitmap.c
+++ b/gcc/sbitmap.c
@@ -139,7 +139,8 @@ sbitmap_realloc (sbitmap src, unsigned int n_elms)
 sbitmap *
 sbitmap_vector_alloc (unsigned int n_vecs, unsigned int n_elms)
 {
-  unsigned int i, bytes, offset, elm_bytes, size, amt, vector_bytes;
+  unsigned int i, size;
+  size_t amt, bytes, vector_bytes, elm_bytes, offset;
   sbitmap *bitmap_vector;
 
   size = SBITMAP_SET_SIZE (n_elms);
-- 
2.26.2

[PATCH] tree-optimization/97539 - reset out-of-loop debug uses before peeling

2020-10-26 Thread Richard Biener

This makes sure to reset out-of-loop debug uses before vectorizer
loop peeling as we cannot make sure to retain the use-def dominance
relationship when there are no LC SSA nodes.

Bootstrapped / tested on x86_64-unknown-linux-gnu, pushed.

2020-10-26  Richard Biener  

PR tree-optimization/97539
* tree-vect-loop-manip.c (vect_do_peeling): Reset out-of-loop
debug uses before peeling.

* gcc.dg/pr97539.c: New testcase.
---
 gcc/testsuite/gcc.dg/pr97539.c | 17 ++
 gcc/tree-vect-loop-manip.c | 41 +-
 2 files changed, 57 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr97539.c

diff --git a/gcc/testsuite/gcc.dg/pr97539.c b/gcc/testsuite/gcc.dg/pr97539.c
new file mode 100644
index 000..def55e1d6ee
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr97539.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -g" } */
+
+int a, b;
+void c() {
+  char d;
+  for (; b;)
+for (;;)
+  for (; d <= 7; d += 1) {
+a = 7;
+for (; a; a += 1)
+e:
+  d += d;
+d ^= 0;
+  }
+  goto e;
+}
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 7cf00e6eed4..5d00b6fb956 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -2545,6 +2545,45 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, 
tree nitersm1,
   if (!prolog_peeling && !epilog_peeling)
 return NULL;
 
+  /* Before doing any peeling make sure to reset debug binds outside of
+ the loop refering to defs not in LC SSA.  */
+  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  for (unsigned i = 0; i < loop->num_nodes; ++i)
+{
+  basic_block bb = LOOP_VINFO_BBS (loop_vinfo)[i];
+  imm_use_iterator ui;
+  gimple *use_stmt;
+  for (gphi_iterator gsi = gsi_start_phis (bb); !gsi_end_p (gsi);
+  gsi_next (&gsi))
+   {
+ FOR_EACH_IMM_USE_STMT (use_stmt, ui, gimple_phi_result (gsi.phi ()))
+   if (gimple_debug_bind_p (use_stmt)
+   && loop != gimple_bb (use_stmt)->loop_father
+   && !flow_loop_nested_p (loop,
+   gimple_bb (use_stmt)->loop_father))
+ {
+   gimple_debug_bind_reset_value (use_stmt);
+   update_stmt (use_stmt);
+ }
+   }
+  for (gimple_stmt_iterator gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
+  gsi_next (&gsi))
+   {
+ ssa_op_iter op_iter;
+ def_operand_p def_p;
+ FOR_EACH_SSA_DEF_OPERAND (def_p, gsi_stmt (gsi), op_iter, SSA_OP_DEF)
+   FOR_EACH_IMM_USE_STMT (use_stmt, ui, DEF_FROM_PTR (def_p))
+ if (gimple_debug_bind_p (use_stmt)
+ && loop != gimple_bb (use_stmt)->loop_father
+ && !flow_loop_nested_p (loop,
+ gimple_bb (use_stmt)->loop_father))
+   {
+ gimple_debug_bind_reset_value (use_stmt);
+ update_stmt (use_stmt);
+   }
+   }
+}
+
   prob_vector = profile_probability::guessed_always ().apply_scale (9, 10);
   estimated_vf = vect_vf_for_cost (loop_vinfo);
   if (estimated_vf == 2)
@@ -2552,7 +2591,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, 
tree nitersm1,
   prob_prolog = prob_epilog = profile_probability::guessed_always ()
.apply_scale (estimated_vf - 1, estimated_vf);
 
-  class loop *prolog, *epilog = NULL, *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  class loop *prolog, *epilog = NULL;
   class loop *first_loop = loop;
   bool irred_flag = loop_preheader_edge (loop)->flags & EDGE_IRREDUCIBLE_LOOP;
 
-- 
2.26.2

Re: Materialize clones on demand

2020-10-26 Thread Jan Hubicka

> > > 
> > > > cselib.c:3137 (cselib_init) 34M: 25.9%  
> > > >  34M 1514k: 17.3%  heap
> > > > tree-scalar-evolution.c:2984 (scev_initialize)  37M: 27.6%  
> > > >  50M  228k:  2.6%   ggc
> > > 
> > > Hmm, so we do
> > > 
> > >   scalar_evolution_info = hash_table::create_ggc (100);
> > > 
> > > and
> > > 
> > >   scalar_evolution_info->empty ();
> > >   scalar_evolution_info = NULL;
> > > 
> > > to reclaim.  ->empty () will IIRC at least allocate 7 elements which we
> > > the eventually should reclaim during a GC walk - I guess the hashtable
> > > statistics do not really handle GC reclaimed portions?
> > > 
> > > If there's a friendlier way of releasing a GC allocated hash-tab
> > > we can switch to that.  Note that in principle the hash-table doesn't
> > > need to be GC allocated but it needs to be walked since it refers to
> > > trees that might not be referenced in other ways.
> > 
> > hashtable has destructor that does ggc_free, so i think ggc_delete is
> > right way to free.
> 
> Can you try if that helps?  As said, in the end it's probably
> miscountings in the stats.

I do not think we are miscounting here.  empty () really allocates small
hashtable and leaves it alone.
It should be ggc_delete.  I will test it.
> 
> > > 
> > > > and hashmaps:
> > > > ipa-reference.c:1133 (ipa_reference_read_optimiz  2047k:  3.0% 
> > > > 3071k9 :  0.0%  heap
> > > > tree-ssa.c:60 (redirect_edge_var_map_add) 4125k:  6.1% 
> > > > 4126k 8190 :  0.1%  heap
> > > 
> > > Similar as SCEV, probably mis-accounting?
> > > 
> > > > alias.c:1200 (record_alias_subset)4510k:  6.6% 
> > > > 4510k 4546 :  0.0%   ggc
> > > > ipa-prop.h:986 (ipcp_transformation_t)8191k: 12.0%  
> > > >  11M   16 :  0.0%   ggc
> > > > dwarf2out.c:5957 (dwarf2out_register_external_di47M: 72.2%  
> > > >  71M   12 :  0.0%   ggc
> > > > 
> > > > and hashsets:
> > > > ipa-devirt.c:3093 (possible_polymorphic_call_tar15k:  0.9%  
> > > >  23k8 :  0.0%  heap
> > > > ipa-devirt.c:1599 (add_type_duplicate) 412k: 22.2%  
> > > > 412k 4065 :  0.0%  heap
> > > > tree-ssa-threadbackward.c:40 (thread_jumps)   1432k: 77.0% 
> > > > 1433k  119k:  0.8%  heap
> > > > 
> > > > and vectors:
> > > > tree-ssa-structalias.c:5783 (push_fields_onto_fi  8   847k: 
> > > > 0.3%  976k475621: 0.8%17k24k
> > > 
> > > Huh.  It's an auto_vec<>
> > 
> > Hmm, those maybe gets miscounted, i will check.
> > > 
> > > > tree-ssa-pre.c:334 (alloc_expression_id) 48  1125k: 
> > > > 0.4% 1187k198336: 0.3%23k34k
> > > > tree-into-ssa.c:1787 (register_new_update_single  8  1196k: 
> > > > 0.5% 1264k380385: 0.6%24k36k
> > > > ggc-page.c:1264 (add_finalizer)   8  1232k: 
> > > > 0.5% 1848k43: 0.0%77k81k
> > > > tree-ssa-structalias.c:1609 (topo_visit)  8  1302k: 
> > > > 0.5% 1328k892964: 1.4%27k33k
> > > > graphds.c:254 (graphds_dfs)   4  1469k: 
> > > > 0.6% 1675k   2101780: 3.4%30k34k
> > > > dominance.c:955 (get_dominated_to_depth)  8  2251k: 
> > > > 0.9% 2266k685140: 1.1%46k50k
> > > > tree-ssa-structalias.c:410 (new_var_info)32  2264k: 
> > > > 0.9% 2341k330758: 0.5%47k63k
> > > > tree-ssa-structalias.c:3104 (process_constraint) 48  2376k: 
> > > > 0.9% 2606k405451: 0.7%49k83k
> > > > symtab.c:612 (create_reference)   8  3314k: 
> > > > 1.3% 4897k 75213: 0.1%   414k   612k
> > > > vec.h:1734 (copy)48   
> > > > 233M:90.5%  234M   6243163:10.1%  4982k  5003k
> > 
> > Also I should annotate copy.
> 
> Yeah, some missing annotations might cause issues.

It will only let us to see who copies the vectors ;)

auto_vecs I think are special since we may manage to miscount the
pre-allocated space.  I will look into that.
> > > 
> > > Well, we're building a DIE tree for the whole unit here so I'm not sure
> > > what parts we can optimize.  The structures may keep quite some stuff
> > > on the tree side live through the decl -> DIE and block -> DIE maps
> > > and the external_die_map used for LTO streaming (but if we lazily stream
> > > bodies we do need to keep this map ... unless we add some
> > > start/end-stream-body hooks and doing the map per function.  But then
> > > we build the DIEs lazily as well so the query of the map is lazy :/)
> > 
> > Yep, not sure how much we could do here.  Of course ggc_collect when
> > invoked will do quite a lot of walking t

[Committed] IBM Z: Add vcond_mask expander

2020-10-26 Thread Andreas Krebbel via Gcc-patches

After adding vec_cmp expanders we have seen various performance
related regression in the testsuite.  These appear to be caused by a
missing vcond_mask definition in the backend.  Fixed with this patch.

The patch fixes the following testsuite fails:

FAIL: gcc.dg/vect/vect-21.c -flto -ffat-lto-objects  scan-tree-dump-times vect 
"vectorized 3 loops" 1
FAIL: gcc.dg/vect/vect-21.c scan-tree-dump-times vect "vectorized 3 loops" 1
FAIL: gcc.dg/vect/vect-23.c -flto -ffat-lto-objects  scan-tree-dump-times vect 
"vectorized 3 loops" 1
FAIL: gcc.dg/vect/vect-23.c scan-tree-dump-times vect "vectorized 3 loops" 1
FAIL: gcc.dg/vect/vect-24.c -flto -ffat-lto-objects  scan-tree-dump-times vect 
"vectorized 3 loops" 1
FAIL: gcc.dg/vect/vect-24.c scan-tree-dump-times vect "vectorized 3 loops" 1
FAIL: gcc.dg/vect/vect-live-6.c -flto -ffat-lto-objects  scan-tree-dump vect 
"vectorized 1 loops"
FAIL: gcc.dg/vect/vect-live-6.c scan-tree-dump vect "vectorized 1 loops"
FAIL: gcc.target/s390/vector/vcond-shift.c scan-assembler-times 
vesrab\\t%v.?,%v.?,7 6
FAIL: gcc.target/s390/vector/vcond-shift.c scan-assembler-times 
vesraf\\t%v.?,%v.?,31 6
FAIL: gcc.target/s390/vector/vcond-shift.c scan-assembler-times 
vesrah\\t%v.?,%v.?,15 6
FAIL: gcc.target/s390/vector/vcond-shift.c scan-assembler-times 
vesrlb\\t%v.?,%v.?,7 4
FAIL: gcc.target/s390/vector/vcond-shift.c scan-assembler-times 
vesrlf\\t%v.?,%v.?,31 4
FAIL: gcc.target/s390/vector/vcond-shift.c scan-assembler-times 
vesrlh\\t%v.?,%v.?,15 4
FAIL: gcc.dg/vect/vect-21.c -flto -ffat-lto-objects  scan-tree-dump-times vect 
"vectorized 3 loops" 1
FAIL: gcc.dg/vect/vect-21.c scan-tree-dump-times vect "vectorized 3 loops" 1
FAIL: gcc.dg/vect/vect-23.c -flto -ffat-lto-objects  scan-tree-dump-times vect 
"vectorized 3 loops" 1
FAIL: gcc.dg/vect/vect-23.c scan-tree-dump-times vect "vectorized 3 loops" 1
FAIL: gcc.dg/vect/vect-24.c -flto -ffat-lto-objects  scan-tree-dump-times vect 
"vectorized 3 loops" 1
FAIL: gcc.dg/vect/vect-24.c scan-tree-dump-times vect "vectorized 3 loops" 1
FAIL: gcc.dg/vect/vect-live-6.c -flto -ffat-lto-objects  scan-tree-dump vect 
"vectorized 1 loops"
FAIL: gcc.dg/vect/vect-live-6.c scan-tree-dump vect "vectorized 1 loops"
FAIL: gcc.target/s390/vector/vcond-shift.c scan-assembler-times 
vesrab\\t%v.?,%v.?,7 6
FAIL: gcc.target/s390/vector/vcond-shift.c scan-assembler-times 
vesraf\\t%v.?,%v.?,31 6
FAIL: gcc.target/s390/vector/vcond-shift.c scan-assembler-times 
vesrah\\t%v.?,%v.?,15 6
FAIL: gcc.target/s390/vector/vcond-shift.c scan-assembler-times 
vesrlb\\t%v.?,%v.?,7 4
FAIL: gcc.target/s390/vector/vcond-shift.c scan-assembler-times 
vesrlf\\t%v.?,%v.?,31 4
FAIL: gcc.target/s390/vector/vcond-shift.c scan-assembler-times 
vesrlh\\t%v.?,%v.?,15 4

Bootstrapped and regression tested on s390x.

gcc/ChangeLog:

* config/s390/vector.md ("vcond_mask_"): New expander.
---
 gcc/config/s390/vector.md | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/gcc/config/s390/vector.md b/gcc/config/s390/vector.md
index 3c01cd1b1e1..3e621daf7b1 100644
--- a/gcc/config/s390/vector.md
+++ b/gcc/config/s390/vector.md
@@ -658,6 +658,17 @@ (define_expand "vcondu"
   DONE;
 })
 
+(define_expand "vcond_mask_"
+  [(set (match_operand:V 0 "register_operand" "")
+   (if_then_else:V
+(eq (match_operand: 3 "register_operand" "")
+(match_dup 4))
+(match_operand:V 2 "register_operand" "")
+(match_operand:V 1 "register_operand" "")))]
+  "TARGET_VX"
+  "operands[4] = CONST0_RTX (mode);")
+
+
 ; We only have HW support for byte vectors.  The middle-end is
 ; supposed to lower the mode if required.
 (define_insn "vec_permv16qi"
-- 
2.25.1

Re: [PATCH 2/2] combine: Don't turn (mult (extend x) 2^n) into extract

2020-10-26 Thread Segher Boessenkool

Hi!

On Mon, Oct 26, 2020 at 10:09:41AM +, Alex Coplan wrote:
> On 22/10/2020 15:39, Segher Boessenkool wrote:
> > On Thu, Oct 15, 2020 at 09:59:24AM +0100, Alex Coplan wrote:
> > > Currently, make_extraction() identifies where we can emit an ASHIFT of
> > > an extend in place of an extraction, but fails to make the corresponding
> > > canonicalization/simplification when presented with a MULT by a power of
> > > two. Such a representation is canonical when representing a left-shifted
> > > address inside a MEM.
> > > 
> > > This patch remedies this situation: after the patch, make_extraction()
> > > now also identifies RTXs such as:
> > > 
> > > (mult:DI (subreg:DI (reg:SI r)) (const_int 2^n))
> > > 
> > > and rewrites this as:
> > > 
> > > (mult:DI (sign_extend:DI (reg:SI r)) (const_int 2^n))
> > > 
> > > instead of using a sign_extract.
> > 
> > That is only correct if SUBREG_PROMOTED_VAR_P is true and
> > SUBREG_PROMOTED_UNSIGNED_P is false for r.  Is that guaranteed to be
> > true here (and how then?)
> 
> Sorry, I didn't give enough context here. For this subreg,
> SUBREG_PROMOTED_VAR_P is not set, so I agree that this transformation in
> isolation is not valid.
> 
> The crucial piece of missing information is that we only make this
> transformation in calls to make_extraction where len = 32 + n and
> pos_rtx = pos = 0 (so we're extracting the bottom 32 + n bits), and
> unsignedp is false (so we're doing a sign_extract).

The high half of a DI subreg of a SI reg is *undefined* if
SUBREG_PROMOTED_VAR_P is not set.  So the code you get as input:

> (ashift:DI (subreg:DI (reg/v:SI 93 [ k ]) 0)
> (const_int 2 [0x2]))

... is already incorrect.  Please fix that?

> where it is clear that the bottom 34 bits are valid (and the
> higher-order bits are undefined). We then hit the block:

No, only the bottom 32 bits are valid.

> diff --git a/gcc/combine.c b/gcc/combine.c
> index c88382efbd3..fe8eff2b464 100644
> --- a/gcc/combine.c
> +++ b/gcc/combine.c
> @@ -7419,8 +7419,8 @@ expand_compound_operation (rtx x)
>  }
>  
>/* If we reach here, we want to return a pair of shifts.  The inner
> - shift is a left shift of BITSIZE - POS - LEN bits.  The outer
> - shift is a right shift of BITSIZE - LEN bits.  It is arithmetic or
> + shift is a left shift of MODEWIDTH - POS - LEN bits.  The outer
> + shift is a right shift of MODEWIDTH - LEN bits.  It is arithmetic or
>   logical depending on the value of UNSIGNEDP.
>  
>   If this was a ZERO_EXTEND or ZERO_EXTRACT, this pair of shifts will be

MODEWIDTH isn't defined here yet, it is initialised just below to
MODE_PRECISION (mode).


Segher

Re: [PATCH] arm: Fix multiple inheritance thunks for thumb-1 with -mpure-code

2020-10-26 Thread Christophe Lyon via Gcc-patches

On Thu, 22 Oct 2020 at 17:22, Richard Earnshaw
 wrote:
>
> On 22/10/2020 09:45, Christophe Lyon via Gcc-patches wrote:
> > On Wed, 21 Oct 2020 at 19:36, Richard Earnshaw
> >  wrote:
> >>
> >> On 21/10/2020 17:11, Christophe Lyon via Gcc-patches wrote:
> >>> On Wed, 21 Oct 2020 at 18:07, Richard Earnshaw
> >>>  wrote:
> 
>  On 21/10/2020 16:49, Christophe Lyon via Gcc-patches wrote:
> > On Tue, 20 Oct 2020 at 13:25, Richard Earnshaw
> >  wrote:
> >>
> >> On 20/10/2020 12:22, Richard Earnshaw wrote:
> >>> On 19/10/2020 17:32, Christophe Lyon via Gcc-patches wrote:
>  On Mon, 19 Oct 2020 at 16:39, Richard Earnshaw
>   wrote:
> >
> > On 12/10/2020 08:59, Christophe Lyon via Gcc-patches wrote:
> >> On Thu, 8 Oct 2020 at 11:58, Richard Earnshaw
> >>  wrote:
> >>>
> >>> On 08/10/2020 10:07, Christophe Lyon via Gcc-patches wrote:
>  On Tue, 6 Oct 2020 at 18:02, Richard Earnshaw
>   wrote:
> >
> > On 29/09/2020 20:50, Christophe Lyon via Gcc-patches wrote:
> >> When mi_delta is > 255 and -mpure-code is used, we cannot load 
> >> delta
> >> from code memory (like we do without -mpure-code).
> >>
> >> This patch builds the value of mi_delta into r3 with a series 
> >> of
> >> movs/adds/lsls.
> >>
> >> We also do some cleanup by not emitting the function address 
> >> and delta
> >> via .word directives at the end of the thunk since we don't 
> >> use them
> >> with -mpure-code.
> >>
> >> No need for new testcases, this bug was already identified by
> >> eg. pr46287-3.C
> >>
> >> 2020-09-29  Christophe Lyon  
> >>
> >>   gcc/
> >>   * config/arm/arm.c (arm_thumb1_mi_thunk): Build mi_delta 
> >> in r3 and
> >>   do not emit function address and delta when -mpure-code 
> >> is used.
> >
>  Hi Richard,
> 
>  Thanks for your comments.
> 
> > There are some optimizations you can make to this code.
> >
> > Firstly, for values between 256 and 510 (inclusive), it would 
> > be better
> > to just expand a mov of 255 followed by an add.
>  I now see the splitted for the "Pe" constraint which I hadn't 
>  noticed
>  before, so I can write something similar indeed.
> 
>  However, I'm note quite sure to understand the benefit in the 
>  split
>  when -mpure-code is NOT used.
>  Consider:
>  int f3_1 (void) { return 510; }
>  int f3_2 (void) { return 511; }
>  Compile with -O2 -mcpu=cortex-m0:
>  f3_1:
>  movsr0, #255
>  lslsr0, r0, #1
>  bx  lr
>  f3_2:
>  ldr r0, .L4
>  bx  lr
> 
>  The splitter makes the code bigger, does it "compensate" for 
>  this by
>  not having to load the constant?
>  Actually the constant uses 4 more bytes, which should be taken 
>  into
>  account when comparing code size,
> >>>
> >>> Yes, the size of the literal pool entry needs to be taken into 
> >>> account.
> >>>  It might happen that the entry could be shared with another use 
> >>> of that
> >>> literal, but in general that's rare.
> >>>
>  so f3_1 uses 6 bytes, and f3_2 uses 8, so as you say below three
>  thumb1 instructions would be equivalent in size compared to 
>  loading
>  from the literal pool. Should the 256-510 range be extended?
> >>>
> >>> It's a bit borderline at three instructions when literal pools 
> >>> are not
> >>> expensive to use, but in thumb1 literal pools tend to be quite 
> >>> small due
> >>> to the limited pc offsets we can use.  I think on balance we 
> >>> probably
> >>> want to use the instruction sequence unless optimizing for size.
> >>>
> 
> 
> > This is also true for
> > the literal pools alternative as well, so should be handled 
> > before all
> > this.
>  I am not sure what you mean: with -mpure-code, the above sample 
>  is compiled as:
>  f3_1:
>  movsr0, #255
>  lslsr0, r0, #1
>  bx  lr
>  f3_2:
>  movsr0, #1
>

RE: [PATCH 6/6] ipa-cp: Separate and increase the large-unit parameter

2020-10-26 Thread Tamar Christina via Gcc-patches

Hi Martin,

I have been playing with --param ipa-cp-large-unit-insns but it doesn't seem to 
have any meaningful effect on
exchange2 and I still can't recover the 12% regression vs GCC 10.

Do I need to use another parameter here?

Thanks,
Tamar

> -Original Message-
> From: Gcc-patches  On Behalf Of Martin
> Jambor
> Sent: Monday, September 21, 2020 3:25 PM
> To: GCC Patches 
> Cc: Jan Hubicka 
> Subject: [PATCH 6/6] ipa-cp: Separate and increase the large-unit parameter
> 
> A previous patch in the series has taught IPA-CP to identify the important
> cloning opportunities in 548.exchange2_r as worthwhile on their own, but
> the optimization is still prevented from taking place because of the overall
> unit-growh limit.  This patches raises that limit so that it takes place and 
> the
> benchmark runs 30% faster (on AMD
> Zen2 CPU at least).
> 
> Before this patch, IPA-CP uses the following formulae to arrive at the
> overall_size limit:
> 
> base = MAX(orig_size, param_large_unit_insns) unit_growth_limit = base +
> base * param_ipa_cp_unit_growth / 100
> 
> since param_ipa_cp_unit_growth has default 10, param_large_unit_insns
> has default value 1.
> 
> The problem with exchange2 (at least on zen2 but I have had a quick look on
> aarch64 too) is that the original estimated unit size is 10513 and so
> param_large_unit_insns does not apply and the default limit is therefore
> 11564 which is good enough only for one of the ideal 8 clonings, we need the
> limit to be at least 16291.
> 
> I would like to raise param_ipa_cp_unit_growth a little bit more soon too,
> but most certainly not to 55.  Therefore, the large_unit must be increased.  
> In
> this patch, I decided to decouple the inlining and ipa-cp large-unit 
> parameters.
> It also makes sense because IPA-CP uses it only at -O3 while inlining also at 
> -
> O2 (IIUC).  But if we agree we can try raising param_large_unit_insns to 13-14
> thousand "instructions," perhaps it is not necessary.  But then again, it may
> make sense to actually increase the IPA-CP limit further.
> 
> I plan to experiment with IPA-CP tuning on a larger set of programs.
> Meanwhile, mainly to address the 548.exchange2_r regression, I'm
> suggesting this simple change.
> 
> gcc/ChangeLog:
> 
> 2020-09-07  Martin Jambor  
> 
>   * params.opt (ipa-cp-large-unit-insns): New parameter.
>   * ipa-cp.c (get_max_overall_size): Use the new parameter.
> ---
>  gcc/ipa-cp.c   | 2 +-
>  gcc/params.opt | 4 
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c index 12acf24c553..2152f9e5876 100644
> --- a/gcc/ipa-cp.c
> +++ b/gcc/ipa-cp.c
> @@ -3448,7 +3448,7 @@ static long
>  get_max_overall_size (cgraph_node *node)  {
>long max_new_size = orig_overall_size;
> -  long large_unit = opt_for_fn (node->decl, param_large_unit_insns);
> +  long large_unit = opt_for_fn (node->decl,
> + param_ipa_cp_large_unit_insns);
>if (max_new_size < large_unit)
>  max_new_size = large_unit;
>int unit_growth = opt_for_fn (node->decl, param_ipa_cp_unit_growth);
> diff --git a/gcc/params.opt b/gcc/params.opt index
> acb59f17e45..9d177ab50ad 100644
> --- a/gcc/params.opt
> +++ b/gcc/params.opt
> @@ -218,6 +218,10 @@ Percentage penalty functions containing a single call
> to another function will r  Common Joined UInteger
> Var(param_ipa_cp_unit_growth) Init(10) Param Optimization  How much can
> given compilation unit grow because of the interprocedural constant
> propagation (in percent).
> 
> +-param=ipa-cp-large-unit-insns=
> +Common Joined UInteger Var(param_ipa_cp_large_unit_insns)
> Optimization
> +Init(16000) Param The size of translation unit that IPA-CP pass considers
> large.
> +
>  -param=ipa-cp-value-list-size=
>  Common Joined UInteger Var(param_ipa_cp_value_list_size) Init(8) Param
> Optimization  Maximum size of a list of values associated with each
> parameter for interprocedural constant propagation.
> --
> 2.28.0

Re: [PATCH 2/2] combine: Don't turn (mult (extend x) 2^n) into extract

2020-10-26 Thread Alex Coplan via Gcc-patches

On 26/10/2020 05:48, Segher Boessenkool wrote:
> Hi!
> 
> On Mon, Oct 26, 2020 at 10:09:41AM +, Alex Coplan wrote:
> > On 22/10/2020 15:39, Segher Boessenkool wrote:
> > > On Thu, Oct 15, 2020 at 09:59:24AM +0100, Alex Coplan wrote:
> > > > Currently, make_extraction() identifies where we can emit an ASHIFT of
> > > > an extend in place of an extraction, but fails to make the corresponding
> > > > canonicalization/simplification when presented with a MULT by a power of
> > > > two. Such a representation is canonical when representing a left-shifted
> > > > address inside a MEM.
> > > > 
> > > > This patch remedies this situation: after the patch, make_extraction()
> > > > now also identifies RTXs such as:
> > > > 
> > > > (mult:DI (subreg:DI (reg:SI r)) (const_int 2^n))
> > > > 
> > > > and rewrites this as:
> > > > 
> > > > (mult:DI (sign_extend:DI (reg:SI r)) (const_int 2^n))
> > > > 
> > > > instead of using a sign_extract.
> > > 
> > > That is only correct if SUBREG_PROMOTED_VAR_P is true and
> > > SUBREG_PROMOTED_UNSIGNED_P is false for r.  Is that guaranteed to be
> > > true here (and how then?)
> > 
> > Sorry, I didn't give enough context here. For this subreg,
> > SUBREG_PROMOTED_VAR_P is not set, so I agree that this transformation in
> > isolation is not valid.
> > 
> > The crucial piece of missing information is that we only make this
> > transformation in calls to make_extraction where len = 32 + n and
> > pos_rtx = pos = 0 (so we're extracting the bottom 32 + n bits), and
> > unsignedp is false (so we're doing a sign_extract).
> 
> The high half of a DI subreg of a SI reg is *undefined* if
> SUBREG_PROMOTED_VAR_P is not set.  So the code you get as input:
> 
> > (ashift:DI (subreg:DI (reg/v:SI 93 [ k ]) 0)
> > (const_int 2 [0x2]))
> 
> ... is already incorrect.  Please fix that?
> 
> > where it is clear that the bottom 34 bits are valid (and the
> > higher-order bits are undefined). We then hit the block:
> 
> No, only the bottom 32 bits are valid.

Well, only the low 32 bits of the subreg are valid. But because those
low 32 bits are shifted left 2 times, the low 34 bits of the ashift are
valid: the bottom 2 bits of the ashift are zeros, and the 32 bits above
those are from the inner SImode reg (with the upper 62 bits being
undefined).

> 
> > diff --git a/gcc/combine.c b/gcc/combine.c
> > index c88382efbd3..fe8eff2b464 100644
> > --- a/gcc/combine.c
> > +++ b/gcc/combine.c
> > @@ -7419,8 +7419,8 @@ expand_compound_operation (rtx x)
> >  }
> >  
> >/* If we reach here, we want to return a pair of shifts.  The inner
> > - shift is a left shift of BITSIZE - POS - LEN bits.  The outer
> > - shift is a right shift of BITSIZE - LEN bits.  It is arithmetic or
> > + shift is a left shift of MODEWIDTH - POS - LEN bits.  The outer
> > + shift is a right shift of MODEWIDTH - LEN bits.  It is arithmetic or
> >   logical depending on the value of UNSIGNEDP.
> >  
> >   If this was a ZERO_EXTEND or ZERO_EXTRACT, this pair of shifts will be
> 
> MODEWIDTH isn't defined here yet, it is initialised just below to
> MODE_PRECISION (mode).

Yes, but bitsize isn't defined at all in this function AFAICT. Are
comments not permitted to refer to variables defined immediately beneath
them?

Alex

[pingn] aarch64: move and adjust PROBE_STACK__REG

2020-10-26 Thread Olivier Hainque

Ping, please ?

Thanks in advance,

Olivier

> On 15 Oct 2020, at 08:38, Olivier Hainque  wrote:
> 
> Ping, please ?
> 
> Patch re-attached for convenience.
> 
> Thanks in advance!
> 
> Best Regards,
> 
> Olivier
> 
>> On 24 Sep 2020, at 11:46, Olivier Hainque  wrote:
>> 
>> Re-proposing this patch after re-testing with a recent
>> mainline on on aarch64-linux (bootstrap and regression test
>> with --enable-languages=all), and more than a year of in-house
>> use in production for a few aarch64 ports on a gcc-9 base.
>> 
>> The change moves the definitions of PROBE_STACK_FIRST_REG
>> and PROBE_STACK_SECOND_REG to a more appropriate place for such
>> items (here, in aarch64.md as suggested by Richard), and adjusts
>> their value from r9/r10 to r10/r11 to free r9 for a possibly
>> more general purpose (e.g. as a static chain at least on targets
>> which have a private use of r18, such as Windows or Vxworks).
>> 
>> OK to commit?
>> 
>> Thanks in advance,
>> 
>> With Kind Regards,
>> 
>> Olivier
>> 
>> 2020-11-07  Olivier Hainque  
>> 
>>  * config/aarch64/aarch64.md: Define PROBE_STACK_FIRST_REGNUM
>>  and PROBE_STACK_SECOND_REGNUM constants, designating r10/r11.
>>  Replacements for the PROBE_STACK_FIRST/SECOND_REG constants in
>>  aarch64.c.
>>  * config/aarch64/aarch64.c (PROBE_STACK_FIRST_REG): Remove.
>>  (PROBE_STACK_SECOND_REG): Remove.
>>  (aarch64_emit_probe_stack_range): Adjust to the _REG -> _REGNUM
>>  suffix update for PROBE_STACK register numbers.
> 
>

Re: [PATCH 2/2] combine: Don't turn (mult (extend x) 2^n) into extract

2020-10-26 Thread Alex Coplan via Gcc-patches

On 26/10/2020 11:06, Alex Coplan via Gcc-patches wrote:

> Well, only the low 32 bits of the subreg are valid. But because those
> low 32 bits are shifted left 2 times, the low 34 bits of the ashift are
> valid: the bottom 2 bits of the ashift are zeros, and the 32 bits above
> those are from the inner SImode reg (with the upper 62 bits being
> undefined).

s/upper 62 bits/upper 30 bits/

Re: [PATCH]AArch64 Fix overflow in memcopy expansion on aarch64.

2020-10-26 Thread Richard Sandiford via Gcc-patches

Tamar Christina  writes:
>/* We can't do anything smart if the amount to copy is not constant.  */
>if (!CONST_INT_P (operands[2]))
>  return false;
> 
> -  n = INTVAL (operands[2]);
> +  /* This may get truncated but that's fine as it would be above our maximum
> + memset inline limit.  */
> +  unsigned tmp = INTVAL (operands[2]);

That's not true for (1ULL << 32) + 1 for example, since the truncated
value will come under the limit.  I think we should just do:

  unsigned HOST_WIDE_INT tmp = UINTVAL (operands[2]);

without a comment.

Thanks,
Richard

RE: [PATCH] PR tree-optimization/97546 Bail out of find_bswap_or_nop on non-INTEGER_CST sizes

2020-10-26 Thread Kyrylo Tkachov via Gcc-patches



> -Original Message-
> From: Jakub Jelinek 
> Sent: 26 October 2020 09:32
> To: Kyrylo Tkachov 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] PR tree-optimization/97546 Bail out of
> find_bswap_or_nop on non-INTEGER_CST sizes
> 
> On Mon, Oct 26, 2020 at 09:20:42AM +, Kyrylo Tkachov via Gcc-patches
> wrote:
> > This patch fixes the ICE in the PR by bailing out of find_bswap_or_nop on
> poly_int sizes.
> > I don't think it intends to handle them and from my reading of the code it's
> the most appropriate place to reject them
> > here rather than in the callers.
> >
> > Bootstrapped and tested on aarch64-none-linux-gnu.
> >
> > Ok for trunk?
> > Thanks,
> > Kyrill
> >
> > gcc/
> > PR tree-optimization/97546
> > * gimple-ssa-store-merging.c (find_bswap_or_nop): Return NULL if
> type is
> > not INTEGER_CST.
> 
> I think better use tree_fits_uhwi_p instead of cst_and_fits_hwi and
> instead of TREE_INT_CST_LOW use tree_to_uhwi.
> TYPE_SIZE_UNIT which doesn't fit into uhwi but fits into shwi is something
> that really shouldn't appear.
> Otherwise LGTM.

Thanks, that makes sense.
Is the attached patch ok?
Kyrill

> 
> > gcc/testsuite/
> > PR tree-optimization/97546
> > * gcc.target/aarch64/sve/acle/general/pr97546.c: New test.
> 
> 
> 
>   Jakub



sm-poly.patch
Description: sm-poly.patch

Re: [PATCH] PR tree-optimization/97546 Bail out of find_bswap_or_nop on non-INTEGER_CST sizes

2020-10-26 Thread Jakub Jelinek via Gcc-patches

On Mon, Oct 26, 2020 at 11:32:43AM +, Kyrylo Tkachov wrote:
> Thanks, that makes sense.
> Is the attached patch ok?

--- a/gcc/gimple-ssa-store-merging.c
+++ b/gcc/gimple-ssa-store-merging.c
@@ -851,12 +851,16 @@ find_bswap_or_nop_finalize (struct symbolic_number *n, 
uint64_t *cmpxchg,
 gimple *
 find_bswap_or_nop (gimple *stmt, struct symbolic_number *n, bool *bswap)
 {
+  tree type_size = TYPE_SIZE_UNIT (gimple_expr_type (stmt));
+  if (!tree_fits_uhwi_p  (type_size))
+return NULL;

Just one space before ( above.  Ok for trunk with that nit fixed.
Thanks.

Jakub

Re: [PATCH 2/2] combine: Don't turn (mult (extend x) 2^n) into extract

2020-10-26 Thread Segher Boessenkool

On Mon, Oct 26, 2020 at 11:06:22AM +, Alex Coplan wrote:
> Well, only the low 32 bits of the subreg are valid. But because those
> low 32 bits are shifted left 2 times, the low 34 bits of the ashift are
> valid: the bottom 2 bits of the ashift are zeros, and the 32 bits above
> those are from the inner SImode reg (with the upper 62 bits being
> undefined).

Ugh.  Yes, I think you are right.  One more reason why we should only
use *explicit* sign/zero extends, none of this confusing subreg
business :-(

> > > diff --git a/gcc/combine.c b/gcc/combine.c
> > > index c88382efbd3..fe8eff2b464 100644
> > > --- a/gcc/combine.c
> > > +++ b/gcc/combine.c
> > > @@ -7419,8 +7419,8 @@ expand_compound_operation (rtx x)
> > >  }
> > >  
> > >/* If we reach here, we want to return a pair of shifts.  The inner
> > > - shift is a left shift of BITSIZE - POS - LEN bits.  The outer
> > > - shift is a right shift of BITSIZE - LEN bits.  It is arithmetic or
> > > + shift is a left shift of MODEWIDTH - POS - LEN bits.  The outer
> > > + shift is a right shift of MODEWIDTH - LEN bits.  It is arithmetic or
> > >   logical depending on the value of UNSIGNEDP.
> > >  
> > >   If this was a ZERO_EXTEND or ZERO_EXTRACT, this pair of shifts will 
> > > be
> > 
> > MODEWIDTH isn't defined here yet, it is initialised just below to
> > MODE_PRECISION (mode).
> 
> Yes, but bitsize isn't defined at all in this function AFAICT. Are
> comments not permitted to refer to variables defined immediately beneath
> them?

Of course you can -- comments are free form text after all -- but as
written it suggest there already is an initialised variable "modewidth".

Just move the initialisation to above this comment?


Segher

Re: [RS6000] Tests that use int128_t and -m32

2020-10-26 Thread Alan Modra via Gcc-patches

On Sun, Oct 25, 2020 at 10:43:12AM -0400, David Edelsohn wrote:
> On Sun, Oct 25, 2020 at 7:20 AM Alan Modra  wrote:
> >
> > All these tests fail with -m32 due to lack of int128 support, in some
> > cases with what I thought was not the best error message.  For example
> > vsx_mask-move-runnable.c:34:3: error: unknown type name 'vector'
> > is misleading.  The problem isn't "vector" but "vector __uint128_t".
> >
> > * gcc.target/powerpc/vsx-load-element-extend-char.c: Require int128.
> > * gcc.target/powerpc/vsx-load-element-extend-int.c: Likewise.
> > * gcc.target/powerpc/vsx-load-element-extend-longlong.c: Likewise.
> > * gcc.target/powerpc/vsx-load-element-extend-short.c: Likewise.
> > * gcc.target/powerpc/vsx-store-element-truncate-char.c: Likewise.
> > * gcc.target/powerpc/vsx-store-element-truncate-int.c: Likewise.
> > * gcc.target/powerpc/vsx-store-element-truncate-longlong.c: 
> > Likewise.
> > * gcc.target/powerpc/vsx-store-element-truncate-short.c: Likewise.
> > * gcc.target/powerpc/vsx_mask-count-runnable.c: Likewise.
> > * gcc.target/powerpc/vsx_mask-expand-runnable.c: Likewise.
> > * gcc.target/powerpc/vsx_mask-extract-runnable.c: Likewise.
> > * gcc.target/powerpc/vsx_mask-move-runnable.c: Likewise.
> 
> Good catch.
> 
> Another problem with all of the vsx_mask test cases is that they use
> -mcpu=power10 instead of  -mdejagnu-cpu=power10.  Can you follow up
> with that fix or do you want me to?

Sure, I can do that if you're pre-approving the patch.
gcc.target/powerpc/pr93122.c too.

-- 
Alan Modra
Australia Development Lab, IBM

Re: [PATCH]AArch64 Fix overflow in memcopy expansion on aarch64.

2020-10-26 Thread Tamar Christina via Gcc-patches

Hi Richard,

The 10/26/2020 11:29, Richard Sandiford wrote:
> Tamar Christina  writes:
> >/* We can't do anything smart if the amount to copy is not constant.  */
> >if (!CONST_INT_P (operands[2]))
> >  return false;
> > 
> > -  n = INTVAL (operands[2]);
> > +  /* This may get truncated but that's fine as it would be above our 
> > maximum
> > + memset inline limit.  */
> > +  unsigned tmp = INTVAL (operands[2]);
> 
> That's not true for (1ULL << 32) + 1 for example, since the truncated
> value will come under the limit.  I think we should just do:
> 
>   unsigned HOST_WIDE_INT tmp = UINTVAL (operands[2]);
> 
> without a comment.
> 

Updated patch attached.

Ok for master and GCC 8, 9, 10?

Thanks,
Tamar

> Thanks,
> Richard

-- 
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index a8cc545c37044345c3f1d3bf09151c8a9578a032..35d6f2e2f017206eb73dc4091f1a15506d3563ab 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -21299,6 +21299,8 @@ aarch64_copy_one_block_and_progress_pointers (rtx *src, rtx *dst,
 bool
 aarch64_expand_cpymem (rtx *operands)
 {
+  /* These need to be signed as we need to perform arithmetic on n as
+ signed operations.  */
   int n, mode_bits;
   rtx dst = operands[0];
   rtx src = operands[1];
@@ -21309,21 +21311,24 @@ aarch64_expand_cpymem (rtx *operands)
   /* When optimizing for size, give a better estimate of the length of a
  memcpy call, but use the default otherwise.  Moves larger than 8 bytes
  will always require an even number of instructions to do now.  And each
- operation requires both a load+store, so devide the max number by 2.  */
-  int max_num_moves = (speed_p ? 16 : AARCH64_CALL_RATIO) / 2;
+ operation requires both a load+store, so divide the max number by 2.  */
+  unsigned int max_num_moves = (speed_p ? 16 : AARCH64_CALL_RATIO) / 2;
 
   /* We can't do anything smart if the amount to copy is not constant.  */
   if (!CONST_INT_P (operands[2]))
 return false;
 
-  n = INTVAL (operands[2]);
+  unsigned HOST_WIDE_INT tmp = INTVAL (operands[2]);
 
   /* Try to keep the number of instructions low.  For all cases we will do at
  most two moves for the residual amount, since we'll always overlap the
  remainder.  */
-  if (((n / 16) + (n % 16 ? 2 : 0)) > max_num_moves)
+  if (((tmp / 16) + (tmp % 16 ? 2 : 0)) > max_num_moves)
 return false;
 
+  /* At this point tmp is known to have to fit inside an int.  */
+  n = tmp;
+
   base = copy_to_mode_reg (Pmode, XEXP (dst, 0));
   dst = adjust_automodify_address (dst, VOIDmode, base, 0);
 
diff --git a/gcc/testsuite/gcc.target/aarch64/pr97535.c b/gcc/testsuite/gcc.target/aarch64/pr97535.c
new file mode 100644
index ..6f83b3f571413577180682c18400d913bb13124d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr97535.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+
+#include 
+
+#define SIZE 2181038080
+
+extern char raw_buffer[SIZE];
+
+void setRaw(const void *raw)
+{
+memcpy(raw_buffer, raw, SIZE);
+}
+
+/* At any optimization level this should be a function call
+   and not inlined.  */
+/* { dg-final { scan-assembler "bl\tmemcpy" } } */

Re: [PATCH 2/2] combine: Don't turn (mult (extend x) 2^n) into extract

2020-10-26 Thread Segher Boessenkool

Hi!

On Thu, Oct 15, 2020 at 09:59:24AM +0100, Alex Coplan wrote:
> @@ -7650,20 +7650,27 @@ make_extraction (machine_mode mode, rtx inner, 
> HOST_WIDE_INT pos,
>   is_mode = GET_MODE (SUBREG_REG (inner));
>inner = SUBREG_REG (inner);
>  }
> +  else if ((GET_CODE (inner) == ASHIFT || GET_CODE (inner) == MULT)
> +&& pos_rtx == 0 && pos == 0)
> +{
> +  const HOST_WIDE_INT ci = INTVAL (XEXP (inner, 1));
> +  const auto code = GET_CODE (inner);
> +  const HOST_WIDE_INT shift_amt = (code == MULT) ? exact_log2 (ci) : ci;

Can you instead replace the mult by a shift somewhere earlier in
make_extract?  That would make a lot more sense :-)


Segher

Re: [PATCH] g++, libstdc++: implement __is_nothrow_{constructible, assignable}

2020-10-26 Thread Jonathan Wakely via Gcc-patches


On 24/10/20 02:32 +0300, Ville Voutilainen via Libstdc++ wrote:

@@ -1118,15 +1080,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
};

  template
-struct __is_nt_assignable_impl
-: public integral_constant() = declval<_Up>())>
-{ };
-
-  template
-struct __is_nothrow_assignable_impl
-: public __and_<__bool_constant<__is_assignable(_Tp, _Up)>,
-   __is_nt_assignable_impl<_Tp, _Up>>
-{ };
+using __is_nothrow_assignable_impl
+= __bool_constant<__is_nothrow_assignable(_Tp, _Up)>;


Please indent the "= __bool_constant<...>;" line two more spaces,
rather than lining it up with the "using".

The library changes are OK with that tweak. Thanks!

Re: [RS6000] Tests that use int128_t and -m32

2020-10-26 Thread Segher Boessenkool

Hi Alan,

On Sun, Oct 25, 2020 at 09:50:01PM +1030, Alan Modra wrote:
> All these tests fail with -m32 due to lack of int128 support,

Is there any good reason __int128 is not enabled for rs6000 -m32, btw?

> in some
> cases with what I thought was not the best error message.  For example
> vsx_mask-move-runnable.c:34:3: error: unknown type name 'vector'
> is misleading.  The problem isn't "vector" but "vector __uint128_t".

Ouch, yes.  Do you see a simple way to fix that?

> diff --git a/gcc/testsuite/gcc.target/powerpc/vsx-load-element-extend-char.c 
> b/gcc/testsuite/gcc.target/powerpc/vsx-load-element-extend-char.c
> index 0b8cfd610f8..7a7cb77c3a0 100644
> --- a/gcc/testsuite/gcc.target/powerpc/vsx-load-element-extend-char.c
> +++ b/gcc/testsuite/gcc.target/powerpc/vsx-load-element-extend-char.c
> @@ -4,6 +4,7 @@
>  
>  /* { dg-do compile {target power10_ok} } */
>  /* { dg-do run {target power10_hw} } */
> +/* { dg-require-effective-target { int128 } } */
>  /* { dg-options "-mdejagnu-cpu=power10 -O3" } */

You might want to write this as {int128}, to keep the same style as the
other statements.  Or leave off the braces completely, they aren't
necessary here, int128 is a single word :-)

> diff --git a/gcc/testsuite/gcc.target/powerpc/vsx_mask-count-runnable.c 
> b/gcc/testsuite/gcc.target/powerpc/vsx_mask-count-runnable.c
> index 5862517eae9..6ac4ed2173f 100644
> --- a/gcc/testsuite/gcc.target/powerpc/vsx_mask-count-runnable.c
> +++ b/gcc/testsuite/gcc.target/powerpc/vsx_mask-count-runnable.c
> @@ -1,7 +1,7 @@
>  /* { dg-do run { target { power10_hw } } } */
>  /* { dg-do link { target { ! power10_hw } } } */
>  /* { dg-options "-mcpu=power10 -O2" } */
> -/* { dg-require-effective-target power10_ok } */
> +/* { dg-require-effective-target { int128 && power10_ok } } */

Or write it as two require statements, as we do most of the time?

Okay for trunk (with those tweaks if you want).  Thanks!


Segher

Re: [RS6000] Tests that use int128_t and -m32

2020-10-26 Thread Segher Boessenkool

On Mon, Oct 26, 2020 at 10:34:20PM +1030, Alan Modra wrote:
> On Sun, Oct 25, 2020 at 10:43:12AM -0400, David Edelsohn wrote:
> > Another problem with all of the vsx_mask test cases is that they use
> > -mcpu=power10 instead of  -mdejagnu-cpu=power10.  Can you follow up
> > with that fix or do you want me to?
> 
> Sure, I can do that if you're pre-approving the patch.
> gcc.target/powerpc/pr93122.c too.

This is obvious and trivial, doesn't need approval (just send a mail
what you did).  Thanks :-)


Segher

[committed] libstdc++: Fix declarations of memalign etc. for freestanding [PR 97570]

2020-10-26 Thread Jonathan Wakely via Gcc-patches

libstdc++-v3/ChangeLog:

PR libstdc++/97570
* libsupc++/new_opa.cc: Declare size_t in global namespace.
Remove unused header.

Tested x86_64-linux. Successfully built for avr cross (with avr-libc
2.0).

Committed to trunk.

commit 93e9a7bcd5434a24c945de33cd7fa01a25f68418
Author: Jonathan Wakely 
Date:   Mon Oct 26 12:02:50 2020

libstdc++: Fix declarations of memalign etc. for freestanding [PR 97570]

libstdc++-v3/ChangeLog:

PR libstdc++/97570
* libsupc++/new_opa.cc: Declare size_t in global namespace.
Remove unused header.

diff --git a/libstdc++-v3/libsupc++/new_opa.cc 
b/libstdc++-v3/libsupc++/new_opa.cc
index b935936e19a..732fe827cda 100644
--- a/libstdc++-v3/libsupc++/new_opa.cc
+++ b/libstdc++-v3/libsupc++/new_opa.cc
@@ -26,7 +26,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include "new"
 
@@ -44,6 +43,7 @@ using std::new_handler;
 using std::bad_alloc;
 
 #if ! _GLIBCXX_HOSTED
+using std::size_t;
 extern "C"
 {
 # if _GLIBCXX_HAVE_ALIGNED_ALLOC

Re: [PATCH]AArch64 Fix overflow in memcopy expansion on aarch64.

2020-10-26 Thread Richard Sandiford via Gcc-patches

Tamar Christina  writes:
> Hi Richard,
>
> The 10/26/2020 11:29, Richard Sandiford wrote:
>> Tamar Christina  writes:
>> >/* We can't do anything smart if the amount to copy is not constant.  */
>> >if (!CONST_INT_P (operands[2]))
>> >  return false;
>> > 
>> > -  n = INTVAL (operands[2]);
>> > +  /* This may get truncated but that's fine as it would be above our 
>> > maximum
>> > + memset inline limit.  */
>> > +  unsigned tmp = INTVAL (operands[2]);
>> 
>> That's not true for (1ULL << 32) + 1 for example, since the truncated
>> value will come under the limit.  I think we should just do:
>> 
>>   unsigned HOST_WIDE_INT tmp = UINTVAL (operands[2]);
>> 
>> without a comment.
>> 
>
> Updated patch attached.
>
> Ok for master and GCC 8, 9, 10?

OK, thanks.

Richard

Re: [PATCH PR94442] [AArch64] Redundant ldp/stp instructions emitted at -O3

2020-10-26 Thread Richard Sandiford via Gcc-patches

xiezhiheng  writes:
>> -Original Message-
>> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
>> Sent: Wednesday, October 21, 2020 12:54 AM
>> To: xiezhiheng 
>> Cc: Richard Biener ; gcc-patches@gcc.gnu.org
>> Subject: Re: [PATCH PR94442] [AArch64] Redundant ldp/stp instructions
>> emitted at -O3
>> 
>> xiezhiheng  writes:
>> > I made two separate patches for these two groups, get/set register
>> intrinsics and store intrinsics.
>> >
>> > Note: It does not matter which patch is applied first.
>> >
>> > Bootstrapped and tested on aarch64 Linux platform.
>> 
>> Thanks.  I pushed the get/set patch.  For the store patch, I think
>> we should have:
>> 
>> const unsigned int FLAG_STORE = FLAG_WRITE_MEMORY | FLAG_AUTO_FP;
>> 
>> since the FP forms don't (for example) read the FPCR.
>> 
>
> That's true.  I added FLAG_STORE for the store intrinsics and made the patch 
> for them.
>
> Bootstrapped and tested on aarch64 Linux platform.

Thanks, pushed to trunk.

Sorry for the delayed response.

Richard

>
> Thanks,
> Xie Zhiheng
>
>
> diff --git a/gcc/ChangeLog b/gcc/ChangeLog
> index 59fa1ad4d5d..26edaa309c8 100644
> --- a/gcc/ChangeLog
> +++ b/gcc/ChangeLog
> @@ -1,3 +1,10 @@
> +2020-10-22  Zhiheng Xie  
> + Nannan Zheng  
> +
> + * config/aarch64/aarch64-builtins.c: Add FLAG STORE.
> + * config/aarch64/aarch64-simd-builtins.def: Add proper FLAG
> + for store intrinsics.
> +

Fix simdclones pass

2020-10-26 Thread Jan Hubicka

Hi,
this patch makes cleaning of stmt pointers in references more robust so
late IPA passes do not break.

Bootstrapped/regtested x86_64-linux, comitted.

Honza

gcc/ChangeLog:

2020-10-26  Jan Hubicka  

PR ipa/97576
* cgraphclones.c (cgraph_node::materialize_clone): Clear stmt
references.
* cgraphunit.c (mark_functions_to_output): Do not clear them here.
* ipa-inline-transform.c (inline_transform): Clear stmt references.
* symtab.c (symtab_node::clear_stmts_in_references): Make recursive
for clones.
* tree-ssa-structalias.c (ipa_pta_execute): Do not clear references.

gcc/testsuite/ChangeLog:

2020-10-26  Jan Hubicka  

PR ipa/97576
* gcc.c-torture/compile/pr97576.c: New test.

diff --git a/gcc/cgraphclones.c b/gcc/cgraphclones.c
index 41c6efb10ac..0ed63078c91 100644
--- a/gcc/cgraphclones.c
+++ b/gcc/cgraphclones.c
@@ -1115,6 +1115,7 @@ cgraph_node::materialize_clone ()
   if (clone.param_adjustments)
clone.param_adjustments->dump (symtab->dump_file);
 }
+  clear_stmts_in_references ();
   /* Copy the OLD_VERSION_NODE function tree to the new version.  */
   tree_function_versioning (clone_of->decl, decl,
clone.tree_map, clone.param_adjustments,
diff --git a/gcc/cgraphunit.c b/gcc/cgraphunit.c
index d2d98c8dc8a..08b93cb00ee 100644
--- a/gcc/cgraphunit.c
+++ b/gcc/cgraphunit.c
@@ -1600,7 +1600,6 @@ mark_functions_to_output (void)
   FOR_EACH_FUNCTION (node)
 {
   tree decl = node->decl;
-  node->clear_stmts_in_references ();
 
   gcc_assert (!node->process || node->same_comdat_group);
   if (node->process)
diff --git a/gcc/ipa-inline-transform.c b/gcc/ipa-inline-transform.c
index 279ba2f7cb0..4df1b7fb9ee 100644
--- a/gcc/ipa-inline-transform.c
+++ b/gcc/ipa-inline-transform.c
@@ -716,6 +716,7 @@ inline_transform (struct cgraph_node *node)
   if (n->decl != node->decl)
n->materialize_clone ();
 }
+  node->clear_stmts_in_references ();
 
   /* We might need the body of this function so that we can expand
  it inline somewhere else.  */
diff --git a/gcc/symtab.c b/gcc/symtab.c
index bc2865f4121..067ae2e28a0 100644
--- a/gcc/symtab.c
+++ b/gcc/symtab.c
@@ -752,7 +752,8 @@ symtab_node::remove_stmt_references (gimple *stmt)
   i++;
 }
 
-/* Remove all stmt references in non-speculative references.
+/* Remove all stmt references in non-speculative references in THIS
+   and all clones.
Those are not maintained during inlining & cloning.
The exception are speculative references that are updated along
with callgraph edges associated with them.  */
@@ -770,6 +771,13 @@ symtab_node::clear_stmts_in_references (void)
r->lto_stmt_uid = 0;
r->speculative_id = 0;
   }
+  cgraph_node *cnode = dyn_cast  (this);
+  if (cnode)
+{
+  if (cnode->clones)
+   for (cnode = cnode->clones; cnode; cnode = cnode->next_sibling_clone)
+ cnode->clear_stmts_in_references ();
+}
 }
 
 /* Remove all references in ref list.  */
diff --git a/gcc/testsuite/gcc.c-torture/compile/pr97576.c 
b/gcc/testsuite/gcc.c-torture/compile/pr97576.c
new file mode 100644
index 000..8d6a6c6d634
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/compile/pr97576.c
@@ -0,0 +1,18 @@
+void
+pc (void);
+
+void __attribute__ ((simd))
+ty (void);
+
+void __attribute__ ((simd))
+gf ()
+{
+  ty ();
+}
+
+void __attribute__ ((simd))
+ty (void)
+{
+  gf (pc);
+  gf (gf);
+}
diff --git a/gcc/tree-ssa-structalias.c b/gcc/tree-ssa-structalias.c
index 9bac06f97af..a4832b75436 100644
--- a/gcc/tree-ssa-structalias.c
+++ b/gcc/tree-ssa-structalias.c
@@ -8138,10 +8138,6 @@ ipa_pta_execute (void)
   from = constraints.length ();
 }
 
-  /* FIXME: Clone materialization is not preserving stmt references.  */
-  FOR_EACH_DEFINED_FUNCTION (node)
-node->clear_stmts_in_references ();
-
   /* Build the constraints.  */
   FOR_EACH_DEFINED_FUNCTION (node)
 {

Re: [PATCH 2/2] combine: Don't turn (mult (extend x) 2^n) into extract

2020-10-26 Thread Alex Coplan via Gcc-patches

On 26/10/2020 06:51, Segher Boessenkool wrote:
> On Mon, Oct 26, 2020 at 11:06:22AM +, Alex Coplan wrote:
> > Well, only the low 32 bits of the subreg are valid. But because those
> > low 32 bits are shifted left 2 times, the low 34 bits of the ashift are
> > valid: the bottom 2 bits of the ashift are zeros, and the 32 bits above
> > those are from the inner SImode reg (with the upper 62 bits being
> > undefined).
> 
> Ugh.  Yes, I think you are right.  One more reason why we should only
> use *explicit* sign/zero extends, none of this confusing subreg
> business :-(

Yeah. IIRC expand_compound_operation() introduces the subreg because it
explicitly wants to rewrite the sign_extend using a pair of shifts (without
using an extend rtx). Something like:

(ashiftrt:DI
  (ashift:DI
(subreg:DI (reg:SI r) 0)
(const_int 32))
  (const_int 32))

> 
> > > > diff --git a/gcc/combine.c b/gcc/combine.c
> > > > index c88382efbd3..fe8eff2b464 100644
> > > > --- a/gcc/combine.c
> > > > +++ b/gcc/combine.c
> > > > @@ -7419,8 +7419,8 @@ expand_compound_operation (rtx x)
> > > >  }
> > > >  
> > > >/* If we reach here, we want to return a pair of shifts.  The inner
> > > > - shift is a left shift of BITSIZE - POS - LEN bits.  The outer
> > > > - shift is a right shift of BITSIZE - LEN bits.  It is arithmetic or
> > > > + shift is a left shift of MODEWIDTH - POS - LEN bits.  The outer
> > > > + shift is a right shift of MODEWIDTH - LEN bits.  It is arithmetic 
> > > > or
> > > >   logical depending on the value of UNSIGNEDP.
> > > >  
> > > >   If this was a ZERO_EXTEND or ZERO_EXTRACT, this pair of shifts 
> > > > will be
> > > 
> > > MODEWIDTH isn't defined here yet, it is initialised just below to
> > > MODE_PRECISION (mode).
> > 
> > Yes, but bitsize isn't defined at all in this function AFAICT. Are
> > comments not permitted to refer to variables defined immediately beneath
> > them?
> 
> Of course you can -- comments are free form text after all -- but as
> written it suggest there already is an initialised variable "modewidth".
> 
> Just move the initialisation to above this comment?

Sure, see the revised patch attached.

Thanks,
Alex
diff --git a/gcc/combine.c b/gcc/combine.c
index 4782e1d9dcc..d4793c1c575 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -7418,9 +7418,11 @@ expand_compound_operation (rtx x)
 
 }
 
+  modewidth = GET_MODE_PRECISION (mode);
+
   /* If we reach here, we want to return a pair of shifts.  The inner
- shift is a left shift of BITSIZE - POS - LEN bits.  The outer
- shift is a right shift of BITSIZE - LEN bits.  It is arithmetic or
+ shift is a left shift of MODEWIDTH - POS - LEN bits.  The outer
+ shift is a right shift of MODEWIDTH - LEN bits.  It is arithmetic or
  logical depending on the value of UNSIGNEDP.
 
  If this was a ZERO_EXTEND or ZERO_EXTRACT, this pair of shifts will be
@@ -7433,7 +7435,6 @@ expand_compound_operation (rtx x)
  extraction.  Then the constant of 31 would be substituted in
  to produce such a position.  */
 
-  modewidth = GET_MODE_PRECISION (mode);
   if (modewidth >= pos + len)
 {
   tem = gen_lowpart (mode, XEXP (x, 0));
@@ -7650,20 +7651,27 @@ make_extraction (machine_mode mode, rtx inner, 
HOST_WIDE_INT pos,
is_mode = GET_MODE (SUBREG_REG (inner));
   inner = SUBREG_REG (inner);
 }
-  else if (GET_CODE (inner) == ASHIFT
+  else if ((GET_CODE (inner) == ASHIFT || GET_CODE (inner) == MULT)
   && CONST_INT_P (XEXP (inner, 1))
-  && pos_rtx == 0 && pos == 0
-  && len > UINTVAL (XEXP (inner, 1)))
-{
-  /* We're extracting the least significant bits of an rtx
-(ashift X (const_int C)), where LEN > C.  Extract the
-least significant (LEN - C) bits of X, giving an rtx
-whose mode is MODE, then shift it left C times.  */
-  new_rtx = make_extraction (mode, XEXP (inner, 0),
-0, 0, len - INTVAL (XEXP (inner, 1)),
-unsignedp, in_dest, in_compare);
-  if (new_rtx != 0)
-   return gen_rtx_ASHIFT (mode, new_rtx, XEXP (inner, 1));
+  && pos_rtx == 0 && pos == 0)
+{
+  const HOST_WIDE_INT ci = INTVAL (XEXP (inner, 1));
+  const auto code = GET_CODE (inner);
+  const HOST_WIDE_INT shift_amt = (code == MULT) ? exact_log2 (ci) : ci;
+
+  if (shift_amt > 0 && len > (unsigned HOST_WIDE_INT)shift_amt)
+   {
+ /* We're extracting the least significant bits of an rtx
+(ashift X (const_int C)) or (mult X (const_int (2^C))),
+where LEN > C.  Extract the least significant (LEN - C) bits
+of X, giving an rtx whose mode is MODE, then shift it left
+C times.  */
+ new_rtx = make_extraction (mode, XEXP (inner, 0),
+0, 0, len - shift_amt,
+unsignedp, in_dest, in_compare);
+ if (new_rtx)
+   re

Re: [PATCH 2/2] combine: Don't turn (mult (extend x) 2^n) into extract

2020-10-26 Thread Alex Coplan via Gcc-patches

On 26/10/2020 07:12, Segher Boessenkool wrote:
> Hi!
> 
> On Thu, Oct 15, 2020 at 09:59:24AM +0100, Alex Coplan wrote:
> > @@ -7650,20 +7650,27 @@ make_extraction (machine_mode mode, rtx inner, 
> > HOST_WIDE_INT pos,
> > is_mode = GET_MODE (SUBREG_REG (inner));
> >inner = SUBREG_REG (inner);
> >  }
> > +  else if ((GET_CODE (inner) == ASHIFT || GET_CODE (inner) == MULT)
> > +  && pos_rtx == 0 && pos == 0)
> > +{
> > +  const HOST_WIDE_INT ci = INTVAL (XEXP (inner, 1));
> > +  const auto code = GET_CODE (inner);
> > +  const HOST_WIDE_INT shift_amt = (code == MULT) ? exact_log2 (ci) : 
> > ci;
> 
> Can you instead replace the mult by a shift somewhere earlier in
> make_extract?  That would make a lot more sense :-)

I guess we could do this, the only complication being that we can't
unconditionally rewrite the expression using a shift, since mult is canonical
inside a mem (which is why we see it in the testcase in the PR).

So if we did this, we'd have to remember that we did it earlier on, and rewrite
it back to a mult accordingly.

Would you still like to see a version of the patch that does that, or is this
version OK: https://gcc.gnu.org/pipermail/gcc-patches/2020-October/557050.html ?

Thanks,
Alex

[PATCH] cp/decl.c: Set DECL_INITIAL before attribute processing

2020-10-26 Thread Jozef Lawrynowicz

Attribute handlers may want to examine DECL_INITIAL for a decl, to
validate the attribute being applied. For C++, DECL_INITIAL is currently
not set until cp_finish_decl, by which time attribute validation has
already been performed.

For msp430-elf this causes the "persistent" attribute to always be
rejected for C++, since DECL_INITIAL must be non-null for the
attribute to be applied to a decl.

This patch ensures DECL_INITIAL is set for initialized decls early in
start_decl, before attribute handlers run. This allows the
initialization status of the decl to be examined by the handlers.
DECL_INITIAL must be restored to it's initial value after attribute
validation is performed, so as to not interfere with later decl
processing.

Successfully bootstrapped and regtested for x86_64-pc-linux-gnu, and
regtested for arm-eabi and msp430-elf.

Ok for trunk?
>From 6fbb18ab081069fb2730360f9e09425b9b1f6a7d Mon Sep 17 00:00:00 2001
From: Jozef Lawrynowicz 
Date: Tue, 20 Oct 2020 14:03:42 +0100
Subject: [PATCH] cp/decl.c: Set DECL_INITIAL before attribute processing

Attribute handlers may want to examine DECL_INITIAL for a decl, to
validate the attribute being applied. For C++, DECL_INITIAL is currently
not set until cp_finish_decl, by which time attribute validation has
already been performed.

For msp430-elf this causes the "persistent" attribute to always be
rejected for C++, since DECL_INITIAL must be non-null for the
attribute to be applied to a decl.

This patch ensures DECL_INITIAL is set for initialized decls early in
start_decl, before attribute handlers run. This allows the
initialization status of the decl to be examined by the handlers.
DECL_INITIAL must be restored to it's initial value after attribute
validation is performed, so as to not interfere with later decl
processing.

gcc/cp/ChangeLog:

* decl.c (start_decl): Set DECL_INITIAL for initialized decls
before attribute processing.

gcc/testsuite/ChangeLog:

* gcc.target/msp430/data-attributes-2.c: Adjust test.
* g++.target/msp430/data-attributes.C: New test.
* g++.target/msp430/msp430.exp: New test.
---
 gcc/cp/decl.c | 13 +
 .../g++.target/msp430/data-attributes.C   | 52 +++
 gcc/testsuite/g++.target/msp430/msp430.exp| 44 
 .../gcc.target/msp430/data-attributes-2.c |  1 +
 4 files changed, 110 insertions(+)
 create mode 100644 gcc/testsuite/g++.target/msp430/data-attributes.C
 create mode 100644 gcc/testsuite/g++.target/msp430/msp430.exp

diff --git a/gcc/cp/decl.c b/gcc/cp/decl.c
index 5f370e60b4e..0f32dd88bad 100644
--- a/gcc/cp/decl.c
+++ b/gcc/cp/decl.c
@@ -5210,6 +5210,7 @@ start_decl (const cp_declarator *declarator,
   bool was_public;
   int flags;
   bool alias;
+  tree initial;
 
   *pushed_scope_p = NULL_TREE;
 
@@ -5234,6 +5235,10 @@ start_decl (const cp_declarator *declarator,
   return error_mark_node;
 }
 
+  /* Save the DECL_INITIAL value since we clobber it with error_mark_node if
+ INITIALIZED is true.  */
+  initial = DECL_INITIAL (decl);
+
   if (initialized)
 {
   if (! toplevel_bindings_p ()
@@ -5243,6 +5248,10 @@ start_decl (const cp_declarator *declarator,
   DECL_EXTERNAL (decl) = 0;
   if (toplevel_bindings_p ())
TREE_STATIC (decl) = 1;
+  /* Tell 'cplus_decl_attributes' this is an initialized decl,
+even though we might not yet have the initializer expression.  */
+  if (!DECL_INITIAL (decl))
+   DECL_INITIAL (decl) = error_mark_node;
 }
   alias = lookup_attribute ("alias", DECL_ATTRIBUTES (decl)) != 0;
   
@@ -5261,6 +5270,10 @@ start_decl (const cp_declarator *declarator,
   /* Set attributes here so if duplicate decl, will have proper attributes.  */
   cplus_decl_attributes (&decl, attributes, flags);
 
+  /* Restore the original DECL_INITIAL that we may have clobbered earlier to
+ assist with attribute validation.  */
+  DECL_INITIAL (decl) = initial;
+
   /* Dllimported symbols cannot be defined.  Static data members (which
  can be initialized in-class and dllimported) go through grokfield,
  not here, so we don't need to exclude those decls when checking for
diff --git a/gcc/testsuite/g++.target/msp430/data-attributes.C 
b/gcc/testsuite/g++.target/msp430/data-attributes.C
new file mode 100644
index 000..4e2139e93f7
--- /dev/null
+++ b/gcc/testsuite/g++.target/msp430/data-attributes.C
@@ -0,0 +1,52 @@
+/* { dg-do compile } */
+/* { dg-skip-if "" { *-*-* } { "-mcpu=msp430" } } */
+/* { dg-options "-mlarge" } */
+
+/* The msp430-specific variable attributes "lower", "upper", either", "noinit"
+   and "persistent", all conflict with one another.
+   These attributes also conflict with the "section" attribute, since they
+   specify sections to put the variables into.  */
+int __attribute__((persistent)) p = 10;
+int __attribute__((persistent,lower)) pl = 20; /* { dg-warning "ignoring 
attribute 'lower' because it conflict

Re: [PATCH] cp/decl.c: Set DECL_INITIAL before attribute processing

2020-10-26 Thread Jozef Lawrynowicz

On Mon, Oct 26, 2020 at 01:30:29PM +, Jozef Lawrynowicz wrote:
> Attribute handlers may want to examine DECL_INITIAL for a decl, to
> validate the attribute being applied. For C++, DECL_INITIAL is currently
> not set until cp_finish_decl, by which time attribute validation has
> already been performed.
> 
> For msp430-elf this causes the "persistent" attribute to always be
> rejected for C++, since DECL_INITIAL must be non-null for the
> attribute to be applied to a decl.
> 
> This patch ensures DECL_INITIAL is set for initialized decls early in
> start_decl, before attribute handlers run. This allows the
> initialization status of the decl to be examined by the handlers.
> DECL_INITIAL must be restored to it's initial value after attribute
> validation is performed, so as to not interfere with later decl
> processing.
> 
> Successfully bootstrapped and regtested for x86_64-pc-linux-gnu, and
> regtested for arm-eabi and msp430-elf.
> 
> Ok for trunk?

Amended slightly misleading comment, shown below, in the attached patch.

diff --git a/gcc/cp/decl.c b/gcc/cp/decl.c
index 0f32dd88bad..4c959377077 100644
--- a/gcc/cp/decl.c
+++ b/gcc/cp/decl.c
@@ -5235,8 +5235,8 @@ start_decl (const cp_declarator *declarator,
   return error_mark_node;
 }

-  /* Save the DECL_INITIAL value since we clobber it with error_mark_node if
- INITIALIZED is true.  */
+  /* Save the DECL_INITIAL value in case it gets clobbered to assist
+ with attribute validation.  */
   initial = DECL_INITIAL (decl);

   if (initialized)

>From fad6cc4df13e00c55d381e82772438161282a008 Mon Sep 17 00:00:00 2001
From: Jozef Lawrynowicz 
Date: Tue, 20 Oct 2020 14:03:42 +0100
Subject: [PATCH] cp/decl.c: Set DECL_INITIAL before attribute processing

Attribute handlers may want to examine DECL_INITIAL for a decl, to
validate the attribute being applied. For C++, DECL_INITIAL is currently
not set until cp_finish_decl, by which time attribute validation has
already been performed.

For msp430-elf this causes the "persistent" attribute to always be
rejected for C++, since DECL_INITIAL must be non-null for the
attribute to be applied to a decl.

This patch ensures DECL_INITIAL is set for initialized decls early in
start_decl, before attribute handlers run. This allows the
initialization status of the decl to be examined by the handlers.
DECL_INITIAL must be restored to it's initial value after attribute
validation is performed, so as to not interfere with later decl
processing.

gcc/cp/ChangeLog:

* decl.c (start_decl): Set DECL_INITIAL for initialized decls
before attribute processing.

gcc/testsuite/ChangeLog:

* gcc.target/msp430/data-attributes-2.c: Adjust test.
* g++.target/msp430/data-attributes.C: New test.
* g++.target/msp430/msp430.exp: New test.
---
 gcc/cp/decl.c | 13 +
 .../g++.target/msp430/data-attributes.C   | 52 +++
 gcc/testsuite/g++.target/msp430/msp430.exp| 44 
 .../gcc.target/msp430/data-attributes-2.c |  1 +
 4 files changed, 110 insertions(+)
 create mode 100644 gcc/testsuite/g++.target/msp430/data-attributes.C
 create mode 100644 gcc/testsuite/g++.target/msp430/msp430.exp

diff --git a/gcc/cp/decl.c b/gcc/cp/decl.c
index 5f370e60b4e..4c959377077 100644
--- a/gcc/cp/decl.c
+++ b/gcc/cp/decl.c
@@ -5210,6 +5210,7 @@ start_decl (const cp_declarator *declarator,
   bool was_public;
   int flags;
   bool alias;
+  tree initial;
 
   *pushed_scope_p = NULL_TREE;
 
@@ -5234,6 +5235,10 @@ start_decl (const cp_declarator *declarator,
   return error_mark_node;
 }
 
+  /* Save the DECL_INITIAL value in case it gets clobbered to assist
+ with attribute validation.  */
+  initial = DECL_INITIAL (decl);
+
   if (initialized)
 {
   if (! toplevel_bindings_p ()
@@ -5243,6 +5248,10 @@ start_decl (const cp_declarator *declarator,
   DECL_EXTERNAL (decl) = 0;
   if (toplevel_bindings_p ())
TREE_STATIC (decl) = 1;
+  /* Tell 'cplus_decl_attributes' this is an initialized decl,
+even though we might not yet have the initializer expression.  */
+  if (!DECL_INITIAL (decl))
+   DECL_INITIAL (decl) = error_mark_node;
 }
   alias = lookup_attribute ("alias", DECL_ATTRIBUTES (decl)) != 0;
   
@@ -5261,6 +5270,10 @@ start_decl (const cp_declarator *declarator,
   /* Set attributes here so if duplicate decl, will have proper attributes.  */
   cplus_decl_attributes (&decl, attributes, flags);
 
+  /* Restore the original DECL_INITIAL that we may have clobbered earlier to
+ assist with attribute validation.  */
+  DECL_INITIAL (decl) = initial;
+
   /* Dllimported symbols cannot be defined.  Static data members (which
  can be initialized in-class and dllimported) go through grokfield,
  not here, so we don't need to exclude those decls when checking for
diff --git a/gcc/testsuite/g++.target/msp430/data-attributes.C 
b/gcc/testsuite/g++.target/msp430/data-

[PATCH] nvptx: Cache stacks block for OpenMP kernel launch

2020-10-26 Thread Julian Brown

Hi,

This patch adds caching for the stack block allocated for offloaded
OpenMP kernel launches on NVPTX. This is a performance optimisation --
we observed an average 11% or so performance improvement with this patch
across a set of accelerated GPU benchmarks on one machine (results vary
according to individual benchmark and with hardware used).

A given kernel launch will reuse the stack block from the previous launch
if it is large enough, else it is freed and reallocated. A slight caveat
is that memory will not be freed until the device is closed, so e.g. if
code is using highly variable launch geometries and large amounts of
GPU RAM, you might run out of resources slightly quicker with this patch.

Another way this patch gains performance is by omitting the
synchronisation at the end of an OpenMP offload kernel launch -- it's
safe for the GPU and CPU to continue executing in parallel at that point,
because e.g. copies-back from the device will be synchronised properly
with kernel completion anyway.

In turn, the last part necessitates a change to the way "(perhaps abort
was called)" errors are detected and reported.

Tested with offloading to NVPTX. OK for mainline?

Thanks,

Julian

2020-10-26  Julian Brown  

libgomp/
* plugin/plugin-nvptx.c (maybe_abort_message): Add function.
(CUDA_CALL_ERET, CUDA_CALL_ASSERT): Use above function.
(struct ptx_device): Add omp_stacks struct.
(nvptx_open_device): Initialise cached-stacks housekeeping info.
(nvptx_close_device): Free cached stacks block and mutex.
(nvptx_stacks_alloc): Rename to...
(nvptx_stacks_acquire): This.  Cache stacks block between runs if same
size or smaller is required.
(nvptx_stacks_free): Rename to...
(nvptx_stacks_release): This.  Do not free stacks block, but release
mutex.
(GOMP_OFFLOAD_run): Adjust for changes to above functions, and remove
special-case "abort" error handling and synchronisation after kernel
launch.
---
 libgomp/plugin/plugin-nvptx.c | 91 ++-
 1 file changed, 68 insertions(+), 23 deletions(-)

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 11d4ceeae62e..e7ff5d5213e0 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -137,6 +137,15 @@ init_cuda_lib (void)
 #define MIN(X,Y) ((X) < (Y) ? (X) : (Y))
 #define MAX(X,Y) ((X) > (Y) ? (X) : (Y))
 
+static const char *
+maybe_abort_message (unsigned errmsg)
+{
+  if (errmsg == CUDA_ERROR_LAUNCH_FAILED)
+return " (perhaps abort was called)";
+  else
+return "";
+}
+
 /* Convenience macros for the frequently used CUDA library call and
error handling sequence as well as CUDA library calls that
do the error checking themselves or don't do it at all.  */
@@ -147,8 +156,9 @@ init_cuda_lib (void)
   = CUDA_CALL_PREFIX FN (__VA_ARGS__); \
 if (__r != CUDA_SUCCESS)   \
   {\
-   GOMP_PLUGIN_error (#FN " error: %s",\
-  cuda_error (__r));   \
+   GOMP_PLUGIN_error (#FN " error: %s%s",  \
+  cuda_error (__r),\
+  maybe_abort_message (__r));  \
return ERET;\
   }\
   } while (0)
@@ -162,8 +172,9 @@ init_cuda_lib (void)
   = CUDA_CALL_PREFIX FN (__VA_ARGS__); \
 if (__r != CUDA_SUCCESS)   \
   {\
-   GOMP_PLUGIN_fatal (#FN " error: %s",\
-  cuda_error (__r));   \
+   GOMP_PLUGIN_fatal (#FN " error: %s%s",  \
+  cuda_error (__r),\
+  maybe_abort_message (__r));  \
   }\
   } while (0)
 
@@ -307,6 +318,14 @@ struct ptx_device
   struct ptx_free_block *free_blocks;
   pthread_mutex_t free_blocks_lock;
 
+  /* OpenMP stacks, cached between kernel invocations.  */
+  struct
+{
+  CUdeviceptr ptr;
+  size_t size;
+  pthread_mutex_t lock;
+} omp_stacks;
+
   struct ptx_device *next;
 };
 
@@ -514,6 +533,10 @@ nvptx_open_device (int n)
   ptx_dev->free_blocks = NULL;
   pthread_mutex_init (&ptx_dev->free_blocks_lock, NULL);
 
+  ptx_dev->omp_stacks.ptr = 0;
+  ptx_dev->omp_stacks.size = 0;
+  pthread_mutex_init (&ptx_dev->omp_stacks.lock, NULL);
+
   return ptx_dev;
 }
 
@@ -534,6 +557,11 @@ nvptx_close_device (struct ptx_device *ptx_dev)
   pthread_mutex_destroy (&ptx_dev->free_blocks_lock);
   pthread_mutex_destroy (&ptx_dev->image_lock);
 
+  pthread_mutex_destroy (&ptx_dev->omp_stacks.lock);
+
+  if (ptx_dev->omp_stacks.ptr)
+CUDA_CALL (cuMemFree, ptx_dev->omp_stacks.ptr);
+
   if (!ptx_dev->ctx_shared)
 CUDA_CALL (cuCtxDestroy, ptx_dev->ctx);
 
@@ -1866,26 +1894,49 @@ nvptx_stack

Implement three-level optimize_for_size predicates

2020-10-26 Thread Jan Hubicka

Hi,
this patch implements thre two-state optimize_for_size predicates, so with -Os
and with profile feedback for never executed code it returns OPTIMIZE_SIZE_MAX
while in cases we decide to optimize for size based on branch prediction logic
it return OPTIMIZE_SIZE_BALLANCED.

The idea is that for places where we guess that code is unlikely we do not
want to do extreme optimizations for size that leads to many fold slowdowns
(using idiv rather than few shigts or using rep based inlined stringops).

I will update RTL handling code to also support this with BB granuality (which
we don't currently).   We also should make attribute cold to lead to
OPTIMIZE_SIZE_MAX I would say.

LLVM has -Os and -Oz levels where -Oz is our -Os and LLVM's -Os would
ocrrespond to OPTIMIZE_SIZE_BALLANCED.  I wonder if we want to export
this to command line somehow?  For me it would be definitly useful to
test things, I am not sure how "weaker" -Os is desired in practice.

Bootstrapped/regtested x86_64-linux, I will commit it later today if
there are no comments.

H.J., can you plase update your patch on stringopts?

Honza

gcc/ChangeLog:

2020-10-26  Jan Hubicka  

* cgraph.h (cgraph_node::optimize_for_size_p): Return
optimize_size_level.
(cgraph_node::optimize_for_size_p): Update.
* coretypes.h (enum optimize_size_level): New enum.
* predict.c (unlikely_executed_edge_p): Microoptimize.
(optimize_function_for_size_p): Return optimize_size_level.
(optimize_bb_for_size_p): Likewise.
(optimize_edge_for_size_p): Likewise.
(optimize_insn_for_size_p): Likewise.
(optimize_loop_nest_for_size_p): Likewise.
* predict.h (optimize_function_for_size_p): Update declaration.
(optimize_bb_for_size_p): Update declaration.
(optimize_edge_for_size_p): Update declaration.
(optimize_insn_for_size_p): Update declaration.
(optimize_loop_for_size_p): Update declaration.
(optimize_loop_nest_for_size_p): Update declaration.

diff --git a/gcc/cgraph.h b/gcc/cgraph.h
index 65e4646efcd..fb3ad95e064 100644
--- a/gcc/cgraph.h
+++ b/gcc/cgraph.h
@@ -1279,7 +1279,7 @@ struct GTY((tag ("SYMTAB_FUNCTION"))) cgraph_node : 
public symtab_node
   bool check_calls_comdat_local_p ();
 
   /* Return true if function should be optimized for size.  */
-  bool optimize_for_size_p (void);
+  enum optimize_size_level optimize_for_size_p (void);
 
   /* Dump the callgraph to file F.  */
   static void dump_cgraph (FILE *f);
@@ -3315,15 +3315,17 @@ cgraph_node::mark_force_output (void)
 
 /* Return true if function should be optimized for size.  */
 
-inline bool
+inline enum optimize_size_level
 cgraph_node::optimize_for_size_p (void)
 {
   if (opt_for_fn (decl, optimize_size))
-return true;
+return OPTIMIZE_SIZE_MAX;
+  if (count == profile_count::zero ())
+return OPTIMIZE_SIZE_MAX;
   if (frequency == NODE_FREQUENCY_UNLIKELY_EXECUTED)
-return true;
+return OPTIMIZE_SIZE_BALANCED;
   else
-return false;
+return OPTIMIZE_SIZE_NO;
 }
 
 /* Return symtab_node for NODE or create one if it is not present
diff --git a/gcc/coretypes.h b/gcc/coretypes.h
index 81a1b594dcd..da178b6a9f6 100644
--- a/gcc/coretypes.h
+++ b/gcc/coretypes.h
@@ -444,6 +444,18 @@ enum excess_precision_type
   EXCESS_PRECISION_TYPE_FAST
 };
 
+/* Level of size optimization.  */
+
+enum optimize_size_level
+{
+  /* Do not optimize for size.  */
+  OPTIMIZE_SIZE_NO,
+  /* Optimize for size but not at extreme performance costs.  */
+  OPTIMIZE_SIZE_BALANCED,
+  /* Optimize for size as much as possible.  */
+  OPTIMIZE_SIZE_MAX
+};
+
 /* Support for user-provided GGC and PCH markers.  The first parameter
is a pointer to a pointer, the second a cookie.  */
 typedef void (*gt_pointer_operator) (void *, void *);
diff --git a/gcc/predict.c b/gcc/predict.c
index 5983889209f..361c4019eec 100644
--- a/gcc/predict.c
+++ b/gcc/predict.c
@@ -243,7 +243,7 @@ probably_never_executed_bb_p (struct function *fun, 
const_basic_block bb)
 static bool
 unlikely_executed_edge_p (edge e)
 {
-  return (e->count () == profile_count::zero ()
+  return (e->src->count == profile_count::zero ()
  || e->probability == profile_probability::never ())
 || (e->flags & (EDGE_EH | EDGE_FAKE));
 }
@@ -260,13 +260,15 @@ probably_never_executed_edge_p (struct function *fun, 
edge e)
 
 /* Return true if function FUN should always be optimized for size.  */
 
-bool
+optimize_size_level
 optimize_function_for_size_p (struct function *fun)
 {
   if (!fun || !fun->decl)
-return optimize_size;
+return optimize_size ? OPTIMIZE_SIZE_MAX : OPTIMIZE_SIZE_NO;
   cgraph_node *n = cgraph_node::get (fun->decl);
-  return n && n->optimize_for_size_p ();
+  if (n)
+return n->optimize_for_size_p ();
+  return OPTIMIZE_SIZE_NO;
 }
 
 /* Return true if function FUN should always be optimized for speed.  */
@@ -289,11 +291,16 @@ function_optimization_type (struct function *fun)
 
 /*

[PATCH] Re: error: ‘EVRP_MODE_DEBUG’ was not declared – was: [PUSHED] Ranger classes.

2020-10-26 Thread Andrew MacLeod via Gcc-patches


On 10/25/20 8:37 PM, Maciej W. Rozycki wrote:

On Tue, 6 Oct 2020, Andrew MacLeod via Gcc-patches wrote:


Build fails here now with: gimple-range.h:168:59: error:
‘EVRP_MODE_DEBUG’ was not declared in this scope


And now builds – as the "Hybrid EVRP and testcases" was pushed as well,
a bit more than a quarter of an hour later. (At least it finished
building the compiler itself, I do not expect surprises in the library
parts.)

Tobias

Guess I should have just pushed it all as one commit. I thought the first part
was pretty separate from the second... and it was except for one line :-P  of
course I had problems getting the second one out or it would have followed
quicker.

  It is still broken at `-O0', does not build with `--enable-werror-always'
(which IMO should be on by default except for releases, just as we do with
binutils AFAIK, so as to make sure people do not introduce build problems
too easily):

.../gcc/gimple-range.cc: In function 'bool range_of_builtin_call(range_query&, 
irange&, gcall*)':
.../gcc/gimple-range.cc:677:15: error: 'zerov' may be used uninitialized 
[-Werror=maybe-uninitialized]
   677 |   if (zerov == prec)
   |   ^~
cc1plus: all warnings being treated as errors
make[2]: *** [Makefile:1122: gimple-range.o] Error 1

   Maciej

I can't reproduce it on x86_64-pc-linux-gnu , i presume this is some 
other target.


Eyeballing it, it seems that there was a missed initialization  when the 
builtin code was ported that might show up on a target that defines 
CLZ_DEFINED_VALUE_AT_ZERO  to be non-zero but doesnt always set the 
zerov parameter...  Or maybe it some optimization ordering thing.


Anyway, the following patch has been pushed as an obvious fix to make 
the code match whats in vr-values.


Andrew




commit 425bb53b54aece8ffe8298686c9ba5259ab17b0e
Author: Andrew MacLeod 
Date:   Mon Oct 26 10:13:58 2020 -0400

Re: error: ‘EVRP_MODE_DEBUG’ was not declared – was: [PUSHED] Ranger 
classes.

Initialize zerov to match vr-values.c.

* gimple-range.cc (range_of_builtin_call): Initialize zerov to 0.

diff --git a/gcc/gimple-range.cc b/gcc/gimple-range.cc
index 267ebad757f..f5c6a1ca620 100644
--- a/gcc/gimple-range.cc
+++ b/gcc/gimple-range.cc
@@ -611,7 +611,7 @@ range_of_builtin_call (range_query &query, irange &r, gcall 
*call)
 
   tree type = gimple_call_return_type (call);
   tree arg;
-  int mini, maxi, zerov, prec;
+  int mini, maxi, zerov = 0, prec;
   scalar_int_mode mode;
 
   switch (func)

Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch

2020-10-26 Thread Jakub Jelinek via Gcc-patches

On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote:
> This patch adds caching for the stack block allocated for offloaded
> OpenMP kernel launches on NVPTX. This is a performance optimisation --
> we observed an average 11% or so performance improvement with this patch
> across a set of accelerated GPU benchmarks on one machine (results vary
> according to individual benchmark and with hardware used).
> 
> A given kernel launch will reuse the stack block from the previous launch
> if it is large enough, else it is freed and reallocated. A slight caveat
> is that memory will not be freed until the device is closed, so e.g. if
> code is using highly variable launch geometries and large amounts of
> GPU RAM, you might run out of resources slightly quicker with this patch.
> 
> Another way this patch gains performance is by omitting the
> synchronisation at the end of an OpenMP offload kernel launch -- it's
> safe for the GPU and CPU to continue executing in parallel at that point,
> because e.g. copies-back from the device will be synchronised properly
> with kernel completion anyway.
> 
> In turn, the last part necessitates a change to the way "(perhaps abort
> was called)" errors are detected and reported.
> 
> Tested with offloading to NVPTX. OK for mainline?

I'm afraid I don't know the plugin nor CUDA well enough to review this
properly (therefore I'd like to hear from Thomas, Tom and/or Alexander.
Anyway, just two questions, wouldn't it make sense to add some upper bound
limit over which it wouldn't cache the stacks, so that it would cache
most of the time for normal programs but if some kernel is really excessive
and then many normal ones wouldn't result in memory allocation failures?

And, in which context are cuStreamAddCallback registered callbacks run?
E.g. if it is inside of asynchronous interrput, using locking in there might
not be the best thing to do.

> -  r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
> -  if (r == CUDA_ERROR_LAUNCH_FAILED)
> -GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
> -maybe_abort_msg);
> -  else if (r != CUDA_SUCCESS)
> -GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r));
> -  nvptx_stacks_free (stacks, teams * threads);
> +  CUDA_CALL_ASSERT (cuStreamAddCallback, NULL, nvptx_stacks_release,
> + (void *) ptx_dev, 0);
>  }
>  
>  /* TODO: Implement GOMP_OFFLOAD_async_run. */
> -- 
> 2.28.0

Jakub

Re: Implement three-level optimize_for_size predicates

2020-10-26 Thread H.J. Lu via Gcc-patches

On Mon, Oct 26, 2020 at 7:23 AM Jan Hubicka  wrote:
>
> Hi,
> this patch implements thre two-state optimize_for_size predicates, so with -Os
> and with profile feedback for never executed code it returns OPTIMIZE_SIZE_MAX
> while in cases we decide to optimize for size based on branch prediction logic
> it return OPTIMIZE_SIZE_BALLANCED.
>
> The idea is that for places where we guess that code is unlikely we do not
> want to do extreme optimizations for size that leads to many fold slowdowns
> (using idiv rather than few shigts or using rep based inlined stringops).
>
> I will update RTL handling code to also support this with BB granuality (which
> we don't currently).   We also should make attribute cold to lead to
> OPTIMIZE_SIZE_MAX I would say.
>
> LLVM has -Os and -Oz levels where -Oz is our -Os and LLVM's -Os would
> ocrrespond to OPTIMIZE_SIZE_BALLANCED.  I wonder if we want to export
> this to command line somehow?  For me it would be definitly useful to
> test things, I am not sure how "weaker" -Os is desired in practice.
>
> Bootstrapped/regtested x86_64-linux, I will commit it later today if
> there are no comments.
>
> H.J., can you plase update your patch on stringopts?
>

Please ahead.  My patches should be orthogonal to yours.

-- 
H.J.

Re: Implement three-level optimize_for_size predicates

2020-10-26 Thread Martin Liška


On 10/26/20 3:22 PM, Jan Hubicka wrote:

Hi,
this patch implements thre two-state optimize_for_size predicates, so with -Os
and with profile feedback for never executed code it returns OPTIMIZE_SIZE_MAX
while in cases we decide to optimize for size based on branch prediction logic
it return OPTIMIZE_SIZE_BALLANCED.


Hello.

Do we want to somehow correspondent to -fprofile-partial-training? Or is the
-fprofile-partial-training option basically dead with your new levels?

Thanks,
Martin

Re: Implement three-level optimize_for_size predicates

2020-10-26 Thread Jan Hubicka

> On 10/26/20 3:22 PM, Jan Hubicka wrote:
> > Hi,
> > this patch implements thre two-state optimize_for_size predicates, so with 
> > -Os
> > and with profile feedback for never executed code it returns 
> > OPTIMIZE_SIZE_MAX
> > while in cases we decide to optimize for size based on branch prediction 
> > logic
> > it return OPTIMIZE_SIZE_BALLANCED.
> 
> Hello.
> 
> Do we want to somehow correspondent to -fprofile-partial-training? Or is the
> -fprofile-partial-training option basically dead with your new levels?

partial-training will set counts to guessed 0 instead of absolute 0, so
we will use OPTIMIZE_SIZE_BALLANCED instead of MAX for things that was
not executed.
We will still use MAX for portions with optimize_size and code detected
by safe heuristics, like code just before abort.

Honza
> 
> Thanks,
> Martin

Re: Implement three-level optimize_for_size predicates

2020-10-26 Thread Jan Hubicka

> On Mon, Oct 26, 2020 at 7:23 AM Jan Hubicka  wrote:
> >
> > Hi,
> > this patch implements thre two-state optimize_for_size predicates, so with 
> > -Os
> > and with profile feedback for never executed code it returns 
> > OPTIMIZE_SIZE_MAX
> > while in cases we decide to optimize for size based on branch prediction 
> > logic
> > it return OPTIMIZE_SIZE_BALLANCED.
> >
> > The idea is that for places where we guess that code is unlikely we do not
> > want to do extreme optimizations for size that leads to many fold slowdowns
> > (using idiv rather than few shigts or using rep based inlined stringops).
> >
> > I will update RTL handling code to also support this with BB granuality 
> > (which
> > we don't currently).   We also should make attribute cold to lead to
> > OPTIMIZE_SIZE_MAX I would say.
> >
> > LLVM has -Os and -Oz levels where -Oz is our -Os and LLVM's -Os would
> > ocrrespond to OPTIMIZE_SIZE_BALLANCED.  I wonder if we want to export
> > this to command line somehow?  For me it would be definitly useful to
> > test things, I am not sure how "weaker" -Os is desired in practice.
> >
> > Bootstrapped/regtested x86_64-linux, I will commit it later today if
> > there are no comments.
> >
> > H.J., can you plase update your patch on stringopts?
> >
> 
> Please ahead.  My patches should be orthogonal to yours.

For example you had patch that limited "rep cmpsb" expansion for
-minline-all-stringops.  Now the conditions could be
-minline-all-stringops || optimize_insn_for_size () == OPTIMIZE_SIZE_MAX
since it is still useful size optimization.

I am not sure if you had other changes of this nature? (It is bit hard
to grep compiler for things like this and I would like to get these
organized to optimize_size levels now).

Honza
> 
> -- 
> H.J.

Re: [PATCH][middle-end][i386][Version 4] Add -fzero-call-used-regs=[skip|used-gpr-arg|used-arg|all-arg|used-gpr|all-gpr|used|all]

2020-10-26 Thread Qing Zhao via Gcc-patches

>> 
>> +/* Generate insns to zero all st/mm registers together.
>> +   Return true when zeroing instructions are generated.
>> +   Assume the number of st registers that are zeroed is num_of_st,
>> +   we will emit the following sequence to zero them together:
>> + fldz; \
>> + fldz; \
>> + ...
>> + fldz; \
>> + fstp %%st(0); \
>> + fstp %%st(0); \
>> + ...
>> + fstp %%st(0);
>> +   i.e., num_of_st fldz followed by num_of_st fstp to clear the stack
>> +   mark stack slots empty.  */
>> +
>> +static bool
>> +zero_all_st_mm_registers (HARD_REG_SET need_zeroed_hardregs)
>> +{
>> +  unsigned int num_of_st = 0;
>> +  for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
>> +if (STACK_REGNO_P (regno)
>> +   && TEST_HARD_REG_BIT (need_zeroed_hardregs, regno)
>> +   /* When the corresponding mm register also need to be cleared too.  
>> */
>> +   && TEST_HARD_REG_BIT (need_zeroed_hardregs,
>> + (regno - FIRST_STACK_REG + FIRST_MMX_REG)))
>> +  num_of_st++;
> 
> I don't think the above logic is correct. It should go like this:
> 
> - If the function is returning an MMX register,

How to check on this? Is the following correct?

If (GET_CODE(crtl->return_rtx) == REG 
&& (MMX_REG_P (REGNO (crtl->return_rtx)))

   The function is returning an MMX register.


> then the function
> exits in MMX mode, and MMX registers should be cleared in the same way
> as XMM registers.

When clearing XMM registers, we used V4SFmode, what’s the mode we should use to 
clearing
mmx registers?

> Otherwise the ABI specifies that the function exits
> in x87 mode and x87 stack should be cleared (but see below).
> 
> - There is no direct mapping of stack registers to hard register
> numbers. If a stack register is used, we don't know where in the stack
> the value remains. So, if _any_ stack register is touched, the whole
> stack should be cleared (value, returning in x87 stack register should
> obviously be excluded).

Then, how to exclude the x87 stack register that returns the function return 
value when we need to 
Clear the whole stack? 
I am a little confused here? Could you explain a little more details?
> 
> - There is no x87 argument register. 32bit targets use MMX0-3 argument
> registers and return value in the XMM register. Please also note that
> complex values take two stack slots in x87 stack.

You mean the complex return value will be returned in two  x87 registers? 

thanks.

Qing
> 
> Uros.
> 
>> +
>> +  if (num_of_st == 0)

PING RE: [Patch] testsuite: Avoid TCL errors when rootme or ASAN/TSAN/UBSAN is not available (was: Re: [Patch] testsuite: Avoid TCL errors when ASAN/TSAN/UBSAN is not available)

2020-10-26 Thread Burnus, Tobias




-Original Message-
From: Tobias Burnus [mailto:tob...@codesourcery.com]
Sent: Monday, October 19, 2020 6:03 PM
To: gcc-patches ; Rainer Orth 
; Mike Stump 
Subject: [Patch] testsuite: Avoid TCL errors when rootme or ASAN/TSAN/UBSAN is 
not available (was: Re: [Patch] testsuite: Avoid TCL errors when 
ASAN/TSAN/UBSAN is not available)

Thomas Schwinge and Joseph convinced me that 'rootme' only makes sense for 
in-tree testing and, hence, does not need (or: should not) be set in site.exp.

Thus, if it is not set, we have to check its existence before using it - to 
avoid similar TCL errors.
Hence, I updated the patch to check also for 'rootme'.

OK?

Tobias

On 10/19/20 11:46 AM, Tobias Burnus wrote:
> In a --disable-libsanitizer build, I see errors such as:
>   g++.sum:ERROR: can't read "asan_saved_library_path": no such
> variable
>
> I believe the following patch is the right way to solve this.
> OK?
>
> Tobias
>
-
Mentor Graphics (Deutschland) GmbH, Arnulfstraße 201, 80634 München / Germany
Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Alexander 
Walter

[PATCH] Refactor SLP instance analysis

2020-10-26 Thread Richard Biener

This refactors the toplevel entry to analyze an SLP instance to
expose a worker analyzing from a vector of stmts and an SLP entry
kind.

Bootstrap & regtest running on x86_64-unknown-linux-gnu.

2020-10-26  Richard Biener  

* tree-vect-slp.c (enum slp_instance_kind): New.
(vect_build_slp_instance): Split out from...
(vect_analyze_slp_instance): ... this.
---
 gcc/tree-vect-slp.c | 260 ++--
 1 file changed, 152 insertions(+), 108 deletions(-)

diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 4544f0f84a8..014bcba7819 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -1997,124 +1997,50 @@ calculate_unrolling_factor (poly_uint64 nunits, 
unsigned int group_size)
   return exact_div (common_multiple (nunits, group_size), group_size);
 }
 
-/* Analyze an SLP instance starting from a group of grouped stores.  Call
-   vect_build_slp_tree to build a tree of packed stmts if possible.
-   Return FALSE if it's impossible to SLP any stmt in the loop.  */
+enum slp_instance_kind {
+slp_inst_kind_store,
+slp_inst_kind_reduc_group,
+slp_inst_kind_reduc_chain,
+slp_inst_kind_ctor
+};
 
 static bool
 vect_analyze_slp_instance (vec_info *vinfo,
   scalar_stmts_to_slp_tree_map_t *bst_map,
-  stmt_vec_info stmt_info, unsigned max_tree_size)
-{
-  slp_instance new_instance;
-  slp_tree node;
-  unsigned int group_size;
-  unsigned int i;
-  struct data_reference *dr = STMT_VINFO_DATA_REF (stmt_info);
-  vec scalar_stmts;
-  bool constructor = false;
-
-  if (is_a  (vinfo))
-vect_location = stmt_info->stmt;
-  if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
-{
-  group_size = DR_GROUP_SIZE (stmt_info);
-}
-  else if (!dr && REDUC_GROUP_FIRST_ELEMENT (stmt_info))
-{
-  gcc_assert (is_a  (vinfo));
-  group_size = REDUC_GROUP_SIZE (stmt_info);
-}
-  else if (is_gimple_assign (stmt_info->stmt)
-   && gimple_assign_rhs_code (stmt_info->stmt) == CONSTRUCTOR)
-{
-  group_size = CONSTRUCTOR_NELTS (gimple_assign_rhs1 (stmt_info->stmt));
-  constructor = true;
-}
-  else
-{
-  gcc_assert (is_a  (vinfo));
-  group_size = as_a  (vinfo)->reductions.length ();
-}
+  stmt_vec_info stmt_info, unsigned max_tree_size);
 
-  /* Create a node (a root of the SLP tree) for the packed grouped stores.  */
-  scalar_stmts.create (group_size);
-  stmt_vec_info next_info = stmt_info;
-  if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
-{
-  /* Collect the stores and store them in SLP_TREE_SCALAR_STMTS.  */
-  while (next_info)
-{
- scalar_stmts.safe_push (vect_stmt_to_vectorize (next_info));
- next_info = DR_GROUP_NEXT_ELEMENT (next_info);
-}
-}
-  else if (!dr && REDUC_GROUP_FIRST_ELEMENT (stmt_info))
-{
-  /* Collect the reduction stmts and store them in
-SLP_TREE_SCALAR_STMTS.  */
-  while (next_info)
-{
- scalar_stmts.safe_push (vect_stmt_to_vectorize (next_info));
- next_info = REDUC_GROUP_NEXT_ELEMENT (next_info);
-}
-  /* Mark the first element of the reduction chain as reduction to properly
-transform the node.  In the reduction analysis phase only the last
-element of the chain is marked as reduction.  */
-  STMT_VINFO_DEF_TYPE (stmt_info)
-   = STMT_VINFO_DEF_TYPE (scalar_stmts.last ());
-  STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info))
-   = STMT_VINFO_REDUC_DEF (vect_orig_stmt (scalar_stmts.last ()));
-}
-  else if (constructor)
-{
-  tree rhs = gimple_assign_rhs1 (stmt_info->stmt);
-  tree val;
-  FOR_EACH_CONSTRUCTOR_VALUE (CONSTRUCTOR_ELTS (rhs), i, val)
-   {
- if (TREE_CODE (val) == SSA_NAME)
-   {
- gimple* def = SSA_NAME_DEF_STMT (val);
- stmt_vec_info def_info = vinfo->lookup_stmt (def);
- /* Value is defined in another basic block.  */
- if (!def_info)
-   return false;
- def_info = vect_stmt_to_vectorize (def_info);
- scalar_stmts.safe_push (def_info);
-   }
- else
-   return false;
-   }
-  if (dump_enabled_p ())
-   dump_printf_loc (MSG_NOTE, vect_location,
-"Analyzing vectorizable constructor: %G\n",
-stmt_info->stmt);
-}
-  else
-{
-  /* Collect reduction statements.  */
-  vec reductions = as_a  (vinfo)->reductions;
-  for (i = 0; reductions.iterate (i, &next_info); i++)
-   scalar_stmts.safe_push (next_info);
-}
+/* Analyze an SLP instance starting from SCALAR_STMTS which are a group
+   of KIND.  Return true if successful.  */
 
+static bool
+vect_build_slp_instance (vec_info *vinfo,
+slp_instance_kind kind,
+vec scalar_stmts,
+stmt_vec_info root_stmt_in

Re: Implement three-level optimize_for_size predicates

2020-10-26 Thread H.J. Lu via Gcc-patches

On Mon, Oct 26, 2020 at 7:36 AM Jan Hubicka  wrote:
>
> > On Mon, Oct 26, 2020 at 7:23 AM Jan Hubicka  wrote:
> > >
> > > Hi,
> > > this patch implements thre two-state optimize_for_size predicates, so 
> > > with -Os
> > > and with profile feedback for never executed code it returns 
> > > OPTIMIZE_SIZE_MAX
> > > while in cases we decide to optimize for size based on branch prediction 
> > > logic
> > > it return OPTIMIZE_SIZE_BALLANCED.
> > >
> > > The idea is that for places where we guess that code is unlikely we do not
> > > want to do extreme optimizations for size that leads to many fold 
> > > slowdowns
> > > (using idiv rather than few shigts or using rep based inlined stringops).
> > >
> > > I will update RTL handling code to also support this with BB granuality 
> > > (which
> > > we don't currently).   We also should make attribute cold to lead to
> > > OPTIMIZE_SIZE_MAX I would say.
> > >
> > > LLVM has -Os and -Oz levels where -Oz is our -Os and LLVM's -Os would
> > > ocrrespond to OPTIMIZE_SIZE_BALLANCED.  I wonder if we want to export
> > > this to command line somehow?  For me it would be definitly useful to
> > > test things, I am not sure how "weaker" -Os is desired in practice.
> > >
> > > Bootstrapped/regtested x86_64-linux, I will commit it later today if
> > > there are no comments.
> > >
> > > H.J., can you plase update your patch on stringopts?
> > >
> >
> > Please ahead.  My patches should be orthogonal to yours.
>
> For example you had patch that limited "rep cmpsb" expansion for
> -minline-all-stringops.  Now the conditions could be
> -minline-all-stringops || optimize_insn_for_size () == OPTIMIZE_SIZE_MAX
> since it is still useful size optimization.
>
> I am not sure if you had other changes of this nature? (It is bit hard
> to grep compiler for things like this and I would like to get these
> organized to optimize_size levels now).

Shouldn't it apply to all functions inlined by -minline-all-stringops?


-- 
H.J.

Re: Extend builtin fnspecs

2020-10-26 Thread Richard Biener

On Mon, 19 Oct 2020, Jan Hubicka wrote:

> > > +  /* True if memory reached by the argument is read.
> > > + Valid only if all loads are known.  */
> > > +  bool
> > > +  arg_read_p (unsigned int i)
> > > +  {
> > > +unsigned int idx = arg_idx (i);
> > > +gcc_checking_assert (arg_specified_p (i));
> > > +gcc_checking_assert (loads_known_p ());
> > 
> > I see loads_known_p () is 'const' (introducing new terminology
> > as alias for sth else is IMHO bad).  So what do you think
> > arg_read_p () guarantees?  Even on a not 'const' function
> > 'r' or 'R' or 'w' or 'W' means the argument could be read??!
> 
> Original intention was !arg_read_p to guarantee that argument is not
> read from (and for that you need const), but I updated it.
> > 
> > > +return str[idx] == 'r' || str[idx] == 'R'
> > > +|| str[idx] == 'w' || str[idx] == 'W';
> > > +  }
> > > +
> > > +  /* True if memory reached by the argument is read.
> > > + Valid only if all loads are known.  */
> > > +  bool
> > > +  arg_written_p (unsigned int i)
> > > +  {
> > > +unsigned int idx = arg_idx (i);
> > > +gcc_checking_assert (arg_specified_p (i));
> > > +gcc_checking_assert (stores_known_p ());
> > 
> > Likewise.  IMHO those will cause lots of confusion.  For example
> > arg_readonly_p doesn't imply arg_read_p.
> > 
> > Please keep the number of core predicates at a minimum!
> 
> Well, my intention is/was that while fnspec strings themselves are
> necessarily bit ad-hoc (trying to pack multiple things into 2 characters
> and choosing just few special cases we care about) we should abstract
> this and have predicates for individual properties we care about.
> 
> Indeed I did not name them very well, hopefully things are better now :)
> In future modref can detect those properties of functions and provide
> symetric API for testing them without need to go to fnspec limitations.
> Similarly I would like to decouple logic around const/pure functions
> better as well (since they pack multiple things together - presence of
> side effects, whether function is deterministic and info about global
> memory accesses).
> 
> Concerning arg_read_p and arg_written_p I introduced the while I was
> still intending to make 'w' and 'W' mean what 'o' and 'O' does.
> With original predicates the code handling loads
> 
> if (fnspec.loads_known_p ())
>   for each argument i
> if (fnspec.arg_readonly_p (i))
>   argument is read from
> else
>   argument is not read, ignore it.
> else
>  ask pta if base is local
> 
> Which looked like odd re-use of readonly_p predicate intended for
> something else.  I think main confussion is that I interpreted the specs
> with cCpP bit weirdly. In particular:
> 
>  ". W . R "
> means
>  - arg 1 is written&read directly and noescapes,
>  - arg 2 is written, read directly or indirectly and escapes,
>  - arg 3 is read only directly, not written to and odes not escape.
> 
> With previus patch 
>  ".cW . R "
> means:
>  - arg 1 is written&read directly and noescapes
>  - arg 2 is used only as pointer value (i.e. for NULL check)
>  - arg 3 is read only directly, not written to and does not escape.
> 
> With current patch is means:
>  - arg 1 is written&read directly and noescapes
>  - arg 2 is written&read directly or indirectly and may escape (in other 
> stores allowed by the function)
>  - arg 3 is read only directly, not written to and does not escape.
> 
> With current patch the loop is:
> if (!fnspec.global_memory_read_p ())
>   for each argument i
> if (POINTER_TYPE_P (TREE_TYPE (arg))
>   && fnspec.arg_maybe_read_p (i))
>   argument is read from
> else
>   argument is not read.
> else
>   ask pta if base is local
> 
> Which seems better and follows what we did before.
> However the extra POINTER_TYPE_P test is needed since we do not want to
> disambiguate non-pointer parameters that also have '.' specifier. I am
> not sure that this is safe/feasible with gimple type system.

Hmm, so PTA does track pointers passed through uintptr_t for example
so I think you would need to write

  if (!POINTER_TYPE_P (TREE_TYPE (arg))
  || fnspec.arg_maybe_read_p (i))
argument may be read from

?

> I think it should be:
> 
> for each argument i
>   if (POINTER_TYPE_P (TREE_TYPE (arg))
>   && fnspec.arg_maybe_read_p (i))
> argument is read from
>   else
> argument is not read.
> if (fnspec.global_memory_read_p ())
>   give up if pta thinks this is not local.
> 
> Because if I have function with spec say
>  ". R "
> and cal it with foo (&localvar) I think it should end up non-escaping
> and thus not alias with function calls except when it is an parameter.

Yes.
 
> We do not have builtins like this right now though.

I think the Fortran FE essentially creates those for scalars passed
by reference and INTENT(IN).  So the Fortran FE is a good way
to write testcases for all of this ;)

> > 
> > > +return str[idx] == 'w' || str[idx] == 'W'
> > > +

Re: [PATCH v2] builtins: (not just) rs6000: Add builtins for fegetround, feclearexcept and feraiseexcept [PR94193]

2020-10-26 Thread Raoni Fassina Firmino via Gcc-patches

On Thu, Oct 01, 2020 at 03:08:19PM -0500, Segher Boessenkool wrote:
> On Thu, Oct 01, 2020 at 08:08:01AM +0200, Richard Biener wrote:
> > On Wed, 30 Sep 2020, Segher Boessenkool wrote:
> > > It's going to be challenging to find a reasonable spot in there.
> > > Oh well.
> > 
> > Put it next to fmin/fmax docs or sin, etc. - at least the section
> > should be clear ;)  But yeah, patterns seem to be quite randomly
> > "sorted"...
> 
> And no chapter toc etc.
> 
> Yeah, this should just go with the other fp things, of course.  Duh.
> Thanks!

I doesn't help that the order in doc/md.texi is different than in
optabs.def.

I ended up putting it after all the trigonometric ones.

Re: [PATCH v2] builtins: rs6000: Add builtins for fegetround, feclearexcept and feraiseexcept [PR94193]

2020-10-26 Thread Raoni Fassina Firmino via Gcc-patches

On Mon, Oct 05, 2020 at 10:36:22AM -0500, Segher Boessenkool wrote:
> Should this pattern not allow setting more than one exception bit at
> once, btw?

Turns out allowing more than one bit was no problem at all.

On Mon, Oct 05, 2020 at 10:36:22AM -0500, Segher Boessenkool wrote:
> On Sun, Oct 04, 2020 at 09:56:01PM -0400, Hans-Peter Nilsson wrote:
> > > > +  rtx tmp = gen_rtx_CONST_INT (SImode, __builtin_clz (INTVAL 
> > > > (operands[1])));
> > 
> > This doesn't appear to be very portable, to any-cxx11-compiler
> > that doesn't pretend to be gcc-intrinsics-compatible.
> 
> Yeah, very good point!

And with that, no ffs() or clz() necessary at all :)

[PATCH] Move SLP nodes to an alloc-pool

2020-10-26 Thread Richard Biener

This introduces a global alloc-pool for SLP nodes to reduce overhead
on SLP allocation churn which will get worse and to eventually release
SLP cycles which will retain a refcount of one and thus are never
freed at the moment.

Bootstrap / regtest pending on x86_64-unknown-linux-gnu.

2020-10-26  Richard Biener  

* tree-vectorizer.h (slp_tree_pool): Declare.
(_slp_tree::operator new): Likewise.
(_slp_tree::operator delete): Likewise.
* tree-vectorizer.c (vectorize_loops): Allocate and free the
slp_tree_pool.
(pass_slp_vectorize::execute): Likewise.
* tree-vect-slp.c (slp_tree_pool): Define.
(_slp_tree::operator new): Likewise.
(_slp_tree::operator delete): Likewise.
---
 gcc/tree-vect-slp.c   | 17 +
 gcc/tree-vectorizer.c |  9 +
 gcc/tree-vectorizer.h |  9 +
 3 files changed, 35 insertions(+)

diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 014bcba7819..894f045c0fe 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -52,6 +52,23 @@ along with GCC; see the file COPYING3.  If not see
 static bool vectorizable_slp_permutation (vec_info *, gimple_stmt_iterator *,
  slp_tree, stmt_vector_for_cost *);
 
+object_allocator<_slp_tree> *slp_tree_pool;
+
+void *
+_slp_tree::operator new (size_t n)
+{
+  gcc_assert (n == sizeof (_slp_tree));
+  return slp_tree_pool->allocate_raw ();
+}
+
+void
+_slp_tree::operator delete (void *node, size_t n)
+{
+  gcc_assert (n == sizeof (_slp_tree));
+  slp_tree_pool->remove_raw (node);
+}
+
+
 /* Initialize a SLP node.  */
 
 _slp_tree::_slp_tree ()
diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 778177a583b..0e08652ed10 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -1170,6 +1170,8 @@ vectorize_loops (void)
   if (vect_loops_num <= 1)
 return 0;
 
+  slp_tree_pool = new object_allocator<_slp_tree> ("SLP nodes for vect");
+
   if (cfun->has_simduid_loops)
 note_simd_array_uses (&simd_array_to_simduid_htab);
 
@@ -1292,6 +1294,8 @@ vectorize_loops (void)
 shrink_simd_arrays (simd_array_to_simduid_htab, simduid_to_vf_htab);
   delete simduid_to_vf_htab;
   cfun->has_simduid_loops = false;
+  delete slp_tree_pool;
+  slp_tree_pool = NULL;
 
   if (num_vectorized_loops > 0)
 {
@@ -1427,8 +1431,13 @@ pass_slp_vectorize::execute (function *fun)
}
 }
 
+  slp_tree_pool = new object_allocator<_slp_tree> ("SLP nodes for slp");
+
   vect_slp_function (fun);
 
+  delete slp_tree_pool;
+  slp_tree_pool = NULL;
+
   if (!in_loop_pipeline)
 {
   scev_finalize ();
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index b56073c4ee3..9c55383a3ee 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -26,6 +26,7 @@ typedef class _stmt_vec_info *stmt_vec_info;
 #include "tree-data-ref.h"
 #include "tree-hash-traits.h"
 #include "target.h"
+#include "alloc-pool.h"
 
 
 /* Used for naming of new temporaries.  */
@@ -115,6 +116,8 @@ typedef hash_map *slp_tree_pool;
+
 /* A computation tree of an SLP instance.  Each node corresponds to a group of
stmts to be packed in a SIMD stmt.  */
 struct _slp_tree {
@@ -163,6 +166,12 @@ struct _slp_tree {
   enum tree_code code;
 
   int vertex;
+
+  /* Allocate from slp_tree_pool.  */
+  static void *operator new (size_t);
+
+  /* Return memory to slp_tree_pool.  */
+  static void operator delete (void *, size_t);
 };
 
 
-- 
2.26.2

Re: [PATCH V2] aarch64: Add bfloat16 vldN_lane_bf16 + vldNq_lane_bf16 intrisics

2020-10-26 Thread Richard Sandiford via Gcc-patches

Andrea Corallo via Gcc-patches  writes:
> Hi all,
>
> Second version of the patch here implementing the bfloat16_t neon
> related load intrinsics: vld2_lane_bf16, vld2q_lane_bf16,
> vld3_lane_bf16, vld3q_lane_bf16 vld4_lane_bf16, vld4q_lane_bf16.
>
> This better narrows testcases so they do not cause regressions for the
> arm backend where these intrinsics are not yet present.
>
> Please see refer to:
> ACLE 
> ISA  

The intrinsics are documented to require +bf16, but it looks like this
makes the bf16 forms available without that.  (This is enforced indirectly,
by complaining that the intrinsic wrapper can't be inlined into a caller
that uses incompatible target flags.)

Perhaps we should keep the existing intrinsics where they are and
just move the #undefs to the end, similarly to __aarch64_vget_lane_any.

Thanks,
Richard

[PATCH 1/x] arm: Add vld1_lane_bf16 + vldq_lane_bf16 intrinsics

2020-10-26 Thread Andrea Corallo via Gcc-patches

Hi all,

I'd like to submit the following patch implementing the bfloat16_t
neon related load intrinsics: vld1_lane_bf16, vld1q_lane_bf16.

Please see refer to:
ACLE 
ISA  

Regtested and bootstrapped.

Okay for trunk?

  Andrea

>From 64e375906abeba1ab14d06106a9714b0371b7105 Mon Sep 17 00:00:00 2001
From: Andrea Corallo 
Date: Wed, 21 Oct 2020 11:16:01 +0200
Subject: [PATCH] arm: Add vld1_lane_bf16 + vldq_lane_bf16 intrinsics

gcc/ChangeLog

2020-10-21  Andrea Corallo  

* config/arm/arm_neon_builtins.def: Add to LOAD1LANE v4bf, v8bf.
* config/arm/arm_neon.h (vld1_lane_bf16, vld1q_lane_bf16): Add
intrinsics.

gcc/testsuite/ChangeLog

2020-10-21  Andrea Corallo  

* gcc.target/arm/simd/vld1_lane_bf16_1.c: New testcase.
* gcc.target/arm/simd/vld1_lane_bf16_indices_1.c: Likewise.
* gcc.target/arm/simd/vld1q_lane_bf16_indices_1.c: Likewise.
---
 gcc/config/arm/arm_neon.h | 14 +
 gcc/config/arm/arm_neon_builtins.def  |  4 ++--
 .../gcc.target/arm/simd/vld1_lane_bf16_1.c| 21 +++
 .../arm/simd/vld1_lane_bf16_indices_1.c   | 17 +++
 .../arm/simd/vld1q_lane_bf16_indices_1.c  | 17 +++
 5 files changed, 71 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/arm/simd/vld1_lane_bf16_1.c
 create mode 100644 gcc/testsuite/gcc.target/arm/simd/vld1_lane_bf16_indices_1.c
 create mode 100644 
gcc/testsuite/gcc.target/arm/simd/vld1q_lane_bf16_indices_1.c

diff --git a/gcc/config/arm/arm_neon.h b/gcc/config/arm/arm_neon.h
index aa21730dea0..fcd8020425e 100644
--- a/gcc/config/arm/arm_neon.h
+++ b/gcc/config/arm/arm_neon.h
@@ -19665,6 +19665,20 @@ vld4q_dup_bf16 (const bfloat16_t * __ptr)
   return __rv.__i;
 }
 
+__extension__ extern __inline bfloat16x4_t
+__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
+vld1_lane_bf16 (const bfloat16_t * __a, bfloat16x4_t __b, const int __c)
+{
+  return __builtin_neon_vld1_lanev4bf (__a, __b, __c);
+}
+
+__extension__ extern __inline bfloat16x8_t
+__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
+vld1q_lane_bf16 (const bfloat16_t * __a, bfloat16x8_t __b, const int __c)
+{
+  return __builtin_neon_vld1_lanev8bf (__a, __b, __c);
+}
+
 #pragma GCC pop_options
 
 #ifdef __cplusplus
diff --git a/gcc/config/arm/arm_neon_builtins.def 
b/gcc/config/arm/arm_neon_builtins.def
index 34c1945c0a1..7cdcd251243 100644
--- a/gcc/config/arm/arm_neon_builtins.def
+++ b/gcc/config/arm/arm_neon_builtins.def
@@ -312,8 +312,8 @@ VAR1 (TERNOP, vtbx3, v8qi)
 VAR1 (TERNOP, vtbx4, v8qi)
 VAR12 (LOAD1, vld1,
 v8qi, v4hi, v4hf, v2si, v2sf, di, v16qi, v8hi, v8hf, v4si, v4sf, v2di)
-VAR10 (LOAD1LANE, vld1_lane,
-   v8qi, v4hi, v2si, v2sf, di, v16qi, v8hi, v4si, v4sf, v2di)
+VAR12 (LOAD1LANE, vld1_lane,
+v8qi, v4hi, v2si, v2sf, di, v16qi, v8hi, v4si, v4sf, v2di, v4bf, v8bf)
 VAR10 (LOAD1, vld1_dup,
v8qi, v4hi, v2si, v2sf, di, v16qi, v8hi, v4si, v4sf, v2di)
 VAR12 (STORE1, vst1,
diff --git a/gcc/testsuite/gcc.target/arm/simd/vld1_lane_bf16_1.c 
b/gcc/testsuite/gcc.target/arm/simd/vld1_lane_bf16_1.c
new file mode 100644
index 000..fa4e45b7217
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/simd/vld1_lane_bf16_1.c
@@ -0,0 +1,21 @@
+/* { dg-do assemble } */
+/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
+/* { dg-add-options arm_v8_2a_bf16_neon } */
+/* { dg-additional-options "-O3 --save-temps" } */
+
+#include "arm_neon.h"
+
+bfloat16x4_t
+test_vld1_lane_bf16 (bfloat16_t *a, bfloat16x4_t b)
+{
+  return vld1_lane_bf16 (a, b, 1);
+}
+
+bfloat16x8_t
+test_vld1q_lane_bf16 (bfloat16_t *a, bfloat16x8_t b)
+{
+  return vld1q_lane_bf16 (a, b, 2);
+}
+
+/* { dg-final { scan-assembler "vld1.16\t{d0\\\[1\\\]}, \\\[r0\\\]" } } */
+/* { dg-final { scan-assembler "vld1.16\t{d0\\\[2\\\]}, \\\[r0\\\]" } } */
diff --git a/gcc/testsuite/gcc.target/arm/simd/vld1_lane_bf16_indices_1.c 
b/gcc/testsuite/gcc.target/arm/simd/vld1_lane_bf16_indices_1.c
new file mode 100644
index 000..c83eb53234d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/simd/vld1_lane_bf16_indices_1.c
@@ -0,0 +1,17 @@
+/* { dg-do assemble } */
+/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
+/* { dg-add-options arm_v8_2a_bf16_neon } */
+
+#include "arm_neon.h"
+
+bfloat16x4_t
+test_vld1_lane_bf16 (bfloat16_t *a, bfloat16x4_t b)
+{
+  bfloat16x4_t res;
+  res = vld1_lane_bf16 (a, b, -1);
+  res = vld1_lane_bf16 (a, b, 4);
+  return res;
+}
+
+/* { dg-error "lane -1 out of range 0 - 3" "" { target *-*-* } 0 } */
+/* { dg-error "lane 4 out of range 0 - 3" "" { target *-*-* } 0 } */
diff --git a/gcc/testsuite/gcc.target/arm/simd/vld1q_lane_bf16_indices_1.c 
b/gcc/testsuite/gcc.target/arm/simd/vld1q_lane_bf16_indices_1.c
new file mode 100644
index 000..8e21e61c9c0
--- /dev/null
+++ b/gcc/testsuite/gcc.targe

[PATCH 2/x] arm: add vst1_lane_bf16 + vstq_lane_bf16 intrinsics

2020-10-26 Thread Andrea Corallo via Gcc-patches

Hi all,

Second patch of the serie here adding vst1_lane_bf16, vst1q_lane_bf16
bfloat16 related neon intrinsics.

Please see refer to:
ACLE 
ISA  

Regtested and bootstrapped.

Okay for trunk?

  Andrea
  
>From 4b66af535f7c08c58633096210f1aef945de36f5 Mon Sep 17 00:00:00 2001
From: Andrea Corallo 
Date: Fri, 23 Oct 2020 14:21:56 +0200
Subject: [PATCH] arm: Add vst1_lane_bf16 + vstq_lane_bf16 intrinsics

gcc/ChangeLog

2020-10-23  Andrea Corallo  

* config/arm/arm-builtins.c (VAR14): Define macro.
* config/arm/arm_neon.h (vst1_lane_bf16, vst1q_lane_bf16): Add
intrinsics.
* config/arm/arm_neon_builtins.def (STORE1LANE): Add v4bf, v8bf.

gcc/testsuite/ChangeLog

2020-10-23  Andrea Corallo  

* gcc.target/arm/simd/vst1_lane_bf16_1.c: New testcase.
* gcc.target/arm/simd/vstq1_lane_bf16_indices_1.c: Likewise.
* gcc.target/arm/simd/vst1_lane_bf16_indices_1.c: Likewise.
---
 gcc/config/arm/arm-builtins.c |  3 +++
 gcc/config/arm/arm_neon.h | 14 +
 gcc/config/arm/arm_neon_builtins.def  |  4 ++--
 .../gcc.target/arm/simd/vst1_lane_bf16_1.c| 21 +++
 .../arm/simd/vst1_lane_bf16_indices_1.c   | 15 +
 .../arm/simd/vstq1_lane_bf16_indices_1.c  | 15 +
 6 files changed, 70 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/arm/simd/vst1_lane_bf16_1.c
 create mode 100644 gcc/testsuite/gcc.target/arm/simd/vst1_lane_bf16_indices_1.c
 create mode 100644 
gcc/testsuite/gcc.target/arm/simd/vstq1_lane_bf16_indices_1.c

diff --git a/gcc/config/arm/arm-builtins.c b/gcc/config/arm/arm-builtins.c
index 33e8015b140..6dc5df93216 100644
--- a/gcc/config/arm/arm-builtins.c
+++ b/gcc/config/arm/arm-builtins.c
@@ -946,6 +946,9 @@ typedef struct {
 #define VAR13(T, N, A, B, C, D, E, F, G, H, I, J, K, L, M) \
   VAR12 (T, N, A, B, C, D, E, F, G, H, I, J, K, L) \
   VAR1 (T, N, M)
+#define VAR14(T, N, A, B, C, D, E, F, G, H, I, J, K, L, M, O) \
+  VAR13 (T, N, A, B, C, D, E, F, G, H, I, J, K, L, M) \
+  VAR1 (T, N, O)
 
 /* The builtin data can be found in arm_neon_builtins.def, arm_vfp_builtins.def
and arm_acle_builtins.def.  The entries in arm_neon_builtins.def require
diff --git a/gcc/config/arm/arm_neon.h b/gcc/config/arm/arm_neon.h
index fcd8020425e..432d77fb272 100644
--- a/gcc/config/arm/arm_neon.h
+++ b/gcc/config/arm/arm_neon.h
@@ -19679,6 +19679,20 @@ vld1q_lane_bf16 (const bfloat16_t * __a, bfloat16x8_t 
__b, const int __c)
   return __builtin_neon_vld1_lanev8bf (__a, __b, __c);
 }
 
+__extension__ extern __inline void
+__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
+vst1_lane_bf16 (bfloat16_t * __a, bfloat16x4_t __b, const int __c)
+{
+  __builtin_neon_vst1_lanev4bf (__a, __b, __c);
+}
+
+__extension__ extern __inline void
+__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
+vst1q_lane_bf16 (bfloat16_t * __a, bfloat16x8_t __b, const int __c)
+{
+  __builtin_neon_vst1_lanev8bf (__a, __b, __c);
+}
+
 #pragma GCC pop_options
 
 #ifdef __cplusplus
diff --git a/gcc/config/arm/arm_neon_builtins.def 
b/gcc/config/arm/arm_neon_builtins.def
index 7cdcd251243..3db7fb9f1f3 100644
--- a/gcc/config/arm/arm_neon_builtins.def
+++ b/gcc/config/arm/arm_neon_builtins.def
@@ -318,8 +318,8 @@ VAR10 (LOAD1, vld1_dup,
v8qi, v4hi, v2si, v2sf, di, v16qi, v8hi, v4si, v4sf, v2di)
 VAR12 (STORE1, vst1,
v8qi, v4hi, v4hf, v2si, v2sf, di, v16qi, v8hi, v8hf, v4si, v4sf, v2di)
-VAR12 (STORE1LANE, vst1_lane,
-   v8qi, v4hi, v4hf, v2si, v2sf, di, v16qi, v8hi, v8hf, v4si, v4sf, v2di)
+VAR14 (STORE1LANE, vst1_lane,
+   v8qi, v4hi, v4hf, v2si, v2sf, di, v16qi, v8hi, v8hf, v4si, v4sf, v2di, 
v4bf, v8bf)
 VAR13 (LOAD1, vld2,
v8qi, v4hi, v4hf, v2si, v2sf, di, v16qi, v8hi, v8hf, v4si, v4sf, v4bf, 
v8bf)
 VAR9 (LOAD1LANE, vld2_lane,
diff --git a/gcc/testsuite/gcc.target/arm/simd/vst1_lane_bf16_1.c 
b/gcc/testsuite/gcc.target/arm/simd/vst1_lane_bf16_1.c
new file mode 100644
index 000..e018ec6592f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/simd/vst1_lane_bf16_1.c
@@ -0,0 +1,21 @@
+/* { dg-do assemble } */
+/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
+/* { dg-add-options arm_v8_2a_bf16_neon } */
+/* { dg-additional-options "-O3 --save-temps" } */
+
+#include "arm_neon.h"
+
+void
+test_vst1_lane_bf16 (bfloat16_t *a, bfloat16x4_t b)
+{
+  vst1_lane_bf16 (a, b, 1);
+}
+
+void
+test_vst1q_lane_bf16 (bfloat16_t *a, bfloat16x8_t b)
+{
+  vst1q_lane_bf16 (a, b, 2);
+}
+
+/* { dg-final { scan-assembler "vst1.16\t{d0\\\[1\\\]}, \\\[r0\\\]" } } */
+/* { dg-final { scan-assembler "vst1.16\t{d0\\\[2\\\]}, \\\[r0\\\]" } } */
diff --git a/gcc/testsuite/gcc.target/arm/simd/vst1_lane_bf16_indices_1.c 
b/gcc/testsuite/gcc.target/arm/simd/vst1_lane_bf16_indices_1.c
new file mode 100644
index 000..39870

Re: [PATCH v2] builtins: rs6000: Add builtins for fegetround, feclearexcept and feraiseexcept [PR94193]

2020-10-26 Thread Raoni Fassina Firmino via Gcc-patches

On Mon, Sep 28, 2020 at 11:42:13AM -0500, will schmidt wrote:
> > +/* Expand call EXP to either feclearexcept or feraiseexcept builtins (from 
> > C99
> > +fenv.h), returning the result and setting it in TARGET.  Otherwise 
> > return
> > +NULL_RTX on failure.  */
> > +static rtx
> > +expand_builtin_feclear_feraise_except (tree exp, rtx target,
> > +  machine_mode target_mode, optab op_optab)
> > +{
> > +  if (!validate_arglist (exp, INTEGER_TYPE, VOID_TYPE))
> > +return NULL_RTX;
> > +  rtx op0 = expand_normal (CALL_EXPR_ARG (exp, 0));
> > +
> > +  insn_code icode = direct_optab_handler (op_optab, SImode);
> > +  if (icode == CODE_FOR_nothing)
> > +return NULL_RTX;
> > +
> > +  if (target == 0
> > +  || GET_MODE (target) != target_mode
> > +  || ! (*insn_data[icode].operand[0].predicate) (target, target_mode))
> > +target = gen_reg_rtx (target_mode);
> > +
> > +  rtx pat = GEN_FCN (icode) (target, op0);
> > +  if (!pat)
> > +return NULL_RTX;
> > +  emit_insn (pat);
> > +
> > +  return target;
> > +}
> 
> 
> I don't see any references to feclearexcept or feraiseexcept in the
> functions there.   I see the callers pass in those values via optab,
> but nothing in these functions explicitly checks or limits that in a
> way that is obvious upon my reading...  Thus I wonder if there may be a
> different generic names that would be appropriate for the functions.

Yes, I wonder the same, I guess in theory it could be used with any
int(int) builtin at least, but looking at other more generic
expand_builtin_* I was not confident enough that this one has the
necessary boilerplate code to handle other builtins.  Also I looked
aroung builtins.c trying to find an expand that I could reuse and I
notice that the vast majority of builtins use dedicated expand_* so I
just followed suit.  I am not really trilled by this name anyway, so If
it something useful in a more generic way (which I don't have enough
knowledge to judge) I am happy to change it.

> > +;; FE_INEXACT, FE_DIVBYZERO, FE_UNDERFLOW and FE_OVERFLOW flags.
> > +;; It doesn't handle values out of range, and always returns 0.
> > +;; Note that FE_INVALID is unsupported because it maps to more than
> > +;; one bit on FPSCR register.
> 
> Should FE_INVALID have an explicit case statement path to FAIL?

Because there is only 4 valid flags I am doing the other way around,
just checking if it is any of the valid flags and FAIL for any other
value, so there is no need of an explicit FE_INVALID case, but if it is
better to have one anyway to make the intention clear through code, I
don't know.

> No further comments or nits..

I applied all other suggestions for now,
thanks for the feedback Will :)

o/
Raoni

Re: [RS6000] Unsupported test options for -m32

2020-10-26 Thread David Edelsohn via Gcc-patches

FAIL: gcc.target/powerpc/swaps-p8-22.c (test for excess errors)
Excess errors:
cc1: error: '-mcmodel' not supported in this configuration

* gcc.target/powerpc/swaps-p8-22.c: Disable for -m32.

diff --git a/gcc/testsuite/gcc.target/powerpc/swaps-p8-22.c
b/gcc/testsuite/gcc.target/powerpc/swaps-p8-22.c
index 83f6ab3a1c0..bceada41b75 100644
--- a/gcc/testsuite/gcc.target/powerpc/swaps-p8-22.c
+++ b/gcc/testsuite/gcc.target/powerpc/swaps-p8-22.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-require-effective-target { lp64 && powerpc_p8vector_ok } } */
 /* { dg-options "-O2 -mdejagnu-cpu=power8 -maltivec -mcmodel=large" } */

 /* The expansion for vector character multiply introduces a vperm operation.


Please don't fix the failure this way.  This is incorrect.  -m32 means
more than Linux.  This reverts my hard work to run more of the powerpc
testsuite on AIX.  AIX also is -m32.

This probably should be fixed with

{ dg-additional-options "-mcmodel=large" { target { lp64 ||
!powerpc*-*-linux* } } }

or whatever the appropriate incantation to omit only ppc32 linux.  Or maybe

{ dg-do compile { target { lp64 || !powerpc*-*-linux* } } }

Thanks, David

Re: [PATCH][middle-end][i386][Version 4] Add -fzero-call-used-regs=[skip|used-gpr-arg|used-arg|all-arg|used-gpr|all-gpr|used|all]

2020-10-26 Thread Uros Bizjak via Gcc-patches

On Mon, Oct 26, 2020 at 3:45 PM Qing Zhao  wrote:
>
>
> +/* Generate insns to zero all st/mm registers together.
> +   Return true when zeroing instructions are generated.
> +   Assume the number of st registers that are zeroed is num_of_st,
> +   we will emit the following sequence to zero them together:
> + fldz; \
> + fldz; \
> + ...
> + fldz; \
> + fstp %%st(0); \
> + fstp %%st(0); \
> + ...
> + fstp %%st(0);
> +   i.e., num_of_st fldz followed by num_of_st fstp to clear the stack
> +   mark stack slots empty.  */
> +
> +static bool
> +zero_all_st_mm_registers (HARD_REG_SET need_zeroed_hardregs)
> +{
> +  unsigned int num_of_st = 0;
> +  for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
> +if (STACK_REGNO_P (regno)
> +   && TEST_HARD_REG_BIT (need_zeroed_hardregs, regno)
> +   /* When the corresponding mm register also need to be cleared too.  */
> +   && TEST_HARD_REG_BIT (need_zeroed_hardregs,
> + (regno - FIRST_STACK_REG + FIRST_MMX_REG)))
> +  num_of_st++;
>
>
> I don't think the above logic is correct. It should go like this:
>
> - If the function is returning an MMX register,
>
>
> How to check on this? Is the following correct?
>
> If (GET_CODE(crtl->return_rtx) == REG
> && (MMX_REG_P (REGNO (crtl->return_rtx)))

Yes, but please use

if (MMX_REG_P (crtl->return_rtx))

>
>The function is returning an MMX register.
>
>
> then the function
> exits in MMX mode, and MMX registers should be cleared in the same way
> as XMM registers.
>
>
> When clearing XMM registers, we used V4SFmode, what’s the mode we should use 
> to clearing
> mmx registers?

It doesn't matter that much, any 8byte vector mode will do (including
DImode). Let's use V4HImode.

> Otherwise the ABI specifies that the function exits
> in x87 mode and x87 stack should be cleared (but see below).
>
> - There is no direct mapping of stack registers to hard register
> numbers. If a stack register is used, we don't know where in the stack
> the value remains. So, if _any_ stack register is touched, the whole
> stack should be cleared (value, returning in x87 stack register should
> obviously be excluded).
>
>
> Then, how to exclude the x87 stack register that returns the function return 
> value when we need to
> Clear the whole stack?
> I am a little confused here? Could you explain a little more details?

x87 returns in the top (two for complex values) register, so simply
load 7 zeros (and 7 corresponding pops). This will preserve the return
value but clear the whole remaining stack.

> - There is no x87 argument register. 32bit targets use MMX0-3 argument
> registers and return value in the XMM register. Please also note that
> complex values take two stack slots in x87 stack.
>
>
> You mean the complex return value will be returned in two  x87 registers?

Yes, please see ix86_class_max_nregs. Please note that in case of
complex return value, only 6 zeros should be loaded to avoid
clobbering the complex return value.

Uros.

Re: [PATCH V2] aarch64: Add bfloat16 vldN_lane_bf16 + vldNq_lane_bf16 intrisics

2020-10-26 Thread Andrea Corallo via Gcc-patches

Richard Sandiford  writes:

> Andrea Corallo via Gcc-patches  writes:
>> Hi all,
>>
>> Second version of the patch here implementing the bfloat16_t neon
>> related load intrinsics: vld2_lane_bf16, vld2q_lane_bf16,
>> vld3_lane_bf16, vld3q_lane_bf16 vld4_lane_bf16, vld4q_lane_bf16.
>>
>> This better narrows testcases so they do not cause regressions for the
>> arm backend where these intrinsics are not yet present.
>>
>> Please see refer to:
>> ACLE 
>> ISA  
>
> The intrinsics are documented to require +bf16, but it looks like this
> makes the bf16 forms available without that.  (This is enforced indirectly,
> by complaining that the intrinsic wrapper can't be inlined into a caller
> that uses incompatible target flags.)
>
> Perhaps we should keep the existing intrinsics where they are and
> just move the #undefs to the end, similarly to __aarch64_vget_lane_any.
>
> Thanks,
> Richard

Hi Richard,

thanks for reviewing.  I was wondering if wouldn't be better to wrap the
new intrinsic definition into the correct pragma so the macro definition
stays narrowed.  WDYT?

Thanks

  Andrea

Re: [RS6000] Unsupported test options for -m32

2020-10-26 Thread Iain Sandoe via Gcc-patches


David Edelsohn via Gcc-patches  wrote:


FAIL: gcc.target/powerpc/swaps-p8-22.c (test for excess errors)
Excess errors:
cc1: error: '-mcmodel' not supported in this configuration

* gcc.target/powerpc/swaps-p8-22.c: Disable for -m32.

diff --git a/gcc/testsuite/gcc.target/powerpc/swaps-p8-22.c
b/gcc/testsuite/gcc.target/powerpc/swaps-p8-22.c
index 83f6ab3a1c0..bceada41b75 100644
--- a/gcc/testsuite/gcc.target/powerpc/swaps-p8-22.c
+++ b/gcc/testsuite/gcc.target/powerpc/swaps-p8-22.c
@@ -1,5 +1,5 @@
/* { dg-do compile } */
-/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-require-effective-target { lp64 && powerpc_p8vector_ok } } */
/* { dg-options "-O2 -mdejagnu-cpu=power8 -maltivec -mcmodel=large" } */

/* The expansion for vector character multiply introduces a vperm  
operation.



Please don't fix the failure this way.  This is incorrect.  -m32 means
more than Linux.  This reverts my hard work to run more of the powerpc
testsuite on AIX.  AIX also is -m32.


Darwin also is (powerpc-darwin) and has an m32 multilib (powerpc64-darwin)
so not reliable there either.


This probably should be fixed with

{ dg-additional-options "-mcmodel=large" { target { lp64 ||
!powerpc*-*-linux* } } }

or whatever the appropriate incantation to omit only ppc32 linux.  Or maybe

{ dg-do compile { target { lp64 || !powerpc*-*-linux* } } }


mcmodel will also break for powerpc64 and powerpc / m64 Darwin, so if this is
meant to be Linux-specific, that seems to be the thing to mention.

thanks
Iain

Re: [PATCH][middle-end][i386][Version 4] Add -fzero-call-used-regs=[skip|used-gpr-arg|used-arg|all-arg|used-gpr|all-gpr|used|all]

2020-10-26 Thread Qing Zhao via Gcc-patches




> On Oct 26, 2020, at 11:13 AM, Uros Bizjak  wrote:
> 
> On Mon, Oct 26, 2020 at 3:45 PM Qing Zhao  > wrote:
>> 
>> 
>> +/* Generate insns to zero all st/mm registers together.
>> +   Return true when zeroing instructions are generated.
>> +   Assume the number of st registers that are zeroed is num_of_st,
>> +   we will emit the following sequence to zero them together:
>> + fldz; \
>> + fldz; \
>> + ...
>> + fldz; \
>> + fstp %%st(0); \
>> + fstp %%st(0); \
>> + ...
>> + fstp %%st(0);
>> +   i.e., num_of_st fldz followed by num_of_st fstp to clear the stack
>> +   mark stack slots empty.  */
>> +
>> +static bool
>> +zero_all_st_mm_registers (HARD_REG_SET need_zeroed_hardregs)
>> +{
>> +  unsigned int num_of_st = 0;
>> +  for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
>> +if (STACK_REGNO_P (regno)
>> +   && TEST_HARD_REG_BIT (need_zeroed_hardregs, regno)
>> +   /* When the corresponding mm register also need to be cleared too.  
>> */
>> +   && TEST_HARD_REG_BIT (need_zeroed_hardregs,
>> + (regno - FIRST_STACK_REG + FIRST_MMX_REG)))
>> +  num_of_st++;
>> 
>> 
>> I don't think the above logic is correct. It should go like this:
>> 
>> - If the function is returning an MMX register,
>> 
>> 
>> How to check on this? Is the following correct?
>> 
>> If (GET_CODE(crtl->return_rtx) == REG
>>&& (MMX_REG_P (REGNO (crtl->return_rtx)))
> 
> Yes, but please use
> 
> if (MMX_REG_P (crtl->return_rtx))

Okay.
> 
>> 
>>   The function is returning an MMX register.
>> 
>> 
>> then the function
>> exits in MMX mode, and MMX registers should be cleared in the same way
>> as XMM registers.
>> 
>> 
>> When clearing XMM registers, we used V4SFmode, what’s the mode we should use 
>> to clearing
>> mmx registers?
> 
> It doesn't matter that much, any 8byte vector mode will do (including
> DImode). Let's use V4HImode.
Okay.

> 
>> Otherwise the ABI specifies that the function exits
>> in x87 mode and x87 stack should be cleared (but see below).
>> 
>> - There is no direct mapping of stack registers to hard register
>> numbers. If a stack register is used, we don't know where in the stack
>> the value remains. So, if _any_ stack register is touched, the whole
>> stack should be cleared (value, returning in x87 stack register should
>> obviously be excluded).
>> 
>> 
>> Then, how to exclude the x87 stack register that returns the function return 
>> value when we need to
>> Clear the whole stack?
>> I am a little confused here? Could you explain a little more details?
> 
> x87 returns in the top (two for complex values) register, so simply
> load 7 zeros (and 7 corresponding pops). This will preserve the return
> value but clear the whole remaining stack.

I see. 
> 
>> - There is no x87 argument register. 32bit targets use MMX0-3 argument
>> registers and return value in the XMM register. Please also note that
>> complex values take two stack slots in x87 stack.
>> 
>> 
>> You mean the complex return value will be returned in two  x87 registers?
> 
> Yes, please see ix86_class_max_nregs. Please note that in case of
> complex return value, only 6 zeros should be loaded to avoid
> clobbering the complex return value.

Okay, I see. 

thanks.

Qing
> 
> Uros.

Re: [PATCH v2] builtins: rs6000: Add builtins for fegetround, feclearexcept and feraiseexcept [PR94193]

2020-10-26 Thread Segher Boessenkool

On Mon, Oct 26, 2020 at 01:05:00PM -0300, Raoni Fassina Firmino wrote:
> On Mon, Sep 28, 2020 at 11:42:13AM -0500, will schmidt wrote:
> > > +;; FE_INEXACT, FE_DIVBYZERO, FE_UNDERFLOW and FE_OVERFLOW flags.
> > > +;; It doesn't handle values out of range, and always returns 0.
> > > +;; Note that FE_INVALID is unsupported because it maps to more than
> > > +;; one bit on FPSCR register.
> > 
> > Should FE_INVALID have an explicit case statement path to FAIL?
> 
> Because there is only 4 valid flags I am doing the other way around,
> just checking if it is any of the valid flags and FAIL for any other
> value, so there is no need of an explicit FE_INVALID case, but if it is
> better to have one anyway to make the intention clear through code, I
> don't know.

To clear VX ("invalid", bit 34) you need to clear all different VX bits
(one per cause), so 39-44, 53-55.  To *set* it you need to pick which
one you want to set.  Maybe this is all best left the the libc in use,
which should have its own policy for that, it might not be the same on
all libcs.


Segher

Re: Implement three-level optimize_for_size predicates

2020-10-26 Thread Jan Hubicka

> >
> > For example you had patch that limited "rep cmpsb" expansion for
> > -minline-all-stringops.  Now the conditions could be
> > -minline-all-stringops || optimize_insn_for_size () == OPTIMIZE_SIZE_MAX
> > since it is still useful size optimization.
> >
> > I am not sure if you had other changes of this nature? (It is bit hard
> > to grep compiler for things like this and I would like to get these
> > organized to optimize_size levels now).
> 
> Shouldn't it apply to all functions inlined by -minline-all-stringops?

I think we handle the other cases, for code optimized for size we go for
ix86_size_memcpy and ix86_size_memset tables that say inline all with
rep movsb.  We do not inline strlen since the way it is implemented gets
too long (short inline version would be welcome).

I will look through backend, but if you are aware of more checks like
one in ix86_expand_cmpstrn_or_cmpmem which disable size optimization
even at -Os, let me know.  They are not that easy to find...

Honza
> 
> 
> -- 
> H.J.

Re: Implement three-level optimize_for_size predicates

2020-10-26 Thread H.J. Lu via Gcc-patches

On Mon, Oct 26, 2020 at 10:14 AM Jan Hubicka  wrote:
>
> > >
> > > For example you had patch that limited "rep cmpsb" expansion for
> > > -minline-all-stringops.  Now the conditions could be
> > > -minline-all-stringops || optimize_insn_for_size () == OPTIMIZE_SIZE_MAX
> > > since it is still useful size optimization.
> > >
> > > I am not sure if you had other changes of this nature? (It is bit hard
> > > to grep compiler for things like this and I would like to get these
> > > organized to optimize_size levels now).
> >
> > Shouldn't it apply to all functions inlined by -minline-all-stringops?
>
> I think we handle the other cases, for code optimized for size we go for
> ix86_size_memcpy and ix86_size_memset tables that say inline all with
> rep movsb.  We do not inline strlen since the way it is implemented gets
> too long (short inline version would be welcome).
>
> I will look through backend, but if you are aware of more checks like
> one in ix86_expand_cmpstrn_or_cmpmem which disable size optimization
> even at -Os, let me know.  They are not that easy to find...
>

[hjl@gnu-cfl-2 gcc]$ cat /tmp/x.c
int
func (char *d, unsigned int l)
{
  return __builtin_strncmp (d, "foo", l) ? 1 : 2;
}
[hjl@gnu-cfl-2 gcc]$ gcc -c -Os  /tmp/x.c
[hjl@gnu-cfl-2 gcc]$ nm x.o
 T func
 U strncmp
[hjl@gnu-cfl-2 gcc]$ size x.o
   textdata bss dec hex filename
138   0   0 138  8a x.o
[hjl@gnu-cfl-2 gcc]$ gcc -c -O2  /tmp/x.c
[hjl@gnu-cfl-2 gcc]$ size x.o
   textdata bss dec hex filename
146   0   0 146  92 x.o
[hjl@gnu-cfl-2 gcc]$ nm x.o
 T func
[hjl@gnu-cfl-2 gcc]$

-Os shouldn't inline strncmp.

-- 
H.J.

Re: [PATCH][middle-end][i386][Version 4] Add -fzero-call-used-regs=[skip|used-gpr-arg|used-arg|all-arg|used-gpr|all-gpr|used|all]

2020-10-26 Thread Qing Zhao via Gcc-patches



The following is the current change in i386.c, could you check whether the 
logic is good?

thanks.

Qing 

/* Check whether the register REGNO should be zeroed on X86.
   When ALL_SSE_ZEROED is true, all SSE registers have been zeroed
   together, no need to zero it again.
   When EXIT_WITH_MMX_MODE is true, MMX registers should be cleared.  */

static bool
zero_call_used_regno_p (const unsigned int regno,
bool all_sse_zeroed,
bool exit_with_mmx_mode)
{
  return GENERAL_REGNO_P (regno)
 || (!all_sse_zeroed && SSE_REGNO_P (regno))
 || MASK_REGNO_P (regno)
 || exit_with_mmx_mode && MMX_REGNO_P (regno);
}

/* Return the machine_mode that is used to zero register REGNO.  */

static machine_mode
zero_call_used_regno_mode (const unsigned int regno)
{
  /* NB: We only need to zero the lower 32 bits for integer registers
 and the lower 128 bits for vector registers since destination are
 zero-extended to the full register width.  */
  if (GENERAL_REGNO_P (regno))
return SImode;
  else if (SSE_REGNO_P (regno))
return V4SFmode;
  else if (MASK_REGNO_P (regno))
return HImode;
  else if (MMX_REGNO_P (regno))
return V4HImode;
  else
gcc_unreachable ();
}

/* Generate a rtx to zero all vector registers together if possible,
   otherwise, return NULL.  */

static rtx
zero_all_vector_registers (HARD_REG_SET need_zeroed_hardregs)
{
  if (!TARGET_AVX)
return NULL;

  for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
if ((IN_RANGE (regno, FIRST_SSE_REG, LAST_SSE_REG)
 || (TARGET_64BIT
 && (REX_SSE_REGNO_P (regno)
 || (TARGET_AVX512F && EXT_REX_SSE_REGNO_P (regno)
&& !TEST_HARD_REG_BIT (need_zeroed_hardregs, regno))
  return NULL;

  return gen_avx_vzeroall ();
}


/* Generate insns to zero all st registers together.
   Return true when zeroing instructions are generated.
   Assume the number of st registers that are zeroed is num_of_st,
   we will emit the following sequence to zero them together:
  fldz; \
  fldz; \
  ...
  fldz; \
  fstp %%st(0); \
  fstp %%st(0); \
  ...
  fstp %%st(0);
   i.e., num_of_st fldz followed by num_of_st fstp to clear the stack
   mark stack slots empty.

   How to compute the num_of_st?
   There is no direct mapping from stack registers to hard register
   numbers.  If one stack register need to be cleared, we don't know
   where in the stack the value remains.  So, if any stack register
   need to be cleared, the whole stack should be cleared.  However,
   x87 stack registers that hold the return value should be excluded.
   x87 returns in the top (two for complex values) register, so
   num_of_st should be 7/6 when x87 returns, otherwise it will be 8.  */


static bool
zero_all_st_registers (HARD_REG_SET need_zeroed_hardregs)
{
  unsigned int num_of_st = 0;
  for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
if (STACK_REGNO_P (regno)
&& TEST_HARD_REG_BIT (need_zeroed_hardregs, regno))
  {
num_of_st++;
break;
  }

  if (num_of_st == 0)
return false;

  bool return_with_x87 = false;
  return_with_x87 = ((GET_CODE (crtl->return_rtx) == REG)
  && (STACK_REG_P (crtl->return_rtx)));

  bool complex_return = false;
  complex_return = (COMPLEX_MODE_P (GET_MODE (crtl->return_rtx)));

  if (return_with_x87)
if (complex_return)
  num_of_st = 6;
else
  num_of_st = 7;
  else
num_of_st = 8;

  rtx st_reg = gen_rtx_REG (XFmode, FIRST_STACK_REG);

  for (unsigned int i = 0; i < num_of_st; i++)
emit_insn (gen_rtx_SET (st_reg, CONST0_RTX (XFmode)));

  for (unsigned int i = 0; i < num_of_st; i++)
{
  rtx insn;
  insn = emit_insn (gen_rtx_SET (st_reg, st_reg));
  add_reg_note (insn, REG_DEAD, st_reg);
}
  return true;
}

/* TARGET_ZERO_CALL_USED_REGS.  */
/* Generate a sequence of instructions that zero registers specified by
   NEED_ZEROED_HARDREGS.  Return the ZEROED_HARDREGS that are actually
   zeroed.  */
static HARD_REG_SET
ix86_zero_call_used_regs (HARD_REG_SET need_zeroed_hardregs)
{
  HARD_REG_SET zeroed_hardregs;
  bool all_sse_zeroed = false;
  bool st_zeroed = false;

  /* first, let's see whether we can zero all vector registers together.  */
  rtx zero_all_vec_insn = zero_all_vector_registers (need_zeroed_hardregs);
  if (zero_all_vec_insn)
{
  emit_insn (zero_all_vec_insn);
  all_sse_zeroed = true;
}

  /* Then, decide which mode (MMX mode or x87 mode) the function exit with.
 In order to decide whether we need to clear the MMX registers or the
 stack registers.  */
  bool exit_with_mmx_mode = false;

  exit_with_mmx_mode = ((GET_CODE (crtl->return_rtx) == REG)
&& (MMX_REG_P (crtl->return_rtx)));

Re: [PATCH V2] aarch64: Add vcopy(q)__lane(q)_bf16 intrinsics

2020-10-26 Thread Richard Sandiford via Gcc-patches

Andrea Corallo via Gcc-patches  writes:
> diff --git 
> a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcopy_lane_bf16_indices_1.c
>  
> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcopy_lane_bf16_indices_1.c
> new file mode 100644
> index 000..9cbb5ea8110
> --- /dev/null
> +++ 
> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vcopy_lane_bf16_indices_1.c
> @@ -0,0 +1,18 @@
> +#include 
> +
> +/* { dg-do compile { target { aarch64*-*-* } } } */
> +/* { dg-skip-if "" { *-*-* } { "-fno-fat-lto-objects" } } */
> +/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok { target { arm*-*-* 
> } } } */
> +/* { dg-add-options arm_v8_2a_bf16_neon }  */

Realise this probably comes from elsewhere, but why is the
dg-require-effective-target dependent on arm*-*-*?  It and the
dg-add-options should usually be used as a pair, with the same
target guards.

In particular:

proc add_options_for_arm_v8_2a_bf16_neon { flags } {
if { ! [check_effective_target_arm_v8_2a_bf16_neon_ok] } {
return "$flags"
}
global et_arm_v8_2a_bf16_neon_flags
return "$flags $et_arm_v8_2a_bf16_neon_flags"
}

will do nothing when arm_v8_2a_bf16_neon_ok is false, and so in that
case I'd expect the testcase to be compiled without +bf16.  We'd then
get an error about using a bf16 function without the required target
feature.

Given that that hasn't been causing people problems in practice,
I assume most people testing on AArch64 use RUNTESTFLAGS that support
arm_v8_2a_bf16_neon_ok (as hoped).  But in principle it could be
false for AArch64 too.  So I think we should just remove the
“{ target arm*-*-* } ”.

Same for the other tests.

OK with that change if it works (for trunk and for whichever
branches need it).

Thanks,
Richard

Re: [PATCH 2/2] combine: Don't turn (mult (extend x) 2^n) into extract

2020-10-26 Thread Segher Boessenkool

On Mon, Oct 26, 2020 at 01:28:42PM +, Alex Coplan wrote:
> On 26/10/2020 07:12, Segher Boessenkool wrote:
> > On Thu, Oct 15, 2020 at 09:59:24AM +0100, Alex Coplan wrote:
> > Can you instead replace the mult by a shift somewhere earlier in
> > make_extract?  That would make a lot more sense :-)
> 
> I guess we could do this, the only complication being that we can't
> unconditionally rewrite the expression using a shift, since mult is canonical
> inside a mem (which is why we see it in the testcase in the PR).

You can do it just inside the block you are already editing.

> So if we did this, we'd have to remember that we did it earlier on, and 
> rewrite
> it back to a mult accordingly.

Yes, this function has ridiculously complicated cpontrol flow.  So I
cannot trick you into improving it? ;-)

> Would you still like to see a version of the patch that does that, or is this
> version OK: 
> https://gcc.gnu.org/pipermail/gcc-patches/2020-October/557050.html ?

I do not like handling both mult and ashift in one case like this, it
complicates things for no good reason.  Write it as two cases, and it
should be good.


Segher

Re: [PATCH 2/2] combine: Don't turn (mult (extend x) 2^n) into extract

2020-10-26 Thread Segher Boessenkool

On Mon, Oct 26, 2020 at 01:18:54PM +, Alex Coplan wrote:
> -  else if (GET_CODE (inner) == ASHIFT
> +  else if ((GET_CODE (inner) == ASHIFT || GET_CODE (inner) == MULT)

As I wrote in the other mail, write this as two cases.  Write something
in the comment for the mult one that this is for the canonicalisation of
memory addresses (feel free to use swear words).

> +{
> +  const HOST_WIDE_INT ci = INTVAL (XEXP (inner, 1));
> +  const auto code = GET_CODE (inner);
> +  const HOST_WIDE_INT shift_amt = (code == MULT) ? exact_log2 (ci) : ci;
> +
> +  if (shift_amt > 0 && len > (unsigned HOST_WIDE_INT)shift_amt)

Space after cast; better is to not need a cast at all (and you do not
need one, len is unsigned HOST_WIDE_INT already).

Segher

[PATCH] Handle signed 1-bit ranges in irange::invert.

2020-10-26 Thread Aldy Hernandez via Gcc-patches

The problem here is we are trying to add 1 to a -1 in a signed 1-bit
field and coming up with UNDEFINED because of the overflow.

Signed 1-bits are annoying because you can't really add or subtract
one, because the one is unrepresentable.  For invert() we have a
special subtract_one() function that handles 1-bit signed fields.

This patch implements the analogous add_one() function so that invert
works.

Pushed.

gcc/ChangeLog:

PR tree-optimization/97555
* range-op.cc (range_tests): Test 1-bit signed invert.
* value-range.cc (subtract_one): Adjust comment.
(add_one): New.
(irange::invert): Call add_one.

gcc/testsuite/ChangeLog:

* gcc.dg/pr97555.c: New test.
---
 gcc/range-op.cc| 17 +++--
 gcc/testsuite/gcc.dg/pr97555.c | 22 ++
 gcc/value-range.cc | 23 +--
 3 files changed, 54 insertions(+), 8 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr97555.c

diff --git a/gcc/range-op.cc b/gcc/range-op.cc
index ee62f103598..74ab2e57fde 100644
--- a/gcc/range-op.cc
+++ b/gcc/range-op.cc
@@ -3680,15 +3680,28 @@ range_tests ()
   // Test 1-bit signed integer union.
   // [-1,-1] U [0,0] = VARYING.
   tree one_bit_type = build_nonstandard_integer_type (1, 0);
+  tree one_bit_min = vrp_val_min (one_bit_type);
+  tree one_bit_max = vrp_val_max (one_bit_type);
   {
-tree one_bit_min = vrp_val_min (one_bit_type);
-tree one_bit_max = vrp_val_max (one_bit_type);
 int_range<2> min (one_bit_min, one_bit_min);
 int_range<2> max (one_bit_max, one_bit_max);
 max.union_ (min);
 ASSERT_TRUE (max.varying_p ());
   }
 
+  // Test inversion of 1-bit signed integers.
+  {
+int_range<2> min (one_bit_min, one_bit_min);
+int_range<2> max (one_bit_max, one_bit_max);
+int_range<2> t;
+t = min;
+t.invert ();
+ASSERT_TRUE (t == max);
+t = max;
+t.invert ();
+ASSERT_TRUE (t == min);
+  }
+
   // Test that NOT(255) is [0..254] in 8-bit land.
   int_range<1> not_255 (UCHAR (255), UCHAR (255), VR_ANTI_RANGE);
   ASSERT_TRUE (not_255 == int_range<1> (UCHAR (0), UCHAR (254)));
diff --git a/gcc/testsuite/gcc.dg/pr97555.c b/gcc/testsuite/gcc.dg/pr97555.c
new file mode 100644
index 000..625bc6fa14b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr97555.c
@@ -0,0 +1,22 @@
+// { dg-do run }
+// { dg-options "-Os" }
+
+struct {
+  int a:1;
+} b;
+
+int c, d, e, f = 1, g;
+
+int main ()
+{
+  for (; d < 3; d++) {
+char h = 1 % f, i = ~(0 || ~0);
+c = h;
+f = ~b.a;
+~b.a | 1 ^ ~i && g;
+if (~e)
+  i = b.a;
+b.a = i;
+  }
+  return 0;
+}
diff --git a/gcc/value-range.cc b/gcc/value-range.cc
index 7847104050c..0e633c1c673 100644
--- a/gcc/value-range.cc
+++ b/gcc/value-range.cc
@@ -1772,19 +1772,30 @@ irange::irange_intersect (const irange &r)
 verify_range ();
 }
 
+// Signed 1-bits are strange.  You can't subtract 1, because you can't
+// represent the number 1.  This works around that for the invert routine.
+
 static wide_int inline
 subtract_one (const wide_int &x, tree type, wi::overflow_type &overflow)
 {
-  // A signed 1-bit bit-field, has a range of [-1,0] so subtracting +1
-  // overflows, since +1 is unrepresentable.  This is why we have an
-  // addition of -1 here.
   if (TYPE_SIGN (type) == SIGNED)
-return wi::add (x, -1 , SIGNED, &overflow);
+return wi::add (x, -1, SIGNED, &overflow);
   else
 return wi::sub (x, 1, UNSIGNED, &overflow);
 }
 
-/* Return the inverse of a range.  */
+// The analogous function for adding 1.
+
+static wide_int inline
+add_one (const wide_int &x, tree type, wi::overflow_type &overflow)
+{
+  if (TYPE_SIGN (type) == SIGNED)
+return wi::sub (x, -1, SIGNED, &overflow);
+  else
+return wi::add (x, 1, UNSIGNED, &overflow);
+}
+
+// Return the inverse of a range.
 
 void
 irange::invert ()
@@ -1881,7 +1892,7 @@ irange::invert ()
   // set the overflow bit.
   if (type_max != wi::to_wide (orig_range.m_base[i]))
 {
-  tmp = wi::add (wi::to_wide (orig_range.m_base[i]), 1, sign, &ovf);
+  tmp = add_one (wi::to_wide (orig_range.m_base[i]), ttype, ovf);
   m_base[nitems++] = wide_int_to_tree (ttype, tmp);
   m_base[nitems++] = wide_int_to_tree (ttype, type_max);
   if (ovf)
-- 
2.26.2

Re: [RS6000] VSX_MM_SUFFIX

2020-10-26 Thread Segher Boessenkool

On Sun, Oct 25, 2020 at 05:16:10AM -0500, Segher Boessenkool wrote:
> On Sun, Oct 25, 2020 at 11:55:39AM +1030, Alan Modra wrote:
> > > If you use a macro that doesn't exist, the compiler simply does not
> > > build!
> > 
> > My empirical evidence to the contrary says your theoretical arguments
> > are invalid.  :-)
> > 
> > $ gcc/xgcc -Bgcc/ -S 
> > ~/src/gcc/gcc/testsuite/gcc.target/powerpc/vsx_mask-count-runnable.c -O2 
> > -mcpu=power10
> > $ grep VSX_MM vsx_mask-count-runnable.s
> > vcntmb 9,0,1
> > vcntmb 9,0,1
> > vcntmb 9,0,1
> > vcntmb 9,0,1
> 
> Oh, wow.  How unexpected (to me, anyway).  I'll open a PR.

This is PR97583 now.


Segher

Re: Make default duplicate and insert methods of summaries abort; fix fallout

2020-10-26 Thread Martin Liška


On 10/25/20 2:22 PM, Jan Hubicka wrote:

Hi,
the default duplicate and insert methods of sumaries produce empty
summary that is not useful for anything and makes it easy to introduce
bugs.

This patch makes the default hooks to abort and summaries that do not
need dupicaito/insertion disable the corresponding hooks. I also
implemented missing insertion hook for ipa-sra which forced me to move
analysis out of anounmous namespace.

Wi aready have disable_insertion_hook, I also added
disable_duplication_hook.  Martin (Liska), it would be nice to simply
unregiter the hooks instead of having bool controlling htem so we save
some indirect calls.


Hi.

Good idea. I've made some refactoring and the following patch enables and
disables directly the callgraph hooks.

Patch can bootstrap on x86_64-linux-gnu and survives regression tests.

Ready to be installed?
Thanks,
Martin



Bootstrapped/regtested x86_64-linux, plan to commit it tomorrow if there
are no comments.

Honza

2020-10-23  Jan Hubicka  

* cgraph.h (struct cgraph_node): Make ipa_transforms_to_apply vl_ptr.
* ipa-inline-analysis.c (initialize_growth_caches): Disable insertion
and duplication hooks.
* ipa-inline-transform.c (clone_inlined_nodes): Clear
ipa_transforms_to_apply.
(save_inline_function_body): Disable insertion hoook for
ipa_saved_clone_sources.
* ipa-prop.c (ipcp_transformation_initialize): Disable insertion hook.
* ipa-prop.h (ipa_node_params_t): Disable insertion hook.
* ipa-reference.c (propagate): Disable insertion hoook.
* ipa-sra.c (ipa_sra_summarize_function): Move out of anonymous
namespace.
(ipa_sra_function_summaries::insert): New virtual function.
* passes.c (execute_one_pass): Do not add transforms to inline clones.
* symbol-summary.h (function_summary_base): Make insert and duplicate
hooks fail instead of silently producing empty summaries; add way to
disable duplication hooks
(call_summary_base): Likewise.
* tree-nested.c (nested_function_info::get_create): Disable insertion
hooks
(maybe_record_nested_function): Likewise.


diff --git a/gcc/cgraph.h b/gcc/cgraph.h
index 9eb48d5b62f..65e4646efcd 100644
--- a/gcc/cgraph.h
+++ b/gcc/cgraph.h
@@ -1402,7 +1402,7 @@ struct GTY((tag ("SYMTAB_FUNCTION"))) cgraph_node : 
public symtab_node
/* Interprocedural passes scheduled to have their transform functions
   applied next time we execute local pass on them.  We maintain it
   per-function in order to allow IPA passes to introduce new functions.  */
-  vec GTY((skip)) ipa_transforms_to_apply;
+  vec GTY((skip)) ipa_transforms_to_apply;
  
/* For inline clones this points to the function they will be

   inlined into.  */
diff --git a/gcc/ipa-inline-analysis.c b/gcc/ipa-inline-analysis.c
index acbf82e84d9..bd0e322605f 100644
--- a/gcc/ipa-inline-analysis.c
+++ b/gcc/ipa-inline-analysis.c
@@ -127,6 +127,9 @@ initialize_growth_caches ()
  = new fast_call_summary (symtab);
node_context_cache
  = new fast_function_summary (symtab);
+  edge_growth_cache->disable_duplication_hook ();
+  node_context_cache->disable_insertion_hook ();
+  node_context_cache->disable_duplication_hook ();
  }
  
  /* Free growth caches.  */

diff --git a/gcc/ipa-inline-transform.c b/gcc/ipa-inline-transform.c
index 3782cce12e3..279ba2f7cb0 100644
--- a/gcc/ipa-inline-transform.c
+++ b/gcc/ipa-inline-transform.c
@@ -231,6 +231,11 @@ clone_inlined_nodes (struct cgraph_edge *e, bool duplicate,
  e->callee->remove_from_same_comdat_group ();
  
e->callee->inlined_to = inlining_into;

+  if (e->callee->ipa_transforms_to_apply.length ())
+{
+  e->callee->ipa_transforms_to_apply.release ();
+  e->callee->ipa_transforms_to_apply = vNULL;
+}
  
/* Recursively clone all bodies.  */

for (e = e->callee->callees; e; e = next)
@@ -606,7 +611,10 @@ save_inline_function_body (struct cgraph_node *node)
  
tree prev_body_holder = node->decl;

if (!ipa_saved_clone_sources)
-ipa_saved_clone_sources = new function_summary  (symtab);
+{
+  ipa_saved_clone_sources = new function_summary  (symtab);
+  ipa_saved_clone_sources->disable_insertion_hook ();
+}
else
  {
tree *p = ipa_saved_clone_sources->get (node);
diff --git a/gcc/ipa-prop.c b/gcc/ipa-prop.c
index a848f1db95e..6014766b418 100644
--- a/gcc/ipa-prop.c
+++ b/gcc/ipa-prop.c
@@ -4211,7 +4211,10 @@ ipcp_transformation_initialize (void)
if (!ipa_vr_hash_table)
  ipa_vr_hash_table = hash_table::create_ggc (37);
if (ipcp_transformation_sum == NULL)
-ipcp_transformation_sum = ipcp_transformation_t::create_ggc (symtab);
+{
+  ipcp_transformation_sum = ipcp_transformation_t::create_ggc (symtab);
+  ipcp_transformation_sum->disable_insertion_hook ();
+}
  }
  
  /* Release the IPA CP transformation summary.  */

diff --git a/gcc/ipa-p

Re: Make default duplicate and insert methods of summaries abort; fix fallout

2020-10-26 Thread Jan Hubicka

> 
> gcc/ChangeLog:
> 
>   * symbol-summary.h (function_summary_base::unregister_hooks):
>   Call disable_insertion_hook and disable_duplication_hook.
>   (function_summary_base::symtab_insertion): New field.
>   (function_summary_base::symtab_removal): Likewise.
>   (function_summary_base::symtab_duplication): Likewise.
>   Register hooks in function_summary_base and directly register
>   (or unregister) hooks.

OK, thanks a lot!

Honza
> ---
>  gcc/symbol-summary.h | 127 ++-
>  1 file changed, 65 insertions(+), 62 deletions(-)
> 
> diff --git a/gcc/symbol-summary.h b/gcc/symbol-summary.h
> index af5f4e6da62..97106c7c25b 100644
> --- a/gcc/symbol-summary.h
> +++ b/gcc/symbol-summary.h
> @@ -28,12 +28,22 @@ class function_summary_base
>  {
>  public:
>/* Default construction takes SYMTAB as an argument.  */
> -  function_summary_base (symbol_table *symtab CXX_MEM_STAT_INFO):
> -  m_symtab (symtab),
> -  m_insertion_enabled (true),
> -  m_duplication_enabled (true),
> +  function_summary_base (symbol_table *symtab,
> +  cgraph_node_hook symtab_insertion,
> +  cgraph_node_hook symtab_removal,
> +  cgraph_2node_hook symtab_duplication
> +  CXX_MEM_STAT_INFO):
> +  m_symtab (symtab), m_symtab_insertion (symtab_insertion),
> +  m_symtab_removal (symtab_removal),
> +  m_symtab_duplication (symtab_duplication),
> +  m_symtab_insertion_hook (NULL), m_symtab_duplication_hook (NULL),
>m_allocator ("function summary" PASS_MEM_STAT)
> -  {}
> +  {
> +enable_insertion_hook ();
> +m_symtab_removal_hook
> +  = m_symtab->add_cgraph_removal_hook (m_symtab_removal, this);
> +enable_duplication_hook ();
> +  }
>  
>/* Basic implementation of insert operation.  */
>virtual void insert (cgraph_node *, T *)
> @@ -56,25 +66,37 @@ public:
>/* Enable insertion hook invocation.  */
>void enable_insertion_hook ()
>{
> -m_insertion_enabled = true;
> +if (m_symtab_insertion_hook == NULL)
> +  m_symtab_insertion_hook
> + = m_symtab->add_cgraph_insertion_hook (m_symtab_insertion, this);
>}
>  
>/* Enable insertion hook invocation.  */
>void disable_insertion_hook ()
>{
> -m_insertion_enabled = false;
> +if (m_symtab_insertion_hook != NULL)
> +  {
> + m_symtab->remove_cgraph_insertion_hook (m_symtab_insertion_hook);
> + m_symtab_insertion_hook = NULL;
> +  }
>}
>  
>/* Enable duplication hook invocation.  */
>void enable_duplication_hook ()
>{
> -m_duplication_enabled = true;
> +if (m_symtab_duplication_hook == NULL)
> +  m_symtab_duplication_hook
> + = m_symtab->add_cgraph_duplication_hook (m_symtab_duplication, this);
>}
>  
>/* Enable duplication hook invocation.  */
>void disable_duplication_hook ()
>{
> -m_duplication_enabled = false;
> +if (m_symtab_duplication_hook != NULL)
> +  {
> + m_symtab->remove_cgraph_duplication_hook (m_symtab_duplication_hook);
> + m_symtab_duplication_hook = NULL;
> +  }
>}
>  
>  protected:
> @@ -99,19 +121,22 @@ protected:
>/* Unregister all call-graph hooks.  */
>void unregister_hooks ();
>  
> +  /* Symbol table the summary is registered to.  */
> +  symbol_table *m_symtab;
> +
> +  /* Insertion function defined by a summary.  */
> +  cgraph_node_hook m_symtab_insertion;
> +  /* Removal function defined by a summary.  */
> +  cgraph_node_hook m_symtab_removal;
> +  /* Duplication function defined by a summary.  */
> +  cgraph_2node_hook m_symtab_duplication;
> +
>/* Internal summary insertion hook pointer.  */
>cgraph_node_hook_list *m_symtab_insertion_hook;
>/* Internal summary removal hook pointer.  */
>cgraph_node_hook_list *m_symtab_removal_hook;
>/* Internal summary duplication hook pointer.  */
>cgraph_2node_hook_list *m_symtab_duplication_hook;
> -  /* Symbol table the summary is registered to.  */
> -  symbol_table *m_symtab;
> -
> -  /* Indicates if insertion hook is enabled.  */
> -  bool m_insertion_enabled;
> -  /* Indicates if duplication hook is enabled.  */
> -  bool m_duplication_enabled;
>  
>  private:
>/* Return true when the summary uses GGC memory for allocation.  */
> @@ -125,9 +150,9 @@ template 
>  void
>  function_summary_base::unregister_hooks ()
>  {
> -  m_symtab->remove_cgraph_insertion_hook (m_symtab_insertion_hook);
> +  disable_insertion_hook ();
>m_symtab->remove_cgraph_removal_hook (m_symtab_removal_hook);
> -  m_symtab->remove_cgraph_duplication_hook (m_symtab_duplication_hook);
> +  disable_duplication_hook ();
>  }
>  
>  /* We want to pass just pointer types as argument for function_summary
> @@ -242,19 +267,11 @@ private:
>  template 
>  function_summary::function_summary (symbol_table *symtab, bool ggc
>MEM_STAT_DECL):
> -  function_summary_base (symtab PASS_MEM_ST

Re: Implement three-level optimize_for_size predicates

2020-10-26 Thread Jan Hubicka

> On Mon, Oct 26, 2020 at 10:14 AM Jan Hubicka  wrote:
> >
> > > >
> > > > For example you had patch that limited "rep cmpsb" expansion for
> > > > -minline-all-stringops.  Now the conditions could be
> > > > -minline-all-stringops || optimize_insn_for_size () == OPTIMIZE_SIZE_MAX
> > > > since it is still useful size optimization.
> > > >
> > > > I am not sure if you had other changes of this nature? (It is bit hard
> > > > to grep compiler for things like this and I would like to get these
> > > > organized to optimize_size levels now).
> > >
> > > Shouldn't it apply to all functions inlined by -minline-all-stringops?
> >
> > I think we handle the other cases, for code optimized for size we go for
> > ix86_size_memcpy and ix86_size_memset tables that say inline all with
> > rep movsb.  We do not inline strlen since the way it is implemented gets
> > too long (short inline version would be welcome).
> >
> > I will look through backend, but if you are aware of more checks like
> > one in ix86_expand_cmpstrn_or_cmpmem which disable size optimization
> > even at -Os, let me know.  They are not that easy to find...
> >
> 
> [hjl@gnu-cfl-2 gcc]$ cat /tmp/x.c
> int
> func (char *d, unsigned int l)
> {
>   return __builtin_strncmp (d, "foo", l) ? 1 : 2;
> }
> [hjl@gnu-cfl-2 gcc]$ gcc -c -Os  /tmp/x.c
> [hjl@gnu-cfl-2 gcc]$ nm x.o
>  T func
>  U strncmp
> [hjl@gnu-cfl-2 gcc]$ size x.o
>textdata bss dec hex filename
> 138   0   0 138  8a x.o
> [hjl@gnu-cfl-2 gcc]$ gcc -c -O2  /tmp/x.c
> [hjl@gnu-cfl-2 gcc]$ size x.o
>textdata bss dec hex filename
> 146   0   0 146  92 x.o
> [hjl@gnu-cfl-2 gcc]$ nm x.o
>  T func
> [hjl@gnu-cfl-2 gcc]$
> 
> -Os shouldn't inline strncmp.
Interesting, I woul dexpect cmpsb still win.  Well, this makes it
easeir.  Sorry for delaying the patches - somehow I got them connected
with the -Os refactoring.

Honza
> 
> -- 
> H.J.

[PATCH] lto: no sub-make when --jobserver-auth= is missing

2020-10-26 Thread Martin Liška


We newly correctly detect that a job server is not active for
a LTO linking:

lto-wrapper: warning: jobserver is not available: '--jobserver-auth=' is not 
present in 'MAKEFLAGS'

In that situation we should not call make -f abc.mk as it can leed
to N^2 LTRANS units.

Ready for master?
Thanks,
Martin

gcc/ChangeLog:

* lto-wrapper.c (run_gcc): Do not use sub-make when jobserver is
not detected properly.
---
 gcc/lto-wrapper.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/gcc/lto-wrapper.c b/gcc/lto-wrapper.c
index b2af3caa021..fe10f4f4fbb 100644
--- a/gcc/lto-wrapper.c
+++ b/gcc/lto-wrapper.c
@@ -1582,7 +1582,11 @@ run_gcc (unsigned argc, char *argv[])
 {
   const char *jobserver_error = jobserver_active_p ();
   if (jobserver && jobserver_error != NULL)
-   warning (0, jobserver_error);
+   {
+ warning (0, jobserver_error);
+ parallel = 0;
+ jobserver = 0;
+   }
   else if (!jobserver && jobserver_error == NULL)
{
  parallel = 1;
--
2.29.0

Re: [RS6000] Unsupported test options for -m32

2020-10-26 Thread Segher Boessenkool

On Sun, Oct 25, 2020 at 09:51:29PM +1030, Alan Modra wrote:
> FAIL: gcc.target/powerpc/swaps-p8-22.c (test for excess errors)
> Excess errors:
> cc1: error: '-mcmodel' not supported in this configuration

This is because your build is not biarch.  We really should not allow
such configurations, they are extra work for us, for no added value.

> diff --git a/gcc/testsuite/gcc.target/powerpc/swaps-p8-22.c 
> b/gcc/testsuite/gcc.target/powerpc/swaps-p8-22.c
> index 83f6ab3a1c0..bceada41b75 100644
> --- a/gcc/testsuite/gcc.target/powerpc/swaps-p8-22.c
> +++ b/gcc/testsuite/gcc.target/powerpc/swaps-p8-22.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-require-effective-target powerpc_p8vector_ok } */
> +/* { dg-require-effective-target { lp64 && powerpc_p8vector_ok } } */
>  /* { dg-options "-O2 -mdejagnu-cpu=power8 -maltivec -mcmodel=large" } */
>  
>  /* The expansion for vector character multiply introduces a vperm operation.

Add a comment why we need lp64?  (Easiest is if you use two separate
dg-require phrases.)  With that, okay for trunk.  Thanks!


Segher

Re: [RS6000] Remove -mpcrel from tests

2020-10-26 Thread Segher Boessenkool

On Sun, Oct 25, 2020 at 09:52:40PM +1030, Alan Modra wrote:
> When running with -m32
> FAIL: gcc.target/powerpc/pr94740.c (test for excess errors)
> Excess errors:
> cc1: error: '-mpcrel' requires '-mcmodel=medium'
> 
> The others don't run for -m32, but remove the unnecessary -mpcrel
> anyway.
> 
>   * gcc.target/powerpc/localentry-1.c: Remove -mpcrel from options.
>   * gcc.target/powerpc/notoc-direct-1.c: Likewise.
>   * gcc.target/powerpc/pr94740.c: Likewise.

Okay for trunk.  Thanks!


Segher

Re: [PATCH][middle-end][i386][Version 4] Add -fzero-call-used-regs=[skip|used-gpr-arg|used-arg|all-arg|used-gpr|all-gpr|used|all]

2020-10-26 Thread Uros Bizjak via Gcc-patches

On Mon, Oct 26, 2020 at 6:30 PM Qing Zhao  wrote:
>
>
> The following is the current change in i386.c, could you check whether the 
> logic is good?

x87 handling looks good to me.

One remaining question: If the function uses MMX regs (either
internally or as an argument register), but exits in x87 mode, does
your logic clear the x87 stack?

(The ABI in the above case requires EMMS before exit, but the values
from XMM regs still remain as their aliases in x87 stack.)

Uros.

> thanks.
>
> Qing
>
> /* Check whether the register REGNO should be zeroed on X86.
>When ALL_SSE_ZEROED is true, all SSE registers have been zeroed
>together, no need to zero it again.
>When EXIT_WITH_MMX_MODE is true, MMX registers should be cleared.  */
>
> static bool
> zero_call_used_regno_p (const unsigned int regno,
> bool all_sse_zeroed,
> bool exit_with_mmx_mode)
> {
>   return GENERAL_REGNO_P (regno)
>  || (!all_sse_zeroed && SSE_REGNO_P (regno))
>  || MASK_REGNO_P (regno)
>  || exit_with_mmx_mode && MMX_REGNO_P (regno);
> }
>
> /* Return the machine_mode that is used to zero register REGNO.  */
>
> static machine_mode
> zero_call_used_regno_mode (const unsigned int regno)
> {
>   /* NB: We only need to zero the lower 32 bits for integer registers
>  and the lower 128 bits for vector registers since destination are
>  zero-extended to the full register width.  */
>   if (GENERAL_REGNO_P (regno))
> return SImode;
>   else if (SSE_REGNO_P (regno))
> return V4SFmode;
>   else if (MASK_REGNO_P (regno))
> return HImode;
>   else if (MMX_REGNO_P (regno))
> return V4HImode;
>   else
> gcc_unreachable ();
> }
>
> /* Generate a rtx to zero all vector registers together if possible,
>otherwise, return NULL.  */
>
> static rtx
> zero_all_vector_registers (HARD_REG_SET need_zeroed_hardregs)
> {
>   if (!TARGET_AVX)
> return NULL;
>
>   for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
> if ((IN_RANGE (regno, FIRST_SSE_REG, LAST_SSE_REG)
>  || (TARGET_64BIT
>  && (REX_SSE_REGNO_P (regno)
>  || (TARGET_AVX512F && EXT_REX_SSE_REGNO_P (regno)
> && !TEST_HARD_REG_BIT (need_zeroed_hardregs, regno))
>   return NULL;
>
>   return gen_avx_vzeroall ();
> }
>
>
> /* Generate insns to zero all st registers together.
>Return true when zeroing instructions are generated.
>Assume the number of st registers that are zeroed is num_of_st,
>we will emit the following sequence to zero them together:
>   fldz; \
>   fldz; \
>   ...
>   fldz; \
>   fstp %%st(0); \
>   fstp %%st(0); \
>   ...
>   fstp %%st(0);
>i.e., num_of_st fldz followed by num_of_st fstp to clear the stack
>mark stack slots empty.
>
>How to compute the num_of_st?
>There is no direct mapping from stack registers to hard register
>numbers.  If one stack register need to be cleared, we don't know
>where in the stack the value remains.  So, if any stack register
>need to be cleared, the whole stack should be cleared.  However,
>x87 stack registers that hold the return value should be excluded.
>x87 returns in the top (two for complex values) register, so
>num_of_st should be 7/6 when x87 returns, otherwise it will be 8.  */
>
>
> static bool
> zero_all_st_registers (HARD_REG_SET need_zeroed_hardregs)
> {
>   unsigned int num_of_st = 0;
>   for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
> if (STACK_REGNO_P (regno)
> && TEST_HARD_REG_BIT (need_zeroed_hardregs, regno))
>   {
> num_of_st++;
> break;
>   }
>
>   if (num_of_st == 0)
> return false;
>
>   bool return_with_x87 = false;
>   return_with_x87 = ((GET_CODE (crtl->return_rtx) == REG)
>   && (STACK_REG_P (crtl->return_rtx)));
>
>   bool complex_return = false;
>   complex_return = (COMPLEX_MODE_P (GET_MODE (crtl->return_rtx)));
>
>   if (return_with_x87)
> if (complex_return)
>   num_of_st = 6;
> else
>   num_of_st = 7;
>   else
> num_of_st = 8;
>
>   rtx st_reg = gen_rtx_REG (XFmode, FIRST_STACK_REG);
>
>   for (unsigned int i = 0; i < num_of_st; i++)
> emit_insn (gen_rtx_SET (st_reg, CONST0_RTX (XFmode)));
>
>   for (unsigned int i = 0; i < num_of_st; i++)
> {
>   rtx insn;
>   insn = emit_insn (gen_rtx_SET (st_reg, st_reg));
>   add_reg_note (insn, REG_DEAD, st_reg);
> }
>   return true;
> }
>
> /* TARGET_ZERO_CALL_USED_REGS.  */
> /* Generate a sequence of instructions that zero registers specified by
>NEED_ZEROED_HARDREGS.  Return the ZEROED_HARDREGS that are actually
>zeroed.  */
> static HARD_REG_SET
> ix86_zero_call_used_regs (HARD_REG_SET need_zeroed_hardregs)
> {
>   HARD_REG_SET zeroed_hardregs;
>   bool al

Re: [RS6000] biarch test fail

2020-10-26 Thread Segher Boessenkool

On Sun, Oct 25, 2020 at 09:55:32PM +1030, Alan Modra wrote:
> I thought this one was worth at least commenting as to why it fails
> when biarch testing.  OK?
> 
>   * gcc.target/powerpc/bswap64-4.c: Comment.
> 
> diff --git a/gcc/testsuite/gcc.target/powerpc/bswap64-4.c 
> b/gcc/testsuite/gcc.target/powerpc/bswap64-4.c
> index a3c05539652..11787000409 100644
> --- a/gcc/testsuite/gcc.target/powerpc/bswap64-4.c
> +++ b/gcc/testsuite/gcc.target/powerpc/bswap64-4.c
> @@ -7,6 +7,12 @@
>  /* { dg-final { scan-assembler-times "ldbrx" 1 { target has_arch_pwr7 } } } 
> */
>  /* { dg-final { scan-assembler-times "stdbrx" 1 { target has_arch_pwr7 } } } 
> */
>  
> +/* This test will fail when biarch testing with
> +   "RUNTESTFLAGS=--target_board=unix'{-m64,-m32}'" because the -m32 is
> +   added on the command line after the dg-options -mpowerpc64, and

That depends on your dejagnu version.  (That is also why we have
-mdejagnu-cpu=, same reason).

> +   common/config/rs6000/rs6000-common.c:rs6000_handle_option disables
> +   -mpowerpc64 for -m32.  */

Okay for trunk if you some words about dejagnu version.  Thanks!


Segher

Re: [PATCH][middle-end][i386][Version 4] Add -fzero-call-used-regs=[skip|used-gpr-arg|used-arg|all-arg|used-gpr|all-gpr|used|all]

2020-10-26 Thread Richard Sandiford via Gcc-patches

Qing Zhao  writes:
> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> index c9f7299..3a884e1 100644
> --- a/gcc/doc/extend.texi
> +++ b/gcc/doc/extend.texi
> @@ -3992,6 +3992,49 @@ performing a link with relocatable output (i.e.@: 
> @code{ld -r}) on them.
>  A declaration to which @code{weakref} is attached and that is associated
>  with a named @code{target} must be @code{static}.
>  
> +@item zero_call_used_regs ("@var{choice}")
> +@cindex @code{zero_call_used_regs} function attribute
> +
> +The @code{zero_call_used_regs} attribute causes the compiler to zero
> +a subset of all call-used registers at function return according to
> +@var{choice}.
> +This is used to increase the program security by either mitigating
> +Return-Oriented Programming (ROP) or preventing information leak
> +through registers.
> +
> +A "call-used" register is a register that is clobbered by function calls,
> +as a result, the caller has to save and restore it before or after a
> +function call.  It is also called as "call-clobbered", "caller-saved", or
> +"volatile".

texinfo quoting is to use ``…'' rather than "…".  So maybe:

---
A ``call-used'' register is a register whose contents can be changed by
a function call; therefore, a caller cannot assume that the register has
the same contents on return from the function as it had before calling
the function.  Such registers are also called ``call-clobbered'',
``caller-saved'', or ``volatile''.
---

> +In order to satisfy users with different security needs and control the
> +run-time overhead at the same time,  GCC provides a flexible way to choose

nit: should only be one space after the comma

> +the subset of the call-used registers to be zeroed.

Maybe add “The three basic values of @var{choice} are:”

> +
> +@samp{skip} doesn't zero any call-used registers.
> +@samp{used} zeros call-used registers which are used in the function.  A 
> "used"

Maybe s/zeros/only zeros/?

s/which/that/

> +register is one whose content has been set or referenced in the function.
> +@samp{all} zeros all call-used registers.

I think this would be better formatted using a @table.

> +In addition to the above three basic choices, the register set can be further
> +limited by adding "-gpr" (i.e., general purpose register), "-arg" (i.e.,
> +argument register), or both as following:

How about:

---
In addition to these three basic choices, it is possible to modify
@samp{used} or @samp{all} as follows:

@itemize @bullet
@item
Adding @samp{-gpr} restricts the zeroing to general-purpose registers.

@item
Adding @samp{-arg} restricts the zeroing to registers that are used
to pass parameters.  When applied to @samp{all}, this includes all
parameter registers defined by the platform's calling convention,
regardless of whether the function uses those parameter registers.
@end @itemize

The modifiers can be used individually or together.  If they are used
together, they must appear in the order above.

The full list of @var{choice}s is therefore:
---

with the list repeating @var{skip}, @var{used} and @var{all}.

(untested)

> +@samp{used-gpr-arg} zeros used call-used general purpose registers that
> +pass parameters.
> +@samp{used-arg} zeros used call-used registers that pass parameters.
> +@samp{all-gpr-arg} zeros all call-used general purpose registers that pass
> +parameters.
> +@samp{all-arg} zeros all call-used registers that pass parameters.
> +@samp{used-gpr} zeros call-used general purpose registers which are used in 
> the
> +function.
> +@samp{all-gpr} zeros all call-used general purpose registers.

I think this too should be a @table.

> +
> +Among this list, "used-gpr-arg", "used-arg", "all-gpr-arg", and "all-arg" are
> +mainly used for ROP mitigation.

Should be quoted using @samp rather than ".

> +@item -fzero-call-used-regs=@var{choice}
> +@opindex fzero-call-used-regs
> +Zero call-used registers at function return to increase the program
> +security by either mitigating Return-Oriented Programming (ROP) or
> +preventing information leak through registers.

After this, we should probably say something like:

---
The possible values of @var{choice} are the same as for the
@samp{zero_call_used_regs} attribute (@pxref{…}).  The default
is @samp{skip}.
---

(with the xref filled in)

> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 97437e8..3b75c46 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -12053,6 +12053,18 @@ argument list due to stack realignment.  Return 
> @code{NULL} if no DRAP
>  is needed.
>  @end deftypefn
>  
> +@deftypefn {Target Hook} HARD_REG_SET TARGET_ZERO_CALL_USED_R

Re: [RFC] Add support for the "retain" attribute utilizing SHF_GNU_RETAIN

2020-10-26 Thread Pedro Alves via Gcc-patches

On 10/6/20 12:10 PM, Jozef Lawrynowicz wrote:

> Should "used" apply SHF_GNU_RETAIN?
> ===
> Another talking point is whether the existing "used" attribute should
> apply the SHF_GNU_RETAIN flag to the containing section.
> 
> It seems unlikely that a user applies the "used" attribute to a
> declaration, and means for it to be saved from only compiler
> optimization, but *not* linker optimization. So perhaps it would be
> beneficial for "used" to apply SHF_GNU_RETAIN in some way.
> 
> If "used" did apply SHF_GNU_RETAIN, we would also have to
> consider the above options for how to apply SHF_GNU_RETAIN to the
> section. Since the "used" attribute has been around for a while 
> it might not be appropriate for its behavior to be changed to place the
> associated declaration in its own, unique section, as in option (2).
> 

To me, if I use attribute((used)), and the linker still garbage
collects the symbol, then the toolchain has a bug.  Is there any
use case that would suggest otherwise?

Thanks,
Pedro Alves

Re: [PATCH][middle-end][i386][Version 4] Add -fzero-call-used-regs=[skip|used-gpr-arg|used-arg|all-arg|used-gpr|all-gpr|used|all]

2020-10-26 Thread Qing Zhao via Gcc-patches




> On Oct 26, 2020, at 1:42 PM, Uros Bizjak  wrote:
> 
> On Mon, Oct 26, 2020 at 6:30 PM Qing Zhao  wrote:
>> 
>> 
>> The following is the current change in i386.c, could you check whether the 
>> logic is good?
> 
> x87 handling looks good to me.
> 
> One remaining question: If the function uses MMX regs (either
> internally or as an argument register), but exits in x87 mode, does
> your logic clear the x87 stack?

Yes but not completely yes. 

FIRST, As following:

  /* Then, decide which mode (MMX mode or x87 mode) the function exit with.
 In order to decide whether we need to clear the MMX registers or the
 stack registers.  */
  bool exit_with_mmx_mode = false;

  exit_with_mmx_mode = ((GET_CODE (crtl->return_rtx) == REG) 
&& (MMX_REG_P (crtl->return_rtx)));

  /* then, let's see whether we can zero all st registers togeter.  */
  if (!exit_with_mmx_mode)
st_zeroed = zero_all_st_registers (need_zeroed_hardregs);


We first check whether this routine exit with mmx mode, if Not then it’s X87 
mode 
(at exit, “EMMS” should already been called per ABI), then 
The st/mm registers will be cleared as x87 stack registers. 

However, within the routine “zero_all_st_registers”:

static bool
zero_all_st_registers (HARD_REG_SET need_zeroed_hardregs)
{
  unsigned int num_of_st = 0;
  for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
if (STACK_REGNO_P (regno)
&& TEST_HARD_REG_BIT (need_zeroed_hardregs, regno))
  {
num_of_st++;
break;
  }

  if (num_of_st == 0)
return false;


In the above, I currently only check whether any “Stack” registers need to be 
zeroed or not.
But looks like we should also check any “MMX” register need to be zeroed or not 
too. If there is any
“MMX” register need to be zeroed, we still need to clear the whole X87 stack? 


BTW, is it convenient for you to provide me 3 small testing cases for the 
following situation: 

1. Return with MMX register;
2. Return with  x87 stack register;
3. Return with 2 x87 stack register (i.e the complex value).

Then it will be much easy for me to verify my implementation is good or not at 
my side.

Thanks a lot for your help.

Qing

> 
> (The ABI in the above case requires EMMS before exit, but the values
> from XMM regs still remain as their aliases in x87 stack.)
> 
> Uros.
> 
>> thanks.
>> 
>> Qing
>> 
>> /* Check whether the register REGNO should be zeroed on X86.
>>   When ALL_SSE_ZEROED is true, all SSE registers have been zeroed
>>   together, no need to zero it again.
>>   When EXIT_WITH_MMX_MODE is true, MMX registers should be cleared.  */
>> 
>> static bool
>> zero_call_used_regno_p (const unsigned int regno,
>>bool all_sse_zeroed,
>>bool exit_with_mmx_mode)
>> {
>>  return GENERAL_REGNO_P (regno)
>> || (!all_sse_zeroed && SSE_REGNO_P (regno))
>> || MASK_REGNO_P (regno)
>> || exit_with_mmx_mode && MMX_REGNO_P (regno);
>> }
>> 
>> /* Return the machine_mode that is used to zero register REGNO.  */
>> 
>> static machine_mode
>> zero_call_used_regno_mode (const unsigned int regno)
>> {
>>  /* NB: We only need to zero the lower 32 bits for integer registers
>> and the lower 128 bits for vector registers since destination are
>> zero-extended to the full register width.  */
>>  if (GENERAL_REGNO_P (regno))
>>return SImode;
>>  else if (SSE_REGNO_P (regno))
>>return V4SFmode;
>>  else if (MASK_REGNO_P (regno))
>>return HImode;
>>  else if (MMX_REGNO_P (regno))
>>return V4HImode;
>>  else
>>gcc_unreachable ();
>> }
>> 
>> /* Generate a rtx to zero all vector registers together if possible,
>>   otherwise, return NULL.  */
>> 
>> static rtx
>> zero_all_vector_registers (HARD_REG_SET need_zeroed_hardregs)
>> {
>>  if (!TARGET_AVX)
>>return NULL;
>> 
>>  for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
>>if ((IN_RANGE (regno, FIRST_SSE_REG, LAST_SSE_REG)
>> || (TARGET_64BIT
>> && (REX_SSE_REGNO_P (regno)
>> || (TARGET_AVX512F && EXT_REX_SSE_REGNO_P (regno)
>>&& !TEST_HARD_REG_BIT (need_zeroed_hardregs, regno))
>>  return NULL;
>> 
>>  return gen_avx_vzeroall ();
>> }
>> 
>> 
>> /* Generate insns to zero all st registers together.
>>   Return true when zeroing instructions are generated.
>>   Assume the number of st registers that are zeroed is num_of_st,
>>   we will emit the following sequence to zero them together:
>>  fldz; \
>>  fldz; \
>>  ...
>>  fldz; \
>>  fstp %%st(0); \
>>  fstp %%st(0); \
>>  ...
>>  fstp %%st(0);
>>   i.e., num_of_st fldz followed by num_of_st fstp to clear the stack
>>   mark stack slots empty.
>> 
>>   How to compute the num_of_st?
>>   There is no direct mapping from stack registers to hard register
>>

Re: [RFC] Add support for the "retain" attribute utilizing SHF_GNU_RETAIN

2020-10-26 Thread Pedro Alves via Gcc-patches

On 10/6/20 12:10 PM, Jozef Lawrynowicz wrote:
> The changes would also only affect targets
> that support the GNU ELF OSABI, which would lead to inconsistent
> behavior between non-GNU OS's.

Well, a separate __attribute__((retain)) will necessarily only work
on GNU ELF targets, so that just shifts the "inconsistent" behavior
elsewhere.

Pedro Alves

Re: [PATCH] rs6000, Power 10 testsuite fixes

2020-10-26 Thread Segher Boessenkool

Hi!

On Fri, Oct 23, 2020 at 02:43:40PM -0700, Carl Love wrote:
> The following patch fixes a few issues with the tests.  The DEBUG is
> defined in each of the files thus the #ifdef DEBUG should just be #if
> DEBUG.  The other issue is a some of the line lengths for the error
> prints exceed 80 characters.  The patch fixes the prints.

Testcases can use whatever formatting they want (but readability is good
of course).

> --- a/gcc/testsuite/gcc.target/powerpc/vec-replace-word-runnable.c
> +++ b/gcc/testsuite/gcc.target/powerpc/vec-replace-word-runnable.c

> -printf("ERROR, vec_replace_unaligned (src_vb_double, src_va_double, 
> index)\
> +printf("ERROR, vec_replace_unaligned (src_vb_double, src_va_double, "
> +"index)  \
>  n");

This is wrong (was wrong already :-) -- it should be  "index)\n");

Okay for trunk with that fixed as well.  Thanks!


Segher

Re: [PATCH] Re: error: ‘EVRP_MODE_DEBUG’ was not declared – was: [PUSHED] Ranger classes.

2020-10-26 Thread Maciej W. Rozycki

On Mon, 26 Oct 2020, Andrew MacLeod wrote:

> >   It is still broken at `-O0', does not build with `--enable-werror-always'
> > (which IMO should be on by default except for releases, just as we do with
> > binutils AFAIK, so as to make sure people do not introduce build problems
> > too easily):
> >
> > .../gcc/gimple-range.cc: In function 'bool
> > range_of_builtin_call(range_query&, irange&, gcall*)':
> > .../gcc/gimple-range.cc:677:15: error: 'zerov' may be used uninitialized
> > [-Werror=maybe-uninitialized]
> >677 |   if (zerov == prec)
> >|   ^~
> > cc1plus: all warnings being treated as errors
> > make[2]: *** [Makefile:1122: gimple-range.o] Error 1
> >
> >Maciej
> >
> I can't reproduce it on x86_64-pc-linux-gnu , i presume this is some other
> target.
> 
> Eyeballing it, it seems that there was a missed initialization  when the
> builtin code was ported that might show up on a target that defines
> CLZ_DEFINED_VALUE_AT_ZERO  to be non-zero but doesnt always set the zerov
> parameter...  Or maybe it some optimization ordering thing.

 Let me see...

$ g++ [...] .../gcc/gimple-range.cc -E -dD | grep CLZ_DEFINED_VALUE_AT_ZERO
#define CLZ_DEFINED_VALUE_AT_ZERO(MODE,VALUE) 0
$

I'm fairly sure this is what the default is.

> Anyway, the following patch has been pushed as an obvious fix to make the code
> match whats in vr-values.

 Thank you!

  Maciej

Re: [PATCH][middle-end][i386][Version 4] Add -fzero-call-used-regs=[skip|used-gpr-arg|used-arg|all-arg|used-gpr|all-gpr|used|all]

2020-10-26 Thread Uros Bizjak via Gcc-patches

On Mon, Oct 26, 2020 at 8:10 PM Qing Zhao  wrote:
>
>
>
> > On Oct 26, 2020, at 1:42 PM, Uros Bizjak  wrote:
> >
> > On Mon, Oct 26, 2020 at 6:30 PM Qing Zhao  wrote:
> >>
> >>
> >> The following is the current change in i386.c, could you check whether the 
> >> logic is good?
> >
> > x87 handling looks good to me.
> >
> > One remaining question: If the function uses MMX regs (either
> > internally or as an argument register), but exits in x87 mode, does
> > your logic clear the x87 stack?
>
> Yes but not completely yes.
>
> FIRST, As following:
>
>   /* Then, decide which mode (MMX mode or x87 mode) the function exit with.
>  In order to decide whether we need to clear the MMX registers or the
>  stack registers.  */
>   bool exit_with_mmx_mode = false;
>
>   exit_with_mmx_mode = ((GET_CODE (crtl->return_rtx) == REG)
> && (MMX_REG_P (crtl->return_rtx)));
>
>   /* then, let's see whether we can zero all st registers togeter.  */
>   if (!exit_with_mmx_mode)
> st_zeroed = zero_all_st_registers (need_zeroed_hardregs);
>
>
> We first check whether this routine exit with mmx mode, if Not then it’s X87 
> mode
> (at exit, “EMMS” should already been called per ABI), then
> The st/mm registers will be cleared as x87 stack registers.
>
> However, within the routine “zero_all_st_registers”:
>
> static bool
> zero_all_st_registers (HARD_REG_SET need_zeroed_hardregs)
> {
>   unsigned int num_of_st = 0;
>   for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
> if (STACK_REGNO_P (regno)
> && TEST_HARD_REG_BIT (need_zeroed_hardregs, regno))
>   {
> num_of_st++;
> break;
>   }
>
>   if (num_of_st == 0)
> return false;
>
>
> In the above, I currently only check whether any “Stack” registers need to be 
> zeroed or not.
> But looks like we should also check any “MMX” register need to be zeroed or 
> not too. If there is any
> “MMX” register need to be zeroed, we still need to clear the whole X87 stack?

I think so, but I have to check the details

> BTW, is it convenient for you to provide me 3 small testing cases for the 
> following situation:
>
> 1. Return with MMX register;
> 2. Return with  x87 stack register;
> 3. Return with 2 x87 stack register (i.e the complex value).
>
> Then it will be much easy for me to verify my implementation is good or not 
> at my side.

--cut here--
typedef int __v2si __attribute__ ((vector_size (8)));

__v2si ret_mmx (void)
{
  return (__v2si) { 123, 345 };
}

long double ret_x87 (void)
{
  return 1.1L;
}

_Complex long double ret_x87_cplx (void)
{
  return 1.1L + 1.2iL;
}
--cut here--

Please compile this with "-m32 -mmmx".

ret_mmx returns value in MMX register.
ret_x87 returns value in x87 register.
ret_x87_cplx returns value in memory.

"-m64"

ret_mmx returns value in XMM register.
ret_x87 returns value in x87 register.
ret_x87_cplx returns value in two x87 registers.

Uros.

Re: [RS6000] Non-pcrel tests when power10

2020-10-26 Thread Segher Boessenkool

Hi!

On Thu, Oct 22, 2020 at 05:28:17PM +1030, Alan Modra wrote:
> These tests require -mno-pcrel because they are testing features
> of the non-pcrel ABI.

> --- a/gcc/testsuite/gcc.target/powerpc/cprophard.c
> +++ b/gcc/testsuite/gcc.target/powerpc/cprophard.c
> @@ -1,6 +1,6 @@
>  /* { dg-do compile { target { powerpc*-*-* && lp64 } } } */

Make this { target lp64 } if you want?

> --- a/gcc/testsuite/gcc.target/powerpc/pr79439-1.c
> +++ b/gcc/testsuite/gcc.target/powerpc/pr79439-1.c
> @@ -1,5 +1,5 @@

(another)

> --- a/gcc/testsuite/gcc.target/powerpc/pr79439-2.c
> +++ b/gcc/testsuite/gcc.target/powerpc/pr79439-2.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile { target { powerpc*-*-linux* && lp64 } } } */

(wow there are many)


Okay for trunk (with or without that extra cleanup).  Thanks!


Segher

Re: [RS6000] dimode_off.c test

2020-10-26 Thread Segher Boessenkool

On Thu, Oct 22, 2020 at 05:29:49PM +1030, Alan Modra wrote:
> This tests behaviour near the limit of 16-bit signed offsets.  If
> power10 prefix instructions are enabled, no such testing occurs.
> 
>   * gcc.target/powerpc/dimode_off.c: Add -mno-prefixed to options.
> 
> Regstrapped powerpc64le-linux power10 and power8.  OK?

Yes please.  Thanks!


Segher

Re: [PATCH][middle-end][i386][Version 4] Add -fzero-call-used-regs=[skip|used-gpr-arg|used-arg|all-arg|used-gpr|all-gpr|used|all]

2020-10-26 Thread Uros Bizjak via Gcc-patches

On Mon, Oct 26, 2020 at 9:05 PM Uros Bizjak  wrote:
>
> On Mon, Oct 26, 2020 at 8:10 PM Qing Zhao  wrote:
> >
> >
> >
> > > On Oct 26, 2020, at 1:42 PM, Uros Bizjak  wrote:
> > >
> > > On Mon, Oct 26, 2020 at 6:30 PM Qing Zhao  wrote:
> > >>
> > >>
> > >> The following is the current change in i386.c, could you check whether 
> > >> the logic is good?
> > >
> > > x87 handling looks good to me.
> > >
> > > One remaining question: If the function uses MMX regs (either
> > > internally or as an argument register), but exits in x87 mode, does
> > > your logic clear the x87 stack?
> >
> > Yes but not completely yes.
> >
> > FIRST, As following:
> >
> >   /* Then, decide which mode (MMX mode or x87 mode) the function exit with.
> >  In order to decide whether we need to clear the MMX registers or the
> >  stack registers.  */
> >   bool exit_with_mmx_mode = false;
> >
> >   exit_with_mmx_mode = ((GET_CODE (crtl->return_rtx) == REG)
> > && (MMX_REG_P (crtl->return_rtx)));
> >
> >   /* then, let's see whether we can zero all st registers togeter.  */
> >   if (!exit_with_mmx_mode)
> > st_zeroed = zero_all_st_registers (need_zeroed_hardregs);
> >
> >
> > We first check whether this routine exit with mmx mode, if Not then it’s 
> > X87 mode
> > (at exit, “EMMS” should already been called per ABI), then
> > The st/mm registers will be cleared as x87 stack registers.
> >
> > However, within the routine “zero_all_st_registers”:
> >
> > static bool
> > zero_all_st_registers (HARD_REG_SET need_zeroed_hardregs)
> > {
> >   unsigned int num_of_st = 0;
> >   for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
> > if (STACK_REGNO_P (regno)
> > && TEST_HARD_REG_BIT (need_zeroed_hardregs, regno))
> >   {
> > num_of_st++;
> > break;
> >   }
> >
> >   if (num_of_st == 0)
> > return false;
> >
> >
> > In the above, I currently only check whether any “Stack” registers need to 
> > be zeroed or not.
> > But looks like we should also check any “MMX” register need to be zeroed or 
> > not too. If there is any
> > “MMX” register need to be zeroed, we still need to clear the whole X87 
> > stack?
>
> I think so, but I have to check the details.

Please compile the following testcase with "-m32 -mmmx":

--cut here--
#include 

typedef int __v2si __attribute__ ((vector_size (8)));

__v2si zzz;

void
__attribute__ ((noinline))
mmx (__v2si a, __v2si b, __v2si c)
{
  __v2si res;

  res = __builtin_ia32_paddd (a, b);
  zzz = __builtin_ia32_paddd (res, c);

  __builtin_ia32_emms ();
}


int main ()
{
  __v2si a = { 123, 345 };
  __v2si b = { 234, 456 };
  __v2si c = { 345, 567 };

  mmx (a, b, c);

  printf ("%i, %i\n", zzz[0], zzz[1]);

  return 0;
}
--cut here--

at the end of mmx() function:

0x080491ed in mmx ()
(gdb) disass
Dump of assembler code for function mmx:
  0x080491e0 <+0>: paddd  %mm1,%mm0
  0x080491e3 <+3>: paddd  %mm2,%mm0
  0x080491e6 <+6>: movq   %mm0,0x804c020
=> 0x080491ed <+13>:emms
  0x080491ef <+15>:ret
End of assembler dump.
(gdb) i r flo
st0 (raw 0x055802be)
st1 (raw 0x01c800ea)
st2 (raw 0x02370159)
st30   (raw 0x)
st40   (raw 0x)
st50   (raw 0x)
st60   (raw 0x)
st70   (raw 0x)
fctrl  0x37f   895
fstat  0x0 0
ftag   0x556a  21866
fiseg  0x0 0
fioff  0x0 0
foseg  0x0 0
fooff  0x0 0
fop0x0 0

There are still values in the MMX registers. However, we are in x87
mode, so the whole stack has to be cleared.

Now, what to do if the function uses x87 registers and exits in MMX
mode? I guess we have to clear all MMX registers (modulo return value
reg).

Uros.

Re: [RS6000] Link power10 testcases

2020-10-26 Thread Segher Boessenkool

On Thu, Oct 22, 2020 at 05:31:15PM +1030, Alan Modra wrote:
> Running the assembler and linker catches more errors.
> 
>   * gcc.target/powerpc/cfuged-1.c,
>   * gcc.target/powerpc/cntlzdm-1.c,

There should be no star on the second and next line of one entry.

Okay for trunk.  Thanks!


Segher

Re: [RFC] Add support for the "retain" attribute utilizing SHF_GNU_RETAIN

2020-10-26 Thread Jozef Lawrynowicz

On Mon, Oct 26, 2020 at 07:08:06PM +, Pedro Alves via Gcc-patches wrote:
> On 10/6/20 12:10 PM, Jozef Lawrynowicz wrote:
> 
> > Should "used" apply SHF_GNU_RETAIN?
> > ===
> > Another talking point is whether the existing "used" attribute should
> > apply the SHF_GNU_RETAIN flag to the containing section.
> > 
> > It seems unlikely that a user applies the "used" attribute to a
> > declaration, and means for it to be saved from only compiler
> > optimization, but *not* linker optimization. So perhaps it would be
> > beneficial for "used" to apply SHF_GNU_RETAIN in some way.
> > 
> > If "used" did apply SHF_GNU_RETAIN, we would also have to
> > consider the above options for how to apply SHF_GNU_RETAIN to the
> > section. Since the "used" attribute has been around for a while 
> > it might not be appropriate for its behavior to be changed to place the
> > associated declaration in its own, unique section, as in option (2).
> > 
> 
> To me, if I use attribute((used)), and the linker still garbage
> collects the symbol, then the toolchain has a bug.  Is there any
> use case that would suggest otherwise?
> 
> Thanks,
> Pedro Alves
> 

I agree that "used" should imply SHF_GNU_RETAIN on whatever section
contains the declaration that the attribute is applied to. However, I
think that apart from the section flag being applied to the section, the
behaviour of "used" shouldn't be modified i.e. the declaration shouldn't
be put in a unique section.

I originally justified the addition of a "retain" attribute, alongside
"used" implying SHF_GNU_RETAIN, as indicating that the declaration
should be placed in it's own section. In hindsight, this is unnecessary;
if the user wants to increase the granularity of the portions of their
program being retained, they should build with
-f{function,data}-sections, or manually put the declaration in its own
section with the "section" attribute.

So we could shelve the "retain" attribute, and just modify the "used"
attribute to imply SHF_GNU_RETAIN. If we get consensus on that, I could
go ahead an implement it, but I never got any specific feedback on the
GCC behavior from anyone apart from you. I don't know whether to
interpret that lack of feedback, whilst the other aspects of the
implementation were commented on, as support for the "retain" attribute.

(I appreciate you giving that feedback in the Binutils discussions, and
should have engaged in those discussions more at the time. There was
just a lot of opinions flying about on many aspects of it, which is
attention for this proposal I now miss...)

Since I'm not proposing to modify the behavior of "used" apart from
applying SHF_GNU_RETAIN to its section, I'm hoping the GCC side of
things won't be too controversial.

However, the assembler will have to support mis-matched section
declarations, i.e.:
  .section .text,"ax",%progbits
  ...
  .section .text,"axR",%progbits
  ...

The Binutils patch that supported this would create two separate .text
sections in the assembled object file, one with SHF_GNU_RETAIN and one
without.
Perhaps they should be merged into a single .text section, with
SHF_GNU_RETAIN applied to that merged section, so as to truly not
interfere with "used" attribute behavior.

There was an opinion that allowing these separate .section directives
with the same name but different flags was undesirable.

Personally, I don't see it as a problem, this exception is beneficial
and makes sense, if the assembler merges the sections it is as if they
all had the flag applied anyway.

On Mon, Oct 26, 2020 at 07:12:45PM +, Pedro Alves via Gcc-patches wrote:
> On 10/6/20 12:10 PM, Jozef Lawrynowicz wrote:
> > The changes would also only affect targets
> > that support the GNU ELF OSABI, which would lead to inconsistent
> > behavior between non-GNU OS's.
> 
> Well, a separate __attribute__((retain)) will necessarily only work
> on GNU ELF targets, so that just shifts the "inconsistent" behavior
> elsewhere.

True, a note in the documentation would cover this. For example:
  "As a GNU ELF extension, the used attribute will also prevent the
  linker from garbage collecting the section containing the symbol"

Thanks,
Jozef

[PATCH] libstdc++: Implement C++20 features for

2020-10-26 Thread Thomas Rodgers

From: Thomas Rodgers 

New ctors and ::view() accessor for -
  * basic_stingbuf
  * basic_istringstream
  * basic_ostringstream
  * basic_stringstreamm

New ::get_allocator() accessor for basic_stringbuf.

libstdc++-v3/ChangeLog:
* acinclude.m4 (glibcxx_SUBDIRS): Add src/c++20.
* config/abi/pre/gnu.ver: Update GLIBCXX_3.4.29 for the addition of -
basic_stringbuf::basic_stringbuf(allocator const&),
basic_stringbuf::basic_stringbuf(openmode, allocator const&),
basic_stringbuf::basic_stringbuf(basic_string&&, openmode),
basic_stringbuf::basic_stringbuf(basic_stringbuf&&, allocator const&),
basic_stringbuf::get_allocator(),
basic_stringbuf::view(),
basic_istringstream::basic_istringstream(basic_string&&, openmode),
basic_istringstream::basic_istringstream(openmode, allocator const&),
basic_istringstream::view(),
basic_ostringstream::basic_ostringstream(basic_string&&, openmode),
basic_ostringstream::basic_ostringstream(openmode, allocator const&),
basic_ostringstream::view(),
basic_stringstream::basic_stringstream(basic_string&&, openmode),
basic_stringstream::basic_stringstream(openmode, allocator const&),
basic_stringstream::view().
* configure: Regenerate.
* include/std/sstream:
(basic_stringbuf::basic_stringbuf(allocator const&)): New constructor.
(basic_stringbuf::basic_stringbuf(openmode, allocator const&)): 
Likewise.
(basic_stringbuf::basic_stringbuf(basic_string&&, openmode)): Likewise.
(basic_stringbuf::basic_stringbuf(basic_stringbuf&&, allocator 
const&)): Likewise.
(basic_stringbuf::get_allocator()): New method.
(basic_stringbuf::view()): Likewise.
(basic_istringstream::basic_istringstream(basic_string&&, openmode)):
New constructor.
(basic_istringstream::basic_istringstream(openmode, allocator const&)):
Likewise
(basic_istringstream::view()): New method.
(basic_ostringstream::basic_ostringstream(basic_string&&, openmode)):
New constructor.
(basic_ostringstream::basic_ostringstream(openmode, allocator const&)):
Likewise
(basic_ostringstream::view()): New method.
(basic_stringstream::basic_stringstream(basic_string&&, openmode)):
New constructor.
(basic_stringstream::basic_stringstream(openmode, allocator const&)):
Likewise
(basic_stringstream::view()): New method.
* src/Makefile.in: Add c++20 directory.
* src/Makefile.am: Regenerate.
* src/c++20/Makefile.am: Add makefile for new sub-directory.
* src/c++20/Makefile.in: Generate.
* src/c++20/sstream-inst.cc: New file defining explicit
instantiations for basic_stringbuf, basic_istringstream,
basic_ostringstream, and basic_stringstream member functions
added in C++20.
* testsuite/27_io/basic_stringbuf/cons/char/2.cc: New test.
* testsuite/27_io/basic_stringbuf/cons/wchar_t/2.cc: Likewise.
* testsuite/27_io/basic_stringbuf/view/char/2.cc: Likewise.
* testsuite/27_io/basic_stringbuf/view/wchar_t/2.cc: Likewise.
* testsuite/27_io/basic_istringstream/cons/char/2.cc: Likewise.
* testsuite/27_io/basic_istringstream/cons/wchar_t/2.cc: Likewise.
* testsuite/27_io/basic_istringstream/view/char/2.cc: Likewise.
* testsuite/27_io/basic_istringstream/view/wchar_t/2.cc: Likewise.
* testsuite/27_io/basic_ostringstream/cons/char/2.cc: Likewise.
* testsuite/27_io/basic_ostringstream/cons/wchar_t/2.cc: Likewise.
* testsuite/27_io/basic_ostringstream/view/char/2.cc: Likewise.
* testsuite/27_io/basic_ostringstream/view/wchar_t/2.cc: Likewise.
* testsuite/27_io/basic_stringstream/cons/char/2.cc: Likewise.
* testsuite/27_io/basic_stringstream/cons/wchar_t/2.cc: Likewise.
* testsuite/27_io/basic_stringstream/view/char/2.cc: Likewise.
* testsuite/27_io/basic_stringstream/view/wchar_t/2.cc: Likewise.
---
 libstdc++-v3/acinclude.m4 |   2 +-
 libstdc++-v3/config/abi/pre/gnu.ver   |  45 ++
 libstdc++-v3/configure|  16 +-
 libstdc++-v3/include/std/sstream  | 190 +
 libstdc++-v3/src/Makefile.am  |  12 +-
 libstdc++-v3/src/Makefile.in  |  14 +-
 libstdc++-v3/src/c++20/Makefile.am| 105 +++
 libstdc++-v3/src/c++20/Makefile.in| 735 ++
 libstdc++-v3/src/c++20/sstream-inst.cc| 108 +++
 .../27_io/basic_istringstream/cons/char/1.cc  |  85 ++
 .../basic_istringstream/cons/wchar_t/1.cc |  85 ++
 .../27_io/basic_istringstream/view/char/1.cc  |  35 +
 .../basic_istringstream/view/wchar_t/1.cc |  35 +
 .../27_io/basic_ostringstream/cons/char/1.cc  |  85 ++
 .../basic_ostringstream/cons/wchar_t/1.cc |  85 ++
 .../27_io/basic_ostringstr

[PATCH v3] builtins: (not just) rs6000: Add builtins for fegetround, feclearexcept and feraiseexcept [PR94193]

2020-10-26 Thread Raoni Fassina Firmino via Gcc-patches

Changes since v2[1]:
  - Added documentation for the new optabs;
  - Remove use of non portable __builtin_clz;
  - Changed feclearexcept and feraiseexcept to accept all 4 valid
flags at the same time and added more test for that case;
  - Extended feclearexcept and feraiseexcept testcases to match
accepting multiple flags;
  - Fixed builtin-feclearexcept-feraiseexcept-2.c testcase comparison
after feclearexcept tests;
  - Updated commit message to reflect change in feclearexcept and
feraiseexcept from the glibc conterpart;
  - Fixed English spelling and typos;
  - Fixed code-style;
  - Changed subject line tag to make clear it is not just rs6000 code.

Tested on top of master (47d13acbda9a5d8eb57ff169ba74857cd54108e4)
on the following plataforms with no regression:
  - powerpc64le-linux-gnu (Power 9)
  - powerpc64le-linux-gnu (Power 8)

[1] https://gcc.gnu.org/pipermail/gcc-patches/2020-September/553297.html

 8< 

This optimizations were originally in glibc, but was removed
and suggested that they were a good fit as gcc builtins[1].

feclearexcept and feraiseexcept were extended (in comparison to the
glibc version) to accept any combination of the accepted flags, not
limited to just one flag bit at a time anymore.

The associated bugreport: PR target/94193

[1] https://sourceware.org/legacy-ml/libc-alpha/2020-03/msg00047.html
https://sourceware.org/legacy-ml/libc-alpha/2020-03/msg00080.html

2020-08-13  Raoni Fassina Firmino  

gcc/ChangeLog:

* builtins.c (expand_builtin_fegetround): New function.
(expand_builtin_feclear_feraise_except): New function.
(expand_builtin): Add cases for BUILT_IN_FEGETROUND,
BUILT_IN_FECLEAREXCEPT and BUILT_IN_FERAISEEXCEPT
* config/rs6000/rs6000.md (fegetroundsi): New pattern.
(feclearexceptsi): New Pattern.
(feraiseexceptsi): New Pattern.
* optabs.def (fegetround_optab): New optab.
(feclearexcept_optab): New optab.
(feraiseexcept_optab): New optab.

gcc/testsuite/ChangeLog:

* gcc.target/powerpc/builtin-feclearexcept-feraiseexcept-1.c: New test.
* gcc.target/powerpc/builtin-feclearexcept-feraiseexcept-2.c: New test.
* gcc.target/powerpc/builtin-fegetround.c: New test.

Signed-off-by: Raoni Fassina Firmino 
---
 gcc/builtins.c|  76 +++
 gcc/config/rs6000/rs6000.md   |  81 +++
 gcc/doc/md.texi   |  18 ++
 gcc/optabs.def|   4 +
 .../builtin-feclearexcept-feraiseexcept-1.c   |  76 +++
 .../builtin-feclearexcept-feraiseexcept-2.c   | 203 ++
 .../gcc.target/powerpc/builtin-fegetround.c   |  36 
 7 files changed, 494 insertions(+)
 create mode 100644 
gcc/testsuite/gcc.target/powerpc/builtin-feclearexcept-feraiseexcept-1.c
 create mode 100644 
gcc/testsuite/gcc.target/powerpc/builtin-feclearexcept-feraiseexcept-2.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/builtin-fegetround.c

diff --git a/gcc/builtins.c b/gcc/builtins.c
index 72627b5b859..bc5459dcc2c 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -115,6 +115,9 @@ static rtx expand_builtin_mathfn_3 (tree, rtx, rtx);
 static rtx expand_builtin_mathfn_ternary (tree, rtx, rtx);
 static rtx expand_builtin_interclass_mathfn (tree, rtx);
 static rtx expand_builtin_sincos (tree);
+static rtx expand_builtin_fegetround (tree, rtx, machine_mode);
+static rtx expand_builtin_feclear_feraise_except (tree, rtx, machine_mode,
+ optab);
 static rtx expand_builtin_cexpi (tree, rtx);
 static rtx expand_builtin_int_roundingfn (tree, rtx);
 static rtx expand_builtin_int_roundingfn_2 (tree, rtx);
@@ -2886,6 +2889,59 @@ expand_builtin_sincos (tree exp)
   return const0_rtx;
 }
 
+/* Expand call EXP to the fegetround builtin (from C99 venv.h), returning the
+   result and setting it in TARGET.  Otherwise return NULL_RTX on failure.  */
+static rtx
+expand_builtin_fegetround (tree exp, rtx target, machine_mode target_mode)
+{
+  if (!validate_arglist (exp, VOID_TYPE))
+return NULL_RTX;
+
+  insn_code icode = direct_optab_handler (fegetround_optab, SImode);
+  if (icode == CODE_FOR_nothing)
+return NULL_RTX;
+
+  if (target == 0
+  || GET_MODE (target) != target_mode
+  || ! (*insn_data[icode].operand[0].predicate) (target, target_mode))
+target = gen_reg_rtx (target_mode);
+
+  rtx pat = GEN_FCN (icode) (target);
+  if (! pat)
+return NULL_RTX;
+  emit_insn (pat);
+
+  return target;
+}
+
+/* Expand call EXP to either feclearexcept or feraiseexcept builtins (from C99
+fenv.h), returning the result and setting it in TARGET.  Otherwise return
+NULL_RTX on failure.  */
+static rtx
+expand_builtin_feclear_feraise_except (tree exp, rtx target,
+  machine_mode target_mode, optab op_optab)
+{
+  if (!validate_arglist (exp, INTEGER_TYPE, VOID_TYPE))
+return NUL

[PATCH] PR fortran/97491 - Wrong restriction for VALUE arguments of pure procedures

2020-10-26 Thread Harald Anlauf

As found/reported by Thomas, the redefinition of dummy arguments with the
VALUE attribute was erroneously rejected for pure procedures.  A related
purity check did not take VALUE into account and was therefore adjusted.

Regtested on x86_64-pc-linux-gnu.

OK for master?

Thanks,
Harald


PR fortran/97491 - Wrong restriction for VALUE arguments of pure procedures

A dummy argument with the VALUE attribute may be redefined in a PURE or
ELEMENTAL procedure.  Adjust the associated purity check.

gcc/fortran/ChangeLog:

* resolve.c (gfc_impure_variable): A dummy argument with the VALUE
attribute may be redefined without making a procedure impure.

gcc/testsuite/ChangeLog:

* gfortran.dg/value_8.f90: New test.

diff --git a/gcc/fortran/resolve.c b/gcc/fortran/resolve.c
index a210f9aad43..096108f4317 100644
--- a/gcc/fortran/resolve.c
+++ b/gcc/fortran/resolve.c
@@ -16476,6 +16507,7 @@ gfc_impure_variable (gfc_symbol *sym)

   proc = sym->ns->proc_name;
   if (sym->attr.dummy
+  && !sym->attr.value
   && ((proc->attr.subroutine && sym->attr.intent == INTENT_IN)
 	  || proc->attr.function))
 return 1;
diff --git a/gcc/testsuite/gfortran.dg/value_8.f90 b/gcc/testsuite/gfortran.dg/value_8.f90
new file mode 100644
index 000..8273fe88b60
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/value_8.f90
@@ -0,0 +1,16 @@
+! { dg-do compile }
+! PR97491 - Wrong restriction for VALUE arguments of pure procedures
+
+pure function foo (x) result (ret)
+  integer:: ret
+  integer, value :: x
+  x = x / 2
+  ret = x
+end function foo
+
+elemental function foo1 (x)
+  integer:: foo1
+  integer, value :: x
+  x = x / 2
+  foo1 = x
+end function foo1

Re: [PATCH] libstdc++: Implement C++20 features for

2020-10-26 Thread Jonathan Wakely via Gcc-patches


On 26/10/20 13:47 -0700, Thomas Rodgers wrote:

From: Thomas Rodgers 

New ctors and ::view() accessor for -
 * basic_stingbuf
 * basic_istringstream
 * basic_ostringstream
 * basic_stringstreamm

New ::get_allocator() accessor for basic_stringbuf.

libstdc++-v3/ChangeLog:
* acinclude.m4 (glibcxx_SUBDIRS): Add src/c++20.
   * config/abi/pre/gnu.ver: Update GLIBCXX_3.4.29 for the addition of -
basic_stringbuf::basic_stringbuf(allocator const&),
basic_stringbuf::basic_stringbuf(openmode, allocator const&),
basic_stringbuf::basic_stringbuf(basic_string&&, openmode),
basic_stringbuf::basic_stringbuf(basic_stringbuf&&, allocator const&),
basic_stringbuf::get_allocator(),
basic_stringbuf::view(),
basic_istringstream::basic_istringstream(basic_string&&, openmode),
basic_istringstream::basic_istringstream(openmode, allocator const&),
basic_istringstream::view(),
basic_ostringstream::basic_ostringstream(basic_string&&, openmode),
basic_ostringstream::basic_ostringstream(openmode, allocator const&),
basic_ostringstream::view(),
basic_stringstream::basic_stringstream(basic_string&&, openmode),
basic_stringstream::basic_stringstream(openmode, allocator const&),
basic_stringstream::view().


As discussed on IRC< please don't name every one of these functions
for the linker script changes, it's just redundant noise. They're
already listed below in the include/std/sstream changes.

Look at past changelog entries for the gnu.ver file.


* configure: Regenerate.
* include/std/sstream:
(basic_stringbuf::basic_stringbuf(allocator const&)): New constructor.
(basic_stringbuf::basic_stringbuf(openmode, allocator const&)): 
Likewise.
(basic_stringbuf::basic_stringbuf(basic_string&&, openmode)): Likewise.
(basic_stringbuf::basic_stringbuf(basic_stringbuf&&, allocator 
const&)): Likewise.


New line before the Likewise.


There are a few formatting changes mentioned below. OK for trunk with
those changes. Thanks. Go ahead and commit the  patch
after this one too.



(basic_stringbuf::get_allocator()): New method.
(basic_stringbuf::view()): Likewise.
(basic_istringstream::basic_istringstream(basic_string&&, openmode)):
New constructor.
(basic_istringstream::basic_istringstream(openmode, allocator const&)):
Likewise
(basic_istringstream::view()): New method.
(basic_ostringstream::basic_ostringstream(basic_string&&, openmode)):
New constructor.
(basic_ostringstream::basic_ostringstream(openmode, allocator const&)):
Likewise
(basic_ostringstream::view()): New method.
(basic_stringstream::basic_stringstream(basic_string&&, openmode)):
New constructor.
(basic_stringstream::basic_stringstream(openmode, allocator const&)):
Likewise
(basic_stringstream::view()): New method.
* src/Makefile.in: Add c++20 directory.
* src/Makefile.am: Regenerate.
* src/c++20/Makefile.am: Add makefile for new sub-directory.
* src/c++20/Makefile.in: Generate.
* src/c++20/sstream-inst.cc: New file defining explicit
instantiations for basic_stringbuf, basic_istringstream,
basic_ostringstream, and basic_stringstream member functions
added in C++20.
* testsuite/27_io/basic_stringbuf/cons/char/2.cc: New test.
* testsuite/27_io/basic_stringbuf/cons/wchar_t/2.cc: Likewise.
* testsuite/27_io/basic_stringbuf/view/char/2.cc: Likewise.
* testsuite/27_io/basic_stringbuf/view/wchar_t/2.cc: Likewise.
* testsuite/27_io/basic_istringstream/cons/char/2.cc: Likewise.
* testsuite/27_io/basic_istringstream/cons/wchar_t/2.cc: Likewise.
* testsuite/27_io/basic_istringstream/view/char/2.cc: Likewise.
* testsuite/27_io/basic_istringstream/view/wchar_t/2.cc: Likewise.
* testsuite/27_io/basic_ostringstream/cons/char/2.cc: Likewise.
* testsuite/27_io/basic_ostringstream/cons/wchar_t/2.cc: Likewise.
* testsuite/27_io/basic_ostringstream/view/char/2.cc: Likewise.
* testsuite/27_io/basic_ostringstream/view/wchar_t/2.cc: Likewise.
* testsuite/27_io/basic_stringstream/cons/char/2.cc: Likewise.
* testsuite/27_io/basic_stringstream/cons/wchar_t/2.cc: Likewise.
* testsuite/27_io/basic_stringstream/view/char/2.cc: Likewise.
* testsuite/27_io/basic_stringstream/view/wchar_t/2.cc: Likewise.
---
libstdc++-v3/acinclude.m4 |   2 +-
libstdc++-v3/config/abi/pre/gnu.ver   |  45 ++
libstdc++-v3/configure|  16 +-
libstdc++-v3/include/std/sstream  | 190 +
libstdc++-v3/src/Makefile.am  |  12 +-
libstdc++-v3/src/Makefile.in  |  14 +-
libstdc++-v3/src/c++20/Makefile.am| 105 +++
libstdc++-v3/src/c++20/

Re: [PATCH] rs6000: Don't split constant operator add before reload, move to temp register for future optimization

2020-10-26 Thread Segher Boessenkool

On Wed, Oct 21, 2020 at 03:25:29AM -0500, Xionghu Luo wrote:
> Don't split code from add3 for SDI to allow a later pass to split.

This is very problematic.

> This allows later logic to hoist out constant load in add instructions.

Later logic should be able to do that any way (I do not say that works
perfectly, mind; it no doubt could be improved).

> In loop, lis+ori could be hoisted out to improve performance compared with
> previous addis+addi (About 15% on typical case), weak point is
> one more register is used and one more instruction is generated.  i.e.:

Yes, better performance on one testcase, and worse code always :-(

> addis 3,3,0x6765
> addi 3,3,0x4321
> 
> =>
> 
> lis 9,0x6765
> ori 9,9,0x4321
> add 3,3,9

This is the typical kind of clumsy code you get if you generate RTL that
matches actual machine instructions too late ("split too late").

So, please make it possible to hoist 2-insn-immediate sequences out of
loops, *without* changing them to fake 1-insn things.

Some comments on the patch:

> +   rs6000_emit_move (tmp, operands[2], mode);
> +   emit_insn (gen_add3 (operands[0], operands[1], tmp));
> DONE;
>   }
> +  else
> + {

You don't need an else here: everything in the "if" has done a DONE or
FAIL.  You can just keep the existing code as-is, there is no need to
obfuscate the code :-)

> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/add-const.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile { target { lp64 } } } */

{ target lp64 }
(no extra braces needed)

> +/* Ensure the lis,ori are generated, which indicates they have
> +   been hoisted outside of the loop.  */

That is a very fragile test.

> --- a/gcc/testsuite/gcc.target/powerpc/prefix-add.c
> +++ b/gcc/testsuite/gcc.target/powerpc/prefix-add.c
> @@ -2,13 +2,13 @@
>  /* { dg-require-effective-target powerpc_prefixed_addr } */
>  /* { dg-options "-O2 -mdejagnu-cpu=power10" } */
>  
> -/* Test that PADDI is generated to add a large constant.  */
> +/* Test that PLI is generated to add a large constant.  */

Nope, that is a bad idea.  The test tested that we generate good code,
we should keep it that way.

Segher

1 2 >

1 - 100 of 131 matches

Mail list logo