Re: [PATCH/RFC] combine: Tweak the condition of last_set invalidation

2021-01-15 Thread Kewen.Lin via Gcc-patches
Hi Segher,

Thanks for the comments!

on 2021/1/15 上午8:22, Segher Boessenkool wrote:
> Hi!
> 
> On Wed, Dec 16, 2020 at 04:49:49PM +0800, Kewen.Lin wrote:
>> When I was investigating unsigned int vec_init issue on Power,
>> I happened to find there seems something we can enhance in how
>> combine pass invalidate last_set (set last_set_invalid nonzero).
>>
>> Currently we have the check:
>>
>>   if (!insn
>>|| (value && rsp->last_set_table_tick >= label_tick_ebb_start))
>>  rsp->last_set_invalid = 1; 
>>
>> which means if we want to record some value for some reg and
>> this reg got refered before in a valid scope, we invalidate the
>> set of reg (last_set_invalid to 1).  It avoids to find the wrong
>> set for one reg reference, such as the case like:
>>
>>... op regX  // this regX could find wrong last_set below
>>regX = ...   // if we think this set is valid
>>... op regX
>>
>> But because of retry's existence, the last_set_table_tick could
> 
> It is not just because of retry: combine can change other insns than
> just i2 and i3, too.  And even changing i2 requires this!
> 

Ah, thanks for the information!  Here retry is one example for that
we can revisit one instruction but meanwhile the stored information
for reg reference can be from that instruction after the current
one but visited before.

> The whole reg_stat stuff is an ugly hack that does not work well.  For
> example, as in your example, some "known" value can be invalidated
> before the combination that wants to know that value is tried.
> 
> We need to have this outside of combine, in a dataflow(-like) thing
> for example.  This could take the place of REG_EQ* as well probably
> (which is good, there are various problems with that as well).
> 

Good point, but IIUC we still need to keep updating(tracking)
information like what we put into reg_stat stuff, it's not static
since as you pointed out above, combine can change i2/i3 etc,
we need to update the information for the changes.  Anyway, it's not
what this patch tries to solve.  :-P

>> This proposal is to check whether the last_set_table safely happens
>> after the current set, make the set still valid if so.
> 
> I don't think this is safe to do like this, unfortunately.  There are
> more places that set last_set_invalid (well, one more), so at the very
> minimum this needs a lot more justification.
> 

Let me try to explain it more.
* Background *

There are two places which set last_set_invalid to 1. 

CASE 1:

```
  if (CALL_P (insn))
{
  HARD_REG_SET callee_clobbers
= insn_callee_abi (insn).full_and_partial_reg_clobbers ();
  hard_reg_set_iterator hrsi;
  EXECUTE_IF_SET_IN_HARD_REG_SET (callee_clobbers, 0, i, hrsi)
{
  reg_stat_type *rsp;

  /* ??? We could try to preserve some information from the last
 set of register I if the call doesn't actually clobber
 (reg:last_set_mode I), which might be true for ABIs with
 partial clobbers.  However, it would be difficult to
 update last_set_nonzero_bits and last_sign_bit_copies
 to account for the part of I that actually was clobbered.
 It wouldn't help much anyway, since we rarely see this
 situation before RA.  */
  rsp = ®_stat[i];
  rsp->last_set_invalid = 1;
```

The justification for this part is that if ABI defines some callee
clobberred registers, we need to invalidate their sets to avoid
later references to use the unexpected values.

CASE 2:

```
  for (i = regno; i < endregno; i++)
{
  rsp = ®_stat[i];
  rsp->last_set_label = label_tick;
  if (!insn
  || (value && rsp->last_set_table_tick >= label_tick_ebb_start))
rsp->last_set_invalid = 1;
  else
rsp->last_set_invalid = 0;
}
```

The justification here is that: if the insn is NULL, it's simply to
invalidate this reg set, go for it; if the value is NULL, it's simply
to say the reg clobberred, invalidate it; if the reference of this
reg being set have been seen, let's invalidate it.

The last part follows the comments:

  /* Now update the status of each register being set.
 If someone is using this register in this block, set this register
 to invalid since we will get confused between the two lives in this
 basic block.  This makes using this register always invalid.  In cse, we
 scan the table to invalidate all entries using this register, but this
 is too much work for us.  */

It's understandable to invalidate it to avoid the case:

   ... op regX// (a)
   regX = ... // (b)
   ... op regX// (c)
 
When we are revisiting (a), it avoids to use the reg set in (b).

* Problem *

Firstly, the problem that this patch is trying to solve is mainly related
to case 2, so it doesn't touch CASE 1.

In the context of CASE 2, the current condition

  (rsp->last_set_table_tick >= label_tick_ebb_start)

is completely safe but too conservati

[PATCH] vect: Use factored nloads for load cost modeling [PR82255]

2021-01-15 Thread Kewen.Lin via Gcc-patches
Hi,

This patch follows Richard's suggestion in the thread discussion[1],
it's to factor out the nloads computation in vectorizable_load for
strided access, to ensure we can obtain the consistent information
when estimating the costs.

btw, the reason why I didn't try to save the information into
stmt_info during analysis phase and then fetch it in transform phase
is that the information is just for strided slp loading, and to
re-compute it looks not very expensive and acceptable.

Bootstrapped/regtested on powerpc64le-linux-gnu P9,
x86_64-redhat-linux and aarch64-linux-gnu.

Is it ok for trunk?  Or it belongs to next stage 1?

BR,
Kewen

[1] https://gcc.gnu.org/legacy-ml/gcc-patches/2017-09/msg01433.html

gcc/ChangeLog:

PR tree-optimization/82255
* tree-vect-stmts.c (vector_vector_composition_type): Adjust function
location.
(struct strided_load_info): New structure.
(vect_get_strided_load_info): New function factored out from...
(vectorizable_load): ...this.  Call function
vect_get_strided_load_info accordingly.
(vect_model_load_cost): Call function vect_get_strided_load_info.

gcc/testsuite/ChangeLog:

2020-01-15  Bill Schmidt  
Kewen Lin  

PR tree-optimization/82255
* gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c: New test.

diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c 
b/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c
new file mode 100644
index 000..aaeefc39595
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c
@@ -0,0 +1,32 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+
+
+/* PR82255: Ensure we don't require a vec_construct cost when we aren't
+   going to generate a strided load.  */
+
+extern int abs (int __x) __attribute__ ((__nothrow__, __leaf__))
+__attribute__ ((__const__));
+
+static int
+foo (unsigned char *w, int i, unsigned char *x, int j)
+{
+  int tot = 0;
+  for (int a = 0; a < 16; a++)
+{
+#pragma GCC unroll 16
+  for (int b = 0; b < 16; b++)
+   tot += abs (w[b] - x[b]);
+  w += i;
+  x += j;
+}
+  return tot;
+}
+
+void
+bar (unsigned char *w, unsigned char *x, int i, int *result)
+{
+  *result = foo (w, 16, x, i);
+}
+
+/* { dg-final { scan-tree-dump-times "vec_construct" 0 "vect" } } */
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 068e4982303..d1cbc55a676 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -897,6 +897,146 @@ cfun_returns (tree decl)
   return false;
 }
 
+/* Function VECTOR_VECTOR_COMPOSITION_TYPE
+
+   This function returns a vector type which can be composed with NETLS pieces,
+   whose type is recorded in PTYPE.  VTYPE should be a vector type, and has the
+   same vector size as the return vector.  It checks target whether supports
+   pieces-size vector mode for construction firstly, if target fails to, check
+   pieces-size scalar mode for construction further.  It returns NULL_TREE if
+   fails to find the available composition.
+
+   For example, for (vtype=V16QI, nelts=4), we can probably get:
+ - V16QI with PTYPE V4QI.
+ - V4SI with PTYPE SI.
+ - NULL_TREE.  */
+
+static tree
+vector_vector_composition_type (tree vtype, poly_uint64 nelts, tree *ptype)
+{
+  gcc_assert (VECTOR_TYPE_P (vtype));
+  gcc_assert (known_gt (nelts, 0U));
+
+  machine_mode vmode = TYPE_MODE (vtype);
+  if (!VECTOR_MODE_P (vmode))
+return NULL_TREE;
+
+  poly_uint64 vbsize = GET_MODE_BITSIZE (vmode);
+  unsigned int pbsize;
+  if (constant_multiple_p (vbsize, nelts, &pbsize))
+{
+  /* First check if vec_init optab supports construction from
+vector pieces directly.  */
+  scalar_mode elmode = SCALAR_TYPE_MODE (TREE_TYPE (vtype));
+  poly_uint64 inelts = pbsize / GET_MODE_BITSIZE (elmode);
+  machine_mode rmode;
+  if (related_vector_mode (vmode, elmode, inelts).exists (&rmode)
+ && (convert_optab_handler (vec_init_optab, vmode, rmode)
+ != CODE_FOR_nothing))
+   {
+ *ptype = build_vector_type (TREE_TYPE (vtype), inelts);
+ return vtype;
+   }
+
+  /* Otherwise check if exists an integer type of the same piece size and
+if vec_init optab supports construction from it directly.  */
+  if (int_mode_for_size (pbsize, 0).exists (&elmode)
+ && related_vector_mode (vmode, elmode, nelts).exists (&rmode)
+ && (convert_optab_handler (vec_init_optab, rmode, elmode)
+ != CODE_FOR_nothing))
+   {
+ *ptype = build_nonstandard_integer_type (pbsize, 1);
+ return build_vector_type (*ptype, nelts);
+   }
+}
+
+  return NULL_TREE;
+}
+
+/* Hold information for VMAT_ELEMENTWISE or VMAT_STRIDED_SLP strided
+   loads in function vectorizable_load.  */
+struct strided_load_info {
+  /* Number of loads required.  */
+  int nloads;
+  /* Number of vector unit advanced for each load.  */
+  in

Re: The performance data for two different implementation of new security feature -ftrivial-auto-var-init

2021-01-15 Thread Richard Biener




On Thu, 14 Jan 2021, Qing Zhao wrote:


Hi, 
More data on code size and compilation time with CPU2017:

Compilation time data:   the numbers are the slowdown against the
default “no”:

benchmarks  A/no D/no
                        
500.perlbench_r 5.19% 1.95%
502.gcc_r 0.46% -0.23%
505.mcf_r 0.00% 0.00%
520.omnetpp_r 0.85% 0.00%
523.xalancbmk_r 0.79% -0.40%
525.x264_r -4.48% 0.00%
531.deepsjeng_r 16.67% 16.67%
541.leela_r  0.00%  0.00%
557.xz_r 0.00%  0.00%
                        
507.cactuBSSN_r 1.16% 0.58%
508.namd_r 9.62% 8.65%
510.parest_r 0.48% 1.19%
511.povray_r 3.70% 3.70%
519.lbm_r 0.00% 0.00%
521.wrf_r 0.05% 0.02%
526.blender_r 0.33% 1.32%
527.cam4_r -0.93% -0.93%
538.imagick_r 1.32% 3.95%
544.nab_r  0.00% 0.00%

From the above data, looks like that the compilation time impact
from implementation A and D are almost the same.
***code size data: the numbers are the code size increase against the
default “no”:
benchmarks A/no D/no
                        
500.perlbench_r 2.84% 0.34%
502.gcc_r 2.59% 0.35%
505.mcf_r 3.55% 0.39%
520.omnetpp_r 0.54% 0.03%
523.xalancbmk_r 0.36%  0.39%
525.x264_r 1.39% 0.13%
531.deepsjeng_r 2.15% -1.12%
541.leela_r 0.50% -0.20%
557.xz_r 0.31% 0.13%
                        
507.cactuBSSN_r 5.00% -0.01%
508.namd_r 3.64% -0.07%
510.parest_r 1.12% 0.33%
511.povray_r 4.18% 1.16%
519.lbm_r 8.83% 6.44%
521.wrf_r 0.08% 0.02%
526.blender_r 1.63% 0.45%
527.cam4_r  0.16% 0.06%
538.imagick_r 3.18% -0.80%
544.nab_r 5.76% -1.11%
Avg 2.52% 0.36%

From the above data, the implementation D is always better than A, it’s a
surprising to me, not sure what’s the reason for this.


D probably inhibits most interesting loop transforms (check SPEC FP
performance).  It will also most definitely disallow SRA which, when
an aggregate is not completely elided, tends to grow code.


stack usage data, I added -fstack-usage to the compilation line when
compiling CPU2017 benchmarks. And all the *.su files were generated for each
of the modules.
Since there a lot of such files, and the stack size information are embedded
in each of the files.  I just picked up one benchmark 511.povray to
check. Which is the one that 
has the most runtime overhead when adding initialization (both A and D). 

I identified all the *.su files that are different between A and D and do a
diff on those *.su files, and looks like that the stack size is much higher
with D than that with A, for example:

$ diff build_base_auto_init.D./bbox.su
build_base_auto_init.A./bbox.su5c5
< bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**,
pov::BBOX_TREE**&, long int*, long int, long int) 160 static
---
> bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**,
pov::BBOX_TREE**&, long int*, long int, long int) 96 static

$ diff build_base_auto_init.D./image.su
build_base_auto_init.A./image.su
9c9
< image.cpp:240:6:void pov::bump_map(double*, pov::TNORMAL*, double*) 624
static
---
> image.cpp:240:6:void pov::bump_map(double*, pov::TNORMAL*, double*) 272
static
….
Looks like that implementation D has more stack size impact than A. 

Do you have any insight on what the reason for this?


D will keep all initialized aggregates as aggregates and live which
means stack will be allocated for it.  With A the usual optimizations
to reduce stack usage can be applied.


Let me know if you have any comments and suggestions.


First of all I would check whether the prototype implementations
work as expected.

Richard.



thanks.

Qing
  On Jan 13, 2021, at 1:39 AM, Richard Biener 
  wrote:

  On Tue, 12 Jan 2021, Qing Zhao wrote:

Hi, 

Just check in to see whether you have any comments
and suggestions on this:

FYI, I have been continue with Approach D
implementation since last week:

D. Adding  calls to .DEFFERED_INIT during
gimplification, expand the .DEFFERED_INIT during
expand to
real initialization. Adjusting uninitialized pass
with the new refs with “.DEFFERED_INIT”.

For the remaining work of Approach D:

** complete the implementation of
-ftrivial-auto-var-init=pattern;
** complete the implementation of uninitialized
warnings maintenance work for D. 

I have completed the uninitialized warnings
maintenance work for D.
And finished partial of the
-ftrivial-auto-var-init=pattern implementation. 

The following are remaining work of Approach D:

  ** -ftrivial-auto-var-init=pattern for VLA;
  **add a new attribute for variable:
__attribute((uninitialized)
the marked variable is uninitialized intentionaly
for performance purpose.
  ** adding complete testing cases;


Please let me know if you have any objection on my
current decision on implementing approach 

Re: [PATCH] c-family, v2: Improve MEM_REF printing for diagnostics [PR98597]

2021-01-15 Thread Jakub Jelinek via Gcc-patches
On Thu, Jan 14, 2021 at 07:26:36PM +0100, Jakub Jelinek via Gcc-patches wrote:
> Is this ok for trunk if it passes bootstrap/regtest?

So, x86_64-linux bootstrap unfortunately broke due to the -march=i486
changes, but at least i686-linux bootstrap succeeded and shows 2
regressions.

One is on g++.dg/gomp/allocate-2.C, which used to print:
allocate-2.C:9:36: error: user defined reduction not found for ‘s’
but now prints:
allocate-2.C:9:36: error: user defined reduction not found for ‘*&s’
because of -O0 and therefore -fno-strict-aliasing.
The problem is that for !flag_strict_aliasing get_deref_alias_set returns 0
and so the:
&& get_deref_alias_set (TREE_OPERAND (e, 1)) == get_alias_set (op)
check fails.  So, shall the code use
&& (!flag_no_strict_aliasing
|| get_deref_alias_set (TREE_OPERAND (e, 1)) == get_alias_set (op))
instead, or
get_alias_set (TREE_TYPE (TREE_TYPE (TREE_OPERAND (e, 1
== get_alias_set (op)
?
The other is on gcc.dg/gomp/_Atomic-3.c test, where we used to print
_Atomic-3.c:22:34: error: ‘_Atomic’ ‘k’ in ‘reduction’ clause
but now print
_Atomic-3.c:22:34: error: ‘_Atomic’ ‘*(_Atomic int (*)[4])(&k[0])’ in 
‘reduction’ clause
Apparently in this case the C FE considers the two _Atomic int [4] types
incompatible, one is created through
c_build_qualified_type (type=, type_quals=8, 
orig_qual_type=, orig_qual_indirect=1)
on an int [4] type, i.e. adding _Atomic qualifier to an unqualified array
type, and the other is created through
build_array_type (elt_type=, 
index_type=, typeless_storage=false)
i.e. creating an array with _Atomic int elements.
That seems like a C FE bug to me.

Anyway, I can fix or workaround that by:
--- gcc/c/c-typeck.c.jj 2021-01-04 10:25:49.65329 +0100
+++ gcc/c/c-typeck.c2021-01-15 09:53:29.590611264 +0100
@@ -13979,7 +13979,9 @@ c_finish_omp_clauses (tree clauses, enum
  size = size_binop (MINUS_EXPR, size, size_one_node);
  size = save_expr (size);
  tree index_type = build_index_type (size);
- tree atype = build_array_type (type, index_type);
+ tree atype = build_array_type (TYPE_MAIN_VARIANT (type),
+index_type);
+ atype = c_build_qualified_type (atype, TYPE_QUALS (type));
  tree ptype = build_pointer_type (type);
  if (TREE_CODE (TREE_TYPE (t)) == ARRAY_TYPE)
t = build_fold_addr_expr (t);
and then we're back to the above allocate-2.C issue, i.e. at -O0
we still print *&k rather than k.

And another question is if in case we punted because of the TBAA check
we shouldn't just force printing the access type, so never print
*&k but print instead *(access type)&k.

Jakub



[PATCH][testsuite] (committed) Fix sed script errors in complex tests

2021-01-15 Thread Tamar Christina via Gcc-patches
Hi All,

I ran sed script late over the tests which accidentally
introduced a syntax error in the tests.

This fixes it.

Committed under the obvious rule.

Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/testsuite/ChangeLog:

* gcc.dg/vect/complex/complex-mla-template.c: Fix sed.
* gcc.dg/vect/complex/complex-mls-template.c: Likewise.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.dg/vect/complex/complex-mla-template.c 
b/gcc/testsuite/gcc.dg/vect/complex/complex-mla-template.c
index 
8995e0a9f6bbfa535fa3630dc65bc3baad1016e5..4b5c42b29f1b40bc88c39e6baedd8c930c823dd1
 100644
--- a/gcc/testsuite/gcc.dg/vect/complex/complex-mla-template.c
+++ b/gcc/testsuite/gcc.dg/vect/complex/complex-mla-template.c
@@ -3,77 +3,77 @@
 void fma0 (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
   _Complex TYPE c[restrict N])
 {
-  for (int i+=0; i < N; i++)
+  for (int i=0; i < N; i++)
 c[i] += a[i] * b[i];
 }
 
 void fma90snd (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
   _Complex TYPE c[restrict N])
 {
-  for (int i+=0; i < N; i++)
+  for (int i=0; i < N; i++)
 c[i] += a[i] * (b[i] * I);
 }
 
 void fma180snd (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
_Complex TYPE c[restrict N])
 {
-  for (int i+=0; i < N; i++)
+  for (int i=0; i < N; i++)
 c[i] += a[i] * (b[i] * I * I);
 }
 
 void fma270snd (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
_Complex TYPE c[restrict N])
 {
-  for (int i+=0; i < N; i++)
+  for (int i=0; i < N; i++)
 c[i] += a[i] * (b[i] * I * I * I);
 }
 
 void fma90fst (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
   _Complex TYPE c[restrict N])
 {
-  for (int i+=0; i < N; i++)
+  for (int i=0; i < N; i++)
 c[i] += (a[i] * I) * b[i];
 }
 
 void fma180fst (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
_Complex TYPE c[restrict N])
 {
-  for (int i+=0; i < N; i++)
+  for (int i=0; i < N; i++)
 c[i] += (a[i] * I * I) * b[i];
 }
 
 void fma270fst (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
_Complex TYPE c[restrict N])
 {
-  for (int i+=0; i < N; i++)
+  for (int i=0; i < N; i++)
 c[i] += (a[i] * I * I * I) * b[i];
 }
 
 void fmaconjfst (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
 _Complex TYPE c[restrict N])
 {
-  for (int i+=0; i < N; i++)
+  for (int i=0; i < N; i++)
 c[i] += ~a[i] * b[i];
 }
 
 void fmaconjsnd (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
 _Complex TYPE c[restrict N])
 {
-  for (int i+=0; i < N; i++)
+  for (int i=0; i < N; i++)
 c[i] += a[i] * ~b[i];
 }
 
 void fmaconjboth (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
  _Complex TYPE c[restrict N])
 {
-  for (int i+=0; i < N; i++)
+  for (int i=0; i < N; i++)
 c[i] += ~a[i] * ~b[i];
 }
 
 void fma_elem (_Complex TYPE a[restrict N], _Complex TYPE b,
   _Complex TYPE c[restrict N])
 {
-  for (int i+=0; i < N; i++)
+  for (int i=0; i < N; i++)
 c[i] += a[i] * b;
 }
 
@@ -81,21 +81,21 @@ void fma_elem (_Complex TYPE a[restrict N], _Complex TYPE b,
 void fma_elemconjfst (_Complex TYPE a[restrict N], _Complex TYPE b,
  _Complex TYPE c[restrict N])
 {
-  for (int i+=0; i < N; i++)
+  for (int i=0; i < N; i++)
 c[i] += ~a[i] * b;
 }
 
 void fma_elemconjsnd (_Complex TYPE a[restrict N], _Complex TYPE b,
  _Complex TYPE c[restrict N])
 {
-  for (int i+=0; i < N; i++)
+  for (int i=0; i < N; i++)
 c[i] += a[i] * ~b;
 }
 
 void fma_elemconjboth (_Complex TYPE a[restrict N], _Complex TYPE b,
   _Complex TYPE c[restrict N])
 {
-  for (int i+=0; i < N; i++)
+  for (int i=0; i < N; i++)
 c[i] += ~a[i] * ~b;
 }
 
diff --git a/gcc/testsuite/gcc.dg/vect/complex/complex-mls-template.c 
b/gcc/testsuite/gcc.dg/vect/complex/complex-mls-template.c
index 
2940be46eaefbfb8224f999a2c3c78c95d46b41e..1954be8b06ad4db91d5ae4ced01aefaaf22b4071
 100644
--- a/gcc/testsuite/gcc.dg/vect/complex/complex-mls-template.c
+++ b/gcc/testsuite/gcc.dg/vect/complex/complex-mls-template.c
@@ -3,77 +3,77 @@
 void fms0 (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
   _Complex TYPE c[restrict N])
 {
-  for (int i-=0; i < N; i++)
+  for (int i=0; i < N; i++)
 c[i] -= a[i] * b[i];
 }
 
 void fms90snd (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
   _Complex TYPE c[restrict N])
 {
-  for (int i-=0; i < N; i++)
+  for (int i=0; i < N; i++)
 c[i] -= a[i] * (b[i] * I);
 }
 
 void fms180snd (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
_Complex TYPE c[restrict N])
 {
-  for (int i-=0; i < N; i++)
+  for (int i=0; i < N; i++)
 c[i] -= a[i] * (b[i] * I * I);
 }
 
 void fms270snd (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
 

Re: [PATCH 2/3] arm: Auto-vectorization for MVE: vshl

2021-01-15 Thread Christophe Lyon via Gcc-patches
ping^3?

On Thu, 7 Jan 2021 at 13:20, Christophe Lyon  wrote:
>
> ping^2?
>
> On Wed, 30 Dec 2020 at 11:34, Christophe Lyon
>  wrote:
> >
> > ping?
> >
> > On Thu, 17 Dec 2020 at 18:48, Christophe Lyon
> >  wrote:
> > >
> > > This patch enables MVE vshlq instructions for auto-vectorization.
> > >
> > > The existing mve_vshlq_n_ is kept, as it takes a single
> > > immediate as second operand, and is used by arm_mve.h.
> > >
> > > We move the vashl3 insn from neon.md to an expander in
> > > vec-common.md, and the mve_vshlq_ insn from mve.md to
> > > vec-common.md, adding the second alternative fron neon.md.
> > >
> > > mve_vshlq_ will be used by a later patch enabling
> > > vectorization for vshr, as a unified version of
> > > ashl3_[signed|unsigned] from neon.md. Keeping the use of unspec
> > > VSHLQ enables to generate both 's' and 'u' variants.
> > >
> > > It is not clear whether the neon_shift_[reg|imm] attribute is still
> > > suitable, since this insn is also used for MVE.
> > >
> > > I kept the mve_vshlq_ naming instead of renaming it to
> > > ashl3__ as discussed because the reference in
> > > arm_mve_builtins.def automatically inserts the "mve_" prefix and I
> > > didn't want to make a special case for this.
> > >
> > > I haven't yet found why the v16qi and v8hi tests are not vectorized.
> > > With dest[i] = a[i] << b[i] and:
> > >   {
> > > int i;
> > > unsigned int i.24_1;
> > > unsigned int _2;
> > > int16_t * _3;
> > > short int _4;
> > > int _5;
> > > int16_t * _6;
> > > short int _7;
> > > int _8;
> > > int _9;
> > > int16_t * _10;
> > > short int _11;
> > > unsigned int ivtmp_42;
> > > unsigned int ivtmp_43;
> > >
> > >  [local count: 119292720]:
> > >
> > >  [local count: 954449105]:
> > > i.24_1 = (unsigned int) i_23;
> > > _2 = i.24_1 * 2;
> > > _3 = a_15(D) + _2;
> > > _4 = *_3;
> > > _5 = (int) _4;
> > > _6 = b_16(D) + _2;
> > > _7 = *_6;
> > > _8 = (int) _7;
> > > _9 = _5 << _8;
> > > _10 = dest_17(D) + _2;
> > > _11 = (short int) _9;
> > > *_10 = _11;
> > > i_19 = i_23 + 1;
> > > ivtmp_42 = ivtmp_43 - 1;
> > > if (ivtmp_42 != 0)
> > >   goto ; [87.50%]
> > > else
> > >   goto ; [12.50%]
> > >
> > >  [local count: 835156386]:
> > > goto ; [100.00%]
> > >
> > >  [local count: 119292720]:
> > > return;
> > >
> > >   }
> > > the vectorizer says:
> > > mve-vshl.c:37:96: note:   ==> examining statement: _5 = (int) _4;
> > > mve-vshl.c:37:96: note:   vect_is_simple_use: operand *_3, type of def: 
> > > internal
> > > mve-vshl.c:37:96: note:   vect_is_simple_use: vectype vector(8) short int
> > > mve-vshl.c:37:96: missed:   conversion not supported by target.
> > > mve-vshl.c:37:96: note:   vect_is_simple_use: operand *_3, type of def: 
> > > internal
> > > mve-vshl.c:37:96: note:   vect_is_simple_use: vectype vector(8) short int
> > > mve-vshl.c:37:96: note:   vect_is_simple_use: operand *_3, type of def: 
> > > internal
> > > mve-vshl.c:37:96: note:   vect_is_simple_use: vectype vector(8) short int
> > > mve-vshl.c:37:117: missed:   not vectorized: relevant stmt not supported: 
> > > _5 = (int) _4;
> > > mve-vshl.c:37:96: missed:  bad operation or unsupported loop bound.
> > > mve-vshl.c:37:96: note:  * Analysis failed with vector mode V8HI
> > >
> > > 2020-12-03  Christophe Lyon  
> > >
> > > gcc/
> > > * config/arm/mve.md (mve_vshlq_): Move to
> > > vec-commond.md.
> > > * config/arm/neon.md (vashl3): Delete.
> > > * config/arm/vec-common.md (mve_vshlq_): New.
> > > (vasl3): New expander.
> > >
> > > gcc/testsuite/
> > > * gcc.target/arm/simd/mve-vshl.c: Add tests for vshl.
> > > ---
> > >  gcc/config/arm/mve.md| 13 +-
> > >  gcc/config/arm/neon.md   | 19 -
> > >  gcc/config/arm/vec-common.md | 30 ++
> > >  gcc/testsuite/gcc.target/arm/simd/mve-vshl.c | 62 
> > > 
> > >  4 files changed, 93 insertions(+), 31 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-vshl.c
> > >
> > > diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
> > > index 673a83c..8bdb451 100644
> > > --- a/gcc/config/arm/mve.md
> > > +++ b/gcc/config/arm/mve.md
> > > @@ -822,18 +822,7 @@ (define_insn "mve_vcmpneq_"
> > >
> > >  ;;
> > >  ;; [vshlq_s, vshlq_u])
> > > -;;
> > > -(define_insn "mve_vshlq_"
> > > -  [
> > > -   (set (match_operand:MVE_2 0 "s_register_operand" "=w")
> > > -   (unspec:MVE_2 [(match_operand:MVE_2 1 "s_register_operand" "w")
> > > -  (match_operand:MVE_2 2 "s_register_operand" "w")]
> > > -VSHLQ))
> > > -  ]
> > > -  "TARGET_HAVE_MVE"
> > > -  "vshl.%#\t%q0, %q1, %q2"
> > > -  [(set_attr "type" "mve_move")
> > > -])
> > > +;; See vec-common.md
> > >
> > >  ;;
> > >  ;; [vabdq_s, vabdq_u])
> > > diff --git a

Re: [PATCH 3/3] arm: Auto-vectorization for MVE: vshr

2021-01-15 Thread Christophe Lyon via Gcc-patches
ping^3?

On Thu, 7 Jan 2021 at 13:20, Christophe Lyon  wrote:
>
> ping^2?
>
> On Wed, 30 Dec 2020 at 11:34, Christophe Lyon
>  wrote:
> >
> > ping?
> >
> > On Thu, 17 Dec 2020 at 18:48, Christophe Lyon
> >  wrote:
> > >
> > > This patch enables MVE vshr instructions for auto-vectorization.  New
> > > MVE patterns are introduced that take a vector of constants as second
> > > operand, all constants being equal.
> > >
> > > The existing mve_vshrq_n_ is kept, as it takes a single
> > > immediate as second operand, and is used by arm_mve.h.
> > >
> > > The vashr3 and vlshr3 expanders are moved fron neon.md to
> > > vec-common.md, updated to rely on the normal expansion scheme to
> > > generate shifts by immediate.
> > >
> > > 2020-12-03  Christophe Lyon  
> > >
> > > gcc/
> > > * config/arm/mve.md (mve_vshrq_n_s_imm): New entry.
> > > (mve_vshrq_n_u_imm): Likewise.
> > > * config/arm/neon.md (vashr3, vlshr3): Move to ...
> > > * config/arm/vec-common.md: ... here.
> > >
> > > gcc/testsuite/
> > > * gcc.target/arm/simd/mve-vshr.c: Add tests for vshr.
> > > ---
> > >  gcc/config/arm/mve.md| 34 
> > >  gcc/config/arm/neon.md   | 34 
> > >  gcc/config/arm/vec-common.md | 38 +-
> > >  gcc/testsuite/gcc.target/arm/simd/mve-vshr.c | 59 
> > > 
> > >  4 files changed, 130 insertions(+), 35 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-vshr.c
> > >
> > > diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
> > > index 8bdb451..eea8b20 100644
> > > --- a/gcc/config/arm/mve.md
> > > +++ b/gcc/config/arm/mve.md
> > > @@ -763,6 +763,7 @@ (define_insn "mve_vcreateq_"
> > >  ;;
> > >  ;; [vshrq_n_s, vshrq_n_u])
> > >  ;;
> > > +;; Version that takes an immediate as operand 2.
> > >  (define_insn "mve_vshrq_n_"
> > >[
> > > (set (match_operand:MVE_2 0 "s_register_operand" "=w")
> > > @@ -775,6 +776,39 @@ (define_insn "mve_vshrq_n_"
> > >[(set_attr "type" "mve_move")
> > >  ])
> > >
> > > +;; Versions that take constant vectors as operand 2 (with all elements
> > > +;; equal).
> > > +(define_insn "mve_vshrq_n_s_imm"
> > > +  [
> > > +   (set (match_operand:MVE_2 0 "s_register_operand" "=w")
> > > +   (ashiftrt:MVE_2 (match_operand:MVE_2 1 "s_register_operand" "w")
> > > +   (match_operand:MVE_2 2 
> > > "imm_for_neon_rshift_operand" "i")))
> > > +  ]
> > > +  "TARGET_HAVE_MVE"
> > > +  {
> > > +return neon_output_shift_immediate ("vshr", 's', &operands[2],
> > > +   mode,
> > > +   VALID_NEON_QREG_MODE (mode),
> > > +   true);
> > > +  }
> > > +  [(set_attr "type" "mve_move")
> > > +])
> > > +(define_insn "mve_vshrq_n_u_imm"
> > > +  [
> > > +   (set (match_operand:MVE_2 0 "s_register_operand" "=w")
> > > +   (lshiftrt:MVE_2 (match_operand:MVE_2 1 "s_register_operand" "w")
> > > +   (match_operand:MVE_2 2 
> > > "imm_for_neon_rshift_operand" "i")))
> > > +  ]
> > > +  "TARGET_HAVE_MVE"
> > > +  {
> > > +return neon_output_shift_immediate ("vshr", 'u', &operands[2],
> > > +   mode,
> > > +   VALID_NEON_QREG_MODE (mode),
> > > +   true);
> > > +  }
> > > +  [(set_attr "type" "mve_move")
> > > +])
> > > +
> > >  ;;
> > >  ;; [vcvtq_n_from_f_s, vcvtq_n_from_f_u])
> > >  ;;
> > > diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
> > > index ac9bf74..a0e8d7a 100644
> > > --- a/gcc/config/arm/neon.md
> > > +++ b/gcc/config/arm/neon.md
> > > @@ -899,40 +899,6 @@ (define_insn "ashl3_unsigned"
> > >[(set_attr "type" "neon_shift_reg")]
> > >  )
> > >
> > > -(define_expand "vashr3"
> > > -  [(set (match_operand:VDQIW 0 "s_register_operand")
> > > -   (ashiftrt:VDQIW (match_operand:VDQIW 1 "s_register_operand")
> > > -   (match_operand:VDQIW 2 
> > > "imm_rshift_or_reg_neon")))]
> > > -  "TARGET_NEON"
> > > -{
> > > -  if (s_register_operand (operands[2], mode))
> > > -{
> > > -  rtx neg = gen_reg_rtx (mode);
> > > -  emit_insn (gen_neon_neg2 (neg, operands[2]));
> > > -  emit_insn (gen_ashl3_signed (operands[0], operands[1], neg));
> > > -}
> > > -  else
> > > -emit_insn (gen_vashr3_imm (operands[0], operands[1], 
> > > operands[2]));
> > > -  DONE;
> > > -})
> > > -
> > > -(define_expand "vlshr3"
> > > -  [(set (match_operand:VDQIW 0 "s_register_operand")
> > > -   (lshiftrt:VDQIW (match_operand:VDQIW 1 "s_register_operand")
> > > -   (match_operand:VDQIW 2 
> > > "imm_rshift_or_reg_neon")))]
> > > -  "TARGET_NEON"
> > > -{
> > > -  if (s_register_operand (operands[2], mode))
> > > -{
> > > -  rtx neg = gen_reg_rtx (mode);
> > > -  emit_

RE: [PATCH 2/3] arm: Auto-vectorization for MVE: vshl

2021-01-15 Thread Kyrylo Tkachov via Gcc-patches



> -Original Message-
> From: Gcc-patches  On Behalf Of
> Christophe Lyon via Gcc-patches
> Sent: 17 December 2020 17:48
> To: gcc-patches@gcc.gnu.org
> Subject: [PATCH 2/3] arm: Auto-vectorization for MVE: vshl
> 
> This patch enables MVE vshlq instructions for auto-vectorization.
> 
> The existing mve_vshlq_n_ is kept, as it takes a single
> immediate as second operand, and is used by arm_mve.h.
> 
> We move the vashl3 insn from neon.md to an expander in
> vec-common.md, and the mve_vshlq_ insn from mve.md to
> vec-common.md, adding the second alternative fron neon.md.
> 
> mve_vshlq_ will be used by a later patch enabling
> vectorization for vshr, as a unified version of
> ashl3_[signed|unsigned] from neon.md. Keeping the use of unspec
> VSHLQ enables to generate both 's' and 'u' variants.
> 
> It is not clear whether the neon_shift_[reg|imm] attribute is still
> suitable, since this insn is also used for MVE.
> 
> I kept the mve_vshlq_ naming instead of renaming it to
> ashl3__ as discussed because the reference in
> arm_mve_builtins.def automatically inserts the "mve_" prefix and I
> didn't want to make a special case for this.
> 
> I haven't yet found why the v16qi and v8hi tests are not vectorized.
> With dest[i] = a[i] << b[i] and:
>   {
> int i;
> unsigned int i.24_1;
> unsigned int _2;
> int16_t * _3;
> short int _4;
> int _5;
> int16_t * _6;
> short int _7;
> int _8;
> int _9;
> int16_t * _10;
> short int _11;
> unsigned int ivtmp_42;
> unsigned int ivtmp_43;
> 
>  [local count: 119292720]:
> 
>  [local count: 954449105]:
> i.24_1 = (unsigned int) i_23;
> _2 = i.24_1 * 2;
> _3 = a_15(D) + _2;
> _4 = *_3;
> _5 = (int) _4;
> _6 = b_16(D) + _2;
> _7 = *_6;
> _8 = (int) _7;
> _9 = _5 << _8;
> _10 = dest_17(D) + _2;
> _11 = (short int) _9;
> *_10 = _11;
> i_19 = i_23 + 1;
> ivtmp_42 = ivtmp_43 - 1;
> if (ivtmp_42 != 0)
>   goto ; [87.50%]
> else
>   goto ; [12.50%]
> 
>  [local count: 835156386]:
> goto ; [100.00%]
> 
>  [local count: 119292720]:
> return;
> 
>   }
> the vectorizer says:
> mve-vshl.c:37:96: note:   ==> examining statement: _5 = (int) _4;
> mve-vshl.c:37:96: note:   vect_is_simple_use: operand *_3, type of def:
> internal
> mve-vshl.c:37:96: note:   vect_is_simple_use: vectype vector(8) short int
> mve-vshl.c:37:96: missed:   conversion not supported by target.
> mve-vshl.c:37:96: note:   vect_is_simple_use: operand *_3, type of def:
> internal
> mve-vshl.c:37:96: note:   vect_is_simple_use: vectype vector(8) short int
> mve-vshl.c:37:96: note:   vect_is_simple_use: operand *_3, type of def:
> internal
> mve-vshl.c:37:96: note:   vect_is_simple_use: vectype vector(8) short int
> mve-vshl.c:37:117: missed:   not vectorized: relevant stmt not supported: _5
> = (int) _4;
> mve-vshl.c:37:96: missed:  bad operation or unsupported loop bound.
> mve-vshl.c:37:96: note:  * Analysis failed with vector mode V8HI
> 

Can you file a bug report once this is committed so we can revisit in the 
future please.

> 2020-12-03  Christophe Lyon  
> 
>   gcc/
>   * config/arm/mve.md (mve_vshlq_): Move to
>   vec-commond.md.
>   * config/arm/neon.md (vashl3): Delete.
>   * config/arm/vec-common.md (mve_vshlq_): New.
>   (vasl3): New expander.
> 
>   gcc/testsuite/
>   * gcc.target/arm/simd/mve-vshl.c: Add tests for vshl.
> ---
>  gcc/config/arm/mve.md| 13 +-
>  gcc/config/arm/neon.md   | 19 -
>  gcc/config/arm/vec-common.md | 30 ++
>  gcc/testsuite/gcc.target/arm/simd/mve-vshl.c | 62
> 
>  4 files changed, 93 insertions(+), 31 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-vshl.c
> 
> diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
> index 673a83c..8bdb451 100644
> --- a/gcc/config/arm/mve.md
> +++ b/gcc/config/arm/mve.md
> @@ -822,18 +822,7 @@ (define_insn "mve_vcmpneq_"
> 
>  ;;
>  ;; [vshlq_s, vshlq_u])
> -;;
> -(define_insn "mve_vshlq_"
> -  [
> -   (set (match_operand:MVE_2 0 "s_register_operand" "=w")
> - (unspec:MVE_2 [(match_operand:MVE_2 1 "s_register_operand" "w")
> -(match_operand:MVE_2 2 "s_register_operand" "w")]
> -  VSHLQ))
> -  ]
> -  "TARGET_HAVE_MVE"
> -  "vshl.%#\t%q0, %q1, %q2"
> -  [(set_attr "type" "mve_move")
> -])
> +;; See vec-common.md
> 
>  ;;
>  ;; [vabdq_s, vabdq_u])
> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
> index 50220be..ac9bf74 100644
> --- a/gcc/config/arm/neon.md
> +++ b/gcc/config/arm/neon.md
> @@ -845,25 +845,6 @@ (define_insn "*smax3_neon"
>  ; generic vectorizer code.  It ends up creating a V2DI constructor with
>  ; SImode elements.
> 
> -(define_insn "vashl3"
> -  [(set (match_operand:VDQIW 0 "s_register_operand" "=w,w")
> - (ashift:VDQIW (match_operand:VDQIW 1 "s_register

RE: [PATCH 3/3] arm: Auto-vectorization for MVE: vshr

2021-01-15 Thread Kyrylo Tkachov via Gcc-patches



> -Original Message-
> From: Gcc-patches  On Behalf Of
> Christophe Lyon via Gcc-patches
> Sent: 17 December 2020 17:48
> To: gcc-patches@gcc.gnu.org
> Subject: [PATCH 3/3] arm: Auto-vectorization for MVE: vshr
> 
> This patch enables MVE vshr instructions for auto-vectorization.  New
> MVE patterns are introduced that take a vector of constants as second
> operand, all constants being equal.
> 
> The existing mve_vshrq_n_ is kept, as it takes a single
> immediate as second operand, and is used by arm_mve.h.
> 
> The vashr3 and vlshr3 expanders are moved fron neon.md to
> vec-common.md, updated to rely on the normal expansion scheme to
> generate shifts by immediate.

Ok.
Thanks,
Kyrill

> 
> 2020-12-03  Christophe Lyon  
> 
>   gcc/
>   * config/arm/mve.md (mve_vshrq_n_s_imm): New entry.
>   (mve_vshrq_n_u_imm): Likewise.
>   * config/arm/neon.md (vashr3, vlshr3): Move to ...
>   * config/arm/vec-common.md: ... here.
> 
>   gcc/testsuite/
>   * gcc.target/arm/simd/mve-vshr.c: Add tests for vshr.
> ---
>  gcc/config/arm/mve.md| 34 
>  gcc/config/arm/neon.md   | 34 
>  gcc/config/arm/vec-common.md | 38 +-
>  gcc/testsuite/gcc.target/arm/simd/mve-vshr.c | 59
> 
>  4 files changed, 130 insertions(+), 35 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-vshr.c
> 
> diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
> index 8bdb451..eea8b20 100644
> --- a/gcc/config/arm/mve.md
> +++ b/gcc/config/arm/mve.md
> @@ -763,6 +763,7 @@ (define_insn "mve_vcreateq_"
>  ;;
>  ;; [vshrq_n_s, vshrq_n_u])
>  ;;
> +;; Version that takes an immediate as operand 2.
>  (define_insn "mve_vshrq_n_"
>[
> (set (match_operand:MVE_2 0 "s_register_operand" "=w")
> @@ -775,6 +776,39 @@ (define_insn "mve_vshrq_n_"
>[(set_attr "type" "mve_move")
>  ])
> 
> +;; Versions that take constant vectors as operand 2 (with all elements
> +;; equal).
> +(define_insn "mve_vshrq_n_s_imm"
> +  [
> +   (set (match_operand:MVE_2 0 "s_register_operand" "=w")
> + (ashiftrt:MVE_2 (match_operand:MVE_2 1 "s_register_operand" "w")
> + (match_operand:MVE_2 2
> "imm_for_neon_rshift_operand" "i")))
> +  ]
> +  "TARGET_HAVE_MVE"
> +  {
> +return neon_output_shift_immediate ("vshr", 's', &operands[2],
> + mode,
> + VALID_NEON_QREG_MODE
> (mode),
> + true);
> +  }
> +  [(set_attr "type" "mve_move")
> +])
> +(define_insn "mve_vshrq_n_u_imm"
> +  [
> +   (set (match_operand:MVE_2 0 "s_register_operand" "=w")
> + (lshiftrt:MVE_2 (match_operand:MVE_2 1 "s_register_operand" "w")
> + (match_operand:MVE_2 2
> "imm_for_neon_rshift_operand" "i")))
> +  ]
> +  "TARGET_HAVE_MVE"
> +  {
> +return neon_output_shift_immediate ("vshr", 'u', &operands[2],
> + mode,
> + VALID_NEON_QREG_MODE
> (mode),
> + true);
> +  }
> +  [(set_attr "type" "mve_move")
> +])
> +
>  ;;
>  ;; [vcvtq_n_from_f_s, vcvtq_n_from_f_u])
>  ;;
> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
> index ac9bf74..a0e8d7a 100644
> --- a/gcc/config/arm/neon.md
> +++ b/gcc/config/arm/neon.md
> @@ -899,40 +899,6 @@ (define_insn "ashl3_unsigned"
>[(set_attr "type" "neon_shift_reg")]
>  )
> 
> -(define_expand "vashr3"
> -  [(set (match_operand:VDQIW 0 "s_register_operand")
> - (ashiftrt:VDQIW (match_operand:VDQIW 1 "s_register_operand")
> - (match_operand:VDQIW 2
> "imm_rshift_or_reg_neon")))]
> -  "TARGET_NEON"
> -{
> -  if (s_register_operand (operands[2], mode))
> -{
> -  rtx neg = gen_reg_rtx (mode);
> -  emit_insn (gen_neon_neg2 (neg, operands[2]));
> -  emit_insn (gen_ashl3_signed (operands[0], operands[1], neg));
> -}
> -  else
> -emit_insn (gen_vashr3_imm (operands[0], operands[1],
> operands[2]));
> -  DONE;
> -})
> -
> -(define_expand "vlshr3"
> -  [(set (match_operand:VDQIW 0 "s_register_operand")
> - (lshiftrt:VDQIW (match_operand:VDQIW 1 "s_register_operand")
> - (match_operand:VDQIW 2
> "imm_rshift_or_reg_neon")))]
> -  "TARGET_NEON"
> -{
> -  if (s_register_operand (operands[2], mode))
> -{
> -  rtx neg = gen_reg_rtx (mode);
> -  emit_insn (gen_neon_neg2 (neg, operands[2]));
> -  emit_insn (gen_ashl3_unsigned (operands[0], operands[1], neg));
> -}
> -  else
> -emit_insn (gen_vlshr3_imm (operands[0], operands[1],
> operands[2]));
> -  DONE;
> -})
> -
>  ;; 64-bit shifts
> 
>  ;; This pattern loads a 32-bit shift count into a 64-bit NEON register,
> diff --git a/gcc/config/arm/vec-common.md b/gcc/config/arm/vec-
> common.md
> index 3a282f0..e126557 100644
> --- a/gcc/config/arm/vec-common.md
> +++ b/gcc/config/arm

Re: Add dg-require-wchars to libstdc++ testsuite

2021-01-15 Thread Jonathan Wakely via Gcc-patches
On Thu, 14 Jan 2021, 22:22 Alexandre Oliva,  wrote:

> On Jan 14, 2021, Jonathan Wakely  wrote:
>
> > The problem is that  uses wchar_t in default
> > template arguments:
>
> > I think we should fix the header, not disable tests that don't use
> > that default template argument. The attached patch should allow you to
> > use wstring_convert and wbuffer_convert without wchar_t support. The
> > tests which instantiate it with char16_t or char32_t instead of
> > wchar_t should work with this patch, right?
>
> Thanks, I'll give it a spin.  That said, ...
>
> > 
> > Is it the case that the wchar_t type is defined on this target, it's
> > just that libc doesn't have support for wcslen etc?
>
> ... it is definitely the case that the target currently defines wchar_t,
> and it even offers wchar.h and a lot of (maybe all?) wcs* functions.
> This was likely not the case when the patch was first written.
>
> I'll double check whether any of the patch is still needed for current
> versions.
>
> I figured it would a waste to just discard Corentin's identification of
> testcases that failed when glibc wchar_t support was not enabled.
>

Definitely not a waste, as it's led to this discussion and plan for
improvement.



> This also means that the test results I'm going to get are likely to not
> reflect the conditions for which these patches were originally written.
>
>
> FWIW, I like very much the notion of offering a fallback wchar_t
> implementation within libstdc++-v3, so that users get the expected C++
> functionality even when libc doesn't offer it.  Even a (conditional?)
> typedef to introduce wchar_t could be there.
>
> Perhaps the test that sets or clears _GLIBCXX_USE_WCHAR_T should be used
> to decide whether or not to offer a wchar.h header in libstdc++, and
> then (pipe dream?) all other uses of this macro would be just gone?
>


That would be great. We might be able to get close even if not all the way
there. However, some small embedded systems might not want the extra
symbols for explicit instantiations of std::wstring, std::wistream in
libstdc++.so so we might want a way to suppress them (they could still
instantiate those templates implicitly by using them, we just wouldn't have
them pre-instantiated in the library).

Anyway, let's start just by making wstring_convert usable without wchar_t.


[PATCH] tree-optimization/98685 - fix placement of extern converts

2021-01-15 Thread Richard Biener
Avoid advancing to the next stmt when inserting at region boundary
and deal with a vector def being not the only child.

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

2021-01-15  Richard Biener  

PR tree-optimization/98685
* tree-vect-slp.c (vect_schedule_slp_node): Refactor handling
of vector extern defs.

* gcc.dg/vect/bb-slp-pr98685.c: New testcase.
---
 gcc/testsuite/gcc.dg/vect/bb-slp-pr98685.c | 15 +++
 gcc/tree-vect-slp.c| 11 ---
 2 files changed, 23 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-pr98685.c

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pr98685.c 
b/gcc/testsuite/gcc.dg/vect/bb-slp-pr98685.c
new file mode 100644
index 000..b213335da78
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pr98685.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3" } */
+
+char onelock_lock[16];
+void write(void);
+
+void lockit(int count) {
+  for (; count;) {
+int pid, i;
+char *p;
+for (i = 0, p = (char *)&pid; i < sizeof 0; i++)
+  onelock_lock[i] = *p++;
+write();
+  }
+}
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 6b6c9ccc0a0..1787ad74268 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -5915,6 +5915,7 @@ vect_schedule_slp_node (vec_info *vinfo,
   /* Emit other stmts after the children vectorized defs which is
 earliest possible.  */
   gimple *last_stmt = NULL;
+  bool seen_vector_def = false;
   FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
if (SLP_TREE_DEF_TYPE (child) == vect_internal_def)
  {
@@ -5966,8 +5967,7 @@ vect_schedule_slp_node (vec_info *vinfo,
   we do not insert before the region boundary.  */
if (SLP_TREE_SCALAR_OPS (child).is_empty ()
&& !vinfo->lookup_def (SLP_TREE_VEC_DEFS (child)[0]))
- last_stmt = gsi_stmt (gsi_after_labels
- (as_a  (vinfo)->bbs[0]));
+ seen_vector_def = true;
else
  {
unsigned j;
@@ -5987,7 +5987,12 @@ vect_schedule_slp_node (vec_info *vinfo,
 constants.  */
   if (!last_stmt)
last_stmt = vect_find_first_scalar_stmt_in_slp (node)->stmt;
-  if (is_a  (last_stmt))
+  if (!last_stmt)
+   {
+ gcc_assert (seen_vector_def);
+ si = gsi_after_labels (as_a  (vinfo)->bbs[0]);
+   }
+  else if (is_a  (last_stmt))
si = gsi_after_labels (gimple_bb (last_stmt));
   else
{
-- 
2.26.2


Re: [PATCH] c-family, v2: Improve MEM_REF printing for diagnostics [PR98597]

2021-01-15 Thread Richard Biener
On Thu, 14 Jan 2021, Jakub Jelinek wrote:

> On Thu, Jan 14, 2021 at 10:49:42AM -0700, Martin Sebor wrote:
> > > In the light of Martins patch this is probably reasonable but still
> > > the general direction is wrong (which is why I didn't approve Martins
> > > original patch).  I'm also somewhat disappointed we're breaking this
> > > so late in the cycle.
> > 
> > So am I.  I didn't test this change as exhaustively as I could and
> > (in light of the poor test coverage) should have.  That's my bad.
> > FWIW, I did do it for the first patch (by instrumenting GCC and
> > formatting every MEM_REF it came across), but it didn't occur to
> > me to do it this time around.  I have now completed this testing
> > (it found one more ICE elsewhere that I'll fix soon).
> 
> Ok, here is an updated patch which fixes what I found, and implements what
> has been discussed on the mailing list and on IRC, i.e. if the types
> are compatible as well as alias sets are same, then it prints
> what c_fold_indirect_ref_for_warn managed to create, otherwise it uses
> that info for printing offsets using offsetof (except when it starts
> with ARRAY_REFs, because one can't have offsetof (struct T[2][2], [1][0].x.y)
> 
> The uninit-38.c test (which was the only one I believe which had tests on the
> exact spelling of MEM_REF printing) contains mainly changes to have space
> before * for pointer types (as that is how the C pretty-printers normally
> print types, int * rather than int*), plus what might be considered a
> regression from what Martin printed, but it is actually a correctness fix.
> 
> When the arg is a pointer with type pointer to VLA with char element type
> (let's say the pointer is p), which is what happens in several of the
> uninit-38.c tests, omitting the (char *) cast is incorrect, as p + 1
> is not the 1 byte after p, but pointer to the end of the VLA.
> It only happened to work because of the hacks (which I don't like at all
> and are dangerous, DECL_ARTIFICIAL var names with dot inside can be pretty
> much anything, e.g. a lot of passes construct their helper vars from some
> prefix that designates intended use of the var plus numeric suffix), where
> the a.1 pointer to VLA is printed as a which if one is lucky happens to be
> a variable with VLA type (rather than pointer to it), and for such vars
> a + 1 is indeed &a[0] + 1 rather than &a + 1.  But if we want to do this
> reliably, we'd need to make sure it comes from VLA (e.g. verify that the
> SSA_NAME is defined to __builtin_alloca_with_align and that there exists
> a corresponding VAR_DECL with DECL_VALUE_EXPR that has the a.1 variable
> in it).
> 
> Is this ok for trunk if it passes bootstrap/regtest?

OK.

Thanks,
Richard.

> 2021-01-14  Jakub Jelinek  
> 
>   PR tree-optimization/98597
>   * c-pretty-print.c (c_fold_indirect_ref_for_warn): New function.
>   (print_mem_ref): Use it.  If it returns something that has compatible
>   type and is TBAA compatible with zero offset, print it and return,
>   otherwise print it using offsetof syntax or array ref syntax.  Fix up
>   printing if MEM_REFs first operand is ADDR_EXPR, or when the first
>   argument has pointer to array type.  Print pointers using the standard
>   formatting.
> 
>   * gcc.dg/uninit-38.c: Expect a space in between type name and asterisk.
>   Expect for now a (char *) cast for VLAs.
>   * gcc.dg/uninit-40.c: New test.
> 
> --- gcc/c-family/c-pretty-print.c.jj  2021-01-13 15:27:09.822834600 +0100
> +++ gcc/c-family/c-pretty-print.c 2021-01-14 19:02:21.299138891 +0100
> @@ -1809,6 +1809,113 @@ pp_c_call_argument_list (c_pretty_printe
>pp_c_right_paren (pp);
>  }
>  
> +/* Try to fold *(type *)&op into op.fld.fld2[1] if possible.
> +   Only used for printing expressions.  Should punt if ambiguous
> +   (e.g. in unions).  */
> +
> +static tree
> +c_fold_indirect_ref_for_warn (location_t loc, tree type, tree op,
> +   offset_int &off)
> +{
> +  tree optype = TREE_TYPE (op);
> +  if (off == 0)
> +{
> +  if (lang_hooks.types_compatible_p (optype, type))
> + return op;
> +  /* *(foo *)&complexfoo => __real__ complexfoo */
> +  else if (TREE_CODE (optype) == COMPLEX_TYPE
> +&& lang_hooks.types_compatible_p (type, TREE_TYPE (optype)))
> + return build1_loc (loc, REALPART_EXPR, type, op);
> +}
> +  /* ((foo*)&complexfoo)[1] => __imag__ complexfoo */
> +  else if (TREE_CODE (optype) == COMPLEX_TYPE
> +&& lang_hooks.types_compatible_p (type, TREE_TYPE (optype))
> +&& tree_to_uhwi (TYPE_SIZE_UNIT (type)) == off)
> +{
> +  off = 0;
> +  return build1_loc (loc, IMAGPART_EXPR, type, op);
> +}
> +  /* ((foo *)&fooarray)[x] => fooarray[x] */
> +  if (TREE_CODE (optype) == ARRAY_TYPE
> +  && TYPE_SIZE_UNIT (TREE_TYPE (optype))
> +  && TREE_CODE (TYPE_SIZE_UNIT (TREE_TYPE (optype))) == INTEGER_CST
> +  && !integer_zerop (TYPE_SIZE_UNIT (TREE_TYPE (op

Re: [PATCH 2/3] arm: Auto-vectorization for MVE: vshl

2021-01-15 Thread Christophe Lyon via Gcc-patches
On Fri, 15 Jan 2021 at 10:42, Kyrylo Tkachov  wrote:
>
>
>
> > -Original Message-
> > From: Gcc-patches  On Behalf Of
> > Christophe Lyon via Gcc-patches
> > Sent: 17 December 2020 17:48
> > To: gcc-patches@gcc.gnu.org
> > Subject: [PATCH 2/3] arm: Auto-vectorization for MVE: vshl
> >
> > This patch enables MVE vshlq instructions for auto-vectorization.
> >
> > The existing mve_vshlq_n_ is kept, as it takes a single
> > immediate as second operand, and is used by arm_mve.h.
> >
> > We move the vashl3 insn from neon.md to an expander in
> > vec-common.md, and the mve_vshlq_ insn from mve.md to
> > vec-common.md, adding the second alternative fron neon.md.
> >
> > mve_vshlq_ will be used by a later patch enabling
> > vectorization for vshr, as a unified version of
> > ashl3_[signed|unsigned] from neon.md. Keeping the use of unspec
> > VSHLQ enables to generate both 's' and 'u' variants.
> >
> > It is not clear whether the neon_shift_[reg|imm] attribute is still
> > suitable, since this insn is also used for MVE.
> >
> > I kept the mve_vshlq_ naming instead of renaming it to
> > ashl3__ as discussed because the reference in
> > arm_mve_builtins.def automatically inserts the "mve_" prefix and I
> > didn't want to make a special case for this.
> >
> > I haven't yet found why the v16qi and v8hi tests are not vectorized.
> > With dest[i] = a[i] << b[i] and:
> >   {
> > int i;
> > unsigned int i.24_1;
> > unsigned int _2;
> > int16_t * _3;
> > short int _4;
> > int _5;
> > int16_t * _6;
> > short int _7;
> > int _8;
> > int _9;
> > int16_t * _10;
> > short int _11;
> > unsigned int ivtmp_42;
> > unsigned int ivtmp_43;
> >
> >  [local count: 119292720]:
> >
> >  [local count: 954449105]:
> > i.24_1 = (unsigned int) i_23;
> > _2 = i.24_1 * 2;
> > _3 = a_15(D) + _2;
> > _4 = *_3;
> > _5 = (int) _4;
> > _6 = b_16(D) + _2;
> > _7 = *_6;
> > _8 = (int) _7;
> > _9 = _5 << _8;
> > _10 = dest_17(D) + _2;
> > _11 = (short int) _9;
> > *_10 = _11;
> > i_19 = i_23 + 1;
> > ivtmp_42 = ivtmp_43 - 1;
> > if (ivtmp_42 != 0)
> >   goto ; [87.50%]
> > else
> >   goto ; [12.50%]
> >
> >  [local count: 835156386]:
> > goto ; [100.00%]
> >
> >  [local count: 119292720]:
> > return;
> >
> >   }
> > the vectorizer says:
> > mve-vshl.c:37:96: note:   ==> examining statement: _5 = (int) _4;
> > mve-vshl.c:37:96: note:   vect_is_simple_use: operand *_3, type of def:
> > internal
> > mve-vshl.c:37:96: note:   vect_is_simple_use: vectype vector(8) short int
> > mve-vshl.c:37:96: missed:   conversion not supported by target.
> > mve-vshl.c:37:96: note:   vect_is_simple_use: operand *_3, type of def:
> > internal
> > mve-vshl.c:37:96: note:   vect_is_simple_use: vectype vector(8) short int
> > mve-vshl.c:37:96: note:   vect_is_simple_use: operand *_3, type of def:
> > internal
> > mve-vshl.c:37:96: note:   vect_is_simple_use: vectype vector(8) short int
> > mve-vshl.c:37:117: missed:   not vectorized: relevant stmt not supported: _5
> > = (int) _4;
> > mve-vshl.c:37:96: missed:  bad operation or unsupported loop bound.
> > mve-vshl.c:37:96: note:  * Analysis failed with vector mode V8HI
> >
>
> Can you file a bug report once this is committed so we can revisit in the 
> future please.

OK, I filed: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98697

>
> > 2020-12-03  Christophe Lyon  
> >
> >   gcc/
> >   * config/arm/mve.md (mve_vshlq_): Move to
> >   vec-commond.md.
> >   * config/arm/neon.md (vashl3): Delete.
> >   * config/arm/vec-common.md (mve_vshlq_): New.
> >   (vasl3): New expander.
> >
> >   gcc/testsuite/
> >   * gcc.target/arm/simd/mve-vshl.c: Add tests for vshl.
> > ---
> >  gcc/config/arm/mve.md| 13 +-
> >  gcc/config/arm/neon.md   | 19 -
> >  gcc/config/arm/vec-common.md | 30 ++
> >  gcc/testsuite/gcc.target/arm/simd/mve-vshl.c | 62
> > 
> >  4 files changed, 93 insertions(+), 31 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/arm/simd/mve-vshl.c
> >
> > diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
> > index 673a83c..8bdb451 100644
> > --- a/gcc/config/arm/mve.md
> > +++ b/gcc/config/arm/mve.md
> > @@ -822,18 +822,7 @@ (define_insn "mve_vcmpneq_"
> >
> >  ;;
> >  ;; [vshlq_s, vshlq_u])
> > -;;
> > -(define_insn "mve_vshlq_"
> > -  [
> > -   (set (match_operand:MVE_2 0 "s_register_operand" "=w")
> > - (unspec:MVE_2 [(match_operand:MVE_2 1 "s_register_operand" "w")
> > -(match_operand:MVE_2 2 "s_register_operand" "w")]
> > -  VSHLQ))
> > -  ]
> > -  "TARGET_HAVE_MVE"
> > -  "vshl.%#\t%q0, %q1, %q2"
> > -  [(set_attr "type" "mve_move")
> > -])
> > +;; See vec-common.md
> >
> >  ;;
> >  ;; [vabdq_s, vabdq_u])
> > diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
> 

Re: [PATCH] c-family, v2: Improve MEM_REF printing for diagnostics [PR98597]

2021-01-15 Thread Richard Biener
On Fri, 15 Jan 2021, Jakub Jelinek wrote:

> On Thu, Jan 14, 2021 at 07:26:36PM +0100, Jakub Jelinek via Gcc-patches wrote:
> > Is this ok for trunk if it passes bootstrap/regtest?
> 
> So, x86_64-linux bootstrap unfortunately broke due to the -march=i486
> changes, but at least i686-linux bootstrap succeeded and shows 2
> regressions.
> 
> One is on g++.dg/gomp/allocate-2.C, which used to print:
> allocate-2.C:9:36: error: user defined reduction not found for ‘s’
> but now prints:
> allocate-2.C:9:36: error: user defined reduction not found for ‘*&s’
> because of -O0 and therefore -fno-strict-aliasing.
> The problem is that for !flag_strict_aliasing get_deref_alias_set returns 0
> and so the:
> && get_deref_alias_set (TREE_OPERAND (e, 1)) == get_alias_set (op)
> check fails.  So, shall the code use
> && (!flag_no_strict_aliasing
> || get_deref_alias_set (TREE_OPERAND (e, 1)) == get_alias_set (op))
> instead, or
> get_alias_set (TREE_TYPE (TREE_TYPE (TREE_OPERAND (e, 1
> == get_alias_set (op)
> ?

Elsewhere we use

  tree decl = TREE_OPERAND (TREE_OPERAND (*t, 0), 0);
  tree alias_type = TREE_TYPE (TREE_OPERAND (*t, 1));
...
  /* Same TBAA behavior with -fstrict-aliasing.  */
  && !TYPE_REF_CAN_ALIAS_ALL (alias_type)
  && (TYPE_MAIN_VARIANT (TREE_TYPE (decl))
  == TYPE_MAIN_VARIANT (TREE_TYPE (alias_type)))

to guard eliding of the MEM_REF.  So maybe use this form which doesn't
depend on alias sets.

> The other is on gcc.dg/gomp/_Atomic-3.c test, where we used to print
> _Atomic-3.c:22:34: error: ‘_Atomic’ ‘k’ in ‘reduction’ clause
> but now print
> _Atomic-3.c:22:34: error: ‘_Atomic’ ‘*(_Atomic int (*)[4])(&k[0])’ in 
> ‘reduction’ clause
> Apparently in this case the C FE considers the two _Atomic int [4] types
> incompatible, one is created through
> c_build_qualified_type (type=, type_quals=8, 
> orig_qual_type=, orig_qual_indirect=1)
> on an int [4] type, i.e. adding _Atomic qualifier to an unqualified array
> type, and the other is created through
> build_array_type (elt_type=, 
> index_type=, typeless_storage=false)
> i.e. creating an array with _Atomic int elements.
> That seems like a C FE bug to me.
> 
> Anyway, I can fix or workaround that by:
> --- gcc/c/c-typeck.c.jj   2021-01-04 10:25:49.65329 +0100
> +++ gcc/c/c-typeck.c  2021-01-15 09:53:29.590611264 +0100
> @@ -13979,7 +13979,9 @@ c_finish_omp_clauses (tree clauses, enum
> size = size_binop (MINUS_EXPR, size, size_one_node);
> size = save_expr (size);
> tree index_type = build_index_type (size);
> -   tree atype = build_array_type (type, index_type);
> +   tree atype = build_array_type (TYPE_MAIN_VARIANT (type),
> +  index_type);
> +   atype = c_build_qualified_type (atype, TYPE_QUALS (type));
> tree ptype = build_pointer_type (type);
> if (TREE_CODE (TREE_TYPE (t)) == ARRAY_TYPE)
>   t = build_fold_addr_expr (t);
> and then we're back to the above allocate-2.C issue, i.e. at -O0
> we still print *&k rather than k.
> 
> And another question is if in case we punted because of the TBAA check
> we shouldn't just force printing the access type, so never print
> *&k but print instead *(access type)&k.

I guess so.

As said, I'm not a fan of too much magic here.  A MEM_REF is what it is...

Richard.


Re: [PATCH] arm: Implement vceqq_p64, vceqz_p64 and vceqzq_p64 intrinsics

2021-01-15 Thread Christophe Lyon via Gcc-patches
ping?

On Fri, 6 Nov 2020 at 16:22, Christophe Lyon  wrote:
>
> On Thu, 5 Nov 2020 at 12:55, Christophe Lyon  
> wrote:
> >
> > On Thu, 5 Nov 2020 at 10:36, Kyrylo Tkachov  wrote:
> > >
> > > H, Christophe,
> > >
> > > > -Original Message-
> > > > From: Gcc-patches  On Behalf Of
> > > > Christophe Lyon via Gcc-patches
> > > > Sent: 15 October 2020 18:23
> > > > To: gcc-patches@gcc.gnu.org
> > > > Subject: [PATCH] arm: Implement vceqq_p64, vceqz_p64 and vceqzq_p64
> > > > intrinsics
> > > >
> > > > This patch adds implementations for vceqq_p64, vceqz_p64 and
> > > > vceqzq_p64 intrinsics.
> > > >
> > > > vceqq_p64 uses the existing vceq_p64 after splitting the input vectors
> > > > into their high and low halves.
> > > >
> > > > vceqz[q] simply call the vceq and vceqq with a second argument equal
> > > > to zero.
> > > >
> > > > The added (executable) testcases make sure that the poly64x2_t
> > > > variants have results with one element of all zeroes (false) and the
> > > > other element with all bits set to one (true).
> > > >
> > > > 2020-10-15  Christophe Lyon  
> > > >
> > > >   gcc/
> > > >   * config/arm/arm_neon.h (vceqz_p64, vceqq_p64, vceqzq_p64):
> > > > New.
> > > >
> > > >   gcc/testsuite/
> > > >   * gcc.target/aarch64/advsimd-intrinsics/p64_p128.c: Add tests for
> > > >   vceqz_p64, vceqq_p64 and vceqzq_p64.
> > > > ---
> > > >  gcc/config/arm/arm_neon.h  | 31 +++
> > > >  .../aarch64/advsimd-intrinsics/p64_p128.c  | 46
> > > > +-
> > > >  2 files changed, 76 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/gcc/config/arm/arm_neon.h b/gcc/config/arm/arm_neon.h
> > > > index aa21730..f7eff37 100644
> > > > --- a/gcc/config/arm/arm_neon.h
> > > > +++ b/gcc/config/arm/arm_neon.h
> > > > @@ -16912,6 +16912,37 @@ vceq_p64 (poly64x1_t __a, poly64x1_t __b)
> > > >return vreinterpret_u64_u32 (__m);
> > > >  }
> > > >
> > > > +__extension__ extern __inline uint64x1_t
> > > > +__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
> > > > +vceqz_p64 (poly64x1_t __a)
> > > > +{
> > > > +  poly64x1_t __b = vreinterpret_p64_u32 (vdup_n_u32 (0));
> > > > +  return vceq_p64 (__a, __b);
> > > > +}
> > >
> > > This approach is okay, but can we have some kind of test to confirm it 
> > > generates the VCEQ instruction with immediate zero rather than having a 
> > > separate DUP...
> >
> > I had checked that manually, but I'll add a test.
> > However, I have noticed that although vceqz_p64 uses vceq.i32 dX, dY, #0,
> > the vceqzq_64 version below first sets
> > vmov dZ, #0
> > and then emits two
> > vmoz dX, dY, dZ
> >
> > I'm looking at why this happens.
> >
>
> Hi,
>
> Here is an updated version, which adds two tests (arm/simd/vceqz_p64.c
> and arm/simd/vceqzq_p64.c).
>
> The vceqzq_64 test does not currently expect instructions with
> immediate zero, because we generate:
> vmov.i32q9, #0  @ v4si
> [...]
> vceq.i32d16, d16, d19
> vceq.i32d17, d17, d19
>
> Looking at the traces, I can see this in reload:
> (insn 19 8 15 2 (set (reg:V2SI 48 d16 [orig:128 _18 ] [128])
> (neg:V2SI (eq:V2SI (reg:V2SI 48 d16 [orig:139 v1 ] [139])
> (reg:V2SI 54 d19 [ _5+8 ]
> "/home/christophe.lyon/src/GCC/builds/gcc-fsf-git-neon-intrinsics/tools/lib/gcc/arm-none-linux-gnueabihf/11.0.0/include/arm_neon.h":2404:22
> 1650 {neon_vceqv2si_insn}
>  (expr_list:REG_EQUAL (neg:V2SI (eq:V2SI (subreg:V2SI (reg:DI 48
> d16 [orig:139 v1 ] [139]) 0)
> (const_vector:V2SI [
> (const_int 0 [0]) repeated x2
> ])))
> (nil)))
> (insn 15 19 20 2 (set (reg:V2SI 50 d17 [orig:121 _11 ] [121])
> (neg:V2SI (eq:V2SI (reg:V2SI 50 d17 [orig:141 v2 ] [141])
> (reg:V2SI 54 d19 [ _5+8 ]
> "/home/christophe.lyon/src/GCC/builds/gcc-fsf-git-neon-intrinsics/tools/lib/gcc/arm-none-linux-gnueabihf/11.0.0/include/arm_neon.h":2404:22
> 1650 {neon_vceqv2si_insn}
>  (expr_list:REG_EQUAL (neg:V2SI (eq:V2SI (subreg:V2SI (reg:DI 50
> d17 [orig:141 v2 ] [141]) 0)
> (const_vector:V2SI [
> (const_int 0 [0]) repeated x2
> ])))
> (nil)))
>
> but it says:
>  Choosing alt 0 in insn 19:  (0) =w  (1) w  (2) w {neon_vceqv2si_insn}
>   alt=0,overall=0,losers=0,rld_nregs=0
>  Choosing alt 0 in insn 15:  (0) =w  (1) w  (2) w {neon_vceqv2si_insn}
>   alt=0,overall=0,losers=0,rld_nregs=0
>
> Why isn't it picking alternative 1 with the Dz constraint?
>
> Christophe
>
>
> > Thanks,
> >
> > Christophe
> >
> >
> > > Thanks,
> > > Kyrill
> > >
> > > > +
> > > > +/* For vceqq_p64, we rely on vceq_p64 for each of the two elements.  */
> > > > +__extension__ extern __inline uint64x2_t
> > > > +__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
> > > > +vceqq_p64 (poly64x2_t __a, poly64x2_t __b)
> > > > +{
> > > >

RE: [PATCH] arm: Implement vceqq_p64, vceqz_p64 and vceqzq_p64 intrinsics

2021-01-15 Thread Kyrylo Tkachov via Gcc-patches


> -Original Message-
> From: Christophe Lyon 
> Sent: 06 November 2020 15:23
> To: Kyrylo Tkachov 
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] arm: Implement vceqq_p64, vceqz_p64 and vceqzq_p64
> intrinsics
> 
> On Thu, 5 Nov 2020 at 12:55, Christophe Lyon 
> wrote:
> >
> > On Thu, 5 Nov 2020 at 10:36, Kyrylo Tkachov 
> wrote:
> > >
> > > H, Christophe,
> > >
> > > > -Original Message-
> > > > From: Gcc-patches  On Behalf Of
> > > > Christophe Lyon via Gcc-patches
> > > > Sent: 15 October 2020 18:23
> > > > To: gcc-patches@gcc.gnu.org
> > > > Subject: [PATCH] arm: Implement vceqq_p64, vceqz_p64 and
> vceqzq_p64
> > > > intrinsics
> > > >
> > > > This patch adds implementations for vceqq_p64, vceqz_p64 and
> > > > vceqzq_p64 intrinsics.
> > > >
> > > > vceqq_p64 uses the existing vceq_p64 after splitting the input vectors
> > > > into their high and low halves.
> > > >
> > > > vceqz[q] simply call the vceq and vceqq with a second argument equal
> > > > to zero.
> > > >
> > > > The added (executable) testcases make sure that the poly64x2_t
> > > > variants have results with one element of all zeroes (false) and the
> > > > other element with all bits set to one (true).
> > > >
> > > > 2020-10-15  Christophe Lyon  
> > > >
> > > >   gcc/
> > > >   * config/arm/arm_neon.h (vceqz_p64, vceqq_p64, vceqzq_p64):
> > > > New.
> > > >
> > > >   gcc/testsuite/
> > > >   * gcc.target/aarch64/advsimd-intrinsics/p64_p128.c: Add tests for
> > > >   vceqz_p64, vceqq_p64 and vceqzq_p64.
> > > > ---
> > > >  gcc/config/arm/arm_neon.h  | 31 +++
> > > >  .../aarch64/advsimd-intrinsics/p64_p128.c  | 46
> > > > +-
> > > >  2 files changed, 76 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/gcc/config/arm/arm_neon.h b/gcc/config/arm/arm_neon.h
> > > > index aa21730..f7eff37 100644
> > > > --- a/gcc/config/arm/arm_neon.h
> > > > +++ b/gcc/config/arm/arm_neon.h
> > > > @@ -16912,6 +16912,37 @@ vceq_p64 (poly64x1_t __a, poly64x1_t
> __b)
> > > >return vreinterpret_u64_u32 (__m);
> > > >  }
> > > >
> > > > +__extension__ extern __inline uint64x1_t
> > > > +__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
> > > > +vceqz_p64 (poly64x1_t __a)
> > > > +{
> > > > +  poly64x1_t __b = vreinterpret_p64_u32 (vdup_n_u32 (0));
> > > > +  return vceq_p64 (__a, __b);
> > > > +}
> > >
> > > This approach is okay, but can we have some kind of test to confirm it
> generates the VCEQ instruction with immediate zero rather than having a
> separate DUP...
> >
> > I had checked that manually, but I'll add a test.
> > However, I have noticed that although vceqz_p64 uses vceq.i32 dX, dY, #0,
> > the vceqzq_64 version below first sets
> > vmov dZ, #0
> > and then emits two
> > vmoz dX, dY, dZ
> >
> > I'm looking at why this happens.
> >
> 
> Hi,
> 
> Here is an updated version, which adds two tests (arm/simd/vceqz_p64.c
> and arm/simd/vceqzq_p64.c).
> 
> The vceqzq_64 test does not currently expect instructions with
> immediate zero, because we generate:
> vmov.i32q9, #0  @ v4si
> [...]
> vceq.i32d16, d16, d19
> vceq.i32d17, d17, d19
> 
> Looking at the traces, I can see this in reload:
> (insn 19 8 15 2 (set (reg:V2SI 48 d16 [orig:128 _18 ] [128])
> (neg:V2SI (eq:V2SI (reg:V2SI 48 d16 [orig:139 v1 ] [139])
> (reg:V2SI 54 d19 [ _5+8 ]
> "/home/christophe.lyon/src/GCC/builds/gcc-fsf-git-neon-
> intrinsics/tools/lib/gcc/arm-none-linux-
> gnueabihf/11.0.0/include/arm_neon.h":2404:22
> 1650 {neon_vceqv2si_insn}
>  (expr_list:REG_EQUAL (neg:V2SI (eq:V2SI (subreg:V2SI (reg:DI 48
> d16 [orig:139 v1 ] [139]) 0)
> (const_vector:V2SI [
> (const_int 0 [0]) repeated x2
> ])))
> (nil)))
> (insn 15 19 20 2 (set (reg:V2SI 50 d17 [orig:121 _11 ] [121])
> (neg:V2SI (eq:V2SI (reg:V2SI 50 d17 [orig:141 v2 ] [141])
> (reg:V2SI 54 d19 [ _5+8 ]
> "/home/christophe.lyon/src/GCC/builds/gcc-fsf-git-neon-
> intrinsics/tools/lib/gcc/arm-none-linux-
> gnueabihf/11.0.0/include/arm_neon.h":2404:22
> 1650 {neon_vceqv2si_insn}
>  (expr_list:REG_EQUAL (neg:V2SI (eq:V2SI (subreg:V2SI (reg:DI 50
> d17 [orig:141 v2 ] [141]) 0)
> (const_vector:V2SI [
> (const_int 0 [0]) repeated x2
> ])))
> (nil)))
> 
> but it says:
>  Choosing alt 0 in insn 19:  (0) =w  (1) w  (2) w {neon_vceqv2si_insn}
>   alt=0,overall=0,losers=0,rld_nregs=0
>  Choosing alt 0 in insn 15:  (0) =w  (1) w  (2) w {neon_vceqv2si_insn}
>   alt=0,overall=0,losers=0,rld_nregs=0
> 
> Why isn't it picking alternative 1 with the Dz constraint?
> 

Not sure, but the intrinsics implementation looks correct so let's go ahead 
with that and improve the codegen later.
Thanks,
Kyrill

> Christophe
> 
> 
> > Thanks,
> >
> > Christophe
> >
> >
> > 

[PATCH v5 01/33] Add and restructure function declaration macros

2021-01-15 Thread Daniel Engel
Most of these changes support subsequent patches in this series.
Particularly, the FUNC_START macro becomes part of a new macro chain:

  * FUNC_ENTRY  Common global symbol directives
  * FUNC_START_SECTION  FUNC_ENTRY to start a new 
  * FUNC_START  FUNC_START_SECTION <".text">

The effective definition of FUNC_START is unchanged from the previous
version of lib1funcs.  See code comments for detailed usage.

The new names FUNC_ENTRY and FUNC_START_SECTION were chosen specifically
to complement the existing FUNC_START name.  Alternate name patterns are
possible (such as {FUNC_SYMBOL, FUNC_START_SECTION, FUNC_START_TEXT}),
but any change to FUNC_START would require refactoring much of libgcc.

Additionally, a parallel chain of new macros supports weak functions:

  * WEAK_ENTRY
  * WEAK_START_SECTION
  * WEAK_START
  * WEAK_ALIAS

Moving the CFI_* macros earlier in the file scope will increase their
scope for use in additional functions.

gcc/libgcc/ChangeLog:
2021-01-14 Daniel Engel 

* config/arm/lib1funcs.S:
(LLSYM): New macro prefix ".L" for strippable local symbols.
(CFI_START_FUNCTION, CFI_END_FUNCTION): Moved earlier in the file.
(FUNC_ENTRY): New macro for symbols with no ".section" directive.
(WEAK_ENTRY): New macro FUNC_ENTRY + ".weak".
(FUNC_START_SECTION): New macro FUNC_ENTRY with  argument.
(WEAK_START_SECTION): New macro FUNC_START_SECTION + ".weak".
(FUNC_START): Redefined in terms of FUNC_START_SECTION <".text">.
(WEAK_START): New macro FUNC_START + ".weak".
(WEAK_ALIAS): New macro FUNC_ALIAS + ".weak".
(FUNC_END): Moved after FUNC_START macro group.
(THUMB_FUNC_START): Moved near the other *FUNC* macros.
(THUMB_SYNTAX, ARM_SYM_START, SYM_END): Deleted unused macros.
---
 libgcc/config/arm/lib1funcs.S | 109 +-
 1 file changed, 69 insertions(+), 40 deletions(-)

diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index c2fcfc503ec..f14662d7e15 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -69,11 +69,13 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  
If not, see
 #define TYPE(x) .type SYM(x),function
 #define SIZE(x) .size SYM(x), . - SYM(x)
 #define LSYM(x) .x
+#define LLSYM(x) .L##x
 #else
 #define __PLT__
 #define TYPE(x)
 #define SIZE(x)
 #define LSYM(x) x
+#define LLSYM(x) x
 #endif
 
 /* Function end macros.  Variants for interworking.  */
@@ -182,6 +184,16 @@ LSYM(Lend_fde):
 #endif
 .endm
 
+.macro CFI_START_FUNCTION
+   .cfi_startproc
+   .cfi_remember_state
+.endm
+
+.macro CFI_END_FUNCTION
+   .cfi_restore_state
+   .cfi_endproc
+.endm
+
 /* Don't pass dirn, it's there just to get token pasting right.  */
 
 .macro RETLDM  regs=, cond=, unwind=, dirn=ia
@@ -324,10 +336,6 @@ LSYM(Lend_fde):
 .endm
 #endif
 
-.macro FUNC_END name
-   SIZE (__\name)
-.endm
-
 .macro DIV_FUNC_END name signed
cfi_start   __\name, LSYM(Lend_div0)
 LSYM(Ldiv0):
@@ -340,48 +348,76 @@ LSYM(Ldiv0):
FUNC_END \name
 .endm
 
-.macro THUMB_FUNC_START name
-   .globl  SYM (\name)
-   TYPE(\name)
-   .thumb_func
-SYM (\name):
-.endm
-
 /* Function start macros.  Variants for ARM and Thumb.  */
 
 #ifdef __thumb__
 #define THUMB_FUNC .thumb_func
 #define THUMB_CODE .force_thumb
-# if defined(__thumb2__)
-#define THUMB_SYNTAX
-# else
-#define THUMB_SYNTAX
-# endif
 #else
 #define THUMB_FUNC
 #define THUMB_CODE
-#define THUMB_SYNTAX
 #endif
 
+.macro THUMB_FUNC_START name
+   .globl  SYM (\name)
+   TYPE(\name)
+   .thumb_func
+SYM (\name):
+.endm
+
+/* Strong global symbol, ".text" section.
+   The default macro for function declarations. */
 .macro FUNC_START name
-   .text
+   FUNC_START_SECTION \name .text
+.endm
+
+/* Weak global symbol, ".text" section.
+   Use WEAK_* macros to declare a function/object that may be discarded in by
+the linker when another library or object exports the same name.
+   Typically, functions declared with WEAK_* macros implement a subset of
+functionality provided by the overriding definition, and are discarded
+when the full functionality is required. */
+.macro WEAK_START name
+   .weak SYM(__\name)
+   FUNC_START_SECTION \name .text
+.endm
+
+/* Strong global symbol, alternate section.
+   Use the *_START_SECTION macros for declarations that the linker should
+place in a non-defailt section (e.g. ".rodata", ".text.subsection"). */
+.macro FUNC_START_SECTION name section
+   .section \section,"x"
+   .align 0
+   FUNC_ENTRY \name
+.endm
+
+/* Weak global symbol, alternate section. */
+.macro WEAK_START_SECTION name section
+   .weak SYM(__\name)
+   FUNC_START_SECTION \name \section
+.endm
+
+/* Strong global symbol.
+   Use *_ENTRY macros internal to a function/object body to declare a second
+or subsequent entry point wi

[PATCH v5 00/33] libgcc: Thumb-1 Floating-Point Library for Cortex M0

2021-01-15 Thread Daniel Engel
Changes since v4: 

* Revised all commit messages per GCC standard form. 
* Split preamble patch 1 into 4 distinct changes. 
* Flattened previously-created directory "bits"
* Added patch to fix unified syntax compiler warnings.
* Moved CFI macro changes to preamble patch 1. 
* Added interim copyright message to refactored files. 
* Added expanation and usage comments for the IT() macro.
* Renamed new __ARM_FEATURE_IT macro as __HAVE_FEATURE_IT.

---

This patch series adds an assembly-language implementation of IEEE-754 compliant
single-precision functions designed for the Cortex M0 (v6m) architecture.  
There 
are improvements to most of the EABI integer functions as well.  This is the
ibgcc component of a larger library project originally proposed in 2018:

https://gcc.gnu.org/legacy-ml/gcc/2018-11/msg00043.html

As one point of comparison, a test program [1] links 916 bytes from libgcc with
the patched toolchain vs 10276 bytes with gcc-arm-none-eabi-9-2020-q2 toolchain.
That's a 90% size reduction.

I have extensive test vectors [2], and this patch pass all tests on an 
STM32F051.
These vectors were derived from UCB [3], Testfloat [4], and IEEECC754 [5], plus
many of my own generation.

There may be some follow-on projects worth discussing:

* The library is currently integrated into the ARM v6s-m multilib only.  It
is likely that some other architectures would benefit from these routines.
However, I have NOT profiled the existing implementations (ieee754-sf.S) to
estimate where improvements may be found.

* GCC currently lacks test for some functions, such as __aeabi_[u]ldivmod().
There may be useful bits in [1] that can be integrated.

On Cortex M0, the library has (approximately) the following properties:

Function(s) Size (bytes)Cycles  Stack   
Accuracy
__clzsi250  20  0   
exact
__clzsi2 (OPTIMIZE_SIZE)22  51  0   
exact
__clzdi28+__clzsi2  4+__clzsi2  0   
exact

__clrsbsi2  8+__clzsi2  6+__clzsi2  0   
exact
__clrsbdi2  18+__clzsi2 (8..10)+__clzsi20   
exact

__ctzsi252  21  0   
exact
__ctzsi2 (OPTIMIZE_SIZE)24  52  0   
exact
__ctzdi28+__ctzsi2  5+__ctzsi2  0   
exact

__ffssi28   6..(5+__ctzsi2) 0   
exact
__ffsdi214+__ctzsi2 9..(8+__ctzsi2) 0   
exact

__popcountsi2   52  25  0   
exact
__popcountsi2 (OPTIMIZE_SIZE)   14  9..201  0   
exact
__popcountdi2   34+__popcountsi246  0   
exact
__popcountdi2 (OPTIMIZE_SIZE)   12+__popcountsi217..401 0   
exact

__paritysi2 24  14  0   
exact
__paritysi2 (OPTIMIZE_SIZE) 16  38  0   
exact
__paritydi2 2+__paritysi2   1+__paritysi2   0   
exact

__umulsidi3 44  24  0   
exact
__mulsidi3  30+__umulsidi3  24+__umulsidi3  8   
exact
__muldi3 (__aeabi_lmul) 10+__umulsidi3  6+__umulsidi3   0   
exact
__ashldi3 (__aeabi_llsl)22  13  0   
exact
__lshrdi3 (__aeabi_llsr)22  13  0   
exact
__ashrdi3 (__aeabi_lasr)22  13  0   
exact

__aeabi_lcmp20  13  0   
exact
__aeabi_ulcmp   16  10  0   
exact

__udivsi3 (__aeabi_uidiv)   56  72..385 0   
< 1 lsb
__divsi3 (__aeabi_idiv) 38+__udivsi326+__udivsi38   
< 1 lsb
__udivdi3 (__aeabi_uldiv)   164 103..1394   16  
< 1 lsb
__udivdi3 (OPTIMIZE_SIZE)   142 120..1392   16  
< 1 lsb
__divdi3 (__aeabi_ldiv) 54+__udivdi336+__udivdi332  
< 1 lsb

__shared_float  178
__shared_float (OPTIMIZE_SIZE)  154

__addsf3 (__aeabi_fadd) 116+__shared_float  31..76  8   
<= 0.5 ulp
__addsf3 (OPTIMIZE_SIZE)112+__shared_float  74  8   
<= 0.5 ulp
__subsf3 (__aeabi_fsub) 6+__addsf3  3+__addsf3  8   
<= 0.5 ulp
__aeabi_frsub   8+__addsf3  6+__addsf3  8   
<= 0.5 ulp
__mulsf3 (__aeabi_fmul) 112+__shared_float  73..97   

[PATCH v5 02/33] Rename THUMB_FUNC_START to THUMB_FUNC_ENTRY

2021-01-15 Thread Daniel Engel
Since THUMB_FUNC_START does not insert the ".text" directive, it aligns
more closely with the new FUNC_ENTRY maro and is renamed accordingly.

THUMB_FUNC_START usage has been universally synonymous with the
".force_thumb" directive, so this is now folded into the definition.
Usage of ".force_thumb" and ".thumb_func" is now tightly coupled
throughout the "arm" subdirectory.

gcc/libgcc/ChangeLog:
2021-01-14 Daniel Engel 

* config/arm/lib1funcs.S: (THUMB_FUNC_START): Renamed to ...
(THUMB_FUNC_ENTRY): for consistency; also added ".force_thumb".
(_call_via_r0): Removed redundant preceding ".force_thumb".
(__gnu_thumb1_case_sqi, __gnu_thumb1_case_uqi, __gnu_thumb1_case_shi,
__gnu_thumb1_case_si): Removed redundant ".force_thumb" and ".syntax".
---
 libgcc/config/arm/lib1funcs.S | 32 +++-
 1 file changed, 11 insertions(+), 21 deletions(-)

diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index f14662d7e15..65d070d8178 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -358,10 +358,11 @@ LSYM(Ldiv0):
 #define THUMB_CODE
 #endif
 
-.macro THUMB_FUNC_START name
+.macro THUMB_FUNC_ENTRY name
.globl  SYM (\name)
TYPE(\name)
.thumb_func
+   .force_thumb
 SYM (\name):
 .endm
 
@@ -1944,10 +1945,9 @@ ARM_FUNC_START ctzsi2

.text
.align 0
-.force_thumb
 
 .macro call_via register
-   THUMB_FUNC_START _call_via_\register
+   THUMB_FUNC_ENTRY _call_via_\register
 
bx  \register
nop
@@ -2030,7 +2030,7 @@ _arm_return_r11:
 .macro interwork_with_frame frame, register, name, return
.code   16
 
-   THUMB_FUNC_START \name
+   THUMB_FUNC_ENTRY \name
 
bx  pc
nop
@@ -2047,7 +2047,7 @@ _arm_return_r11:
 .macro interwork register
.code   16
 
-   THUMB_FUNC_START _interwork_call_via_\register
+   THUMB_FUNC_ENTRY _interwork_call_via_\register
 
bx  pc
nop
@@ -2084,7 +2084,7 @@ LSYM(Lchange_\register):
/* The LR case has to be handled a little differently...  */
.code 16
 
-   THUMB_FUNC_START _interwork_call_via_lr
+   THUMB_FUNC_ENTRY _interwork_call_via_lr
 
bx  pc
nop
@@ -2112,9 +2112,7 @@ LSYM(Lchange_\register):

.text
.align 0
-.force_thumb
-   .syntax unified
-   THUMB_FUNC_START __gnu_thumb1_case_sqi
+   THUMB_FUNC_ENTRY __gnu_thumb1_case_sqi
push{r1}
mov r1, lr
lsrsr1, r1, #1
@@ -2131,9 +2129,7 @@ LSYM(Lchange_\register):

.text
.align 0
-.force_thumb
-   .syntax unified
-   THUMB_FUNC_START __gnu_thumb1_case_uqi
+   THUMB_FUNC_ENTRY __gnu_thumb1_case_uqi
push{r1}
mov r1, lr
lsrsr1, r1, #1
@@ -2150,9 +2146,7 @@ LSYM(Lchange_\register):

.text
.align 0
-.force_thumb
-   .syntax unified
-   THUMB_FUNC_START __gnu_thumb1_case_shi
+   THUMB_FUNC_ENTRY __gnu_thumb1_case_shi
push{r0, r1}
mov r1, lr
lsrsr1, r1, #1
@@ -2170,9 +2164,7 @@ LSYM(Lchange_\register):

.text
.align 0
-.force_thumb
-   .syntax unified
-   THUMB_FUNC_START __gnu_thumb1_case_uhi
+   THUMB_FUNC_ENTRY __gnu_thumb1_case_uhi
push{r0, r1}
mov r1, lr
lsrsr1, r1, #1
@@ -2190,9 +2182,7 @@ LSYM(Lchange_\register):

.text
.align 0
-.force_thumb
-   .syntax unified
-   THUMB_FUNC_START __gnu_thumb1_case_si
+   THUMB_FUNC_ENTRY __gnu_thumb1_case_si
push{r0, r1}
mov r1, lr
adds.n  r1, r1, #2  /* Align to word.  */
-- 
2.25.1



[PATCH v5 03/33] Fix syntax warnings on conditional instructions

2021-01-15 Thread Daniel Engel
gcc/libgcc/ChangeLog:
2021-01-14 Daniel Engel 

* config/arm/lib1funcs.S (RETLDM, ARM_DIV_BODY, ARM_MOD_BODY,
_interwork_call_via_lr): Moved condition code after the flags
update specifier "s".
(ARM_FUNC_START, THUMB_LDIV0): Removed redundant ".syntax".
---
 libgcc/config/arm/lib1funcs.S | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index 65d070d8178..b8693be8e4f 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -204,7 +204,7 @@ LSYM(Lend_fde):
 # if defined(__thumb2__)
pop\cond{\regs, lr}
 # else
-   ldm\cond\dirn   sp!, {\regs, lr}
+   ldm\dirn\cond   sp!, {\regs, lr}
 # endif
.endif
.ifnc "\unwind", ""
@@ -220,7 +220,7 @@ LSYM(Lend_fde):
 # if defined(__thumb2__)
pop\cond{\regs, pc}
 # else
-   ldm\cond\dirn   sp!, {\regs, pc}
+   ldm\dirn\cond   sp!, {\regs, pc}
 # endif
.endif
 #endif
@@ -292,7 +292,6 @@ LSYM(Lend_fde):
pop {r1, pc}
 
 #elif defined(__thumb2__)
-   .syntax unified
.ifc \signed, unsigned
cbz r0, 1f
mov r0, #0x
@@ -429,7 +428,6 @@ SYM (__\name):
 /* For Thumb-2 we build everything in thumb mode.  */
 .macro ARM_FUNC_START name
FUNC_START \name
-   .syntax unified
 .endm
 #define EQUIV .thumb_set
 .macro  ARM_CALL name
@@ -643,7 +641,7 @@ pc  .reqr15
orrhs   \result,   \result,   \curbit,  lsr #3
cmp \dividend, #0   @ Early termination?
do_it   ne, t
-   movnes  \curbit,   \curbit,  lsr #4 @ No, any more bits to do?
+   movsne  \curbit,   \curbit,  lsr #4 @ No, any more bits to do?
movne   \divisor,  \divisor, lsr #4
bne 1b
 
@@ -745,7 +743,7 @@ pc  .reqr15
subhs   \dividend, \dividend, \divisor, lsr #3
cmp \dividend, #1
mov \divisor, \divisor, lsr #4
-   subges  \order, \order, #4
+   subsge  \order, \order, #4
bge 1b
 
tst \order, #3
@@ -2093,7 +2091,7 @@ LSYM(Lchange_\register):
.globl .Lchange_lr
 .Lchange_lr:
tst lr, #1
-   stmeqdb r13!, {lr, pc}
+   stmdbeq r13!, {lr, pc}
mov ip, lr
adreq   lr, _arm_return
bx  ip
-- 
2.25.1



[PATCH v5 04/33] Reorganize LIB1ASMFUNCS object wrapper macros

2021-01-15 Thread Daniel Engel
This will make it easier to isolate changes in subsequent patches.

gcc/libgcc/ChangeLog:
2021-01-14 Daniel Engel 

* config/arm/t-elf (LIB1ASMFUNCS): Split macros into logical groups.
---
 libgcc/config/arm/t-elf | 66 +
 1 file changed, 53 insertions(+), 13 deletions(-)

diff --git a/libgcc/config/arm/t-elf b/libgcc/config/arm/t-elf
index 9da6cd37054..93ea1cd8f76 100644
--- a/libgcc/config/arm/t-elf
+++ b/libgcc/config/arm/t-elf
@@ -14,19 +14,59 @@ LIB1ASMFUNCS += _arm_muldf3 _arm_mulsf3
 endif
 endif # !__symbian__
 
-# For most CPUs we have an assembly soft-float implementations.
-# However this is not true for ARMv6M.  Here we want to use the soft-fp C
-# implementation.  The soft-fp code is only build for ARMv6M.  This pulls
-# in the asm implementation for other CPUs.
-LIB1ASMFUNCS += _udivsi3 _divsi3 _umodsi3 _modsi3 _dvmd_tls _bb_init_func \
-   _call_via_rX _interwork_call_via_rX \
-   _lshrdi3 _ashrdi3 _ashldi3 \
-   _arm_negdf2 _arm_addsubdf3 _arm_muldivdf3 _arm_cmpdf2 _arm_unorddf2 \
-   _arm_fixdfsi _arm_fixunsdfsi \
-   _arm_truncdfsf2 _arm_negsf2 _arm_addsubsf3 _arm_muldivsf3 \
-   _arm_cmpsf2 _arm_unordsf2 _arm_fixsfsi _arm_fixunssfsi \
-   _arm_floatdidf _arm_floatdisf _arm_floatundidf _arm_floatundisf \
-   _clzsi2 _clzdi2 _ctzsi2
+# This pulls in the available assembly function implementations.
+# The soft-fp code is only built for ARMv6M, since there is no
+# assembly implementation here for double-precision values.
+
+
+# Group 1: Integer function objects.
+LIB1ASMFUNCS += \
+   _ashldi3 \
+   _ashrdi3 \
+   _lshrdi3 \
+   _clzdi2 \
+   _clzsi2 \
+   _ctzsi2 \
+   _dvmd_tls \
+   _divsi3 \
+   _modsi3 \
+   _udivsi3 \
+   _umodsi3 \
+
+
+# Group 2: Single precision floating point function objects.
+LIB1ASMFUNCS += \
+   _arm_addsubsf3 \
+   _arm_cmpsf2 \
+   _arm_fixsfsi \
+   _arm_fixunssfsi \
+   _arm_floatdisf \
+   _arm_floatundisf \
+   _arm_muldivsf3 \
+   _arm_negsf2 \
+   _arm_unordsf2 \
+
+
+# Group 3: Double precision floating point function objects.
+LIB1ASMFUNCS += \
+   _arm_addsubdf3 \
+   _arm_cmpdf2 \
+   _arm_fixdfsi \
+   _arm_fixunsdfsi \
+   _arm_floatdidf \
+   _arm_floatundidf \
+   _arm_muldivdf3 \
+   _arm_negdf2 \
+   _arm_truncdfsf2 \
+   _arm_unorddf2 \
+
+
+# Group 4: Miscellaneous function objects.
+LIB1ASMFUNCS += \
+   _bb_init_func \
+   _call_via_rX \
+   _interwork_call_via_rX \
+
 
 # Currently there is a bug somewhere in GCC's alias analysis
 # or scheduling code that is breaking _fpmul_parts in fp-bit.c.
-- 
2.25.1



[PATCH v5 05/33] Add the __HAVE_FEATURE_IT and IT() macros

2021-01-15 Thread Daniel Engel
These macros complement and extend the existing do_it() macro.
Together, they streamline the process of optimizing short branchless
contitional sequences to support ARM, Thumb-2, and Thumb-1.

The inherent architecture limitations of Thumb-1 means that writing
assembly code is somewhat more tedious.  And, while such code will run
unmodified in an ARM or Thumb-2 enfironment, it will lack one of the
key performance optimizations available there.

Initially, the first idea might be to split the an instruction sequence
with #ifdef(s): one path for Thumb-1 and the other for ARM/Thumb-2.
This could suffice if conditional execution optimizations were rare.

However, #ifdef(s) break flow of an algorithm and shift focus to the
architectural differences instead of the similarities.  On functions
with a high percentage of conditional execution, it starts to become
attractive to split everything into distinct architecture-specific
function objects -- even when the underlying algorithm is identical.

Additionally, duplicated code and comments (whether an individual
operand, a line, or a larger block) become a future maintenance
liability if the two versions aren't kept in sync.

See code comments for limitations and expecated usage.

gcc/libgcc/ChangeLog:
2021-01-14 Daniel Engel 

(__HAVE_FEATURE_IT, IT): New macros.
---
 libgcc/config/arm/lib1funcs.S | 68 +++
 1 file changed, 68 insertions(+)

diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index b8693be8e4f..1233b8c0992 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -230,6 +230,7 @@ LSYM(Lend_fde):
ARM and Thumb-2.  However this is only supported by recent gas, so define
a set of macros to allow ARM code on older assemblers.  */
 #if defined(__thumb2__)
+#define __HAVE_FEATURE_IT
 .macro do_it cond, suffix=""
it\suffix   \cond
 .endm
@@ -245,6 +246,9 @@ LSYM(Lend_fde):
\name \dest, \src1, \tmp
 .endm
 #else
+#if !defined(__thumb__)
+#define __HAVE_FEATURE_IT
+#endif
 .macro do_it cond, suffix=""
 .endm
 .macro shift1 op, arg0, arg1, arg2
@@ -259,6 +263,70 @@ LSYM(Lend_fde):
 
 #define COND(op1, op2, cond) op1 ## op2 ## cond
 
+
+/* The IT() macro streamlines the construction of short branchless contitional
+sequences that support ARM, Thumb-2, and Thumb-1.  It is intended as an
+extension to the .do_it macro defined above.  Code not written with the
+intent to support Thumb-1 need not use IT().
+
+   IT()'s main advantage is the minimization of syntax differences.  Unified
+functions can support Thumb-1 without imposiing an undue performance
+penalty on ARM and Thumb-2.  Writing code without duplicate instructions
+and operands keeps the high level function flow clearer and should reduce
+the incidence of maintenance bugs.
+
+   Where conditional execution is supported by ARM and Thumb-2, the specified
+instruction compiles with the conditional suffix 'c'.
+
+   Where Thumb-1 and v6m do not support IT, the given instruction compiles
+with the standard unified syntax suffix "s", and a preceding branch
+instruction is required to implement conditional behavior.
+
+   (Aside: The Thumb-1 "s"-suffix pattern is somewhat simplistic, since it
+does not support 'cmp' or 'tst' with a non-"s" suffix.  It also appends
+"s" to 'mov' and 'add' with high register operands which are otherwise
+legal on v6m.  Use of IT() will result in a compiler error for all of
+these exceptional cases, and a full #ifdef code split will be required.
+However, it is unlikely that code written with Thumb-1 compatibility
+in mind will use such patterns, so IT() still promises a good value.)
+
+   Typical if/then/else usage is:
+
+#ifdef __HAVE_FEATURE_IT
+// ARM and Thumb-2 'true' condition.
+do_it   c,  tee
+#else
+// Thumb-1 'false' condition.  This must be opposite the
+//  sense of the ARM and Thumb-2 condition, since the
+//  branch is taken to skip the 'true' instruction block.
+b!c else_label
+#endif
+
+// Conditional 'true' execution for all compile modes.
+ IT(ins1,c) op1,op2
+ IT(ins2,c) op1,op2
+
+#ifndef __HAVE_FEATURE_IT
+// Thumb-1 branch to skip the 'else' instruction block.
+// Omitted for if/then usage.
+b   end_label
+#endif
+
+   else_label:
+// Conditional 'false' execution for all compile modes.
+// Omitted for if/then usage.
+ IT(ins3,!c) op1,   op2
+ IT(ins4,!c) op1,   op2
+
+   end_label:
+// Unconditional execution resumes here.
+ */
+#ifdef __HAVE_FEATURE_IT
+  #define IT(ins,c) ins##c
+#else
+  #define IT(ins,c) ins##s
+#endif
+
 #ifdef __ARM_EABI__
 .macro ARM_LDIV0 name signed
cmp r0, #0
-- 
2.25.1



[PATCH v5 06/33] Refactor 'clz' functions into a new file

2021-01-15 Thread Daniel Engel
This will make it easier to isolate changes in subsequent patches.

gcc/libgcc/ChangeLog:
2021-01-13 Daniel Engel 

* config/arm/lib1funcs.S (__clzsi2i, __clzdi2): Moved to ...
* config/arm/clz2.S: New file.
---
 libgcc/config/arm/clz2.S  | 145 ++
 libgcc/config/arm/lib1funcs.S | 123 +---
 2 files changed, 146 insertions(+), 122 deletions(-)
 create mode 100644 libgcc/config/arm/clz2.S

diff --git a/libgcc/config/arm/clz2.S b/libgcc/config/arm/clz2.S
new file mode 100644
index 000..2ad9a81892c
--- /dev/null
+++ b/libgcc/config/arm/clz2.S
@@ -0,0 +1,145 @@
+/* Copyright (C) 1995-2021 Free Software Foundation, Inc.
+
+This file is free software; you can redistribute it and/or modify it
+under the terms of the GNU General Public License as published by the
+Free Software Foundation; either version 3, or (at your option) any
+later version.
+
+This file is distributed in the hope that it will be useful, but
+WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+General Public License for more details.
+
+Under Section 7 of GPL version 3, you are granted additional
+permissions described in the GCC Runtime Library Exception, version
+3.1, as published by the Free Software Foundation.
+
+You should have received a copy of the GNU General Public License and
+a copy of the GCC Runtime Library Exception along with this program;
+see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+.  */
+
+
+#ifdef L_clzsi2
+#ifdef NOT_ISA_TARGET_32BIT
+FUNC_START clzsi2
+   movsr1, #28
+   movsr3, #1
+   lslsr3, r3, #16
+   cmp r0, r3 /* 0x1 */
+   bcc 2f
+   lsrsr0, r0, #16
+   subsr1, r1, #16
+2: lsrsr3, r3, #8
+   cmp r0, r3 /* #0x100 */
+   bcc 2f
+   lsrsr0, r0, #8
+   subsr1, r1, #8
+2: lsrsr3, r3, #4
+   cmp r0, r3 /* #0x10 */
+   bcc 2f
+   lsrsr0, r0, #4
+   subsr1, r1, #4
+2: adr r2, 1f
+   ldrbr0, [r2, r0]
+   addsr0, r0, r1
+   bx lr
+.align 2
+1:
+.byte 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0
+   FUNC_END clzsi2
+#else
+ARM_FUNC_START clzsi2
+# if defined (__ARM_FEATURE_CLZ)
+   clz r0, r0
+   RET
+# else
+   mov r1, #28
+   cmp r0, #0x1
+   do_it   cs, t
+   movcs   r0, r0, lsr #16
+   subcs   r1, r1, #16
+   cmp r0, #0x100
+   do_it   cs, t
+   movcs   r0, r0, lsr #8
+   subcs   r1, r1, #8
+   cmp r0, #0x10
+   do_it   cs, t
+   movcs   r0, r0, lsr #4
+   subcs   r1, r1, #4
+   adr r2, 1f
+   ldrbr0, [r2, r0]
+   add r0, r0, r1
+   RET
+.align 2
+1:
+.byte 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0
+# endif /* !defined (__ARM_FEATURE_CLZ) */
+   FUNC_END clzsi2
+#endif
+#endif /* L_clzsi2 */
+
+#ifdef L_clzdi2
+#if !defined (__ARM_FEATURE_CLZ)
+
+# ifdef NOT_ISA_TARGET_32BIT
+FUNC_START clzdi2
+   push{r4, lr}
+   cmp xxh, #0
+   bne 1f
+#  ifdef __ARMEB__
+   movsr0, xxl
+   bl  __clzsi2
+   addsr0, r0, #32
+   b 2f
+1:
+   bl  __clzsi2
+#  else
+   bl  __clzsi2
+   addsr0, r0, #32
+   b 2f
+1:
+   movsr0, xxh
+   bl  __clzsi2
+#  endif
+2:
+   pop {r4, pc}
+# else /* NOT_ISA_TARGET_32BIT */
+ARM_FUNC_START clzdi2
+   do_push {r4, lr}
+   cmp xxh, #0
+   bne 1f
+#  ifdef __ARMEB__
+   mov r0, xxl
+   bl  __clzsi2
+   add r0, r0, #32
+   b 2f
+1:
+   bl  __clzsi2
+#  else
+   bl  __clzsi2
+   add r0, r0, #32
+   b 2f
+1:
+   mov r0, xxh
+   bl  __clzsi2
+#  endif
+2:
+   RETLDM  r4
+   FUNC_END clzdi2
+# endif /* NOT_ISA_TARGET_32BIT */
+
+#else /* defined (__ARM_FEATURE_CLZ) */
+
+ARM_FUNC_START clzdi2
+   cmp xxh, #0
+   do_it   eq, et
+   clzeq   r0, xxl
+   clzne   r0, xxh
+   addeq   r0, r0, #32
+   RET
+   FUNC_END clzdi2
+
+#endif
+#endif /* L_clzdi2 */
+
diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index 1233b8c0992..d92f73ba0c9 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -1803,128 +1803,7 @@ LSYM(Lover12):
 
 #endif /* __symbian__ */
 
-#ifdef L_clzsi2
-#ifdef NOT_ISA_TARGET_32BIT
-FUNC_START clzsi2
-   movsr1, #28
-   movsr3, #1
-   lslsr3, r3, #16
-   cmp r0, r3 /* 0x1 */
-   bcc 2f
-   lsrsr0, r0, #16
-   subsr1, r1, #16
-2: lsrsr3, r3, #8
-   cmp r0, r3 /* #0x100 */
-   bcc 2f
-   lsrsr0, r0, #8
-   subsr1, r1, #8
-2: lsrsr3, r3, #4
-   cmp r0, r3 /* #0x10 */
-   bcc 2f
-   lsrsr0,

[PATCH v5 07/33] Refactor 'ctz' functions into a new file

2021-01-15 Thread Daniel Engel
This will make it easier to isolate changes in subsequent patches.

gcc/libgcc/ChangeLog:
2021-01-13 Daniel Engel 

* config/arm/lib1funcs.S (__ctzsi2): Moved to ...
* config/arm/ctz2.S: New file.
---
 libgcc/config/arm/ctz2.S  | 86 +++
 libgcc/config/arm/lib1funcs.S | 65 +-
 2 files changed, 87 insertions(+), 64 deletions(-)
 create mode 100644 libgcc/config/arm/ctz2.S

diff --git a/libgcc/config/arm/ctz2.S b/libgcc/config/arm/ctz2.S
new file mode 100644
index 000..8702c9afb94
--- /dev/null
+++ b/libgcc/config/arm/ctz2.S
@@ -0,0 +1,86 @@
+/* Copyright (C) 1995-2021 Free Software Foundation, Inc.
+
+This file is free software; you can redistribute it and/or modify it
+under the terms of the GNU General Public License as published by the
+Free Software Foundation; either version 3, or (at your option) any
+later version.
+
+This file is distributed in the hope that it will be useful, but
+WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+General Public License for more details.
+
+Under Section 7 of GPL version 3, you are granted additional
+permissions described in the GCC Runtime Library Exception, version
+3.1, as published by the Free Software Foundation.
+
+You should have received a copy of the GNU General Public License and
+a copy of the GCC Runtime Library Exception along with this program;
+see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+.  */
+
+
+#ifdef L_ctzsi2
+#ifdef NOT_ISA_TARGET_32BIT
+FUNC_START ctzsi2
+   negsr1, r0
+   andsr0, r0, r1
+   movsr1, #28
+   movsr3, #1
+   lslsr3, r3, #16
+   cmp r0, r3 /* 0x1 */
+   bcc 2f
+   lsrsr0, r0, #16
+   subsr1, r1, #16
+2: lsrsr3, r3, #8
+   cmp r0, r3 /* #0x100 */
+   bcc 2f
+   lsrsr0, r0, #8
+   subsr1, r1, #8
+2: lsrsr3, r3, #4
+   cmp r0, r3 /* #0x10 */
+   bcc 2f
+   lsrsr0, r0, #4
+   subsr1, r1, #4
+2: adr r2, 1f
+   ldrbr0, [r2, r0]
+   subsr0, r0, r1
+   bx lr
+.align 2
+1:
+.byte  27, 28, 29, 29, 30, 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 31
+   FUNC_END ctzsi2
+#else
+ARM_FUNC_START ctzsi2
+   rsb r1, r0, #0
+   and r0, r0, r1
+# if defined (__ARM_FEATURE_CLZ)
+   clz r0, r0
+   rsb r0, r0, #31
+   RET
+# else
+   mov r1, #28
+   cmp r0, #0x1
+   do_it   cs, t
+   movcs   r0, r0, lsr #16
+   subcs   r1, r1, #16
+   cmp r0, #0x100
+   do_it   cs, t
+   movcs   r0, r0, lsr #8
+   subcs   r1, r1, #8
+   cmp r0, #0x10
+   do_it   cs, t
+   movcs   r0, r0, lsr #4
+   subcs   r1, r1, #4
+   adr r2, 1f
+   ldrbr0, [r2, r0]
+   sub r0, r0, r1
+   RET
+.align 2
+1:
+.byte  27, 28, 29, 29, 30, 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 31
+# endif /* !defined (__ARM_FEATURE_CLZ) */
+   FUNC_END ctzsi2
+#endif
+#endif /* L_clzsi2 */
+
diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index d92f73ba0c9..b1df00ac597 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -1804,70 +1804,7 @@ LSYM(Lover12):
 #endif /* __symbian__ */
 
 #include "clz2.S"
-
-#ifdef L_ctzsi2
-#ifdef NOT_ISA_TARGET_32BIT
-FUNC_START ctzsi2
-   negsr1, r0
-   andsr0, r0, r1
-   movsr1, #28
-   movsr3, #1
-   lslsr3, r3, #16
-   cmp r0, r3 /* 0x1 */
-   bcc 2f
-   lsrsr0, r0, #16
-   subsr1, r1, #16
-2: lsrsr3, r3, #8
-   cmp r0, r3 /* #0x100 */
-   bcc 2f
-   lsrsr0, r0, #8
-   subsr1, r1, #8
-2: lsrsr3, r3, #4
-   cmp r0, r3 /* #0x10 */
-   bcc 2f
-   lsrsr0, r0, #4
-   subsr1, r1, #4
-2: adr r2, 1f
-   ldrbr0, [r2, r0]
-   subsr0, r0, r1
-   bx lr
-.align 2
-1:
-.byte  27, 28, 29, 29, 30, 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 31
-   FUNC_END ctzsi2
-#else
-ARM_FUNC_START ctzsi2
-   rsb r1, r0, #0
-   and r0, r0, r1
-# if defined (__ARM_FEATURE_CLZ)
-   clz r0, r0
-   rsb r0, r0, #31
-   RET
-# else
-   mov r1, #28
-   cmp r0, #0x1
-   do_it   cs, t
-   movcs   r0, r0, lsr #16
-   subcs   r1, r1, #16
-   cmp r0, #0x100
-   do_it   cs, t
-   movcs   r0, r0, lsr #8
-   subcs   r1, r1, #8
-   cmp r0, #0x10
-   do_it   cs, t
-   movcs   r0, r0, lsr #4
-   subcs   r1, r1, #4
-   adr r2, 1f
-   ldrbr0, [r2, r0]
-   sub r0, r0, r1
-   RET
-.align 2
-1:
-.byte  27, 28, 29, 29, 30, 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 31
-# endif /* !defined (__ARM_FEATURE_CLZ) */
-   FUNC_

[PATCH v5 08/33] Refactor 64-bit shift functions into a new file

2021-01-15 Thread Daniel Engel
This will make it easier to isolate changes in subsequent patches.

gcc/libgcc/ChangeLog:
2021-01-13 Daniel Engel 

* config/arm/lib1funcs.S (__ashldi3, __ashrdi3, __lshldi3): Moved to ...
* config/arm/eabi/lshift.S: New file.
---
 libgcc/config/arm/eabi/lshift.S | 123 
 libgcc/config/arm/lib1funcs.S   | 103 +-
 2 files changed, 124 insertions(+), 102 deletions(-)
 create mode 100644 libgcc/config/arm/eabi/lshift.S

diff --git a/libgcc/config/arm/eabi/lshift.S b/libgcc/config/arm/eabi/lshift.S
new file mode 100644
index 000..0974a72c377
--- /dev/null
+++ b/libgcc/config/arm/eabi/lshift.S
@@ -0,0 +1,123 @@
+/* Copyright (C) 1995-2021 Free Software Foundation, Inc.
+
+This file is free software; you can redistribute it and/or modify it
+under the terms of the GNU General Public License as published by the
+Free Software Foundation; either version 3, or (at your option) any
+later version.
+
+This file is distributed in the hope that it will be useful, but
+WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+General Public License for more details.
+
+Under Section 7 of GPL version 3, you are granted additional
+permissions described in the GCC Runtime Library Exception, version
+3.1, as published by the Free Software Foundation.
+
+You should have received a copy of the GNU General Public License and
+a copy of the GCC Runtime Library Exception along with this program;
+see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+.  */
+
+
+#ifdef L_lshrdi3
+
+   FUNC_START lshrdi3
+   FUNC_ALIAS aeabi_llsr lshrdi3
+   
+#ifdef __thumb__
+   lsrsal, r2
+   movsr3, ah
+   lsrsah, r2
+   mov ip, r3
+   subsr2, #32
+   lsrsr3, r2
+   orrsal, r3
+   negsr2, r2
+   mov r3, ip
+   lslsr3, r2
+   orrsal, r3
+   RET
+#else
+   subsr3, r2, #32
+   rsb ip, r2, #32
+   movmi   al, al, lsr r2
+   movpl   al, ah, lsr r3
+   orrmi   al, al, ah, lsl ip
+   mov ah, ah, lsr r2
+   RET
+#endif
+   FUNC_END aeabi_llsr
+   FUNC_END lshrdi3
+
+#endif
+   
+#ifdef L_ashrdi3
+   
+   FUNC_START ashrdi3
+   FUNC_ALIAS aeabi_lasr ashrdi3
+   
+#ifdef __thumb__
+   lsrsal, r2
+   movsr3, ah
+   asrsah, r2
+   subsr2, #32
+   @ If r2 is negative at this point the following step would OR
+   @ the sign bit into all of AL.  That's not what we want...
+   bmi 1f
+   mov ip, r3
+   asrsr3, r2
+   orrsal, r3
+   mov r3, ip
+1:
+   negsr2, r2
+   lslsr3, r2
+   orrsal, r3
+   RET
+#else
+   subsr3, r2, #32
+   rsb ip, r2, #32
+   movmi   al, al, lsr r2
+   movpl   al, ah, asr r3
+   orrmi   al, al, ah, lsl ip
+   mov ah, ah, asr r2
+   RET
+#endif
+
+   FUNC_END aeabi_lasr
+   FUNC_END ashrdi3
+
+#endif
+
+#ifdef L_ashldi3
+
+   FUNC_START ashldi3
+   FUNC_ALIAS aeabi_llsl ashldi3
+   
+#ifdef __thumb__
+   lslsah, r2
+   movsr3, al
+   lslsal, r2
+   mov ip, r3
+   subsr2, #32
+   lslsr3, r2
+   orrsah, r3
+   negsr2, r2
+   mov r3, ip
+   lsrsr3, r2
+   orrsah, r3
+   RET
+#else
+   subsr3, r2, #32
+   rsb ip, r2, #32
+   movmi   ah, ah, lsl r2
+   movpl   ah, al, lsl r3
+   orrmi   ah, ah, al, lsr ip
+   mov al, al, lsl r2
+   RET
+#endif
+   FUNC_END aeabi_llsl
+   FUNC_END ashldi3
+
+#endif
+
diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index b1df00ac597..7ac50230725 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -1699,108 +1699,7 @@ LSYM(Lover12):
 
 /* Prevent __aeabi double-word shifts from being produced on SymbianOS.  */
 #ifndef __symbian__
-
-#ifdef L_lshrdi3
-
-   FUNC_START lshrdi3
-   FUNC_ALIAS aeabi_llsr lshrdi3
-   
-#ifdef __thumb__
-   lsrsal, r2
-   movsr3, ah
-   lsrsah, r2
-   mov ip, r3
-   subsr2, #32
-   lsrsr3, r2
-   orrsal, r3
-   negsr2, r2
-   mov r3, ip
-   lslsr3, r2
-   orrsal, r3
-   RET
-#else
-   subsr3, r2, #32
-   rsb ip, r2, #32
-   movmi   al, al, lsr r2
-   movpl   al, ah, lsr r3
-   orrmi   al, al, ah, lsl ip
-   mov ah, ah, lsr r2
-   RET
-#endif
-   FUNC_END aeabi_llsr
-   FUNC_END lshrdi3
-
-#endif
-   
-#ifdef L_ashrdi3
-   
-   FUNC_START ashrdi3
-   FUNC_ALIAS aeabi_lasr ashrdi3
-   
-#ifdef __thumb__
-   lsrsal, r2
-   movsr3, ah
-   asrsah, r2
-   subsr2, #32
-

[PATCH v5 09/33] Import 'clz' functions from the CM0 library

2021-01-15 Thread Daniel Engel
On architectures without __ARM_FEATURE_CLZ, this version combines __clzdi2()
with __clzsi2() into a single object with an efficient tail call.  Also, this
version merges the formerly separate Thumb and ARM code implementations
into a unified instruction sequence.  This change significantly improves
Thumb performance without affecting ARM performance.  Finally, this version
adds a new __OPTIMIZE_SIZE__ build option (binary search loop).

There is no change to the code for architectures with __ARM_FEATURE_CLZ.

gcc/libgcc/ChangeLog:
2021-01-13 Daniel Engel 

* config/arm/bits/clz2.S (__clzsi2, __clzdi2): Reduced code size on
architectures without __ARM_FEATURE_CLZ.
* config/arm/t-elf (LIB1ASMFUNCS): Moved _clzsi2 to new weak roup.
---
 libgcc/config/arm/clz2.S | 362 +--
 libgcc/config/arm/t-elf  |   7 +-
 2 files changed, 236 insertions(+), 133 deletions(-)

diff --git a/libgcc/config/arm/clz2.S b/libgcc/config/arm/clz2.S
index 2ad9a81892c..dc246708a82 100644
--- a/libgcc/config/arm/clz2.S
+++ b/libgcc/config/arm/clz2.S
@@ -1,145 +1,243 @@
-/* Copyright (C) 1995-2021 Free Software Foundation, Inc.
+/* clz2.S: Cortex M0 optimized 'clz' functions
 
-This file is free software; you can redistribute it and/or modify it
-under the terms of the GNU General Public License as published by the
-Free Software Foundation; either version 3, or (at your option) any
-later version.
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel (g...@danielengel.com)
 
-This file is distributed in the hope that it will be useful, but
-WITHOUT ANY WARRANTY; without even the implied warranty of
-MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-General Public License for more details.
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
 
-Under Section 7 of GPL version 3, you are granted additional
-permissions described in the GCC Runtime Library Exception, version
-3.1, as published by the Free Software Foundation.
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
 
-You should have received a copy of the GNU General Public License and
-a copy of the GCC Runtime Library Exception along with this program;
-see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
-.  */
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+#if defined(__ARM_FEATURE_CLZ) && __ARM_FEATURE_CLZ
+
+#ifdef L_clzdi2
+
+// int __clzdi2(long long)
+// Counts leading zero bits in $r1:$r0.
+// Returns the result in $r0.
+FUNC_START_SECTION clzdi2 .text.sorted.libgcc.clz2.clzdi2
+CFI_START_FUNCTION
+
+// Moved here from lib1funcs.S
+cmp xxh,#0
+do_it   eq, et
+clzeq   r0, xxl
+clzne   r0, xxh
+addeq   r0, #32
+RET
+
+CFI_END_FUNCTION
+FUNC_END clzdi2
+
+#endif /* L_clzdi2 */
 
 
 #ifdef L_clzsi2
-#ifdef NOT_ISA_TARGET_32BIT
-FUNC_START clzsi2
-   movsr1, #28
-   movsr3, #1
-   lslsr3, r3, #16
-   cmp r0, r3 /* 0x1 */
-   bcc 2f
-   lsrsr0, r0, #16
-   subsr1, r1, #16
-2: lsrsr3, r3, #8
-   cmp r0, r3 /* #0x100 */
-   bcc 2f
-   lsrsr0, r0, #8
-   subsr1, r1, #8
-2: lsrsr3, r3, #4
-   cmp r0, r3 /* #0x10 */
-   bcc 2f
-   lsrsr0, r0, #4
-   subsr1, r1, #4
-2: adr r2, 1f
-   ldrbr0, [r2, r0]
-   addsr0, r0, r1
-   bx lr
-.align 2
-1:
-.byte 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0
-   FUNC_END clzsi2
-#else
-ARM_FUNC_START clzsi2
-# if defined (__ARM_FEATURE_CLZ)
-   clz r0, r0
-   RET
-# else
-   mov r1, #28
-   cmp r0, #0x1
-   do_it   cs, t
-   movcs   r0, r0, lsr #16
-   subcs   r1, r1, #16
-   cmp r0, #0x100
-   do_it   cs, t
-   movcs   r0, r0, lsr #8
-   subcs   r1, r1, #8
-   cmp r0, #0x10
-   do_it   cs, t
-   movcs   r0, r0, lsr #4
-   subcs   r1, r1, #4
-   adr r2, 1f
-   ldrbr0, [r2, r0]
-   add r0, r0, r1
-   RET
-.align 2
-1:
-.byte 4, 3, 2, 2, 1, 1, 1

[PATCH v5 10/33] Import 'ctz' functions from the CM0 library

2021-01-15 Thread Daniel Engel
This version combines __ctzdi2() with __ctzsi2() into a single object with
an efficient tail call.  The former implementation of __ctzdi2() was in C.

On architectures without __ARM_FEATURE_CLZ, this version merges the formerly
separate Thumb and ARM code sequences into a unified instruction sequence.
This change significantly improves Thumb performance without affecting ARM
performance.  Finally, this version adds a new __OPTIMIZE_SIZE__ build option.

On architectures with __ARM_FEATURE_CLZ, __ctzsi2(0) now returns 32.  Formerly,
__ctzsi2(0) would return -1.  Architectures without __ARM_FEATURE_CLZ have
always returned 32, so this change makes the return value consistent.
This change costs 2 extra instructions (branchless).

Likewise on architectures with __ARM_FEATURE_CLZ,  __ctzdi2(0) now returns
64 instead of 31.

gcc/libgcc/ChangeLog:
2021-01-13 Daniel Engel 

* config/arm/bits/ctz2.S (__ctzdi2): Added a new function.
(__clzsi2): Reduced size on architectures without __ARM_FEATURE_CLZ;
changed so __clzsi2(0)=32 on architectures wtih __ARM_FEATURE_CLZ.
* config/arm/t-elf (LIB1ASMFUNCS): Added _ctzdi2;
moved _ctzsi2 to the weak function objects group.
---
 libgcc/config/arm/ctz2.S | 307 +--
 libgcc/config/arm/t-elf  |   3 +-
 2 files changed, 232 insertions(+), 78 deletions(-)

diff --git a/libgcc/config/arm/ctz2.S b/libgcc/config/arm/ctz2.S
index 8702c9afb94..ee6df6d6d01 100644
--- a/libgcc/config/arm/ctz2.S
+++ b/libgcc/config/arm/ctz2.S
@@ -1,86 +1,239 @@
-/* Copyright (C) 1995-2021 Free Software Foundation, Inc.
+/* ctz2.S: ARM optimized 'ctz' functions
 
-This file is free software; you can redistribute it and/or modify it
-under the terms of the GNU General Public License as published by the
-Free Software Foundation; either version 3, or (at your option) any
-later version.
+   Copyright (C) 2020-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel (g...@danielengel.com)
 
-This file is distributed in the hope that it will be useful, but
-WITHOUT ANY WARRANTY; without even the implied warranty of
-MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-General Public License for more details.
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
 
-Under Section 7 of GPL version 3, you are granted additional
-permissions described in the GCC Runtime Library Exception, version
-3.1, as published by the Free Software Foundation.
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
 
-You should have received a copy of the GNU General Public License and
-a copy of the GCC Runtime Library Exception along with this program;
-see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
-.  */
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
 
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
 
-#ifdef L_ctzsi2
-#ifdef NOT_ISA_TARGET_32BIT
-FUNC_START ctzsi2
-   negsr1, r0
-   andsr0, r0, r1
-   movsr1, #28
-   movsr3, #1
-   lslsr3, r3, #16
-   cmp r0, r3 /* 0x1 */
-   bcc 2f
-   lsrsr0, r0, #16
-   subsr1, r1, #16
-2: lsrsr3, r3, #8
-   cmp r0, r3 /* #0x100 */
-   bcc 2f
-   lsrsr0, r0, #8
-   subsr1, r1, #8
-2: lsrsr3, r3, #4
-   cmp r0, r3 /* #0x10 */
-   bcc 2f
-   lsrsr0, r0, #4
-   subsr1, r1, #4
-2: adr r2, 1f
-   ldrbr0, [r2, r0]
-   subsr0, r0, r1
-   bx lr
-.align 2
-1:
-.byte  27, 28, 29, 29, 30, 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 31
-   FUNC_END ctzsi2
+
+// When the hardware 'ctz' function is available, an efficient version
+//  of __ctzsi2(x) can be created by calculating '31 - __ctzsi2(lsb(x))',
+//  where lsb(x) is 'x' with only the least-significant '1' bit set.
+// The following offset applies to all of the functions in this file.
+#if defined(__ARM_FEATURE_CLZ) && __ARM_FEATURE_CLZ
+  #define CTZ_RESULT_OFFSET 1
 #else
-ARM_FUNC_START ctzsi2
-   rsb r1, r0, #0
-   and r0, r0, r1
-# if defined (__ARM_FEATURE_CLZ)
-   clz r0, r0
-   rsb r0, r0, #31
-   RET
-# else
-   mov r1, #28
-   cmp 

[PATCH v5 11/33] Import 64-bit shift functions from the CM0 library

2021-01-15 Thread Daniel Engel
The Thumb versions of these functions are each 1-2 instructions smaller
and faster, and branchless when the IT instruction is available.

The ARM versions were converted to the "xxl/xxh" big-endian register
naming convention, but are otherwise unchanged.

gcc/libgcc/ChangeLog:
2021-01-13 Daniel Engel 

* config/arm/bits/shift.S (__ashldi3, __ashrdi3, __lshldi3):
Reduced code size on Thumb architectures;
updated big-endian register naming convention to "xxl/xxh".
---
 libgcc/config/arm/eabi/lshift.S | 338 +---
 1 file changed, 228 insertions(+), 110 deletions(-)

diff --git a/libgcc/config/arm/eabi/lshift.S b/libgcc/config/arm/eabi/lshift.S
index 0974a72c377..16cf2dcef04 100644
--- a/libgcc/config/arm/eabi/lshift.S
+++ b/libgcc/config/arm/eabi/lshift.S
@@ -1,123 +1,241 @@
-/* Copyright (C) 1995-2021 Free Software Foundation, Inc.
+/* lshift.S: ARM optimized 64-bit integer shift
 
-This file is free software; you can redistribute it and/or modify it
-under the terms of the GNU General Public License as published by the
-Free Software Foundation; either version 3, or (at your option) any
-later version.
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (g...@danielengel.com)
 
-This file is distributed in the hope that it will be useful, but
-WITHOUT ANY WARRANTY; without even the implied warranty of
-MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-General Public License for more details.
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
 
-Under Section 7 of GPL version 3, you are granted additional
-permissions described in the GCC Runtime Library Exception, version
-3.1, as published by the Free Software Foundation.
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
 
-You should have received a copy of the GNU General Public License and
-a copy of the GCC Runtime Library Exception along with this program;
-see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
-.  */
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
 
 
 #ifdef L_lshrdi3
 
-   FUNC_START lshrdi3
-   FUNC_ALIAS aeabi_llsr lshrdi3
-   
-#ifdef __thumb__
-   lsrsal, r2
-   movsr3, ah
-   lsrsah, r2
-   mov ip, r3
-   subsr2, #32
-   lsrsr3, r2
-   orrsal, r3
-   negsr2, r2
-   mov r3, ip
-   lslsr3, r2
-   orrsal, r3
-   RET
-#else
-   subsr3, r2, #32
-   rsb ip, r2, #32
-   movmi   al, al, lsr r2
-   movpl   al, ah, lsr r3
-   orrmi   al, al, ah, lsl ip
-   mov ah, ah, lsr r2
-   RET
-#endif
-   FUNC_END aeabi_llsr
-   FUNC_END lshrdi3
-
-#endif
-   
+// long long __aeabi_llsr(long long, int)
+// Logical shift right the 64 bit value in $r1:$r0 by the count in $r2.
+// The result is only guaranteed for shifts in the range of '0' to '63'.
+// Uses $r3 as scratch space.
+FUNC_START_SECTION aeabi_llsr .text.sorted.libgcc.lshrdi3
+FUNC_ALIAS lshrdi3 aeabi_llsr
+CFI_START_FUNCTION
+
+  #if defined(__thumb__) && __thumb__
+
+// Save a copy for the remainder.
+movsr3, xxh
+
+// Assume a simple shift.
+lsrsxxl,r2
+lsrsxxh,r2
+
+// Test if the shift distance is larger than 1 word.
+subsr2, #32
+
+#ifdef __HAVE_FEATURE_IT
+do_it   lo,te
+
+// The remainder is opposite the main shift, (32 - x) bits.
+rsblo   r2, #0
+lsllo   r3, r2
+
+// The remainder shift extends into the hi word.
+lsrhs   r3, r2
+
+#else /* !__HAVE_FEATURE_IT */
+bhs LLSYM(__llsr_large)
+
+// The remainder is opposite the main shift, (32 - x) bits.
+rsbsr2, #0
+lslsr3, r2
+
+// Cancel any remaining shift.
+eorsr2, r2
+
+  LLSYM(__llsr_large):
+// Apply any remaining shift to the hi word.
+lsrsr3, r2
+
+#endif /* !__HAVE_FEATURE_IT */
+
+// Merge remainder and result.
+addsxxl,r3
+RET
+
+  #else /* 

[PATCH v5 12/33] Import 'clrsb' functions from the CM0 library

2021-01-15 Thread Daniel Engel
This implementation provides an efficient tail call to __clzsi2(), making the
functions rather smaller and faster than the C versions.

gcc/libgcc/ChangeLog:
2021-01-13 Daniel Engel 

* config/arm/bits/clz2.S (__clrsbsi2, __clrsbdi2):
Added new functions.
* config/arm/t-elf (LIB1ASMFUNCS):
Added new function objects _clrsbsi2 and _clrsbdi2).
---
 libgcc/config/arm/clz2.S | 108 ++-
 libgcc/config/arm/t-elf  |   2 +
 2 files changed, 108 insertions(+), 2 deletions(-)

diff --git a/libgcc/config/arm/clz2.S b/libgcc/config/arm/clz2.S
index dc246708a82..5f608c0c2a3 100644
--- a/libgcc/config/arm/clz2.S
+++ b/libgcc/config/arm/clz2.S
@@ -1,4 +1,4 @@
-/* clz2.S: Cortex M0 optimized 'clz' functions
+/* clz2.S: ARM optimized 'clz' and related functions
 
Copyright (C) 2018-2021 Free Software Foundation, Inc.
Contributed by Daniel Engel (g...@danielengel.com)
@@ -23,7 +23,7 @@
.  */
 
 
-#if defined(__ARM_FEATURE_CLZ) && __ARM_FEATURE_CLZ
+#ifdef __ARM_FEATURE_CLZ
 
 #ifdef L_clzdi2
 
@@ -241,3 +241,107 @@ FUNC_END clzdi2
 
 #endif /* !__ARM_FEATURE_CLZ */
 
+
+#ifdef L_clrsbdi2
+
+// int __clrsbdi2(int)
+// Counts the number of "redundant sign bits" in $r1:$r0.
+// Returns the result in $r0.
+// Uses $r2 and $r3 as scratch space.
+FUNC_START_SECTION clrsbdi2 .text.sorted.libgcc.clz2.clrsbdi2
+CFI_START_FUNCTION
+
+  #if defined(__ARM_FEATURE_CLZ) && __ARM_FEATURE_CLZ
+// Invert negative signs to keep counting zeros.
+asrsr3, xxh,#31
+eorsxxl,r3
+eorsxxh,r3
+
+// Same as __clzdi2(), except that the 'C' flag is pre-calculated.
+// Also, the trailing 'subs', since the last bit is not redundant.
+do_it   eq, et
+clzeq   r0, xxl
+clzne   r0, xxh
+addeq   r0, #32
+subsr0, #1
+RET
+
+  #else  /* !__ARM_FEATURE_CLZ */
+// Result if all the bits in the argument are zero.
+// Set it here to keep the flags clean after 'eors' below.
+movsr2, #31
+
+// Invert negative signs to keep counting zeros.
+asrsr3, xxh,#31
+eorsxxh,r3
+
+#if defined(__ARMEB__) && __ARMEB__
+// If the upper word is non-zero, return '__clzsi2(upper) - 1'.
+bne SYM(__internal_clzsi2)
+
+// The upper word is zero, prepare the lower word.
+movsr0, r1
+eorsr0, r3
+
+#else /* !__ARMEB__ */
+// Save the lower word temporarily.
+// This somewhat awkward construction adds one cycle when the
+//  branch is not taken, but prevents a double-branch.
+eorsr3, r0
+
+// If the upper word is non-zero, return '__clzsi2(upper) - 1'.
+movsr0, r1
+bneSYM(__internal_clzsi2)
+
+// Restore the lower word.
+movsr0, r3
+
+#endif /* !__ARMEB__ */
+
+// The upper word is zero, return '31 + __clzsi2(lower)'.
+addsr2, #32
+b   SYM(__internal_clzsi2)
+
+  #endif /* !__ARM_FEATURE_CLZ */
+
+CFI_END_FUNCTION
+FUNC_END clrsbdi2
+
+#endif /* L_clrsbdi2 */
+
+
+#ifdef L_clrsbsi2
+
+// int __clrsbsi2(int)
+// Counts the number of "redundant sign bits" in $r0.
+// Returns the result in $r0.
+// Uses $r2 and possibly $r3 as scratch space.
+FUNC_START_SECTION clrsbsi2 .text.sorted.libgcc.clz2.clrsbsi2
+CFI_START_FUNCTION
+
+// Invert negative signs to keep counting zeros.
+asrsr2, r0,#31
+eorsr0, r2
+
+  #if defined(__ARM_FEATURE_CLZ) && __ARM_FEATURE_CLZ
+// Count.
+clz r0, r0
+
+// The result for a positive value will always be >= 1.
+// By definition, the last bit is not redundant.
+subsr0, #1
+RET
+
+  #else /* !__ARM_FEATURE_CLZ */
+// Result if all the bits in the argument are zero.
+// By definition, the last bit is not redundant.
+movsr2, #31
+b   SYM(__internal_clzsi2)
+
+  #endif  /* !__ARM_FEATURE_CLZ */
+
+CFI_END_FUNCTION
+FUNC_END clrsbsi2
+
+#endif /* L_clrsbsi2 */
+
diff --git a/libgcc/config/arm/t-elf b/libgcc/config/arm/t-elf
index 33b83ac4adf..89071cebe45 100644
--- a/libgcc/config/arm/t-elf
+++ b/libgcc/config/arm/t-elf
@@ -31,6 +31,8 @@ LIB1ASMFUNCS += \
_ashldi3 \
_ashrdi3 \
_lshrdi3 \
+   _clrsbsi2 \
+   _clrsbdi2 \
_clzdi2 \
_ctzdi2 \
_dvmd_tls \
-- 
2.25.1



[PATCH v5 13/33] Import 'ffs' functions from the CM0 library

2021-01-15 Thread Daniel Engel
This implementation provides an efficient tail call to __clzdi2(), making the
functions rather smaller and faster than the C versions.

gcc/libgcc/ChangeLog:
2021-01-13 Daniel Engel 

* config/arm/bits/ctz2.S (__ffssi2, __ffsdi2): New functions.
* config/arm/t-elf (LIB1ASMFUNCS): Added _ffssi2 and _ffsdi2.
---
 libgcc/config/arm/ctz2.S | 77 +++-
 libgcc/config/arm/t-elf  |  2 ++
 2 files changed, 78 insertions(+), 1 deletion(-)

diff --git a/libgcc/config/arm/ctz2.S b/libgcc/config/arm/ctz2.S
index ee6df6d6d01..545f8f94d71 100644
--- a/libgcc/config/arm/ctz2.S
+++ b/libgcc/config/arm/ctz2.S
@@ -1,4 +1,4 @@
-/* ctz2.S: ARM optimized 'ctz' functions
+/* ctz2.S: ARM optimized 'ctz' and related functions
 
Copyright (C) 2020-2021 Free Software Foundation, Inc.
Contributed by Daniel Engel (g...@danielengel.com)
@@ -237,3 +237,78 @@ FUNC_END ctzdi2
 
 #endif /* L_ctzsi2 || L_ctzdi2 */
 
+
+#ifdef L_ffsdi2
+
+// int __ffsdi2(int)
+// Return the index of the least significant 1-bit in $r1:r0,
+//  or zero if $r1:r0 is zero.  The least significant bit is index 1.
+// Returns the result in $r0.
+// Uses $r2 and possibly $r3 as scratch space.
+// Same section as __ctzsi2() for sake of the tail call branches.
+FUNC_START_SECTION ffsdi2 .text.sorted.libgcc.ctz2.ffsdi2
+CFI_START_FUNCTION
+
+// Simplify branching by assuming a non-zero lower word.
+// For all such, ffssi2(x) == ctzsi2(x) + 1.
+movsr2,#(33 - CTZ_RESULT_OFFSET)
+
+  #if defined(__ARMEB__) && __ARMEB__
+// HACK: Save the upper word in a scratch register.
+movsr3, r0
+
+// Test the lower word.
+movsr0, r1
+bne SYM(__internal_ctzsi2)
+
+// Test the upper word.
+movsr2,#(65 - CTZ_RESULT_OFFSET)
+movsr0, r3
+bne SYM(__internal_ctzsi2)
+
+  #else /* !__ARMEB__ */
+// Test the lower word.
+cmp r0, #0
+bne SYM(__internal_ctzsi2)
+
+// Test the upper word.
+movsr2,#(65 - CTZ_RESULT_OFFSET)
+movsr0, r1
+bne SYM(__internal_ctzsi2)
+
+  #endif /* !__ARMEB__ */
+
+// Upper and lower words are both zero.
+RET
+
+CFI_END_FUNCTION
+FUNC_END ffsdi2
+
+#endif /* L_ffsdi2 */
+
+
+#ifdef L_ffssi2
+
+// int __ffssi2(int)
+// Return the index of the least significant 1-bit in $r0,
+//  or zero if $r0 is zero.  The least significant bit is index 1.
+// Returns the result in $r0.
+// Uses $r2 and possibly $r3 as scratch space.
+// Same section as __ctzsi2() for sake of the tail call branches.
+FUNC_START_SECTION ffssi2 .text.sorted.libgcc.ctz2.ffssi2
+CFI_START_FUNCTION
+
+// Simplify branching by assuming a non-zero argument.
+// For all such, ffssi2(x) == ctzsi2(x) + 1.
+movsr2,#(33 - CTZ_RESULT_OFFSET)
+
+// Test for zero, return unmodified.
+cmp r0, #0
+bne SYM(__internal_ctzsi2)
+RET
+
+CFI_END_FUNCTION
+FUNC_END ffssi2
+
+#endif /* L_ffssi2 */
+
diff --git a/libgcc/config/arm/t-elf b/libgcc/config/arm/t-elf
index 89071cebe45..346fc766f17 100644
--- a/libgcc/config/arm/t-elf
+++ b/libgcc/config/arm/t-elf
@@ -35,6 +35,8 @@ LIB1ASMFUNCS += \
_clrsbdi2 \
_clzdi2 \
_ctzdi2 \
+   _ffssi2 \
+   _ffsdi2 \
_dvmd_tls \
_divsi3 \
_modsi3 \
-- 
2.25.1



[PATCH v5 14/33] Import 'parity' functions from the CM0 library

2021-01-15 Thread Daniel Engel
The functional overlap between the single- and double-word functions makes
functions makes this implementation about half the size of the C functions
if both functions are linked in the same application.

gcc/libgcc/ChangeLog:
2021-01-13 Daniel Engel 

* config/arm/parity.S: New file for __paritysi2/di2().
* config/arm/lib1funcs.S: #include bit/parity.S
* config/arm/t-elf (LIB1ASMFUNCS): Added _paritysi2/di2.
---
 libgcc/config/arm/lib1funcs.S |   1 +
 libgcc/config/arm/parity.S| 120 ++
 libgcc/config/arm/t-elf   |   2 +
 3 files changed, 123 insertions(+)
 create mode 100644 libgcc/config/arm/parity.S

diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index 7ac50230725..600ea2dfdc9 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -1704,6 +1704,7 @@ LSYM(Lover12):
 
 #include "clz2.S"
 #include "ctz2.S"
+#include "parity.S"
 
 /*  */
 /* These next two sections are here despite the fact that they contain Thumb 
diff --git a/libgcc/config/arm/parity.S b/libgcc/config/arm/parity.S
new file mode 100644
index 000..45233bc9d8f
--- /dev/null
+++ b/libgcc/config/arm/parity.S
@@ -0,0 +1,120 @@
+/* parity.S: ARM optimized parity functions
+
+   Copyright (C) 2020-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel (g...@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+#ifdef L_paritydi2
+
+// int __paritydi2(int)
+// Returns '0' if the number of bits set in $r1:r0 is even, and '1' otherwise.
+// Returns the result in $r0.
+FUNC_START_SECTION paritydi2 .text.sorted.libgcc.paritydi2
+CFI_START_FUNCTION
+
+// Combine the upper and lower words, then fall through.
+// Byte-endianness does not matter for this function.
+eorsr0, r1
+
+#endif /* L_paritydi2 */
+
+
+// The implementation of __paritydi2() tightly couples with __paritysi2(),
+//  such that instructions must appear consecutively in the same memory
+//  section for proper flow control.  However, this construction inhibits
+//  the ability to discard __paritydi2() when only using __paritysi2().
+// Therefore, this block configures __paritysi2() for compilation twice.
+// The first version is a minimal standalone implementation, and the second
+//  version is the continuation of __paritydi2().  The standalone version must
+//  be declared WEAK, so that the combined version can supersede it and
+//  provide both symbols when required.
+// '_paritysi2' should appear before '_paritydi2' in LIB1ASMFUNCS.
+#if defined(L_paritysi2) || defined(L_paritydi2)
+
+#ifdef L_paritysi2
+// int __paritysi2(int)
+// Returns '0' if the number of bits set in $r0 is even, and '1' otherwise.
+// Returns the result in $r0.
+// Uses $r2 as scratch space.
+WEAK_START_SECTION paritysi2 .text.sorted.libgcc.paritysi2
+CFI_START_FUNCTION
+
+#else /* L_paritydi2 */
+FUNC_ENTRY paritysi2
+
+#endif
+
+  #if defined(__thumb__) && __thumb__
+#if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+
+// Size optimized: 16 bytes, 40 cycles
+// Speed optimized: 24 bytes, 14 cycles
+movsr2, #16
+
+LLSYM(__parity_loop):
+// Calculate the parity of successively smaller half-words into the 
MSB.
+movsr1, r0
+lslsr1, r2
+eorsr0, r1
+lsrsr2, #1
+bne LLSYM(__parity_loop)
+
+#else /* !__OPTIMIZE_SIZE__ */
+
+// Unroll the loop.  The 'libgcc' reference C implementation replaces
+//  the x2 and the x1 shifts with a constant.  However, since it takes
+//  4 cycles to load, index, and mask the constant result, it doesn't
+//  cost anything to keep shifting (and saves a few bytes).
+lslsr1, r0, #16
+eorsr0, r1
+lslsr1, r0, #8
+eorsr0, r1
+lslsr1, r0, #4
+eorsr0, r1
+lslsr1, r0, 

[PATCH v5 15/33] Import 'popcnt' functions from the CM0 library

2021-01-15 Thread Daniel Engel
The functional overlap between the single- and double-word functions
makes this implementation about 30% smaller than the C functions
if both functions are linked together in the same appliation.

gcc/libgcc/ChangeLog:
2021-01-13 Daniel Engel 

* config/arm/popcnt.S (__popcountsi, __popcountdi2): New file.
* config/arm/lib1funcs.S: #include bit/popcnt.S
* config/arm/t-elf (LIB1ASMFUNCS): Add _popcountsi2/di2.
---
 libgcc/config/arm/lib1funcs.S |   1 +
 libgcc/config/arm/popcnt.S| 189 ++
 libgcc/config/arm/t-elf   |   2 +
 3 files changed, 192 insertions(+)
 create mode 100644 libgcc/config/arm/popcnt.S

diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index 600ea2dfdc9..bd84a3e4281 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -1705,6 +1705,7 @@ LSYM(Lover12):
 #include "clz2.S"
 #include "ctz2.S"
 #include "parity.S"
+#include "popcnt.S"
 
 /*  */
 /* These next two sections are here despite the fact that they contain Thumb 
diff --git a/libgcc/config/arm/popcnt.S b/libgcc/config/arm/popcnt.S
new file mode 100644
index 000..51b1ed745ee
--- /dev/null
+++ b/libgcc/config/arm/popcnt.S
@@ -0,0 +1,189 @@
+/* popcnt.S: ARM optimized popcount functions
+
+   Copyright (C) 2020-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel (g...@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+#ifdef L_popcountdi2
+
+// int __popcountdi2(int)
+// Returns the number of bits set in $r1:$r0.
+// Returns the result in $r0.
+FUNC_START_SECTION popcountdi2 .text.sorted.libgcc.popcountdi2
+CFI_START_FUNCTION
+
+  #if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+// Initialize the result.
+// Compensate for the two extra loop (one for each word)
+//  required to detect zero arguments.
+movsr2, #2
+
+LLSYM(__popcountd_loop):
+// Same as __popcounts_loop below, except for $r1.
+subsr2, #1
+subsr3, r1, #1
+andsr1, r3
+bcs LLSYM(__popcountd_loop)
+
+// Repeat the operation for the second word.
+b   LLSYM(__popcounts_loop)
+
+  #else /* !__OPTIMIZE_SIZE__ */
+// Load the one-bit alternating mask.
+ldr r3, =0x
+
+// Reduce the second word.
+lsrsr2, r1, #1
+andsr2, r3
+subsr1, r2
+
+// Reduce the first word.
+lsrsr2, r0, #1
+andsr2, r3
+subsr0, r2
+
+// Load the two-bit alternating mask.
+ldr r3, =0x
+
+// Reduce the second word.
+lsrsr2, r1, #2
+andsr2, r3
+andsr1, r3
+addsr1, r2
+
+// Reduce the first word.
+lsrsr2, r0, #2
+andsr2, r3
+andsr0, r3
+addsr0, r2
+
+// There will be a maximum of 8 bits in each 4-bit field.
+// Jump into the single word flow to combine and complete.
+b   LLSYM(__popcounts_merge)
+
+  #endif /* !__OPTIMIZE_SIZE__ */
+#endif /* L_popcountdi2 */
+
+
+// The implementation of __popcountdi2() tightly couples with __popcountsi2(),
+//  such that instructions must appear consecutively in the same memory
+//  section for proper flow control.  However, this construction inhibits
+//  the ability to discard __popcountdi2() when only using __popcountsi2().
+// Therefore, this block configures __popcountsi2() for compilation twice.
+// The first version is a minimal standalone implementation, and the second
+//  version is the continuation of __popcountdi2().  The standalone version 
must
+//  be declared WEAK, so that the combined version can supersede it and
+//  provide both symbols when required.
+// '_popcountsi2' should appear before '_popcountdi2' in LIB1ASMFUNCS.
+#if defined(L_popcountsi2) || def

[PATCH v5 16/33] Refactor Thumb-1 64-bit comparison into a new file

2021-01-15 Thread Daniel Engel
This will make it easier to isolate changes in subsequent patches.

gcc/libgcc/ChangeLog:
2021-01-13 Daniel Engel 

* config/arm/bpabi-v6m.S (__aeabi_lcmp, __aeabi_ulcmp): Moved to ...
* config/arm/eabi/lcmp.S: New file.
* config/arm/lib1funcs.S: #include eabi/lcmp.S.
---
 libgcc/config/arm/bpabi-v6m.S | 46 --
 libgcc/config/arm/eabi/lcmp.S | 73 +++
 libgcc/config/arm/lib1funcs.S |  1 +
 3 files changed, 74 insertions(+), 46 deletions(-)
 create mode 100644 libgcc/config/arm/eabi/lcmp.S

diff --git a/libgcc/config/arm/bpabi-v6m.S b/libgcc/config/arm/bpabi-v6m.S
index 069fcbbf48c..a051c1530a4 100644
--- a/libgcc/config/arm/bpabi-v6m.S
+++ b/libgcc/config/arm/bpabi-v6m.S
@@ -33,52 +33,6 @@
.eabi_attribute 25, 1
 #endif /* __ARM_EABI__ */
 
-#ifdef L_aeabi_lcmp
-
-FUNC_START aeabi_lcmp
-   cmp xxh, yyh
-   beq 1f
-   bgt 2f
-   movsr0, #1
-   negsr0, r0
-   RET
-2:
-   movsr0, #1
-   RET
-1:
-   subsr0, xxl, yyl
-   beq 1f
-   bhi 2f
-   movsr0, #1
-   negsr0, r0
-   RET
-2:
-   movsr0, #1
-1:
-   RET
-   FUNC_END aeabi_lcmp
-
-#endif /* L_aeabi_lcmp */
-   
-#ifdef L_aeabi_ulcmp
-
-FUNC_START aeabi_ulcmp
-   cmp xxh, yyh
-   bne 1f
-   subsr0, xxl, yyl
-   beq 2f
-1:
-   bcs 1f
-   movsr0, #1
-   negsr0, r0
-   RET
-1:
-   movsr0, #1
-2:
-   RET
-   FUNC_END aeabi_ulcmp
-
-#endif /* L_aeabi_ulcmp */
 
 .macro test_div_by_zero signed
cmp yyh, #0
diff --git a/libgcc/config/arm/eabi/lcmp.S b/libgcc/config/arm/eabi/lcmp.S
new file mode 100644
index 000..336db1d398c
--- /dev/null
+++ b/libgcc/config/arm/eabi/lcmp.S
@@ -0,0 +1,73 @@
+/* Miscellaneous BPABI functions.  Thumb-1 implementation, suitable for ARMv4T,
+   ARMv6-M and ARMv8-M Baseline like ISA variants.
+
+   Copyright (C) 2006-2020 Free Software Foundation, Inc.
+   Contributed by CodeSourcery.
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+#ifdef L_aeabi_lcmp
+
+FUNC_START aeabi_lcmp
+cmp xxh, yyh
+beq 1f
+bgt 2f
+movsr0, #1
+negsr0, r0
+RET
+2:
+movsr0, #1
+RET
+1:
+subsr0, xxl, yyl
+beq 1f
+bhi 2f
+movsr0, #1
+negsr0, r0
+RET
+2:
+movsr0, #1
+1:
+RET
+FUNC_END aeabi_lcmp
+
+#endif /* L_aeabi_lcmp */
+
+#ifdef L_aeabi_ulcmp
+
+FUNC_START aeabi_ulcmp
+cmp xxh, yyh
+bne 1f
+subsr0, xxl, yyl
+beq 2f
+1:
+bcs 1f
+movsr0, #1
+negsr0, r0
+RET
+1:
+movsr0, #1
+2:
+RET
+FUNC_END aeabi_ulcmp
+
+#endif /* L_aeabi_ulcmp */
+
diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index bd84a3e4281..5e24d0a6749 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -1991,5 +1991,6 @@ LSYM(Lchange_\register):
 #include "bpabi.S"
 #else /* NOT_ISA_TARGET_32BIT */
 #include "bpabi-v6m.S"
+#include "eabi/lcmp.S"
 #endif /* NOT_ISA_TARGET_32BIT */
 #endif /* !__symbian__ */
-- 
2.25.1



[PATCH v5 17/33] Import 64-bit comparison from CM0 library

2021-01-15 Thread Daniel Engel
These are 2-5 instructions smaller and just as fast.  Branches are
minimized, which will allow easier adaptation to Thumb-2/ARM mode.

gcc/libgcc/ChangeLog:
2021-01-13 Daniel Engel 

* config/arm/eabi/lcmp.S (__aeabi_lcmp, __aeabi_ulcmp): Replaced;
add macro configuration to build __cmpdi2() and __ucmpdi2().
* config/arm/t-elf (LIB1ASMFUNCS): Added _cmpdi2 and _ucmpdi2.
---
 libgcc/config/arm/eabi/lcmp.S | 151 +-
 libgcc/config/arm/t-elf   |   2 +
 2 files changed, 112 insertions(+), 41 deletions(-)

diff --git a/libgcc/config/arm/eabi/lcmp.S b/libgcc/config/arm/eabi/lcmp.S
index 336db1d398c..2ac9d178b34 100644
--- a/libgcc/config/arm/eabi/lcmp.S
+++ b/libgcc/config/arm/eabi/lcmp.S
@@ -1,8 +1,7 @@
-/* Miscellaneous BPABI functions.  Thumb-1 implementation, suitable for ARMv4T,
-   ARMv6-M and ARMv8-M Baseline like ISA variants.
+/* lcmp.S: Thumb-1 optimized 64-bit integer comparison
 
-   Copyright (C) 2006-2020 Free Software Foundation, Inc.
-   Contributed by CodeSourcery.
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (g...@danielengel.com)
 
This file is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
@@ -24,50 +23,120 @@
.  */
 
 
+#if defined(L_aeabi_lcmp) || defined(L_cmpdi2)
+
 #ifdef L_aeabi_lcmp
+  #define LCMP_NAME aeabi_lcmp
+  #define LCMP_SECTION .text.sorted.libgcc.lcmp
+#else
+  #define LCMP_NAME cmpdi2
+  #define LCMP_SECTION .text.sorted.libgcc.cmpdi2
+#endif
+
+// int __aeabi_lcmp(long long, long long)
+// int __cmpdi2(long long, long long)
+// Compares the 64 bit signed values in $r1:$r0 and $r3:$r2.
+// lcmp() returns $r0 = { -1, 0, +1 } for orderings { <, ==, > } respectively.
+// cmpdi2() returns $r0 = { 0, 1, 2 } for orderings { <, ==, > } respectively.
+// Object file duplication assumes typical programs follow one runtime ABI.
+FUNC_START_SECTION LCMP_NAME LCMP_SECTION
+CFI_START_FUNCTION
+
+// Calculate the difference $r1:$r0 - $r3:$r2.
+subsxxl,yyl
+sbcsxxh,yyh
+
+// With $r2 free, create a known offset value without affecting
+//  the N or Z flags.
+// BUG? The originally unified instruction for v6m was 'mov r2, r3'.
+//  However, this resulted in a compile error with -mthumb:
+//"MOV Rd, Rs with two low registers not permitted".
+// Since unified syntax deprecates the "cpy" instruction, shouldn't
+//  there be a backwards-compatible tranlation available?
+cpy r2, r3
+
+// Evaluate the comparison result.
+blt LLSYM(__lcmp_lt)
+
+// The reference offset ($r2 - $r3) will be +2 iff the first
+//  argument is larger, otherwise the offset value remains 0.
+addsr2, #2
+
+// Check for zero (equality in 64 bits).
+// It doesn't matter which register was originally "hi".
+orrsr0,r1
+
+// The result is already 0 on equality.
+beq LLSYM(__lcmp_return)
+
+LLSYM(__lcmp_lt):
+// Create +1 or -1 from the offset value defined earlier.
+addsr3, #1
+subsr0, r2, r3
+
+LLSYM(__lcmp_return):
+  #ifdef L_cmpdi2
+// Offset to the correct output specification.
+addsr0, #1
+  #endif
 
-FUNC_START aeabi_lcmp
-cmp xxh, yyh
-beq 1f
-bgt 2f
-movsr0, #1
-negsr0, r0
-RET
-2:
-movsr0, #1
-RET
-1:
-subsr0, xxl, yyl
-beq 1f
-bhi 2f
-movsr0, #1
-negsr0, r0
-RET
-2:
-movsr0, #1
-1:
 RET
-FUNC_END aeabi_lcmp
 
-#endif /* L_aeabi_lcmp */
+CFI_END_FUNCTION
+FUNC_END LCMP_NAME
+
+#endif /* L_aeabi_lcmp || L_cmpdi2 */
+
+
+#if defined(L_aeabi_ulcmp) || defined(L_ucmpdi2)
 
 #ifdef L_aeabi_ulcmp
+  #define ULCMP_NAME aeabi_ulcmp
+  #define ULCMP_SECTION .text.sorted.libgcc.ulcmp
+#else
+  #define ULCMP_NAME ucmpdi2
+  #define ULCMP_SECTION .text.sorted.libgcc.ucmpdi2
+#endif
+
+// int __aeabi_ulcmp(unsigned long long, unsigned long long)
+// int __ucmpdi2(unsigned long long, unsigned long long)
+// Compares the 64 bit unsigned values in $r1:$r0 and $r3:$r2.
+// ulcmp() returns $r0 = { -1, 0, +1 } for orderings { <, ==, > } respectively.
+// ucmpdi2() returns $r0 = { 0, 1, 2 } for orderings { <, ==, > } respectively.
+// Object file duplication assumes typical programs follow one runtime ABI.
+FUNC_START_SECTION ULCMP_NAME ULCMP_SECTION
+CFI_START_FUNCTION
+
+// Calculate the 'C' flag.
+subsxxl,yyl
+sbcsxxh,yyh
+
+// Capture the carry flg.
+// $r2 will contain -1 if the first value is smaller,
+//  0 if the first value is larger or equa

[PATCH v5 18/33] Merge Thumb-2 optimizations for 64-bit comparison

2021-01-15 Thread Daniel Engel
This effectively merges support for all architecture variants into a
common function path with appropriate build conditions.
ARM performance is 1-2 instructions faster; Thumb-2 is about 50% faster.

gcc/libgcc/ChangeLog:
2021-01-13 Daniel Engel 

* config/arm/bpabi.S (__aeabi_lcmp, __aeabi_ulcmp): Removed.
* config/arm/eabi/lcmp.S (__aeabi_lcmp, __aeabi_ulcmp): Added
conditional execution on supported architectures (__ARM_FEATURE_IT).
* config/arm/lib1funcs.S: Moved #include scope of eabi/lcmp.S.
---
 libgcc/config/arm/bpabi.S | 42 ---
 libgcc/config/arm/eabi/lcmp.S | 47 ++-
 libgcc/config/arm/lib1funcs.S |  2 +-
 3 files changed, 47 insertions(+), 44 deletions(-)

diff --git a/libgcc/config/arm/bpabi.S b/libgcc/config/arm/bpabi.S
index 2cbb67d54ad..4281a2be594 100644
--- a/libgcc/config/arm/bpabi.S
+++ b/libgcc/config/arm/bpabi.S
@@ -34,48 +34,6 @@
.eabi_attribute 25, 1
 #endif /* __ARM_EABI__ */
 
-#ifdef L_aeabi_lcmp
-
-ARM_FUNC_START aeabi_lcmp
-   cmp xxh, yyh
-   do_it   lt
-   movlt   r0, #-1
-   do_it   gt
-   movgt   r0, #1
-   do_it   ne
-   RETc(ne)
-   subsr0, xxl, yyl
-   do_it   lo
-   movlo   r0, #-1
-   do_it   hi
-   movhi   r0, #1
-   RET
-   FUNC_END aeabi_lcmp
-
-#endif /* L_aeabi_lcmp */
-   
-#ifdef L_aeabi_ulcmp
-
-ARM_FUNC_START aeabi_ulcmp
-   cmp xxh, yyh
-   do_it   lo
-   movlo   r0, #-1
-   do_it   hi
-   movhi   r0, #1
-   do_it   ne
-   RETc(ne)
-   cmp xxl, yyl
-   do_it   lo
-   movlo   r0, #-1
-   do_it   hi
-   movhi   r0, #1
-   do_it   eq
-   moveq   r0, #0
-   RET
-   FUNC_END aeabi_ulcmp
-
-#endif /* L_aeabi_ulcmp */
-
 .macro test_div_by_zero signed
 /* Tail-call to divide-by-zero handlers which may be overridden by the user,
so unwinding works properly.  */
diff --git a/libgcc/config/arm/eabi/lcmp.S b/libgcc/config/arm/eabi/lcmp.S
index 2ac9d178b34..f1a9c3b8fe0 100644
--- a/libgcc/config/arm/eabi/lcmp.S
+++ b/libgcc/config/arm/eabi/lcmp.S
@@ -46,6 +46,19 @@ FUNC_START_SECTION LCMP_NAME LCMP_SECTION
 subsxxl,yyl
 sbcsxxh,yyh
 
+#ifdef __HAVE_FEATURE_IT
+do_it   lt,t
+
+  #ifdef L_aeabi_lcmp
+movlt   r0,#-1
+  #else
+movlt   r0,#0
+  #endif
+
+// Early return on '<'.
+RETc(lt)
+
+#else /* !__HAVE_FEATURE_IT */
 // With $r2 free, create a known offset value without affecting
 //  the N or Z flags.
 // BUG? The originally unified instruction for v6m was 'mov r2, r3'.
@@ -62,17 +75,27 @@ FUNC_START_SECTION LCMP_NAME LCMP_SECTION
 //  argument is larger, otherwise the offset value remains 0.
 addsr2, #2
 
+#endif
+
 // Check for zero (equality in 64 bits).
 // It doesn't matter which register was originally "hi".
 orrsr0,r1
 
+#ifdef __HAVE_FEATURE_IT
+// The result is already 0 on equality.
+// -1 already returned, so just force +1.
+do_it   ne
+movne   r0, #1
+
+#else /* !__HAVE_FEATURE_IT */
 // The result is already 0 on equality.
 beq LLSYM(__lcmp_return)
 
-LLSYM(__lcmp_lt):
+  LLSYM(__lcmp_lt):
 // Create +1 or -1 from the offset value defined earlier.
 addsr3, #1
 subsr0, r2, r3
+#endif
 
 LLSYM(__lcmp_return):
   #ifdef L_cmpdi2
@@ -111,21 +134,43 @@ FUNC_START_SECTION ULCMP_NAME ULCMP_SECTION
 subsxxl,yyl
 sbcsxxh,yyh
 
+#ifdef __HAVE_FEATURE_IT
+do_it   lo,t
+
+  #ifdef L_aeabi_ulcmp
+movlo   r0, -1
+  #else
+movlo   r0, #0
+  #endif
+
+// Early return on '<'.
+RETc(lo)
+
+#else
 // Capture the carry flg.
 // $r2 will contain -1 if the first value is smaller,
 //  0 if the first value is larger or equal.
 sbcsr2, r2
+#endif
 
 // Check for zero (equality in 64 bits).
 // It doesn't matter which register was originally "hi".
 orrsr0, r1
 
+#ifdef __HAVE_FEATURE_IT
+// The result is already 0 on equality.
+// -1 already returned, so just force +1.
+do_it   ne
+movne   r0, #1
+
+#else /* !__HAVE_FEATURE_IT */
 // The result is already 0 on equality.
 beq LLSYM(__ulcmp_return)
 
 // Assume +1.  If -1 is correct, $r2 will override.
 movsr0, #1
 orrsr0, r2
+#endif
 
 LLSYM(__ulcmp_return):
   #ifdef L_ucmpdi2
diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index 5e24d0a6749..f41354f811e 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -1991,6 +1991,6 @@ LSYM(Lchange_\regis

[PATCH v5 19/33] Import 32-bit division from the CM0 library

2021-01-15 Thread Daniel Engel
gcc/libgcc/ChangeLog:
2021-01-07 Daniel Engel 

* config/arm/eabi/idiv.S: New file for __udivsi3() and __divsi3().
* config/arm/lib1funcs.S: #include eabi/idiv.S (v6m only).
---
 libgcc/config/arm/eabi/idiv.S | 299 ++
 libgcc/config/arm/lib1funcs.S |  19 ++-
 2 files changed, 317 insertions(+), 1 deletion(-)
 create mode 100644 libgcc/config/arm/eabi/idiv.S

diff --git a/libgcc/config/arm/eabi/idiv.S b/libgcc/config/arm/eabi/idiv.S
new file mode 100644
index 000..7381e8f57a3
--- /dev/null
+++ b/libgcc/config/arm/eabi/idiv.S
@@ -0,0 +1,299 @@
+/* div.S: Thumb-1 size-optimized 32-bit integer division
+
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (g...@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+#ifndef __GNUC__
+
+// int __aeabi_idiv0(int)
+// Helper function for division by 0.
+WEAK_START_SECTION aeabi_idiv0 .text.sorted.libgcc.idiv.idiv0
+FUNC_ALIAS cm0_idiv0 aeabi_idiv0
+CFI_START_FUNCTION
+
+  #if defined(TRAP_EXCEPTIONS) && TRAP_EXCEPTIONS
+svc #(SVC_DIVISION_BY_ZERO)
+  #endif
+
+RET
+
+CFI_END_FUNCTION
+FUNC_END cm0_idiv0
+FUNC_END aeabi_idiv0
+
+#endif /* !__GNUC__ */
+
+
+#ifdef L_divsi3
+
+// int __aeabi_idiv(int, int)
+// idiv_return __aeabi_idivmod(int, int)
+// Returns signed $r0 after division by $r1.
+// Also returns the signed remainder in $r1.
+// Same parent section as __divsi3() to keep branches within range.
+FUNC_START_SECTION divsi3 .text.sorted.libgcc.idiv.divsi3
+
+#ifndef __symbian__
+  FUNC_ALIAS aeabi_idiv divsi3
+  FUNC_ALIAS aeabi_idivmod divsi3
+#endif
+
+CFI_START_FUNCTION
+
+// Extend signs.
+asrsr2, r0, #31
+asrsr3, r1, #31
+
+// Absolute value of the denominator, abort on division by zero.
+eorsr1, r3
+subsr1, r3
+  #if defined(PEDANTIC_DIV0) && PEDANTIC_DIV0
+beq LLSYM(__idivmod_zero)
+  #else
+beq SYM(__uidivmod_zero)
+  #endif
+
+// Absolute value of the numerator.
+eorsr0, r2
+subsr0, r2
+
+// Keep the sign of the numerator in bit[31] (for the remainder).
+// Save the XOR of the signs in bits[15:0] (for the quotient).
+push{ rT, lr }
+.cfi_remember_state
+.cfi_adjust_cfa_offset 8
+.cfi_rel_offset rT, 0
+.cfi_rel_offset lr, 4
+
+lsrsrT, r3, #16
+eorsrT, r2
+
+// Handle division as unsigned.
+bl  SYM(__uidivmod_nonzero) __PLT__
+
+// Set the sign of the remainder.
+asrsr2, rT, #31
+eorsr1, r2
+subsr1, r2
+
+// Set the sign of the quotient.
+sxthr3, rT
+eorsr0, r3
+subsr0, r3
+
+LLSYM(__idivmod_return):
+pop { rT, pc }
+.cfi_restore_state
+
+  #if defined(PEDANTIC_DIV0) && PEDANTIC_DIV0
+LLSYM(__idivmod_zero):
+// Set up the *div0() parameter specified in the ARM runtime ABI:
+//  * 0 if the numerator is 0,
+//  * Or, the largest value of the type manipulated by the calling
+// division function if the numerator is positive,
+//  * Or, the least value of the type manipulated by the calling
+// division function if the numerator is negative.
+subsr1, r0
+orrsr0, r1
+asrsr0, #31
+lsrsr0, #1
+eorsr0, r2
+
+// At least the __aeabi_idiv0() call is common.
+b   SYM(__uidivmod_zero2)
+  #endif /* PEDANTIC_DIV0 */
+
+CFI_END_FUNCTION
+FUNC_END divsi3
+
+#ifndef __symbian__
+  FUNC_END aeabi_idiv
+  FUNC_END aeabi_idivmod
+#endif 
+
+#endif /* L_divsi3 */
+
+
+#ifdef L_udivsi3
+
+// int __aeabi_uidiv(unsigned int, unsigned int)
+// idiv_return __aeabi_uidivmod(unsigned int, unsigned int)
+// Returns unsigned $r0 a

[PATCH v5 20/33] Refactor Thumb-1 64-bit division into a new file

2021-01-15 Thread Daniel Engel
gcc/libgcc/ChangeLog:
2021-01-13 Daniel Engel 

* config/arm/bpabi-v6m.S (__aeabi_ldivmod/ldivmod): Moved to ...
* config/arm/eabi/ldiv.S: New file.
* config/arm/lib1funcs.S: #include eabi/ldiv.S (v6m only).
---
 libgcc/config/arm/bpabi-v6m.S |  81 -
 libgcc/config/arm/eabi/ldiv.S | 107 ++
 libgcc/config/arm/lib1funcs.S |   1 +
 3 files changed, 108 insertions(+), 81 deletions(-)
 create mode 100644 libgcc/config/arm/eabi/ldiv.S

diff --git a/libgcc/config/arm/bpabi-v6m.S b/libgcc/config/arm/bpabi-v6m.S
index a051c1530a4..b3dc3bf8f4d 100644
--- a/libgcc/config/arm/bpabi-v6m.S
+++ b/libgcc/config/arm/bpabi-v6m.S
@@ -34,87 +34,6 @@
 #endif /* __ARM_EABI__ */
 
 
-.macro test_div_by_zero signed
-   cmp yyh, #0
-   bne 7f
-   cmp yyl, #0
-   bne 7f
-   cmp xxh, #0
-   .ifc\signed, unsigned
-   bne 2f
-   cmp xxl, #0
-2:
-   beq 3f
-   movsxxh, #0
-   mvnsxxh, xxh@ 0x
-   movsxxl, xxh
-3:
-   .else
-   blt 6f
-   bgt 4f
-   cmp xxl, #0
-   beq 5f
-4: movsxxl, #0
-   mvnsxxl, xxl@ 0x
-   lsrsxxh, xxl, #1@ 0x7fff
-   b   5f
-6: movsxxh, #0x80
-   lslsxxh, xxh, #24   @ 0x8000
-   movsxxl, #0
-5:
-   .endif
-   @ tailcalls are tricky on v6-m.
-   push{r0, r1, r2}
-   ldr r0, 1f
-   adr r1, 1f
-   addsr0, r1
-   str r0, [sp, #8]
-   @ We know we are not on armv4t, so pop pc is safe.
-   pop {r0, r1, pc}
-   .align  2
-1:
-   .word   __aeabi_ldiv0 - 1b
-7:
-.endm
-
-#ifdef L_aeabi_ldivmod
-
-FUNC_START aeabi_ldivmod
-   test_div_by_zero signed
-
-   push{r0, r1}
-   mov r0, sp
-   push{r0, lr}
-   ldr r0, [sp, #8]
-   bl  SYM(__gnu_ldivmod_helper)
-   ldr r3, [sp, #4]
-   mov lr, r3
-   add sp, sp, #8
-   pop {r2, r3}
-   RET
-   FUNC_END aeabi_ldivmod
-
-#endif /* L_aeabi_ldivmod */
-
-#ifdef L_aeabi_uldivmod
-
-FUNC_START aeabi_uldivmod
-   test_div_by_zero unsigned
-
-   push{r0, r1}
-   mov r0, sp
-   push{r0, lr}
-   ldr r0, [sp, #8]
-   bl  SYM(__udivmoddi4)
-   ldr r3, [sp, #4]
-   mov lr, r3
-   add sp, sp, #8
-   pop {r2, r3}
-   RET
-   FUNC_END aeabi_uldivmod
-   
-#endif /* L_aeabi_uldivmod */
-
 #ifdef L_arm_addsubsf3
 
 FUNC_START aeabi_frsub
diff --git a/libgcc/config/arm/eabi/ldiv.S b/libgcc/config/arm/eabi/ldiv.S
new file mode 100644
index 000..3c8280ef580
--- /dev/null
+++ b/libgcc/config/arm/eabi/ldiv.S
@@ -0,0 +1,107 @@
+/* Miscellaneous BPABI functions.  Thumb-1 implementation, suitable for ARMv4T,
+   ARMv6-M and ARMv8-M Baseline like ISA variants.
+
+   Copyright (C) 2006-2020 Free Software Foundation, Inc.
+   Contributed by CodeSourcery.
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+.macro test_div_by_zero signed
+cmp yyh, #0
+bne 7f
+cmp yyl, #0
+bne 7f
+cmp xxh, #0
+.ifc\signed, unsigned
+bne 2f
+cmp xxl, #0
+2:
+beq 3f
+movsxxh, #0
+mvnsxxh, xxh@ 0x
+movsxxl, xxh
+3:
+.else
+blt 6f
+bgt 4f
+cmp xxl, #0
+beq 5f
+4:  movsxxl, #0
+mvnsxxl, xxl@ 0x
+lsrsxxh, xxl, #1@ 0x7fff
+b   5f
+6:  movsxxh, #0x80
+lslsxxh, xxh, #24   @ 0x8000
+movsxxl, #0
+5:
+.endif
+@ tailcalls are tricky on v6-m.
+push{r0, r1, r2}
+ldr r0, 1f
+adr r1, 1f
+addsr0, r1
+str r0, [sp, #8]
+@ We know we are not on armv4t, so pop 

[PATCH v5 21/33] Import 64-bit division from the CM0 library

2021-01-15 Thread Daniel Engel
gcc/libgcc/ChangeLog:
2021-01-13 Daniel Engel 

* config/arm/bpabi.c: Deleted unused file.
* config/arm/eabi/ldiv.S (__aeabi_ldivmod, __aeabi_uldivmod):
Replaced wrapper functions with a complete implementation.
* config/arm/t-bpabi (LIB2ADD_ST): Removed bpabi.c.
* config/arm/t-elf (LIB1ASMFUNCS): Added _divdi3 and _udivdi3.
---
 libgcc/config/arm/bpabi.c |  42 ---
 libgcc/config/arm/eabi/ldiv.S | 542 +-
 libgcc/config/arm/t-bpabi |   3 +-
 libgcc/config/arm/t-elf   |   9 +
 4 files changed, 474 insertions(+), 122 deletions(-)
 delete mode 100644 libgcc/config/arm/bpabi.c

diff --git a/libgcc/config/arm/bpabi.c b/libgcc/config/arm/bpabi.c
deleted file mode 100644
index bf6ba757964..000
--- a/libgcc/config/arm/bpabi.c
+++ /dev/null
@@ -1,42 +0,0 @@
-/* Miscellaneous BPABI functions.
-
-   Copyright (C) 2003-2021 Free Software Foundation, Inc.
-   Contributed by CodeSourcery, LLC.
-
-   This file is free software; you can redistribute it and/or modify it
-   under the terms of the GNU General Public License as published by the
-   Free Software Foundation; either version 3, or (at your option) any
-   later version.
-
-   This file is distributed in the hope that it will be useful, but
-   WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-   General Public License for more details.
-
-   Under Section 7 of GPL version 3, you are granted additional
-   permissions described in the GCC Runtime Library Exception, version
-   3.1, as published by the Free Software Foundation.
-
-   You should have received a copy of the GNU General Public License and
-   a copy of the GCC Runtime Library Exception along with this program;
-   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
-   .  */
-
-extern long long __divdi3 (long long, long long);
-extern unsigned long long __udivdi3 (unsigned long long, 
-unsigned long long);
-extern long long __gnu_ldivmod_helper (long long, long long, long long *);
-
-
-long long
-__gnu_ldivmod_helper (long long a, 
- long long b, 
- long long *remainder)
-{
-  long long quotient;
-
-  quotient = __divdi3 (a, b);
-  *remainder = a - b * quotient;
-  return quotient;
-}
-
diff --git a/libgcc/config/arm/eabi/ldiv.S b/libgcc/config/arm/eabi/ldiv.S
index 3c8280ef580..c225e5973b2 100644
--- a/libgcc/config/arm/eabi/ldiv.S
+++ b/libgcc/config/arm/eabi/ldiv.S
@@ -1,8 +1,7 @@
-/* Miscellaneous BPABI functions.  Thumb-1 implementation, suitable for ARMv4T,
-   ARMv6-M and ARMv8-M Baseline like ISA variants.
+/* ldiv.S: Thumb-1 optimized 64-bit integer division
 
-   Copyright (C) 2006-2020 Free Software Foundation, Inc.
-   Contributed by CodeSourcery.
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (g...@danielengel.com)
 
This file is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
@@ -24,84 +23,471 @@
.  */
 
 
-.macro test_div_by_zero signed
-cmp yyh, #0
-bne 7f
-cmp yyl, #0
-bne 7f
-cmp xxh, #0
-.ifc\signed, unsigned
-bne 2f
-cmp xxl, #0
-2:
-beq 3f
-movsxxh, #0
-mvnsxxh, xxh@ 0x
-movsxxl, xxh
-3:
-.else
-blt 6f
-bgt 4f
-cmp xxl, #0
-beq 5f
-4:  movsxxl, #0
-mvnsxxl, xxl@ 0x
-lsrsxxh, xxl, #1@ 0x7fff
-b   5f
-6:  movsxxh, #0x80
-lslsxxh, xxh, #24   @ 0x8000
-movsxxl, #0
-5:
-.endif
-@ tailcalls are tricky on v6-m.
-push{r0, r1, r2}
-ldr r0, 1f
-adr r1, 1f
-addsr0, r1
-str r0, [sp, #8]
-@ We know we are not on armv4t, so pop pc is safe.
-pop {r0, r1, pc}
-.align  2
-1:
-.word   __aeabi_ldiv0 - 1b
-7:
-.endm
-
-#ifdef L_aeabi_ldivmod
-
-FUNC_START aeabi_ldivmod
-test_div_by_zero signed
-
-push{r0, r1}
-mov r0, sp
-push{r0, lr}
-ldr r0, [sp, #8]
-bl  SYM(__gnu_ldivmod_helper)
-ldr r3, [sp, #4]
-mov lr, r3
-add sp, sp, #8
-pop {r2, r3}
+#ifndef __GNUC__
+
+// long long __aeabi_ldiv0(long long)
+// Helper function for division by 0.
+WEAK_START_SECTION aeabi_ldiv0 .text.sorted.libgcc.ldiv.ldiv0
+CFI_START_FUNCTION
+
+  #if defined(TRAP_EXCEPTIONS) && TRAP_EXCEPTIONS
+svc #(SVC_DIVISION_BY_ZERO)
+  #endif
+
 RET
-FUNC_END aea

[PATCH v5 22/33] Import integer multiplication from the CM0 library

2021-01-15 Thread Daniel Engel
gcc/libgcc/ChangeLog:
2021-01-07 Daniel Engel 

* config/arm/eabi/lmul.S: New file for __muldi3(), __mulsidi3(), and
 __umulsidi3().
* config/arm/lib1funcs.S: #eabi/lmul.S (v6m only).
* config/arm/t-elf: Add the new objects to LIB1ASMFUNCS.
---
 libgcc/config/arm/eabi/lmul.S | 218 ++
 libgcc/config/arm/lib1funcs.S |   1 +
 libgcc/config/arm/t-elf   |  13 +-
 3 files changed, 230 insertions(+), 2 deletions(-)
 create mode 100644 libgcc/config/arm/eabi/lmul.S

diff --git a/libgcc/config/arm/eabi/lmul.S b/libgcc/config/arm/eabi/lmul.S
new file mode 100644
index 000..9fec4364a26
--- /dev/null
+++ b/libgcc/config/arm/eabi/lmul.S
@@ -0,0 +1,218 @@
+/* lmul.S: Thumb-1 optimized 64-bit integer multiplication
+
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (g...@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+#ifdef L_muldi3
+
+// long long __aeabi_lmul(long long, long long)
+// Returns the least significant 64 bits of a 64 bit multiplication.
+// Expects the two multiplicands in $r1:$r0 and $r3:$r2.
+// Returns the product in $r1:$r0 (does not distinguish signed types).
+// Uses $r4 and $r5 as scratch space.
+// Same parent section as __umulsidi3() to keep tail call branch within range.
+FUNC_START_SECTION muldi3 .text.sorted.libgcc.lmul.muldi3
+
+#ifndef __symbian__
+  FUNC_ALIAS aeabi_lmul muldi3
+#endif
+
+CFI_START_FUNCTION
+
+// $r1:$r0 = 0x
+// $r3:$r2 = 0x
+
+// The following operations that only affect the upper 64 bits
+//  can be safely discarded:
+//    * 
+//    * 
+//    * 
+//    * 
+//    * 
+//    * 
+
+// MAYBE: Test for multiply by ZERO on implementations with a 32-cycle
+//  'muls' instruction, and skip over the operation in that case.
+
+// (0x * 0x), free $r1
+mulsxxh,yyl
+
+// (0x * 0x), free $r3
+mulsyyh,xxl
+addsyyh,xxh
+
+// Put the parameters in the correct form for umulsidi3().
+movsxxh,yyl
+b   LLSYM(__mul_overflow)
+
+CFI_END_FUNCTION
+FUNC_END muldi3
+
+#ifndef __symbian__
+  FUNC_END aeabi_lmul
+#endif
+
+#endif /* L_muldi3 */
+
+
+// The following implementation of __umulsidi3() integrates with __muldi3()
+//  above to allow the fast tail call while still preserving the extra
+//  hi-shifted bits of the result.  However, these extra bits add a few
+//  instructions not otherwise required when using only __umulsidi3().
+// Therefore, this block configures __umulsidi3() for compilation twice.
+// The first version is a minimal standalone implementation, and the second
+//  version adds the hi bits of __muldi3().  The standalone version must
+//  be declared WEAK, so that the combined version can supersede it and
+//  provide both symbols in programs that multiply long doubles.
+// This means '_umulsidi3' should appear before '_muldi3' in LIB1ASMFUNCS.
+#if defined(L_muldi3) || defined(L_umulsidi3)
+
+#ifdef L_umulsidi3
+// unsigned long long __umulsidi3(unsigned int, unsigned int)
+// Returns all 64 bits of a 32 bit multiplication.
+// Expects the two multiplicands in $r0 and $r1.
+// Returns the product in $r1:$r0.
+// Uses $r3, $r4 and $ip as scratch space.
+WEAK_START_SECTION umulsidi3 .text.sorted.libgcc.lmul.umulsidi3
+CFI_START_FUNCTION
+
+#else /* L_muldi3 */
+FUNC_ENTRY umulsidi3
+CFI_START_FUNCTION
+
+// 32x32 multiply with 64 bit result.
+// Expand the multiply into 4 parts, since muls only returns 32 bits.
+// (a16h * b16h / 2^32)
+//   + (a16h * b16l / 2^48) + (a16l * b16h / 2^48)
+//   + (a16l * b16l / 2^64)
+
+// MAYBE: Test for multiply by 0 on implementations with a 32-cycle
+//  'muls' instruction, and skip over the operation in that case.
+
+ 

[PATCH v5 23/33] Refactor Thumb-1 float comparison into a new file

2021-01-15 Thread Daniel Engel
gcc/libgcc/ChangeLog:
2021-01-13 Daniel Engel 

* config/arm/bpabi-v6m.S (__aeabi_cfcmpeq, __aeabi_cfcmple,
__aeabi_cfrcmple, __aeabi_fcmpeq, __aeabi_fcmple, aeabi_fcmple,
__aeabi_fcmpgt, aeabi_fcmpge): Moved to ...
* config/arm/eabi/fcmp.S: New file.
* config/arm/lib1funcs.S: #include eabi/fcmp.S (v6m only).
---
 libgcc/config/arm/bpabi-v6m.S | 63 -
 libgcc/config/arm/eabi/fcmp.S | 89 +++
 libgcc/config/arm/lib1funcs.S |  1 +
 3 files changed, 90 insertions(+), 63 deletions(-)
 create mode 100644 libgcc/config/arm/eabi/fcmp.S

diff --git a/libgcc/config/arm/bpabi-v6m.S b/libgcc/config/arm/bpabi-v6m.S
index b3dc3bf8f4d..7c874f06218 100644
--- a/libgcc/config/arm/bpabi-v6m.S
+++ b/libgcc/config/arm/bpabi-v6m.S
@@ -49,69 +49,6 @@ FUNC_START aeabi_frsub
 
 #endif /* L_arm_addsubsf3 */
 
-#ifdef L_arm_cmpsf2
-
-FUNC_START aeabi_cfrcmple
-
-   mov ip, r0
-   movsr0, r1
-   mov r1, ip
-   b   6f
-
-FUNC_START aeabi_cfcmpeq
-FUNC_ALIAS aeabi_cfcmple aeabi_cfcmpeq
-
-   @ The status-returning routines are required to preserve all
-   @ registers except ip, lr, and cpsr.
-6: push{r0, r1, r2, r3, r4, lr}
-   bl  __lesf2
-   @ Set the Z flag correctly, and the C flag unconditionally.
-   cmp r0, #0
-   @ Clear the C flag if the return value was -1, indicating
-   @ that the first operand was smaller than the second.
-   bmi 1f
-   movsr1, #0
-   cmn r0, r1
-1:
-   pop {r0, r1, r2, r3, r4, pc}
-
-   FUNC_END aeabi_cfcmple
-   FUNC_END aeabi_cfcmpeq
-   FUNC_END aeabi_cfrcmple
-
-FUNC_START aeabi_fcmpeq
-
-   push{r4, lr}
-   bl  __eqsf2
-   negsr0, r0
-   addsr0, r0, #1
-   pop {r4, pc}
-
-   FUNC_END aeabi_fcmpeq
-
-.macro COMPARISON cond, helper, mode=sf2
-FUNC_START aeabi_fcmp\cond
-
-   push{r4, lr}
-   bl  __\helper\mode
-   cmp r0, #0
-   b\cond  1f
-   movsr0, #0
-   pop {r4, pc}
-1:
-   movsr0, #1
-   pop {r4, pc}
-
-   FUNC_END aeabi_fcmp\cond
-.endm
-
-COMPARISON lt, le
-COMPARISON le, le
-COMPARISON gt, ge
-COMPARISON ge, ge
-
-#endif /* L_arm_cmpsf2 */
-
 #ifdef L_arm_addsubdf3
 
 FUNC_START aeabi_drsub
diff --git a/libgcc/config/arm/eabi/fcmp.S b/libgcc/config/arm/eabi/fcmp.S
new file mode 100644
index 000..96d627f1fea
--- /dev/null
+++ b/libgcc/config/arm/eabi/fcmp.S
@@ -0,0 +1,89 @@
+/* Miscellaneous BPABI functions.  Thumb-1 implementation, suitable for ARMv4T,
+   ARMv6-M and ARMv8-M Baseline like ISA variants.
+
+   Copyright (C) 2006-2020 Free Software Foundation, Inc.
+   Contributed by CodeSourcery.
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+#ifdef L_arm_cmpsf2
+
+FUNC_START aeabi_cfrcmple
+
+   mov ip, r0
+   movsr0, r1
+   mov r1, ip
+   b   6f
+
+FUNC_START aeabi_cfcmpeq
+FUNC_ALIAS aeabi_cfcmple aeabi_cfcmpeq
+
+   @ The status-returning routines are required to preserve all
+   @ registers except ip, lr, and cpsr.
+6: push{r0, r1, r2, r3, r4, lr}
+   bl  __lesf2
+   @ Set the Z flag correctly, and the C flag unconditionally.
+   cmp r0, #0
+   @ Clear the C flag if the return value was -1, indicating
+   @ that the first operand was smaller than the second.
+   bmi 1f
+   movsr1, #0
+   cmn r0, r1
+1:
+   pop {r0, r1, r2, r3, r4, pc}
+
+   FUNC_END aeabi_cfcmple
+   FUNC_END aeabi_cfcmpeq
+   FUNC_END aeabi_cfrcmple
+
+FUNC_START aeabi_fcmpeq
+
+   push{r4, lr}
+   bl  __eqsf2
+   negsr0, r0
+   addsr0, r0, #1
+   pop {r4, pc}
+
+   FUNC_END aeabi_fcmpeq
+
+.macro COMPARISON cond, helper, mode=sf2
+FUNC_START aeabi_fcmp\cond
+
+   push{r4, lr}
+   bl  __\helper\mode
+   cmp r0, #0
+   b\cond  1f
+   movsr0, #0
+   pop {r4, pc}
+1:
+   movsr0, #1
+  

[PATCH v5 24/33] Import float comparison from the CM0 library

2021-01-15 Thread Daniel Engel
These functions are significantly smaller and faster than the wrapper
functions and soft-float implementation they replace.  Using the first
comparison operator (e.g. '<=') in any program costs about 70 bytes
initially, but every additional operator incrementally adds just 4 bytes.

NOTE: It seems that the __aeabi_cfcmp*() routines formerly in bpabi-v6m.S
were not well tested, as they returned wrong results for the 'C' flag.
The replacement functions are fully tested.

gcc/libgcc/ChangeLog:
2021-01-13 Daniel Engel 

* config/arm/eabi/fcmp.S (__cmpsf2, __eqsf2, __gesf2,
__aeabi_fcmpne, __aeabi_fcmpun): Added new functions.
(__aeabi_fcmpeq, __aeabi_fcmpne, __aeabi_fcmplt, __aeabi_fcmple,
 __aeabi_fcmpge, __aeabi_fcmpgt, __aeabi_cfcmple, __aeabi_cfcmpeq,
 __aeabi_cfrcmple): Replaced with branches to __internal_cmpsf2().
* config/arm/eabi/fplib.h: New file with fcmp-specific constants
and general build configuration macros.
* config/arm/lib1funcs.S: #include eabi/fplib.h (v6m only).
* config/arm/t-elf (LIB1ASMFUNCS): Added _internal_cmpsf2,
_arm_cfcmpeq, _arm_cfcmple, _arm_cfrcmple, _arm_fcmpeq,
_arm_fcmpge, _arm_fcmpgt, _arm_fcmple, _arm_fcmplt, _arm_fcmpne,
_arm_eqsf2, and _arm_gesf2.
---
 libgcc/config/arm/eabi/fcmp.S  | 643 +
 libgcc/config/arm/eabi/fplib.h |  83 +
 libgcc/config/arm/lib1funcs.S  |   1 +
 libgcc/config/arm/t-elf|  18 +
 4 files changed, 681 insertions(+), 64 deletions(-)
 create mode 100644 libgcc/config/arm/eabi/fplib.h

diff --git a/libgcc/config/arm/eabi/fcmp.S b/libgcc/config/arm/eabi/fcmp.S
index 96d627f1fea..cada33f4d35 100644
--- a/libgcc/config/arm/eabi/fcmp.S
+++ b/libgcc/config/arm/eabi/fcmp.S
@@ -1,8 +1,7 @@
-/* Miscellaneous BPABI functions.  Thumb-1 implementation, suitable for ARMv4T,
-   ARMv6-M and ARMv8-M Baseline like ISA variants.
+/* fcmp.S: Thumb-1 optimized 32-bit float comparison
 
-   Copyright (C) 2006-2020 Free Software Foundation, Inc.
-   Contributed by CodeSourcery.
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (g...@danielengel.com)
 
This file is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
@@ -24,66 +23,582 @@
.  */
 
 
+// The various compare functions in this file all expect to tail call 
__cmpsf2()
+//  with flags set for a particular comparison mode.  The __internal_cmpsf2()
+//  symbol  itself is unambiguous, but there is a remote risk that the linker 
+//  will prefer some other symbol in place of __cmpsf2().  Importing an archive
+//  file that also exports __cmpsf2() will throw an error in this case.
+// As a workaround, this block configures __aeabi_f2lz() for compilation twice.
+// The first version configures __internal_cmpsf2() as a WEAK standalone 
symbol,
+//  and the second exports __cmpsf2() and __internal_cmpsf2() normally.
+// A small bonus: programs not using __cmpsf2() itself will be slightly 
smaller.
+// 'L_internal_cmpsf2' should appear before 'L_arm_cmpsf2' in LIB1ASMFUNCS.
+#if defined(L_arm_cmpsf2) || defined(L_internal_cmpsf2)
+
+#define CMPSF2_SECTION .text.sorted.libgcc.fcmp.cmpsf2
+
+// int __cmpsf2(float, float)
+// 
+// Returns the three-way comparison result of $r0 with $r1:
+//  * +1 if ($r0 > $r1), or either argument is NAN
+//  *  0 if ($r0 == $r1)
+//  * -1 if ($r0 < $r1)
+// Uses $r2, $r3, and $ip as scratch space.
+#ifdef L_arm_cmpsf2
+FUNC_START_SECTION cmpsf2 CMPSF2_SECTION
+FUNC_ALIAS lesf2 cmpsf2
+FUNC_ALIAS ltsf2 cmpsf2
+CFI_START_FUNCTION
+
+// Assumption: The 'libgcc' functions should raise exceptions.
+movsr2, #(FCMP_UN_POSITIVE + FCMP_RAISE_EXCEPTIONS + FCMP_3WAY)
+
+// int,int __internal_cmpsf2(float, float, int)
+// Internal function expects a set of control flags in $r2.
+// If ordered, returns a comparison type { 0, 1, 2 } in $r3
+FUNC_ENTRY internal_cmpsf2
+
+#else /* L_internal_cmpsf2 */
+WEAK_START_SECTION internal_cmpsf2 CMPSF2_SECTION
+CFI_START_FUNCTION
+
+#endif 
+
+// When operand signs are considered, the comparison result falls
+//  within one of the following quadrants:
+//
+// $r0  $r1  $r0-$r1* flags  result
+//  ++  >  C=0 GT
+//  ++  =  Z=1 EQ
+//  ++  <  C=1 LT
+//  +-  >  C=1 GT
+//  +-  =  C=1 GT
+//  +-  <  C=1 GT
+//  -+  >  C=0 LT
+//  -+  =  C=0 LT
+//  -+  <  C=0 LT
+//  --  >  C=0 LT
+//  --  =  Z=1 EQ
+//  --  <  C=1 GT
+//
+ 

[PATCH v5 25/33] Refactor Thumb-1 float subtraction into a new file

2021-01-15 Thread Daniel Engel
This will make it easier to isolate changes in subsequent patches.

gcc/libgcc/ChangeLog:
2021-01-13 Daniel Engel 

* config/arm/bpabi-v6m.S (__aeabi_frsub): Moved to ...
* config/arm/eabi/fadd.S: New file.
* config/arm/lib1funcs.S: #include eabi/fadd.S (v6m only).
---
 libgcc/config/arm/bpabi-v6m.S | 16 ---
 libgcc/config/arm/eabi/fadd.S | 38 +++
 libgcc/config/arm/lib1funcs.S |  1 +
 3 files changed, 39 insertions(+), 16 deletions(-)
 create mode 100644 libgcc/config/arm/eabi/fadd.S

diff --git a/libgcc/config/arm/bpabi-v6m.S b/libgcc/config/arm/bpabi-v6m.S
index 7c874f06218..c76c3b0568b 100644
--- a/libgcc/config/arm/bpabi-v6m.S
+++ b/libgcc/config/arm/bpabi-v6m.S
@@ -33,22 +33,6 @@
.eabi_attribute 25, 1
 #endif /* __ARM_EABI__ */
 
-
-#ifdef L_arm_addsubsf3
-
-FUNC_START aeabi_frsub
-
-  push {r4, lr}
-  movs r4, #1
-  lsls r4, #31
-  eors r0, r0, r4
-  bl   __aeabi_fadd
-  pop  {r4, pc}
-
-  FUNC_END aeabi_frsub
-
-#endif /* L_arm_addsubsf3 */
-
 #ifdef L_arm_addsubdf3
 
 FUNC_START aeabi_drsub
diff --git a/libgcc/config/arm/eabi/fadd.S b/libgcc/config/arm/eabi/fadd.S
new file mode 100644
index 000..fffbd91d1bc
--- /dev/null
+++ b/libgcc/config/arm/eabi/fadd.S
@@ -0,0 +1,38 @@
+/* Copyright (C) 2006-2021 Free Software Foundation, Inc.
+   Contributed by CodeSourcery.
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+#ifdef L_arm_addsubsf3
+
+FUNC_START aeabi_frsub
+
+  push {r4, lr}
+  movs r4, #1
+  lsls r4, #31
+  eors r0, r0, r4
+  bl   __aeabi_fadd
+  pop  {r4, pc}
+
+  FUNC_END aeabi_frsub
+
+#endif /* L_arm_addsubsf3 */
+
diff --git a/libgcc/config/arm/lib1funcs.S b/libgcc/config/arm/lib1funcs.S
index 236b7a7763f..31132633f32 100644
--- a/libgcc/config/arm/lib1funcs.S
+++ b/libgcc/config/arm/lib1funcs.S
@@ -2012,6 +2012,7 @@ LSYM(Lchange_\register):
 #include "bpabi-v6m.S"
 #include "eabi/fplib.h"
 #include "eabi/fcmp.S"
+#include "eabi/fadd.S"
 #endif /* NOT_ISA_TARGET_32BIT */
 #include "eabi/lcmp.S"
 #endif /* !__symbian__ */
-- 
2.25.1



[PATCH v5 26/33] Import float addition and subtraction from the CM0 library

2021-01-15 Thread Daniel Engel
Since this is the first import of single-precision functions, some common
parsing and formatting routines are also included.  These common rotines
will be referenced by other functions in subsequent commits.
However, even if the size penalty is accounted entirely to __addsf3(),
the total compiled size is still less than half the size of soft-float.

gcc/libgcc/ChangeLog:
2021-01-13 Daniel Engel 

* config/arm/eabi/fadd.S (__addsf3, __subsf3): Added new functions.
* config/arm/eabi/fneg.S (__negsf2): Added new file.
* config/arm/eabi/futil.S (__fp_normalize2, __fp_lalign2, __fp_assemble,
__fp_overflow, __fp_zero, __fp_check_nan): Added new file with shared
helper functions.
* config/arm/lib1funcs.S: #include eabi/fneg.S and eabi/futil.S (v6m 
only).
* config/arm/t-elf (LIB1ASMFUNCS): Added _arm_addsf3, _arm_frsubsf3,
_fp_exceptionf, _fp_checknanf, _fp_assemblef, and _fp_normalizef.
---
 libgcc/config/arm/eabi/fadd.S  | 306 +++-
 libgcc/config/arm/eabi/fneg.S  |  76 ++
 libgcc/config/arm/eabi/fplib.h |   3 -
 libgcc/config/arm/eabi/futil.S | 418 +
 libgcc/config/arm/lib1funcs.S  |   2 +
 libgcc/config/arm/t-elf|   6 +
 6 files changed, 798 insertions(+), 13 deletions(-)
 create mode 100644 libgcc/config/arm/eabi/fneg.S
 create mode 100644 libgcc/config/arm/eabi/futil.S

diff --git a/libgcc/config/arm/eabi/fadd.S b/libgcc/config/arm/eabi/fadd.S
index fffbd91d1bc..77b81d62b3b 100644
--- a/libgcc/config/arm/eabi/fadd.S
+++ b/libgcc/config/arm/eabi/fadd.S
@@ -1,5 +1,7 @@
-/* Copyright (C) 2006-2021 Free Software Foundation, Inc.
-   Contributed by CodeSourcery.
+/* fadd.S: Thumb-1 optimized 32-bit float addition and subtraction
+
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (g...@danielengel.com)
 
This file is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
@@ -21,18 +23,302 @@
.  */
 
 
+#ifdef L_arm_frsubsf3
+
+// float __aeabi_frsub(float, float)
+// Returns the floating point difference of $r1 - $r0 in $r0.
+// Subsection ordering within fpcore keeps conditional branches within range.
+FUNC_START_SECTION aeabi_frsub .text.sorted.libgcc.fpcore.b.frsub
+CFI_START_FUNCTION
+
+  #if defined(STRICT_NANS) && STRICT_NANS
+// Check if $r0 is NAN before modifying.
+lslsr2, r0, #1
+movsr3, #255
+lslsr3, #24
+
+// Let fadd() find the NAN in the normal course of operation,
+//  moving it to $r0 and checking the quiet/signaling bit.
+cmp r2, r3
+bhi SYM(__aeabi_fadd)
+  #endif
+
+// Flip sign and run through fadd().
+movsr2, #1
+lslsr2, #31
+addsr0, r2
+b   SYM(__aeabi_fadd)
+
+CFI_END_FUNCTION
+FUNC_END aeabi_frsub
+
+#endif /* L_arm_frsubsf3 */
+
+
 #ifdef L_arm_addsubsf3
 
-FUNC_START aeabi_frsub
+// float __aeabi_fsub(float, float)
+// Returns the floating point difference of $r0 - $r1 in $r0.
+// Subsection ordering within fpcore keeps conditional branches within range.
+FUNC_START_SECTION aeabi_fsub .text.sorted.libgcc.fpcore.c.faddsub
+FUNC_ALIAS subsf3 aeabi_fsub
+CFI_START_FUNCTION
 
-  push {r4, lr}
-  movs r4, #1
-  lsls r4, #31
-  eors r0, r0, r4
-  bl   __aeabi_fadd
-  pop  {r4, pc}
+  #if defined(STRICT_NANS) && STRICT_NANS
+// Check if $r1 is NAN before modifying.
+lslsr2, r1, #1
+movsr3, #255
+lslsr3, #24
 
-  FUNC_END aeabi_frsub
+// Let fadd() find the NAN in the normal course of operation,
+//  moving it to $r0 and checking the quiet/signaling bit.
+cmp r2, r3
+bhi SYM(__aeabi_fadd)
+  #endif
+
+// Flip sign and fall into fadd().
+movsr2, #1
+lslsr2, #31
+addsr1, r2
 
 #endif /* L_arm_addsubsf3 */
 
+
+// The execution of __subsf3() flows directly into __addsf3(), such that
+//  instructions must appear consecutively in the same memory section.
+//  However, this construction inhibits the ability to discard __subsf3()
+//  when only using __addsf3().
+// Therefore, this block configures __addsf3() for compilation twice.
+// The first version is a minimal standalone implementation, and the second
+//  version is the continuation of __subsf3().  The standalone version must
+//  be declared WEAK, so that the combined version can supersede it and
+//  provide both symbols when required.
+// '_arm_addsf3' should appear before '_arm_addsubsf3' in LIB1ASMFUNCS.
+#if defined(L_arm_addsf3) || defined(L_arm_addsubsf3)
+
+#ifdef L_arm_addsf3
+// float __aeabi_fadd(float, float)
+// Returns the floating point 

[PATCH v5 27/33] Import float multiplication from the CM0 library

2021-01-15 Thread Daniel Engel
gcc/libgcc/ChangeLog:
2021-01-13 Daniel Engel 

* config/arm/eabi/fmul.S (__mulsf3): New file.
* config/arm/lib1funcs.S: #include eabi/fmul.S (v6m only).
* config/arm/t-elf (LIB1ASMFUNCS): Moved _mulsf3 to global scope
(this object was previously blocked on v6m builds).
---
 libgcc/config/arm/eabi/fmul.S | 215 ++
 libgcc/config/arm/lib1funcs.S |   1 +
 libgcc/config/arm/t-elf   |   3 +-
 3 files changed, 218 insertions(+), 1 deletion(-)
 create mode 100644 libgcc/config/arm/eabi/fmul.S

diff --git a/libgcc/config/arm/eabi/fmul.S b/libgcc/config/arm/eabi/fmul.S
new file mode 100644
index 000..767de988f0b
--- /dev/null
+++ b/libgcc/config/arm/eabi/fmul.S
@@ -0,0 +1,215 @@
+/* fmul.S: Thumb-1 optimized 32-bit float multiplication
+
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (g...@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+#ifdef L_arm_mulsf3
+
+// float __aeabi_fmul(float, float)
+// Returns $r0 after multiplication by $r1.
+// Subsection ordering within fpcore keeps conditional branches within range.
+FUNC_START_SECTION aeabi_fmul .text.sorted.libgcc.fpcore.m.fmul
+FUNC_ALIAS mulsf3 aeabi_fmul
+CFI_START_FUNCTION
+
+// Standard registers, compatible with exception handling.
+push{ rT, lr }
+.cfi_remember_state
+.cfi_remember_state
+.cfi_adjust_cfa_offset 8
+.cfi_rel_offset rT, 0
+.cfi_rel_offset lr, 4
+
+// Save the sign of the result.
+movsrT, r1
+eorsrT, r0
+lsrsrT, #31
+lslsrT, #31
+mov ip, rT
+
+// Set up INF for comparison.
+movsrT, #255
+lslsrT, #24
+
+// Check for multiplication by zero.
+lslsr2, r0, #1
+beq LLSYM(__fmul_zero1)
+
+lslsr3, r1, #1
+beq LLSYM(__fmul_zero2)
+
+// Check for INF/NAN.
+cmp r3, rT
+bhs LLSYM(__fmul_special2)
+
+cmp r2, rT
+bhs LLSYM(__fmul_special1)
+
+// Because neither operand is INF/NAN, the result will be finite.
+// It is now safe to modify the original operand registers.
+lslsr0, #9
+
+// Isolate the first exponent.  When normal, add back the implicit '1'.
+// The result is always aligned with the MSB in bit [31].
+// Subnormal mantissas remain effectively multiplied by 2x relative to
+//  normals, but this works because the weight of a subnormal is -126.
+lsrsr2, #24
+beq LLSYM(__fmul_normalize2)
+addsr0, #1
+rorsr0, r0
+
+LLSYM(__fmul_normalize2):
+// IMPORTANT: exp10i() jumps in here!
+// Repeat for the mantissa of the second operand.
+// Short-circuit when the mantissa is 1.0, as the
+//  first mantissa is already prepared in $r0
+lslsr1, #9
+
+// When normal, add back the implicit '1'.
+lsrsr3, #24
+beq LLSYM(__fmul_go)
+addsr1, #1
+rorsr1, r1
+
+LLSYM(__fmul_go):
+// Calculate the final exponent, relative to bit [30].
+addsrT, r2, r3
+subsrT, #127
+
+  #if !defined(__OPTIMIZE_SIZE__) || !__OPTIMIZE_SIZE__
+// Short-circuit on multiplication by powers of 2.
+lslsr3, r0, #1
+beq LLSYM(__fmul_simple1)
+
+lslsr3, r1, #1
+beq LLSYM(__fmul_simple2)
+  #endif
+
+// Save $ip across the call.
+// (Alternatively, could push/pop a separate register,
+//  but the four instructions here are equivally fast)
+//  without imposing on the stack.
+add rT, ip
+
+// 32x32 unsigned multiplication, 64 bit result.
+bl  SYM(__umulsidi3) __PLT__
+
+// Separ

[PATCH v5 28/33] Import float division from the CM0 library

2021-01-15 Thread Daniel Engel
gcc/libgcc/ChangeLog:
2021-01-08 Daniel Engel 

* config/arm/eabi/fdiv.S (__divsf3, __fp_divloopf): New file.
* config/arm/lib1funcs.S: #include eabi/fdiv.S (v6m only).
* config/arm/t-elf (LIB1ASMFUNCS): Added _divsf3 and _fp_divloopf.
---
 libgcc/config/arm/eabi/fdiv.S | 261 ++
 libgcc/config/arm/lib1funcs.S |   1 +
 libgcc/config/arm/t-elf   |   2 +
 3 files changed, 264 insertions(+)
 create mode 100644 libgcc/config/arm/eabi/fdiv.S

diff --git a/libgcc/config/arm/eabi/fdiv.S b/libgcc/config/arm/eabi/fdiv.S
new file mode 100644
index 000..118f4e94676
--- /dev/null
+++ b/libgcc/config/arm/eabi/fdiv.S
@@ -0,0 +1,261 @@
+/* fdiv.S: Cortex M0 optimized 32-bit float division
+
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (g...@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+#ifdef L_arm_divsf3
+
+// float __aeabi_fdiv(float, float)
+// Returns $r0 after division by $r1.
+// Subsection ordering within fpcore keeps conditional branches within range.
+FUNC_START_SECTION aeabi_fdiv .text.sorted.libgcc.fpcore.n.fdiv
+FUNC_ALIAS divsf3 aeabi_fdiv
+CFI_START_FUNCTION
+
+// Standard registers, compatible with exception handling.
+push{ rT, lr }
+.cfi_remember_state
+.cfi_remember_state
+.cfi_adjust_cfa_offset 8
+.cfi_rel_offset rT, 0
+.cfi_rel_offset lr, 4
+
+// Save for the sign of the result.
+movsr3, r1
+eorsr3, r0
+lsrsrT, r3, #31
+lslsrT, #31
+mov ip, rT
+
+// Set up INF for comparison.
+movsrT, #255
+lslsrT, #24
+
+// Check for divide by 0.  Automatically catches 0/0.
+lslsr2, r1, #1
+beq LLSYM(__fdiv_by_zero)
+
+// Check for INF/INF, or a number divided by itself.
+lslsr3, #1
+beq LLSYM(__fdiv_equal)
+
+// Check the numerator for INF/NAN.
+eorsr3, r2
+cmp r3, rT
+bhs LLSYM(__fdiv_special1)
+
+// Check the denominator for INF/NAN.
+cmp r2, rT
+bhs LLSYM(__fdiv_special2)
+
+// Check the numerator for zero.
+cmp r3, #0
+beq SYM(__fp_zero)
+
+// No action if the numerator is subnormal.
+//  The mantissa will normalize naturally in the division loop.
+lslsr0, #9
+lsrsr1, r3, #24
+beq LLSYM(__fdiv_denominator)
+
+// Restore the numerator's implicit '1'.
+addsr0, #1
+rorsr0, r0
+
+LLSYM(__fdiv_denominator):
+// The denominator must be normalized and left aligned.
+bl  SYM(__fp_normalize2)
+
+// 25 bits of precision will be sufficient.
+movsrT, #64
+
+// Run division.
+bl  SYM(__fp_divloopf)
+b   SYM(__fp_assemble)
+
+LLSYM(__fdiv_equal):
+  #if defined(EXCEPTION_CODES) && EXCEPTION_CODES
+movsr3, #(DIVISION_INF_BY_INF)
+  #endif
+
+// The absolute value of both operands are equal, but not 0.
+// If both operands are INF, create a new NAN.
+cmp r2, rT
+beq SYM(__fp_exception)
+
+  #if defined(TRAP_NANS) && TRAP_NANS
+// If both operands are NAN, return the NAN in $r0.
+bhi SYM(__fp_check_nan)
+  #else
+bhi LLSYM(__fdiv_return)
+  #endif
+
+// Return 1.0f, with appropriate sign.
+movsr0, #127
+lslsr0, #23
+add r0, ip
+
+LLSYM(__fdiv_return):
+pop { rT, pc }
+.cfi_restore_state
+
+LLSYM(__fdiv_special2):
+// The denominator is either INF or NAN, numerator is neither.
+// Also, the denominator is not equal to 0.
+// If the denominator is INF, the result goes t

[PATCH v5 29/33] Import integer-to-float conversion from the CM0 library

2021-01-15 Thread Daniel Engel
gcc/libgcc/ChangeLog:
2021-01-13 Daniel Engel 

* config/arm/bpabi-lib.h (__floatdisf, __floatundisf):
Remove obsolete RENAME_LIBRARY directives.
* config/arm/eabi/ffloat.S (__aeabi_i2f, __aeabi_l2f, __aeabi_ui2f,
__aeabi_ul2f): New file.
* config/arm/lib1funcs.S: #include eabi/ffloat.S (v6m only).
* config/arm/t-elf (LIB1ASMFUNCS): Added _arm_floatunsisf,
_arm_floatsisf, and _internal_floatundisf.
Moved _arm_floatundisf to the weak function group
---
 libgcc/config/arm/bpabi-lib.h   |   6 -
 libgcc/config/arm/eabi/ffloat.S | 247 
 libgcc/config/arm/lib1funcs.S   |   1 +
 libgcc/config/arm/t-elf |   5 +-
 4 files changed, 252 insertions(+), 7 deletions(-)
 create mode 100644 libgcc/config/arm/eabi/ffloat.S

diff --git a/libgcc/config/arm/bpabi-lib.h b/libgcc/config/arm/bpabi-lib.h
index 3cb90b4b345..1e651ead4ac 100644
--- a/libgcc/config/arm/bpabi-lib.h
+++ b/libgcc/config/arm/bpabi-lib.h
@@ -56,9 +56,6 @@
 #ifdef L_floatdidf
 #define DECLARE_LIBRARY_RENAMES RENAME_LIBRARY (floatdidf, l2d)
 #endif
-#ifdef L_floatdisf
-#define DECLARE_LIBRARY_RENAMES RENAME_LIBRARY (floatdisf, l2f)
-#endif
 
 /* These renames are needed on ARMv6M.  Other targets get them from
assembly routines.  */
@@ -71,9 +68,6 @@
 #ifdef L_floatundidf
 #define DECLARE_LIBRARY_RENAMES RENAME_LIBRARY (floatundidf, ul2d)
 #endif
-#ifdef L_floatundisf
-#define DECLARE_LIBRARY_RENAMES RENAME_LIBRARY (floatundisf, ul2f)
-#endif
 
 /* For ARM bpabi, we only want to use a "__gnu_" prefix for the fixed-point
helper functions - not everything in libgcc - in the interests of
diff --git a/libgcc/config/arm/eabi/ffloat.S b/libgcc/config/arm/eabi/ffloat.S
new file mode 100644
index 000..9690ab85081
--- /dev/null
+++ b/libgcc/config/arm/eabi/ffloat.S
@@ -0,0 +1,247 @@
+/* ffixed.S: Thumb-1 optimized integer-to-float conversion
+
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (g...@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+#ifdef L_arm_floatsisf
+
+// float __aeabi_i2f(int)
+// Converts a signed integer in $r0 to float.
+
+// On little-endian cores (including all Cortex-M), __floatsisf() can be
+//  implemented as below in 5 instructions.  However, it can also be
+//  implemented by prefixing a single instruction to __floatdisf().
+// A memory savings of 4 instructions at a cost of only 2 execution cycles
+//  seems reasonable enough.  Plus, the trade-off only happens in programs
+//  that require both __floatsisf() and __floatdisf().  Programs only using
+//  __floatsisf() always get the smallest version.
+// When the combined version will be provided, this standalone version
+//  must be declared WEAK, so that the combined version can supersede it.
+// '_arm_floatsisf' should appear before '_arm_floatdisf' in LIB1ASMFUNCS.
+// Same parent section as __ul2f() to keep tail call branch within range.
+#if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+WEAK_START_SECTION aeabi_i2f .text.sorted.libgcc.fpcore.p.floatsisf
+WEAK_ALIAS floatsisf aeabi_i2f
+CFI_START_FUNCTION
+
+#else /* !__OPTIMIZE_SIZE__ */
+FUNC_START_SECTION aeabi_i2f .text.sorted.libgcc.fpcore.p.floatsisf
+FUNC_ALIAS floatsisf aeabi_i2f
+CFI_START_FUNCTION
+
+#endif /* !__OPTIMIZE_SIZE__ */
+
+// Save the sign.
+asrsr3, r0, #31
+
+// Absolute value of the input.
+eorsr0, r3
+subsr0, r3
+
+// Sign extension to long long unsigned.
+eorsr1, r1
+b   SYM(__internal_floatundisf_noswap)
+
+CFI_END_FUNCTION
+FUNC_END floatsisf
+FUNC_END aeabi_i2f
+
+#endif /* L_arm_floatsisf */
+
+
+#ifdef L_arm_floatdisf
+
+// float __aeabi_l2f(long long)
+// Converts a signed 64-bit integer in $r1:$r0 to a float in $r0.
+// See build comments for __floatsisf() above.
+// Same parent section as __ul2f() to keep tail call branch within range.
+#if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+FUNC_START_SECTION aeabi

[PATCH v5 31/33] Import float<->double conversion from the CM0 library

2021-01-15 Thread Daniel Engel
gcc/libgcc/ChangeLog:
2021-01-13 Daniel Engel 

* config/arm/eabi/fcast.S (__aeabi_d2f, __aeabi_f2d): New file.
* config/arm/lib1funcs.S: #include eabi/fcast.S (v6m only).
* config/arm/t-elf (LIB1ASMFUNCS): Added _arm_d2f and _arm_f2d.
---
 libgcc/config/arm/eabi/fcast.S | 256 +
 libgcc/config/arm/lib1funcs.S  |   1 +
 libgcc/config/arm/t-elf|   2 +
 3 files changed, 259 insertions(+)
 create mode 100644 libgcc/config/arm/eabi/fcast.S

diff --git a/libgcc/config/arm/eabi/fcast.S b/libgcc/config/arm/eabi/fcast.S
new file mode 100644
index 000..b1184ee1d53
--- /dev/null
+++ b/libgcc/config/arm/eabi/fcast.S
@@ -0,0 +1,256 @@
+/* fcast.S: Thumb-1 optimized 32- and 64-bit float conversions
+
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (g...@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+#ifdef L_arm_f2d
+
+// double __aeabi_f2d(float)
+// Converts a single-precision float in $r0 to double-precision in $r1:$r0.
+// Rounding, overflow, and underflow are impossible.
+// INF and ZERO are returned unmodified.
+FUNC_START_SECTION aeabi_f2d .text.sorted.libgcc.fpcore.v.f2d
+FUNC_ALIAS extendsfdf2 aeabi_f2d
+CFI_START_FUNCTION
+
+// Save the sign.
+lsrsr1, r0, #31
+lslsr1, #31
+
+// Set up registers for __fp_normalize2().
+push{ rT, lr }
+.cfi_remember_state
+.cfi_adjust_cfa_offset 8
+.cfi_rel_offset rT, 0
+.cfi_rel_offset lr, 4
+
+// Test for zero.
+lslsr0, #1
+beq LLSYM(__f2d_return)
+
+// Split the exponent and mantissa into separate registers.
+// This is the most efficient way to convert subnormals in the
+//  half-precision form into normals in single-precision.
+// This does add a leading implicit '1' to INF and NAN,
+//  but that will be absorbed when the value is re-assembled.
+movsr2, r0
+bl  SYM(__fp_normalize2) __PLT__
+
+// Set up the exponent bias.  For INF/NAN values, the bias
+//  is 1791 (2047 - 255 - 1), where the last '1' accounts
+//  for the implicit '1' in the mantissa.
+movsr0, #3
+lslsr0, #9
+addsr0, #255
+
+// Test for INF/NAN, promote exponent if necessary
+cmp r2, #255
+beq LLSYM(__f2d_indefinite)
+
+// For normal values, the exponent bias is 895 (1023 - 127 - 1),
+//  which is half of the prepared INF/NAN bias.
+lsrsr0, #1
+
+LLSYM(__f2d_indefinite):
+// Assemble exponent with bias correction.
+addsr2, r0
+lslsr2, #20
+addsr1, r2
+
+// Assemble the high word of the mantissa.
+lsrsr0, r3, #11
+add r1, r0
+
+// Remainder of the mantissa in the low word of the result.
+lslsr0, r3, #21
+
+LLSYM(__f2d_return):
+pop { rT, pc }
+.cfi_restore_state
+
+CFI_END_FUNCTION
+FUNC_END extendsfdf2
+FUNC_END aeabi_f2d
+
+#endif /* L_arm_f2d */
+
+
+#if defined(L_arm_d2f) || defined(L_arm_truncdfsf2)
+
+// HACK: Build two separate implementations:
+//  * __aeabi_d2f() rounds to nearest per traditional IEEE-753 rules.
+//  * __truncdfsf2() rounds towards zero per GCC specification.
+// Presumably, a program will consistently use one ABI or the other,
+//  which means that code size will not be duplicated in practice.
+// Merging two versions with dynamic rounding would be rather hard.
+#ifdef L_arm_truncdfsf2
+  #define D2F_NAME truncdfsf2
+  #define D2F_SECTION .text.sorted.libgcc.fpcore.x.truncdfsf2
+#else
+  #define D2F_NAME aeabi_d2f
+  #define D2F_SECTION .text.sorted.libgcc.fpcore.w.d2f
+#endif
+
+// float __aeabi_d2f(double)
+// Converts a double-precision float in $r1:$r0 to single-precision in $r0.
+// Values out of range become ZERO or 

[PATCH v5 30/33] Import float-to-integer conversion from the CM0 library

2021-01-15 Thread Daniel Engel
gcc/libgcc/ChangeLog:
2021-01-13 Daniel Engel 

* config/arm/bpabi-lib.h (muldi3): Removed duplicate.
(fixunssfsi) Removed obsolete RENAME_LIBRARY directive.
* config/arm/eabi/ffixed.S (__aeabi_f2iz, __aeabi_f2uiz,
__aeabi_f2lz, __aeabi_f2ulz): New file.
* config/arm/lib1funcs.S: #include eabi/ffixed.S (v6m only).
* config/arm/t-elf (LIB1ASMFUNCS): Added _internal_fixsfdi,
_internal_fixsfsi, _arm_fixsfdi, and _arm_fixunssfdi.
---
 libgcc/config/arm/bpabi-lib.h   |   6 -
 libgcc/config/arm/eabi/ffixed.S | 414 
 libgcc/config/arm/lib1funcs.S   |   1 +
 libgcc/config/arm/t-elf |   4 +
 4 files changed, 419 insertions(+), 6 deletions(-)
 create mode 100644 libgcc/config/arm/eabi/ffixed.S

diff --git a/libgcc/config/arm/bpabi-lib.h b/libgcc/config/arm/bpabi-lib.h
index 1e651ead4ac..a1c631640bb 100644
--- a/libgcc/config/arm/bpabi-lib.h
+++ b/libgcc/config/arm/bpabi-lib.h
@@ -32,9 +32,6 @@
 #ifdef L_muldi3
 #define DECLARE_LIBRARY_RENAMES RENAME_LIBRARY (muldi3, lmul)
 #endif
-#ifdef L_muldi3
-#define DECLARE_LIBRARY_RENAMES RENAME_LIBRARY (muldi3, lmul)
-#endif
 #ifdef L_fixdfdi
 #define DECLARE_LIBRARY_RENAMES RENAME_LIBRARY (fixdfdi, d2lz) \
   extern DWtype __fixdfdi (DFtype) __attribute__((pcs("aapcs"))); \
@@ -62,9 +59,6 @@
 #ifdef L_fixunsdfsi
 #define DECLARE_LIBRARY_RENAMES RENAME_LIBRARY (fixunsdfsi, d2uiz)
 #endif
-#ifdef L_fixunssfsi
-#define DECLARE_LIBRARY_RENAMES RENAME_LIBRARY (fixunssfsi, f2uiz)
-#endif
 #ifdef L_floatundidf
 #define DECLARE_LIBRARY_RENAMES RENAME_LIBRARY (floatundidf, ul2d)
 #endif
diff --git a/libgcc/config/arm/eabi/ffixed.S b/libgcc/config/arm/eabi/ffixed.S
new file mode 100644
index 000..8ced3a701ff
--- /dev/null
+++ b/libgcc/config/arm/eabi/ffixed.S
@@ -0,0 +1,414 @@
+/* ffixed.S: Thumb-1 optimized float-to-integer conversion
+
+   Copyright (C) 2018-2021 Free Software Foundation, Inc.
+   Contributed by Daniel Engel, Senva Inc (g...@danielengel.com)
+
+   This file is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by the
+   Free Software Foundation; either version 3, or (at your option) any
+   later version.
+
+   This file is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+
+// The implementation of __aeabi_f2uiz() expects to tail call __internal_f2iz()
+//  with the flags register set for unsigned conversion.  The __internal_f2iz()
+//  symbol itself is unambiguous, but there is a remote risk that the linker
+//  will prefer some other symbol in place of __aeabi_f2iz().  Importing an
+//  archive file that exports __aeabi_f2iz() will throw an error in this case.
+// As a workaround, this block configures __aeabi_f2iz() for compilation twice.
+// The first version configures __internal_f2iz() as a WEAK standalone symbol,
+//  and the second exports __aeabi_f2iz() and __internal_f2iz() normally.
+// A small bonus: programs only using __aeabi_f2uiz() will be slightly smaller.
+// '_internal_fixsfsi' should appear before '_arm_fixsfsi' in LIB1ASMFUNCS.
+#if defined(L_arm_fixsfsi) || \
+   (defined(L_internal_fixsfsi) && \
+  !(defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__))
+
+// Subsection ordering within fpcore keeps conditional branches within range.
+#define F2IZ_SECTION .text.sorted.libgcc.fpcore.r.fixsfsi
+
+// int __aeabi_f2iz(float)
+// Converts a float in $r0 to signed integer, rounding toward 0.
+// Values out of range are forced to either INT_MAX or INT_MIN.
+// NAN becomes zero.
+#ifdef L_arm_fixsfsi
+FUNC_START_SECTION aeabi_f2iz F2IZ_SECTION
+FUNC_ALIAS fixsfsi aeabi_f2iz
+CFI_START_FUNCTION
+#endif
+
+  #if defined(__OPTIMIZE_SIZE__) && __OPTIMIZE_SIZE__
+// Flag for unsigned conversion.
+movsr1, #33
+b   SYM(__internal_fixsfdi)
+
+  #else /* !__OPTIMIZE_SIZE__ */
+
+#ifdef L_arm_fixsfsi
+// Flag for signed conversion.
+movsr3, #1
+
+// [unsigned] int internal_f2iz(float, int)
+// Internal function expects a boolean flag in $r1.
+// If the boolean flag is 0, the result is unsigned.
+// If the boolean flag is 1, the result is signed.
+FUNC_ENTRY internal_f2iz
+
+#else /* L_internal_fixsfsi */
+WEAK_START_SECTION internal_f2iz F2IZ_SECTION
+CFI_START_FUNCTION
+

[PATCH v5 33/33] Drop single-precision Thumb-1 soft-float functions

2021-01-15 Thread Daniel Engel
With the complete CM0 library integrated, regression testing showed new
failures with the message "compilation failed to produce executable":

gcc.dg/fixed-point/convert-float-1.c
gcc.dg/fixed-point/convert-float-3.c
gcc.dg/fixed-point/convert-sat.c

Investigating, this appears to be caused by the linker.  I can't find a
comprehensive linker specification to claim this is actually a bug, but it
certainly doesn't match my expectations.  Investigating, I found issues
with the link order of these symbols:

  * __aeabi_fmul()
  * __aeabi_f2d()
  * __aeabi_f2iz()

Specifically, I expect the linker to import the _first_ definition of any
symbol.  This is the basic behavior that allows the soft-float library to
supply missing symbols on architectures without optimized routines.

Comparing the v6-m multilib with the default, I see symbol exports for all
of the affect symbols:

gcc-obj/gcc/libgcc.a:

// assembly routines

_arm_mulsf3.o:
 W __aeabi_fmul
 W __mulsf3

_arm_addsubdf3.o:
0368 T __aeabi_f2d
0368 T __extendsfdf2

_arm_fixsfsi.o:
 T __aeabi_f2iz
 T __fixsfsi

mulsf3.o:


fixsfsi.o:


extendsfdf2.o.o:


gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a:

// assembly routines

_arm_mulsf3.o:
 T __aeabi_fmul
 U __fp_assemble
 U __fp_exception
 U __fp_infinity
 U __fp_zero
 T __mulsf3
 U __umulsidi3

_arm_fixsfsi.o:
 T __aeabi_f2iz
 T __fixsfsi
0002 T __internal_f2iz

_arm_f2d.o:
 T __aeabi_f2d
 T __extendsfdf2
 U __fp_normalize2

// soft-float library

mulsf3.o:
 T __aeabi_fmul

fixsfsi.o:
 T __aeabi_f2iz

extendsfdf2.o:
 T __aeabi_f2d

Given the order of the archive file, I expect the linker to import the affected
functions from the _arm_* archive elements.

For "convert-sat.c", all is well with -march=armv7-m.
...
(/home/mirdan/gcc-obj/gcc/libgcc.a)_arm_muldf3.o
OK> (/home/mirdan/gcc-obj/gcc/libgcc.a)_arm_mulsf3.o
(/home/mirdan/gcc-obj/gcc/libgcc.a)_arm_cmpsf2.o
(/home/mirdan/gcc-obj/gcc/libgcc.a)_arm_fixsfsi.o
(/home/mirdan/gcc-obj/gcc/libgcc.a)_arm_fixunssfsi.o
OK> (/home/mirdan/gcc-obj/gcc/libgcc.a)_arm_addsubdf3.o
(/home/mirdan/gcc-obj/gcc/libgcc.a)_arm_cmpdf2.o
(/home/mirdan/gcc-obj/gcc/libgcc.a)_arm_fixdfsi.o
(/home/mirdan/gcc-obj/gcc/libgcc.a)_arm_fixunsdfsi.o
OK> (/home/mirdan/gcc-obj/gcc/libgcc.a)_fixsfdi.o
(/home/mirdan/gcc-obj/gcc/libgcc.a)_fixdfdi.o
(/home/mirdan/gcc-obj/gcc/libgcc.a)_fixunssfdi.o
(/home/mirdan/gcc-obj/gcc/libgcc.a)_fixunsdfdi.o
...

However, with -march=armv6s-m, the linker imports these symbols from the soft-
float library.  (NOTE: The CM0 library only implements single-precision float
operations, so imports from muldf3.o, fixdfsi.o, etc are expected.)
...
??> (/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)mulsf3.o
??> (/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)fixsfsi.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)muldf3.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)fixdfsi.o
??> (/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)extendsfdf2.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)_clzsi2.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)_arm_fcmpge.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)_arm_fcmple.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)_fixsfdi.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)_fixunssfdi.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)_fixunssfsi.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)_arm_cmpdf2.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)_fixunsdfsi.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)_fixdfdi.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)_fixunsdfdi.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)eqdf2.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)gedf2.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)ledf2.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)subdf3.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)floatunsidf.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)_arm_cmpsf2.o
(/home/mirdan/gcc-obj/gcc/thumb/v6-m/nofp/libgcc.a)_fixsfsi.o
...

It seems that the order in which the linker resolves symbols matters.  In the
affected test cases, the linker begins searching for fixed-point function
symbols first: _subQQ.o, _cmpQQ.o, etc.  T

[PATCH v5 32/33] Import float<->__fp16 conversion from the CM0 library

2021-01-15 Thread Daniel Engel
gcc/libgcc/ChangeLog:
2021-01-13 Daniel Engel 

* config/arm/eabi/fcast.S (__aeabi_h2f, __aeabi_f2h): Added functions.
* config/arm/fp16 (__gnu_f2h_ieee, __gnu_h2f_ieee, 
__gnu_f2h_alternative,
__gnu_h2f_alternative): Disable build for v6m multilibs.
* config/arm/t-bpabi (LIB1ASMFUNCS): Added _aeabi_f2h_ieee,
_aeabi_h2f_ieee, _aeabi_f2h_alt, and _aeabi_h2f_alt (v6m only).
---
 libgcc/config/arm/eabi/fcast.S | 277 +
 libgcc/config/arm/fp16.c   |   4 +
 libgcc/config/arm/t-bpabi  |   7 +
 3 files changed, 288 insertions(+)

diff --git a/libgcc/config/arm/eabi/fcast.S b/libgcc/config/arm/eabi/fcast.S
index b1184ee1d53..e5a34d69578 100644
--- a/libgcc/config/arm/eabi/fcast.S
+++ b/libgcc/config/arm/eabi/fcast.S
@@ -254,3 +254,280 @@ FUNC_END D2F_NAME
 
 #endif /* L_arm_d2f || L_arm_truncdfsf2 */
 
+
+#if defined(L_aeabi_h2f_ieee) || defined(L_aeabi_h2f_alt)
+
+#ifdef L_aeabi_h2f_ieee
+  #define H2F_NAME aeabi_h2f
+  #define H2F_ALIAS gnu_h2f_ieee
+#else
+  #define H2F_NAME aeabi_h2f_alt
+  #define H2F_ALIAS gnu_h2f_alternative
+#endif
+
+// float __aeabi_h2f(short hf)
+// float __aeabi_h2f_alt(short hf)
+// Converts a half-precision float in $r0 to single-precision.
+// Rounding, overflow, and underflow conditions are impossible.
+// In IEEE mode, INF, ZERO, and NAN are returned unmodified.
+FUNC_START_SECTION H2F_NAME .text.sorted.libgcc.h2f
+FUNC_ALIAS H2F_ALIAS H2F_NAME
+CFI_START_FUNCTION
+
+// Set up registers for __fp_normalize2().
+push{ rT, lr }
+.cfi_remember_state
+.cfi_adjust_cfa_offset 8
+.cfi_rel_offset rT, 0
+.cfi_rel_offset lr, 4
+
+// Save the mantissa and exponent.
+lslsr2, r0, #17
+
+// Isolate the sign.
+lsrsr0, #15
+lslsr0, #31
+
+// Align the exponent at bit[24] for normalization.
+// If zero, return the original sign.
+lsrsr2, #3
+
+  #ifdef __HAVE_FEATURE_IT
+do_it   eq
+RETc(eq)
+  #else
+beq LLSYM(__h2f_return)
+  #endif
+
+// Split the exponent and mantissa into separate registers.
+// This is the most efficient way to convert subnormals in the
+//  half-precision form into normals in single-precision.
+// This does add a leading implicit '1' to INF and NAN,
+//  but that will be absorbed when the value is re-assembled.
+bl  SYM(__fp_normalize2) __PLT__
+
+   #ifdef L_aeabi_h2f_ieee
+// Set up the exponent bias.  For INF/NAN values, the bias is 223,
+//  where the last '1' accounts for the implicit '1' in the mantissa.
+addsr2, #(255 - 31 - 1)
+
+// Test for INF/NAN.
+cmp r2, #254
+
+  #ifdef __HAVE_FEATURE_IT
+do_it   ne
+  #else
+beq LLSYM(__h2f_assemble)
+  #endif
+
+// For normal values, the bias should have been 111.
+// However, this offset must be adjusted per the INF check above.
+ IT(sub,ne) r2, #((255 - 31 - 1) - (127 - 15 - 1))
+
+#else /* L_aeabi_h2f_alt */
+// Set up the exponent bias.  All values are normal.
+addsr2, #(127 - 15 - 1)
+#endif
+
+LLSYM(__h2f_assemble):
+// Combine exponent and sign.
+lslsr2, #23
+addsr0, r2
+
+// Combine mantissa.
+lsrsr3, #8
+add r0, r3
+
+LLSYM(__h2f_return):
+pop { rT, pc }
+.cfi_restore_state
+
+CFI_END_FUNCTION
+FUNC_END H2F_NAME
+FUNC_END H2F_ALIAS
+
+#endif /* L_aeabi_h2f_ieee || L_aeabi_h2f_alt */
+
+
+#if defined(L_aeabi_f2h_ieee) || defined(L_aeabi_f2h_alt)
+
+#ifdef L_aeabi_f2h_ieee
+  #define F2H_NAME aeabi_f2h
+  #define F2H_ALIAS gnu_f2h_ieee
+#else
+  #define F2H_NAME aeabi_f2h_alt
+  #define F2H_ALIAS gnu_f2h_alternative
+#endif
+
+// short __aeabi_f2h(float f)
+// short __aeabi_f2h_alt(float f)
+// Converts a single-precision float in $r0 to half-precision,
+//  rounding to nearest, ties to even.
+// Values out of range are forced to either ZERO or INF.
+// In IEEE mode, the upper 12 bits of a NAN will be preserved.
+FUNC_START_SECTION F2H_NAME .text.sorted.libgcc.f2h
+FUNC_ALIAS F2H_ALIAS F2H_NAME
+CFI_START_FUNCTION
+
+// Set up the sign.
+lsrsr2, r0, #31
+lslsr2, #15
+
+// Save the exponent and mantissa.
+// If ZERO, return the original sign.
+lslsr0, #1
+
+  #ifdef __HAVE_FEATURE_IT
+do_it   ne,t
+addne   r0, r2
+RETc(ne)
+  #else
+beq LLSYM(__f2h_return)
+  #endif
+
+// Isolate the exponent.
+lsrsr1, r0, #24
+
+  #ifdef L_aeabi_f2h_ieee
+// Check for NAN.
+cmp r1, #255
+beq LLSYM(__f2h_indefinite)
+
+// Che

Re: [PATCH] x86: Error on -fcf-protection with incompatible target

2021-01-15 Thread Matthias Klose
On 1/14/21 4:18 PM, H.J. Lu via Gcc-patches wrote:
> On Thu, Jan 14, 2021 at 6:51 AM Uros Bizjak  wrote:
>>
>> On Thu, Jan 14, 2021 at 3:05 PM H.J. Lu  wrote:
>>>
>>> -fcf-protection with CF_BRANCH inserts ENDBR32 at function entries.
>>> ENDBR32 is NOP only on 64-bit processors and 32-bit TARGET_CMOVE
>>> processors.  Issue an error for -fcf-protection with CF_BRANCH when
>>> compiling for 32-bit non-TARGET_CMOVE targets.
>>>
>>> gcc/
>>>
>>> PR target/98667
>>> * config/i386/i386-options.c (ix86_option_override_internal):
>>> Issue an error for -fcf-protection with CF_BRANCH when compiling
>>> for 32-bit non-TARGET_CMOVE targets.
>>>
>>> gcc/testsuite/
>>>
>>> PR target/98667
>>> * gcc.target/i386/pr98667-1.c: New file.
>>> * gcc.target/i386/pr98667-2.c: Likewise.
>>> * gcc.target/i386/pr98667-3.c: Likewise.
>>> ---
>>>  gcc/config/i386/i386-options.c| 9 -
>>>  gcc/testsuite/gcc.target/i386/pr98667-1.c | 9 +
>>>  gcc/testsuite/gcc.target/i386/pr98667-2.c | 9 +
>>>  gcc/testsuite/gcc.target/i386/pr98667-3.c | 7 +++
>>>  4 files changed, 33 insertions(+), 1 deletion(-)
>>>  create mode 100644 gcc/testsuite/gcc.target/i386/pr98667-1.c
>>>  create mode 100644 gcc/testsuite/gcc.target/i386/pr98667-2.c
>>>  create mode 100644 gcc/testsuite/gcc.target/i386/pr98667-3.c
>>>
>>> diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
>>> index 4e0165ff32c..1489871b36f 100644
>>> --- a/gcc/config/i386/i386-options.c
>>> +++ b/gcc/config/i386/i386-options.c
>>> @@ -3016,8 +3016,15 @@ ix86_option_override_internal (bool main_args_p,
>>>  }
>>>
>>>if (opts->x_flag_cf_protection != CF_NONE)
>>> -opts->x_flag_cf_protection
>>> +{
>>> +  if ((opts->x_flag_cf_protection & CF_BRANCH) == CF_BRANCH
>>> + && !TARGET_64BIT
>>> + && !TARGET_CMOVE)
>>
>> You need TARGET_CMOV (note, no E) here. Also, please put both tests on one 
>> line.
>>
>> LGTM with the above change.
> 
> This is the patch I am checking in.

I might be doing something wrong, but this breaks the -m32 multilib build for
me. looking at a trunk 20210110 build, -fcf-protection -mshstk are passed to the
-m32 build as well, and now errors out.

Matthias


Re: [PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0

2021-01-15 Thread Daniel Engel
Hi Christophe,

On Mon, Jan 11, 2021, at 8:39 AM, Christophe Lyon wrote:
> On Mon, 11 Jan 2021 at 17:18, Daniel Engel  wrote:
> >
> > On Mon, Jan 11, 2021, at 8:07 AM, Christophe Lyon wrote:
> > > On Sat, 9 Jan 2021 at 14:09, Christophe Lyon  
> > > wrote:
> > > >
> > > > On Sat, 9 Jan 2021 at 13:27, Daniel Engel  
> > > > wrote:
> > > > >
> > > > > On Thu, Jan 7, 2021, at 4:56 AM, Richard Earnshaw wrote:
> > > > > > On 07/01/2021 00:59, Daniel Engel wrote:
> > > > > > > --snip--
> > > > > > >
> > > > > > > On Wed, Jan 6, 2021, at 9:05 AM, Richard Earnshaw wrote:
> > > > > > > --snip--
> > > > > > >
> > > > > > >> - finally, your popcount implementations have data in the code 
> > > > > > >> segment.
> > > > > > >>  That's going to cause problems when we have compilation options 
> > > > > > >> such as
> > > > > > >> -mpure-code.
> > > > > > >
> > > > > > > I am just following the precedent of existing lib1funcs (e.g. 
> > > > > > > __clz2si).
> > > > > > > If this matters, you'll need to point in the right direction for 
> > > > > > > the
> > > > > > > fix.  I'm not sure it does matter, since these functions are PIC 
> > > > > > > anyway.
> > > > > >
> > > > > > That might be a bug in the clz implementations - Christophe: Any 
> > > > > > thoughts?
> > > > >
> > > > > __clzsi2() has test coverage in 
> > > > > "gcc.c-torture/execute/builtin-bitops-1.c"
> > > > Thanks, I'll have a closer look at why I didn't see problems.
> > > >
> > >
> > > So, that's because the code goes to the .text section (as opposed to
> > > .text.noread)
> > > and does not have the PURECODE flag. The compiler takes care of this
> > > when generating code with -mpure-code.
> > > And the simulator does not complain because it only checks loads from
> > > the segment with the PURECODE flag set.
> > >
> > This is far out of my depth, but can something like:
> >
> > ifeq (,$(findstring __symbian__,$(shell $(gcc_compile_bare) -dM -E - 
> >  >
> > be adapted to:
> >
> > a) detect the state of the -mpure-code switch, and
> > b) pass that flag to the preprocessor?
> >
> > If so, I can probably fix both the target section and the data usage.
> > Just have to add a few instructions to finish unrolling the loop.
> 
> I must confess I never checked libgcc's Makefile deeply before,
> but it looks like you can probably detect whether -mpure-code is
> part of $CFLAGS.
> 
> However, it might be better to write pure-code-safe code
> unconditionally because the toolchain will probably not
> be rebuilt with -mpure-code as discussed before.
> Or that could mean adding a -mpure-code multilib

I have learned a few things since the last update.  I think I know how
to get -mpure-code out of CFLAGS and into a macro.  However, I have hit
something of a wall with testing.  I can't seem to compile any flavor of
libgcc with CFLAGS_FOR_TARGET="-mpure-code".

1.  Configuring --with-multilib-list=rmprofile results in build failure:

checking for suffix of object files... configure: error: in 
`/home/mirdan/gcc-obj/arm-none-eabi/libgcc':
configure: error: cannot compute suffix of object files: cannot compile
See `config.log' for more details

   cc1: error: -mpure-code only supports non-pic code on M-profile targets
   
2.  Attempting to filter the multib list results in configuration error.
This might have been misguided, but it was something I tried:

Error: --with-multilib-list=armv6s-m not supported.

Error: --with-multilib-list=mthumb/march=armv6s-m/mfloat-abi=soft not 
supported

3.  Attempting to configure a single architecture results in a build error.  

--with-mode=thumb --with-arch=armv6s-m --with-float=soft

checking for suffix of object files... configure: error: in 
`/home/mirdan/gcc-obj/arm-none-eabi/arm/autofp/v5te/fpu/libgcc':
configure: error: cannot compute suffix of object files: cannot compile
See `config.log' for more details

conftest.c:9:10: fatal error: ac_nonexistent.h: No such file or directory
9 | #include 
  |  ^~

This has me wondering whether pure-code in libgcc is a real issue ... 
If there's a way to build libgcc with -mpure-code, please enlighten me.

> >
> > > > > The 'clzs' and 'ctz' functions should never have problems.   
> > > > > -mpure-code
> > > > > appears to be valid only when the 'movt' instruction is available, 
> > > > > which
> > > > > means that the 'clz' instruction will also be available, so no array 
> > > > > loads.
> > > > No, -mpure-code is also supported with v6m.
> > > >
> > > > > Is the -mpure-code state detectable as a preprocessor flag?  While
> > > > No.
> > > >
> > > > > 'movw'/'movt' appears to be the canonical solution, I'm not sure it
> > > > > should be the default just because a processor supports Thumb-2.
> > > > >
> > > > > Do users wanting to use -mpure-code recompile the toolchain to avoid
> > > > > constant data in compiled C functions?  I don't think this is the
> > > > > default for the typical toolchain s

[committed][OG10] Fix offload dwarf info

2021-01-15 Thread Andrew Stubbs
This patch corrects a problem in which GDB ignores the debug info for 
offload kernel entry functions because they're represented as nested 
functions inside a function that does not exist on the accelerator 
device (only on the host).


The fix is to add a notional code range to the non-existent parent 
function. Setting it the same as the inner function is good enough 
because GDB selects the innermost.


I'll submit this the mainline when stage 1 opens. Committed to 
devel/omp/gcc-10 for now.


Andrew
Fix offload dwarf info

Add a notional code range to the notional parent function of offload kernel
functions.  This is enough to prevent GDB discarding them.

gcc/ChangeLog:

	* dwarf2out.c (gen_subprogram_die): Add high/low_pc attributes for
	parents of offload kernels.

diff --git a/gcc/dwarf2out.c b/gcc/dwarf2out.c
index 4d84a9e9607..a4a1b934dc7 100644
--- a/gcc/dwarf2out.c
+++ b/gcc/dwarf2out.c
@@ -23079,6 +23079,20 @@ gen_subprogram_die (tree decl, dw_die_ref context_die)
 	  /* We have already generated the labels.  */
  add_AT_low_high_pc (subr_die, fde->dw_fde_begin,
  fde->dw_fde_end, false);
+
+	 /* Offload kernel functions are nested within a parent function
+	that doesn't actually exist within the offload object.  GDB
+		will ignore the function and everything nested within unless
+		we give it a notional code range (the values aren't
+		important, as long as they are valid).  */
+	 if (flag_generate_offload
+		 && lookup_attribute ("omp target entrypoint",
+  DECL_ATTRIBUTES (decl))
+		 && subr_die->die_parent
+		 && subr_die->die_parent->die_tag == DW_TAG_subprogram
+		 && !get_AT_low_pc (subr_die->die_parent))
+	   add_AT_low_high_pc (subr_die->die_parent, fde->dw_fde_begin,
+   fde->dw_fde_end, false);
 	}
 	  else
 	{


[committed][OG10] amdgcn: Fix DWARF variables with alloca

2021-01-15 Thread Andrew Stubbs
This patch fixes DWARF frame calculations for functions that use alloca 
on AMD GCN.


Like many other platforms, it achieves this by switching to 
frame-pointer mode for this function.


The frame pointer is necessary for debugability only, so if the user 
specifies -fomit-frame-pointer then this is honoured.


Committed to devel/omp/gcc-10. The prerequisite CFI patches don't exist 
on mainline yet.


Andrew
amdgcn: Fix DWARF variables with alloca

Require a frame pointer for entry functions that use alloca because it isn't
possible to encode the DWARF frame otherwise.  Adjust the CFA definition
expressions accordingly.

gcc/ChangeLog:

	* config/gcn/gcn.c (gcn_expand_prologue): Use the frame pointer for
	the DWARF CFA, if it exists.
	(gcn_frame_pointer_rqd): Require a frame pointer for entry functions
	that use alloca.

diff --git a/gcc/config/gcn/gcn.c b/gcc/config/gcn/gcn.c
index ea88b5e9124..1cb92714f24 100644
--- a/gcc/config/gcn/gcn.c
+++ b/gcc/config/gcn/gcn.c
@@ -3009,6 +3009,16 @@ gcn_expand_prologue ()
 		gen_rtx_SET (sp, gen_rtx_PLUS (DImode, sp,
 		   dbg_adjustment)));
 
+  if (offsets->need_frame_pointer)
+	{
+	  /* Set the CFA to the entry stack address, as an offset from the
+	 frame pointer.  This is necessary when alloca is used, and
+	 harmless otherwise.  */
+	  rtx neg_adjust = gen_int_mode (-offsets->callee_saves, DImode);
+	  add_reg_note (insn, REG_CFA_DEF_CFA,
+			gen_rtx_PLUS (DImode, fp, neg_adjust));
+	}
+
   /* Make sure the flat scratch reg doesn't get optimised away.  */
   emit_insn (gen_prologue_use (gen_rtx_REG (DImode, FLAT_SCRATCH_REG)));
 }
@@ -3120,10 +3130,13 @@ bool
 gcn_frame_pointer_rqd (void)
 {
   /* GDB needs the frame pointer in order to unwind properly,
- but that's not important for the entry point.
- We should also repect the -fomit-frame-pointer flag.  */
-  return (cfun && cfun->machine && cfun->machine->normal_function
-	  && !flag_omit_frame_pointer);
+ but that's not important for the entry point, unless alloca is used.
+ It's not important for code execution, so we should repect the
+ -fomit-frame-pointer flag.  */
+  return (!flag_omit_frame_pointer
+	  && cfun
+	  && (cfun->calls_alloca
+	  || (cfun->machine && cfun->machine->normal_function)));
 }
 
 /* Implement TARGET_CAN_ELIMINATE.


[PATCH] libatomic, libgomp, libitc: Fix bootstrap [PR70454]

2021-01-15 Thread Jakub Jelinek via Gcc-patches
On Thu, Jan 14, 2021 at 04:08:20PM -0800, H.J. Lu wrote:
> Here is the updated patch.  OK for master?

Here is my version of the entire patch.

Bootstrapped/regtested on x86_64-linux and i686-linux and additionally
tested with i686-linux --with-arch=i386 and x86_64-linux --with-arch_32=i386
(non-bootstrap) builds to verify -march=i486 additions in that case.

Ok for trunk?

2021-01-15  Jakub Jelinek  

PR target/70454
* configure.tgt: For i?86 and x86_64 determine if -march=i486 needs to
be added through preprocessor check on
__GCC_HAVE_SYNC_COMPARE_AND_SWAP_4.  Determine if try_ifunc is needed
based on preprocessor check on __GCC_HAVE_SYNC_COMPARE_AND_SWAP_16
or __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8.

* configure.tgt: For i?86 and x86_64 determine if -march=i486 needs to
be added through preprocessor check on
__GCC_HAVE_SYNC_COMPARE_AND_SWAP_4.

* configure.tgt: For i?86 and x86_64 determine if -march=i486 needs to
be added through preprocessor check on
__GCC_HAVE_SYNC_COMPARE_AND_SWAP_4.

--- libatomic/configure.tgt.jj  2021-01-15 11:08:13.659545929 +0100
+++ libatomic/configure.tgt 2021-01-15 11:21:09.071740967 +0100
@@ -81,32 +81,40 @@ case "${target_cpu}" in
ARCH=sparc
;;
 
-  i[3456]86)
-   case " ${CC} ${CFLAGS} " in
- *" -m64 "*|*" -mx32 "*)
-   ;;
- *)
-   if test -z "$with_arch"; then
- XCFLAGS="${XCFLAGS} -march=i486 -mtune=${target_cpu}"
- XCFLAGS="${XCFLAGS} -fomit-frame-pointer"
-   fi
-   esac
-   ARCH=x86
-   # ??? Detect when -march=i686 is already enabled.
-   try_ifunc=yes
-   ;;
-  x86_64)
-   case " ${CC} ${CFLAGS} " in
- *" -m32 "*)
+  i[3456]86 | x86_64)
+   cat > conftestx.c < /dev/null 2>&1; then
+ :
+   else
+ if test "${target_cpu}" = x86_64; then
XCFLAGS="${XCFLAGS} -march=i486 -mtune=generic"
-   XCFLAGS="${XCFLAGS} -fomit-frame-pointer"
-   ;;
- *)
-   ;;
-   esac
+ else
+   XCFLAGS="${XCFLAGS} -march=i486 -mtune=${target_cpu}"
+ fi
+ XCFLAGS="${XCFLAGS} -fomit-frame-pointer"
+   fi
+   cat > conftestx.c < /dev/null 2>&1; then
+ try_ifunc=no
+   else
+ try_ifunc=yes
+   fi
+   rm -f conftestx.c
ARCH=x86
-   # ??? Detect when -mcx16 is already enabled.
-   try_ifunc=yes
;;
 
   *)   ARCH="${target_cpu}" ;;
--- libgomp/configure.tgt.jj2021-01-15 11:08:13.659545929 +0100
+++ libgomp/configure.tgt   2021-01-15 11:20:54.809902917 +0100
@@ -73,28 +73,23 @@ if test x$enable_linux_futex = xyes; the
;;
 
 # Note that bare i386 is not included here.  We need cmpxchg.
-i[456]86-*-linux*)
+i[456]86-*-linux* | x86_64-*-linux*)
config_path="linux/x86 linux posix"
-   case " ${CC} ${CFLAGS} " in
- *" -m64 "*|*" -mx32 "*)
-   ;;
- *)
-   if test -z "$with_arch"; then
- XCFLAGS="${XCFLAGS} -march=i486 -mtune=${target_cpu}"
-   fi
-   esac
-   ;;
-
-# Similar jiggery-pokery for x86_64 multilibs, except here we
-# can't rely on the --with-arch configure option, since that
-# applies to the 64-bit side.
-x86_64-*-linux*)
-   config_path="linux/x86 linux posix"
-   case " ${CC} ${CFLAGS} " in
- *" -m32 "*)
+   cat > conftestx.c < /dev/null 2>&1; then
+ :
+   else
+ if test "${target_cpu}" = x86_64; then
XCFLAGS="${XCFLAGS} -march=i486 -mtune=generic"
-   ;;
-   esac
+ else
+   XCFLAGS="${XCFLAGS} -march=i486 -mtune=${target_cpu}"
+ fi
+   fi
+   rm -f conftestx.c
;;
 
 # Note that sparcv7 and sparcv8 is not included here.  We need cas.
--- libitm/configure.tgt.jj 2021-01-15 11:08:13.659545929 +0100
+++ libitm/configure.tgt2021-01-15 11:21:28.611519095 +0100
@@ -59,16 +59,23 @@ case "${target_cpu}" in
 
   arm*)ARCH=arm ;;
 
-  i[3456]86)
-   case " ${CC} ${CFLAGS} " in
- *" -m64 "*|*" -mx32 "*)
-   ;;
- *)
-   if test -z "$with_arch"; then
- XCFLAGS="${XCFLAGS} -march=i486 -mtune=${target_cpu}"
- XCFLAGS="${XCFLAGS} -fomit-frame-pointer"
-   fi
-   esac
+  i[3456]86 | x86_64)
+   cat > conftestx.c < /dev/null 2>&1; then
+ :
+   else
+ if test "${target_cpu}" = x86_64; then
+   XCFLAGS="${XCFLAGS} -march=i486 -mtune=generic"
+ else
+   XCFLAGS="${XCFLAGS} -march=i486 -mtune=${target_cpu}"
+ fi
+ XCFLAGS="${XCFLAGS} -fomit-frame-pointer"
+   fi
+   rm -f conftestx.c
XCFLAGS="${XCFLAGS} -mrtm"
ARCH=x86
;;
@@ -103,16 +110,6 @@ case "${target_cpu}" in
ARCH=sparc
;;
 
-  x86_64)
-   cas

Re: [EXTERNAL] Re: [PATCH][tree-optimization]Optimize combination of comparisons to dec+compare

2021-01-15 Thread Richard Biener via Gcc-patches
On Thu, Jan 14, 2021 at 10:04 PM Eugene Rozenfeld
 wrote:
>
> I got more feedback for the patch from Gabriel Ravier and Jakub Jelinek in 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96674 and re-worked it 
> accordingly.
>
> The changes from the previous patch are:
> 1. Switched the tests to use __attribute__((noipa)) instead of 
> __attribute__((noinline)) .
> 2. Fixed a type in the pattern comment.
> 3. Added :c for top-level bit_ior expression.
> 4. Added :s for the subexpressions.
> 5. Added a pattern for the negated expression:
> x >= y && y != XXX_MIN --> x > y - 1
> and the corresponding tests.
>
> The new patch is attached.

OK.

Thanks,
Richard.

> Eugene
>
> -Original Message-
> From: Richard Biener 
> Sent: Tuesday, January 5, 2021 4:21 AM
> To: Eugene Rozenfeld 
> Cc: gcc-patches@gcc.gnu.org
> Subject: [EXTERNAL] Re: [PATCH][tree-optimization]Optimize combination of 
> comparisons to dec+compare
>
> On Mon, Jan 4, 2021 at 9:50 PM Eugene Rozenfeld 
>  wrote:
> >
> > Ping.
> >
> > -Original Message-
> > From: Eugene Rozenfeld
> > Sent: Tuesday, December 22, 2020 3:01 PM
> > To: Richard Biener ;
> > gcc-patches@gcc.gnu.org
> > Subject: RE: Optimize combination of comparisons to dec+compare
> >
> > Re-sending my question and re-attaching the patch.
> >
> > Richard, can you please clarify your feedback?
>
> Hmm, OK.
>
> The patch is OK.
>
> Thanks,
> Richard.
>
>
> > Thanks,
> >
> > Eugene
> >
> > -Original Message-
> > From: Gcc-patches  On Behalf Of
> > Eugene Rozenfeld via Gcc-patches
> > Sent: Tuesday, December 15, 2020 2:06 PM
> > To: Richard Biener 
> > Cc: gcc-patches@gcc.gnu.org
> > Subject: [EXTERNAL] Re: Optimize combination of comparisons to
> > dec+compare
> >
> > Richard,
> >
> > > Do we already handle x < y || x <= CST to x <= y - CST?
> >
> > That is an invalid transformation: e.g., consider x=3, y=4, CST=2.
> > Can you please clarify?
> >
> > Thanks,
> >
> > Eugene
> >
> > -Original Message-
> > From: Richard Biener 
> > Sent: Thursday, December 10, 2020 12:21 AM
> > To: Eugene Rozenfeld 
> > Cc: gcc-patches@gcc.gnu.org
> > Subject: Re: Optimize combination of comparisons to dec+compare
> >
> > On Thu, Dec 10, 2020 at 1:52 AM Eugene Rozenfeld via Gcc-patches 
> >  wrote:
> > >
> > > This patch adds a pattern for optimizing x < y || x == XXX_MIN to x
> > > <=
> > > y-1 if y is an integer with TYPE_OVERFLOW_WRAPS.
> >
> > Do we already handle x < y || x <= CST to x <= y - CST?
> > That is, the XXX_MIN case is just a special-case of generic anti-range 
> > testing?  For anti-range testing with signed types we pun to unsigned when 
> > possible.
> >
> > > This fixes pr96674.
> > >
> > > Tested on x86_64-pc-linux-gnu.
> > >
> > > For this function
> > >
> > > bool f(unsigned a, unsigned b)
> > > {
> > > return (b == 0) | (a < b);
> > > }
> > >
> > > the code without the patch is
> > >
> > > test   esi,esi
> > > sete   al
> > > cmpesi,edi
> > > seta   dl
> > > or eax,edx
> > > ret
> > >
> > > the code with the patch is
> > >
> > > subesi,0x1
> > > cmpesi,edi
> > > setae  al
> > > ret
> > >
> > > Eugene
> > >
> > > gcc/
> > > PR tree-optimization/96674
> > > * match.pd: New pattern x < y || x == XXX_MIN --> x <= y - 1
> > >
> > > gcc/testsuite
> > > * gcc.dg/pr96674.c: New test.
> > >


Re: [PATCH] vect: Use factored nloads for load cost modeling [PR82255]

2021-01-15 Thread Richard Biener via Gcc-patches
On Fri, Jan 15, 2021 at 9:11 AM Kewen.Lin  wrote:
>
> Hi,
>
> This patch follows Richard's suggestion in the thread discussion[1],
> it's to factor out the nloads computation in vectorizable_load for
> strided access, to ensure we can obtain the consistent information
> when estimating the costs.
>
> btw, the reason why I didn't try to save the information into
> stmt_info during analysis phase and then fetch it in transform phase
> is that the information is just for strided slp loading, and to
> re-compute it looks not very expensive and acceptable.
>
> Bootstrapped/regtested on powerpc64le-linux-gnu P9,
> x86_64-redhat-linux and aarch64-linux-gnu.
>
> Is it ok for trunk?  Or it belongs to next stage 1?

First of all I think this is stage1 material now.  As we now do
SLP costing from vectorizable_* as well I would prefer to have
vectorizable_* be structured so that costing is done next to
the transform.  Thus rather than finish the vectorizable_*
function when !vec_stmt go along further but in places where
code is generated depending on !vec_stmt perform costing.
This makes it easier to keep consting and transform in sync
and match up.  It might not be necessary for the simple
vectorizable_ functions but for loads and stores there are
so many paths through code generation that matching it up
with vect_model_{load/store}_cost is almost impossible.

Richard.

> BR,
> Kewen
>
> [1] https://gcc.gnu.org/legacy-ml/gcc-patches/2017-09/msg01433.html
>
> gcc/ChangeLog:
>
> PR tree-optimization/82255
> * tree-vect-stmts.c (vector_vector_composition_type): Adjust function
> location.
> (struct strided_load_info): New structure.
> (vect_get_strided_load_info): New function factored out from...
> (vectorizable_load): ...this.  Call function
> vect_get_strided_load_info accordingly.
> (vect_model_load_cost): Call function vect_get_strided_load_info.
>
> gcc/testsuite/ChangeLog:
>
> 2020-01-15  Bill Schmidt  
> Kewen Lin  
>
> PR tree-optimization/82255
> * gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c: New test.
>


Re: [PATCH] libatomic, libgomp, libitc: Fix bootstrap [PR70454]

2021-01-15 Thread Richard Biener
On Fri, 15 Jan 2021, Jakub Jelinek wrote:

> On Thu, Jan 14, 2021 at 04:08:20PM -0800, H.J. Lu wrote:
> > Here is the updated patch.  OK for master?
> 
> Here is my version of the entire patch.
> 
> Bootstrapped/regtested on x86_64-linux and i686-linux and additionally
> tested with i686-linux --with-arch=i386 and x86_64-linux --with-arch_32=i386
> (non-bootstrap) builds to verify -march=i486 additions in that case.
> 
> Ok for trunk?

OK.

Thanks,
Richard.

> 2021-01-15  Jakub Jelinek  
> 
>   PR target/70454
>   * configure.tgt: For i?86 and x86_64 determine if -march=i486 needs to
>   be added through preprocessor check on
>   __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4.  Determine if try_ifunc is needed
>   based on preprocessor check on __GCC_HAVE_SYNC_COMPARE_AND_SWAP_16
>   or __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8.
> 
>   * configure.tgt: For i?86 and x86_64 determine if -march=i486 needs to
>   be added through preprocessor check on
>   __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4.
> 
>   * configure.tgt: For i?86 and x86_64 determine if -march=i486 needs to
>   be added through preprocessor check on
>   __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4.
> 
> --- libatomic/configure.tgt.jj2021-01-15 11:08:13.659545929 +0100
> +++ libatomic/configure.tgt   2021-01-15 11:21:09.071740967 +0100
> @@ -81,32 +81,40 @@ case "${target_cpu}" in
>   ARCH=sparc
>   ;;
>  
> -  i[3456]86)
> - case " ${CC} ${CFLAGS} " in
> -   *" -m64 "*|*" -mx32 "*)
> - ;;
> -   *)
> - if test -z "$with_arch"; then
> -   XCFLAGS="${XCFLAGS} -march=i486 -mtune=${target_cpu}"
> -   XCFLAGS="${XCFLAGS} -fomit-frame-pointer"
> - fi
> - esac
> - ARCH=x86
> - # ??? Detect when -march=i686 is already enabled.
> - try_ifunc=yes
> - ;;
> -  x86_64)
> - case " ${CC} ${CFLAGS} " in
> -   *" -m32 "*)
> +  i[3456]86 | x86_64)
> + cat > conftestx.c < +#ifndef __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4
> +#error need -march=i486
> +#endif
> +EOF
> + if ${CC} ${CFLAGS} -E conftestx.c > /dev/null 2>&1; then
> +   :
> + else
> +   if test "${target_cpu}" = x86_64; then
>   XCFLAGS="${XCFLAGS} -march=i486 -mtune=generic"
> - XCFLAGS="${XCFLAGS} -fomit-frame-pointer"
> - ;;
> -   *)
> - ;;
> - esac
> +   else
> + XCFLAGS="${XCFLAGS} -march=i486 -mtune=${target_cpu}"
> +   fi
> +   XCFLAGS="${XCFLAGS} -fomit-frame-pointer"
> + fi
> + cat > conftestx.c < +#ifdef __x86_64__
> +#ifndef __GCC_HAVE_SYNC_COMPARE_AND_SWAP_16
> +#error need -mcx16
> +#endif
> +#else
> +#ifndef __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8
> +#error need -march=i686
> +#endif
> +#endif
> +EOF
> + if ${CC} ${CFLAGS} -E conftestx.c > /dev/null 2>&1; then
> +   try_ifunc=no
> + else
> +   try_ifunc=yes
> + fi
> + rm -f conftestx.c
>   ARCH=x86
> - # ??? Detect when -mcx16 is already enabled.
> - try_ifunc=yes
>   ;;
>  
>*) ARCH="${target_cpu}" ;;
> --- libgomp/configure.tgt.jj  2021-01-15 11:08:13.659545929 +0100
> +++ libgomp/configure.tgt 2021-01-15 11:20:54.809902917 +0100
> @@ -73,28 +73,23 @@ if test x$enable_linux_futex = xyes; the
>   ;;
>  
>  # Note that bare i386 is not included here.  We need cmpxchg.
> -i[456]86-*-linux*)
> +i[456]86-*-linux* | x86_64-*-linux*)
>   config_path="linux/x86 linux posix"
> - case " ${CC} ${CFLAGS} " in
> -   *" -m64 "*|*" -mx32 "*)
> - ;;
> -   *)
> - if test -z "$with_arch"; then
> -   XCFLAGS="${XCFLAGS} -march=i486 -mtune=${target_cpu}"
> - fi
> - esac
> - ;;
> -
> -# Similar jiggery-pokery for x86_64 multilibs, except here we
> -# can't rely on the --with-arch configure option, since that
> -# applies to the 64-bit side.
> -x86_64-*-linux*)
> - config_path="linux/x86 linux posix"
> - case " ${CC} ${CFLAGS} " in
> -   *" -m32 "*)
> + cat > conftestx.c < +#ifndef __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4
> +#error need -march=i486
> +#endif
> +EOF
> + if ${CC} ${CFLAGS} -E conftestx.c > /dev/null 2>&1; then
> +   :
> + else
> +   if test "${target_cpu}" = x86_64; then
>   XCFLAGS="${XCFLAGS} -march=i486 -mtune=generic"
> - ;;
> - esac
> +   else
> + XCFLAGS="${XCFLAGS} -march=i486 -mtune=${target_cpu}"
> +   fi
> + fi
> + rm -f conftestx.c
>   ;;
>  
>  # Note that sparcv7 and sparcv8 is not included here.  We need cas.
> --- libitm/configure.tgt.jj   2021-01-15 11:08:13.659545929 +0100
> +++ libitm/configure.tgt  2021-01-15 11:21:28.611519095 +0100
> @@ -59,16 +59,23 @@ case "${target_cpu}" in
>  
>arm*)  ARCH=arm ;;
>  
> -  i[3456]86)
> - case " ${CC} ${CFLAGS} " in
> -   *" -m64 "*|*" -mx32 "*)
> - ;;
> -   *)
> - if test -z "$with_arch"; then
> -   XCFLAGS="${XCFLAGS} -march

[committed][OG10] amdgcn: DWARF address spaces

2021-01-15 Thread Andrew Stubbs
This patch implements DWARF address spaces for pointers to LDS, etc., on 
AMD GCN.


The address space mappings are defined by AMD in their DWARF proposals, 
and the LLVM implementation.


ROCGDB does not actually support this feature yet, I don't believe, but 
will do so soonish.


Committed to devel/omp/gcc-10. Queued for mainline.

Andrew
amdgcn: DWARF address spaces

Map GCN address spaces to the proposed DWARF address spaces defined by AMD.

gcc/ChangeLog:

	* config/gcn/gcn.c: Include dwarf2.h.
	(gcn_addr_space_debug): New function.
	(TARGET_ADDR_SPACE_DEBUG): New hook.

diff --git a/gcc/config/gcn/gcn.c b/gcc/config/gcn/gcn.c
index 1cb92714f24..f0e4636c06a 100644
--- a/gcc/config/gcn/gcn.c
+++ b/gcc/config/gcn/gcn.c
@@ -51,6 +51,7 @@
 #include "intl.h"
 #include "rtl-iter.h"
 #include "gimple.h"
+#include "dwarf2.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -1502,6 +1503,31 @@ gcn_addr_space_convert (rtx op, tree from_type, tree to_type)
 gcc_unreachable ();
 }
 
+/* Implement TARGET_ADDR_SPACE_DEBUG.
+
+   Return the dwarf address space class for each hardware address space.  */
+
+static int
+gcn_addr_space_debug (addr_space_t as)
+{
+  switch (as)
+{
+  case ADDR_SPACE_DEFAULT:
+  case ADDR_SPACE_FLAT:
+  case ADDR_SPACE_GLOBAL:
+  case ADDR_SPACE_SCALAR_FLAT:
+  case ADDR_SPACE_FLAT_SCRATCH:
+	return DW_ADDR_none;
+  case ADDR_SPACE_LDS:
+	return 3;  // DW_ADDR_LLVM_group
+  case ADDR_SPACE_SCRATCH:
+	return 4;  // DW_ADDR_LLVM_private
+  case ADDR_SPACE_GDS:
+	return 0x8000; // DW_ADDR_AMDGPU_region
+}
+  gcc_unreachable ();
+}
+
 
 /* Implement REGNO_MODE_CODE_OK_FOR_BASE_P via gcn.h

@@ -6366,6 +6392,8 @@ gcn_dwarf_register_span (rtx rtl)
 
 #undef  TARGET_ADDR_SPACE_ADDRESS_MODE
 #define TARGET_ADDR_SPACE_ADDRESS_MODE gcn_addr_space_address_mode
+#undef  TARGET_ADDR_SPACE_DEBUG
+#define TARGET_ADDR_SPACE_DEBUG gcn_addr_space_debug
 #undef  TARGET_ADDR_SPACE_LEGITIMATE_ADDRESS_P
 #define TARGET_ADDR_SPACE_LEGITIMATE_ADDRESS_P \
   gcn_addr_space_legitimate_address_p


[committed][OG10] DWARF address space for variables

2021-01-15 Thread Andrew Stubbs
This patch adds DWARF support for "local" variables that are actually 
located in a different address space.


This situation occurs for variables shared between all the worker 
threads of an OpenACC gang. On AMD GCN the variables are allocated to 
the low-latency LDS memory associated with each physical compute unit.


The patch depends on my previous patch "amdgcn: DWARF address spaces" to 
actually do anything useful, as it uses the hook defined there. The 
patch has no effect on ports that do not define that hook.


Committed to devel/omp/gcc-10. This will be submitted to mainline in 
stage 1.


Andrew
DWARF address space for variables

Add DWARF address class attributes for variables that exist outside the
generic address space.  In particular, this is the case for gang-private
variables in OpenACC offload kernels.

gcc/ChangeLog:

	* dwarf2out.c (add_location_or_const_value_attribute): Set
	DW_AT_address_class, if appropriate.

diff --git a/gcc/dwarf2out.c b/gcc/dwarf2out.c
index a4a1b934dc7..dedfeaf865f 100644
--- a/gcc/dwarf2out.c
+++ b/gcc/dwarf2out.c
@@ -20294,6 +20294,15 @@ add_location_or_const_value_attribute (dw_die_ref die, tree decl, bool cache_p)
   if (list)
 {
   add_AT_location_description (die, DW_AT_location, list);
+
+  addr_space_t as = TYPE_ADDR_SPACE (TREE_TYPE (decl));
+  if (!ADDR_SPACE_GENERIC_P (as))
+	{
+	  int action = targetm.addr_space.debug (as);
+	  /* Positive values indicate an address_class.  */
+	  if (action >= 0)
+	add_AT_unsigned (die, DW_AT_address_class, action);
+	}
   return true;
 }
   /* None of that worked, so it must not really have a location;


Re: [PATCH] Add pytest for a GCOV test-case

2021-01-15 Thread Rainer Orth
Hi Martin,

 * If we now have an (even optional) dependency on python/pytest, this
 (with the exact versions and use) needs to be documented in
 install.texi.
>>>
>>> Done that.
>> +be installed. Some optional tests also require Python3 and pytest
>> module.
>> It would be better to be more specific here.  Or would Python 3.0 and
>> pytest 2.0.0 do ;-)
>
> I would leave it as it is. Python3 is a well established term. About pytest:

... but unfortunately not exactly precise beyond distinguishing between
Python 2.x and 3.x.  I'd specificially asked about 3.0 because IIRC the
early 3.x versions were not only incompatible with 2.7, but even among
themselves.  Don't remember when this finally stabilized.

> I don't know how to investigate a minimal version right now.

It needn't necessarily be a minimal version, it could be one known to
work (and worded appropriately to indicate that's what it is, not a
minimum).  E.g. from what's bundled with Solaris 11.3, I know that
pytest 2.6.4 does work (apart from the python3 vs. python3.4 issue).

 * On to the implementation: your test for the presence of pytest is
 wrong:
   set result [remote_exec host "pytest -m pytest --version"]
 has nothing to do with what you actually use later: on all of Fedora
 29, Ubuntu 20.04, and Solaris 11.4 (with a caveat) pytest is Python
 2.7 based, but you don't check that.  It is well possible that pytest
 for 2.7 is installed, but pytest for Python 3.x isn't.
 Besides, while Solaris 11.4 does bundle pytest, they don't deliver
 pytest, but only py.test due to a conflict with a different pytest from
 logilab-common, cf. https://github.com/pytest-dev/pytest/issues/1833.
 This is immaterial, however, since what you actually run is
   spawn -noecho python3 -m pytest --color=no -rA -s --tb=no
 $srcdir/$subdir/$pytest_script
 So you should just run python3 -m pytest --version instead to check
 for the presence of the version you're going to use.
 Btw., there's a mess with pytest on Fedora 29: running the above gives
>>>
>>> I must confirm this is mess. I definitely don't want to support Python2 and
>>> I think
>>> the best way would be to use 'env python3', hope it's portable enough.
>>> @David: What do you think?
>> As I mentioned, it's not: Solaris 11.3 has no python3, only (for the 3.x
>> series) python3.4.
>> However, I don't understand what you expect to gain from running
>> $ env python3
>> rather than just
>> $ python3
>> (or a suitable Python 3.x version by any name)?
>
> All right, let's replace it just with 'python3'.
>
>> I just had a quick look and the autoconf-archive has AX_PYTHON which
>> claims to do that:
>>  https://www.gnu.org/software/autoconf-archive/ax_python.html
>> Unfortunately, it doesn't know about Python 3.8+ yet.

I think we can do better than hardcoding python3 here.  If it were only
for a quirk of an older Solaris version (11.3) which misses the python3
name, together with one (or soon two) optional testcases, I wouldn't
care much.  When looking at a VM with FreeBSD 12.2 (the current
release), it took quite some searching to find how to get python2 and
python3 commands installed, and I suspect those are not they only ones
where it's either difficult or missing completely.  Given that David
also plans to use python3, we could use another approach:

Rather than relying on an obscure/unmaintained macro from the
autoconf-archive, automake includes AM_PATH_PYTHON which allows to
specify a minimum version and knows about python2, python3, and
python[23].x names.  While the version of our current prerequisite,
Automake 1.15.1, has a hardcoded list going up to python3.5 only, the
git version has removed that limitation.

What we should do to get this right is:

* import that git automake m4/python.m4 into our toplevel config

* call AM_PATH_PYTHON appropriately from gcc/configure.ac

* record the resulting python path in site.exp

* use it from there e.g. in gcov.exp

This way we're future-proof for all possible uses of python3, rather
than fiddling with the issue over and over again.

I'm sorry to burden you with this, but given that you're the first to
follow through with using python3 in the core of gcc rather than in some
contrib script, we should get things right now if reasonably possible.

The rest of the patch looks good now and passed my testing.

Rainer

-- 
-
Rainer Orth, Center for Biotechnology, Bielefeld University


Re: [PATCH v3] libgcc: Thumb-1 Floating-Point Library for Cortex M0

2021-01-15 Thread Christophe Lyon via Gcc-patches
On Fri, 15 Jan 2021 at 12:39, Daniel Engel  wrote:
>
> Hi Christophe,
>
> On Mon, Jan 11, 2021, at 8:39 AM, Christophe Lyon wrote:
> > On Mon, 11 Jan 2021 at 17:18, Daniel Engel  wrote:
> > >
> > > On Mon, Jan 11, 2021, at 8:07 AM, Christophe Lyon wrote:
> > > > On Sat, 9 Jan 2021 at 14:09, Christophe Lyon 
> > > >  wrote:
> > > > >
> > > > > On Sat, 9 Jan 2021 at 13:27, Daniel Engel  
> > > > > wrote:
> > > > > >
> > > > > > On Thu, Jan 7, 2021, at 4:56 AM, Richard Earnshaw wrote:
> > > > > > > On 07/01/2021 00:59, Daniel Engel wrote:
> > > > > > > > --snip--
> > > > > > > >
> > > > > > > > On Wed, Jan 6, 2021, at 9:05 AM, Richard Earnshaw wrote:
> > > > > > > > --snip--
> > > > > > > >
> > > > > > > >> - finally, your popcount implementations have data in the code 
> > > > > > > >> segment.
> > > > > > > >>  That's going to cause problems when we have compilation 
> > > > > > > >> options such as
> > > > > > > >> -mpure-code.
> > > > > > > >
> > > > > > > > I am just following the precedent of existing lib1funcs (e.g. 
> > > > > > > > __clz2si).
> > > > > > > > If this matters, you'll need to point in the right direction 
> > > > > > > > for the
> > > > > > > > fix.  I'm not sure it does matter, since these functions are 
> > > > > > > > PIC anyway.
> > > > > > >
> > > > > > > That might be a bug in the clz implementations - Christophe: Any 
> > > > > > > thoughts?
> > > > > >
> > > > > > __clzsi2() has test coverage in 
> > > > > > "gcc.c-torture/execute/builtin-bitops-1.c"
> > > > > Thanks, I'll have a closer look at why I didn't see problems.
> > > > >
> > > >
> > > > So, that's because the code goes to the .text section (as opposed to
> > > > .text.noread)
> > > > and does not have the PURECODE flag. The compiler takes care of this
> > > > when generating code with -mpure-code.
> > > > And the simulator does not complain because it only checks loads from
> > > > the segment with the PURECODE flag set.
> > > >
> > > This is far out of my depth, but can something like:
> > >
> > > ifeq (,$(findstring __symbian__,$(shell $(gcc_compile_bare) -dM -E - 
> > >  > >
> > > be adapted to:
> > >
> > > a) detect the state of the -mpure-code switch, and
> > > b) pass that flag to the preprocessor?
> > >
> > > If so, I can probably fix both the target section and the data usage.
> > > Just have to add a few instructions to finish unrolling the loop.
> >
> > I must confess I never checked libgcc's Makefile deeply before,
> > but it looks like you can probably detect whether -mpure-code is
> > part of $CFLAGS.
> >
> > However, it might be better to write pure-code-safe code
> > unconditionally because the toolchain will probably not
> > be rebuilt with -mpure-code as discussed before.
> > Or that could mean adding a -mpure-code multilib
>
> I have learned a few things since the last update.  I think I know how
> to get -mpure-code out of CFLAGS and into a macro.  However, I have hit
> something of a wall with testing.  I can't seem to compile any flavor of
> libgcc with CFLAGS_FOR_TARGET="-mpure-code".
>
> 1.  Configuring --with-multilib-list=rmprofile results in build failure:
>
> checking for suffix of object files... configure: error: in 
> `/home/mirdan/gcc-obj/arm-none-eabi/libgcc':
> configure: error: cannot compute suffix of object files: cannot compile
> See `config.log' for more details
>
>cc1: error: -mpure-code only supports non-pic code on M-profile targets
>

Yes, I did hit that wall too :-)

Hence what we discussed earlier: the toolchain is not rebuilt with -mpure-code.

Note that there are problems in newlib too, but users of -mpure-code seem
to be able to work around that (eg. using their own startup code and no stdlib)

> 2.  Attempting to filter the multib list results in configuration error.
> This might have been misguided, but it was something I tried:
>
> Error: --with-multilib-list=armv6s-m not supported.
>
> Error: --with-multilib-list=mthumb/march=armv6s-m/mfloat-abi=soft not 
> supported

I think only 2 values are supported: aprofile and rmprofile.

> 3.  Attempting to configure a single architecture results in a build error.
>
> --with-mode=thumb --with-arch=armv6s-m --with-float=soft
>
> checking for suffix of object files... configure: error: in 
> `/home/mirdan/gcc-obj/arm-none-eabi/arm/autofp/v5te/fpu/libgcc':
> configure: error: cannot compute suffix of object files: cannot compile
> See `config.log' for more details
>
> conftest.c:9:10: fatal error: ac_nonexistent.h: No such file or directory
> 9 | #include 
>   |  ^~
I never saw that error message, but I never build using --with-arch.
I do use --with-cpu though.

> This has me wondering whether pure-code in libgcc is a real issue ...
> If there's a way to build libgcc with -mpure-code, please enlighten me.
I haven't done so yet. Maybe building the toolchain --with-cpu=cortex-m0
works?

Thanks,

Christophe

> > > > > > The 'clzs' and 'ctz' fun

[PATCH] testsuite/96098 - remove redundant testcase

2021-01-15 Thread Richard Biener
The testcase morphed in a way no longer testing what it was originally supposed 
to do and slightly altering it shows the original issue isn't fixed (anymore).
The limit as set as result of PR91403 (and dups) prevents the issue for larger
arrays but the testcase has

double a[128][128];

which results in a group size of "just" 512 (the limit is 4096).  Avoiding
the 'BB vectorization with gaps at the end of a load is not supported'
by altering it to do

void foo(void)
{
  b[0] = a[0][0];
  b[1] = a[1][0];
  b[2] = a[2][0];
  b[3] = a[3][127];
}

shows that costing has improved further to not account the dead loads making
the previous test inefficient.  In fact the underlying issue isn't fixed
(we do code-generate dead loads).

In fact the vector permute load is even profitable, just the excessive
code-generation issue exists (and is "fixed" by capping it a constant
boundary, just too high for this particular testcase).

The testcase now has "dups", so I'll simply remove it.

2021-01-15  Richard Biener  

PR testsuite/96098
* gcc.dg/vect/bb-slp-pr68892.c: Remove.
---
 gcc/testsuite/gcc.dg/vect/bb-slp-pr68892.c | 20 
 1 file changed, 20 deletions(-)
 delete mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-pr68892.c

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pr68892.c 
b/gcc/testsuite/gcc.dg/vect/bb-slp-pr68892.c
deleted file mode 100644
index e9909cf0dfa..000
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-pr68892.c
+++ /dev/null
@@ -1,20 +0,0 @@
-/* { dg-do compile } */
-/* { dg-additional-options "-fvect-cost-model=dynamic" } */
-/* { dg-require-effective-target vect_double } */
-
-double a[128][128];
-double b[128];
-
-void foo(void)
-{
-  b[0] = a[0][0];
-  b[1] = a[1][0];
-  b[2] = a[2][0];
-  b[3] = a[3][0];
-}
-
-/* ???  Due to the gaps we fall back to scalar loads which makes the
-   vectorization profitable.  */
-/* { dg-final { scan-tree-dump "not profitable" "slp2" { xfail { ! 
aarch64*-*-* } } } } */
-/* { dg-final { scan-tree-dump "BB vectorization with gaps at the end of a 
load is not supported" "slp2" } } */
-/* { dg-final { scan-tree-dump-times "Basic block will be vectorized" 1 "slp2" 
{ xfail aarch64*-*-* } } } */
-- 
2.26.2


Re: [PATCH] libatomic, libgomp, libitc: Fix bootstrap [PR70454]

2021-01-15 Thread H.J. Lu via Gcc-patches
On Fri, Jan 15, 2021 at 4:07 AM Richard Biener  wrote:
>
> On Fri, 15 Jan 2021, Jakub Jelinek wrote:
>
> > On Thu, Jan 14, 2021 at 04:08:20PM -0800, H.J. Lu wrote:
> > > Here is the updated patch.  OK for master?
> >
> > Here is my version of the entire patch.
> >
> > Bootstrapped/regtested on x86_64-linux and i686-linux and additionally
> > tested with i686-linux --with-arch=i386 and x86_64-linux --with-arch_32=i386
> > (non-bootstrap) builds to verify -march=i486 additions in that case.
> >
> > Ok for trunk?
>
> OK.
>

Thanks.

-- 
H.J.


[PATCH] testsuite/96147 - remove scanning for ! vect_hw_misalign

2021-01-15 Thread Richard Biener
This removes scanning that's too difficult to get correct for all
targets, leaving the correctness test for them and keeping the
vectorization capability check to vect_hw_misalign targets.

Pused.

2021-01-15  Richard Biener  

PR testsuite/96147
* gcc.dg/vect/slp-43.c: Remove ! vect_hw_misalign scan.
---
 gcc/testsuite/gcc.dg/vect/slp-43.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/slp-43.c 
b/gcc/testsuite/gcc.dg/vect/slp-43.c
index 0344cc98625..3cee613bdbe 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-43.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-43.c
@@ -78,4 +78,6 @@ int main()
 }
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 13 "vect" { target 
vect_hw_misalign } } } */
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" { target { 
! vect_hw_misalign } } } } */
+/* For ! vect_hw_misalign it depends on vector size and actual alignment
+   requirements of the target which functions can be vectorized.  Avoid
+   that bean-counting and per-target listing here.  */
-- 
2.26.2


[PATCH] testsuite/96147 - key scanning on vect_hw_misalign

2021-01-15 Thread Richard Biener
gcc.dg/vect/slp-45.c failed to key the vectorization capability
scanning on vect_hw_misalign.  Since the stores are strided
they cannot be (all) analyzed to be aligned.

Pushed.

2021-01-15  Richard Biener  

PR testsuite/96147
* gcc.dg/vect/slp-45.c: Key scanning on
vect_hw_misalign.
---
 gcc/testsuite/gcc.dg/vect/slp-45.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/slp-45.c 
b/gcc/testsuite/gcc.dg/vect/slp-45.c
index 1e35d354203..fadc4e59243 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-45.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-45.c
@@ -77,4 +77,4 @@ int main()
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 13 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 13 "vect" { target 
vect_hw_misalign } } } */
-- 
2.26.2


[PATCH] testsuite/96147 - align vector access

2021-01-15 Thread Richard Biener
This aligns p so that the testcase is meaningful for targets
without a hw misaligned access.

Pushed.

2021-01-15  Richard Biener  

PR testsuite/96147
* gcc.dg/vect/bb-slp-32.c: Align p.
---
 gcc/testsuite/gcc.dg/vect/bb-slp-32.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-32.c 
b/gcc/testsuite/gcc.dg/vect/bb-slp-32.c
index 020b6365e02..84cc4370f09 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-32.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-32.c
@@ -8,6 +8,7 @@ int foo (int *p, int a, int b)
   int x[4];
   int tem0, tem1, tem2, tem3;
   int sum = 0;
+  p = __builtin_assume_aligned (p, __BIGGEST_ALIGNMENT__);
   tem0 = p[0] + 1 + a;
   sum += tem0;
   x[0] = tem0;
-- 
2.26.2


[PATCH] testsuite/96147 - scan for vectorized load

2021-01-15 Thread Richard Biener
This changes gcc.dg/vect/bb-slp-9.c to scan for a vectorized load
instead of a vectorized BB which then correctly captures the
unaligned load we try to test and not some intermediate built
from scalar vector.

Pushed.

2021-01-15  Richard Biener  

PR testsuite/96147
* gcc.dg/vect/bb-slp-9.c: Scan for a vector load transform.
---
 gcc/testsuite/gcc.dg/vect/bb-slp-9.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-9.c 
b/gcc/testsuite/gcc.dg/vect/bb-slp-9.c
index b4cc1017f7e..2a42411afe4 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-9.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-9.c
@@ -46,5 +46,5 @@ int main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "optimized: basic block" 1 "slp2"  { 
xfail  { vect_no_align && { ! vect_hw_misalign } } } } } */
+/* { dg-final { scan-tree-dump-times "transform load" 1 "slp2"  { xfail  { 
vect_no_align && { ! vect_hw_misalign } } } } } */
   
-- 
2.26.2


Re: [PATCH] Add pytest for a GCOV test-case

2021-01-15 Thread Martin Liška

On 1/15/21 1:28 PM, Rainer Orth wrote:

Hi Martin,


* If we now have an (even optional) dependency on python/pytest, this
 (with the exact versions and use) needs to be documented in
 install.texi.


Done that.

+be installed. Some optional tests also require Python3 and pytest
module.
It would be better to be more specific here.  Or would Python 3.0 and
pytest 2.0.0 do ;-)


Hello.


I would leave it as it is. Python3 is a well established term. About pytest:


... but unfortunately not exactly precise beyond distinguishing between
Python 2.x and 3.x.  I'd specificially asked about 3.0 because IIRC the
early 3.x versions were not only incompatible with 2.7, but even among
themselves.  Don't remember when this finally stabilized.


I've got this, but as Python 3.0 was released more than 10 years ago, I'm 
leaving
Python3 for now. Feel free to propose a better wording.




I don't know how to investigate a minimal version right now.


It needn't necessarily be a minimal version, it could be one known to
work (and worded appropriately to indicate that's what it is, not a
minimum).  E.g. from what's bundled with Solaris 11.3, I know that
pytest 2.6.4 does work (apart from the python3 vs. python3.4 issue).


* On to the implementation: your test for the presence of pytest is
 wrong:
   set result [remote_exec host "pytest -m pytest --version"]
 has nothing to do with what you actually use later: on all of Fedora
 29, Ubuntu 20.04, and Solaris 11.4 (with a caveat) pytest is Python
 2.7 based, but you don't check that.  It is well possible that pytest
 for 2.7 is installed, but pytest for Python 3.x isn't.
 Besides, while Solaris 11.4 does bundle pytest, they don't deliver
 pytest, but only py.test due to a conflict with a different pytest from
 logilab-common, cf. https://github.com/pytest-dev/pytest/issues/1833.
 This is immaterial, however, since what you actually run is
   spawn -noecho python3 -m pytest --color=no -rA -s --tb=no
$srcdir/$subdir/$pytest_script
 So you should just run python3 -m pytest --version instead to check
 for the presence of the version you're going to use.
 Btw., there's a mess with pytest on Fedora 29: running the above gives


I must confirm this is mess. I definitely don't want to support Python2 and
I think
the best way would be to use 'env python3', hope it's portable enough.
@David: What do you think?

As I mentioned, it's not: Solaris 11.3 has no python3, only (for the 3.x
series) python3.4.
However, I don't understand what you expect to gain from running
$ env python3
rather than just
$ python3
(or a suitable Python 3.x version by any name)?


All right, let's replace it just with 'python3'.


I just had a quick look and the autoconf-archive has AX_PYTHON which
claims to do that:
https://www.gnu.org/software/autoconf-archive/ax_python.html
Unfortunately, it doesn't know about Python 3.8+ yet.


I think we can do better than hardcoding python3 here.  If it were only
for a quirk of an older Solaris version (11.3) which misses the python3
name, together with one (or soon two) optional testcases, I wouldn't
care much.  When looking at a VM with FreeBSD 12.2 (the current
release), it took quite some searching to find how to get python2 and
python3 commands installed, and I suspect those are not they only ones
where it's either difficult or missing completely.  Given that David
also plans to use python3, we could use another approach:


Where's a Python3 binary on a FreeBSD system?



Rather than relying on an obscure/unmaintained macro from the
autoconf-archive, automake includes AM_PATH_PYTHON which allows to
specify a minimum version and knows about python2, python3, and
python[23].x names.  While the version of our current prerequisite,
Automake 1.15.1, has a hardcoded list going up to python3.5 only, the
git version has removed that limitation.

What we should do to get this right is:

* import that git automake m4/python.m4 into our toplevel config

* call AM_PATH_PYTHON appropriately from gcc/configure.ac

* record the resulting python path in site.exp

* use it from there e.g. in gcov.exp

This way we're future-proof for all possible uses of python3, rather
than fiddling with the issue over and over again.


Feel free to propose a patch please.



I'm sorry to burden you with this, but given that you're the first to
follow through with using python3 in the core of gcc rather than in some
contrib script, we should get things right now if reasonably possible.


Sure, but we speak about an optional tests.



The rest of the patch looks good now and passed my testing.


I'm going to install the patch.

Martin



Rainer





[PATCH] tree-optimization/96376 - do check alignment for invariant loads

2021-01-15 Thread Richard Biener
The testcases show that we fail to disregard alignment for invariant
loads.  The patch handles them like we handle gather and scatter.

Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.

2021-01-15  Richard Biener  

PR tree-optimization/96376
* tree-vect-stmts.c (get_load_store_type): Disregard alignment
for VMAT_INVARIANT.
---
 gcc/tree-vect-stmts.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 4d72c4db2f7..f180ced3124 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -2378,19 +2378,26 @@ get_load_store_type (vec_info  *vinfo, stmt_vec_info 
stmt_info,
   else
 {
   int cmp = compare_step_with_zero (vinfo, stmt_info);
-  if (cmp < 0)
-   *memory_access_type = get_negative_load_store_type
- (vinfo, stmt_info, vectype, vls_type, ncopies);
-  else if (cmp == 0)
+  if (cmp == 0)
{
  gcc_assert (vls_type == VLS_LOAD);
  *memory_access_type = VMAT_INVARIANT;
+ /* Invariant accesses perform only component accesses, alignment
+is irrelevant for them.  */
+ *alignment_support_scheme = dr_unaligned_supported;
}
   else
-   *memory_access_type = VMAT_CONTIGUOUS;
-  *alignment_support_scheme
-   = vect_supportable_dr_alignment (vinfo,
-STMT_VINFO_DR_INFO (stmt_info), false);
+   {
+ if (cmp < 0)
+   *memory_access_type = get_negative_load_store_type
+  (vinfo, stmt_info, vectype, vls_type, ncopies);
+ else
+   *memory_access_type = VMAT_CONTIGUOUS;
+ *alignment_support_scheme
+   = vect_supportable_dr_alignment (vinfo,
+STMT_VINFO_DR_INFO (stmt_info),
+false);
+   }
 }
 
   if ((*memory_access_type == VMAT_ELEMENTWISE
-- 
2.26.2


[COMMITTED] IBM Z: Fix linking to libatomic in target test cases

2021-01-15 Thread Marius Hillenbrand via Gcc-patches
Regtested on s390x-linux-gnu. Approved offline by Andreas Krebbel.
Pushed.

>8--->8--->8->8--->8--->8-

One of the test cases failed to link because of missing paths to
libatomic. Reuse procedures in lib/atomic-dg.exp to gather these paths.

gcc/testsuite/ChangeLog:

2021-01-15  Marius Hillenbrand  

* gcc.target/s390/s390.exp: Call lib atomic-dg.exp to link
libatomic into testcases in gcc.target/s390/md.
* gcc.target/s390/md/atomic_exchange-1.c: Remove no unnecessary
-latomic.
---
 gcc/testsuite/gcc.target/s390/md/atomic_exchange-1.c | 2 +-
 gcc/testsuite/gcc.target/s390/s390.exp   | 4 
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.target/s390/md/atomic_exchange-1.c 
b/gcc/testsuite/gcc.target/s390/md/atomic_exchange-1.c
index f82b2131015..54e97d34172 100644
--- a/gcc/testsuite/gcc.target/s390/md/atomic_exchange-1.c
+++ b/gcc/testsuite/gcc.target/s390/md/atomic_exchange-1.c
@@ -1,7 +1,7 @@
 /* Machine description pattern tests.  */
 
 /* { dg-do compile } */
-/* { dg-options "-lpthread -latomic" } */
+/* { dg-options "-lpthread" } */
 /* { dg-do run { target { s390_useable_hw } } } */
 
 /**/
diff --git a/gcc/testsuite/gcc.target/s390/s390.exp 
b/gcc/testsuite/gcc.target/s390/s390.exp
index 57b2690f8ab..df460600d42 100644
--- a/gcc/testsuite/gcc.target/s390/s390.exp
+++ b/gcc/testsuite/gcc.target/s390/s390.exp
@@ -28,6 +28,7 @@ if ![istarget s390*-*-*] then {
 load_lib gcc-dg.exp
 load_lib target-supports.exp
 load_lib gfortran-dg.exp
+load_lib atomic-dg.exp
 
 # Return 1 if the the assembler understands .machine and .machinemode.  The
 # target attribute needs that feature to work.
@@ -250,6 +251,8 @@ dg-runtest [lsort [glob -nocomplain 
$srcdir/$subdir/arch13/*.{c,S}]] \
 dg-runtest [lsort [glob -nocomplain $srcdir/$subdir/vxe/*.{c,S}]] \
"" "-O3 -march=arch12 -mzarch"
 
+# Some md tests require libatomic
+atomic_init
 dg-runtest [lsort [glob -nocomplain $srcdir/$subdir/md/*.{c,S}]] \
"" $DEFAULT_CFLAGS
 
@@ -294,4 +297,5 @@ foreach t [list $srcdir/$subdir/pr80080-3.c] {
 }
 
 # All done.
+atomic_finish
 dg-finish
-- 
2.26.2



Re: [PATCH] [WIP] openmp: Add OpenMP 5.0 task detach clause support

2021-01-15 Thread Kwok Cheung Yeung

On 10/12/2020 2:38 pm, Jakub Jelinek wrote:

On Wed, Dec 09, 2020 at 05:37:24PM +, Kwok Cheung Yeung wrote:

--- a/gcc/c/c-typeck.c
+++ b/gcc/c/c-typeck.c
@@ -14942,6 +14942,11 @@ c_finish_omp_clauses (tree clauses, enum 
c_omp_region_type ort)
  pc = &OMP_CLAUSE_CHAIN (c);
  continue;
  
+	case OMP_CLAUSE_DETACH:

+ t = OMP_CLAUSE_DECL (c);
+ pc = &OMP_CLAUSE_CHAIN (c);
+ continue;
+


If you wouldn't need to do anything for C for the detach clause, just would
just add:
case OMP_CLAUSE_DETACH:
at the end of the case list that starts below:

case OMP_CLAUSE_IF:
case OMP_CLAUSE_NUM_THREADS:
case OMP_CLAUSE_NUM_TEAMS:


But you actually do need to do something, even for C.

There are two restrictions:
- At most one detach clause can appear on the directive.
- If a detach clause appears on the directive, then a mergeable clause cannot 
appear on the same directive.
that should be checked and diagnosed.  One place to do that would be
like usually in all the FEs separately, that would mean adding
   bool mergeable_seen = false, detach_seen = false;
vars and for those clauses setting the *_seen, plus for DETACH
already complain if detach_seen is already true and remove the clause.
And at the end of the loop if mergeable_seen && detach_seen, diagnose
and remove one of them (perhaps better detach clause).
There is the optional second loop that can be used for the removal...

Testcase coverage should include:
   #pragma omp task detach (x) detach (y)
as well as
   #pragma omp task mergeable detach (x)
and
   #pragma omp task detach (x) mergeable
(and likewise for Fortran).



I have implemented checking for multiple detach clauses and usage with 
mergeable. I have included testcases in c-c++-common/gomp/task-detach-1.c and

gfortran.dg/gomp/task-detach-1.f90.


+  if (cp_lexer_next_token_is_not (parser->lexer, CPP_NAME))
+{
+  cp_parser_error (parser, "expected identifier");
+  return list;
+}
+
+  location_t id_loc = cp_lexer_peek_token (parser->lexer)->location;
+  tree t, identifier = cp_parser_identifier (parser);
+
+  if (identifier == error_mark_node)
+t = error_mark_node;
+  else
+{
+  t = cp_parser_lookup_name_simple
+   (parser, identifier,
+cp_lexer_peek_token (parser->lexer)->location);
+  if (t == error_mark_node)
+   cp_parser_name_lookup_error (parser, identifier, t, NLE_NULL,
+id_loc);


The above doesn't match what cp_parser_omp_var_list_no_open does,
in particular it should use cp_parser_id_expression
instead of cp_parser_identifier etc.



Changed to use cp_parser_id_expression, and added extra logic from 
cp_parser_omp_var_list in looking up the decl.



+  else
+   {
+ tree type = TYPE_MAIN_VARIANT (TREE_TYPE (t));
+ if (!INTEGRAL_TYPE_P (type)
+ || TREE_CODE (type) != ENUMERAL_TYPE
+ || DECL_NAME (TYPE_NAME (type))
+  != get_identifier ("omp_event_handle_t"))
+   {
+ error_at (id_loc, "% clause event handle "
+   "has type %qT rather than "
+   "%",
+   type);
+ return list;


You can't do this here for C++, it needs to be done in finish_omp_clauses
instead and only be done if the type is not a dependent type.
Consider (e.g. should be in testsuite)
template 
void
foo ()
{
   T t;
   #pragma omp task detach (t)
   ;
}

template 
void
bar ()
{
   T t;
   #pragma omp task detach (t)
   ;
}

void
baz ()
{
   foo  ();
   bar  (); // Instantiating this should error
}



Moved type checking to finish_omp_clauses, and testcase added at 
g++.dg/gomp/task-detach-1.C.



@@ -7394,6 +7394,9 @@ finish_omp_clauses (tree clauses, enum c_omp_region_type 
ort)
}
}
  break;
+   case OMP_CLAUSE_DETACH:
+ t = OMP_CLAUSE_DECL (c);
+ break;
  


Again, restriction checking here, plus check the type if it is
non-dependent, otherwise defer that checking for finish_omp_clauses when
it will not be dependent anymore.

I think you need to handle OMP_CLAUSE_DETACH in cp/pt.c too.



Done. g++.dg/gomp/task-detach-1.C contains a test for templates.


--- a/gcc/gimplify.c
+++ b/gcc/gimplify.c
@@ -9733,6 +9733,19 @@ gimplify_scan_omp_clauses (tree *list_p, gimple_seq 
*pre_p,
}
  break;
  
+	case OMP_CLAUSE_DETACH:

+ decl = OMP_CLAUSE_DECL (c);
+ if (outer_ctx)
+   {
+ splay_tree_node on
+   = splay_tree_lookup (outer_ctx->variables,
+(splay_tree_key)decl);
+ if (on == NULL || (on->value & GOVD_DATA_SHARE_CLASS) == 0)
+   omp_firstprivatize_variable (outer_ctx, decl);
+ omp_notice_variable (outer_ctx, decl, true);
+   }
+ break;


I don't understand this.  My reading of:
"The event-handl

Re: The performance data for two different implementation of new security feature -ftrivial-auto-var-init

2021-01-15 Thread Qing Zhao via Gcc-patches



> On Jan 15, 2021, at 2:11 AM, Richard Biener  wrote:
> 
> 
> 
> On Thu, 14 Jan 2021, Qing Zhao wrote:
> 
>> Hi, 
>> More data on code size and compilation time with CPU2017:
>> Compilation time data:   the numbers are the slowdown against the
>> default “no”:
>> benchmarks  A/no D/no
>> 
>> 500.perlbench_r 5.19% 1.95%
>> 502.gcc_r 0.46% -0.23%
>> 505.mcf_r 0.00% 0.00%
>> 520.omnetpp_r 0.85% 0.00%
>> 523.xalancbmk_r 0.79% -0.40%
>> 525.x264_r -4.48% 0.00%
>> 531.deepsjeng_r 16.67% 16.67%
>> 541.leela_r  0.00%  0.00%
>> 557.xz_r 0.00%  0.00%
>> 
>> 507.cactuBSSN_r 1.16% 0.58%
>> 508.namd_r 9.62% 8.65%
>> 510.parest_r 0.48% 1.19%
>> 511.povray_r 3.70% 3.70%
>> 519.lbm_r 0.00% 0.00%
>> 521.wrf_r 0.05% 0.02%
>> 526.blender_r 0.33% 1.32%
>> 527.cam4_r -0.93% -0.93%
>> 538.imagick_r 1.32% 3.95%
>> 544.nab_r  0.00% 0.00%
>> From the above data, looks like that the compilation time impact
>> from implementation A and D are almost the same.
>> ***code size data: the numbers are the code size increase against the
>> default “no”:
>> benchmarks A/no D/no
>> 
>> 500.perlbench_r 2.84% 0.34%
>> 502.gcc_r 2.59% 0.35%
>> 505.mcf_r 3.55% 0.39%
>> 520.omnetpp_r 0.54% 0.03%
>> 523.xalancbmk_r 0.36%  0.39%
>> 525.x264_r 1.39% 0.13%
>> 531.deepsjeng_r 2.15% -1.12%
>> 541.leela_r 0.50% -0.20%
>> 557.xz_r 0.31% 0.13%
>> 
>> 507.cactuBSSN_r 5.00% -0.01%
>> 508.namd_r 3.64% -0.07%
>> 510.parest_r 1.12% 0.33%
>> 511.povray_r 4.18% 1.16%
>> 519.lbm_r 8.83% 6.44%
>> 521.wrf_r 0.08% 0.02%
>> 526.blender_r 1.63% 0.45%
>> 527.cam4_r  0.16% 0.06%
>> 538.imagick_r 3.18% -0.80%
>> 544.nab_r 5.76% -1.11%
>> Avg 2.52% 0.36%
>> From the above data, the implementation D is always better than A, it’s a
>> surprising to me, not sure what’s the reason for this.
> 
> D probably inhibits most interesting loop transforms (check SPEC FP
> performance).

The call to .DEFERRED_INIT is marked as ECF_CONST:

/* A function to represent an artifical initialization to an uninitialized
   automatic variable. The first argument is the variable itself, the
   second argument is the initialization type.  */
DEF_INTERNAL_FN (DEFERRED_INIT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)

So, I assume that such const call should minimize the impact to loop 
optimizations. But yes, it will still inhibit some of the loop transformations.

>  It will also most definitely disallow SRA which, when
> an aggregate is not completely elided, tends to grow code.

Make sense to me. 

The run-time performance data for D and A are actually very similar as I posted 
in the previous email (I listed it here for convenience)

Run-time performance overhead with A and D:

benchmarks  A / no  D /no

500.perlbench_r 1.25%   1.25%
502.gcc_r   0.68%   1.80%
505.mcf_r   0.68%   0.14%
520.omnetpp_r   4.83%   4.68%
523.xalancbmk_r 0.18%   1.96%
525.x264_r  1.55%   2.07%
531.deepsjeng_  11.57%  11.85%
541.leela_r 0.64%   0.80%
557.xz_  -0.41% -0.41%

507.cactuBSSN_r 0.44%   0.44%
508.namd_r  0.34%   0.34%
510.parest_r0.17%   0.25%
511.povray_r56.57%  57.27%
519.lbm_r   0.00%   0.00%
521.wrf_r-0.28% -0.37%
526.blender_r   16.96%  17.71%
527.cam4_r  0.70%   0.53%
538.imagick_r   2.40%   2.40%
544.nab_r   0.00%   -0.65%

avg 5.17%   5.37%

Especially for the SPEC FP benchmarks, I didn’t see too much performance 
difference between A and D. 
I guess that the RTL optimizations might be enough to get rid of most of the 
overhead introduced by the additional initialization. 

> 
>> stack usage data, I added -fstack-usage to the compilation line when
>> compiling CPU2017 benchmarks. And all the *.su files were generated for each
>> of the modules.
>> Since there a lot of such files, and the stack size information are embedded
>> in each of the files.  I just picked up one benchmark 511.povray to
>> check. Which is the one that 
>> has the most runtime overhead when adding initialization (both A and D). 
>> I identified all the *.su files that are different between A and D and do a
>> diff on those *.su files, and looks like that the stack size is much higher
>> with D than that with A, for example:
>> $ diff build_base_auto_init.D./bbox.su
>> build_base_auto_init.A./bbox.su5c5
>> < bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**,
>> pov::BBOX_TREE**&, long int*, long int, long int) 160 static
>> ---
>> > bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**,
>> pov::BBOX_TREE**&, long int*, long int, long int) 96 static
>> $ diff build_base_auto_init.D./image.su
>> build_base_auto_init.A./image.su
>> 9c9
>> < image.cpp:240:6:void pov::bump_map(double*, pov::TNORMAL*, double*) 624
>> static
>> ---
>> > image.cpp:240:6:void pov::bump_map(double*, 

Re: BoF DWARF5 patches (25% .debug section size reduction)

2021-01-15 Thread Jakub Jelinek via Gcc-patches
On Sun, Nov 15, 2020 at 11:41:24PM +0100, Mark Wielaard wrote:
> On Tue, 2020-09-29 at 15:56 +0200, Mark Wielaard wrote:
> > On Thu, 2020-09-10 at 13:16 +0200, Jakub Jelinek wrote:
> > > On Wed, Sep 09, 2020 at 09:57:54PM +0200, Mark Wielaard wrote:
> > > > --- a/gcc/doc/invoke.texi
> > > > +++ b/gcc/doc/invoke.texi
> > > > @@ -9057,13 +9057,14 @@ possible.
> > > >  @opindex gdwarf
> > > >  Produce debugging information in DWARF format (if that is supported).
> > > >  The value of @var{version} may be either 2, 3, 4 or 5; the default 
> > > > version
> > > > -for most targets is 4.  DWARF Version 5 is only experimental.
> > > > +for most targets is 5 (with the exception of vxworks and darwin which
> > > > +default to version 2).
> > > 
> > > I think in documentation we should spell these VxWorks and Darwin/Mac OS X
> > 
> > OK. As attached.
> > 
> > Are we ready to flip the default to 5?
> 
> Ping. It would be good to get this in now so that we can fix issues (if
> any) with the DWARF5 support in the general bugfixing stage 3.
> 
> Thanks,
> 
> Mark

> From c04727b6209ad4d52d1b9ba86873961bda0e1724 Mon Sep 17 00:00:00 2001
> From: Mark Wielaard 
> Date: Tue, 29 Sep 2020 15:52:44 +0200
> Subject: [PATCH] Default to DWARF5
> 
> gcc/ChangeLog:
> 
>   * common.opt (gdwarf-): Init(5).
>   * doc/invoke.texi (-gdwarf): Document default to 5.

Ok for trunk.

Jakub



Re: Add dg-require-wchars to libstdc++ testsuite

2021-01-15 Thread Alexandre Oliva
On Jan 15, 2021, Jonathan Wakely  wrote:

> On Thu, 14 Jan 2021, 22:22 Alexandre Oliva,  wrote:
>> ... it is definitely the case that the target currently defines wchar_t,
>> and it even offers wchar.h and a lot of (maybe all?) wcs* functions.
>> This was likely not the case when the patch was first written.
>> 
>> I'll double check whether any of the patch is still needed for current
>> versions.

With the tests I've run since yesterday, I've determined that:

- the wchar-related patches for the libstdc++ testsuite, that I had
  proposed last year, are no longer needed

- your two patchlets did not bring about any regressions to test
  results, not in mainline x86_64-linux-gnu native, not with the trivial
  backports to the gcc-10 tree for x-arm-vxw7r2 that was the focus of my
  immediate attention.

So, I withdraw my submissions of the testsuite patches, and I encourage
you to proceed with the two changes you proposed.

However, for avoidance of any doubt, I'll restate that I cannot vow for
whether they're enough to fix the issues we'd run into back when
wchar/wcs* were not supported in the target system, because now they
are, so the changes do not bring any visible improvements to our results
either.

Anyway, thanks for the feedback!

-- 
Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
   Free Software Activist GNU Toolchain Engineer
Vim, Vi, Voltei pro Emacs -- GNUlius Caesar


[PATCH] c++: Fix up potential_constant_expression_1 FOR/WHILE_STMT handling [PR98672]

2021-01-15 Thread Jakub Jelinek via Gcc-patches
Hi!

The following testcase is rejected even when it is valid.
The problem is that potential_constant_expression_1 doesn't have the
accurate *jump_target tracking cxx_eval_* has, and when the loop has
a condition that isn't guaranteed to be always true, the body isn't walked
at all.  That is mostly a correct conservative behavior, except that it
doesn't detect if there are any return statements in the body, which means
the loop might return instead of falling through to the next statement.
We already have code for return stmt discovery in code snippets we don't
try to evaluate for switches, so this patch reuses that for FOR_STMT
and WHILE_STMT bodies.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

Note, I haven't touched FOR_EXPR, with statement expressions it could
have return stmts in it too, or it could have break or continue statements
that wouldn't bind to the current loop but to something outer.  That
case is clearly mishandled by potential_constant_expression_1 even
when the condition is missing or is always true, and it wouldn't surprise me
if cxx_eval_* didn't handle it right either, so I'm deferring that to
separate PR for later.  We'd need proper test coverage for all of that.

2021-01-15  Jakub Jelinek  

PR c++/98672
* constexpr.c (potential_constant_expression_1) ,
: If the condition isn't constant true, check if
the loop body can contain a return stmt.

* g++.dg/cpp1y/constexpr-98672.C: New test.

--- gcc/cp/constexpr.c.jj   2021-01-13 19:19:44.368469462 +0100
+++ gcc/cp/constexpr.c  2021-01-14 12:02:27.347042704 +0100
@@ -8190,7 +8190,17 @@ potential_constant_expression_1 (tree t,
  /* If we couldn't evaluate the condition, it might not ever be
 true.  */
  if (!integer_onep (tmp))
-   return true;
+   {
+ /* Before returning true, check if the for body can contain
+a return.  */
+ hash_set pset;
+ check_for_return_continue_data data = { &pset, NULL_TREE };
+ if (tree ret_expr
+ = cp_walk_tree (&FOR_BODY (t), check_for_return_continue,
+ &data, &pset))
+   *jump_target = ret_expr;
+ return true;
+   }
}
   if (!RECUR (FOR_EXPR (t), any))
return false;
@@ -8219,7 +8229,17 @@ potential_constant_expression_1 (tree t,
tmp = cxx_eval_outermost_constant_expr (tmp, true);
   /* If we couldn't evaluate the condition, it might not ever be true.  */
   if (!integer_onep (tmp))
-   return true;
+   {
+ /* Before returning true, check if the while body can contain
+a return.  */
+ hash_set pset;
+ check_for_return_continue_data data = { &pset, NULL_TREE };
+ if (tree ret_expr
+ = cp_walk_tree (&WHILE_BODY (t), check_for_return_continue,
+ &data, &pset))
+   *jump_target = ret_expr;
+ return true;
+   }
   if (!RECUR (WHILE_BODY (t), any))
return false;
   if (breaks (jump_target) || continues (jump_target))
--- gcc/testsuite/g++.dg/cpp1y/constexpr-98672.C.jj 2021-01-14 
12:19:24.842438847 +0100
+++ gcc/testsuite/g++.dg/cpp1y/constexpr-98672.C2021-01-14 
12:07:33.935551155 +0100
@@ -0,0 +1,35 @@
+// PR c++/98672
+// { dg-do compile { target c++14 } }
+
+void
+foo ()
+{
+}
+
+constexpr int
+bar ()
+{
+  for (int i = 0; i < 5; ++i)
+return i;
+  foo ();
+  return 0;
+}
+
+constexpr int
+baz ()
+{
+  int i = 0;
+  while (i < 5)
+{
+  if (i == 3)
+   return i;
+  else
+   ++i;
+}
+  foo ();
+  return 0;
+}
+
+constexpr int i = bar ();
+constexpr int j = baz ();
+static_assert (i == 0 && j == 3, "");

Jakub



Re: [PATCH] c++: ICE with constrained placeholder return type [PR98346]

2021-01-15 Thread Patrick Palka via Gcc-patches
On Mon, 11 Jan 2021, Jason Merrill wrote:

> On 1/7/21 4:06 PM, Patrick Palka wrote:
> > This is essentially a followup to r11-3714 -- we ICEing from another
> > "unguarded" call to build_concept_check, this time in do_auto_deduction,
> > due to the presence of templated trees when !processing_template_decl.
> > 
> > Bootstrapped and regtested on x86_64-pc-linux-gnu, does this look OK for
> > trunk and perhaps the 10 branch?
> > 
> > gcc/cp/ChangeLog:
> > 
> > PR c++/98346
> > * pt.c (do_auto_deduction): Temporarily increment
> > processing_template_decl before calling build_concept_check.
> > 
> > gcc/testsuite/ChangeLog:
> > 
> > PR c++/98346
> > * g++.dg/cpp2a/concepts-placeholder3.C: New test.
> > ---
> >   gcc/cp/pt.c   |  2 ++
> >   .../g++.dg/cpp2a/concepts-placeholder3.C  | 15 +++
> >   2 files changed, 17 insertions(+)
> >   create mode 100644 gcc/testsuite/g++.dg/cpp2a/concepts-placeholder3.C
> > 
> > diff --git a/gcc/cp/pt.c b/gcc/cp/pt.c
> > index beabcc4b027..111a694e0c5 100644
> > --- a/gcc/cp/pt.c
> > +++ b/gcc/cp/pt.c
> > @@ -29464,7 +29464,9 @@ do_auto_deduction (tree type, tree init, tree
> > auto_node,
> > cargs = targs;
> > /* Rebuild the check using the deduced arguments.  */
> > +   ++processing_template_decl;
> > check = build_concept_check (cdecl, cargs, tf_none);
> > +   --processing_template_decl;
> 
> This shouldn't be necessary; if processing_template_decl is 0, we should have
> non-dependent args.
> 
> I think your patch only works for this testcase because the concept is trivial
> and doesn't actually try to to do anything with the arguments.
> 
> Handling of PLACEHOLDER_TYPE_CONSTRAINTS is overly complex, partly because the
> 'auto' is represented as an argument in its own constraints.
> 
> A constrained auto variable declaration has the same problem.

D'oh, good point..  We need to also substitute the template arguments of
the current instantiation into the constraint at some point.   This is
actually PR96443 / PR96444, which I reported and posted a patch for back
in August: https://gcc.gnu.org/pipermail/gcc-patches/2020-August/551375.html

The approach the August patch used was to substitute into the
PLACEHOLDER_TYPE_CONSTRAINTS during tsubst, which was ruled out.  We can
instead do the same substitution during do_auto_deduction, as in the
patch below.  Does this approach look better?  It seems consistent with
how type_deducible_p substitutes into the return-type-requirement of a
compound-requirement.

Alternatively we could not substitute into PLACEHOLDER_TYPE_CONSTRAINTS
at all and instead pass the targs of the enclosing function directly
into satisfaction, but that seems inconsistent with type_deducible_p.

-- >8 --

Subject: [PATCH] c++: dependent constraint on placeholder return type
 [PR96443]

We're never substituting the template arguments of the enclosing
function into the constraint of a placeholder variable or return type,
which leads to errors during satisfaction when the constraint is
dependent.  This patch fixes this issue by doing the appropriate
substitution in do_auto_deduction before checking satisfaction.

Bootstrapped and regtested on x86_64-pc-linux-gnu, does this look OK
for trunk?  Also tested on cmcstl2 and range-v3.

gcc/cp/ChangeLog:

PR c++/96443
* pt.c (do_auto_deduction): Try checking the placeholder
constraint template parse time.  Substitute the template
arguments of the containing function into the placeholder
constraint.  If the constraint is still dependent, defer
deduction until instantiation time.

gcc/testsuite/ChangeLog:

PR c++/96443
* g++.dg/concepts/concepts-ts1.C: Add dg-bogus directive to the
call to f15 that we expect to accept.
* g++.dg/cpp2a/concepts-placeholder3.C: New test.
---
 gcc/cp/pt.c   | 19 ++-
 .../g++.dg/cpp2a/concepts-placeholder3.C  | 16 
 gcc/testsuite/g++.dg/cpp2a/concepts-ts1.C |  2 +-
 3 files changed, 35 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/g++.dg/cpp2a/concepts-placeholder3.C

diff --git a/gcc/cp/pt.c b/gcc/cp/pt.c
index c6b7318b378..b70a9a451e1 100644
--- a/gcc/cp/pt.c
+++ b/gcc/cp/pt.c
@@ -29455,7 +29455,7 @@ do_auto_deduction (tree type, tree init, tree auto_node,
 }
 
   /* Check any placeholder constraints against the deduced type. */
-  if (flag_concepts && !processing_template_decl)
+  if (flag_concepts)
 if (tree check = NON_ERROR (PLACEHOLDER_TYPE_CONSTRAINTS (auto_node)))
   {
 /* Use the deduced type to check the associated constraints. If we
@@ -29475,6 +29475,23 @@ do_auto_deduction (tree type, tree init, tree 
auto_node,
 else
   cargs = targs;
 
+   if ((context == adc_return_type || context == adc_variable_type)
+   && current_function_decl
+   && DECL_TEMPLATE_

[pushed] rtl-ssa: Fix a silly typo

2021-01-15 Thread Richard Sandiford via Gcc-patches
s/ref/reg/ on a previously unused function name.

Sorry for the blunder.  Tested on aarch64-linux-gnu, aarch64_be-elf
and x86_64-linux-gnu, pushed as obvious.

Richard


gcc/
* rtl-ssa/functions.h (function_info::ref_defs): Rename to...
(function_info::reg_defs): ...this.
* rtl-ssa/member-fns.inl (function_info::ref_defs): Rename to...
(function_info::reg_defs): ...this.
---
 gcc/rtl-ssa/functions.h| 2 +-
 gcc/rtl-ssa/member-fns.inl | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/rtl-ssa/functions.h b/gcc/rtl-ssa/functions.h
index 25896fc1138..f64bd3f290a 100644
--- a/gcc/rtl-ssa/functions.h
+++ b/gcc/rtl-ssa/functions.h
@@ -100,7 +100,7 @@ public:
   // Return a list of all definitions of register REGNO, in reverse postorder.
   // This includes both real stores by instructions and artificial
   // definitions by things like phi nodes.
-  iterator_range ref_defs (unsigned int regno) const;
+  iterator_range reg_defs (unsigned int regno) const;
 
   // Check if all uses of register REGNO are either unconditionally undefined
   // or use the same single dominating definition.  Return the definition
diff --git a/gcc/rtl-ssa/member-fns.inl b/gcc/rtl-ssa/member-fns.inl
index 4b3eacbd4b4..e1ab7d1ba84 100644
--- a/gcc/rtl-ssa/member-fns.inl
+++ b/gcc/rtl-ssa/member-fns.inl
@@ -883,7 +883,7 @@ function_info::mem_defs () const
 }
 
 inline iterator_range
-function_info::ref_defs (unsigned int regno) const
+function_info::reg_defs (unsigned int regno) const
 {
   return { m_defs[regno + 1], nullptr };
 }


[pushed] recog: Fix insn_change_watermark destructor

2021-01-15 Thread Richard Sandiford via Gcc-patches
Noticed while working on something else that the insn_change_watermark
destructor could call cancel_changes for changes that no longer exist.
The loop in cancel_changes is a nop in that case, but:

  num_changes = num;

can mess things up.

I think this would only affect nested uses of insn_change_watermark.

Tested on aarch64-linux-gnu, aarch64_be-elf and x86_64-linux-gnu,
pushed as obvious.

Richard


gcc/
* recog.h (insn_change_watermark::~insn_change_watermark): Avoid
calling cancel_changes for changes that no longer exist.
---
 gcc/recog.h | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/gcc/recog.h b/gcc/recog.h
index 269094a30f1..e96e66e99f2 100644
--- a/gcc/recog.h
+++ b/gcc/recog.h
@@ -547,13 +547,19 @@ class insn_change_watermark
 {
 public:
   insn_change_watermark () : m_old_num_changes (num_validated_changes ()) {}
-  ~insn_change_watermark () { cancel_changes (m_old_num_changes); }
+  ~insn_change_watermark ();
   void keep () { m_old_num_changes = num_validated_changes (); }
 
 private:
   int m_old_num_changes;
 };
 
+inline insn_change_watermark::~insn_change_watermark ()
+{
+  if (m_old_num_changes < num_validated_changes ())
+cancel_changes (m_old_num_changes);
+}
+
 #endif
 
 #endif /* GCC_RECOG_H */


[pushed] aarch64: Add a minipass for fusing CC insns [PR88836]

2021-01-15 Thread Richard Sandiford via Gcc-patches
This patch adds a small target-specific pass to remove redundant SVE
PTEST instructions.  There are two important uses of this:

- Removing PTESTs after WHILELOs (PR88836).  The original testcase
  no longer exhibits the problem due to more recent optimisations,
  but it can still be seen in simple cases like the one in the patch.
  It also shows up in 450.soplex.

- Removing PTESTs after RDFFRs in ACLE code.

This is just an interim “solution” for GCC 11.  I hope to replace
it with something generic and target-independent for GCC 12.
However, the use cases above are very important for performance,
so I'd rather not leave the bug unfixed for yet another release cycle.

Since the pass is intended to be short-lived, I've not added
a command-line option for it.  The pass can be disabled using
-fdisable-rtl-cc_fusion if necessary.

Although what the pass does is independent of SVE, it's motivated
only by SVE cases and doesn't trigger for any non-SVE test I've seen.
I've therefore gated it on TARGET_SVE and restricted it to PTEST
patterns.

Tested on aarch64-linux-gnu, aarch64_be-elf and x86_64-linux-gnu,
pushed to trunk.  (The Makefile.in part seemed obvious.)

Richard


gcc/
PR target/88836
* config.gcc (aarch64*-*-*): Add aarch64-cc-fusion.o to extra_objs.
* Makefile.in (RTL_SSA_H): New variable.
* config/aarch64/t-aarch64 (aarch64-cc-fusion.o): New rule.
* config/aarch64/aarch64-protos.h (make_pass_cc_fusion): Declare.
* config/aarch64/aarch64-passes.def: Add pass_cc_fusion after
pass_combine.
* config/aarch64/aarch64-cc-fusion.cc: New file.

gcc/testsuite/
PR target/88836
* gcc.target/aarch64/sve/acle/general/ldff1_8.c: New test.
* gcc.target/aarch64/sve/ptest_1.c: Likewise.
---
 gcc/Makefile.in   |   7 +
 gcc/config.gcc|   2 +-
 gcc/config/aarch64/aarch64-cc-fusion.cc   | 296 ++
 gcc/config/aarch64/aarch64-passes.def |   1 +
 gcc/config/aarch64/aarch64-protos.h   |   1 +
 gcc/config/aarch64/t-aarch64  |   6 +
 .../aarch64/sve/acle/general/ldff1_8.c|  32 ++
 .../gcc.target/aarch64/sve/ptest_1.c  |  10 +
 8 files changed, 354 insertions(+), 1 deletion(-)
 create mode 100644 gcc/config/aarch64/aarch64-cc-fusion.cc
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general/ldff1_8.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/ptest_1.c

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index de8af617488..a63c5d9cab6 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1024,6 +1024,13 @@ PLUGIN_H = plugin.h $(GCC_PLUGIN_H)
 PLUGIN_VERSION_H = plugin-version.h configargs.h
 CONTEXT_H = context.h
 GENSUPPORT_H = gensupport.h read-md.h optabs.def
+RTL_SSA_H = $(PRETTY_PRINT_H) insn-config.h splay-tree-utils.h \
+   $(RECOG_H) $(REGS_H) function-abi.h obstack-utils.h \
+   mux-utils.h rtlanal.h memmodel.h $(EMIT_RTL_H) \
+   rtl-ssa/accesses.h rtl-ssa/insns.h rtl-ssa/blocks.h \
+   rtl-ssa/changes.h rtl-ssa/functions.h rtl-ssa/is-a.inl \
+   rtl-ssa/access-utils.h rtl-ssa/insn-utils.h rtl-ssa/movement.h \
+   rtl-ssa/change-utils.h rtl-ssa/member-fns.inl
 
 #
 # Now figure out from those variables how to compile and link.
diff --git a/gcc/config.gcc b/gcc/config.gcc
index 9fb57e96121..17fea83b2e4 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -321,7 +321,7 @@ aarch64*-*-*)
c_target_objs="aarch64-c.o"
cxx_target_objs="aarch64-c.o"
d_target_objs="aarch64-d.o"
-   extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o 
aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o 
aarch64-sve-builtins-sve2.o cortex-a57-fma-steering.o aarch64-speculation.o 
falkor-tag-collision-avoidance.o aarch64-bti-insert.o"
+   extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o 
aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o 
aarch64-sve-builtins-sve2.o cortex-a57-fma-steering.o aarch64-speculation.o 
falkor-tag-collision-avoidance.o aarch64-bti-insert.o aarch64-cc-fusion.o"
target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.c 
\$(srcdir)/config/aarch64/aarch64-sve-builtins.h 
\$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
target_has_targetm_common=yes
;;
diff --git a/gcc/config/aarch64/aarch64-cc-fusion.cc 
b/gcc/config/aarch64/aarch64-cc-fusion.cc
new file mode 100644
index 000..09069a20de2
--- /dev/null
+++ b/gcc/config/aarch64/aarch64-cc-fusion.cc
@@ -0,0 +1,296 @@
+// Pass to fuse CC operations with other instructions.
+// Copyright (C) 2021 Free Software Foundation, Inc.
+//
+// This file is part of GCC.
+//
+// GCC is free software; you can redistribute it and/or modify it under
+// the terms of the GNU General Public License as published by the Free
+// Software Foundation; either version 3, or (at your option) any later
+/

c++: Fix langspecs with -fsyntax-only [PR98591]

2021-01-15 Thread Nathan Sidwell


-fsyntax-only is handled specially in the driver and causes it to add
 '-o /dev/null' (or a suitable OS-specific variant thereof).  PCH is
 handled in the language driver.  I'd not sufficiently protected the
 -fmodule-only action of adding a dummy assembler from the actions of
 -fsyntax-only, so we ended up with two -o options.

PR c++/98591
gcc/cp/
* lang-specs.h: Fix handling of -fmodule-only with -fsyntax-only.   

--
Nathan Sidwell
diff --git c/gcc/cp/lang-specs.h w/gcc/cp/lang-specs.h
index f16279142be..8902ae1d2ed 100644
--- c/gcc/cp/lang-specs.h
+++ w/gcc/cp/lang-specs.h
@@ -52,9 +52,11 @@ along with GCC; see the file COPYING3.  If not see
   "  %{!save-temps*:%{!no-integrated-cpp:%(cpp_unique_options)}}"
   "  %{fmodules-ts:-fmodule-header %{fpreprocessed:-fdirectives-only}}"
   "  %(cc1_options) %2"
-  "  %{!S:-o %g.s%V}"
-  "  %{!fsyntax-only:%{!fmodule-*:%{!fmodules-*:%{!fdump-ada-spec*:"
-  "	 %{!o*:--output-pch=%i.gch}%W{o*:--output-pch=%*",
+  "  %{!fsyntax-only:"
+  "%{!S:-o %g.s%V}"
+  "%{!fmodule-*:%{!fmodules-*:%{!fdump-ada-spec*:"
+  "	 %{!o*:--output-pch=%i.gch}%W{o*:--output-pch=%*}"
+  "}}}",
  CPLUSPLUS_CPP_SPEC, 0, 0},
   {"@c++-system-header",
   "%{E|M|MM:cc1plus -E"
@@ -68,11 +70,14 @@ along with GCC; see the file COPYING3.  If not see
   "%{fmodules-ts:-fdirectives-only}"
   " 	   %{save-temps*:%b.ii} %{!save-temps*:%g.ii}}"
   "  %{!save-temps*:%{!no-integrated-cpp:%(cpp_unique_options)}}"
-  "  %{fmodules-ts:-fmodule-header=system %{fpreprocessed:-fdirectives-only}}"
+  "  %{fmodules-ts:-fmodule-header=system"
+  "%{fpreprocessed:-fdirectives-only}}"
   "  %(cc1_options) %2"
-  "  %{!S:-o %g.s%V}"
-  "  %{!fsyntax-only:%{!fmodule-*:%{!fmodules-*:%{!fdump-ada-spec*:"
-  "	 %{!o*:--output-pch=%i.gch}%W{o*:--output-pch=%*",
+  "  %{!fsyntax-only:"
+  "%{!S:-o %g.s%V}"
+  "%{!fmodule-*:%{!fmodules-*:%{!fdump-ada-spec*:"
+  "	 %{!o*:--output-pch=%i.gch}%W{o*:--output-pch=%*}"
+  "}}}",
  CPLUSPLUS_CPP_SPEC, 0, 0},
   {"@c++-user-header",
   "%{E|M|MM:cc1plus -E"
@@ -88,9 +93,11 @@ along with GCC; see the file COPYING3.  If not see
   "  %{!save-temps*:%{!no-integrated-cpp:%(cpp_unique_options)}}"
   "  %{fmodules-ts:-fmodule-header=user %{fpreprocessed:-fdirectives-only}}"
   "  %(cc1_options) %2"
-  "  %{!S:-o %g.s%V}"
-  "  %{!fsyntax-only:%{!fmodule-*:%{!fmodules-*:%{!fdump-ada-spec*:"
-  "	 %{!o*:--output-pch=%i.gch}%W{o*:--output-pch=%*",
+  "  %{!fsyntax-only:"
+  "%{!S:-o %g.s%V}"
+  "%{!fmodule-*:%{!fmodules-*:%{!fdump-ada-spec*:"
+  "	 %{!o*:--output-pch=%i.gch}%W{o*:--output-pch=%*}"
+  "}}}",
  CPLUSPLUS_CPP_SPEC, 0, 0},
   {"@c++",
   "%{E|M|MM:cc1plus -E %(cpp_options) %2 %(cpp_debug_options)}"
@@ -101,13 +108,16 @@ along with GCC; see the file COPYING3.  If not see
   " 	   %{save-temps*:%b.ii} %{!save-temps*:%g.ii}}"
   "  %{!save-temps*:%{!no-integrated-cpp:%(cpp_unique_options)}}"
   "  %(cc1_options) %2"
-  "  %{fmodule-only:%{!S:-o %g.s%V}}"
-  "  %{!fsyntax-only:%{!fmodule-only:%(invoke_as)}",
+  "  %{!fsyntax-only:"
+  "%{fmodule-only:%{!S:-o %g.s%V}}"
+  "%{!fmodule-only:%(invoke_as)}}"
+  "}}}",
   CPLUSPLUS_CPP_SPEC, 0, 0},
   {".ii", "@c++-cpp-output", 0, 0, 0},
   {"@c++-cpp-output",
   "%{!E:%{!M:%{!MM:"
   "  cc1plus -fpreprocessed %i %(cc1_options) %2"
-  "  %{fmodule-only:%{!S:-o %g.s%V}}"
-  "  %{!fsyntax-only:%{!fmodule-only:%{!fmodule-header*:"
-  " %(invoke_as)}}", 0, 0, 0},
+  "  %{!fsyntax-only:"
+  "%{fmodule-only:%{!S:-o %g.s%V}}"
+  "%{!fmodule-only:%{!fmodule-header*:%(invoke_as)}}}"
+  "}}}", 0, 0, 0},
diff --git c/gcc/testsuite/g++.dg/modules/pr98591.H w/gcc/testsuite/g++.dg/modules/pr98591.H
new file mode 100644
index 000..ad397de2ecb
--- /dev/null
+++ w/gcc/testsuite/g++.dg/modules/pr98591.H
@@ -0,0 +1,3 @@
+// { dg-additional-options {-fmodules-ts -fmodule-header -fsyntax-only} }
+// PR 98591 -fsyntax-only -> output filename specified twice
+// specs are hard


preprocessor: Make quoting : [PR 95253]

2021-01-15 Thread Nathan Sidwell
I changed the quoting of ':', this restores it.  Make doesn't need ':' 
quoting (in a filename).


PR preprocessor/95253
libcpp/
* mkdeps.c (munge): Do not escape ':'.

--
Nathan Sidwell
diff --git i/libcpp/mkdeps.c w/libcpp/mkdeps.c
index 471e449a19d..1867e0089d7 100644
--- i/libcpp/mkdeps.c
+++ w/libcpp/mkdeps.c
@@ -162,7 +162,6 @@ munge (const char *str, const char *trail = nullptr)
 	  /* FALLTHROUGH  */
 
 	case '#':
-	case ':':
 	  buf[dst++] = '\\';
 	  /* FALLTHROUGH  */
 


Re: [PATCH] [WIP] openmp: Add OpenMP 5.0 task detach clause support

2021-01-15 Thread Kwok Cheung Yeung

On 15/01/2021 3:07 pm, Kwok Cheung Yeung wrote:
I have tested bootstrapping on x86_64 (no offloading) with no issues, and 
running the libgomp testsuite with Nvidia offloading shows no regressions. I 
have also tested all the gomp.exp tests in the main gcc testsuite, also with no 
issues. I am currently still running the full testsuite, but do not anticipate 
any problems.


Okay to commit on trunk, if the full testsuite run does not show any 
regressions?


Found an issue already :-( - the libgomp include files are not found when the 
tests are run via 'make check'. I have now included the relevant parts of the 
include files in the tests themselves. Okay for trunk (to be merged into the 
main patch)?


Thanks

Kwok
diff --git a/gcc/testsuite/c-c++-common/gomp/task-detach-1.c 
b/gcc/testsuite/c-c++-common/gomp/task-detach-1.c
index c7dda82..f50f748 100644
--- a/gcc/testsuite/c-c++-common/gomp/task-detach-1.c
+++ b/gcc/testsuite/c-c++-common/gomp/task-detach-1.c
@@ -1,7 +1,12 @@
 /* { dg-do compile } */
 /* { dg-options "-fopenmp" } */
 
-#include 
+typedef enum omp_event_handle_t
+{
+  __omp_event_handle_t_max__ = __UINTPTR_MAX__
+} omp_event_handle_t;
+
+extern void omp_fulfill_event (omp_event_handle_t);
 
 void f (omp_event_handle_t x, omp_event_handle_t y, int z)
 {
diff --git a/gcc/testsuite/g++.dg/gomp/task-detach-1.C 
b/gcc/testsuite/g++.dg/gomp/task-detach-1.C
index 443d3e8..2f0c650 100644
--- a/gcc/testsuite/g++.dg/gomp/task-detach-1.C
+++ b/gcc/testsuite/g++.dg/gomp/task-detach-1.C
@@ -1,7 +1,10 @@
 // { dg-do compile }
 // { dg-options "-fopenmp" }
 
-#include 
+typedef enum omp_event_handle_t
+{
+  __omp_event_handle_t_max__ = __UINTPTR_MAX__
+} omp_event_handle_t;
 
 template 
 void func ()
diff --git a/gcc/testsuite/gcc.dg/gomp/task-detach-1.c 
b/gcc/testsuite/gcc.dg/gomp/task-detach-1.c
index fa7315e..611044d 100644
--- a/gcc/testsuite/gcc.dg/gomp/task-detach-1.c
+++ b/gcc/testsuite/gcc.dg/gomp/task-detach-1.c
@@ -1,7 +1,12 @@
 /* { dg-do compile } */
 /* { dg-options "-fopenmp" } */
 
-#include 
+typedef enum omp_event_handle_t
+{
+  __omp_event_handle_t_max__ = __UINTPTR_MAX__
+} omp_event_handle_t;
+
+extern void omp_fulfill_event (omp_event_handle_t);
 
 void f (omp_event_handle_t x)
 {
diff --git a/gcc/testsuite/gfortran.dg/gomp/task-detach-1.f90 
b/gcc/testsuite/gfortran.dg/gomp/task-detach-1.f90
index dc51345..114068e 100644
--- a/gcc/testsuite/gfortran.dg/gomp/task-detach-1.f90
+++ b/gcc/testsuite/gfortran.dg/gomp/task-detach-1.f90
@@ -2,8 +2,10 @@
 ! { dg-options "-fopenmp" }
 
 program task_detach_1
-  use omp_lib
-
+  use iso_c_binding, only: c_intptr_t
+  implicit none
+  
+  integer, parameter :: omp_event_handle_kind = c_intptr_t
   integer (kind=omp_event_handle_kind) :: x, y
   integer :: z
   


[PATCH] i386: Use cpp_define_formatted for __SIZEOF_FLOAT80__ definition

2021-01-15 Thread Uros Bizjak via Gcc-patches
2021-01-15  Uroš Bizjak  

gcc/
* config/i386/i386-c.c (ix86_target_macros):
Use cpp_define_formatted for __SIZEOF_FLOAT80__ definition.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Pushed to mainline.

Uros.
diff --git a/gcc/config/i386/i386-c.c b/gcc/config/i386/i386-c.c
index a64d9be6106..ed4b098c810 100644
--- a/gcc/config/i386/i386-c.c
+++ b/gcc/config/i386/i386-c.c
@@ -757,10 +757,8 @@ ix86_target_macros (void)
   if (TARGET_LONG_DOUBLE_128)
 cpp_define (parse_in, "__LONG_DOUBLE_128__");
 
-  if (TARGET_128BIT_LONG_DOUBLE)
-cpp_define (parse_in, "__SIZEOF_FLOAT80__=16");
-  else
-cpp_define (parse_in, "__SIZEOF_FLOAT80__=12");
+  cpp_define_formatted (parse_in, "__SIZEOF_FLOAT80__=%d",
+   GET_MODE_SIZE (XFmode));
 
   cpp_define (parse_in, "__SIZEOF_FLOAT128__=16");
 
@@ -780,8 +778,7 @@ ix86_target_macros (void)
   cpp_define (parse_in, "__SEG_GS");
 
   if (flag_cf_protection != CF_NONE)
-cpp_define_formatted (parse_in, "__CET__=%d",
- flag_cf_protection & ~CF_SET);
+cpp_define_formatted (parse_in, "__CET__=%d", flag_cf_protection & 
~CF_SET);
 }
 
 


Re: The performance data for two different implementation of new security feature -ftrivial-auto-var-init

2021-01-15 Thread Richard Biener
On January 15, 2021 5:16:40 PM GMT+01:00, Qing Zhao  
wrote:
>
>
>> On Jan 15, 2021, at 2:11 AM, Richard Biener 
>wrote:
>> 
>> 
>> 
>> On Thu, 14 Jan 2021, Qing Zhao wrote:
>> 
>>> Hi, 
>>> More data on code size and compilation time with CPU2017:
>>> Compilation time data:   the numbers are the slowdown
>against the
>>> default “no”:
>>> benchmarks  A/no D/no
>>> 
>>> 500.perlbench_r 5.19% 1.95%
>>> 502.gcc_r 0.46% -0.23%
>>> 505.mcf_r 0.00% 0.00%
>>> 520.omnetpp_r 0.85% 0.00%
>>> 523.xalancbmk_r 0.79% -0.40%
>>> 525.x264_r -4.48% 0.00%
>>> 531.deepsjeng_r 16.67% 16.67%
>>> 541.leela_r  0.00%  0.00%
>>> 557.xz_r 0.00%  0.00%
>>> 
>>> 507.cactuBSSN_r 1.16% 0.58%
>>> 508.namd_r 9.62% 8.65%
>>> 510.parest_r 0.48% 1.19%
>>> 511.povray_r 3.70% 3.70%
>>> 519.lbm_r 0.00% 0.00%
>>> 521.wrf_r 0.05% 0.02%
>>> 526.blender_r 0.33% 1.32%
>>> 527.cam4_r -0.93% -0.93%
>>> 538.imagick_r 1.32% 3.95%
>>> 544.nab_r  0.00% 0.00%
>>> From the above data, looks like that the compilation time impact
>>> from implementation A and D are almost the same.
>>> ***code size data: the numbers are the code size increase
>against the
>>> default “no”:
>>> benchmarks A/no D/no
>>> 
>>> 500.perlbench_r 2.84% 0.34%
>>> 502.gcc_r 2.59% 0.35%
>>> 505.mcf_r 3.55% 0.39%
>>> 520.omnetpp_r 0.54% 0.03%
>>> 523.xalancbmk_r 0.36%  0.39%
>>> 525.x264_r 1.39% 0.13%
>>> 531.deepsjeng_r 2.15% -1.12%
>>> 541.leela_r 0.50% -0.20%
>>> 557.xz_r 0.31% 0.13%
>>> 
>>> 507.cactuBSSN_r 5.00% -0.01%
>>> 508.namd_r 3.64% -0.07%
>>> 510.parest_r 1.12% 0.33%
>>> 511.povray_r 4.18% 1.16%
>>> 519.lbm_r 8.83% 6.44%
>>> 521.wrf_r 0.08% 0.02%
>>> 526.blender_r 1.63% 0.45%
>>> 527.cam4_r  0.16% 0.06%
>>> 538.imagick_r 3.18% -0.80%
>>> 544.nab_r 5.76% -1.11%
>>> Avg 2.52% 0.36%
>>> From the above data, the implementation D is always better than A,
>it’s a
>>> surprising to me, not sure what’s the reason for this.
>> 
>> D probably inhibits most interesting loop transforms (check SPEC FP
>> performance).
>
>The call to .DEFERRED_INIT is marked as ECF_CONST:
>
>/* A function to represent an artifical initialization to an
>uninitialized
>   automatic variable. The first argument is the variable itself, the
>   second argument is the initialization type.  */
>DEF_INTERNAL_FN (DEFERRED_INIT, ECF_CONST | ECF_LEAF | ECF_NOTHROW,
>NULL)
>
>So, I assume that such const call should minimize the impact to loop
>optimizations. But yes, it will still inhibit some of the loop
>transformations.
>
>>  It will also most definitely disallow SRA which, when
>> an aggregate is not completely elided, tends to grow code.
>
>Make sense to me. 
>
>The run-time performance data for D and A are actually very similar as
>I posted in the previous email (I listed it here for convenience)
>
>Run-time performance overhead with A and D:
>
>benchmarks A / no  D /no
>
>500.perlbench_r1.25%   1.25%
>502.gcc_r  0.68%   1.80%
>505.mcf_r  0.68%   0.14%
>520.omnetpp_r  4.83%   4.68%
>523.xalancbmk_r0.18%   1.96%
>525.x264_r 1.55%   2.07%
>531.deepsjeng_ 11.57%  11.85%
>541.leela_r0.64%   0.80%
>557.xz_ -0.41% -0.41%
>
>507.cactuBSSN_r0.44%   0.44%
>508.namd_r 0.34%   0.34%
>510.parest_r   0.17%   0.25%
>511.povray_r   56.57%  57.27%
>519.lbm_r  0.00%   0.00%
>521.wrf_r   -0.28% -0.37%
>526.blender_r  16.96%  17.71%
>527.cam4_r 0.70%   0.53%
>538.imagick_r  2.40%   2.40%
>544.nab_r  0.00%   -0.65%
>
>avg5.17%   5.37%
>
>Especially for the SPEC FP benchmarks, I didn’t see too much
>performance difference between A and D. 
>I guess that the RTL optimizations might be enough to get rid of most
>of the overhead introduced by the additional initialization. 
>
>> 
>>> stack usage data, I added -fstack-usage to the compilation
>line when
>>> compiling CPU2017 benchmarks. And all the *.su files were generated
>for each
>>> of the modules.
>>> Since there a lot of such files, and the stack size information are
>embedded
>>> in each of the files.  I just picked up one benchmark 511.povray to
>>> check. Which is the one that 
>>> has the most runtime overhead when adding initialization (both A and
>D). 
>>> I identified all the *.su files that are different between A and D
>and do a
>>> diff on those *.su files, and looks like that the stack size is much
>higher
>>> with D than that with A, for example:
>>> $ diff build_base_auto_init.D./bbox.su
>>> build_base_auto_init.A./bbox.su5c5
>>> < bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**,
>>> pov::BBOX_TREE**&, long int*, long int, long int) 160 static
>>> ---
>>> > bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**,
>>> pov::BBOX_TREE**&, long int*, long int, long int) 96 static
>>> $ diff build_base_

Re: [PATCH]AArch64: Add NEON, SVE and SVE2 RTL patterns for Multiply, FMS and FMA.

2021-01-15 Thread Richard Sandiford via Gcc-patches
Tamar Christina  writes:
> Hi All,
>
> This adds implementation for the optabs for complex operations.  With this the
> following C code:
>
>   void g (float complex a[restrict N], float complex b[restrict N],
> float complex c[restrict N])
>   {
> for (int i=0; i < N; i++)
>   c[i] =  a[i] * b[i];
>   }
>
> generates
>
>
> NEON:
>
> g:
> moviv3.4s, 0
> mov x3, 0
> .p2align 3,,7
> .L2:
> mov v0.16b, v3.16b
> ldr q2, [x1, x3]
> ldr q1, [x0, x3]
> fcmla   v0.4s, v1.4s, v2.4s, #0
> fcmla   v0.4s, v1.4s, v2.4s, #90
> str q0, [x2, x3]
> add x3, x3, 16
> cmp x3, 1600
> bne .L2
> ret
>
> SVE:
>
> g:
> mov x3, 0
> mov x4, 400
> ptrue   p1.b, all
> whilelo p0.s, xzr, x4
> mov z3.s, #0
> .p2align 3,,7
> .L2:
> ld1wz1.s, p0/z, [x0, x3, lsl 2]
> ld1wz2.s, p0/z, [x1, x3, lsl 2]
> movprfx z0, z3
> fcmla   z0.s, p1/m, z1.s, z2.s, #0
> fcmla   z0.s, p1/m, z1.s, z2.s, #90
> st1wz0.s, p0, [x2, x3, lsl 2]
> incwx3
> whilelo p0.s, x3, x4
> b.any   .L2
> ret
>
> SVE2 (with int instead of float)
> g:
> mov x3, 0
> mov x4, 400
> mov z3.b, #0
> whilelo p0.s, xzr, x4
> .p2align 3,,7
> .L2:
> ld1wz1.s, p0/z, [x0, x3, lsl 2]
> ld1wz2.s, p0/z, [x1, x3, lsl 2]
> movprfx z0, z3
> cmlaz0.s, z1.s, z2.s, #0
> cmlaz0.s, z1.s, z2.s, #90
> st1wz0.s, p0, [x2, x3, lsl 2]
> incwx3
> whilelo p0.s, x3, x4
> b.any   .L2
> ret
>
>
> It defined a new iterator VALL_ARITH which contains types for which we can do
> general arithmetic (excludes bfloat16).

It doesn't look like anything uses this though.  Is it just left over
from the previous version?

>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> Checked with armv8-a+sve2+fp16 and no issues.  Note that sue to a mid-end
> limitation SLP for SVE currently fails for some permutes.  The tests have 
> these
> marked as XFAIL.  I do intend to fix this soon.
>
> Execution tests verified with QEMU.
>
> Matching tests for these are in the mid-end patches.  This I will turn on for
> these patterns in a separate patch.
>
> Ok for master?
>
> Thanks,
> Tamar
>
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-simd.md (cml4,
>   cmul3): New.
>   * config/aarch64/iterators.md (VALL_ARITH, UNSPEC_FCMUL,
>   UNSPEC_FCMUL180, UNSPEC_FCMLA_CONJ, UNSPEC_FCMLA180_CONJ,
>   UNSPEC_CMLA_CONJ, UNSPEC_CMLA180_CONJ, UNSPEC_CMUL, UNSPEC_CMUL180,
>   FCMLA_OP, FCMUL_OP, rot_op, rotsplit1, rotsplit2, fcmac1, sve_rot1,
>   sve_rot2, SVE2_INT_CMLA_OP, SVE2_INT_CMUL_OP, SVE2_INT_CADD_OP): New.
>   (rot): Add UNSPEC_FCMUL, UNSPEC_FCMUL180.
>   * config/aarch64/aarch64-sve.md (cml4,
>   cmul3): New.
>   * config/aarch64/aarch64-sve2.md (cml4,
>   cmul3): New.
>
> --- inline copy of patch -- 
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> 4b869ded918fd91ffd41e6ba068239a752b331e5..8a5f1dad224a99a8ba30669139259922a1250d0e
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -516,6 +516,47 @@ (define_insn "aarch64_fcmlaq_lane"
>[(set_attr "type" "neon_fcmla")]
>  )
>  
> +;; The complex mla/mls operations always need to expand to two instructions.
> +;; The first operation does half the computation and the second does the
> +;; remainder.  Because of this, expand early.
> +(define_expand "cml4"
> +  [(set (match_operand:VHSDF 0 "register_operand")
> + (plus:VHSDF (match_operand:VHSDF 1 "register_operand")
> + (unspec:VHSDF [(match_operand:VHSDF 2 "register_operand")
> +(match_operand:VHSDF 3 "register_operand")]
> +FCMLA_OP)))]
> +  "TARGET_COMPLEX && !BYTES_BIG_ENDIAN"
> +{
> +  rtx tmp = gen_reg_rtx (mode);
> +  emit_insn (gen_aarch64_fcmla (tmp, operands[1],
> +  operands[3], operands[2]));
> +  emit_insn (gen_aarch64_fcmla (operands[0], tmp,
> +  operands[3], operands[2]));
> +  DONE;
> +})
> +
> +;; The complex mul operations always need to expand to two instructions.
> +;; The first operation does half the computation and the second does the
> +;; remainder.  Because of this, expand early.
> +(define_expand "cmul3"
> +  [(set (match_operand:VHSDF 0 "register_operand")
> + (unspec:VHSDF [(match_operand:VHSDF 1 "register_operand")
> +(match_operand:VHSDF 2 "register_operand")]
> +FCMUL_OP))]
> +  "TARGET_COMPLEX && !BYTES_BIG_ENDIAN"
> +{
> +  rtx tmp = gen_reg_rtx (mode);
> +  rtx res1 = 

Re: [patch] gcc.dg/analyzer tests: relax dependency on alloca.h

2021-01-15 Thread Alexandre Oliva
On Jan 15, 2021, Olivier Hainque  wrote:

> On 14 Jan 2021, at 22:13, Alexandre Oliva  wrote:

>> Would you mind if I submitted an alternate patch to do so?

> Not at all, thanks for your feedback and for proposing
> an alternative!

Here's the modified patch.  Regstrapped on x86_64-linux-gnu, also tested
on x-arm-vx7r2.  David, I'm leaning towards putting it in as "obvious",
barring any objections.


gcc.dg/analyzer tests: use __builtin_alloca, not alloca.h

From: Alexandre Oliva 

Use __builtin_alloca.  Some systems don't have alloca.h or alloca.


Co-Authored-By: Olivier Hainque 

for  gcc/testsuite/ChangeLog

* gcc.dg/analyzer/alloca-leak.c: Drop alloca.h, use builtin.
* gcc.dg/analyzer/data-model-1.c: Likewise.
* gcc.dg/analyzer/malloc-1.c: Likewise.
* gcc.dg/analyzer/malloc-paths-8.c: Likewise.
---
 gcc/testsuite/gcc.dg/analyzer/alloca-leak.c|4 +---
 gcc/testsuite/gcc.dg/analyzer/data-model-1.c   |5 ++---
 gcc/testsuite/gcc.dg/analyzer/malloc-1.c   |3 +--
 gcc/testsuite/gcc.dg/analyzer/malloc-paths-8.c |7 +++
 4 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/analyzer/alloca-leak.c 
b/gcc/testsuite/gcc.dg/analyzer/alloca-leak.c
index 93319932d44ac..073f97e1ade32 100644
--- a/gcc/testsuite/gcc.dg/analyzer/alloca-leak.c
+++ b/gcc/testsuite/gcc.dg/analyzer/alloca-leak.c
@@ -1,10 +1,8 @@
 /* { dg-require-effective-target alloca } */
 
-#include 
-
 void *test (void)
 {
-  void *ptr = alloca (64);
+  void *ptr = __builtin_alloca (64);
   return ptr;
 }
 /* TODO: warn about escaping alloca.  */
diff --git a/gcc/testsuite/gcc.dg/analyzer/data-model-1.c 
b/gcc/testsuite/gcc.dg/analyzer/data-model-1.c
index 3f16a38ab14d4..f6681b678af61 100644
--- a/gcc/testsuite/gcc.dg/analyzer/data-model-1.c
+++ b/gcc/testsuite/gcc.dg/analyzer/data-model-1.c
@@ -3,7 +3,6 @@
 #include 
 #include 
 #include 
-#include 
 #include "analyzer-decls.h"
 
 struct foo
@@ -140,8 +139,8 @@ void test_11 (void)
 
 void test_12 (void)
 {
-  void *p = alloca (256);
-  void *q = alloca (256);
+  void *p = __builtin_alloca (256);
+  void *q = __builtin_alloca (256);
 
   /* alloca results should be unique.  */
   __analyzer_eval (p == q); /* { dg-warning "FALSE" } */
diff --git a/gcc/testsuite/gcc.dg/analyzer/malloc-1.c 
b/gcc/testsuite/gcc.dg/analyzer/malloc-1.c
index 26d828848a259..448b8558ffe11 100644
--- a/gcc/testsuite/gcc.dg/analyzer/malloc-1.c
+++ b/gcc/testsuite/gcc.dg/analyzer/malloc-1.c
@@ -1,6 +1,5 @@
 /* { dg-require-effective-target alloca } */
 
-#include 
 #include 
 
 extern int foo (void);
@@ -273,7 +272,7 @@ int *test_23a (int n)
 
 int test_24 (void)
 {
-  void *ptr = alloca (sizeof (int)); /* { dg-message "memory is allocated on 
the stack here" } */
+  void *ptr = __builtin_alloca (sizeof (int)); /* { dg-message "memory is 
allocated on the stack here" } */
   free (ptr); /* { dg-warning "'free' of memory allocated on the stack by 
'alloca' \\('ptr'\\) will corrupt the heap \\\[CWE-590\\\]" } */
 }
 
diff --git a/gcc/testsuite/gcc.dg/analyzer/malloc-paths-8.c 
b/gcc/testsuite/gcc.dg/analyzer/malloc-paths-8.c
index 35c9385b20611..9a7c414920ce2 100644
--- a/gcc/testsuite/gcc.dg/analyzer/malloc-paths-8.c
+++ b/gcc/testsuite/gcc.dg/analyzer/malloc-paths-8.c
@@ -2,7 +2,6 @@
 /* { dg-require-effective-target alloca } */
 
 #include 
-#include 
 #include 
 
 extern void do_stuff (const void *);
@@ -15,7 +14,7 @@ void test_1 (size_t sz)
   if (sz >= LIMIT)
 ptr = malloc (sz);
   else
-ptr = alloca (sz);
+ptr = __builtin_alloca (sz);
 
   do_stuff (ptr);
 
@@ -27,7 +26,7 @@ void test_2 (size_t sz)
 {
   void *ptr;
   if (sz < LIMIT)
-ptr = alloca (sz);
+ptr = __builtin_alloca (sz);
   else
 ptr = malloc (sz);
 
@@ -41,7 +40,7 @@ void test_3 (size_t sz)
 {
   void *ptr;
   if (sz <= LIMIT)
-ptr = alloca (sz); /* { dg-message "memory is allocated on the stack here" 
} */
+ptr = __builtin_alloca (sz); /* { dg-message "memory is allocated on the 
stack here" } */
   else
 ptr = malloc (sz);
 


-- 
Alexandre Oliva, happy hacker  https://FSFLA.org/blogs/lxo/
   Free Software Activist GNU Toolchain Engineer
Vim, Vi, Voltei pro Emacs -- GNUlius Caesar


Re: The performance data for two different implementation of new security feature -ftrivial-auto-var-init

2021-01-15 Thread Qing Zhao via Gcc-patches



> On Jan 15, 2021, at 11:22 AM, Richard Biener  wrote:
> 
> On January 15, 2021 5:16:40 PM GMT+01:00, Qing Zhao  > wrote:
>> 
>> 
>>> On Jan 15, 2021, at 2:11 AM, Richard Biener 
>> wrote:
>>> 
>>> 
>>> 
>>> On Thu, 14 Jan 2021, Qing Zhao wrote:
>>> 
 Hi, 
 More data on code size and compilation time with CPU2017:
 Compilation time data:   the numbers are the slowdown
>> against the
 default “no”:
 benchmarks  A/no D/no
 
 500.perlbench_r 5.19% 1.95%
 502.gcc_r 0.46% -0.23%
 505.mcf_r 0.00% 0.00%
 520.omnetpp_r 0.85% 0.00%
 523.xalancbmk_r 0.79% -0.40%
 525.x264_r -4.48% 0.00%
 531.deepsjeng_r 16.67% 16.67%
 541.leela_r  0.00%  0.00%
 557.xz_r 0.00%  0.00%
 
 507.cactuBSSN_r 1.16% 0.58%
 508.namd_r 9.62% 8.65%
 510.parest_r 0.48% 1.19%
 511.povray_r 3.70% 3.70%
 519.lbm_r 0.00% 0.00%
 521.wrf_r 0.05% 0.02%
 526.blender_r 0.33% 1.32%
 527.cam4_r -0.93% -0.93%
 538.imagick_r 1.32% 3.95%
 544.nab_r  0.00% 0.00%
 From the above data, looks like that the compilation time impact
 from implementation A and D are almost the same.
 ***code size data: the numbers are the code size increase
>> against the
 default “no”:
 benchmarks A/no D/no
 
 500.perlbench_r 2.84% 0.34%
 502.gcc_r 2.59% 0.35%
 505.mcf_r 3.55% 0.39%
 520.omnetpp_r 0.54% 0.03%
 523.xalancbmk_r 0.36%  0.39%
 525.x264_r 1.39% 0.13%
 531.deepsjeng_r 2.15% -1.12%
 541.leela_r 0.50% -0.20%
 557.xz_r 0.31% 0.13%
 
 507.cactuBSSN_r 5.00% -0.01%
 508.namd_r 3.64% -0.07%
 510.parest_r 1.12% 0.33%
 511.povray_r 4.18% 1.16%
 519.lbm_r 8.83% 6.44%
 521.wrf_r 0.08% 0.02%
 526.blender_r 1.63% 0.45%
 527.cam4_r  0.16% 0.06%
 538.imagick_r 3.18% -0.80%
 544.nab_r 5.76% -1.11%
 Avg 2.52% 0.36%
 From the above data, the implementation D is always better than A,
>> it’s a
 surprising to me, not sure what’s the reason for this.
>>> 
>>> D probably inhibits most interesting loop transforms (check SPEC FP
>>> performance).
>> 
>> The call to .DEFERRED_INIT is marked as ECF_CONST:
>> 
>> /* A function to represent an artifical initialization to an
>> uninitialized
>>  automatic variable. The first argument is the variable itself, the
>>  second argument is the initialization type.  */
>> DEF_INTERNAL_FN (DEFERRED_INIT, ECF_CONST | ECF_LEAF | ECF_NOTHROW,
>> NULL)
>> 
>> So, I assume that such const call should minimize the impact to loop
>> optimizations. But yes, it will still inhibit some of the loop
>> transformations.
>> 
>>> It will also most definitely disallow SRA which, when
>>> an aggregate is not completely elided, tends to grow code.
>> 
>> Make sense to me. 
>> 
>> The run-time performance data for D and A are actually very similar as
>> I posted in the previous email (I listed it here for convenience)
>> 
>> Run-time performance overhead with A and D:
>> 
>> benchmarks   A / no  D /no
>> 
>> 500.perlbench_r  1.25%   1.25%
>> 502.gcc_r0.68%   1.80%
>> 505.mcf_r0.68%   0.14%
>> 520.omnetpp_r4.83%   4.68%
>> 523.xalancbmk_r  0.18%   1.96%
>> 525.x264_r   1.55%   2.07%
>> 531.deepsjeng_   11.57%  11.85%
>> 541.leela_r  0.64%   0.80%
>> 557.xz_   -0.41% -0.41%
>> 
>> 507.cactuBSSN_r  0.44%   0.44%
>> 508.namd_r   0.34%   0.34%
>> 510.parest_r 0.17%   0.25%
>> 511.povray_r 56.57%  57.27%
>> 519.lbm_r0.00%   0.00%
>> 521.wrf_r -0.28% -0.37%
>> 526.blender_r16.96%  17.71%
>> 527.cam4_r   0.70%   0.53%
>> 538.imagick_r2.40%   2.40%
>> 544.nab_r0.00%   -0.65%
>> 
>> avg  5.17%   5.37%
>> 
>> Especially for the SPEC FP benchmarks, I didn’t see too much
>> performance difference between A and D. 
>> I guess that the RTL optimizations might be enough to get rid of most
>> of the overhead introduced by the additional initialization. 
>> 
>>> 
 stack usage data, I added -fstack-usage to the compilation
>> line when
 compiling CPU2017 benchmarks. And all the *.su files were generated
>> for each
 of the modules.
 Since there a lot of such files, and the stack size information are
>> embedded
 in each of the files.  I just picked up one benchmark 511.povray to
 check. Which is the one that 
 has the most runtime overhead when adding initialization (both A and
>> D). 
 I identified all the *.su files that are different between A and D
>> and do a
 diff on those *.su files, and looks like that the stack size is much
>> higher
 with D than that with A, for example:
 $ diff build_base_auto_init.D./bbox.su
 build_base_auto_init.A./bbox.su5c5
 < bbox.cpp:1782:12:int pov::sort_and_split(pov::BBOX_TREE**,
 pov::BBOX_TREE**&, long int*,

Re: [PATCH] [WIP] openmp: Add OpenMP 5.0 task detach clause support

2021-01-15 Thread Jakub Jelinek via Gcc-patches
On Fri, Jan 15, 2021 at 04:58:25PM +, Kwok Cheung Yeung wrote:
> On 15/01/2021 3:07 pm, Kwok Cheung Yeung wrote:
> > I have tested bootstrapping on x86_64 (no offloading) with no issues,
> > and running the libgomp testsuite with Nvidia offloading shows no
> > regressions. I have also tested all the gomp.exp tests in the main gcc
> > testsuite, also with no issues. I am currently still running the full
> > testsuite, but do not anticipate any problems.
> > 
> > Okay to commit on trunk, if the full testsuite run does not show any 
> > regressions?
> 
> Found an issue already :-( - the libgomp include files are not found when
> the tests are run via 'make check'. I have now included the relevant parts
> of the include files in the tests themselves. Okay for trunk (to be merged
> into the main patch)?

This incremental patch is ok.
I'll try to review the previous patch tomorrow.

Jakub



Re: [patch] gcc.dg/analyzer tests: relax dependency on alloca.h

2021-01-15 Thread David Malcolm via Gcc-patches
On Fri, 2021-01-15 at 14:45 -0300, Alexandre Oliva wrote:
> On Jan 15, 2021, Olivier Hainque  wrote:
> 
> > On 14 Jan 2021, at 22:13, Alexandre Oliva 
> > wrote:
> > > Would you mind if I submitted an alternate patch to do so?
> > Not at all, thanks for your feedback and for proposing
> > an alternative!
> 
> Here's the modified patch.  Regstrapped on x86_64-linux-gnu, also
> tested
> on x-arm-vx7r2.  David, I'm leaning towards putting it in as
> "obvious",
> barring any objections.

I think an issue here was that I assumed check_effective_target_alloca
checks that "alloca" is supported, whereas I now see that I was wrong;
it actually checks for "__builtin_alloca".

I have no objections to the patch.

Thanks
Dave




Re: [PATCH 1/3] PowerPC: Add long double target-supports.

2021-01-15 Thread Joseph Myers
On Thu, 14 Jan 2021, Michael Meissner via Gcc-patches wrote:

> +return [check_runtime_nocache ppc_long_double_ovveride_ibm128 {

> +return [check_runtime_nocache ppc_long_double_ovveride_ieee128 {

> +return [check_runtime_nocache ppc_long_double_ovveride_64bit {

All these places have the typo "ovveride".

-- 
Joseph S. Myers
jos...@codesourcery.com


[PATCH] aarch64: Implement vmlsl[_high]* intrinsics using builtins

2021-01-15 Thread Kyrylo Tkachov via Gcc-patches
Hi all,

This patch reimplements some more intrinsics using RTL builtins in the 
straightforward way.
Thankfully most of the RTL infrastructure is already in place for it.

Bootstrapped and tested on aarch64-none-linux-gnu.

Pushing to trunk.
Thanks,
Kyrill

gcc/
* config/aarch64/aarch64-simd.md (*aarch64_mlsl_hi): Rename 
to...
(aarch64_mlsl_hi): ... This.
(aarch64_mlsl_hi): Define.
(*aarch64_mlslmlsl

vmlsl.patch
Description: vmlsl.patch


[PATCH] match.pd: Optimize (x < 0) ^ (y < 0) to (x ^ y) < 0 etc. [PR96681]

2021-01-15 Thread Jakub Jelinek via Gcc-patches
Hi!

This patch simplifies comparisons that test the sign bit xored together.
If the comparisons are both < 0 or both >= 0, then we should xor the operands
together and compare the result to < 0, if the comparisons are different,
we should compare to >= 0.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

2021-01-15  Jakub Jelinek  

PR tree-optimization/96681
* match.pd ((x < 0) ^ (y < 0) to (x ^ y) < 0): New simplification.
((x >= 0) ^ (y >= 0) to (x ^ y) < 0): Likewise.
((x < 0) ^ (y >= 0) to (x ^ y) >= 0): Likewise.
((x >= 0) ^ (y < 0) to (x ^ y) >= 0): Likewise.

* gcc.dg/tree-ssa/pr96681.c: New test.

--- gcc/match.pd.jj 2021-01-15 13:12:11.232019067 +0100
+++ gcc/match.pd2021-01-15 14:00:21.567135280 +0100
@@ -3993,6 +3993,24 @@ (define_operator_list COND_TERNARY
(if (single_use (@2))
 (cmp @0 @1)
 
+/* Simplify (x < 0) ^ (y < 0) to (x ^ y) < 0 and
+   (x >= 0) ^ (y >= 0) to (x ^ y) < 0.  */
+(for cmp (lt ge)
+ (simplify
+  (bit_xor (cmp:s @0 integer_zerop) (cmp:s @1 integer_zerop))
+   (if (INTEGRAL_TYPE_P (TREE_TYPE (@0))
+   && !TYPE_UNSIGNED (TREE_TYPE (@0))
+   && types_match (TREE_TYPE (@0), TREE_TYPE (@1)))
+(lt (bit_xor @0 @1) { build_zero_cst (TREE_TYPE (@0)); }
+/* Simplify (x < 0) ^ (y >= 0) to (x ^ y) >= 0 and
+   (x >= 0) ^ (y < 0) to (x ^ y) >= 0.  */
+(simplify
+ (bit_xor:c (lt:s @0 integer_zerop) (ge:s @1 integer_zerop))
+  (if (INTEGRAL_TYPE_P (TREE_TYPE (@0))
+   && !TYPE_UNSIGNED (TREE_TYPE (@0))
+   && types_match (TREE_TYPE (@0), TREE_TYPE (@1)))
+   (ge (bit_xor @0 @1) { build_zero_cst (TREE_TYPE (@0)); })))
+
 /* Transform comparisons of the form X * C1 CMP 0 to X CMP 0 in the
signed arithmetic case.  That form is created by the compiler
often enough for folding it to be of value.  One example is in
--- gcc/testsuite/gcc.dg/tree-ssa/pr96681.c.jj  2021-01-15 14:16:25.254366911 
+0100
+++ gcc/testsuite/gcc.dg/tree-ssa/pr96681.c 2021-01-15 14:18:30.618941775 
+0100
@@ -0,0 +1,35 @@
+/* PR tree-optimization/96681 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+/* { dg-final { scan-tree-dump-times " \\^ " 5 "optimized" } } */
+/* { dg-final { scan-tree-dump-times " (?:<|>=) 0" 5 "optimized" } } */
+
+int
+foo (int x, int y)
+{
+  return (x < 0) ^ (y < 0);
+}
+
+int
+bar (int x, int y)
+{
+  return (x > -1) ^ (y > -1);
+}
+
+int
+baz (int x, int y)
+{
+  return (x ^ y) < 0;
+}
+
+int
+qux (int x, int y)
+{
+  return (x ^ y) >= 0;
+}
+
+int
+corge (int x, int y)
+{
+  return (x >= 0) ^ (y < 0);
+}

Jakub



[committed] testsuite: Add testcase coverage for already fixed [PR96671]

2021-01-15 Thread Jakub Jelinek via Gcc-patches
Hi!

The fix for this PR didn't come with any test coverage, I've added
tests that make sure we optimize it no matter what order of the x ^ y ^ z
operands is used.

Bootstrapped/regtested on x86_64-linux and i686-linux, committed to trunk.

2021-01-15  Jakub Jelinek  

PR tree-optimization/96671
* gcc.dg/tree-ssa/pr96671-1.c: New test.
* gcc.dg/tree-ssa/pr96671-2.c: New test.

--- gcc/testsuite/gcc.dg/tree-ssa/pr96671-1.c.jj2021-01-15 
14:44:54.694936995 +0100
+++ gcc/testsuite/gcc.dg/tree-ssa/pr96671-1.c   2021-01-15 14:45:40.248419337 
+0100
@@ -0,0 +1,51 @@
+/* PR tree-optimization/96671 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+/* { dg-final { scan-tree-dump-times " \\^ " 6 "optimized" } } */
+/* { dg-final { scan-tree-dump-times " ~" 6 "optimized" } } */
+/* { dg-final { scan-tree-dump-times " & " 6 "optimized" } } */
+
+int
+foo (int a, int b, int c)
+{
+  return (a ^ b) & ((b ^ c) ^ a);
+}
+
+int
+bar (int a, int b, int c)
+{
+  return (a ^ b) & ((b ^ a) ^ c);
+}
+
+int
+baz (int a, int b, int c)
+{
+  return (a ^ b) & ((a ^ c) ^ b);
+}
+
+int
+qux (int a, int b, int c)
+{
+  int d = a ^ b;
+  int e = b ^ c;
+  int f = e ^ a;
+  return d & f;
+}
+
+int
+corge (int a, int b, int c)
+{
+  int d = a ^ b;
+  int e = b ^ a;
+  int f = c ^ e;
+  return d & f;
+}
+
+int
+garply (int a, int b, int c)
+{
+  int d = a ^ b;
+  int e = a ^ c;
+  int f = b ^ e;
+  return d & f;
+}
--- gcc/testsuite/gcc.dg/tree-ssa/pr96671-2.c.jj2021-01-15 
14:44:57.665903235 +0100
+++ gcc/testsuite/gcc.dg/tree-ssa/pr96671-2.c   2021-01-15 14:45:49.565313469 
+0100
@@ -0,0 +1,51 @@
+/* PR tree-optimization/96671 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+/* { dg-final { scan-tree-dump-times " \\^ " 6 "optimized" } } */
+/* { dg-final { scan-tree-dump-not " ~" "optimized" } } */
+/* { dg-final { scan-tree-dump-times " \\| " 6 "optimized" } } */
+
+int
+foo (int a, int b, int c)
+{
+  return (a ^ b) | ((b ^ c) ^ a);
+}
+
+int
+bar (int a, int b, int c)
+{
+  return (a ^ b) | ((b ^ a) ^ c);
+}
+
+int
+baz (int a, int b, int c)
+{
+  return (a ^ b) | ((a ^ c) ^ b);
+}
+
+int
+qux (int a, int b, int c)
+{
+  int d = a ^ b;
+  int e = b ^ c;
+  int f = e ^ a;
+  return d | f;
+}
+
+int
+corge (int a, int b, int c)
+{
+  int d = a ^ b;
+  int e = b ^ a;
+  int f = c ^ e;
+  return d | f;
+}
+
+int
+garply (int a, int b, int c)
+{
+  int d = a ^ b;
+  int e = a ^ c;
+  int f = b ^ e;
+  return d | f;
+}


Jakub



[committed] bootstrap: fix failing diagnostic selftest on Windows [PR98696]

2021-01-15 Thread David Malcolm via Gcc-patches
In one of the selftests in g:f10960558540636800cf5d3d6355969621fbc17e
I didn't consider that paths can contain backslashes, which happens
for the tempfiles on Windows hosts.

Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu.
Confirmed by the reporter as fixing the issue on Windows.

Pushed as r11-6730-ga3128bf01289a243a9e0ebb4e34c23bcb04cb938.

gcc/ChangeLog:
PR bootstrap/98696
* diagnostic.c
(selftest::test_print_parseable_fixits_bytes_vs_display_columns):
Escape the tempfile name when constructing the expected output.
---
 gcc/diagnostic.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/gcc/diagnostic.c b/gcc/diagnostic.c
index 7d65ac7379f..11ac9064d42 100644
--- a/gcc/diagnostic.c
+++ b/gcc/diagnostic.c
@@ -2155,7 +2155,12 @@ test_print_parseable_fixits_bytes_vs_display_columns ()
   where.m_finish = linemap_position_for_column (line_table, 17);
   richloc.add_fixit_replace (where, "color");
 
-  const int buf_len = strlen (fname) + 100;
+  /* Escape fname.  */
+  pretty_printer tmp_pp;
+  print_escaped_string (&tmp_pp, fname);
+  char *escaped_fname = xstrdup (pp_formatted_text (&tmp_pp));
+
+  const int buf_len = strlen (escaped_fname) + 100;
   char *const expected = XNEWVEC (char, buf_len);
 
   {
@@ -2163,7 +2168,7 @@ test_print_parseable_fixits_bytes_vs_display_columns ()
 print_parseable_fixits (&pp, &richloc, DIAGNOSTICS_COLUMN_UNIT_BYTE,
tabstop);
 snprintf (expected, buf_len,
- "fix-it:\"%s\":{1:12-1:18}:\"color\"\n", fname);
+ "fix-it:%s:{1:12-1:18}:\"color\"\n", escaped_fname);
 ASSERT_STREQ (expected, pp_formatted_text (&pp));
   }
   {
@@ -2171,11 +2176,12 @@ test_print_parseable_fixits_bytes_vs_display_columns ()
 print_parseable_fixits (&pp, &richloc, DIAGNOSTICS_COLUMN_UNIT_DISPLAY,
tabstop);
 snprintf (expected, buf_len,
- "fix-it:\"%s\":{1:10-1:16}:\"color\"\n", fname);
+ "fix-it:%s:{1:10-1:16}:\"color\"\n", escaped_fname);
 ASSERT_STREQ (expected, pp_formatted_text (&pp));
   }
 
   XDELETEVEC (expected);
+  free (escaped_fname);
 }
 
 /* Verify that
-- 
2.26.2



[PATCH] strlen: Return TODO_update_address_taken when memcmp has been optimized [PR96271]

2021-01-15 Thread Jakub Jelinek via Gcc-patches
Hi!

On the following testcase, handle_builtin_memcmp in the strlen pass folds
the memcmp into comparison of two MEM_REFs.  But nothing triggers updating
of addressable vars afterwards, so even when the parameters are no longer
address taken, we force the parameters to stack and back anyway.

The following patch fixes that.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

2021-01-15  Jakub Jelinek  

PR tree-optimization/96271
* tree-ssa-strlen.c (handle_builtin_memcmp): Add UPDATE_ADDRESS_TAKEN
argument, set what it points to to true if optimizing memcmp into
a simple MEM_REF comparison.
(strlen_check_and_optimize_call): Add UPDATE_ADDRESS_TAKEN argument
and pass it through to handle_builtin_memcmp.
(check_and_optimize_stmt): Add UPDATE_ADDRESS_TAKEN argument
and pass it through to strlen_check_and_optimize_call.
(strlen_dom_walker): Replace m_cleanup_cfg with todo.
(strlen_dom_walker::before_dom_children): Adjust for the above change,
adjust check_and_optimize_stmt caller and or in into todo
TODO_cleanup_cfg and/or TODO_update_address_taken.
(printf_strlen_execute): Return todo instead of conditionally
TODO_cleanup_cfg.

* gcc.target/i386/pr96271.c: New test.

--- gcc/tree-ssa-strlen.c.jj2021-01-04 10:25:40.0 +0100
+++ gcc/tree-ssa-strlen.c   2021-01-15 15:13:09.847839781 +0100
@@ -3813,7 +3813,7 @@ use_in_zero_equality (tree res, bool exc
return true when call is transformed, return false otherwise.  */
 
 static bool
-handle_builtin_memcmp (gimple_stmt_iterator *gsi)
+handle_builtin_memcmp (gimple_stmt_iterator *gsi, bool *update_address_taken)
 {
   gcall *stmt = as_a  (gsi_stmt (*gsi));
   tree res = gimple_call_lhs (stmt);
@@ -3858,6 +3858,7 @@ handle_builtin_memcmp (gimple_stmt_itera
   boolean_type_node,
   arg1, arg2));
  gimplify_and_update_call_from_tree (gsi, res);
+ *update_address_taken = true;
  return true;
}
 }
@@ -5110,7 +5111,7 @@ is_char_type (tree type)
 
 static bool
 strlen_check_and_optimize_call (gimple_stmt_iterator *gsi, bool *zero_write,
-   pointer_query &ptr_qry)
+   bool *update_address_taken, pointer_query 
&ptr_qry)
 {
   gimple *stmt = gsi_stmt (*gsi);
 
@@ -5179,7 +5180,7 @@ strlen_check_and_optimize_call (gimple_s
return false;
   break;
 case BUILT_IN_MEMCMP:
-  if (handle_builtin_memcmp (gsi))
+  if (handle_builtin_memcmp (gsi, update_address_taken))
return false;
   break;
 case BUILT_IN_STRCMP:
@@ -5341,12 +5342,13 @@ handle_integral_assign (gimple_stmt_iter
 /* Attempt to check for validity of the performed access a single statement
at *GSI using string length knowledge, and to optimize it.
If the given basic block needs clean-up of EH, CLEANUP_EH is set to
-   true.  Return true to let the caller advance *GSI to the next statement
-   in the basic block and false otherwise.  */
+   true.  If it is to update addressables at the end of the pass, set
+   *UPDATE_ADDRESS_TAKEN to true.  Return true to let the caller advance *GSI
+   to the next statement in the basic block and false otherwise.  */
 
 static bool
 check_and_optimize_stmt (gimple_stmt_iterator *gsi, bool *cleanup_eh,
-pointer_query &ptr_qry)
+bool *update_address_taken, pointer_query &ptr_qry)
 {
   gimple *stmt = gsi_stmt (*gsi);
 
@@ -5356,7 +5358,8 @@ check_and_optimize_stmt (gimple_stmt_ite
 
   if (is_gimple_call (stmt))
 {
-  if (!strlen_check_and_optimize_call (gsi, &zero_write, ptr_qry))
+  if (!strlen_check_and_optimize_call (gsi, &zero_write,
+  update_address_taken, ptr_qry))
return false;
 }
   else if (!flag_optimize_strlen || !strlen_optimize)
@@ -5488,7 +5491,7 @@ public:
 evrp (false),
 ptr_qry (&evrp, &var_cache),
 var_cache (),
-m_cleanup_cfg (false)
+todo (0)
   { }
 
   virtual edge before_dom_children (basic_block);
@@ -5503,9 +5506,8 @@ public:
   pointer_query ptr_qry;
   pointer_query::cache_type var_cache;
 
-  /* Flag that will trigger TODO_cleanup_cfg to be returned in strlen
- execute function.  */
-  bool m_cleanup_cfg;
+  /* TODO_* flags for the pass.  */
+  int todo;
 };
 
 /* Callback for walk_dominator_tree.  Attempt to optimize various
@@ -5586,6 +5588,7 @@ strlen_dom_walker::before_dom_children (
 }
 
   bool cleanup_eh = false;
+  bool update_address_taken = false;
 
   /* Attempt to optimize individual statements.  */
   for (gimple_stmt_iterator gsi = gsi_start_bb (bb); !gsi_end_p (gsi); )
@@ -5599,12 +5602,15 @@ strlen_dom_walker::before_dom_children (
   /* Reset search depth preformance counter.  */
   ptr_qry.depth = 0;
 
-   

[PATCH] match.pd: Generalize the PR64309 simplifications [PR96669]

2021-01-15 Thread Jakub Jelinek via Gcc-patches
Hi!

The following patch generalizes the PR64309 simplifications, so that instead
of working only with constants 1 and 1 it works with any two power of two
constants, and works also for right shift (in that case it rules out the
first one being negative, as it is arithmetic shift then).

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

2021-01-15  Jakub Jelinek  

PR tree-optimization/96669
* match.pd (((1 << A) & 1) != 0 -> A == 0,
((1 << A) & 1) == 0 -> A != 0): Generalize for 1s replaced by
possibly different power of two constants and to right shift too.

* gcc.dg/tree-ssa/pr96669-1.c: New test.

--- gcc/match.pd.jj 2021-01-15 14:00:21.567135280 +0100
+++ gcc/match.pd2021-01-15 17:03:49.207071209 +0100
@@ -3117,13 +3117,26 @@ (define_operator_list COND_TERNARY
   (op @0 { build_int_cst (TREE_TYPE (@1), low); })))
 
 
-/* ((1 << A) & 1) != 0 -> A == 0
-   ((1 << A) & 1) == 0 -> A != 0 */
+/* Simplify ((C << x) & D) != 0 where C and D are power of two constants,
+   either to false if D is smaller (unsigned comparison) than C, or to
+   x == log2 (D) - log2 (C).  Similarly for right shifts.  */
 (for cmp (ne eq)
  icmp (eq ne)
  (simplify
-  (cmp (bit_and (lshift integer_onep @0) integer_onep) integer_zerop)
-  (icmp @0 { build_zero_cst (TREE_TYPE (@0)); })))
+  (cmp (bit_and (lshift integer_pow2p@1 @0) integer_pow2p@2) integer_zerop)
+   (with { int c1 = wi::clz (wi::to_wide (@1));
+  int c2 = wi::clz (wi::to_wide (@2)); }
+(if (c1 < c2)
+ { constant_boolean_node (cmp == NE_EXPR ? false : true, type); }
+ (icmp @0 { build_int_cst (TREE_TYPE (@0), c1 - c2); }
+ (simplify
+  (cmp (bit_and (rshift integer_pow2p@1 @0) integer_pow2p@2) integer_zerop)
+   (if (tree_int_cst_sgn (@1) > 0)
+(with { int c1 = wi::clz (wi::to_wide (@1));
+   int c2 = wi::clz (wi::to_wide (@2)); }
+ (if (c1 > c2)
+  { constant_boolean_node (cmp == NE_EXPR ? false : true, type); }
+  (icmp @0 { build_int_cst (TREE_TYPE (@0), c2 - c1); }))
 
 /* (CST1 << A) == CST2 -> A == ctz (CST2) - ctz (CST1)
(CST1 << A) != CST2 -> A != ctz (CST2) - ctz (CST1)
--- gcc/testsuite/gcc.dg/tree-ssa/pr96669-1.c.jj2021-01-15 
17:12:11.067414204 +0100
+++ gcc/testsuite/gcc.dg/tree-ssa/pr96669-1.c   2021-01-15 17:11:55.486589792 
+0100
@@ -0,0 +1,59 @@
+/* PR tree-optimization/96669 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-original" } */
+/* { dg-final { scan-tree-dump "return a == 0;" "original" } } */
+/* { dg-final { scan-tree-dump "return 1;" "original" } } */
+/* { dg-final { scan-tree-dump "return c == 3;" "original" } } */
+/* { dg-final { scan-tree-dump "return d != 1;" "original" } } */
+/* { dg-final { scan-tree-dump "return e != 0;" "original" } } */
+/* { dg-final { scan-tree-dump "return f == 1;" "original" } } */
+/* { dg-final { scan-tree-dump "return 0;" "original" } } */
+/* { dg-final { scan-tree-dump "return h != 1;" "original" } } */
+
+int
+f1 (int a)
+{
+  return ((1 << a) & 1) != 0;
+}
+
+int
+f2 (int b)
+{
+  return ((2 << b) & 1) == 0;
+}
+
+int
+f3 (int c)
+{
+  return ((2 << c) & 16) != 0;
+}
+
+int
+f4 (int d)
+{
+  return ((16 << d) & 32) == 0;
+}
+
+int
+f5 (int e)
+{
+  return ((1 >> e) & 1) == 0;
+}
+
+int
+f6 (int f)
+{
+  return ((2 >> f) & 1) != 0;
+}
+
+int
+f7 (int g)
+{
+  return ((1 >> g) & 2) != 0;
+}
+
+int
+f8 (int h)
+{
+  return ((32 >> h) & 16) == 0;
+}

Jakub



[pushed] c++: Fix list-init of array of no-copy type [PR63707]

2021-01-15 Thread Jason Merrill via Gcc-patches
build_vec_init_elt models initialization from some arbitrary object of the
type, i.e. copy, but in the case of list-initialization we don't do a copy
from the elements, we initialize them directly.

Tested x86_64-pc-linux-gnu, applying to trunk.  And 9/10, soon.

gcc/cp/ChangeLog:

PR c++/63707
* tree.c (build_vec_init_expr): Don't call build_vec_init_elt
if we got a CONSTRUCTOR.

gcc/testsuite/ChangeLog:

PR c++/63707
* g++.dg/cpp0x/initlist-array13.C: New test.
---
 gcc/cp/tree.c | 10 +-
 gcc/testsuite/g++.dg/cpp0x/initlist-array13.C | 16 
 2 files changed, 25 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/g++.dg/cpp0x/initlist-array13.C

diff --git a/gcc/cp/tree.c b/gcc/cp/tree.c
index 3a9a86de34a..290e73bad83 100644
--- a/gcc/cp/tree.c
+++ b/gcc/cp/tree.c
@@ -787,7 +787,15 @@ build_vec_init_expr (tree type, tree init, tsubst_flags_t 
complain)
 {
   tree slot;
   bool value_init = false;
-  tree elt_init = build_vec_init_elt (type, init, complain);
+  tree elt_init;
+  if (init && TREE_CODE (init) == CONSTRUCTOR)
+{
+  gcc_assert (!BRACE_ENCLOSED_INITIALIZER_P (init));
+  /* We built any needed constructor calls in digest_init.  */
+  elt_init = init;
+}
+  else
+elt_init = build_vec_init_elt (type, init, complain);
 
   if (init == void_type_node)
 {
diff --git a/gcc/testsuite/g++.dg/cpp0x/initlist-array13.C 
b/gcc/testsuite/g++.dg/cpp0x/initlist-array13.C
new file mode 100644
index 000..92fe97164cd
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/initlist-array13.C
@@ -0,0 +1,16 @@
+// PR c++/63707
+// { dg-do compile { target c++11 } }
+
+struct Child
+{
+  Child (int);
+  ~Child ();
+  Child (const Child &) = delete;
+};
+
+struct Parent
+{
+  Parent () : children {{5}, {7}} {}
+
+  Child children[2];
+};

base-commit: c0194736b477aef3cf0d15ccd12c64572869cf3f
-- 
2.27.0



Re: [PATCH v5] rs6000, vector integer multiply/divide/modulo instructions

2021-01-15 Thread Segher Boessenkool
Hi!

On Wed, Jan 13, 2021 at 02:15:04PM -0800, Carl Love wrote:
> The patch was compiled and tested on:
> 
>powerpc64le-unknown-linux-gnu (Power 8 BE)

(I assume you mean powerpc64-linux instead?)

> > Presumably it is safe (no side affects) when adding V4SI and V2DI here,
> > with respect to other current users of 'bits'.
> > Is it worth adding the
> > other modes while we are here? (V1TI, V8HI, V16QI ).
> 
> I did not add the additional modes.  I don't see any reason it would
> hurt but feel it is best to only add them when they are needed.

Either works, sure.  Having all but one vector modes covered is silly
(and can give people the idea something is special with that one mode),
but you have only two so far, so :-)

> --- a/gcc/config/rs6000/rs6000.md
> +++ b/gcc/config/rs6000/rs6000.md
> @@ -670,7 +670,8 @@
>  
>  ;; How many bits in this mode?
>  (define_mode_attr bits [(QI "8") (HI "16") (SI "32") (DI "64")
> -(SF "32") (DF "64")])
> +(SF "32") (DF "64")
> +(V4SI "32") (V2DI "64")])

The comment needs a clarification what this means for vector modes.
"How many bits (per element) in this mode?" perhaps?  Does that sound
good?

The patch is okay for trunk with such a tweak.  Thank you!


Segher


[pushed] c++: Improve copy elision for base subobjects [PR98642]

2021-01-15 Thread Jason Merrill via Gcc-patches

Three patches:

1) Rewrite a complete constructor call to call a base constructor if 
we're eliding a copy into a base subobject.
2) Elide the copy from a prvalue built for list-initialization into a 
base subobject.
3) Elide other copies from prvalues representing a constructor call into 
base subobjects.


The first two are clear bugfixes.

The third is a behavior change to elide more copies, as specified by 
C++17, but makes constructor prvalues more different from other function 
call prvalues, from which we can't consistently elide copies (CWG2403). 
Getting closer to the C++17 prvalue model seems desirable, but it's a 
departure from the behavior that we and other compilers have settled on. 
 So perhaps I'll hold off on the last one for now.


Tested x86_64-pc-linux-gnu, applying patches 1 and 2 to trunk.
commit c2afbe403389d32a3f36e35a461beda75d0e82f4
Author: Jason Merrill 
Date:   Wed Jan 13 13:27:53 2021 -0500

c++: Fix copy elision for base initialization

While working on PR98642 I noticed that in this testcase we were eliding the
copy, calling the complete default constructor to initialize the B base
subobject, and therefore wrongly initializing the non-existent A subobject
of B.  The test doesn't care whether the copy is elided or not, but checks
that we are actually calling a base constructor for B.

The patch preserves the elision, but changes the initializer to call the
base constructor instead of the complete constructor.

gcc/cp/ChangeLog:

* call.c (base_ctor_for, make_base_init_ok): New.
(build_over_call): Use make_base_init_ok.

gcc/testsuite/ChangeLog:

* g++.dg/cpp1z/elide4.C: New test.

diff --git a/gcc/cp/call.c b/gcc/cp/call.c
index 218157088ef..c194af74612 100644
--- a/gcc/cp/call.c
+++ b/gcc/cp/call.c
@@ -8425,6 +8425,60 @@ call_copy_ctor (tree a, tsubst_flags_t complain)
   return r;
 }
 
+/* Return the base constructor corresponding to COMPLETE_CTOR or NULL_TREE.  */
+
+static tree
+base_ctor_for (tree complete_ctor)
+{
+  tree clone;
+  FOR_EACH_CLONE (clone, DECL_CLONED_FUNCTION (complete_ctor))
+if (DECL_BASE_CONSTRUCTOR_P (clone))
+  return clone;
+  return NULL_TREE;
+}
+
+/* Try to make EXP suitable to be used as the initializer for a base subobject,
+   and return whether we were successful.  EXP must have already been cleared
+   by unsafe_copy_elision_p.  */
+
+static bool
+make_base_init_ok (tree exp)
+{
+  if (TREE_CODE (exp) == TARGET_EXPR)
+exp = TARGET_EXPR_INITIAL (exp);
+  while (TREE_CODE (exp) == COMPOUND_EXPR)
+exp = TREE_OPERAND (exp, 1);
+  if (TREE_CODE (exp) == COND_EXPR)
+{
+  bool ret = make_base_init_ok (TREE_OPERAND (exp, 2));
+  if (tree op1 = TREE_OPERAND (exp, 1))
+	{
+	  bool r1 = make_base_init_ok (op1);
+	  /* If unsafe_copy_elision_p was false, the arms should match.  */
+	  gcc_assert (r1 == ret);
+	}
+  return ret;
+}
+  if (TREE_CODE (exp) != AGGR_INIT_EXPR)
+/* A trivial copy is OK.  */
+return true;
+  if (!AGGR_INIT_VIA_CTOR_P (exp))
+/* unsafe_copy_elision_p must have said this is OK.  */
+return true;
+  tree fn = cp_get_callee_fndecl_nofold (exp);
+  if (DECL_BASE_CONSTRUCTOR_P (fn))
+return true;
+  gcc_assert (DECL_COMPLETE_CONSTRUCTOR_P (fn));
+  fn = base_ctor_for (fn);
+  if (!fn || DECL_HAS_IN_CHARGE_PARM_P (fn))
+/* The base constructor has more parameters, so we can't just change the
+   call target.  It would be possible to splice in the appropriate
+   arguments, but probably not worth the complexity.  */
+return false;
+  AGGR_INIT_EXPR_FN (exp) = build_address (fn);
+  return true;
+}
+
 /* Return true iff T refers to a base or potentially-overlapping field, which
cannot be used for return by invisible reference.  We avoid doing C++17
mandatory copy elision when this is true.
@@ -9152,6 +9206,10 @@ build_over_call (struct z_candidate *cand, int flags, tsubst_flags_t complain)
   else
 	cp_warn_deprecated_use (fn, complain);
 
+  if (eliding_temp && DECL_BASE_CONSTRUCTOR_P (fn)
+	  && !make_base_init_ok (arg))
+	unsafe = true;
+
   /* If we're creating a temp and we already have one, don't create a
 	 new one.  If we're not creating a temp but we get one, use
 	 INIT_EXPR to collapse the temp into our target.  Otherwise, if the
diff --git a/gcc/testsuite/g++.dg/cpp1z/elide4.C b/gcc/testsuite/g++.dg/cpp1z/elide4.C
new file mode 100644
index 000..03335e4ffbd
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp1z/elide4.C
@@ -0,0 +1,24 @@
+// { dg-do compile { target c++11 } }
+
+// Check that there's a call to some base constructor of B: either the default
+// constructor, if the copy is elided, or the copy constructor.
+
+// { dg-final { scan-assembler {call[ \t]*_?_ZN1BC2} { target { i?86-*-* x86_64-*-* } } } }
+
+int count;
+struct A { int i = count++; };
+struct B: virtual A {
+  B() { }
+  B(const B& b);
+};
+bool x;
+struct C: B
+

  1   2   >