from:"Alexander Monakov via Gcc\-patches"

Re: [PATCH] doc: clarify semantics of vector bitwise shifts

2023-05-30 Thread Alexander Monakov via Gcc-patches



On Thu, 25 May 2023, Richard Biener wrote:

> On Wed, May 24, 2023 at 8:36 PM Alexander Monakov  wrote:
> >
> >
> > On Wed, 24 May 2023, Richard Biener via Gcc-patches wrote:
> >
> > > I’d have to check the ISAs what they actually do here - it of course 
> > > depends
> > > on RTL semantics as well but as you say those are not strictly defined 
> > > here
> > > either.
> >
> > Plus, we can add the following executable test to the testsuite:
> 
> Yeah, that's probably a good idea.  I think your documentation change
> with the added sentence about the truncation is OK.

I am no longer confident in my patch, sorry.

My claim about vector shift semantics in OpenCL was wrong. In fact it specifies
that RHS of a vector shift is masked to the exact bitwidth of the element type.

So, to collect various angles:

1. OpenCL semantics would need an 'AND' before a shift (except VSX/Altivec).

2. From user side we had a request to follow C integer promotion semantics
   in https://gcc.gnu.org/PR91838 but I now doubt we can do that.

3. LLVM makes oversized vector shifts UB both for 'vector_size' and
   'ext_vector_type'.

4. Vector lowering does not emit promotions, and starting from gcc-12
   ranger treats oversized shifts according to the documentation you
   cite below, and optimizes (e.g. with '-O2 -mno-sse')

typedef short v8hi __attribute__((vector_size(16)));

void f(v8hi *p)
{
*p >>= 16;
}

   to zeroing '*p'. If this looks unintended, I can file a bug.

I still think we need to clarify semantics of vector shifts, but probably
not in the way I proposed initially. What do you think?

Thanks.
Alexander

> Note we have
> 
> /* Shift operations for shift and rotate.
>Shift means logical shift if done on an
>unsigned type, arithmetic shift if done on a signed type.
>The second operand is the number of bits to
>shift by; it need not be the same type as the first operand and result.
>Note that the result is undefined if the second operand is larger
>than or equal to the first operand's type size.
> 
>The first operand of a shift can have either an integer or a
>(non-integer) fixed-point type.  We follow the ISO/IEC TR 18037:2004
>semantics for the latter.
> 
>Rotates are defined for integer types only.  */
> DEFTREECODE (LSHIFT_EXPR, "lshift_expr", tcc_binary, 2)
> 
> in tree.def which implies short << 24 is undefined behavior (similar
> wording in generic.texi).  The rtl docs say nothing about behavior
> but I think the semantics should carry over.  That works for x86
> even for scalar instructions working on GPRs (masking is applied
> but fixed to 5 or 6 bits even for QImode or HImode shifts).
> 
> Note that when we make these shifts well-defined there's
> also arithmetic on signed types smaller than int (which again
> doesn't exist in C) where overflow invokes undefined behavior
> in the middle-end.  Unless we want to change that as well
> this is somewhat inconsistent then.
> 
> There's also the issue that C 'int' is defined by INT_TYPE_SIZE
> and thus target dependent which makes what is undefined and
> what not target dependent.
> 
> Richard.
> 
> > #include 
> >
> > #define CHECK(TYPE, WIDTH, OP, COUNT, INVERT) \
> > { \
> > typedef TYPE vec __attribute__((vector_size(WIDTH))); \
> >   \
> > static volatile vec zero; \
> > vec tmp = (zero-2) OP (COUNT);\
> > vec ref = INVERT zero;\
> > if (__builtin_memcmp(&tmp, &ref, sizeof tmp)) \
> > __builtin_abort();\
> > }
> >
> > int main(void)
> > {
> > CHECK( uint8_t, 16, <<, 8,  )
> > CHECK( uint8_t, 16, <<, 31, )
> > CHECK( uint8_t, 16, >>, 8,  )
> > CHECK( uint8_t, 16, >>, 31, )
> > CHECK(  int8_t, 16, <<, 8,  )
> > CHECK(  int8_t, 16, <<, 31, )
> > CHECK(  int8_t, 16, >>, 8,  ~)
> > CHECK(  int8_t, 16, >>, 31, ~)
> > CHECK(uint16_t, 16, <<, 16, )
> > CHECK(uint16_t, 16, <<, 31, )
> > CHECK(uint16_t, 16, >>, 16, )
> > CHECK(uint16_t, 16, >>, 31, )
> > CHECK( int16_t, 16, <<, 16, )
> > CHECK( int16_t, 16, <<, 31, )
> > CHECK( int16_t, 16, >>, 16, ~)
> > CHECK( int16_t, 16, >>, 31, ~)
> > // Per-lane-variable shifts:
> > CHECK( uint8_t, 16, <<, zero+8,  )
> > CHECK( uint8_t, 16, <<, zero+31, )
> > CHECK( uint8_t, 16, >>, zero+8,  )
> > CHECK( uint8_t, 16, >>, zero+31, )
> > CHECK(  int8_t, 16, <<, zero+8,  )
> > CHECK(  int8_t, 16, <<, zero+31, )
> > CHECK(  int8_t, 16, >>, zero+8,  ~)
> > CHECK(  int8_t, 16, >>, zero+31, ~)
> > CHECK(uint16_t, 16, <<, zero+16, )
> > CHECK(uint16_t, 16, <<, zero+31, )
> > CHECK(uint16_t, 16, >>, z

Re: [PATCH] doc: clarify semantics of vector bitwise shifts

2023-06-01 Thread Alexander Monakov via Gcc-patches



On Wed, 31 May 2023, Richard Biener wrote:

> On Tue, May 30, 2023 at 4:49 PM Alexander Monakov  wrote:
> >
> >
> > On Thu, 25 May 2023, Richard Biener wrote:
> >
> > > On Wed, May 24, 2023 at 8:36 PM Alexander Monakov  
> > > wrote:
> > > >
> > > >
> > > > On Wed, 24 May 2023, Richard Biener via Gcc-patches wrote:
> > > >
> > > > > I’d have to check the ISAs what they actually do here - it of course 
> > > > > depends
> > > > > on RTL semantics as well but as you say those are not strictly 
> > > > > defined here
> > > > > either.
> > > >
> > > > Plus, we can add the following executable test to the testsuite:
> > >
> > > Yeah, that's probably a good idea.  I think your documentation change
> > > with the added sentence about the truncation is OK.
> >
> > I am no longer confident in my patch, sorry.
> >
> > My claim about vector shift semantics in OpenCL was wrong. In fact it 
> > specifies
> > that RHS of a vector shift is masked to the exact bitwidth of the element 
> > type.
> >
> > So, to collect various angles:
> >
> > 1. OpenCL semantics would need an 'AND' before a shift (except VSX/Altivec).
> >
> > 2. From user side we had a request to follow C integer promotion semantics
> >in https://gcc.gnu.org/PR91838 but I now doubt we can do that.
> >
> > 3. LLVM makes oversized vector shifts UB both for 'vector_size' and
> >'ext_vector_type'.
> 
> I had the impression GCC desired to do 3. as well, matching what we do
> for scalar shifts.
> 
> > 4. Vector lowering does not emit promotions, and starting from gcc-12
> >ranger treats oversized shifts according to the documentation you
> >cite below, and optimizes (e.g. with '-O2 -mno-sse')
> >
> > typedef short v8hi __attribute__((vector_size(16)));
> >
> > void f(v8hi *p)
> > {
> > *p >>= 16;
> > }
> >
> >to zeroing '*p'. If this looks unintended, I can file a bug.
> >
> > I still think we need to clarify semantics of vector shifts, but probably
> > not in the way I proposed initially. What do you think?
> 
> I think the intent at some point was to adhere to the OpenCL spec
> for the GCC vector extension (because that's a written spec while
> GCCs vector extension docs are lacking).  Originally the powerpc
> altivec 'vector' keyword spurred most of the development IIRC
> so it might be useful to see how they specify shifts.

It doesn't look like they document the semantics of '<<' and '>>'
operators for vector types.

> So yes, we probably should clarify the semantics to match the
> implementation (since we have two targets doing things differently
> since forever we can only document it as UB) and also note the
> difference from OpenCL (in case OpenCL is still relevant these
> days we might want to offer a -fopencl-vectors to emit the required
> AND).

It doesn't have to be UB, in principle we could say that shift amount
is taken modulo some power of two depending on the target without UB.
But since LLVM already treats that as UB, we might as well follow.

I think for addition/multiplication of signed vectors everybody
expects them to have wrapping semantics without UB on overflow though?

Revised patch below.

> It would be also good to amend the RTL documentation.
> 
> It would be very nice to start an internals documentation section
> around collecting what the middle-end considers undefined
> or implementation defined (aka target defined) behavior in the
> GENERIC, GIMPLE and RTL ILs and what predicates eventually
> control that (like TYPE_OVERFLOW_UNDEFINED).  Maybe spread it over
> {gimple,generic,rtl}.texi, though gimple.texi is only about the representation
> and all semantics are shared and documented in generic.texi.

Hm, noted. Thanks.

---8<---

>From e4e8d9e262f2f8dbc91a94291cf7accb74d27e7c Mon Sep 17 00:00:00 2001
From: Alexander Monakov 
Date: Wed, 24 May 2023 15:48:29 +0300
Subject: [PATCH] doc: clarify semantics of vector bitwise shifts

Explicitly say that attempted shift past element bit width is UB for
vector types.  Mention that integer promotions do not happen.

gcc/ChangeLog:

* doc/extend.texi (Vector Extensions): Clarify bitwise shift
semantics.
---
 gcc/doc/extend.texi | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index e426a2eb7d..3723cfe467 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -12026,7 +12026,14 @@ elements in the operand.
 It is possible to use shifting operators @code{<<}, @code{>>} on
 integer-type vectors. The operation is defined as following: @code{@{a0,
 a1, @dots{}, an@} >> @{b0, b1, @dots{}, bn@} == @{a0 >> b0, a1 >> b1,
-@dots{}, an >> bn@}}@. Vector operands must have the same number of
+@dots{}, an >> bn@}}@.  Unlike OpenCL, values of @code{b} are not
+implicitly taken modulo bit width of the base type @code{B}, and the behavior
+is undefined if any @code{bi} is greater than or equal to @code{B}.
+
+In contrast to scalar operations in C and C+

Re: [PATCH] doc: clarify semantics of vector bitwise shifts

2023-06-02 Thread Alexander Monakov via Gcc-patches

On Fri, 2 Jun 2023, Matthias Kretz wrote:

> On Thursday, 1 June 2023 20:25:14 CEST Alexander Monakov wrote:
> > On Wed, 31 May 2023, Richard Biener wrote:
> > > So yes, we probably should clarify the semantics to match the
> > > implementation (since we have two targets doing things differently
> > > since forever we can only document it as UB) and also note the
> > > difference from OpenCL (in case OpenCL is still relevant these
> > > days we might want to offer a -fopencl-vectors to emit the required
> > > AND).
> > 
> > It doesn't have to be UB, in principle we could say that shift amount
> > is taken modulo some power of two depending on the target without UB.
> > But since LLVM already treats that as UB, we might as well follow.
> 
> I prefer UB (as your patch states 👍). If a user requires the AND, let them 
> state it explicitly. Don't let everybody pay in performance.

What I suggested does not imply a performance cost. All targets take some
lower bits of the shift amount anyway. It's only OpenCL's exact masking
that would imply a performance cost (and I agree it's inappropriate for
GCC's generic vectors).

> > I think for addition/multiplication of signed vectors everybody
> > expects them to have wrapping semantics without UB on overflow though?
> 
>   simd x = ...;
>   bool t = all_of(x < x + 1); // unconditionally true or not?
> 
> I'd expect t to be unconditionally true. Because simd simply is a data-
> parallel version of int.

Okay, I see opinions will vary here. I was thinking about our immintrin.h
which is partially implemented in terms of generic vectors. Imagine we
extend UBSan to trap on signed overflow for vector types. I expect that
will blow up on existing code that uses Intel intrinsics. But use of
generic vectors in immintrin.h is our implementation detail, and people
might have expected intrinsics to be overflow-safe, like for aliasing
(where we use __attribute__((may_alias)) in immintrin.h). Although, we
can solve that by inventing overflow-wraps attribute for types, maybe?

> > Revised patch below.
> 
> This can be considered a breaking change. Does it need a mention in the 
> release notes?

I'm not sure what you consider a breaking change here. Is that the implied
threat to use undefinedness for range deduction and other optimizations?

Thanks.
Alexander

Re: [PATCH] doc: clarify semantics of vector bitwise shifts

2023-06-02 Thread Alexander Monakov via Gcc-patches



On Fri, 2 Jun 2023, Matthias Kretz wrote:

> > Okay, I see opinions will vary here. I was thinking about our immintrin.h
> > which is partially implemented in terms of generic vectors. Imagine we
> > extend UBSan to trap on signed overflow for vector types. I expect that
> > will blow up on existing code that uses Intel intrinsics.
> 
> _mm_add_epi32 is already implemented via __v4su addition (i.e. unsigned). So 
> the intrinsic would continue to wrap on signed overflow.

Ah, if our intrinsics take care of it, that alleviates my concern.

> > I'm not sure what you consider a breaking change here. Is that the implied
> > threat to use undefinedness for range deduction and other optimizations?
> 
> Consider the stdx::simd implementation. It currently follows semantics of the 
> builtin types. So simd can be shifted by 30 without UB. The 
> implementation of the shift operator depends on the current behavior, even if 
> it is target-dependent. For PPC the simd implementation adds extra code to 
> avoid the "UB". With nailing down shifts > sizeof(T) as UB this extra code 
> now 
> needs to be added for all targets.

What does stdx::simd do on LLVM, where that has always been UB even on x86?

Alexander

Re: [PATCH] c-family: implement -ffp-contract=on

2023-06-05 Thread Alexander Monakov via Gcc-patches

Ping for the front-end maintainers' input.

On Mon, 22 May 2023, Richard Biener wrote:

> On Thu, May 18, 2023 at 11:04 PM Alexander Monakov via Gcc-patches
>  wrote:
> >
> > Implement -ffp-contract=on for C and C++ without changing default
> > behavior (=off for -std=cNN, =fast for C++ and -std=gnuNN).
> 
> The documentation changes mention the defaults are changed for
> standard modes, I suppose you want to remove that hunk.
> 
> > gcc/c-family/ChangeLog:
> >
> > * c-gimplify.cc (fma_supported_p): New helper.
> > (c_gimplify_expr) [PLUS_EXPR, MINUS_EXPR]: Implement FMA
> > contraction.
> >
> > gcc/ChangeLog:
> >
> > * common.opt (fp_contract_mode) [on]: Remove fallback.
> > * config/sh/sh.md (*fmasf4): Correct flag_fp_contract_mode test.
> > * doc/invoke.texi (-ffp-contract): Update.
> > * trans-mem.cc (diagnose_tm_1): Skip internal function calls.
> > ---
> >  gcc/c-family/c-gimplify.cc | 78 ++
> >  gcc/common.opt |  3 +-
> >  gcc/config/sh/sh.md|  2 +-
> >  gcc/doc/invoke.texi|  8 ++--
> >  gcc/trans-mem.cc   |  3 ++
> >  5 files changed, 88 insertions(+), 6 deletions(-)
> >
> > diff --git a/gcc/c-family/c-gimplify.cc b/gcc/c-family/c-gimplify.cc
> > index ef5c7d919f..f7635d3b0c 100644
> > --- a/gcc/c-family/c-gimplify.cc
> > +++ b/gcc/c-family/c-gimplify.cc
> > @@ -41,6 +41,8 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "c-ubsan.h"
> >  #include "tree-nested.h"
> >  #include "context.h"
> > +#include "tree-pass.h"
> > +#include "internal-fn.h"
> >
> >  /*  The gimplification pass converts the language-dependent trees
> >  (ld-trees) emitted by the parser into language-independent trees
> > @@ -686,6 +688,14 @@ c_build_bind_expr (location_t loc, tree block, tree 
> > body)
> >return bind;
> >  }
> >
> > +/* Helper for c_gimplify_expr: test if target supports fma-like FN.  */
> > +
> > +static bool
> > +fma_supported_p (enum internal_fn fn, tree type)
> > +{
> > +  return direct_internal_fn_supported_p (fn, type, OPTIMIZE_FOR_BOTH);
> > +}
> > +
> >  /* Gimplification of expression trees.  */
> >
> >  /* Do C-specific gimplification on *EXPR_P.  PRE_P and POST_P are as in
> > @@ -739,6 +749,74 @@ c_gimplify_expr (tree *expr_p, gimple_seq *pre_p 
> > ATTRIBUTE_UNUSED,
> > break;
> >}
> >
> > +case PLUS_EXPR:
> > +case MINUS_EXPR:
> > +  {
> > +   tree type = TREE_TYPE (*expr_p);
> > +   /* For -ffp-contract=on we need to attempt FMA contraction only
> > +  during initial gimplification.  Late contraction across statement
> > +  boundaries would violate language semantics.  */
> > +   if (SCALAR_FLOAT_TYPE_P (type)
> > +   && flag_fp_contract_mode == FP_CONTRACT_ON
> > +   && cfun && !(cfun->curr_properties & PROP_gimple_any)
> > +   && fma_supported_p (IFN_FMA, type))
> > + {
> > +   bool neg_mul = false, neg_add = code == MINUS_EXPR;
> > +
> > +   tree *op0_p = &TREE_OPERAND (*expr_p, 0);
> > +   tree *op1_p = &TREE_OPERAND (*expr_p, 1);
> > +
> > +   /* Look for ±(x * y) ± z, swapping operands if necessary.  */
> > +   if (TREE_CODE (*op0_p) == NEGATE_EXPR
> > +   && TREE_CODE (TREE_OPERAND (*op0_p, 0)) == MULT_EXPR)
> > + /* '*EXPR_P' is '-(x * y) ± z'.  This is fine.  */;
> > +   else if (TREE_CODE (*op0_p) != MULT_EXPR)
> > + {
> > +   std::swap (op0_p, op1_p);
> > +   std::swap (neg_mul, neg_add);
> > + }
> > +   if (TREE_CODE (*op0_p) == NEGATE_EXPR)
> > + {
> > +   op0_p = &TREE_OPERAND (*op0_p, 0);
> > +   neg_mul = !neg_mul;
> > + }
> > +   if (TREE_CODE (*op0_p) != MULT_EXPR)
> > + break;
> > +   auto_vec ops (3);
> > +   ops.quick_push (TREE_OPERAND (*op0_p, 0));
> > +   ops.quick_push (TREE_OPERAND (*op0_p, 1));
> > +   ops.quick_push (*op1_p);
> > +
> > +   enum internal_fn ifn = IFN_FMA;
> > +   if (neg_mul)
> > + {
> > +   if

[PATCH v2] ipa-visibility: Optimize TLS access [PR99619]

2022-07-07 Thread Alexander Monakov via Gcc-patches

From: Artem Klimov 

Fix PR99619, which asks to optimize TLS model based on visibility.
The fix is implemented as an IPA optimization: this allows to take
optimized visibility status into account (as well as avoid modifying
all language frontends).

2022-04-17  Artem Klimov  

gcc/ChangeLog:

* ipa-visibility.cc (function_and_variable_visibility): Promote
TLS access model afer visibility optimizations.
* varasm.cc (have_optimized_refs): New helper.
(optimize_dyn_tls_for_decl_p): New helper. Use it ...
(decl_default_tls_model): ... here in place of 'optimize' check.

gcc/testsuite/ChangeLog:

* gcc.dg/tls/vis-attr-gd.c: New test.
* gcc.dg/tls/vis-attr-hidden-gd.c: New test.
* gcc.dg/tls/vis-attr-hidden.c: New test.
* gcc.dg/tls/vis-flag-hidden-gd.c: New test.
* gcc.dg/tls/vis-flag-hidden.c: New test.
* gcc.dg/tls/vis-pragma-hidden-gd.c: New test.
* gcc.dg/tls/vis-pragma-hidden.c: New test.

Co-Authored-By:  Alexander Monakov  
Signed-off-by: Artem Klimov 
---

v2: run the new loop in ipa-visibility only in the whole-program IPA pass;
in decl_default_tls_model, check if any referring function is optimized
when 'optimize == 0' (when running in LTO mode)


Note for reviewers: I noticed there's a place which tries to avoid TLS
promotion, but the comment seems wrong and I could not find a testcase.
I'd suggest we remove it. The compiler can only promote general-dynamic
to local-dynamic and initial-exec to local-exec. The comment refers to
promoting x-dynamic to y-exec, but that cannot happen AFAICT:
https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=8e1ba78f1b8eedd6c65c6f0e6d6d09a801de5d3d


 gcc/ipa-visibility.cc | 19 +++
 gcc/testsuite/gcc.dg/tls/vis-attr-gd.c| 12 +++
 gcc/testsuite/gcc.dg/tls/vis-attr-hidden-gd.c | 13 
 gcc/testsuite/gcc.dg/tls/vis-attr-hidden.c| 12 +++
 gcc/testsuite/gcc.dg/tls/vis-flag-hidden-gd.c | 13 
 gcc/testsuite/gcc.dg/tls/vis-flag-hidden.c| 12 +++
 .../gcc.dg/tls/vis-pragma-hidden-gd.c | 17 ++
 gcc/testsuite/gcc.dg/tls/vis-pragma-hidden.c  | 16 ++
 gcc/varasm.cc | 32 ++-
 9 files changed, 145 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-gd.c
 create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-hidden-gd.c
 create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-hidden.c
 create mode 100644 gcc/testsuite/gcc.dg/tls/vis-flag-hidden-gd.c
 create mode 100644 gcc/testsuite/gcc.dg/tls/vis-flag-hidden.c
 create mode 100644 gcc/testsuite/gcc.dg/tls/vis-pragma-hidden-gd.c
 create mode 100644 gcc/testsuite/gcc.dg/tls/vis-pragma-hidden.c

diff --git a/gcc/ipa-visibility.cc b/gcc/ipa-visibility.cc
index 8a27e7bcd..3ed2b7cf6 100644
--- a/gcc/ipa-visibility.cc
+++ b/gcc/ipa-visibility.cc
@@ -873,6 +873,25 @@ function_and_variable_visibility (bool whole_program)
}
 }
 
+  if (symtab->state >= IPA_SSA)
+{
+  FOR_EACH_VARIABLE (vnode)
+   {
+ tree decl = vnode->decl;
+
+ /* Upgrade TLS access model based on optimized visibility status,
+unless it was specified explicitly or no references remain.  */
+ if (DECL_THREAD_LOCAL_P (decl)
+ && !lookup_attribute ("tls_model", DECL_ATTRIBUTES (decl))
+ && vnode->ref_list.referring.length ())
+   {
+ enum tls_model new_model = decl_default_tls_model (decl);
+ gcc_checking_assert (new_model >= decl_tls_model (decl));
+ set_decl_tls_model (decl, new_model);
+   }
+   }
+}
+
   if (dump_file)
 {
   fprintf (dump_file, "\nMarking local functions:");
diff --git a/gcc/varasm.cc b/gcc/varasm.cc
index 4db8506b1..de149e82c 100644
--- a/gcc/varasm.cc
+++ b/gcc/varasm.cc
@@ -6679,6 +6679,36 @@ init_varasm_once (void)
 #endif
 }
 
+/* Determine whether SYMBOL is used in any optimized function.  */
+
+static bool
+have_optimized_refs (struct symtab_node *symbol)
+{
+  struct ipa_ref *ref;
+
+  for (int i = 0; symbol->iterate_referring (i, ref); i++)
+{
+  cgraph_node *cnode = dyn_cast  (ref->referring);
+
+  if (cnode && opt_for_fn (cnode->decl, optimize))
+   return true;
+}
+
+  return false;
+}
+
+/* Check if promoting general-dynamic TLS access model to local-dynamic is
+   desirable for DECL.  */
+
+static bool
+optimize_dyn_tls_for_decl_p (const_tree decl)
+{
+  if (optimize)
+return true;
+  return symtab->state >= IPA && have_optimized_refs (symtab_node::get (decl));
+}
+
+
 enum tls_model
 decl_default_tls_model (const_tree decl)
 {
@@ -6696,7 +6726,7 @@ decl_default_tls_model (const_tree decl)
 
   /* Local dynamic is inefficient when we're not combining the
  parts of the address.  */
-  else if (optimize && is_local)
+  else if (is_local && optimize_dyn_tls_for_decl_p (decl))
 kind = TLS_MODEL_LOCAL

Re: [PATCH 3/3] lto-plugin: implement LDPT_GET_API_VERSION

2022-07-08 Thread Alexander Monakov via Gcc-patches

On Fri, 8 Jul 2022, Martin Liška wrote:

> Hi.
> 
> All right, there's updated version of the patch that reflects the following 
> suggestions:
> 
> 1) strings are used for version identification
> 2) thread-safe API version (1) is not used if target does not support locking 
> via pthreads
> 
> Patch can bootstrap on x86_64-linux-gnu and survives regression tests.
> 
> Ready to be installed?

Note that mold side will need to be adjusted, because it did not really
implement the proposed contract. Maybe we should be more clear how the
linker is supposed to implement this? Preliminary mold patch does this:

static PluginLinkerAPIVersion
get_api_version(const char *plugin_identifier,
unsigned plugin_version,
PluginLinkerAPIVersion minimal_api_supported,
PluginLinkerAPIVersion maximal_api_supported,
const char **linker_identifier,
unsigned *linker_version) {
  assert(maximal_api_supported >= LAPI_V1);
  *linker_identifier = "mold";
  *linker_version = get_mold_version();
  is_gcc_linker_api_v1 = true;
  return LAPI_V1;
}

but ignoring min_api_supported is wrong, and assuming max_api_supported > 0
is also wrong. It really should check how given [min; max] range intersects
with its own range of supported versions.

Alexande

Re: [PATCH 3/3] lto-plugin: implement LDPT_GET_API_VERSION

2022-07-11 Thread Alexander Monakov via Gcc-patches

On Mon, 11 Jul 2022, Rui Ueyama wrote:

> > but ignoring min_api_supported is wrong, and assuming max_api_supported > 0
> > is also wrong. It really should check how given [min; max] range intersects
> > with its own range of supported versions.
> 
> Currently only one version is defined which is LAPI_V1. I don't think
> LAPI_UNSPECIFIED is a version number; rather, it's an unspecified
> value. No ordering should be defined between a defined value and an
> unspecified value. If LAPI_UNSPECIFIED < LAPI_V1, it should be renamed
> LAPI_V0.

You still cannot rely on API guarantees of LAPI_V1 when the plugin does not
advertise it (thread safety of claim_file in this particular case).

And you still should check the intersection of supported API ranges
and give a sane diagnostic when min_api_supported advertised by the plugin
exceeds LAPI_V1 (though, granted, the plugin could error out as well in this
case).

Alexander

Re: [PATCH 3/3] lto-plugin: implement LDPT_GET_API_VERSION

2022-07-11 Thread Alexander Monakov via Gcc-patches



On Mon, 11 Jul 2022, Rui Ueyama wrote:

> I updated my patch to support the proposed API:
> https://github.com/rui314/mold/commit/22bbfa9bba9beeaf40b76481d175939ee2c62ec8

This still seems to ignore the thread safety aspect.

Alexander

Re: [PATCH 3/3] lto-plugin: implement LDPT_GET_API_VERSION

2022-07-11 Thread Alexander Monakov via Gcc-patches

On Mon, 11 Jul 2022, Martin Liška wrote:

> I've clarified that linker should return a value that is in range
> [minimal_api_supported, maximal_api_supported] and added an abort
> if it's not the case.

I noticed that we are placing a trap for C++ consumers such as mold
by passing min/max_api_supported as enum values. Unlike C, C++ disallows
out-of-range enum values, so when mold does

enum PluginLinkerAPIVersion {
  LAPI_V0 = 0,
  LAPI_V1,
};

get_api_version(const char *plugin_identifier,
unsigned plugin_version,
PluginLinkerAPIVersion minimal_api_supported,
PluginLinkerAPIVersion maximal_api_supported,
const char **linker_identifier,
const char **linker_version) {

checks such as 'min_api_supported > LAPI_V1' can be optimized out. Also,
if a future tool passes LAPI_V2, it will trigger Clang's UBSan (GCC
-fsanitize-undefined=enum instruments loads but not retrieval of function
arguments).

I'd suggest to fix this on both sides by changing the arguments to plain
integer types.

Alexander

Re: [PATCH 2/3] tree-cfg: do not duplicate returns_twice calls

2022-07-12 Thread Alexander Monakov via Gcc-patches

Apologies for the prolonged silence Richard, it is a bit of an obscure topic,
and I was unsure I'd be able to handle any complications in a timely manner.
I'm ready to revisit it now, please see below.

On Mon, 17 Jan 2022, Richard Biener wrote:

> On Fri, Jan 14, 2022 at 7:21 PM Alexander Monakov  wrote:
> >
> > A returns_twice call may have associated abnormal edges that correspond
> > to the "second return" from the call. If the call is duplicated, the
> > copies of those edges also need to be abnormal, but e.g. tracer does not
> > enforce that. Just prohibit the (unlikely to be useful) duplication.
> 
> The general CFG copying routines properly duplicate those edges, no?

No (in fact you say so in the next paragraph). In general I think they cannot,
abnormal edges are a special case, so it should be the responsibility of the
caller.

> Tracer uses duplicate_block so it should also get copies of all successor
> edges of that block.  It also only traces along normal edges.  What it might
> miss is abnormal incoming edges - is that what you are referring to?

Yes (I think its entire point is to build a "trace" of duplicated blocks that
does not have incoming edges in the middle, abnormal or not).

> That would be a thing we don't handle in duplicate_block on its own but
> that callers are expected to do (though I don't see copy_bbs doing that
> either).  I wonder if we can trigger this issue for some testcase?

Oh yes (in fact my desire to find a testcase delayed this quite a bit).
When compiling the following testcase with -O2 -ftracer:

__attribute__((returns_twice))
int rtwice_a(int), rtwice_b(int);

int f(int *x)
{
volatile unsigned k, i = (*x);

for (k = 1; (i = rtwice_a(i)) * k; k = 2);

for (; (i = rtwice_b(i)) * k; k = 4);

return k;
}

tracer manages to eliminate the ABNORMAL_DISPATCHER block completely, so
the possibility of transferring control back to rtwice_a from rtwice_b
is no longer modeled in the IR. I could spend some time "upgrading" this
to an end-to-end miscompilation, but I hope you agree this is quite broken
already.

> The thing to check would be incoming abnormal edges in
> can_duplicate_block_p, not (only) returns twice functions?

Unfortunately not, abnormal edges are also used for computed gotos, which are
less magic than returns_twice edges and should not block tracer I think.

This implies patch 1/3 [1] unnecessary blocks sinking to computed goto targets.
[1] https://gcc.gnu.org/pipermail/gcc-patches/2022-January/588498.html

How would you like to proceed here? Is my initial patch ok?

Alexander

> 
> Richard.
> 
> > gcc/ChangeLog:
> >
> > * tree-cfg.c (gimple_can_duplicate_bb_p): Reject blocks with
> > calls that may return twice.
> > ---
> >  gcc/tree-cfg.c | 7 +--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
> > index b7fe313b7..a99f1acb4 100644
> > --- a/gcc/tree-cfg.c
> > +++ b/gcc/tree-cfg.c
> > @@ -6304,12 +6304,15 @@ gimple_can_duplicate_bb_p (const_basic_block bb)
> >  {
> >gimple *g = gsi_stmt (gsi);
> >
> > -  /* An IFN_GOMP_SIMT_ENTER_ALLOC/IFN_GOMP_SIMT_EXIT call must be
> > +  /* Prohibit duplication of returns_twice calls, otherwise associated
> > +abnormal edges also need to be duplicated properly.
> > +An IFN_GOMP_SIMT_ENTER_ALLOC/IFN_GOMP_SIMT_EXIT call must be
> >  duplicated as part of its group, or not at all.
> >  The IFN_GOMP_SIMT_VOTE_ANY and IFN_GOMP_SIMT_XCHG_* are part of 
> > such a
> >  group, so the same holds there.  */
> >if (is_gimple_call (g)
> > - && (gimple_call_internal_p (g, IFN_GOMP_SIMT_ENTER_ALLOC)
> > + && (gimple_call_flags (g) & ECF_RETURNS_TWICE
> > + || gimple_call_internal_p (g, IFN_GOMP_SIMT_ENTER_ALLOC)
> >   || gimple_call_internal_p (g, IFN_GOMP_SIMT_EXIT)
> >   || gimple_call_internal_p (g, IFN_GOMP_SIMT_VOTE_ANY)
> >   || gimple_call_internal_p (g, IFN_GOMP_SIMT_XCHG_BFLY)
> > --
> > 2.33.1
> >
>

Re: [PATCH 2/3] tree-cfg: do not duplicate returns_twice calls

2022-07-13 Thread Alexander Monakov via Gcc-patches

On Wed, 13 Jul 2022, Richard Biener wrote:

> > > The thing to check would be incoming abnormal edges in
> > > can_duplicate_block_p, not (only) returns twice functions?
> >
> > Unfortunately not, abnormal edges are also used for computed gotos, which 
> > are
> > less magic than returns_twice edges and should not block tracer I think.
> 
> I think computed gotos should use regular edges, only non-local goto should
> use abnormals...

Yeah, afaict it's not documented what "abnormal" is supposed to mean :/

> I suppose asm goto also uses abnormal edges?

Heh, no, asm goto appears to use normal edges, but there's an old gap in
their specification: can you use them like computed gotos, i.e. can asm-goto
jump to a computed target? Or must they be similar to plain gotos where the
jump label is redirectable (because it's substitutable in the asm template)?

If you take a restrictive interpretation (asm goto may not jump to a computed
label) then using regular edges looks fine.

> Btw, I don't see how they in general are "less magic".  Sure, we have an
> explicit receiver (the destination label), but we can only do edge inserts
> if we have a single computed goto edge into a block (we can "move" the
> label to the block created when splitting the edge).

Sure, they are a bit magic, but returns_twice edges are even more magic: their
destination looks tied to a label in the IR, but in reality their destination
is inside a call that returns twice (hence GCC must be careful not to insert
anything between the label and the call, like in patch 1/3).

> > This implies patch 1/3 [1] unnecessary blocks sinking to computed goto 
> > targets.
> > [1] https://gcc.gnu.org/pipermail/gcc-patches/2022-January/588498.html
> >
> > How would you like to proceed here? Is my initial patch ok?
> 
> Hmm, so for returns twice calls duplicate_block correctly copies the call
> and redirects the provided incoming edge to it.  The API does not
> handle adding any further incoming edges - the caller would be responsible
> for this.  So I still somewhat fail to see the point here.  If tracer does not
> handle extra incoming edges properly then we need to fix tracer?

I think abnormal edges corresponding to computed gotos are fine: we are
attempting to create a chain of blocks with no incoming edges in the middle,
right? Destinations of computed gotos remain at labels of original blocks.

Agreed about correcting this in the tracer.

> This also includes non-local goto (we seem to copy non-local labels just
> fine - wasn't there a bugreport about this!?).

Sorry, no idea about this.

> So I think can_duplicate_block_p is the wrong place to fix (the RTL side
> would need a similar fix anyhow?)

Right. I'm happy to leave both RTL and GIMPLE can_duplicate_block_p as is,
and instead constrain just the tracer. Alternative patch below:

* tracer.cc (analyze_bb): Disallow duplication of returns_twice calls.

diff --git a/gcc/tracer.cc b/gcc/tracer.cc
index 64517846d..422e2b6a7 100644
--- a/gcc/tracer.cc
+++ b/gcc/tracer.cc
@@ -132,14 +132,19 @@ analyze_bb (basic_block bb, int *count)
   gimple *stmt;
   int n = 0;

+  bool can_dup = can_duplicate_block_p (CONST_CAST_BB (bb));
+
   for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
 {
   stmt = gsi_stmt (gsi);
   n += estimate_num_insns (stmt, &eni_size_weights);
+  if (can_dup && cfun->calls_setjmp && gimple_code (stmt) == GIMPLE_CALL
+ && gimple_call_flags (stmt) & ECF_RETURNS_TWICE)
+   can_dup = false;
 }
   *count = n;

-  cache_can_duplicate_bb_p (bb, can_duplicate_block_p (CONST_CAST_BB (bb)));
+  cache_can_duplicate_bb_p (bb, can_dup);
 }

 /* Return true if E1 is more frequent than E2.  */

Re: [PATCH 2/3] tree-cfg: do not duplicate returns_twice calls

2022-07-14 Thread Alexander Monakov via Gcc-patches



On Thu, 14 Jul 2022, Richard Biener wrote:

> Indeed.  Guess that's what __builtin_setjmp[_receiver] for SJLJ_EH got 
> "right".
> 
> When copying a block we do not copy labels so any "jumps" remain to the 
> original
> block and thus we are indeed able to isolate normal control flow.  Given that
> returns_twice functions _do_ seem to be special, and also special as to how
> we handle other abnormal receivers in duplicate_block.

We do? Sorry, I don't see what you mean here, can you point me to specific 
lines?

> So it might indeed make sense to special-case them in can_duplicate_block_p
> ... (sorry for going back-and-forth ...)
> 
> Note that I think this detail of duplicate_block (the function) and the hook
> needs to be better documented (the semantics on incoming edges, not 
> duplicating
> labels used for incoming control flow).
> 
> Can you see as to how to adjust the RTL side for this?  It looks like at least
> some places set a REG_SETJMP note on call_insns (emit_call_1), I wonder
> if in rtl_verify_flow_info[_1] (or its callees) we can check that such
> calls come
> first ... they might not since IIRC we do _not_ preserve abnormal edges when
> expanding RTL (there's some existing bug about this and how this breaks some
> setjmp tests) (but we try to recompute them?).

No, we emit arguments/return value handling before/after a REG_SETJMP call,
and yeah, we don't always properly recompute abnormal edges, so improving
RTL in this respect seems hopeless. For example, it is easy enough to create
a testcase where bb-reordering duplicates a returns_twice call, although it
runs so late that perhaps later passes don't care:

// gcc -O2 --param=max-grow-copy-bb-insns=100
__attribute__((returns_twice))
int rtwice(int);
int g1(int), g2(int);
void f(int i)
{
  do {
i = i%2 ? g1(i) : g2(i);
  } while (i = rtwice(i));
}

FWIW, I also investigated https://gcc.gnu.org/PR101347

> Sorry about the back-and-forth again ... your original patch looks OK for the
> GIMPLE side but can you amend the cfghooks.{cc,h} documentation to
> summarize our findings and
> the desired semantics of duplicate_block in this respect?

Like below?

---8<---

Subject: [PATCH v3] tree-cfg: do not duplicate returns_twice calls

A returns_twice call may have associated abnormal edges that correspond
to the "second return" from the call. If the call is duplicated, the
copies of those edges also need to be abnormal, but e.g. tracer does not
enforce that. Just prohibit the (unlikely to be useful) duplication.

gcc/ChangeLog:

* cfghooks.cc (duplicate_block): Expand comment.
* tree-cfg.cc (gimple_can_duplicate_bb_p): Reject blocks with
calls that may return twice.
---
 gcc/cfghooks.cc | 13 ++---
 gcc/tree-cfg.cc |  7 +--
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/gcc/cfghooks.cc b/gcc/cfghooks.cc
index e435891fa..c6ac9532c 100644
--- a/gcc/cfghooks.cc
+++ b/gcc/cfghooks.cc
@@ -1086,9 +1086,16 @@ can_duplicate_block_p (const_basic_block bb)
   return cfg_hooks->can_duplicate_block_p (bb);
 }
 
-/* Duplicates basic block BB and redirects edge E to it.  Returns the
-   new basic block.  The new basic block is placed after the basic block
-   AFTER.  */
+/* Duplicate basic block BB, place it after AFTER (if non-null) and redirect
+   edge E to it (if non-null).  Return the new basic block.
+
+   If BB contains a returns_twice call, the caller is responsible for 
recreating
+   incoming abnormal edges corresponding to the "second return" for the copy.
+   gimple_can_duplicate_bb_p rejects such blocks, while RTL likes to live
+   dangerously.
+
+   If BB has incoming abnormal edges for some other reason, their destinations
+   should be tied to label(s) of the original BB and not the copy.  */
 
 basic_block
 duplicate_block (basic_block bb, edge e, basic_block after, copy_bb_data *id)
diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc
index f846dc2d8..5bcf78198 100644
--- a/gcc/tree-cfg.cc
+++ b/gcc/tree-cfg.cc
@@ -6346,12 +6346,15 @@ gimple_can_duplicate_bb_p (const_basic_block bb)
 {
   gimple *g = gsi_stmt (gsi);
 
-  /* An IFN_GOMP_SIMT_ENTER_ALLOC/IFN_GOMP_SIMT_EXIT call must be
+  /* Prohibit duplication of returns_twice calls, otherwise associated
+abnormal edges also need to be duplicated properly.
+An IFN_GOMP_SIMT_ENTER_ALLOC/IFN_GOMP_SIMT_EXIT call must be
 duplicated as part of its group, or not at all.
 The IFN_GOMP_SIMT_VOTE_ANY and IFN_GOMP_SIMT_XCHG_* are part of such a
 group, so the same holds there.  */
   if (is_gimple_call (g)
- && (gimple_call_internal_p (g, IFN_GOMP_SIMT_ENTER_ALLOC)
+ && (gimple_call_flags (g) & ECF_RETURNS_TWICE
+ || gimple_call_internal_p (g, IFN_GOMP_SIMT_ENTER_ALLOC)
  || gimple_call_internal_p (g, IFN_GOMP_SIMT_EXIT)
  || gimple_call_internal_p (g, IFN_GOMP_SIMT_VOTE_ANY)
  || gimple_call_internal_p (g, IFN_GOMP_SIMT_XCHG_

[committed] .gitignore: do not ignore config.h

2022-07-19 Thread Alexander Monakov via Gcc-patches

GCC does not support in-tree builds at the moment, so .gitignore
concealing artifacts of accidental in-tree ./configure run may cause
confusion. Un-ignore config.h, which is known to break the build.

ChangeLog:

* .gitignore: Do not ignore config.h.
---
 .gitignore | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/.gitignore b/.gitignore
index 021a8c741..5cc4a0fdf 100644
--- a/.gitignore
+++ b/.gitignore
@@ -23,7 +23,8 @@
 
 autom4te.cache
 config.cache
-config.h
+# GCC does not support in-tree builds, do not conceal a stray config.h:
+# config.h
 config.intl
 config.log
 config.status
-- 
2.35.1

[PATCH 1/2] Remove unused remove_node_from_expr_list

2022-07-19 Thread Alexander Monakov via Gcc-patches

This function remains unused since remove_node_from_insn_list was cloned
from it.

gcc/ChangeLog:

* rtl.h (remove_node_from_expr_list): Remove declaration.
* rtlanal.cc (remove_node_from_expr_list): Remove (no uses).
---
 gcc/rtl.h  |  1 -
 gcc/rtlanal.cc | 29 -
 2 files changed, 30 deletions(-)

diff --git a/gcc/rtl.h b/gcc/rtl.h
index 488016bb4..645c009a3 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -3712,7 +3712,6 @@ extern unsigned hash_rtx_cb (const_rtx, machine_mode, int 
*, int *,
 extern rtx regno_use_in (unsigned int, rtx);
 extern int auto_inc_p (const_rtx);
 extern bool in_insn_list_p (const rtx_insn_list *, const rtx_insn *);
-extern void remove_node_from_expr_list (const_rtx, rtx_expr_list **);
 extern void remove_node_from_insn_list (const rtx_insn *, rtx_insn_list **);
 extern int loc_mentioned_in_p (rtx *, const_rtx);
 extern rtx_insn *find_first_parameter_load (rtx_insn *, rtx_insn *);
diff --git a/gcc/rtlanal.cc b/gcc/rtlanal.cc
index d78cc6024..ec95ecd6c 100644
--- a/gcc/rtlanal.cc
+++ b/gcc/rtlanal.cc
@@ -2878,35 +2878,6 @@ in_insn_list_p (const rtx_insn_list *listp, const 
rtx_insn *node)
   return false;
 }
 
-/* Search LISTP (an EXPR_LIST) for an entry whose first operand is NODE and
-   remove that entry from the list if it is found.
-
-   A simple equality test is used to determine if NODE matches.  */
-
-void
-remove_node_from_expr_list (const_rtx node, rtx_expr_list **listp)
-{
-  rtx_expr_list *temp = *listp;
-  rtx_expr_list *prev = NULL;
-
-  while (temp)
-{
-  if (node == temp->element ())
-   {
- /* Splice the node out of the list.  */
- if (prev)
-   XEXP (prev, 1) = temp->next ();
- else
-   *listp = temp->next ();
-
- return;
-   }
-
-  prev = temp;
-  temp = temp->next ();
-}
-}
-
 /* Search LISTP (an INSN_LIST) for an entry whose first operand is NODE and
remove that entry from the list if it is found.
 
-- 
2.35.1

[PATCH 2/2] Avoid registering __builtin_setjmp_receiver label twice [PR101347]

2022-07-19 Thread Alexander Monakov via Gcc-patches

The testcase in the PR demonstrates how it is possible for one
__builtin_setjmp_receiver label to appear in
nonlocal_goto_handler_labels list twice (after the block with
__builtin_setjmp_setup referring to it was duplicated).

remove_node_from_insn_list did not account for this possibility and
removed only the first copy from the list. Add an assert verifying that
duplicates are not present.

To avoid adding a label to the list twice, move registration of the
label from __builtin_setjmp_setup handling to __builtin_setjmp_receiver.

gcc/ChangeLog:

PR rtl-optimization/101347
* builtins.cc (expand_builtin) [BUILT_IN_SETJMP_SETUP]: Move
population of nonlocal_goto_handler_labels from here ...
(expand_builtin) [BUILT_IN_SETJMP_RECEIVER]: ... to here.
* rtlanal.cc (remove_node_from_insn_list): Verify that a
duplicate is not present in the remainder of the list.
---
 gcc/builtins.cc | 15 +++
 gcc/rtlanal.cc  |  1 +
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/gcc/builtins.cc b/gcc/builtins.cc
index e6816d5c8..12a688dd8 100644
--- a/gcc/builtins.cc
+++ b/gcc/builtins.cc
@@ -7467,15 +7467,7 @@ expand_builtin (tree exp, rtx target, rtx subtarget, 
machine_mode mode,
  tree label = TREE_OPERAND (CALL_EXPR_ARG (exp, 1), 0);
  rtx_insn *label_r = label_rtx (label);
 
- /* This is copied from the handling of non-local gotos.  */
  expand_builtin_setjmp_setup (buf_addr, label_r);
- nonlocal_goto_handler_labels
-   = gen_rtx_INSN_LIST (VOIDmode, label_r,
-nonlocal_goto_handler_labels);
- /* ??? Do not let expand_label treat us as such since we would
-not want to be both on the list of non-local labels and on
-the list of forced labels.  */
- FORCED_LABEL (label) = 0;
  return const0_rtx;
}
   break;
@@ -7488,6 +7480,13 @@ expand_builtin (tree exp, rtx target, rtx subtarget, 
machine_mode mode,
  rtx_insn *label_r = label_rtx (label);
 
  expand_builtin_setjmp_receiver (label_r);
+ nonlocal_goto_handler_labels
+   = gen_rtx_INSN_LIST (VOIDmode, label_r,
+nonlocal_goto_handler_labels);
+ /* ??? Do not let expand_label treat us as such since we would
+not want to be both on the list of non-local labels and on
+the list of forced labels.  */
+ FORCED_LABEL (label) = 0;
  return const0_rtx;
}
   break;
diff --git a/gcc/rtlanal.cc b/gcc/rtlanal.cc
index ec95ecd6c..56da7435a 100644
--- a/gcc/rtlanal.cc
+++ b/gcc/rtlanal.cc
@@ -2899,6 +2899,7 @@ remove_node_from_insn_list (const rtx_insn *node, 
rtx_insn_list **listp)
  else
*listp = temp->next ();
 
+ gcc_checking_assert (!in_insn_list_p (temp->next (), node));
  return;
}
 
-- 
2.35.1

Re: [PATCH 2/3] tree-cfg: do not duplicate returns_twice calls

2022-07-19 Thread Alexander Monakov via Gcc-patches

On Tue, 19 Jul 2022, Richard Biener wrote:

> > Like below?
> 
> Yes.
> 
> Thanks and sorry for the back and forth - this _is_ a mightly
> complicated area ...

No problem!  This is the good, healthy kind of back-and-forth, and
I am grateful.

Pushed, including the tree-cfg validator enhancement in patch 3/3.

Alexander

Re: [PATCH v2] ipa-visibility: Optimize TLS access [PR99619]

2022-07-20 Thread Alexander Monakov via Gcc-patches



Ping.

On Thu, 7 Jul 2022, Alexander Monakov via Gcc-patches wrote:

> From: Artem Klimov 
> 
> Fix PR99619, which asks to optimize TLS model based on visibility.
> The fix is implemented as an IPA optimization: this allows to take
> optimized visibility status into account (as well as avoid modifying
> all language frontends).
> 
> 2022-04-17  Artem Klimov  
> 
> gcc/ChangeLog:
> 
>   * ipa-visibility.cc (function_and_variable_visibility): Promote
>   TLS access model afer visibility optimizations.
>   * varasm.cc (have_optimized_refs): New helper.
>   (optimize_dyn_tls_for_decl_p): New helper. Use it ...
>   (decl_default_tls_model): ... here in place of 'optimize' check.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.dg/tls/vis-attr-gd.c: New test.
>   * gcc.dg/tls/vis-attr-hidden-gd.c: New test.
>   * gcc.dg/tls/vis-attr-hidden.c: New test.
>   * gcc.dg/tls/vis-flag-hidden-gd.c: New test.
>   * gcc.dg/tls/vis-flag-hidden.c: New test.
>   * gcc.dg/tls/vis-pragma-hidden-gd.c: New test.
>   * gcc.dg/tls/vis-pragma-hidden.c: New test.
> 
> Co-Authored-By:  Alexander Monakov  
> Signed-off-by: Artem Klimov 
> ---
> 
> v2: run the new loop in ipa-visibility only in the whole-program IPA pass;
> in decl_default_tls_model, check if any referring function is optimized
> when 'optimize == 0' (when running in LTO mode)
> 
> 
> Note for reviewers: I noticed there's a place which tries to avoid TLS
> promotion, but the comment seems wrong and I could not find a testcase.
> I'd suggest we remove it. The compiler can only promote general-dynamic
> to local-dynamic and initial-exec to local-exec. The comment refers to
> promoting x-dynamic to y-exec, but that cannot happen AFAICT:
> https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=8e1ba78f1b8eedd6c65c6f0e6d6d09a801de5d3d
> 
> 
>  gcc/ipa-visibility.cc | 19 +++
>  gcc/testsuite/gcc.dg/tls/vis-attr-gd.c| 12 +++
>  gcc/testsuite/gcc.dg/tls/vis-attr-hidden-gd.c | 13 
>  gcc/testsuite/gcc.dg/tls/vis-attr-hidden.c| 12 +++
>  gcc/testsuite/gcc.dg/tls/vis-flag-hidden-gd.c | 13 
>  gcc/testsuite/gcc.dg/tls/vis-flag-hidden.c| 12 +++
>  .../gcc.dg/tls/vis-pragma-hidden-gd.c | 17 ++
>  gcc/testsuite/gcc.dg/tls/vis-pragma-hidden.c  | 16 ++
>  gcc/varasm.cc | 32 ++-
>  9 files changed, 145 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-gd.c
>  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-hidden-gd.c
>  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-hidden.c
>  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-flag-hidden-gd.c
>  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-flag-hidden.c
>  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-pragma-hidden-gd.c
>  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-pragma-hidden.c
> 
> diff --git a/gcc/ipa-visibility.cc b/gcc/ipa-visibility.cc
> index 8a27e7bcd..3ed2b7cf6 100644
> --- a/gcc/ipa-visibility.cc
> +++ b/gcc/ipa-visibility.cc
> @@ -873,6 +873,25 @@ function_and_variable_visibility (bool whole_program)
>   }
>  }
>  
> +  if (symtab->state >= IPA_SSA)
> +{
> +  FOR_EACH_VARIABLE (vnode)
> + {
> +   tree decl = vnode->decl;
> +
> +   /* Upgrade TLS access model based on optimized visibility status,
> +  unless it was specified explicitly or no references remain.  */
> +   if (DECL_THREAD_LOCAL_P (decl)
> +   && !lookup_attribute ("tls_model", DECL_ATTRIBUTES (decl))
> +   && vnode->ref_list.referring.length ())
> + {
> +   enum tls_model new_model = decl_default_tls_model (decl);
> +   gcc_checking_assert (new_model >= decl_tls_model (decl));
> +   set_decl_tls_model (decl, new_model);
> + }
> + }
> +}
> +
>if (dump_file)
>  {
>fprintf (dump_file, "\nMarking local functions:");
> diff --git a/gcc/varasm.cc b/gcc/varasm.cc
> index 4db8506b1..de149e82c 100644
> --- a/gcc/varasm.cc
> +++ b/gcc/varasm.cc
> @@ -6679,6 +6679,36 @@ init_varasm_once (void)
>  #endif
>  }
>  
> +/* Determine whether SYMBOL is used in any optimized function.  */
> +
> +static bool
> +have_optimized_refs (struct symtab_node *symbol)
> +{
> +  struct ipa_ref *ref;
> +
> +  for (int i = 0; symbol->iterate_referring (i, ref); i++)
> +{
> +  cgraph_node *cnode = dyn_cast  (ref->referring);
> +
> +  if (cnode && opt_for_fn (cnode->decl, optimize))
> +

Re: [PATCH 2/2] Avoid registering __builtin_setjmp_receiver label twice [PR101347]

2022-07-20 Thread Alexander Monakov via Gcc-patches



On Wed, 20 Jul 2022, Eric Botcazou wrote:

> > Eric is probably most familiar with this, but can you make sure to bootstrap
> > and test this on a SJLJ EH target?  I'm not sure --enable-sjlj-exceptions
> > is well tested anywhere but on targets not supporting DWARF EH and the
> > configury is a bit odd suggesting the option is mostly ignored ...
> 
> This is a specific circuitry for __builtln_setjmp so it is *not* exercised by 
> the SJLJ exception scheme.  It used to be exercised by the GNAT bootstrap, 
> but 
> that's no longer the case either.
> 
> I think that the fix is sensible, assuming that it passes the C testsuite.

Yes, it passes the usual regtest.  Thanks, applying to trunk.

Alexander

Re: [PATCH v2] ipa-visibility: Optimize TLS access [PR99619]

2022-08-05 Thread Alexander Monakov via Gcc-patches

Ping^2.

On Wed, 20 Jul 2022, Alexander Monakov wrote:

> 
> Ping.
> 
> On Thu, 7 Jul 2022, Alexander Monakov via Gcc-patches wrote:
> 
> > From: Artem Klimov 
> > 
> > Fix PR99619, which asks to optimize TLS model based on visibility.
> > The fix is implemented as an IPA optimization: this allows to take
> > optimized visibility status into account (as well as avoid modifying
> > all language frontends).
> > 
> > 2022-04-17  Artem Klimov  
> > 
> > gcc/ChangeLog:
> > 
> > * ipa-visibility.cc (function_and_variable_visibility): Promote
> > TLS access model afer visibility optimizations.
> > * varasm.cc (have_optimized_refs): New helper.
> > (optimize_dyn_tls_for_decl_p): New helper. Use it ...
> > (decl_default_tls_model): ... here in place of 'optimize' check.
> > 
> > gcc/testsuite/ChangeLog:
> > 
> > * gcc.dg/tls/vis-attr-gd.c: New test.
> > * gcc.dg/tls/vis-attr-hidden-gd.c: New test.
> > * gcc.dg/tls/vis-attr-hidden.c: New test.
> > * gcc.dg/tls/vis-flag-hidden-gd.c: New test.
> > * gcc.dg/tls/vis-flag-hidden.c: New test.
> > * gcc.dg/tls/vis-pragma-hidden-gd.c: New test.
> > * gcc.dg/tls/vis-pragma-hidden.c: New test.
> > 
> > Co-Authored-By:  Alexander Monakov  
> > Signed-off-by: Artem Klimov 
> > ---
> > 
> > v2: run the new loop in ipa-visibility only in the whole-program IPA pass;
> > in decl_default_tls_model, check if any referring function is optimized
> > when 'optimize == 0' (when running in LTO mode)
> > 
> > 
> > Note for reviewers: I noticed there's a place which tries to avoid TLS
> > promotion, but the comment seems wrong and I could not find a testcase.
> > I'd suggest we remove it. The compiler can only promote general-dynamic
> > to local-dynamic and initial-exec to local-exec. The comment refers to
> > promoting x-dynamic to y-exec, but that cannot happen AFAICT:
> > https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=8e1ba78f1b8eedd6c65c6f0e6d6d09a801de5d3d
> > 
> > 
> >  gcc/ipa-visibility.cc | 19 +++
> >  gcc/testsuite/gcc.dg/tls/vis-attr-gd.c| 12 +++
> >  gcc/testsuite/gcc.dg/tls/vis-attr-hidden-gd.c | 13 
> >  gcc/testsuite/gcc.dg/tls/vis-attr-hidden.c| 12 +++
> >  gcc/testsuite/gcc.dg/tls/vis-flag-hidden-gd.c | 13 
> >  gcc/testsuite/gcc.dg/tls/vis-flag-hidden.c| 12 +++
> >  .../gcc.dg/tls/vis-pragma-hidden-gd.c | 17 ++
> >  gcc/testsuite/gcc.dg/tls/vis-pragma-hidden.c  | 16 ++
> >  gcc/varasm.cc | 32 ++-
> >  9 files changed, 145 insertions(+), 1 deletion(-)
> >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-gd.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-hidden-gd.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-hidden.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-flag-hidden-gd.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-flag-hidden.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-pragma-hidden-gd.c
> >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-pragma-hidden.c
> > 
> > diff --git a/gcc/ipa-visibility.cc b/gcc/ipa-visibility.cc
> > index 8a27e7bcd..3ed2b7cf6 100644
> > --- a/gcc/ipa-visibility.cc
> > +++ b/gcc/ipa-visibility.cc
> > @@ -873,6 +873,25 @@ function_and_variable_visibility (bool whole_program)
> > }
> >  }
> >  
> > +  if (symtab->state >= IPA_SSA)
> > +{
> > +  FOR_EACH_VARIABLE (vnode)
> > +   {
> > + tree decl = vnode->decl;
> > +
> > + /* Upgrade TLS access model based on optimized visibility status,
> > +unless it was specified explicitly or no references remain.  */
> > + if (DECL_THREAD_LOCAL_P (decl)
> > + && !lookup_attribute ("tls_model", DECL_ATTRIBUTES (decl))
> > + && vnode->ref_list.referring.length ())
> > +   {
> > + enum tls_model new_model = decl_default_tls_model (decl);
> > + gcc_checking_assert (new_model >= decl_tls_model (decl));
> > + set_decl_tls_model (decl, new_model);
> > +   }
> > +   }
> > +}
> > +
> >if (dump_file)
> >  {
> >fprintf (dump_file, "\nMarking local functions:");
> > diff --git a/gcc/varasm.cc b/gcc/varasm.cc
> > index 4db8506b1..de149e82c 100644
> > --- a/gcc/varasm.cc
> > +++ b/gcc/varasm.cc
&

Re: [PATCH] c-family: implement -ffp-contract=on

2023-06-19 Thread Alexander Monakov via Gcc-patches



Ping. OK for trunk?

On Mon, 5 Jun 2023, Alexander Monakov wrote:

> Ping for the front-end maintainers' input.
> 
> On Mon, 22 May 2023, Richard Biener wrote:
> 
> > On Thu, May 18, 2023 at 11:04 PM Alexander Monakov via Gcc-patches
> >  wrote:
> > >
> > > Implement -ffp-contract=on for C and C++ without changing default
> > > behavior (=off for -std=cNN, =fast for C++ and -std=gnuNN).
> > 
> > The documentation changes mention the defaults are changed for
> > standard modes, I suppose you want to remove that hunk.
> > 
> > > gcc/c-family/ChangeLog:
> > >
> > > * c-gimplify.cc (fma_supported_p): New helper.
> > > (c_gimplify_expr) [PLUS_EXPR, MINUS_EXPR]: Implement FMA
> > > contraction.
> > >
> > > gcc/ChangeLog:
> > >
> > > * common.opt (fp_contract_mode) [on]: Remove fallback.
> > > * config/sh/sh.md (*fmasf4): Correct flag_fp_contract_mode test.
> > > * doc/invoke.texi (-ffp-contract): Update.
> > > * trans-mem.cc (diagnose_tm_1): Skip internal function calls.
> > > ---
> > >  gcc/c-family/c-gimplify.cc | 78 ++
> > >  gcc/common.opt |  3 +-
> > >  gcc/config/sh/sh.md|  2 +-
> > >  gcc/doc/invoke.texi|  8 ++--
> > >  gcc/trans-mem.cc   |  3 ++
> > >  5 files changed, 88 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/gcc/c-family/c-gimplify.cc b/gcc/c-family/c-gimplify.cc
> > > index ef5c7d919f..f7635d3b0c 100644
> > > --- a/gcc/c-family/c-gimplify.cc
> > > +++ b/gcc/c-family/c-gimplify.cc
> > > @@ -41,6 +41,8 @@ along with GCC; see the file COPYING3.  If not see
> > >  #include "c-ubsan.h"
> > >  #include "tree-nested.h"
> > >  #include "context.h"
> > > +#include "tree-pass.h"
> > > +#include "internal-fn.h"
> > >
> > >  /*  The gimplification pass converts the language-dependent trees
> > >  (ld-trees) emitted by the parser into language-independent trees
> > > @@ -686,6 +688,14 @@ c_build_bind_expr (location_t loc, tree block, tree 
> > > body)
> > >return bind;
> > >  }
> > >
> > > +/* Helper for c_gimplify_expr: test if target supports fma-like FN.  */
> > > +
> > > +static bool
> > > +fma_supported_p (enum internal_fn fn, tree type)
> > > +{
> > > +  return direct_internal_fn_supported_p (fn, type, OPTIMIZE_FOR_BOTH);
> > > +}
> > > +
> > >  /* Gimplification of expression trees.  */
> > >
> > >  /* Do C-specific gimplification on *EXPR_P.  PRE_P and POST_P are as in
> > > @@ -739,6 +749,74 @@ c_gimplify_expr (tree *expr_p, gimple_seq *pre_p 
> > > ATTRIBUTE_UNUSED,
> > > break;
> > >}
> > >
> > > +case PLUS_EXPR:
> > > +case MINUS_EXPR:
> > > +  {
> > > +   tree type = TREE_TYPE (*expr_p);
> > > +   /* For -ffp-contract=on we need to attempt FMA contraction only
> > > +  during initial gimplification.  Late contraction across 
> > > statement
> > > +  boundaries would violate language semantics.  */
> > > +   if (SCALAR_FLOAT_TYPE_P (type)
> > > +   && flag_fp_contract_mode == FP_CONTRACT_ON
> > > +   && cfun && !(cfun->curr_properties & PROP_gimple_any)
> > > +   && fma_supported_p (IFN_FMA, type))
> > > + {
> > > +   bool neg_mul = false, neg_add = code == MINUS_EXPR;
> > > +
> > > +   tree *op0_p = &TREE_OPERAND (*expr_p, 0);
> > > +   tree *op1_p = &TREE_OPERAND (*expr_p, 1);
> > > +
> > > +   /* Look for ±(x * y) ± z, swapping operands if necessary.  */
> > > +   if (TREE_CODE (*op0_p) == NEGATE_EXPR
> > > +   && TREE_CODE (TREE_OPERAND (*op0_p, 0)) == MULT_EXPR)
> > > + /* '*EXPR_P' is '-(x * y) ± z'.  This is fine.  */;
> > > +   else if (TREE_CODE (*op0_p) != MULT_EXPR)
> > > + {
> > > +   std::swap (op0_p, op1_p);
> > > +   std::swap (neg_mul, neg_add);
> > > + }
> > > +   if (TREE_CODE (*op0_p) == NEGATE_EXPR)
> > > + {
> > > +   op0_p = &am

Re: [PATCH] Break false dependence for vpternlog by inserting vpxor or setting constraint of input operand to '0'

2023-07-10 Thread Alexander Monakov via Gcc-patches



On Mon, 10 Jul 2023, liuhongt via Gcc-patches wrote:

> False dependency happens when destination is only updated by
> pternlog. There is no false dependency when destination is also used
> in source. So either a pxor should be inserted, or input operand
> should be set with constraint '0'.
> 
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Ready to push to trunk.

Shouldn't this patch also remove uses of vpternlog in
standard_sse_constant_opcode?

A couple more questions below:

> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -1382,6 +1382,29 @@ (define_insn "mov_internal"
> ]
> (symbol_ref "true")))])
>  
> +; False dependency happens on destination register which is not really
> +; used when moving all ones to vector register
> +(define_split
> +  [(set (match_operand:VMOVE 0 "register_operand")
> + (match_operand:VMOVE 1 "int_float_vector_all_ones_operand"))]
> +  "TARGET_AVX512F && reload_completed
> +  && ( == 64 || EXT_REX_SSE_REG_P (operands[0]))
> +  && optimize_function_for_speed_p (cfun)"

Yan's patch used optimize_insn_for_speed_p (), which looks more appropriate.
Doesn't it work here as well?

> +  [(set (match_dup 0) (match_dup 2))
> +   (parallel
> + [(set (match_dup 0) (match_dup 1))
> +  (unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP)])]
> +  "operands[2] = CONST0_RTX (mode);")
> +
> +(define_insn "*vmov_constm1_pternlog_false_dep"
> +  [(set (match_operand:VMOVE 0 "register_operand" "=v")
> + (match_operand:VMOVE 1 "int_float_vector_all_ones_operand" 
> ""))
> +   (unspec [(match_operand:VMOVE 2 "register_operand" "0")] 
> UNSPEC_INSN_FALSE_DEP)]
> +   "TARGET_AVX512VL ||  == 64"
> +   "vpternlogd\t{$0xFF, %0, %0, %0|%0, %0, %0, 0xFF}"
> +  [(set_attr "type" "sselog1")
> +   (set_attr "prefix" "evex")])
> +
>  ;; If mem_addr points to a memory region with less than whole vector size 
> bytes
>  ;; of accessible memory and k is a mask that would prevent reading the 
> inaccessible
>  ;; bytes from mem_addr, add UNSPEC_MASKLOAD to prevent it to be transformed 
> to vpblendd
> @@ -9336,7 +9359,7 @@ (define_expand "_cvtmask2"
>  operands[3] = CONST0_RTX (mode);
>}")
>  
> -(define_insn "*_cvtmask2"
> +(define_insn_and_split "*_cvtmask2"
>[(set (match_operand:VI48_AVX512VL 0 "register_operand" "=v,v")
>   (vec_merge:VI48_AVX512VL
> (match_operand:VI48_AVX512VL 2 "vector_all_ones_operand")
> @@ -9346,11 +9369,35 @@ (define_insn "*_cvtmask2"
>"@
> vpmovm2\t{%1, %0|%0, %1}
> vpternlog\t{$0x81, %0, %0, %0%{%1%}%{z%}|%0%{%1%}%{z%}, 
> %0, %0, 0x81}"
> +  "&& !TARGET_AVX512DQ && reload_completed
> +   && optimize_function_for_speed_p (cfun)"
> +  [(set (match_dup 0) (match_dup 4))
> +   (parallel
> +[(set (match_dup 0)
> +   (vec_merge:VI48_AVX512VL
> + (match_dup 2)
> + (match_dup 3)
> + (match_dup 1)))
> + (unspec [(match_dup 0)] UNSPEC_INSN_FALSE_DEP)])]
> +  "operands[4] = CONST0_RTX (mode);"
>[(set_attr "isa" "avx512dq,*")
> (set_attr "length_immediate" "0,1")
> (set_attr "prefix" "evex")
> (set_attr "mode" "")])
>  
> +(define_insn "*_cvtmask2_pternlog_false_dep"
> +  [(set (match_operand:VI48_AVX512VL 0 "register_operand" "=v")
> + (vec_merge:VI48_AVX512VL
> +   (match_operand:VI48_AVX512VL 2 "vector_all_ones_operand")
> +   (match_operand:VI48_AVX512VL 3 "const0_operand")
> +   (match_operand: 1 "register_operand" "Yk")))
> +   (unspec [(match_operand:VI48_AVX512VL 4 "register_operand" "0")] 
> UNSPEC_INSN_FALSE_DEP)]
> +  "TARGET_AVX512F && !TARGET_AVX512DQ"
> +  "vpternlog\t{$0x81, %0, %0, %0%{%1%}%{z%}|%0%{%1%}%{z%}, 
> %0, %0, 0x81}"
> +  [(set_attr "length_immediate" "1")
> +   (set_attr "prefix" "evex")
> +   (set_attr "mode" "")])
> +
>  (define_expand "extendv2sfv2df2"
>[(set (match_operand:V2DF 0 "register_operand")
>   (float_extend:V2DF
> @@ -17166,20 +17213,32 @@ (define_expand "one_cmpl2"
>  operands[2] = force_reg (mode, operands[2]);
>  })
>  
> -(define_insn "one_cmpl2"
> -  [(set (match_operand:VI 0 "register_operand" "=v,v")
> - (xor:VI (match_operand:VI 1 "bcst_vector_operand" "vBr,m")
> - (match_operand:VI 2 "vector_all_ones_operand" "BC,BC")))]
> +(define_insn_and_split "one_cmpl2"
> +  [(set (match_operand:VI 0 "register_operand" "=v,v,v")
> + (xor:VI (match_operand:VI 1 "bcst_vector_operand" " 0, m,Br")
> + (match_operand:VI 2 "vector_all_ones_operand" "BC,BC,BC")))]
>"TARGET_AVX512F
> && (!
> || mode == SImode
> || mode == DImode)"
>  {
> +  if (! && which_alternative
> +  && optimize_function_for_speed_p (cfun))
> +return "#";
> +
>if (TARGET_AVX512VL)
>  return "vpternlog\t{$0x55, %1, %0, 
> %0|%0, %0, %1, 0x55}";
>else
>  return "vpternlog\t{$0x55, %g1, %g0, 
> %g0|%g0, %g0, %g1, 0x55}";
>  }
> +  "&& reload_completed && !REG_P (operands[1]) && !
> +   && optimize_function_for_speed_p (cfun)"
> +

Re: [x86-64] RFC: Add nosse abi attribute

2023-07-10 Thread Alexander Monakov via Gcc-patches



On Mon, 10 Jul 2023, Michael Matz via Gcc-patches wrote:

> Hello,
> 
> the ELF psABI for x86-64 doesn't have any callee-saved SSE
> registers (there were actual reasons for that, but those don't
> matter anymore).  This starts to hurt some uses, as it means that
> as soon as you have a call (say to memmove/memcpy, even if
> implicit as libcall) in a loop that manipulates floating point
> or vector data you get saves/restores around those calls.
> 
> But in reality many functions can be written such that they only need
> to clobber a subset of the 16 XMM registers (or do the save/restore
> themself in the codepaths that needs them, hello memcpy again).
> So we want to introduce a way to specify this, via an ABI attribute
> that basically says "doesn't clobber the high XMM regs".

I think the main question is why you're going with this (weak) form
instead of the (strong) form "may only clobber the low XMM regs":
as Richi noted, surely for libcalls we'd like to know they preserve
AVX-512 mask registers as well?

(I realize this is partially answered later)

Note this interacts with anything that interposes between the caller
and the callee, like the Glibc lazy binding stub (which used to
zero out high halves of 512-bit arguments in ZMM registers).
Not an immediate problem for the patch, just something to mind perhaps.

> I've opted to do only the obvious: do something special only for
> xmm8 to xmm15, without a way to specify the clobber set in more detail.
> I think such half/half split is reasonable, and as I don't want to
> change the argument passing anyway (whose regs are always clobbered)
> there isn't that much wiggle room anyway.
> 
> I chose to make it possible to write function definitions with that
> attribute with GCC adding the necessary callee save/restore code in
> the xlogue itself.

But you can't trivially restore if the callee is sibcalling — what
happens then (a testcase might be nice)?

> Carefully note that this is only possible for
> the SSE2 registers, as other parts of them would need instructions
> that are only optional.

What is supposed to happen on 32-bit x86 with -msse -mno-sse2?

> When a function doesn't contain calls to
> unknown functions we can be a bit more lenient: we can make it so that
> GCC simply doesn't touch xmm8-15 at all, then no save/restore is
> necessary.

What if the source code has a local register variable bound to xmm15,
i.e. register double x asm("xmm15"); asm("..." : "+x"(x)); ?
Probably "dont'd do that", i.e. disallow that in the documentation?

> If a function contains calls then GCC can't know which
> parts of the XMM regset is clobbered by that, it may be parts
> which don't even exist yet (say until avx2048 comes out), so we must
> restrict ourself to only save/restore the SSE2 parts and then of course
> can only claim to not clobber those parts.

Hm, I guess this is kinda the reason a "weak" form is needed. But this
highlights the difference between the two: the "weak" form will actively
preserve some state (so it cannot preserve future extensions), while
the "strong" form may just passively not touch any state, preserving
any state it doesn't know about.

> To that end I introduce actually two related attributes (for naming
> see below):
> * nosseclobber: claims (and ensures) that xmm8-15 aren't clobbered

This is the weak/active form; I'd suggest "preserve_high_sse".

> * noanysseclobber: claims (and ensures) that nothing of any of the
>   registers overlapping xmm8-15 is clobbered (not even future, as of
>   yet unknown, parts)

This is the strong/passive form; I'd suggest "only_low_sse".

> Ensuring the first is simple: potentially add saves/restore in xlogue
> (e.g. when xmm8 is either used explicitely or implicitely by a call).
> Ensuring the second comes with more: we must also ensure that no
> functions are called that don't guarantee the same thing (in addition
> to just removing all xmm8-15 parts alltogether from the available
> regsters).
> 
> See also the added testcases for what I intended to support.
> 
> I chose to use the new target independend function-abi facility for
> this.  I need some adjustments in generic code:
> * the "default_abi" is actually more like a "current" abi: it happily
>   changes its contents according to conditional_register_usage,
>   and other code assumes that such changes do propagate.
>   But if that conditonal_reg_usage is actually done because the current
>   function is of a different ABI, then we must not change default_abi.
> * in insn_callee_abi we do look at a potential fndecl for a call
>   insn (only set when -fipa-ra), but doesn't work for calls through
>   pointers and (as said) is optional.  So, also always look at the
>   called functions type (it's always recorded in the MEM_EXPR for
>   non-libcalls), before asking the target.
>   (The function-abi accessors working on trees were already doing that,
>   its just the RTL accessor that missed this)
> 
> Accordingly I also implement some more

Re: [x86-64] RFC: Add nosse abi attribute

2023-07-10 Thread Alexander Monakov via Gcc-patches

On Mon, 10 Jul 2023, Alexander Monakov wrote:

> > I chose to make it possible to write function definitions with that
> > attribute with GCC adding the necessary callee save/restore code in
> > the xlogue itself.
> 
> But you can't trivially restore if the callee is sibcalling — what
> happens then (a testcase might be nice)?

Sorry, when the caller is doing the sibcall, not the callee.

Alexander

Re: [x86-64] RFC: Add nosse abi attribute

2023-07-11 Thread Alexander Monakov via Gcc-patches



On Tue, 11 Jul 2023, Richard Biener wrote:

> > > If a function contains calls then GCC can't know which
> > > parts of the XMM regset is clobbered by that, it may be parts
> > > which don't even exist yet (say until avx2048 comes out), so we must
> > > restrict ourself to only save/restore the SSE2 parts and then of course
> > > can only claim to not clobber those parts.
> >
> > Hm, I guess this is kinda the reason a "weak" form is needed. But this
> > highlights the difference between the two: the "weak" form will actively
> > preserve some state (so it cannot preserve future extensions), while
> > the "strong" form may just passively not touch any state, preserving
> > any state it doesn't know about.
> >
> > > To that end I introduce actually two related attributes (for naming
> > > see below):
> > > * nosseclobber: claims (and ensures) that xmm8-15 aren't clobbered
> >
> > This is the weak/active form; I'd suggest "preserve_high_sse".
> 
> Isn't it the opposite?  "preserves_low_sse", unless you suggest
> the name applies to the caller which has to preserve high parts
> when calling nosseclobber.

This is the form where the function annnotated with this attribute
consumes 128 bytes on the stack to "blindly" save/restore xmm8-15
if it calls anything with a vanilla ABI.

(actually thinking about it more, I'd like to suggest shelving this part
and only implement the zero-cost variant, noanysseclobber)

> > > * noanysseclobber: claims (and ensures) that nothing of any of the
> > >   registers overlapping xmm8-15 is clobbered (not even future, as of
> > >   yet unknown, parts)
> >
> > This is the strong/passive form; I'd suggest "only_low_sse".
> 
> Likewise.

Sorry if I managed to sow confusion here. In my mind, this is the form where
only xmm0-xmm7 can be written in the function annotated with the attribute,
including its callees. I was thinking that writing to zmm16-31 would be
disallowed too. The initial example was memcpy, where eight vector registers
are sufficient for the job.

> As for mask registers I understand we'd have to split the 8 register
> set into two halves to make the same approach work, otherwise
> we'd have no registers left to allocate from.

I'd suggest to look how many mask registers OpenMP SIMD AVX-512 clones
can receive as implicit arguments, as one data point.

Alexander

Re: [x86-64] RFC: Add nosse abi attribute

2023-07-11 Thread Alexander Monakov via Gcc-patches

On Tue, 11 Jul 2023, Michael Matz wrote:

> > > To that end I introduce actually two related attributes (for naming
> > > see below):
> > > * nosseclobber: claims (and ensures) that xmm8-15 aren't clobbered
> > 
> > This is the weak/active form; I'd suggest "preserve_high_sse".
> 
> But it preserves only the low parts :-)  You swapped the two in your 
> mind when writing the reply?

Ahhh. By "high SSE" I mean the high-numbered SSE regs, i.e. xmm8-15, not
the higher halves of (unspecified subset of) SSE regs.

If you look from AVX viewpoint, yes, it preserves lower 128 bits of the
high-numbered vector registers.

Alexander

Re: [x86-64] RFC: Add nosse abi attribute

2023-07-11 Thread Alexander Monakov via Gcc-patches

On Tue, 11 Jul 2023, Michael Matz wrote:

> Hey,
> 
> On Tue, 11 Jul 2023, Alexander Monakov via Gcc-patches wrote:
> 
> > > > > * nosseclobber: claims (and ensures) that xmm8-15 aren't clobbered
> > > > 
> > > > This is the weak/active form; I'd suggest "preserve_high_sse".
> > > 
> > > But it preserves only the low parts :-)  You swapped the two in your 
> > > mind when writing the reply?
> > 
> > Ahhh. By "high SSE" I mean the high-numbered SSE regs, i.e. xmm8-15, not
> > the higher halves of (unspecified subset of) SSE regs.
> 
> Ah, gotcha :-)  It just shows that all these names are confusing.  Maybe 
> I'll just go with "attribute1" and "attribute2" and rely on docu.  (SCNR)

Heh, that reminds me that decimal digits are allowed in attribute names.
Let me offer "preserve_xmm_8_15" and "only_xmm_0_7" then.

One more thing to keep in mind is interaction with SSE-AVX transition.
If the function with a new attribute is using classic non-VEX-encoded SSE,
but its caller is using 256-bit ymm0-15, it will incur a substantial penalty
on Intel CPUs. There's no penalty on AMD (afaik) and no penalty for zmm16-31,
since those are inaccessible in non-EVEX code.

Alexander

Re: RISC-V: Add divmod instruction support

2023-02-20 Thread Alexander Monakov via Gcc-patches

On Mon, 20 Feb 2023, Richard Biener via Gcc-patches wrote:

> On Sun, Feb 19, 2023 at 2:15 AM Maciej W. Rozycki  wrote:
> >
> > > The problem is you don't see it as a divmod in expand_divmod unless you 
> > > expose
> > > a divmod optab.  See tree-ssa-mathopts.cc's divmod handling.
> >
> >  That's the kind of stuff I'd expect to happen at the tree level though,
> > before expand.
> 
> The GIMPLE pass forming divmod could indeed choose to emit the
> div + mul/sub sequence instead if an actual divmod pattern isn't available.
> It could even generate some fake mul/sub/mod RTXen to cost the two
> variants against each other but I seriously doubt any uarch that implements
> division/modulo has a slower mul/sub.

Making a correct decision requires knowing to which degree the divider is
pipelined, and costs won't properly reflect that. If the divider accepts
a new div/mod instruction every couple of cycles, it's faster to just issue
a div followed by a mod with the same operands.

Therefore I think in this case it's fair for GIMPLE level to just check if
the divmod pattern is available, and let the target do the fine tuning via
the divmod expander.

It would make sense for tree-ssa-mathopts to emit div + mul/sub when neither
'divmod' nor 'mod' patterns are available, because RTL expansion will do the
same, just later, and we'll rely on RTL CSE to clean up the redundant div.
But RISC-V has both 'div' and 'mod', so as I tried to explain in the first
paragraph we should let the target decide.

Alexander

Re: [PATCH] [RFC] RAII auto_mpfr and autp_mpz

2023-03-07 Thread Alexander Monakov via Gcc-patches

Hi,

On Mon, 6 Mar 2023, Richard Biener via Gcc-patches wrote:

> --- a/gcc/realmpfr.h
> +++ b/gcc/realmpfr.h
> @@ -24,6 +24,26 @@
>  #include 
>  #include 
>  
> +class auto_mpfr
> +{
> +public:
> +  auto_mpfr () { mpfr_init (m_mpfr); }
> +  explicit auto_mpfr (mpfr_prec_t prec) { mpfr_init2 (m_mpfr, prec); }
> +  ~auto_mpfr () { mpfr_clear (m_mpfr); }
> +
> +  operator mpfr_t& () { return m_mpfr; }
> +
> +  auto_mpfr (const auto_mpfr &) = delete;
> +  auto_mpfr &operator=(const auto_mpfr &) = delete;

Shouldn't this use the idiom suggested in ansidecl.h, i.e.

  private:
DISABLE_COPY_AND_ASSIGN (auto_mpfr);

Alexander

Re: [PATCH] [RFC] RAII auto_mpfr and autp_mpz

2023-03-07 Thread Alexander Monakov via Gcc-patches

On Tue, 7 Mar 2023, Jonathan Wakely wrote:

> > Shouldn't this use the idiom suggested in ansidecl.h, i.e.
> >
> >   private:
> > DISABLE_COPY_AND_ASSIGN (auto_mpfr);
> 
> 
> Why? A macro like that (or a base class like boost::noncopyable) has
> some value in a code base that wants to work for both C++03 and C++11
> (or later). But in GCC we know we have C++11 now, so we can just
> delete members. I don't see what the macro adds.

Evidently it's possible to forget to delete one of the members, as
showcased in this very thread.

The idiom is also slightly easier to read.

Alexander

Re: [PATCH] MATCH: Add pattern for `signbit(x) ? x : -x` into abs (and swapped)

2023-05-14 Thread Alexander Monakov via Gcc-patches

On Sat, 13 May 2023, Andrew Pinski via Gcc-patches wrote:

> +/* signbit(x) != 0 ? -x : x -> abs(x)
> +   signbit(x) == 0 ? -x : x -> -abs(x) */
> +(for sign (SIGNBIT)

Surprised to see a dummy iterator here. Was this meant to include
float and long double versions of the builtin too (SIGNBITF and SIGNBITL)?

> + (for neeq (ne eq)
> +  (simplify
> +   (cond (neeq (sign @0) integer_zerop) (negate @0) @0)
> +(if (neeq == NE_EXPR)
> + (abs @0)
> + (negate (abs @0))
> +
>  (simplify
>   /* signbit(x) -> 0 if x is nonnegative.  */
>   (SIGNBIT tree_expr_nonnegative_p@0)

Thanks.
Alexander

Re: [PATCH] MATCH: Add pattern for `signbit(x) ? x : -x` into abs (and swapped)

2023-05-14 Thread Alexander Monakov via Gcc-patches



On Sun, 14 May 2023, Alexander Monakov wrote:

> On Sat, 13 May 2023, Andrew Pinski via Gcc-patches wrote:
> 
> > +/* signbit(x) != 0 ? -x : x -> abs(x)
> > +   signbit(x) == 0 ? -x : x -> -abs(x) */
> > +(for sign (SIGNBIT)
> 
> Surprised to see a dummy iterator here. Was this meant to include
> float and long double versions of the builtin too (SIGNBITF and SIGNBITL)?

On the other hand, the following clauses both use SIGNBIT directly, and
it would be nice to be consistent.

> > + (for neeq (ne eq)
> > +  (simplify
> > +   (cond (neeq (sign @0) integer_zerop) (negate @0) @0)
> > +(if (neeq == NE_EXPR)
> > + (abs @0)
> > + (negate (abs @0))
> > +
> >  (simplify
> >   /* signbit(x) -> 0 if x is nonnegative.  */
> >   (SIGNBIT tree_expr_nonnegative_p@0)
> 
> Thanks.
> Alexander
>

Re: [PATCH] MATCH: Add pattern for `signbit(x) ? x : -x` into abs (and swapped)

2023-05-14 Thread Alexander Monakov via Gcc-patches



On Sun, 14 May 2023, Andrew Pinski wrote:

> It is NOT a dummy iterator. SIGNBIT is a operator list that expands to
> "BUILT_IN_SIGNBITF  BUILT_IN_SIGNBIT BUILT_IN_SIGNBITL IFN_SIGNBIT".

Ah, it's in cfn-operators.pd in the build tree, not the source tree.

> > On the other hand, the following clauses both use SIGNBIT directly, and
> > it would be nice to be consistent.
> 
> You cannot use the operator list directly if you have a for loop
> expansion too. So it is internally consistent already.

I see. Wasn't aware of the limitation.

Thanks.
Alexander

[committed] tree-ssa-math-opts: correct -ffp-contract= check

2023-05-17 Thread Alexander Monakov via Gcc-patches

Since tree-ssa-math-opts may freely contract across statement boundaries
we should enable it only for -ffp-contract=fast instead of disabling it
for -ffp-contract=off.

No functional change, since -ffp-contract=on is not exposed yet.

gcc/ChangeLog:

* tree-ssa-math-opts.cc (convert_mult_to_fma): Enable only for
FP_CONTRACT_FAST (no functional change).
---

Preapproved in PR 106092, pushed to trunk.

 gcc/tree-ssa-math-opts.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/tree-ssa-math-opts.cc b/gcc/tree-ssa-math-opts.cc
index b58a2ac9e6..d71c51dc0e 100644
--- a/gcc/tree-ssa-math-opts.cc
+++ b/gcc/tree-ssa-math-opts.cc
@@ -3320,7 +3320,7 @@ convert_mult_to_fma (gimple *mul_stmt, tree op1, tree op2,
   imm_use_iterator imm_iter;
 
   if (FLOAT_TYPE_P (type)
-  && flag_fp_contract_mode == FP_CONTRACT_OFF)
+  && flag_fp_contract_mode != FP_CONTRACT_FAST)
 return false;
 
   /* We don't want to do bitfield reduction ops.  */
-- 
2.39.2

[PATCH] c-family: implement -ffp-contract=on

2023-05-18 Thread Alexander Monakov via Gcc-patches

Implement -ffp-contract=on for C and C++ without changing default
behavior (=off for -std=cNN, =fast for C++ and -std=gnuNN).

gcc/c-family/ChangeLog:

* c-gimplify.cc (fma_supported_p): New helper.
(c_gimplify_expr) [PLUS_EXPR, MINUS_EXPR]: Implement FMA
contraction.

gcc/ChangeLog:

* common.opt (fp_contract_mode) [on]: Remove fallback.
* config/sh/sh.md (*fmasf4): Correct flag_fp_contract_mode test.
* doc/invoke.texi (-ffp-contract): Update.
* trans-mem.cc (diagnose_tm_1): Skip internal function calls.
---
 gcc/c-family/c-gimplify.cc | 78 ++
 gcc/common.opt |  3 +-
 gcc/config/sh/sh.md|  2 +-
 gcc/doc/invoke.texi|  8 ++--
 gcc/trans-mem.cc   |  3 ++
 5 files changed, 88 insertions(+), 6 deletions(-)

diff --git a/gcc/c-family/c-gimplify.cc b/gcc/c-family/c-gimplify.cc
index ef5c7d919f..f7635d3b0c 100644
--- a/gcc/c-family/c-gimplify.cc
+++ b/gcc/c-family/c-gimplify.cc
@@ -41,6 +41,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "c-ubsan.h"
 #include "tree-nested.h"
 #include "context.h"
+#include "tree-pass.h"
+#include "internal-fn.h"
 
 /*  The gimplification pass converts the language-dependent trees
 (ld-trees) emitted by the parser into language-independent trees
@@ -686,6 +688,14 @@ c_build_bind_expr (location_t loc, tree block, tree body)
   return bind;
 }
 
+/* Helper for c_gimplify_expr: test if target supports fma-like FN.  */
+
+static bool
+fma_supported_p (enum internal_fn fn, tree type)
+{
+  return direct_internal_fn_supported_p (fn, type, OPTIMIZE_FOR_BOTH);
+}
+
 /* Gimplification of expression trees.  */
 
 /* Do C-specific gimplification on *EXPR_P.  PRE_P and POST_P are as in
@@ -739,6 +749,74 @@ c_gimplify_expr (tree *expr_p, gimple_seq *pre_p 
ATTRIBUTE_UNUSED,
break;
   }
 
+case PLUS_EXPR:
+case MINUS_EXPR:
+  {
+   tree type = TREE_TYPE (*expr_p);
+   /* For -ffp-contract=on we need to attempt FMA contraction only
+  during initial gimplification.  Late contraction across statement
+  boundaries would violate language semantics.  */
+   if (SCALAR_FLOAT_TYPE_P (type)
+   && flag_fp_contract_mode == FP_CONTRACT_ON
+   && cfun && !(cfun->curr_properties & PROP_gimple_any)
+   && fma_supported_p (IFN_FMA, type))
+ {
+   bool neg_mul = false, neg_add = code == MINUS_EXPR;
+
+   tree *op0_p = &TREE_OPERAND (*expr_p, 0);
+   tree *op1_p = &TREE_OPERAND (*expr_p, 1);
+
+   /* Look for ±(x * y) ± z, swapping operands if necessary.  */
+   if (TREE_CODE (*op0_p) == NEGATE_EXPR
+   && TREE_CODE (TREE_OPERAND (*op0_p, 0)) == MULT_EXPR)
+ /* '*EXPR_P' is '-(x * y) ± z'.  This is fine.  */;
+   else if (TREE_CODE (*op0_p) != MULT_EXPR)
+ {
+   std::swap (op0_p, op1_p);
+   std::swap (neg_mul, neg_add);
+ }
+   if (TREE_CODE (*op0_p) == NEGATE_EXPR)
+ {
+   op0_p = &TREE_OPERAND (*op0_p, 0);
+   neg_mul = !neg_mul;
+ }
+   if (TREE_CODE (*op0_p) != MULT_EXPR)
+ break;
+   auto_vec ops (3);
+   ops.quick_push (TREE_OPERAND (*op0_p, 0));
+   ops.quick_push (TREE_OPERAND (*op0_p, 1));
+   ops.quick_push (*op1_p);
+
+   enum internal_fn ifn = IFN_FMA;
+   if (neg_mul)
+ {
+   if (fma_supported_p (IFN_FNMA, type))
+ ifn = IFN_FNMA;
+   else
+ ops[0] = build1 (NEGATE_EXPR, type, ops[0]);
+ }
+   if (neg_add)
+ {
+   enum internal_fn ifn2 = ifn == IFN_FMA ? IFN_FMS : IFN_FNMS;
+   if (fma_supported_p (ifn2, type))
+ ifn = ifn2;
+   else
+ ops[2] = build1 (NEGATE_EXPR, type, ops[2]);
+ }
+   for (auto &&op : ops)
+ if (gimplify_expr (&op, pre_p, post_p, is_gimple_val, fb_rvalue)
+ == GS_ERROR)
+   return GS_ERROR;
+
+   gcall *call = gimple_build_call_internal_vec (ifn, ops);
+   gimple_seq_add_stmt_without_update (pre_p, call);
+   *expr_p = create_tmp_var (type);
+   gimple_call_set_lhs (call, *expr_p);
+   return GS_ALL_DONE;
+ }
+   break;
+  }
+
 default:;
 }
 
diff --git a/gcc/common.opt b/gcc/common.opt
index a28ca13385..3daec85aef 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1662,9 +1662,8 @@ Name(fp_contract_mode) Type(enum fp_contract_mode) 
UnknownError(unknown floating
 EnumValue
 Enum(fp_contract_mode) String(off) Value(FP_CONTRACT_OFF)
 
-; Not implemented, fall back to conservative FP_CONTRACT_OFF.
 EnumValue
-Enum(fp_contract_mode) String(on) Value(FP_CONTRACT_OFF)
+Enum(fp_contract_mode) String(on) Value(FP_CONTRACT_ON)

Re: [PATCH] c-family: implement -ffp-contract=on

2023-05-22 Thread Alexander Monakov via Gcc-patches

On Mon, 22 May 2023, Richard Biener wrote:

> On Thu, May 18, 2023 at 11:04 PM Alexander Monakov via Gcc-patches
>  wrote:
> >
> > Implement -ffp-contract=on for C and C++ without changing default
> > behavior (=off for -std=cNN, =fast for C++ and -std=gnuNN).
> 
> The documentation changes mention the defaults are changed for
> standard modes, I suppose you want to remove that hunk.

No, the current documentation is incomplete, and that hunk extends it
to match the current GCC behavior. Should I break it out to a separate
patch? I see this drive-by fix could look confusing — sorry about that.

> it would be possible to do
> 
>   *expr_p = build_call_expr_internal (ifn, type, ops[0], ops[1]. ops[2]);
>   return GS_OK;
> 
> and not worry about temporary creation and gimplifying of the operands.
> That would in theory also leave the possibility to do this during
> genericization instead (and avoid the guard against late invocation of
> the hook).

Ah, no, I deliberately decided against that, because that way we would go
via gimplify_arg, which would emit all side effects in *pre_p. That seems
wrong if arguments had side-effects that should go in *post_p.

Thanks.
Alexander

> Otherwise it looks OK, but I'll let frontend maintainers have a chance to look
> as well.
> 
> Thanks for tackling this long-standing issue.
> Richard.

Re: [PATCH] c-family: implement -ffp-contract=on

2023-05-23 Thread Alexander Monakov via Gcc-patches



On Tue, 23 May 2023, Richard Biener wrote:
> > Ah, no, I deliberately decided against that, because that way we would go
> > via gimplify_arg, which would emit all side effects in *pre_p. That seems
> > wrong if arguments had side-effects that should go in *post_p.
> 
> Ah, true - that warrants a comment though.

Incrementally fixed up in my tree like this:

diff --git a/gcc/c-family/c-gimplify.cc b/gcc/c-family/c-gimplify.cc
index f7635d3b0c..17b0610a89 100644
--- a/gcc/c-family/c-gimplify.cc
+++ b/gcc/c-family/c-gimplify.cc
@@ -803,6 +803,7 @@ c_gimplify_expr (tree *expr_p, gimple_seq *pre_p 
ATTRIBUTE_UNUSED,
else
  ops[2] = build1 (NEGATE_EXPR, type, ops[2]);
  }
+   /* Avoid gimplify_arg: it emits all side effects into *PRE_P.  */
for (auto &&op : ops)
  if (gimplify_expr (&op, pre_p, post_p, is_gimple_val, fb_rvalue)
  == GS_ERROR)

Alexander

[PATCH] doc: clarify semantics of vector bitwise shifts

2023-05-24 Thread Alexander Monakov via Gcc-patches

Explicitly say that bitwise shifts for narrow types work similar to
element-wise C shifts with integer promotions, which coincides with
OpenCL semantics.

gcc/ChangeLog:

* doc/extend.texi (Vector Extensions): Clarify bitwise shift
semantics.
---
 gcc/doc/extend.texi | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index e426a2eb7d..6b4e94b6a1 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -12026,7 +12026,12 @@ elements in the operand.
 It is possible to use shifting operators @code{<<}, @code{>>} on
 integer-type vectors. The operation is defined as following: @code{@{a0,
 a1, @dots{}, an@} >> @{b0, b1, @dots{}, bn@} == @{a0 >> b0, a1 >> b1,
-@dots{}, an >> bn@}}@. Vector operands must have the same number of
+@dots{}, an >> bn@}}@.  When the base type is narrower than @code{int},
+element-wise shifts are performed as if operands underwent C integer
+promotions, like in OpenCL.  This makes vector shifts by up to 31 bits
+well-defined for vectors with @code{char} and @code{short} base types.
+
+Operands of binary vector operations must have the same number of
 elements. 
 
 For convenience, it is allowed to use a binary vector operation
-- 
2.39.2

Re: [PATCH] doc: clarify semantics of vector bitwise shifts

2023-05-24 Thread Alexander Monakov via Gcc-patches



On Wed, 24 May 2023, Richard Biener wrote:

> On Wed, May 24, 2023 at 2:54 PM Alexander Monakov via Gcc-patches
>  wrote:
> >
> > Explicitly say that bitwise shifts for narrow types work similar to
> > element-wise C shifts with integer promotions, which coincides with
> > OpenCL semantics.
> 
> Do we need to clarify that v << w with v being a vector of shorts
> still yields a vector of shorts and not a vector of ints?

I don't think so, but if necessary we could add "and the result was
truncated back to the base type":

When the base type is narrower than @code{int}, element-wise shifts
are performed as if operands underwent C integer promotions, and
the result was truncated back to the base type, like in OpenCL. 

> Btw, I don't see this promotion reflected in the IL.  For
> 
> typedef short v8hi __attribute__((vector_size(16)));
> 
> v8hi foo (v8hi a, v8hi b)
> {
>   return a << b;
> }
> 
> I get no masking of 'b' and vector lowering if the target doens't handle it
> yields
> 
>   short int _5;
>   short int _6;
> 
>   _5 = BIT_FIELD_REF ;
>   _6 = BIT_FIELD_REF ;
>   _7 = _5 << _6;
> 
> which we could derive ranges from for _6 (apparantly we don't yet).

Here it depends on how we define the GIMPLE-level semantics of bit-shift
operators for narrow types. To avoid changing lowering we could say that
shifting by up to 31 bits is well-defined for narrow types.

RTL-level semantics are also undocumented, unfortunately.

> Even
> 
> typedef int v8hi __attribute__((vector_size(16)));
> 
> v8hi x;
> int foo (v8hi a, v8hi b)
> {
>   x = a << b;
>   return (b[0] > 33);
> }
> 
> isn't optimized currently (but could - note I've used 'int' elements here).

Yeah. But let's constrain the optimizations first.

> So, I don't see us making sure the hardware does the right thing for
> out-of bound values.

I think in practice it worked out even if GCC did not pay attention to it,
because SIMD instructions had to facilitate autovectorization for C with
corresponding shift semantics.

Alexander

> 
> Richard.
> 
> > gcc/ChangeLog:
> >
> > * doc/extend.texi (Vector Extensions): Clarify bitwise shift
> > semantics.
> > ---
> >  gcc/doc/extend.texi | 7 ++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> > index e426a2eb7d..6b4e94b6a1 100644
> > --- a/gcc/doc/extend.texi
> > +++ b/gcc/doc/extend.texi
> > @@ -12026,7 +12026,12 @@ elements in the operand.
> >  It is possible to use shifting operators @code{<<}, @code{>>} on
> >  integer-type vectors. The operation is defined as following: @code{@{a0,
> >  a1, @dots{}, an@} >> @{b0, b1, @dots{}, bn@} == @{a0 >> b0, a1 >> b1,
> > -@dots{}, an >> bn@}}@. Vector operands must have the same number of
> > +@dots{}, an >> bn@}}@.  When the base type is narrower than @code{int},
> > +element-wise shifts are performed as if operands underwent C integer
> > +promotions, like in OpenCL.  This makes vector shifts by up to 31 bits
> > +well-defined for vectors with @code{char} and @code{short} base types.
> > +
> > +Operands of binary vector operations must have the same number of
> >  elements.
> >
> >  For convenience, it is allowed to use a binary vector operation
> > --
> > 2.39.2
> >
>

Re: [PATCH] doc: clarify semantics of vector bitwise shifts

2023-05-24 Thread Alexander Monakov via Gcc-patches



On Wed, 24 May 2023, Richard Biener via Gcc-patches wrote:

> I’d have to check the ISAs what they actually do here - it of course depends
> on RTL semantics as well but as you say those are not strictly defined here
> either.

Plus, we can add the following executable test to the testsuite:

#include 

#define CHECK(TYPE, WIDTH, OP, COUNT, INVERT) \
{ \
typedef TYPE vec __attribute__((vector_size(WIDTH))); \
  \
static volatile vec zero; \
vec tmp = (zero-2) OP (COUNT);\
vec ref = INVERT zero;\
if (__builtin_memcmp(&tmp, &ref, sizeof tmp)) \
__builtin_abort();\
}

int main(void)
{
CHECK( uint8_t, 16, <<, 8,  )
CHECK( uint8_t, 16, <<, 31, )
CHECK( uint8_t, 16, >>, 8,  )
CHECK( uint8_t, 16, >>, 31, )
CHECK(  int8_t, 16, <<, 8,  )
CHECK(  int8_t, 16, <<, 31, )
CHECK(  int8_t, 16, >>, 8,  ~)
CHECK(  int8_t, 16, >>, 31, ~)
CHECK(uint16_t, 16, <<, 16, )
CHECK(uint16_t, 16, <<, 31, )
CHECK(uint16_t, 16, >>, 16, )
CHECK(uint16_t, 16, >>, 31, )
CHECK( int16_t, 16, <<, 16, )
CHECK( int16_t, 16, <<, 31, )
CHECK( int16_t, 16, >>, 16, ~)
CHECK( int16_t, 16, >>, 31, ~)
// Per-lane-variable shifts:
CHECK( uint8_t, 16, <<, zero+8,  )
CHECK( uint8_t, 16, <<, zero+31, )
CHECK( uint8_t, 16, >>, zero+8,  )
CHECK( uint8_t, 16, >>, zero+31, )
CHECK(  int8_t, 16, <<, zero+8,  )
CHECK(  int8_t, 16, <<, zero+31, )
CHECK(  int8_t, 16, >>, zero+8,  ~)
CHECK(  int8_t, 16, >>, zero+31, ~)
CHECK(uint16_t, 16, <<, zero+16, )
CHECK(uint16_t, 16, <<, zero+31, )
CHECK(uint16_t, 16, >>, zero+16, )
CHECK(uint16_t, 16, >>, zero+31, )
CHECK( int16_t, 16, <<, zero+16, )
CHECK( int16_t, 16, <<, zero+31, )
CHECK( int16_t, 16, >>, zero+16, ~)
CHECK( int16_t, 16, >>, zero+31, ~)

// Repeat for WIDTH=32 and WIDTH=64
}

Alexander

Re: [Patch] [middle-end & nvptx] gcc/tracer.c: Don't split BB with SIMT LANE [PR95654]

2020-09-16 Thread Alexander Monakov via Gcc-patches

Can you supply a tar with tree dumps for me to look at please?

Also, if you can check if the problem can be triggered without a
collapsed loop (e.g. try removing collapse(2), remove mentions of
d2) and if so supply dumps from that instead, I'd appreciate that too.

Alexander

On Wed, 16 Sep 2020, Tom de Vries wrote:

> [ cc-ing author omp support for nvptx. ]
> 
> On 9/16/20 12:39 PM, Tobias Burnus wrote:
> > Hi Tom, hi Richard, hello all,
> > 
> > @Richard: does it look okay from the ME side?
> > @Tom: Can you check which IFN_GOMP_SIMT should be
> > excluded with -ftracer?
> > 
> > Pre-remark, I do not know much about SIMT – except that they
> > only appear with nvptx and somehow relate to lanes on the
> > GPU.
> > 
> > In any case, as the testcase libgomp.fortran/pr66199-5.f90 shows,
> > if a basic block with  GOMP_SIMT_VOTE_ANY in it is duplicated,
> > which happens via -ftracer for this testcase, the result is wrong.
> > 
> > The testcase ignores all the loop work but via "lastprivate" takes
> > the upper loop bound (as assigned to the loop indices); instead of
> > the expected 32*32 = 1024, either some number (like 4 or very large
> > or negative) is returned.
> > 
> > While GOMP_SIMT_VOTE_ANY fixes the issue for this testcase, I
> > have the feeling that at least GOMP_SIMT_LAST_LANE should be
> > not copied - but I might be wrong.
> > 
> > Tom: Do you think GOMP_SIMT_LAST_LANE should be removed from
> > that list – or any of the following added as well?
> > GOMP_USE_SIMT, GOMP_SIMT_ENTER, GOMP_SIMT_ENTER_ALLOC, GOMP_SIMT_EXIT,
> > GOMP_SIMT_VF, GOMP_SIMT_ORDERED_PRED, GOMP_SIMT_XCHG_BFLY,
> > GOMP_SIMT_XCHG_IDX
> > 
> > OK for mainline?
> > 
> > Tobias
> > 
> > -
> > Mentor Graphics (Deutschland) GmbH, Arnulfstraße 201, 80634 München /
> > Germany
> > Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung,
> > Alexander Walter
>

Re: [Patch] [middle-end & nvptx] gcc/tracer.c: Don't split BB with SIMT LANE [PR95654]

2020-09-16 Thread Alexander Monakov via Gcc-patches




On Wed, 16 Sep 2020, Tom de Vries wrote:

> [ cc-ing author omp support for nvptx. ]

The issue looks familiar. I recognized it back in 2017 (and LLVM people
recognized it too for their GPU targets). In an attempt to get agreement
to fix the issue "properly" for GCC I found a similar issue that affects
all targets, not just offloading, and filed it as PR 80053.

(yes, there are no addressable labels involved in offloading, but nevertheless
the nature of the middle-end issue is related)

Alexander

Re: [PATCH] gcov: fix TOPN streaming from shared libraries

2020-09-21 Thread Alexander Monakov via Gcc-patches

On Mon, 21 Sep 2020, Martin Liška wrote:

> On 9/6/20 1:24 PM, Sergei Trofimovich wrote:
> > From: Sergei Trofimovich 
> > 
> > Before the change gcc did not stream correctly TOPN counters
> > if counters belonged to a non-local shared object.
> > 
> > As a result zero-section optimization generated TOPN sections
> > in a form not recognizable by '__gcov_merge_topn'.
> > 
> > The problem happens because in a case of multiple shared objects
> > '__gcov_merge_topn' function is present in address space multiple
> > times (once per each object).
> > 
> > The fix is to never rely on function address and predicate on TOPN
> > counter types.
> 
> Hello.
> 
> Thank you for the analysis! I think it's the correct fix and it's probably
> similar to what we used to see for indirect_call_tuple.
> 
> @Alexander: Am I right?

Yes, analysis presented by Sergei in Bugzilla looks correct. Pedantically I
wouldn't say the indirect call issue was similar: it's a different gotcha
arising from mixing static and dynamic linking. There we had some symbols
preempted by the main executable (but not all symbols), here we have lack
of preemption/unification as relevant libgcov symbol is hidden.

I cannot judge if the fix is correct (don't know the code that well) but it
looks reasonable. If you could come up with a clearer wording for the new
comment it would be nice, I struggled to understand it.

Thanks.
Alexander

Re: [PATCH][omp, ftracer] Don't duplicate blocks in SIMT region

2020-10-05 Thread Alexander Monakov via Gcc-patches

On Mon, 5 Oct 2020, Tom de Vries wrote:

> I've had to modify this patch in two ways:
> - the original test-case stopped failing, though not the
>   minimized one, so I added that one as a test-case
> - only testing for ENTER_ALLOC and EXIT, and not explicitly for VOTE_ANY
>   in ignore_bb_p also stopped working, so I've added that now.
> 
> Re-tested and committed.

I don't understand, was the patch already approved somewhere? It has some
issues.


> --- a/gcc/tracer.c
> +++ b/gcc/tracer.c
> @@ -108,6 +108,24 @@ ignore_bb_p (const_basic_block bb)
>   return true;
>  }
>  
> +  for (gimple_stmt_iterator gsi = gsi_start_bb (CONST_CAST_BB (bb));
> +   !gsi_end_p (gsi); gsi_next (&gsi))
> +{
> +  gimple *g = gsi_stmt (gsi);
> +
> +  /* An IFN_GOMP_SIMT_ENTER_ALLOC/IFN_GOMP_SIMT_EXIT call must be
> +  duplicated as part of its group, or not at all.

What does "its group" stand for? It seems obviously copy-pasted from the
description of IFN_UNIQUE treatment, where it is even less clear what the
"group" is.

(I know what it means, but the comment is not explaining things well at all)

> +  The IFN_GOMP_SIMT_VOTE_ANY is currently part of such a group,
> +  so the same holds there, but it could be argued that the
> +  IFN_GOMP_SIMT_VOTE_ANY could be generated after that group,
> +  in which case it could be duplicated.  */

No, something like that cannot be argued, as VOTE_ANY may have data
dependencies to storage that is deallocated by SIMT_EXIT. You seem to be
claiming something that is simply not possible with the current design.

> +  if (is_gimple_call (g)
> +   && (gimple_call_internal_p (g, IFN_GOMP_SIMT_ENTER_ALLOC)
> +   || gimple_call_internal_p (g, IFN_GOMP_SIMT_EXIT)
> +   || gimple_call_internal_p (g, IFN_GOMP_SIMT_VOTE_ANY)))


Hm? So you are leaving SIMT_XCHG_* be until the next testcase breaks?

> + return true;
> +}
> +
>return false;
>  }

Alexander

Re: [PATCH] [x86]Don't optimize cmp mem, 0 to load mem, reg + test reg, reg

2022-09-16 Thread Alexander Monakov via Gcc-patches

On Fri, 16 Sep 2022, Uros Bizjak via Gcc-patches wrote:

> On Fri, Sep 16, 2022 at 3:32 AM Jeff Law via Gcc-patches
>  wrote:
> >
> >
> > On 9/15/22 19:06, liuhongt via Gcc-patches wrote:
> > > There's peephole2 submit in 1990s which split cmp mem, 0 to load mem,
> > > reg + test reg, reg. I don't know exact reason why gcc do this.
> > >
> > > For latest x86 processors, ciscization should help processor frontend
> > > also codesize, for processor backend, they should be the same(has same
> > > uops).
> > >
> > > So the patch deleted the peephole2, and also modify another splitter to
> > > generate more cmp mem, 0 for 32-bit target.
> > >
> > > It will help instruction fetch.
> > >
> > > for minmax-1.c minmax-2.c minmax-10, pr96891.c, it's supposed to scan 
> > > there's no
> > > comparison to 1 or -1, so adjust the testcase since under 32-bit
> > > target, we now generate cmp mem, 0 instead of load + test.
> > >
> > > Similar for pr78035.c.
> > >
> > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}
> > > No performance impact for SPEC2017 on ICX/Znver3.
> > >
> > It was almost certainly for PPro/P2 given it was rth's work from
> > 1999.Probably should have been conditionalized on PPro/P2 at the
> > time.   No worries losing it now...
> 
> Please add a tune flag in x86-tune.def under "Historical relics" and
> use it in the relevant peephole2 instead of deleting it.

When the next instruction after 'load mem; test reg, reg' is a conditional
branch, this disables macro-op fusion because Intel CPUs do not macro-fuse
'cmp mem, imm; jcc'.

It would be nice to rephrase the commit message to acknowledge this (the
statement 'has same uops' is not always true with this considered).

AMD CPUs can fuse some 'cmp mem, imm; jcc' under some conditions, so this
should be beneficial for AMD.

Alexander

Re: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling

2022-09-21 Thread Alexander Monakov via Gcc-patches



Hi.

On the high level, I'd be highly uncomfortable with this. I guess we are in
vague agreement that it cannot be efficiently implemented. It also goes
against the good practice of accelerator programming, which requires queueing
work on the accelerator and letting it run asynchronously with the CPU with high
occupancy.

(I know libgomp still waits for the GPU to finish in each GOMP_offload_run,
but maybe it's better to improve *that* instead of piling on new slowness)

What I said above also applies to MPI+GPU scenarios: a well-designed algorithm
should arrange for MPI communications to happen in parallel with some useful
offloaded calculations. I don't see the value in implementing the ability to
invoke an MPI call from the accelerator in such inefficient fashion.

(so yes, I disagree with "it is better to provide a feature even if it is slow –
than not providing it at all", when it is advertised as a general-purpose
feature, not a purely debugging helper)


On to the patch itself. IIRC one of the questions was use of CUDA managed
memory. I think it is unsafe because device-issued atomics are not guaranteed
to appear atomic to the host, unless compiling for compute capability 6.0 or
above, and using system-scope atomics ("atom.sys").

And for non-USM code path you're relying on cudaMemcpy observing device-side
atomics in the right order.

Atomics aside, CUDA pinned memory would be a natural choice for such a tiny
structure. Did you rule it out for some reason?

Some remarks on the diff below, not intended to be a complete review.

Alexander


> --- a/libgomp/config/nvptx/target.c
> +++ b/libgomp/config/nvptx/target.c
> @@ -26,7 +26,29 @@
>  #include "libgomp.h"
>  #include 
>  
> +#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var

Shouldn't this be in a header (needs to be in sync with the plugin).

> +
> +/* Reverse offload. Must match version used in plugin/plugin-nvptx.c. */
> +struct rev_offload {
> +  uint64_t fn;
> +  uint64_t mapnum;
> +  uint64_t addrs;
> +  uint64_t sizes;
> +  uint64_t kinds;
> +  int32_t dev_num;
> +  uint32_t lock;
> +};

Likewise.

> +
> +#if (__SIZEOF_SHORT__ != 2 \
> + || __SIZEOF_SIZE_T__ != 8 \
> + || __SIZEOF_POINTER__ != 8)
> +#error "Data-type conversion required for rev_offload"
> +#endif

Huh? This is not a requirement that is new for reverse offload, it has always
been like that for offloading (all ABI rules regarding type sizes, struct
layout, bitfield layout, endianness must match).

> +
> +
>  extern int __gomp_team_num __attribute__((shared));
> +extern volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
> +volatile struct rev_offload *GOMP_REV_OFFLOAD_VAR;
>  
>  bool
>  GOMP_teams4 (unsigned int num_teams_lower, unsigned int num_teams_upper,
> @@ -88,16 +110,32 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t 
> mapnum,
>void **hostaddrs, size_t *sizes, unsigned short *kinds,
>unsigned int flags, void **depend, void **args)
>  {
> -  (void) device;
> -  (void) fn;
> -  (void) mapnum;
> -  (void) hostaddrs;
> -  (void) sizes;
> -  (void) kinds;
>(void) flags;
>(void) depend;
>(void) args;
> -  __builtin_unreachable ();
> +
> +  if (device != GOMP_DEVICE_HOST_FALLBACK
> +  || fn == NULL
> +  || GOMP_REV_OFFLOAD_VAR == NULL)
> +return;

Shouldn't this be an 'assert' instead?

> +
> +  while (__sync_lock_test_and_set (&GOMP_REV_OFFLOAD_VAR->lock, (uint8_t) 1))
> +;  /* spin  */
> +
> +  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->mapnum, mapnum, __ATOMIC_SEQ_CST);
> +  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->addrs, hostaddrs, 
> __ATOMIC_SEQ_CST);
> +  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->sizes, sizes, __ATOMIC_SEQ_CST);
> +  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->kinds, kinds, __ATOMIC_SEQ_CST);
> +  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->dev_num,
> + GOMP_ADDITIONAL_ICVS.device_num, __ATOMIC_SEQ_CST);

Looks like all these can be plain stores, you only need ...

> +
> +  /* 'fn' must be last.  */
> +  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->fn, fn, __ATOMIC_SEQ_CST);

... this to be atomic with 'release' semantics in the usual producer-consumer
pattern.

> +
> +  /* Processed on the host - when done, fn is set to NULL.  */
> +  while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_SEQ_CST) != 0)
> +;  /* spin  */
> +  __sync_lock_release (&GOMP_REV_OFFLOAD_VAR->lock);
>  }
>  
>  void
> diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c
> index 9d4cc62..316de74 100644
> --- a/libgomp/libgomp-plugin.c
> +++ b/libgomp/libgomp-plugin.c
> @@ -78,3 +78,15 @@ GOMP_PLUGIN_fatal (const char *msg, ...)
>gomp_vfatal (msg, ap);
>va_end (ap);
>  }
> +
> +void
> +GOMP_PLUGIN_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t 
> devaddrs_ptr,
> + uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
> + void (*dev_to_host_cpy) (void *, const void *, size_t,
> +

Re: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling

2022-09-26 Thread Alexander Monakov via Gcc-patches

Hi.

My main concerns remain not addressed:

1) what I said in the opening paragraphs of my previous email;

2) device-issued atomics are not guaranteed to appear atomic to the host
unless using atom.sys and translating for CUDA compute capability 6.0+.

Item 2 is a correctness issue. Item 1 I think is a matter of policy that
is up to you to hash out with Jakub.

On Mon, 26 Sep 2022, Tobias Burnus wrote:

> In theory, compiling with "-m32 -foffload-options=-m64" or "-m32
> -foffload-options=-m32" or "-m64 -foffload-options=-m32" is supported.

I have no words.

Alexander

Re: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling

2022-09-28 Thread Alexander Monakov via Gcc-patches



On Tue, 27 Sep 2022, Tobias Burnus wrote:

> Ignoring (1), does the overall patch and this part otherwise look okay(ish)?
> 
> 
> Caveat: The .sys scope works well with >= sm_60 but not does not handle
> older versions. For those, the __atomic_{load/store}_n are used.  I do not
> see a good solution beyond documentation. In the way it is used (one
> thread only setting only on/off flag, no atomic increments etc.), I think
> it is unlikely to cause races without .sys scope, but as always is
> difficult to rule out some special unfortunate case where it does. At
> lease we do have now some documentation (in general) - which still needs
> to be expanded and improved.  For this feature, I did not add any wording
> in this patch: until the feature is actually enabled, it would be more
> confusing than helpful.

If the implication is that distros will ship a racy-by-default implementation,
unless they know about the problem and configure for sm_60, then no, that
doesn't look fine to me. A possible solution is not enabling a feature that
has a known correctness issue.

Alexander

Re: [PATCH v2] ipa-visibility: Optimize TLS access [PR99619]

2022-08-23 Thread Alexander Monakov via Gcc-patches

Ping^3.

On Fri, 5 Aug 2022, Alexander Monakov wrote:

> Ping^2.
> 
> On Wed, 20 Jul 2022, Alexander Monakov wrote:
> 
> > 
> > Ping.
> > 
> > On Thu, 7 Jul 2022, Alexander Monakov via Gcc-patches wrote:
> > 
> > > From: Artem Klimov 
> > > 
> > > Fix PR99619, which asks to optimize TLS model based on visibility.
> > > The fix is implemented as an IPA optimization: this allows to take
> > > optimized visibility status into account (as well as avoid modifying
> > > all language frontends).
> > > 
> > > 2022-04-17  Artem Klimov  
> > > 
> > > gcc/ChangeLog:
> > > 
> > >   * ipa-visibility.cc (function_and_variable_visibility): Promote
> > >   TLS access model afer visibility optimizations.
> > >   * varasm.cc (have_optimized_refs): New helper.
> > >   (optimize_dyn_tls_for_decl_p): New helper. Use it ...
> > >   (decl_default_tls_model): ... here in place of 'optimize' check.
> > > 
> > > gcc/testsuite/ChangeLog:
> > > 
> > >   * gcc.dg/tls/vis-attr-gd.c: New test.
> > >   * gcc.dg/tls/vis-attr-hidden-gd.c: New test.
> > >   * gcc.dg/tls/vis-attr-hidden.c: New test.
> > >   * gcc.dg/tls/vis-flag-hidden-gd.c: New test.
> > >   * gcc.dg/tls/vis-flag-hidden.c: New test.
> > >   * gcc.dg/tls/vis-pragma-hidden-gd.c: New test.
> > >   * gcc.dg/tls/vis-pragma-hidden.c: New test.
> > > 
> > > Co-Authored-By:  Alexander Monakov  
> > > Signed-off-by: Artem Klimov 
> > > ---
> > > 
> > > v2: run the new loop in ipa-visibility only in the whole-program IPA pass;
> > > in decl_default_tls_model, check if any referring function is 
> > > optimized
> > > when 'optimize == 0' (when running in LTO mode)
> > > 
> > > 
> > > Note for reviewers: I noticed there's a place which tries to avoid TLS
> > > promotion, but the comment seems wrong and I could not find a testcase.
> > > I'd suggest we remove it. The compiler can only promote general-dynamic
> > > to local-dynamic and initial-exec to local-exec. The comment refers to
> > > promoting x-dynamic to y-exec, but that cannot happen AFAICT:
> > > https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=8e1ba78f1b8eedd6c65c6f0e6d6d09a801de5d3d
> > > 
> > > 
> > >  gcc/ipa-visibility.cc | 19 +++
> > >  gcc/testsuite/gcc.dg/tls/vis-attr-gd.c| 12 +++
> > >  gcc/testsuite/gcc.dg/tls/vis-attr-hidden-gd.c | 13 
> > >  gcc/testsuite/gcc.dg/tls/vis-attr-hidden.c| 12 +++
> > >  gcc/testsuite/gcc.dg/tls/vis-flag-hidden-gd.c | 13 
> > >  gcc/testsuite/gcc.dg/tls/vis-flag-hidden.c| 12 +++
> > >  .../gcc.dg/tls/vis-pragma-hidden-gd.c | 17 ++
> > >  gcc/testsuite/gcc.dg/tls/vis-pragma-hidden.c  | 16 ++
> > >  gcc/varasm.cc | 32 ++-
> > >  9 files changed, 145 insertions(+), 1 deletion(-)
> > >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-gd.c
> > >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-hidden-gd.c
> > >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-hidden.c
> > >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-flag-hidden-gd.c
> > >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-flag-hidden.c
> > >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-pragma-hidden-gd.c
> > >  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-pragma-hidden.c
> > > 
> > > diff --git a/gcc/ipa-visibility.cc b/gcc/ipa-visibility.cc
> > > index 8a27e7bcd..3ed2b7cf6 100644
> > > --- a/gcc/ipa-visibility.cc
> > > +++ b/gcc/ipa-visibility.cc
> > > @@ -873,6 +873,25 @@ function_and_variable_visibility (bool whole_program)
> > >   }
> > >  }
> > >  
> > > +  if (symtab->state >= IPA_SSA)
> > > +{
> > > +  FOR_EACH_VARIABLE (vnode)
> > > + {
> > > +   tree decl = vnode->decl;
> > > +
> > > +   /* Upgrade TLS access model based on optimized visibility status,
> > > +  unless it was specified explicitly or no references remain.  */
> > > +   if (DECL_THREAD_LOCAL_P (decl)
> > > +   && !lookup_attribute ("tls_model", DECL_ATTRIBUTES (decl))
> > > +   && vnode->ref_list.referring.length ())
> > > + {
> > > +   enum tls_model new_model = decl_default_tls_model (decl

[PATCH] i386: avoid zero extension for crc32q

2022-08-23 Thread Alexander Monakov via Gcc-patches

The crc32q instruction takes 64-bit operands, but ignores high 32 bits
of the destination operand, and zero-extends the result from 32 bits.

Let's model this in the RTL pattern to avoid zero-extension when the
_mm_crc32_u64 intrinsic is used with a 32-bit type.

PR target/106453

gcc/ChangeLog:

* config/i386/i386.md (sse4_2_crc32di): Model that only low 32
bits of operand 0 are consumed, and the result is zero-extended
to 64 bits.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr106453.c: New test.
---
 gcc/config/i386/i386.md  |  6 +++---
 gcc/testsuite/gcc.target/i386/pr106453.c | 13 +
 2 files changed, 16 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr106453.c

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 58fcc382f..b5760bb23 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -23823,10 +23823,10 @@
 
 (define_insn "sse4_2_crc32di"
   [(set (match_operand:DI 0 "register_operand" "=r")
-   (unspec:DI
- [(match_operand:DI 1 "register_operand" "0")
+   (zero_extend:DI (unspec:SI
+ [(match_operand:SI 1 "register_operand" "0")
   (match_operand:DI 2 "nonimmediate_operand" "rm")]
- UNSPEC_CRC32))]
+ UNSPEC_CRC32)))]
   "TARGET_64BIT && TARGET_CRC32"
   "crc32{q}\t{%2, %0|%0, %2}"
   [(set_attr "type" "sselog1")
diff --git a/gcc/testsuite/gcc.target/i386/pr106453.c 
b/gcc/testsuite/gcc.target/i386/pr106453.c
new file mode 100644
index 0..bab5b1cb2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr106453.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-msse4.2 -O2 -fdump-rtl-final" } */
+/* { dg-final { scan-rtl-dump-not "zero_extendsidi" "final" } } */
+
+#include 
+#include 
+
+uint32_t f(uint32_t c, uint64_t *p, size_t n)
+{
+for (size_t i = 0; i < n; i++)
+c = _mm_crc32_u64(c, p[i]);
+return c;
+}
-- 
2.35.1

Re: [PATCH] i386: avoid zero extension for crc32q

2022-08-24 Thread Alexander Monakov via Gcc-patches

On Tue, 23 Aug 2022, Alexander Monakov via Gcc-patches wrote:

> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr106453.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-msse4.2 -O2 -fdump-rtl-final" } */
> +/* { dg-final { scan-rtl-dump-not "zero_extendsidi" "final" } } */

I noticed that the test is 64-bit only and added the following fixup in my
tree:

--- a/gcc/testsuite/gcc.target/i386/pr106453.c
+++ b/gcc/testsuite/gcc.target/i386/pr106453.c
@@ -1,4 +1,4 @@
-/* { dg-do compile } */
+/* { dg-do compile { target { ! ia32 } } */
 /* { dg-options "-msse4.2 -O2 -fdump-rtl-final" } */
 /* { dg-final { scan-rtl-dump-not "zero_extendsidi" "final" } } */

Re: [PATCH v2] ipa-visibility: Optimize TLS access [PR99619]

2022-08-26 Thread Alexander Monakov via Gcc-patches

On Fri, 26 Aug 2022, Martin Jambor wrote:

> > +/* Check if promoting general-dynamic TLS access model to local-dynamic is
> > +   desirable for DECL.  */
> > +
> > +static bool
> > +optimize_dyn_tls_for_decl_p (const_tree decl)
> > +{
> > +  if (optimize)
> > +return true;
> 
> ...this.  This is again an access to optimize which in LTO IPA phase is
> not really meaningful.  Can the test be simply removed?  If not (but
> please look at why), I'd suggest to test the overall optimize level only
> if there is a non-NULL cfun.

I'd prefer to keep it. This code is also called from the front-ends when
assigning initial TLS access model (during parsing, at the point where
visibility attributes, if present, have not been processed yet). There
we don't have IPA structures, but 'optimize' is set up.

I also want to avoid iterating over referring functions in non-LTO mode
where trusting 'optimize' should be fine during this IPA pass I think?

Alexander

Re: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling

2022-08-26 Thread Alexander Monakov via Gcc-patches

On Fri, 26 Aug 2022, Tobias Burnus wrote:

> @Tom and Alexander: Better suggestions are welcome for the busy loop in
> libgomp/plugin/plugin-nvptx.c regarding the variable placement and checking
> its value.

I think to do that without polling you can use PTX 'brkpt' instruction on the
device and CUDA Debugger API on the host (but you'd have to be careful about
interactions with the real debugger).

How did the standardization process for this feature look like, how did it pass
if it's not efficiently implementable for the major offloading targets?

Alexander

Re: [PATCH v2] ipa-visibility: Optimize TLS access [PR99619]

2022-08-30 Thread Alexander Monakov via Gcc-patches

On Tue, 30 Aug 2022, Martin Jambor wrote:

> There is still the optimize attribute so in fact no, even in non-LTO
> mode if there is no current function, you cannot trust the "global"
> "optimize" thing.
> 
> Ideally we would assert that no "analysis" phase of an IPA pass reads
> the global optimization flags, so please don't add new places where we
> do.
> 
> You can either add a parameter to decl_default_tls_model to tell it
> what context it is called from and IMHO it would also be acceptable to
> check whether we have a non-NULL cfun and decide based on that (but here
> I only hope it is not something others might object to).

I see, thank you for explaining the issue, and sorry if I was a bit stubborn.

Does the attached patch (incremental change below) look better? It no longer
has the 'shortcut' where iterating over referrers is avoided for the common
case of plain 'gcc -O2' and no 'optimize' attributes, but fortunately TLS
variables are not so numerous to make chasing that worthwhile.

--- a/gcc/varasm.cc
+++ b/gcc/varasm.cc
@@ -6703,8 +6703,8 @@ have_optimized_refs (struct symtab_node *symbol)
 static bool
 optimize_dyn_tls_for_decl_p (const_tree decl)
 {
-  if (optimize)
-return true;
+  if (cfun)
+return optimize;
   return symtab->state >= IPA && have_optimized_refs (symtab_node::get (decl));
 }

Alexander

Re: [PATCH v2] ipa-visibility: Optimize TLS access [PR99619]

2022-08-30 Thread Alexander Monakov via Gcc-patches

> I see, thank you for explaining the issue, and sorry if I was a bit stubborn.
> 
> Does the attached patch (incremental change below) look better? It no longer
> has the 'shortcut' where iterating over referrers is avoided for the common
> case of plain 'gcc -O2' and no 'optimize' attributes, but fortunately TLS
> variables are not so numerous to make chasing that worthwhile.

... and of course I forgot to add the attachment.  Revised patch below.

---8<---

>From b245015ec465604799aef60b224b1e1e264d4cb8 Mon Sep 17 00:00:00 2001
From: Artem Klimov 
Date: Wed, 6 Jul 2022 17:02:01 +0300
Subject: [PATCH] ipa-visibility: Optimize TLS access [PR99619]

Fix PR99619, which asks to optimize TLS model based on visibility.
The fix is implemented as an IPA optimization: this allows to take
optimized visibility status into account (as well as avoid modifying
all language frontends).

2022-04-17  Artem Klimov  

gcc/ChangeLog:

* ipa-visibility.cc (function_and_variable_visibility): Promote
TLS access model afer visibility optimizations.
* varasm.cc (have_optimized_refs): New helper.
(optimize_dyn_tls_for_decl_p): New helper. Use it ...
(decl_default_tls_model): ... here in place of 'optimize' check.

gcc/testsuite/ChangeLog:

* gcc.dg/tls/vis-attr-gd.c: New test.
* gcc.dg/tls/vis-attr-hidden-gd.c: New test.
* gcc.dg/tls/vis-attr-hidden.c: New test.
* gcc.dg/tls/vis-flag-hidden-gd.c: New test.
* gcc.dg/tls/vis-flag-hidden.c: New test.
* gcc.dg/tls/vis-pragma-hidden-gd.c: New test.
* gcc.dg/tls/vis-pragma-hidden.c: New test.

Co-Authored-By:  Alexander Monakov  
Signed-off-by: Artem Klimov 
---
 gcc/ipa-visibility.cc | 19 +++
 gcc/testsuite/gcc.dg/tls/vis-attr-gd.c| 12 +++
 gcc/testsuite/gcc.dg/tls/vis-attr-hidden-gd.c | 13 
 gcc/testsuite/gcc.dg/tls/vis-attr-hidden.c| 12 +++
 gcc/testsuite/gcc.dg/tls/vis-flag-hidden-gd.c | 13 
 gcc/testsuite/gcc.dg/tls/vis-flag-hidden.c| 12 +++
 .../gcc.dg/tls/vis-pragma-hidden-gd.c | 17 ++
 gcc/testsuite/gcc.dg/tls/vis-pragma-hidden.c  | 16 ++
 gcc/varasm.cc | 32 ++-
 9 files changed, 145 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-gd.c
 create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-hidden-gd.c
 create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-hidden.c
 create mode 100644 gcc/testsuite/gcc.dg/tls/vis-flag-hidden-gd.c
 create mode 100644 gcc/testsuite/gcc.dg/tls/vis-flag-hidden.c
 create mode 100644 gcc/testsuite/gcc.dg/tls/vis-pragma-hidden-gd.c
 create mode 100644 gcc/testsuite/gcc.dg/tls/vis-pragma-hidden.c

diff --git a/gcc/ipa-visibility.cc b/gcc/ipa-visibility.cc
index 8a27e7bcd..3ed2b7cf6 100644
--- a/gcc/ipa-visibility.cc
+++ b/gcc/ipa-visibility.cc
@@ -873,6 +873,25 @@ function_and_variable_visibility (bool whole_program)
}
 }
 
+  if (symtab->state >= IPA_SSA)
+{
+  FOR_EACH_VARIABLE (vnode)
+   {
+ tree decl = vnode->decl;
+
+ /* Upgrade TLS access model based on optimized visibility status,
+unless it was specified explicitly or no references remain.  */
+ if (DECL_THREAD_LOCAL_P (decl)
+ && !lookup_attribute ("tls_model", DECL_ATTRIBUTES (decl))
+ && vnode->ref_list.referring.length ())
+   {
+ enum tls_model new_model = decl_default_tls_model (decl);
+ gcc_checking_assert (new_model >= decl_tls_model (decl));
+ set_decl_tls_model (decl, new_model);
+   }
+   }
+}
+
   if (dump_file)
 {
   fprintf (dump_file, "\nMarking local functions:");
diff --git a/gcc/varasm.cc b/gcc/varasm.cc
index 4db8506b1..ffc559431 100644
--- a/gcc/varasm.cc
+++ b/gcc/varasm.cc
@@ -6679,6 +6679,36 @@ init_varasm_once (void)
 #endif
 }
 
+/* Determine whether SYMBOL is used in any optimized function.  */
+
+static bool
+have_optimized_refs (struct symtab_node *symbol)
+{
+  struct ipa_ref *ref;
+
+  for (int i = 0; symbol->iterate_referring (i, ref); i++)
+{
+  cgraph_node *cnode = dyn_cast  (ref->referring);
+
+  if (cnode && opt_for_fn (cnode->decl, optimize))
+   return true;
+}
+
+  return false;
+}
+
+/* Check if promoting general-dynamic TLS access model to local-dynamic is
+   desirable for DECL.  */
+
+static bool
+optimize_dyn_tls_for_decl_p (const_tree decl)
+{
+  if (cfun)
+return optimize;
+  return symtab->state >= IPA && have_optimized_refs (symtab_node::get (decl));
+}
+
+
 enum tls_model
 decl_default_tls_model (const_tree decl)
 {
@@ -6696,7 +6726,7 @@ decl_default_tls_model (const_tree decl)
 
   /* Local dynamic is inefficient when we're not combining the
  parts of the address.  */
-  else if (optimize && is_local)
+  else if (is_local && optimize_dyn_tls_for_decl_p (decl))
 kind = TLS_MODEL_LOCAL_DYNAMIC;

Re: [PATCH] riscv: implement TARGET_MODE_REP_EXTENDED

2022-09-06 Thread Alexander Monakov via Gcc-patches

On Mon, 5 Sep 2022, Philipp Tomsich wrote:

> +riscv_mode_rep_extended (scalar_int_mode mode, scalar_int_mode mode_rep)
> +{
> +  /* On 64-bit targets, SImode register values are sign-extended to DImode.  
> */
> +  if (TARGET_64BIT && mode == SImode && mode_rep == DImode)
> +return SIGN_EXTEND;

I think this leads to a counter-intuitive requirement that a hand-written
inline asm must sign-extend its output operands that are bound to either
signed or unsigned 32-bit lvalues. Will compiler users be aware of that?

Moreover, without adjusting TARGET_TRULY_NOOP_TRUNCATION this should cause
miscompilation when a 64-bit variable is truncated to 32 bits: the pre-existing
hook says that nothing needs to be done to truncate, but the new hook says
that the result of the truncation is properly sign-extended.

The documentation for TARGET_MODE_REP_EXTENDED warns about that:

In order to enforce the representation of mode, TARGET_TRULY_NOOP_TRUNCATION
should return false when truncating to mode. 

Alexander

Re: [PATCH] lto-plugin: add support for feature detection

2022-05-15 Thread Alexander Monakov via Gcc-patches

On Sun, 15 May 2022, Rui Ueyama wrote:

[snip]
> > So get_symbols_v3 allows the linker to discard an LTO .o file to solve this.
> >
> > In absence of get_symbols_v3 mold tries to ensure correctness by restarting
> > itself while appending a list of .o files to be discarded to its command 
> > line.
> >
> > I wonder if mold can invoke plugin cleanup callback to solve this without
> > restarting.
> 
> We can call the plugin cleanup callback from mold, but there are a few 
> problems:
> 
> First of all, it looks like it is not clear what state the plugin cleanup
> callback resets to.  It may reset it to the initial state with which we
> need to restart everything from calling `onload` callback, or it may not
> deregister functions registered by the previous `onload` call. Since the
> exact semantics is not documented, the LLVM gold plugin may behave
> differently than the GCC plugin.

Ack, that looks risky (and probably unnecessary, see below).

> Second, if we reset a plugin's internal state, we need to register all
> input files by calling the `claim_file_hook` callback, which in turn calls
> the `add_symbols` callback. But we don't need any symbol information at
> this point because mold already knows what are in LTO object files as it
> calls `claim_file_hook` already on the same sets of files. So the
> `add_symbols` invocations would be ignored, which is a waste of resources.
> 
> So, I prefer get_symbols_v3 over calling the plugin cleanup callback.

Oh, to be clear I wouldn't argue against implementing get_symbols_v3 in GCC.
I was wondering if mold can solve this in another fashion without the
self-restart trick.

If I understood your design correctly, mold disregards the index in static
archives, because it doesn't give you the dependency graph of contained
objects (it only lists defined symbols, not used, I was mistaken about that
in the previous email), and you wanted to let mold parse all archived
objects in parallel instead of doing a multiphase scan where each phase
extracts only the needed objects (in parallel). Is that correct?

Is that a good tradeoff in the LTO case though? I believe you cannot assume
the plugin to be thread-safe, so you're serializing its API calls, right?
But the plugin is doing a lot of work, so using the index to feed it with as
few LTO objects as possible should be a significant win, no? (even if it
was thread-safe)

And with the index, it should be rare that a file is spuriously added to the
plugin, so maybe you could get away with issuing a warning or an error when
the v2 API is used, but mold needs to discard a file?

> > (also, hm, it seems to confirm my idea that LTO .o files should have had the
> > correct symbol table so normal linker algorithms would work)
> 
> I agree. If GCC LTO object file contains a correct ELF symbol table, we
> can also eliminate the need of the special LTO-aware ar command. It looks
> like it is a very common error to use an ar command that doesn't
> understand the LTO object file, which results in mysterious "undefined
> symbol" errors even though the object files in an archive file provide
> that very symbols.

Thanks.
Alexander

Re: [PATCH] lto-plugin: add support for feature detection

2022-05-15 Thread Alexander Monakov via Gcc-patches

On Sun, 15 May 2022, Rui Ueyama wrote:

> > Is that a good tradeoff in the LTO case though? I believe you cannot assume
> > the plugin to be thread-safe, so you're serializing its API calls, right?
> > But the plugin is doing a lot of work, so using the index to feed it with as
> > few LTO objects as possible should be a significant win, no? (even if it
> > was thread-safe)
> 
> Oh, I didn't know that claim_file_hook isn't thread-safe. I need to add a
> lock to guard it then. But is it actually the case?

You can see for yourself at
https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=lto-plugin/lto-plugin.c
(e.g. how claim_file_handler increments the global variable num_claimed_files)

> As to the tradeoff, speculatively loading all object files from archives
> may not be beneficial if the loaded files are LTO object files. But we do
> this for consistency.  We don't have a multi-phase name resolution pass at
> all in mold; all symbols are resolved at once in parallel. We don't want
> to implement another name resolution pass just for LTO for the following
> reasons:
> 
> 1. It bloats up the linker's code.
> 
> 2. We don't know whether an archive file contains an LTO object file or
> not until we actually read archive members, so there's no chance to switch
> to the multi-pass name resolution algorithm before we read files.
> 
> 3. If we have two different name resolution algorithms, it is very hard to
> guarantee that both algorithms produce the same result. As a result, the
> output with -flto may behave differently without -flto.

Well, -flto can result in observably different results for other reasons
anyway.

> 4. We still have to handle --start-libs and --end-libs, so feeding an
> object file that will end up not being included into the output is
> unavoidable.

Makes sense, but I still don't understand why mold wants to discover in
advance whether the plugin is going to use get_symbols_v3. How would it
help with what mold does today to handle the _v2 case?

Alexander

Re: [PATCH] lto-plugin: add support for feature detection

2022-05-15 Thread Alexander Monakov via Gcc-patches

On Sun, 15 May 2022, Rui Ueyama wrote:

> > Makes sense, but I still don't understand why mold wants to discover in
> > advance whether the plugin is going to use get_symbols_v3. How would it
> > help with what mold does today to handle the _v2 case?
> 
> Currently, mold restarts itself to reset the internal state of the plugin.
> If we know in advance that get_symbols_v3 is supported, we can avoid that
> restart. That should make the linker a bit faster. Also, restarting the
> linker is a hack, so we want to avoid it if possible.

Can you simply restart the linker on first call to get_symbols_v2 instead?

Alexander

Re: [PATCH] lto-plugin: add support for feature detection

2022-05-15 Thread Alexander Monakov via Gcc-patches

On Sun, 15 May 2022, Rui Ueyama wrote:

> > Can you simply restart the linker on first call to get_symbols_v2 instead?
> 
> I could, but it may not be a safe timing to call exec(2). I believe we
> are expected to call cleanup_hook after calling all_symbols_read_hook,
> and it is not clear what will happen if we abruptly terminate and
> restart the current process. For example, doesn't it leave temporary
> files on disk?

Regarding files, as far as I can tell, GCC plugin will leave a 'resolution file'
on disk, but after re-exec it would recreate it anyway.

Alexander

Re: [PATCH] lto-plugin: add support for feature detection

2022-05-15 Thread Alexander Monakov via Gcc-patches

On Sun, 15 May 2022, Rui Ueyama wrote:

> > Regarding files, as far as I can tell, GCC plugin will leave a 'resolution 
> > file'
> > on disk, but after re-exec it would recreate it anyway.
> 
> Does it recreate a temporary file with the same file name so that
> there's no temporary file left on the disk after the linker finishes
> doing LTO?

Resolution file name is taken from the command line option '-fresolution=',
so it's a stable name (supplied by the compiler driver).

Alexander

Re: [PATCH] lto-plugin: add support for feature detection

2022-05-15 Thread Alexander Monakov via Gcc-patches

On Mon, 16 May 2022, Rui Ueyama wrote:

> If it is a guaranteed behavior that GCC of all versions that support
> only get_symbols_v2 don't leave a temporary file behind if it is
> suddenly disrupted during get_symbols_v2 execution, then yes, mold can
> restart itself when get_symbols_v2 is called for the first time.
> 
> Is this what you want? I'm fine with that approach if it is guaranteed
> to work by GCC developers.

I cannot answer that, hopefully someone in Cc: will. This sub-thread started
with Richard proposing an alternative solution for API level discovery [1]
(invoking onload twice, first with only the _v3 entrypoint in the "transfer
vector"), and then suggesting an onload_v2 variant that would allow to
discover which entrypoints the plugin is going to use [2].

[1] https://gcc.gnu.org/pipermail/gcc-patches/2022-May/594058.html
[2] https://gcc.gnu.org/pipermail/gcc-patches/2022-May/594059.html

... at which point I butted in, because the whole _v3 thing was shrouded in
mystery. Hopefully, now it makes more sense.

>From my side I want to add that thread-safety remains a major unsolved point.
Compiler driver _always_ adds the plugin to linker command line, so I expect
that if you add a mutex around your claim_file hook invocation, you'll find
that it serializes the linker too much. Have you given some thought to that?
Will you be needing a plugin API upgrade to discover thread-safe entrypoints,
or do you have another solution in mind?

Alexander

Re: [PATCH] lto-plugin: add support for feature detection

2022-05-16 Thread Alexander Monakov via Gcc-patches

On Mon, 16 May 2022, Richard Biener wrote:

> Is there an API document besides the header itself somewhere?

It's on the wiki: https://gcc.gnu.org/wiki/whopr/driver
(sadly the v3 entrypoint was added there without documentation)

Alexander

Re: [PATCH] lto-plugin: add support for feature detection

2022-05-16 Thread Alexander Monakov via Gcc-patches

On Mon, 16 May 2022, Rui Ueyama wrote:

> > @Rui: Am I correct that you're interested in thread-safe claim_file? Is 
> > there any
> > other function being called paralely?
> 
> Yes, I want a thread-safe claim_file. And that function seems to be
> the only function in mold that is called in parallel.

But note that you'll have to provide a guarantee that all entrypoints that
the plugin may invoke when multithreaded (i.e. add_symbols, which is called
from claim_file) are also thread-safe.

Alexander

Re: [PATCH] lto-plugin: add support for feature detection

2022-05-16 Thread Alexander Monakov via Gcc-patches

On Mon, 16 May 2022, Martin Liška wrote:

> I've implemented first version of the patch, please take a look.

I'll comment on the patch, feel free to inform me when I should back off
with forcing my opinion in this thread :)

> --- a/include/plugin-api.h
> +++ b/include/plugin-api.h
> @@ -483,6 +483,18 @@ enum ld_plugin_level
>LDPL_FATAL
>  };
>  
> +/* The linker's interface for API version negotiation.  */
> +
> +typedef
> +int (*ld_plugin_get_api_version) (char *linker_identifier, int 
> linker_version,
> +   int preferred_linker_api,
> +   const char **compiler_identifier,
> +   int *compiler_version);
> +
> +typedef
> +enum ld_plugin_status
> +(*ld_plugin_register_get_api_version) (ld_plugin_get_api_version handler);
> +
>  /* Values for the tv_tag field of the transfer vector.  */
>  
>  enum ld_plugin_tag
> @@ -521,6 +533,7 @@ enum ld_plugin_tag
>LDPT_REGISTER_NEW_INPUT_HOOK,
>LDPT_GET_WRAP_SYMBOLS,
>LDPT_ADD_SYMBOLS_V2,
> +  LDPT_REGISTER_GET_API_VERSION,
>  };
>  
>  /* The plugin transfer vector.  */
> @@ -556,6 +569,7 @@ struct ld_plugin_tv
>  ld_plugin_get_input_section_size tv_get_input_section_size;
>  ld_plugin_register_new_input tv_register_new_input;
>  ld_plugin_get_wrap_symbols tv_get_wrap_symbols;
> +ld_plugin_register_get_api_version tv_register_get_api_version;
>} tv_u;
>  };

Here I disagree with the overall design. Rui already pointed out how plugin
API seems to consist of callbacks-that-register-callbacks, and I'm with him
on that, let's not make that worse. On a more serious note, this pattern:

* the linker provides register_get_api_version entrypoint
* the plugin registers its get_api_version implementation
* the linker uses the provided function pointer

is problematic because the plugin doesn't know when the linker is going to
invoke its callback (or maybe the linker won't do that at all).

I'd recommend to reduce the level of indirection, remove the register_
callback, and simply require that if LDPT_GET_API_VERSION is provided,
the plugin MUST invoke it before returning from onload, i.e.:

* the linker invokes onload with LDPT_GET_API_VERSION in 'transfer vector'
* the plugin iterates over the transfer vector and notes if LDPT_GET_API_VERSION
  is seen
  * if not, the plugin knows the linker is predates its introduction
  * if yes, the plugin invokes it before returning from onload
* the linker now knows the plugin version (either one provided via
  LDPT_GET_API_VERSION, or 'old' if the callback wasn't invoked).

> diff --git a/lto-plugin/lto-plugin.c b/lto-plugin/lto-plugin.c
> index 00b760636dc..49484decd89 100644
> --- a/lto-plugin/lto-plugin.c
> +++ b/lto-plugin/lto-plugin.c
> @@ -69,6 +69,7 @@ along with this program; see the file COPYING3.  If not see
>  #include "../gcc/lto/common.h"
>  #include "simple-object.h"
>  #include "plugin-api.h"
> +#include "ansidecl.h"
>  
>  /* We need to use I64 instead of ll width-specifier on native Windows.
> The reason for this is that older MS-runtimes don't support the ll.  */
> @@ -166,6 +167,10 @@ static ld_plugin_add_input_file add_input_file;
>  static ld_plugin_add_input_library add_input_library;
>  static ld_plugin_message message;
>  static ld_plugin_add_symbols add_symbols, add_symbols_v2;
> +static ld_plugin_register_get_api_version register_get_api_version;
> +
> +/* By default, use version 1 if there is not negotiation.  */
> +static int used_api_version = 1;
>  
>  static struct plugin_file_info *claimed_files = NULL;
>  static unsigned int num_claimed_files = 0;
> @@ -1407,6 +1412,29 @@ process_option (const char *option)
>verbose = verbose || debug;
>  }
>  
> +static int
> +get_api_version (char *linker_identifier, int linker_version,
> +  int preferred_linker_api,

The 'preferred' qualifier seems vague. If you go with my suggestion above,
I'd suggest to pass lowest and highest supported version number, and the linker
can check if that intersects with its range of supported versions, and error out
if the intersection is empty (and otherwise return the highest version they both
support as the 'negotiated' one).

> +  const char **compiler_identifier,
> +  int *compiler_version)
> +{
> +  *compiler_identifier = "GCC";
> +  *compiler_version = GCC_VERSION;

Note that I'm chiming in here because I worked on a tool that used LTO plugin
API to discover symbol/section dependencies with high accuracy. So I'd like to
remind that implementors/consumers of this API are not necessarily compilers!

Alexander

Re: [PATCH] ipa-visibility: Optimize TLS access [PR99619]

2022-05-16 Thread Alexander Monakov via Gcc-patches

On Mon, 9 May 2022, Jan Hubicka wrote:

> > On second thought, it might be better to keep the assert, and place the loop
> > under 'if (optimize)'?
> 
> The problem is that at IPA level it does not make sense to check
> optimize flag as it is function specific.  (shlib is OK to check it
> anywhere since it is global.)
> 
> So I think we really want to run the code only at the WPA time
> (symtab_state>=IPA_SSA) and we want to see what is optimization flag of
> those function referring the variable since that is what decided codegen
> we will produce.

Perhaps I misunderstood the issue. Are you saying that there might be no -O
option on lto1 command line, because lto1 is supposed to take optimization
level from function summaries, but during pass_ipa_whole_program_visibility
there's no "current function" so 'optimize' is at its default value (zero)?

And the solution is to iterate over referring functions to see if at least
one of them satisfies 'opt_for_fn (decl, optimize) > 0'?

Alexander

Re: [PATCH] Add divide by zero side effect.

2022-05-19 Thread Alexander Monakov via Gcc-patches

On Fri, 20 May 2022, Richard Biener via Gcc-patches wrote:

> > Still waiting for a suggestion, since "side effect" is the description
> > that made sense to me :-)
> 
> I think side-effect captures it quite well even if it overlaps with a term
> used in language standards.  Doing c = a << b has the side-effect on
> imposing a range on 'b' rather than just affecting 'c' (and its range).
> You could call it 'alternate effect' but that sounds just awkward ;)

I suggest 'deduce', 'deduction', 'deducing a range'. What the code is actually
doing is deducing that 'b' in 'a / b' cannot be zero. Function in GCC might be
called like 'deduce_ranges_from_stmt'.

Please don't overload 'side effect' if possible.

Alexander

Re: [PATCH] Add divide by zero side effect.

2022-05-20 Thread Alexander Monakov via Gcc-patches

On Fri, 20 May 2022, Richard Biener wrote:

> On Fri, May 20, 2022 at 8:38 AM Alexander Monakov  wrote:
> >
> > On Fri, 20 May 2022, Richard Biener via Gcc-patches wrote:
> >
> > > > Still waiting for a suggestion, since "side effect" is the description
> > > > that made sense to me :-)
> > >
> > > I think side-effect captures it quite well even if it overlaps with a term
> > > used in language standards.  Doing c = a << b has the side-effect on
> > > imposing a range on 'b' rather than just affecting 'c' (and its range).
> > > You could call it 'alternate effect' but that sounds just awkward ;)
> >
> > I suggest 'deduce', 'deduction', 'deducing a range'. What the code is 
> > actually
> > doing is deducing that 'b' in 'a / b' cannot be zero. Function in GCC might 
> > be
> > called like 'deduce_ranges_from_stmt'.
> 
> So how would you call determining the range of 'c' from the ranges of
> 'a' and 'b', isn't that 'deduction' as well?

Kind of, yes, but for this sort of forward inference I imagine you're already
using 'propagate [ranges through a stmt]', like in 'value range propagation'.

If you'd like to avoid 'propagate'/'deduce' asymmetry, I could suggest
'forward inference' / 'backward inference', but I like it a bit less.

Alexander

Re: [PATCH] Add divide by zero side effect.

2022-05-22 Thread Alexander Monakov via Gcc-patches

On Fri, 20 May 2022, Richard Biener wrote:

> > > > I suggest 'deduce', 'deduction', 'deducing a range'. What the code is 
> > > > actually
> > > > doing is deducing that 'b' in 'a / b' cannot be zero. Function in GCC 
> > > > might be
> > > > called like 'deduce_ranges_from_stmt'.
> > >
> > > So how would you call determining the range of 'c' from the ranges of
> > > 'a' and 'b', isn't that 'deduction' as well?
> >
> > Kind of, yes, but for this sort of forward inference I imagine you're 
> > already
> > using 'propagate [ranges through a stmt]', like in 'value range 
> > propagation'.
> >
> > If you'd like to avoid 'propagate'/'deduce' asymmetry, I could suggest
> > 'forward inference' / 'backward inference', but I like it a bit less.
> 
> Hmm, maybe "guarantees" - if the stmt executed (without traps) then
> it's guaranteed that the divisor isn't zero.  I've almost said 'assertions'
> but then asserts also have separate meanings, not to mention ASSERT_EXPR
> as currently used by the old-style VRP.

I feel the word 'assumptions' captures that nicely.

Alexander

Re: [PATCH] ipa-visibility: Optimize TLS access [PR99619]

2022-05-23 Thread Alexander Monakov via Gcc-patches

On Mon, 16 May 2022, Alexander Monakov wrote:

> On Mon, 9 May 2022, Jan Hubicka wrote:
> 
> > > On second thought, it might be better to keep the assert, and place the 
> > > loop
> > > under 'if (optimize)'?
> > 
> > The problem is that at IPA level it does not make sense to check
> > optimize flag as it is function specific.  (shlib is OK to check it
> > anywhere since it is global.)
> > 
> > So I think we really want to run the code only at the WPA time
> > (symtab_state>=IPA_SSA) and we want to see what is optimization flag of
> > those function referring the variable since that is what decided codegen
> > we will produce.
> 
> Perhaps I misunderstood the issue. Are you saying that there might be no -O
> option on lto1 command line, because lto1 is supposed to take optimization
> level from function summaries, but during pass_ipa_whole_program_visibility
> there's no "current function" so 'optimize' is at its default value (zero)?
> 
> And the solution is to iterate over referring functions to see if at least
> one of them satisfies 'opt_for_fn (decl, optimize) > 0'?

Do you want to see a patch implementing the above solution?

Alexander

Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch

2020-11-09 Thread Alexander Monakov via Gcc-patches

On Mon, 26 Oct 2020, Jakub Jelinek wrote:

> On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote:
> > This patch adds caching for the stack block allocated for offloaded
> > OpenMP kernel launches on NVPTX. This is a performance optimisation --
> > we observed an average 11% or so performance improvement with this patch
> > across a set of accelerated GPU benchmarks on one machine (results vary
> > according to individual benchmark and with hardware used).

In this patch you're folding two changes together: reuse of allocated stacks
and removing one host-device synchronization.  Why is that?  Can you report
performance change separately for each change (and split out the patches)?

> > A given kernel launch will reuse the stack block from the previous launch
> > if it is large enough, else it is freed and reallocated. A slight caveat
> > is that memory will not be freed until the device is closed, so e.g. if
> > code is using highly variable launch geometries and large amounts of
> > GPU RAM, you might run out of resources slightly quicker with this patch.
> > 
> > Another way this patch gains performance is by omitting the
> > synchronisation at the end of an OpenMP offload kernel launch -- it's
> > safe for the GPU and CPU to continue executing in parallel at that point,
> > because e.g. copies-back from the device will be synchronised properly
> > with kernel completion anyway.

I don't think this explanation is sufficient. My understanding is that OpenMP
forbids the host to proceed asynchronously after the target construct unless
it is a 'target nowait' construct. This may be observable if there's a printf
in the target region for example (or if it accesses memory via host pointers).

So this really needs to be a separate patch with more explanation why this is
okay (if it is okay).

> > In turn, the last part necessitates a change to the way "(perhaps abort
> > was called)" errors are detected and reported.

As already mentioned using callbacks is problematic. Plus, I'm sure the way
you lock out other threads is a performance loss when multiple threads have
target regions: even though they will not run concurrently on the GPU, you
still want to allow host threads to submit GPU jobs while the GPU is occupied.

I would suggest to have a small pool (up to 3 entries perhaps) of stacks. Then
you can arrange reuse without totally serializing host threads on target
regions.

Alexander

Re: [PATCH] libgomp, nvptx, v3: Honor OpenMP 5.1 num_teams lower bound

2021-11-12 Thread Alexander Monakov via Gcc-patches

Hello Jakub,

On Fri, 12 Nov 2021, Jakub Jelinek via Gcc-patches wrote:

> On Fri, Nov 12, 2021 at 02:27:16PM +0100, Jakub Jelinek via Gcc-patches wrote:
> > On Fri, Nov 12, 2021 at 02:20:23PM +0100, Jakub Jelinek via Gcc-patches 
> > wrote:
> > > This patch assumes that .shared variables are initialized to 0,
> > > https://docs.nvidia.com/cuda/parallel-thread-execution/index.html lists
> > > in Table 7. .shared as non-initializable.  If that isn't the case,
> > > we need to initialize it somewhere for the case of #pragma omp target
> > > without #pragma omp teams in it, maybe in libgcc/config/nvptx/crt0.c ?
> > 
> > A quick look at libgcc/config/nvptx/crt0.c shows the target supports
> > __attribute__((shared)), so perhaps either following instead, or, if
> > .shared isn't preinitialized to zero, defining the variable in
> > libgcc/config/nvptx/crt0.c , adding there __gomp_team_num = 0;
> > and adding extern keyword before int __gomp_team_num 
> > __attribute__((shared));
> > in libgomp/config/nvptx/target.c.
> 
> And finally here is a third version, which fixes a typo in the previous
> patch (in instead of int) and actually initializes the shared var because
> PTX documentation doesn't say anything about how the shared vars are
> initialized.
> 
> Tested on x86_64-linux with nvptx-none offloading, ok for trunk?

I suspect there may be a misunderstanding here, or maybe your explanation is
incomplete. I don't think the intention of the standard was to force such
complexity. You can launch as many blocks on the GPU as you like, limited only
by the bitwidth of the indexing register used in hardware, NVIDIA guarantees
at least INT_MAX blocks (in fact almost 1<<63 blocks if you launch a
three-dimensional grid with INT_MAX x 65535 x 65535 blocks).

The hardware will schedule blocks automatically (so for example if the hardware
can run 40 blocks simultaneously and you launch 100, the hardware may launch
blocks 0 to 39, then when one of those finishes it will launch the 40'th block
and so on).

So isn't the solution simply to adjust the logic around
nvptx_adjust_launch_bounds in GOMP_OFFLOAD_run, that is, if there's a lower
bound specified, use it instead of what adjust_launch_bounds is computing as
max_blocks?

Yours,
Alexander

Re: [PATCH] libgomp, nvptx, v3: Honor OpenMP 5.1 num_teams lower bound

2021-11-12 Thread Alexander Monakov via Gcc-patches

On Fri, 12 Nov 2021, Jakub Jelinek via Gcc-patches wrote:

> --- libgomp/config/nvptx/team.c.jj2021-05-25 13:43:02.793121350 +0200
> +++ libgomp/config/nvptx/team.c   2021-11-12 17:49:02.847341650 +0100
> @@ -32,6 +32,7 @@
>  #include 
>  
>  struct gomp_thread *nvptx_thrs __attribute__((shared,nocommon));
> +int __gomp_team_num __attribute__((shared));

It's going to be weird to have two declarations next to each other, one with
'nocommon', one without. Could you have 'nocommon' also on the new one, and
then, if you like, to add extern declarations for both variables and drop the
attribute (in a separate patch)?

Alexander

Re: [PATCH] libgomp, nvptx, v3: Honor OpenMP 5.1 num_teams lower bound

2021-11-12 Thread Alexander Monakov via Gcc-patches

On Fri, 12 Nov 2021, Jakub Jelinek via Gcc-patches wrote:

> On Fri, Nov 12, 2021 at 08:47:09PM +0100, Jakub Jelinek wrote:
> > The problem is that the argument of the num_teams clause isn't always known
> > before target is launched.
> 
> There was a design mistake that the clause has been put on teams rather than
> on target (well, for host teams we need it on teams), and 5.1 actually
> partially fixes this up for thread_limit by allowing that clause on both,
> but not for num_teams.

If this is a mistake in the standard, can GCC say "the spec is bad; fix the
spec" and refuse to implement support, since it penalizes the common case?

Technically, this could be implemented without penalizing the common case via
CUDA "dynamic parallelism" where you initially launch just one block on the
device that figures out the dimensions and then performs a GPU-side launch of
the required amount of blocks, but that's a nontrivial amount of work.

I looked over your patch. I sent a small nitpick about 'nocommon' in a separate
message, and I still think it's better to adjust GOMP_OFFLOAD_run to take into
account the lower bound when it's known on the host side (otherwise you do
static scheduling of blocks which is going to be inferior to dynamic scheduling:
imagine lower bound is 3, and maximum resident blocks is 2: then you first do
teams 0 and 1 in parallel, then you do team 2 from the 0'th block, while in fact
you want to do it from whichever block finished its initial team first).

Alexander

Re: [PATCH] Expose stable sort algorithm to gcc_sort_r and add vec::stablesort

2021-06-10 Thread Alexander Monakov via Gcc-patches

On Thu, 10 Jun 2021, Richard Biener wrote:

> This makes it possible to apply GCCs stable sort algorithm to vec<>
> and also use it with the qsort_r compatible interface.
> 
> Alex, any comments?

I'm afraid the patch is not correct, see below; (I'll also point out
errors in comments while at it).

> Bootstrapped & tested on x86_64-unknown-linux-gnu (with some
> not here included changes to actually use stablesort)
> 
> 2021-06-10  Richard Biener  
> 
>   * system.h (gcc_stablesort_r): Declare.
>   * sort.cc (gcc_sort_r): Support stable sort.
>   (gcc_stablesort_r): Define.
>   * vec.h (vec<>::stablesort): Add.
> ---
>  gcc/sort.cc  | 14 +-
>  gcc/system.h |  1 +
>  gcc/vec.h| 24 
>  3 files changed, 38 insertions(+), 1 deletion(-)
> 
> diff --git a/gcc/sort.cc b/gcc/sort.cc
> index fe499b5ec73..e27b90ebbdd 100644
> --- a/gcc/sort.cc
> +++ b/gcc/sort.cc
> @@ -277,8 +277,12 @@ gcc_sort_r (void *vbase, size_t n, size_t size, 
> sort_r_cmp_fn *cmp, void *data)
>  {
>if (n < 2)
>  return;
> +  size_t nlim = 5;
> +  bool stable = (ssize_t) size < 0;
> +  if (stable)
> +nlim = 3, size = ~size;
>char *base = (char *)vbase;
> -  sort_r_ctx c = {data, cmp, base, n, size, 5};
> +  sort_r_ctx c = {data, cmp, base, n, size, nlim};
>long long scratch[32];
>size_t bufsz = (n / 2) * size;
>void *buf = bufsz <= sizeof scratch ? scratch : xmalloc (bufsz);
> @@ -296,3 +300,11 @@ gcc_stablesort (void *vbase, size_t n, size_t size, 
> cmp_fn *cmp)
>  {
>gcc_qsort (vbase, n, ~size, cmp);
>  }
> +
> +/* Stable sort, signature-compatible to C qsort_r.  */

"Glibc qsort_r" (no _r variant in C, and BSD signature differs).

> +void
> +gcc_stablesort_r (void *vbase, size_t n, size_t size, sort_r_cmp_fn *cmp,
> +   void *data)
> +{
> +  gcc_sort_r (vbase, n, ~size, cmp, data);
> +}
> diff --git a/gcc/system.h b/gcc/system.h
> index 3c856266cc2..adde3e264b6 100644
> --- a/gcc/system.h
> +++ b/gcc/system.h
> @@ -1250,6 +1250,7 @@ void gcc_sort_r (void *, size_t, size_t, sort_r_cmp_fn 
> *, void *);
>  void gcc_qsort (void *, size_t, size_t, int (*)(const void *, const void *));
>  void gcc_stablesort (void *, size_t, size_t,
>int (*)(const void *, const void *));
> +void gcc_stablesort_r (void *, size_t, size_t, sort_r_cmp_fn *, void *data);
>  /* Redirect four-argument qsort calls to gcc_qsort; one-argument invocations
> correspond to vec::qsort, and use C qsort internally.  */
>  #define PP_5th(a1, a2, a3, a4, a5, ...) a5
> diff --git a/gcc/vec.h b/gcc/vec.h
> index 24df2db0eeb..c02a834c171 100644
> --- a/gcc/vec.h
> +++ b/gcc/vec.h
> @@ -612,6 +612,7 @@ public:
>void block_remove (unsigned, unsigned);
>void qsort (int (*) (const void *, const void *));
>void sort (int (*) (const void *, const void *, void *), void *);
> +  void stablesort (int (*) (const void *, const void *, void *), void *);
>T *bsearch (const void *key, int (*compar)(const void *, const void *));
>T *bsearch (const void *key,
> int (*compar)(const void *, const void *, void *), void *);
> @@ -1160,6 +1161,17 @@ vec::sort (int (*cmp) (const void *, 
> const void *, void *),
>  gcc_sort_r (address (), length (), sizeof (T), cmp, data);
>  }
>  
> +/* Sort the contents of this vector with qsort.  CMP is the comparison
> +   function to pass to qsort.  */

Not with 'qsort', but gcc_stablesort_r.

> +
> +template
> +inline void
> +vec::stablesort (int (*cmp) (const void *, const void *,
> +  void *), void *data)
> +{
> +  if (length () > 1)
> +gcc_stablesort_r (address (), length (), ~sizeof (T), cmp, data);
> +}

I think this is wrong. You're passing inverted size to gcc_stablesort_r, which
will invert it again, and you end up with normal non-stable sorting function.

With that fixed, I think the patch would be correct.

>  
>  /* Search the contents of the sorted vector with a binary search.
> CMP is the comparison function to pass to bsearch.  */
> @@ -1488,6 +1500,7 @@ public:
>void block_remove (unsigned, unsigned);
>void qsort (int (*) (const void *, const void *));
>void sort (int (*) (const void *, const void *, void *), void *);
> +  void stablesort (int (*) (const void *, const void *, void *), void *);
>T *bsearch (const void *key, int (*compar)(const void *, const void *));
>T *bsearch (const void *key,
> int (*compar)(const void *, const void *, void *), void *);
> @@ -2053,6 +2066,17 @@ vec::sort (int (*cmp) (const void 
> *, const void *,
>  m_vec->sort (cmp, data);
>  }
>  
> +/* Sort the contents of this vector with qsort.  CMP is the comparison
> +   function to pass to qsort.  */

Like above, copy-paste issue in comment.

> +
> +template
> +inline void
> +vec::stablesort (int (*cmp) (const void *, const void *,
> +  void *), void *data)
> +{
> +  if (m_vec)
> +m_

Re: [x86_64 PATCH]: Improvement to signed division of integer constant.

2021-07-08 Thread Alexander Monakov via Gcc-patches

On Thu, 8 Jul 2021, Richard Biener via Gcc-patches wrote:

> You made me lookup idiv and I figured we're not optimally
> handling
> 
> int foo (long x, int y)
> {
>   return x / y;
> }
> 
> by using a 32:32 / 32 bit divide.  combine manages to
> see enough to eventually do this though.

We cannot do that in general because idiv will cause an exception
if the signed result is not representable in 32 bits, but GCC
defines signed conversions to truncate without trapping.

Alexander

Re: [PATCH] i386: correct division modeling in lujiazui.md

2022-12-19 Thread Alexander Monakov via Gcc-patches

Ping. If there are any questions or concerns about the patch, please let me
know: I'm interested in continuing this cleanup at least for older AMD models.

I noticed I had an extra line in my Changelog:

>   (lua_sseicvt_si): Ditto.

It got there accidentally and I will drop it.

Alexander

On Fri, 9 Dec 2022, Alexander Monakov wrote:

> Model the divider in Lujiazui processors as a separate automaton to
> significantly reduce the overall model size. This should also result
> in improved accuracy, as pipe 0 should be able to accept new
> instructions while the divider is occupied.
> 
> It is unclear why integer divisions are modeled as if pipes 0-3 are all
> occupied. I've opted to keep a single-cycle reservation of all four
> pipes together, so GCC should continue trying to pack instructions
> around a division accordingly.
> 
> Currently top three symbols in insn-automata.o are:
> 
> 106102 r lujiazui_core_check
> 106102 r lujiazui_core_transitions
> 196123 r lujiazui_core_min_issue_delay
> 
> This patch shrinks all lujiazui tables to:
> 
> 3 r lujiazui_decoder_min_issue_delay
> 20 r lujiazui_decoder_transitions
> 32 r lujiazui_agu_min_issue_delay
> 126 r lujiazui_agu_transitions
> 304 r lujiazui_div_base
> 352 r lujiazui_div_check
> 352 r lujiazui_div_transitions
> 1152 r lujiazui_core_min_issue_delay
> 1592 r lujiazui_agu_translate
> 1592 r lujiazui_core_translate
> 1592 r lujiazui_decoder_translate
> 1592 r lujiazui_div_translate
> 3952 r lujiazui_div_min_issue_delay
> 9216 r lujiazui_core_transitions
> 
> This continues the work on reducing i386 insn-automata.o size started
> with similar fixes for division and multiplication instructions in
> znver.md [1][2]. I plan to submit corresponding fixes for
> b[td]ver[123].md as well.
> 
> [1] 
> https://inbox.sourceware.org/gcc-patches/23c795d6-403c-5927-e610-f0f1215f5...@ispras.ru/T/#m36e069d43d07d768d4842a779e26b4a0915cc543
> [2] 
> https://inbox.sourceware.org/gcc-patches/20221101162637.14238-1-amona...@ispras.ru/
> 
> gcc/ChangeLog:
> 
>   PR target/87832
>   * config/i386/lujiazui.md (lujiazui_div): New automaton.
>   (lua_div): New unit.
>   (lua_idiv_qi): Correct unit in the reservation.
>   (lua_idiv_qi_load): Ditto.
>   (lua_idiv_hi): Ditto.
>   (lua_idiv_hi_load): Ditto.
>   (lua_idiv_si): Ditto.
>   (lua_idiv_si_load): Ditto.
>   (lua_idiv_di): Ditto.
>   (lua_idiv_di_load): Ditto.
>   (lua_fdiv_SF): Ditto.
>   (lua_fdiv_SF_load): Ditto.
>   (lua_fdiv_DF): Ditto.
>   (lua_fdiv_DF_load): Ditto.
>   (lua_fdiv_XF): Ditto.
>   (lua_fdiv_XF_load): Ditto.
>   (lua_ssediv_SF): Ditto.
>   (lua_ssediv_load_SF): Ditto.
>   (lua_ssediv_V4SF): Ditto.
>   (lua_ssediv_load_V4SF): Ditto.
>   (lua_ssediv_V8SF): Ditto.
>   (lua_ssediv_load_V8SF): Ditto.
>   (lua_ssediv_SD): Ditto.
>   (lua_ssediv_load_SD): Ditto.
>   (lua_ssediv_V2DF): Ditto.
>   (lua_ssediv_load_V2DF): Ditto.
>   (lua_ssediv_V4DF): Ditto.
>   (lua_ssediv_load_V4DF): Ditto.
>   (lua_sseicvt_si): Ditto.
> ---
>  gcc/config/i386/lujiazui.md | 58 +++--
>  1 file changed, 30 insertions(+), 28 deletions(-)
> 
> diff --git a/gcc/config/i386/lujiazui.md b/gcc/config/i386/lujiazui.md
> index 9046c09f2..58a230c70 100644
> --- a/gcc/config/i386/lujiazui.md
> +++ b/gcc/config/i386/lujiazui.md
> @@ -19,8 +19,8 @@
>  
>  ;; Scheduling for ZHAOXIN lujiazui processor.
>  
> -;; Modeling automatons for decoders, execution pipes and AGU pipes.
> -(define_automaton "lujiazui_decoder,lujiazui_core,lujiazui_agu")
> +;; Modeling automatons for decoders, execution pipes, AGU pipes, and divider.
> +(define_automaton "lujiazui_decoder,lujiazui_core,lujiazui_agu,lujiazui_div")
>  
>  ;; The rules for the decoder are simple:
>  ;;  - an instruction with 1 uop can be decoded by any of the three
> @@ -55,6 +55,8 @@ (define_reservation "lua_decoder01" 
> "lua_decoder0|lua_decoder1")
>  (define_cpu_unit "lua_p0,lua_p1,lua_p2,lua_p3" "lujiazui_core")
>  (define_cpu_unit "lua_p4,lua_p5" "lujiazui_agu")
>  
> +(define_cpu_unit "lua_div" "lujiazui_div")
> +
>  (define_reservation "lua_p03" "lua_p0|lua_p3")
>  (define_reservation "lua_p12" "lua_p1|lua_p2")
>  (define_reservation "lua_p1p2" "lua_p1+lua_p2")
> @@ -229,56 +231,56 @@ (define_insn_reservation "lua_idiv_qi" 21
> (and (eq_attr "memory" "none")
>  (and (eq_attr "mode" "QI")
>   (eq_attr "type" "idiv"
> -  "lua_decoder0,lua_p0p1p2p3*21")
> +  "lua_decoder0,lua_p0p1p2p3,lua_div*21")
>  
>  (define_insn_reservation "lua_idiv_qi_load" 25
>(and (eq_attr "cpu" "lujiazui")
> (and (eq_attr "memory" "load")
>  (and (eq_attr "mode" "QI")
>   (eq_attr "type" "idiv"
> -

Re: [PATCH V2] Disable sched1 in functions that call setjmp

2022-12-22 Thread Alexander Monakov via Gcc-patches



On Thu, 22 Dec 2022, Jose E. Marchesi via Gcc-patches wrote:

> The first instruction scheduler pass reorders instructions in the TRY
> block in a way `b=true' gets executed before the call to the function
> `f'.  This optimization is wrong, because `main' calls setjmp and `f'
> is known to call longjmp.
> 
> As discussed in BZ 57067, the root cause for this is the fact that
> setjmp is not properly modeled in RTL, and therefore the backend
> passes have no normalized way to handle this situation.
> 
> As Alexander Monakov noted in the BZ, many RTL passes refuse to touch
> functions that call setjmp.  This includes for example gcse,
> store_motion and cprop.  This patch adds the sched1 pass to that list.
> 
> Note that the other instruction scheduling passes are still allowed to
> run on these functions, since they reorder instructions within basic
> blocks, and therefore they cannot cross function calls.
> 
> This doesn't fix the fundamental issue, but at least assures that
> sched1 wont perform invalid transformation in correct C programs.

I think scheduling across calls in the pre-RA scheduler is simply an oversight,
we do not look at dataflow information and with 50% chance risk extending
lifetime of a pseudoregister across a call, causing higher register pressure at
the point of the call, and potentially an extra spill.

Therefore I would suggest to indeed solve the root cause, with (untested):

diff --git a/gcc/sched-deps.cc b/gcc/sched-deps.cc
index 948aa0c3b..343fe2bfa 100644
--- a/gcc/sched-deps.cc
+++ b/gcc/sched-deps.cc
@@ -3688,7 +3688,13 @@ deps_analyze_insn (class deps_desc *deps, rtx_insn *insn)

   CANT_MOVE (insn) = 1;

-  if (find_reg_note (insn, REG_SETJMP, NULL))
+  if (!reload_completed)
+   {
+ /* Do not schedule across calls, this is prone to extending lifetime
+of a pseudo and causing extra spill later on.  */
+ reg_pending_barrier = MOVE_BARRIER;
+   }
+  else if (find_reg_note (insn, REG_SETJMP, NULL))
 {
   /* This is setjmp.  Assume that all registers, not just
  hard registers, may be clobbered by this call.  */

Alexander

Re: [PATCH V2] Disable sched1 in functions that call setjmp

2022-12-22 Thread Alexander Monakov via Gcc-patches



On Thu, 22 Dec 2022, Qing Zhao wrote:

> > I think scheduling across calls in the pre-RA scheduler is simply an 
> > oversight,
> > we do not look at dataflow information and with 50% chance risk extending
> > lifetime of a pseudoregister across a call, causing higher register 
> > pressure at
> > the point of the call, and potentially an extra spill.
> 
> I am a little confused, you mean pre-RA scheduler does not look at the data 
> flow
>  information at all when scheduling insns across calls currently?

I think it does not inspect liveness info, and may extend lifetime of a pseudo
across a call, transforming
 
  call foo
  reg = 1
  ...
  use reg

to

  reg = 1
  call foo
  ...
  use reg

but this is undesirable, because now register allocation cannot select a
call-clobbered register for 'reg'.

Alexander

Re: [PATCH V2] Disable sched1 in functions that call setjmp

2022-12-23 Thread Alexander Monakov via Gcc-patches

On Fri, 23 Dec 2022, Qing Zhao wrote:
> >> I am a little confused, you mean pre-RA scheduler does not look at the 
> >> data flow
> >> information at all when scheduling insns across calls currently?
> > 
> > I think it does not inspect liveness info, and may extend lifetime of a 
> > pseudo
> > across a call, transforming
> > 
> >  call foo
> >  reg = 1
> >  ...
> >  use reg
> > 
> > to
> > 
> >  reg = 1
> >  call foo
> >  ...
> >  use reg
> > 
> > but this is undesirable, because now register allocation cannot select a
> > call-clobbered register for 'reg’.
> Okay, thanks for the explanation.
> 
> Then, why not just check the liveness info instead of inhibiting all 
> scheduling across calls?

Because there's almost nothing to gain from pre-RA scheduling across calls in
the first place. Remember that the call transfers control flow elsewhere and
therefore the scheduler has no idea about the pipeline state after the call
and after the return, so modeling-wise it's a gamble.

For instructions that lie on a critical path such scheduling can be useful when
it substantially reduces the difference between the priority of the call and
nearby instructions of the critical path. But we don't track which instructions
are on critical path(s) and which are not.

(scheduling across calls in sched2 is somewhat dubious as well, but
it doesn't risk register pressure issues, and on VLIW CPUs it at least
can result in better VLIW packing)

Alexander

Re: [PATCH V2] Disable sched1 in functions that call setjmp

2022-12-23 Thread Alexander Monakov via Gcc-patches



On Fri, 23 Dec 2022, Jose E. Marchesi wrote:

> > (scheduling across calls in sched2 is somewhat dubious as well, but
> > it doesn't risk register pressure issues, and on VLIW CPUs it at least
> > can result in better VLIW packing)
> 
> Does sched2 actually schedule across calls?  All the comments in the
> source code stress the fact that the second scheduler pass (after
> register allocation) works in regions that correspond to basic blocks:
> "(after reload, each region is of one block)".

A call instruction does not end a basic block.

(also, with -fsched2-use-superblocks sched2 works on regions like sched1)

Alexander

Re: [PATCH V2] Disable sched1 in functions that call setjmp

2022-12-23 Thread Alexander Monakov via Gcc-patches

On Fri, 23 Dec 2022, Qing Zhao wrote:

> Then, sched2 still can move insn across calls? 
> So does sched2 have the same issue of incorrectly moving  the insn across a 
> call which has unknown control flow?

I think problems are unlikely because register allocator assigns pseudos that
cross setjmp to memory.

I think you hit the problem with sched1 because most testing is done on x86 and
sched1 is not enabled there, otherwise the problem would have been noticed much
earlier.

Alexander

Re: [PATCH V2] Disable sched1 in functions that call setjmp

2022-12-24 Thread Alexander Monakov via Gcc-patches

On Fri, 23 Dec 2022, Qing Zhao wrote:

> BTW, Why sched1 is not enabled on x86 by default?

Register allocation is tricky on x86 due to small number of general-purpose
registers, and sched1 can make it even more difficult. I think before register
pressure modeling was added, sched1 could not be enabled because then allocation
would sometimes fail, and now there's no incentive to enable it, as it is not so
important for modern x86 CPUs. Perhaps someone else has a more comprehensive
answer.

> Another question is:  As discussed in the original bug PR57067:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57067 The root cause of this
> issue related to the abnormal control flow edges (from setjmp/longjmp) cannot
> be represented correctly at RTL stage, shall we fix this root cause instead? 

You'd need an experienced reviewer to work with you, especially on high-level
design decisions such as "How ABNORMAL_DISPATCHER should be represented on RTL".
I'm afraid it's not just a matter of applying a small patch in one place.

Alexander

Re: [PATCH V2] Disable sched1 in functions that call setjmp

2022-12-24 Thread Alexander Monakov via Gcc-patches

On Sat, 24 Dec 2022, Jose E. Marchesi wrote:

> However, there is something I don't understand: wouldn't sched2
> introduce the same problem when -fsched2-use-superblocks is specified?

Superblocks are irrelevant, a call instruction does not end a basic block
and the problematic motion happens within a BB on your testcase. Didn't you
ask about this already?

> In that case, the option a) would need to be expanded to disable sched2
> as well, and b) wouldn't have effect (!after_reload)?

See my response to Qing Zhao, I think due to special-casing of pseudos
that are live at setjmp during register allocation, sched2 will not move
them in such manner (they should be assigned to memory and I don't expect
sched2 will move such MEMs across calls). But of course there may be holes
in this theory.

On some targets disabling sched2 is not so easy because it's responsible
for VLIW packing (bundling on ia64).

Alexander

Re: [PATCH][X86_64] Separate znver4 insn reservations from older znvers

2023-01-03 Thread Alexander Monakov via Gcc-patches

On Tue, 3 Jan 2023, Jan Hubicka wrote:

> > * gcc/common/config/i386/i386-common.cc (processor_alias_table):
> > Use CPU_ZNVER4 for znver4.
> > * config/i386/i386.md: Add znver4.md.
> > * config/i386/znver4.md: New.
> OK,
> thanks!

Honza, I'm curious what are your further plans for this, you mentioned
merging znver4.md back in znver.md if I recall correctly?

Alexander

Re: [PATCH][RFC] API extension for binutils (type of symbols).

2020-03-11 Thread Alexander Monakov via Gcc-patches

On Wed, 11 Mar 2020, Martin Liška wrote:

> > Is there a comprehensive list of plugins out in the wild using the LD
> > plugin API?
> 
> I know only about:
> $ ls /usr/lib/bfd-plugins
> liblto_plugin.so.0.0.0  LLVMgold.so
> 
> and I know about Alexander Monakov (some dead code elimination plug-in).

I don't recall seeing any other plugin when working on our "debloating"
stuff, but of course I suppose people could have used the plugin API
for experimentation.  I don't think a comprehensive list could exist.

> > Note we also have to bring in gold folks (not sure if lld also
> > implements the same plugin API)
> 
> I don't see how can they be affected?

As discussed downthread the other linkers could freely ignore the
new callback, but fwiw LLD has no plans to implement the plugin-api.h
interface: https://bugs.llvm.org/show_bug.cgi?id=42446

Alexander

Re: [PATCH v3] debug/93751 Generate DIEs for external variables with -g1

2020-03-14 Thread Alexander Monakov via Gcc-patches

On Sat, 14 Mar 2020, Alexey Neyman wrote:
> Attached is a patch that does it: at -g1, the type attributes are not
> generated.

Two small issues I pointed out the last time are still present:
https://gcc.gnu.org/legacy-ml/gcc-patches/2020-02/msg01646.html

(I did not review the new patch on a more substantial level)

Alexander

Re: [PATCH] Do not modify tab options in vimrc for .py files.

2020-04-16 Thread Alexander Monakov via Gcc-patches

On Thu, 16 Apr 2020, Martin Liška wrote:

> On 4/16/20 9:57 AM, Richard Biener wrote:
> > Ah, tab vs. spaces.  Changed to all spaces now and pushed.
> 
> Ah, I've also hit the issue. That's caused by our local vimrc.
> We should exclude tab options for .py files.

I think your patch is correct. It's possible to also add an
'else' branch with correct settings for Python (expandtab,
sw=sts=4, ts=8).  Have you considered that?

Thanks.
Alexander

Re: [PATCH] Do not modify tab options in vimrc for .py files.

2020-04-16 Thread Alexander Monakov via Gcc-patches

On Thu, 16 Apr 2020, Martin Liška wrote:

> To be honest I have:
> autocmd Filetype python setlocal expandtab tabstop=4 shiftwidth=4
> softtabstop=4
> 
> in my default vim config.
> But I'm wondering what's default for 'python' Filetype?

Since October 2013 Vim ftplugin/python.vim has:

" As suggested by PEP8.
setlocal expandtab shiftwidth=4 softtabstop=4 tabstop=8

So the default is correct. Please disregard my suggestion then,
no need to add an 'else' branch there.

Thanks.
Alexander

Re: [PATCH] fix build of targets not implementing add_stmt_cost

2020-05-05 Thread Alexander Monakov via Gcc-patches

On Tue, 5 May 2020, Richard Biener wrote:

> 
> Pushed as obvious.
> 
> C++ makes mismatched prototype and implementation OK.

(because of overloads)

I think this would have been caught if GCC enabled -Wmissing-declarations
during bootstrap, and the main reason we have this problem is that the
similar warning in C is named differently (-Wmissing-prototypes), and
we didn't change -Wmissing-prototypes to -Wmissing-declarations when
transitioning to C++ even though the former is a complete no-op for C++.

This is PR bootstrap/91972.

Alexander

Re: [PATCH] make minmax detection work with FMIN/FMAX IFNs

2020-05-08 Thread Alexander Monakov via Gcc-patches

On Fri, 8 May 2020, Richard Biener wrote:

> 
> Currently we fail to optimize those which are used when MIN/MAX_EXPR
> cannot be used for FP values but the target has IEEE conforming
> implementations.

i386 ieee_s{min,max} patterns are definitely not IEEE-compliant,
their comment alludes to that:

;; These versions of the min/max patterns implement exactly the operations
;;   min = (op1 < op2 ? op1 : op2)
;;   max = (!(op1 < op2) ? op1 : op2)
;; Their operands are not commutative, and thus they may be used in the
;; presence of -0.0 and NaN.

I don't understand why the patch is correct if the IFNs refer to fully
IEEE-compliant operations (which is in itself a bit ambiguous w.r.t
behavior when exactly one operand is a NaN).

Am I missing something?

Alexander

Re: [PATCH] make minmax detection work with FMIN/FMAX IFNs

2020-05-08 Thread Alexander Monakov via Gcc-patches

On Fri, 8 May 2020, Uros Bizjak wrote:

> > Am I missing something?
> 
> Is the above enough to declare min/max as IEEE compliant?

No. SSE min/max instructions semantics match C expression x < y ? x : y.
IEEE min/max operations are commutative when exactly one operand is a NaN,
and so are C fmin/fmax functions:

fmin(x, NaN) == fmin(NaN, x) == x   // x is not a NaN

In contrast, (x < y ? x : y) always returns y when x or y is a NaN, and
likewise the corresponding SSE instructions are not commutative.

Therefore they are explicitly non-compliant in presence of NaNs.

I don't know how GCC defines the semantics of GIMPLE min/max IFNs.

Alexander

Re: [PATCH] make minmax detection work with FMIN/FMAX IFNs

2020-05-10 Thread Alexander Monakov via Gcc-patches

On Sun, 10 May 2020, Uros Bizjak wrote:

> So, I found [1], that tries to explain this issue.
> 
> [1] https://2pi.dk/2016/05/ieee-min-max

I would also recommend reading this report that covers a few more
architectures and issues with IEEE754 definitions:

http://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/minNum_maxNum_Removal_Demotion_v3.pdf

Alexander

Re: [PATCH] make minmax detection work with FMIN/FMAX IFNs

2020-05-11 Thread Alexander Monakov via Gcc-patches

On Mon, 11 May 2020, Richard Sandiford wrote:

> Like you say, the idea is that since the operation is commutative and
> is the same in both vector and scalar form, there's no reason to require
> any -ffast-math flags.

Note that PR88540 that Richard is referencing uses open-coded x < y ? x : y
(non-commutative) and we want to use SSE minpd even without -ffast-math, as
SSE min/max insns match semantics of open-coded ternary operators.

(unlike Arm SIMD, SSE does not have a way to compute fmin/fmax with a
single instruction in presence of NaNs)

Alexander

Re: [PATCH 2/2] x86: Add cmpmemsi for -minline-all-stringops

2020-06-01 Thread Alexander Monakov via Gcc-patches

On Sun, 31 May 2020, H.J. Lu via Gcc-patches wrote:

> --- a/gcc/config/i386/i386-expand.c
> +++ b/gcc/config/i386/i386-expand.c
> @@ -7656,6 +7656,90 @@ ix86_expand_set_or_cpymem (rtx dst, rtx src, rtx 
> count_exp, rtx val_exp,
>return true;
>  }
>  
> +/* Expand cmpstrn or memcmp.  */
> +
> +bool
> +ix86_expand_cmpstrn_or_cmpmem (rtx result, rtx src1, rtx src2,
> +rtx length, rtx align, bool is_cmpstrn)
> +{
> +  if (optimize_insn_for_size_p () && !TARGET_INLINE_ALL_STRINGOPS)
> +return false;
> +
> +  /* Can't use this if the user has appropriated ecx, esi or edi.  */
> +  if (fixed_regs[CX_REG] || fixed_regs[SI_REG] || fixed_regs[DI_REG])
> +return false;
> +
> +  if (is_cmpstrn)
> +{
> +  /* For strncmp, length is the maximum length, which can be larger
> +  than actual string lengths.  We can expand the cmpstrn pattern
> +  to "repz cmpsb" only if one of the strings is a constant so
> +  that expand_builtin_strncmp() can write the length argument to
> +  be the minimum of the const string length and the actual length
> +  argument.  Otherwise, "repz cmpsb" may pass the 0 byte.  */
> +  tree t1 = MEM_EXPR (src1);
> +  tree t2 = MEM_EXPR (src2);
> +  if (!((t1 && TREE_CODE (t1) == MEM_REF
> +  && TREE_CODE (TREE_OPERAND (t1, 0)) == ADDR_EXPR
> +  && (TREE_CODE (TREE_OPERAND (TREE_OPERAND (t1, 0), 0))
> +  == STRING_CST))
> + || (t2 && TREE_CODE (t2) == MEM_REF
> + && TREE_CODE (TREE_OPERAND (t2, 0)) == ADDR_EXPR
> + && (TREE_CODE (TREE_OPERAND (TREE_OPERAND (t2, 0), 0))
> + == STRING_CST
> + return false;
> +}
> +  else
> +{
> +  /* Expand memcmp to "repz cmpsb" only for -minline-all-stringops
> +  since "repz cmpsb" can be much slower than memcmp function
> +  implemented with vector instructions, see
> +
> +  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052
> +   */
> +  if (!TARGET_INLINE_ALL_STRINGOPS)
> + return false;
> +}

This check seems to be misplaced, "rep cmps" is slower than either memcmp or
strcmp. The test for TARGET_INLINE_ALL_STRINGOPS should happen regardless of
is_cmpstrn, so it should go earlier in the function.

Alexander

Re: [PATCH] GDB hooks: improve documentation

2020-12-02 Thread Alexander Monakov via Gcc-patches

Hi,

On Wed, 2 Dec 2020, Martin Liška wrote:

> Hey.
> 
> I see the current help description of GCC hooks not much useful:
> 
> $ help user-defined
[snip]
> trt -- GCC hook: trt [tree]
> 
> It's quite hard to be familiar what each hooks means and rather suggest:
> 
[snip]
> trt -- GCC hook: TREE_TYPE (tree)
> 
> Thoughts?

Yes please! I did not know about 'help user-defined' when updating those
mini-doc snippets. I think your patch is a nice improvement.

It seems you are leaving 'pp' unchanged:

> pp -- GCC hook: pp [any]

I'd suggest perhaps "GCC hook: debug ()".

And one nit:

> @@ -69,7 +69,7 @@ call debug_rtx_list ($debug_arg, debug_rtx_count)
>  end
>   document prl
> -GCC hook: prl [rtx]
> +GCC hook: prl debug_rtx_list (rtx)

I think 'prl' before 'debug_rtx_list' should have been deleted.

Thanks.
Alexander

Re: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch

2020-12-08 Thread Alexander Monakov via Gcc-patches



On Tue, 8 Dec 2020, Julian Brown wrote:

> Ping?

This has addressed my concerns, thanks.

Alexander

> On Fri, 13 Nov 2020 20:54:54 +
> Julian Brown  wrote:
> 
> > Hi Alexander,
> > 
> > Thanks for the review! Comments below.
> > 
> > On Tue, 10 Nov 2020 00:32:36 +0300
> > Alexander Monakov  wrote:
> > 
> > > On Mon, 26 Oct 2020, Jakub Jelinek wrote:
> > >   
> > > > On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote:
> > > > > This patch adds caching for the stack block allocated for
> > > > > offloaded OpenMP kernel launches on NVPTX. This is a performance
> > > > > optimisation -- we observed an average 11% or so performance
> > > > > improvement with this patch across a set of accelerated GPU
> > > > > benchmarks on one machine (results vary according to individual
> > > > > benchmark and with hardware used).
> > > 
> > > In this patch you're folding two changes together: reuse of
> > > allocated stacks and removing one host-device synchronization.  Why
> > > is that? Can you report performance change separately for each
> > > change (and split out the patches)?  
> > 
> > An accident of the development process of the patch, really -- the
> > idea for removing the post-kernel-launch synchronisation came from the
> > OpenACC side, and adapting it to OpenMP meant the stacks had to remain
> > allocated after the return of the GOMP_OFFLOAD_run function.
> > 
> > > > > A given kernel launch will reuse the stack block from the
> > > > > previous launch if it is large enough, else it is freed and
> > > > > reallocated. A slight caveat is that memory will not be freed
> > > > > until the device is closed, so e.g. if code is using highly
> > > > > variable launch geometries and large amounts of GPU RAM, you
> > > > > might run out of resources slightly quicker with this patch.
> > > > > 
> > > > > Another way this patch gains performance is by omitting the
> > > > > synchronisation at the end of an OpenMP offload kernel launch --
> > > > > it's safe for the GPU and CPU to continue executing in parallel
> > > > > at that point, because e.g. copies-back from the device will be
> > > > > synchronised properly with kernel completion anyway.
> > > 
> > > I don't think this explanation is sufficient. My understanding is
> > > that OpenMP forbids the host to proceed asynchronously after the
> > > target construct unless it is a 'target nowait' construct. This may
> > > be observable if there's a printf in the target region for example
> > > (or if it accesses memory via host pointers).
> > > 
> > > So this really needs to be a separate patch with more explanation
> > > why this is okay (if it is okay).  
> > 
> > As long as the offload kernel only touches GPU memory and does not
> > have any CPU-visible side effects (like the printf you mentioned -- I
> > hadn't really considered that, oops!), it's probably OK.
> > 
> > But anyway, the benefit obtained on OpenMP code (the same set of
> > benchmarks run before) of omitting the synchronisation at the end of
> > GOMP_OFFLOAD_run seems minimal. So it's good enough to just do the
> > stacks caching, and miss out the synchronisation removal for now. (It
> > might still be something worth considering later, perhaps, as long as
> > we can show some given kernel doesn't use printf or access memory via
> > host pointers -- I guess the former might be easier than the latter. I
> > have observed the equivalent OpenACC patch provide a significant boost
> > on some benchmarks, so there's probably something that could be gained
> > on the OpenMP side too.)
> > 
> > The benefit with the attached patch -- just stacks caching, no
> > synchronisation removal -- is about 12% on the same set of benchmarks
> > as before. Results are a little noisy on the machine I'm benchmarking
> > on, so this isn't necessarily proof that the synchronisation removal
> > is harmful for performance!
> > 
> > > > > In turn, the last part necessitates a change to the way
> > > > > "(perhaps abort was called)" errors are detected and reported.
> > > > >   
> > > 
> > > As already mentioned using callbacks is problematic. Plus, I'm sure
> > > the way you lock out other threads is a performance loss when
> > > multiple threads have target regions: even though they will not run
> > > concurrently on the GPU, you still want to allow host threads to
> > > submit GPU jobs while the GPU is occupied.
> > > 
> > > I would suggest to have a small pool (up to 3 entries perhaps) of
> > > stacks. Then you can arrange reuse without totally serializing host
> > > threads on target regions.  
> > 
> > I'm really wary of the additional complexity of adding a stack pool,
> > and the memory allocation/freeing code paths in CUDA appear to be so
> > slow that we get a benefit with this patch even when the GPU stream
> > has to wait for the CPU to unlock the stacks block. Also, for large
> > GPU launches, the size of the soft-stacks block isn't really trivial
> > (I've seen something like 50MB on the ha

Re: [PATCH] riscv: implement TARGET_MODE_REP_EXTENDED

2022-11-09 Thread Alexander Monakov via Gcc-patches

On Wed, 9 Nov 2022, Philipp Tomsich wrote:

> > To give a specific example that will be problematic if you go far enough 
> > down
> > the road of matching MIPS64 behavior:
> >
> > long f(void)
> > {
> > int x;
> > asm("" : "=r"(x));
> > return x;
> > }
> >
> > here GCC (unlike LLVM) omits sign extension of 'x', assuming that asm output
> > must have been sign-extended to 64 bits by the programmer.
> 
> In fact, with the proposed patch (but also without it), GCC will sign-extend:
> f:
>   sext.w a0,a0
>   ret
>   .size f, .-f

I'm aware. I said "will be problematic if ...", meaning that GCC omits sign
extension when targeting MIPS64, and if you match MIPS64 behavior on RISC-V, 
you'll get in that situation as well.

> To make sure that this is not just an extension to promote the int to
> long for the function return, I next added another empty asm to
> consume 'x'.
> This clearly shows that the extension is performed to postprocess the
> output of the asm-statement:
> 
> f:
>   # ./asm2.c:4: asm("" : "=r"(x));
>   sext.w a0,a0 # x, x
>   # ./asm2.c:5: asm("" : : "r"(x));
>   # ./asm2.c:7: }
>   ret

No, you cannot distinguish post-processing the output of the first asm vs.
pre-processing the input of the second asm. Try

  asm("" : "+r"(x));

as the second asm instead, and you'll get

f:
# t.c:17: asm("" : "=r"(x));
# t.c:18: asm("" : "+r"(x));
# t.c:20: }
sext.w  a0,a0   #, x
ret

If it helps, here's a Compiler Explorer link comparing with MIPS64:
https://godbolt.org/z/7eobvdKdK

Alexander

Re: [RFC] docs: remove documentation for unsupported releases

2022-11-09 Thread Alexander Monakov via Gcc-patches


On Wed, 9 Nov 2022, Martin Liška wrote:

> Hi.
> 
> I think we should remove documentation for unsupported GCC releases
> as it's indexed by Google engine.

I'd agree with previous responses that outright removing the links is
undesirable, and pointing Google to recent documentation should be done
by annotating links, e.g. via rel=canonical as indicated by Joseph.

I would add that adding rel=canonical links seems doable without modifying
old files, by configuring the web server to add HTTP Link: header.

> Second reason is that the page is long
> one one can't easily jump to Current development documentation.

For this I would suggest using the  tag to neatly fold links for
old releases. Please see the attached patch.

AlexanderFrom ab6ce8c24aa17dba8ed79f3c3f7a5e8038dd3205 Mon Sep 17 00:00:00 2001
From: Alexander Monakov 
Date: Wed, 9 Nov 2022 22:17:16 +0300
Subject: [PATCH] Fold doc links for old releases using  tag

---
 htdocs/onlinedocs/index.html | 151 ++-
 1 file changed, 96 insertions(+), 55 deletions(-)

diff --git a/htdocs/onlinedocs/index.html b/htdocs/onlinedocs/index.html
index 3410f731..03cbdbeb 100644
--- a/htdocs/onlinedocs/index.html
+++ b/htdocs/onlinedocs/index.html
@@ -18,8 +18,8 @@
  caring about internals should really be using the mainline
  versions.  -->
 
-
-  GCC 12.2 manuals:
+
+ GCC 12.2 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/";>GCC
  12.2 Manual (
 https://gcc.gnu.org/onlinedocs/gcc-12.2.0/docs-sources.tar.gz";>Texinfo
  sources of all the GCC 12.2 manuals
-  
+  
+
 
-  GCC 11.3 manuals:
+
+  GCC 11.3 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-11.3.0/gcc/";>GCC
  11.3 Manual (
 https://gcc.gnu.org/onlinedocs/gcc-11.3.0/docs-sources.tar.gz";>Texinfo
  sources of all the GCC 11.3 manuals
-  
+  
+
 
-  GCC 10.4 manuals:
+
+  GCC 10.4 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-10.4.0/gcc/";>GCC
  10.4 Manual (
 https://gcc.gnu.org/onlinedocs/gcc-10.4.0/docs-sources.tar.gz";>Texinfo
  sources of all the GCC 10.4 manuals
-  
+  
+
 
-  GCC 9.5 manuals:
+
+  GCC 9.5 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-9.5.0/gcc/";>GCC
  9.5 Manual (
 https://gcc.gnu.org/onlinedocs/gcc-9.5.0/docs-sources.tar.gz";>Texinfo
  sources of all the GCC 9.5 manuals
-  
+  
+
 
-  GCC 8.5 manuals:
+
+  GCC 8.5 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-8.5.0/gcc/";>GCC
  8.5 Manual (
 https://gcc.gnu.org/onlinedocs/gcc-8.5.0/docs-sources.tar.gz";>Texinfo
  sources of all the GCC 8.5 manuals
-  
+  
+
 
-  GCC 7.5 manuals:
+
+  GCC 7.5 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-7.5.0/gcc/";>GCC
  7.5 Manual (
 https://gcc.gnu.org/onlinedocs/gcc-7.5.0/docs-sources.tar.gz";>Texinfo
  sources of all the GCC 7.5 manuals
-  
+  
+
 
-  GCC 6.5 manuals:
+
+  GCC 6.5 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-6.5.0/gcc/";>GCC
  6.5 Manual (
 https://gcc.gnu.org/onlinedocs/gcc-6.5.0/docs-sources.tar.gz";>Texinfo
  sources of all the GCC 6.5 manuals
-  
+  
+
 
-  GCC 5.5 manuals:
+
+  GCC 5.5 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-5.5.0/gcc/";>GCC
  5.5 Manual (
 https://gcc.gnu.org/onlinedocs/gcc-5.5.0/docs-sources.tar.gz";>Texinfo
  sources of all the GCC 5.5 manuals
-  
+  
+
 
-  GCC 4.9.4 manuals:
+
+  GCC 4.9.4 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-4.9.4/gcc/";>GCC
  4.9.4 Manual ()
 https://gcc.gnu.org/onlinedocs/gcc-4.9.4/docs-sources.tar.gz";>Texinfo
  sources of all the GCC 4.9.4 manuals
-  
+  
+
 
-  GCC 4.8.5 manuals:
+
+  GCC 4.8.5 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/";>GCC
  4.8.5 Manual ()
 https://gcc.gnu.org/onlinedocs/gcc-4.8.5/docs-sources.tar.gz";>Texinfo
  sources of all the GCC 4.8.5 manuals
-  
-
+  
+
 
-  GCC 4.7.4 manuals:
+
+  GCC 4.7.4 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-4.7.4/gcc/";>GCC
  4.7.4 Manual ()
 https://gcc.gnu.org/onlinedocs/gcc-4.7.4/docs-sources.tar.gz";>Texinfo
  sources of all the GCC 4.7.4 manuals
-  
-
+  
+
 
-  GCC 4.6.4 manuals:
+
+  GCC 4.6.4 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-4.6.4/gcc/";>GCC
  4.6.4 Manual ()
 https://gcc.gnu.org/onlinedocs/gcc-4.6.4/docs-sources.tar.gz";>Texinfo
  sources of all the GCC 4.6.4 manuals
-  
-
+  
+
 
-  GCC 4.5.4 manuals:
+
+  GCC 4.5.4 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-4.5.4/gcc/";>GCC
  4.5.4 Manual ()
 https://gcc.gnu.org/onlinedocs/gcc-4.5.4/docs-sources.tar.gz";>Texinfo
  sources of all the GCC 4.5.4 manuals
-  
-
+  
+
 
-  GCC 4.4.7 manuals:
+
+  GCC 4.4.7 manuals:
   
 https://gcc.gnu.org/onlinedocs/gcc-4.4.7/gcc/";>GCC
  4.4.7 Manual ()
 https://gcc.gnu.org/onlinedocs/gcc-4.4.7/docs-sources.tar.gz";>Texinfo

Re: [RFC] docs: remove documentation for unsupported releases

2022-11-10 Thread Alexander Monakov via Gcc-patches



On Thu, 10 Nov 2022, Martin Liška wrote:

> On 11/10/22 08:29, Gerald Pfeifer wrote:
> > On Wed, 9 Nov 2022, Alexander Monakov wrote:
> >> For this I would suggest using the  tag to neatly fold links
> >> for old releases. Please see the attached patch.
> > 
> > Loving it, Alexander!
> > 
> > What do you guys think about unfolding all releases we, the GCC project,
> > currently support (per https://gcc.gnu.org that'd be 12.x, 11.x, and 10.x
> > at this point)?
> 
> Works for me!
> 
> > 
> > Either way: yes, please (aka approved). :-)
> 
> Alexander, can you please install such change?

Yes, pushed: https://gcc.gnu.org/onlinedocs/

Alexander

1 2 >

1 - 100 of 141 matches

Mail list logo