Re: [PATCH v2] Add a GCC Security policy

2023-10-04 Thread Alexander Monakov


On Thu, 28 Sep 2023, Siddhesh Poyarekar wrote:

> Define a security process and exclusions to security issues for GCC and
> all components it ships.

Some typos and wording suggestions below.

> --- /dev/null
> +++ b/SECURITY.txt
> @@ -0,0 +1,205 @@
> +What is a GCC security bug?
> +===
> +
> +A security bug is one that threatens the security of a system or
> +network, or might compromise the security of data stored on it.
> +In the context of GCC there are multiple ways in which this might
> +happen and some common scenarios are detailed below.
> +
> +If you're reporting a security issue and feel like it does not fit
> +into any of the descriptions below, you're encouraged to reach out
> +through the GCC bugzilla or if needed, privately by following the
> +instructions in the last two sections of this document.
> +
> +Compiler drivers, programs, libgccjit and support libraries
> +---
> +
> +The compiler driver processes source code, invokes other programs
> +such as the assembler and linker and generates the output result,
> +which may be assembly code or machine code.  Compiling untrusted
> +sources can result in arbitrary code execution and unconstrained
> +resource consumption in the compiler. As a result, compilation of
> +such code should be done inside a sandboxed environment to ensure
> +that it does not compromise the development environment.

"... the host environment" seems more appropriate.

> +
> +The libgccjit library can, despite the name, be used both for
> +ahead-of-time compilation and for just-in-compilation.  In both
> +cases it can be used to translate input representations (such as
> +source code) in the application context; in the latter case the
> +generated code is also run in the application context.
> +
> +Limitations that apply to the compiler driver, apply here too in
> +terms of sanitizing inputs and it is recommended that both the

s/sanitizing inputs/trusting inputs/ (I suggested it earlier, just unsure
if you don't agree or it simply fell through the cracks)

> +compilation *and* execution context of the code are appropriately
> +sandboxed to contain the effects of any bugs in libgccjit, the
> +application code using it, or its generated code to the sandboxed
> +environment.
> +
> +Libraries such as libiberty, libcc1 and libcpp are not distributed
> +for runtime support and have similar challenges to compiler drivers.
> +While they are expected to be robust against arbitrary input, they
> +should only be used with trusted inputs when linked into the
> +compiler.
> +
> +Libraries such as zlib that bundled into GCC to build it will be

'are bundled with' (missing 'are', s/into/with/)

> +treated the same as the compiler drivers and programs as far as
> +security coverage is concerned.  However if you find an issue in
> +these libraries independent of their use in GCC, you should reach
> +out to their upstream projects to report them.
> +
> +As a result, the only case for a potential security issue in the
> +compiler is when it generates vulnerable application code for
> +trusted input source code that is conforming to the relevant
> +programming standard or extensions documented as supported by GCC
> +and the algorithm expressed in the source code does not have the
> +vulnerability.  The output application code could be considered
> +vulnerable if it produces an actual vulnerability in the target
> +application, specifically in the following cases:

It seems ambiguous if the list that follows is meant to be an exhaustive
enumeration. I think it is meant to give examples without covering all
possibilities; if that's the case, I would suggest

s/specifically in the following cases/for example/

If I misunderstood and the list is really meant to be exhaustive,
it would be nice to make that clear and perhaps refer the reader
to the second paragraph when their scenario does not fit.

> +
> +- The application dereferences an invalid memory location despite
> +  the application sources being valid.
> +- The application reads from or writes to a valid but incorrect
> +  memory location, resulting in an information integrity issue or an
> +  information leak.
> +- The application ends up running in an infinite loop or with
> +  severe degradation in performance despite the input sources having
> +  no such issue, resulting in a Denial of Service.  Note that
> +  correct but non-performant code is not a security issue candidate,
> +  this only applies to incorrect code that may result in performance
> +  degradation severe enough to amount to a denial of service.
> +- The application crashes due to the generated incorrect code,
> +  resulting in a Denial of Service.
> +
> +Language

[PATCH] doc: discourage const/volatile on register variables

2018-07-26 Thread Alexander Monakov
Hi,

when using explicit register variables ('register int foo asm("%ebp");'),
using const/volatile type qualifiers on the declaration does not result in
the logically expected effect.

The main issue is that gcc-8 got "better" at propagating initializers of
'register const' variables to their uses in asm operands, losing the
association with the register and thus causing the operand to
unexpectedly appear in some other register. This caused build issues for
the Linux kernel and was reported a couple of times in the GCC Bugzilla.

This patch adds a few lines to the documentation to say that qualifiers
won't work as expected. OK for trunk?

Thanks.
Alexander

PR target/86673
doc/extend.texi (Global Register Variables): Discourage use of type
qualifiers.
(Local Register Variables): Likewise.

diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index 7b471ec40f7..9a41f2753e9 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -9591,6 +9591,11 @@ a global variable the declaration appears outside a 
function. The
 @code{static}. The register name must be a valid register name for the
 target platform.
 
+Do not use type qualifiers such as @code{const} and @code{volatile}, as
+the result may be contrary to your expectations. In  particular, using
+the @code{volatile} qualifier does not fully prevent the compiler from
+optimizing accesses to the register.
+
 Registers are a scarce resource on most systems and allowing the 
 compiler to manage their usage usually results in the best code. However, 
 under special circumstances it can make sense to reserve some globally.
@@ -9698,6 +9703,12 @@ but for a local variable the declaration appears within 
a function.  The
 @code{static}.  The register name must be a valid register name for the
 target platform.
 
+Do not use type qualifiers such as @code{const} and @code{volatile}, as
+the result may be contrary to your expectations. In particular, when
+the @code{const} qualifier is used, the compiler may substitute the
+variable with its initializer in @code{asm} statements, which may cause
+the corresponding operand to appear in a different register.
+
 As with global register variables, it is recommended that you choose 
 a register that is normally saved and restored by function calls on your 
 machine, so that calls to library routines will not clobber it.


Re: [PATCH] Fix bootstrap failure (with g++ 4.8.5) in tree-if-conv.cc.

2023-07-17 Thread Alexander Monakov


On Mon, 17 Jul 2023, Richard Biener wrote:

> > > > > OK.   Btw, while I didn't spot this during review I would appreciate
> > > > > if the code could use vec.[q]sort, this should work with a lambda as
> > > > > well I think.
> > > >
> > > > That was my first use, but that hits
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99469
> > >
> > > That is not hitting PR 99469 but rather it means your comparison is not
> > > correct for an (unstable) sort.
> > > That is qsort comparator should have this relationship `f(a,b) == !f(b, 
> > > a)` and
> > > `f(a,a)` should also return false.
> >
> > I'm using the standard std::pair comparator which indicates that f(a,a) is 
> > true,
> > https://en.cppreference.com/w/cpp/utility/pair/operator_cmp
> >
> > > If you are running into this for qsort here, you will most likely run 
> > > into issues
> > > with std::sort later on too.
> >
> > Don't see why or how. It needs to have a consistent relationship which 
> > std::pair
> > maintains.  So why would using the standard tuple comparator with a standard
> > std::sort cause problem?
> 
> At least for
> 
>  return left.second < right.second;
> 
> f(a,a) doesn't hold.  Note qsort can end up comparing an element to
> itself (not sure if GCCs implementation now can).

(it cannot but that is not important here)

Tamar, while std::sort receives a "less-than" comparison predicate, qsort
needs a tri-state comparator that returns a negative value for "less-than"
relation, positive for "more-than", and zero when operands are "equal".

Passing output of std::pair::operator< straight to qsort is not correct,
and qsort_chk catches that mistake at runtime.

std::sort is not a stable sort and therefore can cause code generation
differences by swapping around elements that are not bitwise-identical
but "equal" according to the comparator. This is the main reason for
preferring our internal qsort, which yields same results on all platforms.

Let me also note that #include  is pretty heavy-weight, and so
I'd suggest to avoid it to avoid needlessly increasing bootstrap times.

Alexander


Re: [PATCH] Reduce floating-point difficulties in timevar.cc

2023-07-21 Thread Alexander Monakov


On Fri, 21 Jul 2023, Xi Ruoyao via Gcc-patches wrote:

> Perhaps -ffp-contract=on (not off) is enough to fix the issue (if you
> are building GCC 14 snapshot).  The default is "fast" (if no -std=
> option is used), which allows some contractions disallowed by the
> standard.

Not fully, see below.

> But GCC is in C++ and I'm not sure if the C++ standard has the same
> definition for allowed contractions as C.

It doesn't, but in GCC we should aim to provide the same semantics in C++
as in C.

> > (Or is the severity of lack of support sufficiently different in the two 
> > cases that this is fine -- i.e. not compile vs may trigger floating 
> > point rounding inaccuracies?)
> 
> It's possible that the test itself is flaky.  Can you provide some
> detail about how it fails?

See also PR 99903 for an earlier known issue which appears due to x87
excess precision and so tweaking -ffp-contract wouldn't help:

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99903

Now that multiple platforms are hitting this, can we _please_ get rid
of the questionable attempt to compute time in a floating-point variable
and just use an uint64_t storing nanoseconds?

Alexander


Re: [PATCH] Reduce floating-point difficulties in timevar.cc

2023-07-21 Thread Alexander Monakov


On Fri, 21 Jul 2023, Xi Ruoyao wrote:

> > See also PR 99903 for an earlier known issue which appears due to x87
> > excess precision and so tweaking -ffp-contract wouldn't help:
> > 
> >   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99903
> 
> Does it affect AArch64 too?

Well, not literally (AArch64 doesn't have excess precision), but absence
of intermediate rounding in FMA is similar to excess precision.

I'm saying it's the same issue manifesting via different pathways on x86
and aarch64. Sorry if I misunderstood your question.

Alexander


RE: [PATCH] Replace invariant ternlog operands

2023-08-03 Thread Alexander Monakov


On Thu, 27 Jul 2023, Liu, Hongtao via Gcc-patches wrote:

> > +;; If the first and the second operands of ternlog are invariant and ;;
> > +the third operand is memory ;; then we should add load third operand
> > +from memory to register and ;; replace first and second operands with
> > +this register (define_split
> > +  [(set (match_operand:V 0 "register_operand")
> > +   (unspec:V
> > + [(match_operand:V 1 "register_operand")
> > +  (match_operand:V 2 "register_operand")
> > +  (match_operand:V 3 "memory_operand")
> > +  (match_operand:SI 4 "const_0_to_255_operand")]
> > + UNSPEC_VTERNLOG))]
> > +  "ternlog_invariant_operand_mask (operands) == 3 && !reload_completed"
> Maybe better with "!reload_completed  && ternlog_invariant_operand_mask 
> (operands) == 3"

I made this change (in both places), plus some style TLC. Ok to apply?

>From d24304a9efd049e8db6df5ac78de8ca2d941a3c7 Mon Sep 17 00:00:00 2001
From: Yan Simonaytes 
Date: Tue, 25 Jul 2023 20:43:19 +0300
Subject: [PATCH] Eliminate irrelevant operands of VPTERNLOG

As mentioned in PR 110202, GCC may be presented with input where control
word of the VPTERNLOG intrinsic implies that some of its operands do not
affect the result.  In that case, we can eliminate irrelevant operands
of the instruction by substituting any other operand in their place.
This removes false dependencies.

For instance, instead of (252 = 0xfc = _MM_TERNLOG_A | _MM_TERNLOG_B)

vpternlogq  $252, %zmm2, %zmm1, %zmm0

emit

vpternlogq  $252, %zmm0, %zmm1, %zmm0

When VPTERNLOG is invariant w.r.t first and second operands, and the
third operand is memory, load memory into the output operand first, i.e.
instead of (85 = 0x55 = ~_MM_TERNLOG_C)

vpternlogq  $85, (%rdi), %zmm1, %zmm0

emit

vmovdqa64   (%rdi), %zmm0
vpternlogq  $85, %zmm0, %zmm0, %zmm0

gcc/ChangeLog:

* config/i386/i386-protos.h (vpternlog_irrelevant_operand_mask):
Declare.
(substitute_vpternlog_operands): Declare.
* config/i386/i386.cc (vpternlog_irrelevant_operand_mask): New
helper.
(substitute_vpternlog_operands): New function.  Use them...
* config/i386/sse.md: ... here in new VPTERNLOG define_splits.

gcc/testsuite/ChangeLog:

* gcc.target/i386/invariant-ternlog-1.c: New test.
* gcc.target/i386/invariant-ternlog-2.c: New test.
---
 gcc/config/i386/i386-protos.h |  3 ++
 gcc/config/i386/i386.cc   | 43 +++
 gcc/config/i386/sse.md| 42 ++
 .../gcc.target/i386/invariant-ternlog-1.c | 21 +
 .../gcc.target/i386/invariant-ternlog-2.c | 12 ++
 5 files changed, 121 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/invariant-ternlog-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/invariant-ternlog-2.c

diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 27fe73ca65..12e6ff0ebc 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -70,6 +70,9 @@ extern machine_mode ix86_cc_mode (enum rtx_code, rtx, rtx);
 extern int avx_vpermilp_parallel (rtx par, machine_mode mode);
 extern int avx_vperm2f128_parallel (rtx par, machine_mode mode);
 
+extern int vpternlog_irrelevant_operand_mask (rtx[]);
+extern void substitute_vpternlog_operands (rtx[]);
+
 extern bool ix86_expand_strlen (rtx, rtx, rtx, rtx);
 extern bool ix86_expand_set_or_cpymem (rtx, rtx, rtx, rtx, rtx, rtx,
   rtx, rtx, rtx, rtx, bool);
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 32851a514a..9a7c1135a0 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -19420,6 +19420,49 @@ avx_vperm2f128_parallel (rtx par, machine_mode mode)
   return mask + 1;
 }
 
+/* Return a mask of VPTERNLOG operands that do not affect output.  */
+
+int
+vpternlog_irrelevant_operand_mask (rtx *operands)
+{
+  int mask = 0;
+  int imm8 = XINT (operands[4], 0);
+
+  if (((imm8 >> 4) & 0x0F) == (imm8 & 0x0F))
+mask |= 1;
+  if (((imm8 >> 2) & 0x33) == (imm8 & 0x33))
+mask |= 2;
+  if (((imm8 >> 1) & 0x55) == (imm8 & 0x55))
+mask |= 4;
+
+  return mask;
+}
+
+/* Eliminate false dependencies on operands that do not affect output
+   by substituting other operands of a VPTERNLOG.  */
+
+void
+substitute_vpternlog_operands (rtx *operands)
+{
+  int mask = vpternlog_irrelevant_operand_mask (operands);
+
+  if (mask & 1) /* The first operand is irrelevant.  */
+operands[1] = operands[2];
+
+  if (mask & 2) /* The second operand is irrelevant.  */
+operands[2] = operands[1];
+
+  if (mask & 4) /* The third operand is irrelevant.  */
+operands[3] = operands[1];
+  else if (REG_P (operands[3]))
+{
+  if (mask & 1)
+   operands[1] = operands[3];
+  if (mask & 2)
+   operands[2] = operands[3];
+}
+}
+
 /* Return a register priority for hard reg R

Re: RISC-V: Added support for CRC.

2023-08-08 Thread Alexander Monakov


On Thu, 3 Aug 2023, Jeff Law wrote:

> The end goal here is to actually detect bitwise CRC implementations in the
> gimple optimizers and turn them into table lookups or carryless multiplies in
> RTL.
> 
> Mariam has that working end-to-end and has proposed a talk for the Cauldron on
> the topic.
> 
> The idea here is to carve out the RTL side which we think provides potential
> value to end users (the ability to use the builtin to get an performant CRC
> implementation) and to get community feedback on the implementation.

Jeff, as I understand this all is happening only because Coremark contains
use of bitwise CRC that affects benchmark scores. In another universe where

- Coremark was careful to checksum outputs outside of timed sections, or
- implemented CRC in a manner that is not transparent to the compiler, or
- did not use CRC at all

we would not be spending effort on this, correct? At best we might have
a discussion on providing a __builtin_clmul for carry-less multiplication
(which _is_ a fundamental primitive, unlike __builtin_crc), and move on.

Note that proposed __builtin_crc does not match how a high-performance CRC
over a variable-size array is implemented. You don't want to do two
back-to-back CLMULs to compute a new CRC given an old CRC. That makes your
loop latency-constrained to 2*L*N where L is latency of the CLMUL instruction
and N is the number of loop iterations.

Instead, efficient CRC loops have the following structure:

- they carry an unreduced remainder in the loop, performing final reduction
  modulo polynomial only once after the loop — this halves the CLMUL count;

- they overlap multiple CLMUL chains to make the loop throughput-bound
  rather than latency-bound. The typical unroll factor is about 4x-8x.

A relatively easy to follow explanation is provided by Pete Cawley at
https://www.corsix.org/content/alternative-exposition-crc32_4k_pclmulqdq
(there are other sources for similar optimization of table-based CRC).

Also note that in __builtin_crc care is needed regarding how the
polynomial is specified (which term is dropped, and what bit order is used).

Hence, I am concerned that proposed __builtin_crc is not useful for FOSS
that actually needs high-performance CRC (the Linux kernel, compression
and image libraries).

I think this crosses the line of "cheating in benchmarks" and not something
we should do in GCC.

Alexander


Re: RISC-V: Added support for CRC.

2023-08-08 Thread Alexander Monakov


On Tue, 8 Aug 2023, Jeff Law wrote:

> That was my thinking at one time.  Then we started looking at the distros and
> found enough crc implementations in there to change my mind about the overall
> utility.

The ones I'm familiar with are all table-based and look impossible to
pattern-match (and hence already fairly efficient comparable to bitwise
loop in Coremark).

> If we need to do something to make it more useful, we're certainly open to
> that.

So... just provide a library? A library code is easier to develop and audit,
it can be released independently, people can use it with their compiler of
choice. Not everything needs to be in libgcc.

> > - they overlap multiple CLMUL chains to make the loop throughput-bound
> >rather than latency-bound. The typical unroll factor is about 4x-8x.
> We do have the ability to build longer chains.  We actually use that in the
> coremark benchmark where the underlying primitives are 8-bit CRCs that are
> composed into 16/32 bit CRCs.

I'm talking about factoring a long chain into multiple independent chains
for latency hiding.

> > Hence, I am concerned that proposed __builtin_crc is not useful for FOSS
> > that actually needs high-performance CRC (the Linux kernel, compression
> > and image libraries).
> > 
> > I think this crosses the line of "cheating in benchmarks" and not something
> > we should do in GCC.
> Certianly not the intention.  The intention is to provide a useful builtin_crc

Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg?
These consumers need high-performance blockwise CRC, offering them
a latency-bound elementwise CRC primitive is a disservice. And what
should they use as a fallback when __builtin_crc is unavailable?

> while at the same time putting one side of the infrastructure we need for
> automatic detection of CRC loops and turning them into table lookups or
> CLMULs.
> 
> With that in mind I'm certain Mariam & I would love feedback on a builtin API
> that would be more useful.

I think offering a conventional library for CRC has substantial advantages.

Alexander


Re: RISC-V: Added support for CRC.

2023-08-08 Thread Alexander Monakov


On Tue, 8 Aug 2023, Jeff Law wrote:

> If the compiler can identify a CRC and collapse it down to a table or clmul,
> that's a major win and such code does exist in the real world. That was the
> whole point behind the Fedora experiment -- to determine if these things are
> showing up in the real world or if this is just a benchmarking exercise.

Can you share the results of the experiment and give your estimate of what
sort of real-world improvement is expected? I already listed the popular
FOSS projects where CRC performance is important: the Linux kernel and
a few compression libraries. Those projects do not use a bitwise CRC loop,
except sometimes for table generation on startup (which needs less time
than a page fault that may be necessary to bring in a hardcoded table).

For those projects that need a better CRC, why is the chosen solution is
to optimize it in the compiler instead of offering them a library they
could use with any compiler?

Was there any thought given to embedded projects that use bitwise CRC
exactly because they little space for a hardcoded table to spare?

> > Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg?
> > These consumers need high-performance blockwise CRC, offering them
> > a latency-bound elementwise CRC primitive is a disservice. And what
> > should they use as a fallback when __builtin_crc is unavailable?
> THe point is builtin_crc would always be available.  If there is no clmul,
> then the RTL backend can expand to a table lookup version.

No, not if the compiler is not GCC, or its version is less than 14. And
those projects are not going to sacrifice their portability just for
__builtin_crc.

> > I think offering a conventional library for CRC has substantial advantages.
> That's not what I asked.  If you think there's room for improvement to a
> builtin API, I'd love to hear it.
> 
> But it seems you don't think this is worth the effort at all.  That's
> unfortunate, but if that's the consensus, then so be it.

I think it's a strange application of development effort. You'd get more
done coding a library.

> I'll note LLVM is likely going forward with CRC detection and optimization at
> some point in the next ~6 months (effectively moving the implementation from
> the hexagon port into the generic parts of their loop optimizer).

I don't see CRC detection in the Hexagon port. There is a recognizer for
polynomial multiplication (CRC is division, not multiplication).

Alexander


Re: [PATCH] Handle in-order reductions when SLP vectorizing non-loops

2023-08-09 Thread Alexander Monakov


On Wed, 9 Aug 2023, Richard Biener via Gcc-patches wrote:

> The following teaches the non-loop reduction vectorization code to
> handle non-associatable reductions.  Using the existing FOLD_LEFT_PLUS
> internal functions might be possible but I'd have to convince myself
> that +0.0 + x[0] is a safe extra operation in ever rounding mode
> (I also have no way to test the resulting code).

It's not. Under our default -fno-signaling-nans -fno-rounding-math
negative zero is the neutral element for addition, so '-0.0 + x[0]'
might be (but negative zero costs more to materialize).

If the reduction has at least two elements, then 

-0.0 + x[0] + x[1]

has the same behavior w.r.t SNaNs as 'x[0] + x[1]', but unfortunately
yields negative zero when x[0] = x[1] = +0.0 and rounding towards
negative infinity (unlike x[0] + x[1], which is +0.0).

Alexander


Re: [PATCH] tree-optimization/110979 - fold-left reduction and partial vectors

2023-08-11 Thread Alexander Monakov


On Fri, 11 Aug 2023, Richard Biener wrote:

> When we vectorize fold-left reductions with partial vectors but
> no target operation available we use a vector conditional to force
> excess elements to zero.  But that doesn't correctly preserve
> the sign of zero.  The following patch disables partial vector
> support in that case.
> 
> Bootstrap and regtest running on x86_64-unknown-linux-gnu.
> 
> Does this look OK?  With -frounding-math -fno-signed-zeros we are
> happily using the masking again, but that's OK, right?  An additional
> + 0.0 shouldn't do anything here.

I think it converts SNan to QNan (when the partial vector has just one
element which is SNan), so is a test for -fsignaling-nans missing?

In the defaut -fno-rounding-math -fno-signaling-nans mode I think we
can do the reduction by substituting negative zero for masked-off
elements — maybe it's worth diagnosing that case separately (i.e.
as "not yet implemented", not an incorrect transform)?

(note that in avx512 it's possible to materialize negative zeroes
by mask with a single vpternlog instruction, which is cheap)

Alexander


Re: [PATCH] tree-optimization/110979 - fold-left reduction and partial vectors

2023-08-11 Thread Alexander Monakov


On Fri, 11 Aug 2023, Richard Biener wrote:

> > I think it converts SNan to QNan (when the partial vector has just one
> > element which is SNan), so is a test for -fsignaling-nans missing?
> 
> Hm, I guess that's a corner case that could happen when there's no
> runtime profitability check on more than one element and when the
> element accumulated is directly loaded from memory.  OTOH the
> loop vectorizer always expects an initial value for the reduction
> and thus we perform either no add (when the loop isn't entered)
> or at least a single add (when it is).  So I think this particular
> situation cannot occur?

Yes, that makes sense, thanks for the elaboration.
(it's a bit subtle so maybe worth a comment? not sure)

> > In the defaut -fno-rounding-math -fno-signaling-nans mode I think we
> > can do the reduction by substituting negative zero for masked-off
> > elements ? maybe it's worth diagnosing that case separately (i.e.
> > as "not yet implemented", not an incorrect transform)?
> 
> Ah, that's interesting.  So the only case we can't handle is
> -frounding-math -fsigned-zeros then.  I'll see to adjust the patch
> accordingly, like the following incremental patch:

Yeah, nice!

> > (note that in avx512 it's possible to materialize negative zeroes
> > by mask with a single vpternlog instruction, which is cheap)
> 
> It ends up loading the { -0.0, ... } constant from memory, the
> { 0.0, ... } mask is handled by using a zero-masked load, so
> indeed cheaper.

I was thinking it could be easily done without a memory load,
but got confused, sorry.

Alexander


Re: [RFC] GCC Security policy

2023-08-14 Thread Alexander Monakov


On Mon, 14 Aug 2023, Siddhesh Poyarekar wrote:

> 1. It makes it clear to users of the project the scope in which the project
> could be used and what safety it could reasonably expect from the project.  In
> the context of GCC for example, it cannot expect the compiler to do a safety
> check of untrusted sources; the compiler will consider #include "/etc/passwd"
> just as valid code as #include  and as a result, the onus is on the
> user environment to validate the input sources for safety.

Whoa, no. We shouldn't make such statements unless we are prepared to explain
to users how such validation can be practically implemented, which I'm sure
we cannot in this case, due to future extensions such as the #embed directive,
and ability to obfuscate filenames using the preprocessor.

I think it would be more honest to say that crafted sources can result in
arbitrary code execution with the privileges of the user invoking the compiler,
and hence the operator may want to ensure that no sensitive data is available
to that user (via measures ranging from plain UNIX permissions, to chroots,
to virtual machines, to air-gapped computers, depending on threat model).

Resource consumption is another good reason to sandbox compilers.

Alexander


Re: [RFC] GCC Security policy

2023-08-14 Thread Alexander Monakov


On Mon, 14 Aug 2023, Siddhesh Poyarekar wrote:

> There's no practical (programmatic) way to do such validation; it has to be a
> manual audit, which is why source code passed to the compiler has to be
> *trusted*.

No, I do not think that is a logical conclusion. What is the problem with
passing untrusted code to a sandboxed compiler?

> Right, that's what we're essentially trying to convey in the security policy
> text.  It doesn't go into mechanisms for securing execution (because that's
> really beyond the scope of the *project's* policy IMO) but it states
> unambiguously that input to the compiler must be trusted:
> 
> """
>   ... It is necessary that
> all source code inputs to the compiler are trusted, since it is
> impossible for the driver to validate input source code beyond
> conformance to a programming language standard...
> """

I see two issues with this. First, it reads as if people wishing to build
not-entirely-trusted sources need to seek some other compiler, as somehow
we seem to imply that sandboxing GCC is out of the question.

Second, I take issue with the last part of the quoted text (language
conformance): verifying standards conformance is also impossible
(consider UB that manifests only during linking or dynamic loading)
so GCC is only doing that on a best-effort basis with no guarantees.

Alexander


Re: [RFC] GCC Security policy

2023-08-15 Thread Alexander Monakov


On Tue, 15 Aug 2023, Siddhesh Poyarekar wrote:

> Does this as the first paragraph address your concerns:

Thanks, this is nicer (see notes below). My main concern is that we shouldn't
pretend there's some method of verifying that arbitrary source code is "safe"
to pass to an unsandboxed compiler, nor should we push the responsibility of
doing that on users.

> The compiler driver processes source code, invokes other programs such as the
> assembler and linker and generates the output result, which may be assembly
> code or machine code.  It is necessary that all source code inputs to the
> compiler are trusted, since it is impossible for the driver to validate input
> source code for safety.

The statement begins with "It is necessary", but the next statement offers
an alternative in case the code is untrusted. This is a contradiction.
Is it necessary or not in the end?

I'd suggest to drop this statement and instead make a brief note that
compiling crafted/untrusted sources can result in arbitrary code execution
and unconstrained resource consumption in the compiler.

> For untrusted code should compilation should be done
 ^^
 typo (spurious 'should')
 
> inside a sandboxed environment to ensure that it does not compromise the
> development environment.  Note that this still does not guarantee safety of
> the produced output programs and that such programs should still either be
> analyzed thoroughly for safety or run only inside a sandbox or an isolated
> system to avoid compromising the execution environment.

The last statement seems to be a new addition. It is too broad and again
makes a reference to analysis that appears quite theoretical. It might be
better to drop this (and instead talk in more specific terms about any
guarantees that produced binary code matches security properties intended
by the sources; I believe Richard Sandiford raised this previously).

Thanks.
Alexander


Re: [RFC] GCC Security policy

2023-08-15 Thread Alexander Monakov


On Tue, 15 Aug 2023, Siddhesh Poyarekar wrote:

> > Thanks, this is nicer (see notes below). My main concern is that we
> > shouldn't pretend there's some method of verifying that arbitrary source
> > code is "safe" to pass to an unsandboxed compiler, nor should we push
> > the responsibility of doing that on users.
> 
> But responsibility would be pushed to users, wouldn't it?

Making users responsible for verifying that sources are "safe" is not okay
(we cannot teach them how to do that since there's no general method).
Making users responsible for sandboxing the compiler is fine (there's
a range of sandboxing solutions, from which they can choose according
to their requirements and threat model). Sorry about the ambiguity.

> So:
> 
> The compiler driver processes source code, invokes other programs such as the
> assembler and linker and generates the output result, which may be assembly
> code or machine code.  Compiling untrusted sources can result in arbitrary
> code execution and unconstrained resource consumption in the compiler. As a
> result, compilation of such code should be done inside a sandboxed environment
> to ensure that it does not compromise the development environment.

I'm happy with this, thanks for bearing with me.

> >> inside a sandboxed environment to ensure that it does not compromise the
> >> development environment.  Note that this still does not guarantee safety of
> >> the produced output programs and that such programs should still either be
> >> analyzed thoroughly for safety or run only inside a sandbox or an isolated
> >> system to avoid compromising the execution environment.
> > 
> > The last statement seems to be a new addition. It is too broad and again
> > makes a reference to analysis that appears quite theoretical. It might be
> > better to drop this (and instead talk in more specific terms about any
> > guarantees that produced binary code matches security properties intended
> > by the sources; I believe Richard Sandiford raised this previously).
> 
> OK, so I actually cover this at the end of the section; Richard's point AFAICT
> was about hardening, which I added another note for to make it explicit that
> missed hardening does not constitute a CVE-worthy threat:

Thanks for the reminder. To illustrate what I was talking about, let me give
two examples:

1) safety w.r.t timing attacks: even if the source code is written in
a manner that looks timing-safe, it might be transformed in a way that
mounting a timing attack on the resulting machine code is possible;

2) safety w.r.t information leaks: even if the source code attempts
to discard sensitive data (such as passwords and keys) immediately
after use, (partial) copies of that data may be left on stack and
in registers, to be leaked later via a different vulnerability.

For both 1) and 2), GCC is not engineered to respect such properties
during optimization and code generation, so it's not appropriate for such
tasks (a possible solution is to isolate such sensitive functions to
separate files, compile to assembly, inspect the assembly to check that it
still has the required properties, and use the inspected asm in subsequent
builds instead of the original high-level source).

Cheers.
Alexander


Re: [RFC] GCC Security policy

2023-08-15 Thread Alexander Monakov


On Tue, 15 Aug 2023, David Edelsohn wrote:

> > Making users responsible for verifying that sources are "safe" is not okay
> > (we cannot teach them how to do that since there's no general method).
> > Making users responsible for sandboxing the compiler is fine (there's
> > a range of sandboxing solutions, from which they can choose according
> > to their requirements and threat model). Sorry about the ambiguity.
> >
> 
> Alex.
> 
> The compiler should faithfully implement the algorithms described by the
> programmer.  The compiler is responsible if it generates incorrect code for
> a well-defined, language-conforming program.  The compiler cannot be
> responsible for security issues inherent in the user code, whether that
> causes the compiler to function in a manner that deteriorates adversely
> affects the system or generates code that behaves in a manner that
> adversely affects the system.
> 
> If "safe" is the wrong word. What word would you suggest?

I think "safe" is the right word here. We also used "trusted" in a similar
sense. I believe we were on the same page about that.

> > For both 1) and 2), GCC is not engineered to respect such properties
> > during optimization and code generation, so it's not appropriate for such
> > tasks (a possible solution is to isolate such sensitive functions to
> > separate files, compile to assembly, inspect the assembly to check that it
> > still has the required properties, and use the inspected asm in subsequent
> > builds instead of the original high-level source).
> >
> 
> At some point the system tools need to respect the programmer or operator.
> There is a difference between writing "Hello, World" and writing
> performance critical or safety critical code.  That is the responsibility
> of the programmer and the development team to choose the right software
> engineers and right tools.  And to have the development environment and
> checks in place to ensure that the results are meeting the requirements.
> 
> It is not the role of GCC or its security policy to tell people how to do
> their job or hobby.  This isn't a safety tag required to be attached to a
> new mattress.

Yes (though I'm afraid the analogy with the mattress is a bit lost on me).
Those examples were meant to illustrate the point I tried to make earlier,
not as additions proposed for the Security Policy. Specific examples
where we can tell people in advance that compiler output needs to be
verified, because the compiler is not engineered to preserve those
security-relevant properties from the source code (and we would not
accept such accidents as security bugs).

Granted, it is a bit of a stretch since the notion of timing-safety is
not really well-defined for C source code, but I didn't come up with
better examples.

Alexander


Re: [RFC] GCC Security policy

2023-08-16 Thread Alexander Monakov


On Tue, 15 Aug 2023, Paul Koning wrote:

> Now I'm confused.  I thought the whole point of what GCC is trying to, and
> wants to document, is that it DOES preserve security properties.  If the
> source code is standards-compliant and contains algorithms free of security
> holes, then the compiler is supposed to deliver output code that is likewise
> free of holes -- in other words, the transformation performed by GCC does not
> introduce holes in a hole-free input.

Yes, we seem to broadly agree here. The text given by Siddhesh enumerates
scenarios were an incorrent transform could be considered a security bug.
My examples explore situations outside of those scenarios, picking two
popular security properties that cannot be always attained by writing
C source that vaguely appears to conform, and expecting the compiler
to translate in to machine code that actually conforms.

> > Granted, it is a bit of a stretch since the notion of timing-safety is
> > not really well-defined for C source code, but I didn't come up with
> > better examples.
> 
> Is "timing-safety" a security property?  Not the way I understand that
> term.  It sounds like another way to say that the code meets real time
> constraints or requirements.

I meant in the sense of not admitting timing attacks:
https://en.wikipedia.org/wiki/Timing_attack

> No, compilers don't help with that (at least C doesn't -- Ada might be
> better here but I don't know enough).  For sufficiently strict
> requirements you'd have to examine both the generated machine code and
> understand, in gruesome detail, what the timing behaviors of the executing
> hardware are.  Good luck if it's a modern billion-transistor machine.

Yes. On the other hand, the reality in the FOSS ecosystem is that
cryptographic libraries heavily lean on the ability to express
a constant-time algorithm in C and get machine code that is actually
constant-time. There's a bit of a conflict here between what we
can promise and what people might expect of GCC, and it seems
relevant when discussing what goes into the Security Policy.

Thanks.
Alexander


Re: [RFC] GCC Security policy

2023-08-16 Thread Alexander Monakov


On Tue, 15 Aug 2023, David Malcolm via Gcc-patches wrote:

> I'd prefer to reword this, as libgccjit was a poor choice of name for
> the library (sorry!), to make it clearer it can be used for both ahead-
> of-time and just-in-time compilation, and that as used for compilation,
> the host considerations apply, not just those of the generated target
> code.
> 
> How about:
> 
>  The libgccjit library can, despite the name, be used both for
>  ahead-of-time compilation and for just-in-compilation.  In both
>  cases it can be used to translate input representations (such as
>  source code) in the application context; in the latter case the
>  generated code is also run in the application context.
>  Limitations that apply to the compiler driver, apply here too in
>  terms of sanitizing inputs, so it is recommended that inputs are

Unfortunately the lines that follow:

>  either sanitized by an external program to allow only trusted,
>  safe compilation and execution in the context of the application,

again make a reference to a purely theoretical "external program" that
is not going to exist in reality, and I made a fuss about that in another
subthread (sorry Siddhesh). We shouldn't speak as if this solution is
actually available to users.

I know this is not the main point of your email, but we came up with
a better wording for the compiler driver, and it would be good to align
this text with that.

Thanks.
Alexander


Re: [RFC] GCC Security policy

2023-08-16 Thread Alexander Monakov
> > Unfortunately the lines that follow:
> > 
> >>   either sanitized by an external program to allow only trusted,
> >>   safe compilation and execution in the context of the application,
> > 
> > again make a reference to a purely theoretical "external program" that
> > is not going to exist in reality, and I made a fuss about that in another
> > subthread (sorry Siddhesh). We shouldn't speak as if this solution is
> > actually available to users.
> > 
> > I know this is not the main point of your email, but we came up with
> > a better wording for the compiler driver, and it would be good to align
> > this text with that.
> 
> How about:
> 
> The libgccjit library can, despite the name, be used both for
> ahead-of-time compilation and for just-in-compilation.  In both
> cases it can be used to translate input representations (such as
> source code) in the application context; in the latter case the
> generated code is also run in the application context.
> 
> Limitations that apply to the compiler driver, apply here too in
> terms of sanitizing inputs and it is recommended that both the

I'd prefer 'trusting inputs' instead of 'sanitizing inputs' above.

> compilation *and* execution context of the code are appropriately
> sandboxed to contain the effects of any bugs in libgccjit, the
> application code using it, or its generated code to the sandboxed
> environment.

*thumbs up*

Thanks.
Alexander


Re: [RFC] GCC Security policy

2023-08-16 Thread Alexander Monakov


On Wed, 16 Aug 2023, Siddhesh Poyarekar wrote:

> No I understood the distinction you're trying to make, I just wanted to point
> out that the effect isn't all that different.  The intent of the wording is
> not to prescribe a solution, but to describe what the compiler cannot do and
> hence, users must find a way to do this.  I think we have a consensus on this
> part of the wording though because we're not really responsible for the
> prescription here and I'm happy with just asking users to sandbox.

Nice!

> I suppose it's kinda like saying "don't try this at home".  You know many will
> and some will break their leg while others will come out of it feeling
> invincible.  Our job is to let them know that they will likely break their leg
> :)

Continuing this analogy, I was protesting against doing our job by telling
users "when trying this at home, make sure to wear vibranium shielding"
while knowing for sure that nobody can, in fact, obtain said shielding,
making our statement not helpful and rather tautological.

> How about this in the last section titled "Security features implemented in
> GCC", since that's where we also deal with security hardening.
> 
> Similarly, GCC may transform code in a way that the correctness of
> the expressed algorithm is preserved but supplementary properties
> that are observable only outside the program or through a
> vulnerability in the program, may not be preserved.  This is not a
> security issue in GCC and in such cases, the vulnerability that
> caused exposure of the supplementary properties must be fixed.

Yeah, indicating scenarios that fall outside of intended guarantees should
be helpful. I feel the exact text quoted above will be hard to decipher
without knowing the discussion that led to it. Some sort of supplementary
section with examples might help there.

In any case, I hope further discussion, clarification and wordsmithing
goes productively for you both here on the list and during the Cauldron.

Thanks.
Alexander


Re: [RFC] GCC Security policy

2023-08-16 Thread Alexander Monakov


On Wed, 16 Aug 2023, Siddhesh Poyarekar wrote:

> > Yeah, indicating scenarios that fall outside of intended guarantees should
> > be helpful. I feel the exact text quoted above will be hard to decipher
> > without knowing the discussion that led to it. Some sort of supplementary
> > section with examples might help there.
> 
> Ah, so I had started out by listing examples but dropped them before emailing.
> How about:
> 
> Similarly, GCC may transform code in a way that the correctness of
> the expressed algorithm is preserved but supplementary properties
> that are observable only outside the program or through a
> vulnerability in the program, may not be preserved.  Examples
> of such supplementary properties could be the state of memory after
> it is no longer in use, performance and timing characteristics of a
> program, state of the CPU cache, etc. Such issues are not security
> vulnerabilities in GCC and in such cases, the vulnerability that
> caused exposure of the supplementary properties must be fixed.

I would say that as follows:

Similarly, GCC may transform code in a way that the correctness of
the expressed algorithm is preserved, but supplementary properties
that are not specifically expressible in a high-level language
are not preserved. Examples of such supplementary properties
include absence of sensitive data in the program's address space
after an attempt to wipe it, or data-independent timing of code.
When the source code attempts to express such properties, failure
to preserve them in resulting machine code is not a security issue
in GCC.

Alexander


Re: RISC-V: Added support for CRC.

2023-08-16 Thread Alexander Monakov


On Tue, 15 Aug 2023, Jeff Law wrote:

> Because if the compiler can optimize it automatically, then the projects have
> to do literally nothing to take advantage of it.  They just compile normally
> and their bitwise CRC gets optimized down to either a table lookup or a clmul
> variant.  That's the real goal here.

The only high-profile FOSS project that carries a bitwise CRC implementation
I'm aware of is the 'xz' compression library. There bitwise CRC is used for
populating the lookup table under './configure --enable-small':

https://github.com/tukaani-project/xz/blob/2b871f4dbffe3801d0da3f89806b5935f758d5f3/src/liblzma/check/crc64_small.c

It's a well-reasoned choice and your compiler would be undoing it
(reintroducing the table when the bitwise CRC is employed specifically
to avoid carrying the table).

> One final note.  Elsewhere in this thread you described performance concerns.
> Right now clmuls can be implemented in 4c, fully piped.

Pipelining doesn't matter in the implementation being proposed here, because
the builtin is expanded to

   li  a4,quotient
   li  a5,polynomial
   xor a0,a1,a0
   clmul   a0,a0,a4
   srlia0,a0,crc_size
   clmul   a0,a0,a5
   sllia0,a0,GET_MODE_BITSIZE (word_mode) - crc_size
   srlia0,a0,GET_MODE_BITSIZE (word_mode) - crc_size

making CLMULs data-dependent, so the second can only be started one cycle
after the first finishes, and consecutive invocations of __builtin_crc
are likewise data-dependent (with three cycles between CLMUL). So even
when you get CLMUL down to 3c latency, you'll have two CLMULs and 10 cycles
per input block, while state of the art is one widening CLMUL per input block
(one CLMUL per 32-bit block on a 64-bit CPU) limited by throughput, not latency.

> I fully expect that latency to drop within the next 12-18 months.  In that
> world, there's not going to be much benefit to using hand-coded libraries vs
> just letting the compiler do it.

...

Alexander


Re: RISC-V: Added support for CRC.

2023-08-17 Thread Alexander Monakov


On Wed, 16 Aug 2023, Philipp Tomsich wrote:

> > > I fully expect that latency to drop within the next 12-18 months.  In that
> > > world, there's not going to be much benefit to using hand-coded libraries 
> > > vs
> > > just letting the compiler do it.
> 
> I would also hope that the hand-coded libraries would eventually have
> a code path for compilers that support the built-in.

You seem to be working with the false assumption that the interface of the
proposed builtin matches how high-performance CRC computation is structured.
It is not. State-of-the-art CRC keeps unreduced intermediate residual, split
over multiple temporaries to allow overlapping CLMULs in the CPU. The
intermediate residuals are reduced only once, when the final CRC value is
needed. In constrast, the proposed builtin has data dependencies between
all adjacent instructions, and cannot allow the CPU to work at IPC > 1.0.

Shame how little you apparently understand of the "mindbending math".

Alexander


Re: [Patch][v5] libgomp/nvptx: Prepare for reverse-offload callback handling

2022-10-11 Thread Alexander Monakov
On Tue, 11 Oct 2022, Jakub Jelinek wrote:

> So, does this mean one has to have gcc configured --with-arch=sm_70
> or later to make reverse offloading work (and then on the other
> side no support for older PTX arches at all)?
> If yes, I was kind of hoping we could arrange for it to be more
> user-friendly, build libgomp.a normally (sm_35 or what is the default),
> build the single TU in libgomp that needs the sm_70 stuff with -march=sm_70
> and arrange for mkoffload to link in the sm_70 stuff only if the user
> wants reverse offload (or has requires reverse_offload?).  In that case
> ignore sm_60 and older devices, if reverse offload isn't wanted, don't link
> in the part that needs sm_70 and make stuff working on sm_35 and later.
> Or perhaps have 2 versions of target.o, one sm_35 and one sm_70 and let
> mkoffload choose among them.

My understanding is such trickery should not be necessary with
the barrier-based approach, i.e. the sequence of PTX instructions

  st   % plain store
  membar.sys
  st.volatile

should be enough to guarantee that the former store is visible on the host
before the latter, and work all the way back to sm_20.

Alexander


Re: [Patch][v5] libgomp/nvptx: Prepare for reverse-offload callback handling

2022-10-19 Thread Alexander Monakov
On Wed, 12 Oct 2022, Tobias Burnus wrote:

> On 11.10.22 13:12, Alexander Monakov wrote:
> > My understanding is such trickery should not be necessary with
> > the barrier-based approach, i.e. the sequence of PTX instructions
> >
> >st   % plain store
> >membar.sys
> >st.volatile
> >
> > should be enough to guarantee that the former store is visible on the host
> > before the latter, and work all the way back to sm_20.
> 
> If I understand it correctly, you mean:
> 
>   GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num;
> 
>   __sync_synchronize ();  /* membar.sys */
>   asm volatile ("st.volatile.global.u64 [%0], %1;"
> : : "r"(addr_struct_fn), "r" (fn) : "memory");
> 
> 
> And then directly followed by the busy wait:
> 
>   while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) != 0)
> ;  /* spin  */
> 
> which GCC expands to:
> 
>   /* ld.global.u64 %r64,[__gomp_rev_offload_var];
>  ld.u64 %r36,[%r64];
>  membar.sys;  */
> 
> The such updated patch is attached.

I think the topic for which I was Cc'ed (memory space and access method for
the synchronization variable) has been resolved nicely. I am not satisfied
with some other points raised in the conversation, I hope they are noted.

Alexander

> (This is the only change + removing the mkoffload.cc part is the only
> larger change. Otherwise, it only handles the minor comments by Jakub.
> The now removed CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT was used
> until commit r10-304-g1f4c5b9bb2eb81880e2bc725435d596fcd2bdfef i.e.
> it is a really old left over!)
> 
> Otherwise, tested* to work with sm_30 (error by mkoffload, unchanged),
> sm_35 and sm_70.
> 
> Tobias
> 
> *With some added code; until GOMP_OFFLOAD_get_num_devices accepts
> GOMP_REQUIRES_UNIFIED_SHARED_MEMORY and GOMP_OFFLOAD_load_image
> gets passed a non-NULL for rev_fn_table, the current patch is a no op.
> 
> Planned next is the related GCN patch – and the actual change
> in libgomp/target.c (+ accepting USM in GOMP_OFFLOAD_get_num_devices)


RE: [PATCH] [X86_64]: Enable support for next generation AMD Zen4 CPU

2022-10-24 Thread Alexander Monakov
> > > This grew insn-automata.cc from 201502 lines to 639968 lines and the
> > > build of the automata (genautomata) to several minutes in my dev tree.
> >
> > Yeah, in my unoptimized non-bootstrapped development tree genautomata
> > now takes over 12 minutes on a fast box, that is simply not acceptable.
> 
> Thank you for notifying us.
> 
> tejassanjay.jo...@amd.com has posted a patch for review to fix this (as per 
> Honza's comments).
> Ref: https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604144.html

By the way, it appears pre-existing znver[123] models are also causing some kind
of combinatorial blow-up, but before znver4 it was not a blocking issue:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832

Alexander


Re: [PATCH] [X86_64]: Enable support for next generation AMD Zen4 CPU

2022-10-24 Thread Alexander Monakov
On Mon, 24 Oct 2022, Jan Hubička wrote:

> > By the way, it appears pre-existing znver[123] models are also causing
> > some kind
> > of combinatorial blow-up, but before znver4 it was not a blocking issue:
> >
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87832
> 
> 
> It is really easy to make DFA size to grow if there are possibly many
> instructions in the pipeline (as every possible state of a modelled pipeline
> needs to be a new state of the automaton). This is essentially
> depth_of_pipeline * number_of_units with additional states to repesent
> special instructions and this naturally keeps growing.
> 
> We could try to break the FP automata into multiple ones, but there are
> instructions that can go down any pipe which makes this hard
> or we can try toreduce number of different reservation types (possibly by
> breaking the automaton to znver1-3 and 4 or so).
> With znver2 model I experimented with broken up version and common one and
> ended up with smaller binary for combined one.

Looking at znver1.md again, I think the problem is caused by incorrect modeling
of division instructions: they have descriptions like

(define_insn_reservation "znver1_idiv_DI" 41
(and (eq_attr "cpu" "znver1,znver2")
 (and (eq_attr "type" "idiv")
  (and (eq_attr "mode" "DI")
   (eq_attr "memory" "none"
"znver1-double,znver1-ieu2*41")

which says that DImode idiv has latency 41 (which is correct) and that it
occupies 2nd integer execution unit for 41 consecutive cycles, but that is
not correct:

1) the division instruction is partially pipelined, and has throughput 1/14

2) for the most part it occupies a separate division unit, not the general
arithmetic unit.

(incidentally, I think the blowup is caused by interaction of such super-long
41-cycle paths with the rest of reservations)

I think we should fix this by modeling the separate division unit properly, and
fixing reservations to use the measured reciprocal throughput of those
instructions (available from uops.info). The following patch does that for
integer divisions and completely eliminates the integer part of the problem; the
issue with floating-point divisions remains.

Top 5 znver table sizes, before:

68692 r znver1_ieu_check
68692 r znver1_ieu_transitions
99792 r znver1_ieu_min_issue_delay
428108 r znver1_fp_min_issue_delay
856216 r znver1_fp_transitions

After:

1454 r znver1_ieu_translate
1454 r znver1_translate
2304 r znver1_ieu_transitions
428108 r znver1_fp_min_issue_delay
856216 r znver1_fp_transitions

Will you help getting this reviewed for trunk?



diff --git a/gcc/config/i386/znver1.md b/gcc/config/i386/znver1.md
index 9c25b4e27..39b59343d 100644
--- a/gcc/config/i386/znver1.md
+++ b/gcc/config/i386/znver1.md
@@ -24,7 +24,7 @@
 ;; AMD znver1, znver2 and znver3 Scheduling
 ;; Modeling automatons for zen decoders, integer execution pipes,
 ;; AGU pipes and floating point execution units.
-(define_automaton "znver1, znver1_ieu, znver1_fp, znver1_agu")
+(define_automaton "znver1, znver1_ieu, znver1_fp, znver1_agu, znver1_idiv")
 
 ;; Decoders unit has 4 decoders and all of them can decode fast path
 ;; and vector type instructions.
@@ -50,6 +50,7 @@
 (define_cpu_unit "znver1-ieu1" "znver1_ieu")
 (define_cpu_unit "znver1-ieu2" "znver1_ieu")
 (define_cpu_unit "znver1-ieu3" "znver1_ieu")
+(define_cpu_unit "znver1-idiv" "znver1_idiv")
 (define_reservation "znver1-ieu" 
"znver1-ieu0|znver1-ieu1|znver1-ieu2|znver1-ieu3")
 
 ;; 2 AGU pipes in znver1 and 3 AGU pipes in znver2 and znver3
@@ -176,28 +177,28 @@
  (and (eq_attr "type" "idiv")
   (and (eq_attr "mode" "DI")
(eq_attr "memory" "none"
-"znver1-double,znver1-ieu2*41")
+"znver1-double,znver1-idiv*14")
 
 (define_insn_reservation "znver1_idiv_SI" 25
 (and (eq_attr "cpu" "znver1,znver2")
  (and (eq_attr "type" "idiv")
   (and (eq_attr "mode" "SI")
(eq_attr "memory" "none"
-"znver1-double,znver1-ieu2*25")
+"znver1-double,znver1-idiv*14")
 
 (define_insn_reservation "znver1_idiv_HI" 17
 (and (eq_attr "cpu" "znver1,znver2")
  (and (eq_attr "type" "idiv")
   (and (eq_attr "mode" "HI")
(eq_attr "memory" "none"
-"znver1-double,znver1-ieu2*17")
+"znver1-double,znver1-idiv*14")
 
 (define_insn_reservation "znver1_idiv_QI" 12
 (and (eq_attr "cpu" "znver1,znver2")
  (and (eq_attr "type" "idiv")
   

[PATCH] ipa-visibility: remove assert in TLS optimization [PR107353]

2022-10-26 Thread Alexander Monakov
When upgrading TLS access model based on optimized symbol visibility
status, we attempted to assert that recomputing the model would not
weaken it. It turns out that C, C++, and Fortran front-ends all can
(unintentionally) assign a stronger model than what can be derived
from the declaration.

Let's act conservatively instead of asserting, at least as long as
such pre-existing issues remain.

gcc/ChangeLog:

PR other/107353
* ipa-visibility.cc (function_and_variable_visibility):
Conditionally upgrade TLS model instead of asserting.
---
 gcc/ipa-visibility.cc | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/gcc/ipa-visibility.cc b/gcc/ipa-visibility.cc
index 3ed2b7cf6..238f7eb84 100644
--- a/gcc/ipa-visibility.cc
+++ b/gcc/ipa-visibility.cc
@@ -886,8 +886,12 @@ function_and_variable_visibility (bool whole_program)
  && vnode->ref_list.referring.length ())
{
  enum tls_model new_model = decl_default_tls_model (decl);
- gcc_checking_assert (new_model >= decl_tls_model (decl));
- set_decl_tls_model (decl, new_model);
+ STATIC_ASSERT (TLS_MODEL_GLOBAL_DYNAMIC < 
TLS_MODEL_LOCAL_DYNAMIC);
+ STATIC_ASSERT (TLS_MODEL_INITIAL_EXEC < TLS_MODEL_LOCAL_EXEC);
+ /* We'd prefer to assert that recomputed model is not weaker than
+what the front-end assigned, but cannot: see PR 107353.  */
+ if (new_model >= decl_tls_model (decl))
+   set_decl_tls_model (decl, new_model);
}
}
 }
-- 
2.37.2



RE: [PATCH] [X86_64]: Enable support for next generation AMD Zen4 CPU

2022-10-26 Thread Alexander Monakov
On Wed, 26 Oct 2022, Kumar, Venkataramanan wrote:

> > Looking at znver1.md again, I think the problem is caused by incorrect
> > modeling of division instructions: they have descriptions like
> >
> > (define_insn_reservation "znver1_idiv_DI" 41
> > (and (eq_attr "cpu" "znver1,znver2")
> >  (and (eq_attr "type" "idiv")
> >   (and (eq_attr "mode" "DI")
> >(eq_attr "memory" "none"
> > "znver1-double,znver1-ieu2*41")
> >
> > which says that DImode idiv has latency 41 (which is correct) and that it
> > occupies 2nd integer execution unit for 41 consecutive cycles, but that is
> > not correct:
> 
> Yes you are correct. It does not block the 2nd integer execution pipe 
> consecutively for 41 cycles.
> 
> >
> > 1) the division instruction is partially pipelined, and has throughput 1/14
> 
> "Div" unit takes one instruction and in the worst case the latency will be 41 
> cycles in znver1/2.
> But I agree that we can put best case latency of 14 cycles for the scheduler 
> model in znver1/2 .

It is not latency. It is reciprocal throughput. For example, the multiplication
instruction has latency 3 and reciprocal throughput 1, and the corresponding
execution unit can accept a new multiplication instruction each cycle. In the
.md file we are modeling that by saying that multiplication occupies some unit
for one cycle (but has latency 3).

Alexander


Re: [PATCH] [X86_64]: Enable support for next generation AMD Zen4 CPU

2022-11-01 Thread Alexander Monakov

On Mon, 31 Oct 2022, Jan Hubička wrote:

> Hello,
> thanks for checking the performance.  The patch is OK.

Thanks, pushed the attached patch, and working on a corresponding change for
floating-point divisions.

AlexanderFrom 1962a8b22d3d3fb5b6bb5598295a4571daf8876f Mon Sep 17 00:00:00 2001
From: Alexander Monakov 
Date: Mon, 31 Oct 2022 17:35:57 +0300
Subject: [PATCH] i386: correct integer division modeling in znver.md

In znver.md, division instructions have descriptions like

(define_insn_reservation "znver1_idiv_DI" 41
(and (eq_attr "cpu" "znver1,znver2")
 (and (eq_attr "type" "idiv")
  (and (eq_attr "mode" "DI")
   (eq_attr "memory" "none"
"znver1-double,znver1-ieu2*41")

which says that DImode idiv has latency 41 (which is correct) and that
it occupies 2nd integer execution unit for 41 consecutive cycles, but
that is not correct:

1) the division instruction is partially pipelined, and has throughput
   1/14, not 1/41;

2) for the most part it occupies a separate division unit, not the
   general arithmetic unit.

Evidently, interaction of such 41-cycle paths with the rest of
reservations causes a combinatorial explosion in the automaton.

Fix this by modeling the integer division unit properly, and correcting
reservations to use the measured reciprocal throughput of those
instructions (available from uops.info). A similar correction for
floating-point divisions is left for a followup patch.

Top 5 znver table sizes, before:

68692 r znver1_ieu_check
68692 r znver1_ieu_transitions
99792 r znver1_ieu_min_issue_delay
428108 r znver1_fp_min_issue_delay
856216 r znver1_fp_transitions

After:

1454 r znver1_ieu_translate
1454 r znver1_translate
2304 r znver1_ieu_transitions
428108 r znver1_fp_min_issue_delay
856216 r znver1_fp_transitions

gcc/ChangeLog:

PR target/87832
* config/i386/znver.md (znver1_idiv): New automaton.
(znver1-idiv): New unit.
(znver1_idiv_DI): Correct unit and cycles in the reservation.
(znver1_idiv_SI): Ditto.
(znver1_idiv_HI): Ditto.
(znver1_idiv_QI): Ditto.
(znver1_idiv_mem_DI): Ditto.
(znver1_idiv_mem_SI): Ditto.
(znver1_idiv_mem_HI): Ditto.
(znver1_idiv_mem_QI): Ditto.
(znver3_idiv_DI): Ditto.
(znver3_idiv_SI): Ditto.
(znver3_idiv_HI): Ditto.
(znver3_idiv_QI): Ditto.
(znver3_idiv_mem_DI): Ditto.
(znver3_idiv_mem_SI): Ditto.
(znver3_idiv_mem_HI): Ditto.
(znver3_idiv_mem_QI): Ditto.
---
 gcc/config/i386/znver.md | 39 +--
 1 file changed, 21 insertions(+), 18 deletions(-)

diff --git a/gcc/config/i386/znver.md b/gcc/config/i386/znver.md
index 9c25b4e27..4aa098fd8 100644
--- a/gcc/config/i386/znver.md
+++ b/gcc/config/i386/znver.md
@@ -23,8 +23,8 @@ (define_attr "znver1_decode" "direct,vector,double"
 
 ;; AMD znver1, znver2 and znver3 Scheduling
 ;; Modeling automatons for zen decoders, integer execution pipes,
-;; AGU pipes and floating point execution units.
-(define_automaton "znver1, znver1_ieu, znver1_fp, znver1_agu")
+;; SIMD/FP domain, AGU pipes, and dividers.
+(define_automaton "znver1, znver1_ieu, znver1_fp, znver1_agu, znver1_idiv")
 
 ;; Decoders unit has 4 decoders and all of them can decode fast path
 ;; and vector type instructions.
@@ -93,6 +93,9 @@ (define_reservation "znver2-fvector" "znver1-fp0+znver1-fp1
  +znver1-fp2+znver1-fp3
  +znver1-agu0+znver1-agu1+znver2-agu2")
 
+;; Dividers
+(define_cpu_unit "znver1-idiv" "znver1_idiv")
+
 ;; Call instruction
 (define_insn_reservation "znver1_call" 1
 (and (eq_attr "cpu" "znver1")
@@ -176,28 +179,28 @@ (define_insn_reservation "znver1_idiv_DI" 41
  (and (eq_attr "type" "idiv")
   (and (eq_attr "mode" "DI")
(eq_attr "memory" "none"
-"znver1-double,znver1-ieu2*41")
+"znver1-double,znver1-idiv*14")
 
 (define_insn_reservation "znver1_idiv_SI" 25
 (and (eq_attr "cpu" "znver1,znver2")
  (and (eq_attr "type" "idiv")
   (and (eq_attr "mode" "SI")
(eq_attr "memory" "none"
-"znver1-double,znver1-ieu2*25"

[PATCH 0/2] i386: slim down insn-automata [PR 87832]

2022-11-01 Thread Alexander Monakov
Hi,

I'm sending followup fixes for combinatorial explosion of znver scheduling
automaton tables as described in the earlier thread:

https://inbox.sourceware.org/gcc-patches/23c795d6-403c-5927-e610-f0f1215f5...@ispras.ru/T/#m36e069d43d07d768d4842a779e26b4a0915cc543

I think lujiazui.md and b[dt]ver[123].md have similar issues.

Alexander Monakov (2):
  i386: correct x87&SSE division modeling in znver.md
  i386: correct x87&SSE multiplication modeling in znver.md

 gcc/config/i386/znver.md | 67 
 1 file changed, 34 insertions(+), 33 deletions(-)

-- 
2.37.2



[PATCH 1/2] i386: correct x87&SSE division modeling in znver.md

2022-11-01 Thread Alexander Monakov
Correct modeling of division instructions in the SIMD/FP domain for
AMD Zen architectures and avoid combinatorial explosion of automaton
tables by modeling the separate floating-point division unit and
correcting reservations to reflect reciprocal throughput of the
corresponding instructions, similar to earlier commit
5cee5f94000 ("i386: correct integer division modeling in znver.md").

Division is partially pipelined and some instructions have fractional
throughput (e.g. Zen 3 can issue divss and divsd each 3.5 and 4.5
cycles on average, respectively). Considering these CPUs implement
out-of-order execution, the model doesn't need to be exact to the last
cycle, so simplify it by using 4/5 cycles for SF/DF modes, and not
modeling the fact that FP3 pipe is occupied for one cycle.

Top znver table sizes in insn-automata.o:

Before:

428108 r znver1_fp_min_issue_delay
856216 r znver1_fp_transitions

After:

30056 r znver1_fp_min_issue_delay
120224 r znver1_fp_transitions

gcc/ChangeLog:

PR target/87832
* config/i386/znver.md (znver1_fdiv): New automaton.
(znver1-fdiv): New unit.
(znver1_fp_op_div): Correct unit and cycles in the reservation.
(znver1_fp_op_div_load): Ditto.
(znver1_fp_op_idiv_load): Ditto.
(znver2_fp_op_idiv_load): Ditto.
(znver1_ssediv_ss_ps): Ditto.
(znver1_ssediv_ss_ps_load): Ditto.
(znver1_ssediv_sd_pd): Ditto.
(znver1_ssediv_sd_pd_load): Ditto.
(znver1_ssediv_avx256_ps): Ditto.
(znver1_ssediv_avx256_ps_load): Ditto.
(znver1_ssediv_avx256_pd): Ditto.
(znver1_ssediv_avx256_pd_load): Ditto.
---
 gcc/config/i386/znver.md | 27 ++-
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/gcc/config/i386/znver.md b/gcc/config/i386/znver.md
index 4aa098fd8..c52f8b532 100644
--- a/gcc/config/i386/znver.md
+++ b/gcc/config/i386/znver.md
@@ -24,7 +24,7 @@ (define_attr "znver1_decode" "direct,vector,double"
 ;; AMD znver1, znver2 and znver3 Scheduling
 ;; Modeling automatons for zen decoders, integer execution pipes,
 ;; SIMD/FP domain, AGU pipes, and dividers.
-(define_automaton "znver1, znver1_ieu, znver1_fp, znver1_agu, znver1_idiv")
+(define_automaton "znver1, znver1_ieu, znver1_fp, znver1_agu, znver1_idiv, 
znver1_fdiv")
 
 ;; Decoders unit has 4 decoders and all of them can decode fast path
 ;; and vector type instructions.
@@ -95,6 +95,7 @@ (define_reservation "znver2-fvector" "znver1-fp0+znver1-fp1
 
 ;; Dividers
 (define_cpu_unit "znver1-idiv" "znver1_idiv")
+(define_cpu_unit "znver1-fdiv" "znver1_fdiv")
 
 ;; Call instruction
 (define_insn_reservation "znver1_call" 1
@@ -591,27 +592,27 @@ (define_insn_reservation "znver1_fp_op_div" 15
 (and (eq_attr "cpu" "znver1,znver2,znver3")
  (and (eq_attr "type" "fdiv")
   (eq_attr "memory" "none")))
-"znver1-direct,znver1-fp3*15")
+"znver1-direct,znver1-fdiv*6")
 
 (define_insn_reservation "znver1_fp_op_div_load" 22
 (and (eq_attr "cpu" "znver1,znver2,znver3")
  (and (eq_attr "type" "fdiv")
   (eq_attr "memory" "load")))
-"znver1-direct,znver1-load,znver1-fp3*15")
+"znver1-direct,znver1-load,znver1-fdiv*6")
 
 (define_insn_reservation "znver1_fp_op_idiv_load" 27
 (and (eq_attr "cpu" "znver1")
  (and (eq_attr "type" "fdiv")
   (and (eq_attr "fp_int_src" "true")
(eq_attr "memory" "load"
-"znver1-double,znver1-load,znver1-fp3*19")
+"znver1-double,znver1-load,znver1-fdiv*6")
 
 (define_insn_reservation "znver2_fp_op_idiv_load" 26
 (and (eq_attr "cpu" "znver2,znver3")
  (and (eq_attr "type" "fdiv")
   (and (eq_attr "fp_int_src" "true")
(eq_attr "memory" "load"
-"znver1-double,znver1-load,znver1-fp3*19")
+"znver1-double,znver1-load,znver1-fdiv*6")
 
 
 ;; MMX, SSE, SSEn.n, AVX, AVX2 instructions
@@ -1088,7 +1089,7 @@ (define_insn_reservation "znver1_ssediv_ss_ps" 10
  (eq_attr "mode" "V8SF,V4SF,SF")))
  (and (eq_attr "type" "ssediv")
   (eq_attr "memory" "none")))
-"znver1-direct,znver1-fp3*10")
+"znver1-direct,znver1-fdiv*4")
 
 (define_insn_reservation "znver1_ssediv_ss_ps_load" 17
 (and (ior (and (eq_attr "cpu" "znver1")
@@ -1099,7 +1100,7 @@ (define_insn_reservation "znver1_ssediv_ss_ps_load" 17
  

[PATCH 2/2] i386: correct x87&SSE multiplication modeling in znver.md

2022-11-01 Thread Alexander Monakov
All multiplication instructions are fully pipelined, except AVX256
instructions on Zen 1, which issue over two cycles on a 128-bit unit.
Correct the model accordingly to reduce combinatorial explosion in
automaton tables.

Top znver table sizes in insn-automata.o:

Before:

30056 r znver1_fp_min_issue_delay
120224 r znver1_fp_transitions

After:

6720 r znver1_fp_min_issue_delay
53760 r znver1_fp_transitions

gcc/ChangeLog:

PR target/87832
* config/i386/znver.md: (znver1_fp_op_mul): Correct cycles in
the reservation.
(znver1_fp_op_mul_load): Ditto.
(znver1_mmx_mul): Ditto.
(znver1_mmx_load): Ditto.
(znver1_ssemul_ss_ps): Ditto.
(znver1_ssemul_ss_ps_load): Ditto.
(znver1_ssemul_avx256_ps): Ditto.
(znver1_ssemul_avx256_ps_load): Ditto.
(znver1_ssemul_sd_pd): Ditto.
(znver1_ssemul_sd_pd_load): Ditto.
(znver2_ssemul_sd_pd): Ditto.
(znver2_ssemul_sd_pd_load): Ditto.
(znver1_ssemul_avx256_pd): Ditto.
(znver1_ssemul_avx256_pd_load): Ditto.
(znver1_sseimul): Ditto.
(znver1_sseimul_avx256): Ditto.
(znver1_sseimul_load): Ditto.
(znver1_sseimul_avx256_load): Ditto.
(znver1_sseimul_di): Ditto.
(znver1_sseimul_load_di): Ditto.
---
 gcc/config/i386/znver.md | 40 
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/gcc/config/i386/znver.md b/gcc/config/i386/znver.md
index c52f8b532..882f250f1 100644
--- a/gcc/config/i386/znver.md
+++ b/gcc/config/i386/znver.md
@@ -573,13 +573,13 @@ (define_insn_reservation "znver1_fp_op_mul" 5
 (and (eq_attr "cpu" "znver1,znver2,znver3")
  (and (eq_attr "type" "fop,fmul")
   (eq_attr "memory" "none")))
-"znver1-direct,znver1-fp0*5")
+"znver1-direct,znver1-fp0")
 
 (define_insn_reservation "znver1_fp_op_mul_load" 12 
 (and (eq_attr "cpu" "znver1,znver2,znver3")
  (and (eq_attr "type" "fop,fmul")
   (eq_attr "memory" "load")))
-"znver1-direct,znver1-load,znver1-fp0*5")
+"znver1-direct,znver1-load,znver1-fp0")
 
 (define_insn_reservation "znver1_fp_op_imul_load" 16
 (and (eq_attr "cpu" "znver1,znver2,znver3")
@@ -684,13 +684,13 @@ (define_insn_reservation "znver1_mmx_mul" 3
 (and (eq_attr "cpu" "znver1,znver2,znver3")
  (and (eq_attr "type" "mmxmul")
   (eq_attr "memory" "none")))
- "znver1-direct,znver1-fp0*3")
+ "znver1-direct,znver1-fp0")
 
 (define_insn_reservation "znver1_mmx_load" 10
 (and (eq_attr "cpu" "znver1,znver2,znver3")
  (and (eq_attr "type" "mmxmul")
   (eq_attr "memory" "load")))
-"znver1-direct,znver1-load,znver1-fp0*3")
+"znver1-direct,znver1-load,znver1-fp0")
 
 ;; TODO
 (define_insn_reservation "znver1_avx256_log" 1
@@ -1161,7 +1161,7 @@ (define_insn_reservation "znver1_ssemul_ss_ps" 3
  (eq_attr "mode" 
"V8SF,V4SF,SF,V4DF,V2DF,DF")))
  (and (eq_attr "type" "ssemul")
   (eq_attr "memory" "none")))
-"znver1-direct,(znver1-fp0|znver1-fp1)*3")
+"znver1-direct,znver1-fp0|znver1-fp1")
 
 (define_insn_reservation "znver1_ssemul_ss_ps_load" 10 
 (and (ior (and (eq_attr "cpu" "znver1")
@@ -1172,47 +1172,47 @@ (define_insn_reservation "znver1_ssemul_ss_ps_load" 10
  (eq_attr "mode" "V8SF,V4SF,SF")))
  (and (eq_attr "type" "ssemul")
   (eq_attr "memory" "load")))
-"znver1-direct,znver1-load,(znver1-fp0|znver1-fp1)*3")
+"znver1-direct,znver1-load,znver1-fp0|znver1-fp1")
 
 (define_insn_reservation "znver1_ssemul_avx256_ps" 3
 (and (eq_attr "cpu" "znver1")
  (and (eq_attr "mode" "V8SF")
   (and (eq_attr "type" "ssemul")
(eq_attr "memory" "none"
-"znver1-double,(znver1-fp0|znver1-fp1)*3")
+"znver1-double,znver1-fp0*2|znver1-fp1*2")
 
 (define_insn_reservation "znver1_ssemul_avx256_ps_load" 10
 (and (eq_attr "cpu" "znver1")
  (and (eq_attr "mode" "V8SF")
   (and (eq_attr "type" "ssemul")
(eq_attr "mem

Re: [PATCH] ipa-visibility: Optimize TLS access [PR99619]

2022-05-02 Thread Alexander Monakov
Hi,

I'd like to ping this patch on behalf of Artem.

Alexander

On Sun, 17 Apr 2022, Artem Klimov wrote:

> Fix issue PR99619, which asks to optimize TLS access based on
> visibility. The fix is implemented as an IPA optimization, which allows
> to take optimized visibility status into account (as well as avoid
> modifying all language frontends).
> 
> 2022-04-17  Artem Klimov  
> 
> gcc/ChangeLog:
>   PR middle-end/99619
>   * ipa-visibility.cc (function_and_variable_visibility): Add an
>   explicit TLS model update after visibility optimisation loops.
> 
> gcc/testsuite/ChangeLog:
>   PR middle-end/99619
>   * gcc.dg/tls/vis-attr-gd.c: New test.
>   * gcc.dg/tls/vis-attr-hidden-gd.c: New test.
>   * gcc.dg/tls/vis-attr-hidden.c: New test.
>   * gcc.dg/tls/vis-flag-hidden-gd.c: New test.
>   * gcc.dg/tls/vis-flag-hidden.c: New test.
>   * gcc.dg/tls/vis-pragma-hidden-gd.c: New test.
>   * gcc.dg/tls/vis-pragma-hidden.c: New test.
> 
> Co-Authored-By:  Alexander Monakov  
> Signed-off-by: Artem Klimov 
> ---
>  gcc/ipa-visibility.cc   | 16 
>  gcc/testsuite/gcc.dg/tls/vis-attr-gd.c  | 10 ++
>  gcc/testsuite/gcc.dg/tls/vis-attr-hidden-gd.c   | 11 +++
>  gcc/testsuite/gcc.dg/tls/vis-attr-hidden.c  | 10 ++
>  gcc/testsuite/gcc.dg/tls/vis-flag-hidden-gd.c   | 11 +++
>  gcc/testsuite/gcc.dg/tls/vis-flag-hidden.c  | 10 ++
>  gcc/testsuite/gcc.dg/tls/vis-pragma-hidden-gd.c | 15 +++
>  gcc/testsuite/gcc.dg/tls/vis-pragma-hidden.c| 14 ++
>  8 files changed, 97 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-gd.c
>  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-hidden-gd.c
>  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-attr-hidden.c
>  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-flag-hidden-gd.c
>  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-flag-hidden.c
>  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-pragma-hidden-gd.c
>  create mode 100644 gcc/testsuite/gcc.dg/tls/vis-pragma-hidden.c
> 
> diff --git a/gcc/ipa-visibility.cc b/gcc/ipa-visibility.cc
> index e95a0dd252f..ca5b9a95f5e 100644
> --- a/gcc/ipa-visibility.cc
> +++ b/gcc/ipa-visibility.cc
> @@ -872,6 +872,22 @@ function_and_variable_visibility (bool whole_program)
>   }
>   }
>  }
> +  FOR_EACH_VARIABLE (vnode)
> +{
> +  tree decl = vnode->decl;
> +  
> +  /* Optimize TLS model based on visibility (taking into account
> + optimizations done in the preceding loop), unless it was
> + specified explicitly.  */
> +  
> +  if (DECL_THREAD_LOCAL_P (decl)
> +  && !lookup_attribute ("tls_model", DECL_ATTRIBUTES (decl)))
> +{
> +  enum tls_model new_model = decl_default_tls_model (decl);
> +  gcc_checking_assert (new_model >= decl_tls_model (decl));
> +  set_decl_tls_model (decl, new_model);
> +}
> +}
>  
>if (dump_file)
>  {
> diff --git a/gcc/testsuite/gcc.dg/tls/vis-attr-gd.c 
> b/gcc/testsuite/gcc.dg/tls/vis-attr-gd.c
> new file mode 100644
> index 000..473c7846f74
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tls/vis-attr-gd.c
> @@ -0,0 +1,10 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target fpic } */
> +/* { dg-require-effective-target tls } */
> +/* { dg-options "-O2 -fPIC -fdump-ipa-visibility" } */
> +
> +// tls_model should be global-dynamic due to explicitly specified attribute
> +__attribute__((tls_model("global-dynamic")))
> +__thread int x;
> +
> +/* { dg-final { scan-ipa-dump "Varpool flags: tls-global-dynamic" 
> "visibility" } } */
> diff --git a/gcc/testsuite/gcc.dg/tls/vis-attr-hidden-gd.c 
> b/gcc/testsuite/gcc.dg/tls/vis-attr-hidden-gd.c
> new file mode 100644
> index 000..8f592052361
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tls/vis-attr-hidden-gd.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target fpic } */
> +/* { dg-require-effective-target tls } */
> +/* { dg-options "-O2 -fPIC -fdump-ipa-visibility" } */
> +
> +// tls_model should be global-dynamic due to explicitly specified attribute
> +__attribute__((visibility("hidden")))
> +__attribute__((tls_model("global-dynamic")))
> +__thread int x;
> +
> +/* { dg-final { scan-ipa-dump "Varpool flags: tls-global-dynamic" 
> "visibility" } } */
> diff --git a/gcc/testsuite/gcc.dg/tls/vis-attr-hidden.c 
> b/gcc/testsuite/gcc.dg/tls/vis-attr-hidden.c
> new file mode 100644

Re: [PATCH] ipa-visibility: Optimize TLS access [PR99619]

2022-05-02 Thread Alexander Monakov
On Mon, 2 May 2022, Martin Jambor wrote:

> > Co-Authored-By:  Alexander Monakov  
> 
> (Minor nit and I don't care too much, but in GCC we traditionally
> specify co-authors in the ChangeLog entry beginning by providing more
> names, one per line.  But perhaps we want to adapt more widely used
> practices.)

I believe this is the recommended way to specify co-authors now (after
Git migration) according to https://gcc.gnu.org/codingconventions.html

(patch discussion below)

> > --- a/gcc/ipa-visibility.cc
> > +++ b/gcc/ipa-visibility.cc
> > @@ -872,6 +872,22 @@ function_and_variable_visibility (bool whole_program)
> > }
> > }
> >  }
> > +  FOR_EACH_VARIABLE (vnode)
> > +{
> > +  tree decl = vnode->decl;
> > +  
> > +  /* Optimize TLS model based on visibility (taking into account
> > + optimizations done in the preceding loop), unless it was
> > + specified explicitly.  */
> > +  
> > +  if (DECL_THREAD_LOCAL_P (decl)
> > +  && !lookup_attribute ("tls_model", DECL_ATTRIBUTES (decl)))
> > +{
> > +  enum tls_model new_model = decl_default_tls_model (decl);
> > +  gcc_checking_assert (new_model >= decl_tls_model (decl));
> > +  set_decl_tls_model (decl, new_model);
> > +}
> > +}
> >  
> 
> decl_default_tls_model depends on the global optimize flag, which is
> almost always problematic in IPA passes.  I was able to make your patch
> ICE using the vis-attr-hidden.c testcase from your patch with:
> 
>   mjambor@virgil:~/gcc/small/tests/tls$ ~/gcc/small/inst/bin/gcc -O2 -fPIC 
> -flto -c vis-attr-hidden.c
>   mjambor@virgil:~/gcc/small/tests/tls$ ~/gcc/small/inst/bin/gcc -fPIC -O0 
> -shared -flto vis-attr-hidden.o
>   during IPA pass: whole-program
>   lto1: internal compiler error: in function_and_variable_visibility, at 
> ipa-visibility.cc:888
[snip]
> Note the use of LTO, mismatching -O flags and the -shared flag in the
> link step.

Ah, right. The assert is checking that we don't accidentally downgrade decl's
TLS access model, e.g. from local-dynamic to global-dynamic, and you've shown
how to trigger that. I didn't realize this code can run twice, and with
different 'optimize' levels.

I would suggest to solve this by checking if the new TLS model is stronger,
i.e. instead of this:

  gcc_checking_assert (new_model >= decl_tls_model (decl));
  set_decl_tls_model (decl, new_model);

do this:

  if (new_model >= decl_tls_model (decl))
set_decl_tls_model (decl, new_model);

Does this look reasonable?

> A simple but somewhat lame way to avoid the ICE would be to run your
> loop over variables only from pass_ipa_function_and_variable_visibility
> and not from pass_ipa_whole_program_visibility.
> 
> I am afraid a real solution would involve copying relevant entries from
> global_options to the symtab node representing the variable when it is
> created/finalized, properly streaming them for LTO, and modifying
> decl_default_tls_model to rely on those rather than global_options
> itself.

If we agree on the solution above, then this will not be necessary, after all
this transformation looks at optimized whole-program visibility status,
and so initial optimization level should not be relevant.

> But maybe Honza has some other clever idea.
> 
> Also, please be careful not to unnecessarily commit trailing blank
> spaces, the empty lines in your patch are not really empty.

Yep, I can take care of whitespace issues.

Thank you!
Alexander


Re: [PATCH] lto-plugin: add support for feature detection

2022-05-04 Thread Alexander Monakov
On Wed, 4 May 2022, Martin Liška wrote:

> The patch is a follow-up of the discussion we've got in:
> https://gcc.gnu.org/pipermail/gcc-patches/2022-May/593901.html
> 
> Mold linker would appreciate knowing in advance if get_symbols_v3 is supported
> by a GCC plug-in or not.
> 
> Ready to be installed?

Quick note: if the linker is supposed to check for presence of this symbol
via dlsym(), I expect it won't work as-is since lto-plugin.map hides every
symbol except 'onload'.

(also it might be nicer to reword the comment to say 'Presence of the following
symbols is used for ...', because you're leaving the value as 'false').

Alexander

> Thanks,
> Martin
> 
> lto-plugin/ChangeLog:
> 
>   * lto-plugin.c (supports_get_symbols_v3): Add symbol.
> ---
>  lto-plugin/lto-plugin.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/lto-plugin/lto-plugin.c b/lto-plugin/lto-plugin.c
> index 47378435612..049f3841d5b 100644
> --- a/lto-plugin/lto-plugin.c
> +++ b/lto-plugin/lto-plugin.c
> @@ -1554,3 +1554,8 @@ onload (struct ld_plugin_tv *tv)
>  
>return LDPS_OK;
>  }
> +
> +/* The following symbols are used for dynamic detection of plug-in features
> +   from linker side.  */
> +
> +bool supports_get_symbols_v3;
> 


Re: [PATCH] lto-plugin: add support for feature detection

2022-05-04 Thread Alexander Monakov
On Wed, 4 May 2022, Martin Liška wrote:

> On 5/4/22 14:32, Alexander Monakov wrote:
> > On Wed, 4 May 2022, Martin Liška wrote:
> > 
> >> The patch is a follow-up of the discussion we've got in:
> >> https://gcc.gnu.org/pipermail/gcc-patches/2022-May/593901.html
> >>
> >> Mold linker would appreciate knowing in advance if get_symbols_v3 is 
> >> supported
> >> by a GCC plug-in or not.

Out of curiousity, I looked at mold, and I don't understand what problem this
detection is solving, nor why this is the best way to solve that. Was there
some discussion with mold author I should check out?

Note that mold takes this not only as 'v3 API is supported', but, more
importantly, as 'v2 entrypoint will not be called'.

Alexander


Re: [PATCH] lto-plugin: add support for feature detection

2022-05-05 Thread Alexander Monakov
On Thu, 5 May 2022, Richard Biener via Gcc-patches wrote:

> > I think they should simply try to not register LDPT_GET_SYMBOLS or
> > LDPT_GET_SYMBOLS_V2 with the plugin in the onload hook and if
> > that fails they will know the plugin doesn't support V3 only.  I suppose
> > it should work to call onload() multiple times (when only increasing the
> > set of supported functions) until it returns LDPS_OK without intermediately
> > dlclosing it (maybe call cleanup_handler inbertween).  This should work
> > for old plugin versions.
> >
> > That said, a better API extension compared to adding some random
> > symbol like you propose is to enhance the return value from onload(),
> > maybe returning an alternate transfer vector specifying symbol entries
> > that will not be used (or return a transfer vector that will be used).
> > We've been mostly versioning the symbol related hooks here.
> >
> > That said, I do not like at all this proposed add of a special symbol
> > to flag exclusive v3 use.  That's a hack and not extensible at all.
> 
> Speaking of which, onload_v2 would be in order and should possibly
> return some instantiation handle of the plugin that the linker could
> instruct the plugin to dispose (reset ()?).  I see the GCC implementation
> of the plugin just has a single global state and it doesn't seem that it's
> prepared for multiple onload() calls (but it might work by accident if
> you never remove things from the support vector but only add).
> 
> Without revamping the whole API onload_v2 could set the current
> global state for the plugin based on the transfer vector and the reset()
> API would discard the state (might also be redundant and implicitely
> performed by the next onload_v2 call).
> 
> onload_v2 could then also have an "output" transfer vector where the
> plugin simply copies the entries it picked and dropped those it will
> never call.  We'd document the plugin may only pick _one_ of the versioned
> API variants.
> 
> That said, the heuristic outlined above might still work with the present
> onload() API and existing implementations.

Feels a bit weird to ask, but before entertaining such an API extension,
can we step back and understand the v3 variant of get_symbols? It is not
documented, and from what little I saw I did not get the "motivation" for
its existence (what it is doing that couldn't be done with the v2 api).

To me lack of documentation looks like a serious issue :/

Likewise, I don't really understand why mold cannot be flexible and
efficiently service both the v2 and v3 variants without committing 
to one of those in advance.

Alexander


Re: [PATCH] ipa-visibility: Optimize TLS access [PR99619]

2022-05-05 Thread Alexander Monakov
On Thu, 5 May 2022, Jan Hubicka wrote:

> Also note that visibility pass is run twice (once at compile time before
> early optimizations and then again at LTO). Since LTO linking may
> promote public symbols to local/hidden, perhaps we want to do this only
> second time the pass is executed?

The first pass appears to be redundant and could be avoided, yes.

> What is the reson we avoid using LOCAL_DYNAMIC with !optimize while we
> are happy to use LOCAL_EXEC with !optimize !flag_shlib?

It follows from how local-dynamic model is defined: we call __tls_get_addr
with an argument that identifies the current DSO (not the individual
thread-local variable), and then compute the address of the variable with
a simple addition, so when there are two or more TLS variables, we can
call __tls_get_addr just once (but at -O0 we will end up with redundant
calls).

There's no such concern for local-exec vs initial-exec promotion.

Alexander


Re: [PATCH] ipa-visibility: Optimize TLS access [PR99619]

2022-05-05 Thread Alexander Monakov
On Thu, 5 May 2022, Jan Hubicka wrote:

> > It follows from how local-dynamic model is defined: we call __tls_get_addr
> > with an argument that identifies the current DSO (not the individual
> > thread-local variable), and then compute the address of the variable with
> > a simple addition, so when there are two or more TLS variables, we can
> > call __tls_get_addr just once (but at -O0 we will end up with redundant
> > calls).
> 
> Thanks for explanation.
> So this is something that really depends on optimization flags of the
> function referring the variable rather than on optimization flags of the
> variable itself and only makes difference if there is -O0 function that
> contains more than one reference to a TLS var?

Well, for an -O0 function it doesn't matter how many different TLS variables
it is referencing. The interesting case is an -O2 function referencing one
local-dynamic TLS variable.

> I guess then a correct answer would be to search for such references.

Presumably at RTL generation time, i.e. let the middle end discover the
most specific TLS access model, and then selectively downgrade local-dynamic
to global-dynamic when lowering an -O0 function.

> What happens when there are multiple object files with a hidden TLS var
> where some gts LOCAL_DYNAMIC and others GLOBAL_DYNAMIC? (Which may
> happen when linking together object files compiled with different
> versions of compiler if we go ahead with this patch on hidden symbols).

They have different relocations, so there's an increase in number of GOT
entries, but otherwise I don't think there's any problem.

Alexander


Re: [PATCH] lto-plugin: add support for feature detection

2022-05-06 Thread Alexander Monakov



On Thu, 5 May 2022, Martin Liška wrote:

> On 5/5/22 12:52, Alexander Monakov wrote:
> > Feels a bit weird to ask, but before entertaining such an API extension,
> > can we step back and understand the v3 variant of get_symbols? It is not
> > documented, and from what little I saw I did not get the "motivation" for
> > its existence (what it is doing that couldn't be done with the v2 api).
> 
> Please see here:
> https://github.com/rui314/mold/issues/181#issuecomment-1037927757

Thanks. I've also re-read [1] and [2] which provided some relevant ideas.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86490
[2] https://sourceware.org/bugzilla/show_bug.cgi?id=23411


OK, so the crux of the issue is that sometimes the linker needs to feed the
compiler plugin with LTO .o files extracted from static archives. This is
not really obvious, because normally .a archives have an index that enumerates
symbols defined/used by its .o files, and even during LTO the linker can simply
consult the index to find out which members to extract.  In theory, at least.

The theory breaks in the following cases:

 - ld.bfd and common symbols (I wonder if weak/comdat code is also affected?):
 archive index does not indicate which definitions are common, so ld.bfd
 extracts the member and feeds it to the plugin to find out;

 - ld.gold and emulated archives via --start-lib a.o b.o ... --end-lib: here
 there's no index to consult and ld.gold feeds each .o to the plugin.

In those cases it may happen that the linker extracts an .o file that would
not be extracted during non-LTO link, and if that happens, the linker needs to
inform the plugin. This is not the same as marking each symbol from spuriously
extracted .o file as PREEMPTED when the .o file has constructors (the plugin
will assume the constructors are kept while the linker needs to discard them).

So get_symbols_v3 allows the linker to discard an LTO .o file to solve this.

In absence of get_symbols_v3 mold tries to ensure correctness by restarting
itself while appending a list of .o files to be discarded to its command line.

I wonder if mold can invoke plugin cleanup callback to solve this without
restarting.

(also, hm, it seems to confirm my idea that LTO .o files should have had the
correct symbol table so normal linker algorithms would work)

Hopefully this was useful.
Alexander


Re: [PATCH] ipa-visibility: Optimize TLS access [PR99619]

2022-05-09 Thread Alexander Monakov
On Mon, 2 May 2022, Alexander Monakov wrote:
> > > --- a/gcc/ipa-visibility.cc
> > > +++ b/gcc/ipa-visibility.cc
> > > @@ -872,6 +872,22 @@ function_and_variable_visibility (bool whole_program)
> > >   }
> > >   }
> > >  }
> > > +  FOR_EACH_VARIABLE (vnode)
> > > +{
> > > +  tree decl = vnode->decl;
> > > +  
> > > +  /* Optimize TLS model based on visibility (taking into account
> > > + optimizations done in the preceding loop), unless it was
> > > + specified explicitly.  */
> > > +  
> > > +  if (DECL_THREAD_LOCAL_P (decl)
> > > +  && !lookup_attribute ("tls_model", DECL_ATTRIBUTES (decl)))
> > > +{
> > > +  enum tls_model new_model = decl_default_tls_model (decl);
> > > +  gcc_checking_assert (new_model >= decl_tls_model (decl));
> > > +  set_decl_tls_model (decl, new_model);
> > > +}
> > > +}
> > >  
> > 
> > decl_default_tls_model depends on the global optimize flag, which is
> > almost always problematic in IPA passes.  I was able to make your patch
> > ICE using the vis-attr-hidden.c testcase from your patch with:
> > 
> >   mjambor@virgil:~/gcc/small/tests/tls$ ~/gcc/small/inst/bin/gcc -O2 -fPIC 
> > -flto -c vis-attr-hidden.c
> >   mjambor@virgil:~/gcc/small/tests/tls$ ~/gcc/small/inst/bin/gcc -fPIC -O0 
> > -shared -flto vis-attr-hidden.o
> >   during IPA pass: whole-program
> >   lto1: internal compiler error: in function_and_variable_visibility, at 
> > ipa-visibility.cc:888
> [snip]
> > Note the use of LTO, mismatching -O flags and the -shared flag in the
> > link step.
> 
> Ah, right. The assert is checking that we don't accidentally downgrade decl's
> TLS access model, e.g. from local-dynamic to global-dynamic, and you've shown
> how to trigger that. I didn't realize this code can run twice, and with
> different 'optimize' levels.
> 
> I would suggest to solve this by checking if the new TLS model is stronger,
> i.e. instead of this:
> 
>   gcc_checking_assert (new_model >= decl_tls_model (decl));
>   set_decl_tls_model (decl, new_model);
> 
> do this:
> 
>   if (new_model >= decl_tls_model (decl))
> set_decl_tls_model (decl, new_model);
> 
> Does this look reasonable?

On second thought, it might be better to keep the assert, and place the loop
under 'if (optimize)'?

> > A simple but somewhat lame way to avoid the ICE would be to run your
> > loop over variables only from pass_ipa_function_and_variable_visibility
> > and not from pass_ipa_whole_program_visibility.
> > 
> > I am afraid a real solution would involve copying relevant entries from
> > global_options to the symtab node representing the variable when it is
> > created/finalized, properly streaming them for LTO, and modifying
> > decl_default_tls_model to rely on those rather than global_options
> > itself.
> 
> If we agree on the solution above, then this will not be necessary, after all
> this transformation looks at optimized whole-program visibility status,
> and so initial optimization level should not be relevant.

Alexander


Re: [PATCH] ipa-visibility: Optimize TLS access [PR99619]

2022-05-09 Thread Alexander Monakov
On Mon, 9 May 2022, Jan Hubicka wrote:

> > On second thought, it might be better to keep the assert, and place the loop
> > under 'if (optimize)'?
> 
> The problem is that at IPA level it does not make sense to check
> optimize flag as it is function specific.  (shlib is OK to check it
> anywhere since it is global.)
> 
> So I think we really want to run the code only at the WPA time
> (symtab_state>=IPA_SSA) and we want to see what is optimization flag of
> those function referring the variable since that is what decided codegen
> we will produce.

I'm not sure about the latter. Are you suggesting we give up on upgrading
general-dynamic to local-dynamic if in a mixed-O scenario there is at least
one -O0 function referring to the variable? Why? That function will end up
even more deoptimized if we do that!

Alexander


Re: [PATCH] Refactor rust-demangle to be independent of C++ demangling.

2019-10-23 Thread Alexander Monakov
On Wed, 23 Oct 2019, Eduard-Mihai Burtescu wrote:

> @@ -384,6 +384,14 @@ rust_demangle_callback (const char *mangled, int options,
>  return 0;
>rdm.sym_len--;
>  
> +  /* Legacy Rust symbols also always end with a path segment
> + that encodes a 16 hex digit hash, i.e. '17h[a-f0-9]{16}'.
> + This early check, before any parse_ident calls, should
> + quickly filter out most C++ symbols unrelated to Rust. */
> +  if (!(rdm.sym_len > 19
> +&& !strncmp (&rdm.sym[rdm.sym_len - 19], "17h", 3)))

This can be further optimized by using memcmp in place of strncmp, since from
the length check you know that you won't see the null terminator among the three
chars you're checking.

The compiler can expand memcmp(buf, "abc", 3) inline as two comparisons against
a 16-bit immediate and an 8-bit immediate.  It can't do the same for strncmp.

Alexander


[PING][PATCH] optimize costly division in rtx_cost

2020-01-10 Thread Alexander Monakov
Ping.

On Sun, 5 Jan 2020, Alexander Monakov wrote:

> Hi,
> 
> I noticed there's a costly signed 64-bit division in rtx_cost on x86 as well 
> as
> any other target where UNITS_PER_WORD is implemented like TARGET_64BIT ? 8 : 
> 4.
> It's also evident that rtx_cost does redundant work for a SET rtx argument.
> 
> Obviously the variable named 'factor' rarely exceeds 1, so in the majority of
> cases it can be computed with a well-predictable branch rather than a 
> division.
> 
> This patch makes rtx_cost do the division only in case mode is wider than
> UNITS_PER_WORD, and also moves a test for a SET up front to avoid redundancy.
> No functional change.
> 
> Bootstrapped on x86_64, ok for trunk?
> 
> To illustrate the improvement this buys, for tramp3d -O2 compilation, I got
> 
> before:
>73887675319  instructions:u
> 
>72438432200  cycles:u
>  924298569  idq.ms_uops:u
>   102603799255  uops_dispatched.thread:u
> 
> after:
>73888371724  instructions:u
> 
>72386986612  cycles:u
>  802744775  idq.ms_uops:u
>   102096987220  uops_dispatched.thread:u
> 
> (this is on Sandybridge, idq.ms_uops are uops going via the microcode 
> sequencer,
> so the unneeded division is responsible for a good fraction of them)
> 
>   * rtlanal.c (rtx_cost): Handle a SET up front. Avoid division if the
>   mode is not wider than UNITS_PER_WORD.
> 
> diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
> index 9a7afccefb8..c7ab86e228b 100644
> --- a/gcc/rtlanal.c
> +++ b/gcc/rtlanal.c
> @@ -4207,18 +4207,23 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code 
> outer_code,
>const char *fmt;
>int total;
>int factor;
> +  unsigned mode_size;
>  
>if (x == 0)
>  return 0;
>  
> -  if (GET_MODE (x) != VOIDmode)
> +  if (GET_CODE (x) == SET)
> +/* A SET doesn't have a mode, so let's look at the SET_DEST to get
> +   the mode for the factor.  */
> +mode = GET_MODE (SET_DEST (x));
> +  else if (GET_MODE (x) != VOIDmode)
>  mode = GET_MODE (x);
>  
> +  mode_size = estimated_poly_value (GET_MODE_SIZE (mode));
> +
>/* A size N times larger than UNITS_PER_WORD likely needs N times as
>   many insns, taking N times as long.  */
> -  factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
> -  if (factor == 0)
> -factor = 1;
> +  factor = mode_size > UNITS_PER_WORD ? mode_size / UNITS_PER_WORD : 1;
>  
>/* Compute the default costs of certain things.
>   Note that targetm.rtx_costs can override the defaults.  */
> @@ -4243,14 +4248,6 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code 
> outer_code,
>/* Used in combine.c as a marker.  */
>total = 0;
>break;
> -case SET:
> -  /* A SET doesn't have a mode, so let's look at the SET_DEST to get
> -  the mode for the factor.  */
> -  mode = GET_MODE (SET_DEST (x));
> -  factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
> -  if (factor == 0)
> - factor = 1;
> -  /* FALLTHRU */
>  default:
>total = factor * COSTS_N_INSNS (1);
>  }
> 


Re: [PATCH] optimize costly division in rtx_cost

2020-01-20 Thread Alexander Monakov
Ping.

On Sun, 5 Jan 2020, Alexander Monakov wrote:

> Hi,
> 
> I noticed there's a costly signed 64-bit division in rtx_cost on x86 as well 
> as
> any other target where UNITS_PER_WORD is implemented like TARGET_64BIT ? 8 : 
> 4.
> It's also evident that rtx_cost does redundant work for a SET rtx argument.
> 
> Obviously the variable named 'factor' rarely exceeds 1, so in the majority of
> cases it can be computed with a well-predictable branch rather than a 
> division.
> 
> This patch makes rtx_cost do the division only in case mode is wider than
> UNITS_PER_WORD, and also moves a test for a SET up front to avoid redundancy.
> No functional change.
> 
> Bootstrapped on x86_64, ok for trunk?
> 
> To illustrate the improvement this buys, for tramp3d -O2 compilation, I got
> 
> before:
>73887675319  instructions:u
> 
>72438432200  cycles:u
>  924298569  idq.ms_uops:u
>   102603799255  uops_dispatched.thread:u
> 
> after:
>73888371724  instructions:u
> 
>72386986612  cycles:u
>  802744775  idq.ms_uops:u
>   102096987220  uops_dispatched.thread:u
> 
> (this is on Sandybridge, idq.ms_uops are uops going via the microcode 
> sequencer,
> so the unneeded division is responsible for a good fraction of them)
> 
>   * rtlanal.c (rtx_cost): Handle a SET up front. Avoid division if the
>   mode is not wider than UNITS_PER_WORD.
> 
> diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
> index 9a7afccefb8..c7ab86e228b 100644
> --- a/gcc/rtlanal.c
> +++ b/gcc/rtlanal.c
> @@ -4207,18 +4207,23 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code 
> outer_code,
>const char *fmt;
>int total;
>int factor;
> +  unsigned mode_size;
>  
>if (x == 0)
>  return 0;
>  
> -  if (GET_MODE (x) != VOIDmode)
> +  if (GET_CODE (x) == SET)
> +/* A SET doesn't have a mode, so let's look at the SET_DEST to get
> +   the mode for the factor.  */
> +mode = GET_MODE (SET_DEST (x));
> +  else if (GET_MODE (x) != VOIDmode)
>  mode = GET_MODE (x);
>  
> +  mode_size = estimated_poly_value (GET_MODE_SIZE (mode));
> +
>/* A size N times larger than UNITS_PER_WORD likely needs N times as
>   many insns, taking N times as long.  */
> -  factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
> -  if (factor == 0)
> -factor = 1;
> +  factor = mode_size > UNITS_PER_WORD ? mode_size / UNITS_PER_WORD : 1;
>  
>/* Compute the default costs of certain things.
>   Note that targetm.rtx_costs can override the defaults.  */
> @@ -4243,14 +4248,6 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code 
> outer_code,
>/* Used in combine.c as a marker.  */
>total = 0;
>break;
> -case SET:
> -  /* A SET doesn't have a mode, so let's look at the SET_DEST to get
> -  the mode for the factor.  */
> -  mode = GET_MODE (SET_DEST (x));
> -  factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
> -  if (factor == 0)
> - factor = 1;
> -  /* FALLTHRU */
>  default:
>total = factor * COSTS_N_INSNS (1);
>  }
> 


Re: [PATCH] Make target_clones resolver fn static.

2020-01-20 Thread Alexander Monakov



On Mon, 20 Jan 2020, H.J. Lu wrote:
> We can that only if function is static:
> 
[ship asm]
> 
> In this case, foo must be global.

H.J., can you rephrase more clearly? Your response seems contradictory and
does not help to explain the matter.

Alexander


Re: [PATCH] Make target_clones resolver fn static.

2020-01-20 Thread Alexander Monakov



On Mon, 20 Jan 2020, H.J. Lu wrote:
> For,
> 
> ---
> __attribute__((target_clones("avx","default")))
> int
> foo ()
> {
>   return -2;
> }
> 
> 
> foo's resolver must be global.  For
> 
> ---
> __attribute__((target_clones("avx","default")))
> static int
> foo ()
> {
>   return -2;
> }
> ---
> 
> foo's resolver must be static.

Bare IFUNC's don't seem to have this restriction. Why do we want to
constrain target clones this way?

Alexander


Re: [PATCH] Make target_clones resolver fn static.

2020-01-20 Thread Alexander Monakov



On Mon, 20 Jan 2020, H.J. Lu wrote:

> > Bare IFUNC's don't seem to have this restriction. Why do we want to
> > constrain target clones this way?
> >
> 
> foo's resolver acts as foo.  It should have the same visibility as foo.

What do you mean by that? From the implementation standpoint, there's
two symbols of different type with the same value. There's no problem
allowing one of them have local binding and the other have global binding.

Is there something special about target clones that doesn't come into
play with ifuncs?

Alexander


[PATCH] doc: clarify the situation with pointer arithmetic

2020-01-20 Thread Alexander Monakov
Hi,

we have this paragraph in the documentation that attempts to prohibit something
that is allowed by the language.  Instead, I think we should say that this
generally should work and explain that a problem in GCC implementation
breaks this.

OK to apply?

Thanks.
Alexander


* doc/implement-c.texi (Arrays and Pointers): Rewrite the paragraph
about pointer-as-integer arithmetic.

diff --git a/gcc/doc/implement-c.texi b/gcc/doc/implement-c.texi
index 692297b69c4..beee2510670 100644
--- a/gcc/doc/implement-c.texi
+++ b/gcc/doc/implement-c.texi
@@ -401,11 +401,15 @@ pointer representation is smaller than the integer type, 
extends according
 to the signedness of the integer type if the pointer representation
 is larger than the integer type, otherwise the bits are unchanged.
 
-When casting from pointer to integer and back again, the resulting
-pointer must reference the same object as the original pointer, otherwise
-the behavior is undefined.  That is, one may not use integer arithmetic to
-avoid the undefined behavior of pointer arithmetic as proscribed in
-C99 and C11 6.5.6/8.
+Arithmetic on integers derived from pointers can produce a value such
+that casting it back produces a valid pointer corresponding to one of
+the original pointers.  Thus, integer arithmetic allows to express
+computations that might not be expressible as pointer arithmetic without
+undefined behavior.  However, at a certain point the distinction between
+pointers and integers is lost (when GCC translates from GIMPLE internal
+representation to RTL), but some optimizations still attempt to track
+pointer arithmetic beyond that point.  In some cases this may cause
+valid code to be incorrectly optimized.
 
 @item
 @cite{The size of the result of subtracting two pointers to elements


Re: [PATCH] doc: clarify the situation with pointer arithmetic

2020-01-20 Thread Alexander Monakov
On Mon, 20 Jan 2020, Sandra Loosemore wrote:

> I'm not happy with this -- we shouldn't be talking about internal concepts
> like GIMPLE and RTL in the GCC user manual.  Can we instead talk about which
> user-visible optimization options cause problems, so that users who feel the
> urgent need to write hacky code like that know what to disable?  (I'd guess it
> would be things involving alias analysis.)  Otherwise, I think we should stick
> with the current "don't do this" language or some variant of it that explains
> briefly that, while such code is technically allowed by the C standard, it
> confuses GCC's optimizers.

The problem is in code shared by multiple RTL optimizers, so I'm afraid we
cannot give a list of optimizations to disable. Instead, we can suggest that
users can place optimization barriers on problematic pointers:

ptr = (void *) iptr;
__asm__ ("" : "+g"(ptr));
// use *ptr

i.e. prevent incorrect tracking of possible pointer targets by copying the
pointer value through an asm statement. To make the intent of the asm more
clear, users may wish to enumerate possible targets in input constraints
of such asm:

ptr = (void *) iptr;
__asm__ ("" : "+g"(ptr) : "X"(&obj1), "X"(&obj2), ...);
// use *ptr

Alexander


Re: [PATCH] doc: clarify the situation with pointer arithmetic

2020-01-21 Thread Alexander Monakov
On Mon, 20 Jan 2020, Joseph Myers wrote:

> On Mon, 20 Jan 2020, Alexander Monakov wrote:
> 
> > Hi,
> > 
> > we have this paragraph in the documentation that attempts to prohibit 
> > something that is allowed by the language.  Instead, I think we should 
> > say that this generally should work and explain that a problem in GCC 
> > implementation breaks this.
> 
> Is this based on the proposals to adopt a PNVI model (as described in 
> N2362 / N2363 / N2364) for C2x?  Do you have a more detailed analysis of 
> exactly which issues would need changes in GCC to follow the proposed 
> PNVI-ae-udi semantics?  Setting up a meta-bug in Bugzilla with 
> dependencies on such issues might be useful, for example - I know there 
> are existing bugs you've filed or commented on, but it would help to have 
> a list in a single place of issues for implementing PNVI-ae-udi.
> 
> It's not obvious that documentation relating to things that are 
> implementation-defined in existing C standard versions, where detailed 
> memory model questions were not defined, is an appropriate place to 
> discuss questions of how some optimizations might not follow a more 
> precise definition proposed for C2x.

My intent was more basic. I observed that the paragraph can be interpreted as
saying that if you have a cast 'I1 = (intptr_t) P1', then perform some
computations on I1 that do not in any way depend on values of other pointers,
then casting the result back can not point to a different object than P1.
But that is not very helpful, because this is the case where the user might have
used normal pointer arithmetic in the first place. The important cases
involve computations that depend on multiple pointers to unrelated objects.

I think that the case discussed in the proposed patch is already not ambiguous
and does not need further clarifications to the standard.

I also think we should provide the users with rationale when imposing such
restrictions, and, as Sandra noted, offer a way to work around the problem.

However I don't mind dropping the patch if it's not appropriate.  Richard,
can you share your opinion here?

Thanks.
Alexander


Re: [PATCH] doc: clarify the situation with pointer arithmetic

2020-01-21 Thread Alexander Monakov
On Tue, 21 Jan 2020, Richard Biener wrote:

> Fourth.  That PNVI (I assume it's the whole pointer-provenance stuff)
> wants to get the "best" of both which can never be done since a compiler
> needs to have a way to be conservative - in this area it's conflicting
> conservative treatment which is impossible.

This paragraph is unclear, I don't immediately see what the conflicting goals
are. The rest is clear enough given the previous discussions I saw.

Did you mean the restriction that you cannot do arithmetic involving two
integers based on pointers, get a value corresponding to one of them,
cast it back and get a pointer suitable for accessing either of two
originally pointed-to objects? I don't see that as a conflict because
it places a restriction on users, not the compiler.

Alexander


Re: [PATCH] Fix comparison of trees via tree_cmp

2020-01-22 Thread Alexander Monakov



On Wed, 22 Jan 2020, Stefan Schulze Frielinghaus wrote:

> Hi David,
> 
> In function `tree_cmp` an invariant [1] is assumed which does not necessarily
> exist. In case both input trees are finally compared via `strcmp`, then
> 
>   tree_cmp (t1, t2) == -tree_cmp (t2, t1)
> 
> does not hold in general, since function `strcmp (x, y)` guarantees only that 
> a
> negative integer is returned in case xy.
> Currently this breaks s390x where, for example, for certain inputs x,y
> `tree_cmp (x, y) == 1` and `tree_cmp (y, x) == -2` hold.  The attached patch
> normalizes the output from `strcmp` to -1, 0, or 1 while using an auxiliary
> function `sign` (stolen from the Hacker's Delight book ;-)).
> 
> Bootstrapped and tested on s390x. Any thoughts?

It's more appropriate to fix the assert rather than the comparator, like

  gcc_assert (sign (reversed) == -sign (result));

But qsort_chk already checks that, and more, so why is the assert there?
Shouldn't it be simply removed?

Alexander


Re: [PATCH] Make target_clones resolver fn static.

2020-01-23 Thread Alexander Monakov


On Thu, 23 Jan 2020, Martin Liška wrote:
> > So this doesn't help review including two different target changes.  Making
> > ifunc dispatchers of public functions non-public looks like an unrelated
> > thing
> > to the bug (sorry if I mis-suggested).  So I feel comfortable approving the
> > earlier patch which just dropped the extra mangling for non-public
> > dispatchers
> > in the x86 backend.
> 
> Works for me.

If you will be revising the patch, can you please improve the new comment?

I mean this addition:

  /* Make the resolver function static.
  ... */

but it's not what the following code does.

Thanks.
Alexander

Re: [PATCH] Fix comparison of trees via tree_cmp

2020-01-23 Thread Alexander Monakov
On Thu, 23 Jan 2020, David Malcolm wrote:

> Removing the assertions fixes it for me (a stage1 build, at least, and
> it then passes the testsuite).
> 
> I've made this blunder in four places in the analyzer:
> 
>   call-string.cc:162:  call_string::cmp
>   program-point.cc:461:  function_point::cmp_within_supernode

These two don't raise concerns on brief inspection.

>   engine.cc:1820:  worklist::key_t::cmp

This one tries to use signed difference of hashes as return value, this is
not going to work in general while passing your assert almost always (except
when difference of hashes is exactly -2^31). The problem is, difference of
hashes leads to non-transitive comparison and qsort_chk can catch it on a
suitable testcase.

>   region-model.cc:1878:  tree_cmp

This is the one under discussion here.

> IIRC, I added these checks as I was finding it difficult to debug
> things when qsort_chk failed - the first three of the comparators in
> the list above build on each other:  worklist::key_t::cmp uses both
> function_point::cmp_within_supernode and call_string::cmp (and various
> other things), so if the worklist qsort_chk fails, it sometimes
> required a fair bit of digging into which of the nested comparators had
> failed (also, all the void * don't help; I'd love to have a template
> that does a typesafe qsort without needing casts in the comparator).

FWIW qsort_chk_error is a separate function to make that easier: if you place a
breakpoint on qsort_chk_error you just need to step a few times to get into a
comparator that is under suspicion, and don't need to type out casts in gdb.

Alexander


Re: [PATCH] doc: clarify the situation with pointer arithmetic

2020-01-27 Thread Alexander Monakov
On Tue, 28 Jan 2020, Uecker, Martin wrote:

> > (*) this also shows the level of "obfuscation" needed to fool compilers
> > to lose provenance knowledge is hard to predict.
> 
> Well, this is exactly the problem we want to address by defining
> a clear way to do this. Casting to an integer would be the way
> to state: "consider the pointer as escaped and forget the 
> provenance"  and casting an integer to a  pointer would
> mean "this pointer may point to all objects whose pointer has
> escaped". This would give the programmer explicit control about
> this aspect and make most existing code using pointer-to-integer
> casts well-defined. At the same time, this should be simple
> to add to existing points-to analysis.

Can you explain why you make it required for the compiler to treat the
points-to set unnecessarily broader than it could prove? In the Matlab
example, there's a simple chain of computations that the compiler is
following to prove that the pointer resulting from the final cast is
derived from exactly one other pointer (no other pointers have
participated in the computations).

Or, in other words:

is there an example where a programmer can distinguish between the
requirement you explain above vs. the fine-grained interpretation
that GIMPLE aims to implement (at least as I understand it), which is:

  when the program creates a pointer by means of non-pointer computations
  (casts, representation access, etc), the resulting pointer may point to:

* any object which address could have participated in the computation
  (which is at worst the entire set of "exposed" objects up to that
   program point, but can be much narrower if the compiler can see
   the entire chain of computations)

* any objects which is not "exposed" but could have known address, e.g.
  because it is placed at a specific address during linking

Thanks.
Alexander

Re: [PATCH] doc: clarify the situation with pointer arithmetic

2020-01-28 Thread Alexander Monakov
On Tue, 28 Jan 2020, Uecker, Martin wrote:

> Unfortunately, this is not as simple. It is not trivial to
> define the set of objects whose "address could have participated
> in the computation"
> [...]

I am not satisfied with the response, but I'm not sure I should
continue arguing here.

Alexander


Re: [PATCH 3/3] Implementation of -Wclobbered on tree-ssa

2020-01-29 Thread Alexander Monakov
On Tue, 28 Jan 2020, Jeff Law wrote:

> On Tue, 2019-10-08 at 18:04 +0300, Alexander Monakov wrote:
> > On Thu, 3 Oct 2019, Jeff Law wrote:
> > 
> > > You may want to review the 2018 discussion:
> > > 
> > > https://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg185287.html
> > > 
> > > THe 2018 discussion was primarily concerned with fixing the CFG
> > > inaccuracies which would in turn fix 61118.  But would not fix 21161.
> > 
> > Well, PR 61118 was incorrectly triaged to begin with.
> > 
> > Let me show starting from a simplified testcase:
> > 
> > void bar(jmp_buf);
> > void foo(int n)
> > {
> > jmp_buf jb;
> > for (; n; n--)
> > if (setjmp(jb))
> > bar(jb);
> > }
> > 
> > Here we warn that "parameter 'n' might be clobbered..." despite that
> > "obviously" there is no way 'n' might be modified between setjmp and
> > bar... right?
> Not necessarily. You have to look at what objects are live.  N is
> clearly going to be live throughout the entire loop and thus it will be
> subject to setjmp/longjmp warnings.

That N is live is not enough: we want to warn if N is also modified
before an abnormal return (otherwise we'd attempt to warn for live
readonly variables). But if you're agreeing with me that warning here
is *not a false positive*, then I'm happy.

> The inaccuracies in our CFG at the setjmp call cause objects which are
> not actually live across the call to appear to be live.  See PR 21161.

For the example shown in comment #14, agreed.

> > Now suppose 'bar' looks like
> [ ... ]
> It doesn't really matter what bar looks like.

Well it did for the purpose of demonstration, but if you're in agreement
anyway then we can say it's moot.

> > Still, I can imagine one can argue that the warning in the PR is bogus, as
> > the loop assigns invariant values to problematic variables:
> I'm not sure which PR you're referring to (21161, 65041 or some other?)

I'm arguing that PR 61118 was incorrectly triaged (first sentence of my email).

> > void foo(struct ctx *arg)
> > {
> > while (arg->cond) {
> > void *var1 = &arg->field;
> > void (*var2)(void*) = globalfn;
> > jmp_buf jb;
> > if (!setjmp(jb))
> > var2(var1);
> > }
> > etc(arg);
> > }
> > 
> > and since "obviously" &arg->field is invariant, there's no way setjmp will
> > overwrite 'var1' with a different value, right?.. Except:
> > 
> > If the while loop is not entered, a longjmp from inside 'etc' would restore
> > default (uninitialized) values; note the compiler doesn't pay attention to 
> > the
> > lifetime of jmp_buf, it simply assumes 'etc' may cause setjmp to return
> > abnormally (this is correct in case of other returns_twice functions such as
> > vfork).
> Whether or not we should warn for restoring an uninitialized state I
> think is somethign we haven't really considered.  Depending on how many
> assignments to the pseudo there are and the exact control flow we may
> or may not warn.

That restored values are uninitialized is not important; the problem remains
if var1 is declared before the 'while' and initialized to 0.

> > Now regarding claims that we emit "incorrect" CFG that is causing extra phi
> > nodes, and that "fixing CFG" would magically eliminate those and help with
> > false positives.
> > 
> > While in principle it is perhaps nicer to have IR that more closely models
> > the abnormal flow, I don't see how it would reduce phi nodes; we'd go from
> It certainly does move PHI nodes and it can reduce the number of false
> positives.  It's not as effective as it could be because of lameness on
> the RTL side (again see PR21161).

There's some mixup here because overall this was discussing the complete
reimplementation of the warning on GIMPLE, so RTL should not be relevant.

Currently there's no concrete proposal on the table for making GIMPLE CFG
somehow more accurate for setjmps. I can see that RTL CFG could be more
accurate indeed.

The best proposal for GIMPLE so far is moving the outgoing edge from the
abnormal dispatcher to just after the setjmp:

BB1
  r1 = setjmp(env);
 BB2
  |  r2 = ABNORMAL_DISPATCHER()
  |  +---
  V  V
BB3
  r3 = phi(r1, r2)
  if (r3) goto BB4 else goto BB5

where for setjmp in particular the BB1->BB2 edge is not necessary, but
should be there for unknown returns_twice functions.

But I can't see how this proposal substantially aids analysis on GIMPLE.
In particular, we cannot easily make some kind of threading and separate
BB1-BB3-BB5 and BB2-BB3-BB4 paths because then we wouldn't be able to
easily lower it to RTL.

Alexander


Re: [PATCH, v3] wwwdocs: e-mail subject lines for contributions

2020-02-03 Thread Alexander Monakov
On Mon, 3 Feb 2020, Richard Earnshaw (lists) wrote:

> I've not seen any follow-up to this version.  Should we go ahead and adopt
> this?

Can we please go with 'committed' (lowercase) rather than all-caps COMMITTED?
Spelling this with all-caps seems like a recent thing on gcc-patches, before
everyone used the lowercase version, which makes more sense (no need to shout
about the thing that didn't need any discussion before applying the patch).

Also, while tools like 'git format-patch' will automatically put [PATCH] in
the subject, for '[COMMITTED]' it will be the human typing that out, and
it makes little sense to require people to meticulously type that out in caps.
Especially when the previous practice was opposite.

Thanks.
Alexander


Re: [PATCH, v3] wwwdocs: e-mail subject lines for contributions

2020-02-03 Thread Alexander Monakov
On Mon, 3 Feb 2020, Richard Earnshaw (lists) wrote:

> Upper case is what glibc has, though it appears that it's a rule that is not
> strictly followed.  If we change it, then it becomes another friction point
> between developer groups.  Personally, I'd leave it as is, then turn a blind
> eye to such minor non-conformance.

In that case can we simply say that both 'committed' and 'COMMITTED' are okay,
if we know glibc doesn't follow that rule and anticipate we will not follow
it either?

Thanks.
Alexander


[PING^3][PATCH] optimize costly division in rtx_cost

2020-02-13 Thread Alexander Monakov
Ping^3.

On Sun, 5 Jan 2020, Alexander Monakov wrote:

> Hi,
> 
> I noticed there's a costly signed 64-bit division in rtx_cost on x86 as well 
> as
> any other target where UNITS_PER_WORD is implemented like TARGET_64BIT ? 8 : 
> 4.
> It's also evident that rtx_cost does redundant work for a SET rtx argument.
> 
> Obviously the variable named 'factor' rarely exceeds 1, so in the majority of
> cases it can be computed with a well-predictable branch rather than a 
> division.
> 
> This patch makes rtx_cost do the division only in case mode is wider than
> UNITS_PER_WORD, and also moves a test for a SET up front to avoid redundancy.
> No functional change.
> 
> Bootstrapped on x86_64, ok for trunk?
> 
> To illustrate the improvement this buys, for tramp3d -O2 compilation, I got
> 
> before:
>73887675319  instructions:u
> 
>72438432200  cycles:u
>  924298569  idq.ms_uops:u
>   102603799255  uops_dispatched.thread:u
> 
> after:
>73888371724  instructions:u
> 
>72386986612  cycles:u
>  802744775  idq.ms_uops:u
>   102096987220  uops_dispatched.thread:u
> 
> (this is on Sandybridge, idq.ms_uops are uops going via the microcode 
> sequencer,
> so the unneeded division is responsible for a good fraction of them)
> 
>   * rtlanal.c (rtx_cost): Handle a SET up front. Avoid division if the
>   mode is not wider than UNITS_PER_WORD.
> 
> diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
> index 9a7afccefb8..c7ab86e228b 100644
> --- a/gcc/rtlanal.c
> +++ b/gcc/rtlanal.c
> @@ -4207,18 +4207,23 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code 
> outer_code,
>const char *fmt;
>int total;
>int factor;
> +  unsigned mode_size;
>  
>if (x == 0)
>  return 0;
>  
> -  if (GET_MODE (x) != VOIDmode)
> +  if (GET_CODE (x) == SET)
> +/* A SET doesn't have a mode, so let's look at the SET_DEST to get
> +   the mode for the factor.  */
> +mode = GET_MODE (SET_DEST (x));
> +  else if (GET_MODE (x) != VOIDmode)
>  mode = GET_MODE (x);
>  
> +  mode_size = estimated_poly_value (GET_MODE_SIZE (mode));
> +
>/* A size N times larger than UNITS_PER_WORD likely needs N times as
>   many insns, taking N times as long.  */
> -  factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
> -  if (factor == 0)
> -factor = 1;
> +  factor = mode_size > UNITS_PER_WORD ? mode_size / UNITS_PER_WORD : 1;
>  
>/* Compute the default costs of certain things.
>   Note that targetm.rtx_costs can override the defaults.  */
> @@ -4243,14 +4248,6 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code 
> outer_code,
>/* Used in combine.c as a marker.  */
>total = 0;
>break;
> -case SET:
> -  /* A SET doesn't have a mode, so let's look at the SET_DEST to get
> -  the mode for the factor.  */
> -  mode = GET_MODE (SET_DEST (x));
> -  factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
> -  if (factor == 0)
> - factor = 1;
> -  /* FALLTHRU */
>  default:
>total = factor * COSTS_N_INSNS (1);
>  }
> 


Re: Fix overflows in -fprofile-reorder-functions

2019-12-08 Thread Alexander Monakov
2On Sun, 8 Dec 2019, Jan Hubicka wrote:

> Other explanation would be that our new qsort with broken comparator due to
> overflow can actualy remove some entries in the array, but that sounds bit
> crazy.

gcc_qsort only reorders elements, making it possible for gcc_qsort_chk (that
runs afterwards) to catch crazy comparators in a sound manner.

> Bootstrapped/regested x86_64-linux. Comitted.

I have a few comments, please see below.

> --- cgraphunit.c  (revision 279076)
> +++ cgraphunit.c  (working copy)
> @@ -2359,19 +2359,33 @@ cgraph_node::expand (void)
>  /* Node comparator that is responsible for the order that corresponds
> to time when a function was launched for the first time.  */
>  
> -static int
> -node_cmp (const void *pa, const void *pb)
> +int
> +tp_first_run_node_cmp (const void *pa, const void *pb)
>  {
>const cgraph_node *a = *(const cgraph_node * const *) pa;
>const cgraph_node *b = *(const cgraph_node * const *) pb;
> +  gcov_type tp_first_run_a = a->tp_first_run;
> +  gcov_type tp_first_run_b = b->tp_first_run;
> +
> +  if (!opt_for_fn (a->decl, flag_profile_reorder_functions)
> +  || a->no_reorder)
> +tp_first_run_a = 0;
> +  if (!opt_for_fn (b->decl, flag_profile_reorder_functions)
> +  || b->no_reorder)
> +tp_first_run_b = 0;
> +
> +  if (tp_first_run_a == tp_first_run_b)
> +return b->order - a->order;
>  
>/* Functions with time profile must be before these without profile.  */
> -  if (!a->tp_first_run || !b->tp_first_run)
> -return a->tp_first_run - b->tp_first_run;
> +  if (!tp_first_run_a || !tp_first_run_b)
> +return tp_first_run_a ? 1 : -1;

The code does the opposite of the comment: when tp_first_run_b is 0, it will
return 1, indicating a > b, causing b to appear in front of a in the sorted
array.

I would recommend to make these variables uint64_t, then you can simply do

  tp_first_run_a--;
  tp_first_run_b--;

making 0 wrap around to UINT64_MAX. Then they will naturally sort after all
other nodes.

> -  return a->tp_first_run != b->tp_first_run
> -  ? b->tp_first_run - a->tp_first_run
> -  : b->order - a->order;
> +  /* Watch for overlflow - tp_first_run is 64bit.  */
> +  if (tp_first_run_a > tp_first_run_b)
> +return -1;
> +  else
> +return 1;

This also sorts in the reverse order -- please fix the comments if that's really
intended.

> +  /* Output functions in RPO so callers get optimized before callees.  This
> + makes ipa-ra and other propagators to work.
> + FIXME: This is far from optimal code layout.  */

I think this should have said "callees get optimized before callers".


Thanks.
Alexander


Re: Fix overflows in -fprofile-reorder-functions

2019-12-10 Thread Alexander Monakov
On Tue, 10 Dec 2019, Jan Hubicka wrote:

> > I would recommend to make these variables uint64_t, then you can simply do
> > 
> >   tp_first_run_a--;
> >   tp_first_run_b--;
> > 
> > making 0 wrap around to UINT64_MAX. Then they will naturally sort after all
> > other nodes.
> 
> Then we would still have to watch the overflow before returning? I
> actually find the condtional sort of more readable than intentional wrap
> around the range, so I kept it in the code espeically because I made the
> value 32bit again and without this trick I no longer need to watch
> overflows.

For int-typed timestamps, you'd need to warp 0 to INT_MAX, e.g. like this:

  tp_first_run_a = (tp_first_run_a - 1) & INT_MAX;
  tp_first_run_b = (tp_first_run_b - 1) & INT_MAX;

which keeps them in 0..INT_MAX range so 'return tp_first_run_a - tp_first_run_b'
still works.

> > > +  /* Output functions in RPO so callers get optimized before callees.  
> > > This
> > > + makes ipa-ra and other propagators to work.
> > > + FIXME: This is far from optimal code layout.  */
> > 
> > I think this should have said "callees get optimized before callers".
> 
> Indeed.

So technically this would be just postorder, not RPO :)

> --- cgraph.c  (revision 279093)
> +++ cgraph.c  (working copy)
> @@ -3074,6 +3074,11 @@ cgraph_node::verify_node (void)
>inlined_to->count.debug ();
>error_found = true;
>  }
> +  if (tp_first_run < 0)
> +{
> +  error ("tp_first_run must be positive");
> +  error_found = true;
> +}

"non-negative"?

Alexander


[PATCH] gdbinit.in: make shorthands accept an explicit argument

2020-01-04 Thread Alexander Monakov
Hi,

I'm posting an updated patch for GCC gdbinit originally proposed by Konstantin.
The patch amends debugging shorthands such as 'pr' to accept an explicit
argument, in addition to taking an implicit argument from $ as today. In other
words,

  pr x

will be accepted to call 'debug_rtx (x)' just like

  p x
  pr

does today.

I've noticed Konstantin's patch left documentation bits inconsistent with the
new implementation, so I amended them manually and introduced a new
'help-gcc-hooks' command with the index and some general info. I also fixed a
leftover '$' remaining in 'prc' command.

Okay to apply?

Alexander

* gdbinit.in (help-gcc-hooks): New command.
(pp, pr, prl, pt, pct, pgg, pgq, pgs, pge, pmz, ptc, pdn, ptn, pdd, prc,
pi, pbm, pel, trt): Take $arg0 instead of $ if supplied. Update
documentation.

diff --git a/gcc/gdbinit.in b/gcc/gdbinit.in
index 4a5b682451b..8c5f9911ec5 100644
--- a/gcc/gdbinit.in
+++ b/gcc/gdbinit.in
@@ -16,154 +16,216 @@
 # along with GCC; see the file COPYING3.  If not see
 # .
 
+define help-gcc-hooks
+help help-gcc-hooks
+end
+
+document help-gcc-hooks
+GCC gdbinit file introduces several debugging shorthands:
+
+pr [rtx], prl [rtx], prc [rtx], pi [rtx_insn],
+pt [tree], pct [tree], ptc [tree], trt [tree],
+pgs [tree], pge [tree], pdn [tree], ptn [tree],
+pgg [gimple], pgq [gimple_seq],
+pmz [mpz_t],
+pdd [dw_die_ref],
+pbm [bitmap],
+pel [location_t],
+pp, pbs, pcfun
+
+They are generally implemented by calling a function that prints to stderr,
+and therefore will not work when the compiler is not executing.
+
+Most shorthands accept an optional argument. When it is not supplied,
+they use value in GDB register $, i.e. the last printed value.
+end
+
 define pp
-call debug ($)
+eval "set $debug_arg = $%s", $argc ? "arg0" : ""
+call debug ($debug_arg)
 end
 
 document pp
-Print a representation of the GCC data structure that is $.
-Works only when an inferior is executing.
+GCC hook: pp [any]
+Print a representation of any GCC data structure for which an instance of
+overloaded function 'debug' is available.
+See also 'help-gcc-hooks'.
 end
 
 define pr
-call debug_rtx ($)
+eval "set $debug_arg = $%s", $argc ? "arg0" : ""
+call debug_rtx ($debug_arg)
 end
 
 document pr
-Print the full structure of the rtx that is $.
-Works only when an inferior is executing.
+GCC hook: pr [rtx]
+Print the full structure of given rtx.
+See also 'help-gcc-hooks'.
 end
 
 define prl
-call debug_rtx_list ($, debug_rtx_count)
+eval "set $debug_arg = $%s", $argc ? "arg0" : ""
+call debug_rtx_list ($debug_arg, debug_rtx_count)
 end
 
 document prl
-Print the full structure of all rtx insns beginning at $.
-Works only when an inferior is executing.
+GCC hook: prl [rtx]
+Print the full structure of all rtx insns beginning at given rtx.
 Uses variable debug_rtx_count to control number of insns printed:
-  debug_rtx_count > 0: print from $ on.
-  debug_rtx_count < 0: print a window around $.
+  debug_rtx_count > 0: print from given rtx on.
+  debug_rtx_count < 0: print a window around given rtx.
 
 There is also debug_rtx_find (rtx, uid) that will scan a list for UID and print
 it using debug_rtx_list. Usage example: set $foo=debug_rtx_find(first, 42)
 end
 
 define pt
-call debug_tree ($)
+eval "set $debug_arg = $%s", $argc ? "arg0" : ""
+call debug_tree ($debug_arg)
 end
 
 document pt
-Print the full structure of the tree that is $.
-Works only when an inferior is executing.
+GCC hook: pt [tree]
+Print the full structure of given tree.
+See also 'help-gcc-hooks'.
 end
 
 define pct
-call debug_c_tree ($)
+eval "set $debug_arg = $%s", $argc ? "arg0" : ""
+call debug_c_tree ($debug_arg)
 end
 
 document pct
-Print the tree that is $ in C syntax.
-Works only when an inferior is executing.
+GCC hook: pct [tree]
+Print given tree in C syntax.
+See also 'help-gcc-hooks'.
 end
 
 define pgg
-call debug_gimple_stmt ($)
+eval "set $debug_arg = $%s", $argc ? "arg0" : ""
+call debug_gimple_stmt ($debug_arg)
 end
 
 document pgg
-Print the Gimple statement that is $ in C syntax.
-Works only when an inferior is executing.
+GCC hook: pgg [gimple]
+Print given GIMPLE statement in C syntax.
+See also 'help-gcc-hooks'.
 end
 
 define pgq
-call debug_gimple_seq ($)
+eval "set $debug_arg = $%s", $argc ? "arg0" : ""
+call debug_gimple_seq ($debug_arg)
 end
 
 document pgq
-Print the Gimple sequence that is $ in C syntax.
-Works only when an inferior is executing.
+GCC hook: pgq [gimple_seq]
+Print given GIMPLE sequence in C syntax.
+See also 'help-gcc-hooks'.
 end
 
 define pgs
-call debug_generic_stmt ($)
+eval "set $debug_arg = $%s", $argc ? "arg0" : ""
+call debug_generic_stmt ($debug_arg)
 end
 
 document pgs
-Print the statement that is $ in C syntax.
-Works only when an inferior is executing.
+GCC hook: pgq [tree]
+Print given GENERIC statement in C syntax.
+See also 'help-gcc-hooks'.
 end
 
 define pge
-call debu

[PATCH] optimize costly division in rtx_cost

2020-01-05 Thread Alexander Monakov
Hi,

I noticed there's a costly signed 64-bit division in rtx_cost on x86 as well as
any other target where UNITS_PER_WORD is implemented like TARGET_64BIT ? 8 : 4.
It's also evident that rtx_cost does redundant work for a SET rtx argument.

Obviously the variable named 'factor' rarely exceeds 1, so in the majority of
cases it can be computed with a well-predictable branch rather than a division.

This patch makes rtx_cost do the division only in case mode is wider than
UNITS_PER_WORD, and also moves a test for a SET up front to avoid redundancy.
No functional change.

Bootstrapped on x86_64, ok for trunk?

To illustrate the improvement this buys, for tramp3d -O2 compilation, I got

before:
   73887675319  instructions:u

   72438432200  cycles:u
 924298569  idq.ms_uops:u
  102603799255  uops_dispatched.thread:u

after:
   73888371724  instructions:u

   72386986612  cycles:u
 802744775  idq.ms_uops:u
  102096987220  uops_dispatched.thread:u

(this is on Sandybridge, idq.ms_uops are uops going via the microcode sequencer,
so the unneeded division is responsible for a good fraction of them)

* rtlanal.c (rtx_cost): Handle a SET up front. Avoid division if the
mode is not wider than UNITS_PER_WORD.

diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
index 9a7afccefb8..c7ab86e228b 100644
--- a/gcc/rtlanal.c
+++ b/gcc/rtlanal.c
@@ -4207,18 +4207,23 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code 
outer_code,
   const char *fmt;
   int total;
   int factor;
+  unsigned mode_size;
 
   if (x == 0)
 return 0;
 
-  if (GET_MODE (x) != VOIDmode)
+  if (GET_CODE (x) == SET)
+/* A SET doesn't have a mode, so let's look at the SET_DEST to get
+   the mode for the factor.  */
+mode = GET_MODE (SET_DEST (x));
+  else if (GET_MODE (x) != VOIDmode)
 mode = GET_MODE (x);
 
+  mode_size = estimated_poly_value (GET_MODE_SIZE (mode));
+
   /* A size N times larger than UNITS_PER_WORD likely needs N times as
  many insns, taking N times as long.  */
-  factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
-  if (factor == 0)
-factor = 1;
+  factor = mode_size > UNITS_PER_WORD ? mode_size / UNITS_PER_WORD : 1;
 
   /* Compute the default costs of certain things.
  Note that targetm.rtx_costs can override the defaults.  */
@@ -4243,14 +4248,6 @@ rtx_cost (rtx x, machine_mode mode, enum rtx_code 
outer_code,
   /* Used in combine.c as a marker.  */
   total = 0;
   break;
-case SET:
-  /* A SET doesn't have a mode, so let's look at the SET_DEST to get
-the mode for the factor.  */
-  mode = GET_MODE (SET_DEST (x));
-  factor = estimated_poly_value (GET_MODE_SIZE (mode)) / UNITS_PER_WORD;
-  if (factor == 0)
-   factor = 1;
-  /* FALLTHRU */
 default:
   total = factor * COSTS_N_INSNS (1);
 }


[PATCH v2] object lifetime instrumentation for Valgrind [PR66487]

2023-12-22 Thread Alexander Monakov
From: Daniil Frolov 

PR 66487 is asking to provide sanitizer-like detection for C++ object
lifetime violations that are worked around with -fno-lifetime-dse or
-flifetime-dse=1 in Firefox, LLVM (PR 106943), OpenJade (PR 69534).

The discussion in the PR was centered around extending MSan, but MSan
was not ported to GCC (and requires rebuilding everything with
instrumentation).

Instead, allow Valgrind to see lifetime boundaries by emitting client
requests along *this = { CLOBBER }.  The client request marks the
"clobbered" memory as undefined for Valgrind; clobbering assignments
mark the beginning of ctor and end of dtor execution for C++ objects.
Hence, attempts to read object storage after the destructor, or
"pre-initialize" its fields prior to the constructor will be caught.

Valgrind client requests are offered as macros that emit inline asm.
For use in code generation, let's wrap them as libgcc builtins.

gcc/ChangeLog:

* Makefile.in (OBJS): Add gimple-valgrind-interop.o.
* builtins.def (BUILT_IN_VALGRIND_MAKE_UNDEFINED): New.
* common.opt (-fvalgrind-annotations): New option.
* doc/install.texi (--enable-valgrind-interop): Document.
* doc/invoke.texi (-fvalgrind-annotations): Document.
* passes.def (pass_instrument_valgrind): Add.
* tree-pass.h (make_pass_instrument_valgrind): Declare.
* gimple-valgrind-interop.cc: New file.

libgcc/ChangeLog:

* Makefile.in (LIB2ADD_ST): Add valgrind-interop.c.
* config.in: Regenerate.
* configure: Regenerate.
* configure.ac (--enable-valgrind-interop): New flag.
* libgcc2.h (__gcc_vgmc_make_mem_undefined): Declare.
* valgrind-interop.c: New file.

gcc/testsuite/ChangeLog:

* g++.dg/valgrind-annotations-1.C: New test.
* g++.dg/valgrind-annotations-2.C: New test.

Co-authored-by: Alexander Monakov 
---
Changes in v2:

* Take new clobber kinds into account.
* Do not link valgrind-interop.o into libgcc_s.so.

 gcc/Makefile.in   |   1 +
 gcc/builtins.def  |   3 +
 gcc/common.opt|   4 +
 gcc/doc/install.texi  |   5 +
 gcc/doc/invoke.texi   |  27 
 gcc/gimple-valgrind-interop.cc| 125 ++
 gcc/passes.def|   1 +
 gcc/testsuite/g++.dg/valgrind-annotations-1.C |  22 +++
 gcc/testsuite/g++.dg/valgrind-annotations-2.C |  12 ++
 gcc/tree-pass.h   |   1 +
 libgcc/Makefile.in|   3 +
 libgcc/config.in  |   6 +
 libgcc/configure  |  22 ++-
 libgcc/configure.ac   |  15 ++-
 libgcc/libgcc2.h  |   2 +
 libgcc/valgrind-interop.c |  40 ++
 16 files changed, 287 insertions(+), 2 deletions(-)
 create mode 100644 gcc/gimple-valgrind-interop.cc
 create mode 100644 gcc/testsuite/g++.dg/valgrind-annotations-1.C
 create mode 100644 gcc/testsuite/g++.dg/valgrind-annotations-2.C
 create mode 100644 libgcc/valgrind-interop.c

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 9373800018..d027548203 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1507,6 +1507,7 @@ OBJS = \
gimple-ssa-warn-restrict.o \
gimple-streamer-in.o \
gimple-streamer-out.o \
+   gimple-valgrind-interop.o \
gimple-walk.o \
gimple-warn-recursion.o \
gimplify.o \
diff --git a/gcc/builtins.def b/gcc/builtins.def
index f03df32f98..b05e20e062 100644
--- a/gcc/builtins.def
+++ b/gcc/builtins.def
@@ -1194,6 +1194,9 @@ DEF_GCC_BUILTIN (BUILT_IN_LINE, "LINE", BT_FN_INT, 
ATTR_NOTHROW_LEAF_LIST)
 /* Control Flow Redundancy hardening out-of-line checker.  */
 DEF_BUILTIN_STUB (BUILT_IN___HARDCFR_CHECK, "__builtin___hardcfr_check")
 
+/* Wrappers for Valgrind client requests.  */
+DEF_EXT_LIB_BUILTIN (BUILT_IN_VALGRIND_MAKE_UNDEFINED, 
"__gcc_vgmc_make_mem_undefined", BT_FN_VOID_PTR_SIZE, ATTR_NOTHROW_LEAF_LIST)
+
 /* Synchronization Primitives.  */
 #include "sync-builtins.def"
 
diff --git a/gcc/common.opt b/gcc/common.opt
index d263a959df..2be5b8d0a6 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -3377,6 +3377,10 @@ Enum(auto_init_type) String(pattern) 
Value(AUTO_INIT_PATTERN)
 EnumValue
 Enum(auto_init_type) String(zero) Value(AUTO_INIT_ZERO)
 
+fvalgrind-annotations
+Common Var(flag_valgrind_annotations) Optimization
+Annotate lifetime boundaries with Valgrind client requests.
+
 ; -fverbose-asm causes extra commentary information to be produced in
 ; the generated assembly code (to make it more readable).  This option
 ; is generally only of use to those who actually need to read the
diff --git a/gcc/doc/install.texi b/gcc/doc/install.texi
index d20b43a5b2..d6e5e5fdaf 100644
--- a/gc

[PATCH 0/1] Detecting lifetime-dse issues via Valgrind [PR66487]

2023-12-08 Thread Alexander Monakov
I would like to propose Valgrind integration previously sent as RFC for trunk.

Arsen and Sam, since you commented on the RFC I wonder if you can have
a look at the proposed configure and documentation changes and let me
know if they look fine for you? For reference, gccinstall.info will say:

‘--enable-valgrind-interop’
 Provide wrappers for Valgrind client requests in libgcc, which are
 used for ‘-fvalgrind-annotations’.  Requires Valgrind header files
 for the target (in the build-time sysroot if building a
 cross-compiler).

and GCC manual will document the new option as:

 -fvalgrind-annotations
 Emit Valgrind client requests annotating object lifetime
 boundaries.  This allows to detect attempts to access fields of a
 C++ object after its destructor has completed (but storage was
 not deallocated yet), or to initialize it in advance from
 "operator new" rather than the constructor.

 This instrumentation relies on presence of
 "__gcc_vgmc_make_mem_undefined" function that wraps the
 corresponding Valgrind client request. It is provided by libgcc
 when it is configured with --enable-valgrind-interop.  Otherwise,
 you can implement it like this:

 #include 

 void
 __gcc_vgmc_make_mem_undefined (void *addr, size_t size)
 {
   VALGRIND_MAKE_MEM_UNDEFINED (addr, size);
 }

Changes since the RFC:

* Add documentation and tests.

* Drop 'emit-' from -fvalgrind-emit-annotations.

* Use --enable-valgrind-interop instead of overloading
  --enable-valgrind-annotations.

* Do not build the wrapper unless --enable-valgrind-interop is given and
  Valgrind headers are present.

* Clean up libgcc configure changes.
* Reword comments.

Daniil Frolov (1):
  object lifetime instrumentation for Valgrind [PR66487]

 gcc/Makefile.in   |   1 +
 gcc/builtins.def  |   3 +
 gcc/common.opt|   4 +
 gcc/doc/install.texi  |   5 +
 gcc/doc/invoke.texi   |  27 +
 gcc/gimple-valgrind-interop.cc| 112 ++
 gcc/passes.def|   1 +
 gcc/testsuite/g++.dg/valgrind-annotations-1.C |  22 
 gcc/testsuite/g++.dg/valgrind-annotations-2.C |  12 ++
 gcc/tree-pass.h   |   1 +
 libgcc/Makefile.in|   3 +
 libgcc/config.in  |   6 +
 libgcc/configure  |  22 +++-
 libgcc/configure.ac   |  15 ++-
 libgcc/libgcc2.h  |   2 +
 libgcc/valgrind-interop.c |  40 +++
 16 files changed, 274 insertions(+), 2 deletions(-)
 create mode 100644 gcc/gimple-valgrind-interop.cc
 create mode 100644 gcc/testsuite/g++.dg/valgrind-annotations-1.C
 create mode 100644 gcc/testsuite/g++.dg/valgrind-annotations-2.C
 create mode 100644 libgcc/valgrind-interop.c

-- 
2.39.2



[PATCH 1/1] object lifetime instrumentation for Valgrind [PR66487]

2023-12-08 Thread Alexander Monakov
From: Daniil Frolov 

PR 66487 is asking to provide sanitizer-like detection for C++ object
lifetime violations that are worked around with -fno-lifetime-dse or
-flifetime-dse=1 in Firefox, LLVM (PR 106943), OpenJade (PR 69534).

The discussion in the PR was centered around extending MSan, but MSan
was not ported to GCC (and requires rebuilding everything with
instrumentation).

Instead, allow Valgrind to see lifetime boundaries by emitting client
requests along *this = { CLOBBER }.  The client request marks the
"clobbered" memory as undefined for Valgrind; clobbering assignments
mark the beginning of ctor and end of dtor execution for C++ objects.
Hence, attempts to read object storage after the destructor, or
"pre-initialize" its fields prior to the constructor will be caught.

Valgrind client requests are offered as macros that emit inline asm.
For use in code generation, let's wrap them as libgcc builtins.

gcc/ChangeLog:

* Makefile.in (OBJS): Add gimple-valgrind-interop.o.
* builtins.def (BUILT_IN_VALGRIND_MAKE_UNDEFINED): New.
* common.opt (-fvalgrind-annotations): New option.
* doc/install.texi (--enable-valgrind-interop): Document.
* doc/invoke.texi (-fvalgrind-annotations): Document.
* passes.def (pass_instrument_valgrind): Add.
* tree-pass.h (make_pass_instrument_valgrind): Declare.
* gimple-valgrind-interop.cc: New file.

libgcc/ChangeLog:

* Makefile.in (LIB2ADD): Add valgrind-interop.c.
* config.in: Regenerate.
* configure: Regenerate.
* configure.ac (--enable-valgrind-interop): New flag.
* libgcc2.h (__gcc_vgmc_make_mem_undefined): Declare.
* valgrind-interop.c: New file.

gcc/testsuite/ChangeLog:

* g++.dg/valgrind-annotations-1.C: New test.
* g++.dg/valgrind-annotations-2.C: New test.

Co-authored-by: Alexander Monakov 
---
 gcc/Makefile.in   |   1 +
 gcc/builtins.def  |   3 +
 gcc/common.opt|   4 +
 gcc/doc/install.texi  |   5 +
 gcc/doc/invoke.texi   |  27 +
 gcc/gimple-valgrind-interop.cc| 112 ++
 gcc/passes.def|   1 +
 gcc/testsuite/g++.dg/valgrind-annotations-1.C |  22 
 gcc/testsuite/g++.dg/valgrind-annotations-2.C |  12 ++
 gcc/tree-pass.h   |   1 +
 libgcc/Makefile.in|   3 +
 libgcc/config.in  |   6 +
 libgcc/configure  |  22 +++-
 libgcc/configure.ac   |  15 ++-
 libgcc/libgcc2.h  |   2 +
 libgcc/valgrind-interop.c |  40 +++
 16 files changed, 274 insertions(+), 2 deletions(-)
 create mode 100644 gcc/gimple-valgrind-interop.cc
 create mode 100644 gcc/testsuite/g++.dg/valgrind-annotations-1.C
 create mode 100644 gcc/testsuite/g++.dg/valgrind-annotations-2.C
 create mode 100644 libgcc/valgrind-interop.c

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 68410a86af..4db18387c1 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1506,6 +1506,7 @@ OBJS = \
gimple-ssa-warn-restrict.o \
gimple-streamer-in.o \
gimple-streamer-out.o \
+   gimple-valgrind-interop.o \
gimple-walk.o \
gimple-warn-recursion.o \
gimplify.o \
diff --git a/gcc/builtins.def b/gcc/builtins.def
index f03df32f98..b05e20e062 100644
--- a/gcc/builtins.def
+++ b/gcc/builtins.def
@@ -1194,6 +1194,9 @@ DEF_GCC_BUILTIN (BUILT_IN_LINE, "LINE", BT_FN_INT, 
ATTR_NOTHROW_LEAF_LIST)
 /* Control Flow Redundancy hardening out-of-line checker.  */
 DEF_BUILTIN_STUB (BUILT_IN___HARDCFR_CHECK, "__builtin___hardcfr_check")
 
+/* Wrappers for Valgrind client requests.  */
+DEF_EXT_LIB_BUILTIN (BUILT_IN_VALGRIND_MAKE_UNDEFINED, 
"__gcc_vgmc_make_mem_undefined", BT_FN_VOID_PTR_SIZE, ATTR_NOTHROW_LEAF_LIST)
+
 /* Synchronization Primitives.  */
 #include "sync-builtins.def"
 
diff --git a/gcc/common.opt b/gcc/common.opt
index f070aff8cb..b53565fc1a 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -3372,6 +3372,10 @@ Enum(auto_init_type) String(pattern) 
Value(AUTO_INIT_PATTERN)
 EnumValue
 Enum(auto_init_type) String(zero) Value(AUTO_INIT_ZERO)
 
+fvalgrind-annotations
+Common Var(flag_valgrind_annotations) Optimization
+Annotate lifetime boundaries with Valgrind client requests.
+
 ; -fverbose-asm causes extra commentary information to be produced in
 ; the generated assembly code (to make it more readable).  This option
 ; is generally only of use to those who actually need to read the
diff --git a/gcc/doc/install.texi b/gcc/doc/install.texi
index c1128d9274..aaf0213bbf 100644
--- a/gcc/doc/install.texi
+++ b/gcc/doc/install.texi
@@ -1567,6 +1567,11 @@ Disable TM clone registry in libgcc. It is

Re: [PATCH 0/1] Detecting lifetime-dse issues via Valgrind [PR66487]

2023-12-08 Thread Alexander Monakov


On Fri, 8 Dec 2023, Jakub Jelinek wrote:

> Does VALGRIND_MAKE_MEM_UNDEFINED macro ever change onarches once implemented
> there?

It seems Valgrind folks take binary compatibility seriously, so that sounds
unlikely.

> Wouldn't this be better done by emitting the sequence inline?
> Even if it is done in libgcc, it is part of ABI.

I'd rather keep it as simple as possible. We could drop the libgcc parts,
users can drop in the wrapper as explained in the manual.

> So, basically add a new optab, valgrind_request, where each target would
> define_insn whatever is needed (it will need to be a single pattern, it
> can't be split among multiple) and sorry on -fvalgrind-annotations if the
> optab is not defined.

There are going to be complications since the request needs a descriptor
structure (on the stack), plus it needs more effort on the GCC side than
Valgrind side (when Valgrind is ported to a new target). I'd rather not
go that way.

> Advantage would be that --enable-valgrind-interop nor building against
> valgrind headers is not needed.

Alternatively, how about synthesizing an auxiliary translation unit with
the wrapper from the driver for -fvalgrind-annotations?

> In your version, did the new function go just to libgcc.a or to
> libgcc_s.so.1?

It participates in libgcc_s link, but it's not listed in the version script,
so it's not exported from libgcc_s (and -gc-sections should eliminate it).

Alexander


Re: [PATCH 0/1] Detecting lifetime-dse issues via Valgrind [PR66487]

2023-12-08 Thread Alexander Monakov



On Fri, 8 Dec 2023, Jakub Jelinek wrote:

> On Fri, Dec 08, 2023 at 06:43:19PM +0300, Alexander Monakov wrote:
> > On Fri, 8 Dec 2023, Jakub Jelinek wrote:
> > > In your version, did the new function go just to libgcc.a or to
> > > libgcc_s.so.1?
> > 
> > It participates in libgcc_s link, but it's not listed in the version script,
> > so it's not exported from libgcc_s (and -gc-sections should eliminate it).
> 
> Then it at least should not participate in that link.
> There are various other objects which are libgcc.a only (e.g. all of dfp
> stuff, etc.).

Thanks, changing

LIB2ADD += $(srcdir)/valgrind-interop.c

to

LIB2ADD_ST += $(srcdir)/valgrind-interop.c

in my tree achieved that.

Alexander


Re: [PATCH] c++: End lifetime of objects in constexpr after destructor call [PR71093]

2023-12-10 Thread Alexander Monakov

On Sun, 10 Dec 2023, Richard Biener wrote:

> > It seems wrong to me: CLOBBER_EOL is documented to mean that the storage is
> > expiring at that point as well, which a (pseudo-)destructor does not imply;
> > it's perfectly valid to destroy an object and then create another in the
> > same storage.
> > 
> > We probably do want another clobber kind for end of object lifetime. And/or
> > one for beginning of object lifetime.
> 
> There’s not much semantically different between UNDEF and end of object but
> not storage lifetime?  At least for what middle-end optimizations do.
> 
> EOL is used by stack slot sharing and that operates on the underlying storage,
> not individual objects live in it.

I thought EOL implies that ASan may poison underlying memory. In the respin
of the Valgrind interop patch we instrument CLOBBER_UNDEF, but not CLOBBER_EOL.

Alexander

Re: [PATCH] c++: End lifetime of objects in constexpr after destructor call [PR71093]

2023-12-12 Thread Alexander Monakov



On Tue, 12 Dec 2023, Jakub Jelinek wrote:

> On Mon, Dec 11, 2023 at 05:00:50PM -0500, Jason Merrill wrote:
> > In discussion of PR71093 it came up that more clobber_kind options would be
> > useful within the C++ front-end.
> > 
> > gcc/ChangeLog:
> > 
> > * tree-core.h (enum clobber_kind): Rename CLOBBER_EOL to
> > CLOBBER_STORAGE_END.  Add CLOBBER_STORAGE_BEGIN,
> > CLOBBER_OBJECT_BEGIN, CLOBBER_OBJECT_END.
> > * gimple-lower-bitint.cc
> > * gimple-ssa-warn-access.cc
> > * gimplify.cc
> > * tree-inline.cc
> > * tree-ssa-ccp.cc: Adjust for rename.

Doesn't build_clobber_this in the C++ front-end need to be adjusted too?
I think it is used to place clobbers at start of the ctor (should be
CLOBBER_OBJECT_BEGIN in the new nomenclature) and end of the dtor (i.e.
CLOBBER_OBJECT_END).

Alexander


Re: Disable FMADD in chains for Zen4 and generic

2023-12-12 Thread Alexander Monakov

On Tue, 12 Dec 2023, Richard Biener wrote:

> On Tue, Dec 12, 2023 at 3:38 PM Jan Hubicka  wrote:
> >
> > Hi,
> > this patch disables use of FMA in matrix multiplication loop for generic 
> > (for
> > x86-64-v3) and zen4.  I tested this on zen4 and Xenon Gold Gold 6212U.
> >
> > For Intel this is neutral both on the matrix multiplication microbenchmark
> > (attached) and spec2k17 where the difference was within noise for Core.
> >
> > On core the micro-benchmark runs as follows:
> >
> > With FMA:
> >
> >578,500,241  cycles:u #3.645 GHz 
> > ( +-  0.12% )
> >753,318,477  instructions:u   #1.30  insn 
> > per cycle  ( +-  0.00% )
> >125,417,701  branches:u   #  790.227 M/sec   
> > ( +-  0.00% )
> >   0.159146 +- 0.000363 seconds time elapsed  ( +-  0.23% )
> >
> >
> > No FMA:
> >
> >577,573,960  cycles:u #3.514 GHz 
> > ( +-  0.15% )
> >878,318,479  instructions:u   #1.52  insn 
> > per cycle  ( +-  0.00% )
> >125,417,702  branches:u   #  763.035 M/sec   
> > ( +-  0.00% )
> >   0.164734 +- 0.000321 seconds time elapsed  ( +-  0.19% )
> >
> > So the cycle count is unchanged and discrete multiply+add takes same time 
> > as FMA.
> >
> > While on zen:
> >
> >
> > With FMA:
> >  484875179  cycles:u #3.599 GHz 
> >  ( +-  0.05% )  (82.11%)
> >  752031517  instructions:u   #1.55  insn 
> > per cycle
> >  125106525  branches:u   #  928.712 M/sec   
> >  ( +-  0.03% )  (85.09%)
> > 128356  branch-misses:u  #0.10% of all 
> > branches  ( +-  0.06% )  (83.58%)
> >
> > No FMA:
> >  375875209  cycles:u #3.592 GHz 
> >  ( +-  0.08% )  (80.74%)
> >  875725341  instructions:u   #2.33  insn 
> > per cycle
> >  124903825  branches:u   #1.194 G/sec   
> >  ( +-  0.04% )  (84.59%)
> >   0.105203 +- 0.000188 seconds time elapsed  ( +-  0.18% )
> >
> > The diffrerence is that Cores understand the fact that fmadd does not need
> > all three parameters to start computation, while Zen cores doesn't.
> 
> This came up in a separate thread as well, but when doing reassoc of a
> chain with multiple dependent FMAs.

> I can't understand how this uarch detail can affect performance when as in
> the testcase the longest input latency is on the multiplication from a
> memory load.

The latency from the memory operand doesn't matter since it's not a part
of the critical path. The memory uop of the FMA starts executing as soon
as the address is ready.

> Do we actually understand _why_ the FMAs are slower here?

It's simple, on Zen4 FMA has latency 4 while add has latency 3, and you
clearly see it in the quoted numbers: zen-with-fma has slightly below 4
cycles per branch, zen-without-fma has exactly 3 cycles per branch.

Please refer to uops.info for latency data:
https://uops.info/html-instr/VMULPS_YMM_YMM_YMM.html
https://uops.info/html-instr/VFMADD231PS_YMM_YMM_YMM.html

> Do we know that Cores can start the multiplication part when the add
> operand isn't ready yet?  I'm curious how you set up a micro benchmark to
> measure this.

Unlike some of the Arm cores, none of x86 cores can consume the addend
of an FMA on a later cycle than the multiplicands, with Alder Lake-E
being the sole exception, apparently (see 6/10/10 latencies in the
aforementioned uops.info FMA page).

> There's one detail on Zen in that it can issue 2 FADDs and 2 FMUL/FMA per
> cycle.  So in theory we can at most do 2 FMA per cycle but with latency
> (FMA) == 4 for Zen3/4 and latency (FADD/FMUL) == 3 we might be able to
> squeeze out a little bit more throughput when there are many FADD/FMUL ops
> to execute?  That works independent on whether FMAs have a head-start on
> multiplication as you'd still be bottle-necked on the 2-wide issue for
> FMA?

It doesn't matter here since all FMAs/FMULs are dependent on each other
so the processor can start a new FMA only each 4th (or 3rd cycle), except
when starting a new iteration of the outer loop.

> On Icelake it seems all FADD/FMUL/FMA share ports 0 and 1 and all have a
> latency of four.  So you should get worse results there (looking at the
> numbers above you do get worse results, slightly so), probably the higher
> number of uops is hidden by the latency.

A simple solution would be to enable AVOID_FMA_CHAINS when FMA latency 
exceeds FMUL latency (all Zens and Broadwell).

> > Since this seems noticeable win on zen and not loss on Core it seems like 
> > go

[PATCH] sched-deps: do not schedule pseudos across calls [PR108117]

2023-01-13 Thread Alexander Monakov


On Fri, 23 Dec 2022, Jose E. Marchesi wrote:

> > +1 for trying this FWIW.  There's still plenty of time to try an
> > alternative solution if there are unexpected performance problems.
> 
> Let me see if Alexander's patch fixes the issue at hand (it must) and
> will also do some regression testing.

Hi, I'm not sure at which court the ball is, but in the interest at moving
things forward here's the complete patch with the testcase. OK to apply?

---8<---

From: Alexander Monakov 
Date: Fri, 13 Jan 2023 21:04:02 +0300
Subject: [PATCH] sched-deps: do not schedule pseudos across calls [PR108117]

Scheduling across calls in the pre-RA scheduler is problematic: we do
not take liveness info into account, and are thus prone to extending
lifetime of a pseudo over the loop, requiring a callee-saved hardreg
or causing a spill.

If current function called a setjmp, lifting an assignment over a call
may be incorrect if a longjmp would happen before the assignment.

Thanks to Jose Marchesi for testing on AArch64.

gcc/ChangeLog:

PR rtl-optimization/108117
PR rtl-optimization/108132
* sched-deps.cc (deps_analyze_insn): Do not schedule across
calls before reload.

gcc/testsuite/ChangeLog:

PR rtl-optimization/108117
PR rtl-optimization/108132
* gcc.dg/pr108117.c: New test.
---
 gcc/sched-deps.cc   |  9 -
 gcc/testsuite/gcc.dg/pr108117.c | 30 ++
 2 files changed, 38 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr108117.c

diff --git a/gcc/sched-deps.cc b/gcc/sched-deps.cc
index 948aa0c3b..5dc4fa4cd 100644
--- a/gcc/sched-deps.cc
+++ b/gcc/sched-deps.cc
@@ -3688,7 +3688,14 @@ deps_analyze_insn (class deps_desc *deps, rtx_insn *insn)
 
   CANT_MOVE (insn) = 1;
 
-  if (find_reg_note (insn, REG_SETJMP, NULL))
+  if (!reload_completed)
+   {
+ /* Scheduling across calls may increase register pressure by extending
+live ranges of pseudos over the call.  Worse, in presence of setjmp
+it may incorrectly move up an assignment over a longjmp.  */
+ reg_pending_barrier = MOVE_BARRIER;
+   }
+  else if (find_reg_note (insn, REG_SETJMP, NULL))
 {
   /* This is setjmp.  Assume that all registers, not just
  hard registers, may be clobbered by this call.  */
diff --git a/gcc/testsuite/gcc.dg/pr108117.c b/gcc/testsuite/gcc.dg/pr108117.c
new file mode 100644
index 0..ae151693e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr108117.c
@@ -0,0 +1,30 @@
+/* { dg-do run } */
+/* { dg-require-effective-target nonlocal_goto } */
+/* { dg-options "-O2 -fschedule-insns" } */
+
+#include 
+#include 
+
+jmp_buf ex_buf;
+
+__attribute__((noipa))
+void fn_throw(int x)
+{
+   if (x)
+  longjmp(ex_buf, 1);
+}
+
+int main(void)
+{
+int vb = 0; // NB: not volatile, not modified after setjmp
+
+if (!setjmp(ex_buf)) {
+fn_throw(1);
+vb = 1; // not reached in the abstract machine
+}
+
+if (vb) {
+printf("Failed, vb = %d!\n", vb);
+return 1;
+}
+}
-- 
2.37.2



Re: [PATCH 1/3] tree-ssa-sink: do not sink to in front of setjmp

2023-11-08 Thread Alexander Monakov

On Wed, 8 Nov 2023, Richard Biener wrote:

> >> --- /dev/null
> >> +++ b/gcc/testsuite/gcc.dg/setjmp-7.c
> >> @@ -0,0 +1,13 @@
> >> +/* { dg-do compile } */
> >> +/* { dg-options "-O2 -fno-guess-branch-probability -w" } */
> >> +/* { dg-require-effective-target indirect_jumps } */
> >> +
> >> +struct __jmp_buf_tag { };
> >> +typedef struct __jmp_buf_tag jmp_buf[1];
> >> +struct globals { jmp_buf listingbuf; };
> >> +extern struct globals *const ptr_to_globals;
> >> +void foo()
> >> +{
> >> +if ( _setjmp ( ((*ptr_to_globals).listingbuf )))
> >> +;
> >> +}
> > 
> > Is the implicit declaration of _setjmp important to this test?
> > Could we declare it explicitly instead?
> 
> It shouldn’t be important.

Yes, it's an artifact from testcase minimization, sorry about that.

Florian, I see you've sent a patch to fix this up — thank you!

Alexander

Re: PING^1 [PATCH v3] sched: Change no_real_insns_p to no_real_nondebug_insns_p [PR108273]

2023-11-09 Thread Alexander Monakov


On Thu, 9 Nov 2023, Maxim Kuvyrkov wrote:

> Hi Kewen,
> 
> Below are my comments.  I don't want to override Alexander's review, and if
> the patch looks good to him, it's fine to ignore my concerns.
> 
> My main concern is that this adds a new entity -- forceful skipping of
> DEBUG_INSN-only basic blocks -- to the scheduler for a somewhat minor change
> in behavior.  Unlike NOTEs and LABELs, DEBUG_INSNs are INSNS, and there is
> already quite a bit of logic in the scheduler to skip them _as part of normal
> operation_.

I agree with the concern. I hoped that solving the problem by skipping the BB
like the (bit-rotted) debug code needs to would be a minor surgery. As things
look now, it may be better to remove the non-working sched_block debug counter
entirely and implement a good solution for the problem at hand.

> 
> Would you please consider 2 ideas below.
> 
> #1:
> After a brief look, I'm guessing this part is causing the problem:
> haifa-sched.cc :schedule_block():
> === [1]
>   /* Loop until all the insns in BB are scheduled.  */
>   while ((*current_sched_info->schedule_more_p) ())
> {
>   perform_replacements_new_cycle ();
>   do
>   {
> start_clock_var = clock_var;
> 
> clock_var++;
> 
> advance_one_cycle ();

As I understand, we have spurious calls to advance_one_cycle on basic block
boundaries, which don't model the hardware (the CPU doesn't see BB boundaries)
and cause divergence when passing through a debug-only BB which would not be
present at all without -g.

Since EBBs and regions may not have jump targets in the middle, advancing
a cycle on BB boundaries does not seem well motivated. Can we remove it?

Can we teach haifa-sched to emit RTX NOTEs with hashes of DFA states on BB
boundaries with -fcompare-debug is enabled? It should make the problem
readily detectable by -fcompare-debug even when scheduling did not diverge.

Alexander


Re: PING^1 [PATCH v3] sched: Change no_real_insns_p to no_real_nondebug_insns_p [PR108273]

2023-11-10 Thread Alexander Monakov


On Thu, 9 Nov 2023, Jeff Law wrote:

> > Yeah, I noticed that the scheduler takes care of DEBUG_INSNs as normal
> > operations.  When I started to work on this issue, initially I wanted to try
> > something similar to your idea #2, but when checking the APIs, I realized
> > why not just skip the basic block with NOTEs and LABELs, DEBUG_INSNs as
> > well.  IMHO there is no value to try to schedule this kind of BB (to be
> > scheduled range), skipping it can save some resource allocation (like block
> > dependencies) and make it more efficient (not enter function schedule_block
> > etc.), from this perspective it seems an enhancement.  Does it sound
> > reasonable to you?
> It sounds reasonable, but only if doing so doesn't add significant
> implementation complexity.  ie, the gains from doing less work here are likely
> to be very marginal, so I'm more interested in clean, easy to maintain code.

I'm afraid ignoring debug-only BBs goes contrary to overall var-tracking design:
DEBUG_INSNs participate in dependency graph so that schedulers can remove or
mutate them as needed when moving real insns across them.

Cc'ing Alexandre Oliva who can correct me on that if necessary.

Alexander


Re: PING^1 [PATCH v3] sched: Change no_real_insns_p to no_real_nondebug_insns_p [PR108273]

2023-11-10 Thread Alexander Monakov


On Fri, 10 Nov 2023, Richard Biener wrote:

> > I'm afraid ignoring debug-only BBs goes contrary to overall var-tracking 
> > design:
> > DEBUG_INSNs participate in dependency graph so that schedulers can remove or
> > mutate them as needed when moving real insns across them.
> 
> Note that debug-only BBs do not exist - the BB would be there even without 
> debug
> insns!

Yep, sorry, I misspoke when I earlier said

>> and cause divergence when passing through a debug-only BB which would not be
>> present at all without -g.

They are present in the region, but skipped via no_real_insns_p.

> So instead you have to handle BBs with just debug insns the same you
> handle a completely empty BB.

Yeah. There would be no problem if the scheduler never used no_real_insns_p
and handled empty and non-empty BBs the same way.

Alexander


Re: PING^1 [PATCH v3] sched: Change no_real_insns_p to no_real_nondebug_insns_p [PR108273]

2023-11-10 Thread Alexander Monakov

On Fri, 10 Nov 2023, Richard Biener wrote:

> On Fri, Nov 10, 2023 at 3:18 PM Alexander Monakov  wrote:
> >
> >
> > On Fri, 10 Nov 2023, Richard Biener wrote:
> >
> > > > I'm afraid ignoring debug-only BBs goes contrary to overall 
> > > > var-tracking design:
> > > > DEBUG_INSNs participate in dependency graph so that schedulers can 
> > > > remove or
> > > > mutate them as needed when moving real insns across them.
> > >
> > > Note that debug-only BBs do not exist - the BB would be there even 
> > > without debug
> > > insns!
> >
> > Yep, sorry, I misspoke when I earlier said
> >
> > >> and cause divergence when passing through a debug-only BB which would 
> > >> not be
> > >> present at all without -g.
> >
> > They are present in the region, but skipped via no_real_insns_p.
> >
> > > So instead you have to handle BBs with just debug insns the same you
> > > handle a completely empty BB.
> >
> > Yeah. There would be no problem if the scheduler never used no_real_insns_p
> > and handled empty and non-empty BBs the same way.
> 
> And I suppose it would be OK to do that.  Empty BBs are usually removed by
> CFG cleanup so the situation should only happen in rare corner cases where
> the fix would be to actually run CFG cleanup ...

Yeah, sel-sched invokes 'cfg_cleanup (0)' up front, and I suppose that
may be a preferable compromise for sched-rgn as well.

I'm afraid one does not simply remove all uses of no_real_insns_p from
sched-rgn, but would be happy to be wrong about that.

Alexander

Re: [RFC PATCH] Detecting lifetime-dse issues via Valgrind

2023-11-12 Thread Alexander Monakov

On Sat, 11 Nov 2023, Arsen Arsenović wrote:

> > +#else
> > +# define VALGRIND_MAKE_MEM_UNDEFINED(ptr, sz) __builtin_trap ()
> > +#endif
> > +
> > +void __valgrind_make_mem_undefined (void *ptr, unsigned long sz)
> > +{
> > +  VALGRIND_MAKE_MEM_UNDEFINED (ptr, sz);
> > +}
> 
> Would it be preferable to have a link-time error here if missing?

Indeed, thank you for the suggestion, will keep that in mind for resending.
That will allow to notice the problem earlier, and the user will be able
to drop in this snippet in their project to resolve the issue.

Alexander

Re: [RFC PATCH] Detecting lifetime-dse issues via Valgrind [PR66487]

2023-11-12 Thread Alexander Monakov


On Sat, 11 Nov 2023, Sam James wrote:

> > Valgrind client requests are offered as macros that emit inline asm.  For 
> > use
> > in code generation, we need to wrap it in a built-in.  We know that 
> > implementing
> > such a built-in in libgcc is undesirable, [...].
> 
> Perhaps less objectionable than you think, at least given the new CFR
> stuff from oliva from the other week that landed.

Yeah; we haven't found any better solution anyway.

> This is a really neat idea (it also makes me wonder if there's any other
> opportunities for Valgrind integration like this?).

To (attempt to) answer the parenthetical question, note that the patch is not
limited to instrumenting C++ cdtors, it annotates all lifetime CLOBBER marks,
so Valgrind should see lifetime boundaries of various on-stack arrays too.

(I hope positioning the new pass after build_ssa is sufficient to avoid
annotating too much, like non-address-taken local scalars)

> LLVM was the most recent example but it wasn't the first, and this has
> come up with LLVM in a few other places too (same root cause, wasn't
> obvious at all).

I'm very curious what you mean by "this has come up with LLVM [] too": ttbomk,
LLVM doesn't do such lifetime-based optimization yet, which is why compiling
LLVM with LLVM doesn't break it. Can you share some examples? Or do you mean
instances when libLLVM-miscompiled-with-GCC was linked elsewhere, and that
program crashed mysteriously as a result?

Indeed this work is inspired by the LLVM incident in PR 106943. Unforunately
we don't see many other instances with -flifetime-dse workarounds in public.
Grepping Gentoo Portage reveals only openjade. Arch applies the workaround to
a jvm package too, and we know that Firefox and LLVM apply it on their own.

This patch finds the issue in LLVM and openjade; testing it on Spidermonkey
is TODO. Suggestions for other interesting tests would be welcome.

> > --- a/libgcc/configure.ac
> > +++ b/libgcc/configure.ac
> > @@ -269,6 +269,54 @@ GCC_CHECK_SJLJ_EXCEPTIONS
> >  GCC_CET_FLAGS(CET_FLAGS)
> >  AC_SUBST(CET_FLAGS)
> >  
> > +AC_CHECK_HEADER(valgrind.h, have_valgrind_h=yes, have_valgrind_h=no)
> 
> Consider using PKG_CHECK_MODULES and falling back to a manual search.

Thanks. autotools bits in this patch are one-to-one copy of the pre-existing
Valgrind detection in the 'gcc' subdirectory where it's necessary for
running the compiler under Valgrind without false positives.

I guess the right solution is to move Valgrind detection into the top-level
'config' directory (and apply the cleanups you mention), but as we are not
familiar with autotools we just made the copy-paste for this RFC.

With the patch, --enable-valgrind-annotations becomes "overloaded" to
simultaneously instrument the compiler and enhance libgcc to support
-fvalgrind-emit-annotations, but those are independent and in practice
people may need the latter without the former.

Alexander


Re: [RFC PATCH] Detecting lifetime-dse issues via Valgrind

2023-11-13 Thread Alexander Monakov


On Mon, 13 Nov 2023, Richard Biener wrote:

> Another generic comment - placing a built-in call probably pessimizes code
> generation unless we handle it specially during alias analysis (or in
> builtin_fnspec).

But considering the resulting code is intended to be run under Valgrind,
isn't a bit worse quality acceptable? Note that we don't want loads
following the built-in to be optimized out, they are necessary as they
will be flagged by Valgrind as attempts to read uninitialized memory.

I suspect positioning the pass immediately after build_ssa as we do now
is quite imperfect because we will then instrument 'x' in 

  void f()
  {
int x, *p;
p = &x;
  }

Ideally we'd position it such that more locals are put in SSA form,
but not too late to miss some UB, right? Perhaps after first pass_ccp?

> I also don't like having another pass for this - did you
> investigate to do the instrumentation at the point the CLOBBERs are
> introduced?

I don't see a better approach, some CLOBBERs are emitted by the C++
front-end via build_clobber_this, some by omp-expand, some during
gimplification. I'm not a fan of useless IR rescans either, but
this pass is supposed to run very rarely, not by default.

> Another possibility would be to make this more generic
> and emit the instrumentation when we lower GIMPLE_BIND during
> the GIMPLE lowering pass, you wouldn't then rely on the CLOBBERs
> some of which only appear when -fstack-reuse=none is not used.

The CLOBBERs that trigger on Firefox and LLVM are emitted not during
gimplification, but via build_clobber_this in the front-end.

Alexander


Re: PING^1 [PATCH v3] sched: Change no_real_insns_p to no_real_nondebug_insns_p [PR108273]

2023-11-15 Thread Alexander Monakov


On Wed, 15 Nov 2023, Kewen.Lin wrote:

> >> And I suppose it would be OK to do that.  Empty BBs are usually removed by
> >> CFG cleanup so the situation should only happen in rare corner cases where
> >> the fix would be to actually run CFG cleanup ...
> > 
> > Yeah, sel-sched invokes 'cfg_cleanup (0)' up front, and I suppose that
> > may be a preferable compromise for sched-rgn as well.
> 
> Inspired by this discussion, I tested the attached patch 1 which is to run
> cleanup_cfg (0) first in haifa_sched_init, it's bootstrapped and
> regress-tested on x86_64-redhat-linux and powerpc64{,le}-linux-gnu.

I don't think you can run cleanup_cfg after sched_init. I would suggest
to put it early in schedule_insns.

> Then I assumed some of the current uses of no_real_insns_p won't encounter
> empty blocks any more, so made a patch 2 with some explicit assertions, but
> unfortunately I got ICEs during bootstrapping happens in function
> compute_priorities.  I'm going to investigate it further and post more
> findings, but just heads-up to ensure if this is on the right track.

I suspect this may be caused by invoking cleanup_cfg too late.

Alexander


Re: PING^1 [PATCH v3] sched: Change no_real_insns_p to no_real_nondebug_insns_p [PR108273]

2023-11-17 Thread Alexander Monakov


On Fri, 17 Nov 2023, Kewen.Lin wrote:
> > I don't think you can run cleanup_cfg after sched_init. I would suggest
> > to put it early in schedule_insns.
> 
> Thanks for the suggestion, I placed it at the beginning of haifa_sched_init
> instead, since schedule_insns invokes haifa_sched_init, although the
> calls rgn_setup_common_sched_info and rgn_setup_sched_infos are executed
> ahead but they are all "setup" functions, shouldn't affect or be affected
> by this placement.

I was worried because sched_init invokes df_analyze, and I'm not sure if
cfg_cleanup can invalidate it.

> > I suspect this may be caused by invoking cleanup_cfg too late.
> 
> By looking into some failures, I found that although cleanup_cfg is executed
> there would be still some empty blocks left, by analyzing a few failures there
> are at least such cases:
>   1. empty function body
>   2. block holding a label for return.
>   3. block without any successor.
>   4. block which becomes empty after scheduling some other block.
>   5. block which looks mergeable with its always successor but left.
>   ...
> 
> For 1,2, there is one single successor EXIT block, I think they don't affect
> state transition, for 3, it's the same.  For 4, it depends on if we can have
> the assumption this kind of empty block doesn't have the chance to have debug
> insn (like associated debug insn should be moved along), I'm not sure.  For 5,
> a reduced test case is:

Oh, I should have thought of cases like these, really sorry about the slip
of attention, and thanks for showing a testcase for item 5. As Richard as
saying in his response, cfg_cleanup cannot be a fix here. The thing to check
would be changing no_real_insns_p to always return false, and see if the
situation looks recoverable (if it breaks bootstrap, regtest statistics of
a non-bootstrapped compiler are still informative).

Alexander


Re: [PATCH 1/1] sched-deps.cc (find_modifiable_mems): Avoid exponential behavior

2023-11-20 Thread Alexander Monakov


On Mon, 20 Nov 2023, Maxim Kuvyrkov wrote:

> This patch avoids sched-deps.cc:find_inc() creating exponential number
> of dependencies, which become memory and compilation time hogs.
> Consider example (simplified from PR96388) ...
> ===
> sp=sp-4 // sp_insnA
> mem_insnA1[sp+A1]
> ...
> mem_insnAN[sp+AN]
> sp=sp-4 // sp_insnB
> mem_insnB1[sp+B1]
> ...
> mem_insnBM[sp+BM]
> ===
> ... in this example find_modifiable_mems() will arrange for mem_insnA*
> to be able to pass sp_insnA, and, while doing this, will create
> dependencies between all mem_insnA*s and sp_insnB -- because sp_insnB
> is a consumer of sp_insnA.  After this sp_insnB will have N new
> backward dependencies.
> Then find_modifiable_mems() gets to mem_insnB*s and starts to create
> N new dependencies for _every_ mem_insnB*.  This gets us N*M new
> dependencies.

It's a bit hard to read this without knowing which value of 'backwards'
is assumed.

Say 'backwards' is true and we are inspecting producer sp_insnB of mem_insnB1.
This is a true dependency. We know we can break it by adjusting B1 by -4, but
we need to be careful not to move such modified mem_insnB1 above sp_insnA, so
we need to iterate over *incoming true dependencies* of sp_insnB and add them.

But the code seems to be iterating over *all incoming dependencies*, so it
will e.g. take anti-dependency mem_insnA1 -> sp_insnB and create a true
dependency mem_insnA1 -> mem_insnB1'. This seems utterly inefficient, if my
understanding is correct.

Alexander


Re: [PATCH 1/1] sched-deps.cc (find_modifiable_mems): Avoid exponential behavior

2023-11-20 Thread Alexander Monakov


On Mon, 20 Nov 2023, Maxim Kuvyrkov wrote:

> > On Nov 20, 2023, at 17:52, Alexander Monakov  wrote:
> > 
> > 
> > On Mon, 20 Nov 2023, Maxim Kuvyrkov wrote:
> > 
> >> This patch avoids sched-deps.cc:find_inc() creating exponential number
> >> of dependencies, which become memory and compilation time hogs.
> >> Consider example (simplified from PR96388) ...
> >> ===
> >> sp=sp-4 // sp_insnA
> >> mem_insnA1[sp+A1]
> >> ...
> >> mem_insnAN[sp+AN]
> >> sp=sp-4 // sp_insnB
> >> mem_insnB1[sp+B1]
> >> ...
> >> mem_insnBM[sp+BM]
> >> ===
> >> ... in this example find_modifiable_mems() will arrange for mem_insnA*
> >> to be able to pass sp_insnA, and, while doing this, will create
> >> dependencies between all mem_insnA*s and sp_insnB -- because sp_insnB
> >> is a consumer of sp_insnA.  After this sp_insnB will have N new
> >> backward dependencies.
> >> Then find_modifiable_mems() gets to mem_insnB*s and starts to create
> >> N new dependencies for _every_ mem_insnB*.  This gets us N*M new
> >> dependencies.
> 
> [For avoidance of doubt, below discussion is about the general implementation
> of find_modifiable_mems() and not about the patch.]

I was saying the commit message is hard to read (unless it's just me).

> > It's a bit hard to read this without knowing which value of 'backwards'
> > is assumed.
> > 
> > Say 'backwards' is true and we are inspecting producer sp_insnB of 
> > mem_insnB1.
> > This is a true dependency. We know we can break it by adjusting B1 by -4, 
> > but
> > we need to be careful not to move such modified mem_insnB1 above sp_insnA, 
> > so
> > we need to iterate over *incoming true dependencies* of sp_insnB and add 
> > them.
> > 
> > But the code seems to be iterating over *all incoming dependencies*, so it
> > will e.g. take anti-dependency mem_insnA1 -> sp_insnB and create a true
> > dependency mem_insnA1 -> mem_insnB1'. This seems utterly inefficient, if my
> > understanding is correct.
> 
> Yeap, your understanding is correct.  However, this is what
> find_modifiable_mems() has to do to avoid complicated analysis of second-level
> dependencies.

What is the reason it cannot simply skip anti-dependencies in the
'if (backwards)' loop, and true dependencies in the 'else' loop?

Alexander


Re: [RFC PATCH] Detecting lifetime-dse issues via Valgrind

2023-11-20 Thread Alexander Monakov

On Mon, 13 Nov 2023, Richard Biener wrote:

> > Ideally we'd position it such that more locals are put in SSA form,
> > but not too late to miss some UB, right? Perhaps after first pass_ccp?
> 
> I guess it’s worth experimenting.  Even doing it right before RTL expansion
> might work.  Note if you pick ccp you have to use a separate place for -O0

While Daniil is experimenting with this, I want to raise my concern about
attempting this instrumentation too late. Consider the main thing we are
trying to catch:

// inlined operator new:
this->foo = 42;
// inlined constructor:
*this = { CLOBBER };
// caller:
int tmp = this->foo;
return tmp;

Our instrumentation adds

__valgrind_make_mem_undefined(this, sizeof *this);

immediately after the clobber.

I am concerned that if GCC ever learns to leave out the following access
to 'this->foo', leaving tmp uninitialized, we will end up with:

this->foo = 42;
*this = { CLOBBER };
__valgrind_make_mem_undefined(this, sizeof *this);
int tmp(D);
return tmp(D); // uninitialized

and Valgrind will not report anything since the invalid load is optimized out.

With early instrumentation such optimization is not going to happen, since the
builtin may modify *this.

Is my concern reasonable?

Thanks.
Alexander

Re: [RFC PATCH] Detecting lifetime-dse issues via Valgrind

2023-11-21 Thread Alexander Monakov


On Tue, 21 Nov 2023, Alexander Monakov wrote:

> I am concerned that if GCC ever learns to leave out the following access
> to 'this->foo', leaving tmp uninitialized, we will end up with:
> 
>   this->foo = 42;

Sorry, this store will be DSE'd out, of course, but my question stands.

Alexander

>   *this = { CLOBBER };
>   __valgrind_make_mem_undefined(this, sizeof *this);
>   int tmp(D);
>   return tmp(D); // uninitialized
> 
> and Valgrind will not report anything since the invalid load is optimized out.
> 
> With early instrumentation such optimization is not going to happen, since the
> builtin may modify *this.
> 
> Is my concern reasonable?
> 
> Thanks.
> Alexander


Re: [RFC PATCH] Detecting lifetime-dse issues via Valgrind

2023-11-21 Thread Alexander Monakov


On Tue, 21 Nov 2023, Richard Biener wrote:

> and this, too, btw. - the DSE actually happens, the latter transform not.
> We specifically "opt out" of doing that for QOI to not make undefined
> behavior worse.  The more correct transform would be to replace the
> load with a __builtin_trap () during path isolation (or wire in path isolation
> to value-numbering where we actually figure out there's no valid definition
> to reach for the load).
> 
> So yes, if you want to avoid these kind of transforms earlier instrumentation
> is better.

And then attempting to schedule it immediately after pass_ccp in the early-opts
pipeline is already too late, right?

Thanks!
Alexander


Re: [PATCH 1/1] sched-deps.cc (find_modifiable_mems): Avoid exponential behavior

2023-11-21 Thread Alexander Monakov


On Tue, 21 Nov 2023, Maxim Kuvyrkov wrote:

>  This patch avoids sched-deps.cc:find_inc() creating exponential number
>  of dependencies, which become memory and compilation time hogs.
>  Consider example (simplified from PR96388) ...
>  ===
>  sp=sp-4 // sp_insnA
>  mem_insnA1[sp+A1]
>  ...
>  mem_insnAN[sp+AN]
>  sp=sp-4 // sp_insnB
>  mem_insnB1[sp+B1]
>  ...
>  mem_insnBM[sp+BM]
>  ===
>  ... in this example find_modifiable_mems() will arrange for mem_insnA*
>  to be able to pass sp_insnA, and, while doing this, will create
>  dependencies between all mem_insnA*s and sp_insnB -- because sp_insnB
>  is a consumer of sp_insnA.  After this sp_insnB will have N new
>  backward dependencies.
>  Then find_modifiable_mems() gets to mem_insnB*s and starts to create
>  N new dependencies for _every_ mem_insnB*.  This gets us N*M new
>  dependencies.
> >> 
> >> [For avoidance of doubt, below discussion is about the general 
> >> implementation
> >> of find_modifiable_mems() and not about the patch.]
> > 
> > I was saying the commit message is hard to read (unless it's just me).
> > 
> >>> It's a bit hard to read this without knowing which value of 'backwards'
> >>> is assumed.
> 
> Oh, sorry, I misunderstood your comment.
> 
> In the above example I want to describe outcome that current code generates,
> without going into details about exactly how it does it.  I'm not sure how to
> make it more readable, and would appreciate suggestions.

I think it would be easier to follow if you could fix a specific value of
'backwards' up front, and then ensure all following statements are consistent
with that, like I did in my explanation. Please feel free to pick up my text
into the commit message, if it helps.

> >>> Say 'backwards' is true and we are inspecting producer sp_insnB of 
> >>> mem_insnB1.
> >>> This is a true dependency. We know we can break it by adjusting B1 by -4, 
> >>> but
> >>> we need to be careful not to move such modified mem_insnB1 above 
> >>> sp_insnA, so
> >>> we need to iterate over *incoming true dependencies* of sp_insnB and add 
> >>> them.
> >>> 
> >>> But the code seems to be iterating over *all incoming dependencies*, so it
> >>> will e.g. take anti-dependency mem_insnA1 -> sp_insnB and create a true
> >>> dependency mem_insnA1 -> mem_insnB1'. This seems utterly inefficient, if 
> >>> my
> >>> understanding is correct.
> >> 
> >> Yeap, your understanding is correct.  However, this is what
> >> find_modifiable_mems() has to do to avoid complicated analysis of 
> >> second-level
> >> dependencies.
> > 
> > What is the reason it cannot simply skip anti-dependencies in the
> > 'if (backwards)' loop, and true dependencies in the 'else' loop?
> 
> I /think/, this should be possible.  However, rather than improving current
> implementation my preference is to rework it by integrating with the main
> dependency analysis.  This should provide both faster and more precise
> dependency analysis, which would generate breakable addr/mem dependencies.

I see, thank you.

Alexander


[PATCH 0/2] Clean up Valgrind configury

2023-11-23 Thread Alexander Monakov
We have an RFC patch [1] that adds a libgcc wrapper for a Valgrind client
request.  GCC has autoconf detection for Valgrind in the compiler proper
as well as libcpp (where it is actually unnecessary).

It's grown rusty, let's clean it up.

[1] 
https://inbox.sourceware.org/gcc-patches/20231024141124.210708-1-exactl...@ispras.ru/

Alexander Monakov (2):
  libcpp: configure: drop unused Valgrind detection
  gcc: configure: drop Valgrind 3.1 compatibility

 gcc/config.in   | 12 ---
 gcc/configure   | 80 +++--
 gcc/configure.ac| 49 +++
 gcc/system.h| 23 ++---
 libcpp/config.in| 15 ++---
 libcpp/configure| 70 +--
 libcpp/configure.ac | 51 ++---
 libcpp/lex.cc   |  4 +--
 8 files changed, 29 insertions(+), 275 deletions(-)

-- 
2.39.2



[PATCH 1/2] libcpp: configure: drop unused Valgrind detection

2023-11-23 Thread Alexander Monakov
When top-level configure has either --enable-checking=valgrind or
--enable-valgrind-annotations, we want to activate a couple of workarounds
in libcpp. They do not use anything from the Valgrind API, so just
delete all detection.

libcpp/ChangeLog:

* config.in: Regenerate.
* configure: Regenerate.
* configure.ac (ENABLE_VALGRIND_CHECKING): Delete.
(ENABLE_VALGRIND_ANNOTATIONS): Rename to
ENABLE_VALGRIND_WORKAROUNDS.  Delete Valgrind header checks.
* lex.cc (new_buff): Adjust for renaming.
(_cpp_free_buff): Ditto.
---
 libcpp/config.in| 15 ++
 libcpp/configure| 70 +
 libcpp/configure.ac | 51 ++---
 libcpp/lex.cc   |  4 +--
 4 files changed, 9 insertions(+), 131 deletions(-)

diff --git a/libcpp/config.in b/libcpp/config.in
index df4fd44c9e..253ef03a3d 100644
--- a/libcpp/config.in
+++ b/libcpp/config.in
@@ -24,12 +24,9 @@
language is requested. */
 #undef ENABLE_NLS
 
-/* Define to get calls to the valgrind runtime enabled. */
-#undef ENABLE_VALGRIND_ANNOTATIONS
-
-/* Define if you want to workaround valgrind (a memory checker) warnings about
-   possible memory leaks because of libcpp use of interior pointers. */
-#undef ENABLE_VALGRIND_CHECKING
+/* Define if you want to workaround Valgrind warnings about possible memory
+   leaks because of libcpp use of interior pointers. */
+#undef ENABLE_VALGRIND_WORKAROUNDS
 
 /* Define to 1 if you have `alloca', as a function or macro. */
 #undef HAVE_ALLOCA
@@ -201,9 +198,6 @@
 /* Define to 1 if you have the  header file. */
 #undef HAVE_LOCALE_H
 
-/* Define if valgrind's memcheck.h header is installed. */
-#undef HAVE_MEMCHECK_H
-
 /* Define to 1 if you have the  header file. */
 #undef HAVE_MEMORY_H
 
@@ -252,9 +246,6 @@
 /* Define to 1 if you have the  header file. */
 #undef HAVE_UNISTD_H
 
-/* Define if valgrind's valgrind/memcheck.h header is installed. */
-#undef HAVE_VALGRIND_MEMCHECK_H
-
 /* Define as const if the declaration of iconv() needs const. */
 #undef ICONV_CONST
 
diff --git a/libcpp/configure b/libcpp/configure
index 452e4c1f6c..8a38c0546e 100755
--- a/libcpp/configure
+++ b/libcpp/configure
@@ -9116,12 +9116,6 @@ $as_echo "#define ENABLE_ASSERT_CHECKING 1" >>confdefs.h
 
 fi
 
-if test x$ac_valgrind_checking != x ; then
-
-$as_echo "#define ENABLE_VALGRIND_CHECKING 1" >>confdefs.h
-
-fi
-
 # Check whether --enable-canonical-system-headers was given.
 if test "${enable_canonical_system_headers+set}" = set; then :
   enableval=$enable_canonical_system_headers;
@@ -9405,62 +9399,6 @@ case x$enable_languages in
 esac
 
 
-ac_fn_c_check_header_mongrel "$LINENO" "valgrind.h" "ac_cv_header_valgrind_h" 
"$ac_includes_default"
-if test "x$ac_cv_header_valgrind_h" = xyes; then :
-  have_valgrind_h=yes
-else
-  have_valgrind_h=no
-fi
-
-
-
-# It is certainly possible that there's valgrind but no valgrind.h.
-# GCC relies on making annotations so we must have both.
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for VALGRIND_DISCARD in 
" >&5
-$as_echo_n "checking for VALGRIND_DISCARD in ... " >&6; }
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-#include 
-#ifndef VALGRIND_DISCARD
-#error VALGRIND_DISCARD not defined
-#endif
-_ACEOF
-if ac_fn_c_try_cpp "$LINENO"; then :
-  gcc_cv_header_valgrind_memcheck_h=yes
-else
-  gcc_cv_header_valgrind_memcheck_h=no
-fi
-rm -f conftest.err conftest.i conftest.$ac_ext
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: 
$gcc_cv_header_valgrind_memcheck_h" >&5
-$as_echo "$gcc_cv_header_valgrind_memcheck_h" >&6; }
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for VALGRIND_DISCARD in 
" >&5
-$as_echo_n "checking for VALGRIND_DISCARD in ... " >&6; }
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-#include 
-#ifndef VALGRIND_DISCARD
-#error VALGRIND_DISCARD not defined
-#endif
-_ACEOF
-if ac_fn_c_try_cpp "$LINENO"; then :
-  gcc_cv_header_memcheck_h=yes
-else
-  gcc_cv_header_memcheck_h=no
-fi
-rm -f conftest.err conftest.i conftest.$ac_ext
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $gcc_cv_header_memcheck_h" >&5
-$as_echo "$gcc_cv_header_memcheck_h" >&6; }
-if test $gcc_cv_header_valgrind_memcheck_h = yes; then
-
-$as_echo "#define HAVE_VALGRIND_MEMCHECK_H 1" >>confdefs.h
-
-fi
-if test $gcc_cv_header_memcheck_h = yes; then
-
-$as_echo "#define HAVE_MEMCHECK_H 1" >>confdefs.h
-
-fi
-
 # Check whether --enable-valgrind-annotations was given.
 if test "${enable_valgrind_annotations+set}" = set; then :
   enableval=$enable_valgrind_annotations;
@@ -9470,14 +9408,8 @@ fi
 
 if test x$enable_valgrind_annotations != xno \
 || test x$ac_valgrind_checking != x; then
-  if (test $have_valgrind_h = no \
-  && test $gcc_cv_header_memcheck_h = no \
-  && test $gcc_cv_header_valgrind_memcheck_h = no); then
-as_fn_error $? "*** valgrind annotations requested, but" "$LINENO" 5
-as_fn_error $? "*** Can't find 

[PATCH 2/2] gcc: configure: drop Valgrind 3.1 compatibility

2023-11-23 Thread Alexander Monakov
Our system.h and configure.ac try to accommodate valgrind-3.1, but it is
more than 15 years old at this point. As Valgrind-based checking is a
developer-oriented feature, drop the compatibility stuff and streamline
the detection.

gcc/ChangeLog:

* config.in: Regenerate.
* configure: Regenerate.
* configure.ac: Delete manual checks for old Valgrind headers.
* system.h (VALGRIND_MAKE_MEM_NOACCESS): Delete.
(VALGRIND_MAKE_MEM_DEFINED): Delete.
(VALGRIND_MAKE_MEM_UNDEFINED): Delete.
(VALGRIND_MALLOCLIKE_BLOCK): Delete.
(VALGRIND_FREELIKE_BLOCK): Delete.
---
 gcc/config.in| 12 
 gcc/configure| 80 
 gcc/configure.ac | 49 +++--
 gcc/system.h | 23 ++
 4 files changed, 20 insertions(+), 144 deletions(-)

diff --git a/gcc/config.in b/gcc/config.in
index e100c20dcd..3dfc65b844 100644
--- a/gcc/config.in
+++ b/gcc/config.in
@@ -1868,12 +1868,6 @@
 #endif
 
 
-/* Define if valgrind's memcheck.h header is installed. */
-#ifndef USED_FOR_TARGET
-#undef HAVE_MEMCHECK_H
-#endif
-
-
 /* Define to 1 if you have the  header file. */
 #ifndef USED_FOR_TARGET
 #undef HAVE_MEMORY_H
@@ -2136,12 +2130,6 @@
 #endif
 
 
-/* Define if valgrind's valgrind/memcheck.h header is installed. */
-#ifndef USED_FOR_TARGET
-#undef HAVE_VALGRIND_MEMCHECK_H
-#endif
-
-
 /* Define to 1 if you have the `vfork' function. */
 #ifndef USED_FOR_TARGET
 #undef HAVE_VFORK
diff --git a/gcc/configure b/gcc/configure
index cc0c3aad67..5be4592ba0 100755
--- a/gcc/configure
+++ b/gcc/configure
@@ -7679,63 +7679,6 @@ $as_echo "#define ENABLE_FOLD_CHECKING 1" >>confdefs.h
 fi
 valgrind_path_defines=
 valgrind_command=
-
-ac_fn_cxx_check_header_mongrel "$LINENO" "valgrind.h" 
"ac_cv_header_valgrind_h" "$ac_includes_default"
-if test "x$ac_cv_header_valgrind_h" = xyes; then :
-  have_valgrind_h=yes
-else
-  have_valgrind_h=no
-fi
-
-
-
-# It is certainly possible that there's valgrind but no valgrind.h.
-# GCC relies on making annotations so we must have both.
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for VALGRIND_DISCARD in 
" >&5
-$as_echo_n "checking for VALGRIND_DISCARD in ... " >&6; }
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-#include 
-#ifndef VALGRIND_DISCARD
-#error VALGRIND_DISCARD not defined
-#endif
-_ACEOF
-if ac_fn_cxx_try_cpp "$LINENO"; then :
-  gcc_cv_header_valgrind_memcheck_h=yes
-else
-  gcc_cv_header_valgrind_memcheck_h=no
-fi
-rm -f conftest.err conftest.i conftest.$ac_ext
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: 
$gcc_cv_header_valgrind_memcheck_h" >&5
-$as_echo "$gcc_cv_header_valgrind_memcheck_h" >&6; }
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for VALGRIND_DISCARD in 
" >&5
-$as_echo_n "checking for VALGRIND_DISCARD in ... " >&6; }
-cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-#include 
-#ifndef VALGRIND_DISCARD
-#error VALGRIND_DISCARD not defined
-#endif
-_ACEOF
-if ac_fn_cxx_try_cpp "$LINENO"; then :
-  gcc_cv_header_memcheck_h=yes
-else
-  gcc_cv_header_memcheck_h=no
-fi
-rm -f conftest.err conftest.i conftest.$ac_ext
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $gcc_cv_header_memcheck_h" >&5
-$as_echo "$gcc_cv_header_memcheck_h" >&6; }
-if test $gcc_cv_header_valgrind_memcheck_h = yes; then
-
-$as_echo "#define HAVE_VALGRIND_MEMCHECK_H 1" >>confdefs.h
-
-fi
-if test $gcc_cv_header_memcheck_h = yes; then
-
-$as_echo "#define HAVE_MEMCHECK_H 1" >>confdefs.h
-
-fi
-
 if test x$ac_valgrind_checking != x ; then
 
 # Prepare PATH_SEPARATOR.
@@ -7804,11 +7747,8 @@ else
 $as_echo "no" >&6; }
 fi
 
-  if test "x$valgrind_path" = "x" \
-|| (test $have_valgrind_h = no \
-   && test $gcc_cv_header_memcheck_h = no \
-   && test $gcc_cv_header_valgrind_memcheck_h = no); then
-   as_fn_error $? "*** Can't find both valgrind and valgrind/memcheck.h, 
memcheck.h or valgrind.h" "$LINENO" 5
+  if test "x$valgrind_path" = "x"; then
+as_fn_error $? "*** Cannot find valgrind" "$LINENO" 5
   fi
   valgrind_path_defines=-DVALGRIND_PATH='\"'$valgrind_path'\"'
   valgrind_command="$valgrind_path -q"
@@ -7864,12 +7804,16 @@ else
   enable_valgrind_annotations=no
 fi
 
+ac_fn_cxx_check_header_mongrel "$LINENO" "valgrind/memcheck.h" 
"ac_cv_header_valgrind_memcheck_h" "$ac_includes_default"
+if test "x$ac_cv_header_valgrind_memcheck_h" = xyes; then :
+
+fi
+
+
 if test x$enable_valgrind_annotations != xno \
 || test x$ac_valgrind_checking != x; then
-  if (test $have_valgrind_h = no \
-  && test $gcc_cv_header_memcheck_h = no \
-  && test $gcc_cv_header_valgrind_memcheck_h = no); then
-as_fn_error $? "*** Can't find valgrind/memcheck.h, memcheck.h or 
valgrind.h" "$LINENO" 5
+  if test $ac_cv_header_valgrind_memcheck_h = no; then
+as_fn_error $? "*** Cannot find valgrind/memcheck.h" "$LINENO" 5
   fi
 
 $as_echo "#define ENABLE_VALGRIND_ANNOTATIONS 1" >>confdefs.h
@@ -21602

[committed] sort.cc: fix mentions of sorting networks in comments

2023-11-26 Thread Alexander Monakov
Avoid using 'network sort' (a misnomer) in sort.cc, the correct term is
'sorting networks'.

gcc/ChangeLog:

* sort.cc: Use 'sorting networks' in comments.
---
 gcc/sort.cc | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/gcc/sort.cc b/gcc/sort.cc
index 9a0113fb62f..feef345830c 100644
--- a/gcc/sort.cc
+++ b/gcc/sort.cc
@@ -25,7 +25,7 @@ along with GCC; see the file COPYING3.  If not see
- deterministic (but not necessarily stable)
- fast, especially for common cases (0-5 elements of size 8 or 4)
 
-   The implementation uses a network sort for up to 5 elements and
+   The implementation uses sorting networks for up to 5 elements and
a merge sort on top of that.  Neither stage has branches depending on
comparator result, trading extra arithmetic for branch mispredictions.  */
 
@@ -53,7 +53,7 @@ struct sort_ctx
   char   *out; // output buffer
   size_t n;// number of elements
   size_t size; // element size
-  size_t nlim; // limit for network sort
+  size_t nlim; // limit for using sorting networks
 };
 
 /* Like sort_ctx, but for use with qsort_r-style comparators.  Several
@@ -151,7 +151,7 @@ cmp1 (char *e0, char *e1, sort_ctx *c)
   return x & (c->cmp (e0, e1) >> 31);
 }
 
-/* Execute network sort on 2 to 5 elements from IN, placing them into C->OUT.
+/* Apply a sorting network to 2 to 5 elements from IN, placing them into 
C->OUT.
IN may be equal to C->OUT, in which case elements are sorted in place.  */
 template
 static void
-- 
2.33.0



Re: [PATCH][RFC] middle-end/106811 - document GENERIC/GIMPLE undefined behavior

2023-09-20 Thread Alexander Monakov


On Fri, 15 Sep 2023, Richard Biener via Gcc-patches wrote:

> +@itemize @bullet
> +@item
> +When the result of negation, addition, subtraction or division of two signed
> +integers or signed integer vectors not subject to @option{-fwrapv} cannot be
> +represented in the type.

It would be a bit awkward to add 'or vectors' everywhere it applies, perhaps
say something general about elementwise vector operations up front?

> +
> +@item
> +The value of the second operand of any of the division or modulo operators
> +is zero.
> +
> +@item
> +When incrementing or decrementing a pointer not subject to
> +@option{-fwrapv-pointer} wraps around zero.
> +
> +@item
> +An expression is shifted by a negative number or by an amount greater
> +than or equal to the width of the shifted operand.
> +
> +@item
> +Pointers that do not point to the same object are compared using
> +relational operators.

This does not apply to '==' and '!='. Maybe say

  Ordered comparison operators are applied to pointers
  that do not point to the same object.

> +
> +@item
> +An object which has been modified is accessed through a restrict-qualified
> +pointer and another pointer that are not both based on the same object.
> +
> +@item
> +The @} that terminates a function is reached, and the value of the function
> +call is used by the caller.
> +
> +@item
> +When program execution reaches __builtin_unreachable.
> +
> +@item
> +When an object has its stored value accessed by an lvalue that
> +does not have one of the following types:
> +@itemize @minus
> +@item
> +a (qualified) type compatible with the effective type of the object
> +@item
> +a type that is the (qualified) signed or unsigned type corresponding to
> +the effective type of the object
> +@item
> +a character type, a ref-all qualified type or a type subject to
> +@option{-fno-strict-aliasing}
> +@item
> +a pointer to void with the same level of indirection as the accessed
> +pointer object
> +@end itemize

This list seems to miss a clause that allows aliasing between
scalar types and their vector counterparts?

Thanks.
Alexander


Re: RISC-V: Added support for CRC.

2023-09-24 Thread Alexander Monakov


On Sun, 24 Sep 2023, Joern Rennecke wrote:

> It is a stated goal of coremark to test performance for CRC.

I would expect a good CRC benchmark to print CRC throughput in
bytes per cycle or megabytes per second.

I don't see where Coremark states that goal. In the readme at
https://github.com/eembc/coremark/blob/main/README.md
it enumerates the three subcategories (linked list, matrix ops,
state machine) and indicates that CRC is used for validation.

If it claims that elsewhere, the way its code employs CRC does not
correspond to real-world use patterns, like in the Linux kernel for
protocol and filesystem checksumming, or decompression libraries.

> They do not use a library call to implement CRC, but a specific
> bit-banging algorithm they have chosen.  That algorithm is, for the
> vast majority of processors, not representative of the targets
> performance potential in calculating CRCs,

It is, however, representative of the target CPU's ability to run
those basic bitwise ops with good overlap with the rest of computation,
which is far more relevant for the real-world performance of the CPU.

> thus if a compiler fails to translate this into the CRC implementation
> that would be used for performance code, the compiler frustrates this
> goal of coremark to give a measure of CRC calculation performance.

Are you seriously saying that if a customer chooses CPU A over CPU B
based on Coremark scores, and then discovers that actual performance
in, say, zlib (which uses slice-by-N for CRC) is better on CPU B, that's
entirely fair and the benchmarks scores they saw were not misleading?

> > At best we might have
> > a discussion on providing a __builtin_clmul for carry-less multiplication
> > (which _is_ a fundamental primitive, unlike __builtin_crc), and move on.
> 
> Some processors have specialized instructions for CRC computations.

Only for one or two fixed polynomials. For that matter, some processors
have instructions for AES and SHA, but that doesn't change that clmul is
a more fundamental and flexible primitive than "CRC".

> If you want to recognize a loop that does a CRC on a block, it makes
> sense to start with recognizing the CRC computation for single array
> elements first.  We have to learn to walk before we can run.

If only the "walk before you run" logic was applied in favor of
implementing a portable clmul builtin prior to all this.

> A library can be used to implement built-ins in gcc (we still need to
> define one for block operations, one step at a time...).  However,
> someone or something needs to rewrite the existing code to use the
> library.  It is commonly accepted that an efficient way to do this is
> to make a compiler do the necessary transformations, as long as it can
> be made to churn out good enough code.

How does this apply to the real world? Among CRC implementations in the
Linux kernel, ffmpeg, zlib, bzip2, xz-utils, and zstd I'm aware of only
a single instance where bitwise CRC is used. It's in the table
initialization function in xz-utils. The compiler would transform that
to copying one table into another. Is that a valuable transform?

> Alexander Monakov:
> > Useful to whom? The Linux kernel? zlib, bzip2, xz-utils? ffmpeg?
> > These consumers need high-performance blockwise CRC, offering them
> > a latency-bound elementwise CRC primitive is a disservice. And what
> > should they use as a fallback when __builtin_crc is unavailable?
> 
> We can provide a fallback implementation for all targets with table
> lookup and/or shifts .

How would it help when they are compiled with LLVM, or GCC version
earlier than 14?

Alexander


Re: RISC-V: Added support for CRC.

2023-09-26 Thread Alexander Monakov


On Tue, 26 Sep 2023, Jeff Law wrote:

> What ultimately pushed us to keep moving forward on this effort was
> discovering numerous CRC loop implementations out in the wild, including 4
> implementations (IIRC) in the kernel itself.

The kernel employs bitwise CRC only in look-up table generators.
Which run at build time. You are quite literally slowing down the compiler
in order to speed up generators that don't account for even one millisecond
of kernel build time, and have no relation to its run-time performance.

(incidentally you can't detect the actual CRC impls using those tables)

> And as I've stated before, the latency of clmuls is dropping.   I wouldn't be
> terribly surprised to see single cycle clmul implmementations showing up
> within the next 18-24 months.  It's really just a matter of gate budget vs
> expected value.

In a commercial implementation? I'll take that bet. You spend gates budget
like that after better avenues for raising ILP are exhausted (like adding
more ALUs that can do clmul at a reasonable 3c-4c latency).

> To reiterate the real goal here is to take code as-is and make it
> significantly faster.

Which code? Table generators in the kernel and xz-utils? 

> While the original target was Coremark, we've found similar bitwise
> implementations of CRCs all over the place. There's no good reason that code
> should have to change.

But did you look at them? There's no point to optimize table generators either.

Alexander


Re: [PATCH][RFC] c/106800 - support vector condition operation in C

2024-07-18 Thread Alexander Monakov


On Thu, 18 Jul 2024, Richard Biener wrote:

> The following adds support for vector conditionals in C.  The support
> was nearly there already but c_objc_common_truthvalue_conversion
> rejecting vector types.  Instead of letting them pass there unchanged
> I chose to instead skip it when parsing conditionals instead as a
> variant with less possible fallout.  The part missing was promotion
> of a scalar second or third operand to vector, I copied the logic
> from binary operator handling for this.
> 
> I've moved the testcases I could easily spot over to c-c++-common.
> 
> The C++ frontend implements vector ? '1' : '0' with promotion of
> both scalar to a vector but without applying integer promotions
> first (r0-124280-g07298ffd6f7c2d).  I don't think this is
> what we document or backed by OpenCL or C++ valarray.

I think we do document that:

If both b and c are scalars and the type of true?b:c has the same size
as the element type of a, then b and c are converted to a vector type
whose elements have this type and with the same number of elements as a.

(in https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html )

and I added

In contrast to scalar operations in C and C++, operands of integer vector
operations do not undergo integer promotions.

when spelling out the behavior of vector shifts.

Clang also supports 'vector ? scalar : scalar'.

> I've left
> this out for now.  I think we need better coverage in the testsuite
> for this and others, for example the patch below now accepts
> vector(int) ? vector(short) : 0 but the C++ frontend rejects this
> (I think OpenCL is fine with this).  OpenCL says the conditional
> op is equivalent to select(exp3,exp2,exp1) which is defined as
> result[i] = if MSB of c[i] is set ? b[i] : a[i], with no restriction
> on the conditional.

I think we require that sizes match because that's natural considering
how it is lowered (bitwise blending of comparison mask with op2/op3).

Note that in Clang ext_vector_type has OpenCL semantics for c?a:b
(taking the MSB of 'c') while vector_size has GNU semantics ('c' is
compared with zero).

Thanks.
Alexander


  1   2   3   4   5   6   7   8   9   10   >